Why you should use sysbench ?

Why you should use sysbench

Among all benchmark tools available online, sysbench is one of my favorite. If I have to give two words to it , it would be: simplicty and reliability. At Cloud Mercato we mainly focus on 3 of its parts:

  • CPU: Simple CPU benchmark
  • Memory: Memory access benchmark
  • OLTP: Collection of OLTP-like database benchmarks

On top of that, sysbench can be considered as an agnostic benchmark tool, like wrk, it integrates a LuaJIT interpreter allowing to plug the task of your choice and thus to obtain rate, maximum, percentile and more. But today, let’s just focus on the CPU benchmark aka prime number search.

Run on command line

sysbench --threads=$cpu_number --time=$time cpu --cpu-max-prime=$max_prime run

The output of this command is not terrifying:

sysbench 1.1.0 (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 8
Initializing random number generator from current time


Prime numbers limit: 64000

Initializing worker threads...

Threads started!

CPU speed:
    events per second:   707.94

Throughput:
    events/s (eps):                      707.9361
    time elapsed:                        30.0112s
    total number of events:              21246

Latency (ms):
         min:                                   11.25
         avg:                                   11.30
         max:                                   19.08
         95th percentile:                       11.24
         sum:                               240041.14

Threads fairness:
    events (avg/stddev):           2655.7500/1.79
    execution time (avg/stddev):   30.0051/0.00

For the lazyest of you, the most intersting value here is the “events per second” or prime number found per second. But aside of that the output is pretty human readable and very easy to parse. It contains the base stats required for a benchmark in term of context and mathematical aggregations.

But what did I do ?

Napoleon Ier
Never run a command if you don’t know what it’s supposed to do

That being said, as sysbench is open-source and available on GitHub, with a beginner level in C, it’s not hard to understand it. The code sample below represents the core of what is timed during the testing:

int cpu_execute_event(sb_event_t *r, int thread_id)
{
  unsigned long long c;
  unsigned long long l;
  double t;
  unsigned long long n=0;

  (void)thread_id; /* unused */
  (void)r; /* unused */

  /* So far we're using very simple test prime number tests in 64bit */

  for(c=3; c < max_prime; c++)
  {
    t = sqrt((double)c);
    for(l = 2; l <= t; l++) if (c % l == 0) break; if (l > t )
      n++; 
  }

  return 0;
}

Source: GitHub akopytov/sysbench

Basically if I translate the function in human words it would be: Loop over numbers and check if they are divisible only by themselves.

And yes, the deepness of this benchmark is just these 15 lines, simple loops and some arithmetics.  For me, this is clearly the strength of this tool: Again simplicty and reliability. sysbench cpu crossed the ages and as it makes more than 15 years that it didn’t changed, it allow you to compare old chips like Intel Sandy Bridge versus the latest ARM.

It is a well coded prime-number test, but what I see is that there isn’t any complex things developers often setup to achieve their goals such as thread cooperation, encryption or advanced mathematics. It just does a unique task representing a simple “How fast are my CPUs ?”.

The RAM involment is almost null, processors’ features generaly don’t improve performance for these kind of tasks and for a Cloud Benchmarker like me, this is essential to erase as most bias as possible. Where I can get performance data from complex workloads and have an idea of the system capacity, sysbench capture a raw strength that I correlate later with more complex results.

How I use it ?

    {
"configuration": {
"chart": {
"type": "spline",
"polar": false,
"zoomType": "",
"options3d": {},
"height": null,
"width": null,
"margin": null,
"inverted": false
},
"credits": {
"enabled": false
},
"title": {
"text": ""
},
"colorAxis": null,
"subtitle": {
"text": ""
},
"xAxis": {
"title": {
"text": [
"Threads"
],
"useHTML": false,
"style": {
"color": "#666666"
}
},
"categories": [
1,
2,
3,
4,
6,
8,
12,
16,
24,
32,
48,
64
],
"lineWidth": 1,
"tickInterval": null,
"tickWidth": 1,
"tickLength": 10,
"tickPixelInterval": null,
"plotLines": [{
      "value": 9,
      "color": "rgba(209, 0, 108, 0.5)",
      "width": 3,
      "label": {
        "text": "vCPU number",
        "align": "left",
        "style": {
          "color": "gray"
        }
      }
    }],
"labels": {
"enabled": true,
"formatter": "",
"style": {
"color": "#666666",
"cursor": "default",
"fontSize": "11px"
}
},
"plotBands": null,
"visible": true,
"floor": null,
"ceiling": null,
"type": "linear",
"min": null,
"gridLineWidth": null,
"gridLineColor": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null,
"tickmarkPlacement": null
},
"yAxis": {
"title": {
"text": [
"Number per second"
],
"useHTML": false,
"style": {
"color": "#666666"
}
},
"categories": [],
"plotLines": null,
"plotBands": null,
"lineWidth": null,
"tickInterval": 100,
"tickLength": 10,
"floor": null,
"ceiling": null,
"gridLineInterpolation": null,
"gridLineWidth": 1,
"gridLineColor": "#CCC",
"min": 0,
"max": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null,
"minRange": null,
"type": "linear",
"tickmarkPlacement": null,
"labels": {
"enabled": true,
"formatter": null,
"style": {
"color": "#666666",
"cursor": "default",
"fontSize": "11px"
}
}
},
"zAxis": {
"title": {
"text": ""
}
},
"plotOptions": {
"series": {
"dataLabels": {
"enabled": false,
"format": null,
"distance": 30,
"align": "center",
"inside": null,
"style": {
"fontSize": "11px"
}
},
"showInLegend": null,
"turboThreshold": 1000,
"stacking": "",
"groupPadding": 0,
"centerInCategory": false
}
},
"rangeSelector": {
"enabled": false
},
"legend": {
"enabled": true,
"align": "center",
"verticalAlign": "bottom",
"layout": "horizontal",
"width": null,
"margin": 12,
"reversed": false
},
"series": [
{
"name": "T-Systems Open Telekom Cloud s3.8xlarge.1",
"verbose": "T-Systems Open Telekom Cloud s3.8xlarge.1",
"data": [
{
"y": 88.5275
},
{
"y": 176.9944444444444
},
{
"y": 265.46900000000005
},
{
"y": 353.96000000000004
},
{
"y": 530.606
},
{
"y": 707.815
},
{
"y": 1060.8559999999998
},
{
"y": 1414.0780000000002
},
{
"y": 2120.651111111111
},
{
"y": 2824.994
},
{
"y": 2825.9979999999996
},
{
"y": 2826.035
}
],
"color": "#d1006c"
}
],
"tooltip": {
"enabled": true,
"useHTML": false,
"headerFormat": "",
"pointFormat": "<span style=\"color:{series.color}\">{series.name}</span>: <b>{point.y:.2f}</b><br/>",
"footerFormat": "",
"shared": false,
"outside": false,
"valueDecimals": null,
"split": false
}
},
"hc_type": "chart",
"id": "139798768566056"
}

The only one parameter entirely specific to sysbench cpu is max-prime, it defines the highest prime number during the test. Higher this value is, higher will be the time to find all the prime numbers.

Our methodology considers an upper limit of 64000 and a scaling up of thread number, from 1 to 2x the number CPU present on the machine. It produces data like in the chart above where we can see that the OTC’s s3.8xlarge.1 has the ability to scale at 100% until 32 threads which is its physical limit.


We use sysbench almost everyday


New CPU architecture at Oracle Cloud: VM.Standard.A1.Flex

Ampere X Oracle

A decade before, no one was bidding on ARM architecture for server side. It is and was well established in the embebed and mobile systems but it was an odd idea to use it for any layer of a server application. We were able to see only it for some custom requirement such as testing of ARM applications with some providers like Travis CI. From my perspective, I saw the first attempts of ARM in the Cloud market with:

  • Travis, for the sake of Android and iOS developer
  • Scaleway with a series of ARM VMs from 2 to 64 CPU
  • Packet (now Equinix) with the former metals c1.large.arm and x.large.arm

The race is now completely launched and ARM is clearly a new category of products, we can already see several generation of Graviton at AWS and of course the new Oracle’s A1, fruit of a collaboration between Ampere Computing and the hyperscaler.

What’s under the hood ?

Like the other VM offerings at Oracle, A1 benefits from a flexible design: You can chose exactly the amount of CPU and RAM for your instances. From 1 to 80 oCPU and up to 1TB of RAM, you can draw instances that exactly match with your requierments without make a choice between what the other providers call compute or memory optimized.

In the previous x86 Oracle’s series, the Cloud defined their VM with oCPU which is an equivalent of a phyisical core. With AMD and Intel, 1 core has 2 hyperthreads (generally sold as 2 vCPUs in a virtual machine). Oracle kept the oCPU terminology for the new A1 and 1 oCPU still represents 1 core but as their hardware only have 1 thread per core, 1 A1 oCPU equals to 1 thread. This may seem to be a loss of the consumers and the performance analysis in the next section will answer to this.

Performance/price analysis

Oracle Cloud is known to practice a very aggressive pricing and by using ARM technogies they may expect a performance boost and mostly a lower infrastructure costs, so a better price for their customers. To understand this benefit we test 3 latest generations of Oracle VM: E3, E4 and A1. We took the following machines:

Name oCPU CPU RAM (GB) Price (USD/hr)
E3.Flex.2-8 2 4 8 0.062
E4.Flex.2-8 2 4 8 0.062
A1.Flex.2-8 2 2 8 0.022
A1.Flex.4-8 4 4 8 0.032

We run a bunch of tests to compare the performance between series and also to analyse the scaling of 2 A1 oCPU to 4. The goal being to answer: “Should I go to A1?” and “Should I increase the number of oCPU?”

The graphs below represent the performance given by the benchmark tool sysbench CPU and its price/performance ratio calculated from the hourly price:

    {
"configuration": {
"chart": {
"type": "bar",
"polar": false,
"zoomType": "",
"options3d": {},
"height": 400,
"width": null,
"margin": null,
"inverted": false
},
"credits": {
"enabled": false
},
"title": {
"text": "Performance"
},
"colorAxis": null,
"subtitle": {
"text": ""
},
"xAxis": {
"title": {
"text": [
""
],
"useHTML": false,
"style": {
"color": "#666666"
}
},
"categories": [
"provider__short_name",
"flavor__name"
],
"lineWidth": 1,
"tickInterval": null,
"tickWidth": 0,
"tickLength": 10,
"tickPixelInterval": null,
"plotLines": null,
"labels": {
"enabled": false,
"formatter": "",
"style": {
"color": "#666666",
"cursor": "default",
"fontSize": "11px"
}
},
"plotBands": null,
"visible": true,
"floor": null,
"ceiling": null,
"type": "linear",
"min": null,
"gridLineWidth": null,
"gridLineColor": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null
},
"yAxis": {
"title": {
"text": [
"Number per seconds<br>Higher is better"
],
"useHTML": false,
"style": {
"fontSize": "14px"
}
},
"categories": null,
"plotLines": null,
"plotBands": null,
"lineWidth": null,
"tickInterval": null,
"tickLength": 10,
"floor": null,
"ceiling": null,
"gridLineInterpolation": null,
"gridLineWidth": 1,
"gridLineColor": "#CCC",
"min": null,
"max": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null,
"minRange": null,
"type": "linear",
"labels": {
"enabled": true,
"formatter": null,
"style": {
"fontSize": "14px"
}
}
},
"zAxis": {
"title": {
"text": "Number per seconds<br>Higher is better"
}
},
"plotOptions": {
"series": {
"dataLabels": {
"enabled": true,
"format": "{series.name}",
"distance": 30,
"align": "left",
"inside": true,
"style": {
"fontSize": "15px"
}
},
"showInLegend": null,
"turboThreshold": 1000,
"stacking": "",
"groupPadding": 0,
"centerInCategory": false
}
},
"rangeSelector": {
"enabled": false
},
"legend": {
"enabled": false,
"align": "center",
"verticalAlign": "bottom",
"layout": "horizontal",
"width": null,
"margin": 12,
"reversed": false
},
"series": [
{
"name": "A1.Flex.2-8",
"data": [
[
0,
495.88500000000005
]
],
"color": "#FF0000",
"grouping": false
},
{
"name": "A1.Flex.4-8",
"data": [
[
1,
1006.905
]
],
"color": "#FF0000",
"grouping": false
},
{
"name": "E3.Flex.2-8",
"data": [
[
2,
248.498
]
],
"color": "#FF0000",
"grouping": false
},
{
"name": "E4.Flex.2-8",
"data": [
[
3,
630.8466666666668
]
],
"color": "#FF0000",
"grouping": false
}
],
"tooltip": {
"enabled": true,
"useHTML": false,
"headerFormat": "",
"pointFormat": "<span style=\"color:{series.color}\">{series.name}</span>: <b>{point.y:.2f}</b><br/>",
"footerFormat": "",
"shared": false,
"outside": false,
"valueDecimals": null,
"split": false
}
},
"hc_type": "chart",
"id": "139655260119952"
}
    {
"configuration": {
"chart": {
"type": "bar",
"polar": false,
"zoomType": "",
"options3d": {},
"height": 400,
"width": null,
"margin": null,
"inverted": false
},
"credits": {
"enabled": false
},
"title": {
"text": "Price/Performance"
},
"colorAxis": null,
"subtitle": {
"text": ""
},
"xAxis": {
"title": {
"text": [
""
],
"useHTML": false,
"style": {
"color": "#666666"
}
},
"categories": [
"provider__name"
],
"lineWidth": 1,
"tickInterval": null,
"tickWidth": 0,
"tickLength": 10,
"tickPixelInterval": null,
"plotLines": null,
"labels": {
"enabled": false,
"formatter": "",
"style": {
"color": "#666666",
"cursor": "default",
"fontSize": "11px"
}
},
"plotBands": null,
"visible": true,
"floor": null,
"ceiling": null,
"type": "linear",
"min": null,
"gridLineWidth": null,
"gridLineColor": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null
},
"yAxis": {
"title": {
"text": [
"Higher is better"
],
"useHTML": false,
"style": {
}
},
"categories": null,
"plotLines": null,
"plotBands": null,
"lineWidth": null,
"tickInterval": null,
"tickLength": 10,
"floor": null,
"ceiling": null,
"gridLineInterpolation": null,
"gridLineWidth": 1,
"gridLineColor": "#CCC",
"min": null,
"max": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null,
"minRange": null,
"type": "linear",
"labels": {
"enabled": true,
"formatter": null,
"style": {
"fontSize": "14px"
}
}
},
"zAxis": {
"title": {
"text": "Higher is better"
}
},
"plotOptions": {
"series": {
"dataLabels": {
"enabled": true,
"format": "{series.name}",
"distance": 30,
"align": "left",
"inside": true,
"style": {
"fontSize": "15px"
}
},
"showInLegend": null,
"turboThreshold": 1000,
"stacking": "",
"groupPadding": 0,
"centerInCategory": false
}
},
"rangeSelector": {
"enabled": false
},
"legend": {
"enabled": false,
"align": "center",
"verticalAlign": "bottom",
"layout": "horizontal",
"width": null,
"margin": 12,
"reversed": false
},
"series": [
{
"name": "A1.Flex.2-8",
"data": [
{
"x": 0,
"y": 30877.023661270243,
"perf": 495.88500000000005,
"price": 0.022,
"currency": "USD"
}
],
"color": "#FF0000",
"grouping": false
},
{
"name": "A1.Flex.4-8",
"data": [
{
"x": 1,
"y": 43103.809931506854,
"perf": 1006.905,
"price": 0.032,
"currency": "USD"
}
],
"color": "#FF0000",
"grouping": false
},
{
"name": "E3.Flex.2-8",
"data": [
{
"x": 2,
"y": 5490.455148033583,
"perf": 248.498,
"price": 0.062,
"currency": "USD"
}
],
"color": "#FF0000",
"grouping": false
},
{
"name": "E4.Flex.2-8",
"data": [
{
"x": 3,
"y": 13938.28251583444,
"perf": 630.8466666666668,
"price": 0.062,
"currency": "USD"
}
],
"color": "#FF0000",
"grouping": false
}
],
"tooltip": {
"enabled": true,
"useHTML": false,
"headerFormat": "",
"pointFormat": "<span style=\"color:{series.color}\">{series.name}</span>:<br><b>Price/Perf: {point.y:.2f}</b><br><b>Performance</b>: {point.perf:.2f}<br><b>Price</b>: {point.price:.4f} USD",
"footerFormat": "",
"shared": false,
"outside": false,
"valueDecimals": null,
"split": false
}
},
"hc_type": "chart",
"id": "139655258981432"
}

What do we see ?

  • Drop your E3, with same price than E4 and lower perfs, it lose interest.
  • With same oCPU amount, E3 performs better than A1 but with 2.5x the price.
  • With same vCPU amount, A1 is clearly far ahead in term of performance and price.

The main obstacle to A1 adoption is of course its CPU architecture. While you know any software will be available on x86, you may encounter missing packages that you must compile by yourself or simply unavailable under aarch64. But keep in mind that general purpose applications such as programming language or database are already compiled and packaged


Of course we ran more tests:


Amazon Web Services : M5 vs M5a vs M6g

AWS M5 / M5a / M6g Benchmak

In other words Intel vs AMD vs ARM. AWS recently released Graviton series for all their main instance types: R6g with extended memory, C6g for compute optimized and M6g for general purpose. Their offering has always been based on Intel but in the past years we saw AMD and now with Graviton 2 making AWS is based on their own chips.

Amazon Web services announces their Graviton processors as a new choice for their customers for increase their performance for a lower cost. But what’s the difference between all these solutions identical on the paper ? Let’s do CPU benchmark to answer it.

Specification

For our benchmark, we took 8CPU-32GB VM from each series:

Product Price CPU Frequency
m5.2xlarge 0.38 Intel(R) Xeon(R) Platinum 8175M
Intel(R) Xeon(R) Platinum 8259CL
2.5-3.2GHz
m5a.2xlarge 0.34 AMD EPYC 7571 2.4-3.0GHz
m6g.2xlarge 0.31 aarch64 N/A

These data are collected by our Cloud Transparency Platform, prices are for us-east (N. Virginia) region.

Performance testing

Prime number search with sysbench CPU

Sysbench CPU can be categorized as arithmetic operations on integer.

We can observe an increase of +100% on single thread and close to 400% between M5 and M6g with 8 threads.

Encryption with AES-256 CBC

Where AMD’s performance depends of block size, Intel and Graviton are homogeneous across sizes. The ARM chip is able to encrypt at 1.2 GB/sec where the M5 and M5a respectively cap at 400MB/sec (200%)  and 900MB/sec (130%).

Price

Product Hourly Monthly
(estimation)
Yearly
(estimation)
Discount
m5.2xlarge 0.38 280 3,360
m5a.2xlarge 0.34 251 3,012 -11%
m6g.2xlarge 0.31 224 2,688 -22%

Monthly is based on 730 hours, yearly on 8,760 hours without long term subscription

Prices make no doubt, each new generation offers a lower cost and M6g owns the lowest.

Conclusion

Depending of your workload, Graviton offers until +400% of performance compared to the Intel analogous. Combined with a lower pricing, M6G is definitively the best EC2 choice for any CPU related workload compatible with ARM architecture.


Check out data in our Public Cloud Reference


Understand Object storage by its performance

How to qualify Object Storage perf

Nowadays anyone who want to smartly store cool or cold data will be guided to an Object Storage solution.  This cloud model replaced a lot of usages such as our old FTP servers, our backup storage or static website hosting. The keywords here are “Low price”, “Scability”, “Unlimited”. But like we can observe with Computes, all Objects Storages aren’t equal, firstly in terms of price, then in performance.

What does qualify Object Storage performance ?

Latency

Depending of your architecture, latency could be a key factor in the case of workloads related to small blobs. A common example is static website hosting: The average file size won’t exceed 1MB, then you may expect them to be receive by clients almost instantly.

Keep in mind that (generally) an Object Storage is a unique point of service, so for inter-continental connection, it’s recommended to link with a CDN. The table below describes the worldwide average for time-to-first-byte toward storage services:

Africa Asia China Europe N. America Pacific S. America
Asia 1.393 1.264 1.065 0.812 0.899 1.233 1.272
Europe 0.874 0.820 0.957 0.214 0.490 0.996 0.768
N. America 1.343 0.934 1.164 0.635 0.325 0.870 0.652
Pacific 2.534 1.094 1.117 1.763 1.161 0.760 1.570

TTFB in seconds

Bandwidth

If you work with high-sized objects, bandwidth is a more interesting metric. It is especially visible in Big Data architectures, for their low storage costs, Object Storage are very appropriated for huge dataset storing but between the remote and local storage, network bandwidth is the main bottleneck.

Like latency, the factor is double: client and server networks count and at this game Clouds aren’t equal. Server’s bandwidth can be throttled at different layers:

  • For a connection : A maximum bandwidth is set for incoming request
  • At bucket layer : Each bucket are limited
  • For a whole service : Limitation is global for the tenant or each deployed Object Storage service

Bucket scalability

While Object Storage often appears as simple filesystem available with HTTP, under the hood, many technical constraints appear for the Cloud provider. Buckets are presented as ±unlimited flat blob containers, but several factors can make your performance varies:

  • The total number of object in your bucket
  • The total size of objects in your bucket
  • The name of your objects, especially the prefix

Burst handling

Something never presented on the landing pages is the capacity to handle a high load of connections. Again here, the market isn’t homogeneous, some vendors support heavy times worthy of a DDoS, other will have a decreasing of performance or simply return a HTTP 429 Too Many Requests.

The solution may be to simply balance loads across services/buckets or use a CDN service which is more appropriate for intensive HTTP workloads.

Conclusion

There’s no rule of thumb to establish if an Object Storage has good performance from its specification. Even if providers use standard software such as Ceph, the hardware and configuration create a genuine solution with their constraints and advantages. That’s why performance testing is always a requirement to understand the product profile.

New C5a benchmark: Performance/Price

AWS recently released the new series C5a equipped with custom AMD EPYC 7R32. We can discover here, a less expensive alternative to C5, similar to what they did with M5, R5 and T3. But cost isn’t an appropriated metric if you doesn’t take in account performance, so let’s dive into a performance/price benchmark comparing C5 and C5a.

A lower pricing

Name CPU RAM C5 C5a
large 2 4 0.085 0.077
xlarge 4 8 0.170 0.154
2xlarge 8 16 0.340 0.308
4xlarge 16 32 0.680 0.616
9xlarge 36 72 1.530 1.232
12xlarge 48 96 2.040 1.848
18xlarge 72 144 3.060 2.464
24xlarge 96 192 4.080 3.696
metal 96 192 4.080

Pricing is for East US (Ohio)

Performance a bit better

Before open the hood, there are 2 things to keep in mind about C5: CPU performance is highly variable. Behind the product names, several CPU model are sold and we actually collected the following:

  • Intel(R) Xeon(R) Platinum 8124M
  • Intel(R) Xeon(R) Platinum 8275CL

Like the new AMD EPYC 7R32, both are custom models only available at AWS. Next thing, a same CPU model works at different frequencies. Generally, Cloud Providers set their CPU frequency at baseline or turbo frequency, for Platinum 8124M, we detected values from 3 up to 3.45GHz.

Geekbench 5

Kind c5.large c5a.large
Single score 934 909
Single Integer 902 815
Single Float 949 969
Single Crypto 1267 1782
Multi score 1115 1168
Multi Integer 1049 1067
Multi Float 1200 1256
Multi Crypto 1470 1952

From a Geekbench perspective, C5a excels especially in cryptography realm which is not a subject to underestimate, nowadays encryption is something used everywhere, from volumes to HTTP connections or with any backend. Other domain are also more efficient but not with a huge gap.

sysbench RAM

c5.large c5a.large
Read 8201 9139
Write 6134 7091

RAM bandwidth is a good indicator of neighborough’s noise and as C5a has just been released its value has higher chance to be better. Then, we’ll also check regularly if C5 and C5a can still pretend the same throughput.

Performance/Price

Viewing the results below and knowing instances’ prices are 10% lower, it’s not a surprise that C5a has better profile in terms of performance per dollar spent.

Type Hourly Monthly Multi score Perf/price ratio
c5.large 0.085 62.05 1115 17.97
c5a.large 0.077 56.21 1168 18.82

Monthly price is calculated from 730 hours.
Perf/price ratio equals “Mult-score / Monthly”

Conclusion

With this new original CPU model, AWS decreases their pricing again but with some performance increasement. In the past with the previous C5, we observed a lot of performance variation and it wouldn’t be a surprise if future tests pull the average performance up or down.

As the full series cannot be described by its smaller instance type, we also tested bigger flavors. Feel free to consult their performance on our Public Cloud Reference.

AWS and the volume equation

Despite being one of the the worldwide most used block storage solution, Amazon’s General Purpose SSD is far away from being a general and versatile solution. Unlike other providers selling volumes based on device type and an hourly price per gigabytes, AWS made the choice to create products adapted to usages.

EBS : The Block Storage solutions

Behind the name of Elastic Block Store, 5 storage classes are available:

  • Magnetic: Historical Block Storage solution provided by AWS. As its name indicates, this product is powered-up by spinning disk making it inherently slow: 200 IOPS and 100 MB/s. But in the end of 2000s, it wasn’t a low-cost but a general purpose.
  • Throughput Optimized: Dedicated to large chunk processing, this product aims an optimal throughput. Still with HDD but efficient for Big Data or log processing.
  • Cold HDD: In the same branch than Throughput Optimized but with lower price and performance. Useful for data with less frequently accessed volume like cached data storage.
  • General Purpose SSD: This is the common volume type used by consumers and shouldn’t be taken as a standard SSD Block Storage. Firstly GP-SSD, is capped at 16KIOPS which is pretty low for an intensive workload. Secondly, its maximum performance are constrained by a credit system not letting you benefits permanently from the best performance. These both arguments make GP-SSD more appropriate for non-intensive workloads that do not require permanent charge.
  • Provisioned IOPS SSD: It’s an answer to the changing performance of General Purpose. This product allows user to define and pay for an amount of maximum IOPS going up to 64KIOPS. It makes possible storage-bound workloads but at a high price of $0.065 per provisioned IOPS.

Local storage

Block Storage isn’t the only one solution provided by Amazon Web Services, since I3 series, local NVMe-SSD are available for High-IOPS workloads. Let’s compare similar solutions on paper: i3.large vs r5.large + 500GB GP SSD.

Flavor CPU RAM Storage Monthly price
i3.large 2 16 475GB local NVMe-SSD $135
r5.large 2 16 General Purpose SSD $168.4

As you can see on table and chart below, for an equivalent solution in term of basic specifications, it’s much more worth opt in for the i3. Also, the NVMe devices are attached locally to I3 VMs without block storage creating a real gap in terms of IOPS and latency:

Features matter a lot

To do the a comparison of Block versus Local storage is inappropriate without taking in account features. In fact, despite its general lower performance Block Storage is a key component of Clouds’ flexibility and reliability. Where a Local device may focus on latency, Block is attractive by all its features such as snapshot/backups, replication, availability and more.
Here a small comparative table outlining general pro and cons:

Block Local
Latency Low to high Very low
IOPS Low to high High to very high
Replication Yes
SLA Yes
Price Low to very high Included with instance
Size Up to +16TB Fixed at instance startup
Persistence Unlimited Instance lifespan
Hot-plug Yes No

We see that there are clearly 2 usages: A non-guaranteed high performance and a flexible one.

Top 10s Cloud Compute debriefing

We recently release our Top 10 for Cloud Compute North America and Europe. With the help of our automated platform we tested near to 20 cloud providers and selected the most interesting per region. These studies outline performance/price value of Cloud Computes and bound Block Storages. We focus on maximum performance delivered by general purpose infrastructures, their associated costs and where is the best efficiency per dollar spent.

Context

For each provider, we tested 4 sets of VMs:

Category CPU RAM Extra storage
Small 2 4 100
Medium 4 8 150
Large 8 16 200
XLarge 16 32 500

From all the performance and pricing data we collected, the vendor selection was agnostically done, only by numbers with the following key metrics:

  • Hourly price
  • CPU Multi-thread performance
  • Volume IOPS
  • Volume bandwidth

Inherent biases

1. Hourly prices

De facto, most of the hyperscalers are penalized by the documents’ approach. Despite they could propose computing power at the edge of technology, the design of our subject doesn’t take in account the long term billing options such as 1 year or 3 years. These options are only proposed by big players such as Alibaba, AWS or Azure and you can consider up to 60% of discount if you subscribe to them.

2. Volume throttling

Next, hyperscalers generally throttle volume performance, where small and medium size vendors let you reach 3GB/sec and/or 1MIOPS with block storage, the big players stop around 3000 IOPS. This may seem low but it is guaranteed, where the possible 1MIOPS are neither stable nor predictable.

3. Compute focused

Finally, documents focus on compute: virtual machines and volumes, but cloud providers have so much to propose and especially big players. Server-less, Object Storage, DBaaS, with the variety of existing services, the whole value of a cloud vendors cannot be just about Cloud Compute.

Our insights at a glance

For those who don’t want to read the reports, here’s a small list of the leading providers:

Provider Price Compute Storage Region
Hetzner Very aggressive Average Average Europe
Kamatera Low Average High Worldwide
UpCloud Average High Guaranteed very high Worldwide
Oracle Cloud Average Average High Worldwide

What next ?

These documents will be renewed and their methodology improved. We want to bring more infrastructure characteristics like network and RAM. In the pursuit of objectivity, we think that we must diversify our reports to answer real life problematics such as:

  • Small and medium size providers
  • Hyperscalers
  • Country based
  • Provider origin based
  • Object storage, CDNs, DBaaS, Kubernetes, etc

We also want to digitalize this kind of report. Instead of just PDF, we wish to let consumer explore data with a web application. This will also let user appreciate more than 10 vendors without decrease reading quality.

For the meantime, do not hesitate take a look at our document center.

 

New HTTP benchmark tool: pycurlb

Many tools exist around the globe to get performance data for a HTTP connection but we can consider them as stress tools: They focus on launching an amount of request and output statistical aggregations of latency, throughput and rate. ApacheBenchmark (ab), wrk, httperf, we regularly use these software for what they bring but they also provide a methodology which isn’t adapted to some of our goal. We looked for:

  • Run only a single request: We wanted the opportunity to test a link in idle state or bursted by another stress tool
  • Get other TCP timings: DNS, SSL handshake and more have to be known

In the past, in order to fill these requirements, I used the well-known command line tool “client URL Request Library” also known as cURL. This software is considered like the swiss knife of HTTP client and despite it supports much more protocols, most of the people use it just for just to download a file or communicate with REST APIs. But if dig under the surface, curl is actually just a user interface for its powerful library libcurl and where cURL provides more than 50 command options, libcurl let you the opportunity to forge any kind HTTP request.

For debug purpose, curl has an option named –write-out allowing users to export data about connection and response. Here’s an example:

$ curl --write-out '%{time_total}' https://www.cloud-mercato.com/ -o /dev/null -s
0.230652

In the command above, we reach our goal but firstly there are options we will always use: -o and -s because we don’t care about curl’s output. Moreover, the desired output is actually difficult to obtain from command line, in case of a complex one like a JSON, you have create a template file or fight with characters escaping. To ease our work, we decided to create a tool based on libcurl and accurately designed to two tasks: Run a single HTTP request and report connection information. This is how pycurlb was born.

pycurlb, abbreviation of Python cURL Benchmark, is based on pycurl which is a Python wrapper around libcurl. This software is very simple command line tool mimicking curl’s behavior but outputting a JSON with a lot information available. The command similar to the one presented above would be:

$ pycurlb https://www.cloud-mercato.com/
{
  "appconnect_time": 5.673696,
  "compressed": false,
  "connect_time": 5.581115,
  "connect_timeout": 300,
  "content_length_download": 219.0,
  "content_length_upload": -1.0,
  "content_type": "text/html; charset=UTF-8",
  "effective_url": "https://www.cloud-mercato.com/",
  "header_size": 516,
  "http_code": 200,
  "http_connectcode": 0,
  "httpauth_avail": 0,
  "local_ip": "10.0.0.1",
  "local_port": 34740,
  "max_time": 0,
  "method": "GET",
  "namelookup_time": 5.520988,
  "num_connects": 1,
  "os_errno": 0,
  "pretransfer_time": 5.673749,
  "primary_ip": "1.2.3.4",
  "primary_port": 443,
  "proxyauth_avail": 0,
  "redirect_count": 0,
  "redirect_time": 0.0,
  "redirect_url": "https://www.cloud-mercato.com/",
  "request_size": 190,
  "size_download": 219.0,
  "size_upload": 0.0,
  "speed_download": 38.0,
  "speed_upload": 0.0,
  "ssl_engines": [
    "rdrand",
    "dynamic"
  ],
  "ssl_verifyresult": 0,
  "starttransfer_time": 5.74181,
  "total_time": 5.741879
}

Easy and useful, let’s see which helpful metrics we have:

  • namelookup_time: Time to resolve DNS
  • connect_time : Time to do TCP connection
  • appconnect_time: Time before start HTTP communication
  • pretransfer_time: Time before start transfer
  • starttransfer_time: Time when first byte has been received
  • total_time: Total request/response time
  • speed_download and speed_upload: Throughput

We see here that the value called latency in other benchmark tools is split up in 6 items, each of them describing a stage of an applicative TCP/IP connection. Detail is here the master word, so we tried to stay fully compatible with the original curl and keep the same command line arguments, so even advanced scenario such as headers inclusion should be possible.

This software is an open source project stored on Github. Feel free to use, contribute or open issues, you are very welcome.

Observe worldwide network latencies

Have you ever thought which provider will give the best latency to your users ? Not a theoretical value but an accurate metric representing a real end-to-end connection. At Cloud Mercato our platform allow us to manage cloud components all around the globe. VPS, virtual machines, buckets or CDN, we can easily setup worldwide client-server configuration and run network workloads. But this approach could be qualified as Datacenter to Datacenter: My client is an instance at provider X and it hits another machine at provider Y. So basically as providers are always supposed to have a low latency connection, the scenario becomes unappropriated to test real end user connection.

From our point of view, this performance test has to be done in the same condition than an end user: From a 3G/4G/5G device, with WiFi, through aDSL or optical fiber. Instead of create another Unix command we decided to write Observer, a web application letting your test more than 100 locations directly from your browser.

What it does ?

Observer displays performance from live tests operated by your browser. We setup endpoints among a bunch of Object Storages and CDNs and allow you to compare performance among the different solutions and providers.

Actually this application requests to our CTP a list of available endpoints serving a 1 byte file. For each item, an AJAX request is launched outputting Time To First Byte (TTFB). This value is reported as milliseconds on left table and temperature on map.

Some quick observations

  • If your target is regional, CDNs may not bring you an advantage in terms of latency
  • Even without CDN, Google benefits a lot from their private worldwide network

What is the future of this application ?

It’s actually still in beta/PoC but clearly it reaches our ambition that are testing TTFB from anywhere. From this seed we already imagine a lot of usage:

  • Smart integration directly on provider website making live testing
  • Better data visualization with charts
  • Whole data visualization allowing to understand geographical area’s latency by provider and/or device
  • Bandwidth test with upload and download
  • Integrate our pricing data
  • Yes, change the skin …

If you are a provider and would like to integrate your product in this application, do not hesitate to contact us. In any case, we invite you to test and give us a feedback, we love to see other insights.

dd is not a benchmarking tool

There is a widely held idea in the Internet that a written snippet will be universally valid to test and produce comparable results from any machine. Said like that, this assertion is globally false but a piece of code, valid in a context, can do a lot of road on the web and easily fool a good amount of people. Benchmarks with dd are a good example. Which Unix nerd didn’t test his brand new device with dd ? The command outputs an accurate value in MB/sec, what more ?

The problem is already in benchmark conception

If I quickly get the dd’s user manual or more simple, the help text, I can read:

Copy a file, converting and formatting according to the operands.

If my goal was to benchmark a device, it already appears that this tool is not the most appropriate. Firstly, I don’t aim to copy anything but just read or write. Next, I don’t want to work with files but with block device. Then, I don’t need the announced features about data handling. The three points are really important, because they show how much the tool is inappropriate.

Don’t get me wrong, I don’t denigrate dd. It personally saved me tons of hours with ISOs or disk migrations. But use it as a standard benchmark tool is more a hack than a reliable idea.

The first issue: The files

A major misconception of benchmark is in what I want to test and how I’ll do it. Here, our goal is HDD/SSD performance and pass by a filesystem can create a big biais in your analysis. Here the kind of command findable on the Internet:

dd bs=1M count=1024 if=/dev/zero of=/root/test

For those not familiar with dd, the above command creates a 1GB file containing only zeros at root user’s home: /root/test.  The authors generally claim the goal is to collect performance of the device where the file is stored, it’s poorly reached. Storage performance are mainly affected by a set of caches/buffers from the user level to the blocks located in SSD. File system is the main entry for users but as it is a software, it can hide you the reality of your hardware as good as well as bad.

By default, dd toward a file systems uses an asynchronous method, meaning that the if the written file is small enough to fit in RAM, the OS won’t write it on drive and will wait the most appropriate time to do so. In this configuration, the command’s output will absolutely not represent storage’s performance and as only volatile-memory is implied, dd displays very good performance.

At Cloud Mercato, as we want to reflect infrastructure performance, we bypass file system as much as possible and directly test device by its absolute path. So from our benchmark you know your hardware possibilities and can boost them with the file system of your choice. There’s only few cases where files are involved such as test root volume in write mode, you mustn’t not write on your root device directly or you’ll erase its OS.

Second issue: A tool without data generation

dd is designed around the concept of copy, it is also quite well explained by its long name “Data Duplicator”. Fortunately in Unix everything is a file and kernels provide pseudo-files generating data. There are:

  • /dev/zero
  • /dev/random
  • /dev/urandom

Under the hood, these pseudo-files are real software and suffer from this. /dev/zero is CPU bound but because it only produces zeros, it cannot represent a real workload. /dev/random is quite slow due to its high randomness and /dev/urandom is too intensive in term of CPU cycles.

Basically, you may not reach the storage maximum performance if you are limited by CPU. Moreover, dd isn’t a multi-thread software, so only one thread at once can stress the device decreasing chances to get the best.

Third: A lack of features

It is said, dd is not a benchmarking tool, if you look at the the open-source catalog of storage testing and the common features, dd, not being intended for this purpose, it is out of competition:

  • Single thread only
  • No optimized data generation
  • No access mode: Sequential or random
  • No deep control such as I/O depth
  • Only average bandwidth, no IOPS, latency or percentiles
  • No mixed patterns: read/write
  • No time control

This shortened list is eloquent, Data-Duplicator doesn’t provide the necessary features to be declared as a performance test tool.

Then the solution

Here are real benchmark tools that you can use:

FIO is really our daily tool (if not hourly), it brings to us possibilities not imaginable with dd like IO depth or random access. vdbench is also very handy, in a similar concept than FIO, you can create complex scenario such as imply multiple files in read/write access.

In conclusion, benchmark is not only a suite of commands ran in a shell. Executed tests and expected output really depend of context: What do you want to test ? Which component should be implied ? Why this value will represent something ? Any snippets taken on the Internet may have its value in a certain environment and be untruthful in another. It’s up to the tester to understand these factors and chose the appropriate tool to her/his purpose.