Why you should use sysbench ?

Why you should use sysbench

Among all benchmark tools available online, sysbench is one of my favorite. If I have to give two words to it , it would be: simplicty and reliability. At Cloud Mercato we mainly focus on 3 of its parts:

  • CPU: Simple CPU benchmark
  • Memory: Memory access benchmark
  • OLTP: Collection of OLTP-like database benchmarks

On top of that, sysbench can be considered as an agnostic benchmark tool, like wrk, it integrates a LuaJIT interpreter allowing to plug the task of your choice and thus to obtain rate, maximum, percentile and more. But today, let’s just focus on the CPU benchmark aka prime number search.

Run on command line

sysbench --threads=$cpu_number --time=$time cpu --cpu-max-prime=$max_prime run

The output of this command is not terrifying:

sysbench 1.1.0 (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 8
Initializing random number generator from current time


Prime numbers limit: 64000

Initializing worker threads...

Threads started!

CPU speed:
    events per second:   707.94

Throughput:
    events/s (eps):                      707.9361
    time elapsed:                        30.0112s
    total number of events:              21246

Latency (ms):
         min:                                   11.25
         avg:                                   11.30
         max:                                   19.08
         95th percentile:                       11.24
         sum:                               240041.14

Threads fairness:
    events (avg/stddev):           2655.7500/1.79
    execution time (avg/stddev):   30.0051/0.00

For the lazyest of you, the most intersting value here is the “events per second” or prime number found per second. But aside of that the output is pretty human readable and very easy to parse. It contains the base stats required for a benchmark in term of context and mathematical aggregations.

But what did I do ?

Napoleon Ier
Never run a command if you don’t know what it’s supposed to do

That being said, as sysbench is open-source and available on GitHub, with a beginner level in C, it’s not hard to understand it. The code sample below represents the core of what is timed during the testing:

int cpu_execute_event(sb_event_t *r, int thread_id)
{
  unsigned long long c;
  unsigned long long l;
  double t;
  unsigned long long n=0;

  (void)thread_id; /* unused */
  (void)r; /* unused */

  /* So far we're using very simple test prime number tests in 64bit */

  for(c=3; c < max_prime; c++)
  {
    t = sqrt((double)c);
    for(l = 2; l <= t; l++) if (c % l == 0) break; if (l > t )
      n++; 
  }

  return 0;
}

Source: GitHub akopytov/sysbench

Basically if I translate the function in human words it would be: Loop over numbers and check if they are divisible only by themselves.

And yes, the deepness of this benchmark is just these 15 lines, simple loops and some arithmetics.  For me, this is clearly the strength of this tool: Again simplicty and reliability. sysbench cpu crossed the ages and as it makes more than 15 years that it didn’t changed, it allow you to compare old chips like Intel Sandy Bridge versus the latest ARM.

It is a well coded prime-number test, but what I see is that there isn’t any complex things developers often setup to achieve their goals such as thread cooperation, encryption or advanced mathematics. It just does a unique task representing a simple “How fast are my CPUs ?”.

The RAM involment is almost null, processors’ features generaly don’t improve performance for these kind of tasks and for a Cloud Benchmarker like me, this is essential to erase as most bias as possible. Where I can get performance data from complex workloads and have an idea of the system capacity, sysbench capture a raw strength that I correlate later with more complex results.

How I use it ?

    {
"configuration": {
"chart": {
"type": "spline",
"polar": false,
"zoomType": "",
"options3d": {},
"height": null,
"width": null,
"margin": null,
"inverted": false
},
"credits": {
"enabled": false
},
"title": {
"text": ""
},
"colorAxis": null,
"subtitle": {
"text": ""
},
"xAxis": {
"title": {
"text": [
"Threads"
],
"useHTML": false,
"style": {
"color": "#666666"
}
},
"categories": [
1,
2,
3,
4,
6,
8,
12,
16,
24,
32,
48,
64
],
"lineWidth": 1,
"tickInterval": null,
"tickWidth": 1,
"tickLength": 10,
"tickPixelInterval": null,
"plotLines": [{
      "value": 9,
      "color": "rgba(209, 0, 108, 0.5)",
      "width": 3,
      "label": {
        "text": "vCPU number",
        "align": "left",
        "style": {
          "color": "gray"
        }
      }
    }],
"labels": {
"enabled": true,
"formatter": "",
"style": {
"color": "#666666",
"cursor": "default",
"fontSize": "11px"
}
},
"plotBands": null,
"visible": true,
"floor": null,
"ceiling": null,
"type": "linear",
"min": null,
"gridLineWidth": null,
"gridLineColor": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null,
"tickmarkPlacement": null
},
"yAxis": {
"title": {
"text": [
"Number per second"
],
"useHTML": false,
"style": {
"color": "#666666"
}
},
"categories": [],
"plotLines": null,
"plotBands": null,
"lineWidth": null,
"tickInterval": 100,
"tickLength": 10,
"floor": null,
"ceiling": null,
"gridLineInterpolation": null,
"gridLineWidth": 1,
"gridLineColor": "#CCC",
"min": 0,
"max": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null,
"minRange": null,
"type": "linear",
"tickmarkPlacement": null,
"labels": {
"enabled": true,
"formatter": null,
"style": {
"color": "#666666",
"cursor": "default",
"fontSize": "11px"
}
}
},
"zAxis": {
"title": {
"text": ""
}
},
"plotOptions": {
"series": {
"dataLabels": {
"enabled": false,
"format": null,
"distance": 30,
"align": "center",
"inside": null,
"style": {
"fontSize": "11px"
}
},
"showInLegend": null,
"turboThreshold": 1000,
"stacking": "",
"groupPadding": 0,
"centerInCategory": false
}
},
"rangeSelector": {
"enabled": false
},
"legend": {
"enabled": true,
"align": "center",
"verticalAlign": "bottom",
"layout": "horizontal",
"width": null,
"margin": 12,
"reversed": false
},
"series": [
{
"name": "T-Systems Open Telekom Cloud s3.8xlarge.1",
"verbose": "T-Systems Open Telekom Cloud s3.8xlarge.1",
"data": [
{
"y": 88.5275
},
{
"y": 176.9944444444444
},
{
"y": 265.46900000000005
},
{
"y": 353.96000000000004
},
{
"y": 530.606
},
{
"y": 707.815
},
{
"y": 1060.8559999999998
},
{
"y": 1414.0780000000002
},
{
"y": 2120.651111111111
},
{
"y": 2824.994
},
{
"y": 2825.9979999999996
},
{
"y": 2826.035
}
],
"color": "#d1006c"
}
],
"tooltip": {
"enabled": true,
"useHTML": false,
"headerFormat": "",
"pointFormat": "<span style=\"color:{series.color}\">{series.name}</span>: <b>{point.y:.2f}</b><br/>",
"footerFormat": "",
"shared": false,
"outside": false,
"valueDecimals": null,
"split": false
}
},
"hc_type": "chart",
"id": "139798768566056"
}

The only one parameter entirely specific to sysbench cpu is max-prime, it defines the highest prime number during the test. Higher this value is, higher will be the time to find all the prime numbers.

Our methodology considers an upper limit of 64000 and a scaling up of thread number, from 1 to 2x the number CPU present on the machine. It produces data like in the chart above where we can see that the OTC’s s3.8xlarge.1 has the ability to scale at 100% until 32 threads which is its physical limit.


We use sysbench almost everyday


New CPU architecture at Oracle Cloud: VM.Standard.A1.Flex

Ampere X Oracle

A decade before, no one was bidding on ARM architecture for server side. It is and was well established in the embebed and mobile systems but it was an odd idea to use it for any layer of a server application. We were able to see only it for some custom requirement such as testing of ARM applications with some providers like Travis CI. From my perspective, I saw the first attempts of ARM in the Cloud market with:

  • Travis, for the sake of Android and iOS developer
  • Scaleway with a series of ARM VMs from 2 to 64 CPU
  • Packet (now Equinix) with the former metals c1.large.arm and x.large.arm

The race is now completely launched and ARM is clearly a new category of products, we can already see several generation of Graviton at AWS and of course the new Oracle’s A1, fruit of a collaboration between Ampere Computing and the hyperscaler.

What’s under the hood ?

Like the other VM offerings at Oracle, A1 benefits from a flexible design: You can chose exactly the amount of CPU and RAM for your instances. From 1 to 80 oCPU and up to 1TB of RAM, you can draw instances that exactly match with your requierments without make a choice between what the other providers call compute or memory optimized.

In the previous x86 Oracle’s series, the Cloud defined their VM with oCPU which is an equivalent of a phyisical core. With AMD and Intel, 1 core has 2 hyperthreads (generally sold as 2 vCPUs in a virtual machine). Oracle kept the oCPU terminology for the new A1 and 1 oCPU still represents 1 core but as their hardware only have 1 thread per core, 1 A1 oCPU equals to 1 thread. This may seem to be a loss of the consumers and the performance analysis in the next section will answer to this.

Performance/price analysis

Oracle Cloud is known to practice a very aggressive pricing and by using ARM technogies they may expect a performance boost and mostly a lower infrastructure costs, so a better price for their customers. To understand this benefit we test 3 latest generations of Oracle VM: E3, E4 and A1. We took the following machines:

Name oCPU CPU RAM (GB) Price (USD/hr)
E3.Flex.2-8 2 4 8 0.062
E4.Flex.2-8 2 4 8 0.062
A1.Flex.2-8 2 2 8 0.022
A1.Flex.4-8 4 4 8 0.032

We run a bunch of tests to compare the performance between series and also to analyse the scaling of 2 A1 oCPU to 4. The goal being to answer: “Should I go to A1?” and “Should I increase the number of oCPU?”

The graphs below represent the performance given by the benchmark tool sysbench CPU and its price/performance ratio calculated from the hourly price:

    {
"configuration": {
"chart": {
"type": "bar",
"polar": false,
"zoomType": "",
"options3d": {},
"height": 400,
"width": null,
"margin": null,
"inverted": false
},
"credits": {
"enabled": false
},
"title": {
"text": "Performance"
},
"colorAxis": null,
"subtitle": {
"text": ""
},
"xAxis": {
"title": {
"text": [
""
],
"useHTML": false,
"style": {
"color": "#666666"
}
},
"categories": [
"provider__short_name",
"flavor__name"
],
"lineWidth": 1,
"tickInterval": null,
"tickWidth": 0,
"tickLength": 10,
"tickPixelInterval": null,
"plotLines": null,
"labels": {
"enabled": false,
"formatter": "",
"style": {
"color": "#666666",
"cursor": "default",
"fontSize": "11px"
}
},
"plotBands": null,
"visible": true,
"floor": null,
"ceiling": null,
"type": "linear",
"min": null,
"gridLineWidth": null,
"gridLineColor": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null
},
"yAxis": {
"title": {
"text": [
"Number per seconds<br>Higher is better"
],
"useHTML": false,
"style": {
"fontSize": "14px"
}
},
"categories": null,
"plotLines": null,
"plotBands": null,
"lineWidth": null,
"tickInterval": null,
"tickLength": 10,
"floor": null,
"ceiling": null,
"gridLineInterpolation": null,
"gridLineWidth": 1,
"gridLineColor": "#CCC",
"min": null,
"max": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null,
"minRange": null,
"type": "linear",
"labels": {
"enabled": true,
"formatter": null,
"style": {
"fontSize": "14px"
}
}
},
"zAxis": {
"title": {
"text": "Number per seconds<br>Higher is better"
}
},
"plotOptions": {
"series": {
"dataLabels": {
"enabled": true,
"format": "{series.name}",
"distance": 30,
"align": "left",
"inside": true,
"style": {
"fontSize": "15px"
}
},
"showInLegend": null,
"turboThreshold": 1000,
"stacking": "",
"groupPadding": 0,
"centerInCategory": false
}
},
"rangeSelector": {
"enabled": false
},
"legend": {
"enabled": false,
"align": "center",
"verticalAlign": "bottom",
"layout": "horizontal",
"width": null,
"margin": 12,
"reversed": false
},
"series": [
{
"name": "A1.Flex.2-8",
"data": [
[
0,
495.88500000000005
]
],
"color": "#FF0000",
"grouping": false
},
{
"name": "A1.Flex.4-8",
"data": [
[
1,
1006.905
]
],
"color": "#FF0000",
"grouping": false
},
{
"name": "E3.Flex.2-8",
"data": [
[
2,
248.498
]
],
"color": "#FF0000",
"grouping": false
},
{
"name": "E4.Flex.2-8",
"data": [
[
3,
630.8466666666668
]
],
"color": "#FF0000",
"grouping": false
}
],
"tooltip": {
"enabled": true,
"useHTML": false,
"headerFormat": "",
"pointFormat": "<span style=\"color:{series.color}\">{series.name}</span>: <b>{point.y:.2f}</b><br/>",
"footerFormat": "",
"shared": false,
"outside": false,
"valueDecimals": null,
"split": false
}
},
"hc_type": "chart",
"id": "139655260119952"
}
    {
"configuration": {
"chart": {
"type": "bar",
"polar": false,
"zoomType": "",
"options3d": {},
"height": 400,
"width": null,
"margin": null,
"inverted": false
},
"credits": {
"enabled": false
},
"title": {
"text": "Price/Performance"
},
"colorAxis": null,
"subtitle": {
"text": ""
},
"xAxis": {
"title": {
"text": [
""
],
"useHTML": false,
"style": {
"color": "#666666"
}
},
"categories": [
"provider__name"
],
"lineWidth": 1,
"tickInterval": null,
"tickWidth": 0,
"tickLength": 10,
"tickPixelInterval": null,
"plotLines": null,
"labels": {
"enabled": false,
"formatter": "",
"style": {
"color": "#666666",
"cursor": "default",
"fontSize": "11px"
}
},
"plotBands": null,
"visible": true,
"floor": null,
"ceiling": null,
"type": "linear",
"min": null,
"gridLineWidth": null,
"gridLineColor": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null
},
"yAxis": {
"title": {
"text": [
"Higher is better"
],
"useHTML": false,
"style": {
}
},
"categories": null,
"plotLines": null,
"plotBands": null,
"lineWidth": null,
"tickInterval": null,
"tickLength": 10,
"floor": null,
"ceiling": null,
"gridLineInterpolation": null,
"gridLineWidth": 1,
"gridLineColor": "#CCC",
"min": null,
"max": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null,
"minRange": null,
"type": "linear",
"labels": {
"enabled": true,
"formatter": null,
"style": {
"fontSize": "14px"
}
}
},
"zAxis": {
"title": {
"text": "Higher is better"
}
},
"plotOptions": {
"series": {
"dataLabels": {
"enabled": true,
"format": "{series.name}",
"distance": 30,
"align": "left",
"inside": true,
"style": {
"fontSize": "15px"
}
},
"showInLegend": null,
"turboThreshold": 1000,
"stacking": "",
"groupPadding": 0,
"centerInCategory": false
}
},
"rangeSelector": {
"enabled": false
},
"legend": {
"enabled": false,
"align": "center",
"verticalAlign": "bottom",
"layout": "horizontal",
"width": null,
"margin": 12,
"reversed": false
},
"series": [
{
"name": "A1.Flex.2-8",
"data": [
{
"x": 0,
"y": 30877.023661270243,
"perf": 495.88500000000005,
"price": 0.022,
"currency": "USD"
}
],
"color": "#FF0000",
"grouping": false
},
{
"name": "A1.Flex.4-8",
"data": [
{
"x": 1,
"y": 43103.809931506854,
"perf": 1006.905,
"price": 0.032,
"currency": "USD"
}
],
"color": "#FF0000",
"grouping": false
},
{
"name": "E3.Flex.2-8",
"data": [
{
"x": 2,
"y": 5490.455148033583,
"perf": 248.498,
"price": 0.062,
"currency": "USD"
}
],
"color": "#FF0000",
"grouping": false
},
{
"name": "E4.Flex.2-8",
"data": [
{
"x": 3,
"y": 13938.28251583444,
"perf": 630.8466666666668,
"price": 0.062,
"currency": "USD"
}
],
"color": "#FF0000",
"grouping": false
}
],
"tooltip": {
"enabled": true,
"useHTML": false,
"headerFormat": "",
"pointFormat": "<span style=\"color:{series.color}\">{series.name}</span>:<br><b>Price/Perf: {point.y:.2f}</b><br><b>Performance</b>: {point.perf:.2f}<br><b>Price</b>: {point.price:.4f} USD",
"footerFormat": "",
"shared": false,
"outside": false,
"valueDecimals": null,
"split": false
}
},
"hc_type": "chart",
"id": "139655258981432"
}

What do we see ?

  • Drop your E3, with same price than E4 and lower perfs, it lose interest.
  • With same oCPU amount, E3 performs better than A1 but with 2.5x the price.
  • With same vCPU amount, A1 is clearly far ahead in term of performance and price.

The main obstacle to A1 adoption is of course its CPU architecture. While you know any software will be available on x86, you may encounter missing packages that you must compile by yourself or simply unavailable under aarch64. But keep in mind that general purpose applications such as programming language or database are already compiled and packaged


Of course we ran more tests:


Amazon Web Services : M5 vs M5a vs M6g

AWS M5 / M5a / M6g Benchmark

In other words Intel vs AMD vs ARM. AWS recently released Graviton series for all their main instance types: R6g with extended memory, C6g for compute optimized and M6g for general purpose. Their offering has always been based on Intel but in the past years we saw AMD and now with Graviton 2 making AWS is based on their own chips.

Amazon Web services announces their Graviton processors as a new choice for their customers for increase their performance for a lower cost. But what’s the difference between all these solutions identical on the paper ? Let’s do CPU benchmark to answer it.

Specification

For our benchmark, we took 8CPU-32GB VM from each series:

Product Price CPU Frequency
m5.2xlarge 0.38 Intel(R) Xeon(R) Platinum 8175M
Intel(R) Xeon(R) Platinum 8259CL
2.5-3.2GHz
m5a.2xlarge 0.34 AMD EPYC 7571 2.4-3.0GHz
m6g.2xlarge 0.31 aarch64 N/A

These data are collected by our Cloud Transparency Platform, prices are for us-east (N. Virginia) region.

Performance testing

Prime number search with sysbench CPU

Sysbench CPU can be categorized as arithmetic operations on integer.

We can observe an increase of +100% on single thread and close to 400% between M5 and M6g with 8 threads.

Encryption with AES-256 CBC

Where AMD’s performance depends of block size, Intel and Graviton are homogeneous across sizes. The ARM chip is able to encrypt at 1.2 GB/sec where the M5 and M5a respectively cap at 400MB/sec (200%)  and 900MB/sec (130%).

Price

Product Hourly Monthly
(estimation)
Yearly
(estimation)
Discount
m5.2xlarge 0.38 280 3,360
m5a.2xlarge 0.34 251 3,012 -11%
m6g.2xlarge 0.31 224 2,688 -22%

Monthly is based on 730 hours, yearly on 8,760 hours without long term subscription

Prices make no doubt, each new generation offers a lower cost and M6g owns the lowest.

Conclusion

Depending of your workload, Graviton offers until +400% of performance compared to the Intel analogous. Combined with a lower pricing, M6G is definitively the best EC2 choice for any CPU related workload compatible with ARM architecture.


Check out data in our Public Cloud Reference


AWS and the volume equation

Despite being one of the the worldwide most used block storage solution, Amazon’s General Purpose SSD is far away from being a general and versatile solution. Unlike other providers selling volumes based on device type and an hourly price per gigabytes, AWS made the choice to create products adapted to usages.

EBS : The Block Storage solutions

Behind the name of Elastic Block Store, 5 storage classes are available:

  • Magnetic: Historical Block Storage solution provided by AWS. As its name indicates, this product is powered-up by spinning disk making it inherently slow: 200 IOPS and 100 MB/s. But in the end of 2000s, it wasn’t a low-cost but a general purpose.
  • Throughput Optimized: Dedicated to large chunk processing, this product aims an optimal throughput. Still with HDD but efficient for Big Data or log processing.
  • Cold HDD: In the same branch than Throughput Optimized but with lower price and performance. Useful for data with less frequently accessed volume like cached data storage.
  • General Purpose SSD: This is the common volume type used by consumers and shouldn’t be taken as a standard SSD Block Storage. Firstly GP-SSD, is capped at 16KIOPS which is pretty low for an intensive workload. Secondly, its maximum performance are constrained by a credit system not letting you benefits permanently from the best performance. These both arguments make GP-SSD more appropriate for non-intensive workloads that do not require permanent charge.
  • Provisioned IOPS SSD: It’s an answer to the changing performance of General Purpose. This product allows user to define and pay for an amount of maximum IOPS going up to 64KIOPS. It makes possible storage-bound workloads but at a high price of $0.065 per provisioned IOPS.

Local storage

Block Storage isn’t the only one solution provided by Amazon Web Services, since I3 series, local NVMe-SSD are available for High-IOPS workloads. Let’s compare similar solutions on paper: i3.large vs r5.large + 500GB GP SSD.

Flavor CPU RAM Storage Monthly price
i3.large 2 16 475GB local NVMe-SSD $135
r5.large 2 16 General Purpose SSD $168.4

As you can see on table and chart below, for an equivalent solution in term of basic specifications, it’s much more worth opt in for the i3. Also, the NVMe devices are attached locally to I3 VMs without block storage creating a real gap in terms of IOPS and latency:

Features matter a lot

To do the a comparison of Block versus Local storage is inappropriate without taking in account features. In fact, despite its general lower performance Block Storage is a key component of Clouds’ flexibility and reliability. Where a Local device may focus on latency, Block is attractive by all its features such as snapshot/backups, replication, availability and more.
Here a small comparative table outlining general pro and cons:

Block Local
Latency Low to high Very low
IOPS Low to high High to very high
Replication Yes
SLA Yes
Price Low to very high Included with instance
Size Up to +16TB Fixed at instance startup
Persistence Unlimited Instance lifespan
Hot-plug Yes No

We see that there are clearly 2 usages: A non-guaranteed high performance and a flexible one.