Why you should use sysbench ?

Why you should use sysbench

Among all benchmark tools available online, sysbench is one of my favorite. If I have to give two words to it , it would be: simplicty and reliability. At Cloud Mercato we mainly focus on 3 of its parts:

  • CPU: Simple CPU benchmark
  • Memory: Memory access benchmark
  • OLTP: Collection of OLTP-like database benchmarks

On top of that, sysbench can be considered as an agnostic benchmark tool, like wrk, it integrates a LuaJIT interpreter allowing to plug the task of your choice and thus to obtain rate, maximum, percentile and more. But today, let’s just focus on the CPU benchmark aka prime number search.

Run on command line

sysbench --threads=$cpu_number --time=$time cpu --cpu-max-prime=$max_prime run

The output of this command is not terrifying:

sysbench 1.1.0 (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 8
Initializing random number generator from current time


Prime numbers limit: 64000

Initializing worker threads...

Threads started!

CPU speed:
    events per second:   707.94

Throughput:
    events/s (eps):                      707.9361
    time elapsed:                        30.0112s
    total number of events:              21246

Latency (ms):
         min:                                   11.25
         avg:                                   11.30
         max:                                   19.08
         95th percentile:                       11.24
         sum:                               240041.14

Threads fairness:
    events (avg/stddev):           2655.7500/1.79
    execution time (avg/stddev):   30.0051/0.00

For the lazyest of you, the most intersting value here is the “events per second” or prime number found per second. But aside of that the output is pretty human readable and very easy to parse. It contains the base stats required for a benchmark in term of context and mathematical aggregations.

But what did I do ?

Napoleon Ier
Never run a command if you don’t know what it’s supposed to do

That being said, as sysbench is open-source and available on GitHub, with a beginner level in C, it’s not hard to understand it. The code sample below represents the core of what is timed during the testing:

int cpu_execute_event(sb_event_t *r, int thread_id)
{
  unsigned long long c;
  unsigned long long l;
  double t;
  unsigned long long n=0;

  (void)thread_id; /* unused */
  (void)r; /* unused */

  /* So far we're using very simple test prime number tests in 64bit */

  for(c=3; c < max_prime; c++)
  {
    t = sqrt((double)c);
    for(l = 2; l <= t; l++) if (c % l == 0) break; if (l > t )
      n++; 
  }

  return 0;
}

Source: GitHub akopytov/sysbench

Basically if I translate the function in human words it would be: Loop over numbers and check if they are divisible only by themselves.

And yes, the deepness of this benchmark is just these 15 lines, simple loops and some arithmetics.  For me, this is clearly the strength of this tool: Again simplicty and reliability. sysbench cpu crossed the ages and as it makes more than 15 years that it didn’t changed, it allow you to compare old chips like Intel Sandy Bridge versus the latest ARM.

It is a well coded prime-number test, but what I see is that there isn’t any complex things developers often setup to achieve their goals such as thread cooperation, encryption or advanced mathematics. It just does a unique task representing a simple “How fast are my CPUs ?”.

The RAM involment is almost null, processors’ features generaly don’t improve performance for these kind of tasks and for a Cloud Benchmarker like me, this is essential to erase as most bias as possible. Where I can get performance data from complex workloads and have an idea of the system capacity, sysbench capture a raw strength that I correlate later with more complex results.

How I use it ?

    {
"configuration": {
"chart": {
"type": "spline",
"polar": false,
"zoomType": "",
"options3d": {},
"height": null,
"width": null,
"margin": null,
"inverted": false
},
"credits": {
"enabled": false
},
"title": {
"text": ""
},
"colorAxis": null,
"subtitle": {
"text": ""
},
"xAxis": {
"title": {
"text": [
"Threads"
],
"useHTML": false,
"style": {
"color": "#666666"
}
},
"categories": [
1,
2,
3,
4,
6,
8,
12,
16,
24,
32,
48,
64
],
"lineWidth": 1,
"tickInterval": null,
"tickWidth": 1,
"tickLength": 10,
"tickPixelInterval": null,
"plotLines": [{
      "value": 9,
      "color": "rgba(209, 0, 108, 0.5)",
      "width": 3,
      "label": {
        "text": "vCPU number",
        "align": "left",
        "style": {
          "color": "gray"
        }
      }
    }],
"labels": {
"enabled": true,
"formatter": "",
"style": {
"color": "#666666",
"cursor": "default",
"fontSize": "11px"
}
},
"plotBands": null,
"visible": true,
"floor": null,
"ceiling": null,
"type": "linear",
"min": null,
"gridLineWidth": null,
"gridLineColor": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null,
"tickmarkPlacement": null
},
"yAxis": {
"title": {
"text": [
"Number per second"
],
"useHTML": false,
"style": {
"color": "#666666"
}
},
"categories": [],
"plotLines": null,
"plotBands": null,
"lineWidth": null,
"tickInterval": 100,
"tickLength": 10,
"floor": null,
"ceiling": null,
"gridLineInterpolation": null,
"gridLineWidth": 1,
"gridLineColor": "#CCC",
"min": 0,
"max": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null,
"minRange": null,
"type": "linear",
"tickmarkPlacement": null,
"labels": {
"enabled": true,
"formatter": null,
"style": {
"color": "#666666",
"cursor": "default",
"fontSize": "11px"
}
}
},
"zAxis": {
"title": {
"text": ""
}
},
"plotOptions": {
"series": {
"dataLabels": {
"enabled": false,
"format": null,
"distance": 30,
"align": "center",
"inside": null,
"style": {
"fontSize": "11px"
}
},
"showInLegend": null,
"turboThreshold": 1000,
"stacking": "",
"groupPadding": 0,
"centerInCategory": false
}
},
"rangeSelector": {
"enabled": false
},
"legend": {
"enabled": true,
"align": "center",
"verticalAlign": "bottom",
"layout": "horizontal",
"width": null,
"margin": 12,
"reversed": false
},
"series": [
{
"name": "T-Systems Open Telekom Cloud s3.8xlarge.1",
"verbose": "T-Systems Open Telekom Cloud s3.8xlarge.1",
"data": [
{
"y": 88.5275
},
{
"y": 176.9944444444444
},
{
"y": 265.46900000000005
},
{
"y": 353.96000000000004
},
{
"y": 530.606
},
{
"y": 707.815
},
{
"y": 1060.8559999999998
},
{
"y": 1414.0780000000002
},
{
"y": 2120.651111111111
},
{
"y": 2824.994
},
{
"y": 2825.9979999999996
},
{
"y": 2826.035
}
],
"color": "#d1006c"
}
],
"tooltip": {
"enabled": true,
"useHTML": false,
"headerFormat": "",
"pointFormat": "<span style=\"color:{series.color}\">{series.name}</span>: <b>{point.y:.2f}</b><br/>",
"footerFormat": "",
"shared": false,
"outside": false,
"valueDecimals": null,
"split": false
}
},
"hc_type": "chart",
"id": "139798768566056"
}

The only one parameter entirely specific to sysbench cpu is max-prime, it defines the highest prime number during the test. Higher this value is, higher will be the time to find all the prime numbers.

Our methodology considers an upper limit of 64000 and a scaling up of thread number, from 1 to 2x the number CPU present on the machine. It produces data like in the chart above where we can see that the OTC’s s3.8xlarge.1 has the ability to scale at 100% until 32 threads which is its physical limit.


We use sysbench almost everyday


Understand Object storage by its performance

How to qualify Object Storage perf

Nowadays anyone who want to smartly store cool or cold data will be guided to an Object Storage solution.  This cloud model replaced a lot of usages such as our old FTP servers, our backup storage or static website hosting. The keywords here are “Low price”, “Scability”, “Unlimited”. But like we can observe with Computes, all Objects Storages aren’t equal, firstly in terms of price, then in performance.

What does qualify Object Storage performance ?

Latency

Depending of your architecture, latency could be a key factor in the case of workloads related to small blobs. A common example is static website hosting: The average file size won’t exceed 1MB, then you may expect them to be receive by clients almost instantly.

Keep in mind that (generally) an Object Storage is a unique point of service, so for inter-continental connection, it’s recommended to link with a CDN. The table below describes the worldwide average for time-to-first-byte toward storage services:

Africa Asia China Europe N. America Pacific S. America
Asia 1.393 1.264 1.065 0.812 0.899 1.233 1.272
Europe 0.874 0.820 0.957 0.214 0.490 0.996 0.768
N. America 1.343 0.934 1.164 0.635 0.325 0.870 0.652
Pacific 2.534 1.094 1.117 1.763 1.161 0.760 1.570

TTFB in seconds

Bandwidth

If you work with high-sized objects, bandwidth is a more interesting metric. It is especially visible in Big Data architectures, for their low storage costs, Object Storage are very appropriated for huge dataset storing but between the remote and local storage, network bandwidth is the main bottleneck.

Like latency, the factor is double: client and server networks count and at this game Clouds aren’t equal. Server’s bandwidth can be throttled at different layers:

  • For a connection : A maximum bandwidth is set for incoming request
  • At bucket layer : Each bucket are limited
  • For a whole service : Limitation is global for the tenant or each deployed Object Storage service

Bucket scalability

While Object Storage often appears as simple filesystem available with HTTP, under the hood, many technical constraints appear for the Cloud provider. Buckets are presented as ±unlimited flat blob containers, but several factors can make your performance varies:

  • The total number of object in your bucket
  • The total size of objects in your bucket
  • The name of your objects, especially the prefix

Burst handling

Something never presented on the landing pages is the capacity to handle a high load of connections. Again here, the market isn’t homogeneous, some vendors support heavy times worthy of a DDoS, other will have a decreasing of performance or simply return a HTTP 429 Too Many Requests.

The solution may be to simply balance loads across services/buckets or use a CDN service which is more appropriate for intensive HTTP workloads.

Conclusion

There’s no rule of thumb to establish if an Object Storage has good performance from its specification. Even if providers use standard software such as Ceph, the hardware and configuration create a genuine solution with their constraints and advantages. That’s why performance testing is always a requirement to understand the product profile.

New HTTP benchmark tool: pycurlb

Many tools exist around the globe to get performance data for a HTTP connection but we can consider them as stress tools: They focus on launching an amount of request and output statistical aggregations of latency, throughput and rate. ApacheBenchmark (ab), wrk, httperf, we regularly use these software for what they bring but they also provide a methodology which isn’t adapted to some of our goal. We looked for:

  • Run only a single request: We wanted the opportunity to test a link in idle state or bursted by another stress tool
  • Get other TCP timings: DNS, SSL handshake and more have to be known

In the past, in order to fill these requirements, I used the well-known command line tool “client URL Request Library” also known as cURL. This software is considered like the swiss knife of HTTP client and despite it supports much more protocols, most of the people use it just for just to download a file or communicate with REST APIs. But if dig under the surface, curl is actually just a user interface for its powerful library libcurl and where cURL provides more than 50 command options, libcurl let you the opportunity to forge any kind HTTP request.

For debug purpose, curl has an option named –write-out allowing users to export data about connection and response. Here’s an example:

$ curl --write-out '%{time_total}' https://www.cloud-mercato.com/ -o /dev/null -s
0.230652

In the command above, we reach our goal but firstly there are options we will always use: -o and -s because we don’t care about curl’s output. Moreover, the desired output is actually difficult to obtain from command line, in case of a complex one like a JSON, you have create a template file or fight with characters escaping. To ease our work, we decided to create a tool based on libcurl and accurately designed to two tasks: Run a single HTTP request and report connection information. This is how pycurlb was born.

pycurlb, abbreviation of Python cURL Benchmark, is based on pycurl which is a Python wrapper around libcurl. This software is very simple command line tool mimicking curl’s behavior but outputting a JSON with a lot information available. The command similar to the one presented above would be:

$ pycurlb https://www.cloud-mercato.com/
{
  "appconnect_time": 5.673696,
  "compressed": false,
  "connect_time": 5.581115,
  "connect_timeout": 300,
  "content_length_download": 219.0,
  "content_length_upload": -1.0,
  "content_type": "text/html; charset=UTF-8",
  "effective_url": "https://www.cloud-mercato.com/",
  "header_size": 516,
  "http_code": 200,
  "http_connectcode": 0,
  "httpauth_avail": 0,
  "local_ip": "10.0.0.1",
  "local_port": 34740,
  "max_time": 0,
  "method": "GET",
  "namelookup_time": 5.520988,
  "num_connects": 1,
  "os_errno": 0,
  "pretransfer_time": 5.673749,
  "primary_ip": "1.2.3.4",
  "primary_port": 443,
  "proxyauth_avail": 0,
  "redirect_count": 0,
  "redirect_time": 0.0,
  "redirect_url": "https://www.cloud-mercato.com/",
  "request_size": 190,
  "size_download": 219.0,
  "size_upload": 0.0,
  "speed_download": 38.0,
  "speed_upload": 0.0,
  "ssl_engines": [
    "rdrand",
    "dynamic"
  ],
  "ssl_verifyresult": 0,
  "starttransfer_time": 5.74181,
  "total_time": 5.741879
}

Easy and useful, let’s see which helpful metrics we have:

  • namelookup_time: Time to resolve DNS
  • connect_time : Time to do TCP connection
  • appconnect_time: Time before start HTTP communication
  • pretransfer_time: Time before start transfer
  • starttransfer_time: Time when first byte has been received
  • total_time: Total request/response time
  • speed_download and speed_upload: Throughput

We see here that the value called latency in other benchmark tools is split up in 6 items, each of them describing a stage of an applicative TCP/IP connection. Detail is here the master word, so we tried to stay fully compatible with the original curl and keep the same command line arguments, so even advanced scenario such as headers inclusion should be possible.

This software is an open source project stored on Github. Feel free to use, contribute or open issues, you are very welcome.