Why you should use sysbench ?

Why you should use sysbench

Among all benchmark tools available online, sysbench is one of my favorite. If I have to give two words to it , it would be: simplicty and reliability. At Cloud Mercato we mainly focus on 3 of its parts:

  • CPU: Simple CPU benchmark
  • Memory: Memory access benchmark
  • OLTP: Collection of OLTP-like database benchmarks

On top of that, sysbench can be considered as an agnostic benchmark tool, like wrk, it integrates a LuaJIT interpreter allowing to plug the task of your choice and thus to obtain rate, maximum, percentile and more. But today, let’s just focus on the CPU benchmark aka prime number search.

Run on command line

sysbench --threads=$cpu_number --time=$time cpu --cpu-max-prime=$max_prime run

The output of this command is not terrifying:

sysbench 1.1.0 (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 8
Initializing random number generator from current time

Prime numbers limit: 64000

Initializing worker threads...

Threads started!

CPU speed:
    events per second:   707.94

    events/s (eps):                      707.9361
    time elapsed:                        30.0112s
    total number of events:              21246

Latency (ms):
         min:                                   11.25
         avg:                                   11.30
         max:                                   19.08
         95th percentile:                       11.24
         sum:                               240041.14

Threads fairness:
    events (avg/stddev):           2655.7500/1.79
    execution time (avg/stddev):   30.0051/0.00

For the lazyest of you, the most intersting value here is the “events per second” or prime number found per second. But aside of that the output is pretty human readable and very easy to parse. It contains the base stats required for a benchmark in term of context and mathematical aggregations.

But what did I do ?

Napoleon Ier
Never run a command if you don’t know what it’s supposed to do

That being said, as sysbench is open-source and available on GitHub, with a beginner level in C, it’s not hard to understand it. The code sample below represents the core of what is timed during the testing:

int cpu_execute_event(sb_event_t *r, int thread_id)
  unsigned long long c;
  unsigned long long l;
  double t;
  unsigned long long n=0;

  (void)thread_id; /* unused */
  (void)r; /* unused */

  /* So far we're using very simple test prime number tests in 64bit */

  for(c=3; c < max_prime; c++)
    t = sqrt((double)c);
    for(l = 2; l <= t; l++) if (c % l == 0) break; if (l > t )

  return 0;

Source: GitHub akopytov/sysbench

Basically if I translate the function in human words it would be: Loop over numbers and check if they are divisible only by themselves.

And yes, the deepness of this benchmark is just these 15 lines, simple loops and some arithmetics.  For me, this is clearly the strength of this tool: Again simplicty and reliability. sysbench cpu crossed the ages and as it makes more than 15 years that it didn’t changed, it allow you to compare old chips like Intel Sandy Bridge versus the latest ARM.

It is a well coded prime-number test, but what I see is that there isn’t any complex things developers often setup to achieve their goals such as thread cooperation, encryption or advanced mathematics. It just does a unique task representing a simple “How fast are my CPUs ?”.

The RAM involment is almost null, processors’ features generaly don’t improve performance for these kind of tasks and for a Cloud Benchmarker like me, this is essential to erase as most bias as possible. Where I can get performance data from complex workloads and have an idea of the system capacity, sysbench capture a raw strength that I correlate later with more complex results.

How I use it ?

"configuration": {
"chart": {
"type": "spline",
"polar": false,
"zoomType": "",
"options3d": {},
"height": null,
"width": null,
"margin": null,
"inverted": false
"credits": {
"enabled": false
"title": {
"text": ""
"colorAxis": null,
"subtitle": {
"text": ""
"xAxis": {
"title": {
"text": [
"useHTML": false,
"style": {
"color": "#666666"
"categories": [
"lineWidth": 1,
"tickInterval": null,
"tickWidth": 1,
"tickLength": 10,
"tickPixelInterval": null,
"plotLines": [{
      "value": 9,
      "color": "rgba(209, 0, 108, 0.5)",
      "width": 3,
      "label": {
        "text": "vCPU number",
        "align": "left",
        "style": {
          "color": "gray"
"labels": {
"enabled": true,
"formatter": "",
"style": {
"color": "#666666",
"cursor": "default",
"fontSize": "11px"
"plotBands": null,
"visible": true,
"floor": null,
"ceiling": null,
"type": "linear",
"min": null,
"gridLineWidth": null,
"gridLineColor": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null,
"tickmarkPlacement": null
"yAxis": {
"title": {
"text": [
"Number per second"
"useHTML": false,
"style": {
"color": "#666666"
"categories": [],
"plotLines": null,
"plotBands": null,
"lineWidth": null,
"tickInterval": 100,
"tickLength": 10,
"floor": null,
"ceiling": null,
"gridLineInterpolation": null,
"gridLineWidth": 1,
"gridLineColor": "#CCC",
"min": 0,
"max": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null,
"minRange": null,
"type": "linear",
"tickmarkPlacement": null,
"labels": {
"enabled": true,
"formatter": null,
"style": {
"color": "#666666",
"cursor": "default",
"fontSize": "11px"
"zAxis": {
"title": {
"text": ""
"plotOptions": {
"series": {
"dataLabels": {
"enabled": false,
"format": null,
"distance": 30,
"align": "center",
"inside": null,
"style": {
"fontSize": "11px"
"showInLegend": null,
"turboThreshold": 1000,
"stacking": "",
"groupPadding": 0,
"centerInCategory": false
"rangeSelector": {
"enabled": false
"legend": {
"enabled": true,
"align": "center",
"verticalAlign": "bottom",
"layout": "horizontal",
"width": null,
"margin": 12,
"reversed": false
"series": [
"name": "T-Systems Open Telekom Cloud s3.8xlarge.1",
"verbose": "T-Systems Open Telekom Cloud s3.8xlarge.1",
"data": [
"y": 88.5275
"y": 176.9944444444444
"y": 265.46900000000005
"y": 353.96000000000004
"y": 530.606
"y": 707.815
"y": 1060.8559999999998
"y": 1414.0780000000002
"y": 2120.651111111111
"y": 2824.994
"y": 2825.9979999999996
"y": 2826.035
"color": "#d1006c"
"tooltip": {
"enabled": true,
"useHTML": false,
"headerFormat": "",
"pointFormat": "<span style=\"color:{series.color}\">{series.name}</span>: <b>{point.y:.2f}</b><br/>",
"footerFormat": "",
"shared": false,
"outside": false,
"valueDecimals": null,
"split": false
"hc_type": "chart",
"id": "139798768566056"

The only one parameter entirely specific to sysbench cpu is max-prime, it defines the highest prime number during the test. Higher this value is, higher will be the time to find all the prime numbers.

Our methodology considers an upper limit of 64000 and a scaling up of thread number, from 1 to 2x the number CPU present on the machine. It produces data like in the chart above where we can see that the OTC’s s3.8xlarge.1 has the ability to scale at 100% until 32 threads which is its physical limit.

We use sysbench almost everyday

Understand Object storage by its performance

How to qualify Object Storage perf

Nowadays anyone who want to smartly store cool or cold data will be guided to an Object Storage solution.  This cloud model replaced a lot of usages such as our old FTP servers, our backup storage or static website hosting. The keywords here are “Low price”, “Scability”, “Unlimited”. But like we can observe with Computes, all Objects Storages aren’t equal, firstly in terms of price, then in performance.

What does qualify Object Storage performance ?


Depending of your architecture, latency could be a key factor in the case of workloads related to small blobs. A common example is static website hosting: The average file size won’t exceed 1MB, then you may expect them to be receive by clients almost instantly.

Keep in mind that (generally) an Object Storage is a unique point of service, so for inter-continental connection, it’s recommended to link with a CDN. The table below describes the worldwide average for time-to-first-byte toward storage services:

Africa Asia China Europe N. America Pacific S. America
Asia 1.393 1.264 1.065 0.812 0.899 1.233 1.272
Europe 0.874 0.820 0.957 0.214 0.490 0.996 0.768
N. America 1.343 0.934 1.164 0.635 0.325 0.870 0.652
Pacific 2.534 1.094 1.117 1.763 1.161 0.760 1.570

TTFB in seconds


If you work with high-sized objects, bandwidth is a more interesting metric. It is especially visible in Big Data architectures, for their low storage costs, Object Storage are very appropriated for huge dataset storing but between the remote and local storage, network bandwidth is the main bottleneck.

Like latency, the factor is double: client and server networks count and at this game Clouds aren’t equal. Server’s bandwidth can be throttled at different layers:

  • For a connection : A maximum bandwidth is set for incoming request
  • At bucket layer : Each bucket are limited
  • For a whole service : Limitation is global for the tenant or each deployed Object Storage service

Bucket scalability

While Object Storage often appears as simple filesystem available with HTTP, under the hood, many technical constraints appear for the Cloud provider. Buckets are presented as ±unlimited flat blob containers, but several factors can make your performance varies:

  • The total number of object in your bucket
  • The total size of objects in your bucket
  • The name of your objects, especially the prefix

Burst handling

Something never presented on the landing pages is the capacity to handle a high load of connections. Again here, the market isn’t homogeneous, some vendors support heavy times worthy of a DDoS, other will have a decreasing of performance or simply return a HTTP 429 Too Many Requests.

The solution may be to simply balance loads across services/buckets or use a CDN service which is more appropriate for intensive HTTP workloads.


There’s no rule of thumb to establish if an Object Storage has good performance from its specification. Even if providers use standard software such as Ceph, the hardware and configuration create a genuine solution with their constraints and advantages. That’s why performance testing is always a requirement to understand the product profile.

New HTTP benchmark tool: pycurlb

Many tools exist around the globe to get performance data for a HTTP connection but we can consider them as stress tools: They focus on launching an amount of request and output statistical aggregations of latency, throughput and rate. ApacheBenchmark (ab), wrk, httperf, we regularly use these software for what they bring but they also provide a methodology which isn’t adapted to some of our goal. We looked for:

  • Run only a single request: We wanted the opportunity to test a link in idle state or bursted by another stress tool
  • Get other TCP timings: DNS, SSL handshake and more have to be known

In the past, in order to fill these requirements, I used the well-known command line tool “client URL Request Library” also known as cURL. This software is considered like the swiss knife of HTTP client and despite it supports much more protocols, most of the people use it just for just to download a file or communicate with REST APIs. But if dig under the surface, curl is actually just a user interface for its powerful library libcurl and where cURL provides more than 50 command options, libcurl let you the opportunity to forge any kind HTTP request.

For debug purpose, curl has an option named –write-out allowing users to export data about connection and response. Here’s an example:

$ curl --write-out '%{time_total}' https://www.cloud-mercato.com/ -o /dev/null -s

In the command above, we reach our goal but firstly there are options we will always use: -o and -s because we don’t care about curl’s output. Moreover, the desired output is actually difficult to obtain from command line, in case of a complex one like a JSON, you have create a template file or fight with characters escaping. To ease our work, we decided to create a tool based on libcurl and accurately designed to two tasks: Run a single HTTP request and report connection information. This is how pycurlb was born.

pycurlb, abbreviation of Python cURL Benchmark, is based on pycurl which is a Python wrapper around libcurl. This software is very simple command line tool mimicking curl’s behavior but outputting a JSON with a lot information available. The command similar to the one presented above would be:

$ pycurlb https://www.cloud-mercato.com/
  "appconnect_time": 5.673696,
  "compressed": false,
  "connect_time": 5.581115,
  "connect_timeout": 300,
  "content_length_download": 219.0,
  "content_length_upload": -1.0,
  "content_type": "text/html; charset=UTF-8",
  "effective_url": "https://www.cloud-mercato.com/",
  "header_size": 516,
  "http_code": 200,
  "http_connectcode": 0,
  "httpauth_avail": 0,
  "local_ip": "",
  "local_port": 34740,
  "max_time": 0,
  "method": "GET",
  "namelookup_time": 5.520988,
  "num_connects": 1,
  "os_errno": 0,
  "pretransfer_time": 5.673749,
  "primary_ip": "",
  "primary_port": 443,
  "proxyauth_avail": 0,
  "redirect_count": 0,
  "redirect_time": 0.0,
  "redirect_url": "https://www.cloud-mercato.com/",
  "request_size": 190,
  "size_download": 219.0,
  "size_upload": 0.0,
  "speed_download": 38.0,
  "speed_upload": 0.0,
  "ssl_engines": [
  "ssl_verifyresult": 0,
  "starttransfer_time": 5.74181,
  "total_time": 5.741879

Easy and useful, let’s see which helpful metrics we have:

  • namelookup_time: Time to resolve DNS
  • connect_time : Time to do TCP connection
  • appconnect_time: Time before start HTTP communication
  • pretransfer_time: Time before start transfer
  • starttransfer_time: Time when first byte has been received
  • total_time: Total request/response time
  • speed_download and speed_upload: Throughput

We see here that the value called latency in other benchmark tools is split up in 6 items, each of them describing a stage of an applicative TCP/IP connection. Detail is here the master word, so we tried to stay fully compatible with the original curl and keep the same command line arguments, so even advanced scenario such as headers inclusion should be possible.

This software is an open source project stored on Github. Feel free to use, contribute or open issues, you are very welcome.

Do you warm up volumes ?

Nowadays, most of the cloud vendors provide different solutions to store your data and exploit them from other services. In virtual machine realm, it is often admitted that block storage brings a flexible consumption and local device ensures a low latency. At Cloud Mercato we continuously test and report storage metrics and beyond performance announced by providers, we often face bias effects all related to a phenomenons called “volume warming-up“.

We don’t talk about sport ?

No sorry, it isn’t even linked to temperature, your volumes are supposed to be in fresh rooms somewhere with many other peers. Here the subject is about your HDD/SSD performance when you just get it. Brand new volumes may suffer from several kind of phenomena mainly bound to block allocation. Our team sees this regularly and in fact the expression “warm-up” is the solution not the issue. Here’s what we observe:

  • When you read your volume, you have very high performance: It’s not really disturbing as a real user won’t read an empty disk. The problem is for testers like us who risk to collect results sometime too good to be true.
  • When you write, on the contrary, low performance occurs and penalty of 50 to 95% is seeable. Here an end-user will be directly affected, Imagine a fresh new database node working at 30% of its capacity: just to populate your database will take a while.

Why does it occurs ?

As you guess, providers won’t sell under-effective drives. Some vendors will declare clearly in their documentation if their volumes suffer from that issue, other will let you guess by yourself. In that last case we advise you to ensure that your usage won’t be degraded. As we are in virtualized environments, it’s difficult to give a general description of what’s going on but the idea of this handicap is around block allocation.

Let’s explain these problems by taking the point of view of a volume controller, as a block storage system or a device controller:

  • Read scenario:  The OS asks X amount of block in an area of my volume where I never wrote, I even not yet set a registry and I know this part is empty, so I can quickly answer “zero” whatever you ask me. This is why the high read performance.
  • Write scenario: The OS wants to store X amount of block, firstly I need to allocate a space in storage and update my registry. These operations are done automatically when you use your volume for first time, they represent the overhead and why you should warm up your devices before use them.

How to resolve this issue

The fix consists to produce the block allocation before the real usage, basically you must write on the entire device and read it. The intention is to allocate every block on the entire system with write and be sure they are available with read.

Despite the variety of hypervisors and distributed storage, this method works for most penalized storage. On Unix platforms, only 2 lines are required:

# Replace /dev/vdX by your device path
dd if=/dev/zero of=/dev/vdX bs=1M  # Write
dd if=/dev/vdX of=/dev/null bs=1M  # Read

Still stay one problem, the time for these operations. Firstly there is the latency given by our base problem, then the elapsed time is proportional to the volume’s size. Do you see it coming ? Imagine fill a disk of 3TB at 1.5MB/sec, the setup could be highly time consuming. So another solution would be to parallelize jobs but dd is not made for that. That’s where we use FIO:

# Replace /dev/vdX by your device path
fio --filename=/dev/vdX --rw=write --bs=1m --iodepth=32 --ioengine=libaio --numjobs=32 --direct=1 --name=fio  # Write
fio --filename=/dev/vdX --rw=read --bs=1m --iodepth=32 --ioengine=libaio --numjobs=32 --direct=1 --name=fio  # Read

Even with simultaneous operations, warm up is still a potential long task. But we can relativize things by thinking that this penalize only fresh blocks without allocation, so this operation has to be launched only at server startup. No need to launch it several time or periodically. On the other hand, this is something to take in consideration in infrastructure setup time. For example, in a modern application with a lambda xSQL cluster supporting replication, if this system is configured with auto-scaling helping to spin-up VMs and set replicas. If my storage suffers from lazy allocation I have two options:

  • I take the time to warm-up, it could take 1hour and autoscaling becomes useless
  • My RDMS will warm-up the volume by writing replication on its storage: The process will be very slow and you’ll have bad performance for any new block allocated

So there isn’t any quick solution, as written above, we advise to know accurately where you store your data. Volumes experiencing this disadvantage are simply not adapted to auto-scaling or other scenario presenting time constraints.

Let’s visualize it

Here’s a graph representing writing on a fresh SSD through block storage.

I attached my device at 1:20pm and started to write continuously until reach the maximum performance. My test scenario writes randomly on the SSD, so I’m not sure to warm all the blocks and that’s the point, a user writing on FS don’t really chose which block will be filled. So what can we see ?

  • Performance starts really low : 10 IOPS
  • More I write on the disk, more its throughput increases
  • After 10 minutes, maximum is reached and stable between 450 and 500 IOPS

End stability at 500 IOPS is a low number unveiling a throttling set by the cloud provider. If this limit would be 5MIOPS, I think we may have a clearer view on this phenomenon. Similarly, bigger the volume is, longer it will take to be hot and ready.


If we place these data in a real infrastructure, it could have a huge impact like a null one, all depends of the kind of system you  drive. A classical 3/3 will just require unique operation at start-up, but a cloud-native architecture which claims flexibility will suffers either from a low beginning or from a setup time due to warming-up.


dd is not a benchmarking tool

There is a widely held idea in the Internet that a written snippet will be universally valid to test and produce comparable results from any machine. Said like that, this assertion is globally false but a piece of code, valid in a context, can do a lot of road on the web and easily fool a good amount of people. Benchmarks with dd are a good example. Which Unix nerd didn’t test his brand new device with dd ? The command outputs an accurate value in MB/sec, what more ?

The problem is already in benchmark conception

If I quickly get the dd’s user manual or more simple, the help text, I can read:

Copy a file, converting and formatting according to the operands.

If my goal was to benchmark a device, it already appears that this tool is not the most appropriate. Firstly, I don’t aim to copy anything but just read or write. Next, I don’t want to work with files but with block device. Then, I don’t need the announced features about data handling. The three points are really important, because they show how much the tool is inappropriate.

Don’t get me wrong, I don’t denigrate dd. It personally saved me tons of hours with ISOs or disk migrations. But use it as a standard benchmark tool is more a hack than a reliable idea.

The first issue: The files

A major misconception of benchmark is in what I want to test and how I’ll do it. Here, our goal is HDD/SSD performance and pass by a filesystem can create a big biais in your analysis. Here the kind of command findable on the Internet:

dd bs=1M count=1024 if=/dev/zero of=/root/test

For those not familiar with dd, the above command creates a 1GB file containing only zeros at root user’s home: /root/test.  The authors generally claim the goal is to collect performance of the device where the file is stored, it’s poorly reached. Storage performance are mainly affected by a set of caches/buffers from the user level to the blocks located in SSD. File system is the main entry for users but as it is a software, it can hide you the reality of your hardware as good as well as bad.

By default, dd toward a file systems uses an asynchronous method, meaning that the if the written file is small enough to fit in RAM, the OS won’t write it on drive and will wait the most appropriate time to do so. In this configuration, the command’s output will absolutely not represent storage’s performance and as only volatile-memory is implied, dd displays very good performance.

At Cloud Mercato, as we want to reflect infrastructure performance, we bypass file system as much as possible and directly test device by its absolute path. So from our benchmark you know your hardware possibilities and can boost them with the file system of your choice. There’s only few cases where files are involved such as test root volume in write mode, you mustn’t not write on your root device directly or you’ll erase its OS.

Second issue: A tool without data generation

dd is designed around the concept of copy, it is also quite well explained by its long name “Data Duplicator”. Fortunately in Unix everything is a file and kernels provide pseudo-files generating data. There are:

  • /dev/zero
  • /dev/random
  • /dev/urandom

Under the hood, these pseudo-files are real software and suffer from this. /dev/zero is CPU bound but because it only produces zeros, it cannot represent a real workload. /dev/random is quite slow due to its high randomness and /dev/urandom is too intensive in term of CPU cycles.

Basically, you may not reach the storage maximum performance if you are limited by CPU. Moreover, dd isn’t a multi-thread software, so only one thread at once can stress the device decreasing chances to get the best.

Third: A lack of features

It is said, dd is not a benchmarking tool, if you look at the the open-source catalog of storage testing and the common features, dd, not being intended for this purpose, it is out of competition:

  • Single thread only
  • No optimized data generation
  • No access mode: Sequential or random
  • No deep control such as I/O depth
  • Only average bandwidth, no IOPS, latency or percentiles
  • No mixed patterns: read/write
  • No time control

This shortened list is eloquent, Data-Duplicator doesn’t provide the necessary features to be declared as a performance test tool.

Then the solution

Here are real benchmark tools that you can use:

FIO is really our daily tool (if not hourly), it brings to us possibilities not imaginable with dd like IO depth or random access. vdbench is also very handy, in a similar concept than FIO, you can create complex scenario such as imply multiple files in read/write access.

In conclusion, benchmark is not only a suite of commands ran in a shell. Executed tests and expected output really depend of context: What do you want to test ? Which component should be implied ? Why this value will represent something ? Any snippets taken on the Internet may have its value in a certain environment and be untruthful in another. It’s up to the tester to understand these factors and chose the appropriate tool to her/his purpose.

Benchmark floating point computations with Python

Python and its nature

I am fundamentally a Python developer, a fact which makes me skip a bunch of C knowledge. I made the choice to stop placing all my focus on system performance to learn in writing and learning velocity. This will be the first of many blog posts that will focus on a variety of programming aspects. I’ll start things off by diving into python. Let’s go back to 10 years ago, when I was launching my 1st Python interpreter and typing:

>>> sum([2, 2])

It immediately made me think: So I don’t need to compile it or Syntax is soooo clear. As a sysadmin, I didn’t take a lot of time before I create my first scripts, then applications. Web page scrapping, mailing, ncurses menus. Basically the knight became blacksmith and now is able to create his own swords. To share my armory, I quickly decided to learn the well-known web framework Django and seeing the learning curve and the quick results I was getting, I apparently made good choices.

Dive into the cobra performance

If you use the snake language or not, you inevitably heard its main problem: Due to facts it’s not a compiled language, Python cannot reach the performances of the fastest languages such as C/C++. This assumption is partially true and as a performance benchmarker I always thought that despite the slower performance, I’m able to create anything I need with Python, so the trade off was worth it. At least until I tried to write a benchmark tool out of Python with a very small amount of overhead. This did not turn out exactly as I had expected.

The purpose of this kind of program is in total contradiction with the average web developers behaviors (cf I cache everything). The idea here is generally to produce an accurate action somewhere on a system with the minimum overhead and Python by nature didn’t seem adapted to that. This is a second sentence not entirely true, there are bunch of ways to write and use C code in Python such as Cython, pyrex or C wrappers. The standard library itself includes more and more C code, and even third library alternatives come to speed-up parts of code.

In facts, most of the general usages already have their C implementation. Network, file I/O, regexp or TLS, the language has boosted a lot of topics. Remain CPU-bound tasks and here, like most of interpreted languages, the GIL will produce some overhead. But keep in mind that this impacts just multi-tasks application, so in few words, single-thread codes are handsomely optimized but multi-thread suffer from the language design.

Python for scientists

By its various qualities, Python has become one the favorite language for scientific computing. We can mention NumPy, the basis about science, or Anaconda the scientific Python distribution and its package manager conda. The ecosystem is really large, touching many purposes:

  • Pure mathematics
  • Data representation
  • Machine/deep learning
  • Interactive notebook

It has a simple usage and let people produce quickly results but scientists generally have another main requirement: High computing capacity. Tons of numbers with tons of applied functions, sometimes with tons of dimensions. Scaling generally answers to this issue, but let’s avoid the multi-task subject and focus on single-thread performance. Of course parallelism is part of the reality landscape but it requires some technics such as sharding or parallel computing.

To collect performance data, we created a simple tool named FPB: Floating Point Benchmark. It aims to launch different kind of operations across several Python ways. For instance, compute an average from CPython or third libraries. This project is free, so feel free to contribute,. It also will be the subject of another article.

Below you’ll find some charts and tables representing timing of math functions. We observe performances from Vanilla Python to Numpy passing by alternative builtins ways such as SQLite. Our test environment is the following:

  • T-Systems’ p2.2xlarge.8 powered by KVM
  • Intel Xeon CPU E5-2690 v4
  • 8 vCPU @ 2.6 GHz
  • 64GB of RAM
  • 1x Tesla V100 PCle
  • FPB with float32

You can swipe to get more results.

Numpy Pandas Python SQLite
100 0.015 1.004 0.002 0.030
500 0.013 1.015 0.003 0.056
1000 0.017 1.017 0.006 0.088
5000 0.017 1.019 0.026 0.392
10000 0.023 1.027 0.051 0.623
50000 0.041 1.035 0.245 3.152
100000 0.082 1.086 0.498 5.763
500000 0.260 1.289 2.662 29.889
1000000 0.501 1.533 5.622 60.059
Numpy Pandas Python SQLite
100 0.006 0.238 0.002 0.034
500 0.005 0.239 0.008 0.066
1000 0.006 0.244 0.016 0.091
5000 0.007 0.266 0.074 0.424
10000 0.010 0.284 0.147 0.641
50000 0.025 0.430 0.729 3.336
100000 0.035 0.567 1.446 6.044
500000 0.180 3.243 7.240 31.573
1000000 0.313 4.485 14.695 64.826
Numpy Pandas Python SQLite
100 0.003 0.184 0.014
500 0.010 0.187 0.062
1000 0.018 0.203 0.121
5000 0.087 0.275 0.577
10000 0.158 0.349 1.168
50000 0.999 1.093 6.023
100000 0.813 1.027 13.022
500000 10.135 10.343 66.921
1000000 16.541 16.639 134.221
Numpy Pandas Python SQLite
100 0.010 0.263 0.001 0.029
500 0.009 0.270 0.003 0.057
1000 0.011 0.270 0.006 0.087
5000 0.015 0.293 0.026 0.388
10000 0.027 0.307 0.050 0.598
50000 0.050 0.463 0.246 3.063
100000 0.405 0.620 0.502 5.774
500000 0.403 3.354 2.595 30.432
1000000 1.452 4.825 5.582 59.209
Numpy Pandas Python SQLite
100 0.080 0.341 0.072 0.038
500 0.315 0.593 0.323 0.086
1000 0.610 0.673 0.547 0.149
5000 3.038 2.556 3.008 0.670
10000 5.899 3.875 6.069 1.139
50000 29.400 23.922 28.636 5.818
100000 58.728 50.507 61.103 11.123
500000 294.646 268.885 374.352 57.669
1000000 585.783 509.264 590.154 116.219


  • 1st phenomenon, all methods generally give stable result until they unhook, meaning they aren’t design to manage more
  • 2nd phenomenon, if series doesn’t unhook, it may stop before, showing memory errors when system isn’t able to gather the whole dataset
  • Python isn’t always slow, for instance, sum has been greatly implemented and offers good performance
  • The outsider SQLite can offers good results for multi-dimensional operations but can’t do or is slow with math operations
  • Pandas being based on Numpy, performance are equals

From CPU to GPU

In case you didn’t know it, GPUs are optimized for floating point computation and nowadays It’s not incredible to see gaming focused PCs used as high performance computers. During the last decade, development for this kind of device has been ease a lot mainly by CUDA: Compute Unified Device Architecture. This technology allows to use GPU for general purpose (GPGPU) with C programming language, and so, the Snake comes again, based on CUDA multiple libraries drew a complete Python ecosystem from statistic to deep learning where most of the topics has been implemented with GPU support.

In the landscape of Python and GPU, you’ll find several bricks, such as:

  • CuPy : Began in 2015 as ChainNeural: network framework, this project is an implementation of Numpy using C/CUDA libraries
  • CuDF : Young project from 2017 implementing Pandas using CUDA technologies.
  • CUDAMat : Math library using GPU and compatible with NumPy. Its initial release was in 2013
  • GNumpy : NumPy GPU implementation from university of Toronto. Despite this project isn’t supported anymore, it seems to fill our goal.

Here’s another charts showing GPU solutions’ performance. We keep Numpy to have an idea of CPU vs GPU.

100 0.181 0.111 0.015 0.459
500 0.182 0.136 0.013 0.464
1000 0.187 0.111 0.017 0.457
5000 0.210 0.134 0.017 0.547
10000 0.269 0.109 0.023 0.553
50000 0.246 0.140 0.041 0.534
100000 0.318 0.115 0.082 0.548
500000 0.359 0.290 0.260 0.561
1000000 0.372 0.632 0.501 0.900
5000000 0.529 4.020 0.924
10000000 0.527 8.087 0.940
50000000 0.951 40.224 1.273
100000000 1.477 78.201 1.735
500000000 5.455 391.084 6.220
1000000000 10.486 782.191 9.125
100 0.147 0.092 0.006 0.269
500 0.163 0.119 0.005 0.278
1000 0.145 0.093 0.006 0.264
5000 0.174 0.117 0.007 0.365
10000 0.167 0.085 0.010 0.362
50000 0.255 0.120 0.025 0.373
100000 0.358 0.121 0.035 0.371
500000 1.141 0.361 0.180 0.371
1000000 2.128 0.738 0.313 0.665
5000000 15.193 4.418 0.657
10000000 29.211 8.847 0.679
50000000 142.295 44.706 0.985
100000000 283.188 89.415 1.426
500000000 1411.501 447.529 4.645
1000000000 2821.986 895.139 9.640
100 0.082 0.003
500 0.104 0.010
1000 0.077 0.018
5000 0.103 0.087
10000 0.078 0.158
50000 0.166 0.999
100000 0.162 0.813
500000 0.805 10.135
1000000 1.010 16.541
5000000 4.021
10000000 15.713
50000000 81.924
100000000 170.693
500000000 777.003
1000000000 1623.245
100 0.182 0.096 0.010 0.268
500 0.198 0.119 0.009 0.262
1000 0.172 0.095 0.011 0.269
5000 0.187 0.129 0.015 0.359
10000 0.191 0.089 0.027 0.376
50000 0.226 0.117 0.050 0.370
100000 0.225 0.108 0.405 0.369
500000 0.317 0.286 0.403 0.373
1000000 0.334 0.618 1.452 0.647
5000000 0.445 3.979 0.658
10000000 0.511 8.007 0.675
50000000 0.960 38.784 1.009
100000000 1.496 77.704 1.443
500000000 5.345 389.102 5.063
1000000000 10.376 777.999 9.623
100 0.183 0.103 0.080
500 0.195 0.134 0.315
1000 0.226 0.101 0.610
5000 0.309 0.129 3.038
10000 0.311 0.249 5.899
50000 0.442 0.270 29.400
100000 0.443 2.395 58.728
500000 0.777 3.215 294.646
1000000 1.130 19.542 585.783
5000000 3.175 31.483
10000000 5.774 63.104


  • The GPU unhooking is really far from CPU memory filling error
  • Small datasets (<100K) doesn’t really require a GPU
  • Most of the frameworks are able to handle 100 billions data points in reasonable times
  • Some frameworks are still stable after and could handle more if they would have more GPU RAM
  • CPU is directly out of the race for multi-dimension arrays


  • Despite less RAM than the system, the GPU frameworks can handle more data than CPU
  • With unidimensional simple operations, Numpy is faster than the others
  • GPU implementations aren’t equal, depending of operation, each has its plus and minus

Distributed computing

Yes I wrote multi-processing wasn’t the goal of the article but there are several solutions which deserving some lines and I’ll talk about:

  • Dask: Enable scaling for main scientific computing frameworks
  • PySpark: Python API to use Spark, a Big Data analysis engine

With these approaches data are generally chunked and computations are shared across threads, processes or nodes. It has several implications:

  • An overhead is produced for interprocess communication, it will be more significant with TCP/IP
  • It’s up to the developer to know what is the best way to parallelize computing for their application. Operation scheduling or chunk size, everything isn’t easy as Numpy and requires adaptation to the platform
CuPy Dask Dask CuPy Numpy Spark
100 0.111 6.991 0.015 214.231
500 0.136 6.783 0.013 214.696
1000 0.111 6.888 0.017 212.697
5000 0.134 7.303 0.017 209.330
10000 0.109 6.611 0.023 218.304
50000 0.140 6.832 0.041 217.465
100000 0.115 6.819 0.082 273.697
500000 0.290 7.717 0.260 332.148
1000000 0.632 7.209 0.501
5000000 4.020 6.950
10000000 8.087 7.594
50000000 40.224 8.729
100000000 78.201 10.981
500000000 391.084 26.666
1000000000 782.191 45.295
CuPy Dask Dask CuPy Numpy Spark
100 0.092 4.010 6.030 0.006 214.424
500 0.119 3.969 6.586 0.005 202.505
1000 0.093 5.334 6.053 0.006 205.601
5000 0.117 3.800 6.080 0.007 211.439
10000 0.085 5.008 5.944 0.010 218.309
50000 0.120 3.300 6.116 0.025 213.221
100000 0.121 11.348 6.505 0.035 232.708
500000 0.361 10.884 6.620 0.180
1000000 0.738 17.597 6.370 0.313
5000000 4.418 6.226
10000000 8.847 6.459
50000000 44.706 7.992
100000000 89.415 9.817
500000000 447.529 24.648
1000000000 895.139 42.768
CuPy Dask Dask CuPy Numpy Spark
100 0.082 3.588 5.967 0.003 168.582
500 0.104 3.590 5.382 0.010 170.114
1000 0.077 4.558 5.301 0.018 174.239
5000 0.103 3.575 5.321 0.087 202.543
10000 0.078 4.504 5.381 0.158 198.266
50000 0.166 3.944 5.349 0.999 440.174
100000 0.162 9.210 5.394 0.813 590.974
500000 0.805 16.254 5.575 10.135
1000000 1.010 24.246 5.469 16.541
5000000 4.021 5.538
10000000 15.713 5.792
50000000 81.924 7.541
100000000 170.693 9.235
500000000 777.003 24.362
1000000000 1623.245 42.894
CuPy Dask Dask CuPy Numpy Spark
100 0.096 4.260 6.931 0.010 216.131
500 0.119 4.109 6.589 0.009 201.460
1000 0.095 5.564 7.343 0.011 212.060
5000 0.129 3.873 6.847 0.015 221.375
10000 0.089 5.177 6.931 0.027 218.741
50000 0.117 3.360 6.800 0.050 216.568
100000 0.108 11.447 6.763 0.405 265.191
500000 0.286 11.154 7.296 0.403 338.511
1000000 0.618 18.069 7.232 1.452
5000000 3.979 6.986
10000000 8.007 7.221
50000000 38.784 8.990
100000000 77.704 10.670
500000000 389.102 26.002
1000000000 777.999 45.081
CuPy Dask Dask CuPy Numpy Spark
100 0.103 515.880 91.227 0.080 629.306
500 0.134 584.842 95.750 0.315 644.168
1000 0.101 473.285 95.494 0.610 650.786
5000 0.129 637.091 94.269 3.038 742.905
10000 0.249 532.024 82.024 5.899 782.469
50000 0.270 733.753 93.942 29.400 1442.649
100000 2.395 709.022 95.728 58.728 2469.502
500000 3.215 962.462 204.780 294.646
1000000 19.542 948.553 207.883 585.783
5000000 31.483 488.644
10000000 63.104 853.314


  • As I played with a single host we cannot appreciate the real benefits of theses frameworks
  • We observe a minimum overhead of 6ms from CuPy to Dask+CuPy
  • PySpark has an overhead of 200ms making it unsuitable for our tests
  • More, PySpark doesn’t seems to handle memory as good as could do Vanilla Python

Of course, PySpark is here just for the experimentation, in my mind the small implementation that I used isn’t representative of a real usage. Spark is clearly in the big data field and even handling of 1 trillion items would be common tasks. Furthermore, a single host Spark is …hum.. a joke.


Here my GPU has 4 times less RAM than my CPU but we can see that it can handle 1,000 times more data, from 1M to 1G. With the multi-dimensionnal advantage, we can conclude without doubt the superiority of GPU. But due to its price, another question comes:

When should I choose GPU instead of CPU ?

From our results, Numpy compared to GPU solutions doesn’t have bad performance, the real problem here is memory allocation. There is just not enough RAM to run the test until unhooking. 2D and complex operations such as sine are slower but acceptable. At this point we can say that 1D arithmetics with less than 1M datapoints seems to be the most adapted workloads.

These basic operations can’t reflect perfectly an end usage as could do a real machine learning framework, FPB’s goal is to understand performance of these tasks. A future machine learning benchmark will fill this target.