New CPU architecture at Oracle Cloud: VM.Standard.A1.Flex

Ampere X Oracle

A decade before, no one was bidding on ARM architecture for server side. It is and was well established in the embebed and mobile systems but it was an odd idea to use it for any layer of a server application. We were able to see only it for some custom requirement such as testing of ARM applications with some providers like Travis CI. From my perspective, I saw the first attempts of ARM in the Cloud market with:

  • Travis, for the sake of Android and iOS developer
  • Scaleway with a series of ARM VMs from 2 to 64 CPU
  • Packet (now Equinix) with the former metals c1.large.arm and x.large.arm

The race is now completely launched and ARM is clearly a new category of products, we can already see several generation of Graviton at AWS and of course the new Oracle’s A1, fruit of a collaboration between Ampere Computing and the hyperscaler.

What’s under the hood ?

Like the other VM offerings at Oracle, A1 benefits from a flexible design: You can chose exactly the amount of CPU and RAM for your instances. From 1 to 80 oCPU and up to 1TB of RAM, you can draw instances that exactly match with your requierments without make a choice between what the other providers call compute or memory optimized.

In the previous x86 Oracle’s series, the Cloud defined their VM with oCPU which is an equivalent of a phyisical core. With AMD and Intel, 1 core has 2 hyperthreads (generally sold as 2 vCPUs in a virtual machine). Oracle kept the oCPU terminology for the new A1 and 1 oCPU still represents 1 core but as their hardware only have 1 thread per core, 1 A1 oCPU equals to 1 thread. This may seem to be a loss of the consumers and the performance analysis in the next section will answer to this.

Performance/price analysis

Oracle Cloud is known to practice a very aggressive pricing and by using ARM technogies they may expect a performance boost and mostly a lower infrastructure costs, so a better price for their customers. To understand this benefit we test 3 latest generations of Oracle VM: E3, E4 and A1. We took the following machines:

Name oCPU CPU RAM (GB) Price (USD/hr)
E3.Flex.2-8 2 4 8 0.062
E4.Flex.2-8 2 4 8 0.062
A1.Flex.2-8 2 2 8 0.022
A1.Flex.4-8 4 4 8 0.032

We run a bunch of tests to compare the performance between series and also to analyse the scaling of 2 A1 oCPU to 4. The goal being to answer: “Should I go to A1?” and “Should I increase the number of oCPU?”

The graphs below represent the performance given by the benchmark tool sysbench CPU and its price/performance ratio calculated from the hourly price:

    {
"configuration": {
"chart": {
"type": "bar",
"polar": false,
"zoomType": "",
"options3d": {},
"height": 400,
"width": null,
"margin": null,
"inverted": false
},
"credits": {
"enabled": false
},
"title": {
"text": "Performance"
},
"colorAxis": null,
"subtitle": {
"text": ""
},
"xAxis": {
"title": {
"text": [
""
],
"useHTML": false,
"style": {
"color": "#666666"
}
},
"categories": [
"provider__short_name",
"flavor__name"
],
"lineWidth": 1,
"tickInterval": null,
"tickWidth": 0,
"tickLength": 10,
"tickPixelInterval": null,
"plotLines": null,
"labels": {
"enabled": false,
"formatter": "",
"style": {
"color": "#666666",
"cursor": "default",
"fontSize": "11px"
}
},
"plotBands": null,
"visible": true,
"floor": null,
"ceiling": null,
"type": "linear",
"min": null,
"gridLineWidth": null,
"gridLineColor": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null
},
"yAxis": {
"title": {
"text": [
"Number per seconds<br>Higher is better"
],
"useHTML": false,
"style": {
"fontSize": "14px"
}
},
"categories": null,
"plotLines": null,
"plotBands": null,
"lineWidth": null,
"tickInterval": null,
"tickLength": 10,
"floor": null,
"ceiling": null,
"gridLineInterpolation": null,
"gridLineWidth": 1,
"gridLineColor": "#CCC",
"min": null,
"max": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null,
"minRange": null,
"type": "linear",
"labels": {
"enabled": true,
"formatter": null,
"style": {
"fontSize": "14px"
}
}
},
"zAxis": {
"title": {
"text": "Number per seconds<br>Higher is better"
}
},
"plotOptions": {
"series": {
"dataLabels": {
"enabled": true,
"format": "{series.name}",
"distance": 30,
"align": "left",
"inside": true,
"style": {
"fontSize": "15px"
}
},
"showInLegend": null,
"turboThreshold": 1000,
"stacking": "",
"groupPadding": 0,
"centerInCategory": false
}
},
"rangeSelector": {
"enabled": false
},
"legend": {
"enabled": false,
"align": "center",
"verticalAlign": "bottom",
"layout": "horizontal",
"width": null,
"margin": 12,
"reversed": false
},
"series": [
{
"name": "A1.Flex.2-8",
"data": [
[
0,
495.88500000000005
]
],
"color": "#FF0000",
"grouping": false
},
{
"name": "A1.Flex.4-8",
"data": [
[
1,
1006.905
]
],
"color": "#FF0000",
"grouping": false
},
{
"name": "E3.Flex.2-8",
"data": [
[
2,
248.498
]
],
"color": "#FF0000",
"grouping": false
},
{
"name": "E4.Flex.2-8",
"data": [
[
3,
630.8466666666668
]
],
"color": "#FF0000",
"grouping": false
}
],
"tooltip": {
"enabled": true,
"useHTML": false,
"headerFormat": "",
"pointFormat": "<span style=\"color:{series.color}\">{series.name}</span>: <b>{point.y:.2f}</b><br/>",
"footerFormat": "",
"shared": false,
"outside": false,
"valueDecimals": null,
"split": false
}
},
"hc_type": "chart",
"id": "139655260119952"
}
    {
"configuration": {
"chart": {
"type": "bar",
"polar": false,
"zoomType": "",
"options3d": {},
"height": 400,
"width": null,
"margin": null,
"inverted": false
},
"credits": {
"enabled": false
},
"title": {
"text": "Price/Performance"
},
"colorAxis": null,
"subtitle": {
"text": ""
},
"xAxis": {
"title": {
"text": [
""
],
"useHTML": false,
"style": {
"color": "#666666"
}
},
"categories": [
"provider__name"
],
"lineWidth": 1,
"tickInterval": null,
"tickWidth": 0,
"tickLength": 10,
"tickPixelInterval": null,
"plotLines": null,
"labels": {
"enabled": false,
"formatter": "",
"style": {
"color": "#666666",
"cursor": "default",
"fontSize": "11px"
}
},
"plotBands": null,
"visible": true,
"floor": null,
"ceiling": null,
"type": "linear",
"min": null,
"gridLineWidth": null,
"gridLineColor": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null
},
"yAxis": {
"title": {
"text": [
"Higher is better"
],
"useHTML": false,
"style": {
}
},
"categories": null,
"plotLines": null,
"plotBands": null,
"lineWidth": null,
"tickInterval": null,
"tickLength": 10,
"floor": null,
"ceiling": null,
"gridLineInterpolation": null,
"gridLineWidth": 1,
"gridLineColor": "#CCC",
"min": null,
"max": null,
"minorTickInterval": null,
"minorTickWidth": 0,
"minTickInterval": null,
"startOnTick": true,
"endOnTick": null,
"minRange": null,
"type": "linear",
"labels": {
"enabled": true,
"formatter": null,
"style": {
"fontSize": "14px"
}
}
},
"zAxis": {
"title": {
"text": "Higher is better"
}
},
"plotOptions": {
"series": {
"dataLabels": {
"enabled": true,
"format": "{series.name}",
"distance": 30,
"align": "left",
"inside": true,
"style": {
"fontSize": "15px"
}
},
"showInLegend": null,
"turboThreshold": 1000,
"stacking": "",
"groupPadding": 0,
"centerInCategory": false
}
},
"rangeSelector": {
"enabled": false
},
"legend": {
"enabled": false,
"align": "center",
"verticalAlign": "bottom",
"layout": "horizontal",
"width": null,
"margin": 12,
"reversed": false
},
"series": [
{
"name": "A1.Flex.2-8",
"data": [
{
"x": 0,
"y": 30877.023661270243,
"perf": 495.88500000000005,
"price": 0.022,
"currency": "USD"
}
],
"color": "#FF0000",
"grouping": false
},
{
"name": "A1.Flex.4-8",
"data": [
{
"x": 1,
"y": 43103.809931506854,
"perf": 1006.905,
"price": 0.032,
"currency": "USD"
}
],
"color": "#FF0000",
"grouping": false
},
{
"name": "E3.Flex.2-8",
"data": [
{
"x": 2,
"y": 5490.455148033583,
"perf": 248.498,
"price": 0.062,
"currency": "USD"
}
],
"color": "#FF0000",
"grouping": false
},
{
"name": "E4.Flex.2-8",
"data": [
{
"x": 3,
"y": 13938.28251583444,
"perf": 630.8466666666668,
"price": 0.062,
"currency": "USD"
}
],
"color": "#FF0000",
"grouping": false
}
],
"tooltip": {
"enabled": true,
"useHTML": false,
"headerFormat": "",
"pointFormat": "<span style=\"color:{series.color}\">{series.name}</span>:<br><b>Price/Perf: {point.y:.2f}</b><br><b>Performance</b>: {point.perf:.2f}<br><b>Price</b>: {point.price:.4f} USD",
"footerFormat": "",
"shared": false,
"outside": false,
"valueDecimals": null,
"split": false
}
},
"hc_type": "chart",
"id": "139655258981432"
}

What do we see ?

  • Drop your E3, with same price than E4 and lower perfs, it lose interest.
  • With same oCPU amount, E3 performs better than A1 but with 2.5x the price.
  • With same vCPU amount, A1 is clearly far ahead in term of performance and price.

The main obstacle to A1 adoption is of course its CPU architecture. While you know any software will be available on x86, you may encounter missing packages that you must compile by yourself or simply unavailable under aarch64. But keep in mind that general purpose applications such as programming language or database are already compiled and packaged


Of course we ran more tests:


Amazon Web Services : M5 vs M5a vs M6g

AWS M5 / M5a / M6g Benchmak

In other words Intel vs AMD vs ARM. AWS recently released Graviton series for all their main instance types: R6g with extended memory, C6g for compute optimized and M6g for general purpose. Their offering has always been based on Intel but in the past years we saw AMD and now with Graviton 2 making AWS is based on their own chips.

Amazon Web services announces their Graviton processors as a new choice for their customers for increase their performance for a lower cost. But what’s the difference between all these solutions identical on the paper ? Let’s do CPU benchmark to answer it.

Specification

For our benchmark, we took 8CPU-32GB VM from each series:

Product Price CPU Frequency
m5.2xlarge 0.38 Intel(R) Xeon(R) Platinum 8175M
Intel(R) Xeon(R) Platinum 8259CL
2.5-3.2GHz
m5a.2xlarge 0.34 AMD EPYC 7571 2.4-3.0GHz
m6g.2xlarge 0.31 aarch64 N/A

These data are collected by our Cloud Transparency Platform, prices are for us-east (N. Virginia) region.

Performance testing

Prime number search with sysbench CPU

Sysbench CPU can be categorized as arithmetic operations on integer.

We can observe an increase of +100% on single thread and close to 400% between M5 and M6g with 8 threads.

Encryption with AES-256 CBC

Where AMD’s performance depends of block size, Intel and Graviton are homogeneous across sizes. The ARM chip is able to encrypt at 1.2 GB/sec where the M5 and M5a respectively cap at 400MB/sec (200%)  and 900MB/sec (130%).

Price

Product Hourly Monthly
(estimation)
Yearly
(estimation)
Discount
m5.2xlarge 0.38 280 3,360
m5a.2xlarge 0.34 251 3,012 -11%
m6g.2xlarge 0.31 224 2,688 -22%

Monthly is based on 730 hours, yearly on 8,760 hours without long term subscription

Prices make no doubt, each new generation offers a lower cost and M6g owns the lowest.

Conclusion

Depending of your workload, Graviton offers until +400% of performance compared to the Intel analogous. Combined with a lower pricing, M6G is definitively the best EC2 choice for any CPU related workload compatible with ARM architecture.


Check out data in our Public Cloud Reference


New C5a benchmark: Performance/Price

AWS recently released the new series C5a equipped with custom AMD EPYC 7R32. We can discover here, a less expensive alternative to C5, similar to what they did with M5, R5 and T3. But cost isn’t an appropriated metric if you doesn’t take in account performance, so let’s dive into a performance/price benchmark comparing C5 and C5a.

A lower pricing

Name CPU RAM C5 C5a
large 2 4 0.085 0.077
xlarge 4 8 0.170 0.154
2xlarge 8 16 0.340 0.308
4xlarge 16 32 0.680 0.616
9xlarge 36 72 1.530 1.232
12xlarge 48 96 2.040 1.848
18xlarge 72 144 3.060 2.464
24xlarge 96 192 4.080 3.696
metal 96 192 4.080

Pricing is for East US (Ohio)

Performance a bit better

Before open the hood, there are 2 things to keep in mind about C5: CPU performance is highly variable. Behind the product names, several CPU model are sold and we actually collected the following:

  • Intel(R) Xeon(R) Platinum 8124M
  • Intel(R) Xeon(R) Platinum 8275CL

Like the new AMD EPYC 7R32, both are custom models only available at AWS. Next thing, a same CPU model works at different frequencies. Generally, Cloud Providers set their CPU frequency at baseline or turbo frequency, for Platinum 8124M, we detected values from 3 up to 3.45GHz.

Geekbench 5

Kind c5.large c5a.large
Single score 934 909
Single Integer 902 815
Single Float 949 969
Single Crypto 1267 1782
Multi score 1115 1168
Multi Integer 1049 1067
Multi Float 1200 1256
Multi Crypto 1470 1952

From a Geekbench perspective, C5a excels especially in cryptography realm which is not a subject to underestimate, nowadays encryption is something used everywhere, from volumes to HTTP connections or with any backend. Other domain are also more efficient but not with a huge gap.

sysbench RAM

c5.large c5a.large
Read 8201 9139
Write 6134 7091

RAM bandwidth is a good indicator of neighborough’s noise and as C5a has just been released its value has higher chance to be better. Then, we’ll also check regularly if C5 and C5a can still pretend the same throughput.

Performance/Price

Viewing the results below and knowing instances’ prices are 10% lower, it’s not a surprise that C5a has better profile in terms of performance per dollar spent.

Type Hourly Monthly Multi score Perf/price ratio
c5.large 0.085 62.05 1115 17.97
c5a.large 0.077 56.21 1168 18.82

Monthly price is calculated from 730 hours.
Perf/price ratio equals “Mult-score / Monthly”

Conclusion

With this new original CPU model, AWS decreases their pricing again but with some performance increasement. In the past with the previous C5, we observed a lot of performance variation and it wouldn’t be a surprise if future tests pull the average performance up or down.

As the full series cannot be described by its smaller instance type, we also tested bigger flavors. Feel free to consult their performance on our Public Cloud Reference.

Top 10s Cloud Compute debriefing

We recently release our Top 10 for Cloud Compute North America and Europe. With the help of our automated platform we tested near to 20 cloud providers and selected the most interesting per region. These studies outline performance/price value of Cloud Computes and bound Block Storages. We focus on maximum performance delivered by general purpose infrastructures, their associated costs and where is the best efficiency per dollar spent.

Context

For each provider, we tested 4 sets of VMs:

Category CPU RAM Extra storage
Small 2 4 100
Medium 4 8 150
Large 8 16 200
XLarge 16 32 500

From all the performance and pricing data we collected, the vendor selection was agnostically done, only by numbers with the following key metrics:

  • Hourly price
  • CPU Multi-thread performance
  • Volume IOPS
  • Volume bandwidth

Inherent biases

1. Hourly prices

De facto, most of the hyperscalers are penalized by the documents’ approach. Despite they could propose computing power at the edge of technology, the design of our subject doesn’t take in account the long term billing options such as 1 year or 3 years. These options are only proposed by big players such as Alibaba, AWS or Azure and you can consider up to 60% of discount if you subscribe to them.

2. Volume throttling

Next, hyperscalers generally throttle volume performance, where small and medium size vendors let you reach 3GB/sec and/or 1MIOPS with block storage, the big players stop around 3000 IOPS. This may seem low but it is guaranteed, where the possible 1MIOPS are neither stable nor predictable.

3. Compute focused

Finally, documents focus on compute: virtual machines and volumes, but cloud providers have so much to propose and especially big players. Server-less, Object Storage, DBaaS, with the variety of existing services, the whole value of a cloud vendors cannot be just about Cloud Compute.

Our insights at a glance

For those who don’t want to read the reports, here’s a small list of the leading providers:

Provider Price Compute Storage Region
Hetzner Very aggressive Average Average Europe
Kamatera Low Average High Worldwide
UpCloud Average High Guaranteed very high Worldwide
Oracle Cloud Average Average High Worldwide

What next ?

These documents will be renewed and their methodology improved. We want to bring more infrastructure characteristics like network and RAM. In the pursuit of objectivity, we think that we must diversify our reports to answer real life problematics such as:

  • Small and medium size providers
  • Hyperscalers
  • Country based
  • Provider origin based
  • Object storage, CDNs, DBaaS, Kubernetes, etc

We also want to digitalize this kind of report. Instead of just PDF, we wish to let consumer explore data with a web application. This will also let user appreciate more than 10 vendors without decrease reading quality.

For the meantime, do not hesitate take a look at our document center.

 

Out of the wood #1 : Kamatera

That’s make almost 3 years our analysis platform runs on computers all around the globe and automagically collects stuff about cloud market such as locations, instance sizes and more important price and performance. We are actually close to 60 providers, counting IaaS, PaaS or CDNs as vendors and this is a huge stack of knowledge that we want to share. Of course, our P2P is already here for people who want a comparison tool about price and performance, but this application isn’t able to translate all our knowledge. Then before to create another super visualization tool to expose our data, we thought that laying words on electronic paper would be a good and quick solution. So here’s a first article of a series presenting small and medium size cloud providers that aren’t on all lips but worth it.

The first platform studied in this series called “Out of the wood” is Kamatera, a medium sized vendor with an international offering.

Who are they

Firstly Kamatera should be qualified by their worldwide presence with datacenters in North America, Europe, Middle East and China.  Not only with single locations on continents, but pretty well scattered, covering for instance in the Eastern, Western and Central USA.

Following our methodology we class Kamatera as a medium size provider, they are mainly a cloud compute vendor providing IaaS. But on top of that, they give a major attention to the customer service, then you are free to use their infrastructure with high-level support or benefit from their managed services guaranteed by their teams.

In terms of cloud services, they present all the required features for a decent compute provider:

  • Virtual machines scaling up to 72 CPU and 384GB of RAM
  • VPC management
  • Block storage powered by SSD
  • Load balancer
  • Firewall
  • Multi-user management
  • API

More than IaaS, they also propose a great-sized catalog of SaaS services based on their VMs. Called services and apps, they allow users to opt-in for a preconfigured MongoDB, Rancher or WordPress without extra costs.

How is their platform

Let’s dive into their cloud servers design. Kamatera chose a flexible shaping of virtual machines, meaning that you can set a number of CPU and amount of RAM for each server you launch. 8CPU-8GB or 15CPU-200GB, everything is possible permitting an accurate composition of your infrastructure.

Above that, 4 kinds of VM exist:

  • General Purpose (B) : Dedicated CPU thread
  • Dedicated (D) : Dedicated CPU Core (2 threads)
  • Burstable (T) : Dedicated CPU thread with extra costs after 10% of utilization
  • Availability (A) : Non-dedicated CPU thread with no resources guaranteed

Again, by providing these types of vCPU, Kamatera allows consumers to adjust pricing and performance with their workload. No need to make a choice in a memory-optimized series for your Redis cluster, just design servers fitting your requirements.

Performance insights

We’ve launch our machinery on their infrastructure, collecting hardware specifications and metrics such as Geekbench scores or CPU steal. From our analysis, we are in a VMware ecosystem with Intel processors. Here’s a sample of chips we discovered across their datacenters:

  • Intel Xeon CPU E5-2620 v2
  • Intel Xeon CPU E5-2660 v3
  • Intel Xeon CPU E5-2697A v4
  • Intel Xeon CPU Gold 6150
  • Intel Xeon CPU Platinum 8270

From tests ran by our automated platform Kamatera obtains good a performance set, you’ll find below graphs representing their 2CPU-4GB VMs and different families at Microsoft Azure. We picked all the different type of vCPU available at Kamatera:

Compared to this well-known big player, Kamatera really performs well. This is just a sample and our extensive testing reveal that CPU performance increases almost linearly with the number of vCPU. Moreover, the charts above represent pretty well the 4 kind of vCPU: Dedicated performs the best, then General Purpose, Burstable then Avaibility.

Beyond their honorable performance, another great characteristic of Kamatera is their aggressive pricing. Despite they don’t have long term billing options like 1 or 3 year, their general purposes hourly rates is still lower than the major part of the competitors. Here’s a comparative table with the flavors used above:

Flavor Hourly price Monthly price
2ACPU 4GB 0.022 16.06
2BCPU 4GB 0.053 38.69
2DCPU 4GB 0.088 64.24
2TCPU 4GB 0.022 16.06
Standard_A2_v2 0.076 55.48
Standard_B2s 0.042 30.66
Standard_F2 0.099 72.27
Standard_F2s_v2 0.085 62.05

They also propose a monthly billing at the same price than hourly but 1TB of outgoing traffic are offered with this subscription. With VMs billed hourly Kamatera proposes a worldwide price of $0.01$ / GB which is still up to a tenth the costs announced by big-players.

Portrait of conclusion

Kamatera is a good representative of this market share, very valuable, who offers a worldwide infrastructure at a decent price. They don’t have the plethora of specific services findable on hyperscalers but their pricing and abilities can match with a great part of budgets and workloads.

To get more insight and create you own comparative chart or table, I invite you to go to our Price/Performance Portal,

Benchmark floating point computations with Python

Python and its nature

I am fundamentally a Python developer, a fact which makes me skip a bunch of C knowledge. I made the choice to stop placing all my focus on system performance to learn in writing and learning velocity. This will be the first of many blog posts that will focus on a variety of programming aspects. I’ll start things off by diving into python. Let’s go back to 10 years ago, when I was launching my 1st Python interpreter and typing:

>>> sum([2, 2])
4

It immediately made me think: So I don’t need to compile it or Syntax is soooo clear. As a sysadmin, I didn’t take a lot of time before I create my first scripts, then applications. Web page scrapping, mailing, ncurses menus. Basically the knight became blacksmith and now is able to create his own swords. To share my armory, I quickly decided to learn the well-known web framework Django and seeing the learning curve and the quick results I was getting, I apparently made good choices.

Dive into the cobra performance

If you use the snake language or not, you inevitably heard its main problem: Due to facts it’s not a compiled language, Python cannot reach the performances of the fastest languages such as C/C++. This assumption is partially true and as a performance benchmarker I always thought that despite the slower performance, I’m able to create anything I need with Python, so the trade off was worth it. At least until I tried to write a benchmark tool out of Python with a very small amount of overhead. This did not turn out exactly as I had expected.

The purpose of this kind of program is in total contradiction with the average web developers behaviors (cf I cache everything). The idea here is generally to produce an accurate action somewhere on a system with the minimum overhead and Python by nature didn’t seem adapted to that. This is a second sentence not entirely true, there are bunch of ways to write and use C code in Python such as Cython, pyrex or C wrappers. The standard library itself includes more and more C code, and even third library alternatives come to speed-up parts of code.

In facts, most of the general usages already have their C implementation. Network, file I/O, regexp or TLS, the language has boosted a lot of topics. Remain CPU-bound tasks and here, like most of interpreted languages, the GIL will produce some overhead. But keep in mind that this impacts just multi-tasks application, so in few words, single-thread codes are handsomely optimized but multi-thread suffer from the language design.

Python for scientists

By its various qualities, Python has become one the favorite language for scientific computing. We can mention NumPy, the basis about science, or Anaconda the scientific Python distribution and its package manager conda. The ecosystem is really large, touching many purposes:

  • Pure mathematics
  • Data representation
  • Machine/deep learning
  • Interactive notebook

It has a simple usage and let people produce quickly results but scientists generally have another main requirement: High computing capacity. Tons of numbers with tons of applied functions, sometimes with tons of dimensions. Scaling generally answers to this issue, but let’s avoid the multi-task subject and focus on single-thread performance. Of course parallelism is part of the reality landscape but it requires some technics such as sharding or parallel computing.

To collect performance data, we created a simple tool named FPB: Floating Point Benchmark. It aims to launch different kind of operations across several Python ways. For instance, compute an average from CPython or third libraries. This project is free, so feel free to contribute,. It also will be the subject of another article.

Below you’ll find some charts and tables representing timing of math functions. We observe performances from Vanilla Python to Numpy passing by alternative builtins ways such as SQLite. Our test environment is the following:

  • T-Systems’ p2.2xlarge.8 powered by KVM
  • Intel Xeon CPU E5-2690 v4
  • 8 vCPU @ 2.6 GHz
  • 64GB of RAM
  • 1x Tesla V100 PCle
  • FPB with float32

You can swipe to get more results.

Numpy Pandas Python SQLite
100 0.015 1.004 0.002 0.030
500 0.013 1.015 0.003 0.056
1000 0.017 1.017 0.006 0.088
5000 0.017 1.019 0.026 0.392
10000 0.023 1.027 0.051 0.623
50000 0.041 1.035 0.245 3.152
100000 0.082 1.086 0.498 5.763
500000 0.260 1.289 2.662 29.889
1000000 0.501 1.533 5.622 60.059
Numpy Pandas Python SQLite
100 0.006 0.238 0.002 0.034
500 0.005 0.239 0.008 0.066
1000 0.006 0.244 0.016 0.091
5000 0.007 0.266 0.074 0.424
10000 0.010 0.284 0.147 0.641
50000 0.025 0.430 0.729 3.336
100000 0.035 0.567 1.446 6.044
500000 0.180 3.243 7.240 31.573
1000000 0.313 4.485 14.695 64.826
Numpy Pandas Python SQLite
100 0.003 0.184 0.014
500 0.010 0.187 0.062
1000 0.018 0.203 0.121
5000 0.087 0.275 0.577
10000 0.158 0.349 1.168
50000 0.999 1.093 6.023
100000 0.813 1.027 13.022
500000 10.135 10.343 66.921
1000000 16.541 16.639 134.221
Numpy Pandas Python SQLite
100 0.010 0.263 0.001 0.029
500 0.009 0.270 0.003 0.057
1000 0.011 0.270 0.006 0.087
5000 0.015 0.293 0.026 0.388
10000 0.027 0.307 0.050 0.598
50000 0.050 0.463 0.246 3.063
100000 0.405 0.620 0.502 5.774
500000 0.403 3.354 2.595 30.432
1000000 1.452 4.825 5.582 59.209
Numpy Pandas Python SQLite
100 0.080 0.341 0.072 0.038
500 0.315 0.593 0.323 0.086
1000 0.610 0.673 0.547 0.149
5000 3.038 2.556 3.008 0.670
10000 5.899 3.875 6.069 1.139
50000 29.400 23.922 28.636 5.818
100000 58.728 50.507 61.103 11.123
500000 294.646 268.885 374.352 57.669
1000000 585.783 509.264 590.154 116.219

Assumptions:

  • 1st phenomenon, all methods generally give stable result until they unhook, meaning they aren’t design to manage more
  • 2nd phenomenon, if series doesn’t unhook, it may stop before, showing memory errors when system isn’t able to gather the whole dataset
  • Python isn’t always slow, for instance, sum has been greatly implemented and offers good performance
  • The outsider SQLite can offers good results for multi-dimensional operations but can’t do or is slow with math operations
  • Pandas being based on Numpy, performance are equals

From CPU to GPU

In case you didn’t know it, GPUs are optimized for floating point computation and nowadays It’s not incredible to see gaming focused PCs used as high performance computers. During the last decade, development for this kind of device has been ease a lot mainly by CUDA: Compute Unified Device Architecture. This technology allows to use GPU for general purpose (GPGPU) with C programming language, and so, the Snake comes again, based on CUDA multiple libraries drew a complete Python ecosystem from statistic to deep learning where most of the topics has been implemented with GPU support.

In the landscape of Python and GPU, you’ll find several bricks, such as:

  • CuPy : Began in 2015 as ChainNeural: network framework, this project is an implementation of Numpy using C/CUDA libraries
  • CuDF : Young project from 2017 implementing Pandas using CUDA technologies.
  • CUDAMat : Math library using GPU and compatible with NumPy. Its initial release was in 2013
  • GNumpy : NumPy GPU implementation from university of Toronto. Despite this project isn’t supported anymore, it seems to fill our goal.

Here’s another charts showing GPU solutions’ performance. We keep Numpy to have an idea of CPU vs GPU.

CUDAMat CuPy Numpy PyCUDA
100 0.181 0.111 0.015 0.459
500 0.182 0.136 0.013 0.464
1000 0.187 0.111 0.017 0.457
5000 0.210 0.134 0.017 0.547
10000 0.269 0.109 0.023 0.553
50000 0.246 0.140 0.041 0.534
100000 0.318 0.115 0.082 0.548
500000 0.359 0.290 0.260 0.561
1000000 0.372 0.632 0.501 0.900
5000000 0.529 4.020 0.924
10000000 0.527 8.087 0.940
50000000 0.951 40.224 1.273
100000000 1.477 78.201 1.735
500000000 5.455 391.084 6.220
1000000000 10.486 782.191 9.125
CUDAMat CuPy Numpy PyCUDA
100 0.147 0.092 0.006 0.269
500 0.163 0.119 0.005 0.278
1000 0.145 0.093 0.006 0.264
5000 0.174 0.117 0.007 0.365
10000 0.167 0.085 0.010 0.362
50000 0.255 0.120 0.025 0.373
100000 0.358 0.121 0.035 0.371
500000 1.141 0.361 0.180 0.371
1000000 2.128 0.738 0.313 0.665
5000000 15.193 4.418 0.657
10000000 29.211 8.847 0.679
50000000 142.295 44.706 0.985
100000000 283.188 89.415 1.426
500000000 1411.501 447.529 4.645
1000000000 2821.986 895.139 9.640
CUDAMat CuPy Numpy PyCUDA
100 0.082 0.003
500 0.104 0.010
1000 0.077 0.018
5000 0.103 0.087
10000 0.078 0.158
50000 0.166 0.999
100000 0.162 0.813
500000 0.805 10.135
1000000 1.010 16.541
5000000 4.021
10000000 15.713
50000000 81.924
100000000 170.693
500000000 777.003
1000000000 1623.245
CUDAMat CuPy Numpy PyCUDA
100 0.182 0.096 0.010 0.268
500 0.198 0.119 0.009 0.262
1000 0.172 0.095 0.011 0.269
5000 0.187 0.129 0.015 0.359
10000 0.191 0.089 0.027 0.376
50000 0.226 0.117 0.050 0.370
100000 0.225 0.108 0.405 0.369
500000 0.317 0.286 0.403 0.373
1000000 0.334 0.618 1.452 0.647
5000000 0.445 3.979 0.658
10000000 0.511 8.007 0.675
50000000 0.960 38.784 1.009
100000000 1.496 77.704 1.443
500000000 5.345 389.102 5.063
1000000000 10.376 777.999 9.623
CUDAMat CuPy Numpy PyCUDA
100 0.183 0.103 0.080
500 0.195 0.134 0.315
1000 0.226 0.101 0.610
5000 0.309 0.129 3.038
10000 0.311 0.249 5.899
50000 0.442 0.270 29.400
100000 0.443 2.395 58.728
500000 0.777 3.215 294.646
1000000 1.130 19.542 585.783
5000000 3.175 31.483
10000000 5.774 63.104

Assumptions:

  • The GPU unhooking is really far from CPU memory filling error
  • Small datasets (<100K) doesn’t really require a GPU
  • Most of the frameworks are able to handle 100 billions data points in reasonable times
  • Some frameworks are still stable after and could handle more if they would have more GPU RAM
  • CPU is directly out of the race for multi-dimension arrays

Assumptions:

  • Despite less RAM than the system, the GPU frameworks can handle more data than CPU
  • With unidimensional simple operations, Numpy is faster than the others
  • GPU implementations aren’t equal, depending of operation, each has its plus and minus

Distributed computing

Yes I wrote multi-processing wasn’t the goal of the article but there are several solutions which deserving some lines and I’ll talk about:

  • Dask: Enable scaling for main scientific computing frameworks
  • PySpark: Python API to use Spark, a Big Data analysis engine

With these approaches data are generally chunked and computations are shared across threads, processes or nodes. It has several implications:

  • An overhead is produced for interprocess communication, it will be more significant with TCP/IP
  • It’s up to the developer to know what is the best way to parallelize computing for their application. Operation scheduling or chunk size, everything isn’t easy as Numpy and requires adaptation to the platform
CuPy Dask Dask CuPy Numpy Spark
100 0.111 6.991 0.015 214.231
500 0.136 6.783 0.013 214.696
1000 0.111 6.888 0.017 212.697
5000 0.134 7.303 0.017 209.330
10000 0.109 6.611 0.023 218.304
50000 0.140 6.832 0.041 217.465
100000 0.115 6.819 0.082 273.697
500000 0.290 7.717 0.260 332.148
1000000 0.632 7.209 0.501
5000000 4.020 6.950
10000000 8.087 7.594
50000000 40.224 8.729
100000000 78.201 10.981
500000000 391.084 26.666
1000000000 782.191 45.295
CuPy Dask Dask CuPy Numpy Spark
100 0.092 4.010 6.030 0.006 214.424
500 0.119 3.969 6.586 0.005 202.505
1000 0.093 5.334 6.053 0.006 205.601
5000 0.117 3.800 6.080 0.007 211.439
10000 0.085 5.008 5.944 0.010 218.309
50000 0.120 3.300 6.116 0.025 213.221
100000 0.121 11.348 6.505 0.035 232.708
500000 0.361 10.884 6.620 0.180
1000000 0.738 17.597 6.370 0.313
5000000 4.418 6.226
10000000 8.847 6.459
50000000 44.706 7.992
100000000 89.415 9.817
500000000 447.529 24.648
1000000000 895.139 42.768
CuPy Dask Dask CuPy Numpy Spark
100 0.082 3.588 5.967 0.003 168.582
500 0.104 3.590 5.382 0.010 170.114
1000 0.077 4.558 5.301 0.018 174.239
5000 0.103 3.575 5.321 0.087 202.543
10000 0.078 4.504 5.381 0.158 198.266
50000 0.166 3.944 5.349 0.999 440.174
100000 0.162 9.210 5.394 0.813 590.974
500000 0.805 16.254 5.575 10.135
1000000 1.010 24.246 5.469 16.541
5000000 4.021 5.538
10000000 15.713 5.792
50000000 81.924 7.541
100000000 170.693 9.235
500000000 777.003 24.362
1000000000 1623.245 42.894
CuPy Dask Dask CuPy Numpy Spark
100 0.096 4.260 6.931 0.010 216.131
500 0.119 4.109 6.589 0.009 201.460
1000 0.095 5.564 7.343 0.011 212.060
5000 0.129 3.873 6.847 0.015 221.375
10000 0.089 5.177 6.931 0.027 218.741
50000 0.117 3.360 6.800 0.050 216.568
100000 0.108 11.447 6.763 0.405 265.191
500000 0.286 11.154 7.296 0.403 338.511
1000000 0.618 18.069 7.232 1.452
5000000 3.979 6.986
10000000 8.007 7.221
50000000 38.784 8.990
100000000 77.704 10.670
500000000 389.102 26.002
1000000000 777.999 45.081
CuPy Dask Dask CuPy Numpy Spark
100 0.103 515.880 91.227 0.080 629.306
500 0.134 584.842 95.750 0.315 644.168
1000 0.101 473.285 95.494 0.610 650.786
5000 0.129 637.091 94.269 3.038 742.905
10000 0.249 532.024 82.024 5.899 782.469
50000 0.270 733.753 93.942 29.400 1442.649
100000 2.395 709.022 95.728 58.728 2469.502
500000 3.215 962.462 204.780 294.646
1000000 19.542 948.553 207.883 585.783
5000000 31.483 488.644
10000000 63.104 853.314

Assumptions

  • As I played with a single host we cannot appreciate the real benefits of theses frameworks
  • We observe a minimum overhead of 6ms from CuPy to Dask+CuPy
  • PySpark has an overhead of 200ms making it unsuitable for our tests
  • More, PySpark doesn’t seems to handle memory as good as could do Vanilla Python

Of course, PySpark is here just for the experimentation, in my mind the small implementation that I used isn’t representative of a real usage. Spark is clearly in the big data field and even handling of 1 trillion items would be common tasks. Furthermore, a single host Spark is …hum.. a joke.

Conclusion

Here my GPU has 4 times less RAM than my CPU but we can see that it can handle 1,000 times more data, from 1M to 1G. With the multi-dimensionnal advantage, we can conclude without doubt the superiority of GPU. But due to its price, another question comes:

When should I choose GPU instead of CPU ?

From our results, Numpy compared to GPU solutions doesn’t have bad performance, the real problem here is memory allocation. There is just not enough RAM to run the test until unhooking. 2D and complex operations such as sine are slower but acceptable. At this point we can say that 1D arithmetics with less than 1M datapoints seems to be the most adapted workloads.

These basic operations can’t reflect perfectly an end usage as could do a real machine learning framework, FPB’s goal is to understand performance of these tasks. A future machine learning benchmark will fill this target.