Moving to AWS Graviton. Why and How?
- Oleksii Bebych
- Mar 15, 2024
- 8 min read
Updated: Apr 21, 2024
AWS continuously improves cloud services and introduces new hardware for processing power, but customers usually do not rush to move to newer instance generations. AWS documents state that newer generations are more powerful and cheaper, but what is the difference in numbers? In this post, I researched and compared four generations of the instance type M (general purpose) to show the difference in performance and price.
Comparing M4, M5, M6g and M7g instances
Four instance generations of the same instance type and family (2 vCPUs and 8Gib RAM) will be compared. All have 2 vCPUs and 8 Gib RAM:
I first checked the price (in the us-east-1 region) and measured network performance via Speedtest:
### Test for m6g.large
# curl -s https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py | python -
Retrieving speedtest.net configuration...
Testing from Amazon.com (52.205.53.191)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by eero (Ashburn, VA) [0.81 km]: 1.447 ms
Testing download speed................................................................................
Download: 3633.89 Mbit/s
Testing upload speed......................................................................................................
Upload: 3298.03 Mbit/s
Here is a table mix of AWS-provided data + my first findings:
Instance Size / Gen | vCPU | Memory (GiB) | Instance Storage (GB) | Network Bandwidth (Gbps) | Speedtest (approximately Mbit/s) | EBS Bandwidth (Gbps) | Hourly Price $ (us-east-1) | AWS Declares |
---|---|---|---|---|---|---|---|---|
m4.large | 2 | 8 | EBS-only | Moderate | 500 | 450 | 0,1 | - |
m5.large | 2 | 8 | EBS-Only | Up to 10 | 3000 | Up to 4,750 | 0,096 | up to 20% improvement in price/performance compared to M4 instances |
m6g.large | 2 | 8 | EBS-Only | Up to 10 | 3500 | Up to 4,750 | 0,077 | up to 40% better price performance over M5 instances |
m7g.large | 2 | 8 | EBS-Only | Up to 12.5 | 5000 | Up to 10 | 0,0816 | up to 25% better performance over the sixth-generation AWS Graviton2-based M6g instances DDR5 memory, which provides 50% higher memory bandwidth compared to DDR4 memory 20% higher enhanced networking bandwidth compared to M6g instances |
Price difference
Price difference between M4 and M7 is about 20%
M7 is a bit more expensive than M6 because M7 uses newer RAM (DDR5) instead of DD4 in M6.
M7g instances feature Double Data Rate 5 (DDR5) memory, which provides 50% higher memory bandwidth compared to DDR4 memory to enable high-speed access to data in memory.
Network performance
AWS categorizes network performance for some instances with qualitative descriptors like "Low," "Moderate," "High," etc., rather than specifying exact numerical bandwidth values. For "Moderate" network performance, AWS does not publicly disclose precise bandwidth figures, as the actual throughput can vary based on multiple factors, including network congestion and the instance's physical location.
Speedtest utility was used to get numbers. Network performance significantly increased over the generation evolution:
CPU performance check
Sysbench was used to test the CPU and memory performance.
Sysbench is a scriptable multi-threaded benchmark tool based on LuaJIT. It is most frequently used for database benchmarks but can also create arbitrarily complex workloads that do not involve a database server.
Sysbench comes with the following bundled benchmarks:
oltp_*.lua: a collection of OLTP-like database benchmarks
fileio: a filesystem-level benchmark
cpu: a simple CPU benchmark
memory: a memory access benchmark
threads: a thread-based scheduler benchmark
mutex: a POSIX mutex benchmark
How to install the tool on Amazon Linux 2023:
yum -y install make automake libtool pkgconfig libaio-devel
yum -y install openssl-devel
sudo wget https://dev.mysql.com/get/mysql80-community-release-el9-1.noarch.rpm
sudo dnf install mysql80-community-release-el9-1.noarch.rpm -y
sudo rpm --import https://repo.mysql.com/RPM-GPG-KEY-mysql-2023
sudo dnf install mysql-community-client -y
sudo dnf install mysql-devel -y
git clone https://github.com/akopytov/sysbench.git
cd sysbench
./autogen.sh
./configure
make -j
make install
M4 instance CPU / Memory test
Info about the CPU:
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
CPU family: 6
Model: 79
Thread(s) per core: 2
Core(s) per socket: 1
Socket(s): 1
Stepping: 1
BogoMIPS: 4599.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopolo
gy cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invp
cid_single pti fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
Virtualization features:
Hypervisor vendor: Xen
Virtualization type: full
Caches (sum of all):
L1d: 32 KiB (1 instance)
L1i: 32 KiB (1 instance)
L2: 256 KiB (1 instance)
L3: 45 MiB (1 instance)
This will run a single-threaded CPU benchmark.
$ sysbench cpu run
sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Prime numbers limit: 10000
Initializing worker threads...
Threads started!
CPU speed:
events per second: 757.73
Throughput:
events/s (eps): 757.7278
time elapsed: 10.0010s
total number of events: 7578
Latency (ms):
min: 1.30
avg: 1.32
max: 1.68
95th percentile: 1.34
sum: 9987.17
Threads fairness:
events (avg/stddev): 7578.0000/0.00
execution time (avg/stddev): 9.9872/0.00
One more test with 16 threads:
# sysbench --threads=16 cpu run
sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 16
Initializing random number generator from current time
Prime numbers limit: 10000
Initializing worker threads...
Threads started!
CPU speed:
events per second: 1270.80
Throughput:
events/s (eps): 1270.7967
time elapsed: 10.0055s
total number of events: 12715
Latency (ms):
min: 1.55
avg: 12.52
max: 141.00
95th percentile: 71.83
sum: 159180.16
Threads fairness:
events (avg/stddev): 794.6875/7.86
execution time (avg/stddev): 9.9488/0.04
Test memory (single thread):
$ sysbench memory run
sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1KiB
total size: 102400MiB
operation: write
scope: global
Initializing worker threads...
Threads started!
Total operations: 4285343 (428530.63 per second)
4184.91 MiB transferred (418.49 MiB/sec)
Throughput:
events/s (eps): 428530.6314
time elapsed: 10.0001s
total number of events: 4285343
Latency (ms):
min: 0.00
avg: 0.00
max: 0.15
95th percentile: 0.00
sum: 3419.16
Threads fairness:
events (avg/stddev): 4285343.0000/0.00
execution time (avg/stddev): 3.4192/0.00
Test memory (16 threads):
$ sysbench --threads=16 memory run
sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 16
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1KiB
total size: 102400MiB
operation: write
scope: global
Initializing worker threads...
Threads started!
Total operations: 5716923 (571674.03 per second)
5582.93 MiB transferred (558.28 MiB/sec)
Throughput:
events/s (eps): 571674.0298
time elapsed: 10.0003s
total number of events: 5716923
Latency (ms):
min: 0.00
avg: 0.01
max: 140.03
95th percentile: 0.00
sum: 54925.26
Threads fairness:
events (avg/stddev): 357307.6875/2433.99
execution time (avg/stddev): 3.4328/0.25
MUTEX benchmark
A mutex benchmark evaluates mutex implementations' performance, scalability, and overhead in a multi-threaded environment. The primary goal is to measure how efficiently a mutex can manage access to shared resources by multiple threads, especially under heavy concurrency.
Throughput refers to the number of operations (or events) completed within a given time frame when the mutex synchronizes access to shared resources. Higher throughput indicates better performance under concurrent access.
$ sysbench mutex run
sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Initializing worker threads...
Threads started!
Throughput:
events/s (eps): 4.4504
time elapsed: 0.2247s
total number of events: 1
Latency (ms):
min: 224.58
avg: 224.58
max: 224.58
95th percentile: 223.34
sum: 224.58
Threads fairness:
events (avg/stddev): 1.0000/0.00
execution time (avg/stddev): 0.2246/0.00
M5 instance CPU / Memory test
Info about the CPU:
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
Model name: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
BIOS Model name: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 1
Socket(s): 1
Stepping: 4
BogoMIPS: 4999.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtop
ology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand h
ypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx
512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 32 KiB (1 instance)
L1i: 32 KiB (1 instance)
L2: 1 MiB (1 instance)
L3: 33 MiB (1 instance)
The full sysbench output is omitted because all details will be provided in a table and graphs later:
$ sysbench cpu run
CPU speed:
events per second: 1064.75
$ sysbench --threads=16 cpu run
CPU speed:
events per second: 1671.36
M6g instance CPU / Memory test
Info about the CPU:
# lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: ARM
BIOS Vendor ID: AWS
Model name: Neoverse-N1
BIOS Model name: AWS Graviton2
Model: 1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
Stepping: r3p1
BogoMIPS: 243.75
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Caches (sum of all):
L1d: 128 KiB (2 instances)
L1i: 128 KiB (2 instances)
L2: 2 MiB (2 instances)
L3: 32 MiB (1 instance)
The full sysbench output is omitted because all details will be provided in a table and graphs later:
$ sysbench cpu run
CPU speed:
events per second: 2853.55
$ sysbench --threads=16 cpu run
CPU speed:
events per second: 5696.65
M7g instance CPU / Memory test
Info about the CPU:
# lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: ARM
BIOS Vendor ID: AWS
BIOS Model name: AWS Graviton3
Model: 1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
Stepping: r1p1
BogoMIPS: 2100.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm
ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
Caches (sum of all):
L1d: 128 KiB (2 instances)
L1i: 128 KiB (2 instances)
L2: 2 MiB (2 instances)
L3: 32 MiB (1 instance)
The full sysbench output is omitted because all details will be provided in a table and graphs later:
$ sysbench cpu run
CPU speed:
events per second: 3024.28
$ sysbench --threads=16 cpu run
CPU speed:
events per second: 6044.47
Benchmark results
Here is a table, I collected all results from two experiments (single thread and 16 threads) for four instance generations (M4, M5, M6g, and M7g):
1 thread test | 16 threads test |
| |||||||
Instance Family | Instance Size | CPU (events/s) | Memory (events/s) | Memory (MiB/sec) | Mutex (events/s) | CPU (events/s) | Memory (events/s) | Memory (MiB/sec) | Mutex (events/s) |
M4 | m4.large | 757,73 | 428530,63 | 418,49 | 4,45 | 1270,80 | 571674,03 | 558,28 | 4,51 |
M5 | m5.large | 1064,75 | 5774973,91 | 5639,62 | 6,07 | 1671,36 | 9205780,94 | 8990,02 | 6,12 |
M6g | m6g.large | 2853,55 | 5020851,87 | 4903,18 | 4,28 | 5696,65 | 3973599,35 | 3880,47 | 8,34 |
M7g | m7g.large | 3024,28 | 5570464,39 | 5439,91 | 5,13 | 6044,47 | 5794674,12 | 5658,86 | 9,88 |
CPU results show a significant performance increase, but the memory test shows a curious result (M5 is the best).
Consideration for migration to Graviton
The performed tests showed a significant increase in CPU and Network performance in Graviton instances + some cost savings.
AWS Graviton is a family of processors designed to deliver the best price performance for your cloud workloads running in Amazon Elastic Compute Cloud (Amazon EC2).
AWS Graviton-based instances cost up to 20% less than comparable x86-based Amazon EC2 instances.
AWS Graviton-based instances use up to 60% less energy than comparable EC2 instances.
Is your application ready to run on ARM?
A tool, "Porting Advisor for Graviton", analyzes source code for known code patterns and dependency libraries. It then generates a report with any incompatibilities with Graviton processors. This tool provides suggestions of minimal required and/or recommended versions to run on Graviton instances for both language runtime and dependency libraries.
Currently, the tool supports the following languages/dependencies:
Python 3+
Java 8+
Go 1.11+
C, C++, Fortran
You can run it as a Docker container. This option eliminates the need to worry about Python or Java versions or any other dependency that the tool needs, and it is the quickest way to get started:
docker build -t porting-advisor .
docker run --rm -v my/repo/path:/repo -v my/output:/output porting-advisor /repo --output /output/report.html
Scan a sample Python code:
Scan a sample Java code:
Scan a sample Go code:
PLEASE NOTE: Even though the tool does its best to find known incompatibilities, it's still recommended that you perform the appropriate tests on your application on a Graviton instance before going to Production.
Conclusion
Graviton instances look great. They are much more powerful and a bit cheaper than previous generations. In this post, I tested CPU, Memory, and Network performance for M4, M5, M6g, and M7g instances, compared costs, built graphs for visibility, and demonstrated a tool that can help you with the preliminary assessment of how ready your applications are for running on ARM instances.
Comments