Public Cloud MPI Network Benchmark Roundup

Servers image

We have made a number of blog posts over the years where we have run some MPI microbenchmarks against the offerings from the major public cloud providers. All of these providers have made a number of networking improvements during this time so we thought it would be useful to rerun these microbenchmarks against the latest generation of VMs. In particular, AWS has released a new version of “Enhanced Networking” that supports up to 20Gbps, and Azure has released the H-series family of VMs which offers virtualized FDR InfiniBand.

My colleague Irwen recently ran the point-to-point latency (osu_latency) and bisection bandwidth (osu_bibw) tests from the OSU Microbenchmarks library (version 5.3.2) against a number of different VM types from Google Compute Engine. For consistency, we’ll use the same library here with Azure and AWS.  The table below includes the best performing machine from Irwen’s post: the n1-highmem-32. The c4.8xlarge represents an AWS VM type from the previous Enhanced Networking generation and the newer m4.32xlarge VM is running the newer version of Enhanced Networking.

In the table below, we list the averaged results over 3 trials. A new pair of VMs were  provisioned from scratch for each trial:

0-byte Latency (us) 1MB bisection bandwidth (MB/s)
GCE (n1-highmem-32) 41.04 1076
AWS (c4.8xlarge) 37.07 1176
AWS (m4.32xlarge) 32.43 1152
Azure (H16r) 2.63 10807

As you might expect, the Azure H-series VMs seriously outpace the non-InfiniBand equipped competition in these tests. One of the frequent criticisms levied against using the public cloud for HPC is that networking performance is not up to the task of running a tightly-coupled workload. Microsoft’s Azure has shown that it is possible to run a virtualized high-performance networking fabric at hyperscale.

That said, while this is interesting from a raw networking performance perspective, it is important to avoid putting too much stock into synthetic benchmarks like this. Application benchmarks are generally a much better representation of real-world performance. It is certainly possible to achieve strong scaling with some CFD solvers with virtualized 10GigE. AWS has published STAR-CCM+ benchmarks showing close to linear scaling on a 16M cell model on runs up to 700 MPI processes. Microsoft has also published some STAR-CCM+ benchmarks showing close to linear scaling on up to 1,024 MPI processes with an older generation of InfiniBand equipped VMs (note that this is not an apples-to-apples comparison because Microsoft used a larger 100M cell model in their tests). It’s also important to highlight that specialized networking fabric typically comes at a higher price point. Additionally, keep in mind is that network speed is just one dimension of performance. Disk IO, RAM, CPU core count, and generation, as well as the type of simulation and model size all need to be taken into consideration when making a decision about what hardware profiles to use. One of the advantages of using a multi-cloud platform like Rescale’s ScaleX Platform is that it makes it easy to run benchmarks and moreover, enterprise HPC workloads, across a variety of hardware configurations by simply changing the core type in your job submission request.

Finally, it is impressive to note how far things have come from the original Magellan report. There is a fierce battle going on right now between the public cloud heavyweights and we are starting to see hardware refresh cycles including not only high-performance interconnect but also modern CPU generations (Skylake) as well as GPU and FPGA availability at large scale. The “commodity” public cloud is increasingly viable for a growing number of HPC workloads.