Investigating Applications on the A64FX
Adrian Jackson, Michèle Weiland, Nick Brown, Andrew Turner, Mark Parsons
IInvestigating Applications on the A64FX
Adrian Jackson ∗ , Mich`ele Weiland, Nick Brown, Andrew Turner, Mark Parsons EPCC, The University of EdinburghEdinburgh, United Kingdom ∗ Email: [email protected]
Abstract —The A64FX processor from Fujitsu, being designedfor computational simulation and machine learning applications,has the potential for unprecedented performance in HPC systems.In this paper, we evaluate the A64FX by benchmarking againsta range of production HPC platforms that cover a numberof processor technologies. We investigate the performance ofcomplex scientific applications across multiple nodes, as wellas single node and mini-kernel benchmarks. This paper findsthat the performance of the A64FX processor across our chosenbenchmarks often significantly exceeds other platforms, evenwithout specific application optimisations for the processor in-struction set or hardware. However, this is not true for allthe benchmarks we have undertaken. Furthermore, the specificconfiguration of applications can have an impact on the runtimeand performance experienced.
I. I
NTRODUCTION
There is a long history of utilisation of traditional x86architectures from processor manufacturers such as Intel andAMD for computational simulation and machine learningapplications. However, we are now entering a period wherethere has been a significant increase in alternative processortechnologies available for a wide range of tasks. WhilstArm-based processors have been commonplace in the mobileand low power market places, a new range of Arm-basedprocessors designs are now reaching maturity for server-class applications; foremost amongst these are Arm basedprocessors from manufacturers such as Marvell (ThunderX2),Ampere (eMAG), Huawei (Kunpeng 920), Fujitsu (A64FX)and Amazon (Graviton, Graviton2).Arm-based processor designs provide manufacturers andtechnology companies with the ability to customise processorarchitectures for specific workloads or requirements, and pro-duce custom processors at volume and much more affordablythat was previously possible. Leading the way in this is Fujitsuwith the A64FX processor, designed in collaboration withRiken and the Japanese research community, this processoris heavily focused on a range of computational simulationand machine learning applications important to the Japaneseresearch community.Having recently debuted in the Fugaku supercomputer, thecurrent number one system on the Top500 list [1] and im-pressively entering the list at over two times the performanceof the previous number one system, the A64FX has alsodemonstrated impressive results in efficient computing, witha Green500 rating of 16.876 GFLOPs/watts[2].However, performance is only one aspect of a systemrequired to deliver a usable computing platform for varied workloads. Operating system, batch system, compiler, andlibrary support are all required to provide a usable systemand to ensure applications can be easily ported to such newhardware as well as efficiently exploit it.In this paper we will evaluate a range of common com-putational simulation applications on a HPC system withA64FX processors connected with the TofuD network [3].Our paper makes the following contributions to deepening theunderstanding of the performance of novel HPC processors,the usability of such systems, and the suitability of building aproduction HPC system that is based on the Fujitsu and Armecosystem:1) We present and evaluate the multi-node performance ofscientific applications with varying performance charac-teristics and compare it to the established Arm and x86ecosystems.2) We evaluate the ease of porting applications onto thisnew system, compared with equivalent systems basedon other processor technologies.3) We discuss the causes for the performance and scala-bility results that have been observed, and based on thiswe draw conclusions with regards to the maturity of thisnew wave of Arm-based systems for HPC.II. R
ELATED WORK
The first Arm instruction set server-class processor to bewidely available for typical HPC applications was the Thun-derX2 processor from Marvell. The ThunderX2 processor usesthe Armv8 instruction set and has been designed specificallyfor server workloads, although it did not exploit the new ArmSVE vectorisation instruction set. The design includes eightDDR4 memory channels to deliver measured STREAM triadmemory bandwidth in excess of 240 GB/s per dual-socketnode, significantly greater than the comparable Intel processorsavailable at the time. Evaluation of scientific applications onthat platform have been documented [4] [5], demonstratingcomparable performance for a range of applications whencompared to similar Intel processors. This paper extends thiswork by adding the novel A64FX processor architecture, andexpanding the range of applications, as well as the systems,benchmarked.This paper investigates the performance of distributed mem-ory communications (MPI), as well as scientific applicationsthat use MPI, on an A64FX based system, and the associatedlibraries required for functionality and performance. It exploitsthe on-board high-bandwidth memory (HBM) for application a r X i v : . [ c s . PF ] S e p erformance, the first CPU-class processor to be widely avail-able with such memory. As such, the paper presents someof the first results for user applications on both the A64FXprocessor and exploiting HBM.HPC platforms have long been evaluated using a widerange of benchmarks, each targeting a different performanceaspect; popular benchmark suites include [6] [7] [8]. Theseinclude application specific benchmarks [9], and have includedbenchmarking application across multiple systems [10]. In thispaper we follow these common benchmarking approaches toevaluate the performance of the A64FX processor against aset of other established systems using a range of differentbenchmarks.The A64FX processor has been widely described [11] [12].It consists of four Core Memory Groups (CMG) connectedtogether with an on-chip networks (a ring bus). Each CMGhas 12 cores available to the user (although a 13th core ispresent, provided to run the operating system and associatedfunctionality), a coherent shared L2 cache (8MiB per CMG,32MiB per processor) and a memory controller. HBM isdirectly attached to the processor, with 8 GiB attached toeach CMG, providing 256GB/s of bandwidth per CMG, orapproximately 1TB/s of bandwidth for the whole processor.III. B ENCHMARKING METHODOLOGY
In order to fully evaluate the performance of the A64FX,we execute a range of benchmarks and applications that relyon the performance of different aspects of the architecture, i.e.memory bandwidth, floating point performance, network per-formance, etc., and compare our results with other productionHPC systems. Our benchmarking methodology adheres to thefollowing principles: a) Reproducibility:
We use process and thread pinning tocores to ensure our results are not impacted or skewed by theoperating system’s process/thread management policies, andare reproducible. We also list the compiler versions and flags,as well as the libraries used, in Table II. Benchmarks are runmultiple times and any performance variation outside 5% ofthe average runtime is noted in the results. b) Applications:
The benchmarks and applications cho-sen for this investigation cover different scientific domains,programming languages, libraries and performance character-istics. They also represent widely used real-life applications.As we are primarily interested in the single node, and multi-node performance of the applications, we disabled or reducedoutput I/O as much as possible to ensure the I/O characteristicsof the various systems considered are not dominating observedperformance. c) Multi-node benchmarks:
A range of node counts areused for most benchmarks, from 1 up to 16, allowing for theassessment and evaluation of any performance bottlenecks orbenefits caused by the network or the communication libraries. d) Performance comparison:
A64FX results are com-pared with those from a range of different HPC systemsin order to assess the relative performance. The results are generally compared on a per-node basis (rather than per-coreor per-process) using the same benchmark configurations.IV. B
ENCHMARKING SYSTEMS
The system under evaluation, a Fujitsu system containingA64FX processors, is compared against well-established HPCsystem architectures. Details on the systems used for thisperformance evaluation activity are given below, and TableI summarises the compute node specifications. a) A64FX:
The A64FX test system has 48 computenodes, each with a single A64FX processor, with 48 coresavailable to users, running at 2.2 GHz. The processor has32GB of HBM, and nodes are connected with the TofuDnetwork. b) ARCHER:
This Cray XC30 system has 24 cores (twoIntel Xeon 2.7 GHz, 12-core E5-2697v2 processors) and 64GB of DDR3 memory per node (128 GB on a small numberof large memory nodes). Nodes are connected by the CrayAries network. c) Cirrus:
This SGI ICE XA system has compute nodeseach with two 2.1 GHz, 18-core, Intel Xeon E5-2695 (Broad-well) series processors. They have 256 GB of memory sharedbetween the two processors. The system has a single MellanoxFDR Infiniband (IB) fabric. d) EPCC NGIO:
This is a Fujitsu-built system whereeach node has two 24-core Intel Xeon Platinum 8260Mprocessors, running at 2.4GHz, with a total of 192 GB ofDDR4 Mmemory shared between the processors. The systemuses Intel’s OmniPath interconnect. e) Fulhame:
A HPE Apollo 70 cluster with dual-socketcompute nodes connected with Mellanox EDR IB using a non-blocking fat tree topology. Each compute node consists of two32-core Marvell ThunderX2 processors running at 2.2GHz,and 256GB DDR4 memory.V. B
ENCHMARKS
To evaluate the overall performance of the A64FX systemand place it in context with the other systems considered inthis paper we ran the HPCG benchmark across the systemsand evaluated the comparative performance.
A. HPCG
HPCG [13] (High Performance Conjugate Gradients) is aHPC system benchmark designed to be more representativethan the traditional High Performance LINPACK (HPL[14])as it has a more realistic resource usage pattern, closer tofull scale HPC applications. As such, HPCG performance isinfluenced by memory bandwidth, floating point capability andto some extent network performance.For the benchmarks presented here we compiled HPCG inMPI only mode with as many MPI processes used as there arecores on the node. To fit into the memory on a single A64FXnode we used the following parameters for benchmark acrossthe systems: --nx=80 --ny=80 --nz=80 --t=3600 .For the Fulhame and EPCC NGIO systems two differ-ent versions of the benchmark were run, the first version
ABLE IC
OMPUTE NODE SPECIFICATIONS . A64FX ARCHER Cirrus EPCC NGIO Fulhame
Processor FujitsuA64FX Intel XeonE5-2697 v2 Intel XeonE5-2695 Intel Xeon Platinum8260M MarvellThunderX2(SVE) (IvyBridge) (Broadwell) (Cascade Lake) (ARMv8)Processor clock speed 2.2GHz 2.7GHz 2.1GHz 2.4GHz 2.2GHzCores per processor 48 12 16 24 32Cores per node 48 24 36 48 64Threads per core 1 or 2 or 2 or 2 , 2, or 4Vector width 512bit 256bit 256bit 512bit 128bitMaximum node DP GFLOP/s 3379 518.4 1209.6 2662.4 1126.4Memory per node 32GB 64GB 256GB 192GB 256GBMemory per core 0.66GB 2.66GB 7.11GB 4GB 4GB ( unoptimised ) is the standard HPCG source code, the other( optimised ) were modified versions of HPCG optimised byIntel and Arm respectively for the target architectures used(although still compiled with the same compiler flags). Ta-ble III shows the HPCG performance for a single node acrossthe test systems.The table demonstrates that the A64FX processor achievessignificantly higher performance (approx. ) than the un-optimised HPCG source code running on the dual-socketCascade Lake node, whilst having the same number of cores.It also shows that the A64FX has higher performance (approx. ) than the ThunderX2 node (Fulhame) whilst havingfewer cores, demonstrating the performance benefits the widervector units and/or high bandwidth memory provided by theprocessors.Both Intel and Arm have optimised versions of the HPCGsource code that have been altered to use specific performancelibraries or optimised versions of some of the computationalkernels in the benchmark. We also ran these optimised versionsand provide the performance for those versions in the table.Note, both the unoptimised and optimised versions of thebenchmark on EPCC NGIO and Fulhame utilise the relevantoptimised mathematical libraries (MKL on NGIO and armplon Fulhame), so the performance difference exhibited by theoptimised version of the benchmark are primarily from thechanges that have been made to the computational routines.The comparison between the optimised and unoptimisedversions of HPCG on NGIO and Fulhame demonstrate thescope for performance improvements available. Given that weran an unoptimised version of HPCG on the A64FX systemthere is likely to be significant scope for increasing the per-formance through targeted code modifications on the A64FXprocessor, with our comparative benchmarks suggesting performance improvements could be possible.We also ran the same benchmark across multiple nodes toevaluate performance once the network is required for thecalculations. Table IV shows the performance scaling up to8 nodes using the same configuration for the benchmark asfor the single node case.We can see from the multi-node benchmarks that the A64FXnodes are still providing higher performance than the rest ofthe systems, with the difference between A64FX and EPCC NGIO more pronounced on multiple nodes. This demonstratesthat there is no significant overhead from the network hardwareor libraries on the A64FX as compared to the other systems,and indeed the network may be outperforming the EPCCNGIO system (although further in-depth analysis would beneed to verify this assertion).VI. M INI - APPS
To enable in-depth analysis of performance of the A64FXwithout requiring time consuming full application runs weinvestigated a number of mini-apps . These are benchmarkingprograms designed to provide representative functionality ofthe core components of a larger scale applications. In thispaper we used two mini-apps, minikab, a parallel conjugategradient (CG) solver, and nekbone, a representative Navier-Stokes solver (NS). The section will describe those mini-appsand the performance experienced across systems for thesebenchmarks.
A. minikab
The Mini Krylov ASiMoV Benchmark (minikab) programis a simple parallel CG solver developed at EPCC to allowtesting of a range of parallel implementation techniques. It iswritten in Fortran 2008 and parallelised using MPI, as wellas MPI with OpenMP. It supports a range of command-lineoptions to test the different methods that can be used whenimplementing a solver: • the type of decomposition; • the solver algorithm; • the communication approach; • the serial sparse-matrix routine in plain Fortran or imple-mented via a numerical library (such as MKL).We tested minikab using a sparse matrix (called Bench-mark1) that has 9,573,984 degrees of freedom and696,096,138 non-zero elements - the matrix represents a largestructural problem.To establish a baseline for further performance analysis,the test case was run on a single core on EPCC NGIO andFulhame, in addition to the A64FX. As table V presents, ona single core, the A64FX shows the best performance by far:it is 7% faster than even a top of the range Intel Xeon core,and just over 2x faster than the ThunderX2. ABLE IIC
OMPILERS , C
OMPILER F LAGS AND L IBRARIES . Compiler Compiler flags LibrariesHPCG
A64FX Fujitsu 1.2.24 -Nnoclang -O3 -Kfast Fujitsu MPIARCHER Intel 17 -O3 Cray MPICirrus Intel 17 -O3 -cxx=icpc -qopt-zmm-usage=high HPE MPI MPIEPCC NGIO Intel 19 -O3 -cxx=icpc -xCore-AVX512 -qopt-zmm-usage=high Intel MPIFulhame GCC 8.2 -O3 -ffast-math -funroll-loops -std=c++11 -ffp-contract=fast -mcpu=native OpenMPI minikab
A64FX Fujitsu 1.2.25 -O3 -Kopenmp -Kfast -KA64FX -KSVE -KARMV8 3 A-Kassume=noshortloop -Kassume=memory bandwidth-Kassume=notime saving compilation Fujitsu MPIEPCC NGIO Intel 19 -O3 -warn all Intel MPI libraryFulhame Arm Clang 20 -O3 -armpl -mcpu=native -fopenmp OpenMPIArmPL nekbone
A64FX Fujitsu 1.2.24 -CcdRR8 -Cpp -Fixed -O3 -Kfast -KA64FX -KSVE -KARMV8 3 A-Kassume=noshortloop -Kassume=memory bandwidth-Kassume=notime saving compilation Fujitsu MPIARCHER GCC 6.3 -fdefault-real-8 -O3 Cray MPICH2 library 7.5.5EPCC NGIO Intel 19.03 -fdefault-real-8 -O3 Intel MPI 19.3Fulhame GNU 8.2 -fdefault-real-8 -O3 OpenMPI 4.0.2
CASTEP
A64FX Fujitsu 1.2.24 -O3 Fujitsu MPIFujitsu SSL2FFTW 3.3.3ARCHER GCC 6.2 -fconvert=big-endian -fno-realloc-lhs -fopenmp -fPIC-O3 -funroll-loops -ftree-loop-distribution -g -fbacktrace Cray MPICH2 library 7.5.5Intel MKL 17.0.0.098FFTW 3.3.4.11Cirrus Intel 17 -O3 -debug minimal -traceback -xHost SGI MPT 2.16Intel MKL 17.FFTW 3.3.5EPCC NGIO Intel 17 -O3 -debug minimal -traceback -xHost Intel MPI library 17.4Intel MKL 17.4FFTW 3.3.3Fulhame GCC 8.2 -fconvert=big-endian -fno-realloc-lhs -fopenmp -fPIC-O3 -funroll-loops -ftree-loop-distribution -g -fbacktrace HPE MPT MPI library (v2.20)ARM Performance Libraries 19.0.0FFTW 3.3.8
COSA
A64FX Fujitsu 1.2.24 -X9 -Fwide -Cfpp -Cpp -m64 -Ad -O3 -Kfast-KA64FX -KSVE -KARMV8 3 A-Kassume=noshortloop -Kassume=memory bandwidth-Kassume=notime saving compilation Fujitsu MPIFujitsu SSL2FFTW 3.3.3ARCHER GNU 7.2 -g -fdefault-double-8 -fdefault-real-8 -fcray-pointer-ftree-vectorize -O3 -ffixed-line-length-132 Cray MPI library (v7.5.5)Cray LibSci (v16.11.1)Cirrus GNU 8.2 -g -fdefault-double-8 -fdefault-real-8 -fcray-pointer-ftree-vectorize -O3 -ffixed-line-length-132 SGI MPT 2.16Intel MKL 17.0.2.174EPCC NGIO Intel 18 -g -fdefault-double-8 -fdefault-real-8 -fcray-pointer-ftree-vectorize -O3 -ffixed-line-length-132 Intel MPIIntel MKL 18Fulhame GNU 8.2 -g -fdefault-double-8 -fdefault-real-8 -fcray-pointer-ftree-vectorize -O3 -ffixed-line-length-132 HPE MPT MPI library (v2.20)ARM Performance Libraries (v19.0.0)
OpenSBLI
ARCHER Cray Compiler v8.5.8 -O3 -hgnu Cray MPICH2 (v7.5.2)HDF5 (v1.10.0.1)Cirrus Intel 17.0.2.174 -O3 -ipo -restrict -fno-alias SGI MPT 2.16HDF5 1.10.1EPCC NGIO Intel 17.4 -O3 -ipo -restrict -fno-alias Intel MPI 17.4HDF5 1.10.1Fulhame Arm Clang 19.0.0 -O3 -std=c99 -fPIC -Wall OpenMPI 4.0.0HDF5 1.10.4ABLE IIIS
INGLE NODE
HPCG
PERFORMANCE . System Performance % of Theoretical(GFLOP/s) Peak Performance
A64FX 38.26 1.1ARCHER 15.65 3.0Cirrus 17.27 1.4EPCC NGIO (unoptimised) 26.16 1.4EPCC NGIO (optimised) 37.61 2.0Fulhame (unoptimised) 23.58 2.0Fulhame (optimised) 33.80 3.0TABLE IVM
ULTIPLE NODE
HPCG
PERFORMANCE (GFLOP/ S ). System 1 node 2 nodes 4 nodes 8 nodes
A64FX 38.26 78.94 157.46 313.50ARCHER 15.65 26.25 55.63 110.52Cirrus 17.27 34.26 68.44 136.06EPCC NGIO (optimised) 37.61 73.90 147.94 292.60Fulhame (optimised) 33.80 67.68 133.29 261.32
We also investigate the impact of different process-threadconfigurations for this benchmark. Our experiments, presentedin Figure 1, confirm that using 1 process per CMG with 12OpenMP threads per process gives the best performance forminikab. We compare a wide range of run configurations for2 nodes for increasing numbers of cores used. The largestplain MPI configuration able to fit into the available memoryis 48 MPI processes, i.e. under-populating the nodes by half.Unsurprisingly, the best performance is achieved when usingall the cores, and out of the five options tested, using 8 MPIprocesses, each with 12 OpenMP threads, is fastest.Figure 2 compares the scaling behaviour of the default setupof minikab on A64FX and Fulhame. On Fulhame, using plainMPI gives the best performance, and as memory limitationsare not a concern on that system, we populated the nodes fullywith MPI processes. As shown earlier, memory constraints onthe A64FX mean that it is not possible to use fully populated
Fig. 1. Comparing the solver runtimes and GFLOP/s for different executionsetups (plain MPI and mixed-mode MPI with OpenMP) on 2 A64FX nodesfor increasing core counts. Note that the GFLOP/s are as reported by theFujitsu profiler for the entire execution, and therefore include the setup phase,whereas the runtimes are for the solver only. In particular for higher MPIprocess counts, GFLOP/s may be high even though the runtime is also high. TABLE VS
INGLE CODE MINIKAB PERFORMANCE ( RUNTIME IN SECONDS )CPU Runtime (s)A64FX 1182EPCC NGIO 1269Fulhame 2415Fig. 2. Performance of minikab with Benchmark1 on ThunderX2 (Fulhame)and A64FX. Using up to 6 nodes on Fulhame and 8 on A64FX (strongscaling). nodes with a plain MPI configuration here. We therefore usethe best performing setup in both cases as a comparison. TheFulhame results are for 1 to 6 nodes (64-384 cores) and theA64FX results are for 2 to 8 nodes (96-384 cores); three ofthe datapoints (192, 320 and 384 cores) match between thetwo systems.We can see that the A64FX system outperforms Fulhameacross the range of core counts, albeits with different numbersof nodes (i.e. on Fulhame 192 cores is 3 nodes, but 4 nodes onthe A64FX). Even comparing node to node performance theA64FX is still significantly faster, although it does not scaleas well as the Fulhame system.
B. Nekbone
The Nekbone mini-app benchmark captures the basic struc-ture the Nek5000 application, which is a high order, incom-pressible NS solver based on the spectral element method.Nekbone solves a standard Poisson equation using a conjugategradient iterative method with a simple preconditioner on ablock or linear geometry. As a mini-app Nekbone representsthe principal computational kernel of Nek5000, to enable ex-ploration of the essential elements of the algorithmic featuresthat are pertinent to Nek5000.The solution phase consists of conjugate gradient iterationsthat call the main computational kernel, which accounts forover 75% of the runtime. This kernel, ax , performs a matrixvector multiplication operation in an element-by-element fash-ion. Overall each iteration of the solver involves vector oper-ations, matrix-matrix multiply operations, nearest-neighbourcommunication, and MPI Allreduce operations. The linearalgebra operations are performed on an element by elementbasis, with each element consisting of a specific polynomial ABLE VIN
ODE PERFORMANCE OF N EKBONE ACROSS NUMEROUS SYSTEMS
System Coresused GFLOP/s Ratioto A64FX GFLOP/sfast math Ratioto A64FXA64FX 48 175.74 1.00 312.34 1.00EPCC NGIO 48 127.19 0.72 90.37 0.29Fulhame 64 121.63 0.69 132.65 0.42ARCHER 24 66.55 0.40 68.22 0.21 order configuration (for the tests executed here we use 16by 16 by 16). This represents a challenging computationalpattern, as relatively small vector and matrix-matrix multiplyoperations are performed on each element, rather than a singlemuch larger operation which libraries such as BLAS are oftenoptimised for. Furthermore, different aspects of the kernel arebound by different limits, for instance some parts of the kernelare memory bound, whereas others are compute bound. Thistherefore makes it a very interesting study, exploring not onlythe benefits that the floating point performance of the A64FXcan provide, but also whether the higher memory bandwidthcan deliver benefit too.The benchmarks are undertaken using a weak scalingmethodology and leverage the largest test-case in the Nekbonerepository. This corresponds to a system comprising of 200local elements, each 16 by 16 by 16 polynomial order. Unlessotherwise stated, all compilation is performed at O3, withadditional architectural specific flags for optimal performance.All results reported are averaged over three runs.
1) Node performance:
Table VI illustrates the performancein GFLOP/s of the Nekbone weak scaling experiment runacross nodes of different machines which represent numerousarchitectures. Furthermore, we saw a significant performanceimprovement on the A64FX by compiling with -Kfast , and aredenoting this as fast math in the table, similar to -ffast-math with GCC. It can be seen that using the -Kfast flag very signif-icantly improves the performance on the A64FX, but similarcompiler flags do not significantly improve performance onother architectures.The table demonstrates the A64FX is outperforming allother technologies, but crucially it is the improvement memorybandwidth of this chip that makes the difference here. Fur-thermore, with fast maths enabled the A64FX is likely able tokeep the FPUs busy with data, whilst the other technologiesare not and hence likely stalling on memory access. Forcomparison, with a similar sized number of elements, Nekboneperformance experiments explored in [15] demonstrate ap-proximately 200 GFLOP/s on a P100 GPU, and 300 GFLOP/son a V100 GPU. Therefore at 312 GFLOP/s Nekbone on theA64FX with fast maths enabled is competitive against runson a GPU, significantly outperforming a P100 and marginallyfaster than a V100. The -Kfast flag is critical here, without itthe performance delivered by the A64FX falls short of bothGPU technologies.Figure 3 illustrates the performance in MFLOPs on a single
Fig. 3. Single node scaling across number of cores of a processor (one MPIprocesses per core) node of the different machines across core counts in log scale.It can be seen that the Arm technologies, both the A64FX andThunderX2 are scaling much better at higher core counts thanthe Intel technologies, and this in part makes the difference toperformance. One can also see that the high core count of theThunderX2 is a crucial factor here, as at 24 cores it performscomparable to the Ivy Bridge CPUs in ARCHER.More generally, it is also interesting that the Ivy Bridgein ARCHER performs very well initially, competitive withthe Cascade Lake, but then experiences a significant relativeperformance decrease beyond four cores.
2) Scaling across nodes:
We ran some small inter-nodescaling experiments of up to 16 nodes, across the A64FXsystem, Fulhame and ARCHER. This is interesting for com-parison, as Fulhame contains Mellanox EDR IB using a non-blocking fat tree topology, ARCHER is Cray’s Dragonflytopology via the Aries interconnect, and the A64FX uses theTofuD network. In all experiments nodes are fully populated,i.e. 48 processes per node on the A64FX, 64 on Fulhame, and24 on ARCHER.Table VII illustrates the inter-node parallel efficiency scaling(defined as the speed up divided by the number of nodes).Nekbone is known to scale well, so it is not surprising thatthe PEs are so high, although the Infiniband of Fulhame doesseem to provide slightly higher performance compared to theother two systems. However this is a simple test, and we havenot yet explored the options with the different topologies ofthe TofuD interconnect (relying simply on the defaults). Assuch a larger and more challenging test would be instructive,to explore the performance properties of the interconnect morefully. VII. A
PPLICATIONS
Whilst benchmarks and mini-apps are important for charac-terising and exploring performance for HPC systems, usabilityissues and overall system performance evaluation requires run-ning fully functional applications. Therefore, we benchmarkusing three commonly used HPC applications, two focusing
ABLE VIII
NTER - NODE PARALLEL EFFICIENCY ACROSS MACHINES
Node count A64FX PE Fulhame PE ARCHER PE2 0.99 0.99 0.984 0.97 0.99 0.988 0.97 0.97 0.9716 0.96 0.98 0.97TABLE VIIICOSA: P
ROCESSES PER NODE FOR EACH SYSTEM BENCHMARKED
A64FX ARCHER Cirrus Fulhame EPCCNGIOProcessesper node 48 24 36 64 48 on computational fluid dynamics (COSA and OpenSBLI) andone on materials science (CASTEP). This section outlinesthose applications and the performance experienced across thesystems benchmarked.
A. COSA
COSA [16] is a CFD application that supports steady,time-domain (TD), and frequency-domain (harmonic balanceor HB) solvers, implementing the numerical solution of theNS equations using a finite volume space-discretisation andmultigrid (MG) integration. It is implemented in Fortran andhas been parallelised using MPI, with each MPI processworking on a set of grid blocks (geometric partitions) of thesimulation. COSA has been shown to exhibit good parallelscaling to large numbers of MPI processes with a sufficientlylarge test case [17].
1) Test Case:
The benchmark we used to test the perfor-mance of COSA on the A64FX system was a HB test casewith 4 harmonics and a grid composed of blocks, makinga total simulation domain of , , grid cells. This waschosen because it fits into approximately 60GB of memory,making it ideal for testing the scaling across a range of nodeson the system. To enable efficient benchmarking the simulationwas only run for 100 iterations, a significantly smaller numberof iterations than a production run would typically use butenough to evaluate performance sensibly.
2) Configuration:
Writing output data to storage can be asignificant overhead in COSA, especially for simulations usingsmall numbers of iterations therefore I/O output is disabled toensure variations in the I/O hardware of the platforms beingbenchmarked do not affect the performance results collected.The benchmark was run with a single MPI process per core,and all the cores in the node utilised. Table VIII outlines thecores used per node for the systems benchmarked.The number of processes used does impact the efficiencyof the domain decomposition employed in the application,with best performance exhibited when the number of availabledecomposition blocks (800 for the test case presented here)exactly divides by the number of processes used. Furthermore,scaling up to all 800 blocks (i.e. using 800 processes) may also
Fig. 4. COSA performance across a range of nodes counts (strong scaling) introduce some inefficiencies for a small test case such as theone used here, as individual processes may not have enoughwork to do for optimal performance.
3) Results:
Figure 4 presents the results of running thebenchmark (strong scaling) across a range of node countsfor the systems under consideration. Each benchmark wasrun three times and the average runtime is presented. Thebenchmark would not fit on a single A64FX node, so theA64FX results start from two nodes. We can see from thegraph that the A64FX consistently outperforms the othersystems, all the way up to 16 nodes, where performance isovertaken by Fulhame (the ThunderX2 based system). It isworth taking into account the number of MPI processes usedon each system, and the number of blocks in the simulation.There are 800 blocks in the simulation, meaning at most800 MPI processes can be active. However, on Fulhame,using 16 nodes the simulation is using 1024 MPI processes,meaning some of the nodes aren’t actually undertaking work.For Fulhame only 13 of the nodes are being used, whereasfor all the other systems all 16 nodes are active. Furthermore,the number of processes used impacts the load balance, as thedata decomposition distributes blocks to processes. Therefore,using 16 nodes on the A64FX will mean there are 800 blocksto be distributed amongst 768 processes, leaving 32 processeswith 2 blocks and the rest with 1 block each. This loadimbalance, along with the reduced number of nodes requiredon Fulhame which minimises the amount of off node MPIcommunication, is likely to contribute to Fulhame being fasterat the highest node count.
B. CASTEP
CASTEP [18], [19], [20], [21] is a leading code for calcu-lating the properties of materials from first principles. Usingdensity functional theory, it can simulate a wide range of ma-terials proprieties including energetics, structure at the atomiclevel, vibrational properties, electronic response properties etc.In particular it has a wide range of spectroscopic features ig. 5. Single node CASTEP TiN benchmark performance as a function ofcore count. that link directly to experiment, such as infra-red and Ramanspectroscopies, NMR, and core level spectra.In this benchmarking we used CASTEP release 18.1.0.CASTEP requires a high-performance FFT library to function.This is usually provided by FFTW3 or Intel MKL. Fujitsukindly provided their early development version of FFTW3 forthe A64FX platform. CASTEP also requires high-performanceBLAS/LAPACK numerical libraries. We used the FujitsuSSL2 libraries to provide these functions on the A64FX, MKLon the Intel based systems, and the Arm Performance Libraries(Armpl) on the ThunderX2 system.
1) Results:
The TiN CASTEP benchmark was run on thedifferent systems with a variety of core counts up to 1 fullnode and then at various process and thread combinations touse all cores on a node. Note that the benchmark can only berun with total core counts that are either a factor or multiple of8. This means that on Cirrus, with a core count of 36 cores pernode, we cannot use all cores on a node or socket. Instead,we use the number of cores closest to the number of coresavailable (32 cores for a full node, 16 cores for a socket).For other systems, this means that some combinations of MPIprocess counts and OpenMP thread counts are impossible. Inthe majority of combinations we have run the benchmark aminimum of three times and used the best performance fromthe set of results in each case for comparisons.Figure 5 shows the performance of CASTEP for the TiNbenchmark on 1 node for the test systems. On all systems,the best performance was achieved using MPI only, with noOpenMP threading.We can see that the highest absolute single node perfor-mance is seen on the EPCC NGIO system with the lowestabsolute performance on a single node see on the ARCHERsystem. Table IX below shows the performance of the bestfull-node benchmark runs for each system and the ratio ofthis performance to the A64FX system.The A64FX processor is performing well, providing fastersolutions than the ThunderX2 processor, even with lower corecounts. However, it is not quite matching the performance ofthe Intel Cascade Lake processors. As we were working withearly versions of FFT libraries, and have yet to attempt A64FXspecific optimisations on CASTEP it is likely this performance
TABLE IXCASTEP T I N BENCHMARK : BEST SINGLE NODE PERFORMANCECOMPARISON
System Cores used Perf. Ratio to(SCF cycles/s) A64FXA64FX 48 0.145 1.00ARCHER 24 0.074 0.51EPCC NGIO 48 0.184 1.27Cirrus 32 0.125 0.86Fulhame 64 0.141 0.97 could be improved, but it is evident that the A64FX processoris competitive in terms of performance for CASTEP.
C. OpenSBLI
OpenSBLI is a Python-based modelling framework that iscapable of expanding a set of differential equations writtenin Einstein notation, and automatically generating C codethat performs the finite difference approximation to obtain asolution. This C code is then targeted with the OPS librarytowards specific hardware backends, such as MPI/OpenMPfor execution on CPUs, and CUDA/OpenCL for execution onGPUs.The main focus of OpenSBLI is on the solution of the com-pressible NS equations with application to shock-boundarylayer interactions (SBLI). However, in principle, any set ofequations that can be written in Einstein notation can be solvewith this framework.
1) Test Case:
The benchmark test case setup using OpenS-BLI is the Taylor-Green vortex problem in a cubic domainof length π . For this study, we have investigated the strongscaling properties for the benchmark on grids of sizes × × . This is smaller than would normally be run as abenchmark ( × × and × × arecommon benchmark sizes) but the size is chosen to allowcomparisons between single nodes of different architecturesas larger benchmarks will not fit into the 32GB available onthe A64FX. This benchmark was configured to target pureMPI parallelism and performs minimal I/O.
2) Results:
We can see from the results presented in Ta-ble X that the A64FX underperforms the other systems, beingaround 3x and 2x slower than the fastest system (Fulhame).The EPCC NGIO and Fulhame systems present very similarperformance, even though they have different characteristics(i.e. EPCC NGIO only has 48 cores and has lower overallmemory bandwidth than Fulhame but higher vectorisationcapability). For the results presented each test was run threetimes and the average value is used. There was no significantvariations in performance between the individual runs on eachsystem.Some initial analysis of OpenSBLI using profiling tools onthe A64FX system has shown a large amount of time beingspent in both instruction fetch waits and integer cache loadsat the L2 cache level. Whilst further investigation is required,and comparative profiling across the range of systems, there
ABLE XO
PEN
SBLI
PERFORMANCE ( TOTAL RUNTIME IN SECONDS ) System 1 Node 2 Nodes 4 Nodes 8 Nodes
A64FX 3.44 1.89 1.04 0.69Cirrus 1.90 0.93 0.53 0.35EPCC NGIO 1.18 0.75 0.46 0.31Fulhame 1.17 0.74 0.65 0.28 is definitely some evidence of potential to optimise this per-formance with code source code modifications for OpenSBLI.VIII. C
ONCLUSIONS
We have successfully ported a range of applications andbenchmarks to the A64FX processor with minimal effortand no code changes required. This demonstrates the highlevel of readiness for users of the overall system platformsurrounding the A64FX. We have also demonstrated extremelygood performance for the A64FX processor for a range ofapplications, including outperforming other Arm processorsand top of the range Intel processors.However, not all applications exhibit the same performancecharacterstics on the A64FX processor, with a number of thebenchmarks we undertook presenting slightly worse perfor-mance, and one benchmark (OpenSBLI) presenting signifi-cantly worse performance. When considering this it should beremembered that we have not yet attempted to optimise anyof these benchmarks on the target system, aside from usingthe provided compilers and associated libraries. Indeed, theHPCG results demonstrate that on a range of systems thereare significant performance benefits that can be achieved byoptimising applications for the target architecture. Therefore,we can see that the A64FX processor and computers built withthe A64FX technology offer the potential for very significantperformance for computational simulation applications.The benchmarking and evaluation process demonstrated thematurity of the software platform around the A64FX processor(i.e. the compilers, libraries, and batch system). However, it didalso highlight that some applications and application domainsmay struggle with the small amount of memory available onthe A64FX-based nodes. This may require work to furtherparallelise applications, or improve the parallel performance,to exploit additional compute nodes to get sufficient memoryfor the applications to operate. The Fujitsu maths libraries(SSL2) have been shown to be easy replacements for theIntel MKL and Arm performance libraries for some of theapplications we have considered in this paper, but not forall requirements we encountered (i.e. FFTW for CASTEP).Therefore, some further work on optimised libraries for theA64FX system would be beneficial.A
CKNOWLEDGMENTS , 2018, pp. 646–654.[4] A. Jackson, A. Turner, M. Weiland, N. Johnson, O. Perks, andM. Parsons, “Evaluating the arm ecosystem for high performancecomputing,” in
Proceedings of the Platform for Advanced ScientificComputing Conference , ser. PASC 19. New York, NY, USA:Association for Computing Machinery, 2019. [Online]. Available:https://doi.org/10.1145/3324989.3325722[5] E. Calore, A. Gabbana, F. Rinaldi, S. F. Schifano, and R. Tripiccione,“Early performance assessment of the thunderx2 processor for latticebased simulations,” in
Parallel Processing and Applied Mathematics ,R. Wyrzykowski, E. Deelman, J. Dongarra, and K. Karczewski, Eds.Cham: Springer International Publishing, 2020, pp. 187–198.[6] M. Mller, M. van Waveren, R. Lieberman, B. Whitney, H. Saito,K. Kumaran, J. Baron, W. Brantley, C. Parrott, T. Elken, H. Feng,and C. Ponder, “Spec mpi2007-an application benchmark suite forparallel systems using mpi,”
Concurrency and Computation: Practiceand Experience
Hot Chips 30 Symposium (HCS) , ser. Hot Chips, 2018.[12] Y. Kodama, T. Odajima, A. Asato, and M. Sato, “Evaluation ofthe riken post-k processor simulator,” 04 2019. [Online]. Available:https://arxiv.org/abs/1904.06451[13] J. J. Dongarra, M. A. Heroux, and P. Luszczek, “Hpcg benchmark: a new metric for ranking high performance computing systems,”Knoxville, Tennessee, Tech. Rep. UT-EECS-15-736, November 2015.[14] J. J. Dongarra, P. Luszczek, and A. Petitet, “The linpack benchmark:past, present and future,”
Concurrency and Computation: Practice andExperience , vol. 15, pp. 803–820, 2003.[15] M. Karp, N. Jansson, A. Podobas, P. Schlatter, and S. Markidis,“Optimization of tensor-product operations in nekbone on gpus,” arXivpreprint arXiv:2005.13425 , 2020.[16] W. Jackson, M. Campobasso, and J. Drofelnik, “Load balance andparallel i/o: Optimising cosa for large simulations,”
Computers andFluids , 3 2018.[17] A. Jackson and M. Campobasso, “Shared-memory, distributed-memory,and mixed-mode parallelisation of a cfd simulation code,”
ComputerScience - Research and Development , vol. 26, no. 3-4, pp. 187–195, 62011.[18] S. J. Clark, M. D. Segall, C. J. Pickard, P. J. Hasnip, M. J. Probert,K. Refson, and M. Payne, “First principles methods using CASTEP,”
Z.Kristall. , vol. 220, pp. 567–570, 2005.[19] P. Hohenberg and W. Kohn, “Inhomogeneous electron gas,”
Phys. Rev. ,vol. 136, pp. B864–B871, 1964.[20] W. Kohn and L. J. Sham, “Self-consistent equations including exchangeand correlation effects,”
Phys. Rev. , vol. 140, pp. A1133–A1138, 1965.21] M. C. Payne, M. P. Teter, D. C. Allan, T. Arias, and J. D. Joannopoulos,“Iterative minimization techniques for ab initio total-energy calculations- molecular-dynamics and conjugate gradients,”