[PDF] Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX

Abstract

The A64FX CPU powers the current number one supercomputer on the Top500 list. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. Generating efficient code for such a new architecture requires a good understanding of its performance features. Using these features, we construct the Execution-Cache-Memory (ECM) performance model for the A64FX processor in the FX700 supercomputer and validate it using streaming loops. We also identify architectural peculiarities and derive optimization hints. Applying the ECM model to sparse matrix-vector multiplication (SpMV), we motivate why the CRS matrix storage format is inappropriate and how the SELL-C-sigma format with suitable code optimizations can achieve bandwidth saturation for SpMV.

Full PDF

aa r X i v : . [ c s . PF ] S e p Performance Modeling of Streaming Kernels andSparse Matrix-Vector Multiplication on A64FX

Christie Alappat, Jan Laukemann, Thomas Gruber,Georg Hager, Gerhard Wellein

Erlangen Regional Computing CenterFriedrich-Alexander-Universit¨at Erlangen-N¨urnberg

Erlangen, [email protected]

Nils Meyer, Tilo Wettig

Department of PhysicsUniversity of Regensburg

Regensburg, [email protected]

Abstract —The A64FX CPU powers the current C - σ format with suitable code optimizations can achieve bandwidthsaturation for SpMV. Index Terms —ECM model, A64FX, sparse matrix-vector mul-tiplication

I. I

NTRODUCTION

A. Motivation: The A64FX CPU

The A64FX CPU is used in parallel computer designs fromFujitsu. It exists in several variants, the most basic of which(used, e.g., in the Fujitsu FX700 system) comprises 48 coresrunning at 1.8 GHz (see Table I for fundamental data). At amachine balance of 0.37 byte/ﬂop, the architecture is expectedto deliver a high fraction of its peak performance for optimizedcode. The chip is divided into four groups of twelve cores( core memory groups [CMGs]), each of which accesses itsown ccNUMA domain. The 64 KiB L1 cache is core local,while 8 MiB of L2 are shared among the cores of each CMG.Figure 1 shows bandwidth scaling using compiler-generatedOpenMP code with compact pinning for three elementaryoperations: the STREAM

TRIAD ( a[i]=b[i]+s*c[i] ), asum reduction ( s+=a[i] ), and a sparse matrix-vector mul-tiplication with the HPCG [1] matrix using the CompressedRow Storage (CRS) format. All of these should be stronglymemory bound, but only TRIAD shows the typical saturationpattern within the ﬁrst ccNUMA domain; the other two, albeitscalable, top out at only 40% and 70% of the maximum

TRIAD bandwidth, respectively. One goal of our work is to investigatethe reasons for this failure and how to mitigate it. TheExecution-Cache-Memory (ECM) performance model [2]–[4]

This work was supported in part by KONWIHR and by DFG in theframework of SFB/TRR 55. B a nd w i d t h [ G by t e / s ] TRIADSUM

SpMV

Fig. 1. Scaling of STREAM

TRIAD , SUM reduction, andSpMV in CRS format withthe HPCG matrix using gccversion 10.1.1. The workingset size is 4 GB for

TRIAD and

SUM ; for HPCG the di-mension is 128 . will be instrumental in this, leading to valuable insights intoperformance bottlenecks of this new CPU architecture. B. Brief overview of the ECM model

Full coverage of the ECM model is beyond the scope of thiswork. We give a brief overview and refer to the most recentpublication [4] for details.The model considers execution time contributions forsteady-state loops from the core (assuming all data is in L1),data paths in the cache hierarchy, and the memory interface.For the core component, the loop’s assembly code is analyzedfor predictions of optimal throughput, critical path, and longestloop-carried dependency (the current development branch ofthe OSACA tool [5] has preliminary support for A64FX).Data transfer volumes through the memory hierarchy areobtained either by manual analysis or by the Kerncraft [6] tool;together with the known bandwidths of all data paths, timecontributions for L1-L2 and L2-memory transfers are obtained.A machine model is constructed that makes assumptions onhow all these contributions overlap in time in order to arrive ata runtime prediction for a given number of loop iterations. Forthe benchmarks under investigation here, manual predictionsare straightforward and the Kerncraft tool is not needed.

C. Testbed and experimental methodology

All experiments were carried out on QPACE 4, the Fu-jitsu FX700 system running CentOS 8.2 at the physics de-partment of the University of Regensburg. The core clockfrequency is ﬁxed to 1.8 GHz on this machine. All codewas compiled with gcc 10.1.1, using options -Ofast

ABLE IK

EY SPECIFICATIONS OF THE

A64FX CPU

IN THE

FX700

SYSTEM .Microarchitecture A64FXSupported core frequency 1.8 GHzCores/threads 48/48Instruction set Armv8.2-A+SVEMax. SVE vector length 512 bitCache line size 256 bytesL1 cache capacity 48 ×

64 KiBL1 bandwidth per core ( b Reg ↔ L1 ) 128 B/cy LD ⊕

64 B/cy STL2 cache capacity 4 × b L1 ↔ L2 ) 64 B/cy LD, 32 B/cy STMemory conﬁguration 4 × HBM2CMG theor. mem. bandwidth 230 Gbyte/s = 128 byte/cyCMG

TRIAD bandwidth 210 Gbyte/s = 117 byte/cyCMG read-only bandwidth 225 Gbyte/s = 125 byte/cyPage size 64 KiBL1 translation lookaside buffer 16 entriesL2 translation lookaside buffer 1024 entries -msve-vector-bits=512 -march=armv8.2-a+sve .Apart from the cases shown in Fig. 1, SVE vector intrinsics(ACLE [7]) were employed to have better control over codegeneration. All benchmarks were run in double precision,leading to a vector length (VL) of eight elements. Perfor-mance event counting was done with likwid-perfctr [8]v5.0.1, whose current development branch contains supportfor A64FX. For benchmarking individual machine instructionswe employed the ibench [9] framework and cross-checkedour results with the A64FX Microarchitecture Manual [10].We do not show any run-to-run statistics as the ﬂuctuationsin measurements were below 3%. Information about how toreproduce the results in this paper can be found in the artifactdescription [11].This paper is organized as follows: Section II providesan analysis of the in-core architecture, including parts ofthe SVE instruction set, and the memory hierarchy of theA64FX. In Sect. III we use microbenchmarks to construct andvalidate the ECM performance model for serial and parallelsteady-state SVE loops. The insights gained are then used inSect. IV to revisit the SpMV kernel and motivate why anappropriate matrix storage as well as sufﬁcient unrolling arerequired to achieve bandwidth saturation. Finally, Sect. V givesa summary and an outlook to future work.II. A

RCHITECTURAL ANALYSIS

A. In-core

For creating an accurate in-core model of the A64FXmicroarchitecture, we analyze different instruction forms [12],i.e., assembly instructions in combination with their operandtypes, based on the methodology introduced in [13]. Table IIshows a list of instruction forms relevant for this work.Standard SVE load ( ld1d ) instructions have a reciprocalthroughput of 0.5 cy, while stores ( st1d ) have 1 cy. Thethroughput of gather instructions depends on the distributionof addresses: “Simple” access patterns are stride 0 (no stride),1 (consecutive load), and 2, while larger strides and irregularpatterns are considered “complex.” The former have lower

TABLE III N - CORE INSTRUCTION THROUGHPUT AND LATENCY ( IF APPLICABLE ) FOR SELECTED INSTRUCTION FORMS . Instruction Reciprocal Latency [cy]Throughput [cy] ld1d (standard) 0.5 11 ld1d (gather, simple stride) 2.0 ≥ ld1d (gather, complex stride) 4.0 ≥ st1d (standard) 1.0 – fadd fmad fmla fmul fadda (512bit) 18.5 72 faddv (512bit) 11.5 49 while { le|lo|ls|lt } reciprocal throughput and latency than the latter. However,when occurring in combination with a standard LD, we canobserve an increase of reciprocal throughput by 1.5 cy insteadof the expected 0.5 cy. This is caused by the dependency of thegather instruction on the preceding index load operation, whichthe out-of-order (OoO) execution cannot hide completely.Note also the rather long latencies for arithmetic operationssuch as MUL, ADD, and FMA compared to other state-of-the-art architectures (e.g., on Intel Skylake or AMD Zen2these are between 3 cy and 5 cy). Each core’s front end hasan instruction buffer with 6 × µ -ops per cycle to two pairs of reservation stations,which have ten (address generation and load units) or 20(execution pipelines and store units) entries. These attributesemphasize the importance of a compiler that is capable ofexploiting the theoretical in-core performance by intelligentcode generation. The small reservation stations in combinationwith high instruction latencies can result in inefﬁcient OoOexecution. While these constraints cannot be overcome com-pletely, appropriate loop unrolling, consecutive addressing, andinterleaving of different instruction types have shown to bebeneﬁcial in our benchmarks; see Sections III and IV fordetails.The SVE instruction set introduced a “ while { cond } ”instruction to set predicate registers in order to eliminateremainder loops (see Sect. IV for details). A port conﬂictanalysis revealed that this instruction does not collide withﬂoating-point instructions or data transfers, so it does notimpact the kernel runtime compared to non-SVE execution. B. Memory hierarchy

While parallel load/store from and to L1D is possible forgeneral-purpose and NEON registers, different types of SVEdata transfer instructions in L1D cannot be executed in onecycle: Using SVE, one A64FX core can either load up to2 ×

64 byte/cy or store 64 byte/cy from/to L1. The L2 cachecan deliver 64 byte/cy to one L1D but tops out at 512 byte/cyper CMG. The L1D-L2 write bandwidth is half the load band- R un ti m e ( c y / V L ) TL B m i ss e s / K i B (a) TRIAD R un ti m e ( c y / V L ) (b) 2 D PT (c) SUM u=1 u=8 u=8+malloc ECM L2 TLB

Fig. 2. Runtime of SVE loop kernels vs. problem size, comparing no unrolling(black) and eight-way unrolling (blue). Arrays were aligned to 1024-byteboundaries. (a) STREAM

TRIAD ; the extra green line denotes no specialalignment (plain malloc() ), while the orange data set denotes TLB missesper OS page, (b) 2d 5-point stencil, (c) sum reduction. For 2 D PT the outerand inner dimension was set at a ratio of 1:2. width, i.e., 32 byte/cy per core, and is capped at 256 byte/cy perCMG. Finally, the maximum bandwidth between memory andL2 cache is 128 byte/cy per CMG for loading and 64 byte/cyper CMG for storing data, resulting in a theoretical totalpeak load bandwidth of 922 Gbyte/s (230 Gbyte/s per CMG)and peak store bandwidth of 461 Gbyte/s (115 Gbyte/s perCMG). In practice, about 91% of the peak load bandwidth canbe attained using the STREAM TRIAD benchmark and 98%using a read-only benchmark (see Table I). These measuredbandwidths will be used as baselines for the memory transferbandwidth in the ECM model.III. C

ONSTRUCTION OF THE

ECM

MODEL

A. Overlap hypothesis

In order to ﬁnd out which of the time contributions for datatransfers through the cache hierarchy overlap, measurementsfor a test kernel are compared with predictions based ondifferent hypotheses; see [4] for an in-depth description ofthe process. If a hypothesis works for the test kernel, itis tested against a collection of other kernels with differentcharacteristics.Here we use the STREAM triad kernel( a[i]=b[i]+s*c[i] ) to narrow down the possibleoverlap scenarios. This kernel has two LD, one ST, and oneFMA instruction per SVE-vectorized iteration. Figure 2ashows performance in cycles per VL (i.e., eight iterations)for different code variants: “u=1” denotes no unrolling (apart from SVE), and “u=8” is eight-way unrolled on top of SVE.Some level of manual unrolling (typically eight-way) isalways required for best in-core performance. This is evenmore important in kernels where dependencies cannot beresolved easily by the out-of-order logic. In Fig. 2b we showdata for a 2d ﬁve-point stencil, where SVE alone (withoutfurther unrolling) is up to 2 × slower than the eight-wayunrolled code, despite the lack of loop-carried dependencies.Figure 2c shows data for a sum reduction, which requireseight-way modulo variable expansion (MVE) on top of SVEto achieve optimal performance due to the large latency ofthe ﬂoating-point ADD instruction.Optimal performance beyond the L1 cache is only achievedwhen aligning all arrays to 32-byte boundaries; withinL1, 512-byte alignment is needed. In our experiments weused 1024-byte aligned arrays throughout. The standard malloc() function only guarantees 16-byte alignment andleads to performance loss (“u=8+malloc”).At a working set size of 64 MiB, performance drops signiﬁ-cantly for all three kernels. We can correlate this with a suddenrise in L2 TLB misses (see Fig. 2a). Beyond this threshold,exactly one TLB miss per 64 KiB page occurs. This leads tothe conclusion that L2 TLB misses are rather expensive onthis machine. A larger OS page size or other conﬁgurationchanges might help improve the situation but are left to futurework. The effect will be ignored in the following.In order to arrive at an overlap hypothesis, we ignorethe FMA because it is fully overlapping when perfect OoOprocessing is assumed. Figure 3 compares three scenarios ((a),(b), and (c)) with measured cycles per VL (d). Note that thereis a large number of possible overlap hypotheses, and we canonly show a few here. The one leading to the best match tothe STREAM TRIAD data is the following: • L1D is partially overlapping: Cycles in which STs areretired in the core can overlap with L1-L2 (or L2-L1)transfers, but cycles with LDs retiring cannot. • L2 is partially overlapping: Cycles in which the memoryinterface writes data out to memory can overlap withtransfers between L2 and L1, while memory read cyclescannot.As usual, the time required for compute instructions or, moregenerally, non-data-transfer time in the core, fully overlapswith all data transfers. Note that we include write-allocatetransfers (due to store misses) in the analysis.

B. Validation of the single-core model for streaming kernels

With the in-core and data transfer models in place we cannow test the ECM model against a variety of loop kernels.Table III shows a comparison of predictions and measure-ments. For each kernel, three numbers represent the cyclesper VL with the data set in L1, L2, and memory, respectively.In case of the 2d ﬁve-point stencil, three cases are shown: layercondition (LC) satisﬁed at L1, broken at L1, and broken at L2(see [3] for a comprehensive coverage of layer conditions inthe context of the ECM model). L − L M e m − L LD LDST LD LDSTWRRDRDRD WRRDRDRD WRWRRDRDRD LD LDST WRRDRDRD WRWRRDRDRDWRWRRDRDRD

Full overlapat both caches Partial overlapat L1 and L2 Measured cyclesfor data in ...5.8L2(a) (b) (c)No overlap 2.16.7L1 memory (d)0125677.710.9cycles

Fig. 3. Comparing different overlap scenarios (a), (b), and (c) for datatransfers in the memory hierarchy with measured cycles per VL (d) on theSTREAM

TRIAD kernel. TABLE IIIECM

MODEL PREDICTION AND MEASUREMENTS IN [ CY /VL] FORDIFFERENT STREAMING AND STENCIL KERNELS . R

ED COLOR INDICATESA DEVIATION FROM THE MODEL OF AT LEAST

HE SELECTEDUNROLLING FACTOR FOR EACH MEASUREMENT IS SHOWN AS ASUBSCRIPT . Kernel Predictions Measurements

COPY ( a[i]=b[i] ) { . ⌉ . ⌉ . } ( . ⌉ . ⌉ . ) DAXPY ( y[i]=a[i]*x+y[i] ) { . ⌉ . ⌉ . } ( . ⌉ . ⌉ . ) DOT ( sum+=a[i]*b[i] ) { . ⌉ . ⌉ . } ( . ⌉ . ⌉ . ) INIT ( a[i]=s ) { . ⌉ . ⌉ . } ( . ⌉ . ⌉ . ) LOAD ( load(a[i]) ) { . ⌉ . ⌉ . } ( . ⌉ . ⌉ . ) TRIAD ( a[i]=b[i]+s*c[i] ) { . ⌉ . ⌉ . } ( . ⌉ . ⌉ . ) SUM ( sum+=a[i] ) { . ⌉ . ⌉ . } ( . ⌉ . ⌉ . ) SCH ¨ ONAUER ( a[i]=b[i]+c[i]*d[i] ) { . ⌉ . ⌉ . } ( . ⌉ . ⌉ . ) D PT - LC satisﬁed { . ⌉ . ⌉ . } ( . ⌉ . ⌉ . ) D PT - LC violated in L1 { . ⌉ . ⌉ . } ( . ⌉ . ⌉ . ) D PT - LC violated { . ⌉ . ⌉ . } ( . ⌉ . ⌉ . ) The results have been obtained by running each kernel withunrolling factors from 1 to 16 and taking the best result.Entries in red color have a deviation from the model of 15% ormore. The strongest deviations occur in L1: Even with eight-way MVE, the sum reduction cannot achieve the architecturallimit of 0.5 cy/VL. A similar deviation can be observed forthe stencil kernels. We attribute this failure to insufﬁcientOoO resources: A modiﬁed stencil code without intra-iterationregister dependencies achieves a performance within 10% ofthe prediction.Deviations from the model with L2 and memory workingsets occur mainly with kernels that have a single data stream.Indeed, the memory hierarchy seems to work more effectivelywith multi-stream kernels, and a slight overlap for memoryreads can be observed for all of them. This could be correctedby a reﬁnement of the model if superior accuracy is required.

C. Multicore saturation

The “naive” scaling hypothesis of the ECM model assumesperfect performance scaling of a loop across the cores of acontention domain (characterized by a shared resource such as B a nd w i d t h [ G by t e / s ] (a) TRIAD (b)

SUM (c) 2 D PT u=1 u=8 ECM Fig. 4. Multicore scaling within one ccNUMA domain for (a)

TRIAD , (b)

SUM , and (c) 2 D PT kernels, comparing ECM model with measurement. Datawithout unrolling are shown for reference. Note that the read-only memorybandwidth was used as a limit for SUM . The working set size for

TRIAD and

SUM was set to 4GB. For 2 D PT , the problem size was chosen as 10000 sothat the layer condition is broken at L1 but fulﬁlled at L2. L3 or memory bandwidth) until a bandwidth bottleneck is hit.Figure 4 shows a comparison of model and measurement forthe

TRIAD , SUM , and 2 D PT kernels. For SUM it is evidentthat insufﬁcient MVE (as shown in the “u=1” data) is theroot cause for non-saturation of the memory bandwidth dueto the long ADD latency. For the stencil kernel, saturation ispossible even without unrolling, but more cores are needed.The ECM model describes the scaling features qualitatively;the largest deviation occurs around the saturation point. Thisis a known effect that can be corrected for by introducing alatency term [14], which is left for future work.IV. C

ASE STUDY : S

PARSE MATRIX - VECTORMULTIPLICATION

The analysis of the

SUM reduction benchmark revealed thatthe long ADD latency prevents bandwidth saturation if properunrolling with MVE is not employed. The CRS SpMV issimilar as it requires horizontal reduction along each rowof the matrix (the inner loop). However, this loop is shortsince the average number of non-zeros per row, N nzr , istypically in the range of 10–1000 for practical applications.For the HPCG matrix we have N nzr ≈

27, so four 512-bitwide fmad / fmla instructions (eight iterations each) are re-quired, which takes 4 × faddv ) of the ﬁnal vector result adds another 11.5 cy, whichleads to a minimum of 47 . .

05 Gﬂop/s). The memory data trafﬁc for CRS SpMVis [ × ( + α ) + ] bytes per row, where α characterizesthe efﬁciency of right-hand side (RHS) access [15]. For theregular stencil-like HPCG matrix, α can be assumed to beclose to the optimistic lower limit of α = / N nzr , resulting ina data trafﬁc of 352 bytes per row (measured 363 bytes using likwid-perfctr ). This leads to a single-core bandwidthof 352 bytes / row × f / . / row = . f isthe clock frequency. Since 12 × . =

160 Gbyte/s,it is impossible for this code to hit the bandwidth limit of210 Gbyte/s per CMG.In practice, the loop overheads caused by the extremelyshort loops and the poor OoO execution (see Sect. III-B)urther decrease the single-core performance. Contrary to the

SUM benchmark, unrolling does not ﬁx the problem due to theextra work at the end of the short loop and loss of efﬁciencywhen N nzr is not a multiple of the vector length. Hence, wechoose the SELL- C - σ sparse matrix format [15], which isa portable, SIMD-friendly data format for CPUs, GPUs, andvector machines.SELL- C - σ stores chunks of C consecutive rows (zero-padded to the longest row) in column-major format. Theparameter C is tunable; for efﬁciency, it should be a multiple ofVL as well as large enough to allow for sufﬁcient unrolling.A further beneﬁt of the format is the lack of an expensivehorizontal add operation ( faddv ). The only drawback is thatthe zero padding can reduce the efﬁciency if rows have verydifferent lengths. To mitigate this effect, rows are ﬁrst sortedby descending length within a sorting window ( σ ) to reducethe padding. With proper selection of C and σ , the padding canbe made negligible in most cases. For the matrices consideredin this work it was never larger than 5%.On A64FX, C =

32 enables four-way unrolling on top ofSVE, which should reduce the ADD latency by a factor offour (i.e., to 2.25 cy). According to the ECM model, the datatransfer through the memory hierarchy should then becomethe bottleneck. Much of the time here is lost in the LD tothe index array and the subsequent gather instruction fromthe L1 cache to the registers (5.5 cy according to Table II).Loading the matrix values costs an extra 0.5 cy, so we requireat least ( . + . ) × / = . =

8) per rowof the HPCG matrix for the L1 to register transfers. FromL2 and memory, as shown above, at least 352 bytes arerequired per row, which translates to 352 /

64 cy = . /

117 cy = . . . . C - σ in comparison to CRS on a collectionof matrices from the SuiteSparse Matrix Collection [17]. Allmatrices saturated the main memory bandwidth with SELL- C - σ , and we see an improvement of 1 . × compared to CRSon average. The SpMV performance attained with SELL- C - σ is on par with NVIDIA V100 GPU [18] and NEC SX-AuroraTsubasa vector accelerators [19]. P e rf o r m a n ce [ G ﬂ op / s ] SELLCRSECM

Matrix Perf. [Gﬂop/s]SELL CRSaf shell10 124.0 68.5BenElechi1 112.3 86.6bone010 119.4 93.5HPCG 110.8 57.0ML Geer 129.1 102.9nlpkkt120 114.4 60.1pwtk 105.7 78.3

Fig. 5. (Left) Scaling performance and ECM prediction on one CMG forSpMV in SELL- C - σ and CRS format with the HPCG matrix. (Right) Full-node SpMV performance of different matrices in Gﬂop/s in both formats. C was chosen as 32 and σ was tuned between 1 and 1024. Reverse Cuthill-McKee reordering [20] was done if it improved the performance. V. C

ONCLUSION

A. Summary and outlook

Via an analysis of in-core features and data transfers, wehave established an ECM machine model for the A64FX CPUin the Fujitsu FX700 system and applied it to simple stream-ing kernels and sparse matrix-vector multiplication using theHPCG matrix. For in-memory data sets, the single-core ECMmodel was shown to be accurate within a maximum error of20%. The memory hierarchy turned out to be partially overlap-ping, allowing for a substantial single-core memory bandwidthwith optimized code. Long ﬂoating-point instruction latenciesand limited out-of-order execution capabilities were identi-ﬁed as the main culprits of poor performance and lack ofbandwidth saturation. With the current gcc compiler, vectorintrinsics and manual unrolling are often required to achievehigh performance. For SpMV, the SELL- C - σ matrix storagewas shown to achieve superior performance and memory-bandwidth saturation, though requiring almost all cores on theccNUMA domain. This means that SpMV performance willbe very sensitive to load imbalance and inefﬁciencies in dataaccesses.In future work we will conduct a more comprehensiveanalysis of SpMV and a comparison with other high-enddevices like the Nvidia A100 GPU and the NEC SX-AuroraTsubasa. We will also investigate advanced A64FX featuressuch as the sector cache and cache line zero instructions. Inaddition, we will reﬁne the ECM model with a more accuratesaturation model using an additional latency term. We willalso apply the reﬁned model to the HPCG benchmark and tolattice QCD applications [21]. B. Related work

Since the A64FX CPU has only been available for a shorttime, the amount of performance-centric research is limited.D

ONGARRA [22] reports on basic architectural features, HPCbenchmarks (HPL, HPCG, HPL-AI), and the software environ-ment of the Fugaku system. J

ACKSON et al. [23] investigatesome full applications and proxy apps in comparison to Inteland other Arm-based systems but do not use performancemodels for analysis.

EFERENCES[1] J. Dongarra, M. A. Heroux, and P. Luszczek, “High-performanceconjugate-gradient benchmark: A new metric for ranking high-performance computing systems,”

The International Journal of HighPerformance Computing Applications , vol. 30, no. 1, pp. 3–10, 2016.[Online]. Available: https://doi.org/10.1177/1094342015593158[2] G. Hager, J. Treibig, J. Habich, and G. Wellein, “Exploring performanceand power properties of modern multicore chips via simple machinemodels,”

Concurrency Computat.: Pract. Exper. , 2013. [Online].Available: http://dx.doi.org/10.1002/cpe.3180[3] H. Stengel, J. Treibig, G. Hager, and G. Wellein, “Quantifyingperformance bottlenecks of stencil computations using theExecution-Cache-Memory model,” in

Proceedings of the 29thACM International Conference on Supercomputing , ser. ICS’15. New York, NY, USA: ACM, 2015. [Online]. Available:http://doi.acm.org/10.1145/2751205.2751240[4] J. Hofmann, C. Alappat, G. Hager, D. Fey, and G. Wellein,“Bridging the architecture gap: Abstracting performance-relevantproperties of modern server processors,”

Supercomputing Frontiersand Innovations , vol. 7, no. 2, 2020. [Online]. Available:https://superfri.org/superfri/article/view/310[5] J. Laukemann, J. Hammer, G. Hager, and G. Wellein,“Automatic throughput and critical path analysis of x86 andARM assembly kernels,” in , 2019, pp. 1–6. [Online]. Available:http://dx.doi.org/10.1109/PMBS49563.2019.00006[6] J. Hammer, J. Eitzinger, G. Hager, and G. Wellein, “Kerncraft: A toolfor analytic performance modeling of loop kernels,” in

Tools for HighPerformance Computing 2016: Proceedings of the 10th InternationalWorkshop on Parallel Tools for High Performance Computing, October2016, Stuttgart, Germany , C. Niethammer et al. , Eds. Cham:Springer International Publishing, 2017, pp. 1–22. [Online]. Available:https://doi.org/10.1007/978-3-319-56702-0 1[7] Arm. ARM C Language Extensions forSVE. Accessed 2020-09-28. [Online]. Available:https://developer.arm.com/documentation/100987/0000/[8] J. Treibig, G. Hager, and G. Wellein, “LIKWID: A LightweightPerformance-Oriented Tool Suite for x86 Multicore Environments,”in

ICPP Workshops , 2010, pp. 207–216. [Online]. Available:http://dx.doi.org/10.1109/ICPPW.2010.38[9] J. Hofmann, “ibench – Measure Instruction Latency and Throughput,”1 2018. [Online]. Available: https://github.com/RRZE-HPC/ibench[10] F. Limited,

A64FX Microarchitecture Manual 1.2 .Fujitsu Limited, July 2020. [Online]. Available:https://github.com/fujitsu/A64FX/blob/ffc361ef065378a9c3dae86fb5d3b23cc36ce975/doc/A64FXMicroarchitecture Manual en 1.2.pdf[11] “Artifact Description: Performance Modeling of Streaming Kernels andSparse Matrix-Vector Multiplication on A64FX.” [Online]. Available:https://github.com/RRZE-HPC/pmbs2020-paper-artifact[12] J. Laukemann, “Design and Implemention of a Frameworkfor Predicting Instruction Throughput,” Bachelor’s Thesis, 2017.[Online]. Available: https://hpc.fau.de/ﬁles/2018/08/LaukemannJan Design and Implementation For a Framework PredictingInstruction Throughput.pdf[13] J. Laukemann, J. Hammer, J. Hofmann, G. Hager, and G. Wellein,“Automated Instruction Stream Throughput Prediction for Inteland AMD Microarchitectures,” in , 2018, pp. 121–131. [Online]. Available:http://dx.doi.org/10.1109/PMBS.2018.8641578[14] J. Hofmann, G. Hager, and D. Fey, “On the accuracy and usefulnessof analytic energy models for contemporary multicore processors,”in

High Performance Computing , R. Yokota et al. , Eds. Cham:Springer International Publishing, 2018, pp. 22–43. [Online]. Available:https://doi.org/10.1007/978-3-319-92040-5 2[15] M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop, “Auniﬁed sparse matrix data format for efﬁcient general sparse matrix-vector multiplication on modern processors with wide simd units,”

SIAM Journal on Scientiﬁc Computing , vol. 36, no. 5, pp. C401–C423,2014. [Online]. Available: https://doi.org/10.1137/130930352[16] C. L. Alappat, J. Hofmann, G. Hager, H. Fehske, A. R. Bishop,and G. Wellein, “Understanding HPC benchmark performance on Intel Broadwell and Cascade Lake processors,” in

High PerformanceComputing , P. Sadayappan, B. L. Chamberlain, G. Juckeland, andH. Ltaief, Eds. Cham: Springer International Publishing, 2020, pp. 412–433. [Online]. Available: https://doi.org/10.1007/978-3-030-50743-5 21[17] T. A. Davis and Y. Hu, “The University of Florida Sparse MatrixCollection,”

ACM Trans. Math. Softw. , vol. 38, no. 1, pp. 1:1–1:25, Dec.2011. [Online]. Available: http://doi.acm.org/10.1145/2049662.2049663[18] Y. M. Tsai, T. Cojean, and H. Anzt, “Sparse linear algebra on AMDand NVIDIA GPUs – the race is on,” in

High Performance Computing ,P. Sadayappan, B. L. Chamberlain, G. Juckeland, and H. Ltaief, Eds.Cham: Springer International Publishing, 2020, pp. 309–327. [Online].Available: http://dx.doi.org/10.1007/978-3-030-50743-5 16[19] C. G´omez, M. Casas, F. Mantovani, and E. Focht,“Optimizing sparse matrix-vector multiplication in NECSX-Aurora Vector Engine,” Technical Report, BarcelonaSupercomputing Center, August 2020. [Online]. Avail-able: https://upcommons.upc.edu/bitstream/handle/2117/192586/spmvaurora sc20.pdf[20] E. Cuthill and J. McKee, “Reducing the bandwidth of sparsesymmetric matrices,” in

Proceedings of the 1969 24th NationalConference , ser. ACM ’69. New York, NY, USA: Associationfor Computing Machinery, 1969, p. 157172. [Online]. Available:https://doi.org/10.1145/800195.805928[21] N. Meyer et al.