[PDF] Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels

Abstract

Useful models of loop kernel runtimes on out-of-order architectures require an analysis of the in-core performance behavior of instructions and their dependencies. While an instruction throughput prediction sets a lower bound to the kernel runtime, the critical path defines an upper bound. Such predictions are an essential part of analytic (i.e., white-box) performance models like the Roofline and Execution-Cache-Memory (ECM) models. They enable a better understanding of the performance-relevant interactions between hardware architecture and loop code. The Open Source Architecture Code Analyzer (OSACA) is a static analysis tool for predicting the execution time of sequential loops. It previously supported only x86 (Intel and AMD) architectures and simple, optimistic full-throughput execution. We have heavily extended OSACA to support ARM instructions and critical path prediction including the detection of loop-carried dependencies, which turns it into a versatile cross-architecture modeling tool. We show runtime predictions for code on Intel Cascade Lake, AMD Zen, and Marvell ThunderX2 micro-architectures based on machine models from available documentation and semi-automatic benchmarking. The predictions are compared with actual measurements.

Full PDF

AAutomatic Throughput and Critical Path Analysis ofx86 and ARM Assembly Kernels

Jan Laukemann, Julian Hammer, Georg Hager and Gerhard Wellein { jan.laukemann, julian.hammer, georg.hager, gerhard.wellein } @fau.deErlangen Regional Computing CenterFriedrich-Alexander-Universit¨at Erlangen-N¨urnberg, Erlangen, Germany Abstract —Useful models of loop kernel runtimes on out-of-order architectures require an analysis of the in-core perfor-mance behavior of instructions and their dependencies. Whilean instruction throughput prediction sets a lower bound tothe kernel runtime, the critical path deﬁnes an upper bound.Such predictions are an essential part of analytic (i.e., white-box) performance models like the Roofline and Execution-Cache-Memory (ECM) models. They enable a better understanding ofthe performance-relevant interactions between hardware archi-tecture and loop code.The Open Source Architecture Code Analyzer (OSACA) is astatic analysis tool for predicting the execution time of sequen-tial loops. It previously supported only x86 (Intel and AMD)architectures and simple, optimistic full-throughput execution.We have heavily extended OSACA to support ARM instructionsand critical path prediction including the detection of loop-carried dependencies, which turns it into a versatile cross-architecture modeling tool. We show runtime predictions forcode on Intel Cascade Lake, AMD Zen, and Marvell ThunderX2micro-architectures based on machine models from available doc-umentation and semi-automatic benchmarking. The predictionsare compared with actual measurements.

Index Terms —benchmarking, performance modeling, perfor-mance engineering, architecture analysis, static analysis

I. I

NTRODUCTION

Analytic performance modeling of compute-intensive appli-cations during development or optimization can be a powerfultool and sheds light on how code executes on modern CPUarchitectures. Such models are hence not only constructed forthe sake of prediction but also to study relevant bottlenecksand to assess the compiler’s ability to generate optimal code.However, they require a deep understanding of the underlyingmicro-architecture in order to yield accurate results. Com-mon (simpliﬁed) approaches for numerical kernels are theRoofline [1] model or the ECM [2] model, whose constructionis supported by the

Kerncraft open-source performance mod-eling tool [3]. For Roofline, the Roofline Model Toolkit [4]and Intel’s Roofline Advisor are also available.In general, there are two approaches to predict runtimeand performance behavior: simulation and static analysis. Ourwork implements the latter. Even though simulators may bemore thorough and accurate if comprehensive implementationsexist, their usage is complicated by obstacles like ﬁndingsteady states for throughput analysis and pinpointing the This work was in part funded by the BMBF project METACCA. https://software.intel.com/en-us/advisor-user-guide-rooﬂine-analysis runtime-deﬁning hardware bottleneck. In addition, their imple-mentation is much more complex. The analysis and modelingprocess is split into in-core execution time and data transfertime through the memory hierarchy. See [2], [3] for exampleson how this is done. For a long time, the only capable toolfor static in-core code analysis was Intel’s Architecture CodeAnalyzer (IACA) [5], which was employed by Kerncraft.Besides being at end-of-life, there are multiple limitations:Intel-only architectures, undisclosed model and later versionsrestricted to full-throughput assumption. To improve on this,we develop the Open Source Architecture Code Analyzer (OSACA) [6], which has, in addition to the features alreadyknown from IACA, extended x86 (Intel Cascade Lake andAMD) and AArch64 ARM support and supports critical path(CP) latency analysis and loop-carried dependency detection.All three predictions can be combined to a more accurateperformance model, including the throughput as a lower boundand the critical path as an upper bound of the kernel runtime.Like IACA, OSACA assumes that all data originates from theﬁrst-level cache (i.e., L1 cache).With OSACA’s semi-automatic benchmarking pipeline,compilers can beneﬁt from an automated model construc-tion [3], [4]. The instruction database is dynamically extend-able, which enables users to adapt the tool to other applicationscenarios beyond numerical kernels found in HPC usecases.This paper is organized as follows: In Section I-A, we coverrelated work. Section II details the model assumptions andconstruction for the underlying architecture and the generalmethodology of the throughput and critical path analysis aswell as the loop-carried dependency detection. In Section IIIwe describe the benchmarking hardware/software environmentand validate the methodology against actual measurements andcompare with related tools. Section IV summarizes the workand gives an outlook to future developments.The OSACA software is available for download underAGPLv3 [7]. Information about how to reproduce the resultsin this paper can be found in the artifact description [8].

A. Related Work

OSACA was inspired by IACA, the Intel Architecture CodeAnalyzer [5]. Developed by Israel Hirsh and Gideon S. [sic],Intel released the tool in 2012 and announced its end-of-life in April 2019. Therefore, no feature enhancements ornew microarchitecture support can be expected. It is closed a r X i v : . [ c s . PF ] O c t ource and the underlying model has neither been publishedby the authors, nor peer reviewed. The latest version supportsthroughput analysis on Intel micro-architectures up to Skylake(including AVX-512), but is not capable of critical pathanalysis or loop-carried dependency detection.LLVM Machine Code Analyzer (llvm-mca) [9] is a per-formance analysis tool based on LLVM’s existing schedulingmodels. Currently it lacks support for HPC-relevant ARMarchitectures such as the ThunderX2, and some schedulingmodels need reﬁnement. Also, llvm-mca cannot analyze CPs,even though one can manually identify a CP by the pro-vided latency analysis. LLVM Machine Instruction Benchmark(llvm-exegesis) [10] is a micro-benchmarking framework formeasuring throughput and latency of instruction forms. Itcould thus be used as a data source to feed the OSACAinstruction database. Mendis et al. [11] apply a deep neuralnetwork approach to estimate block throughput on Intel x86architectures from Ivy Bridge to Skylake. It is able to useIACA byte markers for indicating the code block to analyzeand is currently not capable of detecting CPs or loop-carrieddependencies. Code Quality Analyzer (CQA) [12] is a staticperformance analysis tool focused on single-core performanceof loop-centric x86 code. Unlike OSACA, its goal is notto predict runtime, but rather give the developer a qualityestimate of the code based on static binary analysis. Uop FlowSimulation (UFS) [13] extends CQA with a simulator for theout-of-order execution, modeling aspects OSACA assumes tobe based on ﬁxed (non-optimal) probabilities.There are a fair number of simulators available: gem5,developed by Binkert et al. [14], ZSim by Sanches et al. [15]and MARSSx86 by Patel et al. [16]. While gem5 even supportsvarious non-x86 instruction set architectures (ARM, Power,RISC-V among others), all of them are considered as “full-system” simulators, going above and beyond the scope of thiswork. Therefore, they give a coarse overview on complete(multi- or many-core) systems, rather than detailed insightspinpointing a bottleneck.II. M ETHODOLOGY

When analyzing loop kernels, we assume for each CPUarchitecture a corresponding “port model”: Each assemblyinstruction is (optionally) split into micro-ops ( µ -ops), whichget executed by multiple ports. A particular instruction mayhave multiple ports that can execute it (e.g., two integerALUs), or – in case of complex instructions – multiple portsthat must execute it (e.g., combined load and ﬂoating-pointaddition). Shared resources, such as a divider pipeline or adata load unit, are modelled as additional ports.Each port receives at most one instruction per cycle andmay be blocked by an instruction for any number of cycles.To model parallel execution of the same instruction formon multiple ports, the cycles may be spread among multipleports, also allowing the inverse of integers as acceptable cyclethroughput of an instruction per port, but always adding up toat least one cycle per instruction over all ports. In-OrderOut-of-Order

Memory ControlData CachesPort 0 Port 1 Port N ...

Out-of-Order SchedulerInstruction CacheDecode & µop Queue

Fig. 1: Assumed generic out-of-order port model. Other sharedresources (e.g., DIV pipeline) are modeled as additional ports.Both x86 and ARM allow memory references to be used incombination with arithmetic instructions. This is modelled bysplitting the instruction in the load and the arithmetic part, andaccounting for their respective port pressures and dependenciesseparately (see below).Figure 1 shows a diagram of the generic port model.Cascade Lake would be modeled with eight ports, plus onedivider pipeline port and two data ports. A ﬂoating-pointdivide instruction would occupy port 0 for one cycle andthe DIV port (i.e., pipeline) for four cycles, while an addinstruction would use ports 0 and 1 for each half a cycle,because it may be executed on both.We repeat here the assumptions behind our predictionmodel [6]: • All data accesses hit the ﬁrst-level cache.

This is wherethe boundary between in-core and data transfer analysisis drawn. If a dataset ﬁts in the ﬁrst-level cache, no cachemisses occur. Replacement strategies, prefetching, linebuffering, etc., are insigniﬁcant on this level. Behaviorbeyond L1 can be modeled with Kerncraft [3], which re-lies on an in-core analysis from OSACA and combines itwith data analysis to arrive at a uniﬁed model prediction. • Multiple available ports per instruction are utilized withﬁxed probabilities.

If the exact amount of µ -ops perport per instruction form is unknown, we assume thatall suitable ports for the same instruction are used withﬁxed probabilities. E.g., an add instruction that mayuse one out of four possible ports and has a maximumthroughput of 1 instr./cy on any unit will be assigned0.25 cy on each of the four ports. This implies imperfectscheduling if ports are asymmetric. Asymmetry meansthat multiple ports can handle the same instruction, butother features of those ports differ (e.g., one port supports add and div , while another supports add and mul ).This may cause load imbalance since, e.g., a code withonly add and mul may be imperfectly scheduled. Theconsideration of the full kernel for a more realistic portpressure model is currently not supported, but is taken hroughput AnalysisCritical Path AnalysisLoop-carried Dependencies Analysis * - Instruction not bound to a port Port Pressure in cycles| 0 | 1 | 2 - 2D | 3 - 3D | 4 | 5 | 6 |-------------------------------------------------------------| | | | | | | | .L22:| | | 0.5 0.5 | 0.5 0.5 | | | | vmovapd 0(%r13,%rax),%ymm0| 0.50 | 0.50 | 0.5 0.5 | 0.5 0.5 | | | | vfmadd213pd (%r14,%rax), \%ymm1,%ymm0| | | 0.5 | 0.5 | 1.0 | | | vmovapd %ymm0,(%r12,%rax)| 0.25 | 0.25 | | | | 0.25 | 0.25 | addq $32,%rax| 0.25 | 0.25 | | | | 0.25 | 0.25 | cmpq %rax,%r15| | | | | | | |* jne .L221.00 1.00 1.5 1.0 1.5 1.0 1.00 0.50 0.50180 | 4.0 | | vmovapd 0(%r13,%rax), %ymm0181 | 4.0 | | vfmadd213pd (%r14,%rax), %ymm1, %ymm0182 | 5.0 | | vmovapd %ymm0, (%r12,%rax) 13.0183 | 1.0 | addq $32, %rax | [183] M a c h i n e F i l e s / D a t aba s e s - name: vfmadd213pd operands: - class: "register" name: "ymm" source: true destination: false - class: "register" name: "ymm" source: true destination: true throughput: 0.5 latency: 4 specific operanddescriptionmnemonicgeneric loadperformanceinformation load_latency: {gpr: 4, xmm: 4, ymm: 4, zmm: 4}load_throughput: {port_pressure: [0,0,0,0.5 ... ,0]} M a r k e d A ss e m b l y movl $111,%ebx Fig. 2: Structural design of OSACA and its workﬂow, forSTREAM triad (

A(:)=B(:)+s*C(:) ) loop.into account for future versions.

A. Port Model Construction

The overall methodology of OSACA is exempliﬁed usingthe STREAM triad

A(:)=B(:)+s*C(:) loop in Figure 2.The x86 or AArch64 ARM assembly is parsed and the kernelin between the byte markers is extracted. For convenience,OSACA supports IACA’s byte markers for x86 and usesthe same instruction pattern for ARM assembly. For eachparsed instruction form within the kernel, OSACA obtainsthe maximum inverse throughput and latency in cycles andthe ports it can be scheduled to from its instruction database.Furthermore, it keeps track of source and destination operandsfor identifying register dependencies.Possible sources for OSACA’s database are microbench-mark databases like uops.info [17], Agner Fog’s “InstructionTable” [18], or speciﬁc microbenchmarks using our ownframeworks asmbench [19] and ibench [20]. For the latter,OSACA can automatically create benchmark ﬁles and import the output into its database, resulting in a semi-automaticbenchmark pipeline.

B. Instruction Throughput and Latency Analysis

To obtain the latency and throughput of an instruction, weautomatically create assembly benchmarks for use with ibench.It offers the infrastructure to initialize, run and accuratelymeasure the desired parameters. It is also intended to supporta python-based approach to micro-benchmarking, using theasmbench framework, which is not yet implemented at thetime of writing.Synthetic dependency chain generation within the assemblykernel allows measurement of throughput and latency of aninstruction form and has been described in our previouswork [6]. As stated in Section II-A, in addition to directlymeasuring throughput and latency of instruction forms includ-ing memory references in combination with register operands,which currently requires manual effort, OSACA is able todynamically calculate the throughput by taking the maximumof both the load and arithmetic part and the latency by takingthe sum of both parts. The throughput prediction assumes aﬁxed and balanced utilization of all suitable ports for anyinstruction form and perfect out-of-order scheduling withoutloop-carried dependencies. It thus yields a lower bound forexecution time.

C. Critical Path Analysis

The critical path analysis is based on a directed acyclicgraph (DAG) constructed from inter-instruction register de-pendencies following these rules:1) A vertex is created for every instruction form in themarked piece of code.2) From each instruction form’s destination operands, edgesare drawn to all instruction forms “further down” relyingon these outputs, unless a break of dependency is foundin between (e.g., by zeroing the register).3) All edges are weighted with their source instruction’slatency.4) If a source memory reference has a dependency, anintermediate load-vertex is added along this edge and theadditional edge weighted with the load latency.After creating the DAG, the longest path within it is de-termined by using a weighted topological sort based on theapproach of Manber [21]. The CP is thus an upper bound forthe execution time of a single instance of the loop body.

D. Loop-Carried Dependency Detection

Dependencies in between iterations, i.e., loop-carried de-pendencies (LCDs), can drastically inﬂuence the runtimeprediction of loop kernels: Even with sufﬁcient out-of-orderexecution resources, overlap of successive iterations is onlypossible up to the limit set by the LCD. The actual runtimeis thus limited from below by the length of the LCD chain.OSACA can detect LCDs by creating a DAG of a codecomprising two back-to-back copies of the loop body. It canthus analyze paths from each vertex of the ﬁrst kernel sectionnd detect most cyclic LCDs if there exists a dependency chainobservable by register dependencies from one instruction formto its corresponding duplicate in the next iteration.III. R

ESULTS

The CP and LCD detection described in Section II areincluded in OSACA’s analysis of loop code and presentedtogether with the “classic” throughput results. For validationwe will use assembly representations generated by the IntelFortran Compiler for x86 and the GNU Fortran Compiler forARM, respectively. In case of CLX we also compare to theIACA and LLVM-MCA predictions for Skylake-X, which doesnot differ in terms of the port model. Due to the proprietarynature of IACA, we cannot use it on any AMD- or ARM-basedsystem; hence, we compare against LLVM-MCA on AMDZen. For lack of other tools, on TX2 OSACA’s prediction canonly be compared to measurements.

A. Example: Gauss-Seidel method on CSX, ZEN and TX2

An interesting ﬂoating-point benchmark for comparing pre-dictions with the measured runtime is a 2D version of the“Gauss-Seidel” sweep [22]: do it=1,itmax do k=1,kmax-1 do i=1,imax-1phi(i,k,t0) = 0.25 * (phi(i,k-1,t0) + phi(i+1,k,t0) +phi(i,k+1,t0) + phi(i-1,k,t0)) enddoenddoenddo It has one multiplication and three additions per iteration.As the update of the matrix happens in-place, each iterationis dependent on the previously calculated value of its “left”( i-1 ) and “bottom” ( k-1 ) neighbor. This is the basic LCDthat should govern the code’s runtime; the CP may be longersince it may contain instructions that are not part of theLCD. If the hardware has sufﬁcient out-of-order capabilities,it should be able to overlap that “extra” part across successiveloop iterations. And ﬁnally, the pure throughput prediction(TP) should be much too optimistic since it ignores alldependencies.Since we have demonstrated OSACA’s TP analysis in previ-ous work [6], we will focus here on the reﬁnement of runtimepredictions via CP and LCD analysis. The total runtime ismeasured and combined with the number of iterations toget lattice site updates per second [LUP/s] and cycles periteration [cy/it] in columns 3–4 of Table I.Unrolling by the compiler must be considered when inter-preting OSACA predictions since they strictly pertain to theassembly level. E.g., if a loop was unrolled four times, asit is the case for our Gauss-Seidel examples, the predictionby OSACA will be for four original (high-level) iterations.This also applies to unrolling for SIMD vectorization, whichis not possible here, however. In this paper, OSACA and IACApredictions in cycles are given for one assembly code iteration, whereas the unit “cy/it” always refers to high-level source codeiterations. The total unrolling factor chosen by the compilershas been 4x for all architectures. Table II shows the condensedOSACA output for the TX2. Predictions by OSACA, IACA,and LLVM-MCA can be found in Table I.The predicted block throughput of all three analysis tools isfar from the measurements, as expected. Even though IACAis not capable of detecting CPs and analysing the latency ofkernels anymore, its block throughput in the analysis reportstates 14 cy/it, contrary to the pure port binding of 2 cy/it.No explanation for this behavior can be found in the outputalthough it matches exactly the LCD and the measurement.Using the additional -timeline ﬂag, LLVM-MCA pro-vides a timeline view showing for a various number of cyclesor iterations the expected cycle of dispatching, execution andretirement. Since it models register dependencies, we assumethis to be its CP analysis and expect the time from thebeginning of the ﬁrst iteration to the retirement of its jumpinstruction to be the CP length, while all further executionshave the length of the LCD. Both numbers can be foundin the last column of Tab. I. While we can observe thatLLVM-MCA overestimates the execution on ZEN by almost50%, it predicts the runtime on CLX nearly exactly. Forthe ThunderX2, LLVM-MCA is neither capable of analyzingthroughput nor latency at the time of writing.OSACA provides a runtime bracket determined by the CP(upper bound) and the length of the longest cyclic LCD path(lower bound). The measured execution time should usually liebetween these limits unless bottlenecks apply that are beyondour model (e.g., instruction cache misses, bank conﬂicts, etc.).As seen in column 6 of Table I, the actual measurement lieswithin the prediction frame in every analysis case, and themeasurement is very close to the longest LCD path for thiskernel. As expected, the runtime is faster than the pure CPlength, since instructions that are not part of the LCD pathcan overlap across iterations.The detailed OSACA analysis for ThunderX2 can be foundin Table II. The LCD and CP columns show latency valuesfor instruction forms along the CP and the longest cyclicLCD path, respectively. Fig. 3 depicts the graph generatedby OSACA from the assembly.Note that in cases where the LCD is very short or zero,the throughput prediction applies, and a deviation of themeasurement from this lower limit points to either a shortageof OoO resources (physical registers, reorder buffer) or anarchitectural effect not covered by the machine model.

B. Validation Hardware, Software, and Runtime Environment

OSACA (version 0.3.1.dev0) was run with Python v3.6.8and benchmarks were compiled using Intel ifort v19.0.2 andGNU Fortran (ARM-build-8) 8.2.0, respectively. All resultspresented were gathered on three machines, with ﬁxed clockfrequency and disabled turbo mode: rchitecture Unroll Measured Prediction [cy/it]factor OSACA IACA LLVM-MCAMLUP/s cy/it TP LCD CP TP LCD CP TP LCD CPMarvel ThunderX2 4x 118.9 18.50 2.46 18.00 25.00 — — — — — —Intel Cascade Lake X 4x 178.3 14.02 2.19 14.00 18.00 14.00 — — 2.00 14.75 19.00AMD Zen 4x 194.4 11.83 2.00 11.50 15.00 — — — 3.00 18.00 24.00

TABLE I: Analysis and measurement of the Gauss-Seidel code on three architectures with OSACA, IACA, and LLVM-MCApredictions. Dashes denote unsupported analysis types or architectures. TP is the throughput prediction, a lower runtime bound.LCD is the loop carried dependency prediction, an expected runtime. CP is the critical path prediction, an upper runtime bound.

P0 P1 P2 P3 P4 P5 LCD CP LN Assembly Instructions .L20: ldr d31, [x15, x18, lsl 3] ldr d0, [x15, 8] mov x14, x15 add x16, x15, 24 ldr d2, [x15, x30, lsl 3] add x15, x15, 32 fadd d1, d31, d0 fadd d3, d1, d30 fadd d4, d3, d2 fmul d5, d4, d9 str d5, [x14], 8 ldr d6, [x14, x18, lsl 3] ldr d16, [x14, 8] add x13, x14, 8 ldr d7, [x14, x30, lsl 3] fadd d17, d6, d16 fadd d18, d17, d5 fadd d19, d18, d7 fmul d20, d19, d9 str d20, [x15, -24] ldr d21, [x13, x18, lsl 3] ldr d23, [x14, 16] ldr d22, [x13, x30, lsl 3] fadd d24, d21, d23 fadd d25, d24, d20 fadd d26, d25, d22 fmul d27, d26, d9 str d27, [x14, 8] ldr d30, [x15] ldr d28, [x16, x18, lsl 3] ldr d29, [x16, x30, lsl 3] fadd d31, d28, d30 fadd d2, d31, d27 fadd d0, d2, d29 fmul d30, d0, d9 str d30, [x15, -8] cmp x7, x15 bne .L20 per high-level iteration

TABLE II: (Condensed) OSACA analysis of Gauss-Seidelassembly code for ARM-based ThunderX2 architecture. TheLN column are line numbers.

ThunderX2:

ARM-based Marvell ThunderX2 9980with ThunderX2 micro-architecture (formerly known asCavium Vulcan) at 2.2 GHz (TX2), gfortran, options -mcpu=thunderx2t99+simd+fp -fopenmp-simd-funroll-loops -Ofast

Fig. 3: (Compressed) dependency graph of the Gauss-Seidelcode on TX2, created by OSACA. Orange nodes are on thelongest LCD, including the backedge. Pink dashed lines andoutlined nodes make up the CP. Numbers in nodes are linenumbers, as found in Table II, and weights along the edgesare latency cycles.

Cascade Lake:

Intel Xeon Gold 6248 with CascadeLake X micro-architecture at 2.5 GHz (CLX), ifort, options -funroll-loops -xCASCADELAKE -Ofast

Zen:

AMD EPYC 7451 with Zen micro-architectureat 2.3 GHz (ZEN), gfortran, options -funroll-loops-mavx2 -mfma -Ofast

The process was always bound to a physical core. In effect,statistical runtime variations were small enough to be ignored.V. C

ONCLUSION

A. Summary

We have shown that automatic extraction, throughput, andcritical path analysis of assembly loop kernels is feasibleusing our cross-platform tool OSACA. OSACAs results areaccurate and sometimes even more precise and versatilethan predictions of comparable tools like IACA and LLVM-MCA. Additionally, direct critical path analysis includingloop-carried dependencies is not supported by any other tool todate, although it can be inferred manually from LLVM-MCA’stimeline information.

B. Future Work

In the future we intend to extend OSACA to support hiddendependencies, i.e., instructions accessing resources not namedspeciﬁcally in the assembly, such as status ﬂags and load-after-store dependencies, including stack operations. The LCDanalysis is not perfect and may miss dependencies in somespecial cases, which can be improved by taking more thantwo iterations into account. Furthermore, we plan to increasethe number of micro-benchmark interfaces and to support thesemi-automatic usage of asmbench in the OSACA toolchain.Beyond the even distribution of µ -ops across multiple ports,we want to implement a more realistic scheduling scheme thattakes port utilization into account. Support for new micro-architectures like AMD’s Zen 2 and eventually IBM’s Power9is also planned. Another topic is the overlap of latency incomplex instructions, which can change the outcome of theanalysis slightly but may be signiﬁcant in pathological cases(e.g., in a = a + b × c with an FMA instruction, the multipli-cation may already execute before a becomes available). Thesplit, as well as the fusion, of µ -ops is currently not considered,but can be achieved with replacement rules in the architecturemodel description.Accurately modeling the performance characteristics of thedecode, reorder buffer, register allocation/renaming, retirementand other stages, which all may limit the execution throughputand impose latency penalties, is currently out of scope forOSACA. R EFERENCES[1] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightfulvisual performance model for multicore architectures,”

Commun. ACM ,vol. 52, no. 4, pp. 65–76, 2009.[2] H. Stengel, J. Treibig, G. Hager, and G. Wellein, “Quantifying Per-formance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model,” in

Proceedings of the 29th ACM InternationalConference on Supercomputing , ser. ICS ’15. New York, NY, USA:ACM, 2015, pp. 207–216, doi: 10.1145/2751205.2751240.[3] J. Hammer, J. Eitzinger, G. Hager, and G. Wellein, “Kerncraft: A Toolfor Analytic Performance Modeling of Loop Kernels,” in

Tools for HighPerformance Computing 2016 , C. Niethammer, J. Gracia, T. Hilbrich,A. Kn¨upfer, M. M. Resch, and W. E. Nagel, Eds. Cham: SpringerInternational Publishing, 2017, pp. 1–22, doi: 10.1007/978-3-319-56702-0 1. [4] Y. Lo, S. Williams, B. Van Straalen, T. J. Ligocki, M. J. Cordery, N. J.Wright, M. W. Hall, and L. Oliker, “Roofline Model Toolkit: A PracticalTool for Architectural and Program Analysis,” in

High PerformanceComputing Systems. Performance Modeling, Benchmarking, and Sim-ulation , ser. Lecture Notes in Computer Science, S. A. Jarvis, S. A.Wright, and S. D. Hammond, Eds., vol. 8966. Springer InternationalPublishing, 2015, pp. 129–148, doi: 10.1007/978-3-319-17248-4 7.[5] (2017, 11) Intel Architecture Code Analyzer. [Online]. Available:https://software.intel.com/en-us/articles/intel-architecture-code-analyzer[6] J. Laukemann, J. Hammer, J. Hofmann, G. Hager, and G. Wellein,“Automated instruction stream throughput prediction for intel andamd microarchitectures,” in , Nov 2018, pp. 121–131.[7] J. Laukemann. (2017, 12) OSACA – Open Source Architecture CodeAnalyzer. [Online]. Available: https://github.com/RRZE-HPC/OSACA[8] “Artifact description: Automatic throughput and critical path analysisof x86 and arm assembly kernels.” [Online]. Available: https://github.com/RRZE-HPC/OSACA-CP-2019[9] D. Andric. [RFC] llvm-mca: a static performance analysistool. [Online]. Available: http://llvm.1065342.n5.nabble.com/llvm-dev-RFC-llvm-mca-a-static-performance-analysis-tool-td117477.html[10] llvm-exegesis – LLVM Machine Instruction Benchmark. [Online].Available: https://llvm.org/docs/CommandGuide/llvm-exegesis.html[11] C. Mendis, A. Renda, S. Amarasinghe, and M. Carbin, “Ithemal:Accurate, portable and fast basic block throughput estimation usingdeep neural networks,” in

Proceedings of the 36th InternationalConference on Machine Learning (ICML) , ser. Proceedings of MachineLearning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97.Long Beach, California, USA: PMLR, Jun 2019, pp. 4505–4515.[Online]. Available: http://proceedings.mlr.press/v97/mendis19a.html[12] A. S. Charif-Rubial, E. Oseret, J. Noudohouenou, W. Jalby, and G. Lar-tigue, “CQA: A code quality analyzer tool at binary level,” in , Dec2014, pp. 1–10, doi: 10.1109/HiPC.2014.7116904.[13] V. Palomares, D. C. Wong, D. J. Kuck, and W. Jalby, “Evaluating out-of-order engine limitations using uop ﬂow simulation,” in

Tools for HighPerformance Computing 2015 , A. Kn¨upfer, T. Hilbrich, C. Niethammer,J. Gracia, W. E. Nagel, and M. M. Resch, Eds. Cham: SpringerInternational Publishing, 2016, pp. 161–181, doi: 10.1007/978-3-319-39589-0 13.[14] N. Binkert, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish,M. D. Hill, D. A. Wood, B. Beckmann, G. Black, and et al., “Thegem5 simulator,”

ACM SIGARCH Computer Architecture News , vol. 39,no. 2, p. 1, 8 2011, doi: 10.1145/2024716.2024718. [Online]. Available:http://dx.doi.org/10.1145/2024716.2024718[15] D. Sanchez and C. Kozyrakis, “ZSim: Fast and Accurate Microarchitec-tural Simulation of Thousand-Core Systems,”

Proceedings of the 40thAnnual International Symposium on Computer Architecture - ISCA ’13 ,2013. [Online]. Available: http://dx.doi.org/10.1145/2485922.2485963[16] A. Patel, F. Afram, and K. Ghose, “Marss-x86: A qemu-based micro-architectural and systems simulator for x86 multicore processors,” in , 2011, pp. 29–30.[17] A. Abel and J. Reineke, “uops.info: Characterizing latency, throughput,and port usage of instructions on intel microarchitectures,” in

Proceedings of the Twenty-Fourth International Conference onArchitectural Support for Programming Languages and OperatingSystems