[PDF] GPA: A GPU Performance Advisor Based on Instruction Sampling

Abstract

Developing efficient GPU kernels can be difficult because of the complexity of GPU architectures and programming models. Existing performance tools only provide coarse-grained suggestions at the kernel level, if any. In this paper, we describe GPA, a performance advisor for NVIDIA GPUs that suggests potential code optimization opportunities at a hierarchy of levels, including individual lines, loops, and functions. To relieve users of the burden of interpreting performance counters and analyzing bottlenecks, GPA uses data flow analysis to approximately attribute measured instruction stalls to their root causes and uses information about a program's structure and the GPU to match inefficiency patterns with suggestions for optimization. To quantify each suggestion's potential benefits, we developed PC sampling-based performance models to estimate its speedup. Our experiments with benchmarks and applications show that GPA provides an insightful report to guide performance optimization. Using GPA, we obtained speedups on a Volta V100 GPU ranging from 1.01 × to 3.53 × , with a geometric mean of 1.22 × .

Full PDF

GGPA: A GP U Performance Advisor Based onInstruction Sampling

Keren Zhou [email protected]

Rice UniversityHouston, Texas, United States

Xiaozhu Meng [email protected]

Rice UniversityHouston, Texas, United States

Ryuichi Sai [email protected]

Rice UniversityHouston, Texas, United States

John Mellor-Crummey [email protected]

Rice UniversityHouston, Texas, United States

Abstract

Developing efficient GPU kernels can be difficult because ofthe complexity of GPU architectures and programming mod-els. Existing performance tools only provide coarse-grainedsuggestions at the kernel level, if any. In this paper, we de-scribe GPA, a performance advisor for NVIDIA GPUs thatsuggests potential code optimization opportunities at a hier-archy of levels, including individual lines, loops, and func-tions. To relieve users of the burden of interpreting perfor-mance counters and analyzing bottlenecks, GPA uses dataflow analysis to approximately attribute measured instruc-tion stalls to their root causes and uses information abouta program’s structure and the GPU to match inefficiencypatterns with suggestions for optimization. To quantify eachsuggestion’s potential benefits, we developed PC sampling-based performance models to estimate its speedup. Our ex-periments with benchmarks and applications show that GPAprovides an insightful report to guide performance optimiza-tion. Using GPA, we obtained speedups on a Volta V100 GPUranging from 1.01 × to 3.53 × , with a geometric mean of 1.22 × . Graphics Processing Units (GPUs) have been extensivelyemployed in data centers and supercomputers as a buildingblock to accelerate High-Performance Computing (HPC) andmachine learning applications. However, fully utilizing thecompute power of GPUs is challenging. Tuning GPU code

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected].

Conference’17, July 2017, Washington, DC, USA © 2020 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn to achieve the maximum possible performance requires sig-nificant manual effort to cope with the complexity of GPUarchitectural features and programming models.GPU profilers [1, 20, 24, 27, 28, 33, 42] are widely used formeasuring GPU-accelerated applications. While these toolsidentify hot GPU code, they lack sophisticated analysis ofperformance bottlenecks and provide little insight into howto improve the code. nvprof and Nsight-Compute, for ex-ample, analyze performance measurement data and proposesuggestions on the kernel level but do not identify specificlines that could be optimized nor estimate the potential gainafter applying optimizations. As a result, even with GPU pro-filers, diagnosing and fixing performance problems requiresexpertise in interpreting measurement data and associatingsuggestions with corresponding bottlenecks.Prior tools on GPUs [4, 8, 32] provide fine-grained sugges-tions using instrumentation-based methods to quantify theseverity of performance problems and locate problematiccode. These tools identify one or a few patterns, such asredundant value/address, insufficient cache utilization, ormemory transaction burst, but overlook others. Moreover,they do not correlate execution time with the patterns. As aresult, one may fix specific problems indicated by the toolsbut not achieve any speedup.Modern processors support fine-grain measurement usingsampling [15–17, 19], which can be used to study instructionstatistics in applications quantitively. Unique among GPUvendors, NVIDIA implements PC sampling on its GPUs tosample instructions and associate them with stall reasons.Existing performance tools [20, 27, 33, 40, 42] that utilizePC sampling only associate instruction samples with sourcelines of GPU code where the stalls occur but lack the abilityto derive performance insight based on stall reasons.To complement the aforementioned approaches, we pro-pose GPA—a GPU performance advisor that suggests effec-tive optimizations for GPU code, and evaluate GPA on a V100GPU with the Rodinia benchmarks [12], several larger appli-cation benchmarks, and a combustion application. Guidedby GPA, we improved the performance of the GPU kernels a r X i v : . [ c s . PF ] N ov onference’17, July 2017, Washington, DC, USA Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey Stalled Issued Sampled

𝑁 2𝑁 3𝑁 4𝑁 5𝑁 6𝑁𝐶𝑦𝑐𝑙𝑒𝑠𝑊𝑎𝑟𝑝𝑠

𝑺𝒄𝒉𝒆𝒅𝒖𝒍𝒆𝒓𝟏 𝑺𝒄𝒉𝒆𝒅𝒖𝒍𝒆𝒓𝟐 𝑺𝒄𝒉𝒆𝒅𝒖𝒍𝒆𝒓𝟑 𝑺𝒄𝒉𝒆𝒅𝒖𝒍𝒆𝒓𝟒 𝑺𝒄𝒉𝒆𝒅𝒖𝒍𝒆𝒓𝟏 𝑺𝒄𝒉𝒆𝒅𝒖𝒍𝒆𝒓𝟐

Figure 1.

A mental model of PC sampling on an SM ofNVIDIA’s V100 GPU. Samples are taken every 𝑁 cycles. Sam-ples at 𝑁 , 4 𝑁 , and 6 𝑁 are latency samples, and others areactive samples. Samples at 𝑁 , 3 𝑁 , 4 𝑁 , 5 𝑁 , and 6 𝑁 are stallsamples.studied by 1.03x to 3.86x. This paper describes the designand implementation of GPA which consists of the followingkey components: • An instruction blamer that attributes stalls to instruc-tions that cause them; • Performance optimizers that match inefficiency pat-terns with optimization suggestions for lines, loops,and functions based on program structure, architec-tural features, measurement data, and control flow; • Performance estimators that model GPU executionusing instruction samples to estimate speedups foreach optimizer.This rest of the paper is organized as follows. Section 2reviews PC sampling and instruction format on NVIDIA’sGPUs. Section 3 introduces the workflow of GPA. Section 4explains the details of GPA’s instruction blamer. Section 5describes the implementation of GPA’s preformance opti-mizers and estimators. Section 6 describes the analysis andoptimization of GPU kernels using GPA. Section 7 presentscase studies of four larger codes, including a combustion ap-plication. Section 8 reviews related work and distinguishesGPA. Finally, Section 9 summarizes our work and outlinesour plans for future work.

In this section, we describe background necessary to un-derstand our work and our motivation for developing GPA. InSection 2.1, we introduce a model of the PC sampling mech-anism implemented in recent NVIDIA GPUs. In Section 2.2,we describe the instruction format used by NVIDIA’s GPUs,which is important for instruction dependency analysis. InSection 2.3, we show how raw PC sampling measurementsare insufficient to provide insight for performance optimiza-tion.

NVIDIA’s GPUs implement PC sampling to collect in-struction samples. One can use NVIDIA’s CUPTI API [25] tocollect PC samples for GPU-accelerated applications. Eachstreaming multi-processor (SM) in an NVIDIA GPU collectssamples individually. When a buffer used to collect samplesis full on an SM, CUPTI merges samples from all SMs andtransfers the samples to the CPU.Each SM on an NVIDIA V100 has four warp schedulers,and each warp scheduler is assigned a number of activewarps. At the end of each sampling period, an SM records asample for one of its warp schedulers and it cycles throughits warp schedulers in a round-robin fashion. When a warpis sampled, two classes of samples are recorded: an activesample when the warp scheduler is issuing an instructionand a latency sample when no instruction is issuing. Forthe instruction sampled, a stall reason (e.g., waiting for avalue from memory) is recorded for the instruction, if any.Consider Figure 1 as an example. There are 5 samples witha stall reason. We call them stall samples or stalls in theremaining sections. Because there are three latency samplesand three active samples, we estimate the stall ratio and theactive ratio of the SM as / . Assuming all SMs on the GPUhave a similar workload, we estimate the stall ratio and theactive ratio of the GPU kernel as / . A fixed length instruction encoding is used on NVIDIA’sGPUs. Pre-Volta GPUs use a 64-bit word for an instruction,but Volta and later architectures use a 128-bit word. In thispaper, we focus on the Volta architecture used in two of thetop three supercomputers—Summit and Sierra.Among the fields of a GPU instruction shown in Table 1,we focus on the following three key fields: • Wait Mask and Write/Read Barrier.

Every GPU in-struction has a control code [21, 41] field that encodesinformation to guide the warp scheduler as it issuesinstructions, including stall cycles, yielding flag, anddependencies. For each fixed latency instruction (e.g.,most arithmetic instructions), the assembler sets stallcycles for the instruction to indicate how long thescheduler should wait before issuing the instruction.For each variable latency instruction, the assemblerassociates write/read barrier indices with it, and asso-ciates instructions that depend on them a wait maskto create dependencies. • Predicate . If an instruction’s predicate field is set, theinstruction is executed when the predicate evaluates astrue. There are both true and false predicate conditions: Pi is a true predicate condition, and !Pi is a falsepredicate condition, where ≤ 𝑖 ≤ . In Table 1, the LDG instruction is executed if P0 is true. • Opcode, Modifiers, and Operands . Each thread canuse up to 255 32-bit regular registers ranging from R0 - R254 . Opcode and modifiers together determine the

PA: A GPU Performance Advisor Based on Instruction Sampling Conference’17, July 2017, Washington, DC, USA

Table 1.

Dissection of the fields of “ @P0 LDG.32 R0, [R2] ” instruction.

Wait Mask Write Barrier Read Barrier Predicate Opcode Modifiers Destination Operands Source Operands

B0 B1 P0 LDG 32 R0 R2, R3 length of operands used. In Table 1, the modifierindicates each thread reads a 32-bit value from mem-ory. Moreover, because the data is loaded from globalmemory, which has a 64-bit address space, the sourceoperand is a 64-bit value comprised of two registers— R2 and R3 . We refer to a collection of instruction samples and theirstall reasons as a raw PC sampling report from which we canmeasure the stall reasons of a kernel. However, diagnosingthe slowness of the kernel still requires interpretation of themeasurement data to answer the following questions. • Which GPU instructions cause stalls? • How can we improve the performance by eliminatingthese stalls? • What is the estimated speedup for each potential opti-mization?To illustrate the importance of analyzing stall reasons andassociating them with optimizations, we analyze the hotspot and the b+tree examples in Rodinia benchmark. for ( int i = 0; i < iteration; i++) { temp_t[ty][tx] = temp_on_cuda[ty][tx] + step_div_Cap * ( power_on_cuda[ty][tx] + (temp_on_cuda[S][tx] + temp_on_cuda[N][tx] - 2.0 * temp_on_cuda[ty][tx]) * ... } Listing 1.

A hot loop in the hotspot exampleListing 1 shows a hot loop of the hotspot kernel. The rawPC sampling report for this kernel indicates large executionlatency stalls on Line 2, but it provides little information re-garding where the stalls come from and what optimizationsapply. GPA attributes the latency to type conversion instruc-tions that demote a 64-bit float to a 32-bit float. Though allarrays are composed of 32-bit values, the compiler generatesconversion instructions as a float constant multiplies a 32-bitfloat value. GPA suggests specifying the type of the constant( . ) as a 32-bit value to avoid conversion. After applyingthe optimization, we achieved a 1.14 × speedup.Listing 2 shows a costly loop in the b+tree code. The rawPC sampling report shows high memory dependency stallson Line 2 but does not propose a suggestion to eliminatethe bottleneck. By analyzing the assembly code, GPA con-cludes that the distance between the load instructions andthe instruction that consumes the loaded values is short.Therefore, instructions in the path are not enough to hide for ( int i = 0; i < height; i++) { if (( knodesD[currKnodeD[bid ]]. keys[thid] <= startD[bid]) && (knodesD[currKnodeD[bid ]]. keys[thid +1] > startD[bid])) ... __syncthreads (); } Listing 2.

A hot loop in the b+tree example

GPA Framework

Profiler CUBINsProfiles Static Analyzer Control FlowProgram StructureGPU Arch FeaturesRaw AdviceGUI Tool

Instruction Blamer

Dynamic Analyzer

PerformanceOptimizers PerformanceEstimators

Figure 2.

Overview of GPAthe latency. GPA suggests the users separate the subscriptedloads from their uses by reordering code. We read the addressof knodesD[currKnodeD[bid]].keys for the next iterationbefore the synchronization on Line 5 and obtained a 1.16 × speedup.Based on the analysis above, we conclude that pure PCsampling information is insufficient to guide optimizations.To improve the quality of the analysis report, we analyzeinstruction dependencies to characterize stalls’ causes. Fur-thermore, we can associate the stalls with the program’sstructure to suggest code optimizations, such as loop un-rolling, function inlining, and code reordering. Figure 2 shows the workflow of GPA. GPA uses a profiler to collect PC samples and kernel launch statistics at runtimeand attribute them to the calling context where the kernel islaunched. The profiler dumps the profiles and records CUDAbinaries (CUBINs) for offline analysis. GPA’s static analyzer analyzes CUBINs to recover static information which is in-gested into the dynamic analyzer with profiles to generatecomprehensive raw advice . onference’17, July 2017, Washington, DC, USA Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey Static Analyzer.

In its static analyzer, GPA analyzes CU-BINs to recover the following files: • Control flow graphs . GPA employs NVIDIA’s nvdisasm tool to decode instructions in CUBINs anddump raw control flow graphs. We modify the rawcontrol flow graphs by splitting super blocks into ba-sic blocks and ingest the modified control flow graphsinto Dyninst [29] to analyze loop nests. • Program structure . A program structure file con-tains functions symbols, inline stacks, loop nests, andsource line mappings. According to each function sym-bol’s visibility field, we annotate global functions anddevice functions. We read DWARF information toparse information about inlined functions. • Architectural features . Based on the architectureflag encoded in CUBINs, we fetch specific hardwareconfigurations, such as instruction latencies, warp size,and register limitations for analysis in the later stages.

Dynamic Analyzer.

The dynamic analyzer is comprisedof three components, including an instruction blamer , perfor-mance optimizers , and performance estimators .We analyze each GPU kernel’s launch context separately.For each kernel invocation, the instruction blamer uses back-ward slicing [14, 35] to attribute stalls to the responsibleinstructions. Based on the stall counts and GPA’s static anal-ysis results, each performance optimizer attempts to matchits optimization strategy to program regions that have highstall samples. Guided by performance models, performanceestimator s estimate each optimizer’s speedup based on thematched samples. Finally, GPA generates an advice reportthat contains suggestions from its top optimizers sorted bytheir estimated speedups.In this paper, we focus on the implementation of GPA’sdynamic analyzer, which tackles the following unique chal-lenges: (1) It extends the backward slicing algorithm forspecial fields (e.g., barriers) of a GPU instruction to trackdependencies among GPU instructions. (2) It attributes stallsto their sources accurately because it incorporates pruningrules to cut down dependency sources. (3) Without code an-notation, it derives a general performance model to quantifythe benefits of each GPU optimizer. Utilization of GPA.

GPA is a command line tool thatautomates profiling and analysis stages. Since GPA usessampling-based profiles, users do not need to change theirprogram source code. To provide advice at the source linelevel, the only change required is adding compiler optionsto ensure that the compiler includes line mapping informa-tion in GPU binaries it generates. Users apply optimizationsaccording to the raw advice generated by GPA. Today, GPAproduces raw advice as ASCII text; however, its advice couldbe incorporated into a graphical user interface tool to analyzeinefficient code regions and optimization suggestions.

LDG R0, [R2]BRA 0x100Write B0Read B0

Figure 3.

An example of barrier register dependency

CUPTI associates stall reasons [15] with instruction sam-ples. Among the stall reasons, memory dependency, synchro-nization, and execution dependency stalls are caused by thesource instructions rather than the instructions that sufferfrom stalls. Other stall reasons, such as memory throttling,are caused by instruction samples with the stall. To furthercharacterize program bottlenecks with memory dependency,synchronization, and execution dependency stalls, we de-veloped an instruction blamer that attributes stalls to thesource instructions.We first use backward slicing to analyze every instruc-tion’s def-use chain in the control flow graph. According tothe def-use chain and measurement data, we build an instruc-tion dependency graph where each node is an instruction,annotated with its stalls, and each edge represents a def-userelation. Since not all edges cause stalls, we prune edgesaccording to several heuristic rules. In the end, we appor-tion the stalls to its incoming edges based on the number ofissued instructions and the length of each edge.

Backward slicing.

We target intra function backwardslicing [14] for GPU instructions because instructions inthe same function cause most stalls. We find a stalled in-struction’s immediate dependency sources because transitivedependencies are unlikely to cause the stalls. According toTable 1, several fields of a GPU instruction impact instructiondependencies, including operands, barriers, and predicate.We can begin with a traditional backward slicing algorithmfor CPU instructions to analyze GPU operands, but barriersand predicates need special processing.

Virtual barrier registers:

We define six available barrierindices as six virtual barrier registers B0 - B5 . A write/readbarrier index association can be represented as a write oper-ation to one or more barrier registers. Likewise, we treat await mask association as a read of barrier registers. In thisway, dependencies caused by barrier indices can be identi-fied through def-use chains of the virtual barrier registers. Itis worth noting that barriers can be set even if there is nodependency between regular registers. Take Figure 3 as anexample, the LDG instruction loads a value to R0 and writesbarrier B0 , and the BRA instruction does not consume R0 butstill reads B0 . Observed memory dependency stalls on the BRA instruction should be attributed to the

LDG instruction.

Predicated instructions:

Immediate dependency sources arenot only the first def instruction of each of its operands onthe search path. Consider Figure 4a as an example, suppose

PA: A GPU Performance Advisor Based on Instruction Sampling Conference’17, July 2017, Washington, DC, USA we observe a stall at the

IADD instruction, which does nothave a predicate; because the

LDG instruction is executedonly if P0 is true, it is possible that the stall comes from the LDC instruction earlier in the path, which is executed onlyif P0 is false. Therefore, the backward slicing search shouldproceed until the predicates of def instructions on the pathcover all conditions.Let 𝑃 be the union of def instructions’ predicates on thepath. 𝑃 = ∪ 𝑝 , where 𝑝 ∈ { 𝑝 𝑖 }∪{ ! 𝑝 𝑖 }∪{ _ } , and { 𝑝 𝑖 }∪{ ! 𝑝 𝑖 } = { _ } , for ≤ 𝑖 ≤ . _ is a special predicate that covers bothtrue and false predicates. An instruction without a predicatehas the same semantic as _. We say 𝑃 contains 𝑝 ′ iff 𝑝 ′ ∈ 𝑃 or _ ∈ 𝑃 . The backward slicing search proceeds until theunion of def instructions’ predicates on the search path ( 𝑃 ) contains the predicate of the use instruction ( 𝑝 ′ ). Construct a dependency graph.

We build an instructiondependency graph from the def-use chains of collected in-struction samples. For simplicity, in Figure 4b we only demon-strate memory dependency. Each node represents an instruc-tion, and each edge represents a def-use relation associatedwith R0 . Prune cold edges.

Not all the dependent edges causestalls. If an edge does not trigger stalls, we call it a “coldedge” and use the following three rules to prune it.1.

Opcode based pruning . Memory dependency stallsare attributed to memory instructions only. Synchro-nization dependency stalls are attributed to synchro-nization instructions only.2.

Dominator based pruning . For every edge 𝑒 fromnode 𝑖 to 𝑗 in a dependency graph, we remove 𝑒 ifthere is a non-predicate instruction 𝑘 uses the sameoperands that 𝑖 defines and 𝑗 uses, and 𝑘 is in everypath from 𝑖 to 𝑗 in the control flow graph because wewould have observed stalls at 𝑘 rather than 𝑗 if 𝑖 causedany stalls.3. Instruction latency based pruning . For every edge 𝑒 from node 𝑖 to 𝑗 in a dependency graph, we remove 𝑒 if the number of instructions in every path from 𝑖 to 𝑗 in the control flow graph is greater than the 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 of 𝑖 .For fixed latency instructions, we can use microbench-marking [21] for their latencies; for variable latency instruc-tions, we use their upper bounds for pruning. For instance,we use the TLB miss latency as the upper bound latency ofglobal memory instructions.According to the opcode pruning rule, we prune the edgefrom IMAD to IADD in Figure 4b to obtain the dependencygraph in Figure 4c because an

IMAD instruction cannot causememory dependency stalls.

Attribute stalls.

After pruning cold edges, there are stillsome nodes that have multiple incoming edges. To measure the stalls caused by each edge, we use the following twoheuristics.1. Apportion the stalls based on each incoming node’sissued samples. The more the issued samples, the morestalls are blamed to the instruction.2. Apportion the stalls based on the number of instruc-tions in paths. The longer the path, the less stalls areblamed on the def instruction. If an instruction 𝑖 hasmultiple paths to instruction 𝑗 in a control flow graph,we use the longest one.Finally, we associate the stalls of each dependency source( 𝑆 𝑖 ) by apportioning the stalls of the observed instruction ( 𝑆 𝑗 )using Equation 1, where R 𝑖𝑠𝑠𝑢𝑒𝑖 is the ratio of each incomingnode calculated by heuristic (1), and R 𝑝𝑎𝑡ℎ𝑖 denotes the ratioof each dependency source 𝑖 calculated by heuristic (2). 𝑆 𝑖 = R 𝑝𝑎𝑡ℎ𝑖 × R 𝑖𝑠𝑠𝑢𝑒𝑖 (cid:205) 𝑘 ∈ 𝑖𝑛𝑐𝑜𝑚𝑖𝑛𝑔 ( 𝑗 ) R 𝑝𝑎𝑡ℎ𝑘 × R 𝑖𝑠𝑠𝑢𝑒𝑘 × 𝑆 𝑗 (1)Figure 4d shows the apportioned stalls using the aboveheuristics. While the LDC instruction has twice the issuedsamples of the

LDG instruction, the number of path samplesfrom

LDC to IADD is also twice that of

LDG to IADD . Thus, weassign each dependency source the same number of samples.Without loss of generality, the above heuristics and equa-tion also apply for apportioning latency samples.After attributing stalls to their sources, we further classifythe stall reasons for execution and memory dependencies ac-cording to the opcode of each source instruction. As shown inFigure 5, we categorize memory dependency as local mem-ory, constant memory, and global memory dependencies.Knowing where local memory stalls occur is important forregister pressure analysis because it often indicates registerspills. Likewise, we classify execution dependency as sharedmemory, arithmetic, and write-after-read (WAR) dependen-cies. WAR dependency happens when a variable latency def instruction reads a value from a register, and the use instruction writes the same register.

This section describes the implementation of performanceoptimizers and estimators.

Performance optimizers take program structure and the anal-ysis result from the instruction blamer. Each optimizer en-codes rules to calculate matching stalls. In this way, we liftthe job of associating stalls with optimizations from users tothe advisor.We classify the available performance optimizers in GPAin Table 2. At a high level, we have parallel and code op-timizers. Parallel optimizers check if we can increase the onference’17, July 2017, Washington, DC, USA Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey

B3…IADD R8, R0, R7…B1…@P0 LDG R0, [R2]… B2…IMAD R0, R4, R5…B0…!@P0 LDC R0, [R4]… (a)

Backward slicing

IADD R8, R0, R7!@P0 LDC R0, [R4]@P0 LDG R0, [R2] IMAD R0, R4, R5 (b)

Construct a dependency graph

IADD R8, R0, R7 !@P0 LDC R0, [R4]@P0 LDG R0, [R2] (c)

Prune cold edges

IADD R8, R0, R7 !@P0 LDC R0, [R4]@P0 LDG R0, [R2]Issue: 1, Path: 5 Issue: 2, Path: 10Stalls: 4Stalls: 2 Stalls: 2 (d)

Apportion stalls

Figure 4.

Steps to attribute stalls of the

IADD instruction

Memory DependencyConstant Memory Dependency Local Memory Dependency Global Memory DependencyLDC LDL Others (a)

Memory dependency

Execution DependencyShared Memory Dependency WAR Dependency Arithmetic DependencyLDS ST/STS/STG/STL Others (b)

Execution dependency

Figure 5.

Classification of detailed dependency stall reasons

Table 2.

A brief description of GPU optimizers in GPA.

Code OptimizersStall EliminationRegister Reuse Match memory dependency stallsof local memory read/write instructionsStrength Reduction Match execution dependency stalls oflong latency arithmetic instructionsFunction Split Match instruction fetch stallsFast Math Match stalls in CUDA math functionsWarp Balance Match warp synchronization stallsMemory Transaction Reduction Match global memory throttling stallsLatency HidingLoop Unrolling Match global memory and executiondependency stalls in loopsCode Reordering Match global memory and executiondependency stallsFunction Inlining Match stalls in device functionsand their call sitesParallel OptimizersBlock Increase Match if the number of blocksis less than the number of SMsThread Increase Match if occupancy is limited bythe number of threads per block parallelism level to improve performance. For instance, the

Block Increase optimizer investigates the potential of increas-ing the number of blocks. Code optimizers check if we canadjust code to improve the performance. Based on optimiza-tion methods, we further categorize the code optimizers asstall elimination and latency hiding optimizers. Stall elimina-tion optimizers provide suggestions to reduce stalls; latencyhiding optimizers suggest rearranging issue orders to overlapstall latency.Each optimizer maintains a workflow to match instructionsamples. The

Loop Unrolling optimizer, for example, iteratesthrough all the latency samples. It records a latency sampleif it has either a memory dependency stall or an executiondependency stall, and the def and the use instructions arewithin the same loop. The optimizer suggests using pragmaunroll annotation or manual unrolling for loops where thecompiler fails to unroll automatically.

With performance optimizers, we associate optimizationmethods with stalls, whereas it is still unclear which methodshave a better effect in terms of the given measurement data,program structure, and the underlying GPU architecture.Performance estimators take the matched stalls as input andestimate the speedups by modeling the GPU’s execution.The optimizers with top estimated speedups output theirsuggestions to the performance advice report. According tothe categories of optimizers, we classify estimators as codeoptimization estimators and parallel optimization estimators.

We first model theeffect of the stall elimination optimizers. Suppose the totalof number samples for a GPU kernel is 𝑇 , and the matchedsamples for an optimizer is 𝑀 . Stall elimination optimizers PA: A GPU Performance Advisor Based on Instruction Sampling Conference’17, July 2017, Washington, DC, USA assume we at best eliminate all the stalls by modifying thecode. We use Equation 2 to estimate the speedup of stallelimination optimizers S 𝑒 . S 𝑒 = 𝑇𝑇 − 𝑀 (2)Latency hiding optimizers suppose we can at best elimi-nate latency samples by modifying code. Therefore, we canuse Equation 3 to estimate the speedup of latency hidingoptimizers S ℎ , where 𝑀 𝐿 is the number of matched latencysamples. S ℎ = 𝑇𝑇 − 𝑀 𝐿 (3) Latency Hiding Example • Reorder instructions to hide latencies

LDG R0, [R2]STALLSTALLIADD R5, R0, R5IADD R6, R6, R6IADD R7, R7, R7

Figure 6.

The mental model of latency hiding optimizers.Green code represents active samples, and red code repre-sents latency samples. Latency hiding optimizers considerthe effect of moving the code enclosed in dashed lines to fillstall slots.Equation 3 models the execution at the kernel level. Inpractice, however, not all 𝑀 𝐿 can be eliminated by rearrang-ing code. Figure 6 explains the mental model of latency hid-ing optimization. We derive Equation 4 to refine the estimateof S ℎ , where 𝐴 denotes the total number of active samples. S ℎ = 𝑇𝑇 − 𝑀𝑖𝑛 ( 𝐴, 𝑀 𝐿 ) (4)We prove that the upper bound of S ℎ is two. We use 𝐿 todenote the total number of latency samples, and 𝑇 = 𝐴 + 𝐿 . Theorem 5.1.

The speedup upper bound of latency hidingoptimizations is 2 × . Proof. • If 𝑀𝑖𝑛 ( 𝐴, 𝑀 𝐿 ) = 𝐴 . 𝑇𝑇 − 𝐴 = 𝐿 + 𝐴 ( 𝐿 + 𝐴 )− 𝐴 = + 𝐴𝐿 .Because 𝐴 ≤ 𝑀 𝐿 ≤ 𝐿, 𝑇𝑇 − 𝑀𝑖𝑛 ( 𝐴,𝑀 𝐿 ) ≤ . • If 𝑀𝑖𝑛 ( 𝐴, 𝑀 𝐿 ) = 𝑀 𝐿 . 𝑇𝑇 − 𝑀 𝐿 = − 𝑀𝐿𝑇 = − 𝑀𝐿𝐴 + 𝐿 .Because 𝐿 ≥ 𝑀 𝐿 and 𝐴 ≥ 𝑀 𝐿 , 𝑀 𝐿 𝐴 + 𝐿 ≤ .Then 𝑇𝑇 − 𝑀𝑖𝑛 ( 𝐴,𝑀 𝐿 ) ≤ . □ Scope Analysis.

We observe that optimizations such asloop unrolling only arrange code for a specific scope so thatonly the active samples within the scope can be used toreduce latency samples. Based on this limitation, we proposeEquation 5 to analyze optimization scopes representing loops and functions. S ℎ𝑙 indicates the speedup for a specific scope 𝑙 , and 𝑀 𝐿𝑙 is the matched latency samples for a scope 𝑙 . S ℎ𝑙 = 𝑇𝑇 − 𝑀𝑖𝑛 ( (cid:205) 𝑙 ′ ∈ 𝑛𝑒𝑠𝑡𝑒𝑑 ( 𝑙 ) 𝐴 𝑙 ′ , 𝑀 𝐿𝑙 ) (5)Suppose we have a loop loop1 nested in another loop loop2 ,the speedup of of loop2 is bounded by the active samples of loop2 and loop1 according to Equation 5. Parallel optimiz-ers adjust the number of blocks and threads within eachblock to change the parallelism level. To estimate the effectof adjusting blocks and threads, we take into account eachwarp scheduler’s change of active warps– C 𝑊 (Equation 6)and change of issue rate— C I (Equation 7) .For instance, by increasing the number of blocks, we re-duce the active warps per scheduler and C 𝑊 is less than one.If the number of threads of each block is reduced, the ratethat a warp scheduler is issuing is reduced, and C I is lessthan one. C 𝑊 = 𝑊 𝑛𝑒𝑤 𝑊 (6) C I = I 𝑛𝑒𝑤 I (7)Assuming every warp scheduler’s issue rate is the sameacross different SMs, we derive Equation 8 and Equation 9to calculate I and I 𝑛𝑒𝑤 respectively, where 𝑅 𝐼 is the ratioof issued samples among all samples. A warp scheduler isissuing if at least one warp on the scheduler is ready to issuean instruction. I = − ( − 𝑅 𝐼 ) 𝑊 (8) I 𝑛𝑒𝑤 = − ( − 𝑅 𝐼 ) 𝑊 𝑛𝑒𝑤 (9) S 𝑝 = C 𝑊 × C I × 𝑓 (10)Based on C 𝑊 and C I , we estimate the speedup of paralleloptimizations ( S 𝑝 ) using Equation 10, where 𝑓 is a factorthat varies between optimizers. Some optimizers may assumethere is no pipeline, memory throttle, and no select stall ifwe reduce the number of active warps per block to a certainnumber (e.g., less than the number of schedulers per SM). We evaluated GPA on an x86_64 system with two Intel E5-2695 processors and a single NVIDIA Volta v100 GPU. Thefollowing system software are used: Linux 3.10.0, NVIDIACUDA Toolkit 11.0.194, NVIDIA Driver 450.51.06, and GCC7.3.0. We evaluated GPA on Rodinia benchmarks and appli-cations described below: onference’17, July 2017, Washington, DC, USA Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey

Table 3.

Achieved speedups averaged among ten runs. We improved each code according to the suggestion provided by GPA.Estimate error is computed by | 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 − 𝐴𝑐ℎ𝑖𝑒𝑣𝑒𝑑 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 | 𝐴𝑐ℎ𝑖𝑒𝑣𝑒𝑑 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 × Application Kernel Optimization Original Achieved Speedup Estimated Speedup Error rodinia/backprop bpnn_layerforward_CUDA Warp Balance 17.26 ± ± × × ± ± × × ± ± × × ± ± × × ± ± × × ± × × ± ± × × ± ± × × ± × × ± ± × × ± ± × × ± × × ± ± × × ± ± × × ± ± × × ± ± × × ± ± × × ± ± × × ± ± × × ± ± × × ± ± × × ± ± × × ± × × ± ± × × ± ± × × ± ± × × × × • Quicksilver [6] is a proxy application that solves a dy-namic Monte Carlo particle transport problem. Quick-silver has a single large kernel that invokes many de-vice functions consisting of thousands of lines of code.We studied Quicksilver with its default input. • ExaTENSOR [3] is a library for large-scale numericaltensor algebra. We studied its tensor transpose kernelusing a large six-dimensional tensor. • PeleC [5] is an application for reacting flows usingadaptive-mesh compressible hydrodynamics. We stud-ied PeleC using its default input. • Minimod [23] is a benchmark application for seismicmodeling. We analyzed its higher-order stencil codesusing grid sizes of .Each row in Table 3 quantifies the speedup we achievedby applying the corresponding optimization suggested byGPA. For each benchmark, we focused on the dominant GPUkernel and implemented one of the top five optimizationsuggestions, based on its estimated speedup and ease of im-plementation. On average, we achieved a geometric mean of1.22 × speedup with individual speedups ranging from 1.01 × to 3.53 × . GPA’s estimated speedup is close to the speedupwe achieved, with a geometric mean of the gap betweenthe speedup we achieved and the estimated speedup of 4.1%.In the rest of this section, we describe observations whileanalyzing and optimizing benchmarks using GPA, includ-ing the optimization workflow, false positivity, and singledependency coverage. Before using GPA, one can apply a source-to-source trans-formation to separate variables that appear on a single line.Then, one can start by interpreting the top optimizations inthe advice report by GPA. Not all optimizations are easy toimplement. For example, for a code reordering suggestion,if the distance between the def and use instructions is long,it is hard to improve it further. Based on our experienceof studying benchmarks, one can investigate the problem,modify the code, and achieve speedup within half an hour.Typically, only a few lines need to be changed to achievenon-trivial speedups.

GPA could overestimate optimization opportunities. FromTable 3, we observe that loop unrolling and code reorderingoptimizations have the highest estimate errors.The overestimation of the benefits of loop unrolling oc-curs because the loop unrolling optimizer lacks informationabout the number of iterations and compiler information.After closely investigating the bfs benchmark, we found thatthe workload is highly unbalanced such that most threadsonly execute less than four iterations of the loop. Thus, loopunrolling benefits only a small number of threads.The data dependency restriction causes the false positivityof code reordering optimization. GPA suggests reordering aglobal memory read in a loop of the pathfinder benchmark.

PA: A GPU Performance Advisor Based on Instruction Sampling Conference’17, July 2017, Washington, DC, USA

The estimated speedup is 26% higher than we achieved be-cause instructions after synchronizations depend on the re-sults before synchronizations. Therefore, the instructions wecan use to hide latency are limited in a fine-grained scope inwhich the distance between the dependent instruction pairsis short no matter how we arrange instructions.

In the instruction dependency graph, we say a node is a single dependency node if the node does not have any in-coming edge, or each incoming edge represents a differentdependency. We define single dependency coverage as the ra-tio of single dependency nodes to the total number of nodes.Figure 7 quantifies the single dependency coverage beforeand after pruning cold edges. After applying edge pruningheuristics, most benchmarks have single dependency cover-age greater than 0.8 so that we can attribute the stalls to oneedge without apportioning.Two exceptions are the bfs and the nw benchmarks. The bfs benchmark is memory-intensive. Most of the instructionsare global memory read/stores, which have a 64-bit memoryaddress stored in two 32-bit registers. The nw benchmarkhas many nodes with multiple incoming edges because ofits intricate control flow. The dominant loop in nw is fullyunrolled. Within the loop, there is a condition that decides ifvalues are calculated or not. If yes, it compares four candi-dates and chooses the maximum one. In this section, we study the optimizations for the four largerbenchmark codes in Table 3, including ExaTENSOR, Quick-silver, PeleC, and Minimod on the platform we mentionedin Section 6. The GPU code of the applications was com-piled with -O3 -lineinfo . With the following case studies,we show that one can achieve non-trivial speedup withoutin-depth knowledge of the assembly code and the GPU ar-chitecture.

We studied a tensor transpose kernel in ExaTENSOR. Weshow a part of GPA’s report in Figure 8. GPA ranks optimizersbased on their estimated speedups. Each optimizer suggests afew methods to modify the code and lists several hotspots tofocus on. Each hotspot consists of the def and use locationsand their distance. In Figure 8, GPA reports that we canfollow the suggestions of the strength reduction optimizer.Because the hotspot code performs an integer division, wecan replace it with a multiplication by its reciprocal. Thisoptimization renders a 1.11 × speedup.We analyzed the modified code again with GPA. This timeGPA suggests a memory transaction reduction optimiza-tion to mitigate memory throttling stalls. In particular, GPAsuggests that we replace global memory reads by constantmemory reads if elements are shared between threads and not changed during execution. According to the suggestion,we achieved a 1.03 × speedup. We used GPA to analyze Quicksilver on a single GPU. GPAreports the function inlining optimization may render thehighest speedup. Applying the always_inline keyword forthese functions fails to inline due to the size/register limita-tion forced by the compiler. Therefore, we manually inlinedtwo small functions by integrating the whole function bod-ies into their callers. By modifying the code in this way, weobtained a . × speedup.Next, GPA’s register reuse optimizer indicates local mem-ory stalls in a loop and points out the potential cause ofregister spilling. GPA suggests splitting the loop into two tosave registers. Without GPA, the raw PC sampling report byother tools only show global memory stalls without identi-fying register pressure. Applying the optimization yields a . × speedup. We studied the react_state kernel of PeleC. GPA estimatesthe code reordering optimization may result in the high-est speedup. However, because the top five hotspots onlyaccount for 4 % all of the matched stalls, there are manyhotspots distributed across lines so that it is difficult to ad-just the code manually. The second best optimizer suggestsincreasing the number of blocks. Since the kernel only occu-pies 16 blocks, GPA suggests reducing the number of threadsper block while increasing the number of blocks to improvethe parallelism. By increasing the number of blocks to 32,we achieved a 1.21 × speedup. We applied GPA to analyze the target_pml_3d kernel ofMinimod, which performs higher-order multi-statement sten-cil computations. GPA first suggests using the fast mathfunctions to replace high precision match functions. We ap-plied the –use_fast_math compiler flag to achieve a 1.03 × speedup.Next, GPA suggests the code reordering optimizationsfor the updated code. Adjusting the code to read subscriptedvalues from global memory well in advance of their use hidesmore of the memory latency and yields an additional 1.04 × speedup. GPU profilers are widely available in various GPU archi-tectures. NVIDIA provides several tools [1, 27, 28] to mea-sure GPU performance metrics. Intel develops VTune [31] tomonitor executions on both CPUs and GPUs. AMD providesROCProfiler [2] to read hardware counters and trace appli-cations. There are also tools [24, 33, 39, 40, 42] that focus onlarge HPC applications. Among the above tools, NVIDIA’s onference’17, July 2017, Washington, DC, USA Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey

Figure 7.

Single dependency coverage before and after pruning cold edges

Apply GPUStrengthReductionOptimizer optimization, ratio 5.805%, estimate speedup 1.062xLong latency non-memory instructions are used. Look for improvements that are mathematicallyequivalent, but the compiler is not intelligent to do so.1. Avoid integer division. Integer division requires using a special function unit to performfloating point transformations. One can use multiplication by a reciprocal instead.2. Avoid conversion. If the float constant is multiplied by a 32-bit float value, the compiler mighttransform the 32-bit value to a 64-bit value first.1. Hot BLAME GINS:LAT_IDEP_DEP code, ratio 0.444%, speedup 1.004x, distance 1From tensor_transpose at /home/kz21/Codes/GPA-Benchmark/ExaTENSOR/cuda2.cu:160x1620 at Line 34 in Loop at Line 30To tensor_transpose at /home/kz21/Codes/GPA-Benchmark/ExaTENSOR/cuda2.cu:160x1630 at Line 34 in Loop at Line 30 Optimization HintsHotspot def and use locations

Figure 8.

A performance report for ExaTENSORnsight-compute provides the most information at the GPUkernel level. It characterizes GPU kernels’ bottlenecks atthe high level but does not pinpoint bottlenecks and pro-vide suggestions for specific code regions. In contrast, GPAanalyzes control flow, program structure, architectural fea-tures, and interprets measurement data to provide thoroughsuggestions and estimate speedups.GPU vendors have also developed instrumentationtools [22, 26, 36, 37] for fine-grained performance measure-ment and analysis. These tools, however, introduce unavoid-able overhead for GPU kernels. GPA adopts PC sampling [15],which introduces considerably less cost for kernel execution.There have been efforts that use instrumentation methodsto diagnose specific types of inefficiencies. Yeh et al. [8] in-strument GPU code as it is generated by LLVM to identifyredundant instructions. CUDAAdvisor [32] also instrumentscode as it is generated by LLVM to monitor GPU memoryaccess and decide if bypassing could be used. GVProf [4] in-struments GPU binaries to detect both temporal and spatialredundant value patterns. These tools only identify a partic-ular type of inefficiencies and do not correlate the problemwith hotness. In comparison, GPA performs a comprehensiveanalysis of stall reasons for instruction samples and derivesvarious optimization suggestions for hot code regions.On the CPU side, there exist several tools that examinecode quality and provide optimization suggestions. PerfEx-pert [9] collects performance metrics using sampling, ana-lyzes measurement data and system parameters, and esti-mates performance upper-bounds. AutoScope [34] extends PerfExpert to suggest optimization strategies based on the de-tected bottlenecks. Unlike these two tools, CQA [11] buildsa static model by emulating processor pipelines to checksymptoms (e.g., vectorization) on the loop level. VTune [38]uses structured guidance to characterize the bottlenecks byinterpreting performance counters.Profile-guided optimization takes measurement data as in-put to guide compiler perform code transformation. PracticalPath Profiling (PPP) [7] collects edge profiles using instru-mentation to help compilers make decisions about functioninlining and loop unrolling. Instrumentation-based meth-ods require using representative inputs to dump meaningfulprofiles. To avoid the overhead of instrumentation-basedapproaches, AutoFDO [13] uses hardware counter basedsampling to collect profiles for production applications anduse the profiles to guide optimizations. While most profile-guided optimization tools attribute measurement data tosource lines to provide feedback for compilers, BOLT [30]is a post link optimizer that attributes samples on machineinstructions and uses this information to derive binary opti-mizations. Recently, there also has been research that incor-porates machine learning to guide optimizations. Cavazoset al. [10] use profile data as input features to a regressionmodel that predicts the best compiler flags. DeepFrame [18]incorporates deep learning methods to learn the most likelypaths during execution and offload the regions to FPGAs.Though profiler-guided optimizations can automatically ad-just code based on rules or models, they only cover a subsetof all the available optimizations. Many optimizations onGPUs need manual effort, such as warp balance, memorycoalescing, and adjustments to the thread counts. Unlikeother tools, GPA depends only on line-mapping informationand is not tied to any specific compiler.

Tuning GPU kernels is difficult due to the complexity ofGPU architectures and programming models. To free appli-cation developers from needing to interpret measurementsfrom multiple performance counters and analyze program

PA: A GPU Performance Advisor Based on Instruction Sampling Conference’17, July 2017, Washington, DC, USA inefficiencies, we introduce GPA. This performance advi-sor provides insightful optimization advice at the levels oflines, loops, and kernels and estimates each optimization’sspeedup. GPA is organized in a modular fashion. Users canadd custom optimizers to match other inefficiency patterns(e.g., texture fetch combination).GPA suffers from both hardware and software limita-tions. First, GPA apportions stalls to multiple dependencysources with an approximation method based on the instruc-tion counts in the paths. If the underlying hardware imple-ments “paired sampling” [16], we could collect precisely boththe stalled instruction and the instruction that causes thestall. Second, to obtain a more accurate speedup estimate,comprehensive compiler information such as loop unrollcount should be considered. Last, because PC Sampling withNVIDIA’s CUPTI serializes kernel executions, GPA’s pro-filer is unable to measure the effect of concurrent kernelexecution.In the future, we plan to ingest compiler information intoGPA to perform a more accurate estimate. In addition, wecan use the insights derived from GPA to guide compilers toapply code transformation for large-scale applications withhundreds of tiny hotspots.

References [1] 2019.

The user manual for NVIDIA profiling tools for optimizing perfor-mance of CUDA applications . https://docs.nvidia.com/cuda/profiler-users-guide [Accessed August 26, 2020].[2] 2020. AMD ROCm ROCProfiler . https://rocmdocs.amd.com/en/latest/ROCm_Tools/ROCm-Tools.html [Accessed August 26, 2020].[3] 2020. ExaTENSOR . https://iadac.github.io/projects/ [Accessed August27th, 2020].[4] 2020. GVPROF: A Value Profiler for GPU-based Clusters . https://github.com/Jokeren/GVProf [Accessed August 26, 2020].[5] 2020. PeleC . https://github.com/AMReX-Combustion/PeleC [AccessedAugust 27th, 2020].[6] 2020. Quicksilver . https://github.com/LLNL/Quicksilver [AccessedAugust 26, 2020].[7] Michael D Bond and Kathryn S McKinley. 2005. Practical path profilingfor dynamic optimizers. In International Symposium on Code Generationand Optimization . IEEE, 205–216.[8] Lorenz Braun and Holger Fröning. 2019. CUDA Flux: A LightweightInstruction Profiler for CUDA Applications. In

Performance Modeling,Benchmarking and Simulation of High Performance Computer Systems(PMBS) Workshop, collocated with International Conference for HighPerformance Computing, Networking, Storage and Analysis (SC2019) .[9] Martin Burtscher, Byoung-Do Kim, Jeff Diamond, John McCalpin,Lars Koesterke, and James Browne. 2010. Perfexpert: An easy-to-useperformance diagnosis tool for hpc applications. In

SC’10: Proceedingsof the 2010 ACM/IEEE International Conference for High PerformanceComputing, Networking, Storage and Analysis . IEEE, 1–11.[10] John Cavazos, Grigori Fursin, Felix Agakov, Edwin Bonilla, Michael FPO’Boyle, and Olivier Temam. 2007. Rapidly selecting good compileroptimizations using performance counters. In

International Symposiumon Code Generation and Optimization (CGO’07) . IEEE, 185–197.[11] Andres S Charif-Rubial, Emmanuel Oseret, José Noudohouenou,William Jalby, and Ghislain Lartigue. 2014. CQA: A code qualityanalyzer tool at binary level. In . IEEE, 1–10. [12] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy WSheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A bench-mark suite for heterogeneous computing. In . Ieee, 44–54.[13] Dehao Chen, Tipp Moseley, and David Xinliang Li. 2016. AutoFDO:Automatic feedback-directed optimization for warehouse-scale appli-cations. In . IEEE, 12–23.[14] Cristina Cifuentes and Antoine Fraboulet. 1997. Intraprocedural staticslicing of binary executables. In . IEEE, 188–195.[15] NVIDIA Corporation. 2019.

PC Sampling . https://docs.nvidia.com/cupti/Cupti/r_main.html [Accessed January 26, 2019].[16] Jeffrey Dean, James E Hicks, Carl A Waldspurger, William E Weihl, andGeorge Chrysos. 1997. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In Proceedings of 30th AnnualInternational Symposium on Microarchitecture . IEEE, 292–302.[17] Paul J Drongowski. 2007. Instruction-based sampling: A new perfor-mance analysis technique for AMD family 10h processors.

AdvancedMicro Devices (2007).[18] Apala Guha, Naveen Vedula, and Arrvindh Shriraman. 2019. Deep-frame: A Profile-Driven Compiler for Spatial Hardware Accelerators.In . IEEE, 68–81.[19] Part Guide. 2011. Intel® 64 and ia-32 architectures software developer’smanual.

Volume 3B: System programming Guide, Part

Tools for High Performance Computing 2014 .Springer, 25–35.[21] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza.2018. Dissecting the nvidia volta gpu architecture via microbench-marking. arXiv preprint arXiv:1804.06826 (2018).[22] Melanie Kambadur, Sunpyo Hong, Juan Cabral, Harish Patil, Chi-Keung Luk, Sohaib Sajid, and Martha A Kim. 2015. Fast computa-tional gpu design with gt-pin. In . IEEE, 76–86.[23] Jie Meng, Andreas Atle, Henri Calandra, and Mauricio Araya-Polo.2020. Minimod: A Finite Difference solver for Seismic Modeling. arXivpreprint arXiv:2007.06048v1 (2020).[24] Dieteran Mey, Scott Biersdorf, Christian Bischof, Kai Diethelm, Do-minic Eschweiler, Michael Gerndt, Andreas Knapfer, Daniel Lorenz,Allen Malony, WolfgangE. Nagel, Yury Oleynik, Christian Rassel, PavelSaviankou, Dirk Schmidl, Sameer Shende, Michael Wagner, Bert We-sarg, and Felix Wolf. 2012. Score-P: A Unified Performance Mea-surement System for Petascale Applications. In

Competence in HighPerformance Computing 2010 , Christian Bischof, Heinz-Gerd Hegering,Wolfgang E. Nagel, and Gabriel Wittum (Eds.). Springer Berlin Heidel-berg, 85–97.[25] NVIDIA Corporation. 2019.

CUPTI User’s Guide DA-05679-001_v10.1 . https://docs.nvidia.com/cuda/pdf/CUPTI_Library.pdf .[26] NVIDIA Corporation. 2020. NVIDIA Compute Sanitizer . https://docs.nvidia.com/cuda/compute-sanitizer/index.html [Accessed August 26,2020].[27] NVIDIA Corporation. 2020. NVIDIA Nsight Compute . https://developer.nvidia.com/nsight-compute [Accessed August 26, 2020].[28] NVIDIA Corporation. 2020. NVIDIA Nsight Systems . https://developer.nvidia.com/nsight-systems [Accessed August 26, 2020].[29] University of Wisconsin-Madison. [n.d.]. Dyninst . https://github.com/dyninst/dyninst [Accessed January 26, 2020].[30] Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni.2019. Bolt: a practical binary optimizer for data centers and beyond.In . IEEE, 2–14. onference’17, July 2017, Washington, DC, USA Keren Zhou, Xiaozhu Meng, Ryuichi Sai, and John Mellor-Crummey [31] James Reinders. 2005. VTune performance analyzer essentials. IntelPress (2005).[32] Du Shen, Shuaiwen Leon Song, Ang Li, and Xu Liu. 2018. Cudaadvisor:Llvm-based runtime profiling for modern gpus. In

Proceedings of the2018 International Symposium on Code Generation and Optimization .214–227.[33] Sameer S Shende and Allen D Malony. 2006. The TAU parallel perfor-mance system.

The International Journal of High Performance Comput-ing Applications

20, 2 (2006), 287–311.[34] Olalekan A Sopeju, Martin Burtscher, Ashay Rane, and James Browne.2011. Autoscope: Automatic suggestions for code optimizations usingperfexpert.

Evaluation (2011).[35] Venkatesh Srinivasan and Thomas Reps. 2016. An improved algorithmfor slicing machine code.

ACM SIGPLAN Notices

51, 10 (2016), 378–393.[36] Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, EimanEbrahimi, Daniel R Johnson, David Nellans, Mike O’Connor, andStephen W Keckler. 2015. Flexible software profiling of gpu architec-tures. In

ACM SIGARCH Computer Architecture News , Vol. 43. ACM,185–197.[37] Oreste Villa, Mark Stephenson, David Nellans, and Stephen W Keck-ler. 2019. NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs. In

Proceedings of the 52nd Annual IEEE/ACM Interna-tional Symposium on Microarchitecture . ACM, 372–383.[38] Ahmad Yasin. 2014. A top-down method for performance analysisand counters architecture. In . IEEE, 35–44.[39] Hui Zhang. 2018.

Data-centric performance measurement and mappingfor highly parallel programming models . Ph.D. Dissertation. Universityof Maryland—College Park.[40] H. Zhang and J. Hollingsworth. 2019. Understanding the Performanceof GPGPU Applications from a Data-Centric View. In . 1–8. https://doi.org/10.1109/ProTools49597.2019.00006 [41] Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou,and Mingyu Chen. 2017. Understanding the gpu microarchitecture toachieve bare-metal performance tuning. In

Proceedings of the 22nd ACMSIGPLAN Symposium on Principles and Practice of Parallel Programming .31–43.[42] Keren Zhou, Mark W. Krentel, and John Mellor-Crummey. 2020. Toolsfor Top-down Performance Analysis of GPU-Accelerated Applications.In

Proceedings of the 34th ACM International Conference on Supercom-puting (ICS ’20) . Association for Computing Machinery, New York, NY,USA, Article 26, 12 pages.. Association for Computing Machinery, New York, NY,USA, Article 26, 12 pages.