[PDF] AZP: Automatic Specialization for Zero Values in Gaming Applications

Abstract

Recent research has shown that dynamic zeros in shader programs of gaming applications can be effectively leveraged with a profile-guided, code-versioning transform. This transform duplicates code, specializes one path assuming certain key program operands, called versioning variables, are zero, and leaves the other path unspecialized. Dynamically, depending on the versioning variable's value, either the specialized fast path or the default slow path will execute. Prior work applied this transform manually and showed promising gains on gaming applications. In this paper, we present AZP, an automatic compiler approach to perform the above code-versioning transform. Our framework automatically determines which versioning variables or combinations of them are profitable, and determines the code region to duplicate and specialize (called the versioning scope). AZP takes operand zero value probabilities as input and it then uses classical techniques such as constant folding and dead-code elimination to determine the most profitable versioning variables and their versioning scopes. This information is then used to affect the final transform in a straightforward manner. We demonstrate that AZP is able to achieve an average speedup of 16.4% for targeted shader programs, amounting to an average frame-rate speedup of 3.5% across a collection of modern gaming applications on an NVIDIA GeForce RTX 2080 GPU GPU.

Full PDF

AAZP : Automatic Specialization for Zero Values inGaming Applications

Mark W. Stephenson [email protected]

NVIDIAAustin, TX, USA

Ram Rangan [email protected]

NVIDIABangalore, Karnataka, India

Abstract

Recent research has shown that dynamic zeros in shaderprograms of gaming applications can be effectively lever-aged with a profile-guided, code-versioning transform. Thistransform duplicates code, specializes one path assuming cer-tain key program operands, called versioning variables, arezero, and leaves the other path unspecialized. Dynamically,depending on the versioning variable’s value, either the spe-cialized fast path or the default slow path will execute. Priorwork applied this transform manually and showed promis-ing gains on gaming applications. In this paper, we present

AZP , an automatic compiler approach to perform the abovecode-versioning transform. Our framework automaticallydetermines which versioning variables or combinations ofthem are profitable, and determines the code region to dupli-cate and specialize (called the versioning scope).

AZP takesoperand zero value probabilities as input and it then usesclassical techniques such as constant folding and dead-codeelimination to determine the most profitable versioning vari-ables and their versioning scopes. This information is thenused to effect the final transform in a straightforward man-ner. We demonstrate that

AZP is able to achieve an averagespeedup of 16.4% for targeted shader programs, amountingto an average frame-rate speedup of 3.5% across a collec-tion of modern gaming applications on an NVIDIA GeForceRTX™ 2080 GPU.

Graphics processing units (GPUs) have enjoyed consistentgeneration-over-generation performance growth since theirinception, thanks in large part to 2D CMOS scaling. How-ever, as GPU performance growth fueled by Moore’s Lawcomes to an end, the gaming industry is in urgent need ofnovel architectural and software solutions to deliver similarperformance growth to sustain innovations in our quest forphoto-realistic real-time rendering.To that end, Rangan et al. recently introduced

Zeroploit , aprofile-guided optimization to exploit dynamically zero val-ued operands in shader programs of gaming applications [29].

Zeroploit uses the knowledge that one of the input operandsof a multiply (or a similar) operation is mostly zero dynami-cally to rearrange the code so as to avoid computing the other source operand whenever the former is zero, since anythingmultiplied by a zero results in a zero. Though this is not safe as per IEEE 754 rules [6] since the other source operandcould evaluate to a floating point

NaN , −∞ or +∞ , game de-velopers typically allow for such IEEE-unsafe floating pointoptimizations to opportunistically squeeze out additionalperformance as well as to ensure NaN values do not leak intorender targets. Zeroploit leverages this developer-grantedpermission to transform a code region into a zero-specializedfast path and an unspecialized slow path. In the aforemen-tioned work,

Zeroploit was applied manually to graphicsshader programs, both in terms of opportunity detection aswell as the code-versioning transform itself. In this paper,we strive for an automatic compiler solution for the

Zeroploit optimization for shader programs of gaming applications.Figure 1 illustrates

Zeroploit . It shows the two basic require-ments of the

Zeroploit transform, namely, the identificationof a versioning variable and the identification of a versioningscope. In its basic form, a versioning variable is an operandthat a value profiler identifies as having a high probabilityof being zero dynamically (e.g., the output of a saturatingarithmetic operation). The versioning scope is the region ofthe input code that gets duplicated to create the specializedfast path and unspecialized slow path code versions.An automatic compiler solution to

Zeroploit must solvethree key challenges. First, given that shader programs typi-cally have fat and complex data flow graphs (DFGs) and theprobability of their operands being dynamically zero is morethan 11% [29], automatic identification of versioning vari-ables and versioning scopes requires the ability to carefullyconsider, rank, and choose from among multiple potentialversioning variables, each with their own zero probabili-ties and versioning scopes that may overlap. Besides, largeversioning scopes can lead to instruction cache thrashingand the ranking algorithm must appropriately trade this offwith specialization benefits. Second, GPUs’ mixed scalar-vector instruction set architectures (ISAs) require analysisand transform intelligence to optimally specialize concurrentDFGs originating at vector instructions (e.g. texture load).Likewise, a versioning condition based on all components ofa vector load being zero will need to automate the creationof a composite or combined versioning variable prior to ef-fecting the transform. Finally, an automatic solution must In Microsoft’s high-level shading language (HLSL), this is achieved byexplicitly setting the refactoringAllowed [12] global flag and dropping the precise storage class specifier for variable declarations [13]. a r X i v : . [ c s . A R ] N ov ark W. Stephenson and Ram Rangan // r1 == 0, 50% of the time1. r1 = expressionAvblEarly()2. r0 = expensiveWork1()3. r2 = r0 x r14. r3 = r2 x expensiveWork2()5. r4 = r3 + r5 (a) Original.

1. r1 = expressionAvblEarly()2. if (r1 == 0) { // fast path3. // specialization hint4. r1 = 05. r0 = expensiveWork1()6. r2 = r0 x r17. r3 = r2 x expensiveWork2()8. } else { // slow path9. r0 = expensiveWork1()10. r2 = r0 x r111. r3 = r2 x expensiveWork2()12.}13.r4 = r3 + r5 (b)

After

Zeroploit .

1. r1 = expressionAvblEarly()2. if (r1 == 0) { // fast path3. r3 = 04. } else { // slow path5. r0 = expensiveWork1()6. r2 = r0 x r17. r3 = r2 x expensiveWork2()8. }9. r4 = r3 + r5 (c)

After

Zeroploit + specialization.

Figure 1.

Example illustrating

Zeroploit . Here, r1 is the ver-sioning variable. expensiveWork1() and expensiveWork2() getspecialized away due to Zeroploit ’s backward and forwardslice specialization respectively. Instructions on lines 2, 3,and 4 of (a) form the versioning scope and get duplicated.One of the versions gets specialized to form the fast path,while the other remains unspecialized and serves as the de-fault, slow path.ensure that control flow divergence due to code versioningdoes not adversely impact performance on single-instructionmultiple-threads (SIMT) GPUs.

Contributions:

In this paper, we present

AZP , an auto-matic technique for the

Zeroploit optimization that effectivelyaddresses the above challenges. Leveraging analysis tech-niques from classical optimizations such as constant foldingand dead-code elimination, we devise novel heuristics toestimate benefit, rank, and identify profitable versioningvariables and versioning scopes in practically linear time.We present techniques to effectively address challenges aris-ing from mixed vector-scalar ISAs.

AZP ’s transform relieson SIMT-wide vote [16] or equivalent operation to ensurethat versioning branches are dynamically convergent. Wedescribe the

AZP compiler pipeline in detail and show howthe above solutions fit together within this framework.We evaluate

AZP on a suite of modern, heavily-optimizedgaming applications and show that the automatic approachpresented in this paper can provide an average speedup of16.4% for targeted shaders on an NVIDIA GeForce RTX™2080 GPU, which translates to an average frame-rate speedup of 3.5%. We show that

AZP ’s compile-time overhead is suit-able for just-in-time environments such as GPU drivers gen-erating code for interactive gaming applications.

This section provides a high-level introduction to three top-ics: graphics programs, execution model in NVIDIA GPUs,and profile-guided optimizations.

Modern gaming applications are most popularly developedin Direct3D 11 [9], Direct3D 12 [10], OpenGL [19], andVulkan [20] APIs. Without exception, all graphics APIs usea two-level hierarchy to convey work from the CPU to theGPU: an API layer and a shader program layer. The API layeris used to setup GPU state (e.g., enable/disable depth testing,enable/disable blending, etc.); bind resources such as con-stant buffers, textures (read-only input data), render targets(write-only output data), or general multi-dimensional arrayscalled unordered access views (UAVs, which are read-write);and bind shader programs for use in GPU work calls (whichcould be graphics draw calls or compute dispatch calls). AGPU work call typically causes one or more threads of acorresponding bound shader program to be run. Shader pro-grams are typically written in high-level languages such asthe high-level shading language (HLSL) used with Direct3DAPIs [11], or the OpenGL Shading Language (GLSL) usedwith OpenGL or Vulkan APIs [23].A typical API sequence interleaves various state setupcommands with GPU work calls. Several such API com-mands can be in flight at any given point in the GPU. Statesetup commands take very little time and most of the GPUtime is spent in GPU work calls. A single frame of a mod-ern game typically requires hundreds or even thousandsof state setup and GPU work calls. The hundreds or thou-sands of API calls that make up a frame can exhibit vastlydifferent performance characteristics. An API call can be lim-ited by the performance of fixed function units, CPU-GPUdata transfer latencies, or the performance of programmableshaders, whose performance can in turn be limited by mem-ory bandwidth, instruction issue rate, or latency of memoryloads.

AZP is applicable only to the portion of a gaming ap-plication’s frame time that is dominated by programmableshaders, namely pixel and compute shaders, which are thetwo most expensive shader types.

Using NVIDIA’s terminology, shader programs execute onprogrammable processing cores called streaming multipro-cessors (SMs). Threads of shader programs are grouped intoconvenient physical entities called warps . A warp can span atmost

32 threads. Instruction fetch, decode, and scheduling inthe SM happens at a warp granularity, while operand fetch, ZP : Automatic Specialization for Zero Values in Gaming Applications execution in functional units, and register writeback happenat a per-thread granularity. SM execution is most efficientwhen handling full warps, since all its functional units arefully utilized. The SM implements a single-instruction, mul-tiple threads (SIMT) execution model [16] whereby, eachthread of a warp maintains its own program counter (PC). Ifthreads of a warp diverge, i.e. execute at different PCs, SMdatapath utilization suffers. To mitigate divergence the SMuses joint hardware-software mechanisms to make divergedthreads re-converge at well-defined “synchronization" pointsin the program. We will build on the above background whendiscussing AZP ’s heuristics later in the paper.

Profile-guided optimizations (PGOs) have historically fo-cused on collecting control flow profiles to determine theexecution weights of various blocks of code. Compiler passessuch as inlining, register allocation, predication, loop un-rolling, etc. then make use of accurate information aboutexecution weights to generate suitably optimized code. Forexample, a compiler can layout code based on profile feed-back to improve instruction cache performance by packingfrequently taken blocks closer to each other. Such execu-tion weight based PGOs have proven effective in a varietyof programs, from general purpose programs [5] to Webbrowsers [3]. Many modern commercial compilers supportPGOs (e.g. [8, 14]).PGOs typically involve compiling and executing code intwo different ways. First, the baseline code needs to be in-strumented for profile collection and profiles collected. Next,a second compilation may then use the collected profile in-formation to transform the baseline code suitably to effecta PGO. At a high level, the

AZP optimization presented inthis paper uses a similar dual compilation approach.

AZP differs from typical PGOs in its use of value profiles, insteadof control flow profiles, to optimize code. A more detailedcomparison with prior value specialization work is given inSection 5.

We now provide background information and terminology re-lated to

Zeroploit [29]. Rangan et al. define a zero-propagator as an instruction that produces zero as output due to one ormore of its source operands being zero (e.g., a multiply oper-ation) and a zero-originator as an instruction that produceszero as output from non-zero inputs, due to its semantics(e.g., a saturating arithmetic operation). They further definea useful zero-originator as a zero-originator whose output isconsumed by a zero-propagator.A typical

Zeroploit opportunity has three main ingredients,namely: 1. A program variable that is produced by a useful zero-originator and is consumed by a zero-propagator. This vari-able (or its equivalent) will become the versioning variable in the transformed code, based on whose dynamic value,execution will jump to either a fast path or a slow path.2. The other operand of the zero-propagator is expensive tocompute.3. The expression computing the versioning variable is avail-able early. At a minimum, this expression must be indepen-dent of the other operand. In general, if the backward slicesof the versioning variable and the other operand intersectfarther up a shader’s program dependency graph (or do notintersect at all), the better it is for versioning scope formationas it can allow for larger versioning scopes and thus morespecialization.We refer to the example in Figure 1 here, as we describethe steps of the

Zeroploit transform below. The code in Fig-ure 1a, which is assumed to satisfy all three prerequisites,can be

Zeroploit -transformed to the code in Figure 1b as fol-lows. First,

Zeroploit identifies a set of operations that willbe affected by either forward or backward slice (includingrecursive backward slice) specialization based on the ver-sioning variable being zero. The region of code from the firstaffected operation to the last forms the versioning scope.Next, it duplicates operations in the versioning scope. Oneof the copies or versions, which will become the specialized,fast path, is prefixed with an explicit assignment of zero tothe versioning variable to enable the compiler back-end toeasily notice the specialization opportunity in that scopeand apply classical optimizations such as constant folding,constant propagation, and dead code elimination to instruc-tions in that path. The other version will serve as the default,unspecialized, slow path. Finally, a conditional branch predi-cated on the versioning variable being equal to zero is addedto dynamically steer execution to the fast path or the slowpath. Figure 1c shows the final optimized code after the fastpath has been specialized by a compiler back-end. As identi-fied in Section 1, an automatic

Zeroploit solution for gamingapplications running on SIMT GPUs must solve three keychallenges. It must be able to evaluate, rank, and choose fromamong several potential versioning variables, deal with chal-lenges posed by mixed scalar-vector ISAs when determiningversioning variables and scopes, and ensure SIMT warp di-vergence penalties do not negate specialization benefits. Thenext section describes how we address the above challengeswith

AZP . AZP

We now describe

AZP , which automatically transforms shaderprograms to profitably exploit likely zero operands. Thissection discusses how our compiler identifies candidate ver-sioning variables, our zero-value profiler, and the compiler ark W. Stephenson and Ram Rangan analyses and transformations

AZP performs. In the remain-der of the paper when we generically say that the compilerspecializes on a versioning variable, we implicitly mean thatthe compiler specializes a region of code in which the vari-able is guaranteed to be zero , and it guards that region witha runtime check that invokes either the specialized region,or the default fallback region. Without loss of generality weassume the compiler’s intermediate representation (IR) is instatic single assignment (SSA) form.Figure 2(a) shows where

AZP runs with respect to othercompiler phases. Our profiling pass, as well as the

AZP analy-ses and transformations run early in our backend compiler’sflow, and as a result the IR that

AZP sees is far from optimal.However, by running

AZP after classical optimizations likedead code elimination and copy propagation, we ensure

AZP operates on code without any static redundancy, which re-duces the number of potential operands it needs to profileand evaluate. We choose to run

AZP earlier than advancedbackend optimizations because loop optimizations such asunrolling and software pipelining can increase the numberof potential versioning variables our optimization has to con-sider. As with most compiler optimizations we recognize that

AZP is sensitive to phase orders. Future work will consideralternate phase orders.

Regardless of

AZP ’s ordering with respect to other compilerpasses,

AZP ’s first task is to enumerate the set of possibleversioning variables in a shader, which we henceforth referto as candidates . We call them candidates because at thispoint in

AZP ’s flow the compiler does not know the potentialupside to specialization on any of the versioning variables.The candidate enumeration step of Figure 2(b) proceeds asfollows:

1. Find scalar candidates : AZP makes the distinction be-tween scalar candidates and vector candidates. Scalar candi-dates are trivial to enumerate since they are merely the vir-tual register writes in a program. To reduce the optimizationsearch space

AZP does not consider as candidates constantliteral moves because traditional compiler transformationssuch as constant propagation and folding will likely opti-mize these away; nor does

AZP consider register-to-registermoves as candidates because they simply rename the move’ssource operand and are therefore redundant. All other regis-ter writes are potential scalar candidates. In SSA form there isa one-to-one correspondence between candidates and virtualregisters.

2. Find vector candidates : In addition to scalar candidates,

AZP groups subsets of scalar candidates together to form vector candidates. In fact, many of

AZP ’s biggest gains comefrom jointly specializing multiple scalar candidates in theIR, which is also supported by prior observations [29]. Wehave seen cases where jointly specializing on the conjunc-tion of two or more scalar candidates exposes specialization

Backend compiler

Reconstruct HLSL vectorsEnumerate candidatesInstrument for proﬁlingFilter candidatesi = 0 ENDEstimate each candidate’s opportunitySelect best candidateTransform best candidatePerform const folding and DCEi = i + 1

Best candidate’s opportunity less than thresholdi < xform_limit i >= xform_limitPGO mode Proﬁling mode

LoweringBranch optimizationsDead code removalCopy propagation

PGZ

UnrollingRegister allocationPost scheduling ……… (a) (b)

Figure 2.

AZP flow.

AZP runs early in our prototype’s com-piler flow as (a) indicates, before many of our compiler’ssignificant passes run. As (b) shows, after enumerating thepotential candidates,

AZP runs in one of two modes: instru-mentation, or PGO. The instrumentation path (labelled pro-filing mode ) is for collecting the offline profiles that drivethe

AZP optimization (labelled

PGO mode ). The PGO pathiteratively evaluates and transforms candidates. In each it-eration of the evaluate-transform loop,

AZP estimates theeffectiveness of transforming each candidate, and greedilytransforms the candidate with the best estimated opportu-nity. The transformation may expose new opportunities onthe next iteration.opportunities, whereas specializing on any scalar candidatein isolation exposes none. Blindly considering arbitrary com-binations of candidates is combinatorial in the number ofversioning variables and is therefore not tractable. To gaininsights into what combinations

AZP should consider wemanually inspected dozens of shader programs and foundthat many specializable combinations of variables are explic-itly combined by the programmer into short vectors, typi-cally into RGBA or XYZW components. Manual inspectionof shader program code led to the insight that some spe-cialization opportunities require whole vectors to be zero.So while

AZP does not consider packing arbitrary combina-tions of candidates, we found that packing candidates thatare elements of a programmer-specified vector into a vector candidate is sufficient to expose additional specializationopportunities.A challenge we face when attempting to pack candidatesinto their programmer-specified vectors is that short vec-tor information does not propagate from the programminglanguage to the compiler’s backend IR. In fact, for DirectX12, the explicit vectors in the high level shader language(HLSL) are lost before backend compilation begins. To recon-struct programmer-specified vectors we leverage prior art on ZP : Automatic Specialization for Zero Values in Gaming Applications compiling for short, single instruction multiple data (SIMD)instruction sets. In particular, for this work we implementedportions of Larsen and Amarasinghe’s “SLP" algorithm forpacking independent and isomorphic instructions into SIMD superwords [25]. We note that for our purposes since we arenot actually packing candidates into wide registers nor sched-uling instructions to execute in SIMD fashion, a best-effortimplementation that forms natural groups of candidates suf-fices. AZP creates a new set of vector candidates from thesuperword-level vectors formed by applying SLP [25].

3. Enumerate candidates : AZP enumerates the union ofscalar candidates ( 𝑆 ) and vector candidates ( 𝑉 ). That is, AZP assigns each candidate with an identifier that it uses to asso-ciate candidates with runtime profiles.Formalizing our discussion of candidates, each scalar can-didate 𝑠 ∈ 𝑆 corresponds to exactly one SSA virtual registerdefinition 𝑣𝑟 𝑠 ; and a virtual register 𝑣𝑟 associates with at mostone scalar candidate, 𝑠 . Virtual registers can also correspondto a vector candidate 𝑣 ∈ 𝑉 and a scalar candidate 𝑠 ∈ 𝑆 .Each vector candidate represents a set of virtual registers,e.g., 𝑣𝑟 𝑣 = { 𝑣𝑟 𝑥 , 𝑣𝑟 𝑦 , 𝑣𝑟 𝑧 , 𝑣𝑟 𝑤 } where 𝑥, 𝑦, 𝑧, 𝑤 ∈ 𝑆 and aredistinct. We limit the number of virtual registers associatedwith a vector candidate to four.As Figure 2(b) shows, after candidate enumeration AZP can take one of two distinct paths. On one path

AZP in-struments the shader to collect runtime profiles, and on theother path

AZP ingests a runtime profile and attempts to spe-cialize the shader for likely zeros. We first discuss programinstrumentation for profiling.

For each candidate enumerated above, our profiler recordsthe likelihood that the candidate is zero at runtime. Ran-gan et al. present a general value profiling framework [29]that builds upon the ideas presented in [1]. However, theyalso note that they did not see significant opportunities toleverage values other than zero. Applying this insight, in-stead of implementing a general value profiler, we designeda simple profiler that discovers the likelihood that a candi-date is zero. Our profiler instruments each scalar candidate 𝑠 ∈ 𝑆 to record both the total number of writes, 𝑤𝑟𝑖𝑡𝑒𝑠 𝑠 , tothe candidate’s associated virtual register 𝑣𝑟 𝑠 in addition tothe number of those writes that were zero, 𝑧𝑒𝑟𝑜 𝑠 . Our zero-value profiler leverages NVIDIA’s global memory atomicincrement operations for updating these counters.For each scalar candidate 𝑠 , we can easily estimate thelikelihood, 𝑃 ( 𝑣𝑟 𝑠 = ) , that the associated versioning variableis zero according to 𝑃 ( 𝑣𝑟 𝑠 = ) ≈ 𝑧𝑒𝑟𝑜 𝑠 𝑤𝑟𝑖𝑡𝑒𝑠 𝑠 . Vector candidates,on the other hand require a different approach. Ideally wewould estimate the likelihood that all elements of the vectorare jointly zero, e.g., 𝑃 ( 𝑣𝑟 𝑣 = ) = 𝑃 ( 𝑣𝑟 𝑥 = 𝑣𝑟 𝑦 = 𝑣𝑟 𝑧 = 𝑣𝑟 𝑤 = ) . A profiler could determine the joint likelihood byonly incrementing 𝑤𝑟𝑖𝑡𝑒𝑠 𝑣 after all elements of the vector are written, and by only incrementing 𝑧𝑒𝑟𝑜 𝑣 if all elementsof the vector are zero. Alternatively, and the approach weuse in our prototype, is to estimate 𝑃 ( 𝑣𝑟 𝑣 = ) assumingindependence among the individual candidates, i.e., 𝑃 ( 𝑣𝑟 𝑥 = ) · 𝑃 ( 𝑣𝑟 𝑦 = ) · 𝑃 ( 𝑣𝑟 𝑧 = ) · 𝑃 ( 𝑣𝑟 𝑤 = ) The final profiling decision we discuss relates specificallyto SIMT execution. If the versioning variable is not uniformly zero across the threads of a warp, should we still considerspecializing code on that variable?

AZP only considers warp-uniform zero specialization primarily because it is the conser-vative choice. One of the primary metrics GPU programmersuse to gauge the performance of their code is

SIMT efficiency ,i.e., how many threads in a warp are simultaneously active onaverage. Were

AZP to introduce divergent branches, wherethe guarding runtime specialization check causes the warpto serially execute the specialized region and the fallbackregion), it would decrease SIMT efficiency and quite likelyreduce performance. For this reason,

AZP ’s profiler onlyincrements a candidate 𝑐 ’s zero counter 𝑧𝑒𝑟𝑜 𝑐 when 𝑣𝑟 𝑐 iszero for all threads in the warp, by using the vote.all fla-vor of NVIDIA’s vote instruction to check for warp-wideconvergence [16]. We now discuss the steps our prototype performs to spe-cialize a shader. As shown in Figure 2(b), when the driversupplies the backend compiler with a valid runtime profile,

AZP attempts to specialize a shader to exploit likely zeros.The basic flow of our profile-guided specialization is to it-eratively estimate each candidate’s performance potentialand then transform the best candidate if it is above a giventhreshold. We now describe each step of this flow.

As we soon discuss, each stepof the estimate-transform loop shown in Figure 2(b) is expen-sive. We can significantly improve the compile-time over-head by filtering out candidates where 𝑃 ( 𝑣𝑟 𝑐 = ) < 𝑃 𝑇ℎ𝑟𝑒𝑠ℎ .In Section 4 we show how 𝑃 𝑡ℎ𝑟𝑒𝑠ℎ affects compile time, butfor all other results we present, 𝑃 𝑡ℎ𝑟𝑒𝑠ℎ = .

32. For vector can-didates, we remove any element from the candidate if its zerolikelihood is less than 𝑃 𝑡ℎ𝑟𝑒𝑠ℎ . We can also skip the evaluationof any zero-propagator candidates 𝑧𝑝 with 𝑃 ( 𝑣𝑟 𝑧𝑝 = ) if oneof its zero-propagating inputs is generated by a candidate 𝑐 with 𝑃 ( 𝑣𝑟 𝑐 = ) > = 𝑃 ( 𝑣𝑟 𝑧𝑝 = ) . This follows because inthis scenario, specializing for candidate 𝑐 will also specializecandidate 𝑧𝑝 . Eliminating zero-propagator candidates in thisfashion through a transitivity analysis helps whittle downthe list of candidates further to just root zero-originators. The authors of

Zeroploit note that it is still beneficial to introduce divergentbranches when the shader is limited by texture throughput and specializa-tion removes texture operations [29]. Such scenarios are difficult for toolsto identify without hardware performance monitor feedback. The vote.all instruction returns

True if a Boolean predicate is

True acrossall active threads of a SIMT warp and

False otherwise. ark W. Stephenson and Ram Rangan r0 = TEXr4 = MUL r0, rBackward0r5 = MUL r4, rBackward1 pixout = MUL r5, rBackward2Backward Slice 0

Backward

Slice 1

Backward

Slice 2

Evaluate opportunity of specializing r0

Figure 3.

Evaluating a candidate. Constant propagation andfolding can convert some instructions to constant literalmoves. A subsequent dead code elimination pass removesbackward slices of computation that are no longer needed.

AZP introducescontrol flow, which not only bloats code, but also has thepotential to negatively impact scheduling by splitting inde-pendent long latency instructions across basic blocks andartificially serializing them. Therefore, we need to balancethe deleterious effects of

AZP ’s code transformation with itspotential upside. With a set of candidates to consider,

AZP estimates the benefit of each candidate. Our algorithm greed-ily selects the candidate with the best estimated opportunity,and transforms the shader to specialize for the candidate’s as-sociated variable(s). While the actual specialization transfor-mation is straightforward, estimating a candidate’s benefits quickly is more difficult. Note that shaders commonly havehundreds of candidates, so we must limit the compile-timecomplexity of opportunity estimation.Figure 3 illustrates

AZP ’s approach on a simple example.At a high-level

AZP estimates benefits by first finding theset of instructions that become specialized constants in theforward slice. In the figure, which shows the evaluation ofvariable r0 , constant propagation causes all subsequent in-structions to evaluate to zero. AZP then runs a dead codeelimination analysis to remove the backward slices of compu-tation that are no longer required (shown with dotted linesin the figure). While constant propagation and dead codeelimination form

AZP ’s foundation, candidate estimationperforms these supporting analyses:1.

Versioning check hoisting : This step aims to determinelegal, “earlier” locations in the program to which to hoist thecomputation needed to perform the candidate’s versioningcheck. Hoisting the versioning check can increase the scopeof the computation’s backward slices. While we could haveimplemented a sophisticated code motion approach ( e.g. ,loop invariant code motion, partial redundancy elimination,and unification), our prototype performs simple intra-blockhoisting of the candidate’s versioning check. This limita-tion trades off missing some specialization opportunities forimproved compile-time speed.2.

Region identification : This step identifies the set of ba-sic blocks, R , in which any specialization is legal for a given candidate, and it depends on the location of the version-ing check. The set is formally defined as { 𝑏 | 𝑏 ∈ 𝐺 and 𝑣 ∈ 𝐷𝑜𝑚 ( 𝑏 )} , where 𝐺 represents the set of basic blocks inthe control flow graph, 𝑣 is the block with the versioningcheck, and 𝐷𝑜𝑚 is the function that returns the dominators ofthe given block. We can copy the blocks to a new set

S ← R ,and insert an if-then-else hammock that branches either to R or S , depending on the outcome of the versioning check. We will specialize the instructions in S .3. Constant propagation : Within S we perform a standardforward constant propagation analysis. When propagatingconstants, if the definition of a source operand reaches froma block 𝑏 ∉ S we conservatively assume the operand isnot constant ( i.e. , it is ⊤ ). This analysis discovers the set ofinstructions for which all destination results will provablyreturn constants.4. Dead code elimination : Source registers are not requiredto produce a statically known constant, and therefore newlydiscovered constant expressions in S might eliminate ad-ditional expressions in the backward slice. The followingdataflow equations, which are slight modifications to livevariable analysis, show how the backward analysis proceedsat the instruction level. 𝐾𝑖𝑙𝑙 𝑖 = 𝐷𝑒 𝑓 𝑠 ( 𝑖 ) (1) 𝐺𝑒𝑛 𝑖 =  ∅ ∀ 𝑑 ∈ 𝐷𝑒𝑓 𝑠 ( 𝑖 ) 𝑑 ∉ 𝑂𝑢𝑡 𝑖 or 𝑑 is constant 𝑈 𝑠𝑒𝑠 ( 𝑖 ) otherwise (2)If all of an instruction’s destination registers ( i.e. , the LHSof the instruction) are marked as constants then Equation 2does not generate any upward exposed flows. Furthermore, ifan instruction’s definition is not live at the point of definition,then we can effectively remove the instruction and forgoits upward exposed flows. Of course, this only applies toinstructions that do not change global state such as branchesand stores. In addition, we logically treat fused-multiply-addinstructions as a multiply followed by an add, which allowsus to remove backward slices of the “other” multiplicand,when one of the multiplicands is zero.At the block level, Equation 3 is exactly the same as it isfor live variable analysis. However, we slightly modify thelive variable analysis flow in Equation 4 so that we need onlyconsider computing flows in the region S . This relies on apre-computed set of “live in” virtual registers for the originalshader, 𝐿𝐼𝑛 . 𝐼𝑛 𝑛 = ( 𝑂𝑢𝑡 𝑛 − 𝐾𝑖𝑙𝑙 𝑛 ) ∪ 𝐺𝑒𝑛 𝑛 (3) 𝑂𝑢𝑡 𝑛 = (cid:216) 𝑠 ∈ 𝑠𝑢𝑐𝑐 ( 𝑛 ) (cid:40) 𝐼𝑛 𝑠 𝑠 ∈ S 𝐿𝐼𝑛 𝑠 𝑠 ∉ S (4) We perform all of the analyses for opportunity estimation without actu-ally transforming the shader. This helps us improve the efficiency of ouroptimization. ZP : Automatic Specialization for Zero Values in Gaming Applications Use R2EXIT

Original CFG

R2 = 0

Out of scopeIn scope

TEX R8,…

TEX R0,…

MULF R2,R0,R8 (a)

Original CFG TEX R8 ,… VOTE.ALL cc, R8==0BR cc == true

TEX R0,…

MULF R2,R0,R8 R2 = 0Use R2EXITR2 = 0

Transformed CFG (b)

Transformed CFG

Figure 4.

The

AZP transformation. In this example,

AZP specializes the CFG in (a) for cases when R8 is zero (the result ofthe highlighted texture operation). The specialization, shown in (b) splits the basic block in which R8 is defined to insert aversioning check. The vote.all instruction ensures the versioning branch is SIMT warp-convergent.5. Region refinement:

Remember that the versioning vari-able’s basic block dominates S . However, constant propa-gation and dead code elimination may have found special-ization opportunities in only a subset of S ’s basic blocks.This optional step attempts to limit the scope of the ver-sioning variable to reduce the number of new basic blocksspecialization introduces. Heuristic:

After we have run the analyses mentioned abovewe estimate the candidate’s opportunity. The heuristic weuse to determine a candidate’s effectiveness follows: 𝑇 AZP = 𝑃 ( 𝑣𝑟 𝑐 = ) · 𝑇 S + ( − 𝑃 ( 𝑣𝑟 𝑐 = )) · 𝑇 R + 𝑇 𝑐ℎ𝑒𝑐𝑘 𝑇 𝛿 = 𝑇 AZP − 𝑇 R (5) 𝑇 𝑐ℎ𝑒𝑐𝑘 is an estimate for the number of cycles spent per-forming the versioning check, 𝑇 R estimates how many dy-namic cycles are spent in R , and is based on the basic blockfrequencies (from the profile) as well as the mixture of in-structions in R . Likewise, 𝑇 S is a cycle estimate using basicblock frequency information as well as the set of instruc-tions remaining in S . Intuitively, 𝑇 𝛿 is the number of cycles saved via specialization. Initially we transformed any can-didate if 𝑇 𝛿 exceeded 25 cycles, but that approach failed fora couple of reasons. First, it did not take into account thespeedup of specializing the candidate. For example, if R islarge and expensive to execute, then saving 25 cycles is prob-ably not worth the negative effects of AZP ’s transformation.In addition, we want to forgo transforming candidates thatintroduce too many additional basic blocks. Finally, we givean extra boost to candidates where specialization removes memory operations: 𝑠𝑢 = 𝑇 R − ( 𝑇 S + 𝑇 𝑐ℎ𝑒𝑐𝑘 ) 𝑇 R 𝑠𝑢 𝑡ℎ𝑟𝑒𝑠ℎ = 𝑒 −|S| +( · 𝑀𝑒𝑚𝐾𝑖𝑙𝑙𝑒𝑑 , (6)where |S| is the number of blocks in S , and 𝑀𝑒𝑚𝐾𝑖𝑙𝑙𝑒𝑑 is the number of memory operations in R that are not in S . We will not transform any candidate unless 𝑇 𝛿 > and 𝑠𝑢 < 𝑠𝑢 𝑡ℎ𝑟𝑒𝑠ℎ . These refinements to our heuristic avoidtransforming some empirically negative candidates. Whendiscussing results in Section 4, we show that this heuristiceffectively identifies several outstanding opportunities, butwe also show that there are still cases where our heuristicfails. Future work will consider alternative heuristics. Figure 4 providesa simple example of the

AZP transformation. In the example,we specialized the CFG in Figure 4a for executions where R8 is zero. While the transformation is conceptually simple,this example lets us recap some interesting tradeoffs. Asduring the evaluation phase, we determine the scope of thecandidate’s versioning variable. In our example, the grayedbasic block is the candidate’s scope, since this block is theonly block in the CFG that the candidate dominates. Ourtransformation creates the region S (the grayed block inFigure 4b), and sets all occurrences of R8 in S to zero. Itinserts a versioning check that either branches to S or theoriginal region R . AZP passes the versioning check pred-icate through the vote.all operation [16], thus making the ark W. Stephenson and Ram Rangan dependent branch warp-convergent. This avoids SIMT warp-divergence penalties in the transformed code. At this point,

AZP invokes constant folding and dead code elimination onthe shader.This example makes clear that the

AZP transformationincreases the size and complexity of a shader’s CFG. In ad-dition, the runtime versioning check introduces overheads.In this example, depending on 𝑃 ( 𝑅 = ) , the added com-plexity may be worthwhile because specialization removedan expensive texture lookup ( TEX ). Note however, that spe-cialization also delays the execution of the “

TEX R0 ... "instruction. In the original CFG in (a), the two texture op-erations could potentially execute concurrently, whereasspecialization defers the execution of a texture operationwhen 𝑅 ≠

0. These tradeoffs are adequately captured byvarious parameters used in our heuristic.Finally, Figure 4a demonstrates different scoping alterna-tives. In

Zeroploit the authors discuss hoisting the versioningvariable as “early" as possible, which might expose additionalopportunities. Thus the

Zeroploit approach, which we remindthe reader was not automated, would presumably hoist theversioning variable to the grayed block’s dominator, wherethe subsequent analysis would determine that R2 is zeroeverywhere when R8 is zero. We have previouslymentioned that our prototype does not consider inter-blockcode motion to increase a candidate’s scope, and we madethis decision primarily in the name of compile-time complex-ity. Partial redundancy elimination and related techniquesare expensive enough that some compiler engineers are reluc-tant to apply them once [24], let alone potentially hundreds of times, once per candidate.Later, in Section 4, we show the correlation between com-pile time and the number of candidates

AZP considers andthe number of transforms it performs (up to xform_limit in Figure 2). We can reduce compilation time by better candi-date filtering, including increasing the “probability of zero"threshold for a candidate. Though we have not done so, animplementation could ignore testing scalar candidates inwhich the candidate’s versioning variable is also part of avector candidate.

This section describes our experimental methodology, pre-sents runtime speedps, compile-time overheads, and con-cludes with a discussion of some interesting insights regard-ing the results.

Table 1 lists the gaming applications evaluated in this paper.Using an internal frame-capture tool, similar to publiclyavailable tools like Renderdoc [22], Nsight [15], etc., oneor more random single-frames are captured from a built-in benchmark run or from actual gameplay of each gamingapplication, depending on whether the game has a built-in benchmark. These single-frame captures, called APICs,contain all the information needed to replay a game frame i.e.both the API sequence (including all relevant state) as wellas shader programs used in individual calls. For applicationsfor which we captured more than one APIC, we made sureto capture them from visually different scenes. We evaluate

AZP on a total of 14 APICs spanning 8 applications as shownin Table 1. We hyphenate an application’s short name withthe APIC number to uniquely refer to an APIC in this section(for e.g.,

SS-3 to refer to the third APIC of Serious Sam 4).

Table 1.

Gaming applications evaluated in this paper.

Application Short APICs dx11 dx12

Ashes of the Singularity - Escalation Ashes 3Deus Ex Mankind Divided DXMD 1Final Fantasy XV FFXV 1Metro Exodus Metro 1PlayerUnknown’s Battlegrounds PUBG 3Horizon Zero Dawn HZD 1WatchDogs Legion Watch 1Serious Sam 4 SS 3

We implemented

AZP as a research prototype on top ofa recent branch of NVIDIA’s GeForce Game Ready driver.Value profiling is performed in dedicated offline runs, whilejust-in-time compilation based on these value profiles hap-pens online.We perform our experiments on an NVIDIA GeForceRTX™ 2080 GPU, locked to base clock settings of 1515 MHzfor the GPU core and production DRAM frequency settings.Full specification of this GPU can be found here [28]. Wecompare the final rendered image against a golden referenceimage to ensure that our transforms are functionally cor-rect. We measure GPU-only time using accurate in-houseprofiling tools, after locking GPU clocks and power state toproduction active game-play settings. These runs use produc-tion settings for driver and compiler optimizations, whichinclude classical optimizations like constant folding, deadcode elimination, loop unrolling, instruction scheduling, etc.,in addition to various machine-specific optimizations.

AZP ’s speedups over baseline execution at the targeted re-gion and full-frame levels are shown in Figure 5a and Fig-ure 5b, respectively. We define a “targeted region” of anapplication as the set of shader programs that

AZP ’s heuris-tic chooses for transformation. We calculate targeted regionspeedup as the sum of the baseline time for all chosen shadersdivided by the sum of the execution times for the sameshaders in an

AZP -transformed run. Both graphs show per-formance speedup with default

AZP heuristics as well as an ZP : Automatic Specialization for Zero Values in Gaming Applications

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% P U B G − P U B G − P U B G − DX M D − FF XV − H Z D − M e t r o − W a t c h − A s h e s − A s h e s − A s h e s − SS − SS − SS − m ea n a pp M ea n T a r g e t e d r e g i on s p ee dup s Applications default heuristics oracle heuristics (a)

Speedup for targeted regions with default and oracle heuristics.

0% 2% 4% 6% 8% 10% 12% 14% P U B G − P U B G − P U B G − DX M D − FF XV − H Z D − M e t r o − W a t c h − A s h e s − A s h e s − A s h e s − SS − SS − SS − m ea n a pp M ea n F u ll − fr a m e s p ee dup s Applications default heuristics oracle heuristics (b)

Full-frame speedups with default and oracle heuristics.

Figure 5.

Performance of

AZP on modern gaming applica-tions.“oracle” heuristic. The latter is calculated by simply ignoringall heuristic-induced slowdowns.Since our goal in this paper is to demonstrate an automaticcompiler technique for

Zeroploit and not to explore its effec-tiveness across scenes of gaming applications, we use thesame APICs for profiling and testing. However, in the fewapplications for which we were able to obtain multiple APICscorresponding to different scenes, we can see that

AZP isable to extract

Zeroploit benefit effectively across scenes. Forexample, in Figure 5a, notice similar upsides among

PUBG

APICs, among

Ashes

APICs, and in APICs 2 and 3 of SS .This augurs well for a dynamic adaptive compilation systembased on AZP , which we plan to explore in the future.At the full-frame level, we see average speedups of 3.5%and 3.9% with default and oracle heuristics, while for tar-geted regions

AZP achieves average speedups of 16.4% and18%, with default and oracle heuristics, respectively. Fromthese graphs, we can see that the default

AZP heuristicsare able to effectively target

Zeroploit opportunities in mostapplications. D y n a m i c s p ee d u p ( x ) weighted abscissaunweighted abscissa Figure 6.

Performance of 404 shaders across our test suitethat

AZP ’s heuristic deemed worthy of transforming.

Although

AZP is reserved for compiling hot shaders, it doesrun in the context of a just-in-time compiler where compi-lation time is a consideration. To determine the overheadof our compiler analyses and transformations we collectedthe compile-time slowdown over the driver’s baseline com-pilation for each shader that

AZP managed to transform.We focus on the transformed shaders to guarantee that theevaluate-transform loop executes at least once. We use aproprietary tool to measure and record the Direct3D driver’sper-shader compile times.On a 3.4Ghz Intel Core i7-6800K CPU with 32 GB of RAM,the geometric mean compile-time slowdown was 1 . × over138 varied shaders across all of our applications. If we in-crease the “probability of zero" threshold ( 𝑃 𝑡ℎ𝑟𝑒𝑠ℎ ) for can-didates that AZP considers from 0 .

32 to 0 .

9, the mean slow-down decreases slightly to 1 . × . If, on the other hand, welimit the number of iterations of the evaluate-transform loopto one, we see the mean slowdown decrease to 1 . × . Notsurprisingly, these results show a clear correlation betweenthe number of candidates evaluated and the resultant over-head. Of note, modern drivers carefully chose when to asyn-chronously compile optimized shaders and will not wasteCPU cycles compiling additional shaders unless performanceis limited by the GPU. In such cases, where the CPU is notlimiting performance, spending additional time optimizingshaders can be prudent. Figure 6 shows how

AZP ’sheuristics performed at the individual shader level. Here, weplot individual shader speedups on the y-axis against bothan unweighted abscissa (blue line) and a weighted abscissa(magenta points). The unweighted abscissa is simply a mono-tonically increasing shader identifier, whereas the weightedabscissa is calculated based on a given shader’s normalizedframe-time contribution in its baseline execution. We can seethat

AZP is positive or neutral in most shaders. On about 25% ark W. Stephenson and Ram Rangan of the shaders, it produces slowdowns as can be seen on theright end of blue line. However, as the steep fall of the clusterof magenta points at the right end shows, these slowdownspredominantly occur in shaders with negligible frame-timecontributions. This implies

AZP ’s heuristics work well inpractice on shader programs that matter.Through manual inspection of a few of the negative ex-amples in Figure 6 we found that

AZP ’s code-versioningtransform at times interacts pathologically with downstreampasses.

AZP currently runs early in the compiler flow, beforeloop unrolling, scheduling, and register allocation. We haveseen examples where

AZP inhibits unrolling an importantloop, and we have seen it negatively affect texture schedulingand register usage. As future work, we plan to investigateways to avoid such cases through heuristic enhancements.

AZP in PUBG . PUBG contains aninteresting, positive example.

AZP progressively optimizedan important shader over three iterations of the evaluate-transform loop. In the first iteration, the best candidate re-moved 30 multiplies and several multiply accumulate andadd instructions. The candidates that

AZP transformed onthe second and third iteration scored poorly on the first it-eration; the first transformation exposed opportunities thatdid not exist. On the second iteration a vector candidateremoved dozens of additional math operations. Again, thefinal and most influential candidate that was transformedscored poorly on the second iteration. On the final iterationof the evaluate-transform loop

AZP removed a large swathof instructions, including dozens of memory operations.

Zeroploit , a profile-guided code transform for forward andbackward slice specialization based on zero values, wasshown to be effective in gaming applications with manu-ally optimized codes [29]. In this paper, we highlight thevarious challenges that need to be solved in order to au-tomate

Zeroploit , and present and evaluate a full compilersolution for it.Value-dependent code specialization has been previouslyautomated in partial evaluation systems for functional andimperative language programs [7, 21], general purpose pro-grams [2, 18, 27, 30], embedded software [4], Java just-in-time compilers [17, 31], etc. Unlike aforementioned value-specialization efforts, which specialize for generic values andthus perform only forward-slice specialization,

AZP achievesboth forward and backward-slice specialization by specializ-ing for just zeros. For example, Muth et al.’s code-versioningapproach for generic value specialization used heuristics thatestimate savings from specializing forward slices [27]. Grantet al. used an annotations-based system without heuristicsto target dynamic zero candidates at function scopes in theDyC dynamic compilation system [18]. In contrast,

AZP es-timates forward and backward slice specialization benefits by using expected savings from constant propagation andfolding as well as dead-code elimination.Recently, Leobas and Pereira identified the type of op-portunity targeted by

Zeroploit as the mathematical notionof semirings . As a proof-of-concept semiring optimization,they automated a profile-driven transform to target silentstores and evaluated it on CPU programs in the LLVM testsuite [26]. They also presented a novel online profiler thatworks well for detecting silent stores in hot loops.

AZP differsfrom semiring optimization in a few key points. First,

AZP ’sfocus on gaming applications means it can take advantage ofdeveloper-granted permission to perform IEEE-754-unsafeoperations to optimize floating point code. Second,

AZP usesa cost-benefit analysis based on constant folding and deadcode elimination to evaluate, rank, and select from amongseveral versioning variables per shader program, whereasthe semiring silent store optimization did not have to useany heuristics and was applied to all stores detected by theirprofiler as being likely silent. Third, their loop-iteration sam-pling based profiler cannot be directly applied to graphicsprograms, since most shader programs do not have loops orones that iterate more than a few times. Therefore,

AZP usesa profiling strategy more suited to gaming applications - in-struction granular zero-value profiling, sampled over severalframes. And fourth, we describe the

AZP compiler pipeline infull detail, complete with dataflow equations, which comple-ments Leobas and Pereira’s theoretical proofs for semiringoptimizations, thus enabling provably correct practical ap-plication of

Zeroploit and similar semiring optimizations.Finally, unlike prior work,

AZP , when applied to gamingapplications, has to contend with the SIMT execution modelof NVIDIA and similar GPUs.

AZP accomplishes this byusing NVIDIA’s warp-wide vote instruction [16] in both itszero-value-profiler as well as its code transformer.

Recent work called

Zeroploit identified a profile-guided op-timization that exploits certain dynamically zero valuedoperands in graphics shader programs to execute highlyspecialized code. However, this prior work relied on manualeffort to identify opportunities and transform them. In thispaper, we advanced the above line of work with

AZP , whichuses novel compile-time heuristics to automatically identifyopportunities and transform code. With its default heuris-tics,

AZP achieves an average frame-rate improvement of3.5% and an average targeted region speedup of 16.4% over aheavily optimized production driver, across a suite of moderngaming applications.Our current results based on single-frame testing are en-couraging. Future work will study

AZP ’s effectiveness bothas a PGO based on offline profiling as well as an adaptivecompilation strategy based on continuous online profiling,in multi-frame scenarios such as demos and game-play. ZP : Automatic Specialization for Zero Values in Gaming Applications References [1] Brad Calder, Peter Feller, and Alan Eustace. 1997. Value Profiling. In

Proceedings of the 30th Annual ACM/IEEE International Symposium onMicroarchitecture (MICRO 30) . 259–269.[2] Brad Calder, Peter Feller, and Alan Eustace. 1999. Value Profiling andOptimization.

Journal of Instruction Level Parallelism (JILP) [3] Max Christoff. 2020.

Chrome just got faster with Profile Guided Opti-mization . https://blog.chromium.org/2020/08/chrome-just-got-faster-with-profile.html [4] Eui-Young Chung, B. Luca, G. DeMicheli, G. Luculli, and M. Carilli.2002. Value-Sensitive Automatic Code Specialization for EmbeddedSoftware. IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems

21 (September 2002). Issue 9.[5] Robert Cohn and P. Geoffrey Lowney. 1999. Feedback Directed Opti-mization in Compaq’s Compilation Tools for Alpha. In

Proceedings ofthe 2nd ACM Workshop on Feedback-Directed Optimization .[6] Microprocessor Standards Committee. 2019. . https://ieeexplore.ieee.org/servlet/opac?punumber=8766227 [7] Charles Consel, Luke Hornof, François Noël, Jacques Noyé, and NicolaeVolansche. 1996. A Uniform Approach for Compile-Time and Run-Time Specialization. In Selected Papers from the International Seminaron Partial Evaluation .[8] Intel Corporation. 2020.

Profile-Guided Optimization . [9] Microsoft Corporation. 2018. Direct3D 11 graphics . https://docs.microsoft.com/en-us/windows/win32/direct3d11/atoc-dx-graphics-direct3d-11 [10] Microsoft Corporation. 2018. Direct3D 12 graphics . https://docs.microsoft.com/en-us/windows/win32/direct3d12/direct3d-12-graphics [11] Microsoft Corporation. 2018. High Level Shading Language . https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl [12] Microsoft Corporation. 2018. Shader Model 4 Assembly (DirectX HLSL)- dcl_globalFlags . https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dcl-globalflags [13] Microsoft Corporation. 2018. Variable Syntax . https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-variable-syntax [14] Microsoft Corporation. 2019. Profile-Guided Optimizations . https://github.com/MicrosoftDocs/cpp-docs/blob/master/docs/build/profile-guided-optimizations.md [15] NVIDIA Corporation. 2019. Nsight 2019.6 . https://developer.nvidia.com/nsight-graphics [16] NVIDIA Corporation. 2019. Parallel Thread Execution ISA: ApplicationGuide . https://docs.nvidia.com/pdf/ptx_isa_6.5.pdf [17] Igor Costa, Pericles Alves, Henrique Nazare Santos, and Fer-nando Magno Quintao Pereira. 2013. Just-in-time Value Specialization.In Proceedings of the 2013 IEEE/ACM International Symposium on CodeGeneration and Optimization (CGO) (CGO ’13) . 1–11.[18] Brian Grant, Matthai Philipose, Markus Mock, Craig Chambers, andSusan J. Eggers. 1999. An Evaluation of Staged Run-time Optimiza-tions in DyC. In

Proceedings of the ACM SIGPLAN 1999 Conference onProgramming Language Design and Implementation (PLDI ’99) . 293–304.[19] The Khronos Group Inc. [n.d.].

OpenGL Overview . [20] The Khronos Group Inc. 2018. Vulkan Overview . [21] Neil D. Jones, Carsten K. Gomard, and Peter Sestoft. 1993. PartialEvaluation and Automatic Program Generation . Prentice-Hall, Inc.,Upper Saddle River, NJ, USA.[22] Baldur Karlsson. 2019.

Renderdoc v1.5 . https://renderdoc.org/docs/index.html [23] John Kessenich, Dave Baldwin, and Randi Rost. 2017. The OpenGL®Shading Language . [24] U. Khedker, A. Sanyal, and B. Sathe. 2017. Data Flow Analysis: Theoryand Practice . CRC Press.[25] Samuel Larsen and Saman Amarasinghe. 2000. Exploiting SuperwordLevel Parallelism with Multimedia Instruction Sets. In

Proceedings ofthe ACM SIGPLAN 2000 Conference on Programming Language Designand Implementation (PLDI ’00) . 145–156.[26] Guilherme Vieira Leobas and Fernando Magno Quintão Pereira. 2020.Semiring Optimizations: Dynamic Elision of Expressions with Iden-tity and Absorbing Elements. In

OOPSLA 2020: Conference on Object-Oriented Programming Systems, Languages, and Applications .[27] Robert Muth, Scott A. Watterson, and Saumya K. Debray. 2000. CodeSpecialization Based on Value Profiles. In

Proceedings of the 7th In-ternational Symposium on Static Analysis (SAS ’00) . Springer-Verlag,London, UK, UK, 340–359.[28] Tech Powerup. 2018.

NVIDIA Geforce RTX 2080 . [29] Ram Rangan, Mark W. Stephenson, Aditya Ukarande, Shyam Murthy,Virat Agarwal, and Marc Blackstein. 2020. Zeroploit : Exploiting ZeroValued Operands in Interactive Gaming Applications.

ACM Trans-actions on Architecture and Code Optimization

17, 3, Article 17 (Aug.2020), 26 pages. https://doi.org/10.1145/3394284 [30] S. Subramanya Sastry, Rastilav Bodik, and James E. Smith. 2000. Char-acterizing Coarse-Grained Reuse of Computation. In

Proceedings ofthe ACM Workshop on Feedback Directed and Dynamic Optimization .[31] Ajeet Shankar, S. Subramanya Sastry, Rastislav Bodík, and James E.Smith. 2005. Runtime Specialization with Optimistic Heap Analysis.In