[PDF] Learning from Shader Program Traces

Abstract

Deep networks for image processing typically learn from RGB pixels. This paper proposes instead to learn from program traces, the intermediate values computed during program execution. We study this idea in the context of pixel~shaders -- programs that generate images, typically running in parallel (for each pixel) on GPU hardware. The intermediate values computed at each pixel during program execution form the input to the learned model. In a variety of applications, models learned from program traces outperform baseline models learned from RGB, even when augmented with hand-picked shader-specific features. We also investigate strategies for selecting a subset of trace features for learning; using just a small subset of the trace still outperforms the baselines.

Full PDF

LLearning from Shader Program Traces

Yuting Yang Connelly Barnes Adam Finkelstein Abstract

Deep networks for image processing typicallylearn from RGB pixels. This paper proposes in-stead to learn from program traces, the interme-diate values computed during program execution.We study this idea in the context of pixel shaders –programs that generate images, typically runningin parallel (for each pixel) on GPU hardware.The intermediate values computed at each pixelduring program execution form the input to thelearned model. In a variety of applications, mod-els learned from program traces outperform base-line models learned from RGB, even when aug-mented with hand-picked shader-speciﬁc features.We also investigate strategies for selecting a sub-set of trace features for learning; using just a smallsubset of the trace still outperforms the baselines.

1. Introduction

Deep learning applications in graphics and vision typicallywork on images encoded as pixels in RGB color space.For images representing 3D scenes, researchers have alsoexplored augmenting the RGB data with other hand-pickedfeatures like depth or surface normals (Chaitanya et al.,2017; Vogels et al., 2018). These auxiliary features arepicked based on domain expertise, and vary for differentapplications or programs.This paper proposes augmenting the data from which aneural network learns with the program trace . In softwareengineering, a trace generally refers to the record of allstates that a program visits during its execution (Feiler &Humphrey, 1993; Larus, 1993), including all instructionsand data. We explore this idea in the context of proceduralshader programs, like the ones shown in Figure 1. Thesequence of instructions tend to be similar or identical frompixel to pixel, so we rely on just the intermediate values forlearning, referring to these as the “program trace.” Princeton University, USA Adobe Research, USA. Cor-respondence to: Yuting Yang < [email protected] > ,Connelly Barnes < [email protected] > , Adam Finkelstein < [email protected] > . (a) Gear (b)

Oceanic (c)

Bricks (d)

Trippy Heart

Figure 1.

Learned outputs for four example shader programs. Eachis split: the upper left shows the reference for the learning task, andthe lower right shows learned outputs. (a) and (b) show learning toremove sampling noise (Section 4.1) while (c) and (d) show learn-ing to reconstruct from a simpliﬁed shader program (Section 4.2).

Shader programs can be used to ﬂexibly construct complexand even fantastical appearances by combining sequencesof mathematical operations to create texture patterns, pro-duce lighting, perturb surface normals to produce effectssuch as bump mapping, apply noise functions, or deter-mine ray intersections with procedurally generated geom-etry (Akenine-M¨oller et al., 2008). Typical pixel shadersreceive as input pixel locations, and output color indepen-dently per-pixel. Therefore these programs can leveragethe parallelism available on modern GPU hardware to efﬁ-ciently perform per-pixel computation across a full image,making such shaders essential for games, VR, visualization,and other interactive applications. There exist other kinds ofshaders, such as vertex and geometry shaders, but this paperfocuses on pixel shaders. A range of example shaders areshown in Figures 1 and 3; many more examples are avail-able from websites such as . Notethat while the example shaders appearing here are simplerthan those typical of production or games, they embody thekey features that appear in production-level shaders.Since the pixel shader program operates independently perpixel, we can consider the program trace as a vector ofvalues computed at each pixel – generalizing from RGB (a a r X i v : . [ c s . L G ] F e b earning from Shader Program Traces manually identi-ﬁed by an expert on a per-shader basis. Moreover, the extentto which these auxiliary features helps learning depends onthe choice of features, the particular shader, and the learn-ing objective. We hypothesize that other shader-speciﬁcinformation potentially useful to the learner remains hiddenwithin the program execution, and that a learning processcould automatically identify and leverage that information.Thus, we propose a learning-based approach that utilizesall of the information produced during the execution of ashader program. The learner could automatically identifywhich features are useful, obviating the need for manualfeature selection in the midst of an experimental process.To illustrate the applicability of learning from programtraces, we introduce four applications. Three of them workfrom pixel data: learning to predict low-noise output, learn-ing to reconstruct full computation from a program withpartial computation, as well as learning the output of apostprocessing ﬁlter. The fourth application shows thatthe idea of learning from program traces can be applied tonon-imagery data: it learns to simulate the position and ve-locity of a ﬂock of “boids” (Reynolds, 1987), which emulateﬂocking behavior similar to that of birds or ﬁsh.In most of our experiments, we train a separate model foreach shader on each application. Scene speciﬁc learningis commonly used, for example, in recent work on novelview synthesis (Sitzmann et al., 2019; Thies et al., 2019;2020; Mildenhall et al., 2020). Appendix C.4 describes howa single network can be trained over multiple shaders.The primary contribution of this paper is the idea that ashader program trace can be used as a feature vector formachine learning. Nevertheless, it is neither obvious howto use such a feature, nor that it would help in any par-ticular application. Thus, a secondary contribution is tointroduce a framework for learning from program traces,and demonstrate that it outperforms baseline methods in sev-eral applications. The third contribution is to investigate therelative importance of individual trace features, and how theinput trace size across various trace subsampling strategiescan affect the performance of the model.Section 2 describes related work. In order to both recordprogram traces and perform learning on GPUs, Section 3describes a compiler that collects intermediate values com-puted during the shader execution, Section 4 presents exper- iments evaluating our method on a variety of applicationsapplied to different shaders; the proposed method comparesfavorably with baseline methods (Figure 2). Section 5 ana-lyzes the importance of individual program trace features,as well as trace subsampling strategies.

2. Related Work

Program traces in machine learning.

Program traces haveproven helpful in malware detection (Chen et al., 2018),program induction (Reed & Freitas, 2016) and program syn-thesis (Chen et al., 2019; Jeppu et al., 2020). Researchershave also explored using partial execution or partial render-ing to synthesize graphics programs (Ganin et al., 2018) orinfer parameters for procedural models (Ritchie et al., 2016).Instead of developing specialized learning models for a par-ticular application, we explore a generic architecture thatcan learn over a range of applications. Nevertheless, thiswork focuses on learning from program traces for shaders,which enjoy certain unique properties such as an emphasison pixel outputs and an enormous degree of parallelism.

Features for deep learning on imagery data.

Researchershave explored the use of a variety of features beyond simpleRGB as inputs to learned functions. Nalbach et al. (2017)have arguably made the most comprehensive explorationof such features as part of a deferred shading pipeline.Xie et al. (2018) consider auxiliary features includingﬂow velocity and vorticity when learning density super-resolution for ﬂuid simulation. To our knowledge, our paperis the ﬁrst to propose augmenting such features with the fullprogram trace. The beneﬁts are that the trace of a programthat computes such manually picked features inherently in-cludes them, as well as other potentially useful information;moreover the extent to which various features are useful fora particular application and shader are discovered automati-cally by the learning process.

Feature space reduction.

In deep networks, an overly largefeature space can exhaust memory, increase training time, oreven make learning tasks harder. Researchers have exploredmethods to reduce the feature space by pruning whole convo-lutional ﬁlters (Li et al., 2016; Luo et al., 2017; Molchanovet al., 2016). In our method, we focus mostly on reduc-ing the input feature space because the dimension of theprogram trace can be large. We use a method similar tothat of Molchanov et al. (2019) to evaluate the importanceof each trace input and show a trade-off between the run-time and visual ﬁdelity as we change our feature reductionstrategies. Our experimental results suggest that withoutprior execution or learning, we could ﬁnd no subsamplingstrategy that consistently outperforms a simple uniform sub-sampling. However, if the shader is allowed the overhead oflearning a model over the full program trace, we can selectimportant trace features from the learned model. earning from Shader Program Traces

Remove sampling noise.

One of our applications addressesMonte Carlo noise reduction in low-budget rendering. Onestrategy for low-noise rendering involved carefully dis-tributing the samples, see Zwicker et al. (2015) for a sur-vey. Another method uses symbolic compilation techniquesto analytically approximate the integral that produces thesmoothed shader (Dorn et al., 2015; Yang & Barnes, 2018),but is hard to scale to complicated shaders such as the onesshown in Figure 1. On the other hand, learning-based al-gorithms train regressors such as neural networks to pre-dict the rendering. The input to networks are usually aug-mented with auxiliary features (Chaitanya et al., 2017; Vo-gels et al., 2018). Unlike previous work, our approachgathers customized information per shader, and is orthogo-nal to learning-based denoising network design in the sensethat it can be combined with an existing network.

Shader simpliﬁcation.

As the complexity of the shader pro-gram grows, it is common to apply lossy optimization tothese complicated programs that only approximate the origi-nal program with better runtime performance (Sitthi-Amornet al., 2011; Wang et al., 2014; He et al., 2015). We showexperiments that explore how the trace from the simpliﬁedprograms can provide information that helps to recover themissing details in the target shader. Thies et al. (2019) iden-tify a similar task where they learn novel view synthesisfrom a coarse proxy geometry. Nevertheless, to our knowl-edge this is the ﬁrst paper to propose the application oflearning from a simpliﬁed shader program to restore detailsin the original program; and we show that using the programtrace can help in this application.

Neural networks for image processing.

Researchers haveinvestigated a variety of learning-based methods for imageprocessing tasks, such as image enhancement and ﬁlter-ing (Li et al., 2018; Gharbi et al., 2017; Wu et al., 2018; Liet al., 2017). Our postprocessing application demonstratesthat the proposed method is also helpful when learning theseimagery operations as postprocessing ﬁlters.

Learning simulation programs.

High quality simulation usu-ally executes the program over many tiny time steps, whichis expensive. Researchers have developed reinforcementlearning based methods (Kim et al., 2020; Hahn et al., 2019)to replace the program entirely, or execute the program ata lower spatial resolution and learn a super-resolution deepmodel (Werhahn et al., 2019). Our simulation applicationinstead learns from the program’s execution trace on a largertime step, and corrects the output as if the program is exe-cuted for multiple smaller steps.

3. Compiler and Preprocessing

This section introduces a compiler that can collect tracesfrom shader programs. It translates shader programs from a domain speciﬁc language to TensorFlow code that logs thetrace (Section 3.1). To stay within the hardware memorybudget, the compiler also restricts the trace length to an arbi-trary size cap (Section 3.2). We will breiﬂy summarize thetraining process as we introduce applications in Section 4.More training details can be found in Appendix B.

Our compiler takes as input an arbitrary procedural shaderprogram written in a domain speciﬁc language (DSL) andtranslates it to a TensorFlow (TF) program that outputs arendered image as well as a collected program trace. Weembed the DSL in Python, which allows us to use Pythonicfeatures such as operator overloading. We also includecommon shader operations such as trigonometric functions,dot and cross products. For simplicity, we assume that theshader program manipulates numerical scalars or vectorsof known size. We handle branching by computing bothbranches of conditionals. Likewise, loops are unrolled tothe maximum possible number of iterations: this limit isset by by the programmer for each loop. These are not fun-damental limitations of the approach, as we experimentedemulating branching and variable-length loops by writingdummy values of zero to traces in the branch/iteration notexecuted, and this gives visually and quantitatively identicalresults to our current approaches. See Appendix C.2 fordetails. These policies permit us to express the trace of anyshader as a ﬁxed-length vector of the computed scalar values,regardless of the pixel at which the shader is evaluated.

Large program traces can produce unnecessarily large fea-ture vectors from which learning becomes unwieldy, orworse, exhausts memory. Loop unrolling is a common con-tributor to large traces, because the program trace would bescaled by the number of iterations. This section describesseveral strategies for reducing the size of the feature vec-tor. All strategies described in this section can be reusedwhen targeting a different language, e.g. OpenGL ShadingLanguage (GLSL) or CUDA.

Compiler optimizations.

The compiler omits constant valuesand duplicate nodes, and retains just one among neighbor-ing nodes in the computation graph that differ by a constantaddition or multiplication, since such features would beredundant in the learning network. The compiler also iden-tiﬁes common built-in functions and iterative improvementloops to eliminate trace features that are highly correlated.Built-in functions (e.g., sin ) should typically be treated asa black box. Our DSL provides widely used shader oper-ations such as noise functions and a normal computationfunctor. The compiler logs only the return values of suchbuilt-in functions, not the intermediate values found when earning from Shader Program Traces computing them. This is a natural choice since in principleone could trace down to a very low level such as includ-ing details about the microarchitecture, but we believe thatlearning will gain the most beneﬁt if it occurs at a similarabstraction level as used by the programmer.An iterative improvement loop repeatedly improves an ap-proximate result to obtain a more accurate result (Sidiroglou-Douskos et al., 2011). A commonly used iterative improve-ment pattern in shader prototyping is a ray marching loopthat computes the distance from the camera to objects inthe scene. Because each iteration computes a more accurateapproximation than the previous iterations, the ﬁnal itera-tion is the most informative. Therefore, the compiler willonly log the trace from the ﬁnal iteration of such loops. Weautomatically handle common cases of iterative improve-ment loops found in shaders by classifying loops based on apattern matching: the output of the loop is either iterativeadditive or can be written as a parametric form of the itera-tive additive variable. Detailed classiﬁcation rules appear inAppendix A.We also investigate several other strategies inspired by pre-vious work on loop perforation (Sidiroglou-Douskos et al.,2011) and image perforation (Lou et al., 2016). In our case,however, we always run the full computation, but simplyselect a subset of those computations as input to the learningtask, as follows.

Uniform feature subsampling.

The most straightforwardstrategy is to subsample the vector by some factor n , re-taining only every n th trace feature as ordered in a depthﬁrst traversal of the compute graph. This approach tends towork well in our experiments, and we speculate that it doesso because nearby nodes in the compute graph tend to berelated by simple computations and thus are redundant. Other sampling schemes.

We explored a variety of otherschemes for reducing the length of the feature vector whichincludes clustering features based on observed statisticalcorrelation, and various loop subsampling strategies that logfeatures from only certain iterations in the loop execution.Yet none of them outperformed the above straightforwardscheme consistently enough to justify their use in our subse-quent experiments.These options are combined as follows. We ﬁrst applycompiler optimizations, then subsample the features with asubsampling rate that makes the trace length be most similarto a ﬁxed target length. For all experiments in Section 4,we target a length of 200, except as noted for the simula-tion example. Because raw trace features can be extremelylarge or small, we apply a clamping and whitening processdescribed in Appendix B.1. After compiling and executingthe shader, we have for every pixel: a vector of dimension N : the number of recorded intermediate values in the trace. In training, we generate input program traces on the ﬂy eachtime one is needed, rather than loading precomputed tracesfrom disk. There are two beneﬁts to this approach. First,precomputed traces are large, and it is typically faster to re-compute the trace, as opposed to loading from disk. Second,each time a trace is generated, we use a new randomly sam-pled sub-pixel location for evaluating the trace for any givenpixel (a common strategy to reduce aliasing). Therefore,the input traces will generally have different values in eachepoch even though we use the same ground truth solution.This approach helps the network avoid overﬁtting.

4. Evaluation

This section evaluates our method for various applicationsand scenarios: denoising pixel shaders (Section 4.1), learn-ing to reconstruct simpliﬁed shaders (Section 4.2), learn-ing postprocessing effects (Section 4.3), and learning non-imagery simulation programs (Section 4.4). We also discusslearning temporal coherence in Appendix C.1. The architec-ture and training scheme in these applications includes fullyconnected networks, traditional CNNs and GANs, demon-strating our method’s wide applicability to various deeplearning models. A summary of performance in all the ap-plications is shown in Figure 2; in all cases our methodoutperforms the strongest baseline.We use a dilated convolutional network similar to that ofChen et al. (2017) for denoising and postprocessing ap-plications. In addition we use a spatial GAN model forreconstructing from all simpliﬁed shaders. The generatoruses the same dilated convolutional network, and the dis-criminator uses the PatchGAN network similar to that ofWang et al.(2018a; 2018b). The model is trained using aweighted sum on L and perceptual (Zhang et al., 2018)losses, together with GAN loss where applicable. In allcases, we select the model at the epoch with the lowestvalidation loss.For imagery learning tasks (Section 4.1, 4.2, 4.3), the modeltrains on a dataset of 1200 tiles with × resolution,and 120 validation tiles in same resolution. Testing includes30 full size images with resolution × . Please referto Appendix B for further details on training.Our implementation is trained on a single GPU. For con-sistent timing in evaluation, we use a 4 core Intel Xeon E5-2620 v4 2.10 GHz CPU with a single Nvidia GForce RTX2080 Ti GPU across all models. During training, we alwaystrain 400 epochs for models without a GAN and 800 epochsfor models with a GAN. Timing results reported through-out appear as speedup relative to ground truth. The actualshader plus inference performance ranges from 30+80ms( Trippy Heart ) to 21+0.2s (

Oceanic ) per frame. These earning from Shader Program Traces

Figure 2.

Evaluation in four applications: denoising (§ 4.1), recon-struction from simpliﬁed shaders (§ 4.2), learned post -processingeffects (§ 4.3) and sim ulation (§ 4.4). In each case, our methodyields reduced perceptual error (Zhang et al., 2018) compared tothe strongest baseline: RGBx in all cases, except simulation whichuses the I/O baseline (§ 4.4). Appx. C reports more comparisons. shaders are relatively slow because they are implementedas computational graphs in TensorFlow. They could begreatly accelerated through engineering a GLSL or CUDAimplementation. Note the shader’s runtime is invariant towhether program traces are collected or not, therefore it isnot a limitation to our proposed method. All experimentspresented in this section are trained per shader. We alsodemonstrate in Appendix C.4 that multiple shaders can betrained together with a shared network and a lightweightshader-speciﬁc encoder.Our strongest baseline is called RGBx. It uses the samenetwork and training as ours, but with the input features con-sisting of RGB color plus manually picked auxiliary featuresthat are commonly used for learning with shader programs.We use all the auxiliary features found in recent denoisingpapers (Chaitanya et al., 2017; Vogels et al., 2018). Becausethe RGBx baseline generally has fewer input channels com-pared to our method, we increase the number of channelsin the ﬁrst convolutional layer of the baseline model suchthat the number of trainable weights matches that of ourmodel. Unlike our automatic method, RGBx requires addi-tional manual expertise to pick auxiliary features for everyshader program. An automatic baseline that resembles ourswould be RGB, which uses only RGB color without anyauxiliary features. However, the RGB baseline is alwaysoutperformed by RGBx, so we only report the RGBx result.

Here we describe the application of removing samplingnoise. Our goal is to approximate a low noise referenceimage collected using 1000 samples per pixel (

SPP ). Ourmethod is evaluated using 1

SPP , drawn from a Gaussianspatial distribution with a standard deviation of . .We evaluate our method and compare it against two base-lines. The ﬁrst baseline is RGBx described before. Oursecond baseline is supersampling. Supersampling draws a number of samples at each pixel, evaluates the shader to ob-tain RGB colors for each sample, and takes the mean of thecolors. We supersample by choosing a constant sample bud-get per pixel to achieve approximately the same run time asours, including the overhead for neural network inference.Training for 400 epochs typically takes between 6 and 32hours. However, the Oceanic shader is slower, and takesabout 7 days to train. Note that all shaders are trained usingthe same process over an identical architecture with a similarnumber of input channels; therefore the great variation intraining time derives primarily from the cost of samplingfrom shader programs, not from the learning model itself.We report both the L error and the perceptual loss for allthe experiments in Appendix Table 1. Our method out-performs the two baselines in all cases. In terms of thearithmetic average over all shaders, our method has a rela-tive perceptual error of 67% compared to the RGBx baseline.Supersampling baseline is consistently worse than RGBx,with relative perceptual error ranging from 3x to 21x com-pared to RGBx. We believe the dramatic improvements inrelative perceptual error of our method over the baselinescorresponds with the qualitatively better reconstruction ofhigh-frequency details that we observe in the renderings, asshown in the zoom regions on the ﬁrst row of Figure 3. Oursupplemental video shows comparisons of several of theseshaders rendered with a moving camera. We also explore a more challenging task: learning to recon-struct the appearance of a shader from its simpliﬁed variant.Shader simpliﬁcation is commonly used as a lossy optimiza-tion that improves runtime while approximating the outputof the original program. However, simpliﬁed programs oftenlose texture or geometry detail as compared with the origi-nal. For example, the simpliﬁed version of

Venice shownin Figure 3d lacks details at the far distance compared toits original counterpart in Figure 3c. We therefore proposean application that learns to recover the denoised output ofthe original shader from the traces of the simpliﬁed shaderprogram sampled at 1

SPP . To our knowledge, this paper isthe ﬁrst to propose this learning task.We use two different techniques to simplify the shader pro-grams: loop perforation (Sidiroglou-Douskos et al., 2011)and genetic programming simpliﬁcation (Sitthi-Amorn et al.,2011).Because the model needs to synthesize unseen texture, weuse a spatial discriminator for this application, described inAppendix B.4. Training for 800 epochs takes between 10and 60 hours. Similar to the denoising application, the greatvariation in training time mostly comes from generatinginput samples from the shader. earning from Shader Program Traces (a) Reference (b) Our Result (c) Reference (d) Input (e) RGBx (f) Ours

Denoising:

Mandelbrot

Venice

Mandel-bulb blur 0%/1x 28x/1200x 100%/700x 65%/590x

Figure 3.

Learning task for three different applications. In all cases ours (b) faithfully recovers the reference (a). Part of the input featuresto the learning model is visualized in (d), which corresponds to the RGB pixel colors generated by the input shader program sampledat 1

SPP . Zooming into the regions boxed in green (c)-(f) reveals approximation error, where ours compares favorably with the RGBxbaseline. We report relative perceptual error over a test set compared with the RGBx baseline (e) and relative speedup compared withthe reference solution (c) in the format error/speedup. Our method better recovers both the orientation and the high-frequency than thebaseline in all applications. Full images for more experiments can be found in the supplemental material.

We report error statistics and more qualitative examples inAppendix C. On average our method has a relative percep-tual error of 67% compared to the RGBx baseline. Note onthe second row in Figure 3e-f, ours reconstructs the missingtexture better without introducing additional color artifacts.

Our method can be useful for learning not only denoising,but also applying additional image-space postprocessing ﬁl-ters. We implement two postprocessing ﬁlters on the CPU:an edge aware sharpening ﬁlter (Paris et al., 2011) and defo-cus blur (Rokita, 1993). The network learns simultaneouslyto denoise and apply the postprocessing ﬁlter on the GPU.The third row of Figure 3 shows results on learning thedefocus blur ﬁlter for

Mandel-bulb . As can be seen, theprogram trace in combination with the network architecturelearns to reproduce the complex effect more faithfully, ascompared to the RGBx baseline. See Appendix C for morequantitative and qualitative results. The average relativeperceptual error for ours is 74% compared to RGBx.

Departing from learning from procedural pixel shader pro-grams, we also explore learning to predict the future forshader programs that perform simulations. This section de-scribes simulation of ﬂocking behavior, while Appendix C.3presents learning to approximate ﬂuid simulations.Our shader simulates a ﬂock of “boids” (Reynolds, 1987)which emulate the ﬂocking behavior of birds or ﬁsh. Eachboid has a 4-vector state representing 2D position and ve-locity. For a ﬂock of K boids, the simulation program takesinput of a K × tensor that represents each individual boid’sinitial state, then updates the state based on repulsion andalignment forces. The updated state then becomes the inputto the next simulation step, and so forth. The interactionbetween boids forms a complex ﬂocking behavior that isdifﬁcult to predict. We run the ground truth simulation usinga small δ step size: × steps with δ = s, targeting20 δ per frame at 30fps. During training we further augmentthe data by randomly permuting boid indices. The learningtask is to correct the simulation output from a larger time earning from Shader Program Traces GroundTruthI/OBaselineOurMethod (a) Visualization (b) Error Analysis

Figure 4. (a) Visualization for ﬂocks of boids. Both I/O baseline(red) and our method (blue) starts from the same initial state andhave taken 80 inference steps with step size 20. Ours is morefaithful to the ground truth (green).

Click on the image to play theboids animation in a browser. (b) We plot average L error as afunction of step size, where training ranges from step size 20 to 64(gray). Ours consistently outperforms both I/O and a more naivebaseline, in the training range and beyond it. step m · δ in order to approximate the boids’ states as ifthe simulation ran m times for step size δ . (We train with m ∈ [20 , .) We compare our method with two baselines:a naive baseline that directly takes the larger step simulationwithout any correction, and an input/output (I/O) baselinethat uses the input and the output of the larger step simula-tion as the input to a neural network. The learning modelis a combination of 1D convolution layers with 3 fully con-nected layers. For details please refer to Appendix B.2. Forreported results, we simulate 40 individual boids and logevery program trace from the boids program. We choosea larger program trace length than for the pixel shadersbecause the simulation considers all pairwise interactionsbetween boids, and a larger program trace budget bettercaptures these interactions.Figure 4 shows that our method outperforms baselines bothvisually (a) and numerically (b), evaluating over step sizes m ∈ [16 , to show generalization outside the trainingrange. The supplemental video also shows that ours recoversindividual boids’ interaction behaviors more faithfully witha step size of m = 20 , while the I/O baseline mainly learnsthe average position and velocity for the entire ﬂock butfails to recover a reasonable distance between the boids.Although our method is beneﬁcial in all the previously de-scribed experiments, we also ﬁnd a null result for our secondsimulation example: a 2D ﬂuid simulation. Our method ismarginally numerically better (92% relative L error and96% relative perceptual perceptual) compared to the I/Obaseline, and the visual quality of the two methods is almostidentical. We hypothesize that this learning task is less suit-able for our method because it is relatively simple and lackscomplicated hidden state: the neural network can easily ap-proximate solving the Navier-Stocks equation given initialand output states. For detail please refer to Appendix C.3.

5. Trace Analysis

This section presents a series of analyses that help to under-stand how program traces are beneﬁcial for learning. Westart by analyzing which trace features are contributing themost to a learned model. Based on trace importance, wethen investigate which subset of the trace can be used forlearning. We empirically ﬁnd that if one cannot afford toﬁrst execute and learn from the full shader trace, then theUniform subsampling used throughout Section 4 alwaysgives reasonable performance, and we were not able to ﬁndany strategy that consistently outperforms Uniform. How-ever, if one is able to train an additional inital network thatﬁrst uses the full program trace, then we can do better thanUniform, using a strategy that we call Oracle that selectsimportant features.

We characterize the importance of the trace features by quan-tifying the change in training loss when removing each ofthe trace inputs. Inspired by Molchanov et al. (2019; 2016),we used the ﬁrst order Taylor expansion to approximateimportance of each input trace feature. Speciﬁcally, for amodel trained with loss L and trace length T , the impor-tance score Θ of the input trace feature z l ( l = 1 , ..., T )with image dimension M × N across K examples is: Θ( z l ) = 1 K K (cid:88) k =1 | M · N M (cid:88) m =1 N (cid:88) n =1 ∂L∂ z lm,n · z lm,n | (1)We computed the importance score on the denoising modelfor two shaders: Bricks and

Mandelbrot . Only a smallfraction of the trace results in a very high importance score.We manually inspect what the top 10% most important tracefeatures represent and veriﬁed that the learned importancecorresponds to human intuition. For example in

Bricks ,we found the most important traces include features thatdetermines the distance to the nearest brick edges: thishelps prevent edges from being broken.

As discussed in Section 3.2, program traces can be arbitrar-ily long, and we could input only a subset of the trace forefﬁcient learning and inference, such as Uniform subsam-pling used in Section 4. Therefore, a natural question to askis: given a ﬁxed input trace length budget, what subsets ofthe program trace are good for learning? The best way to an-swer this question is to enumerate all possible subsets of theprogram trace and train a separate model for each. However,for a shader program that has T traces before subsamplingand a ﬁxed input budget N , this strategy will introducecombinatoric (cid:0) TN (cid:1) learning tasks, which is intractable.To investigate how different subsets of the trace could affect earning from Shader Program Traces Decreasing ErrorIncreasing Runtime

Figure 5.

Error vs. Time trade-off for Opponent, Uniform, andOracle subsampling strategies with varying trace length (for fourshaders, lower-left). The x-axis shows in log scale each model’srelative trace ratio compared to the full program trace length T .Circles show (vertically) relative perceptual error compared tothe RGBx baseline (green square). Pentagons and squares showrelative inference time increasing with N , arriving at that of theFull Trace model on the right. learning in a practical fashion, we propose subsamplingstrategies we call Oracle and Opponent. Both the Oracleand Opponent strategies are based on the feature importancescore (Section 5.1) from a Full Trace model trained with allof the program trace. Oracle always chooses the traces thathave the highest importance scores, while Opponent alwayschooses the ones with the lowest scores. In an analogy tothe lottery ticket hypothesis in machine learning (Frankle& Carbin, 2018), we hypothesize that the Oracle exploits awinning “lottery ticket” found within the Full Trace model,and selects out the relevant trace subset: a “lottery tickettrace.” The Opponent likewise selects losing tickets.To better understand the trade-offs associated with the sub-sampled trace length, we experimented with varying tracelengths using Opponent, Uniform, and Oracle subsamplingand compare them with the RGBx baseline, as shown in Fig-ure 5. For each shader, the trace is subsampled by a relativesample budget compared to the full program trace length T (e.g. N = T /2, T /4). Under a ﬁxed budget N , in most casesthe inference error decreases in the ordering of Opponent,Uniform, Oracle. This corresponds to our intuition becauseOracle selects traces that are beneﬁcial to training based onprior knowledge from the Full Trace model, and similarlyOpponent selects traces that are unimportant based on thesame prior knowledge. Statistically, our hypotheses that Uniform outperforms Oracle, and Opponent outperformsUniform each have p-values . × − and . × − ,respectively. These are smaller than 0.025, so we concludethat the ordering Oracle outperforms Uniform outperformsOpponent is signiﬁcant. For details please see Appendix D.It is also worth noting that even when N is small (e.g. theleftmost two data points in the plots corresponds to N below50), the extra information from the program trace can stillsubstantially reduce the relative perceptual error withoutsigniﬁcant extra cost in inference time. Because the x-axis inthe plot is on a log scale, the actual performance gain wouldhave a more drastic slope starting from RGBx to a small N . Additionally, the current comparison is advantageous toRGBx as its learning capacity matches that of the Full Tracemodel as discussed in Section 4, which is more capacitythan any of the subsampled models in Figure 5.In practice, subsampling strategies can be chosen basedon resources allowed for training and inference. If thereis no limit at all, training a model with the Full Trace canalways give the best performance. If N is only limited byinference time, but extra cost and memory can be permittedduring training, one could use the Oracle strategy. However,when training also becomes a practical concern, our resultssuggest that without actually learning from the full tracein advance, there may not be a single subsampling strategythat could consistently outperform all others, as discussedin Section 3.2. Thus, Uniform subsampling provides aneffective proxy that follows the performance of Oracle, andalways outperforms the worst-case scenario Opponent.

6. Conclusion

This paper proposes the idea of learning from shader pro-gram traces. It demonstrates the efﬁcacy of this idea in arange of learning contexts: denoising, simpliﬁed shaders,postprocessing ﬁlters and simulation. We describe a com-piler that can produce program traces suitable for learning,as well as practical considerations like how to handle largetraces and how to process the trace data to make it amenableto learning. Our method is agnostic to the learning archi-tecture, loss function and training process; however, wealso discuss a particular set of these that worked well in ourexperiments. We evaluate our method on a range of shaders,over which it compares favorably with baselines. We alsoanalyze which features are important in the trace, and ex-plain how one can select subsets of the trace for learning.The experiments described this paper were performed usingcomputer graphics shaders. Future work could explore howwell the ideas introduced herein generalize to other kindsof programs that can rely on (and tolerate) approximatesolutions, for example those relying on stochastic algorithmsor Markov-like decision processes. earning from Shader Program Traces

References

Akenine-M¨oller, T., Haines, E., and Hoffman, N.

Real-timerendering . CRC Press, 2008.Burt, P. and Adelson, E. The laplacian pyramid as a compactimage code.

IEEE Transactions on communications , 31(4):532–540, 1983.Bylinskii, Z., Judd, T., Borji, A., Itti, L., Durand, F.,Oliva, A., and Torralba, A. Mit saliency benchmark.http://saliency.mit.edu/, 2019.Chaitanya, C. R. A., Kaplanyan, A. S., Schied, C., Salvi, M.,Lefohn, A., Nowrouzezahrai, D., and Aila, T. Interactivereconstruction of monte carlo image sequences using arecurrent denoising autoencoder.

ACM Trans. Graph. , 36(4):98:1–98:12, July 2017. ISSN 0730-0301.Chen, L., Sultana, S., and Sahita, R. Henet: A deep learningapproach on intel® processor trace for effective exploitdetection.

CoRR , abs/1801.02318, 2018. URL http://arxiv.org/abs/1801.02318 .Chen, Q., Xu, J., and Koltun, V. Fast image processing withfully-convolutional networks. In

The IEEE InternationalConference on Computer Vision (ICCV) , Oct 2017.Chen, X., Liu, C., and Song, D. Execution-guided neu-ral program synthesis. In

International Conferenceon Learning Representations , 2019. URL https://openreview.net/forum?id=H1gfOiAqYm .Cornia, M., Baraldi, L., Serra, G., and Cucchiara, R. A deepmulti-level network for saliency prediction. In ,pp. 3488–3493, 2016.Dorn, J., Barnes, C., Lawrence, J., and Weimer, W. Towardsautomatic band-limited procedural shaders. In

ComputerGraphics Forum , volume 34, pp. 77–87. Wiley OnlineLibrary, 2015.Feiler, P. H. and Humphrey, W. S. Software process de-velopment and enactment: Concepts and deﬁnitions. In

Software Process, 1993. Continuous Software ProcessImprovement, Second International Conference on the ,pp. 28–40. IEEE, 1993.Frankle, J. and Carbin, M. The lottery ticket hypothesis:Finding sparse, trainable neural networks. arXiv preprintarXiv:1803.03635 , 2018.Ganin, Y., Kulkarni, T., Babuschkin, I., Eslami, S.M. A., and Vinyals, O. Synthesizing programs forimages using reinforced adversarial learning.

CoRR ,abs/1804.01118, 2018. URL http://arxiv.org/abs/1804.01118 . Gharbi, M., Chen, J., Barron, J. T., Hasinoff, S. W., andDurand, F. Deep bilateral learning for real-time imageenhancement.

ACM Transactions on Graphics (TOG) , 36(4):118, 2017.Goodfellow, I. NIPS 2016 tutorial: Generative adversarialnetworks.

CoRR , abs/1701.00160, 2017.Hahn, C., Phan, T., Gabor, T., Belzner, L., and Linnhoff-Popien, C. Emergent escape-based ﬂocking behav-ior using multi-agent reinforcement learning.

CoRR ,abs/1905.04077, 2019. URL http://arxiv.org/abs/1905.04077 .He, Y., Foley, T., Tatarchuk, N., and Fatahalian, K. Asystem for rapid, automatic shader level-of-detail.

ACMTransactions on Graphics (TOG) , 34(6):1–12, 2015.Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. Image-to-image translation with conditional adversarial networks.In

Computer Vision and Pattern Recognition (CVPR),2017 IEEE Conference on , pp. 5967–5976. IEEE, 2017.Jeppu, N. Y., Melham, T., Kroening, D., and O’Leary, J.Learning concise models from long execution traces. In , pp. 1–6, 2020. doi: 10.1109/DAC18072.2020.9218613.Johnson, J., Alahi, A., and Fei-Fei, L. Perceptual losses forreal-time style transfer and super-resolution. In

Europeanconference on computer vision , pp. 694–711. Springer,2016.Kim, S. W., Zhou, Y., Philion, J., Torralba, A., and Fidler,S. Learning to Simulate Dynamic Environments withGameGAN. In

IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , Jun. 2020.Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenetclassiﬁcation with deep convolutional neural networks.In

Advances in neural information processing systems ,pp. 1097–1105, 2012.Kummerer, M., Wallis, T. S., Gatys, L. A., and Bethge, M.Understanding low-and high-level contributions to ﬁxa-tion prediction. In

Proceedings of the IEEE InternationalConference on Computer Vision , pp. 4789–4798, 2017.Larus, J. R. Efﬁcient program tracing.

Computer , 26(5):52–61, 1993.Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf,H. P. Pruning ﬁlters for efﬁcient convnets.

CoRR ,abs/1608.08710, 2016.Li, T.-M., Gharbi, M., Adams, A., Durand, F., and Ragan-Kelley, J. Differentiable programming for image process-ing and deep learning in halide.

ACM Transactions onGraphics (TOG) , 37(4):139, 2018. earning from Shader Program Traces

Li, Y., Huang, J.-B., Ahuja, N., and Yang, M.-H. Jointimage ﬁltering with deep convolutional networks. arXivpreprint arXiv:1710.04200 , 2017.Lou, L., Nguyen, P., Lawrence, J., and Barnes, C. Imageperforation: Automatically accelerating image pipelinesby intelligently skipping samples.

ACM Transactions onGraphics (TOG) , 35(5):153, 2016.Luo, J.-H., Wu, J., and Lin, W. Thinet: A ﬁlter level prun-ing method for deep neural network compression. In

The IEEE International Conference on Computer Vision(ICCV) , Oct 2017.Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T.,Ramamoorthi, R., and Ng, R. Nerf: Representing scenesas neural radiance ﬁelds for view synthesis. arXiv preprintarXiv:2003.08934 , 2020.Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz,J. Pruning convolutional neural networks for resourceefﬁcient transfer learning.

CoRR , abs/1611.06440, 2016.URL http://arxiv.org/abs/1611.06440 .Molchanov, P., Mallya, A., Tyree, S., Frosio, I., and Kautz,J. Importance estimation for neural network pruning.

CoRR , abs/1906.10771, 2019. URL http://arxiv.org/abs/1906.10771 .Nalbach, O., Arabadzhiyska, E., Mehta, D., Seidel, H.-P.,and Ritschel, T. Deep shading: Convolutional neuralnetworks for screen-space shading. 36(4), 2017.Paris, S., Hasinoff, S. W., and Kautz, J. Local laplacianﬁlters: Edge-aware image processing with a laplacianpyramid. In

ACM SIGGRAPH 2011 Papers , SIGGRAPH’11, pp. 68:1–68:12, 2011.Reed, S. and Freitas, N. D. Neural programmer-interpreters.

CoRR , abs/1511.06279, 2016.Reynolds, C. W. Flocks, herds, and schools: A distributedbehavioral model.

SIGGRAPH Computer Graphics , 21(4):25–34, July 1987. ISSN 0097-8930.Ritchie, D., Thomas, A., Hanrahan, P., and Goodman, N.Neurally-guided procedural models: Amortized inferencefor procedural graphics programs using neural networks.In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., andGarnett, R. (eds.),

Advances in Neural InformationProcessing Systems , volume 29, pp. 622–630. Curran As-sociates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/40008b9a5380fcacce3976bf7c08af5b-Paper.pdf .Rokita, P. Fast generation of depth of ﬁeld effects in com-puter graphics.

Computers & Graphics , 17(5):593–595,1993. Sidiroglou-Douskos, S., Misailovic, S., Hoffmann, H., andRinard, M. Managing performance vs. accuracy trade-offs with loop perforation. In

Proceedings of the 19thACM SIGSOFT symposium and the 13th European con-ference on Foundations of software engineering , pp. 124–134. ACM, 2011.Sitthi-Amorn, P., Modly, N., Weimer, W., and Lawrence,J. Genetic programming for shader simpliﬁcation.

ACMTrans. Graph. , 30:152, 12 2011.Sitzmann, V., Thies, J., Heide, F., Nießner, M., Wetzstein,G., and Zollh¨ofer, M. Deepvoxels: Learning persistent3d feature embeddings. In

Proc. Computer Vision andPattern Recognition (CVPR), IEEE , 2019.Thies, J., Zollh¨ofer, M., and Nießner, M. Deferred neuralrendering: Image synthesis using neural textures.

ACMTransactions on Graphics 2019 (TOG) , 2019.Thies, J., Zollh¨ofer, M., Theobalt, C., Stamminger, M., andNießner, M. Image-guided neural object rendering. In

International Conference on Learning Representations ,2020. URL https://openreview.net/forum?id=Hyg9anEFPS .Vogels, T., Rousselle, F., McWilliams, B., R¨othlin, G.,Harvill, A., Adler, D., Meyer, M., and Nov´ak, J. Denois-ing with kernel prediction and asymmetric loss functions.

ACM Transactions on Graphics (TOG) , 37(4):124, 2018.Wang, R., Yang, X., Yuan, Y., Chen, W., Bala, K., and Bao,H. Automatic shader simpliﬁcation using surface signalapproximation.

ACM Transactions on Graphics (TOG) ,33(6):1–11, 2014.Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Liu, G., Tao, A.,Kautz, J., and Catanzaro, B. Video-to-video synthesis.In

Advances in Neural Information Processing Systems(NeurIPS) , 2018a.Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J.,and Catanzaro, B. High-resolution image synthesis andsemantic manipulation with conditional gans. In

Proceed-ings of the IEEE Conference on Computer Vision andPattern Recognition , 2018b.Werhahn, M., Xie, Y., Chu, M., and Thuerey, N. Amulti-pass GAN for ﬂuid ﬂow super-resolution.

CoRR ,abs/1906.01689, 2019. URL http://arxiv.org/abs/1906.01689 .Wu, H., Zheng, S., Zhang, J., and Huang, K. Fast end-to-end trainable guided ﬁlter. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pp. 1838–1847, 2018. earning from Shader Program Traces

Xie, Y., Franz, E., Chu, M., and Thuerey, N. tempogan: Atemporally coherent, volumetric gan for super-resolutionﬂuid ﬂow.

ACM Transactions on Graphics (TOG) , 37(4):95, 2018.Yang, Y. and Barnes, C. Approximate program smoothingusing mean-variance statistics, with application to proce-dural shader bandlimiting.

Comput. Graph. Forum , 37(2):443–454, 2018.Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang,O. The unreasonable effectiveness of deep features as aperceptual metric. In

CVPR , 2018.Zwicker, M., Jarosz, W., Lehtinen, J., Moon, B., Ramamoor-thi, R., Rousselle, F., Sen, P., Soler, C., and Yoon, S.-E.Recent advances in adaptive sampling and reconstructionfor monte carlo rendering. In

Computer Graphics Forum ,volume 34, pp. 667–681. Wiley Online Library, 2015. earning from Shader Program Traces

A. Classifying Iterative Improvement Loops

Here we present our detailed classiﬁcation rules for iterativeimprovement loops. For a loop variable X at iteration n , wewill denote its value as X n . Deﬁnition 1.

A loop variable X is iterative additive if itmatches the following pattern or its equivalent forms: X n = X n − + Z (2) Here Z can be any arbitrary variable. Deﬁnition 2.

A variable Y is dependent on an iterativeadditive variable X if it matches the following pattern orits equivalent forms: Y n = select ( cond, Y n − , f ( X n , X n − , C )) (3) Here, cond is an arbitrary Boolean variable, f is an arbi-trary function, and C is a variable computed outside theloop, i.e. C can be viewed as constant inside the loop. Deﬁnition 3.

A loop variable X is an output variable if forany iteration n , its value X n is used outside the loop. Deﬁnition 4.

A loop is classiﬁed as an iterative improve-ment loop if all of its output variables are either iterativeadditive or are dependent on an iterative additive variable.

B. Training Details

This section summarizes the training details in our exper-iments, such as data preprocessing, network architectureand the loss function. For all of our other applications weselected a basic architecture described in Appendix B.2 thatmakes a good trade-off between visual quality and networkrun time. Nevertheless, our method could be coupled withany deep learning architecture. This section thus serves asan example of how to select a network architecture and carryout training.

B.1. Preprocessing the Collected Trace

We apply preprocessing to the traces collected by the com-piler, where we rescale the data to a ﬁxed range.We ﬁnd that some intermediate values in shader programscan vary over a large range. For example, a shader programmight at training time calculate an intermediate value of , ±∞ , or not a number (NaN), even when most val-ues of this shader computation remain near zero. This canhappen, for instance, near object silhouettes where textureshave high frequency in image space. Furthermore, even if ashader program at training time only generates intermediatevalues in some reasonable range such as [ − , , at inferencetime it might produce an extreme value such as ∞ or NaN. Conventional wisdom suggests that deep learning librariestend to perform best when input data has a mean close tozero and a variance close to one, and such scaling is oftenpart of a “data whitening” preprocessing step. Indeed, weﬁnd that without some form of clamping and rescaling, thetraining tends to diverge entirely. Thus, for each interme-diate value in the program trace, we clamp extreme valuesand rescale others by the following process.We clamp extreme values by collecting the statistics for theintermediate values’ distribution at training time. For eachintermediate value, we ﬁrst decide whether its distributionmerits clamping. If we detect that the distribution has only asmall number of ﬁnite, discrete values (10 or fewer), we donot apply clamping to the corresponding intermediate value.For the rest of the intermediate values, we ﬁrst discard inﬁ-nite values and then ﬁnd from their distributions the lowestand the highest p th percentiles, denoted P and P , and usethese to compute clamping thresholds. Next we clamp allvalues to the range [ P − γ ( P − P ) , P + γ ( P − P )] .We also set NaN values to the low end of this range. Empir-ically, we found in our experiments that p = 5 and γ = 2 work well, and we use these values for all results. Finally,for each intermediate feature, we rescale the clamped valuesto the ﬁxed range [-1,1], and record the corresponding scaleand bias used. In both training and testing, the collectedprogram traces are used directly by applying the same pre-computed scale and bias, but the values will be clamped torange [-2, 2] to allow data extrapolation. B.2. Network Architecture

Our experiments use a dilated convolutional network de-picted in Figure 6, similar to that of Chen et al. (2017). Thisnetwork architecture is used directly in our denoising (Sec-tion 4.1) and post-processing (Section 4.3) applications, andalso serves as a generator model in other applications andscenarios, each of which relies on a GAN model: condi-tional spatial GAN (Isola et al., 2017) for learning from asimpliﬁed shader (Section 4.2) and temporal GAN (Wanget al., 2018a) for learning temporally coherent sequences(Appendix C.1). Details about the GAN models are dis-cussed in Appendix B.4.The boids simulation Section 4.4 works with non-imagerydata, and we therefore uses a combination of 1D convolutionand fully connected layers for that learning task. The inputto the network has size B × N where B represents thenumber of boids (40 in our experiments) and N representseither the length of the program trace in our method or 8for the I/O baseline. We ﬁrst reduce the dimensionality ofthe trace to K using a 1D convolution with kernel size one,followed by 3 additional 1D convolutions with kernel sizeone and 48 output channels. This is an analogy to the 2Dfeature reduction layer and 1x1 convolutions described in earning from Shader Program Traces Program trace Output D N

48 3 (RGB) K Figure 6.

Network architecture used in our experiments. The inputfrom the program trace has N channels. The output layer hasthree channels for color images. The ﬁrst feature reduction layerhas K channels. We use K = 48 in our method. When trainingthe baseline method, K will be increased to a larger value tomatch the total number of trainable weights to be the same astraining with the program trace at the maximum length. All otherintermediate layers have 48 channels. The input feature mapsare ﬁrst analyzed by four 1x1 convolutional layers, followed byﬁve 3x3 convolutional layers with dilation rates of , , , , respectively. Finally, four additional 1x1 convolutional layers areapplied and output a three channel image. Note that the ﬁrst andlast convolutional blocks indicated in lighter blue each reduce thenumber of channels (from N to 48, and from 48 to 3, respectively). Figure 6, where K = 48 for our method and K = 1173 forthe I/O baseline to match the number of trainable weights inboth models. We then ﬂatten the B × tensor as an inputto a 3 layer fully connected network, where each layer has256 hidden neurons, and an output fully connected layerwith the number of neurons being B · , representing theoutput state for each boid. B.3. Loss Functions

This section describes the loss terms used during our trainingprocess. In our basic model, the loss includes one compo-nent L c that encourages pixel-wise color values to be similarto the ground truth, and a second term L p that encouragesperceptual similarity to the ground truth. The overall losscombines these terms: L b = L c + αL p (4)The parameter α is a weight that balances between the colorand perceptual loss terms. We ﬁx α = 0 . for all of ourexperiments. This value was chosen to roughly balancethe magnitude of the gradients due to L c and L p duringback-propagation.The color term L c is simply the standard L loss on theRGB image. The other loss term L p uses the learned im-age perceptual dissimilarity metric of Zhang et al. (2018).We found this gave slightly better results than other com-mon perceptual losses such as using layers of a pretrainedVGG network (Johnson et al., 2016). We used the code ofZhang et al. (2018) in the default conﬁguration with the default network of AlexNet (Krizhevsky et al., 2012).For the boids simulation in Section 4.4, we learn a four-channel state (2D position and velocity) for each boid, ratherthan RGB. Therefore we use only L loss on these coordi-nates after separately normalizing position and velocity overthe training set.This section describes the basic loss L b used in training. Fordetails about the GAN losses, see Appendix B.4. B.4. Detail Regarding GAN Models

Our spatial GAN model is a conditional GAN, where theconditional labels are the RGB channels of the 1

SPP render-ing from the shader program, denoted as c x . Because c x isalready part of the program trace, we directly use the modelfrom Figure 6 as our generator and the generator’s outputis naturally conditioned on c x . We then train the model tomatch the ground truth denoted as c y . Additionally, we useda patchGAN architecture similar to that of Isola et al. (2017)with receptive ﬁeld 34 ×

34 as our discriminator D .Our temporal GAN model uses a similar architecture asthe spatial GAN with modiﬁcations following (Wang et al.,2018a). The generator is conditioned on imagery from threeconsecutive frames: the current predicted frame and thetwo previous ones. This involves ﬁve 3-channel images asconditional labels: shader RGB output from all three framesplus the generator’s output from the two previous frames.Because neither the shader output nor the generator outputfrom the previous two frames is part of the program trace forthe current frame, we modiﬁed the generator architecture inFigure 6 to concatenate the additional four conditional labelimages after the feature reduction layer. The rest of thearchitecture remains the unchanged. We use the same dis-criminator architecture as for our spatial GAN, but it takesan input of sequences of frames and their correspondingconditional labels.We now introduce the variation on the basic loss function(equation (4)) that incorporates the GAN loss. We use amodiﬁed cross entropy loss (Goodfellow, 2017) for bothspatial and temporal GAN models. Our spatial GAN modelis conditioned on the RGB channels of the shader program c x to approximate the distribution of the ground truth c y ,while our temporal GAN loss is applied to sequences (cid:101) c x and (cid:101) c y . The training objective (that we minimize) for generator L G and loss for spatial discriminator L D S can be expressedas: L G = L b − β E c x log( D S ( G ( c x ) , c x )) L D S = − E c x ,c y log( D S ( c y , c x )) − E c x log(1 − D S ( G ( c x ) , c x )) (5) earning from Shader Program Traces Similarly, the training objective on temporal sequences forgenerator L G and temporal discriminator L D T can be ex-pressed as: L G = L b − β E (cid:102) c x log( D T ( G ( (cid:101) c x ) , (cid:101) c x )) L D T = − E (cid:102) c x , (cid:101) c y log( D T ( (cid:101) c y , (cid:101) c x )) − E (cid:102) c x log(1 − D T ( G ( (cid:101) c x ) , (cid:101) c x )) (6)The parameter β is a weight that balances between the GANloss and the regular color and perceptual loss in equation (4).In all our experiments with GAN loss, we ﬁx β = 0 . toroughly balance the magnitude of gradients from all lossterms. Note in equation (6) we did not include spatial dis-criminators for simplicity. But it is possible to combine bothequation (5) and equation (6). For example, in Section C.1,we trained on both discriminators to produce a temporallycoherent model for simpliﬁed shaders.We also skip the back-propagation on the GAN loss for anymini-batch with constant color to avoid training instability. B.5. Generating the Dataset

Our experiments generates the dataset from 800 imagesfor training, 80 images for validation and 30 images fortesting (each 960 × similar distance images with camerapose sampled from the same distribution as the training set,as well as 10 different distance images that are closer orfurther than the training set.We ﬁnd it beneﬁcial to further divide the training and vali-dation set into tiles. One advantage is that certain featuresin the shader may be visually salient to humans, so we canemphasize such features to ensure they are learned well.In principle this could be accomplished with automaticsaliency models (e.g. (Kummerer et al., 2017; Bylinskiiet al., 2019; Cornia et al., 2016)). However, off-the-shelfsaliency models are trained for natural imagery whereas ourshaders are non-photorealistic, and therefore we combineboth a saliency model (Cornia et al., 2016) and a traditionalLaplacian pyramid representation to robustly and automati-cally select salient tiles. Another beneﬁt of tiled training isthat it reduces memory, and it also accelerates convergence,because we can use larger mini-batches with more variedcontent within the same GPU memory to obtain a gradientestimator with lower mean squared error.We sample training and validation tiles as follows. We ﬁrstgenerate saliency maps for each of our 800 training im-ages and 80 validation images using Cornia et al. (2016). Saliency models usually incorporate a center bias that tendsto give lower saliency scores to pixels closer to image bound-aries. This behavior is not ideal for our framework becauseour training images are generated from randomly sampledcamera poses so that salient content could appear anywherein the image. Therefore, we run the saliency model on im-ages with an extended ﬁeld of view (each 1280 × ×

640 are our original train-ing images. This allows every pixel in the original trainingdataset to be away from image boundaries to avoid centerbias in the resulting saliency maps.We then subdivide each of the training and validation im-ages into six 320 ×

320 tiles. For each tile, we estimate itsintensity on low, middle and high frequencies by taking theaverage over its ﬁrst, third, and ﬁfth level of the Laplacianpyramid (Burt & Adelson, 1983). Together with the aver-age saliency score, these four metrics can be combined torobustly sample salient and interesting tiles for learning.Next, we use identical sampling rules to sample one-quarterof the sampling budget from each of the four metrics. Foreach metric, we rank the tiles according to their associatedscore and only sample from the tiles whose score is withinthe top 25% nonzero scores. The score of the qualiﬁed tileswill further be normalized to [0, 1], and each tile will besampled with a probability proportional to the normalizedscore.Apart from the rules described above, we ﬁnd it helpfulto also include a small portion of constant color tiles inthe training dataset, e.g. the black background in

Bricks

Figure 1. These uninformative and constant color tiles canbe easily selected from a low color variance threshold. Al-though some salient tiles already contain both informativeand uninformative regions, they are usually close to objectsilhouettes and could still pose challenges when extrapolat-ing to uninformative regions far away from the object.We sample a total of 1200 tiles for training and 120 tiles forvalidation. If the shader does not contain constant color tiles,all of the sampling budget will be used to equally samplefrom the 4 saliency metrics described above. Otherwise,only 95% of the sampling budget will be sampled fromsaliency, and another 5% will be sampled from low colorvariance tiles. Testing still relies on 30 full images.

C. Additional Evaluation Results

We ﬁrst report additional qualitative and quantitative resultson the applications discussed in the main paper (Section 4).After that, we will discuss extra experiments that are notcovered in the main paper.

Denoising pixel shaders.

Table 1 reports the test set errorstatistics for 7 shaders evaluated in the denoising applica- earning from Shader Program Traces (a) Reference (b) Our Result (c) Reference (d) Input (e) RGBx (f) Ours

Bricks

Oceanic

Gear

Figure 7.

Learning to reduce sampling noise in procedural shaders. The reference low-noise solution (a) relies on 1000 samples per pixel(

SPP ). Our method (b) approximates the reference well at only 1

SPP . Zooming into the region boxed in green (c, f) reveals approximationerror, which compares favorably with the RGBx baseline (3). We report relative perceptual error over a test set compared with the RGBxbaseline (e) and relative speedup compared with reference solution (c) in the format of error/speedup. Our method better recovers both theorientation and high-frequency detail than the baselines. We believe the dramatic improvements in relative perceptual error of our methodover the baselines correspond with the qualitatively better reconstruction of high-resolution details that we observe in the renderings, asshown in the zoom regions.

Bricks Gear , as well as

Mandelbrot in Figure 3 are gamma corrected to emphasize visual differenceswhen viewed on screens of various brightness. Full images for all experiments can be found in the supplemental material.

Table 1.

Error statistics for denoising (Section 4.1). We reportboth perceptual error (Zhang et al., 2018) and mean RGB L error(written as perceptual / L2). The numbers reported are averagedacross the entire test dataset. For all experiments, we report theabsolute error for the RGBx baseline, and for other methods theirerrors relative to that of the RGBx baseline (% or x). Shader RGBx Ours Super

Bricks % / % 21x / 22x Gear % / % 17x / 28x Mandelbrot % / % 15x / 22x Mandel-bulb % / % 8.5x / 6.1x Oceanic % / % 11x / 9.9x Trippy Heart % / % 3.1x / 2.5x Venice % / % 9.3x / 7.9x tion (Section 4.1) compared with two baselines: RGBx andsupersampling. In every case we tested, ours outperformsthe baselines. This includes test datasets with both similarand different distances to the geometry as in the training set.We also show qualitative results in Figure 7. Reconstructing simpliﬁed shaders.

Table 2 reports errorstatistic for 5 shaders evaluated in the simpliﬁed application(Section 4.2) compared with the RGBx baseline. Additionalqualitative results can be found in Figure 8.

Learning postprocessing ﬁlters.

Table 2 reports error statis-tic for two types of postprocessing ﬁlters: defocus blur on

Mandel-bulb (shown in Figure 3), edge aware sharpeningon both full and simpliﬁed

Trippy Heart program. Weshow in Figure 10 an additional example of learning thesharpening ﬁlter from the partial computation of a simpli-ﬁed

Trippy Heart shader program. earning from Shader Program Traces (a) Reference (b) Our Result (c) Reference (d) Input (e) RGBx (f) Ours

Mandelbrot

Mandel-bulb

Trippy Heart

Figure 8.

Learning from simpliﬁed shaders

Mandelbrot , Mandel-bulb and

Trippy Heart . Errors and speedups are reported asin Figure 7. In

Mandelbrot our method better reconstructs missing regions due to oversimpliﬁcation in the input. In

Mandel-bulb our method better recovers the orientation of the texture. In

Trippy Heart ours has less color artifact.

C.1. Training Temporally Coherent Sequences

Temporal coherence in a graphics or vision context refersto there being a strong correlation between each frame andthe next. Training only on individual images can introducetemporal incoherence for rendered video. One straightfor-(a) RGBx Baseline (b) Our result

Figure 9.

Learning temporally coherent sequences for

Trippy Heart with the same ground truth as in Figure 8.We report relative perceptual error compared to the RGBx baseline.Both RGBx (a) and ours (b) are the 90th frame of a synthesizedtemporally coherent sequence. Note how our method generalizeswell to long sequences whereas the RGBx baseline presentsobvious artifacts such as color residual from previous frames nearthe silhouette of the pink foreground. ward ﬁx would be to apply a temporal ﬁlter to the outputsequences to blur out the noise. Alternatively, we imple-mented a temporal discriminator to directly train temporallycoherent sequences using a training scheme similar to thatof Wang et al. (2018a). Each frame in a sequence is syn-thesized conditioned on two previous frames. In training,frames are synthesized in groups of six consecutive frames,relying on eight-frame ground truth sequences to be ableto bootstrap the initial frame. We train temporally coherentsequences both for the task of denoising and learning fromsimpliﬁed programs, and compare with an RGBx baselineas in Sections 4.1 & 4.2. A summary of quantitative error isshown in Table 2. In all cases ours outperforms the RGBxbaseline, and produces a more temporally coherent sequencethan their non-temporal counterparts (Sections 4.1 & 4.2)while retaining similar visual quality in still images. Weadditionally verify that the temporal models generate moretemporally stable sequences by computing the perceptualloss of 2 adjacent frames. For each of the 30 test sequences,we use the last two frames of the length 30 sequence andaverage the score across ten renders with different randomseeds. We then average the score across the test datasetand compare between our temporal and our non-temporalmodels. In all cases, the temporal model has a lower error earning from Shader Program Traces (a) Reference (b) RGBx Baseline (c) Our Result

Trippy Heart simpliﬁed sharpen: 0%/1x 100%/340x (baseline) 80%/270x

Figure 10.

Learning postprocessing effects. The reference solution (a) shows the result of a postprocessing ﬁlter applied to a low-noiseshader rendering sampled at 1000

SPP . Both RGBx baseline (b) and our method (c) approximates the reference at 1

SPP . Our methodrecovers more faithfully the high frequency detail and color pattern in

Trippy Heart . We report relative perceptual error and speedupas in Figure 7. between adjacent frames. The temporal models have 94%perceptual error relative to the non-temporal models on av-erage and 80% in the best case. Our supplementary videodoes not present temporally coherent animation as a sepa-rate application, but rather shows this training scheme in thedenoising and simpliﬁcation applications. Figure 9 showsan example where our method generalizes better to longersequences than the RGBx baseline. Our result correctlylearns both temporal coherence as well as the complicatedstructure in each individual frame, whereas the RGBx base-line introduces additional color artifacts in the output. Thevideo shows even longer sequences (180 frames).

C.2. Branching and Loop Emulation

As discussed in Section 3.1, our compiler currently han-dles conditional execution by simply evaluating bothbranches and unrolls loops to its maximum possible itera-tion. Variable-length loops are handled using a user-givencompile-time pragma specifying a ceiling on the possiblenumber of loop iterations: it is common to have such ceil-ings on iteration counts in shader programs because of theneed to maintain consistent shading performance. Valuesfrom unused iterations are replaced with the values from theﬁnal computed iteration. We made these choices becausethey are much easier to implement in TensorFlow. How-ever, in a practical application, shaders would typically becompiled to code that takes either branch or exits the loopearly based on a termination condition. Therefore, we didan experiment to determine what would be have been theeffect of handling branches and loops the traditional way.For branching, we simply wrote dummy values of zero totraces in the branch not taken. We applied such branch em-ulation to a shader called

Texture Maps which—similarto aspects of

Venice in Figure 1—uses a conditional state-ment to select a texture based on whether a ray has hit aplane. For loops, we wrote zero values to traces after theloop termination condition is met, and applied the emulationto

Mandelbrot . In both cases we found that the emulation gives results that are visually and quantitatively identical toour compiler’s implementation.

Table 2.

Errors statistic for four applications: reconstructing fromsimpliﬁed shaders (Section 4.2), learning postprocessing ﬁlters(Section 4.3), learning temporally coherence sequences (Ap-pendix C.1), as well as learning a shared denoising network trainedwith four shaders (Appendix C.4). Metrics reported are similaras in Table 1. For postprocessing application (Post): each ﬁlteris trained with a speciﬁc shader: blur (

Mandel-bulb ), sharpen(

Trippy Heart ) and simpliﬁed sharpen (

Trippy Heart ). Thetemporal application is trained both on shaders with full computa-tion and simpliﬁed shaders with partial computation (simp). Foreach experiment trained with temporal coherence, we generate a30 frames sequence and compute the error with respect to groundtruth using the last frame. The reported numbers are averagedacross 30 different sequences.

App Shader RGBx Ours S i m p li ﬁ e d Bricks % / % Mandelbrot % / % Mandel-bulb % / % Trippy Heart % / % Venice % / % P o s t blur 1.2e-02 / 3.6e-04 % / %sharpen 8.8e-02 / 3.8e-03 % / %simp sharpen 2.7e-01 / 2.0e-02 % / % T e m po r a l Mandelbrot % / %simp Mandelbrot % / % Mandel-bulb % / %simp Mandel-bulb % / % Trippy Heart % / %simp Trippy Heart % / % S h a r e d Mandelbrot % / % Mandel-bulb % / % Gear % / % Trippy Heart % / % earning from Shader Program Traces C.3. Fluid Simulation

Although our method is beneﬁcial in all the previously de-scribed experiments, we also ﬁnd a null result for our secondsimulation example: a 2D ﬂuid simulation. The state of thesimulation on a 2D grid can be viewed as a 7D feature: 3Dfor RGB color of the ﬂuid and 4D for internal states: veloc-ity, density and vorticity. The simulation takes input of the 7channel ﬂuid state, solves the Navier-Stokes equation with ahard-coded external force to compute the new internal state,then applies color advection on image space to output thenew 7D state. The color advection step controls the trade-off between how fast the ﬂuid propagates and how accuratethe simulation is. We ran the simulation with step size δ as ground truth. The learning task is to run the simulationat a coarser step δ , and predict the intermediate states inbetween the 10 steps as if they were run at the ﬁne scalesimulation with step size δ .We use the same architecture as in Section 4.1 for this taskand compare our method with an I/O baseline that takes theinitial and output ﬂuid states as learning features. While ourmethod is marginally numerically better than the baseline(ours has 92% L2 error and 96% perceptual error comparedto the baseline), the visual quality of the two methods isalmost identical. We hypothesize that this learning task isnot suitable for our method because it is relatively simpleand lacks complicated hidden state: the neural networkcan easily approximate solving the Navier-Stocks equationgiven initial and output states. Additionally, because theﬂuid states change slowly even after 10 simulation steps,the network can easily hallucinate a reasonable correctionusing the initial state as a good starting point, therefore, thebaseline features already sufﬁce. In Figure 11 we show boththe baseline and our method can reasonably approximate(a) Reference (b) I/O Baseline (c) Our Method

0% 100% 96%

Figure 11.

An example of ﬂuid simulation where our method (c)gives a very similar result as the I/O baseline (b). This indicates ourmethod may not be advantageous for simple learning tasks wherethe baseline is already good enough to reconstruct the reference(a). The relative perceptual error compared to the I/O baseline isreported below each image. the reference with almost identical results.

C.4. Can Multiple Shaders be Learnt Together?

In this section, we explore whether part of the model can beshared across shaders with the same task. Because programtraces are unique per shader, we propose to train a separateshallow encoder for each of the shaders, followed by a task-speciﬁc model shared across shaders. The setup is similarto the source-aware encoder by Vogels et al. (2018).Four shaders (

Mandelbrot , Mandel-bulb , Gear , and

Trippy Heart ) are trained together for the denoising task.The encoder consists of four 1x1 convolutional layers, wherethe ﬁrst layer outputs K channels and the rest output 48 chan-nels. In our method, K = 48 while in the RGBx baseline K varies similarly as in Section 4.1. The encoder is identicalto the four 1x1 convolutions that analyze the input programtrace in Figure 6. The 48-dimensional encoding then inputsto a shared denoising network, whose architecture is identi-cal to Figure 6 excluding the four initial 1x1 convolutions.All four shaders use Uniform subsampling to bring their N to be closest to 200. Training alternates between the 4shaders after each epoch, and each shader is trained for 400epochs.We report the error statistics for the shared model in Table 2.Although one might expect this experiment to beneﬁt theRGBx baseline as the RGBx features are more similar, infact, ours outperforms RGBx in all cases. D. Statistical Evidence for SubsamplingStrategies

In this section, we provide statistical evidence for our ﬁnd-ings when investigating trade-offs between different sub-sampling strategies and subsampling budgets described inSection 5.2. Our ﬁrst null hypothesis makes the follow-ing assumption on the performance between Uniform andOracle subsampling: the ratio of relative error betweenUniform and Oracle ( µ ) is less than or equal to . Thishypothesis has a p-value of p = 7 . × − . Similarly, wepropose another null hypothesis regarding the performancebetween Opponent and Uniform subsampling: the ratio ofrelative error between Opponent and Uniform ( µ ) is smallerthan or equal to , which has a p-value p = 5 . × − .If we choose a signiﬁcance level of . and apply Bon-ferroni correction over the 2 hypotheses, we have both p < . and p < . , indicating signiﬁcant evi-dence that Oracle outperforms Uniform ( µ > ) and Uni-form outperforms Opponent ( µ > ). These statistics arecomputed using all possible N available for all 4 shaders: N ∈ [ T / , T / , T / , T /16]