[PDF] Modular Primitives for High-Performance Differentiable Rendering

Abstract

We present a modular differentiable renderer design that yields performance superior to previous methods by leveraging existing, highly optimized hardware graphics pipelines. Our design supports all crucial operations in a modern graphics pipeline: rasterizing large numbers of triangles, attribute interpolation, filtered texture lookups, as well as user-programmable shading and geometry processing, all in high resolutions. Our modular primitives allow custom, high-performance graphics pipelines to be built directly within automatic differentiation frameworks such as PyTorch or TensorFlow. As a motivating application, we formulate facial performance capture as an inverse rendering problem and show that it can be solved efficiently using our tools. Our results indicate that this simple and straightforward approach achieves excellent geometric correspondence between rendered results and reference imagery.

Full PDF

MModular Primitives for High-Performance Differentiable Rendering

SAMULI LAINE,

NVIDIA

JANNE HELLSTEN,

NVIDIA

TERO KARRAS,

NVIDIA

YEONGHO SEOL,

NVIDIA

JAAKKO LEHTINEN,

NVIDIA and Aalto University

TIMO AILA,

NVIDIA

We present a modular differentiable renderer design that yields performancesuperior to previous methods by leveraging existing, highly optimized hard-ware graphics pipelines. Our design supports all crucial operations in amodern graphics pipeline: rasterizing large numbers of triangles, attributeinterpolation, filtered texture lookups, as well as user-programmable shadingand geometry processing, all in high resolutions. Our modular primitives al-low custom, high-performance graphics pipelines to be built directly withinautomatic differentiation frameworks such as PyTorch or TensorFlow. Asa motivating application, we formulate facial performance capture as aninverse rendering problem and show that it can be solved efficiently usingour tools. Our results indicate that this simple and straightforward approachachieves excellent geometric correspondence between rendered results andreference imagery.CCS Concepts: •

Computing methodologies → Rasterization ; Shapeinference .Additional Key Words and Phrases: Differentiable rendering, rasterization,motion capture.

ACM Reference Format:

Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen,and Timo Aila. 2020. Modular Primitives for High-Performance Differen-tiable Rendering.

ACM Trans. Graph.

39, 6, Article 194 (December 2020),14 pages. https://doi.org/10.1145/3414685.3417861

Differentiable rendering is a fundamental building block in machinelearning of 3D geometry. Typically training data is available onlyas images, and finding a corresponding 3D representation requires analysis by synthesis , i.e., rendering candidate images, computing theloss based on training and candidate images, and propagating theerrors back to 3D positions and other scene attributes. Many classicalcomputer graphics and vision problems including the estimation ofreflectance, geometry, lighting, and camera parameters can be castinto this inverse rendering framework [Patow and Pueyo 2003].Much of modern machine learning makes use of first order (gra-dient-based) optimization techniques implemented using backprop-agation. From a computational point of view, the explosive growthin model sizes and capabilities has, for a large part, relied on theavailability of primitive operations that allow massively parallel and

Authors’ addresses: Samuli Laine, NVIDIA, [email protected]; Janne Hellsten, NVIDIA,[email protected]; Tero Karras, NVIDIA, [email protected]; Yeongho Seol,NVIDIA, [email protected]; Jaakko Lehtinen, NVIDIA, Aalto University, [email protected]; Timo Aila, NVIDIA, [email protected].© 2020 Association for Computing Machinery.This is the author’s version of the work. It is posted here for your personal use. Not forredistribution. The definitive Version of Record was published in

ACM Transactions onGraphics , https://doi.org/10.1145/3414685.3417861. coherent execution of both the forward and backward (gradient)passes, allowing highly complex computation graphs to be builtout of them. However, the majority of effort in creating efficientprimitive operations has focused on operating on densely-sampleddata stored in multidimensional regular grids. These kind of opera-tions alone are not sufficient for 3D rendering because the mappingfrom a scene representation to pixel values is highly irregular anddynamic. As such, specialized differentiable rendering algorithmshave emerged for solving two problems:(1)

Forward pass:

Given the factors that affect the shape andappearance of a 3D scene, render a 2D image; and(2)

Backward pass:

Given the gradient of a loss function definedon the output image pixels, compute the gradient of the losswith respect to the input shape and appearance factors.This interface is universally used by autodifferentiation frameworkssuch as PyTorch and TensorFlow, and it allows 3D rendering tobe used as a building block in a complex model that is trained bymodern stochastic first order optimization techniques.Despite its long history, differentiable rendering can be consid-ered a nascent field due to the recent proliferation of algorithmsand applications. Most previous research is targeted towards a spe-cific use case (e.g., pose or shape estimation), and is typically onlyevaluated on downstream tasks as part of a larger machine learningsystem [Chen et al. 2019; Kato et al. 2018; Liu et al. 2019]. These spe-cific use cases and data sets allow optimizations and design choicesthat do not scale to other uses. For example, low geometric com-plexity may make it acceptable to not parallelize over triangles, butthis quickly backfires on a larger scene; single objects viewed in avacuum may enable one to disregard the effects of mutual occlusionbetween triangles when computing gradients, but this approxima-tion is untenable in a scene with non-trivial depth complexity.Another parallel line of research studies differentiable physically-based light transport simulation that models complex effects suchas area light sources and indirect illumination [Li et al. 2018; Loubetet al. 2019; Nimier-David et al. 2019]. Built on Monte Carlo sampling,these methods seek the best attainable image quality and accuracyat the cost of longer rendering times.We build on the rich literature on real-time graphics systems thathas long sought efficient solutions for managing the complex, dy-namic mapping between world points and image pixels, and has de-livered extremely efficient and practical hardware implementations.In particular, we seek to formulate and implement a differentiablerendering system that makes use of these pipelines to maximal

ACM Trans. Graph., Vol. 39, No. 6, Article 194. Publication date: December 2020. a r X i v : . [ c s . G R ] N ov Table 1. Comparison of characteristics of selected differentiable renderingsystems.

OpenDR NMR SoftRas DIB-R Li et al. Mitsuba2

Our [2014] [2018] [2019] [2019] [2018] [2019]Performance ∗ ✔ ✗ ✗ ✗ ✖ ✖ ✔ Scalability ✔ ✖ ✖ ✖ ✔ ✔ ✔

Flexibility ✖ ✖ ✖ ✖ ✔ ✔ ✔

Antialiasing ✖ ✖ ✗ ✖ ✔ ✔ ✔

Occlusion ✔ ✔ ✖ ✗ ✔ ✔ ✔

Gradients ✖ ✖ ✔ ✗ ✔ ✗ ✔

Noise-free ✔ ✔ ✔ ✔ ✖ ✖ ✔

No tuning ✔ ✔ ✖ ✖ ✔ ✗ ✔ GI ✖ ✖ ✖ ✖ ✔ ✔ ✖ ∗ Performance considers suitability to intensive optimization such as in deeplearning;

Scalability refers to performance with respect to surface tessellationand image resolution;

Flexibility is whether the system is designed to supportarbitrary shading;

Antialiasing requires that geometric edges are smoothedin the result image;

Occlusion considers if geometrically obscured surfacesare guaranteed to not affect the resulting image;

Gradients refers to the cor-rectness of gradients with respect to rendered image;

Noise-free systems donot rely on random sampling;

No tuning refers to lack of tunable parametersthat affect the rendered image or gradients. GI denotes global illumination,support for physically-based illumination and shadowing including indirecteffects. See text for detailed analysis. extent, without sacrificing their desirable properties such as cor-rect outputs, a high degree of user control through programmableshading and geometry processing, massive parallelization in all op-erations, and the ability to render high-resolution images of scenesconsisting of millions of geometric primitives.Concretely, we describe a differentiable rendering system basedon deferred shading [Deering et al. 1988], and identify four primitiveoperations for which we provide custom, high-performance imple-mentations: rasterization, attribute interpolation, texture filtering,and antialiasing. Our modular primitives enable rendering high-resolution images of complex scenes, using arbitrary user-specifiedshading, directly inside automatic differentiation (AD) frameworkssuch as TensorFlow or PyTorch.As a motivating example, we cast facial performance capture as aninverse rendering problem and show that it can be efficiently solvedusing direct photometric optimization of shape and surface tex-ture in megapixel resolutions. While our proof-of-concept solutiondoes not aim to reconstruct the mouth, eyes, or complex materialappearance, the high accuracy of the results in comparison to astate-of-the-art commercial solution demonstrates the viability ofhigh-performance differentiable rendering in solving this problem.Our differential rasterization primitives are publicly available athttps://github.com/NVlabs/nvdiffrast. There is a large body of work on using rendering as part of an opti-mization process that infers properties of the world from images.These include inverse rendering algorithms that fit 3D and appear-ance models to photographs in an analysis-by-synthesis loop [Patowand Pueyo 2003], as well as techniques that use a 3D renderer aspart of a more complex machine learning model.By far the most common approach in previous work is to design aspecial-purpose differentiable image synthesis pipeline focusing on the particular requirements of the downstream task, with no partic-ular emphasis on flexibility or generality [Chen et al. 2019; Kato et al.2018; Liu et al. 2019]. A notable exception is OpenDR [Loper andBlack 2014] that, despite its limited shading model, explicitly setsout to develop a general-purpose differentiable rendering system.Research on general-purpose differentiable rendering divides intotwo categories depending on whether the primary motivation isimage quality or performance. Table 1 summarizes various charac-teristics of the previous methods analyzed below.Li et al. [2018] introduce differentiable, physically-based render-ing using Monte Carlo ray tracing with proper visibility gradients.Light transport is integrated using random sampling, which leads tonoise in the images that diminishes with more sampling. Mitsuba 2[Loubet et al. 2019; Nimier-David et al. 2019] is a versatile renderingframework that can target a wide array of rendering problems. Theprimary problem in using these renderers as parts of an intensiveoptimization or learning task is their lack of performance — to notbecome a bottleneck over millions of iterations, the rendering timesin sufficiently high resolutions should be measured in milliseconds,instead of seconds or minutes needed for recursive light transportsimulation. These systems can also be configured to compute onlyprimary visibility and local shading, which is sufficient for many ap-plications. This boosts the performance of Li et al. [2018] to ∼ ×

480 resolution, which is still orders ofmagnitude too slow for many applications.The second category of differentiable rendering aims at higherperformance. Primarily targeted at solving tasks such as shape orpose inference [Chen et al. 2019; de La Gorce et al. 2011; Liu et al.2019; Loper and Black 2014], they render and shade 3D meshes usinglocal shading only. Strong emphasis is placed on obtaining usefulvisibility gradients to facilitate shape inference via gradient descent.Soft Rasterizer [Liu et al. 2019] rasterizes each triangle as a prob-abilistic cloud with a configurable blur radius; these clouds arecombined heuristically based on other configurable parameters.This blur makes coverage a continuous function of vertex positions,which is necessary for obtaining visibility gradients. However, theblur also means that opaque surfaces become transparent aroundedges, leading to an incorrect image. Optimization thus requirestuning these parameters to reach a balance between image correct-ness and gradient quality. DIB-R [Chen et al. 2019] renders colorswithout antialiasing and outputs an additional alpha channel thatextends outside the covered pixels by a configurable blur radius. Thealpha channel can be used for approximating visibility gradients, butonly if an alpha mask is available for reference images as well. Also,these gradients are affected by all triangles regardless of occlusion.Because color channels are point sampled, no visibility gradientsare obtained for silhouettes that are in front of other geometry. Thisis insufficient for, e.g., determining hand poses [de La Gorce et al.2011], and would also fail if one were to render, e.g., a skybox behindthe mesh, so the method cannot be considered general-purpose. Aswith Soft Rasterizer, it is not obvious how the blur parameter shouldbe set. Neural Mesh Rendering [Kato et al. 2018] produces the imageusing point sampling and no antialiasing, and in the backward passhallucinates image-based gradients on triangle edges based on thegeometry. The gradients are thus not consistent with the renderedimage.

ACM Trans. Graph., Vol. 39, No. 6, Article 194. Publication date: December 2020. odular Primitives for High-Performance Differentiable Rendering • 194:3

In contrast to these approximations, our aim is to differentiatethe standard hardware graphics pipeline without altering its imageformation principles. This places special emphasis on occlusion byopaque surfaces. For the gradient to be consistent with the forwardimaging model, a 3D primitive that has no effect on the image — forinstance, due to being off the screen or occluded by other primitives —should, by our premise, receive a zero gradient.The reparameterization technique of Loubet et al. [2019] usedin Mitsuba 2 [Nimier-David et al. 2019] produces correct visibilitygradients only when occluders and occludees can be inferred fromfour samples. Although more complex occlusion scenarios are nothandled correctly, Loubet et al. introduce a bandwidth parameterthat can be adjusted to reduce the errors at the cost of increasednoise. It is argued that handling the most common, simple casesis sufficient for practical purposes, and our approach for visiblitygradients is founded on the same premise. OpenDR [Loper andBlack 2014] approximates all gradients based on the final image andknowledge of which triangle was rendered into each pixel. This hasthe unfortunate effect of producing incorrect gradients for effectssuch as highlights, because all shading is assumed to be “glued” ontothe surfaces. As an example, OpenDR’s gradients falsely indicatethat moving a planar surface tangentially would move the highlightsand reflections on it as well. In addition, inferring gradients fromthe final pixels can be seen as equivalent to taking finite differencesinstead of analytic gradients. Flexibility is limited because texturescan modulate appearance only multiplicatively, and differentiationwith respect to textures is not supported.

Given a 3D scene description in the form of geometric shapes, ma-terials, and camera and lighting models, rendering 2D images boilsdown to two computational problems: figuring out the things thatare visible in each pixel, and what color those things appear to be.A proper differentiable renderer has to provide gradients for all theparameters — e.g., lighting and material parameters, as well as thecontents of texture maps — used in the process.For what follows, it is useful to break the rendering process downinto the following form, where the final color 𝐼 𝑖 of the pixel at screencoordinates ( 𝑥 𝑖 , 𝑦 𝑖 ) is given by 𝐼 𝑖 = filter 𝑥,𝑦 (cid:18) shade (cid:16) 𝑀 (cid:0) 𝑃 ( 𝑥, 𝑦 ) (cid:1) , lights (cid:17)(cid:19) (cid:0) 𝑥 𝑖 , 𝑦 𝑖 (cid:1) . (1)Here, 𝑃 ( 𝑥, 𝑦 ) denotes the world point visible at (continuous) screencoordinates ( 𝑥, 𝑦 ) after projection from 3D to 2D, and 𝑀 ( 𝑃 ) denotesall the spatially-varying factors (texture maps, normal vectors, etc.)that live on the surfaces of the scene. The shade function typicallymodels light-surface interactions. The 2D antialiasing filter, crucialfor both image quality and differentiability, is applied to the shad-ing results in continuous ( 𝑥, 𝑦 ) , and the final color is obtained bysampling the result at the pixel center ( 𝑥 𝑖 , 𝑦 𝑖 ) . In real-time graphics,these steps are typically approximated by techniques like multisam-ple antialiasing (MSAA).The geometry, projection, and lights can all be considered as para-metric functions. The visible world point is affected by the geometry,parameterized by 𝜃 𝐺 , as well as the projection, parameterized by 𝜃 𝐶 . Similarly, the surface factors are parameterized by 𝜃 𝑀 , and light sources by 𝜃 𝐿 . We follow the common view and take differentiablerendering to mean computing the gradients 𝜕𝐿 ( 𝐼 )/ 𝜕 { 𝜃 𝐺 , 𝜃 𝑀 , 𝜃 𝐶 , 𝜃 𝐿 } of a scalar function 𝐿 ( 𝐼 ) of the rendered image 𝐼 with respect tothe scene parameters. Note that this does not require computingthe (very large) Jacobian matrices [ 𝜕𝐼 / 𝜕𝜃 𝐺 ] , etc., but rather onlythe ability to implement multiplication with the Jacobian transpose(“backpropagation”), yielding the final result through the chain rule: (cid:20) 𝜕𝐿 ( 𝐼 ) 𝜕𝜃 𝐺 (cid:21) = (cid:20) 𝜕𝐼𝜕𝜃 𝐺 (cid:21) (cid:20) 𝜕𝐿𝜕𝐼 (cid:21) , and similarly for the other parameter vectors.Two main factors make the design of efficient rendering algo-rithms challenging. First, the mapping 𝑃 ( 𝑥, 𝑦 ) between world pointsand screen coordinates is dynamic: it is affected by changes in bothscene geometry and the 3D-to-2D projection. Furthermore, it isdiscontinuous due to occlusion boundaries. These two factors arealso central points of difficulty in computing the gradients we seek.The following sections outline our approach to addressing them. Our overall aim is to implement an efficient differentiable real-timegraphics pipeline, with the following specific design goals:G1

Efficiency . Support modern graphics pipelines’ ability to ren-der, in high resolution, 3D scenes that are complex in terms ofgeometric detail, occlusion, and appearance.G2

Minimalism . Easy integration with modern automatic differ-entiation (AD) frameworks, such as PyTorch and Tensorflow,without duplication of features.G3

Freedom . Support arbitrary user-specified shading, as well asarbitrary parameterizations of input geometry, without commit-ting to specific forms such as the Phong model or blendshapes.G4

Quality . Support the texture filtering operations required byshaders that implement complex appearance models, while mak-ing no assumptions about the contents of the textures. In addi-tion to quality, this is also important for optimization dynamics.

Goal G1 immediately precludes algorithms that do not parallelizeover both the geometric primitives and pixels, and those that donot properly account for occlusion of overlapping primitives. In theremaining space, we make the following design choices:C1

Modularity . We identify four modular primitive operationsthat implement crucial operations in a graphics pipeline. Eachprimitive is exposed as a backpropagation-capable operationwith a fixed input/output interface to the host AD framework.Much like today’s configurable and programmable hardwaregraphics pipelines, this non-monolithic design enables easy con-struction of potentially complex custom rendering pipelines.C2

Positions and Textures are Tensors . Our system takes theinput geometry and texture maps in the form of tensors from In the simplest case, 𝜃 𝐺 and 𝜃 𝑀 , could describe, say, the vertex coordinates of atriangle mesh of a fixed topology and a diffuse albedo stored at the vertices and inter-polated into the interiors of triangles; we use the abstract notation to allow for complexparameterizations fed into the renderer from within a deep learning model.ACM Trans. Graph., Vol. 39, No. 6, Article 194. Publication date: December 2020. R a s t e r i z a t i o n ( u , v )depthtriID I n t e r p o l a t i o n Vertexattributes InterpolatedattributesTextures T e x t u r e l oo k u p L i g h t i n g e t c . Clip-spaceposIndexbuffer Filteredsamples Aliasedimage A n t i a li a s i n g Discrete inputsIntermediate buffersCustom operationsAdditional computationConnection with gradientsForward connection only

Finalimage G e o m e t r y p r o c e ss i n g / g e n e r a t i o n ∂ u / ∂ x , ∂ u / ∂ y ∂ v / ∂ x , ∂ v / ∂ y Inputs that obtain gradients

Furtherprocessing

Attribute pixelderivatives ∂ A / ∂ x , ∂ A / ∂ y Fig. 1. A simple differentiable rendering pipeline with our proposed primitive operations highlighted in red. The input data for rendering (blue) may begenerated by, e.g., a neural network if the pipeline is part of a larger computation graph. In simpler setups the geometry processing might include only themodel/view/perspective transformations for vertex positions with other inputs being constants or learnable parameters. All intermediate buffers (green)are in image space. Connections with gradients are denoted by a white triangle. Channel counts are fixed only for vertex positions and indices, and inthe intermediate buffers produced by the rasterization operation. There are no restrictions on the channel counts for vertex attributes, textures, relatedintermediate data, or the output image. the host AD system. This allows parameterizing both in a freely-chosen manner, and enables our rendering primitives to be usedas building blocks of a complex learning system.C3

Operate in Clip Space . Contrary to common differentiablerendering systems, we place it on the user’s responsibility toperform world, view, and homogeneous perspective transforma-tions — but not perspective division — on the geometry usingthe host AD system. By this, we follow the separation betweengeometry and pixel processing made by all major graphics APIs.We feel this offers the cleanest possible interface between thehost AD system and the renderer, further amplifying the benefitsof their co-existence.C4

Deferred Shading . We build on the concept of deferred shad-ing [Deering et al. 1988]. This entails first computing, for eachpixel, the 𝑀 ( 𝑃 ( 𝑥, 𝑦 )) terms from Equation (1) and storing theintermediate results in an image-space regular grid. The gridis subsequently consumed by the shading function. As shadingis performed on a regular grid, it can be implemented entirelyoutside our rendering primitives using the efficient dense tensoroperations in the host AD library, in line with G2 and G3.C5 Image-space Antialiasing . We approach differentiation of cov-erage in image space, approximating the inputs of the antialias-ing filter in Equation (1) by the output grid of the deferredshading pass. Effectively, we assume shading to be constantwith respect to the coverage effects at silhouette boundaries, butnot with respect to other effects in appearance.C6

Triangles . We focus on triangle meshes as the basic geometricprimitive, and seek to utilize the modern graphics pipelines’ im-mensely optimized rasterization subsystem to maximal extent.We build pipelines out of the following four primitive operationscustomized for gradient computation. Figure 1 illustrates an examplegraphics pipeline built out of them.

Rasterization implements the dynamic mapping between worldcoordinates and discrete pixel coordinates. Leveraging the hardwarerasterizer, we store per-pixel auxiliary data in the form of barycentriccoordinates and triangle IDs in the forward pass. Using barycentricsas a base coordinate system allows easy coupling of shading and interpolation, as well as combining texture gradients with geometrygradients in the backward pass.

Interpolation is a pipeline operation that expands user-definedper-vertex data (i.e., vertex attributes) to pixel space. Making use ofthe barycentrics computed by the rasterizer, the interpolator modulemanages this mapping in both directions.

Texture filtering is a key operation in a shading system. Takingas inputs the interpolated texture coordinates and their screen-spacederivatives for MIP-mapping, as well as texture data tensors, ourtexture filtering module performs trilinear MIP-mapping with gra-dients correctly propagated through both input texture coordinatesas well as the contents of the (MIP-mapped) texture maps.

Antialiasing is performed on the output of the deferred shadingoperation, taking as additional inputs the barycentrics, triangle IDs,and vertex positions and indices.We now proceed to describe each primitive operation in detail.For simplicity, the following discussion assumes that a single imageis being rendered. However, a differentiable renderer is typicallyused in stochastic gradient descent -type schemes using minibatchesof multiple rendered images. All our operations efficiently supportminibatching.

As per widely adopted graphics API standards, our rasterizationmodule consumes triangles with vertex positions given as an arrayof clip-space homogeneous coordinates ( 𝑥 𝑐 , 𝑦 𝑐 , 𝑧 𝑐 , 𝑤 𝑐 ) . We leave itas the user’s responsibility to compute clip space positions — often,this comprises only a few homogeneous 4 × 𝜕𝐿 / 𝜕 { 𝑥 𝑐 , 𝑦 𝑐 , 𝑧 𝑐 , 𝑤 𝑐 } of the loss 𝐿 with respect to the clip-space positions, leaving differ-entiation with respect to any higher-level parameterizations for thehost AD library. Forward pass.

In the forward pass, the rasterizer outputs a 2D sam-ple grid, with each position storing a tuple ( ID , 𝑢, 𝑣, 𝑧 𝑐 / 𝑤 𝑐 ) , where ID identifies the triangle covering the sample, ( 𝑢, 𝑣 ) are barycentriccoordinates specifying relative position along the triangle, and 𝑧 / 𝑤 corresponds to the depth in normalized device coordinates (NDC). ACM Trans. Graph., Vol. 39, No. 6, Article 194. Publication date: December 2020. odular Primitives for High-Performance Differentiable Rendering • 194:5

A special ID is reserved for blank pixels. Barycentrics serve asa convenient base domain for interpolation and texture mappingcomputations further down the pipeline. The NDC depth is utilizedonly by the subsequent antialiasing module, and does not propagategradients. As a secondary output, the rasterizer outputs a bufferwith the 2 × 𝐽 uv = 𝜕 { 𝑢, 𝑣 }/ 𝜕 { 𝑥, 𝑦 } for each pixel. These are later used fordetermining the footprint for filtered texture lookups.Internally, the rasterization is performed through OpenGL, lever-aging the hardware graphics pipeline. Using the hardware graph-ics pipeline ensures that the rasterization is accurate and thereare, e.g., no visibility leaks due to precision issues. We also auto-matically get proper view frustum clipping as performed by thehardware. The output values, including the per-pixel Jacobians be-tween barycentrics and screen coordinates, are calculated using anOpenGL fragment shader.Both TensorFlow and PyTorch implement GPU tensor opera-tions in CUDA. To bridge them with OpenGL, we use the driver’sOpenGL/CUDA interoperability API. The API minimizes data copies,using the same physical memory when possible, and never requiresdata to leave the GPU memory.

Backward pass.

The backward pass receives, for each pixel, thegradient 𝜕𝐿 / 𝜕 { 𝑢, 𝑣 } with respect to the barycentrics output by therasterizer, and computes the gradients 𝜕𝐿 / 𝜕 { 𝑥 𝑐 , 𝑦 𝑐 , 𝑧 𝑐 , 𝑤 𝑐 } for eachinput vertex. The perspective mapping between barycentrics andclip-space positions is readily differentiated analytically, and thenecessary output is obtained through (cid:20) 𝜕𝐿𝜕 { 𝑥 𝑐 , 𝑦 𝑐 , 𝑧 𝑐 , 𝑤 𝑐 } (cid:21) = (cid:20) 𝜕𝐿𝜕 { 𝑢, 𝑣 } (cid:21) (cid:20) 𝜕 { 𝑢, 𝑣 } 𝜕 { 𝑥 𝑐 , 𝑦 𝑐 , 𝑧 𝑐 , 𝑤 𝑐 } (cid:21) . (2)The gradients w.r.t. the screen-space derivatives of the barycentrics( 𝜕𝐿 / 𝜕𝐽 uv ) are taken into account in a similar fashion. The backwardpass is implemented as a dense operation over output pixels, using ascatter-add operation to accumulate the gradients from the pixels tothe correct vertices based on the triangle IDs. It can thus be triviallyparallelized using a CUDA kernel. Attribute interpolation is a standard part of the graphics pipeline.Specifically, it entails computing weighted sums of vertex attributes,with the weights given by the barycentrics, thereby creating a map-ping between the pixels and the attributes. Generally, vertex attributes can be used for arbitrary purposes.One of their typical uses, however, is to provide 2D coordinatesfor texture mapping. Because of this, our interpolator module sup-ports computing, in the forward pass, screen-space derivatives 𝐽 𝐴 = 𝜕𝐴 / 𝜕 { 𝑥, 𝑦 } of all or a subset of attributes for later use indetermining texture filter footprints and other purposes. On compute clusters, we use OpenGL with EGL for displayless hardware-acceleratedrendering. Note that with our rendering primitives, one can supply a different index buffer forattribute interpolation than was used for rasterization. This is convenient when sourcedata comes from a modeler such as Autodesk Maya that associates each vertex witha 3D position and a texture coordinate from separately indexed arrays. If attributeinterpolation were bundled with rasterization, such flexibility would not be possible.

Forward pass.

Consider a single pixel at ( 𝑥, 𝑦 ) . Denoting a vec-tor of attributes associated with the 𝑖 th vertex by 𝐴 𝑖 , the attributeindices of the triangle visible in the pixel ( 𝑥, 𝑦 ) by 𝑖 , , , and thebarycentrics generated by the rasterizer by 𝑢 = 𝑢 ( 𝑥, 𝑦 ) and 𝑣 = 𝑣 ( 𝑥, 𝑦 ) , the interpolated vector 𝐴 is defined as 𝐴 = 𝑢 𝐴 𝑖 + 𝑣 𝐴 𝑖 + ( − 𝑢 − 𝑣 ) 𝐴 𝑖 . (3)Given the rasterizer’s outputs (per-pixel triangle IDs and barycentrics),implementation of the forward pass is trivial.The screen-space derivatives for attributes tagged as requiringthem are computed using the barycenter Jacobians output by therasterizer by 𝜕𝐴 / 𝜕 { 𝑥, 𝑦 } = [ 𝜕 { 𝑢, 𝑣 }/ 𝜕 { 𝑥, 𝑦 }] [ 𝜕𝐴 / 𝜕 { 𝑢, 𝑣 }] , where thelast Jacobian is simple to derive from Equation (3). Backward pass.

The inputs to the backward pass are the per-pixelgradients 𝜕𝐿 / 𝜕𝐴 w.r.t. the interpolated attributes, as well as gradi-ents w.r.t. the screen-space derivatives of the attributes. Much likethe backward pass of the rasterizer, the gradients w.r.t. the attributetensor are computed by a scatter-add into the tensor, applying theJacobians 𝜕𝐴 / 𝜕 { 𝐴 𝑖 ,𝑖 ,𝑖 } = { 𝑢, 𝑣, − 𝑢 − 𝑣 } to the per-pixel inputgradients. By simple differentiation, the gradients w.r.t. the inputbarycentrics are given by (cid:20) 𝜕𝐿𝜕𝑢 (cid:21) = (cid:2) 𝐴 𝑖 − 𝐴 𝑖 (cid:3) T (cid:20) 𝜕𝐿𝜕𝐴 (cid:21) , (cid:20) 𝜕𝐿𝜕𝑣 (cid:21) = (cid:2) 𝐴 𝑖 − 𝐴 𝑖 (cid:3) T (cid:20) 𝜕𝐿𝜕𝐴 (cid:21) . (4)In the same vein, the gradients w.r.t. the screen-space derivatives ofthe input barycentrics 𝜕𝐿 / 𝜕𝐽 uv are computed based on the incominggradients w.r.t. the screen-space derivatives of the attributes 𝜕𝐿 / 𝜕𝐽 A . We perform texture mapping using trilinear MIP-mapped texturefetches. In the general case, this entails picking a fractional MIP-map pyramid level (i.e., level-of-detail, LOD) based on the incomingscreen-space derivatives of the attributes used as texture coordi-nates, and performing a trilinear interpolation from the eight texelson the appropriate MIP pyramid levels. Figure 2 illustrates our im-plementation. We choose the MIP level based on the texture-spacelength of the major axis of the sample footprint as defined by thescreen-space derivatives of the texture coordinates. This is conser-vative in the sense that grazing angles result in blurring instead ofaliasing.Once a pair of MIP-map levels has been picked, operation of theforward and backward passes closely resemble attribute interpola-tion: On each level, the four closest texels take the place of the threetriangle vertices, and the two sub-texel coordinates that determineexact position within the four-pixel ensemble take the place of thebarycentrics. Consequently, our implementation is highly similar,with the forward pass requiring a gather, and the backward passrequiring a scatter-add, with the related Jacobians computed withequal simplicity from the bilinear basis functions and texture con-tents. As the derivations are highly similar, we omit them for space.Note, however, that gradients are computed also for the texturecoordinate attributes, as well as for the screen-space derivatives forthe texture coordinates.MIP-mapped texturing differs from attribute interpolation by itsmultiscale nature: gradients are accumulated on various levels of theMIP-map pyramid in the backward pass. As all MIP-map levels are

ACM Trans. Graph., Vol. 39, No. 6, Article 194. Publication date: December 2020. ∂ L/ ∂ c [ i,j ] c [ i,j ] g=f ( s,t,lod ) (a) (b) (c) Fig. 2. Filtered differentiable texture lookup with a non-constant texture. (a)In the beginning of forward pass, prefiltered MIP levels 𝑐 lod are constructedfrom the full-resolution texture 𝑐 [ 𝑖, 𝑗 ] by repeated downsampling using a × box filter. (b) In forward pass, each lookup 𝑔 = 𝑓 ( 𝑠, 𝑡, lod ) interpolatesprefiltered values on the appropriate MIP level as determined by the size ofsample footprint. In backward pass, we receive incoming gradients 𝜕𝐿 / 𝜕𝑔 .Texture coordinate gradients 𝜕𝐿 / 𝜕𝑠 and 𝜕𝐿 / 𝜕𝑡 for each lookup are com-puted based on these and contents of texels that were used in interpolation.Simultaneously, texture image gradients 𝜕𝐿 / 𝜕𝑐 lod [ 𝑖, 𝑗 ] are accumulatedinto each MIP level. In a trilinear lookup, these calculations are performedon two adjacent levels and weighted according to the fractional part of lod .(c) To produce outgoing full-resolution texture image gradients 𝜕𝐿 / 𝜕𝑐 [ 𝑖, 𝑗 ] ,we sum the accumulated gradients from all MIP levels. obtained from the finest-level texture during the construction in theforward pass, the backward pass needs to finish by transposing theconstruction operation and flattening the gradient pyramid so thatthe gradient is specified densely at the finest level. Fortunately, thisis implemented easily by starting at the coarsest level, recursivelyup-sampling the result and adding gradients from the next level,precisely like collapsing a Laplacian pyramid.To the best of our knowledge, no previous differentiable rendererexcept for Li et al. [2018] has supported differentiable, filtered texturesampling. While we currently do not do so, we note that it wouldbe possible to utilize the hardware texture unit in the forward pass,and retain the CUDA kernel only for the backward pass. The keychallenge is that we cannot be certain of the implementation (e.g.,numerical precision) of the hardware texture unit, and thus thegradients might not match the forward pass. This will be especiallytrue for anisotropic texture fetches, where considerable freedomexists in covering the footprint [Schilling et al. 1996]. Hardwaretexture units also require specific memory layouts. As usual in real-time graphics, we expect shading to be band-limitedvia filtered texture lookups and other means, and thus not exhibitaliasing within surfaces. However, point-sampled visibility causesaliasing at visibility discontinuities, and more crucially, cannot pro-duce visibility-related gradients for vertex positions. Antialiasingconverts the discontinuities to smooth changes, from which thegradients can be computed [Li et al. 2018]. Note that antialiasingcan only be performed after shading, and therefore must be imple-mented as a separate stage instead of bundling it into rasterization.We follow the same approach as several previous methods [de LaGorce et al. 2011; Loper and Black 2014] and approach the problemby analytic post-process edge antialiasing. Image-based post-process

A Bp q A Bp q before after before after (a) (b)

Fig. 3. Illustration of our analytic antialiasing method. A vertical silhouetteedge 𝑝, 𝑞 passes between centers of horizontally adjacent pixels 𝐴 and 𝐵 .This is detected by the pixels having a different triangle ID rasterized intothem. Pixel pair 𝐴, 𝐵 is processed together, and one of the following casesmay occur. (a) The edge crosses the segment connecting pixel centers insidepixel 𝐵 , causing color of 𝐴 to blend into 𝐵 . (b) The crossing happens insidepixel 𝐴 , so blending is done in the opposite direction. To approximate thegeometric coverage between surfaces, the blending factor is a linear functionof the location of the crossing point — from zero at midpoint to 50% at pixelcenter. This antialiasing method is differentiable because the resulting pixelcolors are continuous functions of positions of 𝑝 and 𝑞 . antialiasing is an old and widely-used technique in real-time graph-ics, with famous techniques such as FXAA being recently super-seded by deep learning algorithms [NVIDIA 2018]. For an overview,see Jimenez et al. [2011]. Our method is a variant of distance-to-edge anti-aliasing (DEAA) [Malan 2010] and geometric post-processantialiasing (GPAA) [Persson 2011]. The main differences are inhow visibility discontinuities are detected and attributed to vertexpositions, as required for computing gradients. Forward pass.

Figure 3 illustrates our antialiasing method. Wefirst detect potential visibility discontinuities by finding all neigh-boring horizontal and vertical pixel pairs with mismatching tri-angle IDs. For each potential discontinuity, we fetch the trian-gle associated with the surface closer to camera, as determinedfrom the NDC depths computed during rasterization. We thenexamine the edges of the triangle to see if any of them are sil-houettes and pass between the neighboring pixel centers. Forhorizontal pixel pairs, we consider only vertically oriented edges( | 𝑤 𝑐, · 𝑦 𝑐, − 𝑤 𝑐, · 𝑦 𝑐, | > | 𝑤 𝑐, · 𝑥 𝑐, − 𝑤 𝑐, · 𝑥 𝑐, | ), and vice versa.If a silhouette edge crosses the segment between pixel centers, wecompute a blend weight by examining where this crossing happens.Pixel colors are then adjusted to reflect the approximated coverage ofeither surface in the pixels. Essentially this approach approximatesthe exact surface coverage per pixel [Jalobeanu et al. 2004] usingan axis-aligned slab. Consequently the coverage estimate is exactfor only perfectly vertical and horizontal edges that extend beyondthe pixel. For a diagonal long edge that passes exactly between thepixel centers, the error in coverage is th of a pixel.Finely tessellated surfaces reveal two further approximations.Theoretically, the silhouette between two pixel centers can take We consider an edge to be on a silhouette if it has only one connecting triangle, or ifit connects two triangles that lie on the same side of it after projection. This includessilhouettes that have another surface behind them, unlike DIB-R [Chen et al. 2019] thatonly considers silhouettes against the background.ACM Trans. Graph., Vol. 39, No. 6, Article 194. Publication date: December 2020. odular Primitives for High-Performance Differentiable Rendering • 194:7 any poly-line shape, and the axis-aligned slab approximation can bearbitrarily poor. However, typically additional tessellation manifestsitself on the pixel-scale as slightly rounded silhouettes, and for thesethe approximation accuracy is only slightly worse than for longedges, although exact error bounds cannot be given.A potentially more serious approximation results from the as-sumption that all triangles that contain silhouette edges overlappixel centers and are thus stored during rasterization. Clearly, if wetessellate a surface enough, it is rare for a triangle with a silhouetteedge to get rasterized. In this situation, some silhouette edges arenot found during antialiasing and no visibility gradient is obtainedfor these pixels. Occasionally missing a gradient can slow downoptimization but rarely prevents it from succeeding, as we show inSection 4.

Backward pass.

To prepare for the gradient computation, we storethe results of the discontinuity analysis in the forward pass so thatwe do not have to repeat it in the backward pass. Gradient compu-tation with the stored data is then easy — for each pixel pair thatwas antialiased in the forward pass, we determine how both vertexpositions influence the blending coefficient, and transfer incomingpixel gradients to vertex positions accordingly using scatter-addoperations.

Our design aims to do as little as possible apart from managingthe complex and dynamic mappings between pixels and the in-put vertices and textures, leaving the field open for utilizing novelparameterizations for geometry, textures, and lighting models. Inparticular, the modular design allows the units to be chained manytimes in a single rendering pipeline. The deferred shading designalso leaves many options open. For example, it is easy to performtexture-space shading instead of shading in the pixel domain: aftercomputing shading results using array operations in texture space,one would simply look up the surface colors from the texture in thedeferred shading pass.

In this section, we validate our design principles via targeted, syn-thetic tests. We first examine the properties of the visibility gradi-ents resulting from our antialiasing, as well as the effects of filteredtexture lookups on texture convergence. Then, we construct a ren-dering pipeline with nontrivial shading and use it to demonstratean optimization task that involves indirect texture lookups. We alsoexamine a pose fitting task with a difficult optimization landscape.Finally, we measure the performance of our system and compare itto previous differentiable rasterizers.

To examine the validity of the gradients, we perform a synthetic testwhere we attempt to infer vertex positions and colors of a simple unitcube. We initialize the solution by taking the true vertex positionsand perturbing them randomly in range [− , ] . The vertex colorsare initialized to random RGB values in [ , ] . We then run Adamoptimizer [Kingma and Ba 2015] ( 𝛽 = . , 𝛽 = . ) for 5000iterations, where in each iteration we render the reference mesh Iteration 100 Iteration 1000 Iteration 5000 Final mesh × p i x e l s × p i x e l s × p i x e l s Fig. 4. To validate that our visibility gradients provide useful informationeven for small triangles, we infer vertex positions and colors of a simplemesh in extremely small resolutions. The geometry of the current solution issuperimposed on the rasterized images for illustration purposes only. Right-most column shows the final, optimized mesh rendered in high resolution.In 4 × × and the optimized mesh from same, random viewpoint, and takethe image-space 𝐿 loss between the images. Based on this loss, welearn both vertex positions and colors simultaneously. The learningrate was ramped down exponentially from − to − over thecourse of the optimization.We implemented two modes for coloring the vertices. In the con-tinuous coloring mode, the vertex colors at each corner are shared.This yields a coloring that is continuous across the surface of thecube and has 8 unique colors to optimize. In the discontinuous color-ing mode, each face has four unique colors at the corners, i.e., a totalof 24 unique colors. Consequently, the coloring is not continuousacross the edges or vertices of the cube. It can be expected thatthe latter mode is more difficult to optimize because of the largernumber of unknowns and presumably less smooth gradients due tocolor discontinuities.Figure 4 illustrates the results in the continuous coloring mode.To our surprise, the optimization succeeds even at 4 × × ACM Trans. Graph., Vol. 39, No. 6, Article 194. Publication date: December 2020.

10 100 1000 50000.50.40.30.20.10 Iterations G e o m e t r i c e rr o r Fig. 5. Convergence of the cube shape and color optimization test (averageof 10 successful optimizations). Vertical axis shows the average distancebetween vertices and their true positions in the unit cube. The solid curvesindicate convergence in the continuous coloring mode (Figure 4), and thedashed curves correspond to the discontinuous coloring mode. As expected,the latter is somewhat more difficult to optimize. Note the logarithmichorizontal axis. with the observation from Mitsuba 2 [Nimier-David et al. 2019] thatsome rendering-related components require careful initialization.We could have made these configurations less likely by loweringthe learning rate, initializing the mesh to further away from self-intersecting states, or by using a suitable regularization term thatpushes apart geometry that is in danger of folding over itself. Wechose not to use any regularizers in this test because we explicitlywanted our gradients to be based on image-space loss only.

To measure the importance of texture filtering via mipmaps, weconstructed a test where we attempt to learn a texture based onsynthetic, high-quality reference images that exhibit large variationsin scale. We then measure how well the texture is learned with andwithout mipmapping.Figure 6a shows example reference images of this task. The ref-erence images are rendered first in 4096 × ×

512 using a high-quality downsampling filter.The reference images are thus well band-limited and display noaliasing, blurring, or other artifacts.The goal of the optimization is to learn a cube map -parameterizedtexture with 512 ×

512 pixel faces, mapped onto a unit sphere, basedon the reference images. We again use Adam [Kingma and Ba 2015]as the optimizer ( 𝛽 = . , 𝛽 = . ) and run it for 20 000 iterations,ramping the learning rate from − to − during the course ofoptimization. This learning rate schedule was chosen to be optimalfor the case without mipmapping. The training images are rendereddirectly in 512 ×

512 resolution, from the same, random viewpointsas the reference images, and the optimizer attempts to minimize 𝐿 loss between training and reference images. The same mesh andtexture parameterization are used for all images.Learning the texture is made difficult by the sphere being placedrandomly at distance [ . , ] from the camera. Hence some refer-ence images view a close-up patch of the surface, whereas most aretoo distant to infer texel-level details. This replicates the effects ofhighly variable pixel-to-texel ratio in reference imagery, which weexpect to be present in many kinds of real-world data such as streetview images or sets of in-the-wild photographs. Figures 6b,c illustrate that with mipmapped texture filtering, thelearned texture converges to a solution much closer to the reference(32.9 dB vs 25.6 dB). The convergence failure of non-mipmappedversion can be explained by a simple thought experiment. When thereference image has a faraway pixel with a large texture footprint,its value is determined by a weighted mean of the reference textureover that footprint. Without mipmapping, we will sample whicheverfull-resolution texel quad lands under that pixel center. If this valuedeviates from the large-area average in the reference image, thegradients will pull the texels in the learned texture towards thisaverage. Over many such updates, this pull towards the mean leadsto attenuated high-contrast details, which can be seen in Figure 6bwhere even the converged non-mipmapped solution has less visiblecontrast than the solution obtained via proper texture filtering. To demonstrate the flexibility of our modules, we construct a ren-dering pipeline that computes reflections via environment mapping[Greene 1986] and adds a highlight from an additional light sourceusing a Phong BRDF [Phong 1975]. Figure 7 illustrates the use of thisrendering pipeline for solving the environment map contents andPhong BRDF parameters based on the reflections from an irregularobject with known geometry and pose.At the beginning of optimization, the BRDF parameters are initial-ized to random values, whereas the environment map is initializedto uniform gray. In each iteration, the camera angle and light di-rection are randomized. Optimization is done using Adam with afixed learning rate of − and a simple image-space 𝐿 loss. Inthis synthetic test, the unknown environment texture and BRDFparameters rapidly converge to the reference solutions.The rendering pipeline is constructed as follows. We start byrasterizing the geometry as usual, obtaining a frame buffer withper-triangle barycentrics and their screen-space derivatives. Wealso calculate a normalized reflection vector for each vertex. Thesereflection vectors are then used as attributes for interpolation, whichyields per-pixel reflection vectors and screen-space derivatives foreach of their components. We represent the environment map as acube map, so for each per-pixel reflection vector we determine thecorresponding cube map face and 2D texture coordinates withinit. The same calculation also yields the screen-space derivatives ofthe texture coordinates, and we perform a trilinear texture fetch tothe environment map. This is important because reflections fromcurved surfaces introduce highly variable distortions and texturefootprint sizes. The cosine between the reflection vector and lightdirection vector required by the Phong BRDF model is computedbased on the per-pixel reflection vectors.The shading computation involves 18 lines of Python code usingstandard TensorFlow operations. In our opinion, this is a small priceto pay for the complete freedom to tailor shading, data representa-tions, etc., to the needs of the application, compared to incorporatinga fixed set of shading models into the differentiable rendering systemitself. It would not be possible to implement a similar setup in previ-ous rasterization-based differentiable renderers without modifyingtheir internals. ACM Trans. Graph., Vol. 39, No. 6, Article 194. Publication date: December 2020. odular Primitives for High-Performance Differentiable Rendering • 194:9

1k iterations 5k iterations 20k iterations Ground truth W i t h m i p m a p s

1k 5k 10k353025201510 Iterations P S N R ( d B ) No mipmapsWith mipmaps20k N o m i p m a p s (a) Example reference images (b) Closeups of learned texture (c) Texture convergence vs. reference Fig. 6. Filtered texture sampling via mipmaps helps considerably when learning a texture in a difficult geometric setup. See text for full description. (a) Examplesynthetic reference images in 512 ×

512 resolution. (b) With mipmapping enabled, sampling is prefiltered and gradients are routed to the correct detail levels.Without mipmaps, faraway views yield badly filtered samples and spurious, noisy texture updates that do not converge to the correct solution. (c) Convergenceof the learned texture compared to the reference texture. Learning rate schedule was optimized for the “no mipmaps” case, and the same schedule was usedwith mipmaps.

Initialization 100 iter. 400 iter. 700 iter. 1000 iter. E rr o r Texture color RMSEPhong color RMSEPhong exponent rel. error

Fig. 7. Optimizing environment map texture and Phong BRDF parametersin a synthetic test case. Top: Example renderings at various iteration counts.The texture converges slightly unevenly due to the distribution of indirecttexture lookups, as seen at 100 iterations. Bottom: Convergence of thelearned parameters over the course of optimization.

Limitations.

Local shading models, such as the one demonstratedhere, cannot accurately model global phenomena such as interreflec-tions. If such fidelity is required, a path tracing based differentiablerenderer will be necessary [Li et al. 2018; Loubet et al. 2019; Nimier-David et al. 2019]. However, there is no inherent limit on the com-plexity of the local shading model, so e.g. microfacet [Cook andTorrance 1982] or Gaussian mixture model [Herholz et al. 2016]BRDFs could be used to seek a better fit to data. Ultimately it de-pends on the intended use how physically accurate the shadingshould be — even a crude approximation of appearance may be suf-ficient for inferring other unknowns such as pose or geometry. InSection 5, we demonstrate that in the context of facial performancecapture, the per-frame geometry of skin areas can be accurately re-covered without any shading at all. In situations like this, striving forphysical fidelity would unnecessarily slow down the optimization.

Initial Final Reference Initial Final Reference

Fig. 8. Two example cases from the cube pose optimization test. With trivialnoise-based regularization, we obtain an average error of 48.62 ◦ which isan improvement over the 63.57 ◦ of Liu et al. [2019], indicating that the blurand transparency offered by SoftRas are not necessary in this task. Theaverage error is dominated by local minima where the pose looks correctbut the colors are wrong (see example on the right with ∼ ◦ final poseerror — the yellow face should be on top instead of white). Customizingthe optimization method to suit the task better lowers the average errorto 2.61 ◦ . Some cases are impossible due to only one face of the cube beingvisible in the reference image, but they are rare enough to not contributesignificantly to the averages. In the SoftRas paper, Liu et al. [2019] investigate the problem ofresolving the pose of a rendered cube using gradient-based op-timization of image-space loss. The task is made difficult by anoptimization landscape with many local minima (Figure 8). Theimage synthesis model of SoftRas allows turning all surfaces par-tially transparent and blurring them by an arbitrary amount. Thisresults in a smoother loss function and a modest improvement inthe resolved poses.To demonstrate that the nonstandard image synthesis model ofSoftRas is not necessary for solving this task, we focus on the opti-mization process instead of manipulating the rendering model. Asimple and efficient way to discourage local minima in a stochasticfashion is to add noise to the unknown parameters during opti-mization. Indeed, running the optimization for 10k iterations usingAdam and ramping down the noise strength from to . overthe course of optimization yields an average pose error of 48.62 ◦ measured over 100 random trials. This is an improvement over thebest result of 63.57 ◦ reported by Liu et al. [2019], indicating that We represent the pose as a quaternion. Noise is applied by constructing a randomquaternion and mixing it with the pose using spherical interpolation [Shoemake 1985].ACM Trans. Graph., Vol. 39, No. 6, Article 194. Publication date: December 2020. noise-based regularization is at least as effective as their approachbased on transparency and blur.However, we note that gradient descent from a random initialstate is an ineffective way to solve this problem. Splitting the opti-mization into two phases — first greedily seeking for a good initialpose by applying ramped-down noise in a gradient-free fashion,and then continuing with Adam from the pose with the smallestimage-space loss — lowers the average error to 22.49 ◦ . As a fur-ther task-specific optimization, we can take the symmetries of thecube into account and customize the noise to incorporate randomsymmetry-preserving rotations. This effectively bridges the localminima with similar pose but different color combinations and low-ers the average error to 2.61 ◦ . To assess the performance of our method, we selected 14 meshesof varying triangle counts from the ShapeNet database [Changet al. 2015]. We rendered these meshes using both our method andtwo comparison methods in multiple resolutions. As comparisonmethods we used the official implementation of Soft Rasterizer[Liu et al. 2019], and PyTorch3D [Ravi et al. 2020], a more recentdifferentiable rasterization library for PyTorch. The test was set upto include both forward and gradient evaluations, reflecting the totalcost of including a rendering operation in an optimization task.Default 𝛾, 𝜎 parameter values were used for Soft Rasterizer. Py-Torch3D was set up to render one pixel blur radius, one face perpixel, and soft compositing (SoftGouraudShader). Default bin sizeheuristic was enabled. We originally intended to include interiorscenes in our tests, but Soft Rasterizer could not render “in-scene”viewpoints due to lack of clipping, making triangles behind thecamera render erroneously in front of the camera. Therefore, welimited our test to individual objects rendered in front of the camera.All tests were run on a single NVIDIA TITAN V GPU with 12 GB ofmemory.Results of the test are summarized in Table 2. We can see that ourmethod is much less sensitive to triangle and pixel counts than thecomparison methods. Soft Rasterizer slows down quickly becauseit tests each triangle for every pixel, and consequently loses to ourmethod by several orders of magnitude with nontrivial trianglecounts and resolutions. PyTorch3D fares better than Soft Rasterizerthanks to its coarse-to-fine rasterization architecture. Still, the per-formance difference is more than an order of magnitude in our favorand grows with high resolutions and triangle counts, highlightingthe better scalability of our method.

Occluded vs visible geometry.

It is generally desirable that render-ing performance is not affected by the amount of geometry that isnot visible in the rendered image. Our method employs deferredshading, and is therefore mostly oblivious to occluded or out-of-view geometry except at the rasterization step. The same holds forPyTorch3D when storing just one face per pixel, but its rasterizationstep is expensive so it is not obvious how hidden geometry affectsthe overall performance.To quantify the effects of depth complexity, we constructed pairsof synthetic scenes where the number of triangles and covered pix-els are held constant but the depth complexity is varied. Specifically, T i m e ( m s ) Mesh complexitystackedadjacent

Fig. 9. The effect of geometric configuration on forward + gradient passexecution times in 1024 × × we repeat a simple base mesh either so that all copies are visible andtogether cover the image, or so that only one is visible and coversthe image while all other copies are hidden behind it. Figure 9 showsthe measured performance as function of geometric complexity andgeometric setup. Resolution is fixed to 1024 × ∼

50 ms, scales strongly with the total area of geometry,occluded or not. This is due to its software rasterizer resolving visi-bility so late that practically no work is saved if the tested fragmentis found to be occluded. The performance of our method is mostlyunaffected by the depth complexity, as it uses the hardware raster-izer with efficient hierarchical depth tests. SoftRas scales linearlywith geometric complexity regardless of the geometric setup, anddoes not compare favorably to the other two methods.

To illustrate the performance and utility of the design of our differ-entiable rendering pipeline, we examine how to use it to solve mark-erless facial performance capture, i.e., inferring time-varying facialgeometry based on multiple camera streams. This is a non-trivialclassical computer graphics problem that has been approached inseveral ways in the past. Many methods utilize morphable 3D mod-els [Blanz and Vetter 1999] that enable approximating the facialgeometry even from monocular data. The downside of this class ofmethods is that the obtained geometry is approximate and cannotfully reproduce intricate motion. High-quality markerless captureoften requires complex capture setups involving structured lightor special cameras [Alexander et al. 2009; Bradley et al. 2010]. Thepassive capture method of Beeler et al. [2011] first reconstructs eachframe using a single-shot method [Beeler et al. 2010] and then buildsframe-to-frame correspondences iteratively.While these methods yield great results, they are fairly complexand consequently difficult to implement. As a result, the state of the

ACM Trans. Graph., Vol. 39, No. 6, Article 194. Publication date: December 2020. odular Primitives for High-Performance Differentiable Rendering • 194:11

Table 2. Performance comparison between our method, the official implementation of Soft Rasterizer (SoftRas) [Liu et al. 2019], and PyTorch3D [Ravi et al.2020].

Triangles →

284 352 514 1236 4474 5216 5344 10908 21695 25643 43448 91145 196179 308170Resolution

Rendering + gradients time ∗ (ms)Ourmethod ×

256 2.13 2.03 1.95 1.89 1.95 2.02 2.02 2.08 2.02 2.05 2.00 2.18 2.97 3.12512 ×

512 2.26 2.29 2.07 2.08 2.23 2.12 2.08 2.11 2.14 2.20 2.27 2.44 2.66 3.331024 × × × ×

256 7.48 6.76 7.84 7.73 18.11 14.29 10.91 29.21 30.93 49.06 65.21 144.57 331.30 788.92512 ×

512 10.01 10.42 8.78 10.68 24.05 24.33 24.40 50.09 82.48 99.76 163.51 375.39 865.63 1430.171024 × × × ×

256 27.12 27.19 26.81 27.95 27.67 27.14 26.94 28.62 27.93 30.08 32.03 37.52 54.82 115.87512 ×

512 31.83 30.93 31.08 31.19 31.70 31.82 30.83 32.06 34.84 38.41 41.50 55.53 95.21 158.711024 × × × Speedup factor

Our vsSoftRas 256 ×

256 3.51 3.33 4.02 4.09 9.29 7.07 5.40 14.04 15.31 23.93 32.60 66.32 111.55 252.86512 ×

512 4.43 4.55 4.24 5.13 10.78 11.48 11.73 23.74 38.54 45.35 72.03 153.85 325.42 429.481024 × × × ×

256 12.73 13.39 13.75 14.79 14.19 13.44 13.34 13.76 13.83 14.67 16.02 17.21 18.46 37.14512 ×

512 14.08 13.51 15.01 15.00 14.22 15.01 14.82 15.19 16.28 17.46 18.28 22.76 35.79 47.661024 × × × ∗ Execution times include both forward and gradient evaluations for rendering one frame. Each mesh was rendered several times from multiple angles and theresults were averaged to reduce random variation. The exact same meshes, viewpoints, and camera parameters were used for all methods. For our method, weperform rasterization, attribute interpolation, and antialiasing, but no texturing. For Soft Rasterizer, rasterization with default lighting is computed. PyTorch3Dwas set up to perform Gouraud shading, i.e., attribute interpolation. art in many cases is using commercial capture systems such as DI4DPRO [2020] or commercial software such as Agisoft Metashape [2020]and R3DS Wrap [2020]. Our goal is not to attempt to surpass thisstate of the art, but to illustrate how far we can get with a near-trivial formulation as an inverse rendering problem. In particular,we will not attempt to reconstruct tricky regions such as mouth,eyes, or hair, but instead focus on skin areas only.Our test material consists of three performances captured in aDI4D PRO rig at 29.97 frames per second. There are synchronized 9camera feeds with 3 in color and 6 in monochrome — for simplicity,we convert the color reference images to monochrome as well priorto processing. Resolution of the reference images is 3008 × Our goal is to find a global texture and a per-frame mesh so thatwhen rendered from the known camera positions, the texturedmeshes match the reference footage as closely as possible measured using image-space 𝐿 loss. We learn the geometry as per-frame de-formation of a fixed-topology base mesh that has 16 521 vertices and16 472 original faces that were triangulated into 32 916 triangles forrendering. The base mesh has texture coordinates referencing a 5:1aspect ratio texture atlas, and our learned texture is a single-channel10240 × ACM Trans. Graph., Vol. 39, No. 6, Article 194. Publication date: December 2020. illumination and shading effects that occur due to changes in headorientation.

In principle, the vertex positions for every frame could be encoded inone large matrix to be optimized. However, we decompose the vertexpositions into a matrix product in order to make the optimizationlandscape more tractable. One could see our model as a three-layerdense neural network that maps a one-hot animation frame indexvector into a vector of per-vertex deltas from the base mesh. Eventhough using a chain of matrices without nonlinearities does notincrease the expressiveness of the representation, it acceleratesoptimization similar to how overparameterization accelerates deeplearning [Arora et al. 2018].Let us denote the vertex count 𝑛 and the number of referenceframes 𝑚 . Denoting the 𝑛 column vector encoding base meshvertex positions as 𝑽 base = [ 𝑥 , 𝑦 , 𝑧 , 𝑥 , · · · ] , the vertex positionsfor frame 𝑖 are computed as 𝑽 𝑖 = 𝑽 base + 𝑴 𝑴 𝑴 𝒘 𝑖 , where 𝒘 𝑖 isa one-hot column vector of length 𝑚 with entry at index 𝑖 being one.Matrices 𝑴 and 𝑴 are of size 𝑚 × 𝑚 , and 𝑴 has size 𝑛 × 𝑚 . Thesquare matrices 𝑴 and 𝑴 are initialized to identity, whereas 𝑴 is initialized to zero. 𝑴 can be seen as a learned basis for the meshdeltas, and 𝑴 and 𝑴 acting as a mapping from frame index tothis basis. Finding the geometry thus corresponds to finding valuesfor 𝑴 , , via optimization.This formulation has no inherent propensity to, e.g., keep thesurface tessellation of the mesh intact. Because of this, we applya mesh Laplacian regularization term that penalizes local curva-ture changes compared to the base mesh. Sorkine [2005] gives anoverview of Laplacian-based methods for mesh processing. Usingtheir notation, the uniformly-weighted differential 𝜹 𝑗 of vertex 𝒗 𝑗 is 𝜹 𝑗 = 𝒗 𝑗 − | 𝑁 𝑗 | (cid:205) 𝑘 ∈ 𝑁 𝑗 𝒗 𝑘 , where 𝑁 𝑗 is the set of one-ring neighborsof vertex 𝒗 𝑗 . In other words, 𝜹 𝑗 is the difference between position ofvertex 𝒗 𝑗 and the average position of its neighbors. Our Laplacianregularization term is 𝐿 𝜹 = 𝑛 (cid:205) ≤ 𝑗 ≤ 𝑛 || 𝜹 𝑗 − 𝜹 base 𝑗 || , i.e., the aver-age square Euclidean difference between the vertex differentials ofthe base mesh and those of the deformed mesh. Although this rep-resentation is not rotation-invariant [Lipman et al. 2005], we havenot found this to be a problem in practice. Similar regularizationhas been used in earlier work on shape inference [Liu et al. 2019].In addition to texture and geometry, we learn global, per-camerabrightness and contrast adjustment values that are applied to therendered images during training. This accounts for differences be-tween reference images originating from color vs monochromecameras. We do not use any form of temporal regularization, i.e.,there are no terms that would prefer nearby frames to be similar toeach other. Regardless of this, the solution is temporally stable ascan be seen in the accompanying video. To resolve geometry and texture for a sequence of frames, we ini-tialize the geometry representation as explained above. In eachiteration, we choose random frame and camera indices, and renderthe corresponding mesh using a pipeline similar to one shown inFigure 1. We perform rendering in the same resolution as reference images, i.e., 3008 × 𝛽 = . , 𝛽 = . ) with a base learning rate of 𝜆 = − . The learning rate isdecayed to − during the last 25% of optimization. High-passfiltering is computed as 𝑥 ′ = 𝑥 − . · blur ( 𝑥 ) where 𝑥 is the ren-dered/reference image, and blur ( 𝑥 ) downsamples the image by afactor of 32 ×

32 and upsamples it back using an approximate Gauss-ian filter. Images are compared using 𝐿 loss, so the overall lossfunction with the Laplacian term included is 𝐿 = || 𝑥 ′ − 𝑦 ′ || + 𝐿 𝜹 where 𝑥 ′ and 𝑦 ′ are the high-pass filtered rendered and referenceimages, respectively. A typical optimization run takes 60–70 min-utes on a single NVIDIA Tesla V100 GPU, with each optimizationiteration taking approximately 40 milliseconds.In an additional test, we resolved all three capture sequencesin a single optimization run. This task is complicated by the actorbeing positioned slightly differently in each sequence — with justone base mesh, the alignment is off and large motion of the meshis required. To circumvent this problem, we also learn a × rigidtransformation matrix 𝑹 𝑗 for each sequence 𝑗 , and apply it beforethe vertex deltas, i.e., 𝑽 𝑖 = 𝑹 𝑗 𝑽 base + 𝑴 𝑴 𝑴 𝒘 𝑖 . In addition weinitialize the texture with one previously solved for the “Neutral”performance, limited 𝑴 to 100 ×

100 elements, and adjusted thedimensions of 𝑴 and 𝑴 accordingly. With these modificationsand extending the computation to 800 000 iterations, the optimiza-tion successfully found a rigid transformation for each sequenceto handle the misalignments, and subsequently solved the vertexdeltas for every frame of the combined set along with a texture thatbest fits all three sequences. As a consequence of optimizing a singletexture for the entire material, the mapping between surface andtexture-space points becomes automatically consistent between allsequences. Figure 10 shows an example result of the geometry and textureoptimization in the “Neutral” sequence. For clarity, the wireframesin Figure 10c–e include only the edges of original base mesh insteadof the triangulated version. See the accompanying video for thesequences and our reconstructions, as well as the progression ofgeometry and texture during training.In our base mesh, the holes for mouth and eyes are simply cov-ered with triangles in order to avoid spurious visibility leaks to theopposite side of the mesh. Obviously, this does not allow faithfulreconstruction of mouth and eyes, because eyes have strong view-dependent reflections, and mouth has complex internal geometry.As such, the optimization ends up texturing these covering trianglesonly somewhat believably. Hair becomes similarly approximated bythe learned texture.Figure 11 shows closeups of the nose region in five frames se-lected from sequences “Neutral” and “Disgust”. There is a surprisingamount of fine-grained motion, and our method captures this veryaccurately as illustrated in the figure. We obtained a 3D reconstruc-tion from DI4D [DI4D 2020] for comparison purposes, and it showsmarkedly less deformation and does not align properly with the

ACM Trans. Graph., Vol. 39, No. 6, Article 194. Publication date: December 2020. odular Primitives for High-Performance Differentiable Rendering • 194:13 (a) Reference (b) Our reconstruction (c) With wireframe (d) Wireframe only (e) Base mesh

Fig. 10. An example frame from the reconstruction of a 89-frame sequence. (a) One of the 9 camera images for the frame. (b–d) Our reconstructed texture andgeometry, rendered from the same viewpoint using the known camera parameters. (e) Base mesh 𝑽 base used as the starting point for optimization. R e f e r e n c e O u rr e n d e r i n g D I D [ ] D i ff e r e n c e , O u r D i ff e r e n c e , D I D Fig. 11. Closeup of the nose reveals significant motion and deformation.Our solution reproduces the changes in geometry faithfully, while the so-lution from DI4D [2020] severely attenuates the deformations. False-colordifference images (red/blue = brighter/darker than reference) highlight thegeometric discrepancies around the nostrils. In our solution, the geometricsilhouettes are located correctly and the differences are only due to defocusblur in the reference images that is lacking in our renderings. Right: TheDI4D solution did not attempt to recover the motion of ears. We placed noconstraints on which vertices are allowed to move during optimization, andthus also captured this motion. camera images. Their optical flow based reconstruction does not “Neutral” (89 frames) “Disgust” (123 frames) “Anger” (207 frames) R e f e r e n c e O u rr e c o n s t r u c t i o n Fig. 12. Example reconstructions from the optimization of three sequencesas a single 419-frame sequence, starting from the same base mesh as shownin Figure 10e. Because a single texture is solved for all frames, the vertex toskin correspondence is consistent across all sequences. attempt to track areas that lack high-quality multi-view observa-tions such as ears and nostrils, whereas our solution automaticallyreconstructs this motion as well.Figure 12 shows three example frames from the test where allthree performances were resolved at once. In these selected frames,the mouth region looks reasonable, but the “Anger” sequence ex-hibits artifacts around the mouth in many frames. This is not sur-prising — our model is unable to render the mouth adequately, sothe optimum may be far from correct. Nonetheless, the solution isotherwise temporally stable, and the motion and texture of skin

ACM Trans. Graph., Vol. 39, No. 6, Article 194. Publication date: December 2020. areas are reconstructed well. As the same texture and texture coordi-nates are used for all frames, our method provides highly consistentvertex to skin correspondence across all sequences.

We have demonstrated a modular differentiable renderer designcapable of rendering high-resolution images of complex 3D scenesup to several orders of magnitude faster than prior approaches,while supporting crucial features such as filtered texture mappingwith correct gradients. We believe that a high-performance differ-entiable renderer enables countless uses in inverse graphics, gener-ative modeling, and other computer vision and AI problems, and tohelp this development, have made our library publicly available athttps://github.com/NVlabs/nvdiffrast.As a practical use case whose success hinges on high-performancedifferentiable rendering performance, we have demonstrated thatmulti-view facial performance capture from synchronized high-resolution video cameras can be solved accurately by casting it as asimple inverse rendering problem. It will be interesting to extendthis solution to joint material appearance capture, dynamic textures,as well as custom solutions for the eyes, mouth, and hair, integratedas a single optimization problem.There may exist specific circumstances and applications (e.g.,discovery of occluded geometry) where the all-transparent imageformation model used by several earlier differentiable renderersmay be beneficial for optimization. However, we believe that suchproblems can also be approached in a principled way via carefulchoices for mesh parameterization, regularization, and optimiza-tion methods, while following the standard occlusion model of themodern graphics hardware pipeline.

ACKNOWLEDGMENTS

We thank Simon Yuen for providing input and comparison datafor the facial performance capture experiment, David Luebke forcomments, and Sanja Fidler and Wenzheng Chen for discussions onprevious work.

REFERENCES

ACM SIGGRAPH 2009 Courses (SIGGRAPH ’09) .Sanjeev Arora, Nadav Cohen, and Elad Hazan. 2018. On the Optimization of DeepNetworks: Implicit Acceleration by Overparameterization. In

ICML (Proceedings ofMachine Learning Research, Vol. 80) . 244–253.Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. 2010.High-Quality Single-Shot Capture of Facial Geometry.

ACM Trans. Graph.

29, 4(2010).Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gots-man, Robert W. Sumner, and Markus Gross. 2011. High-Quality Passive FacialPerformance Capture Using Anchor Frames.

ACM Trans. Graph.

30, 4 (2011).Volker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of 3DFaces (SIGGRAPH ’99) . 187–194.Derek Bradley, Wolfgang Heidrich, Tiberiu Popa, and Alla Sheffer. 2010. High ResolutionPassive Facial Performance Capture.

ACM Trans. Graph.

29, 4 (2010).Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang,Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi,and Fisher Yu. 2015.

ShapeNet: An Information-Rich 3D Model Repository . TechnicalReport arXiv:1512.03012 [cs.GR]. Stanford University — Princeton University —Toyota Technological Institute at Chicago. Wenzheng Chen, Jun Gao, Huan Ling, Edward Smith, Jaakko Lehtinen, Alec Jacobson,and Sanja Fidler. 2019. Learning to Predict 3D Objects with an Interpolation-basedDifferentiable Renderer. In

Advances In Neural Information Processing Systems .R. L. Cook and K. E. Torrance. 1982. A Reflectance Model for Computer Graphics.

ACMTrans. Graph.

1, 1 (1982), 7–24.Martin de La Gorce, David J. Fleet, and Nikos Paragios. 2011. Model-Based 3D HandPose Estimation from Monocular Video.

IEEE Transactions on Pattern Analysis andMachine Intelligence

33, 9 (2011), 1793–1805.Michael Deering, Stephanie Winner, Bic Schediwy, Chris Duffy, and Neil Hunt. 1988. TheTriangle Processor and Normal Vector Shader: A VLSI System for High PerformanceGraphics. In

SIGGRAPH ’88

IEEE Computer Graphics and Applications

6, 11 (1986), 21–29.Sebastian Herholz, Oskar Elek, Jiří Vorba, Hendrik Lensch, and Jaroslav Křivánek. 2016.Product Importance Sampling for Light Transport Path Guiding.

Computer GraphicsForum

35, 4 (2016), 67–77.André Jalobeanu, Frank O. Kuehnel, and John C. Stutz. 2004. Modeling Images ofNatural 3D Surfaces: Overview and Potential Applications. (2004).Jorge Jimenez, Diego Gutierrez, Jason Yang, Alexander Reshetov, Pete Demoreuille,Tobias Berghoff, Cedric Perthuis, Henry Yu, Morgan McGuire, Timothy Lottes, HughMalan, Emil Persson, Dmitry Andreev, and Tiago Sousa. 2011. Filtering Approachesfor Real-Time Anti-Aliasing. In

ACM SIGGRAPH Courses .Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Neural 3D Mesh Renderer.In

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization.In

ICLR .Tzu-Mao Li, Miika Aittala, Frédo Durand, and Jaakko Lehtinen. 2018. DifferentiableMonte Carlo Ray Tracing through Edge Sampling.

ACM Trans. Graph. (Proc. SIG-GRAPH Asia)

37, 6 (2018), 222:1–222:11.Yaron Lipman, Olga Sorkine, David Levin, and Daniel Cohen-Or. 2005. Linear Rotation-Invariant Coordinates for Meshes.

ACM Trans. Graph.

24, 3 (2005), 479–487.Guilin Liu, Duygu Ceylan, Ersin Yumer, Jimei Yang, and Jyh-Ming Lien. 2017. MaterialEditing Using a Physically Based Rendering Network.

ICCV (2017), 2280–2288.Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. 2019. Soft Rasterizer: A DifferentiableRenderer for Image-based 3D Reasoning. In

ICCV .Matthew M. Loper and Michael J. Black. 2014. OpenDR: An Approximate DifferentiableRenderer. In

ECCV 2014 , Vol. 8695. 154–169.Guillaume Loubet, Nicolas Holzschuch, and Wenzel Jakob. 2019. ReparameterizingDiscontinuous Integrands for Differentiable Rendering.

ACM Trans. Graph. (Proc.SIGGRAPH Asia)

38, 6 (2019).Hugh Malan. 2010. Edge Anti-aliasing by Post-Processing. In

GPU Pro , Wolfgang Engel(Ed.). A K Peters, 265–289.Steve Marschner, Stephen H. Westin, Eric P. Lafortune, Kenneth E. Torrance, andDonald P. Greenberg. 1999. Image-Based BRDF Measurement Including HumanSkin. In

Rendering Techniques .Merlin Nimier-David, Delio Vicini, Tizian Zeltner, and Wenzel Jakob. 2019. Mitsuba 2:A Retargetable Forward and Inverse Renderer.

ACM Trans. Graph.

38, 6 (2019).NVIDIA. 2018.

NVIDIA Turing GPU Architecture . Technical Report.Gustavo Patow and Xavier Pueyo. 2003. A Survey of Inverse Rendering Problems.

Computer Graphics Forum

Commun. ACM

IEEE Computer Graphics and Applications

16, 3 (1996),32–41.Ken Shoemake. 1985. Animating Rotation with Quaternion Curves.

SIGGRAPH Comput.Graph.

19, 3 (1985), 245–254.Olga Sorkine. 2005. Laplacian Mesh Processing. In