Light Stage Super-Resolution: Continuous High-Frequency Relighting
Tiancheng Sun, Zexiang Xu, Xiuming Zhang, Sean Fanello, Christoph Rhemann, Paul Debevec, Yun-Ta Tsai, Jonathan T. Barron, Ravi Ramamoorthi
LLight Stage Super-Resolution: Continuous High-Frequency Relighting
TIANCHENG SUN and ZEXIANG XU,
University of California, San Diego
XIUMING ZHANG,
Massachusetts Institute of Technology
SEAN FANELLO, CHRISTOPH RHEMANN, and PAUL DEBEVEC,
YUN-TA TSAI and JONATHAN T. BARRON,
Google Research
RAVI RAMAMOORTHI,
University of California, San Diego (a) I and I , captured light stage images for adjacent lights ℓ and ℓ (b) Blended images ( I + I )/ (c) Our rendering I ( ( ℓ + ℓ )/ ) Fig. 1. Though the light stage is a powerful tool for relighting human subjects, its renderings suffer because adjacent lights of the stage are separated by somedistance (a). Using conventional image blending techniques to reconstruct the image corresponding to a “virtual” light that lies between the stage’s actuallights therefore results in ghosting in shadowed and specular regions (b), seen here on the subject’s eyes and cheek. By training a deep neural network toregress from a light direction to an image, our model is able to synthesize accurate renderings of the subject under arbitrary virtual light directions — as thelight moves, highlights and shadows move smoothly instead of incorrectly blending together, thereby enabling realistic high-frequency relighting effects (c).These images have been manually but uniformly brightened and color-corrected, and are rendered with insets to highlight detail.
The light stage has been widely used in computer graphics for the past twodecades, primarily to enable the relighting of human faces. By capturingthe appearance of the human subject under different light sources, oneobtains the light transport matrix of that subject, which enables image-basedrelighting in novel environments. However, due to the finite number of lightsin the stage, the light transport matrix only represents a sparse sampling onthe entire sphere. As a consequence, relighting the subject with a point lightor a directional source that does not coincide exactly with one of the lights inthe stage requires interpolation and resampling the images corresponding tonearby lights, and this leads to ghosting shadows, aliased specularities, andother artifacts. To ameliorate these artifacts and produce better results underarbitrary high-frequency lighting, this paper proposes a learning-basedsolution for the “super-resolution” of scans of human faces taken from a
Authors’ addresses: Tiancheng Sun, [email protected]; Zexiang Xu, [email protected], University of California, San Diego; Xiuming Zhang, [email protected],Massachusetts Institute of Technology; Sean Fanello, [email protected]; ChristophRhemann, [email protected]; Paul Debevec, [email protected], Google; Yun-Ta Tsai, [email protected]; Jonathan T. Barron, [email protected], Google Re-search; Ravi Ramamoorthi, [email protected], University of California, San Diego.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.0730-0301/2020/12-ART260 $15.00https://doi.org/10.1145/3414685.3417821 light stage. Given an arbitrary “query” light direction, our method aggregatesthe captured images corresponding to neighboring lights in the stage, anduses a neural network to synthesize a rendering of the face that appears tobe illuminated by a “virtual” light source at the query location. This neuralnetwork must circumvent the inherent aliasing and regularity of the lightstage data that was used for training, which we accomplish through the useof regularized traditional interpolation methods within our network. Ourlearned model is able to produce renderings for arbitrary light directions thatexhibit realistic shadows and specular highlights, and is able to generalizeacross a wide variety of subjects. Our super-resolution approach enablesmore accurate renderings of human subjects under detailed environmentmaps, or the construction of simpler light stages that contain fewer lightsources while still yielding comparable quality renderings as light stageswith more densely sampled lights.CCS Concepts: •
Computing methodologies → Image-based render-ing ; Neural networks .Additional Key Words and Phrases: Portrait relighting, Image-based relight-ing.
ACM Reference Format:
Tiancheng Sun, Zexiang Xu, Xiuming Zhang, Sean Fanello, Christoph Rhe-mann, Paul Debevec, Yun-Ta Tsai, Jonathan T. Barron, and Ravi Ramamoor-thi. 2020. Light Stage Super-Resolution: Continuous High-Frequency Re-lighting.
ACM Trans. Graph.
39, 6, Article 260 (December 2020), 12 pages.https://doi.org/10.1145/3414685.3417821
ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020. a r X i v : . [ c s . G R ] O c t et al . A central problem in computer graphics and computer vision is thatof acquiring some observations of an object, and then producingphotorealistic relit renderings of that object. Of particular interestare renderings of human faces, which have many practical useswithin consumer photography and the visual effects industry, butalso serve as a particularly challenging case due to their complexityand the high sensitivity of the human visual system to facial ap-pearance. A light stage represents an effective solution for this task:by programmatically activating and deactivating several LED lightsarranged in a sphere while capturing synchronized images, the lightstage acquires a full reflectance field for a human subject, whichwe refer to as a “one-light-at-a-time” (OLAT) image set. Becauselight is additive, this OLAT scan represents a lighting “basis”, andthe subject can be relit according to some desired environment mapby simply projecting that environment map onto the light stagebasis [Debevec et al. 2000].Though straightforward and theoretically elegant, this classicrelighting approach has a critical limitation. The lights on the lightstage are usually designed to be small and distant from the subject,so that they are well-approximated as directional light sources. Asa consequence, realistic high-frequency effects such as sharp castshadows and specular highlights are present in the captured OLATimages. In order to achieve photorealistic relighting results under all possible lighting conditions, the lights must be placed closelyenough on the sphere of the stage such that shadows and speculari-ties in the captured images of adjacent lights “move” by less thanone pixel. However, practical constraints (the cost and size of eachlight, and the difficulty of powering and synchronizing many lights)discourage the construction of light stages with very high densitiesof lights. Even if such a high-density light stage could be built, thetime to acquire an OLAT increases linearly with the number oflights, and this makes human subjects (which must be stationaryduring OLAT acquisition) difficult to capture. For these reasons,even the most sophisticated light stages in existence today containonly a few hundred lights that are spaced many degrees apart. Thismeans that the OLAT scans from a light stage are undersampled with respect to the angular sampling of lights, and the renderedimages using conventional approaches will likely contain ghosting .Attempting to render an image using a “virtual” light source thatlies in between the real lights of the stage by applying a weightedaverage on adjacent OLAT images will not produce a soft shadow ora streaking specularity, but will instead produce the superpositionof multiple sharp shadows and specular dots (see Fig. 1b).This problem can be mitigated by imaging subjects that only ex-hibit low-frequency reflectance variation, or by performing relight-ing using only low-frequency environment maps. However, mosthuman subjects have complicated material properties (speculari-ties, scattering, etc . ) and real-world environment maps frequentlyexhibit high-frequency variation (bright light sources at arbitrarylocations), which often results in noticeable artifacts as shown inFig. 1b. To this end, we propose a learning based solution for super-resolving the angular resolution of light stage scans of human faces.Given an OLAT scan of a human face with finite images and thedirection of a desired “virtual” light, our model predicts a complete high-resolution RGB image that appears to have been lit by a lightsource from that direction, even though that light is not present inour light stage (see Fig. 1c). Our robust solution for “upsampling” thenumber of lights, which we refer to as light stage super-resolution ,can additionally enable the construction of simpler light stages withfewer lights, thereby reducing cost and increasing the frame rateat which subjects can be scanned. Our algorithm can also producebetter rendered images for applications that require light stage datafor training, such as portrait relighting or shadow removal. Casualusers can then utilize these algorithms on a single cellphone withoutrequiring capture inside a light stage. Note that we focus only onhuman face relighting within a light stage. While we believe themethods herein could be applied more broadly, a comprehensivesystem for general object relighting remains a topic of future work.Our algorithm (Sec. 3) must work with the inherent aliasing andregularity of the light stage data used for training. We address thisby combining the power of deep neural networks with the effi-ciency and generality of conventional linear interpolation methods.Specifically, we use an active set of closest lights within our net-work (Sec. 3.1) and develop a novel alias-free pooling approachto combine their network activations (Sec. 3.2) using a weightingoperator guaranteed to be smooth when lights enter or exit theactive set. Our network allows us to super-resolve an OLAT scan ofa human face: we can take our learned model and repeatedly queryit with thousands of light directions, and treat the resulting set ofsynthesized images as though they were acquired by a physically-unconstrained light stage with an unbounded sampling density. Aswe will demonstrate, these super-resolved “virtual” OLAT scansallow us to produce photorealistic renderings of human faces witharbitrarily high-frequency illumination content.
The angular undersampling from the light stage relates to muchwork over the past two decades on a frequency analysis of lighttransport [Ramamoorthi and Hanrahan 2001; Sato et al. 2003; Du-rand et al. 2005], and can also be related to analyses of samplingrate in image-based rendering [Chai et al. 2000] for the related prob-lem of view synthesis [Mildenhall et al. 2019]. This problem alsobears some similarities to multi-image super-resolution [Milanfar2010] and angular super-resolution in the light field [Kalantari et al.2016; Cheng et al. 2019], where aliased observations are combinedto produce interpolated results. In this paper, we leverage priorsand deep learning to go beyond these sampling limits, upsamplingor super-resolving a sparse input light sampling on the light stageto achieve continuous high-frequency relighting.Recently, many approaches for acquiring a sparse light transportmatrix have been developed, including methods based on compres-sive sensing [Peers et al. 2009; Sen and Darabi 2009], kernel Nys-trom [Wang et al. 2009], optical computing [O’Toole and Kutulakos2010] and neural networks [Ren et al. 2013, 2015; Kang et al. 2018].However, these methods are not designed for the light stage andare largely orthogonal to our approach. They seek to acquire thetransport matrix for a fixed light sampling resolution with a sparseset of patterns, while we seek to take this initial sampling resolu-tion and upsample or super-resolve it to much higher-resolution
ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020. ight Stage Super-Resolution: Continuous High-Frequency Relighting • 260:3 lighting (and indeed enable continuous high-frequency relighting).Most recently, [Xu et al. 2018] proposed a deep learning approachfor image-based relighting from only five lighting directions, butcannot reproduce very accurate shadows. While we do use manymore lights, we achieve significantly higher-quality results withaccurate shadows.The general approach of using light stages for image-based re-lighting stands in contrast to more model-based approaches. Tra-ditionally, instead of super-resolving a light stage scan, one coulduse that scan as input to a photometric stereo algorithm [Wood-ham 1980], and attempt to recover the normal and the albedo mapsof the subject. More advanced techniques were developed to pro-duce a parametric model of the geometry and reflectance for evenhighly specular objects [Tunwattanapong et al. 2013]. There are alsoworks that focus on recovering a parametric model from a single im-age [Sengupta et al. 2018], constructing a volumetric model for viewsynthesis [Lombardi et al. 2018], or even a neural representation ofa scene [Tewari et al. 2020]. However, the complicated reflectanceand geometry of human subjects is difficult to even parameterizeanalytically, let alone recover. Though recent progress may enablethe accurate capture of human faces using parametric models, thereare additional difficulties in capturing a complete portrait due to thecomplexity of human hair, eyes, ears, etc. Indeed, this complexityhas motivated the use of image-based relighting via light stages inthe visual effects industry for many years [Tunwattanapong et al.2011; Debevec 2012].Interpolating a reflectance function has also been investigated inthe literature. Masselus et al. [2004] compare the errors of fittingthe sampled reflectance function to various basis functions and con-clude that multilevel B-Splines can preserve the most features. Morerecently, Rainer et al. [2019] utilize neural networks to compressand interpolate sparsely sampled observations. However, these algo-rithms interpolate the reflectance function independently on eachpixel and do not consider local information in neighboring pixels.Thus, their results are smooth and consistent in the light domain,but might not be consistent in the image domain. Fuchs et al. [2007]treat the problem as a light super-resolution problem, and is themost similar to our work. They use heuristics to decompose the cap-tured images into diffuse and specular layers, and apply optical-flowand level-set algorithms to interpolate highlights and light visibilityrespectively. This approach works well on highly reflective objects,but as we will demonstrate, it usually fails on human skin whichcontains high frequency bumps and cannot be well modeled usingonly diffuse and specular terms.In recent years, light stages have also been demonstrated to beinvaluable tools for generating training data for use in deep learningtasks [Meka et al. 2019; Guo et al. 2019; Sun et al. 2019; Nestmeyeret al. 2019]. This enables user-facing effects that do not require ac-quiring a complete light stage scan of the subject, such as “portraitrelighting” from a single image [Sun et al. 2019; Apple 2017] or VRexperiences [Guo et al. 2019]. These learning-based applicationssuffer from the same undersampling issue as do conventional uses oflight stage data. For example, Sun et al. [2019] observe artifacts whenrelighting with environment maps that contain high-frequency illu-mination. We believe our method can provide better training dataand significantly improve many of these methods in the future.
An OLAT scan of a subject captured by a light stage consists of 𝑛 images, where each image is lit by a single light in the stage. Theconventional way to relight the captured subject with an arbitrarylight direction is to linearly blend the images captured under nearbylights in the OLAT scan. As shown in Fig. 1, this often results in“ghosting” artifacts on shadows and highlights. The goal of this workis to use machine learning instead of simple linear interpolation toproduce higher-quality results. Our model takes as input a querylight direction ℓ and a complete OLAT scan consisting of a set ofpaired images and light directions { I 𝑖 , ℓ 𝑖 } , and uses a deep neuralnetwork Φ to obtain the predicted image I , I ( ℓ ) = Φ (cid:0) { I 𝑖 , ℓ 𝑖 } 𝑛𝑖 = , ℓ (cid:1) . (1)This formalization is broad enough to describe some prior works onlearning-based relighting [Xu et al. 2018; Meka et al. 2019]. Whilethese methods usually operate by training a U-Net [Ronnebergeret al. 2015] to map from a sparse set of input images to an outputimage, we focus on producing as high-quality as possible render-ing results given the complete OLAT scan. However, feeding all thecaptured images into a conventional CNN network is not tractablein terms of speed or memory requirements. In addition, this naiveapproach seems somewhat excessive for practical applications in-volving human faces. While complex translucency and interreflec-tion may require multiple lights to reproduce, it is unlikely that all images in the OLAT scan are necessary to reconstruct the im-age for any particular query light direction, especially given thatbarycentric interpolation requires only three nearby lights to pro-duce a somewhat plausible rendering. Our work attempts to findan effective and tractable compromise between these two extremes,in which the power of deep neural networks is combined with theefficiency and generality of nearest-neighbor approaches. This isaccomplished by a linear blending approach that (like barycentricblending) ensures the output rendering is a smooth function of theinput, where the blending is performed on the activations of a neuralnetwork’s encoding of our input images instead of on the raw pixelintensities of the input images.Our complete network structure is shown in Fig. 2. Given a querylight direction ℓ , we identify the 𝑘 captured images in the OLAT scanwhose corresponding light directions are nearby the query lightdirection, which we call active set A ( ℓ ) . These OLAT images I 𝑖 andtheir corresponding light directions ℓ 𝑖 are then each independentlyprocessed in parallel by the encoder Φ 𝑒 (·) of our convolutional neu-ral network (or equivalently, they are processed as a single “batch”),thereby producing a multi-scale set of internal neural network acti-vations that describe all 𝑘 images. After that, the set of 𝑘 activationsat each layer of the network are pooled into a single set of activa-tions at each layer, which is performed using a weighted averagingwhere the weighting is a function of the query light and each inputlight W ( ℓ , ℓ 𝑖 ) . This weighted average is designed to remove thealiasing introduced by the nearest neighbor sampling in the activeset selection stage. Together with the query light direction ℓ , thesepooled feature maps are then fed into the decoder Φ 𝑑 (·) by means ofskip links from each level of the encoder, thereby producing the finalpredicted image I ( ℓ ) . Formally, our final image synthesis procedure ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020. et al . (cid:22) (cid:22) (cid:22) (cid:22) (cid:22) (cid:22) (cid:24)(cid:20)(cid:21)(cid:27) (cid:23)(cid:3)(cid:91)(cid:3)(cid:23) (cid:24)(cid:20)(cid:21)(cid:27)(cid:24)(cid:20)(cid:21)(cid:24)(cid:20)(cid:21) (cid:22)(cid:24)(cid:20)(cid:21)(cid:27)(cid:24)(cid:20)(cid:21)(cid:24)(cid:20)(cid:21) (cid:20)(cid:25)(cid:3)(cid:91)(cid:3)(cid:20)(cid:25) (cid:27)(cid:3)(cid:91)(cid:3)(cid:27) (cid:24)(cid:20)(cid:21)(cid:27)(cid:24)(cid:20)(cid:21)(cid:24)(cid:20)(cid:21) (cid:22)(cid:21)(cid:3)(cid:91)(cid:3)(cid:22)(cid:21) (cid:70) (cid:81) (cid:22) (cid:11)(cid:78)(cid:3)(cid:91)(cid:3)(cid:78)(cid:12)(cid:3)(cid:38)(cid:82)(cid:81)(cid:89)(cid:47)(cid:68)(cid:92)(cid:72)(cid:85) (cid:38)(cid:82)(cid:81)(cid:70)(cid:68)(cid:87)(cid:72)(cid:81)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81) (cid:22)(cid:78) (cid:36)(cid:79)(cid:76)(cid:68)(cid:86)(cid:16)(cid:41)(cid:85)(cid:72)(cid:72)(cid:51)(cid:82)(cid:82)(cid:79)(cid:76)(cid:81)(cid:74) (cid:47)(cid:82)(cid:86)(cid:86) (cid:27) (cid:21)(cid:24)(cid:25)(cid:27)(cid:21)(cid:24)(cid:25)(cid:21)(cid:24)(cid:25)(cid:27)(cid:20)(cid:22) (cid:22) (cid:20)(cid:21)(cid:27)(cid:3)(cid:91)(cid:3)(cid:20)(cid:21)(cid:27) (cid:20)(cid:21)(cid:27)(cid:20)(cid:21)(cid:27)(cid:20)(cid:21)(cid:27) (cid:24)(cid:20)(cid:21)(cid:3)(cid:91)(cid:3)(cid:24)(cid:20)(cid:21) (cid:54)(cid:83)(cid:68)(cid:87)(cid:76)(cid:68)(cid:79)(cid:3)(cid:53)(cid:72)(cid:86)(cid:82)(cid:79)(cid:88)(cid:87)(cid:76)(cid:82)(cid:81) (cid:22)(cid:21) (cid:27) (cid:25)(cid:23) (cid:27) (cid:21)(cid:24)(cid:25)(cid:3)(cid:91)(cid:3)(cid:21)(cid:24)(cid:25) (cid:22)(cid:21)(cid:22)(cid:21)(cid:25) (cid:27) (cid:25)(cid:23)(cid:25)(cid:23)(cid:20)(cid:22) (cid:27) (cid:25)(cid:23)(cid:3)(cid:91)(cid:3)(cid:25)(cid:23) (cid:36)(cid:70)(cid:87)(cid:76)(cid:89)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:86)(cid:3)(cid:90)(cid:76)(cid:87)(cid:75)(cid:3)(cid:70)(cid:3)(cid:70)(cid:75)(cid:68)(cid:81)(cid:81)(cid:72)(cid:79)(cid:86)(cid:68)(cid:81)(cid:71)(cid:3)(cid:81)(cid:3)(cid:76)(cid:80)(cid:68)(cid:74)(cid:72)(cid:86)(cid:3)(cid:11)(cid:82)(cid:85)(cid:3)(cid:20)(cid:3)(cid:76)(cid:80)(cid:68)(cid:74)(cid:72)(cid:12) (cid:70) (cid:81) (cid:44)(cid:81)(cid:83)(cid:88)(cid:87)(cid:3)(cid:82)(cid:85)(cid:3)(cid:79)(cid:68)(cid:69)(cid:72)(cid:79)(cid:3)(cid:90)(cid:76)(cid:87)(cid:75)(cid:3)(cid:70)(cid:3)(cid:70)(cid:75)(cid:68)(cid:81)(cid:81)(cid:72)(cid:79)(cid:86)(cid:68)(cid:81)(cid:71)(cid:3)(cid:81)(cid:3)(cid:76)(cid:80)(cid:68)(cid:74)(cid:72)(cid:86)(cid:3)(cid:11)(cid:82)(cid:85)(cid:3)(cid:20)(cid:3)(cid:76)(cid:80)(cid:68)(cid:74)(cid:72)(cid:12) (cid:70) (cid:70) (cid:11)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:12)(cid:11)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:12) (cid:21)(cid:91)(cid:3)(cid:69)(cid:76)(cid:79)(cid:76)(cid:81)(cid:72)(cid:68)(cid:85)(cid:88)(cid:83)(cid:86)(cid:68)(cid:80)(cid:83)(cid:79)(cid:76)(cid:81)(cid:74) (cid:313) (cid:22)(cid:22) (cid:20)(cid:20) (cid:313)(cid:313) (cid:22)(cid:21)(cid:20)(cid:21)(cid:27)(cid:24)(cid:20)(cid:21)(cid:24)(cid:20)(cid:21)(cid:24)(cid:20)(cid:21) (cid:20)(cid:22) (cid:40)(cid:81)(cid:70)(cid:82)(cid:71)(cid:72)(cid:85)(cid:39)(cid:72)(cid:70)(cid:82)(cid:71)(cid:72)(cid:85) (cid:22)(cid:22) Fig. 2. A visualization of our model architecture. The encoder of our model Φ 𝑒 (·) takes as input a concatenation of the nearby OLAT images in the active setand their light directions, which are processed by a series of stride-2 conv layers. The resulting encoded activations of these 8 images at each level are thencombined using the alias-free pooling described in Section 3.2, and skip-connected to the decoder. The decoder Φ 𝑑 (·) takes as input the query light direction ℓ , processes it with fully connected layers and then upsamples it (along with the skip-connected encoder activations), and decodes the image using a series ofstride-2 transposed conv layers. Whether or not a conv or transposed conv changes resolution is indicated by whether or not its edge spans two spatial scales. is: I ( ℓ ) = Φ 𝑑 (cid:169)(cid:173)(cid:171) ∑︁ 𝑖 ∈ A ( ℓ ) W ( ℓ , ℓ 𝑖 ) Φ 𝑒 ( I 𝑖 , ℓ 𝑖 ) , ℓ (cid:170)(cid:174)(cid:172) . (2)This hybrid approach of nearest-neighbor selection and neural net-work processing allows us to learn a single neural network thatproduces high quality results, and generalizes well across querylight directions and across subjects in our OLAT dataset.Our active set construction approach is explained in Section 3.1,our alias-free pooling is explained in Section 3.2, the network ar-chitecture is described in Section 3.3, and our progressive trainingprocedure is discussed in Section 3.4. Light stages are conventionally constructed by placing lights on aregular hexagonal tessellation of a sphere (with some “holes” forcameras and other practical concerns), as shown in Fig. 3. As dis-cussed, at test time our model works by identifying the OLAT imagesand lights that are nearest to the desired query light direction, andaveraging their neural activations. But this natural approach, whencombined with the regularity of the sampling of lights in the lightstage, presents a number of problems for training our model. First,we can only supervise our super-resolution model using “virtual”lights that exactly coincide with the real lights of the light stage, asthese are the only light directions for which we have ground-truthimages (this will also be a problem when evaluating our model, aswill be discussed in Sec. 4). Second, this regular hexagonal samplingmeans that, for any given light in the stage, the distances betweenit and its neighbors will always exhibit a highly regular pattern(Fig. 3a). For example, the 6 nearest neighbors of every point on ahexagonal tiling are guaranteed to have exactly the same distanceto that point. In contrast, at test time we would like to be able toproduce renderings for query light directions that correspond to
Fig. 3. The OLAT images taken from a light stage have a uniform hexagonalpattern, which means that the distances between each light and its nearestneighbors is highly regular (a). In contrast, at test time we want to synthe-size images corresponding to unseen light directions that do not lie on thishexagonal grid, and whose neighboring distances will therefore be irregular(c). During training we therefore sample a random subset of nearest neigh-bors for use in the active set of our model (b), which forces the network toadapt to challenging and irregular distributions of neighbor-distances thatbetter match those that will be seen at test time. arbitrary points on the sphere, and those points will likely have ir-regular distributions of neighboring lights (Fig. 3c). This representsa significant deviation between our training data and our test data,and as such we should expect poor generalization at test time if wewere to naively train on highly-regular sets of nearest neighbors.
ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020. ight Stage Super-Resolution: Continuous High-Frequency Relighting • 260:5
Fig. 4. Varying the query light direction will cause OLAT images to leaveand enter the active set of our model, which introduces aliasing that, if un-addressed, results in jarring temporal artifacts in our renderings. To addressthis, we use an “alias-free pooling” technique to ensure that the networkactivations of each OLAT image are averaged in a way that suppresses thisaliasing. We use a weighted average where the weights are smooth, and areexactly zero at the point where lights enter and leave the active set.
To address this issue, we adopt a different technique for samplingneighbors for use in our active set than what is used during test time.For each training iteration, we first identify a larger set of 𝑚 nearestneighbors near the query light (which in this case is identical to oneof the real lights in the stage), and among them randomly selectonly 𝑘 < 𝑚 neighbors to use in the active set (in practice, we use 𝑚 =
16 and 𝑘 = A critical component in our model is the design of the skip linksfrom each level of the encoder of our model to its correspondinglevel in the decoder. This model component is responsible for thenetwork activations corresponding to the 8 images in our activeset and reducing them to one set of activations corresponding toa single output, which will then be decoded into an image. Thisrequires a pooling operator for these 8 images. This pooling operatormust be permutation-invariant, as the images in our active set maycorrespond to any OLAT light direction and may be presented inany order. Standard permutation-invariant pooling operators, suchas average-pooling or max-pooling, are not sufficient for our case,because they do not suppress aliasing . As the query light direction moves across the sphere, images will enter and leave the active setof our model, which will cause the network activations within ourencoder to change suddenly (see Fig. 4). If we use simple average-pooling or max-pooling, the activations in our decoder will alsovary abruptly, resulting in unrealistic flickering artifacts or temporalinstability in our output renderings as the light direction varies. Inother words, the point sampled signal should go through an effectiveprefiltering process in order to suppess the artifacts.The root cause of this problem is that our active set is an aliasedobservation of the input images, and average- or max-pooling allowsthis aliasing to persist. We therefore introduce a technique for alias-free pooling to address this issue. We use a weighted average asour pooling operator where the weight of each item in our activeset is a continuous function of the query light direction, and wherethe weight of each item is guaranteed to be zero at the moment itenters or leaves the active set. We define our weighting functionbetween the query light direction ℓ and each OLAT light direction ℓ 𝑖 as follows: (cid:101) W ( ℓ , ℓ 𝑖 ) = max (cid:18) , 𝑒 𝑠 ( ℓ · ℓ 𝑖 − ) − min 𝑗 ∈ A ( ℓ ) 𝑒 𝑠 ( ℓ · ℓ 𝑗 − ) (cid:19) , W ( ℓ , ℓ 𝑖 ) = (cid:101) W ( ℓ , ℓ 𝑖 ) (cid:205) 𝑗 (cid:101) W ( ℓ , ℓ 𝑗 ) , (3)where 𝑠 is a learnable parameter that adjusts the decay of the weightwith respect to the distance and each ℓ is a normalized vector in 3Dspace. During training, parameter 𝑠 will be automatically adjustedto balance between selecting the nearest neighbor ( 𝑠 = +∞ ) andtaking an unweighted average of all neighbors ( 𝑠 = The remaining components of our model consist of the conventionalbuilding blocks used in constructing convolutional neural networks,and can be seen in Fig. 2. The encoder of our network consists of3 × ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020. et al . input to our encoder is a set of 8 RGB input images correspondingto the nearby OLAT images in our active set, each of which hasbeen concatenated with the 𝑥𝑦𝑧 coordinate of its source light (tiledto every pixel) giving us 8 6-channel input images.These images are processed along the “batch” dimension of ournetwork, and so are treated identically at each level of the encoder.These 8 images are then pooled down to a single “image” ( i . e ., asingle batch) of activations using the alias-free pooling approachof Section 3.2, each of which is concatenated onto the internalactivations of the network’s decoder.The decoder of the network begins with a series of fully-connected(aka “dense”) neural network blocks that take as input the querylight direction ℓ , each of which is followed by instance normaliza-tion [Ulyanov et al. 2016] and a PReLU activation function. Theseactivations are then upsampled to 4 × × [ , ] . Because ournetwork is fully convolutional [Long et al. 2015], it can be evaluatedon images of arbitrary resolution, with GPU memory being the onlylimiting factor. We train on 512 ×
512 resolution images for the sakeof speed, and evaluate and test on 1024 × We supervise the training of our model using an 𝐿 loss on pixelintensities. Formally, our loss function is: L 𝑑 = ∑︁ 𝑖 ∥ M ⊙ ( I 𝑖 − I ( ℓ 𝑖 ))∥ , (4)where I 𝑖 is the ground truth image under light 𝑖 , and I ( ℓ 𝑖 ) is ourprediction. When computing the loss over the image, we use aprecomputed binary image M to mask out pixels that are known tobelong to the background of the subject.During training, we construct each training data instance byrandomly selecting a human subject in our training dataset andthen randomly selecting one OLAT light direction 𝑖 . The imagecorresponding to that light I 𝑖 will be used as the ground-truth imageour model will attempt to reconstruct, and the “query” light directionfor our model will be the light corresponding to that image ℓ 𝑖 . Wethen identify a set of 8 neighboring images/lights to include in ouractive set using the selection procedure described in Section 3.1.Our only data augmentation is a randomly-positioned 512 × − , and default hyperparametersettings ( 𝛽 = . , 𝛽 = . , 𝜖 = − ). We use the OLAT portrait dataset from [Sun et al. 2019], whichcontains 22 subjects with multiple facial expressions captured usinga light stage and a 7-camera system. The light stage consists of 302LEDs uniformly distributed on a spherical dome, and capturing asubject takes roughly 6 seconds. Each capture process produces anOLAT scan of a specific facial expression on each camera, whichconsists of 302 images, and we treat the OLAT scans from differentcameras as independent OLAT scans. Because the subject is askedto stay still (and an optical flow algorithm [Wenger et al. 2005] isapplied to correct the small movements) the captured 302 imagesin each OLAT are aligned and only differ in lighting directions. Wemanually select 4 OLAT scans with a mixture of subjects and viewsfor use as our validation set, and choose another 16 OLAT scanswith good coverage of gender and diverse skin tones for use astraining data. Our 16 training datasets only covers 5 of 7 cameras,and the remaining 2 are covered by the validation data. We train ournetwork using all lights from our OLAT data in a canonical globallighting coordinate frame, which allows us to train a single networkfor all viewpoints in our training data. We train one single modelfor all subjects in our training dataset, which we found to matchthe performance of training an individual model for each subject.Empirically evaluating our model presents a significant challenge:our model is attempting to super-resolve an undersampled scanfrom a light stage, which means that the only ground-truth that isavailable for benchmarking is also undersampled. In other words, thegoal of our model is to accurately synthesize images that correspondto virtual lights in between the real lights of the stage — but we donot have ground-truth images that correspond to those virtual lights.In addition, the model also needs to generalize to an unseen viewand subject. For these reasons, qualitative results (figures, videos)are preferred, and we encourage readers to view our figures andthe accompanying video. In the quantitative results presented here,we use held-out real images lit by real lights on our light stage asa validation set. When evaluating one of these validation images,we do not use the active-set selection technique of Section 3.1, andinstead just sample the 𝑘 = ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020. ight Stage Super-Resolution: Continuous High-Frequency Relighting • 260:7 (a) One rendering, for reference (b) Our model (c) Our model w/ naive neighbors (d) Our model w/ avg poolingFig. 5. A visualization of how our learned model synthesizes renderings in which shadows move smoothly as a function of light direction. In (a) we show arendering from our model for some virtual light ℓ with a horizontal angle of 𝜃 , and highlight one image strip that includes a horizontal cast shadow. In (b) werepeatedly query our model with 𝜃 values that should induce a linear horizontal translation of the shadow boundary in the image plane, and by stacking theseimage strips we can see this linear trend emerge (highlighted in red). In (c) and (d) we do the same for ablations of our model that do not have our active-setrandom selection procedure nor our alias-free pooling, and we see that the resulting shadow boundary does not vary smoothly or linearly.Table 1. Here we benchmark our model against prior work and ablationsof our model on our validation dataset. We report the arithmetic mean ofeach metric across the validation set. The top three results of each metricare highlighted in red, orange, yellow, respectively. While “Ours w/naiveneighbors“ has the lowest error according to this evaluation, “Our model“performs better in our real test-time scenario where the synthesized lightdoes not lie in a regular hexagonal grid (see text and Fig. 5 for details). Algorithm RMSE 𝐻 DSSIM E-LPIPSOur model 0.0160 0.0203 0.0331 0.00466Ours w/naive neighbors 0.0156 0.0199 0.0322 0.00449Ours w/avg-pooling 0.0203 0.0241 0.0413 0.00579Linear blending 0.0191 0.0232 0.0366 0.00503Fuchs et al. [2007] 0.0195 0.0258 0.0382 0.00485Photometric stereo 0.0284 0.0362 0.0968 0.00895Xu et al. [2018]w/ 8 optimal lights 0.0410 0.0437 0.1262 0.01666w/ adaptive input 0.0259 0.0291 0.1156 0.00916Meka et al. [2019] 0.0505 0.0561 0.1308 0.01482validation approach is not ideal, as all such evaluations will followthe same regular sampling pattern of our light stage. This evaluationtask is therefore more biased than the real task of predicting imagesaway from the sampling pattern of the light stage.Selecting an appropriate metric for measuring image reconstruc-tion accuracy for our task is not straightforward. Conventionalimage interpolation techniques often result in ghosting artifacts orduplicated highlights, which are perceptually salient but often notpenalized heavily by traditional image metrics such as per-pixelRMSE. We therefore evaluate image quality using multiple imagemetrics: RMSE, the Sobolev 𝐻 norm [Ng et al. 2003], DSSIM [Wanget al. 2004], and E-LPIPS [Kettunen et al. 2019]. RMSE measurespixel-wise error, the 𝐻 norm emphasizes image gradient error,while DSSIM and E-LPIPS approximate an overall perceptual dif-ference between the predicted image and the ground truth. Still,images and videos are preferred for comparison. We first evaluate against ablated versions of our model, with resultsshown in Tab. 1. In the “Ours w/naive neighbors” ablation we use the 𝑘 = real test-time scenario in which we synthesize with lights that do not lie onthe regular hexagonal grid of our light stage, we see this ablatedmodel generalizes poorly. In Fig. 5 we visualize the output of ourmodel and ablations of our model as a function of the query lightdirection. We see that our model is able to synthesize a cast shadowthat is a smooth linear function in the image plane of the angle ofthe query light (after accounting for foreshortening, etc . ). Ablationsof our technique do not reproduce this linearly-varying shadow,due to the aliasing and overfitting problems described earlier. Seethe supplemental video for additional visualizations.In the “Ours w/avg-pooling” ablation we replace the alias-freepooling of our model with simple average pooling. As shown inTab 1, ablating this component reduces performance. But more im-portantly, ablating this component also causes flickering duringour real test-time scenario in which we smoothly vary our lightsource, and this is not reflected in our quantitative evaluation. Be-cause average pooling assigns a non-zero weight to images as theyenter and exit our active set, renderings from this model will con-tain significant temporal instability. See the supplemental video forexamples. We compare our results against related approaches that are capableof solving the relighting problem. The “Linear blending” baselinein Tab. 1 produces competitive results, despite being a very simplealgorithm: we simply blend the input images of our light stageaccording to our alias-free weights. Because linear blending directly
ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020. et al . (full image)(a) Ours (b) Groundtruth (c) Ours blending(d) Linear et al . [2007](e) Fuchs stereo(f) Photometric optimal sample et al . [2018] w/(g) Xu adaptive sample et al . [2018] w/(h) Xu et al . [2019](i) Meka Fig. 6. Here we present a qualitative comparison between our method and other light interpolation algorithms. Traditional methods (linear blending, Fuchset al. [2007], photometric stereo) retain detail but suffer from ghosting artifacts in shadowed regions. Results from Xu et al. [2018] and Meka et al. [2019]exhibit significant oversmoothing and brightness changes. Our method retains details and synthesizes shadows that resemble the ground truth. interpolates aligned pixel values, it is often able to retain accuratehigh frequency details in the flat region, and this strategy workswell for minimizing our error metrics. However, linear blendingproduces significant ghosting artifacts in shadows and highlights,as shown in Fig. 6. Though these errors are easy to detect visually,they appear to be hard to measure empirically.We evaluate against the layer-based technique of Fuchs et al.[2007] by decomposing an OLAT into diffuse, specular, and visibilitylayers, and interpolating the illumination individually for each layer.Although the method works well on specular objects as shown in theoriginal paper, it performs less well on OLATs of human subjects, asshown in Tab. 1. This appears to be due to the complex specularitieson human skin not being tracked accurately by the optical flowalgorithm of Fuchs et al. [2007]. Additionally, the interpolation ofthe visibility layer sometimes contains artifacts, which results in castshadows being predicted incorrectly. That being said, the algorithmresults in fewer ghosting artifacts than the linear blending algorithm,as shown in Fig. 6 and as reflected by the E-LPIPS metric.Using the layer decomposition produced by Fuchs et al. [2007], weadditionally perform photometric stereo on the OLAT data by simplelinear regression to estimate a per-pixel albedo image and normalmap. Using this normal map and albedo image we can then useLambertian reflectance to render a new diffuse image correspond-ing to the query light direction, which we add to the specular layerfrom [Fuchs et al. 2007] to produce our final rendering. As shownin Tab. 1, this approach underperforms that of Fuchs et al. [2007],likely due to the reflectance of human faces being non-Lambertian.Additionally, the scattering effect of human hair is poorly modeledin terms of a per-pixel albedo and normal vector. These limitingassumptions result in overly sharpened and incorrect shadow pre-dictions, as shown in Fig. 6. In contrast to this photometric stereoapproach and the layer-based approach of Fuchs et al. [2007], our model does not attempt to factorize the human subject into a pre-defined reflectance model wherein interpolation can be explicitlyperformed. Our model is instead trained to identify a latent vectorspace of network activations in which naive linear interpolationresults in accurate non-linearly interpolated images, which resultsin more accurate renderings.The technique of Xu et al. [2018] (retrained on our training data)represents another possible candidate for addressing our problem.This technique does not natively solve our problem. In order tofind the optimal lighting directions for relighting, it requires asinput all
302 high-resolution images in each OLAT scan in the firststep, which significantly exceeds the memory constraints of modernGPUs. To address this, we first jointly train the Sample-Net and theRelight-Net on our images (downsampled by a factor of 4 × due tomemory constraints) to identify 8 optimal directions from the 302total directions of the light stage. Using those 8 optimal directions,we then retrain the Relight-Net using the full-resolution images fromour training data, as prescribed in Xu et al. [2018]. Table 1 showsthat this approach works poorly on our task. This may be becausethis technique is built around 8 fixed input images and is naturallydisadvantaged compared to our approach, which is able to use anyof the 302 light stage images as input. We therefore also evaluate avariant of Xu et al. [2018] where we use the same active-set selectionapproach used by our model to select the images used to train theirRelight-Net. By using our active-set selection approach (Sec. 3.1)this baseline is able to better reason about local information, whichimproves performance as shown in Tab. 1. However, this baselinestill results in flickering artifacts when rendering with moving lights,because (unlike our approach) it is sensitive to the aliasing inducedwhen images leave and enter the active set.We also evaluate Deep Reflectance Fields [Meka et al. 2019] forour task, which is also outperformed by our model. This is likely ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020. ight Stage Super-Resolution: Continuous High-Frequency Relighting • 260:9 𝑛 = 𝑛 = 𝑛 = 𝑛 = 𝑛 =
302 Groundtruth Groundtruth (Complete) (a) Ours(b) Linear Blending
Fig. 7. Here we compare the performance of our model against linear blending as we reduce 𝑛 , the number of lights in our light stage. As we decrease thenumber of available lights from 𝑛 = to 𝑛 = , the quality of our model’s rendered shadow degrades slowly. Linear blending, in contrast, is unable toproduce an accurate rendering even with access to all lights.
100 150 200 250 300number of lights on the light stage0.03000.03250.03500.03750.04000.04250.04500.04750.0500 D SS I M OursLinear Blend
Fig. 8. The image quality of relighting algorithms will gradually reduce aswe remove lights from the light stage. However, our algorithm is able toretain the image quality to a greater extent with fewer lights compared tonaive linear blending. because their model is specifically designed for fast and approximatevideo relighting and uses only two images as input, while our modelhas access to the entire OLAT scan and is designed to prioritizehigh-quality rendering.
An interesting question in light transport acquisition is how manyimages (light samples) are needed to reconstruct the full light trans-port function. To address this question, we present an experiment inwhich we remove some lights from our training set and use only thissubsampled data during training and inference. We reduce the num-ber of lights on the light stage 𝑛 (while maintaining a uniform dis-tribution on the sphere) to [ , , , ] , while also changingthe number of candidates 𝑚 and the active set size 𝑘 to [ , , , ] and [ , , , ] respectively. Image quality on the complete valida-tion dataset (with all 302 lights) as a function of the number ofsubsampled training/input lights is shown in Fig. 8. As expected, relighting quality decreases as we remove the lights, but we see thatthe rendering quality of our method decreases more slowly thanthat of linear blending. This can also be observed in Fig. 7, wherewe present relit renderings using these subsampled light stages. Wesee that removing lights reduces accuracy compared to the groundtruth, but that our synthesized shadows remain relatively sharp:ghosting artifacts only appear when 𝑛 = 𝑛 . During test time, our model can also produce accurateshadows and sharp highlights. Please refer to our supplementaryvideo for our qualitative comparison. A key benefit of our method is the ability to "super-resolve" an OLATscan with virtual lights at a higher resolution than the original lightstage data, thereby enabling continuous high-frequency relightingwith an essentially continuous lighting distribution (or equivalently,with a light stage whose sampling frequency is unbounded). In thissection, we present three applications of this idea.
Precise Directional Light Relighting.
Traditional image-based re-lighting methods produce accurate results near the observed lightsof the stage, but may introduce ghosting effects or inaccurate shad-ows when no observed light is nearby. In Fig. 9 we try to interpolatethe image between two lights on the stage. As shown in the secondand the third row, linear blending or Xu et al. [2018] with adaptivesampling does not produce realistic results and always containsmultiple superposed shadows or highlights. The shadows producedby Meka et al. [2019] are sharp, but are not moving smoothly whenthe light moves. In contrast, our method is able to produce sharpand realistic images for arbitrary light directions: highlights andcast shadows move smoothly as we change the light direction, andour results have comparable sharpness to the (non-interpolated)groundtruth images that are available.
ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020. et al . Captured image under light A ←−−−−−−−−
Interpolation between captured lights −−−−−−−−→
Captured image under light B (a) Ours(b) Linear Blending(c) [Xu et al. 2018] w/ adaptive sampling(d) [Meka et al. 2019]
Fig. 9. Here we use produce interpolated images corresponding to “virtual” lights between two real lights of the light stage. Our model (a) produces renderingswhere sharp shadows and accurate highlights move realistically. Linear blending (b) and Xu et al. [2018] with adaptive sampling result in ghosting artifactsand duplicated highlights. The results from Meka et al. [2019] contain blurry highlights and shadows with unrealistic motion.(a) With super-resolution (b) Without super-resolutionFig. 10. Our model (a) is able to produce accurate relighting results underhigh-frequency environments by super-resolving the light stage beforeperforming image-based relighting [Debevec et al. 2000]. Using the lightstage data as-provided (b) results in ghosting.
High Frequency Environment Relighting.
OLAT scans capturedfrom a light stage can be linearly blended to reproduce images thatappear to have been captured under a specific environment. Thepixel values of the environment map are usually distributed to thenearest or neighboring lights on the light stage for blending. Thistraditional approach may cause ghosting artifacts in shadows andspecularities, due to the finite sampling of light directions on thelight stage. Although this ghosting is hardly noticeable when thelighting is low-frequency, it can be significant when the environ-ment contains high frequency lighting, such as the sun in the sky.These ghosting artifacts can be ameliorated by using our model.Given an environment map, our algorithm can predict the imagecorresponding to the light direction of each pixel in the environment map. By taking a linear combination of all such images (weightedby their pixel values and solid angles), we are able to produce arendering that matches the sampling resolution of the environmentmap. As shown in Fig. 10, this approach produces images with sharpshadows and minimal ghosting when given a high-frequency envi-ronment, while linear blending does not. In this example, we usean environment resolution of 256 × ,
768 lights. Please see our videofor more environment relighting results.We now analyze the relationship between the image quality gainfrom our model and the frequency of the lighting. Specifically, weevaluate for which environments, and at what frequencies, ouralgorithm will be required for accurate rendering, and converselyhow our model performs in low-frequency lighting environmentswhere previous solutions are adequate. For this purpose, we use oneOLAT scan, and render it under 380 high quality indoor and outdoorenvironment maps (environments downloaded from hdrihaven.com)using both our model and linear blending. We then measure theimage quality gain from our model by computing the DSSIM valuebetween our rendering and that from linear blending. We measurethe frequency of the environmental lighting by decomposing it intospherical harmonics (up to degree 50), and finding the degree belowwhich 90% of the energy can be recovered.As shown in Fig. 11, the benefit of using our model becomeslarger when the frequency of the environment increases. For low-frequency light (up to degree 15 spherical harmonics), our modelproduces almost identical results compared to the traditional lin-ear blending method. This is a desired property, showing that ourmethod reduces gracefully to linear blending for low frequencylighting, and thus produces high quality results for any low or high-frequency environment. As the frequency of the lighting becomes
ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020. ight Stage Super-Resolution: Continuous High-Frequency Relighting • 260:11
Our ModelLinear BlendingLight
Fig. 11. In the top figure, each blue dot represents a lighting environment.We render a portrait under this environment using both linear-blending andour method, and measure the image difference using SSIM to evaluate thequality gain of our algorithm. The image quality improvement produced byour model becomes more apparent when the environment map has morehigh-frequency variation. In the bottom figure, we compare the renderedimages using our model and linear blending under environment maps withdifferent frequencies. Our model produces similar results to linear blendingwhen the lighting variation is low frequency (left columns). As the lightingvariation becomes higher frequency, our model produces better renderingswith fewer artifacts and sharper shadows (right columns). higher, the renderings of our model contain sharper and more ac-curate shadows without ghosting artifacts. Note that there is somevariation among the environment maps as expected; even a veryhigh-frequency environment could coincidentally have its brightestlights aligned with one of the light in the light stage, leading tolow error in linear blending and comparable results to our method.Nevertheless, the trend is clear in Fig. 11 with many high-frequencyenvironments requiring our algorithm for lighting super-resolution.According to the plot, we conclude that our model is necessarywhen the light frequency is equal or larger than about 20, whichmeans more than 21 =
441 basis functions are needed to recoverthe lighting. This number has the same order as the number oflights in the stage ( 𝑛 = Lighting Softness Control.
Our model’s ability to render imagesunder arbitrary light directions also allows us to control the softness Our full image Increased shadow radius −−−−−−−−→ (a) Our Model(b) Linear Blending
Fig. 12. Soft shadows can be rendered by synthesizing and averaging imagescorresponding to directional light sources within some area on the sphere.Soft shadows rendered by our method (a) are more realistic and containfewer ghosting artifacts than those rendered using linear blending (b). of the shadow. Given a light direction, we can densely synthesizeimages corresponding to the light directions around it, and averagethose images together to produce a rendering with realistic softshadows (the sampling radius of these lights determines the softnessof the resulting shadow). As shown in Fig. 12, our model is able tosynthesize realistic shadows with controllable softness, which is notpossible using traditional linear blending methods.
The light stage is a crucial tool for enabling the image-based re-lighting of human subjects in novel environments. But as we havedemonstrated, light stage scans are undersampled with respect tothe angle of incident light, which means that synthesizing virtuallights by simply combining images results in ghosting on shadowsand specular highlights. We have presented a learning-based solu-tion for super-resolving light stage scans, thereby allowing us tocreate a “virtual” light stage with a much higher angular lightingresolution, which allows us to render accurate shadows and high-lights in high-frequency environment maps. Our network works byembedding input images from the light stage into a learned spacewhere network activations can then be averaged, and decoding thoseactivations according to some query light direction to reconstructan image. In constructing this model, we have identified two criticalissues: an overly regular sampling pattern in light stage trainingdata, and aliasing introduced when pooling activations of a set ofnearest neighbors. These issues are addressed through our use of adropout-like supersampling of neighbors in our active set, and ouralias-free pooling technique. By combining ideas from conventionallinear interpolation with the expressive power of deep neural net-works, our model is able to produce renderings where shadows andhighlights move smoothly as a function of the light direction.This work is by no means the final word for the task of light stagesuper-resolution or image-based rendering. Approaches similar toours could be applied to other general light transport acquisitionproblems, to other physical scanning setups, or to other kinds ofobjects besides human subjects. Though our network can work on
ACM Trans. Graph., Vol. 39, No. 6, Article 260. Publication date: December 2020. et al . inputs with different image resolutions, GPU memory has been amajor bottleneck to apply our approach on images with much higherresolutions such as 4K resolution. A much more memory efficientapproach for light-stage super-resolution is expected for productionlevel usage in the visual effects industry. Though we exclusivelypursue the one-light-at-a-time light stage scanning approach, al-ternative patterns where multiple lights are active simultaneouslycould be explored, which may enable a more sparse light stage de-sign. Though the undersampling of the light stage is self-evident inour visualizations, it may be interesting to develop a formal theoryof this undersampling with respect to materials and camera reso-lution, so as to understand what degree of undersampling can betolerated in the limit. We have made a first step in this directionwith the graph in Fig. 11. Finally, it would be interesting to extendour approach to enable the synthesis of novel viewpoints in additionto lighting directions. We believe that light stage super-resolutionrepresents an exciting direction for future research, and has the po-tential to further decrease the time and resource constraints requiredfor reproducing accurate high-frequency relighting effects. ACKNOWLEDGMENTS
This work was supported in part by NSF grants 1617234, 1703957ONR grants N000141712687 and N000142012529, a Google Fellow-ship, the Ronald L. Graham Chair, and the UC San Diego Center forVisual Computing. Thanks to anonymous reviewers for the valuablefeedback.
REFERENCES
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,et al. 2016. TensorFlow: A system for large-scale machine learning.
OSDI (2016).Apple. 2017. Use Portrait mode on your iPhone. https://support.apple.com/en-us/HT208118.Jin-Xiang Chai, Xin Tong, Shing-Chow Chan, and Heung-Yeung Shum. 2000. PlenopticSampling. In
SIGGRAPH .Zhen Cheng, Zhiwei Xiong, Chang Chen, and Dong Liu. 2019. Light Field Super-Resolution: A Benchmark. In
CVPR Workshops .Paul Debevec. 2012. The Light Stages and Their Applications to Photoreal DigitalActors. In
SIGGRAPH Asia .Paul Debevec, Tim Hawkins, Chris Tchou, Haarm-Pieter Duiker, Westley Sarokin, andMark Sagar. 2000. Acquiring the reflectance field of a human face. In
SIGGRAPH .Frédo Durand, Nicolas Holzschuch, Cyril Soler, Eric Chan, and François X. Sillion. 2005.A Frequency Analysis of Light Transport. In
SIGGRAPH .Martin Fuchs, Hendrik PA Lensch, Volker Blanz, and Hans-Peter Seidel. 2007. Super-resolution reflectance fields: Synthesizing images for intermediate light directions.In
Computer Graphics Forum , Vol. 26. Wiley Online Library, 447–456.Kaiwen Guo, Jason Dourgarian, Danhang Tang, Anastasia tkach, Adarsh Kowdle, EmilyCooper, Mingsong Dou, Sean Fanello, Graham Fyffe, Christoph Rhemann, JonathanTaylor, Peter Lincoln, Paul Debevec, Shahram Izad, Philip Davidson, Jay Busch,Xueming Yu, Matt Whalen, Geoff Harvey, Sergio Orts-Escolano, and Rohit Pandey.2019. The Relightables: Volumetric Performance Capture of Humans with RealisticRelighting. In
SIGGRAPH Asia .Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep intorectifiers: Surpassing human-level performance on imagenet classification.
CVPR (2015).Nima Khademi Kalantari, Ting-Chun Wang, and Ravi Ramamoorthi. 2016. Learning-based view synthesis for light field cameras.
SIGGRAPH (2016).Kaizhang Kang, Zimin Chen, Jiaping Wang, Kun Zhou, and Hongzhi Wu. 2018. Efficientreflectance capture using an autoencoder.
SIGGRAPH (2018).Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growingof GANs for Improved Quality, Stability, and Variation. In
ICLR .Markus Kettunen, Erik Härkönen, and Jaakko Lehtinen. 2019. E-LPIPS: Robust Percep-tual Image Similarity via Random Transformation Ensembles.
CoRR abs/1906.03973(2019). http://arxiv.org/abs/1906.03973Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization.
ICLR (2015). Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep appear-ance models for face rendering.
SIGGRAPH (2018).Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networksfor semantic segmentation. In
CVPR .Vincent Masselus, Pieter Peers, Philip Dutré, and Yves D Willemsy. 2004. Smoothreconstruction and compact representation of reflectance functions for image-basedrelighting. In
Proceedings of the fifteenth eurographics conference on rendering tech-niques .Abhimitra Meka, Christian Haene, Rohit Pandey, Michael Zollhöfer, Sean Fanello,Graham Fyffe, Adarsh Kowdle, Xueming Yu, Jay Busch, Jason Dourgarian, et al.2019. Deep Reflectance Fields: High-Quality Facial Reflectance Field Inference fromColor Gradient Illumination.Peyman Milanfar. 2010.
Super-resolution imaging . CRC Press.Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari,Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. 2019. Local Light Field Fusion:Practical View Synthesis with Prescriptive Sampling Guidelines. In
SIGGRAPH .Thomas Nestmeyer, Iain Matthews, Jean-François Lalonde, and Andreas M Lehrmann.2019. Structural Decompositions for End-to-End Relighting. arXiv preprintarXiv:1906.03355 (2019).Ren Ng, Ravi Ramamoorthi, and Pat Hanrahan. 2003. All-Frequency Shadows usingNon-Linear Wavelet Lighting Approximation. In
SIGGRAPH .Matthew O’Toole and Kiriakos N. Kutulakos. 2010. Optical Computing for Fast LightTransport. In
SIGGRAPH .Pieter Peers, Dhruv K Mahajan, Bruce Lamond, Abhijeet Ghosh, Wojciech Matusik,Ravi Ramamoorthi, and Paul Debevec. 2009. Compressive Light Transport Sensing.
ACM TOG (2009).Gilles Rainer, Wenzel Jakob, Abhijeet Ghosh, and Tim Weyrich. 2019. Neural btfcompression and interpolation. In
Computer Graphics Forum .Ravi Ramamoorthi and Pat Hanrahan. 2001. A Signal-Processing Framework for InverseRendering. In
SIGGRAPH .Peiran Ren, Yue Dong, Stephen Lin, Xin Tong, and Baining Guo. 2015. Image BasedRelighting Using Neural Networks.
ACM TOG (2015).Peiran Ren, Jiaping Wang, Minmin Gong, Stephen Lin, Xin Tong, and Baining Guo.2013. Global illumination with radiance regression functions.
SIGGRAPH (2013).Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: ConvolutionalNetworks for Biomedical Image Segmentation. In
MICCAI .I Sato, T Okabe, Y Sato, and K Ikeuchi. 2003. Appearance Sampling for Obtaining a setof basis images for variable illumination. In
ICCV .P. Sen and S. Darabi. 2009. Compressive Dual Photography.
Computer Graphics Forum (2009).Soumyadip Sengupta, Angjoo Kanazawa, Carlos D. Castillo, and David W. Jacobs. 2018.SfSNet: Learning Shape, Refectance and Illuminance of Faces in the Wild. In
CVPR .Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: a Simple Way to Prevent Neural Networks fromOverfitting.
JMLR (2014).Tiancheng Sun, Jonathan T. Barron, Yun-Ta Tsai, Zexiang Xu, Xueming Yu, GrahamFyffe, Christoph Rhemann, Jay Busch, Paul E. Debevec, and Ravi Ramamoorthi. 2019.Single Image Portrait Relighting.
SIGGRAPH (2019).Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann, Stephen Lombardi, KalyanSunkavalli, Ricardo Martin-Brualla, Tomas Simon, Jason Saragih, Matthias Nießner,et al. 2020. State of the Art on Neural Rendering. (2020).Borom Tunwattanapong, Graham Fyffe, Paul Graham, Jay Busch, Xueming Yu, AbhijeetGhosh, and Paul Debevec. 2013. Acquiring reflectance and shape from continuousspherical harmonic illumination.
SIGGRAPH (2013).Borom Tunwattanapong, Abhijeet Ghosh, and Paul Debevec. 2011. Practical image-based relighting and editing with spherical-harmonics and local lights. In . IEEE.Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2016. Instance normalization:The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016).J Wang, Y Dong, X Tong, Z Lin, and B Guo. 2009. Kernel Nystrom method for lighttransport.
ACM Transactions on Graphics (2009).Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. 2004. Imagequality assessment: from error visibility to structural similarity.
TIP (2004).Andreas Wenger, Andrew Gardner, Chris Tchou, Jonas Unger, Tim Hawkins, and PaulDebevec. 2005. Performance Relighting and Reflectance Transformation with Time-multiplexed Illumination.
SIGGRAPH (2005).Robert J. Woodham. 1980. Photometric Method For Determining Surface OrientationFrom Multiple Images.
Optical Engineering (1980).Yuxin Wu and Kaiming He. 2018. Group Normalization. In
ECCV .Zexiang Xu, Kalyan Sunkavalli, Sunil Hadap, and Ravi Ramamoorthi. 2018. Deepimage-based relighting from optimal sparse samples. In