[PDF] Generative Modelling of BRDF Textures from Flash Images

Abstract

We learn a latent space for easy capture, semantic editing, consistent interpolation, and efficient reproduction of visual material appearance. When users provide a photo of a stationary natural material captured under flash light illumination, it is converted in milliseconds into a latent material code. In a second step, conditioned on the material code, our method, again in milliseconds, produces an infinite and diverse spatial field of BRDF model parameters (diffuse albedo, specular albedo, roughness, normals) that allows rendering in complex scenes and illuminations, matching the appearance of the input picture. Technically, we jointly embed all flash images into a latent space using a convolutional encoder, and -- conditioned on these latent codes -- convert random spatial fields into fields of BRDF parameters using a convolutional neural network (CNN). We condition these BRDF parameters to match the visual characteristics (statistics and spectra of visual features) of the input under matching light. A user study confirms that the semantics of the latent material space agree with user expectations and compares our approach favorably to previous work.

Full PDF

GGenerative Modelling of BRDF Textures from Flash Images

PHILIPP HENZLER,

University College London, United Kingdom

VALENTIN DESCHAINTRE,

Imperial College London, United Kingdom

NILOY J. MITRA,

University College London and Adobe Research, United Kingdom

TOBIAS RITSCHEL,

University College London, United Kingdom

Fig. 1. Random samples from our generative model of BRDF maps (e.g., circular insets), assigned to a 3D object of a shoe. Training samples were obtainedusing flash images. Our model reveals a BRDF material space that can be sampled from, projected to, and interpolated across.

We learn a latent space for easy capture, semantic editing, consistent inter-polation, and efficient reproduction of visual material appearance. Whenusers provide a photo of a stationary natural material captured under flashlight illumination, it is converted in milliseconds into a latent material code.In a second step, conditioned on the material code, our method, again inmilliseconds, produces an infinite and diverse spatial field of BRDF modelparameters (diffuse albedo, specular albedo, roughness, normals) that allowsrendering in complex scenes and illuminations, matching the appearanceof the input picture. Technically, we jointly embed all flash images into alatent space using a convolutional encoder, and – conditioned on these latentcodes – convert random spatial fields into fields of BRDF parameters using aconvolutional neural network (CNN). We condition these BRDF parametersto match the visual characteristics (statistics and spectra of visual features)of the input under matching light. A user study confirms that the semanticsof the latent material space agree with user expectations and compares ourapproach favorably to previous work.Project webpage: https://henzler.github.io/publication/neuralmaterial/.

Rendering realistic images for feature films or computer gamesrequires adequate simulation of light transport. Besides geometryand illumination, an important factor is material appearance.

Authors’ addresses: Philipp Henzler, Department of Computer Science, UniversityCollege London, United Kingdom, [email protected]; Valentin Deschaintre, Im-perial College London, United Kingdom, [email protected]; Niloy J. Mitra,University College London and Adobe Research, United Kingdom, [email protected];Tobias Ritschel, Department of Computer Science, University College London, UnitedKingdom, [email protected].

Material appearance has three aspects of variation: First, whenview or light direction change, reflected light changes. The physicsof this process are well-understood and can be simulated providedthe input parameters are available. Second, behavior changes acrossmaterials. For example, leather reacts differently to light or viewchanges than paper would, yet, different forms of leather clearlyshare visual properties, i.e., form a (material) space. Third, appear-ance details depend on spatial position. Different locations in oneleather exemplar behave differently but share the same visual statis-tics [Portilla and Simoncelli 2000], i.e., they form a texture .Classic computer graphics captures appearance by reflection mod-els , which predict for a given i) light-view configuration, ii) material,and iii) spatial position, how much light is reflected. Typically, thefirst variation (light and view direction) is covered by

BRDF models ,analytic expressions, such as Phong [1975] which map the light andview direction vector to scalar reflectance. The second variation(material) is covered by choosing

BRDF model parameters , such asPhong glossiness. In practice, it can be difficult, given a desiredappearance, to choose those parameters e.g., how to make make aleather look more like fabric. Measuring BRDF model parametersrequires complex capture hardware. The third variation (spatial) isaddressed by storing multiple BRDF model parameters in imagesof finite size –often referred to as Spatially-varying Bi-directionalReflectance Distribution Function (svBRDF) maps– or writing func-tional expressions to reproduce their behaviour. It is even more a r X i v : . [ c s . G R ] F e b • Henzler et al. challenging to choose these parameters to produce something co-herent like leather, in particular over a large or even infinite spatialextent. Additionally, storing all these values requires substantialmemory and programming functional expressions to mimic theirstatistics requires expert skills and time. Capturing the spatial vari-ation of BRDF model parameters over space using sensors is evenmore involved [Schwartz et al. 2013].Addressing those issues, we provide a reflectance model to jointlygeneralize across all of these three axes. Instead of using analytic pa-rameters, we parametrize appearance by latent codes from a learnedspace, allowing for generation, interpolation and semantic editing.Without involved capture equipment, these codes are produced bypresenting the system a simple 2D flash image, which is then em-bedded into the latent space. Avoiding to store any finite imagetexture, we learn a second mapping to produce svBRDF maps fromthe infinite random field (noise) on-the-fly, conditioned on the la-tent material code. Instead of using any advanced capture devicefor learning, flash images will be the only supervision we use. ViewLight M a t e r i a l S p a t i a l Latent code z Material texturesynthesis S h a d i n g M a t e r i a l m o r p h MaterialgenerationViewLight M a t e r i a l S p a t i a l Fig. 2.

BRDF space . From a flash image, which contains sparse observa-tions across material, space and view-light (left) we map to a latent code (middle) so that changes in that code can be decoded to enable (right) material synthesis (holding material fixed and moving spatially), materialmorphing (holding space and view/light fixed and changing material), orclassical shading and material generation (points in the latent space).

A use case of our approach is shown in Fig. 1. First, a user pro-vides a “flash image”, a photo of a flat material sample under flashillumination. This sample is embedded as a code into a latent spaceusing a CNN. This code is a very compact description that can bemanipulated, e.g., interpolated with a different material or changedalong semantic axis. Previous work has considered spaces of BRDFs[Matusik 2003] or RGB textures [Matusik et al. 2005] but no spatialvariation of reflectance. Conditioned on this code, a second CNN cangenerate an infinite field of BRDF to be directly used in rendering.For training, we solely rely on real flash images. The key insight,inspired by Aittala et al. [2016] is that these flash images revealthe same material at different image locations – they are stationary– but under different view and light angles. Using this constraint,Aittala et al. [2016] were able to decompose a single input imageto capture the parameters of a material model that could then berendered under novel view or light directions. However, this coversonly part of the generalization we are targeting: it generalizes acrossview and light, but not across location or material. Further, theyrequire performing an optimization for every exemplar, requiringtime in the order of hours, while ours requires milliseconds only.In summary, our main contributions are • a generative model of a BRDF material texture space; • generation of maps that are diverse over the infinite plane; • a flash image dataset of materials enabling our training withno BRDF parameter supervision or synthetic data; and • feed-forward embedding of exemplar flash images into thespace in milliseconds using a CNN, without per-materialoptimization.Our implementation will be publicly available upon acceptance. Our work has background in texture analysis, appearance modellingand design spaces.

A classic definition of texture is defined by Julesz [1965]: a textureis an image full of features that in some representation have the samestatistics.

Portilla and Simoncelli [2000] have provided a practicalmethod to compute representations in to do statistics on. They uselinear filters on multiple scalesPerlin [1985] was first to capture the fractal [Mandelbrot 1983]stochastic variation of appearance in model applicable to ComputerGraphics. His approach is simple –a linear combination of noise atdifferent scales– yet extremely powerful, and has led to extensiveuse in computer games or production rendering. Wavelet noise[Cook and DeRose 2005] moved this idea further by band-limitingthe noise that is combined. Such methods can be used of materials,e.g., for gloss maps, bump maps, etc. Regrettably, it does not providea solution to acquire a texture from an exemplar, which is left tomanual adjustment.To generate textures from exemplars, non-parametric sampling[Efros and Leung 1999], vector quantization [Wei and Levoy 2000],optimization [Kwatra et al. 2005] or nearest-neighbour field syn-thesis (PatchMatch [Barnes et al. 2009]) have been proposed. Theseare used less in production rendering or games, due to issues incomputational scalability and lack of intuitive control.The word “texture” can be ambiguous to mean stochastic varia-tion, as well as images attached on surfaces to localize color features.Here, we focus on stochastic variation in the sense of Julesz [1965]or Portilla and Simoncelli [2000].Our approach will build on deep learning-based texture synthesis[Bergmann et al. 2017; Gatys et al. 2015; Johnson et al. 2016; Karraset al. 2019; Sendik and Cohen-Or 2017; Shaham et al. 2019; Simonyanand Zisserman 2014; Ulyanov et al. 2016, 2017; Zhou et al. 2018].Besides applying them to BRDFs, we extend these ideas. We detailtheir background in Sec. 3.2

Representing appearance in simulation-based graphics has been anactivate research field for decades. The survey by Guarnera et al.[2016] present detailed discussion of the many different materialmodel and BRDF acquisition approach. In our method, we use a state-of-the-art micro-facet BRDF model [Cook and Torrance 1982], andfocus on deep-learning based material modelling and acquisition.Many methods have been proposed to acquire materials usingdata-driven approaches. Matusik [2003] proposed a data driven enerative Modelling of BRDF Textures from Flash Images • 3

BRDF model, using a linear model. More recently, Rematas et al.[2016] extract reflectance maps from 2D images using a CNN trainedis a supervised manner. Materials and illuminations acquisition werefurther explored by Georgoulis et al. [2017]. Deschaintre et al. [2018]proposed a rendering loss to capture svRBDFs from flash images.Pix2Pix [Isola et al. 2016] inspired many other approaches for imageto image translation to translate RGB pixels to material attributes[Li et al. 2017, 2018a,b]. Most work now includes a differentiableshading step [Deschaintre et al. 2018, 2019; Guo et al. 2020; Li et al.2018b; Liu et al. 2017] such as we do here. Gao et al. [2019] proposedto use a post-optimization in an encoded latent space, improvingan initial material estimation, and comparing renderings of theirresults directly to their input pictures. All these approaches focuson capturing a single instance of a svBRDF map, but with little or noediting options across materials (semantic space) or generalizationacross the spatial domain (diversity). Kuznetsov et al. [2019] modeledan important aspect of physically-based rendering using adversarialtraining: the Normal Distribution Function (NDF). We take keyinspiration from Aittala et al. [2016] who extended the approachof Gatys et al. [2015] to generate svBRDF parameter maps from asingle picture of a stationary material exemplar.

Spaces of color [Nguyen et al. 2015], materials [Matusik 2003], tex-tures [Matusik et al. 2005], faces [Blanz et al. 1999], human bodies[Allen et al. 2003], and more have been useful in graphics. Closest toour approach, Matusik et al. [2005] has devised a space of textures.Here, users can interpolate combinations of visually similar textures.They warp all pairs of exemplars to each other and constructs graphedges for interpolation when there is evidence that the warpingis admissible. To blend between them, histogram adjustments aremade. Consequently, interpolation between exemplars does not takea straight path in pixel space from one to the other, but traversesonly valid regions. Photoshape [Park et al. 2019] learns the relationof given material textures over a database of 3D objects. Serranoet al. [2016] allow users to semantically control captured BRDFdata. They represent BRDFs using the derived principal componentbasis [Matusik 2003] and map the first five PCA components tosemantic attributes through learned radial basis functions. We takeinspiration from this body of work and build a continuous spaceallowing SVBRDFs generation, interpolation and semantic control.

Aittala et al. [2016] leveraged the fact that a single flash imageof a stationary material reveals multiple realizations of the samereflectance statistics under different light and view angles. We willnow recall a simplified definition of their approach.A flash image is an RGB image of a material, taken in conditionswhere a mobile phone’s flashlight is the dominant light source. Wewrite 𝐿 ( x ) to denote the RGB radiance value at every image location x . The image is assumed to be taken in —or converted to— linearspace. The illumination is assumed to be an isotropic point lightcollocated with the camera. Further, the geometry is assumed to beflat and captured in a fronto-parallel setting, so that the direction from light to every image location in 3D is known. Self-occlusionand parallax are assumed to be negligible.Reflectance is parameterized by a material , represented as a func-tion 𝑓 ( x ) mapping image location x to shading model parameters,including the shading normal. Under these conditions, the reflectedradiance is 𝐿 = R 𝑓 , where R is the differentiable rendering operator,mapping shading model parameters to radiance.A material 𝑓 explains a flash image 𝐿 if it is visually similar to 𝐿 when rendered. Unfortunately, without further constraints, thereare many materials to explain the flash image. This ambiguity canbe resolved when assuming that the material 𝑓 is stationary . Wesay a material is stationary, if local statistics of the shading modelparameters 𝑓 do not change across the image.Putting both –visual similarity and stationarity– together, thebest material from a family 𝑓 𝜃 of material mapping functions pa-rameterized by a vector 𝜃 , can be found by minimizing a loss: L ′ ( 𝜃 ) : = T ( 𝐿, R 𝑓 𝜃 ) + 𝜆 S( 𝑓 𝜃 ) , (1)where T ( 𝐿, R 𝑓 𝜃 ) is a metric of visual similarity between a flashimage 𝐿 and a differentiable rendering R 𝑓 𝜃 , and S( 𝑓 ) is a measureof stationarity of a material map 𝑓 .Comparison, T , of two textures is not trivial. Pixel-by-pixel com-parison is typically not suitable to evaluate visual statistical similar-ity. Instead, images are mapped to a feature space in which imagesthat are perceived as similar textures, map to similar points [Portillaand Simoncelli 2000]. Different mappings are possible here. Classictexture synthesis [Heeger and Bergen 1995] uses moments of linearmulti-scale filters responses. Gatys et al. [2015] proposed to useGram matrices of non-linear multi-scale filters responses such asthose of the VGG [Simonyan and Zisserman 2014] detection net-work. Such a characterization of textures was also used by Aittalaet al. [2016] and, without loss of generality, will be used and ex-tended in this work as well.While 𝑓 is stationary, 𝐿 is not and has features at different randompositions x which are compared as T ′ ( 𝐿 , 𝐿 ) : = E x ∼( , ) ,𝑠 ∼( , ) [|P ( 𝐿 , x ) − P ( 𝐿 , x )| ] , (2)where P ′ ( 𝐿, x ) crops a patch at the location x resamples it to theinput resolution of VGG [Simonyan and Zisserman 2014], computesthe filter responses and ultimately their Gram matrices: P ( 𝐿, x ) : = gram ( vgg ( resample ( crop ( 𝐿, x , 𝑠 )))) . (3)Here, 𝑠 is a crop scale parameter chosen by the user.Minimizing 𝜃 with respect to Eq. 1 for a given 𝐿 results in a ma-terial. 𝑓 𝜃 can represent different approaches. Aittala et al. [2016]directly use the pixel basis and optimize discrete material mapsfor 𝜃 using a single input flash image 𝐿 . With their approach, op-timizing for both visual similarity and stationarity is challenging.In particular, the reflectance stationarity term S , requires a “spec-tral preconditioning” step as explained in their paper. Instead, wepropose a novel approach in the form of a neural model 𝑓 that is(i) defined on the entire infinite domain and (ii) is stationary byconstruction. Thus, our loss does not include the stationarity term.Next, we describe how to generate RGB textures using deep learn-ing (Sec. 3.2), before combining the two components (flash imagesand NN texture (spaces)) into our approach (Sec. 4). • Henzler et al. L1 norm R e s N e t

32 64 128 256 128AdaINT μ,б z

64 8 RenderingFlash image Re-rendering R a n d o m i n fi n i t e fi e l d R a n d o m m a t e r i a l μ,бμ,б μ,б μ,б μ,б μ,б μ,б Random crop GramPower Spectrum GramPower SpectrumVGG VGGT T T T T T T BRDF maps

Fig. 3.

Our architecture . Starting from an exemplar (top-left) training encodes the image to a compact latent space variable 𝑧 . Additionally, a randominfinite field is cropped with the same spatial dimensions as the flash input image. The noise crop is then reshaped based on a convolutional U-Net architecture.Each convolution in the network is followed by and Adaptive Instance Normalization (AdaIN) layer [Huang and Belongie 2017] reshaping the statistics (mean 𝜇 and standard deviation 𝜎 ) of features. A learned affine transformation per layer maps 𝑧 to the desired 𝜇 ’s and 𝜎 ’s. The output of the network are the diffuse,specular, roughness, normal parameters of a svBRDF that when rendered (using a flash light) look the same as the input. Julesz [1965] define textures by their feature statistics across space.The choice of which features to use remains an important openproblem. With the advent of deep learning, Gatys et al. [2015] sug-gested to use Gram matrices of these activations of filters, learnedin deep convolutional neural networks (e.g., VGG [Simonyan andZisserman 2014]), for neural style transfer. By optimizing directlyover pixel values, their method can produce images with the desiredtexture properties. Aittala et al. [2016] rely on the same statistics torecover material parameters of stationary materials. These methodsrequire a different optimization to be ran for each different material.Another group of recent methods [Johnson et al. 2016; Ulyanovet al. 2016] introduce neural networks capable of producing RGBtextures directly, in milliseconds. While these approach use a net-work to generate the textures, they are still limited to the inputtexture exemplar, and do not show further variations in their results.Ulyanov et al. [2017] introduced an explicit diversity term enforcingresults in a batch to be different. This diversity is however limitedand restrict the results quality. Indeed, they add a diversity termto the loss, but the architecture is not modified to enable it. Alter-natively, adversarial training has been used to capture the essenceof textures [Bergmann et al. 2017; Shaham et al. 2019], includingthe non-stationary case [Zhou et al. 2018] or even within a singleimage [Shaham et al. 2019]. In particular, StyleGAN [Karras et al.2019] generates images with details by transforming noise usingadversarial training. As opposed to these approach we do not relyon challenging adversarial trainings, by directly learning a NeuralNetwork to produce VGG statistics.Instead of incentivizing stationarity in the loss, Henzler et al.[2020] suggest a learnable texture representation that is built onmapping an infinite noise field to a field that has the statistics of theexemplar texture. Their method is a point operation, implemented byan MLP that is fed exclusively with noise sampled at different scalesas done by Perlin [1985]. By explicitly preventing the network toaccess any absolute position, this approach is stationary by-design.

An overview of our approach is shown in Fig. 3. We train a neuralnetwork which acts as a decoder 𝑓 𝜃 ( x | z ) that generalizes acrossspatial positions x as well as across materials, expressed as latentmaterial codes z . The material codes z are produced by an encoder 𝑔 with z = 𝑔 ( 𝐿 ) . Both encoder and decoder are trained jointly overa set of flash images using the loss: L( 𝜃 ) : = E 𝐿 [T ( 𝐿, R 𝑓 𝜃 (·| 𝑔 𝜃 ( 𝐿 )))] . (4)This equation is an adapted version of Eq. 1 to fit our objectives.In particular we propose a neural network-based 𝑓 𝜃 , leveragingthe expectation E 𝐿 over all flash images in our training set andremoving the stationarity term as it is enforced by construction inour network architecture. We describe the flash image encoder 𝑔 (Sec. 4.1), the material texture decoder 𝑓 (Sec. 4.2) and the texturecomparison model T (Sec. 4.3), next. The encoder 𝑔 maps a flash image 𝐿 to a latent code z . It is imple-mented using ResNet-50 [He et al. 2016]. The ResNet starts at aresolution of 512 ×

384 and maps to a compact latent code. Empiri-cally, we find a 𝑛 z = The decoder 𝑓 maps location x , conditioned on a material code z to a set of material parameter maps. The key idea is to providethe architecture with access to noise, as previously done for styletransfer [Huang and Belongie 2017], generative modelling [Karraset al. 2019] or 3D texturing [Henzler et al. 2020]. In particular, wesample rectangular patches with edge length of 𝑛 × 𝑚 pixels froman infinite random field and convert them to material maps using aU-net architecture [Ronneberger et al. 2015]. The U-net starts at thedesired output resolution 𝑛 × 𝑚 and reduces resolution four timesusing max-pooling before bilinearly upsampling 𝑛 × 𝑚 again. Let 𝐹 be the array of input features. For 𝑖 =

0, the first level, in fullresolution, these features are sampled from the random field at x . enerative Modelling of BRDF Textures from Flash Images • 5 Then the output features 𝐹 ′ are 𝐹 ′ : = adaIN ( conv 𝜃 ( 𝐹 ) , T 𝜃 z ) , (5)where adaIN is Adative Instance Normalization (AdaIN) [Huangand Belongie 2017], conv a convolution (including up- or down-sampling and ReLU non-linearity) and T is an affine transformation.Components with learned parameters are denoted with subscript 𝜃 .We use AdaIN as defined by Huang and Belongie [2017] as adaIN ( 𝝃 , { 𝝁 , 𝝈 }) = 𝝈𝝈 𝐹 ( 𝝃 − 𝝁 𝐹 ) + 𝝁 (6)and remaps the input features with mean 𝝁 𝐹 and variance 𝝈 𝐹 to adistribution with mean 𝝁 and variance 𝝈 .The affine mapping T is implemented as ( 𝑛 z + ) × ( × 𝑐 𝑖 ) matricesmultiplied with the latent code – augmented with a 1 in dimension 𝑛 z +

1. Here 2 × 𝑐 𝑖 represent a different mean and variance for eachchannel dimension 𝑐 𝑖 of a layer. It provides the link between the ma-terial code and the noise statistics. Each material code z is mappedto a mean and variance to control how the statistics of features areshaped at every channel on every layer of the decoder.Our control of noise statistics from latent codes is similar to Style-GAN [Karras et al. 2019], with the key difference that we do notsample noise at different scales, but learn how to produce noise withdifferent, complex, characteristics at different scales by repeatedlyfiltering it from high resolutions. We propose to compare images based on a loss that accounts bothfor the statistics of activations [Gatys et al. 2015] and their spectrum[Liu et al. 2016] on multiple scales across the infinite spatial field,

T ( 𝐿 , 𝐿 ) : = E x ∼ R ,𝑠 ∼( 𝑠 min ,𝑠 max ) [|P ( 𝐿 , x , 𝑠 ) − P ( 𝐿 , x , 𝑠 )| ] . (7) P ( 𝐿, x , 𝑠 ) : = gram ( 𝑉 ) + 𝜆 · powerSpectrum ( 𝑉 ) (8) 𝑉 : = vgg ( resample ( crop ( 𝐿, x , 𝑠 ))) (9) Spectrum.

VGG Gram matrices capture the frequency of a featureappearance, unless it forms a regular pattern [2016]. Liu et al. [2016]proposed to include the L1 norm of the power spectra of two RGBimages into the texture metric for learning a single texture. Wetherefore combine both ideas and use VGG, but do not limit ourselvesto its statistics, and also leverage its spectrum.

Scale.

As VGG works at a specific scale of features it was trainedfor, it behaves differently at different scales. As the material shouldbe visually plausible regardless of its scale we include multiple scales 𝑠 , ranging from 𝑠 min = 𝑠 max = Infinity.

Expectation over the infinite plane is implemented bysimply training with different random seeds for the noise field. Thisresults in the generation of statistically similar, but locally differentvariations of materials. As, given a seed, every generated patchis a coherent material, combinations of multiple patches remainscoherent as well. This allows to query an endless, seamless anddiverse stream of patches without repetition. It also avoids to over-fit and is crucial to guarantee stationarity by-design.

We use the Cook-Torrance [1982] micro-facet BRDF Model, withSmith’s geometric term [Heitz 2014], Schlick’s [1994] Fresnel andGGX [Walter et al. 2007]. Hence, parameters are diffuse albedo,specular albedo, roughness and height, i.e., eight dimensions. Forpractical purposes we differentiate height into normal vectors.

Many flash images entail a slight rotation as it can be difficult totake a completely fronto-parallel image. This was handled by Aittalaet al. [2016] by locating the brightest pixel and cropping, but wefound our, more abstract, training to struggle with such a solution.Instead, we add a horizontal and a vertical rotation angle to theparameter vector generated from the latent code (not shown inFig. 3 for clarity). During training, these are used to rotate the plane,including the normals. During testing, these angles are not appliedmeaning that the output is in the local space of the exemplar.We use a branch of the encoder to perform the alignment task,allowing to jointly align all images based on their visual features.A byproduct is that the encoder returns angular distance to fronto-parallelity, which could be used to guide users during capture.

For better generalization, we train the system as a Variational Auto-encoder (VAE) [Kingma and Welling 2013], so instead of mappingto a single 64-D latent material code, the encoder 𝑔 maps to a 64-Dmean and variance vector, from which we sample in training. At testtime we use the mean for each 64-D. We have omitted the additionalVAE terms enforcing z to be normally distributed from Eq. 4 andFig. 3 for clarity. We trained our model for 4 days, with a batch sizeof 4 on a NVIDIA Tesla V100 using the ADAM optimizer with alearning rate of 1e-4 and weight decay 1e-5. Our approach allows to embed flash pictures into material codesthat can be decoded to corresponding materials. Furthermore, everylatent material code results in a plausible material. Yet this is notenough for intuitive manipulation. Given a material code z , we nowexplore how we can modify it to perform high level semantic control,e.g., making a material more leather-like.We manually label our dataset with the classes wood , leather , stone , fabric , metal , rubber , dirty , paint , and plastic . For ev-ery class 𝐶 we compute the direction n 𝐶 in latent space that isnormal to a plane best separating the in-class and out-class exem-plars. If 𝐶 is the set of all exemplar indices in the class, this is n = E 𝑖 ∈ 𝐶 [ z ] − E 𝑖 ∉ 𝐶 [ z ] (10)a vector pointing from the barycenter of class 𝐶 to the barycenterof the non-class. This allows to relate semantic parameters to aposition in the latent space. For a material with code z to become“more like 𝐶 ", it is changed to z + 𝛼 n 𝐶 , where 𝛼 is a user control. • Henzler et al. Fig. 4.

Infinite spatial extent.

Our learned BRDF space can be sampled at any query (x,y) location without producing visible repetition artifact. The networkarchitecture, by construction, does not require any special boundary alignment to avoid tiling artifacts. All results are sampled from the learned BRDF space.

We created an extended a dataset of flash images for testing andtraining of our approach. It comprises of 356 images of various typesof materials we captured using smartphones. We reserve 50 imagesfor testing, augmented by all images from Aittala et al. [2016]. Hence,no image from Aittala et al. [2016] was seen during training. Forour control experiment, we leverage the semantic labels describedin Sec. 4.7.

For quantitative analysis we compare our approach to a range ofalternative methods with respect to different metrics.

Methods.

We compare to three methods by (i) Aittala et al. [2016],(ii) Deschaintre et al. [2018], and (iii) Zhao et al. [2020]. All methodswere applied to our test data and re-rendered under the same lightingconditions with the material model described in their respectivepaper. Examples are shown in Fig. 5.

Metrics.

We quantify style , diversity , and computational speed .Style is captured by L1 difference of the VGG Gram matrices of Table 1. Comparison of different methods.

Method Style Error Diversity TimeAittala et al. [2016] 1.25 0 <1hDeschaintre et al. [2018] 1.0 0 <0.5sZhao et al. [2020] 1.01 0 <1hOurs rendered images. A good agreement in style has a low number i.e.,less is better. Diversity is captured as the mean pairwise VGG L1across all realizations. Here, more is better. Finally, we measurecompute time, where less is better.

Results.

Results are shown in Fig. 7. We see that our method showslower style error than other approaches, including Deschaintreet al. [2018] which decompose individual exemplars into shadingchannels. The approach of Zhao et al. [2020] and Aittala et al. [2016],which target a closer objective to ours, statistical reproduction of amaterial, but demonstrate higher style error.Note, that only our approach and Aittala et al. [2016] produce astationary result. Only our result is diverse, i.e., we produce infinitelymany realizations of a texture while all other approaches produce enerative Modelling of BRDF Textures from Flash Images • 7

Input Deschaintre et al. Aitalla et al.Zhao et al. OursNearest Neighbor

Fig. 5.

Comparison with competitors . Comparison of our results to methods that decompose an image into svBRDF parameters (columns) for differentflash input images (rows) . Nearest Neighbor (NN) recovers the closest material in our training set in terms of style. Note that Deschaintre et al. [2018]additionally require 4-channel BRDF measurements for training, while Zhao et al., Aitalla et al. and ours can operate directly on flash images. However, Zhaoet al. and Aitalla et al. are overfitted to single exemplars whereas our method is able to re-synthesis unseen flash images from our test data set. only one. Thus, diversity is zero for all other, while it is a constantfor ours. We summarize this in Tbl. 1.In terms of computational speed, Aittala et al. [2016] and Zhaoet al. [2020] both require around one hour of compute time foreach per-exemplar optimization to produce a stationary texture.Deschaintre et al. [2018] is as fast as our approach (less than 0.5s),executing only a decoder, but do not produce a stationary texture. Our approach benefits from the same speed and allows to gener-ate diverse, infinite materials. We measured an average speed perforward pass of 0.196s over all test steps.

Decomposition.

Qualitative examples of our decomposition aredepicted in Fig. 6. We see that our approach captures a range of dif-ferent materials, reproducing the style, yet being diverse to producean infinite field of values as shown in Fig. 4. • Henzler et al.

DiffuseRoughness NormalSpecular

Fig. 6.

Qualitative results . Result produced by our method for six materials. For each material, we show the re-rendering left, followed by insets showingcrops of the diffuse, specular, roughness, and normal channel. At inference time, our network runs at ms.

Interpolation.

Interpolation of latent BRDF texture codes is shownin Fig. 8. Please see the captions for additional discussion.

Semantic control.

Fig. 9 shows an application exploring the latentspace with semantic controls. Please see the caption for a discussion.

Texturing.

Fig. 1 shows examples of applying maps produced byour approach to a complex 3D shape. Thanks to our generative

OursDeschaintre et al. 2016

Ai�ala et al. 2016

Zhao et al. 2020

Rank S t y l e e rr o r M e d i a n Fig. 7. Ordered error (Zipf) plot comparing four methods (colors). Eachcurve is produced by sorting the errors of each method. The median can becompared in the middle. The vertical axis is logarithmic style error, less isbetter. The horizontal axis is rank (low error left, high error right). model, we can produce an endless stream of sneakers, withoutspatial or material repetition. At any point, a user can performsemantic control using our space, randomize the generated material,generate new materials from pictures or interpolate between newmaterials and old ones.

Generation.

As any point in our space is a material, we can simplysample randomly to produce a coverage of all materials available inthe space. Fig. 10 shows random samples from this space appliedto a set of 3D cubes. Note that no materials are similar and all lookplausible, with spatially varying appearance.

Interactive demo.

The visual quality is best inspected from ourinteractive WebGL demo in the supplemental. It allows exploringthe space by relighting, changing the random seed and visualizingindividual BRDF model channels and their combinations. The samepackage contains all channels of all materials as images. See theaccompanying video for a demonstration of our interactive interface.

We perform a user experiment to better understand the propertiesof our method, in particular we ask three questions: (i) Is a semantic enerative Modelling of BRDF Textures from Flash Images • 9

Exemplar A Exemplar BLatent spaceExemplar A Exemplar BImage space

Fig. 8.

Interpolation of latent BRDF texture codes.

In each row, a left and a right latent code z and z are obtained by encoding two flash images,respectively. The intermediate, continuous field of BRDF parameters is computed by interpolating, in the learned BRDF space, from z to z and conditioningthe decoder Convolutional Neural Network (CNN) with the intermediate code. The result is lit with a fronto-parallel light source to demonstrate the changesin appearance. For comparison, the last row shows image space linear interpolation – compare against the second last row showing latent space interpolation. change in our space would be recognized as such? (ii) Is our semanticlabeling is consistent with the one of users? (iii) Which methodresult (between [Aittala et al. 2016; Deschaintre et al. 2018; Zhaoet al. 2020] and ours) is most similar to the input flash picture? Methods.

Subjects (Ss) anonymously completed an online formwithout time limit. Secondary variables relating to experience andtask difficulty were recorded. The form was composed of three parts.The first part was showing 10 pairs of rendering of materials. Onerandom image in the pair was a re-synthesis from a flash imageinput, the other a re-synthesis from the same image, but changed bya unit step into the direction of a semantic class. In a Two-alternativeForced Choice (2AFC), Ss were asked to indicate which of the imagesis semantically more similar to the label.The second part presented 20 images and Ss had to choose fromthe list of our materials classes (Fabric, Leather, Metal, Paint, Plastic,Stone, Wood) which class this material belongs to. In the third part, 12 sets of five images each were shown, wherethe first one was marked the reference and the four others are re-synthesis result by four other methods (Deschaintre et al. [2018],Aittala et al. [2016] and Zhao et al. [2020]), shown in random order.Ss were asked to say which result was matching the reference best.

Analysis.

We performed a binomial and 𝑡 -tests between betweenthe hypothesis that Ss answered randomly resp. they had identicalpreferences. Effect size for correct/incorrect and preference answersis reported as percentage. A total of 𝑁 =

57 Ss completed theexperiment. They quantified their own experience in ComputerGraphics as 3.06, in general science as 3.39, and in the arts as 2.86on a five-point Likert scale.

Results.

For the first task, we can reject ( 𝑝 < . fabric at100 %, followed by leather at 95 % and wood at 96.92 %, trailed by Exemplar +Paint +StoneAttribute ControlLeather +Metal +Dirty +RubberFabric +Stone +Wood +DirtyPlastic +Fabric +Metal +PaintWood +Leather +Wood +RubberStone +FabricProjection

Fig. 9.

Semantic control . After a flash exemplar (first column) has been embedded into our learned BRDF space (second column) , its latent materialcode can be manipulated such that certain semantic attributes are enhanced or suppressed (last three columns) . Note, that the semantic change maximizesthe attribute, while retaining properties not in conflict. A move towards fabric in the first and fourth example, makes both become more like fabric, (e.g., lessshiny) but retains color. A move towards rubber imposes some structure to the normal map, but is compatible with both shiny and rough. paint at 63.43 % and stone at 60.47 % (all classes 𝑝 < . 𝑝 < . 𝑝 < . 𝑡 -test) preferred over allothers when it was chosen best in 47 .

7% of the cases, followed byDeschaintre et al. [2018] at 23 . .

3% andZhao et al. [2020] at 9 . enerative Modelling of BRDF Textures from Flash Images • 11 Fig. 10. Random samples from the space of all materials.

We have presented an approach to generate a space of BRDF texturesusing a small set of flash images in an unsupervised way. Comparingthis approach to the literature shows competitive metrics for re-renderings with the unique advantage of being able to generate aninfinite and diverse field of BRDF parameters.In future work, more refined differentiable rendering materialmodels could be used to derive stochastic textures, including shad-ows, displacement, or scattering as well as volumetric or time-varying textures. We believe that our framework will representa stepping stone for more complex infinite and diverse BRDFs ac-quisition as well as their semantic manipulation.

REFERENCES

Miika Aittala, Timo Aila, and Jaakko Lehtinen. 2016. Reflectance modeling by neuraltexture synthesis.

ACM Trans Graph (Proc. SIGGRAPH)

35, 4 (2016), 65.Brett Allen, Brian Curless, and Zoran Popović. 2003. The space of human body shapes:reconstruction and parameterization from range scans.

ACM Trans Graph (Proc.SIGGRAPH)

22, 3 (2003), 587–94.Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. 2009. Patch-Match: A randomized correspondence algorithm for structural image editing.

ACMTrans Graph (Proc. SIGGRAPH)

28, 3 (2009), 24.Urs Bergmann, Nikolay Jetchev, and Roland Vollgraf. 2017. Learning texture manifoldswith the periodic spatial GAN. In

J MLR . 469–477.Volker Blanz, Thomas Vetter, et al. 1999. A morphable model for the synthesis of 3Dfaces.. In

Siggraph , Vol. 99. 187–194.Robert L Cook and Tony DeRose. 2005. Wavelet noise.

ACM Trans Graph

24, 3 (2005),803–11.Robert L Cook and Kenneth E Torrance. 1982. A reflectance model for computergraphics.

ACM Trans Graph

1, 1 (1982), 7–24.Valentin Deschaintre, Miika Aittala, Fredo Durand, George Drettakis, and AdrienBousseau. 2018. Single-image svbrdf capture with a rendering-aware deep network.

ACM Trans Graph (Proc. SIGGRAPH)

37, 4 (2018), 128.Valentin Deschaintre, Miika Aittala, Frédo Durand, George Drettakis, and AdrienBousseau. 2019. Flexible SVBRDF Capture with a Multi-Image Deep Network.

CompGraph Forum

38, 4 (2019), 1–13.Alexei A Efros and Thomas K Leung. 1999. Texture synthesis by non-parametricsampling. In

ICCV , Vol. 2.Duan Gao, Xiao Li, Yue Dong, Pieter Peers, Kun Xu, and Xin Tong. 2019. Deep inverserendering for high-resolution SVBRDF estimation from an arbitrary number ofimages.

ACM Trans Graph (Proc. SIGGRAPH Asia)

38, 4 (2019), 134.Leon Gatys, Alexander S Ecker, and Matthias Bethge. 2015. Texture synthesis usingconvolutional neural networks. In

NIPS .Stamatios Georgoulis, Konstantinos Rematas, Tobias Ritschel, Efstratios Gavves, MarioFritz, Luc Van Gool, and Tinne Tuytelaars. 2017. Reflectance and natural illuminationfrom single-material specular objects using deep learning.

PAMI

40, 8 (2017), 1932–1947. Darya Guarnera, Giuseppe Claudio Guarnera, Abhijeet Ghosh, Cornelia Denk, andMashhuda Glencross. 2016. BRDF representation and acquisition. In

Comp GraphForum , Vol. 35. 625–650.Yu Guo, Cameron Smith, Miloš Hašan, Kalyan Sunkavalli, and Shuang Zhao. 2020.MaterialGAN: Reflectance Capture using a Generative SVBRDF Model.

ACM TransGraph

39, 6 (2020).Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learningfor image recognition. In

CVPR . 770–8.David J Heeger and James R Bergen. 1995. Pyramid-based texture analysis/synthesis.In

Proc. SIGGRAPH . 229–38.Eric Heitz. 2014. Understanding the masking-shadowing function in microfacet-basedBRDFs. (2014).Philipp Henzler, Niloy J Mitra, and Tobias Ritschel. 2020. Learning a Neural 3D TextureSpace from 2D Exemplars.

CVPR (2020).Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptiveinstance normalization. In

ICCV . 1501–10.Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2016. Image-to-ImageTranslation with Conditional Adversarial Networks. arxiv (2016).Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-timestyle transfer and super-resolution. In

ECCV .Bela Julesz. 1965. Texture and visual perception.

Scientific American

CVPR . 4401–10.Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXivpreprint arXiv:1312.6114 (2013).Alexandr Kuznetsov, Milos Hasan, Zexiang Xu, Ling-Qi Yan, Bruce Walter,Nima Khademi Kalantari, Steve Marschner, and Ravi Ramamoorthi. 2019. Learninggenerative models for rendering specular microgeometry.

ACM Trans Graph (Proc.SIGGRAPH Asia)

38, 6 (2019).Vivek Kwatra, Irfan Essa, Aaron Bobick, and Nipun Kwatra. 2005. Texture optimizationfor example-based synthesis. In

ACM Trans Graph , Vol. 24.Xiao Li, Yue Dong, Pieter Peers, and Xin Tong. 2017. Modeling surface appearance froma single photograph using self-augmented convolutional neural networks.

ACMTrans Graph

36, 4 (2017), 45.Zhengqin Li, Kalyan Sunkavalli, and Manmohan Chandraker. 2018a. Materials formasses: SVBRDF acquisition with a single mobile phone image. In

ECCV . 72–87.Zhengqin Li, Zexiang Xu, Ravi Ramamoorthi, Kalyan Sunkavalli, and ManmohanChandraker. 2018b. Learning to reconstruct shape and spatially-varying reflectancefrom a single image. In

SIGGRAPH Asia 2018 . ACM, 269.Guilin Liu, Duygu Ceylan, Ersin Yumer, Jimei Yang, and Jyh-Ming Lien. 2017. Materialediting using a physically based rendering network. In

ICCV . 2261–2269.Gang Liu, Yann Gousseau, and Gui-Song Xia. 2016. Texture synthesis through convolu-tional neural networks and spectrum constraints. In

ICPR . 3234–9.Benoit B Mandelbrot. 1983.

The fractal geometry of nature . Vol. 173. WH Freeman NewYork.W Matusik. 2003. A data-driven reflectance model.

ACM Trans Graph

22, 3 (2003),759–769.Wojciech Matusik, Matthias Zwicker, and Frédo Durand. 2005. Texture design using asimplicial complex of morphable textures.

ACM Trans Graph (Proc. SIGGRAPH)

ACM Trans Graph

34, 2 (2015), 20.Keunhong Park, Konstantinos Rematas, Ali Farhadi, and Steven M Seitz. 2019. Photo-shape: photorealistic materials for large-scale shape collections.

ACM Transactionson Graphics (TOG)

37, 6 (2019), 192.Ken Perlin. 1985. An Image Synthesizer.

SIGGRAPH Comput. Graph.

19, 3 (1985).Bui Tuong Phong. 1975. Illumination for computer generated pictures.

Commun. ACM

18, 6 (1975), 311–317.Javier Portilla and Eero P Simoncelli. 2000. A parametric texture model based on jointstatistics of complex wavelet coefficients.

Int J Comp Vis

40, 1 (2000), 49–70.K. Rematas, T. Ritschel, M. Fritz, E. Gavves, and T. Tuytelaars. 2016. Deep ReflectanceMaps. In

CVPR .Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutionalnetworks for biomedical image segmentation. In

MICCAI . 234–41.Christophe Schlick. 1994. An inexpensive BRDF model for physically-based rendering.In

Comp Graph Forum , Vol. 13. 233–246.Christopher Schwartz, Ralf Sarlette, Michael Weinmann, and Reinhard Klein. 2013.DOME II: A Parallelized BTF Acquisition System. In

MAM . 25–31.Omry Sendik and Daniel Cohen-Or. 2017. Deep correlations for texture synthesis.

ACMTrans Graph

36, 5 (2017), 161.Ana Serrano, Diego Gutierrez, Karol Myszkowski, Hans-Peter Seidel, and Belen Masia.2016. An intuitive control space for material appearance.

ACM Trans Graph (Proc.SIGGRAPH Asia)

35, 6 (2016).Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. 2019. Singan: Learning a genera-tive model from a single natural image. In

ICCV . Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor S Lempitsky. 2016. TextureNetworks: Feed-forward Synthesis of Textures and Stylized Images.. In

ICML . 4.Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2017. Improved texture net-works: Maximizing quality and diversity in feed-forward stylization and texturesynthesis. In

CVPR .Bruce Walter, Stephen R Marschner, Hongsong Li, and Kenneth E Torrance. 2007.Microfacet models for refraction through rough surfaces. In

Proc. EGSR . 195–206. Li-Yi Wei and Marc Levoy. 2000. Fast texture synthesis using tree-structured vectorquantization. In

Proc. SIGGRAPH .Yezi Zhao, Beibei Wang, Yanning Xu, Zheng Zeng, Lu Wang, and Nicolas Holzschuch.2020. Joint SVBRDF Recovery and Synthesis From a Single Image using an Unsu-pervised Generative Adversarial Network. In

EGSR .Yang Zhou, Zhen Zhu, Xiang Bai, Dani Lischinski, Daniel Cohen-Or, and Hui Huang.2018. Non-stationary texture synthesis by adversarial expansion. arXiv preprintarXiv:1805.04487arXiv preprintarXiv:1805.04487