[PDF] Flexible SVBRDF Capture with a Multi-Image Deep Network

Abstract

Empowered by deep learning, recent methods for material capture can estimate a spatially-varying reflectance from a single photograph. Such lightweight capture is in stark contrast with the tens or hundreds of pictures required by traditional optimization-based approaches. However, a single image is often simply not enough to observe the rich appearance of real-world materials. We present a deep-learning method capable of estimating material appearance from a variable number of uncalibrated and unordered pictures captured with a handheld camera and flash. Thanks to an order-independent fusing layer, this architecture extracts the most useful information from each picture, while benefiting from strong priors learned from data. The method can handle both view and light direction variation without calibration. We show how our method improves its prediction with the number of input pictures, and reaches high quality reconstructions with as little as 1 to 10 images -- a sweet spot between existing single-image and complex multi-image approaches.

Full PDF

EEurographics Symposium on Rendering 2019T. Boubekeur and P. Sen(Guest Editors)

Volume 38 ( ), Number 4

Flexible SVBRDF Capture with a Multi-Image Deep Network

Valentin Deschaintre , , Miika Aittala , Fredo Durand , George Drettakis and Adrien Bousseau Université Côté d’Azur, Inria MIT CSAIL Optis for Ansys

Figure 1:

Our deep learning method for SVBRDF capture supports a variable number of input photographs taken with uncalibrated light-view directions (a, rectiﬁed). While a single image is enough to obtain a ﬁrst plausible estimate of the SVBRDF maps, more images providenew cues to our method, improving its prediction. In this example, adding images reveals ﬁne normal variations (b), removes highlightresiduals in the diffuse albedo (c), and reveals the difference of roughness between the stone, the stripe, and the thin pattern (d). Please seesupplemental materials for animated re-renderings.

Abstract

Empowered by deep learning, recent methods for material capture can estimate a spatially-varying reﬂectance from a sin-gle photograph. Such lightweight capture is in stark contrast with the tens or hundreds of pictures required by traditionaloptimization-based approaches. However, a single image is often simply not enough to observe the rich appearance of real-world materials. We present a deep-learning method capable of estimating material appearance from a variable number ofuncalibrated and unordered pictures captured with a handheld camera and ﬂash. Thanks to an order-independent fusing layer,this architecture extracts the most useful information from each picture, while beneﬁting from strong priors learned from data.The method can handle both view and light direction variation without calibration. We show how our method improves itsprediction with the number of input pictures, and reaches high quality reconstructions with as little as to images – a sweetspot between existing single-image and complex multi-image approaches. CCS Concepts • Computing methodologies → Reﬂectance modeling;

Image processing;

Keywords:

Material capture, Appearance capture, SVBRDF, Deep learningThis paper is a low resolution version of our full paper, availablehere : .

1. Introduction

The appearance of most real-world materials depends on bothviewing and lighting directions, which makes their capture a chal-lenging task. While early methods achieved faithful capture by densely sampling the view-light conditions [Mca02, DVGNK99],this exhaustive strategy requires expensive and time-consuminghardware setups. In contrast, lightweight methods attempt to onlyperform a few measurements, but require strong prior knowledgeon the solution to ﬁll the gaps. In particular, recent methods pro-duce convincing spatially-varying material appearances from a sin-gle ﬂash photograph thanks to deep neural networks trained fromlarge quantities of synthetic material renderings [DAD ∗ a r X i v : . [ c s . G R ] J un . Deschaintre, M. Aittala, F. Durand, G. Drettakis & A. Bousseau / Flexible SVBRDF Capture with a Multi-Image Deep Network However, in many cases a single photograph simply does not con-tain enough information to make a good inference for a given ma-terial. Figure 1(b-d) illustrates typical failure cases of single-imagemethods, where the ﬂash lighting provides insufﬁcient cues of therelief of the surface, and leaves highlight residuals in the diffusealbedo and specular maps. Only additional pictures with side viewsor lights reveal ﬁne geometry and reﬂectance details.We propose a method that leverages the information providedby additional pictures, while retaining a lightweight capture pro-cedure. When few images are provided, our method harnesses thepower of learned priors to make an educated guess, while when ad-ditional images are available, our method improves its predictionto best explain all observations. We achieve this ﬂexibility thanksto a deep network architecture capable of processing an arbitrarynumber of input images with uncalibrated light-view directions.The key observation is that such image sets are fundamentally un-structured. They do not have a meaningful ordering, nor a pre-determined type of content for any given input. Following this rea-soning, we adopt a pooling-based network architecture that treatsthe inputs in a perfectly order-invariant manner, giving it powerfulmeans to extract and combine subtle joint appearance cues scat-tered across the inputs.Our ﬂexible approach allows us to capture spatially-varying ma-terials with 1 to 10 images, providing a signiﬁcant improvementover single-image methods while requiring much fewer images andless constrained capture than traditional multi-image methods.

2. Related Work

We ﬁrst review prior work on appearance capture, focusing onmethods working with few images. We then discuss deep learningmethods capable of processing multiple images.

Appearance capture.

The problem of acquiring real-world ap-pearance has been extensively studied in computer graphics andcomputer vision, as surveyed by Guarnera et al. [GGG ∗ ∗ ∗ ∗ ∗ ∗

17] and lighting [LN16,DCP ∗

14, RRFG17]. In particular, deep learning is nowadays themethod of choice to automatically build priors from data, which al-lows the most recent methods to only use one picture to recover aplausible estimate of the spatially-varying appearance of ﬂat sam-ples [LDPT17, YLD ∗

18, DAD ∗

18, LSC18], and even the geome-try of isolated objects [LXR ∗ ∗

18, DAD ∗

18, LSC18], our multi-imageapproach produces results of increasing quality as more imagesare provided. Compared to optimization-based multi-image meth-ods [RPG16, HSL ∗ Multi-image deep networks.

Many computer vision tasks be-come better posed as the number of observations increases, whichcalls for methods capable of handling a variable number of in-put images. For example, classical optimization approaches as-sign a data ﬁtting error to each observation and minimize theirsum. However, implementing an analogous strategy in a deeplearning context remains a challenge because most neural net-work architectures, such as the popular U-Net used in prior work[LDPT17,DAD ∗ ∗

16] faced this challenge in the context ofmulti-view 3D reconstruction and proposed a recurrent architec-ture that processes a sequence of images to progressively reﬁne itsprediction. However, the drawback of such an approach is that thesolution still depends on the order in which the images are providedto the method – the ﬁrst image has a great impact on the overall so-lution, while subsequent images tend to only modify details. Thisobservation motivated Wiles et al. [WZ17] to process each image ofa multi-view set through separate encoders before combining theirfeatures through max-pooling, an order-agnostic operation. Aittalaet al. [AD18] and Chen et al. [CHW18] apply a similar strategy tothe problems of burst image deblurring and photometric stereo, re-spectively. In the ﬁeld of geometry processing, Qi et al. [QSMG17]also apply a pooling scheme for deep learning on point sets, andshow that such an architecture is an universal approximator forfunctions whose inputs are set-valued. Zaheer et al. [ZKR ∗

17] fur-ther analyze the theoretical properties of pooling architectures anddemonstrate superior performance over recurrent architectures onmultiple tasks involving loosely-structured set-valued input data.We build on this family of work to offer a method that processesimages captured in an arbitrary order, and that can handle uncali-brated viewing and lighting conditions.

3. Capture Setup

We designed our method to take as input a variable number of im-ages, captured under uncalibrated light and view directions. Fig-ure 2 shows the capture setup we experimented with, where weplace the material sample within a white paper frame and capture itby holding a smartphone in one hand and a ﬂash in the other, or byusing the ﬂash of the smartphone as a co-located light source. Sim-ilarly to Paterson et al. [PCF05] and Hui et al. [HSL ∗ . Deschaintre, M. Aittala, F. Durand, G. Drettakis & A. Bousseau / Flexible SVBRDF Capture with a Multi-Image Deep Network the four corners of the frame to compute an homography that rec-tiﬁes the images, and crop the paper pixels away before processingthe images with our method. We capture pictures of 3456 × ×

256 pixels after cropping.

4. Multi-Image Material Inference

Our goal is to estimate the spatially-varying bi-directional re-ﬂectance distribution function (SVBRDF) of a ﬂat material samplegiven a few aligned pictures of that sample. We adopt a parametricrepresentation of the SVBRDF in the form of four maps represent-ing the per-pixel surface normal and diffuse albedo, specular albedoand specular roughness of a Cook-Torrance [CT82] BRDF model.The core of our method is a multi-image network composed ofseveral copies of a single-image network , as illustrated in Figure 3.The number of copies is dynamically chosen to match the numberof inputs provided by the user (or the training sample). All copiesare identical in their architecture and weights, meaning that eachinput receives an identical treatment by its respective network copy.The ﬁndings from each single-image network are then fused by acommon order-agnostic pooling layer before being subsequentlyprocessed into a joint estimate of the SVBRDF.We now detail the single-image network and the fusion mech-anism, before describing the loss we use to compare the networkprediction against a ground-truth SVBRDF. We detail our genera-tion of synthetic training data in Section 5.The source code of our network architecture along with pre-trained weights is available at https://team.inria.fr/graphdeco/projects/multi-materials/

We base our architecture on the single-image network of Deschain-tre et al. [DAD ∗ Figure 2:

We use a simple paper frame to help register picturestaken from different viewpoints. We use either a single smartphoneand its ﬂash, or two smartphones to cover a larger set of view/lightconﬁgurations. ∗

18, LSC18].Since we are targeting a lightweight capture scenario, we do notprovide the network with any explicit knowledge of the light andview position. We rather count on the network to deduce relatedinformation from visual cues.

The second part of our architecture fuses the multiple feature mapsproduced by the single-image networks to form a single featuremap of ﬁxed size.Speciﬁcally, the encoder-decoder track of each single-image net-work produces a 256 × ×

64 intermediate feature map corre-sponding to the input image it processed. These maps are fused intoa single joint feature map of the same size by picking the maximumvalue reported by any single-image network at each pixel and fea-ture channel. This max-pooling procedure gives every single-imagenetwork equal means to contribute to the content of the joint featuremap in a perfectly order-independent manner [AD18, CHW18].The pooled intermediate feature map is ﬁnally decoded by 3 lay-ers of convolutions and non-linearities, which provide the networksufﬁcient expressivity to transform the extracted information intofour SVBRDF maps. The global features in the fully-connectedtracks are max-pooled and decoded in a similar manner. Throughend-to-end training, the single-image networks learn to producefeatures which are meaningful with respect to the pooling opera-tion and useful for reconstructing the ﬁnal estimate.While we vary the number of copies of the single-view networkbetween 1 and 5 during training, an important property of this ar-chitecture is that it can process an arbitrarily large number of im-ages during testing because all copies share the same weights, andare ultimately fused by the pooling layer to form a ﬁxed-size fea-ture map. In our experiments, we vary the number of input imagesfrom 1 to 10 at testing time.

We evaluate the quality of the network prediction with a differen-tiable rendering loss [LSC18, LXR ∗

18, DAD ∗ ∗ l norm on the logarithmic values of the renderings to compress thehigh dynamic range of specular peaks.Following Li et al. [LSC18], we complement this renderingloss with four l losses, each measuring the difference betweenone of the predicted maps and its ground-truth counterpart. Wefound this direct supervision to stabilize training. Our ﬁnal lossis a weighted mixture of all losses, L = L Render + . (cid:0) L Normal + L Diffuse + L Specular + L Roughness (cid:1) . . Deschaintre, M. Aittala, F. Durand, G. Drettakis & A. Bousseau / Flexible SVBRDF Capture with a Multi-Image Deep Network x x ...... I npu t x x Globalfeatures

U-Net … ...... I npu t x x Globalfeatures

U-Net x x m a x p oo l Output maps

Normal DiffuseRoughness Specular ...... I npu t x x Globalfeatures

U-Net x x … m a x p oo l x x Figure 3:

Overview of our deep network architecture. Each input image is processed by its copy of the encoder-decoder to produce a featuremap. While the number of images and network copies can vary, a pooling layer fuses the output maps to obtain a ﬁxed-size representation ofthe material, which is then processed by a few convolutional layers to produce the SVBRDF maps.

We train our network for 7 days on a Nvidia GTX 1080 TI. We letthe training run for 1 million iterations with a batch size of 2 andinput sizes of 256 ×

256 pixels. We use the Adam optimizer [KB15]with a learning rate set to 0 . β = .

5. Online Generation of Training Data

Following prior work on deep-learning for inverse rendering[RGR ∗

17, LDPT17, DAD ∗

18, LSC18, LXR ∗

18, LCY ∗ ∗

18] report trainingdatasets of 150 ,

000 and 200 ,

000 images respectively. This prac-tical challenge motivated us to implement an online renderer thatgenerates a new SVBRDF and its multiple renderings at each it-eration of the training, yielding up to 2 million training images inpractice.We ﬁrst explain how we generate numerous ground-truthSVBRDFs, before describing the main features of our SVBRDFrenderer.

We rely on procedural, artist-designed SVBRDFs to obtain ourtraining data. Starting from a small set of such SVBRDF maps,Deschaintre et al. [DAD ∗

18] perform data augmentation by com-puting 20 ,

000 convex combinations of random pairs of SVBRDFs.We follow the same strategy, although we implemented this mate-rial mixing within TensorFlow [AAB ∗ ,

850 SVBRDFs covering common material classes suchas plastic, metal, wood, leather, etc , all obtained from Allegorith-mic Substance Share [All18].

We implemented our SVBRDF renderer in TensorFlow, so that itcan be called at each iteration of the training process. Since ournetwork takes rectiﬁed images as input, we do not need to simu-late perspective projection of the material sample. Instead, our ren-derer simply takes as input four SVBRDF maps along with a lightand view position, and evaluates the resulting rendering equationat each pixel. We augment this basic renderer with several featuresthat simulate common effects encountered in real-world captures:

Viewing conditions.

We distribute the camera positions over anhemisphere centered on the material sample, and vary its distanceby a random amount to allow a casual capture scenario where usersmay not be able to maintain an exact distance from the target. Wealso perform random perturbations of the ﬁeld-of-view (set to 40 ◦ by default) to simulate different types of cameras. Finally, we ap-ply a random rotation and scaling to the SVBRDF maps beforecropping them to 256 ×

256 pixels, which simulates materials ofdifferent orientations and scales.

Lighting conditions.

We simulate a ﬂash light as a point lightwith angular fall-off. We again distribute the light positions overan hemisphere at a random distance to simulate a handheld ﬂash.Other random perturbations include the angular fall-off to simulatedifferent types of ﬂash, the light intensity to simulate varying expo-sure, and the light color to simulate varying white-balance. Finally, . Deschaintre, M. Aittala, F. Durand, G. Drettakis & A. Bousseau / Flexible SVBRDF Capture with a Multi-Image Deep Network we also include the simulation of a surrounding lighting environ-ment in the form of a second light with random position, intensityand color, which is kept ﬁxed for a given input SVBRDF.

Image post-processing.

We have implemented several commonimage degradations – additive Gaussian noise, clipping of radiancevalues to 1 to simulate low-dynamic range images, gamma correc-tion and quantization over 8 bits per channel.While rendering our training data on the ﬂy incurs additionalcomputation, we found that this overhead is compensated by thetime gained in data loading. In our experiments, training our systemwith online data generation takes approximately as much time astraining it with pre-computed data stored on disk, making the actualrendering virtually free.

6. Results and Evaluation

We evaluate our method using a test dataset of 32 ground truthSVBRDFs not present in the set used for training data genera-tion. We also use measured Bidirectional Texture Functions (BTFs)[WGK14] to compare the re-renderings of our predictions to real-world appearances. Finally, we used our method to acquire a set ofaround 80 real-world materials. Since our method does not assumea controlled lighting, we used either the camera ﬂash or a separatesmartphone as the light source for those acquisitions. All results inthe ﬁgures of the main paper were taken with two phones; pleasesee supplemental for all results and examples acquired with a singlephone. Resulting quality is similar in both cases.

A strength of our method is its ability to cope with a variable num-ber of photographs. We ﬁrst evaluate whether additional imagesimprove the result using synthetic SVBRDFs, for which we haveground truth maps. We measure the error of our prediction by re-rendering our predicted maps under many views and lights, as doneby the rendering loss used for training. Figure 4 plots the SSIMsimilarity metric of these re-renderings averaged over the test setfor an increasing number of images, along with the SSIM of theindividual SVBRDF maps. While most improvements happen withthe ﬁrst ﬁve images, the similarity continues to increase with sub-sequent inputs, stabilizing at around 10 images. The diffuse albedois the fastest to stabilize, consistent with the intuition that few mea-surements sufﬁce to recover low-frequency signals. Surprisingly,the quality of the roughness prediction seems on average indepen-dent of the number of images, suggesting that the method strugglesto exploit additional information for this quantity. In contrast, thenormal prediction improves with each additional input, as also ob-served in our experiments with real-world data detailed next. Weprovide RMSE plots of the same experiment as supplemental ma-terials.Using the same procedure, in Figure 5 we perform an ablationstudy to evaluate the impact of including random perturbations ofthe viewing and lighting conditions in the training data. As ex-pected, the network trained without perturbation does not performas well as our complete method on our test dataset that includes

Figure 4:

SSIM of our predictions with respect to the number ofinput images, averaged over our synthetic test dataset. The SSIMof re-renderings increases quickly for the ﬁrst images, before stabi-lizing at around 10 images. The normal maps strongly beneﬁt fromnew images. Diffuse and specular albedos also improve with addi-tional inputs, which is not the case of the roughness that remainsstable overall. We provide similar RMSE plots as supplemental ma-terials. view and light variations similar to those in casual real world cap-ture. We trained both networks for 750 ,

000 iterations for this ex-periment.Figure 6 shows our predictions on a measured BTF material fromthe Bonn database [WGK14], using 1, 2, 3 and 10 inputs. For thismaterial, normals, diffuse albedo and roughness estimations im-prove with more inputs. In particular, the normal map progressivelycaptures more relief, the diffuse albedo map becomes almost uni-form, and the embossed part on the upper right is quickly recog-nized as shinier than the remaining of the sample.For a real material capture we performed (Figure 7), we see sim-ilar effects: normals are improved with more inputs, and the dif-ference of roughness between different parts is progressively re-covered. However, we do not have access to ground truth maps forthese real-world captures.Overall, our results in Fig. 4-10 and in supplemental material il-lustrate that our method achieves our goals: adding more picturesgreatly improves the results, notably removing artifacts in the dif-fuse albedo while improving normal estimation. Our method en-hances the quality of recovered materials while maintaining a ca-sual capture.

We compare our data-driven approach to a traditional optimizationthat takes as input multiple images captured under the assumptionof known and precisely calibrated light and viewing conditions.Given these conditions we solve for the SVBRDF maps that min-imize re-rendering error of the input images, as measured by our . Deschaintre, M. Aittala, F. Durand, G. Drettakis & A. Bousseau / Flexible SVBRDF Capture with a Multi-Image Deep Network

Figure 5:

Ablation study. Comparison of SSIM between our method(green) and a restricted version (black) where the network istrained with lighting and viewing directions chosen on a perfecthemisphere, and with all lighting parameters constant (falloff ex-ponent, power, etc.). Our complete method achieves higher SSIMwhen tested on a dataset with small variations of these parameters,showing that it is robust to such perturbations that are frequent incasual real world capture. rendering loss. We further regularize this optimization by augment-ing the loss with a total-variation term that favors piecewise-smoothmaps. We solve the optimization with the Adam algorithm [KB15].While the optimization stabilizes after 900K iterations, we let it runfor a total of 2M iterations to ensure full convergence, which takesapproximately 3 . We ﬁrst compare our architecture to a simple baseline composedof the network by Deschaintre et al. [DAD ∗

18] augmented to take5 images instead of one. This baseline achieves an average SSIMof 0 . .

847 produced by our methodfor the same number of inputs. This evaluation demonstrates thatour multi-image network performs as well as a ﬁxed network whileproviding the freedom to vary the number of input images.We next compare to the recent single-image methods of De-schaintre et al. [DAD ∗

18] and Li et al. [LSC18], which both take asinput a fronto-parallel ﬂash photo. Figure 9 provides a visual com-parison on synthetic SVBRDFs with ground truth maps, Figure 12provides a similar comparison on BTFs measured from 81x81 pic-tures, which allow ground-truth re-renderings, and Figure 10 and11 provide a comparison on real pictures. While developed concur-rently, both single-image approaches suffer from the same limita-tions. The co-located lighting tends to produce low-contrast shad-ing, reducing the cues available for the network to fully retrievenormals. Adding side-lit pictures of the material helps our approachretrieve these missing details. The fronto-parallel ﬂash also oftenproduces a saturated highlight in the middle of the image, whichboth single-image methods struggle to in-paint convincingly in thedifferent maps. While the strength of the highlight could be reducedby careful tuning of exposure, saturated pixels are difﬁcult to avoidin real-world capture. In contrast, our method beneﬁts from addi-tional pictures to recover information about those pixels.Another limitation of these two single-image methods is that theﬂash highlight cannot cover all parts of the material sample. Thislack of information can cause erroneous estimations, especiallywhen the sample is composed of multiple materials with differentshininess. Providing more pictures gives a chance to our methodto observe highlights over all parts of the sample, as is the case inFigure 7, where the difference in roughness in the upper right onlybecomes apparent with the 4th input.

Since our method builds on the single-image network of Deschain-tre et al. [DAD ∗ . Deschaintre, M. Aittala, F. Durand, G. Drettakis & A. Bousseau / Flexible SVBRDF Capture with a Multi-Image Deep Network Inputs Renderings Normal Diffuse albedo Roughness Specular albedo i npu t i npu t s i npu t s i npu t s Figure 6:

Evaluation on a measured BTF. Three images are enough to capture most of normal and roughness maps. Adding images furtherimproves the result by removing lighting residual from the diffuse albedo, and adding subtle details to the normal and specular maps.

7. Conclusion

With the advance of deep learning, the holy grail of single-imageSVBRDF capture recently became a reality. Yet, despite impressiveresults, single-image methods offer little margin to users to correctfor erroneous predictions. We address this fundamental limitationwith a deep network architecture that accepts a variable number ofinput images, allowing users to capture as many images as neededto exhibit all the visual effects they want to capture of a material.Our method bridges the gap between single-image and many-imagemethods, allowing faithful material capture with a handful of im-ages captured from uncalibrated light-view directions.

Acknowledgments

References [AAB ∗

15] A

BADI

M., A

GARWAL

A., B

ARHAM

P., B

REVDO

E., C

HEN

Z., C

ITRO

C., C

ORRADO

G. S., D

AVIS

A., D

EAN

J., D

EVIN

M., G HE - MAWAT

S., G

OODFELLOW

I., H

ARP

A., I

RVING

G., I

SARD

M., J IA Y.,J

OZEFOWICZ

R., K

AISER

L., K

UDLUR

M., L

EVENBERG

J., M

ANÉ

D., M

ONGA

R., M

OORE

S., M

URRAY

D., O

LAH

C., S

CHUSTER

M.,S

HLENS

J., S

TEINER

B., S

UTSKEVER

I., T

ALWAR

K., T

UCKER

P.,V

ANHOUCKE

V., V

ASUDEVAN

V., V

IÉGAS

F., V

INYALS

O., W

ARDEN

P., W

ATTENBERG

M., W

ICKE

M., Y U Y., Z

HENG

X.: TensorFlow:Large-scale machine learning on heterogeneous systems, 2015. Softwareavailable from tensorﬂow.org. URL: . 4[AAL16] A

ITTALA

M., A

ILA

T., L

EHTINEN

J.: Reﬂectance modelingby neural texture synthesis.

ACM Transactions on Graphics (Proc. SIG-GRAPH) 35 , 4 (2016). 2[AD18] A

ITTALA

M., D

URAND

F.: Burst image deblurring using permu-tation invariant convolutional neural networks. In

The European Confer-ence on Computer Vision (ECCV) (2018). 2, 3[All18] A

LLEGORITHMIC : Substance share, 2018. URL: https://share.allegorithmic.com/ . 4[AWL13] A

ITTALA

M., W

EYRICH

T., L

EHTINEN

J.: PracticalSVBRDF capture in the frequency domain. 2[AWL15] A

ITTALA

M., W

EYRICH

T., L

EHTINEN

J.: Two-shotSVBRDF capture for stationary materials.

ACM Trans. Graph. (Proc.SIGGRAPH) 34 , 4 (July 2015), 110:1–110:13. URL: http://doi.acm.org/10.1145/2766967 , doi:10.1145/2766967 . 2 . Deschaintre, M. Aittala, F. Durand, G. Drettakis & A. Bousseau / Flexible SVBRDF Capture with a Multi-Image Deep Network Inputs Renderings Normal Diffuse albedo Roughness Specular albedo i npu t i npu t s i npu t s i npu t s Figure 7:

A single ﬂash picture hardly provides enough information for surfaces composed of several materials. In this example, addingimages allows the recovery of normal details, and the capture of different roughness values in different parts of the image. Note in particularhow the 4th image helps capturing a discontinuity of the roughness on the right part. [CHW18] C

HEN

G., H AN K., W

ONG

K.-Y. K.: Ps-fcn: A ﬂexible learn-ing framework for photometric stereo. In

The European Conference onComputer Vision (ECCV) (2018). 2, 3[CT82] C

OOK

R. L., T

ORRANCE

K. E.: A reﬂectance model for com-puter graphics.

ACM Transactions on Graphics 1 , 1 (1982), 7–24. 3[CXG ∗

16] C

HOY

C. B., X U D., G

WAK

J., C

HEN

K., S

AVARESE

S.:3d-r2n2: A uniﬁed approach for single and multi-view 3d object recon-struction. In

IEEE European Conference on Computer Vision (ECCV) (2016), pp. 628–644. 2[DAD ∗

18] D

ESCHAINTRE

V., A

ITTALA

M., D

URAND

F., D

RETTAKIS

G., B

OUSSEAU

A.: Single-image svbrdf capture with a rendering-awaredeep network.

ACM Transactions on Graphics (SIGGRAPH Confer-ence Proceedings) 37 , 128 (aug 2018), 15. URL: . 1, 2, 3, 4, 6, 14[DCP ∗

14] D

ONG

Y., C

HEN

G., P

EERS

P., Z

HANG

J., T

ONG

X.:Appearance-from-motion: Recovering spatially varying surface re-ﬂectance under unknown lighting.

ACM Transactions on Graphics (Proc.SIGGRAPH Asia) 33 , 6 (2014). 2[DVGNK99] D

ANA

K. J., V AN G INNEKEN

B., N

AYAR

S. K., K

OEN - DERINK

J. J.: Reﬂectance and texture of real-world surfaces.

ACMTransactions On Graphics (TOG) 18 , 1 (1999), 1–34. 1, 2[DWT ∗

10] D

ONG

Y., W

ANG

J., T

ONG

X., S

NYDER

J., B EN -E ZRA

M.,L AN Y., G UO B.: Manifold bootstrapping for svbrdf capture.

ACMTransactions on Graphics (Proc. SIGGRAPH) 29 , 4 (2010). 2[GCP ∗

09] G

HOSH

A., C

HEN

T., P

EERS

P., W

ILSON

C. A., D

EBEVEC

P.: Estimating specular roughness and anisotropy from second orderspherical gradient illumination. In

Computer Graphics Forum (June2009), vol. 28, p. 4. 2[GGG ∗

16] G

UARNERA

D., G

UARNERA

G. C., G

HOSH

A., D

ENK

C.,G

LENCROSS

M.: BRDF Representation and Acquisition.

ComputerGraphics Forum (2016). 2[GTHD03] G

ARDNER

A., T

CHOU

C., H

AWKINS

T., D

EBEVEC

P.: Lin-ear light source reﬂectometry.

ACM Trans. Graph. 22 , 3 (July 2003),749–758. URL: http://doi.acm.org/10.1145/882262.882342 , doi:10.1145/882262.882342 . 2[HSL ∗

17] H UI Z., S

UNKAVALLI

K., L EE J. Y., H

ADAP

S., W

ANG

J.,S

ANKARANARAYANAN

A. C.: Reﬂectance capture using univariatesampling of brdfs. In

IEEE International Conference on Computer Vi-sion (ICCV) (2017). 2[KB15] K

INGMA

D. P., B A J.: Adam: A method for stochastic optimiza-tion. In

International Conference on Learning Representations (ICLR) (2015). 4, 6[KCW ∗

18] K

ANG

K., C

HEN

Z., W

ANG

J., Z

HOU

K., W U H.: Efﬁcientreﬂectance capture using an autoencoder.

ACM Transactions on Graph-ics (Proc. SIGGRAPH) 37 , 4 (July 2018). 2[LCY ∗

17] L IU G., C

EYLAN

D., Y

UMER

E., Y

ANG

J., L

IEN

J.-M.: Ma-terial editing using a physically based rendering network. In

IEEE Inter-national Conference on Computer Vision (ICCV) (2017), pp. 2261–2269.4[LDPT17] L I X., D

ONG

Y., P

EERS

P., T

ONG

X.: Modeling surface ap- . Deschaintre, M. Aittala, F. Durand, G. Drettakis & A. Bousseau / Flexible SVBRDF Capture with a Multi-Image Deep Network

Figure 8:

SSIM on re-renderings for the maps obtained by ourmethod with 5 images (dotted blue) and by a classical optimiza-tion method with an increasing number of input images (black).The classical optimization requires several dozens of calibratedpictures to outperform our method on rather diffuse or uniform ma-terials (stones, tiles, scales), while requiring many more for a morecomplex material (wood). pearance from a single photograph using self-augmented convolutionalneural networks.

ACM Transactions on Graphics (Proc. SIGGRAPH)36 , 4 (2017). 2, 4[LLM ∗

18] L IU R., L

EHMAN

J., M

OLINO

P., S

UCH

F. P., F

RANK

E.,S

ERGEEV

A., Y

OSINSKI

J.: An intriguing failing of convolutional neu-ral networks and the coordconv solution.

CoRR abs/1807.03247 (2018).3[LN16] L

OMBARDI

S., N

ISHINO

K.: Reﬂectance and illumination re-covery in the wild.

IEEE Transactions on Pattern Analysis and MachineIntelligence (PAMI) 38 (2016), 129–141. 2[LSC18] L I Z., S

UNKAVALLI

K., C

HANDRAKER

M.: Materials formasses: SVBRDF acquisition with a single mobile phone image.

Pro-ceedings of ECCV (2018). 1, 2, 3, 4, 6[LXR ∗

18] L I Z., X U Z., R

AMAMOORTHI

R., S

UNKAVALLI

K., C

HAN - DRAKER

M.: Learning to reconstruct shape and spatially-varying re-ﬂectance from a single image.

ACM Transactions on Graphics (Proc.SIGGRAPH Asia) (2018). 2, 3, 4[Mca02] M

CALLISTER

D. K.:

A Generalized Surface Appearance Rep-resentation for Computer Graphics . PhD thesis, 2002. 1, 2[PCF05] P

ATERSON

J. A., C

LAUS

D., F

ITZGIBBON

A. W.: Brdf and ge-ometry capture from extended inhomogeneous samples using ﬂash pho-tography.

Computer Graphics Forum (Proc. Eurographics) 24 , 3 (Sept.2005), 383–391. 2[QSMG17] Q I C. R., S U H., M O K., G

UIBAS

L. J.: Pointnet: Deeplearning on point sets for 3d classiﬁcation and segmentation. In

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017). 2[RGR ∗

17] R

EMATAS

K., G

EORGOULIS

S., R

ITSCHEL

T., G

AVVES

E.,F

RITZ

M., G

OOL

L. V., T

UYTELAARS

T.: Reﬂectance and naturalillumination from single-material specular objects using deep learning.

IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) (2017). 4[RPB15] R

ONNEBERGER

O., P.F

ISCHER , B

ROX

T.: U-net: Convolu- tional networks for biomedical image segmentation. In

Medical Im-age Computing and Computer-Assisted Intervention (MICCAI) (2015),vol. 9351 of

LNCS , pp. 234–241. 3[RPG16] R

IVIERE

J., P

EERS

P., G

HOSH

A.: Mobile surface reﬂectome-try.

Computer Graphics Forum 35 , 1 (2016). 2[RRFG17] R

IVIERE

J., R

ESHETOUSKI

I., F

ILIPI

L., G

HOSH

A.: Polar-ization imaging reﬂectometry in the wild.

ACM Transactions on Graph-ics (Proc. SIGGRAPH) (2017). 2[RWS ∗

11] R EN P., W

ANG

J., S

NYDER

J., T

ONG

X., G UO B.: Pocketreﬂectometry.

ACM Transactions on Graphics (Proc. SIGGRAPH) 30 , 4(2011). 2[WGK14] W

EINMANN

M., G

ALL

J., K

LEIN

R.: Material classiﬁcationbased on training data synthesized using a btf database. In

EuropeanConference on Computer Vision (ECCV) (2014), pp. 156–171. 5[WSM11] W

ANG

C.-P., S

NAVELY

N., M

ARSCHNER

S.: Estimatingdual-scale properties of glossy surfaces from step-edge lighting.

ACMTransactions on Graphics (Proc. SIGGRAPH Asia) 30 , 6 (2011). 2[WZ17] W

ILES

O., Z

ISSERMAN

A.: Silnet : Single- and multi-view re-construction by learning from silhouettes.

British Machine Vision Con-ference (BMVC) (2017). 2[YLD ∗

18] Y E W., L I X., D

ONG

Y., P

EERS

P., T

ONG

X.: Single im-age surface appearance modeling with self-augmented cnns and inexactsupervision.

Computer Graphics Forum 37 , 7 (2018), 201–211. 2[ZKR ∗

17] Z

AHEER

M., K

OTTUR

S., R

AVANBAKHSH

S., P

OCZOS

B.,S

ALAKHUTDINOV

R. R., S

MOLA

A. J.: Deep sets. In

Advances inNeural Information Processing Systems (NIPS) . 2017. 2 . Deschaintre, M. Aittala, F. Durand, G. Drettakis & A. Bousseau / Flexible SVBRDF Capture with a Multi-Image Deep Network

Inputs Renderings Normal Diffuse Roughness Specular D e s c h a i n t r ee t a l . L i e t a l . G r ound T r u t h O u r s ( i npu t s ) D e s c h a i n t r ee t a l . L i e t a l . G r ound T r u t h O u r s ( i npu t s ) Figure 9:

Comparison against single-image methods on synthetic SVBRDFs. Our method leverages additional input images to obtainSVBRDF maps closer to ground truth. In particular, single-image methods under-estimate normal variations and fail to remove the sat-urated highlight on shiny materials. See supplemental materials for more comparisons and results. . Deschaintre, M. Aittala, F. Durand, G. Drettakis & A. Bousseau / Flexible SVBRDF Capture with a Multi-Image Deep Network

Inputs Renderings Normal Diffuse Roughness Specular D e s c h a i n t r ee t a l . L i e t a l . O u r s ( i npu t s ) D e s c h a i n t r ee t a l . L i e t a l . O u r s ( i npu t s ) Figure 10:

Comparison against single-image methods on real-world pictures. Our method recovers more normal details, and better removeshighlight and shading residuals from the diffuse albedo. See supplemental materials for more comparisons and results. . Deschaintre, M. Aittala, F. Durand, G. Drettakis & A. Bousseau / Flexible SVBRDF Capture with a Multi-Image Deep Network

Deschaintre et al. 18 Li et al. 18 Ours (5 inputs) Ground truth

Figure 11:

Comparison to real-world relighting. Each columnshows re-renderings of a captured material, except the last columnwhich shows a picture of that material under a similar lighting con-dition (not used as input). We manually adjusted the position of thevirtual light to best match the ground truth. Similarly, we adjustedthe light power for each method separately since each has its ownarbitrary scale factor. Overall, our method better reproduces thenormal and gloss variations of the materials. In particular, single-image methods tend to ﬂatten the bumps of the leather and orientthem towards the center of the picture, where the ﬂash highlightappeared in the input. For individual result maps, see supplemen-tal materials. . Deschaintre, M. Aittala, F. Durand, G. Drettakis & A. Bousseau / Flexible SVBRDF Capture with a Multi-Image Deep Network L i e t a l . D e s c h a i n t r ee t a l . B T F O u r s ( i npu t s ) Figure 12:

Comparison against single-image methods on a measured BTF with ground truth re-renderings. Our method globally capturesthe material features better. . Deschaintre, M. Aittala, F. Durand, G. Drettakis & A. Bousseau / Flexible SVBRDF Capture with a Multi-Image Deep Network

Inputs Renderings Normal Diffuse Roughness Specular i npu t i npu t s i npu t s i npu t i npu t s i npu t s Figure 13:

Limitations. We inherits some of the limitations of the method by Deschaintre et al. [DAD ∗18], such as the tendency to producecorrelated maps and to interpret dark pixels as shiny (top). Our SVBRDF representation, training data and loss do not model cast shadows.As a result, shadows in the input pollute some of the maps (bottom).