[PDF] CGIntrinsics: Better Intrinsic Image Decomposition through Physically-Based Rendering

Abstract

Intrinsic image decomposition is a challenging, long-standing computer vision problem for which ground truth data is very difficult to acquire. We explore the use of synthetic data for training CNN-based intrinsic image decomposition models, then applying these learned models to real-world images. To that end, we present \ICG, a new, large-scale dataset of physically-based rendered images of scenes with full ground truth decompositions. The rendering process we use is carefully designed to yield high-quality, realistic images, which we find to be crucial for this problem domain. We also propose a new end-to-end training method that learns better decompositions by leveraging \ICG, and optionally IIW and SAW, two recent datasets of sparse annotations on real-world images. Surprisingly, we find that a decomposition network trained solely on our synthetic data outperforms the state-of-the-art on both IIW and SAW, and performance improves even further when IIW and SAW data is added during training. Our work demonstrates the suprising effectiveness of carefully-rendered synthetic data for the intrinsic images task.

Full PDF

CCGIntrinsics: Better Intrinsic Image Decompositionthrough Physically-Based Rendering

Zhengqi Li and Noah Snavely

Department of Computer Science & Cornell Tech, Cornell University

Abstract.

Intrinsic image decomposition is a challenging, long-standing com-puter vision problem for which ground truth data is very difﬁcult to acquire. Weexplore the use of synthetic data for training CNN-based intrinsic image decompo-sition models, then applying these learned models to real-world images. To thatend, we present CGI

NTRINSICS , a new, large-scale dataset of physically-basedrendered images of scenes with full ground truth decompositions. The renderingprocess we use is carefully designed to yield high-quality, realistic images, whichwe ﬁnd to be crucial for this problem domain. We also propose a new end-to-endtraining method that learns better decompositions by leveraging CGI

NTRINSICS ,and optionally IIW and SAW, two recent datasets of sparse annotations on real-world images. Surprisingly, we ﬁnd that a decomposition network trained solelyon our synthetic data outperforms the state-of-the-art on both IIW and SAW, andperformance improves even further when IIW and SAW data is added duringtraining. Our work demonstrates the suprising effectiveness of carefully-renderedsynthetic data for the intrinsic images task.

Intrinsic images is a classic vision problem involving decomposing an input image I into a product of reﬂectance (albedo) and shading images R · S . Recent years have seenremarkable progress on this problem, but it remains challenging due to its ill-posedness.An attractive proposition has been to replace traditional hand-crafted priors with learned,CNN-based models. For such learning methods data is key, but collecting ground truthdata for intrinsic images is extremely difﬁcult, especially for images of real-world scenes.One way to generate large amounts of training data for intrinsic images is to rendersynthetic scenes. However, existing synthetic datasets are limited to images of singleobjects [1,2] (e.g., via ShapeNet [3]) or images of CG animation that utilize simpliﬁed,unrealistic illumination (e.g., via Sintel [4]). An alternative is to collect ground truthfor real images using crowdsourcing, as in the Intrinsic Images in the Wild (IIW) andShading Annotations in the Wild (SAW) datasets [5,6]. However, the annotations in suchdatasets are sparse and difﬁcult to collect accurately at scale.Inspired by recent efforts to use synthetic images of scenes as training data for indoorand outdoor scene understanding [7,8,9,10], we present the ﬁrst large-scale scene-levelintrinsic images dataset based on high-quality physically-based rendering, which we callCGI NTRINSICS (CGI). CGI consists of over 20,000 images of indoor scenes, based onthe SUNCG dataset [11]. Our aim with CGI is to help drive signiﬁcant progress towardssolving the intrinsic images problem for Internet photos of real-world scenes. We ﬁnd a r X i v : . [ c s . C V ] D ec Zhengqi Li and Noah Snavely

Synthetic Images IIW Annotations SAW Annotations Input Image R S Train Decomposition Network

Fig. 1. Overview and network architecture.

Our work integrates physically-based renderedimages from our CGI

NTRINSICS dataset and reﬂectance/shading annotations from IIW and SAWin order to train a better intrinsic decomposition network. that high-quality physically-based rendering is essential for our task. While SUNCGprovides physically-based scene renderings [12], our experiments show that the detailsof how images are rendered are of critical importance, and certain choices can lead tomassive improvements in how well CNNs trained for intrinsic images on synthetic datageneralize to real data.We also propose a new partially supervised learning method for training a CNNto directly predict reﬂectance and shading, by combining ground truth from CGI andsparse annotations from IIW/SAW. Through evaluations on IIW and SAW, we ﬁnd that,surprisingly, decomposition networks trained solely on CGI can achieve state-of-the-artperformance on both datasets. Combined training using both CGI and IIW/SAW leadsto even better performance. Finally, we ﬁnd that CGI generalizes better than existingdatasets by evaluating on MIT Intrinsic Images, a very different, object-centric, dataset.

Optimization-based methods.

The classical approach to intrinsic images is to integratevarious priors (smoothness, reﬂectance sparseness, etc.) into an optimization frame-work [13,14,15,16,17,5]. However, for images of real-world scenes, such hand-craftedprior assumptions are difﬁcult to craft and are often violated. Several recent methods seekto improve decomposition quality by integrating surface normals or depths from RGB-Dcameras [18,19,20] into the optimization process. However, these methods assume depthmaps are available during optimization, preventing them from being used for a widerange of consumer photos.

Learning-based methods.

Learning methods for intrinsic images have recently beenexplored as an alternative to models with hand-crafted priors, or a way to set theparameters of such models automatically. Barron and Malik [21] learn parameters of

GIntrinsics 3 a model that utilizes sophisticated priors on reﬂectance, shape and illumination. Thisapproach works on images of objects (such as in the MIT dataset), but does not generalizeto real world scenes. More recently, CNN-based methods have been deployed, includingwork that regresses directly to the output decomposition based on various trainingdatasets, such as Sintel [22,23], MIT intrinsics and ShapeNet [2,1]. Shu et al . [24] alsopropose a CNN-based method speciﬁcally for the domain of facial images, where groundtruth geometry can be obtained through model ﬁtting. However, as we show in theevaluation section, the networks trained on such prior datasets perform poorly on imagesof real-world scenes.Two recent datasets are based on images of real-world scenes. Intrinsic Images inthe Wild (IIW) [5] and Shading Annotations in the Wild (SAW) [6] consist of sparse,crowd-sourced reﬂectance and shading annotations on real indoor images. Subsequently,several papers train CNN-based classiﬁers on these sparse annotations and use theclassiﬁer outputs as priors to guide decomposition [6,25,26,27]. However, we ﬁnd theseannotations alone are insufﬁcient to train a direct regression approach, likely becausethey are sparse and are derived from just a few thousand images. Finally, very recentwork has explored the use of time-lapse imagery as training data for intrinsic images [28],although this provides a very indirect source of supervision.

Synthetic datasets for real scenes.

Synthetic data has recently been utilized to improvepredictions on real-world images across a range of problems. For instance, [7,10] createda large-scale dataset and benchmark based on video games for the purpose of autonomousdriving, and [29,30] use synthetic imagery to form small benchmarks for intrinsic images.SUNCG [12] is a recent, large-scale synthetic dataset for indoor scene understanding.However, many of the images in the PBRS database of physically-based renderingsderived from SUNCG have low signal-to-noise ratio (SNR) and non-realistic sensorproperties. We show that higher quality renderings yield much better training data forintrinsic images. CGI

NTRINSICS

Dataset

To create our CGI

NTRINSICS (CGI) dataset, we started from the SUNCG dataset [11],which contains over 45,000 3D models of indoor scenes. We ﬁrst considered the PBRSdataset of physically-based renderings of scenes from SUNCG [12]. For each scene,PBRS samples cameras from good viewpoints, and uses the physically-based Mitsubarenderer [31] to generate realistic images under reasonably realistic lighting (including amix of indoor and outdoor illumination sources), with global illumination. Using suchan approach, we can also generate ground truth data for intrinsic images by rendering astandard RGB image I , then asking the renderer to produce a reﬂectance map R fromthe same viewpoint, and ﬁnally dividing to get the shading image S = I/R . Examplesof such ground truth decompositions are shown in Figure 2. Note that we automaticallymask out light sources (including illumination from windows looking outside) whencreating the decomposition, and do not consider those pixels when training the network.However, we found that the PBRS renderings are not ideal for use in training real-world intrinsic image decomposition networks. In fact, certain details in how images arerendered have a dramatic impact on learning performance:

Zhengqi Li and Noah Snavely

Fig. 2. Visualization of ground truth from our

CGI

NTRINSICS dataset.

Top row: renderedRGB images. Middle: ground truth reﬂectance. Bottom: ground truth shading. Note that lightsources are masked out when creating the ground truth decomposition.

Rendering quality.

Mitsuba and other high-quality renderers support a range of ren-dering algorithms, including various ﬂavors of path tracing methods that sample manylight paths for each output pixel. In PBRS, the authors note that bidirectional path trac-ing works well but is very slow, and opt for Metropolis Light Transport (MLT) with asample rate of 512 samples per pixel [12]. In contrast, for our purposes we found thatbidirectional path tracing (BDPT) with very large numbers of samples per pixel wasthe only algorithm that gave consistently good results for rendering SUNCG images.Comparisons between selected renderings from PBRS and our new CGI images areshown in Figure 3. Note the signiﬁcantly decreased noise in our renderings.This extra quality comes at a cost. We ﬁnd that using BDPT with 8,192 samples perpixel yields acceptable quality for most images. This increases the render time per imagesigniﬁcantly, from a reported 31s [12], to approximately 30 minutes. One reason for theneed for large numbers of samples is that SUNCG scenes are often challenging from arendering perspective—the illumination is often indirect, coming from open doorwaysor constrained in other ways by geometry. However, rendering is highly parallelizable,and over the course of about six months we rendered over ten thousand images on acluster of about 10 machines.

Tone mapping from HDR to LDR.

We found that another critical factor in imagegeneration is how rendered images are tone mapped. Renderers like Mitsuba generallyproduce high dynamic range (HDR) outputs that encode raw, linear radiance estimatesfor each pixel. In contrast, real photos are usually low dynamic range. The process thattakes an HDR input and produces an LDR output is called tone mapping , and in real While high, this is still a fair ways off of reported render times for animated ﬁlms. For instance,each frame of Pixar’s

Monsters University took a reported 29 hours to render [32].GIntrinsics 5

Fig. 3. Visual comparisons between our

CGI and the original SUNCG dataset.

Top row:images from SUNCG/PBRS. Bottom row: images from our CGI dataset. The images in ourdataset have higher SNR and are more realistic.Dataset Size Setting Rendered/Real Illumination GT typeMPI Sintel [34] 890 Animation non-PB spatial-varying fullMIT Intrinsics [35] 110 Object Real single global fullShapeNet [2] 2M+ Object PB single global fullIIW [5] 5230 Scene Real spatial-varying sparseSAW [6] 6677 Scene Real spatial-varying sparseCGI

NTRINSICS

Table 1. Comparisons of existing intrinsic image datasets with our

CGI

NTRINSICS dataset.

PB indicates physically-based rendering and non-PB indicates non-physically-based rendering. cameras the analogous operations are the auto-exposure, gamma correction, etc., thatyield a well-exposed, high-contrast photograph. PBRS uses the tone mapping methodof Reinhard et al . [33], which is inspired by photographers such as Ansel Adams, butwhich can produce images that are very different in character from those of consumercameras. We ﬁnd that a simpler tone mapping method produces more natural-lookingresults. Again, Figure 3 shows comparisons between PBRS renderings and our own.Note how the color and illumination features, such as shadows, are better captured in ourrenderings (we noticed that shadows often disappear with the Reinhard tone mapper).In particular, to tone map a linear HDR radiance image I HDR , we ﬁnd the th percentile intensity value r , then compute the image I LDR = αI γ HDR , where γ = . isa standard gamma correction factor, and α is computed such that r maps to the value0.8. The ﬁnal image is then clipped to the range [0 , . This mapping ensures that at most10% of the image pixels (and usually many fewer) are saturated after tone mapping, andtends to result in natural-looking LDR images.Using the above rendering approach, we re-rendered ∼ Zhengqi Li and Noah Snavely scenes, which have more sophisticated structure and illumination (cast shadows, spatial-varying lighting, etc). Compared to IIW and SAW, which include images of real scenes,CGI has full ground truth and and is much more easily collected at scale.

In this section, we describe how we use CGI

NTRINSICS to jointly train an intrinsicdecomposition network end-to-end, incorporating additional sparse annotations fromIIW and SAW. Our full training loss considers training data from each dataset: L = L CGI + λ IIW L IIW + λ SAW L SAW . (1)where L CGI , L IIW , and L SAW are the losses we use for training from the CGI, IIW, andSAW datasets respectively. The most direct way to train would be to simply incorporatesupervision from each dataset. In the case of CGI, this supervision consists of fullground truth. For IIW and SAW, this supervision takes the form of sparse annotationsfor each image, as illustrated in Figure 1. However, in addition to supervision, we foundthat incorporating smoothness priors into the loss also improves performance. Our fullloss functions thus incorporate a number of terms: L CGI = L sup + λ ord L ord + λ rec L reconstruct (2) L IIW = λ ord L ord + λ rs L rsmooth + λ ss L ssmooth + L reconstruct (3) L SAW = λ S / NS L S / NS + λ rs L rsmooth + λ ss L ssmooth + L reconstruct (4)We now describe each term in detail. Since the images in our CGI dataset are equipped with afull ground truth decomposition, the learning problem for this dataset can be formulatedas a direct regression problem from input image I to output images R and S . However,because the decomposition is only up to an unknown scale factor, we use a scale-invariantsupervised loss, L siMSE (for “scale-invariant mean-squared-error”). In addition, we add agradient domain multi-scale matching term L grad . For each training image in CGI, oursupervised loss is deﬁned as L sup = L siMSE + L grad , where L siMSE = 1 N N (cid:88) i =1 ( R ∗ i − c r R i ) + ( S ∗ i − c s S i ) (5) L grad = L (cid:88) l =1 N l N l (cid:88) i =1 (cid:13)(cid:13) ∇ R ∗ l,i − c r ∇ R l,i (cid:13)(cid:13) + (cid:13)(cid:13) ∇ S ∗ l,i − c s ∇ S l,i (cid:13)(cid:13) . (6) R l,i ( R ∗ l,i ) and S l,i ( S ∗ l,i ) denote reﬂectance prediction (resp. ground truth) and shadingprediction (resp. ground truth) respectively, at pixel i and scale l of an image pyramid. N l is the number of valid pixels at scale l and N = N is the number of valid pixels atthe original image scale. The scale factors c r and c s are computed via least squares. GIntrinsics 7

Image CGI ( R ) CGI ( S ) CGI+IIW ( R ) CGI+IIW ( S ) Fig. 4. Examples of predictions with and without IIW training data.

Adding real IIW datacan qualitatively improve reﬂectance and shading predictions. Note for instance how the quilthighlighted in ﬁrst row has a more uniform reﬂectance after incorporating IIW data, and similarlyfor the ﬂoor highlighted in the second row.

In addition to the scale-invariance of L siMSE , another important aspect is that wecompute the MSE in the linear intensity domain, as opposed to the all-pairs pixelcomparisons in the log domain used in [22]. In the log domain, pairs of pixels withlarge absolute log-difference tend to dominate the loss. As we show in our evaluation,computing L siMSE in the linear domain signiﬁcantly improves performance.Finally, the multi-scale gradient matching term L grad encourages decompositions tobe piecewise smooth with sharp discontinuities. Ordinal reﬂectance loss.

IIW provides sparse ordinal reﬂectance judgments betweenpairs of points (e.g., “point i has brighter reﬂectance than point j ”). We introduce aloss based on this ordinal supervision. For a given IIW training image and predictedreﬂectance R , we accumulate losses for each pair of annotated pixels ( i, j ) in that image: L ord ( R ) = (cid:80) ( i,j ) e i,j ( R ) , where e i,j ( R ) =  w i,j (log R i − log R j ) , r i,j = 0 w i,j (max(0 , m − log R i + log R j )) , r i,j = +1 w i,j (max(0 , m − log R j + log R i )) , r i,j = − (7)and r i,j is the ordinal relation from IIW, indicating whether point i is darker (-1), j isdarker (+1), or they have equal reﬂectance (0). w i,j is the conﬁdence of the annotation,provided by IIW. Example predictions with and without IIW data are shown in Fig. 4.We also found that adding a similar ordinal term derived from CGI data can improvereﬂectance predictions. For each image in CGI, we over-segment it using superpixelsegmentation [36]. Then in each training iteration, we randomly choose one pixel fromevery segmented region, and for each pair of chosen pixels, we evaluate L ord similar toEq. 7, with w i,j = 1 and the ordinal relation derived from the ground truth reﬂectance. SAW shading loss.

The SAW dataset provides images containing annotations of smooth(S) shading regions and non-smooth (NS) shading points, as depicted in Figure 1. Theseannotations can be further divided into three types: regions of constant shading, shadowboundaries, and depth/normal discontinuities.We integrate all three types of annotations into our supervised SAW loss L S / NS .For each constant shading region (with N c pixels), we compute a loss L constant − shading Zhengqi Li and Noah Snavely

Image CGI ( R ) CGI ( S ) CGI+SAW ( R ) CGI+SAW ( S ) Fig. 5. Examples of predictions with and without SAW training data.

Adding SAW trainingdata can qualitatively improve reﬂectance and shading predictions. Note the pictures/TV high-lighted in the decompositions in the ﬁrst row, and the improved assignment of texture to thereﬂectance channel for the paintings and sofa in the second row. encouraging the variance of the predicted shading in the region to be zero: L constant − shading = 1 N c N c (cid:88) i =1 (log S i ) − N c (cid:32) N c (cid:88) i =1 log S i (cid:33) . (8)SAW also provides individual point annotations at cast shadow boundaries. As notedin [6], these points are not localized precisely on shadow boundaries, and so we applya morphological dilation with a radius of 5 pixels to the set of marked points beforeusing them in training. This results in shadow boundary regions. We ﬁnd that mostshadow boundary annotations lie in regions of constant reﬂectance, which implies thatfor all pair of shading pixels within a small neighborhood, their log difference should beapproximately equal to the log difference of the image intensity. This is equivalent toencouraging the variance of log S i − log I i within this small region to be [37]. Hence,we deﬁne the loss for each shadow boundary region (with N sd ) pixels as: L shadow = 1 N sd N sd (cid:88) i =1 (log S i − log I i ) − N sd (cid:32) N sd (cid:88) i =1 (log S i − log I i ) (cid:33) (9)Finally, SAW provides depth/normal discontinuities, which are also usually shadingdiscontinuities. However, since we cannot derive the actual shading change for suchdiscontinuities, we simply mask out such regions in our shading smoothness term L ssmooth (Eq. 11), i.e., we do not penalize shading changes in such regions. As above,we ﬁrst dilate these annotated regions before use in training. Examples predictionsbefore/after adding SAW data into our training are shown in Fig. 5. To further constrain the decompositions for real images in IIW/SAW, followingclassical intrinsic image algorithms we add reﬂectance smoothness L rsmooth and shading GIntrinsics 9 smoothness L ssmooth terms. For reﬂectance, we use a multi-scale (cid:96) smoothness term toencourage reﬂectance predictions to be piecewise constant: L rsmooth = L (cid:88) l =1 N l l N l (cid:88) i =1 (cid:88) j ∈N ( l,i ) v l,i,j (cid:107) log R l,i − log R l,j (cid:107) (10)where N ( l, i ) denotes the 8-connected neighborhood of the pixel at position i and scale l .The reﬂectance weight v l,i,j = exp (cid:0) − ( f l,i − f l,j ) T Σ − ( f l,i − f l,j ) (cid:1) , and the featurevector f l,i is deﬁned as [ p l,i , I l,i , c l,i , c l,i ] , where p l,i and I l,i are the spatial position andimage intensity respectively, and c l,i and c l,i are the ﬁrst two elements of chromaticity. Σ is a covariance matrix deﬁning the distance between two feature vectors.We also include a densely-connected (cid:96) shading smoothness term, which can beevaluated in linear time in the number of pixels N using bilateral embeddings [38,28]: L ssmooth = 12 N N (cid:88) i N (cid:88) j ˆ W i,j (log S i − log S j ) ≈ N s (cid:62) ( I − N b S (cid:62) b ¯ B b S b N b ) s (11)where ˆ W is a bistochastic weight matrix derived from W and W i,j = exp (cid:16) − || p i − p j σ p || (cid:17) .We refer readers to [38,28] for a detailed derivation. As shown in our experiments, addingsuch smoothness terms to real data can yield better generalization. Finally, for each training image in each dataset, we add a loss expressing the con-straint that the reﬂectance and shading should reconstruct the original image: L reconstruct = 1 N N (cid:88) i =1 ( I i − R i S i ) . (12) Our network architecture is illustrated in Figure 1. We use a variant of the “U-Net” architecture [28,39]. Our network has one encoder and two decoders with skipconnections. The two decoders output log reﬂectance and log shading, respectively. Eachlayer of the encoder mainly consists of a × stride-2 convolutional layer followedby batch normalization [40] and leaky ReLu [41]. For the two decoders, each layer iscomposed of a × deconvolutional layer followed by batch normalization and ReLu,and a × convolutional layer is appended to the ﬁnal layer of each decoder. We conduct experiments on two datasets of real world scenes, IIW [5] and SAW [6](using test data unseen during training) and compare our method with several state-of-the-art intrinsic images algorithms. Additionally, we also evaluate the generalization ofour CGI dataset by evaluating it on the MIT Intrinsic Images benchmark [35].

Method Training set WHDRRetinex-Color [35] - 26.9%Garces et al . [17] - 24.8%Zhao et al . [14] - 23.8%Bell et al . [5] - 20.6%Zhou et al . [25] IIW 19.9%Bi et al . [44] - 17.7%Nestmeyer et al . [45] IIW 19.5%Nestmeyer et al . [45] ∗ IIW 17.7%DI [22] Sintel 37.3%Shi et al . [2] ShapeNet 59.4%

Method Training set WHDROurs (log, L siMSE ) CGI 22.7%Ours (w/o L grad ) CGI 19.7%Ours (w/o L ord ) CGI 19.9%Ours (w/o L rsmooth ) All 16.1%Ours SUNCG 26.1%Ours † CGI 18.4%Ours CGI 17.8%Ours ∗ CGI 17.1%Ours CGI+IIW(O) 17.5%Ours CGI+IIW(A) 16.2%Ours All

Ours ∗ All

Table 2. Numerical results on the IIW test set.

Lower is better for WHDR. The “Trainingset” column speciﬁes the training data used by each learning-based method: “-” indicates anoptimization-based method. IIW(O) indicates original IIW annotations and IIW(A) indicates aug-mented IIW comparisons. “All” indicates CGI+IIW(A)+SAW. † indicates network was validatedon CGI and others were validated on IIW. ∗ indicates CNN predictions are post-processed with aguided ﬁlter [45]. Network training details.

We implement our method in PyTorch [42]. For all threedatasets, we perform data augmentation through random ﬂips, resizing, and crops. Forall evaluations, we train our network from scratch using the Adam [43] optimizer, withinitial learning rate . and mini-batch size 16. We refer readers to the supplementarymaterial for the detailed hyperparameter settings. We follow the train/test split for IIW provided by [27], also used in [25]. We alsoconduct several ablation studies using different loss conﬁgurations. Quantitative compar-isons of Weighted Human Disagreement Rate (WHDR) between our method and otheroptimization- and learning-based methods are shown in Table 2.Comparing direct CNN predictions, our CGI-trained model is signiﬁcantly betterthan the best learning-based method [45], and similar to [44], even though [45] wasdirectly trained on IIW. Additionally, running the post-processing from [45] on theresults of the CGI-trained model achieves a further performance boost. Table 2 alsoshows that models trained on SUNCG (i.e., PBRS), Sintel, MIT Intrinsics, or ShapeNetgeneralize poorly to IIW likely due to the lower quality of training data (SUNCG/PBRS),or the larger domain gap with respect to images of real-world scenes, compared to CGI.The comparison to SUNCG suggests the key importance of our rendering decisions.We also evaluate networks trained jointly using CGI and real imagery from IIW. Asin [25], we augment the pairwise IIW judgments by globally exploiting their transitivityand symmetry. The right part of Table 2 demonstrates that including IIW training dataleads to further improvements in performance, as does also including SAW trainingdata. Table 2 also shows various ablations on variants of our method, such as evaluating

GIntrinsics 11Method Training set AP% (unweighted) AP% (challenge)Retinex-Color [35] - 91.93 85.26Garces et al . [17] - 96.89 92.39Zhao et al . [14] - 97.11 89.72Bell et al . [5] - 97.37 92.18Zhou et al . [25] IIW 96.24 86.34Nestmeyer et al . [45] IIW 97.26 89.94Nestmeyer et al . [45] ∗ IIW 96.85 88.64DI [22] Sintel+MIT 95.04 86.08Shi et al . [2] ShapeNet 86.62 81.30Ours (log, L siMSE ) CGI 97.73 93.03Ours (w/o L grad ) CGI 98.15 93.74Ours (w/o L ssmooth ) CGI+IIW(A)+SAW 98.60 94.87Ours SUNCG 96.56 87.09Ours † CGI 98.16 93.21Ours CGI 98.39 94.05Ours CGI+IIW(A) 98.56 94.69Ours CGI+IIW(A)+SAW

Higher is better for AP%. The second columnis described in Table 2. The third and fourth columns show performance on the unweighted SAWbenchmark and our more challenging gradient-weighted benchmark, respectively.

Recall P r e c i s i on CGICGI+IIWCGI+IIW+SAWShapeNet [Shi et al. 2017]Sintel+MIT [Narihira et al. 2015][Bell et al. 2014]Retinex-Color [Grosse et al. 2009][Garces et al. 2012][Zhao et al. 2012][Zhou et al. 2015]

Fig. 6. Precision-Recall (PR) curve for shading images on the SAW test set.

Left: PR curvesgenerated using the unweighted SAW error metric of [28]. Right: curves generated using our morechallenging gradient-weighted metric. losses in the log domain and removing terms from the loss functions. Finally, we testa network trained on only

IIW/SAW data (and not CGI), or trained on CGI and ﬁne-tuned on IIW/SAW. Although such a network achieves ∼

19% WHDR, we ﬁnd that thedecompositions are qualitatively unsatisfactory. The sparsity of the training data causesthese networks to produce degenerate decompositions, especially for shading images.

To evaluate our shading predictions, we test our models on the SAW [6] test set,utilizing the error metric introduced in [28]. We also propose a new, more challenging error metric for SAW evaluation. In particular, we found that many of the constant-shading regions annotated in SAW also have smooth image intensity (e.g., texturelesswalls), making their shading easy to predict. Our proposed metric downweights suchregions as follows. For each annotated region of constant shading, we compute theaverage image gradient magnitude over the region. During evaluation, when we addthe pixels belonging to a region of constant shading into the confusion matrices, wemultiply the number of pixels by this average gradient. This proposed metric leads tomore distinguishable performance differences between methods, because regions withrich textures will contribute more to the error compared to the unweighted metric.Figure 6 and Table 3 show precision-recall (PR) curves and average precision (AP)on the SAW test set with both unweighted [28] and our proposed challenge error metrics.As with IIW, networks trained solely on our CGI data can achieve state-of-the-artperformance, even without using SAW training data. Adding real IIW data improves theAP in term of both error metrics. Finally, the last column of Table 3 shows that integratingSAW training data can signiﬁcantly improve the performance on shading predictions,suggesting the effectiveness of our proposed losses for SAW sparse annotations.Note that the previous state-of-the-art algorithms on IIW (e.g., Zhou et al . [25] andNestmeyer et al . [45]) tend to overﬁt to reﬂectance, hurting the accuracy of shadingpredictions. This is especially evident in terms of our proposed challenge error metric.In contrast, our method achieves state-of-the-art results on both reﬂectance and shadingpredictions, in terms of all error metrics. Note that models trained on the originalSUNCG, Sintel, MIT intrinsics or ShapeNet datasets perform poorly on the SAW testset, indicating the much improved generalization to real scenes of our CGI dataset.

Qualitative results on IIW/SAW.

Figure 7 shows qualitative comparisons between ournetwork trained on all three datasets, and two other state-of-the-art intrinsic imagesalgorithms (Bell et al . [5] and Zhou et al . [25]), on images from the IIW/SAW test sets.In general, our decompositions show signiﬁcant improvements. In particular, our networkis better at avoiding attributing surface texture to the shading channel (for instance, thecheckerboard patterns evident in the ﬁrst two rows, and the complex textures in thelast four rows) while still predicting accurate reﬂectance (such as the mini-sofa in theimages of third row). In contrast, the other two methods often fail to handle such difﬁcultsettings. In particular, [25] tends to overﬁt to reﬂectance predictions, and their shadingestimates strongly resemble the original image intensity. However, our method stillmakes mistakes, such as the non-uniform reﬂectance prediction for the chair in the ﬁfthrow, as well as residual textures and shadows in the shading and reﬂectance channels.

For the sake of completeness, we also test the ability of our CGI-trained networks togeneralize to the MIT Intrinsic Images dataset [35]. In contrast to IIW/SAW, the MITdataset contains 20 real objects with 11 different illumination conditions. We follow thesame train/test split as Barron et al . [21], and, as in the work of Shi et al . [2], we directlyapply our CGI trained networks to MIT testset, and additionally test ﬁne-tuning them onthe MIT training set.We compare our models with several state-of-the-art learning-based methods usingthe same error metrics as [2]. Table 4 shows quantitative comparisons and Figure 8 shows

GIntrinsics 13

Image Bell et al .( R ) Bell et al .( S ) Zhou et al .( R ) Zhou et al .( S ) Ours ( R ) Ours ( S ) Fig. 7. Qualitative comparisons on the IIW/SAW test sets.

Our predictions show signiﬁcantimprovements compared to state-of-the-art algorithms (Bell et al . [5] and Zhou et al . [25]). Inparticular, our predicted shading channels include signiﬁcantly less surface texture in severalchallenging settings. qualitative results. Both show that our CGI-trained model yields better performancecompared to ShapeNet-trained networks both qualitatively and quantitatively, eventhough like MIT, ShapeNet consists of images of rendered objects, while our datasetcontains images of scenes. Moreover, our CGI-pretrained model also performs betterthan networks pretrained on ShapeNet and Sintel. These results further demonstrate theimproved generalization ability of our CGI dataset compared to existing datasets. Notethat SIRFS still achieves the best results, but as described in [22,2], their methods aredesigned speciﬁcally for single objects and generalize poorly to real scenes.

We presented a new synthetic dataset for learning intrinsic images, and an end-to-endlearning approach that learns better intrinsic image decompositions by leveraging datasetswith different types of labels. Our evaluations illustrate the surprising effectiveness of

DI [22] Sintel+MIT 0.0277 0.0154 0.0585 0.0295 0.1526 0.1328Shi et al . [2] ShapeNet 0.0468 0.0194 0.0752 0.0318 0.1825 0.1667Shi et al . [2] (cid:63)

ShapeNet+MIT 0.0278 0.0126 0.0503 0.0240 0.1465 0.1200Ours CGI 0.0221 0.0186 0.0349 0.0259 0.1739 0.1652Ours (cid:63)

CGI +MIT 0.0167 0.0127

Table 4. Quantitative Results on MIT intrinsics testset.

For all error metrics, lower is better.The second column shows the dataset used for training. (cid:63) indicates models ﬁne-tuned on MIT.

Image GT SIRFS [21] DI [22] Shi et al . [2] Shi et al . [2] (cid:63)

Ours Ours (cid:63)

Fig. 8. Qualitative comparisons on MIT intrinsics testset.

Odd rows: reﬂectance predictions.Even rows: shading predictions. (cid:63) are the predictions ﬁne-tuned on MIT. our synthetic dataset on Internet photos of real-world scenes. We ﬁnd that the details ofrendering matter, and hypothesize that improved physically-based rendering may beneﬁtother vision tasks, such as normal prediction and semantic segmentation [12].

Acknowledgments.

We thank Jingguang Zhou for his help with data generation. This work wasfunded by the National Science Foundation through grant IIS-1149393, and by a grant fromSchmidt Sciences.GIntrinsics 15

References

1. Janner, M., Wu, J., Kulkarni, T., Yildirim, I., Tenenbaum, J.B.: Self-Supervised IntrinsicImage Decomposition. In: Neural Information Processing Systems. (2017)2. Shi, J., Dong, Y., Su, H., Yu, S.X.: Learning non-Lambertian object intrinsics across ShapeNetcategories. In: Proc. Computer Vision and Pattern Recognition (CVPR). (2017) 5844–58533. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva,M., Song, S., Su, H., et al.: ShapeNet: An information-rich 3D model repository. arXivpreprint arXiv:1512.03012 (2015)4. Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for opticalﬂow evaluation. In: Proc. European Conf. on Computer Vision (ECCV). (2012) 611–6255. Bell, S., Bala, K., Snavely, N.: Intrinsic images in the wild. ACM Trans. Graphics (4)(2014) 1596. Kovacs, B., Bell, S., Snavely, N., Bala, K.: Shading annotations in the wild. In: Proc. ComputerVision and Pattern Recognition (CVPR). (2017) 850–8597. Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: Ground truth from computergames. In: Proc. European Conf. on Computer Vision (ECCV). (2016) 102–1188. Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The SYNTHIA Dataset:A large collection of synthetic images for semantic segmentation of urban scenes. In: Proc.Computer Vision and Pattern Recognition (CVPR). (2016) 3234–32439. Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object trackinganalysis. In: Proc. Computer Vision and Pattern Recognition (CVPR). (2016) 4340–434910. Richter, S.R., Hayder, Z., Koltun, V.: Playing for benchmarks. In: Proc. Int. Conf. onComputer Vision (ICCV). (2017) 2232–224111. Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completionfrom a single depth image. In: Proc. Computer Vision and Pattern Recognition (CVPR).(2017) 190–19812. Zhang, Y., Song, S., Yumer, E., Savva, M., Lee, J.Y., Jin, H., Funkhouser, T.: Physically-basedrendering for indoor scene understanding using convolutional neural networks. In: Proc.Computer Vision and Pattern Recognition (CVPR). (2017) 5057–506513. Land, E.H., McCann, J.J.: Lightness and retinex theory. Josa (1) (1971) 1–1114. Zhao, Q., Tan, P., Dai, Q., Shen, L., Wu, E., Lin, S.: A closed-form solution to retinex withnonlocal texture constraints. Trans. on Pattern Analysis and Machine Intelligence (7) (2012)1437–144415. Rother, C., Kiefel, M., Zhang, L., Sch¨olkopf, B., Gehler, P.V.: Recovering intrinsic imageswith a global sparsity prior on reﬂectance. In: Neural Information Processing Systems. (2011)765–77316. Shen, L., Yeo, C.: Intrinsic images decomposition using a local and global sparse represen-tation of reﬂectance. In: Proc. Computer Vision and Pattern Recognition (CVPR). (2011)697–70417. Garces, E., Munoz, A., Lopez-Moreno, J., Gutierrez, D.: Intrinsic images by clustering.Computer Graphics Forum (Proc. EGSR 2012) (4) (2012)18. Chen, Q., Koltun, V.: A simple model for intrinsic image decomposition with depth cues. In:Proc. Computer Vision and Pattern Recognition (CVPR). (2013) 241–24819. Barron, J.T., Malik, J.: Intrinsic scene properties from a single RGB-D image. In: Proc.Computer Vision and Pattern Recognition (CVPR). (2013) 17–2420. Jeon, J., Cho, S., Tong, X., Lee, S.: Intrinsic image decomposition using structure-textureseparation and surface normals. In: Proc. European Conf. on Computer Vision (ECCV).(2014)6 Zhengqi Li and Noah Snavely21. Barron, J.T., Malik, J.: Shape, illumination, and reﬂectance from shading. Trans. on PatternAnalysis and Machine Intelligence (8) (2015) 1670–168722. Narihira, T., Maire, M., Yu, S.X.: Direct intrinsics: Learning albedo-shading decomposition byconvolutional regression. In: Proc. Int. Conf. on Computer Vision (ICCV). (2015) 2992–299223. Kim, S., Park, K., Sohn, K., Lin, S.: Uniﬁed depth prediction and intrinsic image decomposi-tion from a single image via joint convolutional neural ﬁelds. In: Proc. European Conf. onComputer Vision (ECCV). (2016) 143–15924. Shu, Z., Yumer, E., Hadap, S., Sunkavalli, K., Shechtman, E., Samaras, D.: Neural faceediting with intrinsic image disentangling. In: Proc. Computer Vision and Pattern Recognition(CVPR). (2017) 5444–545325. Zhou, T., Krahenbuhl, P., Efros, A.A.: Learning data-driven reﬂectance priors for intrinsicimage decomposition. In: Proc. Int. Conf. on Computer Vision (ICCV). (2015) 3469–347726. Zoran, D., Isola, P., Krishnan, D., Freeman, W.T.: Learning ordinal relationships for mid-levelvision. In: Proc. Int. Conf. on Computer Vision (ICCV). (2015) 388–39627. Narihira, T., Maire, M., Yu, S.X.: Learning lightness from human judgement on relativereﬂectance. In: Proc. Computer Vision and Pattern Recognition (CVPR). (2015) 2965–297328. Li, Z., Snavely, N.: Learning intrinsic image decomposition from watching the world. In:Proc. Computer Vision and Pattern Recognition (CVPR). (2018)29. Beigpour, S., Serra, M., van de Weijer, J., Benavente, R., Vanrell, M., Penacchio, O., Samaras,D.: Intrinsic image evaluation on synthetic complex scenes. Int. Conf. on Image Processing(2013)30. Bonneel, N., Kovacs, B., Paris, S., Bala, K.: Intrinsic decompositions for image editing.Computer Graphics Forum (Eurographics State of the Art Reports 2017) (11) (2012)37. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using amulti-scale deep network. In: Neural Information Processing Systems. (2014) 2366–237438. Barron, J.T., Adams, A., Shih, Y., Hern´andez, C.: Fast bilateral-space stereo for syntheticdefocus. In: Proc. Computer Vision and Pattern Recognition (CVPR). (2015) 4466–447439. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditionaladversarial networks. In: Proc. Computer Vision and Pattern Recognition (CVPR). (2017)6967–597640. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducinginternal covariate shift. In: Proc. Int. Conf. on Machine Learning. (2015) 448–45641. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectiﬁers: Surpassing human-levelperformance on ImageNet classiﬁcation. In: Proc. Int. Conf. on Computer Vision (ICCV).(2015)42. : Pytorch (2016) http://pytorch.org .GIntrinsics 1743. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014)44. Bi, S., Han, X., Yu, Y.: An l image transform for edge-preserving smoothing and scene-levelintrinsic decomposition. ACM Trans. Graph.34