[PDF] Self-Supervised Shadow Removal

Abstract

Shadow removal is an important computer vision task aiming at the detection and successful removal of the shadow produced by an occluded light source and a photo-realistic restoration of the image contents. Decades of re-search produced a multitude of hand-crafted restoration techniques and, more recently, learned solutions from shad-owed and shadow-free training image pairs. In this work,we propose an unsupervised single image shadow removal solution via self-supervised learning by using a conditioned mask. In contrast to existing literature, we do not require paired shadowed and shadow-free images, instead we rely on self-supervision and jointly learn deep models to remove and add shadows to images. We validate our approach on the recently introduced ISTD and USR datasets. We largely improve quantitatively and qualitatively over the compared methods and set a new state-of-the-art performance in single image shadow removal.

Full PDF

SSelf-Supervised Shadow Removal

Florin-Alexandru VasluianuETH Zurich [email protected]

Andr´es RomeroETH Zurich [email protected]

Luc Van GoolETH Zurich [email protected]

Radu TimofteETH Zurich [email protected]

Abstract

Shadow removal is an important computer vision taskaiming at the detection and successful removal of theshadow produced by an occluded light source and a photo-realistic restoration of the image contents. Decades of re-search produced a multitude of hand-crafted restorationtechniques and, more recently, learned solutions from shad-owed and shadow-free training image pairs. In this work,we propose an unsupervised single image shadow removalsolution via self-supervised learning by using a conditionedmask. In contrast to existing literature, we do not requirepaired shadowed and shadow-free images, instead we relyon self-supervision and jointly learn deep models to removeand add shadows to images. We validate our approach onthe recently introduced ISTD and USR datasets. We largelyimprove quantitatively and qualitatively over the comparedmethods and set a new state-of-the-art performance in sin-gle image shadow removal.

1. Introduction

In an image, a shadow [33] is the direct effect of the oc-clusion of a light source. By inducing a steep variation inan image region, the shadow impacts the performance ofother vision tasks such as image segmentation [6, 1], se-mantic segmentation [30, 9], object recognition [34, 2, 14]or tracking [20, 23, 4].In contrast to the unshadowed pixels, the shadow altersthe observation of the scene contents by a combination ofdegradations in illumination, color, detail, and noise levels.The shadow removal task is, essentially, an image restora-tion task aiming at recovering the underlying content. Manymethods [29, 26] have been proposed for detecting and re-moving shadows from images.The introduction of large-scale datasets of shadowed andshadow-free image pairs such as SRD [28], ISTD [36] or USR [16] allowed the formulation of the shadow removalprocess as a regression problem. One of the major chal-lenges is to learn a physically plausible transformation, re-gardless of the semantic or illumination inconsistencies thatmay be encountered in the data. Thanks to the adventof Generative Adversarial Networks [12] and its ﬂexibilitylearning complex distributions, recent efforts [16, 38] havesuccessfully modeled the shadow removal task as an Image-to-Image translation problem [18]. However, it has beenfound [27] that the learned shadow removal transforma-tions are highly prone to artifacts produced in the downsam-pling/upsampling phases of the translation encoder/decodermodel, and moreover, the tendency of the deshadowed im-age regions to be blurry [16, 39]. In order to circumventthese problems, recent solutions [36, 16, 24] have proposedcarefully designed robust loss functions, producing thushigh quality photo-realistic deshadowed results with lowpixel-wise restoration errors.As the shadow removal is a perceptual transformation,the usage of a perceptual score based on learnt features,on different levels of complexity [40], enabled the exploita-tion of some invariants over the shadow removal or additiontransformation. The increased amount of information usedin training is expected to induce additional degrees of con-trol such that the learning procedure can be faster, and theresults produced will be better, both in terms of ﬁdelity met-rics and perceptual scores.In this work, we propose an unsupervised single imageshadow removal solution via self-supervised learning. Ourmethod exploits several observations made on the shadowformation process and employs the cyclic consistency andthe Generative Adversarial Nets (GANs) [12] paradigms asinspired by the CycleGAN [41], a seminal architecture forlearning image-to-image translation between two image do-mains without paired data.An overview of our method is depicted in Figure 1. Byassuming that each dataset contains shadow images with re-1 a r X i v : . [ c s . C V ] O c t a) Forward step(b) Reconstruction step Figure 1: The forward (top) and the reconstruction step(bottom). As a convention, red lines were used for the ma-nipulation involving shadow affected input, blue lines forthe shadow free input, and the black lines for the mask com-putation operations. u and v could or could not be paireddata. Our training framework uses v ’s mask ( m ) to insert itin G s . In a paired setting ˆ v should resemble v , while not thecase for an unpaired one.spective masks, either on paired or unpaired settings, thecore of our method is to exploit the given mask informationin a self-supervised fashion by inserting randomly createdshadow masks into the training framework (input m duringthe forward step in Figure 1.a), and reconstructing the orig-inal input using cycle-consistency (Figure 1.b).A critical observation is that we do not need to imposestrong pixel-wise ﬁdelity losses in our solution, but rathercapture contents and general texture and colors, whichare inherently perceptual. This helps our model to pro-duce higher quality results, in a smaller number of trainingepochs with respect to the literature.

2. Related Work

Shadow removal is not a recent problem in the computervision community, and despite recent efforts in deep learn-ing and generative modeling it is still a challenging prob-lem.Early methods tackled this problem by using the under-lying physical properties of the shadow formation. Theywere based on image decomposition as a combination ofshadow and shadow-free layers [8, 7], or on an early shadowdetection followed by a color transfer from the shadow-freeregion to the shadow affected region in the local neighbor-hood [37, 31, 35].The variety of shadow generation systems ( e.g . shapes,size, scale, illumination, etc) implies an increased com-plexity in shadow model parameters computation, and con-sequently, models parameterized with these properties areknown for not being able to handle shadow removal incomplex situations [22]. A step forward in this direc-tion is a two-stage model for shadow detection and re-moval, respectively, thus increasing the generalization per-formance. In order to successfully detect the shadow, earlierworks [13, 11] proposed to include hand-crafted featuressuch as image intensity, texture, or gradients.The evolution of the Convolutional Neural Networks(CNNs) enabled the propagation of these learnable featuresalong with the layers of the model, and [22, 21] proposedsolutions using CNNs for shadow detection and a Bayesianmodel for shadow removal. Moreover, [11] pioneered us-ing an unsupervised end-to-end auto-encoder model to learna cross-domain mapping between shadow and shadow-freeimages. However, the need for manual labeling in the pre-processing step in order to produce an estimation over theshadow mask limits this method, both in terms of the com-plexity of the addressed light-occluder systems and the biasinjection.Qu et al . [28] proposed a model based on three networksextracting relevant features from multiple views and aggre-gating them to recover the shadow-free image. The G-net(Global localization network) will extract a high-level rep-resentation of the scene, the A-net (Appearance modelingnetwork), and the S-net (Semantic modeling network) willuse this representation to produce a shadow matte used inthe shadow removal task.The importance of the localization information is ac-knowledged by Hu et al . [15], where the shadows were de-tected and removed using the idea of a Spatial RecurrentNeural Network [3] by exploring the direction-aware con-text.Subsequently, since the introduction of Generative Ad-versarial Networks (GANs) [12], the dominant strategy isto learn an image-to-image mapping function using an en-coder/decoder architecture. The de-facto methodologies forimage-to-image translation for paired and unpaired data are2ix2pix [18] and CycleGAN [41], respectively. In the for-mer, it is assumed a single transformation between shadowand deshadowed regions, while in the latter, in order to dealwith the unsupervised nature of the data, there are two dif-ferent models for shadow removal and addition, and a cycle-constraint loss ensures the ﬂow from one domain to another.Following this trend, [36] proposed a model based onConditional GANs [25] using paired data, where they de-ployed 2 stacked conditional GANs aiming shadow detec-tion and then, with the information computed, shadow re-moval.Recently, Le et al . [24] proposed a model based on twoneural networks able to learn the shadow model parametersand the shadow matte. The main limitation of the modelis the usage of a simple linear relation as the light model.However, by using the same occluder there can be multiplelight sources to produce non-homogeneous shadow areasthat can not be described by a linear model, and therefore,the performance of the model is expected to drop. Nonethe-less, if the assumption made about the uniqueness of thelight source holds, the method is able to produce realisticresults.Despite the recent effort in shadow removal literature,all prior methods rely on the assumption of paired datasetsfor shadow and shadow-free images. Going in an unsuper-vised direction, [16] developed MaskShadowGAN using avanilla CycleGAN [41] approach, where the shadow masksare computed as a binarization of the difference, by thresh-olding it using Otsu’s algorithm. In contrast, we formulatea component in the training objective that is going to offersome bounds in the evolution of the synthetically-generatedshadow masks (used as input in the reconstruction), whichwill increase the degree of control in the training procedure,improving the performance both in ﬁdelity and perceptualmetrics. Moreover, the decrease of the loss in the train-ing/validation procedure will be more pronounced, even ifthe generators we used are characterized by a smaller num-ber of parameters.

3. Proposed Method

Considering the shadow image domain X and theshadow-free image domain set Y , we are mainly interestedto learn the mapping function G f : X → Y . Existing tech-niques [18] rely on a critical dataset assumption of havingaccess to paired images, i.e . the same scene with and with-out shadows. As we will show in Section 4, this assumptiondoes not always hold, and having an unsupervised approachleads, surprisingly, to better performance. To this end, weassume a subset of unpaired images T = { ( u, v ) | u ∈ Y, v ∈ X } . The overall scheme of our method is presented in Fig-ure 1. Our system is based on the vanilla CycleGAN ap-proach [41] with two generators and two discriminatorsfor each domain: G s and G f , being shadow addition andshadow removal, respectively. In detail, the shadow ad-dition network receives two inputs: an image and a bi-nary mask depicting the shadow. The shadow removal net-work only receives an image. Formally, ˆ u = G f ( v ) and ˆ v = G s ( u, m ) , for removal and addition respectively (Fig-ure 1a), being m a randomly generated mask. To close thecycle-consistency loop, we use self-supervision to recon-struct the original inputs (Figure 1b). We extensively ex-plain this process in Section 3.3.7.Besides our self-supervised training framework, an im-portant ingredient of our method relies on the carefully de-signed loss functions we proceed to explain in the followingsection. For simplicity and sake of clarity, it is important to men-tion that we deﬁne our losses regardless of the training set-tings (paired or unpaired) and the transformation mapping(inserting or removing shadow), and instead we use place-holders. We will make a clear distinction at the end of thissection.

Our main motivation to build an unsupervised shadow re-moval comes from an observation we illustrate in Figure 2.On the one hand, there are pixel-wise inconsistencies ( e.g .,different lighting conditions outside the shadow or contentmisalignment) for paired images in the ISTD dataset [36],so building a model under this assumption compromisesthe performance. On the other hand, using loss functionsbased solely on a pixel-wise level (L1, L2, etc.) is also not asuitable learning indicator, as it can lead to producing quiteblurry outputs while minimizing the function.

We aim at removing shadows while preserving the non-shadowed areas as unaltered as possible.Therefore, inspired by recent literature in photo-enhancement [17], style transfer [10] and perceptual su-per resolution [19], we form a perceptual ensemble loss forcolor, style, and content, respectively. The parameters usedwere empirically chosen in relation to the amplitude of eachloss on a subset of the train data: α = 1 , α = 0 . and α = 10000 . Codes will be made publicly available at https://github.com/fvasluianu-cvl/SSSR.git perceptual = α · L color + α · L content + α · L style , (1) The introduction of a color loss can be explained, ﬁrstly, bythe need to capture and preserve color information in the im-age. Under ideal settings, this could be done by imposing apixel-level loss ( e.g ., L1, L2, MSE). However, we considerthat the color is a lower frequency component than the tex-tural information of the image (our eyes are less sensitive tocolor than to intensity changes) and the pixel-level observa-tions are generally noisy ( e.g ., pixel-wise inconsistencies inISTD image pairs). To this end, inspired by [5], we performa Gaussian ﬁlter over the real and fake image, and computethe Mean Squared Error (MSE), L color = M SE ( I smoothed , I smoothed ) (2) Building on the assumption that an image with shadows andthat one without shadows should have similar content interms of semantic relevant regions, the L content is deﬁnedas L content = 1 N l N l (cid:88) i =1 M SE ( C iI , C iI ) , (3)where C i is the feature vector representation extracted inthe i -th target layer of the ImageNet pretrained VGG-16network [32], for each input image I n . L style , is deﬁned as L style = 1 N l N l (cid:88) i =1 M SE ( H iI , H iI ) (4) H lI i,j = D (cid:88) k =1 C lI i,k C lI k,j (5)where the Gram matrix H lI of the feature vector extractedby every i -th layer of the VGG-16 net. The Gram matrix H iI deﬁnes a style for the feature set extracted by the i -thlayer of the VGG-16 net, using as input the I image. Byminimizing the mean square error difference between thestyles computed for feature sets at different levels of com-plexity, the results produced will be characterized by betterperceptual properties. a) illumination changes b) semantic changes (check corner) Figure 2: Examples of ISTD paired images that are not per-fectly aligned and consistent.

The formulation of the problem using adversarial learningimplies the introduction of two new components, D f and D s . The main idea behind the learning procedure is that,for each domain, the discriminator will distinguish betweenthe synthetic and the real results, forcing the counterpartgenerator to produce a better output in terms of semanticcontent and image properties.As the discriminators are characteristic to the shadowdomain X and the shadow free domain Y , the adversar-ial losses are deﬁned, for the synthetic results produced inthe forward step ( ˆ I f , ˆ I s ), as stated in Equation 6 and 7. Theimage pair ( I s , I f ) is the ground truth shadow-shadow freepair used as input, and ( ˆ I s ∗ , ˆ I f ∗ ) is a pair of randomly sam-pled synthetic results. L sGAN ( I s , I f ) = 12 ( M SE ( J, D s ( ˆ I s , I s ))+ M SE ( O, D s ( ˆ I f ∗ , I s ))) , ∀ ˆ I f ∗ / ∈ X (6) L fGAN ( I s , I f ) = 12 ( M SE ( J, D f ( ˆ I f , I f ))+ M SE ( O, D f ( ˆ I s ∗ , I f ))) , ∀ ˆ I s ∗ / ∈ Y (7)The standard output for the discriminators for the posi-tive and negative examples used in training was deﬁned as J and O , as a consequence of the usage of the patchGAN concept. So, J and O are deﬁned as the all one matrix, andall zero matrix, respectively, with a size equal to the size ofthe perceptive ﬁeld of the discriminator. The last and core ingredient of our system is the self-supervised shadow loss (S3 loss). In order to generate animage with shadows, we couple the shadow insertion gen-erator to receive both a binary mask in addition to the RGBimage. Our rationale comes from realizing that shadow ad-dition can be at any random shape, scale, size, and position,then tackling this problem in an unconditional way is ill-posed. Additionally, by guiding the shadow insertion usinga conditional mask, we can use a randomly inserted shadowmask into a deshadowed image in order to perform a cycleconsistency loss to recover the mask.4onsidering u the shadow free image and v the shadowimage, then the generated images have the form of ˆ u = G f ( v ) and ˆ v = G s ( u, m ) , where m is the shadow maskbetween the images u and v . Similarly, we compute thereconstructed cycle-consistency images as u r = G f (ˆ u ) and v r = G s (ˆ u, ˆ m f ) , where ˆ m f = Bin (ˆ u − v ) is the syn-thetic shadow mask computed as the binarization of the dif-ference between the synthetic shadow free image and thetrue shadow image. The shadow mask loss is deﬁned as theL1 distance between the masks produced by the intermedi-ate results produced along with the cycles.The synthetic mask is a very important detail of the pro-posed model because is implied in two of the constraints weenforced over the model trained. For the fully supervisedtraining, where the ground truth shadow mask is providedin the training/validation splits, one of the constraints is theinvariance property of the shadow mask along the forwardstep of the cycle, in the shadow removal procedure. Thesecond constraint is the invariance of the mask along thereconstruction step, both in shadow removal and addition,that can be applied both in the paired and unpaired settings.For the model trained in an unsupervised manner, theused mask in the shadow addition step is sampled froma memory buffer, containing the masks produced in theshadow removal procedure, from the forward step . Asthe generator uses this result in the reconstruction step, thequality of this result is crucial for a qualitative reconstruc-tion. Even if a good mask will be the result of a realis-tic transformation, the usage of this proxy loss will enablefaster learning, and also, the learning of more realistic map-pings. As the images that are to be fed to the model in the unpairedsetting do not represent the same scene, the loss functionhas to be carefully chosen such that the equilibrium pointcan be reached, and so, the learning procedure will con-verge to the required solution. As we deﬁned the GAN lossin Equation 6 and 7, we are using replay buffers such thatthe discriminators are less likely to rely on the simple differ-ence between frames representing similar contents. Thus,the degree of control is weaker, as we are sampling from in-termediary results to feed the negative samples in the binaryclassiﬁcation problem. So, additional information can beexploited by using different complexity features extractedby a pretrained neural network.Moreover, another crucial goal is controlling, as muchas possible, the quality of the intermediary results, by the We circumvent the problem of creating realistic random masks by us-ing the mask on the counterpart forward step It is important to note that the synthetic masks are computed on thepart of the cycle that the real mask does not exist. way of minimizing the distance between the topology infor-mation localizing the hallucinated area in the forward pass,to the one observed in the reconstruction pass of the train-ing step. So, we are exploiting the observation that, underconvergence conditions, both shadow removal and shadowaddition procedures are inverse to each other, as we statedshadow removal as a bijectivity in the problem formulation. L gen ( u, v ) = γ · ( L fGAN (ˆ u, u ) + L sGAN (ˆ v, v )+ L fGAN ( u r , u ) + L sGAN ( v r , v ))+ γ · ( L content ( u, ˆ v ) + L content ( v, ˆ u ))+ γ · ( L pix ( u, u r ) + L pix ( v, v r ))+ γ · ( L perceptual ( u, u r ) + L perceptual ( v, v r ))+ γ · ( L mask ( ˆ m f , m fr ) + L mask ( ˆ m s , m sr )+ β L mask ( ˆ m ∗ , ˆ m f ) , (8)So, we choose the total loss for the unpaired case as alinear combination of the losses described, where γ and β parameters control the contribution of each loss. Note thateach component is easily extracted from Figure 1. As theshadow position and shape are invariant under the transfor-mations implied, by adding this term in the training objec-tive, the transformations will be naturally plausible. As thedatasets are not characterized by a high variation in terms ofshadow regions shapes and positions, the model will beneﬁtfrom adding a loss term such that the mask produced in thereconstruction procedure is similar to the sampled mask ˆ m ∗ As our method can be extended for paired datasets, in Equa-tion 9 we show the modiﬁcations to the loss functions in thisscenario. L gen ( u, v ) = γ · ( L fGAN (ˆ u, u ) + L sGAN (ˆ v, v )+ L fGAN ( u r , u ) + L sGAN ( v r , v ))+ γ · ( L content ( u, ˆ v ) + L content ( v, ˆ u ))+ γ · ( L pix ( u, u r ) + L pix ( v, v r ) + β L pix ( u, ˆ u ))+ γ · ( L perceptual ( u, u r ) + L perceptual ( v, v r ))+ γ · ( L mask ( ˆ m f , m fr ) + L mask ( ˆ m s , m sr )+ β L mask ( m, ˆ m f )) (9)When training using paired data, a constraint can be usedto speed-up the convergence process, by adding the term L pix ( u, ˆ u ) as the L1 pixel-wise loss between the inputshadow free image and the shadow free image generated inthe forward step of the cycle. The β parameter is neededto force the model to create a suitable transformation in5able 1: Parameters of the total loss function (Equation 8and 9) deﬁned for our training framework. Setting γ γ γ γ γ β β Unpaired training 250 10 100 30 60 0 100Paired training 250 20 60 50 60 10 100 the forward step of the cycle, as the reconstruction processwill be using this intermediate representation of the shadowfree image. As the mask shape and position should be thesame, the shadow masks would not differ along the cycle,so L mask , the L1 distance between the two shadow masks,was introduced. The weights used in the linear combinationof GAN, L1, perceptual, or mask losses were determinedwith respect to the magnitude of each term, and their speedof decrease. Exact values were provided in Table 1, for boththe paired and the unpaired settings. Generator.

Details of the generator implementation areprovided in Table 2, where the operation o is a convolu-tional operation with kernel size 4, stride 2 and padding1 and o is its upsampling counterpart that uses a trans-posed convolution. Skip-connections were added betweenthe downsampling blocks and the upsampling counterparts.The o operation, present in the last layer of the architec-ture, is a convolution with kernel size 4 and padding 1,preceded by an upsampling with scale factor 2 and zeropadding 1 in the top and the bottom size of the feature ten-sor. The result is then passed into the tanh activation, ob-taining the corresponding pixel in the produced image. Discriminator.

The discriminator consists of four convo-lutional blocks, each having the convolution operator with k = 4 , instance normalization and LeakyReLU as activa-tion ( a = 0 . ). The ﬁnal output size of the discriminatorwill be the size of the patch described as the “perceptiveﬁeld” of the model. The depth of the initial input tensor canbe explained by the fact that the discriminator will receiveas input a pair of images, each of them with three channels,as they are RGB images. Initialization.

As the initialization, the weights in boththe discriminators and the generators were drawn from aGaussian distribution with 0 mean and 0.2 variance.

4. Experimental Results

Datasets.

We validate our system over

ISTD [36], and

USR [16] datasets. On the one hand, the ISTD datasetcontains paired data for shadow and shadow-free im-ages. Given the illumination inconsistency problem in this dataset, Le et al . [24] proposed a compensation method, cre-ating thus the

ISTD+ dataset. On the other hand, the USRdataset is a collection of unpaired shadow and shadow freeimages used for unsupervised tasks.For unpaired training and testing on the ISTD dataset, theshadow and shadow free images were randomly sampled.The random mask inserted in each iteration comes from abuffer bank of real masks.

Experimental Framework.

We train our system during100 epochs, learning rate . with λ -decay schedulingafter the ﬁrst 40 epochs. We use Adam optimizer with β =(0 . , . .For both the paired and unpaired settings, the masks werecomputed as a binarization of the difference between theshadow free image and the shadow image, by a thresholdingprocedure using the median value of the difference. Evaluation measures.

For the quantitative evaluation ofour method, we use the Root Mean Square Error (RMSE)and Peak Signal to Noise Ratio (PSNR) between the outputdeshadowed image and the reference/ground truth image.We compute these pixel-wise ﬁdelity measures in both RGBand Lab color space, respectively.It is well established that RMSE and PSNR do not corre-late well with perceptual quality, so complementary to theﬁdelity measures, we also employ LPIPS [40] score in or-der to assess the photo-realism of the produced deshadowedimages with respect to the ground truth.

Compared methods.

We directly compare our proposedsolution to two other methods capable to learn from un-paired data:

CycleGAN [41] and

Mask Shadow GAN [16].Moreover, in order to compare with prior systems in pairedlearning, we report qualitative and quantitative results forthe following methods:

DSC [15],

ST-CGAN [36] and

De-ShadowNet [28].

In Table 3 we report our results on ISTD for differentsettings. Each conﬁguration was trained for 100 epochs,on ISTD dataset. When switching from a learning proce-dure based on both ﬁdelity and perceptual losses to an only-ﬁdelity loss based objective ( γ = 0 ) the results improvein ﬁdelity and lack in perceptual terms. The removal of themask loss ( γ = 0 ) produces similar results in terms of bothperceptual and ﬁdelity measures, but the standard deviationover the LPIPS score is higher due to a more pronounceddifﬁculty of the model to deal with more complex textures.The forward pixel-wise loss ( β = 0 ) is very importantin order to produce realistic results in the forward step ofthe cycle, even though the latent representation learnt ( ˆ u, ˆ v ),for both shadow and shadow free domains, produce the bestresults in terms of reconstruction error (either ﬁdelity loss6able 2: General details about the architecture of the generators. LR is LeakyReLU(0.2), R is ReLU, and TH is the hyperbolictangent activation function. Layer 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Channels in 3/4 64 128 256 512 512 512 512 512 1024 1024 1024 1024 512 256 64Channels out 64 128 256 512 512 512 512 512 512 512 512 512 256 128 64 3Operation o o o o o o o o o o o o o o o o Normalization 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0Activation LR LR LR LR LR LR LR LR R R R R R R R THDropout 0 0 0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0

Table 3: The impact of various loss function parameters onthe performance.

LPIPS ↓ RMSE ↓ PSNR ↑ Method avg stddev RGB Lab RGB LabPaired setting γ = 0 ) 0.032 0.029 14.75 4.15 27.1 38.05Paired setting ( γ = 0 ) 0.033 0.022 14.07 4.01 27.39 38.31Paired setting ( γ = 0 ) 0.031 0.027 Paired setting ( β = 0 ) 0.035 ( β = 0 ) 0.021 0.046 5.96 2.61 34.27 41.67 or perceptual score). As a better mask is a consequence ofa better reconstruction, the dropping of the forward mask( β = 0 ), produce results characterized by similar pixel-wise properties, but lacking in perceptual terms.Table 4: Ablative results in terms of pixel-wise loss (RMSEand PSNR, both on Lab space) and perceptual quality loss(LPIPS) for different settings of the loss function.Setting description RMSE ↓ PSNR ↑ LPIPS ↓ default set of parameters 5.73 33.40 default and γ = 0 γ = 10 γ = 0 γ = 0 To quantitatively evaluate the performance of ourshadow removal solution, we adhere to the ISTD and USRbenchmarks [16, 36] and report the results in Table 5. For This conﬁguration was not considered, due to the random forwardmapping effect

Table 5: Comparison with state-of-the-art methods on ISTDand USR datasets.

ISTD test images USR test imagesLPIPS ↓ RMSE ↓ PSNR ↑ LPIPS ↓ RMSE ↓ PSNR ↑ Method avg stddev RGB Lab RGB Lab avg stddev RGB Lab RGB LabUnpaired data trainingMaskShadowGAN[16] 0.25 0.09 28.34 7.32 19.78 31.65 0.31 0.11 27.53 7.06 19.97 31.76CycleGAN [41] 0.118 0.07 25.4 6.95 20.59 31.83 0.147 0.07 30.04 9.66 19.04 29.06 ours (unpaired) 0.041 0.033 7.58 5.12 31.18 34.45 0.009 0.004 5.70 2.21 33.26 41.06

Paired data trainingDeShadowNet [28] 0.080 0.055 31.96 7.98 19.30 31.27 - - - - - -DSC [15] 0.202 0.087 23.36 6.03 21.85 33.63 - - - - - -ST-CGAN [36] 0.067 0.043 22.11 5.93 22.66 34.05 - - - - - - ours (paired) 0.031 0.025 15.05 4.18 26.88 37.90 - - - - - -

Table 6: Lab color space results for both shadow andshadow-free pixels on ISTD[36] and ISTD+[24] datasets.

Setup All Shadow Shadow freeMethod Train Test RMSE PSNR RMSE PSNR RMSE PSNR ours (unpaired)

ISTD ISTD 5.12 34.45 6.98 32.65 4.94 34.71 ours (paired)

ISTD ISTD 4.18 37.90 4.63 36.87 4.07 38.22 ours (paired)

ISTD+ ISTD+ [24] (paired) ISTD+ ISTD+ 3.8 n/a 7.4 n/a 3.1 n/a all the reported results, we used our models trained on thetraining partition of the ISTD dataset, for 100 epochs forboth the paired and the unpaired settings. For the unpairedsetting, the shadow and shadow free training images weresampled without replacement. The USR dataset provides acollection of shadow free images and two splits of shadowimages, for training and validation, which are not represent-ing the same scene as the shadow-free images. The samesampling procedure was deployed for the USR dataset.As shown in Table 5 our models largely improve thestate-of-the-art in both ﬁdelity (RMSE, PSNR) and percep-tual measures (LPIPS) on both benchmarks.

Figure 3 shows visual results obtained on randomlypicked ISTD test images. We note that the results achievedby our solutions are the closest to the reference shadow freeimages, while the other methods generally produce strongartifacts.Our results clearly improve over the unpaired state-of-the-art Mask Shadow GAN, producing more appealing andartifact-free images. On the paired counterpart, our methodcompletely removes the shadow while the related methodsproduce visible traces.7 nput Unpaired data training Paired data training ground truth shadow Mask Shadow GAN ours (unpaired) ours (paired)

DeShadowNet DSC ST-CGAN shadow free

Figure 3: Visual results for the proposed solution and comparison with state-of-the-art learned methods. Best zoom in onscreen for better details. input ours (unpaired) ours (paired) input ours (unpaired) ours (paired)ground truth error heatmap error heatmap ground truth error heatmap error heatmap

Figure 4: Visual results for the proposed solutionstrained with unpaired and paired, and corresponding errorheatmaps. Best zoom in on screen for better details.

For both paired and unpaired settings, our system pro-duces the best perceptual metrics (lower LPIPS) and thebest pixel-wise error metrics (PSNR and RMSE) with re-spect to state-of-the-art methods by large margins.Figure 4 shows results and the square L2 norm of theresiduals in the image space for our models. We observethat the paired version of the model has problems in recov-ering the unshadowed region on the neighborhoods charac-terized by sharp variations in terms of color and illumina-tion variations. This could be due to the ISTD dataset usedfor training the model. ISTD has a limited shadow forma-tion diversity in its pairs. Therefore, the model providespoorer results on images representing much more complexscenes. Furthermore, as we show in Figure 2, the semantic dif-ferences and the differences in illumination are expected toinduce a certain degree of uncertainty when using largelyweighted L1 loss terms between ground truth images andthe synthetically generated images in the same domain.Therefore, as it can be seen in Figure 4, the error is notconcentrated in the shadow affected area, but, in steep vari-ations in terms of texture, and some peaks in error can beobserved in that area. When training in an unpaired manner,by simply dropping this loss term we can overcome this is-sue, improving our results on the ISTD dataset, comparedto the paired setting.The unpaired version also beneﬁts from both samplingprocesses deployed, i.e. , for the shadow mask (using a maskbuffer) and the negative examples for discriminator train-ing. Since the sample sets are dynamically generated fromsynthetic data, the variation of the provided examples is ex-pected to be higher. Therefore, the generalization ability ofthe model increases (as it can be observed in Table 5) pro-ducing better results in terms of both pixel-wise loss andperceptual metrics.This behaviour can be explained by the model beneﬁtingfrom the variety of random localization/shape combinationscharacterising the shadowed region.Additionally, the higher generalization ability of the dis-criminators provides another degree of control such that arobust learning procedure converges to a realistic mapping8rom the shadow domain X to the shadow free domain Y .Although the degree of control is weak under the un-paired setting, the exploitation of both deep features, andthe proxy loss deﬁned for the transformed region providessufﬁcient information for the learnt mapping to be realistic.

5. Conclusions

In this work we proposed a novel unsupervised sin-gle image shadow removal solution. We rely on self-supervision and jointly learn shadow removal from andshadow addition to images. As our experimental resultsshow on ISTD and USR datasets, we set a new state-of-the-art in single image shadow removal, by largely outperform-ing prior works in both ﬁdelity (RMSE, PSNR) and percep-tual quality (LPIPS) for both paired and unpaired settings.

Acknowledgments

This work was partly supported by the ETH Z¨urich Fund(OK), a Huawei project, an Amazon AWS grant, and anNvidia hardware grant.

References [1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, AurelienLucchi, Pascal Fua, and Sabine S¨usstrunk. Slic superpix-els compared to state-of-the-art superpixel methods.

IEEEtransactions on pattern analysis and machine intelligence ,34(11):2274–2282, 2012.[2] Pablo Arbel´aez, Jordi Pont-Tuset, Jonathan T Barron, Fer-ran Marques, and Jitendra Malik. Multiscale combinatorialgrouping. In

Proceedings of the IEEE conference on com-puter vision and pattern recognition , pages 328–335, 2014.[3] Sean Bell, C Lawrence Zitnick, Kavita Bala, and Ross Gir-shick. Inside-outside net: Detecting objects in context withskip pooling and recurrent neural networks. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 2874–2883, 2016.[4] Martin Danelljan, Fahad Shahbaz Khan, Michael Felsberg,and Joost Van de Weijer. Adaptive color attributes for real-time visual tracking. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 1090–1097, 2014.[5] Etienne de Stoutz, Andrey Ignatov, Nikolay Kobyshev, RaduTimofte, and Luc Van Gool. Fast perceptual image enhance-ment. In

The European Conference on Computer Vision(ECCV) Workshops , September 2018.[6] Pedro F Felzenszwalb and Daniel P Huttenlocher. Efﬁcientgraph-based image segmentation.

International journal ofcomputer vision , 59(2):167–181, 2004.[7] Graham D Finlayson, Mark S Drew, and Cheng Lu. Entropyminimization for shadow removal.

International Journal ofComputer Vision , 85(1):35–57, 2009.[8] Graham D Finlayson, Steven D Hordley, and Mark S Drew.Removing shadows from images. In

European conferenceon computer vision , pages 823–836. Springer, 2002. [9] Alberto Garcia-Garcia, Sergio Orts-Escolano, Sergiu Oprea,Victor Villena-Martinez, Pablo Martinez-Gonzalez, and JoseGarcia-Rodriguez. A survey on deep learning techniques forimage and video semantic segmentation.

Applied Soft Com-puting , 70:41–65, 2018.[10] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Im-age style transfer using convolutional neural networks. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 2414–2423, 2016.[11] Han Gong and Darren Cosker. Interactive shadow removaland ground truth for variable scene categories. In

Proceed-ings of the British Machine Vision Conference , 2014.[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In

Advancesin neural information processing systems , pages 2672–2680,2014.[13] Ruiqi Guo, Qieyun Dai, and Derek Hoiem. Paired regions forshadow detection and removal.

IEEE Transactions on Pat-tern Analysis and Machine Intelligence , 35(12):2956–2967,11 2013.[14] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir-shick. Mask r-cnn. In

Proceedings of the IEEE internationalconference on computer vision , pages 2961–2969, 2017.[15] Xiaowei Hu, Chi-Wing Fu, Lei Zhu, Jing Qin, and Pheng-Ann Heng. Direction-aware spatial context features forshadow detection and removal.

IEEE Transactions on Pat-tern Analysis and Machine Intelligence , 2019. to appear.[16] Xiaowei Hu, Yitong Jiang, Chi-Wing Fu, and Pheng-AnnHeng. Mask-ShadowGAN: Learning to remove shadowsfrom unpaired data. In

ICCV , 2019.[17] Andrey Ignatov, Nikolay Kobyshev, Radu Timofte, KennethVanhoey, and Luc Van Gool. Dslr-quality photos on mobiledevices with deep convolutional networks. In

Proceedingsof the IEEE International Conference on Computer Vision ,pages 3277–3285, 2017.[18] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei AEfros. Image-to-image translation with conditional adver-sarial networks. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 1125–1134,2017.[19] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptuallosses for real-time style transfer and super-resolution. In

European Conference on Computer Vision , 2016.[20] Pakorn KaewTraKulPong and Richard Bowden. An im-proved adaptive background mixture model for real-timetracking with shadow detection. In

Video-based surveillancesystems , pages 135–144. Springer, 2002.[21] Salman Khan, Mohammed Bennamoun, Ferdous Sohel, andRoberto Togneri. Automatic shadow detection and removalfrom a single image.

IEEE Transactions on Pattern Analysisand Machine Intelligence , 38:1–1, 01 2015.[22] S. H. Khan, M. Bennamoun, F. Sohel, and R. Togneri. Auto-matic feature learning for robust shadow detection. In , pages 1939–1946, June 2014.

23] Matej Kristan, Jiri Matas, Ales Leonardis, Michael Fels-berg, Luka Cehovin, Gustavo Fernandez, Tomas Vojir, Gus-tav Hager, Georg Nebehay, and Roman Pﬂugfelder. The vi-sual object tracking vot2015 challenge results. In

Proceed-ings of the IEEE international conference on computer visionworkshops , pages 1–23, 2015.[24] Hieu Le and Dimitris Samaras. Shadow removal via shadowimage decomposition. In

The IEEE International Conferenceon Computer Vision (ICCV) , October 2019.[25] Mehdi Mirza and Simon Osindero. Conditional generativeadversarial nets.

ArXiv , abs/1411.1784, 2014.[26] Vu Nguyen, Yago Vicente, F Tomas, Maozheng Zhao, MinhHoai, and Dimitris Samaras. Shadow detection with condi-tional generative adversarial networks. In

Proceedings of theIEEE International Conference on Computer Vision , pages4510–4518, 2017.[27] Augustus Odena, Vincent Dumoulin, and Chris Olah. De-convolution and checkerboard artifacts.

Distill , 1(10):e3,2016.[28] L. Qu, J. Tian, S. He, Y. Tang, and R. W. H. Lau. Deshad-ownet: A multi-context embedding deep network for shadowremoval. In , pages 2308–2316, July 2017.[29] Andres Sanin, Conrad Sanderson, and Brian C Lovell.Shadow detection: A survey and comparative evaluationof recent methods.

Pattern recognition , 45(4):1684–1695,2012.[30] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fullyconvolutional networks for semantic segmentation.

IEEETransactions on Pattern Analysis and Machine Intelligence ,39(4):640–651, 2016.[31] Yael Shor and Dani Lischinski. The shadow meets the mask:Pyramid-based shadow removal.

Computer Graphics Forum ,27(2):577–586, Apr. 2008.[32] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556 , 2014.[33] Marc Stamminger and George Drettakis. Perspective shadowmaps. In

Proceedings of the 29th annual conference on Com-puter graphics and interactive techniques , pages 557–562,2002.[34] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gev-ers, and Arnold WM Smeulders. Selective search for ob-ject recognition.

International journal of computer vision ,104(2):154–171, 2013.[35] T. F. Y. Vicente, M. Hoai, and D. Samaras. Leave-one-outkernel optimization for shadow detection and removal.

IEEETransactions on Pattern Analysis and Machine Intelligence ,40(3):682–695, March 2018.[36] Jifeng Wang, Xiang Li, and Jian Yang. Stacked conditionalgenerative adversarial networks for jointly learning shadowdetection and shadow removal. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 1788–1797, 2018.[37] Tai-Pang Wu, Chi-Keung Tang, Michael S. Brown, andHeung-Yeung Shum. Natural shadow matting.

ACM Trans.Graph. , 26(2):8–es, June 2007. [38] Ling Zhang, Chengjiang Long, Xiao-Long Zhang, andChunxia Xiao. Ris-gan: Explore residual and illuminationwith generative adversarial networks for shadow removal. In

AAAI , 2020.[39] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorfulimage colorization. In

ECCV , 2016.[40] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,and Oliver Wang. The unreasonable effectiveness of deepfeatures as a perceptual metric. In

CVPR , 2018.[41] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei AEfros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In

Computer Vision (ICCV),2017 IEEE International Conference on , 2017., 2017.