Deep Retinex Network for Estimating Illumination Colors with Self-Supervised Learning
DDeep Retinex Network for Estimating IlluminationColors with Self-Supervised Learning st Kouki Seo
Department of Computer ScienceTokyo Metropolitan University
Tokyo, [email protected] nd Yuma Kinoshita
Department of Computer ScienceTokyo Metropolitan University
Tokyo, [email protected] rd Hitoshi Kiya
Department of Computer ScienceTokyo Metropolitan University
Tokyo, [email protected]
Abstract —We propose a novel Retinex image-decompositionnetwork that can be trained in a self-supervised manner. TheRetinex image-decomposition aims to decompose an image intoillumination-invariant and illumination-variant components, re-ferred to as “reflectance” and “shading,” respectively. Althoughthere are three consistencies that the reflectance and shadingshould satisfy, most conventional work considers only one ortwo of the consistencies. For this reason, the three consistenciesare considered in the proposed network. In addition, by usinggenerated pseudo-images for training, the proposed network canbe trained with self-supervised learning. Experimental resultsshow that our network can decompose images into reflectance andshading components. Furthermore, it is shown that the proposednetwork can be used for white-balance adjustment.
Index Terms —Retinex decomposition, intrinsic image decom-position, white balance, self-supervised learning
I. I
NTRODUCTION
A natural image consists of the reflectance and the shad-ing of a scene in Retinex theory [1]. The reflectance andthe shading are an illumination-invariant component and anillumination-variant component respectively. Retinex imagedecomposition aims to decompose a natural image into twosuch components. To enable the decomposition, various meth-ods have so far been proposed [2]–[10], where most methodsare based on deep neural networks (DNN).In the Retinex decomposition, there are three premisesregard to consistency: reconstruction consistency, reflectanceconsistency in terms of exposures, and reflectance consistencyin terms of illumination colors. However, most conventionalmethods only considers some of them. In contrast, conven-tional methods [4], [5] consider all of the premises, but theirperformances are limited due to difficultly in preparing a largeamount of real data or synthetic data for training.Several decomposition methods trained with supervisedlearning have been proposed [7]–[10]. They often use a highly-synthetic dataset or a human-labeled dataset of the real scene[11]–[13]. However, such datasets are insufficient to generalizereal scenes.To solve these problems, in this paper, we propose a novelRetinex image decomposition network that considers both thethree premises and the problem with data. For training theproposed network, we generate pseudo images that are takenunder various exposure and illumination-color conditions. By
Fig. 1. Retinex image decomposition. using such training data, the proposed network can be trainedwith self-supervised learning, and difficultly in preparing alarge amount of data can be overcome. The proposed networkcan decompose input image I into reflectance R I , gray-shading GS I , and single RGB vector c I that represents anillumination color. Shading S I including the effect of illumi-nation color can be obtained by multiplying outputs GS I and c I .We evaluate the performance of the decomposition and theestimation of illumination colors in terms of mean squarederror (MSE) and hue difference ∆ H of CIEDE2000 [15].Experimental results show that our network can decomposeinput images, and identify illumination colors of the inputimages. II. P RELIMINARIES
A. Retinex decomposition
In Retinex theory, a natural image I can be written as thepixel-wise product of reflectance R I and shading S I as shownin Fig. 1, i.e., I ( x, y ) = R I ( x, y ) · S I ( x, y ) , (1)where ( x, y ) indicates a pixel coordinate, R I ( x, y ) is in therange of [0 , , and S I ( x, y ) is in the range of [0 , ∞ ) . Thegoal of Retinex decomposition is to estimate reflectance R I and shading S I from a given image I . Here, shading S will bespatially smooth because it is a map of the illumination inten-sity. In contrast, since R is expected to include textures andedges of objects, reflectance R will be spatially discontinuous. a r X i v : . [ ee ss . I V ] F e b . Effects of exposure change on Retinex decomposition The change of the brightness (or exposure) of an imageaffects the Retinex decomposition of the image. Here, wediscuss the effects of the exposure change.The exposure of an image is usually expressed in terms ofan exposure value (EV), and the proper exposure for a sceneis automatically decided by a camera [16]–[19]. The exposurevalue is commonly controlled by changing the shutter speed,although it can also be controlled by adjusting various cameraparameters. Here, we assume that camera parameters exceptfor the shutter speed are fixed. Let v = 0[EV] and I v be theproper exposure value and the corresponding captured imageunder the given conditions, respectively. By assuming that thecamera response is linear with respect to the light intensity,an image I v i exposed at v i [EV] is written as I v i ( x, y ) = 2 v i I v ( x, y ) . (2)From Eqs. (1) and (2), the Retinex decomposition of I EV= v and I EV= v i are given as I v ( x, y ) = R I v ( x, y ) · S I v ( x, y ) , (3) I v i ( x, y ) = R I vi ( x, y ) · S I vi ( x, y )= 2 v i R I v ( x, y ) · S I v ( x, y ) , (4)respectively. Since the scenes of images I v and I v i are thesame, we can obtain the following relations: R I vi = R I v , (5) S I vi = 2 v i S I v . (6) C. Effects of illumination color on Retinex decomposition
Similarly to the exposure change, the change of illuminationcolor also affects shading S I .Let c = (1 , , , I c , and S I c be the white illumina-tion color, an image taken under the illumination, and itscorresponding shading, respectively. Then, shading S I c j cor-responding to I c j taken under illumination color c j = ( r, g, b ) is given as S I c j ( x, y ) = M c j S I c ( x, y ) , (7)where M c j = diag( c j ) . For this reason, the Retinex decom-position of I c and I c j is given by I c ( x, y ) = diag( R I c ( x, y )) S I c ( x, y ) , (8) I c j ( x, y ) = diag( R I c j ( x, y )) S I c j ( x, y )= diag( R I c ( x, y ))M c j S I c ( x, y )= M c j diag( R I c ( x, y )) S I c ( x, y ) , (9)where we used the relation R I c j = R I c . (10)Therefore, the relationship between I c and I c j is written as I c j ( x, y ) = M c j I c ( x, y ) . (11) D. Scenario
In the Retinex decomposition, there are three premises:
Reconstruction consistency
The product of estimated re-flectance and shading matches the corresponding originalimage, as shown in Eq. (1).
Reflectance consistency (exposure)
Reflectances are invari-ant against a change of exposure values, as in Eq. (5).
Reflectance consistency (color)
Reflectances are invariantagainst a change of illumination colors, as in Eq. (10).Most conventional work considers only a part of thesepremises, e.g., reconstruction consistency and reflectance con-sistency (exposure). In such a case, their reflectance compo-nents are affected by the effects of exposure and illumination-color conditions. In literature [4], all three premises areconsidered by training a DNN by using videos taken by afixed-point camera. However, the DNN still has a limitedperformance due to a limited amount of real data for training.For these reasons, in this paper, we propose a novel DNNfor the Retinex decomposition considering all three premisesand the problem with data. For training our network, we gen-erate pseudo images from original images, which correspondto images taken under various exposure and illumination-color conditions. By using them, our network can be trainedwith self-supervised learning while considering above threepremises. In addition, since we generate pseudo images fromgeneral datasets, the problem with a amount of data can beovercome. III. P
ROPOSED R ETINEX N ETWORK
In this paper, we aim to decompose image I into reflectance R I and shading S I by using a deep neural network. Thekey idea of our approach is to consider the three premisesin section II-D. The proposed network can be trained in aself-supervised manner, while satisfying the premises. A. Network architecture
Figure 2 illustrates the architecture of the proposed network.The proposed network receives input image I , and outputsreflectance R I , gray-shading GS I , and RGB vector c I . Ournetwork has a single encoder and three decoders. By theencoder, input image I is transformed into feature maps thatwill be fed into decoders. Reflectance R I with RGB colorchannels is directly obtained as the output of a decoder. Incontrast, shading S I is given as the product of RGB vector c I and gray-scale shading GS I . The gray-scale shading andthe RGB vector are outputted from the other two decoders,respectively. B. Data generation for self-supervised learning
In order to consider the three premises in Section II-D,images of a single scene taken under various exposure andillumination-color conditions are required for training theproposed network. However, it is very costly to collect suchimages. For this reason, we generate pseudo images from rawimages and use them for training the proposed network. ig. 2. Network architecture
Because raw images are not affected by the non-linearcamera response of a camera, multiplying their pixel values bya scalar value corresponds to the exposure change in Eq. (2).In addition, Eq. (11) is equivalent to applying a color-transfermatrix, used in a white-balance adjustment in the RGB colorspace, to an image. Hence, images generated from raw imagesin accordance with Eqs. (12) and (13) can be used for trainingthe proposed network.We utilize three color-transferred multi-exposure images I v i , c i ( i ∈ { , , } ) having exposure value v i and illuminationcolor c i for calculating loss. Images I v i , c i are generated froma raw image I raw as follows:1) Obtain an RGB image I RGB by demosaicing a raw image I raw .2) Generate three multi-exposure images I v i ( v i ∈{− , , } [EV]) from I RGB in accordance with Eq.(2)as I v i = 2 v i . g ( I RGB ) I RGB , (12)where g ( I RGB ) indicates the geometric mean of theluminance of I RGB .3) Generate color-transferred multi-exposure images I v i , c i by multiplying I v i by M c i = diag( c i ) as I v i ,c i = M c i I v i , (13)where c i is a random vector in [0 . , . . C. Loss functions
To fulfill above premises, our network is trained to minimizethe following loss function L = L recon + L reflect + L other , (14)where L recon is the image-reconstruction loss between inputimages and reconstructed ones. L reflect and L other are lossfunctions for constraining outputs ˆ R I i , and ˆ S I i and ˆ c I i ,respectively.For the reconstruction consistency, in accordance withEq.(1), we use image-reconstruction loss L recon so that thepixel product of ˆ R I i ( x, y ) and ˆ S I i ( x, y ) is equal to input image I i (cid:44) I v i ,c i . We calculate L recon for all combinations of theinput images and the pixel product ˆ R I i ( x, y ) · ˆ S I i ( x, y ) as L recon = (cid:88) i =1 3 (cid:88) j =1 { λ (cid:107) I i ( x, y ) − ˆ R I j ( x, y ) · ˆ S I i ( x, y ) (cid:107) + λ (cid:107) − SSIM( I i ( x, y ) , ˆ R I j ( x, y ) · ˆ S I i ( x, y )) (cid:107) + λ (cid:107) ∆ E ( I i ( x, y ) , ˆ R I j ( x, y ) · ˆ S I i ( x, y )) (cid:107) } , (15)where λ , λ and λ are weights of the loss terms, (cid:107) · (cid:107) isL2 norm, SSIM( · ) calculates a structural similarity (SSIM)value, and ∆ E ( · ) calculates the CIEDE2000 color difference[15]. By using SSIM and ∆ E ( · ) as the loss terms, imagesreconstructed by using outputs ˆ R I i and ˆ S I i reproduce thedetails of input images I i , and moreover output reflectance ˆ R I i can be consistent regardless of exposure and illumination-color conditions.Also, we use reflectance loss L reflect to improve the con-sistency of output reflectance ˆ R I i as L reflect = (cid:88) i =1 3 (cid:88) j =1 { λ (cid:107) ˆ R I i ( x, y ) − ˆ R I j ( x, y ) (cid:107) + λ | . − mean( ˆ R I i ) |} , (16)where λ and λ are weights of the loss terms, mean( · ) cal-culates the mean value of the whole reflectance. By adjustingthe mean value to . , our network can output the normalizedcolor information of input images I i as reflectance ˆ R I i .To add smoothness to output shading ˆ S I i , the total variation tv( · ) is utilized as a loss function for shading. Combining tv( · ) and a loss function of output RGB vector ˆ c I i , we calculate L other as follows: L other = (cid:88) i =1 { λ tv( ˆ S I i ) + λ (cid:107) c i − ˆ c I i (cid:107) } , (17)where λ and λ are weights of the loss terms.In practice, we empirically set λ = 3 , λ = 1 , λ =2 , λ = 3 , λ = 1 , λ = 10 and λ = 20 as weights,respectively. IV. S IMULATION
We performed two simulations to confirm the performanceof the proposed network. For training our network, we used3640 raw images in the HDR+ Burst Photography Dataset[20].
A. Result of Retinex Decomposition
Figure 3 shows an example of images outputted from ournetwork as Retinex decomposition and reconstruction. FromFig.3, our network was confirmed to generate almost the samereflectance from three input images with different exposures.Figure 3 also shows that the input images with differentexposures were able to be reconstructed by using output com-ponents. From these results, our network was demonstrated towork well. ig. 3. Example of images generated by our network. (a) Input image. (b) Output shading. (c) Output reflectance. (d) Reconstructed image by using outputcomponents. Fig. 4. WB adjustment with our network
B. Result of white-balance adjustment
To evaluate the estimation performance of illuminationcolors, a WB adjustment was applied to input images, wherethe input images were prepared as white unbalanced imagesby using only color-transferring, i.e. using steps (1) and (3) inSec.III-B. Figure 4 shows the process of the WB adjustmentused in this experiment. In the process, outputted RGB vectorswere not used for reconstructing output images so that theeffects of the illumination color included in input images wereeliminated from the images.In this experiment, 100 color-transferred images, whichwere generated from 100 raw images in the RAISE Dataset[21], were applied to the trained network as input images.Output images produced from the proposed network wereevaluated in terms of MSE and hue difference ∆ H ofCIEDE2000 [15]. To confirm the decomposition performanceof our network, the scores of output images were comparedwith those of input ones, where original images that were notcolor-transferred were used as reference ones for calculatingscores. TABLE IS
CORES OF WB ADJUSTMENT SIMULATION .MSE ∆ H Input 0.0259 3.5017Output (a) Original image (Reference)(b) Input image (c) Output imageFig. 5. Example of WB adjustment with out network
Table I shows the scores of MSE and hue difference ∆ H ,which were averaged over all 100 images. From Table I, bothscores of the output images were lower than those of the inputimages, where a smaller value indicates a better result in thescores. Figure 5 shows an example of the reference, input, andoutput images used in this experiment. From Fig.5, the whitealance of the output image was closer to the reference onethan the input one. Therefore, our network was confirmed tobe able to eliminate the effects of the illumination color of theinput image. V. C ONCLUSION
In this paper, we proposed a novel Retinex image de-composition network considering the premises of the Retinexdecomposition. In addition, the proposed network can betrained in a self-supervised manner by using pseudo-generatedimages with various exposures and illumination colors. Inan experiment, our network was demonstrated to be able togenerate almost the same reflectance from input images withdifferent exposures and estimate illumination colors.R
EFERENCES[1] E. H. Land, “The retinex theory of color vision,”
Scientific american, vol.237, no.6, pp.108–129, 1977.[2] C. Chien, Y. Kinoshita, S. Shiota, and H. Kiya, “A Retinex-based ImageEnhancement Scheme with Noise Aware Shadow-up Function,” Proc.
SPIE 11049, IWAIT, pp.501–506, 2019.[3] Y. Liu, Y. Li, S. You, and F. Lu, “Unsupervised Learning for IntrinsicImage Decomposition from a Single Image,” Proc.
CVPR, pp.3248–3257, 2020.[4] Z. Li, and N. Snavely, “Learning Intrinsic Image Decomposition fromWatching the World,” Proc.
CVPR, pp.9039–9048, 2018.[5] L. Lettry, K. Vanhoey, and L. Van Gool, “Unsupervised Deep Single-Image Intrinsic Decomposition using Illumination-Varying Image Se-quences,”
Computer Graphics Forum, vol.37, no.7, pp.409–419, 2018.[6] W. Ma, H. Chu, B. Zhou, R. Urtasun, and A. Torralba, “Single imageintrinsic decomposition without a single intrinsic image,” Proc.
ECCV, pp.201–217, 2018.[7] Q. Fan, J. Yang, G. Hua, B. Chen, and D. Wipf, “Revisiting deep intrinsicimage decompositions,” Proc.
CVPR, pp.8944–8952, 2018.[8] T. Zhou, P. Krahenbuhl, and A. A. Efros, “Learning data-drivenreflectance priors for intrinsic image decomposition,” Proc.
ICCV, pp.3469–3477, 2015.[9] Z. Li, and N. Snavely, “CGIntrinsics: Better Intrinsic Image Decompo-sition through Physically-Based Rendering,” Proc.
ECCV, pp.371–387,2018.[10] Z. Wang, and F. Lu, “Single image intrinsic decomposition with dis-criminative feature encoding,” Proc.
ICCVW,
ICCV, pp.2335–2342, 2009.[12] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalisticopen source movie for optical flow evaluation,” Proc.
ECCV, pp.611–625, 2012.[13] S. Bell, K. Bala, and N. Snavely, “Intrinsic Images in the Wild,”
ACMTrans. Graph., vol.33, no.4, 2014.[14] B. Kovacs, S. Bell, N. Snavely and K. Bala, “Shading Annotations inthe Wild,” Proc.
CVPR, pp.6998–7007, 2017.[15] G. Sharma, W. Wu, and E. N. Dalal, “The CIEDE2000 color-differenceformula: Implementation notes, supplementary test data, and mathemati-cal observations,”
Color Research & Application, vol.30, no.1, pp.21–30,2005.[16] Y. Kinoshita and H. Kiya, “Scene Segmentation-Based LuminanceAdjustment for Multi-Exposure Image Fusion,”
IEEE Trans. ImageProcessing, vol.28, no.8, pp.4101–4116, 2019.[17] Y. Kinoshita, S. Shiota and H. Kiya, “Automatic Exposure Compensationfor Multi-Exposure Image Fusion ,” Proc.
ICIP, pp.883–887, 2018.[18] Y. Kinoshita and H. Kiya, “Automatic Exposure Compensation Usingan Image Segmentation Method for Single-Image-Based Multi-ExposureFusion,”
APSIPA Trans. Signal and Information Processing, vol.7, p.e22,2018.[19] K. Seo , C. Go, Y. Kinoshita and H. Kiya, ”Hue-Correction SchemeConsidering Non-Linear Camera Response for Multi-Exposure ImageFusion,”
IEICE Trans. Fundamentals, vol.E103-A, no.12, pp.1562–1570,2020. [20] S. W. Hasinoff, D. Sharlet, R. Geiss, A. Adams, J. T. Barron, F. Kainz,J. Chen and M. Levoy, “Burst photography for high dynamic range andlow-light imaging on mobile cameras,”
ACM Trans. Graph., vol.35, no.6,2016.[21] D.-T. Dang-Nguyen, C. Pasquini, V. Conotter, G. Boato, “RAISE: ARaw Images Dataset for Digital Image Forensics,”