Single Image HDR Reconstruction Using a CNN with Masked Features and Perceptual Loss
Marcel Santana Santos, Tsang Ing Ren, Nima Khademi Kalantari
SSingle Image HDR Reconstruction Using a CNN with Masked Featuresand Perceptual Loss
MARCEL SANTANA SANTOS,
Centro de Informática, Universidade Federal de Pernambuco
TSANG ING REN,
Centro de Informática, Universidade Federal de Pernambuco
NIMA KHADEMI KALANTARI,
Texas A&M University I n p u t G r o u n d t r u t h O u r s Ours GT
GTInput Ours
Fig. 1. We propose a novel deep learning system for single image HDR reconstruction by synthesizing visually pleasing details in the saturated areas. Weintroduce a new feature masking approach that reduces the contribution of the features computed on the saturated areas, to mitigate halo and checkerboardartifacts. To synthesize visually pleasing textures in the saturated regions, we adapt the VGG-based perceptual loss function to the HDR reconstructionapplication. Furthermore, to effectively train our network on limited HDR training data, we propose to pre-train the network on inpainting task. Our methodcan reconstruct regions with high luminance, such as the bright highlights of the windows (red inset), and generate visually pleasing textures (green insert).See Figure 7 for comparison against several other approaches. All images have been gamma corrected for display purposes.
Digital cameras can only capture a limited range of real-world scenes’ lumi-nance, producing images with saturated pixels. Existing single image highdynamic range (HDR) reconstruction methods attempt to expand the rangeof luminance, but are not able to hallucinate plausible textures, producingresults with artifacts in the saturated areas. In this paper, we present a novellearning-based approach to reconstruct an HDR image by recovering thesaturated pixels of an input LDR image in a visually pleasing way. Previousdeep learning-based methods apply the same convolutional filters on well-exposed and saturated pixels, creating ambiguity during training and leadingto checkerboard and halo artifacts. To overcome this problem, we propose afeature masking mechanism that reduces the contribution of the featuresfrom the saturated areas. Moreover, we adapt the VGG-based perceptualloss function to our application to be able to synthesize visually pleasingtextures. Since the number of HDR images for training is limited, we proposeto train our system in two stages. Specifically, we first train our system ona large number of images for image inpainting task and then fine-tune iton HDR reconstruction. Since most of the HDR examples contain smooth
Authors’ addresses: Marcel Santana Santos, Centro de Informática, Universidade Federalde Pernambuco, [email protected]; Tsang Ing Ren, Centro de Informática, Universi-dade Federal de Pernambuco, [email protected]; Nima Khademi Kalantari, Texas A&MUniversity, [email protected].© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.This is the author’s version of the work. It is posted here for your personal use. Not forredistribution. The definitive Version of Record was published in
ACM Transactions onGraphics , https://doi.org/10.1145/3386569.3392403. regions that are simple to reconstruct, we propose a sampling strategy toselect challenging training patches during the HDR fine-tuning stage. Wedemonstrate through experimental results that our approach can reconstructvisually pleasing HDR results, better than the current state of the art on awide range of scenes.CCS Concepts: • Computing methodologies → Computationalphotography . Additional Key Words and Phrases: high dynamic range imaging, con-volutional neural network, feature masking, perceptual loss
ACM Reference Format:
Marcel Santana Santos, Tsang Ing Ren, and Nima Khademi Kalantari. 2020.Single Image HDR Reconstruction Using a CNN with Masked Features andPerceptual Loss.
ACM Trans. Graph.
39, 4, Article 80 (July 2020), 10 pages.https://doi.org/10.1145/3386569.3392403
The illumination of real-world scenes is high dynamic range, butstandard digital cameras sensors can only capture a limited range ofluminance. Therefore, these cameras typically produce images with
ACM Trans. Graph., Vol. 39, No. 4, Article 80. Publication date: July 2020. a r X i v : . [ ee ss . I V ] M a y under/over-exposed areas. A large number of approaches propose togenerate a high dynamic range (HDR) image by combining a set oflow dynamic range images (LDR) of the scene at different exposures[Debevec and Malik 1997]. However, these methods either have tohandle the scene motion [Hu et al. 2013; Kalantari and Ramamoorthi2017; Kang et al. 2003; Oh et al. 2014; Sen et al. 2012; Wu et al. 2018]or require specialized bulky and expensive optical systems [McGuireet al. 2007; Tocci et al. 2011]. Single image dynamic range expansionapproaches avoid these limitations by reconstructing an HDR imageusing one image. These approaches can work with images capturedwith any standard camera or even recover the full dynamic rangeof legacy LDR content. As a result, they have attracted considerableattention in recent years.Several existing methods extrapolate the light intensity usingheuristic rules [Banterle et al. 2006; Bist et al. 2017; Rempel et al.2007], but are not able to properly recover the brightness of saturatedareas as they do not utilize context. On the other hand, recent deeplearning approaches [Eilertsen et al. 2017; Endo et al. 2017; Leeet al. 2018a] systematically utilize contextual information usingconvolutional neural networks (CNNs) with large receptive fields.However, these methods usually produce results with blurriness,checkerboard, and halo artifacts in saturated areas.In this paper, we propose a novel learning-based technique toreconstruct an HDR image by recovering the missing informationin the saturated areas of an LDR image. We design our approachbased on two main observations. First, applying the same convo-lutional filters on well-exposed and saturated pixels, as done inprevious approaches, results in ambiguity during training and leadsto checkerboard and halo artifacts. Second, using simple pixel-wiseloss functions, utilized by most existing approaches, the network isunable to hallucinate details in the saturated areas, producing blurryresults. To address these limitations, we propose a feature mask-ing mechanism that reduces the contribution of features generatedfrom the saturated content by multiplying them to a soft mask. Withthis simple strategy, we are able to avoid checkerboard and haloartifacts as the network only relies on the valid information of theinput image to produce the HDR image. Moreover, inspired by im-age inpainting approaches, we leverage the VGG-based perceptualloss function, introduced by Gatys et al. [2016], and adapt it to theHDR reconstruction task. By minimizing our proposed perceptualloss function during training, the network can synthesize visuallyrealistic textures in the saturated areas.Since a large number of HDR images, required for training a deepneural network, are currently not available, we perform the trainingin two stages. In the first stage, we train our system on a large setof images for the inpainting task. During this process, the networkleverages a large number of training samples to learn an internalrepresentation that is suitable for synthesizing visually realistictexture in the incomplete regions. In the next step, we fine-tune thisnetwork on the HDR reconstruction task using a set of simulatedLDR and their corresponding ground truth HDR images. Since mostof the HDR examples contain smooth regions that are simple toreconstruct, we propose a simple method to identify the texturedpatches and only use them for fine-tuning.Our approach can reconstruct regions with high luminance andhallucinate textures in the saturated areas, as shown in Figure 1. We demonstrate that our approach can produce better results than thestate-of-the-art methods both on simulated images (Figure 7) andon images taken with real-world cameras (Figure 9). In summary,we make the following contributions:(1) We propose a feature masking mechanism to avoid relying onthe invalid information in the saturated regions (Section 3.1).This masking approach significantly reduces the artifacts andimproves the quality of the final results (Figure 10).(2) We adapt the VGG-based perceptual loss function to the HDRreconstruction task (Section 3.2). Compared to pixel-wise lossfunctions, our loss can better reconstruct sharp textures inthe saturated regions (Figure 12).(3) We propose to pre-train the network on inpainting beforefine-tuning it on HDR generation (Section 3.3). We demon-strate that the pre-training stage is essential for synthesizingvisually pleasing textures in the saturated areas (Figure 11).(4) We propose a simple strategy for identifying the texturedHDR areas to improve the performance of training (Sec-tion 3.4). This strategy improves the network ability to recon-struct sharp details (Figure 11). The problem of single image HDR reconstruction, also known asinverse tone-mapping [Banterle et al. 2006], has been extensivelystudied in the last couple of decades. However, this problem remainsa major challenge as it requires recovering the details from regionswith missing content. In this section, we discuss the existing tech-niques by classifying them into two categories of non-learning andlearning methods.
Several approaches propose to perform inverse tone-mapping us-ing global operators. Landis [2002] applies a linear or exponentialfunction to the pixels of the LDR image above a certain threshold.Bist et al. [2017] approximates tone expansion by a gamma function.They use the characteristics of the human visual system to designthe gamma curve. Luzardo et al. [2018] improve the brightness ofthe result by utilizing an operator based on the mid-level mapping.A number of techniques propose to handle this application throughlocal heuristics. Banterle et al. [2006] use median-cut [Debevec 2005]to find areas with high luminance. They then generate an expand-map to extend the range of luminance in these areas, using aninverse operator. Rempel et al. [2007] also utilize an expand-mapbut use a Gaussian filter followed by an edge-stopping function toenhance the brightness of saturated areas. Kovaleski and Oliveira[2014] extend the approach by Rempel et al. [2007] using a crossbilateral filter. These approaches simply extrapolate the light inten-sity by using heuristics and, thus, often fail to recover saturatedhighlights, introducing unnatural artifacts.A few approaches propose to handle this application by incor-porating user interactions in their system. Didyk et al. [2008] en-hance bright luminous objects in video sequences by using a semi-automatic classifier to classify saturated regions as lights, reflections,or diffuse surfaces. Wang et al. [2007] recover the textures in the
ACM Trans. Graph., Vol. 39, No. 4, Article 80. Publication date: July 2020. ingle Image HDR Reconstruction Using a CNN with Masked Features and Perceptual Loss • 80:3 saturated areas by transferring details from the user-selected re-gions. Their approach demands user interactions that take severalminutes, even for an expert user. In contrast to these methods, wepropose a learning-based approach to systematically reconstructHDR images from a wide range of different scenes, instead of relyingon heuristics strategies and user inputs.
In recent years, several approaches have proposed to tackle thisapplication using deep convolutional neural networks (CNN). Givena single input LDR image, Endo et al. [2017] use an auto-encoder[Hinton and Salakhutdinov 2006] to generate a set of LDR imageswith different exposures. These images are then combined to recon-struct the final HDR image. Lee et al. [2018a] chain a set of CNNsto sequentially generate the bracketed LDR images. Later, they pro-pose [Lee et al. 2018b] to handle this application through a recursiveconditional generative adversarial network (GAN) [Goodfellow et al.2014] combined with a pixel-wise l loss.In contrast to these approaches, a few methods [Eilertsen et al.2017; Marnerides et al. 2018; Yang et al. 2018] directly reconstructthe HDR image without generating bracketed images. Eilertsen et al.[2017] use a network with U-Net architecture to predict the values ofthe saturated areas, whereas linear non-saturated areas are obtainedfrom the input. Marnerides et al. [2018] present a novel dedicatedarchitecture for end-to-end image expansion. Yang et al. [2018]reconstruct HDR image for image correction application. They traina network for HDR reconstruction to recover the missing detailsfrom the input LDR image, and then a second network transfersthese details back to the LDR domain.While these approaches produce state-of-the-art results, theirsynthesized images often contains halo and checkerboard artifactsand lacks textures in the saturated areas. This is mainly because ofusing standard convolutional layers and pixel-wise loss functions.Note that, several recent methods [Kim et al. 2019; Lee et al. 2018b;Ning et al. 2018; Xu et al. 2019] use adversarial loss instead of pixel-wise loss functions, but they still do not demonstrate results withhigh-quality textures. This is potentially because the problem ofHDR reconstruction is constrained in the sense that the synthe-sized content should properly fit the input image using a soft mask.Unfortunately, GANs are known to have difficulty handling thesescenarios [Bau et al. 2019]. In contrast, we propose a feature mask-ing strategy and a more constrained VGG-based perceptual lossto effectively train our network and produce results with visuallypleasing textures. Our goal is to reconstruct an HDR image from a single LDR imageby recovering the missing information in the saturated highlights.We achieve this using a convolutional neural network (CNN) thattakes an LDR image as the input and estimates the missing HDRinformation in the bright regions. We compute the final HDR imageby combining the well-exposed content of the input image andthe output of the network in the saturated areas. Formally, wereconstruct the final HDR image ˆ H , as follows:ˆ H = M ⊙ T γ + ( − M ) ⊙ [ exp ( ˆ Y ) − ] , (1) x α β(x) Fig. 2. We use this function tomeasure how well-exposed a pixelis. The value 1 indicates that thepixel is well-exposed, while 0 is as-signed to the pixels that are fullysaturated. In our implementation,we set the threshold α = . . where the γ = . ⊙ de-notes element-wise multiplica-tion. Here, T is the input LDRimage in the range [ , ] , ˆ Y isthe network output in the log-arithmic domain (Section 3.2),and M is a soft mask with val-ues in the range [ , ] that de-fines how well-exposed eachpixel is. We obtain this mask byapplying the function β (·) (seeFigure 2) to the input image, i.e., M = β ( T ) . In the following sections, we discuss our proposed featuremasking approach, loss function, as well as the training process. Standard convolutional layers apply the same filter to the entireimage to extract a set of features. This is reasonable for a wide rangeof applications, such as image super-resolution [Dong et al. 2015],style transfer [Gatys et al. 2016], and image colorization [Zhang et al.2016], where the entire image contains valid information. However,in our problem, the input LDR image contains invalid information inthe saturated areas. Since meaningful features cannot be extractedfrom the saturated contents, naïve application of standard convo-lution introduces ambiguity during training and leads to visibleartifacts (Figure 10).We address this problem by proposing a feature masking mecha-nism (Figure 3) that reduces the magnitude of the features generatedfrom the invalid content (saturated areas). We do this by multiplyingthe feature maps in each layer by a soft mask, as follows: Z l = X l ⊙ M l , (2)where X l ∈ R H × W × C is the feature map of layer l with height H ,width W , and C channels. M l ∈ [ , ] H × W × C is the mask for layer l and has values in the range [ , ] . The value of one indicates thatthe features are computed from valid input pixels, while zero isassigned to the features that are computed from invalid pixels. Here, l = X l = is the input LDR image.Similarly, M l = is the input mask M = β ( T ) . Note that, since ourmasks are soft, weak signals in the saturated areas are not discardedusing this strategy. In fact, by suppressing the invalid pixels, theseweak signals can propagate through the network more effectively.Once the features of the current layer l are masked, the featuresin the next layer X l + are computed as usual: X l + = ϕ l ( W l ∗ Z l + b l ) , (3)where W l and b l refer to the weight and bias of the current layer,respectively. Moreover, ϕ l is the activation function and * is thestandard convolution operation.We compute the masks at each layer by applying the convolu-tional filter to the masks at the previous layer (See Figure 4 forvisualization of some of the masks). The basic idea is that since thefeatures are computed by applying a series of convolutions, the samefilters can be used to compute the contribution of the valid pixels ACM Trans. Graph., Vol. 39, No. 4, Article 80. Publication date: July 2020.
Fig. 3. Illustration of the proposed feature masking mechanism. The featuresat each layer are multiplied with the corresponding mask before goingthrough the convolution process. The masks at each layer are obtained byupdating the masks at the previous layer using Eq. 4.
Input Image Input Mask Layer 1Channel 3 Layer 9Channel 7Layer 2Channel 1 Layer 3Channel 4Layer 4Channel 12 Layer 5Channel 9
Fig. 4. On the left, we show the input image and the corresponding mask.On the right, we visualize a few masks at different layers of the network.Note that, as we move deeper through the network, the masks becomeblurrier and more uniform. This is expected since the receptive field of thefeatures become larger in the deeper layers. in the features. However, since the masks are in the range [ , ] and measure the percentage of the contributions, the magnitude ofthe filters is irrelevant. Therefore, we normalize the filter weightsbefore convolving them with the masks as follows: M l + = (cid:18) | W l |∥ W l ∥ + ϵ (cid:19) ∗ M l , (4)where ∥ · ∥ is the l function and | · | is the absolute operator. Here, | W l | is a R H × W × C tensor and ∥ W l ∥ is a R × × C tensor. To performthe division, we replicate the values of ∥ W l ∥ to obtain a tensorwith the same size as | W l | . The constant ϵ is a small value to avoiddivision by 0 (10 − in our implementation).Note that a couple of recent approaches have proposed strategiesto overcome similar issues in image inpainting [Liu et al. 2018; Yuet al. 2019]. Specifically, Liu et al. [2018] propose to modify theconvolution process to only apply the filter to the pixels with validinformation. Unfortunately, this approach is specially designed forcases with binary masks. However, the masks in our application aresoft and, thus, this method is not applicable. Yu et al. [2019] proposeto multiply the features at each layer with a soft mask, similar toour feature masking strategy. The key difference is that their maskat each layer is learnable, and it is estimated using a small networkfrom the features in the previous layer. Because of the additionalparameters and complexity, training this approach on limited HDR images is difficult. Therefore, this approach is not able to producehigh-quality HDR images (see Section 5.3). The choice of the loss function is critical in each learning system.Our goal is to reconstruct an HDR image by synthesizing plausibletextures in the saturated areas. Unfortunately, using only pixel-wiseloss functions, as utilized by most previous approaches, the networktends to produce blurry images (Figure 12). Inspired by the recentimage inpainting approaches [Han et al. 2019; Liu et al. 2018; Yanget al. 2017], we train our network using a VGG-based perceptualloss function. Specifically, our loss function is a combination of anHDR reconstruction loss L r and a perceptual loss L p , as follows: L = λ L r + λ L p (5)where λ = . λ = . Reconstruction Loss:
The HDR reconstruction loss is a simplepixel-wise l distance between the output and ground truth imagesin the saturated areas. Since the HDR images could potentially havelarge values, we define the loss in the logarithmic domain. Given theestimated HDR image ˆ Y (in the log domain) and the linear groundtruth image H , the reconstruction loss is defined as: L r = ∥( − M ) ⊙ ( ˆ Y − log ( H + ))∥ . (6)The multiplication by ( − M ) ensures that the loss is computed inthe saturated areas. Perceptual Loss:
Our perceptual term is a combination of the VGGand style loss functions as follows: L p = λ L v + λ L s . (7)In our implementation, we set λ = . λ = .
0. The VGGloss function L v evaluates how well the features of the reconstructedimage match with the features extracted from the ground truth. Thisallows the model to produce textures that are perceptually similarto the ground truth. This loss term is defined as follows: L v = (cid:213) l ∥ ϕ l (T ( ˜ H )) − ϕ l (T ( H ))∥ (8)where ϕ l is the feature map extracted from the l th layer of theVGG network. Moreover, the image ˜ H is obtained by combining theinformation of the ground truth H in the well-exposed regions andthe content of the network’s output ˆ Y in the saturated areas usingthe mask M , as follows:˜ H = M ⊙ H + ( − M ) ⊙ ˆ Y . (9)We use ˜ H in our loss functions to ensure that the supervisionis only provided in the saturated areas. Finally, T (·) in Eq. 8 is afunction that compresses the range to [ , ] . Specifically, we use thedifferentiable µ -law range compressor: T ( H ) = log ( + µH ) log ( + µ ) , (10)where µ is a parameter defining the amount of compression ( µ = ACM Trans. Graph., Vol. 39, No. 4, Article 80. Publication date: July 2020. ingle Image HDR Reconstruction Using a CNN with Masked Features and Perceptual Loss • 80:5
Fig. 5. A few example patches selected by our patch sampling approach.These are challenging examples as the HDR images corresponding to thesepatches contain complex textures in the saturated areas.
The style loss in Eq. 7 ( L s ) captures style and texture by comparingglobal statistics with a Gram matrix [Gatys et al. 2015] collectedover the entire image. Specifically, the style loss is defined as: L s = (cid:213) l ∥ G l (T ( ˜ H )) − G l (T ( H ))∥ , (11)where G l ( X ) is the Gram matrix of the features in layer l and isdefined as follows: G l ( X ) = K l ϕ l ( X ) T ϕ l ( X ) . (12)Here, K l is a normalization factor computed as C l H l W l . Note that,the feature ϕ l is a matrix of shape ( H l W l ) × C l and, thus, the Grammatrix has a size of C l × C l . In our implementation, we use the VGG-19 [Simonyan and Zisserman 2015] network and extract featuresfrom layers pool1 , pool2 and pool3 . Training our system is difficult as large-scale HDR image datasetsare currently not available. Existing techniques [Eilertsen et al. 2017]overcome this limitation by pre-training their network on simulatedHDR images that are created from standard image datasets like theMIT Places [Zhou et al. 2014]. They then fine-tune their networkon real HDR images. Unfortunately, our network is not able to learnto synthesize plausible textures with this strategy (see Figure 11), asthe saturated areas are typically in the bright and smooth regions.To address this problem, we propose to pre-train our network onimage inpainting tasks. Intuitively, during inpainting, our networkleverages a large number of training data to learn an appropriateinternal representation that is capable of synthesizing visually pleas-ing textures. In the HDR fine-tuning stage, the network adapts thelearned representation to the HDR domain to be able to synthesizeHDR textures. We follow Liu et al.’s approach [2018] and use theirloss function and mask generation strategy during pre-training. Notethat we still use our feature masking mechanism for pre-training,but the input masks are binary. We fine-tune the network on realHDR images using the loss function, discussed in Section 3.2.One major problem is that the majority of the bright areas inthe HDR examples are smooth and textureless. Therefore, duringfine-tuning, the network adapts to these types of patches and, as
Algorithm 1
Patch Sampling procedure PatchMetric( H , M ) H : HDR image, M : Mask σ c = . ▷ Bilateral filter color sigma σ s = . ▷ Bilateral filter space sigma I = RgbToGray( H ) L = log( I + 1) B = bilateralFilter( L , σ c , σ s ) ▷ Base Layer D = L - B ▷ Detail Layer G x = getGradX( D ) G y = getGradY( D ) G = abs( G x ) + abs( G y ) return mean( G ⊙ ( − M ) )
512 512 512 5125125125125125125125122565121285126451233 k k-dimensional activation k k x k conv layerdownsample by 2 concatenation k k x k conv layerupsample by 2 OutputLDR Image
Fig. 6. The proposed network architecture. The model takes as input theRGB LDR image and outputs an HDR image. We use a feature maskingmechanism in all the convolutional layers. a result, has difficulty producing textured results (see Figure 11).In the next section, we discuss our strategy to select textured andchallenging patches.
Our goal is to select the patches that contain texture in the saturatedareas. We perform this by first computing a score for each patch andthen choosing the patches with a high score. The main challengehere is finding a good metric that properly detects the texturedpatches. One way to do this is to compute the average of the gradientmagnitude in the saturated regions. However, since our images arein HDR and can have large values, this approach can detect a smoothregion with bright highlights as textured.To avoid this issue, we propose to first decompose the HDR imageinto base and detail layers using a bilateral filter [Durand and Dorsey2002]. We use the average of the gradients (Sobel operator) of thedetail layer in the saturated areas as our metric to detect the texturedpatches. We consider all the patches with a mean gradient abovea certain threshold (0 .
85 in our implementation) as textured, andthe rest are classified as smooth. Since the detail layer only containsvariations around the base layer, this metric can effectively measurethe amount of textures in an HDR patch. Figure 5 shows exampleof patches selected using this metric. As shown in Figure 11, thissimple patch sampling approach is essential for synthesizing HDRimages with sharp and artifact-free details in the saturated areas.The summary of our patch selection strategy is listed in Algorithm 1.
Architecture.
We use a network with U-Net architecture [Ron-neberger et al. 2015], as shown in Figure 6. We use the featuremasking strategy in all the convolutional layers and up-sample the
ACM Trans. Graph., Vol. 39, No. 4, Article 80. Publication date: July 2020.
Ours Input Endo Eilertsen Ours Ground truthMarneridesInput
Fig. 7. We compare our method against state-of-the-art approaches of Endo et al. [2017], Eilertsen et al. [2017], and Marnerides et al. [2018] on a diverse set ofsynthetic scenes. Our method is able to synthesize textures in the saturated areas better than the other approaches (rows one to four), while producing resultswith similar or better quality in the bright highlights (fifth row). features in the decoder using nearest neighbor. All the encoder lay-ers use Leaky ReLU activation function [Maas et al. 2013]. On theother hand, we use ReLU [Nair and Hinton 2010] in all the decoderlayers, with the exception of the last one, which has a linear acti-vation function. We use skip connections between all the encoderlayers and their corresponding decoder layers.
Dataset.
We use different datasets for each training step. For theimage inpainting step, we use the MIT Places [Zhou et al. 2014]dataset with the original train, test, and validation splits. We choosePlaces for this step because it contains a large number of scenes( ∼ . M images) with diverse textures. We use the method of Liuet al. [2018] to generate masks of random streaks and holes of arbi-trary shapes and sizes. On the other hand, for the HDR fine-tuningstep, we collect approximately 2,000 HDR images from 735 HDRimages and 34 HDR videos. From each HDR image, we extract 250random patches of size 512 ×
512 and generate the input LDR patchesfollowing the approach by Eilertsen et al. [2017]. We then select asubset of these patches using our patch selection strategy. We also discard patches with no saturated content, since they do not provideany source of learning to the network. Our final training dataset isa set of 100K input and corresponding ground truth patches.
Training.
We initialize our network using the Xavier approach[Glorot and Bengio 2010] and train it on image inpainting task untilconvergence. We then fine-tune the network on HDR reconstruction.We train the network with a learning rate of 2 × − in both stages.However, during the second stage, we reduce the learning rate by afactor of 2 . β = . β = .
999 and mini-batch size of 4.The entire training takes approximately 11 days on a machine withan Intel Core i7, 16GB of memory, and an Nvidia GTX 1080 Ti GPU.
We implement our network in PyTorch [Paszke et al. 2019], but writethe data pre-processing, data augmentation, and patch sampling
ACM Trans. Graph., Vol. 39, No. 4, Article 80. Publication date: July 2020. ingle Image HDR Reconstruction Using a CNN with Masked Features and Perceptual Loss • 80:7
Table 1. Numerical comparison in terms of mean square error (MSE) andHDR-VDP-2 [Mantiuk et al. 2011] against existing learning-based singleimage HDR reconstruction approaches.
Method MSE HDR-VDP-2
Endo et al. [2017] 0.0390 55.67Eilertsen et al. [2017] 0.0387 59.11Marnerides et al. [2018] 0.0474 54.31Ours code in C++. We implement the feature masking mechanism usingthe existing standard convolutional layer in PyTorch. We compareour approach against three existing learning-based single imageHDR reconstruction approaches of Endo et al. [2017], Eilertsen et al.[2017], and Marnerides et al. [2018]. We use the source code providedby the authors to generate the results for all the other approaches.
We begin by quantitatively comparing our approach against theother methods in terms of mean squared error (MSE) and HDR-VDP-2 [Mantiuk et al. 2011] in Table 1. The errors are computed on a testset of 75 randomly selected HDR images, with resolutions rangingfrom 1024 ×
768 to 2084 × We show the generality of our approach by producing results on a setof real images, captured with standard cameras, in Figure 9. Specifi-cally, the top three images are from Google HDR+ dataset [Hasinoffet al. 2016], captured with a variety of smartphones, such as Nexus5/6/5X/6P, Pixel, and Pixel XL. The image in the last row is capturedby a Canon 5D Mark IV camera. All the other approaches are notable to properly reconstruct the saturated regions, producing resultswith discoloration and blurriness, as indicated by the arrows. On
EilertsenEndo OursInput Marnerides % % % Fig. 8. We compare the performance of the proposed method against previ-ous methods for various amounts of saturated areas. The numbers indicatethe percentage of the total number of pixels that are saturated in the in-put. Although our method slightly degrades as the saturation increases, weconsistently present better results than the previous methods.Table 2. We evaluate the effectiveness of our masking and pre-trainingstrategies by comparing against other alternatives in terms of MSE and HDR-VDP-2 [Mantiuk et al. 2011]. Here, SConv, GConv, IMask, and FMask refer tostandard convolution, gated convolution [Yu et al. 2019], only masking theinput image, and our full feature masking approach, respectively. Moreover,Inp. pre-training and HDR pre-training correspond to our proposed pre-training on inpainting and HDR reconstruction tasks, respectively.
Method (Masking + Pre-training) MSE HDR-VDP-2
SConv + HDR pre-training 0.0402 58.43SConv + Inp. pre-training 0.0374 60.03GConv + HDR pre-training 0.0398 53.32GConv + Inp. pre-training 0.1017 43.13IMask + HDR pre-training 0.0398 58.39IMask + Inp. pre-training 0.0369 61.27FMask + HDR pre-training 0.0393 58.81FMask + Inp. pre-training (Ours) the other hand, our method is able to properly increase the dynamicrange by synthesizing realistic textures.
Inpainting Pre-training.
We begin studying the effect of the pro-posed inpainting pre-training step by comparing it against thecommonly-used synthetic HDR pre-training in Table 2 and Figure 11.As seen, our pre-training (“FMask + Inp. pre-training (Ours)”) per-forms better than HDR pre-training (“FMask + HDR pre-training”)both numerically and visually. Specifically, as shown in Figure 11,our network using inpainting pre-training is able to learn betterfeatures and synthesizes sharp textures in the saturated areas.
ACM Trans. Graph., Vol. 39, No. 4, Article 80. Publication date: July 2020.
EilertsenEndo OursInput Marnerides
Fig. 9. Comparison against state-of-the-art approaches on images captured by standard cameras. Zoom in to the electronic version to see the differences.
FeatureMasking (Ours) StandardConvolution Ground truthFeature Masking (Ours)
Fig. 10. In regions with both saturated and well-exposed content (bound-aries of sky and mountain and bright building lights), the response of theinvalid saturated areas in standard convolution dominates the feature maps.Therefore, the network cannot properly utilize the content of the valid re-gions, introducing high frequency checkerboard artifacts (top row) andblurriness and halo (bottom row). Our approach suppresses the featuresfrom the saturated content and allows the network to synthesize the imageusing the well-exposed information.
Feature Masking.
Here, we compare our feature masking strat-egy against several other approaches in Table 2. Specifically, wecompare our method against standard convolution (SConv), gatedconvolution [Yu et al. 2019] (GConv), and the simpler version ofour masking strategy where the mask is only applied to the input(IMask). For completeness, we include the result of each methodwith both inpainting and HDR pre-training. As seen, our maskingstrategy is considerably better than the other methods. It is worthnoting that unlike other methods, the performance of gated convo-lution with inpainting pre-training is worse than HDR pre-training. This is mainly because gated convolution estimates the masks ateach layer using a separate set of networks which become unstableafter transitioning from inpainting pre-training to HDR fine-tuning.We also visually compare our feature masking method againststandard convolution in Figure 10. Standard convolution producesresults with checkerboard artifacts (top) and halo and blurriness(bottom), while our network with feature masking produces consid-erably better results. Moreover, we visually compare our approachagainst other masking strategies in Figure 11. Note that, for eachmasking strategy, we only show the combination of masking andpre-training that produces the best numerical results in Table 2,i.e., gated convolution (GConv) with HDR pre-training and inputmasking (IMask) with inpainting pre-training. Gated convolutionis not able to produce high frequency textures in the saturated ar-eas. Input masking performs reasonably well, but still introducesnoticeable artifacts. Our feature masking method, however, is ableto synthesize visually pleasing textures.
Patch Sampling.
We show our result without patch sampling (Sec-tion 3.4) to demonstrate its effectiveness in Figure 11. As seen, bytraining on the textured patches (ours), the network is able to syn-thesize textures with more details and fewer objectionable artifacts.
Loss Function.
Finally, we compare the proposed perceptual lossfunction against a simple pixel-wise ( l ) loss. As seen in Figure 12,using only the pixel-wise loss function our network tends to pro-duce blurry images, while the network trained using the proposedperceptual loss function can produce visually realistic textures inthe saturated regions. Single image HDR reconstruction is a notoriously challenging prob-lem. Although our method can recover the luminance and halluci-nate textures, it is not always able to reconstruct all the details. Oneof such cases is shown in Figure 13 (top), where our approach failsto reconstruct the wrinkles on the curtain. Nevertheless, our resultis still better than the other approaches as they overestimate the
ACM Trans. Graph., Vol. 39, No. 4, Article 80. Publication date: July 2020. ingle Image HDR Reconstruction Using a CNN with Masked Features and Perceptual Loss • 80:9
OursPre-training Patch Sampling
Ground truthGConv +HDR pre-training FMask +HDR pre-training FMask +Inp. pre-trainingOurs withoutpatch samplingIMask +Inp. pre-training
Masking
Fig. 11. From left to right, we compare our method against two other mask-ing strategies as well as a pre-training method, and evaluate the effect ofpatch sampling. Here, GConv, IMask, and FMask refer to gated convolu-tion [Yu et al. 2019], only masking the input image, and our full featuremasking method, respectively. Moreover, Inp. pre-training refers to ourproposed pre-training on inpainting task.
Input Only pixel-wise loss Perceptual loss (ours) Ground truth
Fig. 12. We compare the results of our network trained with only a pixel-wise loss ( l ) and the proposed perceptual loss. Using the perceptual lossfunction, our network can synthesize visually realistic textures, while thenetwork trained with only a pixel-wise loss produces blurry results. brightness of the window and produce blurry results. Moreover, asshown in Figure 13 (middle), when the input lacks sufficient infor-mation about the underlying texture, our method could potentiallyintroduce patterns that do not exist in the ground truth image. De-spite that, our result is still comparable to or better than the otherapproaches. Additionally, in some cases, our method reconstructsthe saturated areas with an incorrect color, as shown in Figure 13(bottom). It is worth noting that the network reconstruct the build-ing in blue since trees and skies are usually next to each other inthe training data. As seen, other approaches also reconstruct partsof the building in blue color.Although our network can be used to reconstruct an HDR videofrom an LDR video, our result is not temporally stable. This is mainlybecause we synthesize the content of every frame independently. Inthe future, it would be interesting to address this problem throughtemporal regularization [Eilertsen et al. 2019]. Moreover, we wouldlike to experiment with the architecture of the networks to increasethe efficiency of our approach and reduce the memory footprint. We present a novel learning-based system for single image HDRreconstruction using a convolutional neural network. To alleviate
InputEndo E il e rt s e n Ours Ground truthMarnerides
Input Ours Ground truth
Endo E il e rt s e n MarneridesInput OursEndo E il e rt s e n Marnerides
Fig. 13. Failure cases of our approach. From top to bottom, our method failsto reconstruct the wrinkles on the curtain, introduces textures that are notin the ground truth, and incorrectly reconstructs the building with sky color.Note that, the top two examples are synthetic, but the bottom one is realfor which we do not have access to the ground truth image. the artifacts caused by conditioning the convolutional layer on thesaturated pixels, we propose a feature masking mechanism withan automatic mask updating process. We show that this strategyreduces halo and checkerboard artifacts caused by standard con-volutions. Moreover, we propose a perceptual loss function that isdesigned specifically for the HDR reconstruction application. Byminimizing this loss function during training, the network is ableto synthesize visually realistic textures in the saturated areas. Wefurther propose to train the system in two stages where we pre-trainthe network on inpainting before fine-tuning it on HDR generation.To encourage the network to synthesize textures, we propose a sam-pling strategy to select challenging patches in the HDR examples.Our model can robustly handle saturated areas and can reconstructhigh-frequency details in a realistic manner. We show quantitativelyand qualitatively that our method outperforms previous methodson both synthetic and real-world images.
ACKNOWLEDGMENTS
We thank the reviewers for their constructive comments. M. Santosis funded by the Brazilian agency CNPQ grant 161268/2018-8. T.Ren is partially supported by FACEPE grant APQ-0192- 1.03/14. N.Kalantari is in part funded by a TAMU T3 grant 246451.
ACM Trans. Graph., Vol. 39, No. 4, Article 80. Publication date: July 2020.
REFERENCES
Francesco Banterle, Patrick Ledda, Kurt Debattista, and Alan Chalmers. 2006. Inversetone mapping. In
Proceedings of the 4th International Conference on Computer Graph-ics and Interactive Techniques in Australasia and Southeast Asia . ACM, 349–356.David Bau, Hendrik Strobelt, William Peebles, Jonas Wulff, Bolei Zhou, Jun-Yan Zhu,and Antonio Torralba. 2019. Semantic Photo Manipulation with a Generative ImagePrior.
ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH)
38, 4 (2019).Cambodge Bist, Rémi Cozot, Gérard Madec, and Xavier Ducloux. 2017. Tone expansionusing lighting style aesthetics.
Computers & Graphics
62 (2017), 77–86.Paul Debevec. 2005. A median cut algorithm for light probe sampling. In
ACM SIG-GRAPH 2005 Posters . ACM, 66.PE Debevec and J Malik. 1997. Recovering high dynamic range images. In
Proceedingof the SPIE: Image Sensors , Vol. 3965. 392–401.Piotr Didyk, Rafal Mantiuk, Matthias Hein, and Hans-Peter Seidel. 2008. Enhancementof bright video features for HDR displays. In
Computer Graphics Forum , Vol. 27.Wiley Online Library, 1265–1274.Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2015. Image super-resolution using deep convolutional networks.
IEEE Transactions on Pattern Analysisand Machine Intelligence
38, 2 (2015), 295–307.Frédo Durand and Julie Dorsey. 2002. Fast bilateral filtering for the display of high-dynamic-range images. In
Proceedings of the 29th Annual Conference on ComputerGraphics and Interactive Techniques . 257–266.Gabriel Eilertsen, Joel Kronander, Gyorgy Denes, Rafał K Mantiuk, and Jonas Unger.2017. HDR image reconstruction from a single exposure using deep CNNs.
ACMTransactions on Graphics (TOG)
36, 6 (2017), 178.Gabriel Eilertsen, RafałMantiuk, and Jonas Unger. 2019. Single-frame Regularizationfor Temporally Stable CNNs. In
The IEEE Conference on Computer Vision and PatternRecognition (CVPR) .Yuki Endo, Yoshihiro Kanamori, and Jun Mitani. 2017. Deep reverse tone mapping.
ACM Transactions on Graphics (TOG)
36, 6 (2017), 177–1.Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2015. A neural algorithm ofartistic style. arXiv preprint arXiv:1508.06576 (2015).Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer usingconvolutional neural networks. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) . 2414–2423.Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of trainingdeep feedforward neural networks. In
Proceedings of the Thirteenth InternationalConference on Artificial Intelligence and Statistics (AISTATS) . 249–256.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In
Advances in Neural Information Processing Systems (NeurIPS) . 2672–2680.Xintong Han, Zuxuan Wu, Weilin Huang, Matthew R Scott, and Larry S Davis. 2019.FiNet: Compatible and Diverse Fashion Image Inpainting. In
Proceedings of the IEEEInternational Conference on Computer Vision (ICCV) . 4481–4491.Samuel W Hasinoff, Dillon Sharlet, Ryan Geiss, Andrew Adams, Jonathan T Barron,Florian Kainz, Jiawen Chen, and Marc Levoy. 2016. Burst photography for highdynamic range and low-light imaging on mobile cameras.
ACM Transactions onGraphics (TOG)
35, 6 (2016), 192.Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality ofdata with neural networks.
Science
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR) . 1163–1170.Nima Khademi Kalantari and Ravi Ramamoorthi. 2017. Deep high dynamic rangeimaging of dynamic scenes.
ACM Transactions on Graphics (TOG)
36, 4 (2017),144–1.Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski. 2003.High dynamic range video. In
ACM Transactions on Graphics (TOG) , Vol. 22. ACM,319–325.Soo Ye Kim, Jihyong Oh, and Munchurl Kim. 2019. Jsi-gan: Gan-based joint super-resolution and inverse tone-mapping with pixel-wise task-specific filters for UHDHDR video. arXiv preprint arXiv:1909.04391 (2019).Diederick P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization.In
International Conference on Learning Representations (ICLR) .Rafael P Kovaleski and Manuel M Oliveira. 2014. High-quality reverse tone mapping fora wide range of exposures. In . IEEE, 49–56.Hayden Landis. 2002. Production-ready global illumination.
SIGGRAPH Course Notes
16, 2002 (2002), 11.Siyeong Lee, Gwon Hwan An, and Suk-Ju Kang. 2018a. Deep chain hdri: Reconstructinga high dynamic range image from a single low dynamic range image.
IEEE Access
Proceedings of theEuropean Conference on Computer Vision (ECCV) . 596–611. Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and BryanCatanzaro. 2018. Image inpainting for irregular holes using partial convolutions. In
Proceedings of the European Conference on Computer Vision (ECCV) . 85–100.Gonzalo Luzardo, Jan Aelterman, Hiep Luong, Wilfried Philips, Daniel Ochoa, andSven Rousseaux. 2018. Fully-Automatic Inverse Tone Mapping Preserving theContent Creator’s Artistic Intentions. In . IEEE,199–203.Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. 2013. Rectifier nonlinearitiesimprove neural network acoustic models. In
Proceedings of International Conferenceon Machine Learning (ICML) , Vol. 30. 3.Rafat Mantiuk, Kil Joong Kim, Allan G Rempel, and Wolfgang Heidrich. 2011. HDR-VDP-2: A calibrated visual metric for visibility and quality predictions in all luminanceconditions.
ACM Transactions on Graphics (TOG)
30, 4 (2011), 40.Demetris Marnerides, Thomas Bashford-Rogers, Jonathan Hatchett, and Kurt Debattista.2018. ExpandNet: A deep convolutional neural network for high dynamic rangeexpansion from low dynamic range content. In
Computer Graphics Forum , Vol. 37.Wiley Online Library, 37–49.Morgan McGuire, Wojciech Matusik, Hanspeter Pfister, Billy Chen, John F Hughes, andShree K Nayar. 2007. Optical splitting trees for high-precision monocular imaging.
IEEE Computer Graphics and Applications
27, 2 (2007), 32–42.Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restrictedboltzmann machines. In
Proceedings of the 27th International Conference on MachineLearning (ICML) . 807–814.Shiyu Ning, Hongteng Xu, Li Song, Rong Xie, and Wenjun Zhang. 2018. Learning aninverse tone mapping network with a generative adversarial regularizer. In .IEEE, 1383–1387.Tae-Hyun Oh, Joon-Young Lee, Yu-Wing Tai, and In So Kweon. 2014. Robust highdynamic range imaging by rank minimization.
IEEE Transactions on Pattern Analysisand Machine Intelligence
37, 6 (2014), 1219–1232.Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019.PyTorch: An imperative style, high-performance deep learning library. In
Advancesin Neural Information Processing Systems (NeurIPS) . 8024–8035.Allan G Rempel, Matthew Trentacoste, Helge Seetzen, H David Young, WolfgangHeidrich, Lorne Whitehead, and Greg Ward. 2007. Ldr2hdr: on-the-fly reverse tonemapping of legacy video and photographs. In
ACM Transactions on Graphics (TOG) ,Vol. 26. ACM, 39.Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-NET: Convolutionalnetworks for biomedical image segmentation. In
International Conference on MedicalImage Computing and Computer-assisted Intervention . Springer, 234–241.Pradeep Sen, Nima Khademi Kalantari, Maziar Yaesoubi, Soheil Darabi, Dan B Goldman,and Eli Shechtman. 2012. Robust patch-based HDR reconstruction of dynamicscenes.
ACM Transactions on Graphics (TOG)
31, 6 (2012), 203–1.Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks forLarge-Scale Image Recognition. In
International Conference on Learning Representa-tions (ICLR) .Michael D Tocci, Chris Kiser, Nora Tocci, and Pradeep Sen. 2011. A versatile HDR videoproduction system. In
ACM Transactions on Graphics (TOG) , Vol. 30. ACM, 41.Lvdi Wang, Li-Yi Wei, Kun Zhou, Baining Guo, and Heung-Yeung Shum. 2007. High dy-namic range image hallucination. In
Proceedings of the 18th Eurographics Conferenceon Rendering Techniques . Eurographics Association, 321–326.Shangzhe Wu, Jiarui Xu, Yu-Wing Tai, and Chi-Keung Tang. 2018. Deep high dy-namic range imaging with large foreground motions. In
Proceedings of the EuropeanConference on Computer Vision (ECCV) . 117–132.Yucheng Xu, Shiyu Ning, Rong Xie, and Li Song. 2019. Gan Based Multi-ExposureInverse Tone Mapping. In . IEEE, 1–5.Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li. 2017. High-resolution image inpainting using multi-scale neural patch synthesis. In
Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 6721–6729.Xin Yang, Ke Xu, Yibing Song, Qiang Zhang, Xiaopeng Wei, and Rynson WH Lau. 2018.Image correction via deep reciprocating HDR transformation. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 1798–1807.Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. 2019.Free-form image inpainting with gated convolution. In
Proceedings of the IEEEInternational Conference on Computer Vision . 4471–4480.Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In
European Conference on Computer Vision (ECCV) . Springer, 649–666.Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014.Learning deep features for scene recognition using places database. In