[PDF] UPHDR-GAN: Generative Adversarial Network for High Dynamic Range Imaging with Unpaired Data

Abstract

The paper proposes a method to effectively fuse multi-exposure inputs and generates high-quality high dynamic range (HDR) images with unpaired datasets. Deep learning-based HDR image generation methods rely heavily on paired datasets. The ground truth provides information for the network getting HDR images without ghosting. Datasets without ground truth are hard to apply to train deep neural networks. Recently, Generative Adversarial Networks (GAN) have demonstrated their potentials of translating images from source domain X to target domain Y in the absence of paired examples. In this paper, we propose a GAN-based network for solving such problems while generating enjoyable HDR results, named UPHDR-GAN. The proposed method relaxes the constraint of paired dataset and learns the mapping from LDR domain to HDR domain. Although the pair data are missing, UPHDR-GAN can properly handle the ghosting artifacts caused by moving objects or misalignments with the help of modified GAN loss, improved discriminator network and useful initialization phase. The proposed method preserves the details of important regions and improves the total image perceptual quality. Qualitative and quantitative comparisons against other methods demonstrated the superiority of our method.

Full PDF

UUPHDR-GAN: Generative Adversarial Network for HighDynamic Range Imaging with Unpaired Data

Ru Li a , Chuan Wang b , Shuaicheng Liu a , Jue Wang b , Guanghui Liu a , Bing Zeng a a School of Information and Communication Engineering, University of Electronic Science and Technologyof China, Chengdu 611731, China b Megvii Technology, Chengdu 610000, China

Abstract

The paper proposes a method to effectively fuse multi-exposure inputs and generateshigh-quality high dynamic range (HDR) images with unpaired datasets. Deep learning-based HDR image generation methods rely heavily on paired datasets. The ground truthprovides information for the network getting HDR images without ghosting. Datasetswithout ground truth are hard to apply to train deep neural networks. Recently, Gen-erative Adversarial Networks (GAN) have demonstrated their potentials of translatingimages from source domain X to target domain Y in the absence of paired examples.In this paper, we propose a GAN-based network for solving such problems while gen-erating enjoyable HDR results, named UPHDR-GAN. The proposed method relaxesthe constraint of paired dataset and learns the mapping from LDR domain to HDRdomain. Although the pair data are missing, UPHDR-GAN can properly handle theghosting artifacts caused by moving objects or misalignments with the help of mod-iﬁed GAN loss, improved discriminator network and useful initialization phase. Theproposed method preserves the details of important regions and improves the total im-age perceptual quality. Qualitative and quantitative comparisons against other methodsdemonstrated the superiority of our method. Keywords:

Multi-exposure HDR imaging, generative adversarial network, unpaireddata

Preprint submitted to Elsevier February 4, 2021 a r X i v : . [ ee ss . I V ] F e b a) LDRs (b) Our result(c) Hu et al. (d) Sen et al. (e) Kalantari et al. (f) Wu et al. (g) Our result Figure 1: LDR images with different exposures are shown in (a), and our result is shown in (b). (c) Result ofHu et al. ’s method [1]. (d) Result of Sen et al. ’s method [2]. (e) Result of Kalantari et al. ’s method [3]. (f)Result of Wu et al. ’s method [4]. (g) Zoomed-in areas of our result. The proposed UPHDR-GAN handlesmoving objects better and generates results with fewer ghosting artifacts.

1. Introduction

Dynamic ranges of natural scenes are much wider than those captured by com-mercial imaging products. Most digital photography sensors often fail to capture theirradiance range that visible to the human eyes. High dynamic range (HDR) imagingtechniques have attracted considerable interest because they can overcome such limita-tions. Some specialized hardware devices [5, 6] have been introduced to produce HDRimages directly, but they are usually too expensive to be widely adopted. As a result,computational HDR imaging methods have drawn more attention. The most commonstrategy is to capture a stack of low dynamic range (LDR) images at different exposurelevels and then merge them into an HDR image [7, 8, 9].Since its ﬁrst introduction in 1990s, HDR imaging techniques evolve quickly. Somemethods [7, 8, 9] are ﬁrst proposed to reconstruct an HDR image through camera re-sponse function (CRF) and then apply tone mapping for display. The original ap-proaches work only for static scenes because they typically assume constant radianceat each pixel over all exposures. If the scenes exist moving content, these methods2roduce ghosting artifacts from even small misalignments between exposures. Then,some methods [10, 11, 12] are introduced to fuse inputs with scene motions or dy-namic objects using homography or optical ﬂow. Generally, one of the LDR imagesis considered as the reference image and the other inputs are aligned to the reference.These algorithms can generate high-quality HDR results if inputs are fully aligned, butwill suffer from signiﬁcantly ghosting and blurring artifacts when there are unalignederrors. Some patch-based methods [1, 2] are then proposed to reconstruct the inputimages by patch-based synthesis according to one selected reference image, to forma fully registered image stack. However, the patch-based reconstruction is not alwaysrobust in complicated situations.Recent works [3, 4, 13] have been proposed to learn the composition process usingdeep neural networks (DNN). Kalantari et al. [3] ﬁrst proposed a DNN-based HDRgeneration method, which merges the LDR images after an optical ﬂow-based align-ment process. However, their method fails to handle the artifacts and distortions causedby the inevitable optical ﬂow estimation error. Wu et al. [4] applied a simple homogra-phy to align the background ﬁrst and used the network to achieve main alignment andfusion. Yan et al. [13] introduced an attention-guided network to detect and align dy-namic objects before fusion. However, they still suffer from having artifacts when theLDR images contain large motions or signiﬁcant misalignment, due to the unreliabilityof image registration. Different techniques are introduced to improve the performanceof fusion results, however, the most important problem of deep learning-based methodsis that they rely heavily on paired inputs and ground truth.To relax the constraint of dataset, we propose a GAN-based fusion method to opti-mize the network using unpaired dataset, named UPHDR-GAN. First, compared to fa-mous single-image enhancement methods [14, 15, 16] and some recent GAN-based im-age fusion method [17, 18, 19] that are trained on paired dataset, the proposed methodtrains unpaired dataset and learns mapping from LDR domain to HDR domain. Sec-ond, unlike some methods s that are designed for unpaired datasets mainly concentrat-ing on processing single-input [20], our UPHDR-GAN is a multi-input method with theconsideration of moving objects. Then, even considering multi-input, simply concate-nating multi-exposure inputs will suffer from severe ghosting. In contrast, the proposed3ethod successfully fuses multi-inputs and generates high-quality HDR images with-out ghosting artifacts by introducing an initialization phase, a modiﬁed GAN loss andmin-patch training.Fig. 1 shows the comparisons with several typical de-ghosting methods, includ-ing two patch-based methods [1, 2] and two deep learning-based methods [3, 4]. Twopatch-based methods use patch matching to align inputs but ignore image integrity. In-formation lost and undesired halo artifacts appear in ﬁnal results (Fig. 1(c) and (d)).Kalantari et al. [3] applied optical ﬂow to align inputs before they are sent to the net-work. Flow-based methods often produce deformations caused by the parallax or dy-namic contents. Wu et al. [4] improved Kalantari’s method and embedded the align-ment to the network. Although two DNN-based methods produce satisfactory resultsin some examples, they still suffer from ghosting artifacts if there are large-scale move-ments between images (Fig. 1(e) and (f)). Our method deals with the dynamic objectsproperly with the assistance of the modiﬁed GAN loss and min-patch training, whichpay more attention to the dynamic regions and emphasize the edges. In addition tovisual comparisons, we also organize quantitative comparisons against several typicalapproaches to validate the superiority of the proposed UPHDR-GAN. In quantitativeassessments, the proposed method achieves the best scores on average.Overall, the main contributions are summarized as follows:• We proposed a GAN-based multi-exposure HDR fusion network, which relaxesthe constraint of paired training dataset and learns the mapping between inputand target domains. To our best knowledge, this work is the ﬁrst GAN-basedapproach for unpaired HDR reconstruction.• The proposed method can not only be trained on unpaired dataset but generateHDR results with fewer ghosting artifacts. We apply an initialization phase, amodiﬁed GAN loss and min-patch training to avoid ghosting and improve theimage quality.• We provided qualitative and quantitative comparisons with several state-of-the-art methods using UPHDR-GAN, demonstrating its superiority over existingmethods. 4 . Related Works

HDR imaging has been the subject of extensive research over the past decades,which can be mainly divided into two categories, static and dynamic scene methods.

Static scene methods.

Mann et al. [7] and Debevec et al. [8] ﬁrst proposed to takesequential LDR images at different exposure levels and then merge them into an HDRimage. The original approaches produced spectacular results for static scenes and staticcameras. Some variants are then introduced by generating disparity maps or using neu-ral networks. Sun et al. computed the disparity map between the stereo images andapplied them to compute the CRF [21]. Hashimoto et al. developed hard-to-view ornonviewable features and content of color images by a new tone reproduction algo-rithm [22]. Sheth et al. proposed an LDR2HDR network to generate an HDR map of astatic scene using differently exposed LDR images [23]. There are also numerous staticfusion methods [24, 25, 26, 27, 28, 29] that do not generate HDR outputs but directlyobtain informative LDR results. However, due to the lack of an explicit detection forthe dynamic objects, the aforementioned methods are unaware of any motion in thescene, so as to be suitable for static scenes only.

Dynamic scene methods.

To extend the scope of application, many algorithms in-troduce the de-ghosting problem from different perspectives, providing solutions thatrange from rudimentary heuristics to advanced computer vision techniques [30]. Themotion-based methods include global exposure registration [31, 32, 33], moving ob-jects removal [34], moving objects selection [35] and moving objects registration [2, 1].A number of methods reject the moving pixels using weightings in the merging pro-cess [36, 37]. Another approach is to detect and resolve ghosting after the merg-ing [38, 39]. Since such methods simply ignore the misaligned pixels, they fail tofully utilize available contents to generate HDR images. Besides, there are also morecomplicated methods that rely on image registration, which reconstruct each HDR re-gion by searching for the best matching region in LDR images. This is achieved by5ixel (optical ﬂow methods) or patch (patch-based methods) based dense correspon-dences. For example, Bogoni et al. introduced a two-phase alignment strategy, whichﬁrst performs a global afﬁne registration and then estimates optical ﬂow between inputand reference [40]. Zimmer et al. presented an energy-based method for estimating theoptical ﬂow, which is robust in the presence of noise and occlusion [10]. Also, Sen etal. proposed a patch-based energy minimization approach that integrates alignmentand HDR reconstruction in a joint optimization [2]. Hu et al. handled with saturatedregions and moving objects simultaneously, which optimized image alignment basedon brightness and gradient consistencies [1]. Although Optical ﬂow based methodsare able to align images with complex motions, they usually suffer from deformationsin the regions with no correspondences, due to occlusions caused by parallax or dy-namic contents. On the other hand, patch-based methods sometimes produce excellentresults, while they commonly suffer from low efﬁciency and usually fail in the pres-ence of large motions and saturated regions. To overcome the issues as above, somedeep learning approaches [3, 4, 13] have been developed recently. The deep learningmethods have the advantage of exploiting information extracted from training data toidentify and compensate for image regions that do not meet the assumptions underlyingthe HDR process. However, each of these methods only addresses part of the issuesand needs paired data to optimize the network. We propose UPHDR-GAN to compre-hensively handle existing issues, including solving ghosting artifacts and relaxing theconstrain of paired data.

The concept of GAN was originally proposed by Goodfellow et al. [41], which hasdrawn substantial attention in the ﬁeld of deep learning. The vanilla GAN consistsof two components, a generator G and a discriminator D , where G is responsible forcapturing the data distribution while D tries to distinguish whether a sample comesfrom the real training data or the generator. This framework corresponds to a min-max two-player game, which provides a powerful way to estimate target distributionand generate new samples. GANs have achieved impressive results in face opera-tion [42, 43, 44], image blending [45], image generation [46, 47], image editing [48],6epresentation learning [46, 49] and style transfer [50, 20, 51, 52]. Generally, the in-puts of the aforementioned methods are noise or a single image. Some recent worksare proposed for fusing multi-input images [53, 54, 55, 56, 57]. Perera et al. extendedunsupervised image-to-image translation to multiple input settings using GAN and val-idated their method on several tasks, including multi-spectral images to visible imageand synthetic to real image translations [53]. Joo et al. applied GAN to generate afusion image with the identity of input image x and the shape of input image y with thehelp of min-patch training [57]. Guo et al. introduced a GAN-based multi-focus imagefusion system, which utilized G to generate desired mask maps [54]. Ma et al. usedGAN to fuse infrared and visible information, obtaining a fused image with major in-frared intensities together with additional visible gradients [55, 56]. Recently, there aresome GAN-based methods are proposed to handle multi-exposure images [17, 18, 19],Xu et al. [17] and Yang et al. [18] fused two inputs, the under-exposed image andthe over-exposed image, to generate an informative output. Niu et al. [19] proposed areference-based residual merging block for aligning large object motions in the featuredomain, and a deep HDR supervision scheme for eliminating artifacts of the recon-structed HDR images. However, these GAN-based methods heavily rely on pairedtraining dataset so that their performances are greatly limited. In comparison, we pro-pose UPHDR-GAN to fuse multi-exposure inputs, which is compatible to unpaireddataset, so that the ﬂexibility and robustness of our proposed network are signiﬁcantlyimproved.

3. Method

We propose a GAN-based framework, which is the ﬁrst method designed for han-dling unpaired multi-exposure fusion datasets. Like common GAN framework, thegenerator G learns the mapping function between different domains, while the dis-criminator D aims to optimize G by distinguishing source domain images from thegenerated ones. Our collected dataset consists of scenes with and without ground truth.To obtain the unpaired dataset, we upset the organization of input dataset and targetdataset. To better describe the framework, we adopt X as source domain data, includ-7 + Conv=>BN=>ReLU=>Conv=>BN=>ESConv=>Conv=>BN=>ReLUConv=>LReLU

Input images (cid:1)(cid:1)

Residual blocks

Real/Fake + Encoder Decoder Tone mapped image (a) Generator Network (b) Discriminator Network

Conv=>BN=>ReLU Conv=>LReLU=>Conv=>BN=>LReLUConv=>BN=>LReLUD1 D2 D3E1 E2 E3 C1 C2 C3 C4 C5 min-patch

Figure 2: The proposed method seeks to generate high-quality HDR results with unpaired dataset. Thearchitecture of UPHDR-GAN consists of two components operating different functions.

The generator processes the inputs through down-convolution layers, residual blocks and up-convolution layers so as togenerate results with less ghosting.

The discriminator distinguishes the generated and real HDR images. ing a wide diversity of multi-exposure sequences x = { x , x , x } , and a collection ofHDR images { y j } j =1 ,...,M ∈ Y , as the target domain data. The data distributions ofthe two domains are denoted as x ∼ p data ( x ) and y ∼ p data ( y ) , respectively.The proposed UPHDR-GAN is capable of generating HDR images with fewerghosting artifacts in the absence of paired datasets. Its network architecture is intro-duced in Section 3.1, and the objective function is presented in Section 3.2 in detail. UPHDR-GAN is an images-to-image task with three inputs and one output. Thestructure of UPHDR-GAN is illustrated in Fig. 2. The generator and the discriminatorare simultaneously optimized during the training process. Their detailed layer conﬁg-urations are presented in Table 1. Instead of feeding the original full-size images intoour model, we break down the training images into overlapping patches of size 256 ×

256 patches with a stride of 64. The encoder contains three branches and the input sizeof each branch is × , which is the concatenation of the inputs x = { x , x , x } and their mapped HDR images H m = { H , H , H } . H m is obtained using simplegamma encoding: H i = x iγ t i , γ > (1)where t i is the exposure time of x i . The LDRs can facilitate the detection of misalign-ments and saturation, while the exposure-adjusted HDRs improve the robustness of the8 able 1: Detailed parameter settings of the network, in which ‘BN’ indicates the batch normalization and‘ES’ represents elementwise sum. Inputs: 3 × [256, 256, 6] Generator

Module Conv BN ActivationKernel Stride Channel ChannelEncoder E1 7 1 64 64 ReLUE2 3 2 128 - -3 1 128 128 ReLUE3 3 2 256 - -3 1 256 256 ReLUResidual blocks 3 2 256 256 ReLU3 1 256 256 ESDecoder D1 3 1/2 128 - -3 1 128 128 ReLUD2 3 1/2 64 - -3 1 64 64 ReLUD3 7 1 3 - -

Discriminator

C1 3 1 32 - LReLUC2 3 2 64 - LReLU3 1 64 64 LReLUC3 3 2 128 - LReLU3 1 128 128 LReLUC4 3 1 256 256 LReLUC5 3 1 1 - -Output HDR H o : [256, 256, 3]Tonemapped HDR T ( H o ) : [256, 256, 3] network across LDRs with various exposure levels.After getting the HDR output H o , we add a µ -law [3] post-processing to compressthe range of image since computing the loss functions on the tone-mapped HDR imagesis more effective than directly computed in the HDR domain. T ( H o ) = log(1 + µH o )log(1 + µ ) (2)where H o is the output HDR image, µ is a parameter deﬁning the amount of compres-sion and is set to 5,000 in our implementation. The generator network is composed of 3 parts, the encoder, eight residual blocksand the decoder. Speciﬁcally, the encoder consists of three convolutional blocks: E1,92 and E3, as described in Table. 1. Useful local signals are extracted by the encoderfor downstream transformation. Afterward, eight residual blocks with identical layoutare used to construct the content and the manifold feature. The decoder consists of twoidentical transposed convolutional blocks (D1 and D2), and a ﬁnal convolutional layer(D3).

The discriminator network is complementary to the generator. We leverage thetechnique from PatchGANs [14, 58] to classify each image patch into a real or fakeone rather than a full-image discriminator. × overlapped patches are croppedfrom generated or real HDR images for training. Such a simple patch-level discrimina-tor uses fewer parameters and can work on images of arbitrary sizes. However, not allregions in the patch contribute to the discriminator optimization during training. If asmall part of the generated image is so strange as to be different from the real image, itcan be considered as undesirable ghosting artifacts. Therefore, paying more attentionto the strangest parts is essential. we applied min-patch training to add a minimumpooling layer at the last part of the PatchGAN (Fig. 3). Note that, minimum pooling isapplied to the ﬁnal output of discriminator if the generator is trained. We acquire thefeature maps before min-pooling layer ( F ) and after min-pooing layer ( F pool ), respec-tively. The generator is optimized with F pool , which helps the network focus on themost important parts of the generated images, such as the error parts or strange parts.The discriminator distinguishes the real image from fake image using common Patch-GAN and is optimized with features before min-pooling layer F . Min-patch trainingfocuses on the most important part of fake images and helps to avoid ghosting artifacts.In our implementation, the PatchGAN discriminator outputs 64 ×

64 size of featuremaps. The min-patch training uses 16 ×

16 min-pooling, so the generator uses output4 × As GAN is a min-max optimization system, the proposed method aims to solve: G ∗ , D ∗ = arg min G max D L ( G, D ) (3)10 a) (b) (c) Figure 3: (a) The last feature map of PatchGAN. (b) Minimum pooling is applied to (a). (c) Correspondingrespective ﬁelds of the input image.

Based on HDR imaging properties, we design our objective function to includethe following two losses: (1) the adversarial loss L GAN ( G, D ) , which drives the gen-erator network to achieve the desired manifold transformation; (2) the content loss L con ( G, D ) , which preserves the image content during transformation. The full objec-tive function is: L ( G, D ) = L GAN ( G, D ) + w con L con ( G, D ) (4)where w con is a hyper-parameter to control the relative importance of content loss, soas to balance the effects of transformation and content preservation. As in classic GAN networks, the adversarial loss is used to constrain the results of G to look like target domain images. In our task, adversarial loss pushes G to generateoutputs that are similar to real HDR images in the absence of corresponding groundtruth. Meanwhile, D aims to distinguish whether a given image belongs to the synthe-sized or the real target manifold. However, simply applying the common adversarialloss is not sufﬁcient for preserving clear edge information. For this reason, Chen etal. [51] proposed to confuse D with a blur dataset to push generator to produce imageswith sharp edges, which has been proven useful for style transformation. Similarly,we also add a blur HDR dataset to facilitate G to generate clearer output. Speciﬁcally,from the target images { y j } j =1 ,...,M ∈ Y , we generate the same number of blur im-ages { c j } j =1 ,...,M ∈ C by removing clear edges in Y . In details, for each y j , we apply11he following three steps:1. Detecting edge pixels using a standard Canny edge detector;2. Dilating the edge regions;3. Applying a Gaussian smoothing in the dilated edge regions.In other words, the discriminator tries to correctly classify an image into three cate-gories: the generated image G ( x ) , the real HDR image y , and the blurred HDR image c , as formulated in Eq. 5. L GAN = E y ∼ p data ( y ) (cid:2) log D ( y ) (cid:3) + E c ∼ p data ( c ) (cid:2) log(1 − D ( c )) (cid:3) + E x ∼ p data ( x ) (cid:2) log(1 − D ( G ( x )) (cid:3) (5) The adversarial loss just ensures that the generated image looks realistic in thetarget domain, therefore it is inadequate to result in a transformation that preservessemantic information of the input alone. It is essential to enforce a more strict con-straint to ensure semantic consistency between the inputs and the output. Generally,the middle-exposure image is selected as the reference and other inputs are aligned tothe reference. We deﬁne the image content loss that computes the difference betweenthe middle-exposure input and the generated HDR result. Instead of using common L loss function, we apply perceptual loss [50] to measure high-level semantic differences.As such, the complete image content loss is deﬁned as: L con ( G, D ) = E x ∼ p data ( x ) (cid:2) || V GG l (cid:0) G ( x ) (cid:1) − V GG l (cid:0) x (cid:1) || (cid:3) (6)where l refers to a speciﬁc layer in the VGG network, and we apply the ‘conv4 4’ layerin our implementation.The hyper-parameter w con is added to balance the adversarial loss and contentloss. A larger w con produces images with more content information from the middle-exposure input, generating images that do not like desired HDR images. However, asmall w con learn the translation excessively so that the semantic content informationcannot be well preserved. To strike a balance, we set w con to be 1.5 at the initial stage12 able 2: Detailed source information of our dataset. Source Name URL Number

HDReye https://mmspg.epfl.ch/downloads/hdr-eye/ http://rit-mcsl.org/fairchild//HDRPS/HDRthumbs.html http://empamedia.ethz.ch/hdrdatabase/index.php https://cseweb.ucsd.edu/ ∼ viscomp/projects/SIG17HDR http://user.ceng.metu.edu.tr/ ∼ akyuz/files/eg2016/index.html to correctly preserve content information with moderate transformation. Then, w con is gradually decreased to obtain better transformation results as training becomes in-creasingly stable and the semantic content information has been properly reconstructedat the initial stage. The variation trend of w con is described as: w con = w con × . (cid:98) N e / (cid:99) (7)where N e is the number of epochs in the training process.

4. Experiments

We ﬁrst illustrate the datasets and implementation details in Section 4.1 and thenconduct comprehensive experiments to verify the performance of the proposed method,including quantitative comparisons (Section 4.2), qualitative comparisons (Section 4.3)and ablation studies (Section 4.4). Speciﬁcally, we ﬁrst compare our method with sev-eral methods that are just suitable for static inputs [25, 27, 26, 28, 29, 17], and thencompare with several classic de-ghosting methods, including two patch-based meth-ods [2, 1], and two deep neural network (DNN) mergers with and without optical ﬂowregistration, respectively [3, 4]. In order to make a fair comparison, we collect thecodes from the authors of the above algorithms to generate their results with defaultsettings. Xu et al. combined two inputs, the under-exposed image and over-exposedimage, to generate an output [17]. We remove the mid-exposure images from the testset when we evaluate their method.

Training deep networks usually require a large number of training examples, whichpotentially determines the upper bound of a network. The datasets of common deep13 a) (b)

Figure 4: Results of the initialization phase. (a) A middle-exposure input. (b) Generated result after 10epochs pre-training. learning-based fusion methods usually include a set of LDR images of a dynamic sceneand their corresponding HDR image. However, most existing HDR datasets either lackground truth images [30, 59], or have a small number of scenes with only rigid mo-tion [60]. Although Kalantari et al. introduced the ﬁrst HDR dataset, the variety of thescenes is so limited [3]. UPHDR-GAN relaxes the constraints of paired input, whichlearns the mapping between input and target domains and transforms the multi-inputsto an informative HDR output. We have collected a total of 266 groups of imagesfrom various sources, including static/dynamic, indoor/outdoor and daytime/nighttimescenes, as seen in Table 2 for the detailed information. By disorganizing their corre-spondence between the inputs and ground truth, the unpaired dataset is obtained. Someof them include approximately 10 multi-exposure inputs, from which we select 3 im-ages with minimum, medium and maximum exposure as training inputs. For the train-ing data, we ﬁrst align them using a simple homography transformation, making thelearning more effective than directly training it without background alignment. Then,we break down the training images into overlapping patches of size 256 ×

256 patcheswith a stride of 64, and then perform data augmentation (ﬂipping and rotation), furtherincreasing the training data by 8 times.We implement UPHDR-GAN in PyTorch and all the experiments are performedon an NVIDIA RTX 2080Ti GPU. Adam optimizer is applied with the learning rate of . × − for the generator and the discriminator. An initialization phase is introducedto improve the convergence and avoid training being trapped in a suboptimal localminimum. We initialize the generator that reconstructs the content of middle-exposure14 able 3: Quantitative comparison of proposed method with several classic static methods based on twentystatic scenes. Red color indicates the best performance and blue color indicates the second best results. PSNR SSIM HDR-VDP-2Mertens [25] 29.817 0.9511 54.056Li2012 [26] 30.289 0.9527 53.925Li2013 [27] 31.020 0.9575 55.142Paul [28] 31.876 0.9584 55.953Ma [29]

Xu [17] 32.582 0.9651 53.631Ours

Table 4: Quantitative comparison of proposed method with several state-of-the-art de-ghosting methodsbased on twenty dynamic scenes. Red color indicates the best performance and blue color indicates thesecond best results.

PSNR SSIM HDR-VDP-2Sen [2] 40.924 0.9806 56.239Hu [1] 34.785 0.9725 55.754Kalantari [3]

Ours input and ignores the translation. For this purpose, we pre-train the generator G usingmerely L con . An example is presented in Fig. 4. After 10 epochs pre-training, thenetwork properly reconstructs the content information of middle-exposure input. Sincewe select it as the reference, the initialization helps to avoid ghosting. The main goal of image fusion is to integrate complementary information frommultiple sources so that the fused images are more suitable for the purpose of humanvisual perception and computer processing. Although the proposed method can gen-erate high-quality HDR images without ground truth, we select the test set for quan-titative comparisons from paired datasets that include multi-exposure inputs and HDRimages. As the ground truth is available, we can conduct various quantitative evalua-tions and comparisons. We ﬁrst compute the Peak Signal to Noise Ratio (PSNR) [61]15 a) Inputs (b) Our result(c) Mertens (d) Li2012 (e) Li2013 (f) Paul2016 (g) Ma2019 (h) Xu2020

Figure 5: Visual comparisons on the testing data from EmpaMT dataset. (a) Input LDR images. (b) Ourresult. (c) Result of Mertens et al. ’s method [25]. (d) Result of Li et al. ’s method [26]. (e) Result of Li etal. ’s method [27]. (f) Result of Paul et al. ’s method [28]. (g) Result of Ma et al. ’s method [29]. (h) Result ofXu et al. ’s method [17]. and Structural Similarity Index Measure (SSIM) [62] scores between the generated im-age and the real image. We also compute the HDR-VDP-2 [63], a metric speciﬁcallydesigned for measuring the visual quality of HDR images. For the two parameters usedto compute the HDR-VDP-2 scores, we set the diagonal display size to 24 inches, andthe viewing distance to 0.5 meters.Twenty static scenes and twenty dynamic scenes, which include multi-inputs andcorresponding ground truth, are collected as the test set for quantitative comparisons.The static sequences are either captured by cameras mounted on a tripod or aligned be-fore the fusion calculation. Our method is compared with several classic methods thatare suitable for static inputs [25, 26, 27, 28, 29, 17] on these static scenes and the com-parison results are shown in Table 3. These static methods fuse multi-exposure inputswith the absence of ground truth, and therefore resulting in lower scores when comput-ing the difference between output and ground truth. The proposed UPHDR-GAN alsoabandons the constraint of ground truth, but can extract information from target HDRdataset, hence providing results with better PSNR, SSIM and HDR-VDP-2 values on16 a) LDRs (b) Hu et al. (c) Sen et al. (d) Kalantari et al. (e) Wu et al. (f) Our result

Figure 6: Visual comparisons with de-ghosting methods. (a) Input images. (b) Result of Hu et al. ’smethod [1]. (c) Result of Sen et al. ’s method [2]. (d) Result of Kalantari et al. ’s method [3]. (e) Resultof Wu et al. ’s method [4]. (f) Our result. We compare the zoomed-in local areas of UPHDR-GAN and otherde-ghosting methods. Our method owns the best performance overall. average. Table 4 exhibits the comparison results of UPHDR-GAN with several de-ghosting methods. Two patch-based methods [2, 1] reconstruct the underlying imagestacks by patch match oriented optimization. Kalantari et al. [3] and Wu et al. [4] ob-tain HDR results through deep neural networks. Although these methods in Table 4 getsimilar quantitative scores, the proposed method owns superior performance overall.

In this section, our method is ﬁrst compared with [25, 27, 26, 28, 29, 17] on a staticscene (Fig. 5). The compared methods are mature enough to handle images that arestatic, but ignore the tiny motions, such as the moving leaves caused by wind. Someslight movements of the left leaves in Fig. 5 lead to ghosting artifacts of the comparedmethods. Although some of them have strategies to tackle with dynamic contents,such as median and recursive ﬁlters of [26], they still produce unsatisfactory ghosting.Xu et al. solely obtained information from the under-exposed image and over-exposedimage [17], which leads to the mediocre result with color deviation.Fig. 6 and Fig. 7 show the qualitative comparisons against several state-of-the-artde-ghosting methods [1, 2, 3, 4]. Two patch-based approaches [1, 2] aim to recon-17 a) LDRs (b) Our result (c) Hu et al. (d) Sen et al. (e) Kalantari et al. (f) Wu et al. (g) Ours

Figure 7: Visual comparisons with de-ghosting methods. (a) Input images. (b) Our result. (c) Result of Hu etal. ’s method [1]. (d) Result of Sen et al. ’s method [2]. (e) Result of Kalantari et al. ’s method [3]. (f) Resultof Wu et al. ’s method [4]. (g) Zoomed-in areas of our result. The scene is challenging because there arelarge foreground motions between input LDR images. The proposed UPHDR-GAN can properly deal withthe motions caused by moving people. struct ghost regions in the output image by transferring information from inputs whichare determined by patch matching. However, they can not recover the structured re-gions properly. The errors in motion estimation are difﬁcult to avoid in the presenceof tiny random motions for patch-based methods although they generally produce rel-atively good results. Hu’s method generates results with undesirable artifacts aroundthe building (red arrow in Fig. 6(b)) and Sen’s method gets results with serious haloartifacts (yellow arrows in Fig. 6). Note that, Hu’s method tends to suffer from colordrifting in some cases, which occurs in computing generic intensity mapping functionand then causes the radiance consistency measure to be ineffective.Deep learning methods have the advantage that they can exploit information ex-tracted from training data to identify and compensate for image regions. However,they only perform well in one way or another. Kalantari et al. [3] applied optical ﬂowto align inputs ﬁrst and then sent them to a conventional neural network to obtain fusedresults, which may produce artifacts due to two main reasons: misalignment of opticalﬂow and the limitation of merging process. Wu et al. [4] improved Kalantari et al. ’smethod and embedded the alignment into the network. The two methods adopt similarnetwork architecture and produce unnatural transformations in the sky region in Fig. 6(green arrows) and ghosting artifacts in Fig. 7. Our method can properly handle suchproblems and obtain results with higher sensory experience.18 =1.5=1=0.5=0.25 w con w con w con w con (a) (b) (c) (e) Figure 8: The effect of different w con . We set w con to be 1.5 at the initial stage to keep a balance betweentransformation and content preservation. (a) Without min-patch training (a) With min-patch training Figure 9: The inﬂuence of min-patch training. (a) The generated result without min-patch training. (b) Thegenerated result with min-patch training.

We perform the ablation studies of different items to understand how these mainmodules contribute to ﬁnal results, which show that each component plays an importantrole in our architecture. w con We ﬁrst conduct some experiments to illustrate why we set w con = 1 . at the initialstage. Fig. 8 shows corresponding results when we select different w con . Smaller w con cannot generate desired details or suffer from ghosting artifacts because they tend tolearn the translation but ignore to preserve the semantic content information. We set w con to be 1.5 at the initial stage to keep a balance between transformation and contentpreservation. It correctly maintains semantic content with enjoyable transformation(Fig. 8(a)-(c)). If we continue to increase the value of w con , the results will be similarto middle-exposure LDR image because they bring more content information from the19 SNR: 42.466 PSNR: 42.515 (a) Without blur dataset (b) With blur dataset

Figure 10: The inﬂuence of blur dataset. (a) Result without blur dataset. (b) Result with blur dataset. input photo so that the details of yard and ﬂoor regions cannot be generated.

Common discriminator can distinguish the generated images and real images. How-ever, not all regions contribute to the discriminator optimization during training. If asmall part of the generated image is so strange as to be different from real image, it canbe considered as undesirable ghosting artifacts. We add min-patch training to detectsuch regions and avoid ghosting artifacts. Fig. 9 shows the effectiveness of min-patchtraining. After the min-patch training, UPHDR-GAN gets result with fewer artifacts(Fig. 9(b)) compared to result without min-patch training (Fig. 9(a)).

Simply applying GAN loss is not sufﬁcient for generating sharp HDR images. Thisis because clear edges is an important characteristic of HDR images, but commonGAN loss may produce results with unclear edges. To solve the problem, we add ablur dataset C as fake images to confuse the discriminator to produce images withsharp edges. Fig. 10 shows the inﬂuence of the blur dataset, in which the result withblur dataset (Fig. 10(b)) has more sharp edges and higher PSNR values.20 . Conclusion We have presented a novel method for fusing multi-exposure images with unpaireddatasets. The proposed method relaxes the constraints that deep learning-based meth-ods need paired inputs and ground truth by introducing generative adversarial networks.The proposed method learns the translation between input domain and target domainand transforms the multi-inputs into an informative HDR output. It offers the prospectfor more extensive applications of HDR imaging. However, generative adversarialnetworks obtain unclear results sometimes. We applied some techniques to generateimages with sharp edges and clear content information, and we will further focus onthe aspect to explore more useful algorithms. We conduct comprehensive comparisons,include quantitative assessment and qualitative comparisons, with several typical meth-ods to demonstrate the effectiveness of UPHDR-GAN.