[PDF] A review for Tone-mapping Operators on Wide Dynamic Range Image

Abstract

The dynamic range of our normal life can exceeds 120 dB, however, the smart-phone cameras and the conventional digital cameras can only capture a dynamic range of 90 dB, which sometimes leads to loss of details for the recorded image. Now, some professional hardware applications and image fusion algorithms have been devised to take wide dynamic range (WDR), but unfortunately existing devices cannot display WDR image. Tone mapping (TM) thus becomes an essential step for exhibiting WDR image on our ordinary screens, which convert the WDR image into low dynamic range (LDR) image. More and more researchers are focusing on this topic, and give their efforts to design an excellent tone mapping operator (TMO), showing detailed images as the same as the perception that human eyes could receive. Therefore, it is important for us to know the history, development, and trend of TM before proposing a practicable TMO. In this paper, we present a comprehensive study of the most well-known TMOs, which divides TMOs into traditional and machine learning-based category.

Full PDF

11 A review for Tone-mapping Operators on WideDynamic Range Image

Ziyi LiuUniversity of Calgary

Abstract —The dynamic range of our normal life can exceeds120 dB, however, the smart-phone cameras and the conventionaldigital cameras can only capture a dynamic range of 90 dB,which sometimes leads to loss of details for the recorded image.Now, some professional hardware applications and image fusionalgorithms have been devised to take wide dynamic range (WDR),but unfortunately existing devices cannot display WDR image.Tone mapping (TM) thus becomes an essential step for exhibitingWDR image on our ordinary screens, which convert the WDRimage into low dynamic range (LDR) image. More and moreresearchers are focusing on this topic, and give their effortsto design an excellent tone mapping operator (TMO), showingdetailed images as the same as the perception that humaneyes could receive. Therefore, it is important for us to knowthe history, development, and trend of TM before proposing apracticable TMO. In this paper, we present a comprehensivestudy of the most well-known TMOs, which divides TMOs intotraditional and machine learning-based category.

Index Terms —Tone mapping, wide dynamic range image,image processing, machine learning.

I. I

NTRODUCTION

The dynamic range of a scene is deﬁned by the ratio ofthe light intensity of the brightest and the darkest spots. Inthe real-world, the dynamic range is extremely broad from0 to 1, 000, 000. For the human visual system (HVS), wecan perceive a range of around 24 EV when we view anoutlook. Nevertheless, digital cameras can only capture a rangeof approximately 9 EV. If the dynamic range is more than9 EV, there is no combination of aperture or shutter speedthat will enable us to capture the entire dynamic range ofthe real scene. All you can do is optimize the exposure timefor the highlight details or the shadow details, but you willinevitably miss one or the other. As a consequence, a widedynamic range (WDR) image is proposed [1], [2]. Unlike thelow dynamic range (LDR) image, it allows you to recordthe darkest and brightest area of a scene at the same timethat your camera wouldn’t be equal to capture in a singleshot. Although WDR images (16/32-bit) have the ability toprovide fascinating images, they are unable to be displayedvia most regular devices (8-bit), and required to be transferredinto LDR ﬁrst. This process is called tone-mapping (TM),and is widely adopted in image processing and computergraphics nowadays. TM is a technique that is employed toconvert a set of colors to another, approximating the WDRimage information in instruments that have a limited dynamicrange. An ideal tone mapping process is shown in Fig. 1. Itsolves the problem of sharply reducing contrast from sceneradiation to the displayable range, while retaining the image details and color appearances that are critical to appreciatingthe original scene content. Tone mapping originates from artwhere artists make full use of ﬁnite color to describe highcontrast natural scenes. Afterward, it was launched for theapplication on television and photography. It leverages thefact that the HVS has a greater sensitivity to relative ratherthan absolute luminance levels [3]. The rest of the paper isorganized as follows: The second section reviews 6 traditionalTMOs: 3 global TMOs and 3 local TMOs. In the thirdsection, we discusses 4 learning-based TMOs. We show thecomparison and conclusion in section ﬁve and six separately.II. T

RADITIONAL

TMO S Global TMOs do mapping operations on all the pixels, thusthey are very express compared to local TMOs. Most globalTMOs cannot maintain image contrast. On the other hand,local TMOs apply variations on each pixel, taking propertiesof the neighboring pixels into account, but they are slower andmight bring about artifacts in the output images.Fig. 1: Ideal tone mapping process

A. Global TMOs

Ward et al. [4] designed a quite simple tone-mappingoperator, utilizing a calculated scaling factor on the wholeinput WDR image to preserve contrast instead of absoluteluminance. The method is based on subject studies of Black-well [5], which established an association between adaptationluminance and minimum noticeable difference in brightness.From this study, Ward assumed that the result of adaptationcan be seen as a shift in the absolute difference in brightnessrequired for the viewer to notice the change, which means thevisible luminance differences in the real-world can be mappedas the luminance differences on the display medium. After thisoperation, the contrast between the displayed image and thereal image is ﬁxed. This method is computationally efﬁcientbecause it just applies a speciﬁc scaling factor to the inputimages and displays the resulting outputs, leading to the loss a r X i v : . [ ee ss . I V ] J a n of visibility as a consequence of the clipping of the highestand the lowest pixel values.Ferwerda et al. [6] developed a TM model derived from thepsychophysical experiments they did for inferring changes invisual function that accompany light adaptation. The modelsimulates the variation of sensitivity, color appearance, visualacuity, and visibility as a function of the light level. Deployingthis model could map the just noticeable differences (JNDs)in the real-world into JNDs on the medium, ignoring theunnecessary luminance values in input images. This model isvery important due to the experiments, covering the thoroughvisual ﬁeld via providing an immersive display system to suchan extent the viewer’s visual state could be determined on theentire display [7], but it does not fully capture the early stagesof visual adaptation.A histogram adjustment TMO is proposed with Ward et al.[8], building on the earlier work [4], [6]. First, they gotthe cumulative distribution of brightness histograms that isexecuted to identify clusters of light levels. With the aim ofovercoming the dynamic range limitation, the author employeda linear function on an entire image, doing a histogramadjustment based on the gained distribution. This adjustmentis established on the principles: luminance is not constantacross a full image, even so it is consistent in a small region;human eyes are sensitive to color contrast instead of absolutebrightness, only if the bright area is still brighter than the dimregion in the processed resulting image, the absolute valuesare not important; eyes adapt rapidly to a small angle in theﬁeld of view (i.e., about 1 ◦ ) from the ﬁxation point. Thismethod shows more overall contrast reproduction than theothers, at the same time, it could guarantee that the contrastof the resulting image does not exceed the perception ofthe receptor. It also avoids the clipping problem of singleadaptation methods. In spite of that, it still suffers the lossof information. B. Local TMOs

Schlick et al. [9] came up with a quantization technique,using a so-called rational mapping function to transfer theWDR input into the displayable values. The quantizationfunction has three versions, the ﬁrst and the second one is justa simple logarithmic and exponentiation mapping respectively,which are hard to present a satisfying outcome. The thirdmethod has a rational mapping curve with asymmetry shape,treating the high and low pixels in a reciprocal way, therebyproviding smoother results. They also introduced an approachfor color treatment, the dynamic range is initially compressedin the luminance channel, and colors are reproduced in post-processing operations. All pixel values are got through a colorratio: C out = C in L in L out (1)Where C represents a color channel (red, green, or blue) ofa colorful image, L represents the luminance level image, andin/out denotes the input WDR or tone-mapped WDR image.Even though this algorithm reduces the computation resourcesand parameter number, it is easily affected by halo artifacts. Durand et al.’s [10] operator decomposed the input WDRimage into two separate layers: a base layer and a detaillayer, using a non-linear edge-preserving ﬁlter known as thebilateral ﬁlter [11]. The contrast of the base layer is decreasedwhile preserving the details in the detail layer. The detaillayer is further polished to elevate the small-scale details. Tofast the decomposition process, they adopted two strategies: apiecewise-linear approximation in the intensity domain, and asub-sampling in the spatial domain. At last, the tone-mappedWDR image can be obtained over combining processed twolayers. It solves the halo artifacts that happened in [9], butoccasionally it over-enhances the local details of the image,losing realism.Jie et al. [12] invented a TMO in a pixel-by-pixel manner,and they also designed tailored hardware to implement this TMalgorithm. For the TMO, Jie ﬁrst split the WDR input into mxnblocks, then assign a maximum and minimum operation oneach block, obtaining two matrices (Mmax and Mmin). Aftera bilinear interpolation, the matrices are expanded into thesame dimension as the input WDR image. The ﬁnal equationis below: d ( i, j ) = log ( p ( i, j )) − log ( M min ( i, j )) log ( M max ( i, j )) − log ( M min ( i, j )) × ( D max − D min ) + D min (2)Where the p ( i, j ) represents the original pixel value in ( i, j ) position of the input WDR image, Dmax and

Dmin are themaximum and minimum display levels of the visualizationtools. Using the logarithmic function and interpolation methodcould take the neighboring pixels into account, making full useof local information. The hardware implementation is made upof six modules pairing the modules of the TM approach. Theparameter module collects the customer parameters such aswidth, height of image, while the pixel status module savesthe location ( i, j ) of one pixel and transfers the location intoblock div module where the maximum and minimum of eachimage block are calculated. The calculated results are storedin the RegFile module, then interp module does a bilinearinterpolation for the maximum and minimum. Eventually, eq.2 will be executed in the compute module. This method isable to get brighter images and is energy efﬁcient.III. M ACHINE L EARNING - BASED

TMO S A. General Background

Convolutional Neural Network.

Convolutional NeuralNetworks (CNNs) are a sort of neural networks that do amathematical operation termed convolution on at least onelayer, instead of the common matrix multiplication. Usually,a CNN as shown in Fig. 2 includes an input layer, anoutput layer, as well as several speciﬁc hidden layers suchas convolutional layers.With the achievements of the CNNs on image classiﬁcation[13], medical image segmentation [14], object detection [15],natural language processing [16], researchers begin to applythis knowledge to image-to-image translation tasks. Sharingweights reduces the complexity of CNNs, especially for an

Fig. 2: Convolutional neural network architectureimage with a multidimensional input vector, that can bedirectly inputted to the network, avoiding the complexity ofdata reconstruction in the process of feature extraction andclassiﬁcation. Nevertheless, CNN needs a speciﬁc loss func-tion to understand how to minimize the differences betweenoriginal images and their labeled images, sometimes this lossfunction potentially will not be reliable. For example, the L2loss function would cause blurry in displayed images becauseit mixes many different types of attributes of input images,which can not make the image more realistic, but as far as theL2 loss function is concerned, this is the best image our CNNmodel could construct.

Generative Adversarial Network.

Generative AdversarialNetwork (GAN) is also a kind of neural networks. It is goodat mimicing images. CNNs are good at extracting featuresof images, and the quality of resulting outputs of CNNsheavily relies on the choice of the loss function, taking a verylong time to manually look for or design an appropriate lossfunction. Many solutions have been devised to ﬁx this issue,among which GAN is the most prevalent one. It consists oftwo parts, one is a generator, another is the discriminator. Asfor the generator, it seems like a painter, keeping paintingimages, and its goal is to draw a masterpiece like MonaLisa’s Smile. In the meantime, the discriminator is like aconnoisseur of art who has the ability to distinguish whethera painting is a real image or not, in other words, if a fakeimage painted in the generator is given to the discriminator,the discriminator will punish the generator until the generatorcould create a painting very similar to the reality that thediscriminator cannot even differentiate. In order to achieve thesame function as what GAN did, we conventionally have todesign a complex mathematical loss function, but now we canleverage GAN to train its discriminator to satisfy the trainingdata automatically. Nevertheless, this method cannot controlthe pattern of the data to be generated in GAN. ConditionalGAN (CGAN) changes this condition by adding the tag Y tothe generator as an additional parameter and wants to generatethe corresponding image. CGAN could control the generateddata conditionally dependent on the tag. This constrains theoutput space and minimizes the differences between the inputand output images. The conditional information may possiblygive a signiﬁcant head start to GAN for what to look for,therefore, the productions of CGAN are perceived to be better.

U-Net.

U-Net [17] is a prevalent neural network model forbiomedical image segmentation [18], image translation prob-lem [19], due to the fact that U-Net advances processing and learning pixel-level information. It consists of an encoderand decoder with some skip-connections shown in 3, whichare applied for extracting features and reconstructing imageaccordingly. Encoder-decoder model was ﬁrst proposed to beapplied to medical image segmentation, and then it attractedpeople’s attention. Now it is universally used in image con-version, language translation, and other ﬁelds. Encoder isimplemented to extract features from the input data (suchas images, audio signal), converting the information fromlow-dimensional to high-dimensional, and obtaining the coreinformation of the data. Decoder transforms the informationfeatures into data in the target ﬁeld, which is to map the high-dimensional core information to the low-dimensional targetspace. The encoder-decoder model structure generally incorpo-rates a convolution layer, an upsampling layer, a deconvolutionlayer, and a pooling layer.The high-dimensional image features extracted with the en-coder are less sensitive to the content information in the image,but retain a large amount of semantic information of the image,which is beneﬁcial to tasks such as image segmentation andobject detection. The skip connection in this model, as thename implies, skips certain layers in the neural network, andprovides the outputs of one layer as inputs to the next layerand other layers. Some information captured in the initial layerneeds to be reconstructed when using the fully connected layerfor up-sampling. If we do not execute skip connection, theinformation will be lost (or it should be said that it will becometoo abstract for further use). Therefore, the information wehave in the main layer can be explicitly provided to thefollowing layers using a skip connection structure.Fig. 3: U-Net architecture

Supervised learning vs Unsupervised learning.

In super-vised learning, the neural network is trained on a labeleddatabase, learning how to map input data into the groundtruth (GT) in the labeled database, whereas unsupervisedlearning dosed not provide a labeled database, the machinelearning model needs to ﬁgure out the features, patterns, andrelationship among these unlabeled data on their own.

B. TMOs

Patel et al. [20] came up with a novel GAN to learn howto map a real-world luminance to the luminance of the outputequipments. The author built their dataset from 20 data sources, but some of them do not have labels. In a supervisedmanner, it is necessary to get the true labels. To solve thisproblem, they carried out 12 TMOs to obtain 12 tone-mappedWDR images for each. Only one image is needed as a label,so they picked up one tone-mapped image with the highesttone mapped image quality index (TMQI) score, a metriccombines a multi-scale structure ﬁdelity measurement withan image naturalness measurement and provides a singlequality score for the entire image, and discarded other 11images. In this method, the networks were created to learnfrom existing tone mapping operators that are assigned togenerate the GT, to create an optimized combination of them.They explored three different networks including a CNNmodel, a GAN, and a GAN with skip-connections. The CNNis the most basic and naive neural network (NN) as shownin Fig. 4, containing downsampling and upsampling witha mean square error as the loss function. The generator ofthe second NN has the same architecture as CNN in theﬁrst method. It executed an L1 loss between the real andgenerated images. Between the generator and discriminator,a binary cross-entropy loss function was harnessed. ThisGAN produced results that were superior to that of theCNN, because the discriminator is capable of learning thedistribution of the GT that guides the generator producingmore vivid images. The discriminator utilized a conventionalCNN as shown in Fig. 5, limited to six convolutionallayers for downsampling, in order to retain freedom to thediscriminator in differentiating true label or generated images.The third network introduced skip connections betweencorresponding upsample and downsample layers to solve thelossy in output clarity that happened in the GAN network.The addition of skip connections maintained signal strengthto avoid information loss at pixel level from downsamplinglayers, resulting in a higher quality output than the secondnetwork. This operator learns from many prevalent TMOsto make visually pleasing images, just because of this, itsperformance would not exceed these TMOs signiﬁcantly.Fig. 4: Network details of the generatorMontulet et al. [21] focused on the utilization of GANfor low-light image enhancement, applying deep convolutionalgenerative adversarial network (DCGAN) to the luminance Fig. 5: Network details of the discriminatorFig. 6: DCGAN U-net Generator network architecturechannel instead of directly to color components, in order toavoid color bias. A U-net is exploited in the generator asshown in Fig. 6, a Patch-GAN is adopted in the discriminator,and VGG based perception-related loss function is used inthe generator. Ledig et al. [22] introduced VGG as theirloss function on the task of super-resolution, and achievedsuccess in practice. VGG loss also called perceptual loss is analternative to a pixel-wise loss like MSE, which using a pre-trained model applies convolution layers to extract featuresand patterns from the displayed images obtained from thegenerator and GT labels, and reduce the gap between theoutputs and GT labels. Unlike the traditional GAN model thatcarry outs a deep CNN as a discriminator to classify the fullimage, this model opts a PatchGAN as shown in Fig. 7, whichis designed to classify whether a part of the input image is realor fake. The number of layers in PatchGAN is conﬁgured sothat the effective receptive ﬁeld of each output is mapped toa speciﬁc size in the input image. The output of the networkis a single feature map with real/fake prediction values. Thevalues of this map can be averaged to give a score. Montulet etal. [21]’s method is decided on the traditional TM methods togenerate the target LDR images, so their behavior will not benoticeably different from that of the traditional TM method.Rana et al. [23] shows a similar advantage of GAN overFig. 7: DCGAN Discriminator network architecture

CNNs, since convolutional neural networks can result inwidely varying output qualities depending on the choice ofa loss function. Due to the dependence of tone mappingoperators on well-adjusted parameters, the authors createdDeepTMO, a parameter-free operator based on CGAN. Anadvantage of using a CGAN for tone mapping compared to aGAN is that the CGAN is constrained by each output pixel’sconditional dependence on at least one neighboring pixel inthe input. This results in the network receiving a penalty forstructural difference between input and output, which is adesirable quality in tone mapping, where the output should re-main effectively visually similar to the input. The drawback ofusing a CGAN is that it is more challenging to generate a high-resolution resulting image due to instability and problems withoptimization. The method involves four CGAN settings whichexplored possible combinations of single-scale and multi-scale generators and discriminators. Fig. 8 displays a single-generator harnessing an encoder-decoder architecture, whilethe single-discriminator is essentially a PatchGAN shown inFig. 9, chosen for its simplicity over a full-size discriminator,with fewer parameters. A PatchGAN discriminator can be usedfor spatially localized improvement as well. Fig. 10 presentsa residual block that consists of two sequential convolutionlayers. Fig. 8: Generator (Single Scale)Fig. 9: Discriminator (Single Scale)A single-generator, single-discriminator gives acceptableglobal-level reconstruction, but the output has high levels ofnoise in some speciﬁc sections of results. A multi-generatorwith a multi-discriminator network gives an output of strongquality with no artifacts. A multi-discriminator is fundamen-tally the same as the single-discriminator, but harness it ontwo scales of input. For a multi-generator shown in Fig. 11, it consists of a global original network and a down-sampled network. The down-sampled model is similar to thesingle-generator, and its generated images are inputted intothe global network to combine coarse and ﬁne-grained detailstogether. The combinations of single-generator with multi-discriminator and multi-generator with single-discriminator,give results landing in between the full single or completelymuti-structures. The dynamic range is initially compressed inthe luminance channel, and colors are reproduced in post-processing operations. This TMO could cater to vast scenic-content (e.g., outdoor, indoor, landscapes, structures, human,etc.), and solve the problems (e.g., saturation, pattern blurring,tiling patterns) caused by traditional GAN-based TMOs [21],but when input images have a different size from the trainingimage, it could perhaps result in artifacts.Fig. 10: Residual blockFig. 11: DeepTMO multi-scale generator architectureSu et al. [24] developed a multimodal tone-mappingmethod composed of two networks called EdgePreservingNetand ToneCompressingNet in Fig. 12. EdgePreservingNetis trained to ﬁlter the input WDR images, preservingthe high-frequency information in the image. The WDRimages are separated into a detail and a base layer. Thedetail layer is enhanced as a result of a tan function forproducing better quality LDR images, meanwhile the baselayer is squeezed via ToneCompressingNet. Ultimately, theenhanced detail layer and compressed base layer are fusedinto the ﬁnal grayscale LDR output. Through the colorcorrection, generating the ﬁnal colorful tone-mapped WDRimage. The author determined U-Net as the architecture ofEdgePreservingNet. For ToneCompressingNet, it contains aserial continuous convolution layers and a fully connectionlayer in the end.Kim et al. [25] suggested a TMO for X-ray images,which could be divided into two parts (see in Fig. 13).

Fig. 12: Overall block diagram of Su et al.’s [24] methodDetail-recovery network (DR-Net) is responsible for restoredetails in the WDR image, while TM-Net (TM-Net) aims forcompressing dynamic range. This is the ﬁrst learning-basedTMO used for X-ray inspection. For training DR-Net, theauthor presented a data synthesis technique, which is basedon the Beer-Lambert law [26], to generate ground-truth(GT). DR-Net utilized a guided ﬁlter [27] to decomposethe input WDR images into 2 layers: a detail layer and abase layer. The detail layer is chosen for restoring missingdetails, then this layer is integrated with the base layer.After restoring details by DR-Net, the output LDR image ispassed to TM-Net. However, there does not exist standardX-ray databases for TM, leading to that it is impossible totrain TM-Net in a supervised training method. To solve theproblem, Kim presents a novel loss function named structuralsimilarity loss, which not only enhances details, but alsoalleviates the halo artifacts. Therefore, the author choosesU-Net as the basis of DR-Net and TM-Net.Fig. 13: Kim et al.’s [25] tone-mapping methodIV. C

ONCLUSION

Evidently, the usage of GANs for tone mapping producesdesirable results. GANs overcome the limitations of CNNs. Asthe popularity of GAN rapidly rises and the necessity for well-mapped images increases, perhaps their intersection will ﬁndan optimized display of every type of image. NotwithstandingTMQI metric is a good measurement to compare outputs toexisting operators, it can also restrict the network, as it is onlyan approximation to the HVS. To address this problem, havinga stronger formulation of what makes an image look the “best”to most people, beyond noise and artifacts, would a strongstart. Nowadays, traditional TM algorithms are surpassed bymachine learning models. What is more, look at the newdisplays that have more than 8 bits in their display, (e.g. thenew Apple device), and conclude that in the near future we believe tone mapping will not be required anymore due to thedisplay devices. R

EFERENCES[1] O. Yadid-Pecht, “Wide-dynamic-range sensors,”

Optical Engineering ,vol. 38, no. 10, pp. 1650–1660, 1999.[2] A. Spivak, A. Belenky, A. Fish, and O. Yadid-Pecht, “Wide-dynamic-range cmos image sensors—comparative performance analysis,”

IEEEtransactions on electron devices , vol. 56, no. 11, pp. 2446–2461, 2009.[3] M. C. Barris, “Vision and art: The biology of seeing,” 2005.[4] G. Ward, “A contrast-based scalefactor for luminance display.”

GraphicsGems , vol. 4, pp. 415–21, 1994.[5] H. Blackwell, O. BLACKWELL, and H. BODMANN, “An analyticalmodel for describing the inﬂuence of lighting parameters upon visualperformance,”

Technical Foundations , vol. 1, no. 5, 1981.[6] J. A. Ferwerda, S. N. Pattanaik, P. Shirley, and D. P. Greenberg, “Amodel of visual adaptation for realistic image synthesis,” in

Proceedingsof the 23rd annual conference on Computer graphics and interactivetechniques , 1996, pp. 249–258.[7] A. McNamara, “Visual perception in realistic image synthesis,” in

Computer Graphics Forum , vol. 20, no. 4. Wiley Online Library, 2001,pp. 211–224.[8] G. W. Larson, H. Rushmeier, and C. Piatko, “A visibility matchingtone reproduction operator for high dynamic range scenes,”

IEEETransactions on Visualization and Computer Graphics , vol. 3, no. 4,pp. 291–306, 1997.[9] C. Schlick, “Quantization techniques for visualization of high dynamicrange pictures,” in

Photorealistic rendering techniques . Springer, 1995,pp. 7–20.[10] F. Durand and J. Dorsey, “Fast bilateral ﬁltering for the display of high-dynamic-range images,” in

Proceedings of the 29th annual conferenceon Computer graphics and interactive techniques , 2002, pp. 257–266.[11] C. Tomasi and R. Manduchi, “Bilateral ﬁltering for gray and colorimages,” in

Sixth international conference on computer vision (IEEECat. No. 98CH36271) . IEEE, 1998, pp. 839–846.[12] J. Yang, A. Hore, and O. Yadid-Pecht, “Local tone mapping algorithmand hardware implementation,”

Electronics Letters , vol. 54, no. 9, pp.560–562, 2018.[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,”

Communications of the ACM ,vol. 60, no. 6, pp. 84–90, 2017.[14] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutionalneural networks for volumetric medical image segmentation,” in . IEEE, 2016, pp.565–571.[15] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,”

IEEE transactions onpattern analysis and machine intelligence , vol. 39, no. 6, pp. 1137–1149,2016.[16] B. Hu, Z. Lu, H. Li, and Q. Chen, “Convolutional neural networkarchitectures for matching natural language sentences,”

Advances inneural information processing systems , vol. 27, pp. 2042–2050, 2014.[17] R. Berm´udez-Chac´on, P. M´arquez-Neila, M. Salzmann, and P. Fua, “Adomain-adaptive two-stream u-net for electron microscopy image seg-mentation,” in . IEEE, 2018, pp. 400–404.[18] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in

International Conference onMedical image computing and computer-assisted intervention . Springer,2015, pp. 234–241.[19] S. Wu, J. Xu, Y.-W. Tai, and C.-K. Tang, “Deep high dynamic rangeimaging with large foreground motions,” in

Proceedings of the EuropeanConference on Computer Vision (ECCV) , 2018, pp. 117–132.[20] V. A. Patel, P. Shah, and S. Raman, “A generative adversarial network fortone mapping wdr images,” in

National Conference on Computer Vision,Pattern Recognition, Image Processing, and Graphics . Springer, 2017,pp. 220–231.[21] R. Montulet, A. Briassouli, and N. Maastricht, “Deep learning for robustend-to-end tone mapping.” in

BMVC , 2019, p. 194.[22] C. Ledig, L. Theis, F. Husz´ar, J. Caballero, A. Cunningham, A. Acosta,A. Aitken, A. Tejani, J. Totz, Z. Wang et al. , “Photo-realistic singleimage super-resolution using a generative adversarial network,” in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2017, pp. 4681–4690. [23] A. Rana, P. Singh, G. Valenzise, F. Dufaux, N. Komodakis, andA. Smolic, “Deep tone mapping operator for high dynamic rangeimages,”

IEEE Transactions on Image Processing , vol. 29, pp. 1285–1298, 2019.[24] C.-C. Su, R. Wang, H.-J. Lin, Y.-L. Liu, C.-P. Chen, Y.-L. Chang,and S.-C. Pei, “Explorable tone mapping operators,” arXiv preprintarXiv:2010.10000 , 2020.[25] H.-Y. Kim, S. Park, Y.-G. Shin, S.-W. Jung, and S.-J. Ko, “Detailrestoration and tone mapping networks for x-ray security inspection,”

IEEE Access , vol. 8, pp. 197 473–197 483, 2020.[26] H. H. Barrett and W. Swindell,

Radiological imaging: the theory ofimage formation, detection, and processing . Academic Press, 1996.[27] K. He, J. Sun, and X. Tang, “Guided image ﬁltering,”