A Deep Decomposition Network for Image Processing: A Case Study for Visible and Infrared Image Fusion
JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
A Deep Decomposition Networkfor Image Processing:A Case Study for Visible and Infrared Image Fusion
Yu Fu, Xiao-Jun Wu, Josef Kittler,
Life Fellow, IEEE
Abstract —Image decomposition is a crucial subject in the fieldof image processing. It can extract salient features from thesource image. We propose a new image decomposition methodbased on convolutional neural network. This method can beapplied to many image processing tasks. In this paper, we applythe image decomposition network to the image fusion task. Weinput infrared image and visible light image and decompose theminto three high-frequency feature images and a low-frequencyfeature image respectively. The two sets of feature images arefused using a specific fusion strategy to obtain fusion featureimages. Finally, the feature images are reconstructed to obtain thefused image. Compared with the state-of-the-art fusion methods,this method has achieved better performance in both subjectiveand objective evaluation.
Index Terms —image fusion, image decomposition, deep learn-ing, infrared image, visible image.
I. I
NTRODUCTION I MAGE fusion is an important task in image processing. Itaims to extract important features from images of multi-modality signal sources and uses certain fusion strategies togenerate a fused image containing complementary informationof multiple pictures. Our work is one of the common imagefusion tasks, that is, to fuse visible light images and infraredimages [1]. The fused images not only contain the radiationinformation of the occluded object, but also retain sufficienttexture detail information. At present, many advanced methodsare widely used in production and life, such as security moni-toring, autonomous driving, target tracking, target recognitionand other fields.There are many excellent fusion methods, which can bedivided into two categories: traditional methods and deeplearning based methods [2]. Most of the traditional meth-ods are based on signal processing methods to obtain high-frequency bands and low-frequency bands of the image andthen merge them. With the development of deep learning,methods based on deep neural networks have also shown greatpotential in image fusion, because neural networks can extractfeatures of source images and perform feature fusion.Traditional methods can be broadly divided into two cat-egories: one is based on multi-scale decomposition, and theother is representation learning based methods. In the multi-scale domain, the image is decomposed into multi-scale rep-resentation feature maps, and then the multi-scale featurerepresentations are fused through a specific fusion strategy.Finally, the corresponding inverse transform is used to obtainthe fused image. There are many representative multi-scale decomposition methods, such as pyramid [3], curvelet [4],contourlet [5],discrete wavelet transform, [6], etc.In the representation learning domain. The most methods arebased on sparse representation such as sparse representation(SR) and gradient histogram (HOG) [7], joint sparse repre-sentation (JSR) [8], approximate sparse representation withmulti-selection strategy [9], etc.In the low-rank domain,Li and Wu et al. proposed a low-rank representation(LRR) based fusion method [10]. The mostrecent approaches, such as MDLatLRR [11] are based onimage decomposition with Latent LRR. This method canextract source image features in low-rank domains.Although the methods based on multi-scale decompositionand representation learning have achieved good performance.But these methods still have some problems. These methodsare very complicated, and dictionary learning is a time-consuming operation especially for online training. If thesource image is complex, these methods will not be able toextract the features well.In order to solve this problem, in recent years, many meth-ods based on deep learning have been proposed [2] because ofthe powerful feature extraction capabilities of neural networks.In 2017, Liu et al. proposed a method based on convolu-tional neural network for multi-focus image fusion [12]. InICCV2017, Prabhakar et al. proposed DeepFuse [13] to solvethe problem of multi-exposure image fusion. In 2018, Li andWu et al. proposed an new infrared and visible light imagefusion method based on denseblock and autoencoder structure[14]. In the next two years, with the rapid development ofdeep learning, a large number of excellent methods emerged.Including IFCNN [15] proposed by Zhang et al., and fusionnetwork based on GANs (FusionGan) [16] proposed by Ma etal., and the multi-scale fusion network framework (NestFuse)[17] proposed by Li et al. in 2020. Most of the methodsbased on neural networks use the powerful feature extractionfunction of neural networks, and then perform fusion at thefeature level, and obtain the final fused image with somespecific fusion strategies.However, the method based on deep network also has someshortcomings: 1. As a feature extraction tool, neural networkcannot explain the meaning of the extracted features. 2. Thenetwork is complex and takes a long time. 3. The amount andscale of infrared and visible light dataset is small, and manymethods use other data sets for training. This is not necessarilysuitable for extracting infrared and visible light images.To solve these problems, we propose a novel network that a r X i v : . [ c s . C V ] F e b OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 can be used to decompose images. At the same time, drawingon traditional methods and deep learning based methods, ourproposed network can decompose infrared and visible lightimages into high-frequency feature images and low-frequencyfeature images to achieve better decomposition effect thantraditional methods. At the same time, we design some fusionrules to fuse the high and low frequency feature images toobtain the fused feature image. Finally, these fusion featureimages are reconstructed to a fused image. The method weproposed not only utilizes the powerful feature extraction capa-bilities of neural networks, but also realizes the decompositionof image. Compared with the state-of-the-art methods, ourfusion framework has achieved better performance in bothsubjective and objective evaluation.This paper is structured as follows. In Section II, weintroduce some related work. In t Section III, we will introduceour proposed fusion method in detail. And in Section IV,we illustrate the experimental settings, and we analyze andcompare our experimental results. Finally, in the last sectionV, we draw a conclusion of this paper.II. RELATED WORKSWhether it is based on traditional image signal processingmethods or deep learning based methods. They are all veryreasonable and excellent methods. We will introduce somerelated works that inspired us in this section.
A. Wavelet Decomposition and Laplacian Filter
Wavelet transform has been successfully applied to manyimage processing tasks. The most common wavelet transformtechnique for image fusion is the Discrete Wavelet Transform(DWT) [18] [19].DWT is a signal processing tool that can decompose signalsinto high-frequency information and low-frequency informa-tion. Generally speaking, low-frequency information containsthe main characteristics of the signal, and high-frequencyinformation includes the detailed information of the signal.In the field of image processing, 2-D DWT is usually used todecompose images. The wavelet decomposition of the imageis given as follows: M LL ( x, y ) = φ ( x ) φ ( y ) M LH ( x, y ) = φ ( x ) ψ ( y ) M HL ( x, y ) = ψ ( x ) φ ( y ) M HH ( x, y ) = ψ ( x ) ψ ( y ) (1)where φ ( · ) is a low-pass filter, and ψ ( · ) is a high-pass filter.The input signal M ( x, y ) is an image with signals in twodirections. Along the x direction and the y direction, high-pass and low-pass filtering are performed respectively. Asshown in Fig.1, we can get a low-frequency image which isapproximate representation and three high-frequency imageswhich are vertical detail, diagonal detail and horizontal detailrespectively.The Laplacian operator is a simple differential operatorwith rotation invariance. The Laplacian transform of a two- a) source image b) Approximate representation c) Vertical Detaild) Diagonal Detail e) Horizontal Detail Fig. 1. Wavelet Decomposition. We perform wavelet decomposition on theimage to get a low-frequency image (b) and three high-frequency images(c)(d)(e) in three directions. dimensional image function is the isotropic second derivative,defined as: ∇ f ( x, y ) = ∂ f ( x, y ) ∂x + ∂ f ( x, y ) ∂y (2)In order to be more suitable for digital image processing,the equation is approximated as a discrete form: ∇ f ( x, y ) ≈ [ f ( x +1 , y )+ f ( x − , y )+ f ( x, y +1) f ( x, y − − f ( x, y ) (3)The Laplacian operator can also be expressed in the formof a convolution template, using it as a filtering kernel: G = − , G = − (4) G and G are the template and the extended template ofthe discrete Laplacian operator, and the second differentialcharacteristic of this template can be used to determine theposition of the edge. They are often used in image edgedetection and image sharpening processing, as shown in Fig.2,. a) source image b) Laplacian filter c) red box Fig. 2. Laplacian Filter. We use the Laplacian extended template for imagefiltering to get its high-frequency image (b), and magnify a local area of thehigh-frequency image (c).
We can easily observe that traditional edge filtering isusually just a high-frequency filtering. While highlighting theedges, they also highlight the noise.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
B. Decomposition-based Fusion Methods
Li and Wu et al. proposed a method [11] to decomposeimages using low-rank representation [20].First, LatLRR [20] can be described as the followingoptimization problem: min
Z,L,E (cid:107) Z (cid:107) ∗ + (cid:107) L (cid:107) ∗ + µ (cid:107) E (cid:107) s.t. , X = XZ + LX + E (5)Where µ is a hyper-parameter, and (cid:107) · (cid:107) ∗ is nuclear norm, and (cid:107) · (cid:107) is l norm. X is observed data matrix. Z is low-rankcoefficients matrix. L is a projection matrix. E is a sparsenoisy matrix.The author use this method to decompose the image intodetail image I d and base image I b . We can see from Fig .3that I d is a high-frequency image, and I b is a low-frequencyimage. Fig. 3. The framework of MDLatLRR.
As shown in Fig .3, the low-frequency image I b is continu-ously decomposed to obtain several high-frequency image I d , I d and I d .Finally, this method decomposes the infrared image and thevisible light image to obtain high-frequency images and low-frequency images. Then we perform a certain fusion to get thefused image I f . C. Deep Learning-based Fusion Methods
In 2017, Liu et al. proposed a neural network-based method[12]. The authors divides the picture into many small patches.Then CNN is used to predict whether each small patch isblurry or clear. The network builds a decision activation mapto indicate which pixels of the original image are clear andfocused. A well-trained network can accomplish multi-focusfusion tasks very well. However, due to the limitations ofnetwork design, this method is only suitable for multi-focusimage fusion.In order to enable the network to fuse visible light imagesand infrared images, Li and Wu et al. proposed a deep neuralnetwork (DenseFuse) [14] based on an autoencoder. First theytrain a sufficiently powerful encoder and decoder which canfully extract the features of the original image and reconstructthe image without losing information as much as possible.Then the infrared image and the visible light image areinputted into the encoder to obtain the coding features, and the two sets of features are specifically fused to obtain thefusion featurs. Finally, the fusion features are inputted intothe decoder to obtain the fused image. These methods use theencoder to decompose the image into several latent features.Then these features are fused and reconstructed to obtain afused image.In the past few years, Generative Adversarial Net-works(GANs) have also been applied to many fields, includingimage fusion. In [16] FusionGan first uses GANs to generatea fused image. The generator inputs infrared and visiblelight images and outputs a fused image. In order to improvethe quality of the generated image, the author designed anappropriate loss function. Finally, the generator can be usedto fuse any infrared image and visible light image.In view of the superiority of these two methods, we proposea multi-layer image decomposition method based on neuralnetwork. And we propose an image fusion framework forinfrared image and visible light image based on this method.III. PROPOSED FUSION METHODIn this section, the proposed multi-scale decomposition-based fusion network is introduced in detail. Firstly, the fusionframework is presented in section III-C. Then, the detail oftraining phase is described in section III-A. Next, in sectionIII-B we give the design of the loss function of the network.Finally, we present different fusion strategy in section III-D.
A. Network Structure
In the training phase, we discard the fusion strategy andtrain the decomposition network.Our training goal is to make the decomposition networkbetter decompose the source image into several high-frequencyand one low-frequency images, which are used for subsequentoperations. The structure of the network is shown in Fig.4,and the detailed network settings are shown in Table I.
TABLE IT
HE PARAMETERS OF THE NETWORK
Block Layer Channel Channel Size Size Size Activation(input) (output) (kernel) (input) (output)Cin Conv(Cin-1) 1 16 3 256 256 LeakyReLUConv(Cin-2) 16 32 3 256 256 LeakyReLUConv(Cin-3) 32 64 3 256 256 LeakyReLUC1 Conv(C1) 64 64 3 256 256 LeakyReLUC2 Conv(C2) 64 64 3 256 256 LeakyReLUC3 Conv(C3) 64 64 3 256 256 LeakyReLUR1 Conv(R1) 64 64 1 256 256 -R2 Conv(R2) 64 64 1 256 256 -R3 Conv(R3) 64 64 1 256 256 -Detail Conv(D0) 64 32 3 256 256 LeakyReLUConv(D1) 32 16 3 256 256 LeakyReLUConv(D2) 16 1 3 256 256 TanhC-res Conv(C-res1) 64 64 3 256 256 ReLUConv(C-res2) 64 64 3 256 256 ReLUConv(C-res3) 64 64 3 256 256 ReLUSemantic Conv(S0) 64 32 3 256 128 ReLUConv(S1) 32 16 3 128 64 ReLUConv(S2) 16 1 3 64 64 TanhUpsample Upsample 1 1 - 64 256 -
In Fig.4 and Table I, I ori is the original input im-age, and I re is the reconstructed image. The backbone of OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
C 3C 3C 2C 2
Cin C 1C 1
SemanticsSemanticsDetailDetail DetailDetail DetailDetail upsample
R1 R2 R3C-res re I ups I g I g I g I ori I s I Fig. 4. The framework of training process.
D0 D C Detail
64 32 D
16 164 : channels
Fig. 5. We use three convolutions to reduce the channels to one and get ahigh-frequency image.
S 0 S C S emantics
64 32 16 1 S stride=2stride=2 Fig. 6. We use three convolutions and downsampling twice to reduce thechannels to one to get a low-frequency image. the network is four feature extraction convolutional blocks(
Cin, C , C , C ).Then the following is the low-frequency feature extractionpart, that is, the (cid:48) semantic (cid:48) block in figure. The (cid:48) semantic (cid:48) block shown in Fig.6 include two down-sampling convolu-tional layers ( S , S ) with a stride of 2 and a commonconvolutional layer ( S ) which can generate a low-resolutionsemantic image I s . Then the I s is up-sampled to the same sizeof the I ori to obtain the low-frequency image I ups .We copy the features of different depths ( C , C , C ) andand then reshuffle their channels with convolutional layers( R , R , R ). After that, we input them into the (cid:48) detail (cid:48) branch of the shared weight to obtain three high-frequency images I g , I g and I g . The detail branch here is shownin detail in Fig.5 and the Table I, which includes threeconvolutions (D0, D1, D2), and the number of channels isreduced to 1 to obtain a high-frequency image.The reason for adding reshuffle layers( R , R , R ) here isthat the detail block is weight-sharing, the feature maps theyextract high-frequency information should follow the samechannel distribution. So we add a × convolutional layer thatdoes not share weights, and reshuffle and sort the channels ofthe features so that the features can adapt to the weight-shareddetails block.Finally, the three high-frequency images ( I g , I g , I g ) andone low-frequency image ( I ups ) are added pixel by pixel toobtain the final reconstructed image I re .Here we observe that the final reconstructed image isobtained by adding the high frequency image and the lowfrequency image. Therefore, the high-frequency image and thelow-frequency image should be a complementary relationshipin the data distribution space. When the network learns togenerate images, the high-frequency image should be theresidual data of the low-frequency image. So we design theresidual branch ( (cid:48) C − res (cid:48) block). We skip-connect the resultof cin to the front of the semantic block, add it to theresult of c , and input it to the following layers. In this way,what C , C and C get are the residual data between thesource image and the semantic image, which is compulsiveand natural. In order to make the skip-connected data moreclosely match the deep features of C , we performed threeconvolutions in (cid:48) C − res (cid:48) block to increase the semantics ofthe skip-connected features.As shown in the activation function in the Table I, weconsider some properties of low-frequency images and high-frequency images, we choose LeakyRelu [21] as the activation OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 function of the convolution of the backbone network andthe high frequency part (
Cin, C , C , C , detail ), and Relufuction is used as the activation function of the convolutionlayers of the residual branch ( (cid:48) C − res (cid:48) ) and the semantic block ( S , S , S ). Because the output of Relu has a certaindegree of sparseness, which allows our low-frequency featuresto filter out more useless information and retain more blurredbut semantic information. Finally, in order to constrain thepixel value of the obtained image to a controllable range, weuse the T anh activation function after the last layer of detail and semantic block.In general, we first perform a convolution block ( cin ) toobtain a set of feature maps containing various features. Thenafter three same convolution operations( c , c , c ), three setsof shallow features are obtained. Then after two downsampling( semantic ), deep feature is obtained. We believe that shallowfeatures contain more low-level information such as textureand detailed features. We reshuffle the channels and feedthese three sets of shallow features into the high-frequencybranch ( detail ) to obtain three high-frequency images. Whatis more, we believe that deep feature has more semanticinformation and global information, so we convolve andupsample the deep feature to get our low-frequency images.At the same time, we use the residual branch ( C − res ) toexplicitly establish the residual relationship between the high-frequency feature and the low-frequency feature. Lastly, weadd these feature images pixel by pixel to get a reconstructedimage. Decomposition
Network
Decomposition
Network ori I g I g I g I re I GradientGradient
DownsampleDownsample s I down I total grd grd grd adv pix ssim L L L L L L L = + + + + +
Fig. 7. The component of loss function.
B. Loss Fuction
In the training phase, the loss function(
Loss total ) of ournetwork consists of three parts. These losses are the gradientloss( L detail ) of the high-frequency image, the distribution loss( L semantic ) of the low-frequency image and the content recon-struction loss( L reconstruction ) of the reconstructed image. Theformula of the loss function is defined as follows: Loss total = L detail + αL semantic + βL reconstruction (6) α and β are hyper-parameters that balances the three losses.As shown in Fig. 7, Where L detail is to calculate the meansquare error loss between the high-frequency feature map ( I g , I g , I g ) and the gradient image of the original image, and then we accumulate these three losses. The detailed calculationformula of L detail is presented as follows: L detail = L grd - + L grd - + L grd - L grd - i = M SE ( Gradient ( I ori ) , I g - i ) , i ∈ { , , } M SE ( X, Y ) = 1 N N (cid:88) n =1 ( X n − Y n ) (7)Where I ori is the input source image and I g - i is the ith high-frequency image. The M SE ( X, Y ) is the mean squareerror between X and Y . The gradient image of the originalimage is obtained by using the Laplacian gradient operator Gradient () . The Laplacian operator performs a mathematicalconvolution operation in Equ.4.In Equ.6, L semantic is a data distribution loss. We calculatea strong supervised loss of the high-frequency image, andcalculate a strong supervised loss of the reconstructed imagebelow. We hope that the low-frequency semantic block learnto extract deep semantic information, rather than giving it ananswer to let it remember the answer. At the same time, wecannot give a suitable low-frequency image to the networkfor reference. The low-frequency information is definitely nota simply down-sampled image. But if the network does nothave any loss function, it is difficult to get the low-frequencyimage we really want. Therefore, we use the down-sampledimages as an approximate data distribution of low frequencyimages, so that the low-frequency results generated by ournetwork can be in the ”low frequency domain” space. Theexperiment in the next section proves that this loss is indeedvery effective. L semantic = L adv ( I s , I down ) (8)where I s is the low-frequency semantic image generatedby the network, I down is the low-frequency blurred imageobtained by downsampling the source image twice, and L adv is the adversarial loss. L adv ( I s , I down ) = L G = 1 N N (cid:88) n =1 ( D ( G ( I ns )) − L D = 1 N N (cid:88) n =1 ( D ( I ndown ) − + 1 N N (cid:88) n =1 ( D ( I ns ) − (9)where n ∈ N N , N represents the number of images. The lossfunction we use here is defined in LSGAN [22].In Equ.6, L reconstruction is the image content reconstructionloss of the reconstructed image. The L reconstruction lossconsists of two parts, one is the pixel-level reconstruction loss L pix , and the other is the structural similarity loss L ssim asfollows: L reconstruction = L pix + γL ssim (10) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
FuseFuse FuseFuse FuseFuse FuseFuse vi g I − vi g I − vi g I − vi ups I − ir g I − ir ups I − ir g I − ir g I − f g I − f g I − f g I − f ups I − f I vi I ir I Fig. 8. Decomposition and fusion of infrared and visible images.
Where γ is a hyper-parameter that balances the two losses. L pixel and L ssim are calculated as follows: L pixel = M SE ( I ori , I re ) L ssim = 1 − SSIM ( I ori , I re ) SSIM ( x, y ) = (2 µ x µ y + c )(2 σ xy + c )( µ x + µ y + c )( σ x + σ y + c ) (11)As shown in the Fig. 7, the total loss function L total isgiven as follows: Loss total = L detail + αL semantic + βL reconstruction = L grd - + L grd - + L grd - + λ L adv + λ L pix + λ L ssim (12) λ , λ , λ are hyper-parameters and are used to balance thelosses. C. Image Fusion
In the testing phase, our fusion structure is divided intotwo parts: decomposition and fusion, as shown in Fig.9. Thedecomposition network can decompose the image into threehigh-frequency images and one low-frequency image. Thefusion strategy (”FS” in Fig.9) can fuse the correspondingfeature images and reconstruct them to obtain the final image.In Fig.9, I ir and I vi represent infrared image and visiblelight image, respectively. The two images are fed into thedecomposition network to obtain two sets of feature images.One group of feature images comes from visible light imagesincluding three visible light high-frequency images ( I vi - g , I vi - g , I vi − g ) and one visible light low-frequency image( I vi - ups ). And another group of feature images comes from in-frared images including three infrared high-frequency images( I ir - g , I ir - g , I ir - g ) and one infrared low frequency image( I ir - ups ). For the corresponding four groups of feature images,our fusion strategy contains a variety of fusion methods toobtain the final fused image I f . Decomposition
Network FS ir I vi I vi g I − vi g I − vi g I − vi ups I − ir g I − ir ups I − f I ir g I − ir g I − Fig. 9. The framework of proposed method. ”Decomposition Network” candecompose the image and ”FS” indicates fusion strategy.
In the following subsection, we will introduce the fusionstrategy.
D. Fusion strategy
We design a fusion strategy to get a fused image. Asshown in the Fig.8, we first use the decomposition networkto decompose the visible light image I vi and the infraredimage I ir to obtain two sets of high and low frequencyfeature images. The corresponding high-frequency and low-frequency feature images (such as I vi - g and I ir - g ) are fusedusing different specific fusion strategies to obtain fused high-frequency feature images and low-frequency feature images( I f - g , I f - g , I f - g , I f - ups ). Finally, the fusion feature imageis added pixel by pixel to obtain the fused image I f , which isthe same as reconstructing an image in the training phase.We designed two fusion strategies for high-frequency imagefusion, namely, pixel-wise addition (addition) and the corre-sponding pixel taking the maximum value (max). In addition,we also designed two fusion methods for low-frequency im-ages, which are adding and averaging pixel by pixel (avg), OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7 and the corresponding pixel takes the maximum value (max),as shown in Fig.10. The formulas of high frequency fusion ab c
High Frequency Image Fusion
1. c = a + b2. c = max( a , b )
Low Frequency Image Fusion
1. c = (a + b)/22. c = max( a , b )
FUSE
Fig. 10. High and low frequency image fusion strategy. Here a and b are anypair of points from feature images, and c is the corresponding fused pixel. feature I f - g and low frequency fusion feature I f - ups aredescribed as follows: I nf - gj = max ( I nvi - gi , I nir - gi ) , ( M ax ) or I nf - gi = I nvi - gi + I nir - gj ( Add ) I nf - ups = ( I nvi - ups , I nir - ups )2 , ( Avg ) or I nf - ups = max ( I nvi - ups , I nir - ups ) ( M ax ) i ∈ { , , } , n ∈ N (13)Where i represents three high-frequency images, and N repre-sents all pixels in the image. I nvi - gi and I nir - gi are any pixel inthe corresponding three groups of high-frequency images, and I nvi - ups and I nir - ups are any pixel in the low-frequency image.We calculate and fuse the corresponding pixels to get the pixelsof the fused high frequency image I nf - gi and low frequencyimage I nf - ups . Finally, the three fusion features are added toobtain the final fused image I f as follows: I f = I f − g + I f − g + I f − g + I f − ups (14)IV. EXPERIMENTS AND ANALYSIS A. Training and Testing Details
For the selection of hyper-parameters, we make the valuesof losses as close to the same order of magnitude as possible.So, in formula 12, we set λ = 0.1, λ = 100, λ = 10 bycross validation.Our goal is to train a powerful decomposition network thatcan decompose images into high-frequency and low-frequencyimages well. In this way, our input images in the training phaseare not limited to infrared images and visible light images.We can also use MS-COCO [23] and Imagenet [24] or otherimages to achieve this goal. In our experiment, we use MS-COCO as the training set to train our decomposition network.We select about 80,000 images as input images. These imagesare converted to gray scale images which are then resized to256 × ×
256 resolution one by one and calculate theaverage calculation time. It takes about 2ms to decomposeeach image.
B. the role of the adversarial loss
As shown in Fig 11, if we do not give constraints on thelow-frequency image I s , it is difficult for the network to learnsmartly to get a semantic low-frequency image we want. With-out the distribution loss function, the high-frequency imageslearned by the network have too much semantic information,such as the distribution of colors-this is not high-frequencyinformation. And low-frequency images loses a lot of semanticinformation.In order to allow the semantic block to learn the real low-frequency information we want, we give it a hint that is theweak supervision loss. As in Equ. 9, we regard the down-sampled image I down as an approximate solution of the low-frequency image, so that the low-frequency image I s generatedby the network follows the distribution of the low-frequencyimages. a) source image b ) low-frequency images without adversarial loss c ) high-frequency images without adversarial loss d ) low-frequency images with adversarial loss e ) high-frequency images with adversarial loss Fig. 11. The effect of adversarial loss. a) is the original image, b) and c) isthe high-frequency image and low-frequency image without adversarial loss,and d) and e) is the result of using adversarial loss.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
C. the details of the decomposed images
Although the loss function of our high-frequency image iscalculated with gradient map which is calculated by Laplacianoperator. But the result of our high-frequency image is quietdifferent from the Laplacian gradient high-frequency image.As shown in Fig 12, we list the high-frequency imagesdecomposed by the proposed decomposition network and theLaplacian operator. It can be seen that the Laplacian gradientimages only extract part of the high-frequency information,and the image has a lot of noises.The high-frequency image decomposed by our decompo-sition network not only retains almost all high-frequencyinformation on the basis of the Laplacian gradient image, butalso completely extracts the contour and detail informationof the object. In addition, our high-frequency images have acertain degree of semantic recognition, and can clearly expressthe semantic features according to the outline of the objects.
D. Comparison with State of The Art Methods
We select ten classic and the state of the art fusion methodsto compare the fusion effect of our proposed method, includingCurvelet Transform (CVT) [29], dualtree complex wavelettransform (DTCWT) [30], Multi-resolution Singular ValueDecomposition (MSVD) [31], DenseFuse [14], the GAN-based fusion network (FusionGAN) [16], a general end-to-endfusion network(IFCNN) [15], MDLatLRR [11], NestFuse [17],FusionDN [26] and U2Fusion [32]. We use the public codes of these methods and the parameters shown in the paper toobtain fused images.Because there is currently no clear specific evaluationindicators to measure the quality of the fused image, wewill comprehensively compare it according to the subjectiveevaluation and the objective evaluation respectively.
1) subjective evaluation:
In different fields, for differenttasks, everyone has his/her own criteria for judging. We con-sider the subjective feelings of the picture, such as lightness,fidelity, noise, and clarity etc.In Fig. 13 and Fig. 14, our method is compared with othermethods. It can be clearly seen that our fused image notonly perfectly retains the radiation information of the infraredimage, but also fully retains the detailed texture informationof the visible light image. More importantly, our image doesnot have a lot of noises. We marked some salient areas withred boxes. For example, in Fig. 13, the canopy of the shophas less noises. In Fig. 14, the outline of the person in thedistance is clearly visible..
2) objective evaluation:
Subjective feelings have great per-sonal factors, and it is not enough for evaluation to relysolely on subjective evaluation. We select fifteen objectiveevaluation indicators from the popular objective indicators forcomprehensive evaluation. They are: Edge Intensity(EI) [33],SF [34], Entropy (EN) [35], Sum of Correlation Coefficients(SCD) [36], Fast Mutual Information (
F M I w and F M I dct )[37] ,Mutual Information (MI) [38], Standard Deviation ofImage (SD), Definition (DF) [39], Average gradient (AG) [40] f) our high-frequency imageii Visible light Imageiii Infrared Image d) our high-frequency imageb) our high-frequency imagea) Laplacian high-frequency imagec) Laplacian high-frequency imagee) Laplacian high-frequency imagei Vase Image
Fig. 12. High-frequency images decomposed by the proposed decomposition network and the Laplacian operator.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9 (a) Visible image (b) Infrared image (c) CVT (d) DTCWT (e) MSVD (f) DenseFuse (g) FusionGan (h) IFCNN(i) MDLatLRR (j) NestFuse (k) FusionDN (l) U2Fusion(m) Ours( max+avg ) (n) Ours( max+max ) (o) Ours( add+avg ) (p) Ours( add+max ) Fig. 13. Experiment on street images. TABLE IIOBJECTIVE EVALUATION OF CLASSIC AND LATEST FUSION ALGORITHMS ON TNO DATASET
Methods EI SF EN SCD FMI w FMI dct MI SD DF AG QGCVT 42.9631 11.1129 6.4989 1.5812 0.4240 0.3945 12.9979 27.4613 5.4530 4.2802 0.4623DTCWT 42.4889 11.1296 6.4791 1.5829
MSVD 27.6098 8.5538 6.2807 1.5857 0.2828 0.2470 12.5613 24.0288 4.2283 2.8773 0.3375DenseFuse 36.4838 9.3238 6.8526 1.5329 0.4389 0.3897 13.7053 38.0412 4.6176 3.6299 0.4569FusionGan 32.5997 8.0476 6.5409 0.6876 0.4083
FusionDN and QG [33] respectively.The objective evaluation indicators here are divided intotwo categories. One is to evaluate the fused image, suchas calculating the edge(EI), the number of mutations in the image(SF), average gradient(AG), entropy (EN), clarity(DF)and contrast of the image(SD). The other is to evaluate thefused image with the source image. There is another categorythat evaluates the fused image and the source image, such as
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 (a) Visible image (b) Infrared image (c) CVT (d) DTCWT (e) MSVD (f) DenseFuse (g) FusionGan (h) IFCNN(i) MDLatLRR (j) NestFuse (k) FusionDN (l) U2Fusion(m) Ours( max+avg ) (n) Ours( max+max ) (o) Ours( add+avg ) (p) Ours( add+max ) Fig. 14. Experiment on road images. TABLE IIIOBJECTIVE EVALUATION OF CLASSIC AND LATEST FUSION ALGORITHMS ON R
OAD S CENE
DATASET
Methods EI SF EN SCD
F MI w F MI dct
MI SD DF AG QGCVT 59.7642 14.7379 7.0159 1.3418 0.4138 0.3631 14.0319 36.0884 6.9618 5.7442 0.4499DTCWT 57.3431 14.7318 6.9211 1.3329 0.3458 0.2383 13.8421 34.7264 6.7810 5.5228 0.4402MSVD 36.0475 11.3182 6.6960 1.3458 0.2659 0.2195 13.3919 30.9643 5.0926 3.6171 0.3600DenseFuse 34.0135 8.5541 6.6740 1.3491 0.4173 0.3857 13.3480 30.6655 3.9885 3.2740 0.3916FusionGan 35.4048 8.6400 7.1753 0.8671 0.3410 0.3609 14.3507 42.3040 3.9243 3.3469 0.2591IFCNN 57.6653 15.0677 6.9730
MDLatLRR 36.9468 9.3638 6.7171 1.3636 0.4241 max + avg 39.4996 10.7670 6.7575 1.3394 0.3151 0.2311 13.5150 31.6977 4.8180 3.8643 0.3591max + max 39.7592 10.6215 6.8186 1.3223 0.3171 0.2217 13.6371 39.6907 4.7275 3.8575 0.3505add + avg the mutual information (MI,
F M I w and F M I dct ) and somecomplex calculation methods(QG).We compare the proposed method with ten other excellentmethods, and the results of the average values for all fused images shown in Table II and Table III respectively. The bestvalue in the quality table is made bold in red and bold , andthe second best value is given in bold and italic .It can be seen from Table II and Table III that our proposed
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 method obtains eleven best values and four second best values.On the TNO dataset, our method ( add + max ) obtains one bestresult and six second best results. On the RoadScene dataset,our method ( add + avg ) obtained two best results and foursecond best results. Comparing with other operator in fusionstrategy, the addition operator of high frequency informationcan achieve better results in several indicators. Although ourmethod does not get the best results in every result, it wasable to get second best results in many indicators.V. CONCLUSIONSIn this paper, we propose a novel multi-network for imagedecomposition. We also develop a decomposition networkfusion framework to fuse infrared images and visible lightimages. Firstly, with the help of decomposition networks, theinfrared image and the visible light image are decomposed intomultiple high-frequency feature images and a low-frequencyfeature image, respectively. Secondly, the corresponding fea-ture image is fused with a specific fusion strategy to obtain thefusion feature images. Finally, the fusion feature images areadded pixel by pixel to obtain the fused image. This kind ofimage decomposition network is universal, and any numberof images can be quickly and effectively decomposed byneural network. At the same time, using the power of GPUS,neural networks can easily use GPU for matrix calculationacceleration. The speed of image decomposition can alsobe very fast. We have performed a subjective and objectiveevaluation of the proposed method, and the experimentalresults show that it has reached the state of the art. Althoughthe network structure is simple, it proves the feasibility of theneural network to decompose the image. We have a conjecturethat CNN uses the semantics of the image to filter the noisewhile preserving the edges, and obtain a very good high-frequency information image. We will continue to study imagedecomposition based on deep learning, including simplifyingsome originally complex image decomposition calculationssuch as wavelet transformation, low-rank decomposition, etc.,or designing more reasonable network structures for otherimage processing applications. We think that the network wepropose can be used for different image processing tasks,including multi-focus fusion, medical image fusion, multi-exposure fusion, and some basic computer vision tasks suchas detection, recognition, and classification. We will thenexperiment and test this method in other image tasks.R EFERENCES[1] J. Ma, Y. Ma, and C. Li, “Infrared and visible image fusion methodsand applications: A survey,”
Information Fusion , vol. 45, pp. 153–178,2019.[2] Y. Liu, X. Chen, Z. Wang, Z. J. Wang, R. K. Ward, and X. Wang,“Deep learning for pixel-level image fusion: Recent advances and futureprospects,”
Information Fusion , vol. 42, pp. 158–173, 2018.[3] T. Mertens, J. Kautz, and F. Van Reeth, “Exposure fusion: A simpleand practical alternative to high dynamic range photography,”
ComputerGraphics Forum , vol. 28, no. 1, pp. 161–171, 2009.[4] Z. Zhang and R. S. Blum, “A categorization of multiscale-decomposition-based image fusion schemes with a performance studyfor a digital camera application,”
Proceedings of the IEEE , vol. 87, no. 8,pp. 1315–1326, 1999. [5] K. P. Upla, M. V. Joshi, and P. P. Gajjar, “An edge preserving mul-tiresolution fusion: Use of contourlet transform and mrf prior,”
IEEETransactions on Geoscience and Remote Sensing , vol. 53, no. 6, pp.3210–3220, 2014.[6] A. B. Hamza, Y. He, H. Krim, and A. S. Willsky, “A multiscale approachto pixel-level image fusion,”
Computer-Aided Engineering , vol. 12,no. 2, pp. 135–146, 2005.[7] J.-j. Zong and T.-s. Qiu, “Medical image fusion based on sparse rep-resentation of classified image patches,”
Biomedical Signal Processingand Control , vol. 34, pp. 195–205, 2017.[8] Q. Zhang, Y. Fu, H. Li, and J. Zou, “Dictionary learning method forjoint sparse representation-based image fusion,”
Optical Engineering ,vol. 52, no. 5, p. 057006, 2013.[9] Y. Bin, Y. Chao, and H. Guoyu, “Efficient image fusion with ap-proximate sparse representation,”
International Journal of Wavelets,Multiresolution and Information Processing , vol. 14, no. 04, p. 1650024,2016.[10] H. Li and X.-J. Wu, “Multi-focus image fusion using dictionary learningand low-rank representation,” in
International Conference on Image andGraphics . Springer, 2017, pp. 675–686.[11] H. Li, X.-J. Wu, and J. Kittler, “Mdlatlrr: A novel decomposition methodfor infrared and visible image fusion,”
IEEE Transactions on ImageProcessing , 2020.[12] Y. Liu, X. Chen, H. Peng, and Z. Wang, “Multi-focus image fusion witha deep convolutional neural network,”
Information Fusion , vol. 36, pp.191–207, 2017.[13] K. R. Prabhakar, V. S. Srikar, and R. V. Babu, “Deepfuse: A deepunsupervised approach for exposure fusion with extreme exposure imagepairs.” in
ICCV , 2017, pp. 4724–4732.[14] H. Li and X.-J. Wu, “Densefuse: A fusion approach to infrared andvisible images,”
IEEE Transactions on Image Processing , vol. 28, no. 5,pp. 2614–2623, 2018.[15] Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang, “Ifcnn: Ageneral image fusion framework based on convolutional neural network,”
Information Fusion , vol. 54, pp. 99–118, 2020.[16] J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang, “Fusiongan: A generativeadversarial network for infrared and visible image fusion,”
InformationFusion , vol. 48, pp. 11–26, 2019.[17] H. Li, X.-J. Wu, and T. Durrani, “Nestfuse: An infrared and visibleimage fusion architecture based on nest connection and spatial/channelattention models,”
IEEE Transactions on Instrumentation and Measure-ment , 2020.[18] H. Li, B. S. Manjunath, and S. K. Mitra, “Multisensor image fusionusing the wavelet transform,”
Graphical Models and Image Processing ,vol. 57, no. 3, pp. 235–245, 1995.[19] L. J. Chipman, T. M. Orr, and L. N. Graham, “Wavelets and imagefusion,” vol. 3, p. 3248, 1995.[20] G. Liu and S. Yan, “Latent low-rank representation for subspacesegmentation and feature extraction,” pp. 1615–1622, 2011.[21] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearitiesimprove neural network acoustic models,” in
Proc. icml , vol. 30, no. 1,2013, p. 3.[22] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley, “Leastsquares generative adversarial networks,” pp. 2813–2821, 2017.[23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects incontext,” in
European conference on computer vision . Springer, 2014,pp. 740–755.[24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in . Ieee, 2009, pp. 248–255.[25] A. Toet et al. , “Tno image fusion dataset,”
Figshare. data , 2014.[26] H. Xu, J. Ma, Z. Le, J. Jiang, and X. Guo, “Fusiondn: A unified denselyconnected network for image fusion.” in
AAAI , 2020, pp. 12 484–12 491.[27] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv: Learning , 2014.[28] M. D. Zeiler, “Adadelta: An adaptive learning rate method,” arXiv:Learning , 2012.[29] F. Nencini, A. Garzelli, S. Baronti, and L. Alparone, “Remote sensingimage fusion using the curvelet transform,”
Information fusion , vol. 8,no. 2, pp. 143–156, 2007.[30] J. J. Lewis, R. J. Callaghan, S. G. Nikolov, D. R. Bull, and N. Cana-garajah, “Pixel-and region-based image fusion with complex wavelets,”
Information fusion , vol. 8, no. 2, pp. 119–130, 2007.[31] V. Naidu, “Image fusion technique using multi-resolution singular valuedecomposition,”
Defence Science Journal , vol. 61, no. 5, p. 479, 2011.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12 [32] H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2fusion: A unifiedunsupervised image fusion network,”
IEEE Transactions on PatternAnalysis and Machine Intelligence , 2020.[33] C. S. Xydeas and V. S. Petrovic, “Objective pixel-level image fusionperformance measure,” in
Sensor Fusion: Architectures, Algorithms,and Applications IV , vol. 4051. International Society for Optics andPhotonics, 2000, pp. 89–98.[34] A. M. Eskicioglu and P. S. Fisher, “Image quality measures and theirperformance,”
IEEE Transactions on communications , vol. 43, no. 12,pp. 2959–2965, 1995.[35] J. W. Roberts, J. A. van Aardt, and F. B. Ahmed, “Assessment ofimage fusion procedures using entropy, image quality, and multispectralclassification,”
Journal of Applied Remote Sensing , vol. 2, no. 1, p.023522, 2008.[36] V. Aslantas and E. Bendes, “A new image quality metric for imagefusion: the sum of the correlations of differences,”
Aeu-internationalJournal of electronics and communications , vol. 69, no. 12, pp. 1890–1896, 2015.[37] M. Haghighat and M. A. Razian, “Fast-fmi: non-reference image fusionmetric,” in . IEEE, 2014,pp. 1–3.[38] H. Peng, F. Long, and C. Ding, “Feature selection based on mu-tual information criteria of max-dependency, max-relevance, and min-redundancy,”
IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 27, no. 8, pp. 1226–1238, 2005.[39] X. Desheng, “Research of measurement for digital image definition,”
Journal of Image and Graphics , 2004.[40] G. Cui, H. Feng, Z. Xu, Q. Li, and Y. Chen, “Detail preserved fusionof visible and infrared images using regional saliency extraction andmulti-scale image decomposition,”
Optics Communications , vol. 341,pp. 199–209, 2015.
Yu Fu received the B.M. degree in professionalengineering management from North China InstituteOf Science And Technology, China. He is cur-rently a Master student in the Jiangsu ProvincialEngineerinig Laboratory of Pattern Recognition andComputational Intelligence, Jiangnan University. Hisresearch interests include image fusion, machinelearning and deep learning.
Xiao-Jun Wu received the B.Sc. degree in math-ematics from Nanjing Normal University, Nanjing,China, in 1991, and the M.S. and Ph.D. degreesin pattern recognition and intelligent system fromthe Nanjing University of Science and Technology,Nanjing, in 1996 and 2002, respectivelyFrom 1996 to 2006, he taught at the School ofElectronics and Information, Jiangsu University ofScience and Technology, where he was promoted toa Professor. He was a Fellow of the InternationalInstitute for Software Technology, United NationsUniversity, from 1999 to 2000. He was a Visiting Researcher with the Centrefor Vision, Speech, and Signal Processing (CVSSP), University of Surrey,U.K., from 2003 to 2004. Since 2006, he has been with the School of Infor-mation Engineering, Jiangnan University, where he is currently a Professorof pattern recognition and computational intelligence. His current researchinterests include pattern recognition, computer vision, and computationalintelligence. He has published over 300 articles in his fields of research. Hewas a recipient of the Most Outstanding Postgraduate Award from the NanjingUniversity of Science and Technology.