Guided Deep Decoder: Unsupervised Image Pair Fusion
GGuided Deep Decoder: Unsupervised Image PairFusion
Tatsumi Uezato − − − , Danfeng Hong , − − − ,Naoto Yokoya , − − − , and Wei He −− − − RIKEN AIP, Tokyo, Japan German Aerospace Center, Wessling, Germany Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, France The University of Tokyo, Tokyo, Japan { tatsumi.uezato,naoto.yokoya,wei.he } @riken.jp , { danfeng.hong } @dlr.de Abstract.
The fusion of input and guidance images that have a tradeoffin their information (e.g., hyperspectral and RGB image fusion or pan-sharpening) can be interpreted as one general problem. However, previ-ous studies applied a task-specific handcrafted prior and did not addressthe problems with a unified approach. To address this limitation, in thisstudy, we propose a guided deep decoder network as a general prior.The proposed network is composed of an encoder-decoder network thatexploits multi-scale features of a guidance image and a deep decoder net-work that generates an output image. The two networks are connectedby feature refinement units to embed the multi-scale features of the guid-ance image into the deep decoder network. The proposed network allowsthe network parameters to be optimized in an unsupervised way withouttraining data. Our results show that the proposed network can achievestate-of-the-art performance in various image fusion problems.
Keywords:
Deep Image Prior, Deep Decoder, Image Fusion, Hyper-spectral Image, Super-resolution, Pansharpening.
Some image fusion tasks address the fusion of image pairs in the same modality.The tasks consider a pair of images that capture the same region but have atradeoff between the two images (Fig. 1). For example, a low spatial resolutionhyperspectral (LR-HS) image has greater spectral resolution at lower spatialresolution [46]. However, an RGB image acquires much lower spectral resolutionat higher spatial resolution. Likewise, panchromatic and multispectral (MS) im-ages have a tradeoff between spatial and spectral resolution [35]. No-flash imagescapture ambient illumination, but are very noisy, while flash images capture arti-ficial light, but are less noisy [29]. Image fusion enables an image that overcomesthe tradeoff to be generated. Hyperspectral super-resolution or pansharpeningaims to generate a high resolution (HR) HS or MS image. The denoising of a a r X i v : . [ ee ss . I V ] J u l T. Uezato et al.
RGB imagery Low spatial resolution HS High spatial resolution HS
Panchromatic imagery Low spatial resolution MS
High spatial resolution MS
Flash Noisy no-Flash Denoised image
Guidance image Input image Enhanced image H S s up er - re s o l u t i o n P a n s h a r p e n i n g D e n o i s i n g Fig. 1: Illustration of image pair fusion of the same modality.no-flash image with a flash image can be also interpreted as a special case ofimage fusion.Although these tasks share a common goal (i.e., enhancing input images withthe help of guidance images), the tasks have been studied independently. Thisoccurs because a different handcrafted prior is considered to incorporate thespecific property of an output image. In HS super-resolution, a prior exploitingthe low-rankness of HS has been extensively used [47,8,21]. In pansharpening, aprior representing a spatial smoothness has been considered [27]. The denoisingtask assumes that the spatial structure of a restored image is similar to thatof a guidance image [29]. While these handcrafted priors share the same goal,the priors need to be designed for each task to exploit the specific properties ofdata. It is highly desirable to develop a prior applicable to various image fusionproblems.Deep learning (DL) approaches avoid the assumption of explicit priors foreach specific task. Although network architectures themselves need to be hand-crafted, properly designed network architectures have shown to solve variousproblems [31,16]. Most DL approaches rely on training data. However, for pan-sharpening and hyperspectral super-resolution, it is difficult to collect a largesize of training data including reference (i.e., HR-HS or HR-MS) because of thecost or hardware limitation. Thus, previous studies [43,32] have frequently usedsynthetic data for training, which may have limited generalization performance.In addition, different sensors provide different spectral response functions. Net-works trained on data acquired by a particular sensor may not work well on newdata acquired by a different sensor. uided deep decoder 3
A natural question arises: is it possible to use DL approaches without trainingdata? Ulyanov et al. [34] have shown that network architectures have inductivebias and can be used as deep image prior (DIP) without any training data.This intriguing property of DIP has been successfully used for various prob-lems [14,45,33]. In [34], the guided denoising task of flash and no-flash imagepair has been addressed using a no-flash image as an input and a flash imageas an output. Although this approach can be potentially used to address theproblems shown in Fig. 1, the network architecture does not fully exploit thesemantic features or image details of a guidance image. It is still unclear howthe network architecture is conditioned on the features of a guidance image.Although DIP has great potential, the uncertainties limit DIP to achieve state-of-the-art (SOTA) performance in various image fusion problems.As discussed above, previous studies face two major problems (task-specifichandcrafted priors and requirement of training data) to address various imagefusion problems in a unified framework. In this study, we propose a new networkarchitecture, called a guided deep decoder (GDD), that overcomes the problemsand can achieve SOTA performance in different image fusion problems. Specif-ically, the proposed network architecture is composed of two networks whereone encoder-decoder network is designed to extract multi-scale semantic fea-tures from a guidance image, while another deep decoder network generates anoutput image from random noise. The two networks are connected by featurerefinement units incorporating attention gates to embed the multi-scale featuresof the guidance image into the deep decoder network.The contributions of this paper are as follows. (1) We propose a new unsu-pervised DL method that does not require training data and can be adapted todifferent image fusion tasks in a unified framework. We achieve SOTA results forvarious image fusion problems. (2) We propose a new network architecture asa regularizer for unsupervised image fusion problems. The attention gates usedin the proposed architecture guide the generation of an output image using themulti-scale semantic features from a guidance image. The guidance of the multi-scale features can lead to an effective regularizer for an ill-posed optimizationproblem.
Most of the previous works have independently addressed one of the image fu-sion problems shown in Fig. 1, although the common goal is to generate animage that overcomes the tradeoff. This study focuses on the data acquired inthe same modality and is different from the image fusion problems of differentmodalities where the sensor captures different physical quantities (e.g., fusionof RGB images and depth maps [24]). To address the ill-posed fusion problems,similar approaches have been developed for different image fusion tasks.
Classical approach : The classical approach is to specifically design a hand-crafted prior for each task. For example, handcrafted priors exploiting the low-rankness or sparsity of HS have been developed for HS and MS image fusion
T. Uezato et al. problems [47,19,20,41]. In panchromatic and MS image fusion, the handcraftedpriors, which assume that the spatial details of PAN are similar to those of MS,have been widely used [22,27,12,7]. In addition, flash and no-flash image fusionuses a prior that promotes similar spatial details between the paired image [29].The classical approach can reconstruct an enhanced image without any trainingdata by explicitly assuming prior knowledge. However, the priors designed fora specific task may not be effective when they are applied to other tasks. Inaddition, an optimization method needs to be tailored for a different prior.
Supervised DL approach : DL methods that use training data have re-cently achieved SOTA performance in different image fusion problems. DL meth-ods are usually built upon a popular network (e.g., [31,16]). In the HS and RGBimage fusion, DL methods use LR-HS and RGB images as an input and an HR-HS image as an output and learn the mapping function between the inputs andthe output [43,9]. Similarly, in pansharpening, the methods consider panchro-matic and LR-MS images as an input and HR-MS as an output and learn themapping function [32,42,44]. As long as training data are available, DL meth-ods can be potentially applied to different image fusion problems in a unifiedframework by slightly changing the network architecture or the loss function.However, it may be difficult to acquire training data, including reference data,for HS or MS images because of the cost or hardware limitation.
Unsupervised DL approach : To bridge the gap between the classical andsupervised DL approaches, an unsupervised DL approach has been considered insome studies. The unsupervised DL methods have been developed to address theHS and RGB image fusion problem [30,13]. In [30,13], the network architecturehas been specifically designed to exploit the property of the HS image and dif-ferent handcrafted priors have been combined to achieve optimal performance.However, it may not achieve SOTA performance in other tasks because of thespecifically designed network and handcrafted priors. DIP that can apply DL inan unsupervised way has been recently developed by [34] and has been appliedfor a variety of problems [14,45,33]. Although DIP can be potentially applied forvarious image fusion problems, it has not been explored yet. The simple appli-cation of DIP cannot achieve SOTA performance in different image fusion tasks,which is shown in the following experiments. Our study borrows the idea of DIPand proposes a robust network architecture that achieves SOTA performance inthese tasks.
Let us denote a low resolution or noisy input image Y ∈ R C × w × h and a guidanceimage G ∈ R c × W × H where C , W , and H represent the number of channels, theimage width, and the image height, respectively. When considering HS super-resolution or pansharpening, w (cid:28) W , h (cid:28) H , and c (cid:28) C . In the unsupervisedimage fusion problem, the corresponding output X ∈ R C × W × H can be estimated uided deep decoder 5 URU F RU URU F RU URU F RU F RU Noise map(Z)
Guidance image (G)
Output image (X)
Loss function C on v Input image (Y)Guidance image (G)
Encoder-decoder network with skip connections
Deep decoder network
Bilinear upsampling
Skip connection
Fig. 2: The structure of a guided deep decoder. The semantic features are ex-tracted from the guidance image by the U-net like encoder-decoder network. Theblue layers represent the features of the encoder. The red layers represent thefeatures of the decoder. The green layers represent the features of the deep de-coder network. The semantic features of G are used to guide the features of thedeep decoder in the upsampling and feature refinement units (URU and FRU).by solving the following optimization problem:min X L ( X , Y , G ) + R ( X ) , (1)where L is a loss function that is different for each task. Because the problem isill-posed, existing methods commonly add the handcrafted regularization term R . However, the task-specific regularization term (e.g., low-rank property of HSimages) cannot be easily applied to other tasks. Instead of using the handcraftedregularization terms, DIP estimates X using a convolutional neural network(CNN)-based mapping function as: X = f θ ( Z ) , (2)where f θ represents the mapping function with the network parameters θ , Z isthe input representing the random code tensor. The optimization problem canbe rewritten as: min θ L ( f θ ( Z ) , Y , G ) . (3)In this formulation, only one input image Y and a guidance image G are used forthe optimization problem; thus, training data are not required. X is regularizedby the implicit prior of the network architecture. Different types of architecturescan lead to different regularizers. The architecture that effectively incorporatesmulti-scale spatial details and semantic features of the guidance image can be apowerful regularizer for the optimization problem. In the following section, wepropose a new architecture, called the guided deep decoder, as a regularizer thatcan be used for various image fusion problems. T. Uezato et al. CN U p s a m p l e1 x C on v Lea ky R e l u S i g m o i d C on v x C on v Lea ky R e l u S i g m o i d CN Lea ky R e l u 1 x C on v Upsampling refinement unit (URU) Feature refinement unit (FRU) Γ k F CN Ξ 𝑘 F Lea ky R e l u Fig. 3: The structure of upsampling and feature refinement units.
GDD is composed of an encoder-decoder network with skip connections anda deep decoder network, as shown in Fig. 2. The encoder-decoder network issimilar to the architecture of U-net [31] and produces the features of a guidanceimage at multiple scales. The multi-scale features represent hierarchical semanticfeatures of the guidance image from low to high levels. The semantic featuresare used to guide the parameter estimation in the deep decoder. Let Γ k denotethe features of the encoder at the k th scale, Ξ k denotes the k th-scale featuresin the decoder part of the encoder-decoder network. The mapping function isconditioned on the multi-scale features as f θ ( Z | Γ , · · · , Γ K , Ξ , · · · , Ξ K ). Themulti-scale features are incorporated in the deep decoder by the two proposedunits shown in Fig. 3. Upsampling refinement unit (URU).
Upsampling is a vital part of DIP [5].Bilinear or nearest neighbor upsampling promotes piecewise constant patches orsmoothness across all channels [17]. However, the prior is too strong to recoverexact spatial structures or boundaries of an image. Although this problem isalleviated using skip connections, the spatial details of a guidance image arestill lost in the features of the decoder. URU incorporates an attention gate forweighting the features derived after upsampling and channel-wise normalization(CN) in the deep decoder. The features from the guidance image are gated bya 1 × F ,the transformation is carried out as: URU ( F | Γ k ) = F ⊗ Γ k , (4)where ⊗ represents the element-wise multiplication. Note that the dimensionsof F and Γ k are the same at each scale. Both channel-wise and spatial-wiseconditional weights are considered in URU. Feature refinement unit (FRU).
FRU is different from URU in that thefeatures of the deep decoder are weighted by the high-level semantic features uided deep decoder 7 of the guidance image. FRU promotes the semantic alignment with the featuresof the guidance image, while URU promotes similar spatial locality. Using anattention gate, the high-level features are gated by a 1 × FRU ( F | Ξ k ) = F ⊗ Ξ k . (5)Note that the dimensions of F and Ξ k are the same at each scale. The featuresof the deep decoder are weighted in URU and FRU, which leads to a deep priorthat can more explicitly exploit the spatial details or semantic features of theguidance image than DIP. GDD is closely related to existing network architectures. URU and FRU in GDDgenerate multiplicative transformation parameters from the guidance image forthe spatial and channel-wise feature modulation. A similar feature modulationhas been also used in [38,11,4,28,23]. In [38], affine transformation parametershave been generated from segmentation probability maps for the feature mod-ulation to achieve more realistic textures in image super-resolution. The affinetransformation has been also considered in the style transfer [18]. Although theaffine transformation considers the scaling and bias values, URU and FRU con-sider only the scaling values because we find that similar results can be obtainedat lower computational cost for unsupervised optimization problems.The conditional weights can be also interpreted as attention layers across allchannels. In [11,26,4], attention gates are incorporated to refine the spatial detailsand highlight salient features. The conditional attention weights are generatedfrom a label map for semantic image synthesis [23]. GDD is closely related tothe conditional attention weights in that it uses the multi-scale features froma guidance image to generate the conditional attention weights. However, all ofthe aforementioned studies consider and require a large size of training data.The network architectures have not been fully explored as a regularizer for theunsupervised optimization problems. Our study is different from previous studiesin that it uses the network architecture as a regularizer to solve a variety ofunsupervised image fusion problems.
The loss function is different for each task. In this section, the loss functionsused for HS super-resolution, pansharpening, and denoising are discussed.
HS super-resolution.
When fusing RGB and HS images, the loss function isusually designed to preserve the spectral information from the HS image whilekeeping the spatial information from the RGB image. For simplicity, the matrix
T. Uezato et al. forms of X , Y , G are denoted as ˜X ∈ R C × W H , ˜Y ∈ R C × wh , and ˜G ∈ R c × W H ,respectively. Given the estimated HR-HS ˜X , the loss function can be defined as: L ( X , Y , G ) = µ (cid:107) ˜XS − ˜Y (cid:107) F + (cid:107) R ˜X − ˜G (cid:107) F , (6)where (cid:107) · (cid:107) F is the Frobenius norm, S is the spatial downsampling with blurringand R is the spectral response function that integrates the spectra into R, G,B channels. The first term encourages the spectral similarity between the spa-tially downsampled X and Y . The second term encourages the spatial similaritybetween the spectrally downsampled X and G . µ is a scalar controlling the bal-ance between the two terms. The loss function has been widely used with thehandcrafted priors in the HS super-resolution [21,47] because the optimizationproblem is highly ill-posed. Our approach differs from those used in previousstudies because it uses GDD as a regularizer. Pansharpening.
Like HS super-resolution, pansharpening also considers twoterms that balance the tradeoff between spatial and spectral information. Al-though the first term in (6) can be also used for the loss function of pansharpen-ing, the second term may not be effective. This is because the spectral responsefunction of the pansharpening image may partially cover the spectral range cap-tured by the MS image. Thus, the second term cannot effectively measure thespatial similarity between panchromatic and MS images. To address the problem,the second term measuring the spatial similarity is defined as follows: L ( X , Y , G ) = µ (cid:107) ˜XS − ˜Y (cid:107) F + | D (cid:79) ˜X − (cid:79) ˜G | , (7)where ˜Y is the MS image, ˜G is the panchromatic image expanded to the samenumber of bands of ˜X , (cid:79) ˜X is the image gradient of ˜X , (cid:79) ˜G is the image gradientof ˜G , | · | is the l norm, and D is the diagonal matrix to weight each channelof (cid:79) ˜X so that the magnitude of ˜X is scaled to that of (cid:79) ˜G . Note that D can belearned with other parameters within the GDD optimization framework. The l norm is chosen because this norm more explicitly encourages the edges of theoutput and guidance images to be similar than other norms (e.g., l norm). Thefirst term encourages the spectral similarity while the second term promotes thespatial similarity. A similar loss function has been also explored in [6]. Denoising.
For the denoising of the no-flash image, the following loss functionwas used: L ( X , Y ) = (cid:107) ˜X − ˜Y (cid:107) F , (8)where ˜Y is the no-flash image. Only ˜X , ˜Y are considered in the loss function. ˜G is considered only in the network architecture because in the detail transfer ofthe flash and no-flash images, the spatial structures or colors are not necessarilyconsistent [29]. To fairly compare the results derived by DIP [34], we adopt thesame loss function.Different handcrafted priors are usually considered with task-specific lossfunctions. As a result, an optimization framework can be also different for each uided deep decoder 9 Iteration P S N R DDDIP (Z)DIP (G)GDD 5 H I H U H Q F H ' ' ' , 3 = ' , 3 * * ' '
Fig. 4: Comparison of DD, DIP, and GDD. The left figure shows PSNR at differ-ent iterations. The right figure shows the images derived at the 5000 iterations.From top to bottom, RGB images, enlarged RGB images, and the error maps ofthe compared methods.task. Our approach is different from the previous studies in that GDD is usedas a common prior for all of the tasks in a unified optimization framework.
In this section, we show the comparison between a deep decoder (DD), DIP, andGDD to discuss how GDD outperforms the compared methods. The pansharp-ening problem was chosen to evaluate the methods as an example. Extensiveexperiments, including other applications, are shown in the following section.Fig. 4 shows peak signal-to-noise ratio (PSNR) at different iterations. DD usesa tensor representing random noise as an input. DD corresponds to the deepdecoder part in GDD. DD is considered for comparison to validate whether thefeatures guided by the encoder-decoder network are really useful. DIP (Z) repre-sents the deep image prior that uses a random tensor as an input, while DIP (G)uses a guidance image (i.e., panchromatic imagery) as an input in the encoder-decoder network. Because DD considers only the decoder part, the informationlost in the process of upsampling cannot be recovered. DIP(Z) can use the fea-tures derived by a skip connection as a bias term and try to compensate for thelost information. This led to slightly better results of DIP(Z). GDD and DIP (G)that incorporate the guidance image produced high PSNR at early iterations.This shows that the use of the guidance image leads to the high quality of theHR-MS image at fewer iterations. Although both GDD and DIP (G) use theguidance image, GDD considerably outperformed DIP in terms of PSNR. Fig. 4also shows the RGB images of the reconstructed images, the enlarged RGB, andthe corresponding error maps. The enlarged RGB image derived from DD is &