[PDF] Guided Deep Decoder: Unsupervised Image Pair Fusion

Abstract

The fusion of input and guidance images that have a tradeoff in their information (e.g., hyperspectral and RGB image fusion or pansharpening) can be interpreted as one general problem. However, previous studies applied a task-specific handcrafted prior and did not address the problems with a unified approach. To address this limitation, in this study, we propose a guided deep decoder network as a general prior. The proposed network is composed of an encoder-decoder network that exploits multi-scale features of a guidance image and a deep decoder network that generates an output image. The two networks are connected by feature refinement units to embed the multi-scale features of the guidance image into the deep decoder network. The proposed network allows the network parameters to be optimized in an unsupervised way without training data. Our results show that the proposed network can achieve state-of-the-art performance in various image fusion problems.

Full PDF

GGuided Deep Decoder: Unsupervised Image PairFusion

Tatsumi Uezato − − − , Danfeng Hong , − − − ,Naoto Yokoya , − − − , and Wei He −− − − RIKEN AIP, Tokyo, Japan German Aerospace Center, Wessling, Germany Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, France The University of Tokyo, Tokyo, Japan { tatsumi.uezato,naoto.yokoya,wei.he } @riken.jp , { danfeng.hong } @dlr.de Abstract.

The fusion of input and guidance images that have a tradeoﬀin their information (e.g., hyperspectral and RGB image fusion or pan-sharpening) can be interpreted as one general problem. However, previ-ous studies applied a task-speciﬁc handcrafted prior and did not addressthe problems with a uniﬁed approach. To address this limitation, in thisstudy, we propose a guided deep decoder network as a general prior.The proposed network is composed of an encoder-decoder network thatexploits multi-scale features of a guidance image and a deep decoder net-work that generates an output image. The two networks are connectedby feature reﬁnement units to embed the multi-scale features of the guid-ance image into the deep decoder network. The proposed network allowsthe network parameters to be optimized in an unsupervised way withouttraining data. Our results show that the proposed network can achievestate-of-the-art performance in various image fusion problems.

Keywords:

Deep Image Prior, Deep Decoder, Image Fusion, Hyper-spectral Image, Super-resolution, Pansharpening.

Some image fusion tasks address the fusion of image pairs in the same modality.The tasks consider a pair of images that capture the same region but have atradeoﬀ between the two images (Fig. 1). For example, a low spatial resolutionhyperspectral (LR-HS) image has greater spectral resolution at lower spatialresolution [46]. However, an RGB image acquires much lower spectral resolutionat higher spatial resolution. Likewise, panchromatic and multispectral (MS) im-ages have a tradeoﬀ between spatial and spectral resolution [35]. No-ﬂash imagescapture ambient illumination, but are very noisy, while ﬂash images capture arti-ﬁcial light, but are less noisy [29]. Image fusion enables an image that overcomesthe tradeoﬀ to be generated. Hyperspectral super-resolution or pansharpeningaims to generate a high resolution (HR) HS or MS image. The denoising of a a r X i v : . [ ee ss . I V ] J u l T. Uezato et al.

RGB imagery Low spatial resolution HS High spatial resolution HS

Panchromatic imagery Low spatial resolution MS

High spatial resolution MS

Flash Noisy no-Flash Denoised image

Guidance image Input image Enhanced image H S s up er - re s o l u t i o n P a n s h a r p e n i n g D e n o i s i n g Fig. 1: Illustration of image pair fusion of the same modality.no-ﬂash image with a ﬂash image can be also interpreted as a special case ofimage fusion.Although these tasks share a common goal (i.e., enhancing input images withthe help of guidance images), the tasks have been studied independently. Thisoccurs because a diﬀerent handcrafted prior is considered to incorporate thespeciﬁc property of an output image. In HS super-resolution, a prior exploitingthe low-rankness of HS has been extensively used [47,8,21]. In pansharpening, aprior representing a spatial smoothness has been considered [27]. The denoisingtask assumes that the spatial structure of a restored image is similar to thatof a guidance image [29]. While these handcrafted priors share the same goal,the priors need to be designed for each task to exploit the speciﬁc properties ofdata. It is highly desirable to develop a prior applicable to various image fusionproblems.Deep learning (DL) approaches avoid the assumption of explicit priors foreach speciﬁc task. Although network architectures themselves need to be hand-crafted, properly designed network architectures have shown to solve variousproblems [31,16]. Most DL approaches rely on training data. However, for pan-sharpening and hyperspectral super-resolution, it is diﬃcult to collect a largesize of training data including reference (i.e., HR-HS or HR-MS) because of thecost or hardware limitation. Thus, previous studies [43,32] have frequently usedsynthetic data for training, which may have limited generalization performance.In addition, diﬀerent sensors provide diﬀerent spectral response functions. Net-works trained on data acquired by a particular sensor may not work well on newdata acquired by a diﬀerent sensor. uided deep decoder 3

A natural question arises: is it possible to use DL approaches without trainingdata? Ulyanov et al. [34] have shown that network architectures have inductivebias and can be used as deep image prior (DIP) without any training data.This intriguing property of DIP has been successfully used for various prob-lems [14,45,33]. In [34], the guided denoising task of ﬂash and no-ﬂash imagepair has been addressed using a no-ﬂash image as an input and a ﬂash imageas an output. Although this approach can be potentially used to address theproblems shown in Fig. 1, the network architecture does not fully exploit thesemantic features or image details of a guidance image. It is still unclear howthe network architecture is conditioned on the features of a guidance image.Although DIP has great potential, the uncertainties limit DIP to achieve state-of-the-art (SOTA) performance in various image fusion problems.As discussed above, previous studies face two major problems (task-speciﬁchandcrafted priors and requirement of training data) to address various imagefusion problems in a uniﬁed framework. In this study, we propose a new networkarchitecture, called a guided deep decoder (GDD), that overcomes the problemsand can achieve SOTA performance in diﬀerent image fusion problems. Specif-ically, the proposed network architecture is composed of two networks whereone encoder-decoder network is designed to extract multi-scale semantic fea-tures from a guidance image, while another deep decoder network generates anoutput image from random noise. The two networks are connected by featurereﬁnement units incorporating attention gates to embed the multi-scale featuresof the guidance image into the deep decoder network.The contributions of this paper are as follows. (1) We propose a new unsu-pervised DL method that does not require training data and can be adapted todiﬀerent image fusion tasks in a uniﬁed framework. We achieve SOTA results forvarious image fusion problems. (2) We propose a new network architecture asa regularizer for unsupervised image fusion problems. The attention gates usedin the proposed architecture guide the generation of an output image using themulti-scale semantic features from a guidance image. The guidance of the multi-scale features can lead to an eﬀective regularizer for an ill-posed optimizationproblem.

Most of the previous works have independently addressed one of the image fu-sion problems shown in Fig. 1, although the common goal is to generate animage that overcomes the tradeoﬀ. This study focuses on the data acquired inthe same modality and is diﬀerent from the image fusion problems of diﬀerentmodalities where the sensor captures diﬀerent physical quantities (e.g., fusionof RGB images and depth maps [24]). To address the ill-posed fusion problems,similar approaches have been developed for diﬀerent image fusion tasks.

Classical approach : The classical approach is to speciﬁcally design a hand-crafted prior for each task. For example, handcrafted priors exploiting the low-rankness or sparsity of HS have been developed for HS and MS image fusion

T. Uezato et al. problems [47,19,20,41]. In panchromatic and MS image fusion, the handcraftedpriors, which assume that the spatial details of PAN are similar to those of MS,have been widely used [22,27,12,7]. In addition, ﬂash and no-ﬂash image fusionuses a prior that promotes similar spatial details between the paired image [29].The classical approach can reconstruct an enhanced image without any trainingdata by explicitly assuming prior knowledge. However, the priors designed fora speciﬁc task may not be eﬀective when they are applied to other tasks. Inaddition, an optimization method needs to be tailored for a diﬀerent prior.

Supervised DL approach : DL methods that use training data have re-cently achieved SOTA performance in diﬀerent image fusion problems. DL meth-ods are usually built upon a popular network (e.g., [31,16]). In the HS and RGBimage fusion, DL methods use LR-HS and RGB images as an input and an HR-HS image as an output and learn the mapping function between the inputs andthe output [43,9]. Similarly, in pansharpening, the methods consider panchro-matic and LR-MS images as an input and HR-MS as an output and learn themapping function [32,42,44]. As long as training data are available, DL meth-ods can be potentially applied to diﬀerent image fusion problems in a uniﬁedframework by slightly changing the network architecture or the loss function.However, it may be diﬃcult to acquire training data, including reference data,for HS or MS images because of the cost or hardware limitation.

Unsupervised DL approach : To bridge the gap between the classical andsupervised DL approaches, an unsupervised DL approach has been considered insome studies. The unsupervised DL methods have been developed to address theHS and RGB image fusion problem [30,13]. In [30,13], the network architecturehas been speciﬁcally designed to exploit the property of the HS image and dif-ferent handcrafted priors have been combined to achieve optimal performance.However, it may not achieve SOTA performance in other tasks because of thespeciﬁcally designed network and handcrafted priors. DIP that can apply DL inan unsupervised way has been recently developed by [34] and has been appliedfor a variety of problems [14,45,33]. Although DIP can be potentially applied forvarious image fusion problems, it has not been explored yet. The simple appli-cation of DIP cannot achieve SOTA performance in diﬀerent image fusion tasks,which is shown in the following experiments. Our study borrows the idea of DIPand proposes a robust network architecture that achieves SOTA performance inthese tasks.

Let us denote a low resolution or noisy input image Y ∈ R C × w × h and a guidanceimage G ∈ R c × W × H where C , W , and H represent the number of channels, theimage width, and the image height, respectively. When considering HS super-resolution or pansharpening, w (cid:28) W , h (cid:28) H , and c (cid:28) C . In the unsupervisedimage fusion problem, the corresponding output X ∈ R C × W × H can be estimated uided deep decoder 5 URU F RU URU F RU URU F RU F RU Noise map(Z)

Guidance image (G)

Output image (X)

Loss function C on v Input image (Y)Guidance image (G)

Encoder-decoder network with skip connections

Deep decoder network

Bilinear upsampling

Skip connection

Fig. 2: The structure of a guided deep decoder. The semantic features are ex-tracted from the guidance image by the U-net like encoder-decoder network. Theblue layers represent the features of the encoder. The red layers represent thefeatures of the decoder. The green layers represent the features of the deep de-coder network. The semantic features of G are used to guide the features of thedeep decoder in the upsampling and feature reﬁnement units (URU and FRU).by solving the following optimization problem:min X L ( X , Y , G ) + R ( X ) , (1)where L is a loss function that is diﬀerent for each task. Because the problem isill-posed, existing methods commonly add the handcrafted regularization term R . However, the task-speciﬁc regularization term (e.g., low-rank property of HSimages) cannot be easily applied to other tasks. Instead of using the handcraftedregularization terms, DIP estimates X using a convolutional neural network(CNN)-based mapping function as: X = f θ ( Z ) , (2)where f θ represents the mapping function with the network parameters θ , Z isthe input representing the random code tensor. The optimization problem canbe rewritten as: min θ L ( f θ ( Z ) , Y , G ) . (3)In this formulation, only one input image Y and a guidance image G are used forthe optimization problem; thus, training data are not required. X is regularizedby the implicit prior of the network architecture. Diﬀerent types of architecturescan lead to diﬀerent regularizers. The architecture that eﬀectively incorporatesmulti-scale spatial details and semantic features of the guidance image can be apowerful regularizer for the optimization problem. In the following section, wepropose a new architecture, called the guided deep decoder, as a regularizer thatcan be used for various image fusion problems. T. Uezato et al. CN U p s a m p l e1 x C on v Lea ky R e l u S i g m o i d C on v x C on v Lea ky R e l u S i g m o i d CN Lea ky R e l u 1 x C on v Upsampling refinement unit (URU) Feature refinement unit (FRU) Γ k F CN Ξ 𝑘 F Lea ky R e l u Fig. 3: The structure of upsampling and feature reﬁnement units.

GDD is composed of an encoder-decoder network with skip connections anda deep decoder network, as shown in Fig. 2. The encoder-decoder network issimilar to the architecture of U-net [31] and produces the features of a guidanceimage at multiple scales. The multi-scale features represent hierarchical semanticfeatures of the guidance image from low to high levels. The semantic featuresare used to guide the parameter estimation in the deep decoder. Let Γ k denotethe features of the encoder at the k th scale, Ξ k denotes the k th-scale featuresin the decoder part of the encoder-decoder network. The mapping function isconditioned on the multi-scale features as f θ ( Z | Γ , · · · , Γ K , Ξ , · · · , Ξ K ). Themulti-scale features are incorporated in the deep decoder by the two proposedunits shown in Fig. 3. Upsampling reﬁnement unit (URU).

Upsampling is a vital part of DIP [5].Bilinear or nearest neighbor upsampling promotes piecewise constant patches orsmoothness across all channels [17]. However, the prior is too strong to recoverexact spatial structures or boundaries of an image. Although this problem isalleviated using skip connections, the spatial details of a guidance image arestill lost in the features of the decoder. URU incorporates an attention gate forweighting the features derived after upsampling and channel-wise normalization(CN) in the deep decoder. The features from the guidance image are gated bya 1 × F ,the transformation is carried out as: URU ( F | Γ k ) = F ⊗ Γ k , (4)where ⊗ represents the element-wise multiplication. Note that the dimensionsof F and Γ k are the same at each scale. Both channel-wise and spatial-wiseconditional weights are considered in URU. Feature reﬁnement unit (FRU).

FRU is diﬀerent from URU in that thefeatures of the deep decoder are weighted by the high-level semantic features uided deep decoder 7 of the guidance image. FRU promotes the semantic alignment with the featuresof the guidance image, while URU promotes similar spatial locality. Using anattention gate, the high-level features are gated by a 1 × FRU ( F | Ξ k ) = F ⊗ Ξ k . (5)Note that the dimensions of F and Ξ k are the same at each scale. The featuresof the deep decoder are weighted in URU and FRU, which leads to a deep priorthat can more explicitly exploit the spatial details or semantic features of theguidance image than DIP. GDD is closely related to existing network architectures. URU and FRU in GDDgenerate multiplicative transformation parameters from the guidance image forthe spatial and channel-wise feature modulation. A similar feature modulationhas been also used in [38,11,4,28,23]. In [38], aﬃne transformation parametershave been generated from segmentation probability maps for the feature mod-ulation to achieve more realistic textures in image super-resolution. The aﬃnetransformation has been also considered in the style transfer [18]. Although theaﬃne transformation considers the scaling and bias values, URU and FRU con-sider only the scaling values because we ﬁnd that similar results can be obtainedat lower computational cost for unsupervised optimization problems.The conditional weights can be also interpreted as attention layers across allchannels. In [11,26,4], attention gates are incorporated to reﬁne the spatial detailsand highlight salient features. The conditional attention weights are generatedfrom a label map for semantic image synthesis [23]. GDD is closely related tothe conditional attention weights in that it uses the multi-scale features froma guidance image to generate the conditional attention weights. However, all ofthe aforementioned studies consider and require a large size of training data.The network architectures have not been fully explored as a regularizer for theunsupervised optimization problems. Our study is diﬀerent from previous studiesin that it uses the network architecture as a regularizer to solve a variety ofunsupervised image fusion problems.

The loss function is diﬀerent for each task. In this section, the loss functionsused for HS super-resolution, pansharpening, and denoising are discussed.

HS super-resolution.

When fusing RGB and HS images, the loss function isusually designed to preserve the spectral information from the HS image whilekeeping the spatial information from the RGB image. For simplicity, the matrix

T. Uezato et al. forms of X , Y , G are denoted as ˜X ∈ R C × W H , ˜Y ∈ R C × wh , and ˜G ∈ R c × W H ,respectively. Given the estimated HR-HS ˜X , the loss function can be deﬁned as: L ( X , Y , G ) = µ (cid:107) ˜XS − ˜Y (cid:107) F + (cid:107) R ˜X − ˜G (cid:107) F , (6)where (cid:107) · (cid:107) F is the Frobenius norm, S is the spatial downsampling with blurringand R is the spectral response function that integrates the spectra into R, G,B channels. The ﬁrst term encourages the spectral similarity between the spa-tially downsampled X and Y . The second term encourages the spatial similaritybetween the spectrally downsampled X and G . µ is a scalar controlling the bal-ance between the two terms. The loss function has been widely used with thehandcrafted priors in the HS super-resolution [21,47] because the optimizationproblem is highly ill-posed. Our approach diﬀers from those used in previousstudies because it uses GDD as a regularizer. Pansharpening.

Like HS super-resolution, pansharpening also considers twoterms that balance the tradeoﬀ between spatial and spectral information. Al-though the ﬁrst term in (6) can be also used for the loss function of pansharpen-ing, the second term may not be eﬀective. This is because the spectral responsefunction of the pansharpening image may partially cover the spectral range cap-tured by the MS image. Thus, the second term cannot eﬀectively measure thespatial similarity between panchromatic and MS images. To address the problem,the second term measuring the spatial similarity is deﬁned as follows: L ( X , Y , G ) = µ (cid:107) ˜XS − ˜Y (cid:107) F + | D (cid:79) ˜X − (cid:79) ˜G | , (7)where ˜Y is the MS image, ˜G is the panchromatic image expanded to the samenumber of bands of ˜X , (cid:79) ˜X is the image gradient of ˜X , (cid:79) ˜G is the image gradientof ˜G , | · | is the l norm, and D is the diagonal matrix to weight each channelof (cid:79) ˜X so that the magnitude of ˜X is scaled to that of (cid:79) ˜G . Note that D can belearned with other parameters within the GDD optimization framework. The l norm is chosen because this norm more explicitly encourages the edges of theoutput and guidance images to be similar than other norms (e.g., l norm). Theﬁrst term encourages the spectral similarity while the second term promotes thespatial similarity. A similar loss function has been also explored in [6]. Denoising.

For the denoising of the no-ﬂash image, the following loss functionwas used: L ( X , Y ) = (cid:107) ˜X − ˜Y (cid:107) F , (8)where ˜Y is the no-ﬂash image. Only ˜X , ˜Y are considered in the loss function. ˜G is considered only in the network architecture because in the detail transfer ofthe ﬂash and no-ﬂash images, the spatial structures or colors are not necessarilyconsistent [29]. To fairly compare the results derived by DIP [34], we adopt thesame loss function.Diﬀerent handcrafted priors are usually considered with task-speciﬁc lossfunctions. As a result, an optimization framework can be also diﬀerent for each uided deep decoder 9 Iteration P S N R DDDIP (Z)DIP (G)GDD 5HIHUHQFH '' ',3= ',3* *''

Fig. 4: Comparison of DD, DIP, and GDD. The left ﬁgure shows PSNR at diﬀer-ent iterations. The right ﬁgure shows the images derived at the 5000 iterations.From top to bottom, RGB images, enlarged RGB images, and the error maps ofthe compared methods.task. Our approach is diﬀerent from the previous studies in that GDD is usedas a common prior for all of the tasks in a uniﬁed optimization framework.

In this section, we show the comparison between a deep decoder (DD), DIP, andGDD to discuss how GDD outperforms the compared methods. The pansharp-ening problem was chosen to evaluate the methods as an example. Extensiveexperiments, including other applications, are shown in the following section.Fig. 4 shows peak signal-to-noise ratio (PSNR) at diﬀerent iterations. DD usesa tensor representing random noise as an input. DD corresponds to the deepdecoder part in GDD. DD is considered for comparison to validate whether thefeatures guided by the encoder-decoder network are really useful. DIP (Z) repre-sents the deep image prior that uses a random tensor as an input, while DIP (G)uses a guidance image (i.e., panchromatic imagery) as an input in the encoder-decoder network. Because DD considers only the decoder part, the informationlost in the process of upsampling cannot be recovered. DIP(Z) can use the fea-tures derived by a skip connection as a bias term and try to compensate for thelost information. This led to slightly better results of DIP(Z). GDD and DIP (G)that incorporate the guidance image produced high PSNR at early iterations.This shows that the use of the guidance image leads to the high quality of theHR-MS image at fewer iterations. Although both GDD and DIP (G) use theguidance image, GDD considerably outperformed DIP in terms of PSNR. Fig. 4also shows the RGB images of the reconstructed images, the enlarged RGB, andthe corresponding error maps. The enlarged RGB image derived from DD is &KDQQHO &KDQQHO &KDQQHO &KDQQHO &KDQQHO &KDQQHO &KDQQHO

Fig. 5: Examples of the conditional weights of diﬀerent channels used in theattention gates.blurred. The image derived by DIP (Z) is also blurred and the texture is notcorrectly recovered. In the highly ill-posed optimization problem, the deep priorthat does not incorporate the guidance image cannot produce satisfactory re-sults. DIP (G) performs better than DD or DIP (Z). However, the small objectsor boundaries of the image are missing in the reconstructed image. GDD pre-served the small objects or boundaries more explicitly than DIP (G), which ledto smaller errors. In addition, GDD produced smaller errors in the homogeneousregions of the objects.

Reasons why GDD is a good regularizer.

We argue that GDD works as abetter regularizer than DIP (G) for the following two reasons:1.

Upsampling reﬁnement : The bilinear upsampling used in DIP and GDDcauses a strong bias to promote piecewise smoothness and tends to wash away thesmall objects or boundaries. GDD diﬀers from DIP because it uses an attentiongate to weight the features derived by the upsampling. The attention gate enablesthe small objects or boundaries to be aware by the conditional weights shown inFig. 5. Owing to attention gates, GDD can reconstruct spatial details.2.

Feature reﬁnement at multiple scales : DIP uses the guidance image asan input in the hourglass architecture. In DIP, the features of each layer in thedecoder part of the architecture are conditioned using only the features of theprevious layer. GDD enables the features of each layer in the decoder part tobe conditioned on the semantic features from the guidance image at multiplescales. The attention gates at multiple scales emphasize salient features withineach layer, leading to the semantic alignment between the output image and theguidance image.

In this section, we show how GDD works as a regularizer for diﬀerent image fu-sion problems. Because of the limited space, only the selected results are shownin the main document. Additional results are shown in the supplementary ma-terial. The network architecture of GDD has been ﬁxed for all of the followingexperiments to validate the robustness of GDD as a regularizer. It is possible tocarefully tune the network architecture for each task. However, we believe thatthe ﬁxed network architecture that works well for diﬀerent tasks is more impor-tant than a carefully tuned architecture that obtains the best performance only uided deep decoder 11 5HIHUHQFH &10) %65 1665 1/67) 8'/ X6'1 0+) ',3 *''

Fig. 6: First row: reference and RGB images of the reconstructed HS. The se-lected results are from chart and staﬀed toy in the CAVE data. Second row: Thecorresponding error maps.for a speciﬁc task. In the following experiments, DIP used the guidance imageas an input and the same loss function with GDD for fair comparison.

The CAVE dataset was chosen for the experiments because it hasbeen extensively used to evaluate HS super-resolution methods [8,30,13,43]. TheCAVE dataset consists of HR-HS images that were acquired in 32 indoor sceneswith controlled illumination. Each HR-HS image has the spatial size of 256 × i.e., the generation of the LR-HS image fromthe HR-HS image by averaging over 32 ×

32 pixel blocks and the generation ofthe RGB image by spectral downsampling on the basis of the spectral responsefunction. The proposed GDD does not require training data. However, for faircomparison with the supervised DL method, we chose 12 images for the test,and the rest of the images were used for training as done in [43].

Compared methods.

The compared SOTA methods include the matrix/tensorrelated methods (CNMF [47], BSR [1], NSSR [10] and NLSTF [8]), the super-vised DL method (MHF [43]), and the unsupervised DL methods (UDL [13],uSDN [30] and DIP [34]). Among all methods, only MHF required training data.To quantitatively validate the results, four diﬀerent criteria were used. Thecriteria are the root mean square error (RMSE), spectral angle (SA), structuralsimilarity (SSIM [40]), and the relative dimensionless global error in synthesis(ERGAS [37]).

Results.

Table 1 shows the average results of all test images. The performanceof BSR and uSDN was worse than those of other methods because the twomethods do not assume that the downsampling matrix is available a priori .GDD outperformed other unsupervised HS super-resolution methods and waseven competitive with the trained DL method (i.e., MHF). This shows that 5HIHUHQFH %'6' 07)*/3 6,5) '5311 3DQ1HW 311 311 ',3 *'' Fig. 7: First row: RGB images of the pansharpened MS images. Second row: Theenlarged RGB images. Third row: The corresponding error maps.the proposed network architecture is an eﬀective regularizer for the HS super-resolution problem. Fig. 6 shows the RGB images of the reconstructed HS imagesand the error maps. In general, GDD produced lower errors than other methods.The noticeable diﬀerence between DIP and GDD is that the errors of DIP aresigniﬁcantly larger at the edges of the image than those of GDD. This impliesthat GDD properly incorporates the spatial details or semantic features of theguidance image, leading to the edge-preserving image.Table 1: Quantitative results of diﬀerent metrics on the CAVE dataset. ↓ showslower is better while ↑ shows higher is better. CNMF BSR NLSTF NSSR UDL uSDN MHF DIP GDDRMSE ↓ ERGAS ↓ SA ↓ SSIM ↑ Four diﬀerent image scenes covering agriculture, urban, forest or mix-tures of these were chosen for the experiments. The images were acquired bythe WorldView-2. Each MS image is composed of 8 bands representing spectralreﬂectance. The spatial resolution of the MS image is 2 m while that of thepanchromatic image is 0.5 m . Each panchromatic image has one band that par-tially covers the spectral range of the MS image. Synthetic MS and panchromaticimages were generated by spatially downsampling the original resolution MS andpanchromatic images by the factor of 4. Bicubic downsampling was used. Theoriginal resolution MS image was used as reference data. This is the common uided deep decoder 13 approach called Wald’s protocol [36] to generate reference data because referencedata (i.e., HR-MS image) are not available [12,44]. Compared methods.

The compared SOTA methods include three unsuper-vised pansharpening methods (BDSD [15], MTF-GLP [35], SIRF [6]) and foursupervised DL methods (DRPNN [42], PanNet [44], PNN [25], PNN+ [32]). Thesupervised DL methods achieved the SOTA performance. However, the gener-alization performance of the supervised DL methods is still limited if trainingdata are acquired by a diﬀerent sensor or in diﬀerent regions. Training data mustbe carefully prepared for the supervised DL methods. In this study, we dividedeach image scene into training and test data acquired by the same sensor. Thisproduces a favorable condition for the supervised DL methods and can be usedto validate whether the unsupervised GDD can be comparable to the supervisedDL methods.To qualitatively validate the performance of the methods, the synthetic data(i.e., reduced spatial resolution images) and real data (i.e., original spatial resolu-tion images) were used. Four diﬀerent criteria were used for evaluation. Similar tothe experiments of the HS super-resolution, ERDAS and SA were also consideredin pansharpening. In addition, the eight-band extension of average universal im-age quality index (Q8 [39,3]), and spatial correlation coeﬃcient (SCC [48]) wereused for evaluation. In pansharpening, there are also criteria to validate the per-formance of the methods on the original spatial resolution images without usingreference data. The criteria include a spectral quality index ( D λ ) and a spatialquality index ( D S ), and the joint spectral and spatial quality with no reference(QNR [2]). The criteria were used to validate the methods using real data (i.e.,original spatial resolution of images). Results.

Table 2 shows the average results of all test images. When using thesynthetic data with reference data, GDD outperformed other existing methodsin terms of all criteria. This showed that GDD reconstructed an HR-MS imagethat has better quality of both spectral and spatial information. Fig. 7 showsRGB of the reconstructed MS images, the enlarged RGB images, and the cor-responding error maps. Although PanNet, PNN, or DRPNN generated sharpedges in the reconstructed images, the spectral information was distorted, whichled to the colors that are diﬀerent from the reference. DIP produced blurredresults especially at the edges of the reconstructed images. GDD preserved thespectral information while producing similar spatial details with reference data.This led to lower errors in the reconstructed image. Real images (original res-olution images) were also used to evaluate the reconstructed images, as shownin Table 2. DIP produced the lowest value in terms of D λ . This shows that thespectra reconstructed by DIP are most similar to the spectra of the LR-MSimage. PNN+ produced the lowest value in terms of D s . This shows that thespatial details reconstructed by PNN+ are the most similar to the spatial detailsof the pansharpening image. GDD performed better than the other methods in )ODVK 1R)ODVK -% ',3 *'' Fig. 8: The reconstructed images of the no-ﬂash image with the help of the ﬂashimage.terms of QNR. GDD properly balanced the tradeoﬀ between spectral and spatialresolution, which led to the better value of QNR.Table 2: Average results of diﬀerent image scenes for pansharpening. Syntheticrepresents evaluation with reference at lower resolution. Real represents evalu-ation with no reference at original resolution. ↓ shows lower is better while ↑ shows higher is better. BDSD MTF-GLP SIRF DRPNN PanNet PNN PNN+ DIP GDDSynthetic Q8 ↑ SA ↓ ERGAS ↓ SCC ↑ Real QNR ↑ D λ ↓ s ↓ In this section, the reconstruction of a ﬂash image with the help of a no-ﬂashimage was addressed to show another application of GDD. The no-ﬂash imageacquires an image under ambient illumination where the image can be noisybecause of the low-light conditions [29]. However, the ﬂash image acquires animage under artiﬁcial light where the image is noise-free and the spatial details ofthe image are recorded. However, the lighting characteristics are unnatural, andunwanted shadows or artifacts may be produced in the ﬂash image. The objectiveof this application is to reconstruct a clean no-ﬂash image using the featuresof a ﬂash image. In this application, true reference data cannot be available.Although an image with long exposure may be used as a reference [29], the uided deep decoder 15 magnitude or characteristics of illumination are not necessarily the same as thoseof the true reference. In this study, the reconstructed images are qualitativelyevaluated according to [34]. In [34], DIP that uses the ﬂash image as an inputand the no-ﬂash image as an output was successfully applied to the problem. Wequalitatively examined if the architecture used in GDD was as eﬀective as DIP.Fig. 8 shows that the reconstructed images of the no-ﬂash image. DIP andGDD removed the artifacts more clearly than the joint bilateral method (JB) [29].GDD produced more explicit boundaries of the image than DIP while preservingthe natural colors of the image. This shows that GDD performed at least as wellas DIP for the no-ﬂash image reconstruction.

We proposed an unsupervised image fusion method that was based on GDD.GDD is a network architecture-based regularizer and can be used to solve dif-ferent image fusion problems that have been independently studied so far. Thenetwork architecture can better exploit spatial details and semantic features ofa guidance image. This is achieved by considering an encoder-decoder networkthat extracts spatial details and semantic features of a guidance image. Themulti-scale attention gates enable the extracted semantic features to guide adeep decoder network that generates an output image. This approach achievedthe SOTA performance in the diﬀerent image fusion problems. It pushes theboundaries of the current studies that address only one speciﬁc problem. Thepromising results open up the possibility of a network architecture-based priorthat can be used for general purpose including various image fusion problems.

References

1. Akhtar, N., Shafait, F., Mian, A.: Bayesian sparse representation for hyperspectralimage super resolution. In: CVPR. pp. 3631–3640 (2015)2. Alparone, L., Aiazzi, B., Baronti, S., Garzelli, A., Nencini, F., Selva, M.: Multispec-tral and panchromatic data fusion assessment without reference. PhotogrammetricEngineering and Remote Sensing (2), 193–200 (2008)3. Alparone, L., Baronti, S., Garzelli, A., Nencini, F.: A global quality measurementof pan-sharpened multispectral imagery. IEEE Geoscience and Remote SensingLetters (4), 313–317 (2004)4. Amirul Islam, M., Rochan, M., Bruce, N.D.B., Wang, Y.: Gated feedback reﬁne-ment network for dense image labeling. In: CVPR (2017)5. Chakrabarty, P., Maji, S.: The spectral bias of the deep image prior. In: NeurIPSWorkshops (2019)6. Chen, C., Li, Y., Liu, W., Huang, Z.: Sirf: Simultaneous satellite image registrationand fusion in a uniﬁed framework. IEEE Transactions on Image Processing (11),4213–4224 (2015)7. Chen, C., Li, Y., Liu, W., Huang, J.: Image fusion with local spectral consistencyand dynamic gradient sparsity. In: CVPR (2014)8. Dian, R., Fang, L., Li, S.: Hyperspectral image super-resolution via non-local sparsetensor factorization. In: CVPR. pp. 3862–3871 (2017)6 T. Uezato et al.9. Dian, R., Li, S., Guo, A., Fang, L.: Deep hyperspectral image sharpening. IEEEtransactions on neural networks and learning systems (11), 5345–5355 (2018)10. Dong, W., Fu, F., Shi, G., Cao, X., Wu, J., Li, G., Li, X.: Hyperspectral imagesuper-resolution via non-negative structured sparse representation. IEEE Transac-tions on Image Processing (5) (2016)11. Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention networkfor scene segmentation. In: CVPR (2019)12. Fu, X., Lin, Z., Huang, Y., Ding, X.: A variational pan-sharpening with localgradient constraints. In: CVPR (2019)13. Fu, Y., Zhang, T., Zheng, Y., Zhang, D., Huang, H.: Hyperspectral image super-resolution with optimized rgb guidance. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 11661–11670 (2019)14. Gandelsman, Y., Shocher, A., Irani, M.: ”double-dip”: Unsupervised image decom-position via coupled deep-image-priors. In: CVPR (June 2019)15. Garzelli, A., Nencini, F., Capobianco, L.: Optimal mmse pan sharpening of veryhigh resolution multispectral images. IEEE Transactions on Geoscience and Re-mote Sensing (1), 228–236 (2007)16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: CVPR. pp. 770–778 (June 2016)17. Heckel, R., Hand, P.: Deep decoder: Concise image representations from untrainednon-convolutional networks. In: ICLR (2019)18. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instancenormalization. In: ICCV (2017)19. Kawakami, R., Matsushita, Y., Wright, J., Ben-Ezra, M., Tai, Y., Ikeuchi, K.:High-resolution hyperspectral imaging via matrix factorization. In: CVPR. pp.2329–2336 (2011)20. Kwon, H., Tai, Y.W.: RGB-guided hyperspectral image upsampling. In: Proceed-ings of the IEEE International Conference on Computer Vision. pp. 307–315 (2015)21. Lanaras, C., Baltsavias, E., Schindler, K.: Hyperspectral super-resolution by cou-pled spectral unmixing. In: ICCV (2015)22. Liu, P., Xiao, L., Li, T.: A variational pan-sharpening method based on spatialfractional-order geometry and spectralspatial low-rank priors. IEEE Transactionson Geoscience and Remote Sensing , 1788–1802 (2018)23. Liu, X., Yin, G., Shao, J., Wang, X., Li, h.: Learning to predict layout-to-imageconditional convolutions for semantic image synthesis. In: NeurIPS (2019)24. Lutio, R.d., D’Aronco, S., Wegner, J.D., Schindler, K.: Guided super-resolution aspixel-to-pixel transformation. In: ICCV (2019)25. Masi, G., Cozzolino, D., Verdoliva, L., Scarpa, G.: Pansharpening by convolutionalneural networks. Remote Sensing (7), 594 (2016)26. Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K.,McDonagh, S., Hammerla, N.Y., Kainz, B., Glocker, B., Rueckert, D.: Attentionu-net: learning where to look for the pancreas. In: MIDL (2018)27. Palsson, F., Sveinsson, J.R., Ulfarsson, M.O.: A new pansharpening algorithmbased on total variation. IEEE Geoscience and Remote Sensing Letters , 318–322 (2014)28. Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis withspatially-adaptive normalization. In: CVPR (2019)29. Petschnigg, G., Szeliski, R., Agrawala, M., Cohen, M., Hoppe, H., Toyama, K.:Digital photography with ﬂash and no-ﬂash image pairs. ACM Transactions onGraphics (3), 664 (2004)uided deep decoder 1730. Qu, Y., Qi, H., Kwan, C.: Unsupervised sparse dirichlet-net for hyperspectral imagesuper-resolution. In: CVPR (2018)31. Ronneberger, O., P.Fischer, Brox, T.: U-net: Convolutional networks for biomedicalimage segmentation. In: MICCAI. vol. 9351, pp. 234–241 (2015)32. Scarpa, G., Vitale, S., Cozzolino, D.: Target-adaptive cnn-based pansharpening.IEEE Transactions on Geoscience and Remote Sensing (9), 5443–5457 (Sept2018)33. Sidorov, O., Hardeberg, J.Y.: Deep hyperspectral prior: Denoising, inpainting,super-resolution. In: ICIP (2019)34. Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. In: CVPR (2018)35. Vivone, G., Alparone, L., Chanussot, J., Mura, M.D., Garzelli, A., Licciardi, G.,Restaino, R., Wald, L.: A critical comparison among pansharpening algorithms.IEEE Transactions on Geoscience and Remote Sensing (5), 2565–2586 (2014)36. Wald, L., Ranchin, T., Mangolini, M.: Fusion of satellite images of diﬀerent spatialresoltuions: assessing the quality of resulting images. Photogrammetric engineeringand remote sensing (6), 691–699 (1997)37. Wald, L.: Quality of high resolution synthesised images: Is there a simple criterion?In: Third conference” Fusion of Earth data: merging point measurements, rastermaps and remotely sensed images”. pp. 99–103. SEE/URISCA (2000)38. Wang, X., Yu, K., Dong, C., Loy, C.C.: Recovering realistic texture in image super-resolution by deep spatial feature transform. In: CVPR (2018)39. Wang, Z., Bovik, A.C.: A universal image quality index. IEEE signal processingletters (3), 81–84 (2002)40. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P., et al.: Image quality assess-ment: from error visibility to structural similarity. IEEE Transactions on ImageProcessing (4), 600–612 (2004)41. Wei, Q., Dobigeon, N., Tourneret, J., Bioucas-Dias, J., Godsill, S.: R-fuse: Robustfast fusion of multiband images based on solving a sylvester equation. IEEE SignalProcessing Letters (11), 1632–1636 (Nov 2016)42. Wei, Y., Yuan, Q., Shen, H., Zhang, L.: Boosting the accuracy of multispectralimage pansharpening by learning a deep residual network. IEEE Geoscience andRemote Sensing Letters (10), 1795–1799 (2017)43. Xie, Q., Zhou, M., Zhao, Q., Meng, D., Zuo, W., Xu, Z.: Multispectral and hyper-spectral image fusion by ms/hs fusion net. In: CVPR (2019)44. Yang, J., Fu, X., Hu, Y., Huang, Y., Ding, X., Paisley, J.: Pannet: A deep networkarchitecture for pan-sharpening. In: ICCV. pp. 1753–1761 (2017)45. Yokota, T., Kawai, K., Sakata, M., Kimura, Y., Hontani, H.: Dynamic pet im-age reconstruction using nonnegative matrix factorization incorporated with deepimage prior. In: ICCV (2019)46. Yokoya, N., Grohnfeldt, C., Chanussot, J.: Hyperspectral and multispectral datafusion: A comparative review of the recent literature. IEEE Geoscience and RemoteSensing Magazine (2), 29–56 (2017)47. Yokoya, N., Yairi, T., Iwasaki, A.: Coupled nonnegative matrix factorization un-mixing for hyperspectral and multispectral data fusion. IEEE Transactions onGeoscience and Remote Sensing (2), 528–537 (2012)48. Zhou, J., Civco, D., Silander, J.: A wavelet transform method to merge landsattm and spot panchromatic data. International journal of remote sensing19