[PDF] Hyperspectral Denoising Using Unsupervised Disentangled Spatio-Spectral Deep Priors

Abstract

Image denoising is often empowered by accurate prior information. In recent years, data-driven neural network priors have shown promising performance for RGB natural image denoising. Compared to classic handcrafted priors (e.g., sparsity and total variation), the "deep priors" are learned using a large number of training samples -- which can accurately model the complex image generating process. However, data-driven priors are hard to acquire for hyperspectral images (HSIs) due to the lack of training data. A remedy is to use the so-called unsupervised deep image prior (DIP). Under the unsupervised DIP framework, it is hypothesized and empirically demonstrated that proper neural network structures are reasonable priors of certain types of images, and the network weights can be learned without training data. Nonetheless, the most effective unsupervised DIP structures were proposed for natural images instead of HSIs. The performance of unsupervised DIP-based HSI denoising is limited by a couple of serious challenges, namely, network structure design and network complexity. This work puts forth an unsupervised DIP framework that is based on the classic spatio-spectral decomposition of HSIs. Utilizing the so-called linear mixture model of HSIs, two types of unsupervised DIPs, i.e., U-Net-like network and fully-connected networks, are employed to model the abundance maps and endmembers contained in the HSIs, respectively. This way, empirically validated unsupervised DIP structures for natural images can be easily incorporated for HSI denoising. Besides, the decomposition also substantially reduces network complexity. An efficient alternating optimization algorithm is proposed to handle the formulated denoising problem. Semi-real and real data experiments are employed to showcase the effectiveness of the proposed approach.

Full PDF

11 Hyperspectral Denoising Using UnsupervisedDisentangled Spatio-Spectral Deep Priors

Yu-Chun Miao, Xi-Le Zhao ∗ , Xiao Fu ∗ , Jian-Li Wang, and Yu-Bang Zheng Abstract —Image denoising is often empowered by accurateprior information. In recent years, data-driven neural networkpriors have shown promising performance for RGB naturalimage denoising. Compared to classic handcrafted priors (e.g.,sparsity and total variation), the “deep priors” are learned usinga large number of training samples—which can accurately modelthe complex image generating process. However, data-drivenpriors are hard to acquire for hyperspectral images (HSIs) dueto the lack of training data. A remedy is to use the so-called unsupervised deep image prior (DIP). Under the unsupervised DIPframework, it is hypothesized and empirically demonstrated that proper neural network structures are reasonable priors of certaintypes of images, and the network weights can be learned withouttraining data. Nonetheless, the most effective unsupervised DIPstructures were proposed for natural images instead of HSIs.The performance of unsupervised DIP-based HSI denoising islimited by a couple of serious challenges, namely, networkstructure design and network complexity. This work puts forth anunsupervised DIP framework that is based on the classic spatio-spectral decomposition of HSIs. Utilizing the so-called linearmixture model of HSIs, two types of unsupervised DIPs, i.e., U-Net-like network and fully-connected networks, are employed tomodel the abundance maps and endmembers contained in theHSIs, respectively. This way, empirically validated unsupervisedDIP structures for natural images can be easily incorporatedfor HSI denoising. Besides, the decomposition also substantiallyreduces network complexity. An efﬁcient alternating optimiza-tion algorithm is proposed to handle the formulated denoisingproblem. Semi-real and real data experiments are employed toshowcase the effectiveness of the proposed approach.

Index Terms —Hyperspectral image denoising, unsuperviseddeep image prior, spatio-spectral decomposition

I. I

NTRODUCTION H YPERSPECTRAL images (HSIs) contain rich spectraland spatial information of areas/objects of interest. HSIshave been widely used across many disciplines, e.g., biology,ecology, geoscience, and food/medicine science [1]. However,the acquired HSIs are often corrupted by various types ofnoise. Heavy noise may affect the performance of downstream ∗ Corresponding authors. Tel.: +86 28 61831016, Fax: +86 28 61831280.This work is supported by NSFC (No. 61876203, 61772003), theApplied Basic Research Project of Sichuan Province (No. 21YYJC3042),the Key Project of Applied Basic Research in Sichuan Province (No.2020YJ0216), and National Key Research and Development Program of China(No. 2020YFA0714001). The work of X. Fu is supported by NSF ECCS-2024058 and NSF ECCS-1808159.Y.-C. Miao, X.-L. Zhao, J.-L. Wang, and Y.-B. Zheng are with theResearch Center for Image and Vision Computing, School of MathematicalSciences, University of Electronic Science and Technology of China, Chengdu611731, P.R.China (e-mails: [email protected]; [email protected];[email protected]; [email protected]).X. Fu is with the School of Electrical Engineering and ComputerScience, Oregon State University (OSU), Corvallis, OR 97331, United States(e-mail: [email protected]). analytical tasks (e.g., hyperspectral pixel classiﬁcation). In thepast two decades, a plethora of HSI denoising techniques wereproposed to address this challenge; see [2]–[5].At a high level, the idea of many HSI denoising methodsis to ﬁt the acquired image using an estimated image withprior information-induced priors. The rationale is that noisedoes not obey the HSI priors, and thus such a ﬁtting processcan effectively extract the “clean” HSI from the noisy version.Under this principle, early HSI denoising methods used spatialpriors such as sparsity [6]–[8] and total variation (TV) [9].Methods that exploit spectral priors were also proposed; see[10]–[12]. A number of denoising methods incorporated with implicit priors such as low matrix/tensor rank that is a result ofmulti-dimensional correlations; some examples can be foundin [2]–[4], [13]–[18].More recently, data-driven priors have drawn much attentionin the vision and imaging communities [19]. In a nutshell, deepneural networks are used to learn a generative model of imagesfrom a large number of training samples. Deep generativemodels have been successful in computer vision, see, e.g.,[20]–[22]. In particular, these models are able to map low-dimensional random vectors to visually authentic images—which means that they capture the essence of the imagegenerating process. Hence, the learned generative network isnaturally a good prior of clean images. This idea has also beenused in HSI denoising; see, e.g., [23]–[27].Although the methods mentioned above have attained sat-isfactory results for HSI denoising, these models’ expressiveability is limited by the training data’s adversity and quantity.That is, there is a lack of training data for HSIs. This is becauseHSIs are, in general, much more costly to acquire relativeto natural RGB images. In addition, different hyperspectralsensors often admit largely diverse speciﬁcations (e.g., thefrequency band used, the spectral resolution, and the spatialresolution)—data acquired from one sensor may not be usefulfor training deep priors for images from other sensors.Recently, Ulyanov et al. proposed an unsupervised imagerestoration framework, namely, deep image prior (DIP) [28].DIP directly learns a generator network from a single noisyimage—instead of learning the generator from a large numberof training samples. The work in [28] showed that properdeep neural network architectures, without training on anysamples, can already “encode” much critical information inthe natural image generating process. This discovery hashelped design unsupervised

DIPs for tasks such as imagedenoising, inpainting, and super-resolution. This work has thusattracted much attention. Since the DIP approach does not useany training data, it is particularly suitable for data-starved a r X i v : . [ ee ss . I V ] F e b LRTFL0DIP2D selected frontal slices selected tube

Deep Spectral Prior Deep Spatial Prior

Band index P i x e l v a l u e OriginalLRTFL0

Band index P i x e l v a l u e OriginalDIP2D

Band index P i x e l v a l u e OriginalDS2DP

DS2DP S S S r c c c r Fig. 1. The LMM for HSI and the proposed unsupervised disentangled spatio-spectral deep priors (

DS2DP ). applications like hyperspectral imaging. Indeed, Sidorov et al. [29] extended the DIP idea to HSI denoising and observedpositive results.Nonetheless, capitalizing on the power of DIP for HSIdenoising still faces a series of challenges. Unlike RGB imagesthat only have three spectral channels, HSIs are often measuredover hundreds of spectral channels. Therefore, directly usingthe DIP method that is originally proposed for RGB imagesto handle HSIs may not be as promising. First, it is unclearif the network structures used in [28] are still effective forHSIs. Second, due to the large size of HSIs, the scalabilitychallenge is much more severe compared to the natural imagecases. Indeed, as one will see in Sec. V, the two neural networkstructures used in [29] for modeling the generator of a standardHSI induce 2.150 and 2.342 million parameters, respectively—which makes the learning process challenging. Third, due tothe special data acquisition process of HSIs, outlying pixelsand structured noise (other than Gaussian noise) often arise.The DIP denoising loss function used in [28], [29] did nottake these aspects into consideration. Contributions.

In this work, our interest lies in an unsuper-vised DIP-based denoising framework tailored for HSIs. Ourdetailed contributions are summarized as follows: • Disentangled Spatio-Spectral Deep Prior for HSI.

Wepropose an unsupervised DIP structure that is inspired bythe well-established linear mixture model (LMM) for HSIs[30]; see Fig. 1. The LMM views every hyperspectral pixelas a linear combination of spectral signatures of a number ofmaterials ( endmembers ). The linear combination coefﬁcientsof different endmembers across the image give rise to the abundance maps (i.e., spatial distribution patterns) of theendmembers [31]. The LMM is effective in capturing thevast majority of information in HSIs (empirically, about 98%energy of typical HSI datasets can be explained by LMM [32]).Using LMM, the spatial and spectral information embeddedin the HSI can be “disentangled”. This way, the spectral andspatial priors can be designed and modeled individually . Thatis, one only needs to learn deep priors of all the endmembers (1D vectors) and abundance maps (2D images)—and thenumber of endmembers is often not large. As a result, themodeling and computational complexities can be substantiallyreduced—which often leads to improved accuracy. By ourdesign, empirically validated unsupervised DIP structures fornatural images can be much more easily capitalized for HSIdenoising. • Structured Noise-robust Optimization.

To handle struc-tured noise (e.g., stripe-shaped or deadlines), we propose atraining loss that models the structured noise as sparse outliers.We use an alternating optimization process to handle theformulated structured-noise robust deep prior-based denoisingmethod. The algorithm alternates between learning generativemodels of endmembers/abundance maps and structured-noiseidentiﬁcation and removal, and both stages admit efﬁcient andlightweight updates. • Extensive Experiments.

We test the proposed approachon a large variety of semi-real and real datasets. The experi-ments support our design—we observe substantially improveddenoising performance relative to classic methods and morerecent neural prior-based methods over all the datasets undertest. In particular, due to our disentangled network design, theproposed method outperforms the existing unsupervised DIP-based HSI denoising methods in [29] in terms of both accuracyand memory/computational efﬁciency.

Notation.

A scalar, a vector, a matrix, and a tensor are denotedas x , x , X , and X , respectively. [ x ] i , [ X ] i,j , and [ X ] i,j,k denote the i -th, ( i, j ) -th, and ( i, j, k ) -th element of x ∈ R I , X ∈ R I × J , and X ∈ R I × J × K , respectively. The Frobeniusnorms of X and X are denoted as (cid:107) X (cid:107) F = (cid:113)(cid:80) i,j [ X ] i,j and (cid:107) X (cid:107) F = (cid:113)(cid:80) i,j,k [ X ] i,j,k , respectively. Given y ∈ R N and amatrix X ∈ R I × J , the outer product is deﬁned as X ◦ y . Inparticular, X ◦ y ∈ R I × J × N and [ X ◦ y ] i,j,n = [ X ] i,j [ y ] n . Thematrix unfolding operator for a tensor is deﬁned as mat( X ) ,which denotes the mode-3 unfolding of X (see details of theunfolding of HSI in [33]). The vec( X ) operator represents vec( X ) = [[ X ] T : , , . . . , [ X ] T : ,J ] T . II. P

RELIMINARIES

In this section, we brieﬂy review pertinent backgroundinformation.

A. HSI Denoising

The acquired HSIs are three-dimensional arrays (i.e., tensors[34]). Denote X ∈ R I × J × K as the HSI captured by aremotely deployed hyperspectral sensor, where I × J is thenumber of pixels presenting in the 2D spatial domain, and K is the number of spectral bands. Unlike natural images thatare measured with the R, G, and B channels (i.e., K = 3 ),HSIs are measured over tens or hundreds of frequency bands,depending on the speciﬁcations of the employed sensors.In general, X is a noise-contaminated version of the under-lying “clean” HSI (denoted by X (cid:92) ). There are many factorscontributing to noise in the hyperspectral acquisition process,i.e., thermal electronics, dark current, and stochastic error ofphoton-counting. If the noise is additive, we have X = X (cid:92) + V , (1)where V ∈ R I × J × K denotes the noise. The objective of HSIdenoising is to “extract” X (cid:92) from X . B. Prior-Regularization Based HSI Denoising

Note that even under the additive noise model in (1),this problem is ill-posed—this is essentially a disaggregationproblem which admits an inﬁnite number of solutions. Toovercome such ambiguity, prior information of the HSI is usedto conﬁne the solution space. A generic formulation can besummarized as follows: (cid:99) X = arg min M (cid:107) X − M (cid:107) F + λR ( M ) , (2a) subject to M ∈ M , (2b)where (cid:99) X denotes the estimate for X (cid:92) using the above estima-tor, M and R ( · ) : R I × J × K → R + are the constraint set andregularization function imposed according to prior knowledgeabout the clean image X (cid:92) , respectively, and λ ≥ is theregularization parameter that balances the data ﬁdelity term(i.e., the ﬁrst term in (2a)) and the regularization.

1) From Analytical Priors to Data-Driven Priors:

A vari-ety of regularization/constraints have been considered in theliterature. For example, in [2], [35], R ( · ) = (cid:107) · (cid:107) TV is the TV across the two spatial dimensions, since image dataexhibits certain slow changing properties over the space. In[36], [37], M represents the nonnegative orthant, since HSIsare always nonnegative. In [13], [38]–[40], low tensor andmatrix rank constraints are added to M through low-rankparameterization, respectively. Such prarameterization-basedregularization can be written as (cid:98) z = arg min z (cid:107) X − G ( z ) (cid:107) F , (3)where G : R N → R I × J × K is a pre-speciﬁed parameterizationfunction that represents the I × J × K HSI using N parameters, i.e., z . For example, if mat ( X ) is believed to be a low-rankmatrix, mat ( G ( z )) = AB T and z = [vec( A ) T , vec( B ) T ] T .After estimating the parameters z , the clean image can besimply estimated via (cid:99) X = G ( (cid:98) z ) . Classic priors are useful but often insufﬁcient to capture thecomplex nature of the underlying structure of HSIs.A number of works used deep neural networks to parame-terize the regularization—i.e., these works use a deep neuralnetwork G θ ( · ) : R N → R I × J × K whose network weightsare collected in θ ∈ R D to act as the regularization in (2a)[23]–[27]. Instead of having an analytical expression, suchregularizers are “trained” using a large number of trainingsamples. As deep neural networks are universal functionapproximators, such learned “deep priors” are believed to beable to approximate complex generative processes of HSIs andthus are more effective priors for denoising. (cid:98) z = arg min z (cid:107) X − G θ ( z ) (cid:107) F , (4)However, unlike natural RGB images that have tens ofthousands of training samples for learning G θ , HSI (especiallyremotely sensed HSI) datasets are relatively rare due to theircostly acquisition process. Without a large amount of (diverse)HSIs, training such a regularizer may be out of reach.

2) Unsupervised Deep Image Prior:

Very recently, Ulyanov et al. proposed the so-called DIP [28] to circumvent thelack of training samples. The major discovery in [28] is thata proper neural network architecture (without knowing theneural network weights θ ) can already encode much priorinformation of images. As a result, tasks such as imagedenoising can be done by learning a neural network G θ ( z ) to ﬁt X with a random but known z .With this idea, the denoising problem can be formulated asfollows: (cid:98) θ = arg min θ (cid:107) X − G θ ( z ) (cid:107) F , (5)and the denoised image can be estimated via (cid:99) X = G (cid:98) θ ( z ) . (6)The idea of DIP is quite different compared to the superviseddeep prior-based approaches such as those in [23]–[26] [cf.Eq. (4)]. In DIP, the network weights θ is learned from asingle degraded image in an unsupervised manner, and z isgiven instead of learned.At ﬁrst glance, it may be surprising that an untrainedneural network can be used for image denoising (and alsoinpainting and super-resolution as revealed in [28]). The keyrationale behind this approach may be understood as follows:First, some carefully designed neural network structures (e.g.,convolutional neural network with proper modiﬁcations) areable to capture much information in the generating processof some types of images of interest. That is, not all neuralnetwork structures could work well for all types of images.Different structures may need to be carefully handcrafted fordifferent types of images. The handcrafted neural networkstructure is analogous to the handpicked priors such as the Encoder Decoder

Loss function

Update

Forward Backward

Deep Spatial Prior

Deep Spectral Prior

X S z C w Y Y r r rR r r F ( ) ( )  Y Y

Y X S z C w r r rR r r soft th _ ( ) ( ) /2 1  S z r r θθ ( ) C w r r ζζ ( ) S z r r θθ ( ) C w r r ζζ ( ) C r ζζ S r θθ w r z r Loss r Loss r skip connection Fig. 2. Illustration of the proposed

DS2DP . The generative networks C ζ r and S θ r are applied to capture the deep spectral prior of the spectral signaturesand the deep spatial prior of the abundance matrices, respectively. L norm, Tikhonov regularization, and TV regularization—which are also not learned from training samples. In theoriginal paper [28], the U-Net-like "hourglass" architecturewas shown to be powerful in natural RGB image restorationtasks under the DIP framework. In [29], various networkstructures (namely, DIP2D and DIP3D) were experimented forHSI denoising—and the results can be quite different, as onewill also see in Sec. IV. Second, in image restoration tasks,the degraded (noisy) X still contains much information in theunderlying image. Hence, the ﬁtting loss in (5) also “forces”the G θ to faithfully capture the essential information in X .In particular, since G θ has a structured underlying generativeprocess (by construction), the learned G θ is more likely tocapture the “structured signal part” (i.e., the clean image X (cid:92) )in X other than the random noise part.Since the DIP procedure does not use any training examples,it is particularly attractive to data-starved applications suchas hyperspectral imaging. In addition, although it involvescareful structure handcrafting, DIP still inherits many goodproperties of neural networks, e.g., being capable of modelingcomplex generative processes. Consequently, it often exhibitsmore appealing image restoration performance compared toclassic regularizer/parameterization based methods (e.g., TVand low matrix/tensor rank); see [28], [29]. C. Challenges

The unsupervised DIP-based approaches are attractive sincethey are effective without using any training data. However,ﬁnding a proper network structure to serve as prior of HSIsand learning the corresponding θ is by no means a trivial task.A couple of notable new challenges that arise in the domainof hyperspectral imaging are as follows:

1) Challenge 1 - Network Structure:

Since HSIs are quitedifferent compared to natural RGB images (in terms ofsensors, sensing processes, resolutions, and frequency bandsused), directly using the neural network structure in [28] inhyperspectral imaging may not be best practice. The work in [29] proposed two structures crafted for this, but it isnot clear if these two structures are “optimal” due to thelack of extensive experiments. In fact, as we will show inSec. IV, these two unsupervised DIP structures are sometimesnot as promising as some classic models (e.g., low-ranktensor decomposition-based denoising) in terms of denoisingperformance. To capitalize on the power of unsupervised DIPfor HSI denoising, it is critical to design the structure of G θ so that it suits the nature of HSIs.

2) Challenge 2 - Network Size:

Another challenge thatarises in unsupervised DIP-based HSI denoising is that theHSIs are large-scale images due to the large number ofspectral bands contained in the pixels. Directly modeling thegenerative process of a large-scale 3D image (or a third-ordertensor) inevitably leads to an overly sized neural network G θ .Although the work in [29] employed a number of tricks fornetwork size reduction, the ﬁnal constructions still yield a largenumber of network parameters. This leads to a computationallyheavy optimization problem [cf. Eq. (5)]. Since the problem isalready nonconvex and challenging, the excessive scale of theoptimization problem only makes the denoising procedure lessefﬁcient. The challenging nature of numerical optimizationmay also affect the denoising performance since "bad" localminima may be easier to happen.III. P ROPOSED A PPROACH

To circumvent the challenges, we will leverage the well-established LMM of HSI to come up with our customizedunsupervised DIPs in the next section. To this end, we brieﬂyreview the main idea of LMM.

A. Linear Mixture Model of HSI

The LMM of X is as follows (when the noise is absent): X = R (cid:88) r =1 S r ◦ c r , (7) where S r ∈ R I × J and c r ∈ R K represent the r -th endmem-ber’s abundance map and the spectral signature, respectively,and R is the number of endmembers contained in the HSI.The LMM can also be expressed as [ X ] i,j,k = R (cid:88) r =1 [ S r ] i,j [ c r ] k ; see [1], [30]. Physically, it means that every pixel is anon-negative combination of the spectral signatures of theconstituting endmembers in the HSI. Note that S r ≥ , c r ≥ according to their physical meanings—and thus the modelin (7) is often related to non-negative matrix factorization(NMF) [41]. An illustration of the LMM can be found inFig. 1. The LMM model with a relatively small R can oftencapture around 98% of the energy of the HSI [32]. Hence,it is a reliable model for HSIs. Indeed, the LMM has beenutilized for a large variety of hyperspectral imaging tasks,e.g., hyperspectral unmixing [1], [31], [42]–[45], hyperspectralsuper-resolution [46], pansharpening [47], compression andrecovery [48], and denoising [49], just to name a few. In thiswork, we propose to use the LMM to help design unsupervisedDIP neural network structures and denoising algorithms. B. LMM-Aided Unsupervised DIP for HSI

Notably, the LMM disentangles the spectral and spatialinformation into two sets of latent factors, i.e., { S r } Rr =1 and { c r } Rr =1 . Our motivations for using the LMM representationto design unsupervised DIP for HSIs are as follows:First, the physical meaning of the latent factors entailsthe opportunity to employ known effective neural networkstructures of unsupervised DIP. The abundance matrix S r canbe understood as how the material r spreads over space. Thehypothesis is that the abundance maps exhibit similar proper-ties of natural images that focus on capturing and conveyingspatial information. Under this hypothesis, it is reasonableto use unsupervised DIP neural network structures that areknown to work well for natural images to model S r . Moreover,the c r vector can be understood as the spectral signatureof the r -th material, which is the variation of reﬂectance oremittance of material over different wavelengths. It is knownthat fully connected neural networks (FCNs) can approximatesuch relatively simple 1-D continuous smooth functions well.Second, by disentanglement and LMM, the model size ofthe HSI is substantially reduced. Instead of directly imposingunsupervised DIP on the whole HSI, we employ two types ofunsupervised DIPs (i.e., the deep spatial and spectral priors) tomodel abundance maps and spectral signatures, respectively.Since the number of endmembers is often not large, thecomputational complexity is substantially reduced.Following the above argument, we model the HSI using thefollowing: X = R (cid:88) r =1 S θ r ( z r ) ◦ C ζ r ( w r ) , (8) where S θ r ( · ) : R N a → R I × J is the unsupervised DIPneural network of the r -th endmember’s abundance map, and θ r collects all the corresponding network weights; similarly, C ζ r ( · ) : R N s → R K and ζ r denote the unsupervised DIP ofthe r -th endmember and its corresponding network weights,respectively; the vectors z r ∈ R N a and w r ∈ R N s are low-dimensional random vectors that are responsible for generatingthe r -th abundance map and endmember, respectively. Ourdetailed design for S θ r and C ζ r are as follows:

1) Unsupervised DIP for Abundance Maps:

As mentioned,the abundance maps capture the spatial information of thecorresponding materials. We propose to employ the U-Net-like“hourglass” architecture in [28] for modeling S θ r . Note thatthis network architecture was shown to be able to capture thespatial prior of nature images. The U-Net is an asymmetricautoencoder [50] with skip connections, whose structure isshown in Fig. 2.

2) Unsupervised DIP for Endmembers:

The endmembersare relatively simple to model—since they can be understoodas one-dimensional smooth functions. Hence, we employFCNs as the unsupervised DIP for C ζ r . We use FCNs withthree layers; also see Fig. 2.Besides the above unsupervised DIP design, in this work,we also take into consideration of impulsive noise and grosslycorrupted pixels (outliers) that often arise in HSIs. Unlike nat-ural images whose sensing environment can be well controlled,remotely sensed HSIs often suffer from heavily corruptedpixels or spectral bands due to various reasons; see [39], [40].If not accounted for, the HSI denoising performance could beseverely hindered by such noise. To this end, we consider anoisy data acquisition model as follows: X = R (cid:88) r =1 S θ r ( z r ) ◦ C ζ r ( w r ) (cid:124) (cid:123)(cid:122) (cid:125) X (cid:92) + Y + V , (9)where V represents ubiquitous noise, e.g., the Gaussian noise,and Y denotes the impulsive noise or outliers. Accordingly,We propose the following denoising criterion: arg min { θ r ,ζ r } Rr =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X − R (cid:88) r =1 S θ r ( z r ) ◦ C ζ r ( w r ) − Y (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F + λ (cid:107) Y (cid:107) , (10)where λ ≥ and (cid:107) Y (cid:107) = (cid:80) Ii =1 (cid:80) Jj =1 (cid:80) Kk =1 | [ Y ] i,j,k | isused for imposing the sparsity prior on Y , since outliershappen sparsely. C. Optimization Algorithm

Let us denote the objective function in (10) using thefollowing shorthand notation: arg min { θ r ,ζ r } Rr =1 , Y Loss (cid:16) { θ r , ζ r } Rr =1 , Y (cid:17) . (11) We propose the following algorithmic structure: { θ t +1 r , ζ t +1 r } Rr =1 ← arg (cid:103) min { θ r , ζ r } Rr =1 Loss (cid:0) { θ r , ζ r } Rr =1 , Y t (cid:1) (12) Y t +1 ← arg min Y Loss (cid:0) { θ t +1 r , ζ t +1 r } Rr =1 , Y (cid:1) , (13)where the superscript “ t ” is the iteration index. In (12), weuse (cid:103) min to denote inexact minimization since exactly solvingthe subproblem w.r.t. the network parameters may not bepossible—due to its large size and nonconvexity.

1) Solution for (12) : Note that the subproblem w.r.t. { θ r , ζ r } Rr =1 is nothing but a regression problem using neuralmodels. Hence, any off-the-shelf neural network optimizer canbe employed for updating { θ r , ζ r } Rr =1 . In this work, we usethe (sub-)gradient descent algorithm with momentum that hasbeen proven effective in complex network learning problems[51]: θ t +1 r ← θ tr − α t ∇ θ r Loss (cid:0) { θ r , ζ tr } Rr =1 , Y t (cid:1) (14a) ζ t +1 r ← ζ tr − α t ∇ ζ r Loss (cid:0) { θ tr , ζ r } Rr =1 , Y t (cid:1) , (14b)for all r = 1 , . . . , R . Note that the gradient w.r.t. θ r and ζ r can be computed by the standard back-propagation algorithm[52]. Here, α t is the step size (i.e., learning rate) of iteration t .There are multiple ways of determining α t . In this work, weuse the step size rule advocated in the Adam algorithm [51].

2) Solution for (13) : The subproblem (13) is convex—whose solution is the well-known soft-thresholding proximaloperator [53]. Hence, the update of Y can be expressed as Y t +1 = soft _ th λ/ (cid:32) X − R (cid:88) r =1 (cid:98) S t +1 r ◦ (cid:98) c t +1 r (cid:33) . (15)where (cid:98) S t +1 r = S θ t +1 r ( z r ) , (cid:98) c t +1 r = C ζ t +1 r ( w r ) and soft _ th λ/ ( · ) applies soft-thresholding to every entry ofits input, in which the entry-wise thresholding is deﬁned as soft _ th δ ( x ) = sgn( x ) max( | x | − δ, . (16) Algorithm 1

DS2DP for HSI Denoising.

Input:

HSI X ∈ R I × J × K . sample random z r and w r from uniform distribution; for t = 1 to T do (repeat until convergence) (cid:98) S r = S θ t − r ( z r ) , (cid:98) c r = C ζ t − r ( w r ) ; update θ r , ζ r for all r ; using the Adam [51]; update Y according to (13); end for (cid:99) X = (cid:80) Rr =1 (cid:98) S r ◦ (cid:98) c r ; Output: denoised HSI (cid:99) X .The algorithm is summarized in Algorithm 1, which wename as the unsupervised disentangled spatio-spectral deep Since the ReLU activation functions used in the U-Net and the FCNare not differentiable at one point, the algorithm is subgradient based.Nonetheless, we use ∇ (usually for denoting gradient) to denote subgradientfor notation simplicity. prior ( DS2DP ) algorithm. The algorithm falls into the cate-gory of inexact block coordinate descent [54]. Under somerelatively mild conditions, the algorithm produces a solutionsequence that converges to a stationary point of the optimiza-tion problem in (10); see detailed discussions in [54].IV. E

XPERIMENTS

In this section, we use semi-real and real data to demonstratethe effectiveness of the proposed approach.

A. Baselines

To thoroughly evaluate the performance of

DS2DP , weimplemented ﬁve state-of-the-art methods as the baselines.These methods include two unsupervised methods, i.e., deepimage prior based on 2D convolution ( DIP2D ) [29] and deep image prior based on 3D convolution ( DIP3D ) [29], amatrix optimization-based method, i.e., hyperspectral imagerestoration using low-rank matrix recovery ( LRMR ) [38], andtwo tensor optimization-based methods, i.e.,

TV-regularizedlow-rank tensor decomposition ( LRTDTV ) [39] and hyperspec-tral restoration via L gradient regularized low-rank tensorfactorization ( LRTFL0 ) [40].For

DIP2D and

DIP3D , we set the maximum number ofiterations to be 6,000 and report the best results during theiterations. For

LRMR , LRTDTV , and

LRTFL0 , their parametersare set as suggested in [38]–[40]—with parameter ﬁne-tuningeffort to uplift its performance in some cases. The experimentsof

DIP2D , DIP3D , and

DS2DP are executed using

Python on a computer with a six-core Intel(R) Core(TM) i7-9750HCPU @ 2.60GHz, 32.0 GB of RAM, and an NVIDIA GeForceRTX 2070 GPU. The experiments of

LRMR , LRTDTV , and

LRTFL0 are implemented in Matlab (2019a) on the samecomputer.

B. Semi-Real Data Experiments

Evaluation Metrics.

We adopt three frequently used evalua-tion metrics, namely, peak signal-to-noise ratio (PSNR), struc-ture similarity (SSIM), and spectral angle mapper (SAM) [40].Generally, better-restored denoising performance is reﬂectedby higher PSNR and SSIM values and lower SAM values.

Semi-Real Data.

For semi-real data, we use a number of HSIsto serve as our ground truth, which include Washington DCMall (WDC Mall) of size 256 × × of size 200 × ×

80 that is clipped into 192 × ×

80, and Pavia University of size 256 × ×

87. Themultispectral images (MSIs) in the CAVE dataset of size 256 × ×

31 are also used to serve as our clean data X (cid:92) . Scenarios.

We consider a series of scenarios with varioustypes of noise:

Case (Gaussian Noise) : In this basic scenario, the i.i.d. zero-mean Gaussian noise is added to all bands with the variance setto be 0.1. The signal-to-noise ratios (SNRs) (see deﬁnition in[55]) associated with different datasets can be found in TableII. One can see the noise levels in different datasets are similar.Note that the HSIs with SNR being 6dB to 8dB are consideredas severely corrupted data. http://lesun.weebly.com/hyperspectral-data-set.html TABLE IQ

UANTITATIVE COMPARISON OF THE DENOISING RESULTS BY DIFFERENT METHODS . T HE BEST

AND SECOND BEST VALUES ARE HIGHLIGHTED IN BOLDAND UNDERLINED , RESPECTIVELY . Case Case 1 Case 2 Case 3 Case 4 Case 5 Case 6Dataset Method PSNR SSIM SAM PSNR SSIM SAM PSNR SSIM SAM PSNR SSIM SAM PSNR SSIM SAM PSNR SSIM SAMWDC Mall DIP2D 30.408 0.871 0.122 26.540 0.770 0.163 24.043 0.708 0.228 22.679 0.678 0.271 23.366 0.696 0.227 21.759 0.594 0.282DIP3D * * * * * * * * * * * * * * * * * *LRMR 34.954 0.951 0.130 34.954 0.951 0.130 32.422 0.933 0.156 32.058 0.925 0.148 32.358 0.920 0.159 29.815 0.907 0.210LRTDTV 35.293 0.952 0.106 35.087 0.950 0.106 33.307 0.925 0.148 33.024 0.919 0.136 33.464 0.914 0.113 31.691 0.894 0.136LRTFL0 36.043 0.964 0.112 35.796 0.961 0.111 34.151 0.948 0.133 35.278 0.941 0.115 34.296 0.949 0.123 33.224 0.943 0.163DS2DP

Pavia Centre DIP2D 31.965 0.897 0.068 29.603 0.876 0.072 25.319 0.758 0.186 23.587 0.728 0.232 24.885 0.768 0.164 22.175 0.551 0.180DIP3D 26.969 0.694 0.075 26.338 0.691 0.078 25.421 0.651 0.094 23.445 0.637 0.104 24.173 0.672 0.091 23.039 0.627 0.131LRMR 33.293 0.926 0.090 33.293 0.926 0.090 30.398 0.816 0.052 32.398 0.916 0.142 31.409 0.901 0.106 24.667 0.742 0.724LRTDTV 33.511 0.921 0.095 33.608 0.923 0.065 31.465 0.901 0.104 33.096 0.903 0.147 31.415 0.881 0.104 31.882 0.894 0.101LRTFL0 33.833 0.923 0.088 33.310 0.935 0.089 31.751 0.917 0.096 32.756 0.927 0.089 32.676 0.928 0.090 32.003 0.920 0.101DS2DP

Pavia University DIP2D 33.103 0.852 0.107 25.818 0.770 0.177 25.157 0.727 0.223 24.047 0.714 0.269 24.024 0.719 0.283 21.549 0.574 0.382DIP3D 30.070 0.804 0.111 24.968 0.705 0.151 25.307 0.701 0.156 24.198 0.683 0.166 24.265 0.701 0.166 23.509 0.640 0.173LRMR 33.063 0.862 0.113 31.582 0.787 0.149 31.155 0.860 0.119 31.858 0.861 0.115 31.385 0.829 0.139 27.615 0.747 0.240LRTDTV 33.136 0.875 0.108 32.223 0.861 0.110 31.497 0.841 0.151 32.190 0.866 0.112 32.123 0.851 0.136 31.027 0.830 0.187LRTFL0 34.312 0.890 0.092 33.724 0.879 0.099 32.972 0.867 0.123 33.642 0.877 0.103 33.146 0.863 0.124 32.735 0.858 0.126DS2DP

CAVE DIP2D 29.643 0.636 0.339 23.839 0.589 0.421 23.204 0.562 0.449 21.955 0.526 0.506 22.416 0.538 0.484 22.416 0.539 0.484DIP3D 28.960 0.709 0.332 23.397 0.571 0.447 23.377 0.566 0.449 22.157 0.534 0.471 22.435 0.549 0.460 21.405 0.509 0.501LRMR 30.633 0.661 0.418 30.633 0.661 0.418 27.724 0.607 0.466 31.809 0.807 0.334 29.015 0.680 0.445 26.404 0.659 0.536LRTDTV 35.529 0.883 0.165 34.769 0.877 0.210 32.792 0.843 0.260 34.036 0.862 0.232 31.779 0.772 0.361 31.063 0.773 0.430LRTFL0 33.241 0.877 0.233 33.191 0.891 0.262 32.978 0.846 0.209 33.743 0.852 0.264 32.139 0.781 0.352 30.956 0.855 0.301DS2DP

Band index PS N R WDC Mall: Case 1

Band index PS N R WDC Mall: Case 2

Band index PS N R WDC Mall: Case 3

Band index PS N R WDC Mall: Case 4

Band index PS N R WDC Mall: Case 5

Band index PS N R WDC Mall: Case 6

Band index SS I M WDC Mall: Case 1

Band index SS I M WDC Mall: Case 2

Band index SS I M WDC Mall: Case 3

Band index SS I M WDC Mall: Case 4

50 100 150 191

Band index SS I M WDC Mall: Case 5 Band index SS I M WDC Mall: Case 6

Fig. 3. PSNR and SSIM values of all bands obtained by different methods on HSI WDC Mall under Cases 1-6.TABLE IIT HE SNR

OF THE DEGRADED IMAGES UNDER C ASE

Case (Gaussian Noise + Impulse Noise) : In this case, theGaussian noise in Case 1 in kept. We also additionally considerimpulse noise that often happens in real HSI analysis. Theimpulsive noise is also added to each band. Such noise isgenerated following the i.i.d. zero-mean Laplacian distributionwith the density parameter being 0.1. Observed DIP2D LRMR LRTDTV LRTFL0 DS2DP Ground truthFig. 4. Denoising results obtained by different methods. (From Left to Right) The observed image, the denoising results by

DIP2D , LRMR , LRTDTV , LRTFL0 , DS2DP (proposed), and the ground truth, respectively. The ﬁrst two rows are the denoising results of the WDC Mall under Cases 4 and 6, respectively. Thesecond two rows are the denoising results of the Pavia Centre under Cases 4 and 6, respectively. The last two rows are the denoising results of the PaciaUniversity under Cases 4 and 6, respectively.

Case (Gaussian Noise + Impulse Noise + Deadlines) : Tomake the case more challenging, we include deadlines on topof Case 2; see Fig. 4 for illustration of deadlines. The deadlinesare generated by nullifying some selected pixels and bands. Weassume that the deadlines randomly affect 30% of the bands.Moreover, for each selected band, the number of deadlines israndomly generated from 10 to 15, and the spatial width ofthe deadlines is randomly selected from 1 to 3 pixels. Case (Gaussian Noise + Impulse Noise + Diagonal Stripes) :In this case, we replace the deadlines in Case 3 by diagonalstripes; see Fig. 4 for illustration. The the elements of thediagonal stripes are all ones, which is used to simulate theconstant brightness. As before, we assume that the stripesaffect 30% of the bands. Moreover, for each selected band,the number of diagonal stripes is randomly generated from 15 to 30. Case (Gaussian Noise + Impulse Noise + Vertical Stripes) :In this case, we use the setting as in Case 4, except that vertical(other than diagonal) stripes are added; see Fig. 4. For eachaffected band, the number of vertical stripes is randomly gen-erated from 10 to 15. In this case, the elements of each verticalstripe are set to a certain value randomly generated from therange of [0.6, 0.8], to diversify our simulated scenarios. Case (Gaussian Noise + Impulse Noise + Deadlines +Diagonal Stripes + Vertical Stripes) : To create an extrachallenging case, Gaussian noise, impulse noise, and deadlinesare added as in Case 3. Moreover, diagonal stripes and verticalstripes are added as in Case 4 and Case 5, respectively. Parameter Setting. In DS2DP , there are two parameters tobe manually tuned, namely, λ and R . For the parameter Observed DIP2D LRMR LRTDTV LRTFL0 DS2DP Ground truthFig. 5. Denoising results obtained by different methods under Case 6. (From Top to Bottom) The band 4 in Beads, the band 4 in Pompoms, and the band31 in Flowers, respectively. (From Left to Right) The observed image, the denoising results of

DIP2D , LRMR , LRTDTV , LRTFL0 , DS2DP , and the groundtruth, respectively.DIP2D DIP3D LRMR LRTDTV LRTFL0 DS2DP

Band index P i x e l v a l u e Beads: Case 6

OriginalDIP2D

Band index P i x e l v a l u e Beads: Case 6

OriginalDIP3D

Band index P i x e l v a l u e Beads: Case 6

OriginalLRMR

Band index P i x e l v a l u e Beads: Case 6

OriginalLRTDTV

Band index P i x e l v a l u e Beads: Case 6

OriginalLRTFL0

Band index P i x e l v a l u e Pompoms: Case 6

OriginalDS2DP