Advances in Deep Learning for Hyperspectral Image Analysis--Addressing Challenges Arising in Practical Imaging Scenarios
AAdvances in Deep Learning for HyperspectralImage Analysis – Addressing Challenges Arisingin Practical Imaging Scenarios
Xiong Zhou and Saurabh Prasad
Abstract
Deep neural networks have proven to be very effective for computer visiontasks, such as image classification, object detection, and semantic segmentation– these are primarily applied to color imagery and video. In recent years, therehas been an emergence of deep learning algorithms being applied to hyperspectraland multispectral imagery for remote sensing and biomedicine tasks. These multi-channel images come with their own unique set of challenges that must be addressedfor effective image analysis. Challenges include limited ground truth (annotation isexpensive and extensive labeling is often not feasible), and high dimensional natureof the data (each pixel is represented by hundreds of spectral bands), despite beingpresented by a large amount of unlabeled data and the potential to leverage multiplesensors/sources that observe the same scene. In this chapter, we will review recentadvances in the community that leverage deep learning for robust hyperspectral imageanalysis despite these unique challenges – specifically, we will review unsupervised,semi-supervised and active learning approaches to image analysis, as well as transferlearning approaches for multi-source (e.g. multi-sensor, or multi-temporal) imageanalysis.
Since AlexNet [1] won the ImageNet challenge in 2012, deep learning approacheshave gradually replaced traditional methods becoming a predominant tool in a va-riety of computer vision applications. Researchers have reported remarkable results
Xiong ZhouAmazon Web Service, AI, Seattle USA e-mail: [email protected]
Saurabh PrasadHyperspectral Image Analysis Group, Department of Electrical and Computer Engineering, Uni-versity of Houston, Houston USA e-mail: [email protected] a r X i v : . [ c s . C V ] J u l Xiong Zhou and Saurabh Prasad with deep neural networks in visual analysis tasks such as image classification, objectdetection, and semantic segmentation. A major differentiating factor that separatesdeep learning from conventional neural network based learning is the amount ofparameters in a model. With hundreds of thousands even millions or billions ofparameters, deep neural networks use techniques such as error backpropagation [2],weight decay [3], pretraining [4], dropout [5], and batch normalization [6] to pre-vent the model from overfitting or simply memorizing the data. Combined withthe increased computing power and specially designed hardware such as GraphicsProcessing Units (GPU), deep neural networks are able to learn from and processunprecedented large-scale data to generate abstract yet discriminative features andclassify them.Although there is a significant potential to leverage from deep learning advancesfor hyperspectral image analysis, such data come with unique challenges which mustbe addressed in the context of deep neural networks for effective analysis. It is wellunderstood that deep neural networks are notoriously data hungry insofar as trainingthe models is concerned. This is attributed to the manner in which neural networks aretrained. A typical training of a network comprises of two steps: 1) pass data throughthe network and compute a task dependent loss; and 2) minimize the loss by adjustingthe network weights by back-propagating the error [2]. During such a process, amodel could easily end up overfitting [7], particularly if we do not provide sufficienttraining data. Data annotation has always been a major obstacle in machine learningresearch – and this requirement is amplified with deep neural networks. Acquiringextensive libraries such as ImageNet [8] for various applications may be very costlyand time consuming. This problem becomes even more acute when working withhyperspectral imagery for applications to remote sensing and biomedicine. Not onlydoes one need specific domain expertise to label the imagery, annotation itself ischallenging due to the resolution, scale and interpretability of the imagery evenby domain experts. For example, it can be argued that it is much more difficultto tell the different types of soil tillage apart by looking at a hyperspectral imagethan discerning everyday objects in color imagery. Further, the “gold-standard” inannotating remotely sensed imagery would be through field campaigns where domainexperts verify the objects at exact geolocations corresponding to the pixels in theimage. This can be very time consuming and for many appplications unfeasible. Itis hence common in hyperspectral image analysis tasks to have a very small set oflabeled ground truth data to train models from.In addition to the label scarcity, the large inter-class variance of hyperspectraldata also increases the complexity of the underlying classification task. Given thesame material or object, the spectral reflectance (or absorbance) profiles from twohyperspectral sensors could be dramatically different because of the differences inwavelength range and spectral resolution. Even when the same sensor is used tocollect images, one can get significant spectral variability due to the variation ofview angle, atmospheric conditions, sensor altitude, geometric distortions etc. [9].Another reason for high spectral variability is mixed pixels arising from imagingplatforms that result in low spatial resolution – as a result, the spectra of one pixelcorresponds to more than one object on the ground [10]. itle Suppressed Due to Excessive Length 3
For robust machine learning and image analysis, there are two essential compo-nents – deploying an appropriate machine learning model, and leveraging a libraryof training data that is representative of the underlying inter-class and intra-classvariability. For Image analysis, specifically for classification tasks, deep learningmodels are variations of convolution neural networks (CNNs) [11], which conducta series of 2D convolutions between input images and (spatial) filters in a hierar-chical fashion. It has been shown that such hierarchical representations are veryefficient in recognizing objects in natural images [12]. When working with hyper-spectral images, however, CNN based features [13] such as color blobs, edges,shapes etc. may not be the only features of interest for the underlying analysis.There is important information encoded in the spectral profile which can be veryhelpful for analysis. Unfortunately, in traditional applications of CNNs to hyper-spectral imagery, modeling of spectral content in conjunction with spatial contentis ignored. Although one can argue that spectral information could still be pickedup when 2D convolutions are applied channel by channel or features from differentchannels are stacked together, such approaches would not constitute optimal mod-eling of spectral reflectance/absorbance characteristics. It is well understood thatwhen the spectral correlations are explicitly exploited, spectral-spatial features aremore discriminative – from traditional wavelet based feature extraction [14, 15] tomodern CNNs [16, 17, 18, 19]. In chapters 3 and 4, we have reviewed variationsof convolutional and recurrent neural networks that model the spatial and spectralproperties of hyperspectral data. In this chapter, we review recent works that specificaddress issues arising from deploying deep learning neural networks in challengingscenarios. In particular, our emphasis is on challenges presented by (1) limited la-beled data, wherein one must leverage the vast amount of available unlabeled data inconjunction with limited data for robust learning, and (2) multi-source optical data,wherein it is important to transfer models learned from one source (e.g. a specificsensor/platform/viewpoint/timepoint), and transfer the learned model to a differentsource (a different sensor/platform/viewpoint/timepoint), with the assumption thatone source is rich in the quality and/or quantity of labeled training data while theother source is not.
To address the labeled data scarcity, one strategy is to recruit resources (time andmoney, for example) with the goal of expanding the training library by annotatingmore data. However, for many applications, human annotation is neither scalablenor sustainable. An alternate (and more practical) strategy to address this problemis to design algorithms that do not require a large library of training data, but caninstead learn from the extensive unlabeled data in conjunction with the limitedamount of labeled data. Within this broad theme, we will review unsupervisedfeature learning, semi-supervised and active learning strategies. We will presentresults of several methods discussed in this chapter with three hyperspectral datasets
Xiong Zhou and Saurabh Prasad - two of these are benchmark hyperspectral datasets, University of Pavia [20] andUniversity of Houston [21], and represent urban land-cover classification tasks. TheUniversity of Pavia dataset is a hyperspectral scene representing 9 urban land coverclasses, with 103 spectral channels spanning the visible through near-infrared region.The 2013 University of Houston dataset is a hyperspectral scene acquired over theUniversity of Houston campus, and representing 15 urban land cover classes. It has144 spectral channels in the visible through near infrared region. The third dataset is achallenging multi-source (multi-sensor/multi-viewpoint) hyperspectral dataset [22]that is particularly relevant in a transfer learning context – details of this dataset arepresented later in section 3.1.
In contrast to the labeled data, unlabeled data are often easy and cheap to acquirefor many applications, including remotely sensed hyperspectral imagery. Unsuper-vised learning techniques do not rely on labels, and that makes this class of methodsvery appealing. Compared to supervised learning where labeled data are used asa “teacher” for guidance, models trained with unsupervised learning tend to learnrelationships between data samples and estimate the data properties class-specificlabelings of samples. In the sense that most deep networks can be comprised of twocomponents - a feature extraction front-end, and an analysis backend (e.g. under-taking tasks such as classification, regression etc.), an approach can be completelyunsupervised relative to the training labels (e.g. a neural network tasked with fus-ing sensors for super-resolution), or completely supervised (e.g. a neural networkwherein both the features and the backend classifiers are learned with the end goal ofmaximizing inter-class discrimination. There are also scenarios wherein the featureextraction part of the network is unsupervised (where the labeled data are not usedto train model parameters), but the backend (e.g. classification) component of thenetwork is supervised. In this chapter, whenever the feature extraction componentof a network is unsupervised (whether the backend model is supervised or unsu-pervised), we refer to this class of methods as carrying out “unsupervised featurelearning”.The benefit of unsupervised feature learning is that we can learn useful features(in an unsupervised fashion) from a large amount of unlabeled data (e.g. spatialfeatures representing the natural characteristics of a scene) despite not having suf-ficient labeled data to learn object-specific features, with the assumption that thefeatures learned in an unsupervised manner can still positively impact a downstreamsupervised learning task.In traditional feature learning (e.g. dimensionality reduction, subspace learning orspatial feature extraction), the processing operators are often based on assumptionsor prior knowledge about data characteristics. Optimization of feature learning to atask at hand is hence non-trivial. Deep learning-based methods address this problem itle Suppressed Due to Excessive Length 5 in a data-adaptive manner, where the feature learning is undertaken in the context ofthe overall analysis task in the same network.Deep learning-based strategies, such as autoencoders [23] and their variants, re-stricted Boltzmann machines (RBM) [24, 25], and deep belief networks (DBN) [26],have exhibited a potential to effectively characterize hyperspectral data. For classifi-cation tasks, the most common way to use unsupervised feature learning is to extract( learn ) features from the raw data that can then be used to train classifiers down-stream. Section 3.1 in Chapter 3 describes such a use of autoencoders for extractingfeatures for tasks such as classification.In Chen et al. [27], the effectiveness of autoencoder derived features was demon-strated for hyperspectral image analysis. Although they attempted to incorporatespatial information by feeding the autoencoder with image patches, a significantamount of information is potentially lost due to the flattening process. To capturethe multi-scale nature of objects in remotely sensed images, image patches with dif-ferent sizes were used as inputs for a stacked sparse autoencoder in [28]. To extractsimilar multi-scale spatial-spectral information, Zhao et al. [29] applied a scaletransformation by upsampling the input images before sending them to the stackedsparse autoencoder. Instead of manipulating the spatial size of inputs, Ma et al. [30] proposed to enforce the local constraint as a regularization term in the energyfunction of the autoencoder. By using a stacked denoising autoencoder, Xing et al. [31] sought to improve the feature stability and robustness with partially corruptedinputs. Although these approaches have been effective, they still require input signals(frames/patches) to be reshaped as one dimensional vectors, which inevitably resultsin a loss of spatial information. To better leverage the spatial correlations betweenadjacent pixels, several works have proposed to use the convolutional autoencoderto extract features from hyperspectral data [32, 33, 34].Stacking layers has been shown to be an effective way to increase the represen-tation power of an autoencoder model. The same principle applies to deep beliefnetworks [35], where each layer is represented by a restricted Boltzmann ma-chine. With the ability to extract a hierarchical representation from the trainingdata, promising results have been shown for DBN/RBM for hyperspectral imageanalysis [36, 37, 38, 39, 40, 41, 42]. In recent works, some alternate strategies tounsupervised feature learning for hyperspectral image analysis have also emerged.In [43], a convolutional neural networks was trained in a greedy layer-wise unsuper-vised fashion. A special learning criteria called enforcing population and lifetimesparsity (EPLS) [44] was utilized to ensure that the generated features are unique,sparse and robust at the same time. In [45], the hourglass network [46], which sharesa similar network architecture as an autoencoder, was trained for super-resolutionusing unlabeled samples in conjunction with noise. The reconstructed image wasdownsampled and compared with the real low-resolution image. The offset betweenthese two was used as the loss function that was minimized to train the entire net-work. A minimized loss (offset) indicates the reconstruction from the network wouldbe a good super-resolved estimate of the original image.
Xiong Zhou and Saurabh Prasad
Although the feature learning strategy allows us to extract informative features fromunlabeled data, the classification part of the network still requires labeled trainingsamples. Methods that rely completely on unsupervised learning may not provide dis-criminative features from unlabeled data entirely for challenging classification tasks.Semi-supervised deep learning is an alternate approach where unlabeled data areused in conjunction with a small amount of labeled data to train deep networks (boththe feature extraction and classification components). It falls between supervisedlearning and unsupervised learning and leverages benefits of both approaches. In thecontext of classification, semi-supervised learning often provides better performancecompared to unsupervised feature learning, but without the annotation/labeling costneeded for fully supervised learning [47].Semi-supervised learning has been shown to be beneficial for hyperspectral imageclassification in various scenarios [48, 49, 50, 51, 52, 53]. Recent research [50] hasshown that the classification performance of a multilayer perceptron (MLP) can beimproved by adding an unsupervised loss. In addition to the categorical cross entropyloss, a symmetric decoder branch was added to the MLP and multiple reconstructionlosses, measured by the mean squared error of the encoder and decoder, wereenforced to help the network generate effective features. The reconstruction lossin fact served as a regularizer to prevent the model from overfitting. A similarstrategy has been used with convolutional neural networks in [48].A variant of semi-supervised deep learning, proposed by Wu and Prasad in [52]entails learning a deep network that extracts features that are discriminative from theperspective of the intrinsic clustering structure of data (i.e., these deep features candiscriminate between cluster labels – also referred to as pseudo-labels in this work)– in short, the cluster labels generated from clustering of unlabeled data can be usedto boost the classification performance. To this end, a constrained Dirichlet ProcessMixture Model (DPMM) was used, and a variational inference scheme was proposedto learn the underlying clustering from data. The clustering labels of the data wereused as pseudo labels for training a convolutional recurrent neural network, wherea CNN was followed by a few recurrent layers (akin to a pretraining with pseudo-labels). Figure 1 depicts the architecture of network. The network configuration isspecified in Table 1, where convolutional layers are denoted as “conv <filter size> -
Table 1
Network configuration summary for the Aerial view wetland hyperspectral dataset. Everyconvolutional layer is followed by a max pooling layer, which is omitted for the sake of simplicity.(Source adapted from [52])input-103 → conv3-32 → conv3-32 → conv3-64 → conv3-64 → recur-256 → recur-512 → fc-64 → fc-64 → softmax-9itle Suppressed Due to Excessive Length 7 Fig. 1
Architecture of the convolutional recurrent neural network. Cluster labels are used forpretraining the network. (Source adapted from [52])
After pretraining with unlabeled data and associated pseudo-labels, the networkwas fine-tuned with labeled data. This entails adding a few more layers to the previ-ously trained network and learning only these layers from the labeled data. Comparedto traditional semi-supervised methods, the pseudo-label-based network, PL-SSDL,achieved higher accuracy on the wetland data (a detailed description of this datasetis provided in Section 3.1) as shown in Table 2. The effect of varying depth of the
Xiong Zhou and Saurabh Prasad pretrained network on the classification performance is shown as Fig. 2. Accuracyincreases as the model goes deeper, i.e., more layers. In addition to the environmentalmonitoring application represented by the wetland dataset, the efficacy of PL-SSDLwas also verified for urban land-cover classification tasks using the University ofPavia [20] and the University of Houston [21] datasets, having 9 and 15 land coverclasses, and representing spectra 103 and 144 spectral channels spanning the visiblethrough near infrared regions respectively. As we can see from Figure 3, featuresextracted with pseudo label (middle column) are separated better than the raw hy-perspectral data (left column), which implies pretraining with unlabeled data makesthe features more discriminative. Compared to a network that is trained solely usinglabeled data, the semi-supervised method requires much less labeled samples dueto the pretrained model. With only a few labeled samples per class, features arefurther improved by fine-tuning (right column) the network. Similar to this idea,Kang [53] later trained a CNN with pseudo labels to extract spatial deep featuresthrough pretraining.
Methods Label propagation TSVM SS-LapSVM Ladder Networks PL-SSDLAccuracy 89 . ± .
04 92 . ± .
81 95 . ± .
85 93 . ± .
49 97 . ± . Table 2
Overall classification accuracies of different methods on the aerial view wetland dataset.(Source adapted from [52])
Fig. 2
Classification accuracy as a function of the depth of the pretrained model. (Source adaptedfrom [52])itle Suppressed Due to Excessive Length 9
Fig. 3 t-SNE visualization of features at different training stages on the University of Pavia [20](top row) and University of Houston [21] (bottom row) datasets. Left column represents raw imagefeatures, middle column represents features after unlabeled data pretraining, and right columnrepresents feature after labeled data fine-tuning. (Source adapted from [52])
Leveraging unlabeled data is the underlying principle of unsupervised and semi-supervised learning. Active learning, on the other hand, aims to improve the ef-ficiency of acquiring labeling data as much as possible. Figure 4 shows a typicalactive learning flow, which contains four components: a labeled training set, a ma-chine learning model, an unlabeled pool of data, and an oracle (a human annotator /domain expert). The labeled set is initially used for training the model. Based on themodel’s prediction, queries are then selected from the unlabeled pool and sent to theoracle for labeling. The loop is iterated until a pre-determined convergence criterionis met. The criteria used for selecting samples to query determines the efficiencyof model training – efficiency here refers to the machine learning model reachingits full discriminative potential using as few queried labeled samples as possible. Ifevery queried sample provides significant information to the model when labeledand incorporated into training, the annotation requirement will be small. A large partof active learning research is focused on designing suitable metrics to quantify theinformation contained in an unlabeled sample, that can be used for querying samplesfrom the data pool. A common thread in these works is the notion that choosingsamples that confuse the machine the most would result in a better (efficient) activelearning performance.Active learning with deep neural networks has obtained increasing attentionwithin the remote sensing community in recent years [54, 55, 56, 57, 58]. Liu etal. [55] used features produced by a DBN to estimate the representativeness anduncertainty of samples. Both [56] and [57] explored using an active learning strategyto facilitate transferring knowledge from one dataset to another. In [56], a stacked
Fig. 4
Illustration of an active learning system. sparse autoencoder was initially trained in the source domain and then fine-tunedin the target domain. To overcome the labeled data bottleneck, an uncertainty-basedmetric was used to select the most informative samples from the source domainfor active learning. Similarly, Lin et al. [57] trained two separate autoencodersfrom the source and target domains. Representative samples were selected based onthe density in the neighborhood of the samples in the feature space. This allowedautoencoders to be effectively trained using limited data. In order to transfer thesupervision from source to target domain, features in both domains were aligned bymaximizing their correlation in a latent space.Unlike autoencoders and DBN, convolution neural networks (CNNs) provide aneffective framework to exploit the spatial correlation between pixels in a hyperspec-tral image. However, when it comes to training with small data, CNNs tends tooverfit due to the large number of trainable network parameters. To solve this prob-lem, Haut [58] present an active learning algorithm that uses a special network calledBayesian CNN [59]. Gal and Ghahramani [59] have shown that dropout in neuralnetwork can be considered as an approximation to the Gaussian process, which of-fers nice properties such as uncertainty estimation and robustness to overfitting. Byperforming dropout after each convolution layer, the training of Bayesian CNN canbe cast as approximate Bernoulli variational inference. During evaluation, outputsof a Bayesian CNN are averaged over several passes, which allows us to estimatethe model prediction uncertainty and the model suffers less from overfitting. Multi-ple uncertainty-based query criteria were then deployed to select samples for activelearning.
Another common image analysis scenario entails learning with multiple sources, inparticular where one source is label “rich” (in the quantity and/or quality of labeled itle Suppressed Due to Excessive Length 11 data), and the other source is label “starved”. Sources in this scenario could implydifferent sensors, different sensing platforms (e.g. ground-based imagers, drones orsatellites), different time-points and different imaging view-points. In this situation,when it is desired to undertake analysis in the label starved domain (often referredto as the target domain), a common strategy is to transfer knowledge from the labelrich domain (often referred to as the source domain).
Effective training has always been a challenge with deep learning models. Besidesrequiring large amounts of data, the training itself is time-consuming and oftencomes with convergence and generalization problems. One major breakthrough ofeffective training of deep networks is the pretraining technique introduced by Hinton et al. in [4], where a DBN was pretrained with unlabeled labeled data in a greedylayer-wise fashion, followed by a supervised fine-tuning. In particular, the DBN wastrained one layer at a time by reconstructing outputs from the previous layer for theunsupervised pretraining. At the last training stage, all parameters were fine-tunedtogether by optimizing a supervised training criterion. In Erhan et al. in [60], theauthors suggested that unsupervised pretraining works as a form of regularization.It not only provides a good initialization but also helps the generalization perfor-mance of the network. Similar to unsupervised pretraining, networks pretrained withsupervision have also achieved huge success. Infact, using pretrained models as astarting point for new training has become a common practice for many analysistasks [61, 62].The main idea behind transfer learning is that knowledge gained from relatedtasks or a related data source can be transferred to a new task by fine-tuning on thenew data. This is particular useful when there is a data shortage in the new domain.In the computer vision community, a common approach to transfer learning is toinitialize the network with weights that are pretrained for image classification on theImageNet dataset [8]. The rationale for this is that ImageNet contains millions ofnatural images that are manually annotated and models trained with it tend to providea “baseline performance’ with generic and basic features commonly seen in naturalimages. Researchers have shown that features from lower layers of deep networks arecolor blobs, edges, shapes [13]. These basic features are usually readily transferableacross datasets (e.g. data from different sources) [63].In [64], Penatti et al. discussed the feature generalization in the remote sensingdomain. Empirical results suggested that transferred features are not always betterthan hand-crafted features, especially when dealing with unique scenes in remotesensing images. Windrim et al. [65] unveiled valuable insights of transfer learningin the context of hyperspectral image classification. In order to test the effect offilter size and wavelength interval, multiple hyperspectral datasets were acquiredwith different sensors. The performance of transfer learning was examined througha comparison with training the network from scratch, i.e., randomly initializing network weights. Extensive experiments were carried out to investigate the impactof data size, network architecture, and so on. The authors also discussed the trainingconvergence time and feature transferability under various conditions.Despite the open questions which requires more investigations, extensive studieshave empirically shown the effectiveness of transfer learning on hyperspectral imageanalysis [66, 67, 68, 69, 70, 71, 72, 39, 73, 74, 51, 75, 76]. Marmanis et al. [66]introduced the pretrained model idea [1] for hyperspectral image classification. Apretrained AlexNet [1] was used as a fixed feature extractor and a two-layer CNNwas attached for the final classification. Yang et al. [68] proposed a two-branch CNNfor extracting spectral-spatial features. To solve the data scarcity problem, weightsof lower layers were pretrained from another data set and the entire network wasthen fine-tuned on the source dataset. Similar strategies have also been followedin [77, 69, 74].Along with pretraining and fine-tuning, domain adaptation is another mechanismto transfer knowledge from one domain to another. Domain adaptation algorithmsaim at learning a model from source data that can perform well on the target data. Itcan be considered as a sub-category of transfer learning, where the input distribution p ( X ) changes while the label distribution p ( Y | X ) remains the same across the twodomains. Unlike the pretraining and fine-tuning method, which can be used whenboth distributions change, domain adaptation usually assumes the class-specificproperties of the features within the two domains are correlated. This allows us toenforce stronger connections while transferring knowledge.Othman et al. [70] proposed a domain adaptation network that can handle cross-scene classification when there is no labeled data in the target domain. Specifically,the network used three loss components for training: a classification loss (crossentropy) in the source domain, a domain matching loss based on maximum meandiscrepancy (MMD) [78], and a graph regularization loss that aims to retain thegeometrical structure of the unlabeled data in the target domain. The cross entropyloss ensures that features produced by the network are discriminative. Having dis-criminative features in the original domain has also been found to be beneficial forthe domain matching process [22]. In order to undertake domain adaptation, featuresfrom the two domains were aligned by minimizing the distribution difference. Zhouand Prasad [76] proposed to align domains (more specifically, features in these do-mains) based on domain adaptation transformation learning (DATL) [22] – DATLaligns class-specific features in the two domains by projecting the two domains onto acommon latent subspace such that the ratio of within-class distance to between-classdistance is minimized in that latent space.Next, we briefly review how a projection such as DATL can be used to aligndeep networks for domain adaptation and present some results with multi-sourcehyperspectral data. Consider the distance between a source sample x si and a targetsample x tj in the latent space, d ( x si , x tj ) = (cid:107) f s ( x si ) − f t ( x tj )(cid:107) , (1) itle Suppressed Due to Excessive Length 13 where f s and f t are feature extractors, e.g., CNNs, that transform samples fromboth domains to a common feature space. To make the feature space robust to smallperturbations in original source and target domains, the stochastic neighborhoodembedding is used to measure classification performance [79]. In particular, theprobability p ij of the target sample x tj being the neighbor of the source sample x si ,is given as p ij = exp (−(cid:107) f s ( x si ) − f t ( x tj )(cid:107) ) (cid:205) x sk ∈D S exp (−(cid:107) f s ( x sk ) − f t ( x tj )(cid:107) ) , (2)where D S is the source domain. Given a target sample with its label ( x tj , y tj = c ) ,the source domain D s can be split into a same-class set D sc = { x sk | y k = c } anda different-class set D s (cid:54) c = { x sk | y k (cid:44) c } . In the classification setting, one wants tomaximize the probability of making the correct prediction for x j . p j = (cid:205) x s i ∈D sc exp (−(cid:107) f s ( x si ) − f t ( x tj )(cid:107) ) (cid:205) x sk ∈D s (cid:54) c exp (−(cid:107) f s ( x sk ) − f t ( x tj )(cid:107) ) . (3)Maximizing the probability p j is equivalent to minimizing the ratio of intra-classdistances to inter-class distances in the latent space. This ensures that classes fromthe target domain and the source domain are aligned in the latent space. Note thatthe labeled data from the target domain (albeit limited) can be further used to makethe feature more discriminative. The final objective function of DATL can then bewritten as L = β (cid:205) x si ∈D sc exp (−(cid:107) f s ( x si ) − f t ( x tj )(cid:107) ) (cid:205) x sk ∈D s (cid:54) c exp (−(cid:107) f s ( x sk ) − f t ( x tj )(cid:107) ) + ( − β ) (cid:205) x ti ∈D tc exp (−(cid:107) f t ( x si ) − f t ( x tj )(cid:107) ) (cid:205) x tk ∈D t (cid:54) c exp (−(cid:107) f t ( x sk ) − f t ( x tj )(cid:107) ) . (4)The first term can be domain alignment term and the second term can be seen as aclass separation term. β is the trade-off parameter that is data dependent. The greaterthe difference between source and target data, the larger value of β should be usedto put more emphasis on domain alignment.Depending on the feature extractors, Eq. 3 can be either solved using conjugategradient-based optimization [22] or treated as a loss and solved using stochasticgradient descent [80]. DATL has been shown to be effective for addressing largedomain shifts such as between street-view and satellite hyperspectral images [22]acquired with different sensors and imaged with different viewpoints.Figure 5 shows the architecture of feature alignment neural network (FANN)that leverages DATL. Two convolutional recurrent neural networks (CRNN) weretrained separately for the source and target domain. Features from correspondinglayers were connected through an adaptation module, which is composed of a DATLterm and a trade-off parameter that balances the domain alignment and the classseparation. Specifically, the trade-off parameter β is automatically estimated by aproxy A-distance (PAD) [81]. β = PAD / = − (cid:15), (5) where PAD is defined as PAD = ( − (cid:15) ) and (cid:15) ∈ [ , ] is the generalization error ofa linear SVM trained to discriminate between two domains. Aligned features werethen concatenated and fed to a final softmax layer for classification. Fig. 5
The architecture of feature alignment neural network. (Source adapted from [76])
Table 3
Network configuration summary for the Aerial and Street view wetland hyperspectraldataset (A-S view wetland). (Source adapted from [76])FANN (A-S view wetland)CRNN (Street) → DATL ← CRNN (Aerial)(conv4-128 + maxpooling) → DATL ← (conv5-512 + maxpooling)(conv4-128 + maxpooling) → DATL ← (conv5-512 + maxpooling)(conv4-128 + maxpooling) → DATL ← (conv5-512 + maxpooling)(conv4-128 + maxpooling) → DATL ← (conv5-512 + maxpooling)(conv4-128 + maxpooling) → DATL ← (conv5-512 + maxpooling)recur-64 → DATL ← recur-128fully connected-12 The performance of FANN was evaluated on a challenging domain adaptationdataset introduced in [22]. See Fig. 6 for the true color images for the source andtarget domains. The dataset consists of hyperspectral images of ecologically sensitivewetland vegetation that were collected by different sensors from two viewpoints –“aerial”, and “street-view” (and using sensors with different spectral characteristics) itle Suppressed Due to Excessive Length 15
Fig. 6
Aerial and Street view wetland hyperspectral dataset. Left: aerial view of the wetland data(target domain). Right: street view of wetland data (source domain). (Source adapted from [76])
Wavelength [nm] R e f l ec t a n ce c1c2c3c4c5c6c7c8c9c10c11c12
400 500 600 700 800 900 1000
Wavelength [nm] R a d i a n ce c1c2c3c4c5c6c7c8c9c10c11c12 Fig. 7
Mean spectral signature of the aerial view (target domain) wetland data (a) and the streetview (source domain) wetland data (b). Different wetland vegetation species (classes) are indicatedby colors. (Source adapted from [76]) in Galveston, TX. Specifically, the aerial data were acquired using the ProSpecTIRVS sensor aboard an aircraft, and has 360 spectral bands ranging from 400 nm to2450 mm with a 5 nm spectral resolution. The aerial view data were radiometricallycalibrated and corrected. The resulting reflectance data has a spatial coverage of3462 × . ± . . ± . . ± . . ± . Table 4
Overall classification accuracies of different domain adaptation methods on the aerial andstreet view wetland dataset. (Source adapted from [76])
Fig. 8 t-SNE feature visualization of the aerial and street view wetland hyperspectral data atdifferent stages of FANN. (a) Raw spectral features of street view data in source domain. (b) CRNNfeatures of street view data in source domain. (c) Raw spectral features of aerial view data in sourcedomain. (d) CRNN features of aerial view data in source domain. (e) FANN features for bothdomains in the latent space. (Source adapted from [76])
As can be seen from Fig. 8, raw hyperspectral features from source and targetdomains are not aligned with each other. Due to the limited labeled data in aerialview data, the mixture of classes happens in a certain level. The cluster structuresare improved slightly by CRNN, see Fig. 8 (c) and (d) for comparison. On thecontrary, the source data, i.e., street view data, have well-separated cluster structure.However, the classes are not aligned between the two domains, therefore, labelsfrom the source domain cannot be used to directly train a classifier for the targetdomain. After passing all samples through the FANN, the two domains are alignedclass-by-class in the latent space, as shown in Fig. 8 (e).
Table 5
Overall accuracy of the features of alignment layers and concatenated features for theAerial and Street view wetland dataset. (Source adapted from [76])
Layer FA-1 FA-2 FA-3 FA-4 FA-5 FA-6 FANNOA
To better understand the feature adaptation process, features from all layers wereinvestigated individually and compared to the concatenated features. Performanceof each alignment layer are shown in Table 5. Consistent with observations in [63],accuracies drop from the first layer to the fifth layer as features become more andmore specialized toward the training data. Therefore, the larger domain gap makesdomain adaptation challenging. Although the last layer (FA-6) was able to mitigatethis problem, this is because the recurrent layer has the ability to capture contextualinformation along the spectral direction of the hyperspectral data. Features fromthe last layer are the most discriminative ones, which allow the aligning module(DATL) to put more weight on the domain alignment (c.f. β in Eq. 4 and Eq. 5). Theconcatenated features obtained the highest accuracy compared to individual layers.As mentioned in [76], an improvement of this idea would be to learn a combinationweights for different layers instead of a simple concatenation. In addition to image classification / semantic segmentation tasks, the notion oftransferring knowledge between sources and datasets has also been used for manyother tasks, such as object detection [67], image super-resolution [71], and imagecaptioning [72].Compared to image-level labels, training an object detection model requiresobject-level labels and corresponding annotations (e.g. through bounding boxes).This increases the labeling requirement/costs for efficient model training. Effectivefeature representation is hence crucial to the success of these methods. As an example,in order to detect aircraft from remote sensing images, Zhang et al. [67] proposedto use the UC Merced land use dataset [82] as a background class to pretrain FasterRCNN [83]. By doing this, the model gained an understanding of remote sensingscenes which facilitated robust object detection. The underlying assumption in suchan approach is that even though the foreground objects may not be the same, thebackground information remains largely unchanged across the sources (e.g. datasets),and can hence be transferred to a new domain.Another important application of remote sensed images is pansharpening, wherea panchromatic image (which has a coarse/broad spectral resolution, but very highspatial resolution) is used to improve the spatial resolution of multi/hyperspectralimage. However, a high-resolution panchromatic image is not always available for thesame area that is covered by the hyperspectral images. To solve this problem, Yuan et al. [71] pretrained a super-resolution network with natural images and applied themodel to the target hyperspectral image band by band. The underlying assumptionin this work is that the spatial features in both the high- and low-resolution imagesare the same in both domains irrespective of the spectral content.Traditional visual tasks like classification, object detection, and segmentationinterpret an image at either pixel or object level. Image captioning takes this notiona step further and aims to summarize a scene in a language that can be interpreted easily. Although many image captioning methods have been proposed for naturalimages, this topic has not been rigorously developed in the remote sensing domain.Shi et al. [72] proposed satellite image captioning by using a pretrained fullyconvolutional network (FCN) [84]. The base network was pretrained for imageclassification on ImageNet. To understand the images, three losses were definedat the object, environment, and landscape scale respectively. Predicted labels atdifferent levels were then sent to a language generation model for captioning. In thiswork, the task in target domain is very different from the one in the source domain.Despite that, pretrained model still provides features that are generic enough to helpunderstanding the target domain images.
Flipping and rotating images usually does not affect the class labels of objectswithin the image. A machine learning model can benefit if the training library isaugmented with samples with these simple manipulations. By changing the inputtraining images in a way that does not affect the class, it allows algorithms to trainfrom more examples of the object, and the models hence generalize better to testdata. Data generation and augmentation share the same philosophy – to generatesynthetic or transformed data that is representative of real-world data and can beused to boost the training.Data augmentation such as flipping, rotation, cropping and color jittering havebeen shown to be very helpful for training deep neural networks [1, 85, 86]. Theseoperations infact have become common practice when training models for naturalimage analysis tasks. Despite the differences between hyperspectral and naturalimages, standard augmentation methods like rotation, translation and flipping havebeen proven to be useful in boosting the classification accuracy of hyperspectralimage analysis tasks [87] and [88]. To simulate the variance in the at-sensor radianceand mixed pixels during the imaging process, Chen et al. [16] created virtual samples by multiplying random factors to existing samples and linearly combining sampleswith random weights respectively. Li et al. [89] showed the performance can befurther improved by integrating spatial similarity through pixel block pairs, in whicha 3 × et al. [90], where unlabeled pixels were assignedlabels for training if their k -nearest neighbors (in both spatial and spectral domains)belong to the same class. Haut et al. [91] used a random occlusion idea to augmentdata in the spatial domain. It randomly erases regions from the hyperspectral imagesduring the training. As a consequence, the variance in the spatial domain increasedand led to a model that generalized better.Some flavors of data fusion algorithms can be thought of as playing the roleof data augmentation, wherein supplemental data sources are helping the trainingof the models. For instance, a building roof and a paved road both can be made itle Suppressed Due to Excessive Length 19 from similar materials – in such a case, it may be difficult for a model to telldifferential these classes from the reflectance spectra alone. However, this distinctioncan be easily made by comparing their topographic information (e.g. using LiDARdata). A straightforward approach to fuse hyperspectral and LiDAR data would betraining separate networks – one for each source/sensor and combining their featureseither through concatenation [92, 93] or some other schemes such as a compositekernel [94]. Zhao et al. [95] presented data fusion of multispectral and panchromaticimages. Instead of applying CNN to the entire image, features were extracted forsuperpixels that were generated from the multispectral image. Particularly, a fixedsize window around each superpixel was split into multiple regions and the imagepatch in each region was feed into a CNN for extracting local features. These localfeatures were sent to an auto-encoder for fusion and a softmax layer was added atthe end for prediction. Due to its relatively high spatial resolution, the panchromaticimage can produce spatial segments at a finer scale than the multispectral image.This was leveraged to refine the predictions by further segmenting each superpixelbased on panchromatic image.Aside from augmenting the input data, generating synthetic data that resemblesreal-life data is another approach to increase training samples. Generative adversarialnetwork (GAN) [96] introduced a trainable approach to generate new syntheticsamples. GAN consists of two sub-networks, a generator and a discriminator. Duringthe training, two components play a game with each other. The generator is tryingto fool the discriminator by producing samples that are as realistic as possible, andthe discriminator is trying to discern whether a sample is synthetically generatedor belongs to the training data. After the training process converges, the generatorwill be able to produce samples that look similar to the training data. Since it doesnot require any labeled data, there has been an increasing interest in using GANfor data augmentation in many applications. This has recently been applied to thehyperspectral image analysis in recent years [49, 73, 97, 98]. Both [49] and [98] usedGAN for hyperspectral image classification, where a softmax layer was attached tothe discriminator. Fake data were treated as an additional class in the training set.Since a large amount of unlabeled was used for training the GAN, the discriminatorbecame good at classifying all samples. A transfer learning idea was proposed forsuper-resolution in [73], where a GAN is pretrained on a relatively large dataset andfine-tuned on the UC Merced land use dataset [82]. In this chapter, we reviewed recent advances in deep learning for hyperspectral imageanalysis. Although a lot of progress has been made in recent years, there is still a lotof open problems, and related research opportunities. In addition to making advancesin algorithms and network architectures (e.g. networks for multi-scale, multi-sensordata analysis, data fusion, image super-resolution etc.), there is a need for addressing fundamental issues that arise from insufficient data and the fundamental nature ofthe data being acquired. Towards this end, the following directions are suggested• Hyperspectral ImageNet. We have witnessed the immense success brought aboutin part by the ImageNet dataset for traditional image analysis. The benefit of build-ing a similar dataset for hyperspectral image is compelling. If such libraries canbe created for various image analysis tasks (e.g. urban land-cover classification,ecosystem monitoring, material characterization etc.), they will enable learningtruly deep networks that learn highly discriminative spatial-spectral features.• Interdisciplinary collaboration. Developing an effective model for analyzing hy-perspectral data requires a deep understanding of both the properties of the dataitself and machine learning techniques. With this in mind, networks that reflect theoptical characteristics of the sensing modalities (e.g. inter-channel correlations)and variability caused in acquisition (e.g. varying atmospheric conditions) shouldadd more information for the underlying analysis tasks compared to “black-box”networks.
References
1. A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutionalneural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.2. D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al., Learning representations by back-propagating errors, Cognitive modeling 5 (3) (1988) 1.3. A. Krogh, J. A. Hertz, A simple weight decay can improve generalization, in: Advances inneural information processing systems, 1992, pp. 950–957.4. G. E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief nets, Neuralcomputation 18 (7) (2006) 1527–1554.5. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simpleway to prevent neural networks from overfitting, The Journal of Machine Learning Research15 (1) (2014) 1929–1958.6. S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducinginternal covariate shift, arXiv preprint arXiv:1502.03167.7. R. Caruana, S. Lawrence, C. L. Giles, Overfitting in neural nets: Backpropagation, conjugategradient, and early stopping, in: Advances in neural information processing systems, 2001, pp.402–408.8. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchicalimage database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee,2009, pp. 248–255.9. G. Camps-Valls, D. Tuia, L. Bruzzone, J. A. Benediktsson, Advances in hyperspectral imageclassification: Earth monitoring with statistical learning methods, IEEE signal processingmagazine 31 (1) (2014) 45–54.10. J. M. Bioucas-Dias, A. Plaza, N. Dobigeon, M. Parente, Q. Du, P. Gader, J. Chanussot, Hyper-spectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches,IEEE journal of selected topics in applied earth observations and remote sensing 5 (2) (2012)354–379.11. Y. LeCun, K. Kavukcuoglu, C. Farabet, Convolutional networks and applications in vision, in:Proceedings of 2010 IEEE International Symposium on Circuits and Systems, IEEE, 2010,pp. 253–256.itle Suppressed Due to Excessive Length 2112. Y.-L. Boureau, F. Bach, Y. LeCun, J. Ponce, Learning mid-level features for recognition,in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,Citeseer, 2010, pp. 2559–2566.13. J. Yosinski, J. Clune, T. Fuchs, H. Lipson, Understanding neural networks through deepvisualization, in: In ICML Workshop on Deep Learning 2015, Citeseer.14. L. Shen, S. Jia, Three-dimensional gabor wavelets for pixel-based hyperspectral imageryclassification, IEEE Transactions on Geoscience and Remote Sensing 49 (12) (2011) 5039–5046.15. X. Zhou, S. Prasad, M. M. Crawford, Wavelet-domain multiview active learning for spatial-spectral hyperspectral image classification, IEEE Journal of Selected Topics in Applied EarthObservations and Remote Sensing 9 (9) (2016) 4047–4059.16. Y. Chen, H. Jiang, C. Li, X. Jia, P. Ghamisi, Deep feature extraction and classification of hy-perspectral images based on convolutional neural networks, IEEE Transactions on Geoscienceand Remote Sensing 54 (10) (2016) 6232–6251.17. Y. Li, H. Zhang, Q. Shen, Spectral–spatial classification of hyperspectral imagery with 3dconvolutional neural network, Remote Sensing 9 (1) (2017) 67.18. M. Paoletti, J. Haut, J. Plaza, A. Plaza, A new deep convolutional neural network for fasthyperspectral image classification, ISPRS journal of photogrammetry and remote sensing 145(2018) 120–147.19. Z. Zhong, J. Li, Z. Luo, M. Chapman, Spectral–spatial residual network for hyperspectralimage classification: A 3-d deep learning framework, IEEE Transactions on Geoscience andRemote Sensing 56 (2) (2018) 847–858.20. Pavia university hyperspectral data, .21. University of houston hyperspectral data, http://hyperspectral.ee.uh.edu/?page_id=459http://hyperspectral.ee.uh.edu/?page_id=459