[PDF] Incremental Import Vector Machines for Classifying Hyperspectral Data

Abstract

In this paper we propose an incremental learning strategy for import vector machines (IVM), which is a sparse kernel logistic regression approach. We use the procedure for the concept of self-training for sequential classification of hyperspectral data. The strategy comprises the inclusion of new training samples to increase the classification accuracy and the deletion of non-informative samples to be memory- and runtime-efficient. Moreover, we update the parameters in the incremental IVM model without re-training from scratch. Therefore, the incremental classifier is able to deal with large data sets. The performance of the IVM in comparison to support vector machines (SVM) is evaluated in terms of accuracy and experiments are conducted to assess the potential of the probabilistic outputs of the IVM. Experimental results demonstrate that the IVM and SVM perform similar in terms of classification accuracy. However, the number of import vectors is significantly lower when compared to the number of support vectors and thus, the computation time during classification can be decreased. Moreover, the probabilities provided by IVM are more reliable, when compared to the probabilistic information, derived from an SVM's output. In addition, the proposed self-training strategy can increase the classification accuracy. Overall, the IVM and the its incremental version is worthwhile for the classification of hyperspectral data.

Full PDF

IIEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 1

Incremental Import Vector Machines for ClassifyingHyperspectral Data

Ribana Roscher, Bj¨orn Waske,

Member, IEEE,

Wolfgang F¨orstner,

Member, IEEE

Abstract —In this paper we propose an incremental learningstrategy for import vector machines (IVM), which is a sparsekernel logistic regression approach. We use the procedure for theconcept of self-training for sequential classiﬁcation of hyperspec-tral data. The strategy comprises the inclusion of new trainingsamples to increase the classiﬁcation accuracy and the deletionof non-informative samples to be memory- and runtime-efﬁcient.Moreover, we update the parameters in the incremental IVMmodel without re-training from scratch. Therefore, the incremen-tal classiﬁer is able to deal with large data sets. The performanceof the IVM in comparison to support vector machines (SVM) isevaluated in terms of accuracy and experiments are conductedto assess the potential of the probabilistic outputs of the IVM.Experimental results demonstrate that the IVM and SVMperform similar in terms of classiﬁcation accuracy. However, thenumber of import vectors is signiﬁcantly lower when comparedto the number of support vectors and thus, the computation timeduring classiﬁcation can be decreased. Moreover, the probabilitiesprovided by IVM are more reliable, when compared to theprobabilistic information, derived from an SVM’s output. Inaddition, the proposed self-training strategy can increase theclassiﬁcation accuracy. Overall, the IVM and the its incrementalversion is worthwhile for the classiﬁcation of hyperspectral data.

Index Terms —Import vector machines, incremental learning,hyperspectral data, self-training.

I. I

NTRODUCTION

Hyperspectral imaging, also known as imaging spectroscopyis used for more than two decades for monitoring the earth [1].The spectrally continuous data ranges from visible to theshort-wave infrared region of the electromagnetic spectrumand thus, enables a detailed separation of similar surfacematerials. Therefore hyperspectral imagery is used for classi-ﬁcation problems that require a precise differentiation in spec-tral feature space [2]–[4]. Hyperspectral applications becomeeven more attractive, regarding the increased availability ofhyperspectral imagery through future space-borne missions,such as the German EnMAP (Environmental Mapping andAnalysis Program) [5] and the Italian PRISMA (HyperspectralPrecursor of the Application Mission).Nevertheless, the special properties of hyperspectral im-agery demand more sophisticated image (pre-)processing and

The authors are with the Institute of Geodesy and Geoinformation, Fac-ulty of Agriculture, University of Bonn, 53115 Bonn, Germany (e-mail:[email protected], [email protected], [email protected]).This work was supported by CROP.SENSe.net project, funded by the GermanFederal Ministry of Education and Research (BMBF) within the scope ofthe competitive grants program Networks of excellence in agricultural andnutrition research (FKZ: 0315529).Manuscript received May 4, 2011 analysis [6], [7]. Conventional methods, such as the maxi-mum likelihood classiﬁer, can be limited when applied tohyperspectral imagery, due to the high-dimensional featurespace and a ﬁnite number of training samples. Consequently,the classiﬁcation accuracy often decreases with an increasingnumber of bands (i. e., the well-known Hughes phenomena).Thus, more ﬂexible classiﬁers, such as spectral angle mapper,neural networks and support vector machines (SVM), areapplied on hyperspectral imagery [2], [4], [8], [9].Among the various developments in the ﬁeld of patternrecognition, SVM [10] are perhaps the most popular approachin recent hyperspectral applications [7]. SVM can outperformother methods in terms of the classiﬁcation accuracy [11],[12] and still exhibit further modiﬁcation and improvement,e. g., in context of modifying the kernel functions [13] andsemi-supervised learning [14]. Whereas other classiﬁers candirectly solve multi-class problems, the binary nature of SVMrequires an adequate multi-class strategy (e. g., [15]). Incontrast to other classiﬁers, which directly provide class labelsor probabilities, SVM provide the distance of each pixel to thehyperplane of the binary classiﬁcation problem. This informa-tion is used to determine the ﬁnal class membership. Althoughthe output of SVM can be transferred to probabilities (e. g.,[16]), the reliability of these values could be inadequate [17],[18]. Nevertheless, probabilities are of interest and can beused e. g., as input in a markov random ﬁeld model or foruncertainty analysis [19]–[21].Logistic regression as an alternative probabilistic discrim-inative classiﬁcation model was already used in context ofclassiﬁcation and feature selection of hyperspectral imagery[22], [23]. The approach was extended to kernel logisticregression, e. g., [24], [25], showing a better accuracy buta higher complexity. To overcome the limitations in contextof efﬁciency and computation time several sparse realizationsof (kernel) logistic regression have been developed, includingthe explicit usage of a sparsity enforcing prior [26]–[29], animplicit prior used in the relevance vector machines (RVM)[17], [18], [30] or a greedy subset selection for the concept ofimport vector machines (IVM) [31].Recent classiﬁer developments, such as SVM, usually per-form well on high-dimensional data sets. Nevertheless, theclassiﬁcation accuracy can be affected when a limited numberof training samples is available [2], [32]. One possibility toovercome this problem is active learning, which has beenstudied extensively in the literature, e. g., [27], [33]–[35]. Inthis paper we use the specialized concept of self-training [36].Self-training is based on the sequentially training of a classiﬁerwith new training samples. I. e., starting with a initial classiﬁer a r X i v : . [ c s . C V ] A ug EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 2 model learned from a few labeled training samples, an iterativeclassiﬁcation procedure is performed. After each classiﬁcation,new relevant training samples are selected and usually theclassiﬁer is re-trained from scratch. That means, the classiﬁeris learned by using old and new training samples withouttaking into account the previous learned classiﬁer model. Forthe acquisition of new training samples we use spectral as wellas spatial information by using a discriminative random ﬁeld(DRF) [37]. The gain of integrating spatial information wasalready discussed in several studies in context of classifyinghyperspectral data [20], [21], [26], [34], [38]. Moreover, weconsider the probabilistic outputs by the IVM, to assess thereliability of the classiﬁcation result.However, to be time- and memory-efﬁcient we need anincremental learning strategy within the self-training approach.Several incremental learning methods have been proposed,extending classical and state-of-the art off-line methods. In-cremental generative models, e. g. [39], provide probabilitiesbut tend to become very complex. Discriminative models, onthe other hand, like incremental linear discriminant analysis[40] or incremental support vector machines, e. g. [41], showa good performance in classiﬁcation tasks.Kernel-based algorithms have achieved considerable successin incremental and off-line learning settings [42], [43]. How-ever, incremental kernel-based learning settings need strategiesfor dealing with existing and new samples. The challenge is toperform efﬁcient, incremental update steps without sufferinga loss in performance.The main objective is to propose an incremental learningstrategy to update the trained IVM model. The classiﬁer iscalled incremental IVM. Therefore, the IVM model is notre-trained from scratch, as usually done in context of activeand self-training approaches. The incremental IVM consistsof the selection of new training samples, the deletion ofirrelevant training samples, and the update of the IVM model.Therefore, the classiﬁer is able to deal with inﬁnitely newtraining samples.The potential of the IVM and the its incremental versionin context of classifying hyperspectral data is evaluated. Theperformance is compared to SVM in terms of accuracy. Inaddition, the reliability of the probabilities outputs providedby IVM and SVM are evaluated by using a discriminativerandom ﬁeld and the effect is further investigate by neglectinguncertain test samples. Our study is aiming on the classiﬁca-tion of three different hyperspectral data sets, i. e., two urbanareas from the city of Pavia and an agricultural area fromIndiana, USA, using SVM and IVM.The paper is organized as follows. Section II discusses thelogistic regression, kernel logistic regression and the IVMalgorithm. Moreover, the concept of SVM and related clas-siﬁers is brieﬂy introduced and compared to IVM. Section IIIintroduces the proposed strategy for self-training, includingthe DRF model. Also the incremental IVM are explained.The experimental setup is given in Section IV. The resultsare presented and discussed in Section V. We conclude inSection VI. II. T

HEORETICAL B ACKGROUND

In this section IVM model is introduced. Starting from thelogistic regression, we discuss kernels and sparsity and ﬁnallythe sparse kernel logistic regression model, i.e., IVM. Alsoa brief introduction to SVM and sparse multinomial logisticregression is given. In II-D the discriminative random ﬁeld isintroduced, which incorporates spatial information.

A. Logistic Regression and Kernel Logistic Regressiona) Logistic Regression:

We assume to have a trainingset ( x n , y n ) , n = 1 , . . . , N of N labeled samples with featurevectors x n ∈ IR M and class labels y n ∈ C = {C , . . . , C K } .The observations are collected in a matrix X = [ x , . . . , x N ] ,while the corresponding labels are summarized in the vector y = [ y , . . . , y N ] .In the two-class case the posterior probability p n of a featurevector x n is assumed to follow the Logistic Regression model p n = p ( y n = C | x n ; w ) = 11 + exp( − w T x n ) (1)with the extended feature vector x T n = [1 , x T n ] ∈ IR M +1 andthe extended parameters w T = [ w , w T ] ∈ IR M +1 containingthe bias w and the weight vector w . b) Kernel Logistic Regression: In linear non-separablecases, the original observations X are implicitly mappedfrom the input space to a higher-dimensional kernel spacewith the kernel matrix K = [ k nm ] via the kernel function k nm = k ( x n , x m ) . The kernel matrix K consists of afﬁnitiesbetween the points depending on the distance measure deﬁnedby the kernel function.The parameters, in the kernel-based approach referred to α ,are determined in an iterative way with α ( i ) = (cid:18) N K T RK + λ K (cid:19) − K T R z (2) z = 1 N (cid:16) K α ( i − + R − ( p − t ) (cid:17) (3)by optimizing the objective function Q ( i ) = − N (cid:88) n [ t n log p n + (1 − t n ) log (1 − p n )]+ λ α T ( i ) K α ( i ) (4)using the Newton-Raphson procedure. The ( N × N ) -dimensional diagonal matrix R has the elements r nn = p n (1 − p n ) with p n = 1 / (1 + exp( − k n α )) and k n as the n th row of the kernel matrix K . The binary target vector t ∈ { , } of length N codes the labels with t n = 0 for y n = C and t n = 1 for y n = C . Additionally, we addan L -norm regularization term with parameter λ to preventoverﬁtting. B. Import Vector Machines

The kernel logistic regression includes all training samplesto train the classiﬁer, which is computationally expensive andmemory intensive for data sets with many training samples.Similar to the SVM the IVM algorithm [31] chooses a subset V EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 3 of feature vectors out of the training set with V = |V| samples X V = [ x V ,m ] , m = 1 , . . . , V , obtaining a sparse solution ofthe kernel logistic regression. These feature vectors are calledimport vectors.Following (2) and (3) the parameters in iteration i aredetermined by α ( i ) = (cid:18) N K T V RK V + λ K R (cid:19) − K T V R z (5) z = 1 N (cid:16) K V α ( i − + R − ( p − t ) (cid:17) . (6)The ( N × V ) -dimensional kernel matrix is given by K V =[ k ( x n , x V ,m )] and the ( V × V ) -dimensional regularizationmatrix by K R = [ k ( x V ,l , x V ,m )] , { l, m } = 1 , . . . , V .The IVM is illustrated in Algorithm 1. The convergenceInitialize V := {} , X := { x , . . . , x N } , i := 0 ; repeat Compute z ( i ) from the current set V ( i ) ; foreach x n ∈ X ( i ) do Let V ( i ) n := V ( i ) ∪ x n ;Compute α ( i ) n from V ( i ) n in a one-step iteration;Evaluate error function Q ( i ) n ; end Find best point x ∗ = x n with n = argmin n Q ( i ) n ;Update V ( i +1) := V ( i ) ∪ x ∗ , X ( i +1) := X ( i ) \ x ∗ , i := i + 1 ; until Q converged; Algorithm 1:

IVM: In every iteration i each point x n ∈X ( i ) from the current training set X ( i ) is tested to be inthe set of import vectors V ( i ) . The point x ∗ yielding thelowest error Q ( i ) n is included. The algorithm stops as soonas Q converged.criterion is proposed by the ratio (cid:15) = |Q ( i ) − Q ( i − ∆ i ) | / |Q ( i ) | with a small integer ∆ i .The original algorithm selects the import vectors in agreedy forward selection procedure. The approach is extendedto a forward stepwise selection, which allows forward andbackward steps. The advantage of this procedure is, that importvectors, which once entered can be dropped if they are nolonger relevant. In all experiments an improvement of theresults could be observed. Furthermore, an incremental updateprocedure to compute the inverse in (5) depending on the lastiteration is used, which makes the algorithm more efﬁcient.The incremental update is described in a detailed way inSection III-C.The two class model can be generalized to the multi-classmodel. Then the objective function is Q = − N (cid:88) n t T n log p n + λ (cid:88) k α T k K R α k (7)with the probabilities P = [ p , . . . , p N ] obtained by p nk = exp( k V ,n α k ) (cid:80) l exp( k V ,n α l ) . (8)The binary target vector t n of length K uses the 1-of-K codingscheme so that all components but t nk are if the point x n is from class C k . In the Newton-Raphson procedure in (5) and(6) we have to use one R k and one z k for each class. C. Related Classiﬁers

Recently several algorithms have been developed, whichenforce sparseness to control both the generalization capabilityof the learned classiﬁer model and the complexity, as e. g.,[44]–[47]. In these algorithms the model consists of a sparseweighted linear combination of basis functions, which arethe input features themselves, nonlinear transformations ofthem or kernels centered on them. In this section we restrictourselves to the review of realizations of sparse (kernel)logistic regression and SVM, whereby the latter one is usedfor comparison in the experiments.

1) Realizations of Sparse (Kernel) Logistic Regression:

Using logistic regression or its kernel realization can beprohibitive regarding memory and time requirements if thedimension of the features or the number of training samplesis large. Several sparse algorithms have been developed in thelast years to overcome this problem.The relevance vector machine [18] uses the same model asthe kernel logistic regression in combination with an implicitprior as regularization term, the so-called ARD (automaticrelevance determination) [48] prior, to induce sparseness. Theprior includes several regularization parameters, also calledhyperparameters, which are determined during the optimiza-tion process. The algorithm have shown to be very sparse, butalso tends to underﬁt, leading to a non well-generalized model[29]. Additionally, the RVM uses an expectation-maximization(EM)-like learning method and therefore, can suffer from localminima leading to non-optimal classiﬁcation results.Alternatively, [28], [29] use a Laplace prior enforcingsparseness, which assigned regularization parameter is deter-mined via cross-validation. These approaches propose sparsemultinomial (kernel) logistic regression (SMLR) using differ-ent methods for a fast computation. In the ﬁeld of hyperspec-tral image classiﬁcation these approaches have been appliedand further developed in, e. g., [26], [27].

2) Support Vector Machines:

The SVM ﬁnd an optimalnonlinear decision boundary by minimizing the objectivefunction Q SV M = 1 N (cid:88) n [1 − y n f ( x n )] + + λ (cid:107) f (cid:107) (9)with f ( x n ) = (cid:80) n α n K ( X , x n ) .Contrary to IVM, which maximize the posterior probabili-ties, SVM aim to maximize the margin between the hyperplaneand the closest training samples, the so-called support vectors.SVM are a binary classiﬁer, with the decision rule given bythe sign of f ( x n ) .

3) Comparison with Import Vector Machines:

The mainproperties of SVM, IVM, SMLR and RVM are summarizedin Table I and brieﬂy compared below.The objective function of the SVM is quadratic and there-fore convex and can be efﬁciently solved with sequentialminimal optimization algorithm (SMO) [49]. The objectivefunction of the IVM is convex, but non-quadratic. The function

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 4

TABLE IS

UMMARY OF THE

SVM, RVM

AND

IVM

ALGORITHM REGARDING SEVERAL CHARACTERISTICS . H

ERE ”–/0/+”

MEANS THAT THE ALGORITHMSATISFIES A CERTAIN PROPERTY ” BARELY / PARTIALLY / COMPLETELY ”.Algorithm Objective function, optimization procedure Sparse Training Testing Probabilistic Multi-classtime timeSVM convex, greedy with SMO algorihtm 0 + 0 0 0RVM nonconvex, IRLS, EM-like + – + + +SMLR convex, see Sec. II-C1 0/+ 0/+ 0/+ + +IVM convex, IRLS with greedy forward stepwise + 0/+ + + +selection (see Sec. II-B and III-C) can be solved with iterated re-weighted least squares (IRLS)and a greedy forward selection of import vectors.All models are sparse, i.e., the model parameters are pri-marily zero. The sparseness arises from different techniques,namely the usage of a prior in the SMLR models and theRVM or the greedy selection of a fraction of the trainingsamples in the IVM model. However, IVM and RVM haveshown to be sparser when compared to SVM [30], [31], andthus, require less computation time during classiﬁcation. Thetraining time of the IVM, on the other hand, can be slower thanfor the SVM, because of the non-quadratic objective function.Nonetheless, the training time depends on the number oftraining samples and the attainable sparseness, that thereforealso SMLR and IVM can reach a faster training time.In comparison to RVM and SMLR approaches, which usethe whole kernel during the optimization procedure, the IVMalgorithm is suitable for large data sets, since only a subsetof the training samples is used for computations during theoptimization. Consequently, also only a fraction of the wholekernel have to be computed and stored. In contrast to this,e. g., the standard RVM algorithm has to solve a largematrix inversion of the size of the number of training samplesduring the optimization process and therefore can be slow andintractable for large datasets.While IVM, SMLR and RVM directly provide a proba-bilistic output, the SVM algorithm has the opportunity totransform its output to probabilities after some post-processingsteps [16]. Though, as Tipping [18] already shows the trans-formed output is not necessarily statistical interpretable.Both IVM, SMLR and RVM are introduced as multi-classclassiﬁers. The standard SVM have the opportunity to multi-class classiﬁcation but they need a coupling strategy to getmulti-class classiﬁcation results. Moreover, these multi-classapproaches are more suitable for practical use than directmulti-class approaches for SVM [50].

D. Discriminative Random Fields

A discriminative random ﬁeld is employed to model priorknowledge about the neighborhood relations within the image.The ﬁnal classiﬁcation is assumed to be smooth, i. e., neigh-boring pixels are more likely to belong to the same class thanto different classes.With this, the best classiﬁcation C DRF is given by theargument of the minimum of the energy E ( y ) = − (cid:88) j ∈I log p ( y j | x j ) − β (cid:88) { m,j }∈N δ ( y j , y m ) , (10) where x j is the observed feature vector from the j th pixel, I being the set of all pixels and δ being the Kronecker deltafunction. The weighting parameter is given by β , which can bedetermined via cross-validation. The ﬁrst term in (10) modelsthe probability of a class assignment y j of the j th pixel,deﬁned by the probabilistic output of the IVM. The secondterm describes the interaction potential as a Potts model overa 2D lattice penalizing every dissimilar pair of labels andtherefore heterogeneous regions. The set of all neighboringpixels is given by N .III. I NCREMENTAL L EARNING S TRATEGY FOR S ELF -T RAINING

In this section we introduce the self-training concept and thelearning strategy for the incremental IVM. Self-training refersto the sequential selection of new training data and the adap-tion of the previous classiﬁer model. The previous classiﬁermodel can (i) be neglected and re-trained from scratch or (ii)incrementally be updated. Following (i), a regular classiﬁertraining is performed, using the whole training samples set(i. e., previous + new training samples). The latter approach”simply” updates the previous model, using the newly selectedtraining samples.

A. Self-training Concept

Remote Sensing image Classiﬁcation Improved classiﬁcationIVM DRFPosteriors New trainingsamples

Classiﬁcation

Incrementallearningprocedureab

Fig. 1. Self-training scheme consisting of two steps: In the ﬁrst step (Fig.1a) the image is classiﬁed with the learned incremental IVM model and theDRF. The probabilistic output of the IVM and the classiﬁcation result of theDRF are used to acquire new training samples. In the second step (Fig. 1b)the classiﬁer model is incrementally updated.

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 5

In the ﬁrst step (Fig. 1a) we identify potential new trainingsamples. For this selection step, we use the classiﬁcation resultprovided by the DRF and the probabilistic outputs of theincremental IVM. In the second step (Fig. 1b) we use theincremental learning strategy to update the classiﬁer model,without re-training from scratch. The procedure is repeateduntil no additional training samples can be selected. The latterstep consists of the inclusion of new training samples, thedeletion of irrelevant training samples and ﬁnally, the updateof the IVM model. Therefore the approach can handle largedata sets or even inﬁnitive data streams. However, this stepis independent of the used self-training strategy and otherapproaches can be used to identify new training samples. Bothsteps are explained in detail in the next paragraphs.

B. Acquisition of New Training Samples for Self-training

This acquisition step illustrated in Fig. 2 is subdivided in3 parts: First, we use a DRF as expert to identify worthwhilesamples by evaluating the disagreement between the classiﬁ-cation C yielded by the IVM classiﬁer and the DRF result C DRF . The aspect implies, that the samples with estimatedlabel y DRF ,j derived from the DRF are sufﬁcient uncertain with max( p j ) < . obtained from the IVM classiﬁer. Using suchsamples we ensure progress in self-training.Second, we exclude from the selected samples these ones,whose probability is too small. I. e., the inﬂuence of the newsamples is restricted, to ensure that the model is graduallychanged and stable.Finally, we sort the chosen samples by their potentiallyinﬂuence on the model, to enable the ﬂexibility of the modelduring self-training. The potential inﬂuence uses the conceptof leverage points in regression [51]. The leverage values in aweighted regression are contained in the vector l = diag (cid:18) K pot (cid:16) K T pot R pot K pot (cid:17) − K T pot R pot (cid:19) . (11)The kernel matrix obtained from the potential new trainingsamples X pot is given by K pot = [ k ( x pot ,j , x V ,m )] and R pot isthe weight matrix of the class the training sample belongsto. Training samples with a high leverage value ﬁrst areconsidered ﬁrst, since they causes large effects in the learnedmodel. The self-training is stopped if no more training samplescan be acquired.To prevent an acquisition, which leads to an imbalancednumber of training samples per class, we ensure sampling anequal number of training samples for each class. If not enoughnew training samples can be acquired with the proposed self-training approach, we also consider training samples with ahigh probability and whose label in C and C DRF are thesame. These samples have a small inﬂuence onto the modeland should not change the result too much, but balance thenumber of samples of each class. If there are still not enoughsamples for a class, we follow the over-sampling approachby duplicating existing samples from the concerned class atrandom and add a small noise to them [52].

C. Incremental Learning

To update the learned classiﬁer, we consider the followingaspects: • New training samples acquired with the proposed self-training approach (see Section III-B) are included. • Non-informative samples are deleted. • The set of import vectors is updated. a) Update Training Vectors:

For the two-class case theincremental learning procedure is stated as follows. We addtraining vectors X ∆ with targets t ∆ so that N ( s ) := N ( s − +∆ N with ∆ N as the number of new training samples. At eachself-training iteration s we extend the matrices and vectors K ( s ) = (cid:20) K ( s − K ∆ (cid:21) , t ( s ) = (cid:20) t ( s − t ∆ (cid:21) to obtain the updated parameters α +( s ) given by (12) and (13)yielding z ( s ) = (cid:20) z ( s − z ∆ (cid:21) , p ( s ) = (cid:20) p ( s − p ∆ (cid:21) . The Sherman-Morrison-Woodbury (SMW) [53] formula isused to compute the inverse in (12) yielding (14) with A =1 /N ( s ) K T ( s − R ( s − K ( s − + λ K R, ( s − . Note that the update(14) only incorporates an inverse of size ∆ N × ∆ N , since theinverse A − was computed in time step s − . With these stepsthe parameters can be updated in an incremental way withoutre-training from scratch.The update (12) can also be formulated for a decreasingnumber of training vectors in a similar manner, leading toan efﬁcient update rule as in (14), only with the sign of R ∆ changed. Training vectors can be removed based on their”age”, so that they follow the ﬁrst-in-ﬁrst-out strategy. This canlead to instable results. Therefore, we identify training vectors,which can be removed using Cook’s distance [54]. Cook’sdistance measures the effect of deleting a training vector asthe higher the value the more informative is the training vector: d n = ( p n − t n ) T ( p n − t n ) a MSE l n (1 − l n ) (15)with a as the number of parameters and MSE as the meansquared error summed over all classes given by the differencebetween and the mean probability of all training samplesbelonging to class C k . The leverage value of each trainingvector is given by l n . The training vectors with the lowestdistance are removed in a greedy backward selection until thevalue of the optimization function increases more than of 5%. b) Update the Set of Import Vectors: To add and removeimport vectors we use the forward stepwise selection asdescribed in Section II. We also make use of the SMW formulaand proceed in the same way as described in Algorithm 1 untila convergence criterion is reached.To generalize the two class model to the multi-class modelwe use the class-speciﬁc values R k, ( s − and z k, ( s − . EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 6

Classiﬁcationwith IVMImproved classiﬁ-cation with DRF Compare C and C DRF

Select worthwhilesamples with y j = y DRF ,j and p ( y j | x j ) suf-ﬁciently large Sort samples regard-ing leverage values C Posteriors PC DRF

Fig. 2. Schematic diagram of the selection of new training samples: Worthwhile samples are identiﬁed by comparing the classiﬁcation C and C DRF . Sampleswith a relatively small posteriori probability are excluded. Remaining samples are sorted regarding their leverage value, i. e., samples with high inﬂuence arepreferred. α +( s ) = (cid:16) /N ( s ) (cid:16) K T ( s − R ( s − K ( s − + K T ∆ R ∆ K ∆ (cid:17) + λ K R, ( s − (cid:17) − (cid:16) K T ( s − R ( s − z ( s − + K T ∆ R ∆ z ∆ (cid:17) (12) z ( s ) = 1 N ( s ) (cid:34) K ( s − α ( s − + R ( s − (cid:16) p ( s − − t ( s − (cid:17) K ∆ α ( s − + R ∆ ( p ∆ − t ∆ ) (cid:35) (13) α +( s ) = A − − /N ( s ) A − K T ∆ (cid:16) R − + 1 /N ( s ) K ∆ A − K T ∆ (cid:17) − K ∆ A − (cid:16) K T ( s − R ( s − z ( s − + K T ∆ R ∆ z ∆ (cid:17) (14)IV. E XPERIMENTAL S ETUP

A. Data Sets

We use three hyperspectral data sets – C

ENTER OF P AVIA ,U NIVERSITY OF P AVIA and I

NDIAN P INES – from study siteswith different environmental setting. The data sets have beenused in a multitude of studies, e. g., [7], [13], [32], [55], [56].The C

ENTER OF P AVIA image was acquired by ROSIS-3sensor in 2003. The spatial resolution of the image is 1.3 meterper pixel. The data cover the range from 0.43 µ m to 0.86 µ m ofthe electromagnetic spectrum. However, some bands have beenremoved due to noise and ﬁnally 102 channels have been usedin the classiﬁcation. The image strip, with 1096 ×

492 pixelsin size, lies around the center of Pavia. The classiﬁcation isaiming on 9 land cover classes. The U

NIVERSITY OF P AVIA data set was also acquired by ROSIS-3 sensor with 610 × NDIAN P INES data set wasacquired by the AVIRIS instrument in 1992. The study site liesin a predominately agricultural region in NW Indiana, USA.AVIRIS operates from the visible to the short-wave infraredregion of the electromagnetic spectrum, ranging from 0.4 µ mto 2.4 µ m. The data set covers 145 ×

145 pixels, with a spatialresolution of 20 m per pixel. The experiments are aiming onthe classiﬁcation of 16 classes (Table II).

B. Methods

In the experiments the IVM for classifying hyperspectraldata is analyzed. In addition SVM were applied on the datasets. SVM are perhaps the most popular approach in morerecent applications and seems particularly advantageous whenclassifying high-dimensional data sets. Thus, the method isregarded as a kind of benchmark classiﬁer for comparison withnew approaches.Moreover, a DRF is applied on the respective probabilisticoutput. Besides as input for the DRF, we use the probabilistic outputs to analyze the uncertainty of the classiﬁcation result.We assess the reliability of the probabilities by rejectinguncertain test samples and deriving the classiﬁcation accuracyon the non-rejected test points [19]. The rejection rate isgiven by a threshold on the posterior probability, whereby theaccuracy provided by SVM and IVM is reported as a functionof the rejection rate in discrete intervals.In addition, the incremental IVM is evaluated by applyingthe self-training approach on the three data sets in terms ofthe classiﬁcation accuracy and sparsity.To investigate the impact of the number of training sampleson the performance of the model (e.g., in terms of sparsityand accuracy) we use two different training sets, containing(i) all initial training samples and (ii) 10% of each class (witha minimum of at least 10 samples per class). For (ii), weperformed a stratiﬁed random sampling, selecting 10% of thesamples of each class from the initial training set. The ﬁnalresults were averaged.For the SVM and the (incremental) IVM we use a radialbasis function kernel. The kernel parameters are determinedby a 5-fold cross-validation. Also the DRF parameter β isdetermined by by 5-fold cross-validation. The result providedby the common IVM is used for the initialization of the self-training. The self-training procedure is repeated until no moretraining samples can be selected.The IVM algorithm is implemented in MATLAB and C++.The SVM classiﬁcation is performed in MATLAB, using theLIBSVM approach by Chang and Lin [57]. To compute theresult in (10) we use the graph-cut algorithm [58]. Besidesthe standard SVM classiﬁcation, we use the method of [16]to convert the output of the SVM to probabilities (from nowon referred to as probabilistic SVM). EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 7

TABLE IIN

UMBER OF TRAINING AND TEST SAMPLES .C ENTER OF P AVIA U NIVERSITY OF P AVIA I NDIAN P INES class

VERALL ACCURACY (OA),

AVERAGE ACCURACY (AA)

AND KAPPA COEFFICIENT (K APPA ) OF SVM, SVM

WITH TRANSFORMED PROBABILITIES (P ROB . SVM), IVM,

INCREMENTAL

IVM ( I IVM)

WITH SELF - TRAINING (ST)

AND ADDITIONAL

DRF. F

OR THE SMALLER DATA SETS (10%)

WE REPORTTHE MEAN AND STANDARD DEVIATION IN BRACKETS OVER RUNS .C ENTER OF P AVIA U NIVERSITY OF P AVIA I NDIAN P INES

Train data Algorithm OA[%] AA[%] Kappa OA[%] AA[%] Kappa OA[%] AA[%] Kappa100% SVM 97.1 94.6 0.95 79.0 87.9 0.73 61.6 69.6 0.57100% Prob. SVM 97.0 94.2 0.95 78.4 87.7 0.73 60.7 68.9 0.56100% IVM 97.2 94.2 0.95 78.3 87.4 0.73 63.7 67.8 0.59100% iIVM+ST 97.3 95.2 0.95 86.6 90.0 0.83 63.7 67.8 0.59100% Prob. SVM+DRF 97.8 96.4 0.96 82.6 90.5 0.78 66.3 71.6 0.62100% IVM+DRF 98.4 96.5 0.97 85.9 91.4 0.82 70.6 80.0 0.67100% iIVM+ST+DRF 98.6 97.1 0.98 94.2 93.8 0.92 70.6 80.2 0.6710% SVM 92.1 (0.6) 85.1 (1.0) 0.86 (0.01) 70.2 (3.8) 77.5 (2.4) 0.62 (0.04) 54.7 (2.3) 63.0 (1.6) 0.49 (0.03)10% Prob. SVM 93.0 (1.3) 86.2 (2.1) 0.87 (0.02) 69.4 (3.8) 77.3 (2.4) 0.61 (0.04) 54.9 (2.5) 63.2 (2.3) 0.50 (0.03)10% IVM 92.2 (1.1) 85.4 (1.6) 0.86 (0.02) 70.3 (3.5) 77.3 (0.8) 0.62 (0.04) 58.2 (0.1) 65.2 (0.8) 0.53 ( < V. E

XPERIMENTAL R ESULTS

Fig. 4 shows the ground truth and the classiﬁcation resultsfor all three data sets. As shown in Table III, the IVM iscompetitive to SVM in terms of accuracy and result in almostsimilar overall accuracies and kappa coefﬁcients, irrespectivelyfrom the number of training samples. However, the SVMoutperforms the IVM in terms of the averaged class accuracy(AA) for the I

NDIAN P INES data set, when using all trainingsamples. As expected the accuracies are increased by increas-ing the number of training samples, independently from theclassiﬁer method.It is interesting to underline that in many cases, the proba-bilistic SVM provide (slightly) lower accuracies (i.e., OA andAA) than a standard SVM. Behind this fact, the reliability ofthe probabilistic output of the SVM can be questioned. Thisassumption is conﬁrmed by the results provided by the DRF.The accuracies of both methods, the probabilistic SVM andthe IVM, are increased by the DRF. This is in accordancewith other studies that have successfully integrated spatial http://vision.csd.uwo.ca/code/ information when classifying hyperspectral imagery [20], [21],[26], [34], [38]. However, the improvement of the IVM isusually higher, sometimes to a degree that the combinationof IVM and DRF outperforms the accuracy provided by theprobabilistic SVM in combination with a DRF.Table IV shows the class-speciﬁc accuracies achieved inthe U NIVERSITY OF P AVIA data set. The results conﬁrm theprevious ﬁndings, e. g., SVM and IVM show similar overallaccuracies. While some classes are more accurately classiﬁedby SVM, IVM are more adequate for the separation of otherclasses.To underline this ﬁnding, an analysis of the probabilisticoutput is shown in Fig. 3. With an increasing rejection thresh-old, IVM provide higher OA on the I

NDIAN P INES data andin most cases on the C

ENTER OF P AVIA and U

NIVERSITY OF P AVIA data set. Consequently, it can be assumed that sampleswith high class probabilities are more accurately classiﬁed byIVM than by SVM, whereas relatively low class probabilitiesby the IVM are more likely referred to misclassiﬁed samples.Table V shows, that the number of import vectors is lowerwhen compared to the number of support vectors. This is in

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 8

TABLE IVC

LASS - SPECIFIC ACCURACIES OF THE U NIVERSITY OF P AVIA DATA SET OF

SVM, SVM

WITH TRANSFORMED PROBABILITIES (P ROB . SVM), IVM,

INCREMENTAL

IVM ( I IVM)

WITH SELF - TRAINING (ST)

AND ADDITIONAL

DRF. T

HE RESULTS ARE GIVEN IN PERCENT ([%]).Class SVM Prob.SVM IVM iIVM+ST Prob.SVM + DRF IVM+DRF iIVM+ST+DRFAsphalt 85.4 83.3 83.8 84.3 94.7 95.5 94.1Bare Soil 93.7 93.6 89.1 93.9 97.7 98.5 98.2Bitumen 90.5 90.5 90.3 90.7 94.1 95.7 94.5Gravel 68.8 71.2 69.8 68.3 65.5 60.7 67.2Meadows 65.9 65.2 66.2 82.8 68.6 66.6 93.7Metal Sheets 99.4 99.5 99.6 99.4 99.8 99.9 99.7Bricks 92.5 91.8 92.7 93.7 98.7 99.3 98.8Shadow 97.5 97.0 98.7 99.5 97.7 99.6 100.0Trees 97.0 97.4 95.9 97.0 97.9 98.3 98.0(a) C

ENTER OF P AVIA (b) U

NIVERSITY OF P AVIA (c) I

NDIAN P INES

Fig. 3. The overall accuracy of SVM and IVM as a function of rejected test points on the C

ENTER OF P AVIA data set (left), the U

NIVERSITY OF P AVIA dataset (middle) and the I

NDIAN P INES data set (right). The overall accuracy is computed on the non-rejected test points.TABLE VN

UMBER OF SUPPORT / IMPORT VECTORS OF

SVM

AND

IVM. F

OR THESMALLER DATA SETS (10%)

WE REPORT THE MEAN AND STANDARDDEVIATION IN BRACKETS OVER RUNS . Data Algorithm C

ENTER OF P AVIA U NIVERSITY OF P AVIA I NDIAN P INES accordance with the results of a previous study in context ofmachine learning data sets [31]. Comparing the number ofsupport vectors and import vectors, respectively, the resultsconﬁrm that the number of support vectors clearly increaseswith an increasing number of training samples, whereas thenumber of import vector increases slowly or almost remainsconstant. Only for the U

NIVERSITY OF P AVIA data set thenumber increases when all training points are used, becausea smaller kernel parameter was chosen and more importvectors are necessary to train the classiﬁer. Consequently,the computation time of the IVM during the classiﬁcation ismuch faster when compared to SVM (see Table VI), sincethe number of required mathematical operations depends onthe number of support and import vectors. This is particularlyimportant in context of high-dimensional hyperspectral datasets, which are usually classiﬁed with a large number oftraining data.Table VI reports the training and testing time of SVM andIVM on an Intel(R) Dual Core with 3.0 GHz. In contrast to the SVM implementation, the current Matlab/C++ implementationof the IVM is not optimized, so there is still potential foracceleration.Finally, Table III also shows the positive impact of self-training on the classiﬁcation accuracy. The classiﬁcation resultwas improved in all cases. As expected the improvement on thetraining sets with 10% of all training samples is higher whencompared to the classiﬁcation results, which were generatedby the whole training samples set. Table VII shows, thatthe number of import vectors remains nearly the same dur-ing the self-training procedure. Moreover, the proposed self-training strategy deletes irrelevant training samples. Therefore,the ﬁnal number of training samples is signiﬁcantly lower,when compared to initial number of training samples and thesamples added during the self-training procedure. This fact isparticularly obvious in case of the C

ENTER OF P AVIA datausing the whole training sample set.VI. C

ONCLUSION AND O UTLOOK

We proposed the incremental IVM classiﬁer, which includesthe addition and deletion of training samples as well as theupdate of the set of import vectors. The incremental learningstrategy updates efﬁciently the classiﬁer model without re-training from scratch, which makes it capable for large datasets. To evaluate the incremental IVM, we have introduced aself-training strategy, which uses the probabilistic output ofthe classiﬁer and a DRF.

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 9

TABLE VIT

RAINING AND TEST TIME OF

SVM,

PROBABILISTIC

SVM

AND

IVM. F

OR THE SMALLER DATA SETS (10%)

WE REPORT THE MEAN AND STANDARDDEVIATION IN BRACKETS OVER RUNS .C ENTER OF P AVIA U NIVERSITY OF P AVIA I NDIAN P INES

Data Algorithm Train[sec] Test[sec] Train[sec] Test[sec] Train[sec] Test[sec]100% SVM 3.8 187.9 4.3 126.4 20.0 68.9100% Prob. SVM 106.1 0.1 41.7 0.1 12.45 0.1100% IVM 225.2 2.3 484.9 1.8 344.3 0.110% SVM 0.1 ( < < < < < < < UMBER OF ADDED TRAINING SAMPLES IN THE SELF - TRAINING (ST)

PROCEDURE AND NUMBER OF TRAINING AND IMPORT VECTORS BEFORE ANDAFTER SELF - TRAINING . T

HE NUMBER OF TRAINING SAMPLES AFTER SELF - TRAINING IS GIVEN BY THE NUMBER OF TRAINING SAMPLES BEFORESELF - TRAINING PLUS THE NUMBER OF ADDED TRAINING SAMPLES MINUS THE REMOVED IRRELEVANT TRAINING SAMPLES . F

OR THE SMALLER DATASETS (10%)

WE REPORT THE MEAN AND STANDARD DEVIATION IN BRACKETS OVER RUNS .C ENTER OF P AVIA U NIVERSITY OF P AVIA I NDIAN P INES

We evaluated the performance of IVM in the context ofclassifying hyperspectral imagery. IVM constitute a feasibleapproach and an useful alternative for the classiﬁcation ofremote sensing data, particularly when probabilities are ofinterest. The experimental results underline that SVM andIVM perform almost similar in terms of the classiﬁcationaccuracy. In addition, the results show the strong dependencyof the number of support vectors on the number of availabletraining samples. In contrast to this, the number of importvectors is signiﬁcantly lower when compared to the number ofsupport vectors and remains constant or only slightly increaseswith an increasing number of training samples. As conﬁrmedby the experimental results, the probabilities provided byIVM are more reliably, when compared to the probabilisticoutputs provided by SVM. This fact is particularly interesting,because the probabilities are useful for further image analysis,e.g., (i) as input in a DRF that increases the classiﬁcationaccuracy, (ii) to detect mislabeled samples by a uncertaintyanalysis, and (iii) to identity relevant training samples for aself-training strategy. Particularly for hyperspectral data sets,which require a sufﬁcient large number of training samplesto ensure an adequate accuracy, the self-training strategyincluding the incremental IVM is interesting and can furtherincrease the classiﬁcation result. Moreover, the computationtime is reduced by the incremental learning approach ratherthan re-training the classiﬁer with all training samples. Theincremental IVM can further be incorporated into other activelearning approaches or more sophisticated models for DRFs.Therefore, the approach seems attractive as well as feasiblefor operational applications.Overall, the IVM and its incremental version appears worth-while for the classiﬁcation of remote sensing data, especiallywhen the user is interested in reliable class probabilities anda fast classiﬁcation. More efﬁcient implementation strategiesand further modiﬁcations will be investigated in the future. A

CKNOWLEDGMENT

The authors would like to thank D. Landgrebe andL. Biehl (Purdue University, USA) for providing the IndianPines data (available on: http://cobweb.ecn.purdue.edu/ ∼ biehl/MultiSpec/) and P. Gamba (University of Pavia, Italy) forproviding the Pavia dataset.R EFERENCES[1] A. Goetz, “Three Decades of Hyperspectral Remote Sensing of the Earth:A Personal View,”

Remote Sens. Environ. , vol. 113, pp. 5–16, 2009.[2] B. Waske, S. van der Linden, J. Benediktsson, A. Rabe, and P. Hostert,“Sensitivity of Support Vector Machines to Random Feature Selection inClassiﬁcation of Hyperspectral Data,”

IEEE Trans. Geosci. Remote Sens. ,vol. 48, no. 7, pp. 2880–2889, 2010.[3] G. Mitri and I. Gitas, “Mapping Postﬁre Vegetation Recovery Using EO-1 Hyperion Imagery,”

IEEE Trans. Geosci. Remote Sens. , vol. 48, no. 3,pp. 1613–1618, 2010.[4] J. Benediktsson, J. Palmason, and J. Sveinsson, “Classiﬁcation of Hy-perspectral Data from Urban Areas based on Extended MorphologicalProﬁles,”

IEEE Trans. Geosci. Remote Sens. , vol. 43, no. 3, pp. 480–491,2005.[5] L. Guanter, K. Segl, and H. Kaufmann, “Simulation of Optical Remote-Sensing Scenes With Application to the EnMAP Hyperspectral Mission,”

IEEE Trans. Geosci. Remote Sens. , vol. 47, no. 7, pp. 2340–2351, 2009.[6] J. Richards, “Analysis of Remotely Sensed Data: The Formative Decadesand the Future,”

IEEE Trans. Geosci. Remote Sens. , vol. 43, no. 3, pp.422–432, 2005.[7] A. Plaza, J. Benediktsson, J. Boardman, J. Brazile, L. Bruzzone,G. Camps-Valls, J. Chanussot, M. Fauvel, P. Gamba, A. Gualtieri,M. Marconcini, J. Tilton, and G. Trianni, “Recent Advances in Techniquesfor Hyperspectral Image Processing,”

Remote Sens. Environ. , vol. 113, pp.110–122, 2009.[8] X. Chen, T. Warner, and D. Campagna, “Integrating Visible, Near-Infrared and Short-Wave Infrared Hyperspectral and Multispectral Ther-mal Imagery for Geological Mapping at Cuprite, Nevada,”

Remote Sens.Environ. , vol. 110, no. 3, pp. 344–56, 2007.[9] S. Van der Linden, A. Janz, B. Waske, M. Eiden, and P. Hostert,“Classifying Segmented Hyperspectral Data from a Heterogeneous UrbanEnvironment using Support Vector Machines,”

J. Appl. Remote Sens. ,vol. 1, no. 1, 2007.[10] V. Vapnik,

The Nature of Statistical Learning Theory . Springer, 2000.[11] M. Pal and P. Mather, “Some Issues in the Classiﬁcation of DAISHyperspectral Data,”

Int. J. Remote Sens. , vol. 27, no. 14, pp. 2895–2916,2006.

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 10 (a) Data (b) Training groundtruth (c) Test ground truth (d) SVM result (e) IVM result (f) IVM+DRF result (g) Self-training re-sultFig. 4. F.l.t.r.: (a) Data, (b) training ground truth, (c) test ground truth, (d) classiﬁcation result of SVM, (e) IVM, (f) IVM + DRF and (g) self-trainingprocedure with incremental IVM. The upper row shows the C

ENTER OF P AVIA data set, the middle row the I

NDIAN P INES data set and the bottom row theU

NIVERSITY OF P AVIA data set.[12] B. Waske, J. Benediktsson, K. Arnason, and J. Sveinsson, “Mapping ofHyperspectral AVIRIS Data using Machine-Learning Algorithms,”

Can.J. Remote Sensing , vol. 35, pp. 106–116, 2009.[13] G. Camps-Valls, N. Shervashidze, and K. M. Borgwardt, “Spatio-Spectral Remote Sensing Image Classiﬁcation with Graph Kernels,”

IEEEGeosci. Remote Sens. Lett. , vol. 7, no. 4, pp. 741–745, 2010.[14] J. Mu˜noz-Mar´ı, F. Bovolo, L. G´omez-Chova, L. Bruzzone, and G. Camp-Valls, “Semisupervised One-Class Support Vector Machines for Classi-ﬁcation of Remote Sensing Data,”

IEEE Trans. Geosci. Remote Sens. ,vol. 48, no. 8, pp. 3188–3197, 2010.[15] A. Mathur and G. Foody, “Multiclass and binary SVM classiﬁcation:implications for training and classiﬁcation users,”

IEEE Geosci. RemoteSens. Lett. , vol. 5, no. 2, pp. 241–245, 2008.[16] J. Platt, N. Cristianini, and J. Shawe-Taylor, “Large Margin DAGs forMulticlass Classiﬁcation,”

Adv. Neural Inf. Process. Syst. , vol. 12, no. 3,pp. 547–553, 2000.[17] G. Foody, “RVM-Based Multi-Class Classiﬁcation of Remotely SensedData,”

Int. J. Remote Sens. , vol. 29, no. 6, pp. 1817–1823, 2008.[18] M. Tipping, “Sparse Bayesian Learning and the Relevance VectorMachine,”

J. of Mach. Learn. Research , vol. 1, pp. 211–244, 2001.[19] F. Giacco, C. Thiel, L. Pugliese, S. Scarpetta, and M. Marinaro,“Uncertainty Analysis for the Classiﬁcation of Multispectral SatelliteImages Using SVMs and SOMs,”

IEEE Trans. Geosci. Remote Sens. ,no. 99, pp. 1–11, 2010.[20] P. Zhong and R. Wang, “Learning Conditional Random Fields forClassiﬁcation of Hyperspectral Images,”

IEEE Trans. Image Process ,vol. 19, pp. 1890–1907, 2010.[21] Y. Tarabalka, M. Fauvel, J. Chanussot, and J. Benediktsson, “SVM-and MRF-Based Method for Accurate Classiﬁcation of HyperspectralImages,”

IEEE Geosci. Remote Sens. Lett. , vol. 7, pp. 736–740, 2010. [22] P. Zhong, P. Zhang, and R. Wang, “Dynamic Learning of SMLR forFeature Selection and Classiﬁcation of Hyperspectral Data,”

IEEE Geosci.Remote Sens. Lett. , vol. 5, no. 2, pp. 280–284, APR 2008.[23] Q. Cheng, P. Varshney, and M. Arora, “Logistic regression for featureselection and soft classiﬁcation of remote sensing data,”

IEEE Geosci.Remote Sens. Lett. , vol. 3, no. 4, pp. 491–494, 2006.[24] S. Keerthi, K. Duan, S. Shevade, and A. Poo, “A Fast Dual Algorithmfor Kernel Logistic Regression,”

Mach. Learn. , vol. 61, no. 1, pp. 151–165, 2005.[25] G. Cawley and N. Talbot, “Efﬁcient Model Selection for Kernel LogisticRegression,”

Pattern Recogn. , vol. 2, pp. 439–442, 2004.[26] J. Borges, J. Bioucas-Dias, and A. Marcal, “Bayesian HyperspectralImage Segmentation With Discriminative Class Learning,”

IEEE Trans.Geosci. Remote Sens. , vol. 49, pp. 2151–2164, 2011.[27] J. Li, J. Bioucas-Dias, and A. Plaza, “Hyperspectral Image SegmentationUsing a New Bayesian Approach With Active Learning,”

IEEE Trans.Geosci. Remote Sens. , pp. 1–14, 2010.[28] G. Cawley, N. Talbot, and M. Girolami, “Sparse Multinomial LogisticRegression via Bayesian L1 Regularisation,” in

Adv. Neural Inf. Process.Syst. , 2007.[29] B. Krishnapuram, L. Carin, M. Figueiredo, and A. Hartemink, “SparseMultinomial Logistic Regression: Fast Algorithms and GeneralizationBounds,”

IEEE Trans. Pattern Anal. Mach. Intell. , pp. 957–968, 2005.[30] B. Demir and S. Erturk, “Hyperspectral Image Classiﬁcation UsingRelevance Vector Machines,”

IEEE Geosci. Remote Sens. Lett. , vol. 4,pp. 586–590, 2007.[31] J. Zhu and T. Hastie, “Kernel Logistic Regression and the Import VectorMachine,”

J. Comput. Graph. Stat. , vol. 14, no. 1, pp. 185–205, 2005.[32] M. Pal and G. Foody, “Feature Selection for Classiﬁcation of Hyper-

EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 11 spectral Data by SVM,”

IEEE Trans. Geosci. Remote Sens. , vol. 48, no. 5,pp. 2297–2307, 2010.[33] D. Tuia, M. Volpi, L. Copa, M. Kanevski, and J. Munoz-Mari, “Asurvey of active learning algorithms for supervised remote sensing imageclassiﬁcation,”

IEEE J. Sel. Topics Signal Process. , vol. 5, pp. 606–617,2011.[34] J. Li, J. Bioucas-Dias, and A. Plaza, “Semisupervised HyperspectralImage Segmentation Using Multinomial Logistic Regression with ActiveLearning,”

IEEE Trans. Geosci. Remote Sens. , vol. 48, pp. 4085–4098,2010.[35] S. Rajan, J. Ghosh, and M. Crawford, “An Active Learning Approachto Hyperspectral Data Classiﬁcation,”

IEEE Trans. Geosci. Remote Sens. ,vol. 46, pp. 1231–1242, 2008.[36] V. Ng and C. Cardie, “Weakly Supervised Natural Language Learningwithout Redundant Views,” in

NAACL , 2003, pp. 94–101.[37] S. Kumar and M. Hebert, “Discriminative Random Fields,”

Int. J.Comput. Vision , vol. 68, no. 2, pp. 179–201, 2006.[38] P. Zhong and R. Wang, “Learning Sparse CRFs for Feature Selectionand Classiﬁcation of Hyperspectral Imagery,”

Geoscience and RemoteSensing, IEEE Transactions on , vol. 46, pp. 4186–4197, 2008.[39] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual modelsfrom few training examples: An incremental bayesian approach tested on101 object categories,” in

CVIU , 2007, pp. 59–70.[40] S. Pang, S. Ozawa, and N. Kasabov, “Incremental Linear DiscriminantAnalysis for Classiﬁcation of Data Streams,”

Trans. Systems, Man, andCybernetics , vol. 35, pp. 905–914, 2005.[41] G. Cauwenberghs and T. Poggio, “Incremental and Decremental SupportVector Machine Learning,” in

Adv. Neural Inf. Process. Syst. , 2001, pp.409–415.[42] M. Karasuyama and I. Takeuchi, “Multiple Incremental DecrementalLearning of Support Vector Machines,”

Trans. Neural Netw. , vol. 21,no. 7, pp. 1048–1059, 2010.[43] G. Fung and O. Mangasarian, “Incremental Support Vector MachineClassiﬁcation,” in

SIAM , 2002, pp. 247–260.[44] M. Figueiredo, “Adaptive Sparseness for Supervised Learning,”

IEEETrans. Pattern Anal. Mach. Intell. , pp. 1050–1159, 2003.[45] M. Figueiredo and A. Jain, “Bayesian Learning of Sparse Classiﬁers,”in

Proc. IEEE Conf. Computer Vision and Pattern Recognition , 2001.[46] L. Csat´o and M. Opper, “Sparse On-line Gaussian Processes,”

NeuralComputation , vol. 14, pp. 641–668, 2002.[47] N. Lawrence, M. Seeger, and R. Herbrich, “Fast Sparse Gaussian ProcessMethods: The Informative Vector Machine,” in

Adv. Neural Inf. Process.Syst. , 2003.[48] R. Neal,

Bayesian Learning for Neural Networks . Springer Verlag,1996, vol. 118.[49] J. Platt,

Advances in Kernel Methods - Support Vector Learning .MIT Press, 1999, ch. Fast Training of Support Vector Machines usingSequential Minimal Optimization, pp. 185–208.[50] C. Hsu and C. Lin, “A comparison of methods for multiclass supportvector machines,”

Neural Networks, IEEE Transactions on , vol. 13, no. 2,pp. 415–425, 2002.[51] P. Rousseeuw, A. Leroy, and J. Wiley,

Robust Regression and OutlierDetection . Wiley Online Library, 1987, vol. 3.[52] N. Japkowicz, “The class imbalance problem: Signiﬁcance and strate-gies,” in

Int. Conf. Artiﬁcial Intelligence , vol. 1. Citeseer, 2000, pp.111–117.[53] N. Higham,

Accuracy and Stability of Numerical Algorithms . Societyfor Industrial Mathematics, 2002.[54] R. Cook and S. Weisberg,

Residuals and Inﬂuence in Regression .Chapman and Hall New York, 1982.[55] J.-M. Yang, B.-C. Kuo, P.-T. Yu, and C.-H. Chuang, “A DynamicSubspace Method for Hyperspectral Image Classiﬁcation,”

IEEE Trans.Geosci. Remote Sens. , vol. 48, no. 7, pp. 2840–2853, 2010.[56] F. Melgani and L. Bruzzone, “Classiﬁcation of Hyperspectral RemoteSensing Images with Support Vector Machines,”

IEEE Trans. Geosci.Remote Sens. , vol. 42, no. 8, pp. 1778–1790, 2004.[57] C. Chang and C. Lin, “LIBSVM: A Library for Support Vector Ma-chines,” 2001.[58] Y. Boykov, O. Veksler, and R. Zabih, “Fast Approximate EnergyMinimization via Graph Cuts,”

IEEE Trans. Pattern Anal. Mach. Intell. ,pp. 1222–1239, 2001.

Ribana Roscher received her Dipl.-Ing. degree inGeodesy from University of Bonn, Germany, in2008. She is currently a PhD at the Institute ofGeodesy and Geoinformation at the University ofBonn, Germany. Her current research activities con-centrate on sequential learning and discriminativemodels for semantic segmentation, especially importvector machines. She is reviewer for IEEE Trans-actions on Geoscience and Remote Sensing andIEEE Transactions on Pattern Analysis and MachineIntelligence.

Bj¨orn Waske (S06-M08) received his degree inApplied Environmental Sciences with a major inRemote Sensing from Trier University, Germany,in 2002. Until mid 2004 he was research assistantthe Department of Geosciences at the Munich Uni-versity, Germany. From 2004 until end of 2007 hepursued a PhD at the Center for Remote Sensingof Land Surfaces (ZFL) at the University of Bonn,Germany and received the PhD degree in Geography.From beginning of 2008 until August 2009 he wasa Postdoctoral researcher at the Faculty of Electricaland Computer Engineering, University of Iceland. Since September 2009 heis a (Junior)professor for Remote Sensing in Agriculture at the Universityof Bonn, Germany. His current research activities concentrate on advancedconcepts for image classiﬁcation and data fusion. Currently he is AssociateEditor IEEE Journal of Selected Topics in Applied Earth Observations andRemote Sensing (J-STARS). He is reviewer for different international journals,including IEEE Transactions on Geoscience and Remote Sensing, IEEEGeoscience and Remote Sensing Letters, and IEEE Journal of Selected Topicsin Applied Earth Observations and Remote Sensing.