Incremental Import Vector Machines for Classifying Hyperspectral Data
IIEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 1
Incremental Import Vector Machines for ClassifyingHyperspectral Data
Ribana Roscher, Bj¨orn Waske,
Member, IEEE,
Wolfgang F¨orstner,
Member, IEEE
Abstract —In this paper we propose an incremental learningstrategy for import vector machines (IVM), which is a sparsekernel logistic regression approach. We use the procedure for theconcept of self-training for sequential classification of hyperspec-tral data. The strategy comprises the inclusion of new trainingsamples to increase the classification accuracy and the deletionof non-informative samples to be memory- and runtime-efficient.Moreover, we update the parameters in the incremental IVMmodel without re-training from scratch. Therefore, the incremen-tal classifier is able to deal with large data sets. The performanceof the IVM in comparison to support vector machines (SVM) isevaluated in terms of accuracy and experiments are conductedto assess the potential of the probabilistic outputs of the IVM.Experimental results demonstrate that the IVM and SVMperform similar in terms of classification accuracy. However, thenumber of import vectors is significantly lower when comparedto the number of support vectors and thus, the computation timeduring classification can be decreased. Moreover, the probabilitiesprovided by IVM are more reliable, when compared to theprobabilistic information, derived from an SVM’s output. Inaddition, the proposed self-training strategy can increase theclassification accuracy. Overall, the IVM and the its incrementalversion is worthwhile for the classification of hyperspectral data.
Index Terms —Import vector machines, incremental learning,hyperspectral data, self-training.
I. I
NTRODUCTION
Hyperspectral imaging, also known as imaging spectroscopyis used for more than two decades for monitoring the earth [1].The spectrally continuous data ranges from visible to theshort-wave infrared region of the electromagnetic spectrumand thus, enables a detailed separation of similar surfacematerials. Therefore hyperspectral imagery is used for classi-fication problems that require a precise differentiation in spec-tral feature space [2]–[4]. Hyperspectral applications becomeeven more attractive, regarding the increased availability ofhyperspectral imagery through future space-borne missions,such as the German EnMAP (Environmental Mapping andAnalysis Program) [5] and the Italian PRISMA (HyperspectralPrecursor of the Application Mission).Nevertheless, the special properties of hyperspectral im-agery demand more sophisticated image (pre-)processing and
The authors are with the Institute of Geodesy and Geoinformation, Fac-ulty of Agriculture, University of Bonn, 53115 Bonn, Germany (e-mail:[email protected], [email protected], [email protected]).This work was supported by CROP.SENSe.net project, funded by the GermanFederal Ministry of Education and Research (BMBF) within the scope ofthe competitive grants program Networks of excellence in agricultural andnutrition research (FKZ: 0315529).Manuscript received May 4, 2011 analysis [6], [7]. Conventional methods, such as the maxi-mum likelihood classifier, can be limited when applied tohyperspectral imagery, due to the high-dimensional featurespace and a finite number of training samples. Consequently,the classification accuracy often decreases with an increasingnumber of bands (i. e., the well-known Hughes phenomena).Thus, more flexible classifiers, such as spectral angle mapper,neural networks and support vector machines (SVM), areapplied on hyperspectral imagery [2], [4], [8], [9].Among the various developments in the field of patternrecognition, SVM [10] are perhaps the most popular approachin recent hyperspectral applications [7]. SVM can outperformother methods in terms of the classification accuracy [11],[12] and still exhibit further modification and improvement,e. g., in context of modifying the kernel functions [13] andsemi-supervised learning [14]. Whereas other classifiers candirectly solve multi-class problems, the binary nature of SVMrequires an adequate multi-class strategy (e. g., [15]). Incontrast to other classifiers, which directly provide class labelsor probabilities, SVM provide the distance of each pixel to thehyperplane of the binary classification problem. This informa-tion is used to determine the final class membership. Althoughthe output of SVM can be transferred to probabilities (e. g.,[16]), the reliability of these values could be inadequate [17],[18]. Nevertheless, probabilities are of interest and can beused e. g., as input in a markov random field model or foruncertainty analysis [19]–[21].Logistic regression as an alternative probabilistic discrim-inative classification model was already used in context ofclassification and feature selection of hyperspectral imagery[22], [23]. The approach was extended to kernel logisticregression, e. g., [24], [25], showing a better accuracy buta higher complexity. To overcome the limitations in contextof efficiency and computation time several sparse realizationsof (kernel) logistic regression have been developed, includingthe explicit usage of a sparsity enforcing prior [26]–[29], animplicit prior used in the relevance vector machines (RVM)[17], [18], [30] or a greedy subset selection for the concept ofimport vector machines (IVM) [31].Recent classifier developments, such as SVM, usually per-form well on high-dimensional data sets. Nevertheless, theclassification accuracy can be affected when a limited numberof training samples is available [2], [32]. One possibility toovercome this problem is active learning, which has beenstudied extensively in the literature, e. g., [27], [33]–[35]. Inthis paper we use the specialized concept of self-training [36].Self-training is based on the sequentially training of a classifierwith new training samples. I. e., starting with a initial classifier a r X i v : . [ c s . C V ] A ug EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 2 model learned from a few labeled training samples, an iterativeclassification procedure is performed. After each classification,new relevant training samples are selected and usually theclassifier is re-trained from scratch. That means, the classifieris learned by using old and new training samples withouttaking into account the previous learned classifier model. Forthe acquisition of new training samples we use spectral as wellas spatial information by using a discriminative random field(DRF) [37]. The gain of integrating spatial information wasalready discussed in several studies in context of classifyinghyperspectral data [20], [21], [26], [34], [38]. Moreover, weconsider the probabilistic outputs by the IVM, to assess thereliability of the classification result.However, to be time- and memory-efficient we need anincremental learning strategy within the self-training approach.Several incremental learning methods have been proposed,extending classical and state-of-the art off-line methods. In-cremental generative models, e. g. [39], provide probabilitiesbut tend to become very complex. Discriminative models, onthe other hand, like incremental linear discriminant analysis[40] or incremental support vector machines, e. g. [41], showa good performance in classification tasks.Kernel-based algorithms have achieved considerable successin incremental and off-line learning settings [42], [43]. How-ever, incremental kernel-based learning settings need strategiesfor dealing with existing and new samples. The challenge is toperform efficient, incremental update steps without sufferinga loss in performance.The main objective is to propose an incremental learningstrategy to update the trained IVM model. The classifier iscalled incremental IVM. Therefore, the IVM model is notre-trained from scratch, as usually done in context of activeand self-training approaches. The incremental IVM consistsof the selection of new training samples, the deletion ofirrelevant training samples, and the update of the IVM model.Therefore, the classifier is able to deal with infinitely newtraining samples.The potential of the IVM and the its incremental versionin context of classifying hyperspectral data is evaluated. Theperformance is compared to SVM in terms of accuracy. Inaddition, the reliability of the probabilities outputs providedby IVM and SVM are evaluated by using a discriminativerandom field and the effect is further investigate by neglectinguncertain test samples. Our study is aiming on the classifica-tion of three different hyperspectral data sets, i. e., two urbanareas from the city of Pavia and an agricultural area fromIndiana, USA, using SVM and IVM.The paper is organized as follows. Section II discusses thelogistic regression, kernel logistic regression and the IVMalgorithm. Moreover, the concept of SVM and related clas-sifiers is briefly introduced and compared to IVM. Section IIIintroduces the proposed strategy for self-training, includingthe DRF model. Also the incremental IVM are explained.The experimental setup is given in Section IV. The resultsare presented and discussed in Section V. We conclude inSection VI. II. T
HEORETICAL B ACKGROUND
In this section IVM model is introduced. Starting from thelogistic regression, we discuss kernels and sparsity and finallythe sparse kernel logistic regression model, i.e., IVM. Alsoa brief introduction to SVM and sparse multinomial logisticregression is given. In II-D the discriminative random field isintroduced, which incorporates spatial information.
A. Logistic Regression and Kernel Logistic Regressiona) Logistic Regression:
We assume to have a trainingset ( x n , y n ) , n = 1 , . . . , N of N labeled samples with featurevectors x n ∈ IR M and class labels y n ∈ C = {C , . . . , C K } .The observations are collected in a matrix X = [ x , . . . , x N ] ,while the corresponding labels are summarized in the vector y = [ y , . . . , y N ] .In the two-class case the posterior probability p n of a featurevector x n is assumed to follow the Logistic Regression model p n = p ( y n = C | x n ; w ) = 11 + exp( − w T x n ) (1)with the extended feature vector x T n = [1 , x T n ] ∈ IR M +1 andthe extended parameters w T = [ w , w T ] ∈ IR M +1 containingthe bias w and the weight vector w . b) Kernel Logistic Regression: In linear non-separablecases, the original observations X are implicitly mappedfrom the input space to a higher-dimensional kernel spacewith the kernel matrix K = [ k nm ] via the kernel function k nm = k ( x n , x m ) . The kernel matrix K consists of affinitiesbetween the points depending on the distance measure definedby the kernel function.The parameters, in the kernel-based approach referred to α ,are determined in an iterative way with α ( i ) = (cid:18) N K T RK + λ K (cid:19) − K T R z (2) z = 1 N (cid:16) K α ( i − + R − ( p − t ) (cid:17) (3)by optimizing the objective function Q ( i ) = − N (cid:88) n [ t n log p n + (1 − t n ) log (1 − p n )]+ λ α T ( i ) K α ( i ) (4)using the Newton-Raphson procedure. The ( N × N ) -dimensional diagonal matrix R has the elements r nn = p n (1 − p n ) with p n = 1 / (1 + exp( − k n α )) and k n as the n th row of the kernel matrix K . The binary target vector t ∈ { , } of length N codes the labels with t n = 0 for y n = C and t n = 1 for y n = C . Additionally, we addan L -norm regularization term with parameter λ to preventoverfitting. B. Import Vector Machines
The kernel logistic regression includes all training samplesto train the classifier, which is computationally expensive andmemory intensive for data sets with many training samples.Similar to the SVM the IVM algorithm [31] chooses a subset V EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 3 of feature vectors out of the training set with V = |V| samples X V = [ x V ,m ] , m = 1 , . . . , V , obtaining a sparse solution ofthe kernel logistic regression. These feature vectors are calledimport vectors.Following (2) and (3) the parameters in iteration i aredetermined by α ( i ) = (cid:18) N K T V RK V + λ K R (cid:19) − K T V R z (5) z = 1 N (cid:16) K V α ( i − + R − ( p − t ) (cid:17) . (6)The ( N × V ) -dimensional kernel matrix is given by K V =[ k ( x n , x V ,m )] and the ( V × V ) -dimensional regularizationmatrix by K R = [ k ( x V ,l , x V ,m )] , { l, m } = 1 , . . . , V .The IVM is illustrated in Algorithm 1. The convergenceInitialize V := {} , X := { x , . . . , x N } , i := 0 ; repeat Compute z ( i ) from the current set V ( i ) ; foreach x n ∈ X ( i ) do Let V ( i ) n := V ( i ) ∪ x n ;Compute α ( i ) n from V ( i ) n in a one-step iteration;Evaluate error function Q ( i ) n ; end Find best point x ∗ = x n with n = argmin n Q ( i ) n ;Update V ( i +1) := V ( i ) ∪ x ∗ , X ( i +1) := X ( i ) \ x ∗ , i := i + 1 ; until Q converged; Algorithm 1:
IVM: In every iteration i each point x n ∈X ( i ) from the current training set X ( i ) is tested to be inthe set of import vectors V ( i ) . The point x ∗ yielding thelowest error Q ( i ) n is included. The algorithm stops as soonas Q converged.criterion is proposed by the ratio (cid:15) = |Q ( i ) − Q ( i − ∆ i ) | / |Q ( i ) | with a small integer ∆ i .The original algorithm selects the import vectors in agreedy forward selection procedure. The approach is extendedto a forward stepwise selection, which allows forward andbackward steps. The advantage of this procedure is, that importvectors, which once entered can be dropped if they are nolonger relevant. In all experiments an improvement of theresults could be observed. Furthermore, an incremental updateprocedure to compute the inverse in (5) depending on the lastiteration is used, which makes the algorithm more efficient.The incremental update is described in a detailed way inSection III-C.The two class model can be generalized to the multi-classmodel. Then the objective function is Q = − N (cid:88) n t T n log p n + λ (cid:88) k α T k K R α k (7)with the probabilities P = [ p , . . . , p N ] obtained by p nk = exp( k V ,n α k ) (cid:80) l exp( k V ,n α l ) . (8)The binary target vector t n of length K uses the 1-of-K codingscheme so that all components but t nk are if the point x n is from class C k . In the Newton-Raphson procedure in (5) and(6) we have to use one R k and one z k for each class. C. Related Classifiers
Recently several algorithms have been developed, whichenforce sparseness to control both the generalization capabilityof the learned classifier model and the complexity, as e. g.,[44]–[47]. In these algorithms the model consists of a sparseweighted linear combination of basis functions, which arethe input features themselves, nonlinear transformations ofthem or kernels centered on them. In this section we restrictourselves to the review of realizations of sparse (kernel)logistic regression and SVM, whereby the latter one is usedfor comparison in the experiments.
1) Realizations of Sparse (Kernel) Logistic Regression:
Using logistic regression or its kernel realization can beprohibitive regarding memory and time requirements if thedimension of the features or the number of training samplesis large. Several sparse algorithms have been developed in thelast years to overcome this problem.The relevance vector machine [18] uses the same model asthe kernel logistic regression in combination with an implicitprior as regularization term, the so-called ARD (automaticrelevance determination) [48] prior, to induce sparseness. Theprior includes several regularization parameters, also calledhyperparameters, which are determined during the optimiza-tion process. The algorithm have shown to be very sparse, butalso tends to underfit, leading to a non well-generalized model[29]. Additionally, the RVM uses an expectation-maximization(EM)-like learning method and therefore, can suffer from localminima leading to non-optimal classification results.Alternatively, [28], [29] use a Laplace prior enforcingsparseness, which assigned regularization parameter is deter-mined via cross-validation. These approaches propose sparsemultinomial (kernel) logistic regression (SMLR) using differ-ent methods for a fast computation. In the field of hyperspec-tral image classification these approaches have been appliedand further developed in, e. g., [26], [27].
2) Support Vector Machines:
The SVM find an optimalnonlinear decision boundary by minimizing the objectivefunction Q SV M = 1 N (cid:88) n [1 − y n f ( x n )] + + λ (cid:107) f (cid:107) (9)with f ( x n ) = (cid:80) n α n K ( X , x n ) .Contrary to IVM, which maximize the posterior probabili-ties, SVM aim to maximize the margin between the hyperplaneand the closest training samples, the so-called support vectors.SVM are a binary classifier, with the decision rule given bythe sign of f ( x n ) .
3) Comparison with Import Vector Machines:
The mainproperties of SVM, IVM, SMLR and RVM are summarizedin Table I and briefly compared below.The objective function of the SVM is quadratic and there-fore convex and can be efficiently solved with sequentialminimal optimization algorithm (SMO) [49]. The objectivefunction of the IVM is convex, but non-quadratic. The function
EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 4
TABLE IS
UMMARY OF THE
SVM, RVM
AND
IVM
ALGORITHM REGARDING SEVERAL CHARACTERISTICS . H
ERE ”–/0/+”
MEANS THAT THE ALGORITHMSATISFIES A CERTAIN PROPERTY ” BARELY / PARTIALLY / COMPLETELY ”.Algorithm Objective function, optimization procedure Sparse Training Testing Probabilistic Multi-classtime timeSVM convex, greedy with SMO algorihtm 0 + 0 0 0RVM nonconvex, IRLS, EM-like + – + + +SMLR convex, see Sec. II-C1 0/+ 0/+ 0/+ + +IVM convex, IRLS with greedy forward stepwise + 0/+ + + +selection (see Sec. II-B and III-C) can be solved with iterated re-weighted least squares (IRLS)and a greedy forward selection of import vectors.All models are sparse, i.e., the model parameters are pri-marily zero. The sparseness arises from different techniques,namely the usage of a prior in the SMLR models and theRVM or the greedy selection of a fraction of the trainingsamples in the IVM model. However, IVM and RVM haveshown to be sparser when compared to SVM [30], [31], andthus, require less computation time during classification. Thetraining time of the IVM, on the other hand, can be slower thanfor the SVM, because of the non-quadratic objective function.Nonetheless, the training time depends on the number oftraining samples and the attainable sparseness, that thereforealso SMLR and IVM can reach a faster training time.In comparison to RVM and SMLR approaches, which usethe whole kernel during the optimization procedure, the IVMalgorithm is suitable for large data sets, since only a subsetof the training samples is used for computations during theoptimization. Consequently, also only a fraction of the wholekernel have to be computed and stored. In contrast to this,e. g., the standard RVM algorithm has to solve a largematrix inversion of the size of the number of training samplesduring the optimization process and therefore can be slow andintractable for large datasets.While IVM, SMLR and RVM directly provide a proba-bilistic output, the SVM algorithm has the opportunity totransform its output to probabilities after some post-processingsteps [16]. Though, as Tipping [18] already shows the trans-formed output is not necessarily statistical interpretable.Both IVM, SMLR and RVM are introduced as multi-classclassifiers. The standard SVM have the opportunity to multi-class classification but they need a coupling strategy to getmulti-class classification results. Moreover, these multi-classapproaches are more suitable for practical use than directmulti-class approaches for SVM [50].
D. Discriminative Random Fields
A discriminative random field is employed to model priorknowledge about the neighborhood relations within the image.The final classification is assumed to be smooth, i. e., neigh-boring pixels are more likely to belong to the same class thanto different classes.With this, the best classification C DRF is given by theargument of the minimum of the energy E ( y ) = − (cid:88) j ∈I log p ( y j | x j ) − β (cid:88) { m,j }∈N δ ( y j , y m ) , (10) where x j is the observed feature vector from the j th pixel, I being the set of all pixels and δ being the Kronecker deltafunction. The weighting parameter is given by β , which can bedetermined via cross-validation. The first term in (10) modelsthe probability of a class assignment y j of the j th pixel,defined by the probabilistic output of the IVM. The secondterm describes the interaction potential as a Potts model overa 2D lattice penalizing every dissimilar pair of labels andtherefore heterogeneous regions. The set of all neighboringpixels is given by N .III. I NCREMENTAL L EARNING S TRATEGY FOR S ELF -T RAINING
In this section we introduce the self-training concept and thelearning strategy for the incremental IVM. Self-training refersto the sequential selection of new training data and the adap-tion of the previous classifier model. The previous classifiermodel can (i) be neglected and re-trained from scratch or (ii)incrementally be updated. Following (i), a regular classifiertraining is performed, using the whole training samples set(i. e., previous + new training samples). The latter approach”simply” updates the previous model, using the newly selectedtraining samples.
A. Self-training Concept
Remote Sensing image Classification Improved classificationIVM DRFPosteriors New trainingsamples
Classification
Incrementallearningprocedureab
Fig. 1. Self-training scheme consisting of two steps: In the first step (Fig.1a) the image is classified with the learned incremental IVM model and theDRF. The probabilistic output of the IVM and the classification result of theDRF are used to acquire new training samples. In the second step (Fig. 1b)the classifier model is incrementally updated.
EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 5
In the first step (Fig. 1a) we identify potential new trainingsamples. For this selection step, we use the classification resultprovided by the DRF and the probabilistic outputs of theincremental IVM. In the second step (Fig. 1b) we use theincremental learning strategy to update the classifier model,without re-training from scratch. The procedure is repeateduntil no additional training samples can be selected. The latterstep consists of the inclusion of new training samples, thedeletion of irrelevant training samples and finally, the updateof the IVM model. Therefore the approach can handle largedata sets or even infinitive data streams. However, this stepis independent of the used self-training strategy and otherapproaches can be used to identify new training samples. Bothsteps are explained in detail in the next paragraphs.
B. Acquisition of New Training Samples for Self-training
This acquisition step illustrated in Fig. 2 is subdivided in3 parts: First, we use a DRF as expert to identify worthwhilesamples by evaluating the disagreement between the classifi-cation C yielded by the IVM classifier and the DRF result C DRF . The aspect implies, that the samples with estimatedlabel y DRF ,j derived from the DRF are sufficient uncertain with max( p j ) < . obtained from the IVM classifier. Using suchsamples we ensure progress in self-training.Second, we exclude from the selected samples these ones,whose probability is too small. I. e., the influence of the newsamples is restricted, to ensure that the model is graduallychanged and stable.Finally, we sort the chosen samples by their potentiallyinfluence on the model, to enable the flexibility of the modelduring self-training. The potential influence uses the conceptof leverage points in regression [51]. The leverage values in aweighted regression are contained in the vector l = diag (cid:18) K pot (cid:16) K T pot R pot K pot (cid:17) − K T pot R pot (cid:19) . (11)The kernel matrix obtained from the potential new trainingsamples X pot is given by K pot = [ k ( x pot ,j , x V ,m )] and R pot isthe weight matrix of the class the training sample belongsto. Training samples with a high leverage value first areconsidered first, since they causes large effects in the learnedmodel. The self-training is stopped if no more training samplescan be acquired.To prevent an acquisition, which leads to an imbalancednumber of training samples per class, we ensure sampling anequal number of training samples for each class. If not enoughnew training samples can be acquired with the proposed self-training approach, we also consider training samples with ahigh probability and whose label in C and C DRF are thesame. These samples have a small influence onto the modeland should not change the result too much, but balance thenumber of samples of each class. If there are still not enoughsamples for a class, we follow the over-sampling approachby duplicating existing samples from the concerned class atrandom and add a small noise to them [52].
C. Incremental Learning
To update the learned classifier, we consider the followingaspects: • New training samples acquired with the proposed self-training approach (see Section III-B) are included. • Non-informative samples are deleted. • The set of import vectors is updated. a) Update Training Vectors:
For the two-class case theincremental learning procedure is stated as follows. We addtraining vectors X ∆ with targets t ∆ so that N ( s ) := N ( s − +∆ N with ∆ N as the number of new training samples. At eachself-training iteration s we extend the matrices and vectors K ( s ) = (cid:20) K ( s − K ∆ (cid:21) , t ( s ) = (cid:20) t ( s − t ∆ (cid:21) to obtain the updated parameters α +( s ) given by (12) and (13)yielding z ( s ) = (cid:20) z ( s − z ∆ (cid:21) , p ( s ) = (cid:20) p ( s − p ∆ (cid:21) . The Sherman-Morrison-Woodbury (SMW) [53] formula isused to compute the inverse in (12) yielding (14) with A =1 /N ( s ) K T ( s − R ( s − K ( s − + λ K R, ( s − . Note that the update(14) only incorporates an inverse of size ∆ N × ∆ N , since theinverse A − was computed in time step s − . With these stepsthe parameters can be updated in an incremental way withoutre-training from scratch.The update (12) can also be formulated for a decreasingnumber of training vectors in a similar manner, leading toan efficient update rule as in (14), only with the sign of R ∆ changed. Training vectors can be removed based on their”age”, so that they follow the first-in-first-out strategy. This canlead to instable results. Therefore, we identify training vectors,which can be removed using Cook’s distance [54]. Cook’sdistance measures the effect of deleting a training vector asthe higher the value the more informative is the training vector: d n = ( p n − t n ) T ( p n − t n ) a MSE l n (1 − l n ) (15)with a as the number of parameters and MSE as the meansquared error summed over all classes given by the differencebetween and the mean probability of all training samplesbelonging to class C k . The leverage value of each trainingvector is given by l n . The training vectors with the lowestdistance are removed in a greedy backward selection until thevalue of the optimization function increases more than of 5%. b) Update the Set of Import Vectors: To add and removeimport vectors we use the forward stepwise selection asdescribed in Section II. We also make use of the SMW formulaand proceed in the same way as described in Algorithm 1 untila convergence criterion is reached.To generalize the two class model to the multi-class modelwe use the class-specific values R k, ( s − and z k, ( s − . EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 6
Classificationwith IVMImproved classifi-cation with DRF Compare C and C DRF
Select worthwhilesamples with y j = y DRF ,j and p ( y j | x j ) suf-ficiently large Sort samples regard-ing leverage values C Posteriors PC DRF
Fig. 2. Schematic diagram of the selection of new training samples: Worthwhile samples are identified by comparing the classification C and C DRF . Sampleswith a relatively small posteriori probability are excluded. Remaining samples are sorted regarding their leverage value, i. e., samples with high influence arepreferred. α +( s ) = (cid:16) /N ( s ) (cid:16) K T ( s − R ( s − K ( s − + K T ∆ R ∆ K ∆ (cid:17) + λ K R, ( s − (cid:17) − (cid:16) K T ( s − R ( s − z ( s − + K T ∆ R ∆ z ∆ (cid:17) (12) z ( s ) = 1 N ( s ) (cid:34) K ( s − α ( s − + R ( s − (cid:16) p ( s − − t ( s − (cid:17) K ∆ α ( s − + R ∆ ( p ∆ − t ∆ ) (cid:35) (13) α +( s ) = A − − /N ( s ) A − K T ∆ (cid:16) R − + 1 /N ( s ) K ∆ A − K T ∆ (cid:17) − K ∆ A − (cid:16) K T ( s − R ( s − z ( s − + K T ∆ R ∆ z ∆ (cid:17) (14)IV. E XPERIMENTAL S ETUP
A. Data Sets
We use three hyperspectral data sets – C
ENTER OF P AVIA ,U NIVERSITY OF P AVIA and I
NDIAN P INES – from study siteswith different environmental setting. The data sets have beenused in a multitude of studies, e. g., [7], [13], [32], [55], [56].The C
ENTER OF P AVIA image was acquired by ROSIS-3sensor in 2003. The spatial resolution of the image is 1.3 meterper pixel. The data cover the range from 0.43 µ m to 0.86 µ m ofthe electromagnetic spectrum. However, some bands have beenremoved due to noise and finally 102 channels have been usedin the classification. The image strip, with 1096 ×
492 pixelsin size, lies around the center of Pavia. The classification isaiming on 9 land cover classes. The U
NIVERSITY OF P AVIA data set was also acquired by ROSIS-3 sensor with 610 × NDIAN P INES data set wasacquired by the AVIRIS instrument in 1992. The study site liesin a predominately agricultural region in NW Indiana, USA.AVIRIS operates from the visible to the short-wave infraredregion of the electromagnetic spectrum, ranging from 0.4 µ mto 2.4 µ m. The data set covers 145 ×
145 pixels, with a spatialresolution of 20 m per pixel. The experiments are aiming onthe classification of 16 classes (Table II).
B. Methods
In the experiments the IVM for classifying hyperspectraldata is analyzed. In addition SVM were applied on the datasets. SVM are perhaps the most popular approach in morerecent applications and seems particularly advantageous whenclassifying high-dimensional data sets. Thus, the method isregarded as a kind of benchmark classifier for comparison withnew approaches.Moreover, a DRF is applied on the respective probabilisticoutput. Besides as input for the DRF, we use the probabilistic outputs to analyze the uncertainty of the classification result.We assess the reliability of the probabilities by rejectinguncertain test samples and deriving the classification accuracyon the non-rejected test points [19]. The rejection rate isgiven by a threshold on the posterior probability, whereby theaccuracy provided by SVM and IVM is reported as a functionof the rejection rate in discrete intervals.In addition, the incremental IVM is evaluated by applyingthe self-training approach on the three data sets in terms ofthe classification accuracy and sparsity.To investigate the impact of the number of training sampleson the performance of the model (e.g., in terms of sparsityand accuracy) we use two different training sets, containing(i) all initial training samples and (ii) 10% of each class (witha minimum of at least 10 samples per class). For (ii), weperformed a stratified random sampling, selecting 10% of thesamples of each class from the initial training set. The finalresults were averaged.For the SVM and the (incremental) IVM we use a radialbasis function kernel. The kernel parameters are determinedby a 5-fold cross-validation. Also the DRF parameter β isdetermined by by 5-fold cross-validation. The result providedby the common IVM is used for the initialization of the self-training. The self-training procedure is repeated until no moretraining samples can be selected.The IVM algorithm is implemented in MATLAB and C++.The SVM classification is performed in MATLAB, using theLIBSVM approach by Chang and Lin [57]. To compute theresult in (10) we use the graph-cut algorithm [58]. Besidesthe standard SVM classification, we use the method of [16]to convert the output of the SVM to probabilities (from nowon referred to as probabilistic SVM). EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 7
TABLE IIN
UMBER OF TRAINING AND TEST SAMPLES .C ENTER OF P AVIA U NIVERSITY OF P AVIA I NDIAN P INES class
VERALL ACCURACY (OA),
AVERAGE ACCURACY (AA)
AND KAPPA COEFFICIENT (K APPA ) OF SVM, SVM
WITH TRANSFORMED PROBABILITIES (P ROB . SVM), IVM,
INCREMENTAL
IVM ( I IVM)
WITH SELF - TRAINING (ST)
AND ADDITIONAL
DRF. F
OR THE SMALLER DATA SETS (10%)
WE REPORTTHE MEAN AND STANDARD DEVIATION IN BRACKETS OVER RUNS .C ENTER OF P AVIA U NIVERSITY OF P AVIA I NDIAN P INES
Train data Algorithm OA[%] AA[%] Kappa OA[%] AA[%] Kappa OA[%] AA[%] Kappa100% SVM 97.1 94.6 0.95 79.0 87.9 0.73 61.6 69.6 0.57100% Prob. SVM 97.0 94.2 0.95 78.4 87.7 0.73 60.7 68.9 0.56100% IVM 97.2 94.2 0.95 78.3 87.4 0.73 63.7 67.8 0.59100% iIVM+ST 97.3 95.2 0.95 86.6 90.0 0.83 63.7 67.8 0.59100% Prob. SVM+DRF 97.8 96.4 0.96 82.6 90.5 0.78 66.3 71.6 0.62100% IVM+DRF 98.4 96.5 0.97 85.9 91.4 0.82 70.6 80.0 0.67100% iIVM+ST+DRF 98.6 97.1 0.98 94.2 93.8 0.92 70.6 80.2 0.6710% SVM 92.1 (0.6) 85.1 (1.0) 0.86 (0.01) 70.2 (3.8) 77.5 (2.4) 0.62 (0.04) 54.7 (2.3) 63.0 (1.6) 0.49 (0.03)10% Prob. SVM 93.0 (1.3) 86.2 (2.1) 0.87 (0.02) 69.4 (3.8) 77.3 (2.4) 0.61 (0.04) 54.9 (2.5) 63.2 (2.3) 0.50 (0.03)10% IVM 92.2 (1.1) 85.4 (1.6) 0.86 (0.02) 70.3 (3.5) 77.3 (0.8) 0.62 (0.04) 58.2 (0.1) 65.2 (0.8) 0.53 ( < V. E
XPERIMENTAL R ESULTS
Fig. 4 shows the ground truth and the classification resultsfor all three data sets. As shown in Table III, the IVM iscompetitive to SVM in terms of accuracy and result in almostsimilar overall accuracies and kappa coefficients, irrespectivelyfrom the number of training samples. However, the SVMoutperforms the IVM in terms of the averaged class accuracy(AA) for the I
NDIAN P INES data set, when using all trainingsamples. As expected the accuracies are increased by increas-ing the number of training samples, independently from theclassifier method.It is interesting to underline that in many cases, the proba-bilistic SVM provide (slightly) lower accuracies (i.e., OA andAA) than a standard SVM. Behind this fact, the reliability ofthe probabilistic output of the SVM can be questioned. Thisassumption is confirmed by the results provided by the DRF.The accuracies of both methods, the probabilistic SVM andthe IVM, are increased by the DRF. This is in accordancewith other studies that have successfully integrated spatial http://vision.csd.uwo.ca/code/ information when classifying hyperspectral imagery [20], [21],[26], [34], [38]. However, the improvement of the IVM isusually higher, sometimes to a degree that the combinationof IVM and DRF outperforms the accuracy provided by theprobabilistic SVM in combination with a DRF.Table IV shows the class-specific accuracies achieved inthe U NIVERSITY OF P AVIA data set. The results confirm theprevious findings, e. g., SVM and IVM show similar overallaccuracies. While some classes are more accurately classifiedby SVM, IVM are more adequate for the separation of otherclasses.To underline this finding, an analysis of the probabilisticoutput is shown in Fig. 3. With an increasing rejection thresh-old, IVM provide higher OA on the I
NDIAN P INES data andin most cases on the C
ENTER OF P AVIA and U
NIVERSITY OF P AVIA data set. Consequently, it can be assumed that sampleswith high class probabilities are more accurately classified byIVM than by SVM, whereas relatively low class probabilitiesby the IVM are more likely referred to misclassified samples.Table V shows, that the number of import vectors is lowerwhen compared to the number of support vectors. This is in
EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 8
TABLE IVC
LASS - SPECIFIC ACCURACIES OF THE U NIVERSITY OF P AVIA DATA SET OF
SVM, SVM
WITH TRANSFORMED PROBABILITIES (P ROB . SVM), IVM,
INCREMENTAL
IVM ( I IVM)
WITH SELF - TRAINING (ST)
AND ADDITIONAL
DRF. T
HE RESULTS ARE GIVEN IN PERCENT ([%]).Class SVM Prob.SVM IVM iIVM+ST Prob.SVM + DRF IVM+DRF iIVM+ST+DRFAsphalt 85.4 83.3 83.8 84.3 94.7 95.5 94.1Bare Soil 93.7 93.6 89.1 93.9 97.7 98.5 98.2Bitumen 90.5 90.5 90.3 90.7 94.1 95.7 94.5Gravel 68.8 71.2 69.8 68.3 65.5 60.7 67.2Meadows 65.9 65.2 66.2 82.8 68.6 66.6 93.7Metal Sheets 99.4 99.5 99.6 99.4 99.8 99.9 99.7Bricks 92.5 91.8 92.7 93.7 98.7 99.3 98.8Shadow 97.5 97.0 98.7 99.5 97.7 99.6 100.0Trees 97.0 97.4 95.9 97.0 97.9 98.3 98.0(a) C
ENTER OF P AVIA (b) U
NIVERSITY OF P AVIA (c) I
NDIAN P INES
Fig. 3. The overall accuracy of SVM and IVM as a function of rejected test points on the C
ENTER OF P AVIA data set (left), the U
NIVERSITY OF P AVIA dataset (middle) and the I
NDIAN P INES data set (right). The overall accuracy is computed on the non-rejected test points.TABLE VN
UMBER OF SUPPORT / IMPORT VECTORS OF
SVM
AND
IVM. F
OR THESMALLER DATA SETS (10%)
WE REPORT THE MEAN AND STANDARDDEVIATION IN BRACKETS OVER RUNS . Data Algorithm C
ENTER OF P AVIA U NIVERSITY OF P AVIA I NDIAN P INES accordance with the results of a previous study in context ofmachine learning data sets [31]. Comparing the number ofsupport vectors and import vectors, respectively, the resultsconfirm that the number of support vectors clearly increaseswith an increasing number of training samples, whereas thenumber of import vector increases slowly or almost remainsconstant. Only for the U
NIVERSITY OF P AVIA data set thenumber increases when all training points are used, becausea smaller kernel parameter was chosen and more importvectors are necessary to train the classifier. Consequently,the computation time of the IVM during the classification ismuch faster when compared to SVM (see Table VI), sincethe number of required mathematical operations depends onthe number of support and import vectors. This is particularlyimportant in context of high-dimensional hyperspectral datasets, which are usually classified with a large number oftraining data.Table VI reports the training and testing time of SVM andIVM on an Intel(R) Dual Core with 3.0 GHz. In contrast to the SVM implementation, the current Matlab/C++ implementationof the IVM is not optimized, so there is still potential foracceleration.Finally, Table III also shows the positive impact of self-training on the classification accuracy. The classification resultwas improved in all cases. As expected the improvement on thetraining sets with 10% of all training samples is higher whencompared to the classification results, which were generatedby the whole training samples set. Table VII shows, thatthe number of import vectors remains nearly the same dur-ing the self-training procedure. Moreover, the proposed self-training strategy deletes irrelevant training samples. Therefore,the final number of training samples is significantly lower,when compared to initial number of training samples and thesamples added during the self-training procedure. This fact isparticularly obvious in case of the C
ENTER OF P AVIA datausing the whole training sample set.VI. C
ONCLUSION AND O UTLOOK
We proposed the incremental IVM classifier, which includesthe addition and deletion of training samples as well as theupdate of the set of import vectors. The incremental learningstrategy updates efficiently the classifier model without re-training from scratch, which makes it capable for large datasets. To evaluate the incremental IVM, we have introduced aself-training strategy, which uses the probabilistic output ofthe classifier and a DRF.
EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 9
TABLE VIT
RAINING AND TEST TIME OF
SVM,
PROBABILISTIC
SVM
AND
IVM. F
OR THE SMALLER DATA SETS (10%)
WE REPORT THE MEAN AND STANDARDDEVIATION IN BRACKETS OVER RUNS .C ENTER OF P AVIA U NIVERSITY OF P AVIA I NDIAN P INES
Data Algorithm Train[sec] Test[sec] Train[sec] Test[sec] Train[sec] Test[sec]100% SVM 3.8 187.9 4.3 126.4 20.0 68.9100% Prob. SVM 106.1 0.1 41.7 0.1 12.45 0.1100% IVM 225.2 2.3 484.9 1.8 344.3 0.110% SVM 0.1 ( < < < < < < < UMBER OF ADDED TRAINING SAMPLES IN THE SELF - TRAINING (ST)
PROCEDURE AND NUMBER OF TRAINING AND IMPORT VECTORS BEFORE ANDAFTER SELF - TRAINING . T
HE NUMBER OF TRAINING SAMPLES AFTER SELF - TRAINING IS GIVEN BY THE NUMBER OF TRAINING SAMPLES BEFORESELF - TRAINING PLUS THE NUMBER OF ADDED TRAINING SAMPLES MINUS THE REMOVED IRRELEVANT TRAINING SAMPLES . F
OR THE SMALLER DATASETS (10%)
WE REPORT THE MEAN AND STANDARD DEVIATION IN BRACKETS OVER RUNS .C ENTER OF P AVIA U NIVERSITY OF P AVIA I NDIAN P INES
We evaluated the performance of IVM in the context ofclassifying hyperspectral imagery. IVM constitute a feasibleapproach and an useful alternative for the classification ofremote sensing data, particularly when probabilities are ofinterest. The experimental results underline that SVM andIVM perform almost similar in terms of the classificationaccuracy. In addition, the results show the strong dependencyof the number of support vectors on the number of availabletraining samples. In contrast to this, the number of importvectors is significantly lower when compared to the number ofsupport vectors and remains constant or only slightly increaseswith an increasing number of training samples. As confirmedby the experimental results, the probabilities provided byIVM are more reliably, when compared to the probabilisticoutputs provided by SVM. This fact is particularly interesting,because the probabilities are useful for further image analysis,e.g., (i) as input in a DRF that increases the classificationaccuracy, (ii) to detect mislabeled samples by a uncertaintyanalysis, and (iii) to identity relevant training samples for aself-training strategy. Particularly for hyperspectral data sets,which require a sufficient large number of training samplesto ensure an adequate accuracy, the self-training strategyincluding the incremental IVM is interesting and can furtherincrease the classification result. Moreover, the computationtime is reduced by the incremental learning approach ratherthan re-training the classifier with all training samples. Theincremental IVM can further be incorporated into other activelearning approaches or more sophisticated models for DRFs.Therefore, the approach seems attractive as well as feasiblefor operational applications.Overall, the IVM and its incremental version appears worth-while for the classification of remote sensing data, especiallywhen the user is interested in reliable class probabilities anda fast classification. More efficient implementation strategiesand further modifications will be investigated in the future. A
CKNOWLEDGMENT
The authors would like to thank D. Landgrebe andL. Biehl (Purdue University, USA) for providing the IndianPines data (available on: http://cobweb.ecn.purdue.edu/ ∼ biehl/MultiSpec/) and P. Gamba (University of Pavia, Italy) forproviding the Pavia dataset.R EFERENCES[1] A. Goetz, “Three Decades of Hyperspectral Remote Sensing of the Earth:A Personal View,”
Remote Sens. Environ. , vol. 113, pp. 5–16, 2009.[2] B. Waske, S. van der Linden, J. Benediktsson, A. Rabe, and P. Hostert,“Sensitivity of Support Vector Machines to Random Feature Selection inClassification of Hyperspectral Data,”
IEEE Trans. Geosci. Remote Sens. ,vol. 48, no. 7, pp. 2880–2889, 2010.[3] G. Mitri and I. Gitas, “Mapping Postfire Vegetation Recovery Using EO-1 Hyperion Imagery,”
IEEE Trans. Geosci. Remote Sens. , vol. 48, no. 3,pp. 1613–1618, 2010.[4] J. Benediktsson, J. Palmason, and J. Sveinsson, “Classification of Hy-perspectral Data from Urban Areas based on Extended MorphologicalProfiles,”
IEEE Trans. Geosci. Remote Sens. , vol. 43, no. 3, pp. 480–491,2005.[5] L. Guanter, K. Segl, and H. Kaufmann, “Simulation of Optical Remote-Sensing Scenes With Application to the EnMAP Hyperspectral Mission,”
IEEE Trans. Geosci. Remote Sens. , vol. 47, no. 7, pp. 2340–2351, 2009.[6] J. Richards, “Analysis of Remotely Sensed Data: The Formative Decadesand the Future,”
IEEE Trans. Geosci. Remote Sens. , vol. 43, no. 3, pp.422–432, 2005.[7] A. Plaza, J. Benediktsson, J. Boardman, J. Brazile, L. Bruzzone,G. Camps-Valls, J. Chanussot, M. Fauvel, P. Gamba, A. Gualtieri,M. Marconcini, J. Tilton, and G. Trianni, “Recent Advances in Techniquesfor Hyperspectral Image Processing,”
Remote Sens. Environ. , vol. 113, pp.110–122, 2009.[8] X. Chen, T. Warner, and D. Campagna, “Integrating Visible, Near-Infrared and Short-Wave Infrared Hyperspectral and Multispectral Ther-mal Imagery for Geological Mapping at Cuprite, Nevada,”
Remote Sens.Environ. , vol. 110, no. 3, pp. 344–56, 2007.[9] S. Van der Linden, A. Janz, B. Waske, M. Eiden, and P. Hostert,“Classifying Segmented Hyperspectral Data from a Heterogeneous UrbanEnvironment using Support Vector Machines,”
J. Appl. Remote Sens. ,vol. 1, no. 1, 2007.[10] V. Vapnik,
The Nature of Statistical Learning Theory . Springer, 2000.[11] M. Pal and P. Mather, “Some Issues in the Classification of DAISHyperspectral Data,”
Int. J. Remote Sens. , vol. 27, no. 14, pp. 2895–2916,2006.
EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 10 (a) Data (b) Training groundtruth (c) Test ground truth (d) SVM result (e) IVM result (f) IVM+DRF result (g) Self-training re-sultFig. 4. F.l.t.r.: (a) Data, (b) training ground truth, (c) test ground truth, (d) classification result of SVM, (e) IVM, (f) IVM + DRF and (g) self-trainingprocedure with incremental IVM. The upper row shows the C
ENTER OF P AVIA data set, the middle row the I
NDIAN P INES data set and the bottom row theU
NIVERSITY OF P AVIA data set.[12] B. Waske, J. Benediktsson, K. Arnason, and J. Sveinsson, “Mapping ofHyperspectral AVIRIS Data using Machine-Learning Algorithms,”
Can.J. Remote Sensing , vol. 35, pp. 106–116, 2009.[13] G. Camps-Valls, N. Shervashidze, and K. M. Borgwardt, “Spatio-Spectral Remote Sensing Image Classification with Graph Kernels,”
IEEEGeosci. Remote Sens. Lett. , vol. 7, no. 4, pp. 741–745, 2010.[14] J. Mu˜noz-Mar´ı, F. Bovolo, L. G´omez-Chova, L. Bruzzone, and G. Camp-Valls, “Semisupervised One-Class Support Vector Machines for Classi-fication of Remote Sensing Data,”
IEEE Trans. Geosci. Remote Sens. ,vol. 48, no. 8, pp. 3188–3197, 2010.[15] A. Mathur and G. Foody, “Multiclass and binary SVM classification:implications for training and classification users,”
IEEE Geosci. RemoteSens. Lett. , vol. 5, no. 2, pp. 241–245, 2008.[16] J. Platt, N. Cristianini, and J. Shawe-Taylor, “Large Margin DAGs forMulticlass Classification,”
Adv. Neural Inf. Process. Syst. , vol. 12, no. 3,pp. 547–553, 2000.[17] G. Foody, “RVM-Based Multi-Class Classification of Remotely SensedData,”
Int. J. Remote Sens. , vol. 29, no. 6, pp. 1817–1823, 2008.[18] M. Tipping, “Sparse Bayesian Learning and the Relevance VectorMachine,”
J. of Mach. Learn. Research , vol. 1, pp. 211–244, 2001.[19] F. Giacco, C. Thiel, L. Pugliese, S. Scarpetta, and M. Marinaro,“Uncertainty Analysis for the Classification of Multispectral SatelliteImages Using SVMs and SOMs,”
IEEE Trans. Geosci. Remote Sens. ,no. 99, pp. 1–11, 2010.[20] P. Zhong and R. Wang, “Learning Conditional Random Fields forClassification of Hyperspectral Images,”
IEEE Trans. Image Process ,vol. 19, pp. 1890–1907, 2010.[21] Y. Tarabalka, M. Fauvel, J. Chanussot, and J. Benediktsson, “SVM-and MRF-Based Method for Accurate Classification of HyperspectralImages,”
IEEE Geosci. Remote Sens. Lett. , vol. 7, pp. 736–740, 2010. [22] P. Zhong, P. Zhang, and R. Wang, “Dynamic Learning of SMLR forFeature Selection and Classification of Hyperspectral Data,”
IEEE Geosci.Remote Sens. Lett. , vol. 5, no. 2, pp. 280–284, APR 2008.[23] Q. Cheng, P. Varshney, and M. Arora, “Logistic regression for featureselection and soft classification of remote sensing data,”
IEEE Geosci.Remote Sens. Lett. , vol. 3, no. 4, pp. 491–494, 2006.[24] S. Keerthi, K. Duan, S. Shevade, and A. Poo, “A Fast Dual Algorithmfor Kernel Logistic Regression,”
Mach. Learn. , vol. 61, no. 1, pp. 151–165, 2005.[25] G. Cawley and N. Talbot, “Efficient Model Selection for Kernel LogisticRegression,”
Pattern Recogn. , vol. 2, pp. 439–442, 2004.[26] J. Borges, J. Bioucas-Dias, and A. Marcal, “Bayesian HyperspectralImage Segmentation With Discriminative Class Learning,”
IEEE Trans.Geosci. Remote Sens. , vol. 49, pp. 2151–2164, 2011.[27] J. Li, J. Bioucas-Dias, and A. Plaza, “Hyperspectral Image SegmentationUsing a New Bayesian Approach With Active Learning,”
IEEE Trans.Geosci. Remote Sens. , pp. 1–14, 2010.[28] G. Cawley, N. Talbot, and M. Girolami, “Sparse Multinomial LogisticRegression via Bayesian L1 Regularisation,” in
Adv. Neural Inf. Process.Syst. , 2007.[29] B. Krishnapuram, L. Carin, M. Figueiredo, and A. Hartemink, “SparseMultinomial Logistic Regression: Fast Algorithms and GeneralizationBounds,”
IEEE Trans. Pattern Anal. Mach. Intell. , pp. 957–968, 2005.[30] B. Demir and S. Erturk, “Hyperspectral Image Classification UsingRelevance Vector Machines,”
IEEE Geosci. Remote Sens. Lett. , vol. 4,pp. 586–590, 2007.[31] J. Zhu and T. Hastie, “Kernel Logistic Regression and the Import VectorMachine,”
J. Comput. Graph. Stat. , vol. 14, no. 1, pp. 185–205, 2005.[32] M. Pal and G. Foody, “Feature Selection for Classification of Hyper-
EEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 50, NO. 09, SEPTEMBER 2012 11 spectral Data by SVM,”
IEEE Trans. Geosci. Remote Sens. , vol. 48, no. 5,pp. 2297–2307, 2010.[33] D. Tuia, M. Volpi, L. Copa, M. Kanevski, and J. Munoz-Mari, “Asurvey of active learning algorithms for supervised remote sensing imageclassification,”
IEEE J. Sel. Topics Signal Process. , vol. 5, pp. 606–617,2011.[34] J. Li, J. Bioucas-Dias, and A. Plaza, “Semisupervised HyperspectralImage Segmentation Using Multinomial Logistic Regression with ActiveLearning,”
IEEE Trans. Geosci. Remote Sens. , vol. 48, pp. 4085–4098,2010.[35] S. Rajan, J. Ghosh, and M. Crawford, “An Active Learning Approachto Hyperspectral Data Classification,”
IEEE Trans. Geosci. Remote Sens. ,vol. 46, pp. 1231–1242, 2008.[36] V. Ng and C. Cardie, “Weakly Supervised Natural Language Learningwithout Redundant Views,” in
NAACL , 2003, pp. 94–101.[37] S. Kumar and M. Hebert, “Discriminative Random Fields,”
Int. J.Comput. Vision , vol. 68, no. 2, pp. 179–201, 2006.[38] P. Zhong and R. Wang, “Learning Sparse CRFs for Feature Selectionand Classification of Hyperspectral Imagery,”
Geoscience and RemoteSensing, IEEE Transactions on , vol. 46, pp. 4186–4197, 2008.[39] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual modelsfrom few training examples: An incremental bayesian approach tested on101 object categories,” in
CVIU , 2007, pp. 59–70.[40] S. Pang, S. Ozawa, and N. Kasabov, “Incremental Linear DiscriminantAnalysis for Classification of Data Streams,”
Trans. Systems, Man, andCybernetics , vol. 35, pp. 905–914, 2005.[41] G. Cauwenberghs and T. Poggio, “Incremental and Decremental SupportVector Machine Learning,” in
Adv. Neural Inf. Process. Syst. , 2001, pp.409–415.[42] M. Karasuyama and I. Takeuchi, “Multiple Incremental DecrementalLearning of Support Vector Machines,”
Trans. Neural Netw. , vol. 21,no. 7, pp. 1048–1059, 2010.[43] G. Fung and O. Mangasarian, “Incremental Support Vector MachineClassification,” in
SIAM , 2002, pp. 247–260.[44] M. Figueiredo, “Adaptive Sparseness for Supervised Learning,”
IEEETrans. Pattern Anal. Mach. Intell. , pp. 1050–1159, 2003.[45] M. Figueiredo and A. Jain, “Bayesian Learning of Sparse Classifiers,”in
Proc. IEEE Conf. Computer Vision and Pattern Recognition , 2001.[46] L. Csat´o and M. Opper, “Sparse On-line Gaussian Processes,”
NeuralComputation , vol. 14, pp. 641–668, 2002.[47] N. Lawrence, M. Seeger, and R. Herbrich, “Fast Sparse Gaussian ProcessMethods: The Informative Vector Machine,” in
Adv. Neural Inf. Process.Syst. , 2003.[48] R. Neal,
Bayesian Learning for Neural Networks . Springer Verlag,1996, vol. 118.[49] J. Platt,
Advances in Kernel Methods - Support Vector Learning .MIT Press, 1999, ch. Fast Training of Support Vector Machines usingSequential Minimal Optimization, pp. 185–208.[50] C. Hsu and C. Lin, “A comparison of methods for multiclass supportvector machines,”
Neural Networks, IEEE Transactions on , vol. 13, no. 2,pp. 415–425, 2002.[51] P. Rousseeuw, A. Leroy, and J. Wiley,
Robust Regression and OutlierDetection . Wiley Online Library, 1987, vol. 3.[52] N. Japkowicz, “The class imbalance problem: Significance and strate-gies,” in
Int. Conf. Artificial Intelligence , vol. 1. Citeseer, 2000, pp.111–117.[53] N. Higham,
Accuracy and Stability of Numerical Algorithms . Societyfor Industrial Mathematics, 2002.[54] R. Cook and S. Weisberg,
Residuals and Influence in Regression .Chapman and Hall New York, 1982.[55] J.-M. Yang, B.-C. Kuo, P.-T. Yu, and C.-H. Chuang, “A DynamicSubspace Method for Hyperspectral Image Classification,”
IEEE Trans.Geosci. Remote Sens. , vol. 48, no. 7, pp. 2840–2853, 2010.[56] F. Melgani and L. Bruzzone, “Classification of Hyperspectral RemoteSensing Images with Support Vector Machines,”
IEEE Trans. Geosci.Remote Sens. , vol. 42, no. 8, pp. 1778–1790, 2004.[57] C. Chang and C. Lin, “LIBSVM: A Library for Support Vector Ma-chines,” 2001.[58] Y. Boykov, O. Veksler, and R. Zabih, “Fast Approximate EnergyMinimization via Graph Cuts,”
IEEE Trans. Pattern Anal. Mach. Intell. ,pp. 1222–1239, 2001.
Ribana Roscher received her Dipl.-Ing. degree inGeodesy from University of Bonn, Germany, in2008. She is currently a PhD at the Institute ofGeodesy and Geoinformation at the University ofBonn, Germany. Her current research activities con-centrate on sequential learning and discriminativemodels for semantic segmentation, especially importvector machines. She is reviewer for IEEE Trans-actions on Geoscience and Remote Sensing andIEEE Transactions on Pattern Analysis and MachineIntelligence.
Bj¨orn Waske (S06-M08) received his degree inApplied Environmental Sciences with a major inRemote Sensing from Trier University, Germany,in 2002. Until mid 2004 he was research assistantthe Department of Geosciences at the Munich Uni-versity, Germany. From 2004 until end of 2007 hepursued a PhD at the Center for Remote Sensingof Land Surfaces (ZFL) at the University of Bonn,Germany and received the PhD degree in Geography.From beginning of 2008 until August 2009 he wasa Postdoctoral researcher at the Faculty of Electricaland Computer Engineering, University of Iceland. Since September 2009 heis a (Junior)professor for Remote Sensing in Agriculture at the Universityof Bonn, Germany. His current research activities concentrate on advancedconcepts for image classification and data fusion. Currently he is AssociateEditor IEEE Journal of Selected Topics in Applied Earth Observations andRemote Sensing (J-STARS). He is reviewer for different international journals,including IEEE Transactions on Geoscience and Remote Sensing, IEEEGeoscience and Remote Sensing Letters, and IEEE Journal of Selected Topicsin Applied Earth Observations and Remote Sensing.