[PDF] Faster feature selection with a Dropping Forward-Backward algorithm

Abstract

In this era of big data, feature selection techniques, which have long been proven to simplify the model, makes the model more comprehensible, speed up the process of learning, have become more and more important. Among many developed methods, forward and stepwise feature selection regression remained widely used due to their simplicity and efficiency. However, they all involving rescanning all the un-selected features again and again. Moreover, many times, the backward steps in stepwise deem unnecessary, as we will illustrate in our example. These remarks motivate us to introduce a novel algorithm that may boost the speed up to 65.77% compared to the stepwise procedure while maintaining good performance in terms of the number of selected features and error rates. Also, our experiments illustrate that feature selection procedures may be a better choice for high-dimensional problems where the number of features highly exceeds the number of samples.

Full PDF

aa r X i v : . [ s t a t . M L ] N ov Faster feature selection with a DroppingForward-Backward algorithm

Thu Nguyen

University of Louisiana at Lafayette, Lafayette LA 70504, USA

Abstract.

In this era of big data, feature selection techniques, whichhave long been proven to simplify the model, makes the model morecomprehensible, speed up the process of learning, have become more andmore important. Among many developed methods, forward and stepwisefeature selection regression remained widely used due to their simplicityand eﬃciency. However, they all involving rescanning all the un-selectedfeatures again and again. Moreover, many times, the backward steps instepwise deem unnecessary, as we will illustrate in our example. Theseremarks motivate us to introduce a novel algorithm that may boost thespeed up to 65.77% compared to the stepwise procedure while maintain-ing good performance in terms of the number of selected features anderror rates. Also, our experiments illustrate that feature selection proce-dures may be a better choice for high-dimensional problems where thenumber of features highly exceeds the number of samples.

Keywords: feature selection · classiﬁcation · regression. In this era of big data, the growth of data poses challenges for eﬀective datamanagement and inference. Real-world data usually contain a lot of redundant orirrelevant features that can derail the learning performance. Moreover, for high-dimensional data, a critical issue is that the number of features highly surpassesthe number of samples, which could cause the models to overﬁt, and performanceon test data suﬀer. This is well known as the curse of dimensionality or the n ≫ p problem. To deal with this issue, various feature extraction and feature selectionmethods have been developed (see [7,6] or the related for reviews). However, thefeature extraction methods create sets of new features that we can not directlyinterpret. Moreover, since those approaches use all the features available duringtraining, it does not help to reduce the cost of collecting data in the future.Feature selection, on the other hand, helps to maintain the meanings of theoriginal features, reducing the cost of storage and collecting data in the future,by removing irrelevant or redundant features.Feature selection techniques can be classiﬁed into three categories: ﬁlter,wrapper, and embedded. The ﬁlter approaches (Markov Blanket Filtering, t-test,etc.) extract features from data without involving any learning. The wrap-pers methods (genetic algorithm, sequential search, etc.), on the other hand, Thu Nguyen use learning techniques to evaluate the importance of the features. Finally, theembedded approaches (random forest, least absolute shrinkage and selection op-erator, etc.) combine the feature selection steps with the classiﬁer constructionprocess.In the wrapper approach, some of the most popular methods are forward,backward and stepwise feature selection. Forward selection starts with an emptymodel. Then, it sequentially adds to the model the feature that best improvesthe ﬁt the most in terms of the criterion being used. This method is well knownfor its speed but it may select some features at some steps and later add someother features that make the inclusion of the previous ones redundant. Backwardselection avoids this problem by sequentially remove the least useful feature, oneat a time. However, it is computationally expensive and can only be appliedwhen the number of samples is much larger than the number of features (see[2]). Stepwise regression is a combination of these two methods. It ﬁrstly addsfeatures to the model sequentially as in forward feature selection. In addition, af-ter adding a new feature, it removes the features that are no longer important inthe model after the inclusion of the new one. Related methods have been devel-oped to boost the eﬃciency of these methods. [8] incorporating GramSchmidtand Givens orthogonal transforms into forward and backward procedures forclassiﬁcation tasks, respectively. This makes the features de-correlated in theorthogonal space so that each feature can be independently evaluated and se-lected. [1] proposes a forward orthogonal algorithm, with mutual informationinterference for regression at the cost of doing orthogonal transforma-tions.In this paper, we introduce a novel method for feature selection that has greatperformance in terms of speed, error rates and number of selected features. Ourcontributions from this paper are of ﬁve folds: (1) We point out some deﬁcienciesof forward and stepwise feature selection. (2) We propose a new scheme that givesmuch faster training time than stepwise while maintaining good results in termsof the number of selected features and error rates. (3) We demonstrate the powerof our approach in regression and classiﬁcation tasks using simulated and real-world data. (4) We point out that feature selection may be preferable to featureextraction in the p ≫ n problems. (5) We illustrate that the time it takes fora feature selection procedure to stop depends on not only the dimension of thedata but also the sparsity of the resulting model.The structure of the remaining parts of this paper is as follows. In section 2,we review the forward, and stepwise algorithms for feature selection, and pointout the issues associate with these approaches. Those serve as motivations forour approach. Next, in section 3, we introduce our dropping forward-backwardalgorithm . Then, in section 4, we show how powerful our approach surpassesstepwise selection and another intuitive forward-backward scheme both in termsof error reduction and the number of features selected on simulated and realdatasets. Finally, in section 5, we summarize the main points of this paper. aster feature selection with a Dropping Forward-Backward algorithm 3 Forward feature selection algorithm:–

Input: a set of features C = { X , X , ..., X p } , response Y , α -to enter value,selection criterion. – Output: a set R ⊂ C of relevant features. – Procedure: Sequentially add to R a feature that improves the ﬁt the mostin terms of the criterion being used. Stop when no feature can improve theﬁt more than α .Forward algorithm has been used widely due to its computational eﬃciency,along with the possibility to deal with the p ≫ n problems, where the numberof features highly exceeds the number of observations. However, some featuresincluded by forward steps may appear redundant after the inclusion of someother features. About the suﬃcient conditions for forward feature selection torecover the original model and its stability, we refer to [9] and [3] for furtherreadings. Stepwise feature selection algorithm:–

Input: a set of features C = { X , X , ..., X p } , response Y , β -to removevalue, selection criterion. – Output: a set R ⊂ C of relevant features. – Procedure:1. (a)

Forward step:

Add to R a feature that improves the ﬁt the mostin terms of the criterion being used.(b)

Backward step:

Sequentially remove the least useful feature in themodel, one at a time, if it worsen the model no more than anamount of β , in terms of the given criterion. Stop when the removalof any feature in the model causes the ﬁt to decrease more than β .2. Stop when no feature can improve the ﬁt more than an amount of α .Stepwise selection appears to be a remedy to the forward error in forward selec-tion. It adds features to the model sequentially as in forward feature selection.In addition, after adding a new feature, this approach removes the features thatare no longer important in the model. However, there is a computational cost as-sociated with the backward steps that remove unnecessary features. Sometimes,this raises the question about how likely the forward scheme commits an errorlike that. [10] proposes an algorithm that takes a backward step only when thesquared error is no more than half of the squared error decrease in the earlierforward steps. However, this still gives rise to the same question of whetherchecking to take backward steps like that worth the eﬀort. This motivates us todo some experiments to gain some insight into the problem.Table 1 and 2 show results from Monte Carlo simulation, with data from 80 − dimensional multivariate normal distribution with sample size n = 80. For the Thu Nguyen

Table 1.

The average number of backward steps taken by stepwise procedure accordingto the number of features in the model.Number of featuresincluded in the model 4 8 12 16 20Average number of backwardbackward steps taken bystepwise procedure 0 0.013 0.004 0.003 0.003 ﬁrst table, we vary the number of features included in the model, repeat each ex-periment 1000 times, and compute the average number of backward steps takenby stepwise procedure. For the second table, we vary the maximum correlationamong features and generate correlation values for the multivariate normal dis-tribution from 0 to the maximum value. We repeat each experiment 1000 timesand compute the average number of backward steps taken by stepwise procedure.

Table 2.

Average number of backward steps taken by stepwise procedure accordingto correlation. Maximum correlationamong diﬀerent features 0 . . . . . From these tables, we see that many times, the eﬀort to check whether totake a backward step or not does not worth the computational price. Rather, wecould simply do forward feature selection to get a list R, and then do backwardselection on R to correct the mistakes that forward selection scheme may havemade. We shall refer to this as forward-backward algorithm . For regression, thisis reasonable, as the order of features in the model does not aﬀect their cor-responding coeﬃcients. That can be seen directly from the following theorem:

Theorem 1.

Suppose that we have a regression model Y = Xβ + ǫ, (1) where ǫ ∼ N (0 , σ I ) . Suppose X = [ x , x , ..., x p ] where x i is the i th columnvector of X , and ˆ β = ( ˆ β , ..., ˆ β p ) ′ is the least square estimate of β . Let Z be theresulting matrix if we swap any two columns x i , x j ( i < j ) of X . Consider themodel Y = Zγ + ǫ ′ (2) aster feature selection with a Dropping Forward-Backward algorithm 5 then we can get the least square estimate b γ of γ by swapping the i th , j th positionof the old ˆ β , i.e., b γ = ( ˆ β , ..., ˆ β i − , ˆ β j , ˆ β i +1 , ..., ˆ β j − , ˆ β i , ˆ β j +1 , ..., ˆ β p ) ′ . (3)Another point worth noticing is that all of the algorithms mentioned aboverequire scanning over and over the remaining features in the pool when addinga new feature. This makes the algorithms suﬀer higher computational cost thannecessary. Therefore, in the next section, we introduce a new algorithm that canremedy these ineﬃciencies. As the deﬁciencies of forward and stepwise feature selection algorithms arepointed out in the previous section, we introduce the following dropping forward-backward scheme to improve these ineﬃciencies.

General dropping forward-backward scheme: Input : a set of feature C = { X , X , ..., X p } , response Y , α -to enter thresh-old, β -to remove threshold, selection criterion.2. Output : a set R ⊂ C of relevant features.3. Forward dropping steps:

Sequentially add to R the feature that improvethe ﬁt in term of the criterion being used the most. Remove from C thisfeature and the features that can not improve the ﬁt more than an amountof β . Stop when no feature can improve the ﬁt more than α .4. Re-forward steps: (a) C = { X , X , ..., X p } \ R ,(b) Sequentially add to R a feature that best improve the ﬁt in term of thecriterion being used. Stop when no feature can improve the ﬁt morethan α .5. Backward steps:

Sequentially remove from R the least useful feature,one at a time, if removing it causes the ﬁt to decrease no more than anamount of β , until the removal of any feature in the model cause the ﬁt todecrease more than β , in terms of the criterion being used.Note that in the dropping forward-backward scheme above, the forward stepsare very similar to the forward algorithm, except that we temporarily remove allthe features in the pool that can not improve the ﬁt more than an amount of β in terms of the criterion being used. This helps reduce the computational costof rescanning through the features that temporarily do not seem to be able toimprove the model a lot compared to other features. Though, after that we doforward steps again in the re-forward steps , with all the features that have notbeen included in the model yet, to account for the possible correlation that mayimprove the ﬁt. Moreover, instead of taking a backward step after every forward Thu Nguyen move, we only take a backward step at the end of all forward steps to removethe redundant features that remained in the model. This is to correct the errorthat forward steps may make and avoid the computational cost of checking fora backward move after every inclusion of a new feature.Note that higher β will results in more feature dropping and less re-scanningduring forward dropping moves. However, depending on the data and the chosencriterion, for high dimensional data, we may prefer to use lower β . The reason ishigher β may result in too many feature dropping, which implies that much fewerfeatures have chances to get into the model during forward dropping moves. Thiscauses the forward dropping moves to terminate early, and we have to re-scan alot of features during re-forward steps.Another worth noticing point is that after the forward dropping steps arethe re-forward steps. Therefore, after the forward dropping steps, the algorithmgives ranks to the importance of the features, and it possible to speciﬁes themaximum number of features to be included in the model in case one wishes fora smaller set of features than what the thresholds may produce.Finally, if we want more ﬂexibility, we can use diﬀerent β thresholds for theforward dropping steps and the backward moves.As for illustration, we have the following algorithm, Dropping forward-backward algorithm with Mallows’s C p for regres-sion Input : a set of feature C = { X , X , ..., X p } , response Y , α -to enter, β -toremove.2. Output : a set R ⊂ C of relevant features.3. Forward-dropping steps:

Sequentially add to R the feature that mini-mizes C p . Remove from C this feature and the features that can not reduce C p more than an amount of β . Stop when no feature can reduce C p morethan α .4. Re-forward steps: (a) C = { X , X , ..., X p } \ R, (b) Sequentially add to R a feature that minimizes C p until no feature canreduce C p more than α .5. Backward steps:

Sequentially remove from R the least useful feature,one at a time, if removing that feature causes C p to increase C p no morethan an amount β . In this section, we illustrate the power of our method by comparing the drop-ping forward-backward algorithm to the stepwise algorithm and the intuitiveforward-backward scheme mentioned in the last part of section 2 on artiﬁcial aster feature selection with a Dropping Forward-Backward algorithm 7 and real data. Note that throughout all the experiments, we carry out standardnormalization procedures for every dataset.For the simulation, we generate n = 80 samples of dimension p , where p varies from 50 to 80. The original regression model is Y = 4 . X + 2 . X +3 . X + 0 . X + ǫ , where ǫ ∼ N (0 , p and report the average error sum of squares and the averagenumber of selected features. We choose α = β = 0 .

01 and use C p as the selectioncriterion. The results are shown in table 4. We do not mention the regressionerror here, as they are very low and are the same when rounding oﬀ to ﬁvedecimal places.The experiments on real data are feature selection for classiﬁcation based on trace criterion . Trace criterion is popular class separability measure for featureselection in classiﬁcation task (more details in [4],[5]). There are many equivalentversions. However, suppose that we have C classes, and there are n i observationfor the i th class, then one way to deﬁne the criterion is trace ( S − w S b ) , (4)where S b = C X i =1 n i (¯ x i − ¯ x )(¯ x i − ¯ x ) ′ (5)and S w = C X i =1 n i X j =1 ( x ij − ¯ x i )( x ij − ¯ x i ) ′ (6)are the between-class scatter matrix and within-class scatter matrix, respectively.Here, ¯ x i is the mean for the i th class, ¯ x is the overall mean.Since this criterion measures the separability of classes, we would like tomaximize it. After selecting the relevant features, we classify the samples usinga support vector machine (SVM) classiﬁer and a linear discriminant analysis(LDA) classiﬁer and compare results among the feature selection methods andwhen all the features are used. For Parkinson, since the dimension highly exceedsthe number of samples, we use Principal Component Analysis to extract the ﬁrst200 components that explain 98.58% of the variance, and then use SVM or LDAto classify samples.The information about the datasets from UCI repository that we use aresummarized in table 3. For the datasets that do not have separate training,testing sets, we randomly split the data into training and testing set. From table 4, we see that dropping forward-backward procedure signiﬁcantlysurpasses the other two methods in term of speed (when p = 70, the speedof dropping forward-backward procedure is 21.89% less than forward-backwardalgorithm and 23.11% less than stepwise algorithm), but rarely increases the Thu Nguyen

Table 3.

UCI datasets for experimentsdataset

Table 4.

The performances of the three procedures on simulated data.Dimension (time in second(s), number of selected features)Dropping forward-backward Forward-backward Stepwise p = 50 (0.2576, 3.915) (0.3005, 3.914) (0.3083, 3.914) p = 60 (0.2965, 4.002) (0.3692, 4) (0.3776, 4) p = 70 (0.3437,4.001) (0.44,4) (0.4470,4) p = 80 (0.4140,4) (0.4937,4) (0.5012,4) Table 5.

The speed and number of selected features of the procedures on real datawith α = β = 0 .

05, except for Parkinson, we use α = 0 . , β = 0 .

01 according to thediscussion in section 3.Dataset (time (s), number of selected features)Dropping forward-backward Stepwise Forward-backwardBiodegradation (0.1375,5) (0.212, 6) (0.204, 6)Ionosphere (0.258,14) (0.437,14) (0.340,14)Optic (24.045,49) (70.250,49) (36.282,49)Satellite (2.801,17) (3.595,14) (2.902,14)Parkinson (2.691,10) (25.486,24) (25.685,24)aster feature selection with a Dropping Forward-Backward algorithm 9

SVM classifier b i o d e g r a d a t i o n i n o s p h e r e o p t i c p a r k i n s o n s a t e lli t e dataset e rr o r ( % ) LDA classifier b i o d e g r a d a t i o n i n o s p h e r e o p t i c p a r k i n s o n s a t e lli t e dataset e rr o r ( % ) method all features dropping forward−backward forward−backward/stepwise Fig. 1. performances of three approaches on real data using LDA classiﬁer. Note thatstepwise and forward-backward selection give the same error rates, so we plot them onthe same line. number of features in the model (at most only twice in a thousand times when p = 60 in this simulation study).For real data, we can see from table 5 that dropping the forward-backwardprocedure highly surpasses the other two methods in terms of speed. Speciﬁcally,for the optic dataset, the speed of dropping forward backward is only 34.23%the speed of stepwise procedure and only 66.29% the speed of the forward-backward procedure. Also, when combining with ﬁgure 1, we see that the drop-ping forward-backward approach has close performances, and many times, betterthan stepwise and forward-backward procedures, depending on the classiﬁer. Forbiodegradation and satellite datasets, the features selection methods proceduresubsets of features that can obtain a close error rate for using all features. How-ever, multiple times, the three features selection methods reach lower error ratesthan using all the features available.Note that for Parkinson data set, the performances of all three feature se-lection methods highly surpass the PCA feature extraction approach, especiallywhen using the LDA classiﬁer. One possible explanation for this is that PCAsuﬀers from the poor estimation of the covariance matrix.Another worth-noticing thing from table 5 is that the amount of time it takesfor the three feature selection procedures to terminate for Parkinson data set isfar less than for Optic, even though Parkinson has 753 features and Optic hasonly 64. This implies that the time it takes for the procedures to run dependsnot only on the dimension of the data but also on the sparsity of the resultingmodel. In this paper, we point out some issues with forward and stepwise feature se-lection and from that propose a new faster scheme that can maintain a good performance. We illustrate the power of our method via simulation and experi-ments on real datasets. We also give an example to show that for datasets wherethe number of features highly exceeds the number of samples, feature selec-tion may be preferable, since feature extraction using PCA may suﬀer from thepoor estimation of the covariance matrix. Our examples also illustrated that theamount of time it takes for selection procedures to run depends not only on thedimension of the data but also on the sparsity of the resulting model.Regarding how the algorithm should be implemented, we pointed out thechoice of β can play a crucial role in the speed of the algorithm and shouldbe chosen according to the criterion used and the dimension of the dataset.Sometimes, the maximum number of features we would like to include may bemuch smaller than what the α, β thresholds produce. In such a case, we mayspecify the maximum number of features to be included. Note that we can alsoforce the algorithm to print out the number of features included by forwardmoves, and after the forward dropping moves are the re-forward moves, allowingus to rank the features. References

1. Billings, S.A., Wei, H.L.: Sparse model identiﬁcation using a forward orthogonalregression algorithm aided by mutual information. IEEE Transactions on NeuralNetworks (1), 306–310 (2007)2. Couvreur, C., Bresler, Y.: On the optimality of the backward greedy algorithm forthe subset selection problem. SIAM Journal on Matrix Analysis and Applications (3), 797–808 (2000)3. Donoho, D.L., Elad, M., Temlyakov, V.N.: Stable recovery of sparse overcompleterepresentations in the presence of noise. IEEE Transactions on information theory (1), 6–18 (2005)4. Fukunaga, K.: Introduction to statistical pattern recognition. Elsevier (2013)5. Johnson, R.A., Wichern, D.W., et al.: Applied multivariate statistical analysis,vol. 5. Prentice hall Upper Saddle River, NJ (2002)6. Kumar, V., Minz, S.: Feature selection: a literature review. SmartCR (3), 211–229(2014)7. Liu, H., Motoda, H.: Feature selection for knowledge discovery and data mining,vol. 454. Springer Science & Business Media (2012)8. Mao, K.Z.: Orthogonal forward selection and backward elimination algorithms forfeature subset selection. IEEE Transactions on Systems, Man, and Cybernetics,Part B (Cybernetics) (1), 629–634 (2004)9. Tropp, J.A.: Greed is good: Algorithmic results for sparse approximation. IEEETransactions on Information theory (10), 2231–2242 (2004)10. Zhang, T.: Adaptive forward-backward greedy algorithm for learning sparse rep-resentations. IEEE transactions on information theory (7), 4689–4708 (2011)aster feature selection with a Dropping Forward-Backward algorithm 11 A Appendix

Proof for theorem 1:

Let K = Z ′ Z and denote by [ U ] rs the ( r, s ) entries of amatrix U . The proof of the above theorem follows from these remarks: – Remark 1:

The determinant of X ′ X does not change if we interchange anycolumns of X , i.e., | K | = | X ′ X | . – Remark 2: [ K − ] is = [( X ′ X ) − ] js , [ K − ] js = [( X ′ X ) − ] is for s = i, j . – Remark 3: [ K − ] jj = [( X ′ X ) − ] ii , [ K − ] ii = [( X ′ X ) − ] jj . – Remark 4: [ K − ] ij = [( X ′ X ) − ] ji , [ K − ] ji = [( X ′ X ) − ] ij .From remarks 1-4, we see that we can get K − from ( X ′ X ) − by interchang-ing its i th , j th rows, and then, its i th , j th columns. Moreover, ˆ β = ( X ′ X ) − X ′ Y, b γ =( Z ′ Z ) − Z ′ Y and Z ′ Y =  x ′ ... x ′ j ... x ′ i ... x ′ p  Y =  x ′ Y ... x ′ j Y ... x ′ i Y ... x ′ p Y  , which implies that we can get Z ′ Y from X ′ Y by swapping its i th , j th entries.Hence, we can get b γ by swapping the i th , j th position of ˆ β , i.e., b γ = ( ˆ β , ..., ˆ β i − , ˆ β j , ˆ β i +1 , ..., ˆ β j − , ˆ β i , ˆ β j +1 , ..., ˆ β p ) ′ . Proof of the remarks:–

Proof of remark 1: K =  x ′ ... x ′ j ... x ′ i x ′ p  ( x , ..., x j , ..., x i , ..., x p ) =  x ′ x ... x ′ x j ... x ′ x i ... x ′ x p ... . . . ... . . . ... . . . ... x ′ j x ... x ′ j x j ... x ′ j x i ... x ′ j x p ... . . . ... . . . ... . . . ... x ′ i x ... x ′ i x j ... x ′ i x i ... x ′ i x p ... . . . ... . . . ... . . . ... x ′ p x ... x ′ p x j ... x ′ p x i ... x ′ p x p  (7) Moreover, X ′ X =  x ′ ... x ′ i ... x ′ j ... x ′ p  ( x , ..., x i , ..., x j , ..., x p ) =  x ′ x ... x ′ x i ... x ′ x j ... x ′ x p ... . . . ... . . . ... . . . ... x ′ i x ... x ′ i x i ... x ′ i x j ... x ′ i x p ... . . . ... . . . ... . . . ... x ′ j x ... x ′ j x i ... x ′ j x j ... x ′ j x p ... . . . ... . . . ... . . . ... x ′ p x ... x ′ p x i ... x ′ p x j ... x ′ p x p  . (8)Hence, the after interchanging two columns, we can get the new X ′ X byinterchange the i th , j th rows and then the i th , j th columns of the original X ′ X . Therefore, their determinants are the same.From 7, 8, we have M ij , the determinant of the ( p − × ( p −

1) matrix thatresults from deleting row i and column j of K , is equal to N ij , the determinantof the ( p − × ( p −

1) matrix that results from deleting row i and column j of X ′ X . Moreover, from remark 1, we know that the determinant of X ′ X doesnot change if we swap any columns of X , for r = i, j and s = i, j . Therefore,[ K − ] rs = ( − r + s M rs | X ′ X | = ( − r + s N rs | X ′ X | = [ X ′ X ] rs . – Proof of remark 2: for s = i, j ,[ K − ] is = ( − i + s | X ′ X | | A is | , where A is =  x ′ x ... x ′ x s − x ′ x s +1 ... x ′ x p ... . . . ... ... . . . ... x ′ i − x ... x ′ i − x s − x ′ i − x s +1 ... x ′ i − x p x ′ i +1 x ... x ′ i +1 x s − x ′ i +1 x s +1 ... x ′ i +1 x p ... . . . ... ... . . . ... x ′ i x ... x ′ i x s − x ′ i x s +1 ... x ′ i x p ... . . . ... ... . . . ... x ′ p x ... x ′ p x s − x ′ p x s +1 ... x ′ p x p  . Moreover, [( X ′ X ) − ] js = ( − j + s | X ′ X | | B js | , aster feature selection with a Dropping Forward-Backward algorithm 13 where B js =  x ′ x ... x ′ x s − x ′ x s +1 ... x ′ x p ... . . . ... ... . . . ... x ′ i x ... x ′ i x s − x ′ i x s +1 ... x ′ i x p ... . . . ... ... . . . ... x ′ j − x ... x ′ j − x s − x ′ j − x s +1 ... x ′ j − x p x ′ j +1 x ... x ′ j +1 x s − x ′ j +1 x s +1 ... x ′ j +1 x p ... . . . ... ... . . . ... x ′ p x ... x ′ p x s − x ′ p x s +1 ... x ′ p x p  . Note that we can get B js from A is by doing the following swaps:( j − th row ↔ ( j − th row,( j − th row ↔ ( j − th row,...( i + 1) th row ↔ i th row,and then swap the original i th , j th columns. Hence, we made ( j − i −

1) + 1swaps. Therefore,[ K − ] is = ( − i + s ( − j − i − | B js | = ( − j + s | B js | = [( X ′ X ) − ] js . Similarly, we can prove that [ K − ] js = [( X ′ X ) − ] is for s = i, j . – Proof of remark 3: [ K − ] ii = ( − i + i | A ii || X ′ X | where A ii =  x ′ x ... x ′ x i − x ′ x i +1 ... x ′ x i ... x ′ x p ... . . . ... ... . . . ... . . . ... x ′ i − x ... x ′ i − x i − x ′ i − x i +1 ... x ′ i − x i ... x ′ i − x p x ′ i +1 x ... x ′ i +1 x i − x ′ i +1 x i +1 ... x ′ + − x i ... x ′ i +1 x p ... . . . ... ... . . . ... . . . ... x ′ i x ... x ′ i x i − x ′ i x i +1 ... x ′ i x i ... x ′ i x p ... . . . ... ... . . . ... . . . ... x ′ p x ... x ′ p x i − x ′ p x i +1 ... x ′ p x i ... x ′ p x p  Note that [( X ′ X ) − ] jj = ( − j + j | B jj || X ′ X | , where B jj =  x ′ x ... x ′ x i ... x ′ x j − x ′ x j +1 ... x ′ x p ... . . . ... . . . ... ... . . . ... x ′ i x ... x ′ i x i ... x ′ i x j − x ′ i x j +1 ... x ′ i x p ... . . . ... . . . ... ... . . . ... sx ′ j − x ... x ′ j − x i ... x ′ j − x j − x ′ j − x j +1 ... x ′ j − x p x ′ j +1 x ... x ′ j +1 x i ... x ′ j +1 x j − x ′ j +1 x j +1 ... x ′ j +1 x p ... . . . ... . . . ... ... . . . ... x ′ p x ... x ′ p x i ... x ′ p x j − x ′ p x j +1 ... x ′ p x p .  . We can get B jj from A ii by doing the following swaps:( j − th row ↔ ( j − th row,...( i + 1) th row ↔ i th row,and then,( j − th column ↔ ( j − th column,...( i + 1) th column ↔ i th column.Hence, we made 2( j − i −

1) swaps. Therefore, | A ii | = | B jj | , which implies,[ K − ] ii = [( X ′ X ) − ] jj . Similarly, we can prove that[ K − ] jj = [( X ′ X ) − ] ii . – Proof of remark 4: [ K − ] ij = ( − i + j | X ′ X | | A ij | , where A ij =  x ′ x ... x ′ x j ... x ′ x j − x ′ x j +1 ... x ′ x p ... . . . ... . . . ... ... . . . ... x ′ i − x ... x ′ i − x j ... x ′ i − x j − x ′ i − x j +1 ... x ′ i − x p x ′ i +1 x ... x ′ i +1 x j ... x ′ i +1 x j − x ′ i +1 x j +1 ... x ′ i +1 x p ... . . . ... . . . ... ... . . . ... x ′ i x ... x ′ i x j ... x ′ i x j − x ′ i x j +1 ... x ′ i x p ... . . . ... . . . ... ... . . . ... x ′ p x ... x ′ p x j ... x ′ p x j − x ′ p x j +1 ... x ′ p x p  . Moreover, [( X ′ X ) − ] ji = ( − i + j | X ′ X | | B ji | , aster feature selection with a Dropping Forward-Backward algorithm 15 where B ji =  x ′ x ... x ′ x i − x ′ x i +1 ... x ′ x j ... x ′ x p ... . . . ... ... . . . ... . . . ... x ′ i x ... x ′ i x i − x ′ i x i +1 ... x ′ i x j ... x ′ i x p ... . . . ... ... . . . ... . . . ... x ′ j − x ... x ′ j − x i − x ′ j − x i +1 ... x ′ j − x j ... x ′ j − x p x ′ j +1 x ... x ′ j +1 x i − x ′ j +1 x i +1 ... x ′ j − x j ... x ′ j +1 x p ... . . . ... ... . . . ... . . . ... x ′ p x ... x ′ p x i − x ′ p x i +1 ... x ′ p x j ... x ′ p x p  . Note that we can get B ji by doing the following swaps:( j − th row ↔ ( j − th row,...( i + 1) th row ↔ i th row,and then,( i + 1) th column ↔ ( i + 2) th column,...( j − th column ↔ ( j − th column.Hence, we made 2( j − i −

1) swaps. Therefore, | A ij | = | B ji | , which implies,[ K − ] ij = [( X ′ X ) − ] ji . Similarly, we can prove that[ K − ] ji = [( X ′ X ) − ] ijij