[PDF] On-the-Fly Joint Feature Selection and Classification

Abstract

Joint feature selection and classification in an online setting is essential for time-sensitive decision making. However, most existing methods treat this coupled problem independently. Specifically, online feature selection methods can handle either streaming features or data instances offline to produce a fixed set of features for classification, while online classification methods classify incoming instances using full knowledge about the feature space. Nevertheless, all existing methods utilize a set of features, common for all data instances, for classification. Instead, we propose a framework to perform joint feature selection and classification on-the-fly, so as to minimize the number of features evaluated for every data instance and maximize classification accuracy. We derive the optimum solution of the associated optimization problem and analyze its structure. Two algorithms are proposed, ETANA and F-ETANA, which are based on the optimum solution and its properties. We evaluate the performance of the proposed algorithms on several public datasets, demonstrating (i) the dominance of the proposed algorithms over the state-of-the-art, and (ii) its applicability to broad range of application domains including clinical research and natural language processing.

Full PDF

11 On–the–Fly Joint Feature Selection andClassiﬁcation

Yasitha Warahena Liyanage,

Student member, IEEE,

Daphney–Stavroula Zois,

Member, IEEE, and Charalampos Chelmis,

Member, IEEE

Abstract —Joint feature selection and classiﬁcation in an online setting is essential for time–sensitive decision making. However, mostexisting methods treat this coupled problem independently. Speciﬁcally, online feature selection methods can handle either streamingfeatures or data instances ofﬂine to produce a ﬁxed set of features for classiﬁcation, while online classiﬁcation methods classifyincoming instances using full knowledge about the feature space. Nevertheless, all existing methods utilize a set of features, commonfor all data instances, for classiﬁcation. Instead, we propose a framework to perform joint feature selection and classiﬁcation on–the–ﬂy,so as to minimize the number of features evaluated for every data instance and maximize classiﬁcation accuracy. We derive theoptimum solution of the associated optimization problem and analyze its structure. Two algorithms are proposed, ETANA andF–ETANA, which are based on the optimum solution and its properties. We evaluate the performance of the proposed algorithms onseveral public datasets, demonstrating (i) the dominance of the proposed algorithms over the state–of–the–art, and (ii) its applicabilityto broad range of application domains including clinical research and natural language processing.

Index Terms —large–scale data mining, big data analytics, feature selection, classiﬁcation. (cid:70)

NTRODUCTION F EATURE selection is the process of selecting a subsetof the most informative features from a large set ofpotentially redundant features with the objective of maxi-mizing classiﬁcation accuracy, alleviating the effect of thecurse of dimensionality, speeding up the training processand improving interpretability [1], [2].Most existing work on feature selection [3], [4], [5], [6],[7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19],[19], [20] extracts a subset of discriminative features thatcan globally describe the data well, where the same featuresubset is used to classify all instances during classiﬁcation(see Fig. 1(a)). Only a handful of feature selection meth-ods have considered the feature evaluation cost and costsassociated with misclassiﬁcation, which play a key role inmany real–world applications [21], [22]. On the other hand,existing online classiﬁcation techniques [23], [24], [25], [26],[27], [28], [29], [30], [31], [32], [33] update model parametersby examining incoming data instances one at a time; suchmethods are not only affected by noisy and missing data,but also face scalability constraints [33].In a depart from existing feature selection and classiﬁ-cation methods, we study the problem of on–the–ﬂy jointfeature selection and classiﬁcation (OJFC) in an online set-ting. Speciﬁcally, the goal of OJFC is to minimize the numberof feature evaluations for classiﬁcation individually for eachinstance, while achieving high classiﬁcation accuracy acrossall instances. This is particularly important and necessaryfor real–world applications requiring time–sensitive deci- ● Y. Warahena Liyanage and D.–S. Zois are with the Department ofElectrical and Computer Engineering, University at Albany, SUNY, NY,12222.E-mail: [email protected], [email protected] ● C. Chelmis is with the Department of Computer Science, University atAlbany, SUNY, NY, 12222.E-mail: [email protected]

Fig. 1: Using a K × D matrix, (a) existing feature selection(FS) methods extract a set L << K of features, whichis common for all data instances during classiﬁcation. Incontrast, (b) the proposed on–the–ﬂy joint feature selectionand classiﬁcation approach utilizes a varying number offeatures to classify each data instance online using a modellearned ofﬂine from all features.sions such as weather forecasting [34], transportation [35],stock markets prediction [36], clinical research [37], andnatural disasters prediction [38]. Therefore, the proposedmethod utilizes a varying number of features to classify eachdata instance (see Fig. 1(b)).To address the challenges associated with the problem ofOJFC, we deﬁne an optimization problem which simultane- a r X i v : . [ c s . L G ] A p r ously minimizes the number of features evaluated and max-imizes classiﬁcation accuracy. The solution to this optimiza-tion problem leads to an approach that sequentially reviewsfeatures and classiﬁes a data instance once it determinesthat including additional features cannot further improvethe quality of classiﬁcation. However, the computationalcomplexity of the optimum solution increases exponentiallywith the number of classes in multi–class classiﬁcation tasks.To improve the scalability of our approach, we propose anefﬁcient implementation, which exploits the structure of theoptimum solution. Speciﬁcally, the functions related to theoptimum solution are shown to be concave, continuous andpiecewise linear on the domain of a sufﬁcient statistic. As aresult, the optimum solution exhibits a threshold structureto decide between continuing the feature evaluation processand stopping. A stochastic gradient algorithm is utilizedto estimate the optimal linear thresholds. Extensive exper-imental evaluation using seven publicly available datasetsshows the superiority of the proposed approach in terms ofclassiﬁcation accuracy, average number of features used perdata instance, and time required for joint feature selectionand classiﬁcation compared to the state–of–the–art. Further,our evaluation results indicate that the proposed efﬁcientimplementation drastically reduces training time without adrop in accuracy as compared to the optimum solution. Allproofs are included in Appendices A and B. ELATED W ORK

In this section, we summarize the most relevant prior workon (i) online feature selection and (ii) online classiﬁcationtechniques.

In contrast to ofﬂine feature selection methods that aredesigned for static datasets with ﬁxed number of featuresand data samples, online feature selection methods arecapable of handling either streaming features or streamingdata samples to choose a subset of features from a largerset of potentially redundant features [14], [19], [20]. Onlinefeature selection methods can be generally grouped into twogroups: (a) Streaming Features : In this branch of online feature se-lection problems, the number of data instances is consideredconstant while features arrive one at a time [3], [4], [5], [6],[7], [8], [9], [10]. In [3], a newly arriving feature is selected ifthe improvement in the model is greater than a predeﬁnedthreshold. [5], [6] try to extract features in the Markovblanket of the class variable using a forward algorithm,where thresholds on probability approximations to measureconditional independence (e.g. G –test [5], Fisher’s Z–test[6]) are employed. Such threshold–based methods [3], [4],[5], [6] require prior information about the feature space [7].Recently, rough set theory based methods [7], [8], [9], [10]have been explored. Such methods do not require anydomain knowledge [8]. However, methods proposed in [8],[10] are not applicable to numerical features, while meth-ods [7], [9] are much slower in feature selection comparedto the state–of–the–art streaming feature selection methods. (b) Streaming Data : In this problem setting, the number of features is considered constant, while data instances arriveover time [11], [12], [13]. Such methods [11], [12], [13], arelimited to binary classiﬁcation and/or impose hard con-straints on the number of non–zero elements in the model,requiring the user to deﬁne the number of features that needto be selected a priori [33]. Online classiﬁcation methods, also referred to as onlinelearning, use sequentially arriving data to update the func-tion of a classiﬁer. This is in contrast to batch learningtechniques where a collection of training data is used to traina classiﬁer ofﬂine, without further updates once trainingis complete [33]. Most widely used online learning meth-ods [23], [24], [25] are either limited to binary classiﬁca-tion [23], [24] or require solving a complex optimizationproblem at each iteration, and require prior informationto tune parameters in the model [25]. On the other hand,traditional gradient based methods [26], [27], [28], [29] notonly require to compute the gradient of a cost function, butalso require to solve an optimization problem at each iter-ation. Cost–sensitive extensions of traditional online classi-ﬁcation methods, which account for misclassiﬁcation costshave been recently explored [25], [30], [31], [32]. Unlike ourapproach, [25], [30] do not optimize the misclassiﬁcationcost directly [33], while [30], [31], [32] are limited to binaryclassiﬁcation. Last but not least, most existing methods arehighly susceptible to noise and/or incomplete data [33]. N – THE –F LY J OINT F EATURE S ELECTION AND C LASSIFICATION

Consider a set S of data instances, with each data instance s ∈ S being described using an assignment of values f = { f , f , . . . , f K } to a set F = { F , F , . . . , F K } of K features. Each data instance s is drawn from some prob-ability distribution over the feature space such that foreach assignment f to F , we have a probability P ( F = f ) .Further, each instance s may belong to one of N classes,with corresponding a priori probability P ( T = T i ) = p i foreach assignment T i , i = , , . . . , N , of the class variable T .Moreover, coefﬁcients c k > , k = , , . . . , K , represent thecost of evaluating features F k , respectively, and coefﬁcients M i,j ⩾ , i, j ∈ { , . . . , N } , denote the misclassiﬁcation costof selecting class T j when class T i is true.To select one out of N possible classes for each datainstance s , our proposed approach evaluates features se-quentially, where at each step it has to decide betweenstopping and continuing the feature evaluation processbased on the accumulated information thus far and the costof evaluating the remaining features. Herein, we introducea pair of random variables ( R, D R ) , where ≤ R ≤ K (referred to as stopping time [39] in decision theory) denotesthe feature at which the framework assigns s to a speciﬁcclass, and D R ∈ { , . . . , N } , which depends on R , denotesthe possibility to select among the N classes. The event { R = k } depends only on the feature set { F , F , . . . , F k } ,whereas the event { D R = j } represents choosing class T j based on information accumulated up to feature R . The goal is to select random variables R and D R by solving thefollowing optimization problem:minimize R,D R J ( R, D R ) , (1)where the cost function is deﬁned as: J ( R, D R ) ≜ E { R ∑ k = c k } + N ∑ j = N ∑ i = M ij P ( D R = j, T = T i ) , (2)in which the ﬁrst term denotes the cost of evaluating fea-tures, and the second term penalizes misclassiﬁcation errors.To solve the optimization problem deﬁned in Eq. (1), wedeﬁne a sufﬁcient statistic of accumulated information, the a posteriori probability vector π k , as follows: π k ≜ [ π k , π k , . . . , π Nk ] T , (3)where the k th feature is evaluated to generate outcome f k , and π ik ≜ P ( T i ∣ F , . . . , F k ) . To simplify the nota-tion, P ( T i ∣ F , . . . , F k ) is used in lieu of P ( T = T i ∣ F = f , . . . , F k = f k ) subsequently. Assuming that features in set F are independent given the class variable T , π k can becomputed recursively as in Lemma 1. Lemma 1.

The a posteriori probability vector π k ∈ [ , ] N canbe recursively computed as: π k = diag ( ∆ k ( F k )) π k − ∆ Tk ( F k ) π k − , (4) where ∆ k ( F k ) ≜ [ P ( F k ∣ T ) , P ( F k ∣ T ) , . . . , P ( F k ∣ T N )] T , diag ( A ) denotes a diagonal matrix with diagonal elements beingthe elements in vector A , and π ≜ [ p , p , . . . , p N ] T . Next, we simplify the probability P ( D R = j, T = T i ) exploit-ing the deﬁnition of the a posteriori probability π iR . Lemma 2.

Based on the fact that x R = ∑ Kk = x k { R = k } for anysequence of random variables { x k } , where A is the indicatorfunction for event A (i.e., A = when A occurs, and A = otherwise), the probability P ( D R = j, T i ) can be written asfollows: P ( D R = j, T i ) = E { π iR { D R = j } } . (5)Using Lemma 2, the average cost in Eq. (2) can be writtencompactly as: J ( R, D R ) = E ⎧⎪⎪⎨⎪⎪⎩ R ∑ k = c k + N ∑ j = ( N ∑ i = M ij π iR ) { D R = j } ⎫⎪⎪⎬⎪⎪⎭ , (6) which in turn can be rewritten as follows: J ( R, D R ) = E ⎧⎪⎪⎨⎪⎪⎩ R ∑ k = c k + N ∑ j = M Tj π R { D R = j } ⎫⎪⎪⎬⎪⎪⎭ , (7)where M j ≜ [ M ,j , M ,j , . . . , M N,j ] .To obtain the optimum stopping time R , we must ﬁrstobtain the optimum decision rule D R for any given R . Inthe process of ﬁnding the optimum decision rule, we needto ﬁnd a lower bound (independent of D R ) for the secondterm inside the expectation in Eq. (7), which is the part ofthe equation that depends on D R . Theorem 1 provides suchbound.

1. Even though validation of this assumption is beyond the scope ofthis paper, we ﬁnd our proposed method to work well in practice.

Theorem 1.

For any classiﬁcation rule D R given stoppingtime R , ∑ Nj = M Tj π R { D R = j } ⩾ g ( π R ) , where g ( π R ) ≜ min ⩽ j ⩽ N [ M Tj π R ] . The optimum rule is deﬁned as follows: D optimumR = arg min ⩽ j ⩽ N [ M Tj π R ] . (8)From Theorem 1, we conclude that: J ( R, D R ) ⩾ J ( R, D optimumR ) , where J ( R, D optimumR ) = min D R J ( R, D R ) . (9)Thus, we can reduce the cost function in Eq. (7) to one whichdepends only on the stopping time R as follows: ̃ J ( R ) = E { R ∑ k = c k + g ( π R )} . (10)To optimize the cost function in Eq. (10) with respect to R ,we need to solve the following optimization problem: min R ⩾ ̃ J ( R ) = min R ⩾ E { R ∑ k = c k + g ( π R )} . (11)Since R ∈ { , , . . . , K } , the optimum strategy consists ofa maximum of K + stages, where the optimum solutionmust minimize the corresponding average cost going fromstages to K . The solution can be obtained using dynamicprogramming [40]. Theorem 2.

For k = K − , . . . , , function ¯ J k ( π k ) is related to ¯ J k + ( π k + ) through the equation: ¯ J k ( π k ) = min [ g ( π k ) , c k + + ∑ F k + ∆ Tk + ( F k + ) π k × ¯ J k + ( diag ( ∆ k + ( F k + )) π k ∆ Tk + ( F k + ) π k )] , (12) where ¯ J K ( π K ) = g ( π K ) . The optimum stopping strategy derived from Eq. (12)has a very intuitive structure. Speciﬁcally, it stops at stage k , where the cost of stopping (the ﬁrst expression in theminimization) is no greater than the expected cost of contin-uing given all information accumulated at the current stage k (the second expression in the minimization). Equivalently,at each stage k , our method faces two options given π k :(i) stop evaluating features and select optimally betweenthe N classes, or (ii) continue with the next feature. Thecost of stopping is g ( π k ) , whereas the cost of continuing is c k + + ∑ F k + ∆ Tk + ( F k + ) π k × ¯ J k + ( diag ( ∆ k + ( F k + ) ) π k ∆ Tk + ( F k + ) π k ) .Based on Lemma 1, and Theorems 1 and 2, we present ETANA , an on–the–ﬂy fEature selecTion and clAssiﬁcatioNAlgorithm. Initially, the posterior probability vector π isset to [ p , p , . . . , p N ] , and the two terms in Eq. (12) arecompared. If the ﬁrst term is less than or equal to the secondterm, ETANA classiﬁes the instance under examination tothe appropriate class, based on the optimum rule in Eq. (8).Otherwise, the ﬁrst feature is evaluated. ETANA repeatsthese steps until either it decides to classify the instanceusing < K features, or using all K features.To implement ETANA, we ﬁrst need to solve the dynamicprogramming recursion in Eq. (12). This can be achievedby quantizing the interval [ , ] over N values such that ω ω g ( ϖ ) P a r t i t i o n P a r t i t i o n P a r t i t i o n Decision 1Decision 2Decision 3

Fig. 2: Illustration of Lemma 3 when the number N ofclasses equals to , with misclassiﬁcation costs M =[ , , ] , M = [ , , ] and M = [ , , ] . ∑ Ni = π ik = to generate different possible vectors π k . We canthen compute a ( K + ) × d matrix, where each row containsvalues of the function ¯ J k ( π k ) , k = , , . . . , K , evaluatedusing Theorem 2 for all possible d vectors of π k . Althoughthis computation requires only a priori information, the sizeof this matrix (i.e., d ) grows exponentially with the num-ber N of classes, resulting in a computationally expensivesolution. To address this challenge, we propose an efﬁcientimplementation of ETANA in Section 4. FFICIENT I MPLEMENTATION OF

ETANA

In this section, we present a fast version of ETANA, namely,

F–ETANA , that exploits structural properties of the opti-mum classiﬁcation rule in Eq. (8) and the optimum stoppingstrategy in Eq. (12).Consider a general form of the function g ( π R ) used toderive the optimum classiﬁcation rule in Eq. (8) as follows: g ( (cid:36) ) ≜ min ⩽ j ⩽ N [ M Tj (cid:36) ] , (cid:36) ∈ [ , ] N , (13)where (cid:36) = [ ω , . . . , ω N ] T , such that ω i ⩾ , ∑ Ni = ω i = .Here, the domain of g ( (cid:36) ) is the probability space of (cid:36) ,which is a N − dimensional unit simplex. Function g ( (cid:36) ) has some interesting properties as described in Lemma 3. Lemma 3.

Function g ( (cid:36) ) is concave, continuous, and piecewiselinear and consist of at most N hyperplanes. Fig. 2 shows a visualization of Lemma 3, when N = ,so that the domain of g ( (cid:36) ) is a –dimensional unit simplex(i.e., an equilateral triangle). Next, we consider the generalform of the optimum stopping strategy in Eq. (12) as fol-lows: ¯ J k ( (cid:36) ) = min [ g ( (cid:36) ) , c k + + ∑ F k + ∆ Tk + ( F k + ) (cid:36) × ¯ J k + ( diag ( ∆ k + ( F k + )) (cid:36) ∆ Tk + ( F k + ) (cid:36) )] . (14) Lemma 4 summarizes the key properties enjoyed by thisfunction. Lemma 4.

The functions ¯ J k ( (cid:36) ) , k = , . . . , K − , are concave,continuous, and piecewise linear. The fact that g ( (cid:36) ) and ¯ J k ( (cid:36) ) are concave and piecewiselinear allows for a compact representation of these func-tions. Recall that according to Theorem 2, we stop at stage k whenever g ( (cid:36) ) ≤ ¯ C k + ( (cid:36) ) , where ¯ C k + ( (cid:36) ) is the optimumcost–to–go at stage k given by c k + + ∑ F k + ∆ Tk + ( F k + ) (cid:36) × ¯ J k + ( diag ( ∆ k + ( F k + ) ) (cid:36) ∆ Tk + ( F k + ) (cid:36) ) . In particular, to decide betweencontinuing and stopping, it is sufﬁcient to keep track of thethresholds at the intersections of g ( (cid:36) ) with every ¯ C k + ( (cid:36) ) as stated in Theorem 3 below. Theorem 3.

At every stage k , there exists at most N thresh-old curves that separate the unit simplex into regions whichalternatively switch between continuation to the next stage andstopping. In particular, the region starting from every corner of the N − dimensional unit simplex always corresponds to stoppingthe feature evaluation process. Theorem 3 turns out to be very important. Speciﬁcally,the region where the a posteriori probability vector π R fallsinto will help decide between continuing to the next stage orstopping. This provides an alternative fast implementationof the optimum solution using thresholds. Fig. 3 shows avisualization of Theorem 3; both sub–ﬁgures contain maxi-mum number of threshold curves (i.e., since N = ). We propose a stochastic gradient algorithm to estimate thethreshold curves described in Theorem 3. For ease of imple-mentation, we restrict the approximation to linear thresholdcurves of the form given in Eq. (15).Let θ ̃ D ̃ R ≜ [ θ ̃ D ̃ R , θ ̃ D ̃ R , . . . , θ N ̃ D ̃ R ] denote the parametersof a linear hyperplane, where ̃ R is the number of featuresevaluated so far, and ̃ D ̃ R = j, ̃ R ∈ { , , . . . , K } , j ∈ { , . . . N } represents a decision choice. Then, the decision Z θ to “stop”or “continue” at each stage ̃ R under the decision choice ̃ D ̃ R = j , as function of (cid:36) , is deﬁned as follows: Z θ ̃ D ( (cid:36) ) = { stop , if θ T ̃ D ̃ R (cid:36) ⩽ continue , otherwise . (15)Decision Z θ ̃ D is indexed by θ ̃ D to show the explicitdependency of the parameters on the decision, where θ ̃ D ≜[ θ ̃ D , θ ̃ D . . . , θ ̃ D K − ] ∈ R K × N , ̃ D ̃ R = j, ∀ ̃ R , is the concatena-tion of θ ̃ D ̃ R vectors, one for each stage ̃ R . Now, recall the costfunction in Eq. (10). Since we are interested in ﬁnding linearthresholds for each decision choice ̃ D ̃ R = j independently,we use a modiﬁed version of the cost function in Eq. (10) asfollows: ̃ H ( θ ̃ D ) = E Z θ ̃ D ⎧⎪⎪⎨⎪⎪⎩ ̃ R ∑ k = c k + M ̃ D ̃ R π ̃ R ⎫⎪⎪⎬⎪⎪⎭ . (16)Algorithm 1 generates a sequence of estimates θ ̃ D,t bycomputing the gradient ∇ θ ̃ D ̃ H ( θ ) . Here, θ ̃ D,t denotes the ω ω ̄ J ̄ ϖ ) ̄a) ḡϖ)C ̄ϖ) ω ω ̄ J ̄ ϖ ) ̄b) ḡϖ)C ̄ϖ) Fig. 3: Illustration of Theorem 3 (better seen in color): (a) at stage and (b) at stage , using MLL dataset withmisclassiﬁcation costs M = [ , , ] , M = [ , , ] and M = [ , , ] . The MLL dataset contains 72 samples, each ofwhich comprises 5848 gene expression values belonging to one of 3 diagnostic classes [41]. In both cases, the blue regioncorresponds to continuation to the next stage, while the red region corresponds to stopping. Algorithm 1

Stochastic Gradient Algorithm for EstimatingOptimal Linear Thresholds

Require:

Initial parameters θ ̃ D, Output:

Optimal parameters θ ̃ D,opt for iterations t = , , , . . . do Evaluate ̃ H ( θ ̃ D,t + β t α t ) and ̃ H ( θ ̃ D,t − β t α t ) usingFunction 2 based on Eq. (16) Estimate ˆ ∇ θ ̃ D ̃ H ( θ ̃ D,t ) using Eq. (17) Update θ ̃ D,t to θ ̃ D,t + using Eq. (18) Stop if ∣∣ ˆ ∇ θ ̃ D ̃ H ( θ ̃ D,t )∣∣ ⩽ ε or maximum number t max of iterations reached end for return θ ̃ D,opt estimate of θ ̃ D at iteration t . Although evaluating the gra-dient in closed form is intractable due to the non–lineardependency of ̃ H ( θ ̃ D ) and θ ̃ D , estimate ˆ ∇ θ ̃ D ̃ H ( θ ̃ D ) can becomputed using a simulation–based gradient estimator. Forsimplicity, we opted for the SPSA algorithm [42], amongthe several simulation–based gradient estimators in the lit-erature [43]. SPSA algorithm estimates the gradient at eachiteration t using a ﬁnite difference method and a randomdirection α t , as follows: ˆ ∇ θ ̃ D ̃ H ( θ ̃ D,t ) = ̃ H ( θ ̃ D,t + β t α t ) − ̃ H ( θ ̃ D,t − β t α t ) β t α t , (17)where α it = { − , with probability . + , with probability . .Using the gradient estimate in Eq. (17), parameter θ ̃ D,t isupdated as follows: θ ̃ D,t + = θ ̃ D,t − a t ˆ ∇ θ ̃ D ̃ H ( θ ̃ D,t ) , (18) where a t and β t are typically chosen as in [42]: a t = ε ( t + + ς ) − κ , . < κ ⩽ , ε, ς > ,β t = µ ( t + ) − υ , . < υ ⩽ , µ > . (19)Algorithm 1 is guaranteed to converge to a local minimumwith probability one [42]. We consider the following stop-ping criteria: ∣∣ ˆ ∇ θ ̃ D ̃ H ( θ ̃ D,t )∣∣ ⩽ ε , or algorithm stops whenit reaches a user–deﬁned maximum number of iterations.Finally, ̃ H ( . ) in Eq. (16) is estimated using Function 2. Function 2 ̃ H ( . ) Require: parameter θ ̃ D and π Output: ̃ H ( θ ̃ D ) Initialization : k = and ̃ H = while θ T ̃ D k π k ⩾ do k = k + Obtain a new feature F k Update π k using Eq. (4) ̃ H = ̃ H + c k end while ̃ H = ̃ H + M T ̃ D k π k return ̃ H XPERIMENTAL R ESULTS

In this section, we conduct an extensive set of experimentsto evaluate the performance of ETANA and F–ETANA us-ing seven benchmark datasets: 4 DNA Microarray Datasets(Lung Cancer, Lung2, MLL, Car) [41], 2 NIPS 2003 featureselection challenge datasets (Dexter, Madelon) [44], and 1high dimensional dataset (News20) [45]. Table 1 summarisesthese datasets. For Madelon, Dexter and MLL datasets, weuse the originally provided training and validation sets,

Fig. 4: Variation of (a) accuracy, (b) average number of features, and (c) training time (sec) as a function of the number V ∈ { , , , , , , , , } of bins using Lung Cancer, Dexter, Madelon and MLL datasets.TABLE 1: Datasets used in our experiments. Dataset ,

000 500 2

Lung

181 12 ,

533 2

MLL

72 5 ,

848 3

Dexter

300 20 ,

000 2

Car

174 9 ,

182 11

Lung2

203 3 ,

312 5

News20 ,

996 1 , ,

191 2 while for Lung Cancer, Lung2 and Car datasets, we reportﬁve–fold cross validated results. All experiments are con-ducted on a PC with Intel(R) Core(TM) i7-7700 @3.60 GHzCPU with 16 GB memory, running Windows 10 Pro, 64 bitoperating system.

Here, we discuss some practical considerations. We usea smoothed maximum likelihood estimator to estimate p ( F k ∣ T i ) , k = , . . . , K, i = , . . . , N , after quantizing thefeature space. Speciﬁcally, ˆ p ( F k ∣ T i ) = S k,i + S i + V , where S k,i de-notes the number of samples that satisfy F k = f k and belongto class T i , S i denotes the total number of samples belongingto class T i , and V is the number of bins considered. Theeffect of the number V of bins on the performance of ouralgorithm is studied in Section 5.2. We estimate the a priori probabilities as P ( T i ) = S i ∑ Ni = S i , i = , . . . , N .Feature ordering is crucial for early stopping. Differentfeatures can hinder or facilitate the quick identiﬁcation ofthe class of which an instance may belong to. Consider anexample of classifying fruits as either ‘Apple’ or ‘Orange’using two features F and F , where F is the color of thefruit, and F is the weight of the fruit. Intuitively, the colorof the fruit can potentially simplify the classiﬁcation processas compared to the weight of the fruit. As a result, if feature F was to be examined ﬁrst, it would be very probable forfeature F to be examined as well to improve the chancesof accurate classiﬁcation. Instead, if F was to be evaluatedﬁrst, a decision could be made using one feature only. Toavoid the computational complexity of evaluating all K ! possible feature orderings, we sort features in increasingorder of the sum of type I and II errors (considering the true class as the positive class and all the rest classes as asingle negative class), scaled by the cost coefﬁcient of the n thfeature to promote low cost features that at the same timeare expected to result in few errors. Finally, for F–ETANA,we set ε = − , and t max = as the stopping criteria inAlgorithm 1. In Section 5.1, we estimated the conditional probabilities offeatures given the class using a data binning technique (i.e., ˆ p ( F k ∣ T i ) = N n,i + N i + V ). In this subsection, we analyze the effectof the number V of bins on ETANA using four datasets(Lung, Dexter, Madelon, MLL).In Fig. 4, we plot the variation of the accuracy, theaverage number of features used for classiﬁcation, and thetraining time as a function of V . ETANA’s accuracy andthe average number of features used for classiﬁcation isrelatively robust to the number of bins (see Fig. 4(a) andFig. 4(b)). However, increasing the number of bins from 50 to100 results in a drop in accuracy for the MLL dataset. Mostprobably, this is due to overﬁtting as a result of increasingthe resolution of the feature space to a very high value. Thelinear relationship between training time and the numberof bins (see Fig. 4(c)) is due to Eq. (12). Without any lossof generality, in the rest of the experiments we set V to thenumber of class variables (i.e., N ) in the evaluating dataset. To study the behavior of ETANA for varying values offeature evaluation cost c , when all features incur same cost(i.e., c k = c ), we measured accuracy for constant misclassiﬁ-cation costs (i.e., M i , j = ∀ i ≠ j, M i,i = , i, j ∈ { , . . . , N } )and c = { . , . , . , . , . , . , . , } . Different c values result in different number of features and levels ofaccuracy. Intuitively, using a small potion of the total featureset leads to low accuracy, whereas when the average numberof features used increases, the performance improves dra-matically. From here onwards, unless speciﬁed, we reportresults for c = . . TABLE 2: Comparison of accuracy. The highest accuracy is bolded and gray–shaded. The second highest value is also grayshaded. Cells are marked with ‘- -’ if the corresponding method was unable to generate results within a cutoff time of 12days.

Dataset ETANA F–ETANA OFS–Density OFS–A3M SAOLA Fast–OSFS OSFS Alpha–InvestingMadelon .

555 0 . . . . . . Lung C. . . . . . . MLL . . . . . . Dexter .

62 0 . . . . . . Car . . . . . .

500 0 . Lung2 . . . . . . . News20 . . −− −− −− −− −− TABLE 3: Comparison of average number of features used. Values corresponding to highest and second highest accuracyare bolded and gray–shaded, and gray–shaded accordingly. Cells are marked with ‘- -’ if the corresponding method wasunable to generate results within a cutoff time of 12 days.

Dataset ETANA F–ETANA OFS–Density OFS–A3M SAOLA Fast–OSFS OSFS Alpha–InvestingMadelon .

21 2 2 3 3 3 4

Lung C. .

78 8 .

48 25 .

40 8 .

52 6.8 . . MLL

12 9 28 5 3 7

Dexter .

32 7 .

98 21 9 6 1

Car . . . .

40 8 . . . Lung2 .

77 71 . . . . . . News20 .

70 4000 . −− −− −− −− −− TABLE 4: Comparison of time (in seconds) required for feature selection (F), classiﬁcation (C), joint feature selection andclassiﬁcation (F+C), and model training (T). Values corresponding to highest and second highest accuracy are bolded andgray–shaded, and gray–shaded accordingly. Cells are marked with ‘- -’ if the corresponding method was unable to generateresults within a cutoff time of 12 days.

Dataset T i m e ETANA F–ETANA T i m e OFS–Density OFS–A3M SAOLA Fast–OSFS OSFS Alpha–InvestingMadelon F+C 3.678 F .

85 225 .

48 0 .

136 0 .

199 0 .

203 0 . C .

024 0 .

025 0 .

026 0 . T . T .

423 0 .

424 0 .

438 0 .

429 0 .

437 0 . LungCancer F+C 0.003 F .

455 39 . .

868 1 . . C .

017 0 . .

012 0 . T .

680 2 . T .

117 0 . .

119 0 . MLL F+C F .

524 5 .

122 3 .

079 1 .

495 9 .

354 0 . C .

045 0 .

023 0 .

045 0 .

025 0 .

025 0 . T T .

397 0 .

401 0 .

408 0 .

412 0 .

409 0 . Dexter F+C 0.047 0.046 F .

006 1 .

513 3 .

195 15 . C .

080 0 .

085 0 .

072 0 .

066 0 . T .

484 2 . T .

423 0 .

420 0 .

423 0 .

427 0 . Car F+C F .

882 37 .

205 2 .

249 1 .

092 14 .

144 0 . . C .

012 0 .

017 0 .

001 0 .

001 0 . T . T .

113 0 .

112 0 .

117 0 .

004 0 .

005 0 . Lung2 F+C 0.034 0.071 F .

346 1 .

255 1 .

681 28 .

888 0 . C .

017 0 .

012 0 . T .

03 7 . T .

109 0 .

115 0 .

120 0 .

119 0 . News20 F+C 346.47 F −− −− −− −− −− . C −− −− −− −− −− T . T −− −− −− −− −− In this subsection, we compare ETANA and F–ETANA withthe following state–of–the–art feature selection methods:OFS–Density [7], OFS–A3M [9], SOAOLA [6], OSFS [5],Fast–OSFS [5], and Alpha–Investing [4]. We use KNN clas-siﬁer with three neighbours to evaluate a selected featuresubset, which has been shown to outperform SVM, CART,and J48 classiﬁers on the datasets used in [6], [7]. ForSAOLA, OSFS, and Fast–OSFS, the parameter α is set to . [5], [6]. For Alpha–Investing, parameters are set to thevalues used in [4].We summarize our observations from Tables 2, 3, 4 bydataset as follows. Madelon:

ETANA achieves the highest accuracy using only ∼ in accuracy over Alpha–Investing, whichhas the highest accuracy among all the baselines usingthe same number of features. At the same time, ETANAis . , and . faster in joint feature selection andclassiﬁcation, and model training respectively, compared toAlpha–Investing. Lung Cancer:

SAOLA and Fast–OSFS achieve the highestaccuracy using , and . features, respectively, but require . , and . times more features respectively, compared toETANA for a difference of in accuracy. Further, ETANAis much faster in joint feature selection and classiﬁcationcompared to SAOLA and Fast–OSFS. MLL:

ETANA and F–ETANA achieve accuracy using . , and . features on average, respectively. This corre- D e x t e r M L L L u n g C a r (a) ETANAF-ETANA D e x t e r M L L L u n g C a r (b) ETANAF-ETANA

Fig. 5: Comparison of (a) model training time (seconds),and (b) accuracy achieved by F–ETANA and ETANA usingDexter (2 classes), MLL (3 classes), Lung2 (5 classes), andCAR datasets (11 classes).sponds to an improvement of in accuracy with . ,and . less of features used compared to OFS–Densityand Alpha–Investing, respectively, which achieve the high-est accuracy among all the baselines. At the same time,both ETANA and F–ETANA are much faster in joint featureselection and classiﬁcation compared to OFS–Density andAlpha–Investing. Dexter:

OFS–Density achieves the highest accuracy, butrequires . more features. This results in a signiﬁcantslowdown for joint feature selection and classiﬁcation com-pared to ETANA. Car:

F–ETANA achieves the highest accuracy, however,ETANA is a close second while using only 13.85 featureson average. This corresponds to an improvement of and . in accuracy and average number of features usedrespectively, while at the same time, leads to a faster runtimecompared to SAOLA, the best performing baseline. Lung2:

OFS–Density achieves the highest accuracy with . more features as compared to ETANA, and muchslower runtime. In this subsection, we discuss the performance of our al-gorithms, ETANA and F–ETANA, and the state–of–the–artonline feature selection methods on the News20 dataset.Experiments on this dataset are conducted using the highperformance computing cluster provided by the Informa-tion Technology Services at the University of Albany, SUNY.We used one node with 20 Intel(R) Xeon(R) E5-2680 [email protected] CPUs with 256 GB memory. Except for SAOLA,the rest of the online feature selection methods were unableto generate results within a cutoff time of 12 days. AlthoughSAOLA achieves the highest accuracy, it requires ∼ ∼

20 times slower in joint feature se-lection and classiﬁcation compared to ETANA for c = . (see the last row in Tables 2, 3, 4). Thus far we have shown that ETANA outperforms allbaselines in terms of accuracy, number of features, and timerequired for joint feature selection and classiﬁcation. Thelimitation of ETANA is in its training time (see Table 4),due to the construction of a ( K + ) × d matrix which grows exponentially with the number of classes (see Section 3).F–ETANA drastically reduces the time required for modeltraining as compared to ETANA without sacriﬁcing accu-racy (see Fig. 5). At the same time, F–ETANA requires morefeatures per data instance compared to ETANA (see Table 3). ONCLUSION

This paper investigated a new research problem, on–the–ﬂyjoint feature selection and classiﬁcation, which aims to mini-mize the number of feature evaluations per data instance forfast and accurate classiﬁcation. Speciﬁcally, an optimizationproblem was deﬁned in terms of the cost of evaluatingfeatures and the Bayes risk associated with the classiﬁcationdecision. The optimum solution was derived using dynamicprogramming and it was shown that the correspondingfunctions are concave, continuous and piecewise linear. Twoalgorithms, ETANA and F–ETANA were proposed basedon the optimum solution and its properties. The proposedalgorithms outperformed state–of–the–art feature selectionmethods in terms of the average number of features used,classiﬁcation accuracy, and the time required for on–the–ﬂy joint feature selection and classiﬁcation. Furthermore, F–ETANA resulted in a drastic reduction in model trainingtime compared to ETANA. As a part of our future work, weplan to exploit feature dependencies, which may improveperformance even more. A PPENDIX A A.1 Proof of Lemma 1

We start from the deﬁnition of the a posteriori probability vector, i.e.: π k ≜ [ π k , π k , . . . , π Nk ] T , and consider any element in this vector, i.e., π ik . Speciﬁcally,we use Bayes’ rule and the law of total probability to get thefollowing result: π ik = P ( T i ∣ F , . . . , F k )= P ( F , . . . , F k ∣ T i ) P ( T i ) P ( F , . . . , F k ) (20) = P ( F , . . . , F k ∣ T i ) P ( T i )∑ Nj = P ( F , . . . , F k , T j )= P ( F , . . . , F k ∣ T i ) P ( T i )∑ Nj = P ( F , . . . , F k ∣ T j ) P ( T j ) . (21)Note that we can further simplify Eq. (21) by exploiting theconditional independence of the features in set F given theclass variable T as follows: π ik = P ( T i ) ∏ kn = P ( F n ∣ T i )∑ Nj = P ( T j ) ∏ kn = P ( F n ∣ T j )= p i ∏ kn = P ( F n ∣ T i )∑ Nj = p j ∏ kn = P ( F n ∣ T j ) . (22)Similarly, π ik − will take the following form: π ik − = p i ∏ k − n = P ( F n ∣ T i )∑ Nj = p j ∏ k − n = P ( F n ∣ T j ) . (23) We can now rewrite π ik in Eq. (22) in terms of π ik − in Eq. (23)as follows: π ik = p i ∏ kn = P ( F n ∣ T i )∑ Nj = p j ∏ kn = P ( F n ∣ T j )= P ( F k ∣ T i )( p i ∏ k − n = P ( F n ∣ T i ))∑ Nj = p j ∏ kn = P ( F n ∣ T j )= P ( F k ∣ T i )( π ik − ∑ Nj = p j ∏ k − n = P ( F n ∣ T j ))∑ Nj = p j ∏ kn = P ( F n ∣ T j )= P ( F k ∣ T i ) π ik − ∑ Nj = p j ∏ kn = P ( F n ∣ T j )∑ Nj = p j ∏ k − n = P ( F n ∣ T j ) = P ( F k ∣ T i ) π ik − ∑ Nj = P ( F k ∣ T j ) π jk − . (24)Finally, using the above result, the a posteriori probability vector takes the following form: π k = [ π k , π k , . . . , π Nk ] T = diag ([ P ( F k ∣ T ) , . . . , P ( F k ∣ T N )])[ π k − , . . . , π Nk − ] T [ P ( F k ∣ T ) , . . . , P ( F k ∣ T N )][ π k − , . . . , π Nk − ] T = diag ([ P ( F k ∣ T ) , . . . , P ( F k ∣ T N )]) π k − [ P ( F k ∣ T ) , . . . , P ( F k ∣ T N )] π k − = diag ( ∆ k ( F k )) π k − ∆ Tk ( F k ) π k − , (25)where ∆ k ( F k ) ≜ [ P ( F k ∣ T ) , P ( F k ∣ T ) , . . . , P ( F k ∣ T N )] T , diag ( A ) denotes a diagonal matrix with diagonal elementsbeing the elements in vector A , and π ≜ [ p , p , . . . , p N ] T . A.2 Proof of Lemma 2

Using the law of total probability, we can write the proba-bility P ( D R = j, T i ) as follows: P ( D R = j, T i ) = K ∑ k = P ( R = k, D k = j, T i )= K ∑ k = P ( T i ) P ( R = k, D k = j ∣ T i ) . (26)Using the fact that the event { R = k, D k = j } depends onlyon the set { F , F , . . . , F k } , and by the deﬁnition P ( R = k, D k = j ) = E { { R = k } { D k = j } } , Eq. (26) can be written asfollows: P ( D R = j, T i ) = K ∑ k = ∑ F ,F ,...,F k { R = k } { D k = j } × P ( T i ) P ( F , . . . , F k ∣ T i ) . (27)Then, using the result in Eq. (20), we can incorporate π ik inEq. (27) as follows: P ( D R = j, T i ) = K ∑ k = ∑ F ,F ,...,F k { R = k } { D k = j } × P ( F , . . . , F k ) π ik . (28)Further, from the deﬁnition of the expectation operator (i.e.,if a random variable Y has set D of possible values andprobability mass function P ( Y ) , then the expected value E { H ( Y )} of any function H ( Y ) equals ∑ Y ∈ D P ( Y ) H ( Y ) ),Eq. (28) can be rewritten as follows: P ( D R = j, T i ) = K ∑ k = E { { R = k } { D k = j } π ik } . (29)By the linearity of expectation in Eq. (29), we get: P ( D R = j, T i ) = E { K ∑ k = { R = k } { D k = j } π ik } . (30)Finally, using the fact that x R = ∑ Kk = x k { R = k } , Eq. (30) willend up in the desired form as shown below: P ( D R = j, T i ) = E { π iR { D R = j } } . (31) A.3 Proof of Theorem 1

At any stopping time R , the optimum decision D R takesonly one out of N possibilities such that the misclassiﬁcationcost is minimum. In the process of ﬁnding this optimum D R ,it is important to note that ∑ Nj = { D R = j } = , where A is theindicator function for event A ( i.e., A = when A occurs,and A = otherwise), which implies that only one of theterms in the sum becomes , while the remaining terms areequal to zero. We note that: M Tj π R ⩾ g ( π R ) , ∀ j ∈ { , , . . . , N } , (32)where g ( π R ) ≜ min ⩽ j ⩽ N [ M Tj π R ] . Then, using the fact that { D n = j } is non–negative, we get the following result: N ∑ j = ( M Tj π R ) { D R = j } ⩾ g ( π R ) N ∑ j = { D R = j } = g ( π R ) . (33)We underscore that the lower bound g ( π R ) derivedabove is independent of the decision D R . Thus, it is obviousthat this lower bound can be achieved only by the ruledeﬁned in Eq. (8), which is therefore the optimum decisionfor a given stopping time R . A.4 Proof of Theorem 2

At the end of the K th stage, assuming that all the featureshave been examined, the only remaining expected cost isthe optimum misclassiﬁcation cost of selecting among N decision choices at stage k = K , which is ¯ J K ( π K ) = g ( π K ) (see Theorem 1).Then, consider any intermediate stage k = , , . . . , K − .Being at stage k , with the available information π k , theoptimum strategy has to choose between, either to terminatethe feature evaluation process and incur cost g ( π k ) , whichis the optimum misclassiﬁcation cost of selecting among N decision choices (see Theorem 1), or continue and incurcost of c k + to evaluate feature F k + and an additional cost ¯ J k + ( π k + ) to continue optimally. Thus, the total cost ofcontinuing optimally (referred to as optimum cost–to–go [40])is c k + + ¯ J k + ( π k + ) . It is important to note that at stage k ,we do not know the outcome of examining feature F k + .Thus, we need to consider the expected optimum cost–to–go ,which is equal to c k + + E { ¯ J k + ( π k + )∣ π k } . Using Lemma1 to express π k + in terms of π k , and by the deﬁnition ofthe expectation operator (i.e., if a random variable Y has set D of possible values and probability mass function P ( Y ) ,then the expected value E { H ( Y )} of any function H ( Y ) equals to ∑ Y ∈ D P ( Y ) H ( Y ) ), we get the optimum cost–to–go ¯ C k + ( π k ) : ¯ C k + ( π k ) ≜ c k + + E { ¯ J k + ( π k + )∣ π k }= c k + + ∑ F k + P ( F k + ∣ F , F , . . . , F k )× ¯ J k + ( diag ( ∆ k + ( F k + )) π k ∆ Tk + ( F k + ) π k ) . (34)Let us simplify the term P ( F k + ∣ F , F , . . . , F k ) separately.Speciﬁcally, using Bayes’ rule and the law of total probabil-ity, we observe that: P ( F k + ∣ F , . . . , F k ) = P ( F , . . . , F k + ) P ( F , . . . , F k )= ∑ Nj = P ( F , . . . , F k + , T j )∑ Nj = P ( F , . . . , F k , T j )= ∑ Nj = P ( F , . . . , F k + ∣ T j ) P ( T j )∑ Nj = P ( F , . . . , F k ∣ T j ) P ( T j ) . (35)Note that we can further simplify Eq. (35) by exploiting thefact that the random variables F k are independent undereach class T j as follows: P ( F k + ∣ F , . . . , F k ) = ∑ Nj = P ( T j ) ∏ k + n = P ( F n ∣ T j )∑ Nj = P ( T j ) ∏ kn = P ( F n ∣ T j )= ∑ Nj = ( p j ∏ kn = P ( F n ∣ T j )) P ( F k + ∣ T j )∑ Nj = p j ∏ kn = P ( F n ∣ T j ) . (36)Using the result in Eq. (22), we can simplify Eq. (36) asfollows: P ( F k + ∣ F , . . . , F k ) = N ∑ j = π jk P ( F k + ∣ T j )= ∆ Tk + ( F k + ) π k . (37)Finally, substituting Eq. (37) in Eq. (34), we get the desiredresult: ¯ C k + ( π k ) = c k + + ∑ F k + ∆ Tk + ( F k + ) π k × ¯ J k + ( diag ( ∆ k + ( F k + )) π k ∆ Tk + ( F k + ) π k ) . (38) A PPENDIX B B.1 Proof of Lemma 3

Let us consider the deﬁnition of g ( (cid:36) ) : g ( (cid:36) ) ≜ min ⩽ j ⩽ N [ M Tj (cid:36) ] , (cid:36) ∈ [ , ] N . The term M Tj (cid:36) is linear with respect to (cid:36) , and since theminimum of linear functions is a concave, piecewise linearfunction, we conclude that g ( (cid:36) ) is a concave, piecewiselinear function as well. Concavity also assures the continuityof this function. Finally, minimization over ﬁnite N hyper-planes guarantees that the function g ( (cid:36) ) is made up of atmost N hyperplanes. B.2 Proof of Lemma 4

First, let us consider the function ¯ J K − ( (cid:36) ) given by: ¯ J K − ( (cid:36) ) = min [ g ( (cid:36) ) , c K + ∑ F K ∆ TK ( F K ) (cid:36) × ¯ J K ( diag ( ∆ K ( F K )) (cid:36) ∆ TK ( F K ) (cid:36) )] . (39)Using the fact that ¯ J K ( π K ) = g ( π K ) , we can rewrite Eq. (39)as follows: ¯ J K − ( (cid:36) ) = min [ g ( (cid:36) ) , c K + ∑ F K ∆ TK ( F K ) (cid:36) × g ( diag ( ∆ K ( F K )) (cid:36) ∆ TK ( F K ) (cid:36) )] . (40)We focus our attention on the following function insidethe summation of Eq. (40): Q ( (cid:36) ) ≜ ∆ TK ( F K ) (cid:36)g ( diag ( ∆ K ( F K )) (cid:36) ∆ TK ( F K ) (cid:36) ) . (41)Using the deﬁnition of g ( (cid:36) ) in Eq. (13), we can rewriteEq. (41) as follows: Q ( (cid:36) ) = ∆ TK ( F K ) (cid:36) min ⩽ j ⩽ N [ M Tj diag ( ∆ K ( F K )) (cid:36) ∆ TK ( F K ) (cid:36) ]= min ⩽ j ⩽ N [ ∆ TK ( F K ) (cid:36)M Tj diag ( ∆ K ( F K )) (cid:36) ∆ TK ( F K ) (cid:36) ]= min ⩽ j ⩽ N [ M Tj diag ( ∆ K ( F K )) (cid:36) ] . (42)Note that the term M Tj diag ( ∆ K ( F K )) (cid:36) is linear withrespect to (cid:36) . Using the fact that the minimum of linear func-tions is a concave, piecewise linear function, implies that Q ( (cid:36) ) is a concave, piecewise linear function. Furthermore,we recall: i) the non–negative sum of concave/piecewiselinear functions is also a concave/piecewise linear function,and ii) the minimum of two concave/piecewise linear func-tions is also a concave/piecewise linear function. Based onthese two facts, and the fact that ∆ K ( F K ) is a probabilityvector which is non–negative, we conclude that the func-tion ¯ J K − ( (cid:36) ) in Eq. (39) is concave and piecewise linear.Concavity also assures the continuity of this function.Then, let us consider the function ¯ J K − ( (cid:36) ) given by: ¯ J K − ( (cid:36) ) = min [ g ( (cid:36) ) , c K − + ∑ F K − ∆ TK − ( F K − ) (cid:36) × ¯ J K − ( diag ( ∆ K − ( F K − )) (cid:36) ∆ TK − ( F K − ) (cid:36) )] . (43)We have already proved that the functions ¯ J K − ( (cid:36) ) and g ( (cid:36) ) are concave and piecewise linear. Using the factsthat i) the non–negative sum of concave/piecewise linearfunctions is also a concave/piecewise linear function, andii) the minimum of two concave/piecewise linear functionsis also a concave/piecewise linear function, we concludethat the function ¯ J K − ( (cid:36) ) is also concave and piecewiselinear. Concavity also assures the continuity of this function.Using similar arguments, the concavity, the continuity andthe piecewise linearity of functions ¯ J k ( (cid:36) ) , n = , . . . , K − , can also be guaranteed. B.3 Proof of Theorem 3

Let us start the proof by showing that all N corners ofthe N − dimensional unit simplex always correspond tostopping irrespective of the stage. In other words, when (cid:36) = e i , g ( (cid:36) ) < ¯ C k + ( (cid:36) ) , for all k = , . . . , K − , where e i denotes the column vector with a in the i th coordinateand ’s elsewhere. At stage k = K − , we have that: ¯ C K ( (cid:36) ) = c K + ∑ F K ∆ TK ( F K ) e i ¯ J K ( diag ( ∆ K ( F K )) e i ∆ TK ( F K ) e i )= c K + ∑ F K ∆ TK ( F K ) e i g ( diag ( ∆ K ( F K )) e i ∆ TK ( F K ) e i )= c K + ∑ F K P ( F K ∣ T i ) g ( P ( F K ∣ T i ) e i P ( F K ∣ T i ) )= c K + g ( e i ) ∑ F K P ( F K ∣ T i )= c K + g ( e i )> g ( e i ) , (44)where the last inequality holds since c K > . From Eq. (39),we see that ¯ J K − ( e i ) = g ( e i ) . Then, let us consider the case k = K − as follows: ¯ C K − ( (cid:36) ) = c K − + ∑ F K − ∆ TK − ( F K − ) e i × ¯ J K − ( diag ( ∆ K − ( F K − )) e i ∆ TK − ( F K − ) e i )= c K − + ∑ F K − P ( F K − ∣ T i ) ¯ J K − ( P ( F K − ∣ T i ) e i P ( F K − ∣ T i ) )= c K − + ¯ J K − ( e i ) ∑ F K − P ( F K − ∣ T i )= c K − + g ( e i ) ∑ F K − P ( F K − ∣ T i )= c K − + g ( e i )> g ( e i ) , (45)where the last inequality holds since c K − > . Usingsimilar arguments, the latter result can be proven for all k = , . . . , K − . The rest of the proof is very intuitive.Using the facts: (i) the functions ¯ C k + ( (cid:36) ) are concave (seethe proof of Lemma 4), and (ii) the simplex corners alwayscorrespond to stopping, we see that the hyperplanes of g ( (cid:36) ) connected to the N corners of the unit simplex can haveonly one intersection with each ¯ C k + ( (cid:36) ) . Finally, using thefact that g ( (cid:36) ) is made up of at most N hyperplanes (seeLemma 3), we conclude that at every stage k , there areat most N threshold curves which split up the probabilityspace of (cid:36) (i.e., the N − dimensional unit simplex) intoareas that correspond to either continuing or stopping. A CKNOWLEDGMENTS

This material is based upon work supported by the NationalScience Foundation under Grant No. ECCS-1737443. R EFERENCES [1] I. Guyon and A. Elisseeff, “An introduction to variable and featureselection,”

Journal of machine learning research , vol. 3, no. Mar, pp.1157–1182, 2003.[2] H. Liu and L. Yu, “Toward integrating feature selection algorithmsfor classiﬁcation and clustering,”

IEEE Transactions on Knowledge &Data Engineering , no. 4, pp. 491–502, 2005.[3] S. Perkins and J. Theiler, “Online feature selection using grafting,”in

Proceedings of the 20th International Conference on Machine Learn-ing (ICML-03) , 2003, pp. 592–599.[4] J. Zhou, D. Foster, R. Stine, and L. Ungar, “Streaming featureselection using alpha-investing,” in

Proceedings of the eleventh ACMSIGKDD international conference on Knowledge discovery in datamining . ACM, 2005, pp. 384–393.[5] X. Wu, K. Yu, W. Ding, H. Wang, and X. Zhu, “Online featureselection with streaming features,”

IEEE transactions on patternanalysis and machine intelligence , vol. 35, no. 5, pp. 1178–1192, 2012.[6] K. Yu, X. Wu, W. Ding, and J. Pei, “Towards scalable and accurateonline feature selection for big data,” in . IEEE, 2014, pp. 660–669.[7] P. Zhou, X. Hu, P. Li, and X. Wu, “Ofs-density: A novel onlinestreaming feature selection method,”

Pattern Recognition , vol. 86,pp. 48–61, 2019.[8] S. Eskandari and M. M. Javidi, “Online streaming feature selectionusing rough sets,”

International Journal of Approximate Reasoning ,vol. 69, pp. 35–57, 2016.[9] P. Zhou, X. Hu, P. Li, and X. Wu, “Online streaming feature selec-tion using adapted neighborhood rough set,”

Information Sciences ,vol. 481, pp. 258–279, 2019.[10] M. M. Javidi and S. Eskandari, “Online streaming feature selec-tion: a minimum redundancy, maximum signiﬁcance approach,”

Pattern Analysis and Applications , vol. 22, no. 3, pp. 949–963, 2019.[11] S. C. Hoi, J. Wang, P. Zhao, and R. Jin, “Online feature selection formining big data,” in

Proceedings of the 1st international workshop onbig data, streams and heterogeneous source mining: Algorithms, systems,programming models and applications . ACM, 2012, pp. 93–100.[12] J. Wang, P. Zhao, S. C. H. Hoi, and R. Jin, “Online feature selectionand its applications,”

IEEE Transactions on Knowledge and DataEngineering , vol. 26, no. 3, pp. 698–710, March 2014.[13] Y. Wu, S. C. Hoi, T. Mei, and N. Yu, “Large-scale online featureselection for ultra-high dimensional sparse data,”

ACM Transac-tions on Knowledge Discovery from Data (TKDD) , vol. 11, no. 4, p. 48,2017.[14] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, andH. Liu, “Feature selection: A data perspective,”

ACM ComputingSurveys (CSUR) , vol. 50, no. 6, p. 94, 2018.[15] J. Wang, J.-M. Wei, Z. Yang, and S.-Q. Wang, “Feature selection bymaximizing independent classiﬁcation information,”

IEEE Trans-actions on Knowledge and Data Engineering , vol. 29, no. 4, pp. 828–841, 2017.[16] Z. Zhang and K. K. Parhi, “Muse: Minimum uncertainty and sam-ple elimination based binary feature selection,”

IEEE Transactionson Knowledge and Data Engineering , 2018.[17] G. Chandrashekar and F. Sahin, “A survey on feature selectionmethods,”

Computers & Electrical Engineering , vol. 40, no. 1, pp.16–28, 2014.[18] Y. Saeys, I. Inza, and P. Larra˜naga, “A review of feature selectiontechniques in bioinformatics,” bioinformatics , vol. 23, no. 19, pp.2507–2517, 2007.[19] X. Hu, P. Zhou, P. Li, J. Wang, and X. Wu, “A survey on onlinefeature selection with streaming features,”

Frontiers of ComputerScience , vol. 12, no. 3, pp. 479–493, 2018.[20] N. AlNuaimi, M. M. Masud, M. A. Serhani, and N. Zaki, “Stream-ing feature selection algorithms for big data: A survey,”

AppliedComputing and Informatics , 2019.[21] V. Bol´on-Canedo, I. Porto-D´ıaz, N. S´anchez-Maro˜no, andA. Alonso-Betanzos, “A framework for cost-based feature selec-tion,”

Pattern Recognition , vol. 47, no. 7, pp. 2481–2489, 2014.[22] W. Shu and H. Shen, “Multi-criteria feature selection on cost-sensitive data with missing values,”

Pattern Recognition , vol. 51,pp. 268–280, 2016.[23] F. Rosenblatt, “The perceptron: a probabilistic model for informa-tion storage and organization in the brain.”

Psychological review ,vol. 65, no. 6, p. 386, 1958.[24] A. B. Novikoff, “On convergence proofs for perceptrons,” STAN-FORD RESEARCH INST MENLO PARK CA, Tech. Rep., 1963. [25] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer,“Online passive-aggressive algorithms,” Journal of Machine Learn-ing Research , vol. 7, no. Mar, pp. 551–585, 2006.[26] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methodsfor online learning and stochastic optimization,”

Journal of MachineLearning Research , vol. 12, no. Jul, pp. 2121–2159, 2011.[27] T. van Erven and W. M. Koolen, “Metagrad: Multiple learningrates in online learning,” in

Advances in Neural Information Process-ing Systems , 2016, pp. 3666–3674.[28] H. Luo, A. Agarwal, N. Cesa-Bianchi, and J. Langford, “Efﬁcientsecond order online learning by sketching,” in

Advances in NeuralInformation Processing Systems , 2016, pp. 902–910.[29] L. Zhang, S. Lu, and Z.-H. Zhou, “Adaptive online learning in dy-namic environments,” in

Advances in Neural Information ProcessingSystems , 2018, pp. 1323–1333.[30] Y. Li and P. M. Long, “The relaxed online maximum marginalgorithm,”

Machine Learning , vol. 46, no. 1, pp. 361–387, 2002.[31] J. Wang, P. Zhao, and S. C. Hoi, “Cost-sensitive online classiﬁca-tion,”

IEEE Transactions on Knowledge and Data Engineering , vol. 26,no. 10, pp. 2425–2438, 2013.[32] P. Zhao, Y. Zhang, M. Wu, S. C. Hoi, M. Tan, and J. Huang,“Adaptive cost-sensitive online classiﬁcation,”

IEEE Transactionson Knowledge and Data Engineering , vol. 31, no. 2, pp. 214–228, 2018.[33] S. C. Hoi, D. Sahoo, J. Lu, and P. Zhao, “Online learning: Acomprehensive survey,” arXiv preprint arXiv:1802.02871 , 2018.[34] C. Zhang, H. Wei, J. Zhao, T. Liu, T. Zhu, and K. Zhang, “Short-term wind speed forecasting using empirical mode decompositionand feature selection,”

Renewable Energy , vol. 96, pp. 727–737, 2016.[35] S. Yang, “On feature selection for trafﬁc congestion prediction,”

Transportation Research Part C: Emerging Technologies , vol. 26, pp.160–169, 2013.[36] M.-C. Lee, “Using support vector machine with a hybrid featureselection method to the stock trend prediction,”

Expert Systemswith Applications , vol. 36, no. 8, pp. 10 896–10 904, 2009.[37] C. Ding and H. Peng, “Minimum redundancy feature selectionfrom microarray gene expression data,”

Journal of bioinformaticsand computational biology , vol. 3, no. 02, pp. 185–205, 2005.[38] J.-H. Seo, Y. H. Lee, and Y.-H. Kim, “Feature selection for veryshort-term heavy rainfall prediction using evolutionary computa-tion,”

Advances in Meteorology , vol. 2014, 2014.[39] A. N. Shiryaev,

Optimal Stopping Rules . Springer Science &Business Media, 2007, vol. 8.[40] D. P. Bertsekas,

Dynamic Programming and Optimal Control . AthenaScientiﬁc, 2005, vol. 1.[41] K. Yang, Z. Cai, J. Li, and G. Lin, “A stable gene selection inmicroarray data analysis,”

BMC bioinformatics , vol. 7, no. 1, p. 228,2006.[42] J. C. Spall,

Introduction to stochastic search and optimization: estima-tion, simulation, and control . John Wiley & Sons, 2005, vol. 65.[43] G. C. Pﬂug,

Optimization of stochastic models: the interface betweensimulation and optimization ∼ cjlin/libsvmtools/datasets/. Yasitha Warahena Liyanage received the B.S.degree in electrical and electronic engineeringfrom the University of Peradeniya, Sri Lanka, in2016. Currently, he is working toward the Ph.D.degree in electrical and computer engineeringat the University at Albany, SUNY. His researchinterests include quickest change detection, op-timal stopping theory and machine learning.

Daphney-Stavroula Zois received the B.S. de-gree in computer engineering and informaticsfrom the University of Patras, Patras, Greece,and the M.S. and Ph.D. degrees in electricalengineering from the University of Southern Cal-ifornia, Los Angeles, CA, USA. Previous appoint-ments include the University of Illinois, Urbana–Champaign, IL, USA. She is an Assistant Pro-fessor in the Department of Electrical and Com-puter Engineering, University at Albany, StateUniversity of New York, Albany, NY, USA. Shereceived the Viterbi Dean’s and Myronis Graduate Fellowships. Shehas served and is serving as Co-Chair, TPC member or reviewer ininternational conferences and journals, such as GlobalSIP, Globecom,and ICASSP, and IEEE Transactions on Signal Processing. Her researchinterests include decision making under uncertainty, machine learning,detection & estimation theory, intelligent systems design, and signalprocessing.