Feature Selection via Mutual Information: New Theoretical Insights
Mario Beraha, Alberto Maria Metelli, Matteo Papini, Andrea Tirinzoni, Marcello Restelli
FFeature Selection via Mutual Information:New Theoretical Insights
Mario Beraha ∗† , Alberto Maria Metelli † , Matteo Papini † , Andrea Tirinzoni † , Marcello Restelli †∗ Universit`a degli Studi di Bologna, Bologna, Italy † DEIB, Politecnico di Milano, Milan, ItalyEmail: { mario.beraha, albertomaria.metelli, matteo.papini, andrea.tirinzoni, marcello.restelli } @polimi.it Abstract —Mutual information has been successfully adoptedin filter feature-selection methods to assess both the relevancyof a subset of features in predicting the target variable and theredundancy with respect to other variables. However, existingalgorithms are mostly heuristic and do not offer any guarantee onthe proposed solution. In this paper, we provide novel theoreticalresults showing that conditional mutual information naturallyarises when bounding the ideal regression/classification errorsachieved by different subsets of features. Leveraging on theseinsights, we propose a novel stopping condition for backward andforward greedy methods which ensures that the ideal predictionerror using the selected feature subset remains bounded by auser-specified threshold. We provide numerical simulations tosupport our theoretical claims and compare to common heuristicmethods.
Index Terms —feature selection, mutual information, regres-sion, classification, supervised learning, machine learning
I. I
NTRODUCTION
The abundance of massive datasets composed of thousandsof attributes and the widespread use of learning models ableof large representational power pose a significant challengeto machine learning algorithms. Feature selection allows toeffectively address some of these challenges with a potentialbenefit in terms of computational costs, generalization capa-bilities and interpretability. A large variety of approaches hasbeen proposed by the machine learning community [1]. Asimple dimension for classifying the feature selection methodsis whether they are aware of the underlying learning model.A first group of methods take advantage of this knowledgeand try to identify the best subset of features for the specificmodel class. This group can be further split into wrapper and embedded methods. Wrappers [2] employ the learningprocess as a subroutine of the feature selection process,using the validation error of the trained model as a scoreto decide whether to keep or discard a feature. Clearly, thispotentially leads to good generalization capabilities at the costof iterating the learning process multiple times, which mightbecome impractical for high-dimensional datasets. Embeddedmethods [3], still assume the knowledge of the model class, butthe feature selection and the learning process are carried outtogether (a remarkable example is [4] in which a generalizationbound on the SVM is optimized for both learning the featuresand the model). Although less demanding than wrappers froma computational standpoint, embedded methods heavily relyon the peculiar properties of the model class. A second group of methods do not incorporate knowledge of the model class.These approaches are known as filters . Filters [5] performthe feature selection using scores that are independent of theunderlying learning model. For this reason, they tend not tooverfit but they might result less effective than wrappers andembedded methods as they are general across all the possiblemodel classes. From a computational perspective, filters arethe most efficient feature selection methods.Filter methods have been deeply studied in the supervisedlearning field [6]. A relevant amount of literature focused onusing the mutual information (MI) as a score for identifyinga suitable subset of features [7]. The MI [8] is an index ofstatistical dependence between random variables. Intuitively,the MI measures how much knowing the value of one variablereduces the uncertainty on the other. Differently from otherindexes, like the Pearson correlation coefficient, the MI is ableto capture also non–linear dependences and is invariant underinvertible and differentiable transformations of the randomvariables [8]. Thanks to these properties, the MI has beenemployed extensively as a score for filter methods [9]–[14].Nonetheless, all these techniques are rather empirical as theytry to encode with MI the intuition that “a feature can bediscarded if it is useless for predicting the target or it ispredictable from the other features”. This notion can bemade more formal by introducing the notion of relevance , redundancy and complementarity [7].To the best of our knowledge, the only work that drawsa connection among the several approaches based on the MIis [15]. The authors claim that selecting features using as ascore the conditional mutual information (CMI) is equivalentto maximizing the conditional likelihood between the targetand the features. This observation provides a justification tothe well–known iterative backward and forward algorithms inwhich the features are considered one-by-one for insertion inor removal from the feature set, like in the Markov Blanketapproach [16]. Although this work offers a wide perspectiveon the feature selection methods based on the MI, it doesnot investigate the relation between the mutual information ofa feature set and the prediction error, which, of course, willdepend on the specific choice of the model class.In this paper, we address the problem of controlling theprediction (regression and classification) error when perform-ing the feature selection process via CMI. We claim thatselecting features using CMI has the effect on controlling the a r X i v : . [ c s . L G ] J u l deal error, i.e., the error attained by the Bayes classifier forclassification and the minimum MSE (Mean Squared Error)model for regression. We start in Section II by revising somefundamental concepts of information theory. In Section III, weintroduce our main theoretical contribution. We derive a pairof inequalities, one for regression (Section III-A) and one forclassification (Section III-C), that upper bound the incrementof the ideal error obtained by removing a set of features.Such increment is expressed in terms of the CMI between thetarget and the removed features, given the remaining features.These results support the intuition that a set of features canbe safely removed if it does not increase significantly the“information” about the target, assuming we observed theremaining features. Since the result holds for the ideal error,we assert that a filter method based on CMI selects thefeatures assuming that the model employed for solving theregression/classification problem has “infinite capacity”. Weshow that, when considering linear models for regression, thebound does not hold and we propose an adaptation for thisspecific case (Section III-B). These results can be effectivelyemployed to derive a novel and principled stopping conditionfor the feature selection process (Section IV). Differently fromthe typical stopping conditions, such as a fixed number offeatures or the increment of the score, our approach allowsto explicitly control the ideal error introduced in the featureselection process. After contextualizing our work in the featureselection literature (Section V), we evaluate our approach incomparison with several different stopping criteria on bothsynthetic and real datasets (Section VI).II. P RELIMINARIES
We indicate with
X ⊆ R d the feature space and with Y thetarget space. In case of classification Y = { y , y , . . . , y m } is a finite set of classes, whereas in case of regression Y ⊆ R is a subset of the real numbers. We consider adistribution p ( X , Y ) over X × Y from which a finite dataset D = { ( x i , y i ) | i ∈ { , . . . , N }} of N i.i.d. instances is drawn,i.e., ( x i , y i ) ∼ p ( X , Y ) for all i . For regression problems,we assume there exists B ∈ R such that | Y | ≤ B almostsurely. A key term for a regression/classification problem isthe conditional distribution p ( Y | x ) , which allows to predictthe target associated with any given x ∈ X . A. Notation
Given a (random) vector X ∈ X and a set of indices A ⊆ { , , . . . , d } , we denote by X A the vector of componentsof X whose indices are in A . Notice that the vectors X A and X ¯ A , for ¯ A = { , , . . . , d } \ A , form a partition of X .For a d -dimensional random vector X we indicatewith E X [ X ] the d -dimensional vector of the expec-tations of each component Given two random vec-tors X and Y , we indicate with C ov X , Y [ X , Y ] = E X , Y (cid:104) ( X − E X [ X ]) ( Y − E Y [ Y ]) T (cid:105) the covariance matrixbetween the two. We indicate with C ov X [ X ] = C ov X [ X , X ] the covariance matrix of X . We denote with V ar X [ X ] =tr( C ov X [ X , X ]) the trace of the covariance matrix of X . Whenever clear by the context we will remove the subscriptsfrom E , V ar and C ov . Given two random (scalar) randomvariables X and Y we denote with ρ ( X, Y ) = C ov[ X,Y ] √ V ar[ X ] V ar[ Y ] the Pearson correlation coefficient between X and Y . B. Entropy and Mutual Information
We now introduce the basic concepts from informationtheory that we employ in the remaining of this paper. Forsimplicity, we provide the definitions for continuous randomvariables, although all these concepts straightforwardly gener-alize to discrete variables [8].The entropy H ( X ) of a random variable X , having p as probability density function, is a common measure ofuncertainty: H ( X ) := E X [ p ( X )] = − (cid:90) p ( x ) log p ( x ) dx. (1)Given two distributions p and q , we define the Kullback-Leibler (KL) divergence as: D KL ( p (cid:107) q ) := E X (cid:20) p ( X ) q ( X ) (cid:21) = (cid:90) p ( x ) log p ( x ) q ( x ) dx. The mutual information (MI) between two random variables X and Y is defined as: I ( X ; Y ) := H ( Y ) − H ( Y | X )= E X [ D KL ( p ( Y | X ) (cid:107) p ( Y ))]= (cid:90) (cid:90) p ( x, y ) log p ( x, y ) p ( x ) p ( y ) d x d y, Intuitively, the MI between X and Y represents the reductionin the uncertainty of Y after observing X (and viceversa). No-tice that the MI is symmetric, i.e., I ( X ; Y ) = I ( Y ; X ) . Thisdefinition can be straightforwardly extended by conditioningon a third random variable Z , obtaining the conditional mutualinformation (CMI) between X and Y given Z : I ( X ; Y | Z ) := E Z [ I ( X | Z ; Y | Z )]= E Z [ E X [ D KL ( p ( Y | X, Z ) (cid:107) p ( Y | Z ))]]= (cid:90) p ( z ) (cid:90) (cid:90) p ( x, y | z ) log p ( x, y | z ) p ( x | z ) p ( y | z ) d x d y d z. The CMI fulfills the useful chain rule: I ( X ; Y, Z ) = I ( X ; Z ) + I ( X ; Y | Z ) . (2)As we shall see later, the CMI can be used to definea score of both relevancy and redundancy for our featureselection problem, which arises naturally when bounding theideal regression/classification error. Given a set of indices A ,we denote the CMI between Y and X A given X ¯ A as: ν ( A ) := I ( Y ; X A | X ¯ A ) . This quantity intuitively represents the importance of thefeature subset X A in predicting the target Y given that weare also using X ¯ A .III. F EATURE S ELECTION VIA M UTUAL I NFORMATION
In this section, we introduce our novel theoretical resultsthat shed light on the relationship between CMI and theideal prediction error. Then, in the next section, we employthese results to propose a new stopping condition that ensuresounded error. We discuss relationships to existing bounds inSection V.
A. Bounding the Regression Error
We start by analyzing an ideal regression problem underthe mean square error (MSE) criterion. Consider the subspace X ¯ A of X which includes only the features with indices in ¯ A and define G ¯ A = { g : X ¯ A → Y} as the space of all functionsmapping X ¯ A to Y . The ideal regression problem consists offinding the function g ∗ ∈ G ¯ A minimizing the expected MSE, inf g ∈G ¯ A E X ,Y (cid:104) ( Y − g ( X ¯ A )) (cid:105) , (3)where the expectation is taken under the full distribution p ( X , Y ) , i.e., under all features and the target. The followingresult relates the ideal error to the expected CMI ν ( A ) . Theorem 1.
Let σ = E X ,Y (cid:104) ( Y − E [ Y | X ]) (cid:105) be the irre-ducible error and A be a set of indices, then the regressionerror obtained by removing features X A can be bounded as: inf g ∈G ¯ A E X ,Y (cid:104) ( Y − g ( X ¯ A )) (cid:105) ≤ σ + 2 B ν ( A ) . (4) Proof.
The infimum inf g ∈G ¯ A E X ,Y (cid:2) ( Y − g ( X ¯ A )) (cid:3) is attained bythe minimum MSE regression function g ( x ¯ A ) = E Y [ Y | x ¯ A ] . There-fore, we have inf g ∈G ¯ A E X ,Y (cid:2) ( Y − g ( X ¯ A )) (cid:3) = E X ,Y (cid:2) ( Y − E [ Y | X ¯ A ]) (cid:3) = (cid:90) p ( x ) (cid:90) p ( y | x ) ( y − E [ Y | x ¯ A ] ± E [ Y | x ]) d y d x = σ + (cid:90) p ( x ) ( E [ Y | x ] − E [ Y | x ¯ A ]) d x = σ + (cid:90) p ( x ) (cid:18)(cid:90) y ( p ( y | x ) − p ( y | x ¯ A )) d y (cid:19) d x ≤ σ + B (cid:90) p ( x ) (cid:18)(cid:90) | p ( y | x ) − p ( y | x ¯ A ) | d y (cid:19) d x ≤ σ + 2 B (cid:90) p ( x ) D KL ( p ( ·| x ) (cid:107) p ( ·| x ¯ A )) d x = σ + 2 B ν ( A ) . The second inequality follows from Pinsker’s inequality [17]–[19]by noting that (cid:82) | p ( y | x ) − p ( y | x ¯ A ) | d y = 2 D TV ( p ( ·| x ) (cid:107) p ( ·| x ¯ A )) is twice the total variation distance between p ( ·| x ) and p ( ·| x ¯ A ) . Theorem 1 tells us that the minimum possible MSE that wecan achieve by predicting Y only with the feature subset ¯ A canbe bounded by the CMI between Y and X A , conditioned on X ¯ A . This result formalizes the intuitive belief that whenevera subset of features A has low relevancy or high redundancy(i.e., ν ( A ) is small), such features can be safely removedwithout affecting the resulting prediction error too much. Infact, when ν ( A ) = 0 , Theorem 1 proves that it is possibleto achieve the irreducible MSE σ without using any of thefeatures in A .Interestingly, this score accounts for both the relevancy of X A in the prediction of Y and its redundancy with respectto the other features X ¯ A . To better verify this fact, we canrewrite ν ( A ) as: (cid:90) p ( x ¯ A ) (cid:90) p ( y, x A | x ¯ A ) log p ( y, x A | x ¯ A ) p ( y | x ¯ A ) p ( x A | x ¯ A ) d x d y, (5) and notice that the inner integral is zero whenever: i) x ¯ A perfectly predicts y , i.e., x A is irrelevant , or ii) x ¯ A perfectlypredicts x A , i.e., x A is redundant . In both cases we have p ( y, x A | x ¯ A ) = p ( y | x ¯ A ) p ( x A | x ¯ A ) and, thus, ν ( A ) = 0 . B. Regression Error in Linear Models
As previously mentioned the actual error introduced byremoving a set of features depends on the choice of the modelclass. We remark that Theorem 1 bounds the ideal predictionerror, i.e., the error achieved by a model of infinite capacity.Unfortunately, in practical applications the chosen model hasoften very limited capacity (e.g., linear). In such cases, ourbound, and all CMI-based methods, might be over-optimistic.Indeed, there are situations in which CMI leads to discardingan apparently redundant feature which would reveal itself tobe useful when considering the finite capacity of the chosenmodel. Let us consider the following example.
Example 1.
Consider a regression problem with two features, X and X , and target Y = aX + bX , for two scalars a and b . Furthermore, assume that X = Z and X = e Z , for Z ∼ N (0 , σ ) , with σ (cid:29) . It is clear that ν ( { x } ) (cid:39) and ν ( { x } ) (cid:39) , since the two features can be perfectly recoveredfrom one another. However, if the chosen model is linear, bothfeatures are fundamental for predicting Y . In fact, the squaredPearson correlation coefficients ρ ( X , Y ) and ρ ( X , Y ) arehigh, while ρ ( X , X ) is small. We show now that, when linear models are involved, thecorrelation between the features and between a feature andthe target can be used to bound the regression error.
Theorem 2.
Let σ X → Y = min w ,b E X ,Y (cid:104)(cid:0) Y − w T X − b (cid:1) (cid:105) be the minimum MSE of the linear model that predicts Y with all the features and ( w ∗ , b ∗ ) be the optimalweights and bias. Let A be a set of indices and σ X ¯ A → X i = min w i, ¯ A , b i, ¯ A E X i , X ¯ A (cid:20)(cid:16) X i − w Ti, ¯ A X ¯ A − b i, ¯ A (cid:17) (cid:21) be the minimum MSE of the linear model that predicts X i from the features X ¯ A . Then the minimum MSE of the linearmodel that predicts Y from the features X ¯ A can be boundedas: min w ¯ A ,b ¯ A E X ,Y (cid:104)(cid:0) Y − w T ¯ A X ¯ A − b ¯ A (cid:1) (cid:105) ≤ σ X → Y + (cid:112) | A | (cid:88) i ∈ A w ∗ i σ X ¯ A → X i . Furthermore, let σ Y = V ar[ Y ] . If ρ ( X i , X j ) = 0 for all i, j ∈ A and i (cid:54) = j and ρ ( X i , X j ) = 0 for all i, j ∈ ¯ A and i (cid:54) = j , then it holds that: min w ¯ A ,b ¯ A E X ,Y (cid:104)(cid:0) Y − w T ¯ A X ¯ A − b ¯ A (cid:1) (cid:105) ≤ σ X → Y + (cid:112) | A | σ Y (cid:88) i ∈ A ρ ( Y, X i ) (cid:18) − (cid:88) j ∈ ¯ A ρ ( X i , X j ) (cid:19) . We are requiring that all features in X A are uncorrelated and that allfeatures in X ¯ A are uncorrelated; but, of course, there might exist i ∈ A and j ∈ ¯ A such that ρ ( X i , X j ) (cid:54) = 0 . roof. Consider the linear regression problem for predicting Y withall the features, min w ,b E Y, X (cid:104)(cid:0) Y − w T X − b (cid:1) (cid:105) , having ( w ∗ , b ∗ ) as the optimal solution. The expression of the optimal weights andthe minimum MSE σ are given by: w ∗ = C ov[ X ] − C ov[ X , Y ] .σ X → Y = V ar[ Y ] − C ov[ Y, X ] C ov[ X ] − C ov[ X , Y ] . Consider now a partition of X into X ¯ A and X A andthe linear regression problem to predict X A from X ¯ A , i.e., min W A, ¯ A , b A, ¯ A E X A , X ¯ A (cid:104)(cid:0) X A − W A, ¯ A X ¯ A − b A, ¯ A (cid:1) (cid:105) . We canexpress the optimal weights and the minimum MSE as: W ∗ A, ¯ A = C ov[ X A , X ¯ A ] C ov[ X ¯ A ] − .σ X ¯ A → X i = V ar[ X i ] − C ov[ X i , X ¯ A ] C ov[ X ¯ A ] − C ov[ X ¯ A , X i ] Let us now consider the linear regression problem for predicting Y from the features X ¯ A . min w ¯ A ,b ¯ A E X ,Y (cid:20)(cid:16) Y − w T ¯ A X ¯ A − b ¯ A (cid:17) (cid:21) ≤ E X ,Y (cid:20)(cid:16) Y − w ∗ ¯ AT X ¯ A − b ∗ ± w ∗ AT (cid:0) W ∗ A, ¯ A X ¯ A − b ∗ A, ¯ A (cid:1)(cid:17) (cid:21) ≤ E X ,Y (cid:20)(cid:16) Y − w ∗ ¯ AT X ¯ A − w ∗ AT X A − b ∗ (cid:17) (cid:21) + E X ,Y (cid:20)(cid:16) w ∗ AT (cid:0) X A − W ∗ A, ¯ A X ¯ A − b ∗ A, ¯ A (cid:1)(cid:17) (cid:21) (6) ≤ σ X → Y + (cid:112) | A | E X (cid:34)(cid:88) i ∈ A w ∗ i (cid:16) X i − w ∗ i, ¯ AT X ¯ A − b ∗ i, ¯ A (cid:17) (cid:35) (7) = σ X → Y + (cid:112) | A | (cid:32)(cid:88) i ∈ A w ∗ i E X (cid:20)(cid:16) X i − w ∗ i, ¯ AT X ¯ A − b ∗ i, ¯ A (cid:17) (cid:21)(cid:33) ≤ σ X → Y + (cid:112) | A | (cid:88) i ∈ A w ∗ i E X (cid:20)(cid:16) X i − w ∗ i, ¯ AT X ¯ A − b ∗ i, ¯ A (cid:17) (cid:21) (8) = σ X → Y + (cid:112) | A | (cid:88) i ∈ A w ∗ i σ X ¯ A → X i , where (6) derives from Minkowski inequality after having summedand subtracted w ∗ AT X A , (7) is obtained from Cauchy-Schwarz in-equality (for d dimensional vectors we have ( a T b ) ≤ d (cid:80) di =1 a i b i )and (8) derives from subadditivity of the square root.By recalling that w ∗ i = (cid:80) j ∈ A C ov[ X i , X j ] − C ov[ X j , Y ] , foruncorrelated features we get: w ∗ i = V ar[ X i ] − C ov[ X i , Y ] = (cid:18) V ar[ Y ] V ar[ X i ] (cid:19) ρ ( Y, X i ) . If the features in X ¯ A are uncorrelated as well, we have that C ov[ X ¯ A ] is diagonal. Therefore, we have: σ X ¯ A → X i = V ar[ X i ] − (cid:88) j ∈ ¯ A C ov[ X i , X j ] V ar[ X j ] − = V ar[ X i ] (cid:18) − (cid:88) j ∈ ¯ A ρ ( X i , X j ) (cid:19) , from which the result follows directly. This result allows highlighting two relevant points. First,when considering linear models what matters is the correlationamong the features and the correlation between the featuresand the class. Most importantly, the Pearson correlation co-efficient is a weaker index of dependency between random variables compared to the MI as it identifies linear dependencyonly. As suggested by Example 1, using MI for discardingfeatures when the model used for prediction is too weak mightbe dangerous. Second, Theorem 2 highlights once again tworelevant properties of the features. In the linear case, a feature X i is relevant if it is highly correlated with the target Y ,i.e., ρ ( Y, X i ) (cid:29) , and a feature is redundant if it is highlycorrelated with the others, i.e., ρ ( X i , X j ) (cid:29) . Both thesecontributions appear clearly in Theorem 2. C. Bounding the Classification Error
A similar result to Theorem 1 can be obtained for an idealclassification problem. Here the goal is to find the function g ∗ ∈ G ¯ A minimizing the ideal prediction loss, inf g ∈G ¯ A E X ,Y (cid:2) { Y (cid:54) = g ( X ¯ A ) } (cid:3) , (9)where E denotes the indicator function over an event E . Theorem 3.
Let (cid:15) = E X ,Y (cid:104) { Y (cid:54) =arg max y ∈Y p ( y | X ) } (cid:105) be theBayes error and A be a set of indices, then the classificationerror obtained by removing features X A can be bounded as: inf g ∈G ¯ A E X ,Y (cid:2) { Y (cid:54) = g ( X ¯ A ) } (cid:3) ≤ (cid:15) + (cid:112) ν ( A ) . (10) Proof.
Let us denote by y ∗ = arg max y ∈Y p ( y | x ) the optimalprediction given x and by y ∗ ¯ A = arg max y ∈Y p ( y | x ¯ A ) the optimalprediction given the subset of features in ¯ A . We have: inf g ∈G ¯ A E X ,Y (cid:104) { Y (cid:54) = g ( X ¯ A ) } (cid:105) = E X ,Y (cid:104) { Y (cid:54) =arg max y ∈Y p ( y | X ¯ A ) } (cid:105) = E X ,Y (cid:104) { Y (cid:54) =arg max y ∈Y p ( y | X ¯ A ) } ± { Y (cid:54) =arg max y ∈Y p ( y | X ) } (cid:105) = (cid:15) + (cid:90) p ( x ) ( p ( y ∗ | x ) − p ( y ∗ ¯ A | x )) d x = (cid:15) + (cid:90) p ( x ) ( p ( y ∗ | x ) ± p ( y ∗ ¯ A | x ¯ A ) − p ( y ∗ ¯ A | x )) d x . Let us now bound the term inside the expectation point-wisely. Forthe term p ( y ∗ | x ) − p ( y ∗ ¯ A | x ¯ A ) , we have: p ( y ∗ | x ) − p ( y ∗ ¯ A | x ¯ A ) = max y ∈Y p ( y | x ) − max y ∈Y p ( y | x ¯ A ) ≤ max y ∈Y | p ( y | x ) − p ( y | x ¯ A ) |≤ D TV ( p ( ·| x ) (cid:107) p ( ·| x ¯ A )) . Following a similar argument for the term p ( y ∗ ¯ A | x ¯ A ) − p ( y ∗ ¯ A | x ) , wefind that the inner term is always less or equal than the total variationdistance between p ( ·| x ) and p ( ·| x ¯ A ) . Then, by applying Pinsker’sinequality: inf g ∈G ¯ A E X ,Y (cid:104) { Y (cid:54) = g ( X ¯ A ) } (cid:105) ≤ (cid:15) + 2 (cid:90) p ( x ) D TV ( p ( ·| x ) (cid:107) p ( ·| x ¯ A ))d x ≤ (cid:15) + (cid:90) p ( x ) (cid:112) D KL ( p ( ·| x ) (cid:107) p ( ·| x ¯ A ))d x ≤ (cid:15) + (cid:115) (cid:90) p ( x ) D KL ( p ( ·| x ) (cid:107) p ( ·| x ¯ A )) d x = (cid:15) + (cid:112) ν ( A ) . Here the last inequality follows from Jensen’s inequality and theconcavity of the square root.
Similarly to the result for regression problems, Theorem3 bounds the minimum ideal classification error achievableby a model which uses only the subset of features in ¯ A by the score ν ( A ) . The astute reader might have noticeda slightly better dependence on ν ( A ) with respect to theegression case (square root versus linear). This is due to thefact that minimizing the MSE gives rise to a squared totalvariation distance between the conditional distributions p ( ·| x ) and p ( ·| x ¯ A ) , which leads to a linear dependence on ν ( A ) .IV. A LGORITHMS
In this section, we rephrase the forward and backwardfeature selection algorithms based on the findings of SectionIII. Furthermore, we propose a novel stopping condition thatallows to bound the error introduced by removing a set offeatures, assuming the predictor will make the best possibleuse of the remaining features. Actively searching for theoptimal subset of features is combinatorial in the number offeatures and, thus, unfeasible [20]. Instead, we can start fromthe complete feature set and remove one feature at a time,greedily minimizing the score. In this spirit, we propose thefollowing iterative procedure.
Algorithm 1 (Backward Elimination) . Given a dataset X , Y ,select a threshold δ ≥ , the maximum error that the filter isallowed to introduce. Then: • Start with the full feature set, i.e., A = ∅ , where A t denotes the index set of features removed prior to step t . • For each step t = 1 , . . . , remove the feature thatminimizes the conditional mutual information betweenitself and the target Y given the remaining features, i.e.: i t = arg min i I ( Y ; X i | X ¯ A t \ X i ) , (11) I t = I ( Y ; X i t | X ¯ A t \ X i t ) , (12) A t +1 = A t ∪ { i t } (13) • Stop as soon as (cid:80) th =1 I h ≥ δ B for regression and (cid:80) th =1 I h ≥ δ for classification. The selected featuresare the remaining ones, indexed by A T , where T is thelast step. This algorithm, apart from the stopping condition, is de-scribed by Brown et al. [15] as
Backward Elimination withMutual Information . The same authors show that this pro-cedure greedily maximizes the conditional likelihood of theselected features given the target, as long as I k is always zero.This would correspond to selecting δ = 0 as a threshold inour algorithm. The same backward elimination step is usedas a subroutine in the IAMB algorithm [16]. Our stoppingcondition allows selecting the maximum error that the featureselection procedure is allowed to add to the ideal error, i.e.,the unavoidable error that even a perfect predictor using allthe features would commit. The fact that the threshold will beactually observed is guaranteed by the following result. Theorem 4.
Algorithm 1 achieves an error of σ + δ forregression, where σ is the irreducible error and (cid:15) + δ forclassification, where (cid:15) is the Bayes error. Proof.
We prove the result for regression using Theorem 1. The prooffor classification is analogous, but based on Theorem 3. We have: inf g ∈G ¯ A E X ,Y (cid:104)(cid:0) Y − g ( X ¯ A t ) (cid:1) (cid:105) ≤ σ + 2 B ν ( A t ) , (14) where t is any iteration of the algorithm. By repeatedly applying thechain rule of CMI (2), we can rewrite the score as: ν ( A t +1 ) = I ( Y ; X A t +1 | X ¯ A t +1 )= I ( Y ; X ) − I ( Y ; X ¯ A t +1 )= I ( Y ; X A t , X ¯ A t ) − I ( Y ; X ¯ A t +1 )= I ( Y ; X A t | X ¯ A t ) + I ( Y ; X ¯ A t ) − I ( Y ; X ¯ A t +1 )= ν ( A t ) + I ( Y ; X i t , X ¯ A t +1 ) − I ( Y ; X ¯ A t +1 )= ν ( A t ) + I ( Y ; X i t | X ¯ A t +1 ) ± I ( Y ; X ¯ A t +1 )= ν ( A t ) + I ( Y ; X i t | X ¯ A t \ X i t )= ν ( A t ) + I t . (15)Noting that ν ( A ) = I ( Y ; ∅| X ) = 0 , we can unroll this recursiveequation, obtaining: ν ( A T ) = T − (cid:88) t =1 I t ≤ δ B , (16)where the inequality is due to the stopping condition. Plugging (16)into (14), we get the thesis. Our Theorems 1 and 3 suggest that a backward eliminationprocedure allows keeping the error controlled. In the follow-ing, we argue that we can resort also to forward selectionmethods and still have a guarantee on the error. Using thechain rule of the CMI we can express our score ν ( A ) as: ν ( A ) = I ( Y ; X ) − I ( Y ; X ¯ A ) , where X ¯ A is the set of features that have not been eliminatedyet. If we plug this equation into the bounds of Theorems 1and 3 we get: inf g ∈G ¯ A E X ,Y (cid:104) ( Y − g ( X ¯ A )) (cid:105) ≤ σ + 2 B [ I ( Y ; X ) − I ( Y ; X ¯ A )] , inf g ∈G ¯ A E X ,Y (cid:2) { Y (cid:54) = g ( X ¯ A ) } (cid:3) ≤ (cid:15) + (cid:112) I ( Y ; X ) − I ( Y ; X ¯ A )] , for the regression and classification cases respectively. Since I ( Y ; X ) does not depend on the selected features X ¯ A , inorder to minimize the bound we need to maximize the term I ( Y ; X ¯ A ) . This matches the intuition that we should selectthe features that provide the maximum information on theclass. Using this result, we can easly provide a forwardfeature selection algorithm. Algorithm 2 (Forward Selection) . Given a dataset X , Y ,select a threshold δ ≥ , the maximum error that the filteris allowed to introduce. Then: • Start with the empty feature set, i.e., A = ∅ , where A t denotes the index set of features selected prior to step t . • For each step t = 1 , . . . , add the feature that maximizesthe conditional mutual information between itself and thetarget Y given the remaining features, i.e.: i t = arg max i I ( Y ; X i | X A t ) , (17) I t = I ( Y ; X i t | X A t ) , (18) A t +1 = A t ∪ { i t } (19) • Stop as soon as (cid:80) th =1 I h ≥ δ B for regression and (cid:80) th =1 I h ≥ δ for classification. The selected featuresare those indexed by A T , where T is the last step. part from the stopping condition, this algorithm was alsopresented in Brown et al. [15] and named Forward Selectionwith Mutual Information . Like for the backward case, we areable to provide a guarantee on the final error.
Theorem 5.
Algorithm 2 achieves an error of σ − δ +2 B I ( Y ; X ) for regression, where σ is the irreducible errorand (cid:15) − δ + (cid:112) I ( Y ; X ) for classification, where (cid:15) is the Bayeserror. Proof.
We prove the result just for the regression case, as thederivation for classification is analogous. Using the chain rule (2),we have the following recursion: I ( Y ; X A t +1 ) = I ( Y ; X A t , X i t )= I ( Y ; X i t | X A t ) + I ( Y ; X A t )= I t + I ( Y ; X A t ) . By observing that I ( Y ; X A ) = I ( Y ; ∅ ) = 0 , we unroll therecursion and we get I ( Y ; X A T ) = T (cid:88) t =1 I t ≥ δ B , from which the result follows. A. Estimation of the Conditional Mutual Information
So far, we have assumed to be able to compute the CMI I ( Y ; X i | X ¯ A t \ X i ) and I ( Y ; X i | X A t ) exactly. In practice,they need to be estimated from data. Estimating the MI can bereduced to the estimation of several entropies [21]; numerousmethods have been employed in feature selection, either basedon nearest neighbors approaches [22] or on histograms [15].The main challenge arise in classification where we needto estimate CMI between a discrete variable (the class) andpossibly continuous features. For this reason, we resort to therecent nearest neighbor estimator proposed by [23], whichcollapses to the more traditional KSG estimator [24] whenboth X and Y have a continuous density. These estimatorsare proved to be consistent when the number of samples andthe number of neighbors grows to infinity [23].V. R ELATED W ORKS
A related theoretical study of feature selection via MI hasbeen recently proposed by Brown et al. [15]. The authors showthat the problem of finding the minimal feature subset suchthat the conditional likelihood of the targets is maximized isequivalent to minimizing the CMI. Based on this result, com-mon heuristics for information-theoretic feature selection canbe seen as iteratively maximizing the conditional likelihood.Similarly, we show a connection between the CMI and theoptimal prediction error. Differently from [15], we additionallypropose a novel stopping condition that is well motivated byour theoretical findings.In the information theory literature, [25] also analyzesthe connection between CMI and minimum mean squareerror, deriving a similar result to our Theorem 1. However,classification problems (i.e., minimum zero-one loss) are notconsidered and the focus is not on feature selection.The authors of [22] propose a nearest neighbor estimatorfor the CMI and show how it can be used in a classic forward feature selection algorithm. One of the authors’ questions ishow to devise a suitable stopping condition for such methods.Here we propose a possible answer: our stopping criterion(Section IV) is intuitive, applicable to both forward andbackward algorithms, and theoretically well-grounded.Several existing approaches use linear correlation measuresto score the different features [26]–[30]. Such algorithms aremostly based on the heuristic intuition that a good featureshould be highly correlated with the class and lowly correlatedwith the other features. Instead, we provide a more theoreticaljustification for this claim (Section III), showing a connectionbetween these two properties and the minimum MSE.VI. E
XPERIMENTS
We evaluate the performance of our stopping conditionon synthetic and real-world datasets, by comparing differentstopping criteria, employing a backward feature selectionapproach: • error (ER) : stop when the bound on the prediction error,as in Theorem 4, is greater than a fixed threshold δ ; • feature score (FS) : stop when a feature with a CMI scoregreater than a fixed threshold δ is encountered; • delta feature score ( ∆ FS) : stop when the differencebetween the score of two consecutive features is greaterthan a threshold δ (as in knee-elbow analysis); • number of features ( : stop with exactly k features areselected.For all the experiments, we use Python’s scikit-learn imple-mentation of SVM with default parameters (RBF kernel and C = 1 ). A. Synthetic Data
The synthetic data consist in several binary classificationproblems. Each dataset is composed of samples. Thedatasets are generated, similarly to [31], as follows: fix thenumber of useful features k (i.e., the number of featuresthat are actually needed to predict the class); given Y = 1 , X , . . . X k are N (0 , conditioned on (cid:80) ki =1 X i > k − ,while X k +1 , . . . X ∼ N (0 , . The choice of k will bespecified for each experiment. Stopping Condition Comparison.
The first experimentis meant to compare the stopping conditions presentedabove across datasets for classification with different a num-ber of useful features. We generate independent prob-lems with features. Among the features only k ∈{ , , , , , } are useful to predict the target. In Figure1, we show the accuracy of SVM for the different datasets anddifferent stopping conditions. We can see that our stoppingcondition (ER) performs better than choosing a fixed numberof features ( ∆ FS) (which is similar to the knee-elbow analysis) is highly inefficient (as the outputs are almostidentical for both choices of the threshold) and is clearly the . . . . a cc u r a c y useful features: 9 . . . . . useful features: 12 . . . . . useful features: 15 . . . . . useful features: 18 . . . . . useful features: 21 . . . . . useful features: 24 Fig. 1. SVM test accuracy for different choices of the number of features that generate the problem and different stopping criteria. worst performer. The feature score (FS) stopping criterion ishighly sensitive to the threshold, achieving good performancewith a low threshold and a significantly worse performancewhen the threshold is increased. The choice of the thresholdin both ∆ FS and FS poses a significant problem as it has norelation to the prediction error and its optimal value is highlyproblem-specific. On the contrary, for
Robustness.
To have a better grasp of the proposedstopping criterion we generate binary classification prob-lems, with the only difference of choosing k accordinglyto k ∼ Uniform (3 , and having only total features. InFigure 2 we show the accuracy of a SVM classifier on atest set, after the feature selection has been performed, as afunction of the error threshold δ . Moreover, we overlay thefraction of selected features over the original . We can noticetwo interesting facts. i) Even with a threshold close to zero a great number of features is discarded. ii) The classificationaccuracy is rather constant despite a high error threshold whilethe number of selected features decreases significantly. Wecan conclude that our method was effectively able to identifyirrelevant features and discard them. CMI Estimation.
To see how the estimation of the (condi-tional) mutual information impacts on the performance of thestopping condition, we consider one last problem, generated asbefore with N = 30 features and fixed k = 10 . In Figure 3, welook at the performance of an SVM classifier on the same testset for increasing sizes (number of samples) of the training set.We select the number of neighbors in the mutual informationestimation as a fixed fraction of the training set size. Noticehow, when the data points are too few, the estimated mutualinformation “overfits”, and actually very little to no featuresare discarded in the feature selection step. As a consequence,also the SVM classifier overfits the training set and leads topoor performance on the test set. On the other hand, as thenumber of samples increases, the estimation of the mutualinformation becomes more precise and the appropriate setof features is selected, resulting in a great increase in theclassification accuracy on the test set. Moreover, for a smallnumber of data points, the number of neighbors used in theMI estimation is not too relevant, while it is evident that for a Since the CMI is estimated from data as well, we cannot set the thresholdto exactly , thus, we used . in the experiments. TABLE IR EAL D ATA R ESULTS .Dataset δ = 0 . δ = 0 . δ = 0 . δ = 0 . δ = 1 . ORL 0.8 0.75 0.7 0.7375 0.7125warpAR10P 0.97 0.98 0.98 0.98 0.98glass* 0.99 0.99 0.99 0.99 0.99wine 0.96 0.96 0.96 0.95 0.83ALLAML 1.0 1.0 1.0 0.92 0.78 *: no feature removed large enough sample size, it is better to increase the numberof neighbors.
B. Real-World Data
We further tested the proposed feature selection algorithmon several popular real world datasets, publicly available onthe ASU feature selection website and the UCI repository [32].In Table I, we report the classification accuracy on a test setafter the feature selection procedure for different values of thethreshold δ . We notice how the upper bound on the error isstricter in some examples and larger in others. In particular,the actual classification accuracy follows the theoretical errorbound in cases where the dataset has a bigger number ofsamples and a number of features that is not too big, forexample ORL. Conversely, if the number of features is toobig in comparison to the number of samples, the error boundtends to be pessimistic and the actual accuracy is much biggerthan the expected one (warpAR10P, ALLAML). Interestinglyenough, the number of classes does not play a significant role.VII. D ISCUSSION AND C ONCLUSION
Conditional Mutual Information is an effective statisticaltool to perform feature selection via filter methods. In thispaper, we proposed a novel theoretical analysis showing thatusing CMI allows to control the ideal prediction error, assum-ing that the trained model has infinite capacity. This is a rathernew insight, as filter methods are typically employed whenno assumptions are made on the underlying trained model.We proved that, when using linear models, the correlationcoefficient becomes a suitable criterion for ranking and se-lecting features. On the bases of our findings, we proposed anew stopping condition, that can be applied to both forwardand backward feature selection, with theoretical guaranteeson the prediction error. The experimental evaluation showedthat, compared against classical filter methods and stopping . . . . . . . . . error threshold ( δ ) . . . . . accuracyselected features Fig. 2. Classification accuracy and fraction of selected features as a functionof the error threshold δ . Estimates are reported as mean values ± standarddeviation.
50 100 150 200 250 300 350 400 number of samples (M) . . . . . . S V M a cc u r a c y M/ − NNM/ − NNM/ − NN Fig. 3. Classification accuracy on a binary classification dataset, generated with features and k = 10 , for different values of number of samples and numberof neighbors used for estimating the MI. criteria, our approach, besides the theoretical foundation, isless sensitive to the choice of the threshold hyper-parameterand allows reaching state-of-the-art results.R EFERENCES[1] G. Chandrashekar and F. Sahin, “A survey on feature selection methods,”
Computers & Electrical Engineering , vol. 40, no. 1, pp. 16–28, 2014.[2] R. Kohavi and G. H. John, “Wrappers for feature subset selection,”
Artificial intelligence , vol. 97, no. 1-2, pp. 273–324, 1997.[3] T. N. Lal, O. Chapelle, J. Weston, and A. Elisseeff, “Embeddedmethods,” in
Feature extraction . Springer, 2006, pp. 137–165.[4] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vap-nik, “Feature selection for svms,” in
Advances in neural informationprocessing systems , 2001, pp. 668–674.[5] W. Duch, T. Winiarski, J. Biesiada, and A. Kachel, “Feature selectionand ranking filters,” in
International conference on artificial neuralnetworks (ICANN) and International conference on neural informationprocessing (ICONIP) , vol. 251. Citeseer, 2003, p. 254.[6] W. Duch, “Filter methods,” in
Feature Extraction . Springer, 2006, pp.89–117.[7] J. R. Vergara and P. A. Est´evez, “A review of feature selection methodsbased on mutual information,”
Neural computing and applications ,vol. 24, no. 1, pp. 175–186, 2014.[8] T. M. Cover and J. A. Thomas,
Elements of information theory . JohnWiley & Sons, 2012.[9] D. D. Lewis, “Feature selection and feature extraction for text categoriza-tion,” in
Proceedings of the workshop on Speech and Natural Language .Association for Computational Linguistics, 1992, pp. 212–217.[10] R. Battiti, “Using mutual information for selecting features in supervisedneural net learning,”
IEEE Transactions on neural networks , vol. 5, no. 4,pp. 537–550, 1994.[11] H. H. Yang and J. Moody, “Data visualization and feature selection: Newalgorithms for nongaussian data,” in
Advances in Neural InformationProcessing Systems , 2000, pp. 687–693.[12] F. Fleuret, “Fast binary feature selection with conditional mutual infor-mation,”
Journal of Machine Learning Research , vol. 5, no. Nov, pp.1531–1555, 2004.[13] D. Lin and X. Tang, “Conditional infomax learning: an integratedframework for feature extraction and fusion,” in
European Conferenceon Computer Vision . Springer, 2006, pp. 68–82.[14] g. Cheng, Z. Qin, C. Feng, Y. Wang, and F. Li, “Conditional mutualinformation-based feature selection analyzing for synergy and redun-dancy,”
Etri Journal , vol. 33, no. 2, pp. 210–218, 2011.[15] G. Brown, A. Pocock, M.-J. Zhao, and M. Luj´an, “Conditional likelihoodmaximisation: a unifying framework for information theoretic featureselection,”
Journal of machine learning research , vol. 13, no. Jan, pp.27–66, 2012.[16] I. Tsamardinos, C. F. Aliferis, A. R. Statnikov, and E. Statnikov,“Algorithms for large scale markov blanket discovery.” in
FLAIRSconference , vol. 2, 2003, pp. 376–380.[17] M. S. Pinsker, “Information and information stability of random vari-ables and processes,” 1960. [18] I. Csisz´ar, “Information-type measures of difference of probability dis-tributions and indirect observation,” studia scientiarum MathematicarumHungarica , vol. 2, pp. 229–318, 1967.[19] S. Kullback, “A lower bound for discrimination information in terms ofvariation (corresp.),”
IEEE Transactions on Information Theory , vol. 13,no. 1, pp. 126–127, 1967.[20] G. H. John, R. Kohavi, and K. Pfleger, “Irrelevant features and the subsetselection problem,” in
Machine Learning Proceedings 1994 . Elsevier,1994, pp. 121–129.[21] L. Paninski, “Estimation of entropy and mutual information,”
Neuralcomputation , vol. 15, no. 6, pp. 1191–1253, 2003.[22] A. Tsimpiris, I. Vlachos, and D. Kugiumtzis, “Nearest neighbor estimateof conditional mutual information in feature selection,”
Expert Systemswith Applications , vol. 39, no. 16, pp. 12 697–12 708, 2012.[23] W. Gao, S. Kannan, S. Oh, and P. Viswanath, “Estimating mutualinformation for discrete-continuous mixtures,” in
Advances in NeuralInformation Processing Systems , 2017, pp. 5986–5997.[24] A. Kraskov, H. St¨ogbauer, and P. Grassberger, “Estimating mutualinformation,”
Phys. Rev. E , vol. 69, p. 066138, Jun 2004. [Online].Available: https://link.aps.org/doi/10.1103/PhysRevE.69.066138[25] Y. Wu and S. Verd´u, “Functional properties of minimum mean-squareerror and mutual information,”
IEEE Transactions on Information The-ory , vol. 58, no. 3, pp. 1289–1301, 2012.[26] S. K. Das, “Feature selection with a linear dependence measure,”
IEEEtransactions on Computers , vol. 100, no. 9, pp. 1106–1109, 1971.[27] M. A. Hall, “Correlation-based feature selection for machine learning,”1999.[28] L. Yu and H. Liu, “Feature selection for high-dimensional data: A fastcorrelation-based filter solution,” in
Proceedings of the 20th interna-tional conference on machine learning (ICML-03) , 2003, pp. 856–863.[29] J. Biesiada and W. Duch, “Feature selection for high-dimensional dataapearson redundancy based filter,” in
Computer recognition systems 2 .Springer, 2007, pp. 242–249.[30] H. F. Eid, A. E. Hassanien, T.-h. Kim, and S. Banerjee, “Lin-ear correlation-based feature selection for network intrusion detectionmodel,” in
Advances in Security of Information and CommunicationNetworks . Springer, 2013, pp. 240–248.[31] J. Chen, M. Stern, M. J. Wainwright, and M. I. Jordan, “Kernel featureselection via conditional covariance minimization,” in