[PDF] Variational Information Maximization for Feature Selection

Abstract

Feature selection is one of the most fundamental problems in machine learning. An extensive body of work on information-theoretic feature selection exists which is based on maximizing mutual information between subsets of features and class labels. Practical methods are forced to rely on approximations due to the difficulty of estimating mutual information. We demonstrate that approximations made by existing methods are based on unrealistic assumptions. We formulate a more flexible and general class of assumptions based on variational distributions and use them to tractably generate lower bounds for mutual information. These bounds define a novel information-theoretic framework for feature selection, which we prove to be optimal under tree graphical models with proper choice of variational distributions. Our experiments demonstrate that the proposed method strongly outperforms existing information-theoretic feature selection approaches.

Full PDF

VVariational Information Maximization forFeature Selection

Shuyang Gao Greg Ver Steeg Aram Galstyan

University of Southern California, Information Sciences Institute [email protected] , [email protected] , [email protected] Abstract

Feature selection is one of the most fundamental problems in machine learning.An extensive body of work on information-theoretic feature selection exists whichis based on maximizing mutual information between subsets of features and classlabels. Practical methods are forced to rely on approximations due to the difﬁcultyof estimating mutual information. We demonstrate that approximations made byexisting methods are based on unrealistic assumptions. We formulate a more ﬂex-ible and general class of assumptions based on variational distributions and usethem to tractably generate lower bounds for mutual information. These boundsdeﬁne a novel information-theoretic framework for feature selection, which weprove to be optimal under tree graphical models with proper choice of variationaldistributions. Our experiments demonstrate that the proposed method stronglyoutperforms existing information-theoretic feature selection approaches.

Feature selection is one of the fundamental problems in machine learning research [1, 2]. Manyproblems include a large number of features that are either irrelevant or redundant for the task athand. In these cases, it is often advantageous to pick a smaller subset of features to avoid over-ﬁtting,to speed up computation, or simply to improve the interpretability of the results.Feature selection approaches are usually categorized into three groups: wrapper , embedded and ﬁlter [3, 4, 5]. The ﬁrst two methods, wrapper and embedded , are considered classiﬁer-dependent ,i.e., the selection of features somehow depends on the classiﬁer being used. Filter methods, on theother hand, are classiﬁer-independent and deﬁne a scoring function between features and labels inthe selection process.Because ﬁlter methods may be employed in conjunction with a wide variety of classiﬁers, it is im-portant that the scoring function of these methods is as general as possible. Since mutual information(MI) is a general measure of dependence with several unique properties [6], many MI-based scoringfunctions have been proposed as ﬁlter methods [7, 8, 9, 10, 11, 12]; see [5] for an exhaustive list.Owing to the difﬁculty of estimating mutual information in high dimensions, most existing MI-basedfeature selection methods are based on various low-order approximations for mutual information.While those approximations have been successful in certain applications, they are heuristic in natureand lack theoretical guarantees. In fact, as we demonstrate below (Sec. 2.2), a large family ofapproximate methods are based on two assumptions that are mutually inconsistent.To address the above shortcomings, in this paper we introduce a novel feature selection methodbased on variational lower bound on mutual information; a similar bound was previously studiedwithin the Infomax learning framework [13]. We show that instead of maximizing the mutual infor-mation, which is intractable in high dimensions (hence the introduction of many heuristics), we canmaximize a lower bound on the MI with the proper choice of tractable variational distributions. Weuse this lower bound to deﬁne an objective function and derive a forward feature selection algorithm. a r X i v : . [ s t a t . M L ] J un e provide a rigorous proof that the forward feature selection is optimal under tree graphical modelsby choosing an appropriate variational distribution. This is in contrast with previous information-theoretic feature selection methods, which lack any performance guarantees. We also conduct em-pirical validation on various datasets and demonstrate that the proposed approach outperforms state-of-the-art information-theoretic feature selection methods.In Sec. 2 we introduce general MI-based feature selection methods and discuss their limitations.Sec. 3 introduces the variational lower bound on mutual information and proposes two speciﬁc vari-ational distributions. In Sec. 4, we report results from our experiments, and compare the proposedapproach with existing methods. Consider a supervised learning scenario where x = { x , x , ..., x D } is a D -dimensional input fea-ture vector, and y is the output label. In ﬁlter methods, the mutual information-based feature selec-tion task is to select T features x S ∗ = { x f , x f , ..., x f T } such that the mutual information between x S ∗ and y is maximized. Formally, S ∗ = arg max S I ( x S : y ) s.t. | S | = T (1)where I ( · ) denotes the mutual information [6]. Forward Sequential Feature Selection

Maximizing the objective function in Eq. 1 is generallyNP-hard. Many MI-based feature selection methods adopt a greedy method, where features areselected incrementally, one feature at a time. Let S t − = { x f , x f , ..., x f t − } be the selectedfeature set after time step t − . According to the greedy method, the next feature f t at step t isselected such that f t = arg max i/ ∈ S t − I ( x S t − ∪ i : y ) (2)where x S t − ∪ i denotes x ’s projection into the feature space S t − ∪ i . As shown in [5], the mutualinformation term in Eq. 2 can be decomposed as: I ( x S t − ∪ i : y ) = I ( x S t − : y ) + I ( x i : y | x S t − )= I ( x S t − : y ) + I ( x i : y ) − I ( x i : x S t − ) + I ( x i : x S t − | y )= I ( x S t − : y ) + I ( x i : y ) − ( H ( x S t − ) − H ( x S t − | x i )) + ( H ( x S t − | y ) − H ( x S t − | x i , y )) (3)where H ( · ) denotes the entropy [6]. Omitting the terms that do not depend on x i in Eq. 3, we canrewrite Eq. 2 as follows: f t = arg max i/ ∈ S t − I ( x i : y ) + H ( x S t − | x i ) − H ( x S t − | x i , y ) (4)The greedy learning algorithm has been analyzed in [14]. Estimating high-dimensional information-theoretic quantities is a difﬁcult task. Therefore mostMI-based feature selection methods propose low-order approximation to H ( x S t − | x i ) and H ( x S t − | x i , y ) in Eq. 4. A general family of methods rely on the following approximations [5]: H ( x S t − | x i ) ≈ t − (cid:88) k =1 H ( x f k | x i ) H ( x S t − | x i , y ) ≈ t − (cid:88) k =1 H ( x f k | x i , y ) (5)The approximations in Eq. 5 become exact under the following two assumptions [5]:2 ssumption 1. (Feature Independence Assumption) p ( x S t − | x i ) = t − (cid:81) k =1 p ( x f k | x i ) Assumption 2. (Class-Conditioned Independence Assumption) p ( x S t − | x i , y ) = t − (cid:81) k =1 p ( x f k | x i , y ) Assumption 1 and

Assumption 2 mean that the selected features are independent and class-conditionally independent, respectively, given the unselected feature x i under consideration. Assumption 1 Assumption 2

Satisfying both

Assumption 1 and

Assumption 2

Figure 1: The ﬁrst two graphical models show the assumptions of traditional MI-based feature selec-tion methods. The third graphical model shows a scenario when both

Assumption 1 and

Assumption2 are true. Dashed line indicates there may or may not be a correlation between two variables.We now demonstrate that the two assumptions cannot be valid simultaneously unless the data hasa very speciﬁc (and unrealistic) structure. Indeed, consider the graphical models consistent witheither assumption, as illustrated in Fig. 1. If

Assumption 1 holds true, then x i is the only commoncause of the previously selected features S t − = { x f , x f , ..., x f t − } , so that those features becomeindependent when conditioned on x i . On the other hand, if Assumption 2 holds, then the featuresdepend both on x i and class label y ; therefore, generally speaking, distribution over those featuresdoes not factorize by solely conditioning on x i —there will be remnant dependencies due to y . Thus,if Assumption 2 is true, then

Assumption 1 cannot be true in general, unless the data is generatedaccording to a very speciﬁc model shown in the rightmost model in Fig. 1. Note, however, that inthis case, x i becomes the most important feature because I ( x i : y ) > I ( x S t − : y ) ; then we shouldhave selected x i at the very ﬁrst step, contradicting the feature selection process.As we mentioned above, most existing methods implicitly or explicitly adopt both assumptions ortheir stronger versions as shown in [5], including mutual information maximization (MIM) [15],joint mutual information (JMI) [8], conditional mutual information maximization (CMIM) [9],maximum relevance minimum redundancy (mRMR) [10], conditional infomax feature extraction(CIFE) [16], etc. Approaches based on global optimization of mutual information, such as quadraticprogramming feature selection ( QPFS ) [11] and state-of-the-art conditional mutual information-based spectral method (

SPEC

CMI ) [12], are derived from the previous greedy methods and there-fore also implicitly rely on those two assumptions.In the next section we address these issues by introducing a novel information-theoretic frameworkfor feature selection. Instead of estimating mutual information and making mutually inconsistentassumptions, our framework formulates a tractable variational lower bound on mutual information,which allows a more ﬂexible and general class of assumptions via appropriate choices of variationaldistributions.

Let p ( x , y ) be the joint distribution of input ( x ) and output ( y ) variables. Barber & Agkov [13]derived the following lower bound for mutual information I ( x : y ) by using the non-negativity ofKL-divergence, i.e., (cid:80) x p ( x | y ) log p ( x | y ) q ( x | y ) ≥ gives: I ( x : y ) ≥ H ( x ) + (cid:104) ln q ( x | y ) (cid:105) p ( x , y ) (6)where angled brackets represent averages and q ( x | y ) is an arbitrary variational distribution. Thisbound becomes exact if q ( x | y ) ≡ p ( x | y ) . 3t is worthwhile to note that in the context of unsupervised representation learning, p ( y | x ) and q ( x | y ) can be viewed as an encoder and a decoder , respectively. In this case, y needs to be learnedby maximizing the lower bound in Eq. 6 by iteratively adjusting the parameters of the encoder anddecoder, such as [13, 17]. Naturally, in terms of information-theoretic feature selection, we could also try to optimize thevariational lower bound in Eq. 6 by choosing a subset of features S ∗ in x , such that, S ∗ = arg max S (cid:110) H ( x S ) + (cid:104) ln q ( x S | y ) (cid:105) p ( x S , y ) (cid:111) (7)However, the H ( x S ) term in RHS of Eq. 7 is still intractable when x S is very high-dimensional.Nonetheless, by noticing that variable y is the class label, which is usually discrete, and hence H ( y ) is ﬁxed and tractable, by symmetry we switch x and y in Eq. 6 and rewrite the lower bound asfollows: I ( x : y ) ≥ H ( y ) + (cid:104) ln q ( y | x ) (cid:105) p ( x , y ) = (cid:28) ln (cid:18) q ( y | x ) p ( y ) (cid:19)(cid:29) p ( x , y ) (8)The equality in Eq. 8 is obtained by noticing that H ( y ) = (cid:104)− ln p ( y ) (cid:105) p ( y ) .By using Eq. 8, the lower bound optimal subset S ∗ of x becomes: S ∗ = arg max S (cid:40)(cid:28) ln (cid:18) q ( y | x S ) p ( y ) (cid:19)(cid:29) p ( x S , y ) (cid:41) (9) q ( y | x S ) in Eq. 9 can be any distribution as long as it is normalized. We need to choose q ( y | x S ) tobe as general as possible while still keeping the term (cid:104) ln q ( y | x S ) (cid:105) p ( x S , y ) tractable in Eq. 9.As a result, we set q ( y | x S ) as q ( y | x S ) = q ( x S , y ) q ( x S ) = q ( x S | y ) p ( y ) (cid:80) y (cid:48) q ( x S | y (cid:48) ) p ( y (cid:48) ) (10)We can verify that Eq. 10 is normalized even if q ( x S | y ) is not normalized.If we further denote, q ( x S ) = (cid:88) y (cid:48) q ( x S | y (cid:48) ) p ( y (cid:48) ) (11)then by combining Eqs. 9, 10, we get, I ( x S : y ) ≥ (cid:28) ln (cid:18) q ( x S | y ) q ( x S ) (cid:19)(cid:29) p ( x S , y ) ≡ I LB ( x S : y ) (12) Auto-Regressive Decomposition.

VMI pairwise respectively.It is worthwhile noting that the lower bound does not always increase at each step. A decrease inlower bound at step t indicates that the Q -distribution would approximate the underlying distribu-tion worse than it did at previous step t − . In this case, the algorithm would re-maximize thelower bound from zero with only the remaining unselected features. We summarize the concreteimplementation of our algorithms in supplementary Sec. A. Time Complexity.

Although our algorithm needs to calculate the distributions at each step,we only need to calculate the probability value at each sample point. For both

VMI naive and

VMI pairwise , the total computational complexity is O ( N DT ) assuming N as number of samples, D as total number of features, T as number of ﬁnal selected features. The detailed time analysis isleft for the supplementary Sec. A. As shown in Table 1, our methods VMI naive and

VMI pairwise have the same time complexity as mRMR [10], while state-of-the-art global optimization method

SPEC

CMI [12] is required to precompute the pairwise mutual information matrix, which gives antime complexity of O ( N D ) .Table 1: Time complexity in number of features D , selected number of features d , and numberof samples N Method mRMR

VMI naive

VMI pairwise

SPEC

CMI

Complexity O ( N DT ) O ( N DT ) O ( N DT ) O ( N D ) Optimality Under Tree Graphical Models.

Although our method

VMI naive assumes a NaiveBayes model, we can prove that this method is still optimal if the data is generated according totree graphical models. Indeed, both of our methods,

VMI naive and

VMI pairwise , will alwaysprioritize the ﬁrst layer features, as shown in Fig. 3. This optimality is summarized in Theorem B.1in supplementary Sec. B.

We begin with the experiments on a synthetic model according to the tree structure illustrated inthe left part of Fig. 3. The detailed data generating process is shown in supplementary section D.The root node Y is a binary variable, while other variables are continuous. We use VMI naive to optimize the lower bound I LB ( x : y ) . samples are used to generate the synthethic data,and variational Q -distributions are estimated by kernel density estimator. We can see from theplot in the right part of Fig. 3 that our algorithm, VMI naive , selects x , x , x as the ﬁrst threefeatures, although x and x are only weakly correlated with y . If we continue to add deeper levelfeatures { x , ..., x } , the lower bound will decrease. For comparison, we also illustrate the mutualinformation between each single feature x i and y in Table 2. We can see from Table 2 that it wouldchoose x , x and x as the top three features by using the maximum relevance criteria [15].6 M u t u a l I n f o r m a t i o n x x x x x x x x x Ground Truth: I ( x S t : y ) Lower Bound: b I LB ( x S t : y ) Figure 3: (Left) This is the generative model used for synthetic experiments. Edge thickness repre-sents the relationship strength. (Right) Optimizing the lower bound by

VMI naive . Variables underthe blue line denote the features selected at each step. Dotted blues line shows the decreasing lowerbound if adding more features. Ground-truth mutual information is obtained using N = 100 , samples.feature i x x x x x x x x x I ( x i : y ) y and each feature x i for Fig. 3. I ( x i : y ) is estimatedusing N=100,000 samples. Top three variables with highest mutual information are highlighted inbold. We compare our algorithms

VMI naive and

VMI pairwise with other popular information-theoreticfeature selection methods, including mRMR [10], JMI [8], MIM [15], CMIM [9], CIFE [16], and

SPEC

CMI [12]. We use 17 well-known datasets in previous feature selection studies [5, 12] (all dataare discretized). The dataset summaries are illustrated in supplementary Sec. C. We use the averagecross-validation error rate on the range of 10 to 100 features to compare different algorithms underthe same setting as [12]. 10-fold cross-validation is employed for datasets with number of samples N ≥ and leave-one-out cross-validation otherwise. The 3-Nearest-Neighbor classiﬁer is usedfor Gisette and Madelon, following [5]. While for the remaining datasets, the classiﬁer is chosen tobe Linear SVM, following [11, 12].The experimental results can be seen in Table 3 . The entries with ∗ and ∗∗ indicate the best perfor-mance and the second best performance respectively (in terms of average error rate). We also use thepaired t-test at 5% signiﬁcant level to test the hypothesis that VMI naive or VMI pairwise performssigniﬁcantly better than other methods, or vice visa. Overall, we ﬁnd that both of our methods, V M I naive and V M I pairwise , strongly outperform other methods, indicating our variational featureselection framework is a promising addition to the current literature of information-theoretic featureselection.

10 20 30 40 50 60 70 80 90 100

Number of selected features A v e r a g e c r o ss v a li d a t e e rr o r SEMEION mRMRJMIMIMCMIMCIFESPEC

CMI

VMI naive

VMI pairwise

10 20 30 40 50 60 70 80 90 100

Number of selected features A v e r a g e c r o ss v a li d a t e e rr o r GISETTE mRMRJMIMIMCMIMCIFESPEC

CMI

VMI naive

VMI pairwise

Figure 4: Number of selected features versus average cross- validation error in datasets Semeionand Gisette. we omit the results for MIM and

CIF E due to space limitations, the complete results are shown in thesupplementary Sec. C.

Average cross-validation error rate comparison of

VMI against other methods. Thelast two lines indicate win(W)/tie(T)/loss(L) for

VMI naive and

VMI pairwise respectively.

Dataset mRMR JMI CMIM

SPEC

CMI

VMI naive

VMI pairwise

Lung ± (4.7) ∗∗ ± (4.7) 11.4 ± (3.0) 11.6 ± (5.6) ± (3.6) ∗ ± (6.0)Colon 19.7 ± (2.6) 17.3 ± (3.0) 18.4 ± (2.6) 16.1 ± (2.0) ± (2.7) ∗ ± (1.7) ∗∗ Leukemia 0.4 ± (0.7) 1.4 ± (1.2) 1.1 ± (2.0) 1.8 ± (1.3) ± (0.1) ∗ ± (0.5) ∗∗ Lymphoma 5.6 ± (2.8) 6.6 ± (2.2) 8.6 ± (3.3) 12.0 ± (6.6) ± (1.9) ∗ ± (3.1) ∗∗ Splice ± (0.4) ∗ ± (0.5) ∗∗ ± (0.3) ± (0.5) ∗∗ ± (0.5) ∗∗ ± (0.5) ∗∗ Landsat 19.5 ± (1.2) 18.9 ± (1.0) 19.1 ± (1.1) 21.0 ± (3.5) ± (0.8) ∗ ± (1.0) ∗∗ Waveform ± (0.5) ∗ ± (0.5) ∗ ± (0.7) ± (0.6) ∗∗ ± (0.6) ∗∗ ± (0.5) ∗ KrVsKp ± (0.7) ∗∗ ± (0.6) 5.3 ± (0.5) ± (0.6) ∗ ± (0.5) ± (0.7) ∗∗ Ionosphere 12.8 ± (0.9) 16.6 ± (1.6) 13.1 ± (0.8) 16.8 ± (1.6) ± (1.9) ∗∗ ± (1.0) ∗ Semeion 23.4 ± (6.5) 24.8 ± (7.6) 16.3 ± (4.4) 26.0 ± (9.3) ± (4.0) ∗ ± (3.9) ∗∗ Multifeat. 4.0 ± (1.6) 4.0 ± (1.6) 3.6 ± (1.2) 4.8 ± (3.0) ± (1.1) ∗ ± (1.1) ∗∗ Optdigits 7.6 ± (3.3) 7.6 ± (3.2) ± (3.4) ∗∗ ± (6.0) ± (2.5) ∗ ± (3.6)Musk2 ± (0.7) ∗ ± (0.7) 13.0 ± (1.0) 15.1 ± (1.8) 12.8 ± (0.6) ± (0.5) ∗∗ Spambase 6.9 ± (0.7) 7.0 ± (0.8) ± (0.7) ∗∗ ± (2.3) ± (0.3) ∗ ± (0.3) ∗ Promoter 21.5 ± (2.8) 22.4 ± (4.0) 22.1 ± (2.9) 24.0 ± (3.7) ± (3.9) ∗∗ ± (3.1) ∗ Gisette 5.5 ± (0.9) 5.9 ± (0.7) 5.1 ± (1.3) 7.1 ± (1.3) ± (0.9) ∗∗ ± (0.8) ∗ Madelon 30.8 ± (3.8) ± (2.6) ∗ ± (2.6) ± (2.5) ∗∗ ± (2.7) 16.6 ± (2.9) W /T /L : 11/4/2 10/6/1 10/7/0 13/2/2 W /T /L : 9/6/2 9/6/2 13/3/1 12/3/2 We also plot the average cross- validation error with respect to number of selected features. Fig. 4shows the two most distinguishable data sets, Semeion and Gisette. We can see that both of ourmethods,

VMI

Naive and

VMI

P airwise , have lower error rates in these two data sets.

There has been a signiﬁcant amount of work on information-theoretic feature selection in the pasttwenty years: [5, 7, 8, 9, 10, 15, 11, 12, 20], to name a few. Most of these methods are based oncombinations of so-called relevant, redundant and complimentary information. Such combinationsrepresenting low-order approximations of mutual information are derived from two assumptions,and it has proved unrealistic to expect both assumptions to be true. Inspired by group testing [21],more scalable feature selection methods have been developed, but this method also requires thecalculation of high-dimensional mutual information as a basic scoring function.Estimating mutual information from data requires an large number of observations—especiallywhen the dimensionality is high. The proposed variational lower bound can be viewed as a wayof estimating mutual information between a high-dimensional continuous variable and a discretevariable. Only a few examples exist in literature [22] under this setting. We hope our method willshed light on new ways to estimate mutual information, similar to estimating divergences in [23].

Feature selection has been a signiﬁcant endeavor over the past decade. Mutual information givesa general basis for quantifying the informativeness of features. Despite the clarity of mutual in-formation, estimating it can be difﬁcult. While a large number of information-theoretic methodsexist, they are rather limited and rely on mutually inconsistent assumptions about underlying datadistributions. We introduced a unifying variational mutual information lower bound to address theseissues. We showed that by auto-regressive decomposition, feature selection can be done in a forwardmanner by progressively maximizing the lower bound. We also presented two concrete methods us-ing Naive Bayes and Pairwise Q -distributions, which strongly outperform the existing methods. VMI naive only assumes a Naive Bayes model, but even this simple model outperforms the existinginformation-theoretic methods, indicating the effectiveness of our variational information maximiza-tion framework. We hope that our framework will inspire new mathematically rigorous algorithmsfor information-theoretic feature selection, such as optimizing the variational lower bound globallyand developing more powerful variational approaches for capturing complex dependencies.8 eferences [1] Manoranjan Dash and Huan Liu. Feature selection for classiﬁcation.

Intelligent data analysis , 1(3):131–156, 1997.[2] Huan Liu and Hiroshi Motoda.

Feature selection for knowledge discovery and data mining , volume 454.Springer Science & Business Media, 2012.[3] Ron Kohavi and George H John. Wrappers for feature subset selection.

Artiﬁcial intelligence , 97(1):273–324, 1997.[4] Isabelle Guyon and Andr´e Elisseeff. An introduction to variable and feature selection.

The Journal ofMachine Learning Research , 3:1157–1182, 2003.[5] Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luj´an. Conditional likelihood maximisation: aunifying framework for information theoretic feature selection.

The Journal of Machine Learning Re-search , 13(1):27–66, 2012.[6] Thomas M Cover and Joy A Thomas.

Elements of information theory . John Wiley & Sons, 2012.[7] Roberto Battiti. Using mutual information for selecting features in supervised neural net learning.

NeuralNetworks, IEEE Transactions on , 5(4):537–550, 1994.[8] Howard Hua Yang and John E Moody. Data visualization and feature selection: New algorithms fornongaussian data. In

NIPS , volume 99, pages 687–693. Citeseer, 1999.[9] Franc¸ois Fleuret. Fast binary feature selection with conditional mutual information.

The Journal ofMachine Learning Research , 5:1531–1555, 2004.[10] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information criteria ofmax-dependency, max-relevance, and min-redundancy.

Pattern Analysis and Machine Intelligence, IEEETransactions on , 27(8):1226–1238, 2005.[11] Irene Rodriguez-Lujan, Ramon Huerta, Charles Elkan, and Carlos Santa Cruz. Quadratic programmingfeature selection.

The Journal of Machine Learning Research , 11:1491–1516, 2010.[12] Xuan Vinh Nguyen, Jeffrey Chan, Simone Romano, and James Bailey. Effective global approaches formutual information based feature selection. In

Proceedings of the 20th ACM SIGKDD internationalconference on Knowledge discovery and data mining , pages 512–521. ACM, 2014.[13] David Barber and Felix Agakov. The im algorithm: a variational approach to information maximiza-tion. In

Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference ,volume 16, page 201. MIT Press, 2004.[14] Abhimanyu Das and David Kempe. Submodular meets spectral: Greedy algorithms for subset selection,sparse approximation and dictionary selection. In

Proceedings of the 28th International Conference onMachine Learning (ICML-11) , pages 1057–1064, 2011.[15] David D Lewis. Feature selection and feature extraction for text categorization. In

Proceedings of theworkshop on Speech and Natural Language , pages 212–217. Association for Computational Linguistics,1992.[16] Dahua Lin and Xiaoou Tang. Conditional infomax learning: an integrated framework for feature extrac-tion and fusion. In

Computer Vision–ECCV 2006 , pages 68–82. Springer, 2006.[17] Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsicallymotivated reinforcement learning. In

Advances in Neural Information Processing Systems , pages 2116–2124, 2015.[18] Kiran S Balagani and Vir V Phoha. On the feature selection criterion based on an approximation ofmultidimensional mutual information.

IEEE Transactions on Pattern Analysis & Machine Intelligence ,(7):1342–1343, 2010.[19] Nguyen Xuan Vinh, Shuo Zhou, Jeffrey Chan, and James Bailey. Can high-order dependencies improvemutual information based feature selection?

Pattern Recognition , 2015.[20] Hongrong Cheng, Zhiguang Qin, Chaosheng Feng, Yong Wang, and Fagen Li. Conditional mutualinformation-based feature selection analyzing for synergy and redundancy.

ETRI Journal , 33(2):210–218, 2011.[21] Yingbo Zhou, Utkarsh Porwal, Ce Zhang, Hung Q Ngo, Long Nguyen, Christopher R´e, and Venu Govin-daraju. Parallel feature selection inspired by group testing. In

Advances in Neural Information ProcessingSystems , pages 3554–3562, 2014.[22] Brian C Ross. Mutual information between discrete and continuous data sets.

PloS one , 9(2):e87357,2014.[23] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionalsand the likelihood ratio by convex risk minimization.

Information Theory, IEEE Transactions on ,56(11):5847–5861, 2010.[24] Shuyang Gao. Variational feature selection code. http://github.com/BiuBiuBiLL/InfoFeatureSelection .[25] Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray gene expressiondata.

Journal of bioinformatics and computational biology , 3(02):185–205, 2005.[26] Kevin Bache and Moshe Lichman. Uci machine learning repository, 2013. upplementary Material for “Variational Information Maximization forFeature Selection”A Detailed Algorithm for Variational Forward Feature Selection We describe the detailed algorithm for our approach. We also provide open source code implement-ing

VMI naive and

VMI pairwise [24].Concretely, let us suppose class label y is discrete and has L different values { y , y , ..., y L } ; thenwe deﬁne the distribution q ( x S t | y ) vector Q ( k ) t of size L for each sample (cid:0) x ( k ) , y ( k ) (cid:1) at step t : Q ( k ) t = (cid:104)(cid:98) q (cid:16) x ( k ) S t | y = y (cid:17) , ..., (cid:98) q (cid:16) x ( k ) S t | y = y L (cid:17)(cid:105) T (19)where x ( k ) S t denotes the sample x ( k ) projects onto the x S t feature space.Also, We further denote Y of size L × as the distribution vector of y as follows: Y = [ (cid:98) p ( y = y ) , (cid:98) p ( y = y ) , ..., (cid:98) p ( y = y L )] T (20)Then we are able to rewrite q ( x S t − ) and q ( x S t − | y ) in terms of Q ( k ) t − , Y and substitute them into (cid:98) I LB ( x S t − : y ) .To illustrate, at step t − we have, (cid:98) I LB ( x S t − : y ) = 1 N (cid:88) x ( k ) , y ( k ) log (cid:16) p (cid:16) x ( k ) S t − | y = y ( k ) (cid:17)(cid:17) − N (cid:88) k log (cid:16) Y T Q ( k ) t − (cid:17) (21)To select a feature i at step t , let us deﬁne the conditional distribution vector C ( k ) i,t − for each feature i / ∈ S t − and each sample (cid:0) x ( k ) , y ( k ) (cid:1) , i.e., C ( k ) i,t − = (cid:104) q (cid:16) x ( k ) i | x ( k ) S t − , y = y (cid:17) , ..., q (cid:16) x ( k ) i | x ( k ) S t − , y = y L (cid:17)(cid:105) T (22)At step t , we use C ( k ) i,t − and Q ( k ) t − previously stored and get, (cid:98) I LB ( x S t − ∪ i : y ) = 1 N (cid:88) x ( k ) , y ( k ) log (cid:16) p (cid:16) x ( k ) S t − | y = y ( k ) (cid:17) p (cid:16) x ( k ) i | x ( k ) S t − , y = y ( k ) (cid:17)(cid:17) − N (cid:88) k log (cid:16) Y T diag (cid:16) Q ( k ) t − (cid:17) C ( k ) i,t − (cid:17) (23)We summarize our detailed implementation in Algorithm 1.Updating Q ( k ) t and C ( k ) i,t in Algorithm 1 may vary according to different Q -distributions. But we canverify that under Naive Bayes Q -distribution or Pairwise Q -distribution, Q ( k ) t and C ( k ) i,t can be ob-tained recursively from Q ( k ) t − and C ( k ) i,t − by noticing that q ( x i | x S t , y ) = p ( x i | y ) for Naive Bayes Q -distribution and q ( x i | x S t , y ) = (cid:16) p ( x i | x f t , y ) q ( x i | x S t − , y ) t − (cid:17) t for Pairwise Q -distribution.Let us denote N as number of samples, D as total number of features, T as number of selectedfeatures and L as number of distinct values in class variable y . The computational complex-ity of Algorithm 1 involves calculating the lower bound for each feature i at every step which is O ( N DL ) ; updating C ( k ) i,t would cost O ( N DL ) for pairwise Q -distribution and O (1) for NaiveBayes Q -distribution; updating Q ( k ) t would cost O ( N DL ) . We need to select T features, thereforethe time complexity is O ( N DT ) . we ignore L here because the number of classes is usually much smaller. lgorithm 1 Variational Forward Feature Selection (VMI)

Data: (cid:0) x (1) , y (1) (cid:1) , (cid:0) x (2) , y (2) (cid:1) , ..., (cid:0) x ( N ) , y ( N ) (cid:1) Input: T ← { number of features to select } Output: F ← { ﬁnal selected feature set } F ← { ∅ } ; S ← { ∅ } ; t ← Initialize Q ( k )0 and C ( k ) i, for any feature i ; calculate Y while | F | < T do (cid:98) I LB ( x S t − ∪ i : y ) ← { Eq. 23 for each i not in F } f t ← arg max i/ ∈ S t − (cid:98) I LB ( x i ∪ S t − : y ) if (cid:98) I LB (cid:0) x S t − ∪ f t : y (cid:1) ≤ (cid:98) I LB ( x S t − : y ) then Clear S ; Set t ← else F ← F ∪ f t S t ← S t − ∪ f t Update Q ( k ) t and C ( k ) i,t t ← t + 1 end ifend while B Optimality under Tree Graphical Models

Theorem B.1 (Optimal Feature Selection) . If data is generated according to tree graphical models,where the class label y is the root node, denote the child nodes set in the ﬁrst layer as L = { x , x , ..., x L } , as shown in Fig. B.1. Then there must exist a step T > such that the followingthree conditions hold by using VMI naive or VMI pairwise :Condition I: The selected feature set S T ⊂ L .Condition II: I LB ( x S t : y ) = I ( x S t : y ) for ≤ t ≤ T .Condition III: I LB ( x S T : y ) = I ( x : y ) . Figure B.1: Demonstration of tree graphical model, label y is the root node. Proof.

We prove this theorem by induction. For tree graphical model when selecting the ﬁrstlayer features,

VMI naive and

VMI pairwise are mathematically equal, therefore we only prove

VMI naive case and

VMI pairwise follows the same proof.11) At step t = 1 , for each feature i , we have, I LB ( x i : y ) = (cid:28) ln (cid:18) q ( x i | y ) q ( x i ) (cid:19)(cid:29) p ( x , y ) = (cid:42) ln  p ( x i | y ) (cid:80) y (cid:48) p ( y (cid:48) ) p ( x i | y (cid:48) ) (cid:43) p ( x , y ) = (cid:28) ln (cid:18) p ( x i | y ) p ( x i ) (cid:19)(cid:29) p ( x , y ) = I ( x i : y ) (24)Thus, we are choosing a feature that has the maximum mutual information with y at the very ﬁrststep. Based on the data processing inequality, we have I ( x i : y ) ≥ I ( desc ( x i ) : y ) for any x i in layer 1 where desc ( x i ) represents any descendant of x i . Thus, we always select featuresamong the nodes of the ﬁrst layer at step t = 1 without loss of generality. If node x j that isnot in the ﬁrst layer is selected at step t = 1 , denote ances ( x j ) as x j ’s ancestor in layer 1, then I ( x j : y ) = I ( ances ( x j ) : y ) which means that the information is not lost from ances ( x j ) → x j .In this case, one can always switch ances ( x j ) with x j and let x j be in the ﬁrst layer, which doesnot conﬂict with the model assumption.Therefore, condition I and II are satisﬁed in step t = 1 .2) Assuming condition I and II are satisﬁed in step t , then we have the following argument in step t + 1 :We discuss the candidate nodes in three classes, and argue that nodes in Remaining-Layer 1 Class are always being selected.

Redundant Class

For any descendant desc ( S t ) of selected feature set S t , we have, I (cid:0) x S t ∪ desc ( S t ) : y (cid:1) = I ( x S t : y ) = I LB ( x S t : y ) (25)Eq. 25 comes from the fact that the desc ( S t ) carries no additional information about y other than S t . The second equality is by induction.Based on Eq. 12 and 25, we have, I LB (cid:0) x S t ∪ desc ( S t ) : y (cid:1) < I (cid:0) x S t ∪ desc ( S t ) : y (cid:1) = I ( x S t : y ) (26)We assume here that the LHS is strictly less than RHS in Eq. 26 without loss of generality. Thisis because if the equality holds, we have p (x S t | y ) p ( desc ( S t ) | y ) = p ( x t , desc ( S t ) | y ) due toTheorem 3.1. In this case, we can always rearrange desc ( S t ) to the ﬁrst layer, which does notconﬂict with the model assumption.Note that by combining Eqs. 25 and 26, we can also get I LB (cid:0) x S t ∪ desc ( S t ) : y (cid:1) < I LB ( x S t : y ) (27)Eq. 27 means that adding a feature in Redundant Class will actually decrease the value of lowerbound I LB . Remaining-Layer1 Class

For any other unselected node j of the ﬁrst layer, i.e., j ∈ L \ S t , we have I ( x S t : y ) ≤ I ( x S t ∪ j : y ) = I LB ( x S t ∪ j : y ) (28)The inequality in Eq. 28 is obvious which comes from the data processing inequality [6]. And theequality in Eq. 28 comes directly from Theorem 3.1. Descendants-of-Remaining-Layer1 Class

For any node desc ( j ) that is the descendant of j where j ∈ L \ S t , we have, I LB (cid:0) x S t ∪ desc ( j ) : y (cid:1) ≤ I (cid:0) x S t ∪ desc ( j ) : y (cid:1) I (cid:0) x S t ∪ desc ( j ) : y (cid:1) ≤ I ( x S t ∪ j : y ) (29)The second inequality of Ineq. 29 also comes from data processing inequality.12ombining Eqs. 26 and 28, we get, I LB (cid:0) x S t ∪ desc ( S t ) : y (cid:1) < I LB ( x S t ∪ j : y ) (30)Combining Eqs. 28 and 29, we get, I LB (cid:0) x S t ∪ desc ( j ) : y (cid:1) ≤ I LB ( x S t ∪ j : y ) (31)Ineq. 30 essentially tells us the forward feature selection will always choose Remaining-Layer1Class other than

Redundant Class .Ineq. 31 is saying we are choosing

Remaining-Layer1 Class other than

Descendants-of-Remaining-Layer1 Class without loss of generality (for the equality concern, we can have the same argumentin step t = 1 ).Considering Ineqs. 30 and 31, in step t + 1 , the algorithm chooses node j in Remaining-Layer1Class , i.e., j ∈ L \ S t .Therefore, condition I and II hold at step t + 1 .At step t + 1 , if I LB ( x S t ∪ j : y ) = I LB ( x S t : y ) for any j ∈ L \ S t , that means I ( x S t ∪ j : y ) = I ( x S t : y ) . Then we have, I ( x S t : y ) = I ( x L : y ) = I ( x : y ) (32)The ﬁrst equality in Eq. 32 holds because adding any j in L \ S t will not increase the mutualinformation. The second equality is due to the data processing inequality under tree graphical modelassumption.Therefore, if I LB ( x S t ∪ j : y ) = I LB ( x S t : y ) for any j ∈ L \ S t , we set T = t . Thus by combin-ing condition II and Eq. 32, we have, I LB ( x S T : y ) = I ( x S T : y ) = I ( x : y ) (33)Then condition III holds. C Datasets and Results

Table 4 summarizes the datasets used in the experiment. Table 5 shows the complete results.Table 4:

Dataset summary. N : d : L : Data

N d L

SourceLung 73 325 20 [25]Colon 62 2000 2 [25]Leukemia 72 7070 2 [25]Lymphoma 96 4026 9 [25]Splice 3175 60 3 [26]Landsat 6435 36 6 [26]Waveform 5000 40 3 [26]KrVsKp 3196 36 2 [26]Ionosphere 351 34 2 [26]Semeion 1593 256 10 [26]Multifeat. 2000 649 10 [26]Optdigits 3823 64 10 [26]Musk2 6598 166 2 [26]Spambase 4601 57 2 [26]Promoter 106 57 2 [26]Gisette 6000 5000 2 [4]Madelon 2000 500 2 [4]13 a t a s e t m R M R J M I M I M C M I M C I F E S P E C C M I V M I n a i v e V M I p a i r w i s e L ung . ± ( . ) ∗∗ . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) C o l on19 . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗∗ L e uk e m i a . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗∗ L y m pho m a . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗∗ S p li ce . ± ( . ) ∗ . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) L a nd s a t . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗∗ W a v e f o r m . ± ( . ) ∗ . ± ( . ) ∗ . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) ∗∗ . ± ( . ) ∗ K r V s K p5 . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗∗ . ± ( . ) . ± ( . )I ono s ph e r e . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) ∗ S e m e i on23 . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗∗ M u lti f ea t . . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗∗ O p t d i g it s . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) M u s k2 . ± ( . ) ∗ . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗∗ S p a m b a s e . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗ P r o m o t e r . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) ∗ G i s e tt e . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) ∗ M a d e l on30 . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) . ± ( . ) . ± ( . ) W / T / L : / / / / / / / / / / / / W / T / L : / / / / / / / / / / / / T a b l e : A v er ag ecr o ss va li d a t i o n err o rr a t ec o m p a r i s o n o f V M I aga i n s t o t h er m e t h o d s . T h e l a s tt w o li n e s i nd i c a t e w i n ( W ) / t i e ( T ) /l o ss ( L )f o r V M I n a i v e a nd V M I p a i r w i s e re s p ec t i v e l y . Generating Synthetic Data

Here is a detailed generating process for synthetic tree graphical model data in the experiment.Draw y ∼ Bernoulli (0 . Draw x ∼ Gaussian ( σ = 1 . , µ = y ) Draw x ∼ Gaussian ( σ = 1 . , µ = y / . Draw x ∼ Gaussian ( σ = 1 . , µ = y / . Draw x ∼ Gaussian ( σ = 1 . , µ = x ) Draw x ∼ Gaussian ( σ = 1 . , µ = x ) Draw x ∼ Gaussian ( σ = 1 . , µ = x ) Draw x ∼ Gaussian ( σ = 1 . , µ = x ) Draw x ∼ Gaussian ( σ = 1 . , µ = x ) Draw x ∼ Gaussian ( σ = 1 . , µ = x ))