Variational Information Maximization for Feature Selection
VVariational Information Maximization forFeature Selection
Shuyang Gao Greg Ver Steeg Aram Galstyan
University of Southern California, Information Sciences Institute [email protected] , [email protected] , [email protected] Abstract
Feature selection is one of the most fundamental problems in machine learning.An extensive body of work on information-theoretic feature selection exists whichis based on maximizing mutual information between subsets of features and classlabels. Practical methods are forced to rely on approximations due to the difficultyof estimating mutual information. We demonstrate that approximations made byexisting methods are based on unrealistic assumptions. We formulate a more flex-ible and general class of assumptions based on variational distributions and usethem to tractably generate lower bounds for mutual information. These boundsdefine a novel information-theoretic framework for feature selection, which weprove to be optimal under tree graphical models with proper choice of variationaldistributions. Our experiments demonstrate that the proposed method stronglyoutperforms existing information-theoretic feature selection approaches.
Feature selection is one of the fundamental problems in machine learning research [1, 2]. Manyproblems include a large number of features that are either irrelevant or redundant for the task athand. In these cases, it is often advantageous to pick a smaller subset of features to avoid over-fitting,to speed up computation, or simply to improve the interpretability of the results.Feature selection approaches are usually categorized into three groups: wrapper , embedded and filter [3, 4, 5]. The first two methods, wrapper and embedded , are considered classifier-dependent ,i.e., the selection of features somehow depends on the classifier being used. Filter methods, on theother hand, are classifier-independent and define a scoring function between features and labels inthe selection process.Because filter methods may be employed in conjunction with a wide variety of classifiers, it is im-portant that the scoring function of these methods is as general as possible. Since mutual information(MI) is a general measure of dependence with several unique properties [6], many MI-based scoringfunctions have been proposed as filter methods [7, 8, 9, 10, 11, 12]; see [5] for an exhaustive list.Owing to the difficulty of estimating mutual information in high dimensions, most existing MI-basedfeature selection methods are based on various low-order approximations for mutual information.While those approximations have been successful in certain applications, they are heuristic in natureand lack theoretical guarantees. In fact, as we demonstrate below (Sec. 2.2), a large family ofapproximate methods are based on two assumptions that are mutually inconsistent.To address the above shortcomings, in this paper we introduce a novel feature selection methodbased on variational lower bound on mutual information; a similar bound was previously studiedwithin the Infomax learning framework [13]. We show that instead of maximizing the mutual infor-mation, which is intractable in high dimensions (hence the introduction of many heuristics), we canmaximize a lower bound on the MI with the proper choice of tractable variational distributions. Weuse this lower bound to define an objective function and derive a forward feature selection algorithm. a r X i v : . [ s t a t . M L ] J un e provide a rigorous proof that the forward feature selection is optimal under tree graphical modelsby choosing an appropriate variational distribution. This is in contrast with previous information-theoretic feature selection methods, which lack any performance guarantees. We also conduct em-pirical validation on various datasets and demonstrate that the proposed approach outperforms state-of-the-art information-theoretic feature selection methods.In Sec. 2 we introduce general MI-based feature selection methods and discuss their limitations.Sec. 3 introduces the variational lower bound on mutual information and proposes two specific vari-ational distributions. In Sec. 4, we report results from our experiments, and compare the proposedapproach with existing methods. Consider a supervised learning scenario where x = { x , x , ..., x D } is a D -dimensional input fea-ture vector, and y is the output label. In filter methods, the mutual information-based feature selec-tion task is to select T features x S ∗ = { x f , x f , ..., x f T } such that the mutual information between x S ∗ and y is maximized. Formally, S ∗ = arg max S I ( x S : y ) s.t. | S | = T (1)where I ( · ) denotes the mutual information [6]. Forward Sequential Feature Selection
Maximizing the objective function in Eq. 1 is generallyNP-hard. Many MI-based feature selection methods adopt a greedy method, where features areselected incrementally, one feature at a time. Let S t − = { x f , x f , ..., x f t − } be the selectedfeature set after time step t − . According to the greedy method, the next feature f t at step t isselected such that f t = arg max i/ ∈ S t − I ( x S t − ∪ i : y ) (2)where x S t − ∪ i denotes x ’s projection into the feature space S t − ∪ i . As shown in [5], the mutualinformation term in Eq. 2 can be decomposed as: I ( x S t − ∪ i : y ) = I ( x S t − : y ) + I ( x i : y | x S t − )= I ( x S t − : y ) + I ( x i : y ) − I ( x i : x S t − ) + I ( x i : x S t − | y )= I ( x S t − : y ) + I ( x i : y ) − ( H ( x S t − ) − H ( x S t − | x i )) + ( H ( x S t − | y ) − H ( x S t − | x i , y )) (3)where H ( · ) denotes the entropy [6]. Omitting the terms that do not depend on x i in Eq. 3, we canrewrite Eq. 2 as follows: f t = arg max i/ ∈ S t − I ( x i : y ) + H ( x S t − | x i ) − H ( x S t − | x i , y ) (4)The greedy learning algorithm has been analyzed in [14]. Estimating high-dimensional information-theoretic quantities is a difficult task. Therefore mostMI-based feature selection methods propose low-order approximation to H ( x S t − | x i ) and H ( x S t − | x i , y ) in Eq. 4. A general family of methods rely on the following approximations [5]: H ( x S t − | x i ) ≈ t − (cid:88) k =1 H ( x f k | x i ) H ( x S t − | x i , y ) ≈ t − (cid:88) k =1 H ( x f k | x i , y ) (5)The approximations in Eq. 5 become exact under the following two assumptions [5]:2 ssumption 1. (Feature Independence Assumption) p ( x S t − | x i ) = t − (cid:81) k =1 p ( x f k | x i ) Assumption 2. (Class-Conditioned Independence Assumption) p ( x S t − | x i , y ) = t − (cid:81) k =1 p ( x f k | x i , y ) Assumption 1 and
Assumption 2 mean that the selected features are independent and class-conditionally independent, respectively, given the unselected feature x i under consideration. Assumption 1 Assumption 2
Satisfying both
Assumption 1 and
Assumption 2
Figure 1: The first two graphical models show the assumptions of traditional MI-based feature selec-tion methods. The third graphical model shows a scenario when both
Assumption 1 and
Assumption2 are true. Dashed line indicates there may or may not be a correlation between two variables.We now demonstrate that the two assumptions cannot be valid simultaneously unless the data hasa very specific (and unrealistic) structure. Indeed, consider the graphical models consistent witheither assumption, as illustrated in Fig. 1. If
Assumption 1 holds true, then x i is the only commoncause of the previously selected features S t − = { x f , x f , ..., x f t − } , so that those features becomeindependent when conditioned on x i . On the other hand, if Assumption 2 holds, then the featuresdepend both on x i and class label y ; therefore, generally speaking, distribution over those featuresdoes not factorize by solely conditioning on x i —there will be remnant dependencies due to y . Thus,if Assumption 2 is true, then
Assumption 1 cannot be true in general, unless the data is generatedaccording to a very specific model shown in the rightmost model in Fig. 1. Note, however, that inthis case, x i becomes the most important feature because I ( x i : y ) > I ( x S t − : y ) ; then we shouldhave selected x i at the very first step, contradicting the feature selection process.As we mentioned above, most existing methods implicitly or explicitly adopt both assumptions ortheir stronger versions as shown in [5], including mutual information maximization (MIM) [15],joint mutual information (JMI) [8], conditional mutual information maximization (CMIM) [9],maximum relevance minimum redundancy (mRMR) [10], conditional infomax feature extraction(CIFE) [16], etc. Approaches based on global optimization of mutual information, such as quadraticprogramming feature selection ( QPFS ) [11] and state-of-the-art conditional mutual information-based spectral method (
SPEC
CMI ) [12], are derived from the previous greedy methods and there-fore also implicitly rely on those two assumptions.In the next section we address these issues by introducing a novel information-theoretic frameworkfor feature selection. Instead of estimating mutual information and making mutually inconsistentassumptions, our framework formulates a tractable variational lower bound on mutual information,which allows a more flexible and general class of assumptions via appropriate choices of variationaldistributions.
Let p ( x , y ) be the joint distribution of input ( x ) and output ( y ) variables. Barber & Agkov [13]derived the following lower bound for mutual information I ( x : y ) by using the non-negativity ofKL-divergence, i.e., (cid:80) x p ( x | y ) log p ( x | y ) q ( x | y ) ≥ gives: I ( x : y ) ≥ H ( x ) + (cid:104) ln q ( x | y ) (cid:105) p ( x , y ) (6)where angled brackets represent averages and q ( x | y ) is an arbitrary variational distribution. Thisbound becomes exact if q ( x | y ) ≡ p ( x | y ) . 3t is worthwhile to note that in the context of unsupervised representation learning, p ( y | x ) and q ( x | y ) can be viewed as an encoder and a decoder , respectively. In this case, y needs to be learnedby maximizing the lower bound in Eq. 6 by iteratively adjusting the parameters of the encoder anddecoder, such as [13, 17]. Naturally, in terms of information-theoretic feature selection, we could also try to optimize thevariational lower bound in Eq. 6 by choosing a subset of features S ∗ in x , such that, S ∗ = arg max S (cid:110) H ( x S ) + (cid:104) ln q ( x S | y ) (cid:105) p ( x S , y ) (cid:111) (7)However, the H ( x S ) term in RHS of Eq. 7 is still intractable when x S is very high-dimensional.Nonetheless, by noticing that variable y is the class label, which is usually discrete, and hence H ( y ) is fixed and tractable, by symmetry we switch x and y in Eq. 6 and rewrite the lower bound asfollows: I ( x : y ) ≥ H ( y ) + (cid:104) ln q ( y | x ) (cid:105) p ( x , y ) = (cid:28) ln (cid:18) q ( y | x ) p ( y ) (cid:19)(cid:29) p ( x , y ) (8)The equality in Eq. 8 is obtained by noticing that H ( y ) = (cid:104)− ln p ( y ) (cid:105) p ( y ) .By using Eq. 8, the lower bound optimal subset S ∗ of x becomes: S ∗ = arg max S (cid:40)(cid:28) ln (cid:18) q ( y | x S ) p ( y ) (cid:19)(cid:29) p ( x S , y ) (cid:41) (9) q ( y | x S ) in Eq. 9 can be any distribution as long as it is normalized. We need to choose q ( y | x S ) tobe as general as possible while still keeping the term (cid:104) ln q ( y | x S ) (cid:105) p ( x S , y ) tractable in Eq. 9.As a result, we set q ( y | x S ) as q ( y | x S ) = q ( x S , y ) q ( x S ) = q ( x S | y ) p ( y ) (cid:80) y (cid:48) q ( x S | y (cid:48) ) p ( y (cid:48) ) (10)We can verify that Eq. 10 is normalized even if q ( x S | y ) is not normalized.If we further denote, q ( x S ) = (cid:88) y (cid:48) q ( x S | y (cid:48) ) p ( y (cid:48) ) (11)then by combining Eqs. 9, 10, we get, I ( x S : y ) ≥ (cid:28) ln (cid:18) q ( x S | y ) q ( x S ) (cid:19)(cid:29) p ( x S , y ) ≡ I LB ( x S : y ) (12) Auto-Regressive Decomposition.
Now that q ( y | x S ) is defined, all we need to do is model q ( x S | y ) under Eq. 10, and q ( x S ) is easy to compute based on q ( x S | y ) . Here we decompose q ( x S | y ) as an auto-regressive distribution assuming T features in S : q ( x S | y ) = q ( x f | y ) T (cid:89) t =2 q ( x f t | x f VMI pairwise respectively.It is worthwhile noting that the lower bound does not always increase at each step. A decrease inlower bound at step t indicates that the Q -distribution would approximate the underlying distribu-tion worse than it did at previous step t − . In this case, the algorithm would re-maximize thelower bound from zero with only the remaining unselected features. We summarize the concreteimplementation of our algorithms in supplementary Sec. A. Time Complexity. Although our algorithm needs to calculate the distributions at each step,we only need to calculate the probability value at each sample point. For both VMI naive and VMI pairwise , the total computational complexity is O ( N DT ) assuming N as number of samples, D as total number of features, T as number of final selected features. The detailed time analysis isleft for the supplementary Sec. A. As shown in Table 1, our methods VMI naive and VMI pairwise have the same time complexity as mRMR [10], while state-of-the-art global optimization method SPEC CMI [12] is required to precompute the pairwise mutual information matrix, which gives antime complexity of O ( N D ) .Table 1: Time complexity in number of features D , selected number of features d , and numberof samples N Method mRMR VMI naive VMI pairwise SPEC CMI Complexity O ( N DT ) O ( N DT ) O ( N DT ) O ( N D ) Optimality Under Tree Graphical Models. Although our method VMI naive assumes a NaiveBayes model, we can prove that this method is still optimal if the data is generated according totree graphical models. Indeed, both of our methods, VMI naive and VMI pairwise , will alwaysprioritize the first layer features, as shown in Fig. 3. This optimality is summarized in Theorem B.1in supplementary Sec. B. We begin with the experiments on a synthetic model according to the tree structure illustrated inthe left part of Fig. 3. The detailed data generating process is shown in supplementary section D.The root node Y is a binary variable, while other variables are continuous. We use VMI naive to optimize the lower bound I LB ( x : y ) . samples are used to generate the synthethic data,and variational Q -distributions are estimated by kernel density estimator. We can see from theplot in the right part of Fig. 3 that our algorithm, VMI naive , selects x , x , x as the first threefeatures, although x and x are only weakly correlated with y . If we continue to add deeper levelfeatures { x , ..., x } , the lower bound will decrease. For comparison, we also illustrate the mutualinformation between each single feature x i and y in Table 2. We can see from Table 2 that it wouldchoose x , x and x as the top three features by using the maximum relevance criteria [15].6 M u t u a l I n f o r m a t i o n x x x x x x x x x Ground Truth: I ( x S t : y ) Lower Bound: b I LB ( x S t : y ) Figure 3: (Left) This is the generative model used for synthetic experiments. Edge thickness repre-sents the relationship strength. (Right) Optimizing the lower bound by VMI naive . Variables underthe blue line denote the features selected at each step. Dotted blues line shows the decreasing lowerbound if adding more features. Ground-truth mutual information is obtained using N = 100 , samples.feature i x x x x x x x x x I ( x i : y ) y and each feature x i for Fig. 3. I ( x i : y ) is estimatedusing N=100,000 samples. Top three variables with highest mutual information are highlighted inbold. We compare our algorithms VMI naive and VMI pairwise with other popular information-theoreticfeature selection methods, including mRMR [10], JMI [8], MIM [15], CMIM [9], CIFE [16], and SPEC CMI [12]. We use 17 well-known datasets in previous feature selection studies [5, 12] (all dataare discretized). The dataset summaries are illustrated in supplementary Sec. C. We use the averagecross-validation error rate on the range of 10 to 100 features to compare different algorithms underthe same setting as [12]. 10-fold cross-validation is employed for datasets with number of samples N ≥ and leave-one-out cross-validation otherwise. The 3-Nearest-Neighbor classifier is usedfor Gisette and Madelon, following [5]. While for the remaining datasets, the classifier is chosen tobe Linear SVM, following [11, 12].The experimental results can be seen in Table 3 . The entries with ∗ and ∗∗ indicate the best perfor-mance and the second best performance respectively (in terms of average error rate). We also use thepaired t-test at 5% significant level to test the hypothesis that VMI naive or VMI pairwise performssignificantly better than other methods, or vice visa. Overall, we find that both of our methods, V M I naive and V M I pairwise , strongly outperform other methods, indicating our variational featureselection framework is a promising addition to the current literature of information-theoretic featureselection. 10 20 30 40 50 60 70 80 90 100 Number of selected features A v e r a g e c r o ss v a li d a t e e rr o r SEMEION mRMRJMIMIMCMIMCIFESPEC CMI VMI naive VMI pairwise 10 20 30 40 50 60 70 80 90 100 Number of selected features A v e r a g e c r o ss v a li d a t e e rr o r GISETTE mRMRJMIMIMCMIMCIFESPEC CMI VMI naive VMI pairwise Figure 4: Number of selected features versus average cross- validation error in datasets Semeionand Gisette. we omit the results for MIM and CIF E due to space limitations, the complete results are shown in thesupplementary Sec. C. Average cross-validation error rate comparison of VMI against other methods. Thelast two lines indicate win(W)/tie(T)/loss(L) for VMI naive and VMI pairwise respectively. Dataset mRMR JMI CMIM SPEC CMI VMI naive VMI pairwise Lung ± (4.7) ∗∗ ± (4.7) 11.4 ± (3.0) 11.6 ± (5.6) ± (3.6) ∗ ± (6.0)Colon 19.7 ± (2.6) 17.3 ± (3.0) 18.4 ± (2.6) 16.1 ± (2.0) ± (2.7) ∗ ± (1.7) ∗∗ Leukemia 0.4 ± (0.7) 1.4 ± (1.2) 1.1 ± (2.0) 1.8 ± (1.3) ± (0.1) ∗ ± (0.5) ∗∗ Lymphoma 5.6 ± (2.8) 6.6 ± (2.2) 8.6 ± (3.3) 12.0 ± (6.6) ± (1.9) ∗ ± (3.1) ∗∗ Splice ± (0.4) ∗ ± (0.5) ∗∗ ± (0.3) ± (0.5) ∗∗ ± (0.5) ∗∗ ± (0.5) ∗∗ Landsat 19.5 ± (1.2) 18.9 ± (1.0) 19.1 ± (1.1) 21.0 ± (3.5) ± (0.8) ∗ ± (1.0) ∗∗ Waveform ± (0.5) ∗ ± (0.5) ∗ ± (0.7) ± (0.6) ∗∗ ± (0.6) ∗∗ ± (0.5) ∗ KrVsKp ± (0.7) ∗∗ ± (0.6) 5.3 ± (0.5) ± (0.6) ∗ ± (0.5) ± (0.7) ∗∗ Ionosphere 12.8 ± (0.9) 16.6 ± (1.6) 13.1 ± (0.8) 16.8 ± (1.6) ± (1.9) ∗∗ ± (1.0) ∗ Semeion 23.4 ± (6.5) 24.8 ± (7.6) 16.3 ± (4.4) 26.0 ± (9.3) ± (4.0) ∗ ± (3.9) ∗∗ Multifeat. 4.0 ± (1.6) 4.0 ± (1.6) 3.6 ± (1.2) 4.8 ± (3.0) ± (1.1) ∗ ± (1.1) ∗∗ Optdigits 7.6 ± (3.3) 7.6 ± (3.2) ± (3.4) ∗∗ ± (6.0) ± (2.5) ∗ ± (3.6)Musk2 ± (0.7) ∗ ± (0.7) 13.0 ± (1.0) 15.1 ± (1.8) 12.8 ± (0.6) ± (0.5) ∗∗ Spambase 6.9 ± (0.7) 7.0 ± (0.8) ± (0.7) ∗∗ ± (2.3) ± (0.3) ∗ ± (0.3) ∗ Promoter 21.5 ± (2.8) 22.4 ± (4.0) 22.1 ± (2.9) 24.0 ± (3.7) ± (3.9) ∗∗ ± (3.1) ∗ Gisette 5.5 ± (0.9) 5.9 ± (0.7) 5.1 ± (1.3) 7.1 ± (1.3) ± (0.9) ∗∗ ± (0.8) ∗ Madelon 30.8 ± (3.8) ± (2.6) ∗ ± (2.6) ± (2.5) ∗∗ ± (2.7) 16.6 ± (2.9) W /T /L : 11/4/2 10/6/1 10/7/0 13/2/2 W /T /L : 9/6/2 9/6/2 13/3/1 12/3/2 We also plot the average cross- validation error with respect to number of selected features. Fig. 4shows the two most distinguishable data sets, Semeion and Gisette. We can see that both of ourmethods, VMI Naive and VMI P airwise , have lower error rates in these two data sets. There has been a significant amount of work on information-theoretic feature selection in the pasttwenty years: [5, 7, 8, 9, 10, 15, 11, 12, 20], to name a few. Most of these methods are based oncombinations of so-called relevant, redundant and complimentary information. Such combinationsrepresenting low-order approximations of mutual information are derived from two assumptions,and it has proved unrealistic to expect both assumptions to be true. Inspired by group testing [21],more scalable feature selection methods have been developed, but this method also requires thecalculation of high-dimensional mutual information as a basic scoring function.Estimating mutual information from data requires an large number of observations—especiallywhen the dimensionality is high. The proposed variational lower bound can be viewed as a wayof estimating mutual information between a high-dimensional continuous variable and a discretevariable. Only a few examples exist in literature [22] under this setting. We hope our method willshed light on new ways to estimate mutual information, similar to estimating divergences in [23]. Feature selection has been a significant endeavor over the past decade. Mutual information givesa general basis for quantifying the informativeness of features. Despite the clarity of mutual in-formation, estimating it can be difficult. While a large number of information-theoretic methodsexist, they are rather limited and rely on mutually inconsistent assumptions about underlying datadistributions. We introduced a unifying variational mutual information lower bound to address theseissues. We showed that by auto-regressive decomposition, feature selection can be done in a forwardmanner by progressively maximizing the lower bound. We also presented two concrete methods us-ing Naive Bayes and Pairwise Q -distributions, which strongly outperform the existing methods. VMI naive only assumes a Naive Bayes model, but even this simple model outperforms the existinginformation-theoretic methods, indicating the effectiveness of our variational information maximiza-tion framework. We hope that our framework will inspire new mathematically rigorous algorithmsfor information-theoretic feature selection, such as optimizing the variational lower bound globallyand developing more powerful variational approaches for capturing complex dependencies.8 eferences [1] Manoranjan Dash and Huan Liu. Feature selection for classification. Intelligent data analysis , 1(3):131–156, 1997.[2] Huan Liu and Hiroshi Motoda. Feature selection for knowledge discovery and data mining , volume 454.Springer Science & Business Media, 2012.[3] Ron Kohavi and George H John. Wrappers for feature subset selection. Artificial intelligence , 97(1):273–324, 1997.[4] Isabelle Guyon and Andr´e Elisseeff. An introduction to variable and feature selection. The Journal ofMachine Learning Research , 3:1157–1182, 2003.[5] Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luj´an. Conditional likelihood maximisation: aunifying framework for information theoretic feature selection. The Journal of Machine Learning Re-search , 13(1):27–66, 2012.[6] Thomas M Cover and Joy A Thomas. Elements of information theory . John Wiley & Sons, 2012.[7] Roberto Battiti. Using mutual information for selecting features in supervised neural net learning. NeuralNetworks, IEEE Transactions on , 5(4):537–550, 1994.[8] Howard Hua Yang and John E Moody. Data visualization and feature selection: New algorithms fornongaussian data. In NIPS , volume 99, pages 687–693. Citeseer, 1999.[9] Franc¸ois Fleuret. Fast binary feature selection with conditional mutual information. The Journal ofMachine Learning Research , 5:1531–1555, 2004.[10] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information criteria ofmax-dependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence, IEEETransactions on , 27(8):1226–1238, 2005.[11] Irene Rodriguez-Lujan, Ramon Huerta, Charles Elkan, and Carlos Santa Cruz. Quadratic programmingfeature selection. The Journal of Machine Learning Research , 11:1491–1516, 2010.[12] Xuan Vinh Nguyen, Jeffrey Chan, Simone Romano, and James Bailey. Effective global approaches formutual information based feature selection. In Proceedings of the 20th ACM SIGKDD internationalconference on Knowledge discovery and data mining , pages 512–521. ACM, 2014.[13] David Barber and Felix Agakov. The im algorithm: a variational approach to information maximiza-tion. In Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference ,volume 16, page 201. MIT Press, 2004.[14] Abhimanyu Das and David Kempe. Submodular meets spectral: Greedy algorithms for subset selection,sparse approximation and dictionary selection. In Proceedings of the 28th International Conference onMachine Learning (ICML-11) , pages 1057–1064, 2011.[15] David D Lewis. Feature selection and feature extraction for text categorization. In Proceedings of theworkshop on Speech and Natural Language , pages 212–217. Association for Computational Linguistics,1992.[16] Dahua Lin and Xiaoou Tang. Conditional infomax learning: an integrated framework for feature extrac-tion and fusion. In Computer Vision–ECCV 2006 , pages 68–82. Springer, 2006.[17] Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsicallymotivated reinforcement learning. In Advances in Neural Information Processing Systems , pages 2116–2124, 2015.[18] Kiran S Balagani and Vir V Phoha. On the feature selection criterion based on an approximation ofmultidimensional mutual information. IEEE Transactions on Pattern Analysis & Machine Intelligence ,(7):1342–1343, 2010.[19] Nguyen Xuan Vinh, Shuo Zhou, Jeffrey Chan, and James Bailey. Can high-order dependencies improvemutual information based feature selection? Pattern Recognition , 2015.[20] Hongrong Cheng, Zhiguang Qin, Chaosheng Feng, Yong Wang, and Fagen Li. Conditional mutualinformation-based feature selection analyzing for synergy and redundancy. ETRI Journal , 33(2):210–218, 2011.[21] Yingbo Zhou, Utkarsh Porwal, Ce Zhang, Hung Q Ngo, Long Nguyen, Christopher R´e, and Venu Govin-daraju. Parallel feature selection inspired by group testing. In Advances in Neural Information ProcessingSystems , pages 3554–3562, 2014.[22] Brian C Ross. Mutual information between discrete and continuous data sets. PloS one , 9(2):e87357,2014.[23] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionalsand the likelihood ratio by convex risk minimization. Information Theory, IEEE Transactions on ,56(11):5847–5861, 2010.[24] Shuyang Gao. Variational feature selection code. http://github.com/BiuBiuBiLL/InfoFeatureSelection .[25] Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray gene expressiondata. Journal of bioinformatics and computational biology , 3(02):185–205, 2005.[26] Kevin Bache and Moshe Lichman. Uci machine learning repository, 2013. upplementary Material for “Variational Information Maximization forFeature Selection”A Detailed Algorithm for Variational Forward Feature Selection We describe the detailed algorithm for our approach. We also provide open source code implement-ing VMI naive and VMI pairwise [24].Concretely, let us suppose class label y is discrete and has L different values { y , y , ..., y L } ; thenwe define the distribution q ( x S t | y ) vector Q ( k ) t of size L for each sample (cid:0) x ( k ) , y ( k ) (cid:1) at step t : Q ( k ) t = (cid:104)(cid:98) q (cid:16) x ( k ) S t | y = y (cid:17) , ..., (cid:98) q (cid:16) x ( k ) S t | y = y L (cid:17)(cid:105) T (19)where x ( k ) S t denotes the sample x ( k ) projects onto the x S t feature space.Also, We further denote Y of size L × as the distribution vector of y as follows: Y = [ (cid:98) p ( y = y ) , (cid:98) p ( y = y ) , ..., (cid:98) p ( y = y L )] T (20)Then we are able to rewrite q ( x S t − ) and q ( x S t − | y ) in terms of Q ( k ) t − , Y and substitute them into (cid:98) I LB ( x S t − : y ) .To illustrate, at step t − we have, (cid:98) I LB ( x S t − : y ) = 1 N (cid:88) x ( k ) , y ( k ) log (cid:16) p (cid:16) x ( k ) S t − | y = y ( k ) (cid:17)(cid:17) − N (cid:88) k log (cid:16) Y T Q ( k ) t − (cid:17) (21)To select a feature i at step t , let us define the conditional distribution vector C ( k ) i,t − for each feature i / ∈ S t − and each sample (cid:0) x ( k ) , y ( k ) (cid:1) , i.e., C ( k ) i,t − = (cid:104) q (cid:16) x ( k ) i | x ( k ) S t − , y = y (cid:17) , ..., q (cid:16) x ( k ) i | x ( k ) S t − , y = y L (cid:17)(cid:105) T (22)At step t , we use C ( k ) i,t − and Q ( k ) t − previously stored and get, (cid:98) I LB ( x S t − ∪ i : y ) = 1 N (cid:88) x ( k ) , y ( k ) log (cid:16) p (cid:16) x ( k ) S t − | y = y ( k ) (cid:17) p (cid:16) x ( k ) i | x ( k ) S t − , y = y ( k ) (cid:17)(cid:17) − N (cid:88) k log (cid:16) Y T diag (cid:16) Q ( k ) t − (cid:17) C ( k ) i,t − (cid:17) (23)We summarize our detailed implementation in Algorithm 1.Updating Q ( k ) t and C ( k ) i,t in Algorithm 1 may vary according to different Q -distributions. But we canverify that under Naive Bayes Q -distribution or Pairwise Q -distribution, Q ( k ) t and C ( k ) i,t can be ob-tained recursively from Q ( k ) t − and C ( k ) i,t − by noticing that q ( x i | x S t , y ) = p ( x i | y ) for Naive Bayes Q -distribution and q ( x i | x S t , y ) = (cid:16) p ( x i | x f t , y ) q ( x i | x S t − , y ) t − (cid:17) t for Pairwise Q -distribution.Let us denote N as number of samples, D as total number of features, T as number of selectedfeatures and L as number of distinct values in class variable y . The computational complex-ity of Algorithm 1 involves calculating the lower bound for each feature i at every step which is O ( N DL ) ; updating C ( k ) i,t would cost O ( N DL ) for pairwise Q -distribution and O (1) for NaiveBayes Q -distribution; updating Q ( k ) t would cost O ( N DL ) . We need to select T features, thereforethe time complexity is O ( N DT ) . we ignore L here because the number of classes is usually much smaller. lgorithm 1 Variational Forward Feature Selection (VMI) Data: (cid:0) x (1) , y (1) (cid:1) , (cid:0) x (2) , y (2) (cid:1) , ..., (cid:0) x ( N ) , y ( N ) (cid:1) Input: T ← { number of features to select } Output: F ← { final selected feature set } F ← { ∅ } ; S ← { ∅ } ; t ← Initialize Q ( k )0 and C ( k ) i, for any feature i ; calculate Y while | F | < T do (cid:98) I LB ( x S t − ∪ i : y ) ← { Eq. 23 for each i not in F } f t ← arg max i/ ∈ S t − (cid:98) I LB ( x i ∪ S t − : y ) if (cid:98) I LB (cid:0) x S t − ∪ f t : y (cid:1) ≤ (cid:98) I LB ( x S t − : y ) then Clear S ; Set t ← else F ← F ∪ f t S t ← S t − ∪ f t Update Q ( k ) t and C ( k ) i,t t ← t + 1 end ifend while B Optimality under Tree Graphical Models Theorem B.1 (Optimal Feature Selection) . If data is generated according to tree graphical models,where the class label y is the root node, denote the child nodes set in the first layer as L = { x , x , ..., x L } , as shown in Fig. B.1. Then there must exist a step T > such that the followingthree conditions hold by using VMI naive or VMI pairwise :Condition I: The selected feature set S T ⊂ L .Condition II: I LB ( x S t : y ) = I ( x S t : y ) for ≤ t ≤ T .Condition III: I LB ( x S T : y ) = I ( x : y ) . Figure B.1: Demonstration of tree graphical model, label y is the root node. Proof. We prove this theorem by induction. For tree graphical model when selecting the firstlayer features, VMI naive and VMI pairwise are mathematically equal, therefore we only prove VMI naive case and VMI pairwise follows the same proof.11) At step t = 1 , for each feature i , we have, I LB ( x i : y ) = (cid:28) ln (cid:18) q ( x i | y ) q ( x i ) (cid:19)(cid:29) p ( x , y ) = (cid:42) ln p ( x i | y ) (cid:80) y (cid:48) p ( y (cid:48) ) p ( x i | y (cid:48) ) (cid:43) p ( x , y ) = (cid:28) ln (cid:18) p ( x i | y ) p ( x i ) (cid:19)(cid:29) p ( x , y ) = I ( x i : y ) (24)Thus, we are choosing a feature that has the maximum mutual information with y at the very firststep. Based on the data processing inequality, we have I ( x i : y ) ≥ I ( desc ( x i ) : y ) for any x i in layer 1 where desc ( x i ) represents any descendant of x i . Thus, we always select featuresamong the nodes of the first layer at step t = 1 without loss of generality. If node x j that isnot in the first layer is selected at step t = 1 , denote ances ( x j ) as x j ’s ancestor in layer 1, then I ( x j : y ) = I ( ances ( x j ) : y ) which means that the information is not lost from ances ( x j ) → x j .In this case, one can always switch ances ( x j ) with x j and let x j be in the first layer, which doesnot conflict with the model assumption.Therefore, condition I and II are satisfied in step t = 1 .2) Assuming condition I and II are satisfied in step t , then we have the following argument in step t + 1 :We discuss the candidate nodes in three classes, and argue that nodes in Remaining-Layer 1 Class are always being selected. Redundant Class For any descendant desc ( S t ) of selected feature set S t , we have, I (cid:0) x S t ∪ desc ( S t ) : y (cid:1) = I ( x S t : y ) = I LB ( x S t : y ) (25)Eq. 25 comes from the fact that the desc ( S t ) carries no additional information about y other than S t . The second equality is by induction.Based on Eq. 12 and 25, we have, I LB (cid:0) x S t ∪ desc ( S t ) : y (cid:1) < I (cid:0) x S t ∪ desc ( S t ) : y (cid:1) = I ( x S t : y ) (26)We assume here that the LHS is strictly less than RHS in Eq. 26 without loss of generality. Thisis because if the equality holds, we have p (x S t | y ) p ( desc ( S t ) | y ) = p ( x t , desc ( S t ) | y ) due toTheorem 3.1. In this case, we can always rearrange desc ( S t ) to the first layer, which does notconflict with the model assumption.Note that by combining Eqs. 25 and 26, we can also get I LB (cid:0) x S t ∪ desc ( S t ) : y (cid:1) < I LB ( x S t : y ) (27)Eq. 27 means that adding a feature in Redundant Class will actually decrease the value of lowerbound I LB . Remaining-Layer1 Class For any other unselected node j of the first layer, i.e., j ∈ L \ S t , we have I ( x S t : y ) ≤ I ( x S t ∪ j : y ) = I LB ( x S t ∪ j : y ) (28)The inequality in Eq. 28 is obvious which comes from the data processing inequality [6]. And theequality in Eq. 28 comes directly from Theorem 3.1. Descendants-of-Remaining-Layer1 Class For any node desc ( j ) that is the descendant of j where j ∈ L \ S t , we have, I LB (cid:0) x S t ∪ desc ( j ) : y (cid:1) ≤ I (cid:0) x S t ∪ desc ( j ) : y (cid:1) I (cid:0) x S t ∪ desc ( j ) : y (cid:1) ≤ I ( x S t ∪ j : y ) (29)The second inequality of Ineq. 29 also comes from data processing inequality.12ombining Eqs. 26 and 28, we get, I LB (cid:0) x S t ∪ desc ( S t ) : y (cid:1) < I LB ( x S t ∪ j : y ) (30)Combining Eqs. 28 and 29, we get, I LB (cid:0) x S t ∪ desc ( j ) : y (cid:1) ≤ I LB ( x S t ∪ j : y ) (31)Ineq. 30 essentially tells us the forward feature selection will always choose Remaining-Layer1Class other than Redundant Class .Ineq. 31 is saying we are choosing Remaining-Layer1 Class other than Descendants-of-Remaining-Layer1 Class without loss of generality (for the equality concern, we can have the same argumentin step t = 1 ).Considering Ineqs. 30 and 31, in step t + 1 , the algorithm chooses node j in Remaining-Layer1Class , i.e., j ∈ L \ S t .Therefore, condition I and II hold at step t + 1 .At step t + 1 , if I LB ( x S t ∪ j : y ) = I LB ( x S t : y ) for any j ∈ L \ S t , that means I ( x S t ∪ j : y ) = I ( x S t : y ) . Then we have, I ( x S t : y ) = I ( x L : y ) = I ( x : y ) (32)The first equality in Eq. 32 holds because adding any j in L \ S t will not increase the mutualinformation. The second equality is due to the data processing inequality under tree graphical modelassumption.Therefore, if I LB ( x S t ∪ j : y ) = I LB ( x S t : y ) for any j ∈ L \ S t , we set T = t . Thus by combin-ing condition II and Eq. 32, we have, I LB ( x S T : y ) = I ( x S T : y ) = I ( x : y ) (33)Then condition III holds. C Datasets and Results Table 4 summarizes the datasets used in the experiment. Table 5 shows the complete results.Table 4: Dataset summary. N : d : L : Data N d L SourceLung 73 325 20 [25]Colon 62 2000 2 [25]Leukemia 72 7070 2 [25]Lymphoma 96 4026 9 [25]Splice 3175 60 3 [26]Landsat 6435 36 6 [26]Waveform 5000 40 3 [26]KrVsKp 3196 36 2 [26]Ionosphere 351 34 2 [26]Semeion 1593 256 10 [26]Multifeat. 2000 649 10 [26]Optdigits 3823 64 10 [26]Musk2 6598 166 2 [26]Spambase 4601 57 2 [26]Promoter 106 57 2 [26]Gisette 6000 5000 2 [4]Madelon 2000 500 2 [4]13 a t a s e t m R M R J M I M I M C M I M C I F E S P E C C M I V M I n a i v e V M I p a i r w i s e L ung . ± ( . ) ∗∗ . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) C o l on19 . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗∗ L e uk e m i a . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗∗ L y m pho m a . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗∗ S p li ce . ± ( . ) ∗ . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) L a nd s a t . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗∗ W a v e f o r m . ± ( . ) ∗ . ± ( . ) ∗ . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) ∗∗ . ± ( . ) ∗ K r V s K p5 . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗∗ . ± ( . ) . ± ( . )I ono s ph e r e . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) ∗ S e m e i on23 . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗∗ M u lti f ea t . . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗∗ O p t d i g it s . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) M u s k2 . ± ( . ) ∗ . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗∗ S p a m b a s e . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) ∗ P r o m o t e r . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) ∗ G i s e tt e . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) ∗ M a d e l on30 . ± ( . ) . ± ( . ) ∗∗ . ± ( . ) . ± ( . ) . ± ( . ) ∗ . ± ( . ) . ± ( . ) . ± ( . ) W / T / L : / / / / / / / / / / / / W / T / L : / / / / / / / / / / / / T a b l e : A v er ag ecr o ss va li d a t i o n err o rr a t ec o m p a r i s o n o f V M I aga i n s t o t h er m e t h o d s . T h e l a s tt w o li n e s i nd i c a t e w i n ( W ) / t i e ( T ) /l o ss ( L )f o r V M I n a i v e a nd V M I p a i r w i s e re s p ec t i v e l y . Generating Synthetic Data Here is a detailed generating process for synthetic tree graphical model data in the experiment.Draw y ∼ Bernoulli (0 . Draw x ∼ Gaussian ( σ = 1 . , µ = y ) Draw x ∼ Gaussian ( σ = 1 . , µ = y / . Draw x ∼ Gaussian ( σ = 1 . , µ = y / . Draw x ∼ Gaussian ( σ = 1 . , µ = x ) Draw x ∼ Gaussian ( σ = 1 . , µ = x ) Draw x ∼ Gaussian ( σ = 1 . , µ = x ) Draw x ∼ Gaussian ( σ = 1 . , µ = x ) Draw x ∼ Gaussian ( σ = 1 . , µ = x ) Draw x ∼ Gaussian ( σ = 1 . , µ = x ))