[PDF] Target-Focused Feature Selection Using a Bayesian Approach

Abstract

In many real-world scenarios where data is high dimensional, test time acquisition of features is a non-trivial task due to costs associated with feature acquisition and evaluating feature value. The need for highly confident models with an extremely frugal acquisition of features can be addressed by allowing a feature selection method to become target aware. We introduce an approach to feature selection that is based on Bayesian learning, allowing us to report target-specific levels of uncertainty, false positive, and false negative rates. In addition, measuring uncertainty lifts the restriction on feature selection being target agnostic, allowing for feature acquisition based on a single target of focus out of many. We show that acquiring features for a specific target is at least as good as common linear feature selection approaches for small non-sparse datasets, and surpasses these when faced with real-world healthcare data that is larger in scale and in sparseness.

Full PDF

TTarget-Focused Feature Selection Using a BayesianApproach

Orpaz Goldstein

Department of Computer ScienceUniversity of California, Los Angeles [email protected]

Mohammad Kachuee

Department of Computer ScienceUniversity of California, Los Angeles [email protected]

Kimmo Karkkainen

Department of Computer ScienceUniversity of California, Los Angeles [email protected]

Majid Sarrafzadeh

Department of Computer ScienceUniversity of California, Los Angeles [email protected]

Abstract

In many real-world scenarios where data is high dimensional, test time acquisitionof features is a non-trivial task due to costs associated with feature acquisition andevaluating feature value. The need for highly conﬁdent models with an extremelyfrugal acquisition of features can be addressed by allowing a feature selectionmethod to become target aware. We introduce an approach to feature selectionthat is based on Bayesian learning, allowing us to report target-speciﬁc levelsof uncertainty, false positive, and false negative rates. In addition, measuringuncertainty lifts the restriction on feature selection being target agnostic, allowingfor feature acquisition based on a single target of focus out of many. We show thatacquiring features for a speciﬁc target is at least as good as common linear featureselection approaches for small non-sparse datasets, and surpasses these when facedwith real-world healthcare data that is larger in scale and in sparseness.

As big data becomes ubiquitous so does the increase in the dimensionality of data. As the selectionof features increases, feature selection becomes a necessary tool in the evaluation and acquisition offeatures [19], and in turn for the training of learning models. This is increasingly true in the healthcaredomain, where data is accumulated and under-utilized [11; 25]. Moreover, in the healthcare domain,both budget for features and model uncertainty should be taken into account in order for a featureselection model to be practical. Since in many cases our main target of interest is the minority target,we would rather focus on reducing the uncertainty of a speciﬁc target of interest rather than thegeneral uncertainty, while maintaining a budget for features. For example, two types of heart diseasemight display similar symptoms, but we rather focus our resources understanding whether the patienthas a less common disease which is more fatal.Classic approaches to feature selection focus on maximizing information gain and inferring featurerelevance [2; 12]. Health informatics methods of feature selection take into account real-world costsassociated with the acquisition of features and the need to maintain a budget. Costs of tests, physiciantime, patient discomfort, should all be taken into account when reasoning on which feature is to beacquired, by using cost-sensitive decision methods or active sensing [8; 26]. In addition to costs,changes in medical data availability might call for iterative feature aggregation in training time,requiring an online cost-sensitive budgeted approach [15].

Preprint. Under review. a r X i v : . [ c s . L G ] S e p ealthcare data tends to be imbalanced, some conditions or variants of a disease are more common.Some targets carry more signiﬁcance or are more relevant to a speciﬁc diagnosis. Acquisition ofrelevant data is made possible using an active learning approach [22]. Contributing to the imbalanceis also the sparseness of data. Due to the high dimensionality of the data, not all data points will haveall features. For medical domain feature selection and prediction, ensemble methods have been usedto reduce the effects of imbalance data, and of inherent missingness [13; 20], and more recently witha robust feature selection framework [27]. While addressing the imbalance in data is closely relatedto our work, the acquisition of features that are germane to a speciﬁc target of focus is not addressed.Uncertainty measurement in a machine learning model ﬂows from applying a probabilistic approachto learning, also known as Bayesian learning. Sampling a trained probabilistic model for latentvariables allows us to capture the inherent uncertainty in the model. The usage of Gaussian weightdistributions to estimate the uncertainty was ﬁrst discussed in [6]. Later work includes [3; 21;10] and many more. Application of uncertainty to feature selection robustness appears in SameDecision Probability (SDP) [4], which measures the effect of feature acquisition on the shift in thedecision making of a model. SDP measures the uncertainty in the model while acquiring features,and reasons on stopping criteria based on a threshold of conﬁdence and budget. More recently,an expected SDP query and an optimal feature selection algorithm based on SDP were proposed[5]. SDP queries are generally P P

P P -complete which makes it costly for many high dimensionalreal-world applications.In this paper, we propose a novel probabilistic uncertainty-based method, for target-speciﬁc featureacquisition. Our ﬁrst contribution is providing a method to focus resources on a single target ofinterest, that is generalizable, scalable, and consistent in selecting informative features for a speciﬁcsingle target of interest out of many.Our second contribution is polynomial time, threshold-based method, allowing us to reason on modelconﬁdence in predictions while learning a representation of the data, and make a decision on whetherto ask for more data or declare readiness to start making predictions on real-world scenarios.

In order to capture uncertainty in a model, we need to learn a representation of a latent distributionover a set of parameters deﬁning that distribution and be able to sample the learned parameters inorder to associate the captured uncertainty with test time examples of the data and targets. Ouroptimization function, therefore, will be taking a probabilistic approach.Using Variational Inference we will estimate λ (cid:63) using Kullback-Leibler (KL) divergence such that: λ (cid:63) = arg min λ KL ( q ( z ; λ ) || p ( z | X )) , (1)where q ( z ; λ ) is the estimation of posterior distribution p ( z | X ) optimized over parameters λ .Since the posterior p ( z | X ) is unknown to us, we will resort to maximizing the Evidence LowerBound (ELBO) as an optimization function: ELBO ( λ ) = E q ( z ; λ ) [ log p ( X , z ) − log q ( z ; λ )] , (2)which is equivalent to minimizing KL divergence [14; 3; 17].Gradient optimization of ELBO is done via the reparameterization trick [17]. ∇ λ ELBO ( λ ) ≈ s s (cid:88) s =1 [ ∇ λ ( log p ( X , z ( (cid:15) ; λ )) − log q ( z ( (cid:15) ; λ ); λ ))] , (3)where s is the number of samples drawn. 2 Target-Focused Feature Selection

Using a minimal amount of features, our goal is to achieve reasonable conﬁdence for a speciﬁc class,as described in our objective function. argmax FS ( conﬁdence θ − (cid:88) f i ∈ FS v i ) , (4) Subject to: | FS | < β. Such that FS is the set of acquired features we wish to minimize, | FS | is the cardinality of the set, v i is the value associated with each feature. The objective is to frugally acquire the most valuablefeatures while achieving maximum conﬁdence in a speciﬁc class θ , without exhausting our budgetfor features β . Evaluation of features per target considers the contribution of each feature towards minimizinguncertainty for our target of interest, jointly evaluated with the features already acquired. In additionto conﬁdence scores, features vectors are scored for their cosine similarity as well as their Hammingweight scores, in order to gauge potential information gain from a candidate feature.In order to use

ELBO as our optimization function, we model the linear regression case in which our λ contains the input X , a single layer of weights W and a bias b such that λ = ( W , b , X ) . Here X ∈ R c,r , w ∈ R c,d , b ∈ R ,d . X has r data points and c features, and the model will learn thedistribution over d targets. Assuming independence given our parameters: p ( z | W , b , X ) = r (cid:89) n =1 p ( z n | X (cid:62) n W + b , σ z ) , (5)where z is the ELBO optimized posterior estimation. We deﬁne the priors on both parameters to bethe standard normal distribution.

For each feature not already in our feature set f i / ∈ FS , a model estimating ELBO ( λ ) is trainedfor each f i ∪ FS . Once trained, each feature is scored on its contribution to model conﬁdence inpredicting a speciﬁc target on a validation set, in addition to the cosine similarity and co-variancescores between the feature f i and all the features already in FS . We then select a single feature f i / ∈ FS , to be aggregated together with the features already in FS , based on the scores received inthe previous step. With each feature added, a new model is trained trying to estimate the latent targetvariables on a previously unseen test set. We continue aggregating features until we reach a stoppingcondition or we exhaust our budget as described in Algorithm 1. Our algorithm runs in O ( n ) time.Complete time complexity analysis provided in supplemental material. Our available data is split into a training set X train and a testing set X test . To obtain our input X wesample the training data X train in a balanced way. For example, if we are trying to predict 3 targetsthen X will have of the data points correspond to each of our targets, regardless of the originaldistribution. In order to generate a validation input dataset X (cid:48) , we sample X train according to itsoriginal distribution (no balancing).At each iteration, a subset of all available feature f i ∪ FS is trained to learn λ = ( W , b , X ) . Oncetrained, we score the feature subset on the on the validation set X (cid:48) by measuring the effect acquiredfeatures had on a per-target uncertainty. Using our learned distribution, we sample each of ourparameters such that W (cid:48) ∼ W , b (cid:48) ∼ b and calculate the probability vector: prob = X (cid:48)(cid:62) W (cid:48) + b (cid:48) , (6)3 lgorithm 1 Target-Focused Feature Selection

Input : β : Budget ; FPT , FNT , CT : Thresholds ; X : Train set ; X (cid:48) : Validation set ; X test : Test set ; y : Targets for X , X (cid:48) and X test ; F : { f , f , ..., f n } , set of available features Parameter : FS ← {} : features selected ; M ← model optimizing ELBO ( λ ) F ← function for computing v i : value for feature i ; FP θ , FN θ , conﬁdence θ : current false positive,false negative and conﬁdence for speciﬁc target θ Output : SF ⊆ F within budget β while | SF | < β, FP θ > FPT , FN θ > FNT , conﬁdence θ < CT do for f i ∈ F do Train M ( X , SF ∪ f i , y ) v i ← F ( M ( X (cid:48) , SF ∪ f i , y ) , FS , f i ) end for FS ← FS ∪ f i | argmax i v i ∈ V FP θ , FN θ , conﬁdence θ ← M ( X test , FS , y ) end while return solutionwhere prob ∈ R r,d has the probability of each data point belonging to each possible target. We thenget the prediction vector by calculating softmax for each prob i : y i = argmax (softmax ( prob i )) = exp ( prob i ) (cid:80) d exp ( prob i,d ) . (7)Next we evaluate precision, represented by the fraction of times that y θ corresponding to target θ ,was equal the correct target for position i . Note that y θ ∈ y , and is of subset size | y θ | : precision θ = 1 | y θ | | y θ | (cid:88) i =1 y θ,i = θ ) , (8)where y θ,i = θ ) equals 1 if data point y θ,i has the target value θ , and 0 otherwise.Repeating 6 - 8 for l iterations, sampling the distribution of our parameters each time, our conﬁdencescore becomes the averaged precision over multiple iterations. Therefore the conﬁdence for a speciﬁctarget: conﬁdence θ = 1 l l (cid:88) j =1 ( precision θ,j ) . (9)Here l is the number of times we sample our learned distributions. The trade-off using l is between amore accurate representation of the model conﬁdence, and a faster model. We have found that 300iterations were accurate enough in reporting conﬁdence in our case. In addition to the conﬁdence scores, we wish to capture the potential information gain of the currentcandidate feature f i given the existing features in FS . We use the computed similarity scores: co-variance distance score, and cosine similarity score. We sum the inverse scores for all such pairwisecomparison and then normalize to the range [1,0]. CovScore = N , (cid:88) g i ∈ FS − cov ( g i , f i ) , (10) CosScore = N , (cid:88) g i ∈ FS − cos ( g i , f i ) . (11)4ovScore and CosScore are the summed inverse co-variance distances and cosine similarities,transferred to the [0,1] range applying the normalization N , .Our ﬁnal feature value for the current feature f i is then v i = ω ∗ conﬁdence θ + ω ∗ CovScore + ω ∗ CosScore (12), where ω , ω , ω are hyperparameters.Once all features F i / ∈ FS have been scored and evaluated for their contribution towards class θ aspart of set FS , we append the single feature that maximized v i to the set FS Here we provide an empirical evaluation of our target focused method (TF) compared with prevalentlinear feature selection techniques.

Mutual Information (MI) is estimating statistical dependency for feature selection [18], and iswidely used as a non-parametric approach to evaluating data dependencies. The MI approach worksby estimating correlation level based on entropy from k-nearest neighbor distances.

Max-relevance min-redundancy (mRMR) [23] is a ﬁrst-order incremental feature selection methodbased on Mutual Information that eliminates redundancy in features while selecting relevant ones.

Least absolute shrinkage and selection operator (Lasso) model [24] is an L1-based feature selec-tion approach. Performing some regularization in addition to ﬁltering out unwanted features, Lasso isan "automatic" approach to feature selection.

Extremely randomized trees (Extra trees) [9] is a tree-based model performing feature selectionbased on the importance values computed by the model.We remind the reader that all methods mentioned above are target agnostic, and therefore we compareconﬁdence in both the speciﬁc target of interest as well as the classic general conﬁdence of a model(can be seen in the supplemental material) over all targets in the data.

We evaluated our model on image classiﬁcation task, as well as a breast cancer detection task, bothchosen from the UCI machine learning repository [7], in addition to various disease predictiontasks assembled using the Centers for Disease Control and Prevention’s (CDC) National Health andNutrition Examination Survey (NHANES) [1] data.For each of our sets, we select a target of special interest, that we would like our model to focus onwhen deciding which features to acquire. Projecting this to the real world, the focus target will be aspeciﬁc health issue in a dataset of symptoms and possible tests or images, pointing to more than onepossible target class.The data is as follows: From the UCI machine learning repository, we use SatLog data . A datasetof evaluating image data and identifying a particular type of soil in satellite images. Also fromUCI, we use the Breast Cancer Wisconsin dataset . Providing features that are computed from adigitized image of a ﬁne needle aspirate (FNA) of a breast mass. From NHANES, we construct twodatasets ourselves based on the approach described by [16]. One for evaluating diabetes, and onefor evaluating heart diseases. To construct our datasets we join all possible NHANES tables that arecorrelated with our targets. For example, for the heart disease dataset, we join all tables that havefeatures with correlation to any of 5 heart conditions. This causes the resulting sets to have a vastamount of possible features.Dataset statistics, as well as the target chosen for each dataset, is listed in Table 1. For the NHANESdatasets, targets are renamed from the original data for convenience. Blood glucose refers to thefeature LBXGLU, the amount of glucose in the blood when fasting, used here to indicate whetheror not an individual has diabetes. Congestive heart failure (CHF) refers to the feature MCQ160B, Available here: UCI Statlog (Landsat Satellite) Available here: UCI Breast Cancer Wisconsin ataset Size Features Targets Focus target MissingnessUCI Breast cancer 569 32 2 Malignant 0%UCI Satlog 4435 37 6 Damp grey soil 0%NHANES Diabetes 25474 581 2 Blood Glucose 25%NHANES Heart 49346 555 5 CHF 25% Table 1: Datasets statistics

F1 scoresBreast Cancer Satlogf MI mRMR Lasso Extra Trees TF MI mRMR Lasso Extra Trees TF

Table 2: Comparing F1 scores for feature selection on low feature count sets. f indicates the number of featuresacquired. and it is one of 5 heart conditions we construct the dataset for (MCQ160E, MCQ160F, MCQ160C,MCQ160B, MCQ180B).

Assuming a constant budget for features, we run all feature selection approaches on the same trainingsubset of the data and iteratively evaluate for each feature we add. We select a single target to act asthe focus of our method. We put an emphasis on the model conﬁdence for that speciﬁc target valuethat we wish to maximize over all targets. The target chosen for each dataset is listed in Table 1.The compared models were constructed with the following parameters:(i) Mutual information (MI) between our training data and the training target was calculated usinga different number of neighbors. Balancing the estimation variance and bias, we evaluated numberof neighbors k ∈ [1 , , , , . The k = 3 instance, giving the best average result in all cases wasselected.(ii) mRMR was evaluated both on "MIQ" and "MID" feature selection methods.(iii) Lasso with cross-validation was used in this experiment. In order to ﬁnd the best α value for theregularization process, we considered α ∈ [1 , . , . , . , . . In addition, the best setup of Lasso for the average case was as follows: a maximum number of iterations was set to 1000,tolerance was set to 0.1, and the number of cross-validation folds set to 10.(iv) Extra trees classiﬁer was used in our experiments. The number of estimators in this model was setto 1000, with no maximum depth deﬁned. In order to split a node, the minimum number of sampleswas set to 2, and the quality of split measured by Gini impurity.(v) Our Target-Focused (TF) feature selection was trained using ω = 0 . , ω = ω = 0 . .The machine used for evaluation had speciﬁcation: Intel 12 core i9-7920x (2.90GHz) CPU, 128 GBRAM, and 4 GeForce RTX 2080TI GPUs. In this section, we will report conﬁdence, false positive, and false negative scores, as well as F1 scoresof our model at intervals as features are acquired. When plotting model trends of the aforementionedmetrics, we denote the variance scores of model conﬁdence prior to applying the argmax function(equation 7), as the line margin on conﬁdence plots, as can be seen in plots below.On data sets with low feature count and little missingness, our method was able to achieve betteroverall scores, faster than the comparable methods For a speciﬁc target value. As can be seen in Table2. 6 c o n f i d e n c e p e r c e n t s c o r e TFMILASSOETMRMR

Figure 1: Comparing model conﬁdence inpredicting malignant breast cancer. Linethickness indicates variance. c o n f i d e n c e p e r c e n t s c o r e TFMILASSOETMRMR

Figure 2: Comparing model conﬁdence inpredicting one class out of the Satlog dataset.Line thickness indicates variance. F P / F N p e r c e n t a g e y = Grey soil_fpy = Grey soil_fn (a) F P / F N p e r c e n t a g e y = Grey soil_fpy = Grey soil_fn (b) F P / F N p e r c e n t a g e y = Grey soil_fpy = Grey soil_fn (c) F P / F N p e r c e n t a g e y = Grey soil_fpy = Grey soil_fn (d)Figure 3: Analysis of UCI Satlog dataset comparing FP/FN rates as features are acquired. 3a shows evaluationusing our approach, 3b shows evaluation using mRMR "MIQ" method, 3c shows evaluation using Lasso method,3d shows evaluation using Extra trees Figure 1 shows conﬁdence trend for acquiring 30 features on the Breast Cancer Wisconsin datasetusing our method, comparing speciﬁc target conﬁdence in the 4 compared models. In this case, ourmodel can be seen on par with the conﬁdence achieved by the Mutual information and Extra treesmethods. As can be seen in Table 2, our model is able to produce slightly better F1 scores, indicatinga faster false positive and false negative reduction. The breast cancer dataset proved to be a relativelysimple prediction problem, it can be seen that all methods performed relatively well, achieving goodmodel conﬁdence and F1 scores.Figure 2 shows conﬁdence trend for acquiring 30 features on the Satlog dataset using our andcompared methods. Here we can see our target-focused method is able to achieve a more conﬁdentmodel faster, as well as a better overall F1 score. In this case, the target of focus chosen appeared tobe the hardest target to model out of the available targets, since all compared models struggled to ﬁndfeatures that best model the data, in addition to it being one of the minority classes. Despite that, ourmodel has gained the most conﬁdence, while using a low number of features. Figures 3a to 3d showthe FP/FN evaluation over the Satlog dataset. We see our method shows a consistent non-volatiledecline in FP rates while maintaining a low FN rate throughout.On Table 3 we can see the F1 scores of the different feature selection methods compared on highfeature count datasets, with missing values and lots of features. Here our model heuristic is evaluatedon publicly available real-world healthcare data. Our method, being speciﬁc target aware, is able topick out a good subset of the features consistently, consequently using fewer features that in turncontribute most to maximizing the selected target class in focus.As can be seen in the diabetes conﬁdence evaluation in Figure 5, and in the FP/FN evaluation inFigures 6a to 6d, our method outperformed the compared methods in minimizing FP and FN scoresquickly, in addition to achieving a consistent amount of conﬁdence in the target of interest relativelyfast. It can be seen in Figure 5 that all comparable methods achieve a high amount of conﬁdencequicker than our target-focused method. However, comparing Figures 6a to 6d we can see our modelminimizes false positive and false negative scores quicker and therefore receives higher F1 scores.Conﬁdence evaluation for the heart disease dataset can be seen in Figure 4. Our model is gainingconﬁdence using fewer features as before and keeps a relatively increasing trend of conﬁdence. Othermodels failed to increase their conﬁdence signiﬁcantly as this was the hardest task, with multipletargets and high dimensionality. 7

Table 3: Comparing F1 scores for feature selection on high feature count sets. f indicates the number of featuresacquired. c o n f i d e n c e p e r c e n t s c o r e TFMILASSOETMRMR

Figure 4: Comparing model conﬁdence inpredicting congestive heart failure. Linethickness indicates variance. c o n f i d e n c e p e r c e n t s c o r e TFMILASSOETMRMR

Figure 5: Comparing model conﬁdence inpredicting diabetes. Line thickness indicatesvariance. F P / F N p e r c e n t a g e y = LBXGLU_fpy = LBXGLU_fn (a) F P / F N p e r c e n t a g e y = Grey soil_fpy = Grey soil_fn (b) F P / F N p e r c e n t a g e y = LBXGLU_fpy = LBXGLU_fn (c) F P / F N p e r c e n t a g e y = LBXGLU_fpy = LBXGLU_fn (d)Figure 6: Analysis of NHANES Diabetes constructed dataset comparing FP/FN rates as features are acquired.6a shows evaluation using our approach, 6b shows evaluation using mRMR "MIQ" method, 6c shows evaluationusing Lasso method, 6d shows evaluation using Extra trees. Since the other feature selection models are unaware of the single target uncertainty in the model, theresults obtained by the compared models could be largely dependent on the distribution of targets.I.e: the selected target of focus might get better results if it is also the majority target. All modelscompared were able to ﬁnd features to construct an efﬁcient frugal model on at least one of the sets,but our method has shown higher consistency across all sets. While real-world health data is normallysparse and feature-rich, we can see that even on smaller datasets with fewer features, our methodprovides a good heuristic as to the value of features when acquired towards a single target.

In this paper, we have investigated the approach of acquiring features based on a speciﬁc target ofinterest out of two or more targets. We see a frugal approach as an important addition to the processof feature selection, especially as data availability grows dramatically, and utilization of data remainssomewhat inefﬁcient, particularly in the domain of healthcare. We have discussed the applicationof our target-focused approach to both well-known sources of machine learning datasets, as well asreal-world public healthcare data converted into datasets. On these, we have clearly demonstrated thevalue of having a target-aware method to feature selection, as compared to feature selection methodsthat are target-agnostic. We have introduced a Bayesian conﬁdence based scoring mechanism, thatwe proceeded to show is robust in both scalability and consistency on different types of datasets.Practically, we were able to minimize uncertainty in a speciﬁc target of interest with a minimalbudget, while minimizing the general uncertainty, false positive, and false negative rates.8 eferences [1] National health and nutrition examination survey, 2018.[2] H Bentz, M Hagstroem, and G Palm. Selection of relevant features and examples in machinelearning.

Neural Networks , 2(4):289–293, 1997.[3] Christopher M. Bishop.

Pattern Recognition and Machine Learning (Information Science andStatistics) . Springer-Verlag, Berlin, Heidelberg, 2006.[4] Arthur Choi, Yexiang Xue, and Adnan Darwiche. Same-decision probability: A conﬁdencemeasure for threshold-based decisions.

Int. J. Approx. Reasoning , 53:1415–1428, 2012.[5] YooJung Choi, Adnan Darwiche, and Guy Van den Broeck. Optimal feature selection fordecision robustness in bayesian networks. In

Proceedings of the 26th International JointConference on Artiﬁcial Intelligence (IJCAI) , 2017.[6] J. S. Denker and Y. LeCun. transforming neural-net output levels to probability distributions. InR. Lippmann, J. Moody, and D. Touretzky, editors,

Advances in Neural Information ProcessingSystems (NIPS 1990) , volume 3, Denver, CO, April 1991. Morgan Kaufman.[7] Dua Dheeru and Eﬁ Karra Taniskidou. UCI machine learning repository, 2017.[8] Alberto Freitas, Altamiro Costa-Pereira, and Pavel Brazdil. Cost-sensitive decision trees appliedto medical data. In

International Conference on Data Warehousing and Knowledge Discovery ,pages 303–312. Springer, 2007.[9] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees.

Machinelearning , 63(1):3–42, 2006.[10] Zoubin Ghahramani. Probabilistic machine learning and artiﬁcial intelligence.

Nature , 521:452EP –, 05 2015.[11] Peter Groves, Basel Kayyali, David Knott, and Steve Van Kuiken. The ‘big data’revolution inhealthcare.

McKinsey Quarterly , 2(3), 2013.[12] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection.

Journalof machine learning research , 3(Mar):1157–1182, 2003.[13] Shamsul Huda, John Yearwood, Herbert F Jelinek, Mohammad Mehedi Hassan, GiancarloFortino, and Michael Buckland. A hybrid feature selection with ensemble classiﬁcation forimbalanced healthcare data: A case study for brain tumor diagnosis.

IEEE access , 4:9145–9154,2016.[14] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. Anintroduction to variational methods for graphical models.

Machine Learning , 37(2):183–233,Nov 1999.[15] Mohammad Kachuee, Orpaz Goldstein, Kimmo Kärkkäinen, and Majid Sarrafzadeh. Op-portunistic learning: Budgeted cost-sensitive learning from data streams. In

InternationalConference on Learning Representations , 2019.[16] Mohammad Kachuee, Kimmo Karkkainen, Orpaz Goldstein, Davina Zamanzadeh, andMajid Sarrafzadeh. Nutrition and health data for cost-sensitive learning. arXiv preprintarXiv:1902.07102 , 2019.[17] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes.

CoRR , abs/1312.6114,2013.[18] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information.

Physical review E , 69(6):066138, 2004.[19] Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, Jiliang Tang, andHuan Liu. Feature selection: A data perspective.

ACM Computing Surveys (CSUR) , 50(6):94,2018.[20] Peng Liu, Lei Lei, Junjie Yin, Wei Zhang, Wu Naijun, and Elia El-Darzi. Healthcare datamining: Prediction inpatient length of stay. In

Intelligent Systems, 2006 3rd International IEEEConference on , pages 832–837. IEEE, 2006.[21] Kevin P. Murphy.

Machine Learning: A Probabilistic Perspective . The MIT Press, 2012.922] Sriraam Natarajan, Srijita Das, Nandini Ramanan, Gautam Kunapuli, and Predrag Radivojac. Onwhom should i perform this lab test next? an active feature elicitation approach. In

Proceedingsof the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence, IJCAI-18 , pages3498–3505. International Joint Conferences on Artiﬁcial Intelligence Organization, 7 2018.[23] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information:criteria of max-dependency, max-relevance, and min-redundancy.

IEEE Transactions on PatternAnalysis & Machine Intelligence , (8):1226–1238, 2005.[24] Robert Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the RoyalStatistical Society. Series B (Methodological) , pages 267–288, 1996.[25] Yichuan Wang, LeeAnn Kung, and Terry Anthony Byrd. Big data analytics: Understanding itscapabilities and potential beneﬁts for healthcare organizations.

Technological Forecasting andSocial Change , 126:3–13, 2018.[26] Shipeng Yu, Balaji Krishnapuram, Romer Rosales, and R Bharat Rao. Active sensing. In

Artiﬁcial Intelligence and Statistics , pages 639–646, 2009.[27] Wei Zheng, Xiaofeng Zhu, Yonghua Zhu, and Shichao Zhang. Robust feature selection onincomplete data. In