Opportunistic Learning: Budgeted Cost-Sensitive Learning from Data Streams
Mohammad Kachuee, Orpaz Goldstein, Kimmo Karkkainen, Sajad Darabi, Majid Sarrafzadeh
PPublished as a conference paper at ICLR 2019 O PPORTUNISTIC L EARNING :B UDGETED C OST -S ENSITIVE L EARNING FROM D ATA S TREAMS
Mohammad Kachuee, Orpaz Goldstein, Kimmo Karkkainen, Sajad Darabi, Majid Sarrafzadeh
Department of Computer ScienceUniversity of California, Los Angeles (UCLA)Los Angeles, CA 90095, USA { mkachuee,orpgol,kimmo,sajad.darabi,majid } @cs.ucla.edu A BSTRACT
In many real-world learning scenarios, features are only acquirable at a cost con-strained under a budget. In this paper, we propose a novel approach for cost-sensitive feature acquisition at the prediction-time. The suggested method acquiresfeatures incrementally based on a context-aware feature-value function. We formu-late the problem in the reinforcement learning paradigm, and introduce a rewardfunction based on the utility of each feature. Specifically, MC dropout sampling isused to measure expected variations of the model uncertainty which is used as afeature-value function. Furthermore, we suggest sharing representations betweenthe class predictor and value function estimator networks. The suggested approachis completely online and is readily applicable to stream learning setups. The so-lution is evaluated on three different datasets including the well-known MNISTdataset as a benchmark as well as two cost-sensitive datasets: Yahoo Learning toRank and a dataset in the medical domain for diabetes classification. According tothe results, the proposed method is able to efficiently acquire features and makeaccurate predictions.
NTRODUCTION
In traditional machine learning settings, it is usually assumed that a training dataset is freely availableand the objective is to train models that generalize well. In this paradigm, the feature set is fixed,and we are dealing with complete feature vectors accompanied by class labels that are providedfor training. However, in many real-world scenarios, there are certain costs for acquiring featuresas well as budgets limiting the total expenditure. Here, the notation of cost is more general thanfinancial cost and it also refers to other concepts such as computational cost, privacy impacts, energyconsumption, patient discomfort in medical tests, and so forth (Krishnapuram et al., 2011). Take theexample of the disease diagnosis based on medical tests. Creating a complete feature vector from allthe relevant information is synonymous with conducting many tests such as MRI scan, blood test,etc. which would not be practical. On the other hand, a physician approaches the problem by askinga set of basic easy-to-acquire features, and then incrementally prescribes other tests based on thecurrent known information (i.e., context) until a reliable diagnosis can be made. Furthermore, inmany real-world use-cases, due to the volume of data or necessity of prompt decisions, learning andprediction should take place in an online and stream-based fashion. In the medical diagnosis example,it is consistent with the fact that the latency of diagnosis is vital (e.g., urgency of specific cases anddiagnosis), and it is often impossible to defer the decisions. Here, by online we mean processingsamples one at a time as they are being received.Various approaches were suggested in the literature for cost-sensitive feature acquisition. To beginwith, traditional feature selection methods suggested to limit the set of features being used for training(Greiner et al., 2002; Ji & Carin, 2007). For instance, L1 regularization for linear classifiers results in
A version of the source code and the health dataset preproccessing code for this paper is available at: https://github.com/mkachuee/Opportunistic a r X i v : . [ c s . L G ] F e b ublished as a conference paper at ICLR 2019models that effectively use a subset of features (Efron et al., 2004). Note that these methods focus onfinding a fixed subset of features to be used (i.e., feature selection), while a more optimal solutionwould be making feature acquisition decisions based on the sample at hand and at the prediction-time.More recently, probabilistic methods were suggested that measure the value of each feature basedon the current evidence (Chen et al., 2015). However, these methods are usually applicable toBayesian networks or similar probabilistic models and make limiting assumptions such as havingbinary features and binary classes (Chen et al., 2014). Furthermore, these probabilistic methods arecomputationally expensive and intractable in large scale problems (Chen et al., 2015).Motivated by the success of discriminative learning, cascade and tree based classifiers suggested asan intuitive way to incorporate feature costs (Karayev et al., 2012; Chen et al., 2012; Xu et al., 2012;2014). Nevertheless, these methods are basically limited to the modeling capability of tree classifiersand are limited to fixed predetermined structures. A recent work by Nan & Saligrama (2017)suggested a gating method that employs adaptive linear or tree-based classifiers, alternating betweenlow-cost models for easy-to-handle instances and higher-cost models to handle more complicatedcases. While this method outperforms many of the previous work on the tree-based and cascadecost-sensitive classifiers, the low-cost model being used is limited to simple linear classifiers orpruned random forests.As an alternative approach, sensitivity analysis of trained predictors is suggested to measure theimportance of each feature given a context (Early et al., 2016a; Kachuee et al., 2017; 2018). Theseapproaches either require an exhaustive measurement of sensitivities or rely on approximations ofsensitivity. These methods are easy to use as they work without any significant modification to thepredictor models being trained. However, theoretically, finding the global sensitivity is a difficultand computationally expensive problem. Therefore, frequently, approximate or local sensitivities arebeing used in these methods which may cause not optimal solutions.Another approach that is suggested in the literature is modeling the feature acquisition problem asa learning problem in the imitation learning (He et al., 2012) or reinforcement learning (He et al.,2016; Shim et al., 2017; Janisch et al., 2017) domain. These approaches are promising in termsof performance and scalability. However, the value functions used in these methods are usuallynot intuitive and require tuning hyper-parameters to balance the cost vs. accuracy trade-off. Morespecifically, they often rely on one or more hyper-parameters to adjust the average cost at which thesemodels operate. On the other hand, in many real-world scenarios it is desirable to adjust the trade-offat the prediction-time rather than the training-time. For instance, it might be desirable to spend morefor a certain instance or continue the feature acquisition until a desired level of prediction confidenceis achieved.This paper presents a novel method based on deep Q-networks for cost-sensitive feature acquisition.The proposed solution employs uncertainty analysis in neural network classifiers as a measure forfinding the value of each feature given a context. Specifically, we use variations in the certainty ofpredictions as a reward function to measure the value per unit of the cost given the current context. Incontrast to the recent feature acquisition methods that use reinforcement learning ideas (He et al.,2016; Shim et al., 2017; Janisch et al., 2017), the suggested reward function does not require anyhyper-parameter tuning to balance cost versus performance trade-off. Here, features are acquiredincrementally, while maintaining a certain budget or a stopping criterion. Moreover, in contrast tomany other work in the literature that assume an initial complete dataset (He et al., 2012; Kusneret al., 2014; Chen et al., 2015; Early et al., 2016b; Nan & Saligrama, 2017), the proposed solutionis stream-based and online which learns and optimizes acquisition costs during the training andthe prediction. This might be beneficial as, in many real-world use cases, it might be prohibitivelyexpensive to collect all features for all training data. Furthermore, this paper suggests a method forsharing the representations between the class predictor and action-value models that increases thetraining efficiency. 2ublished as a conference paper at ICLR 2019 RELIMINARIES
ROBLEM S ETTINGS
In this paper, we consider the general scenario of having a stream of samples as input ( S i ). Eachsample S i corresponds to a data point of a certain class in R d , where there is a cost for acquiringeach feature ( c j ; 1 ≤ j ≤ d ). For each sample, initially, we do not know the value of any feature.Subsequently, at each time step t , we only have access to a partial realization of the feature vectordenoted by x ti that consists of features that are acquired so far. There is a maximum feature acquisitionbudget ( B ) that is available for each sample. Note that the acquisition may also be terminated beforereaching the maximum budget, based on any other termination condition such as reaching a certainprediction confidence. Furthermore, for each S i , there is a ground truth target label ˜ y i . It is alsoworth noting that we consider the online stream processing task in which acquiring features is onlypossible for the current sample being processed. In other words, any decision should take place in anonline fashion.In this setting, the goal of an Opportunistic Learning (OL) solution is to make accurate predictionsfor each sample by acquiring as many features as necessary. At the same time, learning should takeplace by updating the model while maintaining the budgets. Please note that, in this setup, we areassuming that the feature acquisition algorithm is processing a stream of input samples and there areno distinct training or test samples. However, we assume that ground truth labels are only available tous after the prediction and for a subset of samples.More formally, we define a mask vector k ti ∈ { , } d where each element of k indicates if thecorresponding feature is available in x ti . Using this notation, the total feature acquisition cost at eachtime step can be represented as C ttotal,i = ( k ti − k i ) T c . (1)Furthermore, we define the feature query operator (q) as x t +1 i = q ( x ti , j ) , where k t +1 i,j − k ti,j = 1 . (2)In Section 3, we use these primitive operations and notations for presenting the suggested solution.2.2 P REDICTION C ERTAINTY
As prediction certainty is used extensively throughout this paper, we devote this section to certaintymeasurement. The softmax output layer in neural networks are traditionally used as a measure ofprediction certainty. However, interpreting softmax values as probabilities is an ad hoc approachprone to errors and inaccurate certainty estimates (Szegedy et al., 2013). In order to mitigate this issue,we follow the idea of Bayesian neural networks and Monte Carlo dropout (MC dropout) (Williams,1997; Gal & Ghahramani, 2016). Here we consider the distribution of model parameters at each layer l in an L layer neural network as: ˆ ω l ∼ p ( ω l ) , where ≤ l ≤ L , (3)where ˆ ω l is a realization of layer parameters from the probability distribution of p ( ω l ) . In this setting,a probability estimate conditioned on the input and stochastic model parameters is represented as: p ( y | x , ˆ ω ) = softmax ( f ˆ ω D ( x )) , (4)where f ˆ ω D is the output activation of a neural network with parameters ˆ ω trained on dataset D . Inorder to find the uncertainty of final predictions with respect to inputs, we integrate equation 4 withrespect to ω : p ( y | x , D ) = (cid:90) p ( y | x , ω ) p ( ω | D ) dω . (5)Finally, MC dropout suggests interpreting the dropout forward path evaluations as Monte Carlosamples ( ω t ) from the ω distribution and approximating the prediction probability as: p ( y | x , D ) = 1 T T (cid:88) t =1 p ( y | x , ω t ) . (6)3ublished as a conference paper at ICLR 2019With reasonable dropout probability and number of samples, the MC dropout estimate can beconsidered as an accurate estimate of the prediction uncertainty. Readers are referred to Gal &Ghahramani (2016) for a more detailed discussion. In this paper, we denote the certainty of predictionfor a given sample ( Cert ( x ti ) ) as a vector providing the probability of the sample belonging to eachclass in equation 6. ROPOSED S OLUTION
OST -S ENSITIVE F EATURE A CQUISITION
We formulate the problem at hand as a generic reinforcement learning problem. Each episode isbasically consisting of a sequence of interactions between the suggested algorithm and a singledata instance (i.e., sample). At each point, the current state is defined as the current realization ofthe feature vector (i.e., x ti ) for a given instance. At each state, the set of valid actions consists ofacquiring any feature that is not acquired yet (i.e., A ti = { j = 1 . . . d | k ti,j = 0 } ). In this setting, eachaction along with the state transition as well as a reward, defined in the following, is characterizingan experience.We suggest incremental feature acquisition based on the value per unit cost of each feature. Here, thevalue of acquiring a feature is defined as the expected amount of change in the prediction uncertaintythat acquiring the feature causes. Specifically, we define the value of each unknown feature as: r ti,j = || Cert ( x ti ) − Cert ( q ( x ti , j )) || c j , (7)where r ti,j is the value of acquiring feature j for sample i at time step t . It can be interpreted asthe expected change of the hypothesis due to acquiring each feature per unit of the cost. Otherreinforcement learning based feature acquisition methods in the literature usually use the finalprediction accuracy and feature acquisition costs as components of reward function (He et al., 2016;Shim et al., 2017; Janisch et al., 2017). However, the reward function of equation 7 is modelingthe weighted changes of hypothesis after acquiring each feature. Consequently, it results in anincremental solution which is selecting the most informative feature to be acquired at each point. Asit is demonstrated in our experiments, this property is particularly beneficial when a single model isto be used under a budget determined at the prediction-time or any other, not predefined, terminationcondition.While it is possible to directly use the measure introduced in equation 7 to find features to be acquiredat each time, it would be computationally expensive; because it requires exhaustively measuring thevalue function for all features at each time. Instead, in addition to a predictor model, we train anaction value (i.e., feature value) function which estimates the gain of acquiring each feature basedon the current context. For this purpose, we follow the idea of the deep Q-network (DQN) (Mnihet al., 2015; 2013). Briefly, DQN suggests end-to-end learning of the action-value function. It isachieved by exploring the space through taking actions using an (cid:15) -greedy policy, storing experiencesin a replay memory, and gradually updating the value function used for exploration. Due to spacelimitations, readers are referred to Mnih et al. (2015) for a more detailed discussion.Figure 1 presents the network architecture of the proposed method for prediction and feature acqui-sition. In this architecture, a predictor network (P-Network) is trained jointly with an action valuenetwork (Q-Network). The P-Network is responsible for making prediction and consists of dropoutlayers that are sampled in order to find the prediction uncertainty. The Q-Network estimates the valueof each unknown feature being acquired.Here, we suggest sharing the representations learned from the P-Network with the Q-Network.Specifically, the activations of each layer in the P-Network serve as input to the adjacent layers of theQ-Network (see Figure 1). Note that, in order to increase model stability during the training, we donot allow back-propagation from Q-Network outputs to P-Network weights. We also explored otherarchitectures and sharing methods including using fully-shared layers between P- and Q-Networksthat are trained jointly or only sharing the first few layers. According to our experiments, thesuggested sharing method of Figure 1 is reasonably efficient, while introducing a minimal burden onthe prediction performance. 4ublished as a conference paper at ICLR 2019 x yq P NetworkQ Network
Figure 1: Network architecture of the proposed approach for prediction and action value estimation.Algorithm 1 summarizes the procedures for cost-sensitive feature acquisition and training the net-works. This algorithm is designed to operate on a stream of input instances, actively acquire features,make predictions, and optimize the models. In this algorithm, if any features are available for freewe include them in the initial feature vector; otherwise, we start with all features not being availableinitially. Here, the feature acquisition is terminated when either a maximum budget is exceeded, auser-defined stopping function decides to stop, or there is no unknown feature left to acquire. It isworth noting that, in Algorithm 1, to simplify the presentation, we assumed that ground-truth labelsare available at the beginning of each episode. However, in the actual implementation, we storeexperiences within an episode in a temporary buffer, excluding the label. At last, after the terminationof the feature acquisition procedure, a prediction is being made and upon the availability of label forthat sample, the temporary experiences along with the ground-truth label are pushed to the experiencereplay memory.In our experiments, for the simplicity of presentation, we assume that all features are independentlyacquirable at a certain cost, while in many scenarios, features are bundled and acquired together (e.g.,certain clinical measurements). However, it should be noted that the current formulation presentedin this paper allows for having bundled feature sets. In this case, each action would be acquiringeach bundle and the reward function is evaluated for the acquisition of the bundle by measuring thevariations of uncertainty before and after acquiring the bundle.3.2 I
MPLEMENTATION D ETAILS
In this paper, PyTorch numerical computational library (Paszke et al., 2017) is used for the implemen-tation of the proposed method. The experiments took between a few hours to a couple days on a GPUserver, depending on the experiment. Here, we explored fully connected multi-layer neural networkarchitectures; however, the approach taken can be readily applied to other neural network and deeplearning architectures. We normalize features prior to our experiments statistically ( µ = 0 , σ = 1 )and impute missing features with zeros. Note that, in our implementation, for efficiency reasons, weuse NaN (not a number) values to represent features that are not available and impute them with zerosduring the forward/backward computation.Cross-entropy and mean squared error (MSE) loss functions were used as the objective functionsfor the P and Q networks, respectively. Furthermore, the Adam optimization algorithm Kingma& Ba (2014) was used throughout this work for training the networks. We used dropout with theprobability of . for all hidden layers of the P-Network and no dropout for the Q-Network. Thetarget Q-Network was updated softly with the rate of . . We update P, Q, and target Q networksevery n fe experiences, where n fe is the total number of features in an experiment. In addition,the replay memory size is set to store × n fe most recent experiences. The random explorationprobability is decayed such that eventually it reaches the probability of . . We determined thesehyper-parameters using the validation set. Based on our experiments, the suggested solution is notmuch sensitive to these values and any reasonable setting, given enough training iterations, wouldresult in reasonable a performance. A more detailed explanation of implementation details for eachspecific experiment is provided in Section 4. 5ublished as a conference paper at ICLR 2019 Algorithm 1:
Suggested algorithm for Cost-Sensitive Feature Acquisition, Prediction, and Training
Input: total budget ( B ), stream of samples ( S i ), acquisition cost of features ( c j ) Initialize: experience replay memory, random exploration probability (
P r rand ) ← for S i in the stream do P r rand ← decay f actor × P r rand t ← x ti ← known features of S i // if there are any features available total cost ← ˜ y i ← class label of S i terminate f lag ← F alse while not terminate flag do // collect experiences from each episode if new random in [0 , < P r rand then // if it is a random exploration j ← index of a randomly selected unknown feature else // if it is an exploration using policy j ← index of the unknown feature with maximum Q value x t +1 i ← q ( x ti , j ) // acquire feature j total cost ← total cost + c j // pay the cost of j r ti,j ← || Cert ( x ti ) − Cert ( x t +1 i ) || c j push ( x ti , x t +1 i , j, r ti,j , ˜ y i ) into the replay memory t ← t + 1 if total cost ≥ B or stop condition() or no unknown feature then terminate f lag ← T rue // terminate if all the budget is used if update condition() then train batch ← random mini-batch from the replay memory update P, Q, and target Q networks using train batch // Jointly train P & Q end end Table 1: The summary of datasets and experimental settings.
Dataset Instances Features Classes P-Net Architecture Q-Net ArchitectureMNIST (LeCun et al., 1998) , , ,
64] [ + 512 , + 256 , + 65 , + 16] LTRC (Chapelle & Chang, 2011) ,
32] [ + 128 , + 8] Diabetes (Sec. 4.1) , ,
16] [ + 64 , + 16 , + 10] ESULTS AND E XPERIMENTS
ATASETS AND E XPERIMENTS
We evaluated the proposed method on three different datasets: MNIST handwritten digits (LeCunet al., 1998), Yahoo Learning to Rank (LTRC) (Chapelle & Chang, 2011), and a health informaticsdataset. The MNIST dataset is used as it is a widely used benchmark. For this dataset, we assumeequal feature acquisition cost of for all features. It is worth noting that we are considering thepermutation invariant setup for MNIST where each pixel is a feature discarding the spatial information.Regarding the LTRC dataset, we use feature acquisition costs provided by Yahoo! that correspondingto the computational cost of each feature. Furthermore, we evaluated our method using a real-worldhealth dataset for diabetes classification where feature acquisition costs and budgets are naturaland essential to be considered. The national health and nutrition examination survey (NAHNES)data (nha, 2018) was used for this purpose. A feature set including: ( i ) demographic information(age, gender, ethnicity, etc.), ( ii ) lab results (total cholesterol, triglyceride, etc.), ( iii ) examinationdata (weight, height, etc.) , and ( iv ) questionnaire answers (smoking, alcohol, sleep habits, etc.) isused here. An expert with experience in medical studies is asked to suggest costs for each featurebased on the overall financial burden, patient privacy, and patient inconvenience. Finally, the fastingglucose values were used to define three classes: normal, pre-diabetes, and diabetes based on standardthreshold values. The final dataset consists of samples of features.6ublished as a conference paper at ICLR 2019
100 200 300 400 500 600 700
Number of Features A cc u r a c y ( % ) OLRADINGreedyMiserRL-Based
Figure 2: Evaluation of the proposed method onMNIST dataset. Accuracy vs. number of ac-quired features for OL, RADIN (Contardo et al.,2016), GreedyMiser (Xu et al., 2012), and a recentwork based on reinforcement learning (RL-Based)(Janisch et al., 2017). N D C G OLCSTCCronusEarly Exit
Acquisition ! "
Figure 3: Evaluation of the proposed method onLTRC dataset. NDCG vs. cost of acquired fea-tures for OL, CSTC (Xu et al., 2014), Cronus(Chen et al., 2012), and Early Exit (Cambazogluet al., 2010) approaches.In the current study, we use reinforcement learning as an optimization algorithm, while processingdata in a sequential manner. Throughout all experiments, we used fully-connected neural networkswith ReLU non-linearity and dropout applied to hidden layers. We apply MC dropout samplingusing evaluations of the predictor network for confidence measurements and finding the rewardvalues. Meanwhile, evaluations are used for prediction at test-time. We selected these value forour experiments as it showed stable prediction and uncertainty estimates. Each dataset was randomlysplitted to for test, for validation, and the rest for train. During the training and validationphase, we use the random exploration mechanism. For comparison of the results with other work inthe literature, as they are all offline methods, the random exploration is not used during the featureacquisition. However, intuitively we believe in datasets with non-stationary distributions, it may behelpful to use random exploration as it helps to capture concept drift. Furthermore, we do modeltraining multiple time for each experiment and average the outcomes. It is also worth noting that, asthe proposed method is incremental, we continued feature acquisition until all features were acquiredand reported the average accuracy corresponding to each feature acquisition budget.Table 1 presents a summary of datasets and network architectures used throughout the experiments.In this table, we report the number of hidden neurons at each network layer of the P and Q networks.For the Q-Network architecture, the number of neurons in each hidden layer is reported as the numberof shared neurons from the P-Network plus the number of neurons specific to the Q-Network.4.2 P
ERFORMANCE OF THE P ROPOSED A PPROACH
Figure 2 presents the accuracy versus acquisition cost curve for the MNIST dataset. Here, we com-pared results of the proposed method (OL) with a feature acquisition method based on recurrent neuralnetworks (RADIN) (Contardo et al., 2016), a tree-based feature acquisition method (GreedyMiser)(Xu et al., 2012), and a recent work using reinforcement learning ideas (RL-Based) (Janisch et al.,2017). As it can be seen from this figure, our cost-sensitive feature acquisition method achieveshigher accuracies at a lower cost compared to other competitors. Regarding the RL-Based method,(Janisch et al., 2017), to make a fair comparison, we used the similar network sizes and learningalgorithms as with the OL method. Also, it is worth mentioning that the RL-based curve is the resultof training many models with different cost-accuracy trade-off hyper-parameter values, while trainingthe OL model gives us a complete curve. Accordingly, evaluating the method of (Janisch et al., 2017)took more than 10 times compared to OL.Figure 3 presents the accuracy versus acquisition cost curve for the LTRC dataset. As LTRC is aranking dataset, in order to have a fair comparison with other work in the literature, we have usedthe normalized discounted cumulative gain (NDCG) (J¨arvelin & Kek¨al¨ainen, 2002) performancemeasure. In short, NDCG is the ratio of the discounted relevance achieved using a suggested rankingmethod to the discounted relevance achieved using the ideal ranking. Inferring from Figure 3, the7ublished as a conference paper at ICLR 2019
Feature S a m p l e Demographic Examination Lab Questionnaire (a)
Acquisition Cost A cc u r a c y ( % ) OLExhaustiveRL-BasedAdapt-GBRT (b)
Figure 4: Evaluation of the proposed method on the diabetes dataset. (a) Visualization of featureacquisition orders for test samples (warmer colors represent more priority). (b) Accuracy vs. costof acquired features for this paper (OL), an exhaustive sensitivity-based method (Exhaustive) (Earlyet al., 2016a), the method suggested by Janisch et al. (2017) (RL-Based), and using gating functionsand adaptively trained random forests (Nan & Saligrama, 2017) (Adapt-GBRT).proposed method is able to achieve higher NDCG values using a much lower acquisition budgetcompared to tree-based approaches in the literature including CSTC (Xu et al., 2014), Cronus (Chenet al., 2012), and Early Exit (Cambazoglu et al., 2010).Figure 4a shows a visualization of the OL feature acquisition on the diabetes dataset. In this figure,the y-axis corresponds to random test samples and the x-axis corresponds to each feature. Here,warmer colors represent features that were acquired with more priority and colder colors representless acquisition priority. It can be observed from this figure that OL acquires features based on theavailable context rather than having a static feature importance and ordering. It can also be seenthat OL gives more priority to less costly and yet informative features such as demographics andexaminations. Furthermore, Figure 4b demonstrates the accuracy versus acquisition cost for thediabetes classification. As it can be observed from this figure, OL achieves a superior accuracy with alower acquisition cost compared to other approaches. Here, we used the exhaustive feature querymethod as suggested by Early et al. (2016a) using sensitivity as the utility function, the methodsuggested by Janisch et al. (2017) (RL-Based), as well a recent paper using gating functions andadaptively trained random forests (Nan & Saligrama, 2017) (Adapt-GBRT).4.3 A NALYSIS
BLATION S TUDY
In this section we show the effectiveness of three ideas suggested by this paper i.e, using modeluncertainty as a feature-value measure, representation sharing between the P and Q networks, andusing MC-dropout as a measure of prediction uncertainty. Additionally, we study the influenceof the available budget on the performance of the algorithm. In these experiments, we used thediabetes dataset. A comparison between the suggested feature-value function (OL) in this paperwith a traditional feature-value function (RL-Based) was presented in Figure 2 and Figure 4b. Weimplemented the RL-Based method such that it is using a similar architecture and learning algorithmas the OL, while the reward function is simply the the negative of feature costs for acquiring eachfeature and a positive value for making correct predictions. As it can be seen from the comparisonof these approaches, the reward function suggested in this paper results in a more efficient featureacquisition.In order to demonstrate the importance of MC-dropout, we measured the average of accuracy at eachcertainty value. Statistically, confidence values indicate the average accuracy of predictions (Guoet al., 2017). For instance, if we measure the certainty of prediction for a group of samples to be , we expect to correctly classify samples of that group of the time. Figure 5 shows theaverage prediction accuracy versus the certainty of samples reported using the MC-dropout method(using samples) and directly using the softmax output values. As it can be inferred from thisfigure, MC-dropout estimates are highly accurate, while softmax estimates are mostly over-confident8ublished as a conference paper at ICLR 2019
Certainty A cc u r a c y MC-dropout Cert.Softmax Cert. (a)
Acquisition Cost A cc u r a c y ( % ) MC-dropout Cert.Softmax Cert. (b)
Figure 5: (a) The average prediction accuracy versus the certainty of samples reported using theMC-dropout method ( samples) and directly using the softmax output values. (b) The accuracyversus cost curves for using the MC-dropout method and directly using the softmax output values.
Number of Episodes A U A CC W/ SharingW/O Sharing
Figure 6: The speed of convergence using thesuggested sharing between the P and Q networks(W/ Sharing) compared with not using the sharingarchitecture (W/O Sharing).
Acquisition Cost A cc u r a c y ( % ) B=25 %B=50 %B=75 %B=100 %
Figure 7: Accuracy versus cost curves achievedusing different budget levels: 25 %, 50%, 75%,and 100% of the cost of acquiring all features.and inaccurate. Note that the accuracy of certainty estimates are crucially important to us as anyinaccuracy in these values results in having inaccurate reward values. Figure 5b shows the accuracyversus cost curves that the suggested architecture achieves using the accurate MC-dropout certaintyestimates and using the inaccurate softmax estimates. It can be seen from this figure that moreaccurate MC-dropout estimates are essential.Figure 6 demonstrates the speed of convergence using the suggested sharing between the P and Qnetworks (W/ Sharing) as well as not using the sharing architecture (W/O Sharing). Here, we usethe normalized area under the accuracy-cost curve (AUACC) as measure of acquisition performanceat each episode. Please note that we adjust the number of hidden neurons such that the number ofQ-Network parameters is the same for each corresponding layer between the two cases. As it can beseen from this figure, the suggested representation sharing between the P and Q networks increasesthe speed of convergence.Figure 7 shows the performance of the OL method having various limited budgets during the operation.Here, we report the accuracy-cost curves for , , , and of the budget required toacquire all features. As it can be inferred from this figure, the suggested method is able to efficientlyoperate at different enforced budget constraints.4.3.2 C ONVERGENCE A NALYSIS
Figure 8a and 8b demonstrate the validation accuracy and AUACC values measured during theprocessing of the data stream at each episode for the MNIST and Diabetes datasets, respectively.9ublished as a conference paper at ICLR 2019 A cc u r a c y / A U A CC Validation AccuracyValidation AUACC
MNIST (a) A cc u r a c y / A U A CC Validation AccuracyValidation AUACC
D iabetes (b)
Figure 8: The validation set accuracy and AUACC values versus the number of episodes for the (a)MNIST and (b) Diabetes datasets.As it can be seen from this figure, as the algorithm observes more data samples, it achieves highervalidation accuracy/AUACC values, and it eventually converges after a certain number of episodes. Itshould be noted that, in general, convergence in reinforcement learning setups is dependent on thetraining algorithm and parameters used. For instance, the random exploration strategy, the updatecondition, and the update strategy for the target Q network would influence the overall time behaviorof the algorithm. In this paper, we use conservative and reasonable strategies as reported in Section 3.2that results in stable results across a wide range of experiments.
ONCLUSION
In this paper, we proposed an approach for cost-sensitive learning in stream-based settings. Wedemonstrated that certainty estimation in neural network classifiers can be used as a viable measurefor the value of features. Specifically, variations of the model certainty per unit of the cost is used asmeasure of feature value. In this paradigm, a reinforcement learning solution is suggested which isefficient to train using a shared representation. The introduced method is evaluated on three differentreal-world datasets representing different applications: MNIST digits recognition, Yahoo LTRC webranking dataset, and diabetes prediction using health records. Based on the results, the suggestedmethod is able to learn from data streams, make accurate predictions, and effectively reduce theprediction-time feature acquisition cost. R EFERENCES
National health and nutrition examination survey, 2018. URL .B Barla Cambazoglu, Hugo Zaragoza, Olivier Chapelle, Jiang Chen, Ciya Liao, Zhaohui Zheng,and Jon Degenhardt. Early exit optimizations for additive machine learned ranking systems.In
Proceedings of the third ACM international conference on Web search and data mining , pp.411–420. ACM, 2010.Olivier Chapelle and Yi Chang. Yahoo! learning to rank challenge overview. In
Proceedings of theLearning to Rank Challenge , pp. 1–24, 2011.Minmin Chen, Zhixiang Xu, Kilian Weinberger, Olivier Chapelle, and Dor Kedem. Classifier cascadefor minimizing feature evaluation cost. In
Artificial Intelligence and Statistics , pp. 218–226, 2012.Suming Jeremiah Chen, Arthur Choi, and Adnan Darwiche. Algorithms and applications for thesame-decision probability.
Journal of Artificial Intelligence Research , 49:601–633, 2014.Suming Jeremiah Chen, Arthur Choi, and Adnan Darwiche. Value of information based on decisionrobustness. In
AAAI , pp. 3503–3510, 2015. 10ublished as a conference paper at ICLR 2019Gabriella Contardo, Ludovic Denoyer, and Thierry Arti`eres. Recurrent neural networks for adaptivefeature acquisition. In
International Conference on Neural Information Processing , pp. 591–599.Springer, 2016.Kirstin Early, Stephen E Fienberg, and Jennifer Mankoff. Test time feature ordering with focus:interactive predictions with minimal user burden. In
Proceedings of the 2016 ACM InternationalJoint Conference on Pervasive and Ubiquitous Computing , pp. 992–1003. ACM, 2016a.Kirstin Early, Jennifer Mankoff, and Stephen E Fienberg. Dynamic question ordering in onlinesurveys. arXiv preprint arXiv:1607.04209 , 2016b.Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regression.
TheAnnals of statistics , 32(2):407–499, 2004.Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing modeluncertainty in deep learning. In international conference on machine learning , pp. 1050–1059,2016.Russell Greiner, Adam J Grove, and Dan Roth. Learning cost-sensitive active classifiers.
ArtificialIntelligence , 139(2):137–174, 2002.Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neuralnetworks. arXiv preprint arXiv:1706.04599 , 2017.He He, Hal Daum´e III, and Jason Eisner. Cost-sensitive dynamic feature selection. In
ICML InferningWorkshop , 2012.He He, Paul Mineiro, and Nikos Karampatziakis. Active information acquisition. arXiv preprintarXiv:1602.02181 , 2016.Jarom´ır Janisch, Tom´aˇs Pevn`y, and Viliam Lis`y. Classification with costly features using deepreinforcement learning. arXiv preprint arXiv:1711.07364 , 2017.Kalervo J¨arvelin and Jaana Kek¨al¨ainen. Cumulated gain-based evaluation of IR techniques.
ACMTransactions on Information Systems (TOIS) , 20(4):422–446, 2002.Shihao Ji and Lawrence Carin. Cost-sensitive feature acquisition and classification.
Pattern Recogni-tion , 40(5):1474–1485, 2007.Mohammad Kachuee, Anahita Hosseini, Babak Moatamed, Sajad Darabi, and Majid Sarrafzadeh.Context-aware feature query to improve the prediction performance. In
Signal and InformationProcessing (GlobalSIP), 2017 IEEE Global Conference on , pp. 838–842. IEEE, 2017.Mohammad Kachuee, Sajad Darabi, Babak Moatamed, and Majid Sarrafzadeh. Dynamic featureacquisition using denoising autoencoders.
IEEE transactions on neural networks and learningsystems , 2018.Sergey Karayev, Tobias Baumgartner, Mario Fritz, and Trevor Darrell. Timely object recognition. In
Advances in Neural Information Processing Systems , pp. 890–898, 2012.Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.Balaji Krishnapuram, Shipeng Yu, and R Bharat Rao.
Cost-sensitive Machine Learning . CRC Press,2011.Matt J Kusner, Wenlin Chen, Quan Zhou, Zhixiang Eddie Xu, Kilian Q Weinberger, and Yixin Chen.Feature-cost sensitive learning with submodular trees of classifiers. In
AAAI , pp. 1939–1945, 2014.Yann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist database of handwritten digits,1998.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, DaanWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprintarXiv:1312.5602 , 2013. 11ublished as a conference paper at ICLR 2019Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level controlthrough deep reinforcement learning.
Nature , 518(7540):529, 2015.Feng Nan and Venkatesh Saligrama. Adaptive classification for prediction under a budget. In
Advances in Neural Information Processing Systems , pp. 4727–4737, 2017.Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation inPyTorch. In
NIPS-W , 2017.Hajin Shim, Sung Ju Hwang, and Eunho Yang. Why pay more when you can pay less: A joint learningframework for active feature acquisition and classification. arXiv preprint arXiv:1709.05964 , 2017.Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 , 2013.Christopher KI Williams. Computing with infinite networks. In
Advances in neural informationprocessing systems , pp. 295–301, 1997.Zhixiang Xu, Kilian Weinberger, and Olivier Chapelle. The greedy miser: Learning under test-timebudgets. arXiv preprint arXiv:1206.6451 , 2012.Zhixiang Eddie Xu, Matt J Kusner, Kilian Q Weinberger, Minmin Chen, and Olivier Chapelle.Classifier cascades and trees for minimizing feature evaluation cost.