[PDF] Dynamic Feature Acquisition Using Denoising Autoencoders

Abstract

In real-world scenarios, different features have different acquisition costs at test-time which necessitates cost-aware methods to optimize the cost and performance trade-off. This paper introduces a novel and scalable approach for cost-aware feature acquisition at test-time. The method incrementally asks for features based on the available context that are known feature values. The proposed method is based on sensitivity analysis in neural networks and density estimation using denoising autoencoders with binary representation layers. In the proposed architecture, a denoising autoencoder is used to handle unknown features (i.e., features that are yet to be acquired), and the sensitivity of predictions with respect to each unknown feature is used as a context-dependent measure of informativeness. We evaluated the proposed method on eight different real-world datasets as well as one synthesized dataset and compared its performance with several other approaches in the literature. According to the results, the suggested method is capable of efficiently acquiring features at test-time in a cost- and context-aware fashion.

Full PDF

AACCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (TNNLS) 1

Dynamic Feature Acquisition Using DenoisingAutoencoders

Mohammad Kachuee,

Student Member, IEEE,

Sajad Darabi,

Student Member, IEEE,

Babak Moatamed,

Student Member, IEEE, and Majid Sarrafzadeh,

Fellow, IEEE c (cid:13) Abstract —In real-world scenarios, different features have dif-ferent acquisition costs at test-time which necessitates cost-awaremethods to optimize the cost and performance trade-off. Thispaper introduces a novel and scalable approach for cost-awarefeature acquisition at test-time. The method incrementally asksfor features based on the available context that are known featurevalues. The proposed method is based on sensitivity analysisin neural networks and density estimation using denoisingautoencoders with binary representation layers. In the proposedarchitecture, a denoising autoencoder is used to handle unknownfeatures (i.e., features that are yet to be acquired), and thesensitivity of predictions with respect to each unknown featureis used as a context-dependent measure of informativeness. Weevaluated the proposed method on eight different real-worlddatasets as well as one synthesized dataset and compared itsperformance with several other approaches in the literature.According to the results, the suggested method is capable ofefﬁciently acquiring features at test-time in a cost- and context-aware fashion.

Index Terms —Feature Acquisition, test-time, context-aware,cost-aware, denoising autoencoder

I. I

NTRODUCTION F EATURE selection methods have been largely studied inthe literature. Usually, the main goal of feature selectionis deﬁned as selecting a subset of available features to increasethe prediction performance and to reduce over-ﬁtting. In realworld scenarios, however, the cost of extracting or acquir-ing each feature is different from other features. The costdifference can be due to various factors such as differencesin computational load in the extraction of features [1], [2],user disruptions in computer and user interactions [3], patientpain in medical procedures and tests [4], and so forth [5]. Inthese scenarios, selecting a feature that may only marginallycontribute to an increase in the prediction accuracy whichentails high costs would be unacceptable. In other words,there exists a trade-off between the feature cost and predictionperformance that should be considered in the algorithm design.To overcome this issue, there are methods suggested inthe literature trying to adapt feature selection algorithms toconsider the cost of each feature [6], [7], [8], [9]. However,another point of concern that requires attention is that selectinga ﬁxed set of features to be used during the training phaseand using them at test-time would not be an optimal solution;as it neglects the potential interdependence between features.

Authors are with the UCLA Department of Computer Science. Address:UCLA Computer Science Dept.,

In many scenarios, there are features that are either freelyavailable or easy to acquire at test-time. An optimal decisionabout other features to include in the analysis can be highlydependent on them. For instance, a doctor decides whether toprescribe an MRI scan based on the patient’s current availableinformation such as age, gender, symptoms and so on. Inthis example, having a ﬁxed list of required tests and askingpatients to provide the results of these tests for a clinicalvisit, would result in the high cost of MRI for all patients.In other words, the decision to include each feature should bebased on the learned system dynamics as well as the availableinformation at test-time.In this paper, we suggest a novel approach for feature ac-quisition considering costs at test-time (FACT). The proposedsolution is capable of incrementally asking for features tobe included in the prediction based on the current availablecontext and user-deﬁned feature costs. The rest of the paperis organized as follows. Section II brieﬂy reviews the cur-rent relevant literature. Section III introduces the suggestedapproach including theoretical and implementation details.Section IV presents the results of using the suggested methodand compares them with the state-of-the-art approaches in theliterature. Finally, Section VI concludes the paper.II. R

ELATED W ORK

One of the approaches to incorporate feature acquisitioncosts or feature costs in general is considering the feature costsduring the training phase and trading off the prediction accu-racy with the prediction cost. An example of these approachesis limiting the number of features that are actually used inthe predictor model by using L regularization [10]. In thismethod, the L regularization enforces weights correspondingto certain features to be zero, and hence they can be omittedduring the test phase. There are other methods in the literaturethat try to deﬁne and solve optimization problems over boththe prediction performance and prediction costs [11], [7], [12].Nevertheless, in all these methods, the ﬁnal set of selectedfeatures is ﬁxed and these methods fail to capture and takethe advantage of the contextual information available at test-time.One intuitive approach to incorporate feature costs duringthe training phase, while considering the available contextduring the test phase, is using the idea of decision trees. Oneof the most famous examples of this approach is the facedetection cascade classiﬁer by Viola and Jones [13]. Whiletheir goal was to increase the prediction speed by rejecting a r X i v : . [ c s . L G ] N ov CCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (TNNLS) 2 negative samples as soon as possible within a cascade of classi-ﬁers, many papers followed their architecture and incorporatedfeature cost in creating cascade predictors [14], [15]. One maindrawback of cascade approaches is that cascades are onlyapplicable to problems with a considerable class imbalancesuch as face detection or spam email detection. In these cases,the number of negative samples is signiﬁcantly higher than thenumber of positive samples. However, there are many real-world applications in which the classes are relatively balancedsuch as document classiﬁcation or image classiﬁcation. Toovercome this issue, in [16], [17] authors suggested the ideaof classiﬁer trees instead of classiﬁer cascades to handle theproblems where cascades are not applicable.While cascade and tree based test-time feature acquisitionmethods are shown to perform reasonably well in manyscenarios; there are many problems and applications suchas large-scale image classiﬁcation, voice recognition, naturallanguage processing, etc. where tree and cascade classiﬁersare not intrinsically strong enough to make accurate predic-tions. Another important limitation of cascade and tree basedapproaches is, while they include the context informationto some extent, their feature query decisions are not trulyinstance speciﬁc. Speciﬁcally, they are limited by the ﬁxedpredetermined structure of the tree that enforces the featuresto be acquired at each tree node.In order to address these issues, recently, there has beengreat attention toward using learning methods to solve thegeneric problem of cost-sensitive and context-aware featureacquisition. He et al. [18] suggested a method based onimitation learning that trains a model that is able to predict anoptimal feature query decision to be made given the availablefeatures. Contardo et al. [19], [20] introduced the idea ofdeﬁning the problem as a reinforcement learning problemand solving it as a separate problem. While these methodsare successful in terms of truly incorporating the test-timecontext information to the decisions, they require extra effortof training a feature query model in addition to the targetpredictor.An alternative idea for measuring the informativeness offeatures given the context is using sensitivity analysis attest-time to measure the inﬂuence of each feature on thepredictions. Early et al. [21] introduced a method basedon sensitivity analysis that exhaustively measures the impactof acquiring each feature on the prediction outcome. Theirsolution does not require training any other model, and itworks in conjunction with almost any supervised learningalgorithm. However, exhaustive sensitivity measurement iscomputationally expensive. It is impractical in problems witha large number of features to exhaustively examine the sensi-tivity with respect to each unknown feature.In this paper we suggest a novel approach that is basedon the idea of sensitivity analysis. The proposed approachincrementally asks for features based on the feature acquisitioncosts and the expected effect each feature can induce onthe prediction. Furthermore, the devised method uses back-propagation of gradients and binary representation layers inneural networks to address the computational load as well asscalability concerns. In an earlier work [22], we introduced the idea of sensitivity analysis as a method for dynamicfeature selection. However, in this paper, we extend the ideaby considering feature acquisition costs, introducing improve-ments such as feature encoding, and conducting more detailedexperiments and analysis.III. P

ROPOSED M ETHOD

A. Problem Deﬁnition

In this paper, we consider the problem of predicting targetclasses ( y ∈ R r ) corresponding to a given feature vector ( x ∈ R d ). Each feature vector consists of known features as well asunknown features (i.e., missing values) that are set to zero. Thecomplete feature vector without any missing values is denotedby ˜ x . To indicate unknown features, a vector k ∈ { , } d isdeﬁned that acts as a mask and indicates known and unknownfeatures with one and zero values, respectively. In addition, wedeﬁne a feature acquisition cost vector ( c ∈ R d ) that deﬁnesthe cost of acquiring each feature.For simplicity of analysis, we consider the incrementalproblem of having a feature vector ( x t ) and the correspondingmask vector ( k t ) at time step t . Additionally, we consider thecost values to be time dependent and deﬁned for each timestep ( c t ). Using this notation, at each time step t , the currentfeature vector can be represented as x ( t ) j = (cid:40) k tj = 0˜ x j k tj = 1 , (1)which is acquired at the total cost of C ttotal = ( k t − k ) T c t . (2)Apart from this, at each time step, we have an expectedprediction value ( y t ) using a predictor function ( h ) that takes x t as input: y t = h ( x t ) = h ( x t , x t , . . . , x td ) . (3)In this setup, we deﬁne the feature query operator ( q ) as afunction that acquires the value of feature j in the incompletefeature vector x t and outputs the feature vector of the nexttime step, x t +1 : x t +1 = q ( x t , j ) , where k t +1 j − k tj = 1 and k t +1 i − k ti = 0 ( i (cid:54) = j ) . (4)Furthermore, we deﬁne the desired feature to be queriedat time t as the feature that decreases the prediction errorsigniﬁcantly, while at the same time, incurs a low acquisitioncost. Mathematically speaking, we can use prediction accuracyimprovement per acquisition cost as a measure of efﬁciencyfor the feature query. Accordingly, the desired feature to bequeried at time step t can be found by j tsel = argmin j ∈{ ...d }| k tj =0 | ˜ y − h ( q ( x t , j )) | . c tj , (5)where ˜ y is the ground-truth target value, and is bias valuein order to prevent the ﬁrst term from becoming zero. It isworth mentioning that the solution introduced here is basically CCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (TNNLS) 3

TABLE I: The summary of the notations used throughout thepaper.

Notation Description ˜ y ∈ R r Ground-truth target values y ∈ R r Predicted target values x ∈ R d Incomplete feature vector at test-time ˜ x ∈ R d Complete feature vector without any missing values x bin ∈ R d × l Binary representation of the feature vector x (cid:48) ∈ R d Reconstructed feature vector x (cid:48) bin ∈ R d × l Binary representation of the reconstructed feature vector k ∈ { , } d Mask vector indicating known and unknown features c ∈ R d Feature acquisition cost vector z ∈ R d (cid:48) Encoded feature vector an incremental solution that greedily selects features to beacquired at each step.Table I presents a summary of the notations used throughoutthe paper.

B. Sensitivity-based Feature Acquisition

While (5) suggests what features to be acquired at each step,directly using this equation is not practical. The reason behindthis is the ﬁrst term in this equation is usually not known andis difﬁcult to estimate. To resolve this issue, it is possible touse the sensitivity of model predictions with respect to eachmissing feature as a measure for the potential impact of thatfeature on the ﬁnal predictions. As a result, (5) can be rewrittenusing the suggested sensitivity measure as: j tsel = argmax j ∈{ ...d }| k tj =0 Sensitivity ( h ( x t ) , j ) c tj . (6)Note that because a higher sensitivity is synonymous with amore informative feature to select, the argmin function in (5)is replaced by argmax. Furthermore, the prediction sensitivitywith respect to input j can be deﬁned as Sensitivity ( h ( x t ) , j ) = E x j ( ∂h t ( x t ) ∂x j )= (cid:90) | ∂h ( x t ) ∂x j | p ( x j | x t ; h t ) dx j . (7)In this equation, the ﬁrst term corresponds to the derivativeof the predictor function with respect to each missing feature.The second term is the conﬁdence of inferring the j ’th featuregiven the context and model parameters. By substituting thisinto (6), the feature query criterion can be written as j tsel = argmax j ∈{ ...d }| k tj =0 (cid:82) | ∂h ( x t ) ∂x j | p ( x j | x t ; h t ) dx j c tj . (8)Furthermore, the continuous integral in (8) can be approxi-mated by a discrete summation: j tsel = argmax j ∈{ ...d }| k tj =0 (cid:80) x j ∈ RS | ∂h ( x t ) ∂x j | p ( x j | x t ; h t ) c tj , (9)where RS is a set of samples from the range of possible valuesthat can be taken by each feature. By adjusting the granularity of the values in the RS set, one can trade-off between theapproximation accuracy and the computational load of theexpected value approximation. C. Proposed Solution

The required terms in (9) for ﬁnding the feature to queryincludes: the cost of acquiring each feature, the derivative ofthe prediction function with respect to each input at differentinput values, and probability of having each value for eachfeature given the available context. Feature query costs areassumed to be given by the user and known for each timestep. For the latter two, while it is possible to model andestimate each term using conventional modeling methods, thesolution to evaluate the summation exhaustively may be com-putationally expensive and impractical in many applications.Here, we introduce a novel method based on autoencoderswith binary representation layers that can estimate the wholesummation with a single forward and backward propagationin neural networks.The left and the upper right part of the Fig. 1 show thearchitecture of the proposed network for the context-awareand one-shot estimation of the distribution of each feature.As depicted, an autoencoder architecture designed to converteach feature in the feature vector ( x ) to a binary representation( x bin ). Then, it encodes the features to a more compactrepresentation ( z ), and ﬁnally reconstructs the original featurevectors ( x (cid:48) ) by creating a binary decoded vector ( x (cid:48) bin ). Here,in order to have an estimate for the probability of each bitbeing set, sigmoid non-linearity activation function is used forthe binary reconstruction layer ( x (cid:48) bin ). For other activationfunctions; however, we used the rectiﬁed linear unit (ReLu)[23] non-linearity. Additionally, the network optimization costfunction is deﬁned as the weighted sum of cross-entropies forbinary feature words. Here, the term word refers to the setof encoded bits that are representing a feature. The weightsare adjusted to offset the importance of the reconstructionerror caused by errors in different bits in the word withdifferent signiﬁcance. It is worth mentioning that the trainedautoencoder as explained here, takes an input feature vectorwhere missing features are set to zero, and it is capable ofestimating the probability of each bit being set in the binarydecode layer ( x (cid:48) bin ).In addition to the autoencoder part, in the network of Fig.1, we create a predictor model by stacking a few layers onthe top of the encoded representation ( z ) and training theencoder as well as the predictor parts of the network in asupervised fashion. Here, in order to measure the sensitivityof the output predictions with respect to different changes ineach feature, we suggest using the summation of absolutederivatives of the output layer neurons with respect to eachbit of the missing features. The ﬁnal estimation of the sum-mation in (9) is achieved by an element-wise multiplication ofthe bit probabilities estimated from the autoencoder’s binaryreconstruction layer and the sensitivities calculated from thederivative of output layer with respect to each input featurebit. Speciﬁcally, this paper suggests deﬁning the RS as RS = { − l , . . . , − , − , } , (10) CCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (TNNLS) 4

Fig. 1: Network architecture of the proposed method including: encoder, decoder, and predictor parts. The encoder part isresponsible for handling missing features. The decoder part is used for feature density estimation. The predictor is responsiblefor making predictions; additionally, its derivatives with respect to inputs are used for measuring sensitivities.where l is the total number of bits used in the binary repre-sentation of each feature. Using (9) and (10), the feature to beacquired is given by j tsel = argmax j ∈{ ...d }| k tj =0 (cid:80) b = lb =1 | ∂h ( x t ) ∂x binj,b | x (cid:48) bin j,b c tj , (11)where the sensitivity term is deﬁned as | ∂h ( x t ) ∂x bin j,b | = i = r (cid:88) i =1 | ∂y i ∂x bin j,b | . (12)It is worth mentioning that, in addition to the common neu-ral network hyper-parameters, the only hyper-parameter thatis added by the suggested method is the parameter l which isused for controlling the accuracy of the binary representation.Additionally, as we do not make any assumptions on the valuesused as feature costs and the proposed method is incremental,it can be applied to the scenarios where feature costs aresubject to change during the course of operation at test-time.However, in our experiments, in order to make the comparisonof results easier, we evaluate the proposed method on scenarioswhere feature acquisition costs are constant in time. D. Implementation Details

Prior to the analysis we have normalized all feature valuesin the dataset to the range of zero and one. Also, throughoutthe experiments we used Tensorﬂow numerical computationlibrary [31] and explored feed-forward neural network archi-tectures. Also, ReLU non-linearity [23] is used for all hiddenlayers except the binary representation layers. For convertingfeature values to the suggested binary representation, weimplemented the bit by bit recursive conversion in an efﬁcientand parallel manner. Also, for converting back from the binaryrepresentation, we implemented the weighted summation of bitvalues utilizing a fully parallel matrix multiplication, reshape,and addition. In this work, the Adaptive Moment (Adam)optimization algorithm [32] is used to train each network. TheAdam hyper-parameters: learning rate ( α ), decay rate for theﬁrst moment ( β ), and decay rate for the second raw moment( β ) are set to . , . , and . , respectively.The process of training the network starts with training theautoencoder part using a weighted cross-entropy loss betweenthe binary representation of the complete feature vectors andthe estimated probabilities from the binary reconstruction CCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (TNNLS) 5

TABLE II: The summary of datasets and experimental settings.

Dataset Instances Features Classes Network Architecture Latent Missing DistributionMNIST a

10 Encoder: [697 ×

8, 64, 32] Beta Distribution[24] Predictor: [16,10] α = 3 . , β = 1 . Yahoo LTRC ×

8, 128, 32] Beta Distribution[25] Predictor: [16, 8] α = 1 . , β = 1 . HAPT ×

8, 64, 32] Beta Distribution[26] Predictor: [16] α = 1 . , β = 1 . Reuters R8 ×

8, 64, 32] Beta Distribution[27] Predictor: [16, 8] α = 3 . , β = 1 . UCI Mushroom b ×

8, 16] Beta Distribution[28] Predictor: [4] α = 5 . , β = 1 . UCI Landsat ×

8, 16, 8] Beta Distribution[29] Predictor: [4] α = 1 . , β = 1 . UCI CTG ×

8, 8] Beta Distribution[30] Predictor: [4] α = 1 . , β = 1 . Synthesized ×

8, 16, 10] Beta Distribution(Section IV-D) Predictor: [8, 4] α = 1 . , β = 1 . Thyroid

279 16 3 Encoder: [16 ×

8, 8] Beta Distribution(Section IV-E) Predictor: [4] α = 1 . , β = 1 . a

697 features after omitting features with STD of less than . corresponding to margin pixels. b

116 features after one-hot encoding of categorical features. layer: f ( x, θ ) = d (cid:88) j =1 l (cid:88) b =0 − b ( ˜ x (cid:48) bin j,b log ( x (cid:48) bin j,b ) +(1 − ˜ x (cid:48) bin j,b ) log (1 − x (cid:48) bin j,b ) ) . (13)In order to train the denoising autoencoder, for each traininginstance, we sample random values from a latent Beta distribu-tion and use the sampled values as the probability of missingeach feature in the training data. After training the autoencoderpart, the trained autoencoder network weights are stored, anda few prediction layers are added on top of the encoder part.The reason we store autoencoder weights is that ﬁne-tuningthe weights for the prediction task would affect the distributionestimation functionality of the originally trained autoencoder.In other words, we do ﬁne-tuning for the supervised predictiontask, while a copy of the original not ﬁne-tuned autoencoderis used for probabilistic modeling. Here, to train the predictornetwork, we use a smaller learning rate ( α = 0 . ) for thepre-trained encoder and a larger learning rate ( α = 0 . ) forthe new predictor weights.For the efﬁcient calculation of derivatives we use back-propagation from the values in the output prediction layerto each binary input bit. In this section, we only describedthe general architecture and training procedures, as we haveconducted various experiments on different datasets, the exactnetwork architecture of each case is explained in Section IV. IV. E XPERIMENTAL R ESULTS

A. Datasets and Experiments

The proposed method is evaluated on seven different real-world datasets including human activity recognition (HAPT)[26], hand-written character recognition (MNIST) [24], docu-ment classiﬁcation (Reuters R8) [27], and web ranking (YahooLTRC) [25] as well as three other classiﬁcation datasets.Apart from these, we have evaluated the method on a syn-thesized dataset which is explained in Section IV-D and adataset in health domain explained in Section IV-E. Table IIpresents a summery of the conducted experiments. The tablealso includes the network architecture and the missing valuedistribution used during the training phase. In each case,the architecture column contains encoder layer sizes for thebinary layers, encoder layers, and predictor layers. In thistable, the decoder layers are not shown and are equal tothe encoder layer sizes in reverse order. We used an 8-bitbinary representation throughout the experiments. Regardingthe feature size of each dataset, we report both the nominalfeature count and the number of features we have used in ourexperiments. Speciﬁcally, for the MNIST dataset, each pixel isconsidered as a feature and we removed features correspondingto pixels near the margin that are almost always zero withthe standard deviation of less than . across all samples.Also, regarding the Mushroom dataset, one-hot encoding of categorical features resulted in features to be used asinput. An important point to consider for these features is that, CCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (TNNLS) 6 during sensitivity measurement and acquisition, we shouldconsider acquisition of all one-hot features corresponding to acategorical feature as a single feature to acquire.Regarding the feature acquisition costs, for the LTRCdataset, as suggested by [16], we used real cost values inthe range of to based on the time required to extracteach feature. In order to introduce feature costs to the MNISTdataset, we have followed the method suggested by [33].Using this method, we create feature vectors by concatenatingMNIST images at different resolutions including × , × , × , and × with feature acquisition costsof , , , and for acquiring feature at each resolution,respectively. For the Landsat dataset, each sample consistsof features from four different frequency bands. Here, weconsidered features of the same frequency band to have thesame acquisition cost from to , equal to the frequencyband number. At last, for the CTG dataset, we assumed thatfeatures that are measuring event counts to have cost value of , features measuring statistical information to have the costvalue of , and histogram features to have the cost of . Forthe synthesized and diabetes datasets, a complete explanationexperiments is presented in Section IV-D and Section IV-E,respectively. Lastly, for the other three datasets, we assumedfeature costs to be equal for all features.Based on the aforementioned experimental setup, in thefollowing parts of this section, we evaluate the proposedmethod for feature acquisition considering costs at test-time(FACT) on each dataset. B. Performance Evaluation

We split each dataset into three parts: for test, forvalidation, and the rest for training. During the training phase,we use the validation and train sets to train the networks asexplained in Section III-D and following the setups introducedin Table II. It is worth noting that the proposed method doesnot necessary require all training features to be known. Infact, as explained in Section III-D, we use a latent Betadistribution to simulate the existence of unknown features.After the training phase, we use the test set to simulate the caseof feature acquisition using the proposed method by assumingall the features to be initially unknown and using the featurequery criterion of (11) to ask for features incrementally. Inour experiments (except experiments in Section IV-D), wecontinue the feature query until querying all the featuresand report the accuracy as well as the cost at each pointduring feature acquisition. In real applications; however, theincremental acquisition should be stopped after reaching acertain criterion such as a minimum conﬁdence of predictions.Table III presents the results of the proposed method on eachdataset. The table contains the test accuracy of each datasetwhile asking for different percentage of the total cost from theoriginal feature set. Here, total cost is deﬁned as the cost ofacquiring all features. As a baseline, we have reported the testperformance of using a random forest classiﬁer (RFC) on thecomplete feature set (i.e., acquiring all features and spending the maximum cost). The table also reports the denoisingpercentage of the trained denoising autoencoder calculated as × || x − ˜ x || − || x (cid:48) − ˜ x |||| x − ˜ x || . (14)Additionally, in this table, the area under the accuracy-costcurves (AUACC) of the proposed feature query method aswell as the AUACC of randomly asking for the unknownfeatures are presented. We have also included AUACC resultsof a cost-aware version of the method suggested in [22] inwhich sensitivities are normalized by feature costs (see theDPFQ column). It is worth mentioning that AUACC values arecalculated as the normalized area under the accuracy versusthe acquisition cost curve from a cost of zero to the cost atwhich the accuracy converges to the maximum accuracy.According to the results presented in Table III, the pro-posed method can be used to effectively reduce the cost offeatures to be acquired at test-time for accurate predictions.Also, comparing the baseline accuracy of using the completefeature set, the results show that our method trains predictormodels that can make viable predictions using only a subsetof the features without sacriﬁcing prediction performance.Regarding the denoising percentage, in most of the cases, theachieved denoising percentage is signiﬁcant and conﬁrmingthat a denoising autoencoder is capable of encoding features toreduce the feature representation length which results in a newrepresentation that is more robust to the presence of missingvalues. Regarding the area under the curve values, in all cases,the AUC values of using FACT is considerably higher than itsrandom selection counterpart. It is also noteworthy to mentionthat for a few datasets (i.e., Reuters R8, UCI Landsat, HAPT,and UCI CTG) there is a considerable class imbalance that isaffecting baseline accuracies.To further illustrate the performance of the proposed ap-proach, we used MNIST dataset to visualize the effect of costand context on the selection of features to be queried. Here,features are pixel values at different locations across eachimage, and the context is the available pixel values at each timestep. Fig. 2 shows the effect of context and cost on the order offeatures that are queried by the proposed algorithm as well asa static order which is measured based on mutual informationbetween pixels and target classes. Here, we present results forthe original MNIST dataset with single resolution images andequal feature costs (see Fig. 2a) as well as the introducedmulti-resolution setup with different feature costs for eachresolution (see Fig. 2c). In this ﬁgure, pixels with higherimportance to be queried are indicated by warmer colors andless important pixels are indicated by colder colors. As it isevident from this ﬁgure, the proposed context-aware method,based on the available pixels, acquires features with differentorders and is scanning for digit edges or discriminative areas.On the other hand, the static feature acquisition method, onlyasks for central pixels of each image in a ﬁxed order (seeFig. 2b). In addition, regarding the multi-resolution case, asit can be seen from the ﬁgure, the informative pixels fromlower resolutions that incur lower costs are preferred to morecostly higher resolution pixels. For instance, in the lower leftcorner image of Fig. 2c, the parts that are ﬁrst acquired from CCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (TNNLS) 7

TABLE III: Results of evaluating the proposed method on different datasets.

Dataset FACT Accuracy (%)

RFC Accuracy Denoising Rand DPFQ FACT % of total cost used (%) (%) (AUACC ) (AUACC) (AUACC)

0% 25% 50% 75% 100% 100% total cost MNIST .

63 94 .

95 96 .

19 96 .

17 96 .

17 97 .

07 60 .

08 0 .

63 0 .

71 0 . Yahoo LTRC .

70 47 .

16 49 .

41 50 .

48 50 .

39 50 .

51 94 .

03 0 .

47 0 .

47 0 . HAPT .

60 87 .

57 90 .

22 90 .

71 90 .

77 91 .

75 90 .

09 0 .

65 0 .

69 0 . Reuters R8 .

74 94 .

78 95 .

56 94 .

78 94 .

70 94 .

61 14 .

92 0 .

87 0 .

89 0 . UCI Mushroom .

41 99 .

26 99 .

83 99 .

91 99 .

93 99 .

99 57 .

69 0 .

61 0 .

85 0 . UCI Landsat .

66 80 .

10 83 .

73 87 .

35 88 .

49 91 .

08 93 .

88 0 .

65 0 .

68 0 . UCI CTG .

52 84 .

90 88 .

68 88 .

99 90 .

88 89 .

62 77 .

48 0 .

83 0 .

84 0 . Synthesized .

43 96 .

72 98 .

19 98 .

25 98 .

33 99 .

41 88 .

78 0 .

64 0 .

73 0 . Thyroid .

83 65 .

85 70 .

73 75 .

61 78 .

05 80 .

85 44 .

07 0 .

59 0 .

59 0 . Total cost deﬁned as the total cost of acquiring all features. AUACC is deﬁned as the area under the accuracy and the feature acquisition cost curve. Cost-aware version of the method suggested in [22]. the high-resolution pixels are the parts that create differencebetween the digits of 4 and 9 which are not clear enough inlower resolutions.

C. Comparison with Other Work

Fig. 3 presents comparison of the proposed feature acquisi-tion method with a feature acquisition method based on recur-rent neural networks (RADIN) [19] and a tree-based featureacquisition method (GreedyMiser) [15]. In this comparison,we have used MNIST dataset to evaluate the performanceof each method based on their accuracy using a differentnumber of features. As it can be inferred from the ﬁgure,in the case of acquiring of features, where the numberof features to be queried is signiﬁcantly less than the totalnumber of features, the achieved accuracy using FACT islower than other methods. Nevertheless, the rate of increasein the accuracy with respect to the number of queried featuresfor the presented method is signiﬁcantly higher than otherpapers which makes it superior in other cases. In this plot andsimilar cost versus accuracy plots in this section, we provide95% conﬁdence intervals presented as error-bars measured byrunning each experiment multiple times using different randominitializations.Fig. 4 presents a comparison between the feature acquisitionand cost curve of the proposed method with three otherapproaches that use classiﬁer cascades or trees in the literatureincluding CSTC [16], Cronus [14], and Early Exit [34]. Here,we used the LTRC dataset with the feature costs as suggestedby [16] to plot the feature acquisition cost versus the cor-responding normalized discounted cumulative gain (NDCG)[35] performance measure. NDCG is a well-known measureof ranking quality that is used to measure the effectiveness andthe relevance of ranking results in search engines. As it canbe seen from this ﬁgure, FACT is signiﬁcantly more powerfuland more efﬁcient compared to others.In order to evaluate the performance of the proposed methodfor the estimation of sensitivity values, we have implementedan exhaustive feature query method as suggested by [21] using sensitivity as the utility function. To make a fair comparison,we have used the trained predictor network, and exhaustivelymeasured the effect of changing each input on the predictionprobabilities. Here, in order to estimate the probability of eachchange, we have used a 5-bin histogram for each feature.Fig. 5 presents a comparison between the accuracy achievedusing FACT and the exhaustive sensitivity-based method onthe Landsat dataset. As a baseline, we have also includedthe curve corresponding to randomly selecting and acquiringfeatures. As it is evident from the ﬁgure, the proposed methodis almost equivalent to the exhaustive method in terms ofthe accuracy achieved at each total acquisition cost. This ispromising considering the fact that the proposed approach triesto approximate the exhaustive sensitivity measurement in anefﬁcient and scalable manner. In other words, compared tothe proposed method, the exhaustive method is signiﬁcantlyslower and less efﬁcient at test-time. Speciﬁcally, the averageprocessing time of the exhaustive method was about ms for each sample, while the corresponding processing time forFACT was about ms (i.e., about times faster). In othertest cases with more features, comparing with the exhaustivemethod was not possible due to the exponentially increasingcomputational load of evaluating the exhaustive method.It worth mentioning that this computational advantagecomes from the fact that, for each incremental feature query,we approximate the summation of (11) for all unknownfeatures using one forward and one backward network com-putation. However, the exhaustive sensitivity measurementcomputes the summation for each feature and over the rangeof all possible values, separately. In other words, the proposedmethod scales linearly with the growth in the number ofunknown features, while the exhaustive method scales inpolynomial time. D. Evaluation using Synthesized Data

In order to get more insight about the performance ofthe proposed method, we have used a synthesized datasetto evaluate the suggested approach. The synthesized dataset

CCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (TNNLS) 8 (a) (b)(c)

Fig. 2: Visualization of: (a) using the proposed approach on the MNIST dataset with equal feature costs, (b) using static featureacquisition using mutual information between pixels and targets on MNIST dataset, and (c) using the proposed approach onthe multi-resolution MNIST dataset with different feature costs at each resolution. Pixels with more importance/priority to bequeried are indicated by warmer colors.

100 200 300 400 500 600 700

Number of Features A cc u r a c y ( % ) FACTRADINGreedyMiser

Fig. 3: Comparison of the proposed method (FACT) withRADIN [19] and GreedyMiser [15] methods on the MNISTdataset.is generated as follows: ﬁrst, we have randomly sampled cluster centers from a -dimensional space. Then, pointssampled around each cluster center from a normal distribution Acquisition Cost N D C G FACTCSTCCronusEarly Exit

Fig. 4: Comparison of the proposed method (FACT) withCSTC [16], Cronus [14], and Early Exit [34] methods on theLTRC dataset.with the mean of zero and variance of . . Afterwards,we have randomly assigned each cluster to a class from aset of two different classes. To each feature vector createdso far containing features, we have appended another CCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (TNNLS) 9

Acquisition Cost A cc u r a c y ( % ) FACTExhaustiveRand

Fig. 5: Comparison of the proposed method (FACT) withexhaustive sensitivity-based (Exhaustive) [21] and randomselection (Rand) methods on the Landsat dataset. This ﬁgureshows that the proposed method is able to approximate theground-truth sensitivity values accurately and efﬁciently.features with random values from a normal distribution. Theseare features without any predictive value. Accordingly, theresulting feature vectors are of size . Finally, we madethe dataset cost-sensitive by deﬁning feature costs for theﬁrst and second features to be a monotonically increasingfunction from to . See Fig. 6a for a visualization of featurecosts and the static importance computed using the mutualinformation between features and labels.Fig. 6b demonstrates the order in which each feature isacquired using the proposed method. In this ﬁgure, eachrow corresponds to a test sample (here only 50 samples arevisualized) and each column represents a feature. The featuresthat are acquired earlier are indicated with warmer colors.Here, we have continued the feature acquisition until we reach of the maximum achievable accuracy. As it can be seen,the second half of features which are not informative aremainly skipped by the proposed method. Speciﬁcally, onlyabout . of features from the second half are selected byFACT, which means that most of the uninformative featuresare not acquired by the algorithm. On the other hand, basedon the cost and value of each feature from the ﬁrst half, theproposed method acquired features that are more informativeand have lower cost values. Apart from this, Fig. 6c presentsthe accuracy versus total feature acquisition cost on the testset. As it can be seen from the curve, FACT converges to themaximum accuracy much faster than the static and randomacquisition methods. It is mainly due to the fact that theproposed method highly prefers informative features with lowcost while other methods disregard this information. E. Evaluation using Real-World Health Data

In order to evaluate the performance of the proposed methodon a dataset in health domain where feature acquisition costsare inherently important, we have used thyroid classiﬁcation

Feature N o r m a li z e d V a l u e Feature CostFeature Importance (a)(b)

Acquisition Cost A cc u r a c y ( % ) FACTStaticRand (c)

Fig. 6: Evaluation of the proposed method on synthesizeddata. (a) Cost and static importance of each feature. (b) Thefeature acquisition order for 50 different test samples (warmercolors mean more priority). (c) Accuracy versus acquisitioncost curves for the proposed method (FACT), acquisition usingstatic order, and random selection.

CCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (TNNLS) 10

Feature S a m p l e Demo. Examination LabQuestionnaire (a)

10 20 30 40 50 60

Acquisition Cost A cc u r a c y ( % ) FACTStaticRand (b)

Fig. 7: Evaluation of the proposed method on thyroid diseaseclassiﬁcation task. (a) The feature acquisition order for 50different test samples (warmer colors mean more priority).(b) Accuracy versus acquisition cost curves for the proposedmethod (FACT), acquisition using static order, and randomselection.dataset [36] . Here, we have features from different categoriesincluding demographics, questionnaire, examination, and labresults. Furthermore, this dataset provides the acquisition costscorresponding to each feature which ranges from . forfeatures such as age to . for certain blood tests.Figure 7a presents a visualization of orders which eachfeature is acquired for 40 randomly selected test samples. Asit can be observed from this visualization, FACT gives morepriority to low cost features and costly but informative featuresare acquired a with lower priority. Apart from this, Figure 7bpresents the accuracy versus acquisition cost curve for FACT,static, and random methods. As it can be seen from this ﬁgure,FACT outperforms other baseline approaches. Available at: http://archive.ics.uci.edu/ml/datasets/thyroid+disease

Acquisition Cost A cc u r a c y ( % ) = 1.5, = 1.5= 2.5, = 1.5= 1.5, = 2.5= 1.5, = 5.0= 5.0, = 1.5 Fig. 8: Inﬂuence of different beta distribution parameters onthe accuracy versus acquisition cost curve for the synthesizeddataset. V. D

ISCUSSION

There are many methods such as mutual information, in-formation gain etc. that are traditionally used in literature tomeasure the value of each feature [37]. However, these meth-ods are usually limited to considering only linear relationshipsor considering a single feature rather than joint distributionof features. For instance, given evidence about a subset offeatures, the correlations between the rest of features maybeaffected that traditional approaches are usually incapable ofcapturing. In this paper, we suggest inferring the dynamicsbetween features and classes by sensitivity analysis of trainedpredictors. This approach employs the hidden informationcaptured inside a black-box network to measure the valueof acquiring each feature given the available context. In thispaper, we use (6) as a measure of feature informativeness perunit of the cost to make feature acquisition decisions. However,an alternative approach, which may result in better accuraciesat a certain context, would be deﬁning an objective functionthat balances the cost versus performance trade-off using ahyper-parameter.Furthermore, this paper suggests an encoding and decodingapproach to create a range of changes so that the ﬁnalsummation of sensitivities would be a better approximation ofthe total sensitivity with respect to each feature. Aside frombinary quantization, we have explored different methods suchas variable length and constant length quantizations; however,while they usually work reasonably well, we decided to usebinary encoding as it is more efﬁcient to implement and ourreaders are more familiar with.In this paper, a beta corruption function is used to introducemissing features and to train the denoising autoencoder. Basedon our experiments, as long as it is chosen reasonably, itdoes not have any direct inﬂuence on the performance of thepredictor or the feature acquisition functionality. Speciﬁcally,we measured the inﬂuence of changing beta parameters from to and the area under the accuracy cost curve changes wereless than (see Fig. 8 for an example). In this paper, wesuggest beta distribution parameters of α =1 . , β =1 . for most CCEPTED FOR PUBLICATION IN IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (TNNLS) 11 datasets, and parameters of α =5 . , β =1 . sparse datasets suchas mushroom. It is also worth mentioning that the corruptionfunction is applied to all features independently. Therefore, itdoes not introduce any bias toward certain features.VI. C ONCLUSION

In this paper, we introduced a novel method for cost- andcontext-aware feature acquisition at test-time. The proposedmethod based on denoising autoencoders with binary repre-sentation layers efﬁciently estimates context-dependent featuredistributions and measures the sensitivity of the output withrespect to each unknown feature. Furthermore, we evaluatedthe proposed approach on eight different real-world datasetscovering various problem scenarios and applications. Finally,we compared the results of the introduced method with the re-sults of using other state-of-the-art approaches in the literature.According to the results, the suggested method is capable ofdynamically deciding on which feature to be acquired basedon feature costs and available context in an efﬁcient manner.R

EFERENCES[1] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robustfeatures (surf),”

Computer vision and image understanding , vol. 110,no. 3, pp. 346–359, 2008.[2] M. Kachuee, M. M. Kiani, H. Mohammadzade, and M. Shabany, “Cuf-ﬂess blood pressure estimation algorithms for continuous health-caremonitoring,”

IEEE Transactions on Biomedical Engineering , vol. 64,no. 4, pp. 859–869, 2017.[3] K. Early, J. Mankoff, and S. E. Fienberg, “Dynamic question orderingin online surveys,” arXiv preprint arXiv:1607.04209 , 2016.[4] P. K. Sharpe and R. Solly, “Dealing with missing values in neuralnetwork-based diagnostic systems,”

Neural Computing & Applications ,vol. 3, no. 2, pp. 73–77, 1995.[5] B. Krishnapuram, S. Yu, and R. B. Rao,

Cost-sensitive Machine Learn-ing . CRC Press, 2011.[6] M. Liu, C. Xu, Y. Luo, C. Xu, Y. Wen, and D. Tao, “Cost-sensitivefeature selection via f-measure optimization reduction.” in

AAAI , 2017,pp. 2252–2258.[7] H. Ghasemzadeh, N. Amini, R. Saeedi, and M. Sarrafzadeh, “Power-aware computing in wearable sensor networks: An optimal featureselection,”

IEEE Transactions on Mobile Computing , vol. 14, no. 4,pp. 800–812, 2015.[8] F. Min, Q. Hu, and W. Zhu, “Feature selection with test cost constraint,”

International Journal of Approximate Reasoning , vol. 55, no. 1, pp. 167–179, 2014.[9] P. Cao, D. Zhao, and O. Zaiane, “An optimized cost-sensitive svm forimbalanced data learning,” in

Paciﬁc-Asia Conference on KnowledgeDiscovery and Data Mining . Springer, 2013, pp. 280–292.[10] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani et al. , “Least angleregression,”

The Annals of statistics , vol. 32, no. 2, pp. 407–499, 2004.[11] R. Greiner, A. J. Grove, and D. Roth, “Learning cost-sensitive activeclassiﬁers,”

Artiﬁcial Intelligence , vol. 139, no. 2, pp. 137–174, 2002.[12] S. Ji and L. Carin, “Cost-sensitive feature acquisition and classiﬁcation,”

Pattern Recognition , vol. 40, no. 5, pp. 1474–1485, 2007.[13] P. Viola and M. J. Jones, “Robust real-time face detection,”

Internationaljournal of computer vision , vol. 57, no. 2, pp. 137–154, 2004.[14] M. Chen, Z. Xu, K. Weinberger, O. Chapelle, and D. Kedem, “Classiﬁercascade for minimizing feature evaluation cost,” in

Artiﬁcial Intelligenceand Statistics , 2012, pp. 218–226.[15] Z. Xu, K. Weinberger, and O. Chapelle, “The greedy miser: Learningunder test-time budgets,” arXiv preprint arXiv:1206.6451 , 2012.[16] Z. E. Xu, M. J. Kusner, K. Q. Weinberger, M. Chen, and O. Chapelle,“Classiﬁer cascades and trees for minimizing feature evaluation cost.”

Journal of Machine Learning Research , vol. 15, no. 1, pp. 2113–2144,2014.[17] S. Karayev, T. Baumgartner, M. Fritz, and T. Darrell, “Timely objectrecognition,” in

Advances in Neural Information Processing Systems ,2012, pp. 890–898. [18] H. He, H. Daum´e III, and J. Eisner, “Cost-sensitive dynamic featureselection,” in

ICML Inferning Workshop , 2012.[19] G. Contardo, L. Denoyer, and T. Arti`eres, “Recurrent neural networksfor adaptive feature acquisition,” in

International Conference on NeuralInformation Processing . Springer, 2016, pp. 591–599.[20] G. Contardo, L. Denoyer, and T. Artieres, “Sequential cost-sensitivefeature acquisition,” in

International Symposium on Intelligent DataAnalysis . Springer, 2016, pp. 284–294.[21] K. Early, S. E. Fienberg, and J. Mankoff, “Test time feature ordering withfocus: interactive predictions with minimal user burden,” in

Proceedingsof the 2016 ACM International Joint Conference on Pervasive andUbiquitous Computing . ACM, 2016, pp. 992–1003.[22] M. Kachuee, A. Hosseini, B. Moatamed, S. Darabi, and M. Sarrafzadeh,“Context-aware feature query to improve the prediction performance,”in

Signal and Information Processing (GlobalSIP), 2017 IEEE GlobalConference on . IEEE, 2017, pp. 838–842.[23] V. Nair and G. E. Hinton, “Rectiﬁed linear units improve restricted boltz-mann machines,” in

Proceedings of the 27th international conference onmachine learning (ICML-10) , 2010, pp. 807–814.[24] Y. LeCun, C. Cortes, and C. J. Burges, “The mnist database ofhandwritten digits,” 1998.[25] O. Chapelle and Y. Chang, “Yahoo! learning to rank challenge overview,”in

Proceedings of the Learning to Rank Challenge , 2011, pp. 1–24.[26] J.-L. Reyes-Ortiz, L. Oneto, A. Sama, X. Parra, and D. Anguita,“Transition-aware human activity recognition using smartphones,”

Neu-rocomputing , vol. 171, pp. 754–767, 2016.[27] D. D. Lewis, “Reuters-21578 text categorization test collection, distri-bution 1.0,” 1997.[28] J. Schlimmer, “Mushroom records drawn from the audubon society ﬁeldguide to north american mushrooms,”

GH Lincoff (Pres), New York

Department of Information and ComputerScience , vol. 55, 1998.[30] D. Ayres-de Campos, J. Bernardes, A. Garrido, J. Marques-de Sa, andL. Pereira-Leite, “Sisporto 2.0: a program for automated analysis ofcardiotocograms,”

Journal of Maternal-Fetal Medicine , vol. 9, no. 5,pp. 311–318, 2000.[31] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin et al. , “Tensorﬂow: Large-scalemachine learning on heterogeneous distributed systems,” arXiv preprintarXiv:1603.04467 , 2016.[32] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[33] K. Trapeznikov and V. Saligrama, “Supervised sequential classiﬁcationunder budget constraints,” in

Artiﬁcial Intelligence and Statistics , 2013,pp. 581–589.[34] B. B. Cambazoglu, H. Zaragoza, O. Chapelle, J. Chen, C. Liao,Z. Zheng, and J. Degenhardt, “Early exit optimizations for additivemachine learned ranking systems,” in

Proceedings of the third ACMinternational conference on Web search and data mining . ACM, 2010,pp. 411–420.[35] K. J¨arvelin and J. Kek¨al¨ainen, “Cumulated gain-based evaluation of IRtechniques,”

ACM Transactions on Information Systems (TOIS) , vol. 20,no. 4, pp. 422–446, 2002.[36] D. Dheeru and E. Karra Taniskidou, “UCI Machine Learning Reposi-tory,” 2017.[37] H. Liu and R. Setiono, “Chi2: Feature selection and discretization of nu-meric attributes,” in