[PDF] A scalable saliency-based Feature selection method with instance level information

Abstract

Classic feature selection techniques remove those features that are either irrelevant or redundant, achieving a subset of relevant features that help to provide a better knowledge extraction. This allows the creation of compact models that are easier to interpret. Most of these techniques work over the whole dataset, but they are unable to provide the user with successful information when only instance information is needed. In short, given any example, classic feature selection algorithms do not give any information about which the most relevant information is, regarding this sample. This work aims to overcome this handicap by developing a novel feature selection method, called Saliency-based Feature Selection (SFS), based in deep-learning saliency techniques. Our experimental results will prove that this algorithm can be successfully used not only in Neural Networks, but also under any given architecture trained by using Gradient Descent techniques.

Full PDF

AA SCALABLE SALIENCY - BASED F EATURE SELECTION METHODWITH INSTANCE LEVEL INFORMATION

A P

REPRINT

Brais Cancela ∗ CITIC. Universidade da Coruña.15006, A Coruña, Spain. [email protected]

Verónica Bolón-Canedo

CITIC. Universidade da Coruña.15006, A Coruña, Spain. [email protected]

Amparo Alonso-Betanzos

CITIC. Universidade da Coruña.15006, A Coruña, Spain. [email protected]

João Gama

LIAAD, INESCTEC.Rua Dr. Roberto Frias, 4200-465 Porto, Portugal. [email protected]

May 1, 2019 A BSTRACT

Classic feature selection techniques remove those features that are either irrelevant or redundant,achieving a subset of relevant features that help to provide a better knowledge extraction. This allowsthe creation of compact models that are easier to interpret. Most of these techniques work over thewhole dataset, but they are unable to provide the user with successful information when only instanceinformation is needed. In short, given any example, classic feature selection algorithms do not giveany information about which the most relevant information is, regarding this sample. This work aimsto overcome this handicap by developing a novel feature selection method, called Saliency-basedFeature Selection (SFS), based in deep-learning saliency techniques. Our experimental results willprove that this algorithm can be successfully used not only in Neural Networks, but also under anygiven architecture trained by using Gradient Descent techniques.

With the rise of the so-called

Big Data , there is an increasing need for the use of techniques that allow us to reducethe input space [1]. These techniques are often divided into two broad groups [2]:

Feature Selection (FS) and

FeatureExtraction (FE) . Fig. 1 shows a graphic representation about how these two approaches work.On the one hand, Feature Extraction approaches reduce the number of characteristics by the combination (either linearor not) of the input space features [3]. For instance, a Feature Extraction technique in

Deep Learning is the so-called deep features , which are the data representation that can be obtained if we remove the last Neural Network (NN) layer.Thus, these techniques are able to create a new feature set, which is often more compact and with higher discriminationcapacity. This is the most used technique in image analysis, signal processing and information retrieval [4, 5, 6].On the other hand, Feature Selection approaches achieve the dimensionality reduction by removing either irrelevantand redundant features [7]. Due to the fact that these techniques are able to preserve the original features, it is anuseful approach specially in applications where these attributes are essential to both understanding the model and forthe knowledge inferring [8, 9, 10].Often, FS techniques are divided in three big groups: ﬁlters , that work independently of the inductive model (can beviewed as a data pre-processing step); and wrappers and embedded methods , which measure the feature relevanceaccording to a classiﬁers’ performance. However, the idea is quite similar: having any given dataset, the task is to ∗ Corresponding author. a r X i v : . [ c s . L G ] A p r PREPRINT - M AY

1, 2019Figure 1: Feature Extraction (FE) creates new features by combining elements of the original input, whereas Fea-ture Selection (FS) removes those features that are considered either irrelevant or redundant, while the rest are keptunaltered.select which are the features that contains the most discriminative information, in a global way. However, we aimto create an algorithm that, instead of using dataset-level information, takes sample-level information to build a FSmodel.For example: suppose we have a medical dataset, and we aim to predict if a patient is prone to have cancer. ClassicFeature Selection algorithms select which are the most important features that help the classiﬁer to make a goodprediction. On the contrary, we propose to infer, for any speciﬁc patient, which are the features the classiﬁer is usingto predict a certain output, building a FS algorithm by using this information. We believe this approach can keep allClassic Feature Selection advantages, while adding a new powerful tool: the ability to provide customized featurerelevance information for each sample.The most important work in model explainability is the LIME algorithm [11]. The idea is to apply small perturbationsto the input, establishing the importance of each feature in the model’s decision. Although this is a black box methodthat can be used in any given classiﬁer model, it has two main drawbacks: 1) it is slow, as it requires too much time toevaluate each sample (up to 10 minutos in the ImageNet dataset, as the authors have mentioned in their work); and 2)it cannot help the model to improve its explainability, as it is designed to be used after the model is properly trained.In this work we propose to solve this problem by introducing a novel Feature Selection algorithm, called Saliency-based Feature Selection (SFS). Our proposal consists in using techniques that are able to provide personalized in-formation for any given example. Once this information is collected, a Feature Selection algorithm is created byincluding those features that contain a higher discrimination coefﬁcient. As at present, scalability is also an importantrequirement in Machine Learning algorithms, our algorithm must fulﬁll also another speciﬁc requisite: it must be ableto work under

Big Data environments.The rest of the paper is organized as follows: section 2 will explain the personalized information algorithm we aregoing to use; section 3 will provide the metrics used to compute our saliency; section 4 will propose our novel SFSalgorithm; section 5 will show an ablation study regarding our approach; ﬁnally, section 6 will show some experimentalresults over public datasets, and section 7 will offer some conclusions and future work.

There are only a few research works that address the problem of FS in Big Data environments. In [12, 8] the authorsuse Deep Bayesian Networks (Deep BN) to select the most relevant features. Although Deep BN can be used in Big2

PREPRINT - M AY

1, 2019Data, they only provide results in small datasets (either few examples or few features). In [13], the authors claimed touse a deep feature selection technique to reduce the input space in short-term wind forecasting models. However, theirapproach uses Recursive Feature Elimination (RFE) which requires to exponentially train several different models,making it unable to be used with a high number of features.Perhaps the most interesting work is presented in [14]. It is called Deep Feature Selection (DFS), and consists ona Elastic Net variant, that can be introduced as an extra layer into any Neural Network (NN). However, the authorsclaim that the number of layers has to be reduced so as to have the method properly working. Thus, it is not suitablefor working with Convolutional Neural Networks (CNN) models. This method introduces a mask in the input data,adding a l and l -regularization to that mask, in the same way the Elastic Net is deﬁned. Furthermore, Elastic Netpenalties are also applied to the hidden layers.However, as early mentioned, none of these approaches is able to, for any given example, give personalized informa-tion. To address this issue, we propose to use a well-known computer vision technique, which is called saliency . Saliency is a technique that was ﬁrst developed in computer vision problems. Its goal is to evaluate the degree of quality for each pixel within an image [15]. Often, NNs are seen like a black box where, given any input and any desiredoutput, it is possible to obtain an accurate prediction that is somehow close to what we should expect. However, NNsdo not provide any kind of transparent explanation about how the system reaches the predicted solution. Saliency wascreated with the purpose of seeing what is happening inside the NN. Nowadays, there are two different approximationsto calculate this saliency: supervised or semi-supervised.The ﬁrst works used a semi-supervised approach [15, 16]. The idea is simple: having any given trained NN for clas-siﬁcation, and any given image, a back-propagation routine is used to detect which are the pixels that most inﬂuencedthe desired output.Supervised approaches are more recent [17, 18], and they are also trained in a different way. In this case, the NN hasthe same input and output size. Furthermore, we also know, for any given image in the training set, which are the mostrelevant features. For instance, if we are detecting a cat, the important features are the pixels in the image that belongto the cat. The model is trained so that the predicted output matches our previous segmentation of important features.This is the reason why these networks are also called

Semantic Segmentation Networks . They are also called

AttentionModels [19, 20] whenever a Recurrent Neural Network is used at the end to evaluate the quality of the feature.Supervised techniques achieve better results than semi-supervised, but they have a major drawback: it is necessary toknow, a priori, which are the most relevant features for each instance, in order to successfully train the model. In thecase of an image dataset it is easy, but it is not always possible to obtain this information in other environments, like inthe case of DNA microarrays, for instance. Furthermore, the supervised techniques can only be used in classiﬁcationproblems. They are not suitable for other approaches, like regression.Because of that, we propose to use a semi-supervised saliency technique.

For our model, we are going to use a generalization of the idea proposed in [15]. Let X ∈ R N × R be our inputdata, with N being the number of samples and R the number of features; and let ˜Y = f ( X ; Θ ) ∈ R N × C be ourclassiﬁcation model (just for the purposes of explaining the model; later we will show how our approach can also beapplied in regression problems). It does not matter which type of model we are employing, as long as it can be trainedby using a loss minimization function (NN, CNN, SVM, . . . ). C is the number of different classes to evaluate, and Θ are the classiﬁer weights, which are adjusted during the training procedure.For the purpose of explanining the idea, we can assume that f ( X ; Θ ) is the result of applying the softmax function toa one layer model ( Θ ∈ R C × R ). Thus, we have that C (cid:88) c =1 y ( i ) c = 1 , (1)where y ( i ) c = softmax ( θ Tc X ( i ) ) . (2) y ( i ) c is the probability that the instance i belongs to the class c . θ c is the c -th column of Θ .3 PREPRINT - M AY

1, 2019To train this model, a loss function (cid:96) ( Θ ; f, X , Y ) is minimized, where Y ∈ R N × C is the class one-hot encoding.Since we are using the softmax function as our output, our minimization function is the categorical cross-entropy ,deﬁned as (cid:96) ( Θ ; f, X , Y ) = − N N (cid:88) i =1 C (cid:88) c =1 y ( i ) c log (cid:16) f ( x ( i ) ; Θ ) c (cid:17) (3) In order to know which the features most contribute to activating the class c , the solution proposed in [15] is toevaluate the gradient of y ( i ) c with respect to the input. To put it in mathematical terms, σ ( i ) c = (cid:12)(cid:12)(cid:12)(cid:12) ∂y ( i ) c ∂ X ( i ) (cid:12)(cid:12)(cid:12)(cid:12) . (4)To give an intuition behind the idea, this gradient indicates how we should modify the input instance in order tomaximize its belonging to class c . This technique, however, only works in classiﬁcation problems. Thus, we need tocreate a generalization of this method for our purposes. Instead of applying a gradient function for each class c , our idea is similar to the one that is used to update theweights during the training procedure. As the loss function usually gives higher gradients whenever there is a strongtraining misclassiﬁcation, we propose the deﬁnition of a Gain Function ( g ( Θ ; f, X , Y ) ), that will bring high gradientswhenever a sample is correctly classiﬁed. Thus, our Saliency Function σ will be deﬁned as σ (˜ y ( i ) , y ( i ) ) = (cid:12)(cid:12)(cid:12)(cid:12) ∂ g (˜ y ( i ) ) ∂ x ( i ) (cid:12)(cid:12)(cid:12)(cid:12) , (5)where ˜y = f ( X ; Θ ) is our model’s predicted output for the instance i .Below we will show how to create this Gain Function g , depending on the loss function we are trying to minimize. As early mentioned, our aim is to develop a saliency system that can work with multiple types of problems. To thatend, we are going to explain how to create our Gain Function g in three different scenarios: one for regression and twofor classiﬁcation (NNs and SVMs). First, we will introduce the Regression Gain Function, as it is the most intuitive one. As simpliﬁcation, we assume ourmodel is trained by using the Mean Square Error Loss (MSE): (cid:96)

MSE ( ˜Y , Y ) = − N N (cid:88) i =1 (cid:16) ˜y ( i ) − y ( i ) (cid:17) , (6)where ˜Y = f ( X ; Θ ) is our model’s predicted output. It does not matter if you choose a different loss function,because all regression losses have the same structure: the loss is if the prediction is perfect, and the loss increases asthe prediction moves away from the expected result.On the contrary, our Gain Function must behave in the opposite way: it must have high values whenever the predictionis good, and values close to zero with poor predictions. Thus, our solution is to use the inverse of the MSE lossfunction: g MSE ( ˜Y , Y ) = α(cid:96) MSE ( ˜Y , Y ) , + (cid:15) , (7)where α is a multiplication factor and (cid:15) > is a factor to avoid division by zero. By default, we set α = 1 and (cid:15) = 10 − . 4 PREPRINT - M AY

1, 2019

Although Eq. 7 might also be used as the Gain Function in classiﬁcation problems, we found it not very suitable. Thereason is its behavior when a total misclassiﬁcation ( ˜ y ( i ) c = 0 and y ( i ) c = 1 , for instance) occurs. Recall that we are us-ing the saliency function to build a feature selection algorithm. Thus, we do not want to receive any information whena total misclassiﬁcation occurs. That is, our Gain Function should return values when this happens. Unfortunately,the Gain Function provided in Eq. 7 does not satisfy this requirement.Consequently, we have developed two different gain functions for two different classiﬁcation losses: cross-entropy and hinge loss, the ones most commonly used. The cross-entropy loss (Eq. 3) is the most common used loss function when dealing with NN for Classiﬁcation,including CNNs. Since our Gain Function should operate in the opposite way, as discussed above, our proposedsolution is g CE ( ˜Y , Y ) = − αN N (cid:88) i =1 C (cid:88) c =1 y ( i ) c log (cid:16) − ˆ y ( i ) c (cid:17) , (8)where ˆ y ( i ) c = min (cid:110) − (cid:15), ˜ y ( i ) c (cid:111) , (9)in order to avoid a zero-logarithm. α and (cid:15) have the same behavior exposed in Eq. 7. This function will ensure that nosaliency is obtained whenever a total misclassiﬁcation occurs (log(1 −

0) = log(1) = 0) . Note that we do not want tokill the gradient when clipping the value in Eq. 9. Thus, the gradient should remain unaltered ( ∇ ˆ y ( i ) c = 1 ).Note that this idea is somehow similar to the one proposed in [15]. It only differs in one point: while the methodproposed in [15] decomposes the last layer network to obtain the saliency, we applied our model directly to a GainFunction, which makes it suitable to use it in other machine learning approaches, as regression. Another advantage ofour approach is that it returns close-to-zero gradients whenever there is a misclasiﬁcation. In this sense, our algorithmis able to indicate that there are no relevant features in a misclassiﬁcation. On the contrary, the saliency deﬁned in [15]does not take into account this crucial information. The hinge loss is often used to train SVMs. In a multiclass problem, it can be deﬁned as (cid:96) H ( ˜Y , Y ) = − N N (cid:88) i =1 C (cid:88) c =1 y ( i ) c max(0 , − ˜ y ( i ) c ) + (1 − y ( i ) c ) max(0 , y ( i ) c ) , (10)that is, our correct class prediction will have values higher than , whereas values lower than − will be obtained forthe incorrect classes, as desired.We may note that the function does not have gradient when the values are higher than in the correct class (same with − in the incorrect ones). Thus, a predicted output with value of should have the same information as a predictedvalue of . Thus, we modiﬁed the Gain Function as follows: g H ( ˜Y , Y ) = − αN N (cid:88) i =1 C (cid:88) c =1 y ( i ) c log (cid:16) − ˘ y ( i ) c (cid:17) , (11)where ˘ y ( i ) c = min (cid:40) − (cid:15), min(1 , max( − , ˜ y ( i ) c )) + 12 (cid:41) . (12)Again, we do not kill the gradient after clipping ( ∇ ˘ y ( i ) c = 1 ). Our approach is named as Saliency-based Feature Selection, or SFS. It is a ranker-based feature selection method, thatis, it returns an ordered vector of all features, based on their importance. In Algorithm 1 we show its schema.5

PREPRINT - M AY

1, 2019

Data: X , Y , f, (cid:96), Θ , γ, (cid:15), reps Result: feature ranking r n f ← R // n f is the number of Alive features r ← [1 . . . n f ] ; while n f > (cid:15) > do ˆX ← X ; ˆX [: , r [ n f + 1 : R ]] ← ; σ fs ← zeros ( n f ) ; for rep ← to C do Initialize f ( ˆX ; Θ ) ;Train f ( ˆX ; Θ ) given Y ; ˜Y ← f ( ˆX ; Θ ) ; σ fs ← σ fs + GetSaliency ( ˜

Y , Y, σ ) ; end index ← argsort ( σ fs , descend ) ; r [1 : n f ] ← r [ index ] ; n f ← int ( n f ∗ γ ) ; end Algorithm 1: Pesudocode of the SFS AlgorithmThe proposed algorithm only contains three hyper-parameters: (cid:15) ≥ , a stopping criteria variable; ≥ γ > , whichcontrols the number of alive features that are kept in the next iteration; and reps , which controls the number of timesa model is trained, in order to avoid overﬁtting.We start by training the model f with all the features in the feature set. Then, we train the model and we computethe saliency. After that, we sum up and sort all features, obtaining the feature ranking r . Then, we discard the leastrelevant features, and we repeat the operation until the stopping criteria is satisﬁed. Function

GetSaliency( ˜ Y , Y, σ ) : σ ← ; C ← Number of classes in Y ; for c ← to C do σ c ← (cid:80) N c i c =1 σ ( ˜y ( i c ) , y ( i c ) ) ; σ ← σ + σ c (cid:107) σ c (cid:107) ; endreturn σ end Algorithm 2: Saliency Function for ClassiﬁcationThe way to compute the saliency differs depending on whether we are dealing with a classiﬁcation or a regressionproblem. In the case of the classiﬁcation issue, the procedure is explained in Algorithm 2. Basically, we compute andnormalize the saliency for each class. Alter that, we sum up all features. In case of a regression problem, we just sumup all the saliency scores, as described in Algorithm 3.

Function

GetSaliency( ˜ Y , Y, σ ) : σ ← (cid:80) Ni =1 σ ( ˜y ( i ) , y ( i ) ) ; return σ end Algorithm 3: Saliency Function for RegressionThe complexity of this algorithm is variable, as it completely depends on the γ parameter. In the best scenario, when γ = 0 , the complexity is lineal in the number of instances ( O ( N ) ), whereas in the worst scenario γ ≈ , the complexityalso depends on the number of variables ( O ( RN ) ), as we barely remove one feature in each loop.6 PREPRINT - M AY

1, 2019Table 1: NIPS 2003 Feature Selection Challenge datasets.Name

In this section we will show how parameters γ and reps affect the behavior or our algorithm. Furthermore, we willshow how decoupling the model to obtain the feature ranking from the model that is used to classify can affect ourapproach. In order to accomplish that, we ﬁrst introduce the datasets that will be used to test our methodology. We have selected the 5 datasets that were proposed in the

NIPS 2003 Feature Selection challenge . There are 5synthetic datasets that were designed with the only purpose of measuring the quality of feature selection algorithmsfor classiﬁcation. The speciﬁc characteristics of each dataset are shown in Table 1. Although each dataset onlycontains two classes, they are challenging because of these factors: few training examples (Arcene), unbalanced data(Dorothea) or low relevant features (Madelon). Due to their small number of examples, NIPS 2003 FS challenge datasets are not suitable to test the behavior of ourapproach in Big Data scenarios. To overcome this issue, we have included four classic computer vision datasets:• MNIST It is the most classic computer vision classiﬁcation challenge [21]. It contains more than 50.000handwritten digits (10 classes) stored in grayscale × images.• FASHION-MNIST It is a dataset of Zalando’s article images [22]. It contains 60.000 images (10 classes)stored in grayscale × resolution.• CIFAR-10 and CIFAR-100 They were proposed in [23]. Each one contains more than 50.000 tiny RGBimages ( × × ) of different objects (car, truck, plane, . . . ). The ﬁrst one contains images belonging to10 different labels, whereas the other contains images from 100 different classes. As early mentioned, and different from classic information-based FS algorithms [24], our model is able to perform FSin regression problems. To test it we have used two different Big Data datasets:•

Relative location of CT slices on axial axis . Originally published in [25], its aim is to discover the relativelocation of the image on the axial axis. It contains 53500 CT images that belong to 74 different patients. Eachimage is reduced to two different histograms, resulting in 385 total features.• Enery Molecule Dataset . Originally published in [26], the dataset contains the ground state energies of16,242 molecules, each with 1275 features. The aim of this dataset is to use Machine Learning techniques toquickly compute the atomization energy, as the simulations needed to compute it require a high computationaltime. 7 PREPRINT - M AY

1, 2019Figure 2: Effect of the number of training repetitions over our algorithm. The number of repetitions signiﬁcantlyaffects the obtained result. 8

PREPRINT - M AY

1, 2019

As it is well-known, it is almost impossible to train the same machine learning model more than once, expectingto achieve the very same exact result each time. Aspects like random initialization in model’s weights, or randompermutations in the training set cause the model output to be similar but not entirely exact each time the model istrained. Thus, our ﬁrst aim was to evaluate how the number of train repetitions can affect the output. Our objectivewas to check if it will be necessary to, at each step, train the model more than once, computing the saliency ranking asthe mean average of all repetitions.We have trained our algorithm using the NIPS 2003 FS Challenge datasets, ﬁxing γ = 0 , and trying different numberof repetitions. We have trained a 3-layer Fully-Connected NN ( , and nodes per layer, respectively). BatchNormalization (BN) [27] and ReLU ( x ) = max (0 , x ) activation are used. Softmax function is used as output, and Eq.3 is used as loss function.We have included an l2 weight decay regularization with factor . . We have trained the model for 100 epochs,using Adam [28] as optimizer. To deal with unbalanced data, we have replicated the number of examples until weachieved a true balance between classes. All models were created using Keras Framework , along with TensorFlow[29] as back-end.Fig. 2 shows the results obtained. It can be seen that the number of repetitions used affects accuracy in most datasets,and that this effect depends also in the number of features. We have conducted a Friedman test, resulting in there isno signiﬁcant difference between the models when the number of repetitions is higher than 2, except in the Madelondataset, where there is a low number of relevant features. We also found signiﬁcant differences between these modelsagainst the model with just one repetition. Thus, we recommend to use a number of repetitions higher than 2 to achievea better performance. γ parameter In order to carry out the experiments to analyze the case of the behavior of the γ parameter, we have ﬁxed reps =1 , while rest of parameters were the same as the in previous subsection. Fig. 3 shows that the accuracy of ouralgorithm improves as the γ value increases. This occurs because when removing some features, some redundancy isalso eliminated, causing some other features to become more important. Although increasing the γ value helps ouralgorithm to achieve better scores, the Friedman tests we have conducted suggest there are no signiﬁcant differenceswhen we select γ ≥ . , except in the Madelon dataset. Furthermore, a Wilcoxon test suggest there is a signiﬁcantdifference between γ = 0 and γ = 0 . models over the datasets Arcene, Madelon and Gisette. However, we found nosigniﬁcant difference over the Dexter and Dorothea datasets. In these datasets, we presume the high features/samplesratio are causing this effect. Our Feature Selection algorithm is an embedded model, as both selection and classiﬁcation/regression tasks can beperformed at the same time. An embedded system selects features that achieve good results in the same model that isused to perform either the classiﬁcation or the regression task. However, this might led us to select features that areonly valid for the machine learning model we are using to obtain the ranking.This fact arises one question:

How good is our selection? . Consequently, we tested our proposal separating theproblem into two different tasks: ranking and classifying. By using different models for each task, we will test howdependant is our ranking with respect to the model that was used to obtain it. To this end, we used the four kernelimplementations provided by the sklearn’s SVC dataset [30], ﬁxing the parameters C = 1 , degree = 3 and coef .Our algorithm meta-parameters were ﬁxed to γ = 0 . and reps = 1 . We do not need to use more repetitions, as theSVM SMO-training algorithm [31] is very stable. Table 2 shows the obtained results over the NIPS datasets.Two main answers arise:1. Only in of cases the best result is provided by using the same algorithm to obtain both the ranking andthe classiﬁcation. Since we are using 4 different models, this percentage is close to random. http://clopinet.com/isabelle/Projects/NIPS2003/ https://archive.ics.uci.edu/ml/datasets/Relative+location+of+CT+slices+on+axial+axis https://keras.io/ PREPRINT - M AY

1, 2019Figure 3: Effect of the hyper-parameter γ over our algorithm. Using γ = 0 led the system to achieve poor resultswhen using a low number of features. However, there is no signiﬁcant performance improvement when γ > . .10 PREPRINT - M AY

1, 2019Table 2: Decoupling ranker and classiﬁer accuracy results. In brackets, number of features used to achieve the bestscore. In black, the best score achieved by each classiﬁer. If multiple rankers achieve the best score, only the one withfewer features is highlighted.Dataset SVM Ranker SVM Classiﬁer ( (9750) 85.0 (48) (327)Poly 84.0 (5578) (9750) 81.0 (1088) 73.0 (438)RBF 83.0 (1269) (9750) (487) 74.0 (3622)Sigmoid (1145) (9750) 81.0 (1088) 72.0 (3715)Dexter Linear (1419) 93.3 (99) (105) (102)Poly 93.6 (7071) 91.0 (66) 90.7 (33) 90.3 (68)RBF 94.0 (7071) 92.0 (93) 92.7 (37) 91.7 (40)Sigmoid 93.7 (3384) (66) 92.3 (54) 92.0 (64)Dorothea Linear 94.9 (222) 94.6 (216) 94.6 (79) (11318)Poly (69) (88) (65) 94.9 (150)RBF 94.6 (26) 94.6 (25) 94.6 (23) 94.9 (27)Sigmoid 94.3 (71) 94.6 (75) 94.9 (96) 94.9 (85)Gisette Linear (1080) 98.3 (714) 98.1 (351) 97.6 (351)Poly 98.1 (168) 98.2 (360) 98.2 (470) 97.9 (240)RBF 98.2 (275) 98.2 (581) 98.2 (412) 98.0 (412)Sigmoid 98.0 (315) (483) (351) (351)Madelon Linear 58.2 (218) 71.7 (94) 73.7 (271) (374)Poly 61.0 (9) (18) 88.0 (11) 52.3 (500)RBF 60.5 (5) 70.3 (96) 89.3 (12) 53.7 (3)Sigmoid (4) 71.3 (108) (17) 52.3 (500)2. In three different datasets (Arcene, Dorothea and Gisette) the best ranker is the same for three differentclassiﬁers. This suggests the selected features are not substantially affected by the type of classiﬁer that isselected. Thus, a good ranker model performs well no matter which classiﬁer we use afterwards.Thus, it seems clear that it is more important to have a good classiﬁer when selecting the features, rather than tryingto use the same model for either ranking and classifying.

As we have seen in the previous subsection, a good classiﬁer selection is crucial in order to obtain a solid featureselection, and thus we also would like to evaluate how the overﬁtting of a model might affect the selection of features.In this case, we have only used the Arcene dataset to illustrate the scenario, as it provides very meaningful results.In the previous subsection, it can be seen that the best results for Arcene dataset were achieved using SVM-RBF forboth ranking and classifying, with C = 1 . as meta-parameter. Thus, our experiment consists in, using the sameclassiﬁer conﬁguration, check how the result varies as we modify the C parameter during the FS step. Our algorithmmeta-parameters were ﬁxed to their minimum values ( γ = 0 and reps = 1 ).Fig. 4 shows the obtained results. The best results are achieved when using the same C value that was employed alsoduring the classiﬁcation step. It can be seen that using a lower value does not substantially affect the result. However,when introducing overﬁtting (high C values), the quality of the result is drastically reduced.Thus, we strongly recommend, during the FS step, the use of classiﬁers that are able to prevent overﬁtting, as this is aproblem that could introduce noise in the FS ranking system. In order to test the adequacy of our proposal, we have compared it against the most representative feature selectionmethods available nowadays, namely: (a) both Sklearn’s LASSO and Elastic Net implementations [32, 33]; (b) theMIM revision developed in [34], also implemented in the Sklearn package; (c) the ReliefF algorithm [35] implementedin the Skrebate package [36]; and (d) the DFS algorithm [14], using our own Keras implementation. In this last case,and in order to make a fair comparison, we have used the DFS mask values as the feature ranking. This approach11

PREPRINT - M AY

1, 2019Figure 4: Effect of the overﬁtting problem during the FS step. As the classiﬁer increases its overﬁtting, the quality ofthe selection decreases substantially.was not mentioned in the original paper mentioned above [14], as they just train the model with the new mask andconstraint (like LASSO and Elastic Net behavior). However, experimental results show that this approach achievesbetter results when using our variation. The code with our implementations is available in GitHub . As the number of instances is relatively low in all datasets, we have followed the same approach explained in theablation study described in the previous section, that is, we have used four SVM variants to test the performance ofevery algorithm. In the case of the DFS algorithm, we have used the three-layer NN that was also explained in theprevious section as the feature selection algorithm. We also have tried the same network with our proposed approach.

Table 3 shows the obtained results. Overall, the best results are achieved when using the polynomial kernel. However,for all the other tested algorithms but our approach, these are achieved when using practically almost all features (inthe most favourable case, 81,6% of the total features). Differently, our algorithm is able to achieve the best sameresults in accuracy as the polynomial kernel, but when using a RBF kernel and only features, that is, the 4,87% ofthe whole features of the dataset. The problem found with the polynomial kernel in the Arcene dataset appears also in Dexter dataset, although this timewith the Linear kernel, as we can see in Table 4. Again, our algorithm outperforms the other techniques, independently https://github.com/braisCB/SFS PREPRINT - M AY

1, 2019Table 3: Arcene accuracy results, and in parenthesis, number of features used to achieve the best score. The baseline isthe classiﬁer accuracy without removing any feature. Same ranker means the same SVM model is used both as rankerand as classiﬁer. Bold shows the result of the best ranker-classiﬁer combination.FS Method Ranker SVM Classiﬁer ( (8164) 81.0 (25) 69.0 (1817)ReliefF [35] — 83.0 (10000) (8818) 75.0 (600) 71.0 (2231)DFS [8] NN 86.0 (1145) 87.0 (8810) 77.0 (302) 72.0 (4439)SFS Same ( γ = 0 ) 84.0 (4328) 87.0 (10000) 83.0 (600) 74.0 (3715)Best ( γ = 0 ) 85.0 (318) 87.0 (9036) 86.0 (513) 78.0 (2601)Same ( γ = 0 . ) 83.0 (3189) (9760) (487) 72.0 (3715)Best ( γ = 0 . ) 84.0 (1145) (9750) (487) 84.0 (1145)NN ( γ = 0 . , reps = 3 ) 83.0 (1599) 87.0 (5869) γ = 0 ) 94.0 (939) 92.0 (56) 91.3 (52) 91.0 (4710)Best ( γ = 0 ) 94.0 (939) 92.7 (52) 91.3 (52) 91.0 (4710)Same ( γ = 0 . ) γ = 0 . ) γ = 0 . , reps = 3 ) 94.0 (2493) 91.7 (111) 91.7 (114) 91.0 (124)of the SVM kernel selected. This time our proposal not only reduces considerably the number of features used (100%for the best result in other methods, while ours use 7,1%), but also improves slightly the accuracy result obtained. Table 5 shows that the best result is achieved when using the DFS algorithm with an RBF kernel. However, ourapproach is able to achieve the same accuracy, both with Polynomial and RBF kernels, but requiring more features (57for the DFS algorithm and 283 for our proposal, respectively 0,057% and 0,3% regarding the complete set of features).

Table 6 shows that our algorithm achieves the highest score when using the Neural Network as ranker, although DFSobtains similar accuracy with fewer features (130 for DFS versus 291 for our approach, respectively 2,6% and 5,8%of the total set of features). Compared with either MIM or ReliefF, our algorithm is able to systematically select fewerfeatures.

The ReliefF algorithm achieves the best score, as Table 7 shows. However, we may note that our algorithm is also ableto achieve the same score, but using more features (1,8% and 5% of the total number of features).To sum up, our algorithm is able to achieve the same, or even improve slightly, the accuracy results compared withthose obtained by the state-of-the art algorithms for all NIPS datasets. Regarding the number of features, our proposal13

PREPRINT - M AY

1, 2019Table 5: Dorothea accuracy results, in parenthesis, number of features used to achieve the best score. The baseline isthe classiﬁer accuracy without removing any feature. Same ranker means the same SVM model is used both as rankerand as classiﬁer. Best shows the result of the best ranker-classiﬁer combination.FS Method Ranker SVM Classiﬁer ( γ = 0 ) 94.0 (3995) (283) 95.1 (138) 94.9 (94)Best ( γ = 0 ) 94.3 (94) (283) (381) 94.9 (94)Same ( γ = 0 . ) 94.9 (222) 95.1 (88) 94.6 (23) 94.9 (27)Best ( γ = 0 . ) 94.9 (69) 95.1 (88) 94.9 (65) 95.1 (11318)NN ( γ = 0 . , reps = 3 ) 94.3 (193) 94.3 (85) 94.9 (381) 94.6 (402)Table 6: Gisette accuracy results, in parenthesis number of features used to achieve the best score. The baseline is theclassiﬁer accuracy without removing any feature. Same ranker means the same SVM model is used both as ranker andas classiﬁer. Best shows the result of the best ranker-classiﬁer combination.FS Method Ranker SVM Classiﬁer ( (130) 97.5 (163)SFS Same ( γ = 0 ) 98.1 (551) 98.1 (333) 98.0 (423) 97.8 (645)Best ( γ = 0 ) 98.1 (299) 98.3 (662) 98.1 (275) 97.8 (412)Same ( γ = 0 . ) 98.3 (1080) 98.2 (360) 98.2 (412) 98.0 (351)Best ( γ = 0 . ) 98.3 (1080) 98.3 (483) 98.3 (351) 98.0 (351)NN ( γ = 0 . , reps = 3 ) 98.0 (1397) γ = 0 ) 59.2 (56) 73.5 (42) 83.9 (32) 54.7 (76)Best ( γ = 0 ) 62.3 (1) 73.5 (42) 83.9 (32) 62.3 (58)Same ( γ = 0 . ) 58.2 (218) 76.2 (18) 89.3 (12) 52.3 (500)Best ( γ = 0 . ) 62.0 (4) 76.2 (18) 90.8 (17) 57.0 (374)NN ( γ = 0 . , reps = 3 ) 62.8 (9) 77.8 (16) (25) 52.8 (87)14 PREPRINT - M AY

1, 2019Table 8: SFS performance in Regression problems. Relative location of CT slices on axial axis Dataset MAE results.FS Method γ = 0 ) 6.55 3.70 2.71 SFS ( γ = 0 . ) 4.18 3.16 2.60 2.31SFS + DFS ( γ = 0 ) 4.41 2.94 2.25 2.40SFS + DFS ( γ = 0 . ) γ = 0 ) 0.292 0.215 0.146 0.141SFS ( γ = 0 . ) 0.145 SFS + DFS ( γ = 0 ) 0.145 0.139 0.143 0.149SFS + DFS ( γ = 0 . ) behaves extraordinarily in some datasets, in which the reduction of the features needed is remarkable, while in othercases in which the existing methods accomplish an important reduction our approach needs approximately the double. We conducted two experiments to show the behavior of our SFS algorithm in a regression problem. As early men-tioned, we have used both the

Relative location of CT slices on axial axis and the

Energy Molecule dataset . Toperform the experiments, we selected the same approach presented in Section 5.2.1: a 3-layer NN ( , and nodes, respectively), along with BN and ReLU activation function. Both differ only in the output (now it is just onenode) and the loss function (MSE). A 5-fold cross-validation was performed.Table 8 shows the obtained results for the Relative location of CT slices on axial axis dataset. We compare ourapproach with the DFS algorithm, using an input mask with l = 5 · − as weight penalty. Additionally, we alsoconsider using our SFS method adding the DFS mask at the input. We refer to that conﬁguration as SFS + DFS. Wehave ﬁxed the parameter reps to in all experiments. The results show that SFS and the combination SFS + DFS,both with a 3-layer CNN, achieved the best results. For the SFS alone, 192 features, that is the whole set of featureswere needed, while for the combination of DFS with our approach, only 96 (50%) were needed for an almost equalaccuracy.Using the same conﬁguration, Table 9 shows the obtained results for the

Energy Molecule dataset. The results showthat SFS with γ = 0 . is able to achieve almost the best score by just using 127 features (the 10% of the whole dataset).Either SFS or SFS + DFS achieve the best scores in all different conﬁgurations.

One of the main advantages of our method is that it is possible to use it in Big Data environments, as it was developed tobe used in state-of-the-art architectures like Convolutional Neural Networks. To that end, we have tested the behaviorof our approach over four different datasets using the Wide Residual Network WRN-16-4 [37] as classiﬁer.As ranking method, we have tried two different conﬁgurations: the previously mentioned WRN-16-4 and a standard3-layer CNN. It consists in two convolutional layers with 16 and 32 channels, respectively. After each convolution,a × Max-Pooling Layer. Finally, two Fully Connected layers are used: the ﬁrst one contains 1024 nodes, andthe last one has the size of the number of labels in the dataset ( for CIFAR-100, for the others). Both BatchNormalization (DN) [27] and ReLU activation function are applied right after all hidden layers, and the Softmaxfunction is applied to the output. The Adam optimizer [28] is used to train the model.In both networks we use the categorical cross-entropy as loss function, together with an l = 5 · − weight penalty.We have conducted experiments in four well-known image databases described in sections above: MNIST, Fashion-MNIST, CIFAR-10 and CIFAR-100. 15 PREPRINT - M AY

1, 2019Table 10: MNIST accuracy results using a WRN-16-4 [37] as classiﬁers.FS Method Ranker γ = 0 ) 89.78 95.66 99.14 99.53SFS 3-layer CNN ( γ = 0 . ) SFS WRN-16-4 ( γ = 0 ) 88.18 94.98 98.70 99.41SFS WRN-16-4 ( γ = 0 . ) 96.76 + DFS 3-layer CNN ( γ = 0 ) 95.60 98.47 + DFS WRN-16-4 ( γ = 0 ) 92.92 97.36 98.72 98.98Table 11: Fashion-MNIST accuracy results using a WRN-16-4 [37] as classiﬁers.FS Method Ranker SFS 3-layer CNN ( γ = 0 ) 67.85 81.86 89.33 92.36SFS 3-layer CNN ( γ = 0 . ) SFS WRN-16-4 ( γ = 0 ) 64.17 73.53 85.60 92.48SFS WRN-16-4 ( γ = 0 . ) 77.99 82.58 88.16 91.72SFS + DFS 3-layer CNN ( γ = 0 ) 79.81 SFS + DFS WRN-16-4 ( γ = 0 ) 65.09 73.87 85.86 92.40Table 12: CIFAR-10 accuracy results using a WRN-16-4 [37] as classiﬁers.FS Method Ranker γ = 0 ) 61.00 72.49 84.31 89.31SFS 3-layer CNN ( γ = 0 . ) 63.52 77.96 89.60 91.5SFS WRN-16-4 ( γ = 0 ) 53.03 60.42 85.55 90.44SFS WRN-16-4 ( γ = 0 . ) 64.15 79.27 SFS + DFS 3-layer CNN ( γ = 0 ) + DFS WRN-16-4 ( γ = 0 ) 54.05 68.65 82.47 89.54Table 13: CIFAR-100 accuracy results using a WRN-16-4 [37] as classiﬁers.FS Method Ranker SFS 3-layer CNN ( γ = 0 ) 24.53 34.14 52.67 63.45SFS 3-layer CNN ( γ = 0 . ) 30.74 44.55 61.49 66.96SFS WRN-16-4 ( γ = 0 ) 24.66 37.86 56.66 66.39SFS WRN-16-4 ( γ = 0 . ) 27.88 44.45 + DFS 3-layer CNN ( γ = 0 ) + DFS WRN-16-4 ( γ = 0 ) 25.08 37.29 54.07 64.9416 PREPRINT - M AY

1, 2019

No data augmentation techniques were used to train the models. The 3-layer CNN was trained for 40 epochs, while theWRN-16-4 required 80 epochs. Table 10 shows the obtained results. As it can be observed, our approach achieved thebest scores balancing accuracy and number of features used. It is worth mentioning that the accuracy of the classiﬁerhighly improves when using a high γ value. In the next section we will show our intuition behind this effect. In this case, as data augmentation we used random horizontal ﬂips, along with horizontal and vertical random shifts(up to pixels). As this dataset is more complex than MNIST, we increased the number of training epochs to in thecase of the 3-layer CNN, and when using the WRN-16-4. Table 11 shows the obtained results. Overall, our SFSapproach with γ = 0 . achieved the best scores. We used for this case the same data augmentation and training conﬁguration that was presented in the previous Fashion-MNIST dataset, but increasing the random shifts up to pixels. Table 12 shows the obtained results. The use of DFShelps the model to achieve better results whenever the number of kept features is low. On the contrary, a high γ valueis useful with a higher number of features (more than of the total features). We used the same training conﬁguration as in the previous CIFAR-10. Table 13 shows the obtained results. In thisdataset our results are not clearly better than DFS. We believe that the overﬁtting problem is causing this results(although the training accuracy is close to 100%, the test accuracy never reaches 70%).

As we mentioned before, to our knowledge this is the ﬁrst Feature Selection algorithm, besides Tree-based techniques,that is able to provide the feature importance for each sample. The main advantage of our proposal is that we canprovide an explanation about the classiﬁer’s decision. However, we would like to focus on a different scenario:checking the classiﬁer’s reliability.If we take a look at the MNIST results (Table 10), we can see that the best results are achieved when using as rankerthe 3-layer CNN instead of the WRN-16-4 model. This result was not expected because the latter is a better classiﬁerthan the former. Thus, our intuition in this effect was reduced to two points:1.

Max-pooling:

We think the main disadvantage of our algorithm is that it does not manage feature correlation,unless the training algorithm does it. In the case of a CNN model, local correlations can be detected by usingPooling layers. Contrary to the WRN-16-4 classiﬁer, the 3-layer CNN contains 2 Max-pooling layers, whichcould help our algorithm to achieve better scores.2.

Over-ﬁtting:

During the training step, the training set accuracy always reaches a perfect score when using theWRN-16-4. This could led our algorithm to a bad generalization behavior, as explained in the ablation studysection (see Fig. 4).Thus, we conclude that the quality of the classiﬁer highly affects the behavior of our algorithm. And the MNISTresults also show that test accuracy is not a good metric to determine how suitable is a classiﬁer to be used as rankerin our algorithm.For this reason, we carried on a different experiment, taking advantage of saliency’s properties. As we have previouslydeﬁned, our saliency function is the gradient of the gain function with respect to the input. Thus, it measures howwe have to modify our input to increase the probability of belonging to the desired class. So, we decided to use oursaliency function to do exactly the opposite. The idea is to answer this simple question:• How much do I have to change a sample to change the classiﬁer’s output?This topic, called

Adversarial Examples gained a lot of relevance during these years. Several techniques were createdin order to increase the reliability of a classiﬁer. For instance, Fast Gradient Sign Method (FGSM) [38], Fast GradientSign Method (FGSM) and its variants [39], and Projected Gradient Descent (PGD) [40] use the Saliency gradient toevaluate how much an input sample must be modiﬁed to cheat the classiﬁer. However, we want to extend these works,17

PREPRINT - M AY

1, 2019Table 14: Adversarial Images. Given an original image X , minimal perturbation needed to force a trained classiﬁer topredict each label with more than of certainty. X X = 0 X = 1 X = 2 X = 3 X = 4 X = 5 X = 6 X = 7 X = 8 X = 9 WRN-16-4 w/oInput NoiseWRN-16-4 w/ In-put Noise in order to evaluate how reliable is a classiﬁer. We do not focus on how much an image has to be modiﬁed, but to seeif the generated image can also cheat a human eye.To answer this question, we have conducted an experiment, that is shown in Table 14. Given an original image and atrained classiﬁer, we use the saliency output to make perturbations in this image, in order to cheat the classiﬁer andobtain wrong predictions with a high conﬁdence (higher than ). The ﬁrst row shows the perturbations neededwhen using a WRN-16-4 model, which achieves a . accuracy on the testing set. We can see that the resultingimages have no substantial visual difference with respect to the original one. It looks like only random noise wasadded to the original image. To put it in different words, ten different images that look extremely similar are able toachieve completely different predictions in our WRN-16-4 model.As we conclude that introducing white noise can easily fool our WRN-16-4 model, we re-trained it again, but this timeintroducing some random Gaussian noise in the training images. Apparently, the quality of the classiﬁer decreases, aswe only obtain a . accuracy on the testing set. However,when we take a look at the images (Table 14, secondrow), we can see that the perturbations look completely different from the original image, and that it should be easy toestablish the classiﬁer prediction by just looking at the input. Therefore, we conclude that, although the second modelachieves a lower accuracy in the training set, it is a more reliable model .This is a very interesting effect that could led to potential implications in the training algorithm. As we can easilyobtain images that are able to fool our model, we can make adjustments to our training set so we can improve the reliability of our model. This is a very important factor in topics like decision support systems, as we can base ourconclusion in a more solid explanation.It could also have potential implications in medical analysis, as, having any given sample, we can show to doctorswhich is our conclusion, and how the input should be modiﬁed to modify the prediction. Together with a dictionary ofpotential treatments and their effect to our input parameters, it could also led to treatment recommendations. In this paper we have presented a novel Feature Selection ranking approach, called Saliency-based Feature Selection(SFS). Contrary to classic Feature Selection approaches, our algorithm is able to rank the importance of each feature atan instance-level, rather than as a whole dataset. Besides this advantage, experimental results in challenging datasetsshow that our algorithm is able to achieve state-of-the-art results under different conﬁgurations, making it suitableto be used in any king of problem, either is it classiﬁcation or regression. Contrary to classic Information-basedFeature Selection techniques, the reduced complexity of our SFS algorithm (it can be computed simultaneously withthe classiﬁcation or regression training) allows it to be used in high-dimension datasets.As future research, we aim to use our SFS technique to deﬁne a metric that evaluates a model’s degree of robustness ,that is, to measure how hard is to fool either a classiﬁer or a regression architecture, instead of just using visual cues.We also want to test our adversarial images as part of a explainable model in a real scenario like medical information.

This research has been ﬁnancially supported in part by the Spanish Ministerio de Economía y Competitividad (researchproject TIN2015-65069-C2-1-R), and by the Xunta de Galicia (research projects ED431C 2018/34 and Centro singularde investigación de Galicia, accreditation 2016-2019), and the European Union (European Regional DevelopmentFund - ERDF). We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan XpGPU used for this research. Brais Cancela acknowledges the support of the Xunta de Galicia under its postdoctoralprogram. 18

PREPRINT - M AY

1, 2019

ReferencesReferences [1] Ibrahim Abaker Targio Hashem, Ibrar Yaqoob, Nor Badrul Anuar, Salimah Mokhtar, Abdullah Gani, andSamee Ullah Khan. The rise of “big data” on cloud computing: Review and open research issues.

Informa-tion systems , 47:98–115, 2015.[2] Shigeo Abe. Feature selection and extraction. In

Support Vector Machines for Pattern Classiﬁcation , pages331–341. Springer, 2010.[3] Isabelle Guyon and André Elisseeff. An introduction to feature extraction. In

Feature extraction , pages 1–25.Springer, 2006.[4] Adriana Romero, Carlo Gatta, and Gustau Camps-Valls. Unsupervised deep feature extraction for remote sensingimage classiﬁcation.

IEEE Transactions on Geoscience and Remote Sensing , 54(3):1349–1362, 2016.[5] Yushi Chen, Hanlu Jiang, Chunyang Li, Xiuping Jia, and Pedram Ghamisi. Deep feature extraction and classiﬁ-cation of hyperspectral images based on convolutional neural networks.

IEEE Transactions on Geoscience andRemote Sensing , 54(10):6232–6251, 2016.[6] Praveen Krishnan, Kartik Dutta, and CV Jawahar. Deep feature embedding for accurate recognition and retrievalof handwritten text. In

Frontiers in Handwriting Recognition (ICFHR), 2016 15th International Conference on ,pages 289–294. IEEE, 2016.[7] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection.

Journal of machinelearning research , 3(Mar):1157–1182, 2003.[8] Qin Zou, Lihao Ni, Tong Zhang, and Qian Wang. Deep learning based feature selection for remote sensing sceneclassiﬁcation.

IEEE Geosci. Remote Sensing Lett. , 12(11):2321–2325, 2015.[9] Yahong Han, Yi Yang, Yan Yan, Zhigang Ma, Nicu Sebe, and Xiaofang Zhou. Semisupervised feature selectionvia spline regression for video semantic recognition.

IEEE Transactions on Neural Networks and LearningSystems , 26(2):252–264, 2015.[10] Jasmina Novakovi´c. Toward optimal feature selection using ranking methods and classiﬁcation algorithms.

Yugoslav Journal of Operations Research , 21(1), 2016.[11] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictionsof any classiﬁer. In

Proceedings of the 22nd ACM SIGKDD international conference on knowledge discoveryand data mining , pages 1135–1144. ACM, 2016.[12] Rania Ibrahim, Noha A Yousri, Mohamed A Ismail, and Nagwa M El-Makky. Multi-level gene/mirna featureselection using deep belief nets and active learning. In

Engineering in Medicine and Biology Society (EMBC),2014 36th annual international conference of the IEEE , pages 3957–3960. IEEE, 2014.[13] Cong Feng, Mingjian Cui, Bri-Mathias Hodge, and Jie Zhang. A data-driven multi-model methodology withdeep feature selection for short-term wind forecasting.

Applied Energy , 190:1245–1257, 2017.[14] Yifeng Li, Chih-Yu Chen, and Wyeth W Wasserman. Deep feature selection: theory and application to identifyenhancers and promoters.

Journal of Computational Biology , 23(5):322–336, 2016.[15] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualisingimage classiﬁcation models and saliency maps. arXiv preprint arXiv:1312.6034 , 2013.[16] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In

Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on , pages 5188–5196. IEEE, 2015.[17] Rui Zhao, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Saliency detection by multi-context deep learning.In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 1265–1274, 2015.[18] Dingwen Zhang, Deyu Meng, and Junwei Han. Co-saliency detection via a self-paced multiple-instance learningframework.

IEEE transactions on pattern analysis and machine intelligence , 39(5):865–878, 2017.[19] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In

Advances in neuralinformation processing systems , pages 2204–2212, 2014.[20] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, andYoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In

InternationalConference on Machine Learning , pages 2048–2057, 2015.19

PREPRINT - M AY

1, 2019[21] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to documentrecognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.[22] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machinelearning algorithms, 2017.[23] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.[24] Verónica Bolón-Canedo, Noelia Sánchez-Maroño, and Amparo Alonso-Betanzos. A review of feature selectionmethods on synthetic data.

Knowledge and information systems , 34(3):483–519, 2013.[25] Franz Graf, Hans-Peter Kriegel, Matthias Schubert, Sebastian Pölsterl, and Alexander Cavallaro. 2d image reg-istration in ct images using radial image descriptors. In

International Conference on Medical Image Computingand Computer-Assisted Intervention , pages 607–614. Springer, 2011.[26] Burak Himmetoglu. Tree based machine learning framework for predicting ground state energies of molecules.

The Journal of chemical physics , 145(13):134101, 2016.[27] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducinginternal covariate shift. arXiv preprint arXiv:1502.03167 , 2015.[28] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[29] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, SanjayGhemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine learning. In

OSDI , volume 16, pages 265–283, 2016.[30] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:Machine learning in Python.

Journal of Machine Learning Research , 12:2825–2830, 2011.[31] John Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. 1998.[32] Seung-Jean Kim, Kwangmoo Koh, Michael Lustig, Stephen Boyd, and Dimitry Gorinevsky. An interior-pointmethod for large-scale l_1-regularized least squares.

IEEE journal of selected topics in signal processing ,1(4):606–617, 2007.[33] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models viacoordinate descent.

Journal of statistical software , 33(1):1, 2010.[34] Brian C Ross. Mutual information between discrete and continuous data sets.

PloS one , 9(2):e87357, 2014.[35] Igor Kononenko. Estimating attributes: analysis and extensions of relief. In

European conference on machinelearning , pages 171–182. Springer, 1994.[36] Ryan J Urbanowicz, Randal S Olson, Peter Schmitt, Melissa Meeker, and Jason H Moore. Benchmarking relief-based feature selection methods. arXiv preprint arXiv:1711.08477 , 2017.[37] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146 , 2016.[38] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 , 2014.[39] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. arXiv preprintarXiv:1611.01236 , 2016.[40] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deeplearning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083arXiv preprint arXiv:1706.06083