[PDF] Contingency Training

Abstract

When applied to high-dimensional datasets, feature selection algorithms might still leave dozens of irrelevant variables in the dataset. Therefore, even after feature selection has been applied, classifiers must be prepared to the presence of irrelevant variables. This paper investigates a new training method called Contingency Training which increases the accuracy as well as the robustness against irrelevant attributes. Contingency training is classifier independent. By subsampling and removing information from each sample, it creates a set of constraints. These constraints aid the method to automatically find proper importance weights of the dataset's features. Experiments are conducted with the contingency training applied to neural networks over traditional datasets as well as datasets with additional irrelevant variables. For all of the tests, contingency training surpassed the unmodified training on datasets with irrelevant variables and even outperformed slightly when only a few or no irrelevant variables were present.

Full PDF

CContingency Training

Danilo Vasconcellos Vargas , Hirotaka Takano and Junichi Murata Graduate School of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan(E-mail: [email protected]) Faculty of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan(E-mail: [email protected]) Faculty of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan(E-mail: [email protected])

Abstract:

When applied to high-dimensional datasets, feature selection algorithms might still leave dozens of irrelevantvariables in the dataset. Therefore, even after feature selection has been applied, classiﬁers must be prepared to the pres-ence of irrelevant variables. This paper investigates a new training method called Contingency Training which increasesthe accuracy as well as the robustness against irrelevant attributes. Contingency training is classiﬁer independent. Bysubsampling and removing information from each sample, it creates a set of constraints. These constraints aid the methodto automatically ﬁnd proper importance weights of the dataset’s features. Experiments are conducted with the contin-gency training applied to neural networks over traditional datasets as well as datasets with additional irrelevant variables.For all of the tests, contingency training surpassed the unmodiﬁed training on datasets with irrelevant variables and evenoutperformed slightly when only a few or no irrelevant variables were present.

Keywords:

Irrelevant Variables, Contingency Training, Classiﬁcation, Neural Networks, Feature Weighting, DimensionalReduction.

1. INTRODUCTION

Real world classiﬁcation problems often involve mul-tiple features (the feature space is high dimensional). Be-yond being computationally expensive, these problemspossess the curse of dimensionality which make themharder for various reasons (distance metrics become lessuseful, sampling becomes more expensive as the vol-ume grows exponentially with the number of dimensions,etc...) [1-3]. One way of alleviating the curse of dimen-sionality is to use feature selection algorithms to decreasethe number of features [4].Feature selection methods has been applied for sometime and their success is widely known [5-7]. There areother advantages of employing feature selection, to citesome: speeding up the learning process, facilitating theunderstanding of the model. However, it is difﬁcult toselect exactly all the relevant features. To avoid the riskof excluding relevant variables and losing information,dozens of irrelevant variables might still remain in thedataset.To prepare a classiﬁer to the eventual presence of ir-relevant variables, in this paper, we propose a new train-ing method for supervised learning hereby called Contin-gency Training. The purpose of contingency training isto improve the accuracy and robustness to irrelevant vari-ables of any classiﬁer by modifying its training dataset.The proposed method creates from the initial dataset amore difﬁcult and bigger dataset with some of its valuesmissing. Actually, the proposed method generates a mix-ture of constraints with the interesting property that theydo not possess a bias, since they are generated by a uni-form distribution.The following is highlighted as the most salient advan-tages of the approach: • Robustness against irrelevant variables; • Simplicity - The contingency training can be imple-mented easily by just modifying the samples from thetraining dataset; • Accuracy - Provides a relevant increase in accuracy; • Classiﬁer independent - It works for any classiﬁer algo-rithm without any required modiﬁcation in the classiﬁeritself.Contingency training is not a feature selection algo-rithm. Actually, both methods have different objectivesand should work together to enable a system to solvehigh dimensional problems. In one hand, the objectiveof feature selection is to remove irrelevant features anddecrease dimensionality, on the other hand, contingencytraining aims to improve the performance of any classiﬁeron the presence of irrelevant variables.

2. RELATED METHODS

Functionally, contingency training relates to the ideaof feature weighting. The idea of weighting the impor-tance of variables is not new. It was developed in variousmethods for feature selection [6, 8]. However, few are themethods which can use a feature’s importance weight-ing vector directly (such as the k-means [9]). Most ofthe methods can only make use of a weighting vectorby setting a threshold and removing the least importantvariables [10, 11]. Contigency training is the ﬁrst to per-turbate the dataset to force feature weighting inside thelearning phase, which is algorithm independent and canmake any supervised learning algorithm take the impor-tance of the variables into account.Structurally, the proposed method relates to methodsthat pre-process the dataset [12]. But there is not anydataset pre-processing method with similar objective to a r X i v : . [ c s . L G ] N ov he proposed method. Their objectives range from efﬁ-ciency improvement for large datasets (e.g., subsamplingthe dataset to decrease its overall size [13]) to privacypreservation (e.g., perturbating the dataset to preserve theprivacy while preserving important information [14]).

3. CONTINGENCY TRAINING

Let S be the ﬁnal set of training samples to train agiven classiﬁer. Here, n and nv are deﬁned to be re-spectively the number of samples and the number of vari-ables. { s , ..., s n } are the initial training samples where s i = { v i , ..., v inv } deﬁnes a sample composed of nv vari-ables ( v ). Using these deﬁnitions, the contingency train-ing is explicitly described in Table 1.Table 1 Contingency Training Algorithm1. S = { s , ..., s n } CS = ∅

3. while(checkCriterion())(a) newsample = replaceW ithM issing ( S ) (b) CS = CS ∪ { newsample } S = CS ∪ S Where the function for creating missing values( replaceW ithM issing () ) replaces some values in thedataset with missing values uniformly with probability prob over all variables. And the criterion to stop repeat-ing samples ( checkCriterion () ) may be set to a maxi-mum constant over the size of the dataset or even basedon the learning error of the algorithm. In this article, wewill use a maximum constant for the checkCriterion () .The next Section will cover a notation for specifying thecontingency training which determines precisely both themaximum constant and the probability prob . This Section describes the notation which speciﬁes ex-actly how the dataset is created in the tests. Let p a be thesize of the artiﬁcial dataset (composed of only artiﬁcialsamples) created from samples of the initial dataset by replaceWithMissing() , and consider p i the original sizeof the initial dataset. Yet, let r a = p (cid:48) a p i and r i = p (cid:48) i p i , where p (cid:48) a and p (cid:48) i are respectively the size of the artiﬁcial and ini-tial datasets that were used to compose the ﬁnal dataset S . Note that usually p (cid:48) a = p a , but p (cid:48) i = 0 may often holdtrue and therefore p (cid:48) i (cid:54) = p i . Now it is possible to deﬁne anotation r a A/r i I/prob deﬁning uniquely a given contin-gency training setting, where prob is the same probabilityof replacing values by missing values explained in Sec-tion 3..

In the sections above, it was pointed out that some val-ues are replaced by missing values, but no speciﬁc infor-mation was given on how to represent the missing valuesexplicitly. This paper represents every missing value bythe zero value independent of the dataset. Additionally, it concatenates each sample with a binary vector with thesame size of the sample itself, containing respectively foreach variable of the original sample either one (if the re-spective value is not missing) or zero (if the respectivevalue is missing). Figure 1 gives an example of the rep-resentation used.

Fig. 1 Example of how the method replace non miss-ing values with missing values and construct the ﬁnalrepresentation.This representation is used to provide the dataset withenough information to enable the learning even whendatasets have originally plenty of zero values. In prelim-inary tests we observed that the concatenation performedbetter than other forms of representation (such as addingan offset value).

4. EXPERIMENTS

The tests are conducted with trials using different / train/test splits (training dataset has a size of75% of the entire dataset while the testing dataset havethe remaining 25%) over the following datasets from theUCI machine learning repository [15]: Glass , Wine,Zoo, Iris and “Pima Indians Diabetes” (which we will re-fer to as “Diabetes”) and an additional dataset called Az-5000 corresponding to a character recognition problemcreated by [16]. The number of classes, samples and vari-ables of each dataset are provided in Table 2. Moreover,in Table 2 there are two datasets (Iris and Diabetes) whichwere modiﬁed to contain additional irrelevant vari-ables created from a uniform distribution U nif (0 , α ) ,where α is chosen randomly from another uniform distri-bution U nif (1 , and ﬁxed for all samples of the sameinput.For the classiﬁer implementation we used the nnetpackage from R [17] which is a feed-forward multi-layer perceptron (MLP) [18, 19] having a single hiddenlayer and Broyden-Fletcher-Goldfarb-Shanno (BFGS) aslearning algorithm [20]. Table 3 shows the parameters The dataset from UCI has an index as the ﬁrst attribute and an orderedclass. This index was removed to avoid a trivial direct mapping fromindex to class. able 2 Characteristics of the Datasets

Datasets Number of variables Number of classes Number of samplesWine 13 3 178Zoo 16 7 102Diabetes 8 2 768Glass 10 7 214Az-5000 18 26 5000Modiﬁed Iris 24 3 150Modiﬁed Diabetes 28 2 768

Table 3 Neural Network’s ParametersInitial weights [ − . , . Weight decay e − Hidden nodes 15Maximum iterations 1000000Output units logistic functionfor the neural network. The contingency training param-eters will be deﬁned per test using the notation deﬁned inSection 3.1.

This section focuses on datasets with irrelevant vari-ables. Figure 3 shows the results for the two datasets withadditional irrelevant variables (Iris and Diabetes) as wellas the Az-5000 dataset (character recognition task). Ex-ceptionally, the test on the Az-5000 dataset had ten trialsinstead of the usual trials.With the proposed method the median increased of . , . , . , . , . and . respectivelyfor the Az-5000, Iris, Diabetes, Wine, Glass and Zoodatasets. These results demonstrate that contingencytraining is a promising way of preparing classiﬁers tothe eventual presence of irrelevant variables. The reasonof why the proposed method surpasses the usual trainingmethod will be explained in Section 5.. The objective of this section is to test the algorithmwith datasets which has only a few or no irrelevant vari-ables and check if the performance remains the same.Figures 4 and 5 show the results for the usual training(training with the unmodiﬁed dataset) and contingencytraining for two different settings of the algorithm. Con-tingency training performed similarly and even slightlysurpassed the usual training for the majority of the tests.Furthermore, the similarity of both Figures reveals that itis possible to replace the original dataset entirely for ar-tiﬁcially created dataset (e.g. in the setting A/ I/ . )and still obtain equally improved results.

5. EXPLANATION

The proposed method creates a mixture of constraintsby removing information from the samples. An inter-esting feature of this mixture of constraints is that theypossess zero bias. This happens because they are gener-ated by the procedure replaceW ithM issing ( S ) which selects non missing values uniformly at random for thereplacement with missing values. This point divergesfrom the common use of constraints where they inducea bias over the learning procedure.Furthermore, in the contingency training constructeddataset it may be even impossible to satisfy all constraintsat the same time and the learning procedure must ﬁnd acompromise that respect the majority of them. Findingthis compromise is possible because the error is beingmeasured over the sum of the errors of all the constraintsand therefore the less constraints the classiﬁer violatesthe better. Thus, no constraint conﬁguration is forced,but the algorithm itself will indirectly search for a bestset of constraints. Naturally, since the constraints createdcan not be entirely satisﬁed, the learning procedure mustweight their importance and satisfy as much as possible.For example, if some variables are less important thanothers the constraint of them being zero may be satisﬁedmore easily.To exemplify, consider the simple example of a systemwith two inputs and one output. The samples would be ofthe form: { x , x , y } (1)With a small probability of substituting the values of theinputs with zero, the following possible samples can beartiﬁcially built: { x , x , y } (2) { , x , y } (3) { x , , y } (4) { , , y } (5)The probability of the samples are from top to bottom − prob − prob , prob , prob , prob .Lets consider two situations to highlight how themethod works. First case, suppose both variables are im-portant and necessary to predict y . Then, all artiﬁciallycreated samples would remain with a big prediction er-ror. Second case, suppose variable x is not important topredict y . The last two artiﬁcial samples will remain witha big prediction error. However, both the ﬁrst and secondsamples can be predicted accurately. In other words, inthis case both the ﬁrst and second samples can be satis-ﬁed at the same time. . . . . . . . A cc u r a cy f o r A z − da t a s e t Usual Contingency Training lll . . . . . . . A cc u r a cy f o r I r i s da t a s e t Usual Contingency Training . . . . . . . A cc u r a cy f o r D i abe t e s da t a s e t Usual Contingency Training

Fig. 2 Accuracy on (from left to right) Az-5000, Iris and Diabetes (both Iris and Diabetes have additional irrelevantvariables) datasets using an unmodiﬁed dataset (usual training) and contingency training with a A/ I/ . setting. lllllllllll . . . . . . A cc u r a cy f o r W i ne da t a s e t Usual Contingency Training l . . . . . . A cc u r a cy f o r G l a ss da t a s e t Usual Contingency Training . . . . . . A cc u r a cy f o r Z oo da t a s e t Usual Contingency Training

Fig. 3 Accuracy on (from left to right) Wine, Glass and Zoo (all of them possess additional irrelevant variables)datasets using an unmodiﬁed dataset (usual training) and contingency training with a A/ I/ . setting. . . . . . . A cc u r a cy f o r A z − da t a s e t Usual Contingency Training . . . . . . A cc u r a cy f o r I r i s da t a s e t Usual Contingency Training . . . . . . A cc u r a cy f o r D i abe t e s da t a s e t Usual Contingency Training

Fig. 4 Accuracy on (from left to right) Zoo, Wine and Diabetes datasets using an unmodiﬁed dataset (usual training) andcontingency training with a A/ I/ . setting. . . . . . . A cc u r a cy f o r A z − da t a s e t Usual Contingency Training . . . . . . A cc u r a cy f o r I r i s da t a s e t Usual Contingency Training . . . . . . A cc u r a cy f o r D i abe t e s da t a s e t Usual Contingency Training

Fig. 5 Accuracy on (from left to right) Zoo, Wine and Diabetes datasets using an unmodiﬁed dataset (usual training) andcontingency training with a A/ I/ . setting.

6. CONCLUSIONS

This article proposed a method for training classiﬁerscalled contingency training. Contingency training worksby removing information from the initial dataset for thecreation of a bigger and less informative artiﬁcial dataset.For the analysis of the proposed method, tests wereconducted over traditional datasets, datasets with irrel-evant variables. Contingency training presented supe-rior performance on datasets with irrelevant variables andsimilar accuracy on remaining datasets.Moreover, from the tests it was delineated some ex-planation of how the proposed method achieves its per-formance. In summary, contingency training was shownto possess two nice direct or indirect implications:1. Feature importance weighting by the satisfaction ofconstraints;2. Absence of bias.To achieve this, the method only adds a small computa-tional overhead caused by an augmented dataset.Thus, with its accuracy improvement, ease of imple-mentation and the capability of being applied to anyclassiﬁer, contingency training is a promising algorithmthat should soon be employed in applications in the realworld. Nonetheless, tests with other classiﬁers, differentmissing value representations and use in an ensemble ofclassiﬁers remain as future work.

REFERENCES [1] C. Aggarwal, “On k-anonymity and the curse of di-mensionality,” in

Proceedings of the 31st interna-tional conference on Very large data bases , pp. 901–909, VLDB Endowment, 2005.[2] J. Friedman, “On bias, variance, 0/1-loss, and thecurse-of-dimensionality,”

Data mining and knowl-edge discovery , vol. 1, no. 1, pp. 55–77, 1997.[3] P. Indyk and R. Motwani, “Approximate nearestneighbors: towards removing the curse of dimen-sionality,” in

Proceedings of the thirtieth annual ACM symposium on Theory of computing , pp. 604–613, ACM, 1998.[4] J. Biesiada and W. Duch, “Feature selection forhigh-dimensional data: A kolmogorov-smirnovcorrelation-based ﬁlter,”

Computer RecognitionSystems , pp. 95–103, 2005.[5] Y. Yang and J. Pedersen, “A comparative studyon feature selection in text categorization,” in

Ma-chine Learning-International workshop , pp. 412–420, 1997.[6] I. Guyon and A. Elisseeff, “An introduction to vari-able and feature selection,”

The Journal of MachineLearning Research , vol. 3, pp. 1157–1182, 2003.[7] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil,T. Poggio, and V. Vapnik, “Feature selection forsvms,”

Advances in Neural Information ProcessingSystems 13 , vol. 13, pp. 668–674, 2000.[8] R. Gnanadesikan, J. Kettenring, and S. Tsao,“Weighting and selection of variables for clusteranalysis,”

Journal of Classiﬁcation , vol. 12, no. 1,pp. 113–136, 1995.[9] J. Huang, M. Ng, H. Rong, and Z. Li, “Auto-mated variable weighting in k-means type cluster-ing,”

Pattern Analysis and Machine Intelligence,IEEE Transactions on , vol. 27, no. 5, pp. 657–668,2005.[10] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil,T. Poggio, and V. Vapnik, “Feature selection forsvms,”

Advances in neural information processingsystems , pp. 668–674, 2001.[11] A. Blum and P. Langley, “Selection of relevant fea-tures and examples in machine learning,”

Artiﬁcialintelligence , vol. 97, no. 1, pp. 245–271, 1997.[12] S. Kotsiantis, “Supervised machine learning: Areview of classiﬁcation techniques,”

Informatica ,vol. 31, pp. 249–268, 2007.[13] J. Kniss, P. McCormick, A. McPherson, J. Ahrens,J. Painter, A. Keahey, and C. Hansen, “Interac-tive texture-based volume rendering for large dataets,”

Computer Graphics and Applications, IEEE ,vol. 21, no. 4, pp. 52–61, 2001.[14] K. Chen and L. Liu, “Privacy preserving data classi-ﬁcation with rotation perturbation,” in

Data Mining,Fifth IEEE International Conference on , pp. 4–pp,IEEE, 2005.[15] A. Frank and A. Asuncion, “UCI machine learningrepository,” 2010.[16] A. Ng, “Mechanistician.” http://mechanistician.blogspot.jp/2009/04/lec9-bias-variance-tradeoff.html .[17] W. N. Venables and B. D. Ripley,

Modern AppliedStatistics with S . New York: Springer, fourth ed.,2002. ISBN 0-387-95457-0.[18] S. Haykin,

Neural networks and learning machines ,vol. 3. Prentice Hall, 2009.[19] D. Rumelhart,

Backpropagation: theory, architec-tures, and applications . Lawrence Erlbaum, 1995.[20] D. Shanno, “On broyden-ﬂetcher-goldfarb-shannomethod,”