Auto-CASH: Autonomous Classification Algorithm Selection with Deep Q-Network
aa r X i v : . [ c s . L G ] J u l Auto-CASH: Autonomous Classification Algorithm Selectionwith Deep Q-Network
Tianyu Mu ∗ [email protected] Institute of TechnologyDepartment of Computer ScienceHarbin, China Hongzhi Wang † [email protected] Institute of TechnologyDepartment of Computer ScienceHarbin, China Chunnan Wang ‡ [email protected] Institute of TechnologyDepartment of Computer ScienceHarbin, China
Zheng Liang§ [email protected] Institute of TechnologyDepartment of Computer ScienceHarbin, China
ABSTRACT
The great amount of datasets generated by various data sourceshave posed the challenge to machine learning algorithm selectionand hyperparameter configuration. For a specific machine learn-ing task, it usually takes domain experts plenty of time to select anappropriate algorithm and configure its hyperparameters. If theproblem of algorithm selection and hyperparameter optimizationcan be solved automatically, the task will be executed more effi-ciently with performance guarantee. Such problem is also knownas the CASH problem. Early work either requires a large amount ofhuman labor, or suffers from high time or space complexity. In ourwork, we present Auto-CASH, a pre-trained model based on meta-learning, to solve the CASH problem more efficiently. Auto-CASHis the first approach that utilizes Deep Q-Network to automaticallyselect the meta-features for each dataset, thus reducing the timecost tremendously without introducing too much human labor. Todemonstrate the effectiveness of our model, we conduct extensiveexperiments on 120 real-world classification datasets. Comparedwith classical and the state-of-art CASH approaches, experimentalresults show that Auto-CASH achieves better performance withinshorter time.
KEYWORDS
Meta-feature, Algorithm selection, Hyperparameter optimization,CASH problem, Classification algorithm, Deep Q-Network
Machine learning(ML) approaches have been used widely in recentyears to solve problems in the data science field [25], such as datamining, data preprocessing, etc. Many algorithms(or models)havebeen developed for a specific problem [10, 20]. However, for differ-ent datasets, the performance of these algorithms varies consider-ably. One learning algorithm cannot outperform another learningalgorithm in various aspects and problems [17]. Therefore, domainexperts usually choose the most suitable algorithm and hyperpa-rameters based on their experience and a series of experiments tooptimize performance the. However, with the explosive growth of both data and ML algo-rithms, the former approach could hardly work. Each algorithmhas a large hyperparameter configuration space. Even for an ex-pert with adequate domain knowledge, it will be hard to make anideal selection among various algorithms and their complex hyper-parameter space. In the face of such situation, Thornton et al. pre-sented the combined algorithm selection and hyperparameter opti-mization problem(CASH) [21], aiming at helping other researchersfind a solution to select a suitable algorithm and configure the hy-perparameters in different scenarios automatically.An effective approach to solve the CASH problem is meta-learning ,also known as learn how to learn . With meta-feature vector repre-senting the previous experience, the meta-learning is capable ofrecommending the same algorithm for similar tasks [7, 8]. Meta-learning requires less human labor and computation resources, mak-ing it more suitable for the automatic and lightweight demand inpractice.Therefore, to solve the CASH problem in an automatic and light-weight way, there are two main challenges. On the one hand, weshould make the whole workflow automatically. An effective strat-egy should be determined to automatically choose the meta-featureused. The correlations among meta-feature candidates are compli-cated, and their influence on the algorithm selection result is in-explicable, which makes it crucial to select the optimal meta fea-ture. On the other hand, CASH has buckets effect. That is, the mea-surement of HPO results has multi-aspect on real-world task, andthe usability depends on the shortest aspect. The HPO algorithmadopted should have performance guarantee, acceptable time costand the potential to deal with various data types.Auto-WEKA [21] is the first approach which provides a solutionto the CASH problem. It uses a hyperparameter to represent candi-date algorithms, thereby converting the CASH problem into a hy-perparameter optimization problem(HPO). However, Auto-WEKAwill iterate online round by round to find the best solution, thus suf-fering from high time and space cost. Different from Auto-WEKA,Auto-Model[22] extracts experimental results from previously pub-lished ML papers to create a knowledge base, making the selectionof algorithms more intelligent and automated. The knowledge basean be updated with continuous training. A steady flow of train-ing data will enhance the knowledge base gradually replacing theexperience of experts. To the best of our knowledge, Auto-Modelperforms better than Auto-WEKA on classification problems. Nev-ertheless, the quality of used paper will affect the effectivenessof the entire model, too much manual work is need for evaluat-ing each paper’s contribution to the knowledge base. As a con-sequence, Auto-Model is not a fully automated CASH processingmodel.From above discussions, early works cannot solve those chal-lenges well, which makes them inefficient in practice. Thus, wepresent Auto-CASH , a pre-trained model based on meta-learning,to slove the CASH problem in an efficient way. For the first chal-lenge, Auto-CASH utilizes
Deep Q-Network [12], a reinforcementlearning(RL) approach, to automatically select meta-feature. Thengiven each training dataset, we use its meta-feature [2, 6], alongwith the most suitable algorithm tested for it, to train a
RandomForest (RF) classifier, which is the key to the algorithm selectionprocess . By RF, Auto-CASH achieve a good performance and anacceptable time cost. For the second challenge, we adopt GeneticAlgorithm (GA), which is one of the fastest and the most effectiveHPO approaches to improve the efficiency of finding the optimalhyperparameter setting. Our experimental results show that GAspends a quarter less time on HPO than early work and achievesbetter results.Major contributions in our work are summarized as follows:1) We propose
Auto-CASH , a meta-learning based approach for theCASH problem. By sufficiently utilizing the experiences of train-ing datasets, our approach is more lightweight and efficient inpractice.2) We first transform the selection of meta-feature into a continu-ous action decision problem. Deep Q-Network is introduced toautomatically choose the meta-features we use in the algorithmselection process. To the best of our knowledge, Auto-CASH isthe first study that introduce RL approach and meta-learning tothe CASH problem.3) We conduct extensive experiments and demonstrate the effec-tiveness of Auto-CASH on 120 real-world classification datasetsfrom UCI Machine Learning Repository and Kaggle . Com-pared with Auto-Model and Auto-WEKA, experimental resultsshow that Auto-CASH can deal with the CASH problem better.The structure of this paper is as follows. Section 2 introducessome concepts of DQN and GA, which are crucial in Auto-CASH.Section 3 describes the workflow and some implementations ofour model. Section 4 introduces the methodology for automati-cally meta-feature selection. Section 5 evaluates the performanceof Auto-CASH and compares it with early work. Finally, we drawa conclusion and give our future research directions in section 6. Auto-Model and Auto-WEKA The prediction function of a trained RF can infer the most suitable algorithm for anew task instance. https://archive.ics.uci.edu/ml/index.php In this section, we introduce the basic concepts of Deep Q-Networkand HPO algorithm, respectively. Both of them are crucial in Auto-CASH.
In the first part of Auto-CASH, the selection of meta-feature istransformed into a continuous action decision problem, which canbe solved by RL approaches. A RL approach includes two entities:the agent and the environment. The interactions between the twoentities are as follows. Under the state s t , the agent takes an action a t and get a corresponding reward r t from the environment, andthen the agent enters next state s t + . The action decision processwill be repeated until the agent meets termination conditions.Continuous action decision problems have the following char-acteristics. • For various actions, the corresponding rewards are usuallydifferent. • Reward for an action is delayed. • Value of the reward for an action is influenced by the currentstate.Q-learning[23] is a classical value-based RL algorithm to solvecontinuous action decision problems. Let Q ( s t , a t ) value representsthe reward of action a t in state s t . The main idea of Q-learningis to fill in the Q-table with Q ( s t ∈ allStates , a t ∈ allActions ) value by iterative training. At the beginning of the training phase,i.e. exploring the environment, the Q-table is filled up with thesame random initial value. As the agent explores the environmentcontinuously, the Q-table will be updated using Equation (1). Q ( s , a ) ← Q ( s , a ) + α [ r + γmax a ′ Q ( s ′ , a ′ ) − Q ( s , a )] (1)Under the state s , the agent select action a which can obtaina maximum cumulative reward r according to Q-table, then en-ters state s ′ . The Q-table should be updated right now. γ , α de-note the discount factor and learning rate, respectively. The dif-ference between the true Q-value and the estimated Q-value is α [ R ( s , a ) + γmax a ′ Q ( s ′ , a ′ )− Q ( s , a )] [11]. The value of α determinesthe speed of learning new informations and the value of γ meansthe importance of future rewards. The specific execution workflowof Q-learning is shown in Figure 1. Figure 1: Q-learning workflow. -learning utilizes tables to store each state and the Q value ofeach action under this state. However, with the problem gettingcomplicated, it is difficult to describe the environment by an ac-ceptable amount of states an agent could possibly enter. If we stilluse Q-table, there should be heavy space cost. Searching in sucha complex table also needs a lot of time and computing resources.Deep Q-Network (DQN)[12] is proposed, which uses neural net-work (NN) to analyze the reward of each action under a specificstate instead of Q-table. The input of DQN is state values and theoutput is the estimated reward value for each action. The agentthen randomly chooses actions with a probability of ε ( < ε < ) and chooses actions with a probability of 1 − ε that can bring themaximal reward. It is called ε − дreedy exploration strategy, whichcan balance the exploration and the exploitation. In the beginning,the system will maximize the exploration space completely ran-domly. As training continues, ε will gradually decrease from 1 to0. Finally it will be fixed to a stable exploration rate.In the training phase, DQN uses the same strategy with Q-learningto update the parameter values of NN. Besides, DQN has two mech-anisms to make it acts like a human being: Experience Replay and
Fixed Q-target . Every time the DQN is updated, we can randomlyextract some previous experiences stored in the experience base tolearn. Randomly extracting disrupts the correlation between expe-riences and makes update process more efficient. Fixed Q-targetsis also a mechanism that disrupts correlations. We use two NNswith the same structure but different parameters to predict the esti-mated and target Q-value, respectively. NN for estimated Q-valuehas the latest parameter values, while for target Q-value, it hasprevious parameters. With these two mechanisms, DQN becomesmore intelligent.
Let A and D repersents a learning algo-rithm with n hyperparameters λ , λ , ... λ n and a dataset, respec-tively. The domain of λ i is denoted by Λ i . So the overall hyperpa-rameter space Λ is a subset of Cartesian product of these domains: Λ ⊆ Λ × Λ × ... Λ n . Given a score function F (A , D) , the HPOproblem can be written as Equation (2), where A λ means algo-rithm A with a hyperparameter configuration λ ( λ ∈ Λ ). λ ∗ = arg max λ ∈ Λ F (A λ , D) (2) Mainstream modern MLand deep learning algorithms or models’ performances are sen-sitive to their hyperparameter settings. To solve HPO efficientlyand automatically, some classical approaches are proposed: GridSearch[13], Random Search[1], Hyperband[9], Bayesian Optimization[3,15], and Genetic Algorithm[24], etc. Among them the most famousand effective HPO approaches are Bayesian Optimization (BO)[4,18, 19] and Genetic Algorithm (GA)[14, 27].BO is a black-box global optimization approach that almost hasthe best performance among the above-mentioned HPO approaches.It uses a surrogate model(eg. Gaussian Process) to fit the targetfunction, then predict the distribution of the surrogate model basedon Bayesian theory iteratively. Finally, BO returns the best resultit explored as the HPO solution. However, it is time-consuming to explore the surrogate model using Bayesian theory and histori-cal data. When encountering a model or algorithm with high timecomplexity or high dimensional hyperparameter space, althoughBO can provide the optimal HPO results, the execution time ishardly unacceptable. So in Auto-CASH, we use GA, another HPOapproach with similar performance and reduce the time cost.GA originates from the computer simulation study of biologicalsystems. It is a stochastic global search and optimization methoddeveloped by imitating the biological evolution mechanism of na-ture. In essence, it is an efficient, parallel, and global search method,which automatically accumulates knowledge about the search spaceduring the search process.
In the era of algorithms and data explosion, it is increasingly chal-lenging to select the algorithm most suitable for different datasetsin a particular task (e.g. classification). One of the best ways tosolve such problems is to train a pre-model based on previous ex-perience. In our work, we use the training dataset and its optimalalgorithm as the previous experience, which is the most intuitiveform of experience and easy to apply. After the training, the pre-model is like an expert who has learned all the previous experi-ences and can work effectively offline.When selecting the optimal algorithm for training datasets, themetric criteria are crucial. For classification algorithms, the mostcommonly used metric criterion is accuracy. However, in somecases (e.g. unbalanced classification), higher accuracy does not meanbetter performance. A more balanced metric criterion should beconsidered to measure the performance of the algorithm on thedataset from multiple perspectives. Combining accuracy and
AUC (area under ROC curve), we also propose a more comprehensivemetric criterion based on multi-objective optimization.In our approach, we use DQN to select meta-features represent-ing the whole training dataset. To develop the RL environment forDQN, we need to define the reward for the action of meta-featureselection. We randomly select batch of meta-features to constructa RF, then we test its performance on the training datasets. By re-peating this procedure, we can estimate the influence of each meta-feature on the classification algorithm recommended by RF as thereward of the DQN environment.In this sections, we will first introduce the workflow of Auto-CASH in Section 3.1. Then we discuss the criterion used in Auto-CASH and its advantage in Section 3.2. Eventually, we give theimplementations of algorithm selection and HPO in Section 3.3 andSection 3.4, respectively.
The workflow of our Auto-CASH approach is shown in Figure 2and Algorithm 1. The whole workflow is divided into two phases -the training phase and offline working phase. First of all, in Line 1of Algorithm 1, we select the optimal candidate algorithm usingour new metric criterion. In the next place, we use DQN to au-tomatically determine the meta-feature for representing datasets,as shown in Line 2. In this way, refer Line 3 and 4, the trainingdatasets are transformed into meta-feature vectors, together with (cid:2869) (cid:1830) (cid:2870) (cid:28663)(cid:28663) (cid:1830) (cid:3043)
Training Datasets (cid:1839)(cid:1838)(cid:1827)(cid:1864)(cid:1859)(cid:1867)(cid:1870)(cid:1861)(cid:1872)(cid:1860)(cid:1865) (cid:2869) (cid:1839)(cid:1838)(cid:1827)(cid:1864)(cid:1859)(cid:1867)(cid:1870)(cid:1861)(cid:1872)(cid:1860)(cid:1865) (cid:2870) (cid:28663)(cid:28663)
Algorithm CandidatesOptimal AlgorithmSelection Deep Q-NetworkMeta-Feature Vectors { (cid:1874) (cid:2869) , (cid:1874) (cid:2870) ,…, (cid:1874) (cid:3043) } Random Forest Training (cid:1840)(cid:1857)(cid:1875)(cid:1830) (cid:2870) (cid:28663)(cid:28663)
New Datasets (cid:1840)(cid:1857)(cid:1875)(cid:1830) (cid:2869) (cid:1840)(cid:1857)(cid:1875)(cid:1830) (cid:3044)
Autonomous Algorithm SelectionHyper-Parameter OptimizationInput
Training Phase Offline Working Phase
Output + Figure 2: The Auto-CASH workflow. After training, ourmodel can work offline.Algorithm 1
Auto-CASH approach.
Input:
The training datasets D tr ain ; The candidate algorithm set Alд ; The datasets needs for autonomous algorithm selectionand HPO D ; Output:
An optimal algorithm alд and its hyperparameter setting λ alд ; Select the optimal algorithm in
Alд for D tr ain ; Use DQN to select meta-feature list according to D tr ain ; Input the meta-vector of D tr ain and optimal algorithm to RF; Train the RF; Utilize the trained RF to predict alд for D ; Utilize Genetic Algorithm to search the λ alд ; return alд , λ alд ; Table 1: Notations and their meanings.
Notation Meaning
Alд
Algorithm candidates list alд
An algorithm in
AlдD tr ain
All training datasets MF Meta-feature candidates list m f
A meta-feature in
MFM list
Eventually optimal meta-feature listtheir optimal algorithm, which are used to train a RF. Given a meta-feature vector for a new dataset, the trained RF can predict the labelfor it (autonomous algorithm selection), which is shown in Line 5.Eventually, in Line 6, we apply the Genetic Algorithm to search forthe optimal hyperparameter configuration.To fairly demonstrate Auto-CASH, we will come to some criticalconcepts. The notations of these concepts are summarized in table1.
AUC is the area under the ROC [16] curve. For an unbalanced dis-tributed dataset, the AUC value represents the classifier’s ability toclassify positive and negative examples [5]. While selecting the op-timal algorithm for each training dataset, the common evaluation is to use the accuracy, which is highly influenced by the test-trainsplitting. To eliminate such influence, we use a score function com-bining AUC and accuracy.The accuracy and AUC of a classification algorithm usually turnout to be conflicted on an unbalanced dataset. For example, in a can-cer dataset, there may be only 1% of cancer records(negative case).If a classifier divides all records into positive cases, the accuracyvalue is 0 .
99, but the AUC value is only 0 .
5. Therefore, optimizingboth accuracy and AUC can be treated as a multi-objective opti-mization problem.A classic multi-objective optimization method [26] is weightedsum, shown as F score = w · accuracy + w · AUC in our prob-lem. However, it needs more complicated calculations to optimizethe accuracy and AUC separately, and set a reasonable weight co-efficient. We use a concise way to represent the score function inAuto-CASH, shown as Equation (3). F score = accuracy · AUC (3)
A RF model is used for autonomous algorithm selection process,which has two advantages. First, we use some complex meta-featuresto represent the datasets. RF is sensitive to the internal influencesamong these meta-features when training. Second, RF has a highprediction accuracy without the need of hyperparameter tuning.The trained RF contains the knowledge of previous experience,which can work offline. For a new dataset, RF will recommend analgorithm, which has the best performance with high possibility.In this way, the autonomous algorithm selection process only costa few seconds. Training a RF needs much less human labor thanAuto-Model, for Auto-Mudel has to extract rules in published pa-pers. We compare the RF with other famous classification models(e.g., KNN, SVM) in our experiments, and the results in section 5show that it is the most effective.
Genetic algorithm is used for HPO process. Since Auto-WEKA hasa complicated hyparameter space, and HPO is the major step, thefirst thing considered is the number of hyperparameters for each alд ∈ Alд . We utilize GA to tune the hyperparameters for each alд and determine which hyperparameters will be tuned in the HPO ac-cording to the performance improvement after tuning. Accordingto the Occam’s razor principle, in order to reduce the complexityof the algorithm of the HPO, we only select the hyperparametersthat will bring a relatively large effect improvement for tuning.The workflow of GA is shown in Figure 3. In the beginning,we uses binary code to encode hyperparameters and initializes theoriginal population. Then, we select the batch of individuals withthe best fitness, i.e.the algorithm performence with specific hyper-parameter configuration, for subsequent generation. To introducerandom disturbance, we adapt crossover and mutation as geneticoperators shown in Figure 4. Two binary sequences (individuals)randomly exchange their subsequences in the same position to rep-resent the crossover process. And the binary digits of individualsalters randomly as mutation. For each subsequent generation, thehyperparameter configuration is returned as HPO result if the ter-mination condition has been reached; otherwise, the above steps (cid:374)(cid:272)(cid:381)(cid:282)(cid:286)(cid:3)(cid:296)(cid:381)(cid:396)(cid:3)(cid:381)(cid:396)(cid:349)(cid:336)(cid:349)(cid:374)(cid:258)(cid:367)(cid:3)(cid:393)(cid:381)(cid:393)(cid:437)(cid:367)(cid:258)(cid:410)(cid:349)(cid:381)(cid:374)(cid:28)(cid:448)(cid:258)(cid:367)(cid:437)(cid:258)(cid:410)(cid:286)(cid:3)(cid:393)(cid:381)(cid:393)(cid:437)(cid:367)(cid:258)(cid:410)(cid:349)(cid:381)(cid:374)(cid:38)(cid:349)(cid:410)(cid:374)(cid:286)(cid:400)(cid:400)(cid:3)(cid:296)(cid:437)(cid:374)(cid:272)(cid:410)(cid:349)(cid:381)(cid:374) (cid:94)(cid:410)(cid:381)(cid:393)(cid:3)(cid:272)(cid:381)(cid:374)(cid:282)(cid:349)(cid:410)(cid:349)(cid:381)(cid:374)(cid:122)(cid:286)(cid:400)(cid:39)(cid:286)(cid:410)(cid:3)(cid:410)(cid:346)(cid:286)(cid:3)(cid:381)(cid:393)(cid:410)(cid:349)(cid:373)(cid:258)(cid:367)(cid:3)(cid:400)(cid:381)(cid:367)(cid:437)(cid:410)(cid:349)(cid:381)(cid:374)(cid:3)(cid:258)(cid:374)(cid:282)(cid:3)(cid:282)(cid:286)(cid:272)(cid:381)(cid:282)(cid:286) (cid:39)(cid:286)(cid:374)(cid:286)(cid:410)(cid:349)(cid:272)(cid:3)(cid:381)(cid:393)(cid:286)(cid:396)(cid:258)(cid:410)(cid:381)(cid:396)(cid:400)(cid:39)(cid:286)(cid:374)(cid:286)(cid:396)(cid:258)(cid:410)(cid:286)(cid:3)(cid:374)(cid:286)(cid:449)(cid:3)(cid:400)(cid:437)(cid:271)(cid:400)(cid:286)(cid:395)(cid:437)(cid:286)(cid:374)(cid:410)(cid:3)(cid:336)(cid:286)(cid:374)(cid:286)(cid:396)(cid:258)(cid:410)(cid:349)(cid:381)(cid:374) (cid:1005)(cid:856)(cid:3)(cid:400)(cid:286)(cid:367)(cid:286)(cid:272)(cid:410)(cid:1006)(cid:856)(cid:3)(cid:272)(cid:396)(cid:381)(cid:400)(cid:400)(cid:381)(cid:448)(cid:286)(cid:396)(cid:1007)(cid:856)(cid:373)(cid:437)(cid:410)(cid:258)(cid:410)(cid:286)(cid:47)(cid:374)(cid:349)(cid:410)(cid:349)(cid:258)(cid:367)(cid:349)(cid:460)(cid:286)(cid:3)(cid:381)(cid:396)(cid:349)(cid:336)(cid:349)(cid:374)(cid:258)(cid:367)(cid:3)(cid:393)(cid:381)(cid:393)(cid:437)(cid:367)(cid:258)(cid:410)(cid:349)(cid:381)(cid:374) (cid:69)(cid:381)
Figure 3: Genetic Algorithm workflow. will be iteratively executed. Our experimental results in section 5show that the fitness of individuals will converge to the optimalvalue within 50 generations in most cases. (cid:1005)(cid:1004)(cid:1004)(cid:1005)(cid:1005)(cid:1004)(cid:1004)(cid:1004)(cid:1004) (cid:1005)(cid:1004)(cid:1004)(cid:1005)(cid:1005)(cid:1004)(cid:1004)(cid:1004)(cid:1004) (cid:1005)(cid:1004)(cid:1004)(cid:1005)(cid:1004)(cid:1004)(cid:1004)(cid:1004)(cid:1004) (cid:1005)(cid:1004)(cid:1004)(cid:1005)(cid:1005)(cid:1004)(cid:1004)(cid:1004)(cid:1005) (cid:272)(cid:396)(cid:381)(cid:400)(cid:400)(cid:381)(cid:448)(cid:286)(cid:396) (cid:1005)(cid:1004)(cid:1004)(cid:1005)(cid:1005)(cid:1004)(cid:1004)(cid:1004)(cid:1004) (cid:373)(cid:437)(cid:410)(cid:258)(cid:410)(cid:349)(cid:381)(cid:374) (cid:1005)(cid:1005)(cid:1004)(cid:1004)(cid:1005)(cid:1005)(cid:1005)(cid:1004)(cid:1004)(cid:1005)(cid:1005)(cid:1005) (cid:1005)(cid:1005)(cid:1005) (cid:1005)(cid:1005) (cid:1005)(cid:1005)(cid:1005)(cid:1005) (cid:1005)(cid:1005)(cid:1005) (cid:1005)
Figure 4: Crossover and mutation examples
The major interfering factor of the algorithm selection process isthe quality of meta-features. Unfortunately, due to the fact thatmeta-features have complicated correlation between each other, itis difficult to reconfigure the priority of them after a specific actionof candidate selection. A well-studied approach focusing on theinfluence of multiple candidate selection is DQN. However, DQNis used to solve the automatic continuous decision problem, so wetransform meta-feature selection into such problem. In the next ofthis section, we will discuss the methodology of DQN environmentconstruction and using DQN to select meta-features.First of all, we will introduce the elements of DQN, i.e. the state,action, and reward in the environment, respectively.
Definition 1
Given a collection of candidate meta-features MF (| MF | = m ) , the state s is the meta-features selected from MF . Each action a selects a specific meta-feature m f ∈ MF . The eventually selectedmeta-features construct an optimal meta-feature list M list ( M list $ MF , | M list | max = n , n < m ) . The reward r a of action a is the prob-ability of selecting the optimal algorithm by performing action a .In Auto-CASH, we use an m -bit binary number to encode allstates. Each bit represents a meta-feature in MF . In a specific state s , if the meta-feature m f is selected, its corresponding bit is en-coded as 1. Otherwise, it is encoded as 0. Thus, there are totally 2 m states and m actions. The example in Figure 5 can explain thetransition between states more clearly. Under the start state S , nometa-feature has been selected, so all m bits are 0. After perform-ing some actions, it is state S j now. The next action is to choose m f m , so the m th bit of the number is set 1. These steps will berepeated until the termination state. m bitStart state (cid:1845) (cid:2868) Some actions
State (cid:1845) (cid:3037) m bit Choose the (cid:1865) (cid:3047)(cid:3035) meta-feature m bit …… State (cid:1845) (cid:3037)(cid:2878)(cid:2869) ……110……101
Termination state S (cid:3038) m bit Figure 5: Transition between states examples
In order to make sufficient preparation for the RL environment,we consider several characteristics for a classification dataset toform some types of meta-fatures. For category attributes, we con-centrate on the inter-class dispersion degree and the maximumrange of class proportion. As for numeric attributes, we are moreconcerned about the center and extent of fluctuation. Besides, wealso take the global numeric information of records and attributesinto consider. 5 basic types meta-features are as follows. • Type 1 : Category information entropy. • Type 2 : Proportion of classes in different type of attributes. • Type 3 : Average value. • Type 4 : Variance. • Type 5 : Number of instances.On the basis of above discussion, we construct the MF , madeup of constrained(e.g., class number in category attribute with theleast classes) and combined(e.g., variance of average value in nu-meric attributes) meta-features. The details are shown in Section 5.There is no precise approach to measure or calculate the rewardof each meta-feature. Therefore, we can only estimate these re-wards according to experimental results on training datasets. Themeta-features have complicated influence on one another, so eval-uating the reward of a single meta-feature independently is notpersuasive. Therefore, for each meta-feature, we randomly selectsome batches of meta-features containing it. With each batch ofmeta-features, we construct an RF. We repeat above steps multipletimes for each batch size, and the average accuracy of RF is thereward.All meta-feature selection steps are summarized in Algorithm 2.At first, as shown in Line 1, we construct the state set S and actionset A , respectively. Then we estimate the reward of each action R ,which is shown in Line 2. The DQN environment is initialized by S , A , and R . For each episode, DQN starts from S , and chooses themaximal reward action in each next step (Line 4-10). After decod-ing the termination state, the training results for one episode areobtained. We repeat above steps and eventually obtain the optimalmeta-feature list from numerous training results(Line 14).In the beginning, the lack of experience makes the selectionDQN have a deviation from reality. As the training progresses, lgorithm 2 Automatically meta-feature selection approach.
Input:
The meta-feature candidates list MF (| MF | = m ) ; The limitof optimal meta-feature list | M list | max = n ( n < m ) ; The limitof episode e max ; Output:
The optimal meta-feature list M list ; Construct state set S and action set A ; Estimate reward R of each a ∈ A ; for i = i < e max ; i + + do Initialize M list = œ and all candidate M list set M all = œ ; Start state = s , current state s = s ; while | M list | < n do Initialize the DQN environment using S , A , and R ; Use DQN to find the optimal action a best for s ; M list ← M list ∪ a best ; s ← s perform a best ; end while M all ← M all ∪ M list ; end for return M list = arдmax ( M all ) ;DQN will adjust the parameters such as learning rate and discountrate according to the deviation and the selection becomes reason-able. It is just like a human being fixes his action by absorbing theprevious experience and the result is getting better. Eventually, thenetwork parameters become stable and the selected meta-featureshave the best performance.With M list and alд , all original training datasets can be trans-formed into a new dataset to train the RF model. Assuming that | D tr ain | = p and | M list | = y , we have the new training dataset D ′ p ×( y + ) , in which the column D ′ y + represents the arд . After thetraining phase, our Auto-CASH model works offline. Benefitingfrom the excellent prediction performance of RF and the high ef-ficiency of GA, the performance of Auto-CASH surpasses earlywork, which is shown in Section 5.3. In this section, we evaluate our Auto-CASH approach on the clas-sification CASH problem. Given a dataset, we use Auto-CASH toautomatically select an algorithm and search its optimal hyper-parameter settings. Then we utilize the new metric criterion inSection 3.2 to examine the performance of results given by Auto-CASH. Eventually, we compare Auto-CASH with classical CASHapproach Auto-WEKA and the state-of-the-art CASH approach Auto-Model and discuss the experimental results.
For all experiments in this paper, the setup is as follows:1) We implement all experiments in Python 3.7 and run them ona computer with a 2.6 GHz Intel (R) Core (TM) i7-6700HQ CPUand 16 GB RAM.2) All datasets used are real-world datasets from UCI Machine Learn-ing Repository and Kaggle . The most significant advantageof using real-world datasets is that it can improve the effect https://archive.ics.uci.edu/ml/index.php of our model in real life and lay the foundation for future re-search work. However, for the missing values in the data set,Auto-CASH uses random other values of the same attribute toreplace. The implementation of all classification algorithms isfrom WEKA , which is consistent with Auto-WEKA and Auto-Model.3) The performance of Auto-WEKA and Auto-Model are both re-lated to the tuning time, so we set the timeLimit parameter to 5minutes.4) When calculating the AUC and accuracy value in the metriccriterion, we use 80% and 20% of the dataset as the training dataand test data, respectively.5) AUC is the evaluating indicator defined in the binary classifica-tion problem. For multiple classification problems, we binarizethe output of the classification algorithm using the function inEquation (4). f ( output ) = ( (a) GA tuning curve for K (b) GA tuning curve for depth(c) GA tuning curve for I K depth I P e r ce n t o f i m p r ov e m e n t Hyperparameters in Random Forest (d) Improvement for different hyperpa-rameter
Figure 6: Examples for selecting hyperparameters used inHPO process.
Referring to the methodology in Sectioni 3.4, we first test theperformance improvement of hyperparameters for each alд . Ex-amples for Random Forest algorithm and ecoli dataset is shown infigure 6. Figure 6(a), 6(b), and 6(c) represents the GA tuning curvefor hyperparameter − K , − depth , and − I , respectively. The x-axisrepresents each generation in GA, and the y-axis represents theaverage F score value (performance) of each generation. Althoughthese curves converge in about the fifth generation, the effect of Source code can refer to https://svn.cms.waikato.ac.nz/svn/weka/branches/stable-3-8/. We wrap the jar package and invoke it using Python. https://archive.ics.uci.edu/ml/datasets/Ecoli able 2: The number of hyperparameter need to be tuned foreach algorithm in Auto-CASH. We totally utilize 23 famousand effective classification algorithms. Algorithm Number Algorithm NumberAdaBoost 3 Bagging 3AttributeSelectedClassifier 2 BayesNet 1ClassificationViaRegression 2 IBK 4DecisionTable 2 J48 8JRip 4 KStar 2Logistic 1 LogitBoost 3LWL 3 MultiClass 3MultilayerPerceptron 5 NaiveBayes 2RandomCommittee 2 RandomForest 2RandomSubSpace 3 RandomTree 4SMO 6 Vote 1LMT 5each parameter on the final performance improvement is differ-ent, which is shown in Figure 6(d). After tuning − depth , we canhave a 6 percent improvement, while − I can only improve 1 . alд RF, we decide to tune − depth and − K in HPOprocess. Table 2 shows the number of hyperparameter that needsto be tuned for each algorithm in Auto-CASH.After selecting the hyperparameters to be tuned in HPO, we testthe optimal algorithm for each dataset in D tr ain . Then we comparethe performance of all algorithm candidates on training datasetsand list their optimal algorithm in Figure 7. Number of training datasets C l a ss i f i ca ti on a l go r it h m Figure 7: Distribution of the optimal algorithm on 104 train-ing datasets. For each algorithm, we list the number ofdatasets using it as its optimal algorithm.
Meta-features used for representing a dataset in our experimentsare summerized as follows: • m f : Class number in target attribute. • m f : Class information entropy of target attribute. • m f : Maximum proportion of single class in target attribute. • m f : Minimum proportion of single class in target attribute. • m f : Number of numeric attributes. Table 3: Type of each meta-feature.
Type m f indexType 1 1, 10, 14Type 2 2, 3, 6, 11, 12, 15, 16Type 3 17, 18Type 4 19, 20, 21, 22Type 5 0, 4, 5, 7, 8, 9, 13 • m f : Number of category attributes. • m f : Proportion of numeric attributes. • m f : Total number of attributes. • m f : Records number in the dataset. • m f : Class number in category attribute with the least classes. • m f : Class information entropy in category attribute withthe least classes. • m f : Maximum proportion of single class in category at-tribute with the least classes. • m f : Minimum proportion of single class in category at-tribute with the least classes. • m f : Class number in category attribute with the most classes. • m f : Class information entropy in category attribute withthe most classes. • m f : Maximum proportion of single class in category at-tribute with the most classes. • m f : Minimum proportion of single class in category at-tribute with the most classes. • m f : Minimum average value in numeric attributes. • m f : Maximum average value in numeric attributes. • m f : Minimum variance in numeric attributes. • m f : Maximum variance in numeric attributes. • m f : Variance of average value in numeric attributes. • m f : Variance of variance in numeric attributes.The type mentioned in Section 4 of each m f is shown in Table 3.These meta-features are easy to calculate, which will reduce calcu-lation cost in the algorithm selection. After determining
Alд and MF , we utilize DQN to obtain M list .Too many meta-features will not bring enough information gainwhile increasing the computational complexity. Therefore, we setthe upper limit of | M list | to 8 and evaluate each m f with differentbatch sizes range from 2 to 8. The evaluation results are shown inFigure 8, which represents the estimated reward of m f . From theresults, we can see that the influence of each m f has a large range.According to the methodology in Section 4, there is totally 23 ac-tions and 2 states. The experience memory size of DQN is set to200, which will be randomly updated after an action decision. Thenwe get the M list = { m f , m f , m f , m f , m f , m f , m f } amongthe outputs of DQN. We utilize these selected meta-features andeach dataset’s optimal algorithm to train the RF. The trained RFwill predict the optimal algorithm for test datasets. Eventually inHPO, we use GA and set the maximum generations to 50.We evaluate the F score performance of Auto-CASH on 20 classi-fication datasets in Table 4. The average time cost in each phase is P r i d ec t acc u r ac y Meta-features
Figure 8: Performance for each meta-feature.Table 4: Datasets
Dataset Records Attributes Classes SymbolAvila 20867 10 12 D Nursery 12960 8 3 D Absenteeism 740 21 36 D Climate 540 19 2 D Australian 690 14 2 D Iris.2D 150 2 3 D Heart-c 303 14 5 D Sick 3772 30 2 D Anneal 798 38 6 D Hypothyroid 3772 27 2 D Squash 52 24 3 D Vowel 990 14 11 D Zoo 101 18 7 D Breast-W 699 9 2 D Iris 150 4 3 D Diabetes 768 9 2 D Dermatology 336 34 6 D Musk 476 166 2 D Promoter 106 57 2 D Blood 748 5 2 D shown in table 5. From the table, we can see that Auto-CASH costsfew time on autonomous algorithm selection. After tuning for hy-perparameter, the HPO time is greatly reduced, which guaranteesthe efficiency of Auto-CASH. We also evaluate the F score performanceof Auto-WEKA and Auto-Model on the same datasets, and the de-tailed expermental results are shown in Table 6. The meta-feature selected by DQN can comprehensively representthe datasets. Compared with Auto-Model, we use fewer meta-featureswhile Auto-CASH achieves a better performance in most cases asshown in Table 6. It proves that DQN is more effective. Our ap-proach significantly reduces human labor in the training phase,which makes it a fully-automated model. Auto-CASH can handle
Table 5: The average time of each phase in Auto-CASH.
Phase TimeDQN training 10 CPU hourCalculate meta-feature value 0.96 secondAlgorithm selection 0.5 secondHPO 229.3 secondsTotal CASH 230.76 seconds
Table 6: F score of Auto-CASH, Auto-Model, and Auto-WEKAon test datasets. Bold font denotes the best result. Auto-Model cannot give a result for some cases, so we use -1 here. Dataset Auto-CASH Auto-Model Auto-WEKA D D D D D D D -1 0.58 D -1 0.886 D D -1 0.976 D D D D D D D -1 0.942 D D D CONCLUSION AND FUTURE WORK
In this paper, we present Auto-CASH, a pre-trained model based onmeta-learning for the CASH problem. By transforming the selec-tion of meta-feature into a continuous action decision problem, weare able to automatically solve it utilizing Deep Q-Network. Thusit significantly reduces human labor in the training process. For aparticular task, Auto-CASH enhances the performance of the rec-ommended algorithm within an acceptable time by means of Ran-dom Forest and Genetic Algorithm. Experimental results demon-strate that Auto-CASH outperforms classical and the state-of-the-art CASH approach on efficiency and effectiveness. In future work,we plan to extend Auto-CASH to deal with more problems, e.g.,regression, image processing. Besides, we intend to develop anapproach to automatically extract the meta-feature candidates ac-cording to the task and its datasets.
REFERENCES [1] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameteroptimization.
Journal of machine learning research
13, Feb (2012), 281–305.[2] Besim Bilalli, Alberto Abelló, and Tomas Aluja-Banet. 2017. On the predictivepower of meta-features in OpenML.
International Journal of Applied Mathemat-ics and Computer Science
27, 4 (2017), 697–712.[3] Eric Brochu, Vlad M Cora, and Nando De Freitas. 2010. A tutorial on Bayesianoptimization of expensive cost functions, with application to active user mod-eling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010).[4] George E Dahl, Tara N Sainath, and Geoffrey E Hinton. 2013. Improving deepneural networks for LVCSR using rectified linear units and dropout. In . IEEE, 8609–8613.[5] Tom Fawcett. 2006. An introduction to ROC analysis.
Pattern recognition letters
27, 8 (2006), 861–874.[6] Andrey Filchenkov and Arseniy Pendryak. 2015. Datasets meta-feature descrip-tion for recommending feature selection algorithm. In . IEEE, 11–18.[7] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. 2019.
Automated MachineLearning . Springer.[8] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gersh-man. 2017. Building machines that learn and think like people.
Behavioral andbrain sciences
40 (2017).[9] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Tal-walkar. 2017. Hyperband: A novel bandit-based approach to hyperparameteroptimization.
The Journal of Machine Learning Research
18, 1 (2017), 6765–6816.[10] Marius Lindauer, Jan N van Rijn, and Lars Kotthoff. 2019. The algorithm selec-tion competitions 2015 and 2017.
Artificial Intelligence
272 (2019), 86–100.[11] Francisco S Melo. 2001. Convergence of Q-learning: A simple proof.
Institute OfSystems and Robotics, Tech. Rep (2001), 1–4.[12] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, IoannisAntonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari withdeep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).[13] Douglas C Montgomery. 2017.
Design and analysis of experiments . John wiley &sons.[14] Randal S Olson and Jason H Moore. 2019. TPOT: A tree-based pipeline opti-mization tool for automating machine learning. In
Automated Machine Learning .Springer, 151–160.[15] Martin Pelikan, David E Goldberg, Erick Cantú-Paz, et al. 1999. BOA: TheBayesian optimization algorithm. In
Proceedings of the genetic and evolutionarycomputation conference GECCO-99 , Vol. 1. 525–532.[16] David Martin Powers. 2011. Evaluation: from precision, recall and F-measure toROC, informedness, markedness and correlation. (2011).[17] Cullen Schaffer. 1994. Cross-validation, stacking and bi-level stacking: Meta-methods for classification learning. In
Selecting Models from Data . Springer, 51–59.[18] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesianoptimization of machine learning algorithms. In
Advances in neural informationprocessing systems . 2951–2959.[19] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish,Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. 2015.Scalable bayesian optimization using deep neural networks. In
International con-ference on machine learning . 2171–2180. [20] Ben Taylor, Vicent Sanz Marco, Willy Wolff, Yehia Elkhatib, and Zheng Wang.2018. Adaptive deep learning model selection on embedded systems.
ACM SIG-PLAN Notices
53, 6 (2018), 31–43.[21] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013.Auto-WEKA: Combined selection and hyperparameter optimization of classifica-tion algorithms. In
Proceedings of the 19th ACM SIGKDD international conferenceon Knowledge discovery and data mining . 847–855.[22] Chunnan Wang, Hongzhi Wang, Tianyu Mu, Jianzhong Li, and Hong Gao. 2019.Auto-Model: Utilizing Research Papers and HPO Techniques to Deal with theCASH problem. arXiv preprint arXiv:1910.10902 (2019).[23] Christopher JCH Watkins and Peter Dayan. 1992. Q-learning.
Machine learning
8, 3-4 (1992), 279–292.[24] Darrell Whitley. 1994. A genetic algorithm tutorial.
Statistics and computing arXiv preprintarXiv:1708.07747 (2017).[26] Lei Xiujuan and Shi Zhongke. 2004. Overview of multi-objective optimizationmethods.
Journal of Systems Engineering and Electronics
15, 2 (2004), 142–146.[27] G Zames, NM Ajlouni, NM Ajlouni, NM Ajlouni, JH Holland, WD Hills, andDE Goldberg. 1981. Genetic algorithms in search, optimization and machinelearning.