[PDF] Auto-CASH: Autonomous Classification Algorithm Selection with Deep Q-Network

Abstract

The great amount of datasets generated by various data sources have posed the challenge to machine learning algorithm selection and hyperparameter configuration. For a specific machine learning task, it usually takes domain experts plenty of time to select an appropriate algorithm and configure its hyperparameters. If the problem of algorithm selection and hyperparameter optimization can be solved automatically, the task will be executed more efficiently with performance guarantee. Such problem is also known as the CASH problem. Early work either requires a large amount of human labor, or suffers from high time or space complexity. In our work, we present Auto-CASH, a pre-trained model based on meta-learning, to solve the CASH problem more efficiently. Auto-CASH is the first approach that utilizes Deep Q-Network to automatically select the meta-features for each dataset, thus reducing the time cost tremendously without introducing too much human labor. To demonstrate the effectiveness of our model, we conduct extensive experiments on 120 real-world classification datasets. Compared with classical and the state-of-art CASH approaches, experimental results show that Auto-CASH achieves better performance within shorter time.

Full PDF

aa r X i v : . [ c s . L G ] J u l Auto-CASH: Autonomous Classification Algorithm Selectionwith Deep Q-Network

Tianyu Mu ∗ [email protected] Institute of TechnologyDepartment of Computer ScienceHarbin, China Hongzhi Wang † [email protected] Institute of TechnologyDepartment of Computer ScienceHarbin, China Chunnan Wang ‡ [email protected] Institute of TechnologyDepartment of Computer ScienceHarbin, China

Zheng Liang§ [email protected] Institute of TechnologyDepartment of Computer ScienceHarbin, China

ABSTRACT

The great amount of datasets generated by various data sourceshave posed the challenge to machine learning algorithm selectionand hyperparameter conﬁguration. For a speciﬁc machine learn-ing task, it usually takes domain experts plenty of time to select anappropriate algorithm and conﬁgure its hyperparameters. If theproblem of algorithm selection and hyperparameter optimizationcan be solved automatically, the task will be executed more eﬃ-ciently with performance guarantee. Such problem is also knownas the CASH problem. Early work either requires a large amount ofhuman labor, or suﬀers from high time or space complexity. In ourwork, we present Auto-CASH, a pre-trained model based on meta-learning, to solve the CASH problem more eﬃciently. Auto-CASHis the ﬁrst approach that utilizes Deep Q-Network to automaticallyselect the meta-features for each dataset, thus reducing the timecost tremendously without introducing too much human labor. Todemonstrate the eﬀectiveness of our model, we conduct extensiveexperiments on 120 real-world classiﬁcation datasets. Comparedwith classical and the state-of-art CASH approaches, experimentalresults show that Auto-CASH achieves better performance withinshorter time.

KEYWORDS

Meta-feature, Algorithm selection, Hyperparameter optimization,CASH problem, Classiﬁcation algorithm, Deep Q-Network

Machine learning(ML) approaches have been used widely in recentyears to solve problems in the data science ﬁeld [25], such as datamining, data preprocessing, etc. Many algorithms(or models)havebeen developed for a speciﬁc problem [10, 20]. However, for diﬀer-ent datasets, the performance of these algorithms varies consider-ably. One learning algorithm cannot outperform another learningalgorithm in various aspects and problems [17]. Therefore, domainexperts usually choose the most suitable algorithm and hyperpa-rameters based on their experience and a series of experiments tooptimize performance the. However, with the explosive growth of both data and ML algo-rithms, the former approach could hardly work. Each algorithmhas a large hyperparameter conﬁguration space. Even for an ex-pert with adequate domain knowledge, it will be hard to make anideal selection among various algorithms and their complex hyper-parameter space. In the face of such situation, Thornton et al. pre-sented the combined algorithm selection and hyperparameter opti-mization problem(CASH) [21], aiming at helping other researchersﬁnd a solution to select a suitable algorithm and conﬁgure the hy-perparameters in diﬀerent scenarios automatically.An eﬀective approach to solve the CASH problem is meta-learning ,also known as learn how to learn . With meta-feature vector repre-senting the previous experience, the meta-learning is capable ofrecommending the same algorithm for similar tasks [7, 8]. Meta-learning requires less human labor and computation resources, mak-ing it more suitable for the automatic and lightweight demand inpractice.Therefore, to solve the CASH problem in an automatic and light-weight way, there are two main challenges. On the one hand, weshould make the whole workﬂow automatically. An eﬀective strat-egy should be determined to automatically choose the meta-featureused. The correlations among meta-feature candidates are compli-cated, and their inﬂuence on the algorithm selection result is in-explicable, which makes it crucial to select the optimal meta fea-ture. On the other hand, CASH has buckets eﬀect. That is, the mea-surement of HPO results has multi-aspect on real-world task, andthe usability depends on the shortest aspect. The HPO algorithmadopted should have performance guarantee, acceptable time costand the potential to deal with various data types.Auto-WEKA [21] is the ﬁrst approach which provides a solutionto the CASH problem. It uses a hyperparameter to represent candi-date algorithms, thereby converting the CASH problem into a hy-perparameter optimization problem(HPO). However, Auto-WEKAwill iterate online round by round to ﬁnd the best solution, thus suf-fering from high time and space cost. Diﬀerent from Auto-WEKA,Auto-Model[22] extracts experimental results from previously pub-lished ML papers to create a knowledge base, making the selectionof algorithms more intelligent and automated. The knowledge basean be updated with continuous training. A steady ﬂow of train-ing data will enhance the knowledge base gradually replacing theexperience of experts. To the best of our knowledge, Auto-Modelperforms better than Auto-WEKA on classiﬁcation problems. Nev-ertheless, the quality of used paper will aﬀect the eﬀectivenessof the entire model, too much manual work is need for evaluat-ing each paper’s contribution to the knowledge base. As a con-sequence, Auto-Model is not a fully automated CASH processingmodel.From above discussions, early works cannot solve those chal-lenges well, which makes them ineﬃcient in practice. Thus, wepresent Auto-CASH , a pre-trained model based on meta-learning,to slove the CASH problem in an eﬃcient way. For the ﬁrst chal-lenge, Auto-CASH utilizes

Deep Q-Network [12], a reinforcementlearning(RL) approach, to automatically select meta-feature. Thengiven each training dataset, we use its meta-feature [2, 6], alongwith the most suitable algorithm tested for it, to train a

RandomForest (RF) classiﬁer, which is the key to the algorithm selectionprocess . By RF, Auto-CASH achieve a good performance and anacceptable time cost. For the second challenge, we adopt GeneticAlgorithm (GA), which is one of the fastest and the most eﬀectiveHPO approaches to improve the eﬃciency of ﬁnding the optimalhyperparameter setting. Our experimental results show that GAspends a quarter less time on HPO than early work and achievesbetter results.Major contributions in our work are summarized as follows:1) We propose

Auto-CASH , a meta-learning based approach for theCASH problem. By suﬃciently utilizing the experiences of train-ing datasets, our approach is more lightweight and eﬃcient inpractice.2) We ﬁrst transform the selection of meta-feature into a continu-ous action decision problem. Deep Q-Network is introduced toautomatically choose the meta-features we use in the algorithmselection process. To the best of our knowledge, Auto-CASH isthe ﬁrst study that introduce RL approach and meta-learning tothe CASH problem.3) We conduct extensive experiments and demonstrate the eﬀec-tiveness of Auto-CASH on 120 real-world classiﬁcation datasetsfrom UCI Machine Learning Repository and Kaggle . Com-pared with Auto-Model and Auto-WEKA, experimental resultsshow that Auto-CASH can deal with the CASH problem better.The structure of this paper is as follows. Section 2 introducessome concepts of DQN and GA, which are crucial in Auto-CASH.Section 3 describes the workﬂow and some implementations ofour model. Section 4 introduces the methodology for automati-cally meta-feature selection. Section 5 evaluates the performanceof Auto-CASH and compares it with early work. Finally, we drawa conclusion and give our future research directions in section 6. Auto-Model and Auto-WEKA The prediction function of a trained RF can infer the most suitable algorithm for anew task instance. https://archive.ics.uci.edu/ml/index.php In this section, we introduce the basic concepts of Deep Q-Networkand HPO algorithm, respectively. Both of them are crucial in Auto-CASH.

In the ﬁrst part of Auto-CASH, the selection of meta-feature istransformed into a continuous action decision problem, which canbe solved by RL approaches. A RL approach includes two entities:the agent and the environment. The interactions between the twoentities are as follows. Under the state s t , the agent takes an action a t and get a corresponding reward r t from the environment, andthen the agent enters next state s t + . The action decision processwill be repeated until the agent meets termination conditions.Continuous action decision problems have the following char-acteristics. • For various actions, the corresponding rewards are usuallydiﬀerent. • Reward for an action is delayed. • Value of the reward for an action is inﬂuenced by the currentstate.Q-learning[23] is a classical value-based RL algorithm to solvecontinuous action decision problems. Let Q ( s t , a t ) value representsthe reward of action a t in state s t . The main idea of Q-learningis to ﬁll in the Q-table with Q ( s t ∈ allStates , a t ∈ allActions ) value by iterative training. At the beginning of the training phase,i.e. exploring the environment, the Q-table is ﬁlled up with thesame random initial value. As the agent explores the environmentcontinuously, the Q-table will be updated using Equation (1). Q ( s , a ) ← Q ( s , a ) + α [ r + γmax a ′ Q ( s ′ , a ′ ) − Q ( s , a )] (1)Under the state s , the agent select action a which can obtaina maximum cumulative reward r according to Q-table, then en-ters state s ′ . The Q-table should be updated right now. γ , α de-note the discount factor and learning rate, respectively. The dif-ference between the true Q-value and the estimated Q-value is α [ R ( s , a ) + γmax a ′ Q ( s ′ , a ′ )− Q ( s , a )] [11]. The value of α determinesthe speed of learning new informations and the value of γ meansthe importance of future rewards. The speciﬁc execution workﬂowof Q-learning is shown in Figure 1. Figure 1: Q-learning workﬂow. -learning utilizes tables to store each state and the Q value ofeach action under this state. However, with the problem gettingcomplicated, it is diﬃcult to describe the environment by an ac-ceptable amount of states an agent could possibly enter. If we stilluse Q-table, there should be heavy space cost. Searching in sucha complex table also needs a lot of time and computing resources.Deep Q-Network (DQN)[12] is proposed, which uses neural net-work (NN) to analyze the reward of each action under a speciﬁcstate instead of Q-table. The input of DQN is state values and theoutput is the estimated reward value for each action. The agentthen randomly chooses actions with a probability of ε ( < ε < ) and chooses actions with a probability of 1 − ε that can bring themaximal reward. It is called ε − дreedy exploration strategy, whichcan balance the exploration and the exploitation. In the beginning,the system will maximize the exploration space completely ran-domly. As training continues, ε will gradually decrease from 1 to0. Finally it will be ﬁxed to a stable exploration rate.In the training phase, DQN uses the same strategy with Q-learningto update the parameter values of NN. Besides, DQN has two mech-anisms to make it acts like a human being: Experience Replay and

Fixed Q-target . Every time the DQN is updated, we can randomlyextract some previous experiences stored in the experience base tolearn. Randomly extracting disrupts the correlation between expe-riences and makes update process more eﬃcient. Fixed Q-targetsis also a mechanism that disrupts correlations. We use two NNswith the same structure but diﬀerent parameters to predict the esti-mated and target Q-value, respectively. NN for estimated Q-valuehas the latest parameter values, while for target Q-value, it hasprevious parameters. With these two mechanisms, DQN becomesmore intelligent.

Let A and D repersents a learning algo-rithm with n hyperparameters λ , λ , ... λ n and a dataset, respec-tively. The domain of λ i is denoted by Λ i . So the overall hyperpa-rameter space Λ is a subset of Cartesian product of these domains: Λ ⊆ Λ × Λ × ... Λ n . Given a score function F (A , D) , the HPOproblem can be written as Equation (2), where A λ means algo-rithm A with a hyperparameter conﬁguration λ ( λ ∈ Λ ). λ ∗ = arg max λ ∈ Λ F (A λ , D) (2) Mainstream modern MLand deep learning algorithms or models’ performances are sen-sitive to their hyperparameter settings. To solve HPO eﬃcientlyand automatically, some classical approaches are proposed: GridSearch[13], Random Search[1], Hyperband[9], Bayesian Optimization[3,15], and Genetic Algorithm[24], etc. Among them the most famousand eﬀective HPO approaches are Bayesian Optimization (BO)[4,18, 19] and Genetic Algorithm (GA)[14, 27].BO is a black-box global optimization approach that almost hasthe best performance among the above-mentioned HPO approaches.It uses a surrogate model(eg. Gaussian Process) to ﬁt the targetfunction, then predict the distribution of the surrogate model basedon Bayesian theory iteratively. Finally, BO returns the best resultit explored as the HPO solution. However, it is time-consuming to explore the surrogate model using Bayesian theory and histori-cal data. When encountering a model or algorithm with high timecomplexity or high dimensional hyperparameter space, althoughBO can provide the optimal HPO results, the execution time ishardly unacceptable. So in Auto-CASH, we use GA, another HPOapproach with similar performance and reduce the time cost.GA originates from the computer simulation study of biologicalsystems. It is a stochastic global search and optimization methoddeveloped by imitating the biological evolution mechanism of na-ture. In essence, it is an eﬃcient, parallel, and global search method,which automatically accumulates knowledge about the search spaceduring the search process.

In the era of algorithms and data explosion, it is increasingly chal-lenging to select the algorithm most suitable for diﬀerent datasetsin a particular task (e.g. classiﬁcation). One of the best ways tosolve such problems is to train a pre-model based on previous ex-perience. In our work, we use the training dataset and its optimalalgorithm as the previous experience, which is the most intuitiveform of experience and easy to apply. After the training, the pre-model is like an expert who has learned all the previous experi-ences and can work eﬀectively oﬄine.When selecting the optimal algorithm for training datasets, themetric criteria are crucial. For classiﬁcation algorithms, the mostcommonly used metric criterion is accuracy. However, in somecases (e.g. unbalanced classiﬁcation), higher accuracy does not meanbetter performance. A more balanced metric criterion should beconsidered to measure the performance of the algorithm on thedataset from multiple perspectives. Combining accuracy and

AUC (area under ROC curve), we also propose a more comprehensivemetric criterion based on multi-objective optimization.In our approach, we use DQN to select meta-features represent-ing the whole training dataset. To develop the RL environment forDQN, we need to deﬁne the reward for the action of meta-featureselection. We randomly select batch of meta-features to constructa RF, then we test its performance on the training datasets. By re-peating this procedure, we can estimate the inﬂuence of each meta-feature on the classiﬁcation algorithm recommended by RF as thereward of the DQN environment.In this sections, we will ﬁrst introduce the workﬂow of Auto-CASH in Section 3.1. Then we discuss the criterion used in Auto-CASH and its advantage in Section 3.2. Eventually, we give theimplementations of algorithm selection and HPO in Section 3.3 andSection 3.4, respectively.

The workﬂow of our Auto-CASH approach is shown in Figure 2and Algorithm 1. The whole workﬂow is divided into two phases -the training phase and oﬄine working phase. First of all, in Line 1of Algorithm 1, we select the optimal candidate algorithm usingour new metric criterion. In the next place, we use DQN to au-tomatically determine the meta-feature for representing datasets,as shown in Line 2. In this way, refer Line 3 and 4, the trainingdatasets are transformed into meta-feature vectors, together with (cid:2869) (cid:1830) (cid:2870) (cid:28663)(cid:28663) (cid:1830) (cid:3043)

Training Datasets (cid:1839)(cid:1838)(cid:1827)(cid:1864)(cid:1859)(cid:1867)(cid:1870)(cid:1861)(cid:1872)(cid:1860)(cid:1865) (cid:2869) (cid:1839)(cid:1838)(cid:1827)(cid:1864)(cid:1859)(cid:1867)(cid:1870)(cid:1861)(cid:1872)(cid:1860)(cid:1865) (cid:2870) (cid:28663)(cid:28663)

Algorithm CandidatesOptimal AlgorithmSelection Deep Q-NetworkMeta-Feature Vectors { (cid:1874) (cid:2869) , (cid:1874) (cid:2870) ,…, (cid:1874) (cid:3043) } Random Forest Training (cid:1840)(cid:1857)(cid:1875)(cid:1830) (cid:2870) (cid:28663)(cid:28663)

New Datasets (cid:1840)(cid:1857)(cid:1875)(cid:1830) (cid:2869) (cid:1840)(cid:1857)(cid:1875)(cid:1830) (cid:3044)

Autonomous Algorithm SelectionHyper-Parameter OptimizationInput

Training Phase Offline Working Phase

Output + Figure 2: The Auto-CASH workﬂow. After training, ourmodel can work oﬀline.Algorithm 1

Auto-CASH approach.

Input:

The training datasets D tr ain ; The candidate algorithm set Alд ; The datasets needs for autonomous algorithm selectionand HPO D ; Output:

An optimal algorithm alд and its hyperparameter setting λ alд ; Select the optimal algorithm in

Alд for D tr ain ; Use DQN to select meta-feature list according to D tr ain ; Input the meta-vector of D tr ain and optimal algorithm to RF; Train the RF; Utilize the trained RF to predict alд for D ; Utilize Genetic Algorithm to search the λ alд ; return alд , λ alд ; Table 1: Notations and their meanings.

Notation Meaning

Alд

Algorithm candidates list alд

An algorithm in

AlдD tr ain

All training datasets MF Meta-feature candidates list m f

A meta-feature in

MFM list

Eventually optimal meta-feature listtheir optimal algorithm, which are used to train a RF. Given a meta-feature vector for a new dataset, the trained RF can predict the labelfor it (autonomous algorithm selection), which is shown in Line 5.Eventually, in Line 6, we apply the Genetic Algorithm to search forthe optimal hyperparameter conﬁguration.To fairly demonstrate Auto-CASH, we will come to some criticalconcepts. The notations of these concepts are summarized in table1.

AUC is the area under the ROC [16] curve. For an unbalanced dis-tributed dataset, the AUC value represents the classiﬁer’s ability toclassify positive and negative examples [5]. While selecting the op-timal algorithm for each training dataset, the common evaluation is to use the accuracy, which is highly inﬂuenced by the test-trainsplitting. To eliminate such inﬂuence, we use a score function com-bining AUC and accuracy.The accuracy and AUC of a classiﬁcation algorithm usually turnout to be conﬂicted on an unbalanced dataset. For example, in a can-cer dataset, there may be only 1% of cancer records(negative case).If a classiﬁer divides all records into positive cases, the accuracyvalue is 0 .

99, but the AUC value is only 0 .

5. Therefore, optimizingboth accuracy and AUC can be treated as a multi-objective opti-mization problem.A classic multi-objective optimization method [26] is weightedsum, shown as F score = w · accuracy + w · AUC in our prob-lem. However, it needs more complicated calculations to optimizethe accuracy and AUC separately, and set a reasonable weight co-eﬃcient. We use a concise way to represent the score function inAuto-CASH, shown as Equation (3). F score = accuracy · AUC (3)

A RF model is used for autonomous algorithm selection process,which has two advantages. First, we use some complex meta-featuresto represent the datasets. RF is sensitive to the internal inﬂuencesamong these meta-features when training. Second, RF has a highprediction accuracy without the need of hyperparameter tuning.The trained RF contains the knowledge of previous experience,which can work oﬄine. For a new dataset, RF will recommend analgorithm, which has the best performance with high possibility.In this way, the autonomous algorithm selection process only costa few seconds. Training a RF needs much less human labor thanAuto-Model, for Auto-Mudel has to extract rules in published pa-pers. We compare the RF with other famous classiﬁcation models(e.g., KNN, SVM) in our experiments, and the results in section 5show that it is the most eﬀective.

Genetic algorithm is used for HPO process. Since Auto-WEKA hasa complicated hyparameter space, and HPO is the major step, theﬁrst thing considered is the number of hyperparameters for each alд ∈ Alд . We utilize GA to tune the hyperparameters for each alд and determine which hyperparameters will be tuned in the HPO ac-cording to the performance improvement after tuning. Accordingto the Occam’s razor principle, in order to reduce the complexityof the algorithm of the HPO, we only select the hyperparametersthat will bring a relatively large eﬀect improvement for tuning.The workﬂow of GA is shown in Figure 3. In the beginning,we uses binary code to encode hyperparameters and initializes theoriginal population. Then, we select the batch of individuals withthe best ﬁtness, i.e.the algorithm performence with speciﬁc hyper-parameter conﬁguration, for subsequent generation. To introducerandom disturbance, we adapt crossover and mutation as geneticoperators shown in Figure 4. Two binary sequences (individuals)randomly exchange their subsequences in the same position to rep-resent the crossover process. And the binary digits of individualsalters randomly as mutation. For each subsequent generation, thehyperparameter conﬁguration is returned as HPO result if the ter-mination condition has been reached; otherwise, the above steps (cid:374)(cid:272)(cid:381)(cid:282)(cid:286)(cid:3)(cid:296)(cid:381)(cid:396)(cid:3)(cid:381)(cid:396)(cid:349)(cid:336)(cid:349)(cid:374)(cid:258)(cid:367)(cid:3)(cid:393)(cid:381)(cid:393)(cid:437)(cid:367)(cid:258)(cid:410)(cid:349)(cid:381)(cid:374)(cid:28)(cid:448)(cid:258)(cid:367)(cid:437)(cid:258)(cid:410)(cid:286)(cid:3)(cid:393)(cid:381)(cid:393)(cid:437)(cid:367)(cid:258)(cid:410)(cid:349)(cid:381)(cid:374)(cid:38)(cid:349)(cid:410)(cid:374)(cid:286)(cid:400)(cid:400)(cid:3)(cid:296)(cid:437)(cid:374)(cid:272)(cid:410)(cid:349)(cid:381)(cid:374) (cid:94)(cid:410)(cid:381)(cid:393)(cid:3)(cid:272)(cid:381)(cid:374)(cid:282)(cid:349)(cid:410)(cid:349)(cid:381)(cid:374)(cid:122)(cid:286)(cid:400)(cid:39)(cid:286)(cid:410)(cid:3)(cid:410)(cid:346)(cid:286)(cid:3)(cid:381)(cid:393)(cid:410)(cid:349)(cid:373)(cid:258)(cid:367)(cid:3)(cid:400)(cid:381)(cid:367)(cid:437)(cid:410)(cid:349)(cid:381)(cid:374)(cid:3)(cid:258)(cid:374)(cid:282)(cid:3)(cid:282)(cid:286)(cid:272)(cid:381)(cid:282)(cid:286) (cid:39)(cid:286)(cid:374)(cid:286)(cid:410)(cid:349)(cid:272)(cid:3)(cid:381)(cid:393)(cid:286)(cid:396)(cid:258)(cid:410)(cid:381)(cid:396)(cid:400)(cid:39)(cid:286)(cid:374)(cid:286)(cid:396)(cid:258)(cid:410)(cid:286)(cid:3)(cid:374)(cid:286)(cid:449)(cid:3)(cid:400)(cid:437)(cid:271)(cid:400)(cid:286)(cid:395)(cid:437)(cid:286)(cid:374)(cid:410)(cid:3)(cid:336)(cid:286)(cid:374)(cid:286)(cid:396)(cid:258)(cid:410)(cid:349)(cid:381)(cid:374) (cid:1005)(cid:856)(cid:3)(cid:400)(cid:286)(cid:367)(cid:286)(cid:272)(cid:410)(cid:1006)(cid:856)(cid:3)(cid:272)(cid:396)(cid:381)(cid:400)(cid:400)(cid:381)(cid:448)(cid:286)(cid:396)(cid:1007)(cid:856)(cid:373)(cid:437)(cid:410)(cid:258)(cid:410)(cid:286)(cid:47)(cid:374)(cid:349)(cid:410)(cid:349)(cid:258)(cid:367)(cid:349)(cid:460)(cid:286)(cid:3)(cid:381)(cid:396)(cid:349)(cid:336)(cid:349)(cid:374)(cid:258)(cid:367)(cid:3)(cid:393)(cid:381)(cid:393)(cid:437)(cid:367)(cid:258)(cid:410)(cid:349)(cid:381)(cid:374) (cid:69)(cid:381)

Figure 3: Genetic Algorithm workﬂow. will be iteratively executed. Our experimental results in section 5show that the ﬁtness of individuals will converge to the optimalvalue within 50 generations in most cases. (cid:1005)(cid:1004)(cid:1004)(cid:1005)(cid:1005)(cid:1004)(cid:1004)(cid:1004)(cid:1004) (cid:1005)(cid:1004)(cid:1004)(cid:1005)(cid:1005)(cid:1004)(cid:1004)(cid:1004)(cid:1004) (cid:1005)(cid:1004)(cid:1004)(cid:1005)(cid:1004)(cid:1004)(cid:1004)(cid:1004)(cid:1004) (cid:1005)(cid:1004)(cid:1004)(cid:1005)(cid:1005)(cid:1004)(cid:1004)(cid:1004)(cid:1005) (cid:272)(cid:396)(cid:381)(cid:400)(cid:400)(cid:381)(cid:448)(cid:286)(cid:396) (cid:1005)(cid:1004)(cid:1004)(cid:1005)(cid:1005)(cid:1004)(cid:1004)(cid:1004)(cid:1004) (cid:373)(cid:437)(cid:410)(cid:258)(cid:410)(cid:349)(cid:381)(cid:374) (cid:1005)(cid:1005)(cid:1004)(cid:1004)(cid:1005)(cid:1005)(cid:1005)(cid:1004)(cid:1004)(cid:1005)(cid:1005)(cid:1005) (cid:1005)(cid:1005)(cid:1005) (cid:1005)(cid:1005) (cid:1005)(cid:1005)(cid:1005)(cid:1005) (cid:1005)(cid:1005)(cid:1005) (cid:1005)

Figure 4: Crossover and mutation examples

The major interfering factor of the algorithm selection process isthe quality of meta-features. Unfortunately, due to the fact thatmeta-features have complicated correlation between each other, itis diﬃcult to reconﬁgure the priority of them after a speciﬁc actionof candidate selection. A well-studied approach focusing on theinﬂuence of multiple candidate selection is DQN. However, DQNis used to solve the automatic continuous decision problem, so wetransform meta-feature selection into such problem. In the next ofthis section, we will discuss the methodology of DQN environmentconstruction and using DQN to select meta-features.First of all, we will introduce the elements of DQN, i.e. the state,action, and reward in the environment, respectively.

Deﬁnition 1

Given a collection of candidate meta-features MF (| MF | = m ) , the state s is the meta-features selected from MF . Each action a selects a speciﬁc meta-feature m f ∈ MF . The eventually selectedmeta-features construct an optimal meta-feature list M list ( M list $ MF , | M list | max = n , n < m ) . The reward r a of action a is the prob-ability of selecting the optimal algorithm by performing action a .In Auto-CASH, we use an m -bit binary number to encode allstates. Each bit represents a meta-feature in MF . In a speciﬁc state s , if the meta-feature m f is selected, its corresponding bit is en-coded as 1. Otherwise, it is encoded as 0. Thus, there are totally 2 m states and m actions. The example in Figure 5 can explain thetransition between states more clearly. Under the start state S , nometa-feature has been selected, so all m bits are 0. After perform-ing some actions, it is state S j now. The next action is to choose m f m , so the m th bit of the number is set 1. These steps will berepeated until the termination state. m bitStart state (cid:1845) (cid:2868) Some actions

State (cid:1845) (cid:3037) m bit Choose the (cid:1865) (cid:3047)(cid:3035) meta-feature m bit …… State (cid:1845) (cid:3037)(cid:2878)(cid:2869) ……110……101

Termination state S (cid:3038) m bit Figure 5: Transition between states examples

In order to make suﬃcient preparation for the RL environment,we consider several characteristics for a classiﬁcation dataset toform some types of meta-fatures. For category attributes, we con-centrate on the inter-class dispersion degree and the maximumrange of class proportion. As for numeric attributes, we are moreconcerned about the center and extent of ﬂuctuation. Besides, wealso take the global numeric information of records and attributesinto consider. 5 basic types meta-features are as follows. • Type 1 : Category information entropy. • Type 2 : Proportion of classes in diﬀerent type of attributes. • Type 3 : Average value. • Type 4 : Variance. • Type 5 : Number of instances.On the basis of above discussion, we construct the MF , madeup of constrained(e.g., class number in category attribute with theleast classes) and combined(e.g., variance of average value in nu-meric attributes) meta-features. The details are shown in Section 5.There is no precise approach to measure or calculate the rewardof each meta-feature. Therefore, we can only estimate these re-wards according to experimental results on training datasets. Themeta-features have complicated inﬂuence on one another, so eval-uating the reward of a single meta-feature independently is notpersuasive. Therefore, for each meta-feature, we randomly selectsome batches of meta-features containing it. With each batch ofmeta-features, we construct an RF. We repeat above steps multipletimes for each batch size, and the average accuracy of RF is thereward.All meta-feature selection steps are summarized in Algorithm 2.At ﬁrst, as shown in Line 1, we construct the state set S and actionset A , respectively. Then we estimate the reward of each action R ,which is shown in Line 2. The DQN environment is initialized by S , A , and R . For each episode, DQN starts from S , and chooses themaximal reward action in each next step (Line 4-10). After decod-ing the termination state, the training results for one episode areobtained. We repeat above steps and eventually obtain the optimalmeta-feature list from numerous training results(Line 14).In the beginning, the lack of experience makes the selectionDQN have a deviation from reality. As the training progresses, lgorithm 2 Automatically meta-feature selection approach.

Input:

The meta-feature candidates list MF (| MF | = m ) ; The limitof optimal meta-feature list | M list | max = n ( n < m ) ; The limitof episode e max ; Output:

The optimal meta-feature list M list ; Construct state set S and action set A ; Estimate reward R of each a ∈ A ; for i = i < e max ; i + + do Initialize M list = œ and all candidate M list set M all = œ ; Start state = s , current state s = s ; while | M list | < n do Initialize the DQN environment using S , A , and R ; Use DQN to ﬁnd the optimal action a best for s ; M list ← M list ∪ a best ; s ← s perform a best ; end while M all ← M all ∪ M list ; end for return M list = arдmax ( M all ) ;DQN will adjust the parameters such as learning rate and discountrate according to the deviation and the selection becomes reason-able. It is just like a human being ﬁxes his action by absorbing theprevious experience and the result is getting better. Eventually, thenetwork parameters become stable and the selected meta-featureshave the best performance.With M list and alд , all original training datasets can be trans-formed into a new dataset to train the RF model. Assuming that | D tr ain | = p and | M list | = y , we have the new training dataset D ′ p ×( y + ) , in which the column D ′ y + represents the arд . After thetraining phase, our Auto-CASH model works oﬄine. Beneﬁtingfrom the excellent prediction performance of RF and the high ef-ﬁciency of GA, the performance of Auto-CASH surpasses earlywork, which is shown in Section 5.3. In this section, we evaluate our Auto-CASH approach on the clas-siﬁcation CASH problem. Given a dataset, we use Auto-CASH toautomatically select an algorithm and search its optimal hyper-parameter settings. Then we utilize the new metric criterion inSection 3.2 to examine the performance of results given by Auto-CASH. Eventually, we compare Auto-CASH with classical CASHapproach Auto-WEKA and the state-of-the-art CASH approach Auto-Model and discuss the experimental results.

For all experiments in this paper, the setup is as follows:1) We implement all experiments in Python 3.7 and run them ona computer with a 2.6 GHz Intel (R) Core (TM) i7-6700HQ CPUand 16 GB RAM.2) All datasets used are real-world datasets from UCI Machine Learn-ing Repository and Kaggle . The most signiﬁcant advantageof using real-world datasets is that it can improve the eﬀect https://archive.ics.uci.edu/ml/index.php of our model in real life and lay the foundation for future re-search work. However, for the missing values in the data set,Auto-CASH uses random other values of the same attribute toreplace. The implementation of all classiﬁcation algorithms isfrom WEKA , which is consistent with Auto-WEKA and Auto-Model.3) The performance of Auto-WEKA and Auto-Model are both re-lated to the tuning time, so we set the timeLimit parameter to 5minutes.4) When calculating the AUC and accuracy value in the metriccriterion, we use 80% and 20% of the dataset as the training dataand test data, respectively.5) AUC is the evaluating indicator deﬁned in the binary classiﬁca-tion problem. For multiple classiﬁcation problems, we binarizethe output of the classiﬁcation algorithm using the function inEquation (4). f ( output ) = ( (a) GA tuning curve for K (b) GA tuning curve for depth(c) GA tuning curve for I K depth I P e r ce n t o f i m p r ov e m e n t Hyperparameters in Random Forest (d) Improvement for diﬀerent hyperpa-rameter

Figure 6: Examples for selecting hyperparameters used inHPO process.

Referring to the methodology in Sectioni 3.4, we ﬁrst test theperformance improvement of hyperparameters for each alд . Ex-amples for Random Forest algorithm and ecoli dataset is shown inﬁgure 6. Figure 6(a), 6(b), and 6(c) represents the GA tuning curvefor hyperparameter − K , − depth , and − I , respectively. The x-axisrepresents each generation in GA, and the y-axis represents theaverage F score value (performance) of each generation. Althoughthese curves converge in about the ﬁfth generation, the eﬀect of Source code can refer to https://svn.cms.waikato.ac.nz/svn/weka/branches/stable-3-8/. We wrap the jar package and invoke it using Python. https://archive.ics.uci.edu/ml/datasets/Ecoli able 2: The number of hyperparameter need to be tuned foreach algorithm in Auto-CASH. We totally utilize 23 famousand eﬀective classiﬁcation algorithms. Algorithm Number Algorithm NumberAdaBoost 3 Bagging 3AttributeSelectedClassiﬁer 2 BayesNet 1ClassiﬁcationViaRegression 2 IBK 4DecisionTable 2 J48 8JRip 4 KStar 2Logistic 1 LogitBoost 3LWL 3 MultiClass 3MultilayerPerceptron 5 NaiveBayes 2RandomCommittee 2 RandomForest 2RandomSubSpace 3 RandomTree 4SMO 6 Vote 1LMT 5each parameter on the ﬁnal performance improvement is diﬀer-ent, which is shown in Figure 6(d). After tuning − depth , we canhave a 6 percent improvement, while − I can only improve 1 . alд RF, we decide to tune − depth and − K in HPOprocess. Table 2 shows the number of hyperparameter that needsto be tuned for each algorithm in Auto-CASH.After selecting the hyperparameters to be tuned in HPO, we testthe optimal algorithm for each dataset in D tr ain . Then we comparethe performance of all algorithm candidates on training datasetsand list their optimal algorithm in Figure 7. Number of training datasets C l a ss i f i ca ti on a l go r it h m Figure 7: Distribution of the optimal algorithm on 104 train-ing datasets. For each algorithm, we list the number ofdatasets using it as its optimal algorithm.

Meta-features used for representing a dataset in our experimentsare summerized as follows: • m f : Class number in target attribute. • m f : Class information entropy of target attribute. • m f : Maximum proportion of single class in target attribute. • m f : Minimum proportion of single class in target attribute. • m f : Number of numeric attributes. Table 3: Type of each meta-feature.

Type m f indexType 1 1, 10, 14Type 2 2, 3, 6, 11, 12, 15, 16Type 3 17, 18Type 4 19, 20, 21, 22Type 5 0, 4, 5, 7, 8, 9, 13 • m f : Number of category attributes. • m f : Proportion of numeric attributes. • m f : Total number of attributes. • m f : Records number in the dataset. • m f : Class number in category attribute with the least classes. • m f : Class information entropy in category attribute withthe least classes. • m f : Maximum proportion of single class in category at-tribute with the least classes. • m f : Minimum proportion of single class in category at-tribute with the least classes. • m f : Class number in category attribute with the most classes. • m f : Class information entropy in category attribute withthe most classes. • m f : Maximum proportion of single class in category at-tribute with the most classes. • m f : Minimum proportion of single class in category at-tribute with the most classes. • m f : Minimum average value in numeric attributes. • m f : Maximum average value in numeric attributes. • m f : Minimum variance in numeric attributes. • m f : Maximum variance in numeric attributes. • m f : Variance of average value in numeric attributes. • m f : Variance of variance in numeric attributes.The type mentioned in Section 4 of each m f is shown in Table 3.These meta-features are easy to calculate, which will reduce calcu-lation cost in the algorithm selection. After determining

Alд and MF , we utilize DQN to obtain M list .Too many meta-features will not bring enough information gainwhile increasing the computational complexity. Therefore, we setthe upper limit of | M list | to 8 and evaluate each m f with diﬀerentbatch sizes range from 2 to 8. The evaluation results are shown inFigure 8, which represents the estimated reward of m f . From theresults, we can see that the inﬂuence of each m f has a large range.According to the methodology in Section 4, there is totally 23 ac-tions and 2 states. The experience memory size of DQN is set to200, which will be randomly updated after an action decision. Thenwe get the M list = { m f , m f , m f , m f , m f , m f , m f } amongthe outputs of DQN. We utilize these selected meta-features andeach dataset’s optimal algorithm to train the RF. The trained RFwill predict the optimal algorithm for test datasets. Eventually inHPO, we use GA and set the maximum generations to 50.We evaluate the F score performance of Auto-CASH on 20 classi-ﬁcation datasets in Table 4. The average time cost in each phase is P r i d ec t acc u r ac y Meta-features

Figure 8: Performance for each meta-feature.Table 4: Datasets

Dataset Records Attributes Classes SymbolAvila 20867 10 12 D Nursery 12960 8 3 D Absenteeism 740 21 36 D Climate 540 19 2 D Australian 690 14 2 D Iris.2D 150 2 3 D Heart-c 303 14 5 D Sick 3772 30 2 D Anneal 798 38 6 D Hypothyroid 3772 27 2 D Squash 52 24 3 D Vowel 990 14 11 D Zoo 101 18 7 D Breast-W 699 9 2 D Iris 150 4 3 D Diabetes 768 9 2 D Dermatology 336 34 6 D Musk 476 166 2 D Promoter 106 57 2 D Blood 748 5 2 D shown in table 5. From the table, we can see that Auto-CASH costsfew time on autonomous algorithm selection. After tuning for hy-perparameter, the HPO time is greatly reduced, which guaranteesthe eﬃciency of Auto-CASH. We also evaluate the F score performanceof Auto-WEKA and Auto-Model on the same datasets, and the de-tailed expermental results are shown in Table 6. The meta-feature selected by DQN can comprehensively representthe datasets. Compared with Auto-Model, we use fewer meta-featureswhile Auto-CASH achieves a better performance in most cases asshown in Table 6. It proves that DQN is more eﬀective. Our ap-proach signiﬁcantly reduces human labor in the training phase,which makes it a fully-automated model. Auto-CASH can handle

Table 5: The average time of each phase in Auto-CASH.

Phase TimeDQN training 10 CPU hourCalculate meta-feature value 0.96 secondAlgorithm selection 0.5 secondHPO 229.3 secondsTotal CASH 230.76 seconds

Table 6: F score of Auto-CASH, Auto-Model, and Auto-WEKAon test datasets. Bold font denotes the best result. Auto-Model cannot give a result for some cases, so we use -1 here. Dataset Auto-CASH Auto-Model Auto-WEKA D D D D D D D -1 0.58 D -1 0.886 D D -1 0.976 D D D D D D D -1 0.942 D D D CONCLUSION AND FUTURE WORK

In this paper, we present Auto-CASH, a pre-trained model based onmeta-learning for the CASH problem. By transforming the selec-tion of meta-feature into a continuous action decision problem, weare able to automatically solve it utilizing Deep Q-Network. Thusit signiﬁcantly reduces human labor in the training process. For aparticular task, Auto-CASH enhances the performance of the rec-ommended algorithm within an acceptable time by means of Ran-dom Forest and Genetic Algorithm. Experimental results demon-strate that Auto-CASH outperforms classical and the state-of-the-art CASH approach on eﬃciency and eﬀectiveness. In future work,we plan to extend Auto-CASH to deal with more problems, e.g.,regression, image processing. Besides, we intend to develop anapproach to automatically extract the meta-feature candidates ac-cording to the task and its datasets.

REFERENCES [1] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameteroptimization.

Journal of machine learning research

13, Feb (2012), 281–305.[2] Besim Bilalli, Alberto Abelló, and Tomas Aluja-Banet. 2017. On the predictivepower of meta-features in OpenML.

International Journal of Applied Mathemat-ics and Computer Science

27, 4 (2017), 697–712.[3] Eric Brochu, Vlad M Cora, and Nando De Freitas. 2010. A tutorial on Bayesianoptimization of expensive cost functions, with application to active user mod-eling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599 (2010).[4] George E Dahl, Tara N Sainath, and Geoﬀrey E Hinton. 2013. Improving deepneural networks for LVCSR using rectiﬁed linear units and dropout. In . IEEE, 8609–8613.[5] Tom Fawcett. 2006. An introduction to ROC analysis.

Pattern recognition letters

27, 8 (2006), 861–874.[6] Andrey Filchenkov and Arseniy Pendryak. 2015. Datasets meta-feature descrip-tion for recommending feature selection algorithm. In . IEEE, 11–18.[7] Frank Hutter, Lars Kotthoﬀ, and Joaquin Vanschoren. 2019.

Automated MachineLearning . Springer.[8] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gersh-man. 2017. Building machines that learn and think like people.

Behavioral andbrain sciences

40 (2017).[9] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Tal-walkar. 2017. Hyperband: A novel bandit-based approach to hyperparameteroptimization.

The Journal of Machine Learning Research

18, 1 (2017), 6765–6816.[10] Marius Lindauer, Jan N van Rijn, and Lars Kotthoﬀ. 2019. The algorithm selec-tion competitions 2015 and 2017.

Artiﬁcial Intelligence

272 (2019), 86–100.[11] Francisco S Melo. 2001. Convergence of Q-learning: A simple proof.

Institute OfSystems and Robotics, Tech. Rep (2001), 1–4.[12] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, IoannisAntonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari withdeep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).[13] Douglas C Montgomery. 2017.

Design and analysis of experiments . John wiley &sons.[14] Randal S Olson and Jason H Moore. 2019. TPOT: A tree-based pipeline opti-mization tool for automating machine learning. In

Automated Machine Learning .Springer, 151–160.[15] Martin Pelikan, David E Goldberg, Erick Cantú-Paz, et al. 1999. BOA: TheBayesian optimization algorithm. In

Proceedings of the genetic and evolutionarycomputation conference GECCO-99 , Vol. 1. 525–532.[16] David Martin Powers. 2011. Evaluation: from precision, recall and F-measure toROC, informedness, markedness and correlation. (2011).[17] Cullen Schaﬀer. 1994. Cross-validation, stacking and bi-level stacking: Meta-methods for classiﬁcation learning. In

Selecting Models from Data . Springer, 51–59.[18] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesianoptimization of machine learning algorithms. In

Advances in neural informationprocessing systems . 2951–2959.[19] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish,Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. 2015.Scalable bayesian optimization using deep neural networks. In

International con-ference on machine learning . 2171–2180. [20] Ben Taylor, Vicent Sanz Marco, Willy Wolﬀ, Yehia Elkhatib, and Zheng Wang.2018. Adaptive deep learning model selection on embedded systems.

ACM SIG-PLAN Notices

53, 6 (2018), 31–43.[21] Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013.Auto-WEKA: Combined selection and hyperparameter optimization of classiﬁca-tion algorithms. In

Proceedings of the 19th ACM SIGKDD international conferenceon Knowledge discovery and data mining . 847–855.[22] Chunnan Wang, Hongzhi Wang, Tianyu Mu, Jianzhong Li, and Hong Gao. 2019.Auto-Model: Utilizing Research Papers and HPO Techniques to Deal with theCASH problem. arXiv preprint arXiv:1910.10902 (2019).[23] Christopher JCH Watkins and Peter Dayan. 1992. Q-learning.

Machine learning

8, 3-4 (1992), 279–292.[24] Darrell Whitley. 1994. A genetic algorithm tutorial.

Statistics and computing arXiv preprintarXiv:1708.07747 (2017).[26] Lei Xiujuan and Shi Zhongke. 2004. Overview of multi-objective optimizationmethods.

Journal of Systems Engineering and Electronics

15, 2 (2004), 142–146.[27] G Zames, NM Ajlouni, NM Ajlouni, NM Ajlouni, JH Holland, WD Hills, andDE Goldberg. 1981. Genetic algorithms in search, optimization and machinelearning.