[PDF] Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools

Abstract

There has been considerable growth and interest in industrial applications of machine learning (ML) in recent years. ML engineers, as a consequence, are in high demand across the industry, yet improving the efficiency of ML engineers remains a fundamental challenge. Automated machine learning (AutoML) has emerged as a way to save time and effort on repetitive tasks in ML pipelines, such as data pre-processing, feature engineering, model selection, hyperparameter optimization, and prediction result analysis. In this paper, we investigate the current state of AutoML tools aiming to automate these tasks. We conduct various evaluations of the tools on many datasets, in different data segments, to examine their performance, and compare their advantages and disadvantages on different test cases.

Full PDF

TTowards Automated Machine Learning: Evaluationand Comparison of AutoML Approaches and Tools

Anh Truong ∗ , Austin Walters ∗ , Jeremy Goodsitt ∗ , Keegan Hines ∗ , C. Bayan Bruss ∗ , Reza Farivar ∗†∗ Applied Research, Center for Machine LearningCapital One

McLean, VA, USA { anh.truong, austin.walters, jeremy.goodsittkeegan.hines, bayan.bruss } @capitalone.com † Department of Computer ScienceUniversity of Illinois

Urbana-Champaign, IL, [email protected]

Abstract —There has been considerable growth and interest inindustrial applications of machine learning (ML) in recent years.ML engineers, as a consequence, are in high demand across theindustry, yet improving the efﬁciency of ML engineers remains afundamental challenge. Automated machine learning (AutoML)has emerged as a way to save time and effort on repetitive tasksin ML pipelines, such as data pre-processing, feature engineering,model selection, hyperparameter optimization, and predictionresult analysis. In this paper, we investigate the current stateof AutoML tools aiming to automate these tasks. We conductvarious evaluations of the tools on many datasets, in differentdata segments, to examine their performance, and compare theiradvantages and disadvantages on different test cases.

Index Terms —AutoML, automated machine learning, driver-less AI, model selection, hyperparameter optimization

I. I

NTRODUCTION

Automated Machine Learning (AutoML) promises majorproductivity boosts for data scientists, ML engineers and MLresearchers by reducing repetitive tasks in machine learningpipelines. There are currently a number of different toolsand platforms (both open-source and commercially availablesolutions) that try to automate these tasks. The goal of thispaper is to address the following questions: (i) what are theavailable ML functionalities provided by the tools; (ii) howthe tools perform when facing a wide spectrum of real worlddatasets; (iii) ﬁnd the trade-off between optimization speedand accuracy of the results; and (iv) the reproducibility of theresults (a.k.a. tool robustness).The rest of the paper is organized as follows. Section IIcovers the history and background of AutoML tools. Next, inSection III we compare the tools’ features and functionalitieson an automated ML pipeline including data preprocess-ing, model selection, hyperparameter optimization, and modelinterpretation. After that, in Section IV we experimentallyevaluate the performance of a selected subset of these toolson a large variety of datasets and a range of supervised MLtasks. Finally, we conclude the paper in Section V.II. B

ACKGROUND AND H ISTORY

Between 1995 to 2015 many ML libraries and tools weredeveloped, spanning from Weka (1990s), RapidMiner (2001), Scikit-learn (2007-2010), H2O (2011), and Spark MLlib(2013) among many others. Deep Neural Network platformshave also gained popularity in the last 5 years. Tensorﬂow(2015), Keras (2015) and MXNet (2015) contributed to thewide adoption of deep learning models.During this time period it became evident to many MLpractitioners that extracting the best performance from ma-chine learning models requires substantial human expertise.Developing good models from a dataset is almost an artform involving intuition, experience, and many tedious manualtasks to tune algorithmic parameters. The combination ofmarket pressure for more ML engineers, and the tedious natureof developing ‘optimal’ ML solutions sparked the idea ofautomating the ML tasks.AutoML’s initial effort came out of academia and MLpractitioners ﬁrst, and later startups. One of the ﬁrst at-tempts was Auto-Weka (2013) [1] from Universities of BritishColumbia (UBC) and Freiburg, which utilizes algorithmsprovided by Weka [2]. Auto-sklearn (2014) [3] came nextfrom the University of Freiburg. TPOT [4] was developedat the University of Pennsylvania (2015). Auto-ml [5], anopen-source python package, was released in 2016 (to avoidconfusion with the general term ‘AutoML’, please note thespelling for this tool). Auto-sklearn, Auto-ml, and TPOT areall built on the well-known ‘scikit-learn’ ML package. Othertools followed, including Auto-Keras (2017) [6] from theTexas A & M University running on top of Keras, Tensorﬂowand Scikit-learn. MLjar (2018) [7] also uses Scikit-learn, inconjunction with Tensorﬂow. On the same timeframe, somestartups started developing their tools for AutoML. Datarobot[8], [9], [10] launched its automated machine learning toolin 2015. H2O-Automl [11], [12] was introduced by the H2O(2016), using ML models from the H2O platform. The H2Oteam later released their commercialized H2O-DriverlessAIproduct (2017) [13], and SparkCognition introduced Darwin(2018) [14] utilizing their own ML platform.After a while, the large cloud providers and technologycompanies followed suit, offering Automated Machine Learn-ing as a Service (AMLaaS) or standalone products. GoogleCloud Automl (2017) [15] runs on Google Cloud platform. a r X i v : . [ c s . L G ] S e p icrosoft AzureML (2018) [16] takes advantage of algorithmson Azure, and Salesforce’s TransmogrifAI (2018) [17] runs ontop of Spark ML, and Uber’s Ludwig (2019) [18] runs modeltraining on Horovod, Uber’s open-source distributed trainingframework.The aforementioned platforms emphasize different as-pects of the AutoML space. For example, Darwin, H2O-DriverlessAI and DataRobot provide the functionality of de-tecting and processing time-series data. They also offer inter-active UI to help customers experiment quickly with differentmachine learning tasks. H2O-DriverlessAI exports a Plain OldJava Object (POJO) or a Model ObJect Optimized (MOJO)for the optimized models to be easily deployed in any Java-supported platform. TPOT exports optimized code for develop-ers. Auto-ml offers ‘categorical ensembling’, where segmentsof categories in a column can have different models. GoogleCloud AutoML and Auto-keras conduct neural network search[19], [20], for both image and text data.III. A UTO ML PLATFORMS ’ FEATURES ANDFUNCTIONALITY COMPARISON : T

HE COMMON PIPELINEFig. 1. The common AutoML pipeline.

Most AutoML tools follow a common three stage pipelineillustrated in Figure 1. In general, these three componentsare optimized iteratively to obtain the best outcome. Figure 2brieﬂy summarizes the comparison across the surveyed tools.More detailed comparisons follow in the subsequent sections.

A. Data Preprocessing and Feature Engineering

Data preprocessing is typically the ﬁrst task in MLpipelines. At the moment, this task is not handled very wellby any of the AutoML tools and still requires considerablehuman intervention. In particular, this task requires data typeand schema detection which have not been widely supportedamong the AutoML tools. However, once data types areidentiﬁed, the tools provide appropriate feature engineeringfor the next component in the pipeline. TransmogrifAI seemsto be further ahead in this regard by supporting differentdetailed data types detection (e.g., addresses, phone numbers,names, currency, etc), however this functionality appears tonot be very stable on multiple datasets. H2O-Automl, H2O-DriverlessAI, DataRobot, MLjar and Darwin gain some ad-vantage by offering the ability to detect basic data types orschemas, currently limited to numerical, categorical and time-series data. Auto-ml, Auto-sklearn, AzureML and Ludwigare less favorable here, in the sense that they can only dofeature engineering from user-input speciﬁcations, e.g. datatypes for each column. The other tools need much more humaninteraction on feature engineering. Auto-sklearn requires users input to convert categorical data into integers (e.g., using labelencoder) before any other transformation. TPOT and Auto-keras do not provide either data preprocessing or feature gener-ation steps and instead require users to manually perform datapre-processing, and only accept numerical feature matrices.

B. Model Selection, Hyperparameter Optimization, and Archi-tecture Search

In this step, the extracted features from the previous step areused to train many different types of models, each with manydifferent sets of parameters (hyperparameter optimization),then the best model (or an ensemble of models) is selectedas the ﬁnal model. Each tool supports a collection of existingmachine learning algorithms to build models. They include,but not limited to, Logistic Regression, tree-based algorithms,SVM, and neural network models. H2O-Automl, Ludwig,DataRobot, Darwin, Auto-ml, Auto-sklearn, MLjar, Transmo-grifAI, and TPOT all work in this fashion for supervisedmethods. DataRobot, H2O-DriverlessAI and Darwin provideadditional unsupervised methods such as clustering and outlierdetection. TPOT and Darwin also utilize genetic algorithms toiteratively select the best traits of each model and pass themto the next generation. Google Cloud AutoML and Auto-keraswork differently, utilizing neural architecture search to selectthe best neural network model.For hyperparameter optimization, some of the most popularmethods are grid search, random search, and Bayesian search.Auto-Weka uses SMAC (Sequential Model-based AlgorithmConﬁguration, [21]) while Auto-sklearn utilizes SMAC3, are-implementation of SMAC to efﬁciently perform Bayesianoptimization. H2O-Automl and MLjar apply random searchon the parameter spaces, while H2O-DriverlessAI, Ludwig,Auto-ml, TransmogrifAI and Auto-keras use both random andBayesian search.In order to reduce time for model search and hyperparameteroptimization, it is common to prune the parameter space. Inthe ﬁrst approach, the tools attempt to quickly ﬁnd an initialparameter set. Auto-sklearn and Darwin use pre-processed‘meta-features’ from previously trained datasets, each witha known ‘meta-learner’. Given a target dataset, they ﬁnd asimilar dataset based on ‘meta-feature’, and use the closest’meta-learners’ as the initial model. The second approach isto use the relationship between model selection and hyper-parameter optimization. H2O-Automl uses the combinationof random grid search with stacked ensembles, as diversiﬁedmodels improve the accuracy of ensemble method. The thirdapproach is to ﬁx an allowed runtime for the tools to searchfor a best model. All AutoML tools, except Auto-ml, currentlyoffer this option. The fourth approach (only applies for H2O-Automl and Auto-sklearn) is to restrict the parameters thatcause a slow optimization. For example, non-linear featureapproximation combined with KNN models is restricted as itdramatically slows down the optimization. ig. 2. Comparison table of functionality for AutoML tools. (+) : commercialized tools; ( ∗ ) : the function is not very stable, it fails for some datasets; (2 ∗ ) :categorical input must be converted into integers; (3 ∗ ) : datasets have to include headers; (4 ∗ ) : missing values must be represented as NA; (5 ∗ ) : multiclassclassiﬁcation not provided; (6 ∗ ) : need some users’ input for dataset description such as column types; (7 ∗ ) : ability to detect primitive data types and rich datatypes such as: text (id, url, phone), numerical (integer, real); (8 ∗ ) : advanced feature processing: bucketing of values, removing features with zero varianceor features with drift over time; (9 ∗ ) : supervised learning includes binary classiﬁcation, multiclass classiﬁcation, regression; (10 ∗ ) : unsupervised learningincludes clustering and anomaly detection; (11 ∗ ) : model interpretation and explainability refers to techniques such as LIME, Shapley, Decision Tree Surrogate,Partial Dependence, Individual Conditional Expectation, Lift chart, feature ﬁt, prediction distribution plot, accuracy over time, hot spot and reason codes; (12 ∗ ) : conﬁrmed by a company spokesperson, we could not ﬁnd public documentation at the time of publication; In a few empty cells, it is not clear thatthe functionality is provided from documentations of the tools, to the best of our knowledge. C. Model Interpretation and Prediction Analysis

This component is currently applied to most commercial-ized tools such as H2O-DriverlessAI, DataRobot and Darwinwhereas it is not the concentration of non-commercializedtools. In essence, it provides detailed result representationthrough model dashboards, feature importance and differentvisualization methods, e.g., lift chart and prediction distribu-tion. Those tools even highlight outlier data points that the bestmodel was not conﬁdent in predictions, and support ‘reasoncode’, LIME, Shapley, and partial dependence, etc., for bettermodel interpretation.IV. E

XPERIMENTAL EVALUATION

We evaluate a selected subset of AutoML tools on nearly300 datasets collected from Openml [22], which allows usersto query data for different use cases. Detailed descriptionson the datasets are given on the Table I in the Appendix.The two advantages of using Openml datasets are: (i) the Due to the unavailability of the licensed or trial versions, we have notevaluated most commercialized tools. Some other open-sourced tools havenot been evaluated due to the lack of widely supported Python wrappers. Fig. 3. Data segments used for evaluation. Each cell is referred to as a‘data segment’. For example, in the ﬁrst row, ’Less than one third’ stands fordatasets with the categorical proportion less than / . datasets are already pre-processed into the numerical features , therefore the same data will be fed to all AutoML tools,minimizing the risk of bias from data selection process; and (ii)guarantee a fair comparison among the tools as some do notprovide the pre-processing steps for raw datasets. In order toevaluate AutoML tools on a variety of dataset characteristics,we selected multiple datasets according to the criteria depictedin Figure 3. For the sake of clarity, each cell in this tableis referred to as a ‘data segment’, each containing datasetswith different sample sizes, feature dimensions, categorical Although there are still a few datasets containing text or non-numericalfeatures, those are not included in this paper.ig. 4. Evaluation of AutoML tools on binary classiﬁcation task across ten data segments (depicted in Figure 3). Each diagram refers to a data segment. Allexperiments are run up to 15 minutes. Some experiments are completed faster but in some other cases, several tools cannot obtain results after that time limit.Speciﬁcally, the percentage of experiments that did not ﬁnish in 15 minutes are: Ludwig , H2O-Automl , TPOT , Darwin , Auto-sklearn features ratio (deﬁned as the ratio of number of categoricalfeatures over total number of features), missing proportion(proportion of samples with at least one missing feature), andclass imbalance (samples in minority class vs. in majorityclass). Each dataset is divided into two parts, one for trainingand another for testing with the ratio . All AutoML toolsare applied to the same training and testing proportions of alldatasets. For all evaluations, the following tools and associatedversions are used: Darwin 1.6, Auto-sklearn 0.5.2, Auto-keras0.4.0, Auto-ml 2.9.10, Ludwig 0.1.2, H2O-Automl 3.24.0.5,TPOT 0.10.1.In the next subsections, we will evaluate AutoML toolson different test cases, each with three different supervisedlearning tasks: binary classiﬁcation, multiclass classiﬁcation,and regression. All experiments are run on Amazon EC2p2.xlarge instances, which provide 1 Tesla K80 GPU, 4 vCPUs(Intel Xeon E5-2686, 2.30Ghz) and 61 GiB of host memory.Setting a time-limit for all experiments is not straightfor-ward. On the one hand, we would like to let each tool runas long as it takes to produce the best results. On the otherhand, with 3 ML tasks, 300 datasets and 6 tools, we have5,400 experiments to run. To keep the experiment run-timeand cost to practical limits, we aim for a ‘completion target’of 70 % , i.e., we select a run-time for which all tools are ableto ﬁnish the AutoML tasks for 70 % of the datasets. All the AutoML tools proved to be capable of hitting the 70 % targetwithin 15 minutes for binary classiﬁcation. 5 out of the 6tools (all but Darwin) hit 70 % target for regression, and 4out of 6 tools hit the 70 % target in multiclass classiﬁcation,(TPOT nearly reaches the target, and Darwin misses the targetagain). As Darwin appears to be slow in convergence, and tobe fair to the other tools, it is excluded from our completiontarget analysis. We therefore decide to run all our extensiveexperiments (5,400) for 15 minutes time-limits, for a totalof 1,350-hour EC2 run-time (which includes the overheadof benchmark harness code), where the results are detailedin Section IV-A. We then run another experiment with arandomized subset of our datasets for longer time limits toevaluate the performance of the tools when more time is givento ﬁnish. The results of this latter experiment (3 tasks, 5 datasegments, 6 run-time periods, 7 tools, for a total of 717 EC2hours including benchmark harness overhead) are detailed inSection IV-B. Note that the Auto-ml tool was not included inthe extensive experiments as it does not offer an option to limitits run-time from a user-input value (15 minutes in our case),it simply can only run to completion. As such, its results areonly included among the experiments in Section IV-B. ig. 5. Evaluation of AutoML tools on multiclass classiﬁcation task across ten data segments (depicted in Figure 3). Each diagram refers to a data segment.All experiments are run up to 15 minutes. Some experiments are completed faster but in some other cases, several tools cannot obtain results after that timelimit. Speciﬁcally, the percentage of experiments that did not ﬁnish in 15 minutes are: Auto-keras , H2O-Automl , Ludwig , Auto-sklearn ,TPOT , Darwin . A. Evaluation on multiple data segments

In this section, we investigate the performance of the toolsacross many datasets and applications (please see Table I in theAppendix for the detailed descriptions on the datasets). To thatend, the evaluated data is divided into ten segments (as shownin Figure 3), each including ten random datasets. ‘Accuracy’is the comparison metric used for binary and multiclassclassiﬁcation tasks and ‘Mean Squared Error (MSE)’ is usedfor regression tasks.Figure 4 shows the performance of AutoML tools for binaryclassiﬁcation task in different data segments. In this Figure, theperformance is represented in the box-whisker format, whereeach box shows the median, and the ﬁrst and third quartilesof the performance at the two ends. Note that, for the datasegment with class imbalance (third row in Figure 4), F1-score is used instead of the regular accuracy as it is a moreappropriate metric for imbalanced data.It can be observed from Figure 4 that, the performanceof AutoML tools ﬂuctuate more with a larger number ofcategorical features, and ﬂuctuate less with more data samples.This makes intuitive sense, as the tools will learn better withmore data samples, and each tool has different approaches toencode categorical values that result in different performance.In addition, most tools suffer from the imbalanced datasets except Ludwig and Darwin. Comparing the tools against eachother, H2O-Automl and Darwin slightly outperform the rest,however it is worth reiterating that Darwin cannot deliver re-sults for of all datasets. Auto-sklearn and TPOT performslightly worse than the aforementioned tools. Auto-keras doesnot perform as well as other tools for most datasets in binaryclassiﬁcation. As noted before, in this experiment, we limitthe optimization time to 15 minutes. Here, TPOT managesto complete and deliver results within the 15-minutes timelimit for of datasets, while Darwin and Auto-sklearnsuffer slightly higher non-delivering ratios of and respectively. Ludwig’s performance appears to ﬂuctuate themost compared to other tools.The performance of the tools for muticlass classiﬁcationis illustrated in Figure 5. Here, minimal variation was foundwhen evaluating between data segments of the same categories(the two graphs in each row). For this multiclass classiﬁcationtask, Auto-keras and Auto-sklearn slightly outperform the rest,even though Auto-sklearn cannot deliver results within thetime limit for of datasets. TPOT comes next after thesetwo tools. Finally, Ludwig, H2O-Automl and Darwin performslightly worse than the rest.Figure 6 shows the performance of the tools for regressiontask. The results from this task has similar trends to binary ig. 6. Evaluation of AutoML tools on regression task across ten data segments (depicted in Figure 3). Each diagram refers to a data segment. All experimentsare run up to 15 minutes. Some experiments are completed faster but in some other cases, several tools cannot obtain results after that time limit. Speciﬁcally,the percentage of experiments that did not ﬁnish in 15 minutes are: Auto-keras , H2O-Automl , Auto-sklearn , Ludwig , TPOT , Darwin classiﬁcation on categorical features. Furthermore, the perfor-mance variance tends to increase for all tools when the featuresdimensions decreases, or missing proportion increases. For thistask, H2O-Automl and Auto-sklearn slight outperform Auto-keras and TPOT while Darwin cannot deliver results on halfof the datasets.To summarize what we have seen from the three differentML tasks, Auto-keras does not perform as well as other toolsfor some datasets in binary classiﬁcation. In other words,whether Auto-keras can perform well or not (in binary clas-siﬁcation) depends signiﬁcantly on the nature of the dataset.For multiclass classiﬁcation, H2O-Automl performs slightlyworse than the rest. For the regression task, Auto-keras, H2O-Automl and Auto-sklearn outperform the rest for most datasegments (even though Auto-sklearn struggles somewhat moreto complete results in the allotted 15 minutes, failing in of datasets). TPOT performs slightly worse than those threetools, Ludwig’s performance varies across the datasets, andDarwin can only complete work on about half of the datasetsin the allotted 15 minutes. B. Evaluation on time limit

Our next targeted evaluation is to explore the impact of timelimit in order to investigate how quickly the tools can deliverthe results, and whether the tools can consistently guarantee better results given more time availability. We performed var-ious time-limit experiments for datasets with different samplesizes. Here, we randomly select a dataset given a sample sizerange (i.e., we pick a uniformly random dataset among alldatasets in each sample size range) and evaluate each tools’accuracy bounded by the time limits: 5 minutes, 15 minutes,30 minutes, 1 hour, 2 hours and 3 hours. Since the dataset sizesdo not exceed one million samples, the maximum allotted timeof 3 hours should allow the tools to converge. Figure 7 showsthe results of this evaluation. As observed from the ﬁgure,most tools can generally improve the performance (increasethe accuracy for classiﬁcation tasks and decrease the mean-squared error for regression task) given more time for theiroptimization. Among the tools, H2O-Automl, Auto-keras andLudwig converge to the optimal performance very quicklyfor most cases, roughly within 15 minutes. Auto-sklearnneeds almost 2-3 hours to obtain reasonable results whileTPOT converges slightly faster. Darwin appears ﬂuctuating itsperformance even with more time for optimization.

C. Evaluation on robustness

In this evaluation, we test the robustness of AutoML tools,i.e., whether the tools deliver similar results across multipleruns on the same input datasets. For each task, we select arandom dataset with the sample size from 10000 to 50000 ig. 7. Evaluation of AutoML on multiple time limits. The left (middle) subgraphs show the accuracy of tools for binary (multiclass) classiﬁcation. The rightsubgraphs show the mean squared error of tools for regression. From top-to-bottom: each row shows a random dataset in the increasing order of the samplesize, from 1000 to 100000. Note: In the left graph of the third row, all tools except Darwin obtain the same performance although the graph displays onlythe result for TPOT; in the graphs at the rows 3 &

4, column 3, all tools except Auto-keras and Auto-ml cannot deliver results due to the large number offeatures, roughly and , respectively; in the second graph of third column, we omit the results of Ludwig as its error is roughly -times largerthan the others. Fig. 8. Evaluation of AutoML tools on robustness. (this is a common sample size for many real-world datasets)and run each tool on it for ten different times, each times in10 minutes. The results are illustrated in Figure 8. We observethat H2O-Automl and Ludwig obtain very stable performanceacross three different tasks. Darwin, Auto-keras and Auto-mlget slightly less stable performance than H2O-Automl. TPOTand Auto-sklearn are somewhat unstable in regression task.It is worth noting that even though Ludwig’s performance isvery stable, it deviates largely from others. V. C

ONCLUSIONS AND F UTURE W ORK

In this paper, we have evaluated AutoML tools on theircapabilities in the common machine learning pipeline. Atthe current state, different tools have different approachesfor model selection and hyperparameter optimization. Com-mercialized tools such as H2O-DriverlessAI, DataRobot andDarwin extend their offering functionality on the ﬁrst and thethird component of the pipeline where they are able to detectthe data schema, run feature engineering, and analyze thedetailed results for interpretation purpose. In contrast, openource tools focus more on the second task in the pipeline,which is training and selecting the best model itself.In addition, we have evaluated tools across many datasetson different data segments. We observed that most AutoMLtools obtain reasonable results in terms of their performanceacross many datasets. However, there is no perfect tool at thecurrent state yet, no tool managed to outperform all otherson a plurality of tasks. Across the various evaluations andbenchmarks we have tested, H2O-Automl, Auto-keras andAuto-sklearn performed better than Ludwig, Darwin, TPOTand Auto-ml. In particular, H2O-Automl slightly outperformsthe rest for binary classiﬁcation and regression, and quicklyconverges to the optimal results. However, it suffers from lowperformance in multiclass classiﬁcation. Auto-keras is verystable across all tasks and performs slightly better than therest for multiclass classiﬁcation and ties with H2O-Automl forregression, but suffers from low performance in binary classi-ﬁcation. For a production environment where the computationspeed and performance stability are key requirements, thesetwo tools might be good candidates depending on applicationsand machine learning tasks. Auto-sklearn ties with H2O-Automl and Auto-keras for all tasks but it is comparativelyslower than the other two and usually requires longer runtime. Other tools such as Ludwig, Darwin, TPOT and Auto-ml showed more varying results depending on the dataset andtask.Ultimately there is no one AutoML tool at this point thatcan clearly outperform every other tool. We are at an earlyjuncture for Automated Machine Learning, and there aremany innovations announced at a rapid pace. We believe asthe tools mature and borrow ideas from each other, they willgain more strength in their core task. We also observed a gapin the AutoML tools’ support for the ﬁrst and third stagesof the AutoML pipeline, and expect major developments tohappen in those areas in near future.

Disclaimer : For commercialized tools, our analysesand descriptions are consistent with our understandingderived from publicly available documentation and productdescriptions. In some of these cases, we are unable to exploresource code and regret any factual errors that may arise.A subsidiary of Capital One - Capital One Ventures - is aninvestor in H2O.ai. During the course of our research we wereneither in contact with H2O.ai nor Capital One Ventures. R EFERENCES[1] L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, and K. Leyton-Brown, “Auto-weka 2.0: Automatic model selection and hyperparameteroptimization in weka,”

The Journal of Machine Learning Research ,vol. 18, no. 1, pp. 826–830, Jan. 2017.[2] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.Witten, “The weka data mining software: An update,”

ACM SIGKDDExplorations Newsletter , vol. 11, no. 1, pp. 10–18, 2009.[3] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum,and F. Hutter, “Efﬁcient and robust automated machine learning,” in

Advances in Neural Information Processing Systems 28 , 2015, pp. 2962–2970. [4] R. S. Olson, N. Bartley, R. J. Urbanowicz, and J. H. Moore, “Evaluationof a tree-based pipeline optimization tool for automating data science,” in

Proceedings of the Genetic and Evolutionary Computation Conference(GECCO) 2016 . New York, NY, USA: ACM, 2016, pp. 485–492.[5] “Auto-ml: Automated machine learning for production and analytics,”https://github.com/ClimbsRocks/$auto ml$, accessed: 2019-04-10.[6] H. Jin, Q. Song, and X. Hu, “Auto-keras: An efﬁcient neural architecturesearch system,” in arXiv { } started.html,accessed: 2019-04-10.[11] “H2o.ai automl github,” https://github.com/h2oai/h2o-3, accessed: 2019-04-10.[12] “H2o.ai automl documentation,” http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html, accessed: 2019-04-10.[13] “H2o-driverlessai,” http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/index.html, accessed: 2019-04-10.[14] “Darwin-sparkcognition,” https://github.com/sparkcognition/darwin-sdk,accessed: 2019-04-10.[15] “Google cloud automl,” https://cloud.google.com/automl/, accessed:2019-04-10.[16] “Automated machine learning with azureml,” https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning, accessed: 2019-04-10.[17] “Transmogrifai,” https://github.com/salesforce/TransmogrifAI, accessed:2019-04-10.[18] “Ludwig,” https://github.com/uber/ludwig, accessed: 2019-04-10.[19] B. Zoph and Q. Le, “Neural architecture search with reinforcementlearning,” in arXiv , Nov. 2016.[20] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean, “Efﬁcient neuralarchitecture search via parameters sharing,” in

Proceedings of the35th International Conference on Machine Learning , vol. 80, Stock-holmsmssan, Stockholm Sweden, 10–15 Jul 2018, pp. 4095–4104.[21] F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequential model-basedoptimization for general algorithm conﬁguration,” in

Proceedings of the5th International Conference on Learning and Intelligent Optimization ,ser. LION’05, 2011, pp. 507–523.[22] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo, “Openml:Networked science in machine learning,”

ACM SIGKDD ExplorationsNewsletter , vol. 15, pp. 49–60, Jun. 2014. A PPENDIXABLE ID

ATASET DESCRIPTIONS ..