The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development
Micah J. Smith, Carles Sala, James Max Kanter, Kalyan Veeramachaneni
TThe Machine Learning Bazaar: Harnessing the MLEcosystem for Effective System Development
Micah J. Smith
Carles Sala
James Max Kanter
Feature [email protected]
KalyanVeeramachaneni
ABSTRACT
As machine learning is applied more widely, data scientistsoften struggle to find or create end-to-end machine learn-ing systems for specific tasks. The proliferation of librariesand frameworks and the complexity of the tasks have ledto the emergence of “pipeline jungles” — brittle, ad hoc MLsystems. To address these problems, we introduce the
Ma-chine Learning Bazaar , a new framework for developing ma-chine learning and automated machine learning softwaresystems. First, we introduce ML primitives, a unified API andspecification for data processing and ML components fromdifferent software libraries. Next, we compose primitivesinto usable ML pipelines, abstracting away glue code, dataflow, and data storage. We further pair these pipelines witha hierarchy of AutoML strategies — Bayesian optimizationand bandit learning. We use these components to create ageneral-purpose, multi-task, end-to-end AutoML system thatprovides solutions to a variety of data modalities (image, text,graph, tabular, relational, etc.) and problem types (classifica-tion, regression, anomaly detection, graph matching, etc.).We demonstrate 5 real-world use cases and 2 case studies ofour approach. Finally, we present an evaluation suite of 456real-world ML tasks and describe the characteristics of 2.5million pipelines searched over this task suite.
CCS CONCEPTS • Computing methodologies → Machine learning ; •
Soft-ware and its engineering → Abstraction, modeling andmodularity ; Software development techniques . KEYWORDS machine learning; AutoML; software development; ML pipelines;ML primitives
SIGMOD’20, June 14–19, 2020, Portland, OR, USA © 2020 Copyright held by the owner/author(s). Publication rights licensedto ACM.This is the author’s version of the work. It is posted here for your per-sonal use. Not for redistribution. The definitive Version of Record waspublished in
Proceedings of the 2020 ACM SIGMOD International Conferenceon Management of Data (SIGMOD’20), June 14–19, 2020, Portland, OR, USA ,https://doi.org/10.1145/3318464.3386146.
ACM Reference Format:
Micah J. Smith, Carles Sala, James Max Kanter, and Kalyan Veera-machaneni. 2020. The Machine Learning Bazaar: Harnessing theML Ecosystem for Effective System Development. In
Proceedings ofthe 2020 ACM SIGMOD International Conference on Management ofData (SIGMOD’20), June 14–19, 2020, Portland, OR, USA.
ACM, NewYork, NY, USA, 16 pages. https://doi.org/10.1145/3318464.3386146
Once limited to conventional commercial applications, ma-chine learning (ML) is now being widely applied in physicaland social sciences, in policy and government, and in a vari-ety of industries. This diversification has led to difficultiesin actually creating and deploying real-world systems, askey functionality becomes fragmented across ML-specific ordomain-specific software libraries created by independentcommunities. In addition, the process of building problem-specific end-to-end systems continues to be marked by MLand data management challenges, such as formulating achiev-able learning problems [29], managing and cleaning data andmetadata [6, 38, 49], scaling tuning procedures [17, 34], anddeploying models and serving predictions [3, 12]. In prac-tice, engineers and data scientists often spend significanteffort developing ad hoc programs for new problems: writing“glue code” to connect components from different softwarelibraries, processing different forms of raw input, and inter-facing with external systems. These steps are tedious anderror-prone and lead to the emergence of brittle “pipelinejungles” [43].These points raise the question, “How can we make build-ing ML systems easier in practical settings?”
A new approachis needed to designing and developing software systems thatsolve specific ML tasks. Such an approach should address awide variety of input data modalities, such as images, text,audio, signals, tables, graphs, and learning problem types,such as regression, classification, clustering, anomaly de-tection, community detection, graph matching; it shouldcover the intermediate stages involved, such as data prepro-cessing, munging, featurization, modeling, and evaluation;and it should enable AutoML functionality to fine-tune solu-tions, such as hyperparameter tuning and algorithm selec-tion. Moreover, it should offer coherent APIs, fast iteration a r X i v : . [ c s . S E ] A p r IGMOD’20, June 14–19, 2020, Portland, OR, USA Smith et al. on ideas, and easy integration of new ML innovations. Insum, this ambitious goal would allow almost all end-to-endlearning problems to be solved or built using a single frame-work.To address these challenges, we present the
Machine Learn-ing Bazaar , a framework for designing and developing MLand AutoML systems. We organize the ML ecosystem intocomposable software components, ranging from basic build-ing blocks like individual classifiers to full AutoML systems.With our design, a user specifies a task, provides a rawdataset, and either composes an end-to-end pipeline out ofpre-existing, annotated, ML primitives or requests a curatedpipeline for their task (Figure 1). The resulting pipelines canbe easily evaluated and deployed across a variety of soft-ware and hardware settings and tuned using a hierarchy ofAutoML approaches. Using our own framework, we havecreated an AutoML system which we have entered in partic-ipation in DARPA’s Data-Driven Discovery of Models (D3M)program; ours is the first end-to-end, modular, publicly re-leased system designed to meet the program’s goal.To preview the potential of development using our frame-work, we highlight the Orion project within MIT for ML-based anomaly detection in satellite telemetry (Figure 2), asone of several successful real-world applications that uses ML Bazaar for effective ML system development (Section 4.1).The Orion pipeline processes a telemetry signal using severalcustom preprocessors, an LSTM predictor, and a dynamicthresholding postprocessor to identify anomalies. The en-tire pipeline can be represented in a short Python snippet,custom processing steps are easily implemented as modularcomponents, two external libraries are integrated withoutglue code, and the pipeline can be tuned using composableAutoML functionality.Our contributions in this paper include:
A composable framework for representing and devel-oping ML and AutoML systems : Our framework enablesusers to specify a pipeline for any ML task, ranging fromimage classification to graph matching, through a unifiedAPI (Sections 2 and 3).
The first general-purpose automated machine learn-ing system : Our system, AutoBazaar, is, to the best of ourknowledge, the first open-source, publicly-available, systemwith the ability to reliably compose end-to-end, automatically-tuned, solutions for 15 data modalities and problem types(Section 3.3).
Successful applications : We describe 5 successful applica-tion of our framework on real-world problems (Section 4). Just as one open-source community was described as “a great babblingbazaar of different agendas and approaches” [41], our framework is charac-terized by the availability of many compatible alternatives, a wide varietyof libraries and custom solutions, a space for new contributions, and more.
A comprehensive evaluation : We evaluated our AutoMLsystem against a suite of 456 ML tasks/datasets covering15 ML task types, analyzing 2.5 million scored pipelines(Section 5).
Open-source libraries:
Components of our framework havebeen released as the open-source libraries
MLPrimitives , MLBlocks , BTB , piex , and AutoBazaar. The
ML Bazaar is a composable framework for developingML and AutoML systems based on a hierarchical organiza-tion of and unified API for the ecosystem of ML softwareand algorithms. One can use curated or custom softwarecomponents for every aspect of the practical ML process,from featurizers for relational datasets to signal processingtransformers to neural networks to pre-trained embeddings.From these primitives , data scientists can easily and effi-ciently construct ML solutions for a variety of ML task types,and ultimately, automate much of the work of tuning thesemodels. A primitive is a reusable, self-contained, software componentfor ML paired with the structured annotation of its meta-data. It has a well-defined fit / produce interface wherein itreceives input data in one of several formats or types, per-forms computations, and returns the data in another formator type. With this categorization and abstraction, widelyvarying functionality required to construct ML pipelines canbe collected in a single location. Primitives can be re-usedin chained computations while minimizing glue code writ-ten by callers. An example primitive annotation is shown inListing 1.Primitives encapsulate different types of functionality.Many have a learning component, such as a random for-est classifier. Many primitives, categorized as transformers,may have no learning component and only have a produce method, but are very important nonetheless. For example,the Hilbert and Hadamard transforms from signal processingwould be important primitives to include when building anML system to solve a task in Internet-of-Things.Some primitives do not change the values in the data, butsimply prepare or reshape the data. These glue primitives areintended to reduce glue code required to connect primitivesinto a full system. An example of this type of primitive is pandas.DataFrame.unstack . Each primitive is annotated with machine-readable metadata that enables its usage and automatic inte-gration within an execution engine. Annotations allow us tounify a variety of primitives from disparate libraries, reduce he ML Bazaar: Harnessing the ML Ecosystem for Effective System Development SIGMOD’20, June 14–19, 2020, Portland, OR, USA dfshoggraphResNet50XceptionMobileNetDenseNet121link_predictionTokenizerStringVectorizerDatetimeFeaturizerCategoricalEncoder
Feature TransformersFeature Selectors I m a g e G r a p h R e l a t i o n a l T i m e s e r i e s S i n g l e t a b l e T e x t U s e r - i t e mm a t r i x L i n k p r e d i c t i o n A O ++++++ OOOO ? C l a ss i fi ca t i o n R e g r e ss i o n Fo r eca s t i n g G r a p h m a t c h i n g A n o m a l y d e t ec t i o n C o mm un i t y d e t ec t i o n Feature ExtractorsFeature Generators
PCAImputerStandardScalerpad_sequencesExtraTreesSelectorLassoSelector
Preprocessors BC x x D LightFM
XGBClassifier
XGBRegressor
LSTMTextClassifierRandomForestClassifier
RandomForestRegressor
EstimatorsTraditional AutoML
ClassDecoderAnomalyDetectorBoundaryDetector
Postprocessors y xv t?= X? TextCleanerGaussianBlurClassEncoderUniqueCounterResNet50PrepXceptionPrepMobileNetPrepDenseNet121PrepVocabularyCounter [X] [Y] [X] [Y] [Y]
Figure 1: Various ML task types that can be solved in
ML Bazaar using composition of ML primitives (abbreviatedhere from fully-qualified names). Primitives are categorized into preprocessors, feature processors, estimators,and postprocessors and are drawn from many different ML libraries, such as scikit-learn, Keras, OpenCV, andNetworkX, as well as custom implementations. Many additional primitives and pipelines are available in ourcurated catalog. the need for glue code, and provide information about thetunable hyperparameters. This full annotation is providedin a single JSON file and has three major sections: • Meta-information.
This section has the name of theprimitive, the fully-qualified name of the underlyingimplementation as a Python object, and other detailedmetadata, such as the author, description, documenta-tion URL, categorization, and the data modalities it ismost used for. This information enables searching andindexing primitives. • Information required for execution.
This section has thenames of the methods pertaining to fit / produce in theunderlying implementation as well as the data typesof the primitive’s inputs and outputs. When applicable,for each primitive, we annotate the ML data types ofdeclared inputs and outputs, i.e., recurring objects inML that have a well-defined semantic meaning, such asa feature matrix X , a target vector y , or a space of classlabels classes . We provide a mapping between MLdata types and synonyms used by specific libraries asnecessary. This logical structure will help dramaticallydecrease the amount of glue code developers mustwrite (Section 2.2.1). The primitive annotation specification is described and documented in fullin the associated
MLPrimitives library. { "name" : "cv2.GaussianBlur", "contributors" : ["Carles Sala
GaussianBlur transformer primitive using
MLPrimitives . (Some fieldsare abbreviated or elided.) This primitive does not an-notate any tunable hyperparameters but such a sectionmarks hyperparameter types, defaults, and feasiblevalues. • Information about hyperparameters.
The third sectiondetails all the hyperparameters of the primitive — their
IGMOD’20, June 14–19, 2020, Portland, OR, USA Smith et al. primitives = ['mlp.custom.ts_preprocessing.time_segments_average','sklearn.impute.SimpleImputer','sklearn.preprocessing.MinMaxScaler','mlp.custom.ts_preprocessing.rolling_window_sequences','keras.Sequential.LSTMTimeSeriesRegressor','mlp.custom.ts_anomalies.regression_errors','mlp.custom.ts_anomalies.find_anomalies',]options = {'init_params': {'mlp.custom.ts_preprocessing.time_segments_average (a) Python representation.
XXXXX y y errorsindex index y UniqueCounterTextCleanerVocabularyCountertime_segments_averagerolling_window_sequencesLSTMTimeSeriesRegressorregression_errorsSimpleImputer find_anomalies
MinMaxScalerSequencePadder LSTMText
Classifier
Tokenizer
XXX y X y vocabularysize classes y X MLPipeline([ 'UniqueCounter', 'TextCleaner', 'VocabularyCounter', 'Tokenizer', 'SequencePadder', 'LSTMTextClassifier' ])MLPipeline([ 'time_segments_average', 'SimpleImputer', 'MinMaxScaler', 'rolling_window_sequences', 'LSTMTimeSeriesRegressor' 'regression_errors', 'find_anomalies' ]) (b) Graph representation. from mlblocks import MLPipeline from orion.data import load_signal from orion.pipelines.lstm_dt import primitives, options (cid:44) → train = load_signal('S-1-train')test = load_signal('S-1-test')ppl = MLPipeline(primitives, **options)ppl.fit(train)anomalies = ppl.predict(test) (c) Usage with Python SDK. Figure 2: Representation and usage of the Orion pipeline for anomaly detection using the
ML Bazaar framework.ML system developers or researchers describe the pipeline in a short Python snippet by a sequence of primi-tives annotated from several libraries (and optional additional parameters). Our framework compiles this into agraph representation (Section 2.2.2) by consulting meta-information associated with the underlying primitives(Section 2.1.1). Developers can then use our Python SDK to train the pipeline on “normal” signals and then identifyanomalies in test signals. The
MLPipeline provides a familiar interface but enables more general data engineeringand ML processing. It also can expose the entire underlying hyperparameter configuration space for tuning byour AutoML libraries or others (Section 3). names, descriptions, data types, ranges, and whetherthey are fixed or tunable . It also captures any condi-tional dependencies between the hyperparameters.We have developed the open-source MLPrimitives librarywhich contains a number of primitives adapted from differ-ent libraries (Table 1). For libraries that already provide a fit / produce interface or similar (e.g. scikit-learn), a primitivedeveloper has to write the JSON specification and point tothe underlying estimator class.To support integration of primitives from libraries thatneed significant adaptation to the fit / produce interface, MLPrimitives also provides a powerful set of adapter modulesthat assist in wrapping common patterns. These adaptermodules then allow us to integrate many functionalitiesas primitives from the library without having to write aseparate object for each — thus requiring us to write only https://github.com/HDI-Project/MLPrimitives an annotation file for each primitive. Keras is an example ofsuch a library.Source Count Source Countscikit-learn 39 XGBoost 2MLPrimitives (custom) 27 LightFM 1Keras 25 OpenCV 1pandas 16 python-louvain 1Featuretools 4 scikit-image 1NumPy 3 statsmodels 1NetworkX 2 Table 1: Primitives in the curated catalog of
MLPrimitives , by library source. Catalogs maintainedby individual projects may contain more primitives.
For developers, domain experts, and researchers,
MLPrimitives enables easy contribution of new primitives in several ways he ML Bazaar: Harnessing the ML Ecosystem for Effective System Development SIGMOD’20, June 14–19, 2020, Portland, OR, USA by providing primitive templates, example annotations, anddetailed tutorials and documentation. We also provide pro-cedures to validate proposed primitives against the formalspecification and a unit test suite. Finally, contributors canalso write custom primitives.Currently,
MLPrimitives maintains a curated catalog ofhigh-quality, useful primitives from 12 libraries, as well ascustom primitives that we have created (Table 1). Each prim-itive is identified by a fully-qualified name to differentiateprimitives across catalogs. The JSON annotations can thenbe mined for additional insights. We considered multiplealternatives to the primitives API, such as representing allof them as Python data structures or classes, regardless oftheir type (i.e. transformers or estimators). One disadvan-tage of these alternatives is that it makes it more difficultfor domain experts to contribute primitives. We have foundthat domain experts, such as engineers and scientists in thesatellite industry, prefer writing functions rather than otherconstructs such as classes, and many domain-specific pro-cessing methods are simply transformers without a learningcomponent.
Another option we consid-ered was to enforce that every primitive— whether broughtover from a library with a compatible API or otherwise —be integrated via a Python class with wrapper methods. Weopted against this approach as it led to excessive wrappercode and created redundancy, which made it more difficultto write primitives. Instead, for libraries that are compatible,our design requires that we only create the annotation file.
In this work, we focus onthe wealth of ML functionality that exists in the Pythonecosystem. Through
ML Bazaar ’s careful design, we couldalso support other common languages in data science likeR, MATLAB, and Julia and enable multi-language pipelines.Starting from our JSON primitive annotation format, a multi-language pipeline execution backend would be built thatuses language-specific kernels or containers and relies on aninteroperable data format such as Apache Arrow. A language-independent format like JSON provides several additionaladvantages. It is both machine- and human- readable andwriteable. It is also a natural format for storage and queryingin NoSQL document stores, allowing developers to easilyquery a knowledge base of primitives for the subset appro-priate for a specific ML task type, for example. As of
MLPrimitives v0.2.4.
To solve practical learning problems, we must be able toinstantiate and compose primitives into usable programs.These programs must be easy to specify with a natural in-terface, such that developers can easily compose primitiveswithout sacrificing flexibility. We aim to support both end-users trying to build an ML solution for their specific problemwho may not be savvy about software engineering, as wellas system developers wrapping individual ML solutions inAutoML components. In addition, we provide an abstractedexecution layer, such that learning, data flow, data storage,and deployment are handled automatically by various con-figurable and pluggable backends. As one realization of theseideas, we have implemented
MLBlocks , a library for compos-ing, training, and deploying end-to-end ML pipelines. from mlblocks import MLPipeline from mlblocks.datasets import load_umls from mlblocks.discovery import load_pipelinedataset = load_umls()X_train, X_test, y_train, y_test = dataset.get_splits(1)graph = dataset.graphnode_columns = ['source', 'target']ppl = MLPipeline(load_pipeline('graph.link_prediction.nx.xgb'))ppl.fit(X_train, y_train, graph=graph, node_columns=node_columns)y_pred = ppl.predict(X_test, graph=graph,node_columns=node_columns) (cid:44) → Listing 2: Usage of the
MLBlocks library for a graph linkprediction task. Curated pipelines in the
MLPrimitives library can be easily loaded. Pipelines provide a famil-iar API but enable more general data engineering andML.
We introduce
ML pipelines , whichcollect multiple primitives into a single computational graph.Each primitive in the graph is instantiated in a pipeline step ,which loads and interprets the underlying primitive and pro-vides a common interface to run a step in a larger program.We define a pipeline as a directed acyclic multigraph L = ⟨ V , E , λ ⟩ , where V is a collection of pipeline steps, E arethe directed edges between steps representing data flow, and λ is a joint hyperparameter vector for the underlying primi-tives. A valid pipeline — and its generalizations (Section 3.1)— must also satisfy acceptability constraints that require theinputs to each step to be satisfied by the outputs of anotherstep connected by a directed edge.The term “pipeline” is used in the literature to refer to aML-specific sequence of operations, and sometimes abused(as we do here) to refer to a more general computationalgraph or analysis. In our conception, we bring foundational https://github.com/HDI-Project/MLBlocks IGMOD’20, June 14–19, 2020, Portland, OR, USA Smith et al. data processing operations of raw inputs into this scope,like featurization of graphs, multi-table relational data, timeseries, text, and images, as well as simple data transforms, likeencoding integer or string targets. This gives our pipelines agreatly expanded role, providing solutions to any ML tasktype and spanning the entire ML process beginning with theraw dataset. primitives = ['UniqueCounter','TextCleaner','VocabularyCounter','Tokenizer','SequencePadder','LSTMTextClassifier',] (a) Python representa-tion.
UniqueCounterTextCleanerVocabularyCounterSequencePadder LSTMText
Classifier
Tokenizer
XXX y X y vocabularysize classes y X (b) Graph representation. Figure 3: Recovery of ML computational graph frompipeline description for a text classification pipeline.The ML data types that enable extraction of the graph,and stand for data flow, are labeled along edges.
Large graph-structuredworkloads can be difficult to specify for end-users due to thecomplexity of the data structure and such workloads are anactive area of research in data management. In
ML Bazaar ,we consider three aspects of pipeline representation: ease ofcomposition, readability, and computational issues. First, weprioritize easily composing complex ML pipelines by provid-ing a pipeline description interface (PDI) in which developersspecify only the topological ordering of all pipeline steps inthe pipeline without requiring any explicit dependency dec-larations. These steps can be passed to our libraries as Pythondata structures or loaded from JSON files. Full training-time( fit ) and inference-time ( produce ) computational graphs canthen be recovered (Algorithm 1). This is made possible bythe meta-information provided in the primitive annotations,in particular, the ML data types of the primitive inputs andoutputs. We leverage the observation that steps that modifythe same ML data type can be grouped into the same subpath.In cases where this information does not uniquely identifya graph, the user can additionally provide an input-outputmap which serves to explicitly add edges to the graph, aswell as other parameters to customize the pipeline.Though it may be more difficult to read and understandthese pipelines from the PDI alone as the edges are not shownnor labeled, it is easy to accompany them with the recoveredgraph representation (Figures 2 and 3).
Input: pipeline description S = ( v , . . . , v n ) , source node v ,sink node v n + Output: directed acyclic multigraph ⟨ V , E ⟩ begin S ← v ∪ S ∪ v n + V ← ∅ , E ← ∅ U ← ∅ // unsatisfied inputs while S (cid:44) ∅ do v ← popright ( S ) // last pipeline step remaining M ← popmatches ( U , outputs ( v )) if M (cid:44) ∅ then V ← V ∪ { v } for ( v ′ , σ ) ∈ M do E ← E ∪ {( v , v ′ , σ )} for σ ∈ inputs ( v ) do // unsatisfied inputs of v U ← U ∪ {( v , σ )} else // isolated node return INVALID if U (cid:44) ∅ then // unsatisfied inputs remain return INVALID return ⟨ V , E ⟩ Algorithm 1:
Pipeline-Graph Recovery. Pipeline stepsare added to the graph in reverse order and edgesare iteratively added when the step under consider-ation produces an output that is required by an ex-isting step. Exactly one graph is recovered if a validgraph exists. In cases where multiple graphs have thesame topological ordering, the user can additionally pro-vide an input-output map (which modifies the result ofinputs( v )/outputs( v ) above) to explicitly add edges andthereby select from among several possible graphs.The resulting graphs describe abstract computational work-loads, but we must be able to actually execute them for pur-poses of learning and inference. From the recovered graphs,we could re-purpose many existing data engineering systemsas backends for scheduling and executing the workloads[42, 45, 53]. In our MLBlocks execution engine, a collection ofobjects and a metadata tracker in a key-value store are itera-tively transformed through sequential processing of pipelinesteps. The Orion pipeline would be executed using
MLBlocks as shown in Figure 2c.
Why not scikit-learn?
Several alternatives exist to our newML pipeline abstraction (Section 2.2), such as scikit-learn’s
Pipeline [11]. Ultimately, while our pipeline is inspired bythese alternatives, it aims to provide more general data engi-neering and ML functionality. While the scikit-learn pipelinesequentially applies a list of transformers to X and y onlybefore outputting a prediction, our pipeline supports generalcomputational graphs, accepts multiple data modalities as he ML Bazaar: Harnessing the ML Ecosystem for Effective System Development SIGMOD’20, June 14–19, 2020, Portland, OR, USA input simultaneously, produces multiple outputs, managesevolving metadata, and can use software from outside thescikit-learn ecosystem/design paradigm. For example, wecan use our pipeline to construct entity sets [28] from multi-table relational data for input to other pipeline steps. We canalso support pipelines in an unsupervised learning paradigm,such as in Orion, where we create the target y “on-the-fly”(Figure 3). Where’d the glue go?
To connect learning componentsfrom different libraries with incompatible APIs, data scien-tists end up writing “glue code”. Typically, this glue code iswritten within pipeline bodies. In
ML Bazaar , we mitigatethe need for this glue by pushing the need of API adaptationdown to the level of primitive annotations, which are writtenonce and reside in central locations, amortizing the adapta-tion cost. Moreover, the need for glue code arises in creatingintermediate outputs and shaping of the data. We created anumber of primitives that support these common program-ming patterns and miscellaneous needs in development of aML pipeline. These are, for example, data reshaping primi-tives like pandas.DataFrame.unstack , data preparation prim-itives like pad_sequences required for Keras-based LSTMs,and utilities like
UniqueCounter that count the number ofunique classes.
Interactive development.
Interactivity is an important as-pect of data science development for beginners and expertsalike, as they build understanding of the data and iterateon different modeling ideas. In
ML Bazaar , the level of in-teractivity possible depends on the specific runtime library.For example, our
MLBlocks library supports interactive de-velopment in a shell or notebook environment by allowingthe inspection of intermediate pipeline outputs and by al-lowing pipelines to be iteratively expanded starting froma loaded pipeline description. Alternatively, ML primitivescould be used as a backend pipeline representation for soft-ware that provides more advanced interactivity such as drag-and-drop. For interfaces that require low latency pipelinescoring to provide user feedback such as [13],
ML Bazaar ’sperformance depends mainly on the underlying primitiveimplementations (Section 5).
Supporting new task types.
While
ML Bazaar handles 15ML task types (Table 4), there are many more task types forwhich we do not currently provide pipelines in our defaultcatalog (Section 5.5). To extend our approach to supportnew task types, it is generally sufficient to write severalnew primitive annotations for pre-processing input and post-processing output — no changes are needed to the core
MLBazaar software libraries such as
MLPrimitives and
MLBlocks .For example, for the anomaly detection task type from the Orion project, several new simple primitives were imple-mented: rolling_window_sequences , regression_errors , and find_anomalies . Indeed, support for a certain task type ispredicated on the availability of a pipeline for that task typerather than any characteristics of our software libraries. Primitive versioning.
The default catalog of primitives fromthe
MLPrimitives library is versioned together, and libraryconflicts are resolved manually by maintainers through care-fully specifying minimum and maximum dependencies. Thisstrategy ensures that the default catalog can always be used,even if there are incompatible updates to the underlyinglibraries. Automated tools can be integrated to aid both end-users and maintainers in understanding potential conflictsand safely bumping library-wide versions.
From the components of the
ML Bazaar , data scientists caneasily and effectively build ML pipelines with fixed hyper-parameters for their specific problems. To improve the per-formance of these solutions, we introduce the more gen-eral pipeline templates and pipeline hypertemplates and thenpresent the design and implementation of AutoML primitiveswhich facilitate hyperparameter tuning and model selection,either using our own library for Bayesian optimization orexternal AutoML libraries. Finally, we describe AutoBazaar,one specific AutoML system we have built on top of thesecomponents.
Frequently, pipelines require hyperparameters to be specifiedat several places. Unless these values are fixed at annotation-time, hyperparameters must be exposed in a machine-friendlyinterface. This motivates pipeline templates and pipeline hy-pertemplates, which generalize pipelines by allowing a hi-erarchical tunable hyperparameter configuration space andprovide first-class tuning support.We define a pipeline template as a directed acyclic multi-graph T = ⟨ V , E , Λ ⟩ , where Λ is the joint hyperparameterconfiguration space for the underlying primitives. By provid-ing values λ ∈ Λ for the unset hyperparameters of a pipelinetemplate, a specific pipeline is created.In some cases, certain values of hyperparameters can affectthe domains of other hyperparameters. For example, the typeof kernel for a support vector machine results in differentkernel hyperparameters, and preprocessors used to adjust forclass imbalance can affect the training procedure of a down-stream classifier. We call these conditional hyperparameters ,and accommodate them with pipeline hypertemplates.We define a pipeline hypertemplate as a directed acyclicmultigraph H = ⟨ V , E , (cid:208) j Λ j ⟩ , where V is a collection of IGMOD’20, June 14–19, 2020, Portland, OR, USA Smith et al. pipeline steps, E are directed edges between steps, and Λ j is the hyperparameter configuration space for pipeline tem-plate T j . A number of pipeline templates can be derived fromone pipeline hypertemplate by fixing the conditional hyper-parameters. Just as primitives units components of ML computation, Au-toML primitives represent components of an AutoML system.We separate AutoML primitives into tuners and selectors . Inour extensible AutoML library for developing AutoML sys-tems,
BTB , we provide various instances of these AutoMLprimitives. Given a pipeline template, an AutoML sys-tem must find a specific pipeline with fully-specified hyper-parameter values to maximize some utility. Given pipelinetemplate T and a function f that assigns a performancescore to pipeline L λ with hyperparameters λ ∈ Λ , the tuningproblem is defined as λ ∗ = arg max λ ∈ Λ f ( L λ ) . We introduce tuner s, AutoML primitives which provide a record / propose interface in which evaluation results are recorded to thetuner by the user or by an AutoML controller and new hy-perparameters are proposed in return.Hyperparameter tuning is widely studied and its effectiveuse is instrumental to maximizing the performance of MLsystems [4, 5, 18, 46]. One widely used approach to hyper-parameter tuning is Bayesian optimization (BO), a black-boxoptimization technique in which expensive evaluations of f are kept to a minimum by forming and updating a meta-model for f . At each iteration, the next hyperparameterconfiguration to try is chosen according to an acquisitionfunction. We structure these meta-models and acquisitionfunctions as separate BO-specific AutoML primitives thatcan be combined together to form a tuner. Researchers haveargued for different formulations of meta-models and acquisi-tion functions [39, 46, 51]. In our BTB library for AutoML, weimplement the
GP-EI tuner, which uses a Gaussian Processmeta-model primitive and an Expected Improvement (EI)acquisition function primitive, among several other tuners.Many other tuning paradigms exist, such as those based onevolutionary strategies [37, 40], adaptive execution [27, 33],meta-learning [21], or reinforcement learning [16]. Thoughwe have not provided implementations of these in
BTB , onecould do so using our common API.
For many ML task types, there may be mul-tiple pipeline templates or pipeline hypertemplates available,each with their own tunable hyperparameters. The aim isto balance the exploration-exploitation tradeoff while se-lecting promising pipeline templates to tune. For a set of https://github.com/HDI-Project/BTB pipeline templates T , we define the selection problem as T ∗ = arg max T ∈T max λ T ∈ Λ T f ( L λ T ) . We introduce selector s,AutoML primitives which provide a compute_rewards / select API.Algorithm selection is often treated as a multi-armed ban-dit problem where the score returned from a selected tem-plate can be assumed to come from an unknown underlyingprobability distribution. In
BTB , we implement the
UCB1 se-lector, which uses the upper confidence bound method [2],among several other selectors. Users or AutoML controllerscan use selectors and tuners together to perform joint algo-rithm selection and hyperparameter tuning.
Using the ML Bazaar framework, we have built AutoBazaar, an open-source, end-to-end, general-purpose, multi-task, au-tomated machine learning system. It consists of several com-ponents: an AutoML controller; a pipeline execution engine;data stores for metadata and pipeline evaluation results; load-ers and configuration for ML tasks, primitives, etc.; a Pythonlanguage client; and a command-line interface. AutoBazaaris an open-source variant of the AutoML system we havedeveloped for the DARPA D3M program.We focus here on the core pipeline search and evaluationalgorithms (Algorithm 2). The input to the search is a com-putational budget and an ML task, which consists of the rawdata and task and dataset metadata — dataset resources, prob-lem type, dataset partition specifications, and an evaluationprocedure for scoring. Based on these inputs, AutoBazaarsearches through its catalog of primitives and pipeline tem-plates for the most suitable pipeline that it can build. First, thecontroller loads the train and test dataset partitions, D ( train ) and D ( test ) , following the metadata specifications. Next, itloads from its default catalog and the user’s custom catalog acollection of candidate pipeline templates suitable for the MLtask type. Using the BTB library, it initializes a
UCB1 selectorand a collection of
GP-EI tuners for joint algorithm selectionand hyperparameter tuning. The search process begins andcontinues for as long as the computation budget has not beenexhausted. In each iteration, the selector is queried to selecta template, the corresponding tuner is queried to propose ahyperparameter configuration, a pipeline is generated andscored using cross validation over D ( train ) , and the scoreis reported back to the selector and tuner. The best overallpipeline found during the search, L ∗ , is re-fit on D ( train ) andscored over D ( test ) . Its specification is returned to the useralongside the score obtained, s ∗ . https://github.com/HDI-Project/AutoBazaar he ML Bazaar: Harnessing the ML Ecosystem for Effective System Development SIGMOD’20, June 14–19, 2020, Portland, OR, USA Input: task t = ( M , f , D ( train ) , D ( test ) ) , budget B Output: best pipeline L ∗ , best score s ∗ begin T ← load_available_templates ( M ) A ← init_automl( T ) // bookkeeping s ∗ ← + ∞ , L ∗ ← ∅ while B > do T ← select ( A ) // uses selector.select λ ← propose ( A , T ) // uses T ’s tuner.propose L ← ( T , λ ) s ← cross_validate_score ( f , L , D ( train ) ) record(A, L , s ) // update selector and tuners if s < s ∗ then s ∗ ← s , L ∗ ← L decrease( B ) s ∗ ← fit_and_score ( f , L ∗ , D ( train ) , D ( test ) ) return L ∗ , s ∗ Algorithm 2:
Search and evaluation of pipelines in Au-toBazaar. Detailed task metadata M is used by the systemto load relevant pipeline templates and scorer function f is used to score pipelines. In this paper, we claim that
ML Bazaar makes it easier todevelop ML systems. We provide evidence for this claim inthis section by describing 5 real-world use cases in which
ML Bazaar is currently used to create both ML and AutoMLsystems. Through these industrial applications we examinethe following questions: Does
ML Bazaar support the needsof ML system developers? If not, how easy was it to extend?
Anomaly detection for satellite telemetry. ML Bazaar is usedby a communications satellite operator which provides videoand data connectivity globally. This company wanted tomonitor more than 10,000 telemetry signals from their satel-lites and identify anomalies, which might indicate a loomingfailure severely affecting the satellite’s coverage. This timeseries/anomaly detection task was not initially supported byany of the pipelines in our curated catalog. Our collaboratorswere able to easily implement a recently developed end-to-end anomaly detection method [26] using pre-existing trans-formation primitives in
ML Bazaar and by adding severalnew primitives: a primitive for the specific LSTM architec-ture used in the paper and new time series anomaly detectionpostprocessing primitives, which take as input a time seriesand time series forecast, and produce as output a list of anom-alies, identified by intervals {[ t i , t i + ]} . This design enabledrapid experimentation through substituting different timeseries forecasting primitives and comparing the results. In current work, they apply ML pipelines to 82 publicly avail-able satellite telemetry signals from NASA and evaluate theanomaly detections against 105 known anomalies. The workhas been released as the open-source Orion project and iscurrently under active development. Predicting clinical outcomes from electronic health records.
Cardea is an open-source, automated framework for pre-dictive modeling in health care on electronic health recordsfollowing the FHIR schema. Its developers formulated a num-ber of prediction problems including predicting length ofhospital stay, missed appointments, and hospital readmission.All tasks in Cardea are multitable regression or classifica-tion. From
ML Bazaar , Cardea uses the featuretools.dfs primitive to automatically engineer features for this highly-relational data and multiple other primitives for classificationand regression. The framework also presents examples on apublicly available patient no-show prediction problem. Theframework has been released as an open-source project. Failure prediction in wind turbines. ML Bazaar is also usedby a multinational energy utility to predict critical failuresand stoppages in their wind turbines. Most prediction prob-lems here pertain to the time series classification ML tasktype.
ML Bazaar has several time series classification pipelinesavailable in its catalog and they enable usage of time seriesfrom 140 turbines to develop multiple pipelines, tune them,and produce prediction results. Multiple outcomes are pre-dicted, ranging from stoppage and pitch failure to less com-mon issues, such as gearbox failure. This library is releasedas the open-source GreenGuard project. Leaks and crack detection in water distribution systems.
A global water technology provider uses
ML Bazaar for avariety of ML needs, ranging from image classification fordetecting leaks from images, to crack detection from timeseries data, to demand forecasting using water meter data.
ML Bazaar provides a unified framework for these disparateneeds. The team also builds custom primitives internally anduses them directly with the
MLBlocks backend.
DARPA’s Data-Driven Discovery of Models (D3M) program,of which we are participants, aims to spur development of au-tomated systems for model discovery for use by non-experts.Among other goals, participants aim to design and imple-ment AutoML systems that can produce solutions to arbi-trary ML tasks without any human involvement. We used
MLBazaar to create an AutoML system to be evaluated against https://github.com/D3-AI/Orion https://github.com/D3-AI/Cardea https://github.com/D3-AI/GreenGuard IGMOD’20, June 14–19, 2020, Portland, OR, USA Smith et al.
System Top pipeline Beats Expert 1 Beats Expert 2 RankSystem 1 29 57 31 1
ML Bazaar
18 56 28 2System 3 15 47 22 3System 4 14 46 21 4System 5 10 42 14 5System 6 8 43 15 6System 7 8 33 12 7System 8 6 24 11 8System 9 4 25 13 9System 10 2 27 12 10
Table 2: Results from the DARPA D3M Summer 2019evaluation (most recent conducted). Entries representthe number of ML tasks. “Top pipeline” is the numberof tasks for which a system created a winning pipeline.“Beats Expert 1” and “Beats Expert 2” are the numberof tasks for which a system beat the two expert teambaselines. We highlight Systems 6 and 7 as they belongto the same teams as [44] and [16], respectively. (Weare unable to comment on other systems as they havenot yet provided public reports.) Rank is given basedon number of top pipeline lines produced. The top 4teams are consistent in their ranking even if a differ-ent column is chosen. other teams from US academic institutions. Participants in-clude ourselves (MIT), CMU, UC Berkeley, Brown, Stanford,TAMU, and others. Our system relies on AutoML primitives(Section 3) and other features of our framework, but does not use our primitive and pipeline implementations (neither
MLPrimitives nor
MLBlocks ).We present results comparing our system against otherteams in the program. DARPA organizes an evaluation every6 months (Winter and Summer). During evaluation, AutoMLsystems submitted by participants are run by DARPA on95 tasks spanning several task types for three hours pertask. At the end of the run, the best pipeline identified by theAutoML system is evaluated on held-out test data. Results arealso compared against two independently-developed expertbaselines (MIT Lincoln Laboratory and Exline).Results from one such evaluation from Spring 2018 werepresented by [44]. We make comparisons from the Summer2019 evaluation, the results of which were released in August2019 — the most recent evaluation as of this writing. Table 2compares our AutoML system against 9 other teams. Giventhe same tasks and same machine learning primitives, thiscomparison highlights the efficacy of the AutoML primitives(
BTB ) in
ML Bazaar only — it does not provide any evaluationof our other libraries. In its implementation, our system usesa
GP-MAX tuner and a
UCB1 selector. Across all metrics, oursystem places 2nd out of the 10 teams.
Through these applications using the components of the
MLBazaar , several advantages surfaced.
One important aspect of
ML Bazaar is that it does not restrict the user to use a single mono-lithic system, rather users can pick and choose parts of theframework they want to use. For example, Orion uses only
MLPrimitives / MLBlocks , Cardea uses
MLPrimitives but inte-grates the hyperopt library for hyperparameter tuning, ourD3M AutoML system submission mainly uses AutoML prim-itives and
BTB , and AutoBazaar uses every component.
The ease of developing MLsystems for the task at hand freed up time for teams to thinkthrough and design a comprehensive ML infrastructure. Inthe case of Orion and GreenGuard, this led to the devel-opment of a database that catalogues the metadata fromevery ML experiment run using
ML Bazaar . This had severalpositive effects: it allowed for easy sharing between teammembers, and it allowed the company to transfer the knowl-edge of what worked from one system to another system.For example, the satellite company plans to use the pipelinesthat worked on a previous generation of the satellites on thenewer ones from the beginning. With multiple entities find-ing uses for such a database, creation of such infrastructurecould be templatized.
Our framework allowed the watertechnology company to solve many different ML task typesusing the same framework and API.
Once a baseline pipeline hasbeen designed to solve a problem, we notice that users canquickly shift focus to developing and improving primitivesthat are responsible for learning.
A fitted pipeline maintains all thelearned parameters as well as all the data manipulations. Auser is able to serialize the pipeline and load it into produc-tion. This reduces the development-to-production lifecycle.
In this section, we experimentally evaluate
ML Bazaar alongseveral dimension. We also leverage our evaluation resultsto perform several case studies in which we show how ageneral-purpose evaluation setting can be used to assess thevalue of specific ML and AutoML primitives.
The
ML Bazaar
Task Suite is a comprehensive corpus oftasks and datasets to be used for evaluation, experimenta-tion, and diagnostics. It consists of 456 ML tasks spanning he ML Bazaar: Harnessing the ML Ecosystem for Effective System Development SIGMOD’20, June 14–19, 2020, Portland, OR, USA min p25 p50 p75 maxNumber of examples 7 202 599 3,634 6,095,521Number of classes † X Table 3: Summary of tasks in
ML Bazaar
Task Suite(n=456). † for classification tasks
15 task types. Tasks, which encompass raw datasets and an-notated task descriptions, are assembled from a variety ofsources, including MIT Lincoln Laboratory, Kaggle, OpenML,Quandl, and Crowdflower. We created train/test splits andorganized the folder structure. Other than this, we do not doany preprocessing (sampling, outlier detection, imputation,featurization, scaling, encoding, etc.), presenting data in itsraw form as it would be ingested by end-to-end pipelines.Our publicly-available task suite can be browsed online orthrough piex , our library for exploration and meta-analysisof ML tasks and pipelines. The covered task types are shownin Table 4 and a summary of the tasks is shown in Table 3.We made every effort to curate a corpus that was evenlybalanced across ML task types. Unfortunately, in practice,available datasets are heavily skewed to traditional ML prob-lems of single-table classification and our task suite reflectsthis deficiency (though 49% are not single-table classifica-tion). Indeed, among other evaluation suites, the OpenML 100and the AutoML Benchmark [8, 19] are both exclusively com-prised of single-table classification problems. Similarly, eval-uation approaches for AutoML methods usually target theblack-box optimization aspect in isolation [15, 20, 23] with-out considering the larger context of an end-to-end pipeline. We run the search process for all tasks in parallel on a het-erogenous cluster of 400 AWS EC2 nodes. Each ML task issolved independently on a node of its own over a 2-hourtime limit. Metadata and fine-grained details about everypipeline evaluated are stored in a MongoDB document store.The best pipelines for each task, after checkpoints at 10, 30,60, and 120 minutes of search, are selected by consideringthe cross-validation score on the training set and are thenre-scored on the held-out test set. https://mlbazaar.github.io https://github.com/HDI-Project/piex Exact replication files and detailed instructions for the experiments in thissection are included here: https://github.com/micahjsmith/ml-bazaar-2019and can be further analyzed using our piex library.
ABZ I/O MLB MLB Ext. BTB BTB Ext.020406080100 E x e c u t i o n t i m e ( % o f t o t a l ) Figure 4: Execution time of AutoBazaar pipelinesearch attributable to different libraries/components.The box plot shows quartiles of the distribution, 1.5 × IQR, and outliers. MLB Ext and BTB Ext refer to callsto external libraries providing underlying implemen-tations, like the scikit-learn
GaussianProcessRegressor used in the
GP-EI tuner. The vast majority of executiontime is attributed to the underlying primitives imple-mented in external libraries.
We first evaluate the computational bottlenecks of the Auto-Bazaar system. To assess this, we instrument AutoBazaar andour framework libraries (
MLBlocks , MLPrimitives , BTB ) to de-termine what portion of overall execution time for pipelinesearch is due to our runtime libraries vs. other factors suchas I/O and underlying component implementation. The re-sults are shown in Figure 4. Overall, the vast majority ofexecution time is due to execution of the underlying primi-tives (p25=90.2%, p50=96.2%, p75=98.3%). A smaller portionis due to the AutoBazaar runtime (p50=3.1%) and a negligi-ble (p50<0.1%) portion of execution time is due to our otherframework libraries and I/O. Thus, performance of pipelineexecution/search is largely limited by the performance ofthe underlying physical implementation from the externallibrary.
One important attribute of AutoBazaar is the ability to im-prove pipelines for different tasks through tuning and selec-tion. We measure the improvement in the best pipeline pertask, finding that the average task improves its best score by1.06 standard deviations over the course of tuning, and that31.7% of tasks improve by more than 1 standard deviation(Figure 5). This demonstrates the AutoBazaar pipeline searcheffectiveness that a user may expect to obtain. However, aswe describe in Section 3, there are many possible AutoMLprimitives that can be implemented using our tuner/selectorAPIs; a comprehensive comparison is beyond the scope ofour work.
IGMOD’20, June 14–19, 2020, Portland, OR, USA Smith et al.
Data Modality Problem Type Tasks Pipeline Templategraph community detection 2
CommunityBestPartition graph matching 9 link_prediction_feat_extr graph_feat_extr CategoricalEncoder SimpleImputer StandardScaler XGBClassifier link prediction 1 link_prediction_feature_extraction CategoricalEncoder SimpleImputer StandardScaler XGBClassifier vertex nomination 1 graph_feature_extraction categorical_encoder SimpleImputer StandardScaler XGBClassifier image classification 5
ClassEncoder preprocess_input MobileNet XGBClassifier ClassDecoder regression 1 preprocess_input MobileNet XGBRegressor multi table classification 6
ClassEncoder dfs SimpleImputer StandardScaler XGBClassifier ClassDecoder regression 7 dfs SimpleImputer StandardScaler XGBRegressor single table classification 234
ClassEncoder dfs SimpleImputer StandardScaler XGBClassifier ClassDecoder collaborative filtering 4 dfs LightFM regression 87 dfs SimpleImputer StandardScaler XGBRegressor timeseries forecasting 35 dfs SimpleImputer StandardScaler XGBRegressor text classification 18
UniqueCounter TextCleaner VocabularyCounter Tokenizer pad_sequences LSTMTextClassifier regression 9
StringVectorizer SimpleImputer XGBRegressor timeseries classification 37
ClassEncoder dfs StandardImputer StandardScaler XGBClassifier ClassDecoder
Table 4: ML task types (data modality and problem type pairs) and associated ML tasks counts in the
ML Bazaar
Task Suite, along with default templates from AutoBazaar (i.e., where we have curated appropriate pipeline tem-plates to solve a task). standard deviations d e n s i t y Figure 5: Distribution of task performance improve-ment from
ML Bazaar
AutoML. Improvement for eachtask is measured as the score of the best pipeline lessthe score of the initial default pipeline, in standard de-viations of all pipelines evaluated for that task.
ML Bazaar
To further examine the expressiveness of
ML Bazaar to solvea wide variety of tasks, we randomly selected 23 Kaggle com-petitions from 2018, comprising tasks ranging from imageand time series classification to object detection and multi-table regression. For each task, we attempted to develop asolution using existing primitives and atalogs.Overall, we were able to immediately solve 11 tasks. Wedid not currently support 4 task types: image matching (2tasks), object detection within images (4 tasks), multi-labelclassification (1 task), and video classification (1 task). Wecould readily support these within our framework by devel-oping new primitives and pipelines. In the remaining tasks, multiple data modalities were provided to participants (i.e.some combination of image, text, and tabular data). To sup-port these tasks, we would need to develop a new “glue”primitive for concatenating separately-featurized data fromeach resources to create a single feature matrix. Though ourevaluation suite contains many examples of tasks with mul-tiple data resources of different modalities, we had writtenpipelines customized to operate on certain common subsets(i.e. tabular + graph). We can never expect to have already implemented pipelines for the innumerable diversity of MLtask types, but we can still write new primitives and pipelinesusing our framework to solve these problems.
When new primitives are contributed by the ML community,they can be incorporated into pipeline templates and pipelinehypertemplates, either to replace similar pipeline steps or toform the basis of new topologies. By running the end-to-endsystem on our evaluation suite, we can assess the impact ofthe primitive on general-purpose ML workloads (rather thanoverfit baselines).In this first case study, we compare two similar primitives:annotations for the XGBoost (
XGB ) and random forest ( RF )classifiers. We ran two experiments, one in which RF is usedin pipeline templates and one in which XGB is substitutedinstead. he ML Bazaar: Harnessing the ML Ecosystem for Effective System Development SIGMOD’20, June 14–19, 2020, Portland, OR, USA
We consider 1 . × relevant pipelines to determinethe best scores produced for 367 tasks. We find that the XGB pipelines substantially outperformed the RF pipelines, win-ning 64.9% of the comparisons. This confirms the experienceof practitioners, who widely report that XGBoost is one of themost powerful ML methods for classification and regression. The design of the
ML Bazaar
AutoML system and our ex-tensive evaluation corpus allows us to easily swap in newAutoML primitives (Section 3.2) to see to what extent changesin components like tuners and selectors can improve perfor-mance in general settings.In this case study, we revisit [46], which was influential forbringing about widespread use of Bayesian optimization fortuning ML models in practice. Their contributions include:(C1) proposing the usage of the Matérn 5/2 kernel for tunermeta-model (see Section 3.2.1), (C2) describing an integratedacquisition function that integrates over uncertainty in theGP hyperparameters, (C3) incorporating a cost model intoan EI per second acquisition function, and (C4) explicitlymodeling pending parallel trials. How important was eachof these contributions to a resulting tuner?Using
ML Bazaar , we show how a more thorough abla-tion study [36], not present in [46], would be conducted toaddress these questions, by assessing the performance of ourgeneral-purpose AutoML system using different combina-tions of these four contributions. Here we focus on contribu-tion C1. We run experiments using a baseline tuner with asquared exponential kernel (
GP-SE-EI ) and compare it with atuner using the Matérn 5/2 kernel (
GP-Matern52-EI ). In bothcases, kernel hyperparameters are set by optimizing the mar-ginal likelihood. Thus we can isolate the contributions ofthe proposed kernel in the context of general-purpose MLworkloads.In total, 4 . × pipelines were evaluated to find thebest pipelines for a subset of 414 tasks. We find that there isno improvement from using the Matérn 5/2 kernel over theSE kernel — in fact, the GP-SE-EI tuner outperforms, winning60.1% of the comparisons. One possible explanation for thisnegative result is that the Matérn kernel is sensitive to hyper-parameters which are set more effectively by optimization ofthe integrated acquisition function. This is supported by theover-performance of the tuner using the integrated acquisi-tion function in the original work; however, the integratedacquisition function is not tested with the baseline SE kernel,and more study is needed.
Researchers have developed numerous algorithmic and soft-ware innovations to make it possible to create ML and Au-toML systems in the first place.
ML libraries.
High-quality ML libraries have originatedover a period of decades. For general ML applications, scikit-learn implements many different algorithms using a commonAPI centered on the influential fit / predict paradigm [11].For specialized analysis, libraries have been developed inseparate academic communities, often with different andincompatible APIs [1, 7, 10, 24, 28, 32]. In ML Bazaar , weconnect and link components of these libraries, only creatingmissing functionality ourselves.
ML systems.
Prior work has provided several approachesfor making it easier to develop ML systems. For example,caret [31] standardizes interfaces and provides utilities forthe R ecosystem, but without enabling more complex pipelines.Recent systems have attempted to provide graphical inter-faces, like [22] and Azure Machine Learning Studio. Devel-opment of ML systems is closely tied to the execution envi-ronments needed to train, deploy, and update the resultingmodels. In SystemML [9] and Weld [45], implementations ofspecific ML algorithms are optimized for specific runtimes.Velox [12] is an analytics stack component that efficientlyserves predictions and manages model updates.
AutoML libraries.
AutoML research has often been lim-ited to solving sub-problems of an end-to-end ML work-flow, such as data cleaning [14], feature engineering [28, 30],hyperparameter tuning [18, 21, 25, 27, 33, 40, 46, 48], or al-gorithm selection [25, 50]. Thus AutoML solutions are oftennot widely applicable or deployed in practice without humansupport. In contrast,
ML Bazaar integrates many of theseexisting approaches and designs one coherent and config-urable structure for joint tuning and selection of end-to-endpipelines.
AutoML systems.
These AutoML libraries, if deployed, aretypically one component within a larger system that aimsto manage several practical aspects such as parallel and dis-tributed training, tuning, and model storage, and even serv-ing, deployment, and graphical interfaces for model building.These include ATM [47], Vizier [20], and Rafiki [52], as wellas commercial platforms like Google AutoML, DataRobot,and Azure Machine Learning Studio. While these systemsprovide many benefits, they have several limitations. First,they each focus on a specific subset of ML use cases, such ascomputer vision, NLP, forecasting, or hyperparameter tuning.Second, these systems are designed as proprietary applica-tions and do not support community-driven integration ofnew innovations.
ML Bazaar provides a new approach to
IGMOD’20, June 14–19, 2020, Portland, OR, USA Smith et al. developing such systems in the first place: it supports a widevariety of ML task types, and builds on top of a community-driven ecosystem of ML innovations. Indeed, it could serveas the backend for such ML services or platforms.The DARPA D3M program [35], of which we are partici-pants, aims to spur development of automated systems formodel discovery for use by non-experts. Several differingapproaches are being developed within this context. Forexample, Alpine Meadow [44] focuses on efficient searchfor producing interpretable ML pipelines with low latenciesfor interactive usage. It combines existing techniques fromquery optimization, Bayesian optimization, and multi-armedbandits to efficiently search for pipelines. AlphaD3M [16]formulates a pipeline synthesis problem and uses reinforce-ment learning to construct pipelines. In contrast,
ML Bazaar is a framework to develop ML or AutoML systems in the firstplace. While we present our open-source AutoBazaar system,it is not the primary focus of our work and represents a sin-gle point in the design space of AutoML systems using ourframework libraries. Indeed, one could use specific AutoMLapproaches like the ones described by Alpine Meadow orAlphaD3M for pipeline search within our own framework.
Throughout this paper, we have built up abstractions, in-terfaces, and software components for data scientists, dataengineers, and other practitioners to effectively develop ma-chine learning systems. Developers can use
ML Bazaar tocompose one-off pipelines, tunable pipeline templates, orfull-fledged AutoML systems. Researchers can contributeindividual ML or AutoML primitives and make them eas-ily accessible to a broad base for inclusion in end-to-endsolutions.We have applied this approach to several real-world MLproblems and entered our AutoML system in an automatedmodeling program. As we collect more and more scoredpipelines, we expect opportunities will emerge for meta-learning and debugging on ML tasks and pipelines, as well asthe ability to track progress and transfer knowledge withindata science organizations. We will focus on several comple-mentary extensions in future work. These include continuingto improve our AutoML system and making it more robustfor everyday use by a diverse user base, and studying howto best support users of different backgrounds in using andinteracting with ML and AutoML systems.
ACKNOWLEDGMENTS
The authors would like to thank Plamen Valentinov Kolev forcontributions running experiments and testing datasets. The authors also acknowledge the contributions of the follow-ing people: Laura Gustafson, William Xue, Akshay Raviku-mar, Ihssan Tinawi, Alexander Geiger, Sarah Alnegheimish,Saman Amarasinghe, Stefanie Jegelka, Zi Wang, BenjaminSchreck, Seth Rothschild, Manual Alvarez Campo, SebastianMir Peral, Peter Fontana, and Brian Sandberg. The authorsare part of the DARPA Data-Driven Discovery of Models(D3M) program, and would like to thank the D3M communityfor the discussions around the design. This material is basedon research sponsored by DARPA and Air Force ResearchLaboratory (AFRL) under agreement number FA8750-17-2-0126. The U.S. Government is authorized to reproduce anddistribute reprints for Government purposes notwithstand-ing any copyright notation thereon. The view and conclu-sions contained herein are those of the authors and shouldnot be interpreted as necessarily representing the officialpolicies of endorsements, either expressed or implied, ofDARPA and Air Force Research Laboratory (AFRL) or theU.S. Government. Authors MJS, CS, and KV acknowledgesupport and user feedback from Iberdrola S.A. and SES S.A.
REFERENCES
Machine learning
47, 2-3(2002), 235–256.[3] Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo,Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc,Chiu Yuen Koo, Lukasz Lew, Clemens Mewald, Akshay Naresh Modi,Neoklis Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang,Martin Wicke, Jarek Wilkiewicz, Xin Zhang, and Martin Zinkevich.2017. TFX: A TensorFlow-Based Production-Scale Machine LearningPlatform. In
KDD .[4] James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-Parameter Optimization.
JMLR
13 (2012), 281–305.[5] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011.Algorithms for hyper-parameter optimization. In
NIPS . 2546–2554.[6] Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Desh-pande, Aaron J Elmore, Samuel Madden, and Aditya Parameswaran.2015. DataHub: Collaborative Data Science & Dataset Version Man-agement at Scale. In
CIDR .[7] Steven Bird, Ewan Klein, and Edward Loper. 2009.
Natural languageprocessing with Python: analyzing text with the natural language toolkit .O’Reilly Media, Inc. he ML Bazaar: Harnessing the ML Ecosystem for Effective System Development SIGMOD’20, June 14–19, 2020, Portland, OR, USA [8] Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Frank Hutter,Michel Lang, Rafael G Mantovani, Jan N van Rijn, and Joaquin Van-schoren. 2017. OpenML benchmarking suites and the OpenML100.(2017). arXiv:arXiv:1708.03731[9] Matthias Boehm, Michael W Dusenberry, Deron Eriksson, Alexandre VEvfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Rein-wald, Frederick R Reiss, Prithviraj Sen, Arvind C Surve, and ShirishTatikonda. 2016. Systemml: Declarative machine learning on spark.
PVLDB
9, 13 (2016), 1425–1436.[10] G. Bradski. 2000. The OpenCV Library.
Dr. Dobb’s Journal of SoftwareTools (2000).[11] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, An-dreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexan-dre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, ArnaudJoly, Brian Holt, and Gaël Varoquaux. 2013. API design for machinelearning software: experiences from the scikit-learn project. In
ECMLPKDD Workshop: Languages for Data Mining and Machine Learning .108–122.[12] Daniel Crankshaw, Peter Bailis, Joseph E. Gonzalez, Haoyuan Li, ZhaoZhang, Michael J. Franklin, Ali Ghodsi, and Michael I. Jordan. 2015.The Missing Piece in Complex Analytics: Low Latency, Scalable ModelManagement and Serving with Velox. In
CIDR .[13] Andrew Crotty, Alex Galakatos, Emanuel Zgraggen, Carsten Binnig,and Tim Kraska. 2015. Vizdom: Interactive Analytics Through Penand Touch.
Proc. VLDB Endow.
8, 12 (Aug 2015), 2024–2027.[14] Dong Deng, Raul Castro, Fernandez Ziawasch, Abedjan Sibo, AhmedElmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouzzani, and NanTang. 2017. The Data Civilizer System.
CIDR (2017).[15] Ian Dewancker, Michael McCourt, Scott Clark, Patrick Hayes, Alexan-dra Johnson, and George Ke. 2016. A Strategy for Ranking OptimizationMethods using Multiple Criteria.
ICML AutoML workshop
ICML 2018 AutoML Workshop , 1–8.[17] Stefan Falkner, Aaron Klein, and Frank Hutter. 2018. BOHB: Robust andEfficient Hyperparameter Optimization at Scale. In
ICML . 1436–1445.[18] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springen-berg, et al. 2015. Efficient and robust automated machine learning. In
NIPS . 2962–2970.[19] Pieter Gijsbers, Erin Ledell, Janek Thomas, Sébastien Poirier, BerndBischl, and Joaquin Vanschoren. 2019. An Open Source AutoML Bench-mark. (2019), 1–8.[20] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski,John Karro, and D Sculley. 2017. Google Vizier: A Service for Black-BoxOptimization. In
KDD .[21] Taciana A.F. Gomes, Ricardo B.C. Prudˆcncio, Carlos Soares, André L.D.Rossi, and André Carvalho. 2012. Combining meta-learning and searchtechniques to select parameters for support vector machines.
Neuro-computing
75, 1 (2012), 3–13.[22] Ming Gong, Linjun Shou, Wutao Lin, Zhijie Sang, Quanjia Yan, ZeYang, and Daxin Jiang. 2019. NeuronBlocks – Building Your NLP DNNModels Like Playing Lego. (2019). arXiv:arXiv:1904.09535[23] Isabelle Guyon, Kristin Bennett, Gavin Cawley, Hugo Jair Escalante,Sergio Escalera, Tin Kam Ho, Núria Macià, Bisakha Ray, MehreenSaeed, Alexander Statnikov, and Evelyne Viegas. 2015. Design of the2015 ChaLearn AutoML challenge. In
IJCNN .[24] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. 2008. Exploringnetwork structure, dynamics, and function using NetworkX. In
SciPy ,Gaël Varoquaux, Travis Vaught, and Jarrod Millman (Eds.). 11–15. [25] Martin Hirzel, Kiran Kate, Avraham Shinnar, Subhrajit Roy, and Parik-shit Ram. 2019. Type-Driven Automated Learning with Lale. (May2019). arXiv:arXiv:1906.03957[26] Kyle Hundman, Valentino Constantinou, Christopher Laporte, IanColwell, and Tom Soderstrom. 2018. Detecting Spacecraft AnomaliesUsing LSTMs and Nonparametric Dynamic Thresholding. In
KDD .[27] Kevin Jamieson and Ameet Talwalkar. 2016. Non-stochastic Best ArmIdentification and Hyperparameter Optimization. In
AISTATS . 240–248.[28] James Max Kanter. 2015. Deep Feature Synthesis:Towards AutomatingData Science Endeavors. In
DSAA . 1–10.[29] James Max Kanter, Owen Gillespie, and Kalyan Veeramachaneni. 2016.Label, segment, featurize: A cross domain framework for predictionengineering. In
DSAA . 430–439.[30] Udayan Khurana, Deepak Turaga, Horst Samulowitz, and SrinivasanParthasrathy. 2016. Cognito: Automated feature engineering for su-pervised learning. In
ICDMW . 1304–1307.[31] Max Kuhn. 2008. Building Predictive Models in R Using the caretPackage.
Journal of Statistical Software
28, 5 (2008), 159–160.[32] Maciej Kula. 2015. Metadata Embeddings for User and Item Cold-startRecommendations. In
Proceedings of the 2nd Workshop on New Trendson Content-Based Recommender Systems , Vol. 1448. 14–21.[33] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, andAmeet Talwalkar. 2017. Hyperband: a novel bandit-based approach tohyperparameter optimization.
JMLR
18, 1 (2017), 6765–6816.[34] Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina,Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. 2018. Massivelyparallel hyperparameter tuning. (2018). arXiv:arXiv:1810.05934[35] Richard Lippmann, William Campbell, and Joseph Campbell. 2016.An Overview of the DARPA Data Driven Discovery of Models (D3M)Program. In
NIPS Workshop on Artificial Intelligence for Data Science .[36] Zachary C Lipton and Jacob Steinhardt. 2018. Troubling trends inmachine learning scholarship. (2018). arXiv:arXiv:1807.03341[37] Ilya Loshchilov and Frank Hutter. 2016. CMA-ES for HyperparameterOptimization of Deep Neural Networks. (2016). arXiv:arXiv:1604.07269[38] Hui Miao, Ang Li, Larry S Davis, and Amol Deshpande. 2017. Model-hub: Deep learning lifecycle management. In
ICDE . 1393–1394.[39] ChangYong Oh, Efstratios Gavves, and Max Welling. 2018.BOCK: Bayesian Optimization with Cylindrical Kernels. (2018).arXiv:arXiv:1806.01619[40] Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H.Moore. 2016. Evaluation of a Tree-based Pipeline Optimization Toolfor Automating Data Science. In
GECCO .[41] Eric Raymond. 1999. The cathedral and the bazaar.
Knowledge, Tech-nology & Policy
12, 3 (1999), 23–49.[42] Matthew Rocklin. 2015. Dask: Parallel Computation with Blockedalgorithms and Task Scheduling. In
SciPy , Kathryn Huff and JamesBergstra (Eds.). 130–136.[43] D Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips,Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Cre-spo, and Dan Dennison. 2015. Hidden technical debt in machinelearning systems. In
NIPS . 2503–2511.[44] Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Koss-mann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal,and Tim Kraska. 2019. Democratizing data science through interactivecuration of ml pipelines. In
SIGMOD . 1171–1188.[45] Palkar Shoumik, James Thomas, Deepak Naryanan, PratiskhaThaker, Rahul Palamuttam, Parimajan Negi, Anil Shanbhag, MalteSchwarzkopf, Holger Pirk, Saman Amarasinghe, Samuel Madden, andMatei Zaharia. 2018. Evaluating end-to-end optimization for dataanalytics applications in weld.
PVLDB
11, 9 (2018), 1002–1015.
IGMOD’20, June 14–19, 2020, Portland, OR, USA Smith et al. [46] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practicalbayesian optimization of machine learning algorithms. In
NIPS . 2951–2959.[47] Thomas Swearingen, Will Drevo, Bennett Cyphers, et al. 2017. ATM:A distributed, collaborative, scalable system for automated machinelearning. In
BigData .[48] Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: combined selection and hyperparameteroptimization of classification algorithms. In
KDD .[49] Tom van der Weide, Dimitris Papadopoulos, Oleg Smirnov, MichalZielinski, and Tim van Kasteren. 2017. Versioning for End-to-EndMachine Learning Pipelines. In
DEEM . 1–9.[50] Jan N. van Rijn, Salisu Mamman Abdulrahman, Pavel Brazdil, andJoaquin Vanschoren. 2015. Fast Algorithm Selection Using LearningCurves. In
IDA . 298–309. [51] Hao Wang, Bas van Stein, Michael Emmerich, and Thomas Back. 2017.A new acquisition function for Bayesian optimization based on themoment-generating function. In
SMC . 507–512.[52] Wei Wang, Jinyang Gao, Meihui Zhang, Sheng Wang, Gang Chen,Teck Khim Ng, Beng Chin Ooi, Jie Shao, and Moaz Reyad. 2018. Rafiki:machine learning as an analytics service system.
PVLDB
12, 2 (2018),128–140.[53] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das,Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shiv-aram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez,Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Enginefor Big Data Processing.