Machine learning 2.0 : Engineering Data Driven AI Products
MMachine learning 2.0Engineering data driven AI products
Max Kanter
Feature Labs, Inc.Boston, MA 02116 [email protected]
Benjamin Schreck
Feature Labs, Inc.Boston, MA 02116 [email protected]
Kalyan Veeramachaneni
MIT LIDS,Cambridge, MA 02139 [email protected] v0.1.0Date: 2018-03-06
AbstractML 2.0 : In this paper, we propose a paradigm shift from the current practice of creating machinelearning models – which requires months-long discovery, exploration and “feasibility report” generation,followed by re-engineering for deployment – in favor of a rapid, 8-week process of development, under-standing, validation and deployment that can executed by developers or subject matter expert s (non-MLexperts) using reusable
API s. This accomplishes what we call a “minimum viable data-driven model,”delivering a ready-to-use machine learning model for problems that haven’t been solved before usingmachine learning. We provide provisions for the refinement and adaptation of the “model," with strictenforcement and adherence to both the scaffolding/abstractions and the process. We imagine that this willbring forth a second phase in machine learning, in which discovery is subsumed by more targeted goals ofdelivery and impact.
Attempts to embed machine learning-based predictive models into products and services in order to makethem smarter, faster, cheaper, and more personalized will dominate the technology industry for the foreseeablefuture. Currently, these applications aid industries ranging from financial services systems, which oftenemploy simple fraud detection models, to patient care management systems in intensive care units, whichemploy more complex models that predict events. If the many research articles, news articles, blogs, and datascience competitions based on new data-driven discoveries are to be believed, future applications will befueled by applying machine learning to numerous data stores — medical (see [1], [9]), financial and others.But it is arguable how many of the predictive models described in these announcements have actually beendeployed — or have been effective, serving their intended purposes of saving costs, increasing revenue and/orenabling better experiences [12]. To explicate this, we highlight a few important observations about how machine learning and AI systemsare currently built, delivered and deployed . In most cases, the development of these models involves thefollowing criteria: 1. It relies on making an initial discovery from the data; 2. It uses a historically defined and The authors of [8] offer one perspective on the challenges of deploying and maintaining a machine learning model, and others inthe tech industry have highlighted these challenges as well. The core problem, however, goes deeper than deployment challenges. These observations have been gathered from our experiences creating, evaluating, validating, publishing and delivering machinelearning models using data from BBVA, A
CCENTURE , K
OHLS , N
IELSEN , M
ONSANTO , J
AGUAR AND L AND R OVER , ED X, MIMIC,GE, D
ELL , and others. Additionally, we have entered our automation tools into numerous publicly-held data science competitionsheld by K
AGGLE and similar websites. We are also a part of the MIT team involved with the DARPA D3M initiative, and developmentof the automation system Featuretools is funded under DARPA. The findings and opinions expressed here are ours alone, and do notrepresent any of our clients, funders, or collaborators. a r X i v : . [ c s . A I] J u l eeply entrenched workflow; and 3. It struggles to find the functional interplay between a robust softwareengineering practice and the complex landscape of mathematical concepts required to build the models Wecall this current state of machine learning ML 1.0 , defined by its focus on discovery . The “discovery first” paradigm : Most of the products within which we are currently trying to embedintelligence have already been collecting data as part of their normal, day-to-day functioning. Machinelearning or predictive modeling is usually an afterthought. Machine learning models are attempting to predicta future outcome – one that is legible to the humans using it – and it is not always certain whether the data athand will be able to provide such conclusions.This makes machine learning unlike other software engineering projects that involve developing or addinga new feature to a product. In those cases, the end outcome is deterministic and visible. Designers, architects,and managers can make a plan, establish a workflow, release the feature, and manage it.In contrast, a new machine learning project usually starts with a discovery phase, in which attempts aremade to answer the quintessential question: “Is this predictable using our data?" . If the answer is yes, furtherquestions emerge: With how much accuracy? What variables mattered? Which one of the numerous modelingpossibilities works best? If the data happens to be temporal, which one of the numerous time series models(with latent states) can model it best? What explainable patterns did we notice in the data, and what modelssurfaced those patterns better? And so on, in an endless list.It is not uncommon to see dozens of research papers focused on the same prediction problem, eachproviding slightly different answers to any of the questions above. A recently established prediction problem– “predicting dropout in Massive Open Online Courses, to enable interventions” – resulted in at least 100research papers, all written in a span of 4 years , and a competition at a Tier 1 data mining conference, KDDCup 2015 [10, 7, 4, 3].The expectation is that once this question has been answered, a continuously working predictive modelcan be developed, integrated with the product, and put to use rather quickly. This expectation might not beso unrealistic if it weren’t for the deeply entrenched workflow generally used for discovery , as we describebelow. A “deeply entrenched workflow” : Developing and evaluating machine learning models for industrial-scaleproblems predates the emergence of internet-scale data. Our observations of machine learning in practice overthe past decade, as well as our own experiences developing predictive models, have enabled us to delineatea number of codified steps that make up these projects. These steps are shown in Table 1. Over the years,surprisingly little has changed about how researchers approach a data store. For example, you can easilycompare the structure of these three papers, one written in 1994 ([2]) and two written in 2017 ([1, 9]).
Sowhat is the problem? : Although the generalized, codified steps shown in Table 1 make up a good workingtemplate, the workflow in which they are executed contains problems that are now entrenched. In Figure 3, wedepict how these steps are currently executed as if they are three disjoint parts. Recent tools and collaborativesystems tend to apply to just one of these parts, accelerating discovery only when the data is ready to beused in that part. We present our detailed commentary on the current state of applied machine learning inSection 6.
ML 2.0: Delivery and impact : In this paper, we propose a paradigm shift, turning from the current practiceof creating machine learning models – which involves a months-long process of discovery, exploration,“feasibility report” generation, and re-engineering for deployment – in favor of a rapid, 8-week-long processof development, understanding, validation and deployment that can executed by developers or subject matterexpert s (non-ML experts) using reusable
API s. This accomplishes what we call a “minimum viable data-driven model,” delivering a ready-to-use machine learning model for problems that haven’t been solved Throughout this paper, our core focus is on data that is temporal, multi-entity, multi-table, and relational (and/or transactional).In most cases, we are attempting to predict using a machine learning model, and in some cases we are predicting ahead of time. These papers are published in premier AI venues – NIPS, IJCAI, KDD, and AAAI – and one of our first predictive modelingprojects, for weblog data in 2013-14, focused on this very problem . Extract relevant data subset Data warehouses (usually centrally organized) contain several data elements that may notbe related to the predictive problem at hand, and may cover a large time period. In this step,a subset of tables and fields is determined, and a specific time period is selected. Based onthese two choices, a number of filters are applied, and the resulting data is passed on toStep 2.
2. Formulate problem and assemble training examples
Developing a model that can predict future events requires finding past examples to learnfrom. Typical predictive model development thus involves first defining the outcome inquestion, and then finding past occurrences of that outcome that could be used to learn amodel.
3. Prepare data
To train a model, we use data retrospectively to emulate the prediction scenario – that is, weuse data prior to the occurrence of the outcome to learn a model, and to evaluate its ability topredict the outcome. This requires careful annotation regarding which data elements can beused for modeling. In this step, most of these annotations are added, and the data is filteredto create a dataset ready for machine learning.
4. Engineer features
For each training example, given the usable data, one computes features (aka variables) andcreates a machine learning-ready matrix. Each column in the matrix is a feature, the lastcolumn is the label, and each row is a different training example.
5. Learn a model
Given the feature matrix, in this step a model – either classifier (if the outcome is categorical)or regressor (if the outcome is continuous) – is learned. Numerous models are evaluated andselected based on a user-specified evaluation metric.
6. Evaluate the model and report
Once a modeling approach has been selected and trained, it is evaluated using the trainingdata. A number of metrics, including precision and recall, are reported, and the impact ofvariables on predictive accuracy is evaluated.Table 1: Steps to making a discovery using machine learning from a data warehouse. This process does notdefine a robust software engineering practice, nor does it account for model testing or deployment. The endresult of the process is to generate a report or research paper. The typical workflow breaks the process downinto silos, presented in Figure 3.before using machine learning.We posit that any system claiming to be
ML 2.0 should deliver on the key goals and requirements listedin Figure 1. Each of these steps is important for different reasons: (1) ensures that we are bringing newintelligence services to life, (2) requires that the system considers the entire process and not just one step, and3 subsume discovery deliver a machine learning model M for a problem that has not been solved before, and for which noprior solution exists or no prior steps or partial steps have been taken. The system subsumes discoveryby enabling generation of a report of accuracy metrics, feature importances - all things originally doneas part of discovery ; (2) be end-to-end start with data D from the warehouse with several tables containing data for several entities andseveral years up to a current time t , and enable an developer to execute all steps 1 through 5, buildand validate a model M ; (3) enable confidence enable testing and validation of the end-to-end system under different simulated scenarios andconditions, and establish developer confidence; (4) provide deployable code enable deployment of the model at t for making predictions at T > t by providing code that carriesnew data from its raw shape to predictions; and (5) provide provisions for updates provide provisions for updating the model, re- assessing/validating and re-deploying it.Figure 1: Goals and requirements for
ML 2.0 (3), (4) and (5) guarantee the model’s deployment and use.All of the above steps should:– require a minimal amount of coding (possibly with simple
API calls);– use the same abstractions/software;– quickly deliver the first version of the model;– require a minimal amount of manpower and no “machine learning expertise.”This will ensure a reduction in design complexity and speed up the democratization of AI. We imaginethat this will bring forth a second phase in artificial intelligence, in which the elevation of discovery issubsumed by more targeted goals of delivery and impact.4
The path that led to ML 2.0
Over the last decade, the demand for predictive models has grown at an increasing rate. As the data scientistswho build these models, we have found that our biggest challenge typically isn’t creating accurate solutionsfor incoming prediction problems, but rather the time it takes as to build an end-to-end model, which makesit difficult to match the supply of expertise to the incoming demand. We have also seen, disappointingly, thateven when we build models, they are often not deployed. ML 2.0 is the result of a series of efforts to addressthese problems by automating those aspects of creating and deploying machine learning solutions that have,in the past, involved time- and skill-intensive human effort.In 2013, we focused our efforts on the problem of selecting a machine learning method and then tuningits hyperparameters. This eventually resulted in Auto-Tuned Models (ATM)[11] , which takes advantage ofcloud-based computing to perform a high-throughput search over modeling options and find the best possiblemodeling technique for a particular problem .While ATM helped us quickly build more accurate models, we couldn’t take advantage of it unless thedata we were using took the form of a feature matrix. As most of the real-world use cases we worked on didnot provide data in such a form, we decided to move our focus earlier in the data science process. This ledus to explore automating feature engineering, or the process of using domain-specific knowledge to extractpredictive patterns from raw datasets. Unlike most data scientists who work in a single domain, we workedwith a wide range of industries. This gave us the unique opportunity to develop innovative solutions for eachof the diverse problems we faced. Across industries including finance, education, and health care, we startedto see commonalities in how features were constructed. These observations led to the creation of the DeepFeature Synthesis (DFS) algorithm for automated feature engineering [5].We put DFS to the test against human data scientists to see if it accomplished the goal of saving ussignificant time while preparing raw data. We used DFS to compete in 3 different worldwide data sciencecompetitions and found that we could build models in a fraction of the time it took human competitors, whileachieving similar predictive accuracies.After this success, we took our automation algorithms to industry. Now that we could quickly buildaccurate models using just a raw dataset and a specific problem, a new automation question arose – “How dowe figure out what problem to solve?” This question was particularly pertinent to companies and teams newto machine learning, who frequently had a grasp of high-level business issues, but struggled to translate theminto specific prediction problems.We discovered that there was no existing process for systematically defining a prediction problem. Infact, by the time labels reached data scientists, they had typically been reduced to a list of true / false values,with no annotations regarding their origins. Even if a data scientist had access to the code that extracted thelabels, it was implemented such that it could not be tweaked based on domain expert feedback or reused fordifferent problems. To remedy this, we started to explore “prediction engineering,” or the process of definingprediction problems in a structured way. In a 2016 paper [6], we laid out the Label Segment Featurize (LSF)abstraction and associated traversal algorithm to search for labels.As we explored prediction engineering and defined the LSF abstraction, the concept of time began toplay a huge role. Because they involve both simulating past predictive scenarios and carefully extractingtraining examples in order to prevent label leakage while training, predictive tasks are intricately tied totime. This inspired perhaps the most important requirement – a way to annotate every data point with thetimestamp at which it occurred. We then revisited each data-processing algorithm we had written for featureengineering, prediction engineering, or preprocessing and programmed it to accept only time periods wheredata was valid by introducing a simple yet powerful notion: cutoff _ time . When cutoff _ time is specified https://github.com/HDI-Project/ATM ATM was published and open-sourced in 2017 to aid in the release of ML 2.0, but was originally developed in 2014 metadata . json . We also started to define a more concrete way toorganize the meta-information about the processing done on the data from a predictive modeling standpoint,which led to the development of model _ provenance . json .In our first attempt at creating each of these automation approaches, we focused on defining the right ab-stractions, process organization and algorithms, because of their foundational role in data science automation.We knew from the beginning the right abstractions would enable us to write code that could be reused acrossdevelopments, deployments, and domains. This not only increases the rate at which we build models, butincreases the likelihood of deployment by enabling the involvement of those with subject matter expertiseand those that maintain the production environments. As we have applied our new tools to industrial-scaleproblems over the last three years, these abstractions and algorithms have transformed into production-readyimplementations. With this paper, we are formally releasing multiple tools, as well as data and modelorganization schemas. They are Featuretools , Entityset , metadata . json , model _ provenance . json , ATM .In the sections below, we show how abstractions, algorithms, and implementations all come together toenable ML 2.0. In ML 2.0 , users of the system start with raw untapped data in the warehouse and a rough idea of the problemthey want to solve. The user formulates or specifies a prediction problem, follows steps 1-5, and deliversa stable, tested and validated, deployable model in a few weeks, along with a report that documents the discovery .Although this workflow seems challenging and unprecedented, we are motivated by recent develop-ments in proper abstractions, algorithms, and data science automation tools, which lend themselves to therestructuring of the entire predictive modeling process. In Figure 2, we provide an end-to-end recipe for
ML 2.0 . We next describe each step in detail, including its input and output, its automated aspects, andrelevant hyperparameters that allow developers to control the process. We highlight both what the humancontributes and what the automation enables, illustrating how they form a perfect symbiotic relationship. Inan accompanying paper, we present the first industrial use case that follows this paradigm, as well as theaccompanying successfully deployed model. → Form an Entityset (Data organization) An Entityset is an unified
API for organizing and querying relational datasets to facilitate building predictivemodels. By providing a single
API , an Entityset can be reused throughout the life cycle of a modeling projectand reduce the possibility of human error while manipulating data. An Entityset is integrated with the typicalindustrial-scale data warehouse or data lake that contains multiple tables of information, each with multiplefields or columns.The first, and perhaps the most human-driven, step of a modeling project consists of identifying andextracting the subset of tables and columns that are relevant to the prediction problem at hand from thedata source. Each table added to the Entityset is called an entity, and has at least one column, called the Our current version of this json is available at https://github.com/HDI-Project/MetaData.json Our current version of this json is available at: https://github.com/HDI-Project/model-provenance-json ndex , that uniquely identifies each row. Other columns in an entity can be used to relate instances to anotherentity, similar to a foreign-key relationship in a traditional relational database. An Entityset further tracksthe semantic variable types of the data (e.g the underlying data may be stored as an float, but in context it isimportant to know that data represents a latitude or longitude).In predictive modeling, it is common to have at least one entity that changes over time as new informationbecome available. This data is appended as a row to the entity it is an instance of, and must contain a columnwhich stores the value for the time that this particular row became available. This column is known as the timeindex for the entity, and is used by automation tools to understand the order in which informationbecame available for modeling. Without this functionality, it would not be possible to correctly simulatehistorical states of data while training, testing, and validating the predictive model.The index , timeindex , relationships, and semantic variable types for each entity in an Entityset aredocumented in ML 2.0 using a metadata . json . After extracting and annotating the data, an Entitysetprovides the interface to this data for each of the modeling steps ahead.The open source library Featuretools offers an in-memory implementation of the Entityset API inPython to be used by humans and automated tools . Featuretools represents each entity as a PandasDataFrame, provides standardized access to metadata about the entities, and offers an existing list of semantictypes for variables in the dataset. It offers additional functionality that is pertinent to predictive modeling,including:– querying by time - select a subset of data across all entities up until a point in time (uses time indices)– incorporating new data - append to existing entities as new data becomes available or is collected– checking data consistency - verify new data matches old data to avoid unexpected behavior– transforming the relational structure - create new entities and relationships through normalization (i.ecreate a new entity with repeated information from an existing entity and add a relationship) → Assemble training examples (Prediction engineering)
To learn an ML model that can predict a future outcome, past occurrences of that outcome are required inorder to train it. In most cases, defining which outcome one is interested in predicting is itself a processriddled with choices. For example:
Consider a developer maintaining an online learning platform. He wants to deploy apredictive model predicting when a student is likely to drop out. But dropout could be defined inseveral ways: as “student never returns to the website,” or “student stops to watch videos,” or“student stops submitting assignments.”
Labeling function : Our first goal is to provide a way of describing the outcome of interest easily. This isachieved by allowing the developer to write a labeling function given by label , cutoff _ time = f ( E , e i , timestamp , hparams ) (1)where, e i is the id of the i th instance of the entity ( i th student, in the example above), E is Entityset, and timestamp is the time point in the future at which we are evaluating whether or not the outcome occurred. Itis worth emphasizing that this outcome can be a complex function that uses a number of columns in the data. Documentation is available at https://github.com/HDI-Project/MetaData.json Documentation is available at https://docs.featuretools.com/loading_data/using_entitysets.html prediction_window the period of time to look for the outcome of interest lead the amount of time before the prediction window to make a prediction gap the minimum amount of time between successive training examples for ansingle instance examples_per_instance maximum number of training examples per entity instance min_training_data the minimum amount of training required before making a predictionTable 2: Hyperparameters for the prediction engineeringThe output of the function is the label , which the predictive model will learn to predict using data up until cutoff _ time . Search : Given the labeling function, the next task is to search the data to find past occurrences of theoutcome in question. One can simply iterate over multiple instances of the entity at different time points, andcall 1 in each iteration to generate a label.A number of conditions are applied to this search. Having executed this search for several differentproblems across many different domains, we were able to precisely and programatically describe this processand expose all of the hyperparameters that a developer may want to control during this search. Table ?? presents a subset of these hyperparameters, and we refer an interested reader to [6] for a detailed descriptionof these and other search settings.Given the labeling function, the settings for search, and the Entityset, a search algorithm searches forpast occurrences across all instances of an entity and returns a list of three tuples called label-times: e ... n , label ... n , cutoff _ time ... n = search _ training _ examples ( E , f ( . ) , hparams ) (2)where e i is the i th instance of the entity, label i is a binary 1 or 0 (or multi-category depending upon the labeling ) representing whether or not the outcome of interest happened for the i th instance of the entity,and cutoff _ time i represents the time point after which label i is known. A list of these tuples providesthe possible training examples. While this process is applied in any predictive modeling endeavor, be it inhealthcare, educational data mining, or retail, there have been no prior attempts made to abstract this processand provide widely applicable interfaces for it. Subject matter expert → implements the labeling function and sets the hyperparameters. Automation → searches for and compiles a list of training examples that satisfy a number ofconstraints. → Generate features (Feature engineering)
The next step in machine learning model development involves generating features for each training example.Consider:
At this stage, the developer has a list of those students who dropped out of a course and thosewho did not, based on his previously specified definition of dropout. He posits that the amount oftime a student spends on the platform could predict whether he is likely to drop out. Using dataprior to the cutoff _ time , he writes software and computes the value for this feature. target _ entity entity in Entityset to create features for training _ window amount of historical data before cutoff _ time used to calculate features aggregation _ primitives reusable functions that create new features using data at the intersection ofentities transform _ primitives reusable functions that create new features from existing features within anentity ignore _ variables variables in an Entityset that shouldn’t be used for feature engineeringTable 3: Hyperparameters for feature engineeringFeatures, a.k.a variables, are quantities derived by applying mathematical operations to the data associatedwith an entity-instance – but only prior to the cutoff _ time as specified in the label_times . This step isusually done manually and takes up a lot of developer time. Our recent work enables us to automate thisprocess by allowing the developer to specify a set of hyperparameters for an algorithm called Deep FeatureSynthesis [5]. The algorithm exploits the relational structure in the data and applies different statisticalaggregations at different levels
11 12 . feature _ list = create _ features ( E , hparams ) (3) feature _ matrix = calculate _ feature _ matrix ( E , e i ... n , cutoff _ time i ... n , label i ... n , feature _ list , hparams ) (4)Given the Entityset and hyperparameter settings, the automatic feature generation algorithm outputs a feature _ list containing a list of feature descriptions defined for the target _ entity using transform _ primitives , aggregate _ primitives , and ignore _ variables . These definitions are passedto 4 which generates a matrix where each column is a feature and each row pertains to an entity-instance e i at the corresponding cutof f _ time i and training _ window . The format of the feature _ matrix is alsoshown in Figure 2. The feature _ list is stored as a serialized file for use in deployment. Users can alsoapply a feature selection method at this stage to reduce the number of features. Subject matter expert → guides the process by suggesting which primitives to use or variables to ignore,as well as how much historical data to use to calculate the features. Automation → suggests the features based on the relational structure, and precisely calculatesthe features for each training example within the allotted window cutoff _ time - training _ window - cutoff _ time . → Generate a model M (Modeling and operationalization) Given the feature _ matrix and its associated labels, the next task is to learn a model. The goal of trainingis to use examples to learn a model that will generalize well to future situations, which will be evaluatedusing a validation set. Typically, a model is not able to predict accurately for all training examples, and mustmake trade-offs. While the training algorithm optimizes a loss function, metrics like fscore, aucroc, recall, methods list of machine learning methods to search budget the maximum amount of time or number of models to search automl_method path to file describing the automl technique to use for optimizationTable 4: Hyperparameters for model search processand precision are used for model selection and evaluation on a validation set. These metrics are derived solelyfrom counting how many correct and incorrect predictions are made within each class. In practice, wefind that these measures are not enough to capture domain-specific needs, a conclusion we expand uponbelow: Consider a developer working on a fraud prediction problem. S/he compares two models, M and M . While both have the same false positive rate, M has a 3% higher true positive rate – thatis, it was able to detect more fraudulent activities. When fraud goes undetected, the bank mustreimburse the customer – so the developer decides to measure the financial impact by addingthe cost of the fraudulent transactions that were missed by M and M respectively. He finds that,by this metric, M actually performs worse. Ultimately, the number of frauds M missed does notmatter, because missing a fraudulent transaction that is worth $10 is far less costly than missinga $10,000 transaction. This problem can be handled by using a domain specific cost function, g ( . ) , implemented by the developer. Cost function g ( . ) : Given predictions and true labels and the Entityset, a cost function calculates thedomain-specific evaluation of the model’s performance. Its abstraction is specified as: cost = g ( E , predictions , labels ) (5) Search for a model : Given the cost function, we can now search for the best possible model, M . Anoverwhelming number of possibilities exist, and the search space includes all possible methods. In addition,each method contains hyperparameters, and choosing certain hyperparameters may necessitate choosing evenmore in turn. For example: A developer is ready to finally learn a model label ← M ( X ) . He must choose betweennumerous modeling techniques, including svm , decisiontrees , and neuralnetworks . If svm is chosen, s/he must then choose the kernal , which could be polynomial , rbf , or Gaussian . If polynomial is chosen, then he has to choose another hyperparameter– the value of the degree.
To fully exploit the search space, we designed an approach in which humans fully and formally specifythe search space by creating a json specification for each modeling method (see Figure 4 for an example).The automated algorithm then searches the space for the model that optimizes the cost assessed usingpredictions. [11] M = search _ model ( g ( . ) , feature _ matrix , labels , cutoff _ times , hparams ) (6) Once the modeling method has been chosen, numerous search algorithms for the following steps have been published inacademia and/or are available as open source tools.
10o search for a model, based on the cost function g ( . ) , we split the feature _ matrix into three sets: aset to train the model, a set to tune the decision function, and a set to test/validate tye model. These splitscan be made based on time, and could also be made much earlier in the process, allowing for different labelsearch strategies to extract training examples from different splits. A typical model search works as follows:– Train a model M i on the training split– Use the trained model to make predictions on the threshold tuning set.– Pick a decision-making threshold such that it minimizes the cost function g ( . ) – Use the threshold and the model M to make predictions on the test split and evaluate the cost function g ( . ) – Repeat steps 1 - 4 k times for model M i and determine the average statistics.– Repeat steps 1-5 for different models and pick the one that performs best (on average) on the test split.The output of this step is a full model specification, including the threshold and the model M . M isstored as a serialized file, and the remaining settings are stored in a json called model _ provenance . json .An example json is shown in Figure 5. → Integration testing in production
The biggest advantage of the automation and abstractions designed for the first 4 steps is that they enable anidentical and repeatable process with exactly the same
API s in the production environment – a requirementof
ML 2.0 . In a production environment, new data is added to the data warehouse D as time goes on. To testa model trained for a specific purpose, after loading the model M , the feature _ list and the Entityset E t , adeveloper can simply call the following three API s: E t + = add _ new _ data ( E t , new _ data _ path , metadata . json ) (7) feature _ matrix = calculate _ feature _ matrix ( E t + , < e i , current _ time >, feature _ list ) (8) predictions ← generate _ predictions ( M , feature _ matrix ) (9)This allows the developer to test the end-to-end software for new data, ensuring that the software willwork in a production environment. → Validate in production
No matter how robustly the model is evaluated during the training process, a robust practice also involvesevaluating a model in the production environment. This is paramount for establishing trust, as it identifiesissues such as whether the model was trained on a biased data sample, or if the data distributions have shiftedsince the time the data was extracted for steps 1-4. A
ML 2.0 system, after loading the model M and the feature _ list and undergoing an update to the Entityset E t + , enables this evaluation by a simple tweakingof parameters within the API s: feature _ matrix = calculated _ feature _ matrix ( E t + , < e i , arbitrary _ timestamp >, feature _ list ) (10)11tem description metadata . json file containing a description of the an Entityset label _ times the list of label training examples and the point in time prediction will occur feature _ matrix table of data with one row per label_times and one column for each feature M serialized model file returned by model search feature _ list serialized file specifying the feature _ list returned by feature engineeringstep f ( . ) user-defined function used to create label times g ( . ) cost function used during model search model _ provenance . json description of pipelines considered, testing results, and final deployable model predictions the output of the model when passed a feature _ matrix Table 5: Different intermediate data and domain-specific code generated during the end-to-end process label , arbitrary _ timestamp = f ( E t + , < e i , arbitrary _ timestamp >, hparams ) (11) predictions ← generate _ predictions ( M , feature _ matrix ) (12) cost = g ( predictions , labels , E t + ) (13) After testing and validation, a developer can deploy the model using the three commands specified byEquations 7, 8, and 9.
While going through steps 1 -7 we maintain information about what settings/hyperparameters were set foreach stage, as well as the results of the modeling stage, in an attempt to maintain full records of the provenancethe process. Our current proposal of this provenance is in model _ provenance . json and is documentedat https://github.com/HDI-Project/model-provenance-json. This allows us to check for data drift and theunavailability of certain columns, and provides a provenance for what was tried during the training phase anddid not work. It is also hierarchical, as it points to three other jsons : one that stores the metadata, a secondthat stores the search space for each of the methods over which a model search was performed, and a thirdthat stores information about the automl method used. The proposed
ML 2.0 system overcomes the significant bottlenecks and complexities currently hinderingthe integration of machine learning into products. We frame this as the first proposal to deliberately breakout of
ML 1.0 , and anticipate that as we grow, more robust versions of this paradigm, as well as theirimplementations, will emerge. Powered by a number of automation tools, we have used the system described12bove to solve two real industrial use cases – one from Accenture, and one from BBVA. Below, we highlightthe most important contributions of the system described above, and more deeply examine each of itsinnovations.–
Standardized representations and process : In the past, converting complex relational data first into a learning task and then into data representations had largely been considered an “art.” The process lackedscientific form and rigor, precise definitions and abstractions, and – most importantly – generalizableframework. It also took about 80% of the average data scientist’s time. The ability to structurally representthis process, and to define the algorithms and intermediate representations that make it up, enabled us toformulate
ML 2.0 . If we aim to broadly deliver the promise of machine learning, this process, much likeany other software development practice, must be streamlined.–
The concept of “time” : A few years ago, we participated in a KAGGLE competition using our automaticfeature generation algorithm [5]. The algorithm automatically applied mathematical operations to databy finding the relationships within it. The process is typically done manually, and doing it automaticallymeant a significant boost in data scientists’ productivity while maintaining competitive accuracy. Whenwe then applied this algorithm to industrial datasets – keeping its core ability, generating features fromdata, intact – we recognized that, in this new context, it was necessary to first define a concrete machinelearning task. When a task is defined, performing step 2 (assembling training examples) implied that “valid” data had been identified, and since task definition required flexibility, which data was valid and which wasnot changed each time we changed the definition – a key problem we identified in
ML 1.0 . (Competitionwebsites perform this step before they even release the data.) This inspired us to consider the third andmost important dimension of our automation: “time.” To identify valid data, we needed our algorithmsto filter data based on the time of their occurrence, which meant giving every data point an annotationpinpointing when it was first “known.” A complex interplay between the storage and usage of theseannotations ensued, eventually resulting in
ML 2.0 . Just as our ability to automate the search processled to step 2, accounting for time allows unprecedented flexibility in defining machine learning tasks bymaintaining provenance.–
Automation of processes : While our work in early 2013 focused on the automation of model search (step4), we moved to automating step 3 in 2015, and subsequently step 2 in 2016. These automation systemsenabled us to go through the entire machine learning process end-to-end. Once we had fully automatedthese steps, we were then able to design interfaces and
API s that enabled easy human interaction and aseamless transition to deployment.–
Same
API s in training and production : The
API s used in steps 5-7 are exactly the same as the ones usedin steps 2 and 4. This implies that the same software stack is used both in training and deployment. Thissoftware stack allows the same team to execute all the steps, requires a minimal learning curve when thingsneed to be changed, and most importantly, provides proof of the model’s provenance – how it was validatedboth in the development environment and in the testing/production environment.–
No redundant copies of data : Throughout
ML 2.0 , new data representations made up of computationsover the data – including features, training examples, metadata, and model descriptions – are created andstored. However, nowhere in the process is a redundant copy of a data element made. Arguably, thismitigates a number of data management and provenance issues elucidated by researchers. Although it is This is not necessarily equivalent to the time it was recorded and stored in the database, although sometimes it is. Unless the computation is as simple as identity – for example, if the age column is available in the data, and we would use thiscolumn as a feature as-is, it will appear in the original data as well as in the feature matrix.
ML 2.0 eliminates multiple hangupsthat are usually a part of machine learning or predictive modeling practice.–
Perfect interplay between humans and automation : After deliberate attempts to take our tools andsystems to industry, and debates over who it is we are enabling when we talk about "democratize machinelearning," we have come to the conclusion that a successful
ML 2.0 system will require perfect interplaybetween humans and automated systems. Such an interplay ensures that: (1) Humans control aspects thatare closely related to the domain in question – such as domain-specific cost functions, the outcome theywant to predict, how far in advance they want to predict, etc. – which are exposed as hyperparameters setin steps 2-5. (2) The process abstracts away the nitty-gritty details of maching learning, including dataprocessing based on the various forms and shapes in which it is organized, generating data representations,searching for models, provenance, testing, validation and evaluation. (3) A structure and common languagefor the end-to-end process of machine learning aids in sharing
In an accompanying paper called “The AI Project Manager”, we demonstrate how we used the seven stepsabove to deliver a new AI product for the global enterprise Accenture. Not only did this project take upan ML problem that didn’t have an existing solution, it presented us an unique opportunity to work with subject matter expert s to develop a system that would be put to use. In this case, the goal was to create anAI Product Manager that could augment human product managers while they manage complex softwareprojects. Using over 3 years of historical data, which contained 40 million reports and 3 million comments, wetrained a machine learning-based model to predict, weeks in advance, the performance of software projects interms of a host of delivery metrics. The subject matter expert s provided these delivery metrics for predictionengineering, as well as valuable insights into the subsets of data that should considered for automated featureengineering. Under the ML 2.0 process, the end-to-end project spanned 8 weeks. In live testing, the AI ProjectManager correctly predicts potential issues 80% of the time, helping to improve key performance indicatorsrelated to project delivery. As a result, the AI Project Manager has been integrated in Accenture’s myWizardAutomation Platform and serves predictions on a weekly basis.
References [1] Anand Avati, Kenneth Jung, Stephanie Harman, Lance Downing, Andrew Ng, and Nigam H Shah.Improving palliative care with deep learning. arXiv preprint arXiv:1711.06402 , 2017.[2] Sushmito Ghosh and Douglas L Reilly. Credit card fraud detection with a neural-network. In
SystemSciences, 1994. Proceedings of the Twenty-Seventh Hawaii International Conference on , volume 3,pages 621–630. IEEE, 1994.[3] Sherif Halawa, Daniel Greene, and John Mitchell. Dropout prediction in moocs using learner activityfeatures.
Experiences and best practices in and around MOOCs , 7:3–12, 2014.[4] Jiazhen He, James Bailey, Benjamin IP Rubinstein, and Rui Zhang. Identifying at-risk students inmassive open online courses. In
AAAI , pages 1749–1755, 2015.[5] James Max Kanter and Kalyan Veeramachaneni. Deep feature synthesis: Towards automating datascience endeavors. In
Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE Interna-tional Conference on , pages 1–10. IEEE, 2015.14
Ef (
E, e id , timestamp )EntitySet metadata.jsonsearch_training_examples ( E, f(.), hparams ) gapleadprediction_windown_examplesmin_training_dataaggregate_primitivestransform_primitivetraining_windowignore_variablesmethodsbudgetautoml_method model Mmodel_provenance.jsonpredictionspredictionsintegration testingvalidatedeploy labele id feature_list predictione id labele id g ( E, predictions, labels ) ... f n f Figure 2: ML 2.0 workflow. The end result is a stable, validated, tested deployable model. An ML 2.0 systemmust support each of these steps. In the middle, we show each of the steps, along with the well-defined,high-level, functional
API s and arguments (hyperparameters) that allow for exploration and discovery. Datatransformation
API s do not change in deployment, ensuring a repeatable process. The rightmost columnpresents the result of each step. 156] James Max Kanter, Owen Gillespie, and Kalyan Veeramachaneni. Label, segment, featurize: a crossdomain framework for prediction engineering. In
Data Science and Advanced Analytics (DSAA), 2016IEEE International Conference on , pages 430–439. IEEE, 2016.[7] Arti Ramesh, Dan Goldwasser, Bert Huang, Hal Daumé III, and Lise Getoor. Modeling learnerengagement in moocs using probabilistic soft logic. In
NIPS Workshop on Data Driven Education ,volume 21, page 62, 2013.[8] D Sculley, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. Machine learning: Thehigh-interest credit card of technical debt. 2014.[9] Haoran Shi, Pengtao Xie, Zhiting Hu, Ming Zhang, and Eric P Xing. Towards automated icd codingusing deep learning. arXiv preprint arXiv:1711.04075 , 2017.[10] Tanmay Sinha, Patrick Jermann, Nan Li, and Pierre Dillenbourg. Your click decides your fate: Inferringinformation processing and attrition behavior from mooc video clickstream interactions. arXiv preprintarXiv:1407.7131 , 2014.[11] Thomas Swearingen, Will Drevo, Bennett Cyphers, Alfredo Cuesta-Infante, Arun Ross, and KalyanVeeramachaneni. Atm: A distributed, collaborative, scalable system for automated machine learning.[12] Kiri Wagstaff. Machine learning that matters. arXiv preprint arXiv:1206.4656 , 2012.
Perhaps the fundamental problem with the ML 1.0 workflow is that it was never intended to create a usablemodel. Instead, it was designed to make discoveries . In order to achieve the end goal of an analytic report orresearch paper introducing the discovery , numerous usability tradeoffs are made. Below, we highlight someof these tradeoffs, and how they are at odds with creating a deployable, useful model.–
Ad hoc, non-repeatable software engineering practice for steps 1- 4 : The software that carries out steps1 - 4 is often ad-hoc and lacks a robust, well-defined software stack. This is predominantly due to a lackof recognized structures, abstractions and methodologies. Most academic research in machine learninghas ignored this lack for the past 2 decades, and only focused on defining a robust practice when the datais ready to be used at step 5, and/or the data fits the form of a clean, time-stamped time series
X, t (forexample, the time series of a stock price). This implies that a subset of transformations done to the dataduring discovery will have to be re-implemented later – probably by a different team, requiring extensiveback-and-forth communication.–
Ad hoc data practices : In almost all cases, as users go through steps 1, 2, and 3 and insert their domainknowledge, the data output at each stage is stored on a disk and passed around with some notes (sometimesthrough emails). To ensure confidence in discoveries made, data fields that are questionable or cannot beexplained easily or without caveats are dropped. No provenance is maintained for the entire process.–
Siloed and non-iterative : These steps are usually executed in a siloed fashion, often by different indi-viduals within an enterprise. Step 1 is usually executed by people managing the data warehouse. Step 2 Take for example, “Elements of Statistical Learning,” a well-recognized and immensely useful textbook that describes numerousmethods and concepts relevant only to data in a matrix form. .. ... f f n Label
Select + AccuracyFeatureimportanceFPRTPR I. t e id Label e id e id ... f f n Label e id Label
II.
AUC = Very rare Additional dataRare Features
Figure 3: ML 1.0 workflow. This traditional workflow is split into three disjoint parts, each of which containsa subset of steps, which are often executed by different people. Intermediate data representations generated atthe end of each part are stored and passed along to the next people in the chain. At the top, a relevant subsetof the data is extracted from the data warehouse, the prediction problem is decided, past training examplesare identified within the data, and the data is prepared so as to enable model building whilst avoiding issueslike “label leakage." This step is usually executed in collaboration with people that manage/collect the data(as well as the product that creates it), domain/business experts who help define the prediction problem,and data scientists who use the data to build the predictive models. The middle part of the workflow isfeature engineering, where, for each training example, a set of features is extracted by the data scientists.If statistics and machine learning experts are involved, this part could be executed by software engineersor data engineers or the data could be released as a public competition on K
AGGLE . The last part involvesof building, validating, and analyzing the predictive model. This part is executed by someone familiar withmachine learning. A number of automated tools and open-source libraries are available to do help withthis part, including a few shown on the left. The goal of this process has mostly been discovery , ultimatelyresulting in a research paper , a competition leader board , or a blog post . In these cases, putting the machinelearning model to use is an afterthought. This traditional workflow has a couple of problems: decisions madeduring previous parts are rarely revisited; the prediction problem is decided early-on, without an explorationof how small changes in it can garner significant benefits; and end-to-end software cannot be compiled in oneplace so that it can be repeated with new data, or lead to the creation of deployable model.17s done by domain experts in collaboration with the software engineers maintaining the data warehouse.Steps 3 and 4 fall under the purview of data science and machine learning experts. At each step, a subset ofintermediate data is stored and passed on. Once a step is finished, very rarely does anyone wants to revisitit or ask for additional data, as this often entails some kind of discussion or back-and-forth.–
Over-optimizing steps 4 and 5, especially step 5 : While the data inputs for steps 1 and 2 are complicatedand nuanced, the data is usually simplified and shrunk down to a goal-specific subset by the time it isarrives at step 4. It is simplified further when it arrives at step 5, at which stage it is just a matrix. Steps1, 2 and 3 are generally considered tedious and complex, and so when work is done to bring the data tostep 4, engineers make the most of it. This has led to steps 4 and 5 being overemphasized, engineered andoptimized. For example, all data science competitions, including those on K
AGGLE , start at Step 4, andalmost all machine learning startups, tools and systems start at Step 5. At this point, there are several waysto handle these two steps, including crowdsourcing, competitions, automation, and scaling. Sometimes,these over-engineered solutions do not translate to real-time use environments. Take the 2009 Netflixchallenge: The winning solution was comprised of a mixture of >100 models, which made it difficult toengineer for real-time use .– Departure from the actual system collecting the data : Since data is pulled out from the warehouse attime t , and the discovery ensues at some arbitrary time in the future, it may be the case that the system isno longer recording certain fields at that time. This could be due to policy shifts, the complexity or costinvolved in making those recordings, or simply a change in the underlying product or its instrumentation.While everything written as part of the discovery is still valid, the model may not actually be completelyimplementable, and so might not be put to use.Ultimately, a discovery has to make its way to real use by integrating with real-time data. This impliesthat new incoming data has to be processed/transformed in the precise way it was to learn the model, andthat a robust prediction-making software stack must be developed and integrated with the product. Usually,this responsibility falls on the shoulders of the team responsible for the product, who most likely have notbeen involved in the end-to-end model discovery process. Thus the team starts auditing the discovery, tryingto understand the process while simultaneously building the software, learning concepts that may not be intheir skill set (for example, cross validation), and attempting to gain confidence in the software as well as itspredictive capability, all before putting it to use. Lacking of well-documented steps, robust practice for steps1-4, and the ability to replicate a model accuracy can often result in confusion, distrust in the possibility ofthe model delivering the intended result and ultimately abandonment of deployment altogether. https://medium.com/netflix-techblog/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429 ppendix Figure 4: A specification of search parameters for a machine learning method. { "name": "dt", "class": "sklearn.tree.DecisionTreeClassifier", "parameters": { "criterion": { "type": "string", "range": ["entropy", "gini"] }, "max_features": { "type": "float", "range": [0.1, 1.0] }, "max_depth": { "type": "int", "range": [2, 10] }, "min_samples_split": { "type": "int", "range": [2, 4] }, "min_samples_leaf": { "type": "int", "range": [1, 3] } }, "root_parameters": ["criterion", "max_features", "max_depth", "min_samples_split", "min_samples_leaf"], "conditions": {} } Figure 5: An example JSON prediction log file detailing the provenance of the accompanying model,including instructions for labeling, data splitting, modeling, tuning, and testing (continued in Figure 6). Themost up-to-date specification of this json can be found at https://github.com/HDI-Project/model-provenance-json. { "metadata": "/path/to/metadata.json", "prediction_engineering": { "labeling_function": "/path/to/labeling_function.py", "prediction_window": "56 days", "min_training_data": "28 days", "lead": "28 days" }, "feature_engineering":[ { "method": "Deep Feature Synthesis", "training_window": "2 years", "aggregate_primitives": ["TREND", "MEAN", "STD"], "transform_primitives": ["WEEKEND", "PERCENTILE"], "ignore_variables":{ "customers": ["age", "zipcode"], "products": ["brand"] }, "feature_selection": { "method": "Random Forest", "n_features": 20 } } ], "modeling": { "methods": [{"method": "RandomForestClassifer", "hyperparameter_options": "/path/to/random_forest.json" ← (cid:45) }, {"method": "MLPClassifer", "hyperparameter_options": "/path/to/ ← (cid:45) multi_layer_perceptron.json"}], "budget": "2 hours", "automl_method": "/path/to/automl_specs.json", "cost_function": "/path/to/cost_function.py" }, "data_splits": [ { "id": "train", "start_time": "2014/01/01", "end_time": "2014/06/01", "label_search_parameters": { "strategy": "random", "examples_per_instance": 10, "offset": "7 days", "gap": "14 days" } }, { "id": "threshold-tuning", "start_time": "2014/06/02", "end_time": "2015/01/01", "label_search_parameters": {"offset": "7 days"} }, { "id": "test", "start_time": "2015/01/02", "end_time": "2015/06/01", "label_search_parameters": {"offset": "7 days"} } ], "training_setup": { "training": {"data_split_id": "train", "validation_method": "/path/to/validation_spec_train. ← (cid:45) json"}, "tuning": {"data_split_id": "threshold-tuning", "validation_method": "/path/to/validation_spec_tune.json" ← (cid:45) }, "testing": {"data_split_id": "test", "validation_method": "/path/to/validation_spec_test.json ← (cid:45) "} }, "results": { "test": [ { "random_seed": 0, "threshold": 0.210, "precision": 0.671, "recall": 0.918, "fpr": 0.102, "auc": 0.890 }, { "random_seed": 1, "threshold": 0.214, "precision": 0.702, "recall": 0.904, "fpr": 0.113, "auc": 0.892 } ] }, "deployment": { "deployment_executable": "/path/to/executable", "deployment_parameters": { "feature_list_path": "/path/to/serialized_feature_list.p", "model_path": "/path/to/serialized_fitted_model.p", "threshold": 0.212 }, "integration_and_validation": { "data_fields_used": { "customers": ["name"], "orders": ["Order Id", "Timestamp"], "products": ["Product ID", "Category"], "orders_products": ["Product Id", "Order Id", "Price", " ← (cid:45) Discount"] }, "expected_feature_value_ranges":{ "MEAN(orders_products.Price)": {"min": 9.50, "max": 332.30}, "PERCENT(WEEKEND(orders.Timestamp))": {"min": 0, "max": 1.0} } } }}