[PDF] AutoDS: Towards Human-Centered Automation of Data Science

Abstract

Data science (DS) projects often follow a lifecycle that consists of laborious tasks for data scientists and domain experts (e.g., data exploration, model training, etc.). Only till recently, machine learning(ML) researchers have developed promising automation techniques to aid data workers in these tasks. This paper introduces AutoDS, an automated machine learning (AutoML) system that aims to leverage the latest ML automation techniques to support data science projects. Data workers only need to upload their dataset, then the system can automatically suggest ML configurations, preprocess data, select algorithm, and train the model. These suggestions are presented to the user via a web-based graphical user interface and a notebook-based programming user interface. We studied AutoDS with 30 professional data scientists, where one group used AutoDS, and the other did not, to complete a data science project. As expected, AutoDS improves productivity; Yet surprisingly, we find that the models produced by the AutoDS group have higher quality and less errors, but lower human confidence scores. We reflect on the findings by presenting design implications for incorporating automation techniques into human work in the data science lifecycle.

Full PDF

AAutoDS: Towards Human-Centered Automation of Data Science

Dakuo Wang [email protected] ResearchUSA

Josh Andres

IBM Research AustraliaAustralia

Justin Weisz

IBM ResearchUSA

Erick Oduor

IBM Research AfricaKenya

Casey Dugan

IBM ResearchUSA

ABSTRACT

Au-toDS , an automated machine learning (AutoML) system that aimsto leverage the latest ML automation techniques to support data sci-ence projects. Data workers only need to upload their dataset, thenthe system can automatically suggest ML configurations, prepro-cess data, select algorithm, and train the model. These suggestionsare presented to the user via a web-based graphical user interfaceand a notebook-based programming user interface. We studied Au-toDS with 30 professional data scientists, where one group usedAutoDS, and the other did not, to complete a data science project.As expected, AutoDS improves productivity; Yet surprisingly, wefind that the models produced by the AutoDS group have higherquality and less errors , but lower human confidence scores .We reflect on the findings by presenting design implications forincorporating automation techniques into human work in the datascience lifecycle.

CCS CONCEPTS • Human-centered computing → User studies ; Empirical studiesin HCI ; •

Computing methodologies → Artificial intelligence . KEYWORDS

Data science, automated data science, automated machine learning,AutoML, AutoDS, model building, human-in-the-loop, AI, human-AI collaboration, XAI, collaborative AI

ACM Reference Format:

Dakuo Wang, Josh Andres, Justin Weisz, Erick Oduor, and Casey Dugan.2021. AutoDS: Towards Human-Centered Automation of Data Science. In

CHI Conference on Human Factors in Computing Systems (CHI ’21), May8–13, 2021, Yokohama, Japan.

ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3411764.3445526

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

CHI ’21, May 8–13, 2021, Yokohama, Japan © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-8096-6/21/05...$15.00https://doi.org/10.1145/3411764.3445526

Data Science (DS) refers to the practice of applying statistical andmachine learning approaches to analyze data and generate insightsfor decision making or knowledge discovery [11, 44, 48]. It involvesa wide range of tasks from understanding a technical problem tocoding and training a machine learning model [48, 65]. Together,these steps constitute a data science project’s lifecycle [63]. As illus-trated in Figure 1, data science literature often use a circle diagramto represent the entire data science life cycle. In one version of theDS lifecycle model, [63] synthesizes multiple papers and suggestsa 10 Stages view of the DS work. Because the complex natural ofa DS lifecycle, a DS project often requires an interdisciplinary DSteam (e.g., domain experts and data scientists) [29, 52, 70]. Previousliterature suggests that as much as 80% of a DS team’s time [55] isspent on low-level activities, such as manually tweaking data [25]or trying to select various candidate algorithms [71] with pythonand Jupyter notebooks [30]; thus, they do not have enough timefor valuable knowledge discovery activities to create better models.To cope with this challenge, researchers have started exploring theuse of ML algorithms (i.e. Bayesian optimization) to automate thelow-level activities, such as automatically training and selecting thebest algorithm from all the candidates [35, 38, 40, 71]; this group ofwork is called Automated Data Science (or

AutoDS for short) .Many research communities, universities, and companies haverecently made significant investments in AutoDS research, underthe belief that the world has ample data and, therefore, an unprece-dented demand for data scientists [24]. For example, Google releasedAutoML in 2018 [19]. Startups like H2O [23] and Data Robot [10]both have their branded products. There are also Auto-sklearn [16]and TPOT [14, 50] from the open source communities.While ML researchers and practitioners keep investing and ad-vancing the AutoDS technologies, HCI researchers have begun toinvestigate how these AI systems may change the future of datascientists’ work. For example, a recent qualitative study revealeddata scientists’ attitudes and perceived interactions with AutoDSsystems is beginning to shift to a collaborative mindset, rather thana competitive one, where the relationship between data scientistsand AutoDS systems, cautiously opens the “inevitable future ofautomated data science” [65].However, little is known about how data scientists would ac-tually interact with an AutoDS system to solve data scienceproblems . To fill this gap, we designed the AutoDS system and Some researchers also refer to Automated Artificial Intelligence (AutoAI) or Auto-mated Machine Learning (AutoML). In this paper, we use

AutoDS to refer to thecollection of all these technologies. a r X i v : . [ c s . H C ] J a n HI ’21, May 8–13, 2021, Yokohama, Japan Wang and et al.

Figure 1: A 10 Stages, 43 Sub-tasks (not shown) DS/ML Life-cycle. This is a synthesized version from reviewing multiplescholarly publications and marketing reports [16, 18, 39, 41,65]. conducted a between-subject user study to learn how data work-ers use it in practice. We recruited 30 professional data scientistsand assigned them into one of two groups to complete the sametask — using up to 30 minutes to build the best performing modelfor the given dataset. Half of the participants used AutoDS (ex-periment group), and the other half built models using Pythonin a Jupyter notebook, which aims to replicate their daily modelbuilding work practice (control group). We collected various mea-surements to quantify 1) the participant’s productivity (e.g., howlong a participant spent in building the model or improving themodel), 2) the final model’s performance (e.g., its accuracy), and 3)the participant’s confidence score in the final model (i.e., how muchconfidence a participant has in the final model, if they they areasked to deploy their model). Before the experiment, we also collectmeasurements on 4) each participant’s data science expertise leveland 5) their prior experience and general attitude towards AutoDSas control variables.Our results show that, as expected, the participants with AutoDSsupport can create more models (on average 8 models v.s. 3.13models) and much faster (on average 5 minutes v.s. 15 minutes),than the participants with python and notebook. More interestingly,the final models from the AutoDS group have higher quality (.919ROC AUC v.s. .899) and less human errors (0 out of 15 v.s 7 outof 15), than the models from the control group with python andnotebooks. The most intriguing finding is that despite participantsacknowledged the AutoDS models were at better quality, they hadlower confidence scores in these models than in the manually-crafted models, if they were to deploy the model (2.4 v.s. 3.3 out ofa 5-point Likert scale). This result indicates that “better” modelsare not equal to “confident” models, and the “trustworthiness” ofAutoDS is critical for user adoption in the future. We discuss thepotential explanations for this seemingly conflicted result, anddesign implications stemming from it. In summary, our work makes the following contributions to theCHI community: • We present an automated data science prototype system withvarious novel feature designs (e.g., end-to-end, human-in-the-loop, and automatically exporting models to notebooks); • We offer a systematic investigation of user interaction andperceptions of using an AutoDS system in solving a datascience task, which yields many expected (e.g., higher pro-ductivity) and novel findings (e.g., performance is not equalto confidence, and shift of focus); and • Based on these novel findings, we present design implica-tions for AutoDS systems to better fit into data science work-ers’ workflow.

Many researchers have studied how data scientists work with dataand models. For example, it was suggested that 80 percent of thetime spent on a data science project is spent in data preparation [22,48, 70, 71]. As a result, data scientists often do not have enoughtime to complete a comprehensive data analysis [61].A popular research topic in this domain is interactive machinelearning , which aims to design better user experiences for humanusers to interact with machine learning tool [1]. These humanusers are often labelers of a data sample or domain experts whohave better domain knowledge but not so much machine learningexpertise.Based on these the findings from these empirical studies, manytools have been built to support data science workers’ work [34,44, 53]. For example, Jupyter Notebook [30] and its variations suchas Google Colab [20] and Jupyter-Lab [31] are widely adopted bythe data science community. These systems provide an easy code-and-test environment with a graphical user interface so that datascientists can quickly iterate their model crafting and testing pro-cess [37, 48]. Another group of tools includes the Data Voyager [26]and TensorBoard [9] systems that provide visual analytic supportto data scientists to explore their data, but they often stop in thedata preparation stage and thus do not provide automated supportfor model building tasks in the lifecycle (as shown in Figure 1).

AutoDS refers to a group of technologies that can automate the man-ual processes of data pre-processing, feature engineering, modelselection, etc. [71]. Several technologies have been developed withdifferent specializations. For example, Google has developed a setof AutoDS products under the umbrella of Cloud AutoML, suchthat even non-technical users can build models for visual, text, andtabular data [19]. H2O is java-based software for data modellingthat provides a python module, which data scientists can importinto their code file in order to use the automation capability [23].Despite the extensive work in building AutoDS systems andalgorithms, only a few recent efforts have focused on the interac-tion between humans and AutoDS. Gil and collaborators propose aguideline for designing AutoDS systems [18]. However, they envi-sion new design features for AutoDS based on their understanding utoDS: Towards Human-Centered Automation of Data Science CHI ’21, May 8–13, 2021, Yokohama, Japan of how data scientists manually build models, and their understand-ings arise from surveying the previous literature and the authors’personal experience. In our study, we aim to fill in the empiricalunderstanding gap by conduct an experiment to systematicallyexamine how data scientists actually use an AutoDS system.Another recent work studies data scientists’ perceptions andattitudes towards AutoDS systems through an interview with 30data scientists [65]. The interviewees, who have never interactedwith AutoDS before, believe that AutoDS could potentially changetheir work practice. They also believe automation in data sciencework is the future, but they certainly hope that such automationcould support their jobs instead of sabotaging them. We follow thisline of research. We aim to provide an account for the actual userbehaviors when a data scientist uses an AutoDS system to build amodel.There is one more research strand worth mentioning is the in-formation visualization designs for AutoDS systems [67, 68]. Forexample, ATMSeer enables users to browse AutoDS processes at thealgorithm, hyperpartition, and hyperparameter levels [71]. Theirresults indicate that users with a higher level of expertise in ma-chine learning are more willing to interact with ATMSeer [67].We reference these existing visualization work’s findings whiledesigning and implementing our prototype system’s user interface.But these papers only focus on the visualization aspect and revealinformation on existing AutoDS pipelines, and they do not measurethe data scientists’ behaviors. Thus, the feedback from these studiesare limited to the AutoDS visualization user interface design, butnot much for the AutoDS’s functionality improvement.

The recent "AI Summer" [21] is remarked with machine learningsystem demonstrations such as IBM DeepBlue [6] and WatsonJeopardy [45], and Google’s AlphaGo [66]. In these user scenarios,AI has largely been portrayed as a competitor to humans. Thegeneral public and news media began to worry about when the“singularity” will arrive, with AI replacing humans [3].More recently, a few researchers have started to argue that in-stead of worrying about the singularity, why do not we work to-gether to design AI systems that can collaborate with humans? [42,59] Following this trend, HCI projects have reported various casestudies of designing AI systems to work together with humans in-stead of replacing them. For example, Cranshaw and collaboratorsdeveloped a calendar scheduling assistant system that combinesthe complementary capabilities of humans and AI systems for taskssuch as scheduling meetings [8]. They coined this architecture a“human-in-the-loop AI system”. More towards the hardware side ofthe AI spectrum, a group of researchers from Cornell Universityhave designed a robot that can work together with humans as ateam and complete the tasks such as distributing resources to hu-man collaborators [7]. IBM researchers also experimented the usecase of putting an embodied conversational agent into a recruit-ing team of two human participants, with the agent and humansworking together to complete a CV review task [57].There was a seminal debate 20 years ago on “Agency v.s. DirectManipulation User Interfaces Design” [58]. With more and more deep neural network AI technologies, we, as human users, findit harder and harder to understand what happens inside the AI“black box”. In parallel, we designate more and more agency andproactiveness to many of today’s AI system user interfaces, such asthe conversational systems and autonomous cars. Some researchershave reported that users already look at these AI systems differ-ently than the traditional computer systems but more like humanpartners [69], where anthropomorphism effect plays a critical rolein user perceptions (e.g. [49, 62]. Researchers and designers areactively asking: what are the updated frameworks and theories thatwe can leverage to help us design better AI systems to work withhuman?Various human-centered design guidelines for AI have beenpublished in the past year from big companies such as Google, Mi-crosoft, and IBM [2, 28, 56, 60]. But these guidelines often focusonly on usability of AI system’s interface design, but fall short indiscussing the integration of AI systems into human workflowsas a collaborative partner. We argue that with an AI system beingperceived more like a collaborator teaming up with humans, wemay be able to reference classical theories from human-humancollaborations to guide our design of an cooperative AI system .For example, we may learn from the “collaboration awareness” con-cept [12] to guide our design of AI system’s “transparency”. The“awareness” is a bi-directional concept, thus maybe “transparentAI” should not only have human better understand AI’s runtimestate and logic, but also have AI keep track of human’s states andintention. Thus, in this study, we are also interested in exploringthe human-AutoDS interaction through a collaborative work per-spective.

Based on machine learning technique’s advances and inspired bydesign insights from related literature, we implement an AutoDSprototype system, shown in Figure 2, to support data science work-ers on data science tasks.From the user perceptive, a user is only required to upload theirdataset to trigger the AutoDS execution, and then they can waitfor the final model result. They interact with the system primarilyfrom two graphical user interface screens: a configuration screen(Figure 2a) and a result screen(Figure 2b) . At the configuration screen,after the user uploads a dataset, AutoDS suggests a specific the MLtask configuration for users to approve or adjust (e.g., classificationor regression); then at the result screen, the users can monitor Au-toDS’s execution progress visualization in real time and eventuallysee the final results in a model leaderboard view.When looking closer at the result screen in Figure 2b, we designa tree-based progress visualization at the top. Each leave dot inthe tree-based visualization represents one of the candidate modelpipeline; and the path represents its composition flow. The screen-shot illustrates an AutoDS execution is in progress. Two algorithmsare being tested (XGB Classifier (blue) and Gradient Boosting Clas-sifier (purple) ). For the XGB Classifier, four models (P1 to P4) aregenerated; and for Gradient Boosting Classifier, only two models(P5 and P6) are generated, and two more (P7 and P8) are in training.Some models use the same algorithm but their composition stepsare different. For example, P2 has an extra step of hyperparameter

HI ’21, May 8–13, 2021, Yokohama, Japan Wang and et al. (a) Configuration UI Screen(b) Result UI Screen

Figure 2: The two steps of AutoDS’s graphical user interface. optimization, in comparison to P1, and P3 has an additional featureengineering step, despite these models are all using the same XGBalgorithm.At the bottom of the result screen in Figure 2b, we included amodel result leaderboard. Each row in the leaderboard represents acandidate model pipeline, which corresponds to a model node in thetree visualization at the top. It displays following model information

Rank, Pipeline ID, Algorithm, Accuracy Score, Enhancement Stepssuch as Transformation and HyperparameterOptimization, TrainingRuntime , and

ROC AUC model performance score on training orholdout data splits.We design three user functions for each model result, and theyall can be triggered through interacting with the model leaderboard.1) a user can further examine the details of a model (such as thefeatures included or excluded in the model, the prediction plot,etc) through clicking on a row in the leaderboard (not shown inFigure 2b); 2) if a user is satisfied with a model, they can click on the“Save as” button to save the “Model” (shown in Figure 2b), whichwill be automatically deployed as a Cloud API endpoint; and 3) ifa user is interested in checking the details of a model via pythonnotebook, they can click on “Save as Notebook” button (shownin Figure 2b) and download the AutoDS-generated notebook forfurther improvement. Many of these design consideration (e.g., automatic generation of human-readable python notebooks) arealso novel and practical contributions to the AutoDS system designs.From the system and algorithm perspective, AutoDS reads inthe user uploaded dataset, suggests a problem configuration (e.g.,classification or regression) based on the data structure. Then, Au-toDS jointly optimizes the sequence of the model pipeline, whichincludes selecting the appropriate methods for data preprocessing,feature engineering, algorithm selection, and hyperparameter opti-mization. AutoDS can choose a sequence of transformation stepsbefore training an estimator, or it can simply decide not to useany transformation to generate new features. Once the sequence isdecided, AutoDS can search and decide on which particular trans-formation to be used inside each of the transformation step, andwhich estimator to be used inside the modeling step. Finaly, AutoDSwill fine-tune the hyperparameters of those chosen estimators andtransformations simultaneously.By design, our AutoDS system can provide automation supportfor the end-to-end data science lifecycle as shown in Fig. 1: fromRequirement Gathering and Problem Formulation (i.e., suggestingconfigurations), to Model Building and Training (i.e., selecting algo-rithms), and to Decision Making and Optimization (i.e., deployingmodels). Following a lab experiment study guideline [17], we conducted abetween-subject user evaluation to understand how people interactwith AutoDS in a data science project. Participants were asked totry their best at building a machine learning model for the sameUCI Heart Disease dataset [5], either by writing Python code in aJupyter Notebook (

Notebook Condition in Figure 3) or by usingthe AutoDS prototype system (

AutoDS Condition in Figure 2).The experiment design follows a time-constrained clinical trialsetting, each participants was given 30 minutes to build their bestmodel. The reason why we decided on 30 minutes was because thatfrom our three pilot study sessions, all the pilot participants (datascientists) finished the task within 12 minutes, and they neededto write no more than 10 lines of code to get the expected result:split train and test data subsets, define a model variable, and run across-validation function to train the model and report the accuracyscore. We considered a participant having completed the task ifs/he got at least one model.Their final model’s performance is evaluated based on the ROCAUC score . During the 30 minutes time, participants were allowedto try out one or more models, but they understood that their per-formance and productivity was not measured by the quantity ofthe mode, rather the quality of the model. They were allowed tosubmit early if they were satisfied with the model. All experimentsessions were conducted remotely using a video conferencing sys-tem, and participants used their own laptops during the experiment,as both the notebook and AutoDS were hosted on our experimentcloud server. These requirement was specified in the experiment For more technical details about the AutoDS joint optimization algorithm, readerscan refer to [38, 43] to replicate the backend. One of the widely-used model performance metrics [51]. For simplicity, we refer tothis metric as “accuracy” in the rest of paper. utoDS: Towards Human-Centered Automation of Data Science CHI ’21, May 8–13, 2021, Yokohama, Japan instruction and all the participants verbally acknowledged thisrequirement.Worthy noting that this data science task is a simplified versionof their daily data science works, we selected the UCI Heart Diseasedataset, which contains only 303 patients’ basic medical record in-formation and 13 features (both continuous and categorical values)to predict whether the patients have heart disease (binary targetfeature with 0 represents no disease). It was a widely used bench-mark dataset in research papers as well as in machine learningcompetitions (e.g., Kaggle [32]). Data cleaning, preprocessing, andfeature engineering steps are not critical to build a valid model, butin order to achieve better model performance, participants do needto try out further model improvement steps. In both conditions,the dataset and required libraries in Notebook were pre-loaded soparticipants did not need to find the data or set up the network.We asked the participants to share their screens, and with theirconsent, we recorded each session for further analysis.

Participants in the

AutoDS Condition used our AutoDS prototype,shown in Figure 2. As a default problem configuration, AutoDScan generate a list of the four model pipelines with one singlealgorithm (shown in Fig. 2). A model pipeline consists of all of thecomposition steps in generating the model, from reading data tooptimizing hyperparameters. As aforementioned, not all pipelineshad the hyperparameter optimization step or feature engineeringstep, leading to the variety of generated models and their accuracyscores. Participants could modify the AutoDS’s configuration andget up to 8 model pipelines.As shown in the AutoDS system design section, participantshave a user function to inspect the generated model results witha visualization and a leaderboard showing information such asthe confusion matrix, a feature importance chart, and tables withvarious evaluation metrics (e.g. ROC AUC, precision, F1, etc.) anddescriptions of feature transformations. In addition, through theautomated notebook generation feature in AutoDS systems, partic-ipants were also able to download, execute, or edit the generatedcode and to further improve the AutoDS model through coding.

Participants in the

Notebook Condition were provided with anonline Jupyter Notebook environment as a replication to their cur-rent work environment. To simplify the task, we provided note-books with pre-scripted code sections and a skeleton of instructions.The notebooks were preloaded with data and necessary libraries,and we provided instructions in the markdown cells in differentsections. For example, we listed “(Optional) Feature Engineeringsection”, “(Required) Modeling Training and Selection section”, andblank code cells for users to fill in their code. In the model trainingand selection section, we suggested five commonly used algorithmsfor this particular task: Logistic Classifier, KNN, SVM, DecisionTree, and Random Forest [51]. Noted that these pre-scripted codeskeletons were meant to help participants to easily write code, andwe specifically described in the task sheet that they were not en-forced to use these pre-scripted codes. From our observation of

Figure 3: Screenshot from the Jupyter Notebook used in par-ticipants group with python and notebook (Control Condi-tion). The Notebook contained skeleton code to load rele-vant data science libraries (numpy, pandas, skikit-learn), aswell as the data set. Instructional sections in the Notebookwere provided to remind participants to perform the follow-ing tasks: data preprocessing (optional), feature engineering(optional), model training (with 5 different algorithms rec-ommended), and model comparison (optional). the experiment, some participants in the Notebook Condition didchoose to not use the skeleton but wrote all codes in one cell.

Our research goal was to explore how different the participants (i.e.data scientists) may complete the data science tasks in these twoconditions. To that goal, we manually coded the 30 video recordingsessions (each is at 45 minutes to one hour long) to extract variousmeasurements, such as how many models each participant gen-erated, what final algorithm they selected, and what its accuracywas. We also captured time-related measures, such as how longthey spent on the model building task, to evaluate how participantsallocate their time differently.In both AutoDS Condition and Notebook Condition, we collectedparticipants background information and their attitudes and per-ception towards AutoDS before the session. We also collected theirconfidence score of the final submitted model after the session.In the AutoDS Condition alone, we also collected participants’perceptions towards the AutoDS prototype they just used througha XAI questionnaire [27], as its four dimensions of predictable,reliable, efficient, and believable of an AI system is applicable inour context. In addition, we also collected participants’ ratingsof trust in AutoDS technologies for a second time in the AutoDScondition, as their scores may change after just experiencing thesystem. Thus, we can compare the two pre-study trust scores acrossthe two conditions, and compare the pre-study and post-study trustscores in the AutoDS Condition.It is worth noting that we collect ROC AUC score among otherscores as an indicator of the model performance from both con-ditions, as we simply need a standardized metric to reflect each

HI ’21, May 8–13, 2021, Yokohama, Japan Wang and et al. participant’s task performance. We acknowledge that the accuracyscore of a model should not be overstated [36, 54], as in an actualdata science project, there are various other considerations (e.g.,runtime efficiency) in evaluating a model’s performance than themodel accuracy score.From our observation, some participants in the Notebook Con-dition made human errors in the submitted model solutions. Forexample, they forgot to split the data, and thus reported a very highbut incorrect accuracy score because they used the training datasplit. Other participants reported a F1-score despite the instructionasked for a ROC AUC score. Thus, the scores from these error so-lutions are not comparable to the scores from other participants’solutions. The participants that made these mistakes acknowledgedthey were indeed human errors in the post-study interview, andthey said they forgot to follow the instruction sheet. To capturethis human error, we created one additional measurement to indi-cate whether a participant successfully complete the task with nohuman error.In summary, we have four groups of quantitative measurements: • Participant’s background information, collected via a pre-study survey (P1, P2 for both conditions); • Participant’s perceptions toward general AutoDS type oftechnologies, collected via a pre-study survey (A1, A2, A3for both conditions) and post-study survey (AA2, AA3, AA5for AutoDS condition only) • Participant’s behavioral measurements in the task (B3, BNx,BAx) and the final model performance (B1, B2), collected viacoding the video recordings (both conditions) • Whether the participant successfully completed the taskwithout human error, collected by examining the final model’scode (F1 for both conditions)A summary of all the collected quantitative measurements arelisted in Table 1. We also conducted a semi-structured interviewat the end of each session to gather users’ qualitative feedback toenrich our results.

We recruited 30 professional data scientists in a multinational ITcompany, and randomly assigned them to one of the two conditions(AutoDS v.s. Notebook). These participants come from differentlocations in the U.S., South Africa, and Australia. Seven participantsout of 30 were female (23%), which was similar to the 16.8% ratioreported in the Kaggle 2018 survey of data science [33].Our recruitment criteria was that the participants were pro-fessional data scientists and that they practiced data science ormachine learning works in their day-to-day work. Participants re-ported practicing data science for an average of 3.5 years (SD =2.7). Six participants (20%) rated themselves as beginners, 17 (57%)as intermediates, and 7 (23%) as experts in data science. Partici-pants rated themselves at a moderate amount of experience (3.2SD=.97 out of 5) with python scikit-learn lib, which is used in ourstudy. Later, in the modeling section in result section 5.5.2, we testwhether these background factors (e.g., years of expertise, and priorexperience with scikit-learn) may influence the model performance,and the result is not significant.

Background Measures

P1. Years spent practicing data scienceP2. Prior experience with scikit-learn (1=low, 5=high)

Attitudinal Measures “A” questions used in both conditions, “AA” used only in AutoDS)A1. Previous familiarity with general AutoDS (1=low, 5=high)A2. Trust in general AutoDS (first time)AA2. Trust in general AutoDS (second time)A3. Belief in general AutoDS replace human (first time)AA3. Belief in general AutoDS replace human (second time)A4. Participant’s confidence in the selected final modelAA5. XAI Scale [27]

Behavioral Measures “B” questions used in both conditions, “BA” used only in AutoDS, “BN”used only in NotebookB1. Accuracy of final model (ROC AUC)B2. Type of final model (e.g. logistic regression, random forest, etc.)B3. Total time spent searching for information on the webBN1. Amount of time spent preparing dataBN2. Amount of time spent until the first model was producedBN3. Number of different models triedBN4. Did participant perform exploratory data analysis (EDA)?BN5. Did participant perform feature engineering?BA1. Total time spent configuring AutoDSBA2. Total time spent running AutoDSBA3. Total time spent examining the leaderboardBA4. Total time spent examining pipeline detailsBA5. Number of times AutoDS was runBA6. Number of pipelines examined (i.e. viewed pipeline details)BA7. Number of pipeline notebooks viewedBA8. Number of pipeline notebooks editedBA9. Which AutoDS pipeline was chosen at the end?BA10. Did participant change the code of the final selected pipeline?BA11. Accuracy of the modified pipeline

Successful Completion

F1. (Binary) Did participant successfully complete the task without humanerrors (e.g., mistakenly reporting accuracy score from the training data splitinstead of the holdout)?

Table 1: Summary of background, behavioral, and attitudi-nal measures captured in the study.

We first describe the overall results, then we list results in theorder of: attitudes toward general AutoDS systems; behavioralmeasures from Notebook conditions and from AutoDS conditionsrespectively; and lastly, an extensive comparison of the two con-ditions in terms of participant behaviors, model outcomes, andparticipant attitudes.

All 15 participants (100%) in the AutoDS condition and eight partic-ipants (53%) in the Notebook condition finished the model buildingtask without any human error. We refereed to these participants asthe “successful” participants. In contrast, seven participants (47%)from the Notebook condition made one or more mistakes in theprocess. Common human errors included the participant forgettingto split the dataset into training and testing subsets, or reporting adifferent accuracy metric other than the required ROC AUC score. utoDS: Towards Human-Centered Automation of Data Science CHI ’21, May 8–13, 2021, Yokohama, Japan

The experiment time was limited up to 30 minutes for both theconditions, out of all 30 participants, the fastest participant to reacha state of completion was P23 (Notebook), who finished the taskearly in 17.9 minutes with an accuracy score of 0 . a 300% increase in productivity ”.But this result was not the goal of our study and it was kinds ofexpected. Also, it was not a fair comparison, thus we would cautionreaders not to use this number bluntly.As for final submitted models, 18 participants chose the RandomForest algorithm (also the best one according to our own experi-ment), 9 chose Logistic Regression, 2 used SVM and 1 chose Extra-Tree. The AutoDS condition outcome was dominated by RandomForest (13 out of 15) with 2 participants chose Logistic regression.In the Notebook condition, participants’ choices were more diversewith only 5 out of 15 chose Random Forest. Before the task, participants in both conditions were asked theirfamilarity and attitudes toward general AutoDS technologies (A1,A2, A3); at the end of the AutoDS condition, participants were againasked about the attitudes questions (AA2, AA3). In this section, wereport the comparison between A1, A2, A3 measurements betweenthe two conditions. We reserve the comparison result betweenpre-study trust and post-study trust inside the AutoDS conditionin Sec 5.5.4. In general, participants had some familiarity withAutoDS/AutoML technologies (2.27 SD = 1.08 out of 5), but notmuch. They may have heard about it before or even used it once,but they had not been using it frequently (A1).About the “Trust in General AutoDS/AutoML” question (A2), wefound participants had conflicting opinions: 13 participants (43%)agreed with this statement, 13 remained neutral, and 4 disagreed(13%).And lastly, participants did not believe AutoDS would replacehuman data scientists (A3): 15 (50%) disagreed with this statementand only 3 (10%) agreed, with the rest remaining neutral.

In this subsection, we report how participants built models usingAutoDS.Participants needed to explore the data to understand the prob-lem; they did so in the AutoDS configuration step. Thus, despiteAutoDS automatically and instantly suggested default configura-tions (such as prediction type), participants still spent some timein this step to further examine the dataset by checking the datadistributions etc. (2.3 SD=.96 min) (BA1).

Feature engineering, model training, and model selection steps were fully automated by AutoDS; we tracked time as par-ticipants waited for AutoDS to complete runs (2.1 SD=.59 min)(BA2). “It’s fast and it gives you visualizations to compare themodels, this saves me a lot of time” (P8, M, AutoDS)Because it was fast for data scientists in the AutoDS conditionto generate models (in total less than 5 minutes), participants hadsome extra time to spend on other activities (BA5). For example,they spent more time to examine or further refine the model pipeline results generated by AutoDS. They spent most of theirtime in examining the leaderboard and visualization as shown inFigure ?? (9.7 SD=3.4 min) (BA3). Then, they also went into particu-lar model’s detailed information page for fine-grained information(3.9 SD=2.2 min) (BA4),Participants examined the details of an average of 3.8 (SD=1.2)pipelines (BA6), and some of them downloaded the generated note-book, examined the code to further understand the model (7 out15 participants, spent an average of 2.3 (SD=.70) minutes) (BA7).Some participants even further revised those codes (7 out 15), andmost of these participants edited only one Notebook (BA8). Partici-pants who chose to edit code did so with mixed outcome results.Out of 7, two were able to improve accuracy scores to the pipelineproduced by AutoDS, and two submitted final models that actuallyhad slightly worse accuracy. Three abandoned their changes andsubmitted the original scores (BA10).The path of selecting the best model in AutoDS condition seemedstraightforward: a participant ran AutoDS and selected the modelwith the “highest performance AUC ROC score”. But some partici-pants did not select the top performance model as their final choice.Of the 8 possible models that could be produced with AutoDS, 8participants chose the highest-scoring one (called “P7”) with a Ran-dom Forest algorithm, and 7 chose a different model with a lowerROC AUC score than P7. It appears that even though P7 with thehighest ROC AUC score was available to the participant, in somecases they had already invested time and effort in inspecting andtweaking parameters within an earlier generated pipeline, in theend selecting a pipeline with a lower ROC AUC score than P7 (BA9).The mean ROC AUC score for the final selected model was 0.919(SD=.01) (BA11). Now we shift our discussion from user perception to user behavior.Participants’ workflow in the Notebook condition was very similarto prior findings, e.g. [48]. They went through data exploration,feature generation, model building, hyperparamenter tuning, andmodeling selection steps. As this is a lab study centered aroundmodel building, participants were not asked to perform deploymentsteps of the model.About the time-related features, as aforementioned, participantsspent 15 minutes (SD=6.8) in generating the first model (BN2). Notethat this BN2 measure also includes the time participants spentin pre-processing and splitting data (BN1 9.4 minutes, SD=5.5), In our study, the dataset was simple, so it took AutoDS only a few minutes to get thefirst model generated; with a more complicated dataset (e.g., hundreds of MB data file),this training step could take hours.

HI ’21, May 8–13, 2021, Yokohama, Japan Wang and et al. so the actual model building task does not take that long. A fewparticipants (N = 2) were marked “un-successfull” as they did notperform the necessary data splitting step (F1).In investigating the dataset, seven participants (47%) performedsome form of exploratory data analysis, either by looking at tablesthat summarized descriptive statistics of the data, or by producinggraphs and charts (BN4). But none of the participants wrote newcode to explore the data distribution or generate visualizations tosupport their findings. From the post-study interview, almost allparticipants mentioned that if they had had more time, they wouldhave loved to conduct more data exploration and understanding ofthe domain, and consulting with domain experts. By doing so, theybelieve they could have built better models, and increased theirconfidence in the model they generated.“... I need to talk with domain experts or doctors tounderstand the data and feature ... whether my currentmodel makes sense .. Then, I need to think about howto create new features to improve the model.” (P29, M,Notebook)Only four participants (27%) performed some form of feature en-gineering, by transforming, scaling, or combining existing featuresto create new features (BN5). Some argued that they did not havetime, while a few others stated they did not do so for this simple,small dataset.“I know SVM’s and Random Forest, in problems likethis you don’t need to pre-normalize [the data], RFcan handle it.” (P28, M, Notebook)About the quantity of the outcome, Notebook participants triedon average 3.13 (SD = 1.69) models (BN3). Three participants fo-cused all of their attention on just one model, and four participantstried all five suggested models. Despite finishing the task minutesearly, participants who finished only one model argued they did nothave enough time to try more models, or so they believed. All of theparticipants used only the recommended models in the Notebookskeleton, with them justifying:“I would [still] try KNN, RF, Logistic Regression [with-out your recommendation], because it’s a simple bi-nary classification.” (P18, M, Notebook)In summary, the results from the Notebook condition are notsurprising. Data scientists followed the general pattern of datascience workflow, as reported previously. This is a good result for us,as it means that we have successfully replicated data scientists’ day-to-day data science jobs in the lab environment. And by trackingvarious behavioral and perceptional measures, we now have a solidbaseline condition to compare with data scientists’ new ways ofworking with AutoDS.

We use this section to delve deeply into the comparison of the twoways of working for data scientists in building models. We startwith various comparison analyses [46] on the behavioral patternsfrom participants, the outcomes of the task, and finally participantattitudes.

AutoDS easily generated 8 models with various algorithm andhyperparameter combinations, whereas participants in Notebookcondition generated 3.13 (SD = 1.69) models (BN3). To provide a faircomparison, participants in AutoDS examined in detail 3.8 (SD =1.2) models out of the 8 models (BA6), which served as a candidateset for participants to select the final model. That count (BA6) isstill a bit higher than those in Notebook condition (BN3), thoughthe 𝑡 -test shows the difference is not signification 𝑡 ( ) = . 𝑝 = . 𝜂 , which is the proportion of variance accounted for byeach factor in the model, controlling for all other factors. To inter-pret effect sizes, Miles & Shevlin [47] gives guidance that partial 𝜂 ≥ .

14 is a large effect, ≥ .

06 is a medium-sized effect, and ≥ . 𝐹 [ , ] = . 𝑝 = .

01, partial 𝜂 = .

27. (B1)As aforementioned, participants spent much more time in creat-ing models in Notebook condition than in AutoDS condition. Evenif we compare the time participants spent on the first model in Note-book (time to first model 15.0 SD=6.8 min) (BN2) v.s. the time theygenerate 8 models in the AutoDS condition (config + run time 4.3SD=1.0 min) (BA1+BA2), that difference is significant 𝑡 ( . ) = . 𝑝 < . 𝑡 ( . ) = − . 𝑝 < . To understand what factors led to participants choosing a utoDS: Towards Human-Centered Automation of Data Science CHI ’21, May 8–13, 2021, Yokohama, Japan

Factor M SD t p partial 𝜂 Direction

Years spent practicing data science (P1) 3.5 2.7 .955 n.s. .47 +Prior experience with scikit-learn (P2) † † Table 2: Factors that predicted higher ROC AUC scores for AutoDS participants. Model adjusted 𝑅 = . . (* and **) Indicateseffects that are both significant ( 𝑝 ≤ . , ≤ . ) and sizable (partial 𝜂 ≥ . ). ( † ) Prior experience with scikit-learn and AutoDSwere rated on a 5-point scale (1=low, 5=high). more accurate model while working with AutoDS, we created aregression model to predict ROC AUC scores in AutoDS condition(B1). We included a variety of behavioral measures (B3, BA1, BA3,BA4, BA6, BA7, BA8) and three self-reported background mea-sures of data scientists (P1, P2, A1) to control for expertise effects:years practicing data science, prior experience with scikit-learnexperience, and prior experience with AutDS. We also control howlong the participants took to complete the task. For AutoDS partici-pants, we found a number of significant predictors of ROC AUCscore, detailed in Table 2.We found that the longer participants searched online for docu-mentation and task-related information in the AutoDS condition(B3), the lower their accuracy score was in the selected model. Weobserved a similar trend in the Notebook condition, where the moreexperienced data scientists seemed to spend less time in searchingfor documentation.The longer participants spent in examining the leaderboard andthe pipeline details, the better the model they obtained (BA3, BA4).This suggests that the more time and effort participants dedicated tounderstanding the AutoDS model results, the better their outcome.There were quite a few models generated by AutoDS, and partic-ipants only needed to inspect a number of them in detail (BA6).They may have been interested in viewing the pipeline’s code inNotebook as well (BA7). Both behaviors led to higher accuracy inthe model they selected.The behavior of changing pipeline code did not necessarily in-crease the accuracy score (BA8), partially because participantschanged the code to make sure they could see corresponding changesin model score, even if the score went down. That gave participantsreassurance of the AutoDS generated models. Other than the accu-racy score of the final model, we also collected how confident eachparticipant was in their final selected model (A4) at the end of thestudy. Overall ratings of confidence landed in the middle of thescale (2.9 SD=.97 out of 5), suggesting that participants felt theycould have produced even better models if they had had more time. We added all the behavioral measures except BA5. Because BA5 is the AutoDScomputation time, which is a constant number in seconds, thus the model would treatit as a constant and omit it.

Participants in AutoDS condition were significantly less confi-dent (2.4 SD=.98) in their models than participants in the Notebookcondition (3.3 SD=.72), 𝑡 ( . ) = − . 𝑝 < .

01. An important con-sideration here is that more confidence did not equate to bettermodels between the two conditions. This may suggest that whilethe AutoDS users had more options to compare and study, theircounterparts in the Notebook condition were invested in writingcode from the start. This investment of time and effort into a modelcould influence users’ perspectives when it came to how confidentthey were. This is backed by the qualitative interview:“I’m pretty confident that this is the best model I canget with this dataset in 30 mins” (P18, M, Notebook)

To form a deeper un-derstanding of people’s trust in AutoDS, we decomposed the trustinto various fine-grain dimensions. We asked AutoDS conditionparticipants to complete an survey developed by Hoffman et al. [27]at the end of the study (AA5). The survey was intended to evaluatetrust in Explainable AI systems (XAI), and contains 8-items, eachwith 5-point Likert scale responses ranging from Strongly Disagreeto Strongly Agree (Table 3). Factor analysis indicated that removingitem Q2 increased reliability of the scale, so we aggregated the finalscale by omitting this item.

Q1. I am confident in the AutoDS. I feel that it works well.Q2. The outputs of the AutoDS are very predictable.Q3. The AutoDS is very reliable. I can count on it to be correct all thetime.Q4. I feel safe that when I rely on the [AutoDS] I will get the rightanswers.Q5. AutoDS is efficient in that it works very quickly.Q6. I am wary of the AutoDS.Q7. AutoDS can perform the task better than a novice human user.Q8. I like using the system for decision making.

Table 3: XAI survey from [27]. With the removal of Q2, thisscale has acceptable reliability (Cronbach’s 𝛼 = . ). Previous literature [27] suggests to use the XAI scales as anoverall score. The overall XAI score was quite positive (3.6 SD=.41out of 5) (AA5). Participants found the AutoDS efficient, predictable,and liked it for decision making.

HI ’21, May 8–13, 2021, Yokohama, Japan Wang and et al. “It’s fast and it gives you visualisations to compare themodels, this saves me a lot of time” (P8, M, AutoDS)However, participants wondered about the reliability of AutoDS,trying to understand the rationale behind AutoDS’s decisions.“I’m not sure about the features it created for me” (P5,M, AutoDS), “It selected in one case random forest andin another logistic regression, but why, can i choosethe classifier?” (P6, F, AutoDS)In order to gauge how participants felt about AutoDS after havingused it, we measured their attitude toward AutoDS twice, once atthe beginning of the study (A2, A3) and once at the end (AA2, AA3).Participants’ opinions generally did shift in a positive directionafter experiencing AutoDS. Participants had significantly higherlevels of trust in AutoDS after using it ( 3.5 SD=.74) than before(2.7 SD=1.1), 𝑡 ( . ) = − . 𝑝 = .

03. This finding is consistentwith the finding that all 15 participants reported that they wantto adopt AutoDS in their day-to-day model building work afterthe experiment. As to whether our participants felt that AutoDSwould one day replace data scientists, there was a small shift inopinion toward feeling that it would, but it still falls in the negativedirection, and this difference was not significant (A3) (pre-study2.4 SD=.83, post-study 2.9 SD=1.2, 𝑡 ( . ) = − . 𝑝 = 𝑛.𝑠. ). Motivated by literature, we began the investigation on how datascientists use an AutoDS system in DS tasks. Through our analyses,we found that participants working with AutoDS not only can cre-ate more models and faster, but more importantly these models areat a higher accuracy and with fewer human-errors, comparing toparticipants working with a Jupyter Notebook coding environment.With AutoDS’s help, data scientists can afford to spend more timein understanding the results and inspecting the codes generated byAutoDS, whereas in the baseline condition participants had to spentlots of their efforts and time in searching for library documenta-tions and writing code from scratch. However, despite participantsacknowledge all these benefits from using AutoDS, participants stilltrust their manually-crafted models more than the AutoDS gener-ated model. These results suggest design improvement for AutoDStechnology, and enlighten us about the future of human-centeredautomated data science work.

Our results suggested that AutoDS prototype’s code generation fea-ture, which automatically translate a resulting model into human-readable Python code, is very welcomed by the participants. Theycould inspect the codes to see the decisions made by AutoDS. Theycan rationalize some of those decisions, and they could even changethe code and see if that changes the result as expected. Furthermore,we observed there are a couple participants who kept working onthe code generated by AutoDS to achieve a better model, that usecase again supports our claim that data scientists and AutoDS to-gether could deliver better results. Other than the possible use ofthe entire code, data scientists could also easily migrate or replicatea segment of the generated code, maybe with a transformation function, some hyperparameters, or other decisions made by Au-toDS into a different task and context. Thus, we highly recommendAutoDS systems to provide such code-generation feature.We also learned that participants have trouble in trusting AutoDSsystem, and they have more confidence in their worse-performingmanually-created models than in the higher quality AutoDS-generatedmodels, thus the explainable and trustworthy AutoDS researchagenda should be prioritized. The fact that people loved the code-generation feature could also attribute to the needs of a transparentand trusted AutoDS. Participants wanted to understand not onlywhat the decisions are (code-generation satisfied that), but also whyAutoDS made such decisions. However, only the generated codefor the result model was not enough. We need to design AutoDSsystems to provide better explanation for why certain decisionswere made. For example, when showing users the top performed al-gorithm and the optimized set of hyperparameter values, we couldalso reveal the information about all the other candidates consid-ered, and why AutoDS eliminated those options. Another exampleis that for the new features engineered by AutoDS, users shouldhave a better view of how those features are generated and chosenin the middle of the feature engineering process, because users mayhave some domain knowledge to prevent the nonsense transforma-tions (e.g., absolute value of gender). We are glad to see that recentlythere are some researchers moving toward this direction [13]. Weshould also caution readers that we are not arguing to show a userall the information AutoDS has. Full transparency causes informa-tion overload and is no better than no transparency. This links backto a well-established CSCW theory: “social translucent” [15], whereresearchers argue to show the information at the right moment andat the right level of details, and we should learn from our past.

Our result showed that some participants worked over a widerrange of AutoDS results, some others focused on improving thecode of one particular model (BN6 and BA7). The more successfulparticipants were the ones who tried a wider range of approachesrather than focusing all of their attention on just one model. Thisfinding echoed wisdom in the artistic domain, where the most cre-ative works emerge when favoring quantity over quality [4]. Andthis was not isolated that the data science work appears linkedwith a practice in the art design domain, as [48] argued that “Datascience workers work is a new type of crafting”, while in the Au-toDS context, users are crafting the AutoDS generated results as amaterial.This collaborative step that emerged over practice, between datascientist and the AutoDS, appears to us as a demythification ofthe AutoDS replacing the job of the data scientist, and instead,offers a perspective on how the relationship between this partiestowards a collaborative partnership is maturing. We argue that the

Human-AI Collaboration [65] paradigm is also emerging in thedata science domain, as in many other domains (e.g., education [69],healthcare [64], human resource [57]). In the data science context,human data scientists can work on the understanding of domainknowledge and context tasks, while AutoDS systems work on thecomputation tasks. Then human data scientists can work on resultinterpretation tasks, while AutoDS can work on deployment tasks. utoDS: Towards Human-Centered Automation of Data Science CHI ’21, May 8–13, 2021, Yokohama, Japan

They hand over information to each other and share responsibility.Together, AutoDS plays a partner role in the future of human-centered automated data science practice.We believe AutoDS technologies will become available to moreand more data scientists, affording them a chance to craft modelstogether with AI. We agree with the majority of our participantsthat AutoDS will not replace data scientists’ job. Instead, a human-centered AutoDS will play a more important assistant role to thehuman data scientists in new paradigm of data scietist’s work.In this new paradigm, data scientists’ role will shift, as suggestedby our findings: they can spend less time in the tedious modeltraining and hyperparameter tuning tasks, but spend more time inunderstanding the data, communicating with the domain experts,and selecting the model to best fit the domain and context. Inaddition, they will need to spend a big proportion of their time torationalize the AutoDS decisions and results, and translate thosedecisions to present to other human stakeholders.In this new paradigm, not only professional data scientists butalso end-customers and domain experts can also build ML solutionsto answer their own simple questions with their data. Our systemprovide a GUI and an automatically export human-readable note-book in automatically models building. The future in which varioushuman stakeholders can work together with AutoDS is promising,but there are works needed to be done to achieve that vision. Thiswork is just the first step.

We acknowledge our works limitations as follows: this is a lab ex-periment study, thus it has the common limitations as any otherlab experiments [17]. For example, in the Notebook Condition, weprovided participants a code skeleton, which may also bias theirbehaviors in that condition. Another example is that the task anddata set is tailored as simple and specific tasks to emphasize modelbuilding, thus the reported behaviors of participants interactingwith AutoDS may be different from when they actually adopt Au-toDS in their daily DS projects. In an actual data science project,data acquisition and curation are more significant problems for datascientists during the ML lifecycle. The AutoDS system presented inthis paper is just the starting point of a long-term project. We haveanother paper in submission that discusses human-AI collaborationin automated data preparation and curation process. In that work,we used reinforcement learning instead of AutoML, because it isbetter suited for the task.Another limitation is that all the participants are professionaldata scientists coming from a technology company, there may besome selection bias in their background. For example, on average,they rated themselves as moderate level of data science expertise.This participant population may also impose limitations on thegeneralizability of our paper’s findings. We call out these externalvalidity limitations to the reader who plan to apply the findings toother contexts.

Our work presents an AutoDS prototype system, and a between-subject user experiment with with 30 professional data scientists touse the AutoDS system, or use python in notebooks, to complete a data science model building task. Grounded in the results, wepresent design implications for building AutoDS systems to betterincorporate into human workflows. We also discuss future researchdirections for human-centered automated data science.

ACKNOWLEDGMENTS

We thank all the participants who contributed their time participateour experiment. We want to specifically thank Arunima Chaudhary,Abel Valente, and Carolina Spina for implementing the presentedAutoDS system. We also want to thank Alex Swain, Dillon Evers-man, Voranouth Supadulya, and Daniel Karl for their inspiringUX design sketches, and Gregory Bramble, Theodoros Salonidis,Peter Kirchner, and Alex Gray for their input on the algorithmimplementation.

REFERENCES