[PDF] Themisto: Towards Automated Documentation Generation in Computational Notebooks

Abstract

Computational notebooks allow data scientists to express their ideas through a combination of code and documentation. However, data scientists often pay attention only to the code, and neglect creating or updating their documentation during quick iterations, which leads to challenges in sharing their notebooks with others and future selves. Inspired by human documentation practices from analyzing 80 highly-voted Kaggle notebooks, we design and implement Themisto, an automated documentation generation system to explore the Human-AI Collaboration opportunity in the code documentation scenario. Themisto facilitates the creation of different types of documentation via three approaches: a deep-learning-based approach to generate documentation for source code (fully automated), a query-based approach to retrieve the online API documentation for source code (fully automated), and a user prompt approach to motivate users to write more documentation (semi-automated). We evaluated Themisto in a within-subjects experiment with 24 data science practitioners, and found that automated documentation generation techniques reduced the time for writing documentation, reminded participants to document code they would have ignored, and improved participants' satisfaction with their computational notebook.

Full PDF

TThemisto: Towards Automated Documentation Generation in ComputationalNotebooks

APRIL YI WANG ∗ , University of Michigan, USA

DAKUO WANG ∗ , IBM Research, USA

JAIMIE DROZDAL,

Rensselaer Polytechnic Institute, USA

MICHAEL MULLER,

IBM Research, USA

SOYA PARK,

MIT CSAIL, USA

JUSTIN D. WEISZ,

IBM Research, USA

XUYE LIU,

Rensselaer Polytechnic Institute, USA

LINGFEI WU,

IBM Research, USA

CASEY DUGAN,

IBM Research, USA

Computational notebooks allow data scientists to express their ideas through a combination of code and documentation. However,data scientists often pay attention only to the code, and neglect creating or updating their documentation during quick iterations,which leads to challenges in sharing their notebooks with others and future selves. Inspired by human documentation practices fromanalyzing 80 highly-voted Kaggle notebooks, we design and implement Themisto, an automated documentation generation system toexplore the Human-AI Collaboration opportunity in the code documentation scenario. Themisto facilitates the creation of differenttypes of documentations via three approaches: a deep-learning-based approach to generate documentation for source code (fullyautomated), a query-based approach to retrieve the online API documentation for source code (fully automated), and a user promptapproach to motivate users to write more documentation (semi-automated). We evaluated Themisto in a within-subjects experimentwith 24 data science practitioners, and found that automated documentation generation techniques reduced the time for writingdocumentation, reminded participants to document code they would have ignored, and improved participants’ satisfaction with theircomputational notebook.CCS Concepts: •

Human-centered computing → Interactive systems and tools ; Empirical studies in HCI ; •

Computingmethodologies → Natural language generation ; •

Software and its engineering → Documentation .Additional Key Words and Phrases: code summarization, computational notebooks, code documentation ∗ Both authors contributed equally to this research.Authors’ addresses: April Yi Wang, [email protected], University of Michigan, USA; Dakuo Wang, [email protected], IBM Research, USA;Jaimie Drozdal, [email protected], Rensselaer Polytechnic Institute, USA; Michael Muller, [email protected], IBM Research, USA; Soya Park,[email protected], MIT CSAIL, USA; Justin D. Weisz, [email protected], IBM Research, USA; Xuye Liu, [email protected], Rensselaer Polytechnic Institute,USA; Lingfei Wu, [email protected], IBM Research, USA; Casey Dugan, [email protected], IBM Research, USA.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].© 2021 Association for Computing Machinery.Manuscript submitted to ACMManuscript submitted to ACM a r X i v : . [ c s . H C ] F e b Wang and Wang et al.

ACM Reference Format:

April Yi Wang, Dakuo Wang, Jaimie Drozdal, Michael Muller, Soya Park, Justin D. Weisz, Xuye Liu, Lingfei Wu, and Casey Dugan.2021. Themisto: Towards Automated Documentation Generation in Computational Notebooks. 1, 1 (February 2021), 29 pages.https://doi.org/10.1145/nnnnnnn.nnnnnnn

Documenting the story behind code and results is critical for data scientists to work effectively with others, as wellas their future selves [25]. The story, code, and computational results together construct a computational narrative.Unfortunately, data scientists often write messy and drafty analysis code in computational notebooks as they quicklytest hypotheses and experiment with alternatives. It is a tedious process for data scientists to then manually documentand refactor the raw notebook into a more readable computational narrative, and many people neglect to do so [49].Many efforts have sought to address the tension between exploration and explanation in computational notebooks.For example, researchers have explored the use of code gathering techniques to help data scientists organize clutteredand inconsistent notebooks [14], as well as algorithmic and visualization approaches to help data scientists forage pastanalysis choices [24]. But these efforts focus on the cleaning and organizing of existing notebook content, instead ofsupporting the creation of new content. Another work developed a chat feature that enables data scientists to havesimultaneous discussions while coding in a notebook [57], and linked their chat messages as documentations to relevantnotebook elements. However, these chat messages are too fragmented and colloquial to be used for documentation;besides, in real practice it is not common that multiple data scientists work on the same copy of a notebook and activelymessage each other.We began our project by asking, “What makes a well-documented notebook?” To address this question, we firstconducted an in-depth analysis of the human documentation practices of notebooks. Publicly available notebooks areoften not well documented [49], thus we look up to the publicly-shared and highly-voted notebooks on the Kagglecompetition website and use 80 of them as an approximation. Based on our analysis, these 80 notebooks are indeedmuch better documented in comparison to the corpus in previous literature [49]. Thus, in this paper, we refer to themas “well-documented” notebooks.Through an iterative indepth coding process, we also abstracted a taxtonomy for the content of documentationswith nine categories (e.g., Reason, Process, Result), which reflects the thought processes and decisions made by thenotebook owner. These findings together with the insights from related work motivate us to consider AI automation asa potential solution to support the human process of crafting documentation. We re-imagine that the documentationtask can be conducted in a Human-AI Collaboration fasion, and are interested in how this joint effort compares to thesolo effort of a human alone.We propose Themisto, an automated documentation generation system that integrates into the Jupyter Notebookenvironment. To support the diverse types of documentation content, Themisto incorporate three distinct approaches:a deep-learning-based approach to automatically generate new documentation for source code (fully automated);a query-based approach to retrieve existing documentation from online Application Programming Interface (API)websites for third party packages and libraries (fully automated); and a prompt-based approach to give users a start ofthe sentence and encourage them to complete the sentence that serves as documentation (semi automated).We evaluated Themisto in a within-subjects experiment with 24 data science practitioners. We found that Themistoreduced the time for data scientists to create documentation, reminded them to document code they would have ignored,

Manuscript submitted to ACM hemisto: Towards Automated Documentation Generation in Computational Notebooks 3

ABC

Fig. 1. Computational notebooks allow data scientists to create (A) markdown cells and (B) code cells, and view (C) code output inthe same environment. Together, the variety of media including text explanations, graphs, forms, interactive visualizations, codesegments and their outputs — weaves into computational narratives. and improved their satisfaction with their computational notebooks. Meanwhile, the quality of the documentationproduced with Themisto are about the same as what data scientists produced on their own.Our paper provides a three-fold contribution to the HCI and data science practitioner communities: • providing an empirical understanding of best practices of how human documenting a notebook through ananalysis of highly-rated Kaggle notebooks, • demonstrating the design of an AI system that can collaborate with human data scientists to create high-qualitycomputational narratives, • reporting empirical evidence that Themisto can collaborate with data scientists to generate high quality andhighly-satisfied computational notebooks in much less time. Our work builds on top of both Human-Computer Interaction (HCI) and Machine Learning (ML) fields. Thus, ourliterature review briefly summarizes the work of both, with a focus on the following three topics: computationalnotebook management, code documentation supporting systems, and neural-network-based code summarization.

Computational notebooks allow data scientists to weave together a variety of media, including text explanations, graphs,forms, interactive visualizations, code segments, and their outputs, into computational narratives (as shown in Figure1). These computational narratives enable literate programming [27] and allow data scientists to effectively create,

Manuscript submitted to ACM

Wang and Wang et al.communicate, and collaborate on their analysis work. The data scientist community has widely adopted notebooksystems (e.g., Jupyter Notebook [22] and Jupyter Lab [21]) as their main working envrioment [43].However, it is not easy for data scientists to create a computational narrative while they are coding for rapidexploration. Data scientists often need to explore diverse sets of hypotheses and theories [31, 45]. Active explorationof alternatives increases the workload for data scientists to track the history of their experimentation [26]. Thus,documenting for those alternatives will pose more workload to data scientists, sometimes interference with theircognitive process of coding, and is hardly rewarding because many those alternatives will be discarded in later versions.Because creating and maintaining a clean computational narrative is often an expensive and tedious process, manycomputational notebooks shared within open communities are not appropriately documented. For example, Rule et al.examined 1 million open-source computational notebooks from Github and found that one in four lacked any sort ofwritten documentation [49]. In addition, they analyzed a sample of 221 academic computational notebooks, which theyconsidered are higher quality notebooks, and found that academic computational notebooks contained text cells forintroduction, describing analytical steps, explaining the reasoning, and discussing results.Poor documentation hinders the readability and reusability of the notebooks that are shared with other collaboratorsor even with one’s future self [6]. Recently, various groups of researchers have developed a wide range of tools tohelp data scientists to manage their “messy” computational notebooks. Notebally, Kery et al. designed Verdant [24], alightweight local versioning plugin for Jupyter Lab, that uses algorithmic and visualization techniques for data scienceworkers to better forage their past analysis choices; Head et al. used code gathering tools to help data scientists traceback to the computational code from an end result [14]; Wenskovitch et al. designed an interactive tool that produceda visual summary of the structure of a computational notebook [62]; Wang et al. proposed capturing the contextualconnections between notebook content and discussion messages to help data science teams reflect on their decisionmaking process [57]; Lau et al. summarized the design space of computational notebooks which covered aspects likeversioning, collaboration, order, and liveness [29].However, despite the wide variety of approaches to helping data scientsts manage their notebooks, none of thesetools directly aids data scientists in creating new, rich, descriptive contents to document their computational notebooks,and to improve the quality of the computational narrative. This research gap motivates us to design and build a systemto support data scientists to better document their code and to produce higher qualitive computational narratives.But what makes up a good computational narrative? Despite the portrait of not-so-good notebooks on Github [49],we need further understanding and role models for well-documented computational narratives. Thus, we decided to firstconduct an in-depth analysis of some highly-voted notebooks on Kaggle competetion . Kaggle competition provides aplatform where organizations post datasets as challenges, and many data scientists submit their notebooks as solutionsto a challenge. If a solution has the highest accuracy, it wins the competition. But those winning solutions are often notthe most voted ones, as community members voted on readability and completeness of the computational narrative. Documentation plays an important role in software programming. Programmers write comments in their sourcecode to make the code easier for both themselves and others to understand [39]. Writing clear and comprehensivedocumentation is critical to software development and maintenance [7, 23, 35, 47, 54]. However, writing documentation hemisto: Towards Automated Documentation Generation in Computational Notebooks 5itself is a time-consuming task. And that is why documentation practices in open source communities are widelyperceived to be of low quality, due in part to low levels of intrinsic enjoyment for doing documentation work [13].To save time in creating documentation, template-based approaches are often used to help developers annotate theirsource code. For example, tools like JavaDoc and JSDoc allow programmers to annotate their code with tags (e.g.,@param, @return) and then automatically generate documentation using these tags. This approach helps programmerscreate documentation for others and works especially well for documenting APIs, where method signatures and variabletypes are important pieces of information and can easily be documented from tags. Although, such methods may notwork in the rapid, experimental nature of data science work, because data scientists may be particularly reluctant tocreate and maintain high-quality documentation of their work. Furthermore, these methods can not capture otheraspects of documentation important in data science, such as how a data set was constructed, the intent behind ananalysis, or a description of why an experiment was successful or not.Recently, some researchers have put forth proposals for better documenting the specific artifacts invovled in a datascience workflow, i.e., the data set and the machine learning model [3, 12, 15, 37]. Notably, Gebru et al. [12] and Hollandet al. [15] proposed both qualitative and quantitative guidlines for documenting a dataset, so that the dataset creatorsand maintainers can follow these guidlines to document useful information for the data. Similarly, Mitchell et al. [37] andArnold et al. [3] explored using formulas to document the machine learning model artifacts, and sharing such formulaswith others. These approaches are inline with what is called provenance , which refers to tracking what has been donewith code and data over time – typically to aid reproducibility of results – using applications such as noWorkflow andYesWorkflow [44], ReproduceMeGit [52], and Provbook [51]. However, this approach focus more on the the dataset andthe model artifact in the final product of a data science project, and supporting data scientists to create a “factsheet”for these artifacts for the non-technical consumers. Instead, we want to support data scientists to better create thedocumentation during the process of creating models and data science products, and such documentation, togetherwith the code as a computational narrative, is primarily for other technical users to understand and to reuse.In addition to the various ways of generating new documentation for code, there is another research line thatfocuses on improving the usability of documentation, as novice programmers may find it difficult to read and use APIdocumentation [16]. Oney et al. proposed linking interactive documentation and example code in an editor to help noviceprogrammers better understand the external documentation and write code [38]. We believe this approach of linkingcode with external documentation is a promising way to help data scientists to create more usable documentations, andwe will also implement this retrieval-based approach in our system. Automatic code summarization is a rapidly-expanding research topic in the Natural Language Processing (NLP) and MLcommunities [2, 9, 18, 19, 30]. The automatic code summarization task can be considered as a translation task, whichtakes a code snippet as the input sequence, and “translating” it into a natural language description of the code as anoutput sequence.Early work primarily used predefined templates and heuristics to produce code summaries (e.g., [9, 56]). Recentstudies have taken the advantage of modern deep neural network architectures to generate the summary for the code(e.g. [2, 18, 20]). Motivated by the language translation task (e.g., English to French), most of these learning-based JavaDoc: https://docs.oracle.com/javase/8/docs/technotes/tools/windows/javadoc.html JSDoc: https://devdocs.io/jsdoc/ Manuscript submitted to ACM

Wang and Wang et al.approaches are based on the Neural Machine Translation (NMT) model architecture [34]. This architecture breaks codeinto a sequence of input tokens and produce the summarization text as a sequence of output tokens.However, this sequence-to-sequence approach does not work well in practice because source code is not just a streamof tokens. There is additional semantic information that is lost when processing source code in this way. LeClair et al.proposed improving code summarization through the use of Graph Neural Networks (GNNs) [30]. The GNN model cantake in both the code sequence and its Abstract Syntax Tree (AST) structure (refer to Fig 5 for an example) as input togenerate summary sentences as output. Their approach achieved better accuracy than the baseline algorithms.Our work explores neural-network-based automatic code summarization techniques to support document writing incomputational notebooks. To our acknowledge, there has been little discussion in the HCI community on leveragingautomatic code summarization techniques to improve documentation. Furthermore, we suspect that the automationapproach alone may not work well in the documentation creation task, as data science is a highly interdisciplinary fieldthat requires various human expertises to explain and interpret. Inspired by prior studies that implement ArtificialIntelligence (AI) systems to work together with human [32, 59], we believe the system will work better if it has boththe automated documentaiton capability and the capability that allows users to directly manipulate the documentation.However, what types of documentation may be better suited for AI to do, and what works should the system leaveto human data scientists? This is a design question that requires further exploration of the best practices for creatingnotebooks. Thus, we start this project with a formative study to fill this research gap.

In order to build a useful system that can support data scientists to create documentations and to improve theircomputational narrative’s quality, we first need to explore and understand the characteristics of good documentations inhigh-quality notebooks.

What does a well-documenteed computational narrative look like?

We identify “well-documented” computational narratives with ratings from a broader data scientist community (Kaggle), and analyze theircharacteristics specifically around the documentation. We consider the community voting number is a good indicatorto reflect a computational notebook’s quality for our research goal. Based on this premise, we then conduct a formativestudy to analyze the characteristics of a set of most voted computational narratives, and explore how the data scientistscreate documentations for these notebooks.

We collected notebooks from two popular Kaggle competitions — House Price Prediction and Titanic SurvivalPrediction . We chose these two competitions because they are the most popular competitions (5280 notebookssubmitted for House Price and 6300 notebooks submitted for Titanic Survival) and because many data science coursesuse these two competitions as a tutorial for beginners [4, 10].We collected the top 1% of the submitted notebooks from each competition based on their voting numbers, whichresulted in 53 for House Price and 63 for Titanic Survival. We then filtered out the notebooks that were not written inEnglish and the ones that are not relevant to the particular challenge (e.g., a computational notebook as a tutorial onhow to save memories can win lots of votes, but it is not a solution to the challenge), which returned 80 valid notebooksfor analysis (39 for House Price and 41 for Titanic Survival). hemisto: Towards Automated Documentation Generation in Computational Notebooks 7 Fig. 2. We replicated the notebook-level descriptive analysis by Rule et al. [49] to the 80 well-documented notebooks on Kaggle. Theleft side represents the descriptive visualization of the 80 well-documented computational notebooks from Kaggle (noted as SampleA) and the right side represents the descriptive visualization of the 1 million computational notebooks on Github (noted as Sample B).The highly-voted notebooks on Kaggle are better documented compared to the Github notebooks.

Five members of the research team conducted an iterative open coding practice to analyze the collected notebooks.Differing from [49], where their qualitative coding stoped at the notebook level, our analysis goes deep to the cellgranularity: we code each cell’s purposes and types of content; and which step (stages) in the data science lifecyclethat the cell belongs (e.g., data cleaning or modeling training [58, 65]). Our analysis covered 4427 code cells and 3606markdown cells within the 80 notebooks. Each notebook took around 1 hour to code as we coded the notebook at thecell level.Each coder independently analyzed the same six notebooks to develop a codebook. After discussing and refiningthe codebook, they again went back to recode the six notebooks and achieved pair-wise inter-rater reliability ranged0.78-0.95 (Cohen’s 𝜅 ). After this step, the five coders divided and coded the remaining notebooks. We found that these 80 well-documented computational notebooks all contain rich documentation. In total, we identified nine categories for the content of the markdown cells. In addition, we found the markdown cells covered fourstages and 13 tasks of the data science workflow [58]. Note that a markdown cell may belong to multiple categories.

We found that on average, each notebook contains 55.3 code cells and 45.1markdown cells. We replicated tne notebook descriptive analysis that Rule et al. used to analyze 1 milion computationalnotebooks on Github [49]. As shown in Figure 2, the left side represents the descriptive visualization of the 80 well-documented computational notebooks from Kaggle (noted as Sample A) and the right side reprersents the descriptivevisualization of the 1 milion computational notebooks on Github (noted as Sample B). We found that the Sample A

Manuscript submitted to ACM

Wang and Wang et al.

Table 1. We identified 9 categories based on the purpose of markdown cells. Note that a markdown cell may belong to multiplecategories of contents or none of the categories.

Category N Description Example

Process 2115(58.65%) The markdown cell describes what the followingcode cell is doing. This always appears before therelevant code cell.

Transforming Feature X to anew binary variable

Headline 1167(32.36%) The markdown cell contains a headline in markdownsyntax. The cell is used for navigation purposes ormarking the structure of the notebook. It may berelevant to a nearby code cell.

Result 692(19.19%) The markdown cell explains the output. This typealways appears after the relevant code cell.

It turns out there is a longtail of outlying properties...

Education 414(11.48%) The markdown cell provides a rich content as aneducational tutorial, but may not be relevant to aspecific code cell.

Multicollinearity increasesthe standard errors of thecoefficients.

Reason 227(6.30%) The markdown cell explains the reasons why certainfunctions are used or why a task is performed. Thismay appear before or after the relevant code cell.

We do this manually, becauseML models won't be able toreliably tell the differences.

Todo 202(5.60%) The markdown cell describes a list of actions for fu-ture implementations. This normally is not relevantto a specific code cell.

1. Apply models2. Get cross validation scores3. Calculate the mean

Reference 200(5.55%) The markdown cell contains an external reference.This is also relevant to the adjacent code cell.

Gradient Boosting RegressionRefer [here](https://...)

Meta-Information 141(3.91%) The markdown cell contains meta-information suchas project overview, author’s information, and a linkto the data sources. This often is not relevant to aspecific code.

The purpose of this notebookis to build a model withTensorflow.

Summary 51(1.41%) The markdown cell summarizes what has been doneso far for a section or a series of steps. This often isnot relevant to a specific code. **In summary**By EDA we found a strong impactof features like Age, Embarked.. has more total cells per notebook (Median = 95) than Sample B (Median = 18). Sample A has roughly equal ratio ofmarkdown cells and code cells per notebook, while Sample B is unbalanced with majority cells being code cells. Notably,Sample A has more total words in markdown cells (Median = 1728) than Sample B (Median = 218). This result indicatesthat the 80 well-documented computational notebooks are better documented than general Github notebooks.

As shown in Table 1, our analysis revealedthat markdown cells are mostly used to describe what the adjacent code cell is doing (Process, 58.65%). Second to theProcess category, 32.36% markdown cells are used to specify a headline for organizing the notebook into separatefunctional sections and for navigation purposes (Headline).Markdown cells can also be used to explain beyond the adjacent code cells. We found that many markdown cells arecreated to describe the outputs from code execution (Result, 19.19%), to explain results or critical decisions (Reason,

Manuscript submitted to ACM hemisto: Towards Automated Documentation Generation in Computational Notebooks 9

Table 2. We coded each markdown cell to which data science stage (or task) they belong. We identified 4 stages with 13 tasks out ofthe data science lifecycle [58]. Note that a markdown cell may belong to multiple stages or none of the stages.

Stage Total Task N

Environment Configuration 162 (4.49%) Library Loading 33 (0.92%)Data Loading 129 (3.58%)Data Preparation and Exploration 1336 (37.05%) Data Preparation 91 (2.52%)Exploratory Data Analysis 960 (26.62%)Data Cleaning 285 (7.90%)Feature Engineering and Selection 375 (10.40%) Feature Engineering 120 (3.32%)Feature Transformation 178 (4.94%)Feature Selection 77 (2.14%)Model Building and Selection 994 (27.57%) Model Building 247 (6.85%)Data Sub-Sampling and Train-Test Splitting 61 (1.69%)Model Training 377 (10.45%)Model Parameter Tuning 81 (2.25%)Model Validation and Assembling 288 (6.32%)6.30%), or to provide an outline for the readers to know what they are going to do in a list of todo actions (Todo, 5.60%),and/or to recap what has been done so far (Summary, 1.41%).We observed that 11.48% markdown cells explain what a general data science concept means, or how a functionworks (Education), while 5.54% markdown cells are connected with external references for readers to further explorethe topics (Reference). We believe these are the extra efforts that the notebook owners dedicated, to attract a broaderaudience, especially beginners in the Kaggle community. In addition, some authors want to leave their own signature,and so they spend spaces at the beginning of the notebooks to debrief the project, to add the author’s information, oreven to add their mottos (Meta-Information, 3.91%).

We coded markdown cells based on where they belong in the data science workflow [60]. Weidentified four stages and 13 tasks. The four stages include environment configuration (4.50%), data preparation and exploration (37.05%), feature engineering and selection (10.40%), and model building and selection (27.57%).At the finer-grained task level, in particular, notebook authors create more markdown cells for documenting exploratorydata analysis tasks (26.62%) and model training tasks (10.45%). The rest of the markdown cells are evenly distributedalong with other tasks.

In summary, our analysis of markdown cells in well-documented notebooks suggests that data scientists documentvarious types of content in a notebook, and the distribution of these markdown cells generally follows an order ofthe data science lifecycle, starting with data cleaning, and ending with model building and selection. Based on thesefindings, we synthesize the following actionable design considerations: • The system should support more than one type of documentation generation.

Data scientists benefitfrom documenting not only the behavior of the code, but also interpreting the output, and explaining rationales.Thus, a good system should be flexible to support more than one type of documentation generation.

Manuscript submitted to ACM • Some types of documentations are highly related to the adjacent code cell.

We found at least the Pro-cess, Result, Reason, and Reference types of documentations are highly related to the adjacent code cell. Toautomatically generate interpretations of results or rationale for a decision may be hard, as both involve deephuman expertise. But, with the latest neural network algorithms, we believe we can build an automation systemto generate Process type of documentation, and we can also retrieve Reference for a given code cell. • There are certain types of documentations that are irrelevant to the code.

Various types of documenta-tions do not have a relevant code piece upon which the automation algorithm can be trained. Together with theReason and Result types, the system should also provide a function that the human user can easily switch to themanual creation mode for these types. • For different types of documentation, it could be at the top or the bottom of the related code cell.

Thisdesign insight is particularly important to the Process, Result, and Reason types of documentation. It may be lesspreferable to put Result documentation before the code cell, where the result is yet to be rendered. The systemshould be flexible to render documentation at different relative locations to the code cell. • External resources such as Uniform Resource Locators (URLs) and the official API descriptions mayalso be useful.

Some types of documentation, such as Education and Reference, are not easy to be generatedwith the NN-based models, but they are easy to retrieve from the Internet. So the system should incorporate thecapability to fetch relevant web content as candidate documentation. • There is an ordinality in markdown cells that is aligned with the data science project’s lifecycle.

Thesystem should consider that Library Loading types of cells are often at the beginning section of the notebook,and the Model Training type of content may be more likely to appear near the end of the notebook. In our systemprototype, though, we did not take this design consideration into account, it will be our future work. • The notebook would be nice to have documentation with a problem overview at the beginning and asummary at the end.

We considered this design implication not in the system design, but our evaluation studydesign. For the two barebone notebooks we used in the experiment, we always provide a problem overview as amarkdown cell at the top of the notebook.

Based on findings from the formative study and design insights from related works, we design and implement Themisto,an automatic documentation generation system that supports data scientists to write better-documented computationalnarratives. In this section, we present the system architecture, the user interface design, and the core technical capabilityof generating documentation.

The Themisto system has two components: the client-side User Interface (UI) is implemented as a Jupyter Notebookplugin using TypeScript code, and the server-side backend is implemented as a server using Python and Flask.The client-side program is responsible to render user interface, and also to monitor the user actions on the notebookto edits in code cells. When the user’s cursor is focused on a code cell, the UI will send the current code cell content tothe server-side program through Hypertext Transfer Protocol (HTTP) requests.The server-side program takes the code content and generates documentation using both the deep-learning-basedapproach and the query-based approach. For the deep-learning-based approach, the server-side program first tokenizesthe code content and generates the AST. It then generates the prediction with the pre-trained model. For the query-based

Manuscript submitted to ACM hemisto: Towards Automated Documentation Generation in Computational Notebooks 11

A B CD

Fig. 3. The Themisto user interface is implemented as a Jupyter Notebook plugin: (A) When the recommended documentation isready, a lightbulb icon shows up to the left of the currently-focused code cell. (B – D) shows the three options in the dropdown menugenerated by Themisto, (B) A documentation candidate generated for the code with a deep-learning model, (C) A documentationcandidate retrieved from the online API documentation for the source code, and (D) A prompt message that nudges users to writedocumentation on a given topic. approach, the server-side program matches the curated API calls with the code snippets and returns the pre-collecteddescriptions. For the prompt-based approach, the server-side program sends different prompts (e.g., for interpretingresult or for explaining reason) base on the output type of the code cell.

Figure 3 shows the user interface of Themisto as a Jupyter Notebook plugin. Each time the user changes her focus on acode cell, as she may be inspecting or working on the cell, the plugin is triggered. The plugin sends the user-focusedcode cell’s content to the backend. Using this content, the backend generates a code summarization using the modeland retrieves a piece of documentation from the API webpage. When such a documentation generation process is done,the generated documentation is sent from the server-side to the frontend, and a light bulb icon appears next to the codecell, indicating that the there are recommended markdown cells for the selected code cell (as shown in Figure 3.A).When a user clicks on the light bulb icon, she can see three options rendered in the dropdown menu: (1) a deep-learning-based approach to generate documentation for source code (Figure 3.B); (2) a query-based approach to retrievethe online API documentation for source code (Figure 3.C); and (3) a user prompt approach to nudge users to writemore documentation (Figure 3.D). If the user likes one of these three candidates, she can simply click on one of them,and the selected documentation candidate will be inserted into above the code cell (if it is the Process, Reference, orReason type), or below it (if it is the Result type).

In this subsection, we describe the rationale and implementation detail of the three different approaches for documenta-tion generation (Figure 4):

Manuscript submitted to ACM

Deep-learning-based ApproachQuery-based ApproachPrompt-based Approach

TextCode

Extracting text code pairs from well-explained notebooks Training a GNN-based Neural Machine Translation Model

XGBoostHistogramCross ValidationDecision Tree

Connecting with external documentationsCurating a list of keywords

Detecting code cells and output type Selecting a prompt ? This code cell is for ...The table shows ...The figure shows ... ✓ Fig. 4. An illustration of the three different approaches for documentation generation in Themisto. • Our formative study suggests that the system should be able to generate multiple types of documentation (e.g., Process, Result, Education, Reason, and Reference). • Some types of documentation can be directly derived from the code , thus the automated approaches can help.The Process type of documentation directly describes the coding process, and existing ML literature suggestthat the deep-learning-based approach is most suitable for generating it; The Reference type does not need alearning-based approach, it can be achieved with a traditional query-based approach, which locates and retrievesthe most relevant online documentation as candidates; • Some others types of documentations (e.g., Education, Result, and Reason) are not directly related to the code ,thus the fully automated approaches are not capable of generating such contents. We design the prompt-basedapproach for users to complete the generation process.

We trained a deep learning model using the Graph-Neural-Network architecturebased on LeClair et al. [30]. These GNN models can take both the source code’s structure (extracted as AST) and thesource code’s content as input. Thus, it outperforms the traditional sequence-to-sequence model architectures, whichonly takes the source code’s content as an input sequence, in source code summarization tasks for Python code . Wedid not consider T5, BerT or GPT-3 architectures as these models can take minutes to make one inference (i.e., generateone summary) even with a cluster of GPUs (costing thousands of dollars per hour), whereas our GNN-based model canmake an inference within a second with one GPU. All the collected data science notebooks are in Python.Manuscript submitted to ACM hemisto: Towards Automated Documentation Generation in Computational Notebooks 13

Fig. 5. A code summarization model for the deep-learning-based documentation generation approach via GNN. There are three stepsof data pre-processing (1) We first extract text code pairs from existing notebooks. (2) We generate AST from code. (3) We tokenizedeach word and translated them into embeddings. And (4), the GNN model architecture.

In order to fine-tune the model, we constructed a training dataset for our particular context. We collected the top 10%highly-voted notebooks from two popular Kaggle competitions (N = 1158). For each of the notebook, we first extractedcode cells and the markdown cells adjacent above as a pair of input and output (similar to the data collection approachin [1]). If there is an inline comment in the first line of the code cell, we replaced the output of the pair using the inlinecomment. In total, our dataset has 5,912 pairs of code and its corresponding documentation. Following the best practiceof model training, we split the dataset into training, testing, and validation subsets with an 8 to 1 to 1 ratio.Before feeding data into the training process, we have a three-step pre-processing stage, as illustrated in Figure 5. Step1 removes the style decoration, formats, and special characters that are not in Python grammar (e.g., Notebook Magics).We also generate an AST for the source code input in step 2 with Python AST library . The AST result is equivalent tothe source code but with more contextual and relational information. In step 3, we tokenize source code to a sequenceof tokens with an input dictionary, and parse the AST nodes as a sequence of tokens with the same input dictionary.We parse the relationship between AST nodes as a matrix of edges. Finally, we tokenize the output documentation as asequence of tokens with a separate output dictionary. After this process, all the tokens are transformed into an array ofword embeddings — vectors of real numbers. We use these data to train the network for 100 epochs, 30 batch sizes, and15 early stop points on a two Tesla V100 GPU cluster. Out of all the epochs, we selected the model with the highestvalidation accuracy score.To evaluate our model’s performance against baseline models, we conducted both quantitative and qualitativeevaluations, as suggested by [46]. For the automated quantitative evaluation, we use BLEU scores [40] as the modelperformance metric. BLEU scores are commonly used in the source code summarization tasks. It evaluates the wordsimilarity between the generated text and the ground truth text. We selected and trained Code2Seq model [2] andGraph2Seq model [63] with the same data split. https://docs.python.org/3/library/ast.html Manuscript submitted to ACM Fig. 6. Example output from the model. (A) The generated text well describes the code. (B) The generated text vaguely describes thecode. (C) The generated text is poorly readable, but still captures the keywords of the descriptions.

Our model achieves 11.41 (BLEU–a), which outperforms the baseline models Code2Seq (BLEU–a = 9.61) and Graph2Seq(BLEU–a = 11.05). These scores suggest that the data science documentation task is more difficult than the benchmarkcode summarization tasks in the software engineering field. For example, in data science, a notebook code cell cancontain multiple code snippets and functions.In addition to the automated quantitative evaluation, we also conduct a qualitative analysis of the generateddocumentation pieces. We found that despite the word-to-word similarity score is low, the general quality of thecontent is reasonable and satisfying for building a prototype system. As an illustration, we provide three exampleswith both input and model generated outputs, as shown in Figure 6. In the Appendix, we provide full code cells andmodel-generated outputs for the two experimental notebooks that we used in the user study.

Our formative study indicates that the well-documented Kaggle notebooks often havethe description of frequently-used data science code functions for educational purposes. And sometimes data scientistsdirectly paste in a link or a reference to the external API documentation for a code function. Thus, we implement aquery-based approach that curates a list of APIs from commonly used data science packages, and the short descriptionsfrom external documentation sites. In our system, we only cover Pandas , Numpy , and Scikit-learn these threelibraries as a starting point to explore this approach. We argue that it can be easily expanded to include other packagesin the future. We collected both the API names and the short descriptions by building a crawling script with Python.When users trigger this query-based approach for a code cell, Themisto matches the API names with the code snippetsand concatenate all the corresponding descriptions. Lastly, the system also provides a prompt-based approach that allows users to manuallycreate the documentation. Because our formative study found that a well-documented notebook not only documentsthe process of the code, but also interprets the output, and explains rationales. These types of documentation are hardto generate with automated solutions To achieve it, we implement a prompt-based approach. It detects whether thecode cell has a cell output or not: if the cell outputs a result, Themisto assumes that the user is more likely to addinterpretation for the output result, thus the corresponding prompt will be inserted below the code cell. Otherwise, thesystem assumes the user may want to insert a reason or some educational types of documentations, thus it changes itsprompt message.

To evaluate the usability of Themisto and its effectiveness in supporting data scientists to create documention innotebooks, we conducted a within-subject controlled experiment with 24 data scientists. The task is to add documentationto the given notebook. And each participant is asked to finish two sessions, one with the Themisto support and one https://pandas.pydata.org/docs/reference/index.html https://numpy.org/doc/stable/reference/ https://scikit-learn.org/stable/modules/classes.htmlManuscript submitted to ACM hemisto: Towards Automated Documentation Generation in Computational Notebooks 15 Table 3. Demographics of participants

PID Gender Job Role Work Experience in Data Science

P1 M Expert Data Scientist 5-10 yearsP2 M Application Developer 3-5 yearsP3 M Novice Data Scientist less than 3 yearsP4 M Novice Data Scientist 0 year (just start learning data science)P5 M AI Operator or ML Engineer 3-5 yearsP6 M Novice Data Scientist less than 3 yearsP7 M Application Developer 3-5 yearsP8 M Novice Data Scientist less than 3 yearsP9 F Expert Data Scientist 3-5 yearsP10 M Expert Data Scientist 5-10 yearsP11 M Expert Data Scientist more than 10 yearsP12 F Novice Data Scientist 3-5 yearsP13 F Expert Data Scientist 5-10 yearsP14 M Novice Data Scientist 0 year (just start learning data science)P15 M Expert Data Scientist more than 10 yearsP16 M AI Operator or ML Engineer 3-5 yearsP17 M Subject Matter Expert 3-5 yearsP18 M Expert Data Scientist more than 10 yearsP19 F Application Developer 3-5 yearsP20 F Expert Data Scientist 3-5 yearsP21 M Novice Data Scientist 3-5 yearsP22 M Novice Data Scientist less than 3 yearsP23 M Application Developer less than 3 yearsP24 M Novice Data Scientist less than 3 yearswithout its support. The evaluation aims to understand (1) how well Themisto can facilitate documenting notebooksand (2) how data scientists perceive the three approaches that are used by Themisto for generating documentation.

We recruited 24 data science professionals as our evaluation participants (5 female, 19 male) in a multinational ITcompany. We used a snowball sampling approach to recruit participants, where we sent recruitment messages to friendsand colleagues, various internal mailing-lists, and Slack channels. We then asked participants to refer their friends andcolleagues. Our recruitment criteria are that the participant must have had experience in data science projects and theyare familiar with Python and Jupyter Notebook environment. As shown in Table 3, participants reported a diverse jobrole backgrounds, including expert data scientists (N = 8), novice data scientists (N = 9), AI Operators (AIOPs) or MLengineers (N = 2), subject matter experts (N = 1), and application developer (N = 4).

We conducted a within-subject controlled experiment with 24 data scientist participants. Their task was to adddocumentation to a given draft notebook, which only has code and no documentation at all. The participants were toldthat they were adding documentation for the purpose of sharing those documented notebooks as tutorials for datascience students who just started learning data science. Each participant is asked to finish two sessions, one with the

Manuscript submitted to ACM . To counterbalance the order effect, werandomized the order of the control condition and the experiment condition for each participant, so some participantsencountered Themisto in their first session, some others experienced it in their second session.Each participant was given up to 12 minutes (720 seconds) to finish one session . Before the experiment conditionsession, we gave the participant a 1-minute quick demo on the functionality of Themisto. All study sessions wereconducted remotely via a teleconferencing tool. We asked the participants to share their screen and we video recordedthe entire session with their permission. After finishing both sessions, we conducted a post-experiment semi-structuredinterview session to ask about their experience and feedback. We had a few pre-defined questions such as “How doyou compare the experience of the documenting task with or without the support of Themisto?” or “Did you noticethe multiple candidates in the dropdown menu? Which one do you like the most?” In addition, participants wereencouraged to tell their stories and experience outside these structured questions. The interview sections of the videorecordings were transcripted into text. We have three data sources: the observational notes and video recording for each session (N = 48), the final notebookartifact out of each session (N = 48), and the post-task questionnaire and interview transcripts (N = 24).Our first group of measurements are from coding participants’ behavioral data from the session recordings. Inparticular, we counted the task completion time (in secs) for all sessions. Then, for experiment condition only, we alsocounted the followings: how many times a participant clicked on the light bulb icon to check for suggestions ( code cellschecked for suggestions ); how many times a participant directly used the generated documentation ( markdown cellscreated by Themisto ); how many times a participant ignored the generated recommendations and manually created thedocumentation ( markdown cells created by human ); and how many time a markdown cell is co-created by human andThemisto( markdown cells co-created by human and Themisto ).Secondly, to evaluate the quality of the final notebook artifact, we define our second group of measurements bycounting: the number of added markdown cells , and the number of added words as these two are indicators of the quantityand effort each participant spent in a notebook. Also, we asked the participants to give an evaluation score to each ofthe two documented notebooks. We considered that number (-2 to 2) as a self-evaluation of the notebook quality. Inaddition, we asked two experts to code each of the 48 notebooks with a 3-dimensional rubric (based on [11]) to evaluatethe documentation’s readability, accuracy, and completeness in a notebook. Two experts’ independently ratings achievedan agreement on the rating (r(48) = 0.52, Spearman’s Rho test).For the experiment session only, we also asked the participants to finish a post-experiment survey (5-point LikertScale, -2 as strongly disagree and 2 as strongly agree, Fig 7) to collect their feedback specific on the system’s usability,helpfulness, accuracy, trust, satisfaction, and adoption propensity (based on [61]).Lastly, for the interview transcripts, four researchers of this research project conduct an iterative open codingmethod to get the code, theme, and representative quotes as the third group of data. They each independently coded asubset of interview transcripts, and discussed the codes and themes together. After the discussion, they when back We conducted three pilot run sessions, and all the three pilot participants were able to finish a single task within 10 minutes, with or without thesupport of Themisto.Manuscript submitted to ACM hemisto: Towards Automated Documentation Generation in Computational Notebooks 17

Table 4. Performance data in two conditions (M: mean, SD: standard deviation): the task completion time (secs), participants’satisfaction with the final notebook (from -2 to 2), graded notebook quality, number of markdown cells, and number of words. Inparticular, participants spent less time to complete the task in the experimental condition than the control condition (p = .001);participants were more satisfied with the final notebook in the experimental condition than the control condition (p = .04); the finalnotebooks in the experimental condition are less readable than the control condition (p = .001).

Condition M SD

Number of Added Markdown Cells Experiment 8.04 2.40Control 7.79 1.91Number of Added Words Experiment 95.75 50.56Control 100.92 53.27**Task Completion Time (secs) Experiment 391.12 200.15Control 494.75 184.28*Satisfaction with the Final Notebook (-2 to 2) Experiment 0.96 0.69Control 0.54 0.83Expert Rating of the Final Notebook Quality: Accuracy (-2 to 2) Experiment 1.62 0.45Control 1.71 0.41**Expert Rating of the Final Notebook Quality: Readability (-2 to 2) Experiment 0.67 0.56Control 1.04 0.66Expert Rating of the Final Notebook Quality: Completeness (-2 to 2) Experiment 0.21 0.57Control 0.48 0.67and reiterated the coding practice to apply the codes and themes to their assigned notebooks. Some examples of theidentified themes are: pros and cons of Themisto; preference of the three document generation approaches; futureadoption, and suggestions for design improvement. We will report the qualitative results as supporting materialstogether with reporting the quantitative results.

In this section, we present the user study results on: how Themisto improved participants’ performance on the task,how participants perceived the documentation generation methods in Themisto, and how participants described thepractical applicability of Themisto.

Our experiment revealed that Themistoimproved participants’ performance on the task by reducing task completion time and improving the satisfaction withthe final notebooks.We performed a two-way repeated measures ANOVA to examine the effect of the two notebooks and the twoconditions (with or without Themisto) on task completion time. As shown in Table 5, participants spent significantly less time (p<.001) to complete the task using Themisto in the experiment condition (M(SD) = 391.12 (200.15)) thanin the control condition (M(SD) = 494.75 (184.28)) (Table 5). In addition, there was not a statistically significant effectof notebooks on task completion time, nor a statistically significant interaction between the effects of notebooks andconditions on completion time, which indicates that the two notebooks used in the task were properly designed.

Manuscript submitted to ACM

Table 5. Usage data of the plugin in experimental condition (mean: 𝑥 , standard deviation: 𝜎 ). The results indicate that participantsused the plugin for recommended documentation on most code cells ( 𝑥 = 86.11%). For markdown cells in the final notebooks, 𝑥 =45.31% were directly adopted from the plugin’s recommendation, while 𝑥 = 41.52% were modified from the plugin’s recommendationand 𝑥 = 12.16% were created by participants from scratch. 𝑥 𝜎 Percentage of Code Cells Checked for Suggestions 86.11 18.89Percentage of Markdown Cells Created by Themisto 45.31 33.16Percentage of Markdown Cells Co-created by Human and Themisto 41.52 30.18Percentage of Markdown Cells Created by Human 12.16 18.85Through coding the video recordings for only the experiment-condition sessions, we further examined while theThemisto was available, how did the participants use it? Did they check the recommendations it generated? Did theyactually use those recommendations in their documentations added into notebooks?We found that while Themisto is available, for 80% of code cells, participants checked the recommended documentationby clicking on the light bulb icon to show the dropdown menu. Then, 45% of the created markdown cells were directlyadopted from Themisto’s recommendation; while 12% of the created markdown cells were manually crafted by humanalone.

The most interesting finding is that 41% markdown cells were co-created by Themisto and humanparticipants together : Themisto suggests a markdown cell, human takes it, and modifies on top of it. This resultsuggested that most participants used Themisto in the creation of documentation, and some of them formed a smallcollaboration between the human and the AI.The post-experiment survey result supported our findings. Most participants believed it was easier to accomplishthe task with Themisto’s help (22 out of 24 rated agree or higher), as shown in Figure 7. And the Themisto generatedrecommendation was accurate (20 out of 24 rated agree or higher).Looking into the qualitative interview data, we can find some potential explanations for why participants believedso. Participants reported that Themisto provided them something to begin with, thus it was easier than starting fromscratch: “

The plugin makes it easy to just pick it and have something simple. And then I got a couple of times where I wentback and said, ‘Oh let me add a few more words.’ ” (P21).Also, the accuracy of the generated recommendations plays a role in participant’s experience: “

My experience withthe plugin is definitely better. For the most part, the suggestions are pretty accurate. Although sometimes I did make a fewminor changes like rearranging the text. ” (P5).

The post-task questionnaire revealed that participants were more satisfied with the final notebook after using Themisto in theexperiment condition than in the control condition (p = .04). The interview results also supported this finding. P14believed that Themisto helped with wording: “

Sometimes I knew what the cells were doing but I did not know how to putthings in a really good sentence for others. ”Themisto also motivates participants to document the analysis details. Although we did not see a difference in thenumber of markdown cells created in two conditions or the number of words in total, Themisto helps them overcomethe procrastination of writing documentation and reminds them to document things that they might ignore.I think I definitely overlooked some details when I was commenting without the tool, because I just madethe assumption that people should know from the code... To be honest, I do not usually follow a good

Manuscript submitted to ACM hemisto: Towards Automated Documentation Generation in Computational Notebooks 19

Strongly Disagree Disagree Neutral Agree Strongly Agree

The plugin was difficult to use.

11 11 1 1 0

It was easier to accomplish this task with the plugin.

The plugin’s recommendations are accurate.

I trust the plugin’s recommendations.

I was confused by the plugin’s recommendations

If the plugin is available, I would use it in the future.

If the plugin is available, I would recommend it to others.

Fig. 7. Results of the post-task questionnaire. coding practice. My notebooks are really messy and I am the only person who can understand it. I feelsorry for anybody else that has to see it. (P19)Moreover, participants believed that Themisto can help them form a better documenting practice in the long term:“

It very useful to remind me to always put some documentation in a timely manner. ” (P13).The two experts’ gradings for the notebook quality suggest that notebooks from both conditions have a relativelysame quality level. Experts rated that notebooks written in the control condition are more readable than those writtenwith Themisto (p < .001). A higher readability score means less formatting, spelling, or grammar issues. Then for theother two dimensions of the quality rubric ( accuracy and completeness ), there was not a significant difference.In the post-task interview, participants mentioned that the documentation generated by Themisto was not alwayswell-written, particularly the ones generated from the deep-learning-based approach. Some participants also mentionedthat they needed to edit the format of the generated document to fit their context. We believe that while Themistooffers convenience to improve the data scientists’ productivity and saves their time, it may not provide the same levelof readability as those notebooks well articulated by humans. Thus, data scientists may want to further revise theformatting and wording of the Themisto generated documentation.In summary, our experiment indicated that Themisto improves participants’ productivity for creating documentation.It also increases their perceived satisfaction with the final notebook, compared to the notebooks written by participantsthemselves, though the quality of the final notebook may suffer a bit readability issues.

In this section, we have anin-depth analysis of how participants perceived the three different approaches that Themisto implemented to generatedocumentations: the deep-learning-based approach, the query-based approach, and the prompt-based approach. In thepost-experiment interview, we explained how Themisto generated the documentation with these three approaches, andasked participants if they like or dislike one particular approach.Participants reported that they felt the deep-learning-based approach provided concise and general descriptions ofthe analysis process: “

I think the AI suggestion gives me an overview. It is short, and has some useful keywords. ”(P12).Participants also suggested that the deep-learning-based approach sometimes generated inaccurate or very vaguedocumentation: “

The first one gives me a very short summary, though it didn’t always say what the cell is doing. ” (P1). Butthe deep-learning-based approach is still perceived useful. As it is short and with only a couple of keywords, many

Manuscript submitted to ACM

This one gives you really good information. For some specific methods or calls, you don’t have tocome up with a high-level summary for others and you can directly use it. ” (P14).Participants also acknowledged that such a query-based approach may not work for some scenarios. For example,participants found that the query-based approach was not useful for summarizing the very fundamental level datamanipulations, as there was no core API method in it. Some participants mentioned that the usefulness of this query-based approach depends on the audience.The [deep-learning-based approach] was really useful. The [query-based approach]... it depends on theaudience. It is much more appropriate for a novice programmer. (P18)We observed in the video recordings that participants rarely used the prompt-based approach in the session. Theinterview data confirmed our speculation. Some participants said that they liked the idea of user prompts, but theydid not use it because the deep-learning-based approach and the query-based approach already gave them the actualcontent. Other participants pointed out that the prompts were not intelligent enough, so they did not use it: “

It alwaysasks the same thing and I just ignored the prompts. ” (P18).Participants suggested that the prompts could be designed to better fit the context.Perhaps the system can infer what the code cell was doing [from the deep-learning-based approach], andshow prompts accordingly. Like if I delete a data point from the dataset, there is a prompt asking why Iconsidered it as an outlier or something. (P5)Last but not the least, many participants preferred a hybrid approach to combine the deep-learning-based approachand the query-based approach. For example, P12 mentioned,The first one (deep-learning-based) tells me what the code cell is doing in general and the second one(query-based) tells me the details of the function. I would go with a hybrid approach. (P12)

Most participants indicated that they would liketo use Themisto in the future when answering the survey question as shown in Figure 7. The interview data providesmore detail and evidence to elaborate on this result. Participants suggested various scenarios in which the Themistocould be useful in their future work, such as they need to add documentation during the exploration process for futureselves, or they need to document a notebook in a post-hoc way for sharing it with collaborators, or they need to mentora team member who is a novice data scientist, or they need to refactor an ill-documented notebook written by others:When I am doing data analysis, I tend to write the code first because there is a flow in my head of whatI need to do. And then I will go back afterward to use the plugin and add the comments needed. I willdefinitely do that before sharing that file or handing it over to others. (P12)There was one participant who did not think Themisto could fit into his workflow: “

I always write documentationbefore writing code. Maybe Themisto does not work for people like me. ” (P3)In our experiment, we provided the scenario as they were documenting the notebook as a tutorial for some datascience students. In the interview, we asked how participants would document a notebook differently if they werecreated documentation for the notebooks for non-technical domain experts audience. Some participants suggested that

Manuscript submitted to ACM hemisto: Towards Automated Documentation Generation in Computational Notebooks 21computational notebooks may not be a good medium to present the analysis to non-technical domain experts. Theywould prefer to curate all the textual annotations in a standalone report or slide decks. Some others believed that thenotebook could work as the medium but they would change the documentation by using less technical terminology,adding more details on topics that the non-technical domain experts would be interested in (e.g., how data is collected,potential bias in the analysis).

In the interview, participantsprovided various design suggestions to improve Themisto and to design future technologies that can support datascientists to document the notebook.Participants expected Themisto to have more functionalities than simply generating documentation for the code. Forexample, P13 proposed maybe Themisto can also create a description to document the changes of versions and theediting histories from different team members. P3 and P4 believed that the automatic generation of Reason is very muchneeded for explaining decisions such as why selecting a particular algorithm. P19 wanted the system to automaticallyadd explanations to the execution errors. Participants also mentioned that Themisto should add more varieties into thegenerated content’s formatting. They would like to see suggested documentation with a better presentation.And lastly, some participants suggested that maybe such a documentation generation system can take considerationof the purpose of the notebook, the domain-specific terminology, or the indivials’ habits for writing documentation.

In summary, our study found that Themisto can support data scientists in generatingdocumentation by significantly reducing their time spent on the task, and improving the perceived satisfaction level ofthe final notebook. When Themisto is available, participants are very likely to check the generated documentationas a reference. Many of them directly used the generated documentation, a few of them still prefer to manually typethe documentation, while many of them adopted a human-AI co-creation approach that they used the AI-generatedone as a baseline and keep improving on top of it. Participants perceived the documentation generated by the deep-learning-based approach as a short and concise overview, the documentation generated by the query-based approach asdescriptive and useful for educational purposes, but they rarely used the prompt-based approach. Overall, participantsenjoyed Themisto and would like to use it in the future for various documenting purposes.

We argue that AI-supported documentation work for computational notebooks can be viewed as a co-creative processin which humans and machine teammates work together to create documents in a notebook. Viewed this way, we foundthat the combined human + AI effort produced a satisfied level of quality at a more rapid pace than what human or AIcould achieve alone. The notion of a partnership relationship between human data scientists and AI has been discussionby Wang et al. [60], and is part of a larger research discussion by many others (e.g., [5, 36, 53]). These researcherspropose that the research community should focus more on designing AI techniques to integrate into human worksand cooperating with human, as opposed to competing with human as portraited by AlphaGo or DeepBlue. In ourstudy, we found that the interaction pattern in which the AI creates “first draft” quality documentation, followed byhuman review and editing, resulted in a final artifact that not only met the bar for quality, but exceeded it for level ofsatisfaction. Participants were happier with their documentation when they worked with the AI system to create it,rather than when they worked alone. Thus, we conclude that the benefits of having an AI partner in this task stem from

Manuscript submitted to ACM and the initiative of a machine teammate, as proposed in Shneiderman’s recent two-dimensionalmodel [55].

The practice of documentation in data science has both overlaps and strong contrasts in relation to software engineeringin many facets [33, 50, 65]. Software engineers write inline comments in their work-in-progress code to help collaboratorsunderstand the behavior of the code without the burden of going through thousands of lines of code; they documentchanges of their code for better version management and improving awareness of their collaborators; when others needto build upon their work, they write formal documentation and Readme files to describe how to use functions and APIs intheir packages or services. Data scientists write computational narratives as a practice of literate programming [25, 42],and as a way to think and explore alternatives. Thus, notebooks often have orphan code cells or out-of-order codesnippets, which leads to lower reusability of the notebook and further highlights the importance of documentation inthe notebook. As we found in our formative study, well-documented notebooks explain more than the behavior of thecode. Notebooks cover various topics including describing and interpreting the output of the code, explaining reasonsfor choosing certain algorithms or models, educating the audience from different levels of expertise, and so on.Thus, many interventions and lessons learned about documentation in software engineering may not work the samein data science. For example, how can we evaluate the quality of the documentation? Software documentation can beassessed based on attributes like completeness, organization, the relevance of content, readability, and accuracy [11].Our experiment found that the quality score assessed by these rubrics does not reflect users’ satisfaction with the finalnotebooks. Despite many people’s efforts to creating a standard documentation practice [28, 48], it remains questionablewhether there is a one-size-fits-all solution. For example, Rule et al. [48] suggested ten rules for writing and sharingcomputational analysis in Jupyter notebooks. The first rule they proposed is to tell a story to the audience. However, thisdescription is very general as people may approach storytelling differently. As we observed in Kaggle notebooks, somenotebook authors prefer to use concise and accurate language to convey important information; while others writedocumentation in more creative and entertaining ways — for example, describing the analysis of target variables as a“dating process”, making analogies between data science workflow and starting a business. These creative notebooksstand out and receive many votes and compliments from the Kaggle community. As we recognize documentation in data

Manuscript submitted to ACM hemisto: Towards Automated Documentation Generation in Computational Notebooks 23science as a fluid activity, traditional template-based approaches to aid documentation writing may not work in datascience because they can not capture a broader aspects of documentation, and limit the expressiveness of storytelling.We argue that future work should recognize the difference between data science and software engineering, andtailor the documentation experience for data scientists. For example, Callisto [57] harnessed the fact that data scientistsengage in synchronous work and discussion, and used contextual links between discussion messages and notebookcontent to aid the explanation of notebooks.

Participants reported that they did not find the prompts useful. They complainedthat the prompts were invariant. We speculate that a more adaptive or responsive set of prompts might prove moreuseful. For example, prompts could be based on the contents of the code cell which the user was trying to document.Another possibility is that prompts be based on the user’s own history of writing markdown cells, and could eitherappeal to the user’s strengths, or could anticipate and accommodate the user’s weaknesses. We suggest that adaptiveintelligent prompting may be a worthwhile research topic.In a more socially-oriented approach, users within an organization might rate the initial set of prompts, voting someprompts up or down depending on their usefulness. An evolution of this idea might allow users to propose new promptsfor use by selves and others (e.g., [8]).

We might be able to adopt a similar strategy with the Themisto-generated cells that usersmodified. Over time, an enhanced version of Themisto could include a learning component, which could learn from theuser’s modifications to Themisto-proposed texts, and could anticipate the user’s preferences in subsequent markdowntexts. A related feature would allow an organization to enact a “house style” of documentation into the markdown texts.This kind of feature could be used to “standardize” dcoumentation practices while preserving users’ diversified writingcharacteristics, or to fulfill certain legal requirements.

The premise of Themisto was to generatedescriptive material based on program code. Following some of the ideas in Seeber et al. [53], we might invert thisstrategy. We recall that P3 told us that he wrote documentation in advance of writing the code itself. If there are otherpeople who use the same discipline as P3, could we generate code from descriptive text? We suspect that this ideawould not work for just any textual description. However, there could be certain stylized ways of writing descriptionsthat might be translatable into code; pseudocode could provide a starting point for the design of such a stylized type ofdescription. We recognize that this kind of approach would need to have a representation of code packages and libraries,so that it could generate code that was appropriately structured for those packages. Of course, package documentationcould be used to construct such a representation.

Our experiment has several limitations: it focuses only on the documenting (instead of coding) process, it is a controledexperiment study, and participants did not work on notebooks created by themselves. Thus, for example, we do notknow how participants would perceive the usefulness of the tool in their own code, which may be longer and morecomplicated (e.g., having out-of-order cells) than the notebooks we provided. However, we believe the result is stillpromising to shed light for future research and future system design. Future work can explore the generalizability ofThemisto through a long-term deployment study.

Manuscript submitted to ACM

In this paper, we have designed and built Themisto to support human data scientists in the notebook documentationtask. This researhc prototype also serves as a prompt to explore the human-AI collaboration research agenda within theautomated notebook documentation user scenario. The system design is driven by insights from previous literature, andalso by a formative study that analyzed 80 highly-voted Kaggle notebooks to understand how human data scientistsdocument notebooks. The follow-up user evaluation suggested that the collaboration between data scientists andThemisto significantly reduced task completion time and resulted in a final artifact that not only met the bar of quality,but also exceed it for the level of satisfaction.

ACKNOWLEDGMENTS

We thank all of our participants for their help in the study, and the annonymous reviewers for their valuable feedback.

REFERENCES [1] Rajas Agashe, Srinivasan Iyer, and Luke Zettlemoyer. 2019. Juice: A large scale distantly supervised dataset for open domain context-based codegeneration. arXiv preprint arXiv:1910.02216 (2019).[2] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2018. code2seq: Generating sequences from structured representations of code. arXiv preprintarXiv:1808.01400 (2018).[3] Matthew Arnold, Rachel KE Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilovi`c, Ravi Nair, K Natesan Ramamurthy,Alexandra Olteanu, David Piorkowski, et al. 2019. FactSheets: Increasing trust in AI services through supplier’s declarations of conformity.

IBMJournal of Research and Development

63, 4/5 (2019), 6–1.[4] Liang Bai and Yanli Hu. 2018. Problem-driven teaching activities for the capstone project course of data science. In

Proceedings of ACM TuringCelebration Conference-China . 130–131.[5] Carrie J Cai, Emily Reif, Narayan Hegde, Jason Hipp, Been Kim, Daniel Smilkov, Martin Wattenberg, Fernanda Viegas, Greg S Corrado, Martin CStumpe, et al. 2019. Human-centered tools for coping with imperfect algorithms during medical decision-making. In

Proceedings of the 2019 CHIConference on Human Factors in Computing Systems . 1–14.[6] Souti Chattopadhyay, Ishita Prasad, Austin Z Henley, Anita Sarma, and Titus Barik. 2020. What’s Wrong with Computational Notebooks? PainPoints, Needs, and Design Opportunities. In

Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems . 1–12.[7] Sergio Cozzetti B de Souza, Nicolas Anquetil, and Káthia M de Oliveira. 2005. A study of the documentation essential to software maintenance. In

Proceedings of the 23rd annual international conference on Design of communication: documenting & designing for pervasive information . 68–75.[8] Joan Morris DiMicco, Werner Geyer, David R Millen, Casey Dugan, and Beth Brownholtz. 2009. People sensemaking and relationship building on anenterprise social network site. In . IEEE, 1–10.[9] Brian P Eddy, Jeffrey A Robinson, Nicholas A Kraft, and Jeffrey C Carver. 2013. Evaluating source code summarization techniques: Replication andexpansion. In . IEEE, 13–22.[10] Jesus Fernandez-Bes, Jerónimo Arenas-García, and Jesús Cid-Sueiro. [n.d.]. Energy generation prediction: Lessons learned from the use of Kaggle inMachine Learning Course.

Group

7, 8 ([n. d.]), 9.[11] Golara Garousi, Vahid Garousi, Mahmoud Moussavi, Guenther Ruhe, and Brian Smith. 2013. Evaluating usage and quality of technical softwaredocumentation: an empirical study. In

Proceedings of the 17th International Conference on Evaluation and Assessment in Software Engineering . 24–35.[12] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018.Datasheets for datasets. arXiv preprint arXiv:1803.09010 (2018).Manuscript submitted to ACM hemisto: Towards Automated Documentation Generation in Computational Notebooks 25 [13] R Stuart Geiger, Nelle Varoquaux, Charlotte Mazel-Cabasse, and Chris Holdgraf. 2018. The types, roles, and practices of documentation in dataanalytics open source software libraries.

Computer Supported Cooperative Work (CSCW)

27, 3-6 (2018), 767–802.[14] Andrew Head, Fred Hohman, Titus Barik, Steven M. Drucker, and Robert DeLine. 2019. Managing Messes in Computational Notebooks. In

Proceedingsof the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19) . ACM, New York, NY, USA, Article 270, 12 pages.https://doi.org/10.1145/3290605.3300500[15] Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2018. The dataset nutrition label: A framework to drive higherdata quality standards. arXiv preprint arXiv:1805.03677 (2018).[16] Amber Horvath, Mariann Nagy, Finn Voichick, Mary Beth Kery, and Brad A Myers. 2019. Methods for investigating mental models for learners ofAPIs. In

Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems . 1–6.[17] Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In

Proceedings of the SIGCHI conference on Human Factors in Computing Systems .159–166.[18] Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In . IEEE, 200–20010.[19] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a Neural Attention Model. In

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Berlin, Germany, 2016-08). Associationfor Computational Linguistics, 2073–2083. https://doi.org/10.18653/v1/P16-1195[20] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In

Proceedingsof the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 2073–2083.[21] Project Jupyter. [n.d.].

JupyterLab: the next generation of the Jupyter Notebook . https://blog.jupyter.org/jupyterlab-the-next-generation-of-the-jupyter-notebook-5c949dabea3[22] Project Jupyter. 2015. Project Jupyter: Computational Narratives as the Engine of Collaborative Data Science. Retrieved September 15, 2019 fromhttps://blog.jupyter.org/project-jupyter-computational-narratives-as-the-engine-of-collaborative-data-science-2b5fb94c3c58.[23] Mira Kajko-Mattsson. 2005. A survey of documentation practice within corrective maintenance.

Empirical Software Engineering

10, 1 (2005), 31–55.[24] Mary Beth Kery, Bonnie E. John, Patrick O’Flaherty, Amber Horvath, and Brad A. Myers. 2019. Towards Effective Foraging by Data Scientists toFind Past Analysis Choices. In

Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19) .ACM, New York, NY, USA, Article 92, 13 pages. https://doi.org/10.1145/3290605.3300322[25] Mary Beth Kery and Brad A. Myers. 2017. Exploring exploratory programming. In (2017-10). 25–29. https://doi.org/10.1109/VLHCC.2017.8103446[26] Mary Beth Kery, Marissa Radensky, Mahima Arya, Bonnie E. John, and Brad A. Myers. 2018. The Story in the Notebook: Exploratory Data ScienceUsing a Literate Programming Tool. In

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI’18) . ACM, New York, NY, USA, Article 174, 11 pages. https://doi.org/10.1145/3173574.3173748[27] Donald E. Knuth. 1984. Literate Programming.

Comput. J.

27, 2 (1984), 97–111. https://doi.org/10.1093/comjnl/27.2.97[28] Markus Konkol, Daniel Nüst, and Laura Goulier. 2020. Publishing computational research–A review of infrastructures for reproducible andtransparent scholarly communication. arXiv preprint arXiv:2001.00484 (2020).[29] Sam Lau, Ian Drosos, Julia M Markel, and Philip J Guo. 2020. The Design Space of Computational Notebooks: An Analysis of 60 Systems in Academiaand Industry. In . IEEE, 1–11.[30] Alexander LeClair, Sakib Haque, Linfgei Wu, and Collin McMillan. 2020. Improved code summarization via a graph neural network. arXiv preprintarXiv:2004.02843 (2020).[31] Jiali Liu, Nadia Boukhelifa, and James R Eagan. 2019. Understanding the role of alternatives in data analysis practices.

IEEE transactions onvisualization and computer graphics

26, 1 (2019), 66–76.[32] Ryan Louie, Andy Coenen, Cheng Zhi Huang, Michael Terry, and Carrie J Cai. 2020. Novice-AI Music Co-Creation via AI-Steering Tools for DeepGenerative Models. In

Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems . 1–13.[33] Julia S Stewart Lowndes, Benjamin D Best, Courtney Scarborough, Jamie C Afflerbach, Melanie R Frazier, Casey C O’Hara, Ning Jiang, and Benjamin SHalpern. 2017. Our path to better science in less time using open data science tools.

Nature ecology & evolution

1, 6 (2017), 1–7.[34] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXivpreprint arXiv:1508.04025 (2015).[35] Walid Maalej and Martin P Robillard. 2013. Patterns of knowledge in API reference documentation.

IEEE Transactions on Software Engineering

39, 9(2013), 1264–1282.[36] Thomas W Malone. 2018. How human-computer’Superminds’ are redefining the future of work.

MIT Sloan Management Review

59, 4 (2018), 34–41.[37] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, andTimnit Gebru. 2019. Model cards for model reporting. In

Proceedings of the conference on fairness, accountability, and transparency . 220–229.[38] Stephen Oney and Joel Brandt. 2012. Codelets: linking interactive documentation and example code in the editor. In

Proceedings of the SIGCHIConference on Human Factors in Computing Systems . 2697–2706.[39] Yoann Padioleau, Lin Tan, and Yuanyuan Zhou. 2009. Listening to programmers—Taxonomies and characteristics of comments in operating systemcode. In . IEEE, 331–341. Manuscript submitted to ACM [40] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In

Proceedingsof the 40th annual meeting of the Association for Computational Linguistics . 311–318.[41] Raja Parasuraman, Thomas B Sheridan, and Christopher D Wickens. 2000. A model for types and levels of human interaction with automation.

IEEE Transactions on systems, man, and cybernetics-Part A: Systems and Humans

30, 3 (2000), 286–297.[42] Soya Park, Amy X. Zhang, and David R. Karger. 2018. Post-literate Programming: Linking Discussion and Code in Software Development Teams. In

The 31st Annual ACM Symposium on User Interface Software and Technology Adjunct Proceedings (Berlin, Germany) (UIST ’18 Adjunct) . ACM, NewYork, NY, USA, 51–53. https://doi.org/10.1145/3266037.3266098[43] Jeffrey M. Perkel. 2018. Why Jupyter is data scientists’ computational notebook of choice.

Nature

563 (2018), 145. https://doi.org/10.1038/d41586-018-07196-1[44] João Felipe Pimentel, Saumen Dey, Timothy McPhillips, Khalid Belhajjame, David Koop, Leonardo Murta, Vanessa Braganholo, and BertramLudäscher. 2016. Yin & Yang: demonstrating complementary provenance from noWorkflow & YesWorkflow. In

International Provenance andAnnotation Workshop . Springer, 161–165.[45] Mohammed Suhail Rehman. 2019. Towards Understanding Data Analysis Workflows using a Large Notebook Corpus. In

Proceedings of the 2019International Conference on Management of Data . 1841–1843.[46] Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. arXiv preprint arXiv:2005.04118 (2020).[47] Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How do professional developers comprehend software?. In . IEEE, 255–265.[48] Adam Rule, Amanda Birmingham, Cristal Zuniga, Ilkay Altintas, Shih-Cheng Huang, Rob Knight, Niema Moshiri, Mai H Nguyen, Sara BrinRosenthal, Fernando Pérez, et al. 2019. Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks.[49] Adam Rule, Aurélien Tabard, and James D Hollan. 2018. Exploration and explanation in computational notebooks. In

Proceedings of the 2018 CHIConference on Human Factors in Computing Systems . 1–12.[50] Jeffrey Saltz, Kevin Crowston, et al. 2017. Comparing data science project management methodologies via a controlled experiment. (2017).[51] Sheeba Samuel and Birgitta König-Ries. 2018. ProvBook: Provenance-based Semantic Enrichment of Interactive Notebooks for Reproducibility.. In

International Semantic Web Conference (P&D/Industry/BlueSky) .[52] Sheeba Samuel and Birgitta König-Ries. 2020. ReproduceMeGit: A Visualization Tool for Analyzing Reproducibility of Jupyter Notebooks. arXivpreprint arXiv:2006.12110 (2020).[53] Isabella Seeber, Eva Bittner, Robert O Briggs, Triparna de Vreede, Gert-Jan De Vreede, Aaron Elkins, Ronald Maier, Alexander B Merz, SarahOeste-Reiß, Nils Randrup, et al. 2020. Machines as teammates: A research agenda on AI in team collaboration.

Information & management

57, 2(2020), 103174.[54] Lin Shi, Hao Zhong, Tao Xie, and Mingshu Li. 2011. An empirical study on evolution of API documentation. In

International Conference onFundamental Approaches To Software Engineering . Springer, 416–431.[55] Ben Shneiderman. 2020. Human-centered artificial intelligence: Reliable, safe & trustworthy.

International Journal of Human–Computer Interaction

36, 6 (2020), 495–504.[56] Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-Shanker. 2010. Towards automatically generating summary commentsfor java methods. In

Proceedings of the IEEE/ACM international conference on Automated software engineering . 43–52.[57] April Yi Wang, Zihan Wu, Christopher Brooks, and Steve Oney. 2020. Callisto: Capturing the “Why” by Connecting Conversations with ComputationalNarratives. In

Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20) . ACM.[58] Dakuo Wang, Q. Vera Liao, Yunfeng Zhang, Udayan Khurana, Horst Samulowitz, Soya Park, Michael Muller, and Lisa Amini. 2021. How MuchAutomation Does a Data Scientist Want?. In preprint .[59] Dakuo Wang, Parikshit Ram, Daniel Karl I Weidele, Sijia Liu, Michael Muller, Justin D Weisz, Abel Valente, Arunima Chaudhary, Dustin Torres,Horst Samulowitz, et al. 2020. AutoAI: Automating the End-to-End AI Lifecycle with Humans-in-the-Loop. In

Proceedings of the 25th InternationalConference on Intelligent User Interfaces Companion . 77–78.[60] Dakuo Wang, Justin D Weisz, Michael Muller, Parikshit Ram, Werner Geyer, Casey Dugan, Yla Tausczik, Horst Samulowitz, and Alexander Gray.2019. Human-AI Collaboration in Data Science: Exploring Data Scientists’ Perceptions of Automated AI.

Proceedings of the ACM on Human-ComputerInteraction

3, CSCW (2019), 1–24.[61] Justin D Weisz, Mohit Jain, Narendra Nath Joshi, James Johnson, and Ingrid Lange. 2019. BigBlueBot: teaching strategies for successful human-agentinteractions. In

Proceedings of the 24th International Conference on Intelligent User Interfaces . 448–459.[62] John Wenskovitch, Jian Zhao, Scott Carter, Matthew Cooper, and Chris North. 2019. Albireo: An Interactive Tool for Visually SummarizingComputational Notebook Structure. In . IEEE, 1–10.[63] Kun Xu, Lingfei Wu, Zhiguo Wang, Yansong Feng, Michael Witbrock, and Vadim Sheinin. 2018. Graph2seq: Graph to sequence learning withattention-based neural networks. arXiv preprint arXiv:1804.00823 (2018).[64] Ying Xu, Dakuo Wang, Penelope Collins, Hyelim Lee, and Mark Warschauer. [n.d.]. Same benefits, different communication patterns: ComparingChildren’s reading with a conversational agent vs. a human partner.

Computers & Education

161 ([n. d.]), 104059.[65] Amy X Zhang, Michael Muller, and Dakuo Wang. 2020. How do Data Science Workers Collaborate? Roles, Workflows, and Tools. arXiv preprintarXiv:2001.06684 (2020).Manuscript submitted to ACM hemisto: Towards Automated Documentation Generation in Computational Notebooks 27

A EXAMPLE OF DOCUMENTATION GENERATION IN THEMISTO

Table 6. Example Notebook (House Prediction) and Documentation Generation in Themisto

Source Code DL-Based Query-Based Prompt-Based import pandas as pdimport numpy as npfrom sklearn.linear_model import

LassoCV ↩ → from sklearn.model_selection import cross_val_score ↩ → Importing libraries Pandas is for data manipulation andanalysis; NumPy is a library for large,multi-dimensional arrays and matrices;Scikit-learn is a machine learning li-brary. This code cell is for __ _ _ _ train = pd.read_csv('train.csv')test = pd.read_csv('test.csv')

Read the data Read a comma-separated values (csv) fileinto DataFrame; Return the first 5 rows. This code cell is for __ _ _ _ train.head()

Let’s see the values Return the first 5 rows The table shows _ _ __ _ all_data = pd.concat((train.loc[:,'SubClass':'SaleCond'],test.loc[:,'SubClass':'SaleCond']))

A generator for fea-ture Concatenate pandas objects along a par-ticular axis with optional set logic alongthe other axes. This code cell is for __ _ _ _ all_data = pd.get_dummies(all_data)

Convert all the data Convert categorical variable intodummy/indicator variables This code cell is for __ _ _ _ all_data =all_data.fillna(all_data.mean()) ↩ → Check the missing val-ues Fill NA/NaN values using the specifiedmethod This code cell is for __ _ _ _

X_train = all_data[:train.shape[0]]X_test = all_data[train.shape[0]:]y = train.SalePrice

Create the target andthe test data Slice string This code cell is for __ _ _ _ model_lasso = LassoCV(alphas = [1,0.1, 0.001, 0.0005]).fit(X_train,y) ↩ → ↩ → Model Lasso linear model with iterative fittingalong a regularization path. This code cell is for __ _ _ _ def rmse_cv(model):rmse=np.sqrt(-cross_val_score(model,X_train, y,scoring="neg_mean_squared_error",cv = 5)) ↩ → ↩ → ↩ → ↩ → return (rmse)rmse_cv(model_lasso).mean() A simple examplemodel with the lasso Evaluate a score by cross-validation The result indicatesthat _ _ _ _ _

Manuscript submitted to ACM

Table 7. Example Notebook (Covid Prediction) and Documentation Generation in Themisto

Source Code DL-Based Query-Based Prompt-Based import numpy as npimport pandas as pdfrom sklearn.ensemble import

RandomForestClassifier ↩ → Importing libraries Pandas is for data manipulation andanalysis; NumPy is a library for large,multi-dimensional arrays and matrices;Scikit-learn is a machine learning li-brary. This code cell is for __ _ _ _ train = pd.read_csv("train.csv")test = pd.read_csv("test.csv")train.head()

Read the data Read a comma-separated values (csv) fileinto DataFrame; Return the first 5 rows. The table shows _ _ __ _ train.describe()

Let’s see the values Generate descriptive statistics. Descrip-tive statistics include those that summa-rize the central tendency, dispersion andshape of a dataset’s distribution, exclud-ing NaN values. The table shows _ _ __ _ train["Date"] =train["Date"].apply( lambda x:x.replace("-","")) ↩ → ↩ → train["Date"] =train["Date"].astype(int) ↩ → train.head() Convert all the data Replace a specified phrase with anotherspecified phrase The table shows _ _ __ _ train.isnull().sum()

Check the missing val-ues Detect missing values for an array-likeobject The result indicatesthat _ _ _ _ _ test["Date"] =test["Date"].apply( lambda x:x.replace("-","")) ↩ → ↩ → Convert all the data Replace a specified phrase with anotherspecified phrase This code cell is for __ _ _ _ x = train[['Lat', 'Long', 'Date']]y = train[['ConfirmedCases']]x_test = test[['Lat', 'Long','Date']] ↩ → Create the target andthe test data Select subsets of data This code cell is for __ _ _ _

Tree_model =RandomForestClassifier(max_depth=200,random_state=0) ↩ → ↩ → Tree_model.fit(x,y)

Model A random forest is a meta estimator thatfits a number of decision tree classifierson various sub-samples of the datasetand uses averaging to improve the pre-dictive accuracy and control over-fitting. This code cell is for __ _ _ _ pred = Tree_model.predict(x_test)pred = pd.DataFrame(pred)pred.columns =["ConfirmedCases_prediction"] ↩ → Predicate to use apredicate function fortests A random forest is a meta estimator thatfits a number of decision tree classifierson various sub-samples of the datasetand uses averaging to improve the pre-dictive accuracy and control over-fitting. This code cell is for __ _ _ _

Manuscript submitted to ACM hemisto: Towards Automated Documentation Generation in Computational Notebooks 29

B CODING BOOK FOR THE INTERVIEW TRANSCRIPTS

Table 8. Coding Book for the Interview Transcripts

Theme Code