[PDF] Facilitating Knowledge Sharing from Domain Experts to Data Scientists for Building NLP Models

Abstract

Data scientists face a steep learning curve in understanding a new domain for which they want to build machine learning (ML) models. While input from domain experts could offer valuable help, such input is often limited, expensive, and generally not in a form readily consumable by a model development pipeline. In this paper, we propose Ziva, a framework to guide domain experts in sharing essential domain knowledge to data scientists for building NLP models. With Ziva, experts are able to distill and share their domain knowledge using domain concept extractors and five types of label justification over a representative data sample. The design of Ziva is informed by preliminary interviews with data scientists, in order to understand current practices of domain knowledge acquisition process for ML development projects. To assess our design, we run a mix-method case-study to evaluate how Ziva can facilitate interaction of domain experts and data scientists. Our results highlight that (1) domain experts are able to use Ziva to provide rich domain knowledge, while maintaining low mental load and stress levels; and (2) data scientists find Ziva's output helpful for learning essential information about the domain, offering scalability of information, and lowering the burden on domain experts to share knowledge. We conclude this work by experimenting with building NLP models using the Ziva output by our case study.

Full PDF

FFacilitating Knowledge Sharing from Domain Experts to Data Scientists forBuilding NLP Models

SOYA PARK,

Massachusetts Institute of Technology, USA

APRIL WANG,

University of Michigan, USA

BAN KAWAS, Q. VERA LIAO, DAVID PIORKOWSKI, and MARINA DANILEVSKY,

IBM, USA

Data scientists face a steep learning curve in understanding a new domain for which they want to build machine learning (ML)models. While input from domain experts could offer valuable help, such input is often limited, expensive, and generally not in aform readily consumable by a model development pipeline. In this paper, we propose Ziva, a framework to guide domain experts insharing essential domain knowledge to data scientists for building NLP models. With Ziva, experts are able to distill and share theirdomain knowledge using domain concept extractors and five types of label justification over a representative data sample. The designof Ziva is informed by preliminary interviews with data scientists, in order to understand current practices of domain knowledgeacquisition process for ML development projects. To assess our design, we run a mix-method case-study to evaluate how Ziva canfacilitate interaction of domain experts and data scientists. Our results highlight that (1) domain experts are able to use Ziva to providerich domain knowledge, while maintaining low mental load and stress levels; and (2) data scientists find Ziva’s output helpful forlearning essential information about the domain, offering scalability of information, and lowering the burden on domain experts toshare knowledge. We conclude this work by experimenting with building NLP models using the Ziva output by our case study.CCS Concepts: •

Human-centered computing → Empirical studies in HCI ; Interactive systems and tools .Additional Key Words and Phrases: Human-in-the-loop machine learning, CSCW, Multi-disciplinary collaboration

In recent decades, machine learning (ML) technologies are sought by an increasing number of professionals to automatetheir work tasks or augment their decision-making [86]. Broad areas of applications are benefiting from integrationof ML, such as healthcare [16, 17], finance [23], employment [51], and so on. However, building an ML model in aspecialized domain is still expensive and time-consuming for at least two reasons. First, a common bottleneck indeveloping modern ML technologies is the requirement of a large quantity of labeled data. Second, many steps in an MLdevelopment pipeline, from problem definition to feature engineering to model debugging, necessitate an understandingof domain-specific knowledge and requirements. Data scientists therefore often require input from domain experts toboth obtain labeled data, as well as understand model requirements, inspire feature engineering, and get feedback onmodel behavior. In practice, such knowledge transfer between domain experts and data scientists is very much ad-hoc,with no standardized practices or proven effective approaches, and requires significant direct interaction between datascientists and domain experts. Building a high-quality legal, medical, or financial classifier will inevitably require adata scientist to consult with professionals in such domains. In practice these are often costly and frustrating iterativeconversations and labeling exercises that can go on for weeks and months, which usually still do not yield output in aform readily consumable by a model development pipeline.In this work, we set out to develop methods and interfaces that facilitate knowledge sharing from domain experts todata scientists for model development. We chose to focus on natural language processing (NLP) modeling tasks, andwe are especially motivated by real-world cold-start scenarios where labeled data is small or nonexistent. Informedby a formative interview with data scientists regarding current practices and challenges of learning from domain a r X i v : . [ c s . H C ] J a n . Park et al. experts, we developed a domain-knowledge acquisition interface Ziva (With Z ero knowledge, How I do de V elop A machine learning model?). Instead of a data-labeling tool, Ziva intends to provide a diverse set of elicitation methods togather knowledge from domain experts, then present the results as a repository to data scientists to serve their domainunderstanding needs and to build ML models for specialized domains. Ziva scaffolds the knowledge sharing in desiredformats and allows asynchronous exchange between domain experts and data scientists. It also allows flexible re-use ofthe knowledge repository for different modeling tasks in the domain.Specifically, informed by findings from the formative interview and requirements of NLP modeling tasks, Zivafocuses on eliciting key concepts in the text data of a domain ( concept creation ), and rationale justifying a label thata domain expert gives to a representative data instance ( justification elicitation ). In the current version of Ziva,we provide five different justification elicitation methods – bag of words, simplification, perturbation, conceptbag of words, and concept annotation.To evaluate and inform future development of Ziva, we conducted a case study in assessment of its coupled designgoals: 1) to provide an efficient and user-friendly experience for domain experts to supply domain knowledge; 2) tosupport data scientists building NLP models, especially in cold-start scenarios.We conducted a lab study (N=12) and a crowd-deployment study (N=88) for participants to act as domain experts ofa restaurant reviewing domain, and use Ziva to provide concepts and justification-based knowledge. We found thecompletion time and subjective workload using different elicitation methods varied. Interestingly, the popular keywordsbased justification (bag of words) approach led to higher self-reported task success but was considered more stressful.We conducted an interview study with 7 data scientists to investigate whether and how Ziva could help thembuild NLP models. Through the study, we identified design requirements for domain knowledge-sharing tools in MLdevelopment workflow – scalability of information and lowering workload for domain experts. Participants also reflectedon how the shared domain knowledge facilitated by Ziva may be utilized, including bootstrapping labels, supportingfeature engineering, improving explainability, and training few-shot learning models. Based on these suggestions, weexperimented with building a rule-based model using the data from our user study, and report the outcomes usingknowledge elicited with different methods. In summary, the contributions of the paper are as follows: • Through a formative interview with data scientists who built models in a specialized domain, we identified theirunder-supported needs to learn about a domain from domain experts. • We developed Ziva, a tool providing concept creation and five kinds of justification elicitation interfacesto gather domain knowledge from domain experts in formats that could help data scientists build NLP models. • We conducted a case study using Ziva to elicit domain knowledge then presented the output to data scientistsin an interview study. Their feedback validated the utility of Ziva and provided design insights for tools thatsupport knowledge sharing and collaboration between domain experts and data scientists. • We also investigated the experience of domain experts using Ziva. We believe that our analysis could inform thedesign of knowledge elicitation methods for domain experts.

We are informed by recent studies of data science practices, as well as ML and HCI work that leverages domain experts’input to train or improve models, and research to facilitate knowledge sharing in teams and organizations. acilitating Knowledge Sharing from Domain Experts to Data Scientists for Building NLP Models Recently the data science domain has spurred great research interest in the HCI community. Besides developingnumerous tools to support specific data science tasks (e.g. [4, 36, 37, 89]), an emerging area of research has focused onstudying the practices of data scientists in model development work. Many have recognized the collaborative natureof data science projects, with both intra- (among data scientists) [45] and multi-disciplinary (with domain experts)collaboration [87]. In particular, data scientists rely heavily on domain experts during core modeling building stages,such as data access and feature extraction. Domain experts also feature prominently in latter stages of data scienceprojects such as model evaluation and communication of results. However, data scientists’ work faces significantchallenges as such collaborative activities are currently not well supported [52, 59], and they are often left with nochoice but to rely on “an intuitive sense of their data and processes” [55].Computational notebooks are positioned as a potential solution to both support collaborative coding and com-municating results to stakeholders [80]. However, a recent study reported reluctance for data scientists to directlycommunicate the in-progress model work in notebooks [67]. While there are tools emerging to address the technologygaps to support collaborative data science practices [22, 34, 81], to our knowledge they tend to focus on supportingteams of data scientists and place domain experts with limited elicitation. In this work, we explore the approach ofproviding interfaces in which domain experts can create a knowledge repository for a sophisticated domain, so that itcould be consumed by data scientists asynchronously and flexibly when the availability of domain experts is limited.

There has been a long-standing desire to increase the involvement of domain experts in model building in both the MLand HCI communities. For example, tasks like text annotation, image annotation involve massive input from domainexperts to provide domain-related feedback. Ziva interface is inspired by NLP text annotation tools like Doccano [56]or Prodigy [62], which ease the burden of manual labeling by various visual designs (e.g., using colors to highlightdifferent entities). We take this design further to acquire domain knowledge for model development and data scientists.Recognizing the challenge of having domain experts label a large quantity of data, many ML works have exploredmore efficient learning algorithms to reduce the workload [70, 75], or utilize domain experts input as rules [47],constraints [19, 58], prior information [11, 72], or feedback to re-weigh features [25, 63]. Recently, given the prominenceof label-hungry ML algorithms, weak supervision has become a popular approach to bootstrap labels based on a smallamount of labeled input from domain experts [24, 65, 66].The HCI community is further concerned with the isolation of domain experts from the model development process,requiring thus data scientists to go through lengthy and asynchronous iterations to get their input [9, 60]. To tackle theproblem, the sub-field of interactive machine learning (iML) is motivated to enable domain experts or end-users todirectly drive model behaviors [9, 38, 83]. Since domain experts might not have training in ML or programming, iMLsystems elicit their input through intuitive and interactive interfaces (e.g., visualization [41], graphic user interfaces [9],conversational interfaces [18]), and a tight feedback loop for them to adjust their input. A variety of user input havebeen explored in prior work for different tasks in model development, including unseen training data to help correctthe model’s mistakes [18, 26], provide new feature-level input [46] or adjustment to feature weights [70], assessment ofmodel performance [10, 27], error preferences [43], parameter choices [54], model ensemble [77], etc.Research on iML has been especially fruitful for NLP modeling tasks, partly because text data and features (e.g., bagof words) are often comprehensible to people without ML training, increasing the likelihood of obtaining effective . Park et al. feedback from domain experts or end-users. For example, for document classification task, DUALIST [71] solicitsfeedback for both labels and learned features (keywords that the model believes to be informative of the target class).FeatureInsight [14] further supports feature ideation by the users, e.g. by adding words or creating dictionaries that theuser believes to be informative of the target class. EluciDebug [48] is an end-user debugging tool that allows critiquingmodel weights based on the model’s explanation on how it classifies a document. Interactive topic modeling is anotherwell-explored area to incorporate domain experts’ input [21, 39, 40, 73], for example, by moving documents around oradding words, to refine clusters of topics.Our work is informed by prior work on iML but takes a complementary approach by facilitating knowledge sharingfrom domain experts to data scientists. iML is not a panacea to effectively leverage domain experts’ input. There areknown issues with letting ML novices directly adjust models [74], such as lacking generalization or over-fitting [84]. Inpractice it is not always feasible to set up an iML system for domain experts to work with. Currently most ML projectsstill rely on data scientists to write code and set up the pipelines [61]. Moreover, having data scientists mediate theknowledge input offers the flexibility to apply it to different kinds of ML algorithms, and allow domain experts toprovide reusable knowledge not constrained by a particular modeling task.In general, it is possible to elicit diverse kinds of knowledge from people, not all of which could be consumed directlyby a given ML model. For example, Stumpt et al. [76] and Ghai et al. [29] explored what kind of feedback people naturallywant to give seeing model explanations. Only a small subset of the various forms of feedback is readily consumable byexisting ML algorithms. However, as the ML field rapidly advances, many novel usages of domain knowledge are beingexplored. For example, since ML models might use low-level features that are not human understandable (e.g., pixels ofan image), interpretable ML works explored eliciting human-interpretable concepts in the domain (e.g., an object inthe image) and use the concepts to explain the model decisions [30, 44]. Elicited domain concepts have also been usedto create sub-groups for labeled data to enable “structure labeling”, which could lower the re-labeling burden when atarget class changes [49]. We further envision elicited domain concepts could help data scientists head start their modelbuilding, as revealed in our preliminary interview. By facilitating knowledge sharing from domain experts, we alsohope to inspire novel algorithmic work that could leverage such a knowledge repository. Ziva is also motivated by prior work on technologies that facilitate knowledge sharing in enterprise and organizations.Knowledge sharing has been long studied in the Computer Supported Collaborative Work (CSCW) community [5, 69].Ackerman et al. summarized [5] two generations of research in this area, where the first generation focuses on therepository models to elicit knowledge as information artifacts to enable sharing, storing and re-using, while thesecond generation centers around expertise location and communication. Knowledge repository tools elicit variousformal and informal information including manuals, standard procedures, best practices, common questions, and soon. For example, Goldberg et al. studied collaborative tagging and filtering mechanisms for workers to construct aknowledge repository [31]; Answer Garden is a system to build a growing information repository through peopleasking and answering questions [4]; Terveen et al. designed a memory framework for large-scale software engineeringwhere groups collectively build a shared memory [78]; Nam and Ackerman studied methods for elicitation of informalinformation into more organized forms [57].Knowledge sharing in ML projects poses unique challenges [8, 15] to make the knowledge transferrable into MLspecifications. The challenges are amplified in sophisticated domains. For example, for a medical ML model, a clinicianmay have to help data scientists understand complex drug information. We inform the design of Ziva both by prior acilitating Knowledge Sharing from Domain Experts to Data Scientists for Building NLP Models Table 1. Interviewees information

Pn (domain) Model (reasons of choosing the model) Methods of knowledge exchange

P1 (Legal, law) Rule-based (transparency, few labels) Instance perturbationP2(Disaster recovery) Supervised neural net (sufficient labelers) Education session of domain overview domainexperts labelingP2 (") Rule-based (transparency, few labels) Domain experts think aloud labeling dataP3 (Customer cate-gorization) Random forest (transparency) Pair-authoring (Go over analysis together withdomain experts [80])P4 (Sports) AutoML (time) Brute-force model buildingwork on involving domain experts in data science projects and model development, and a preliminary interview studyto understand how data scientists learn from domain experts. Meanwhile, studies have warned that knowledge sharingand repository tools often fail in practice [35, 85, 91], if the design fails to take into account the social dynamics,including what benefits and demands these technologies bring for both the knowledge providers and the knowledgeconsumers [5, 32]. Thereby we evaluate Ziva by involving both the knowledge consumers –data scientists, and theknowledge providers –domain experts.

To understand how data scientists grasp a domain, we conducted semi-structured interviews with four data scientistsworking on NLP models (2 females, 2 males). Each interview was 45 minutes long and driven by a script that coversquestions related to their recent projects collaborating with domain experts and their typical interactions with domainexperts. We recruited participants via posting on slack channels of an international technology company. Each intervie-wee was compensated $15 for their time. We summarized our interviewees’ projects and challenges in Table 1. As aresult, we identified the current practices of learning from domain experts and design requirements for our tool.

All of our data scientist interviewees indicated they often need domain experts’ help and feedback. However, domainexperts are busy and have little time to spare. One said: “The first issue is getting hold of their time... I think hardly Iwas getting one day a week, you can say one hour a week, not even an entire day.”

So data scientists try to extract asmuch knowledge as they can in the limited time they have. They have to spend significant time preparing for thesediscussions. For example, they often manually curated examples such as mis-classified instances and instances thatcontain the unfamiliar keywords to ground the discussion during the meetings with domain experts. Even though thereis no standard way to extract domain knowledge across different domains, but through mutual effort they find whatworks best for a project. We identified the following approaches to learn domain knowledge from domain experts:

Example-driven conversation:

The first fold of approaches is domain knowledge sharing based on examples. Byinquiring about how and why domain experts would label or make decisions for these examples, data scientists learn rationales of how the model should behave for the instances. There are three tactics mentioned by our interviewees. P2observed domain experts during labeling to learn the domain experts’ thought process: “They would go line by line infront of me so that I can also see what their brain is looking at classifying them.”

P1 initially took P2’s approach, but dueto the complexity of the law domain, explaining rationale required extensive background knowledge. Oftentimes, it is . Park et al. Data scientists raw input

The red velvet is rich and moist!I think that the waiter was friendly.

The clam chowder was not tasty.

Took forever to get my drinkYesterday I ate a terrible burger. … Upload a dataset

Ziva raw input

The red velvet is rich and moist!I think that the waiter was friendly.

Took forever to get my drink

Domain experts

Input curation : Ziva automatically selects representative inputs to domain experts

Knowledge extraction : A domain expert (1) extracts taxonomy from the inputs and (2) explains rationales of the labeling Data scientists use domain experts’ review to understand the domain and build a model. “restaurant”: { “Food”: [“tasty”, “lukewarm”, “unique”], “Service time": [“immediate”, “forever”] } The red velvet is rich and moist! Took forever to get my drink dfThe soup of the day was clam chowder and it was not incredible.The red velvet cake is tasteless.

Fig. 1. To facilitate domain knowledge sharing, Ziva presents representative instances and to interfaces to review the instances todomain experts, then which will be used by data scientists. unclear to data scientists how to connect the explanations provided by the domain experts to model specifications. P1used a strategy called instance perturbation – for a given instance, the domain experts were asked to minimally changethe instance until the model changes the label and discuss the reasons. With this, data scientists were able to narrowin on the parts of the instance that should be the most important to the model’s decision. Instead of aiming to builda perfect model right off the bat, P4 deployed their model first and incrementally improved the model upon domainexperts’ request. Whenever domain experts encountered mis-classified results, they shared the instances with datascientists and discussed why they were mis-classified.

General background knowledge acquisition:

Concepts are key units of information for a given domain, such asnotions, entities, components or properties. A set of domain concepts can be seen as a taxonomy . Understanding themcould help data scientists make sense of the domain. Participants reported approaches to learn concepts in an unfamiliardomain. P2 and P3 said domain experts in one of their projects offered a lecture to explain key concepts their domain.For P2, domain experts gave an overview and touched on the basic concepts of each class. P3 pair-authored [80] withdomain experts to bridge concepts and a mathematical formula that encapsulates the information. With this iterativelearning process, data scientists were able to kick start model building. P2 said “I think that was very helpful becauseafter that, my dependency reduced a bit. I could myself assess that what category they belong to.”

Summary:

From our interview, we derived several design requirements to design Ziva. We found that the usageof domain knowledge is not only limited to labeling but also other parts in ML development, sometimes open-endedlearning. Thus, interfaces of Ziva ought to facilitate domain-knowledge learning of data scientists in general throughoutthe development workflow. More specifically, we found that the tool should scaffold domain experts to efficiently elicitdomain knowledge within short amount of time ( R1 ). Next, a tool should help data scientists to extract basic domainconcepts ( R2 ). Lastly, data scientists indicated that they often learn from domain experts’ rationale, especially how theyjustify a decision or label. Hence, the tool needs to facilitate label justification sharing ( R3 ). This section introduces the interface of Ziva. Ziva provides features for domain experts to create domain concepts andelicit justification from representative instances that are automatically curated by Ziva. We discuss Ziva’s differentcomponents and how they meet the design requirements in detail.

As highlighted in the our formative interview, domain experts have limited time for labeling or sharing domainknowledge (R1). Hence, it is important to ask them to review only a few of instances and the sample can cover most acilitating Knowledge Sharing from Domain Experts to Data Scientists for Building NLP Models Unsorte

Atmosphere

Unsorted cards

Food

TopicsDescriptions

It features cool themed decor and offers the calm welcoming vibe! My lamb chops are cooked to perfection.

If a card contains multiple topics, use

Copy to indicate the topics

Copy

Concept Creation Justiﬁcation Elicitation

Bag of wordsSimpliﬁcationPerturbationConcept BoWsAnnotation

Choose all the words that are indicative of “positive” labelChoose all the words that are indicative of “Food is tasty”Shorten the sentence. Keep the all the ideas in the sentence relevant to “positive” labelMake smallest changes that make this sentence no longer “positive” label Annotate “food” and “is tasty”

Fig. 2. Ziva interface: domain experts first extract domain knowledge with curated instances. Then they review each instance one byone using one of justification-elicitation interfaces. concepts in the domain. Ziva extracts such a representative sample of 𝑚 instances from a large training set of 𝑁 textinstances by the simple method of transforming the original text into ’tf-idf’ space, clustering the result using analgorithm such as k-means (setting 𝑘 = 𝑚 ), and, for each cluster, returning the text instance closest to the cluster center.This method is not deterministic, but provides a reasonable set of representative instances, for cases where 𝑚 << 𝑁 . Creating a taxonomy is an effective way of organizing information [20, 50]. Ziva provides an interface where SMEscan extract domain concepts (R2). Users are asked to categorize each example instance, presented as a card, via acard-sorting activity. Users first group cards by topic (general concepts of the domain such as atmosphere, food, service,price). Cards in each topic are then further divided cards into descriptions referencing specific attributes for a topic(e.g., cool, tasty, kind, high). The interface (Figure 2) was implemented as a drag-and-drop UI using LMDD [2].

Once a domain expert finishes the concept extraction, they review each instance using one of elicitation interfaces,which ask the domain expert to justify an instance’s label (this information is then intended for consumption by datascientists (R3)). We used Materialize to implement the justification elicitation conditions.The justification elicitation interfaces were designed through an iterative process of paper prototyping,starting with initial designs inspired by our preliminary interviews. As we conducted paper prototyping, we examinedif (1) the answers from different participants were consistent and (2) the information from participants’ answers wereuseful to data scientists. We now describe the five different justification elicitation methods that we created andevaluated, and highlight the design rationale where appropriate.

Bag of words.

This base condition reflects the most common current approach. Given an instance and a label (e.g.,positive, negative), the domain experts are asked to highlight the text snippets that justify the label assignment.

Instance perturbation.

Inspired by one of our data scientists in the formative study, this condition asks a domainexpert to perturb (edit) a part of the instance such that the assigned label is no longer justifiable by the resulting text.For example, in the restaurant domain, “our server was kind” , can be modified to no longer convey a positive sentimentby either negating an aspect (e.g., “our server was not kind” ) or altering it (e.g., “our server was rude” ). . Park et al. This strategy is also inspired by the research area of generating natural language adversarial examples [7]. Suchapproaches algorithmically alter training examples to create similar adversarial examples that fool well-trained NLPmodels. In our scenario, the domain expert is seeking to alter training examples in order to point out the most salientcharacteristics to the data scientist; the latter learns from this information, combining it with syntactic and semanticanalysis of the original and perturbed instances.

Instance simplification.

This condition asks domain experts to shorten an instance as much as possible, leavingonly text that justifies the assigned label of the original instance. For example, “That’s right. The red velvet cake... ohhhh..it was rich and moist” , can be simplified to “The cake was rich and moist” , as the rest of the content does not convey anysentiment, and can therefore be judged irrelevant to the sentiment analysis task.This condition is inspired by the plethora of methods for sentence simplification used in extractive text summarization[79]. In particular, the domain expert is performing sentence reduction as in [42]. The output can be considered to be aconcise summary of the original instance, keeping only that content which is directly relevant to the sentiment analysistask. The result for the data scientist is clean, compact, and fully relevant high quality training examples.

Concept bag of words.

This condition incorporates the concept extracted in the prior step. Similar to the Bag ofwords condition, domain experts are asked to highlight relevant text within each instance to justify the assigned label;however, each highlight must be grouped into one of the concepts. If, during

Concept creation , the domain expertcopied a card to assign multiple topics and descriptions, then the interface prompts multiple times to highlight relevanttext for each one. For example, if they classified the instance, “That’s right. The red velvet cake... ohhhh.. it was rich andmoist” , into the concept “food is tasty” , they can select rich , moist and cake as being indicative words for that concept. Concept annotation.

This condition is similar to the above Concept bag of words condition. However, whenannotating the instance text, domain experts are directed to distinguish between words relevant to the topic and wordsrelevant to the description. Given the above sample instance, the domain expert would need to indicate which part ofthe sentence applies to food (e.g., cake ) and which to tasty (e.g., rich and moist ). Both this and the previous conceptcondition are motivated by the well-established knowledge that a variety of NLP tasks, such as relation extraction,question answering, clustering and text generation can benefit from tapping into the the conceptual relationship presentin the hierarchies of human knowledge [88]. Learning taxonomies from text corpora is a significant NLP researchdirection, especially for long-tailed and domain-specific knowledge acquisition [82].In the rest of the paper, we present a case study to evaluate the utility of the Ziva interface in two parts. In Section 5,we conduct a lab experiment and a crowd experiment in which participants acted as domain experts using Ziva. Wechoose the domain of restaurant reviews (Yelp Open Dataset[3]) and the NLP task of sentiment analysis, as beingextremely familiar and easy enough to understand for most people to qualify as domain experts. In Section 6, weconduct an interview study with data scientists to evaluate the utility of domain knowledge collected in the aboveexperiments. We instruct the data scientists to assume no previous knowledge of the domain, so we could use theelicited knowledge about restaurant reviewing as proxy to understand how Ziva could help them build NLP models.

We recruited participants to act as domain experts of restaurant reviewing to use Ziva. In a lab study (N=12), wecompared participants’ task completion and experience with all concept and justification elicitation methods, andgathered their qualitative feedback. To allow quantitatively compare the results of different justification elicitationmethods, we conducted a follow-up crowd experiment (N=88). acilitating Knowledge Sharing from Domain Experts to Data Scientists for Building NLP Models Table 2. Average task completion time (standard deviation) of lab study participants

Bag of words Simplification Perturbation Concept bag of words Concept annotation

Study protocol : To avoid noisiness in labeling, we pre-labeled the set of yelp reviews instance so we could focus oncomparing the elicitation methods. We created binary labels based on ground-truth ratings: if the number of stars is 1or 2 for a review, we labeled it as negative, 4 or 5 as positive [90]. We then took a random balanced sample of 10,000instances. 8,000 were used as a (balanced) ’training set’, from which we extracted ten representative instances to use inthe study (see Section 4.1.) We set the other 2,000 (balanced) instances to use as a test set for analyzing the performanceof models built from the study output (see Section 6).We recruited participants (5 female, 7 male), who self-report little or no knowledge in ML via posting on slackchannels of an international technology company. Participants are designers, graduate students, researchers, trainedprofessionals, skilled laborers, software engineers and project managers. To compensate their time, we ran a $30 raffle.Participants were given introduction to the project and a tutorial of the Ziva interface. They were also given a practicetask in a different domain, i.e., clothing. For the concept extraction task, all participants used the same interface. Forthe justification interface, we randomly assigned each participant one treatment from the elicitation methods withoutconcepts (bag of words, label perturbation and simplification), and one from those with concepts (concept bag of wordsand concept annotation). Thus, each participant experienced two elicitation interfaces and reviewed 5 instances each.After each interface, participants were asked to fill out the NASA TLX form [33] to evaluate their subjective workloadand share their feedback. The entire session lasted about for up to one hour.

Task Results : One participant could not complete the second justification interface. We reported the summary of concepts generated by participants, as well as quantitative and qualitative experience using justification methods.

Concept creation.

Participants took 879.7 seconds on average ( 𝜎 =385.4). They created 3.92 topics on average ( 𝜎 =1.11).Everyone included Food quality and

Customer service in their topics. To assess taxonomy from each domain expert, weexamined the consistency between domain experts and coverage of the restaurant domain. • Consistency between domain experts: The union of all topics across all participants includes following 10 topics:ambiance, cuisine, food quality, customer service, additional service, complaint, speciality, reservation, location,and price. For each topic, we rated whether each taxonomy intersects with the topic or not. Thus, the inter-raterreliability (IRR) across all domain experts was 58% using Fleiss’ 𝜅 . • Coverage of the domain: We selected 3 additional instances which were not shown to the participants. We usedour curation method to pick another set of representative instances. We then inspected how many instances canbe categorized using each taxonomy resulting in a coverage of 69% (25 out of 36 instances).

Justification elicitation.

The average task completion time in each condition is summarized in Table 2. Since eachparticipant was assigned two out of five justification elicitation, there were only a few data points per elicitationtechnique (3 to 5 per technique). To further investigate in a larger population, we deployed Ziva on a crowd platformdescribed in the following section. Plots of the post-questionnaire results are attached in supplemental materials.Most participants found the bag of words condition easy to complete. One participant said: “This was easy becausea lot of words were clearly positive or negative, such as "terrible" or "delicious"” . However, some considered it tricky to . Park et al. identify words that are indicative of the overall sentiment. For example, one participant said, “this can be just descriptivewithout any positive or negative feelings without the context. So it’s difficult to isolate the context out of the words.” For the simplification task, participants indicated the task was straightforward. Participants said “easy as it hadeliminated redundant and unnecessary words” and “quite easy and intuitive, paraphrasing keeping the original intent iswhat I usually do as part of minutes of meetings” . One participant said sometimes the task became hard because someinstances could not be obviously shortened and instead need to be entirely rewritten.Participants said perturbation is also straightforward but it required them to understand the entire instance thoroughly.One participant commented, “It was kind of hard because I don’t know some of the words” . Another participant suggestedthat if the interface suggested antonyms, it would be easier to finish the task.With concept bag of words, participants said it allowed subjective and nuanced elicitation, as they could pick wordsassociated with a concept without judging their sentiment. However, it led to more varied results among participants.For example, for the concept

Food is tasty , and the instance

Ohhhh... The red velvet cake is rich and moist , most participantsselected rich and moist . One participant said “Even red velvet cake could be the indicative words if you personallylike the cake” . Others said “maybe ohhhh part can be included” and “ moist doesn’t necessarily mean delicious” .For the concept annotation task, participants said it is straightforward to choose words directly mapped to eachtoken. On the other hand, it complicated the articulation to have to label in such fine granularity. One participantcommented, “slightly tedious as it required me to comprehend on how best to label the words accordingly” . To assess different justification methods on larger population, we deployed the Ziva interface on a crowd platform.

Study Protocol : Using our curation method, we extracted 10 reviews from the datasets used in the lab study. Wepre-populated a taxonomy. In order to provide a representative sampled concept , we recruited 5 volunteers and askedthem to extract concepts of the restaurant domain using the concept extraction interface and two of the authorsaggregated the taxonomy. The resulting taxonomy is attached in supplemental materials.We installed 5 test questions for each condition with ground-truth created by the authors. If a crowd worker did notpass more than half of test questions, they can not continue to the Human Intelligence Task (HIT). Each worker wasgiven one of the five justification elicitation interfaces, and reviewed 10 instances.We recruited our study participants from Appen [1]. We compensated them with $0.5 per HIT, they are rewarded$2.5 in addition for test questions. From the lab study, we observed each HIT took less than 2 minutes on average,which makes hourly wage of $15. After the tasks, we asked them to fill out the same NASA TLX form to report on theirsubjective workload. Participants were rewarded additional $3 for the survey. A total of 88 crowd workers completedour study resulting in 857 instances with elicitated data.

Result : We analyzed participants’ survey responses using an one-way Kruskal–Wallis ANOVA as summarized inTable 3. There was marginal difference in self-reported success of task accomplishment and significant difference instress level across justification elicitation methods.As a post-hoc analysis, we ran a one-tailed Mann-Whitney U Test. The result revealed that participants completedthe tasks using bag of words perceived higher success in accomplishing the tasks than participants with simplification(U=76.5, z=2.31; p=.01) and concept annotation (U=97.5, z=2.02; p=.02). Concept bag of words users also perceivedhigher success than simplification (U=75.5, z=2.34; p=.009) and concept annotation users (U=103, z=1.85; p=.03).As for the stress level, bag of words users reported significantly higher stress than perturbation (U=55, z=2.90; p=.002),concept bag of words (U=84.5, z=-2.24; p=.01), and concept annotation users (U=81.5, z=-2.34; p=.01). acilitating Knowledge Sharing from Domain Experts to Data Scientists for Building NLP Models Table 3. Crowd experiment Likert result. H statistics (p-value) in significance level 0.05

Mentally demanding Successfully accomplishing Hard to accomplish Insecure, Stressed .06641 ) 8.0609 (.08937) 9.9411 ( .04143 ) Fundraiser Results by Salesperson

PARTICIPANT UNITS SOLD

Andy Chloe Daniel Grace Sophia Column Chart

AndyChloeDanielGraceSophia

Column , bar , and pie charts compare values in a single category, such as the number of products sold by each salesperson. Pie charts show each category’s value as a percentage of the whole. Mentally demanding

Bag of wordsSimpliﬁcationPerturbationConcept bowConcept annotation

Table 1

Bag of words Simpliﬁcation Perturbation Concept bow Concept annotation

Bag of words

Successfully accomplished

Hard to accomplish

Insecured, stressed Fig. 3. Post-question responses in NASA TLX (1- Very low, 7- Very high) (Crowd experiment participants, n=88)

To investigate what and how domain knowledge extracted from Ziva helps data scientists, we conducted an interviewstudy with data scientists. We showed them concept and different justification results extracted by domain expertsand asked them how they could use them in their ML development workflow.

Study Protocol:

Participants were given introduction to the project, prompts shown to domain experts and cor-responding outputs of each part of the interface. Each interview was 1 hour long and driven by a questionnaire thatposed questions related to compare domain knowledge extracted by domain experts using Ziva to their current practice.Finally, they were asked to rank usefulness of justification interface to their workflow.We recruited 7 data scientists who have between 4 and 20 years of experience building models with domain expertsin sophisticated domains, using the slack channels of an international technology company and word-of-mouth.

Results:

The full ranking of each technique is reported in supplemental materials. We re-ranked the scores on alinear scale, with a data scientist’s favorite at 5 points, the second-most favorite at 4 and so on. If two techniques weretied for N-th rank, we averaged the scores for the both of techniques (e.g., if two techniques are 4th, they are given 1.5)As a result, the concept annotation technique scored the most (30), then concept bag of words and perturbation (22.5),simplification (17.5), and bag of words (12.5). Data scientists had several reasons why they prefer one justification technique to another and various applications for different techniques. Through the metrics, we were able to identifythe design requirements and important factors of domain-knowledge learning.

Standardized protocols.

As revealed in our preliminary interview and previous work [52], there is no set protocolof communications or common ground between two parties, and expressed a need for a protocol of communicationwith domain experts . The steep learning curve of a specialized domain and lack of guidance in how to extract domainknowledge exacerbates the collaboration with domain experts. Three of interviewees said that having such a concept and examples upfront provided by domain experts has helped them to build a model in prior projects. One said, “Theydescribe what are the component information and examples. It was not very difficult to understand after reading thedocuments.”

In light of this, interviewees preferred justification techniques to inform them about the domain. Forexample, interviewees found concept annotation helpful because it is tightly connected with the concept , hence theycan learn from examples how different components of the domain is expressed in the instances. Simplification is alsohelpful, as it is a simpler version of instances without rhetoric.One interviewee suggested to use justification techniques to explain model decisions. They said, “I work onactive learning ML a lot. So I work with users. And so far all the interactions I expect for the user, fairly simple, either . Park et al. binary feedback – correct or incorrect. I have any incorporated explanation of when the user provide feedback. What’s theexplanation behind this feedback? I think that that would be very useful to generate some explanation or learn how togenerate explanation.” While model explanation is not the intended usage of the Ziva’s justification techniques, thedata scientist found the techniques helpful for debugging a model.

Scalability of domain knowledge.

Interviewees were also interested in how they would scale the Ziva output. Sincethey only received only 10 labeled instances and justification, it was too small for data scientists to train a model.One application of Ziva output mentioned by data scientists is to label more instances by generalizing concept and justification , so called weak-supervised learning [64]. One interviewee said, “They are trying to give me guidance onhow to propagate the labels. So one, the concept is going to be able to give me some notion on how to bucket my data, right,like, just in an unsupervised fashion.”

Interviewees stressed the importance of domain knowledge in feature engineering . During meetings with domainexperts, they focus on identifying features for their model: “I immediately start looking at what are the different featuresor abstractions of features that seem to be important to the domain expert.”

However, data scientists expressed the difficultyof feature ideation in building models in a specialized domain. Repeat meetings were required to go over many instancestogether in hope of covering the complete set of features. 3 of interviewees said they would use Ziva output to facilitatefeature engineering by using the concepts created by domain experts as features. A participant expalined: “Vector that wecan convert each restaurant record into a some feature vector.”

When it comes to the best justification techniques forfeature engineering, one said “The one with the highest resolution would be more beneficial for feature learning potentiallybecause it allow me to generalize better” . One data scientist suggested that they can propagate the feature across differentcomponents (e.g., food/food quality, service) of the domain expert’s concepts using distributional signature [13]. Forinstance, in a restaurant domain, once they identified positive-sentiment words related to food, they can find similarsentiment of words related to service using the distribution of words.

Reduced burden on domain experts.

We also found data scientists were being mindful of domain experts’ cognitiveload when they generated Ziva output, because domain experts were often busy. Another reason is if the eliciting justification is difficult, data scientists would not get a reliable result. One interviewee said: “I would say there’s alsothe question of what I think would be more easier for people, if it’s difficult, then they’re probably not going to do it verywell. I wouldn’t give it to them because I would think it’s going to be more noisy.”Elicitation and learning outcomes.

To demonstrate the feasibility of translating the Ziva output into useful features formodel building, we constructed a real implementation. Inspired by a use case suggested by a data scientist in our study,we built 5 rule-based models for weak-supervised learning, mimicking a real-world cold start scenario with extremelylimited labeling resources and no pre-trained model available. With such constraints, no one can expect state-of-the-artperformance after a few training examples. Instead, a valuable characteristic at this early stage is intra-class consistency,demonstrating parallel improvement in precision and recall performance on various classes (here, positive and negativesentiment). This would suggest that the model is indeed learning something relevant to the entire task rather thanguessing wildly, and hints at a good robustness that can be reliably improved upon with additional examples.Excepting the bag of words condition, the models primarily focused on recognizing the semantic pattern of ‘Noun isAdjective’. Of course, this can take several forms (‘food is good’, ‘good food’, ‘food is not bad’, etc.) We built rule-basedmodels that extend a generic semantic role labeling model [6] which can easily handle such variations. The genericmodel identifies all existing semantic roles, and the ten instances, annotated in each condition, are used to populate acilitating Knowledge Sharing from Domain Experts to Data Scientists for Building NLP Models Table 4. Performance of Rule-Based Models on 2,000 test instances, for different justification conditions on 10 training instances.Because the test dataset is balanced, the Recall (R) value is equivalent to Accuracy. The last three columns are the really meaningfulones, as they highlight the absolute differences in Precision/Recall/F1 between the two classes (lower is better; values below 0.10 arehighlighted). The Trivial model, which always assigns a positive label to each instance, is shown for reference.

Positive Class Negative Class Delta Between ClassesP R F P R F P R F

Trivial (Always Pos)

Bag of Words

Perturbation

Concept Annotation

Bag of words.

This was simple keyword matching on the terms identified in this condition. The positive terms outputfrom this condition were mostly generic (‘amazing’, ‘delicious’) whereas many negative terms were very specific (‘over-hyped’,‘small quantities’). This is an artifact of both the domain (restaurant reviews) and the labels. The performance onthe two classes reflects this: the positive class has pretty bad precision but great recall, as it severely over-generalizes,whereas the negative class has amazing precision but barely finds any examples, because it is so specific.

Perturbation.

The perturbed parts of the instances were treated as local high quality training instances for bothlabels. All possible ‘Noun is Adjective’ signals were extracted from those instances to populate the relevant dictionaries.If a verb was negated, or an adjective transformed into an antonym (e.g., changing ‘delicious’ to ‘disgusting’ in ‘Therewere delicious burgers’, assigned a positive label), this meant that the topic (‘burgers’) is highly relevant, the originaltext (‘delicious burgers’) was a good training example for the given label, and the perturbed result (‘disgusting burgers’)was a good example for the opposite label.

Simplification.

The simplified instances were treated as high quality training instances. All possible ’Noun isAdjective’ signals were extracted from those instances to populate the relevant dictionaries. These signals did notoverlap much in content, so the model could do little generalizing. Much like the perturbation condition, the recall forboth classes is therefore extremely low, and the precision is respectable for only 10 training examples. Perturbationrecall results are slightly better because each perturbed instance yields both a positive and a negative signal.

Concept bag of words and Concept annotation

The concept taxonomy described in Section 5.2 follows the‘Noun is Adjective’ format by definition, so it was encoded accordingly for both of these conditions. The outputs ofeach condition were then used to extend the possible dictionaries. For concept bag of words , each annotation was addedto both the ‘Noun’ and ‘Adjective’ dictionaries (whenever grammatically possible). For concept annotation , the ‘Noun’and ‘Adjective’ elements were elicited separately, and thus were added to their respective dictionaries. It is unclearthat either of these conditions is more successful than the other, at this stage. The recall is markedly better than for simplification and perturbation owing to the well-structured concept taxonomy, that lends itself well to generalization.But this comes at a price, as the delta in performance between the classes is similarly worse. . Park et al. Capturing nuanced domain knowledge.

While Ziva provides basic components in a domain, data scientists pointedout there is some in that the current design of Ziva does not provide. For instance, domain experts provide insight aboutdata, such as sparsity of a certain column. Data scientists find such information helpful, however, it can not be capturedin the Ziva output. More investigation is required on how to extract those nuanced data. One possible direction is toleverage proposed documentation for data [28] or for models [12, 53]. Another tactic is to take a set of guided questionssimilar to the ones proposed in [68] in discussions between domain experts and data scientists. The structure providedby these artifacts can facilitate transfer of domain knowledge and get teams to a common ground quickly.

Re-evaluating the old normal: Bag of words.

Bag of words is one of the dominant ways in NLP domain to elicitsignals. It is seemingly most simple and straightforward task for domain experts. However, to our surprise, our worksuggests otherwise. Participants in our user study indicated that bag of words is in fact more mentally demanding,harder to accomplish and more stressful for participants than other justification techniques. Furthermore, in ourexercise of building a rule-based model with different justification methods, the other methods outperformed thebag of words. This informed us that both domain experts and data scientists can benefit from our justification techniques during collaboration. We believe our justification method could be used throughout the ML developmentworkflow and provide an outlet for stakeholders to efficiently communicate their model building.

Limitations.

Various use cases of Ziva output validated the efficacy of our interfaces drawn from our preliminarystudy and literature review, demonstrating that domain experts’ elicited knowledge can facilitate model building.However, this paper only considers the concrete setting of a sentiment classifier for Yelp restaurant reviews. Futurework should examine the generalizability of the approach for other tasks (e.g., document classification, clustering,machine translation, and question answering) and other domains (e.g., education, health science). Nevertheless, theoverall approach described in this paper is domain-agnostic, and in fact much more relevant to real-life scenarioswith complex tasks, specialized domains, and significant constraints on the resources to generate large amounts oflabeled data. We therefore believe we have identified a number of interesting design requirements of domain-knowledgesharing in ML development workflow that are not currently addressed, and are applicable across tasks and domains.

In this paper, we presented a system and a case study on how data scientists can get help from domain experts in MLdevelopment lifecycle. Along the way, we were able to identify the current practice of how data scientists go aboutdomain knowledge-learning. Inspired by the workaround to extract domain knowledge, we designed an interfacethat facilitates the domain knowledge-sharing. We presented the interface output to ML practitioners to reflect theirexperience building an ML model in a specialized domain. They shared that scalability of a piece of domain knowledgeand low cognitive load of domain experts are important factors in such a domain knowledge-bootstrapping tool. Wecontinued the work by investigating cognitive load of different methods in our interface. We found that the traditionalelicitation method “bag of words” is least preferred by domain experts in terms of mental load and stress level, andprovides the least knowledge scalability compared to other elicitation methods.

ACKNOWLEDGMENTS

We thank Dakuo Wang and David Karger for their feedback. Soya Park is partly supported by the Kwanjeong fellowship. acilitating Knowledge Sharing from Domain Experts to Data Scientists for Building NLP Models REFERENCES

ACM Transactions on Information Systems (TOIS)

16, 3(1998), 203–224.[5] Mark S Ackerman, Juri Dachtera, Volkmar Pipek, and Volker Wulf. 2013. Sharing knowledge and expertise: The CSCW view of knowledgemanagement.

Computer Supported Cooperative Work (CSCW)

22, 4-6 (2013), 531–573.[6] A. Akbik and Yunyao Li. 2016. K-SRL: Instance-based Learning for Semantic Role Labeling. In

COLING .[7] Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. 2018. Generating Natural LanguageAdversarial Examples. In

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing . Association for ComputationalLinguistics, Brussels, Belgium, 2890–2896. https://doi.org/10.18653/v1/D18-1316[8] Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and ThomasZimmermann. 2019. Software engineering for machine learning: A case study. In . IEEE, 291–300.[9] Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. 2014. Power to the people: The role of humans in interactive machinelearning.

Ai Magazine

35, 4 (2014), 105–120.[10] Saleema Amershi, Max Chickering, Steven M Drucker, Bongshin Lee, Patrice Simard, and Jina Suh. 2015. Modeltracker: Redesigning performanceanalysis tools for machine learning. In

Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems . 337–346.[11] David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In

Proceedings of the 26th annual international conference on machine learning . 25–32.[12] Matthew Arnold, Rachel KE Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, A Mojsilović, Ravi Nair, K Natesan Ramamurthy, AlexandraOlteanu, David Piorkowski, et al. 2019. FactSheets: Increasing trust in AI services through supplier’s declarations of conformity.

IBM Journal ofResearch and Development

63, 4/5 (2019), 6–1.[13] Yujia Bao, Menghua Wu, Shiyu Chang, and Regina Barzilay. 2019. Few-shot text classification with distributional signatures. arXiv preprintarXiv:1908.06039 (2019).[14] Michael Brooks, Saleema Amershi, Bongshin Lee, Steven M Drucker, Ashish Kapoor, and Patrice Simard. 2015. FeatureInsight: Visual support forerror-driven feature ideation in text classification. In . IEEE, 105–112.[15] Carrie J Cai and Philip J Guo. 2019. Software Developers Learning Machine Learning: Motivations, Hurdles, and Desires. In . IEEE, 25–34.[16] Carrie Jun Cai, Emily Reif, Narayan G Hegde, Jason Hipp, Been Kim, Daniel Smilkov, Martin Wattenberg, Fernanda Viégas, Greg Corrado,Martin Stumpe, and Michael Terry. 2019. Human-Centered Tools for Coping with Imperfect Algorithms during Medical Decision-Making.https://arxiv.org/abs/1902.02960[17] Carrie Jun Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry. 2019. "Hello AI": Uncovering the Onboarding Needs of MedicalPractitioners for Human-AI Collaborative Decision-Making.[18] Maya Cakmak and Andrea L Thomaz. 2012. Designing robot learners that ask good questions. In . IEEE, 17–24.[19] Ming-Wei Chang, Lev Ratinov, and Dan Roth. 2007. Guiding semi-supervision with constraint-driven learning. In

Proceedings of the 45th annualmeeting of the association of computational linguistics . 280–287.[20] Lydia B Chilton, Greg Little, Darren Edge, Daniel S Weld, and James A Landay. 2013. Cascade: Crowdsourcing taxonomy creation. In

Proceedings ofthe SIGCHI Conference on Human Factors in Computing Systems . 1999–2008.[21] Jaegul Choo, Changhyun Lee, Chandan K Reddy, and Haesun Park. 2013. Utopian: User-driven topic modeling based on interactive nonnegativematrix factorization.

IEEE transactions on visualization and computer graphics

19, 12 (2013), 1992–2001.[22] Kevin Crowston, Jeff S Saltz, Amira Rezgui, Yatish Hegde, and Sangseok You. 2019. Socio-technical Affordances for Stigmergic CoordinationImplemented in MIDST, a Tool for Data-Science Teams.

Proceedings of the ACM on Human-Computer Interaction

3, CSCW (2019), 1–25.[23] Robert Culkin and Sanjiv R Das. 2017. Machine learning in finance: The case of deep learning for option pricing.

Journal of Investment Management

15, 4 (2017), 92–100.[24] Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft. 2017. Neural ranking models with weak supervision. In

Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval . 65–74.[25] Gregory Druck, Burr Settles, and Andrew McCallum. 2009. Active learning by labeling features. In

Proceedings of the 2009 conference on Empiricalmethods in natural language processing . 81–90.[26] Jerry Alan Fails and Dan R Olsen Jr. 2003. Interactive machine learning. In

Proceedings of the 8th international conference on Intelligent user interfaces .39–45.[27] James Fogarty, Desney Tan, Ashish Kapoor, and Simon Winder. 2008. CueFlik: interactive concept learning in image search. In

Proceedings of thesigchi conference on human factors in computing systems . 29–38. 15 . Park et al. [28] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018.Datasheets for datasets. arXiv preprint arXiv:1803.09010 (2018).[29] Bhavya Ghai, Q Vera Liao, Yunfeng Zhang, Rachel Bellamy, and Klaus Mueller. 2020. Explainable Active Learning (XAL): An Empirical Study ofHow Local Explanations Impact Annotator Experience. arXiv preprint arXiv:2001.09219 (2020).[30] Amirata Ghorbani, James Wexler, James Y Zou, and Been Kim. 2019. Towards automatic concept-based explanations. In

Advances in NeuralInformation Processing Systems . 9273–9282.[31] Yaron Goldberg, Marilyn Safran, and Ehud Shapiro. 1992. Active mail—a framework for implementing groupware. In

Proceedings of the 1992 ACMconference on Computer-supported cooperative work . 75–83.[32] Jonathan Grudin. 1988. Why CSCW applications fail: problems in the design and evaluationof organizational interfaces. In

Proceedings of the 1988ACM conference on Computer-supported cooperative work . 85–93.[33] Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In

Advances in psychology . Vol. 52. Elsevier, 139–183.[34] Andrew Head, Fred Hohman, Titus Barik, Steven M Drucker, and Robert DeLine. 2019. Managing messes in computational notebooks. In

Proceedingsof the 2019 CHI Conference on Human Factors in Computing Systems . 1–12.[35] Sven Hoffmann, Aparecido Fabiano Pinatti de Carvalho, Darwin Abele, Marcus Schweitzer, Peter Tolmie, and Volker Wulf. 2019. Cyber-PhysicalSystems for Knowledge and Expertise Sharing in Manufacturing Contexts: Towards a Model Enabling Design.

Computer Supported CooperativeWork (CSCW)

28, 3-4 (2019), 469–509.[36] Fred Hohman, Andrew Head, Rich Caruana, Robert DeLine, and Steven M Drucker. 2019. Gamut: A design probe to understand how data scientistsunderstand machine learning models. In

Proceedings of the 2019 CHI conference on human factors in computing systems . 1–13.[37] Fred Hohman, Minsuk Kahng, Robert Pienta, and Duen Horng Chau. 2018. Visual analytics in deep learning: An interrogative survey for the nextfrontiers.

IEEE transactions on visualization and computer graphics

25, 8 (2018), 2674–2693.[38] Andreas Holzinger. 2016. Interactive machine learning for health informatics: when do we need the human-in-the-loop?

Brain Informatics

3, 2(2016), 119–131.[39] Enamul Hoque and Giuseppe Carenini. 2015. Convisit: Interactive topic modeling for exploring asynchronous online conversations. In

Proceedingsof the 20th International Conference on Intelligent User Interfaces . 169–180.[40] Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and Alison Smith. 2014. Interactive topic modeling.

Machine learning

95, 3 (2014), 423–469.[41] Liu Jiang, Shixia Liu, and Changjian Chen. 2019. Recent research advances on interactive machine learning.

Journal of Visualization

22, 2 (2019),401–417.[42] Hongyan Jing. 2000. Sentence reduction for automatic text summarization. In

Proceedings of the sixth conference on Applied natural languageprocessing (ANLC ’00) . Association for Computational Linguistics, 310–315. https://doi.org/10.3115/974147.974190[43] Ashish Kapoor, Bongshin Lee, Desney Tan, and Eric Horvitz. 2010. Interactive optimization for steering machine classification. In

Proceedings of theSIGCHI Conference on Human Factors in Computing Systems . 1343–1352.[44] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. 2018. Interpretability beyond feature attribution:Quantitative testing with concept activation vectors (tcav). In

International conference on machine learning . PMLR, 2668–2677.[45] Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The emerging role of data scientists on software development teams.In

Proceedings of the 38th International Conference on Software Engineering . ACM, 96–107.[46] Josua Krause, Adam Perer, and Enrico Bertini. 2014. INFUSE: interactive feature selection for predictive modeling of high dimensional data.

IEEEtransactions on visualization and computer graphics

20, 12 (2014), 1614–1623.[47] Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu. 2009. SystemT: a system fordeclarative information extraction.

ACM SIGMOD Record

37, 4 (2009), 7–13.[48] Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. 2015. Principles of explanatory debugging to personalize interactivemachine learning. In

Proceedings of the 20th international conference on intelligent user interfaces . 126–137.[49] Todd Kulesza, Denis Charles, Rich Caruana, Saleema Amin Amershi, and Danyel Aharon Fisher. 2019. Structured labeling to facilitate conceptevolution in machine learning. US Patent 10,318,572.[50] David Laniado, Davide Eynard, Marco Colombetti, et al. 2007. Using WordNet to turn a folksonomy into a hierarchy of concepts. In

Semantic WebApplication and Perspectives-Fourth Italian Semantic Web Workshop . 192–201.[51] James Manyika, Michael Chui, Mehdi Miremadi, et al. 2017. A future that works: AI, automation, employment, and productivity.

McKinsey GlobalInstitute Research, Tech. Rep

60 (2017).[52] Yaoli Mao, Dakuo Wang, Michael Muller, Kush R Varshney, Ioana Baldini, Casey Dugan, and Aleksandra Mojsilović. 2019. How Data ScientistsWork Together With Domain Experts in Scientific Collaborations: To Find The Right Answer Or To Ask The Right Question?

Proceedings of theACM on Human-Computer Interaction

3, GROUP (2019), 1–23.[53] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, andTimnit Gebru. 2019. Model cards for model reporting. In

Proceedings of the conference on fairness, accountability, and transparency . 220–229.[54] Thomas Mühlbacher, Lorenz Linhardt, Torsten Möller, and Harald Piringer. 2017. Treepod: Sensitivity-aware selection of pareto-optimal decisiontrees.

IEEE transactions on visualization and computer graphics

24, 1 (2017), 174–183.16 acilitating Knowledge Sharing from Domain Experts to Data Scientists for Building NLP Models [55] Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q. Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How DataScience Workers Work with Data: Discovery, Capture, Curation, Design, Creation. In

Proceedings of the 2019 CHI Conference on Human Factors inComputing Systems (Glasgow, UK) (CHI ’19) . ACM, New York, NY, USA, Forthcoming.[56] Hiroki Nakayama, Takahiro Kubo, Junya Kamura, Yasufumi Taniguchi, and Xu Liang. 2018. doccano: Text Annotation Tool for Human. https://github.com/doccano/doccano Software available from https://github.com/doccano/doccano.[57] Kevin K Nam and Mark S Ackerman. 2007. Arkose: reusing informal information from online discussions. In

Proceedings of the 2007 internationalACM conference on Supporting group work . 137–146.[58] Radu Stefan Niculescu, Tom M Mitchell, and R Bharat Rao. 2006. Bayesian network learning with parameter constraints.

Journal of machine learningresearch

7, Jul (2006), 1357–1383.[59] Samir Passi and Steven J Jackson. 2018. Trust in data science: collaboration, translation, and accountability in corporate data science projects.

Proceedings of the ACM on Human-Computer Interaction

2, CSCW (2018), 1–28.[60] Claudio Pinhanez. 2019. Machine Teaching by Domain Experts: Towards More Humane, Inclusive, and Intelligent Machine Learning Systems. arXivpreprint arXiv:1908.08931 (2019).[61] David Piorkowski, Soya Park, April Yi Wang, Dakuo Wang, Michael Muller, and Felix Portnoy. 2021. How AI Developers Overcome CommunicationChallenges in a Multidisciplinary Team: A Case Study. arXiv:2101.06098 [cs.CY][62] prodigy. 2018. prodigy. https://prodi.gy.[63] Hema Raghavan, Omid Madani, and Rosie Jones. 2006. Active learning with feedback on features and instances.

Journal of Machine LearningResearch

7, Aug (2006), 1655–1686.[64] Alex Ratner, Stephen Bach, Paroma Varma, and Chris Ré. 2019. Weak supervision: the new programming paradigm for machine learning.

HazyResearch. Available via https://dawn. cs. stanford. edu//2017/07/16/weak-supervision/. Accessed (2019), 05–09.[65] Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation withweak supervision. In

Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases , Vol. 11. NIH Public Access, 269.[66] Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets,quickly. In

Advances in neural information processing systems . 3567–3575.[67] Adam Rule, Ian Drosos, Aurélien Tabard, and James D Hollan. 2018. Aiding collaborative reuse of computational notebooks with annotated cellfolding.

Proceedings of the ACM on Human-Computer Interaction

2, CSCW (2018), 1–12.[68] Shems Saleh, William Boag, Lauren Erdman, and Tristan Naumann. 2020. Clinical Collabsheets: 53 Questions to Guide a Clinical Collaboration. In

Machine Learning for Healthcare Conference . PMLR, 783–812.[69] A Th Schreiber, Guus Schreiber, Hans Akkermans, Anjo Anjewierden, Nigel Shadbolt, Robert de Hoog, Walter Van de Velde, Bob Wielinga, R Nigel,et al. 2000.

Knowledge engineering and management: the CommonKADS methodology . MIT press.[70] Burr Settles. 2009.

Active learning literature survey . Technical Report. University of Wisconsin-Madison Department of Computer Sciences.[71] Burr Settles. 2011. Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances. In

Proceedings of the 2011Conference on Empirical Methods in Natural Language Processing . 1467–1478.[72] Patrice Y Simard, Saleema Amershi, David M Chickering, Alicia Edelman Pelton, Soroush Ghorashi, Christopher Meek, Gonzalo Ramos, Jina Suh,Johan Verwey, Mo Wang, et al. 2017. Machine teaching: A new paradigm for building machine learning systems. arXiv preprint arXiv:1707.06742 (2017).[73] Alison Smith, Varun Kumar, Jordan Boyd-Graber, Kevin Seppi, and Leah Findlater. 2018. Closing the loop: User-centered design and evaluation of ahuman-in-the-loop topic modeling system. In . 293–304.[74] Alison Smith-Renner, Varun Kumar, Jordan Boyd-Graber, Kevin Seppi, and Leah Findlater. 2020. Digging into user control: perceptions of adherenceand instability in transparent models. In

Proceedings of the 25th International Conference on Intelligent User Interfaces . 519–530.[75] Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In

Advances in neural information processingsystems . 4077–4087.[76] Simone Stumpf, Vidya Rajaram, Lida Li, Margaret Burnett, Thomas Dietterich, Erin Sullivan, Russell Drummond, and Jonathan Herlocker. 2007.Toward harnessing user feedback for machine learning. In

Proceedings of the 12th international conference on Intelligent user interfaces . 82–91.[77] Justin Talbot, Bongshin Lee, Ashish Kapoor, and Desney S Tan. 2009. EnsembleMatrix: interactive visualization to support machine learning withmultiple classifiers. In

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems . 1283–1292.[78] Loren G Terveen, Peter G Selfridge, and M David Long. 1995. Living design memory: framework, implementation, lessons learned.

Human-ComputerInteraction

10, 1 (1995), 1–37.[79] Rafaella Vale, Rafael Dueire Lins, and Rafael Ferreira. 2020. An Assessment of Sentence Simplification Methods in Extractive Text Summarization. In

Proceedings of the ACM Symposium on Document Engineering 2020 (Virtual Event, CA, USA) (DocEng ’20) . Association for Computing Machinery,New York, NY, USA, Article 9, 9 pages. https://doi.org/10.1145/3395027.3419588[80] April Yi Wang, Anant Mittal, Christopher Brooks, and Steve Oney. 2019. How Data Scientists Use Computational Notebooks for Real-TimeCollaboration.

Proceedings of the ACM on Human-Computer Interaction

3, CSCW (2019), 1–30.[81] April Yi Wang, Zihan Wu, Christopher Brooks, and Steve Oney. 2020. Callisto: Capturing the “Why” by Connecting Conversations with ComputationalNarratives. In

Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20) . ACM.17 . Park et al. [82] Chengyu Wang, Xiaofeng He, and Aoying Zhou. 2017. A Short Survey on Taxonomy Learning from Text Corpora: Issues, Resources and RecentAdvances. In

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics,Copenhagen, Denmark, 1190–1203. https://doi.org/10.18653/v1/D17-1123[83] Malcolm Ware, Eibe Frank, Geoffrey Holmes, Mark Hall, and Ian H Witten. 2001. Interactive machine learning: letting users build classifiers.

International Journal of Human-Computer Studies

55, 3 (2001), 281–292.[84] Tongshuang Wu, Daniel S Weld, and Jeffrey Heer. 2019. Local Decision Pitfalls in Interactive Machine Learning: An Investigation into FeatureSelection in Sentiment Analysis.

ACM Transactions on Computer-Human Interaction (TOCHI)

26, 4 (2019), 1–27.[85] Chi-Lan Yang, Chien Wen Yuan, and Hao-Chuan Wang. 2019. When Knowledge Network is Social Network: Understanding Collaborative KnowledgeTransfer in Workplace.

Proceedings of the ACM on Human-Computer Interaction

3, CSCW (2019), 1–23.[86] Qian Yang, Aaron Steinfeld, and John Zimmerman. 2019. Unremarkable ai: Fitting intelligent decision support into critical, clinical decision-makingprocesses. In

Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems . 1–11.[87] Amy X Zhang, Michael Muller, and Dakuo Wang. 2020. How do Data Science Workers Collaborate? Roles, Workflows, and Tools. arXiv preprintarXiv:2001.06684 (2020).[88] Hao Zhang, Zhiting Hu, Yuntian Deng, Mrinmaya Sachan, Zhicheng Yan, and Eric Xing. 2016. Learning Concept Taxonomies from Multi-modal Data.In

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for ComputationalLinguistics, Berlin, Germany, 1791–1801. https://doi.org/10.18653/v1/P16-1169[89] Jiawei Zhang, Yang Wang, Piero Molino, Lezhi Li, and David S Ebert. 2018. Manifold: A model-agnostic framework for interpretation and diagnosisof machine learning models.

IEEE transactions on visualization and computer graphics

25, 1 (2018), 364–373.[90] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-Level Convolutional Networks for Text Classification . arXiv:1509.01626 [cs] (Sept.2015). arXiv:1509.01626 [cs][91] Xiaomu Zhou, Mark Ackerman, and Kai Zheng. 2011. CPOE workarounds, boundary objects, and assemblages. In