[PDF] Towards Teachable Conversational Agents

Abstract

The traditional process of building interactive machine learning systems can be viewed as a teacher-learner interaction scenario where the machine-learners are trained by one or more human-teachers. In this work, we explore the idea of using a conversational interface to investigate the interaction between human-teachers and interactive machine-learners. Specifically, we examine whether teachable AI agents can reliably learn from human-teachers through conversational interactions, and how this learning compare with traditional supervised learning algorithms. Results validate the concept of teachable conversational agents and highlight the factors relevant for the development of machine learning systems that intend to learn from conversational interactions.

Full PDF

TTowards Teachable Conversational Agents

Nalin Chhibber

Department of Computer ScienceUniversity of WaterlooWaterloo, Canada [email protected]

Edith Law

Department of Computer ScienceUniversity of WaterlooWaterloo, Canada [email protected]

Abstract

The traditional process of building interactive machine learning systems can beviewed as a teacher-learner interaction scenario where the machine-learners aretrained by one or more human-teachers. In this work, we explore the idea of usinga conversational interface to investigate the interaction between human-teachersand interactive machine-learners. Speciﬁcally, we examine whether teachable AIagents can reliably learn from human-teachers through conversational interactions,and how this learning compare with traditional supervised learning algorithms.Results validate the concept of teachable conversational agents and highlight thefactors relevant for the development of machine learning systems that intend tolearn from conversational interactions.

Recent progress in artiﬁcial intelligence has resulted in the development of intelligent agents that candirect their activities towards achieving a goal. Moreover, rapidly advancing infrastructure aroundconversational technologies has resulted in a wide range of applications around these agents, includingintelligent personal assistants (like Alexa, Cortana , Siri, and Google Assistant), guides in publicplaces (like Edgar [1], Ada and Grace [2]), smart-home controllers [3], and virtual assistants in cars[4]. This growing ecosystem of applications supporting conversational capabilities has the potentialto affect all aspects of our lives, including healthcare, education, work, and leisure. Consequently,agent-based interactions has attracted a lot of attention from various research communities [5–8, 3].The success of these agents will depend on their ability to efﬁciently learn from non-expert humansin a natural way.In this paper, we explore the idea of using conversational interactions to incorporate human feedbackin machine learning systems. We evaluate this concept through a crowdsourcing experiment wherehumans teach text classiﬁcation to a conversational agent, with an assumption that the agent willassist them with annotations at a later time. Overall, this paper contributes towards a larger goal ofusing conversations as a possible interface between humans and machine learning systems with thefollowing key contributions:• The idea of leveraging conversational interactions to tune the performance of machinelearning systems, which can be extended to personalize assistants in future.• An interactive machine learning algorithm that learns from human feedback, and considersstatistical as well as user-deﬁned likelihood of words for text classiﬁcation.

Traditional machine learning systems that only make data-driven predictions, tend to be as good asthe quality of the training data. However, the data itself may suffer from various biases and may not a r X i v : . [ c s . H C ] F e b ccurately represent all human-speciﬁc use-cases. Interactive machine learning attempts to overcomethis by involving users in the process of training and optimizing the machine learning models. Byallowing rapid, focused and incremental updates to the model, it enables users to interactively examinethe impact of their actions and adapt subsequent inputs. In essence, interactive machine learning is away to allow meaningful human feedback to guide machine learning systems. One of the earliestwork in this area is from Ankerst et al. who worked on an interactive visualization of classiﬁcationtree [9]. They created an interface that provide sliders to adjust the number of features or thresholdvalues for each node in the decision tree, and interactively display the classiﬁcation error. Ware et al.[10] demonstrated that humans can produce better classiﬁers than traditional automatic techniqueswhen assisted by a tool that provides visualizations about the operation of speciﬁc machine learningalgorithms. Fails and Olsen studied the difference between classical and interactive machine learningand introduced an interactive feature selection tool for image recognition [11]. Von et al. introducedReCAPTCHA as a human computation system for transcribing old books and newspapers for whichOCR was not very effective [12]. Fiebrink et al. created a machine learning system that enablepeople to interactively create novel gesture-based instruments [13]. Their experiments found thatas users trained their respective instruments, they also got better and even adjusted their goals tomatch observed capabilities of the machine learner. These examples illustrate how rapid, focused andincremental interaction cycles can facilitate end-user involvement in the machine-learning process.Porter et al. [14] formally breaks down the interactive machine-learning process into three dimensions:task decomposition, training vocabulary, and training dialogue. These dimensions deﬁne the level ofcoordination, type of input, and level/frequency of interaction between the end-users and machinelearners. Later, Amershi et. al examined the role of humans in interactive machine learning, andhighlighted various areas where humans have interactively helped machine learning systems to solvea problem [15]. Their case study covered various situations where humans were seen as peers,learners, or even teachers while engaging with interactive systems across different disciplines likeimage segmentation and gestured interactions. A form of interactive machine learning has beenstudied under apprenticeship learning (also learning by watching, imitation learning, or learning fromdemonstration) where an expert directly demonstrate the task to machine learners rather than tellingthem the reward function [16]. However, this is tangential to our current work as we speciﬁcally focuson providing active guidance through conversational interactions instead of passive demonstrations.A special case of interactive machine learning is active learning which focuses on improving machinelearner’s performance by actively querying a human oracle and obtain labels [17]. However, severalstudies reveal that active learning can cause problems when applied to truly interactive settings[18–20]. Simard et al. [21] formalize the role of teachers as someone who transfer knowledge tolearners in order to generate useful models. Past work on algorithmic teaching shows that whilehuman teachers can signiﬁcantly improve the learning rate of a machine learning algorithm [22–24],they often do not spontaneously generate optimal teaching sequences as human teaching is mostlyoptimized for humans, and not machine learning systems. Cakmak et al. examined several waysto elicit good teaching from humans for machine learners [25]. They proposed the use of teachingguidance in the form of an algorithms or heuristics. Algorithmic guidance is mostly studied underalgorithmic teaching [22], and aims to characterize teachability of concepts by exploring compact(polynomial-size) representation of instances in order to avoid enumerating all possible examplesequences. On the other hand, heuristic-based guidance aims to capture the intuition of an optimalteacher and enable them to approximate the informativeness of examples for the learner. Whilealgorithmic guidance can provide guaranteed optimality bounds, heuristic-based guidance is easier tounderstand and use [25]. Consequently, recent work in this area has started focusing on the human-centric part of these interactive systems, such as the efﬁcacy of human-teachers, their interactionwith data, as well as ways to scale interactive machine learning systems with the complexity of theproblem or the number of contributors [21]. However, these solutions have not been studied withinthe context of conversational systems.

In this work, we introduce a teachable agent that learns to classify text using human feedback fromconversational interactions. In this section, we describe the task environment used for teaching theagent, architecture of the dialog system and ﬁnally the learning mechanism of the agent.2 .1 Task Environment (a) Teaching Interface (b) Testing Interface

Figure 1: Task EnvironmentThe teachable agent was deployed as a textual conversational bot embedded into a web-based learningenvironment. In the task interface, participants read an article and converse with the conversationalagent to teach them how to classify that article. There were two modes, teaching and testing, asdescribed in Figure 1. In the teaching mode, while reading the article, participants could enter orhighlight words to explain why an article should be classiﬁed in a particular way (Figure 1a). Theagent asked questions to the human-teacher and revealed what it did not understand about the topic, orwhat else it wanted to know. In answering the agent’s questions, the human teachers were promptedto reﬂect on their own knowledge. The assumption was that through this process, human teachersmay gain a better understanding about how to perform the classiﬁcation task themselves. Everyhuman teacher taught their own agent. In the testing mode, participants could present new articles tothe teachable agent, and ask them to classify articles in real-time based on what they have learnedfrom the conversational interaction (Figure 1b). After the agent’s prediction, correctly classiﬁedarticles were coloured green by the system, whereas incorrectly classiﬁed articles were coloured red.During the entire interaction, participants were encouraged to frequently test the agent to assess theirteaching performance and how well the agent was handling unseen examples.

Agent’s dialogue system was designed using conversational tree, a branching data structure whereeach node represents a place where a conversation may branch, based on what the user says [26].Edges in a conversational tree can be traversed backward or forward because of the nature ofconversational interaction; for example, the traversal is backwards if the agent is asked to repeata sentence. Besides the conversational tree, the state of the conversation was maintained using ahierarchical state machine. The top-most level of this hierarchy was the split between the learningand testing modes. In the learning mode, the teachable agent was focused on learning new featuresthrough conversations related to a given topic; whereas in the testing mode, agent predicted thecategory of unseen articles and asked for more samples from the human teachers. Each of these modesfurther contained multiple contexts that deﬁned the agent’s current understanding about the relevanceof features. The agent could switch between different contexts in order to capture new features thatwere relevant or irrelevant to the topic under discussion. This switching between different contextswas made possible by explicit user actions, as well as intent identiﬁcation. For the latter, we used arule-based approach to identify different intents during the conversational interactions. In addition,we also developed agent strategies loosely consistent with Speech Act theory [27], that directs theuser to ask about content within the agent’s dialog system repertoire. In certain cases in which noinput was recognized, the agent would default to one of several fallback options like: asking usersto paraphrase, repeat or simply ignore and move to next article. Sample conversational interactionduring the teaching and testing modes are shown in Figure 2A and 2B respectively.Table 1 summarizes the different types of heuristic teaching guidance that the human teacher canprovide. We identiﬁed these three teaching heuristics based on Macgregor et al. [28], who proposedteaching heuristics for optimizing the classiﬁcation algorithms. Features identiﬁed through these3 a) Teaching (b) Testing

Figure 2: Interaction with the agent during (a) teaching, and (b) testing modeTable 1: Three types of heuristic teaching guidance

Heuristic Description Conversational Guidance

Externally relevantwords Words ’outside’ the text thatwill most likely describe the category

Can you tell me few more words thatshould describe the category but are notin the text?Internally relevantwords Words from the text that aremost relevant to the category

I wonder which words are most rele-vant while categorizing this text to the category ?Internally irrelevantwords Words from the text that areleast relevant to the category

Which words are least relevant while cat-egorizing this text to the category ?heuristics were meant to supplement the classiﬁer by proposing new features, amplifying relevantones, or discounting the irrelevant ones.

The agent learns to classify articles using an enhanced version of the Naive Bayes algorithm thatincorporates human teaching as additional input. Naive Bayes is a generative classiﬁer, whichcomputes the posterior probability P ( y | x ) (i.e., the probability of a class y given data x ); for textclassiﬁcation, the assumption is that the data is a bag of words and that presence of a particular wordin a class is independent to the presence of other words in that class. One advantage of Naive Bayes,especially in the context of interactive teaching, is that it can be trained quickly and incrementally.Formally, the Naive Bayes model can be expressed as: P ( C k | w , w ...w n ) ∝ P ( C k ) n (cid:89) i =1 P ( w i | C k ) (1)Here the variable C k represents a document class from (World, Sports, Business, or SciTech)and w = ( w , w , w ...w n ) are the individual words from the respective document. Naive Bayesis known to perform well for many classiﬁcation tasks even when the conditional independenceassumption on which they are based is violated [29]. However, many researchers have tried to boosttheir classiﬁcation accuracy by relaxing this conditional independence assumption through locallyweighted learning methods [30, 31]. We adopt a similar idea of relaxing the feature independence4ssumption by considering the relevant and irrelevant features ( conversational keywords ) that a humanteacher mentions during interaction on a particular topic. We infer the class of a test document byconsidering its constituent words, as well as similar conversational keywords captured from theteaching conversation. Given the set of words in a test document, the conditional probability forthose words in training data under respective classes is represented as P ( w i | C k ) and the conditionalprobability of conversational keywords that are similar to the words in the corpus is represented as P ( s i | C k ) . P ( s i | C k ) = word i in test document for C k Total C k (2)To determine the similarity between conversational keywords and words from the test document, weused the cosine similarity of their vector representations as a proxy for semantic closeness. Cosinesimilarity has a range between -1 and 1, with negative values indicating dissimilar word vectors,and positive values indicating greater similarity between the word vectors. These word vectorswere obtained using Word2vec model: a shallow neural-network that is trained to reconstruct thelinguistic contexts of words in vector space [32]. We used 300-dimension word vectors trained on300,000 words from Google News dataset, cross-referenced with English dictionaries. Conversationalkeywords where the similarity coefﬁcient is below a threshold (e.g. 0.2) were not considered in (2).Having determined the set of conversational keywords that are similar to the document words, wemodify the posterior probability in two different ways: Case 1: Without supervised pre-training . In this case, the posterior probability is inferred only fromthe conditional probability of the conversational keywords captured during teaching. Thus, equation(1) can be expressed as: P ( C k | w , w ...w n , s , s ...s n ) ∝ P ( C k ) n (cid:89) i =1 P ( s i | C k ) (3) Case 2: With supervised pre-training . In this case, the posterior probability is inferred from both theconditional probability of the conversational keywords captured during teaching and the conditionalprobability of the words in the original corpus. Thus, equation (1) can be expressed as: P ( C k | w , w ...w n , s , s ...s n ) ∝ P ( C k ) n (cid:89) i =1 P ( w i | C k ) P ( s i | C k ) (4)Note that the conditional probability of a word appearing in the training corpus, P ( w i | C k ) , andthe conditional probability of similar words being discussed during the conversational interaction, P ( s i | C k ) are considered as two independent events and hence their combined probabilities can beexpressed as the product of individual probabilities. To get the ﬁnal classiﬁcation, we output theclass with highest posterior probability. For equation 3 and equation 4, this can be calculated as: y =argmax P ( C k ) (cid:81) ni =1 P ( s i | C k ) , and y = argmax P ( C k ) (cid:81) ni =1 P ( w i | C k ) P ( s i | C k ) respectively We conducted a formative experiment to investigate whether humans can interactively teach textclassiﬁcation to conversational agents. We validate this by comparing the performance of theunderlying Naive Bayes algorithm with and without using the supervised pre-training, as well asagainst baseline text classiﬁcation algorithms with no human feedback.

We recruited sixty crowdworkers from Amazon Mechanical Turk (10 females, 50 males), 23 to 53years old (M= 30.9, SD= 5.29). The study was conducted by posting Human-Intelligence-Tasks(HITs) with the title: “Teach How to Classify News Articles to a Chatbot”. 87% of the participantswere native English speakers, but all reported some prior experience with conversational agents on a7-point scale (M=5.76, SD=1.15). 53.4 % of the participants reported prior experience in teachinga classiﬁcation task to someone else, the other half had no prior experience in teaching (46.6%).Regarding the prior knowledge on the given news categories, participants rated most for World5M=5.85, SD=1.20), followed by SciTech (M=5.63, SD=1.27), Business (M=5.55, SD=1.47) andSports (M=5.07, SD=1.78). The experiment took approximately 20-30 minutes to complete.After accepting the HIT, providing consent, and completing the demographic questionnaire, partici-pants were given a short tutorial on the task interface. During the main phase of the experiment, therewere 20 articles to teach that were equally distributed across all four news categories. Participantswere supposed to teach at least one word from each article and were also allowed to switch betweenteaching and testing modes in order to check their agent’s performance. During the teaching process,the agent asked questions that participants would answer in order to teach them how to classifyarticles into one of the four categories. In the test mode, the agent would predict the category ofunseen articles based on words that were taught during the teaching interaction. Participants were freeto switch between the "Teach" and "Test" modes by clicking respective buttons below the chatbox.Articles for text classiﬁcation were sampled from a subset of AG News Classiﬁcation Dataset [33],with 4 largest classes representing the topics World, Sports, Business and SciTech. Each classcontained 30,000 training samples and 1,900 testing samples. The total number of training samplesin the dataset were 120,000 and number of test samples were 7,600. We used the standard data pre-processing techniques including tokenization, stop-words removal and lemmatization. Tokenizationwas done using word _ tokenize () function from NLTK that splits the raw sentences into separateword tokens. This was followed by a text normalization step where we converted individual tokensinto lowercase to maintain the consistency during training and prediction. Stop-words ﬁlteringwas also done using NLTK to ﬁlter out the words that did not contain vital information for textclassiﬁcation. Finally, we used WordNetLemmatizer with part-of-speech tags to obtain the canonicalform (lemmas) of the tokens. Conversion of tokens to their base form was done to reduce thelanguage inﬂections from words expressing different grammatical categories including tense, case,voice, aspect, person, number, gender, and mood. Throughout the study, a total of 31,199 dialogues were exchanged between sixty users (12,020) andthe conversational agent (19,179), with an average of 520 total dialogues per session (200.3 by theuser and 319.6 by the conversational agent). Average F1-score of the agent was recorded as 0.48 (SD=0.15). Participants’ background did not show any signiﬁcant impact on their agent’s F1-score, butas the number of dialogues exchanged by the participants increased, their agent’s performance alsosigniﬁcantly increased, β = 0 . , t (56) = 3 . , p < . . Native English speakers tend to speakmore than non-native English speakers throughout the experiment, β = − . , t (54) = − . , p = . . There was also a signiﬁcant increase in the F1-score with increasing number of agent testing, β = 0 . , t (56) = 4 . , p < . . However, the overall F1-score seemed to decrease when moreexternal words was taught, β = − . , t (55) = − . , p = 0 . . Article Number F S c o r e Article Number F S c o r e Figure 3: Change in F1-scores of the agent when taught by 3 (a) most successful, (b) least successfulcrowdworkers, with no supervised pre-training of the interactive Naive Bayes classiﬁerWe calculated the classiﬁcation performance of the agent after each news article that was discussedduring the conversational interaction. Although the classiﬁer was trained online on the keywordscaptured from conversations on an article, along with the keywords captured from all previousconversations, the performance was calculated “ofﬂine” on the entire test set of 7600 articles fromthe AG News Dataset treating individual article as an epoch. For this, we used the interactive variant6f Multinomial Naive Bayes classiﬁer as described in equation (3). Since the classiﬁer was usedwithout supervised pre-training, the initial performance was around 20% before the interaction. Afterthe interaction, some of the most successful crowdworkers were able to increase the performanceof the agent to around 70%, while for the least successful ones, the performance decreased to10%. Results indicate that the ﬁnal performance of classiﬁer varied signiﬁcantly across differentparticipants. We did not ﬁnd a direct co-relation between the number of words taught and theclassiﬁcation performance. This indicates that the quantity of the words captured alone does notimpact the classiﬁer’s performance. Figure 3 shows the progression of F1-score with each articlefor three most successful and least successful teachers, that trained an interactive machine learnerwithout supervised pre-training.Table 2: Comparison of baseline classiﬁers with interactive variants of Naive Bayes with supervisedpre-training, for best teacher, worst teacher and all teachers.

Model Precision Recall F1-Score

Without Teachers (Baseline)Bernoulli Naive Bayes 0.8626 0.8584 0.8593Multinomial Naive Bayes 0.8899 0.8902 0.8900Best TeacherInteractive Bernoulli Naive Bayes

Interactive Multinomial Naive Bayes

Worst TeacherInteractive Bernoulli Naive Bayes 0.8145 0.8247 0.8196Interactive Multinomial Naive Bayes 0.8729 0.8709 0.8719All TeachersInteractive Bernoulli Naive Bayes 0.8532 0.8578 0.8558Interactive Multinomial Naive Bayes 0.8847 0.8830 0.8838Next, we investigated the results of classiﬁer’s performance for other interactive variants of NaiveBayes with supervised pre-training as described in equation (4). These results were obtained "ofﬂine",by simulating the learning conditions after the experiment. Both statistical likelihood of words fromrelevant classes, and the user-deﬁned likelihood obtained from conversations were used to calculatethe posterior probability of test-documents. The classiﬁcation performance of the interactive variantsof Naive Bayes were compared with the two baselines for Bernoulli Naive Bayes and MultinomialNaive Bayes respectively. The comparison was made between most successful, least successful, andcombination of all crowdworkers who taught the teachable agent during the experiment. Surprisingly,combined effect of teaching from all the participants seemed to reduce the overall performance ofthe learner in an interactive conversational setting. Precision, recall and F1 scores for all interactivevariants are described in Table 2.

In this work, we described the concept of leveraging conversational interactions as an interfacebetween humans and an interactive machine learning system. It was found that performance of theagent improved with increase in the number of dialogues exchanged by participants and the numberof times it was tested during the session. This implies that participants who were concerned abouttheir agent’s performance through repeated testing were more successful in training the agent onnews classiﬁcation task. Further, classiﬁcation performance of the agent seem to degrade when theywere taught more external words that were outside the given article. An interesting ﬁnding is that thecombined effect of teaching from all the crowdworkers may actually reduce the overall performanceof the learner in an interactive conversational setting (Table 2). This indicates that learning from alot of sources may affect the performance of the learner if the proportion of ineffective teachers issigniﬁcantly more than effective ones, and teaching from effective and ineffective sources is not easilydistinguishable. It was also observed that native English speakers tend to exchange more dialoguesthroughout the experiment. This implies that localization of dialogue systems is useful for longerengagement. 7 .1 Limitations and Future Work

The performance of our proposed interactive machine learning algorithm is based on the cosinesimilarities obtained from the vector representation of words. We used a compressed variant ofWord2Vec trained on a smaller dataset due to performance reasons, which limits the quality of wordembeddings used. Future investigations can focus on contextual embeddings (like BERT) trainedon more relevant and richer dataset for better outcomes. Further, results from the experiment showsthat effective human teaching leads to better machine-learners. However, it remains unclear whatcharacteristics are speciﬁc to a good teacher and which factors inﬂuence the quality of teaching.Moreover, it will be interesting to explore different modalities of the interaction with teachable agentsas opposed to a textual conversational interaction. Follow up experiments may involve the use ofvoice-based agents or embodied agents like physical robots to validate the results in different contexts.Finally, while the proposed algorithm focuses on transparency by using Naive Bayes classiﬁer as thebaseline machine learning model, it remains unclear how the idea of teachable conversational agentswill extend to state-of-the-art systems. Future work can investigate how human feedback throughconversational interactions can be used to improve machine learners based on modern deep learningarchitectures.In conclusion, this paper aims to take one step in the direction of building teachable conversationalagents and how they learn from human teachers. Understanding various nuances across thesefacets will be useful for building interactive machine learners that aim to reliably learn throughconversational interactions.

References [1] Pedro Fialho, Luísa Coheur, Sérgio Curto, Pedro Cláudio, Ângela Costa, Alberto Abad, HugoMeinedo, and Isabel Trancoso. Meet edgar, a tutoring agent at monserrate. In

Proceedings of the51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations ,pages 61–66, 2013.[2] David Traum, Priti Aggarwal, Ron Artstein, Susan Foutz, Jillian Gerten, Athanasios Katsamanis,Anton Leuski, Dan Noren, and William Swartout. Ada and grace: Direct interaction withmuseum visitors. In

International conference on intelligent virtual agents , pages 245–251.Springer, 2012.[3] Alex Sciuto, Arnita Saini, Jodi Forlizzi, and Jason I Hong. Hey alexa, what’s up?: A mixed-methods studies of in-home conversational agent usage. In

Proceedings of the 2018 DesigningInteractive Systems Conference , pages 857–868. ACM, 2018.[4] Giuseppe Lugano. Virtual assistants and self-driving cars. In , pages 1–5. IEEE, 2017.[5] Justine Cassell. More than just another pretty face: Embodied conversational interface agents.

Communications of the ACM , 43(4):70–78, 2000.[6] Dominic W Massaro, Michael M Cohen, Sharon Daniel, and Ronald A Cole. Developing andevaluating conversational agents. In

Human performance and ergonomics , pages 173–194.Elsevier, 1999.[7] Ewa Luger and Abigail Sellen. Like having a really bad pa: the gulf between user expectationand experience of conversational agents. In

Proceedings of the 2016 CHI Conference on HumanFactors in Computing Systems , pages 5286–5297. ACM, 2016.[8] Irene Lopatovska, Katrina Rink, Ian Knight, Kieran Raines, Kevin Cosenza, Harriet Williams,Perachya Sorsche, David Hirsch, Qi Li, and Adrianna Martinez. Talk to me: Exploring userinteractions with the amazon alexa.

Journal of Librarianship and Information Science , page0961000618759414, 2018.[9] Mihael Ankerst, Christian Elsen, Martin Ester, and Hans-Peter Kriegel. Visual classiﬁcation: aninteractive approach to decision tree construction. In

KDD , volume 99, pages 392–396, 1999.[10] Malcolm Ware, Eibe Frank, Geoffrey Holmes, Mark Hall, and Ian H Witten. Interactive machinelearning: letting users build classiﬁers.

International Journal of Human-Computer Studies , 55(3):281–292, 2001.[11] Jerry Alan Fails and Dan R Olsen Jr. Interactive machine learning. In

Proceedings of the 8thinternational conference on Intelligent user interfaces , pages 39–45. ACM, 2003.812] Luis Von Ahn, Benjamin Maurer, Colin McMillen, David Abraham, and Manuel Blum. re-captcha: Human-based character recognition via web security measures.

Science , 321(5895):1465–1468, 2008.[13] Rebecca Fiebrink, Perry R Cook, and Dan Trueman. Human model evaluation in interactivesupervised learning. In

Proceedings of the SIGCHI Conference on Human Factors in ComputingSystems , pages 147–156. ACM, 2011.[14] Reid Porter, James Theiler, and Don Hush. Interactive machine learning in data exploitation.

Computing in Science & Engineering , 15(5):12–20, 2013.[15] Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. Power to thepeople: The role of humans in interactive machine learning.

AI Magazine , 35(4):105–120, 2014.[16] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning.In

Proceedings of the twenty-ﬁrst international conference on Machine learning , page 1, 2004.[17] Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009.[18] Maya Cakmak and Andrea L Thomaz. Optimality of human teachers for robot learners. In , pages 64–69. IEEE, 2010.[19] Maya Cakmak, Crystal Chao, and Andrea L Thomaz. Designing interactions for robot activelearners.

IEEE Transactions on Autonomous Mental Development , 2(2):108–118, 2010.[20] Andrew Guillory and Jeff A Bilmes. Simultaneous learning and covering with adversarial noise.2011.[21] Patrice Y Simard, Saleema Amershi, David M Chickering, Alicia Edelman Pelton, SoroushGhorashi, Christopher Meek, Gonzalo Ramos, Jina Suh, Johan Verwey, Mo Wang, et al.Machine teaching: A new paradigm for building machine learning systems. arXiv preprintarXiv:1707.06742 , 2017.[22] Frank J Balbach and Thomas Zeugmann. Recent developments in algorithmic teaching. In

International Conference on Language and Automata Theory and Applications , pages 1–18.Springer, 2009.[23] H David Mathias. A model of interactive teaching. journal of computer and system sciences ,54(3):487–501, 1997.[24] Sally A Goldman and Michael J Kearns. On the complexity of teaching.

Journal of Computerand System Sciences , 50(1):20–31, 1995.[25] Maya Cakmak and Andrea L Thomaz. Eliciting good teaching from humans for machinelearners.

Artiﬁcial Intelligence , 217:198–215, 2014.[26] Ernest Adams.

Fundamentals of game design . Pearson Education, 2014.[27] John R Searle, Ferenc Kiefer, Manfred Bierwisch, et al.

Speech act theory and pragmatics ,volume 10. Springer, 1980.[28] James N MacGregor. The effects of order on learning classiﬁcations by example: heuristics forﬁnding the optimal order.

Artiﬁcial Intelligence , 34(3):361–370, 1988.[29] Pedro Domingos and Michael Pazzani. Beyond independence: Conditions for the optimality ofthe simple bayesian classi er. In

Proc. 13th Intl. Conf. Machine Learning , pages 105–112, 1996.[30] Christopher G Atkeson, Andrew W Moore, and Stefan Schaal. Locally weighted learning. In

Lazy learning , pages 11–73. Springer, 1997.[31] Eibe Frank, Mark Hall, and Bernhard Pfahringer. Locally weighted naive bayes. In

Proceedingsof the Nineteenth conference on Uncertainty in Artiﬁcial Intelligence , pages 249–256. MorganKaufmann Publishers Inc., 2002.[32] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-sentations of words and phrases and their compositionality. In

Advances in neural informationprocessing systems , pages 3111–3119, 2013.[33] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for textclassiﬁcation. In