[PDF] Conceptual Metaphors Impact Perceptions of Human-AI Collaboration

Abstract

With the emergence of conversational artificial intelligence (AI) agents, it is important to understand the mechanisms that influence users' experiences of these agents. We study a common tool in the designer's toolkit: conceptual metaphors. Metaphors can present an agent as akin to a wry teenager, a toddler, or an experienced butler. How might a choice of metaphor influence our experience of the AI agent? Sampling metaphors along the dimensions of warmth and competence---defined by psychological theories as the primary axes of variation for human social perception---we perform a study (N=260) where we manipulate the metaphor, but not the behavior, of a Wizard-of-Oz conversational agent. Following the experience, participants are surveyed about their intention to use the agent, their desire to cooperate with the agent, and the agent's usability. Contrary to the current tendency of designers to use high competence metaphors to describe AI products, we find that metaphors that signal low competence lead to better evaluations of the agent than metaphors that signal high competence. This effect persists despite both high and low competence agents featuring human-level performance and the wizards being blind to condition. A second study confirms that intention to adopt decreases rapidly as competence projected by the metaphor increases. In a third study, we assess effects of metaphor choices on potential users' desire to try out the system and find that users are drawn to systems that project higher competence and warmth. These results suggest that projecting competence may help attract new users, but those users may discard the agent unless it can quickly correct with a lower competence metaphor. We close with a retrospective analysis that finds similar patterns between metaphors and user attitudes towards past conversational agents such as Xiaoice, Replika, Woebot, Mitsuku, and Tay.

Full PDF

1163Conceptual Metaphors Impact Perceptions of Human-AICollaboration

PRANAV KHADPE,

Stanford University, USA

RANJAY KRISHNA,

Stanford University, USA

LI FEI-FEI,

Stanford University, USA

JEFFREY T. HANCOCK,

Stanford University, USA

MICHAEL S. BERNSTEIN,

Stanford University, USAWith the emergence of conversational artificial intelligence (AI) agents, it is important to understand themechanisms that influence users’ experiences of these agents. In this paper, we study one of the most commontools in the designer’s toolkit: conceptual metaphors. Metaphors can present an agent as akin to a wryteenager, a toddler, or an experienced butler. How might a choice of metaphor influence our experience ofthe AI agent? Sampling a set of metaphors along the dimensions of warmth and competence—defined bypsychological theories as the primary axes of variation for human social perception—we perform a study ( N = ) where we manipulate the metaphor, but not the behavior, of a Wizard-of-Oz conversational agent.Following the experience, participants are surveyed about their intention to use the agent, their desire tocooperate with the agent, and the agent’s usability. Contrary to the current tendency of designers to use highcompetence metaphors to describe AI products, we find that metaphors that signal low competence lead tobetter evaluations of the agent than metaphors that signal high competence. This effect persists despite bothhigh and low competence agents featuring identical, human-level performance and the wizards being blindto condition. A second study confirms that intention to adopt decreases rapidly as competence projected bythe metaphor increases. In a third study, we assess effects of metaphor choices on potential users’ desire totry out the system and find that users are drawn to systems that project higher competence and warmth.These results suggest that projecting competence may help attract new users, but those users may discard theagent unless it can quickly correct with a lower competence metaphor. We close with a retrospective analysisthat finds similar patterns between metaphors and user attitudes towards past conversational agents such asXiaoice, Replika, Woebot, Mitsuku, and Tay.CCS Concepts: • Human-centered computing → Empirical studies in collaborative and social com-puting ; Empirical studies in HCI ; Human computer interaction (HCI) .Additional Key Words and Phrases: expectation shaping; adoption of AI systems; perception of human-AIcollaboration; conceptual metaphors

ACM Reference Format:

Pranav Khadpe, Ranjay Krishna, Li Fei-Fei, Jeffrey T. Hancock, and Michael S. Bernstein. 2020. ConceptualMetaphors Impact Perceptions of Human-AI Collaboration.

Proc. ACM Hum.-Comput. Interact.

4, CSCW2,Article 163 (October 2020), 26 pages. https://doi.org/10.1145/3415234

Authors’ addresses: Pranav Khadpe, Stanford University, Stanford, California, USA, [email protected]; RanjayKrishna, Stanford University, Stanford, California, USA, [email protected]; Li Fei-Fei, Stanford University,Stanford, California, USA, [email protected]; Jeffrey T. Hancock, Stanford University, Stanford, California, USA,[email protected]; Michael S. Bernstein, Stanford University, Stanford, California, USA, [email protected] to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2020 Association for Computing Machinery.2573-0142/2020/10-ART163 $15.00https://doi.org/10.1145/3415234Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. a r X i v : . [ c s . H C ] A ug Conceptual metaphors Pre-use expectations Post-use evaluationsAI Interaction - toddler- teenager- professional- executive

Fig. 1. We explore how the metaphors used to describe an AI agent, by influencing pre-use expectations,have a downstream impact on evaluations of those AI agents.

Collaboration between people and conversational artificial intelligence (AI) agents—AI systemsthat communicate through natural language [30]—is now prevalent. As a result, there is increasinginterest in designing these agents and studying how users interact with them [1, 14, 23, 30, 49, 66].While the technical underpinnings of these systems continue to improve, we still lack fundamentalunderstanding of the mechanisms that influence our experience of them. What mechanisms causesome conversational AI agents to succeed at their goals, while others are discarded? Why wouldXiaoice [68] amass millions of monthly users, while the same techniques powering Tay [35] ledto the agent being discontinued for eliciting anti-social troll interactions? Many AI agents havereceived polarized receptions despite offering very similar functionality: for example, Woebot [55]and Replika [43] continue to evoke positive user behavior, while Mitsuku [79] is often subjected todehumanization. Even with millions of similar AI systems available online [11, 46], only a handfulare not abandoned [30, 83]. The emergence of social robots and human-AI collaborations has drivenhome a need to understand the mechanisms that inform users’ evaluations of such systems.In HCI, experiences of a system are typically understood as being mediated by a person’s mentalmodel of that system [54]. Conveying an effective understanding of the system’s behavior can enableusers to build mental models that increase their desire to cooperate with the system [5, 10, 37, 41].However, a mental model explanation is insufficient to answer the present question: in the caseof Xiaoice and Tay, both agents were based on the same underlying technology from Microsoft,but they resulted in very different reactions by users. Likewise, other agents such as Replika andMitsuku elicit very different evaluations while existing even within the same cultural context. Whiletheories of mental models and culture each help us understand how users experience conversationalAI agents, we require additional theoretical scaffolding to understand the phenomenon.An important and unexamined difference between these otherwise similar agents are the different metaphors that they project. Conceptual metaphors are short descriptions attached to a systemthat are suggestive of its functionality and intentions [15, 50]. For instance, Microsoft describedTay as an “AI that’s got no chill” [62], while it markets Xiaoice as an “empathetic ear”—two verydifferent metaphors. Metaphors are a central mechanism in the designer’s toolkit. Unlike mentalmodels, they offer more than just functional understandings of the system—they shape users’expectations from the system. And while most existing expectation-shaping mechanisms dependon the functionality of the specific AI system or task [41], metaphors are agnostic to specificities ofa system and can be used to shape expectations for nearly any AI system. Prior theory suggeststhat pre-use expectations of AI systems influence both initial behaviors [31, 40, 76] and long-termbehaviors [42], even if the system itself remains unchanged while varying user expectations [58].We propose that these metaphors are a powerful mechanism to shape expectations and mediateexperiences of AI systems. If, for example, the metaphor primes people to expect an AI that is

Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. onceptual Metaphors Impact Perceptions of Human-AI Collaboration 163:3 highly competent and capable of understanding complex commands, they will evaluate the sameinteraction with the system differently than if users expect their AI to be less competent and onlycomprehend simple commands (Figure 1). Similarly, if users expect a warm, welcoming experience,they will evaluate an AI agent differently than if they expect a colder, professional experience —even if the interaction with the agent is identical in both cases.In this paper, we test the effect of metaphors on evaluations of AI agents. We draw on theStereotype Content Model (SCM) from psychology [16, 24], which demonstrates that the twodimensions of warmth and competence are the principal axes of human social perception. Judgementsalong these dimensions provoke systematic cognitive, emotional, and behavioral reactions [16].The SCM suggests that user expectations and therefore evaluations, are mediated by judgementsof warmth and competence. We crowdsource the labeling of a set of metaphors along these axesto identify a set of metaphors that appear in different quadrants of the SCM — e.g., a toddler,who is high warmth and low competence, and a shrewd executive, who is low warmth and highcompetence.We perform an experiment ( N = Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020.

Study 3, we test the negative effects of portraying a low competence metaphor by studying the effectthat warmth and competence have on participants’ interest in using the system in the first place.Finally, we discuss the implications of our findings for the choice of metaphors when designersdeal with the dual objective of attracting more users and ensuring a positive user experience.

Pre-use expectations play a critical role in users’ initial usage of a system or design [32, 40, 76].Setting positive or negative expectations colors users’ evaluation of what would otherwise beidentical experiences [58]. The effects of these pre-use expectations can have effects on evaluationseven after weeks of interaction with a service [42].In the case of AI systems, which are often data-driven and probabilistic, there exists no simplemethod of setting user expectations. Providing users with performance metrics does not establishan accurate expectation for how the system behaves [41]. In the absence of effective mental modelsof AI systems, users instead develop folk theories — intuitive, informal theories — as expansiveguiding beliefs about the system and its goals [26, 29, 45, 64].Prior work has shown how subjective evaluations of interface agents are strongly influenced bythe face, voice, and other design aspects of the agent [53, 82], beyond just the actual capabilities ofthe agent. These results motivate our study of how metaphors set expectations that affect how usersview and interact with conversational AI systems. Inaccurate expectations can be consequential.Previously, interviews have established that expectations from conversational agents such as Siri,Google Assistant, and Alexa are out of sync with the actual capabilities and performance of thesystems [49, 83]. So, after repeatedly hitting the agent’s capability limits, users retreat to usingthe agents only for menial, low-level tasks [49]. While these prior interview-based studies havedemonstrated that a mismatch between user expectations and system operation are detrimental touser experiences [49], they haven’t been able to establish causality and quantify the magnitude ofthis effect. This gap motivates our inquiry into understanding mechanisms that might shape theseexpectations and measuring the effect of expectations on user experiences and attitudes. We areguided by the following research question:Research Question:

How do metaphors impact evaluations of interactions with conver-sational AI systems?

Conceptual metaphors are one of the most common and powerful means that a designer has toinfluence user expectations. We refer to a conceptual metaphor (or user interface metaphor, orjust metaphor) as the understanding and expression of complex or abstract ideas using simpleterms [45]. Metaphors are attached to all types of AI systems, both by designers to communicateaspects of the system and by users to express their understanding of the system. For instance,Google describes its search algorithm as a “robotic nose” [26] and YouTube users think of therecommendation algorithm as a “drug dealer” [81]. Starting with the desktop metaphor for personalcomputing in the Xerox Star [39], conceptual metaphors proliferated through the design of userinterfaces — trash cans for deleted files, notepads for freetext notes, analog shutter clicking soundsfor mobile phone cameras, and more.Some AI agents utilize metaphors based in personas or human roles, for example an administrativeassistant, a teenager, a friend, or a psychotherapist, and some are metaphors grounded in othercontexts, for example a Jetsons-style humanoid servant robot. Such metaphors are meant to helphuman-AI collaboration in complex domains by aiding users’ ability to understand and predictthe agent’s behavior [5, 18]. Metaphors include system descriptions outside of those rooted in

Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. onceptual Metaphors Impact Perceptions of Human-AI Collaboration 163:5 human roles as well: Google describing its search algorithm as a “robotic nose” [26] and Microsoft’sZo marketed as a bot that “Will make you LOL”. The notion of “metaphors” extends beyondconversational AI to non-anthropomorphic systems that “personas” or “roles” may be ill-equippedto describe. Metaphors are effective: they influence a person’s folk theories of an AI system evenbefore they use it [19]. Prior work has developed methods to extract conceptual metaphors [45, 64]for how people understand AI systems and aggregate them into underlying folk theories [26].Metaphors impact expectations, sometimes implicitly by activating different norms, biases, andexpectations. For example, social robots that are racialized as Black or Asian are more likely tobe subject to antisocial behaviour such as aggression and objectification [70]. Similarly, female-gendered robots can elicit higher levels of dehumanisation than male-gendered bots. Antisocialbehavior leads to verbal disinhibition toward AI systems [69], and in some extreme cases, tophysical abuse and even dismemberment [8, 61]. Female voice agents are viewed as friendlier butless intelligent [53]. Users also have a higher tendency to disclose information to female genderedagents [53]. Race and gender of pedagogical agents affect learning outcomes—agents racialized asBlack or female-gendered lead to improved attention and learning [6]. Beyond race and gender,agents portrayed as less intelligent, taking on roles such as “motivator” or “mentor”, promote moreself-efficacy than agents projected as “experts” [6]. Young, urban users respond positively to botsthat can add value to their life by suggesting recommendations, while in the role of a “friend” [71].However, designers typically aim to use metaphors to affect expectations in more explicit,controlled, and pro-social ways. Most obviously, a metaphor communicates expectations of whatcan and cannot be done with an AI agent [39]. Just as we expect an administrative assistant toknow our calendar but not to know the recipe for the best stoat sandwiches, an AI agent thatcommunicates a metaphor as an “administrative assistant” projects the same skills and boundaries.In a similar vein, describing an agent as a “toddler” suggests that the agent can interact in naturallanguage and understand some, but not all, of our communication.While other expectation shaping mechanisms for AI agents such as tutorials and instructionshave been studied [41], the effect of metaphors on user expectations and evaluations have not.Our work also bridges to research suggesting that people already form metaphor-based theories ofsocio-technical systems [26] and suggests design implications for how designers should choosetheir metaphors.

As people view AI agents as social agents [59], the metaphor—and thus the nature of that agent—islikely to influence their experience. However, the literature presents two competing theories for howchanges to the metaphor — and thus to expectations — will impact user evaluation of an AI system.Assimilation theory [67] states that people adapt their perceptions to match their expectations,and thus adjust their evaluations to be positively correlated with their initial expectations. (AsDumbledore points out to Snape in Harry Potter and the Deathly Hallows, “You see what you expectto see, Severus.”) Assimilation theory argues that users don’t perceive a difference between theirpre-use expectations and actual experiences. Prior work supports that, for interactive systems, users’expectations do influence evaluations [31, 73]. For example, users rate an interactive system higherwhen they are shown a positive review of that system before using it, and rate the system lowerif they are shown a negative review before using it [58]. Likewise, humor and other human-likecharacteristics that create high social intelligence expectations can be crucial in producing positiveevaluations [36, 48].Assimilation theory would predict that a metaphor signaling high competence will set positiveexpectations and subsequently lead to positive evaluation:

Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020.

Hypothesis 1 (H1).

Positive metaphors (e.g., high competence, high warmth) will lead to higheraverage intention to adopt and desire to cooperate with an AI agent than if it had no metaphor ornegative metaphors.

Contrast theory [67], on the other hand, attributes user evaluations to the difference they perceivebetween expectations and actual experience. Contrast theory argues that we are attuned not toabsolute experiences, but to differences between our expectations and our experiences. For example,exceeding expectations results in high satisfaction, whereas falling short of expectations resultsin lower satisfaction. This suggests that it is beneficial to set users’ initial expectations to be low(with practitioners reasoning in the manner of George Weasley, in Harry Potter and the Order ofthe Phoenix, “‘E’ is for ‘Exceeds Expectations’ and I’ve always thought Fred and I should’ve got‘E’ in everything, because we exceeded expectations just by turning up for the exams.”) Users ofconversational AI agents such as Alexa stumble onto humorous easter egg commands that raisetheir expectations of what the system can do, but then report disappointment in the contrast todiscovering the system’s actual limits [49]. Likewise, ratings of interactive games are driven in partby contrasting players experiences against their expectations of the game [51].Contrast theory predicts that positive metaphors will backfire because AI agents inevitably makemistakes and have limits:Hypothesis 2 (H2).

Positive metaphors (e.g., high competence, high warmth) will lead to loweraverage intention to adopt and desire to cooperate with an AI agent than if it had no metaphor ornegative metaphors.

Our research aim is to study the effect of metaphors on experiences with AI agents. So, we seek anexperimental setup where participants accomplish a task in collaboration with an AI system, whileavoiding effects introduced by idiosyncrasies of any particular AI system. We situate our methodin goal-oriented conversational agents (or task-focused bots) as these systems represent a broadclass of agents in research and product [17, 35, 55, 68, 78, 79].

Goal-oriented AI systems, such as those for booking flights, hotel rooms, or navigating customerservice requests, have become pervasive on social media platforms including Kik, Slack, andFacebook Messenger, with as many as one million flooding the Web between 2015 and 2019 [30].Surveys revealed that as of 2018, as many as 60% of surveyed millenials had used a chatbot [2] and15% of surveyed internet users had used customer-service chatbots [20]. Their prevalence meansthat interaction with such an agent is an ecologically valid task, and that many users online arefamiliar with how to interact with them. We draw on a common set of transactional tasks such asappointment booking, scheduling, and purchasing, which require people to engage with the agentin task-focused dialogue to acquire information or complete their task [27]. Inspired by the popularMaluuba Frames [4] data collection task templates, used to evaluate conversational agents in thenatural language processing community, we utilize a travel planning task. More concretely, thetask is a vacation planning endeavor where users must pick a vacation package that meets a set ofexperimenter-specified requirements through a search-compare-decide process. Specifically, everyparticipant is presented with the following prompt:You are considering going to New York, Berlin or Paris from Montreal. You want totravel sometime between August 23rd and September 1st. You are traveling alone. Askfor information about options available in all cities. Compare the alternatives and makeyour decision.

Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. onceptual Metaphors Impact Perceptions of Human-AI Collaboration 163:7

Participants were further instructed to determine what they could get for their money and to takeinto consideration factors they would consider while actually planning a vacation, including wifi,breakfast options, and a spa. The task is structured to involve three sub-goals: finalize a hotelpackage, an outgoing flight and an incoming flight back to Montreal.

We sought a conversational AI agent whose actual performance was strong enough for our result togeneralize as the underlying AI models improve. So, following a pattern in prior work [12, 34, 71, 74],we adopt a Wizard-of-Oz study paradigm.We hire and train customer-support professionals from the Upwork platform to act as wizards inour experiment and pay them their posted hourly rate of $10 − $20 per hour. The wizards play therole of the conversational AI agent in the text chat. We filtered workers who had at least a 90% jobsuccess rating from past work, and had already earned $20 k We place participants into treatment groups each defined by the metaphor used to describe theirAI collaborator. Instead of randomly sampling metaphors, we draw on the Stereotype ContentModel (SCM) [16, 24], an influential psychological theory that articulates two major axes in socialperception: warmth and competence. These two dimensions have proven to far outweigh others andrepeatedly come up as prime factors in literature [3, 38, 77]. Judgements on warmth and competence

Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020.

Fig. 2. Average warmth and competence measured for the conceptual metaphors sampled for our studies.Both the axes ranged from to . Toddler MiddleSchooler InexperiencedTeenager YoungStudent Shrewd TravelExecutive RecentGraduate TrainedProfessionalCompetence 1 . ± .

04 2 . ± .

83 2 . ± .

85 2 . ± . ± . ± . ± . ± . ± .

69 2 . ± . ± .

56 2 . ± . ± . ± . Table 1. Warmth and competence values (average ± standard deviations) for the metaphors we use acrossthe studies. High competence and warmth values are in bold. are made within 100 milliseconds [72] and a change in these traits alone can wholly changeimpressions [84]. The SCM proposes a quadrant structure and cognitive notions of warmth andcompetence- better understood as discrete- are characterized as being low or high [16]. Warmth ischaracterised by notions such as good-naturedness and sincerity, while competence is characterisedby notions of intelligence, responsibility, and skillfulness. For example, a “shrewd travel executive”can be described as high competence and low warmth. We sample metaphors such that they haveeither high or low values of warmth and competence.In our first study, we use four metaphors, one in each quadrant, to study the impact of competenceand warmth. We pre-tested a set of metaphors for a conversational agent, measuring the perceivedcompetence and warmth of conversational AI agents described with these metaphors using a 5 pointLikert scale. We captured 50 ratings for each metaphor from workers on Amazon Mechanical Turk —a mutually exclusive set of workers from those who will later be involved in the experiment. Basedon the results (see Figure 2), we chose “trained professional travel assistant” (high competence, highwarmth), “shrewd travel executive” (high competence, low warmth), “toddler” (low competence,high warmth), and “inexperienced teenager” (low competence, low warmth). We selected metaphorsthat were otherwise agendered, with similar socio-cultural connotations across the world, andrepresentative of actual metaphors that could be associated with a travel assistant bot. These four Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. onceptual Metaphors Impact Perceptions of Human-AI Collaboration 163:9 metaphors form our four treatment groups in Study 1; their mean and standard deviation values ofcompetence and warmth are reported in Table 1.In Study 2, we follow the same procedure to characterize several additional metaphors: “middleschooler”, “young student”, and “recent graduate”. These metaphors offer intermediate levels ofcompetence, with “toddler” less competent than “middle schooler”, “middle schooler” less competentthan “young student”, and “young student” less competent than “trained professional”. “Youngstudent” is associated with higher competence levels than “middle schooler”, suggesting thatpeople’s impression of a “young student” is a high schooler or college student, somewhere betweena “middle schooler” and a “recent graduate”. In Study 3, we revisit the metaphors we analyzed inStudy 1 to understand the effects of metaphors on potential users’ likelihood of trying out thesystem and their intentions of co-operating with it prior to using it.

In our first study, we examine the effects of metaphors attached to the conversational system onuser pre-use expectations and post-use evaluations. Specifically, we examine participants’ perceivedpre- and post-use usability and warmth of the AI system. Additionally, we measure their post-useintention to adopt and their desire to cooperate with such a system given their treatment metaphorcondition. Finally, we analyze the chat logs to explore if there are behavioral differences betweenthe participants in different conditions.

We perform a between-subjects experiment where participants in each treatment condition areprimed with a metaphor to associate with the system. As described in the previous section,metaphors were chosen to vary as low/high warmth × low/high competence, resulting in fourtreatment conditions. In addition, we included a control condition where participants were notprimed with a metaphor, resulting in five total conditions.After consenting to the study, participants were introduced to one of the study conditions,i.e. they were shown one of the four metaphors, or a control condition of no metaphor: The bot you are about to interact with is modeled after a “shrewd travel executive”.

With the study condition revealed, participants were asked questions about their pre-use expecta-tions of the AI system’s competence and warmth. Next, participants were shown the goal-orientedtask description and allowed to interact with the wizard posing as conversational agent via thechat widget until they completed their task.After finalizing their travel plans, participants were asked to evaluate their experience withthe AI system and answer the manipulation check question. Finally, participants were debriefed,informed of the actual purpose of the study, and made aware that they were talking to a humanand not an AI system. A high-level workflow is depicted in Figure 1.

User evaluation measures.

To test contrast theory, it is important to measure a user’s evaluationof the experience without drawing explicit attention to the contrast between their expectationsand their experience—since this makes the contrast salient [42]. So, we independently measurepre-use expectations and post-use evaluations without explicitly asking if expectations were metor violated. To gauge participants’ pre-use expectations and post-use perceptions of the systemscompetence and warmth, we ask the participants to report how strongly, on a 5 point Likert scale(where 1 = strongly disagree and 5 = strongly agree), they agree with the following statements, both

Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. before and after they interacted with the AI system. Questions asked before use simply replacedthe past tense of the verb with the future tense; the question ordering was randomized. • Usability: Since our notion of a system’s competence is akin to the notion of usability inprevious studies, we adapt questions from previous surveys that examine usability [42]. Thesequestions are: 1) “Using the AI system was (will be) a frustrating experience.” “The AI systemwas (will be) easy to use.” “I spent (will spend) too much time correcting things with this AIsystem.” “‘The AI system met (will meet) my requirements.” Responses from before usingthe system are combined to form a pre-use usability index ( α = .

91) while responses fromafter the conversation are combined to form a post-use usability index ( α = . • Warmth: To measure the warmth of the AI system, we draw on different warmth levelsarticulated in the stereotype content model [25]: 1) “This AI system was (will be) good-natured.” “This AI system was (will be) warm.” Responses from before using the system are combinedto form a pre-use warmth index ( α = . α = . • Intention to Adopt and Desire to Cooperate: We borrow from prior work [42] that capturesuser evaluations through their intentions to adopt the system. Since we increasingly havesituations in which humans work alongside AI systems where these systems augment humanefforts, it also becomes necessary to understand users’ behavioural tendencies towards thesesystems. So, we draw on prior work in HRI [52] and capture users’ behavioural tendenciesthrough their desire to help and cooperate with the AI system. After their interaction withthe system, participants are probed for their intentions to adopt as well as their desire tocooperate with the system. To probe for the participants’ intentions to adopt, we asked themthe following two questions on 5 point Likert scales: 1) “Based on your experience, how willingare you to continue using the service?” , 2) “How likely is it that you will be using the service in thefuture?” . Like previous work [42], these two questions are combined to form an intention toadopt index ( α = . “How likely would you be to cooperatewith this AI?” and “How likely would you be to help this AI?” . Like previous work [52], thesetwo questions are combined to form a cooperation index ( α = . Conversational behavior measures.

To investigate if participant behavior changes across con-ditions we include measures to analyze differences in the conversational behavior of users. • Language measures: To measure differences in the chatlogs by the participant and by thewizards across the various conditions, we utilize the popular linguistic dictionary LinguisticInquiry and Word Count, known as LIWC [57]. LIWC uses 82 dimensions to determine ifa text uses positive or negative emotions, self-references, and causal words, to help assessphysical and mental health, intentions and expectations of the writers. We categorize all thewords used by the participants and the wizards into LIWC categories and create normalizedfrequency histograms of these categories. We compare the words used by the participantsacross all the conditions to see if there significant differences in the types of LIWC categoriesused. Similarly, we compare the wizards’ words across all conditions. Finally, we combinethe words used by the wizard and participant together and also check to see if there weredifferences between conversation across the different conditions. • Conversation measures: We also investigate differences across conditions, at the level ofindividual messages and whole conversations in terms of number of words used and durationof interaction . Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. onceptual Metaphors Impact Perceptions of Human-AI Collaboration 163:11

For all the studies in this paper, we hired participants to interact with our Wizard-of-Oz AI systemfrom Amazon Mechanical Turk (AMT). Participants were all US citizens aged 18 + . Each participantwas allowed to take part in the experiment only once. Participants were compensated $4 for asurvey lasting an average of 15 ± .

7% of ourparticipants were female and the mean age of participants was 41 ± . f = .

05, a power analysis with asignificance level of 0 .

05, powered at 80%, a power analysis indicated that we require 25 participantsper condition, or 125 total participants. Thirteen participants’ responses were discarded becausethe raters had coded their WoZ manipulation check as expressing suspicion that the agent might behuman. After these exclusions, we had a sample size of 140 participants, which met the requirementsfrom our power analysis.

To ensure that our study was not compromised by participants who identified that they werespeaking to a wizard instead of an AI, we included a manipulation check at the end of the survey.The manipulation check gauges whether the participant was suspicious of the AI without explicitlydrawing their attention to the fact that this might be the case. So, drawing on prior Wizard-of-Ozstudies [33], we asked the participants how they thought the system worked from a technicalstandpoint.The responses were sent to two coders who inspected each response individually and markedall the responses that suspected a person was pretending to be an AI system. Participants whoexpressed suspicion that the system might be human were excluded from further analysis. Someparticipants were very confident they knew how such our conversational AI could be built:

I ama programmer so i understand the bot has a vocabulary of words it attempts to parse through, thenit takes what it finds from the user and checks against a database to output information it thinks isrelevant.

Others talked about how

Most chat bots go through “training” beforehand to be able to parsecommonly asked questions and phrasing or how it must be using a database full of responses .Out of all our participants, both coders identified the same 13 participants ( κ =

1) who failed themanipulation check by calling out the agent as a human, resulting in a suspicion level of 10 . I’m like sure it’s not a bot, but if it were a bot, machinelearning, though I don’t know exactly what THAT means ”. These 13 participants were excluded fromour analysis.

We compare the impact of setting expectations by varying the competence and warmth of themetaphors used. We perform our analysis using a pair of two-way analyses of variance (ANOVAs),where competence and warmth are two categorical independent variables, and pre-use usabilityand warmth are the dependent variables. We compared the impact of the conceptual metaphorsused to describe our system compared to the control condition to measure if they have an impact onthe participants’ default expectations of conversational AI systems. So, the independent variablesare categorized into high, low or control categories.Pre-use usability is affected by the metaphor’s competence. For pre-use usability (Figure 3 (a)),we find that competence has a large main effect ( F ( , ) = . , p < . , η = . ) . By Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. (a) (b)

Fig. 3. (a) Metaphors that signal high competence lead to higher pre-use usability scores. (b) Similarlymetaphors that signal high warmth lead to higher pre-use warmth scores. We also notice from both (a)and (b) that participants are naturally predisposed to have high expectations of usability and warmth fromconversational systems; however, priming them with metaphors reduces the variance of their expectations asopposed to when their expectations are uninformed. default, participants have high expectations of competence. A post-hoc Tukey revealed that pre-use usability was significantly lower for the low competence condition ( . ± . ) than highcompetence ( . ± . , p < . ) or control ( . ± . , p = . ) conditions. We found nomain effects for warmth.Pre-use warmth is likewise affected by the metaphor’s warmth. For pre-use warmth (see Fig-ure 3 (b)), we find that warmth has a large main effect ( F ( , ) = . , p < . , η = . ) . Bydefault, participants have high expectations of warmth. A post-hoc Tukey revealed that pre-usewarmth is significantly low for the low warmth condition ( . ± . ) than both the high warmth ( . ± . , p < . ) and control ( . ± . , p = . ) conditions. We found no main effects ofcompetence.Together, these tests imply that participants, by default, expect conversational AI to possess highcompetence and high warmth. However, the change in expectation caused by low competenceand low warmth implies that these conceptual metaphors do affect participants’ expectations ofhow the AI system will perform and behave. We visualize the means and standard errors for theseconditions in Figure 3 (a, b). We compare the impact of varying the competence and warmth of the metaphors on participants’post-use evaluations of the AI system’s usability and warmth. We perform our analysis using apair of two-way ANOVAs where competence and warmth are categorical independent variables,and post-use usability and post-use warmth are the two dependent variables.Participants perceive agents with low competence to be more usable after interaction. For post-use usability (Figure 4 (a)), competence has a main effect ( F ( , ) = . , p < . , η = . ) .Competence has a smaller effect on post-use usability than on pre-use usability, implying thatthe actual interaction of the participant with the system affects their final evaluations. In the lowcompetence condition, post-use usability was rated at ( . ± . ) and in the high competencecondition, it was rated ( . ± . ) . These results suggest that users perceive a difference betweentheir experience and their expectations in terms of competence of the agent. The means for post-useusability are also higher than for pre-use usability for both high ( t ( ) = − . , p < . ) and Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. onceptual Metaphors Impact Perceptions of Human-AI Collaboration 163:13 (a) (b) (c)(d) (e)

Fig. 4. The low competence metaphor condition features the highest post-use usability, intention to adopt,and desire to cooperate. This result suggests that metaphors that undersell the AI agent’s competence aremost likely to succeed. low ( t ( ) = − . , p < . ) competence conditions. We found no main effects of warmth orinteraction effects between competence and warmth.We observe post-use warmth ratings to be higher in the high warmth condition than the lowwarmth condition though the difference is not significant. For post-use warmth (Figure 4 (d)), wefind no main effects of competence or warmth and no interaction effects. In the high warmthcondition, warmth was rated at ( . ± . ) and in the low warmth condition, it was rated ( . ± . ) . There is no significant difference between the means of pre-use and post-use warmthfor high warmth ( t ( ) = − . , p = . ) , but it is significantly different in the low warmth ( t ( ) = − . , p < . ) .Using the composites described in the study design, we measure the effect of conceptualmetaphors on the participants’ intention to adopt and desire to cooperate after interacting with thesystem.Low competence metaphors increase participants’ likelihood of adopting the AI agent. For theirintention to adopt (Figure 4 (b)), competence has a main effect ( F ( , ) = . , p < . , η = . ) . In the high competence condition, the intention to adopt was rated at ( . ± . ) and inthe low competence condition, it was rated ( . ± . ) . These results support Hypothesis 2 as wesee support for contrast theory: participants are more likely to adopt an agent that they originallyexpected to have low competence but outperforms that expectation. They are less forgiving ofmistakes made by AI systems they expect to have high competence. We found no main effects ofwarmth or interaction effects between competence and warmth.Participants prefer to cooperate with agents that have high warmth and low competence. Fortheir desire to cooperate with the AI system, we found that both competence ( F ( , ) = . , p < Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020.

Is wiﬁ included? Breakfast is offeredAnd the Paris location does that include breakfast? yesDo they speak english?I am sorry. I don't have that information.Do I get my own bathroom?How far from the Empire State Building is the New York hotel?Does berlin offer parking?Do any of these hotels offer spa services?Is it quiet?I am sorry. I don't have that information.I am sorry. I don't have that information.I am sorry. I don't have that information.I am sorry. I don't have that information.yesNew York and Paris I'd like to go to Paris on the August 24th. There will be four of us. We will return on the 29th.Is there wiﬁ? We have options for you.That's okay. Can I book it?We also have one outgoing direct ﬂight leaving at 11:00am local time for $346. Would you like me to book that as well?Do you have a return ﬂight too?That's perfect. Can you go ahead and book the hotel, outgoing ﬂight and return ﬂight?"For the return ﬂight, we have one return direct ﬂight leaving at 2:00pm local time for $311.We have 4 star hotel in Paris and it is available on August 24-28 which is for 5 nights. The price is $728.79.YesAI systemParticipantConversation with high warmth metaphor (toddler) Conversation with low warmth metaphor (inexperienced teenager)How far is it from the Eiffel Tower? Hello, I am planning for a trip to New York from MontrealI am planning for an accommodation from August 23rd to sep1 We have options for you.yeah! that sounds good. Okay. I'll book that for you right away.yesDoes berlin offer parking?Great. it will work.I am sorry. I don't have that information.Do you want me to show you your return ﬂight options?United - Departure at 2:00 pm local time - Direct Flight - $311 Luma Hotel - 4 Star: Dates Available: August 26 to 30 (5 nights) Price: $850.90Yes is it available on that dates?Does the Berlin hotel have Queen beds? I am sorry. I don't have that information.Do any have a minibar?I am sorry. I don't have that information.I would like to book the Paris hotel.... ...... ... I am sorry. I don't have that information.

Fig. 5. Segments of two example conversations between a participant with our conversational AI system.In both cases, the participant expects the AI system to have low competence. While the left conversation isin the high warmth metaphor condition, the right conversation is in the low warmth metaphor condition.Participants in the high warmth condition ask more questions and explore the space of possible interactions byasking the agent details about checked luggage and hotel amenities. Wizards, acting as conversation agents,are given a fixed knowledge set, mimicking how today’s systems are designed, and reply with apologies whenasked about details outside of their knowledge. . , η = . ) and warmth ( F ( , ) = . , p < . , η = . ) had main effects but nointeraction effect (see Figure 4 (c, e)). The means increase from high ( . ± . ) to low competenceof ( . ± . ) . Similarly, the means decrease from high ( . ± . ) to low warmth ( . ± . ) .These results provide mixed support to both Hypothesis 1 and Hypothesis 2 as we see supportfor assimilation theory along the warmth dimension and contrast theory along the competencedimension. If participants are told that the AI system is high warmth, they are more likely tocooperate with it. But if the AI system is described as high competence, they are less likely tocooperate. Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. onceptual Metaphors Impact Perceptions of Human-AI Collaboration 163:15

We analyze the chat logs with LIWC features, following a standard LIWC analysis protocol ofbuilding a frequency count of how often words belonging to a specific LIWC category were used.We contrasted these counts across the various conditions and observed no significant differences( p > .

05) in language level phenomenon in the chatlogs across the conditions. This result impliesthat the post-use evaluations are driven primarily by the expectations set by the metaphors, not bythe actual content of the conversation. In other words, evaluations differed between conditions,but the actual conversations themselves did not. The wizard was blinded to the condition, so anydifferences would have needed to be prompted by the participant. However, we acknowledge thatthere might be language shifts that LIWC categories cannot capture.Participants use more words and spend more time speaking to agents with high warmth. Wefind a significant main effect of warmth on the number of words used per conversation ( F ( , ) = . , p < . , η = . ) . The number of words increase from ( ± ) in low to ( ± ) in highwarmth. We also find that participants in the high warmth condition typically spend an averageof 4 ± . Our results support contrast theory (Hypothesis 2) for the competence axis. Users are more tolerantof gaps in knowledge of systems with low competence but are less forgiving of high competencesystems making mistakes. The intention to adopt and desire to cooperate decreases as the compe-tence of the AI system metaphor increases. For the warmth axis, our results provide some supportfor assimilation theory (Hypothesis 1): users are more likely to co-operate and interact longer withagents portraying high warmth, but we do not observe significant impact of warmth on users’intention to adopt.

Study 1 established that setting low expectations and violating them positively increased thelikelihood of users adopting the system. In Study 2, we zoom in and try to understand how userevaluations change as the magnitude of that gap changes. We sample additional metaphors and usethe same experiment procedure as before to characterize how users’ intentions to adopt the systemvary as gap between users expectations and their experience changes. For this purpose, we rely onthe same measure of Intention to Adopt as Study 1.

To precisely traverse the range of perceived competence, we sampled 3 additional metaphors —“middle schooler”, “young student”, and “recent graduate”. Our pre-experiment survey revealed thatthese metaphors had perceived competence levels between the “toddler” and the “trained profes-sional”. Together, these five metaphors formed five treatment conditions. As Figure 2 demonstrates,all five metaphors lie in the high warmth half of the space, minimizing any interfering effectsof variations in warmth. Participants in these five conditions were primed with the respectivemetaphor.

Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020.

Competence I n t en t i on t o adop t toddler middle schooler young student recent graduate trained professionalestimated post-use competence of system in control groupThe Competence-Adoption Curve: extreme contrast has stronger effects Fig. 6. A larger positive violation of expectation increases adoption intentions. The intention to adopt decreasesmonotonically with an increase in expected competence of the system. The red vertical line shows the averagescore users in the control condition assigned the system and the yellow shaded region around the verticalline depicts the standard deviation.

To understand users’ unprimed evaluations of the system, we asked a sixth control group toparticipate without a metaphor (similar to the control condition in Study 1). Afterwards, we askedthem to pick from the list of five metaphors and identify which one they felt described the systemmost accurately after use.

Similar to Study 1, we measure participants’ intention to adopt and desire to co-operate across allfive metaphors.

Similar to the protocol in Study 1, participants were recruited on AMT. We recruited 20 participantsfor each condition, for a total of 120 participants. The duration of the study was similar to Study 1and participants were compensated at the same rate. The average age of participants was 39 ± . The two coders were consistent and identified the same 5 participants ( κ =

1) as being suspicious,implying a low suspicion level of 4 . The five metaphors we sampled are shown in Figure 6, where the x-axis depicts workers’ perceivedcompetence of a system with that metaphor and the y-axis depicts a different set of users’ intentionto adopt. The red vertical line shows the average score users in the control condition assigned the

Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. onceptual Metaphors Impact Perceptions of Human-AI Collaboration 163:17 system and the yellow shaded region around the vertical line depicts the standard deviation. Theunprimed system was viewed roughly as competently as a recent graduate.Over-performing low competence leads to higher adoption than over-performing mediumcompetence, and projecting any more competence than the toddler metaphor incurs an immediatecost (Figure 6). These results paint a fuller picture of contrast theory at play, as the intention toadopt decreases monotonically as the expected competence of the system increases. Consistentwith prior literature, the effect is greater as the contrast is greater [7, 28]. However, the effect isnonlinear, with only the lowest competence metaphor receiving a substantial benefit.The “toddler” metaphor sees the highest (beneficial) violation as it is furthest away from thevertical line and sees the greatest intention to adopt and desire to cooperate. There was a statisticallysignificant difference between groups as determined by one-way ANOVA for intention to adopt ( F ( , ) = . , p = . ) and for desire to cooperate ( F ( , ) = . , p < . ) . A Tukeypost-hoc test revealed that intention to adopt was statistically significantly higher for “toddler” ( . ± . ) than “young student” ( . ± . , p = . ) , “recent graduate” ( . ± . , p = . ) , and“trained professional” ( . ± . , p = . ) . There was no statistically significant difference betweenother metaphors. Similarly, a Tukey post-hoc test revealed that desire to cooperate was statisticallysignificantly higher for “toddler” ( . ± . ) than “middle schooler” ( . ± . , p = . ) , “youngstudent” ( . ± . , p = . ) , “recent graduate” ( . ± . , p = . ) , and “trained professional” ( . ± . , p = . ) . There was no statistically significant difference between other metaphors. Our results further support contrast theory (Hypothesis 2) for the competence axis. Users aremore likely to adopt a lower competence agent than one with high competence, even though allconditions were exposed to human-level performance. We additionally see an asymmetry — usersare even more likely to adopt an agent that exceeds extremely low expectations than one thatexceeds slightly higher (but still low) expectations. And as the agent begins to under-performexpectations, intentions to adopt decrease further.

From our results so far, it might appear that that designers should pick metaphors that project lowercompetence and high warmth regardless of experience, as these conditions are most conducive forcooperative and patient user interactions. However, such a conclusion might be myopic. Metaphorsattached to a system also have the ability to attract or drive people away.To test the effect of metaphor on pre-use intention to adopt, and pre-use desire to cooperatewith an AI system, we ran a third study. In this study, we present participants with AI systems,described using conceptual metaphors. We ask participants to identify which systems they aremore likely to try out and potentially adopt, prior to using the system.

We perform a between-subjects experiment. Each participant was introduced to an AI agentdescribed using one of the metaphors in Study 1. Unlike the previous experiments, participants donot actually interact with an AI system (or wizard). Instead, they are asked to rate their likelihoodof trying out a new AI system service described by each metaphor.

To probe for the participants’ intentions to try the system, we asked them the following twoquestions on 5 point Likert scales:

How likely are you to try out this AI system? , and

Do you envision

Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. yourself engaging in long-term use of such a AI system?

These two questions are combined to form atrial index ( α = . How likely are you to cooperate with such an AI system? , and

How likely areyou to tolerate errors made by this AI system?

These two questions are combined to form a pre-usedesire to cooperate index ( α = . Similar to the previous studies, we recruited participants from AMT. 80 new participants participatedin this survey: 20 participants exposed to a metaphor from each quadrant. We ensured that none ofthe participants in this study participate in any of our other studies.

Participants were more interested to try out AI systems that were described by high competence andhigh warmth. A two-way ANOVA revealed that competence ( F ( , ) = . , p < . , η = . ) and warmth ( F ( , ) = . , p = . , η = . ) both had significant impact on their intention totry out the AI system. The average trial index response was ( . ± . ) for high competence, and ( . ± . ) for low competence. Following a similar pattern, the average trial index response was ( . ± . ) for high warmth as opposed to ( . ± . ) for low warmth. The ANOVA also showedan interaction effect between competence and warmth ( F ( , ) = . , p = . , η = . ) . Inthis case, the combination of high competence and high warmth produced a substantial benefit3 . ± .

06 compared to the effects of warmth and competence individually: low competence andhigh warmth 1 . ± .

84, high competence and low warmth 2 . ± .

77, and both low competenceand low warmth 1 . ± . ( F ( , ) = . , p < . , η = . ) and warmth ( F ( , ) = . , p = . , η = . ) both had significantimpact on the trial index. The average pre-use desire to cooperate index response was ( . ± . ) for the high competence as opposed to ( . ± . ) for low competence. Similarly, the averagepre-use desire to cooperate index response was ( . ± . ) for the high warmth as opposed to ( . ± . ) for low warmth. The ANOVA also showed an interaction effect between competenceand warmth ( F ( , ) = . , p = . , η = . ) . People expected to behave more positivelywith high competence and high warmth AI systems 3 . ± .

75 over the low competence and highwarmth 1 . ± .

82, high competence and low warmth 2 . ± .

09, and both low competence andlow warmth 1 . ± .

87 bots.

While our previous studies demonstrated the detrimental effects of presenting an AI system with ahigh competence metaphor, this study shows a positive benefit of high competence — people aremore likely to try out a new service if it is described with a high competence metaphor. This studyalso shows that metaphors that project high warmth also increase people’s likelihood of tryingout a service and to behave positively with it. We discuss the implications of of these findings andsuggest guidelines for choosing metaphors considering both the competing objectives of attractingmore users and ensuring favorable evaluations and cooperative behavior.

Metaphors, as an expectation setting mechanism, are task- and model-agnostic. Users reasonabout complex algorithmic systems, including news feeds [19], content curation, and recommender

Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. onceptual Metaphors Impact Perceptions of Human-AI Collaboration 163:19 systems, using metaphors. This implies their effects are not limited to conversational agents or evento AI systems and can be used to set expectations of any algorithmic system (e.g., is Facebook’snewsfeed algorithm a gossipy teen, an information butler, or a spy?), although the implications ofour study might differ depending on the task, interaction and context.With our findings in mind, this section explores their design implications and limitations, andsituates our work amongst existing literature in HCI. We end with a retrospective analysis onexisting and previous conversational AI products, reinterpreting their metaphors and adoption/usercooperation patterns through the lens of our results.

Our work contributes to a growing body of work in HCI that seeks to understand how peoplereason about algorithmic systems with the aim of facilitating more informed and engaging interac-tions [19, 22, 26]. Previous work has looked at how users form informal theories about the technicalmechanisms behind of social media feeds [19, 21, 26] and how these “folk theories” drive theirinteractions with these systems. People’s conceptual understanding of such systems have beenknown to be metaphorical in nature, leading them to form folk theories of socio-technical systemsin terms of metaphors. Folk theories for Facebook and Twitter news feeds include metaphors rootedin personas such as “rational assistant” and “unwanted observer” as well as metaphors tied to moreabstract concepts such as “corporate black box” and “transparent platform”. More recent work hassought to study the social roles of algorithms by looking at how people personify algorithms andattach personas to them [81]. Prior work in the domain of interactive systems and embodied agentshas observed that the mental schemas people apply towards agents affect the way they behave withthe agent and it is possible to detect users’ schematic orientation through initial interaction [47].Diverging from previous work on folk theories, our work takes a complementary route — insteadof studying which metaphors users attach to systems, we study how metaphors explicitly attachedto the system, by designers, impact experiences.

Studies 1 and 2 demonstrate that low competence metaphors lead to the highest evaluations of anAI agent, but Study 3 counters that agents with low competence metaphors are least likely to betried out. What should a designer do?From Study 3, it becomes clear that associating a high warmth metaphor is always beneficial—however, the choice of competence level projected by the metaphor becomes a more nuanceddecision. One possible approach might be to choose a higher-competence metaphor but to lowercompetence expectations right after interaction begins (e.g., “Great question! One thing I shouldmention: I’m still learning how to best respond to questions like yours, so please have patience if Iget something wrong.”) Another approach might be to age the metaphor over time: to present ahigh competence metaphor such as a professional, but when a user first encounters it, the agentintroduces itself via a lower-competence version such as a professional trainee and tells the userthat it will evolve over time into a full professional [65].If designers are unwilling to change or adapt their high-competence metaphor, then their designsrun the risk of being abandoned for being less effective than users expect. There may be otherways to disarm the contrast between expectations and reality. The agent blaming itself for errors orblaming the user for errors create challenging issues, but blaming an intermediary might work [53]:for example, “I’ve seen that previous folks who asked that question meant multiple different thingsby it. To make sure I can help effectively, can you reword that question?”

Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020.

The scope of the study was limited to a conversational AI, as an instance of an algorithmic system,where interaction is devoid of strong visual cues [63]. In the case of embodied agents and systemswhere visual communication is a major aspect of the interaction, visual factors might have a strongeffect on expectations. It is important to understand how users factor in these visual signals informing an impression of the system. Additionally, our choice of conceptual metaphors was solelytextual metaphors; future work should explore how these findings translate to visual metaphorssuch as the abstract shape associated with Siri, or the cartoonish rendering of Clippy, because suchvisual abstractions also inform users’ judgements of a system’s competence and warmth.Since the task in our study was highly structured and participants had no incentive to exploreperipheral conversational topics, we did not observe significant differences in user vocabularyacross the conditions. This result surprised us — that evaluations would differ even if the interactionsthemselves had no major differences between conditions. Future work should explore user behaviorin open-ended conversations, which are more likely to contain personal stories and anecdotesthat can elicit greater behavior changes. The conversations and therefore, interactions with theAI system were limited to 15 −

20 minutes, so further research needs to establish the effects ofmetaphors on prolonged exposure to the AI system. Additionally, our service is not commonlyused by people today to book flights or hotels and it is possible that the novelty of performing thistask with a conversational agent might have skewed evaluations. This necessitates the need tounderstand how prior experience with similar technology changes people’s susceptibility to suchexpectation shaping and subsequently their evaluations.We observed partial support for Assimilation Theory along the warmth axis: participants pre-ferred to cooperate with agents projecting higher warmth but at the same time, they perceived adifference between the agent’s projected warmth and the actual warmth. Future work is needed todevelop more robust theories along the warmth axis. One potential direction could create conditionsof more extreme violation — sampling metaphors that signal either extremely low or extremelyhigh warmth and measuring to see if participants’ attitudes towards the agent are still driven bytheir pre-use warmth perceptions or whether the larger perceived difference in warmth alters theirattitudes towards the agent.Our study explored the effect of metaphors on adoption and behavioral intentions but thesecould also impact many other factors, including perceived trustworthiness of the system. Alongthis direction, future work should explore the impact on user evaluations when the interactionwith the AI system results in a failure to accomplish the task. Finally, the actually competenceand warmth of the AI system should be varied to analyze the effects metaphors as the AI system’scompetence is lowered from our human-level performance.

Studies have repeatedly shown that initial user expectations of their conversational agents (includingGoogle Now, Siri, and Cortana) are not met [36, 49, 83], causing users to abandon these servicesafter initial use. Initial experiences with conversational agents are often decisive: users reported thatinitial failures in achieving a task with Siri caused them to retreat to simple tasks they were sure thesystem could handle. While bloated expectations of users before interacting with AI systems havebeen acknowledged, little work has explored what those expectations are and how they contributeto user adoption and behavior.Are today’s conversational agents being set up for failure? Our studies establish that the descrip-tions and metaphors attached to these systems can play a key role in shaping expectations. Woebotwas introduced as “Tiny conversations to feel your best”; Replika presaged as “The AI companion

Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. onceptual Metaphors Impact Perceptions of Human-AI Collaboration 163:21

Similar competenceSimilar warmth

Fig. 7. Average warmth and competence measured for popular social chat-bots. Both axes range from to . who cares” and Mitsuku was revealed as “a record breaking five-time winner of the Loebner PrizeTuring Test [...] the world’s best conversational chatbot”. We collected these descriptions associatedwith 5 popular social chatbots—Xiaoice, Mistuku, Tay, Replika and Woebot—and deployed the exactsame warmth-competence measurement of those descriptions with 50 participants from AMT aswe used for the metaphors in our study.We find that today’s social chatbots signal high competence (Figure 7), between “recent graduate”and “trained professional”. (Tay, incidentally, also projects very low warmth.) Descriptions of thiskind, as we’ve shown, might be setting such systems up for failure. As Ars Technica reported: “Youmight come away thinking that Apple found some way to shrink Mad Men’s Joan Holloway andpop her into a computer chip. Though Siri shows real potential, these kinds of high expectations arebound to be disappointed” [13]. It is important to note, then, that users often report disappointmentafter using these agents, especially since Apple’s announcement of Siri included the sentence: “AskSiri and get the answer back almost instantly without having to type a single character”; GoogleAssistant was heralded as “the first virtual assistant that truly anticipates your needs”.With the recent glut of Twitter bots and other social agents that learn from their interactionsand are adaptive in nature [56], it also becomes important to understand what drives users’ anti-social behaviour towards such bots and what factors contribute to antisocial behavior. Previouswork has sought explanations through the lens of user profiling, gender attributions, and racialrepresentations. Our work provides another lens on why otherwise similar systems such as Xiaoiceand Tay (both female, teen-aged, and not representative of marginalised communities in theirrespective countries) might have elicited vastly different responses from their users. While Tay’sofficial Twitter account described it as “Microsoft’s AI fam from the internet that’s got zero chill!”,signaling high competence and low warmth, Xiaoice was setup to be “Sympathetic ear” and an“Empathetic social chatbot” [85], very clearly signaling high warmth and even priming behaviorsaround warmth such as personal disclosure. Our study suggests that people are more likely tocooperate with a bot that is perceived as higher warmth before use — a result consistent with the Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. fact that Xiaoice continued to be a friend and remained popular with its user base while Tay waspulled down within 16 hours of its release for attracting trolls.Xiaoice is not an isolated case. Other bots, such as Woebot and Replika, which were set up aswith high warmth, have had success in garnering users. Even though both Woebot and Replikahad comparable competence expectations to that of Tay’s, they were far warmer and obtained analtogether different outcome. Similarly, among the bots perceived as high warmth, Mitsuku stoodout as as exceptionally competent and consistent with our finding that perceptions of very highcompetence decrease the desire to cooperate with the AI system: up to 30% of the messages receivedby Mitsuku comprise of antisocial messages [80]. For their part, Microsoft may have absorbed thelesson, as Tay’s successor named Zo, was described more warmly as “Always down to chat. Willmake you LOL”.While we acknowledge that there are several variables that affect user reception of these systems,the fact that our findings are consistent with in-the-wild outcomes of extant conversational systemsis notable. It is, of course, impossible to prove that the expectations set by attached metaphorsare a causal factor in the users’ reception of these specific systems, and caution readers againstconcluding that metaphors alone are responsible.

We explore metaphors as a causal factor in determining users’ evaluations of AI agents. Wedemonstrate experimentally that these conceptual metaphors change users’ pre-use expectationsas well as their post-use evaluations of the system, their intentions of adopt and their desire tocooperate. While people are more likely to cooperate with agents that they expect to be warm, theyare more likely to adopt and cooperate with agents that project low competence. This result runscounter to designers’ usual default towards projecting high competence to attract more users.

ACKNOWLEDGMENTS

We thank Jacob Ritchie, Mitchell Gordon and Mark Whiting for their valuable comments andfeedback. This work was partially funded by the Brown Institute of Media Innovation and byToyota Research Institute (TRI) but this article solely reflects the opinions and conclusions of itsauthors and not TRI or any other Toyota entity.

REFERENCES [1] Norah Abokhodair, Daisy Yoo, and David W McDonald. 2015. Dissecting a social botnet: Growth, content and influencein Twitter. In

Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing .ACM, 839–851.[2] Andrew Arnold. 2018. How Chatbots Feed Into Millennials’ Need For Instant Gratification. (2018).[3] Solomon E Asch. 1946. Forming impressions of personality.

The Journal of Abnormal and Social Psychology

41, 3 (1946),258.[4] Layla El Asri, Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery Fine, Rahul Mehrotra, and KaheerSuleman. 2017. Frames: A Corpus for Adding Memory to Goal-Oriented Dialogue Systems.

CoRR abs/1704.00057(2017). arXiv:1704.00057 http://arxiv.org/abs/1704.00057[5] Gagan Bansal, , Ece Kamar, Walter S Lasecki, and Daniel S Weld Eric Horvitz. 2019. Beyond Accuracy: The Role ofMental Models in Human-AI Team Performance. (2019).[6] Amy L Baylor and Yanghee Kim. 2004. Pedagogical agent design: The impact of agent realism, gender, ethnicity, andinstructional role. In

International conference on intelligent tutoring systems . Springer, 592–603.[7] Susan A Brown, Viswanath Venkatesh, and Sandeep Goyal. 2012. Expectation confirmation in technology use.

Information Systems Research

23, 2 (2012), 474–487.[8] Drazen Brscić, Hiroyuki Kidokoro, Yoshitaka Suehiro, and Takayuki Kanda. 2015. Escaping from children’s abuse ofsocial robots. In

Proceedings of the tenth annual acm/ieee international conference on human-robot interaction . ACM,59–66.Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. onceptual Metaphors Impact Perceptions of Human-AI Collaboration 163:23 [9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible modelsfor healthcare: Predicting pneumonia risk and hospital 30-day readmission. In

Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining . ACM, 1721–1730.[10] Justine Cassell, Tim Bickmore, Lee Campbell, Hannes Vilhjálmsson, and Hao Yan. 2000. Embodied ConversationalAgents. MIT Press, Cambridge, MA, USA, Chapter Human Conversation As a System Framework: Designing EmbodiedConversational Agents, 29–63. http://dl.acm.org/citation.cfm?id=371552.371555[11] Elaine Chang and Vishwac Kannan. 2018. Conversational AI: Best practices for building bots. https://medius.studios.ms/Embed/Video/BRK3225[12] Ana Paula Chaves and Marco Aurélio Gerosa. 2019. How should my chatbot interact? A survey on human-chatbotinteraction design.

CoRR abs/1904.02743 (2019). arXiv:1904.02743 http://arxiv.org/abs/1904.02743[13] Jacqui Cheng. 2011. iPhone 4S: A Siri-ously slick, speedy smartphone. (2011).[14] Justin Cranshaw, Emad Elwany, Todd Newman, Rafal Kocielnik, Bowen Yu, Sandeep Soni, Jaime Teevan, and AndrésMonroy-Hernández. 2017. Calendar help: Designing a workflow-based scheduling agent with humans in the loop. In

Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems . ACM, 2382–2393.[15] L Elizabeth Crawford. 2009. Conceptual metaphors of affect.

Emotion review

1, 2 (2009), 129–139.[16] Amy JC Cuddy, Susan T Fiske, and Peter Glick. 2008. Warmth and competence as universal dimensions of socialperception: The stereotype content model and the BIAS map.

Advances in experimental social psychology

40 (2008),61–149.[17] Sonam Damani, Nitya Raviprakash, Umang Gupta, Ankush Chatterjee, Meghana Joshi, Khyatti Gupta, Kedhar NathNarahari, Puneet Agrawal, Manoj Kumar Chinnakotla, Sneha Magapu, et al. 2018. Ruuh: A Deep Learning BasedConversational Social Agent. arXiv preprint arXiv:1810.12097 (2018).[18] Leslie A DeChurch and Jessica R Mesmer-Magnus. 2010. The cognitive underpinnings of effective teamwork: Ameta-analysis.

Journal of Applied Psychology

95, 1 (2010), 32.[19] Michael A. DeVito, Jeremy Birnholtz, Jeffery T. Hancock, Megan French, and Sunny Liu. 2018. How People Form FolkTheories of Social Media Feeds and What It Means for How We Study Self-Presentation. In

Proceedings of the 2018CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18) . ACM, New York, NY, USA,Article 120, 12 pages. https://doi.org/10.1145/3173574.3173694[20] Salesforce Drift, Survey Monkey. 2018. 2018 State of Chatbots Report. (2018).[21] Motahhare Eslami, Karrie Karahalios, Christian Sandvig, Kristen Vaccaro, Aimee Rickman, Kevin Hamilton, andAlex Kirlik. 2016. First I âĂĲlikeâĂİ It, Then I Hide It: Folk Theories of Social Feeds. In

Proceedings of the 2016 CHIConference on Human Factors in Computing Systems (San Jose, California, USA) (CHI âĂŹ16) . Association for ComputingMachinery, New York, NY, USA, 2371âĂŞ2382. https://doi.org/10.1145/2858036.2858494[22] Motahhare Eslami, Aimee Rickman, Kristen Vaccaro, Amirhossein Aleyasen, Andy Vuong, Karrie Karahalios, KevinHamilton, and Christian Sandvig. 2015. âĂĲI Always Assumed That I WasnâĂŹt Really That Close to [Her]âĂİ:Reasoning about Invisible Algorithms in News Feeds. In

Proceedings of the 33rd Annual ACM Conference on HumanFactors in Computing Systems (Seoul, Republic of Korea) (CHI âĂŹ15) . Association for Computing Machinery, NewYork, NY, USA, 153âĂŞ162. https://doi.org/10.1145/2702123.2702556[23] Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer, and Alessandro Flammini. 2016. The rise of social bots.

Commun. ACM

59, 7 (2016), 96–104.[24] Susan T Fiske, Amy JC Cuddy, Peter Glick, and Jun Xu. 2018. A model of (often mixed) stereotype content: Competenceand warmth respectively follow from perceived status and competition (2002). In

Social cognition . Routledge, 171–222.[25] Susan T. Fiske, Juan Xu, Amy C. Cuddy, and Peter Glick. 1999. (Dis)respecting versus (Dis)liking: Status and Interde-pendence Predict Ambivalent Stereotypes of Competence and Warmth.

Journal of Social Issues

55, 3 (1999), 473–489.https://doi.org/10.1111/0022-4537.00128 arXiv:https://spssi.onlinelibrary.wiley.com/doi/pdf/10.1111/0022-4537.00128[26] Megan French and Jeff Hancock. 2017. What’s the folk theory? Reasoning about cyber-social systems.

ReasoningAbout Cyber-Social Systems (February 2, 2017) (2017).[27] Jianfeng Gao, Michel Galley, and Lihong Li. 2018. Neural Approaches to Conversational AI.

CoRR abs/1809.08267(2018). arXiv:1809.08267 http://arxiv.org/abs/1809.08267[28] Andrew L Geers and G Daniel Lassiter. 1999. Affective expectations and information gain: Evidence for assimilationand contrast effects in affective experience.

Journal of Experimental Social Psychology

35, 4 (1999), 394–413.[29] Susan A Gelman and Cristine H Legare. 2011. Concepts and folk theories.

Annual review of anthropology

40 (2011),379–398.[30] Jonathan Grudin and Richard Jacques. 2019. Chatbots, Humbots, and the Quest for Artificial General Intelligence.In

Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19) .ACM, New York, NY, USA, Article 209, 11 pages. https://doi.org/10.1145/3290605.3300439[31] Jan Hartmann, Antonella De Angeli, and Alistair Sutcliffe. 2008. Framing the user experience: information biases onwebsite quality judgement. In

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems . ACM,Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020.

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Florence,Italy) (CHI ’08) . ACM, New York, NY, USA, 855–864. https://doi.org/10.1145/1357054.1357190[33] P. J. Hinds, T. L. Roberts, and H. Jones. 2004. Whose job is it anyway? A study of humanrobot interaction in acollaborative task. (2004).[34] Annabell Ho, Jeff Hancock, and Adam S Miner. 2018. Psychological, relational, and emotional effects of self-disclosureafter conversations with a chatbot.

Journal of Communication

68, 4 (2018), 712–733.[35] Elle Hunt. 2016. Tay, Microsoft’s AI chatbot, gets a crash course in racism from Twitter.

The Guardian

24 (2016).[36] Mohit Jain, Pratyush Kumar, Ramachandra Kota, and Shwetak N Patel. 2018. Evaluating and informing the design ofchatbots. In

Proceedings of the 2018 Designing Interactive Systems Conference . ACM, 895–906.[37] Maurice Jakesch, Megan French, Xiao Ma, Jeffrey T Hancock, and Mor Naaman. 2019. AI-Mediated Communication:How the Perception that Profile Text was Written by AI Affects Trustworthiness. In

Proceedings of the 2019 CHIConference on Human Factors in Computing Systems . ACM, 239.[38] Charles M Judd, Laurie James-Hawkins, Vincent Yzerbyt, and Yoshihisa Kashima. 2005. Fundamental dimensions ofsocial judgment: understanding the relations between judgments of competence and warmth.

Journal of personalityand social psychology

89, 6 (2005), 899.[39] Ralph Kimball and B Verplank E Harslem. 1982. Designing the Star user interface.

Byte

7, 1982 (1982), 242–282.[40] Kristen J Klaaren, Sara D Hodges, and Timothy D Wilson. 1994. The role of affective expectations in subjectiveexperience and decision-making.

Social Cognition

12, 2 (1994), 77–101.[41] Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Accept an Imperfect AI?: Exploring Designsfor Adjusting End-user Expectations of AI Systems. In

Proceedings of the 2019 CHI Conference on Human Factorsin Computing Systems (Glasgow, Scotland Uk) (CHI ’19) . ACM, New York, NY, USA, Article 411, 14 pages. https://doi.org/10.1145/3290605.3300641[42] Sari Kujala, Ruth Mugge, and Talya Miron-Shatz. 2017. The role of expectations in service evaluation: A longitudinalstudy of a proximity mobile payment service.

International Journal of Human-Computer Studies

98 (2017), 51–61.[43] Eugenia Kuyda and Phil Dudchuk. 2015. Replika [computer program]. https://replika.ai/[44] Isaac Lage, Andrew Ross, Samuel J Gershman, Been Kim, and Finale Doshi-Velez. 2018. Human-in-the-loop inter-pretability prior. In

Advances in Neural Information Processing Systems . 10159–10168.[45] George Lakoff and Mark Johnson. 2008.

Metaphors we live by . University of Chicago press.[46] Justin Lee. 2018. Chatbots were the next big thing: what happened? https://medium.com/swlh/chatbots-were-the-next-big-thing-what-happened-5fc49dd6fa61[47] Min Kyung Lee, Sara Kiesler, and Jodi Forlizzi. 2010. Receptionist or information kiosk: how do people talk with arobot?. In

Proceedings of the 2010 ACM conference on Computer supported cooperative work . 31–40.[48] Q Vera Liao, Mas-ud Hussain, Praveen Chandar, Matthew Davis, Yasaman Khazaeni, Marco Patricio Crasso, DakuoWang, Michael Muller, N Sadat Shami, Werner Geyer, et al. 2018. All Work and No Play? Conversations with aquestion-and-answer chatbot in the wild. In

Proceedings of the 2018 CHI Conference on Human Factors in ComputingSystems . ACM, 3.[49] Ewa Luger and Abigail Sellen. 2016. Like having a really bad PA: the gulf between user expectation and experienceof conversational agents. In

Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems . ACM,5286–5297.[50] Matthew S McGlone. 1996. Conceptual metaphors and figurative language interpretation: Food for thought?

Journalof memory and language

35, 4 (1996), 544–565.[51] Jaroslav Michalco, Jakob Grue Simonsen, and Kasper Hornbæk. 2015. An exploration of the relation between expecta-tions and user experience.

International Journal of Human-Computer Interaction

31, 9 (2015), 603–617.[52] H. Mieczkowski, S. X. Liu, J. Hancock, and B. Reeves. 2019. Helping Not Hurting: Applying the Stereotype ContentModel and BIAS Map to Social Robotics. In . 222–229. https://doi.org/10.1109/HRI.2019.8673307[53] Clifford Ivar Nass and Scott Brave. 2005.

Wired for speech: How voice activates and advances the human-computerrelationship . MIT press Cambridge, MA.[54] Donald A Norman. 1988.

The psychology of everyday things.

Basic books.[55] AE Nutt. 2017. The Woebot will see you now. the rise of chatbot therapy: Washington Post (2017).[56] Junwon Park, Ranjay Krishna, Pranav Khadpe, Li Fei-Fei, and Michael Berstein. 2019. AI-based Request Augmentationto Increase Crowdsourcing Participation. In

AAAI Conference on Human Computation and Crowdsourcing . ACM.[57] James W Pennebaker, Martha E Francis, and Roger J Booth. 2001. Linguistic inquiry and word count: LIWC 2001.

Mahway: Lawrence Erlbaum Associates

71, 2001 (2001), 2001.Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020. onceptual Metaphors Impact Perceptions of Human-AI Collaboration 163:25 [58] Eeva Raita and Antti Oulasvirta. 2011. Too good to be bad: Favorable product expectations boost subjective usabilityratings.

Interacting with Computers

23, 4 (2011), 363–371.[59] Byron Reeves and Clifford Ivar Nass. 1996.

The media equation: How people treat computers, television, and new medialike real people and places.

Cambridge university press.[60] Cynthia Rudin. 2018. Please stop explaining black box models for high stakes decisions. arXiv preprint arXiv:1811.10154 (2018).[61] Pericle Salvini, Gaetano Ciaravella, Wonpil Yu, Gabriele Ferri, Alessandro Manzi, Barbara Mazzolai, Cecilia Laschi,Sang-Rok Oh, and Paolo Dario. 2010. How safe are service robots in urban environments? Bullying a robot. In . IEEE, 1–7.[62] C. Sandvig. 2015. Seeing the Sort: The Aesthetic and Industrial Defense of âĂĲThe Algorithm.âĂİ.

Journal of the NewMedia Caucus, 11, 35-51 (2015).[63] Ayse Pinar Saygin, Thierry Chaminade, Hiroshi Ishiguro, Jon Driver, and Chris Frith. 2011. The thing that should notbe: predictive coding and the uncanny valley in perceiving human and humanoid robot actions.

Social cognitive andaffective neuroscience

7, 4 (2011), 413–422.[64] Robin Sease. 2008. MetaphorâĂŹs role in the information behavior of humans interacting with computers.

Informationtechnology and libraries

27, 4 (2008), 9–16.[65] Joseph Seering, Michal Luria, Connie Ye, Geoff Kaufman, and Jessica Hammer. 2020. It Takes a Village: Integrating anAdaptive Chatbot into an Online Gaming Community. In

Proceedings of the 2020 CHI Conference on Human Factors inComputing Systems (CHI ’20) .[66] Ameneh Shamekhi, Q Vera Liao, Dakuo Wang, Rachel KE Bellamy, and Thomas Erickson. 2018. Face Value? Exploringthe effects of embodiment for a group facilitation agent. In

Proceedings of the 2018 CHI Conference on Human Factors inComputing Systems . ACM, 391.[67] Muzafer Sherif, Daniel Taub, and Carl I Hovland. 1958. Assimilation and contrast effects of anchoring stimuli onjudgments.

Journal of experimental psychology

55, 2 (1958), 150.[68] Heung-Yeung Shum, Xiao-dong He, and Di Li. 2018. From Eliza to XiaoIce: challenges and opportunities with socialchatbots.

Frontiers of Information Technology & Electronic Engineering

19, 1 (2018), 10–26.[69] Megan Strait, Virginia Contreras, and Christian Duarte Vela. 2018. Verbal Disinhibition towards Robots is Associatedwith General Antisociality. arXiv preprint arXiv:1808.01076 (2018).[70] Megan Strait, Ana Sánchez Ramos, Virginia Contreras, and Noemi Garcia. 2018. Robots Racialized in the Likeness ofMarginalized Social Identities are Subject to Greater Dehumanization than those racialized as White. In . IEEE, 452–457.[71] Indrani Medhi Thies, Nandita Menon, Sneha Magapu, Manisha Subramony, and Jacki OâĂŹneill. 2017. How do you wantyour chatbot? An exploratory Wizard-of-Oz study with young, urban Indians. In

IFIP Conference on Human-ComputerInteraction . Springer, 441–459.[72] Alexander Todorov, Chris P Said, Andrew D Engell, and Nikolaas N Oosterhof. 2008. Understanding evaluation offaces on social dimensions.

Trends in cognitive sciences

12, 12 (2008), 455–460.[73] Paul Van Schaik and Jonathan Ling. 2008. Modelling user experience with web sites: Usability, hedonic value, beautyand goodness.

Interacting with computers

20, 3 (2008), 419–432.[74] Peter Wallis and Emma Norling. 2005. The Trouble with Chatbots: social skills in a social world.

Virtual Social Agents

29 (2005).[75] Mark E Whiting, Grant Hugh, and Michael S Bernstein. 2019. Fair Work: Crowd Work Minimum Wage with One Lineof Code. In

Proceedings of the AAAI Conference on Human Computation and Crowdsourcing , Vol. 7. 197–206.[76] Timothy D Wilson, Douglas J Lisle, Dolores Kraft, and Christopher G Wetzel. 1989. Preferences as expectation-driveninferences: Effects of affective expectations on affective experience.

Journal of personality and social psychology

56, 4(1989), 519.[77] Bogdan Wojciszke. 2005. Affective concomitants of information on morality and competence.

European psychologist

10, 1 (2005), 60–70.[78] Victoria Woollaston. 2016. Following the failure of Tay, Microsoft is back with new chatbot Zo.[79] S Worswick. 2015. Mitsuku [computer program].[80] Steve Worswick. 2018. The Curse of the Chatbot Users. (2018).[81] Eva Yiwei Wu, Emily Pedersen, and Niloufar Salehi. 2019. Agent, Gatekeeper, Drug Dealer: How Content CreatorsCraft Algorithmic Personas.

Proceedings of the ACM on Human-Computer Interaction

3, CSCW (2019), 219.[82] Jun Xiao, John Stasko, and Richard Catrambone. 2004. An empirical study of the effect of agent competence on userperformance and perception. In

Proceedings of the Third International Joint Conference on Autonomous Agents andMultiagent Systems-Volume 1 . IEEE Computer Society, 178–185.[83] Jennifer Zamora. 2017. I’m Sorry, Dave, I’m Afraid I Can’T Do That: Chatbot Perception and Expectations. In

Proceedingsof the 5th International Conference on Human Agent Interaction (Bielefeld, Germany) (HAI ’17) . ACM, New York, NY,Proc. ACM Hum.-Comput. Interact., Vol. 4, No. CSCW2, Article 163. Publication date: October 2020.

USA, 253–260. https://doi.org/10.1145/3125739.3125766[84] Mark P Zanna and David L Hamilton. 1972. Attribute dimensions and patterns of trait inferences.

Psychonomic Science

27, 6 (1972), 353–354.[85] Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2018. The Design and Implementation of XiaoIce, an EmpatheticSocial Chatbot.