[PDF] Chatbots language design: the influence of language variation on user experience

Abstract

Chatbots are often designed to mimic social roles attributed to humans. However, little is known about the impact on user's perceptions of using language that fails to conform to the associated social role. Our research draws on sociolinguistic theory to investigate how a chatbot's language choices can adhere to the expected social role the agent performs within a given context. In doing so, we seek to understand whether chatbots design should account for linguistic register. This research analyzes how register differences play a role in shaping the user's perception of the human-chatbot interaction. Ultimately, we want to determine whether register-specific language influences users' perceptions and experiences with chatbots. We produced parallel corpora of conversations in the tourism domain with similar content and varying register characteristics and evaluated users' preferences of chatbot's linguistic choices in terms of appropriateness, credibility, and user experience. Our results show that register characteristics are strong predictors of user's preferences, which points to the needs of designing chatbots with register-appropriate language to improve acceptance and users' perceptions of chatbot interactions.

Full PDF

CChatbots language design: the influence of language variation on userexperience

ANA PAULA CHAVES,

Northern Arizona University and Fed. University of Technology–Paraná, Brazil

JESSE EGBERT,

Department of English, Northern Arizona University

TOBY HOCKING, ECK DOERRY, and MARCO AURELIO GEROSA,

School of Informatics, Computing,and Cyber Systems, Northern Arizona University

Chatbots are often designed to mimic social roles attributed to humans. However, little is known about the impact on user’s perceptionsof using language that fails to conform to the associated social role. Our research draws on sociolinguistic theory to investigate how achatbot’s language choices can adhere to the expected social role the agent performs within a given context. In doing so, we seek tounderstand whether chatbots design should account for linguistic register. This research analyzes how register differences play arole in shaping the user’s perception of the human-chatbot interaction. Ultimately, we want to determine whether register-specificlanguage influences users’ perceptions and experiences with chatbots. We produced parallel corpora of conversations in the tourismdomain with similar content and varying register characteristics and evaluated users’ preferences of chatbot’s linguistic choices interms of appropriateness, credibility, and user experience. Our results show that register characteristics are strong predictors of user’spreferences, which points to the needs of designing chatbots with register-appropriate language to improve acceptance and users’perceptions of chatbot interactions.CCS Concepts: •

Human-centered computing → Human computer interaction (HCI) ; Empirical studies in HCI ; Naturallanguage interfaces .Additional Key Words and Phrases: chatbots, conversational agents, language design, register, user perceptions

Recent advances in conversational technologies have promoted the increasing popularity of chatbots [131], which aredisembodied conversational interfaces that interact with users in natural language via a text-based messaging interface.A recent report on the chatbot market [55] attests to their increasing demand, predicting a global chatbot market ofUSD 1.25 billion by 2025. The skyrocketing interest in chatbot technologies has brought new challenges for the HCIfield [23, 48, 101] and, despite improvements in design, users may not always be satisfied with their experiences [75, 91],which may affect their attitudes towards the technology [74].For chatbots, natural language conversation is the primary mechanism for achieving interactional goals. Therefore,developing a more comprehensive understanding of the linguistic design of chatbot conversations and its effects onusers’ perceptions is critical to the success of chatbot technologies. Previous research on chatbot design suggests thatwhen chatbots misuse language (e.g., conveying excessive (in)formality or using incoherent style), the conversationsounds strange to the user, and leads to frustration [38, 73, 93]. To date, language design for chatbots has focusedprimarily on ensuring that chatbots produce coherent and grammatically correct responses, and on improving functionalperformance and accuracy (see e.g. [69, 94, 95, 134]). Although current chatbots may, at some functional level, provideusers with the answers they seek, the utterances portray arbitrary patterns of language that may not take into account

Authors’ addresses: Ana Paula Chaves, [email protected], Northern Arizona University, 1295 S Knoles Dr, Flagstaff, Arizona, 86011, Fed. Universityof Technology–Paraná, R. Rosalina Maria dos Santos, 1233, Campo Mourão, Brazil; Jesse Egbert, [email protected], Department of English, NorthernArizona University, 705 S Beaver St, Flagstaff, Arizona, 86011; Toby Hocking, [email protected]; Eck Doerry, [email protected]; Marco AurelioGerosa, [email protected], School of Informatics, Computing, and Cyber Systems, Northern Arizona University. a r X i v : . [ c s . H C ] J a n Chaves, et al.the interactional situation. For instance, one would expect a chatbot representing a financial advisor to employ a differenttone and linguistic features than that employed by a chatbot advising teens on current fashion choices. Currently,the design of a particular chatbot’s linguistic choices is often based on ad-hoc analyses of user characteristics or thechatbot’s persona. For machine language generation, models are trained using available corpora in the target domain,but they do not consider the particular context of the corpora’s conversations.Little is known about the effect of these design decisions on users’ perceptions, much less about how to tailor chatbotdesign to the particular situation of use. When exploring a list of key factors that could influence a user’s perceptionsof chatbots, scholars even argued that using an appropriate language style is not relevant as long as the user canunderstand the chatbot’s answer [9, 21]. In contrast, empirical studies have repeatedly demonstrated that chatbot’slinguistic choices influence users’ perceptions and behavior toward chatbots [5, 41, 110, 121, 125]. Using appropriatelinguistic choices potentially increases human-likeness [53, 59, 68] and believability [68, 98, 99, 120], as well as enhancethe overall perception of the quality of the interaction [67]. Resendez [110] showed that a chatbot’s linguistic styleevoke competence and trust as well as likeability and usefulness. Developing a strong basis for designing not just whata chatbot says but also how it says it must be a priority for creating the next generation of chatbots. This researchestablishes a framework for analyzing the effect of linguistic choices on users’ perceptions, and takes a first step towarddeveloping a prescriptive basis for tailoring chatbot linguistic choices to specific interactional situations.Humans have developed a sense of how to adapt their tone, idioms, and formulations to various conversationalcontexts. According to sociolinguists [34], human linguistic choices are not arbitrary but are closely tailored by speakersto convey not just the informational payload, but also a host of subtle but important social cues [70, 72]. This concept iscalled register , and it has emerged as one of the most important predictors of linguistic variation in human-humancommunication [13]. Although the relevance of register in human-human communication has been well-established inthe sociolinguistic community, the potential applicability of these insights has not yet been explored in the context ofchatbot language design.Argamon [6] suggests that the sociolinguistic concept of register could be formalized to provide a theoretical basisfor machine language generation. To achieve that, chatbots would need to be enriched with computational modelsthat can evaluate the conversational situation and adapt the chatbot’s linguistic choices to conform with the expectedregister, which is similar to the subconscious humans’ language production process. The first step toward this goal is tounderstand whether register theory applies to chatbot interactions and how it might inform chatbots’ designs. Thecornerstone for this effort is an analysis to expose the perceived effect of chatbot language choices, while reducingthe effect of variables other than language (e.g., variations in conversation context and content). This requires parallelconversations that are similar in content, but each represent a different linguistic register.In our previous work [27, 28], we explored language variation in the context of tourism-related interactions, aimingto expose the relationships between register and the interactional situations within this domain. Our results conformedwith established sociolinguistic theories: core linguistic features vary as the situational parameters vary, resultingin different language patterns. In this paper, we move to the next step, analyzing the extent to which the registerdifferences we previously identified play a role in shaping the user’s perception of the human-chatbot interaction. Thequestion guiding our investigation is:

Does the use of register-specific language influence the users’ perceptions of touristassistant chatbots?

To answer this question, we analyzed the register of two corpora of conversations (

𝐹𝐿𝐺 and

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 ) tocharacterize their register differences and used the outcomes to produce a parallel corpus (

𝐹𝐿𝐺 𝑚 𝑜𝑑 ) in the tourismdomain, which have similar informational content as 𝐹𝐿𝐺 , but vary in language use patterns. Then, we performed ahatbots language design: the influence of language variation on user experience 3user study where we asked participants to compare answers from the two corpora and express their preferences interms of appropriateness, credibility, and user experience. Finally, we conducted a statistical analysis of our user studydata using a Generalized Linear Model [50] (binary response) to identify associations between the frequency of corelinguistic features and user preferences with respect to language use.Our results showed that there is an association between linguistic features and user’s perceptions of appropriateness,credibility, and user experience, and that register is a stronger predictor of this association than other variablesof individual biases (participants, their social orientation, and answers’ authors). These outcomes have importantimplications for the design of chatbots, e.g., the need to design chatbots to generate register-specific language forhard-coded and dynamically generated utterances to improve acceptance and users’ perceptions of chatbot interactions.

In the following, we summarize the literature on chatbot language design and the rationale for applying sociolinguisticregister analysis as the theoretical foundation for this research. We also discuss our choice of the tourism domain,specifically information search interactions in tourism contexts, as a testbed for our research.

Chatbots are typically designed to mimic the social roles usually associated with a human conversational partner, forexample, a buddy [123], a tutor [40, 122], healthcare provider [45, 97], a salesperson [53, 136], a hotel concierge [80], or,as in this research, a tourist assistant [28, 29]. Research on mind perception theory [58, 71, 81] suggests that althoughartificial agents are presumed to have sub-standard intelligence, people still apply certain social stereotypes to them. Itis reasonable, then, to assume that “machines may be treated differently when attributed with higher-order minds” [81].As chatbots enrich their communication and social skills, the user expectations will likely grow as the conversationalcompetence and perceived social role of chatbots approach the human profiles they aim to represent. A variety offactors influence how people perceive chatbot communication skills [30, 43, 121] and, as user expectations of proficiencyincrease, one important way to enhance chatbot interactions is by carefully planning their use of language [54, 73].Most linguists agree that the language choices made by humans are systematic [72], and previous research hasprovided ample evidence that variation within a language can often be accounted for by factors such as individualauthor/speaker style (e.g., [7, 84]), dialect (e.g. [78, 119]), genre (e.g., [70, 105]), and register (e.g., [11, 34]). Among thesefactors, style has particularly captured the attention of researchers on conversational agents [43, 67, 87, 103, 125], withexplorations ranging from consistently mimicking the style of a particular character [87, 118] to dynamically matchingthe style to the conversational partner [60, 103].A number of studies have sought to empirically evaluate the influence of conversational style on user experienceswith chatbots [5, 41]. For example, Elsholz et al. [41] compared interactions with chatbots that use modern Englishto those that use a Shakespearean language style. Users perceived the chatbot that used the modern English style aseasy to use, while the chatbot that used Shakespearean English was seen as more fun to use. Araujo [5] evaluatedthe influence of anthropomorphic design cues on users’ perceptions of companies represented by a chatbot, whereperceptions include attitudes, satisfaction, and emotional connection with the company; one cue, for instance, was theuse of an informal language style. Results showed that anthropomorphic cues resulted in significantly higher scoresfor adjectives like likeable, sociable, friendly, and personal in user evaluations of the interactions (though the relativeimpact of individual anthropomorphic cues on the outcomes was not evaluated). Similarly, based on an exploratoryanalysis, Tariverdiyeva [121] concluded that “appropriate degrees of formality” (renamed “appropriate language style” Chaves, et al.in a subsequent work [9]) directly correlates with user satisfaction. We note that these studies define “appropriatelanguage” as the “ability of the chatbot to use appropriate language style for the context .” This linkage of perceivedappropriateness of language to context is important and reflects clear evidence that appropriateness of language is notabsolute, but rather influenced by the user’s specific expectations concerning the chatbot’s communicative behaviorand the stereotypes of the social category [67, 77]. For example, when assessing the effects of language style on brandtrust in online interactions with customers, Jakic et al. [67] concluded that the perceived language fit between the brandand the product/service category increases the quality of interaction. Proficiency in human-like language style may alsoinfluence the users’ perceptions of chatbot credibility. Jenkins et al. [68] observed that chatbots are deemed sub-standardwhen users see them “ acting as a machine ”; similarly, in analyzing the naturalness of chatbots, Morrissey and Kirakowski[99] found that correct language usage was a determinant in perceived chatbot quality. The failure to convey linguisticexpertise compromises credibility [137], i.e., the chatbot’s ability to convey believability and competence [92, 117].Although some scholars define style as “the meaningful deployment of language variation in a message” [43],sociolinguistics define style as a set of linguistic variants that reflect aesthetic preferences, usually associated withparticular speakers or historical periods [34] (e.g., Shakespearean vs. modern English). Sociolinguistic studies alsoemphasize that the “core linguistic features like pronouns and verbs are functional” rather than aesthetic [34], whichpoints to register. Register theory states that for each interactional situation there is a subset of norms and expectationsfor using language to accomplish communicative functions [34]. In a conversation, every utterance is influenced by thesocial atmosphere [8, 66], which is represented in the form of situational parameters, such as the relationship betweenparticipants, the purpose of the interaction, and the topic of the conversation [34, 70]. This results in the emergence ofsituationally-defined language varieties, which ultimately determine the interlocutor’s linguistic choices [13, 34].Although the relevance of register in human-human communication has been extensively underlined [13], the extentto which this theory applies to human-chatbot interactions has yet to be widely investigated. There is some evidencesuggesting that chatbots should use language appropriate to the service category that the chatbot represents [9]. Still,there has been no systematic analysis of how users’ perceptions might be influenced by expectations regarding chatbotlanguage, or exploration of specific core linguistic features that determine the appropriateness of language fit. In thenext section, we focus on how the register theory applies to human-human communication and why it should beconsidered in chatbot interactions.

Register theory states that for each interactional situation there is a subset of norms and expectations for using languageto accomplish communicative functions [34]; every utterance is influenced by the social atmosphere [8, 66], whichis represented in the form of situational parameters, such as the relationship between participants, the purpose ofthe interaction, and the topic of the conversation [34, 70]. The influence of these parameters results in the emergenceof situationally-defined language varieties [13, 34]. Hence, the register can be interpreted as the distribution of the linguistic features in a conversation, given the context ; the linguistic features consist of the set of words or grammaticalcharacteristics that occur in the conversation, and the context consists of a set of situational parameters that characterizethe situation in which the conversation occurs, e.g., the participants, the channel, the production circumstances, and soon.Several recent studies have shown that register is crucial for linguistic research on language variation and use: mostlinguistic features are adopted in different ways and to varying extents across different registers. For example, somestudies have focused on describing register variation in the use of a narrower set of features, such as grammaticalhatbots language design: the influence of language variation on user experience 5complexity features [18, 19], lexical bundles [15, 63], and evaluation [62]. The

Longman Grammar of Spoken and WrittenEnglish [20] documents the systematic patterns of register variation for most grammatical features in English, exposingthe power of register as a significant predictor of linguistic variation. Indeed, we draw on this grammar to supportour discussion surrounding the use of particular linguistic features in the context of tourist assistant discourse (seeSection 8).The value of register for understanding conversational structure is further emphasized by studies showing that failing to account for register in linguistic analyses and computational language models can, and often does, result inincorrect conclusions about language use. Biber [13], for instance, offers many examples of how failing to account forconversational register in a linguistic analysis can result in faulty conclusions.Given the crucial role of register in shaping human-human communication, we suggest that register must be accountedfor in the design of chatbots language; users’ perception of chatbots as competent and trustworthy conversationalpartners depends on the chatbot’s correct use of register. This paper works to provide a practical cornerstone forthis broad endeavor by exposing associations between the frequency of core linguistic features (which comprise theconversational register) and user evaluations of the quality of the chatbot interactions. We focus our study on thedomain of “tourist information search” to (a) analyze the linguistic features relevant to characterizing conversationalregister in this domain, and (b) show how adherence or failure to adhere to that register impacts user perceptions ofconversational quality. To identify the typical patterns of language present in the interactions, we applied registeranalysis, as developed by Conrad and Biber [34], consisting of two main steps. First, situational analysis aims tocharacterize the interactions in the target corpus using a conversational taxonomy based around seven situationalparameters: participants, relationship, channel, production, setting, purpose, and topic. Second, register characterization analyzes and aggregates the results of the situational analysis to yield a statistical characterization of the linguisticfeatures typically used in domain interactions (in this case, within the tourist assistant domain); the result is a registermodel for the domain, i.e., a concrete representation of the appropriate register for the given domain.

Tourism is one of the fastest-growing economic sectors in the world and is a major category of international trade inservices [130]. As the sector grows, the demand for timely and accurate information on destinations also increases; arecent survey [90] revealed that the Internet is the top source for travel planning. As the penetration of smartphonedevices with data access has increased, travelers increasingly search for information and make decisions en-route [24,124, 132]. However, conducting en-route travel information searches using small-screen mobile devices can be anoverwhelming experience [79, 106], due to the information overload compounded by a lack of reliable mechanisms forfinding accurate, trustworthy, and relevant information [79].Radlinski and Craswell [107] suggest that complex information searches could benefit from conversational interfacesand, indeed, several conversational agents with a wide range of characteristics have been developed to improve tourisminformation search and travel planning [2, 65, 82, 88]. Loo [90] claims that over one in three travelers across countriesare interested in using digital assistants to research or book travel and that travel-related searches for “tonight” and“today” have grown over 150% on mobile devices in just two years. In particular, a report on the chatbot market [55]places the Travel & Tourism sector as one of the top five markets with the best revenue prospects by 2025. In responseto this trend, the number of chatbots within the online tourism sector has increased. The BotList website, for example, Chaves, et al.lists nearly a hundred available chatbots under the travel category; some examples include the Expedia and MarinaAlterra virtual assistants.Travel planning is also a domain in which perceived competence and trustworthiness is central to user experience;advice that is not appropriately presented is unlikely to be trusted and utilized. From a more pragmatic perspective, thepopularity of tourist assistants (both human and chatbot) means that a growing corpus of conversations in this domainexists and can serve as the seed for our analysis.In sum, we selected the tourism advising domain for developing a practical framework for including register inthe design of chatbot conversational engines because proper use of conversational register is likely to be particularlycritical for user experience in this domain, there is a real-world demand for chatbots for travel advice, and severalcorpora of human and chatbot interactions in this domain are available. As aforementioned, this research aimed to explore the extent to which user experience (in terms of perceived appropri-ateness, credibility, and overall user experience) is related to the conversational register used by a chatbot. For thispurpose, we compared conversations expressed in different registers, presenting them to users for evaluation. To isolatethe effect of register on perceived user experience, we compared conversations that are equivalent in content but varyin language patterns.Finding such parallel data—natural language texts that have the same semantic content, but are expressed in differentforms [102]—is difficult. Previous studies requiring such parallel data have typically used written texts with multipleversions, e.g., versions of the bible or Shakespearean texts in the original and modern language forms (see [127]).Although perhaps useful for analysis in an abstract context of NLP research, these corpora portray archaic languagecentered around topics not likely to be relevant to most modern chatbot users, much less in the design of touristassistants.Our approach, therefore, was based on the production of a parallel corpus. This corpus was based on actualconversations, which were carefully manipulated based on register theory to produce conversations of equivalentcontent, but in differing registers. Unlike previous studies that focus on style [41, 60, 121] (i.e., preferences associatedwith authors or historical period), we relied on the register theory to identify and reproduce language variationsthat would be plausible for a tourist assistant chatbot to use. Moreover, developing a concrete basis for explicitlymanipulating conversational register in the design of chatbot language requires an explicit characterization of theregister. Therefore, we identified a set of linguistic features that together characterize the register and show how varyingthese features affects user perceptions of conversational quality.Our approach comprised four steps, as illustrated in Figure 1. We collected two corpora of conversations in thetourism domain and used them to develop parallel corpora that, while equivalent in content, differ across varyingdimensions of conversational register. These conversations were then presented to users to generate a multi-facetedevaluation of subjective conversational quality. Finally, the user perceptions were analyzed with respect to the variationsin register to expose a relationship between language patterns and user perceptions. These steps are further describedbelow.(1)

Data collection: to provide a foundation for our analysis, we collected conversations of human domain experts(tourist assistants) interacting with tourists in a text-based tourist information search scenario; we refer this as https://botlist.co/bots/expedia https://botlist.co/bots/marina-alterra hatbots language design: the influence of language variation on user experience 7 (3) Textmodiﬁcation (4) User preferencesevaluationFLG (2) Registercharacterization(1) Data collection DailyDialog FLGmod: Wealways have placesyou can visit...

Prevailing registercharacteristics per corpus Linguistic features thatinﬂuence users preferences

FLG: There are alwaysa lot of great optionsin downtownFlagstaff...

Similar content,varying features

Fig. 1. Overview of the research method. The method consists of four main steps, and the outcomes of one step is seeded into thenext step. the

𝐹𝐿𝐺 corpus. Because conversational register is characterized by comparing linguistic expression in varyinginteractional situations, we also selected another corpus of conversations in the tourism domain that is availableonline and is commonly used in natural language research, namely

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 [85]. Conversations in thiscorpus span a random variety of daily-life topics from ordinary life, to politics, health, tourism, and other topics.Details about the corpora collection are provided in Section 4.(2)

Register characterization: our next aim was to characterize the conversational registers present in our twocorpora. Based on a broad set of linguistic features that have been previously identified as relevant for char-acterizing conversational register [11], we performed register analysis of the

𝐹𝐿𝐺 and

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 corporaindividually, and then statistically compared the patterns of language between them, similar to the analysisperformed by Chaves et al. [27]. The register characterization step is detailed in Section 5.(3)

Text modification: having identified discrete register variations present in the two corpora, our focus shiftedto using these register characterizations to produce a parallel corpus in which conversations had equivalentinformation content, but used a different linguistic format. Specifically, for every answer provided by a touristassistant in the

𝐹𝐿𝐺 corpus, we performed linguistic modifications to produce a new corresponding answerthat portrays a language pattern that mimics the register characteristics from the

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 corpus; wecall this produced parallel corpus

𝐹𝐿𝐺 𝑚𝑜𝑑 . To assess whether the modified answers in

𝐹𝐿𝐺 𝑚𝑜𝑑 preservedthe informational content of the original, we performed a study to validate the text modification. We invitedparticipants to compare the parallel answers in terms of naturalness and content preservation. Section 6 detailsthe text modification and validation. After performing these foundational steps, we ended up with two parallelcorpora (

𝐹𝐿𝐺 and

𝐹𝐿𝐺 𝑚𝑜𝑑 ).(4)

Users preferences evaluation: after developing the parallel corpora that differ solely in the portrayed register,we performed a study to reveal whether users perceived register variations and, if so, what linguistic variationswithin a register characterization appear to have the greatest impact on aspects of user experience. Overall, weexpected to find a preference for the original answers from the

𝐹𝐿𝐺 corpus, since these are register-specificlanguage produced by humans. To perform the analysis, we selected a subset of tourist questions and their Chaves, et al.corresponding answers from both

𝐹𝐿𝐺 and

𝐹𝐿𝐺 𝑚𝑜𝑑 . Participants were presented with these individual question-answer exchanges and, for each, were asked to choose which answer they preferred based on three distinctmeasures of quality: appropriateness, credibility, and overall experience. Then, we fitted a statistical learningmodel to identify the linguistic features that best predict the users’ choices. This study is detailed in Section 7.Having established an overview of our study method, the following sections detail each of the steps outlined above.

The data collection comprises two sub-steps: (i) to collect a baseline corpus of human-human conversations betweentourist assistants and tourists; and (ii) to select a corpus of conversations in the tourism domain. In the following, weintroduce the collected corpora.

To collect the

𝐹𝐿𝐺 corpus, we hired three experienced professionals from the Flagstaff Visitor Center inFlagstaff, Arizona, USA, to answer tourist questions about the city and nearby tourist destinations during summer 2018.The official government website reports that Flagstaff receives over 5 million visitors per year [46], including in-state,out-of-state, and international visitors. According to the 2017-2018 Flagstaff Visitor Survey [126], Flagstaff is the centralhub for visiting tourist destinations such as Grand Canyon National Park, Arizona Snow Bowl, the Navajo and Hopireservations, and many other local attractions. Regional tourism is significant as well, with a large number of visitorsseeking to escape the heat and crowding of the Phoenix metropolitan area.The three tourist assistants were native English speakers, female, had some post-secondary education, and had fouror more years of experience as tourist assistants. Two of them were 25-34 years old; the other was in the 35-44 age range.Although they had more than four years of experience in providing tourist information in in-person conversations atthe Visitor Center, they had never professionally provided information through an online platform.To recruit tourists to interact with the tourist assistants, we advertised the free tourist assistant online service in thecity of Flagstaff through flyers and intercepted tourists at the Flagstaff Visitor Center in Historic Downtown, directingthem to a booth to use the service. About 30 tourists participated in the interactions. We also collected tourism-relatedquestions about Flagstaff from websites such as Quora, Google Maps, and TripAdvisor, and a researcher posted thesequestions to the tourist assistants. The tourist assistants were unaware of the origin of the questions and thought theywere always interacting with real tourists.The tourist advising conversations were performed through a Facebook Messenger account [42] over the summerof 2018. The human tourist assistants participated in the study from our lab. Before the first interaction, the touristassistants participated in a training session, in which we presented the environment and the tools. During the study, aresearcher observed the interactions and took notes on comments made by the tourist assistants. Because we wanted tounderstand the natural linguistic variation in tourism-related interactions, both tourists and tourist assistants were freeto interact according to their needs, interests, and knowledge. No tasks were proposed to the tourists, nor were anyscripts provided to the tourist assistants. The textual exchanges were exported from Facebook Messenger and archivedto create the FLG corpus; the corpus comprises 144 interactions with about 540 question-answer pairs. To analyze theregister of the conversation, we only used the answers from the tourist assistants.

The second corpus we selected is

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 [85], which is a corpus available online and used asa reference for research on natural language generation in the tourism domain. DailyDialog consists of conversationsabout daily life crawled from websites for English language learning, with topics ranging from ordinary life to politics,hatbots language design: the influence of language variation on user experience 9health, and tourism. Similar to someone who would use this corpus to train a chatbot, we filtered the original corpus toselect only conversations that were originally labeled as “tourism” and show customer-service provider interactions,e.g., hotel guest-concierge, business person-receptionist, tourist-tour guide, etc. We chose

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 because itcontains a large set of conversations in the tourism domain and it is likely be used as a baseline model for chatbotconversations [51].We downloaded the

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 corpus from its website . After filtering to focus on tourism-related interactions,the subset of 𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 used in this research comprises 999 interactions. Because we are only interested in theutterances produced by the service providers (

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 ) and tourist assistants (

𝐹𝐿𝐺 ), we edited the conversationsto remove the tourists’ utterances.

We followed the situational analytical framework proposed by Conrad and Biber [34] toidentify the situational parameters in which the conversations took place. The main outcome of the situational analysisis presented in Chaves et al. [28] and summarized in Table 1.

Table 1. Situational analysis. Situational parameters are extracted from the situational analytical framework [34]

Situational parameter DailyDialog FLG

Participants Customer and service providers Tourists and tour guidesRelationship Role, power, and knowledge relations vary Tourist-tour guide, the latter owns the knowledgeChannel Human-written, representing face-to-face Written, instant messaging toolProduction Planned Quasi-real-timeSetting Private, shared time, and mostly physically shared place Private, shared time, virtually shared placePurpose Provide a service or information Information searchTopic Varies within the context of tourism Local information (e.g., activities, attractions)

According to register theory [34], differences in situational parameters result in varying register; people use differentpatterns of language depending on the context.

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 presents larger variability in terms of situational parameters(e.g., participants, purpose, and topic) than

𝐹𝐿𝐺 . Given differences in the situational parameters, we expect that thelanguage characteristics in

𝐹𝐿𝐺 differ from the language characteristics in

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 . We investigate this claim inthe next section, where we discuss our register characterization analysis, which identifies the linguistic features thatdetermine the register of each corpus and how the typical language varies among different situations in the tourismdomain.

To characterize the varying conversational registers used in our two corpora conversations, we performed a registeranalysis [11]. Register analysis consists of identifying the linguistic features typically used in a corpus, which is basedon tagging and counting the linguistic features present in the utterances and interpreting them according to theirfunction in the sentence [11, 34]. We performed register analysis for both

𝐹𝐿𝐺 and

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 corpora and thencompared the outcomes to identify the variations in language use across corpora; the following subsections present thisanalysis and its outcomes in detail.

Our register analysis relied on information from the Biber grammatical tagger [14] to identify thelinguistic variation present in each corpus. Given a set of texts, this tool tags and counts the linguistic features present http://yanran.li/dailydialog dimension scores for eachtext, which are based on aggregations of subsets of features derived using a multidimensional analysis algorithm. Thedimension scores reveal the prevailing characteristics of the register (i.e., the levels of personal involvement, narrativeflow, contextual references, persuasion, and formality present in the texts) [11]. Details about the tagger can be foundelsewhere [11, 14].We first analyzed the dimension scores to understand the linguistic characteristics and varieties of the discoursein each corpus. Following Biber [11], we applied a one-way multivariate analysis method (MANOVA) to generate astatistical comparison of the dimension scores across corpora, where the dependent variables are the values of the fivedimension scores, and the independent variables are the 𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 (control group) and the three tourist assistantsfrom the

𝐹𝐿𝐺 corpus are 𝑇𝐴 𝑇𝐴

2, and 𝑇𝐴 text corresponds to one observation in ourmodel, where a text is a set of one or more contiguous sentences produced by an interlocutor (i.e., one answer). Giventhe significant overall MANOVA test, we also performed a one-way univariate analysis ( 𝑑 𝑓 = , 𝑑 𝑓 = , 𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 , and theexperimental groups are each of the three tourist assistants from the

𝐹𝐿𝐺 corpus. All the reported statistics use a 5%significance level ( 𝛼 = . The MANOVA revealed that our three tourist assistants’ dimension scores are significantly different fromthe average

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 discourse (

𝑊 𝑖𝑙𝑘𝑠 = . , 𝐹 = . , 𝑝 < . Table 2. Univariate analysis of dimension scores ( 𝑑 𝑓 = , ). For each dimension, the table shows the estimated dimension score ± , the standard error per group ( 𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 , 𝑇𝐴 , 𝑇𝐴 , 𝑇𝐴 ), and the corresponding F- and p-values. DailyDialog TA1 TA2 TA3 F-value p-value

Dim. 1: Involvement 30.50 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 portrays an oral discourse while tourist assistants in

𝐹𝐿𝐺 aremore literal and informational than involved (Dimension 1), which can be explained by both the face-to-face nature ofthe conversations in

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 and the variation in the participants’ role and power.

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 also has a moreextreme negative score for contextual references (Dimension 3), which might be explained by the shared space andcommon ground provided by face-to-face interactions.

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 also has a slightly more formal discourse andelaborated language, although this difference is not significant. Both corpora show a descriptive rather than narrativelanguage (negative estimates for Dimension 2) and slightly persuasive language (positive estimates for Dimension 4).Since our ultimate goal is to reproduce the patterns of language (i.e. register) of

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 within the conversationsof the

𝐹𝐿𝐺 corpus to produce our parallel corpora, we need to identify not only the main linguistic characteristicspresent in the discourse (e.g., how involved or persuasive is the discourse), but also how these characteristics emerge inhatbots language design: the influence of language variation on user experience 11each corpus. Although the dimensional analysis reveals the overall register characterization, it is not sensitive enoughto identify the prevailing linguistic features that influence the overall discourse. For example, although Dimension4 is not significantly different, the prevailing linguistic features that contribute to the dimension score varied acrosscorpora. Thus, we statistically compared the occurrences of every linguistic feature per dimension. The left side ofTable 3 lists the linguistic features that vary significantly between the two original corpora (

𝐹𝐿𝐺 and

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 ), asrevealed by the ANOVA analysis per feature. The table includes the estimates for

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 (control group) and eachtourist assistant in

𝐹𝐿𝐺 (TA1, TA2, TA3) as well as the F-values. A more complete table that includes the non-significantlinguistic features and the p-values per feature is presented in the supplementary materials, which also include aglossary with examples of the features.

Table 3. ANOVA results for individual features comparison between

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 and both original and modified corpora. The leftside of the table presents the estimates and standard error for each independent variable (

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 , 𝑇𝐴 , 𝑇𝐴 , 𝑇𝐴 ) and theF-value. The right side of the table shows the estimates, standard error, and F-values for the three experimental groups ( 𝑇𝐴 𝑚𝑜𝑑 , 𝑇𝐴 𝑚𝑜𝑑 , 𝑇𝐴 𝑚𝑜𝑑 ) after modifications being performed ( 𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 column was omitted in the right side to avoid repetition). Allthe statistics are calculated with 𝑑 𝑓 = , Estimates ± Std.Err. (

𝐹𝐿𝐺 ) Estimates ± Std.Err. (

𝐹𝐿𝐺 𝑚𝑜𝑑 )Features

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔

TA1 TA2 TA3 F 𝑇𝐴 𝑚𝑜𝑑 𝑇𝐴 𝑚𝑜𝑑 𝑇𝐴 𝑚𝑜𝑑 FDimension 1: personal involvementPrivate verb 14.7 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± As indicated in Table 3, the register analysis reveals 22 linguistic features that vary significantly across corporathrough all of the five register dimensions. As we anticipated, differences in the situational parameters influenced thepatterns of language observed in the corpora, with the typical language presented in

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 varying significantlyfrom that presented in

𝐹𝐿𝐺 for a core set of linguistic features.2 Chaves, et al.

Having characterized the differences in register between the

𝐹𝐿𝐺 and

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 corpora, our next step was toclone the

𝐹𝐿𝐺 corpus and then use the register characterization to modify its utterances, mimicking the registercharacteristics observed in the

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 corpus. In psycholinguistics, linguistic modification consists of changingthe language of a text while preserving the text’s content and integrity, which includes using familiar or frequentlyused words [22, 89, 112]. We applied this technique to alter the linguistic features in

𝐹𝐿𝐺 to approximate its language tothe patterns presented in

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 . The new corpus, which we call

𝐹𝐿𝐺 𝑚𝑜𝑑 , is paired with the original

𝐹𝐿𝐺 corpusto form a pair of parallel corpora equivalent in topic, participants, and informational content, but expressed in varyingpatterns of language.One important aspect of text modification is that both the informational content and basic linguistic integrity of thetext should be preserved [22, 89, 112], otherwise, the quality of the produced sentences can be compromised. Therefore,we performed a validation study where participants proof-read the original and modified answers and assessed themodifications in terms of content preservation and quality attributes, such as naturalness and meaningfulness.

We manipulated the answers in

𝐹𝐿𝐺 to approximate the estimate values (presented in Table 3) of a particular featureto the corresponding estimate in

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 . For example, the estimate for private verbs in 𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 is 14 . 𝐹𝐿𝐺 the greatest rates for private verbs is 7 .

29; therefore, we want to increasethe occurrences of private verbs , until we reach an estimate that is closer, and not significantly different from 14 .

70 forevery tourist assistant in

𝐹𝐿𝐺 corpus.

Inspired by previous studies on chatbot language use (see, e.g., [41]), modifications were performedsemi-manually, using the AntConc tool [3] and a Python script as support tools. The Python script took in a list ofpaired inputs, where the first parameter is a text present in the original data that needs to be changed (target); and thesecond parameter is the text that will replace the original (goal). The script then searched for all the occurrences of thefirst parameter in the

𝐹𝐿𝐺 corpus and replaced it with the second. This process is repeated until there are no moreentries in the list. The AntConc tool is used to help identify the targets to be added to the Python input list. The scriptthen saves the modified interactions and continues to generate the complete

𝐹𝐿𝐺 𝑚𝑜𝑑 corpus. The new texts were thensubmitted to the Biber tagger and subjected to renewed register analysis, as described in Section 5. If the features ofinterest in

𝐹𝐿𝐺 𝑚𝑜𝑑 were still statistically different from the estimates in

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 , the process was repeated untilthe cumulative changes gathered in the modified corpus yielded a register analysis result that was as similar as possibleto the register profile of

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 . The right side of Table 3 shows the comparison between

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 and

𝐹𝐿𝐺 𝑚𝑜𝑑 . The table showsthe estimates for each tourist assistant in

𝐹𝐿𝐺 𝑚𝑜𝑑 and the F-value compared to the

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 (control group). Forthree features, namely nouns , first-person pronouns , and second-person pronouns , linguistic modification did not reach anon-significant difference, although we substantially reduced the F-values. The problem with these features is thatdifferences between 𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 and

𝐹𝐿𝐺 were so extreme that forcing them to non-significant levels in the modifiedcorpus could affect the distribution of the co-occurring features (e.g., increasing verbs associated with pronouns) orproduce artificial changes that could harm the content preservation and the naturalness of the resulting answer. Forall the other features, the modifications reached non-significant differences. Table 4 shows an example of a modifiedhatbots language design: the influence of language variation on user experience 13answer. Although both the counts for the co-occurring features and the length of the answers (which influences thenormalized counts) changed due to the modifications, features that were not statistically significant when compared to

𝐹𝐿𝐺 were still not significant when compared to

𝐹𝐿𝐺 𝑚𝑜𝑑 . The statistical results for the non-significant features can befound in the supplementary materials.

Table 4. Example of a modified answer. The left side shows the answer provided by a tourist assistant in the original data collection.The right side shows the corresponding answer, modified to portray features that mimics

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 linguistic form. Modifiedwords are highlighted in bold and the tags attributed to the words are between square brackets.

Original answer from

𝐹𝐿𝐺 corpora Modified answer (

𝐹𝐿𝐺 content,

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 form)

Well there are always a lot of [preposition] great options [at-tributive adjective, noun] in downtown Flagstaff [preposition,nouns] for [preposition] live music at bars like The State Barand the Hotel Monte Vista. You can see what activities [nomi-nalization] are going on [preposition] formusic and events [preposition, nouns] . Well we always have [1st person pronoun, present verb] places you can visit [2nd person pronoun, present verb] that have [present verb] live music. I’d suggest [1st person pronoun, con-traction, prediction modal, suasive verb]

In summary, for each question-answer pair in the original

𝐹𝐿𝐺 , there is a corresponding question-answer in

𝐹𝐿𝐺 𝑚𝑜𝑑 ,where the answer has equivalent informational content but is expressed in a different register. The patterns of languageuse in

𝐹𝐿𝐺 𝑚𝑜𝑑 mimic those in

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 , which is, on average, more personally involved and oral, with additionalfeatures for persuasion and formality. To evaluate the quality of the modifications, we asked human subjects to comparethe content in the original and modified versions of the answers and assess the naturalness of the modified answers, asdetailed in the following section.

We performed a validation study to verify whether the modifications preserved the content and the naturalness of theoriginal text. Details of this modification process are presented in the following subsections.

We randomly selected 54 (10%) question-answer pairs from the parallel corpora and asked participantsto judge the content preservation and naturalness of the answers. We collected data via an online questionnaire, whereeach participant assessed two blocks of questions, presented in random order. In one block, participants were presentedwith a tourist question and two possible answers (A and B, respectively), which correspond to the original answerfrom

𝐹𝐿𝐺 and modified version from

𝐹𝐿𝐺 𝑚𝑜𝑑 . Participants were unaware of how the answers A and B were produced.For each question-answer pair presented on the screen, participants were invited to rate how similar the informationprovided in the answers was, regardless of how the messages were written. Participants used a slider to select thesimilarity level, where the extremities of the sliders were labeled with “Completely different” (0) and “Exactly the same”(100). One example of a content preservation question is presented in Figure 2. Each participant assessed 10 randomlyselected question-answer pairs for content preservation.In the second block of the assessment, we were interested in the naturalness of the answers. We selected a subset of27 question-answer pairs from the original set of 54. Participants were presented with a question and one single answer,either from

𝐹𝐿𝐺 or 𝐹𝐿𝐺 𝑚𝑜𝑑 , at a time. Considering the question-answer on the screen, participants were invited to usea seven-point Likert scale (1:completely disagree/7:completely agree) to rate the answer on four dimensions: natural,complete, meaningful, and well-written. Each participant rated nine question-answer pairs randomly selected from the54 possible (27 answers from

𝐹𝐿𝐺 and 27 answers from

𝐹𝐿𝐺 𝑚𝑜𝑑 ), as well as one attention check.4 Chaves, et al.

Fig. 2. Example of a content preservation question. Participants selected their answer to the question using the slider, where 0represents that the content in the answers is completely different, while 100 represents that the content in the answers A and B isexactly the same.

To evaluate the content preservation, we fitted an intercept-only, mixed effect linear model [10] with the score forcontent similarity as the dependent variable. The random effects are the questions and the participants’ identification.We considered the content preservation for a modification to be reasonable (i.e., content essentially the same) if theestimate for the intercept (taking into account the standard error) stayed above the upper quartile ( 𝜃 ± 𝜎 > Participants were recruited through Prolific , in February 2020. Prolific is an online recruitmentservice explicitly designed for the scientific community to enable large-scale recruitment of willing research participants(see [104]). We recruited a total of 90 participants, but two were later discarded due to failure to answer the attentioncheck ( 𝑁 = 𝜇 = . 𝜎 = . Our content preservation dataset contains 880 observations (10 observations per participant). Eachquestion-answer pair was evaluated from 14 to 18 times.The model shows that the estimated mean for content similarity is 86 .

78 (SE=1 .

39, df=99 . ( , . ) , which is reasonably above the upper quartile (75). Both random effects are significant, and theeffect of participants has the largest variance (see Table 5). Hence, we conclude that the linguistic modifications madein producing the 𝐹𝐿𝐺 𝑚𝑜𝑑 corpus reasonably preserved the content of answers in

𝐹𝐿𝐺 . hatbots language design: the influence of language variation on user experience 15 Table 5. Random effects. The variance explained by the participants is larger than the residual variance, which shows that a significantportion of the variation is influenced by participants’ biases.

Groups Variance Std.Dev.PID (Intercept) 143.674 11.986Question (Intercept) 9.112 3.019Residual 102.486 10.124Regarding naturalness, the number of positive ratings is consistently higher than negative ratings for both corpora(see Table 6). The CLMM models show that the scores for every item do not significantly vary, as presented in Table 7.Additionally, the participants’ biases account for a lot of variance (see supplementary materials for random interceptsresults). Hence, we conclude that the answers in

𝐹𝐿𝐺 𝑚𝑜𝑑 are not significantly different in terms of naturalness fromthe answers in

𝐹𝐿𝐺 . Table 6. Number of times the original and modified answers received a negative, neutral, and positive scores. Both original andmodified answers consistently received more positive than negative scores. group Negative (1-3) Neutral (4) Positive (5-7)Original (

𝐹𝐿𝐺 ) 180 84 1317Modified (

𝐹𝐿𝐺 𝑚𝑜𝑑 ) 255 105 1210

Table 7. CLMM results per evaluated item. The table shows the estimate, standard error, Z-values, and p-values for each item.

Item Estimate SE z Pr (>|z|)Natural -0.61 0.32 -1.90 0.06Meaningful 0.07 0.051 1.42 0.16Complete -0.10 0.29 -0.34 0.74Well-written -0.58 0.35 -1.66 0.10In summary, our text modification process produced a parallel corpus,

𝐹𝐿𝐺 𝑚𝑜𝑑 , with equivalent informationalcontent as the

𝐹𝐿𝐺 corpus, but with the register characteristics of the

𝐷𝑎𝑖𝑙𝑡𝐷𝑖𝑎𝑙𝑜𝑔 conversations. Our validation studyindicates that the modifications introduced to generate the

𝐹𝐿𝐺 𝑚𝑜𝑑 corpus preserved the content and the naturalnessexpressed in

𝐹𝐿𝐺 . In our last research step, we finally address the motivating question driving our effort: investigating whether usersare sensitive to changes in conversational register and how such differences in register impact perceived quality ofthe interaction and overall user experience. To explore this issue, we compared original and modified corpora in astudy to identify which answers the participants perceive as more appropriate, credible, and providing the best userexperience. Considering that human tourist assistants produced the language in

𝐹𝐿𝐺 , and therefore it is likely to beregister-appropriate to the proposed interactional situation, we hypothesized that findings would show a preference foranswers from

𝐹𝐿𝐺 ; the answers from

𝐹𝐿𝐺 𝑚𝑜𝑑 should score lower, as we have artificially modified them to introduce a6 Chaves, et al.register that is less likely to be appropriate to the situational parameters in

𝐹𝐿𝐺 . Our analysis additionally includedvariables to represent the individual interlocutors (both assistants and participants) to compare the strength of thesevariables when compared to the register variation expressed in the corpora.

User experience (UX) refers to the overall experience of a person using a software product, which includes theirperceptions and attitudes such as emotions, beliefs, behaviors, and accomplishments [64]. Because user experience isa very broad concept, it is crucial to delimit the scope of the term for this particular study. User experience is oftenmeasured in terms of usability metrics, such as effectiveness, efficiency, and satisfaction [44, 96, 108] . In this research,however, we controlled for the chatbot’s technology and knowledge as well as the user’s tasks, since the main goal wasto evaluate the user perceptions regarding the varying patterns of language use. Thus, the usability constructs, such astask success, robustness, and ease of use, was equivalent across treatments.Thus, for this research, user experience is defined in terms of attitudinal metrics. Specifically, we measured specificattributes that are potentially influenced by the user’s expectations and perceptions of the chatbot’s language use. Sincethe conversation’s participants and the relationship among them are pointed out in the situation analysis framework [34]as characteristics that influence the register, we evaluated whether the register variation influence how appropriate isthe language, given the chatbot’s social role. Appropriateness speaks to the fit between what the user expects in thatconversational context in terms of linguistic form and content and what they actually encounter in the answer. Weexpected this to be influenced by the tourist’s expectations concerning the assistant’s communicative behavior and thestereotypes of the social category [67, 77].The perceived communicative behavior might also influence how the users perceive the chatbot’s credibility . Credi-bility is presented as a rating of confidence in the accuracy of the information contained in the answer; the failure toconvey expertise through language compromises credibility [137] and trustworthiness [92, 117]. According to Corritoreet al. [36], credibility consists of four factors: honesty, expertise, reputation, and predictability. In this study, we focuson the expertise and honesty factors, which represent a chatbot’s perceived competence and believability. Since theparticipants have no previous experiences with the studied chatbot, reputation and predictability factors did not apply.Finally, the overall user experience represents the user’s general satisfaction with the provided answer and, conse-quently, the interaction’s outcome. Considering that what is said is similar between 𝐹𝐿𝐺 and

𝐹𝐿𝐺 𝑚𝑜𝑑 , user experienceshould be mostly influenced by how it is said. In the following subsections, we present the method used to evaluate userperceptions of the two parallel corpora and its outcomes.

We selected 10% of the question-answer pairs from the parallel corpora to be evaluated. Considering the semi-manualprocess, in some cases the answers to a given question were very similar between the two corpora, i.e., one particularanswer may not have been modified at all as part of our register-shifting process. To focus our comparison on answerswith distinct differences in register, we selected answers that had been substantially modified: we calculated theLevenshtein distance [37, 133] between the pairs of original and modified answers, and selected the 54 question-answerpairs with the highest values of distance. The Levenshtein distance corresponds to the minimum cost (i.e., the number Effectiveness, efficiency, and satisfaction are part of the general definition of usability, according to the ISO 9241-11 [64]. Usability, in turn, is part ofthe overall users experience and, in many cases, the two terms are used interchangeably [128]. In this research, we differentiate usability from userexperience, where usability is the ability to carry the task successfully and user experience focuses on the user perceptions and behavior resulting fromthe interaction [128]. hatbots language design: the influence of language variation on user experience 17of insertions, deletions, or substitutions) of transforming one text into the other [37]. See the supplementary materialsfor the evaluated question-answers pairs and the corresponding Levenshtein distance values.We collected participant responses via an online questionnaire. After reading and agreeing with the informed consent,participants were introduced to the task: given a tourist question, identify the answer that best represents a touristassistant’s discourse. For this experiment, participants were told that the answers would be provided by a chatbot. Foreach tourist’s question, presented in the screen one at a time, participants could choose one out of three options: theoriginal answer (from

𝐹𝐿𝐺 ), the modified version (from

𝐹𝐿𝐺 𝑚𝑜𝑑 ), and “I don’t know” (see an example in Figure 3).Original and modified answers were presented in a randomized order, whereas “I don’t know” was always the lastoption on the list. Supplementary materials contain a sample of this instrument for the other constructs as well.

Fig. 3. Example of a question from the study. In this example, the participant was invited to select the answer that portrays the mostappropriate language. Participants selected their responses by clicking on their preferred answer or on the “I don’t know” option.

Constructs were also evaluated one at a time. Thus, we first showed a definition of the construct of interest (e.g.,appropriateness), and then the ten question-answer pairs to be evaluated for that construct, which were nine question-answer pairs extracted from the corpora and one attention check. In total, each participant evaluated 27 differentquestion-answer pairs (9 per construct, without repetition across constructs) randomly selected from the possible54. The order of the constructs was also randomized. In the end, participants answered the demographics and socialorientation questionnaire. We used the social orientation items proposed by Liao et al. [86]; the social orientationtoward chatbots determines the participants’ preferences regarding human-like social interactions with chatbots [86].The outcome of the questionnaire consists of the users’ preferences regarding which answer (original vs. modified)is the most appropriate and credible for a tourist assistant, as well as the answer that would result in the best userexperience.

Participants were recruited through Prolific , in March 2020. We received a total of 193 submissions, 15 of which werediscarded due to either technical issues in the data collection or failure to answer the attention checks ( 𝑁 = 𝜇 = .

10 years-old, 𝜎 = . To model the users’ preferences between original and modified versions of

𝐹𝐿𝐺 , we fitted a generalized linear model(GLM) for two-class logistic regression, using the 𝑔𝑙𝑚𝑛𝑒𝑡 package in R [50]. Because we are interested in the differencebetween original and modified versions of

𝐹𝐿𝐺 for each linguistic feature of interest (listed in Table 3), we calculated theoriginal − modified counts, and input this difference into the model. For example, suppose that one particular answerprovided by the tourist assistants in the original 𝐹𝐿𝐺 corpus has 31 . private verbs (normalized per 10K words), andthat after linguistic modification, the corresponding answer has 62 . private verbs . Thus, for this particular answer, thevalue for private verbs input into the model is 31 . − . = − .

2. A negative value means that the occurrences of thatfeature were increased in the modification process for that particular answer. In contrast, a positive value for a featuremeans that the occurrences of that feature were reduced in the modification process.

Consider a model with the response variable 𝑌 = { , } where 0 represents the originalanswers ( 𝐹𝐿𝐺 ) and 1 represents the modified answers. The L1-regularized logistic regression algorithm models class-conditional probabilities through a linear function of the predictors [50]. The prediction function is defined as: 𝑓 ( 𝑥 ) = 𝑤 𝑇 𝑥 + 𝛽 where 𝑥 is a feature vector of 𝑝 real numbers representing (i) the difference between original and modified countsper linguistic feature; (ii) variables representing the participant who answered the question (1 if the observation wasanswered by that participant, 0 otherwise), the participants’ self-assessed social orientation (1 to 7), and the authorhatbots language design: the influence of language variation on user experience 19of the answer (1 if the answer was authored by that tourist assistant, 0 otherwise). We want to learn a 𝑝 vector 𝑤 ofweights and a real scalar intercept 𝛽 . The L1-regularization ensures that the learned model has a sparse/interpretable 𝑤 (some entries will be exactly zero; these entries correspond to features that are not used/important for prediction). Theprediction function 𝑓 ( 𝑥 ) gives real-valued predictions for the given feature vector 𝑥 . The logistic link function thatfinds the predicted probability in [ , ] is 𝑝 ( 𝑥 ) = + exp (− 𝑓 ( 𝑥 )) The function predicts the negative answer (i.e., 0:original) when 𝑓 ( 𝑥 ) < 𝑝 ( 𝑥 ) < .

5, while the positive answer(i.e., 1:modification) is predicted when 𝑓 ( 𝑥 ) > 𝑝 ( 𝑥 ) > .

5. For comparison purposes (to determine an upperbound on prediction accuracy), we fitted two non-linear learning models: random forest and gradient boosting. For therandom forest, we used the party package in R, which provides an implementation of random forests with conditionalinference trees [61, 115, 116]. For the gradient boosting, we used the xgboost package in R [31], which is an efficientimplementation of gradient tree boosting. The measures were also compared to a baseline model, which always predictsthe most frequent class in the training data (and provides a lower bound for prediction accuracy). The evaluation metricswere the accuracy, the ROC curve, and the area under the curve (AUC). To calculate these measures, we used 10-foldcross-validation. First, we randomly assigned every observation in the data set to one out of 𝑘 =

10 folds. For each foldID from 1 to 10, we created a test set comprising all observations with matching fold ID and used all other observationsfor a training set. We then used the train set to learn model parameters, and we used the test set to evaluate predictionaccuracy. The cross-validation pseudo-algorithm is available in the supplementary material, and the R code and datasetsare available on [26].

Our evaluation dataset started with a total of 4,806 observations (178 participants, 27 evaluations per participant).From this total, participants skipped the question without answering for 11 observations. In 77 others, the participantssignaled that they did not have a preference (the “I don’t know” option). These observations were discarded fromthe analysis, resulting in a dataset with 4,718 observations. Each question-answer pair was evaluated from 24 to 35times per construct. As we expected, participants overall preferred the answers from the original corpus, although themodified version was preferred for a few answers. A table with the number of votes per question is presented in thesupplementary materials.Figure 4 shows the prediction accuracy and AUC plots for the four fitted models. Since participants generallypreferred the original

𝐹𝐿𝐺 corpus answers, the prediction threshold is close to always predicting the most frequentclass (original), which is particularly true for the random forest model. The prediction accuracy of glmnet and xgboost are only slightly better than the baseline. Nevertheless, the AUC plot indicates that the models are learning somethingimportant (the ROC curve plots are available in the supplementary materials), as the AUC values are consistently betterthan the baseline. Additionally, the glmnet model consistently selects the same variables to the k-folds.Note that the non-linear models are not considerably more accurate than the linear model. Although there seemsto be some non-linear trend in the data, it does not justify the complexity of the non-linear models. Hence, we usedthe results from glmnet model to interpret the coefficients and identify the linguistic features that determine userpreferences.0 Chaves, et al. ll lll ll ll ll l ll lll ll ll lll l llll ll l l ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll ll l l ll ll ll lll ll ll lll ll ll ll ll ll ll ll ll ll l ll ll l ll lll lll ll l app r op r i a t ene ssc r ed i b ili t y u x accuracy.percent m ode l (a) Accuracy percentage l l lll l lll l l ll l l ll l lll l lll l l ll ll ll l l ll l lll l l ll l lll l lll l l ll l l ll ll ll l lll l l ll llll l l ll l l ll ll ll l lll l l ll l l ll l l ll l lll l l ll l lll l l l app r op r i a t ene ssc r ed i b ili t y u x auc m ode l (b) AUC Fig. 4. Accuracy (a) and AUC (b) results per model for each construct (appropriateness, credibility, and user experience). The baselinerepresents a model that always predicts the most frequent class (original). Accuracy percentage shows that glmnet, random forest,and xgboost perform only slightly better than the baseline model. AUC, however, is reasonably better than the baseline for the threemodels.

Table 8 presents the coefficients of the linguistic features selected in six or more folds. The first andsecond columns indicate, respectively, the linguistic feature of interest and the sign of original − modified calculation,which indicates whether one particular feature was increased or decreased in the text modification process. A positivesign (+) for a feature 𝑓 𝑖 indicates that count original ( 𝑓 𝑖 ) > count modified ( 𝑓 𝑖 ) , while a negative sign (-) indicates thatcount original ( 𝑓 𝑖 ) < count modified ( 𝑓 𝑖 ) . The following three columns present the mean of the coefficients and the standarddeviation for each construct. Features with negative coefficients increase the likelihood of the model predicting theoriginal class. In contrast, features with positive coefficients increase the likelihood of the model predicting the modifiedclass. The supplementary materials include plots of coefficients for each construct.Original answers have significantly more coordinating conjunction clauses and attributive adjectives than the modifiedversions. These features have a negative coefficient for all the three constructs, which indicates that frequent occurrencesof these features increase the likelihood of original answers being chosen; participants are more likely to prefer answersin which these features are more frequent. The same conclusion applies to suasive verbs and causative subordination ,although these features shows up as relevant only for the user experience construct.Original answers also have significantly more nouns , WH-relative clause (subject position) , and final preposition . Thesefeatures have a positive coefficient for all three constructs, which indicates that frequent occurrences of these featuresincrease the likelihood of modified answers being chosen. This outcome suggests that participants are more likely toprefer answers in which these features are less frequent. The same conclusion applies to third person pronouns , but thisfeature is relevant only for the user experience construct.Two features showed inconsistent outcomes across constructs.

Conditional subordination has a negative coefficientfor appropriateness, but a positive coefficient for credibility. This outcome suggests that frequent occurrences of thesefeatures influence appropriateness positively while influencing credibility negatively. Additionally, prepositions have apositive coefficient for both appropriateness and credibility, indicating that participants are more likely to choose theanswers where these features are less frequent. In contrast, prepositions have a negative coefficient for user experience,hatbots language design: the influence of language variation on user experience 21

Table 8. Coefficients and standard deviation of the non-zero variables per construct. Only linguistic features were selected as relevantvariables for predicting users choices, and most of the selected linguistic features are relevant for all the three constructs. The columnsoriginal − modified represents whether the original answers have more (+) or less (−) occurrences of that particular feature. Thedots indicate that the corresponding feature was not selected for that particular construct. Mean of coefficients ± Std. Deviation

Linguistic features orig. − mod. Appropriateness Credibility User ExperienceDim 1: Coordinating conj. clause (+) -0.032 ± ± ± ± ± ± ± ± · Dim 1: Nouns (+) 0.002 ± ± ± ± ± ± ± ± ± ± ± ± · · -0.015 ± · · -0.007 ± · · ± ± ± ± ± ± ± ± ± · Dim 1: Second-person pronoun (-) -0.002 ± ± · Dim 1: First-person pronoun (-) 0.007 ± ± ± ± · · Dim 1: Present verbs (-) 0.003 ± ± ± ± · ± ± ± ± · ± · indicating that participants tend to choose answers where this feature is more frequent for this construct. However,when we aggregate the estimate to the standard deviation, it sums up to zero, suggesting that this outcome may benoise.Modifications have significantly more adverbial–conjuncts and prediction modals than the original answers. Thesefeatures have a negative coefficient for all three constructs, which indicates that frequent occurrences of these featuresincrease the likelihood of original answers being chosen. This outcome suggests that participants are more likely toprefer answers in which these features are less frequent. The same inference applies to contractions and second-personpronouns , although these features did not show up as a relevant feature for user experience.Modifications also have a larger number of first-person pronouns , present verbs , and time adverbials . These featureshave a positive coefficient for all three constructs, which indicates that increasing the occurrences of these featuresincreased the likelihood of modified answers being chosen. This outcome highlights the participants’ preferencesfor answers in which these features are more consistently present. The same conclusion applies to private verbs , that-deletion , and indefinite pronouns . However, private verbs help to predict appropriateness only; that-deletion predictsappropriateness and user experience only; and indefinite pronoun is relevant for credibility only.In conclusion, our results clearly show that the use of register-specific language has a significant impact on userperceptions of conversational quality for tourist assistant chatbots . There is an association between register-specific useof particular linguistic features and perceived quality; linguistic features are stronger predictors of appropriateness,2 Chaves, et al.credibility, and user experience than individual characteristics of interlocutors (either assistants or users). The variablesrepresenting the tourist assistants who produced original answers, as well as those representing the individual partici-pants and their social orientation, were not selected as predictors of user preferences regarding chatbot language use,since these factors were not relevant. The results of the study presented here show that users are sensitive to conversational register and, specifically,that register has a significant impact on user perceptions of conversational quality; a chatbot that adopts the wrongconversational register risks losing credibility and acceptance by users. In this section, we discuss our findings andtheir implications for the design of register-specific language engines for chatbots.User perceptions of conversational quality are important in chatbot design because the central interactional goal ofchatbots is to fluidly interact using natural language. Chatbots are often deployed to perform social roles traditionallyassociated with humans, particularly in contexts where there may be consequences for a human if they choose to acton the chatbot’s information. This means that user perceptions of chatbot competence and credibility are crucial for achatbot’s success. Some previous studies have found that appropriate language style is not relevant for determininguser satisfaction as long as the user can understand the chatbot’s answer, advising only that the chatbot’s languagestyle should be “mildly appropriate to the service the chatbot provides” [9]. We suggest, however, that the narrowfocus on traditional usability metrics, such as effectiveness and efficiency, fail to adequately capture the broadercontext of user experience, which goes beyond simply comprehending a chatbot’s utterance to, more importantly,whether the user trusts and ultimately uses the chatbot’s advice. Specifically, user experience also involves the “user’sperceptions and responses that result from the use of the system” [64], including emotions, beliefs, preferences, andperceptions, among others. The results presented here suggest that these crucial user perceptions are specificallyshaped by how information is conveyed–as characterized by the conversational register–rather than by the explicitinformation content of conversational exchanges. In this sense, our results support earlier behavioral observationsthat found that language fit can impact the quality of the interactions as well as the users’ perceptions and behaviorstoward the chatbots [67]. Importantly, the work presented here refines these observations by adapting register theoryto provide a sound theoretical framework for concretely characterizing the concept of “conversational style” andanalytically exploring how variations in the patterns of language impact user perceptions of chatbot quality, includingfactors critical for the user experience with chatbots, like appropriateness and credibility. Using register analysis tocharacterize different situations and how they map to the most appropriate register profiles also provides a way forwardin the design of the next generation of chatbots; one could imagine a chatbot language engine that, given a particularsituational profile for planned conversations, could automatically configure its language to present information in themost appropriate register. As a start in this direction, in the following we summarize some of the key insights regardingthe complex relationship between conversational register and user experience.

In this section, we begin with several broad observations drawn from our findings before moving on to discuss specificlinguistic features that have the greatest impact on user perceptions.

Register is an established theory in the sociolinguisticsdomain (see e.g., [6, 11, 13, 20]), and has been shown as a reliable predictor of language variation across conversationalhatbots language design: the influence of language variation on user experience 23contexts. We aimed to explore the applicability of these results to human-chatbot interactions, develop a strong rationalefor accounting for register in chatbot design, and provide a concrete mechanism for implementating theory intodesign practice. Our results show that, in terms of perceived appropriateness, credibility, and overall user experience,register characteristics are more relevant than individual preferences or personal habits, i.e., none of the covariatesrepresenting individual biases (participants, their social orientation, and answers’ authors) were identified as predictorsof participants’ choices (see Section 7.5.1). Thus, register characteristics can be seen as primary drivers for theseperceptions, and designers should certainly consider to ensure chatbot acceptance and success.

Like other information search interactions, theinteractions in our corpus are goal-oriented. Particularly for en-route tourist information searches (see Section 2), usersare often pressed for time; efficiency in finding the target information is critical [124, 129, 132]. Efficiency is also apriority in other task-oriented domains where chatbots operate, such as customer services [23]. Our statistical modelingrevealed several linguistic features as particularly important in supporting compact, efficient information sharing. First,the linguistic feature coordinating conjunction had the largest coefficient in our analysis; it is common in conversationsdue to real-time production constraints. Complex sentences in a conversation are often a linear combination of shortclauses with a simple grammatical structure [20], usually connected by coordinating conjunctions . 𝐹𝐿𝐺 𝑚𝑜𝑑 mimics

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 form, which portrays language that is more elaborated and carefully planned to achieve educational goals.The elaborated complexity leads to varying grammatical structure levels, which can reduce efficiency when providinginformation. Similarly, participants also often preferred answers with more verbs and fewer nouns , which is a patternpresent in the modified versions of the answers (

𝐹𝐿𝐺 𝑚𝑜𝑑 ). This outcome indicates a preference for active languagerather than descriptive [11]. For example, when comparing the answers

𝐹𝐿𝐺 : There are $2 off coupons inside the visitor center behind the desk . 𝐹𝐿𝐺 𝑚𝑜𝑑 : The visitor center has $2 off coupons you can get.a participant stated that the first one (

𝐹𝐿𝐺 ) “gives more information, but it’s unnecessary” and “takes more time”[LabP2].Participants were also more likely to choose answers with fewer

WH-relative clauses and adverbial–conjuncts . WH-relatives are “often an extra-piece of information that might be of interest” [20], for example, regarding the answers

𝐹𝐿𝐺 : There is the Discovery Map, which is more geared toward visitors [...]

𝐹𝐿𝐺 𝑚𝑜𝑑 : The Discovery Map is more geared toward visitors[LabP8] stated that

𝐹𝐿𝐺 𝑚𝑜𝑑 version was “more clear” with “less fluff to the sentence.”

Adverbial–conjuncts are used toconnect sentences in discourse [20]. Some are more frequent in face-to-face conversations (e.g., so, then, anyway), whileothers are more common in written language (e.g., however, therefore, although) [20]. As

𝐹𝐿𝐺 𝑚𝑜𝑑 mimics

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 which focuses on language learning, the most common adverbial–conjuncts align with the ones that are common inwritten registers. These conjuncts increase the formality of the answers [11], and interactive chatbot users may seethese words as unnecessary, filler words. In the lab sessions, participants stated that “when there is a lot of information,some filler words can be left out” [LabP1]. For example, when comparing the answers

𝐹𝐿𝐺 : You cannot leave it in 15-minute parking for an extended period of time. On the Amtrak side of thebuilding, there is a paid parking lot.

𝐹𝐿𝐺 𝑚𝑜𝑑 : You cannot leave it in 15-minute parking for more than that.

However , the Amtrak side of thebuilding has a paid parking lot you could use.4 Chaves, et al.a participant mentioned that the reason for the preference is that “they don’t have the ‘however,’ and lead directly tothe next sentence” [LabP8]. Additionally, these conjuncts imbue a style of formality that may create distance betweenthe chatbot and the user.Finally, the preference for that-deletion shown by the analysis is likely influenced by its frequent co-occurrenceswith suasive and private verbs . However, that-deletion “has colloquial associations and it is therefore common inconversations” [89], since conversations favor the omission of unnecessary words to accommodate real-time productionconstraints. Hence, user preferences for that-deletion in our data could be associated with the preference for efficiencyin conversations.

The literature on human-chatbot interactionshas extensively explored the need for chatbots to be “human-like.” On the one hand, scholars grounded in media equationtheory [47, 100] have shown that people prefer agents who reflect human social and conversational protocols, e.g.,conform to gender stereotypes associated with tasks [49]; self-disclose and show reciprocity when recommending [83];demonstrate a positive mood [57], and so on. On the other hand, overly humanized agents can create inaccurateexpectations in users [53] and result in the “uncanny effect” [4, 33], which eventually leads to more frustration whenthe chatbot fails to live up to these increased expectations [53]. However, the idea of assigning a social role to aconversational agent does not necessarily imply deceiving people into thinking the software is human; a chatbotcan be clearly identified as such, but still benefit from approximating its conversational register to the patterns ofhuman-human communication. This study brought to light a crucial aspect of human-chatbot interaction, namelythe need for balancing the chatbot’s anthropomorphic clues–several specific findings in our analysis support thisobservation.In the Natural Language Generation field, the aggregation of sentences using coordinating conjunctions is commonlyused to increase fluency and readability [109]. According to Reiter and Dale [109], aggregation of sentences can bede-emphasized if the text is obviously produced by a computer; this suggests that participants would not care aboutslightly stilted language, since they were told that the answer was produced by a chatbot. Our analysis reveals, however,that users have a preference for language that is more human-like, with fewer pauses and more coordination.

First-person pronouns can also increase human-likeness, although the plural form is preferable. The singular form(“I”) unambiguously indicates the speaker, whereas referring to the speaker’s identity in the plural form (“we”) variesaccording to the context [20]. Choosing between singular or plural forms is a strong indicator of the identity that thechatbot conveys. When the chatbot says “I,” it clearly refers to itself, but when it says “we,” it can be interpreted as ageneral reference to its social category. Using “we” softens the role of the chatbot and highlights its representative roleof a broader entity (e.g., professional tourist assistants, visitor center’s representatives, Flagstaff’s tourism personnel).Both singular and plural forms convey personal involvement, but “we” may demonstrate more credibility because thechatbot is more likely to be recognized as part of a community of knowledgeable individuals. Although both singularand plural pronouns are measured under the same linguistic feature, our evidence shows that participants preferred theplural use of this feature. As often occurs in

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 , the singular form (“I”) co-occurred with the prediction modal “would” in

𝐹𝐿𝐺 𝑚𝑜𝑑 to make suggestions or to give advice. Noticeably, participants preferred answers where predictionmodals are less frequent, which indicates that the singular form of first-person pronouns is unlikely to influence theusers’ preference for this feature. Quotes from lab sessions also support that participants preferred the plural form. Forexample, regarding the singular form, [LabP1] stated: “I don’t like when the chatbot says ‘I,’ it seems the developer ishatbots language design: the influence of language variation on user experience 25trying to trick me to think the bot has opinions. When chatbots use ‘I’ it sounds too much like pandering.” In contrast,when comparing the answers

𝐹𝐿𝐺 :“There are 50 miles of trails within Flagstaff [...]”

𝐹𝐿𝐺 𝑚𝑜𝑑 “We have 50 miles of trails within Flagstaff [...]”[LabP3] stated that “the use of the word ‘we’ makes it more personable than simply saying ‘there is this’ ” [LabP3].Since the language in the baseline corpus (

𝐹𝐿𝐺 ) was human-written and not tailored to represent the identity of achatbot, participants likely perceived some of the chatbot’s answers as overly human-like. As these findings reveal,chatbot language should conform to the expectations of its social category, as previous literature has suggested [53].Still, the register of chatbot conversation must also consider its artificial nature, particularly positioning the agent as arepresentative of a broader entity. As register theory suggests [34], the interlocutors’ identities and the relationshipamong them are influential parameters when defining the interactional context. Therefore, tailoring the chatbot’slanguage to the appropriate register includes not only adapting to the language of the professional category it represents,but also revealing the chatbot’s social identity as an artificial agent. These observations strengthen the relevance ofconversational register as a theoretical foundation for the design of chatbot utterances.

Previous literature has shown the benefitsof personalized interactions with chatbots [39, 114, 123], particularly in domains where the chatbot needs to build rapportand trustful relationships with the user, such as in financial services [38], companion (or buddy) chatbots [114, 123],and recommendation systems [25]. At the same time, users might feel uncomfortable with some aspects of personalizedcontent [123]. In our study, tourist assistants did not provide personally tailored information, as they had few or noclues about the tourists’ identities or preferences due to the text-based nature of the conversations. In this inherentlyrather impersonal content, our results showed that participants preferred general rather than personalized information.Participants preferred lower frequencies of occurrences of second-person pronouns (e.g. “you”), which are used as adirect address to the user [20]. Giving that information search interactions focus on the assistant providing informationwithout necessarily sharing a personal relationship with the user, it may sound inappropriate for a chatbot to use second-person pronouns , rather than simply stating the information. In the lab sessions, for instance, a participant statedthat “[the chatbot] saying ‘you’ implies a lot of personalization but in a pressuring type of way” [LabP6].

Second-personpronouns (as well as singular first-person pronouns ) often co-occur with contractions , which may justify the participantspreferences for answers with low frequency of contractions .The appropriateness of conditional subordination also relates to personalization. The use of conditional clauses as amechanism for inserting suggestions, requests, and offers is common in conversation [20]. The tourist assistants in

𝐹𝐿𝐺 used conditional subordination to offer different options to the users (e.g., “if you..., you can/will/would”) , since theirindividual preferences were unknown. In the lab session, [LabP1] mentioned “prefer the [original answer] because itsays ‘if you stop’ rather than ’I’d suggest you stop’.” Moreover, conditional subordination helps to frame the subsequentdiscourse [20], which was also observed by [LabP1]: “I like the ’if you are looking for maps’ as it sets up the scenariothat this information would be useful for” [LabP1]. However, the conditional subordination is negatively associatedwith credibility, which may indicate that when the chatbot gives options, it sounds as if it is not confident about theinformation provided. It is important to observe the impact of the varying communication purposes here. Although thechatbot is presented as an information provider, the tourist assistants interpreted some tourist questions as requestsfor recommendations (see more in Chaves et al. [27]) rather than information. Recommendations are more inherentlypersonalized than information search results, and consequently require more personalization in their expression (see6 Chaves, et al.e.g., [52, 76, 111]). Thus, participants may have expected that the tourist assistants would provide more personal, tailoredcontent rather than conditional options. Clearly, the dynamics that shape these interactions are subtle, and the influenceof sub-registers, i.e., the variation in language use to match specific communicative purposes, will need to be exploredin more detail to evaluate these assumptions. the study we presented in this paper accomplished its primary goal and, at thesame time, exposed deeper complexities that point to the need for further exploration. For instance, we found inconsistentresults regarding private verbs when comparing the cross-validation results to the quotes from lab sessions’ participants.The cross-validation model found an association between the high frequency of private verbs and appropriateness, withno effect on the other two constructs (credibility and overall user experience). However, several quotes from lab sessionparticipants suggest that private verbs did make certain answers less credible. For example:“ ‘I guess’ is not the tone I want.” [Lab1P]“I don’t like the bot saying ‘I guess’, it sounds passive aggressive” [LabP2]“I don’t like how the [modified answer] says ‘okay, I believe’. This makes it sound like it doesn’t know.”[LabP4]Considering such qualitative feedback from these participants, we believe that the lack of any negative influence of private verbs on credibility shown by the analysis may be conditioned by their co-occurrences with other features; thisneeds to be further investigated.Our results also showed a positive association between attributive adjectives and all the three evaluated metrics(appropriateness, credibility, and user experience). However, this association may be too coarse-grained, as the way inwhich attributive adjectives were often used in the specific conversational context of tourism advising is somewhatatypical. The typical use of attributive adjectives in conversation is to describe some physical attribute of an object [20],e.g., “new,” “big,” or “smelly.” In contrast, attributive adjectives in the

𝐹𝐿𝐺 corpus were more often used to classify ratherthan describe the corresponding nouns . For example, common attributive adjectives are “local business,” “national park,”and “natural landmark.” Using classifying attributive adjectives adds detail to the information without loss in efficiency.Participants mentioned that the attributive adjective “makes [the answer] more interesting” [LabP7], explaining theassociation with quality metrics, but this may apply only to classifying attribute adjectives . Here too, further investigationwill be needed to clarify the validity of the observed association.Finally, suasive verbs are typically used to express the degree of certainty associated with the information that thesentence communicates [20]. For example, when the tourist assistant says “I recommend,” it represents how much itbelieves the tourist should take that advice. [LabP3] observed this fact by stating that “ ‘I would recommend’ seemslike more of a suggestion.” Our analysis showed that suasive verbs influence the overall user experience, but were notshown to make the language more credible or appropriate. Closer analysis shows that this association, too, deservesfurther investigation. In our data, suasive verbs co-occurred mainly with the singular first-person pronoun . As notedearlier, the use of plural first-person pronouns was clearly linked to credibility; further investigation should evaluatewhether the use of plural first-person pronouns with suasive verbs would increase their impact on credibility.

The register theory aims to link the appearance of certain linguisticfeatures in utterances to the situational parameters of the conversation [34]. Although characterizations of situationalparameters and their detailed impacts on the selection of conversational register may continue to evolve and berefined over time, the patterns of language should be similar in domains that share similar situational parameters. Forhatbots language design: the influence of language variation on user experience 27instance, information search, the core interactional purpose of the interactions studied in this research, is also a commoninteractional purpose in customer services interactions, i.e., two participants working to fulfill an information request.The two domains share other situational parameters as well, such as channel, production, and setting. Abu Shawarand Atwell [1] observed that conversations from a corpus of Spoken Professional American English portray more coordinating conjunctions , subordinating conjunctions , and plural personal pronouns than transcripts from the chatbotALICE. These same linguistic features were selected in our analysis as predictors of appropriateness, credibility, anduser experience (see Section 7.5.1). In contrast, we expect that a sales chatbot would require more persuasion (featuresin Dimension 4, see supplementary materials for the full list of linguistic features), and recommendation-based chatbotswould require more personalization, as we discussed previously in this section. In any case, a register analysis, aspresented in this paper, could be used as a tool to analyze the conversation register used by expert humans in suchconversational scenarios, and to identify specific linguistic features of that register that are relevant to producing thedesired impact on user perceptions. This research has important implications for designers of the next generation of chatbots. For chatbots that find andretrieve knowledge snippets from external sources, utterances should be adapted to the conversational situation inwhich the chatbot is embedded. This is not generally done in the current generation of chatbots; it is not uncommon tofind chatbots that extract and present information directly from websites, books, or manuals without any adaptation. Forexample, Golem is a chatbot designed to guide tourists through Prague (Czech Republic); its utterances are extractedfrom an online travel magazine without any adaptation to the new interactional situation (which differs in production,channel, and setting). Moreover, new generations of chatbots will be expected to dynamically generate their owncustom-constructed utterances. They will need sophisticated language engines that are able to dynamically adapt theirconversational register to changing situational parameters. In this context, corpora such as 𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 are likely tobecome a baseline for training the conversation models [51] at heart of such language engines; our study emphasizes thatdesigners should carefully ensure that the register found in any corpus used to train such models matches the optimalregister implied by the situational parameters, or that the learning algorithms can adapt the language accordingly.This paper provides a list of linguistic features that conform with user expectations about chatbots’ languageuse in the context of tourism information searches, which can be directly applied to the design of chatbots for thisdomain. Researchers could leverage these outcomes to evaluate the application of these results to similar interactionalsituations in other domains. Section 8 discusses the relation of these outcomes with other task-oriented domains, suchas information searches and customer services. Additionally, the methodology presented in this paper can be applied toother domains that are more distant in terms of interactional situations from

𝐹𝐿𝐺 , aiming to identify the associationsbetween new interactional situations and core linguistic features used in the domain. As our study shows, the analysisdoes not necessarily require a large number of individuals to identify the linguistic features that characterize the registerof a particular situation or domain.Finally, the parallel corpora

𝐹𝐿𝐺 and

𝐹𝐿𝐺 𝑚𝑜𝑑 are available for researchers and practitioners who are interested in (i)developing chatbots for tourism information searches, as they comprise a set of frequently asked tourism questionswith register-specific answers; or (ii) research on natural language generation that requires parallel data. The corporaand associated materials are available online [26]. Available in Facebook Messenger at http://m.me/praguevisitor. Last accessed: June, 2020 Register characterization relies on the multidimensional approach proposed by Biber [11, 12], which is the maintheoretically-motivated approach taken within register analysis [6]. Other approaches, such as register classification [6],could be explored in the context of human-chatbot interactions. Additionally, the register analysis also relies on Biber’sgrammatical tagger to automatically tag the linguistic features. The tagger has been used for many previous large-scalecorpus investigations, including multidimensional studies of register variation (e.g., [11, 12, 35]), The Longman Grammarof Spoken and Written English [20], and a study of register variation on the Internet [16, 17]. Although this taggerachieves accuracy levels comparable to other existing taggers [11], mis-taggings are possible. To mitigate this effect, wemanually inspected a small subset of tagged files to search for mis-tags that could potentially impact the outcomes.We performed manual linguistic modifications to produce the

𝐹𝐿𝐺 𝑚𝑜𝑑 , which inherently introduced a subjectiveelement in the exact choice of changes applied to shift the register. We mitigated this threat by manually inspecting

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 corpora for every feature we modified, to understand the function of the feature in the corpus, and producemodifications using similar patterns. We also performed a validation with human participants for content preservationand quality of modifications.We included in the cross-validation model only features that vary between

𝐹𝐿𝐺 and

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 , and thereforewere manipulated during the text modification. We considered that the linguistic features that do not significantly varyacross the corpora are the standard in those particular contexts and are unlikely to signal users’ preferences. We claimthat

𝐷𝑎𝑖𝑙𝑦𝐷𝑖𝑎𝑙𝑜𝑔 is an appropriate dataset to be used in this study since it has been widely used in natural language anddialogue generation research (see, e.g., [56, 113, 135]), and might eventually become a baseline for learning conversationmodels [51]. However, it is also important to compare

𝐹𝐿𝐺 against corpora produced in other interactional situationsto evaluate varying sets of features and confirm the inferences presented in this paper.The register analysis presented in this paper was based on counts of the occurrences of features, normalized per 10,000words. However, it does not consider sentence structure, i.e., where the features occur in the sentence. Additionally,because linguistic features are best understood in terms of co-occurrence patterns, it is important to extend this studyto consider the effect of the linguistic features individually and the effect of features that typically co-occur with them.Future research is needed to expand the analysis to incorporate these aspects.This research is performed in the context of conversations in American English. The core linguistic features andtheir usage change from one language to the other; thus, these results may not apply to other languages and furtherinvestigation is necessary.Three tourist assistants, all female, answered tourist questions in the

𝐹𝐿𝐺 corpus, and tourists were recruited inFlagstaff, AZ, USA. To increase the diversity of tourists’ questions, we mined questions from websites, as discussed inSection 4. Concerning tourist assistants, an interesting extension of this study would include hiring tourist assistantswith a more diverse profile and including questions about other touristic cities. It is important to note that, even withonly three tourist assistants, we were able to identify the impact of certain linguistic features on the three metrics(appropriateness, credibility, and overall experience) used to represent perceived conversational quality; this suggeststhat a very large corpus is not necessary to identify the core linguistic features of chatbot dialogues that influence userperceptions.Finally, with limited qualitative observations from our in-lab sessions to support the quantitative findings, our abilityto draw conclusions based on participants’ statements is incomplete. The purpose of adding this qualitative element wasto augment and clarify our quantitative findings, by identifying participants’ impressions aligned with our quantitativehatbots language design: the influence of language variation on user experience 29analysis and comparing the impressions of our participants to the interpretations of linguistic features we find in theregister literature, such as in the Longman Grammar [20]. A deeper qualitative investigation is needed to draw strongerconclusions about motivations behind participants’ choices.

10 CONCLUSIONS

This paper focuses on investigating the impact of chatbot language use on user perceptions of language appropriateness,credibility, and the overall user experience. We collected two corpora of conversations between tourists and touristassistants in different interactional situations and compared the language variation in these corpora by adaptingtechniques associated with register analysis, which are well-established by sociolinguists. Based on this analysis, weproduced two parallel corpora of conversational exchanges that were equivalent in informational content but differedin linguistic form. We then used the parallel corpora to perform a study of the impact of register on user preferences,asking participants to rate parallel responses draw from the corpora on three metrics of perceived conversationalquality: appropriateness, credibility, and overall experience. Participants were told that a tourist assistant chatbot hadgenerated all responses.Our results revealed statistically relevant associations between certain linguistic features present in utteranceswithin the two corpora and user perceptions of appropriateness, credibility, and overall user experience. Additionally,the results also showed that the linguistic features are a stronger predictor of this association than the variablesrepresenting individual biases (participants, their social orientation, and answers’ authors). This outcome stronglysuggests that attention to an appropriate conversational register is an important factor for the perceived quality ofchatbot conversations and, therefore, critical to future chatbots’ success. Although our study focused on the tourismdomain, we expect these outcomes to be applicable to other interactions that share similar situational parameters (e.g.,different information search scenarios). More generally, this study demonstrates that the theoretical foundation ofregister analysis introduced in this paper can be an effective tool for characterizing the conversational register used inother target domains, and can systematically expose the specific linguistic features within conversational utterancesthat most strongly impact user perceptions of conversational quality. This makes register a promising cornerstonefor rationalizing the design of future generations of chatbots, by reproducing this study for other domains, as well asdeveloping empirical studies that extend this study to other aspects of register theory, such as the impact of sentencestructure. Specific future directions for our research will focus on evaluating the effect of register-specific languagewithin dynamically-generated interactions with chatbots on user perceptions of conversational quality.

ACKNOWLEDGMENTS

To Caitlin Abuel and Tyler Conger, NAU CS undergraduate students, for the contributions to the recruitment andqualitative data collection during the lab sessions. This work is supported by the National Science Foundation underGrant No.: 1815503.

REFERENCES [1] B Abu Shawar and E Atwell. 2004. Evaluation of Chatbot Information System. In Proceedings of the Eighth Maghrebian Conference on SoftwareEngineering and Artificial Intelligence. Centre de Publication Universitaire, Tunis, 12.[2] Papathanassis Alexis. 2017. R-Tourism: Introducing the Potential Impact of Robotics and Service Automation in Tourism. Ovidius UniversityAnnals, Series Economic Sciences 17, 1 (2017), 211–216.[3] Laurence Anthony. 2005. AntConc: design and development of a freeware corpus analysis toolkit for the technical writing classroom. In IPCC2005. Proceedings. International Professional Communication Conference, 2005. IEEE, Limerick, Ireland, 729–737. [4] Markus Appel, David Izydorczyk, Silvana Weber, Martina Mara, and Tanja Lischetzke. 2020. The uncanny of mind in a machine: Humanoid robotsas tools, agents, and experiencers. Computers in Human Behavior 102 (2020), 274–286.[5] Theo Araujo. 2018. Living up to the chatbot hype: The influence of anthropomorphic design cues and communicative agency framing onconversational agent and company perceptions. Computers in Human Behavior 85 (2018), 183–189.[6] Shlomo Argamon. 2019. Register in computational language research. Register Studies 1, 1 (2019), 100–135.[7] Shlomo Argamon, Moshe Koppel, and Galit Avneri. 1998. Routing documents according to style. In First International workshop on innovativeinformation systems. Citeseer, Pisa, Italy, 85–92.[8] Mikhail Mikha˘ılovich Bakhtin. 2010. Speech genres and other late essays. University of Texas Press, Austin, TX.[9] Divyaa Balaji. 2019. Assessing user satisfaction with information chatbots: a preliminary investigation. Master’s thesis. University of Twente.https://essay.utwente.nl/79785/1/Balaji_MA_BMS.pdf[10] Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2015. Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software67, 1 (2015), 1–48. https://doi.org/10.18637/jss.v067.i01[11] Douglas Biber. 1988. Variation across speech and writing. Cambridge University Press, Cambridge, UK.[12] Douglas Biber. 1995. Dimensions of register variation: A cross-linguistic comparison. Cambridge University Press, New York, NY, USA.[13] Douglas Biber. 2012. Register as a predictor of linguistic variation. Corpus Linguistics and Linguistic Theory 8, 1 (2012), 9–37.[14] Douglas Biber. 2017. MAT–Multidimensional Analysis Tagger. Available at: https://goo.gl/u7h9gb.[15] Douglas Biber, Susan Conrad, and Viviana Cortes. 2004. If you look at...: Lexical bundles in university teaching and textbooks. Applied linguistics25, 3 (2004), 371–405.[16] Douglas Biber and Jesse Egbert. 2016. Using Multi-Dimensional Analysis to Study Register Variation on the Searchable Web. Corpus LinguisticsResearch 2 (2016), 1–23.[17] D. Biber and J. Egbert. 2018. Register variation online. Cambridge University Academic press, Cambridge.[18] Douglas Biber and Bethany Gray. 2010. Challenging stereotypes about academic writing: Complexity, elaboration, explicitness. Journal of Englishfor Academic Purposes 9, 1 (2010), 2–20.[19] Douglas Biber, Bethany Gray, and Kornwipa Poonpon. 2011. Should we use characteristics of conversation to measure grammatical complexity inL2 writing development? Tesol Quarterly 45, 1 (2011), 5–35.[20] Douglas Biber, Stig Johansson, Geoffrey Leech, Susan Conrad, Edward Finegan, and Randolph Quirk. 1999. Longman grammar of spoken andwritten English. Vol. 2. Pearson Longman, London, UK.[21] Nina Böcker. 2019. Usability of information-retrieval chatbots and the effects of avatars on trust. B.S. thesis. University of Twente.[22] Susan Bosher and Melissa Bowles. 2008. The Effects of Linguistic Modification on ESL Students’ Comprehension of Nursing Course Test Items-Acollaborative process is used to modify multiple-choice questions for comprehensibility without damaging the integrity of the item. NursingEducation Perspectives 29, 4 (2008), 174.[23] Petter Bae Brandtzaeg and Asbjørn Følstad. 2018. Chatbots: changing user needs and motivations. Interactions 25, 5 (2018), 38–43.[24] Dimitrios Buhalis and Soo Hyun Jun. 2011. E-tourism. In Contemporary tourism reviews, Chris Cooper (Ed.). Goodfellow Publishers Limited,Woodeaton, Oxford, 1–38.[25] Jhonny Cerezo, Juraj Kubelka, Romain Robbes, and Alexandre Bergel. 2019. Building an expert recommender chatbot. In 2019 IEEE/ACM 1stInternational Workshop on Bots in Software Engineering (BotSE). IEEE, New York, 59–63.[26] Ana Paula Chaves. 2020. GitHub Repository. Available at: https://github.com/chavesana/chatbots-register.[27] Ana Paula Chaves, Eck Doerry, Jesse Egbert, and Marco Gerosa. 2019. It’s How You Say It: Identifying Appropriate Register for ChatbotLanguage Design. In Proceedings of the 7th International Conference on Human-Agent Interaction (HAI ’19). ACM, New York, NY, USA, 8.https://doi.org/10.1145/3349537.3351901[28] Ana Paula Chaves, Jesse Egbert, and Marco Aurelio Gerosa. 2019. Chatting like a robot: the relationship between linguistic choices and users’experiences. In ACM CHI 2019 Workshop on Conversational Agents: Acting on the Wave of Research and Development. https://convagents.org/,Glasgow, UK, 8.[29] Ana Paula Chaves and Marco Aurelio Gerosa. 2018. Single or Multiple Conversational Agents? An Interactional Coherence Comparison. In ACMSIGCHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 191:1–191:13.[30] Ana Paula Chaves and Marco Aurelio Gerosa. 2020. How Should My Chatbot Interact? A Survey on Social Characteristics in Human–ChatbotInteraction Design. International Journal of Human–Computer Interaction 0, 0 (2020), 1–30. https://doi.org/10.1080/10447318.2020.1841438arXiv:https://doi.org/10.1080/10447318.2020.1841438[31] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, NewYork, NY, USA, 785–794. https://doi.org/10.1145/2939672.2939785[32] R. H. B. Christensen. 2019. ordinal—Regression Models for Ordinal Data. R package version 2019.12-10. https://CRAN.R-project.org/package=ordinal.[33] Leon Ciechanowski, Aleksandra Przegalinska, Mikolaj Magnuski, and Peter Gloor. 2018. In the shades of the uncanny valley: An experimentalstudy of human–chatbot interaction. Future Generation Computer Systems 92 (2018), 539–548.[34] Susan Conrad and Douglas Biber. 2009. Register, genre, and style. Cambridge University Press, New York, NY, USA.[35] Susan Conrad and Douglas Biber. 2014. Multi-dimensional Studies of Register Variation in English. Routledge, New York, NY, USA. hatbots language design: the influence of language variation on user experience 31

Sofia University, 19–21.[66] Muayyad Jabri, Allyson D Adrian, and David Boje. 2008. Reconsidering the role of conversations in change communication: A contribution basedon Bakhtin. Journal of Organizational Change Management 21, 6 (2008), 667–685.[67] Ana Jakic, Maximilian Oskar Wagner, and Anton Meyer. 2017. The impact of language style accommodation during social media interactions onbrand trust. Journal of Service Management 28, 3 (2017), 418–441.[68] Marie-Claire Jenkins, Richard Churchill, Stephen Cox, and Dan Smith. 2007. Analysis of user interaction with service oriented chatbot systems.In Human-Computer Interaction. HCI Intelligent Multimodal Interaction Environments, Julie A. Jacko (Ed.). Springer Berlin Heidelberg, Berlin,Heidelberg, 76–83.[69] Ridong Jiang and Rafael E Banchs. 2017. Towards Improving the Performance of Chat Oriented Dialogue System. In 2017 International Conferenceon Asian Language Processing (IALP). IEEE, New York, NY, USA, 23–26.[70] George Kamberelis. 1995. Genre as institutionally informed social practice. J. Contemp. Legal Issues 6 (1995), 115.[71] Merel Keijsers and Christoph Bartneck. 2018. Mindless Robots get Bullied. In Proceedings of the 2018 ACM/IEEE International Conference onHuman-Robot Interaction. ACM, New York, NY, USA, 205–214.[72] Adam Kilgarriff. 2005. Language is never, ever, ever, random. Corpus linguistics and linguistic theory 1, 2 (2005), 263–276.[73] Jurek Kirakowski, Anthony Yiu, et al. 2009. Establishing the hallmarks of a convincing chatbot-human dialogue. In Human-Computer Interaction.InTech, London, UK.[74] Julia Kiseleva, Kyle Williams, Ahmed Hassan Awadallah, Aidan C Crook, Imed Zitouni, and Tasos Anastasakos. 2016. Predicting user satisfactionwith intelligent assistants. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval.ACM, New York, NY, USA, 45–54.[75] Takanori Komatsu, Rie Kurosawa, and Seiji Yamada. 2012. How does the difference between users’ expectations and perceptions about a roboticagent affect their behavior? International Journal of Social Robotics 4, 2 (2012), 109–116.[76] Sherrie YX Komiak and Izak Benbasat. 2006. The effects of personalization and familiarity on trust and adoption of recommendation agents. MISquarterly 30, 4 (2006), 941–960.[77] Robert M Krauss and Chi-Yue Chiu. 1998. Language and social behavior. In The handbook of social psychology, D. T. Gilbert, S. T. Fiske, andG. Lindzey (Eds.). McGraw-Hill, New York, NY, US, 41–88.[78] William Labov, Sharon Ash, and Charles Boberg. 2005. The atlas of North American English: Phonetics, phonology and sound change. Walter deGruyter, Boston, MA, USA.[79] Tania C Lang. 2000. The effect of the Internet on travel consumer purchasing behaviour and implications for travel agencies. Journal of vacationmarketing 6, 4 (2000), 368–385.[80] Mirosława Lasek and Szymon Jessa. 2013. Chatbots for Customer Service on Hotels’ Websites. Information Systems in Management 2, 2 (2013),146–158.[81] Minha Lee, Gale Lucas, Johnathan Mell, Emmanuel Johnson, and Jonathan Gratch. 2019. What’s on Your Virtual Mind?: Mind Perception inHuman-Agent Negotiations. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents (Paris, France) (IVA ’19).ACM, New York, NY, USA, 38–45. https://doi.org/10.1145/3308532.3329465[82] Min Kyung Lee, Sara Kiesler, and Jodi Forlizzi. 2010. Receptionist or information kiosk: How do people talk with a robot?. In Proceedings of the2010 ACM CSCW. ACM, New York, NY, USA, 31–40.[83] SeoYoung Lee and Junho Choi. 2017. Enhancing user experience with conversational agent for movie recommendation: Effects of self-disclosureand reciprocity. International Journal of Human-Computer Studies 103 (2017), 95–105.[84] Geoffrey N Leech and Mick Short. 2007. Style in fiction: A linguistic introduction to English fictional prose. Number 13 in English language series.Pearson Education, London, UK.[85] Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset.In International Joint Conference on Natural Language Processing (IJCNLP). Asian Federation of Natural Language Processing, Taipei, Taiwan,986–995.[86] Vera Q Liao, Matthew Davis, Werner Geyer, Michael Muller, and N Sadat Shami. 2016. What can you do?: Studying social-agent orientationand agent proactive interactions with an agent for employees. In Proceedings of the 2016 ACM Conference on Designing Interactive Systems(Brisbane, QLD, Australia). ACM, New York, NY, USA, 264–275.[87] Grace I Lin and Marilyn A Walker. 2017. Stylistic Variation in Television Dialogue for Natural Language Generation. In Proceedings of theWorkshop on Stylistic Variation. Association for Computational Linguistics, Copenhagen, Denmark, 85–93.[88] Greg Linden, Steve Hanks, and Neal Lesh. 1997. Interactive assessment of user preference models: The automated travel assistant. In User Modeling.Springer, Vienna, Vienna, 67–78.[89] Michael H Long and Steven Ross. 1993. Modifications That Preserve Language and Content. Technical Report. ERIC.[90] Jaclyn Loo. 2017. The future of travel: New consumer behavior and the technology giving it flight. Google/Phocuswright Travel Study 2017.[91] Ewa Luger and Abigail Sellen. 2016. Like Having a Really Bad PA: The Gulf between User Expectation and Experience of Conversational Agents.In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 5286–5297.[92] Rhonda W Mack, Julia E Blose, and Bing Pan. 2008. Believe it or not: Credibility of blogs in tourism. Journal of Vacation marketing 14, 2 (2008),133–144. hatbots language design: the influence of language variation on user experience 334 Chaves, et al.