[PDF] Mental Workload and Language Production in Non-Native Speaker IPA Interaction

Abstract

Through proliferation on smartphones and smart speakers, intelligent personal assistants (IPAs) have made speech a common interaction modality. Yet, due to linguistic coverage and varying levels of functionality, many speakers engage with IPAs using a non-native language. This may impact the mental workload and pattern of language production displayed by non-native speakers. We present a mixed-design experiment, wherein native (L1) and non-native (L2) English speakers completed tasks with IPAs through smartphones and smart speakers. We found significantly higher mental workload for L2 speakers during IPA interactions. Contrary to our hypotheses, we found no significant differences between L1 and L2 speakers in terms of number of turns, lexical complexity, diversity, or lexical adaptation when encountering errors. These findings are discussed in relation to language production and processing load increases for L2 speakers in IPA interaction.

Full PDF

aa r X i v : . [ c s . H C ] J un Mental Workload and Language Production in Non-NativeSpeaker IPA Interaction

Yunhan Wu

University College [email protected]

Justin Edwards

University College [email protected]

Orla Cooney

University College [email protected]

Anna Bleakley

University College [email protected]

Philip R. Doyle

University College [email protected]

Leigh Clark

Swansea [email protected]

Daniel Rough

University College [email protected]

Benjamin R. Cowan

University College [email protected]

ABSTRACT

Through proliferation on smartphones and smart speakers, intel-ligent personal assistants (IPAs) have made speech a common in-teraction modality. Yet, due to linguistic coverage and varying lev-els of functionality, many speakers engage with IPAs using a non-native language. This may impact the mental workload and pat-tern of language production displayed by non-native speakers. Wepresent a mixed-design experiment, wherein native (L1) and non-native (L2) English speakers completed tasks with IPAs throughsmartphones and smart speakers. We found signiﬁcantly highermental workload for L2 speakers during IPA interactions. Contraryto our hypotheses, we found no signiﬁcant diﬀerences between L1and L2 speakers in terms of number of turns, lexical complexity,diversity, or lexical adaptation when encountering errors. Theseﬁndings are discussed in relation to language production and pro-cessing load increases for L2 speakers in IPA interaction.

CCS CONCEPTS • Human-centered computing → User studies ; Natural languageinterfaces ; HCI theory, concepts and models . KEYWORDS speech interface; voice user interface; intelligent personal assis-tants; non-native language speakers

ACM Reference Format:

Yunhan Wu, Justin Edwards, Orla Cooney, Anna Bleakley, Philip R.Doyle, Leigh Clark, Daniel Rough, and Benjamin R. Cowan. 2020. MentalWorkload and Language Production in Non-Native Speaker IPA Inter-action. In

ACM, New York, NY, USA, 8 pages.https://doi.org/10.1145/3405755.3406118

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor proﬁt or commercial advantage and that copies bear this notice and the full cita-tion on the ﬁrst page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior speciﬁc permissionand/or a fee. Request permissions from [email protected].

Intelligent personal assistants (IPAs) like Google Assistant haveincreased the popularity of speech as an interaction modality [9].Primarily used on smart speakers and smartphones [34], these as-sistants can be used in a number of diﬀerent languages, but cover-age and functionality across these languages is not comprehensive[27], requiring many users to interact using a non-native language.This includes those using English as a second language, hereby re-ferred to as L2 speakers. Interacting with IPAs in this way is likelyto be signiﬁcantly more challenging than the interaction experi-enced by those using English as their native language (L1 speak-ers). For instance, L2 speakers tend to experience diﬃculty in lex-ical retrieval [21, 43], because of an incomplete knowledge of thelanguage being used [47], with production being less automatizedwhen compared to L1 users [15]. Alongside increased demandsin processing and planning utterances in a second language, thismeans L2 users may experience a signiﬁcantly higher mental work-load [14, 47] when engaging with IPAs. These factors may also leadthem to approach the interaction diﬀerently [40, 48]. Our researchexplores this empirically, by comparing the mental workload andlanguage choices made by L1 and L2 speakers when interactingwith IPAs across smart speakers and smartphones.Our study identiﬁed signiﬁcant diﬀerences in cognitive demandbetween the two speaker groups. Speciﬁcally, we found L2 speak-ers experience signiﬁcantly higher levels of mental workload wheninteracting with IPAs in their non-native language compared to L1speakers. Contrary to expectations, L1 and L2 speakers did not sig-niﬁcantly vary in the number of commands needed to completetasks, number of words used per command, the diversity of theirlexicon, nor their levels of adaptation when they experienced er-rors during interaction. Our ﬁndings are the ﬁrst to focus on thecognitive and linguistic aspects in L2 IPA use. We discuss the ﬁnd-ings in relation to the cognitive mechanisms that may be presentwhen interacting with IPAs as an L2 speaker.

UI ’20, July 22–24, 2020, Bilbao, Spain Wu et al.

Current work on language production in speech interface interac-tion almost universally observes the language choices of L1 speak-ers. Even then, the volume of work on this topic is limited [9],with a focus on comparing language production in interactionsbetween human-machine and human-human interlocutors. Suchwork ﬁnds that users tend to vary signiﬁcantly in how they inter-act with systems compared to how they interact with humans [2],although similar mechanisms may inﬂuence language production[11, 12]. People tend to use fewer topic shifts, use more words, aswell as use fewer anaphora and ﬁllers when interacting with com-puters as opposed to human partners. Similarly, people tend to usemore basic lexical choices and grammatically simpler utterances[7] when interacting with computers compared to other people[26].This tendency to vary speech choices based on partner type isthought to be driven by the perception of a computer’s competenceas a dialogue partner (i.e., a user’s partner model), whereby peoplesee voice user interfaces as at-risk listeners [36]. This is similar tothe mechanisms for adaptation proposed in psycholinguistics lit-erature, which highlight the tendency for partners to select theirlanguage with the perceptions of their audience in mind, termed audience design [6]. A similar eﬀect has recently been shown tooperate on lexical choice in speech interface interaction, wherebyparticipants interacting with a US-accented system were signiﬁ-cantly more likely to use US lexical terms than when interactingwith an Irish-accented system [12].

Recent work comparing IPA use by both L1 and L2 speakers hasfocused on user experience as opposed to observing their interac-tion from a cognitive and linguistic perspective. L2 speakers seeIPAs as more diﬃcult to use than do L1 speakers [39, 40]. Recentwork has also found that L2 speakers perceive diﬃculties in tryingto use the right sentence structures or retrieving the right lexicalterms [48] when speaking to IPAs, with L2 speakers feeling theyhave to rephrase utterances, causing frustration [40]. Research onL2 language production oﬀers potential explanations for these per-ceived diﬃculties. It is widely acknowledged that L2 speakers tendto have an incomplete knowledge of the non-native language be-ing used when compared to L1 speakers [15, 47]. Along with acomparative lack of automatisation of the cognitive processes forlanguage production within a second language [15], this meansL2 users must resort to speciﬁc production strategies to mitigatethese production barriers. These include replacing lexical items, re-ducing message complexity or describing the meaning of wordsthat are hard to retrieve [15]. Paired with the need to process non-native speech when in dialogue, this means L2 speakers experienceconsiderable cognitive load when having to converse in a secondlanguage [14, 47]. Accented speech and the need for longer planning time may alsolead to L2 users experiencing diﬃculties in commands being un-derstood, with the system either not recognising commands or in-terrupting the user before commands are complete [25, 48]. Whenthey encounter communication breakdowns in IPA use, L2 speak-ers tend to use common strategies to repair commands such asrepeating and rephrasing utterances [33]. Yet, the eﬀective plan-ning of error repair may depend on the type of device being used.For example, L2 speakers have emphasised the beneﬁt of using vi-sual feedback [40], allowing them to use further visual information(e.g., transcriptions of the conversation) to diagnose errors in theircommands as well as process system prompts, making them moreeﬀective when using IPAs [33, 48].

Although a number of users engage with IPAs in their non-nativelanguage, research on cognitive concepts such as the mental work-load and the language they produce in interaction is scant. It istherefore critical that we widen research to include the experiencesof non-native speakers [39, 40]. Our study focuses on linguisticand cognitive aspects of L2 speaker interaction. We focus on themental workload experienced by L2 IPA users in comparison to L1users, while also exploring the diﬀerences in language productionbetween the two groups when completing tasks with an IPA.We hypothesise that, due to planning, generating and process-ing speech utterances in a diﬀerent language, L2 speakers are likelyto experience signiﬁcantly higher mental workload in IPA interac-tion compared to L1 speakers (H1). We also hypothesise that, dueto speech recognition and planning time diﬃculties [48], L2 speak-ers may need signiﬁcantly more turns when conducting a taskthan L1 speakers (H2). Due to lexical retrieval and knowledge con-straints compared to L1 speakers, we also hypothesise L2 speakerswill have signiﬁcantly fewer words per utterance (H3), lower lex-ical diversity than L1 speakers in interaction (H4) and may varyin their levels of adaptation in comparison to L1 speakers whenexperiencing errors (H5).Based on work emphasising the importance of visual modalitiesin supporting L2 speaker IPA use [40, 48], we also hypothesise thatthese eﬀects may vary signiﬁcantly by device. Speciﬁcally, the vi-sual feedback aﬀorded by Google Assistant on a smartphone maylead to reduced mental workload for L2 speakers due to visual out-put supporting error diagnosis and system query understanding(H6). As visual support helps users diagnose and correct errors, wealso hypothesise that using a smartphone may signiﬁcantly aﬀectthe number of commands per turn (H7) and the number of wordsper command (H8), while also impacting lexical diversity (H9) andlevels of adaptation (H10) for L2 speakers.

To investigate these hypotheses, we designed a study that enabledus to quantitatively compare the cognitive workload and linguis-tic properties of L1 and L2 speakers in their interaction with IPAs.The study received ethical approval through the university’s ethicsprocedures for low risk projects. ental Workload and Language Production in Non-Native Speaker IPA Interaction CUI ’20, July 22–24, 2020, Bilbao, Spain

A sample of 33 participants (F=14, M=18, Prefer not to say=1) witha mean age of 28.1 years (SD=9.8 years) took part in the study.These were all recruited from students and staﬀ at a Europeanuniversity via email, campus-wide posters, and snowball sampling.One participant was removed due to technical diﬃculties in record-ing their utterances, leaving 32 participants in the sample. 16 (F=8,M=7, Prefer not to say=1) were native English speakers, and 16(F=6, M=10) were native Mandarin speakers, who used English astheir non-native language. These Mandarin speakers self-reportedtheir English proﬁciency as moderate (7 point Likert Scale: 1 =Not at all proﬁcient; 7 = Extremely proﬁcient; Mean=4.21, SD=0.7).78.1% (N=25) of our sample had used IPAs before, with 9.4% (N=3)using IPAs frequently or very frequently. For those that had usedIPAs before, Siri (56%) was most commonly used, followed by Ama-zon Alexa (36%) and Google Assistant (12%). Each participant wasgiven a €10 voucher as an honorarium for taking part.

The study included two device conditions. Participants interactedwith Google Assistant, using both a Moto G6 smartphone (

Smart-phone condition) and a Google Home Mini smart speaker (

Smartspeaker condition) in a within-subjects design. We selected GoogleAssistant because it is commonly used on both smartphones andon smart speakers [34], minimising potential variation due to dif-ferences in the IPAs being used across devices. The order of deviceinteraction was fully counterbalanced across L1 and L2 speakergroups.

Participants used Google Assistant to complete a total of 12 tasks(6 with each device) across the experimental session. Experimentaltasks focused on 6 common IPA tasks [3, 17]: 1) playing music, 2)setting an alarm, 3) converting values, 4) asking for the time in aparticular location, 5) controlling device volume and 6) requestingweather information. To reduce practice eﬀects, two versions ofeach task were generated, creating two sets of six tasks. Each setof tasks was used in only one of the device conditions. To eliminatethe inﬂuence of written tasks on user utterances, and the potentialconfound of written tasks increasing L2 speaker cognitive load, alltasks were delivered to participants as pictograms (see Figure 1 -all pictograms are included in supplementary material). The orderof task sets were arbitrarily assigned, ensuring they were counter-balanced as much as possible across device and speaker conditions.Task order was randomised within sets for each participant.

To assess participants’ mental workloadduring interaction with each of the devices, participants completedthe NASA-TLX [24] after completing each task set. The NASA-TLXis a 6-item Likert scale (20 point scale per item) questionnaire, mea-suring 6 constituent factors of mental workload:

Mental Demand,Physical Demand, Temporal Demand, Performance, Eﬀort , and

Frus-tration . Scores on the questionnaire were summed to create anoverall workload (Raw TLX) score (Range: 0-120, see [23]).

To assess language pro-duction in interaction, user task commands were transcribed. Fromthese transcripts, a number of measures were derived. These mea-sures include:

Number of commands per task , Lexical complexity , Lexical diversity per task , Dynamic lexical adaptation , Lexical adap-tion from initial command . Number of commands per task is deﬁned as the number of ut-terances, starting with a wake phrase (i.e. "Hey Google" or "OKGoogle"), that a participant used to complete a task.

Lexical complexity (measured through word count per com-mand) was derived by dividing the total word count used tocomplete a task by the number of turns taken. This measure rep-resents the complexity of the utterance, and follows measuresof L2 linguistic complexity used in text-based research [35]. Ascommands to speech interfaces tend to be concise, formulaic state-ments [19, 26], we used word count per command rather thanmeasuring numbers of clauses as is done in other L2 complexityresearch [35].Guiraud’s index of lexical diversity [22] was also calculated toidentify the number of unique words used when completing a task(

Lexical diversity per task ). This measure compares unique wordsin a command to the root of total words in a command. It is con-sidered to be a robust alternative to diversity measures that use adirect ratio of unique words to total words, as these measures tendto inﬂate diversity as utterance lengths increase [45].To gauge levels of lexical adaptation for tasks that required mul-tiple utterances to complete, we measured the Guiraud index of lex-ical diversity for each pair of consecutive commands within a task(

Dynamic lexical adaptation ). We also measured the Guiraud indexof lexical diversity for each utterance paired with the ﬁrst utter-ance of a task to determine how much participants varied their lexi-cal choices away from their initial command (

Lexical adaption frominitial command ). Both measures of adaptation were used so thatdiﬀerent styles of adaptation would be detected. For instance, par-ticipants may make a command, try a diﬀerent phrasing, then re-turn to their original phrasing. This would result in high dynamiclexical adaptation but low lexical adaptation from initial command.Participants may alternatively adapt by changing few words acrossmany commands, resulting in low dynamic lexical adaptation buthigh lexical adaptation from initial command as each utterance in-creasingly departs from the ﬁrst attempt. Using both measures al-low us to detect these diﬀerences.

Upon arrival, participants were welcomed by the experimenter,given an information sheet with details about the experiment andasked to give written consent to take part in the study. Participantsthen completed a demographic questionnaire, giving informationabout their age, sex, nationality, native language, experience withIPAs and speech interfaces, and their self-reported level of Englishproﬁciency. Participants were then given instructions for the study.Within these, they were asked to also look at 6 practice pictogramswith the same visual structure as those in the experimental sessionbut diﬀerent in the information requested, and write what theywould say to the IPA to complete the task depicted. From these

UI ’20, July 22–24, 2020, Bilbao, Spain Wu et al.

Figure 1: Example set of task pictograms responses, experimenters ensured they were interpreting the pic-tograms correctly before conducting the experimental tasks. Theywere then asked to complete a number of tasks with Google As-sistant on two devices - a smartphone and a smart speaker. Thesetasks were displayed on a laptop, one at a time. Participants wereasked to complete a task using the Assistant and once they feltthey had done so, were asked to move to the next task. After com-pleting a set of 6 tasks with one of the devices, participants thencompleted the NASA-TLX. This was then repeated for the next 6tasks, wherein they interacted with Google Assistant through theother device. After ﬁnishing all tasks with both devices, partici-pants then completed a short post-interaction interview and werethen fully debriefed as to the aims of the study, and thanked forparticipation. To capture participant utterances, the sessions wererecorded using Audacity v. 2.3.0.

Out of the total of 384 tasks, 315 were successfully completed (82%)with 14 partially completed (3.6%) (i.e., participants completed thetask but varied the information requested). 45 tasks (11.7%) werenot successfully completed, of which 24 (6.2%) were not com-pleted due to technical errors. Unsuccessful and technical errortasks were excluded from the dataset analysed. Before analysis, alldata was screened for outliers, with these being replaced by valuesof the mean ± Due to violation of the assumption of normal distribution (p<.05),a robust mixed ANOVA with 10% trimmed means was run usingthe WRS2 package (Version 1.0) [30] in R (Version 3.6) [41]. Therewas a statistically signiﬁcant main eﬀect of speaker on the mentalworkload experienced, whereby L1 speakers reported signiﬁcantlylower NASA-TLX scores (Mean=27.0; SD=19.07) than L2 speakers(Mean=42.0; SD=14.37) [Q=11.74, p=.002] (see Figure 2). This sup-ports our ﬁrst hypothesis (H1). However, there was no statisticallysigniﬁcant main eﬀect of device type [Q=0.28, p=.60] or interactionbetween speaker type and device type [Q=0.81, p=.37] on mentalworkload. H6 was therefore not supported.

To analyse the language productiondata, linear mixed-eﬀects models (LMM) were run using the lme4

Speaker R a w T L X S c o r e Figure 2: Mean Raw TLX scores (10% trimmed means withtrimmed standard error) for each speaker group package (Version 1.1.21) [5] in R (Version 3.6) [41]. This type ofanalysis allows for the modelling of ﬁxed (i.e., device and speakertype) and random (i.e., participant and task variations) eﬀects onspeciﬁc outcomes such as lexical diversity. LMMs also allow us tomodel individual diﬀerences through random intercepts, as wellas diﬀerences in how the ﬁxed eﬀects vary by participant and bytask through modelling random slopes. We take the approach ofmodelling the maximal random eﬀect structure determined bythe experiment [4], reducing the complexity of random eﬀects byremoving higher order random slopes to facilitate convergence.We report LMM results in the text, following recent best-practiceguidelines [31] by also reporting all LMM analyses fully. Theseappear in the supplementary material. We include ﬁxed and ran-dom eﬀect results as well as reporting all model syntax to improvemodel reproducibility.

Across the data set therewas a total of 933 user commands. The LMM run showed no sta-tistically signiﬁcant eﬀect of speaker [Unstandardized β =-0.39, SE β =0.37, 95% CI [-1.12,0.34], t=-1.06, p=.29], device [Unstandardized β =0.12, SE β =0.27, 95% CI [-0.41,0.63], t=0.43, p=.67] or speakerand device interaction [Unstandardized β =0.33, SE β =0.38, 95% CI[-0.41,1.07], t=0.88, p=.38] on the number of user commands pertask. This means that our hypotheses (H2 and H7) were not statis-tically supported. ental Workload and Language Production in Non-Native Speaker IPA Interaction CUI ’20, July 22–24, 2020, Bilbao, Spain Table 1: Descriptive statistics by speaker and device type

NASA-TLXscore(10% trimmed) Number ofcommandsper task Lexicalcomplexity Lexicaldiversityper task Dynamiclexicaladaptation Lexicaladaptationfrom initialcommandSpeaker Device Type Mean SD Mean SD Mean SD Mean SD Mean SD Mean SDL1 Smart speakerSmartphoneTotal 27.3629.0027.50 18.5913.3814.13 2.242.352.29 2.182.142.15 8.088.578.32 2.873.072.97 2.612.582.60 0.530.580.56 2.122.242.19 0.670.610.66 0.980.800.88 0.960.910.93L2 Smart speakerSmartphoneTotal 47.0040.6443.23 12.208.698.11 1.842.302.07 1.282.021.70 7.397.437.42 2.582.112.55 2.452.552.50 0.600.530.57 2.051.962.00 0.660.710.68 0.710.880.80 0.930.890.91Total Smart speakerSmartphone 36.8935.27 16.1610.25 2.042.33 1.792.07 7.748.01 2.742.69 2.532.57 0.570.56 2.092.10 0.660.67 0.850.84 0.950.90

Across the dataset there were 7112words used to command the IPAs, with an average of 7.62 wordsper command. There was no statistically signiﬁcant eﬀect ofspeaker [Unstandardized β =-0.65, SE β =0.59, 95% CI [-1.83,0.53],t=-1.11, p=.27], device [Unstandardized β =0.50, SE β =0.34, 95%CI [-0.17:1.18], t=1.45, p=.15] or speaker and device interaction[Unstandardized β =-0.45, SE β =0.49, 95% CI [-1.41,0.51], t=-0.92,p=.36] on the number of words used per command. Therefore ourhypotheses in relation to lexical complexity (H3 and H8) were notstatistically supported. The LMM model showed no sta-tistically signiﬁcant eﬀect of speaker type on levels of lexicaldiversity per task [Unstandardized β =-0.15, SE β =0.11, 95% CI[-0.38,0.07], t=-1.38, p=.18], speaker type [Unstandardized β =-0.01,SE β =0.08, 95% CI [-0.16,0.14] ,t=-0.19, p=.85] and speaker deviceinteraction [Unstandardized β =0.12, SE β =0.11, 95% CI [-0.09,0.33],t=1.14, p=.26]. Therefore our hypotheses in relation to lexicaldiversity (H4 and H9) were not statistically supported. Over the 315 successful tasks,116 required more than one command to complete. Tasks that par-ticipants only used one turn to complete (N=199) were excludedfrom the dataset. There was no statistically signiﬁcant eﬀect ofspeaker [Unstandardized β =-0.04, SE β =0.16,95% CI [-0.36,0.28], t=-0.28, p=.78], device [Unstandardized β =0.14, SE β =0.14, 95% CI [-0.14,0.42], t=0.98, p=.32] or speaker and device interaction [Unstan-dardized β =-0.24, SE β =0.20, 95% CI [-0.64,0.16], t=-1.20, p=.23] onthe level of lexical diversity from a preceding turn. Therefore, L1and L2 speakers did not vary in their levels of lexical adaptationfrom a previous utterance when having to use more than one com-mand to complete a task. There was also no impact of device typeon levels of lexical adaption from previous command, so H5 andH10 were not supported. Again, taskswhere participants only used one utterance to complete the taskwere excluded from analysis. The LMM showed no statistically sig-niﬁcant eﬀect of speaker [Unstandardized β =-0.26, SE β =0.18, 95%CI [-0.61,0.10], t=-1.43, p=.16], device [Unstandardized β =-0.17, SE β =0.17, 95% CI [-0.51,0.17], t=-1.01, p=.32] or speaker and device interaction [Unstandardized β =0.33, SE β =0.25, 95% CI [-0.15,0.82],t=1.35, p=.18] on the level of lexical diversity from the ﬁrst turn.It seems that both L1 and L2 speakers tend to use similar levels oflexical adaptation from their ﬁrst turn, with this adaptation notbeing inﬂuenced by device type. This means that again H5 andH10 were not supported. Our work set out to identify how using IPAs in a non-native lan-guage impacted mental workload and language production. Wefound L2 speakers experienced signiﬁcantly higher mental work-load than L1 speakers in IPA interactions across both smart speak-ers and smartphone devices. Although there were signiﬁcant levelsof workload for L2 users, there were no signiﬁcant diﬀerences be-tween L1 and L2 speakers in terms of the number of turns, wordsused and diversity of lexical choice. They also did not vary in thelevel of lexical adaptation from their initial utterances. They alsodid not vary in their level of lexical adaptation when comparing toa preceding turn. We discuss the interpretations of these ﬁndingsbelow.

Our work highlights that, even though they may show similartypes of language use, L2 speakers experience signiﬁcantly highermental workload than L1 users in IPA interaction. Reasons for thisare likely to involve the increased load in producing and process-ing utterances in a non-native language [14, 15]. Eﬀorts needed forlexical retrieval in production and processing may be of particularinﬂuence. Multilingual speakers store signiﬁcantly more words intheir mental lexicon when compared to monolinguals, to facilitateaccurate word retrieval in processing and production when usingother languages. This is thought to lead to less frequent accessof words across their lexicon, making activation lower and thusleading to diﬃculties in recall and retrieving these lexical items[15, 21, 43]. The lack of automatisation of language productionprocesses [15], is also likely to contribute to this load.

UI ’20, July 22–24, 2020, Bilbao, Spain Wu et al.

In addition to production issues, many L2 speakers also ﬁndit more cognitively challenging to process and understand non-native synthetic speech [46]. Non-native speakers ﬁnd synthesisin a non-native language signiﬁcantly less intelligible than donative speakers [1, 42, 46]. This is proposed to derive from L2speakers’ comparative unfamiliarity with their non-native lan-guage’s phonological system, common syntactic structures andlexicon, which may increase cognitive load when interpreting andprocessing speech output [46]. In real world IPA use, this mentalworkload may be even higher as background noise negativelyaﬀects non-native speakers’ ratings of intelligibility compared tonative speakers [42]. A challenge for future HCI research is toinvestigate ways to mitigate this load for L2 users.

Contrary to our hypotheses, number of commands, lexical adapta-tion, complexity and diversity did not vary across speaker groupsor device types. There may be a number of reasons for this. Al-though L2 users may experience more load in lexical retrieval, IPAinteraction still tends to be lexically constrained. Consequently,complex and diverse lexical choices may not be a priority, as IPAsare often seen as basic dialogue partners [8, 10, 32]. This is con-trasted by more open-ended interactions in which people havebeen shown to use conversational and complex linguistic struc-tures (e.g., with automotive interfaces [28, 29]). L1 and L2 speakervariance may be more stark in these types of interaction.The opportunity for lexical variation may be further limited bythe requirement to use the wake word at the start of commands,reducing the potential for variability. Additionally, although adap-tation has been noted as a common strategy for error repair inhuman-machine dialogue [26, 36], it may be that lexical adaptationin this instance is not the primary adaptation strategy for users.Although L2 users have suggested they may use lexical strategiesin IPA use (e.g., substitution or describing the meaning of wordsthey cannot retrieve) [48], adaptation of pronunciation is muchmore strongly emphasised by L2 speakers in previous work [40,48]. L2 speakers tend to vary signiﬁcantly from L1 speakers inother speech dimensions like tempo, rate of hesitations (e.g., ﬁlledpauses, repetitions and corrections [47]) while also adapting syn-tactically or semantically [37]. Our ﬁndings suggest that, at a lex-ical level, L2 speakers and L1 speakers do not vary in the limitedlexical context of IPA interaction. Future work should look to ex-plore other forms of adaptation as well as other linguistic cues inlanguage production with IPAs across these user groups.

Although we found no signiﬁcant diﬀerence between speakers inlexical diversity and complexity, this may be due to proﬁciencyof the participants recruited. L2 participants rated themselves asmoderately proﬁcient and all attended an English-speaking uni-versity. These factors, together with the relative simplicity of thecommands required for IPA use, may explain the lack of eﬀect inour analysis. Increased proﬁciency signiﬁcantly improves IPA userexperience for L2 language speakers [39, 40]. Increased ﬂuency in asecond language is also linked to the proceduralisation of syntactic and lexical knowledge of that language [44]. Although we foundno eﬀect in our sample, there may be diﬀerences between begin-ner and more advanced L2 speakers. Future work should look toidentify the role that this proﬁciency has on language productionwithin IPA interactions.

Along with L2 users being recruited from a European universitywhere English is the primary language, all L2 users were nativeMandarin speakers, which may inﬂuence the wider generalisabil-ity of results to other native and non-native language combina-tions. It may be that cognitive eﬀects seen in our work vary basedon similarities and diﬀerences of the languages being used, suchas the phonetic or structural similarity of a non-native languageto participants’ native tongue. This means that L2 speakers whosenative languages are more closely related to English may expe-rience even less evident language production eﬀects than Man-darin speakers. It is therefore important that future work exploreswhether similar eﬀects are seen for L2 speakers with diﬀerent na-tive languages, as well as diﬀering levels of language ability men-tioned above. It is also important to note that future work shouldlook to increase sample size so as to identify whether the ﬁndingsare replicated across larger samples of users.To increase ecological validity, participants were able to controlwhen to move on to the next task. This meant that participantscould complete the tasks at their own pace and may more accu-rately reﬂect how many attempts participants are willing to givea task before abandoning it. Individual diﬀerences in this willing-ness are likely to inﬂuence the number of commands users made.Some were willing to try several times in order to successfully com-plete tasks, whereas others preferred to skip to the next task afterrelatively few attempts, even if they were not successful at com-pleting the task (although we note only 5.5% of tasks in our datawere abandoned by participants). Although the experimenters en-couraged participants to try as many times as necessary, they hadthe freedom to move on before a successful response, which couldhave inﬂuenced the number of commands recorded per task.In relation to ecological validity, it is also important to note thatour research was lab based. This allowed us to minimise poten-tial confounds such as background noise and user distraction. Yetthis context may have also made users aware that they were beingrecorded. Real-world IPA use is likely to vary on these dimensionsin comparison to a lab based environment. Future work shouldtherefore aim to replicate our ﬁndings in a real-world deployment.Rather than using text based task instructions, we used pic-tograms to inform participants what to complete during the study.This was to ensure that the processing of non-native language intask instructions for L2 users did not confound any mental work-load eﬀects. The use of pictograms also ensured that text-basedinstructions did not inﬂuence subsequent language used whenmaking commands. Future studies with L2 speakers should inves-tigate the mental workload and language production impact ofdelivering written tasks experienced by speakers in such studies.Our ﬁndings are limited to a relatively constrained linguistictask of IPA interaction. IPAs are generally designed to perform ental Workload and Language Production in Non-Native Speaker IPA Interaction CUI ’20, July 22–24, 2020, Bilbao, Spain simple tasks [3, 13] through question-answer adjacency pair dia-logues [20, 38], rather than being designed for more conversationalor open-ended speech tasks [10, 16]. It is important that future re-search considers the nature of L2 speech behaviours in these moreopen-ended scenarios.

Although IPA use has grown, fuelled by their inclusion on smartspeakers and smartphones, not all languages are fully supported,leading some users to interact in a non-native language. Our studyfocused on these non-native (L2) speakers to understand diﬀer-ences in their experience of IPAs from native (L1) speakers from acognitive and linguistic perspective. We found that L2 speakers ex-perienced signiﬁcantly higher mental workload than L1 speakers,irrespective of the device they are using. Even though they experi-ence higher load in producing and interpreting the language fromthe IPA, they did not vary in the way they interacted linguisticallywith the IPAs, showing similar number of commands, lexical com-plexity, lexical diversity and lexical adaptation to L1 speakers. Ourwork sheds light on this under-researched set of users. CUI-basedresearch needs to study this group in more detail to identify waysto support their IPA interactions, reducing the cognitive burdenthey experience.

ACKNOWLEDGMENTS

This work was conducted with the ﬁnancial support of the UCDChina Scholarship Council (CSC) Scheme grant No. 201908300016,Science Foundation Ireland ADAPT Centre under Grant No.13/RC/2106 and the Science Foundation Ireland Centre for Re-search Training in Digitally-Enhanced Reality (D-REAL) underGrant No. 18/CRT/6224.

REFERENCES [1] Diane Mayasari Alamsaputra, Kathryn J. Kohnert, Benjamin Munson, and JoeReichle. 2006. Synthesized speech intelligibility among native speakers and non-native speakers of English.

Augmentative and Alternative Communication

22, 4(2006), 258–268. https://doi.org/10.1080/00498250600718555[2] René Amalberti, Noëlle Carbonell, and Pierre Falzon. 1993. User represen-tations of computer systems in human-computer speech interaction.

In-ternational Journal of Man-Machine Studies

38, 4 (April 1993), 547–566.https://doi.org/10.1006/imms.1993.1026[3] Tawﬁq Ammari, Joﬁsh Kaye, Janice Y. Tsai, and Frank Bentley. 2019. Mu-sic, Search, and IoT: How People (Really) Use Voice Assistants.

ACMTrans. Comput.-Hum. Interact.

26, 3, Article Article 17 (April 2019), 28 pages.https://doi.org/10.1145/3311956[4] Dale J Barr, Roger Levy, Christoph Scheepers, and Harry J Tily. 2013. Randomeﬀects structure for conﬁrmatory hypothesis testing: Keep it maximal.

Journalof memory and language

68, 3 (2013), 255–278.[5] Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2015. FittingLinear Mixed-Eﬀects Models Using lme4.

Journal of Statistical Software

67, 1(2015), 1–48. https://doi.org/10.18637/jss.v067.i01[6] Allan Bell. 1984. Language style as audience design.

Language in Society

13, 02(June 1984), 145. https://doi.org/10.1017/S004740450001037X[7] Linda Bell and Joakim Gustafson. 1999. Repetition and its phonetic realizationsinvestigating a Swedish database of spontaneous computer directed speech. In

Proceedings of the XIVth International Congress of Phonetic Sciences , Vol. 99. 1221–1224.[8] Holly P. Branigan, Martin J. Pickering, Jamie Pearson, Janet F. McLean, andAsh Brown. 2011. The role of beliefs in lexical alignment: Evidence fromdialogs with humans and computers.

Cognition

Interacting with Computers

31, 4 (09 2019), 349–371.https://doi.org/10.1093/iwc/iwz016 [10] Leigh Clark, Nadia Pantidi, Orla Cooney, Philip Doyle, Diego Garaialde,Justin Edwards, Brendan Spillane, Emer Gilmartin, Christine Murad, CosminMunteanu, and et al. 2019. What Makes a Good Conversation? Challenges inDesigning Truly Conversational Agents. In

Proceedings of the 2019 CHI Confer-ence on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19) .Association for Computing Machinery, New York, NY, USA, Article Paper 475,12 pages. https://doi.org/10.1145/3290605.3300705[11] Benjamin R Cowan, Holly P Branigan, Habiba Begum, Lucy McKenna, and EvaSzekely. 2017. They Know as Much as We Do: Knowledge Estimation and PartnerModelling of Artiﬁcial Partners.. In

CogSci 2017 - 39th Annual Meeting of theCognitive Science Society .[12] Benjamin R. Cowan, Philip Doyle, Justin Edwards, Diego Garaialde, Ali Hayes-Brady, Holly P. Branigan, João Cabral, and Leigh Clark. 2019. What’s inan Accent? The Impact of Accented Synthetic Speech on Lexical Choicein Human-Machine Dialogue. In

Proceedings of the 1st International Confer-ence on Conversational User Interfaces (Dublin, Ireland) (CUI ’19) . Associa-tion for Computing Machinery, New York, NY, USA, Article 23, 8 pages.https://doi.org/10.1145/3342775.3342786[13] Benjamin R Cowan, Nadia Pantidi, David Coyle, Kellie Morrissey, Peter Clarke,Sara Al-Shehri, David Earley, and Natasha Bandeira. 2017. What can i help youwith?: infrequent users’ experiences of intelligent personal assistants.In

Proceed-ings of the 19th International Conference on Human-Computer Interaction withMobile Devices and Services . ACM, New York, NY, USA, 43.[14] S Dornic. 1979. Information processing in bilinguals: Some selected issues.

Psy-chological Research

40, 4 (1979), 329–348.[15] Zoltán Dörnyei and Judit Kormos. 1998. Problem-solving mechanisms in L2communication: A psycholinguistic perspective.

Studies in second language ac-quisition

20, 3 (1998), 349–385.[16] Philip R. Doyle, Justin Edwards, Odile Dumbleton, Leigh Clark, and Benjamin R.Cowan. 2019. Mapping Perceptions of Humanness in Intelligent Personal Assis-tant Interaction. In

Proceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services (Taipei, Taiwan) (Mobile-HCI ’19) . Association for Computing Machinery, New York, NY, USA, Article 5,12 pages. https://doi.org/10.1145/3338286.3340116[17] Mateusz Dubiel, Martin Halvey, and Leif Azzopardi. 2018. A Survey Inves-tigating Usage of Virtual Personal Assistants.

CoRR abs/1807.04606 (2018).arXiv:1807.04606 http://arxiv.org/abs/1807.04606[18] Andy P Field, Jeremy Miles, and Zoë Field. 2012. Discovering statistics using R.[19] Emer Gilmartin, Francesca Bonin, Loredana Cerrato, Carl Vogel, and Nick Camp-bell. 2015. What’s the game and who’s got the ball? genre in spoken interaction.In .[20] Emer Gilmartin, Marine Collery, Ketong Su, Yuyun Huang, Christy Elias, Ben-jamin R. Cowan, and Nick Campbell. 2017. Social talk: making conversationwith people and machine. In

Proceedings of the 1st ACM SIGCHI InternationalWorkshop on Investigating Social Interactions with Artiﬁcial Agents - ISIAA 2017 .ACM Press, Glasgow, UK, 31–32. https://doi.org/10.1145/3139491.3139494[21] Tamar H Gollan and Lori-Ann R Acenas. 2004. What is a TOT? Cognate andtranslation eﬀects on tip-of-the-tongue states in Spanish-English and tagalog-English bilinguals.

Journal of Experimental Psychology: Learning, Memory, andCognition

30, 1 (2004), 246.[22] Pierre Guiraud. 1954.

Les Charactères Statistiques du Vocabulaire. Essai deméthodologie . Presses Universitaires de France, Paris, France.[23] Sandra G Hart. 2006. NASA-task load index (NASA-TLX); 20 years later. In

Proceedings of the human factors and ergonomics society annual meeting , Vol. 50.Sage Publications Sage CA: Los Angeles, CA, 904–908.[24] Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (TaskLoad Index): Results of empirical and theoretical research. In

Advances in psy-chology . Vol. 52. Elsevier, 139–183.[25] Abhinav Jain, Minali Upreti, and Preethi Jyothi. 2018. Improved AccentedSpeech Recognition Using Accent Embeddings and Multi-task Learning.. In

In-terspeech . 2454–2458.[26] Alan Kennedy, A Wilkes, L Elder, and Wayne Murray.1988. Dialogue with machines.

Cognition

30 (1988), 37–72.https://doi.org/10.1016/0010-0277(88)90003-0[27] Bret Kinsella. 2019. Google Assistant Now Supports Simpliﬁed Chinese on An-droid Smartphones. http://bit.ly/30Yg8qN. Accessed 27th Jan 2020.[28] David R. Large, Leigh Clark, Gary Burnett, Kyle Harrington, Jacob Luton,Peter Thomas, and Pete Bennett. 2019. “It’s Small Talk, Jim, but Not asWe Know It.”: Engendering Trust through Human-Agent Conversation in anAutonomous, Self-Driving Car. In

Proceedings of the 1st International Con-ference on Conversational User Interfaces (Dublin, Ireland) (CUI ’19) . Associ-ation for Computing Machinery, New York, NY, USA, Article 22, 7 pages.https://doi.org/10.1145/3342775.3342789[29] David R. Large, Leigh Clark, Annie Quandt, Gary Burnett, and Lee Skrypchuk.2017. Steering the conversation: A linguistic exploration of natural language in-teractions with a digital assistant during simulated driving.

Applied Ergonomics

63 (Sept. 2017), 53–61. https://doi.org/10.1016/j.apergo.2017.04.003

UI ’20, July 22–24, 2020, Bilbao, Spain Wu et al. [30] Patrick Mair and Rand Wilcox. 2019. Robust Statistical Methods in R Using theWRS2 Package.

Behavior Research Methods (2019). Forthcoming.[31] Lotte Meteyard and Robert A.I. Davies. 2020. Best practice guidance for linearmixed-eﬀects models in psychological science.

Journal of Memory and Language

112 (June 2020), 104092. https://doi.org/10.1016/j.jml.2020.104092[32] Roger K Moore. 2017. Is spoken language all-or-nothing? Implications for fu-ture speech-based human-machine interaction. In

Dialogues with Social Robots .Springer, 281–291.[33] Souheila Moussalli and Walcir Cardoso. 2019. Intelligent personalassistants: can they understand and be understood by accented L2learners?

Computer Assisted Language Learning

0, 0 (2019), 1–26.https://doi.org/10.1080/09588221.2019.1595664[34] Christie Olson and Kelli Kemery. 2019. . Technical Report. Microsoft.[35] Lourdes Ortega. 2003. Syntactic complexity measures and their relationship toL2 proﬁciency: A research synthesis of college-level L2 writing.

Applied linguis-tics

24, 4 (2003), 492–518.[36] Sharon Oviatt, Jon Bernard, and Gina-Anne Levow. 1998. Linguistic AdaptationsDuring Spoken and Multimodal Error Resolution.

Language and Speech

41, 3-4(July 1998), 419–442. https://doi.org/10.1177/002383099804100409[37] Andrew Pawley and Frances Hodgetts Syder. 1983. Natural selection in syn-tax: Notes on adaptive variation and change in vernacular and literary grammar.

Journal of Pragmatics

7, 5 (1983), 551–579.[38] Martin Porcheron, Joel E Fischer, Stuart Reeves, and Sarah Sharples. 2018. VoiceInterfaces in Everyday Life. In

Proceedings of the 2018 CHI Conference on HumanFactors in Computing Systems . ACM, 640.[39] Aung Pyae and Paul Sciﬂeet. 2018. Investigating Diﬀerences between Na-tive English and Non-Native English Speakers in Interacting with a VoiceUser Interface: A Case of Google Home. In

Proceedings of the 30th AustralianConference on Computer-Human Interaction (Melbourne, Australia) (OzCHI ’18) . Association for Computing Machinery, New York, NY, USA, 548–553.https://doi.org/10.1145/3292147.3292236[40] Aung Pyae and Paul Sciﬂeet. 2019. Investigating the Role of User’s Eng-lish Language Proﬁciency in Using a Voice User Interface: A Case ofGoogle Home Smart Speaker. In

Extended Abstracts of the 2019 CHI Con-ference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI EA ’19) . Association for Computing Machinery, New York, NY, USA, 6.https://doi.org/10.1145/3290607.3313038[41] R Core Team. 2018.

R: A Language and Environment for StatisticalComputing

Augmentativeand Alternative Communication

12, 1 (1996), 32–36.[43] Norman Segalowitz and Jan Hulstijn. [n.d.]. Automaticity in bilingualism andsecond language learning.

Handbook of bilingualism: Psycholinguistic approaches ([n. d.]), 371–388.[44] Richard Towell, Roger Hawkins, and Nives Bazergui. 1996. The development ofﬂuency in advanced learners of French.

Applied linguistics

17, 1 (1996), 84–119.[45] Roeland Van Hout and Anne Vermeer. 2007. Comparing measures of lexicalrichness.

Modelling and assessing vocabulary knowledge (2007), 93–115.[46] Catherine Watson, Wei Liu, and Bruce MacDonald. 2013.

The eﬀect of age andnative speaker status on synthetic speech intelligibility .[47] Richard Wiese. 1984. Language Production in Foreign and Native Languages:Same or diﬀerent? In