Methods for the Design and Evaluation of HCI+NLP Systems
MMethods for the Design and Evaluation of HCI+NLP Systems
Hendrik Heuer
Institute for Information ManagementUniversity of BremenBremen, Germany [email protected]
Daniel Buschek
Department of Computer ScienceUniversity of BayreuthBayreuth, Germany [email protected]
Abstract
HCI and NLP traditionally focus on dif-ferent evaluation methods. While HCI in-volves a small number of people directly anddeeply, NLP traditionally relies on standard-ized benchmark evaluations that involve alarger number of people indirectly. We presentfive methodological proposals at the intersec-tion of HCI and NLP and situate them in thecontext of ML-based NLP models. Our goalis to foster interdisciplinary collaboration andprogress in both fields by emphasizing whatthe fields can learn from each other.
NLP is the subset of AI that is focused on thescientific study of linguistic phenomena (Associa-tion for Computational Linguistics, 2021). Human-computer interaction (HCI) is “the study and prac-tice of the design, implementation, use, and eval-uation of interactive computing systems” (Rogers,2012). Grudin described HCI and AI as two fieldsdivided by a common focus (Grudin, 2009): Whileboth are concerned with intelligent behavior, thetwo fields have different priorities, methods, and as-sessment approaches. In 2009, Grudin argued thatwhile AI research traditionally focused on long-term projects running on expensive systems, HCIis focused on short-term projects running on com-modity hardware. For successful HCI+NLP appli-cations, a synthesis of both approaches is neces-sary. As a first step towards this goal, this article,informed by our sensibility as HCI researchers,provides five concrete methods from HCI to studythe design, implementation, use, and evaluation ofHCI+NLP systems.One promising pathway for fostering interdisci-plinary collaboration and progress in both fields isto ask what each field can learn from the methodsof the other. On the one hand, while HCI directly and deeply involves the end-users of a system, NLPinvolves people as providers of training data or asjudges of the output of the system. On the otherhand, NLP has a rich history of standardised eval-uation metrics with freely available datasets andcomparable benchmarks. HCI methods that enabledeep involvement are needed to better understandthe perspective of people using NLP, or being af-fected by it, their experiences, as well as relatedchallenges and benefits.As a synthesis of this user focus and the stan-dardized benchmarks, HCI+NLP systems couldcombine more standardized evaluation proceduresand material (data, tasks, metrics) with user in-volvement. This could lead to better comparabilityand clearer measures of progress. This may alsospur systematic work towards “grand challenges”,that is, uniting HCI researchers under a commongoal (Kostakos, 2015).To facilitate a productive collaboration betweenHCI+NLP, clearly defined tasks that attract a largenumber of researchers would be helpful. Thesetasks could be accompanied with data to train mod-els, as a methodological approach from NLP, andmethodological recommendations on how to eval-uate these systems, as a methodological approachfrom HCI. One task could e.g. define which ques-tions should be posed to experiment participants. Ifthe questions regarding the evaluation of an experi-ment are fixed, the results of different experimentscould be more comparable. This would not onlyunite a variety of research results, but it could alsoincrease the visibility of the researchers who par-ticipate. Complementary, NLP could benefit fromasking further questions about use cases and usagecontexts, and from subsequently evaluating contri-butions in situ, including use by the intended targetgroup (or indirectly affected groups) of NLP.In conclusion, both fields stand to gain an en-riched set of methodological procedures, prac- a r X i v : . [ c s . C L ] F e b ethod Description
1. User-Centered NLP user studies ensure that users un-derstand the output and the ex-planations of the NLP system2. Co-Creating NLP deep involvement from the startenables users to actively shape asystem and the problem that thesystem is solving3. Experience Sampling richer data collected by (active)users enables a deeper under-standing of the context and theprocess in which certain datawas created4. Crowdsourcing an evaluation at scale withhumans-in-the-loop ensureshigh system performance andcould prevent biased results ordiscrimination5. User Models simulating real users computa-tionally can automate routineevaluation tasks to speed up thedevelopment
Table 1: The five methodological proposals forHCI+ML that we present in this paper. tices, and tools. In the following, we propose fiveHCI+NLP methods that we consider useful in ad-vancing research in both fields. Table 1 providesa short description of each of the five HCI+NLPmethods that this paper highlights. With our non-exhaustive overview, we hope to inspire interdisci-plinary discussions and collaborations, ultimatelyleading to better interactive NLP systems – both“better” in terms of NLP capabilities and regardingusability, user experience, and relevance for people.
This section presents and discusses a set of con-crete ideas and directions for developing evaluationmethods at the intersection of HCI and NLP.
Our experience as researchers at the intersectionof HCI+AI taught us that systems that may workfrom an AI perspective, may not be helpful to users.One example of this is an unpublished machinelearning-based fake news detection based on textstyle. Even though this worked in principle with F1-scores of 80 and higher, pilot studies showed thatthe style-based explanations are not meaningful tousers. Even for educated participants, it may bean overextension to comprehend such explanationsabout an ML-based system. This relates to previ-ous work that showed an explanatory gap betweenwhat is available to explain ML-based systems and what users need to understand such systems (Heuer,2020). Far too frequently, NLP systems are builton assumptions about users, not based on insightsabout users. We argue that all ML systems aimedat users need to be evaluated with users. FollowingISO 9241-210, user-centered design is an iterativeprocess that involves repeatedly 1. specifying thecontext of use, 2. specifying requirements, 3. devel-oping solutions, and 4. evaluating solutions, all inclose collaboration with users (Normalizacyjnych,2011).Our review of prior work indicates that HCI andNLP follow different approaches regarding the re-quirements analysis and the evaluation of complexinformation systems. To the best of our knowledge,we did not find good examples for true interdisci-plinary collaborations that contribute to both fields.While there are HCI contributions that leverageNLP technology, they rarely make a fundamen-tal contribution towards computational linguistics,merely applying existing approaches. On the otherhand, where NLP aims to make a contribution toan HCI-related field, this contribution is commonlypresented without empirical evidence in the form ofuser studies. Our most fundamental and importantcontribution in this position paper is a call to recen-ter efforts in natural language processing aroundusers. We argue that empirical studies with andof users are central to successful HCI+AL applica-tions. A contribution on a system for recognizingfake news, for example, has to empirically showthat the way the system predicts its results is help-ful to users. Training an ML-based system withgood intentions is not enough for real progress.
While user-centered design is already a great im-provement from developing systems based on as-sumptions, HCI has moved beyond it, involvingusers much deeper. With so-called Co-Creation,users are not just objects that are studied to buildbetter systems, but subjects that actively shape thesystem. We, therefore, argue that HCI+NLP re-searchers should (co)-create services with users.Jarke (2021), among others, describes co-creationas a joint problem-making and problem-solvingof researcher and user. This deep involvement ofusers enables novel ways of sharing expertise andcontrol over design decisions.Prior research showed how challenging it can befor users to understand complex, machine-learning-ased systems like the recommendation system onYouTube (Alvarado et al., 2020). The field of HCI,therefore, recognized the importance of involvingusers in the design, implementation, and evalua-tion of interactive computing systems. While usersare frequently the subject of investigation, recenttrends in interaction design aim to involve usersmuch earlier and deeper.If users are deeply involved in the design anddevelopment of NLP systems, they can share theirexpertise on the task at hand. On the one hand, thiscan yield insights into UI and interaction design forthe NLP system (Yang et al., 2019). On the otherhand, it is relevant regarding the output. Sharingcontrol is also crucial considering the potential bi-ases enacted by such systems. Deep involvement ofa diverse set of users could help prevent problem-atic applications of machine learning and preventdiscrimination based on gender (Bolukbasi et al.,2016) or ethnicity (Buolamwini and Gebru, 2018).
The need for very large text datasets in NLP hasmotivated and favored certain methods for datacollection, such as scraping text from the web.These methods assume that text is “already there”,i.e. they do not consider or facilitate its creation:For example, scraping Wikipedia neither supportsWikipedia authors, nor does it care if authors wouldwant to have their texts included in such models, ornot.To advance future HCI+NLP applications, itcould be helpful to create and deploy tools formore interactive data collection. One importantmethod here is the experience sampling method(ESM) (Csikszentmihalyi and Larson, 2014; vanBerkel et al., 2017), which is used widely in HCIand could be deployed for NLP as well. Thismethod of data collection repeatedly asks shortquestions throughout participants’ daily lives, andthus captures data in context: For instance, an ESMsmartphone app could prompt users to describetheir current environment, an experience they hadtoday, or to “donate” input and language data (e.g.from messaging) in an anonymous way (Bemmannand Buschek, 2020; Buschek et al., 2018). Thiscould be enriched with further context (e.g. loca-tion, date, time, weather, phone sensors) to answernovel research questions, such as how a languagemodel for a chatbot can improve its text genera- tion and understanding by making use of the loca-tion or other context data. One important examplefor such experience sampling is work on citizensociolinguistics, which explores how citizens canparticipate (often through mobile technologies) insociolinguistic inquiry (Rymes and Leone, 2014).Although it would be challenging to collect mas-sive amounts of text using this method, the ESM-based data collection could be used to complementdata collected via scarping (e.g. via finetuning withESM data). ESM also supports more personalizedand context-rich language data and models, fromspecific communities or contexts. This might caterto novel research questions, e.g. on context-basedand personalized language modeling. More gener-ally, methods like ESM furthermore give the peoplethat act as data sources more of a “say” in the datacollection for NLP, for instance, via explicitly shar-ing data via an interactive ESM application, or viatheir rich daily contexts being better represented inmetadata.
As described, NLP has a strong tradition in usingand reusing benchmark datasets, which are benefi-cial for comparable and standardized evaluations.However, some aspects cannot be evaluated in thisway. First, comparisons with human language un-derstanding or generation are limited to the (few)humans that originally provided data for the lim-ited set of examples that these people had beengiven. Yet language understanding and use changeover time, and vary between people and their back-grounds and contexts. Second, “offline” evalua-tions without people cannot assess interactive use of NLP systems by people (e.g. chatting with a bot,writing with AI text suggestions). Therefore, at theintersection of HCI and NLP, one may ask: Is itpossible to keep the benefits of (large) standardizedbenchmark evaluations while involving humans?
Crowd-sourcing may provide one approach toaddress this: HCI and NLP researchers shouldcreate evaluation tools that streamline large-scaleevaluations with remote participants. Practicallyspeaking, one would then still set a benchmark taskrunning “with one click”, yet this would triggerthe creation, distribution, and collection of crowd-tasks. One example of this is “GENIE”, a systemand leaderboard for human-in-the-loop evaluationof text generation (Khashabi et al., 2021). nput NLP System Output1. User-Centered Natural Language Processing2. Co-Creating NLP Applications3. Experience Sampling
Method 4. Crowdsourced Evaluation5. User Models as Proxies
Figure 1: The model situates the five methodologicalproposals in the context of an NLP system.
In addition to involving users deeply and collect-ing context-rich data, relevant aspects of people’sinteraction behavior with interactive NLP systemsmay also be modeled explicitly. HCI, psychology,and related fields offer a variety of models, for ex-ample, relating to pointing at user interface targetsor selecting elements from a list. Extending andimproving those modeled aspects is particularlypursued in the emerging area of
ComputationalHCI (Oulasvirta et al., 2018). Even though suchmodels cannot replace humans, they may help eval-uate certain aspects and parameter choices of aninteractive NLP system in a standardized and rapidmanner.For instance, Todi et al. (2021) showed that ap-proaches based on reinforcement learning can beused to automatically adapt related user interfaces.For interactive NLP, Buschek et al. (2021) investi-gated how different numbers of phrase suggestionsfrom a neural language model impact user behaviorwhile writing, collecting a dataset of 156 people’sinteractions. In the future, data such as this mightbe used, for example, to train a model that repli-cates users’ selection strategies for text suggestionsfrom an NLP system. Such a model might then beused in lieu of actual users to gauge general usagepatterns for HCI+NLP systems, e.g. for interactivetext generation.
Figure 1 situates the different methods in the con-text of HCI+NLP systems. The figure illustratesthat two approaches are focused on the model sideand three methods are focused on the user side.Methods 1 and 2 are focused on the NLP systemitself. The
1. User-Centered NLP is at the heartof the model and focuses on users’ understandingof the output and the explanations of the NLP sys-tem. While Method 2 is also strongly related tothe user, we put it on the system side to highlightthat when
2. Co-Creating an NLP system , the goalis not just to evaluate the experience with an NLPsystem, but to enable users to actively shape thesystem. This does not only include what the systemlooks like but means involving users in the problemformulation stage and allowing them to shape whatproblem is being solved. Considering the input thatan NLP system is trained on, Method
3. Experi-ence Sampling provides a simpler way of collectingmetadata and more actively involving people in thecollection of the dataset. Regarding the output ofan NLP system, we showed the utility of
4. Crowd-sourcing the Evaluation of NLP systems, whichputs users into the loop to evaluate existing NLPsystems at scale. The advantage of this is that alarge number of users can be involved in the eval-uation of the system. Finally, Method 5 proposessimulating real users through other ML-based sys-tems. These
5. User Models can act as proxies forreal users and allow a fast, automated evaluationof NLP systems at scale. We hope that this workinforms novel approaches on how to standardizetools for large-scale interactive evaluations that willgenerate comparable and actionable benchmarks.
The five methods presented in Figure 1 cover thewhole spectrum of HCI+NLP systems includingthe input, the NLP system, and the output of thesystem. Though each method has merits on itsown, for successful future HCI+NLP applications,we believe that the whole will be greater than thesum of its parts. The design of future HCP+NLPapplications should be centered around users (1)and involve them not only in the evaluation but alsoin the development and the problem formulation ofan NLP system (2). Rich-meta data (3) that shapesthe input of such a system are equally importantas a thorough investigation of the output of thesystem, both by humans-in-the-loop (4) and bypproaches based on computational methods thatautomate certain key aspects of such systems (5).We hope that this overview of HCI and NLPmethods is a useful starting point to engage in-terdisciplinary collaborations and to foster an ex-change of what HCI and NLP have to offer eachother methodologically. With this work, we hopeto stimulate a discussion that brings HCI and NLPtogether and that advances the methodologies fortechnical and human-centered system design andevaluation in both fields.
This work was partially funded by the DeutscheForschungsgemeinschaft (DFG, German ResearchFoundation) under project number 374666841,SFB 1342. This project is also partly funded bythe Bavarian State Ministry of Science and the Artsand coordinated by the Bavarian Research Institutefor Digital Transformation (bidt).
References
Oscar Alvarado, Hendrik Heuer, Vero Vanden Abeele,Andreas Breiter, and Katrien Verbert. 2020. Middle-aged video consumers’ beliefs about algorithmicrecommendations on youtube.
Proc. ACM Hum.-Comput. Interact. , 4(CSCW2).Association for Computational Linguistics. 2021.What is the ACL and what is Computational Linguis-tics?Florian Bemmann and Daniel Buschek. 2020. Lan-guagelogger: A mobile keyboard application forstudying language use in everyday text communica-tion in the wild.
Proc. ACM Hum.-Comput. Interact. ,4(EICS).N. van Berkel, D. Ferreira, and V. Kostakos. 2017.The experience sampling method on mobile devices.
ACM Computing Surveys , 50(6):93:1–93:40.Tolga Bolukbasi, Kai-Wei Chang, James Zou,Venkatesh Saligrama, and Adam Kalai. 2016.Man is to computer programmer as woman is tohomemaker? debiasing word embeddings. In
Pro-ceedings of the 30th International Conference onNeural Information Processing Systems , NIPS’16,page 4356–4364, Red Hook, NY, USA. CurranAssociates Inc.Joy Buolamwini and Timnit Gebru. 2018. Gendershades: Intersectional accuracy disparities in com-mercial gender classification. In
Proceedings ofthe 1st Conference on Fairness, Accountability andTransparency , volume 81 of
Proceedings of Ma-chine Learning Research , pages 77–91, New York,NY, USA. PMLR. Daniel Buschek, Benjamin Bisinger, and Florian Alt.2018.
ResearchIME: A Mobile Keyboard Applica-tion for Studying Free Typing Behaviour in the Wild ,page 1–14. Association for Computing Machinery,New York, NY, USA.Daniel Buschek, Martin Z¨urn, and Malin Eiband. 2021.The impact of multiple parallel phrase suggestionson email input and composition behaviour of na-tive and non-native english writers. In
Proceedingsof the SIGCHI Conference on Human Factors inComputing Systems , CHI ’21, New York, NY, USA.ACM. (forthcoming).M. Csikszentmihalyi and R. Larson. 2014. Validity andReliability of the Experience-Sampling Method. InM. Csikszentmihalyi, editor,
Flow and the Founda-tions of Positive Psychology: The Collected Worksof Mihaly Csikszentmihalyi , pages 35–54.Jonathan Grudin. 2009. Ai and hci: Two fields dividedby a common focus.
Ai Magazine , 30(4):48–48.Hendrik Heuer. 2020.
Users & Machine Learning-based Curation Systems . Ph.D. thesis, University ofBremen.Juliane Jarke. 2021.
Co-creating Digital Public Ser-vices for an Ageing Society: Evidence for User-centric Design . Springer Nature.Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg,Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah A.Smith, and Daniel S. Weld. 2021. Genie: A leader-board for human-in-the-loop evaluation of text gen-eration.Vassilis Kostakos. 2015. The big hole in hci research.
Interactions , 22(2):48–51.Polska. Polski Komitet Normalizacyjny.Wydział Wydawnictw Normalizacyjnych. 2011.
Ergonomics of Human-system Interaction - Part210: Human-centred Design for Interactive Systems(ISO 9241-210:2010): . pt. 210. Polski KomitetNormalizacyjny.Antti Oulasvirta, Xiaojun Bi, and Andrew Howes.2018.
Computational interaction . Oxford Univer-sity Press.Yvonne Rogers. 2012.
HCI Theory: Classical, Mod-ern, and Contemporary , 1st edition. Morgan &Claypool Publishers.Betsy Rymes and Andrea R Leone. 2014. Citizen so-ciolinguistics: A new media methodology for under-standing language and social life.
Working Papersin Educational Linguistics (WPEL) , 29(2):4.Kashyap Todi, Luis A Leiva, Gilles Bailly, and AnttiOulasvirta. 2021. Adapting user interfaces withmodel-based reinforcement learning.ian Yang, Justin Cranshaw, Saleema Amershi,Shamsi T. Iqbal, and Jaime Teevan. 2019. Sketch-ing nlp: A case study of exploring the right things todesign with language intelligence. In