[PDF] A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents

Abstract

Embodied Conversational Agents (ECA) take on different forms, including virtual avatars or physical agents, such as a humanoid robot. ECAs are often designed to produce nonverbal behaviour to complement or enhance its verbal communication. One form of nonverbal behaviour is co-speech gesturing, which involves movements that the agent makes with its arms and hands that is paired with verbal communication. Co-speech gestures for ECAs can be created using different generation methods, such as rule-based and data-driven processes. However, reports on gesture generation methods use a variety of evaluation measures, which hinders comparison. To address this, we conducted a systematic review on co-speech gesture generation methods for iconic, metaphoric, deictic or beat gestures, including their evaluation methods. We reviewed 22 studies that had an ECA with a human-like upper body that used co-speech gesturing in a social human-agent interaction, including a user study to evaluate its performance. We found most studies used a within-subject design and relied on a form of subjective evaluation, but lacked a systematic approach. Overall, methodological quality was low-to-moderate and few systematic conclusions could be drawn. We argue that the field requires rigorous and uniform tools for the evaluation of co-speech gesture systems. We have proposed recommendations for future empirical evaluation, including standardised phrases and test scenarios to test generative models. We have proposed a research checklist that can be used to report relevant information for the evaluation of generative models as well as to evaluate co-speech gesture use.

Full PDF

IIEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. X, NO. Y, Z 1

A Review of Evaluation Practices of GestureGeneration in Embodied Conversational Agents

Pieter Wolfert,

Member, IEEE,

Nicole Robinson,

Member, IEEE, and Tony Belpaeme,

Member, IEEE

Abstract —Embodied Conversational Agents (ECA) take ondifferent forms, including virtual avatars or physical agents,such as a humanoid robot. ECAs are often designed to pro-duce nonverbal behaviour to complement or enhance its verbalcommunication. One form of nonverbal behaviour is co-speechgesturing, which involves movements that the agent makes withits arms and hands that is paired with verbal communication.Co-speech gestures for ECAs can be created using differentgeneration methods, such as rule-based and data-driven pro-cesses. However, reports on gesture generation methods use avariety of evaluation measures, which hinders comparison. Toaddress this, we conducted a systematic review on co-speechgesture generation methods for iconic, metaphoric, deictic orbeat gestures, including their evaluation methods. We reviewed22 studies that had an ECA with a human-like upper body thatused co-speech gesturing in a social human-agent interaction,including a user study to evaluate its performance. We foundmost studies used a within-subject design and relied on a form ofsubjective evaluation, but lacked a systematic approach. Overall,methodological quality was low-to-moderate and few systematicconclusions could be drawn. We argue that the ﬁeld requiresrigorous and uniform tools for the evaluation of co-speech gesturesystems. We have proposed recommendations for future empiricalevaluation, including standardised phrases and test scenarios totest generative models. We have proposed a research checklistthat can be used to report relevant information for the evaluationof generative models as well as to evaluate co-speech gesture use.

Index Terms —human-robot interaction, virtual interaction,human-computer interface, social robotics

I. I

NTRODUCTION H UMANS that interact with embodied conversationalagents (ECA) expect these to respond in a similarmanner as human interlocutors would. Human communicationinvolves a large non-verbal component, with some suggestingthat a large portion of communicative semantics is drawn thenon-linguistic elements of face-to-face interaction [1]. Non-verbal behaviour can be broken down into several elements,such as posture, gestures, facial expressions, gaze, proxemicsand haptics (i.e. touch during communicative interactions).All these elements convey different types of meaning, whichcan complement or alter the linguistic component of com-munication. Even minimal elements can provide a markedcontribution to the interaction. Eye blinking, for example, incombination with head nodding has been found to inﬂuence

P. Wolfert and T. Belpaeme are with IDLab, Ghent University - imec,Technologiepark 126, 9052, Gent Belgium, e-mail: [email protected]. Robinson is with Monash University, Department of Electrical andComputer Systems Engineering, Turner Institute for Brain and Mental Health,Victoria, Australia and Queensland University of Technology, AustralianResearch Council Centre of Excellence for Robotic Vision, Brisbane, AustraliaManuscript received ...; revised ...

Fig. 1: Pepper on the left [9], a virtual avatar on the right [10]the duration of a response in a Q&A session [2]. A signiﬁcantcomponent involved in nonverbal communication is the useof gestures, such as movements of the hand, arm or bodyto express an idea, feeling or message [1]. Humans oftenuse gestures in daily life such as to point at objects in ourvisual space, or to signal the size of an object. McNeill [3]categorized four kinds of human-based gestures: beat gestures,iconic gestures, metaphorical gestures, and deictic gestures.Iconic and metaphorical gestures both carry meaning and areused to visually enrich our communication [4]. An iconicgesture can be an up and down movement to indicate the actionof slicing a tomato. Instead a metaphoric gesture can involvean empty palm hand that is used to symbolize ’presenting aproblem’. In other words, metaphoric gestures are used forabstract symbols and instead, iconic gestures for communi-cating around more concrete content. The difference betweeniconic and metaphoric gesture are the terms of content, butare also processed differently in the brain [5]. Beat gesturesdo not carry semantic meaning and they are often used toemphasize the rhythm of speech. Beat gestures have both beenshown to facilitate speech and word recall [6], [7]. Lastly,deictic gestures are used to point out elements of interest or tocommunicate directions, which facilitates learning [8]. Lastly,co-speech gesturing involves the use of movements that theagent makes with its arms and hands that is paired with verbalcommunication.

A. Gesture Use in Human-Machine Interaction

Nonverbal behavior plays an important role in human-human interaction, and substantial efforts have been put intothe generation of nonverbal behavior for ECAs. ECAs suchas social robots have been built with a range of nonverbal be-haviors, including the hardware and software capacity to makegesture-like movements [11]. Gesturing can have an importantimpact on human-agent interaction for human behaviour and a r X i v : . [ c s . H C ] J a n EEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. X, NO. Y, Z 2 actions, which is central to effective communication [12], [13].Gestures used by ECAs can inﬂuence human behavior in avariety of differentw ays, such as through social presence [14].To demonstrate, deictic and beat gestures performed by ECAscan lead to better academic performance compared to thosethat do not use them [15], [16]. In addition, humans weremore willing to cooperate when an ECA showed appropriategesturing (deictic, iconic and metaphoric gestures) in com-parison to when an ECA did not use gestures or when thegestures did not match the verbal utterances [17]. In relationto embodiment, ECAs that were robots can be perceived to bemore persuasive when they combine gestures with other socialbehaviors, such as eye gazing, in comparison with when theydo not use either of these techniques [18], [19]. This collectionof work demonstrates the impact of non-verbal behavior fromECAs can have on people, and its importance for considerationin human-agent interactions.Over the years, AI-generated systems have been built forintelligent gesture generation in ECAs for beat gestures,iconic gestures, metaphorical gestures, and deictic gestures.Gesture generation engines have been based on connectinglanguage and gesture, given that the rhythm and semanticcontent signalled through gestures are highly correlated withthe verbal utterance [3]. Early examples of ECA gesturegeneration relied on rule-based systems to generate gesturesand nonverbal behavior, e.g. [20]. For example, the BEATsystem for generating nonverbal behavior can autonomouslyanalyse input text on a linguistic and contextual level, and thesystem assigns nonverbal behaviors, such as beat and iconicgestures, based on predeﬁned rules [21]. A notable initiativewas the Behavior Markup Language (BML), which provided auniﬁed multimodal behavior generation framework [22]. BMLwas used to describe physical behavior in a XML format andcould be coupled with rule based generation systems, whichcould provide descriptions in this format. In order to catch allaspects of nonverbal behavior generation, BML was aimed tonot only integrate gesturing but also other forms such as bodypose, head nodding and gaze.Instead of hand-coding, gesture generation systems can alsobe created from human conversational data, known as the datadriven approach [23], [24]. These data-driven methods havepredominantly relied on neural networks. Paired with the riseof deep learning techniques, data-driven methods are capableof unprecedented generalisation, an invaluable property whengenerating high dimensional temporal output, such as gesturegeneration. Data driven approaches using neural networks cangenerate dynamic and unique looking gestures, dependent onthe available training data. Audio signal based methods arenow much better at creating dynamic and ﬂuent beat gestures,whereas text based methods show improved generation oficonic and metaphoric gestures. Some approaches learn a map-ping from acoustic features of the speech signals to gestures[25], [26]. However, relying on only speech audio means thatsemantic details are lost, hence these approaches often onlygenerate beat gestures. Recent work by Kucherenko et al.[27] combines neural networks for beat gesture generationwith sequential neural networks for generating iconic gestures,dispensing with the need for a rule based hybrid approach. In work by Yoon et al. [28], an encoder-decoder neural networkwas trained on combinations of subtitles and human poses ex-tracted from public TED(x) videos. This allowed the networkto infer a semantic relationship between written language,extracted from the video’s subtitles, and gesture, and was usedto generate beat and iconic gestures for a humanoid robot. Thismethod was a notable advance in gesture generation, giventhat videos contain a wealth of human conversational data andare abundantly available. The data used to build data-drivengesture generation can vary, where some use data collectedfrom a large number of individuals [28], whereas others makeuse of data sets containing a single actor [29].

B. Objective and Subjective Methods for Gesture Evaluation

A central component for any method that can generatehuman-like behaviour is the ability to evaluate the quality ofthe generated signals. To date, there has been a variety ofdifferent methods used to evaluate gesture generation systems.One way is by using objective evaluations, often consistingof metrics for the joint speed or joint trajectories. This caneither be done at training time, when data-driven methods areused, or at run time. Some of the aforementioned studies haverelied on objective evaluations, which often only report theloss functions that are used for training the neural network[30]. However, loss functions only tell how close the generatedstimuli are to the ground truth, and they do not provideinformation on whether the generated motion is dynamic ornatural enough. Others include subjective evaluations fromuser studies to measure the effect of gestures on an externalobserver. A subjective analysis consists of a user study, wherehuman participants evaluate the performance of the gesturesused by the ECA. Examples include listener’s comprehensionand recall of spoken material [15], [16], the naturalness ofthe generated motion [31], or the perceived appropriatenessin timing [23]. Subjective evaluations are often collected byletting participants complete questionnaires on the non-verbalbehavior. These can contain questions to evaluate naturalness,timing and semantic appropriateness using rating scales, suchas Likert scales. Often, latent variables such as ‘speech-gesture correlation’ or ‘naturalness’ are evaluated using sev-eral items in one Likert Scale. In human-robot interaction[28], researchers have used questionnaires for general robotevaluation, such as the Godspeed questionnaire, or a selectedsubselection of items from such instruments. The Godspeedquestionnaire can evaluate the perception of ECAs in a nondomain-speciﬁc measurement, and how different gesture gen-eration methods impact the humanlikeness, animacy, likeabil-ity and perceived intelligence of ECAs [32]. Objective andsubjective measures provide vital insight into evaluation ofgesture generation, and a method that will help to improvethe quality of the ﬁeld, particularly if systematic conclusionscan be drawn across studies.

C. Review Aim and Objectives

The described collection of objective, subjective and ques-tionnaire methods do not directly evaluate factors important togesture generation and evaluation. For example, evaluation of

EEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. X, NO. Y, Z 3 the naturalness, semantic correctness and timing of co-speechgestures. However, they can contribute notable insight by indi-rect measurement of people’s perception of the ECA, includingrobots. Given the importance that both effective and ineffectivegestures can have on human-machine interaction, the abilityto effectively identify and evaluate gesture suitability is vital,and an important next step in the ﬁeld. This includes to betterunderstand an individual score, rating or perception of usefor the ECA across different agents, gesture sets, time points,or data-generation methods. However, there is no standardizedgeneration and evaluation protocol available for the ﬁeld of co-speech gesture generation for ECAs. A standardized question-naire, measure or protocol would make comparing work fromdifferent sources more effective, and would allow for morereliable reporting of results to demonstrate improvement overtime. To create a standardized method of gesture generation,the ﬁrst step is knowing how to design systematic reportingmethods. The completion of a comprehensive review andanalysis of previous work in the ﬁeld will assist in synthesisof previous work, and help establish a proposed protocol thatcan be used for more robust evaluation of gesture generationmethods, and their resulting gestures.In this paper, we present a systematic review that followedthe Preferred Reporting Items for Systematic Reviews andMeta-Analyses (PRISMA) protocol [33] to identify and as-sess evaluation methods used in co-speech gestures. We willconduct a review of studies with two topics: 1) evaluate ofquality of gesture work and 2) analyse gesture generation andevaluation methods for co-speech gestures, with a speciﬁcfocus on the evaluation methods. This review is consideredtimely given that work in co-verbal gesture generation isexpanding, new techniques are emerging for creating novelgesture sets, and no systematic evaluation method has beenprovided to date. Central to this review, three core researchquestions were presented.1) What evaluation methods are used to evaluate co-speechgesture generation?2) Which evaluation methodology is considered the mosteffective for assessing co-speech gestures (i.e. objectiveor subjective methods)?3) What evaluation methods and related metrics should beadapted to create a standardised evaluation or reportingprotocol?These research questions will be used to create a formulatedapproach for the ﬁeld on how to make use of objective andsubjective metrics to evaluate co-speech gesture performanceof ECAs, including to create a standardised testing and report-ing method. II. M

ETHODS

A. Search Strategy

This review focuses on evaluation studies of co-speech ges-ture generation methods for embodied conversational agents.Three databases were consulted for data extraction: IEEEExplore, Web of Science and Google Scholar. IEEE Explorewas selected given that it captures a substantial number ofpublications in computer science and engineering. Web of Science and Google Scholar were used because they provideaccess to multiple databases with a wide coverage extendingbeyond computer science and engineering. Data and recordextraction occurred on April 8, 2020 and on June 25, 2020 tocollect new records. Two authors conducted independent dataextraction steps to reduce the chance of relevant papers beingmissed from the review, which included inter-rater checks onthe included records. The databases were queried using fourdifferent keyword strings: 1) ”gesture generation for socialrobots”, 2) ”co speech gesture generation”, 3) ”non verbalgesture generation”, and 4) ”nonverbal behavior generation”.

B. Eligibility - Inclusion and Exclusion

The following inclusion criteria was used:1) The ECA paper must report on gesture generation oneither a robot or an embodied agent.2) The ECA system must be humanoid in nature with oneor two human-like arms and/or hands that can be usedto gesture information or messages to the human.3) The ECA system must have displayed multiple gestures(i.e. a minimum of 2 different gestures, one of whichmust be an beat, iconic, metaphoric or deictic gesture).4) Gestures created by the ECA system must be those thatwould be seen during a multi-modal social interaction.5) The ECA paper must report on a user study (i.e. notevaluated using technical collaborators or authors) in alaboratory or performed remotely through online plat-forms.6) The ECA system must be evaluated by a human rateron its performance (i.e. directly or indirectly).Paper were excluded when there was not a clear focus on theevaluation of co-speech gestures. Extracted records that onlyincluded beat gesture generation were recorded but excludedfrom the main analysis because the review intends to focus ongesture method that involved semantic information. Instead,a separate analysis outside of the PRISMA protocol will beprovided to take into account work to beat gestures only.Unpublished and other survey work such as PhD dissertations,reviews, technical papers, pre-prints were not included inthe search protocol. Only papers published in English wereconsidered for this review. Papers could have been publishedin any given year, based on the new emergence of gesturegeneration methods. Authors were not directly contacted forany unpublished works to be included in this review. Backwardand forward searches were conducted on the ﬁnal papersselected for the main analysis.III. R

ESULTS

A. Selected Articles

The initial search conducted across three separate databasesresulted in 295 papers, which contained 92 duplicate records.A total of 203 papers were screened for their titles andabstract for an initial exclusion step, resulting in 113 papersbeing omitted for not meeting inclusion criteria. The 90remaining papers were assessed in detail by reviewing themain text for eligibility, which resulted in 22 papers that

EEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. X, NO. Y, Z 4

Fig. 2: PRISMA Flow Chartmet all inclusion criteria. Extracted information from includedmanuscripts included publication year, venue, design andconditions, method of generation, objective metrics, subjectivemetrics, type of ECA, evaluation type (online or in laboratory),participants, characteristics of participants, and other importantnotes related to the experiment.

B. Embodied Conversational Agents

In the 22 included studies, 16 studies (73%) used differenthuman-like robots, such as Nao (n = 3, 14%), ASIMO (n = 3,14%) or Wakamaru (n = 2, 9%). Only 6 (27%) reported the useof a virtual agent (e.g. [34]–[39]). All the virtual agents (n = 6,27%) were modelled in 3D as a virtual human, and there wereno consistent features across the agents between studies. Fromthe 6 studies, 4 used female avatars [34], [36], [37], [39], 1used both [38] and 1 study used a male avatar [35]. Half of thestudies that entailed avatars, showed only the upper body [35],[37], [39], whereas the other half showed full body avatars[31], [36], [38]. Speciﬁc descriptions of the hands were notprovided in all the studies that used avatars. Nineteen (87%)of the studies had the ECA perform iconic gestures, combinedwith other gestures [15]–[17], [28], [31], [34], [37], [39]–[49].Seventeen (77%) of the studies (n = 22) include metaphoricgestures, combined with other gestures [15]–[17], [28], [31],[34]–[37], [39]–[42], [45]–[48]. Thirteen (59%) of the studies(n = 22) include deictic gestures, combined with other gestures[15]–[17], [28], [31], [34], [37], [39]–[41], [45]–[50]. Lastly,seventeen (77%) studies (n = 22) included beat gestures [15],[16], [28], [31], [34]–[36], [38], [39], [42]–[47], [49], [50].Half of the studies had the ECA perform ‘random gestures’that were included in the evaluation (i.e. gestures that had noalignment between gestures and speech). Other studies (n = 4) had the ECA present the user with a variety of differentnonverbal behavior schemes, such as gestures that were basedon text, speech or a combination of the two [17], [34], [43],[44].

C. Participants

A total of 1060 participants were involved across all of thestudies with numbers per study ranging from 13 to 250 intotal (Mean = 50, SD = 50, Median = 35). In these papers,86% (n = 19) were conducted in the laboratory, and 14% (n =3) were conducted either online through Amazon MechanicalTurk (n = 2) and one during an exhibition (i.e. ’in the wild’).In 54% of studies (n = 12) that did report the mean age of theparticipants, mean reported age across all studies was 30.10years of age (SD = 6.6). The remaining 46% (n = 11) did notprovide demographic data for gender and age. In relation totrial location, 73% of studies (n = 16) were performed outsideof English speaking countries, with the top 3 countries beingGermany (n = 5), Japan (n = 3) and France (n = 3). Forparticipant recruitment, 27% of studies (n = 6) reported theuse of university students or ’convenience sample’ to evaluategesture generation. Table I provides a more detailed overviewof the different studies, countries of origin and characteristics.

D. Research Experiment and Assessment

In research design, 68% of the studies (n = 15) used a withinsubject design and 32% (n = 7) used between subject design.Most experimental design methods involved participants beingrecruited to attend a university research laboratory (n = 18,82%) to have an interaction with an ECA. Other methodsused Amazon Mechanical Turk (AMT) (n = 2, 9%). In 41%of studies (n = 9), ‘naturalness’ was the most common metricfor evaluation in generated gestures. This was followed bysynchronisation (n = 6, 27%), likeability (n = 4, 18%), andhuman-likeness (n = 2, 9%). Two studies (9%) [36], [41] letparticipants match gestures with audio. Forty one (n = 9)percent of studies made use of models that learn to generateco-speech gestures.In gesture generation assessment, 73% of studies (n = 16),questionnaires were used as a tool to evaluate ECA gestureperformance. Only one study [41] included a previous iterationof their gesture model for evaluation. Four studies (18%) usedground truth as part of the gesture generation evaluation. Threestudies (13%) relied on pairwise comparisons, such as twoor more videos put side by side with the user selecting thevideo that best matches with the speech audio, e.g. [38], [44],[46]. Other evaluation methods involved robot performancee.g. [15], [16].

E. Objective and Subjective Evaluation

Table II provides a summary of studies that involved objec-tive evaluation. Only 5 studies (23%) involved some form ofobjective evaluation metrics as a key method in their evalua-tion. Other metrics included variations on the mean squarederror (n = 1) between the generated, ground truth gestures,and qualitative analyses of joint velocities and positions (n =

EEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. X, NO. Y, Z 5

TABLE I: Participants in Studies

Study Country Gender Mean Age (SD) N Characteristics Lab/Remote Evaluation[28] South Korea 23M/23F 37(-) 46 45 USA, 1 Australia Amazon Mechanical Turk[44] Spain - - 50 Non-native English Sample In Lab[34] Japan - - 10 Age + Gender not speciﬁed In Lab[31] Japan - - 20 Age + Gender not speciﬁed In Lab[43] Japan - - 13 In Lab[39] Slovenia 22M/8F - 30 - In Lab[36] U.S.A. - - 250 One ’worker’ per comparison Amazon Mechanical Turk[16] U.S.A. 16M/13F 22.62(4.35) 29 Convenience Sample In Lab[41] Germany 10M/10F 28.5(4.53) 20 Native Germans In Lab[42] France 14M/7F 21-30 21 Convenience Sample In Lab[37] Slovenia 23M/7F 26.73(4.88) 30 Convenience Sample In Lab[15] U.S.A. 16M/16F 24.34(8.64) 32 Convenience Sample In Lab[17] Germany 30M/32F 30.90(9.82) 62 Convenience Sample In Lab[40] Germany 30M/30F 31(10.21) 60 Native Germans In Lab[50] South Korea - - 65 In Lab[45] France 36M/27F 37(12.14) 63 Convenience Sample In Lab[47] France - - 63 French Speakers In Lab[48] Germany 20M/20F - 20M/20F 31.31(10.55)/31.54(10.96) 81 Two Studies In Lab[38] U.S.A. 21M/14F 23(-) 35 Convenience Sample In Lab[46] U.S.A. - - 54 - In Lab[35] U.S.A. 20M/6F 24-26(-) 26 Non-experts In Lab[49] Germany - - Exhibition

F. Additional Results - Beat Gestures

Research work that focused on beat gesture generation only was excluded from the main analysis, given that beatgestures do not carry semantic information. Methods used toevaluate the performance of beat gesture generation systemsin ECAs were similar to those used in work on semanticgesture generation. Ten papers were selected that met thecriteria [23], [25], [51]–[58]. A total of 70% of studies (n= 7) mentioned the number of participants, with a total of 236participants. Only 40% (n = 4) mentioned statistics on age andgender. From the ten studies, three (30%) were performed ina lab, and 5 (50%) online or via AMT. One study evaluatedthrough participating in an exhibition, and two studies did notevaluate at all. As beat gesture generation mostly relied onprosody information, 80% (n = 8) of papers used a data drivenapproach. Only four of the eight studies that relied on datadriven methods reported their metrics used for an objectiveevaluation, with either the average position error (APE) orthe mean square error (MSE). Up to 70% of papers (n = 7) ran their evaluation on a virtual avatar or stick ﬁgure with nodiscernible face. The subjective evaluations performed in thesestudies were often similar to those part of the main analysis.Sixty percent of the studies (n = 6) used a post-experimentquestionnaire to asses the quality of the generated gesturesby the ECA. Yet, 30% relied on pairwise comparisons and 1relied on the time spent with focused attention while lookingat an ECA [53]. All studies (n = 10) rely on a within subjectevaluation, where participants are shown stimuli of multipleconditions. The questionnaire items that were used the most:Naturalness (4 times) and Time Consistency (4 times).IV. P

RINCIPLE F INDINGS AND I MPLICATIONS

In this section, we will examine the above results in moredetail, including a discussion section on its impact and im-plications for gesture generation methods. Due to the highvariation and diversity in the experiments presented in themain analysis, a meta-analysis of the experiments will not beperformed. Instead, in-depth evaluation will be provided.

A. Participant Sample

More than half of the studies involved in the main analysisdid not report the average age and gender of their participants.This provides a notable weakness for generalisation fromcurrent gesture generation and evaluation methods, given thatthese statistical ﬁgures provide important insight around thesample that has been used to evaluate gestures. This includesreducing the potential impact in relation to the generalisabilityand reproducability of the body of work conducted to date.A large number of studies (30%) used participants that werereadily available, for example from a higher education cam-pus. However, such a convenience sample of students is notrepresentative of the general population, and may result in apredominately young adult cohort from higher socioeconomicbackgrounds involved in user studies [59]. Subsequently, theevaluation of gestures generated from models represent a more

EEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. X, NO. Y, Z 6

TABLE II: Objective Evaluation Methods

Study Generation Method Objective Metrics Agent[28] Data Driven Variation on Mean Squared Error NAO[44] Rule Based - REEM-C[34] Data Driven - Virtual Agent (3D)[31] Data Driven - Android Erica[43] Data Driven Log-likelihood of generated motion Pepper[39] Data Driven - Virtual Agent (3D)[36] Rule Based - Virtual Agent (3D)[16] Data Driven - Wakamaru[41] Rule Based Qualitative Analysis of Joint Positions ASIMO[42] Rule Based - NAO[37] Data Driven - Virtual Agent (3D)[15] Rule Based - Wakamaru[17] Rule Based - ASIMO[40] Rule Based Qualitative Analysis of Joint Positions ASIMO[50] Rule Based - Industrial Service Robot[45] Rule Based - NAO[47] Rule Based - Nao[48] Rule Based - ASIMO[38] Data Driven Cost Function on Kinematic Parameters Virtual Agent (3D)[46] Rule Based - ASIMO[35] Data Driven - Virtual Agent (3D)[49] Rule Based - Fritz narrow cultural and social viewpoint, and some gestures thatare acceptable and natural in other cultures may have beenmisrepresented or rated poorly in the evaluation process fromthe use of a more restricted sample. The reported participantsample also came from diverse geographical locations, whichis known to have developed unique gesture styles and commu-nication methods in regional areas. This would have inﬂuencedthe accuracy, naturalness and perceived performance ratings ofgesture generation [60]. While it is a strength to have diverseviewpoints, these studies cannot be easily pooled togetherdue to variation in each trial to represent a more inclusivecohort, which causes difﬁculty in drawing conclusions ongesture generation evaluation practices conducted to date usinga collective participant sample.

B. Recruitment and Trial Location

The use of online workers through services such as AmazonMechanical Turk (AMT) or Proliﬁc, does have its merits, butalso has important implications on the evaluation of gesturegeneration. Large amounts of data can be collected for amodest budget and in a very short period of time, at timesreaching participants from different global regions with verydiverse backgrounds. In addition, studies have shown thatcrowd-sourced data can be of comparable quality to lab basedstudies [61]. However, Mechanical Turk is known for havinga US heavy user base, which provides a skewed culturalperspective, especially for studies related to gesture evaluationfor communication purposes. While it is possible to selectfor data response quality using ratings and checker questions,there are limited ways to control what the person is doingduring the testing session, and time taken to evaluate thegesture may not have been sufﬁcient. More importantly, userexperience when interacting with a video of a robot is vastlydifferent from interacting with a real robot, and some semanticmeaning can be lost via digital translation. It is thereforeimportant to consider studies conducted only with crowd- sourcing methods with some caution, including for subjectiveuser evaluations of interactive robots [62].

C. Experimental Set-up and Assessment

In the main analysis, 65% of the studies relied on a withinsubject design, which helps to evaluate iterations of gesturesover multiple exposures, introduce less variation in the par-ticipant scores, and requires fewer participants to achievean outcome. This demonstrates that systematic evaluation ofmultiple gestures had been established, given that the sameparticipants were used to review multiple gestures, leading tomore in-depth data around preference reporting. It is however,somewhat problematic, that not all studies relied on groundtruth comparisons. For instance, studies often did not includea ground truth condition. A ground truth condition could haveinvolved recording of gestures by a human, which were thencompared to the artiﬁcial generated gestures. Human groundtruth can serve as a concrete baseline, and this should scorethe highest on scales for appropriateness and naturalness, pro-viding a clear comparison to participant evaluation scores. Anumber of studies also involved random movement generationas a control condition. Random movement is interpreted indifferent ways, given that some take random samples fromtheir data set, which is then put on top of original speech[28], or insert random parameters for generating gestures [16].Random gestures are an important control condition for thistype of work, given that people are not simply attributingmeaning to every gesture seen in the experiment, whether itwas a relevant co-speech gesture or not. Overall, quality of theexperimental set-up for gesture generation and evaluation wasmoderate, and some improvements could be made for futurework.

D. Evaluation Methods

The papers demonstrated that there was not a consistentset of evaluation metrics to employ for evaluation of gesture

EEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. X, NO. Y, Z 7

TABLE III: Subjective Evaluation Methods

Study Design Conditions Gesture Types Evaluation Questionnaire items[28] Within subject Ground truth, proposedmethod, nearest neighbours,random or manual Iconic, Beat, Deictic,Metaphoric Questionnaire Anthropomorphism,Likeability, Speech-gesturecorrelation[44] Within Subject Part-of-Speech-Based,Prosody-Based, Combined Iconic, Beat pairwise + Questionnaire Timing, Appropriateness,Naturalness[34] Within Subject None, Random, ProposedMethod Iconic, Beat, Deictic,Metaphoric Questionnaire Naturalness of Movement,Consistency in utterance andmovement, likeability, hu-manness[31] Within Subject No hand motion, DirectHuman mapping, Text-based gestures, Text-based+ prosody based gestures Iconic, Beat, Deictic,Metaphoric Questionnaire Human-likeness, Gesture-speech suitability, Gesture-Naturalness, Gesture-Frequency, Gesture-timing[43] Within Subject Ground truth, seq2seq,seq2seq(model) + semantic,seq2seq tts + semantic Iconic, Beat Questionnaire Naturalness, Skill of presen-tation, Utilization of gesture,Vividness, Enthusiasm[39] Within Subject Text+Speech (no avatar),Gestures Iconic, Beat, Deictic,Metaphoric Questionnaire Content Match, Synchro-nization, Fluidity, Dynam-ics, Density, Understanding,Vividness[36] Within Subject Hands never go into relaxposition, hands always gointo rest position Beat, Metaphoric Match preference N.A.[16] Between Subject Learning-based, unimodal,random, conventional Iconic, Beat, Deictic,Metaphoric Questionnaire + RetellingPerformance Immediacy, Naturalness,Effectiveness, Likeability,Credibility[41] Within Subject Old version, new version ofmodel Iconic, Deictic, Metaphoric Match preference N.A.[42] Within Subject Introverted versusExtraverted Robot, AdaptedSpeech and Behavior versusAdapted Speech Iconic, Beat, Metaphoric Questionnaire 24 questions on personality,interaction with the robot,speech and gesture synchro-nization and matching[37] Between Subject Virtual avatar versus iCubrobot Iconic, Deictic, Metaphoric Questionnaire Content Matching, Synchro-nization, Fluidness, Speech-Gesture Matching, Execu-tion Speed, Amount of Ges-ticulation[15] Between Subject Amount of gestures, ran-domly selected Iconic, Beat, Deictic,Metaphoric Questionnaire + RetellingPerformance Naturalness, Competence,Effective use of Gestures[17] Between Subject Unimodal (speech only),congruent multimodal,incongruent multimodal Iconic, Deictic, Metaphoric Questionnaire Humanlikeness, Likeability,Shared Reality, Future Con-tact Intentions[40] Between Subject Unimodal versusmultimodal (speech +gestures) in a kitchen task Iconic, Deictic, Metaphoric Questionnaire Gesture Quantity, GestureSpeed, Gesture Fluidity,Speech-Gesture Content,Speech-Gesture Timing,Naturalness[50] Within Subject - Deictic, Beat Questionnaire Suitability of Gestures, Syn-chronization, Scheduling[45] Within Subject Synchronized Gestures,not SynchronizedGestures, Gestures withExpressivity, Gestureswithout Expressivity Iconic, Beat, Deictic,Metaphoric Questionnaire Synchronization, Natu-ralness, Expressiveness,Contradictiveness, Gesturesare complementary,Gesture-speech Redundancy[47] Within Subject One Condition Iconic, Beat, Deictic,Metaphoric Questionnaire Speech-GestureSynchronization,Expressiveness, Naturalness[48] Between Subject Study 1: Unimodal versusMultimodal; Study 2: Same Iconic, Deictic, Metaphoric Questionnaire Appearance, Naturalness,Liveliness, Friendliness[38] Within Subject Generated versus GroundTruth Iconic, Beat pairwise -[46] Within Subject 4 studies: Audio vs WrongAudio; Excited vs CalmGestures; Low Expressivity,Medium Expressivity, HighExpressivity; Slow Gestic-ulation, Medium Gesticula-tion, Fast Gesticulation Iconic, Beat, Deictic,Metaphoric pairwise -[35] Within Subject Speaker 1, speaker 2 Beat, Metaphoric Match style to speaker -[49] - - Iconic, Beat, Deictic Public Exhibition -

Conditions and Questionnaire items as reported in the publications.

EEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. X, NO. Y, Z 8 generation, including different research groups focusing onfeatures of interest for them. In most cases, evaluation methodssuch as questionnaires were used often for assessing thequality of co-speech gestures in ECAs. Different question-naires did extract information around similar outcomes, butthere was no gold standard instrument for questionnaires, oragreement on a single questionnaire to evaluate the perceptionof generated gestures. Many items were evaluated on onedimension, which can miss evaluation of the full gesturepresentation. Questionnaires often involved the use of Likertscales, which have been brought into debate for their oftenincorrect use in HRI [63], such as failing to report internalconsistency, except for [15], [16]. Objective evaluations werealso highly varied, from mean squared error to joint velocitiesand positions. Overall, evaluation methods were still prelimi-nary and emergent, and the presented studies did not allow fora recommended instrument or methodology to be nominated.V. R

ECOMMENDATIONS FOR G ESTURE E VALUATION

This reviewed literature on gesture generation and lessonsfrom common practice in other ﬁelds creates clear recommen-dations for how to evaluate gestures under the assumption thata system generates gestures for an ECA. This section willprovide recommendations for the use of gesture generation andevaluation methods based on the main ﬁndings of the review.First, performance measures of machine learning in data-driven approaches can be informative to reproduce researchoutcomes, but do not inform about the quality of the generatedgestures. For that, involvement of participants is required.This includes diverse population samples across different ageranges, ethnicity, and diverse geographical backgrounds. Werecommend that participant evaluation is conducted whenfeasible at each major evaluation point, to ensure better validityand relevance when deployed for human social interaction.Second, evaluation methods of AI-generated gestures hasshown signiﬁcant overlap with evaluation of other artiﬁciallygenerated human-like signals, such as facial expressions orspeech. Adopting similar evaluation methods as other ﬁeldscan inform co-speech gesture evaluation. To demonstrate,text-to-speech has a tradition of evaluating quality of speechsynthesis by playing synthesised speech to human raters.The recommended standard maintained by the InternationalTelecommunication Union is the mean opinion score (MOS)[64]. Raters give an objective assessment of speech qualityon a 5-point scale which involves different areas of compre-hension (how easy was it to understand the spoken sentence),naturalness, intelligibility (how well does a transcription by theraters match the original written sentence) and likeability [65].Both objective as subjective measures are used and quantiﬁed,allowing for statistical comparison.This includes design of ameasure that is robust to increasing complexity that can carryacross different context [66]. Direct measures for the qualityof gestures can include objective measures such as naturalness,human-likeness, ﬂuency, appropriateness, or intelligibility (incase of iconic or deictic gestures). However, it should be notedthat there is not an absolute result, given that different variantsof co-verbal speech gesturing can be considered appropriate enough for use in human-agent interaction. Several similarvariants can be rated quite high or low, but a standardisedinstrument will assist with identifying this. Therefore, ourrecommendation is to adopt a similar standard instrument thatcan be used for gesture generation and evaluation with directmeasurement. We also recommend the use of direct measuresof quality, rather than indirect measures, not only because theywill have less variance, but also because they are less sensitiveto context and, with that, confounds.Third, objective evaluation can be substantiated with morerigor using the contrastive approach, commonly implementedas AB testing or side-by-side testing. Some participant mea-sures are better than others: preference of stimulus in A/Btesting will be more reliable than a indirect questionnaire onattitudes towards the ECA. For preference testing, two variantsof a stimulus are presented, and participants can indicate apreference from multiple stimuli, such as to select the mostnatural or interesting gesture. Testing can return quantitativeresults to explore statistical signiﬁcance using two-samplehypothesis testing, such as Student’s t -test or Chi-squared test.A contrastive approach is a more accurate and suitable methodfor gesture evaluation than single stimulus presentations orsimple between-group design, in which participants have nobaseline or other stimulus to compare against. Between-groupevaluation can often lead to increased response variation andfewer signiﬁcant outcomes. Particularly if individuals eval-uating the gestures are not aware of its capacity to moresophisticated ﬁne motor movements, or understand the co-speech context, leading to less intuitive gestures being rated asacceptable. In all cases, when developing a gesture generationsystem, comparisons to a base line (for example, and earlierversion of the algorithm) or a gold standard (gestures recordedfrom human conversations and played back on the ECA)are to be preferred over evaluations that are not contrastive.Therefore, our recommendation is for researchers to employmore contrastive approaches in user testing evaluation to im-prove the capacity for more relevant responses being recorded.This recommendation includes providing a clear context andrationale to people when they are providing an evaluation ofthe gesture, such as asking them to evaluate the gesture if itwere used in a social interaction or a gesture set for providinginstructions to a person. Not all of the presented studies had thesole aim to improve co-speech gesturing, but when possible,researchers should include ground truth comparisons for aproper baseline. We do not recommend evaluating the ECAin isolation, without using a baseline or a second conditionto evaluate against, the results will have high variation acrossparticipants and hence low validity.Fourth, given that gestures are part of multi-modal commu-nication, gestures should not be evaluated on appearance alone,but also on the secondary effects, such as task completion,level of acceptability or perceived sociability. This could alsoinclude dimensions such as the coherence of the messagecommunicated by the ECA, the recall of information, or thetrustworthiness. Other metrics could be to measure the level ofengagement through the amount and duration of eye contact orthrough measuring the response duration in question-answersessions. We recommend both a standard evaluation corpora, EEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. X, NO. Y, Z 9 and when possible, evaluation of the ECA in relation to othermetrics, such as social, task or perception-based items.Lastly, evaluation of generated gestures would beneﬁt froma standard corpus of sentences or contexts, which would allowfor comparison of different systems. If each gesture generationsystem can be demonstrated on a selection of standard phrasesor scenarios, then side-by-side comparisons of generative sys-tems would be made possible. However, standard evaluationcorpora do often suffer from the fact that development be-comes increasingly focused on improving and overspecialisingon those evaluation sets. A set of recommended sentencesand contexts are provided below for researchers to practiceon to increase the standardization of gesture generation andevaluation methods.To summarise, we propose these recommendations:1. Conduct user-testing and evaluation studies for both ges-ture generation and resulting gesture use2. The call for the creation of a similar standard instrumentto measure generated gestures.3. Employ more contrastive approaches in user testing eval-uation.4. Make use of direct measures such as naturalness, human-likeness, ﬂuency, appropriateness, or intelligibility.5. Secondary gesture effects to be measured as well, suchas task performance6. Take a standard corpus for measuring gesture generationmetrics such as quality.VI. S

TANDARDIZED T EST P HRASES TO E VALUATE G ESTURE G ENERATION

Here, we provide a selection of sentences and scenariosrelated to the type of gestures that might be used for iconic,deictic, metaphoric, and beat gestures. Our proposal is thateach gesture generation method should include at least two ofthe relevant sentences below to evaluate their generated gesturetype, while assuring that the test phrases were not part ofthe training set in data-driven methods. For gesture generationmodels that create multiple types of gestures (i.e. iconic andmetaphoric), researchers should include at least two of therelevant gesture type phrases, in this example, it would includefour sentences. We also provide some suggested scenarios inwhich gesture generation methods can be evaluated against forfuture systematic assessment. Detailed instructions and furtherinformation of using the proposed standarized phrases andscenarios can be found at [redacted].Sentences:1. “I caught a large ﬁsh in the pond on the right” (iconic)2. “Don’t you know what a ball is?” (iconic)3. “That takes about a whole day.” (metaphoric)4. “That’s a very smart person” (metaphoric)5. “I am really enthusiastic to get this started!” (beat)6. “I have to stress that this is really important!” (beat)7. “My car keys are on the table.” (deictic)8. “The vase holding the ﬂowers is right over there.” (deic-tic)Scenarios: 1. A scenario in which visitors are greeted and welcomedat a reception desk. The ECA is providing detailedinformation to the person in a one-way interaction.2. A scenario in which an ECA is talking to a personto provide assistance in a kitchen setting. The ECA isproviding information and a set of instructions in a one-way interaction.3. A scenario in which pupils/students need convincing todo their homework. The ECA is providing afﬁrmation andstatements to the individual in a one-way interaction.4. A scenario in which an ECA is teaching a subject as theteacher in the room. The ECA is leading the conversationby asking questions and providing answers in a two-wayinteraction.5. A scenario in which visitors can ask information. TheECA is leading the interaction by through options andresponding to questions from the individual in a two-wayinteraction.VII. A C

HECKLIST FOR S YSTEMATIC R EPORTING OF G ESTURE G ENERATION AND E VALUATION

We provide a non-exhaustive checklist for researchers whoare working on evaluating the performance of their AI-generated gestures for both physical and non-physical agents(see Table IV). Our recommendation is that researchers pub-lishing work in this space should include the following tableinto their publication, so that more systematic evaluation andbench marking can be conducted in the future. An exampleof a completed checklist, including the generated table andsteps on inclusion into the ﬁnal publication, can be seen athttps://github.com/pieterwolfert. When possible, demographicdata must report the mean and standard deviation scores.VIII. C

ONCLUSION

We reviewed studies on the generation and evaluation of co-speech gestures for ECAs, with a speciﬁc focus on evaluationmethods. Findings revealed that there were large differencesin how evaluation methods were approached, and there did notseem to be a golden standard for evaluation. Our main analysisfound that many studies did not mention basic statisticson participant characteristics, few studies reported detailedevaluation methods, and there were no systematic reportingmethods used for gesture generation and evaluation steps.Findings indicate that the ﬁeld of gesture generation and eval-uation requires higher experimental rigour and methodologyin conducting systematic evaluation methods for the designedsystems, e.g. [67]. We argue that the ﬁeld would beneﬁtfrom rigorous reporting practices and call for standardizedevaluation methods to produce results that can be comparedacross studies. To do so, we propose a set of standardizedtesting phrases, scenarios and research checklist to allowfor systematic evaluation and reporting of gesture generationsystems. R

EFERENCES[1] M. L. Knapp, J. A. Hall, and T. G. Horgan,

Nonverbal communicationin human interaction . Cengage Learning, 2013.

EEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. X, NO. Y, Z 10

TABLE IV: Checklist for co-speech gesture evaluation

Embodied Conversational Agent : (cid:3) ECA: Avatar/robot (cid:3)

DOF (shoulder, elbow, wrist, hand, neck) (cid:3)

Level of articulation of hands

Demographics : (cid:3) Recruitment method (cid:3)

Sample size (cid:3)

Age (cid:3)

Gender distribution (cid:3)

Geographical distribution (cid:3)

Prior exposure with ECAs (cid:3)

Language(s) spoken

Gesture Generation Model : (cid:3) Included generated gestures: [iconic, metaphorical, beat, deictic] (cid:3)

Gesture generation model: [rule based, data driven, both, other] (cid:3)

Gesture generation model link/repository (cid:3) (If not included - why not?)

Gesture Generation Evaluation : (cid:3) Context / application (cid:3)

Evaluation method/questionnaire set (cid:3)

Gestures annotated by human raters? [Yes/No] (cid:3)

How many human raters were used? (cid:3)

Inter-rater agreement

Metrics : (cid:3) Objective metrics [average jerk, distance between velocity histograms] (cid:3)

Subjective metrics [humanlikeness, gesture appropriateness, quality,other]

Training dataset : (cid:3) Domain of dataset (cid:3)

Length/size of dataset (cid:3)

Gesture types annotated in the dataset (cid:3)

Details on the actors in the datset ( N , language, conversation topic) Statistical analysis scripts : (cid:3) Link to scripts[2] P. H¨omke, J. Holler, and S. C. Levinson, “Eye blinks are perceived ascommunicative signals in human face-to-face interaction,”

PLoS ONE ,2018.[3] D. McNeill,

Hand and mind: What gestures reveal about thought .University of Chicago press, 1992.[4] A. Kendon, “Gesticulation and speech: Two aspects of the,”

The rela-tionship of verbal and nonverbal communication , no. 25, p. 207, 1980.[5] B. Straube, A. Green, B. Bromberger, and T. Kircher, “The differentia-tion of iconic and metaphoric gestures: Common and unique integrationprocesses,”

Human brain mapping , vol. 32, no. 4, pp. 520–533, 2011.[6] C. Lucero, H. Zaharchuk, and D. Casasanto, “Beat gestures facilitatespeech production,” in

Proceedings of the Annual Meeting of theCognitive Science Society , vol. 36, no. 36, 2014.[7] A. Igualada, N. Esteve-Gibert, and P. Prieto, “Beat gestures improveword recall in 3-to 5-year-old children,”

Journal of Experimental ChildPsychology , vol. 156, pp. 99–112, 2017.[8] K. Lucca and M. P. Wilbourn, “Communicating to learn: Infants’pointing gestures result in optimal learning,”

Child development , vol. 89,no. 3, pp. 941–960, 2018.[9] A. K. Pandey and R. Gelin, “A mass-produced sociable humanoid robot:Pepper: The ﬁrst machine of its kind,”

IEEE Robotics & AutomationMagazine , vol. 25, no. 3, pp. 40–48, 2018.[10] S. Alexanderson, G. E. Henter, T. Kucherenko, and J. Beskow, “Style-controllable speech-driven gesture synthesis using normalising ﬂows,”in

Computer Graphics Forum , vol. 39, no. 2. Wiley Online Library,2020, pp. 487–496.[11] C. Bartneck, T. Belpaeme, F. Eyssel, T. Kanda, M. Keijsers, andS. ˇSabanovi´c,

Human-robot interaction: An introduction . CambridgeUniversity Press, 2020.[12] C. Breazeal, C. D. Kidd, A. L. Thomaz, G. Hoffman, and M. Berlin,“Effects of nonverbal communication on efﬁciency and robustness inhuman-robot teamwork,” in . IEEE, 2005, pp. 708–713.[13] S. Saunderson and G. Nejat, “How robots inﬂuence humans: A surveyof nonverbal communication in social human–robot interaction,”

Inter-national Journal of Social Robotics , vol. 11, pp. 575–608, 2019.[14] K. Allmendinger, “Social presence in synchronous virtual learning situ- ations: The role of nonverbal signals displayed by avatars,”

EducationalPsychology Review , vol. 22, no. 1, pp. 41–56, 2010.[15] C.-M. Huang and B. Mutlu, “Modeling and evaluating narrative gesturesfor humanlike robots,” in

Robotics: Science and Systems , 2013, pp. 57–64.[16] ——, “Learning-based modeling of multimodal behaviors for humanlikerobots,” in . IEEE, 2014, pp. 57–64.[17] M. Salem, F. Eyssel, K. Rohlﬁng, S. Kopp, and F. Joublin, “To err ishuman (-like): Effects of robot gesture on perceived anthropomorphismand likability,”

International Journal of Social Robotics , vol. 5, no. 3,pp. 313–323, 2013.[18] J. Ham, R. H. Cuijpers, and J.-J. Cabibihan, “Combining roboticpersuasive strategies: The persuasive power of a storytelling robot thatuses gazing and gestures,”

International Journal of Social Robotics ,vol. 7, no. 4, pp. 479–487, 2015.[19] V. Chidambaram, Y.-H. Chiang, and B. Mutlu, “Designing persuasiverobots: how robots might persuade people using vocal and nonverbalcues,” in

Proceedings of the seventh annual ACM/IEEE internationalconference on Human-Robot Interaction , 2012, pp. 293–300.[20] J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn,T. Becket, B. Douville, S. Prevost, and M. Stone, “Animatedconversation: Rule-based generation of facial expression, gesture &spoken intonation for multiple conversational agents,” in

Proceedingsof the 21st Annual Conference on Computer Graphics and InteractiveTechniques , ser. SIGGRAPH ’94. New York, NY, USA: Associationfor Computing Machinery, 1994, p. 413–420. [Online]. Available:https://doi.org/10.1145/192161.192272[21] J. Cassell, H. H. Vilhj´almsson, and T. Bickmore, “Beat: the behaviorexpression animation toolkit,” in

Life-Like Characters . Springer, 2004,pp. 163–185.[22] S. Kopp, B. Krenn, S. Marsella, A. N. Marshall, C. Pelachaud, H. Pirker,K. R. Th´orisson, and H. Vilhj´almsson, “Towards a common frameworkfor multimodal generation: The behavior markup language,” in

Inter-national workshop on intelligent virtual agents . Springer, 2006, pp.205–217.[23] S. Levine, C. Theobalt, and V. Koltun, “Real-time prosody-drivensynthesis of body language,” in

ACM SIGGRAPH Asia 2009 papers ,2009, pp. 1–10.[24] K. Bergmann and S. Kopp, “Gnetic–using bayesian decision networksfor iconic gesture generation,” in

International Workshop on IntelligentVirtual Agents . Springer, 2009, pp. 76–89.[25] T. Kucherenko, D. Hasegawa, G. E. Henter, N. Kaneko, and H. Kjell-str¨om, “Analyzing input and output representations for speech-drivengesture generation,” in

Proceedings of the 19th ACM InternationalConference on Intelligent Virtual Agents , 2019, pp. 97–104.[26] D. Hasegawa, N. Kaneko, S. Shirakawa, H. Sakuta, and K. Sumi,“Evaluation of speech-to-gesture generation using bi-directional lstmnetwork,” in

Proceedings of the 18th International Conference onIntelligent Virtual Agents , 2018, pp. 79–86.[27] T. Kucherenko, P. Jonell, S. van Waveren, G. E. Henter, S. Alexanderson,I. Leite, and H. Kjellstr¨om, “Gesticulator: A framework for semantically-aware speech-driven gesture generation,” 2020.[28] Y. Yoon, W.-R. Ko, M. Jang, J. Lee, J. Kim, and G. Lee, “Robots learnsocial skills: End-to-end learning of co-speech gesture generation forhumanoid robots,” in . IEEE, 2019, pp. 4303–4309.[29] Y. Ferstl, M. Neff, and R. McDonnell, “Multi-objective adversarialgesture generation,” in

Motion, Interaction and Games , 2019, pp. 1–10.[30] M. Marmpena, A. Lim, T. S. Dahl, and N. Hemion, “Generatingrobotic emotional body language with variational autoencoders,” in . IEEE, 2019, pp. 545–551.[31] C. T. Ishi, D. Machiyashiki, R. Mikata, and H. Ishiguro, “A speech-driven hand gesture generation method and evaluation in android robots,”

IEEE Robotics and Automation Letters , vol. 3, no. 4, pp. 3757–3764,2018.[32] C. Bartneck, D. Kulic, and E. Croft, “Measuring the anthropomorphism,animacy, likeability, perceived intelligence, and perceived safety ofrobots,” in

Proceedings of the 3rd ACM/IEEE international conferenceon Human robot interaction , 2008, annual ACM/IEEE InternationalConference on Human-Robot Interaction (HRI) 2008, HRI 2008 ;Conference date: 12-03-2008 Through 15-03-2008. [Online]. Available:https://dl.acm.org/doi/proceedings/10.1145/1349822

EEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. X, NO. Y, Z 11 [33] D. Moher, A. Liberati, J. Tetzlaff, D. G. Altman et al. , “Preferredreporting items for systematic reviews and meta-analyses: the prismastatement,”

Int J Surg , vol. 8, no. 5, pp. 336–341, 2010.[34] R. Ishii, T. Katayama, R. Higashinaka, and J. Tomita, “Generating bodymotions using spoken language in dialogue,” in

Proceedings of the 18thInternational Conference on Intelligent Virtual Agents , 2018, pp. 87–92.[35] M. Neff, M. Kipp, I. Albrecht, and H.-P. Seidel, “Gesture modeling andanimation based on a probabilistic re-creation of speaker style,”

ACMTransactions on Graphics (TOG) , vol. 27, no. 1, pp. 1–24, 2008.[36] Y. Xu, C. Pelachaud, and S. Marsella, “Compound gesture generation:a model based on ideational units,” in

International Conference onIntelligent Virtual Agents . Springer, 2014, pp. 477–491.[37] I. Mlakar, Z. Kaˇciˇc, and M. Rojc, “Tts-driven synthetic behaviour-generation model for artiﬁcial bodies,”

International Journal of Ad-vanced Robotic Systems , vol. 10, no. 10, p. 344, 2013.[38] S. Levine, P. Kr¨ahenb¨uhl, S. Thrun, and V. Koltun, “Gesture controllers,”in

ACM SIGGRAPH 2010 papers , 2010, pp. 1–11.[39] M. Rojc, I. Mlakar, and Z. Kaˇciˇc, “The tts-driven affective embodiedconversational agent eva, based on a novel conversational-behavior gen-eration algorithm,”

Engineering Applications of Artiﬁcial Intelligence ,vol. 57, pp. 80–104, 2017.[40] M. Salem, S. Kopp, I. Wachsmuth, K. Rohlﬁng, and F. Joublin, “Gen-eration and evaluation of communicative robot gesture,”

InternationalJournal of Social Robotics , vol. 4, no. 2, pp. 201–217, 2012.[41] M. Salem, S. Kopp, and F. Joublin, “Closing the loop: Towards tightlysynchronized robot gesture and speech,” in

International Conference onSocial Robotics . Springer, 2013, pp. 381–391.[42] A. Aly and A. Tapus, “A model for synthesizing a combined verbal andnonverbal behavior based on personality traits in human-robot interac-tion,” in . IEEE, 2013, pp. 325–332.[43] A. Shimazu, C. Hieida, T. Nagai, T. Nakamura, Y. Takeda, T. Hara,O. Nakagawa, and T. Maeda, “Generation of gestures during presentationfor humanoid robots,” in . IEEE, 2018,pp. 961–968.[44] L. P´erez-Mayos, M. Farr´us, and J. Adell, “Part-of-speech and prosody-based approaches for robot speech and gesture synchronization,”

Journalof Intelligent & Robotic Systems , pp. 1–11, 2019.[45] Q. A. Le and C. Pelachaud, “Evaluating an expressive gesture model fora humanoid robot: Experimental results,” in

Submitted to 8th ACM/IEEEInternational Conference on Human-Robot Interaction , 2012.[46] V. Ng-Thow-Hing, P. Luo, and S. Okita, “Synchronized gesture andspeech production for humanoid robots,” in . IEEE, 2010, pp.4617–4624.[47] Q. Le, J. Huang, and C. Pelachaud, “A common gesture and speechproduction framework for virtual and physical agents,” in

ACM interna-tional conference on multimodal interaction , 2012.[48] M. Salem, K. Rohlﬁng, S. Kopp, and F. Joublin, “A friendly gesture:Investigating the effect of multimodal robot behavior in human-robotinteraction,” in . IEEE, 2011, pp. 247–252.[49] M. Bennewitz, F. Faber, D. Joho, and S. Behnke, “Fritz-a humanoidcommunication robot,” in

RO-MAN 2007-The 16th IEEE InternationalSymposium on Robot and Human Interactive Communication . IEEE,2007, pp. 1072–1077.[50] H.-H. Kim, Y.-S. Ha, Z. Bien, and K.-H. Park, “Gesture encoding andreproduction for human-robot interaction in text-to-gesture systems,”

Industrial Robot: An International Journal , 2012.[51] J. Ondras, O. Celiktutan, P. Bremner, and H. Gunes, “Audio-driven robotupper-body motion synthesis,”

IEEE Transactions on Cybernetics , 2020.[52] J. Kim, W. H. Kim, W. H. Lee, J.-H. Seo, M. J. Chung, and D.-S. Kwon,“Automated robot speech gesture generation system based on dialogsentence punctuation mark extraction,” in . IEEE, 2012, pp. 645–647.[53] P. Bremner, A. G. Pipe, M. Fraser, S. Subramanian, and C. Melhuish,“Beat gesture generation rules for human-robot interaction,” in

RO-MAN2009-The 18th IEEE International Symposium on Robot and HumanInteractive Communication . IEEE, 2009, pp. 1029–1034.[54] C.-C. Chiu and S. Marsella, “Gesture generation with low-dimensionalembeddings,” in

Proceedings of the 2014 international conference onAutonomous agents and multi-agent systems , 2014, pp. 781–788.[55] A. Fern´andez-Baena, R. Monta˜no, M. Antonijoan, A. Roversi, D. Mi-ralles, and F. Al´ıas, “Gesture synthesis adapted to speech emphasis,”

Speech communication , vol. 57, pp. 331–350, 2014. [56] P. Wolfert, T. Kucherenko, H. Kjelstr¨om, and T. Belpaeme, “Shouldbeat gestures be learned or designed? a benchmarking user study,” in

ICDL-EPIROB 2019 Workshop on Naturalistic Non-Verbal and AffectiveHuman-Robot Interactions , 2019, pp. 1–4.[57] K. Takeuchi, D. Hasegawa, S. Shirakawa, N. Kaneko, H. Sakuta, andK. Sumi, “Speech-to-gesture generation: A challenge in deep learningapproach with bi-directional lstm,” in

Proceedings of the 5th Interna-tional Conference on Human Agent Interaction , 2017, pp. 365–369.[58] M. Kipp, M. Neff, K. H. Kipp, and I. Albrecht, “Towards naturalgesture synthesis: Evaluating gesture units in a data-driven approachto gesture synthesis,” in

International Workshop on Intelligent VirtualAgents . Springer, 2007, pp. 15–28.[59] R. A. Peterson and D. R. Merunka, “Convenience samples of collegestudents and research reproducibility,”

Journal of Business Research ,vol. 67, no. 5, pp. 1035–1041, 2014.[60] M. LaFrance and C. Mayo, “Cultural aspects of nonverbal communica-tion,”

International Journal of Intercultural Relations , vol. 2, no. 1, pp.71–89, 1978.[61] M. Buhrmester, T. Kwang, and S. D. Gosling, “Amazon’s mechanicalturk: A new source of inexpensive, yet high-quality, data?”

Perspectiveson Psychological Science , vol. 6, no. 1, pp. 3–5, 2011.[62] T. Belpaeme, “Advice to new human-robot interaction researchers,”Cham, Switzerland, pp. 355–369, 2020.[63] M. L. Schrum, M. Johnson, M. Ghuy, and M. C. Gombolay, “Four yearsin review: Statistical practices of likert scales in human-robot interactionstudies,” in

Companion of the 2020 ACM/IEEE International Conferenceon Human-Robot Interaction , 2020, pp. 43–52.[64] I. Rec, “P. 800.1, mean opinion score (mos) terminology,”

InternationalTelecommunication Union, Geneva , 2006.[65] N. Campbell, “Evaluation of speech synthesis,” in

Evaluation of text andspeech systems . Springer, 2007, pp. 29–64.[66] R. Clark, H. Silen, T. Kenter, and R. Leith, “Evaluating long-form text-to-speech: Comparing the ratings of sentences and paragraphs,” arXivpreprint arXiv:1909.03965 , 2019.[67] G. Hoffman and X. Zhao, “A primer for conducting experiments inhuman–robot interaction,”