[PDF] A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020

Abstract

Co-speech gestures, gestures that accompany speech, play an important role in human communication. Automatic co-speech gesture generation is thus a key enabling technology for embodied conversational agents (ECAs), since humans expect ECAs to be capable of multi-modal communication. Research into gesture generation is rapidly gravitating towards data-driven methods. Unfortunately, individual research efforts in the field are difficult to compare: there are no established benchmarks, and each study tends to use its own dataset, motion visualisation, and evaluation methodology. To address this situation, we launched the GENEA Challenge, a gesture-generation challenge wherein participating teams built automatic gesture-generation systems on a common dataset, and the resulting systems were evaluated in parallel in a large, crowdsourced user study using the same motion-rendering pipeline. Since differences in evaluation outcomes between systems now are solely attributable to differences between the motion-generation methods, this enables benchmarking recent approaches against one another in order to get a better impression of the state of the art in the field. This paper reports on the purpose, design, results, and implications of our challenge.

Full PDF

AA Large, Crowdsourced Evaluation of Gesture GenerationSystems on Common Data: The GENEA Challenge 2020

Taras Kucherenko ∗ [email protected] of Robotics, Perception andLearning, KTH Royal Institute ofTechnologyStockholm, Sweden Patrik Jonell ∗ [email protected] of Speech, Music andHearing, KTH Royal Institute ofTechnologyStockholm, Sweden Youngwoo Yoon ∗ [email protected] & KAISTDaejeon, Republic of Korea Pieter Wolfert [email protected], Ghent University – imecGhent, Belgium

Gustav Eje Henter [email protected] of Speech, Music andHearing, KTH Royal Institute ofTechnologyStockholm, Sweden

ABSTRACT

Co-speech gestures, gestures that accompany speech, play an impor-tant role in human communication. Automatic co-speech gesturegeneration is thus a key enabling technology for embodied conver-sational agents (ECAs), since humans expect ECAs to be capableof multi-modal communication. Research into gesture generationis rapidly gravitating towards data-driven methods. Unfortunately,individual research efforts in the field are difficult to compare: thereare no established benchmarks, and each study tends to use itsown dataset, motion visualisation, and evaluation methodology. Toaddress this situation, we launched the GENEA Challenge, a gesture-generation challenge wherein participating teams built automaticgesture-generation systems on a common dataset, and the resultingsystems were evaluated in parallel in a large, crowdsourced userstudy using the same motion-rendering pipeline. Since differencesin evaluation outcomes between systems now are solely attribut-able to differences between the motion-generation methods, thisenables benchmarking recent approaches against one another inorder to get a better impression of the state of the art in the field.This paper reports on the purpose, design, results, and implicationsof our challenge.

CCS CONCEPTS • Human-centered computing → Human computer interac-tion (HCI) . KEYWORDS gesture generation, conversational agents, evaluation paradigms ∗ Equal contribution and joint first authors.Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

ACM Reference Format:

Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and GustavEje Henter. 2021. A Large, Crowdsourced Evaluation of Gesture GenerationSystems on Common Data: The GENEA Challenge 2020. In

ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3397481.3450692

This paper is concerned with systems for automatic generation ofnonverbal behaviour, and how these can be compared in a fair andsystematic way in order to advance the state-of-the-art. This is ofimportance as nonverbal behaviour plays a key role in conveying amessage in human communication [36]. A large part of nonverbalbehaviour consists of so called co-speech gestures, spontaneoushand gestures that relate closely to the content of the speech, andthat have been shown to improve understanding [17]. Embodiedconversational agents (ECAs) benefit from gesticulation, as ges-ticulation, e.g., improves interaction with social robots [48] andwillingness to cooperate with an ECA [46]. Knowledge of how andwhen to gesture is also needed. This can for example be learnedfrom interaction data; see, e.g., [23] and references therein.Synthetic gestures used to be based on rule-based systems, e.g.,[8, 49]; see [56] for a review. These are gradually being supplantedby data-driven approaches, e.g., [3, 10, 28, 34], with recent work [2,31, 61] showing improvements in gesticulation production for ECAs.However, the results in prior studies on gesture-generation are notdirectly comparable. First, prior studies make use of a variety ofdifferent evaluation metrics. Second, prior studies rely on differentdata sources, and train their models on these different sources.Lastly, visualisations of their generated gestures have differentavatars and production values, which can obscure the quality ofthe underlying gesture-generation approach. All these differencesare, however, external to the actual methods that drive the gesturegeneration. a r X i v : . [ c s . H C ] F e b UI ’21, April 14–17, 2021, College Station, TX, USA Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter

In this paper, we present the GENEA Challenge 2020, the firstjoint gesture-generation challenge that controls for all previoussources of between-paper variation, by providing a common datasetfor building gesture-generation systems, along with common eval-uation standards and a shared visualisation procedure. The aim ofthe challenge is not to select the best team – it is not a contest, nora competition – but to be able to compare different approaches andoutcomes. This makes it possible to assess and advance the stateof the art in gesture generation, and measure the gap between itand natural co-speech gestures. Comparing the different methodsand their performance also helps identify what matters most ingesture generation, and where the bottlenecks are. Challenge par-ticipants benefit by working on the same problem together withresearchers interested in the same topic, strengthening the researchcommunity, and get an opportunity to compare their systems toother competitive systems in a large and carefully-executed jointevaluation.Our concrete contributions are:(1) Jointly evaluating several state-of-the-art gesture-generationmodels on a common dataset using a common 3D model andrendering method.(2) Two large-scale user studies assessing human-likeness andappropriateness of submitted motion.(3) Providing open code and and high-quality data – comprisingthe pre-processed, multimodal training and test datasets,the standardised visualisation, a large number of subjectiveresponses, and evaluation and analysis using open standardsand code – in the spirit of reproducible research.(4) Bringing researchers together in order to advance the state-of-the-art in gesture generation, and enabling future researchto compare and benchmark against systems from the chal-lenge.The remainder of this paper first presents prior work in termsof gesture-evaluation practices (and their shortcomings) and dis-cusses how challenges have helped in other fields. We then describethe challenge setup and its results, and finally turn to considerthe implications for future challenges and gesture generation as awhole. Most previous work proposing new gesture-generation methodsincorporates an evaluation to support the merits of their method.Human gesture perception is highly subjective, and there are cur-rently no widely accepted objective measures of gesture perception,so most publications have conducted human assessments instead.However, previous subjective evaluations, as reviewed in [59], haveseveral drawbacks, with major ones being the coverage of systemsbeing compared and the scale of the studies. Like in [2, 30, 31, 45],proposed models are at most compared to one or two prior ap-proaches (often a highly similar baseline) or possibly only to ab-lated versions of the same model. A large number of studies do notcompare their outcomes with other methods at all. This creates aninsular landscape where particular model families only are applied GENEA stands for “Generation and Evaluation of Non-verbal Behaviour for EmbodiedAgents”. The paper extends a preliminary report, [32] (not peer reviewed), presentedat the GENEA Workshop associated with the challenge. to particular datasets, and never contrasted against one another.As for scale, large evaluations are expensive, and studies may notbe be able to recruit enough participants, thus leaving the differ-ences between many pairs of studied systems unresolved and notstatistically significant (cf. [60, 61]). Questionnaires, which are onepopular evaluation methodology (cf. [4, 21, 47]) demand a lot oftime and cognitive effort even before scaling up. In addition, theitems used in questionnaires differs across studies and the set ofquestions used is often not standardised.Sometimes, evaluations fail to anchor system performance againstnatural (“ground truth”) motion from their database, e.g., [22, 33, 47].Another significant difference between studies is how generatedmotion is visualised, where some prior work (e.g., [29, 58]) displaysmotion through stick figures, or applies it to a physical agent (e.g.,[21, 47]). Neither of these may allow the same expressiveness orrange of motion as 3D-rendered avatars in, e.g., [2, 31].Although there is no directly related work on challenges thatbenchmark co-speech gestures in ECAs, other fields have done wellusing challenges to standardise evaluation techniques, establishbenchmarks, and track and evolve the state of the art. For exam-ple, the Blizzard Challenges have since their inception in 2005 (see[5]) helped advance text-to-speech (TTS) technology and identifiedsubtle but robust trends in the specific strengths and weaknessesin different speech-synthesis paradigms [26]. These challenges areopen to both academia and industry. Participants are provided acommon dataset of speech audio and associated text transcriptions,and use these to build a synthetic voice. The resulting voices arethen evaluated in a large, joint evaluation. Challenge data, eval-uation stimuli, and subjective ratings remain available after thechallenge, and have been widely used both for benchmarking sub-sequent TTS systems, e.g., [9, 52], and for doing research on theperception of natural and artificial speech, e.g., [13, 37, 38, 50, 62].Challenges are also actively used in the computer-vision commu-nity, for instance for benchmarking purposes. Recent CLIC [54] andNTIRE [41] challenges, for example, compared systems for imagecompression and super-resolution respectively, also incorporatingsubjective human assessments similar to the challenge described inthis paper (although they used a MOS-like setup, which has beenfound to be less efficient than the side-by-side evaluation methodol-ogy we employ [44]). This addresses the over-reliance on objectivemetrics in computer-vision evaluation, which, just like in speechquality and gesture generation, do not always align with humanperception. Inspired by the successes of challenges in other fieldof study, we conducted the first challenge in the field of gesturegeneration.

Our challenge focussed on data-driven gesture generation. We posethe problem of speech-driven gesture generation as follows: giveninput speech features 𝒔 – which could involve either an audio wave-form (a sequence of pressure samples) or text (a word sequence) orthe combination of the two – the task is to generate a correspondingpose sequence ^ 𝒈 describing gesture motion that an agent mightperform while uttering this speech. To enable direct comparisonof different data-driven gesture-generation methods, all methodsevaluated in the challenge were trained of the same gesture-speech Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020 IUI ’21, April 14–17, 2021, College Station, TX, USA dataset and their motion visualised using the same virtual avatarand rendering pipeline.

We based the challenge on the Trinity Gesture Dataset [11], com-prising 244 min of audio and motion-capture recordings of a maleactor speaking freely on a variety of topics. This is one of the largestdatasets of parallel speech and 3D motion (in joint-angle space)publicly available in the English language. We removed lower-bodydata, retaining 15 upper-body joints out of the original 69. Fingermotion was also removed due to poor capture quality.To obtain verbal information from the speech, we first tran-scribed the audio recordings using Google Cloud automatic speechrecognition (ASR), followed by a thorough manual review to cor-rect recognition errors and add punctuation for both the trainingand test parts of the dataset. All names of non-fictive persons wereremoved and replaced by unique tokens in the transcriptions.Before releasing the data to challenge participants, it was splitinto training data (3 h and 40 min) and test data (20 min), with onlythe training data initially being shared with the participants. Boththese data subsets have since been made publicly available in theoriginal dataset repository at trinityspeechgesture.scss.tcd.ie.

Each participating team could only submit one system for evalua-tion. As for timeline, the speech-motion training data was releasedto participants on July 1, 2020. Test input speech (but not motionoutput) was released to participants on August 7, with participantsrequested to submit their generated gesture motion for the testinput speech on or before August 15. The joint evaluation tookplace after the generated gestures were submitted.Synthetic gesture motion was required to be submitted at 20frames per second (fps) in a format otherwise identical to thatused by the challenge training data. To prevent optimising forthe specific evaluation used in the challenge and to encouragemotion generation approaches with long-term stability, participantswere asked to synthesise motions for 20 min of test speech in longcontiguous segments, from which a subset of clips were extractedfor the user studies, similar to many Blizzard Challenges. Manualtweaking of the output motion was not allowed, since the idea wasto evaluate how systems would perform in an unattended setting.

We recruited challenge participants from a public call for participa-tion. Sixteen teams signed up for the challenge, and we distributedthe dataset and baseline implementations to all of them. Five teamscompleted the challenge and the other teams were not able to sub-mit results for evaluation. Two of the withdrawing teams explainedit was (in one case) due to reduced manpower for completing thechallenge and (in the other) due to unsatisfactory results. Therewere no reported withdrawals due to the challenge data or task.The challenge evaluation contained 9 different conditions or systems : 2 toplines that represent human-quality gesture motions, 2previously published baselines , and 5 challenge entries/submissions .Table 1 lists all conditions, together with participating team namesand (abbreviated) affiliations. Following the practice established by the Blizzard Challenge, we anonymised the teams in the presentpaper, by not revealing which team was assigned which ID, butindividual teams are free to disclose their ID if they wish. Papersfrom each team describing their submitted systems in detail arecollected in the proceedings of the GENEA Workshop 2020. The two toplines were: N Natural motion capture from the actor for the input speechsegment in question. Surpassing this system would essen-tially entail superhuman performance. M Mismatched natural motion capture from the actor, corre-sponding to another speech segment than that played to-gether with the video. This was accomplished by permutingthe motion segments from condition N in such a way thatno segments remained in its original position. This repre-sents the performance attainable by a system that producesvery human-like motion (same as N, so a topline), but whosebehaviour is completely unrelated to the speech (and thuscan be considered as a bottom line in terms of motion appro-priateness for the speech).Since there has been no previous general study that comparessystems to each other and what the state of the art is, it is hard toidentify the “best” baseline systems to use. Therefore the choice wasmore subjective and based on code availability, with the two base-line systems chosen from recent data-driven gesture-generationpapers that had their code available and were easy to reproduce.These were: BA The system from [29], which only takes speech audio intoaccount when generating system output. This model uses achain of two neural networks: one maps from speech to poserepresentation and another decodes representation to pose,generating motion frame by frame by sliding a window overthe speech input. BT The system from [61], which only takes text transcript in-formation (which includes word timing information) intoaccount when generating system output. This model con-sists of an encoder for text understanding and a decoder forframe-by-frame pose generation.The original authors of the baseline systems updated their meth-ods and code to perform well on the challenge material. In BA,the representation of upper-body poses in the challenge datasetwas different from the data used in the original publication andhence a new hyperparameter search was conducted to find optimalhyperparameters. Another change was that the resulting motionwas represented using the exponential map [14] and was smoothedusing a Savitzky–Golay filter [51] with window length 9 and poly-nomial order 3.In BT, the representation of upper-body poses in the challengedataset was different to that of the TED dataset used in the originalpublication. Accordingly, the pose representation was changed from2D Cartesian coordinates of 8 upper-body joints to 3 × × × Available at zenodo.org/communities/genea2020.3

UI ’21, April 14–17, 2021, College Station, TX, USA Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter

Table 1: Conditions participating in the evaluation. Teams are sorted alphabetically by name. The anonymised IDs of submittedentries begin with the letter ‘S’ followed by a second, randomly-assigned letter in the range A through E, but which letter isassociated which each team is not revealed in order to preserve anonymity. † indicates a use of word vectors pretrained onexternal data. Inputs used Representation or features StochasticName or description Origin ID Aud. Text Input speech Motion output?Natural motion - N ✓ ✓ – – ✓ Mismatched motion - M ✗ ✗ – – ✓ Audio-only baseline Kucherenko et al. [29] BA ✓ ✗

MFCC Exp. map ✗ Text-only baseline Yoon et al. [61] BT ✗ ✓

FastText † Rot. matrix ✗ AlltheSmooth [35] CSTR lab, UEDIN, Scotland S... ✓ ✗

MFCCs Joint pos. ✗ Edinburgh CVGU [42] CVGU lab, UEDIN, Scotland S... ✓ ✓

BERT † & mel-spectr. Rot. matrix ✓ FineMotion [27] ABBYY lab, MIPT, Russia S... ✓ ✓

GloVe † & mel-spectr. Exp. map ✗ Nectec [53] HCCR unit, NECTEC, S... ✓ ✓

Phoneme, Spacy word Exp. map ✗ Thailand vecs. † , MFCCs, & prosodyStyleGestures [1] TMH division, KTH, Sweden S... ✓ ✗ mel-spectr. Exp. map ✓ Source code and hyperparameters for both baseline systems areavailable on GitHub. These implementations and hyperparame-ters were also made available to participating teams during thechallenge.We also considered including a re-implementation of the systemfrom Ginosar et al. [12] as a third baseline, but this was dropped dueto unsatisfactory results. This might be due to the challenge datasetbeing smaller than needed for this method, or due to difficultieswith tuning the particular implementation we used.

We conducted a large-scale, crowdsourced, joint evaluation of ges-ture motion from the nine conditions in Table 1 in parallel using awithin-subject design (i.e., every rater was exposed to and evaluatedall conditions). The systems were evaluated in terms of the human-likeness of the gesture motion itself, as well as the appropriatenessof the gestures for a given input speech. Jonell & Kucherenko etal. [24] recently found that the results from crowdsourcing evalu-ations were not significantly different from in-lab evaluations interms of results and consistency. We therefore adopted an entirelycrowdsourced approach, as opposed to for example the BlizzardChallenge, which has used a mixed approach. Attention checkswere used to exclude participants that were not paying attention,as detailed in Section 5.3.

Prior to motion being submitted, the organisers selected 40 non-overlapping speech segments from the test inputs (average segmentduration 10 s) to use in the user-study evaluation. These speechsegments, which were not revealed to participants, were selectedacross the test inputs to be full and/or coherent phrases. The motionfrom the corresponding intervals in the BVH files submitted byparticipating teams was extracted and converted to a motion video BA: github.com/GestureGeneration/Speech_driven_gesture_generation_with_ au-toencoder/tree/GENEA_2020BT: github.com/youngwoo-yoon/Co-Speech_Gesture_Generation

Figure 1: Screenshot of the rating interface from the evalu-ation. The question asked in the image (“How well do thecharacter’s movements reflect what the character says?”)originates from [25], and was changed for each of the twoevaluations in this paper. clip using the visualisation server provided to participants (seeSection 5.1), albeit at a higher resolution of 960 ×

540 this time.We used the same virtual avatar for all renderings during thechallenge and the evaluation. The avatar can be seen in Figure 1.The avatar originally had 69 joints (full body including fingers) but Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020 IUI ’21, April 14–17, 2021, College Station, TX, USA only 15 joints, corresponding to the upper body and no fingers,were used for the challenge. Since hand and finger data had beenomitted, these body parts were assigned a static pose, in which thehands were lightly cupped (again, see Figure 1).We also developed a visualisation server that enabled all partic-ipating teams to produce gesture-motion visualisations identical(except in resolution) to the video stimuli evaluated in the challenge.This was implemented using a Python-based web server which in-terfaced Blender 2.83. Participants would send a send a 20 fps BVHfile to the visualisation server, and these files were then processed asquickly as possible into videos visualising the motion on the avatar,in the order they came in. The same server was also used to renderthe final stimuli, but with the resolution increased to 960 ×

540 in-stead of 480 × In order to efficiently evaluate a large number of relatively similarly-performing systems in parallel, we used a methodology inspiredby the MUSHRA (MUltiple Stimuli with Hidden Reference andAnchor) test standard for audio-quality evaluation [19] from theInternational Telecommunication Union (ITU). However, there area number of differences between the MUSHRA standard and ourevaluation, e.g., our use of video rather than audio and the omissionof a designated reference and a low-end anchor, which correspondto the letters R and A in the original acronym.Figure 1 shows an example of the user interface used for theevaluation. The participants were first met with a screen with in-structions and how to use the evaluation interface. They were thenpresented with 10 pages, where on each page they would compareand evaluate motion stimuli from all toplines, baselines, and mostsubmitted systems, all for/with the same speech. It was possiblefor participants to return to previous conditions and change theirrating after seeing other examples. Lastly they were presented witha page asking for demographics and their experience of the test. Ascan be seen in the figure, the 100-point rating scale was anchoredby dividing it into successive 20-point intervals labelled (from bestto worst) “Excellent”, “Good”, “Fair”, “Poor”, and “Bad”. These la-bels were based on those associated with the 5-point scale used forMean Opinion Score (MOS) [20] tests, another evaluation standarddeveloped by the ITU.For a detailed explanation of the evaluation interface we referthe reader to [25], which introduced and validated the evaluationparadigm for gesture-motion stimuli.

Each study was balanced such that each segment appeared on pages1 through 10 with approximately equal frequency across all raters(segment order), and each condition was associated with each sliderwith approximately equal frequency across all pages (conditionorder). For any given participant and study, each page would usedifferent speech segments. Every page would contain conditionN and (where relevant) condition M, but one other condition was randomly omitted from each page to limit the maximum numberof sliders on a page to 8 or 7, depending on the study.Three attention checks were incorporated into the pages foreach study participant. These either displayed a brief text messageover the gesticulating avatar reading “Attention! Please rate thisvideo XX.”, or they temporarily replaced the audio with a syntheticvoice speaking the same message. XX would be a number from 5to 95, and the participant had to set the corresponding slider to therequested value, plus or minus 3, to pass the attention check. Thenumbers 13 through 19, as well as multiples of 10 from 30 to 90,were not used for attention checks due to their acoustic ambiguity.Which sliders on which pages that were used for attention checkwas uniformly random, except that no page had more than oneattention check, and condition N and M were never replaced byattention checks.We evaluated two aspects of the gesture motion, each in a sepa-rate study:

Human-likeness

This study asked participants to rate “How human-like does the gesture motion appear?”, with the intention ofmeasuring the quality of the generated motion while ignor-ing its link to the input speech. This study did not includespeech in stimulus videos and only used text-based attentionchecks (all videos were silent).

Appropriateness

This study asked participants to rate “How ap-propriate are the gestures for the speech?” This was intendedto investigate the perceived link between motion and speech(both in terms of rhythm/timing and semantics), ignoringmotion quality as much as possible. This study containedspeech audio in the stimuli, and each participant had to passone text-based and two audio-based attention checks.

Study participants were recruited through the crowdsourcing plat-form Prolific (formerly Prolific Academic), restricted to a set ofEnglish-speaking countries (UK, IE, USA, CAN, AUS, NZ). Therewas no requirement to be a native speaker of English, since Prolificdoes not support screening participants based on that criterion. Aparticipant could take either study or both studies, but not morethan once each. Participants were remunerated 5.75 GBP for com-pleting the human-likeness study (median time 33 min) and 6.50GBP for the appropriateness study (median time 34 min).

Since subjective evaluation is costly and time-consuming it wouldbe beneficial for the field to agree on meaningful objective evalua-tions to use. As a step in this direction we consider two numericalmeasures previously used to evaluate co-speech gestures, namelyaverage jerk and distance between gesture speed (i.e., absolutevelocity) histograms.

The third time derivative of the joint positionsis called jerk . Average jerk is commonly used to quantify motionsmoothness [29, 39, 55]. We report average values of absolute jerk(defined using finite differences) across different motion segments.A perfectly natural system should have average jerk very similarto natural motion. UI ’21, April 14–17, 2021, College Station, TX, USA Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter

The distance between speedhistograms has also been used to evaluate gesture quality [29, 31],since well-trained models should produce motion with similar prop-erties to that of the actor it was trained on. In particular, it shouldhave a similar motion-speed profile for any given joint. To evaluatethis similarity we calculate speed-distribution histograms for allsystems and compare them to the speed distribution of naturalmotion (condition N) by computing the Hellinger distance [40], 𝐻 ( 𝒉 ( ) , 𝒉 ( ) ) = √︂ − (cid:205) 𝑖 √︃ ℎ ( ) 𝑖 · ℎ ( ) 𝑖 , between the histograms 𝒉 ( ) and 𝒉 ( ) . Lower distance is better.For both of the objective evaluations above the motion was firstconverted from joint angles to 3D coordinates. The code for thenumerical evaluations has been made publicly available to enhancereproducibility. This section describes and discusses the results of the subjective andobjective evaluations. First, Section 6.1 introduces demographic andother information gathered from the recruited participants. Section6.2 then reports the results of the subjective evaluation of challengeconditions, which also are visualised in a number of different figures.Section 6.3 complements the subjective findings with results on theobjective measures introduced in Section 5.5. Section 6.4 providesa discussion of the results obtained in the challenge evaluation.

Each user study recruited 125 participants that passed all attentionchecks they encountered. In the human-likeness study, averagereported participant age was 31.5 years (standard deviation 10.7),with 66 men, 57 women, and 2 others. We asked participants onwhich continent they lived, and 69 participants were from Europe,1 from Africa, 48 from North America, 2 from South America, and 5from Asia. In the appropriateness study, average age was 31.1 years(standard deviation 11.7), with 60 men, 64 women, and 1 other. 78participants reported residing in Europe, 1 in Africa, 39 in NorthAmerica, 3 in Asia, and 4 in Oceania. Each study had 116 nativeand 9 non-native speakers of English.23 test-takers in the human-likeness study and 40 test-takers inthe appropriateness study did not pass all attention checks. Thesetest-takers were not part of the 125 participants analysed. Scoresfrom sliders used for attention checks were also omitted, leavingin total 8,375 and 9,625 ratings that were analysed in each of thetwo respective studies. The median successful completion timefor the main part of the study was 24 min for the human-likenessstudy and 27 min for the appropriateness study, with the shortestsuccessful completion time being 12 min in both studies. Thesefigures exclude reading instructions and answering the post-testquestionnaire, unlike the timings in Section 5.4.

Summary statistics (sample median and sample mean) for all condi-tions in each of the two studies are shown in Table 2 (see page 8),together with a 99% confidence interval for the true median/mean.The confidence intervals were computed either using a Gaussianassumption for the means (i.e., with Student’s 𝑡 -distribution cdf,and rounded outward to ensure sufficient coverage), or using orderstatistics for the median (leverages the binomial distribution cdf, cf.[16]).The ratings distributions in the two studies are further visu-alised through box plots in Figure 2. The distributions are seen tobe quite broad. This is common in MUSHRA-like evaluations, sincethe range of numbers not only reflects differences between systems,but also extraneous variation, e.g., between stimuli, in individualpreferences, and in how critical different raters are in their judge-ments. In contrast, the plotted confidence intervals are seen to bequite narrow, due to the large number of ratings collected for eachcondition.Despite the wide range of the distributions, the fact that theconditions were rated in parallel on each page enables using pair-wise statistical tests to factor out many of the above sources ofvariation. To analyse the significance of differences in sample me-dian between different conditions, we applied two-sided pairwiseWilcoxon signed-rank tests to all pairs of distinct conditions ineach study. This closely follows the analysis methodology usedthroughout recent Blizzard Challenges. (Unlike Student’s 𝑡 -test,this test does not assume that rating differences follow a Gaussiandistribution, which would likely be inappropriate, as we can seefrom the box plots in Figure 2 that ratings distributions are skewedand thus non-Gaussian.) For each condition pair, only pages forwhich both conditions were assigned valid scores were included inthe analysis. (Recall that not all systems were scored on all pagesdue to the limited number of sliders and the presence of attentionchecks.) This meant that every statistical significance test was basedon at least 796 pairs of valid ratings in each of the studies. The 𝑝 -values computed in the significance tests were adjusted for multiplecomparisons using the Holm-Bonferroni method [18] (which is uni-formly more powerful than regular Bonferroni correction) in eachof the two studies. This statistical analysis found all but 4 out of 28condition pairs to be significantly different in the human-likenessstudy, which the corresponding numbers being 7 out of 36 conditionpairs in the appropriateness study, all at the level 𝛼 = . . Whichconditions that were found to be rated significantly above or belowwhich other conditions in the two studies is visualised in Figure 3.Finally, we present two diagrams that bring the results of the twostudies together. Figure 4, in particular, visualises the relative (par-tial) ordering between different conditions implied by the results ofthe two studies in Figure 3. Although there are similarities, the twoorderings are meaningfully different. This, together with the resultsin [25], reinforces a conclusion that the two studies managed todisentangle aspects of perceived motion quality (human-likeness)from the perceived link between gesture and speech (appropriate-ness). Figure 5, meanwhile, visualises confidence regions for themedian rating as boxes whose horizontal and vertical extents are See github.com/Svito-zar/genea_numerical_evaluations.6

Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020 IUI ’21, April 14–17, 2021, College Station, TX, USA H u m a n - li k e n e ss r a t i n g N SD SC BT SB SE BA SA020406080100 (a) Human-likeness ratings A pp r o p r i a t e n e ss r a t i n g N M SC SD SE SB BA BT SA020406080100 (b) Appropriateness ratings

Figure 2: Box plots visualising the ratings distribution in the two studies. Red bars are the median ratings (each with a 0.01confidence interval); yellow diamonds are mean ratings (also with a 0.01 confidence interval). Box edges are at 25 and 75percentiles, while whiskers cover 95% of all ratings for each system. Conditions are ordered descending by sample median,which leads to a different order in each of the two plots. ...over system x , in terms of appropriateness S i g n i ﬁ c a n t p r e f e r e n ce f o r s y s t e m y ... N SD SC BT SB SE BA SANSDSCBTSBSEBASA (a) Human-likeness study ...over system x , in terms of appropriateness S i g n i ﬁ c a n t p r e f e r e n ce f o r s y s t e m y ... N M SC SD SE SB BA BT SANMSCSDSESBBABTSA (b) Appropriateness study

Figure 3: Significance of pairwise differences between conditions. White means that the condition listed on the 𝑦 -axis ratedsignificantly above the condition on the 𝑥 -axis, black means the opposite ( 𝑦 rated below 𝑥 ), and grey means no statisticallysignificant difference at the 0.01 level after Holm-Bonferroni correction. Conditions are listed in the same order as in Figure2, which is different for each of the two studies. given by the corresponding confidence intervals in Table 2. Onceagain, different systems are found to be good at different things.The numerical gap between natural and synthetic gesture motionis seen to be more pronounced in the case of appropriateness thanfor human-likeness. Results of the objective evaluations from Section 5.5 are given inTable 3. The first column contains the average jerk across all thejoints. We report mean and standard deviation for the full 20 minof test motion. The second and third columns contain the Hellingerdistance between speed histograms for the left and right wrists. Different systems performed best (coming closest to the naturalmotion N) in different objective measures. For example, systems SAand SB where the closest to the ground truth in terms of the jerkvalue, but SE and SD were among the closest to the ground truthas measured by Hellinger distance between speed histograms.We also found that objective metrics deviate from the subjectiveresults. While SA showed the most similar jerk to natural motion, itwas less preferred in the subjective evaluation. Similarly, SE showedthe Hellinger distances most similar to N, but was not close to beingthe most preferred synthetic system in the subjective evaluation.Considering this disparity, we stress that objective evaluation of UI ’21, April 14–17, 2021, College Station, TX, USA Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter

BT & SB SC SDSA BT BA SB SE SD SC MAppropriatenessHuman-likeness NNSA SEBA Higher median rating

Figure 4: Partial ordering between conditions in the two studies. Each condition is an ellipse; overlapping or (in one case)coinciding ellipses signify that the corresponding conditions were not statistically significantly different in the evaluation.The diagram was inspired by [57] with colours adapted from [7]. There is no scale on the axis since the figure visualisesordinal information only.Table 2: Summary statistics of user-study ratings for all con-ditions in the two studies, with 0.01-level confidence inter-vals. The human-likeness of M was not evaluated explicitly,since it uses the same motion clips as N.

Human-likeness AppropriatenessID Median Mean Median MeanN ∈ [ , ] . ± . ∈ [ , ] . ± . M " " ∈ [ , ] . ± . BA ∈ [ , ] . ± . ∈ [ , ] . ± . BT ∈ [ , ] . ± . ∈ [ , ] . ± . SA ∈ [ , ] . ± . ∈ [ , ] . ± . SB ∈ [ , ] . ± . ∈ [ , ] . ± . SC ∈ [ , ] . ± . ∈ [ , ] . ± . SD ∈ [ , ] . ± . ∈ [ , ] . ± . SE ∈ [ , ] . ± . ∈ [ , ] . ± . gesture motion is a complementary measure, and that subjectiveevaluation is much more important. It is obvious that gesture generation is a difficult problem which isfar from being solved, seeing that no system came remotely closeto the natural motion N. However, the fact that many submissionsscored significantly better than the previously published baselinessuggests that progress is being made. The numerical gap betweennatural motion and that synthesised by machine-learning modelsis greater in terms of appropriateness than human-likeness. This(along with the fact that no artificial system surpassed the speech-independent condition M) could indicate that appropriateness is aharder problem to solve. As one part of this, the available data maynot be sufficiently rich to allow learning to generate appropriategestures, especially semantically-meaningful gesticulation.Previous studies suggest that motion quality (human-likeness)may influence gesture appropriateness ratings in subjective evalu-ations [31, 61]. Our experiments only partly managed to separatethese two aspects of gesture perception. On the one hand, we canobserve in Figure 4 that different systems were good at differentthings: some scored better than other on human-likeness, but worse

Table 3: Results from the objective evaluations. TheHellinger distance between natural and synthetic speed pro-files was computed for the two wrist joints, since hand mo-tion is of central importance for co-speech gestures.

Hellinger distanceID Jerk Left RightN 151.52 ± ± ± ± ± ± ± ± In this section we discuss challenge implications: what the challengebrings to the scientific community, the limitations of the challenge,and lessons learned from conducting it.

We have taken the first step in jointly benchmarking differentgesture generation systems on a common dataset and virtual avatar. Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020 IUI ’21, April 14–17, 2021, College Station, TX, USA

Human-likeness A pp r o p r i a t e n e ss

30 40 50 60 70 80 9030405060708090 NMBABTSASBSCSDSE

Figure 5: Confidence regions for the true median rating across both studies. The dotted black line is the identity, 𝑥 = 𝑦 . Whilethe human-likeness ( 𝑥 -coordinate) of M was not evaluated directly, it is expected to be very close to N since it uses the samemotion clips, and the horizontal extent of the confidence region for M was therefore copied from N. The below points summarise some of the added value we see forthe gesture-generation field:(1) We have defined the first benchmark for evaluating gesture-generation models, consisting of a dataset of speech audio,aligned text transcriptions, and 3D motion, as well as train-test splits and an evaluation procedure. Future research canmake use of these components to compare new models withprevious ones in a consistent way.(2) All the motion clips generated by the systems evaluatedin the challenge are publicly available, together with therendering pipeline used. This enables easy comparisonswith these systems in the future, since their motions can beused directly, without the need to reproduce the systems.(3) All the subjective and objective scores for the challenge sub-missions and analysis scripts we used are also available on-line. This material could be used, e.g., to investigate humanperception and to analyse the correlation between subjectiveperception and different objective measures (not only thosein Section 5.5), to aid progress toward reliable and usefulobjective metrics for the field.

Our crowdsourced evaluation had a few limitations: First, in mea-suring appropriateness of gestures (i.e., the link between gesturesand speech), semantic and rhythmic appropriateness were consid-ered together, and there is no way to determine which aspect ofappropriateness the participants rated. In addition, our appropri-ateness ratings were likely been affected by motion quality to some See zenodo.org/record/4080919 and github.com/jonepatr/genea_visualizer for themotion stimuli and the visualiser, respectively. See zenodo.org/record/4088250. extent, as discussed in Section 6.4, despite the fact that participantswere instructed participants to disregard motion quality.Second, the dataset used in the challenge was limited to a singleEnglish speaker in a monologue scenario. The role of gesticulationmay be expected to differ between different persons and languagesas well as the speaking environment (e.g., dyadic conversationversus monologue), which this challenge did not explore. We believethe models and the challenge can be extended to other languagesif proper datasets are available, as audio processing is essentiallylanguage agnostic and pretrained word vectors are available for amultitude of languages [15].A third limitation is that we considered only upper-body ges-tures, even though whole-body gestures (including posture, step-ping motion and stance, facial expression, and hand motion) also areimportant in social interactions. Three teams stated that the mostdesirable extension of the challenge would be to include whole-body and/or facial gestures. Some evaluation participants also foundthe absence of facial and finger motion to be a limitation of thechallenge.

Conducting the gesture generation challenge has highlighted sev-eral take-away messages and lessons learned: • Being human-like does not mean being appropriate for ges-tures of a virtual avatar. The challenge evaluation found somesystems performed better than others in terms of human-likeness but worse in terms of appropriateness, highlightingthat one does not imply the other. Any evaluation or com-parison of synthetic gestures should keep this distinction inmind. UI ’21, April 14–17, 2021, College Station, TX, USA Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter • Providing carefully pre-processed data and good infrastruc-ture (code for feature extraction, motion visualisation, base-line systems, etc.) enables challenge participants to focus ondeveloping their system, instead of solving unrelated issues. • A MUSHRA-like evaluation scheme can successfully bench-mark numerous gesture-generation models in parallel. • There is a need for future challenges, since there remains abig gap between natural and synthesised motion and varia-tion across speakers, languages, and scenarios has yet to beexplored in a challenge format.We additionally think the following points are worth consideringfor anyone running a similar challenge in the future: • Include some of the best systems used in the current chal-lenge to provide continuity and assess whether the fieldkeeps moving forward. This is facilitated by the fact that thebaselines and several challenge entries have made their codepublicly available. • Evaluate gesture appropriateness in a more granular and pre-cise way, for example having separate questions and studiesfor semantic and rhythmic appropriateness, and by also eval-uating contrasts between matched and mismatched motionfrom all challenge entries. Since the link between speechand motion is important yet difficult to evaluate, challengesand their data may be used to explore how to better measuregesture appropriateness. • Use a different speech-gesture dataset. As previously dis-cussed, the dataset used in this challenge has limitations,e.g., it has already been used extensively and contains just asingle actor speaking in isolation, while gesture generationsystems usually are intended to be used in an interaction.More data may be necessary to better learn semanticallymeaningful gestures.

We have hosted the GENEA Challenge 2020 to assess the state ofthe art in data-driven co-speech gesture generation. The central de-sign goal of the challenge was to enable direct comparison betweenmany different gesture-generation methods while controlling forfactors of variation external to the model, namely data, embodi-ment, and evaluation methodology. Our results suggest that thefield is advancing measurably, since most submissions performedsignificantly better than the baselines published the year before.Different systems were also found to be good at different thingson the two scales (human-likeness and appropriateness) that weassessed. However, a substantial gap remains between syntheticand natural gesture motion, indicating that gesture generation isfar from a solved problem.We believe that the standardised challenge training and testsets (of time-aligned audio, text, and gestures), the visualisationcode, and the associated library of rated motion clips from thechallenge will be useful for future benchmarking and research ingesture generation. Furthermore, we think challenges like the onedescribed here are poised to play an important role in identifyingkey factors for convincing gesture generation in practice, and indriving and validating future progress toward the goal of endowingembodied agents with natural gesture motion.

ACKNOWLEDGMENTS

The GENEA Challenge 2020 used the Trinity Speech-Gesture Datasetcollected by Ylva Ferstl and Rachel McDonnell. The challengedataset was further processed by Taras Kucherenko, Simon Alexan-derson, Jonatan Lindgren, and Jonas Beskow at KTH Royal Instituteof Technology and by Pieter Wolfert at Ghent University.The authors wish to thank Simon King for sharing his insightsand experiences from running the Blizzard Challenge in speechsynthesis. We are also grateful to Ulysses Bernardet for input and toAndré Tiago Abelho Pereira, Bram Willemsen, Dmytro Kalpakchi,Jonas Beskow, Kevin El Haddad, and Ulme Wennberg for feedbackon the paper preprint.This research was partially supported by Swedish Foundation forStrategic Research contract no. RIT15-0107 (EACare), by IITP grantno. 2017-0-00162 (Development of Human-care Robot Technologyfor Aging Society) funded by the Korean government (MSIT), theFlemish Research Foundation grant no. 1S95020N, and by the Wal-lenberg AI, Autonomous Systems and Software Program (WASP)funded by the Knut and Alice Wallenberg Foundation.

REFERENCES [1] Simon Alexanderson. 2020. The StyleGestures entry to the GENEA Challenge2020. In

Proc. GENEA Workshop . https://doi.org/10.5281/zenodo.4088600[2] Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow.2020. Style-controllable speech-driven gesture synthesis using normalising flows.

Comput. Graph. Forum

39, 2 (2020), 487–496.[3] Kirsten Bergmann and Stefan Kopp. 2009. GNetIc – Using Bayesian decisionnetworks for iconic gesture generation. In

Proc. IVA . 76–89.[4] Kirsten Bergmann, Stefan Kopp, and Friederike Eyssel. 2010. Individualized ges-turing outperforms average gesturing–evaluating gesture production in virtualhumans. In

Proc. IVA . 104–117.[5] Alan W. Black and Keiichi Tokuda. 2005. The Blizzard Challenge – 2005: Evalu-ating corpus-based speech synthesis on common datasets. In

Proc. Interspeech .77–80.[6] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017.Enriching word vectors with subword information.

Trans. Assoc. Comput. Linguist.

Proc.SPIE , Vol. 1077. 322–332.[8] Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. BEAT:The behavior expression animation toolkit. In

Proc. SIGGRAPH . 477–486.[9] Marcela Charfuelan and Ingmar Steiner. 2013. Expressive speech synthesis inMARY TTS using audiobook data and EmotionML. In

Proc. Interspeech . 1564–1568.[10] Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predictingco-verbal gestures: A deep and temporal modeling approach. In

Proc. IVA . 152–166.[11] Ylva Ferstl and Rachel McDonnell. 2018. Investigating the use of recurrent motionmodelling for speech gesture generation. In

Proc. IVA . 93–98.[12] Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, andJitendra Malik. 2019. Learning individual styles of conversational gesture. In

Proc. CVPR . 3497–3506.[13] Avashna Govender, Anita E. Wagner, and Simon King. 2019. Using pupil dilationto measure cognitive load when listening to text-to-speech in quiet and in noise.In

Proc. Interspeech , Vol. 20. 1551–1555.[14] F. Sebastian Grassia. 1998. Practical parameterization of rotations using theexponential map.

J. Graph. Tools

3, 3 (1998), 29–48.[15] Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and TomasMikolov. 2018. Learning word vectors for 157 languages. In

Proc. LREC . 3483–3487.[16] Gerald J. Hahn and William Q. Meeker. 1991.

Statistical Intervals: A Guide forPractitioners . Vol. 92. John Wiley & Sons.[17] Judith Holler, Kobin H. Kendrick, and Stephen C. Levinson. 2018. Processing lan-guage in face-to-face conversation: Questions with gestures get faster responses.

Psychon. B. Rev.

25, 5 (2018), 1900–1908.[18] Sture Holm. 1979. A simple sequentially rejective multiple test procedure.

Scand.J. Stat.

6, 2 (1979), 65–70.[19] International Telecommunication Union, Radiocommunication Sector. 2015.

Method for the subjective assessment of intermediate quality levels of audio systems .10

Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020 IUI ’21, April 14–17, 2021, College Station, TX, USA

Methods for subjective determination of transmission quality

IEEE Robot. Autom. Lett.

3, 4 (2018), 3757–3764.[22] Ryo Ishii, Taichi Katayama, Ryuichiro Higashinaka, and Junji Tomita. 2018. Gen-erating body motions using spoken language in dialogue. In

Proc. IVA . 87–92.[23] Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, and Jonas Beskow. 2020.Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facialgestures in dyadic settings. In

Proc. IVA . Article 31, 8 pages.[24] Patrik Jonell, Taras Kucherenko, Ilaria Torre, and Jonas Beskow. 2020. Canwe trust online crowdworkers? Comparing online and offline participants in apreference test of virtual agents. In

Proc. IVA . Article 30, 8 pages.[25] Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, and Gus-tav Eje Henter. 2021. HEMVIP: Human evaluation of multiple videos in parallel.arXiv:2101.11898[26] Simon King. 2014. Measuring a decade of progress in text-to-speech.

Loquens

Proc. GENEA Workshop . https://doi.org/10.5281/zenodo.4088609[28] Taras Kucherenko. 2018. Data driven non-verbal behavior generation for hu-manoid robots. In

Proc. ICMI Doctoral Consortium . 520–523.[29] Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and HedvigKjellström. 2019. Analyzing input and output representations for speech-drivengesture generation. In

Proc. IVA . 97–104.[30] Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gustav Eje Henter, and HedvigKjellström. 2021. Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation.

Int. J. Hum. Comput.Interact. (2021). https://doi.org/10.1080/10447318.2021.1883883[31] Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, SimonAlexanderson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A frame-work for semantically-aware speech-driven gesture generation. In

Proc. ICMI

ACM Trans. Graph.

29, 4, Article 124 (2010), 11 pages.[35] JinHong Lu, TianHang Liu, ShuZhuang Xu, and Hiroshi Shimodaira. 2020. Double-DCCCAE: Estimation of sequential body motion using wave-form - AlltheSmooth.In

Proc. GENEA Workshop . https://doi.org/10.5281/zenodo.4088376[36] David McNeill. 1992.

Hand and Mind: What Gestures Reveal about Thought .University of Chicago Press.[37] Gabriel Mittag and Sebastian Möller. 2020. Deep learning based assessment ofsynthetic speech naturalness. In

Proc. Interspeech . 1748–1752.[38] Sebastian Möller, Florian Hinterleitner, Tiago H. Falk, and Tim Polzehl. 2010.Comparison of approaches for instrumentally predicting the quality of text-to-speech systems. In

Proc. Interspeech . 1325–1328.[39] Pietro Morasso. 1981. Spatial control of arm movements.

Exp. Brain Res.

42, 2(1981), 223–227.[40] Mikhail S. Nikulin. 2001. Hellinger distance. In

Encyclopedia of Mathematics .Springer. http://encyclopediaofmath.org/index.php?title=Hellinger_distanceAccessed: 2021-01-31.[41] NTIRE Challenge organisers. 2020. NTIRE 2020: Perceptual extreme super-resolution challenge. https://competitions.codalab.org/competitions/22217 Accessed: 2021-01-18.[42] Kunkun Pang, Taku Komura, Hanbyul Joo, and Takaaki Shiratori. 2020. CGVU:Semantics-guided 3D body gesture synthesis. In

Proc. GENEA Workshop . https://doi.org/10.5281/zenodo.4090879[43] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe:Global vectors for word representation. In

Proc. EMNLP . 1532–1543.[44] Manuel Sam Ribeiro, Junichi Yamagishi, and Robert A. J. Clark. 2015. A perceptualinvestigation of wavelet-based decomposition of 𝑓 Proc. Interspeech . 1586–1590.[45] Najmeh Sadoughi and Carlos Busso. 2019. Speech-driven animation with mean-ingful behaviors.

Speech Commun.

110 (2019), 90–100.[46] Maha Salem, Friederike Eyssel, Katharina Rohlfing, Stefan Kopp, and FrankJoublin. 2013. To err is human(-like): Effects of robot gesture on perceivedanthropomorphism and likability.

Int. J. Soc. Robot.

5, 3 (2013), 313–323.[47] Maha Salem, Stefan Kopp, Ipke Wachsmuth, Katharina Rohlfing, and FrankJoublin. 2012. Generation and evaluation of communicative robot gesture.

Int. J.Soc. Robot.

4, 2 (2012), 201–217.[48] Maha Salem, Katharina Rohlfing, Stefan Kopp, and Frank Joublin. 2011. A friendlygesture: Investigating the effect of multimodal robot behavior in human-robotinteraction. In

Proc. Ro-MAN . 247–252.[49] Giampiero Salvi, Jonas Beskow, Samer Al Moubayed, and Björn Granström. 2009.SynFace—Speech-driven facial animation for virtual speech-reading support.

EURASIP J. Audio Spee. , Article 191940 (2009), 10 pages.[50] Ibon Saratxaga, Jon Sanchez, Zhizheng Wu, Inma Hernaez, and Eva Navas. 2016.Synthetic speech detection using phase information.

Speech Commun.

81 (2016),30–41.[51] Abraham Savitzky and Marcel J. E. Golay. 1964. Smoothing and differentiation ofdata by simplified least squares procedures.

Anal. Chem.

36, 8 (1964), 1627–1639.[52] Éva Székely, João P. Cabral, Mohamed Abou-Zleikha, Peter Cahill, and JulieCarson-Berndsen. 2012. Evaluating expressive speech synthesis from audiobooksin conversational phrases. In

Proc. LREC . 3335–3339.[53] Ausdang Thangthai, Kwanchiva Thangthai, Arnon Namsanit, Sumonmas That-phithakkul, and Sittipong Saychum. 2020. The Nectec gesture generation sys-tem entry to the GENEA Challenge 2020. In

Proc. GENEA Workshop

Biol. Cybern.

61, 2 (1989),89–101.[56] Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and speech ininteraction: An overview.

Speech Commun.

57 (2014), 209–232.[57] Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi. 2016. Analysis of the VoiceConversion Challenge 2016 evaluation results. In

Proc. Interspeech . 1637–1641.[58] Pieter Wolfert, Taras Kucherenko, Hedvig Kjelström, and Tony Belpaeme. 2019.Should beat gestures be learned or designed? A benchmarking user study. In

Proc. ICDL-EPIROB Workshop . 4 pages. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-255998[59] Pieter Wolfert, Nicole Robinson, and Tony Belpaeme. 2021. A review ofevaluation practices of gesture generation in embodied conversational agents.arXiv:2101.03769[60] Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim,and Geehyuk Lee. 2020. Speech gesture generation from the trimodal contextof text, audio, and speaker identity.

ACM Trans. Graph.

39, 6, Article 222 (2020),16 pages.[61] Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and GeehyukLee. 2019. Robots learn social skills: End-to-end learning of co-speech gesturegeneration for humanoid robots. In

Proc. ICRA . 4303–4309.[62] Takenori Yoshimura, Gustav Eje Henter, Oliver Watts, Mirjam Wester, JunichiYamagishi, and Keiichi Tokuda. 2016. A hierarchical predictor of synthetic speechnaturalness using neural networks. In