[PDF] Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity

Abstract

For human-like agents, including virtual avatars and social robots, making proper gestures while speaking is crucial in human--agent interaction. Co-speech gestures enhance interaction experiences and make the agents look alive. However, it is difficult to generate human-like gestures due to the lack of understanding of how people gesture. Data-driven approaches attempt to learn gesticulation skills from human demonstrations, but the ambiguous and individual nature of gestures hinders learning. In this paper, we present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. By incorporating a multimodal context and an adversarial training scheme, the proposed model outputs gestures that are human-like and that match with speech content and rhythm. We also introduce a new quantitative evaluation metric for gesture generation models. Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models. We further confirm that our model is able to work with synthesized audio in a scenario where contexts are constrained, and show that different gesture styles can be generated for the same speech by specifying different speaker identities in the style embedding space that is learned from videos of various speakers. All the code and data is available at this https URL.

Full PDF

SSpeech Gesture Generation from the Trimodal Context of Text, Audio,and Speaker Identity

YOUNGWOO YOON,

ETRI, KAIST

BOK CHA,

University of Science and Technology, ETRI

JOO-HAENG LEE,

ETRI, University of Science and Technology

MINSU JANG,

ETRI

JAEYEON LEE,

ETRI

JAEHONG KIM,

ETRI

GEEHYUK LEE,

KAIST

Fig. 1. Overview of the proposed gesture generation model that considers the trimodality of speech text, audio, and speaker identity. The model is trained ononline speech videos demonstrating co-speech gestures. At the synthesis phase, we can manipulate gesture styles by sampling a style vector from the learnedstyle embedding space.

For human-like agents, including virtual avatars and social robots, mak-ing proper gestures while speaking is crucial in human–agent interaction.Co-speech gestures enhance interaction experiences and make the agentslook alive. However, it is difficult to generate human-like gestures due tothe lack of understanding of how people gesture. Data-driven approachesattempt to learn gesticulation skills from human demonstrations, but theambiguous and individual nature of gestures hinders learning. In this paper,we present an automatic gesture generation model that uses the multi-modal context of speech text, audio, and speaker identity to reliably gen-erate gestures. By incorporating a multimodal context and an adversarialtraining scheme, the proposed model outputs gestures that are human-like and that match with speech content and rhythm. We also introduce anew quantitative evaluation metric for gesture generation models. Experi-ments with the introduced metric and subjective human evaluation showed

ACM acknowledges that this contribution was authored or co-authored by an employee,contractor or affiliate of a national government. As such, the Government retains anonexclusive, royalty-free right to publish or reproduce this article, or to allow othersto do so, for Government purposes only.© 2020 ACM. 0730-0301/2020/12-ART222 $15.00DOI: 10.1145/3414685.3417838 that the proposed gesture generation model is better than existing end-to-end generation models. We further confirm that our model is able to workwith synthesized audio in a scenario where contexts are constrained, andshow that different gesture styles can be generated for the same speech byspecifying different speaker identities in the style embedding space that islearned from videos of various speakers. All the code and data is availableat https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.CCS Concepts: •

Computing methodologies → Animation;

Supervisedlearning by regression;

Additional Key Words and Phrases: nonverbal behavior, co-speech gesture,neural generative model, multimodality, evaluation of a generative model

ACM Reference format:

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, JaehongKim, and Geehyuk Lee. 2020. Speech Gesture Generation from the TrimodalContext of Text, Audio, and Speaker Identity.

ACM Trans. Graph.

39, 6,Article 222 (December 2020), 16 pages.DOI: 10.1145/3414685.3417838

ACM Transactions on Graphics, Vol. 39, No. 6, Article 222. Publication date: December 2020. a r X i v : . [ c s . G R ] S e p The continued development of graphics and robotics technology hasprompted the development of artificial embodied agents, such asvirtual avatars and social robots, as a popular interaction medium.One of the merits of the embodied agent is its nonverbal behavior,including facial expressions, hand gestures, and body gestures. Inthe present paper, we focus on upper-body gestures that occurwith speech. Such co-speech gestures are a representative exampleof nonverbal communication between people. Appropriate use ofgestures is helpful for understanding speech (McNeill 1992) andincreases persuasion and credibility (Burgoon et al. 1990). Gesturesare important not only in human–human interaction, but also inhuman–machine interaction. Gestures performed by artificial agentshelp a listener to concentrate and understand utterances (Bremneret al. 2011) and improve the intimacy between humans and agents(Wilson et al. 2017).Interactive artificial agents, such as game characters, virtualavatars, and social robots, need to generate gestures in real timein accord with their speech. Automatically generating co-speechgestures is a difficult problem because machines must be able to un-derstand speech, gestures, and the relationship between them. Tworepresentative gesture generation methods are rule-based and data-driven approaches (Kipp 2005; Kopp et al. 2006). The rule-based ap-proach, as the name suggests, defines various rules mapping speechto gestures; it requires considerable human effort to define the rules,but it is widely used in commercial robots because these models arerelatively simple and intuitive. The data-driven approach learns ges-ticulation skills from human demonstrations. This approach requiresmore complex models and large amounts of data, but they do notrequire human effort in designing rules. As large gesture datasetsare becoming more available, research on data-driven approachesis increasing, e.g., (Chiu et al. 2015; Ginosar et al. 2019; Huang andMutlu 2014; Kipp 2005; Yoon et al. 2019).One data-driven approach, called the end-to-end method (Gi-nosar et al. 2019; Yoon et al. 2019), is unlike others in that it usesraw gesture data without intermediate representation such as prede-fined unit gestures. Such less restrictive representation increases themethod’s expressive capacity, enabling it to generate more naturalgestures. Previous studies have successfully demonstrated end-to-end gesture generation methods. However, they were limited bytheir consideration of only a single modality, either speech audio ortext. Since human gestures are associated with various factors, suchas speech content, speech audio, interlocutor interaction, individ-ual personality, and surrounding environment, generating gesturesfrom a single speech modality can produce a very limited model.In the study of human gestures (McNeill 1992), researchers havedefined four categories, called iconic, metaphoric, deictic, and beatgestures, which are related to different contexts. Iconic gesturesillustrate physical actions or properties (e.g., raising one’s handswhile saying “tall”) and metaphoric gestures describes abstract con-cepts (e.g., moving one’s hands up and down to depict a wall whilesaying “constraint”). Both iconic and metaphoric gestures are highlyrelated to the speech lexicon. Deictic gestures are indicative motionsthat point to a specific target or space, and are related to both thespeech lexicon and the spatial context in which the gesture is made.Beat gestures are rhythmic movements that are closely related to the speech audio. In addition, even with the same speech and inthe same surrounding environment, each person makes differentgestures every time due to inter-person and intra-person variabilityof human gestures, and the inter-person variability may be attrib-uted to individual personality. Various modalities related to speechshould be considered in order to generate more meaningful andhuman-like gestures.In the present study, we propose an end-to-end gesture gener-ation model that uses the multimodal context of text for speechcontent, audio for speech rhythm, and speaker identity (ID) forstyle variations. To integrate these multiple modalities, a tempo-rally synchronized encoder–decoder architecture is devised basedon the property of temporal synchrony found between speech andgestures in human gesture studies (Chu and Hagoort 2014; McNeill2008). We experimentally confirm that each modality is effective.Especially, a style embedding space is learned from speaker IDs toreflect inter-person variability, so we can create different styles ofgestures for the same speech by sampling different points in the styleembedding space. Figure 1 provides an overview of the proposedgesture generation model and its training. The model is trained on adataset derived from online videos exhibiting speech gestures witha training objective to generate human-like and diverse gestures.Our task is to develop a general gesture generator, a model that issupposed to generate convincing gestures for previously unseenspeech.A major hurdle in gesture generation studies is determining howto evaluate results. There is no single ground truth in gesture gen-eration and well-defined evaluation methods are not yet available.Subjective human evaluation is the most reasonable method, but itis not cost effective and difficult to reproduce results. Some studieshave used the mean absolute error (MAE) of the positions of bodyjoints between human gesture examples and generated gesturesfor the same speech (Ginosar et al. 2019; Joo et al. 2019). The MAEevaluation method is objective and reproducible, though it is hardto ascertain to what extent the MAE between joints correlates withperceived gesture quality. In the present paper, we apply the Fr´echetinception distance (FID) concept proposed in image generation re-search (Heusel et al. 2017) to our problem of gesture generation.FID compares fitted distributions on a latent image feature spacebetween the sets of real and generated images. We introduce theFr´echet gesture distance (FGD), which compares samples on a latentgesture feature space. With synthetic noisy data and comparing tohuman judgements, we validate that the proposed metrics are moreperceptually plausible than computing the MAE between gestures.Our contributions can be summarized as follows: • A new gesture generation model using a trimodal contextof speech text, audio, and speaker identity. To the best ofour knowledge, this is the first end-to-end approach usingtrimodality to generate co-speech gestures. • The proposal and validation of a new objective evaluationmetric for gesture generation models. • Extensive experiments to verify the usability of the pro-posed model. We show style manipulations with the trainedstyle embedding space, the model’s response to alteredspeech text, and the gestures’ incorporation with synthe-sized audio.

ACM Transactions on Graphics, Vol. 39, No. 6, Article 222. Publication date: December 2020. peech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity • 222:3

The remainder of this paper is organized as follows. We first intro-duce related research (Section 2), then describe the proposed model(Section 3) and its training in detail (Section 4). Section 5 introducesa metric for evaluating gesture generative models and Section 6describes human evaluation to validate the proposed metric. Section7 presents qualitative and quantitative results. Finally, Section 8concludes the paper with a discussion of the limitations and futuredirection of the present research.

We first review automatic co-speech gesture generation methodsfor artificial agents. Next, we introduce previous data-driven ges-ture generation approaches. Related work discussing gesture styles,multimodality, and evaluation methods are also introduced.

Co-speech Gesture Generation for Artificial Agents.

Motion captureand retargeting human motions to artificial agents is widely used togenerate motions, especially in commercial systems, because of itshigh-quality motion from human actors (Menache 2000). Nonverbalbehavior can also be generated by retargeting human motion (Kimand Lee 2020). However, the motion capture method has a criticallimitation: the motion should be recorded beforehand. Therefore,the motion capture method can only be used in movies or gamesthat have specified scripts. Interactive applications, in which theagents interact with humans with various speech utterances in realtime, mostly use automatic gesture generation methods. The typi-cal automatic generation method is rule-based generation (Cassellet al. 2004; Kopp et al. 2006; Marsella et al. 2013). For example, therobots NAO and Pepper (Softbank 2018) have a predefined set ofunit gestures and have rules that connect speech words and unitgestures. This rule-based method requires human effort to designthe unit gestures and hundreds of mapping rules. Research intodata-driven methods has aimed to reduce the human effort requiredfor rule generation; these methods find gesture generation rules indata using machine learning techniques. Probabilistic modeling forspeech–gesture mapping has also been studied (Huang and Mutlu2014; Kipp 2005; Levine et al. 2010) and a neural classification modelselecting a proper gesture for given speech context (Chiu et al. 2015)was also proposed. The review paper (Wagner et al. 2014) providesa comprehensive summary of the gesture generation research andrule-based approaches.

End-to-end Gesture Generation Methods.

Gesture generation isa complex problem that requires understanding speech, gestures,and their relationships. To reduce the complexity of this task, pre-vious data-driven models have divided speech into discrete topics(Sadoughi and Busso 2019) or represented gestures as predefinedunit gestures (Huang and Mutlu 2014; Kipp 2005; Levine et al. 2010).However, with recent advancements in deep learning, an end-to-end approach using raw gesture data is possible. There are studiesusing the end-to-end approach (Ferstl et al. 2019; Ginosar et al. 2019;Kucherenko et al. 2019, 2020; Yoon et al. 2019) that have formulatedgesture generation as a regression problem rather than a classifica-tion problem. This continuous gesture generation does not requirecrafting unit gestures and their rules and also removes the restric-tion that gesture expressions must be selected from predeterminedunit gestures. One study used an attentional Seq2Seq network that generates asequence of upper body poses from speech text (Yoon et al. 2019).The network consists of a text encoder that processes speech textand a gesture decoder that generates a pose sequence. Other studiesgenerated gestures from speech audio (Ferstl et al. 2019; Ginosaret al. 2019; Kucherenko et al. 2019). These audio-based generatorsalso based on the neural architectures generating a sequence ofposes, and some studies used adversarial loss to guide generatedgestures to become similar to actual human gestures. The main dif-ference between the previous models is the use of different speechmodalities. Both semantics and acoustics are important for generat-ing co-speech gestures (McNeill 1992), so, in this paper, we proposea model that uses multimodal speech information, audio and texttogether. Note that there is a concurrent work considering both au-dio and text information, but it trained and validated the generativemodel on a limited dataset of a single actor (Kucherenko et al. 2020).

Learning Styles of Gestures.

People make different gestures evenwhen they say the same words (Hostetter and Potthoff 2012). Sim-ilarly, artificial agents must also learn different styles of gestures.The agents should be able to make extrovert- or introvert-stylegestures according to their emotional states, interaction history,user preferences, and other factors. Stylized gestures also give theagents a unique identity similar to appearances and voices. Previousstudies have attempted to generate such stylized gestures (Ginosaret al. 2019; Levine et al. 2010; Neff et al. 2008). In these studies, gen-erative models were trained separately for each speaker or style.This approach is an obvious way of learning individual styles, butrequires a substantial amount of training data for each individualstyle. Because of this limitation, only three and ten individual styleswere trained in (Levine et al. 2010) and (Ginosar et al. 2019), re-spectively. In the present study, we aim to build a style embeddingspace, so that we can manipulate styles through sampling the spaceinto which different styles are embedded, rather than replicating aparticular style as the previous papers did. Another study proposedmore detailed style manipulation by using control signals of handposition, motion speed, or moving space (Alexanderson et al. 2020).

Processing Multimodal Data.

The present study considers fourmodalities: text, audio, gesture motion, and speaker identity. Gener-ally, multimodal data processing includes the representation of eachmodality, alignment between modalities, and translation betweenmodalities (Baltruˇsaitis et al. 2018). There are two approaches torepresentation: one is that all modalities share the same represen-tation and the other is that modalities are represented separately,and later alignment or translation stages integrate them. We canfind both representation approaches related to gesture generation.A study by (Ahuja and Morency 2019) represented both human mo-tion and descriptive text as vectors in the same embedding space. Inother studies, different representations are used for different modal-ities (Roddy et al. 2018; Sadoughi and Busso 2019). We use separaterepresentations, owing to the difficulty of learning a cross-modalrepresentation for co-speech gestures arising from the weak andambiguous relationship between speech and gestures.Alignment between modalities is also an important factor fortime-series data. In (Ginosar et al. 2019), a feature vector encodinginput speech was passed to a decoder to generate gestures, and the

ACM Transactions on Graphics, Vol. 39, No. 6, Article 222. Publication date: December 2020. alignment between the modalities is not explicitly handled. A neuralencoder and decoder implicitly processed the alignment as well asthe translation from speech to gesture. In (Yoon et al. 2019), a similarencoder–decoder architecture was used, but they guided the modelto learn sequential alignment more explicitly by incorporating anattention mechanism (Bahdanau et al. 2015). In (Kucherenko et al.2020), speech audio and text were aligned but not with gestures. Ourmodel uses explicitly aligned speech and gesture because speechand gesture are synchronized temporally (Chu and Hagoort 2014),allowing the network to concentrate on the translation from inputspeech to gestures.

Evaluating Generative Models.

Recently, as research into genera-tive models has expanded, interest in evaluating generative modelshas increased. In a generation problem considering speech synthe-sis, image generation, and conversational text generation, humanevaluation is the most plausible evaluation method because thereis no clear ground truth to compare with. However, the results ofhuman evaluation cannot easily be reproduced. A reliable computa-tional evaluation metric is necessary for reproducible comparisonswith state-of-the-art models and would accelerate research. Previ-ous studies have measured gesture differences between generatedand human gestures (Ginosar et al. 2019; Joo et al. 2019), though thismethod is limited because pose-level differences do not measurethe perceptual quality of the generated gestures. Some studies haveused other metrics to evaluate human motion, for example, the mo-tion statistics of jerk and acceleration (Kucherenko et al. 2019) andLaban parameters from a study of choreography (Aristidou et al.2015). However, the aforementioned metrics compute distances foreach sample, so they cannot measure how the generated results arediversified, which is crucial in generation problems. In the imagegeneration problem, the inception score (Salimans et al. 2016) andFID (Heusel et al. 2017) have recently become de facto evaluationmetrics because they can measure the diversity of generated samplesas well as their quality, and this concept was successfully appliedto other generation problems (Kilgour et al. 2018; Unterthiner et al.2019). In this study, we have applied the concept of FID to the ges-ture generation problem to measure both perceptual quality anddiversity.

Gesture generation in this paper is a translation problem that gen-erates co-speech gestures from a given speech context. Our goal isto generate gestures that are human-like and match well with anygiven speech. We propose a neural network architecture consistingof three encoders for input speech modalities and a decoder forgesture generation. Figure 2 shows the overall architecture. Threemodalities—text, audio, and speaker identity (ID)—are encoded withdifferent encoder networks and transferred to the gesture generator.A gesture is represented as a sequence of human poses, and thegenerator, which is a recurrent neural network, generates posesframe-by-frame from an input sequence of feature vectors contain-ing encoded speech context. Speech and gestures are temporallysynchronized (Chu and Hagoort 2014; McNeill 2008), so we config-ured the generator to use part of the speech text and audio near

SpeechAudioSpeechText

How ⋄ can you find ⋄ an ⋄ answer on ⋄ ⋄ ⋄ what vision ⋄ ⋄ of the FeatureVector

SpeakerID f SeedPose

GeneratedPoses

Discriminator

Real / GeneratedA B C D E F G H I JD ... f t f Generator (32) (32) (8)(27)(27) (27)

Fig. 2. The architecture of the proposed gesture generation model. Thegenerator generates a sequence of human poses from a sequence of contextfeature vectors that contain the encoded features of speech text, speechaudio, and speaker identity (ID). The features of text, audio, and speakerID are depicted as red, blue, and green arrows, respectively. The seed posesare also used to ensure continuity between consecutive syntheses. Thediscriminator is a binary classifier that distinguishes between real humangestures and generated gestures. The number in parentheses indicatesthe data dimension. The poses are in 27 dimensions since there are ninedirectional vectors in 3D coordinates. the current time step instead of the whole speech context. Gesturestyle does not change in the short term, so the same speaker IDis used throughout the synthesis. In addition, we used seed posesfor the first few frames for better continuity between consecutivesyntheses. See appendix A for the figures of detailed architecture.

This section describes how the speech modalities of text, audio, andspeaker ID are represented and the details of the encoder networks.We have four modalities, including the output gesture, in differenttime resolution. We first ensure that all input data have the sametime resolution as the output gestures, so all modalities share thesame time steps and the proposed sequential model (Figure 2) canprocess speech input and generate poses frame by frame.The speech text is a word sequence, with the number of wordsvarying according to speech speed. We insert padding tokens ( (cid:5) )into the word sequence to make a padded word sequence ( word , word , …, word t ) that is the same length as the gestures. Here, t is the number of poses in a synthesis (fixed as 34 throughoutthe paper, see Section 4). We assume the exact utterance time ofwords is known, so the padding token is inserted to make the wordstemporally match the gestures. For instance, for the speech text “I ACM Transactions on Graphics, Vol. 39, No. 6, Article 222. Publication date: December 2020. peech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity • 222:5 love you” , if there were a short pause between “I” and “love” , thenthe padded word sequence would be “I (cid:5) (cid:5) love you” when t is 5. Allwords in the padded word sequence are then transformed into wordvectors in 300 dimensions via a word embedding layer. Next, theseword vectors are encoded by a temporal convolutional network(TCN) (Bai et al. 2018) to make 32-D feature vectors for speechtext modality ( f text1 , f text2 , …, f text t ). TCN processes sequential datathrough convolutional operations, and showed competitive resultsover the recurrent neural networks in diverse problems (Bai et al.2018). In this paper, we used a four-layered TCN, where each f text has a receptive field of 16. Thus, f text i encodes 16 padded wordsaround at time step i . For our training dataset the average and thelargest number of non-padding words in this receptive field were3.9 and 16, respectively.We used FastText (Bojanowski et al. 2017), a pretrained word em-bedding, and update these embeddings during training. There wasthe concern that word embeddings pretrained by filling a missingword in a sentence (Mikolov et al. 2013) may not suitable to gesturegeneration. For instance, if we query words that are close to large ,then small appears in the top-3 list in both GloVe (Pennington et al.2014) and FastText (Bojanowski et al. 2017) even though they haveopposite meanings. This problem with pretrained word embeddinghas also been raised in text-based sentiment analysis, where thesentiment of words is important (Fu et al. 2018). We tested three dif-ferent settings: 1) pretrained embeddings without weight updating,2) pretrained embeddings with fine-tuning weights, and 3) learningword embeddings from scratch. In our problem, using pretrainedembeddings with fine-tuning was the most successful. FastText (Bo-janowski et al. 2017) was favored over GloVe (Pennington et al. 2014)since FastText is using subword information so that it gives accuraterepresentation for unseen words.For the speech audio modality, a raw audio waveform goes throughcascaded one-dimensional (1D) convolutional layers to generate asequence of 32-D feature vectors ( f audio1 , f audio2 , …, f audio t ). Audiofrequency is usually fixed, so we adjusted the sizes, strides, andpadding in the convolutional layers to obtain equally many audiofeature vectors as there were output motion frames. In our experi-ments, each feature vector had a receptive field of about a quarterof a second. The quarter-second receptive field may not be largeenough to cover occasional asynchrony between speech and ges-ture (the standard deviation of the temporal differences is about ahalf second according to (Bergmann et al. 2011)), but our use of abidirectional GRU in the gesture generator that sends informationforwards and backwards can compensate for the asynchrony.The model also uses speaker IDs to learn a style embedding space.Human gestures are not the same even for the same speech. Weutilize the speaker IDs to reflect characteristics of each speaker in thedataset, and we call this individuality as ‘style’ in the present paper.Note that our purpose is to build an embedding space capturingdifferent styles not to replicate gestures of each speaker. The speakerIDs are represented as one-hot vectors where only one elementof a selected speaker is nonzero. A set of fully connected layersmaps a speaker ID to a style embedding space of much smallerdimension (8 in the present study). To make the style embeddingspace more interpretable, variational inference (Kingma and Welling2014; Rezende et al. 2014) that uses a probabilistic sampling process is used. The same feature vector f style on the style embedding spaceis used for all time steps in a synthesis. The generator G (·) takes encoded features as input and generatesgestures. The gesture is a sequence of human poses p i consistingof 10 upper body joints (spine, head, nose, neck, L/R shoulders, L/Relbows, and L/R wrists). All poses were spine-centered. When wetrain the model, we represent each pose as directional vectors whichrepresent the relative positions of the child joints from the parentjoints. There are nine directional vectors for spine–neck, neck–nose,nose–head, neck–R/L shoulders, R/L shoulders–R/L elbows, and R/Lelbows–R/L wrists. The directional vectors are favored for trainingthe proposed model because this representation is less affectedby bone lengths and root motion. In the representation of jointcoordinates, a small translation of neck, which is the parent jointof both arms, can have an excessive effect on all coordinates of thearms. We denote human poses represented as directional vectors by d i , and all directional vectors were normalized to the unit length.We note that forearm twists were not considered in this paper.For gesture generation, we use a multilayered bidirectional gatedrecurrent unit (GRU) network (Cho et al. 2014). Encoded featuresof speech text, audio, and speaker ID are concatenated to form aconcatenated feature vector f i = ( f text i , f audio i , f style ) for each timeinstant i . The generator takes the feature vector f i as input andgenerates the next pose ˆ d i + iteratively.For a long speech, the speech is divided into 2-second chunksand the generator synthesizes gestures for each chunk. The use ofseed poses helps to make transitions between consecutive synthesessmooth. Seed poses d i = ,..., , the last four frames of the previoussynthesis, are concatenated with the feature vector for the earlyfour frames of the next synthesis as ( f i , d i ) , and an additional bit isused to indicate the presence of a seed pose. An adversarial scheme (Goodfellow et al. 2014) is applied in train-ing the model to generate more realistic gestures. The adversarialscheme uses a discriminator, which is a binary classifier distinguish-ing between real and generated gestures. By alternate optimizationof generator and discriminator, the generator improves its perfor-mance to fool the discriminator. For the discriminator, we use amultilayered bidirectional GRU that outputs binary output for eachtime step. A fully connected layer aggregate the t binary outputsand gives a final binary (real or generated gesture) decision. The gesture generation model is trained on the TED gesture dataset(Yoon et al. 2019), which is a large-scale, English-language dataset fordata-driven gesture generation research. The dataset includes speechfrom various speakers, so it is suitable for learning individual gesturestyles. We added 471 additional TED videos to the data of (Yoonet al. 2019), for a total of 1,766 videos. Extracted human poses fromTED videos, speech audio, and transcribed English speech text areavailable. We further converted all human poses to 3D by using the

ACM Transactions on Graphics, Vol. 39, No. 6, Article 222. Publication date: December 2020.

3D pose estimator (Pavllo et al. 2019) which convert a sequence of 2Dposes into 3D poses. The pose estimator uses temporal convolutionsthat lead to temporally coherent results despite of a few of inaccurate2D poses. We used the manual speech transcriptions available oneach TED talk, with onset timestamps of each word extracted usingthe Gentle forced aligner (Ochshorn and Hawkins 2016) to insertpadding tokens. The forced aligner reported successful alignmentof 97% of the total words.From the videos, only the sections of videos in which upper bodygestures were clearly visible were extracted; the total duration ofthe valid data was 97 h. The gesture poses were resampled at 15frames per second, and each training sample having 34 frames wassampled with a stride of 10 from the valid video sections. The initialfour frames were used as seed poses and the model was trainedto generate the remaining 30 poses (2 seconds). We excluded non-informative samples having little motion (i.e., low variance of asequence of poses) and erratic samples having lying poses (i.e., lowangle of the spine–neck vector).The dataset was divided into training, validation, and test sets.The division was done at the video level. Because all presentationsin the TED dataset were given by different speakers, the number ofunique speaker IDs is the same as the number of videos and there isno overlap of speaker IDs between split sets. We used the trainingset for training the model, the validation set for tuning the systems,and the test set for qualitative results and human evaluation. Thefinal number of 34-frame sequences in each data partition were199,384; 26,795; and 25,930.

The model is trained using the losses below. We use L G to train theencoders and gesture generator and L D to train the discriminator. L G = α · L Huber G + β · L NSGAN G + γ · L style G + λ · L KLD G (1) L Huber G = E [ t t (cid:213) i = HuberLoss ( d i , ˆ d i )] (2) L NSGAN G = − E [ log ( D ( ˆ d ))] (3) L style G = − E  min (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) HuberLoss ( G ( f text , f audio , f style )− G ( f text , f audio , f style ))(cid:107) f style − f style (cid:107) , τ (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) (4) L D = − E [ log ( D ( d ))] − E [ log ( − D ( ˆ d ))] (5)where t is the length of the gesture sequence, d i represents the i th pose, represented as directional vectors, in a training sample.When training the encoder and gesture generator, we minimizedthe difference between human poses d in the training examples andthe corresponding generated poses ˆ d using the Huber loss (Huber1964). This loss L Huber G can be interpreted as a once-differentiablecombination of the L1 and L2 losses, and is therefore sometimescalled the smooth L1 loss. The adversarial losses L NSGAN G and L D arefrom the non-saturating generative adversarial network (NS-GAN) (Goodfellow et al. 2014). We use sample mean to approximate theexpectation terms.A generative model conditioned on multiple input contexts oftensuffers from posterior collapse where weak context is ignored. Inthe proposed model, various gestures can be generated only fromtext and audio, so the style features from speaker IDs might beignored during training. Thus, we use diversity regularization (Yanget al. 2019) to avoid ignoring style features. L style G is the Huber lossbetween the gestures generated from different style features nor-malized by the differences of the two style features, so it guidesstyle features in the embedding space to generate different gestures. τ is for value clamping for numerical stability. In Equation 4, f style is the style feature corresponding to the speaker ID of a trainingsample, and f style is the style feature for a speaker ID selected ran-domly. L KLD G , the Kullback–Leibler (KL) divergence between N( , I ) and the style embedding space assumed Gaussian, prevents the styleembedding space from being too sparse (Kingma and Welling 2014). L D is to train the discriminator D , and the generator and discrim-inator are alternately updated with L G and L D as in conventionalGAN training (Goodfellow et al. 2014). D (·) is trained to output 1for human gestures and 0 for generated gestures.The model was trained for 100 epochs. An Adam optimizer with β = . β = .

999 was used, and the learning rate was 0.0005.Weights for the loss terms were determined experimentally ( α = β = γ = .

05, and λ = . β = τ was 1000.The trained encoders and generator are used at the synthesisstage. As the model is lightweight enough, the synthesis can bedone in real time. A single synthesis generating 30 poses takes 10ms on a GPU (NVIDIA RTX 2080 Ti) and 80 ms on a CPU (Inteli7-5930K). It is difficult to evaluate gesture generation models objectively be-cause no perceptual quality metric is available for human gestures.Although a human evaluation method in which participants rategenerated gestures subjectively is possible, objective evaluation met-rics are still required for fair and reproducible comparisons betweenstate-of-the-art models. No proper and widely used evaluation met-ric is yet available for the gesture generation problem.Image generation studies have proposed the FID metric (Heuselet al. 2017). Latent image features are extracted from the generatedimages using a pretrained feature extractor and FID calculates theFr´echet distance between the distributions of the features of real andgenerated images. Because FID uses feature vectors that describevisual characteristics well, FID is more perceptually appropriatethan measurements over raw pixel spaces. FID can also measure thediversity of the generated samples by using the samples’ distributionrather than simply averaging the differences between the real andgenerated samples. The diversity of generation has been thought tobe one of major factors in evaluating generative models (Borji 2019).Diversity is also crucial for the gesture generation problem becausethe use of repetitive gestures makes artificial agents look dull.

ACM Transactions on Graphics, Vol. 39, No. 6, Article 222. Publication date: December 2020. peech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity • 222:7

In applying the concept of FID to the gesture generation problem,there is a hurdle that no general feature extractor is available forgesture data. The paper proposing FID used an inception networktrained on the ImageNet database for image classification, but thereis no analog of the pretrained inception network for gesture motiondata to the best of our knowledge. Accordingly, we trained a featureextractor based on autoencoding (Rumelhart et al. 1985), which canbe trained in unsupervised manner. The feature extractor consistsof a convolutional encoder and decoder; the encoder encodes a se-quence of direction vectors d to a latent feature z дesture and thedecoder then attempts to restore the original pose sequence fromthe latent z дesture (see appendix A for the detailed architecture).This unsupervised learning is unlike the supervised learning of theinception network used in FID. However, both supervised and unsu-pervised learning have proven to be effective for learning perceptualquality metrics (Zhang et al. 2018).The encoder part of the trained autoencoder was used as a featureextractor. We defined FGD( X , ˆ X ) as the Fr´echet distance betweenthe Gaussian mean and covariance of the latent features of humangestures X and the Gaussian mean and covariance of the latentfeatures of the generated gestures ˆ X as follows:FGD ( X , ˆ X ) = (cid:107) µ r − µ д (cid:107) + Tr ( Σ r + Σ д − ( Σ r Σ д ) / ) (6)where µ r and Σ r are the first and second moments of the latentfeature distribution Z r of real human gestures X , and µ д and Σ д arethe first and second moment of the latent feature distribution Z д ofgenerated gestures ˆ X .For training the feature extractor, we used the Human3.6M dataset(Ionescu et al. 2013) containing motion capture data of 7 differentactors and 17 different scenarios including discussion and makingpurchases showing co-speech gestures. The total duration of thetraining data was about 175 m. All poses were frontalized based ontwo hip joints. We explored the properties of the proposed FGD metric using syn-thetic noisy data. Five types of noisy data were considered. Gaussiannoise and Salt&Pepper (S&P) noise were added to the joint coor-dinates of poses; the same noise data were added to all poses in asequence, so that there is no artificial temporal discontinuity. Tem-poral noise was simulated by adding Gaussian noise to only a fewtime frames. Multiplicative transformation in “eigenposes” p eiдeni (Yoon et al. 2019) converted from p i using principal componentanalysis (PCA) was used to generate monotonous or exaggeratedgestures. Mismatched gestures were also generated to examine howthe metric responds to discrepancies between speech and gestures.The following shows how the noisy data were synthesized. Theparameter ζ controls the overall disturbance levels. The dimensionof a pose, K , is 30 (10 joints in 3D coordinates). • Gaussian noise: ˜ p i = p i + x ; x ∼ N K ( , ζ I ) (a)(b)(c)(d)(e)(f)(g) Fig. 3. Samples of noisy gesture data to validate evaluation metrics. (a)None (original data), (b) Gaussian noise ( ζ = . ), (c) Salt&Pepper noise( ζ = . ), (d) Temporal noise ( ζ = ), (e–f) Multiplicative transformation ineigenposes ( ζ = . , . ), (g) Mismatched sample • Salt&Pepper noise: ˜ p i = p i + xx k = ,..., K =  . u ≤ ζ / u ∼ U ( , )− . ζ / < u ≤ ζ • Temporal noise:˜ p i =  p i + x ; x ∼ N K ( , . I ) if r ≤ i < r + ζ ; r is a random time step p i otherwise • Multiplicative transformation: ˜ p eiдeni = ζ · p eiдeni • Mismatched samples: Select a fraction ζ of all samples andassociate the input speech to random gestures in the TEDtest dataset to make mismatched samples.Figure 3 shows samples of the synthetically noisy data. The Gauss-ian noise introduced changes across all joints, whereas the S&P noiseproduces impulsive noise in a few joints. The temporal noise intro-duced discontinuities in motion. Multiplicative transformation wasapplied to eigenposes, so it controls the overall motion range. Themismatch noise shows a sample of nonmatching content and speechrhythms.We measured FGD and mean absolute error of joint coordinates(MAEJ) which is calculated as MAE( ˜ p , p ). Figure 4 shows the experi-mental results. For the Gaussian and S&P noise, both FGD and MAEJshowed increasing distances as the disturbance level increases, butFGD showed larger distances for S&P noise than Gaussian noise onaverage, unlike MAEJ. As shown in Figure 3 (b) and (c), the samplewith Gaussian noise still look like human poses, though with somedistortions, whereas the sample with S&P noise show unrealisticposes where the neck is out of the upper body. The samples withGaussian noise are more perceptually plausible gestures than those ACM Transactions on Graphics, Vol. 39, No. 6, Article 222. Publication date: December 2020. M A E J F G D ζ=0.001 0.002 0.003 ζ=0.1 0.15 0.20 ζ=1 5 10 ζ=0.0 0.5 1.0 1.5 2.0 ζ=0.2 0.5 1.0 MAE of joint coordinates FGD

Gaussian noise Salt&Pepper noise Temporal noise Multiplicative transformation Mismatched

Fig. 4. Results of the metric validation experiment on the synthetic noisy dataset showing four types of noise. The disturbance level increases as ζ increasesexcept for the multiplicative transformation. The disturbance level is lowest when ζ = . for the multiplicative transformation. with S&P noise, so, in our view, having larger distances for S&Pnoise is acceptable.Both FGD and MAEJ showed relatively low values for the tempo-ral noise even though discontinuous motion is perceptually unnat-ural. MAEJ calculates errors in each time frame independently, soit is obvious that MAEJ is not able to capture motion discontinuity.However, FGD, which encodes a whole sequence, also showed lowdistances unexpectedly. The primary reason is that the feature ex-tractor used in FGD were not able to discriminate enough betweenthe sequences with and without temporal noise. When we examinedthe reconstructed motion from the autoencoder, we found that theautoencoder tended to remove temporal noise.For the multiplicative transformation, both metrics showed in-creasing distances as the disturbance level increased (larger orsmaller than ζ = . ζ = . ζ = . ζ = . ζ = . In this section, we validated FGD by comparing with subjectiveratings from humans. We followed the overall experimental settingin the paper introducing Fr´echet video distance (Unterthiner et al. 2019), but we had two separate user study sessions with 14 noisemodels and 10 trained gesture generation models. In the first session,14 noise models (excluding Mismatched with ζ = . × { C or C } × ACM Transactions on Graphics, Vol. 39, No. 6, Article 222. Publication date: December 2020. peech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity • 222:9

Table 1. Agreement of the evaluation metric to human judgements in theuser study on (a) noise models and (b) trained gesture generation models.We also report the agreements between human subjects as a top line. Highernumbers are better.

Agreement (%)Metric Preference Human-likenessof motion Speech–gesturematch(a) Noise modelsMAE of joint coordi-nates (MAEJ) 50.9 55.9 60.5MAE of acceleration 46.5 47.7 46.3FGD 64.8 63.6 66.3Between humansubjects 83.3 72.2 85.7(b) Trained gesture generation modelsMAE of joint coordi-nates (MAEJ) 37.7 48.2 32.8MAE of acceleration 34.9 40.0 38.9FGD 70.5 59.6 70.2Between humansubjects 73.1 78.8 94.415–30 min to complete the task, and 2.5 USD was given as a reward.We also included an attention check presenting two copies of thesame video side by side. The participants who did not answer “un-decidable” in this case were excluded. In the first session with thenoise models, a total of 28 subjects participated, but we analyzedthe results from 22 subjects after excluding six subjects failed theattention check. There were 13 male and 9 female subjects, and theywere 36.9 ± 11.5 years old. In the second session with the trainedgeneration models, a total of 51 subjects participated and 21 subjectswere excluded. There were 15 male and 15 female subjects, and theywere 42.8 ± 13.2 years old. The total numbers of answers were 660and 900 at the first and second session, respectively.We evaluated the objective evaluation metrics by comparing thosewith human judgements, and the results are shown in Table 1. MAEof acceleration was used to assess dance motion (Aristidou et al.2015) and gestures (Kucherenko et al. 2019), and it focuses on motionrather poses. The agreement values were calculated as the numberof comparisons in which each metric agreed human judgement di-vided by the total number of comparisons. “Undecidable” responseswere not included in the analysis. In both sessions, FGD showedgreater agreement with human judgements than did MAE of jointcoordinates and MAE of acceleration on all questions. However,FGD was performed less than the agreements between humans; inparticular, FGD showed the lowest agreement of 53.5% for temporalnoise as discussed in Section 5.2.By considering both experimental results on synthetic noisy dataand human judgements, FGD is a plausible objective metric. In addi-tion, when we examine learning curves, which are shown in Figure

Training epoch M A E J F G D FGD MAEJ

Fig. 5. Validation learning curves measured by mean absolute error of jointcoordinates (MAEJ) and Fr´echet gesture distance (FGD).

5, FGD showed a decreasing trend when the distribution of gener-ated gestures becomes more similar to the reference distribution astraining continues. In contrast, MAEJ showed a flat learning curve.The lowest MAEJ is at Epoch 6, in which only static mean posesappear for all speech contexts. In the following experiments, we useFGD to compare models.All subjects were asked to write the reasons for their selection.Most of them said they preferred gestures that were fit to speechwords and audio, as we had assumed in the present paper. Opinionson gesture dynamics were mixed. Some participants liked dynamicor even exaggerated gestures, whereas some other participants pre-ferred moderate gestures with a few large movements for emphasis.This implies that the gesture styles must be adapted as per the users’preference.

Figure 6 shows the gesture generation results for the speech in thetest set of the TED gesture dataset. The gestures are depicted usinga 3D dummy character. The poses represented as directional vectorswere retargetted to the character with fixed bone lengths, and thegesture sequences were upsampled using cubic spline interpolationto 30 FPS. We used the same retargeting procedure for all animations.The character makes metaphoric gestures when saying “civil rights,”“30 million,” or “great leadership.” An iconic gesture also found forthe words “to the point.” Gesture generation depends on speechrhythm and presence or absence of speech as shown in the sample(a) and (e). A deictic gesture also appears in (c) when the charactersays “I.” Please see the supplementary video for the animated results.

We compared the proposed model with three models from previousstudies. The first model compared is attentional Seq2Seq, whichgenerates gestures from speech text (Yoon et al. 2019). We followedthe original implementation provided by the authors but the ges-ture representation was modified to be identical to the proposedmodel. The second comparison model is Speech2Gesture (Ginosaret al. 2019), which generates gestures from speech audio using anencoder–decoder neural architecture and learns to generate human-like gestures by using an adversarial loss during training. Spectro-grams were used to represent audio in this model. The third oneis the joint embedding model (Ahuja and Morency 2019), which

ACM Transactions on Graphics, Vol. 39, No. 6, Article 222. Publication date: December 2020. (a) as civil rights violation that it is we’re seeing cities and states (b) (c) (e) (f)

30 million hectares of land in EuropeI thoughthe immediately wrote me back we could even see this moving to the pointno doubt of great leadership but it is also the result of strong Arab women not giving up (d) with these grades and our kids

Fig. 6. Sample results of co-speech gesture generation from the trimodal speech context of text, audio, and speaker identity. Motion history images for someparts are depicted along with the speech text and audio signals. In (a), the character makes metaphoric gestures when saying “civil rights” and beat gesturesfor “cities and states.” In (b) and (d), there are metaphoric gestures for the words of “30 million,” “great leadership,” and “giving up.” In (c), a deictic gestureappears when the character says “I.” In (e), we can find the character does not gesture in the middle of the silence. An iconic gesture is also found in (f).

ACM Transactions on Graphics, Vol. 39, No. 6, Article 222. Publication date: December 2020. peech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity • 222:11

Table 2. Results of comparisons with state-of-the-art models. Lower num-bers indicate better performance (Bold: best , Underline: second).

Method FGDAttentional Seq2Seq (Yoon et al. 2019) 18.154Speech2Gesture (Ginosar et al. 2019) 19.254Joint embedding model (Ahuja and Morency 2019) 22.083Proposed creates human motion from motion description text. This modelmaps text and motion to the same embedding space. We embeddedthe input speech text and audio together to the same space as themotion. The same encoders in our model were used to process theaudio and text, and 4-layered GRUs were used for gesture genera-tion. All models were trained on the same TED dataset for the samenumber of epochs. We modified the original architectures of thebaselines to generate the same number of poses (i.e., 30) and to usefour seed poses for consecutive syntheses. The learning rate andweights of loss terms in the baselines were optimized via grid searchfor best FGD.Figure 7 shows sample results from each model for the samespeech. The joint embedding model generated very static poses,failing to learn gesticulation skills. The relationship between speechand gestures are weak and subtle, making it difficult to map speechand gestures to a joint embedding space. All other models gener-ated plausible motions, but there were differences depending onthe modality and training loss considered. Attentional Seq2Seq gen-erated different gestures for different input speech sentences, butthe motion tended to be slow and we found a few discontinuitiesbetween the seed poses and generates poses. The Speech2Gesturemodel used an RNN decoder similar to attentional Seq2Seq, but itshowed better motion with the help of its adversarial loss compo-nent. However, because it uses only a single speech modality, audio,Speech2Gesture generated monotonous beat gestures. The proposedmodel successfully generated large and dynamic gestures as shownin the supplementary video.The proposed model performed the best in terms of FGD (Table 2).We also analysed the human evaluation results by computing ranksfrom pairwise comparisons using the Bradley—Terry model (Chuand Ghahramani 2005). Pairwise comparisons were collected fromanother 14 MTurk subjects that passed the same attention check asbefore. The same settings described in Section 6 were used, but onlythe four models in Table 2 and human gestures were compared. Fig-ure 8 shows the results. For all the questions, the proposed methodachieved better results than Attentional Seq2Seq, Speech2Gesture,and joint embedding methods, but the differences between the pro-posed method and Speech2Gesture were not distinct in the thehuman-likeness of motion and speech–gesture match questions. Wealso tested statistical significance between the proposed method andthe others by using the Chi-Square Goodness of Fit test over the nullhypothesis that the probabilities of the pairwise choices are equal to50% (the choice of “undecidable” was not counted). In the preference,the difference between the proposed and joint embedding methodwas significant (p ¡ 0.01). In the speech–gesture match, Seq2Seq

Speech Text “I started walking towards the public. I was a mess. I was half naked. I was full of blood and tears were running down my face.” (a)(b)(c)(d)

Fig. 7. Sample results of (a) attentional Seq2Seq, (b) Speech2Gesture, (c) jointembedding, and (d) the proposed model for the same input speech. Sevenevenly sampled frames are shown for the resulting pose sequences. The lastcolumn shows motion history images in which all frames are superimposed.Please see the supplementary video for animated results.

ProposedSeq2SeqS2GJoint Embedding HumanProposedSeq2Seq S2GJoint Embedding HumanProposedSeq2Seq S2G

Joint Embedding

Human-2 -1 0 1 2 (a) Preference(b) Human-likeness of motion(c) Speech-gesture match

Fig. 8. The results of human evaluation for the three questions about (a)preference, (b) human-likeness of motion, and (c) speech–gesture match. Theranking is calculated using the Bradley-Terry model and the horizontal axisrepresents the winning probability against the other methods. Mean andstandard deviation are depicted through Bayesian inference for the Bradley-Terry model (Chu and Ghahramani 2005). S2G denotes the Speech2Gesturemethod.

ACM Transactions on Graphics, Vol. 39, No. 6, Article 222. Publication date: December 2020.

Table 3. Results of the ablation study for the proposed model. Lower num-bers are better. Ablations are not accumulated.

Configuration FGDProposed (no ablation) 3.729Without speech text modality 4.701Without speech audio modality 4.874Without speaker ID 6.275Without adversarial scheme 9.712Without regularization terms L style G and L KLD G An ablation study was conducted to understand the proposed modelin detail. We eliminated components from the proposed model thatwas used in the comparison with the state-of-the-art models. Ta-ble 3 summarizes the results of the ablation study. Removing eachmodality of text, audio, and speaker ID reduced the model’s perfor-mance; this shows that all three modalities used in the proposedmodel had positive effects on gesture generation. Among the lossterms, removing the adversarial term and regularization terms alsoworsened FGD. In particular, when we trained the model withoutthe adversarial scheme, the model tended to generate static posesclose to the mean pose.Although, when ablating different modalities, excluding the speakerID degraded the FGD the most, we could not find a noticeable degra-dation in our subjective impression of motion quality than ablatingtext or audio modalities. In our view, this is attributed to that overalldiversity was reduced without the divergence regularization L style G and that the property of FGD that measures not only motion qual-ity but also diversity. There is no concrete way to disentangle thefactors of quality and diversity in FGD as well as FID. However,we hypothesise that the covariance matrix of the fitted Gaussianis more related to the diversity than to quality. The trace of thecovariance matrix was 244, which is less than that of the humangestures and of the models without the text or audio modalities (299,258, and 250, respectively). This indirectly suggests that generatedgestures were less diverse without speaker IDs and L style G .The text modality had the least effect on FGD. In the proposedmodel, speech text and audio are treated as independent modalities;however, strictly speaking, audio contains text information becausewe can transcribe text from audio. Although the above ablationstudy showed that the FGD worsened without the text modality,it was less significant than excluding audio or speaker IDs. We There are of sparrowsfewhundreds (a) Audio Only(b) Text and Audio

Fig. 9. Visualization of how a gesture changes when a word is changedin a sentence. We compare the results of (a) the ablated model withouttext modality and (b) proposed model considering both text and audio. Thegenerated gesture for the original and altered sentences are overlaid for fiveevenly sampled frames. When we consider both text and audio, the modelgenerates more different gestures for the changes in speech content. further verified the effect of the text with an additional experiment.Figure 9 shows how the generated gestures differed when a wordwas altered in the input speech with the model considering bothtext and audio and the model considering only audio. Although themodel considering both text and audio generated different gestures(widening arms) when the word “hundreds” replaced the words“few,” there was only a slight change in motion when we used theaudio-only model. We synthesized speech audio using Google CloudTTS (Google 2018) for both original and altered text.We also conducted the above text-altering experiment quantita-tively. For 1,000 samples randomly selected from the validation set,a word in a speech sentence was changed to a synonym or antonymtaken from WordNet (Miller 1995). If there were several synonymsor antonyms, the one closest in duration to the original word wasselected to minimize the change in the length of the speech audio.Synthesized audio was used and the experiment was repeated 10times due to the randomness in selecting samples and words. Wereport the FGD between the generated samples before and after textalteration; this measure is unlike all other FGD measures, whichcompare human motion and generated motion, in the paper. Themodel considering text and audio (2.433 ± 0.483) showed a signifi-cantly higher FGD than the model considering only audio (1.604 ±0.275) (paired t-test, p ¡ 0.001), indicating that using text and audiomodalities together helps to generate diverse gestures accordingto the changes in the speech text. This argument is also backed bythe result that the FGD when a word was replaced by an antonym(2.567 ± 0.484) was significantly higher than when replaced by asynonym (2.299 ± 0.467) (paired t-test, p ¡ 0.05) in the model usingboth text and audio.

Many artificial agents use synthesized audio since recording a hu-man speaking for every word is infeasible. We tested that the pro-posed model, trained with human speech audio, also can work withsynthesized audio. Figure 10 and the supplementary video shows

ACM Transactions on Graphics, Vol. 39, No. 6, Article 222. Publication date: December 2020. peech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity • 222:13

Speech Text “... once handed me a very thick book, it was his family's legacy” (a)(b)(c)(d)

Fig. 10. Co-speech gesture generation results with (a) original human speechaudio, (b) synthesized audio of a male voice, (c) synthesized audio of femalevoice, and (d) synthesized audio of a female voice with pauses. The proposedmodel can generate gestures from synthesized audio of different voices andrhythm. some results using synthesized audio with different voices. GoogleCloud TTS (Google 2018) was used in this experiment. The pro-posed model worked well with synthesized audio of different voices,prosody, speed, and pauses. When the speech is fast, the modelgenerates rapid motion. The model also reacts to inserted speechpauses by generating static poses for the silence period.

The proposed model can generate different gesture styles for thesame speech. Figure 11 visualizes the trained style embedding spaceand the gestures generated with different style vectors for the sameinput speech. To understand the style embedding space closely, wedepict the motion statistics of the generated gestures for each stylevector corresponding to speaker ID with the marker color and shapein the figure. Colors from red to blue correspond to higher and lowertemporal motion variances. A larger motion variance can be calledan extrovert style and the opposite is an introvert style. We alsocalculated the temporal motion variance for the right and left armsseparately and used different marker shapes to indicate styles ofhandedness. Styles using the right and left arms more are depictedas (cid:73) and (cid:74) respectively, and the rest are depicted as • . As shown inFigure 11, similar styles are clustered, and users can easily choosethe desired style from the embedding space after traversing it. In this paper, we presented a co-speech gesture generation modelthat generates upper-body gestures from input speech. We proposed a temporally synchronized architecture using the three input modal-ities of speech text, audio, and speaker ID. The trained model suc-cessfully generated various gestures matching the speech text andaudio; different styles of gestures could be generated by samplingstyle vectors from a style embedding space. A new metric, FGD, wasintroduced to evaluate the generation results. The proposed metricwas validated using synthetic noisy data and measuring the agree-ment with human judgements. The proposed generation methodshowed better results than previous methods both objectively andsubjectively as determined by the FGD metric and human evalua-tion. We also highlighted different properties of the proposed modelthrough various experiments. The model can generate gestures withsynthesized audio of various prosody settings. Additionally, thestyle embedding space was trained to be a continuous space wheresimilar styles are distributed closely.There is room for improvement in the present research. First, it isdifficult to control the gesture generation process. Although stylemanipulation is possible, users are not able to set constraints ongestures. For example, we might want an avatar to make a deicticgesture when the avatar says a specific word. Most end-to-endneural models have this controllability issue (Jahanian et al. 2020).It would be interesting to extend the current model to have furthercontrollability, for example, by adding constraining poses in themiddle of generation. Second, FGD need to be improved. In non-verbal behavior, subtle motion is as important as large motion, butthe feature extractor trained by motion reconstruction might failto capture subtle motion. It is also necessary to separately evaluatemotion quality and diversity for in-depth comparisons betweengeneration models. Third, we only considered the motion of upperbody, whereas whole-body motion, including facial expressionsand finger movements should be integrated. Taking a long-termview of creating an artificial conversational agent, we would pursueintegrating our model with other nonverbal behaviors and with aconversational model. Gestures are deeply related to verbalizationaccording to the information packaging hypothesis (Kita 2000), soan integrated model generating speech and gestures together coulddeliver information more efficiently.

ACKNOWLEDGMENTS

The authors thank the anonymous reviewers for their thoroughand valuable comments. This work was supported by the Instituteof Information & communications Technology Planning & Eval-uation (IITP) grant funded by the Korea government (MSIT) (No.2017-0-00162, Development of Human-care Robot Technology forAging Society). Resource supporting this work were provided by the‘Ministry of Science and ICT’ and NIPA (“HPC Support” Project).

REFERENCES

Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2Pose: Natural LanguageGrounded Pose Forecasting. In

International Conference on 3D Vision . IEEE, 719–728.Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020.Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. In

Computer Graphics Forum , Vol. 39. Wiley Online Library, 487–496.Andreas Aristidou, Efstathios Stavrakis, Panayiotis Charalambous, Yiorgos Chrysan-thou, and Stephania Loizidou Himona. 2015. Folk Dance Evaluation Using LabanMovement Analysis.

Journal on Computing and Cultural Heritage

8, 4 (2015), 20.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Transla-tion by Jointly Learning to Align and Translate.

International Conference on LearningRepresentations .ACM Transactions on Graphics, Vol. 39, No. 6, Article 222. Publication date: December 2020.

Fig. 11. Visualization of the style embedding space and sample generation results for the different style vectors. All speaker identities are mapped to stylefeature vectors f style , and we visualize the feature vectors in two dimensions by using UMAP (McInnes et al. 2018). The points represent degrees of motionvariance via color and degree of handedness by its marker types. We labeled the sampled style vectors according to the overall motion variance and handednessas ‘IB’ for the introvert style of moving both hands similarly, ‘ER’ for the extrovert style of moving the right hand more, and so on. All gesture results aregenerated from the same speech. Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2018. An Empirical Evaluation of GenericConvolutional and Recurrent Networks for Sequence Modeling. arXiv:1803.01271 (2018).Tadas Baltruˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. MultimodalMachine Learning: A Survey and Taxonomy.

IEEE Transactions on Pattern Analysisand Machine Intelligence

41, 2 (2018), 423–443.Kirsten Bergmann, Volkan Aksu, and Stefan Kopp. 2011. The Relation of Speech andGestures: Temporal Synchrony Follows Semantic Synchrony. In

Proceedings of the2nd Workshop on Gesture and Speech in Interaction .Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enrich-ing Word Vectors with Subword Information.

Transactions of the Association forComputational Linguistics

Computer Vision andImage Understanding

179 (2019), 41–65.Paul Bremner, Anthony G Pipe, Chris Melhuish, Mike Fraser, and Sriram Subramanian.2011. The Effects of Robot-Performed Co-Verbal Gesture on Listener Behaviour. In

IEEE-RAS International Conference on Humanoid Robots . IEEE, 458–465.Judee K Burgoon, Thomas Birk, and Michael Pfau. 1990. Nonverbal Behaviors, Persua-sion, and Credibility.

Human communication research

17, 1 (1990), 140–169.Justine Cassell, Hannes H¨ogni Vilhj´almsson, and Timothy Bickmore. 2004. BEAT: theBehavior Expression Animation Toolkit. In

Life-Like Characters . Springer, 163–185.Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach. In

ACM InternationalConference on Intelligent Virtual Agents . Springer, 152–166.Kyunghyun Cho, Bart van Merri¨enboer, Caglar Gulcehre, Dzmitry Bahdanau, FethiBougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representa-tions using RNN Encoder–Decoder for Statistical Machine Translation. In

EmpiricalMethods in Natural Language Processing . 1724–1734.Mingyuan Chu and Peter Hagoort. 2014. Synchronization of Speech and Gesture:Evidence for Interaction in Action.

Journal of Experimental Psychology: General

Learning to Rank (2005), 29.Andrew P Clark, Kate L Howard, Andy T Woods, Ian S Penton-Voak, and ChristofNeumann. 2018. Why Rate When You Could Compare? Using the “EloChoice”Package to Assess Pairwise Comparisons of Perceived Physical Strength.

PloS one

13, 1 (2018).Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2019. Multi-Objective AdversarialGesture Generation. In

Motion, Interaction and Games . 1–10.Peng Fu, Zheng Lin, Fengcheng Yuan, Weiping Wang, and Dan Meng. 2018. LearningSentiment-Specific Word Embedding via Global Sentiment Representation. In

AAAIConference on Artificial Intelligence . Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and JitendraMalik. 2019. Learning Individual Styles of Conversational Gesture. In

IEEE Conferenceon Computer Vision and Pattern Recognition . 3497–3506.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In

Advances in Neural Information Processing Systems . 2672–2680.Google. 2018. Google Cloud Text-to-Speech. https://cloud.google.com/text-to-speechAccessed: 2020-03-01.Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and SeppHochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge toa Local Nash Equilibrium. In

Advances in Neural Information Processing Systems .6626–6637.Autumn B Hostetter and Andrea L Potthoff. 2012. Effects of Personality and SocialSituation on Representational Gesture Production.

Gesture

12, 1 (2012), 62–83.Chien-Ming Huang and Bilge Mutlu. 2014. Learning-Based Modeling of MultimodalBehaviors for Humanlike Robots. In

ACM/IEEE International Conference on Human-Robot Interaction . ACM, 57–64.Peter J Huber. 1964. Robust Estimation of a Location Parameter.

The Annals of Mathe-matical Statistics

35, 1 (1964), 73–101.Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2013. Hu-man3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing inNatural Environments.

IEEE Transactions on Pattern Analysis and Machine Intelli-gence

36, 7 (2013), 1325–1339.Ali Jahanian, Lucy Chai, and Phillip Isola. 2020. On the “Steerability” of GenerativeAdversarial Networks. In

International Conference on Learning Representations .Hanbyul Joo, Tomas Simon, Mina Cikara, and Yaser Sheikh. 2019. Towards SocialArtificial Intelligence: Nonverbal Social Signal Prediction in A Triadic Interaction.In

IEEE Conference on Computer Vision and Pattern Recognition . 10873–10883.Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. 2018. Fr´echetAudio Distance: A Metric for Evaluating Music Enhancement Algorithms. arXivpreprint arXiv:1812.08466 (2018).Taewoo Kim and Joo-Haeng Lee. 2020. C-3PO: Cyclic-Three-Phase Optimization forHuman-Robot Motion Retargeting based on Reinforcement Learning. In

Interna-tional Conference on Robotics and Automation .Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In

International Conference on Learning Representations .Michael Kipp. 2005.

Gesture Generation by Imitation: From Human Behavior to ComputerCharacter Animation . Universal-Publishers.Sotaro Kita. 2000. How Representational Gestures Help Speaking.

Language and gesture peech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity • 222:15

Common Framework for Multimodal Generation: The Behavior Markup Language.In

ACM International Conference on Intelligent Virtual Agents . Springer, 205–217.Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and HedvigKjellstr¨om. 2019. Analyzing Input and Output Representations for Speech-DrivenGesture Generation. In

ACM International Conference on Intelligent Virtual Agents .97–104.Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexan-derson, Iolanda Leite, and Hedvig Kjellstr¨om. 2020. Gesticulator: A Frameworkfor Semantically-Aware Speech-Driven Gesture Generation. In

ACM InternationalConference on Multimodal Interaction .Sergey Levine, Philipp Kr¨ahenb¨uhl, Sebastian Thrun, and Vladlen Koltun. 2010. GestureControllers.

ACM Transactions on Graphics

29, 4 (2010), 1–11.Stacy Marsella, Yuyu Xu, Margaux Lhommet, Andrew Feng, Stefan Scherer, andAri Shapiro. 2013. Virtual Character Performance From Speech. In

ACM SIG-GRAPH/Eurographics Symposium on Computer Animation . 25–35.Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. 2018. UMAP:Uniform Manifold Approximation and Projection.

Journal of Open Source Software

Hand and Mind: What Gestures Reveal About Thought . Universityof Chicago press.David McNeill. 2008.

Gesture and Thought . University of Chicago press.Alberto Menache. 2000.

Understanding Motion Capture for Computer Animation andVideo Games . Morgan Kaufmann.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Dis-tributed Representations of Words and Phrases and their Compositionality. In

Ad-vances in Neural Information Processing Systems . 3111–3119.George A Miller. 1995. WordNet: A Lexical Database for English.

Commun. ACM

ACMTransactions on Graphics

27, 1 (2008), 5.Robert Ochshorn and Max Hawkins. 2016. Gentle: A Forced Aligner. https://lowerquality.com/gentle/ Accessed: 2020-01-06.Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. 3DHuman Pose Estimation in Video With Temporal Convolutions and Semi-SupervisedTraining. In

IEEE Conference on Computer Vision and Pattern Recognition . 7753–7762.Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vec-tors for Word Representation. In

Empirical Methods in Natural Language Processing .1532–1543.Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic back-propagation and approximate inference in deep generative models. In

Proceedingsof the International Conference on Machine Learning , Vol. 32. 1278–1286.Matthew Roddy, Gabriel Skantze, and Naomi Harte. 2018. Multimodal ContinuousTurn-Taking Prediction Using Multiscale RNNs. In

ACM International Conference onMultimodal Interaction . ACM, 186–190.David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1985.

Learning InternalRepresentations by Error Propagation . Technical Report. California Univ San DiegoLa Jolla Inst for Cognitive Science.Najmeh Sadoughi and Carlos Busso. 2019. Speech-Driven Animation with MeaningfulBehaviors.

Speech Communication

110 (2019), 90–100.Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, andXi Chen. 2016. Improved Techniques for Training GANs. In

Advances in NeuralInformation Processing Systems . 2234–2242.Robotics Softbank. 2018. NAOqi API Documentation. http://doc.aldebaran.com/2-5/index dev guide.html Accessed: 2020-01-06.Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Rapha¨el Marinier, MarcinMichalski, and Sylvain Gelly. 2019. FVD: A new Metric for Video Generation. In

International Conference on Learning Representations Workshop .Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and Speech in Interaction:An Overview.

Speech Communication

57, Special Iss. (2014).Jason R Wilson, Nah Young Lee, Annie Saechao, Sharon Hershenson, Matthias Scheutz,and Linda Tickle-Degnen. 2017. Hand Gestures and Verbal Acknowledgments Im-prove Human-Robot Rapport. In

International Conference on Social Robotics . Springer,334–344.Dingdong Yang, Seunghoon Hong, Yunseok Jang, Tianchen Zhao, and Honglak Lee.2019. Diversity-Sensitive Conditional Generative Adversarial Networks. In

Interna-tional Conference on Learning Representations .Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee.2019. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Gener-ation for Humanoid Robots. In

International Conference on Robotics and Automation .IEEE, 4303–4309.Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018.The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In

IEEEConference on Computer Vision and Pattern Recognition . 586–595.

Table 4. The list of the gesture generation models used in the human evalu-ation. ∗ denotes the epoch having the best FGD. Model Training stage(epochs) FGDProposed model 89 ∗ . ∗ . ∗ . . . ∗ . ∗ . ∗ . . . A DETAILED ARCHITECTURES

Figure 12 shows the detailed architectures of the encoders, gesturegenerator, and discriminator. Figure 13 shows the architecture ofthe feature extractor used in the Fr´echet gesture distance.

B MODELS IN HUMAN EVALUATION

Table 4 lists the models used in the human evaluation.

ACM Transactions on Graphics, Vol. 39, No. 6, Article 222. Publication date: December 2020.

Speech AudioConv 1D / BNConv 1D / BNConv 1D / BNConv 1D FC ( t x32)(1x68267) (a) Audio encoder Word Embedding (b) Text encoder

Speech Text ( t x32) TCNlayersGRUGRUGRUGRU T/F (d) Gesture generator ...... (e) Gesture discriminator

GRUGRUGRUGRU ... (72)

FCFC (c) Style embedding

SpeakerID EmbeddingLayer FCFC (1766) (8) mean (8)std. dev (8) ... ... (27) (27)

Fig. 12. Detailed architectures of the (a) audio encoder, (b) text encoder, (c)speaker embedding, (d) gesture generator (assumed two seed poses), and(e) discriminator. BN stands for batch normalization, FC for fully connectedlayer, and TCN for temporal convolutional network.

Conv 1D / BNConv 1D / BNConv 1D / BNConv 1D (1x32)( t x27) FC / BNFC / BNFC Conv 1DConv 1DConvTr 1D / BN ( t x27) ConvTr 1D / BNFCFC / BN