[PDF] A Framework for Integrating Gesture Generation Models into Interactive Conversational Agents

Abstract

Embodied conversational agents (ECAs) benefit from non-verbal behavior for natural and efficient interaction with users. Gesticulation - hand and arm movements accompanying speech - is an essential part of non-verbal behavior. Gesture generation models have been developed for several decades: starting with rule-based and ending with mainly data-driven methods. To date, recent end-to-end gesture generation methods have not been evaluated in a real-time interaction with users. We present a proof-of-concept framework, which is intended to facilitate evaluation of modern gesture generation models in interaction. We demonstrate an extensible open-source framework that contains three components: 1) a 3D interactive agent; 2) a chatbot backend; 3) a gesticulating system. Each component can be replaced, making the proposed framework applicable for investigating the effect of different gesturing models in real-time interactions with different communication modalities, chatbot backends, or different agent appearances. The code and video are available at the project page this https URL

Full PDF

AA Framework for Integrating Gesture Generation Models intoInteractive Conversational Agents

Demonstration Track

Rajmund Nagy ∗ KTH, Stockholm, Sweden

Taras Kucherenko ∗ KTH, Stockholm, Sweden

Birger Moell

KTH, Stockholm, Sweden

André Pereira

KTH, Stockholm, Sweden

Hedvig Kjellström

KTH, Stockholm, Sweden

Ulysses Bernardet

Aston University, Birmingham, UK

Figure 1: Architecture of the framework for integrating gesture generation models.

ABSTRACT

Embodied conversational agents (ECAs) benefit from non-verbalbehavior for natural and efficient interaction with users. Gesticu-lation – hand and arm movements accompanying speech – is anessential part of non-verbal behavior. Gesture generation modelshave been developed for several decades: starting with rule-basedand ending with mainly data-driven methods. To date, recent end-to-end gesture generation methods have not been evaluated in areal-time interaction with users. We present a proof-of-conceptframework, which is intended to facilitate evaluation of moderngesture generation models in interaction.We demonstrate an extensible open-source framework that con-tains three components: 1) a 3D interactive agent; 2) a chatbot back-end; 3) a gesticulating system. Each component can be replaced,making the proposed framework applicable for investigating theeffect of different gesturing models in real-time interactions withdifferent communication modalities, chatbot backends, or differentagent appearances. The code and video are available at the projectpage https://nagyrajmund.github.io/project/gesturebot.

KEYWORDS conversational embodied agents; non-verbal behavior synthesis ∗ Both authors contributed equally to the paper.

Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2021), U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), May 3–7, 2021, Online

ACM Reference Format:

Rajmund Nagy, Taras Kucherenko, Birger Moell, André Pereira, HedvigKjellström, and Ulysses Bernardet. 2021. A Framework for Integrating Ges-ture Generation Models into Interactive Conversational Agents: Demon-stration Track. In

Proc. of the 20th International Conference on AutonomousAgents and Multiagent Systems (AAMAS 2021), Online, May 3–7, 2021 , IFAA-MAS, 3 pages.

Humans use non-verbal behavior to signal their intent, emotionsand attitudes [9, 16]. Similarly, embodied conversational agents(ECAs) can be more engaging when having appropriate nonverbalbehavior [21]. It is therefore desirable to enable conversationalagents to communicate nonverbally.Currently existing implementations of ECAs rely primarily onpre-recorded animations or require handcrafted specification of themotion [11, 15, 17], e.g. in XML-based formats [10]. However, recentdevelopments in the field of gesture generation make it possible toproduce realistic gestures in a purely data-driven fashion [1, 12, 23].To date, many of these recent gesture generation methods have notbeen evaluated in a real-time interaction with users. A potentialreason for the lack of evaluation in interaction for the recent modelsis the difficulty of setting up an interactive conversational agent.In this work, we outline a framework for embedding data-drivengesture generation models into conversational agents. We envisionthat this framework with accelerate development of interactiveembodied agents with end-to-end gesticulation models.Our framework is modular, which enables it to be used for a widerange of scientific investigations about intelligent virtual agents,such as experimenting with their voices, gestures, breathing, conver-sational complexity or gender. For our demonstration, we integratethe speech- and text-driven model developed by Kucherenko et a r X i v : . [ c s . H C ] F e b l. [13] into an ECA built with Unity, and we show the flexibility ofour approach by demonstrating our framework with two differentchatbot backends. Our open source system is composed of a 3D virtual agent in Unity,a chatbot backend with text-to-speech capabilities and a neuralnetwork that generates gesturing motion from speech (Figure 1).The communication between the components consists of sendingaudio, text or motion file in a message; in our implementation, thisis facilitated by the open-source Apache ActiveMQ message broker and the STOMP protocol .Each component is replaceable and is described in the corre-sponding section below. We provide the virtual environment and the user interface as aUnity scene. The end user interacts with the conversational agentthrough voice input or a text field.By using Unity, we ensure that the system can be easily extendedwith new modules by other researchers in the future. Furthermore,it makes it possible to tailor the environment and the charactermodel according to the requirements of the application (e.g., explorevirtual/mixed reality applications).

The user’s input message is sent to the chatbot backend that pro-duces the agent’s response as text and audio. The chatbot backend iscomprised of a speech recognition system, a neural conversationalmodel and a text-to-speech synthesizer. For our demonstration, wepresent two implementations of this component.In the first configuration, Google’s popular DialogFlow platform[19] is used with its automatic speech recognition module to en-able voice-based interaction with the agent. The interfacing toDialogFlow is implemented in a C ) to seamlessly integrate the two models intothe Python backend. Based on the output of the chatbot backend, the gesture generationmodel synthesizes the corresponding motion sequence. We adapta recent gesture generation model called Gesticulator [13], whichis an autoregressive neural network that takes acoustic featurescombined with semantic information as its input, and generates thecorresponding gesticulation as a sequence of upper-body joint an-gles, which is a widely used representation in computer animationand robotics.In the original paper [13], the network was trained on the TrinitySpeech-Gesture dataset [6], consisting of 244 minutes of speech https://activemq.apache.org/ https://stomp.github.io/ https://github.com/mozilla/TTS and motion capture recordings of spontaneous monologues actedout by a male actor. The input features – log-power spectrogramsfor audio and BERT [4] word embeddings for text – are extractedand concatenated at the frame level, and a 1.5 s (30 frames at 20FPS) sliding window of input features is used for predicting everymotion frame, motivated by gesture-speech alignment research.We tailor the base Gesticulator model to our interactive agentwith the following adjustments:(1) We replace the audio features from spectrograms to theextended Geneva Minimalistic Acoustic Parameter Set [5],normalized to zero mean and unit variance. We qualitativelyfound that it results in better motion for synthesized voice.(2) The text transcriptions that are used for training the modelcontain precise word timing information, which is usuallynot available in real-time settings. Therefore, when interact-ing with a user, we approximate it with speech utterancelengths that are proportional to the syllable count.(3) Finally, we replace the BERT word embedding with Fast-Text [3] (which has significantly lower dimensionality) inorder to reduce the feature extraction time. At the current stage of development, each of the components hassome important limitations:(1) Both available chatbot backends introduce several secondsof processing time before the agent responds to the user,which might currently affect immersion in the interaction.(2) The synthesized voice of the agent yields out-of-distributionaudio samples which significantly degrade the quality ofthe generated motion. This could be improved by replacingaudios in the dataset with synthetic audios [20] or by traininga TTS model on the audio from a speech-gesture dataset [2].(3) Finger motion is not modelled by the gesture generationmodel due to poor data quality in the dataset.However, the system’s modular design of replaceable compo-nents allows addressing these limitations in the future. For instance,it is straightforward to replace Gesticulator with a model that gen-erates full-body motion or to change the chatbot backend as shownin our two distinct examples with DialogFlow and Blenderbot.

We have presented a framework for integrating state-of-the-art data-driven gesture generation models with embodied conversationalagents. As highlighted by the GENEA gesture generation challenge[14], the gesture generation field needs a reliable benchmark. Theproposed framework provides such a possibility; it can be used tocompare gesture generation models in real-time interactions.There are many directions in which this work can be extended inthe future. More diverse gesticulation can be achieved by choosinga probabilistic gesture generation model instead of a deterministicone like Gesticulator. Incorporating stylistic control [1, 7] in ECAto allow expression of different emotions is a promising direction.Moreover, the proposed framework can be used in user studies toinvestigate the human perception of different gesticulation.

EFERENCES [1] Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow.2020. Style-controllable speech-driven gesture synthesis using normalising flows.

Computer Graphics Forum

39, 2 (2020), 487–496.[2] Simon Alexanderson, Éva Székely, Gustav Eje Henter, Taras Kucherenko, andJonas Beskow. 2020. Generating coherent spontaneous speech and gesture fromtext. In . 1–3.[3] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017.Enriching word vectors with subword information.

Transactions of the Associationfor Computational Linguistics

Annual Conference of the North American Chapter of the Association for Computa-tional Linguistics (NAACL-HLT 2019) .[5] Florian Eyben, Klaus R Scherer, Björn W Schuller, Johan Sundberg, ElisabethAndré, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth SNarayanan, et al. 2015. The Geneva minimalistic acoustic parameter set (GeMAPS)for voice research and affective computing.

IEEE transactions on affective com-puting

7, 2 (2015), 190–202.[6] Ylva Ferstl and Rachel McDonnell. 2018. Investigating the Use of RecurrentMotion Modelling for Speech Gesture Generation. In

Proceedings of the 18thInternational Conference on Intelligent Virtual Agents (IVA ’18) . Association forComputing Machinery, New York, NY, USA.[7] Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2020. Understanding thepredictability of gesture parameters from speech and their perceptual importance.In

Proceedings of the 20th ACM International Conference on Intelligent VirtualAgents . 1–8.[8] Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. 2020. Glow-TTS:A Generative Flow for Text-to-Speech via Monotonic Alignment Search. In .[9] Mark L Knapp, Judith A Hall, and Terrence G Horgan. 2013.

Nonverbal Commu-nication in Human Interaction . Wadsworth, Cengage Learning.[10] Stefan Kopp, Brigitte Krenn, Stacy Marsella, Andrew N Marshall, CatherinePelachaud, Hannes Pirker, Kristinn R Thórisson, and Hannes Vilhjálmsson. 2006.Towards a common framework for multimodal generation: The behavior markuplanguage. In

International workshop on intelligent virtual agents . Springer.[11] Stefan Kopp and Ipke Wachsmuth. 2004. Synthesizing multimodal utterances forconversational agents.

Computer animation and virtual worlds

15, 1 (2004).[12] Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gustav Eje Henter, and HedvigKjellström. 2021. Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation.

International Journalof Human–Computer Interaction (2021). https://doi.org/10.1080/10447318.2021.1883883 [13] Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, SimonAlexanderson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A frame-work for semantically-aware speech-driven gesture generation. In

Proceedings ofthe ACM International Conference on Multimodal Interaction .[14] Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav EjeHenter. 2021. A large, crowdsourced evaluation of gesture generation systemson common data: The GENEA Challenge 2020. In

Proceedings of the InternationalConference on Intelligent User Interfaces .[15] Margot Lhommet, Yuyu Xu, and Stacy Marsella. 2015. Cerebella: automaticgeneration of nonverbal behavior for virtual humans. In

Twenty-Ninth AAAIConference on Artificial Intelligence .[16] Michael Nixon, Steve DiPaola, and Ulysses Bernardet. 2018. An Eye Gaze Modelfor Controlling the Display of Social Status in Believable Virtual Humans. In . IEEE, 1–8.https://doi.org/10.1109/CIG.2018.8490373[17] Igor Rodriguez, Aitzol Astigarraga, Txelo Ruiz, and Elena Lazkano. 2016. Singingminstrel robots, a means for improving social behaviors. In . IEEE, 2902–2907.[18] Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, YinhanLiu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, and JasonWeston. 2020. Recipes for building an open-domain chatbot. arXiv preprintarXiv:2004.13637 .[19] Navin Sabharwal and Amit Agrawal. 2020. Introduction to Google Dialogflow.In

Cognitive Virtual Assistants Using Google Dialogflow . Springer, 13–54.[20] Najmeh Sadoughi and Carlos Busso. 2016. Head Motion Generation with Syn-thetic Speech: A Data Driven Approach.. In

INTERSPEECH . 52–56.[21] Maha Salem, Katharina Rohlfing, Stefan Kopp, and Frank Joublin. 2011. A friendlygesture: Investigating the effect of multimodal robot behavior in human-robotinteraction. In

Proceedings of the International Symposium on Robot and HumanInteractive Communication . IEEE.[22] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, JoeDavison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu,Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest,and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural LanguageProcessing. In

Proceedings of the 2020 Conference on Empirical Methods in NaturalLanguage Processing: System Demonstrations2019 International Conference on Robotics andAutomation (ICRA)