[PDF] MuMMER: Socially Intelligent Human-Robot Interaction in Public Spaces

Abstract

In the EU-funded MuMMER project, we have developed a social robot designed to interact naturally and flexibly with users in public spaces such as a shopping mall. We present the latest version of the robot system developed during the project. This system encompasses audio-visual sensing, social signal processing, conversational interaction, perspective taking, geometric reasoning, and motion planning. It successfully combines all these components in an overarching framework using the Robot Operating System (ROS) and has been deployed to a shopping mall in Finland interacting with customers. In this paper, we describe the system components, their interplay, and the resulting robot behaviours and scenarios provided at the shopping mall.

Full PDF

MMuMMER: Socially Intelligent Human-Robot Interaction in Public Spaces

Mary Ellen Foster, Bart Craenen, Amol Deshmukh, Oliver Lemon, Emanuele Bastianelli, Christian Dondrup, Ioannis Papaioannou, Andrea Vanzo, Jean-Marc Odobez, Olivier Can´evet, Yuanzhouhan Cao, Weipeng He, Angel Mart´ınez-Gonz´alez, Petr Motlicek, R´emy Siegfried, Rachid Alami, Kathleen Belhassein, Guilhem Buisan, Aur´elie Clodic, Amandine Mayima, Yoan Sallami, Guillaume Sarthou, Phani-Teja Singamaneni, Jules Waldhart, Alexandre Mazel, Maxime Caniot, Marketta Niemel¨a, P¨aivi Heikkil¨a, Hanna Lammi, Antti Tammela University of Glasgow, Glasgow, UK Heriot-Watt University, Edinburgh, UK Idiap Research Institute, Martigny, Switzerland LAAS-CNRS, Toulouse, France SoftBank Robotics Europe, Paris, France VTT Technical Research Centre of Finland, Tampere, [email protected]

Abstract

In the EU-funded MuMMER project, we have developed asocial robot designed to interact naturally and ﬂexibly withusers in public spaces such as a shopping mall. We presentthe latest version of the robot system developed during theproject. This system encompasses audio-visual sensing, so-cial signal processing, conversational interaction, perspectivetaking, geometric reasoning, and motion planning. It success-fully combines all these components in an overarching frame-work using the Robot Operating System (ROS) and has beendeployed to a shopping mall in Finland interacting with cus-tomers. In this paper, we describe the system components,their interplay, and the resulting robot behaviours and scenar-ios provided at the shopping mall.

Introduction

In the EU-funded MuMMER project (http://mummer-project.eu/), we have developed a sociallyintelligent interactive robot designed to interact with thegeneral public in open spaces, using SoftBank Robotics’Pepper humanoid robot as the primary platform (Foster etal. 2016). The MuMMER system provides an entertainingand engaging experience to enrich a human-robot interac-tion. Crucially, our robot exhibits behaviour that is sociallyappropriate and engaging by combining speech-basedconversational interaction with non-verbal communication,and motion planning. To support this behaviour, we havedeveloped and integrated new methods from audiovisualscene processing, social-signal processing, conversationalAI, perspective taking, and geometric reasoning.The primary MuMMER deployment location is Ideapark,a large shopping mall in Lemp¨a¨al¨a, Finland. The MuMMERrobot system has been taken to the shopping mall severaltimes for short-term co-design activities with the mall cus-tomers and retailers (Heikkil¨a, Lammi, and Belhassein 2018;Heikkil¨a et al. 2019); the full robot system has been de-ployed for short periods in the mall in September 2018 (Fig-ure 1), May 2019, and June 2019, and has been installed fora long-term, three-month deployment as of September 2019. Figure 1: The MuMMER robot system interacting with acustomer in the Ideapark shopping mall, September 2018.The demo system supports a range of behaviours cover-ing a variety of functional and entertainment tasks that areappropriate for a shopping-mall setting, including guidanceto various locations within the mall, small-talk, and play-ing quiz games with customers. The activities during the de-ployment have included a number of data collection studieswith real users: recording of customer interaction with therobot in guidance situations, sound localisation and auto-matic speech recognition in the noisy mall environment, andtests for AI-based conversation and localisation and naviga-tion based on a partial 3D model of the mall and a completesemantic model.In the remainder of this paper, we outline the technicalcontributions in each of the main MuMMER component ar-eas: audiovisual sensing, social signal processing, conver-sational interaction, human-aware robot motion planning,knowledge representation and decision. At the end, we de-scribe the details of the deployed robot system. a r X i v : . [ c s . R O ] S e p igure 2: The perception systems tracks and re-identiﬁespeople leaving the ﬁeld, and extracts other features: speakerturns, head pose, visual focus of attention and nods. Audio-visual sensing

For MuMMER, the main task of audio-visual perception issensing people in general – that is, maintaining a represen-tation of the persons around the robot, with a dedicated at-tention to people susceptible of interacting with it, or thosewho are (or have been) interacting with it. This requires sev-eral audio-visual algorithms to detect, track, re-identify peo-ple, and detect their non-verbal behaviors and activities, andalso predict their position/behaviours even when they are notseen. At the same time, the representation of people needsto be deﬁned and shared with other modules which are re-sponsible for inferring other knowledge about people (forinstance, to deﬁne a person’s goal in the interaction).For visual tracking, we ﬁrst detect the person with theconvolutional pose machines (CPM) (Cao et al. 2017) whichprovide accurate locations of the body joints (nose, eyes,shoulders, etc.). The output of the algorithm is almost per-fect when people are in the foreground of the image and upto 3 of 5 meters, depending on the resolution. This is ouruse-case deﬁnition of an entertainment robot in a shopping mall. On top of the CPM, we use OpenHeadPose (Cao,Can´evet, and Odobez 2018), which makes use of the heapmaps of the CPM to estimate the head pose of the per-son. Then, we perform head pose tracking (Khalidov andOdobez 2017) to maintain a consistent identity across adja-cent frames (Figure 2). The head pose tracker is representedby a particle ﬁlter mainly based on color and face cues. Asfaces are tracked, we store OpenFace features (Amos, Lud-wiczuk, and Satyanarayanan 2016) which are computed onan aligned face. When a new tracklet is created, the Open-Face features of the new tracklet are compared with the fea-tures previously accumulated, and the new tracklet is re-assigned the identity of the one which had the more votes.For sound localization , we use a multi-task neural net-work (NN) which jointly performs speech/non-speech de-tection and sound source localisation (He, Motlicek, andOdobez 2018a; He, Motlicek, and Odobez 2018b) appliedon top of the 4-channel microphone array (embedded on therobot). The NN uses as an input a 4-channel audio trans-formed into a frequency domain, and it outputs the likeli-hood values for the two tasks. Thanks to a semi-automatedand synthetic data collection procedure taking advantage ofthe robotic platform as well as the use of a weak super-vision learning approach, it is possible to quickly collectdata to learn the models for a new sensor (He, Motlicek,and Odobez 2019). The fusion between the visual and audioparts is done by assigning the detected speech to the personwho is standing in the given direction.Finally, although a close range (up to 1.2m) gaze sens-ing module is available and can be applied for one selectedperson using a self-calibrated approach (Siegfried, Yu, andOdobez 2017), as a compromise between computation androbustness, we instead compute the visual focus of atten-tion of each person based on the head pose (Sheikhi andOdobez 2015). The algorithm can reliably estimate the ob-ject the person is looking at (either the robot, the other per-sons, the targets, the shops, or the tablet embedded at therobot) which is a preliminary step to identify the addressee,and is also used in the context of perspective taking to deter-mine whether the human has looked in the direction wherethe robot pointed (Sallami et al. 2019). Social signal processing

For Social Signal Processing, we focus on two primarytasks: fusing the provided audio-visual sensing data for so-cial state estimation, and synthesising appropriate social sig-nals for the robot to use when communicating with users.While detecting, tracking, (re)identifying users, as well asdetecting their primary non-verbal behaviours and activitiesprovide the basic signals, the multi-modal fusion of thesesignals allows for a more accurate and deeper understandingof the underlying social state, including gaining personalityimpressions from the user. The estimated social state is thenmade available to inform planning of the robot’s subsequentactions; who and how to converse with the users of the robot;and, how the robot is to move and behave (gestures) in thepresence of the user(s). https://gitlab.idiap.ch/software/openheadpose igure 3: Social state estimator visualiser output, displayingall relevant information for ﬁne-tuning.Figure 4: Gesture variations. Each row shows the same ges-ture using different parameters. Social state estimation

On the fusion side, the main function of the social state es-timator is to determine which user the robot should initiateinteraction with. We used the underlying assumption that therobot should initiate interaction with the user perceived tobe the most willing to interact; which we took to be the userpaying the most attention to the robot. We assume that theuser paying the most attention to the robot is the user that islooking most directly at the robot, and who is most closelysituated near the robot.To this end, the social state estimator aggregates audio-visual sensing data about the head pose of users, whetherthe users are looking at the robot and/or the screen on therobot, and the distance between the users’ head and therobot. The head pose data of the users are used to calculatethe (Euclidean) distance between the head pose of the users and three centroids derived from clustering/classifying pre-viously recorded lab and deployment data. This distance isthen normalised to a value between zero and one, and used asa probability. The distances between the users and the robotare used as a penalty, and normalised between zero and oneas a probability in such a way that users further away fromthe robot are penalised more than users closer to the robot.This results in four probabilities, two taken directly fromthe audio-visual sensing data, and two derived from it. Thesefour probabilities are then fused into one attention probabil-ity by calculating their (weighted) average. Choosing whichuser the robot should interact with is done by comparing theattention probability against a conﬁgurable minimum atten-tion probability threshold, and then selecting the user withthe highest attention probability.To prevent immediately re-initiating an interaction with auser that the robot has just interacted with, the social stateestimator also monitors the actions of the planning and dia-logue components. The social state estimator then maintainsa list of the users that the robot is, and has been, interact-ing with, and applies a penalty to their attention probabilitywhile they are interacting and for a short time afterwards.The social state estimator is fully conﬁgurable by a setof parameters, with the initial parameter settings determinedfrom extensive recorded lab data. The parameters were fur-ther ﬁne-tuned during deployment to provide the most accu-rate and applicable social state estimates. To facilitate ﬁne-tuning the parameters of social state estimator, a visualiseris provided to display all relevant features (Figure 3).

Social signal generation

For the synthesis side, a repertoire of non-verbal social sig-nals, including gestures and sounds, has been developed forthe robot, available to be used in conjunction with movingand interacting with the users. The non-verbal behaviour ofan embodied agent is at least as communicative as its ver-bal behaviour (Vinciarelli, Pantic, and Bourlard 2009), andin a noisy environment such as a mall it may even be moreimportant, so understanding and controlling the robot’s non-verbal signals is crucial. Examples of some robot gesturevariants are shown in Figure 4.In a series of perception experiments, we have examinedhow manipulating gesture parameters affect users’ subjec-tive responses to the robot as well as their perception ofthe robot’s personality. These studies have found severalclear relationships: for example, manipulating the amplitudeand speed signiﬁcantly affected users’ perception of the Ex-traversion and Neuroticism of the robot, while the attributedpersonality also affected users’ subjective reactions to therobot (Craenen et al. 2018b). In addition, it was found thatwhile the majority of users preferred a robot that they per-ceived to have a similar personality to their own, a signiﬁcantminority preferred a robot whose personality was perceivedto be different than their own (Craenen et al. 2018a).We are currently integrating a ﬁner-grained method ofgesture control based on sentiment (Deshmukh, Foster, andMazel 2019), as well as a set of affectively generated arti-ﬁcial sounds (Hastie et al. 2016), with the goal of furtherenhancing the robot’s expressiveness. onversational interaction

The MuMMER system focuses on enabling an agent tocombine a task-based dialogue system with chat-style open-domain social interaction, to fulﬁl the required tasks whileat the same time being natural, entertaining, and engag-ing to interact with. The presented work is based on the“Alana” conversational framework, a ﬁnalist of the AmazonAlexa Challenge in both 2017 (Papaioannou et al. 2017) and2018 (Curry et al. 2018). Alana was initially developed forthe Amazon Echo as an open-domain social chatbot. For theneeds of this project, Alana acts as the core module for everydialogue interaction with the user from every other module.This means that whenever a module requires to either ver-bally notify the user or get the user’s feedback, Alana willhandle this task. In this way the conversation throughout theinteraction will be more contextually relevant, and easier tomaintain. Since the robot needs to engage in social dialogueas well as to complete tasks, Alana was enriched with so-called task bots to conversationally execute and monitor be-haviours on a physical agent (Figure 5) (Papaioannou, Don-drup, and Lemon 2018).In order to enable the functionality described above, a newNatural Language Understanding (NLU) module, HERMITNLU, has been implemented and integrated into the Alanasystem, which is able to deal with social chit-chat but alsoextract the necessary information from commands to starttasks. HERMIT NLU (Vanzo, Bastianelli, and Lemon 2019)is thus used to decide if the task bot is triggered and to ex-tract the required parameters for tasks, such as the name ofthe shop someone is looking for. While standard chatbotsmostly rely on NLU that works on shallow semantic repre-sentations (e.g., intents + slots), task-based applications re-quire richer characterisations. In line with (Dinarelli et al.2009), we promote the idea that the user’s intent can be rep-resented through the combination of existing theories, cap-turing different dimensions of the overall problem, namely

Dialogue Acts and

Frame Semantics . Existing approaches toNLU for dialogue systems are based on formal languagesdesigned around the targeted domain. However, it has beenwidely demonstrated that the generalisation capability ofstatistical-based approaches is more robust towards lexicaland domain variability (Bastianelli et al. 2016). We thus usea deep learning architecture based on a hierarchy of self-attention mechanisms and BiLSTM encoders followed byCRF tagging layers to perform multi-task learning over theaforementioned semantic dimensions (Rastogi, Gupta, andHakkani-Tur 2018). The system effectively learns how topredict Dialogue Acts, Frames, and Frame Elements in a se-quence labelling scheme, starting from a corpus of annotatedsentences which we are currently developing.After a task has been identiﬁed, executing it on a robotusually includes physical actions that require a ﬁnite amountof time to complete and are not instantaneous such as di-alogue actions. While the robot is executing such an ac-tion, the user might want to continue the conversation, orgive new instructions. In order to be able to support sucha multi-threaded dialogue management of interleaving taskswith general chitchat and other tasks, we build on the ideaspresented in (Lemon et al. 2002). To this end, the execu- Figure 5: Architecture of the Dialogue system. The blueparts on the left represent the task management and execu-tion system, the green parts on the right represent Alana asthe dialogue system. The Bot Ensemble contains social chatbots and the task bot which is able to trigger tasks and handlecommunication between the task and the user. The yellowmiddle part is the task speciﬁc dialogue management sys-tem (Arbiter), Text-To-Speech (TTS) and Automatic SpeechRecognition (ASR).tion system introduced in (Dondrup et al. 2017) has beenextended to use so-called recipes that deﬁne dialogue andphysical actions to execute in order to achieve the givengoal (Papaioannou, Dondrup, and Lemon 2018). The exe-cution framework described therein has been redesigned tosupport multi-threaded execution and an arbitration processhas been put in place to manage the currently running taskson the execution side and in the Alana system. This lets tasksbe started, stopped, and paused at any time, with appropriatefeedback to the user. If a task has been suspended by anotheraction, it will be resumed after the new action ﬁnishes andany open questions will be re-raised to prompt the user.

Route guidance supervision

One of the core tasks for the MuMMER robot in the mallis the guiding of users to speciﬁc locations in the mall, bypointing at places and explaining the route to the wantedlocation. The task is triggered when a human asks for alocation. A supervision system, based on Jason (Bordini,H¨ubner, and Wooldridge 2007), a BDI agent-oriented frame-work handles the execution through Jason reactive plans.Throughout the task, the robot supervises the execution and,depending what goes wrong, the robot has multiple possibleresponses. For example, it is able to handle nominal scenar-os of route guidance while being able to take into accountcontingencies such as the human lack of visibility of the di-rection, his/her ability or not to take the stairs, his/her un-derstanding of the message, etc. Finally, if at some point,the human is not perceived during a certain time, the robotends the task, assuming that the human has left.

Route computing and route verbalization

The entire description of the route, from the search for thebest route to get to the ﬁnal destination to the verbalizationof this route, is based on the SSR (Semantic Spatial Repre-sentation) (Sarthou, Alami, and Clodic 2019b). This repre-sentation is used to describe the topology of an indoor en-vironment as well as semantic information (type of storesor items sold by stores) in a single ontology. This ontologyis managed by Ontologenius (Sarthou, Alami, and Clodic2019a), a lightweight open-source ROS-compatible pack-age which stores semantic knowledge, reasons with it, andshares that information to all the other system components.

Geometric reasoning

Geometric reasoning uses Underworlds (Lemaignan et al.2018; Sallami et al. 2019), a lightweight framework for cascading spatio-temporal situation assessment in robotics.It represents the environment as real-time distributed datastructures, containing scene graph (for representation of 3Dgeometries). Underworlds supports cascading representa-tions: the environment is viewed as a set of worlds thatcan each have different spatial granularities, and may in-herit from each other. It also provides a set of high-levelclient libraries and tools to introspect and manipulate theenvironment models. Based on a 3D model of the mall (Fig-ure 6), it maintains what the robot knows about the sceneas well as alternative world states. These states representthe estimation of the human’s beliefs about the scene Italso provides the symbolic relations among entities withstamped predicates (e.g. [ isInsideArea ( person, area ) ] or[ isSpeakingT o ( X, Y ) ] when X speaks and looks at Y (given by perception)). Motion planning

The navigation of the robot is implemented using the ROSnavigation stack, with navfn as the global planner anda Timed Elastic Band (TEB) (R¨osmann, Hoffmann, andBertram 2017) planner as the local planner. For MuMMER,the local planner was modiﬁed in order to accommodate hu-mans into planning inspired from (Khambhaita and Alami2017), resulting a new planner called and this new plan-ner called Social TEB (S-TEB). This algorithm is able toplan and execute trajectories while ensuring satisfaction ofrobot kinematics constraints, avoiding static and movingnon-human obstacles and planning navigation solutions re-specting social constraints with humans perceived. The plan-ner ensures the safety of humans by re-planning a local planat each control loop.

SVP planner

Although the target robot location in the mall is in a largesquare, several elements of the environment can block the Figure 6: Visualization of the visibility grids of a landmarkon the 3D model of the central square of the Ideapark shop-ping center.visibility of important landmarks for the proper understand-ing of the route to take. The purpose of the SVP (SharedVisual Perspective) planner (Waldhart, Clodic, and Alami2019) is therefore to try to ﬁnd a position where the humanwill have to go in order to observe an element of the envi-ronment such as a passage, a staircase or a store. To do this,a visibility grid is computed for each possible landmark, asshown on ﬁgure 6. Having determine a good position for thehuman, the planer also allows to determine the good positionfor the robot so as to have a human-robot-landmark confor-mation allowing both to point the landmark and to look atthe human.

Deployment

A long-term deployment (three months from September2019) will allow the study of the customer behavioursaround a helpful and entertaining robot over an extended pe-riod of time. This section gives details on how the hardwareand software that are being employed in the ﬁnal deploy-ment, as well as the scenarios that are supported.

Setup

The fully integrated MuMMER system consists of severalhardware components to allow the computation to be per-formed on the appropriate platforms.The robot we are using is an updated, custom version ofthe Pepper platform, which is equipped with an Intel D435camera and a NVIDIA Jetson TX2 in addition to the tra-ditional sensors that are found on the previous versions ofthe robot. We use the Robot Operating System (ROS) to en-able the communication between the processing nodes. Allthe streams (audio, video, robot states) are sent to a remotelaptop which performs all the computation. The laptop hasa NVIDIA RTX 2080 graphics card (for the deep learningpart) and 12 CPU cores. The perception algorithms processthe Intel images at a resolution of × for the detectionand tracking parts, and at a resolution of × for there-identiﬁcation part, which enables fast tracking and a goode-identiﬁcation quality with OpenFace. The 4 microphonestreams are processed at a frequency of Hz, and thefull perception system delivers the output at 10 fps.To transcribe the user’s speech signal we use the GoogleAutomatic Speech Recognition (ASR) API which receivesan enhanced audio signal from a delay and sum beamformerbased on the location of the speaking person determined bythe audio-visual sensing. A dedicated ROS node streams theaudio to the ASR which in real-time returns an incrementallyupdated string transcribing the utterance. Using silence tomark the end of speech, this transcription is enhanced usingthe context of the sentence to provide for a more coherentresult. Finally, the text output of the Google ASR is sent tothe Alana framework to perform the dialogue task, throughthe arbitration module as explained above.The system is deployed in two languages, English andFinnish, though due to the vast linguistic differences be-tween the two languages, the two versions have been keptseparated, and the whole interaction can either be in one orthe other. Due to the complexity of the NLU module in theFinnish version of the system, the user’s utterance is beingtranslated into English using Google Translate API . The re-sult of this translation is sent to the Alana conversationalframework and goes through the NLU pipeline described indetail in (Curry et al. 2018). In the English version of thesystem, Alana then returns the reply to be verbalised. Due tothe relatively poor performance of Google translate when itcomes to translating English into Finnish (as remarked uponby our Finnish partners), the Finnish version of Alana hasa much reduced set of bots in its ensemble (see Figure 5).These bots mainly return answers based on templates thathave been translated into Finnish beforehand. Scenarios deployed

As a proof of concept, a real-time autonomous system hasbeen built to integrate all the components described in thesections above. The following types of interactions can betriggered by the user:

Chat

The staple of the interaction is social dialogue (Curryet al. 2018). During all other modes of interaction, theuser can always default to simply chat to the robot irre-spective of whether it is currently executing a goal/task(e.g. the user requires guidance to a speciﬁc shop) or not.For example the user might approach the robot and startdiscussing various topics. At speciﬁc points throughoutthis conversation, the system might explain its capabili-ties to the user in order to recover from a conversationalstalemate or to simply make them aware of the fact thatit can also be helpful in ﬁnding your way around the mall(see below).

Quiz

In this scenario, multiple choice questions are askedby the robot, and the human replies by stating the numberof the answer they think is correct.

Route Description (Dialogue only)

When the human askshow to get to a speciﬁc shop, the robot gives him/her the https://cloud.google.com/speech-to-text/ https://cloud.google.com/translate/ route description. In this most “simple” form, the systemuses only verbal interaction. This means that especiallythe route description is merely presented as a string ofsynthesised text. Route Guidance (Dialogue + Pointing)

In this version,the robot guides the human to speciﬁc locations in themall, by pointing at places and explaining the route to thewanted location. To do so, the robot ﬁrst computes posi-tions so that the human will be able to see what the robotis pointing for him/her. Then, the robot navigates to itsposition (this part is optional), expecting that the humanwill join it once it stopped and checks the human’s visi-bility. Then, the robot explains to the human how to reachthe destination. According to a human-human guidancestudy (Belhassein et al. 2017), the robot points, ﬁrst at thelocation direction and then points at the access point (acorridor, stairs or an escalator) to go through to reach thelocation. While pointing, the robot verbalizes the route.Finally, the robot checks that the human knows how toreach the goal and leaves open the possibility to repeat ifneeded. All along the task, the robot supervises the taskand adapts accordingly.All these modes of interaction can be interleaved at theuser’s discretion. This means, for example, that during thequiz the user could revert to social dialogue. If they do so,the system might occasionally try to bring them back to thequiz by re-raising the last question. The same holds true ifthe person chooses to abandon a route guidance task beforeit was ﬁnished.

Conclusions

The MuMMER project has built a fully autonomous enter-tainment robot to perform HRI scenarios in a shopping mall,in which the main goal is to have entertainment interaction(quiz, chat), as well as route guidance. The system is real-time, by leveraging the heavy deep learning computation ona remote laptop, the ASR on the Google platform, and theAlana conversational AI system on a remote server. Thissystem enables a natural interaction with the participants; ithas been tested and was tested in real conditions for severalshort sessions, and as of September 2019 is fully deployedfor a three-month long-term user study.Further work for large scale deployment could includesome software optimizations to run more components on therobot itself, and to reduce the lag which sometime exists be-tween the human speech and the robot reply.

Acknowledgements

This research has been partially funded by the Euro-pean Union’s Horizon 2020 research and innovation pro-gram under grant agreement no. 688147 (MuMMER, http://mummer-project.eu/). eferences [Amos, Ludwiczuk, and Satyanarayanan 2016] Amos, B.;Ludwiczuk, B.; and Satyanarayanan, M. 2016. Openface:A general-purpose face recognition library with mobileapplications. Technical report, CMU-CS-16-118, CMUSchool of Computer Science.[Bastianelli et al. 2016] Bastianelli, E.; Croce, D.; Vanzo, A.;Basili, R.; and Nardi, D. 2016. A discriminative approachto grounded spoken language understanding in interactiverobotics. In

Proceedings of the Twenty-Fifth InternationalJoint Conference on Artiﬁcial Intelligence , IJCAI’16, 2747–2753.[Belhassein et al. 2017] Belhassein, K.; Clodic, A.; Cochet,H.; Niemel¨a, M.; Heikkil¨a, P.; Lammi, H.; and Tammela, A.2017. Human-Human Guidance Study. Technical Report17596, LAAS.[Bordini, H¨ubner, and Wooldridge 2007] Bordini, R. H.;H¨ubner, J. F.; and Wooldridge, M. 2007.

ProgrammingMulti-Agent Systems in AgentSpeak Using Jason (WileySeries in Agent Technology) . USA: John Wiley & Sons, Inc.[Cao et al. 2017] Cao, Z.; Simon, T.; Wei, S.-E.; and Sheikh,Y. 2017. Realtime multi-person 2d pose estimation usingpart afﬁnity ﬁelds. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 7291–7299.[Cao, Can´evet, and Odobez 2018] Cao, Y.; Can´evet, O.; andOdobez, J.-M. 2018. Leveraging convolutional pose ma-chines for fast and accurate head pose estimation. In

IEEE/RSJ International Conference on Intelligent Robotsand Systems .[Craenen et al. 2018a] Craenen, B.; Deshmukh, A.; Foster,M. E.; and Vinciarelli, A. 2018a. Do we really like robotsthat match our personality? the case of big-ﬁve traits, god-speed scores and robotic gestures. In

Proceedings of the27th IEEE International Symposium on Robot and HumanInteractive Communication (RO-MAN) .[Craenen et al. 2018b] Craenen, B. G.; Deshmukh, A.; Fos-ter, M. E.; and Vinciarelli, A. 2018b. Shaping ges-tures to shape personalities: The relationship between ges-ture parameters, attributed personality traits, and Godspeedscores. In

Proceedings of the 27th IEEE International Sym-posium on Robot and Human Interactive Communication(RO-MAN) , 699–704.[Curry et al. 2018] Curry, A. C.; Papaioannou, I.; Suglia, A.;Agarwal, S.; Shalyminov, I.; Xu, X.; Duˇsek, O.; Eshghi, A.;Konstas, I.; Rieser, V.; et al. 2018. Alana v2: Entertainingand informative open-domain social dialogue using ontolo-gies and entity linking.

Alexa Prize Proceedings .[Deshmukh, Foster, and Mazel 2019] Deshmukh, A.; Foster,M. E.; and Mazel, A. 2019. Contextual non-verbal be-haviour generation for humanoid robot using text senti-ment. In

Proceedings of the 28th IEEE International Sympo-sium on Robot and Human Interactive Communication (RO-MAN) .[Dinarelli et al. 2009] Dinarelli, M.; Quarteroni, S.; Tonelli,S.; Moschitti, A.; and Riccardi, G. 2009. Annotating spo-ken dialogs: From speech segments to dialog acts and frame semantics. In

Proceedings of SRSL 2009, the 2nd Workshopon Semantic Representation of Spoken Language , 34–41.[Dondrup et al. 2017] Dondrup, C.; Papaioannou, I.;Novikova, J.; and Lemon, O. 2017. Introducing a ROSbased planning and execution framework for human-robotinteraction. In

Proceedings of the 1st ACM SIGCHIInternational Workshop on Investigating Social Interactionswith Artiﬁcial Agents , ISIAA 2017, 27–28.[Foster et al. 2016] Foster, M. E.; Alami, R.; Gestranius, O.;Lemon, O.; Niemel¨a, M.; Odobez, J.-M.; and Pandey, A. K.2016. The MuMMER project: Engaging human-robot in-teraction in real-world public spaces. In

Social Robotics ,753–763. Cham: Springer International Publishing.[Hastie et al. 2016] Hastie, H.; Dente, P.; K¨uster, D.; andKappas, A. 2016. Sound emblems for affective multimodaloutput of a robotic tutor: A perception study. In

Proceedingsof the 18th ACM International Conference on MultimodalInteraction , 256–260.[He, Motlicek, and Odobez 2018a] He, W.; Motlicek, P.; andOdobez, J. 2018a. Deep neural networks for multiplespeaker detection and localization. In , 74–79.[He, Motlicek, and Odobez 2018b] He, W.; Motlicek, P.; andOdobez, J.-M. 2018b. Joint localization and classiﬁcationof multiple sound sources using a multi-task neural network.In

Proceedings of Interspeech 2018 , 312–316.[He, Motlicek, and Odobez 2019] He, W.; Motlicek, P.; andOdobez, J. 2019. Adaptation of multiple sound source local-ization neural networks with weak supervision and domain-adversarial training. In

ICASSP 2019 - 2019 IEEE Interna-tional Conference on Acoustics, Speech and Signal Process-ing (ICASSP) , 770–774.[Heikkil¨a et al. 2019] Heikkil¨a, P.; Niemel¨a, M.; Belhassein,K.; Sarthou, G.; Tammela, A.; Clodic, A.; and Alami, R.2019. Should a robot guide like a human? a qualitative four-phase study of a shopping mall robot. In

International Con-ference on Social Robotics (ICSR) .[Heikkil¨a, Lammi, and Belhassein 2018] Heikkil¨a, P.;Lammi, H.; and Belhassein, K. 2018. Where can I ﬁnda pharmacy? - human-driven design of a service robot’sguidance behaviour. In

Proceedings of PubRob 2018 .[Khalidov and Odobez 2017] Khalidov, V., and Odobez, J.-M. 2017. Real-time multiple head tracking using textureand colour cues. Idiap-RR Idiap-RR-02-2017, Idiap.[Khambhaita and Alami 2017] Khambhaita, H., and Alami,R. 2017. Viewing Robot Navigation in Human Environmentas a Cooperative Activity. In

International Symposium onRobotics Research (ISSR 2017) , 18p.[Lemaignan et al. 2018] Lemaignan, S.; Sallami, Y.; Wall-bridge, C.; Clodic, A.; Belpaeme, T.; and Alami, R. 2018.UNDERWORLDS: Cascading Situation Assessment forRobots. In .[Lemon et al. 2002] Lemon, O.; Gruenstein, A.; Battle, A.;and Peters, S. 2002. Multi-tasking and collaborative activi-ies in dialogue systems. In

Proceedings of the 3rd SIGdialWorkshop on Discourse and Dialogue - Volume 2 , 113–124.[Papaioannou et al. 2017] Papaioannou, I.; Cercas Curry, A.;Part, J.; Shalyminov, I.; Xinnuo, X.; Yu, Y.; Dusek, O.;Rieser, V.; and Lemon, O. 2017. Alana: Social dialogueusing an ensemble model and a ranker trained on user feed-back. In .[Papaioannou, Dondrup, and Lemon 2018] Papaioannou, I.;Dondrup, C.; and Lemon, O. 2018. Human-robot interactionrequires more than slot ﬁlling-multi-threaded dialogue forcollaborative tasks and social conversation. In

FAIM/ISCAWorkshop on Artiﬁcial Intelligence for Multimodal HumanRobot Interaction , 61–64.[Rastogi, Gupta, and Hakkani-Tur 2018] Rastogi, A.; Gupta,R.; and Hakkani-Tur, D. 2018. Multi-task learning for jointlanguage understanding and dialogue state tracking. In

Pro-ceedings of the 19th Annual SIGdial Meeting on Discourseand Dialogue , 376–384.[R¨osmann, Hoffmann, and Bertram 2017] R¨osmann, C.;Hoffmann, F.; and Bertram, T. 2017. Integrated online tra-jectory planning and optimization in distinctive topologies.

Robotics and Autonomous Systems

IEEE/RSJInternational Conference on Intelligent Robots and Systems(IROS 2019) . To appear.[Sarthou, Alami, and Clodic 2019a] Sarthou, G.; Alami, R.;and Clodic, A. 2019a. Ontologenius : A long-term semanticmemory for robotic agents. In

RO-MAN 2019 .[Sarthou, Alami, and Clodic 2019b] Sarthou, G.; Alami, R.;and Clodic, A. 2019b. Semantic Spatial Representation: aunique representation of an environment based on an ontol-ogy for robotic applications. In

SpLU-RoboNLP 2019 , 50 –60.[Sheikhi and Odobez 2015] Sheikhi, S., and Odobez, J.2015. Combining dynamic head pose and gaze mappingwith the robot conversational sftate or attention recognitionin human-robot interactions.

Pattern Recognition Letters .[Vanzo, Bastianelli, and Lemon 2019] Vanzo, A.; Bas-tianelli, E.; and Lemon, O. 2019. Hierarchical multi-tasknatural language understanding for cross-domain conver-sational ai: HERMIT NLU. In

Proceedings of the 20thAnnual SIGdial Meeting on Discourse and Dialogue , toappear. Stockholm, Sweden: Association for ComputationalLinguistics.[Vinciarelli, Pantic, and Bourlard 2009] Vinciarelli, A.; Pan-tic, M.; and Bourlard, H. 2009. Social signal processing:Survey of an emerging domain.

Image and Vision Comput-ing2019 28th IEEEInternational Conference on Robot & Human InteractiveCommunication