[PDF] Are you still with me? Continuous Engagement Assessment from a Robot's Point of View

Abstract

Continuously measuring the engagement of users with a robot in a Human-Robot Interaction (HRI) setting paves the way towards in-situ reinforcement learning, improve metrics of interaction quality, and can guide interaction design and behaviour optimisation. However, engagement is often considered very multi-faceted and difficult to capture in a workable and generic computational model that can serve as an overall measure of engagement. Building upon the intuitive ways humans successfully can assess situation for a degree of engagement when they see it, we propose a novel regression model (utilising CNN and LSTM networks) enabling robots to compute a single scalar engagement during interactions with humans from standard video streams, obtained from the point of view of an interacting robot. The model is based on a long-term dataset from an autonomous tour guide robot deployed in a public museum, with continuous annotation of a numeric engagement assessment by three independent coders. We show that this model not only can predict engagement very well in our own application domain but show its successful transfer to an entirely different dataset (with different tasks, environment, camera, robot and people). The trained model and the software is available to the HRI community as a tool to measure engagement in a variety of settings.

Full PDF

AAre you still with me? Continuous Engagement Assessment froma Robot’s Point of View

Francesco Del Duchetto

L-CAS, University of Lincoln

Lincoln, [email protected]

Paul Baxter

L-CAS, University of Lincoln

Lincoln, [email protected]

Marc Hanheide

L-CAS, University of Lincoln

Lincoln, [email protected]

ABSTRACT

Continuously measuring the engagement of users with a robot ina Human-Robot Interaction (HRI) setting paves the way towards in-situ reinforcement learning, improve metrics of interaction qual-ity, and can guide interaction design and behaviour optimisation.However, engagement is often considered very multi-faceted anddifficult to capture in a workable and generic computational modelthat can serve as an overall measure of engagement. Building uponthe intuitive ways humans successfully can assess situation for adegree of engagement when they see it, we propose a novel regres-sion model (utilising CNN and LSTM networks) enabling robotsto compute a single scalar engagement during interactions withhumans from standard video streams, obtained from the point ofview of an interacting robot. The model is based on a long-termdataset from an autonomous tour guide robot deployed in a publicmuseum, with continuous annotation of a numeric engagementassessment by three independent coders. We show that this modelnot only can predict engagement very well in our own applicationdomain but show its successful transfer to an entirely differentdataset (with different tasks, environment, camera, robot and peo-ple). The trained model and the software is available to the HRIcommunity as a tool to measure engagement in a variety of settings.

KEYWORDS engagement, machine learning, tools for HRI, long-term autonomy

One of the key challenges for long-term interaction in human-robotinteraction (HRI) is to maintain user engagement, and, in particular,to make a robot aware of the level of engagement humans displayas part of an interactive act. With engagement being an inherentlyinternal mental state of the human(s) interacting with the robot,robots (and observing humans for that matter) have to resort to theanalysis of external cues (vision, speech, audio).In the research program that informed the aims of this paper, weare working to close the loop between the user perception of therobot as well as their engagement with it, and our robot’s behaviorduring real-world interactions, i.e., to improve the robot’s plan-ning and action over time using the responses of the interactinghumans. The estimation of users’ engagement is hence consideredan important step in the direction of automatic assessment of therobot’s own behaviours in terms of its social and communicativeabilities, in order to facilitate in-situ adaptation and learning. Inthe context of reinforcement learning, a scalar measure of engage-ment can directly be interpreted as a reinforcement signal that caneventually be used to govern the learning of suitable actions inthe robot’s operational situation and environment. As a guiding

Figure 1: Engagement annotated values and our model’s predictionsover a guided tour interaction sequence recorded from our robot’shead camera. principle (and indeed a working hypothesis), we anticipate thathigher and sustained engagement with a robot can be interpretedas a positive reinforcement of the robot’s action, allowing it toimprove its behavior in the long term.Previous work on robot deployment in museum contexts [7] pro-vide evidence on how user engagement during robot guided tourseasily degrades with time when employing an open-loop interactivebehavior which does not take into account the engagement state ofthe other (human) parties.However, we argue that the usefulness of a scalar measure ofengagement as presented in the paper stretches far beyond ourprimary aim to use it to guide learning. Work in many applicationdomains of HRI [3, 4, 23] has focused on a measure of engagementto inform the assessment of the implementation for a specific use-case, or to guide a robot’s behavior. However, how engagement ismeasured and represented varies greatly (see Sec. 2) and there is yetto be found a generally applicable measure of engagement that read-ily lends itself to guide the online selection of appropriate behavior,learning, adaptation, and analysis. Based on the observation that en-gagement as a concept is implicitly often quite intuitive for humansto assess, but inherently difficult to formalize into a simple and uni-versal computational model, we propose to employ a data-drivenmachine learning approach, to exploit the implicit awareness ofhumans in assessing an interaction situation. Consequently, insteadof aiming to comprehensively model and describe engagement asa multi-factored analysis, we use end-to-end machine learning to a r X i v : . [ c s . R O ] J a n rancesco Del Duchetto, Paul Baxter, and Marc Hanheide directly learn a regression model from video frames onto a scalar inthe range of 0% to 100%, and use a rich annotated dataset obtainedfrom a long-term deployment of a robot tour guide in a museum totrain said model.For a scalar engagement measure to be useful in actual HRIscenarios, we postulate that a few requirements have to be fulfilled.In particular, the proposed solution should • demonstrably generalize to new unseen people, environ-ments, and situations; • operate from a robot’s point of view, forgoing any additionalsensors in the environment; • employ a sensing modality that is readily available on avariety of robot platforms; • have few additional software dependencies to maximize com-munity uptake; and • operate with modest computational resources at soft real-time.Consequently, we present our novel engagement model, solelyoperating on first-person (robot-centric) point of view video of arobot and prove its applicability not only in our own scenario butalso on a publicly available dataset (UE-HRI) without any transferlearning or adaptation necessary. We demonstrate that the modelcan operate at typical video frame rates on average GPU hardwaretypically found on robots. Hence, the core contributions of thispaper can be summarised asi the appraisal of a scalar engagement score for the purpose ofin-situ learning, adaptation, and behavior generation in HRI;ii a proposed end-to-end deep learning architecture for the regres-sion of first-person view video stream onto scalar engagementfactors in real-time;iii the comprehensive assessment of the proposed model on ourown long-term dataset, and a publicly available HRI datasetproving the generalizing capabilities of the learned model; andiv the availability of a implementation and trained model to pro-vide the community with an easy to use, out of the box method-ology to quantify engagement from first-person view video ofan interactive robot. Recognizing the level of engagement of the humans during theinteractions is an important capability for social robots. In the firstplace, we want to recognize the level of engagement as a way toassess the robot behavior. Feeding this information to a learningsystem we can improve the robot behavior to maximize the levelof engagement. In an education scenario, such as a museum, beingable to engage the users is a crucial factor. It is known that higherlevel of engagement generates better learning outcomes [21], whileengagement with a robot during a learning activity has also beenshown to have a similar effect [13]. While there is evidence that thepresence of a robot, particularly when novel, is sufficient in itselffor higher engagement in educational STEM activities, e.g. [2], thefocus in the present work is on engagement between individualsand the robot within a direct (social) interaction, for which there isnot a universally agreed definition [12]. Within interactions, engagement has been characterized as a pro-cess that can be separated in four stages: point of engagement, pe-riod of sustained engagement, disengagement, and re-engagement[20]. Context has also been identified as being of importance, interms of the task and environment, as well as the social context [5].For example, [19] proposes a simple model to infer engagementfor a robot receptionist based on the person spatial position withinsome predefined areas around the robot, and [25] studies to whatextent is possible to predict the engagement of an entity relyingsolely on the features of the other parties of the interaction, showingthat engagement, and the features needed to detect it, changes withthe context of the interaction [26]. These examples furthermoresuggest that there are multiple, overlapping, and likely interactingtimescales involved in the characterization of engagement, fromthe longer term context to short interaction-orientated behavioursthat nevertheless impact social dynamics, and which humans areparticularly receptive to [9].In the context of the characterization of engagement above, thereare a number of approaches to the automatic assessment of engage-ment that may be distinguished. On the one hand, there is a focuson individual behavioral cues, which may be integrated to form acharacterization of engagement. On the other hand, there is a moreholistic perspective of engagement taken, where proxy metrics maybe used or direct measures of engagement estimated. Combinationsof these perspectives are summarised briefly below.Work on characterizing engagement in both human-human andhuman-robot interactions has identified human gaze as being ofparticular significance when determining engagement levels in aninteraction, e.g. [16, 22]. Gaze thus forms an important behavioralcue when assessing engagement, e.g. [3, 27]. For example, Lemaig-nan et al. [18] do not try to directly define and detect engagement,recognizing that it is a complex and broad concept. Instead, theconcept of “with-me-ness” is introduced, which is the extent towhich the human is “with” the robot during the interactions, andwhich is based on the human gaze behavior.Beyond only human gaze behavior, Foster et al. [11], for example,address the task of estimating the engagement state of customersfor a robot bartender based on the data from audiovisual sensors.They test different approaches reporting that the rule-based clas-sifier shows competitive performances with the trained ones andcould actually be preferred for their stability (and to overcomedata-scarcity problems).In addition to these explicitly cue-centred approaches, more re-cently, attempts have been made to leverage the power of machinelearning to discover the important overtly visible features with min-imal (or at least sparse) explicit guidance from humans (through cueidentification for example). For example, [29] use an active learn-ing approach with Deep RL to automatically (and interactively)learn the engagement level of children interacting with a robotfrom raw video sequences. The learning is incremental and allowsfor real-time update of the estimates, so that the results can beadapted to different users or situations. The DQN is initially trainedwith videos labeled with engagement values. In other work [24]investigate the performance of deep learning models in the taskof automated engagement estimation from face images of childrenwith autism using a novel deep learning model, named CultureNet, re you still with me? Continuous Engagement Assessment from a Robot’s Point of View which efficiently leverages the multi-cultural data when perform-ing the adaptation of the proposed deep architecture to the targetculture and child, although this is based on a dataset of static imagesrather than real-time data.These deep learning methods have the advantage that the con-stituent features of interest do not have to be explicitly defined apriori by the system designer, rather, only the (hidden) phenome-non needs to be annotated; engagement in this case. Since socialengagement within interactions is readily recognized by humansbased on visible information (see discussion above), human cod-ing of engagement provides a promising source of ground-truthinformation. Indeed, in this context, [28] employed human codersto assess the ‘quality’ of observed interactions, demonstrating goodagreement between coder on what was a subjective metric.Taken together, the literature indicates that while a precise oper-ational definition of engagement may not be universally agreed, itseems that more holistic perspectives may be more insightful. It islikely that while gaze is an important cue involved in making thisassessment, there are other contextual factors that influence theinterpretation of engagement. Given that humans are naturally ableto accurately assess engagement in interactions, it seems that onepromising possibility would be to leverage this to directly informautomated systems.

This work is embedded in a research program that seeks to employonline learning and adaptation of an autonomous mobile robot todeliver tours in a museum context. The robotic platform, describedbelow, has been operating autonomously in this environment for anextended period of time, as evidenced by the long term autonomymetrics (Table 1).The goal is to facilitate the visitor’s engagement with the mu-seum’s display of art and archaeology. This project provides anopportunity to study methodologies to equip the robot with theability to interact socially with the visitors. In particular, the re-search aims to find a good model to allow the robot to do the correctthing at the right moment, in terms of social interaction. The firststep in doing so is endowing the robot with a means of assessingits own performance at any given moment to allow adaptation,learning, and to avoid repeating the same errors.

The robot is a Scitos G5 robot manufactured by MetraLabs GmbH.It is equipped with a laser scanner with 270 ◦ scan angle on its baseand two depth cameras. An Asus xtion depth camera is mountedon a pan-tilt unit above his head and a Realsense D415 is mountedabove the touchscreen with an angle of 50 ◦ w.r.t. the horizontalplane in order to face the people standing in front of the robot.The interactions with the visitors are mediated through a touchscreen, two speakers, a microphones array and a head with two eyesthat can move with five degrees of freedom to provide human-likeexpressions. To ensure safe operations in public environments therobot is equipped with an array of bumpers around the circular basewith sensors to detect collisions and two easily reachable emergencybuttons that, when activated, cuts the power to the motors. Thesoftware framework is based on ROS and uses STRANDS project Table 1: Long-Term Autonomy metrics: total system lifetime (TSL -how long the system is available for autonomous operation), and autonomy percentage (A% - duration the system was actively per-forming tasks as a proportion of the time it was allowed to operateautonomously), following [14].

Days of operation 103 daysTotal distance travelled 299 kmTotal tasks completed 8423TSL 26 days, 11 hoursA% 74%[14] core modules for topological navigation, people tracking, taskscheduling and data collection.

The data gathered so far spans the date range between the 24 th January 2019 (day on which we started recording data of the robotoperations) and the 9 th May 2019, with data collection remainingongoing. The work and data recording exercise has been approvedby the University of Lincoln’s Ethics Board, under approval ID"COSREC509". The ethical approval does not allow the public releaseof any data that can feature identifiable persons, in particular videodata.During the current deployment the robot performs mainly twotypes of interactive task: guided tour and go to exhibit and describe . Inthe first task the robot guides the users to 5 or 6 exhibits sequentiallyaround the museum, describing what they contain when stoppingin front of each. During the second interactive task the robot guidesto users to one of the exhibits and, when arrived at the destination,describes the content it is showing.

The TOur GUide RObot (TOGURO) dataset was collected fromthe two cameras mounted on the robot’s body and head, eachproviding a stream of rgb and depth frames. These video streamswere collected from the start until the end of each guided tour and go to exhibit and describe task. Considering the large number ofvideos to be stored each we saved the frame streams as compressedMPEG video files directly from the interaction. Moreover, we store,in an additional file, the ROS timestamp at the time each frame isreceived by the video recorder node. This allows us to reconstructafterward the alignment between the different video streams frameby frame.The participants were aware that the robot was recording dataduring the interactions (by means of visible signs and leaflets),although they were not informed that the purpose of this datawas for engagement analysis, thus not biasing their behaviours. Intotal we collected 703 distinct interactions with a total duration of40 hours and 17 minutes. As described below (section 3.4), only asubset of this total data was coded. Given the unconstrained setting,the interactions varied significantly in duration, with the shortestat 1.2 seconds and the longest at 2 hours 40 minutes. rancesco Del Duchetto, Paul Baxter, and Marc Hanheide

Figure 2: One frame from a video in the TOGURO dataset recordedfrom the robot’s head camera during a guided tour. The red, greenand blue plots at the bottom of the frame represent each a distinctannotation sequence.

Given that the museum in which the robot is deployed is a publicspace openly accessible to anyone, the interactions between the ro-bot and the museum’s visitors are completely unstructured. Peoplewalking in the gallery are allowed to roam around the collection orto interact with the robot. When they choose to do so they do notreceive any instruction about how to interact with it explicitly, andare not observed by experimenters when doing so.

In order to address the primary research goal – the assessment ofrobot-centric group engagement – the dataset was manually codedin order to establish a ground truth. As noted previously, giventhat there is not a universally accepted operationalized definitionof engagement, a human observer response method is employed inthe present work, following the prior application of a continuousaudience response method [28].The annotations were performed over only the rgb stream ofrobot’s head camera, and not taking into account all the four videostreams available from the collected data. Similarly to [28], theannotators were asked to indicate in real time how engaged peopleinteracting with a robot appeared to be in a video captured bythe robot (e.g. Figure 2). They operated a dial using a game-padjoystick while watching the interaction videos using the NOVAannotation tool [1]. This procedure allowed the generation ofper-frame annotations of the provided videos, with very little timespent on software training (around 20 minutes per annotator) andon the annotation process itself (not more than the duration of thevideos). The annotators were instructed by providing them witha demonstration, and a set of annotation rules based on a set oftypical examples .Three annotators took part in the coding process: each wasfamiliar with the robot being used and the interaction context. Threesubsets of the overall dataset collected were randomly drawn andassigned to the annotators. The subsets were partially overlapping.This was to enable an analysis of inter-rater agreement to assess https://github.com/hcmlab/nova Available at: https://justpaste.it/6p1tb/pdf

Table 2: Video annotations by annotator (coder): unique indicateslength of video coded by a single coder

Coder

Coder1

66 3h 59m

Coder2

40 2h 55m

Coder3

40 2h 23mUnique 94 5h 50mTotal 146 9h 17mreliability of the essentially subjective metric, but also to maximizeannotation coverage of the dataset. As indicated in Table 2, the totallength of the annotated data was over nine hours, with 3 hours27m of overlap between the annotators (resulting in 5 hours 50mof unique videos annotated).The amount of annotated data is depicted in Table 2. 96 uniquevideos were coded by the three annotators with a total of 146 videos(including repeated annotations) for a total duration of 9 hours and17 minutes. In total the annotated video set features 227 people(53.74% (122) females and 46.26% (105) males, 60.79% (138) adultsand 39.21% (89) minors). The composition of each group of peopleinteracting with the robot is very diverse; on average each videosfeatures 2 .

41 people ( min = , max = , σ = . .

32 females( min = , max = , σ = . .

14 males ( min = , max = , σ = . . min = , max = , σ = .

97) and 0 .

96 minors( min = , max = , σ = . ρ )is employed to assess inter-rater agreement. Table 3 shows thecorrelation values for each pair of annotators. Since every frame isannotated (with a frame-rate of 10 frames-per-second), the contin-uous values were smoothed over time, using different smoothingconstant values, in the range [ . s , s ] (Figure 3). Table 3 providesa summary of these, with overall mean agreement rates at selectedrepresentative values of the smoothing constant. While there issome variability in the between-coder agreement, mean values of ρ vary in strength from moderate to strong (0.56 to 0.72). In this re-gard, there is a trade-off to be made between the smoothing constantsize and the apparent agreement between the coders: the largertime window size reduces the real-time relevance of the engage-ment assessment, even though the agreement over the extendedperiods of time is greater than in comparatively shorter windows.Overall, these results indicate that the use of the independentlycoded data can be considered reliable in terms of the highly variableand subjective metric of engagement. Given the ground-truth provided by the human-coded engage-ment levels within interactions with the robot, we propose a deeplearning approach for the estimation of human engagement fromvideo sequences. The model is trained end-to-end from the rawimages coming from the robot’s head camera to predict a high-levelengagement score of people interacting with the robot. It should re you still with me? Continuous Engagement Assessment from a Robot’s Point of View

Table 3: Spearman’s Correlation ρ at different smoothing constantvalues S . The significance p -value < . and sample size n ≥ for all coder pairs and smoothing constants. Coders Pair S (sec) ρ Coder1 ↔ Coder2

Coder1 ↔ Coder3

Coder2 ↔ Coder3

Figure 3: Spearman correlation averaged over coder pairs andweighted by the overlap rate. Value reported over different smooth-ing constants S . be noted that this model does not model individual humans in theview of the robot but provides an overall holistic engagement score.The network architecture, depicted in Figure 4, is composed oftwo main modules: a convolutional module which extracts frame-wise image features and a recurrent module that aggregates theframe features over a time to produce a temporal feature vectorof the scene. The convolutional module is a ResNetXt-50 Convo-lutional Neural Network (CNN) [30] pre-trained on the ImageNetdataset [17]. We obtain the frame features from the activation ofthe last fully connected layer of the CNN, with dimension 2048,before the softmax layer. The recurrent module is a single layerLong-Short Term Memory (LSTM) [15] with 2048 units followed by a Fully Connected (FC) layer of size 2048 ×

1. The LSTM takesin input a sequence of w frame features coming from the convolu-tional module and produces in turn a feature vector that representsthe entire frame sequence, to capture temporal behavior of hu-mans within the time window w . The temporal features are passedthrough the FC layer with a sigmoid activation function at the endto produce values y ′ ∈ [ , ] The recurrent module is trained inour experiments to predict engagement values from the providedannotation values, while the CNN layer is fixed.The proposed framework is implemented in Python using theKeras library [6] and will be freely released as a ready-to-use toolto the HRI community.

We train and test the model presented in Section 4 on our ownTOGURO dataset, and assess generalization of this model (with-out modification) on the public UE-HRI dataset in the followingsubsections.

We used the entire annotated dataset presented in Section 3.4, com-posed of 94 videos, for a total duration of 5 hours and 50 minutes ofinteractions. For each video we randomly choose an annotation, ifmultiple are available from the different coders (see table 2), in orderto avoid repetitions in the data and biasing the model toward thosevideos that have been annotated multiple times. Each video is thenrandomly assigned to either the training, test or validation set witha corresponding probability of 50%, 30% and 20%, respectively, toprevent our model to train and test over data that are closely corre-lated at the video frames level. Sampling for the dataset split henceoperates on full video level, rather than on frame level. Each video V k is composed of I V k frames x i ∈ V k for i ∈ , . . . , I V k and has anassociated array of annotations A k = [ y , . . . , y I Vk ] , also of dimen-sion I V k . From all the videos in each set (training/test/validation)we extract all the possible sequences of w consecutive frames X i = [ x i , . . . , x i + w − ] to be the input sample for our model. There-fore, each sample X i has an overlap of w − X i + from the same video. For each sample X i weassign the ground truth value y i + w − ∈ A k , in order to relate eachsequence of frames with the engagement value set at the end of thesequence.After the pre-processing phase over our dataset we obtain 93 271training samples, 72 146 test samples and 44 581 validation samples.Each frame is reshaped to 224 ×

224 pixel frames, and normalizedbefore being fed to the network.

For training and evaluation we decided to set the window size w equal to 10 frames in order to have a model that gives evaluations ofthe engagement in a relative short times (i.e. after 1 second). Eventhough more temporally extended time windows would providemore coherent ground truth values among the different annotators,as discussed in Section 3.4, we decide to sacrifice some accuracy infavour of increase realtimeness of our model predictions.During training the weights of the Convolutional module, whichis already pre-trained, are kept frozen while the Recurrent module rancesco Del Duchetto, Paul Baxter, and Marc Hanheide Figure 4: Overview of the proposed model. The input is a video stream of interactions between the robot and humans collected in w size inter-vals. The frames x i are passed through the pre-trained CNN (ResNet) producing a per-frame feature vector which is then passed sequentiallyto the LSTM network. After w steps the LSTM produces a temporal feature vector which is passed to a FC layer with sigmoid activation toproduce an engagement value y for the temporal window. is fully trained from scratch. The model is trained to optimize theMean Squared Error (MSE) regression loss between the predictionvalues y ′ i and the corresponding ground truth values y i using theAdagrad optimization algorithm [8] with an initial learning rate lr = e −

4. At each training epoch we sample uniformly 20% of thetraining set samples to be used for training and we collect them inbatches of size bs =

16. The uniform data sampling of the trainingdata is performed in order to reduce training time and limitingoverfitting [10]. The model has been trained for a total of 22 epochsusing early stopping after no improvement on validation loss.

In order to assess the generalization capabilities of our trainedmodel over different scenarios featuring people interacting withrobots, we propose to test the performance of our trained modelas a detector of the start and end of interactions over the UE-HRIdataset [4]. Similarly to our dataset, it provides video recordingsfrom the robots own cameras allowing for engagement estima-tion from the robot’s point of view. The dataset provides videosof spontaneous interactions between humans and a Pepper (Soft-bank Robotics) robot alongside annotations of start/end of interac-tions and various signs of engagement decrease (Sign of Engage-ment Decrease (SED), Early sign of future engagement BreakDown(EBD), engagement BreakDown (BD) and Temporary Disengage-ment (TD)). The UE-HRI dataset features 54 interactions with 36males and 18 females, where 32 are mono-users and 22 are multi-party.For a fair comparison with our proposed method, we evaluatethe ability of our model to distinguish between the moments duringwhich an interaction is taking place and those in which there is abreakdown (TD or BD), the interaction is not yet started or it is al-ready ended, in line with the UE-HRI coding scheme. Consequently,we predict engagement values over the RGB image streams fromthe Pepper robot’s front camera. By setting a threshold value thr weconvert the predictions y ′ into a binary classification of C = {⊤ , ⊥} Table 4: Model performance on our TOGURO Dataset

GPU Test loss Prediction time Mem. usage

GeForce GTX 1060 . (MSE) t < = . sec . thr ) which indicates whether there isengagement or not. The categorical predictions are then comparedwith values from the annotations in the dataset. We consider theground truth value to be y tint = ⊤ if at time t there is a annotationof a Mono or Multi interaction and there are no annotations of BDor TB in the UE-HRI coding. The ground truth value is y tint = ⊥ otherwise. With our evaluation we set out to provide evidence that our modelis able to predict engagement through regression on our ownTOGURO dataset by assessing its accuracy in comparison to theground-truth annotation, and to assess the generalization ability ofthe model on newly encountered situations through the analysis ofthe UE-HRI data.To show the ability of our framework to map short-term humanbehavioral features from image sequences into engagement scores,we compute the Mean Squared Error (MSE) prediction loss onour test set as 0 .

126 (in the context of the [ , ] interval of outputexpected), also reported in Table 4. Looking back at section 1, softreal-time operation is seen as a requirement for the applicability ofour model. Hence, we measured the duration of a forward pass onour GPU hardware of 10 consecutive frames (1 sample) through thethe convolutional module and the recurrent module taking at most50 ms (worst case), allowing real-time estimation of engagement at20 frames per second.Evaluating the power of our approach for binary classification onthe UE-HRI as detailed above in section 5.3, allows us to capture the re you still with me? Continuous Engagement Assessment from a Robot’s Point of View Figure 5: ROC curve generated using our trained model as a classi-fier of the interaction sessions for the UE-HRI dataset. generalization capabilities. In Figure 5 we report the Receiver Oper-ating Characteristic (ROC) curve obtained by varying the thresholdwith values in the range thr ∈ [ , ] of the binary classification taskon the UE-HRI data. The Area Under the Curve ( AUC = .

88 in ourexperiment) reports the probability that our classifier ranks a ran-domly chosen positive instance y tint = ⊤ higher than a randomlychosen negative one y tint = ⊥ , i.e., provides a good assessment ofthe performance of the model in this completely different dataset. This paper has motivated, developed and validated a novel easy-to-use computational model to assess engagement from a robot’sperspective. The results presented in the previous sections lead usto the conclusion thati a moderate to strong inter-rater agreement (see table 3) in mea-suring engagement on [ , ] interval indicates that human canreasonably and reliably assess the holistic engagement from arobot’s point of view solely from video;ii a two-stage deep-learning architecture as presented in figure 4trained from our TOGURO dataset is a suitable computationalregression model to capture the inherent human interpretationof engagement provided by the annotators; and thatiii the trained model is generic enough to be successfully applied ina completely different scenario, here the UE-HRI dataset, show-ing applicability of the model also in different environments,on a different robot with a different camera, and with differenttasks and people. The area under the Receiver-Operator Curve(ROC) of 0 .

88 in figure 5 evidences that indeed the proposedregression model can serve as a strong discriminator to iden-tify situations of loss of engagement (TD or BD in the UE-HRIcoding scheme).Given these encouraging quantitative results, some qualitative as-sessment of exemplary frames with the corresponding computedengagement score are presented in figures 6, 7 and 8. All figuresshow examples of the UE-HRI dataset, which was been completely absent from the training dataset (Section 5.1). Figure 6 presentstwo short sequences (roughly 2 seconds apart between frames),showcasing short-term diversion of attention of subjects resultingin a temporarily lower engagement score, but not leading to a verylow engagement. Figure 7 exemplifies that our model can cope wellwith perception challenges which would forgo a correct assessmentjust using gaze or facial feature analysis. While one could in thiscontext argue that our model has simply learned to detect people,figure 8 is providing three examples from different videos of theUE-HRI dataset with people present in the vicinity of the robot,but not engaging with it. The engagement score in these examplesare significantly lower across all frames. We hypothesize that thelearned model does not solely discriminate only person and/or facepresence, but that the temporal aspects of the humans’ behaviorobservable in the video are captured by the LSTM layer in ourarchitecture well enough to successfully deal with these situations.These qualitative reflections are evidently supported by the quan-titative analysis on both datasets, providing us with confidence thatthe trained model is broadly applicable and can serve as a veryuseful tool to the HRI community with its modest computationalrequirements and high response speed in assessing videos from arobot’s point of view.

ACKNOWLEDGMENTS

We thank the annotators, the Lincolnshire County Council and themuseum’s staff for supporting this research.

REFERENCES [1] Tobias Baur, Gregor Mehlmann, Ionut Damian, Florian Lingenfelser, JohannesWagner, Birgit Lugrin, Elisabeth André, and Patrick Gebhard. 2015. Context-Aware Automated Analysis and Annotation of Social Human-Agent Interactions.

ACM Transactions on Interactive Intelligent Systems (TiiS)

5, 2 (2015), 11.[2] Paul Baxter, Francesco Del Duchetto, and Marc Hanheide. 2018. Engaging Learn-ers in Dialogue Interactivity Development for Mobile Robots. (2018).[3] Paul Baxter, James Kennedy, Anna-Lisa Vollmer, Joachim de Greeff, and TonyBelpaeme. 2014. Tracking gaze over time in HRI as a proxy for engagement andattribution of social agency. In

Proceedings of the 2014 ACM/IEEE internationalconference on Human-robot interaction - HRI ’14 . ACM Press, New York, NewYork, USA, 126–127. https://doi.org/10.1145/2559636.2559829[4] Atef Ben-Youssef, Chloé Clavel, Slim Essid, Miriam Bilac, Marine Chamoux, andAngelica Lim. 2017. UE-HRI: a new dataset for the study of user engagement inspontaneous human-robot interactions. In

Proceedings of the 19th ACM Interna-tional Conference on Multimodal Interaction - ICMI 2017 . ACM Press, New York,New York, USA, 464–472. https://doi.org/10.1145/3136755.3136814[5] G. Castellano, I. Leite, A. Pereira, C. Martinho, A. Paiva, and P. W. McOwan.2012. Detecting Engagement in HRI: An Exploration of Social and Task-BasedContext. In . 421–428. https://doi.org/10.1109/SocialCom-PASSAT.2012.51[6] François Chollet et al. 2015. Keras. https://keras.io.[7] Francesco Del Duchetto, Paul Baxter, and Marc Hanheide. 2019. Lindsey theTour Guide Robot - Usage Patterns in a Museum Long-Term Deployment. In

International Conference on Robot & Human Interactive Communication (RO-MAN) .IEEE, New Delhi.[8] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methodsfor online learning and stochastic optimization.

Journal of Machine LearningResearch

12, Jul (2011), 2121–2159.[9] Gautier Durantin, Scott Heath, and Janet Wiles. 2017. Social moments: a per-spective on interaction for social robotics.

Frontiers in Robotics and AI

Proceedings of theNew Challenges in Data Sciences: Acts of the Second Conference of the MoroccanClassification Society . ACM, 26.[11] Mary Ellen Foster, Andre Gaschler, and Manuel Giuliani. 2017. Automaticallyclassifying user engagement for dynamic multi-party human-robot interaction.

International Journal of Social Robotics

9, 5 (2017), 659–674. rancesco Del Duchetto, Paul Baxter, and Marc Hanheide (a) t =

14m 32s, y ′ = t =

14m 35s, y ′ = t =

14m 37s, y ′ = t =

5m 56s, y ′ = t =

5m 58s, y ′ = t =

5m 59s, y ′ = y ′ at the frame shown in picture at time t being in the center, past predictions on the left and future predictions on the right.(a) y ′ = y ′ = y ′ = y ′ > = . ) in situations difficult to understand using standardface description features. Red plot shows the predicted engagement values over the frame sequences with the prediction y ′ at the frame shownin picture being in the center, past predictions on the left and future predictions on the right.(a) y ′ = y ′ = y ′ = y ′ < = . ) in cases in which the people were not actuallyengaging with the robot. Red plot shows the predicted engagement values over the frame sequences with the prediction y ′ at the frame shownin picture being in the center, past predictions on the left and future predictions on the right. re you still with me? Continuous Engagement Assessment from a Robot’s Point of View [12] Nadine Glas and Catherine Pelachaud. 2015. Definitions of Engagement inHuman-Agent Interaction. In International Conference on Affective Computingand Intelligent Interaction (ACII) . IEEE Press, Xi’an, China, 944–949.[13] Benjamin Gleason and Christine Greenhow. 2017. Hybrid education: The po-tential of teaching and learning with robot-mediated communication.

OnlineLearning Journal

21, 4 (2017).[14] Nick Hawes, Christopher Burbridge, Ferdian Jovan, Lars Kunze, Bruno Lacerda,Lenka Mudrova, Jay Young, Jeremy Wyatt, Denise Hebesberger, Tobias Kortner,et al. 2017. The strands project: Long-term autonomy in everyday environments.

IEEE Robotics & Automation Magazine

24, 3 (2017), 146–156.[15] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.

Neuralcomputation

9, 8 (1997), 1735–1780.[16] Aaron Holroyd. 2011. Generating engagement behaviors in human-robot inter-action. (2011).[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-tion with deep convolutional neural networks. In

Advances in neural informationprocessing systems . 1097–1105.[18] Séverin Lemaignan, Fernando Garcia, Alexis Jacq, and Pierre Dillenbourg. 2016.From real-time attention assessment to with-me-ness in human-robot interaction.In

The eleventh acm/ieee international conference on human robot interaction . IEEEPress, 157–164.[19] Marek P Michalowski, Selma Sabanovic, and Reid Simmons. 2006. A spatialmodel of engagement for a social robot. In

IEEE, 762–767.[20] Heather L O’Brien and Elaine G Toms. 2008. What is user engagement? Aconceptual framework for defining user engagement with technology.

Journal ofthe American society for Information Science and Technology

59, 6 (2008), 938–955.[21] Claire Cameron Ponitz, Sara E Rimm-Kaufman, Kevin J Grimm, and Timothy WCurby. 2009. Kindergarten classroom quality, behavioral engagement, and readingachievement.

School Psychology Review

38, 1 (2009), 102–121.[22] Charles Rich, Brett Ponsler, Aaron Holroyd, and Candace L Sidner. 2010. Recogniz-ing engagement in human-robot interaction. In . IEEE, 375–382.[23] Ognjen Rudovic, Jaeryoung Lee, Lea Mascarell-Maricic, Björn W. Schuller, andRosalind W. Picard. 2017. Measuring Engagement in Robot-Assisted AutismTherapy: A Cross-Cultural Study.

Frontiers in Robotics and AI . IEEE, 339–346.[25] Hanan Salam and Mohamed Chetouani. 2015. Engagement detection based onmutli-party cues for human robot interaction. In . IEEE, 341–347.[26] Hanan Salam and Mohamed Chetouani. 2015. A multi-level context-based mod-eling of engagement in human-robot interaction. In , Vol. 3.IEEE, 1–6.[27] Candace L Sidner, Cory D Kidd, Christopher Lee, and Neal Lesh. 2004. Where tolook: a study of human-robot engagement. In

Proceedings of the 9th internationalconference on Intelligent user interfaces . ACM, 78–84.[28] Fumihide Tanaka, Aaron Cicourel, and Javier R. Movellan. 2007. Socializationbetween toddlers and robots at an early childhood education center.

Proceedingsof the National Academy of Sciences

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops . 0–0.[30] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017.Aggregated residual transformations for deep neural networks. In