Quantitative analysis of robot gesticulation behavior
Unai Zabala, Igor Rodriguez, José María Martínez-Otzeta, Itziar Irigoien, Elena Lazkano
aa r X i v : . [ c s . R O ] O c t Quantitative analysis of robot gesticulationbehavior ∗ Unai Zabala , Igor Rodriguez , Jos´e Mar´ıa Mart´ınez-Otzeta ,Itziar Irigoien , and Elena Lazkano Department of Computer Science and Artificial Intelligence,Faculty of Informatics, University of Basque Country(UPV/EHU), 20018 Donostia ([email protected])
October 23, 2020
Abstract
Social robot capabilities, such as talking gestures, are best producedusing data driven approaches to avoid being repetitive and to show trust-worthiness. However, there is a lack of robust quantitative methods thatallow to compare such methods beyond visual evaluation. In this papera quantitative analysis is performed that compares two Generative Ad-versarial Networks based gesture generation approaches. The aim is tomeasure characteristics such as fidelity to the original training data, butat the same time keep track of the degree of originality of the producedgestures. Principal Coordinate Analysis and procrustes statistics are per-formed and a new Fr´echet Gesture Distance is proposed by adapting theFr´echet Inception Distance to gestures. These three techniques are takentogether to asses the fidelity/originality of the generated gestures.
Keywords:
Social robots, Motion capturing and imitation, Gener-ative Adversarial Networks, Gesture Generation, Principal CoordinateAnalysis, procrustes statistics, FID
Advances in social robots are widespread in robotic conferences and newspapers.Robots for entertainment and care need to show socially acceptable behaviorand, at the same time, must act in a non repetitive/boring manner and showtrustworthiness. An effective social interaction between humans and robots re-quires these robots follow the social rules and expectations of human users. ∗ *This work has been partially supported by the Basque Government (IT900-16 and Elka-rtek 2018/00114), the Spanish Ministry of Economy and Competitiveness (RTI 2018-093337-B-100, MINECO/FEDER,EU). , where two variables have to be taken into account: the capturemethod (MoCap) and the length of the unit of movements (UM, a parameterintrinsic to our system that will be defined later). The goal is to test if the gener-ated gestures are similar to the original ones, but at the same time possess somedegree of originality. As it can be inferred, these two goals are contradictory, soa trade-off is needed. To measure the fidelity of the generated samples to theoriginal ones we performed a Principal Coordinate Analysis (PCoA) over theoriginal and generated samples for the two types of MoCap and different lengthof units of movements. To measure the originality, we calculated procrustesstatistics. Finally, we have defined a Fr´echet Gesture Distance (FGD) which isinspired in the Fr´echet Inception Distance (FID). Assuming that the balance be-tween fidelity and originality comes with the smaller FGD measure, this allowsus to select the most appropriate value for the parameter being analysed.Thus, the contribution of the paper is as follows: • Principal Coordinate Analysis (PCoA): a statistical tool for exploring thestructure of high dimensional data. We propose this analysis to measurethe degree of fidelity with respect to the training data. • Procrustes statistics is applied to ensure that the model is able to offer • A new Fr´echet Gesture Distance (FGD) is defined by adapting the Fr´echetInception Distance (FID) to the problem of GAN generated gestures.The rest of the paper is structured as follows: Section 2 introduces the needfor robot gesticulation and summarizes the different social skills evaluation al-ternatives found in the literature. Section 3 describes the experimental baseline,the two GAN based gesture generation approaches that will be quantitativelyanalysed later on. The fidelity analysis is performed in Section 4 while the orig-inality analysis is described in Section 5. The definition of the FGD measureis introduced in Section 6 and this section also shows how the trade off hasbeen conducted by calculating the distance between the generated gestures andthe Gaussian Mixture Model (GMM) generated from a set of synthetic gesturescreated using Choregraphe, a software that allows to create robot animations.A qualitative visual evaluation is provided in Section 7. Finally, Section 8 isdedicated to the conclusions and to outline further work.
Talking involves spontaneous gesticulation; postures and movements are rele-vant for social interactions even if they are subjective and culture dependent.As co-thought (movements related to thinking activity) supports complex prob-lem solving, co-speech implies communication [11]. Lhommet and Marsella [25]discuss body expression in terms of postures, movements and gestures. Ges-tures, defined as movements that convey information intentionally or not, arecategorised as emblems, illustrators and adaptors. Emblems are gestures delib-erately performed by the speaker that convey meaning by themselves and areagain culture dependent. Illustrators are gestures accompanying speech, thatmay (emblems, deictic, iconic and metaphoric) or may not (beats) be relatedto the semantics of the speech [28]. Lastly, adaptors or manipulators belong tothe gesture class that does not aid in understanding what is being said, such asticks or restless movements. Aiming at building trust and making people feelconfident when interacting with them, socially acting humanoid robots shouldshow human-like talking gesticulation.Problems arise when it comes to evaluate the behavior or a particular skill,e.g., the gesticulation ability, of a social robot. Usually robot behaviour is qual-itatively evaluated. Often questionnaires are defined so that participants canrank several aspects of the robot’s performance. There seems to be a consen-sus in presenting the questions using Likert scale and analyzing the obtainedresponses using some statistical test like analysis of variance, chi-square and soon. For instance, in [42] social engagement with a robot can be evaluated byobserving expressed emotions during the conversation. Humans participate inconversations with a NAO robot in different intonation conditions. As objective3easures they use number of turns between actors, number of re-prompts, num-ber of interruptions and the average silence length between turns. These mea-sures are complemented with other subjective data such as the conversationalnaturalness, measured using questionnaires. In [38] authors propose a methodfor modifying affective robot movements using neural networks. Again, the ap-proach is evaluated using an online survey and Two one-sided tests (TOST).Wolfert et al. [22] replicate the evaluation approach in [17] and assess the nat-uralness, semantic consistency and time consistency of the gestures generatedby a speech driven encoder-decoder DNN performing a user-study. Once more,Becker-Asano and Ishiguro [4] use questionnaires to investigate if facial displaysof emotions with Geminoid F can be recognized and to find intercultural dif-ferences in the perception of those facial displays. Confusion matrices of therecognition rates are shown as measure.Carpinella et al. [9] go one step further by developing a 18-item scale (basedon psychological literature on social perception) to measure people’s judgmentof the social attributes of robots. This scale is also used in [31] to examine howhuman collaborators perceive their robotic counterparts from a social perspec-tive during object handovers.When it comes to compare different approaches, data driven approaches areconfronted to the original data that was used to learn the model and rankedresults are then compared using some statistical tests. For instance, in [43]generated beat gestures are compared with designed beat gestures, timed beatgestures and noisy gestures using such approach.Qualitative methods are essential but are difficult to perform because a largenumber of evaluators is required and their subjective perceptions might be dif-ferent. Moreover, when a large number of gestures must be evaluated the humaneye becomes used to what is observing and it gets hard to remark the differences.Thus, such methods are prone to result in subjective evaluation. Besides, theevaluation is cultural dependent.On the contrary, quantitative methods can handle a huge number of dataas input, what makes them more appropriate to evaluate the robustness of afeature. However, subtle and subjective properties might not be easily measuredwith numerical methods. They cannot answer questions like ”which one doyou like it more?” neither can take into account the impact or effect a gesturesystem might have on a specific target audience. Both evaluation methods havestrengths and weaknesses and are complementary.Rare are the references that use quantitative evaluation methods. In [36]gestures generated by a GAN network are compared with gestures obtained byGMM, Hidden Markov Model (HMM) and gestures obtained by randomly or-dering the training data. Principal coordinates analysis was used to extract thesimilarities between the generated gestures and the original ones. Other featuressuch as 3D space coverage, path length and motion jerk were also used for eval-uation purposes. Similar motion statistics were used in [23]. More specifically,the average values of the root-mean-square error and speed histograms of theproduced motion are shown as new measures.Social behavior must be socially acceptable above all and questionnaires4re very valuable tools that need to be considered. But when it comes tocompare several approaches objective tools are needed. We have focused onthree characteristics (from the seven ones stated in [5]) as desirable when usinga data-driven gesture generation approach: • Ability to generate high fidelity samples • Ability to generate diverse samples • Agreement with human perceptual judgments and human rankings ofmodelsThese characteristics could be in contradiction among them, particularly thefidelity and diversity constraints. We have tried our best to try to reconcile themand after the quantitative analysis we have returned the human to the loop forthe test of the third condition: human judgement. Therefore, in the researchdescribed in this paper we perform an analysis based on several methods forquantitatively measuring the degree of fidelity as well as originality a gesturegeneration method offers with respect to the properties of the data used fortraining the system.
In this section the two GAN based gesture generation methods that will bequantitatively analysed later on are explained. Both methods can generatehuman-like motion in a humanoid robot that includes arms, head and handsmotion mapping (upper-body part) since legs are not involved in talking beats,and only differ in the 3D MoCap system (OpenNI vs OpenPose[8]) being usedto capture human motion and create the databases for acquiring the generativemodels. The motion capturing alternative over already existing robot animationsoftware clearly allows to better capture the nature of the talking movements wedo, but it requires the capability to (1) Capture good features of the motion (2)Map those captured features into the robot joints [34]. This mapping processcan be done by inverse kinematics [1], calculating the necessary joint positionsgiven a desired end effector’s pose as in [29]. Alternatively, we adopt the directkinematics option that straightforwardly adapts the captured arms and headangles to the robot joints [48] [35].The mapping process leads to the capability of human motion imitation asdepicted in Figure 1.
As mentioned before, arms, head and hands are involved in the gesture gen-eration process. The MoCap systems being used show different features andlimitations and thus, the mapping process differs from one to the other in some5igure 1: Human motion imitation processjoints. More specifically, head and hands need to be differently mapped. Fig-ure 2 reflects the main differences between these two systems. OpenNI can onlydetect 15 keypoints while OpenPose detects 25 for the body plus the 42 hands’keypoints (hands are not displayed in the figure aforementioned). The followingsubsections detail how those elements are translated from human captured 3Dcartesian coordinates to the joints of Softbank’s robot Pepper.
Arms mapping
The literature reveals different approaches to calculate therobot arm joint positions [48][21]. This mapping process depends upon therobot’s degrees of freedom and joints range. For the Pepper robot arms we aredealing with, some upper-body link vectors are built through the skeleton pointsin the human skeleton model, and joint angles are afterwards extracted fromthe calculation of the angles between those vectors. For the sake of simplicity,since the calculation of the angles is similar for both approaches, the completeformulas involved in that process will not be described here (see [46] for moredetailed information).
Head mapping
The OpenNI skeleton tracking program employed for headmapping gives us the neck and head 3D poses. The approach taken for mappingthe yaw angle to the robot’s head consists of applying a gain K to the human’syaw value, once transformed into the robot space by a − π rotation (Eq. 1).6 a) OpenNI (b) OpenPose Figure 2: OpenNI and OpenPose skeleton modelsFigure 3: Left arm joints and angle limits7 robotγ = K × H β (1)In order to approximate head’s pitch angle, the head to neck vector ( HN )is calculated and rotated − π and then, its angle is obtained (Eq. 2). Notethat robot’s head is an ellipsoid instead of an sphere. To avoid unwanted headmovements a lineal gain is applied to the final value. H robotβ = arctan ( rotate ( HN , − π | K ∗ H γ | (2)On the contrary, OpenPose detects basic face features such as the nose, theeyes and the ears (see Figure 7(b)) and thus, allows for a more realistic trackingof the robot head. To map humans head position into the robot, we use thenose position as reference. Head’s pitch ( H robotφ ) is proportional to the distancebetween the nose and the neck joint (see Eq. 3). Instead, the yaw orientation ofthe head itself ( H robotψ ) can be calculated by measuring the angle between thevector joining the nose and the neck, and the vertical axis (see Eq. 4). N N = dist ( N ose, N eck ) H robotφ = rangeConv ( N N , robotRange ) (3) H robotψ = rangeConv ( − arcsin ( N N x ) , robotRange ) (4) Hands mapping
The OpenNI skeleton tracking program used for hands map-ping can not detect the operator’s hands’ yaw motion and thus, LW γ jointscannot be reproduced using the skeleton information. The developed solutionforces the user to wear coloured gloves to detect the hand orientation. In ourimplementation, the gloves are green in the palm of the hand and red in theback. While the human talks, hands coordinates are tracked and those positionsare mapped into the image space and a subimage is obtained for each hand. An-gular information is afterwards calculated by measuring the number of pixels( max ) of the outstanding color in a subimage. Eq. 5 shows the procedure forthe left hand. N is a normalizing constant and maxW γ stands for the maximumwrist yaw angle of the robot. (cid:26) LW robotγ = max/N × maxW γ if max is palm LW robotγ = max − NN × maxW γ otherwise (5)In addition, LE γ is modified when humans palms are up to ease the move-ment of the robot.Regarding the fingers, as they cannot be tracked, their position is randomlyset at each skeleton frame to make the movement more realistic.Alternatively, OpenPose differentiates left and right sides without any cali-bration and gives 21 keypoints per hand, four per finger plus wrist (see Figure 4).8igure 4: OpenPose hand model (21 keypoints)To determine if the hand shows the palm or the back the angle between thehorizontal line and the line joining the thumb and the pinky fingertips (keypoints4 and 20) is required. This calculation is expressed in Eq. 6, where F T standsfor fingertip and
OF T represents the new origin of a fingertip. Afterwards, thefingers’ points are rotated in such a way that the pinky lies at the right of thethumb and the number of fingers over the Y = 0 line is calculated (Eq. 7). Forthe right hand, at least two fingers should lie over that line to consider the palmis being showed as shown if Eq. 8 (the opposite condition for the left hand). ∀ iF T i OF T ix,y = F T ix,y − T humb x,y α = arctan( OF T pinkyy , OF T pinkyx ) (6) ∀ i F T ′ ix = OF T ix ∗ cos ( − α ) − OF T iy ∗ sin ( − α ) ∀ i F T ′ iy = OF T ix ∗ sin ( − α ) − OF T iy ∗ cos ( − α ) (7) HandSide = ( Back (( P i =1 F T ′ iy ) > ≥ P alm otherwise (8)In addition, each hand’s yaw angle ( H ψ ) must be calculated by measuringthe distance between the thumb and the pinky fingertips (Eq. 9). The minimumand maximum values are adjusted according to the wrist’s height so to avoidcollisions with the touch screen on the chest of the robot. T P = dist ( F T ′ T humb , F T ′ P inky ) H ψ = rangeConv ( T P , robotRange ) (9)9inally, the hand’s opening/closing is defined as a function of the distancebetween wrist (keypoint 0) and middle fingertip (keypoint 12) as in Eq. 10.
M W = dist ( F T
Middle , W rist ) ClosedOpen = rangeConv ( M W , [0 . − . GAN networks are composed by two different interconnected networks. The
Generator ( G ) network generates possible candidates so that they are as similaras possible to the training set. The second network, known as the Discriminator ( D ), judges the output of the first network to discriminate whether its inputdata are “real”, namely equal to the input data set, or if they are “fake”, thatis, generated to trick with false data.As we are interested in generating movements, i.e., a sequence of poses, theinput to the learning process to any generative model has to take into accountthe temporal sequence of poses. The training dataset given to the D networkcontains K unit of movements (UM), being each UM a sequence of µ consecutiveposes, and each pose 14 float numbers corresponding to joint values ( J i ) of head,arms, wrists (yaw angle) and hands (finger opening value).Table 1 describes more in detail the aspect of a single entry of the databasefor the case of µ = 4. These samples were recorded by using two differentMoCap systems and by registering 10 different person talking about 18 minutesoverall. Therefore, two datasets have been obtained: OpenNI DB is built froma recording about nine minutes long and contains five people’s skeleton datacaptured using OpenNI as MoCap system, while OpenPose DB is built fromanother recording also about nine minutes long that contains other differentfive people’s skeleton data captured using OpenPose. After sampling with afrequency of 4 Hz, two datasets of a slightly different number of poses arecreated. The shorter one has 2018 poses, and the last poses of the longer onehave been deleted to make their lengths match. J ( t ) · · · J ( t ) , J ( t + ∆ t ) · · · J ( t + ∆ t ) , J ( t + 2∆ t ) · · · J ( t + 2∆ t ) , J ( t + 3∆ t ) · · · J ( t + 3∆ t )Table 1: Characterization of a unit of movement for µ = 4 consecutive poses.∆ t depends of the data sampling frequencyThe D network is thus trained using that data to learn its distribution space;its input dimension is µ ∗
14. On the other hand, the G network is seededthrough a random input with a uniform distribution in the range [ − ,
1] andwith a dimension of 100. The G network intends to produce as output gesturesthat belong to the real data distribution and that the D network would not beable to correctly pick out as generated.10 Fidelity analysis
Dimension Reduction techniques are very widely used in very different areas,such as in genomics, image classification or in natural language processing tasks.The most well known is the Principal Component Analysis (PCA) [19] and itcan help to explore the structure of high dimensional data. It is a techniquethat displays the structure of complex data in a high dimensional space into alower dimensional space without too much loss of information. In robotics, par-ticularly when studying motions or movements, PCA has also extensively beenapplied. [32] used PCA to build motions within an imitation learning frame-work; [44] used PCA to increase the interpretability of upper limb’s movementsregistered by a robotic technology for different tasks; and in [20] data acquiredwith a dataglove was summarized with PCA to extract the coordination pat-terns available for handgrasps. Principal Coordinates Analysis [15] (PCoA),also known as Classical Multidimensional Scaling, is an extension of the PCAand therefore it allows to explore and visualize similarities or dissimilarities ofdata. Given n units and distances d ij between each pair of units i and j , all thedistances are gathered in a n × n distance matrix D . The PCoA builds a newmatrix Y containing the coordinates of the n units in l dimensions such that theEuclidean distance between the i -th and j -th units is equal to d ij for all i and j . The columns of matrix Y are given basically by the eigenvectors of the innerproduct matrix ( I − · ′ /n ) ˜ D ( I − · ′ /n ), where ˜ D is the matrix with value( d ij ) in position ( i, j ), = (1 , . . . , ′ and I is the identity matrix. The relatedeigenvalues show the variability decomposition in the original data. When thedistance matrix D is the Euclidean distance built on the original features, PCoAand PCA give the same results. In summary, the columns of matrix Y alongwith the eigenvalues allow to analyse the internal structure of the original highdimensional data. Let OpenNI DB and OpenNI+GAN be the databases captured and generated re-spectively with the OpenNI capture method. The same holds for OpenPose DBand OpenPose+GAN. The databases OpenNI DB, OpenNI+GAN, OpenPoseDB, OpenPose+GAN were calculated for different length of UM ( µ = 4 , , N × (14 × µ ) data matrix for each method where columns rep-resent the positions of the joints along the sequence of µ consecutive poses( J i ( t + k ∆ t ) , i = 1 , . . . , , k = 0 , . . . µ − , µ = 4 , , Y O of the originals and the principal coordinates Y G ofthe GAN generated samples were compared. Particularly, we measured the abil-ity to recover each of the first 10 principal coordinates Y O = [ y O , . . . , y O ]from the first 10 principal coordinates Y G based on linear regression models.We considered the linear regression model for the j th principal coordinate ofthe originals based on the 10 principal coordinates of GAN generated samples( y jO = β + P i =1 β i y iG , j = 1 , . . . ,
10) and calculated the explained variance bythe coefficient of determination R (see Figure 6). Broadly, very high values of R are obtained assessing the fidelity of GAN models in concordance with whatthe eigenvalue decompositions showed. Nevertheless, we gained some insightand it can be observed that the recovery of the originals is bigger with Open-Pose as MoCap. Furthermore, the recovery for 8 UM is the poorer. It can beseen that first 6 and 7 principal coordinates of OpenPose DB can be recoveredby the GAN principal coordinates ( R ≥ .
85) for µ Nothing has been said about the degree of originality of the generated motion.As mentioned in the introduction, robot gesticulation should not result repeti-tive/boring.In order to analyze it, we considered Procrustes Analysis. Procrustes meth-ods analyze the matching between two or more configurations. That is, givensome units measured in different contexts or by different features, the main aimof procrustes methods [16] is to measure the degree of similarity among theconfigurations. Procrustes methods are widely applied. For instance, in [26] theauthors extend procrustes statistic to get transfer learning techniques to learnrobot kinematic and dynamic models; in [12] applied procrustes techniques asan effective robot base frame calibration; more recently, [27] proposed a methodto increase efficiency and to identify potential issues of the assembly process inrobotized assembly as a variation of the classical procrustes analysis.
In our particular context, we considered pairs of configurations given by thefirst 10 principal coordinates Y O of the originals and the 10 principal coordi-nates Y G of the GAN generated samples for each combination of MoCap andUM. The rows of those matrices represent the joints along the unit of move-ment and the matrices can be considered as configurations for the joints. Basedon the percentage of the explained variances (see Table 2) the aforementionedconfigurations are capturing the essence of the joints along the units of move-ment. The classical orthogonal procrustes statistic ( ss ) between configurations12 O (DB) and Y G (GAN) is the residual sum of squares between both config-urations, once a scaling factor and rotation movement are allowed. That is, ss = || Y O − sQ Y G || , where s is a scaling factor and Q is a rotation matrixthat minimize the sum of squares. The underlying idea is to consider Y O asthe target configuration and to scale and rotate the second configuration Y G so that it is as similar as possible to the target configuration. The remainingresiduals build the procrustes statistic ss . The bigger is ss the more differentare the joints along the units of movements, or in our context, the bigger isthe originality of the movements. Since ss depends on the number of rows, wenormalize it so that we obtain a commeasurable statistics for different length ofUM. Taking into account Table 2, in terms of originality it seems that MoCapOpenPose obtains slightly bigger values.Table 2: Explained variance along the first 10 dimensions for different length ofUM and different systems of movement. Differences between joints along UMmeasured by the commeasurable procrustes statistic ss/ (14 µ ).Explained variance (%) µ Original OpenNi ss/ (14 × µ )4 81.3 85.6 0.08576 75.0 83.7 0.17238 70.1 82.9 0.1932 µ Original OpenPose4 83.2 86.2 0.10546 77.6 83.8 0.13078 74.4 83.4 0.2369The originality should not come at the cost of rough or uneven movements.Tables 3 to 5 show the mean values of the norm of jerk [7] (equation 11) andthe length of the path (described as the increment in the positions over time inequation 12) for 1000 generated movements. Head position does not shift in thespace, and thus only jerk values are calculated. Overall, motion analysis showsthat OpenPose based gesture generation is smoother than the OpenNI basedone, independent of the length of the unit of movement selected. jerk = 1 T T X t =1 || ˙ accel t || (11) lpath = T X t =2 || x t − x t − || (12) As mentioned in the introduction, the fidelity and the originality features arecontradictory and a trade-off is desirable. Looking for that balance, we have13igure 5: Decomposition of the variance for different length of units and differentsystems of movement ( λ Ol , λ Gl , l = 1 . . . , µ = 4 , , R ) for lineal models y jO = β + P i =1 β i y iG , j = 1 , . . . ,
10. The columns are ordered by length of UM( µ = 4 , , φ : pitch, ψ : yaw) µ = 4OpenPose based OpenNI basedLhand E jerk E lpath E jerk E lpath E jerk E lpath E jerk E lpath E ψjerk E φjerk φ : pitch, ψ : yaw) µ = 6OpenPose based OpenNI basedLhand E jerk E lpath E jerk E lpath E jerk E lpath E jerk E lpath E ψjerk E φjerk Evaluation of the performance of GAN networks is not a straightforward process.Several approaches have been proposed, among them average log-likelihood [40],Parzen window estimates [6] or visual fidelity of samples [13] when suitable. In[41] the authors show that these three criteria are largely independent of eachother when the data is high-dimensional. In particular, they state that averagelikelihood is not a good measure.In the field of image generating GANs, some more recently defined measuresare the Inception Score (IS) [37] and the Fr´echet Inception Distance (FID) [18].Both approaches measure the distance between the original and the generatedimages. The Inception Score is computed as exp( E x KL ( p ( y | x ) k p ( y ))), where16able 5: Mean values for each measure ( φ : pitch, ψ : yaw) µ = 8OpenPose based OpenNI basedLhand E jerk E lpath E jerk E lpath E jerk E lpath E jerk E lpath E ψjerk E φjerk p ( y | x ). Images that contain meaningful objects are expected to havea conditional label distribution p ( y | x ) with low entropy. On the other hand, itis expected that the images generated by the model have a degree of variationamong them, so the marginal R p ( y | x = G ( z )) dz should have high entropy. TheInception score is obtained from the combination of these two requirements,where the results are exponentiated so the values are easier to compare. KL stands for Kullback-Leibler divergence [24]. The Fr´echet Inception Distance iscomputed as d (( M r , Σ r ) , ( M g , Σ g )) = || M r – M g || + T r (Σ r + Σ g − r Σ g ) ),where ( M r , Σ r ) and ( M g , Σ g ) are the mean vectors and covariance matricesof the feature vectors for real and generated images, respectively. The featurevectors are computed as the values of the activation layer of the Inception model.In layperson’s terms, given two sets of images I A and I B , the FID measuresthe similarity of the predictions of the Inception model over I a and I b . FID iswidely used as a performance measure in the image generation community, asin [33][45][47][30].In [5] the author analyzes the pros and cons of several GAN performancemeasures. His work is focused on GAN applied to images, and arrives to theconclusion that FID score looks more plausible than others. Although it hasits drawbacks, as to rely on pre-trained networks, which could pose problemswhen translated to other domains. However, as it is pointed out in a recentarticle [2], “there are no universally agreed-upon performance metrics for un-supervised learning, and people have already pointed out many shortcomingsof these Inception-based methods. Until something better comes along though,they’re going to show up in every paper so it’s worth knowing what they are.”Taken into account the relative quality of the FID score when applied to theimage domain, one of the goals of this research is to find a way of adapting thatscore, based in a pre-trained model over a set of images, to sets of gestures.17 .2 Applying Fr´echet Gesture Distance to the baseline When trying to adapt the FID for gestures, the first problem is that, to thebest of our knowledge, there is no model that could play the role of the Incep-tion model. Let us remember that the Inception model has been created by asupervised deep learning algorithm and, when presented an input image, it out-puts a set of probabilities of that image belonging to any of a thousand possibleclasses. For gestures, there is no such model. A possible approach could beto manually label a set of gestures, apply a supervised model over it, and thenuse it with the same role than the Inception model. As in our domain thereis not a clear-cut classification of the gestures generated by the robot, we havechosen another alternative: to build a Gaussian Mixture Model (GMM) in anunsupervised fashion from a set of synthetic gestures created by Choregraphe ,a software designed to create robot animations. It includes different type ofpredefined animations, such us body talk gestures, reactions and emotions, thatare used to bring the robot to life. In a previous work [36], we chose a set ofanimations from original Choregraphe’s animation library that could be used asbeat gestures, and we created a database with those animations. After samplingselected animations with a frequency of 4 Hz we obtained a database built upfrom 1502 poses. From now on we will refer to this database as Choregraphe ges-tures database (ChDB). In this approach, Choregraphe gestures play a similarrole as the data from which the Inception model was created. As in the imagedomain a model independent from the analyzed data was created (Inception),in the gesture domain we create a model independent from the analyzed data(Choregraphe-based GMM). The data used by the GAN for training is differentfrom Choregraphe data, as it is captured by a MoCap system.The GMM election is supported by previous motion work by the authors[36], where they show that this model ranks second after GANs in the qualityof generated gestures, when used as generative models. When evaluating thequality of the gestures created by the GAN, the computed GMM model can beused to classify new gestures, and thus return the set of probabilities needed tocompute the FID.The process to define the Fr´echet Gesture Distance (FGD) between two setsof gestures G A and G B is the following: • Create a database G M from Choregraphe gestures. • Build a GMM from G M . • Compute the probabilities P ( G A ) and P ( G B ) returned by that GMM over G A and G B , respectively. • Compute the Fr´echet distance over P ( G A ) and P ( G B ).The GMM has been built with 24 components sharing the same covariancematrix. After a initial Choregraphe gestures database (ChDB) of K = 1502 http://doc.aldebaran.com/2-5/software/choregraphe/index.html µ = 1 the length of the pose and the unit of movement are thesame, and the Choregraphe gestures database can be used without further pro-cessing. Therefore, if we denote as ChDB-n the Choregraphe database usedwhen µ = n , we find that ChDB-1 and ChDB are the same. But, in the generalcase, with an arbitrary value of µ = n , the dimensionality of ChDB-n is 14 × n .To achieve this, n consecutive poses are joined together in ChDB-n, thus bring-ing the number of units of movement in ChDB-n to K/n . Therefore, the GMMis trained with the Chdb- µ associated to each value of the µ parameter.Table 6 shows the FGD distance values for the different µ values. Accordingto these values µ = 4 shows the shortest distance and thus it seems the mostadequate value.Table 6: FGD values for the different comparisonsUM OpenNI GAN OpenPose GAN µ = 4 E σ µ = 6 E σ µ = 8 E σ Visual inspection of the robot behavior can be somewhat subjective, speciallywhen variations are subtle. However, the robot behavior must be perceived asacceptable by humans in any circumstance. The two approaches compared inthis work are very similar in nature, the only difference being the MoCap systemused to generate the learning data. The main differences between them werethe difficulties to accurately track the head and hands positions with OpenNI.Figure 7 shows those differences. The reader is invited to pay attention to howthe head and hand positions differ.These difficulties are therefore reflected in the generated gestures, as can beappreciated in this video . The executions of both systems correspond to themodels trained to generate movements using µ =4 as unit of movement. Noticethat the temporal length of the audio intended to be pronounced by the robotdetermines the number of UM required to the generative model. Thus, the exe-cution of those UMs, one after the other, defines the whole movement displayedby the robot. On the one hand, head information provided by the OpenNIskeleton tracking package was not enough for preserving head movements andthus, the resulting motion was poor. On the other hand, the tracker only offered a) OpenNI(b) OpenPose Figure 7: Reproduction of poses in the simulated robot20rist positions and as a consequence, a vision based alternative was developedby segmenting red/green colors of the gloves wore by the speaker for trackingpalms and backs of both hands. The opening/closing of the fingers was madeat random for each generated movement. Lastly, the robot elbows tended to betoo separated from the body and raised up. At a glance, it can be seen that theOpenPose based approach overcomes these three main drawbacks.
In this paper an approach to quantitatively measure the degree of fidelity/origi-nality of a gesture generating method is presented. Two beat gestures generationapproaches are compared: OpenNI based GAN model and OpenPose basedGAN model. These two systems basically differ in the MoCap system beingused for acquiring the database used for learning the generative model.To measure the fidelity of the generated samples to the original ones weperformed a PCoA over the original and generated samples for the two typesof MoCap and different length of units of movements. The visual analysis, aswell as the decomposition of the variances in this step support the hypothesisthat the generated gestures indeed are similar to the original ones. We alsodiscovered that the explained variances by the regression models to recover theoriginal units are bigger in OpenPose, which could point to bigger fidelity tothe original. In the same vein UM 4 and 6 appear to have higher degree offidelity. To measure the originality, we calculated procrustes statistics and weobserved that in general terms, the originality seems bigger in OpenPose and atthe same time this approach generates smoother movements according to twomotion measures: jerk and length path.Finally, we have defined a Fr´echet Gesture Distance (FGD) which is inspiredin the Fr´echet Inception Distance (FID) to be able to see how far are the gener-ated gestures from the original ones. The Fr´echet distance is a measure of thesimilarity between two distributions and in our context those two distributionsare the probabilities assigned by a classifier over all the possible classes whenpresented a new instance. Therefore, FGD is generator-agnostic, in the sensethat it is irrelevant how the objects have been created, only their predictedprobabilities when applying some model (as with Inception in the case of FID)are taken into account.Let us remember that we want the generated gestures to be similar, butnot too much. We could wonder if, given the data collected so far (PCoA,jerk), the similarity constraint has been already fulfilled (let us remember thatthe difference in variance composition tips the balance in the other direction),and the FGD will be bigger (more different). To pursue this analysis, we havecomputed FGD for the two MoCaps (OpenNI and OpenPose) and three typesof number of units of movements (4, 6 and 8). We see that FGD for OpenPose issmaller than for OpenNI, so it seems reasonable to suppose that in the balancebetween similarity and originality, the smaller the FGD measure the better.This leads us also to the conclusion that 4 units of movements are better than21 or 8.Visual inspection reflected that although subtle the difficulties to track handsand head positions were translated to the GAN generated gestures. And subtleare also the differences among the measured values, probably because the twosystems being compared are equal in nature. Thus, as further work we plan torepeat the analysis to observe if the results of these different quantitative tech-niques are translatable when comparing, for instance, GAN based approachesto other motion generation approaches such as variational autoencoders.
References [1] Alibeigi, M., Rabiee, S., Ahmadabadi, M.N.: Inverse kinematics basedhuman mimicking system using skeletal tracking technology. Journal ofIntelligent & Robotic Systems (1), 27–45 (2017)[2] Barratt, S., Sharma, R.: A note on the inception score. arXiv preprintarXiv:1801.01973 (2018)[3] Beck, A., Yumak, Z., Magnenat-Thalmann, N.: Body movements genera-tion for virtual characters and social robots. In: Social signal processing,chap. 20, pp. 273–286. Cambridge University Press (2017)[4] Becker-Asano, C., Ishiguro, H.: Evaluating facial displays of emotion forthe android robot Geminoid F. In: 2011 IEEE Workshop on AffectiveComputational Intelligence (WACI), pp. 1–8 (2011). DOI 10.1109/WACI.2011.5953147[5] Borji, A.: Pros and cons of GAN evaluation measures. Computer Visionand Image Understanding , 41–65 (2019)[6] Breuleux, O., Bengio, Y., Vincent, P.: Unlearning for better mixing. Uni-versite de Montreal/DIRO (2010)[7] Calinon, S., D’halluin, F., Sauser, E.L., Cakdwell, D.G., Billard, A.G.:Learning and reproduction of gestures by imitation. In: International Con-ference on Intelligent Robots and Systems, pp. 2769–2774 (2004)[8] Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtimemulti-person 2D pose estimation using Part Affinity Fields. In: arXivpreprint arXiv:1812.08008 (2018)[9] Carpinella, C., Wyman, A., Perez, M., Stroessner, S.: The robotic socialattributes scale (RoSAS): Development and validation. In: 17th HumanRobot Interaction, pp. 254–262 (2017). DOI 10.1145/2909824.3020208[10] Cerrato, L., Campbell, N.: Engagement in dialogue with social robots. In:Dialogues with Social Robots, pp. 313–319. Springer (2017)2211] Eielts, C., Pouw, W., Ouwehand, K., van Gog, T., Zwaan, R.A., Paas, F.:Co-thought gesturing supports more complex problem solving in subjectswith lower visual working-memory capacity. Psychological Research (2),502–513 (2020). DOI 10.1007/s00426-018-1065-9[12] Gao, X., Yun, C., Jin, H., Gao, Y.: Calibration method of robot base frameusing procrustes analysis. In: 2016 Asia-Pacific Conference on IntelligentRobot Systems (ACIRS), pp. 16–20. IEEE (2016)[13] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Ad-vances in neural information processing systems, pp. 2672–2680 (2014)[14] Gower, J.: Encyclopedia of Statistical Sciences, vol. 5, chap. Measures ofsimilarity, dissimilarity and distance, pp. 397–405. John Wiley & Sons,New York (1985)[15] Gower, J.C.: Some distance properties of latent root and vector methodsused in multivariate analysis. Biometrika (3–4), 325–338 (1966)[16] Gower, J.C., Dijksterhuis, G.B., et al.: Procrustes problems, vol. 30. OxfordUniversity Press on Demand (2004)[17] Hasegawa, D., Kaneko, N., Shirakawa, S., Sakuta, H., Sumi, K.: Evalu-ation of speech-to-gesture generation using bi-directional LSTM network.In: 18th International Conference on Intelligent Virtual Agents, pp. 79–86(2018). DOI 10.1145/3267851.3267878[18] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.:GANs trained by a two time-scale update rule converge to a local Nashequilibrium. In: Advances in neural information processing systems, pp.6626–6637 (2017)[19] Hotelling, H.: Analysis of a complex of statistical variables into principalcomponents. Journal of Educational Psychology (6), 417–441 (1993).DOI https://doi.org/10.1037/h0071325[20] Jarque-Bou, N.J., Scano, A., Atzori, M., M¨uller, H.: Kinematic synergies ofhand grasps: a comprehensive study on a large publicly available dataset.Journal of neuroengineering and rehabilitation (1), 63 (2019)[21] Kofinas, N., Orfanoudakis, E., Lagoudakis, M.G.: Completeanalytical forward and inverse kinematics for the nao hu-manoid robot. Journal of Intelligent & Robotic Systems (2), 251–264 (2015). DOI 10.1007/s10846-013-0015-4. URL https://doi.org/10.1007/s10846-013-0015-4 [22] Kucherenko, T., Hasegawa, D., Kaneko, N., Henter, G., Kjellstr¨om, H.:On the importance of representations for speech-driven gesture generation.In: 18th International Conference on Autonomous Agents and MultiAgentSystems (AAMAS), pp. 2072–2074 (2019)2323] Kucherenko, T., Jonell, P., van Waveren, S., Eje Henter, G., Alexanderson,S., Leite, I., Kjellstr¨om, H.: Gesticulator: A framework for semantically-aware speech-driven gesture generation. arXiv e-prints arXiv:2001.09326(2020)[24] Kullback, S.: Information theory and statistics. Courier Corporation (1997)[25] Lhommet, M., Marsella, S.: The Oxford Handbook of Affective Comput-ing, chap. Expressing Emotion Through Posture and Gesture, pp. 273–285.Oxford University Press (2015)[26] Makondo, N., Rosman, B., Hasegawa, O.: Knowledge transfer for learningrobot models via local procrustes analysis. In: 2015 IEEE-RAS 15th In-ternational Conference on Humanoid Robots (Humanoids), pp. 1075–1082.IEEE (2015)[27] Maset, E., Scalera, L., Zonta, D., Alba, I., Crosilla, F., Fusiello, A.:Procrustes analysis for the virtual trial assembly of large-size elements.Robotics and Computer-Integrated Manufacturing , 101885 (2020)[28] McNeill, D.: Hand and mind: What gestures reveal about thought. Uni-versity of Chicago press (1992)[29] Mukherjee, S., Paramkusam, D., Dwivedy, S.K.: Inverse kinematics of aNAO humanoid robot using Kinect to track and imitate human motion. In:International Conference on Robotics, Automation, Control and EmbeddedSystems (RACE). IEEE (2015)[30] Nazeri, K., Ng, E., Joseph, T., Qureshi, F.Z., Ebrahimi, M.: Edgeconnect:Generative image inpainting with adversarial edge learning. arXiv preprintarXiv:1901.00212 (2019)[31] Pan, M., Croft, E., Niemeyer, G.: Evaluating social perception of human-to-robot handovers using the robot social attributes scale (rosas). In:ACM/IEEE International Conference on Human-Robot Interaction (HRI),p. 443–451 (2018). DOI https://doi.org/10.1145/3171221.3171257[32] Park, G., Konno, A.: Imitation learning framework basedon principal component analysis. Advanced Robotics (9),639–656 (2015). DOI 10.1080/01691864.2015.1007084. URL https://doi.org/10.1080/01691864.2015.1007084 [33] Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis withspatially-adaptive normalization. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pp. 2337–2346 (2019)[34] Poubel, L.P.: Whole-body online human motion imitation by a humanoidrobot using task specification. Master’s thesis, Ecole Centrale de Nantes–Warsaw University of Technology (2013)2435] Rodriguez, I., Astigarraga, A., Jauregi, E., Ruiz, T., Lazkano, E.: Human-izing NAO robot teleoperation using ROS. In: International Conference onHumanoid Robots (Humanoids) (2014)[36] Rodriguez, I., Mart´ınez-Otzeta, J.M., Irigoien, I., Lazkano, E.: Sponta-neous talking gestures using generative adversarial networks. Robotics andAutonomous Systems , 57 – 65 (2019)[37] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen,X.: Improved techniques for training GANs. In: Advances in neural infor-mation processing systems, pp. 2234–2242 (2016)[38] Suguitan, M., Gomez, R., Hoffman, G.: MoveAE: Moditying affective robotmovements using classifying variational autoencoders. In: ACM/IEEE In-ternational Conference on Human Robot Interaction (HRI), pp. 481–489(2020). DOI 10.1145/3267851.3267878[39] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Er-han, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions.In: Proceedings of the IEEE conference on computer vision and patternrecognition, pp. 1–9 (2015)[40] Theis, L., Bethge, M.: Generative image modeling using spatial lstms. In:Advances in Neural Information Processing Systems, pp. 1927–1935 (2015)[41] Theis, L., van den Oord, A., Bethge, M.: A note on the evaluation ofgenerative models. CoRR abs/1511.01844 (2015)[42] Velner, E., Boersma, P.P., de Graaf, M.M.: Intonation in robot speech:Does it work the same as with people? In: ACM/IEEE InternationalConference on Human-Robot Interaction (HRI), pp. 569–578 (2020)[43] Wolfert, P., Kucherenko, T., Kjelstr¨om, H., Belpaeme, T.: Should beatgestures be learned or designed? a benchmarking user study. In: ICDL-EPIROB 2019 Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions, p. 4 (2019)[44] Wood, M., Simmatis, L., Boyd, J.G., Scott, S., Jacobson, J.: Using prin-cipal component analysis to reduce complex datasets produced by robotictechnology in healthy participants. Journal of NeuroEngineering and Re-habilitation (2018). DOI 10.1186/s12984-018-0416-5[45] Wu, Y., Donahue, J., Balduzzi, D., Simonyan, K., Lillicrap, T.: LOGAN:Latent optimisation for generative adversarial networks. arXiv preprintarXiv:1912.00953 (2019)[46] Zabala, U., Rodriguez, I., Mart´ınez-Otzeta, J.M., Lazkano, E.: Learningto gesticulate by observation using a deep generative approach. In: 11thInternational Conference on Social Robotics (ICSR) (2019 (Accepted)).URL http://arxiv.org/abs/1909.01768 (8), 1947–1962 (2018)[48] Zhang, Z., Niu, Y., Kong, L.D., Lin, S., Wang, H.: A real-time upper-bodyrobot imitation system. International Journal of Robotics and Control2