Generating Emotive Gaits for Virtual Agents Using Affect-Based Autoregression
Uttaran Bhattacharya, Nicholas Rewkowski, Pooja Guhan, Niall L. Williams, Trisha Mittal, Aniket Bera, Dinesh Manocha
GGenerating Emotive Gaits for Virtual Agents Using Affect-BasedAutoregression
Uttaran Bhattacharya * UMD College Park, USA
Nicholas Rewkowski † UMD College Park, USA, UNC Chapel Hill, USA
Pooja Guhan ‡ UMD College Park, USA
Niall L. Williams § UMD College Park, USA
Trisha Mittal ¶ UMD College Park, USA
Aniket Bera || UMD College Park, USA
Dinesh Manocha ** UMD College Park, USAFigure 1:
Emotive Gaits for Virtual Agents (VAs) Generated Using Affect-Based Autoregression:
Samples of the emotivegaits for VAs generated by our learning-based algorithm in an AR environment. The top row shows stick figures and the bottom rowshows human models corresponding to the VAs. The VAs can exhibit different emotions based on their gaits, as indicated at the topof each column, while walking along user-defined trajectories at interactive rates. A BSTRACT
We present a novel autoregression network to generate virtual agentsthat convey various emotions through their walking styles or gaits.Given the 3D pose sequences of a gait, our network extracts perti-nent movement features and affective features from the gait. We usethese features to synthesize subsequent gaits such that the virtualagents can express and transition between emotions represented ascombinations of happy, sad, angry, and neutral. We incorporatemultiple regularizations in the training of our network to simultane-ously enforce plausible movements and noticeable emotions on thevirtual agents. We also integrate our approach with an AR environ-ment using a Microsoft HoloLens and can generate emotive gaits atinteractive rates to increase the social presence. We evaluate howhuman observers perceive both the naturalness and the emotionsfrom the generated gaits of the virtual agents in a web-based study.Our results indicate around 89% of the users found the naturalness ofthe gaits satisfactory on a five-point Likert scale, and the emotionsthey perceived from the virtual agents are statistically similar tothe intended emotions of the virtual agents. We also use our net- * e-mail: [email protected] † e-mail: [email protected] ‡ e-mail: [email protected] § e-mail: [email protected] ¶ e-mail: [email protected] || e-mail: [email protected] ** e-mail: [email protected] work to augment existing gait datasets with emotive gaits and willrelease this augmented dataset for future research in emotion predic-tion and emotive gait synthesis. Our project website is available at https://gamma.umd.edu/gen_emotive_gaits/ . Index Terms:
Human-centered computing—Human computerinteraction (HCI)—Interaction paradigms—Mixed / augmented real-ity; Computing methodologies—Machine learning—Machine learn-ing approaches—Neural networks
NTRODUCTION
The generation of intelligent virtual agents (IVAs) is important formany virtual and augmented reality systems. The virtual agentscorrespond to embodied digital characters that are often used asavatars to represent the users and may look like real-world characters.Recent work in photorealistic rendering and capturing technologieshas resulted in generating agents or avatars that closely resemble thehumans and are widely used in VR and AR systems [2, 21].Many applications, including virtual assistance, training, and AIchatbots, need a computationally created virtual agent or an avatarthat not only looks like a real human but also behaves like one andconveys emotions [35, 60]. Perception of such emotional expressive-ness is commonly described as the ability of an observer to makedecisions on a subject’s emotional state by observing certain patternsor cues physically expressed by the subject [22, 58, 66]. These phys-ical cues are expressed through various “modalities,” including, butnot limited to, facial expressions [17], tones of voice [19], gestures,body expressions [15], walking styles or “gaits” [47], etc. Emo-tions coming from such different modalities [45], in conjunctionwith the different underlying situational and social contexts [32, 46],have a significant impact on our everyday lives. They influence oursocial interactions and relationships and provide key insights into a r X i v : . [ c s . G R ] O c t eveloping healthy social environments [29]. Similarly, emotionscan greatly impact the perception of these virtual agents in termsof social presence and how the users behave when interacting withthem in AR and VR environments [34, 48, 60].In this paper, we mainly focus on designing virtual agents thatare capable of expressing different emotions through their gaits, i.e. virtual agents with emotive gaits . When perceiving emotionsfrom gaits, humans generally look at physical expressions such asarm swing, stride length, upper-body posture, head jerk, etc. [13],collectively referred to as affective features . In fact, studies haveshown that observers often rely on such cues from gaits and otherbody expressions, especially when there are mismatches with cuesfrom more common modalities such as facial expressions [4]. As aresult, it is useful to generate virtual agents with emotive gaits forgaming and social VR [52, 71], crowd simulation, and path plan-ning [6, 50], anomaly detection [63], therapy, rehabilitation [65],and psychology and neurobiology [5, 51, 68]. Studies indicate thatvirtual agents expressing various moods, emotions, and behaviorselicit more empathy and engagement from humans interacting withthem [64]. However, automated methods to generate gaits that ex-press certain emotions are challenging to design and implement.This is hard not only because of the complexity of modeling periodicand aperiodic motions constituting different gaits, but also becauseof the individual, social, and cultural diversities in terms of bothexpressing and perceiving emotions [3, 20]. These challenges arefurther exacerbated by the difficulty of collecting and annotatinglarge benchmark datasets of gaits with appropriate emotion labels.Datasets collected in controlled experimental settings are often ex-aggerated and not always generalizable. On the other hand, datasetscollected in the wild often suffer from the long-tail nature of thedistribution of emotions, with most emotion being close to neutral.Therefore, there is a need to develop automated methods to synthe-size emotive gaits for virtual agents that can augment the existingdatasets.There is an extensive work in AR/VR, computer graphics, visionand related areas such as biomechanics, on automated techniquesfor generating human characters capable of performing locomotionactivities, including walking, running, leaping, and more [24, 25, 57,60–62, 75]. These methods make use of movement features such asjoint rotations, joint velocities, frequency of foot contact with theground, and walking phases, and combine them with structural andkinematic constraints of the human body and learning techniques.While these methods can generate plausible locomotion, walkingpatterns, and actions, it is non-trivial to add emotional componentsto these techniques or use them to generate emotive gaits. Main Results:
We present a novel autoregression method that takesas input 3D pose sequences of the gaits of virtual agents (VAs) andefficiently combines pose affective features such as arm swings,head jerks, body posture and more, and movement features suchas stepping speed, root height and more, to generate future posesequences of emotive gaits. We present a network architecturethat incorporates both spatial information and spectral informationavailable from the input pose sequences and enables the VAs to bothexpress and transition smoothly between different emotions whilewalking. We construct VAs as both stick figures and human modelsusing our generated emotive gaits and integrate these VAs in an ARenvironment using the Microsoft HoloLens (Figure 1). Our learning-based algorithm takes a few milliseconds to generate an emotivegait for each agent on a pair of NVIDIA GeForce GTX 1080TiGPUs. Our VAs, overlayed onto a real-world room, are rendered atinteractive rates to increase their sense of social presence. The novelcomponents of our work include:• An autoregression network that takes in 3D pose sequencesof a VA’s gait, the desired future trajectory, and the desiredemotions. It outputs the VA’s gait expressing the given emotionwhile following the given trajectory.• A novel training method combining movement features andpsychologically-motivated affective features into a unified net-work to generate plausible, emotionally-expressive gaits.• A transition scheme for the characters to smoothly transition between gaits expressing different emotions.• An elaborate web-based user study to evaluate the benefitsof the emotive gaits generated by our algorithm. We askedthe observers to report the emotions they perceived from thegenerated gaits, as well as the Likert scale (LS) values of poseaffective features that contributed to their perception. Basedon the study, we conclude that – There is strong statistical evidence to suggest that theobservers’ perceived emotions are statistically similar tothe corresponding intended emotions of our VAs, therebyshowing that the generated gaits are emotive, – The observers consistently reported different LS val-ues of the pose affective features for different emotions,making our choice of pose affective features statisticallysignificant for perceiving emotional expressiveness.• An augmented dataset, “Synthesized Emotive Gait,” whichprovides emotive gaits generated by our method to facilitatemore research in this area.
ELATED W ORK
In this section, we briefly survey prior work on representing emo-tions, perceiving emotions from gaits, generating and styling gaitsfor virtual agents, and making virtual agents emotionally expressive.
Various models for representing emotions have been studied in psy-chology, and the Valence-Arousal-Dominance (VAD) model [44] isone of the most popular. The VAD model considers a continuous3D space, spanned by the valence, arousal, and dominance axes.Valence is a measure of the pleasantness of emotion, arousal is ameasure of the intensity of expression, and dominance is the mea-sure of how much emotion makes one feel in control. Many methodsuse a simpler model that is a linear combination of discrete emotionsand represents a subset of VAD.Humans perceive these emotions by observing physical featuresor cues expressed via different modalities. Studies conducted byMontepare et al. [47] concluded that observers were able to per-ceive emotions by only looking at the subjects’ gaits. Subsequently,Roether et al. [66] and Gross et al. [22] identified that observerswere most consistent when looking at gaits expressing emotions thatvaried on the arousal axis. Follow-up studies looked more closelyat the gait-based expressions observers focused on for distinguish-ing between different perceived emotions and identified featuresincluding arm swing, gait velocity, upper body posture, and headjerk [9, 10, 13, 28]. In contrast to prior approaches, our goal is to useaffective features and movement features to synthesize gaits withemotions varying on the arousal axis.
Prior works have commonly explored the generation of emotion-ally expressive virtual agents via modalities such as verbal com-munication [11, 70], face movements [18], body gestures [27], andgaits [60–62]. These generation techniques have had significantperformance benefits when combined with concepts from affectivecomputing. For example, Pelczer et al. [55] designed a strategyto evaluate the accuracy of identifying the modalities of emotionalexpressiveness in a virtual agent. McHugh et al. [43] explored howdifferent body postures influenced the emotion perception of indi-vidual agents in crowds. Clavel et al. [12] studied the combinedeffect of faces and postures of virtual agents on emotion perception,and Liebold et al. [39] generalized this to include combinations ofother modalities such as verbal cues and faces. More recently, Rand-havane et al. [61] developed an empirical mapping between gaitand gaze features and different emotions to generate emotionallyexpressive virtual agents. Our approach to generating emotive gaitsis complementary to these methods and can be combined with them.
There has been extensive prior work in computer graphics andAR/VR for generating and styling gaits for virtual agents. Early igure 2:
Our Affective features:
The top left figure shows our posegraph as a directed tree, with the joints numbered through . Weuse affective features, counting joint angles, distance ratios,and area ratios. The joint angles are labeled A through A , andmarked with red arcs on the last three figures in the top row. Theleftmost figure on the bottom row shows the distances we use tocompute the distance ratios. We use the ratios D D , D D , D D , and D D .The last three figures on the bottom row show the triangles we use tocompute area ratios. We use the ratios T T , T T , T T . These features areused by our network to generate emotive gaits of the virtual agents. approaches used patch-based building blocks [36], kernel-based ap-proaches [73], or modeled the motion paths as directed graphs [33] togenerate natural-looking movement styles. Recent approaches haveleveraged large-scale datasets using deep learning-based approachesto generate diverse movement styles. These approaches include train-ing a network on specific joint trajectories [25, 67], using periodicphase-functions, which are either modeled geometrically [24] orlearned with a neural network [72], to represent walking cycles, andexploiting transfer to reduce over-dependency on data [42]. Otherapproaches use deep reinforcement learning to learn control poli-cies for virtual characters exhibiting different movement styles andactions [37, 53, 56, 57]. Yet other approaches model motion pre-diction as an autoregression problem, and have utilized recurrentnetworks [54] and convolutional networks [38] on motion capturespose sequences, and generative adversarial learning on dynamicpose graphs [14], to predict future motions. While these methods arenot built for motion styling, their key concepts have been useful indeveloping many motion-based style transfer methods [16, 30, 74].In contrast with these methods, we combine walking phases andgait-based affective features in an autoregression network to estimatefuture joint rotations and movement features for different emotivegaits. Instead of a DRL-based control policy, our network learns afeature-based latent representation space and maps from that spaceto emotion-styled predicted poses. Furthermore, our emotion stylesare sampled from a continuous space of emotions. Therefore, theyneed to be modeled differently from conventional styles, which canbe viewed as one-hot labels in a discrete space. We also demonstratethat our approach can generate gaits expressing a continuous rangeof emotions, for AR applications. ENERATING E MOTIVE G AITS
In this section, we present our approach for generating emotivegaits of VAs. The inputs to our algorithm are the sample gaits ofa VA provided as motion capture data, the desired trajectory, andthe desired emotion. Our goal is to generate subsequent predictionsof the gait that follow the desired trajectory and express the desiredemotion. Since our approach depends only on the input motion-captured gait samples and the desired trajectories, we can adapt toVAs with different skeletal dimensions, different natural walkingstyles, as well as to different AR environments.
We choose to use an emotion model consisting of linear combina-tions of categorical emotions varying primarily on the arousal axis(happy, angry, sad, etc.). Although this model admittedly spansa smaller set of emotions than the VAD model, prior works havereported that categorical emotion terms are more easily understoodby non-experts, leading to the availability of more labeled data andthe generation of a diverse range of emotions [9, 59].
We present various components used by our network to perform gen-eration: input gaits, emotions, pose affective features, and trajectoryfeatures.
We denote a gait G as G = (cid:26) X tj = (cid:104) x tj , y tj , z tj (cid:105) (cid:62) ∈ R (cid:27) J − , T − j = , t = ,where (cid:104) x tj , y tj , z tj (cid:105) (cid:62) denotes the 3D positions of joint j at time step t in the world frame, J denotes the total number of joints, and T de-notes the total number of input time steps. The input to our networkare joint rotations extracted from the gait G , and control signalsobtained from the gait, its trajectory, and the associated emotions.At each time step t , we model the pose graph of the input gait asa directed tree, as shown in Figure 2. The root joint in the pose is theroot node of the tree and the two toes, the two hand indices, and thehead, are the leaf nodes the tree. All edges in the tree are directedfrom the root node to the leaf nodes. We denote the parent of a joint j as P ( j ) . In our construction, each joint has a unique parent, exceptthe root joint, which has no parent. We therefore assign P ( j ) = − j at each time step t , we consider therotation R tj ∈ SO ( ) that transforms the joint from its offset o j ∈ R — a pre-defined initial position relative to its parent P ( j ) — to itsposition at time step t relative to its parent. That is, we considerthe rotation R tj such that the global position X tj of the joint j at timestep t is given by X tj = R tj o j + X tP ( j ) . For the root joint, we consider o = and obtain its position directly from the gait G .Following the approach taken by Pavllo et al. [54], we representthese rotations as unit quaternions or versors q tj ∈ H ⊂ R , where H denotes the space of versors. We have chosen to represent rotationsas versors, as they are free of the gimbal-lock problem. We enforcethe additional unit norm constraints for these versors when trainingour network. Thus, for each gait, our input rotations are given as (cid:110) q t = (cid:2) q t ; ... ; q tJ − (cid:3) ∈ H J − (cid:111) T − t = . We do not consider q t for the root joint denoted by 0 as part of theinput data, since o = We note that the modeling of emotions is fundamentally differentfrom modeling conventional motion styles such as strutting, zombie-like, etc. These conventional styles can be considered “discrete”,such that clips or images can be categorized as belonging to a particu-lar style. Emotions, on the other hand, span a continuous space [44],such that different motion clips can have different intensities of thesame emotion. For example, a somewhat happy gait is expresseddifferently from one that is extremely happy. For conventional styles,it is generally not needed to account for such intensities, e.g., slightlystrutting vs. heavily strutting. To account for this continuous na-ture, we assume each gait in the input dataset is associated with anemotion vector whose components are the C categorical emotionterms, i.e. , each emotion m is a vector in R C . In practice, the valueof each element l in an emotion vector m is the relative count of thenumber of annotators who labeled the corresponding gait with the igure 3: Movement features:
We show the root height from theground h t , the root speed s t , and the stepping phase θ t . The rootspeed is the distance travelled between time steps t and t − . Thestepping phase θ t = when the left foot touches the ground at timestep t , θ t + ∆ t = π when the right foot touches the ground at time step t + ∆ t , and θ t + ∆ t + τ = when the left foot touches the ground again.We fill in the values for θ t between these time steps using linearinterpolation. We use these features in our autoregression network. categorical emotion term l . Also, for training, and due to the prac-tical limitations of separately annotating the emotion at each timestep, we repeat the same annotated emotion m in all the time steps.In other words, we assume the emotion vector remains unchangedthroughout the corresponding input gait. Prior studies in psychology have shown that various physically-basedpose features observed per-frame during a gait, better known as af-fective features , aid the identification of perceived emotions fromgaits [13, 28]. Roether et al. [66] identified such a set of necessarypose affective features for human perception. To make these featuressuitable for machine perception, prior works such as [9, 10, 59] havecome up with necessary sets of scale-independent pose affectivefeatures that can be computed geometrically. Scale independenceis an important factor in such intra-frame affective features, as ob-servers can identify emotions irrespective of the distance from or thephysical stature of the subject. In our work, we use the followingthree types of scale-independent pose affective features to encodethe relevant emotion information:
Angles.
We use the angles subtended by a pair of joints at athird joint. For example, the angle between the two shoulder jointsat the neck measures slouching, an indicator of valence and arousal.
Distance ratios.
We use the ratios of the distances between twopairs of joints. For example, the ratio of the distance between thetwo feet joints to the distance between the neck and the root jointsmeasures the stride, which can indicate arousal and dominance.
Area ratios.
We use the ratios of areas formed by two triplets ofjoints. These can be considered as amalgamations of the angle- andthe distance ratio-based features and they can be used to supplementobservations from both these types of features. For example, theratio of the area of the triangle formed by the hand indices and theneck to the area of the triangle formed by the toes at the root can beused to simultaneously measure arm swings and strides, which cancollectively indicate the valence, arousal, and dominance.We use 11 angles, four distance ratios and three area ratios for atotal of 18 pose affective features, which we collectively denote as a t ∈ R at each time step t . We list all these pose affective featuresin Figure 2, and direct the interested reader to [10] for a detailedanalysis on choosing these features. We use a trajectory followed throughout a gait to extract pertinentmovement information for our network to generate gaits followinggiven trajectories. We use two kinds of movement features: rootjoint features and stepping features . The former consists of rootheight deviation, root speed (a low-pass filtered component), androot orientation difference trajectory curvature. The latter consistsof stepping phase and foot-step frequency, which are obtained fromthe trajectory. Figure 3 illustrates some of these features. Apartfrom movement information, the root joint features and the foot-step frequency also provide inter-frame or dynamic affective informationfor emotional expressions. We define these features below.
Root joint features
The root height deviation ( h t ) is the signeddifference of the height of the root joint from its mean height fromthe ground plane across the time steps. Subjects expressing emotionswith higher arousal tend to have their upper bodies more upright,thus keeping the root height above the mean more often than subjectsexpressing emotions with lower arousal. In our case, the XZ planeis the ground plane, making the root height h t = y t .The root speed ( s t ) is the magnitude of the difference of the2D position of the root joint, as projected on the ground plane,between the current step t and the previous step t −
1. Root speedhelps indicate the arousal as well, with higher arousal tending toresult in faster speeds more often. We represent the root speed as s t = (cid:13)(cid:13)(cid:13)(cid:13)(cid:2) x t , z t (cid:3) (cid:62) − (cid:104) x t − , z t − (cid:105) (cid:62) (cid:13)(cid:13)(cid:13)(cid:13) . With root speed, we also use itsloss-pass filtered component ¯ s t . This reduces the high-frequencynoise in the root speed, which is especially useful when the networklearns on trajectories with high curvatures.The root orientation difference ( δ t ) is the angular differencebetween the root orientation α t w.r.t. the world coordinates andthe tangent τ t to the 2D root joint positions (cid:2) x t , z t (cid:3) (cid:62) on the groundplane, w.r.t. the world coordinates at each time step t . We express thetangent using forward difference, i.e. , τ t = (cid:2) x t , z t (cid:3) (cid:62) − (cid:104) x t − , z t − (cid:105) (cid:62) .Then we have δ t = d ang (cid:16) [ sin α t , cos α t ] (cid:62) , τ t / (cid:107) τ t (cid:107) (cid:17) , where d ang denotes the unsigned smaller angle between the unit vectors. Stepping features
The trajectory curvature ( κ t ) is the normof the second-order derivative of the 2D positions of the root jointon the ground plane, or equivalently, the derivative of the root jointtangents τ t on the ground plane. We compute this using forwarddifference as well, i.e. κ t = (cid:13)(cid:13) τ t − τ t − (cid:13)(cid:13) .The stepping phase ( θ ) represents the phases of the feet betweenthe time steps where they touch the ground. We consider a half-period to be the time from the instant of one foot touching theground, to the subsequent instant when the other foot touches theground. Given the half-periods, we define the stepping phase θ t at each time step t as follows. We assign a phase θ t = θ t = π when the right foottouches the ground, filling in the intermediate phases through linearinterpolation.The foot-step frequency ( ω t ) is the angular velocity of the footjoints. Apart from generating realistic walk cycles (the motionbetween ipsilateral footsteps), this feature also supplements the rootspeed information to indicate the arousal in the emotions expressedby the gait. We compute the foot-step frequency at each time step t as the difference between the phase at that time step and the previoustime step t − i.e. ω t = θ t − θ t − . UTOREGRESSION
Given a sequence of values with some information content, theoverall goal of autoregression is to predict subsequent values inthe sequence to maintain a similar information content [40]. Inour work, we use an autoregression network to encode gaits withgiven emotions and predict subsequent gaits while maintaining thesame emotions. Our network consists of an encoder followed by apredictor. The encoder takes in the joint positions and rotations ofthe input gait for a number of time steps and extracts pose affectivefeatures and root joint features from the input gaits. Next, it learnsthe encoding functions to jointly map these extracted features, thecorresponding input emotions, and the stepping features (such ascurvature and foot-step frequency) to latent representations.The predictor learns to compute the inverse mapping from thelatent representations to the necessary affective and trajectory fea-tures and, by extension, the joint positions, and rotations, in thesubsequent time steps. igure 4:
Our autoregression network for emotive gaits:
Our network takes in the joint rotations, input emotions as vectors consisting ofprobabilities for happy, sad, angry, and neutral, pose affective features, and movement features and jointly maps them to a latent representationspace through the encoder. The predictor then takes in the latent representations and predicts gaits for subsequent time steps that follow the inputtrajectory while expressing the input emotions. The green boxes denote concatenation, and the cyan box at the end of the predictor denotesnormalization of the variables to versors.
We train the encoder and the predictor in tandem, by adding thepredictor’s output back to the encoder’s input and advancing thetemporal window of the encoder. We are able to achieve emotionalexpressiveness by training our network with input gaits for differentemotions and forcing the network to learn to predict the correspond-ing pose affective features in the prediction time steps from its latentrepresentations. We simultaneously enforce a robust constraint onthe network to adapt its trajectory features such that its predictedmovements are close to the corresponding movements in the groundtruth. This enables the network to take sharp turns and follow bendsin the trajectory without smoothing out feet movements. Our overallapproach is shown in Figure 4. We now elaborate on the operationsof the encoder and the predictor.
In the encoder, we separately combine our emotion vectors m withthe pose affective features a t , the stepping features consisting of [ sin θ t , cos θ t ] and ω t , and the root joint features ¯ s t and κ t , givingus input vectors i t = [ a t , m ] (cid:62) ∈ R + C and i t = (cid:2) sin θ t , cos θ t , ω t , ¯ s t , κ t , m (cid:3) (cid:62) ∈ R + C .We pass each of these inputs through a set of 3 fully con-nected layers, respectively, collectively denoted as the functionsFC enc (cid:16) · , φ FC enc1 (cid:17) : R T × ( + C ) → R T × H and FC enc (cid:16) · , φ FC enc2 (cid:17) : R T × ( + C ) → R T × H . Here, H and H denote the number of hiddenunits in the last fully connected layer in FC enc and FC enc respec-tively, and φ FC enc1 and φ FC enc2 denote the set of trainable parametersin the two sets of fully connected layers, respectively.We combine the outputs of these fully connected networks toobtain intermediate representations γ t , i.e. we have γ t = (cid:104) FC enc (cid:16) i , φ FC enc1 (cid:17) ;FC enc (cid:16) i , φ FC enc2 (cid:17)(cid:105) (1)where i = (cid:104) i ,..., i T − (cid:105) (cid:62) and i = (cid:104) i ,..., i T − (cid:105) (cid:62) .We then append γ t separately to our input joint rotations q t andto the remaining root joint trajectory features, h t , s t and δ t . Wepass the appended rotation data through a GRU to obtain the latentrepresentations ˜ q , i.e. , we have˜ q = GRU versors (cid:0) [ q ; γ ] , φ GRU versors (cid:1) (2)where– q = (cid:2) q ,..., q T − (cid:3) (cid:62) ,– γ = (cid:2) i ,..., γ T − (cid:3) (cid:62) ,– GRU versors : R T × ( ( J − )+ H + H ) → R T × H , – H is the number of hidden units in the final layer of the GRU,and– φ GRU versors denotes the trainable parameters in the GRU.We also pass the appended root joint trajectory features througha fully connected layer FC root : R T × ( + H + H ) → T × H , H being thenumber of hidden units in the layer, to obtain the latent representa-tions ˜ h , ˜ s , and ˜ δ . That is, we have (cid:104) ˜ h , ˜ s , ˜ δ (cid:105) (cid:62) = FC root (cid:0) [ h ; s ; δ ; γ ] , φ FC root (cid:1) (3)where– h = (cid:2) h ,..., h T − (cid:3) (cid:62) , s = (cid:2) s ,..., s T − (cid:3) (cid:62) , δ = (cid:2) δ ,..., δ T − (cid:3) (cid:62) ,– φ FC root denotes the trainable parameters in the fully connectedlayer. Our predictor takes in the latent representations of the joint rotationsand the root joint trajectory features from the encoder and learns topredict the same for T pred subsequent time steps such that the corre-sponding generated gaits follow the input trajectory while expressingthe input emotion vectors. The predictor consists of a set of 2 fullyconnected layers, denoted as FC versors : R T × H → R T pred × ( J − ) topredict the joint rotations, and three separate fully connected layers,FC h : R T × H → R T pred , FC s : R T × H → R T pred , and FC δ : R T × H → R T pred to compute the respective root joint features. Thus, we have,ˆ q = FC versors (cid:0) ˜ q , φ FC versors (cid:1) , (4)ˆ h = FC h (cid:0) ˜ h , φ FC h (cid:1) , (5)ˆ s = FC s (cid:0) ˜ s , φ FC s (cid:1) , (6)ˆ δ = FC δ (cid:16) ˜ δ , φ FC δ (cid:17) (7)where, as usual, φ FC versors , φ FC h , φ FC s , and φ FC δ respectively denotethe respective trainable parameters.From the predicted joint rotations and root joint features at eachtime step t , we can also compute the predicted pose ˆ X t and thecorresponding pose affective features ˆ a t . We use these predictedvariables, together with the input data, to train our network accordingto a curriculum schedule described in Section 5.3. We now describe the formulation of the loss function for trainingand validating our network. The loss function should accuratelyconstrain the network both to learn the emotional expressions inthe input gaits as well as to follow the gaits’ trajectories in thesubsequent time steps with plausible joint motions. We capture all able 1:
Joint Position and Rotation Errors.
We compute posi-tion error relative to the longest diagonal of the bounding box of thecharacters we test, and we compute rotation errors in degrees. Theperformance of our method is on par with the current state-of-the-artin motion generation.
Method Pose Error Rotation Error
PFNN [24] 0.19 0.06QuaterNet [54] 0.16 0.05Emotive Gait Styling 0.12 0.04these requirements in the loss function using four loss terms: themotion loss, the pose loss, the pose affective features loss, and theroot joint features loss. ( L motion ) This loss ensures that the predicted joint motions remain plausible, i.e. close to the ground truth joint motions. To compute this loss, wemeasure the angle difference between the ground truth rotations q t and the predicted rotations ˆ q t on each joint at each prediction timestep t . We also add the unit norm constraint on the predicted versorsas regularization. Thus, we write the motion loss as L motion : = ∑ j , t (cid:13)(cid:13)(cid:13) q2e (cid:16) q tj (cid:17) − q2e (cid:16) ˆ q tj (cid:17)(cid:13)(cid:13)(cid:13) + λ versor (cid:16)(cid:13)(cid:13)(cid:13) ˆ q tj (cid:13)(cid:13)(cid:13) − (cid:17) (8)where q2e : H → [ , π ] maps the versors to corresponding Eu-ler angles, and the summation is over all the joints across all theprediction time steps. (cid:0) L pose (cid:1) The pose loss supplements the motion loss by adding an extra regu-larization to maintain plausible predicted joint motions. We requirethe predicted character poses ˆ X t at each prediction time step, ob-tained using the predicted versors ˆ q tj , to be as close as possible to thecorresponding ground truth poses X t . However, we do not requireour predicted poses to follow the same trajectory as the ground truthposes since the desired trajectory will be provided to us at test time.We, therefore, subtract the root joint position from all the other jointsat every time step and write our pose reconstruction loss L pose as L pose : = ∑ t J − ∑ j = (cid:13)(cid:13)(cid:13)(cid:16) X tj − X t (cid:17) − (cid:16) ˆ X tj − ˆ X t (cid:17)(cid:13)(cid:13)(cid:13) . (9) ( L aff ) This loss constrains the network to predict pose affective featuressimilar to the ones computed from the input gaits. Therefore, itforces the network to maintain the emotional expressions in the gaits.We compute this loss by measuring the norm difference between theground truth affective features a t and the predicted features ˆ a t . Wewrite it as L aff = ∑ t (cid:13)(cid:13) a t − ˆ a t (cid:13)(cid:13) . (10) ( L root ) This is a robust loss that we use to constrain the network to followthe ground truth gait trajectory at the prediction trajectory. Therobustness ensures that the prediction follows sharp turns and bendsin the trajectory without smoothing out the foot joint movements.We compute this loss by measuring the L norm difference betweenthe ground truth root joint features h , s , and δ , and the predictedfeatures ˆ h , ˆ s , and ˆ δ given by our network. We write it as L root = ∑ t (cid:13)(cid:13)(cid:13)(cid:2) h t ; s t ; δ t (cid:3) − (cid:104) ˆ h t ; ˆ s t ; ˆ δ t (cid:105)(cid:13)(cid:13)(cid:13) . (11) Figure 5:
Emotional expressions and transitions.
Each row showsfour snapshots of synthesized gaits in temporal sequence from left toright. The top two rows show gaits with single emotions. The bottomrow shows gaits transitioning from one emotion to another. ( L ft ct ) We also require the generated characters to walk naturally withoutany foot sliding. Therefore, we add a robust L norm loss to constrainthe heel and toe positions of the generated character to match theground truth heel and toe positions. The robustness ensures that theprediction follows sharp turns and bends in the trajectory withoutsmoothing out the foot joint movements. We write this loss as L ft ct = ∑ t (cid:13)(cid:13)(cid:13)(cid:2) lh t ; lt t ; rh t ; rt t (cid:3) − (cid:104) ˆ lh t ; ˆ lt t ; ˆ rh t ; ˆ rt t (cid:105)(cid:13)(cid:13)(cid:13) . (12)Finally, we linearly combine all these loss terms to formulate ouroverall loss function L , which we write as L = λ motion L motion + λ pose L pose + λ aff L aff + λ root L root + λ ft ct L ft ct (13)where λ motion , λ pose , λ aff , λ root and λ ft ct are the correspondingscaling terms and assign relative importance to the different lossterms. ESULTS
We show the performance of our autoregression network on thetraining dataset described below. We briefly describe the trainingdataset in Section 5.1 and discuss our augmented dataset in Sec-tion 5.2. We report our training routine in Section 5.3, elaborate onour performance benchmarks in Sections 5.4 and 5.5, and discussthe contributions of our novel components through ablation studiesin Section 5.6. We summarize the details of integrating our setupwith the AR environment in Section 5.7. For a video demonstrationof the results, please refer to our supplementary material.
The datasets we use consist of temporal 3D pose sequences of humangaits for different types of walking, running, and other locomotionactivities. This dataset was collected from various 3D pose sequencedatasets, including BML [41], Human3.6M [26], ICT [49], CMU-MoCap [1], ELMD [23], and Emotion-Gait dataset [9]. All gaitsin the dataset are 240 frames long and playable at 10 fps. Dueto memory constraints, we sampled every 4 th frame and used theresultant 60 frames as input data to our network, i.e. we had T = ,
835 gaitswith corresponding emotion vectors available.We used 80% of our gait dataset for training our network, 10%for validation, and kept the remaining 10% of the dataset for testingthe emotional-expressiveness and trajectory-following performances.We performed this split randomly, and the network never sees thevalidation and the test sets during training. igure 6:
Comparison and Ablation Studies. ( a ) and ( c ) showsemotive four snapshots in temporal sequence from left to right gaitsgenerated by our network following user-driven trajectories. ( b ) showsthe results at the same four time instances for QuaterNet, which hasno emotive component. ( d ) shows the results at the same four timeinstances for our network without the affective feature component. Inthis case, the gait is able to follow the trajectory, but not express theemotions ( e.g. , no shoulder slouching to indicate sadness). ( e ) showsthe results at the same four time instances for our network without themovement feature component. In this case, the gait is able to expressemotions, but not follow the desired trajectory. At test time, our network is able to generate predicted gaits ontrajectories it did not encounter during training. Our network is alsoable to transition between different emotions on the test gaits as aresult of the learned inverse mapping from the latent representationspace of the encoder. We, therefore, use our network to augmentsynthesized gaits to the Emotion-Gait benchmark datasets. Wegenerate gaits on 20 trajectories not present in the dataset, with 100emotions, also not present in the dataset, on each trajectory, for atotal of 2 ,
000 new gaits. We also perform transitions between 50pairs of emotions on each trajectory, picking a pair of emotions fromthe 100 novel ones without replacement, thus adding another 1 , , We use the Adam optimizer [31] with a learning rate of 0 . .
999 at every epoch. We use theELU activation [8] on all the fully connected layers in the network.Similar to Pavllo et al. [54], we use a curriculum schedulingtechnique [7] to train our network. We begin training by presentingvarious sequences of the input data and control features, each oflength T , to our network, and the network predicts the rotations andtranslations for a single subsequent time step. This is equivalent tohaving a teacher forcing ratio of 1. At every subsequent epoch E , wedecay the teacher forcing ratio by β = . i.e. with probability β E , we supplement the data and controls at each input time stepwith the network’s predicted data and controls at that time step.In other words, we progressively expose the network to more andmore of its own predictions to make further predictions. Curriculumscheduling thus helps the network gently transition from a teacher-guided prediction routine to a self-guided prediction routine, whichsignificantly speeds up the training process.We train our network for 500 epochs, which takes around 18 hourson an Nvidia GeForce GTX 1080Ti GPU with 12 GB memory. Weuse 90% of the available data for training our network, and validateits performance on the remaining 10% of the data. We also observedthat our network performs well for any values of the scaling termsin Equation 13 between 0 . .
0. We used a value of 1 . Figure 7:
Time to generate emotive gaits:
We highlight the gener-ation and rendering time for emotive gaits on a pair of GPUs. Theaverage time per agent per frame decreases, as we increase thenumber of virtual agents.
We note here that methods generating motions with discrete stylestest generalizability by providing their network with novel discrete-style labels not seen during training [30, 42], as opposed to rangesin the styles. Our network, on the other hand, learns the mappingbetween gaits and the underlying continuous space of emotions,rather than the mapping between the gaits and annotated emotionsin the training samples. As a result, we test the generalizability ofour network by generating gaits corresponding to continuous rangesof emotion vectors not seen during training. In order to benchmarkour network on its ability to generalize to these continuous ranges ofemotions, we perform experiments on emotion expressiveness andemotion transitions.
We randomly pick gaits from the test set and extract the first 18frames (
13 rd of the total data length) to provide as inputs to ournetwork. We set the associated emotion vector of the input gaitas the desired emotion, initially set a straight line as the desiredtrajectory, and predict for 200 time steps. Figure 5 (top row) showssome snapshots of the results of this evaluation. Next, we evaluateour method on trajectories with bends and sharp turns for 200 steps(Figure 5, middle row). We note that the generated gait maintains theemotion of the input, and follows all the trajectories. For example,the gait slows down while taking sharp turns and adjusts its strideand other joint movements such that the affective features remainsimilar when walking on a path with bends and sharp turns.
In this set of evaluations, we modify the desired emotion at eachprediction time step to be different from those in the previous timesteps, as well as the emotion vector associated with the input gait.To track the performance of our network, we first choose a particularemotion vector for the final prediction time step. Next, we linearlyinterpolate the value of each element of the emotion vector at eachprediction time step separately, including everything from the inputvector to the vector at the final time step. We normalize the vectorat each time step to convert the values to a probability distributionthat can be passed to our network. We test the results of emotiontransition on trajectories with bends and turns for 200 time steps(Figure 5, bottom row). We observe that our network is able tosmoothly transition between the different emotions, with no sharplimb movement or jarring action at any time step.
We visually compare the performance of our autoregression net-work with QuaterNet developed by Pavllo et al. [54] in Figure 6(rows ( a ) and ( b ) ). QuaterNet is a state-of-the-art motion predictionnetwork, and our network builds on the core prediction frameworkof QuaterNet. Since QuaterNet performs motion prediction butnot emotional expressiveness, the gaits are not able to express thedifferent emotions. able 2: Likert scales for observed pose affective features.
The Likert scale response categories we provided users for the four broadobserved pose affective features. Our goal is to evaluate if different users find the Likert scale values of these features similar for the sameemotion, which would indicate these features are relevant for perceiving the emotions.
Feature Likert Scale Response CategoriesValue = 0 Value = 1 Value = 2 Value = 3 Value = 4Torso Contracted, bowed Somewhat contracted Neither contracted norexpanded Somewhat expanded Expanded, stretchedArms Contracted, close tothe body Somewhat contracted Neither contracted norexpanded Somewhat expanded Expanded, away fromthe bodyGait Pace Sustained, leisurely,slow Somewhat sustained Neither sustained norhurried Somewhat hurried Hurried, sudden, fastGait Flow Free, relaxed,uncontrolled Somewhat free Neither free norbound Somewhat bound Bound, tense,controlled
We also summarize the mean pose errors relative to the scale ofthe input data and the joint rotation errors in degrees as they areproduced by our method, QuaterNet, as well as prediction networksbased on alternative approaches such as the phase-functioned neuralnetwork (PFNN) [24] in Table 1. We keep the desired emotion forour network the same as the input emotion vector to perform a faircomparison. We also require ground truth gaits to be available so theprediction time steps can actually compute these errors. Therefore,we present the first 18 frames of each data point in the test set asinputs to both the methods and compute their predicted motions onthe trajectory of the ground truth data for the remaining 42 frames.We notice negligible differences between the performances of thetwo networks, showing that our approach is comparable to the state-of-the-art in motion prediction. However, for predicting motionbeyond the 60 frames in the dataset, we noticed that the motionspredicted by QuaterNet eventually reduce to no movement and thecharacter comes to a stop, whereas both PFNN and our method canpredict plausible motions for up to 200 prediction steps.Thus, while current motion generation methods can producehighly realistic gaits for VAs, our method can additionally produceemotional expressiveness for those gaits. Therefore, our methodhelps improve the social presence of the VAs in an AR environment,as we observe through user evaluations in Section 6.
We have two main contributions to the design of our autoregressionnetwork. First, we provide the pose affective features as part of theinput and constrain the predicted pose affective features to remainclose to the corresponding ground truth during training through theaffective loss L aff (Eq. 10). This enables our network to achieveemotional expressiveness and emotion transition on the input data.We, therefore, remove this input component and the correspond-ing loss function from our network and the training process andcompare the results on the experiments described in Sections 5.4.1and 5.4.2. We observe (Figure 6, rows ( c ) and ( d ) ) that the ablatednetwork does not maintain consistent pose affective features acrossthe prediction time steps for the desired emotion.Second, our network predicts the root joint features, consisting ofthe root height, the root speed, and the root orientation difference(as detailed in Section 3.2.4) alongside the predictions for the jointrotations. To ensure that the predicted motion follows the desiredtrajectory, we constrain these predicted root joint features to remainclose to the corresponding ground truth during training through theroot joint loss function L root (Eq. 11). To underscore the importanceof this loss function, we remove it from our training and performthe experiments described in Section 5.4.2 on the ablated network.We notice (Figure 6, rows ( c ) and ( e ) ) that the ablated network isnot able to follow the desired trajectory. For linear trajectories, thepredicted gaits often end up being oriented in arbitrary directionsand not facing the direction of motion. For trajectories containingbends and turns, once the predicted gaits deviate from the desiredtrajectories, the ablated network is not able to reduce the deviationsin subsequent prediction time steps. Our generative method can generate animation frames at the inter-active rate of 40 ms per frame for 10 agents in an AR environment(Figure 7), i.e. , at 4 ms per agent per frame on average, when uti-lizing two Nvidia GeForce GTX 1080Ti GPUs. We built the re-altime AR demo by rigging the generated skeletons to humanoidmeshes, modifying the posed meshes to handle minor visual dis-tortions caused by body shape mismatch, and streaming a virtualenvironment containing the animated characters to the MicrosoftHololens.
Rigging
Rigging the humanoid meshes to the generated skele-tons requires that the rest pose of the generated skeleton is notmodified to accommodate the desired mesh, as this would invalidatethe rest of the animation. Thus, the rigging process must be donein reverse; the desired meshes must already contain a skeleton thatallows us to repose it to meet the rest pose of the generated skeletons.We found free meshes with suitable skeletons from online 3D meshdatabases. For the demo, we chose a humanoid stick figure andsome humans from the Microsoft Rocketbox collection [21]. Weperformed the rigging in Blender 2.7.
Modifications To Rig
Due to the body shapes of our desiredmeshes not exactly matching that of the people used to generatethe original dataset, we use Blender’s sculpt tools to iron out anydistortions. In order to make the human meshes seem less synthetic,we used the original face bones in the meshes to create blendshapessuch as blinking, breathing (mouth), and breathing (chest), which weactivated at regular intervals. These blendshapes represent typicalhuman behaviors independent of bodily animations. Our generatedskeletons do not contain facial bones, thus, blendshapes are a goodoption for animating the face without requiring bones.
AR Implementation
We made the realtime AR demo in Unreal4.24 due to its strong animation system allowing trivial sharing ofanimation files between meshes. We created An environment inwhich we show pairs of animations of specific emotions, human orstick figure, with the meshes approximately walking along the realground. We used the Unreal Hololens plugin to stream the renderedimages directly to the Hololens through the Hololens’ HolographicRemoting player, which receives images by listening on a specific IPaddress. Due to the start position of the user being non-deterministic,we also provide key inputs to reposition the animated characters infront of wherever the user is when the key is pressed. The animationsloop in order to make it easier for the user to determine differencesin gaits and the stick figure characters are given colored materialsmatching their emotion.
Animation artifacts
We observe some jerkiness in the anima-tion of the human characters in AR. The major sources of thisjerkiness are (i) jerky motion of the user wearing the HoloLens, (ii)issues in the HoloLens software, e.g., frame rate clipping, and (iii)using textures instead of deformable cloth materials for the low-poly human models, which makes the jerkiness more apparent dueto aliasing. We observe much-reduced jerkiness for the texturelessstick figures in AR, and almost none when rendering the stick figuresin a purely virtual environment with a known ground plane. able 3:
Based on thestatistics, we are unable to reject the null hypothesis that the intendedand perceived emotions of the gaits are samples from the sameprobability distribution, except for the case of Gait 2.
Gait p -value Reject Null Hypothesis?1 0 . > .
25 Not able to2 2 .
111 0 .
04 With 96% confidence3 0 . > .
25 Not able to4 − . > .
25 Not able to5 − . > .
25 Not able to6 − . > .
25 Not able to7 − . > .
25 Not able to8 − . > .
25 Not able to
SER E VALUATION
We conducted a web-based user study with our generated emotivegaits to test the following null hypothesis : The emotion vector used as input to generate each gait, and the emo-tion vector obtained by taking the arithmetic mean of the emotionvectors perceived by all the users from that generated gait, are twosamples of the same statistical distribution.
In other words, the distribution of emotions we intend for a gen-erated gait is statistically similar to the distribution of emotionsperceived by the observing users. We also obtain the values of thepose affective features observed by the users from the generatedgaits on a five-point Likert scale (LS) to validate our choice ofpose affective features and emotional expressiveness of the VAs inSection 3.2.3.
The study was divided into three sections. Each section took threeto four minutes to complete on average, and the entire study lastedfor around ten minutes on average.In the first section, we showed the users ten-second clips of eightrandomly chosen generated gaits, one at a time, and asked them toreport the emotion they perceived from each of those gaits. Userscould report multiple emotions. For example, if one gait looked lesshappy to the user than another (but not necessarily sad), then theuser could potentially mark that gait as both happy and neutral.In the second section, we again showed the users ten-second clipsof six randomly chosen generated gaits, one at a time. However, inthis section, we performed emotion transitions on the generated gaits,so the final emotions were different from the initial ones. We askedthe users to report the initial and the final emotions they perceivedfrom these gaits, with the option to report multiple emotions.In the third section, we showed the users ten-second clips of thesame eight generated gaits from the first section, one at a time, andasked them to report the observed values or intensities of four broadpose affective features on a five-point LS. The four pose affectivefeatures we chose to ask are inspired by the critical features identifiedin the study by Roether et al. [66]. We summarize the scales for eachof the four features in Table 2.
Since emotion perceptions are influenced by numerous social andcultural factors, we invited participants from diverse demographicsto draw useful conclusions. We had 102 participants in total, ofwhich 58 were male and 44 were female. 31 male and 26 femaleparticipants were in the age group of 18-24. 25 male and 14 femaleparticipants were in the age group of 25-34. 2 male and 4 femaleparticipants were above 35. Based on the overall test statistics, wedid not find any noticeable difference in the emotions perceived fromthe generated gaits across the different sexes and age groups.
We analyze the results on single emotions (section 1), emotiontransition (section 2), and pose affective features (section 3). Finally,
Figure 8:
Distribution of user votes for the broad pose affectivefeatures in gaits from different emotions.
We can observe differentdistinct modes for the different emotions, indicating that the poseaffective features vary between different emotions and are consistentacross users for a given emotion. The values in the horizontal axiscorrespond to the Likert scale values in Table 2. we report the perceived naturalness of the gaits by the users andmiscellaneous analyses in “other feedback”.
Given the perceived emotions from the first section of the userstudy, we plot the normalized perceived emotions for eight randomlychosen gaits, as well as the corresponding normalized intendedemotions of the generated gaits, in Figure 9. The emotions aredenoted as four-component vectors as described in Section 3.2.2.We perform l normalization so that each component of the emotionvector represents the intensity of the corresponding emotion.For each gait, we perform the 2-sample Anderson-Darlingtest [69] on the null hypothesis that the set of probability valuesof the perceived emotions and the set of probability values of theintended emotions are samples of the same underlying distribution.Table 3 summarizes the test statistic and the corresponding p -valuesfor each of the eight gaits in Figure 9.As we can observe from Table 3, we cannot reject our null hy-pothesis for seven of the eight gaits. This suggests strong statisticalevidence that the intended and perceived emotions are statisticallysimilar for those seven gaits. In Gait 2, where we reject the nullhypothesis, the intended emotion was fully happy, but the observersmainly perceived it as either happy or neutral, indicating that theintensity of happiness did not come across to some of the observers. We performed a similar 2-sample Anderson-Darling test [69] foreach of the initial and the final intended and perceived emotions, andwere unable to reject the null hypothesis in 10 out of the 12 gaitswe tested with. This again provides strong statistical evidence thatthe intended emotions for the gaits and the corresponding perceivedemotions are statistically similar.In one rejected case, the initial emotion was predominantly angrywhile the final was predominantly happy, but many observers indi-cated that both the initial and the final emotions were neutral. In theother case, the transition was from predominantly sad to predomi-nantly happy, but many observers reported the gait to be going fromsad to neutral. We hypothesize two possibilities for the mismatches:• the intensities of the initial and final emotions did not comeacross in the generated gaits, igure 9:
Sets of normalized intended and perceived emotion vectors.
As we can observe from the plots and the statistics in Table 3, exceptfor Gait 2, the intended and perceived emotions of the gaits cannot be determined to belong to separate statistical distributions. • the observers did not expect the gait to transition betweenextreme emotions such as angry to happy or sad to happy inthe ten-second span of the clip, hence opted for choices theyfound more reasonable.
Our goal here is to validate the usefulness of the pose affectivefeatures we use to train our network, as well as the emotional expres-siveness of the VAs through these pose affective features. However,evaluating the values of angles and ratios is out of scope for a user-study. We, therefore, opted to measure the user-observed LS valuesor intensities of the broad pose affective features that we used toformulate our geometric features. A good test is to verify if theobserved intensities of these broad pose affective features are statis-tically consistent across different users. If this is verified, it justifies• basing the geometric features on these broad features,• the VAs are able to clearly express the different emotionsthrough different LS values of the pose affective features.We show the distribution of the fraction of users that marked eachparticular intensity of the four broad pose affective features for eachof the single intended emotions in our study in Figure 8. The valuesin the horizontal axis correspond to the LS values in Table 2. Fromthis figure, we can observe different distinct modes in the distributionfor the different intended emotions. For example, the mode for thetorso is at “contracted, bowed” (0) for sad, while it is concentratedmore around “somewhat expanded” (3) and “expanded, stretched”(4) for happy. For angry, the users observed it to be less expandedthan happy overall, but less than 10% found it to be contracted. Forneutral, there is a clear mode at “neither contracted nor expanded”(2). These statistics show that the users perceived the VAs to haveclear preferences of the different intensities of the pose affectivefeatures when expressing different emotions.We also perform a k -sample Anderson-Darling test 3 for eachgait and each of the four broad affective features (and k being thenumber of users) on the null hypothesis that all the user-providedvalues are from the same underlying distribution. We fail to rejectthe null hypothesis for all the four broad features in all the gaits, thusindicating strong statistical evidence that the observed intensities areconsistent across different users. We asked the users to mark out of five how natural and smooth theyfelt the animations to be, with one indicating “not natural at all”,three indicating “satisfactory”, and five indicating “very natural”.To establish a baseline, we asked the users to similarly marking the corresponding source motions as well. For our generated animations,22% of the users marked five, 43% marked four, 24% marked three,7% marked two, and 4% marked one. Thus 89% of the users markedat least three, i.e. , found the naturalness in the generated gaits to besatisfactory. By contrast, for the source motions, 58% of the usersmarked five, 35% marked four, and 7% marked three.In the videos we sent out to the users, we used a moving cameraso that the user was always looking straight at the virtual agent asit walked on different trajectories. 30% of the users reported beingdistracted by this moving camera during the study. Therefore, weplan to use a fixed camera in our subsequent studies.
ONCLUSIONS , L
IMITATIONS , AND F UTURE W ORK
We present a novel, learning-based method to synthesize and tran-sition between emotive gaits. Our emotion model is based on alinear combination of four widely-used categorical emotions andwe present a network architecture that uses affective features andmovement features. Our algorithm can generate emotive gaits thatfollow a given trajectory at interactive rates and develops a transitionscheme to switch between gaits with different emotions. We haveshown the results on gaits collected from open-source datasets anddiscussed our procedure for developing VAs with these gaits in anAR environment. We have also reported our observations from aweb-based user study to conclude that our generated gaits lookednatural, as well as had the desired emotional expressiveness. Lastly,we release an augmented dataset of emotive gaits.Our work has some limitations. Our approach can generate gaitsof various emotions for one person at a time; it would be useful togenerate gaits for a group of pedestrians in a crowd. Our formulationonly considers a linear space of four emotions and we would like toextend our emotion representation to encompass more emotions inthe arousal space and in the broader VAD space [44]. The fidelity ofour synthesized gaits is limited by the number of gaits and emotionlabels available in the original database used by the network for train-ing. To improve the performance and generate more natural-lookingemotive gaits, we need larger datasets that account for individual,social, and cultural diversities. Moreover, our approach is only basedon low-level affective and movement features, and it would be usefulto model higher-level information corresponding to the environmentand context. Furthermore, we would like to combine emotive gaitswith other cues corresponding to facial expressions or gestures anduse multiple modalities. A CKNOWLEDGMENTS
This work has been supported by ARO grant W911NF-19-1-0069.
EFERENCES [1] Cmu graphics lab motion capture database. http://mocap.cs.cmu.edu/ ,2018.[2] , 2020.[3] J. Altarriba, D. M. Basnight, and T. M. Canary. Emotion representationand perception across cultures.
Online readings in psychology andculture , 4(1):1–17, 2003.[4] H. Aviezer, Y. Trope, and A. Todorov. Body cues, not facial expressions,discriminate between intense positive and negative emotions.
Science ,338(6111):1225–1229, 2012.[5] L. F. Barrett, B. Mesquita, and M. Gendron. Context in emotionperception.
Current Directions in Psychological Science , 20(5):286–290, 2011. doi: 10.1177%2F0963721411422522[6] A. Bauer, K. Klasing, G. Lidoris, Q. M¨uhlbauer, F. Rohrm¨uller, S. Sos-nowski, T. Xu, K. K¨uhnlenz, D. Wollherr, and M. Buss. The au-tonomous city explorer: Towards natural human-robot interaction inurban environments.
International Journal of Social Robotics , 1(2):127–140, Apr 2009. doi: 10.1007/s12369-009-0011-9[7] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled samplingfor sequence prediction with recurrent neural networks. In C. Cortes,N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.,
Ad-vances in Neural Information Processing Systems 28 , pp. 1171–1179.Curran Associates, Inc., 2015.[8] Y. Bengio and Y. LeCun, eds. , 2016.[9] U. Bhattacharya, T. Mittal, R. Chandra, T. Randhavane, A. Bera, andD. Manocha. Step: Spatial temporal graph convolutional networks foremotion perception from gaits. In
Proceedings of the Thirty-FourthAAAI Conference on Artificial Intelligence , AAAI’20, p. 1342–1350.AAAI Press, 2020.[10] U. Bhattacharya, C. Roncal, T. Mittal, R. Chandra, A. Bera, andD. Manocha. Take an emotion walk: Perceiving emotions from gaitsusing hierarchical attention pooling and affective mapping. In
Proceed-ings of the European Conference on Computer Vision (ECCV) , August2020.[11] A. Chowanda, P. Blanchfield, M. Flintham, and M. Valstar. Com-putational models of emotion, personality, and social relationshipsfor interactions in games: (extended abstract). In
Proceedings of the2016 International Conference on Autonomous Agents and MultiagentSystems , AAMAS ’16, p. 1343–1344. International Foundation forAutonomous Agents and Multiagent Systems, Richland, SC, 2016.[12] C. Clavel, J. Plessier, J.-C. Martin, L. Ach, and B. Morel. Combiningfacial and postural expressions of emotions in a virtual character. InZ. Ruttkay, M. Kipp, A. Nijholt, and H. H. Vilhj´almsson, eds.,
Intelli-gent Virtual Agents , pp. 287–300. Springer Berlin Heidelberg, Berlin,Heidelberg, 2009.[13] A. Crenn, R. A. Khan, A. Meyer, and S. Bouakaz. Body expressionrecognition from animated 3d skeleton. In
IC3D , pp. 1–7. IEEE, 2016.[14] Q. Cui, H. Sun, and F. Yang. Learning dynamic relationships for 3dhuman motion prediction. In
Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition (CVPR) , June 2020.[15] N. Dael, M. Mortillaro, and K. R. Scherer. Emotion expression in bodyaction and posture.
Emotion , 12(5):1085, 2012. doi: 10.1037/a0025737[16] H. Du, E. Herrmann, J. Sprenger, N. Cheema, S. Hosseini, K. Fis-cher, and P. Slusallek. Stylistic locomotion modeling with conditionalvariational autoencoder. In
Eurographics (Short Papers) , pp. 9–12,2019.[17] P. Ekman and W. V. Friesen.
Facial action coding system: Investiga-tor’s guide . Consulting Psychologists Press, 1978.[18] Y. Ferstl and R. McDonnell. A perceptual study on the manipulationof facial features for trait portrayal in virtual agents. In
Proceedings ofthe 18th International Conference on Intelligent Virtual Agents , IVA’18, p. 281–288. Association for Computing Machinery, New York,NY, USA, 2018. doi: 10.1145/3267851.3267891[19] R. W. Frick. Communicating emotion: The role of prosodic features.
Psychological Bulletin , 97(3):412, 1985. doi: 10.1037/0033-2909.97.3.412[20] M. Gendron, D. Roberson, J. M. van der Vyver, and L. F. Barrett. Per-ceptions of emotion from facial expressions are not culturally universal:evidence from a remote culture.
Emotion , 14(2):251, 2014. doi: 10.1037/a0036052[21] M. Gonzalez-Franco, M. Wojcik, E. Ofek, A. Steed, and D. Gara- gan.
Microsoft Rocketbox: https://github.com/microsoft/Microsoft-Rocketbox , 2020.[22] M. M. Gross, E. A. Crane, and B. L. Fredrickson. Effort-shape andkinematic assessment of bodily expression of emotion during gait.
Human movement science , 31(1):202–221, 2012.[23] I. Habibie, D. Holden, J. Schwarz, J. Yearsley, and T. Komura. Arecurrent variational autoencoder for human motion synthesis. In
British Machine Vision Conference 2017, BMVC 2017, London, UK,September 4-7, 2017 , 2017.[24] D. Holden, T. Komura, and J. Saito. Phase-functioned neural networksfor character control.
ACM Transactions on Graphics (TOG) , 36(4):42,2017.[25] D. Holden, J. Saito, and T. Komura. A deep learning framework forcharacter motion synthesis and editing.
ACM Transactions on Graphics(TOG) , 35(4):138, 2016.[26] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m:Large scale datasets and predictive methods for 3d human sensingin natural environments.
IEEE transactions on pattern analysis andmachine intelligence , 36(7):1325–1339, 2013.[27] N. Jaques, D. J. McDuff, Y. L. Kim, and R. W. Picard. Understandingand predicting bonding in conversations using thin slices of facialexpressions and body language. In D. R. Traum, W. R. Swartout,P. Khooshabeh, S. Kopp, S. Scherer, and A. Leuski, eds.,
IntelligentVirtual Agents - 16th International Conference, IVA 2016, Los Angeles,CA, USA, September 20-23, 2016, Proceedings , vol. 10011 of
LectureNotes in Computer Science , pp. 64–74, 2016. doi: 10.1007/978-3-319-47665-0[28] M. Karg, A.-A. Samadani, R. Gorbet, K. K¨uhnlenz, J. Hoey, andD. Kuli´c. Body movements for affective expression: A survey ofautomatic recognition and generation.
IEEE Transactions on AffectiveComputing , 4(4):341–359, 2013.[29] D. Keltner and J. Haidt. Social functions of emotions. 2001.[30] A. Kfir, Y. Weng, D. Lischinski, D. Cohen-Or, and B. Chen. Unpairedmotion style transfer from video to animation.
ACM Trans. Graph. ,39(4), July 2020. doi: 10.1145/3386569.3392469[31] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.[32] R. Kosti, J. Alvarez, A. Recasens, and A. Lapedriza. Context basedemotion recognition using emotic dataset.
IEEE Transactions on Pat-tern Analysis and Machine Intelligence , pp. 1–1, 2019. doi: 10.1109/TPAMI.2019.2916866[33] L. Kovar, M. Gleicher, and F. Pighin. Motion graphs. In
ACMSIGGRAPH 2008 Classes , SIGGRAPH ’08. Association for Com-puting Machinery, New York, NY, USA, 2008. doi: 10.1145/1401132.1401202[34] M. E. Latoschik, F. Kern, J. Stauffert, A. Bartl, M. Botsch, and J. Lugrin.Not alone here?! scalability and user experience of embodied ambientcrowds in distributed social virtual reality.
IEEE Transactions onVisualization and Computer Graphics , 25(5):2134–2144, 2019.[35] M. E. Latoschik, D. Roth, D. Gall, J. Achenbach, T. Waltemate, andM. Botsch. The effect of avatar realism in immersive social virtual re-alities. In
Proceedings of the 23rd ACM Symposium on Virtual RealitySoftware and Technology , VRST ’17. Association for Computing Ma-chinery, New York, NY, USA, 2017. doi: 10.1145/3139131.3139156[36] K. H. Lee, M. G. Choi, and J. Lee. Motion patches: building blocks forvirtual environments annotated with motion data.
ACM Trans. Graph. ,25(3):898–906, 2006.[37] S. Lee, M. Park, K. Lee, and J. Lee. Scalable muscle-actuated hu-man simulation and control.
ACM Transactions on Graphics (TOG) ,38(4):73, 2019.[38] C. Li, Z. Zhang, W. S. Lee, and G. H. Lee. Convolutional sequenceto sequence model for human dynamics. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR) , June2018.[39] B. Liebold and P. Ohler. Multimodal emotion expressions of virtualagents, mimic and vocal emotion expressions and their effects onemotion recognition. In
Proceedings of the 2013 Humaine AssociationConference on Affective Computing and Intelligent Interaction , ACII’13, p. 405–410. IEEE Computer Society, USA, 2013. doi: 10.1109/ACII.2013.73[40] Z. C. Lipton, J. Berkowitz, and C. Elkan. A critical review ofrecurrent neural networks for sequence learning. arXiv preprintarXiv:1506.00019 , 2015.41] Y. Ma, H. M. Paterson, and F. E. Pollick. A motion capture library forthe study of identity, gender, and emotion perception from biologicalmotion.
Behavior research methods , 38(1):134–141, 2006.[42] I. Mason, S. Starke, H. Zhang, H. Bilen, and T. Komura. Few-shotlearning of homogeneous human locomotion styles.
Computer Graph-ics Forum , 37(7):143–153, 2018. doi: 10.1111/cgf.13555[43] J. E. McHugh, R. McDonnell, C. O’Sullivan, and F. N. Newell. Per-ceiving emotion in crowds: the role of dynamic body postures on theperception of emotion in crowded scenes.
Experimental brain research ,204(3):361–372, 2010.[44] A. Mehrabian and J. A. Russell.
An approach to environmental psy-chology. the MIT Press, 1974.[45] T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha.M3er: Multiplicative multimodal emotion recognition using facial,textual, and speech cues. In
Proceedings of the Thirty-Fourth AAAIConference on Artificial Intelligence , AAAI’20, pp. 1359–1367. AAAIPress, 2020.[46] T. Mittal, P. Guhan, U. Bhattacharya, R. Chandra, A. Bera, andD. Manocha. Emoticon: Context-aware multimodal emotion recogni-tion using frege’s principle. In
Proceedings of the IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition , pp. 14234–14243,2020.[47] J. M. Montepare, S. B. Goldstein, and A. Clausen. The identificationof emotions from gait information.
Journal of Nonverbal Behavior ,11(1):33–42, 1987.[48] F. Moustafa and A. Steed. A longitudinal study of small group in-teraction in social virtual reality. In
Proceedings of the 24th ACMSymposium on Virtual Reality Software and Technology , VRST ’18.Association for Computing Machinery, New York, NY, USA, 2018.doi: 10.1145/3281505.3281527[49] S. Narang, A. Best, A. Feng, S.-h. Kang, D. Manocha, and A. Shapiro.Motion recognition of self and others on realistic 3d avatars.
ComputerAnimation and Virtual Worlds , 28(3-4):e1762, 2017.[50] V. Narayanan, B. M. Manoghar, V. S. Dorbala, D. Manocha, andA. Bera. Proxemo: Gait-based emotion learning and multi-view prox-emic fusion for socially-aware robot navigation. In . IEEE, 2020.[51] E. J. Nestler, M. Barrot, R. J. DiLeone, A. J. Eisch, S. J. Gold, andL. M. Monteggia. Neurobiology of depression.
Neuron , 34(1):13–25,2002. doi: 10.1016/S0896-6273(02)00653-0[52] H. Osking and J. A. Doucette. Enhancing emotional effectiveness ofvirtual-reality experiences with voice control interfaces. In B. et al.,ed.,
Immersive Learning Research Network , pp. 199–209. SpringerInternational Publishing, Cham, 2019.[53] S. Park, H. Ryu, S. Lee, S. Lee, and J. Lee. Learning predict-and-simulate policies from unorganized human motion data.
ACM Trans.Graph. , 38(6), 2019.[54] D. Pavllo, D. Grangier, and M. Auli. Quaternet: A quaternion-based re-current model for human motion. In
British Machine Vision Conference2018, BMVC 2018 , p. 299, 2018.[55] I. Pelczer, F. C. Contreras, and F. G. Rodr´ıguez. Expressions of emo-tions in virtual agents: Empirical evaluation. , pp. 31–35, 2007.[56] X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne. Deepmimic:Example-guided deep reinforcement learning of physics-based charac-ter skills.
ACM Trans. Graph. , 37(4), July 2018. doi: 10.1145/3197517.3201311[57] X. B. Peng, G. Berseth, K. Yin, and M. Van De Panne. Deeploco: Dy-namic locomotion skills using hierarchical deep reinforcement learning.
ACM Trans. Graph. , 36(4), July 2017. doi: 10.1145/3072959.3073602[58] M. L. Phillips, W. C. Drevets, S. L. Rauch, and R. Lane. Neurobiologyof emotion perception i: The neural basis of normal emotion perception.
Biological psychiatry , 54(5):504–514, 2003.[59] T. Randhavane, A. Bera, K. Kapsaskis, U. Bhattacharya, K. Gray, andD. Manocha. Identifying emotions from walking using affective anddeep features. arXiv preprint arXiv:1906.11884 , 2019.[60] T. Randhavane, A. Bera, K. Kapsaskis, K. Gray, and D. Manocha. Fva:Modeling perceived friendliness of virtual agents using movement char-acteristics.
IEEE transactions on visualization and computer graphics ,25(11):3135–3145, 2019.[61] T. Randhavane, A. Bera, K. Kapsaskis, R. Sheth, K. Gray, and D. Manocha. Eva: Generating emotional behavior of virtual agentsusing expressive features of gait and gaze. In
ACM Symposium onApplied Perception 2019 , p. 6. ACM, 2019.[62] T. Randhavane, A. Bera, E. Kubin, K. Gray, and D. Manocha. Modelingdata-driven dominance traits for virtual characters using gait analysis.
CoRR , abs/1901.02037, 2019.[63] T. Randhavane, U. Bhattacharya, K. Kapsaskis, K. Gray, A. Bera, andD. Manocha. The liar’s walk: Detecting deception with gait and gesture. arXiv preprint arXiv:1912.06874 , 2019.[64] L. D. Riek, T.-C. Rabinowitch, B. Chakrabarti, and P. Robinson. Howanthropomorphism affects empathy toward robots. In
Proceedings ofthe 4th ACM/IEEE International Conference on Human Robot Inter-action , HRI ’09, p. 245–246. Association for Computing Machinery,New York, NY, USA, 2009. doi: 10.1145/1514095.1514158[65] J. J. Rivas, F. Orihuela-Espina, L. E. Sucar, L. Palafox, J. Hern´andez-Franco, and N. Bianchi-Berthouze. Detecting affective states in vir-tual rehabilitation. In
Proceedings of the 9th International Confer-ence on Pervasive Computing Technologies for Healthcare , Perva-siveHealth ’15, p. 287–292. ICST (Institute for Computer Sciences,Social-Informatics and Telecommunications Engineering), Brussels,BEL, 2015.[66] C. L. Roether, L. Omlor, A. Christensen, and M. A. Giese. Criticalfeatures for the perception of emotion from gait.
Journal of vision ,9(6):15–15, 2009.[67] N. Rokbani, B. A. Cherif, and A. M. Alimi. Toward intelligent biped-humanoids gaits generation.
Humanoid Robots , pp. 259–271, 2009.[68] H. Rosenberg, S. McDonald, J. Rosenberg, and R. F. Westbrook. Mea-suring emotion perception following traumatic brain injury: The com-plex audio visual emotion assessment task (caveat).
Neuropsychologi-cal Rehabilitation , 29(2):232–250, 2019. PMID: 28030989. doi: 10.1080/09602011.2016.1273118[69] F. W. Scholz and M. A. Stephens. K-sample anderson–darling tests.
Journal of the American Statistical Association , 82(399):918–924,1987. doi: 10.1080/01621459.1987.10478517[70] S. S. Sohn, X. Zhang, F. Geraci, and M. Kapadia. An emotionallyaware embodied conversational agent. In
Proceedings of the 17th Inter-national Conference on Autonomous Agents and MultiAgent Systems ,AAMAS ’18, p. 2250–2252. International Foundation for AutonomousAgents and Multiagent Systems, Richland, SC, 2018.[71] B. Stangl, D. C. Ukpabi, and S. Park. Augmented reality applications:The impact of usability and emotional perceptions on tourists’ appexperiences. In J. Neidhardt and W. W¨orndl, eds.,
Information andCommunication Technologies in Tourism 2020 , pp. 181–191. SpringerInternational Publishing, Cham, 2020.[72] S. Starke, H. Zhang, T. Komura, and J. Saito. Neural state machine forcharacter-scene interactions.
ACM Transactions on Graphics (TOG) ,38(6):209, 2019.[73] J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamicalmodels for human motion.
IEEE transactions on pattern analysis andmachine intelligence , 30(2):283–298, 2007.[74] S. Xia, C. Wang, J. Chai, and J. Hodgins. Realtime style transfer forunlabeled heterogeneous human motion.
ACM Trans. Graph. , 34(4),July 2015. doi: 10.1145/2766999[75] M. E. Yumer and N. J. Mitra. Spectral style transfer for human motionbetween independent actions.