[PDF] ARBEE: Towards Automated Recognition of Bodily Expression of Emotion In the Wild

Abstract

Humans are arguably innately prepared to comprehend others' emotional expressions from subtle body movements. If robots or computers can be empowered with this capability, a number of robotic applications become possible. Automatically recognizing human bodily expression in unconstrained situations, however, is daunting given the incomplete understanding of the relationship between emotional expressions and body movements. The current research, as a multidisciplinary effort among computer and information sciences, psychology, and statistics, proposes a scalable and reliable crowdsourcing approach for collecting in-the-wild perceived emotion data for computers to learn to recognize body languages of humans. To accomplish this task, a large and growing annotated dataset with 9,876 video clips of body movements and 13,239 human characters, named BoLD (Body Language Dataset), has been created. Comprehensive statistical analysis of the dataset revealed many interesting insights. A system to model the emotional expressions based on bodily movements, named ARBEE (Automated Recognition of Bodily Expression of Emotion), has also been developed and evaluated. Our analysis shows the effectiveness of Laban Movement Analysis (LMA) features in characterizing arousal, and our experiments using LMA features further demonstrate computability of bodily expression. We report and compare results of several other baseline methods which were developed for action recognition based on two different modalities, body skeleton, and raw image. The dataset and findings presented in this work will likely serve as a launchpad for future discoveries in body language understanding that will enable future robots to interact and collaborate more effectively with humans.

Full PDF

IInt J Comput Vis manuscript No. (will be inserted by the editor)

ARBEE: Towards Automated Recognition of BodilyExpression of Emotion In the Wild

Yu Luo · Jianbo Ye · Reginald B. Adams, Jr. · Jia Li · Michelle G. Newman · James Z. Wang

Received: date / Accepted: date

Abstract

Humans are arguably innately prepared tocomprehend others’ emotional expressions from sub-tle body movements. If robots or computers can beempowered with this capability, a number of roboticapplications become possible. Automatically recogniz-ing human bodily expression in unconstrained situa-tions, however, is daunting given the incomplete un-derstanding of the relationship between emotional ex-pressions and body movements. The current research,as a multidisciplinary eﬀort among computer and in-formation sciences, psychology, and statistics, proposesa scalable and reliable crowdsourcing approach for col-lecting in-the-wild perceived emotion data for comput-ers to learn to recognize body languages of humans.To accomplish this task, a large and growing anno-

Y. Luo ( (cid:0) ) · J. Ye · J.Z. Wang ( (cid:0) )College of Information Sciences and Technology,The Pennsylvania State University, University Park, PA,USA.E-mail: [email protected]. YeE-mail: [email protected] is currently with Amazon Lab126, Sunnyvale, CA, USA.J.Z. WangE-mail: [email protected]. Adams Jr. · M.G. NewmanDepartment of Psychology,The Pennsylvania State University, University Park, PA,USA.E-mail: [email protected]. NewmanE-mail: [email protected]. LiDepartment of Statistics,The Pennsylvania State University, University Park, PA,USA.E-mail: [email protected] tated dataset with 9,876 video clips of body movementsand 13,239 human characters, named BoLD (Body Lan-guage Dataset), has been created. Comprehensive sta-tistical analysis of the dataset revealed many interest-ing insights. A system to model the emotional expres-sions based on bodily movements, named ARBEE (Au-tomated Recognition of Bodily Expression of Emotion),has also been developed and evaluated. Our analysisshows the eﬀectiveness of Laban Movement Analysis(LMA) features in characterizing arousal, and our ex-periments using LMA features further demonstrate com-putability of bodily expression. We report and compareresults of several other baseline methods which weredeveloped for action recognition based on two diﬀerentmodalities, body skeleton and raw image. The datasetand ﬁndings presented in this work will likely serve as alaunchpad for future discoveries in body language un-derstanding that will enable future robots to interactand collaborate more eﬀectively with humans.

Keywords

Body language · emotional expression · computer vision · crowdsourcing · video analysis · perception · statistical modeling Many future robotic applications, including personalassistant robots, social robots, and police robots de-mand close collaboration with and comprehensive un-derstanding of the humans around them. Current robotictechnologies for understanding human behaviors beyondtheir basic activities, however, are limited. Body move-ments and postures encode rich information about aperson’s status, including their awareness, intention,and emotional state (Shiﬀrar et al., 2011). Even at a r X i v : . [ c s . C V ] J u l Yu Luo et al.

Fig. 1

Examples of possible scenarios where computerizedbodily expression recognition can be useful. From left to right:psychological clinic assistance, public safety and law enforce-ment, and social robot or social media. a young age, humans can “read” another’s body lan-guage, decoding movements and facial expressions asemotional keys.

How can a computer program be trainedto recognize human emotional expressions from bodymovements?

This question drives our current researcheﬀort.Previous research on computerized body movementanalysis has largely focused on recognizing human ac-tivities ( e.g. , the person is running). Yet, a person’semotional state is another important characteristic thatis often conveyed through body movements. Recent stud-ies in psychology have suggested that movement andpostural behavior are useful features for identifying hu-man emotions (Wallbott, 1998; Meeren et al., 2005;De Gelder, 2006; Aviezer et al., 2012). For instance,researchers found that human participants of a studycould not correctly identify facial expressions associatedwith winning or losing a point in a professional tennisgame when facial images were presented alone, whereasthey were able to correctly identify this distinction withimages of just the body or images that included boththe body and the face (Aviezer et al., 2012). More inter-estingly, when the face part of an image was paired withthe body and edited to an opposite situation face ( e.g. ,winning face paired with losing body), people still usedthe body to identify the outcome. A valuable insightfrom this psychology study is that the human body maybe more diagnostic than the face in terms of emotionrecognition. In our work, bodily expression is deﬁnedas human aﬀect expressed by body movements and/orpostures.Our earlier work studied the computability of evokedemotions (Lu et al., 2012, 2017; Ye et al., 2019) fromvisual stimuli using computer vision and machine learn-ing. In this work, we investigate whether bodily expres-sions are computable. In particular, we explore whethermodern computer vision techniques can match the cog-nitive ability of typical humans in recognizing bodilyexpressions in the wild, i.e. , from real-world uncon-strained situations.Computerized bodily expression recognition capa-bilities have the potential to enable a large number of innovative applications including information manage-ment and retrieval, public safety, patient care, and so-cial media (Krakovsky, 2018). For instance, such sys-tems can be deployed in public areas such as airports,metro or bus stations, or stadiums to help police iden-tify potential threats. Better results might be obtainedin a population with a high rate of emotional instabil-ity. A psychology clinic, for example, may install suchsystems to help assess and evaluate disorders, includinganxiety and depression, either to predict danger to selfand others from patients, or to track the progress of pa-tients over time. Similarly, police may use such technol-ogy to help assess the identity of suspected criminals innaturalistic settings and/or their emotions and decep-tive motives during an interrogation. Well-trained andexperienced detectives and interrogators rely on a com-bination of body language, facial expressions, eye con-tact, speech patterns, and voices to diﬀerentiate a liarfrom a truthful person. An eﬀective assistive technol-ogy based on emotional understanding could substan-tially reduce the stress of police oﬃcers as they carryout their work. Improving the bodily expression recog-nition of assistive robots will enrich human-computerinteractions. Future assistive robots can better assistthose who may suﬀer emotional stress or mental illness, e.g., assistive robots may detect early warning signalsof manic episodes. In social media, recent popular so-cial applications such as Snapchat and Instagram allowusers to upload short clips of self-recorded and editedvideos. A crucial analysis from an advertising perspec-tive is to better identify the intention of a speciﬁc up-loading act by understanding the emotional status ofa person in the video. For example, a user who wantsto share the memory of traveling with his family wouldmore likely upload a video capturing the best interac-tion moment ﬁlled with joy and happiness. Such anal-ysis helps companies to better personalize the servicesor to provide advertisement more eﬀectively for theirusers, e.g., through showing travel-related products orservices as opposed to business-related ones.Automatic bodily expression recognition as a re-search problem is highly challenging for three primaryreasons. First, it is diﬃcult to collect a bodily expres-sion dataset with high quality annotations. The under-standing and perception of emotions from concrete ob-servations is often subject to context, interpretation,ethnicity and culture. There is often no gold standardlabel for emotions, especially for bodily expressions. Infacial analysis, the expression could be encoded withmovements of individual muscles, a.k.a., Action Units(AU) in facial action coding system (FACS) (Ekmanand Friesen, 1977). However, psychologists have notdeveloped an analogous notation system that directly

RBEE: Towards Automated Recognition of Bodily Expression of Emotion In the Wild 3 encodes correspondence between bodily expression andbody movements. This lack of such empirical guidanceleaves even professionals without complete agreementabout annotating bodily expressions. To date, researchon bodily expression is limited to acted and constrainedlab-setting video data (Gunes and Piccardi, 2007; Klein-smith et al., 2006; Schindler et al., 2008; Dael et al.,2012), which are usually of small size due to lengthy hu-man subject study regulations. Second, bodily expres-sion is subtle and composite. According to (Karg et al.,2013), body movements have three categories, func-tional movements ( e.g. walking), artistic movements( e.g. dancing), and communicative movements ( e.g. ges-turing while talking). In a real-world setting, bodily ex-pression can be strongly coupled with functional move-ments. For example, people may represent diﬀerent emo-tional states in the same functional movement, e.g. walk-ing. Third, an articulated pose has many degrees offreedom. Working with real-world video data poses ad-ditional technical challenges such as the high level ofheterogeneity in peoples behaviors, the highly clutteredbackground, and the often substantial diﬀerences inscale, camera perspective, and pose of the person inthe frame.In this work, we investigate the feasibility of crowd-sourcing bodily expression data collection and study thecomputability of bodily expression using the collecteddata. We summarize the primary contributions as fol-lows. – We propose a scalable and reliable crowdsourcingpipeline for collecting in-the-wild perceived emotiondata. With this pipeline, we collected a large datasetwith 9,876 clips that have body movements and over13,239 human characters. We named the dataset the

BoLD ( Bo dy L anguage D ataset). Each short videoclip in BoLD has been annotated for emotional ex-pressions as perceived by the viewers. To our knowl-edge, BoLD is the ﬁrst large-scale video dataset forbodily emotion in the wild. – We conducted comprehensive agreement analysis onthe crowdsourced annotations. The results demon-strate the validity of the proposed data collectionpipeline. We also evaluated human performance onemotion recognition on a large and highly diversepopulation. Interesting insights have been found inthese analyses. – We investigated Laban Movement Analysis (LMA)features and action recognition-based methods us-ing the BoLD dataset. From our experiments, handacceleration shows strong correlation with one par-ticular dimension of emotion — arousal, a resultthat is intuitive. We further show that existing ac-tion recognition-based models can yield promising results. Speciﬁcally, deep models achieve remarkableperformance on emotion recognition tasks.In our work, we approach the bodily expressionrecognition problem with the focus of addressing theﬁrst challenge mentioned earlier. Using our proposeddata collection pipeline, we have collected high qual-ity aﬀect annotation. With the state-of-the-art com-puter vision techniques, we are able to address the thirdchallenge to a certain extent. To properly address thesecond challenge, regarding the subtle and compositenature of bodily expression, requires breakthroughs incomputational psychology. Below, we detail some of theremaining technical diﬃculties on the bodily expressionrecognition problem that the computer vision commu-nity can potentially address.Despite signiﬁcant progress recently in 2D/3D poseestimation (Cao et al., 2017; Martinez et al., 2017),these techniques are limited compared with Motion Cap-ture (MoCap) systems, which rely on placing active orpassive optical markers on the subject’s body to detectmotion, because of two issues. First, these vision-basedestimation methods are noisy in terms of the jitter er-rors (Ruggero Ronchi and Perona, 2017). While highaccuracy has been reported on pose estimation bench-marks, the criteria used in the benchmarks are not de-signed for our application which demands substantiallyhigher precision of landmark locations. Consequently,the errors in the results generated through those meth-ods propagate in our pipeline, as pose estimation is aﬁrst-step in analyzing the relationship between motionand emotion.Second, vision-based methods ( e.g. , Martinez et al.(2017)) usually address whole-body poses, which haveno missing landmarks, and only produce relative coor-dinates of the landmarks from the pose ( e.g. , with re-spect to the barycenter of the human skeleton) insteadof the actual coordinates in the physical environment.In-the-wild videos, however, often contain upper-bodyor partially-occluded poses. Further, the interaction be-tween human and the environment, such as a lift of theperson’s barycenter or when the person is pacing be-tween two positions, is often critical for bodily expres-sion recognition. Additional modeling on the environ-ment together with that for the human would be usefulin understanding body movement.In addition to these diﬃculties faced by the com-puter vision community broadly, the computation psy-chology community also needs some breakthroughs. Forinstance, state-of-the-art end-to-end action recognitionmethods developed in the computer vision communityoﬀer insuﬃcient interpretability of bodily expression.While the LMA features that we have developed in thiswork has better interpretability than the action recogni-

Yu Luo et al. tion based methods, to completely address the problemof body language interpretation, we believe it will beimportant to have comprehensive motion protocols de-ﬁned or learned, as a counterpart of FACS for bodilyexpression.The rest of this paper is structured as follows. Sec-tion 2 reviews related work in the literature. The datacollection pipeline and statistics of the BoLD datasetare introduced in Section 3. We describe our modelingprocesses on BoLD and demonstrate ﬁndings in Sec-tion 4, and conclude in Section 5.

After ﬁrst reviewing basic concepts on bodily expres-sion and related datasets, we then discuss related workon crowdsourcing subjective aﬀect annotation and au-tomatic bodily expression modeling.2.1 Bodily Expression RecognitionExisting automated bodily expression recognition stud-ies mostly build on two theoretical models for repre-senting aﬀective states, the categorical and the dimen-sional models. The categorical model represents aﬀec-tive states into several emotion categories. In (Ekmanand Friesen, 1986; Ekman, 1992), Ekman et al. pro-posed six basic emotions, i.e. , anger, happiness, sad-ness, surprise, disgust, and fear. However, as suggestedby Carmichael et al. (1937) and Karg et al. (2013), bod-ily expression is not limited to basic emotions. Whenwe restricted interpretations to only basic emotions ata preliminary data collection pilot study, the partici-pants provided feedback that they often found none ofthe basic emotions as suitable for the given video sam-ple. A dimensional model of aﬀective states is the PADmodel by Mehrabian (1996), which describes an emo-tion in three dimensions, pleasure (valence), arousal,and dominance. In the PAD model, valence character-izes the positivity versus negativity of an emotion, whilearousal characterizes the level of activation and energyof an emotion, and dominance characterizes the extentof controlling others or surroundings. As summarized in(Karg et al., 2013; Kleinsmith and Bianchi-Berthouze,2013), most bodily expression-related studies focus oneither a small set of categorical emotions or two dimen-sions of valence and arousal in the PAD model. In ourwork, we adopt both measurements in order to acquirecomplementary emotion annotations.Based on how emotion is generated, emotions can becategorized into acted or elicited emotions, and spon-taneous emotions. Acted emotion refers to actors’ per- forming a certain emotion under given contexts or sce-narios. Early work was mostly built on acted emotions(Wallbott, 1998; Dael et al., 2012; Gunes and Piccardi,2007; Schindler et al., 2008). Wallbott (1998) analyzesvideos recorded on recruited actors and established bod-ily emotions as an important modality of emotion recog-nition. In (Douglas-Cowie et al., 2007), a human sub-ject’s emotion is elicited via interaction with computeravatar of its operator. Lu et al. (2017) crowdsourcedemotion responses with image stimuli. Recently, natu-ral or authentic emotions have generated more inter-est in the research community. In (Kleinsmith et al.,2011), body movements are recorded while human sub-jects play body movement-based video games.Related work can be categorized based on raw datatypes, namely MoCap data or image/video data. Forlab-setting studies such as (Kleinsmith et al., 2006, 2011;Aristidou et al., 2015), collecting motion capture datais usually feasible. Gunes and Piccardi (2007) collecteda dataset with upper body movement video recordedin a studio. Other work (Gunes and Piccardi, 2007;Schindler et al., 2008; Douglas-Cowie et al., 2007) usedimage/video data capturing the frontal view of the poses.Humans perceive and understand emotions from mul-tiple modalities, such as face, body language, touch,eye contact, and vocal cues. We review the most re-lated vision-based facial expression analysis here. Facialexpression is an important modality in emotion recog-nition and automated facial expression recognition ismore successful compared with other modalities. Themain reasons for this success are two-fold. First, thediscovery of FACS made facial expression less subjec-tive. Many recent works on facial expression recogni-tion focus on Action Unit detection, e.g. , (Eleftheri-adis et al., 2015; Fabian Benitez-Quiroz et al., 2016).Second, the face has fewer degrees of freedom com-pared with the whole body (Schindler et al., 2008). Toaddress the comparatively broader freedom of bodilymovement, Karg et al. (2013) suggest the use of a move-ment notation system may help identify bodily expres-sion. Other research has considered microexpressions, e.g. , (Xu et al., 2017), suggesting additional nuances infacial expressions. To our knowledge, no vision-basedstudy or dataset on complete measurement of naturalbodily emotions exists.2.2 Crowdsourced Aﬀect AnnotationCrowdsourcing from the Internet as a data collectionprocess has been originally proposed to collect objec-tive, non-aﬀective data and received popularity in themachine learning community to acquire large-scale

RBEE: Towards Automated Recognition of Bodily Expression of Emotion In the Wild 5 ground truth datasets. A school of data quality con-trol methods has been proposed for crowdsourcing. Yet,crowdsourcing aﬀect annotations is highly challengingdue to the intertwined subjectivity of aﬀect and unin-formative participants. Very few studies report on thelimitations and complexity of crowdsourcing aﬀect an-notations. As suggested by Ye et al. (2019), inconsis-tency of crowdsourced aﬀective data exists due to twofactors. The ﬁrst is the possible untrustworthiness of re-cruited participants due to the discrepancy between thepurpose of study (collecting high quality data) and theincentive for participants (earning cash rewards). Thesecond is the natural variability of humans perceivingothers’ aﬀective expressions, as was discussed earlier.Biel and Gatica-Perez (2013) crowdsourced personalityattributes. Although they analyzed agreements amongdiﬀerent participants, they did not conduct quality con-trol, catering to the two stated factors in the crowd-sourcing. Kosti et al. (2017), however, used an ad hoc gold standard to control annotation quality and eachsample in the training set was only annotated once. Luet al. (2017) crowdsourced evoked emotions of stimuliimages. Building on Lu et al. (2017), Ye et al. (2019)proposed a probabilistic model, named the GLBA, tojointly model each worker’s reliability and regularity —the two factors contributing to the inconsistent anno-tations — in order to improve the quality of aﬀectivedata collected. Because the GLBA methodology is ap-plicable for virtually any crowdsourced aﬀective data,we use it for our data quality control pipeline as well.2.3 Automatic Modeling of Bodily ExpressionAutomatic modeling of bodily expression (AMBE) typ-ically requires three steps: human detection, pose esti-mation and tracking, and representation learning. Insuch a pipeline, human(s) are detected frame-by-framein a video and their body landmarks are extracted by apose estimator. Subsequently, if multiple people appearin the scene, the poses of the same person are associ-ated along all frames (Iqbal et al., 2017). With eachperson’s pose identiﬁed and associated across frames,an appropriate feature representation of each person isextracted.Based on the way data is collected, we divide AMBEmethods into video-based and non-video-based. Forvideo-based methods, data are collected from a cam-era, in the form of color videos. In (Gunes and Pic-cardi, 2005; Nicolaou et al., 2011), videos are collectedin a lab setting with a pure-colored background and aﬁxed-perspective camera. They could detect and trackhands and other landmarks with simple thresholding and grouping of pixels. Gunes and Piccardi (2005) ad-ditionally deﬁned motion protocols, such as whether thehand is facing up, and combined them with landmarkdisplacement as features. Nicolaou et al. (2011) usedthe positions of shoulders in the image frame, facialexpression, and audio features as the input of a neu-ral network. Our data, however, is not collected undersuch controlled settings, thus has variations in view-point, lighting condition, and scale.For non-video-based methods, locations of bodymarkers are inferred by the MoCap system (Kleinsmithet al., 2011, 2006; Aristidou et al., 2015; Schindler et al.,2008). The ﬁrst two steps, i.e., human detection, andpose estimation and tracking, are solved directly by theMoCap system. Geometric features, such as velocity, ac-celeration, and orientation of body landmarks, as wellas motion protocols can then be conveniently developedand used to build predictive models (Kleinsmith et al.,2011, 2006; Aristidou et al., 2015). For a more compre-hensive survey of automatic modeling of bodily expres-sion, readers are referred to the three surveys (Karget al., 2013; Kleinsmith and Bianchi-Berthouze, 2013;Corneanu et al., 2018).Related to AMBE, human behaviour understand-ing (a.k.a. action recognition) has attracted a lot of at-tention. The emergence of large-scale annotated videodatasets (Soomro et al., 2012; Caba Heilbron et al.,2015; Kay et al., 2017) and advances in deep learn-ing (Krizhevsky et al., 2012) have accelerated the devel-opment in action recognition. To our knowledge, two-stream ConvNets-based models have been leading onthis task (Simonyan and Zisserman, 2014; Wang et al.,2016; Carreira and Zisserman, 2017). The approach usestwo networks with an image input stream and an op-tical ﬂow input stream to characterize appearance andmotion, respectively. Each stream of ConvNet learnshuman-action-related features in an end-to-end fash-ion. Recently, some researchers have attempted to uti-lize human pose information. Yan et al. (2018), for ex-ample, modeled human skeleton sequences using a spa-tiotemporal graph convolutional network. Luvizon et al.(2018) leveraged pose information using a multitask-learning approach. In our work, we extract LMA fea-tures based on skeletons and use them to build predic-tive models.

In this section, we describe how we created the BoLDdataset and provide results of our statistical analysis ofthe data.

Yu Luo et al.

Raw video Short clips

Time Segmentation

Processed clips

Pose estimation& tracking

Emotion Annotation

Crowdsourcing

Fig. 2

Overview of our data collection pipeline. The process involves crawling movies, segmenting them into clips, estimatingthe poses, and emotion annotation.

The Internet has vast natural human-to-human interac-tion videos, which serves as a rich source for our data. Alarge collection of video clips from daily lives is an ideal dataset for developing aﬀective recognition capabilitiesbecause they match closely with our common real-worldsituations. However, a majority of those user-uploaded,in-the-wild videos suﬀer from poor camera perspectivesand may not cover a variety of emotions. We considerit beneﬁcial to use movies and TV shows, e.g. , realityshows or uploaded videos in social media, that are un-constrained but oﬀer highly interactive and emotionalcontent. Movies and TV shows are typically of highquality in terms of ﬁlming techniques and the richnessof plots. Such shows are thus more representative inreﬂecting characters’ emotional states than some othercategories of videos such as DIY instructional videosand news event videos, some of which were collected re-cently (Abu-El-Haija et al., 2016; Thomee et al., 2016).In this work, we have crawled 150 movies (220 hours intotal) from YouTube by the video IDs curated in theAVA dataset (Gu et al., 2018).Movies are typically ﬁlmed so that shots in onescene demonstrate characters’ speciﬁc activities, verbalcommunication, and/or emotions. To make these videosmanageable for further human annotation, we partitioneach video into short video clips using the kernel tempo-ral segmentation (KTS) method (Potapov et al., 2014).KTS detects shot boundary by keeping variance of vi-sual descriptors within a temporal segment small. Shotboundary can be either a change of scene or a changeof camera perspective within the same scene. To avoid

RBEE: Towards Automated Recognition of Bodily Expression of Emotion In the Wild 7

Fig. 3

A frame in a video clip, with diﬀerent characters num-bered with an ID ( e.g. , 0 and 1 at the bottom left corner ofred bounding boxes) and the body and/or facial landmarksdetected (indicated with the stick ﬁgure). confusion, we will use the term scene to indicate bothcases.

We adopted an approach to detect human body land-marks and track each character at the same time (Fig. 3).Because not all short clips contain human characters,we removed those clips without humans via pose es-timation (Cao et al., 2017). Each clip was processedby a pose estimator frame-by-frame to acquire humanbody landmarks. Diﬀerent characters in one clip corre-spond to diﬀerent samples. Each character in the clipis marked as a diﬀerent sample. To make the corre-spondence clear, we track each character and designatethem with a unique ID number. Speciﬁcally, trackingwas conducted on the upper-body bounding box withthe Kalman Filter and Hungarian algorithm as the keycomponent (Bewley et al., 2016) . In our implementa-tion, the upper-body bounding box was acquired withthe landmarks on face and shoulders. Empirically, toensure reliable tracking results when presenting to theannotators, we removed short trajectories that had lessthan 80% of the total frames. Following the above steps, we generated 122 ,

129 shortclips from these movies. We removed facial close-up https://github.com/CMU-Perceptual-Computing-Lab/caffe_rtpose https://github.com/abewley/sort clips using results from pose estimation. Concretely,we included a clip in our annotation list if the char-acter in it has at least three visible landmarks out ofthe six upper-body landmarks, i.e. , wrists, elbows, andshoulders on both body sides (left and right). We fur-ther select those clips with between 100 and 300 framesfor manual annotation by the participants. An identi-ﬁed character with landmark tracking in a single clipis called an instance . We have curated a total of 48,037instances for annotation from a total of 26,164 videoclips.We used the AMT for crowdsourcing emotion anno-tations of the 48,037 instances. For each Human Intelli-gence Task (HIT), a human participant completes emo-tion annotation assignments for 20 diﬀerent instances.Each of which was drawn randomly from the instancepool. Each instance is expected to be annotated by ﬁvediﬀerent participants.We asked human annotators to ﬁnish four annota-tion tasks per instance. Fig. 4 shows screenshots of ourcrowdsourcing website design. As a ﬁrst step, partici-pants must check if the instance is corrupted. An in-stance is considered corrupted if landmark tracking ofthe character is not consistent or the scene is not re-alistic in daily life, such as science ﬁction scenes. If aninstance is not corrupted, participants are asked to an-notate the character’s emotional expressions accordingto both categorical emotions and dimensional emotions( i.e. , valence, arousal, dominance (VAD) in dimensionalemotion state model (Mehrabian, 1980)). For categor-ical emotions, we used the list in (Kosti et al., 2017),which contains 26 categories and is a superset of the sixbasic emotions (Ekman, 1993). Participants are askedto annotate these categories in the way of multi-labelbinary classiﬁcations. For each dimensional emotion, weused integers that scales from 1 to 10. These annota-tion tasks are meant to reﬂect the truth revealed inthe visual and audio data — movie characters’ emo-tional expressions — and do not involve the participantsemotional feelings. In addition to these tasks, partic-ipants are asked to specify a time interval ( i.e. , thestart and end frames) over the clip that best representsthe selected emotion(s) or has led to their annotation.Characters’ and participants’ demographic information(gender, age, and ethnicity) is also annotated/collectedfor complementary analysis. Gender categories are maleand female. Age categories are deﬁned as kid (aged upto 12 years), teenager (aged 13-20), and adult (agedover 20). Ethnicity categories are American Indian orAlaska Native, Asian, African American, Hispanic orLatino, Native Hawaiian or Other Paciﬁc Islander, White,and Other. Yu Luo et al.(1) video data quality check (2) categorical emotion la-beling (3) dimensional emotionand demographic labeling (4) frame range identiﬁ-cation

Fig. 4

The web-based crowdsourcing data collection process. Screenshots of the four steps are shown. For each video clip,participants are directed to go through a sequence of screens with questions step-by-step.

The participants are permitted to hear the audio ofthe clip, which can include a conversation in Englishor some other language. While the goal of this researchis to study the computability of body language, we al-lowed the participants to use all sources of information(facial expression, body movements, sound, and limitedcontext) in their annotation in order to obtain as highaccuracy as possible in the data collected. Additionally,the participants can play the clip back-and-forth duringthe entire annotation process for that clip.To sum up, we crowdsourced the annotation of cat-egorical and dimensional emotions, time interval of in-terest, and character demographic information.

Quality control has always been a necessary compo-nent for crowdsourcing to identify dishonest partici-pants, but it is much more diﬃcult for aﬀect data. Dif-ferent people may not perceive aﬀect in the same way,and their understanding may be inﬂuenced by their cul-tural background, current mood, gender, and personalexperiences. An honest participant could also be unin-formative in aﬀect annotation, and consequently, theirannotations can be poor in quality. In our study, thevariance in acquiring aﬀects usually comes from twokinds of participants, i.e. , dishonest ones, who give use-less annotations for economic motivation, and exoticones, who give inconsistent annotations compared withothers. Note that exotic participants come with thenature of emotion, and annotations from exotic par-ticipants could still be useful when aggregating ﬁnalground truth or investigating cultural or gender eﬀectsof aﬀect. In our crowdsourcing task, we want to reducethe variance caused by dishonest participants. In themeantime, we do not expect too many exotic partici-pants because that would lead to low consensus. Using gold standard examples is a common prac-tice in crowdsourcing to identify uninformative partici-pants. This approach involves curating a set of instanceswith known ground truth and removing those partic-ipants who answer incorrectly. For our task, however,this approach is not as feasible as in conventional crowd-sourcing tasks such as image object classiﬁcation. Toaccommodate subjectivity of aﬀect, gold standard hasto be relaxed to a large extent. Consequently, the recallof dishonest participants is lower.To alleviate the aforementioned dilemma, we usedfour complementary mechanisms for quality control, in-cluding three online approaches ( i.e. , analyzing whilecollecting the data) and an oﬄine one ( i.e., post-collectionanalysis). The online approaches are participant screen-ing, annotation sanity check, and relaxed gold standardtest, while the oﬄine one is reliability analysis. – Participant screening.

First-time participants in ourHIT must take a short empathy quotient (EQ) test(Wakabayashi et al., 2006). Only those who haveabove-average EQ are qualiﬁed. This approach aimsto reduce the number of exotic participants from thebeginning. – Annotation sanity check.

During the annotation pro-cess, the system checks consistency between categor-ical emotion and dimensional emotion annotationsas they are entered. Speciﬁcally, we expect an “aﬀec-tion”, “esteem”, “happiness”, or “pleasure” instanceto have an above-midpoint valence score; a “disap-proval”, “aversion”, “annoyance”, “anger”, “sensi-tivity”, “sadness”, “disquietment”, “fear”, “pain”,or “suﬀering” instance to have a below-midpointvalence score; a “peace” instance to have a below-midpoint arousal score; and an “excitement” instanceto have an above-midpoint arousal score. As an ex-ample, if a participant chooses “happiness” and avalence rating between 1 and 5 (out of 10) for aninstance, we treat the annotation as inconsistent. In

RBEE: Towards Automated Recognition of Bodily Expression of Emotion In the Wild 9 each HIT, a participant fails this annotation sanitycheck if there are two inconsistencies among twentyinstances. – Relaxed gold standard test.

One control instance (re-laxed gold standard) is randomly inserted in eachHIT to monitor the participant’s performance. Wecollect control instances in our trial run within asmall trusted group and choose instances with veryhigh consensus. We manually relax the acceptablerange of each control instance to avoid false alarm.For example, for an indisputable sad emotion in-stance, we accept an annotation if valence is nothigher than 6. An annotation that goes beyond theacceptable range is treated as failing the gold stan-dard test. We selected nine control clips and theirrelaxed annotations as the gold standard. We didnot use more control clips because the average num-ber of completed HITs per participant is much lessthan nine and the gold standard is rather relaxedand ineﬃcient in terms of recall. – Reliability analysis.

To further reduce the noise in-troduced by dishonest participants, we conduct re-liability analysis over all participants. We adoptedthe method by Ye et al. (2019) to properly handlethe intrinsic subjectivity in aﬀective data. Reliabil-ity and regularity of participants are jointly mod-eled. Low-reliability-score participant correspondsto dishonest participant, and low-regularity partici-pant corresponds to exotic participant. This methodwas originally developed for improving the qualityof dimensional annotations based on modeling theagreement multi-graph built from all participantsand their annotated instances. For each dimensionof VAD, this method estimates participant i ’s reli-ability score, i.e. , r vi , r ai , r di . According to Ye et al.(2019), the valence and arousal dimensions are em-pirically meaningful for ranking participants’ relia-bility scores. Therefore, we ensemble the reliabilityscore as r i = (2 r vi + r ai ) /

3. We mark participant i asfailing in reliability analysis if r i is less than withenough eﬀective sample size.Based on these mechanisms, we restrain those par-ticipants deemed ‘dishonest.’ After each HIT, partici-pants with low performance are blocked for one hour.Low-performance participant is deﬁned as either failingthe annotation sanity check or the relaxed gold stan-dard test. We reject the work if it shows low perfor-mance and fails in the reliability analysis. In additionto these constraints, we also permanently exclude par-ticipants with a low reliability score from participatingour HITs again. Whenever a single set of annotations is needed for aclip, proper aggregation is necessary to obtain a consen-sus annotation from multiple participants. The Dawid-Skene method (Dawid and Skene, 1979), which is typ-ically used to combine noisy categorical observations,computes an estimated score (scaled between 0 and 1)for each instance. We used the method to aggregate an-notations on each categorical emotion annotation andcategorical demographic annotation. Particularly, weused the notation s ci to represent the estimated scoreof the binary categorical variable c for the instance i .We set a threshold of 0 . n annotations. The score s di is annotated by participant i with reliability score r i for dimensional emotion d ,where i ∈ { , , . . . , n } and d ∈ { V, A, D } in the VADmodel. The ﬁnal annotation is then aggregated as s d = Σ ni =1 r i s di Σ ni =1 r i . (1)In the meantime, instance conﬁdence according to themethod by Ye et al. (2019) is deﬁned as c = 1 − n (cid:89) i =1 (1 − r i ) . (2)Note that we divided the ﬁnal VAD score by 10 so thatthe data ranges between 0 and 1. Our ﬁnal dataset to beused for further analysis retained only those instanceswith conﬁdence higher than 0 . Fig. 5

Examples of high-conﬁdence instances in BoLD for the 26 categorical emotions and two instances that were used forquality control. For each subﬁgure, the left side is a frame from the video, along with another copy that has the characterentity IDs marked in a bounding box. The right side shows the corresponding aggregated annotation, annotation conﬁdence c , demographics of the character, and aggregated categorical and dimensional emotion. To be continued on the next page.RBEE: Towards Automated Recognition of Bodily Expression of Emotion In the Wild 11(16) yearning (17) disapproval (18) aversion(19) annoyance (20) anger (21) sensitivity(22) sadness (23) disquietment (24) fear(25) pain (26) suﬀering(27) quality control (28) quality control Fig. 5 (Continued from the previous page.) Examples of high-conﬁdence instances in BoLD for the 26 categorical emotionsand two instances (27 and 28) that were used for quality control. For each subﬁgure, the left side is a frame from the video,along with another copy that has the character entity IDs marked in a bounding box. The right side shows the correspondingaggregated annotation, annotation conﬁdence c , demographics of the character, and aggregated categorical and dimensionalemotion.2 Yu Luo et al. Number of Samples C a t e g o r i c a l E m o t i o n A nn o t a t i o n S c o r e PeaceAffectionEsteemAnticipationEngagementConfidenceHappinessPleasureExcitementSurpriseSympathyDoubt/ConfusionDisconnectionFatigueEmbarrassmentYearningDisapprovalAversionAnnoyanceAngerSensitivitySadnessDisquietmentFearPainSuffering

Fig. 6

Distributions of the 26 diﬀerent categorical emotions. N u m b e r o f S a m p l e s valencearousaldominance Fig. 7

Distributions of the three dimensional emotion rat-ings: valence, arousal, and dominance. (1) gender(2) age (3) ethnicity

Fig. 8

Demographics of characters in our dataset. varies across participants, we do not expect absoluteconsensus for collected labels. In fact, it is nontrivial toquantitatively understand and measure the quality ofsuch aﬀective data.

We have collected annotations for 13 ,

239 instances. Thedataset continues to grow as more instances and anno-tations are added. Fig. 5 shows some high-conﬁdenceinstances in our dataset. Figs. 6, 7, and 8 show thedistributions of categorical emotion, dimensional emo-tion, and demographic information, respectively. Foreach categorical emotion, the distribution is highly un-balanced. For dimensional emotion, the distributionsof three dimensions are Gaussian-like, while valence isright-skewed and dominance is left-skewed. Characterdemographics is also unbalanced: most characters inour movie-based dataset are male, white, and adult.We partition all instances into three sets: the trainingset ( ∼ ∼ RBEE: Towards Automated Recognition of Bodily Expression of Emotion In the Wild 13 P e a c e P e a c e A ff e c t i o n A ff e c t i o n E s t ee m E s t ee m A n t i c i p a t i o n A n t i c i p a t i o n E n g a g e m e n t E n g a g e m e n t C o n f i d e n c e C o n f i d e n c e H a pp i n e ss H a pp i n e ss P l e a s u r e P l e a s u r e E x c i t e m e n t E x c i t e m e n t S u r p r i s e S u r p r i s e S y m p a t h y S y m p a t h y D o u b t / C o n f u s i o n D o u b t / C o n f u s i o n D i s c o nn e c t i o n D i s c o nn e c t i o n F a t i g u e F a t i g u e E m b a rr a ss m e n t E m b a rr a ss m e n t Y e a r n i n g Y e a r n i n g D i s a pp r o v a l D i s a pp r o v a l A v e r s i o n A v e r s i o n A nn o y a n c e A nn o y a n c e A n g e r A n g e r S e n s i t i v i t y S e n s i t i v i t y S a d n e ss S a d n e ss D i s q u i e t m e n t D i s q u i e t m e n t F e a r F e a r P a i n P a i n S u ff e r i n gS u ff e r i n g V a l e n c e V a l e n c e A r o u s a l A r o u s a l D o m i n a n c e D o m i n a n c e Peace PeaceAffection AffectionEsteem EsteemAnticipation AnticipationEngagement EngagementConfidence ConfidenceHappiness HappinessPleasure PleasureExcitement ExcitementSurprise SurpriseSympathy SympathyDoubt/Confusion Doubt/ConfusionDisconnection DisconnectionFatigue FatigueEmbarrassment EmbarrassmentYearning YearningDisapproval DisapprovalAversion AversionAnnoyance AnnoyanceAnger AngerSensitivity SensitivitySadness SadnessDisquietment DisquietmentFear FearPain PainSuffering SufferingValence ValenceArousal ArousalDominance Dominance −0.3−0.2−0.10.00.10.20.30.40.50.60.70.80.91.0

Fig. 9

Correlations between pairs of categorical or dimensional emotions, calculated based on the BoLD dataset. the testing set (20%, 2864). Our split protocol ensuredthat clips from the same raw movie video belong to thesame set so that subsequent evaluations can be con-ducted faithfully.We observed interesting correlations between pairsof categorical emotions and pairs of dimensional emo-tions. Fig. 9 shows correlations between each pair ofemotion categories. Categorical emotion pairs such aspleasure and happiness (0 . . . . . .

35) show high corre-lations, matching our intuition. Correlations betweendimensional emotions (valence and arousal) are weak(0 . . . .

61) and plea-sure (0 . − . − . − . − . .

25) and anger (0 . − . − . Table 1

Agreement among participants on categorical emotions and characters’ demographic information.

Category κ ﬁltered κ Category κ ﬁltered κ Category κ ﬁltered κ Peace 0 .

132 0 .

148 Aﬀection 0 .

262 0 .

296 Esteem 0 .

077 0 . .

071 0 .

078 Engagement 0 .

110 0 .

126 Conﬁdence 0 .

166 0 . .

385 0 .

414 Pleasure 0 .

171 0 .

200 Excitement 0 .

178 0 . .

137 0 .

155 Sympathy 0 .

114 0 .

127 Doubt/Confusion 0 .

127 0 . .

125 0 .

140 Fatigue 0 .

113 0 .

131 Embarrassment 0 .

066 0 . .

030 0 .

036 Disapproval 0 .

140 0 .

153 Aversion 0 .

075 0 . .

176 0 .

197 Anger 0 .

287 0 .

307 Sensitivity 0 .

082 0 . .

233 0 .

267 Disquietment 0 .

110 0 .

125 Fear 0 .

193 0 . .

273 0 .

312 Suﬀering 0 .

161 0 . Average .

154 0 . .

863 0 .

884 Age 0 .

462 0 .

500 Ethnicity 0 .

410 0 . with conﬁdence (0 . − . − . − . − . − . We computed Fleiss’ Kappa score ( κ ) for each categor-ical emotion and categorical demographic informationto understand the extent and reliability of agreementamong participants. Perfect agreement leads to a scoreof one, while no agreement leads to a score less than orequal to zero. Table 1 shows Fleiss’ Kappa (Gwet, 2014)among participants on each categorical emotion andcategorical demographic information. κ is computed onall collected annotations for each category. For each cat-egory, we treated it as a two-category classiﬁcation andconstructed a subject-category table to compute Fleiss’Kappa. By ﬁltering out those with low reliability scores,we also computed ﬁltered κ . Note that some instancesmay have less than ﬁve annotations after removing an-notations from low-reliability participants. We editedthe way to compute p j , deﬁned as the proportion of allassignments which were to the j -th category. Originally,it should be p j = 1 N N (cid:88) i =1 n ij n , (3)where N is the number of instances, n ij is the numberof ratings annotators have assigned to the j -th categoryon the i -th instance, and n is the number of annotatorsper instance. In our ﬁltered κ computation, n variesfor diﬀerent instances and we denote the number ofannotators for instance i as n i . Then Eq. (3) is revised as: p j = 1 N N (cid:88) i =1 n ij n i . (4)Filtered κ is improved for each category, even for thoseobjective category like gender, which also suggests thevalidity of our oﬄine quality control mechanism. Notethat our reliability score is computed over dimensionalemotions, and thus the oﬄine quality control approachis complementary. As shown in the table, aﬀection, anger,sadness, fear, and pain have fair levels of agreement(0 . < κ < . . < κ < . < κ < . κ score of demo-graphic annotation is close to previous studies reportedin (Biel and Gatica-Perez, 2013). Because the annota-tion is calculated from the same participant population, κ also represents how diﬃcult or subjective the task is.Evidently gender is the most consistent (hence the eas-iest) task among all categories. The data conﬁrms thatemotion recognition is both challenging and subjectiveeven for human beings with suﬃcient level of EQ. Par-ticipants in our study passed an EQ test designed tomeasure one’s ability to sense others’ feelings as well asresponse to others’ feelings, and we suspect that indi-viduals we excluded due to a failed EQ test would likelyexperience greater diﬃculty in recognizing emotions.For dimensional emotions, we computed both across-annotation variances and within-instance annotationvariances. The variances across all annotations are 5 . .

66, and 6 .

40 for valence, arousal, and dominance, re-spectively. Within-instance variances (over diﬀerent an-notators) is computed for each instance and the meansof these variances are 3 .

79, 5 .

24, and 4 .

96, respectively.

RBEE: Towards Automated Recognition of Bodily Expression of Emotion In the Wild 15 N u m b e r o f w o r k e r s passfailure Fig. 10

Reliability score distribution among low-performance participants (failure) and non low-performanceparticipants (pass).

Notice that for the dimensions, the variances are re-duced by 35%, 21%, and 23%, respectively, which illus-trates human performance at reducing variance givenconcrete examples. Interestingly, participants are bet-ter at recognizing positive and negative emotions ( i.e. valence) than in other dimensions.

We explored the diﬀerence between low-performanceparticipants and low reliability-score participants. Asshown in Fig. 10, low-performance participants showslower reliability score by average. While a signiﬁcantlylarge number of low-performance participants haverather high reliability scores, most non-low-performanceparticipants have reliability scores larger than 0 . , F R ) and mean squared error (MSE) for dimensional emotion to evaluate the participant’sperformance. Similar to our standard annotation ag-gregation procedure, we ignored instances with a con-ﬁdence score less than 0 .

95 when dealing with dimen-sional emotions. Fig. 11 shows the cumulative distri-bution of participants’ F R score, and the MSE score of dimensionalemotion, respectively. We calculated vanilla R scoreand rank percentile-based R score. For the latter, weused rank percentile for both prediction and the groundtruth. The areas under the curves (excluding Fig. 11(5))can be interpreted as how diﬃcult it is for humans torecognize the emotion. For example, humans are eﬀec-tive at recognizing happiness while ineﬀective at recog-nizing yearning. Similarly, humans are better at recog-nizing the level of valence than that of arousal or dom-inance. These results reﬂect the challenge of achievinghigh classiﬁcation and regression performance for emo-tion recognition even for human beings. Culture, gender, and age could be important factors ofemotion understanding. As mentioned in Section 3.1.4,we have nine quality control videos in our crowdsourc-ing process that have been annotated for emotion morethan 300 times. We used these quality control videos totest whether the annotations are independent of anno-tators’ culture, gender, and age.For categorical annotations (including both categor-ical emotions and categorical character demographics),we conducted χ test on each video. For each controlinstance, we calculated the p-value of the χ test overannotations (26 categorical emotions and 3 characterdemographic factors) from diﬀerent groups resultingfrom annotators’ three demographic factors. This pro-cess results in 29 × p < .

01 or p < . p < . × p < . p < .6 Yu Luo et al.

01 or p < . p < . × p < . p < .6 Yu Luo et al. Percentile F S c o r e PeaceAffectionEsteemAnticipationEngagementConfidenceHappinessPleasureExcitement (1) F Percentile F S c o r e SurpriseSympathyDoubt/ConfusionDisconnectionFatigueEmbarrassmentYearningDisapprovalAversion (2) F Percentile F S c o r e AnnoyanceAngerSensitivitySadnessDisquietmentFearPainSuffering (3) F

70% 80% 90% 95%Valence ! " ! " ! " -0.66 -0.36 -0.01 0.19Arousal ranked ! " ! " -0.43 -0.14 0.16 0.39Dominance ranked ! " (4) R score

70% 80% 90% 95%Valence

𝑀𝑆𝐸

𝑀𝑆𝐸 (5) MSE

Fig. 11

Human regression performance on dimensional emotions. X-axis: participant population percentile. Y-axis: F R and MSE score. Tables inside each plot in the second row summarize top 30%, 20%, 10%, and 5% participant regression scores. culated p-value of one-way ANOVA test over VAD (3)annotations from diﬀerent groups resulting from anno-tators’ demographic factors (3). This results in 3 × × (3 + 3) = 54, p < . × p < . p < . .

56, yet4 .

12 among African Americans and 4 .

41 among Whites.However, arousal average among Asians is 7 .

20, yet 8 . .

21 among Whites. ForFig. 5(28), valence average of person 1 among Asiansis 6 .

30, yet 5 .

09 among African Americans and 4 . .

20, yet 8 .

27 among African Americans and 8 . Our data collection eﬀorts oﬀer important lessons. Theeﬀorts conﬁrmed that reliability analysis is useful forcollecting subjective annotations such as emotion la-bels when no gold standard ground truth is available.As shown in Table 1, consensus (ﬁltered κ value) overhigh-reliable participants is higher than that of all par-ticipants ( κ value). This ﬁnding holds for both sub-jective questions (categorical emotion) and objectivequestions (character demographics), even though thereliability score is calculated with the diﬀerent VADannotations — an evidence that the score does notoverﬁt. As an oﬄine quality control component, themethod we developed and used to generate reliabilityscores (Ye et al., 2019) is suitable for analyzing such af-fective data. For example, one can also apply our pro-posed data collection pipeline to collect data for thetask of image aesthetics modeling (Datta et al., 2006).In addition to their eﬀectiveness in quality control, re-liability scores are very useful for resource allocation.With a limited annotation budget, it is more reason-able to reward highly-reliable participants rather thanless reliable ones. RBEE: Towards Automated Recognition of Bodily Expression of Emotion In the Wild 17

Table 2

Laban Movement Analysis (LMA) features.( f i : categories; m : number of measurements; dist.: distance;accel.: acceleration) f i Description m f i Description mf Feet-hip dist. 4 f Hands-shoulder dist. 4 f Hands dist. 4 f Hands-head dist. 4 f Centroid-pelvis dist. 4 f Gait size (foot dist.) 4 f Shoulders velocity 4 f Elbow velocity 4 f Hands velocity 4 f Hip velocity 4 f Knee velocity 4 f Feet velocity 4 f Angular velocity 4 C f Shoulders accel. 4 f Elbow accel. 4 f Hands accel. 4 f Hip accel. 4 f Knee accel. 4 f Feet accel. 4 f Angular accel. 4 C f Shoulders jerk 4 f Elbow jerk 4 f Hands jerk 4 f Hip jerk 4 f Knee jerk 4 f Feet jerk 4 f Volume 4 f Volume (upper body) 4 f Volume (lower body) 4 f Volume (left side) 4 f Volume (right side) 4 f Torso height 4

In this section, we investigate two pipelines for au-tomated recognition of bodily expression and presentquantitative results for some baseline methods. UnlikeAMT participants, who were provided with all the in-formation regardless of whether they use all in theirannotation process, the ﬁrst computerized pipeline re-lied solely on body movements, but not on facial ex-pressions, audio, or context. The second pipeline tooka sequence of cropped images of the human body asinput, without explicitly modeling facial expressions.4.1 Learning from Skeleton

Laban notation, originally proposed by Rudolf Laban(1971), is used for documenting body movement of danc-ing such as ballet. Laban movement analysis (LMA)uses four components to record human body movements:body, eﬀort, shape, and space. Body category repre-sents structural and physical characteristics of the hu-man body movements. It describes which body partsare moving, which parts are connected, which partsare inﬂuenced by others, and general statements aboutbody organization. Eﬀort category describes inherentintention of a movement. Shape describes static body (1) natural human skeleton (2) limbs that are used infeature extraction

Fig. 12

Illustration of the human skeleton. Both red linesand black lines are considered limbs in our context. shapes, the way the body interacts with something, theway the body changes toward some point in space, andthe way the torso changes in shape to support move-ments in the rest of the body. LMA or its equivalentnotation systems are widely used in psychology for emo-tion analysis (Wallbott, 1998; Kleinsmith et al., 2006)and human computer interaction for emotion genera-tion and classiﬁcation (Aristidou et al., 2017, 2015). Inour experiments, we use features listed in Table 2.LMA is conventionally conducted for 3D motioncapture data that have 3D coordinates of body land-marks. In our case, we estimated 2D pose on images us-ing (Cao et al., 2017). In particular, we denote p ti ∈ R as the coordinate of the i -th joint at the t -th frame. Asthe nature of the data, our 2D pose estimation usuallyhas missing values of joint locations and varies in scale.In our implementation, we ignored an instance if thedependencies to compute the feature are missing. Toaddress the scaling issue, we normalized each pose bythe average length of all visible limbs, such as shoulder-elbow and elbow-wrist. Let ν = { ( i, j ) | joint i and joint j are visible } be the visible set of the instance. We com-puted normalized pose ˆ p ti by s = 1 T | ν | (cid:88) ( i,j ) ∈ ν T (cid:88) t (cid:13)(cid:13) p ti − p tj (cid:13)(cid:13) , ˆ p ti = p ti s . (5)The ﬁrst part of features in LMA, body component ,captures the pose conﬁguration. For f , f , f , f , and f , we computed the distance between the speciﬁedjoints frame by frame. For symmetric joints like feet-hipdistance, we used the mean of left-feat-hip and right-feat-hip distance in each frame. The same protocol wasapplied to other features that contains symmetric jointslike hands velocity. For f , the centroid was averagedover all visible joints and pelvis is the midpoint betweenleft hip and right hip. This feature is designed to rep-resent barycenter deviation of the body.The second part of features in LMA, eﬀort compo-nent , captures body motion characteristics. Based on the normalized pose, joints velocity ˆ v ti , acceleration ˆ a ti ,and jerk ˆ j ti were computed as: v ti = ˆ p t + τi − ˆ p ti τ , a ti = v t + τi − v ti τ , j ti = a t + τi − a ti τ , ˆ v ti = (cid:13)(cid:13) v ti (cid:13)(cid:13) , ˆ a ti = (cid:13)(cid:13) a ti (cid:13)(cid:13) , ˆ j ti = (cid:13)(cid:13) j ti (cid:13)(cid:13) . (6)Angles, angular velocity, and angular acceleration be-tween each pair of limbs (Fig. 12) were calculated foreach pose: θ t ( i, j, m, n ) = arccos (cid:32) (ˆ p ti − ˆ p tj ) · (ˆ p tm − ˆ p tn ) (cid:107) ˆ p ti − ˆ p tj (cid:107)(cid:107) ˆ p tm − ˆ p tn (cid:107) (cid:33) ,ω tk ( i, j, m, n ) = θ t + τ ( i, j, m, n ) − θ t ( i, j, m, n ) τ ,α tk ( i, j, m, n ) = ω t + τ ( i, j, m, n ) − ω t ( i, j, m, n ) τ . (7)We computed velocity, acceleration, jerk, angular ve-locity, and angular acceleration of joints with τ = 15.Empirically, features become less eﬀective when τ is toosmall (1 ∼

2) or too large ( > shape compo-nent , captures body shape. For f , f , f , f , and f , the area of bounding box that contains correspond-ing joints is used to approximate volume.Finally, all features are summarized by their basicstatistics (maximum, minimum, mean, and standarddeviation, denoted as f max i , f min i , f mean i , and f std i , re-spectively) over time.With all LMA features combined, each skeleton se-quence can be represented by a 2 , ,

000 in our case). We then search modelparameters with cross validation on the combined setof training and validation. Finally, we use the selectedbest parameter to retrain a model on the combined set.

Besides handcrafted LMA features, we experimentedwith an end-to-end feature learning method. Follow-ing (Yan et al., 2018), human body landmarks can beconstructed as a graph with their natural connectivity.Considering the time dimension, a skeleton sequencecould be represented with a spatiotemporal graph.Graph convolution in (Kipf and Welling, 2016) is usedas building blocks in ST-GCN. ST-GCN was originallyproposed for skeleton action recognition. In our task, each skeleton sequence is ﬁrst normalized between 0 and1 with the largest bounding box of skeleton sequence.Missing joints are ﬁlled with zeros. We used the samearchitecture as in (Yan et al., 2018) and trained on ourtask with binary cross-entropy loss and mean-squared-error loss. Our learning objective L can be written as: L cat = (cid:88) i =1 y cat i log x i + (1 − y cat i ) log(1 − x cat i ) , L cont = (cid:88) i =1 ( y cont i − x cont i ) , L = L cat + L cont , (8)where x cat i and y cat i are predicted probability and groundtruth, respectively, for the i -th categorical emotion, and x cont i and y cont i are model prediction and ground truth,respectively, for the i -th dimensional emotion.4.2 Learning from PixelsEssentially, bodily expression is expressed through bodyactivities. Activity recognition is a popular task in com-puter vision. The goal is to classify human activities,like sports and housework, from videos. In this sub-section, we use four classical human activity recogni-tion methods to extract features (Kantorov and Laptev,2014; Simonyan and Zisserman, 2014; Wang et al., 2016;Carreira and Zisserman, 2017). Current state-of-the-art results of activity recognition are achieved by two-stream network-based deep-learning methods (Simonyanand Zisserman, 2014). Prior to that, trajectory-basedhandcrafted features are shown to be eﬃcient and ro-bust (Wang et al., 2011; Wang and Schmid, 2013). The main idea of trajectory-based feature extraction isselecting extended image features along point trajec-tories. Motion-based descriptors, such as histogram ofﬂow (HOF) and motion boundary histograms (MBH)(Dalal et al., 2006), are widely used in activity recog-nition for their good performance (Wang et al., 2011;Wang and Schmid, 2013). Common trajectory-basedactivity recognition has the following steps: 1) comput-ing the dense trajectories based on optical ﬂow; 2) ex-tracting descriptors along those dense trajectories; 3)encoding dense descriptors by Fisher vector (Perronninand Dance, 2007); and 4) training a classiﬁer with theencoded histogram-based features.

RBEE: Towards Automated Recognition of Bodily Expression of Emotion In the Wild 19

In this work, we cropped each instance from rawclips with a ﬁxed bounding box that bounds the char-acter over time. We used the implementation in Kan-torov and Laptev (2014) to extract trajectory-based ac-tivity features . We trained 26 SVM classiﬁers for thebinary categorical emotion classiﬁcation and three SVMregressors for the dimensional emotion regression. Weselected the penalty parameter based on the validationset and report results on the test set. Two-stream network-based deep-learning methods learnto extract features in an end-to-end fashion Simonyanand Zisserman (2014). A typical model of this type con-tains two convolutional neural networks (CNN). Onetakes static images as input and the other takes stackedoptical ﬂow as input. The ﬁnal prediction is an averagedensemble of the two networks. In our task, we used thesame learning objective of L as deﬁned in Eq. 8.We implemented two-stream networks in PyTorch .We used 101-layer ResNet as (He et al., 2016) as ournetwork architecture. Optical ﬂow was computed viaTVL1 optical ﬂow algorithm (Zach et al., 2007). Bothimage and optical ﬂow were cropped with the instancebody centered. Since emotion understanding could bepotentially related to color, angle, and position, we didnot apply any data augmentation strategies. The train-ing procedure is identical to the work of Simonyan andZisserman (2014), where the learning rate is set to 0 . K seg-ments. One frame is randomly sampled for each seg-ment during the training stage. Video classiﬁcation re-sult is averaged over all sampled frames. In our task,learning rate is set to 0 .

001 and batch size is set to128. For two-stream inﬂated 3D ConvNet (I3D) (Car-reira and Zisserman, 2017), 3D convolution replaces 2Dconvolution in the original two-stream network. With https://github.com/vadimkantorov/fastvideofeat http://pytorch.org/

3D convolution, the architecture can learn spatiotem-poral features in an end-to-end fashion. This architec-ture also leverages recent advances in image classiﬁca-tion by duplicating weights of pretrained image classi-ﬁcation model over the temporal dimension and usingthem as initialization. In our task, learning rate is setto 0 .

01 and batch size is set to 12. Both experimentsare conducted on a server with two NVIDIA Tesla K40cards. Other training details are the same as the orig-inal work (Wang et al., 2016; Carreira and Zisserman,2017).4.3 Results

We evaluated all methods on the test set. For categori-cal emotion, we used average precision (AP, area underprecision recall curve) and area under receiver operatingcharacteristic curve (ROC AUC) to evaluate the classi-ﬁcation performance. For dimensional emotion, we used R to evaluate regression performance. Speciﬁcally, arandom baseline of AP is the proportion of the posi-tive samples (P.P.). ROC AUC could be interpreted asthe possibility of choosing the correct positive sampleamong one positive sample and one negative sample;a random baseline for that is 0 .

5. To compare perfor-mance of diﬀerent models, we also report mean R score(m R ) over three dimensional emotion, mean averageprecision (mAP), and mean ROC AUC (mRA) over 26categories of emotion. For the ease of comparison, wedeﬁne emotion recognition score (ERS) as follows anduse it to compare performance of diﬀerent methods:ERS = 12 (cid:18) m R + 12 (mAP + mRA) (cid:19) . (9) For each categorical emotion and dimension of VAD,we conducted linear regression tests on each dimen-sion of features listed in Table 2. All tests were con-ducted using the BoLD training set. We did not ﬁndstrong correlations ( R < .

02) over LMA features andemotion dimensions other than arousal, i.e. , categoricalemotion and valence and dominance. Arousal, however,seems to be signiﬁcantly correlated with LMA features.Fig. 13 shows the kernel density estimation plots of fea-tures with top R on arousal. Hands-related featuresare good indicators for arousal. With hand accelera-tion, f mean16 alone, R can be achieved as 0 . (1) f mean13 (2) f max16 (3) f mean16 (4) f mean30 (5) f mean31 (6) f mean33 (7) f mean34 (8) f max40 (9) f mean40 Fig. 13

Kernel density estimation plots on selected LMA features that have high correlation with arousal.

Table 3 shows the results on the emotion classiﬁca-tion and regression tasks. TSN achieves the best per-formance, with a mean R of 0 . . . . κ ) among annotators. Similarto the results from skeleton-based methods, two-streamnetwork-based methods show better regression perfor-mance over arousal than for valence and dominance.However, as shown in Fig. 11, workers with top 10%performance has R score of 0 . − .

01, and 0 .

16 forvalence, arousal, and dominance, respectively. Appar-ently, humans are best at recognizing valence and worstat recognizing arousal, and the distinction between hu-man performance and model performance may suggestthat there could be other useful features that the modelhas not explored.

RBEE: Towards Automated Recognition of Bodily Expression of Emotion In the Wild 21

AP(%): RA(%): ! " : Fig. 14

Classiﬁcation performance (AP: average precision on the top left, RA: ROC AUC on the top right) and regressionperformance ( R on the bottom) of diﬀerent methods on each categorical and dimensional emotion. i.e. , image-classiﬁcation modelpretrained on ImageNet (Deng et al., 2009) and action Table 3

Dimensional emotion regression and categoricalemotion classiﬁcation performance on the test set. m R =mean of R over dimensional emotions, mAP(%)= aver-age precision / area under precision recall curve (PR AUC)over categorical emotions, mRA(%) = mean of area un-der ROC curve (ROC AUC) over categorical emotions, andERS = emotion recognition score. Baseline methods: ST-GCN (Yan et al., 2018), TF (Kantorov and Laptev, 2014),TS-ResNet101 (Simonyan and Zisserman, 2014), I3D (Car-reira and Zisserman, 2017), and TSN (Wang et al., 2016).Model Regression Classiﬁcation ERSm R mAP mRAA Random Method based on Priors:Chance 0 10 .

55 50 0 . .

044 12 .

63 55 .

96 0 . .

075 13 .

59 57 .

71 0 . Learning from Pixels:TF − .

008 10 .

93 50 .

25 0 . . . .

29 0 . . .

37 61 .

24 0 . .

095 17 . .

70 0 . TSN-Spatial 0 .

048 15 .

34 60 .

03 0 . .

098 15 .

78 61 .

28 0 . recognition model pretrained on Kinetics (Kay et al.,2017), to initialize TSN. Table 4 shows the results foreach case. The results demonstrate that initializing withpretrained ImageNet model leads to slightly betteremotion-recognition performance. For the second set ofexperiments, we train TSN with two other diﬀerent in-put types, i.e., face only and faceless body. Our exper-iment in the last section crops the whole human bodyas the input. For face only, we crop the face for bothspatial branch (RGB image) and temporal branch (op-tical ﬂow) during both the training and testing stages.Note that for the face-only setting, orientation of facesin our dataset may be inconsistent, i.e, facing forward,facing backward, or facing to the side. For the facelessbody, we still crop the whole body, but we also maskthe region of face by imputing pixel value with a con-stant 128. Table 5 shows the results for each setting. Wecan see from the results that the performance of usingeither the face or the faceless body as input is compa-rable to that of using the whole body as input. Thisresult suggests both face and the rest of the body con-tribute signiﬁcantly to the ﬁnal prediction. Althoughthe “whole body” setting of TSN performs better thanany of the single model do, it does so by leveraging bothfacial expression and bodily expression. Table 4

Ablation study on the eﬀect of pretrained models.PretrainedModel Regression Classiﬁcation ERSm R mAP mRAImageNet 0 .

095 17 .

02 62 .

70 0 . .

093 16 .

77 62 .

53 0 . Table 5

Ablation study on the eﬀect of face.Input Type Regression Classiﬁcation ERSm R mAP mRAwhole body 0 .

095 17 .

02 62 .

70 0 . .

092 16 .

21 62 .

18 0 . .

088 16 .

61 62 .

30 0 . Table 6

Ensembled results.Model Regression Classiﬁcation ERSm R mAP mRATSN-body 0 .

095 17 .

02 62 .

70 0 . .

101 16 .

70 62 .

75 0 . .

101 17 .

31 63 .

46 0 . .

103 17 .

14 63 .

52 0 . i.e., body, face and skeleton,achieves the best performance. ARBEE is the averageensemble of the three models.We further investigated how well ARBEE retrievesinstances in the test set given a speciﬁc categorical emo-tion as query. Concretely, we calculated precision at10, 100, and R-Precision as summarized in Table 7. R-Precision is computed as precision at R , where R isnumber of positive samples. Similar to the classiﬁca-tion results, happiness and pleasure can be retrievedwith a rather high level of precision. RBEE: Towards Automated Recognition of Bodily Expression of Emotion In the Wild 23

Table 7

Retrieval results of our deep model.P@K(%) = precision at K, R-P(%)=R-Precision.

Category P@10 P@100 R-P

Peace 40 33 28Aﬀection 50 32 26Esteem 30 14 12Anticipation 30 24 20Engagement 50 46 42Conﬁdence 40 33 31Happiness 30 36 31Pleasure 40 25 23Excitement 50 41 31Surprise 20 6 8Sympathy 10 14 12Doubt/Confusion 20 33 25Disconnection 20 20 18Fatigue 40 20 17Embarrassment 0 5 5Yearning 0 2 4Disapproval 30 28 22Aversion 10 10 11Annoyance 30 28 23Anger 40 24 20Sensitivity 30 19 19Sadness 50 34 25Disquietment 10 26 25Fear 10 8 8Pain 20 9 12Suﬀering 10 17 18Average 27 23 20

We proposed a scalable and reliable video-data collec-tion pipeline and collected a large-scale bodily expres-sion dataset, the BoLD. We have validated our datacollection via statistical analysis. To our knowledge, oureﬀort is the ﬁrst quantitative investigation of humanperformance on emotional expression recognition withthousands of people, tens of thousands of clips, andthousands of characters. Importantly, we found signiﬁ-cant predictive features regarding the computability ofbodily emotion, i.e., hand acceleration for emotionalexpressions along the dimension of arousal. Moreover,for the ﬁrst time, our deep model demonstrates decentgeneralizability for bodily expression recognition in thewild. Possible directions for future work are numerous.First, our model’s regression performance of arousalis clearly better than that of valence, yet our analysisshows humans are better at recognizing valence. The in-adequacy in feature extraction and modeling, especiallyfor valence, suggests the need for additional investiga-tion. Second, our analysis has identiﬁed demographicfactors in emotion perception between diﬀerent ethnicgroups. Our current model has largely ignored thesepotentially useful factors. Considering characters’ de-mographics in the inference of bodily expression canbe a fascinating research direction. Finally, althoughthis work has focused on bodily expression, the BoLDdataset we have collected has several other modalitiesuseful for emotion recognition, including audio and vi-sual context. An integrated approach to study these willlikely lead to exciting real-world applications.

Acknowledgements

This material is based upon work sup-ported in part by The Pennsylvania State University. Thiswork used the Extreme Science and Engineering DiscoveryEnvironment (XSEDE), which is supported by National Sci-ence Foundation grant No. ACI-1548562 (Towns et al., 2014).The work was also supported through a GPU gift from theNVIDIA Corporation. The authors are grateful to the thou-sands of Amazon Mechanical Turk independent contractorsfor their time and dedication in providing invaluable emo-tion ground truth labels for the video collection. Hanjoo Kimcontributed in some of the discussions. Jeremy Yuya Ong sup-ported the data collection and visualization eﬀort. We thankAmazon.com, Inc. for supporting the expansion of this line ofresearch.

References

Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G,Varadarajan B, Vijayanarasimhan S (2016) Youtube-8m: A large-scale video classiﬁcation benchmark.arXiv preprint arXiv:160908675Aristidou A, Charalambous P, Chrysanthou Y (2015)Emotion analysis and classiﬁcation: understandingthe performers’ emotions using the lma entities. Com-puter Graphics Forum 34(6):262–276Aristidou A, Zeng Q, Stavrakis E, Yin K, Cohen-OrD, Chrysanthou Y, Chen B (2017) Emotion controlof unstructured dance movements. In: Proceedings ofthe ACM SIGGRAPH/Eurographics Symposium onComputer Animation, article 9Aviezer H, Trope Y, Todorov A (2012) Body cues, notfacial expressions, discriminate between intense posi-tive and negative emotions. Science 338(6111):1225–1229Bewley A, Ge Z, Ott L, Ramos F, Upcroft B (2016)Simple online and realtime tracking. In: Proceed-ings of the IEEE International Conference on Image

Processing, pp 3464–3468, DOI 10.1109/ICIP.2016.7533003Biel JI, Gatica-Perez D (2013) The youtube lens:Crowdsourced personality impressions and audiovi-sual analysis of vlogs. IEEE Transactions on Multi-media 15(1):41–55Caba Heilbron F, Escorcia V, Ghanem B, Car-los Niebles J (2015) Activitynet: A large-scale videobenchmark for human activity understanding. In:Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pp 961–970Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtimemulti-person 2d pose estimation using part aﬃnityﬁelds. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pp 7291–7299Carmichael L, Roberts S, Wessell N (1937) A studyof the judgment of manual expression as presentedin still and motion pictures. The Journal of SocialPsychology 8(1):115–142Carreira J, Zisserman A (2017) Quo vadis, action recog-nition? a new model and the kinetics dataset. In: Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition, IEEE, pp 4724–4733Corneanu C, Noroozi F, Kaminska D, Sapinski T, Es-calera S, Anbarjafari G (2018) Survey on emotionalbody gesture recognition. IEEE Transactions on Af-fective ComputingDael N, Mortillaro M, Scherer KR (2012) Emotionexpression in body action and posture. Emotion12(5):1085Dalal N, Triggs B, Schmid C (2006) Human detectionusing oriented histograms of ﬂow and appearance.In: Proceedings of the European Conference on Com-puter Vision, Springer, pp 428–441Datta R, Joshi D, Li J, Wang JZ (2006) Studying aes-thetics in photographic images using a computationalapproach. In: European conference on computer vi-sion, Springer, pp 288–301Dawid AP, Skene AM (1979) Maximum likelihood es-timation of observer error-rates using the em algo-rithm. Applied Statistics pp 20–28De Gelder B (2006) Towards the neurobiology of emo-tional body language. Nature Reviews Neuroscience7(3):242–249Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009)Imagenet: A large-scale hierarchical image database.In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pp 248–255Douglas-Cowie E, Cowie R, Sneddon I, Cox C, LowryL, McRorie M, Martin LJC, Devillers J, Abrilian A,Batliner S, et al. (2007) The humaine database: ad-dressing the needs of the aﬀective computing commu- nity. In: Proceedings of the International Conferenceon Aﬀective Computing and Intelligent Interaction,pp 488–500Ekman P (1992) Are there basic emotions? Psycholog-ical Review 99(3):550–553Ekman P (1993) Facial expression and emotion. Amer-ican Psychologist 48(4):384Ekman P, Friesen WV (1977) Facial Action CodingSystem: A technique for the measurement of facialmovement. Consulting Psychologists Press, StanfordUniversity, Palo AltoEkman P, Friesen WV (1986) A new pan-cultural fa-cial expression of emotion. Motivation and Emotion10(2):159–168Eleftheriadis S, Rudovic O, Pantic M (2015) Discrimi-native shared gaussian processes for multiview andview-invariant facial expression recognition. IEEETransactions on Image Processing 24(1):189–204Fabian Benitez-Quiroz C, Srinivasan R, Martinez AM(2016) Emotionet: An accurate, real-time algorithmfor the automatic annotation of a million facial ex-pressions in the wild. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition, pp 5562–5570Gu C, Sun C, Ross DA, Vondrick C, Pantofaru C, LiY, Vijayanarasimhan S, Toderici G, Ricco S, Suk-thankar R, et al. (2018) Ava: A video dataset ofspatio-temporally localized atomic visual actions. In:Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pp 6047–6056Gunes H, Piccardi M (2005) Aﬀect recognition fromface and body: early fusion vs. late fusion. In: Pro-ceedings of the IEEE International Conference onSystems, Man and Cybernetics, vol 4, pp 3437–3443Gunes H, Piccardi M (2007) Bi-modal emotion recogni-tion from expressive face and body gestures. Journalof Network and Computer Applications 30(4):1334–1345Gwet KL (2014) Handbook of Inter-rater Reliability:The Deﬁnitive Guide to Measuring the Extent ofAgreement Among Raters. Advanced Analytics, LLCHe K, Zhang X, Ren S, Sun J (2016) Deep residuallearning for image recognition. In: Proceedings of theIEEE Conference on Computer Vision and PatternRecognition, pp 770–778Iqbal U, Milan A, Gall J (2017) Posetrack: Joint multi-person pose estimation and tracking. In: Proceedingsof the IEEE Conference on Computer Vision and Pat-tern Recognition, pp 2011–2020Kantorov V, Laptev I (2014) Eﬃcient feature extrac-tion, encoding and classiﬁcation for action recog-nition. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pp 2593–

RBEE: Towards Automated Recognition of Bodily Expression of Emotion In the Wild 256 Yu Luo et al.