[PDF] Human Social Interaction Modeling Using Temporal Deep Networks

Abstract

We present a novel approach to computational modeling of social interactions based on modeling of essential social interaction predicates (ESIPs) such as joint attention and entrainment. Based on sound social psychological theory and methodology, we collect a new "Tower Game" dataset consisting of audio-visual capture of dyadic interactions labeled with the ESIPs. We expect this dataset to provide a new avenue for research in computational social interaction modeling. We propose a novel joint Discriminative Conditional Restricted Boltzmann Machine (DCRBM) model that combines a discriminative component with the generative power of CRBMs. Such a combination enables us to uncover actionable constituents of the ESIPs in two steps. First, we train the DCRBM model on the labeled data and get accurate (76\%-49\% across various ESIPs) detection of the predicates. Second, we exploit the generative capability of DCRBMs to activate the trained model so as to generate the lower-level data corresponding to the specific ESIP that closely matches the actual training data (with mean square error 0.01-0.1 for generating 100 frames). We are thus able to decompose the ESIPs into their constituent actionable behaviors. Such a purely computational determination of how to establish an ESIP such as engagement is unprecedented.

Full PDF

HHuman Social Interaction Modeling UsingTemporal Deep Networks

Mohamed R. Amer

SRI International

[email protected] Behjat Siddiquie

SRI International

[email protected] Amir Tamrakar

SRI International

[email protected] A. Salter

SRI International

[email protected] Brian J. Lande

UC Santa Cruz [email protected] Darius Mehri

UC Berkeley [email protected] Divakaran

SRI International

[email protected]

ABSTRACT

We present a novel approach to computational modelingof social interactions based on modeling of essential socialinteraction predicates (ESIPs) such as joint attention andentrainment. Based on sound social psychological theoryand methodology, we collect a new “Tower Game” datasetconsisting of audio-visual capture of dyadic interactions la-beled with the ESIPs. We expect this dataset to provide anew avenue for research in computational social interactionmodeling. We propose a novel joint Discriminative Con-ditional Restricted Boltzmann Machine (DCRBM) modelthat combines a discriminative component with the gener-ative power of CRBMs. Such a combination enables us touncover actionable constituents of the ESIPs in two steps.First, we train the DCRBM model on the labeled data andget accurate (76%-49% across various ESIPs) detection ofthe predicates. Second, we exploit the generative capabilityof DCRBMs to activate the trained model so as to generatethe lower-level data corresponding to the speciﬁc ESIP thatclosely matches the actual training data (with mean squareerror 0.01-0.1 for generating 100 frames). We are thus ableto decompose the ESIPs into their constituent actionable be-haviors. Such a purely computational determination of howto establish an ESIP such as engagement is unprecedented.

Categories and Subject Descriptors

H.1.2 [

Models and Principles ]: User/Machine Systems—

Human information processing ; I.2.10 [

Artiﬁcial Intelli-gence ]: Vision and Scene Understanding

General Terms

Algorithms, Theory, Human Factors

Keywords

Hybrid Models; Deep Learning; DCRBMs; Social Interac-tion; Computational Social Psychology; Tower Game Dataset;

1. INTRODUCTION

This research brings together multiple disciplines to ex-plore the problem of social interaction modeling. The goalof this work is to leverage research in social psychology, com-puter vision, signal processing, and machine learning to bet-ter understand human social interactions.As an example application, consider aid-workers or med-ical personnel deployed in a foreign country. During thecourse of their deployment, these workers often have to in-teract with people with whom they share little in common interms of language, customs and culture. Reducing frictionas well as increasing engagement between the workers andthe populations they encounter can have an important bear-ing on the success of their mission. Therefore the ability toimpart such professionals, with a general cross-cultural com-petency which would enable them to smoothly interact withthe foreign populations they encounter would be extremelyuseful. With such an application in mind, we focus on iden-tifying and automatically detecting predicates that facili-tate social interactions irrespective of the cultural context.Since our interests lie in aspects of social interactions thatreduce conﬂict and build trust, we focus on social predicatesthat support rapport: joint attention, temporal synchrony,mimicry, and coordination.Our orientation to social sensing departs signiﬁcantly fromexisting methods [4, 39] that focus on inferring internal orhidden mental states. Instead, inspired by a growing bodyof research [26,30,55], we focus on the process of social inter-action. This research argues that social interaction is morethan the meeting of two minds, with an additional emphasison the cognitive, perceptual and motor explanations of thejoint and coordinated actions that occur as part of these in-teractions [37]. Our approach is guided by two key insights.The ﬁrst is that apart from inferring the mental state of theother, social interactions require individuals to attend eachother’s movements, utterances and context to coordinate ac-tions jointly with each other [46]. The second insight is thatsocial interactions involve reciprocal acts, joint behaviors a r X i v : . [ c s . C Y ] M a y igure 1: (a) Our capture setup which includes a GoPro camera mounted on each participant’s chest and a Kinect mountedon a tripod. (b) An overhead view of our capture setup involving the two participants. (c) Sample Data Collected: The imageoutlined in solid red shows the image captured from the GoPro camera mounted on player A (green shirt), while the imageoutlined in dashed red shows the image captured from the Kinect behind player A and is used to track the upper body ofplayer B (red shirt). Similarly the image outlined in solid green is the image captured from the GoPro mounted on player Band the image outlined in dashed green is the image captured from the Kinect behind player B. (d) A view of our collecteddata projected in a uniﬁed coordinate framework.along with nested events (e.g. speech, eye gaze, gestures)at various timescales and therefore demand adaptive andcooperative behaviors of their participants [14].Using the work of [10] as a starting point, which em-phasizes the interactive and cooperative aspects of the so-cial interactions, we focus on detecting rhythmic coupling(also known as entrainment and attunement), mimicry (be-havioral matching), movement simultaneity, kinematic turntaking patterns, and other measurable features of engagedsocial interaction. We established that behaviors such as joint attention and entrainment were the essential predi-cates of social interaction (ESIPs). With this in mind wefocus on developing computational models of social inter-action, that utilize multimodal sensing and temporal deeplearning models to detect and recognize these ESIPs as wellas discover their actionable constituents.Over the past decade, the ﬁelds of computer vision andmachine learning have made signiﬁcant advances. Further-more, with the availability of complex sensors like Kinect,researchers are able to accurately track full human bodyposes [47]. This allowed for many diﬀerent applications insuch as activity recognition [42], facial feature tracking [13],and multimodal event detection [22].The sophistication of our problem requires a machine learn-ing algorithm capable of jointly recognizing, correlating fea-tures, and generating multimodal data of dyadic social in-teractions. Discriminative models focus on maximizing theseparation between classes, however, they are often uninter-pretable. On the other hand, generative models focus solelyon modeling distributions and are often unable to incorpo-rate higher level knowledge. Hybrid models tend to addressthese problems by combining the advantages of discrimina-tive and generative models. They encode higher level knowl-edge as well as model the distribution from a discriminativeperspective. We propose a novel hybrid model that allows usto recognize classes, correlate features, and generate socialinteraction data.This paper proposes new approach to machine learningthat answers questions posed by social psychology. Ourapproach to social sensing is multimodal and attempts todetect the existence of features of social interaction, socialinteraction itself, and the qualitative and dynamic featuresof social interaction. We took a multimodal approach be- cause humans must solve a variety of binding problems toeﬀectively coordinate action. Coordination must span every-thing from postural sways, eye gazes, head pose, gestures,lexical choice, verbal pitch and intonation, etc. Our contributions are 3-fold: • A new problem of computational modeling of essentialsocial interaction predicates (ESIPs). Starting froma socio-psychological framework, we demonstrate theuse of multimodal sensors and temporal deep learningmodels to uncover actionable constituents of ESIPs. • A new dataset,

Tower Game Dataset , for analyzingsocial interaction predicates. The dataset consists ofmultimodal recordings of two players participating ina tower building game, in the process communicatingand collaborating with each other. The dataset hasbeen annotated with ESIPs and will be made publiclyavailable. We believe that it will foster new research inthe area of computational social interaction modeling. • A novel model, Discriminative Conditional RestrictedBoltzmann Machine (DCRBM), that introduces a dis-criminative component to Conditional Restricted Boltz-mann Machines (CRBM). The discriminative compo-nent enables DCRBMs to directly learn classiﬁcationmodels while retaining all the advantages of CRBMs,including their ability to generate missing data. Re-sults on the

Tower Game Dataset demonstrate thatDCRBMs can eﬀectively detect ESIPs as well decom-pose ESIPs into their constituent actionable behaviors.

Paper organization : In sec. 2 we discuss prior work. In sec. 3we specify our model, then we explain inference and learn-ing. In sec. 4 we describe our dataset and demonstrate thequantitative results of our approach. In sec. 5 we conclude.

2. RELATED WORK

Social Psychology:

The study of social interactions andtheir associated sociological and psychological implicationshas received a lot of attention from social science researchers [6,26, 37]. Early research focused on the “Theory of Mind” ac-cording to which individuals ascribe mental states to them-selves and others [4], a line of thinking that largely inspiredmuch of the initial work on aﬀective computing. However,more recent work has shown that apart from inferring eachther’s mental states, an important challenge for partici-pants of a social interaction is to pragmatically sustain se-quences of action where the action is tightly coupled to oneanother via multiple channels of observable information (e.g.visible kinematic information, audible speech). In otherwords, social interactions require dynamically coupled in-terpersonal motor coordination from their participants [46].Moreover, detecting coupled behaviors such as kinematicturn taking or simultaneity in movements can help in recog-nizing engaged social interactions [10].

Aﬀective Computing: refers to the study and develop-ment of systems that can automatically detect human af-fect [8, 39]. Aﬀective computing has long been an activeresearch area due to its utility in a variety of applicationsthat require realistic Human Computer Interaction, such asonline tutoring [11] and health screenings [17]. The goal hereis to detect the overall mental or emotional state of the per-son based on external cues. This is typically done based onspeech [3], facial expressions [29], gesture/posture [35] andmultimodal cues [2, 40, 48]. There has also been work onmodeling activities and interactions involving multiple peo-ple [23, 43, 44]. However, most of this work deals with shortduration task-oriented activities [23,44] with a focus on theirphysical aspects. There has been a recent interest in model-ing interactions with a focus on the rich and complex socialbehaviors that they elicit along with their aﬀective impacton the participants [43].

Hybrid Models: consist of a generative model, which usu-ally learns a feature representation of low level input, and adiscriminative model for higher level reasoning. Recent workhas empirically shown that generative models which learn arich feature representation tend to outperform discrimina-tive models that rely solely on hand-crafted features [38].Hybrid models can be divided into three groups, joint meth-ods [12, 24, 25, 32], iterative methods [15, 49], and stagedmethods [7, 21, 28, 38, 41]. Joint methods optimize a sin-gle objective function which consists of both the generativeand discriminative energies. Iterative methods consist of agenerative and a discriminative model that are trained in aniterative manner, inﬂuencing each other. In staged methods,both models are trained separately, with the discriminativemodel being trained on representations learned by the gen-erative model. Classiﬁcation is performed after projectingthe samples into a ﬁxed-dimensional space induced by thegenerative model.

Deep Networks: are able to learn rich features in an unsu-pervised manner, this is what makes deep learning very pow-erful. They have been successfully applied to many prob-lems [5]. Restricted Boltzmann Machines (RBMs) form thebuilding blocks in deep networks models [20, 45]. In [20, 45],the networks are trained using the Contrastive Divergence(CD) algorithm [19], which demonstrated the ability of deepnetworks to capture the distributions over the features eﬃ-ciently and to learn complex representations. RBMs can bestacked together to form deeper networks known as DeepBoltzmann Machines (DBMs), which capture more com-plex representations. Recently, deep networks based tem-poral models, capable of modeling a more temporally richset of problems have been proposed. These include Condi-tional RBMs (CRBMs) [54] and Temporal RBMs (TRBMs)[18, 51, 52]. CRBMs have been successfully used in both vi-sual and audio domains. They have been used for modelinghuman motion [54], tracking 3D human pose [53] and phone recognition [34]. TRBMs have been applied for transferring2D and 3D point clouds [31], transition based dependencyparsing [16], and polyphonic music generation [27].

3. APPROACH

In this section we describe our approach. We ﬁrst reviewsimilar prior work, next we deﬁne our model, formulate itsinference, and ﬁnally show how the model parameters arelearned.

Restricted Boltzmann Machines [45]: An RBM (Fig. 2a)deﬁnes a probability distribution p R as a Gibbs distribution(1), where v is a vector of visible nodes, h is a vector ofhidden nodes. E R is the energy function and Z is the par-tition function which ensures that the distribution is valid.The parameters θ R to be learned are a and b the biases for v and h respectively and the weights W . The RBM archi-tecture is deﬁned as fully connected between layers, with nolateral connections. This architecture implies that v and h are factorial given one of the two vectors. This allows forthe exact computation of p ( v | h ) and p ( h | v ). p R ( v , h ; θ R ) = exp[ − E R ( v , h )] /Z ( θ R ) ,Z ( θ R ) = (cid:80) v , h exp[ − E R ( v , h )] , θ R = { a , b , W } (1)In case of binary valued data v i is deﬁned as a logistic func-tion. In case of real valued data, v i is deﬁned as a mul-tivariate Gaussian distribution with a unit covariance. Abinary valued hidden layer h j is deﬁned as a logistic func-tion . This is done because we want the hidden layer to be asparse binary code (empirically proven to be better [52,54]).(2) shows the probability distributions for vp ( v i = 1 | h ) = σ ( a i + (cid:80) j h j w ij ) , Binary, p ( v i | h ) = N ( a i + (cid:80) j h j w ij , , Real, p ( h j = 1 | v ) = σ ( b j + (cid:80) i v i w ij ) , Binary. (2)The energy function E R for binary v i is deﬁned as in (3). E R ( v , h ) = − (cid:88) i a i v i − (cid:88) j b j h j − (cid:88) i,j v i w i,j h j , (3)while, the energy function E R is slightly modiﬁed to allowfor the real valued v as shown in (4). E R ( v , h ) = − (cid:88) i ( a i − v i ) − (cid:88) j b j h j − (cid:88) i,j v i w i,j h j (4) Discriminative Restricted Boltzmann Machines [24]:DRBMs are a natural extension of RBMs which have anadditional discriminative term for classiﬁcation. They arebased on the model in [24]. DRBM (Fig. 2b) deﬁnes a prob- The logistic function σ ( · ) for a variable x is deﬁned as σ ( x ) = (1 + exp ( − x )) − . a) RBM(b) DRBM(c) CRBM(d) DCRBM Figure 2: Deep learning models described in sections 3. (a)RBM and (c) CRBM are generative models. (b)DRBM and(d)DCRBM are discriminatively trained hybrid models. ability distribution p D as a Gibbs distribution (5). p DR ( y , v , h | v ; θ DR ) = exp[ − E DR ( y , v , h )] /Z ( θ DR ) ,Z ( θ DR ) = (cid:80) y , v , h exp[ − E DR ( y , v , h )] θ C = { a , b , s , W , U } (5)The probability distribution over the visible layer will fol-low the same distributions as in (2). The hidden layer h isdeﬁned as a function of the labels y and the visible nodes v . Also, a new probability distribution for the classiﬁer isdeﬁned to relate the label y to the hidden nodes h as in (6). p ( v i | h ) = N ( a i + (cid:80) j h j w ij , ,p ( h j = 1 | y k , v ) = σ ( b j + u j,k + (cid:80) i v i w ij ) ,p ( y k | h ) = e sk + (cid:80) j uj,khj (cid:80) k ∗ e sk ∗ + (cid:80) j uj,k ∗ hj (6)The new energy function E DR is deﬁned similar to (7), E D ( y , v , h ) = − (cid:80) i ( a i − v i ) / − (cid:80) j b j h j − (cid:80) k s k y k − (cid:80) i,j v i w i,j h j − (cid:80) j,k h j u j y k (7) Conditional Restricted Boltzmann Machines [54]: CRBMsare a natural extension of RBMs for modeling short termtemporal dependencies. A CRBM (Fig. 2c) is an RBM whichtakes into account history from the previous time instances[( t − N ) , . . . , ( t − t ). This is done by treating theprevious time instances as additional inputs. Doing so doesnot complicate inference . A CRBM deﬁnes a probabilitydistribution p C as a Gibbs distribution (8). p C ( v t , h t | v

For classiﬁcation we use a bottom up approach,where we maximize the posterior distribution, p DC ( y t,k | v t , v

Learning our model is done using ContrastiveDivergence (CD) [19]. The update equations of the dynam-ically changing bases ∆ c and ∆ d are obtained by ﬁrst up-dating ∆ A and ∆ B as in the case of the real valued CRBM(8) and then combining them with ∆ a and ∆ b .∆ w i,j ∝ (cid:104) v i h j (cid:105) data − (cid:104) v i h j (cid:105) recon , ∆ u j,k ∝ (cid:104) y k h j (cid:105) data − (cid:104) y k h j (cid:105) recon , ∆ a i ∝ (cid:104) v i (cid:105) data − (cid:104) v i (cid:105) recon , ∆ b j ∝ (cid:104) h j (cid:105) data − (cid:104) h j (cid:105) recon , ∆ s k ∝ (cid:104) y k (cid:105) data − (cid:104) y k (cid:105) recon , ∆ A k,i,t − n ∝ v k,t − n ( (cid:104) v i,t (cid:105) data − (cid:104) v i,t (cid:105) recon ) , ∆ B i,j,t − m ∝ v i,t − m ( (cid:104) h j,t (cid:105) data − (cid:104) h j,t (cid:105) recon ) , (15)where (cid:104)·(cid:105) data is the expectation with respect to the datadistribution and (cid:104)·(cid:105) recon is the expectation with respect tothe reconstructed data. The reconstruction is generated byﬁrst sampling p ( h j = 1 | v , y ) for all the hidden nodes inparallel. The visible nodes are then generated by sampling p ( v i | h ) for all the visible nodes in parallel. Finally, the labelnodes are generated using p ( y | h ) using (12).

4. EXPERIMENTS

In this section, we ﬁrst discuss existing activity recogni-tion and aﬀective computing datasets. Next we describe thecollection and annotation of our

Tower Game Dataset , whichcontains recordings of two players building a tower and inthe process engaging in a variety of interactive behaviors. Fi-nally, we describe our experimental results on this dataset,demonstrating the eﬀectiveness of our DCRBM model.

Most existing activity recognition benchmarks – e.g., theWeizmann, Trecvid, PETS04, CAVIAR, IXMAS, Hollywooddatasets, Olympic Sports and UCF-100 – contain relativelysimple and repetitive actions involving a single person [9].On the other hand, group activity recognition datasets suchas UCLA Courtyard, UT-Interactions, Collective Activitydatasets, and Volleyball dataset, lack rich social dynamics.Other relevant datasets include the Multimodal DyadicBehavior (MMDB) dataset [43], which focuses on analyzingdyadic social interactions between adults and children in adevelopmental context. This dataset was collected in a semi-structured format, where children interact with an adult ex-aminer in a series of pre-planned games. However, due to itsnarrow focus on analysis of social behaviors to diagnose de-velopmental disorders in children, we believe it is not generalenough. Another dataset is the Mimicry database [50] whichfocuses on studying social interactions between humans withthe aim of analyzing mimicry in human-human interactions.This dataset was collected in an unstructured format wherethe two humans talk to each other about diﬀerent subjects.here are a number of issues with the aforementioneddatasets, including: (a) unnatural, acted activities in con-strained scenes; (b) limited spatial and temporal coverage;(c) poor diversity of activity classes; (d) Lack of rich so-cial interactions; (e) Narrow focus on a single behavior (e.g.mimicry); and (f) Unstructured or uncontrolled collectionsetup. Hence, we propose our new Tower Game Dataset toaddress the above issues.

Tower Game Dataset is a simple game of tower build-ing often used in social psychology to elicit diﬀerent kindsof interactive behaviors from the participants. It is typicallyplayed between two people working with a small ﬁxed num-ber of simple toy blocks that can be stacked to form variouskinds of towers. We choose these tower games as they forcethe players to engage and communicate with each other inorder to achieve the objectives of the game, thereby evokingbehaviors such as joint-attention and entrainment from theparticipants. The game, due to its simplicity, allows for to-tal control over the variables of an interaction. Due to thesmall number of blocks involved, the number of potentialmoves (actions) is limited. Also since the game involves in-teracting with physical objects, joint-attention is mediatedthrough concrete objects. Furthermore, only two players areinvolved, ensuring that we can stay in the realm of dyadicinteractions.There are many diﬀerent variants of the game. We settledon two variants designed to elicit maximum communicationbetween the players, namely, (i) the architect-builder vari-ant and (ii) the distinct-objective variant. Furthermore,in order to maximize the amount of non-verbal communica-tion, we prohibited the participants from verbally commu-nicating with each other.The architect-builder variant involves one participantplaying the role of the architect, who decides the kind oftower to build and how to build it. The second participantis the builder, who has control of all the building blocksand is the only one actually manipulating the blocks. Thegoal here is for the architect to communicate to the builderhow to build the tower so that builder can build the desiredtower.The distinct-objective variant is slightly more compli-cated and is designed to elicit more interaction between theplayers. In this variant, each player is given half of the build-ing blocks required to build the tower. Each player is alsogiven a particular rule, restricting the conﬁguration of thetower being built, that they are required to enforce. An ex-ample rule could be that no two blocks of the same colormay be placed such that they are touching each other. Tomake the play interesting, each player only knows their ownrule and is not aware of rule given to the other player. Therules are selected at random from a small rule book. Whilecertain combinations of rules may result in some conﬂict be-tween the objectives of the two players, this is typically notthe case. However, since each player needs to adhere to theirrule, it means that they will need to correct an action takenby the other if it conﬂicts with their rule. In the process,each player also tries to ﬁgure out the rule assigned to theother player so that the process of building the tower is moreeﬃcient. Also, when the subjects played multiple sessions ofthis game, the pieces used were changed and the area of thetable upon which they could place blocks was reduced in size.

Capture Setup:

Our sensors include a pair of Kinect cam-eras that record color videos, depth video and track skele-tons of the players and a pair of GoPro cameras mountedon the chest of each player (Fig. 1(a)). External lapel mi-crophones were attached to the GoPro cameras. However,the audio captured from them was used only for data syn-chronization purposes. Since the players were not allowedto verbally communicate with each other, very little speech(or paralinguistic) data exists.In order to ensure optimal data capture from the Kinectcameras (i.e. minimal occlusions and optimal skeleton track-ing), they were mounted on tripods facing one another, slightlyto the right and back of each of the participants and slightlyelevated, ensuring that each camera got an unobstructedview of the other participant. The overhead layout is shownin Fig. 1(b). These videos are of VGA resolutions (640x480)and were captured at 30Hz. The GoPro cameras were set tocapture at full HD (1920x1080) resolution and at the widestangle available. They were placed on the harnesses rotated90 degrees so as to capture the face of the other player aswell as the blocks on the table (Fig. 1(c)).In each session, the subjects play the game by standing ateither end of a small rectangular table as shown in Fig. 1(c).The person supervising the data collection enters player in-formation and other meta-data about the game session intoa form and then starts recording. He/she then instructs theplayers to begin their game session. They ﬁrst manually ac-tivate the GoPro cameras to start recording and then claptheir hands before starting their sessions. These claps wereused to automatically synchronize the GoPro videos withthe Kinect videos. The ﬁnal dataset consists of the follow-ing data types for each game session:1. Two Kinect videos (RGB)2. Two depth videos (depth encoded in RGB)3. Two GoPro videos (distortion corrected)4. Intrinsic and extrinsic calibrations for the two Kinectcameras5. Intrinsic calibrations and video frame aligned sequencesof camera poses for the GoPro cameras6. Kinect tracked skeletons for the two participants7. Face and head pose tracking for the two individualsfrom the GoPro cameras when visible8. Eye Gaze information (3d vectors) for the two partic-ipants whenever available9. Object positions (2d bounding boxes, not 3d positions)and tracks for all the blocks within each gaming ses-sion.

Data Annotation:

Since our focus is on joint attention and entrainment , we annotated 112 videos which were di-vided into 1213 10-second segments indicating the presenceor absence of these two behaviors in each segment. To an-notate the videos, we developed an innovative annotationschema drawn from concepts in the social psychology liter-ature [1, 6]. The annotation schema is a series of questions,that could be used as a guideline to assist the annotators.The annotation schema associates high level social interac-tion predicates with more objectively perceptible measures.For example,

Joint attention is the shared focus of two in-dividuals on a common subject and it involves eye gaze (ona person and on an object) and body language. Similarly, ntrainment is the alignment in the behavior of two individ-uals and it involves simultaneous movement, tempo similar-ity, and coordination. Each measure was rated using a low,medium, high measure for the entire 10 second segment. Wehired six undergraduate sociology and psychology studentsto annotate the videos. The students were given a general in-troduction to the survey instrument and were then asked tocode representative samples of the videos. The videos wereannotated after ensuring that all the students as a groupwere annotating the sample videos accurately and reliably.The dataset will be released with the acceptance of thispaper. We will also publish a fully detailed description ofthe collection, capture, and annotation.

In this section we describe the set of experiments we con-ducted to evaluate our proposed model.

Implementation Details:

For our experiments, we reliedonly on the skeleton features. We use the 11 joints from theupper body of the two players since the tower game almostentirely involves only upper body actions.Using the 11 joints we extracted a set of ﬁrst order staticand dynamic handcrafted skeleton features. The static fea-tures are computed per frame. The features consist of, rela-tionships between all pairs of joints of a single actor, as wellas the relationships between all pairs of joints of both theactors. The dynamic features are extracted per window (aset of 300 frames). In each window, we compute ﬁrst andsecond order dynamics (velocities and accelerations) of eachjoint, as well as relative velocities and accelerations of pairsof joints per actor, and across actors. The dimensionalityof the static and dynamic features is (257400 D). To reducetheir dimensionality we use Principle Component Analysis(PCA) (100 D), Bag-of-Words (BoW) (100 and 300 D) [36].We also extracted Deep Learning features using RBMs andCRBMs (50 dimensions)For the DRBM and DCRBM we used the raw joint lo-cations normalized with respect to a selected origin point.We used the same dimensionality for both models D ( v ) =66 , D ( h ) = 50. For DCRBM we empirically evaluated his-tory windows of diﬀerent sizes, and found that a window ofsize n = 15 works the best. Results:

For the purpose of this paper we focused on thethree ECIPs, Coordination, Simultaneous Movement, andTempo Similarity. As a baseline we used a multi-class Sup-port Vector Machine and the diﬀerent types of features de-ﬁned above to classify a certain ECIP.We divided our evaluation into two tasks. The ﬁrst taskis the

Classiﬁcation Task . We use the raw features of thetwo players and our goal is to predict the level (strength) ofthe three ECIPs. Each ECIP can be low , medium or high ,hence random classiﬁcation accuracy is 33%. The data issplit into two sets, a training set consisting of 70% of the in-stances, and a test set consisting of the remaining 30%. Weperformed a 5 fold cross validation to guarantee unbiased re-sults. Figure 3 shows our average classiﬁcation accuracy onthe Tower Game Dataset using diﬀerent features and base-lines combinations as well as the results from our DCRBMmodel. The evaluation is done with respect to the six an-notators { A , A , . . . , A } as well as the mean annotation.We can see that the DCRBM model outperforms all the (a) Simultaneous Movement(b) Coordination(c) Tempo Similarity Figure 3: Average classiﬁcation results on ESIPs. It is clearthat the DCRBM outperforms all other baselines on thethree ESIPs. a) Generating partial visible layer data(b) Average generation error (missing partial visible)

Figure 4: Average generation error for the partial visiblelayer by varying the generated window size for the threediﬀerent ESIPs. (a) Generating full visible layer data(b) Average generation error (missing full visible)

Figure 5: Average generation error for the full visible layerby varying the generated window size for the three diﬀerentESIPs.ther models for each of the three measures across all an-notators, thereby demonstrating its eﬀectiveness on detect-ing these entrainment measures. Furthermore, the DCRBMmodel outperforms the PCA and BoW based features whichare derived from the high dimensional handcrafted features,demonstrating its ability to learn a rich representation start-ing from the raw skeleton features. Finally, the performanceof the DCRBM model indicates that the joint learning andinference of DCRBMs is superior to the staged approach ofthe SVM + CRBM model.The second task is the

Generation Task , where we aregiven the class label and our goal is to generate the data(i.e. the raw features) for that label. This task allows usto visualize what the classiﬁer has learned. For generation,we initialize the model using 15 frames for each person, andthen generate sequences of lengths varying from 16 to 300frames. We measure the mean error between the ground-truth data and the generated data for each class label over50 video instances. For this experiment, we evaluated gen-erated sequences of varying length using a normalized meansquared error metric deﬁned in (16).Generation Error = (cid:18) (cid:107) v Generated − v Groundtruth (cid:107)(cid:107) v Groundtruth (cid:107) (cid:19) (16)Generation is done in two diﬀerent settings. In the ﬁrst set-ting, given partial visible player data (one player’s features)as well as the class label, the goal is to generate the otherplayer’s data. Figure 4 shows our average generation errorusing our DCRBM model for generating the partial visiblelayer. In the second setting, given only the class label, thegoal is to generate the entire visible layer data (i.e. the rawfeatures for both the players). Figure 5 shows our averagegeneration error on using our DCRBM model for generatingthe full visible layer. We can see that the generation is rela-tively low ( < .

1) in all cases (except for Tempo Similarity when generating the entire visible layer data) demonstrat-ing the eﬀectiveness of DCRBM model for generating data.Also, the error is similar across diﬀerent levels (strengths) foreach measure indicating that the model is relatively stable.Finally, the error increases with the length of the generatedsequence, which is expected as the possibility of variation inthe ground-truth sequences increases with length.Therefore, the classiﬁcation task shows that DCRBMs caneﬀectively detect the constituents of entrainment (an ESIP).Similarly, the generation task shows that DCRBMs can ef-fectively generate raw skeleton data of the actors while mod-eling the diﬀerent strengths of each constituent (measure).

5. CONCLUSIONS AND FUTURE WORK

We presented a novel approach to computational modelingof social interactions based on modeling of essential social in-teraction predicates (ESIPs) such as joint attention and en-trainment. Our data collection was guided by social psycho-logical theory and methodology. We introduce a new “TowerGame” dataset consisting of audio-visual capture of dyadicinteractions labeled with the ESIPs, that should spark newresearch in computational social interaction modeling. We Tempo Similarity measures the similarity in the rate of themotion of the two players, and when data from both theplayers is missing generating their raw features based onwhether their rate of motion is similar is extremely underconstrained proposed a novel joint Discriminative Conditional RestrictedBoltzmann Machine (DCRBM) model that enabled us touncover actionable constituents of the ESIPs in two steps.First, we trained the DCRBM model and second, used it togenerate lower-level data corresponding to ESIP’s with highaccuracy.Such purely computational decomposition of ESIPs intoactionable behavioral constituents is unprecedented and pow-erful, and oﬀers rich possibilities for further research. First,we can substantially advance the understanding of ESIPsby uncovering mid-level predicates using the hidden layers ofthe DCRBM thus going beyond the current low-level featuregeneration to a multi-level understanding of the semanticsof ESIPs. Second, we would like to extend our frameworkto multimodal streams that also include gaze, facial behav-iors, head pose and audio so as to get a full understandingof actionable behaviors that make up the ESIPs. For in-stance, we may ﬁnd out that coordinating gaze and gestu-ral behavior is the most eﬀective in establishing rapport, orperhaps not. Third, such a comprehensive multimodal andsemantic model would capture the overall “rules of engage-ment” in a social interaction. Such a model would there-fore lend itself to monitoring and training applications suchas automatic assessment of the eﬃcacy of an interaction interms of establishment of rapport-engagement and genera-tion of “interaction-realistic” avatar behaviors in a virtualreality environment that convey realism in terms of interac-tion dynamics rather than through photo or audio realism,and thus achieve immersion and engagement, as well as moreeﬃcacious human-robot interaction. We have thus laid thefoundation of a computational approach that enables us tomove from “folklore” based methods of establishing ESIPsto methods that are systematically arrived at through com-putational analysis of data from scientiﬁc observations.

Acknowledgments

This work is supported by DARPA W911NF-12-C-0001. Theviews, opinions, and/or conclusions contained in this paperare those of the author and should not be interpreted asrepresenting the oﬃcial views or policies, either expressedor implied of the DARPA or the DoD.

6. REFERENCES [1] L. Adamson and et al. Rating parent-childinteractions: Joint engagement, communicationdynamics and shared topics in autism, downsyndrome, and typical development.

JADD , 2012.[2] M. Amer, B. Siddiquie, S. Khan, A. Divakaran, andH. Sawhney. Multimodal fusion using dynamic hybridmodels. In

WACV , 2014.[3] M. Amer, B. Siddiquie, C. Richey, and A. Divakaran.Emotion detection in speech using deep networks. In

ICASSP , 2014.[4] S. Baron-Cohen. Mindblindness: An essay on autismand theory of mind. In

MIT , 1997.[5] Y. Bengio. Learning deep architectures for ai. In

FTML , 2009.[6] F. J. Bernieri. Coordinated movement and rapport inteacher-student interactions.

Journal of Non-VerbalBehavior , 1988.[7] A. Bosch, A. Zisserman, and M. Xavier. Sceneclassiﬁcation using a hybrid generative/discriminativepproach. In

TPAMI , 2008.[8] R. Calvo and S. D’Mello. Aﬀect detection: Aninterdisciplinary review of models, methods, and theirapplications. In

IEEE Transactions on AﬀectiveComputing , 2010.[9] J. M. Chaquet, E. J. Carmona, andA. Fern´andez-Caballero. A survey of video datasets forhuman action and activity recognition.

CVIU ,117(6):633 – 659, 2013.[10] H. De Jaegher, E. Di Paolo, and S. Gallagher. Cansocial interaction constitute social cognition?

Trendsin Cognitive Science , 2010.[11] S. D’Mello, R. W. Picard, and A. Graesser. Toward anaﬀect-sensitive autotutor. In

IEEE IntelligentSystems , 2007.[12] G. Druck and A. McCallum. High-performancesemi-supervised learning using discriminativelyconstrained generative models. In

ICML , 2010.[13] G. Fanelli, M. Dantone, J. Gall, A. Fossati, and L. V.Gool. Random forests for real time 3d face analysis. In

IJCV , 2012.[14] V. Fantasia, H. D. Jaegher, and A. Fasulo. We canwork it out: an enactive look at cooperation.

Frontieres in Psychology , 2014.[15] A. Fujino, N. Ueda, and K. Saito. Semi-supervisedlearning for a hybrid generative/discriminativeclassiﬁer based on the maximum entropy principle. In

TPAMI , 2008.[16] N. Garg and J. Henderson. Temporal restrictedboltzmann machines for dependency parsing. In

ACL ,2011.[17] S. Ghosh, M. Chatterjee, and L.-P. Morency. Amultimodal context-based approach for distressassessment. In

ICMI , 2014.[18] C. Hausler and A. Susemihl. Temporal autoencodingrestricted boltzmann machine. In

CoRR , 2012.[19] G. E. Hinton. Training products of experts byminimizing contrastive divergence. In NC , 2002.[20] G. E. Hinton, S. Osindero, and Y. W. Teh. A fastlearning algorithm for deep belief nets. In NC , 2006.[21] T. Jebara and et. al. Probability product kernels. In MLR , 2004.[22] H. S. Koppula and A. Saxena. Learningspatio-temporal structure from rgb-d videos for humanactivity detection and anticipation. In

ICML , 2013.[23] T. Lan, Y. Wang, W. Yang, S. Robinovitch, andG. Mori. Discriminative latent models for recognizingcontextual group activities. In

PAMI , 2012.[24] H. Larochelle and Y. Bengio. Classiﬁcation usingdiscriminative restricted boltzmann machines. In

ICML , 2008.[25] J. Lasserre, C. Bishop, and T. Minka. Principledhybrids of generative and discriminative models. In

CVPR , 2006.[26] S. C. Levinson. On the human ”interaction engine”. In

Roots of Human Sociality Culture, Cognition andInteraction . Berg, 2006.[27] N. B. Lewandowski, Y. Bengio, and P. Vincent.Modeling temporal dependencies in high-dimensionalsequences: Application to polyphonic musicgeneration and transcription. In

ICML , 2012. [28] X. Li, T. Lee, and Y. Liu. Hybridgenerative-discriminative classiﬁcation using posteriordivergence. In

CVPR , 2011.[29] Y. li Tian, T. Kanade, and J. F. Cohn. Recognizingaction units for facial expression analysis. In

IEEEPAMI . 2001.[30] M. M. Louwerse, R. Dale, E. G. Bard, andP. Jeuniaux. Behavior matching in multimodalcommunication is synchronized.

Cognitive Science ,2012.[31] L. S. M. D. Zeiler, G. W. Taylor, I. Matthews, andR. Fergus. Facial expression transfer withinput-output temporal restricted boltzmann machines.In

NIPS , 2011.[32] A. Mccallum, C. Pal, G. Druck, and X. Wang.Multi-conditional learning: Generative/discriminativetraining for clustering and classiﬁcation. In

AAAI ,2006.[33] R. Memisevic and G. E. Hinton. Unsupervisedlearning of image transformations. In

CVPR , 2007.[34] A. R. Mohamed and G. E. Hinton. Phone recognitionusing restricted boltzmann machines. In

ICASSP ,2009.[35] S. Mota and R. W. Picard. Automated postureanalysis for detecting learnerˇSs interest level. In

CVPRW , 2003.[36] J. Niebles, H. Wang, and L. Fei-Fei. Unsupervisedlearning of human action categories usingspatial-temporal words.

IJCV , 79(3):299–318, 2008.[37] E. D. Paolo and H. D. Jaegher. The interactive brainhypothesis.

Frontiers in Human Neuroscience , 2012.[38] A. Perina and et al. Free energy score spaces: Usinggenerative information in discriminative classiﬁers. In

TPAMI , 2012.[39] R. W. Picard.

Aﬀective Computing . MIT Press, 1995.[40] G. Ramirez, T. Baltrusaitis, and L. P. Morency.Modeling latent discriminative dynamic ofmulti-dimensional aﬀective signals. In

ACII , 2011.[41] M. A. Ranzato and et. al. On deep generative modelswith applications to recognition. In

CVPR , 2011.[42] M. Raptis, D. Kirovski, and H. Hoppe. Real-timeclassiﬁcation of dance gestures from skeletonanimation. In

SCA , 2011.[43] J. M. Rehg and et al. Decoding children’s socialbehavior.

CVPR , 2013.[44] M. Ryoo and J. Aggarwal. Semantic representationand recognition of continued and recursive humanactivities. In

IJCV , 2009.[45] R. Salakhutdinov and G. E. Hinton. Reducing thedimensionality of data with neural networks. In

Science , 2006.[46] N. Sebanz and G. Knoblich. Prediction in joint action:What, when, and where.

Topics in Cognitive Science ,2009.[47] J. Shotton and et al. Real-time human poserecognition in parts from a single depth image. In

CVPR , 2011.[48] B. Siddiquie, S. Khan, A. Divakaran, and H. Sawhney.Aﬀect analysis in natural human interactions usingjoint hidden conditional random ﬁelds. In

ICME , 2013.[49] C. Sminchisescu, A. Kanaujia, and D. Metaxas.earning joint top-down and bottom-up processes for3d visual inference. In

CVPR , 2006.[50] X. Sun, J. Lichtenauer, M. F. Valstar, A. Nijholt, andM. Pantic. A multimod al database for mimicryanalysis. In

ACII , 2011.[51] I. Sutskever, G. Hinton, and G. Taylor. The recurrenttemporal restricted boltzmann machine. In

NIPS ,2008.[52] I. Sutskever and G. E. Hinton. Learning multileveldistributed representations for high-dimensionalsequences. In

AISTATS , 2007.[53] G. W. Taylor and et. al. Dynamical binary latentvariable models for 3d human pose tracking. In

CVPR , 2010.[54] G. W. Taylor, G. E. Hinton, and S. T. Roweis. Twodistributed-state models for generatinghigh-dimensional time series. In

Journal of MachineLearning Research , 2011.[55] M. Tomasello.