Synthesizing Skeletal Motion and Physiological Signals as a Function of a Virtual Human's Actions and Emotions
Bonny Banerjee, Masoumeh Heidari Kapourchali, Murchana Baruah, Mousumi Deb, Kenneth Sakauye, Mette Olufsen
SSynthesizing Skeletal Motion and Physiological Signals as a Function ofa Virtual Human’s Actions and Emotions ∗ Bonny Banerjee †‡ Masoumeh Heidari Kapourchali § Murchana Baruah ‡ Mousumi Deb ¶ Kenneth Sakauye ‖ Mette Olufsen ∗∗ Abstract
Round-the-clock monitoring of human behavior andemotions is required in many healthcare applicationswhich is very expensive but can be automated using ma-chine learning (ML) and sensor technologies. Unfortu-nately, the lack of infrastructure for collection and shar-ing of such data is a bottleneck for ML research appliedto healthcare. Our goal is to circumvent this bottle-neck by simulating a human body in virtual environ-ment. This will allow generation of potentially infiniteamounts of shareable data from an individual as a func-tion of his actions, interactions and emotions in a carefacility or at home, with no risk of confidentiality breachor privacy invasion. In this paper, we develop for thefirst time a system consisting of computational modelsfor synchronously synthesizing skeletal motion, electro-cardiogram, blood pressure, respiration, and skin con-ductance signals as a function of an open-ended set ofactions and emotions. Our experimental evaluations, in-volving user studies, benchmark datasets and compari-son to findings in the literature, show that our modelscan generate skeletal motion and physiological signalswith high fidelity. The proposed framework is modu-lar and allows the flexibility to experiment with differ-ent models. In addition to facilitating ML research forround-the-clock monitoring at a reduced cost, the pro-posed framework will allow reusability of code and data,and may be used as a training tool for ML practitionersand healthcare professionals.
Keywords:
Signal synthesis, skeletal motion, electro-cardiogram, skin conductance, actions, emotions. ∗ A version of this article will appear as [1]. † Corresponding author. ‡ Institute for Intelligent Systems, and Department of Electrical& Computer Engineering, University of Memphis, Memphis, TN38152, USA. Email: { bbnerjee, mbaruah } @memphis.edu. § Department of Electrical & Computer Engineering, JohnsHopkins University, Baltimore, MD 21218, USA. Email: [email protected]. ¶ Texas A&M AgriLife Research & Extension Center, Temple,TX 76502, USA. Email: [email protected]. ‖ Department of Psychiatry, University of TennesseeHealth Science Center, Memphis, TN 38163, USA. Email:[email protected]. ∗∗ Department of Mathematics, North Carolina State University,Raleigh, NC 27695, USA. Email: [email protected].
Motivation:
Individuals with certain mental healthconditions, such as Alzheimer’s, schizophrenia, sub-stance abuse, brain injury, trauma, and depression, man-ifest abnormal or agitated behaviors from time to time.They may require long-term round-the-clock monitoringwhich is very expensive and strenuous, but can be au-tomated using sensors and machine learning (ML) algo-rithms. However, collecting real-world data from suchpatients to train ML algorithms poses five challenges:(1) risk of confidentiality breach and privacy invasion(the patient and his environment has to be monitoredsimultaneously as the causes behind his behavior may behidden in his environment), (2) low-level data collectionand processing issues (e.g., speech/speaker recognitionrequires solving the blind signal separation and cock-tail party problems), (3) limited resolution and coverageof sensors (e.g., cameras may not be installed in privatespaces such as bedrooms and bathrooms), (4) discomfortand lack of motivation to wear sensors round-the-clock,and (5) considerable time and expertise required for dataannotation. Unsurprisingly, round-the-clock monitoringdata from mentally-ill individuals is scarce. This is asignificant bottleneck towards developing ML algorithmsfor providing quality care at a reduced cost.
Background:
One way to circumvent this bottleneckis by simulating humans in appropriate virtual environ-ments. The humans would generate physiological signalswhile the environment would be equipped with ambientsensors as in a smart home or care facility. This willallow generation of potentially infinite amounts of share-able data. Also, simulated humans will allow evalua-tion of artificial intelligence (AI) agents , implementedin software, in terms of fidelity to human physiologicalperformance. Virtual humans are ubiquitous. There alsoexist virtual patient simulators. However, a frameworkto derive physical and physiological signals from a virtualhuman round-the-clock as a function of his day-to-dayactions and emotions is currently missing. An agent is anything that can be viewed as perceiving itsenvironment through sensors and acting upon that environmentthrough actuators [2]. Such agents, implemented in software, havebeen reported in our prior work [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17, 18, 19] as well as in others’. a r X i v : . [ ee ss . SP ] F e b e aim to simulate a virtual human with wearablesensors that will synchronously generate five signals:skeletal joint motions, electrocardiogram (ECG), bloodpressure (BP), respiration, and skin conductance re-sponse (SCR), as a function of an open-ended set ofactions and emotions (e.g., walking sadly). That is, P ( S t ) = f θ ( P ( E t ) , P ( A t )), where E t and A t are the setsof emotions and actions respectively and S t is the set ofsensor signals (skeletal motion, physiological signals) attime t ; f is a mapping from the probability distributionsof { E t , A t } to that of S t ; and θ are the parameters of f that can be manipulated to generate signals for differ-ent emotions and actions with inter- and intra-individualvariability. Contributions:
In this paper, we develop for thefirst time: (1)
Computational models for synthesizingthe five signals as a function of an open-ended set ofactions and emotions. Extensive evaluations, involvinguser studies (MTurk), benchmark datasets (DEAP andHCI-Tagging for emotions, AmI and TROIKA for ac-tions) and findings in the literature [20], show that themodels can synthesize the signals with high fidelity toreal-world data. No model for synthesizing any of thesesignals takes into account actions and emotions jointly. (2)
A modular system that combines the above mod-els to synchronously synthesize the physiological signalsfrom emotionally-expressive skeletal motion. Currently,there is no system even remotely related to this.
Our system is shown in Fig. 1(a). The synthesizedskeletal motions generate a rate of energy expenditurein metabolic equivalent (MET) [21] which is an input tothe physiological signal models.
The goal is to synthesize an extensive range of realisticskeletal actions, with individual variability, over a largerange of emotions. Our idea is to consider an emotionas a style, and use a style transfer method to transferan emotion from one action to another. This allows usto synthesize any action over any emotion as long as aninstance of the action and the emotion are observed. SeeFig. 1(b) for an overview of our model.We use the spectral style transfer method in [22] whichproduces state-of-the-art results using a small and effi-cient model. Let f s [ t ] be a time domain signal of thesource style, and a different action than the target mo-tion f [ t ]. Let f r [ t ] be a reference signal that representsthe same action as f s [ t ] and the same style as f [ t ]. Theidea is to apply the difference between f s [ t ] and f r [ t ] to f [ t ]. Since the length, synchronization and spatial cor-respondences of the three signals may not match in the time domain, we formulate style transfer in the spectraldomain by computing a new magnitude, R (cid:48) [ ω ] for theentire signal, as: R (cid:48) [ ω ] = R [ ω ] + s [ ω ]( R s [ ω ] − R r [ ω ]), A (cid:48) [ ω ] = A [ ω ], and Im { f (cid:48) [ t ] } = 0, over all ω . Here, s [ ω ] = R [ ω ] (cid:14) max ω ( R [ ω ]), R ( ω ) = | F ( ω ) | , A ( ω ) = ∠ F ( ω ), F ( ω ) = F ( f ( t )), and F denotes Fourier transform. Thestylized magnitude R (cid:48) [ ω ] and constant A (cid:48) [ ω ] constitutethe final signal, inverse Fourier transform of which is thestylized time domain output. See [22] for details. The ECG, BP and respiration signals vary in differentways with change in emotion and action. The cardio-vascular and respiratory systems in the human body areintimately linked in their operations, hence they are of-ten modeled together (e.g., [23]). Integrated models ex-ploit the interdependencies between the two systems tosynthesize more realistic signals. Closed-loop models re-ceive internal feedback for stabilization of the systems,which also leads to more realistic signals. The proposedECG, BP and respiration models are closed-loop and in-tegrated.The dynamical model for synthesizing ECG in [24] wasextended to synthesize BP and respiration signals [23].It accurately reproduces many of the important clinicalproperties of these signals, such as QT dispersion, real-istic beat-to-beat variability in timing and morphologyas well as pulse transit time. It considers the balancebetween the effects of the sympathetic and parasympa-thetic systems. We extend these models [24, 23] to in-corporate the influence of actions and emotions.The model [24] generates a trajectory in 3D statespace with coordinates ( x, y, z ). Quasi-periodicity ofthe ECG is reflected by the movement of the trajec-tory around an attracting limit cycle of unit radius inthe ( x, y )-plane. Each revolution on this circle corre-sponds to one RRI or heart beat. Inter-beat variationin the ECG is reproduced using the motion of the tra-jectory in the z -direction. P, Q, R, S and T are de-scribed by events corresponding to negative and pos-itive extrema in the z -direction. Motion is describedby three ODEs: ˙ x = αx − ωy , ˙ y = αy + ωx , ˙ z = − (cid:80) i ∈{ P,Q,R,S,T } a i ∆ θ i exp( − ∆ θ i / b i ) − ( z − z ), where α = 1 − (cid:112) x + y , ∆ θ i = ( θ − θ i )mod2 π , θ = atan2( y, x )and ω = 2 πf is the angular velocity of the trajectory asit moves around the limit cycle and is related to thebeat-to-beat HR. The baseline wander of the ECG is: z ( t ) = A sin(2 πf t ), where A = 0 .
15 mV and f is therespiratory frequency. The signals are scaled and shiftedto the appropriate range. Table 2 lists the parametervalues used in our ECG model (Fig. 1(c)). { P, Q, R, S, T } are the extrema points in one heartbeat inECG signal. RR-interval (RRI) is the time between successive Rpeaks. HR stands for heart rate, HRV for heart rate variability. a) Overall block diagram of our system(b) Emotionally-expressive skeletal action synthesis (see Fig. 3) (c) ECG, BP signal synthesis (6 sec duration shown)(d) SCR signal synthesis (60 sec duration shown) (e) Respiration signal synthesis (6 sec duration shown) Figure 1: Block diagrams of the proposed system and its components. “Inst.” refers to instantaneous. An inputto all components is time which is not shown. The system can be tailored to each individual.HR tends to jump between quantized states, relatingto different physical and mental activities [25]. Since themodel considers beat-to-beat HR dynamics, it is possibleto add the effect of emotion and action to HR kinetics.Instead of using predefined Gaussians for RRI simulationas in [24], we use a mathematical model of HR kineticsto generate RRI time-series in response to actions andemotions.The model of HR kinetics in response to movement[26] is expressed as a system of coupled ODEs: ˙HR = − f min f max f D , ˙ v = I ( t ), where HR is the subject’s cur-rent heart rate, t denotes time, v is the movement in-tensity (velocity), λ ∈ [0 ,
1] is a global parameter thatdefines the overall cardiovascular condition of the sub- ject and is a function of HR min and gender (ref. Eq.6, 7 in [26]). λ ≈ f min controls the dynamics near theminimum HR (HR min ). HR min is a function of λ andgender. f max controls the dynamics near the maximumHR. HR max can be calculated as a function of age [27]. f D is a function of initial HR, velocity, λ , blood lactate,and t .Intensity of blood lactate is modeled as [28]: I l = I p − arctan ( I p ), where I p = v/v (cid:48) max is exercise intensity, v (cid:48) max = 40 √ λ is the maximum achievable velocity for asubject. The velocity of activities can be measured by3D accelerometer along three dimensions as in [28], orobtained as the mean velocity of points in a Kinect-type3keleton as in our work.There have been efforts (e.g., [28]) to quantify men-tal arousal levels in daily living. HR in response to ac-tivity level is predicted using HR kinetic model. Theadditional heart rate (prediction error) is assumed tobe a measure of mental arousal. However, only effectof arousal is taken into account. Furthermore, while apower function can be fitted to change in HR (∆HR)[29], it is not consistent with Kreibig table (Table 2 in[20]). So to incorporate emotions, we use a multilayerperceptron (MLP) to learn a mapping from { valence,arousal } to { ∆HR , ∆HRV } from data of 62 subjects ob-tained from the DEAP [30] and HCI [31] datasets. Theoutputs of MLP are added to HR demand calculated foraction using Karavonen’s formula [32]. The final outputis instantaneous HR time-series which can be convertedto RRI time-series (RRI = 60/HR). The RRI is appliedto the original model for generating ECG. Ten hiddenunits is found to be optimal. The dynamical model for generating ECG [24] has beenextended in [23] to include blood pressure in isolationfrom the ECG. BP waveform is generated with the samenon-linear model but different parameters and angles(ref. Table 2, Fig. 1(c)). Similar to ECG, changes inangular frequency in the ( x, y )-plane (due to the RRItime-series) drive the beat-to-beat variations in timingand shape of the BP waveform. Since effect of emotionand action is incorporated in the RRI, there is no needto train a new model for BP. Many of the morphologicalchanges in the ECG and BP waveforms are due to thegeometrical structure of the model. Since the amplitudeof BP waveforms includes information about systolic BP(SBP) and diastolic BP (DBP), the fact that the BPis linearly coupled to mean HR and hence inversely tomean of RRIs, is used in scaling. The scaling values arein the range (110 − × RRI , − × RRI) where RRIis the mean of RRIs in a predefined duration of time.
The block diagram of our model is shown in Fig. 1(e).The model for generating ECG [24] has been extended in[23] to generate the respiration signal. The power spec-trum of the RRIs is selected a priori and used to generatethe respiratory signal. Effect of action is incorporated inRR using a function borrowed from [33]. The criterionof minimum average respiratory work rate is used to de-termine the optimal values of respiratory frequency. RRis then calculated as: f = (cid:18) − K (cid:48) V D + (cid:113) K (cid:48) V D + 32 K (cid:48) K (cid:48)(cid:48) V D ˙ V A (cid:19)(cid:30) K (cid:48)(cid:48) V D where K (cid:48) = 5 . K (cid:48)(cid:48) = 4 . V D = 0 . V A + 0 . V A ,that may be expressed as a function of different ac-tivities (exercise intensity or metabolic rate) as [34]:˙ V A = 0 .
868 ˙ V CO / ˙ P ACO . P ACO is partial pressure ofCO which is constant during mild to moderate exercise[34], V CO is CO output, and V O max is the maximumrate of oxygen consumption. ˙ V CO is close to ˙ V O [35].Since we know how HR changes with exercise intensity(ref. model of HR kinetics in Sec. 2.2), ˙ V O is computedusing HR as in [36] and V O max is computed as in [34]: V O = V O max (HR / HR max − . / . V O max = (cid:40) . − . × age ± . , for men2 . − . × age ± . , for women . As with HR, emotion is incorporated by training aMLP which maps from { valence, arousal } to ∆RR. Theoptimal number of hidden units is experimentally foundto be six. Body movement and other factors introducenoise and trends in respiration signals collected usingrespiration belt, as in certain benchmark datasets (e.g.,DEAP, HCI-Tagging). RR is calculated as the dominantfrequency of a 6-sec sliding window over the de-trendedand bandpass (0.1-0.9 Hz) filtered respiration signal. SCR generation models belong to three classes: curvefitting (e.g., [37]), peripheral (e.g., [38, 39]), and neu-ral (e.g., [40, 41]). We consider a peripheral modelin which the SCR is generated by convolving a re-sponse function, h ( t ), and the sudomotor drive, u ( t, Θ):
SCR ( t ) = h ( t ) ⊗ u ( t, Θ). The skin conductance responsefunction is approximated by a Gaussian smoothed bi-exponential function [39]: h ( t ) = N ( t ) ⊗ E ( t ), where N ( t ) = √ πσ exp( − ( t − t ) / σ ), E ( t ) = E ( t )+ E ( t ), E i ( t ) = exp( − λ i t ), i = 1 ,
2. The parameters t = 3 . σ = 0 . λ = 0 . λ = 0 . u ( t, Θ) = (cid:80) ni =1 a i exp( − ( t − τ i ) / σ ), Θ = { τ i , a i : i =1 , , . . . , n } , where n is the number of sudomotor nerveactivity (SNA) bursts per minute, σ is the standard de-viation of the Gaussian, a i is the amplitude, and τ i isthe time of maximum firing of the SNA burst, i . Whilethe duration and shape of SNA firing bursts are not well-defined, n cannot exceed 30 bursts/min [40]. We assume n = 10 bursts/min and σ = 0 .
3, as in [40].The parameters Θ are generated as a function of emo-tion (valence, arousal), action (MET), age, gender and4ime using an MLP. The MLP weights are learned byminimizing (cid:80) Mm =1 (cid:107) SCR m − SCR (cid:48) m (cid:107) , where SCR m and SCR (cid:48) m are the true and predicted SCR signals for the m th window. A window is of 60 seconds duration andslides by each second to generate a total of M win-dows. We use tanh activation function in the hiddenlayer (128 units) and sigmoid in the output layer. Themodel is trained end-to-end using backpropagation andAdam optimizer with default hyperparameters β = 0 . β = 0 . Currently, there is no benchmark dataset containingphysiological signals as a joint function of actions andemotions. Hence, our physiological signal synthesis mod-els are evaluated for emotions assuming sitting actionand for actions assuming neutral emotion. In all mod-els, the hyperparameters are estimated from the trainingset via cross-validation. A simple model (MLP with onehidden layer) with regularization and dropout is used toprevent overfitting. Unless otherwise stated, in all exper-iments, randomly-chosen 60% of data for each emotionor action in the dataset is used for training and the restfor testing.
Data:
We experimented with two benchmark datasets:CMU motion capture database [42] in BVH format andEmotional Body Motion database [43]. The data has38 skeleton joints with three degrees of freedom for eachjoint and joint movements are recorded at 120 frames persecond. We used 18 actions and five emotions, which arelisted in Fig. 2. Our model was tested on CMU mocapdatabase.
Experimental setup:
We conducted a user studyon Amazon Mechanical Turk (MTurk) to assess howwell our model generates emotionally expressive skele-tal actions. Our user group consisted of MTurk Masterworkers; they have achieved distinction by consistentlydemonstrating a high degree of success in performinga large number and wide range of Human IntelligenceTasks. We administered two questions types to simulta-neously assess action and emotion in videos of skeletalmotions generated by our model. In one question type,we asked the users to choose one action out of 18, andin the second, to choose one emotion out of five, bothfrom the same video. We created five survey forms, eachcontaining 30 videos by combining 18 actions and fiveemotions, and a total of 60 questions. We rejected aresponse if it was recorded by a user before the videoended. If such an event occurred twice for an user, we rejected all his responses. After all rejections, we col-lected responses from 50 users, ten for each survey form.We conducted the same user study twice: once by pre-senting videos of emotionally expressive skeletal actionsfrom the benchmark dataset, and another by presentingvideos of the same but generated by our model.
Evaluation results:
The recognition rate from ouruser study for the videos generated by our model for the18 actions and five emotions are shown in the confusionmatrices in Fig. 2. The recognition rates for each actionand emotion from our two user studies are very close.As expected, actions with similar skeletal motions suchas hitting vs. throwing vs. pushing, eating vs. drinking,and falling vs. jumping, are confused by users. To thebest of our knowledge, no user study has simultaneouslyevaluated the recognition of actions and emotions fromthe movements of human figures such as skeletons.In Fig. 3, skeletal motions for two sample actions,walk and run, are shown for neutral and happy/sad emo-tions. Consistent with the literature [44], for sad emo-tion, skeleton head joints are having slow, long move-ments over time with low amplitude elbow motion anda hiding posture. For happy emotion, the arms arestretched out to the front.
Datasets:
The physiological signals, generated by ourmodels, are evaluated as a function of emotion using theKreibig table and two benchmark datasets.Kreibig [20] reviewed 134 publications reporting ex-perimental investigations of emotional effects on periph-eral physiological responding. Table 2 in [20] summarizesthe findings. It shows the response direction reported bymajority of studies with at least three studies indicat-ing the same response direction. It does not include theresponse magnitude since magnitude quantification de-pends on the type of baseline or comparison conditionwhich varies greatly across studies.DEAP [30] is a dataset for emotion analysis usingphysiological signals. It contains ECG, BP, respirationand SCR, recorded from 32 participants as each watched40 excerpts of music videos of one minute duration.Participants rated each video in terms of the levels ofarousal, valence, like/dislike, dominance and familiarity.HCI Tagging [31] is a database of user responsesin terms of peripheral physiological signals and self-annotations to multimedia content. Twenty seven par-ticipants (11 male, 16 female, aged 19-40 years) wereshown 20 fragments of movies, while recording theirphysiological signals such as ECG, respiration amplitude,SCR, and others. For each video shown, the participantsself-reported their emotive state on a discrete scale of 1-9for valence and arousal along with an emotional keyword.
Evaluation results:
The training set is chosen such5igure 2: Confusion matrix for 18 skeletal actions (left) and five emotions (right) as obtained from a recognitiontask carried out by MTurk users on synthesized (top row) and benchmark (bottom row) data. The recognition rateis stated in each cell. The true and predicted classes are along the rows and columns respectively.that the signal direction for an emotion in that setmatches with the direction in the Kreibig table. Ourmodels synthesized the four signals for the emotions inKreibig table and compared the directions of changefrom neutral emotion using the following features: HRand HRV for ECG; SBP, DBP and LVET (left ventricu-lar ejection time) for BP; RR for respiration; and SCR.See Table 1. The valence and arousal values of theseemotions are obtained from [45]. A threshold of 10% ofmaximum and minimum of changes is used to obtain thedirections. Four mismatches occurred out of 64 entriesin Table 1, leading to an error rate of 6.25%.To evaluate the performance of our models in predict-ing the signal direction (increase, decrease, no change)due to change in emotions using benchmark datasets, di-rection of the synthesized signal is compared with that ofthe true signal. As above, a 10% threshold is used to ob-tain the directions. See Table 3. Direction of signals dueto emotions in benchmark datasets often do not matchwith that in the Kreibig table due to significant variabil-ity between individuals. Thus, if a model is trained onthe Kreibig table, its performance may be lower on the benchmark datasets, and vice versa.
Datasets:
The physiological signals, generated by ourmodels, are evaluated as a function of action using twobenchmark datasets.AmI [46] is a dataset for action analysis using phys-iological signals. AmI contains respiration and SCR,recorded from 10 participants. Their age and genderare not specified. Our models are simulated with theiraverage age (27.2 years) and male gender. Participantsdid activities such as lying, walking, sitting, standing,kneeling, all fours (crawling), transition, leaning, cyclingand running.TROIKA [47] is a dataset for heart rate monitor-ing when motion artifact is strong. It includes ECG,recorded from chest using wet ECG sensors from 12 maleparticipants as each walked or ran on a treadmill at sixdifferent speeds, each for 0.5-1 minute. The participantsare aged 18-35 years. Each individual’s age is unknown,6 a) Neutral walk. (b) Neutral run.(c) Sad walk. (d) Happy run.
Figure 3: Visual comparison of key frames of walking (left) and running (right) for different emotions. The toprow (a or b) and an emotion label (e.g., sad, happy) are input to our model while the corresponding output is thebottom row (c or d).so we used their mean age.
Evaluation results:
To evaluate our ECG model us-ing TROIKA, a signal corresponding to each subject isgenerated (different initial heart rates and λ lead to dif-ferent signals for each individual). The Pearson corre-lation between synthesized and actual HR for each sub-ject is significant ( p < .
01) with mean 0.72 (std. dev.0.08). Our respiration and SCR models are evaluatedusing AmI following the same procedure as above. ThePearson correlation between synthesized and actual RRfor each of the 10 subjects is significant with mean 0.82(std. dev. 0.18). The same for SCR is also significantwith mean 0.57 (std. dev. 0.23). We did not find anyaction dataset containing BP signals.
We provide the baseline for multimodal action and emo-tion classification using the synthesized signals. We clas-sify the true and synthesized signals separately usingMLP and early fusion. For DEAP, valence and arousalare classified separately into two classes (high, low) us-ing six features extracted from SCR, respiration and BPsignals [49]: mean and standard deviation of raw sig-nals, mean of absolute of first and second differences ofraw signals and the same of normalized signals, from a6-second window with 50% overlap. Each window consti-tutes a datapoint. For HCI, three-class (high, medium,low) classification is done using 13 features extractedfrom ECG, SCR and respiration signals [49]: the sixfeatures as in DEAP, mean, standard deviation, median,minimum, maximum, range, variance, skewness and kur-tosis from a 6-second window with 50% overlap. Weclassify the AmI and synthesized signals into 10 actionsusing 13 features (same as HCI) extracted from SCR,HR and RR signals, from a 6-minute window with 50%overlap. Results (ref. Table 4) show that it is slightly easier to classify the true signals than the synthesizedones for emotions, and vice versa for actions.
We developed a novel system for synchronously synthe-sizing skeletal motion and physiological signals (ECG,BP, respiration, SCR) as a joint function of a set ofactions and emotions. This set is open-ended as theconstituent models operate on { valence, arousal } val-ues for emotion and MET value (derived from skeletaljoint movements) for action. Extensive evaluation in-volving user studies, benchmark datasets and compari-son to findings reported in the literature show that theproposed system can synthesize the signals with high lev-els of accuracy, for both actions and emotions. Significance:
A modular system for synthesizing re-alistic motion and physiological signals from an individ-ual’s actions and emotions will: (1) overcome the fivechallenges stated at the beginning of this paper, therebyfacilitating ML research for round-the-clock monitoringat a reduced cost; (2) allow reusability of code anddata; (3) be useful for training ML students, data scien-tists, modelers, and healthcare professionals by facilitat-ing simulation of critical situations, education, research,and in silico clinical trials; and (4) allow evaluation ofAI agents in terms of fidelity to human physiological per-formance.
Impact:
Using this system, it will be possible for thefirst time to generate infinite amounts of shareable dataas a function of any action and emotion in any envi-ronment, without any risk of privacy invasion. This isthe first dataset for joint action and emotion recogni-tion from motion and physiological signals. Our systemwill open up a new line of research in signal synthesisas a joint function of actions and emotions as there is7able 1: Comparison of feature directions extracted from our synthesized ECG (HR, HRV), BP (SBP, DBP, LVET),RR and SCR signals with those in Kreibig table. Arrows indicate increase ( ↑ ), decrease ( ↓ ), no change from baseline(-), or both increase and decrease ( ↑↓ ) between studies. Arrows in parentheses indicate tentative response direction,based on fewer than three studies. Mismatched directions are marked red. S i g n a l S o u r ce o f d i r ec t i o n A n g e r A n x i e t y D i s g u s t E m b a rr a ss - m e n t F e a r S a dn e ss A m u s e - m e n t C o n t e n t - m e n t H a pp i n e ss J o y Su s p e n s e P l e a s u r e R e li e f HR Kreibig table ↑ ↑ ↑ - ↑ ↑ ↓ ↑↓ ↓ ↑ ↑ ( ↓ )Synthesized ↑ ↑ ↑ - ↑ ↑ ↓ ↑ ↓ ↑ ↑ ↓ HRV Kreibig table ↓ ↓ ↑ ( ↓ ) ↓ ↓ ↑ ↑↓ ↓ ( ↑ )Synthesized ↓ ↑ ↑ ↑ ↓ ↓ ↑ ↓ ↓ ↓ ↓ SBP Kreibig table ↑ ↑ ↑ ( ↑ ) ↑ ↑ - ( ↓ ) ↑ ↑ Synthesized ↑ ↑ ↑ ↑ ↑ ↓ ↑ ↓ ↑ ↑ ↓
DBP Kreibig table ↑ ↑ ↑ ( ↑ ) ↑ ↑ - ( ↓ ) ↑ -Synthesized ↑ ↑ ↑ ↑ ↑ ↓ ↑ ↓ ↑ ↑ ↓ LVET Kreibig table ↓ ( ↓ ) ↓ ( ↑ ) (-) ( ↓ )Synthesized ↓ ↓ ↓ ↓ ↓ ↑ ↓ ↑ ↓ ↓ ↑ RR Kreibig table ↑ ↑ ↑ ↑ ↑ ↑ ↑↓ ↑ ↑↓ ( ↑ )Synthesized ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↓ ↑ ↑ ↑ SCR Kreibig table ↑ ↑ ↑ ↑ ↓ ↑ (...) ↑ ↓
Synthesized ↑ ↑ ↑ ↑ ↓ ↑ (...) ↑ ↓ currently no system even remotely related to ours.
References [1] B. Banerjee, M. H. Kapourchali, M. Baruah,M. Deb, K. Sakauye, and M. Olufsen. Synthesiz-ing skeletal motion and physiological signals as afunction of a virtual human’s actions and emotions.In
SIAM International Conference on Data Mining ,April-May 2021.[2] S. Russell and P. Norvig.
Artificial Intelligence: AModern Approach . Prentice Hall, 4th edition, 2020.[3] B. Chandrasekaran, U. Kurup, B. Banerjee, J. R.Josephson, and R. Winkler. An architecture forproblem solving with diagrams. In
Lecture Notesin AI , volume 2980, pages 151–165. Springer, 2004.[4] B. Banerjee.
Spatial problem solving for diagram-matic reasoning . PhD thesis, Dept. of ComputerScience & Engineering, The Ohio State University,Columbus, 2007.[5] B. Chandrasekaran, B. Banerjee, U. Kurup, J. R.Josephson, and R. Winkler. Diagrammatic rea-soning in army situation understanding and plan-ning: Architecture for decision support and cog-nitive modeling. In P. McDermott and L. Allen-der, editors,
Advanced Decision Architectures for the Warfighter: Foundations and Technology , chap-ter 21, pages 379–394. 2009.[6] B. Banerjee and B. Chandrasekaran. A constraintsatisfaction framework for executing perceptionsand actions in diagrammatic reasoning.
Journal ofArtificial Intelligence Research , 39:373–427, 2010.[7] B. Banerjee and B. Chandrasekaran. A spatialsearch framework for executing perceptions and ac-tions in diagrammatic reasoning. In
Lecture Notesin AI , volume 6170, pages 144–159. Springer, Hei-delberg, 2010.[8] B. Chandrasekaran, B. Banerjee, U. Kurup, andO. Lele. Augmenting cognitive architectures to sup-port diagrammatic imagination.
Topics in CognitiveScience , 3(4):760–777, 2011.[9] B. Banerjee and J. K. Dutta. Efficient learning fromexplanation of prediction errors in streaming data.In
IEEE BigData Conference , pages 14–20, SantaClara, CA, Oct. 2013.[10] B. Banerjee and J. K. Dutta. A predictive codingframework for learning to predict changes in stream-ing data. In
ICDM Workshops , pages 497–504, Dal-las, TX, Dec. 2013.[11] B. Banerjee and J. K. Dutta. SELP: A general-purpose framework for learning the norms from8able 2: Parameter values used in our ECG (upperblock) and BP (lower block) synthesis models. Timeis in secs, θ i in radians.P Q R S TTime − . − .
05 0 0.05 0.3 θ i − π/ − π/
12 0 π/ π/ a i − − . b i θ i − π/ − π/
36 0 π/
18 4 π/ a i b i Neurocomputing ,138:41–60, 2014.[12] S. Najnin and B. Banerjee. Emergence of vocal de-velopmental sequences in a predictive coding modelof speech acquisition. In
INTERSPEECH , pages1113–1117, 2016.[13] S. Najnin and B. Banerjee. A predictive codingframework for a developmental agent: Speech mo-tor skill acquisition and speech production.
SpeechCommunication , 92:24–41, 2017.[14] S. Najnin and B. Banerjee. Pragmatically framedcross-situational noun learning using computationalreinforcement models.
Frontiers in Psychology , 9:5,2018.[15] M. H. Kapourchali and B. Banerjee. Multipleheads outsmart one: A computational model fordistributed decision making. In
CogSci , pages 1776–1781, 2018.[16] M. H. Kapourchali and B. Banerjee. EPOC: Ef-ficient perception via optimal communication. In
AAAI , pages 4107–4114, 2020. [17] M. H. Kapourchali and B. Banerjee. State esti-mation via communication for monitoring.
IEEETransactions on Emerging Topics in ComputationalIntelligence , 4(6):786–793, 2020.[18] M. Baruah and B. Banerjee. A multimodal predic-tive agent model for human interaction generation.In
CVPR Workshops , pages 1022–1023, 2020.[19] M. Baruah and B. Banerjee. The perception-actionloop in a predictive agent. In
CogSci , pages 1171–1177, 2020.[20] S. D. Kreibig. Autonomic nervous system activity inemotion: A review.
Biol. Psychol. , 84(3):394–421,2010.[21] B. E. Ainsworth et al. Compendium of physical ac-tivities: An update of activity codes and MET in-tensities.
Med. Sci. Sports Exerc. , 32(9):S498–S504,2000.[22] M. E. Yumer and N. J. Mitra. Spectral style trans-fer for human motion between independent actions.
ACM T. Graphic. , 35(4):137, 2016.[23] G. D. Clifford and P. E. McSharry. A realistic cou-pled nonlinear artificial ECG, BP, and respiratorysignal generator for assessing noise performance ofbiomedical signal processing algorithms. In
ICNF ,pages 290–301. SPIE, 2004.[24] P. E. McSharry et al. A dynamical model for gen-erating synthetic electrocardiogram signals.
IEEETrans. Biomed. Eng. , 50(3):289–294, 2003.[25] G.D. Clifford and P.E. McSharry. Generating 24-hour ECG, BP and respiratory signals with realisticlinear and nonlinear clinical characteristics using anonlinear model. In
CinC , pages 709–712. IEEE,2004.[26] M. S. Zakynthinaki. Modelling heart rate kinetics.
PloS One , 10(4):e0118263, 2015.927] H. Tanaka et al. Age-predicted maximal heart raterevisited.
J. Am. Coll. Cardiol. , 37(1):153–156,2001.[28] Z. Yang et al. Quantifying mental arousal levelsin daily living using additional heart rate.
Biomed.Signal Proces. , 33:368–378, 2017.[29] A. Azarbarzin et al. Relationship between arousalintensity and heart rate response to arousal.
Sleep ,37(4):645, 2014.[30] S. Koelstra et al. DEAP: A database for emotionanalysis; using physiological signals.
IEEE Trans.Affective Comput. , 3(1):18–31, 2012.[31] M. Soleymani et al. A multimodal database for af-fect recognition and implicit tagging.
IEEE Trans.Affective Comput. , 3(1):42–55, 2012.[32] J. Karvonen and T. Vuorimaa. Heart rate and exer-cise intensity during sports activities.
Sports Med. ,5(5):303–311, 1988.[33] F. T. Tehrani. Optimal control of respiration in ex-ercise. In
EMBS , volume 6, pages 3211–3214. IEEE,1998.[34] L. Martin.
Pulmonary physiology in clinical prac-tice: the esseentials for patient care and evaluation .Mosby, 1987.[35] H. Noguchi et al. Breath-by-breath VCO2 and VO2required compensation for transport delay and dy-namic response.
J. Appl. Physiol. , 52(1):79–84,1982.[36] D. P. Swain et al. Target heart rates for the develop-ment of cardiorespiratory fitness.
Med. Sci. Sport.Exer. , 26(1):112–116, 1994.[37] C. L. Lim et al. Decomposing skin conductance intotonic and phasic components.
Int. J. Psychophys-iol. , 25(2):97–109, 1997.[38] D. M. Alexander et al. Separating indi-vidual skin conductance responses in a shortinterstimulus-interval paradigm.
J. Neurosci. Meth-ods , 146(1):116–123, 2005.[39] D. R. Bach et al. Modelling event-related skinconductance responses.
Int. J. Psychophysiol. ,75(3):349–356, 2010.[40] D. R. Bach et al. Dynamic causal modeling ofspontaneous fluctuations in skin conductance.
Psy-chophysiol. , 48(2):252–257, 2011.[41] D. R. Bach et al. Dynamic causal modelling of antic-ipatory skin conductance responses.
Biol. Psychol. ,85(1):163–170, 2010. [42] B. Hahne. The motionbuilder-friendly BVH con-version release of CMU’s motion capture database. https://sites.google.com/a/cgspeed.com/cgspeed/motion-capture/cmu-bvh-conversion ,2010.[43] Max-Planck Institute for Biological Cybernetics,Tuebingen, Germany. Emotional Body MotionDatabase. http://ebmdb.tuebingen.mpg.de/ ,2014.[44] A. Melzer, T. Shafir, and R. P. Tsachor. How do werecognize emotion from movement? Specific motorcomponents contribute to the recognition of eachemotion.
Front. Psychol. , 10:1389, 2019.[45] K. R Scherer. What are emotions? And how canthey be measured?
SSI , 44(4):695–729, 2005.[46] Department of Intelligent Systems, Jozef StefanInstitute, Slovenia. AmI repository for activ-ity, plan and behaviour recognition from sen-sor data. http://dis.ijs.si/ami-repository/index.phpd=16 , 2013.[47] Z. Zhang et al. TROIKA: A general frameworkfor heart rate monitoring using wrist-type photo-plethysmographic signals during intensive physicalexercise.
IEEE Trans. Biomed. Eng. , 62(2):522–531,2015.[48] M. B. H. Wiem and Z. Lachiri. Emotion classifica-tion in arousal valence model using MAHNOB-HCIdatabase.
Int. J. Adv. Comput. Sci. Appl. , 8(3),2017.[49] S. Tripathi et al. Using deep and convolutional neu-ral networks for accurate emotion classification onDEAP dataset. In