[PDF] Having a Bad Day? Detecting the Impact of Atypical Life Events Using Wearable Sensors

Abstract

Life events can dramatically affect our psychological state and work performance. Stress, for example, has been linked to professional dissatisfaction, increased anxiety, and workplace burnout. We explore the impact of positive and negative life events on a number of psychological constructs through a multi-month longitudinal study of hospital and aerospace workers. Through causal inference, we demonstrate that positive life events increase positive affect, while negative events increase stress, anxiety and negative affect. While most events have a transient effect on psychological states, major negative events, like illness or attending a funeral, can reduce positive affect for multiple days. Next, we assess whether these events can be detected through wearable sensors, which can cheaply and unobtrusively monitor health-related factors. We show that these sensors paired with embedding-based learning models can be used ``in the wild'' to capture atypical life events in hundreds of workers across both datasets. Overall our results suggest that automated interventions based on physiological sensing may be feasible to help workers regulate the negative effects of life events.

Full PDF

HHaving a Bad Day? Detecting the Impact of Atypical Life Events Using WearableSensors

Keith Burghardt ∗ , Nazgol Tavabi ∗ , Emilio Ferrara † , Shrikanth Narayanan ‡ , Kristina Lerman § USC Information Sciences Institute

Abstract —Life events can dramatically affect our psychologicalstate and work performance. Stress, for example, has beenlinked to professional dissatisfaction, increased anxiety, andworkplace burnout. We explore the impact of positive andnegative life events on a number of psychological constructsthrough a multi-month longitudinal study of hospital andaerospace workers. Through causal inference, we demonstratethat positive life events increase positive affect, while negativeevents increase stress, anxiety and negative affect. While mostevents have a transient effect on psychological states, majornegative events, like illness or attending a funeral, can reducepositive affect for multiple days. Next, we assess whether theseevents can be detected through wearable sensors, which cancheaply and unobtrusively monitor health-related factors. Weshow that these sensors paired with embedding-based learningmodels can be used “in the wild” to capture atypical lifeevents in hundreds of workers across both datasets. Overallour results suggest that automated interventions based onphysiological sensing may be feasible to help workers regulatethe negative effects of life events.

1. Introduction

As organizations prepare their workforce for changingjob demands, worker wellness has emerged as an importantfocus. Organizations see worker wellness as being centralto their mission to develop a healthy and productive work-force while also maintaining optimal job performance. Thesegoals are especially important in high-stakes jobs, such ashealthcare providers working at hospitals, where job-relatedstress often leads to burnout and poor performance [1], [2],[3], and is one of the most costly modiﬁable health issues atthe workplace [4]. An additional challenge faced by workersis balancing demanding jobs with equally stressful eventsin their personal life. Adverse events—such as attending afuneral, the death of a pet, or illness of a family member—may amplify worker stress, and potentially harm job per-formance. On the other hand, positive life events—suchas getting a raise, getting engaged, or taking a vacation—may decrease stress and improve well-being. The abilityto detect such atypical life events in a workforce can helporganizations better balance tasks to reduce stress, burnout,and absenteeism and improve job performance.Until recently, detecting such life events automatically,in real time and at scale, would have been unthinkable. How- ever, recent advances in sensing technologies have madewearable sensors more accurate and widely available, offer-ing opportunities for unobtrusive and continuous acquisitionof diverse physiological states.Sensor-generated data, such as heart rate and physicalactivity, allows for real-time, quantitative assessment of indi-vidual’s health [5] and psychological well-being [6], [7], [8],[9]. Sensor data could also provide insights into atypical lifeevents that individual workers experience and could affecttheir psychological well-being and job performance. How-ever, the connection between atypical life events, individualwell-being, and quantitative measurements from sensor datahas not been demonstrated for such dynamic environments,especially in real-world scenarios.In this paper, we report results of large longitudinalstudies of hospital and aerospace industry workers who woresensors and reported ecological momentary assessments(EMAs) over the course of several months. Workers alsoreported whether they had experienced an atypical event.The data allows us to use difference-in-difference analysis,a type of causal inference method [10], to measure theeffect of atypical events—either positive or negative lifeevents—on individual psychological states and well-being.We ﬁnd that negative life events increase self-reported stress,anxiety, and negative affect by 10-20% or more, whiledecreasing positive affect over multiple days. Positive lifeevents, meanwhile, have little effect on stress, anxiety, andnegative affect, but boost positive affect on the day of theevent. Negative atypical events have a greater impact onworker’s psychological states than positive events, in linewith previous ﬁndings [11].In addition to measuring the effects of atypical events,we show that it is possible to detect these events from anon-invasive wristband sensor. We discover that, althoughchanges in individual psychological constructs are difﬁcultto detect, atypical events are amenable to detection be-cause they jointly affect several constructs. We propose amethod that learns a representation of multi-modal physio-logical signals from sensors by embedding them in a lower-dimensional space. The embedding provides features forclassifying when atypical events occur. Detection results areimproved over baseline F1 scores by up to nine times, andachieve ROC-AUC of between 0.60-0.66.Physiological data from wearable sensors allows forstudying individual response to atypical life events in thewild, creating opportunities for testing psychological theory a r X i v : . [ c s . H C ] A ug bout affect and experience. In addition, sensors data opensthe possibility of passive monitoring to detect when indi-viduals have stressful or negative experiences. While ourinitial results show that models can be further improved inthe future, the ability to detect such experiences can helporganizations improve the health and well-being of theirworkforce and reduce their detrimental effects on vulnerablepopulations.

2. Related Work

In this paper, we explore the effect of acute positive andnegative events on human behavior, and how to detect theseevents with wearable sensors.We ﬁnd that negative events increase stress, anxiety, andnegative effect over the course of one to two days. Acutestress, in which stress increases over short periods [12], canincrease cardiovascular risk and depression [13], and cannegatively impact job performance [1], [2], [3].Increased anxiety is associated with reduction in fertility[14], while negative affect is associated with higher sensi-tivity to pain [15]. In this paper, we ﬁnd that positive eventsincrease positive affect. Higher positive affect is associatedwith broadened attention and improved creative problemsolving [16], [17], and preferring future utility over present[18], although high levels may be associated with aversionto change [17].There exists extensive research on how sensors can beused to detect patterns and changes in human behavior[8], [19], including psychological constructs such as stress,anxiety, and affect (c.f., literature review of wearable sen-sors [20]). For example, they can detect if workers [21]or students [22] are stressed, even at a minute-by-minutelevel (c.f., cited literature in [23]). Recent research has alsoexplored detecting the degree to which a subject is stressedat shorter [6], [9], [23], [24], [25], [26], and longer [27],[28] timescales. Papers on stress typically induce stressexternally [6], [21], [29], [30], but there are also paperson detecting natural stresses [9], [23], [27], [28]. Whilemost related works have explored stress detection, thereis some literature on detecting bio-markers associated withother psychological constructs. This includes anxiety [31],positive and negative affect [32], [33], and depression [34].In addition, recent literature has explored predicting (insteadof detecting) multiple constructs using multi-task learning[35]. Notably, however, researchers needed access to data onsocial interactions, exercise, drug use, and sensors of severalmodalities, which may be unavailable in many situations.Finally, detecting acute positive and negative events is simi-lar to research on using sensors for anomaly detection [20].In contrast to previous literature, however, we detect eventsthat affect psychological constructs rather than physiologicalconstructs such as heart rate or sleep. In order to detect bio-marker patterns, sensors used in previous research measurea number of modalities including phone usage [22], skinconductance [6], [9], [21], [36], heart rate [6], [7], [9], [21],[24], [30], or breathing rate [6] features. The past work has suffered from two signiﬁcant lim-itations. First, research has focused on either short timeintervals (up to two weeks) and very small sample sizes(on the order of tens of subjects) [6], [9], [23], [24], [25],[26], or collected data sporadically (once every severalmonths) [27], [28]. Second, previous literature has typicallydetected very short-term stresses (e.g., stresses that affectpeople on minute level [6], [9], [23], [23], [24], [25], [26])rather than individual stressful events that impact someoneover the longer term, such as funerals. Our work differsfrom these previous studies through continuous evaluationover several weeks of hundreds of subjects, allowing us torobustly uncover effects in diverse populations. Moreover,we uncover patterns associated with unusually good or badevents that can affect multiple psychological constructs overmultiple days.

3. Data

The data used in this paper comes from two studiesaimed at understanding the relationship between individualvariables, job performance, and wellness [37], which waspart of the IARPA MOSAIC program. The study protocolwas reviewed by USC Institutional Review Board (HS-17-00876 - TILES). Although the studies were conducted atdifferent locations and recruited different populations, theyhad similar longitudinal design and collected similar data.The hospital workforce data was collected during a 10-weeklong study that recruited 212 hospital workers. Participantswere enrolled over the course of three “study waves,” eachwith different start dates (03/05/18, 04/09/18 and 05/05/18for waves 1, 2 and 3 respectively). The aerospace workforcedata was collected from 264 subjects from 01/08/18 to04/06/18.In both datasets, subjects’ bio-behavioral data was cap-tured via wearable devices. The studies also administereddaily surveys to collect self-assessments of individual partic-ipant stress, sleep, job performance, organizational behavior,and other personality constructs. The same survey questionswere asked in both studies. We focus on positive affect,negative affect, anxiety and stress, which we discuss ingreater detail in the psychological construct section.In this paper, we use data collected from

Fitbit wrist-bands. Although other sensor data was collected during eachstudy, including location data and audio or environmentalfeatures, we focus on this modality since it was commonto both studies, and is the only sensor we have accessto in the aerospace dataset. The Fitbit wristband capturesdynamic heart rate and step count. It also offers a summaryreport of duration and quality of sleep for each day. Data iscollected voluntarily by each subject, which was recordedat sub-minute levels. It was then uploaded to servers, wherewe aggregate the data. Table 1 summarizes the modalitiescaptured by the Fitbit Charge 2 sensor. For the embeddingapproach, we only used the signals extracted from Fitbit(heart rate and steps) but for the aggregated method we alsoincluded the static summary features. itbit Modality

Signals (time series): Heart rate (PPG)Number of stepsSummary features (static): Time in personalized heart rate ranges: “fat burn,” “cardio,” or “out of range”Daily minutes in bedDaily minutes asleepDaily sleep efﬁciencySleep start & end time

TABLE 1. E

XTRACTED FEATURES FROM SENSORS .Figure 1. Statistics of compliance and frequency of atypical events. Leftﬁgure shows the number of days of data we have for each participant.Right ﬁgure shows the ratio of days in which there is an atypical event asa function of subject participation. Error bars are 95% conﬁdence intervalsof the mean.

Study participants exhibited varying compliance rates.As a result, collected data varied in the amount (hours perday) and length (number of days) across different partici-pants. Figure 1 shows the distribution of the data collectedin these two datasets and the average ratio of atypical eventsfor participants as a function of their compliance rate. Weﬁnd in the left panel of Fig. 1 that most participants hadseveral days of data, but a minority had only a few daysof data over the entire study period. Pre-processing wastherefore as follows. We only used data from participantswho had at least two days worth of data and one day markedas an atypical day. This brings the hospital data down to8,155 days for 150 participants and the aerospace data to10,057 days for 207 participants. We ﬁnd in the right panelof Fig. 1 that removal of these low compliance subjectsdoes not appear to signiﬁcantly bias the data. Instead thefrequency of atypical events is relatively independent of thecompliance rate.The amount of data available from each day also variesand depends on the amount of time the participant worethe wristband. Although most participants (90% in hospitaldataset and 89% in aerospace dataset) had the wristband onfor the full day (24 hours), there are instances where onlyﬁve hours of data could be collected in a day.

The data used for this study includes daily self-assessments of psychological states provided by subjectsover the course of the study. These constructs includeself-assessments of job performance (Individual Task Pro-ﬁciency (ITP) [38], In-Role Behaviors [39]), Big Five per-sonality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism [40]), alcohol [41] and tobaccouse [42], sleep quality [43], stress, anxiety, positive and neg-ative affect. Stress and anxiety were measured by responsesto questions that read, “Overall, how would you rate yourcurrent level of stress?” and “Please select the response thatshows how anxious you feel at the moment” respectivelyand have a range of 1–5. Positive and negative affect weremeasured based on 10 questions from [44] (ﬁve questionsfor measuring positive affect and ﬁve for measuring negativeaffect) and have a range of 5–25. We focus on positive andnegative affect, stress, and anxiety in this study becausethese were found to consistently change during an atypicalevent.

In addition to these constructs, subjects were also askedif they had experienced, or anticipated experiencing, an atyp-ical event: “Have any atypical events happened today or areexpected to happen?”. If subjects replied yes, they had theoption of add free-form text describing the atypical event.In the hospital data, there are 8,155 days of data, of which958 days had atypical events (11.7%). The aerospace datahas 10,057 days of data, of which 1,503 were consideredatypical (14.9%).We have access to the free-form text in the hospital data,which was ﬁlled out by participants in 87% of all atypicalevents. Surprisingly, the severity of the event could not beeasily gleaned from sentiment analysis, such as VADER[45] or LIWC [46], as these tools gave neutral sentiment totext samples that were clearly negative. For example, textalike to “at a funeral” is given zero sentiment in VADER.We therefore applied a protocol, using human annotators, tocategorize text as major negative events (such as deaths orinjuries of loved ones), minor negative events (such as beingstuck in trafﬁc), or positive events (such as promotions).Major negative events were classiﬁed as negative life eventssuch as major medical issues and funerals while minor neg-ative events were daily hassles, sickness, or negative workevents. Positive events were awards, promotions, weddings,and other events that were beneﬁcial. Of all categorizedatypical events, 210 (24%) were positive, 626 (71%) wereminor negative events, and 39 (4.5%) were major negativeevents. igure 2. Overview of the modeling framework. Sensor data collected from participants A and B (left two panels) is fed into non-parametric HMM modelwhich outputs a state sequence per participant (middle panel), where states are shared among participants. Output from the HMM model is used to learnembeddings for each day of each participants (right panel). The daily embedding (colored circles) and the average embedding for each participant (hashedcircles) are used as features to detect an atypical day.

4. Methods

The text descriptions of many atypical events in thehospital data mention sudden and unexpected events, suchas an injured family member or unusually heavy trafﬁc. Wecan therefore conjecture that atypical events create an as-if random assignment of any given subject over time. Thisis not always true, as in the case of subjects who reportbeing on vacation multiple days, or are at different stagesof burying a loved one. These are, however, relatively rareinstances, with sequential events occurring in less than 15%of atypical events in either dataset and exclusion of thisdata does not signiﬁcantly affect results. To determine theeffect of atypical events on subjects, we use a difference-in-difference approach to causal inference. Speciﬁcally, welook at all subjects who report an atypical event and thenlook at a subset who report stress, anxiety, negative affect,or positive affect the prior day. This is usually the majorityof all events (83%). We ﬁnally take the difference in theirself-reported constructs from the day before the event. Ifsubjects report construct values after the event (which isusually the case) we report the difference between thesevalues and the day prior to an atypical event. We contrastthese measurements with a null model , in which we ﬁndsubjects who did not report an atypical event on the samedays that other subjects reported an atypical event, and ﬁndthe change in their construct values from the prior day. Thisnull model shows very little change in constructs over con-secutive days, in agreement with expectation. The differencebetween construct values associated with the event and thenull model is the average treatment effect (ATE).

We detect atypical events by embedding individuals’physiological time series data into a vector space, usingthe framework proposed in [47]. We then train modelsto identify where in this space do atypical events happenunexpectedly often. Namely, the time series is modeled as ahidden Markov model, where each state corresponds to anautomatically inferred activity (e.g., exercising, working, or resting). The model effectively distinguishes activities peo-ple do during atypical days from activities during “normal”days.In more detail, each subject’s day of physiological datais interpreted as a multivariate time series, as described inFig. 2, left two panels. The time series are transformed intosequences of hidden Markov states using a Beta ProcessAuto Regressive HMM (BP-AR-HMM) [48] (Fig. 2, centerpanel). Unlike classical hidden Markov models, BP-AR-HMM is ﬂexible by allowing the number of hidden states tobe inferred from the data. Based on these datasets the modelfound 73 states in the hospital data, and 130 states in theaerospace dataset, i.e., we ﬁnd 73–130 “activities” that sub-jects perform, although they may only do a small fraction ofthese activities in a day. In addition, these states are sharedamong all subjects, rather than speciﬁc to one subject. Thismakes it feasible to embed data across different subjectsand across different days. After the states are learned, wecalculate the stationary distribution of time spent in eachstate to embed the daily data into the activity space (Fig. 2,right panel). This can be easily calculated from the HMMtransition matrix by ﬁnding the eigenvector correspondingto the largest eigenvalue of the matrix.

5. Results

How do atypical events affect individual’s psycholog-ical states? We apply a difference-in-difference approachto measure the impact of atypical events on self-reportedpsychological constructs. We ﬁrst look at the effect ofatypical events across all our datasets, as shown in Fig. 3.Atypical events, on average, have a relatively small effecton positive affect the day of the event (difference fromnull = 0 . , . ; p-value = 0 . , . , for hospital andaerospace data, respectively). We notice a decrease in pos-itive affect from the day of the event to the day after theevent (difference = 0 . , . ; p-values = 0 . , . for aerospace and hospital data, respectively). On the otherhand, there is a substantial increase in negative affect,stress, and anxiety (p-values < . ), although changesare smaller in the aerospace dataset. - - Δ P o s A ff e c t - - Δ N eg A ff e c t - - Δ S t r e ss - - Δ A n x i e t y (a) (b) (c) (d) AerospaceHospital Null modelAerospaceHospital Null modelAerospaceHospital Null modelAerospaceHospital Null model Figure 3. Effect of atypical events among the datasets studied. (a) Positive affect, (b) negative affect, (c) stress, and (d) anxiety. Green squares show theaerospace dataset, red diamonds show the hospital dataset, and gray circles are the null models, in which we collect sequential data from subjects who donot experience an atypical event at day zero. - - Days Between Event Δ S t r e ss - - Days Between Event Δ A n x i e t y - Days Between Event Δ N eg A ff e c t - - - - - Days Between Event Δ P o s A ff e c t Positive eventsMinor negative eventsMajor negative eventsNull model(a) (b) (c) (d)

Figure 4. Effect of atypical events versus severity of event. (a) Positive affect, (b) negative affect, (c) stress, and (d) anxiety. Green squares are positiveevents, white triangles are minor negative events, red diamonds are major negative events, and gray circles are the null models. In the null models wecollect sequential data from subjects who do not experience an atypical event at day zero.

The free-text descriptions that subjects provided aboutatypical events they experienced (only available in the hos-pital data), conﬁrms these results. Most atypical events arenegative, such as a ﬁght with the spouse, trafﬁc, or deaths.In a minority of cases, however, subjects report positiveevents, such as passing a test or a promotion. For thehospital data, we categorized atypical events as positive,minor negative, or major negative events, and determinedthe relative effect each has on subjects, as shown in Fig. 4.We ﬁnd that, as expected, positive events increase positiveaffect (p-value = 0 . ), but have no statistically signiﬁ-cant effect on negative affect, stress, or anxiety (p-value ≥ . ). Minor negative events do not substantially changepositive affect on the day of the event (difference fromnull = − . , p-value = 0 . ), and have a small effecton positive affect the day after the event (difference fromnull = − . , p-value = 0 . ). On the other hand, theysigniﬁcantly increase negative affect, anxiety, and stress (p-value < . ). Finally, major negative events both decreasepositive affect the day of the event and the day after the event (p-value = 0 . , . respectively). These results point tothe strong diversity in atypical events, and support the ideathat “bad is stronger than good” [11]: adverse, or negative,events have a stronger effect on people than positive events,and are reported as atypical events more often. We evaluate performance ofthree classiﬁcation tasks using sensor data: (1) detectingwhether an atypical event occurred on that day; (2) detectingwhether subjects experienced a good day; or (3) detectingwhether subjects experienced a bad day. For (2) and (3) theclassiﬁcation task was “1” if subjects experienced a good orbad day, respectively, and “0” otherwise. Hence we simplifyall tasks into a binary detection task. We emphasize thatthese last two tasks are only available for the hospital data.We use ten-fold cross validation. We choose to splitdatapoints at random, but in the Limitations section, wealternatively split users into training and testing sets toapproximate a cold-start scenario where, in many cases,researchers train data on one cohort of subjects and classifyata on another cohort [49]. The challenge of the latterdetection task is that we need to classify if a subject hasa good or bad day despite not training on any previous datafrom that subject. Performance metrics are averaged acrossall held-out folds.

We use three performancemetrics for evaluation. First, we use the area under thereceiver operating characteristic (ROC-AUC) which quan-tiﬁes how well a model can make true positives versusfalse positives. Random detection has an ROC-AUC of 0.5.Next, we use the F1 score, which is the harmonic meanof precision and recall. The higher F1 scores correspond tohigher recall and precision of our estimates. Finally, we useprecision itself as a performance metric because we want todetermine the fraction of times we correctly label an atypicalday (i.e., a “good” or “bad” day) as atypical. Low precisionwould indicate many false positives.

We compare detection quality for two typesof models: models using features from statistics of aggre-gated data, and models using features based on time seriesembeddings.

Aggregated

We create several features based on aggre-gated statistics of signals and static modalities, listed inTable 1. These statistics included the sum, mean, median,variance, kurtosis, and skewness of signals the day before,the day of, and the day after each day. Missing data issubstituted with mean statistic value in the training or testingset. Statistics before and after each day were created becausesome physiological features, such as mean heart rate, mightchange before an atypical event, and some may changeafter, such as sleep duration. We use Minimum RedundancyMaximum Relevance on each dataset to select the bestfeatures (23 and 26 for the aerospace and hospital datarespectively) [50]. Alternative features selection approachesusing random forest feature importance produced poorerresults. Typical features in the hospital data relate to sleep(for example, the top feature was tomorrow’s minutes inbed). Typical features in the aerospace dataset tend to relateto heart rate (the top feature was the number of minutes inthe “fat burn” heart rate zone in the past day).

Embedding when creating features from HMM embed-ding, we used only the signal modalities from Table 1;the summary features were not used. Representations fromHMMs were learned for the day of, and the day aftereach day. We also include the centroid of embeddings foreach person in the training data as features, to control forsubject-speciﬁc differences in behavior. We did not use anyadditional feature selection because embedding naturally re-duces the feature dimensions. Imputation is also not neededbecause the HMM learns states based on the amount of dataavailable for that day.We use several candidate classiﬁcation methods to detectwhether a subject experiences an atypical event. For aggre-gate features, we compared logistic regression [51], randomforest [52], support vector machines (SVMs) [53], extratrees [54], AdaBoost [55], and multi-layered perceptrons (MLPs) [56]. When training aggregate feature models, wemake sure to downsample the majority class (no atypicalevent) such that the number of datapoints in each classare equal. Raw data, or upsampling the minority class, wasfound to produce worse results. Using all three performancemetrics and ten-fold cross validation, we ﬁnd atypical eventsin the hospital dataset are best modeled with random forests,while the aerospace workforce dataset is best modeled withlogistic regression. In comparison, positive events are bestmodeled with random forests but negative events are bestmodeled with extra trees.Model hyperparameters for these models are chosen asfollows. For random forest and extra trees, we used 100trees and a max depth of 10. For AdaBoost, we let thenumber of estimators be 100. For all other hyperparameters,we use default parameters in Python library sklearn 0.21.3for Python 3.7. For MLPs, we use three dense layers wherethe number of nodes in each layer equals the number offeatures in the model. For the model with embedding fea-tures, we used SVM, the same classiﬁer used in the originalwork [47]. In all cases, hyperparameters were chosen asreasonable baselines, therefore additional improvements inmodel quality could be obtained with further tuning.

We demonstrate our model resultsin Table 2. First, we ﬁnd that HMM embedding-basedmodel outperforms alternative models. The ROC-AUC forthe HMM-based model is 0.60 for the aerospace workforceand 0.66 for the hospital workforce. Positive and negativeevents similarly have an ROC-AUC of 0.61-0.63. F1 andprecision exceed random baselines by factors of two to nine.The seemingly low F1 and precision are due to the rarity ofatypical events, especially for positive events, which onlyhappen on 3% of days, and negative events which onlyhappen in 8% of all days. A detection therefore represents a“warning sign” that a worker may have had an negative eventthat day. Overall, detecting atypical events shows promise.

6. Discussion

Our results highlight how unusual but impactful eventsstrongly affect workers. Interestingly, however, atypicalevents are more often negative than positive. For example,8% of all days among hospital workers contained negativeevents, while only 3% contained positive events. The relativeadversity and frequency of negative events over positiveevents in our data agrees with previous ﬁndings that neg-ative events are often more impactful [11]. Moreover, weﬁnd that signiﬁcant events cannot be viewed as affectinga single psychological construct; they can affect multipleconstructs at once. In the same way that multi-task learningcan improve predictions in AI [57], we expect that atypicalevent detection could be useful to detecting anxiety, stress,and other psychological constructs simultaneously.Our results also point to important future work. First,while the performance of our method does not allow itto be used in practice, it can be considered a signiﬁcantstarting point. Other sensor modalities can be added to better ataset Construct Model

ROC-AUC F1 PrecisionHospitalworkforce

AtypicalEvent Random 0.50 0.12 0.12Aggregated .

57 0 .

24 0 . Embedding .

66 0 .

37 0 . Good Event Random 0.50 0.03 0.03Aggregated . .

08 0 . Embedding . .

27 0 . Bad Event Random 0.50 0.08 0.08Aggregated .

57 0 .

17 0 . Embedding .

61 0 .

27 0 . Aerospaceworkforce

AtypicalEvent Random 0.50 0.15 0.15Aggregated . .

31 0 . Embedding .

60 0 .

32 0 . TABLE 2. P

ERFORMANCE OF ATYPICAL EVENT DETECTION FROM SENSORS IN THE HOSPITAL AND AEROSPACE WORKFORCE DATASETS WITHRANDOMLY SAMPLED CROSS - VALIDATION . F

OR ALL DATASETS , WE CAN CLASSIFY WHETHER AN EVENT IS ATYPICAL . F

OR HOSPITAL WORKERS , WECAN ALSO CLASSIFY WHETHER AN EVENT IS “ GOOD ” (

INSTEAD OF ANY OTHER TYPE OF EVENT ), OR “ BAD .” P

ERCENTAGES ARE ABOVE BASELINE ( E . G ., IF CLASSIFICATION IS NO BETTER THAN RANDOM , THE PERCENTAGE WOULD BE infer when or if an atypical event occurs. These includebreathing, skin conductance, or phone usage sensors. Next,personalizing our methods to individuals has the potential tosubstantially improve detection performance [35]. We ﬁnd,for example, some subjects experience very few atypicalevents while others experience atypical events triple theaverage rate. Next, we can extend our results by analyzinghow similar good or bad events affect people differently.Some subjects may be able to cope with negative eventsbetter than others.

7. Limitations

There are, however, a number of limitations we shoulddiscuss, that highlight limitations in the data, as well asbroader model limitations that offer implications for modeldesign.First, data was only collected once a day, and we wereunable to gather when atypical events occurred during theday. This made the detection problem much harder becausethere are a number of separate reasons for heart rates orstep counts to change and inferring the speciﬁc signal thatwould indicate an atypical event is unavailable in our data.Next, we are limited in the modalities we had access to, andtherefore the physiological behavior we could measure. Forexample, stress might be more accurately measured with thehelp of skin monitors [6], [9], [21], [36].Finally, our results are based on cross validation, a stan-dard method in which datapoints are divided into trainingand testing splits. This is alike to previous work on detectingstress, in which training and testing was performed on thesame users [6], [23], [25], [26]. It’s feasible, however, that amodel may be trained on one dataset and tested on another.To approximate this scenario, we instead split users, ratherthan days of data, into training and testing folds. We showour model performance results in Table 3. Atypical eventscan be detected 91–220% above baselines based on F1 score,but results are more modest than in the Results section, witha reduction in ROC-AUC from 0.66 to 0.58 for hospitalatypical events. These results are alike to other recent papers, which split subjects into training and testing and foundrelatively poor model performance [9], [24]. On one hand,this means that these models will not necessarily be ableto work out of the box. They need to be personalized tousers. That said, once they are tuned to the cohort, theperformance is respectable. Human heterogeneity thereforemake physiologically-based psychological modeling espe-cially difﬁcult.

8. Conclusion

We discover that atypical events and negative events sub-stantially increase stress, anxiety, and negative affect. Majornegative events are found to reduce positive affect over mul-tiple days, while positive events improve positive affect thatday. We also demonstrate that wearable sensors can provideimportant clues about whether someone is experiencing apositive or negative event. We ﬁnd atypical events can bepredicted with ROC-AUC of 0.66 with relatively little modelhyperparameter tuning. This suggests more improvementsare possible to predict atypical events. Overall, these resultspoint to the importance and relative detectability of negativeevents, which offer hope for remote sensing and automatedinterventions in the future.

Acknowledgments

The authors are grateful to the

TILES team for the effortsin study design, data collection and sharing that enablethis work. This research is based upon work supported bythe Ofﬁce of the Director of National Intelligence (ODNI),Intelligence Advanced Research Projects Activity (IARPA),via IARPA Contract No 2017-17042800005.

References [1] P. Gray-Toft and J. G. Anderson, “Stress among hospital nursing staff:its causes and effects,”

Social Science & Medicine. Part A: MedicalPsychology & Medical Sociology , vol. 15, no. 5, pp. 639–647, 1981. ataset

Construct Model

ROC-AUC F1 PrecisionHospitalworkforce

AtypicalEvent Random 0.50 0.12 0.12Aggregated .

55 0 .

22 0 . Embedding .

56 0 .

23 0 . Good Event Random 0.50 0.03 0.03Aggregated .

57 0 .

065 0 . Embedding .

58 0 .

08 0 . Bad Event Random 0.50 0.08 0.08Aggregated . .

15 0 . Embedding . .

16 0 . Aerospaceworkforce

AtypicalEvent Random 0.50 0.15 0.15Aggregated .

58 0 .

30 0 . Embedding 0.54 .

25 0 . TABLE 3. P

ERFORMANCE OF ATYPICAL EVENT DETECTION FROM SENSORS IN THE HOSPITAL AND AEROSPACE WORKFORCE DATASETS WITH USERHELD - OUT DETECTION . F

OR ALL DATASETS , WE CAN CLASSIFY WHETHER AN EVENT IS ATYPICAL . F

OR HOSPITAL WORKERS , WE CAN ALSOCLASSIFY WHETHER AN EVENT IS “ GOOD ” (

INSTEAD OF ANY OTHER TYPE OF EVENT ), OR “ BAD .”[2] U. Bashir and M. I. Ramay, “Impact of stress on employees jobperformance a study on banking sector of pakistan,”

InternationalJournal of Marketing Studies , vol. 2, no. 1, pp. 122–126, 2010.[3] M. Jamal, “Job stress, job performance and organizational com-mitment in a multinational company: An empirical study in twocountries,”

International Journal of Business and Social Science ,vol. 2, no. 20, 2011.[4] R. Z. Goetzel, X. Pei, M. J. Tabrizi, R. M. Henke, N. Kowlessar,C. F. Nelson, and R. D. Metz, “Ten modiﬁable health risk factorsare linked to more than one-ﬁfth of employer-employee health carespending,”

Health Affairs , vol. 31, no. 11, pp. 2474–2484, 2012.[5] S. Aral and C. Nicolaides, “Exercise contagion in a global socialnetwork.”

Nature communications , vol. 8, p. 14753, 2017.[6] J. A. Healey and R. W. Picard, “Detecting stress during real-worlddriving tasks using physiological sensors,”

IEEE Transactions onIntelligent Transportation Systems , vol. 6, no. 2, pp. 156–166, June2005.[7] K. Hovsepian, M. al’Absi, E. Ertin, T. Kamarck, M. Nakajima,and S. Kumar, “cstress: Towards a gold standard for continuousstress assessment in the mobile environment,” in

Proc ACM Int ConfUbiquitous Comput (UbiComp) , 2015, pp. 493–504.[8] R. Wang, F. Chen, Z. Chen, T. Li, G. Harari, S. Tignor,X. Zhou, D. Ben-Zeev, and A. T. Campbell, “Studentlife: Assessingmental health, academic performance and behavioral trends ofcollege students using smartphones,” in

Proceedings of the 2014ACM International Joint Conference on Pervasive and UbiquitousComputing , ser. UbiComp 14. New York, NY, USA: Associationfor Computing Machinery, 2014, p. 314. [Online]. Available:https://doi.org/10.1145/2632048.2632054[9] E. Smets, E. Rios Velazquez, G. Schiavone, I. Chakroun, E. DHondt,W. De Raedt, J. Cornelis, O. Janssens, S. Van Hoecke, S. Claes,I. Van Diest, and C. Van Hoof, “Large-scale wearable data reveal dig-ital phenotypes for daily-life stress detection,” npj Digital Medicine ,vol. 1, no. 67, 2018.[10] H. R. Varian, “Causal inference in economics and marketing,”

Proceedings of the National Academy of Sciences

Review of General Psychology , vol. 5, pp. 323–370, 2001.[12] J. E. Dimsdale, “Psychological stress and cardiovascular disease,”

Journal of the American College of Cardiology

JAMA , vol. 298, no. 14, pp. 1685–1687, 102007. [Online]. Available: https://doi.org/10.1001/jama.298.14.1685[14] J. Smeenk, C. Verhaak, A. Eugster, A. van Minnen, G. Zielhuis,and D. Braat, “The effect of anxiety and depression onthe outcome of in-vitro fertilization,”

Human Reproduction ,vol. 16, no. 7, pp. 1420–1423, 07 2001. [Online]. Available:https://doi.org/10.1093/humrep/16.7.1420[15] D. Ruiz-Aranda, J. M. Salguero, and P. Fernndez-Berrocal,“Emotional intelligence and acute pain: The mediating effect ofnegative affect,”

The Journal of Pain

Proceedings of the NationalAcademy of Sciences

Journalof Organizational Behavior , vol. 35, no. 4, pp. 530–546, 2014.[Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/job.1906[18] J. Ifcher and H. Zarghamee, “Happiness and time preference:The effect of positive affect in a random-assignment experiment,”

American Economic Review

Personal Ubiquitous Comput. , vol. 10,no. 4, p. 255268, Mar. 2006. [Online]. Available: https://doi.org/10.1007/s00779-005-0046-3[20] H. Banaee, M. U. Ahmed, and A. Loutﬁ, “Data mining for wearablesensors in health monitoring systems: a review of recent trends andchallenges,”

Sensors , vol. 13, no. 12, pp. 17 472–17 500, 2013.[21] S. Sriramprakash, V. D. Prasanna, and O. R. Murthy, “Stressdetection in working people,”

Procedia Computer Science

BodySensor Networks , Cambridge, USA, 2015.23] Y. S. Can, N. Chalabianloo, D. Ekiz, and C. Ersoy, “Continuousstress detection using wearable sensors in real life: Algorithmicprogramming contest case study,”

Sensors , vol. 19, no. 8, p. 1849,2019.[24] M. Gjoreski, M. Lutrek, M. Gams, and H. Gjoreski, “Monitoringstress with a wrist device using context,”

Journal of BiomedicalInformatics

ArtiﬁcialComputation in Biology and Medicine , J. M. Ferr´andez Vicente,J. R. ´Alvarez-S´anchez, F. de la Paz L´opez, F. J. Toledo-Moreo, andH. Adeli, Eds. Cham: Springer International Publishing, 2015, pp.526–532.[26] O. M. Mozos, V. Sandulescu, S. Andrews, D. Ellis, N. Bellotto,R. Dobrescu, and J. M. Ferrandez, “Stress detection using wearablephysiological and sociometric sensors,”

International Journal of Neu-ral Systems , vol. 27, no. 02, p. 1650041, 2017.[27] E. Guthrie, D. Black, H. Bagalkote, C. Shaw, M. Campbell, andF. Creed, “Psychological stress and burnout in medical students: aﬁve-year prospective longitudinal study,”

Journal of the Royal Societyof Medicine , vol. 91, no. 5, pp. 237–243, 1998.[28] D. Edwards, P. Burnard, K. Bennett, and U. Hebden, “A longitudinalstudy of stress and self-esteem in student nurses,”

Nurse EducationToday ,Nov 2015, pp. 93–98.[30] V. Camomilla, M. Salai, I. Vassnyi, and I. K´osa, “Stress detectionusing low cost heart rate sensors,”

Journal of Healthcare Engineering ,p. 136705, 2016.[31] Y. Huang, J. Gong, M. Rucker, P. Chow, K. C. Fua, M. S. Gerber,B. A. Teachman, and L. E. Barnes, “Discovery of behavioral markersof social anxiety from smartphone sensor data,” in

DigitalBiomarkers’17: Proceedings of the 1st Workshop on Digital Biomarkers , 2017,pp. 9–14.[32] S. Yan, H. Hosseinmardi, H. Kao, S. Narayanan, K. Lerman, andE. Ferrara, “Estimating individualized daily self-reported affect withwearable sensors,” in , June 2019, pp. 1–9.[33] A. Mottelson and K. Hornbundeﬁnedk, “An affect detection techniqueusing mobile commodity sensors in the wild,” in

Proceedings ofthe 2016 ACM International Joint Conference on Pervasive andUbiquitous Computing , ser. UbiComp 16. New York, NY, USA:Association for Computing Machinery, 2016, p. 781792. [Online].Available: https://doi.org/10.1145/2971648.2971654[34] L. Canzian and M. Musolesi, “Trajectories of depression: Unobtrusivemonitoring of depressive states by means of smartphone mobilitytraces analysis,” in

Proceedings of the 2015 ACM InternationalJoint Conference on Pervasive and Ubiquitous Computing , ser.UbiComp 15. New York, NY, USA: Association for ComputingMachinery, 2015, p. 12931304. [Online]. Available: https://doi.org/10.1145/2750858.2805845[35] N. Jaques, O. O. Rudovic, S. Taylor, A. Sano, and R. Picard,“Predicting tomorrows mood, health, and stress level usingpersonalized multitask learning and domain adaptation,” in

Proceedings of IJCAI 2017 Workshop on Artiﬁcial Intelligencein Affective Computing , ser. Proceedings of Machine LearningResearch, N. Lawrence and M. Reid, Eds., vol. 66.PMLR, 20 Aug 2017, pp. 17–33. [Online]. Available:http://proceedings.mlr.press/v66/jaques17a.html[36] M. V. Villarejo, B. G. Zapirain, and A. M. Zorrilla, “A stress sensorbased on galvanic skin response (gsr) controlled by zigbee,”

Sensors ,vol. 12, no. 5, pp. 6075–6101, 2012. [37] K. Mundnich, B. M. Booth, M. l’Hommedieu, T. Feng, B. Girault,J. L’Hommedieu, M. Wildman, S. Skaaden, A. Nadarajan, J. L.Villatte, T. H. Falk, K. Lerman, E. Ferrara, and S. Narayanan, “Tiles-2018: A longitudinal physiologic and behavioral data set of hospitalworkers,” arXiv preprint arXiv:2003.08474 , 2020.[38] M. Grifﬁn, A. Neal, and S. Parker, “A new model of work role perfor-mance: positive behavior in uncertain and interdependent contexts,”

Academy of Management Journal , vol. 50, no. 2, pp. 327–347, 2007.[39] L. J. Williams and S. E. Anderson, “Job satisfaction and organiza-tional commitment as predictors of organizational citizenship and in-role behaviors,”

J. of Management , vol. 17, no. 3, pp. 601–617, 1991.[40] S. D. Gosling, P. J. Rentfrow, and W. B. Swann Jr, “A very briefmeasure of the big-ﬁve personality domains,”

Journal of Research inpersonality , vol. 37, no. 6, pp. 504–528, 2003.[41] J. B. Saunders, O. G. Asaland, T. F. Babor, J. R. D. la Fuente, andM. Grant, “Development of the alcohol use disorders identiﬁcationtest (audit): Who collaborative project on early detection of personswith harmful alcohol consumptionii,”

Addiction , vol. 89, no. 6, 1993.[42] G. T. S. S. (GTSS), “Global adult tobacco survey (gats),”

IndicatorGuidelines: Deﬁnition and Syntax , 2009.[43] D. J. Buysse, C. F. Reynolds III, T. H. Monk, S. R. Berman, andD. J. Kupfer, “The pittsburgh sleep quality index: a new instrumentfor psychiatric practice and research,”

Psychiatry research , vol. 28,no. 2, pp. 193–213, 1989.[44] A. Mackinnon, A. F. Jorm, H. Christensen, A. E. Korten, P. A.Jacomb, and B. Rodgers, “A short form of the positive and negativeaffect schedule: Evaluation of factorial validity and invariance acrossdemographic variables in a community sample,”

Personality andIndividual differences , vol. 27, no. 3, pp. 405–416, 1999.[45] C. J. Hutto and E. Gilbert, “Vader: A parsimonious rule-based modelfor sentiment analysis of social media text,” in

Eighth internationalAAAI conference on weblogs and social media , 2014.[46] J. Pennebaker, R. Boyd, K. Jordan, and K. Blackburn, “The develop-ment and psychometric properties of liwc2015,” 2015.[47] N. Tavabi, H. Hosseinmardi, J. L. Villatte, A. Abeliuk, S. Narayanan,E. Ferrara, and K. Lerman, “Learning behavioral representations fromwearable sensors,” arXiv preprint arXiv:1911.06959 , 2019.[48] E. B. Fox, M. C. Hughes, E. B. Sudderth, M. I. Jordan et al. , “Jointmodeling of multiple time series via the beta process with applicationto motion capture segmentation,”

The Annals of Applied Statistics ,vol. 8, no. 3, pp. 1281–1313, 2014.[49] A. Bogomolov, B. Lepri, M. Ferron, F. Pianesi, and A. S. Pentland,“Daily stress recognition from mobile phone data, weather conditionsand individual traits,” in

Proceedings of the 22nd ACM InternationalConference on Multimedia , ser. MM 14. New York, NY, USA:Association for Computing Machinery, 2014, p. 477486. [Online].Available: https://doi.org/10.1145/2647868.2654933[50] A. Torralba and A. Oliva, “Depth estimation from image structure,”

IEEE Transactions on Pattern Analysis & Machine Intelligence ,vol. 27, no. 09, pp. 1226–1238, sep 2002.[51] D. R. Cox, “The regression analysis of binary sequences,”

Journal of the Royal Statistical Society. Series B (Methodological)

Proceedings of 3rdInternational Conference on Document Analysis and Recognition ,vol. 1, Aug 1995, pp. 278–282 vol.1.[53] C. Cortes and V. Vapnik, “Support-vector networks,”

MachineLearning , vol. 20, no. 3, pp. 273–297, Sep 1995. [Online]. Available:https://doi.org/10.1023/A:1022627411411[54] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,”

Machine Learning , vol. 63, no. 1, pp. 3–42, Apr 2006. [Online].Available: https://doi.org/10.1007/s10994-006-6226-155] Y. Freund and R. E. Schapire, “A decision-theoretic generalizationof on-line learning and an application to boosting,”

Journal ofComputer and System Sciences

ArtiﬁcialIntelligence arXiv:1706.05098arXiv:1706.05098