On Attention Models for Human Activity Recognition
OOn Attention Models for Human Activity Recognition
Vishvak S Murahari
Georgia Institute of TechnologyAtlanta, [email protected]
Thomas Plötz
Georgia Institute of TechnologyAtlanta, [email protected]
ABSTRACT
Most approaches that model time-series data in human activityrecognition based on body-worn sensing (HAR) use a fixedsize temporal context to represent different activities. Thismight, however, not be apt for sets of activities with individ-ually varying durations. We introduce attention models intoHAR research as a data driven approach for exploring relevanttemporal context. Attention models learn a set of weights overinput data, which we leverage to weight the temporal contextbeing considered to model each sensor reading. We constructattention models for HAR by adding attention layers to a state-of-the-art deep learning HAR model (DeepConvLSTM) andevaluate our approach on benchmark datasets achieving sig-nificant increase in performance. Finally, we visualize thelearned weights to better understand what constitutes relevanttemporal context.
ACM Classification Keywords
H.1.2. User/Machine System; I.5. Pattern Recognition
Author Keywords
Activity Recognition; Attention; Deep Learning
INTRODUCTION
In Human Activity Recognition (HAR) we analyze and modelsequential, that is time-series data. In order to do so we need tolook into the temporal context of every single sensor reading,which forms the basis for modeling and eventually recogni-tion. This has traditionally been done through sliding windowapproaches [2], which use a fixed size window to model thetemporal context of every single sensor reading. Sliding win-dow procedures also (and still) play a crucial role for manyrecent Deep Learning based HAR methods. For example, Con-volutional Neural Networks (CNNs) in HAR employ a slidingwindow procedure to map the timeseries data to a fixed 2D rep-resentation that is fed into the convolution layers [9]. Arguably,the window length is a crucial parameter for sliding-windowbased approaches that often is established based on prior, i.e.,domain knowledge. Decisions regarding any sliding windowprocedure are hard and often final decisions that impact therecognition procedure as a whole. As such mistakes made
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].
CHI’16,
May 07–12, 2016, San Jose, CA, USA© 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.ISBN 123-4567-24-567/08/06. . . $15.00DOI: http://dx.doi.org/10.475/123_4 here are critical and errors made are difficult to recover from.Also, most sliding-window based approaches are constrainedto use a single fixed size context, which may not be ideal whenmodeling activities with varying durations.Alternative approaches use sequential models that could over-come aforementioned issues through explicit segmentationof the activities of interest. The recent adoption of recurrent(deep) neural networks for HAR applications has also led im-pressive recognition results, but these methods come with theirown set of problems [6]. For example, Long Short Term Mem-ory (LSTM [7] models can learn infinite temporal contexts.However, it is not reasonable to assume that an event in thedistant past would actually influence current events. Vanillamodeling is not able to capture such aspects. This is even morepertinent for HAR problems as there is typically only little, ifany, relation between current and distant past activities [2].Such observations lead us to the question about what wouldbe the temporal context that is actually relevant for a model toconsider in order to successfully represent activities of interest,and whether a model could make such a decision automati-cally. If that was the case, then an externalization of such adata-driven decision regarding the relevant temporal contextwould lead to insights about the analyzed data, possibly up toimproved segmentation procedures. Ultimately, we aim for amodel to automatically learn its relevant temporal context.In this paper we employ attention models to human activityrecognition problems. Essentially, attention models help amodel learn a set of weights over a set of representations–datainput–which signify the relative importance of each of therepresentations. For the case of activity analysis, these modelswould learn the contributions of all previous sensor readingsthat are considered for the analysis of a sample. We useattention models for supervised learning tasks in HAR givingthe model the ability to generate weight distributions over thehistory of a sample. In doing so the model is incentivized togenerate weights which place the weight on the context that isrelevant for a classification decision.We explore the potential of attention models by adding anattention layer to a state-of-the-art, deep learning based HARmodel [9]. We evaluate our approach on standard bench-marks (Opportunity, PAMAP2, Skoda), and results demon-strate significantly increased performance over current ap-proaches, which emphasizes the relevance of the proposedapproach. We further explore what the models have learnedby visualizing the attention model’s weights, which providesadditional insights into the model behavior.1 a r X i v : . [ c s . C V ] M a y ACKGROUND
Recent work in sequence modeling in HAR has mainly fo-cused on convolutional networks (CNN), and on recurrentmodels such as LSTMs. The attraction of CNNs lies in thefact that end-to-end learning is possible due to their stackedfiltering layers that automatically capture hierarchical featurerepresentations of the data. Combined with clever combina-tions of pooling, that is, subsampling, and linear layers, verypowerful recognition systems have been realized [1, 16, 15].CNNs are applicable for analyzing sequential data only be-cause of the sliding window trick, that slices out–typicallyfixed size–analysis frames from the input sequence of sensordata and promotes these through the network independently.Some recent work has randomized this sliding window proce-dure in order to generate data variability that is exploited inEnsemble based approaches [4]. However, the general slidingwindow principle remains unchanged.Recurrent Models have also been applied very successfullyin challenging HAR scenarios. The vast majority of theseapproaches is based on–variants of–LSTMs [7]. Such modelsincorporate specific gates into individual cells that allow forkeeping an internal memory by feeding back a cell’s outputand by keeping track of the internal state. [5] extensivelyanalyzed the behavior of deep learning models in the widerHAR context, and one of the most promising current modelsrepresents a combination of CNNs–for representation learning–and LSTMs–for sequence learning: DeepConvLSTM [9].
ATTENTION FOR HAR
Previous deep learning approaches have focused on represent-ing and modeling a fixed size temporal context for all sensorreadings. Arguably, this approach works very well as suchmodels currently dominate the most challenging HAR bench-marks (such as Opportunity [3]). However, especially suchchallenging tasks exhibit both a substantial intra- as well asinter-class variance with regards to durations of the activities(cf. [4] for a detailed analysis of current benchmark datasets).As such, using fixed window lengths will not naturally lead toideal modeling and hence classification performance.Instead, we explore how attention models can be employed forautomatically determining the temporal context that is relevantfor modeling activities. In essence, such an approach wouldadapt the analysis windows in a data-driven manner. Attentionmodels have been introduced for natural language processingtasks for part of speech (POS) tagging [8]. The formal idea isthat a set of linear layers and non-linearities are used to learnweights over k vectors each of dimension d . Most architectureshave a set of linear layers which map the dimension of these d -dimensional vectors to a one-dimensional score and thesescores are then passed through the Softmax function to givethe set of k weights. The way each of these k vectors is mappedto a one dimensional score is specific to the architecture. Forinstance, a linear layer could directly map the d -dimensionalvector to a single dimension or one could add an intermediatelinear layer to initially map the d -dimensional vector to, forexample, a dimension d / d / Figure 1. Illustration of adding attention (see text for description).
We construct our HAR models by adding an attention layer tothe state of the art architecture from [9] – DeepConvLSTM(see below for model details). The general idea is to startwith a large enough temporal context (sliding window) thatwas used in previous work and led to reasonable recognitionperformance. We then add an attention layer to automaticallyrescale the weights of all samples in the analysis frame ac-cording to their relevance for modeling, which is accordingto our hypothesis that not all historic samples in an analysisframe are of (the same) relevance for modeling. This relevanceweighting of a sensor reading’s history is a direct outcomeof the attention layers, which we exploit for improving HARmodels. We do not change any other (hyper-) parameters tofocus our exploration on the effect that introducing attentionhas on state-of-the-art activity recognition. Fig. 2 illustratesthis architecture. Details are given in what follows.
DeepConvLSTM and Attention
DeepConvLSTM models [9] represent the state-of-the-art fordeep learning based HAR applications, which motivates usto explore the addition of attention layers to this model ar-chitecture. In this architecture the input, which is a windowcontaining one second of data (24 samples for Opportunity[3]), is fed into four consecutive convolution layers with ReLUnon-linearity. Through employing the windowing procedure,the sensor data input (time-series) is converted into a two-dimensional representation where the first dimension repre-sents time and the other captures features – 24 × d , where d denotes the number of features in each sample. Convolution fil-ters are one-dimensional along the time axis and after the fourconvolution layers and successive downsampling through pool-ing the data representation is expanded to a two-dimensionalarrary of size 8 × d × f , where d is the number of features and f denotes the number of filters. The latter was set to 64 foreach layer [9]. This sequence of 8 resulting feature vectorsis then modeled by a two-layer LSTM with 128 hidden units.The final hidden layer of the LSTM represents the embed-ding of the input frame, which is fed into a linear layer and asoftmax to produce the prediction for an input frame.To incorporate the attention mechanism, we analyze the 8hidden states from the LSTM that represent the embeddingsfor the different parts of an input frame. We then considerthe first 7 hidden states as the historical temporal context andlearn 7 weights corresponding to these hidden states: past context = [ h , h , h ,... h ] (1)current = h (2)transformed context = tanh ( W × past context + b ) (3)weights = softmax ( W × transformed context + b ) (4)final embedding = past context × weights + current (5) igure 2. Adding attention to human activity recognition based on the DeepConvLSTM Architecture. The final embedding, highlighted in green, isused for prediction as opposed to the final hidden state in the original model (see text for description). b and b denote the biases in the two linear layers, and W and W represent the 2D matrices in the linear layers. Weinitially apply a linear transformation accompanied by a tanhlinearity transforming each of these seven vectors of size 128into seven new vectors of size 128 (Eq. 3). Another lineartransformation converts these 7 vectors of size 128 into 7vectors of size 1 essentially giving us scores for each of thehidden states. These scores are then passed through a softmaxto give the final set of weights (Eq. 4). These weights are usedto calculate a weighted sum of all the 7 hidden states to givethe final embedding for the past context. This past contextis added to the last hidden state to give the final embeddingfor the input frame (Eq. 5). This final embedding is usedfor classification as opposed to the last hidden state used byDeepConvLSTM.Note that the addition of the last hidden state to the embeddingof the past context can be interpreted as a skip connectionfrom the recurrent layers to the attention layer. Consideringthe computational graph that corresponds to this model, weobserve that the model may decide to propagate the gradientonly to the recurrent layers and could avoid the attention layersaltogether. This is actually beneficial for HAR as datasets areoften relatively small overfitting needs to be avoided, whichcould be realized explicitly through aggressive regularization,through the dropout layers [12] shown in Fig. 2 , or implicitlythrough these skip connections. EXPERIMENTS
Our explorations of the benefits that attention models maybring to human activity recognition are based on experimentalevaluations on standard datasets from the field: Opportunity[3], PAMAP2 [11], and Skoda [13]. These datasets are verydiverse in terms of the nature of activities and the relativedistribution of activities. Therefore, the datasets present robustbenchmarks for evaluating HAR systems. We employ standardtraining and evaluation protocols based on hold out datasetsas they have been defined in the original publications (andsummarized, e.g., in [4]).All models were trained using the PyTorch deep learningframework [10]. For all experiments a sliding window proce-dure was used to extract the processing frames our analysis isbased on. Initial frame lengths were set to one second of data
Table 1. Sample-wise recognition results. Significant improvements overnon-attention baselines are highlighted in bold (Wilson).
Modeling DatasetsVariant Opportunity PAMAP2 SkodaDeepConvLSTM [9] 67.2 74.8 91.2b-LSTM-S [5] 68.4 83.8 92.1
Att. Model 70.7 87.5 ± .003 ± .002 ± .004each, with an overlap of 50% between consecutive frames. Ex-tracted frames were randomly shuffled during training to avoidbias. All studied models produce sample-wise predictions,which is–in contrast to frame-wise prediction–more realisticfor practical applications [5, 4].Models were trained using cross-entropy loss. Learning ratewas fixed at 0 .
001 and decayed after every epoch. Learningdecay rate and the dropout values were optimized for all mod-els, which seemingly have substantial impact on recognitionperformance. RMSProp was employed for optimization [14].Batch size was set to 100 for all experiments and dropoutlayers were used for regularization. All code along with thebest model weights for each of the datasets and the best hyper-parameters is available on our github page for reference . RESULTS
Given the imbalance in class-distributions for all three datasets,we report results as mean f1 scores. Statistical significancetests are based on Wilson score interval with 95% confidence.Recognition results for all benchmark datasets are given inTable 1. It can be seen that the incorporation of attention mod-els leads to significant increase in performance over the stateof the art for both Opportunity and PAMAP2. For Skoda weonly see marginal improvements when introducing attention,which is similar to what has been reported in the literature for(other) model evaluations on this datasets. As such, it may beconcluded that by now a performance level has been reachedfor Skoda that bears no potential for further improvements.
DISCUSSION
Figure 3 visualizes the weights of the best model. Whileevaluating, each sample has a set of 7 weights associated For reviewing purposes an anonymized archive of our github can befound at: tinyurl.com/ybs4ndlv igure 3. Visualizing weights learned by the best attention model on theOpportunity test dataset (best viewed in color). with the relative importance of the first 7 hidden states ofthe LSTM. We take the median of these weights across allsamples belonging to a certain activity. The visualizationshows interesting insights into the model’s behavior. We noticethat most of the weight is concentrated on the last few hiddenstates. This is reasonable as these states capture the summaryof the input frame and through the LSTM recurrence the lasthidden states capture more information than previous ones, andhence have a more dominant contribution to the final contextembedding. However, only using the final hidden state–asLSTMs do–is detrimental as there may be some importantinformation at the start of the input frame. Therefore, throughusing attention models we see a significant amount of theweight being placed on the past hidden states as well and thisallows the model to capture the context more effectively asopposed to only relying on the last hidden state of the LSTM.The improvements in recognition performance confirm thebenefit of adding attention to deep, recurrent HAR models.We also observe that for all activities analyzed, the weighton the first two hidden states is close to zero. This is likelydue to the first few hidden states not yet being able to captureanything valuable because the history at this stage is too shortand thus rather uninformative. The attention model explicitlydownweights those initial states as they do not contributemuch to the model and thus rather "waste" model parametersif included. In summary, the attention mechanism effectivelyshrinks and focuses the history of a sensor reading that a HARmodel needs to focus on.We also observe that among all the (Opportunity) activities,the activity "Open Door 3" has the most spread out weightson all the hidden states. This is interesting because it suggeststhat this activity might involve multiple distinct segments asthe model distributes the weight evenly on hidden states. Onfurther inspection, we realize that this activity is about openingthe lowest drawer in a cupboard containing three drawers andhence one might need to perform multiple smaller activitiessuch as bending down, opening the drawer and rising up toperform this activity. Therefore, the model is incentivizedto distribute the weight more evenly to capture these sub-activities. This last aspect is the basis for future developmentsand applications as–essentially–it is the starting point for novelsegmentation schemes. REFERENCES
1. S Bhattacharya and ND Lane. 2016. Sparsification andseparation of deep learning layers for constrained resource inference on wearables. In
Proc. Int. Conf.Embedded Network Sensor Systems .2. A Bulling, U Blanke, and B Schiele. 2014. A Tutorial onHuman Activity Recognition Using Body-worn InertialSensors.
ACM Comput. Surv.
46, 3, Article 33 (Jan.2014), 33 pages.3. R Chavarriaga, H Sagha, A Calatroni, ST Digumarti, GTröster, JR Millán, and D Roggen. 2013. TheOpportunity challenge: A benchmark database foron-body sensor-based activity recognition.
PatternRecognition Letters
34, 15 (2013), 2033–2042.4. Y Guan and T Plötz. 2017. Ensembles of Deep LSTMLearners for Activity Recognition Using Wearables.
PACM IMWUT
1, 2 (June 2017), 11:1–11:28.5. NY Hammerla, S Halloran, and T Ploetz. 2016. Deep,convolutional, and recurrent models for human activityrecognition using wearables. In
Proc. IJCAI .6. S Hochreiter, Y Bengio, P Frasconi, J Schmidhuber, andothers. 2001. Gradient flow in recurrent nets: thedifficulty of learning long-term dependencies. (2001).7. S Hochreiter and J Schmidhuber. 1997. Long short-termmemory.
Neural computation
9, 8 (1997), 1735–1780.8. A Kumar, O Irsoy, P Ondruska, M Iyyer, J Bradbury, IGulrajani, V Zhong, R Paulus, and R Socher. 2016. Askme anything: Dynamic memory networks for naturallanguage processing. In
Proc. ICML .9. FJ Ordóñez and D Roggen. 2016. Deep convolutional andlstm recurrent neural networks for multimodal wearableactivity recognition.
Sensors
16, 1 (2016), 115.10. A Paszke, S Gross, S Chintala, and G Chanan. 2017.Pytorch. pytorch.org . (2017). Accessed: 2018-05-18.11. A Reiss and D Stricker. 2012. Introducing a newbenchmarked dataset for activity monitoring. In
Proc.ISWC .12. N Srivastava, G Hinton, A Krizhevsky, I Sutskever, and RSalakhutdinov. 2014. Dropout: A simple way to preventneural networks from overfitting.
JMLR
15, 1 (2014),1929–1958.13. T Stiefmeier, D Roggen, G Ogris, P Lukowicz, and GTröster. 2008. Wearable activity tracking in carmanufacturing.
IEEE Pervasive Computing
7, 2 (2008).14. T Tieleman and G Hinton. 2012. Lecture 6.5-rmsprop:Divide the gradient by a running average of its recentmagnitude.
COURSERA: Neural networks for machinelearning
4, 2 (2012), 26–31.15. J Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiaoli Li,and Shonali Krishnaswamy. 2015. Deep ConvolutionalNeural Networks on Multichannel Time Series forHuman Activity Recognition.. In
Proc. IJCAI .16. M Zeng, LT Nguyen, B Yu, Ole J Mengshoel, J Zhu, PWu, and J Zhang. 2014. Convolutional neural networksfor human activity recognition using mobile sensors. In