[PDF] Multi-agent Attentional Activity Recognition

Abstract

Multi-modality is an important feature of sensor based activity recognition. In this work, we consider two inherent characteristics of human activities, the spatially-temporally varying salience of features and the relations between activities and corresponding body part motions. Based on these, we propose a multi-agent spatial-temporal attention model. The spatial-temporal attention mechanism helps intelligently select informative modalities and their active periods. And the multiple agents in the proposed model represent activities with collective motions across body parts by independently selecting modalities associated with single motions. With a joint recognition goal, the agents share gained information and coordinate their selection policies to learn the optimal recognition model. The experimental results on four real-world datasets demonstrate that the proposed model outperforms the state-of-the-art methods.

Full PDF

MMulti-agent Attentional Activity Recognition

Kaixuan Chen , Lina Yao , Dalin Zhang , Bin Guo and Zhiwen Yu University of New South Wales Northwestern Polytechnical University { kaixuan.chen@student., lina.yao@ } unsw.edu.au, Abstract

Multi-modality is an important feature of sensorbased activity recognition. In this work, we con-sider two inherent characteristics of human activ-ities, the spatially-temporally varying salience offeatures and the relations between activities andcorresponding body part motions. Based on these,we propose a multi-agent spatial-temporal atten-tion model. The spatial-temporal attention mech-anism helps intelligently select informative modal-ities and their active periods. And the multipleagents in the proposed model represent activitieswith collective motions across body parts by in-dependently selecting modalities associated withsingle motions. With a joint recognition goal,the agents share gained information and coordinatetheir selection policies to learn the optimal recog-nition model. The experimental results on fourreal-world datasets demonstrate that the proposedmodel outperforms the state-of-the-art methods.

The ability to identify human activities via on-body sensorshas been of interest to the healthcare community [Anguita et al. , 2013], the entertainment [Freedman and Zilberstein,2018] and ﬁtness [Guo et al. , 2017] community. Some worksof Human Activity Recognition (HAR) are based on hand-crafted features for statistical machine learning models [Laraand Labrador, 2013]. Until recently, deep learning has expe-rienced massive success in modeling high-level abstractionsfrom complex data [Pouyanfar et al. , 2018], and there is agrowing interest in developing deep learning for HAR [Ham-merla et al. , 2016]. Despite this, these methods still lack suf-ﬁcient justiﬁcation when being applied to HAR. In this work,we consider two inherent characteristics of human activitiesand exploit them to improve the recognition performance.The ﬁrst characteristic of human activities is the spatially-temporally varying salience of features. Human activities canbe represented as a sequence of multi-modal sensory data.The modalities include acceleration, angular velocity andmagnetism from different positions of testers’ bodies, suchas chests, arms and ankles. However, only a part of modal-ities from speciﬁc positions are informative for recognizing certain activities [Wang and Wang, 2017]. Irrelevant modali-ties often inﬂuence the recognition and undermine the perfor-mance. For instance, identifying lying mainly relies on peo-ple’s orientations (magnetism), and going upstairs can be eas-ily distinguished by upward acceleration from arms and an-kles. In addition, the signiﬁcance of modalities changes overtime. Intuitively, the modalities are only important when thebody parts are actively participating in the activities. There-fore, we propose a spatial-temporal attention method to selectsalient modalities and their active periods that are indicativeof the true activity. Attention has been proposed as a se-quential decision task in earlier works [Denil et al. , 2012;Mnih et al. , 2014]. This mechanism has been applied tosensor based HAR in recent years. [Chen et al. , 2018] and[Zhang et al. , 2018] transform the sensory sequences into 3-D activity data by replicating and permuting the input data,and they propose to attentionally keep a focal zone for classi-ﬁcation. However, these methods heavily rely on data pre-processing, and the replication increases the computationcomplexity unnecessarily. Also, these methods do not takethe temporally-varying salience of modalities into account. Incontrast, the proposed spatial-temporal attention approach di-rectly selects informative modalities and their active time thatare relevant to classiﬁcation from raw data. The experimentresults shows that our model makes HAR more explainable.The second characteristic of human activities consideredin this paper is activities are portrayed by motions on sev-eral body parts collectively. For instance, running can beseen as a combination of arm and ankle motions. Someworks like [Radu et al. , 2018; Yang et al. , 2015] are com-mitted to fusing multi-modal sensory data for time-seriesHAR, but they only fuse the information of local modali-ties from the same positions. These methods, as well asthe existing attention based methods [Chen et al. , 2018;Zhang et al. , 2018], are limited in capturing the global inter-connections across different body parts. To ﬁll this gap, wepropose a multi-agent reinforcement learning approach. Wesimplify activity recognition by dividing the activities intosub-motions with which an independent intelligent agent isassociated and by coordinating the agents’ actions. Theseagents select informative modalities independently based onboth their local observations and the information shared byeach other. Each agent can individually learn an efﬁcient se-lection policy by trial-and-error. After a sequence of selec- a r X i v : . [ c s . H C ] M a y igure 1: The overview of the proposed model. At each step s , three agents a , a , a individually select modalities and obtain observations o s , o s , o s from the input x at ( t s , l s ) , ( t s , l s ) and ( t s , l s ) . The agents then exchange and process the gained information to get the repre-sentation r sg of the shared observation. And they decide the next locations again. Based on a sequence of observations after an episode, theagents jointly make the classiﬁcation. Red, green and blue denote the workﬂows that are associated with a , a , a , respectively. Other colorsdenote the shared information and its representations. tions and information exchanges, a joint decision on recogni-tion is made. The selection policies are incrementally coor-dinated during training since the agents share a common goalwhich is to minimize the loss caused by false recognition.The key contributions of this research are as follows:• We propose a spatial-temporal attention method fortemporal sensory data, which considers the spatially-temporally varying salience of features, and allows themodel to focus on the informative modalities that areonly collected in their active periods.• We propose a multi-agent collaboration method. Theagents represent activities with collective motions by in-dependently selecting modalities associated with singlemotions and sharing observations. The whole model canbe optimized by coordinating the agents’ selection poli-cies with the joint recognition goal.• We evaluate the proposed model on four datasets. Thecomprehensive experiment results demonstrate the supe-riority of our model to the state-of-the-art approaches. We now detail the human activity recognition problem onmulti-modal sensory data. Each input sample ( x , y ) consistsof a 2-d vector x and an activity label y . Let x = [ x , x , ... x K ] where K denotes the time window length and x i denotes thesensory vector collected at the point i in time. x i is the com-bination of multi-modal sensory data collected from testers’ different body positions such as chests, arms and ankles. Sup-pose that x i = ( m i , m i , ... m iN ) = ( x i , x i , ...x iP ) , where m de-notes data collected from each position, N denotes the num-ber of positions (in our datasets, N = 3 ), and P denotesthe number of values per vector. Therefore, x ∈ R K × P and y ∈ [1 , ..., C ] . C represents the number of activity classes.The goal of the proposed model is to predict the activity y . The overview of the model structure is shown in Figure 1. Ateach step s , the agents select an active period together andindividually select informative modalities from the input x .These agents share their information and independently de-cide where to “look at” at the next step. The locations aredetermined spatially and temporally in terms of modalitiesand time. After several steps, the ﬁnal classiﬁcation is jointlyconducted by the agents based on a sequence of the obser-vations. Each agent can incrementally learn an efﬁcient de-cision policy over episodes. But by having the same goal,which is to jointly minimize the recognition loss, they collab-orate with each other and learn to align their behaviors suchthat it achieves their common goal. Multi-agent Collaboration.

In this work we simplify activity recognition by dividing theactivities into sub-motions and require each agent select in-formative modalities that are associated with one motion.Suppose that we employ H agents a , a , ...a H (we assume H = 3 in this paper for simplicity). The workﬂows of a , a , a are shown in red, green and blue in Figure 1.At each step s , each agent locally observes a small patch of , which includes information of a speciﬁc modality from amotion in its active period. Let the observations be e s , e s , e s as Figure 1 shows. They are extracted from x at the locations ( t s , l s ) , ( t s , l s ) and ( t s , l s ) , respectively, where t denotes theselected active period and l denote the location of a modal-ity in the input x . The model encodes the region around ( t s , l si ) ( i ∈ { , , } ) with high resolution but uses a pro-gressively lower resolution for points further from ( t s , l si ) inorder to remove noises and avoid information loss in [Zontak et al. , 2013]. We then further encode the observations intohigher level representations. With regard to each agent a i ( i ∈ { , , } ), the observation e si and the location ( t s , l si ) arelinear transformed independently, parameterized by θ e and θ tl , respectively. Next, the summation of these two parts isfurther transformed with another linear layer parameterizedby θ o The whole process can be summarized as the followingequation: o si = f o ( e si , t s , l si ; θ e , θ tl , θ o )= L ( L ( e si ) + L ( concat ( t s , l si ))) i ∈ { , , } , (1)where L ( • ) denotes a linear transformation and concat ( t s , l si ) represents the concatenation of t s and l si .Each linear layer is followed by a rectiﬁed linear unit (ReLU)activation. Therefore, o si contains information from ”what”( ρ ( C f , l ft ) ), ”where” ( l ft ) and ”when”.Making multiple observations not only avoids the systemprocessing the whole data at a time but also maximally pre-vents the information loss from only selecting one region ofdata. Furthermore, multiple agents make observations indi-vidually so that they can represent activities with the collec-tive modalities from different motions. The model can ex-plore various combinations of modalities to recognize activi-ties during learning.Then we are interested in the collaborative setting wherethe agents communicate with each other and share the obser-vations they make. So we get the shared observation o sg byconcatenate o s , o s , o s together. o sg = concat ( o s , o s , o s ) , (2)so that o sg contains all the information observed by threeagents. A convolutional network is further applied to process o sg and extract the informative spatial relations. The output isthen reshaped to be the representation r sg . r sg = f c ( o sg ; θ c ) = reshape ( Conv ( o sg )) (3)And r sg represents the activity to be identiﬁed with multiplemodalities selected from motions on different body positions. Attentive Selection.

In this section, the details about how to select modalities andactive period attentively are introduced. We ﬁrst introduce theepisodes in this work. The agents incrementally learn the at-tentive selection policies over episodes. In each episode, fol-lowing the bottom-up processes, the model attentively selectsdata regions and integrates the observations over time to gen-erate dynamic representations, in order to determine effectiveselections and maximize the rewards, i.e., minimize the loss.Based on this, LSTM is appropriate to build an episode as it incrementally combines information from time steps to ob-tain ﬁnal results. As can be seen in Figure 1, at each step s ,the LSTM module receives the representation r sg and the pre-vious hidden state h s − as the inputs. Parameterized by θ h , itoutputs the current hidden state h s : h s = f h ( r sg , h s − ; θ h ) (4)Now we introduce the selection module. The agents aresupposed to select salient modalities and an active period ateach step. To be speciﬁc, they need to select the locationswhere they make next observations. Three agents control l s +11 , l s +11 , l s +13 independently based on both the hidden state h s and their individual observations o s , o s , o s so that the in-dividual decisions are made from the overall observation aswell. t s +1 is jointly decided based on h s only since it is acommon selection. The decisions are made by the agents’selection policies which are deﬁned by Gaussian distributionstochastic process: l s +1 i ∼ P ( · | f l ( h s , o si ; θ l i )) i ∈ { , , } , (5)and t s +1 ∼ P ( · | f t ( h s ; θ t )) (6)The purpose of stochastic selections is to explore more kindsof selection combinations such that the model can learn thebest selections during training.To align the agents’ selection policies, we assign the agentsa common goal that correctly recognizing activities after a se-quence of observations and selections. They together receivea positive reward if the recognition is correct. Therefore, ateach step s , a prediction ˆ y s is made by: ˆ y s = f y ( h s ; θ y ) = sof tmax ( L ( h s )) (7)Usually, agents receive a reward r after each step. But in ourcase, since only the classiﬁcation in the last step S is repre-sentative, the agents receive a delayed reward R after eachepisode. R = (cid:40) if ˆ y S = y if ˆ y S (cid:54) = y (8)The target of optimization is to coordinate all the selectionpolicies by maximizing the expected value of the reward ¯ R after several episodes. This model involves parameters that deﬁne the multi-agentcollaboration and the attentive selection. The parameters

Θ = { θ e , θ tl , θ o , θ c , θ h , θ l i , θ t , θ y } ( i ∈ { , , } ). The parametersfor classiﬁcation can be optimized by minimizing the cross-entropy loss: L c = − N N (cid:88) n =1 C (cid:88) c =1 y n ( c ) log F y ( x ) , (9)where F y is the overall function that outputs ˆ y given x . C is the number of activity classes, and y n ( c ) = 1 if the n -thsample belongs to the c -th class and otherwise.However, selection policies that are mainly deﬁned by θ l i ( i ∈ { , , } ) and θ t are expected to select a sequence ofocations. The parameters are thus non-differentiable. Inthis view, we deploy a Partially Observable Markov Deci-sion Process (POMDP) [Cai et al. , 2009] to solve the op-timization problem. Suppose e s = ( e s , e s , e s ) , lt s =( l s , l s , l s , t s ) , We consider each episode as a trajectory τ = { e , lt , y ; e , lt , y ; ..., e S , lt S , y S } . Each trajectory rep-resents one order of the observations, the locations and thepredictions our agents make. After agents repeat N episodes,we can obtain { τ , τ , ..., τ N } , and each τ has a probabil-ity p ( τ ; Θ) to be obtained. The probability depends on theselection policy Π = ( π , π , π ) of the agents.Our goal is to learn the best selection policy Π that maxi-mizes ¯ R . Speciﬁcally, Π is decided by Θ . Thus we need toﬁnd out the optimized Θ ∗ = arg max Θ [ ¯ R ] . One common wayis gradient ascent.Generally, given a sample x with reward f ( x ) and proba-bility p ( x ) , the gradient can be calculated as follows: ∇ θ E x [ f ( x )] = ∇ θ (cid:88) x p ( x ) f ( x )= (cid:88) x p ( x ) ∇ θ p ( x ) p ( x ) f ( x )= (cid:88) x p ( x ) ∇ θ logp ( x ) f ( x )= E x [ f ( x ) ∇ θ logp ( x )] (10)In our case, a trajectory τ can be seen as a sample, the prob-ability of each sample is p ( τ ; Θ) , and the reward function ¯ R = E p ( τ ;Θ) [ R ] . We have the gradient: ∇ Θ ¯ R = E p ( τ ;Θ) [ R ∇ Θ logp ( τ ; Θ)] (11)By considering the training problem as a POMDP and fol-lowing the REINFORCE rule [Williams, 1992]: ∇ Θ ¯ R = E p ( τ ;Θ) [ R S (cid:88) s =1 ∇ Θ log Π( y | τ s ; Θ)] (12)Since we need several samples τ for one input x to learnthe best policy combination, we adopt Monte Carlo samplingwhich utilizes randomness to yield results that might be theo-retically deterministic. Supposing M is the number of MonteCarlo sampling copies, we duplicate the same input for M times and average the prediction results. The M copies gen-erate M subtly different results owing to the stochasticity, sowe have: ∇ Θ ¯ R ≈ M M (cid:88) i =1 R ( i ) S (cid:88) s =1 ∇ Θ log Π( y ( i ) | τ ( i )1: s ; Θ) , (13)where M denotes the number of Monte Carlo samples, i de-notes the i th duplicated sample, and y i is the correct labelfor the i th sample. Therefore, the overall optimization canbe summarized as maximizing ¯ R and minimizing Eq. 9. Thedetailed procedure is shown in Algorithm 1. We now introduce the settings in our experiments. The timewindow of inputs is with overlap. The size of each Algorithm 1

Training and Optimization

Require: sensory matrix x , label y ,the length of episodes S ,the number of Monte Carlo samples M . Ensure: parameters Θ . Θ =

RandomInitialize () while training do duplicate x for M times for i from to M do l i )1 , l i )2 , l i )3 , t i ) = RandomInitialize () for s from to S do extract e s ( i )1 , e s ( i )2 , e s ( i )2 o s ( i )1 , o s ( i )2 , o s ( i )3 ← Eq. o s ( i ) g , r s ( i ) g , h s ( i ) ← Eq. , Eq. , Eq. l s ( i )1 , l s ( i )2 , l s ( i )3 , t s ( i ) ← Eq. , Eq. ˆ y s ( i ) ← Eq. record τ ( i )1: s end for R ( i ) ← Eq. end for ˆ y = M (cid:80) Mi =1 ˆ y S ( i ) L c , ∇ Θ ¯ R ← Eq. , Eq. Θ ← Θ − ∇ Θ L c + ∇ Θ ¯ R end while return Θ observation patch is set to K × P , where K × P is the sizeof the inputs. In the partial observation part, the sizes of θ e , θ tl , θ o are , , , respectively. The ﬁlter size ofthe convolutional layer in the shared observation module is × M and the number of feature maps is , where M denotesthe width of o sg . The size of LSTM cells is , and the lengthof episodes is . The Gaussian distribution that deﬁnes theselection policies is with a variance of . .To ensure the rigorousness, the experiments are per-formed by Leave-One-Subject-Out (LOSO) on four datasets,MHEALTH [Banos et al. , 2014], PAMAP2 [Reiss andStricker, 2012], UCI HAR [Anguita et al. , 2013] and MARS.They contain , , , subjects’ data, respectively. To verify the overall performance of the proposed model, weﬁrst compare our model with other state-of-the-art methods.The compared methods include a convolutional model onmultichannel time series for HAR (MC-CNN) [Yang et al. ,2015], a CNN-based multi-modal fusion model (C-Fusion)[Radu et al. , 2018], a deep multimodal HAR model withclassiﬁer ensemble (MARCEL) [Guo et al. , 2016], an en-semble of deep LSTM learners for activity recognition (E-LSTM) [Guan and Pl¨otz, 2017], a parallel recurrent modelwith convolutional attentions (PRCA) [Chen et al. , 2018]and a weighted average spatial LSTM with selective attention(WAS-LSTM) [Zhang et al. , 2018].As can be observed in Table 1, with respect to thedatasets, MARCEL, E-LSTM, PRCA, WAS-LSTM and theproposed model perform better than MC-CNN and C-Fusion able 1: The prediction performance of the proposed approach and other state-of-the-art methods. * denotes attention based state-of-the-art.The best performance is indicated in bold.

MH Method MC-CNN C-Fusion MARCEL E-LSTM PRCA* WAS-LSTM*

Ours*

Accuracy 87.19 ± ± ± ± ± ± ± Precision 86.50 ± ± ± ± ± ± ± Recall 87.29 ± ± ± ± ± ± ± F1 86.89 ± ± ± ± ± ± ± PMP Method MC-CNN C-Fusion MARCEL E-LSTM PRCA* WAS-LSTM*

Ours*

Accuracy 81.16 ± ± ± ± ± ± ± Precision 81.57 ± ± ± ± ± ± ± Recall 81.43 ± ± ± ± ± ± ± F1 81.50 ± ± ± ± ± ± ± HAR Method MC-CNN C-Fusion MARCEL E-LSTM PRCA* WAS-LSTM*

Ours*

Accuracy 75.86 ± ± ± ± ± ± ± Precision 76.93 ± ± ± ± ± ± Recall 75.81 ± ± ± ± ± ± ± F1 76.36 ± ± ± ± ± ± ± MARS Method MC-CNN C-Fusion MARCEL E-LSTM PRCA* WAS-LSTM*

Ours*

Accuracy 81.34 ± ± ± ± ± ± ± Precision 81.68 ± ± ± ± ± ± ± Recall 81.06 ± ± ± ± ± ± ± F1 81.32 ± ± ± ± ± ± ± Table 2: Ablation Study. S1 ∼ S6 are six structures by systematically removing ﬁve components from the proposed model. The consideredcomponents are: a) the selection module, (b) the partial observation processing from e s to o si ( i ∈ { , , } ), (c) the convolutional merge ofshared observations, (d) the temporal attentive selection (e) the multi-agent for selection. Ablation MHEALTH PAMAP2Accuracy Precision Recall F1 Accuracy Precision Recall F1S1 85.75 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± S6 96.12 ± ± ± ± ± ± ± ± Ablation UCI HAR MARSAccuracy Precision Recall F1 Accuracy Precision Recall F1S1 73.68 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± S6 85.72 ± ± ± ± ± ± ± in MHEALTH and PAMAP2, as these models enjoy highervariance. They ﬁt well when data contain numerous featuresand complex patterns. On the other hand, data in UCI HARand MARS have fewer features, but MARCEL, E-LSTM,PRCA and our model still perform well while the perfor-mance of WAS-LSTM deteriorates. The reason is that WAS-LSTM is based on a complex structure and it requires morefeatures as input. In contrast, MARCEL and E-LSTM adoptrather simple models like DNNs and LSTMs. Despite theensembles, they are still suitable for fewer features. PRCAand the proposed model select salient features directly withintuitive rewards, so they do not necessarily need a largenumber of features as well. In addition, the attention basedmethods, PRCA and WAS-LSTM, are more unstable than theother methods since the selection is stochastic and they can- not guarantee the effectiveness of all the selected features.Overall, our model outperforms the compared state-of-the-artand eliminates the instability of regular selective attentions. We perform a detailed ablation study to examine the contri-butions of the proposed model components to the predictionperformance in Table 2. Considering that there are ﬁve re-movable components in this model: (a) the modality selectionmodule, (b) the transformation from e si to o si ( i ∈ { , , } ),(c) the convolutional network for higher-level representa-tions, (d) the temporal attentive selection (e) the multi-agent.We consider six structures: S1 : We ﬁrst remove the selectionmodule including the observations, episodes, selections andrewards. For comparison, we set S1 to be a regular CNN as a) Standing l (b) Standing l (c) Standing l (d) Going Upstairs l (e) Going Upstairs l (f) Going Upstairs l (g) Running l (h) Running l (i) Running l (j) Lying l (k) Lying l (l) Lying l Figure 2: Visualization of the selected modalities and time on MHEALTH. The input matrices’ size is × , where is the length of thetime window and is the number of modalities. Thus each grid denotes an input feature, and the values in the grids represent the frequencywith which this feature is selected. Lighter colors denote higher frequency. To be clear, detailed illustration is provided in Table 3. a baseline. S2 : We employ one agent but remove (b), (c),(d) and (e), so the workﬂow is: inputs → e si → LSTM → selections and rewards. The performance decreases and ismore unstable than other structures. Although S2 includes at-tentions, the model does not include the previous selectionsin their observations, which inﬂuences their next decisionssigniﬁcantly. S3 : Based on S2, we add (b) to (a). (b) con-tributes considerably since the prediction results are improvedby to , because it feeds back the history selections tothe agents for learning. S4 : We further consider (c) in themodel. It can be observed that this setting also achieves betterperformance than S3 since it convolutionally merges the par-tial observations. S5 : (d) is added. The workﬂow is the sameas S3, but the agents make an additional action: selecting t S , which leads to another attention mechanism in time level.The performance is improved by to . S6 : The pro-posed model. When combining all these beneﬁts, our modelachieves the best performance, higher than S5 by to . The proposed method decomposes the activities into partic-ipating motions, from each of which the agents decide themost salient modalities individually, which makes the modelexplainable. We present the visualized process of recognizingstanding, going upstairs, running and lying on MHEALTH.The available features include three -axis acceleration fromchests, arms and ankles, two ECG signals, two -axis angularvelocity and two -axis magnetism vectors from arms and an-kles. Figure 2 shows the modality heatmaps of all agents. Weobserve that each agent does focus on only a part of modal-ities in a time period during recognition. Table 3 lists themost frequently selected modalities. We can observe thatmagnetism (orientation) in standing and lying is selected asone of the most active features, owing to the fact it is easyto distinguish between standing and lying with people’s ori-entation. Another example is that the most distinguishing Table 3: The active modalities for activities selected by the agentsare listed.

X, Y, Z denote the axis of data. Acc, Ang and Magndenote acceleration, angular velocity and magnetism, respectively.The most frequently selected locations are indicated in bold.

Activity Agent LocationStanding 1 Y, Z -Magn-Ankle, X,Y-Acc-Arm2 Y -Acc-Ankle, ECG1,2-Chest3 X, Y -Ang-AnkleGoingUpstairs 1 Y, Z -Acc-Ankle, X,Y-Ang-Ankle2 X,Y,Z-Acc-Ankle, X ,Y-Ang-Ankle3 Y, Z -Acc-Arm, X,Y,Z-Ang-ArmRunning 1 Y,Z-Acc-Chest, ECG1,2-Chest X ,Y,Z-Acc-Ankle2 X,Y,Z-Acc-Arm, X ,Y-Ang-Arm3 Y,Z-Acc-Arm, X, Y .Z-Ang-ArmLying 1 Y,Z-Acc-Chest, ECG1 ,2-Chest2 Y, Z -Magn-Ankle3 X, Y ,Z-Magn-Armcharacteristic of going upstairs is “up”. Therefore, Z-axisacceleration is speciﬁcally selected by agents for going up-stairs. Also, identifying running involves acceleration, ECG,and arm swing, which conforms to the experiment evidenceas well. The agents also select several other features withlower frequencies, which avoids losing effective information. In this work, we ﬁrst propose a selective attention methodfor spatially-temporally varying salience of features. Then,multi-agent is proposed to represent activities with collectivemotions. The agents’ cooperate by aligning their actions toachieve their common recognition target. We experimentallyevaluate our model on four real-world datasets, and the resultsvalidate the contributions of the proposed model. eferences [Anguita et al. , 2013] Davide Anguita, Alessandro Ghio,Luca Oneto, Xavier Parra, and Jorge Luis Reyes-Ortiz. Apublic domain dataset for human activity recognition usingsmartphones. In

ESANN , 2013.[Banos et al. , 2014] Oresti Banos, Rafael Garcia, Juan AHolgado-Terriza, Miguel Damas, Hector Pomares, IgnacioRojas, Alejandro Saez, and Claudia Villalonga. mhealth-droid: a novel framework for agile development of mobilehealth applications. In

International Workshop on AmbientAssisted Living , pages 91–98. Springer, 2014.[Cai et al. , 2009] Chenghui Cai, Xuejun Liao, and LawrenceCarin. Learning to explore and exploit in pomdps. In

Ad-vances in Neural Information Processing Systems (NIPS) ,pages 198–206, 2009.[Chen et al. , 2018] Kaixuan Chen, Lina Yao, Xianzhi Wang,Dalin Zhang, Tao Gu, Zhiwen Yu, and Zheng Yang. Inter-pretable parallel recurrent neural networks with convolu-tional attentions for multi-modality activity modeling. In

Neural Networks (IJCNN), 2018 International Joint Con-ference on , pages 3016–3021. IEEE, 2018.[Denil et al. , 2012] Misha Denil, Loris Bazzani, HugoLarochelle, and Nando de Freitas. Learning where to at-tend with deep architectures for image tracking.

Neuralcomputation , 24(8):2151–2184, 2012.[Freedman and Zilberstein, 2018] Richard Gabriel Freed-man and Shlomo Zilberstein. Roles that plan, activity, andintent recognition with planning can play in games. In

Workshops at the Thirty-Second AAAI Conference on Ar-tiﬁcial Intelligence , 2018.[Guan and Pl¨otz, 2017] Yu Guan and Thomas Pl¨otz. En-sembles of deep lstm learners for activity recognition us-ing wearables.

Proceedings of the ACM on Interactive,Mobile, Wearable and Ubiquitous Technologies, IMWUT ,1(2):11, 2017.[Guo et al. , 2016] Haodong Guo, Ling Chen, LiangyingPeng, and Gencai Chen. Wearable sensor based multi-modal human activity recognition exploiting the diversityof classiﬁer ensemble. In

Proceedings of the 2016 ACMInternational Joint Conference on Pervasive and Ubiqui-tous Computing, UbiComp 2016, Heidelberg, Germany,September 12-16, 2016 , pages 1112–1123, 2016.[Guo et al. , 2017] Xiaonan Guo, Jian Liu, and YingyingChen. Fitcoach: Virtual ﬁtness coach empowered by wear-able mobile devices. In , pages 1–9, 2017.[Hammerla et al. , 2016] Nils Y. Hammerla, Shane Halloran,and Thomas Pl¨otz. Deep, convolutional, and recurrentmodels for human activity recognition using wearables. In

Proceedings of the Twenty-Fifth International Joint Con-ference on Artiﬁcial Intelligence, IJCAI 2016, New York,NY, USA, 9-15 July 2016 , pages 1533–1540, 2016.[Lara and Labrador, 2013] Oscar D Lara and Miguel ALabrador. A survey on human activity recognition using wearable sensors.

IEEE Communications Surveys & Tuto-rials , 15(3):1192–1209, 2013.[Mnih et al. , 2014] Volodymyr Mnih, Nicolas Heess, AlexGraves, et al. Recurrent models of visual attention. In

Advances in neural information processing systems , pages2204–2212, 2014.[Pouyanfar et al. , 2018] Samira Pouyanfar, Saad Sadiq,Yilin Yan, Haiman Tian, Yudong Tao, Maria Presa Reyes,Mei-Ling Shyu, Shu-Ching Chen, and SS Iyengar. A sur-vey on deep learning: Algorithms, techniques, and ap-plications.

ACM Computing Surveys (CSUR) , 51(5):92,2018.[Radu et al. , 2018] Valentin Radu, Catherine Tong, SouravBhattacharya, Nicholas D Lane, Cecilia Mascolo, Ma-hesh K Marina, and Fahim Kawsar. Multimodal deeplearning for activity and context recognition.

Proceedingsof the ACM on Interactive, Mobile, Wearable and Ubiqui-tous Technologies, IMWUT , 1(4):157, 2018.[Reiss and Stricker, 2012] Attila Reiss and Didier Stricker.Introducing a new benchmarked dataset for activity moni-toring. In

Wearable Computers (ISWC), 2012 16th Inter-national Symposium on , pages 108–109. IEEE, 2012.[Wang and Wang, 2017] Hongsong Wang and Liang Wang.Modeling temporal dynamics and spatial conﬁgurations ofactions using two-stream recurrent neural networks.

TheConference on Computer Vision and Pattern Recognition(CVPR) , 2017.[Williams, 1992] Ronald J Williams. Simple statisticalgradient-following algorithms for connectionist reinforce-ment learning.

Machine learning , 8(3-4):229–256, 1992.[Yang et al. , 2015] Jianbo Yang, Minh Nhut Nguyen,Phyo Phyo San, Xiaoli Li, and Shonali Krishnaswamy.Deep convolutional neural networks on multichannel timeseries for human activity recognition. In

IJCAI , pages3995–4001, 2015.[Zhang et al. , 2018] Xiang Zhang, Lina Yao, ChaoranHuang, Sen Wang, Mingkui Tan, Guodong Long, and CanWang. Multi-modality sensor data classiﬁcation with se-lective attention. In

Proceedings of the Twenty-SeventhInternational Joint Conference on Artiﬁcial Intelligence,IJCAI 2018, July 13-19, 2018, Stockholm, Sweden. , pages3111–3117, 2018.[Zontak et al. , 2013] Maria Zontak, Inbar Mosseri, andMichal Irani. Separating signal from noise using patchrecurrence across scales. In