[PDF] Mining Mid-level Features for Action Recognition Based on Effective Skeleton Representation

Abstract

Recently, mid-level features have shown promising performance in computer vision. Mid-level features learned by incorporating class-level information are potentially more discriminative than traditional low-level local features. In this paper, an effective method is proposed to extract mid-level features from Kinect skeletons for 3D human action recognition. Firstly, the orientations of limbs connected by two skeleton joints are computed and each orientation is encoded into one of the 27 states indicating the spatial relationship of the joints. Secondly, limbs are combined into parts and the limb's states are mapped into part states. Finally, frequent pattern mining is employed to mine the most frequent and relevant (discriminative, representative and non-redundant) states of parts in continuous several frames. These parts are referred to as Frequent Local Parts or FLPs. The FLPs allow us to build powerful bag-of-FLP-based action representation. This new representation yields state-of-the-art results on MSR DailyActivity3D and MSR ActionPairs3D.

Full PDF

aa r X i v : . [ c s . C V ] S e p Mining Mid-level Features for Action RecognitionBased on Effective Skeleton Representation

Pichao Wang, Wanqing Li, Philip Ogunbona and Zhimin Gao

University of Wollongong, Wollongong, NSW, Australia, [email protected], { wanqing, philipo } @uow.edu.au, { zg126 } @uowmail.edu.au Hanling Zhang

Hunan University, P. R. China, jt [email protected]

Abstract —Recently, mid-level features have shown promisingperformance in computer vision. Mid-level features learned byincorporating class-level information are potentially more dis-criminative than traditional low-level local features. In this paper,an effective method is proposed to extract mid-level featuresfrom Kinect skeletons for 3D human action recognition. Firstly,the orientations of limbs connected by two skeleton joints arecomputed and each orientation is encoded into one of the 27 statesindicating the spatial relationship of the joints. Secondly, limbsare combined into parts and the limb’s states are mapped intopart states. Finally, frequent pattern mining is employed to minethe most frequent and relevant (discriminative, representativeand non-redundant) states of parts in continuous several frames.These parts are referred to as

Frequent Local Parts or FLPs .The

FLPs allow us to build powerful bag-of-FLP -based actionrepresentation. This new representation yields state-of-the-artresults on MSR DailyActivity3D and MSR ActionPairs3D.

I. I

NTRODUCTION

Human action recognition has been an active research topicin computer vision due to its wide range of applications,such as smart surveillance and human-computer interactions.Despite remarkable research efforts and encouraging advancesin the past decade, accurate recognition of human actions isstill an open problem.A common and intuitive method to represent human motionis to use a sequence of skeletons. With the development ofthe cost-effective depth cameras and algorithms for real-timepose estimation [1], skeleton extraction has become moreand more robust and skeleton-based action representation isbecoming one of the most practical and promising approaches.Up to date, the skeleton-based approach primarily focuses onlow-level features and models the dynamics of the skeletonsholistically, such as moving pose [2] and trajectories of humanjoints [3]. The full skeletal description is highly subject tothe noise introduced during the extraction of the skeletonand less effective in the cases where some actions involvemotion of the whole body and others are preformed usingonly small number of body parts. A key fact we observedis that during the temporal axis of actions, only a few bodyparts in several continuous frames are activated during theperformance of the actions. These parts are more robust anddiscriminative to represent an action. In our method we takeadvantage of this observation to capture mid-level features foraction recognition.Inspired by the mid-level features mining techniques [4]for image classiﬁcation, we propose a new scheme applying pattern mining to obtain the most relevant combinations ofparts in several continuous frames for action recognition ratherthan to utilize all the joints as most previous works did. Inparticular, a new descriptor called bag-of-FLPs is proposedto describe an action as illustrated in Fig. 1. The overall (cid:127) (cid:127)€ (cid:127)€ (cid:127) (cid:127) !"! (cid:127) (cid:127)€ (cid:127)€ (cid:127) (cid:127) (cid:127) (cid:127) ! (cid:127) " (cid:127) (cid:127) (cid:127) (cid:127) (cid:127) ! (cid:127) ! (cid:127) " (cid:127) (cid:127) (cid:127) (cid:127) (cid:127) ! (cid:127) $ (cid:127) " !" (cid:127)€(cid:129) (cid:127) (cid:127)€ (cid:127) (cid:127)€! (cid:127) (cid:127) (cid:127) Fig. 1. The general framework of the proposed method. process of our method can be divided into four steps: featureextraction, building transactions, mining & selecting relevantpatterns and building

Bag-of-FLPs & classiﬁcation. We ﬁrstcompute the orientations of limbs, i.e. connected joints, andthen encode each orientation into one of the 27 states indicatinghe spatial relationship of the joints. Limbs are combinedinto parts and limb’s states are mapped to part states. Localtemporal information is included by combining part states ofseveral, say, 5, continuous frames into one transaction formining, with each state as one item. In order to keep motioninformation after frequent pattern mining, the unique statesof parts of the continuous frames are reserved, removingthe repeated ones, ensuring the pose information and motioninformation be included in each transaction. The most relevantpatterns, which we referred to

FLPs , are mined and selected orepresent frames and build bag-of-FLPs as new representationfor a whole action. The new representation is much robust tothe errors in the features, because the errors are usually notfrequent patterns.Our main contributions include the following four aspects.First, an effective and efﬁcient method is proposed to extractskeleton features. Second, a novel method is developed toexplore spatial and temporal information in skeleton data,simultaneously. Third, an effective scheme is proposed forapplying pattern mining to action recognition by adapting thegeneric pattern mining tools to the features of skeleton. Ourscheme is much robust to noise as most noisy data does notform frequent patterns. In addition, our scheme has achievedthe state-of-the-art results on several benchmark datasets.The rest of the paper is organized as follows. Section IIreviews the related work. Section III presents our scheme indetail. Section IV shows experimental results. Conclusion ismade in section V. II. R

ELATED WORK

The process of action recognition can be generally dividedinto two main steps, action representation and action classiﬁ-cation. Action representation consists of feature extraction andfeature selection. Features can be extracted from input sourcessuch as depth maps, skeleton and/or RGB images. Regardlessof the input source, there are two main approaches, space-timeapproach and sequential approach [5], [6], [7], to the repre-sentation of actions. The space-time approach usually extractslocal or holistic features from space-time volume, withoutexplicit modelling of temporal dynamics. By contrast, thesequential approach normally extracts local features from eachframe of the input source and models the dynamics explicitly.Action classiﬁcation is the step of learning a classiﬁer basedon action representation and classifying any new observationsusing the classiﬁer. For space-time approaches, discriminativeclassiﬁer, such as Support Vector Machine (SVM), is oftenused for classiﬁcation. For the sequential approach, generativestatistical models, such as Hidden Markov Model (HMM), arecommonly used. Our method belongs to the skeleton-basedspace-time volume approach. In this section, we mainly reviewthe existing work of skeleton-based action representation foraction recognition.For the skeleton-based sequential approach, Xia et al. [8]proposed a feature called Histograms of 3D Joint Locations(HOJ3D) as a representation of postures. The HOJ3D essen-tially encodes spatial occupancy information relative to the rootjoint, e.g. hip centre. A modiﬁed spherical coordinate systemis deﬁned on the root joint and the 3D space is divided into N bins. The HOJ3D is reprojected using Linear Discriminant Analysis (LDA) to reduce dimensionality and then clusteredinto K posture visual words which represent the prototypicalposes of actions. HMMs are adopted to model the visualwords and recognize actions. Radial distance is adopted inthis spherical coordinate system which makes the method tosome extend view-invariant.Koppula et al. [9] explicitly modelled the motion hierarchyto enable their method to handle simple human-object interac-tions. The human activities and object affordances are jointlymodelled as a Markov Random Field (MRF) where the nodesrepresent objects and sub-activities, and the edges represent therelationships between object affordances, their relations withsub-activities, and their evolution over time. Feature vectorsthat represent the object’s location and the changing informa-tion in the scene are deﬁned by training a Structural SupportVector Machine (SSVM). Similar to this approach, Sung etal. [10] proposed a hierarchical two-layer Maximum EntropyMarkov Model (MEMM) to represent an activity. The lowerlayer nodes represent sub-activities while higher level nodesdescribe more complex activities, for example, “lifting lefthand” and “pouring water” can be described as a sub-activityand a complex activity, respectively. Wang et al. [11] proposedan Local Occupancy Patterns (LOP) feature calculated fromthe 3D point cloud around a particular joint to discriminatedifferent types of interactions and Fourier Temporal Pyramid(FTP) to represent the temporal structure. Based on above twotypes of features, a model called Actionlet Ensemble Model(AEM) is proposed which is a combination of the features fora subset of the joints. Due to the numerous actionlets, datamining technique is used to discover discriminative actionlets.Both skeleton and point cloud information are utilized torecognize human-objects interactions.For the skeleton-based space-time volume approach, Yanget al. [12] proposed a new feature descriptor called Eigen-Joints features which contain posture features, motion featuresand offset features. The pair-wise joint differences in currentframes and their consecutive frames are used to encode thespatial and temporal information, which are called posturefeatures and motion features, respectively. The difference ofa pose with respect to the initial pose is called offset features.The initial pose is generally assumed as a neutral pose.The three channels are normalized and Principal ComponentAnalysis (PCA) is applied to reduce redundancy and noiseto obtain the EigenJoints descriptor. A Naive-Bayes-Nearest-Neighbor (NBNN) classiﬁer is adopted to recognize actions.Gowayyed et al. [3] proposed a new descriptor called His-tograms of Oriented Displacements (HOD) to recognize ac-tions. The displacement of each joint votes with its length in ahistogram of oriented angles. Each 3D trajectory is representedby the HOD of its three 2D projection. In order to reservetemporal information, a temporal pyramid is proposed, wheretrajectories are considered as a whole, halves and quarters andthen all the descriptors in these three levels are concatenatedto form the ﬁnal descriptor. A linear SVM is used to classifyactions based on the histograms. Similar to this work, Husseinet al. [13] proposed a descriptor called Covariance of 3D Joints(Cov3DJ) for human action recognition. This descriptor usescovariance matrix to capture the dependence of locations ofdifferent joints on one another during an action. In order tocapture the order of motion in time, a hierarchy of Cov3DJsis used, similarly to the work in [3].anﬁr et al. [2] proposed a descriptor called moving posewhich is formed by the position, velocity and accelerationof skeleton joints within a short time window around thecurrent fame. To learn discriminative pose, a modiﬁed k -Nearest Neighbours ( k NN) classiﬁer is used that considersboth the temporal location of a particular frame within theaction sequence as well as the discrimination power of itsmoving pose descriptor compared to other frames in thetraining set. Thanh et al. [14] extracted key frames which arethe central frames in the short temporal segments of videosand labelled each key frame as a pattern for a unit action.An improved Term Frequency-Inverse Document Frequency(TF-IDF) method is used to learn the discriminative patternsand learned patterns is deﬁned as local features for actionrecognition. Wang et al. [15] ﬁrst estimated human jointspositions from videos and then grouped the estimated jointsinto ﬁve parts. Each action is represented by computingsets of co-occurring spatial and temporal conﬁgurations ofbody parts. The authors use a bag of words method withthe extracted features for classiﬁcation. Ohn-Bar and Trivedi[16] tracked the joint angles and built a descriptor based onsimilarities between angle trajectories. This feature is furthercombined with a double-HOG descriptor that accounts for thespatio-temporal distribution of depth values around the joints.Theodorakopoulos et al. [17] initially processed the skeletondata from sensor coordinate to torso PCA frame in order togain robust and invariant pose representation. Sparse coding indissimilarity space is utilized to sparsely represent the actions.Chaaraoui et al. [18] proposed to use an evolutionary algorithmto determine the optimal subset of skeleton joints, taking intoaccount the topological structure of the skeleton.To fuse depth-based features with skeleton-based features,Althloothi et al. [19] presented two sets of features, featuresfor shape representation extracted from depth data by using aspherical harmonics representation and features for kinematicstructure extracted from skeleton data by estimating 3D jointpositions. The shape features are used to describe the 3Dsilhouette structure while the kinematic features are used todescribe the movement of the human body. Both sets offeatures are fused at the kernel level for action recognition byusing Multiple Kernel Learning (MKL) technique. Similar tothis direction, Chaaraoui et al. [20] proposed a fusion methodto combine skeleton and silhouette-based features. The skeletalfeatures are obtained by normalising the 3D position of originalskeleton data while the silhouette-based features are generatedby extracting contour points of the silhouette. After featurefusion, a model called bag of key poses is employed foraction recognition. The key poses are obtained by K -meansclustering algorithm and the words are made up of key poses.In recognition stage, unknown video sequences are classiﬁedbased on sequence matching. Rahmani et al. [21] proposedan algorithm combining the discriminative information fromdepth maps as well as from 3D joints positions for actionrecognition. To avoid the suppression of subtle discriminativeinformation, local information integration and normalizationare performed. Joint importance is encoded by using jointmotion volume. Random Decision Forest (RDF) is trained toselect the discriminant features. Because of the low dimen-sionality of their features, their method turns to be efﬁcient.In above methods, most of them are based low-levelfeatures and need the whole skeletal description which leads to their weak adaptation to noise. In addition, most of them needto explore the spatial and temporal information, separately, andthen combine them together. Besides, most of the methods usedto explore temporal information are subject to the neural poses,which are shared by all actions. However, in our method, weuse a parts-based mid-level feature to represent actions andexplore the spatial and temporal information simultaneously.This makes our method more robust.III. P ROPOSED M ETHOD

The overall process of the proposed method is illustratedin Fig. 1. It can be divided into four steps: feature extraction,building transactions, mining & selecting relevant patterns andbuilding

Bag-of-FLPs & classiﬁcation.

A. Feature Extraction

In our method, the orientations of human body limbs areconsidered as low-level features and they can be calculatedfrom the two joints of the limbs. For Kinect skeleton data, joint positions, as shown in Fig. 2, are tracked [1]. The skeletondata is ﬁrst normalized using Algorithm 1 in [2] to suppressnoise in the original skeleton data and to compensate for lengthvariations across different subjects and different body parts.Each joint i has 3 coordinates, denoted as ( x i , y i , z i ) afternormalization. !" Fig. 2. The human joints tracked with the skeleton tracker [1].

For Kinect skeleton, it is found that the

Hand Left , HandRight , Foot Left , Foot Right , and

Spine joints are often notreliable and, hence they are not used in our method. Thus,there are 15 joints 14 limbs. The joint Head is considered asthe origin of the 15 points. For each limb, we compute a unitdifference vector between its two joints: (∆ x ij , ∆ y ij , ∆ z ij ) = ( x i , y i , z i ) − ( x j , y j , z j ) d ij (1)where i and j represent the current joint and reference joint,respectively; d ij is the Euclidean distance between the twojoints. For example, as illustrated in Fig. 1, to compute theorientation of the limb between joint Hand Right and WristRight (highlighted in red), the Wrist Right joint is regardedas the sphere center and Eq. (1) is used to compute the unitdifference vector.ach element of the unit difference vector is quantizedinto three states: − , and . If | ∆ x ij | ≤ threshold then q (∆ x ij ) = 0 ; if ∆ x ij > threshold then q (∆ x ij ) = 1 ; else q (∆ x ij ) = − . Thus, there are 27 possible states for each unitdifference vector, and each state is encoded as one elementof a feature vector, so the dimension of the feature vectorfor each pose is 14 ×

27 = 378 after concatenating all featurevectors of the 14 limbs. For each element of the feature vector,if the corresponding orientation between two joints is bid toone state, then the relative position is labelled to 1, otherwise,it is 0. Therefore, the feature vectors are very sparse, only14 positions in each feature vector are 1 (not zeros). Thethreshold is an empirical value which is dependent on the noisecharacteristics of the skeleton data.For each frame of skeleton, a quantized 378 dimensionalfeature vector is calculated as described above. This featurevector is reduced to a 14 dimensional feature vector with eachelement being the index to a non-zero element of the 378-dimensional feature vector.To extract mid-level features for action representation, the14 limbs are combined into 7 body parts. As illustrated inFig. 1, the dotted line contains joints Hand Right, Wrist Rightand Elbow Right, and these three limbs form one part. Inthis way, seven body parts are formed, namely, Head-ShoulderCenter, Should Center-Shoulder Left-Elbow Left-Wrist Left,Shoulder Center-Shoulder Right-Elbow Right-Wrist Right,Shoulder Center-Hip Center-Hip Left, Hip Left-Knee Left-Angle Left,Shoulder Center-Hip Center-Hip Right and Hip-Right-Knee Right-Angle Right. According to the Degree ofFreedom (DoF) of joints [22], each body part is encoded withdifferent number of states and the total number of states isdenoted as

N DF , which is currently an empirical parameter.It should be adjusted according to the complexity of the actionsto be recognized and noise level of the dataset.To explore temporal information and keep motion infor-mation at the same time after frequent data mining (generally,frequent data mining can only mine the most frequent pattenswhich can not be guaranteed as discriminative patterns), anovel way is proposed. Seven states for each frame will beobtained after combination, and the unique states of contin-uous C frames, as illustrated in Fig. 1, where C = 3 , arecounted and form a new mid-level feature vector, denoted as { f i | i = 1 , ..., n A } . This new feature vector contains both poseinformation of the current frame and the motion informationin the continuous C frames, because the repeated states in thecontinuous frames can be regarded as static pose informationand the different ones with other frames can capture the motioninformation. This feature vector is used to build transactionsdescribed in the next section. The pattens after mining can bethe combinations of several body parts in different frames, thusthe temporal order information can be easily maintained. B. Building Transactions

Each instance of action A is represented by a set of abovemid-level features { f i | i = 1 , ..., n A } and a class label c , c ∈ { ...C } . The set of features for all the action samplesis denoted by Ω . The dimensionality of the feature vector isdenoted as W and in our case | W | ≥ .

1) Items, Transactions and Frequencies:

Each element ina feature vector for continuous C poses is deﬁned as an item,and an item is denoted as ω , where ω ∈ (0 , N DF ] and ω ∈ N . The set of transactions X from the set Ω is created next.For each x ∈ Ω there is one transaction x (i.e. a set of items).This transaction x contains all the items ω j . A local pattern is an itemset t ⊆ Γ , where Γ represents the set of all possibleitems. For a local pattern t , the set of transactions that includethe pattern t is deﬁned as: X ( t ) = { x ∈ X | t ⊆ x } . The frequency of t is | X ( t ) | , also known as the support of thepattern t or supp ( t ) .

2) Frequent Local Part:

For a given constant T , also knownas the minimum support threshold, a local pattern t is frequent if supp ( t ) ≥ T . A pattern t is said to be closed if there existsno pattern t ′ that t ⊂ t ′ and supp ( t ) = supp ( t ′ ) . The setof frequent closed patterns is a compact representation of thefrequent patterns, and such a frequent and closed local partpattern is referred to as Frequent Local Part of FLP . C. Mining & Selecting Relevant FLPs1) FLPs Mining:

Given the set of transaction X , anyexisting frequent mining algorithm can be used to ﬁnd theset of FLPs Υ . In our work, the optimised LCM algorithm[23] is used as in [4].

LCM uses a preﬁx preserving closureextension to completely enumerate closed itemsets.

2) Encoding a New Action with FLPs:

Given a new action,the features can be extracted according to the section A andeach feature vector can be converted into a transaction x andfor each FLP pattern t ∈ Υ it can be checked whether t ⊆ x .If t ⊆ x is true, then x is an instance of the FLP pattern t . Thefrequency of a pattern t in a given action A j (i.e. the numberof instances of t in A j ) is denoted as F ( t | A j ) .

3) Selecting the Best FLPs for Action Recognition:

The

FLPs set Υ is considered as a candidate set of mid-levelfeatures to represent an action. Therefore, the most useful FLP patterns from Υ is needed to be selected because i ) the numberof generated FLP patterns is huge and ii ) not all discovered FLP patterns are equally important to the action recognitiontask. Usually, relevant patterns are those discriminative and non-redundant . On top of that, a new criterion, representativity is also used. As a result, some patterns may be frequent andappear to be discriminative but they may occur in very fewactions (e.g. noise pose). Such features are not representativeand therefore not the best choice for action recognition. Agood

FLP pattern should be at the same time discriminative,representative and non-redundant. In this section, how to selectsuch patterns is discussed.The methods used in [4] are followed to ﬁnd the mostsuitable pattern subset χ , where χ ⊂ Υ , for action recognition.To do this the gain of a pattern t is denoted by G ( t ) (s.t. t χ and t ∈ Υ ) and deﬁned as follows: G ( t ) = S ( t ) − max s ∈ χ { R ( s, t ) · min( S ( t ) , S ( s )) } (2)where S ( t ) is the overall relevance of a pattern t and R ( s, t ) isthe redundancy between two patterns s , t . In Eq. (2), a pattern t has a higher gain G ( t ) if it has a higher relevance S ( t ) (i.e. itis discriminative and representative) and if the pattern t is nonedundant with any pattern s in set χ (i.e. R ( s, t ) is small). S ( t ) is deﬁned as: S ( t ) = D ( t ) × O ( t ) , (3)and R ( s, t ) is deﬁned as: R ( s, t ) = exp {− [ p ( t ) · D KL ( p ( A | t ) || p ( A |{ t, s } ))+ p ( s ) · D KL ( p ( A | s ) || p ( A |{ t, s } ))] } . (4)Following a similar approach in [24] to ﬁnd afﬁnity betweenpatterns, two patterns t and s ∈ Υ are redundant if they followsimilar document distributions, i.e. if p ( A | t ) ≈ p ( A | s ) ≈ p ( A |{ t, s } ) where p ( A |{ t, s } ) is the document distributiongiven both patterns { t, s } .In Eq. (3), D ( t ) is the discriminability score . Followingthe entropy-based approach in [25], and a high value of D ( t ) implies that the pattern t occurs only in very few actions; O ( t ) is the representativity score for a pattern t and it considers thedivergence between the optimal distribution for class c p ( A | t ∗ c ) and the distribution for pattern t p ( A | t ) , and then takes the bestmatch over all classes. The optimal distribution is such that i ) the pattern occurs only in actions of class c , i.e. p ( c | t ∗ c ) = 1 (giving also a discriminability score of 1), and ii ) the patterninstances are equally distributed among all the actions of class c , i.e. ∀ A j , A k in class c , p ( A j | t ∗ c ) = p ( A k | t ∗ c ) = (1 /N c ) where N c is the number of samples of class c . An optimalpattern, denoted by t ∗ c for class c , is a pattern which has abovetwo properties.The discriminability score and representativity score aredeﬁned as: D ( t ) = 1 + P c p ( c | t ) · log p ( c | t )log C , (5) O ( t ) = max c (exp {− [ D KL ( p ( A | t ∗ c ) || p ( A | t ))] } ) (6)where p ( c | t ) is the probability of class c given the pattern t ,computed as follows: p ( c | t ) = P Nj =1 F ( t | A j ) · p ( c | A j ) P Nj =1 F ( t | A j ) ; (7) D KL ( . || . ) is the Kullback-Leibler divergence between twodistributions; p ( A | t ) is computed empirically from the frequen-cies F ( t | A j ) of the pattern t : p ( A | t ) = F ( t | A ) P j F ( t | A j ) (8)Here, A j is the j th action and N is the total number of actionsin the dataset. p ( c | A ) = 1 if the class label of A j is c and otherwise; p ( c | t ∗ c ) is the optimal distribution with respect to aclass c .In Eq. (4), p ( t ) is the probability of pattern t and it is deﬁnedas: p ( t ) = P A j F ( t | A j ) P t j ∈ Υ P A j F ( t j | A j ) (9)while p ( A |{ t, s } ) is the document distribution given bothpatterns { t, s } and it is deﬁned as: p ( A |{ t, s } ) = F ( t | A ) + F ( s | A ) P j F ( t | A j ) + F ( s | A j ) (10) To ﬁnd the best K patterns the following greedy process isused. First the most relevant pattern is added to the relevantpattern set χ . Then the pattern with the highest gain (nonredundant but relevant) is searched out and this pattern is addedinto the set χ until K patterns are added (or until no morerelevant patterns can be found). For more detailed discussions,[4] is recommended to refer to. D. Building Bag-of-FLPs & Classiﬁcation

After computing the K most relevant and non-redundant FLPs , each action can be represented by a new representationcalled bag-of-FLPs by counting the occurrences of such

FLPs in the action. Let L be such a bag-of-FLPs for action A L and M be the bag-of-FLPs for action A M .An SVM [26] is trained to classify the actions. The SVMuses the following kernel to calculate the similarities betweenthe bag-of-FLPs of L and M . K ( L, M ) = X i min( p L ( i ) , p M ( i )) (11)Here L ( i ) is the frequency of the i th selected pattern inhistogram L . It is a standard histogram intersection kernelwith non-linear weighting. This reduces the importance ofhighly frequent patterns and is necessary since there is a largevariability in pattern frequencies.IV. E XPERIMENTAL R ESULTS

Two benchmark datasets, MSR-DailyActivity3D [27] andMSR-ActionPairs3D [28], were used to evaluate the proposedmethod and the results are compared with those reported inother papers on the same datasets and under the same trainingand testing conﬁguration.

A. Experimental Setup

In our method, there are several parameters that need tobe tuned, the threshold T , the number of states N DF , thenumber of relevant patterns K , the continuous frames C ,minimum support S and maximum support U . For differentdatasets, different sets of parameters were learned throughcross-validation to optimise the performance. Speciﬁcally, two-third of the entire training dataset was used as training and therest one-third was used for validation to tune the parameters.The ranges of the parameters are empirical. In general, thethreshold T is dependent on the noise level of the dataset.The higher the noise the larger its value. This is an importantparameter because it affects the states of limbs computed fromthe skeleton data. However, such sensitivity can be reduced bysetting a large number, N DF (i.e. over 600) of states. Thenumber of relevant patterns K is dependent on the complexityof the actions to be recognized, the more actions in the dataset,the larger number it should be. The number of continuousframes C is affected by the complexity of required temporalinformation to encode the actions. If the dataset has pairactions, for example, two actions of each pair are similarin motion (have similar trajectories) and shape (have similarobjects), the value of C should be large. However, a large C leads to high memory and post-processing requirement.The values of the minimum support S and maximum support U effect the number of generated patterns before patternelection. We observed that if S is large, U should also belarge; If S is small, U should also be small. Generally, S and U are set to reduce the computational time for post-processing.In fact, there are many combinations of these two parametersto get the best results. In the other words, the performance ofthe proposed method is not much sensitive to the choice of S and U . B. MSR DailyActivity3D

The MSR DailyActivity3D dataset consists of 10 subjectsand 16 activities: drink, eat, read book, call cellphone, writeon a paper, use laptop, use vacuum cleaner, cheer up, sit still,toss paper, play game, lay down on sofa, walk, play guitar,stand up, sit down . Fig. 3 shows some sample frames for theactivities. Each subject performed each activity twice, once in (cid:127)€(cid:129)‚ƒ !"

Fig. 3. Sample frames of the MSR DailyActivity3D dataset. standing position and once in sitting position. In total, there are320 samples. This dataset has large intra-class variations andinvolves human-object interactions, which is challenging forrecognition only by 3D joints. Experiments were performedbased on cross-subject test setting described in [2], i.e. ﬁvesubjects (1, 2, 3, 4, 5) were used for training and the rest5 subjects were used for testing. Table I shows the resultsof our methods compared with other published results. For

TABLE I. C

OMPARISON ON

MSR-D

AILY A CTIVITY D ATASET

Methods Accuracy (%)Dynamic Temporal Warping [29] 54.0Moving Pose [2] 73.8Actionlet Ensemble on Joint Features [11] 74.0Proposed Method 78.8 this dataset, T = 0 . , N DF = 600 , K = 30000 , C =3 , S = 15 , U = 180 . As seen, although this dataset is quitechallenging, our method obtained promising results based onlyon skeleton data. The confusion matrix is illustrated in Fig. 4.From the confusion matrix, it can be seen that activities suchas “Drink”, “Cheer Up”, “Sit Still”, “Toss Paper” are relativelyeasy to recognise, while “Eat” and “Use laptop” are relativelydifﬁcult to recognise. The reason for the difﬁculties is that forthese human-object interactions, object information was not D r i n k E a t R ead B oo k C a ll C e ll phone W r i t e on P ape r U s e l ap t op U s e V a cc u m C l ea r ne r C hee r U p S i t S t ill T o ss P ape r P l a y G a m eLa y D o w n on S o f a W a l k i ng P l a y G u i t a r S t and U p S i t D o w n DrinkEatRead BookCall CellphoneWrite on PaperUse laptopUse Vaccum ClearnerCheer UpSit StillToss PaperPlay GameLay Down on SofaWalkingPlay GuitarStand UpSit Down

Fig. 4. The confusion matrix of our proposed method for MSR-DailyActivity3D. available from skeleton data which makes these interactionsare almost the same in terms of motion reﬂected in the skeletondata.

C. MSR ActionPairs3D

The MSR ActionPairs3D dataset [28] is a paired-activitydataset captured by a Kinect camera. This dataset contains12 activities (i.e. six pairs) of 10 subjects with each subjectperforming each activity 3 times. The pair actions are: Pick upa box/Put down a box, Lift a box/Place a box, Push a chair/Pulla chair, Wear a hat/Take off hat, Put on a backpack/Take offa backpack, Stick a poster/Remove a poster. Some sampleframes for the activities of this dataset are shown in Fig.5. This dataset is collected to investigate how the temporal (cid:127)€(cid:129)‚ƒ„…ƒ†ƒ‡ˆ‰ (cid:127) !ƒ"ˆ

Fig. 5. Sample frames of the MSR ActionPairs3D dataset. order affects activity recognition. Experiments were set to the

TABLE II. C

OMPARISON ON

MSR-A

CTION P AIRS D ATASET

Methods Accuracy (%)Skeleton + LOP [27] 63.33Depth Motion Maps [30] 66.11Proposed Method 75.56 same conﬁguration as [28], namely, the ﬁrst ﬁve actors areused for testing, and the rest for training. For this dataset, T = 0 . , N DF = 1000 , K = 10000 , C = 4 , S = 3 , U =100 . We compare our performance in this dataset with twoethods whose results were reported in [28]. Table II showsthe comparisons with other methods tested on this dataset.The confusion matrix is shown in Fig. 6. From the con- P i ck up a bo x P u t do w n a bo x L i ft a bo x P l a c e a bo x P u s h a c ha i r P u ll a c ha i r W ea r a ha t T a k e o ff ha t P u t on a ba ck pa ck T a k e o ff a ba ck pa ck S t i ck a po s t e r R e m o v e a po s t e r Pick up a boxPut down a boxLift a boxPlace a boxPush a chairPull a chairWear a hatTake off hatPut on a backpackTake off a backpackStick a posterRemove a poster

Fig. 6. The confusion matrix of our proposed method for MSR-ActionPairs3D. fusion matrix, it can be seen that activities such as “Lift abox”, “Place a Box”, “Push a Chair”, “Stick a Poster” areeasy for our method to recognise, while “Pich up a Box”and “Take off Hat” are relatively difﬁcult to recognise. Theresults have veriﬁed that our method can distinguish temporalorders in actions, however, it still can be confused with otheractions which were not paired. One possible reason for causingthe confusion between some actions, for instance, “Pick up aBox” and “Push a Chair”, is the 3-state quantization of the unitdifferent vectors. This issue can be addressed by quantizing thevector into more states.V. C

ONCLUSION

In this paper, a new representation is proposed and effectivedata mining method is adopted to mine the mid-level patterns(different compositions of body parts) for action recognition.A novel method to explore temporal information and minethe different combinations of different body parts in differentframes is proposed. The strength of the proposed method hasbeen demonstrated through the state-of-the-art results obtainedon the recent and challenging benchmark datasets for activityand action recognition. However, the proposed method can befurther improved by combining depth or RGB data to explorethe human-object interactions. With the increasing popularityof Kinect-based action recognition and data mining methodsin computer vision, the proposed method has promising po-tentialities in practical applications.R

EFERENCES[1] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,A. Kipman, and A. Blake, “Real-time human pose recognition in partsfrom single depth images,” in

Proceedings of the 2011 IEEE Conferenceon Computer Vision and Pattern Recognition , ser. CVPR ’11. IEEEComputer Society, 2011, pp. 1297–1304.[2] M. Zanﬁr, M. Leordeanu, and C. Sminchisescu, “The moving pose:An efﬁcient 3d kinematics descriptor for low-latency action recognitionand detection,” in

Computer Vision (ICCV), 2013 IEEE InternationalConference on , Dec 2013, pp. 2752–2759. [3] M. A. Gowayyed, M. Torki, M. E. Hussein, and M. El-Saban, “His-togram of oriented displacements (hod): Describing trajectories ofhuman joints for action recognition,” in

Proceedings of the Twenty-ThirdInternational Joint Conference on Artiﬁcial Intelligence , ser. IJCAI’13.AAAI Press, 2013, pp. 1351–1357.[4] B. Fernando, E. Fromont, and T. Tuytelaars, “Mining mid-level featuresfor image classiﬁcation,”

International Journal of Computer Vision , vol.108, no. 3, pp. 186–203, 2014.[5] W. Li, Z. Zhang, and Z. Liu, “Expandable data-driven graphical mod-eling of human actions based on salient postures,”

IEEE Transactionson Circuits and Systems for Video Technology , vol. 18, no. 11, pp.1499–1510, 2008.[6] W. Li, Z. Zhang, and Z.Liu, “Action recognition based on a bag of 3Dpoints,” in

IEEE International Workshop for Human CommunicativeBehavior Analysis (in conjunction with CVPR2010) , 2010.[7] M. Ye, Q. Zhang, L. Wang, J. Zhu, R. Yang, and J. Gall, “A survey onhuman motion analysis from depth data,” in

Time-of-Flight and DepthImaging. Sensors, Algorithms, and Applications . Springer, 2013, pp.149–187.[8] L. Xia, C.-C. Chen, and J. Aggarwal, “View invariant human actionrecognition using histograms of 3d joints,” in

Computer Vision and Pat-tern Recognition Workshops (CVPRW), 2012 IEEE Computer SocietyConference on . IEEE, 2012, pp. 20–27.[9] H. S. Koppula, R. Gupta, and A. Saxena, “Learning human activitiesand object affordances from RGB-D videos,” in

International Journalof Robotics Research (IJRR), 32(8): 951-970, July 2013. , vol. 32, no. 8,July 2013, pp. 951–970.[10] J. Sung, C. Ponce, B. Selman, and A. Saxena, “Unstructured humanactivity detection from rgbd images,” in

Robotics and Automation(ICRA), 2012 IEEE International Conference on . IEEE, 2012, pp.842–849.[11] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Learning actionlet ensemble for3d human action recognition,”

Pattern Analysis and Machine Intelli-gence, IEEE Transactions on , vol. 36, no. 5, pp. 914–927, May 2014.[12] X. Yang and Y. Tian, “Eigenjoints-based action recognition using nave-bayes-nearest-neighbor,” in

International Workshop on Human ActivityUnderstanding from 3D Data (HAU3D) in conjunction with CVPR , June2012, pp. 14–19.[13] M. E. Hussein, M. Torki, M. A. Gowayyed, and M. El-Saban, “Humanaction recognition using a temporal hierarchy of covariance descriptorson 3d joint locations,” in

Proceedings of the Twenty-Third InternationalJoint Conference on Artiﬁcial Intelligence , ser. IJCAI’13. AAAI Press,2013, pp. 2466–2472.[14] T. T. Thanh, F. Chen, and K. Kotani, “Extraction of discriminativepatterns from skeleton sequences for accurate action recognition,”

Fundamenta Informaticae , pp. 1–15, 2014, in Press.[15] C. Wang, Y. Wang, and A. L. Yuille, “An approach to pose-based actionrecognition,” in

Computer Vision and Pattern Recognition (CVPR), 2013IEEE Conference on . IEEE, 2013, pp. 915–922.[16] E. Ohn-Bar and M. Trivedi, “Joint angles similarities and hog2 foraction recognition,” in

Computer Vision and Pattern Recognition Work-shops (CVPRW), 2013 IEEE Conference on , June 2013, pp. 465–470.[17] I. Theodorakopoulos, D. Kastaniotis, G. Economou, and S. Fotopou-los, “Pose-based human action recognition via sparse representationin dissimilarity space,”

Journal of Visual Communication and ImageRepresentation , vol. 25, no. 1, pp. 12 – 23, 2014, visual Understandingand Applications with RGB-D Cameras.[18] A. A. Chaaraoui, J. R. Padilla-Lpez, P. Climent-Prez, and F. Flrez-Revuelta, “Evolutionary joint selection to improve human action recog-nition with rgb-d devices,”

Expert Systems with Applications , vol. 41,no. 3, pp. 786 – 794, 2014.[19] S. Althloothi, M. H. Mahoor, X. Zhang, and R. M. Voyles, “Humanactivity recognition using multi-features and multiple kernel learning,”

Pattern Recognition. , vol. 47, no. 5, pp. 1800–1812, May 2014.[20] A. Chaaraoui, J. Padilla-Lopez, and F. Florez-Revuelta, “Fusion ofskeletal and silhouette-based features for human action recognition withrgb-d devices,” in

Computer Vision Workshops (ICCVW), 2013 IEEEInternational Conference on , Dec 2013, pp. 91–97.[21] H. Rahmani, A. Mahmood, A. Mian, and D. Huynh, “Real time actionrecognition using histograms of depth gradients and random decisionorests,” in

IEEE Winter Applications of Computer Vision Conference(WACV) , March 2014, pp. 14–19.[22] V. M. Zatsiorsky,

Kinematics of human motion . Human Kinetics, 1998.[23] T. Uno, T. Asai, Y. Uchida, and H. Arimura, “Lcm: An efﬁcientalgorithm for enumerating frequent closed item sets,” in

In Proceedingsof Workshop on Frequent itemset Mining Implementations (FIMI03 ,2003.[24] X. Yan, H. Cheng, J. Han, and D. Xin, “Summarizing itemset patterns: aproﬁle-based approach,” in

Proceedings of the eleventh ACM SIGKDDinternational conference on Knowledge discovery in data mining .ACM, 2005, pp. 314–323.[25] H. Cheng, X. Yan, J. Han, and C.-W. Hsu, “Discriminative frequentpattern analysis for effective classiﬁcation,” in

Data Engineering, 2007.ICDE 2007. IEEE 23rd International Conference on . IEEE, 2007, pp.716–725.[26] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for supportvector machines,”

ACM Transactions on Intelligent Systems andTechnology ∼ cjlin/libsvm.[27] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemblefor action recognition with depth cameras,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2012.[28] O. Oreifej and Z. Liu, “Hon4d: Histogram of oriented 4d normals foractivity recognition from depth sequences,” in

Computer Vision andPattern Recognition (CVPR), 2013 IEEE Conference on , June 2013,pp. 716–723.[29] M. M¨uller and T. R¨oder, “Motion templates for automatic classiﬁcationand retrieval of motion capture data,” in

Proceedings of the 2006ACM SIGGRAPH/Eurographics symposium on Computer animation .Eurographics Association, 2006, pp. 137–146.[30] X. Yang, C. Zhang, and Y. Tian, “Recognizing actions using depthmotion maps-based histograms of oriented gradients,” in