[PDF] Hierarchically Learned View-Invariant Representations for Cross-View Action Recognition

Abstract

Recognizing human actions from varied views is challenging due to huge appearance variations in different views. The key to this problem is to learn discriminant view-invariant representations generalizing well across views. In this paper, we address this problem by learning view-invariant representations hierarchically using a novel method, referred to as Joint Sparse Representation and Distribution Adaptation (JSRDA). To obtain robust and informative feature representations, we first incorporate a sample-affinity matrix into the marginalized stacked denoising Autoencoder (mSDA) to obtain shared features, which are then combined with the private features. In order to make the feature representations of videos across views transferable, we then learn a transferable dictionary pair simultaneously from pairs of videos taken at different views to encourage each action video across views to have the same sparse representation. However, the distribution difference across views still exists because a unified subspace where the sparse representations of one action across views are the same may not exist when the view difference is large. Therefore, we propose a novel unsupervised distribution adaptation method that learns a set of projections that project the source and target views data into respective low-dimensional subspaces where the marginal and conditional distribution differences are reduced simultaneously. Therefore, the finally learned feature representation is view-invariant and robust for substantial distribution difference across views even the view difference is large. Experimental results on four multiview datasets show that our approach outperforms the state-ofthe-art approaches.

Full PDF

22416 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 8, AUGUST 2019

Hierarchically Learned View-InvariantRepresentations for Cross-ViewAction Recognition

Yang Liu , Zhaoyang Lu,

Senior Member, IEEE , Jing Li ,

Member, IEEE , and Tao Yang ,

Member, IEEE

Abstract — Recognizing human actions from varied views ischallenging due to huge appearance variations in different views.The key to this problem is to learn discriminant view-invariantrepresentations generalizing well across views. In this paper,we address this problem by learning view-invariant representa-tions hierarchically using a novel method, referred to as jointsparse representation and distribution adaptation. To obtainrobust and informative feature representations, we ﬁrst incor-porate a sample-afﬁnity matrix into the marginalized StackedDenoising Autoencoder to obtain shared features that are thencombined with the private features. In order to make the featurerepresentations of videos across views transferable, we thenlearn a transferable dictionary pair simultaneously from pairsof videos taken at different views to encourage each actionvideo across views to have the same sparse representation.However, the distribution difference across views still existsbecause a uniﬁed subspace, where the sparse representations ofone action across views are the same, may not exist when the viewdifference is large. Therefore, we propose a novel unsuperviseddistribution adaptation method that learns a set of projectionsthat project the source and target views data into respectivelow-dimensional subspaces, where the marginal and conditionaldistribution differences are reduced simultaneously. Therefore,the ﬁnally learned feature representation is view-invariant androbust for substantial distribution difference across views eventhough the view difference is large. Experimental results on fourmulti-view datasets show that our approach outperforms thestate-of-the-art approaches.

Index Terms — Action recognition, cross-view, dictionary learn-ing, distribution adaptation.

I. I

NTRODUCTION H UMAN action recognition aims to automatically recog-nize an ongoing action from a video clip, which hasreceived great attention in recent years due to its wideapplications, including video surveillance [1], video label-ing [2], video content retrieval [3], human-computer interac-tion [4], and sports video analysis [5]. However, recent works

Manuscript received November 23, 2017; revised March 7, 2018,April 10, 2018, and June 4, 2018; accepted August 22, 2018. Date ofpublication August 31, 2018; date of current version August 2, 2019. Thiswork was supported by the National Natural Science Foundation of Chinaunder Grant 61502364 and Grant 61672429. This paper was recommendedby Associate Editor Y. Wu. (Corresponding author: Jing Li.)

Y. Liu, Z. Lu, and J. Li are with the School of Telecommunications Engi-neering, Xidian University, Xi’an 710071, China (e-mail: [email protected]; [email protected]; [email protected]).T. Yang is with the School of Computer Science, Northwestern Polytech-nical University, Xi’an 710072, China (e-mail: [email protected]).Color versions of one or more of the ﬁgures in this paper are availableonline at http://ieeexplore.ieee.org.Digital Object Identiﬁer 10.1109/TCSVT.2018.2868123 Fig. 1. Examples of multi-camera-view on IXMAS dataset. in [6]–[11] have demonstrated that recognizing action data incross-view scenario is challenging due to large appearancevariations in action videos captured by various cameras atdifferent locations. For example, the same action performedby the same actor may be visually different from one view toanother view (Figure 1). In addition, different viewpoints ofcameras may result in different background, camera motions,lighting conditions and occlusions. Therefore, developingmethods for cross-view action recognition that can recognizean unknown action in the target view by using the featuresextracted from some other source views remains a challenge.In order to accurately recognize human actions from variedviews, a family of view-shared sparse representation basedapproaches are proposed recently and demonstrated to achievegood results [12]–[14]. They assume that samples from dif-ferent views contribute equally in shared features and ignoreview-private features. However, this assumption is not alwaysvalid, i.e, the top view should have lower contribution to theshared features compared to other side views (e.g. the topview and side views in Figure 1). Actually, shared featuresof one action across views mainly encode the body and bodyoutline while private features mainly encode different limbposes that represent the class information across views [14].Therefore, the view-private features that capture motion infor-mation particularly owned by one view should be incorporatedinto view-shared features to learn more discriminative andinformative features. In addition, the distribution differenceacross views still exists because a uniﬁed subspace wherethe feature representations of one action across views are thesame may not exist when the view difference is large (e.g. thetop view and the side views in Figure 1). This will degradethe overall performance of the cross-view action recognitionalgorithm. Thus, we should learn a set of projections thatproject different views into respective subspaces to obtainnew representations of respective views, and concurrentlyencourage the subspace divergence to be small. In this way, IU et al. : HIERARCHICALLY LEARNED VIEW-INVARIANT REPRESENTATIONS FOR CROSS-VIEW ACTION RECOGNITION 2417 Fig. 2. Framework of our proposed JSRDA. The framework is hierarchical as the view-invariant representation is learned in a coarse-to-ﬁne fashion. the learned representations can generalize well across viewseven when the view difference is large.In this paper, we focus on unsupervised cross-view actionrecognition problem including the case when the view differ-ence is large and learn view-invariant representations hierar-chically by incorporating shared features learning, transferabledictionary learning and distribution adaptation into a uniﬁedframework named Joint Sparse Representation and Distrib-ution Adaptation (JSRDA). An overview of the JSRDA ispresent in Figure 2. The proposed method JSRDA mainlyconsist of three stages.In the ﬁrst stage, a Sample-Afﬁnity Matrix (SAM) intro-duced in [15] is employed to measure the similarities betweenvideo samples in different views, which facilitates accuratelybalancing information transfer across views. Then the SAM isincorporated into the marginalized stacked denoising Autoen-coder [16] (mSDA) to learn more robust shared features.In addition to shared features, private features that originatedfrom raw input features are combined with the obtained sharedfeatures to yield more informative feature representations.In the second stage, we learn a set of dictionaries thatcorrespond to training and testing views respectively. Thesedictionaries are learned simultaneously from the sets of videostaken at different views by encouraging each video in the set tohave the same sparse representations. After the dictionaries arelearned, we obtain the sparse representations of training andtesting videos respectively using the corresponding dictionary.This procedure enables the transfer of sparse feature represen-tations of videos in the source view(s) to the correspondingvideos in the target view.However, the distribution difference across views still existsafter the second stage because a uniﬁed subspace where thesparse representations of one action across views are the samemay not exist when the view difference is large (e.g. the topview and the side views). Therefore, in the third stage, we pro-pose a novel unsupervised distribution adaptation method thatlearns a set of projections that project the source and targetviews into respective subspaces where both the marginal andconditional distribution differences between source and targetviews are reduced simultaneously. After the projections, 1) thevariance of target view data is maximized to preserve theembedded data properties, 2) the discriminative information of source view data is preserved to effectively transfer the classinformation, 3) both the marginal and conditional distributiondifferences between source and target views are minimized,4) the divergence of these projections is encouraged to besmall to reduce the domain divergence between source andtarget views.Finally, the view-invariant representations of action videosfrom different views are obtained in their respective subspaces.Then, we train a classiﬁer in the source view(s) and testit in the target view. Extensive experiments on four multi-view datasets show that our approach signiﬁcantly outperformsstate-of-the-art approaches.The main contributions of this paper are as follows: • To obtain more robust and informative feature representa-tions for cross-view action recognition, a sample-afﬁnitymatrix is incorporated into the marginalized stackeddenoising Autoencoder (mSDA) to learn shared features,which are then combined with private features. • To address the performance degradation problem whenthe view difference is large, we propose a novel unsu-pervised distribution adaptation method that learns a setof projections that project the source and target viewsinto respective subspaces where both the marginal andconditional distribution differences between source andtarget views are reduced simultaneously. • To obtain view-invariant feature representations that gen-eralize well across views, we learn view-invariant repre-sentations hierarchically by incorporating shared featureslearning, transferable dictionary learning and distributionadaptation into a uniﬁed framework, which is effectiveand can learn robust and discriminative view-invariantrepresentations even on different datasets.This paper is organized as follows: Section II brieﬂy reviewsrelated state-of-the-art works. Section III introduces the pro-posed approach JSRDA for cross-view action recognition.Experimental results and related discussions are presentedin Section IV. Finally, Section V concludes the paper.II. R

ELATED W ORK

A. Cross-View Action Recognition

Recently, many approaches have been proposed toaddress the problem of cross-view action recognition.

418 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 8, AUGUST 2019

Farhadi and Tabrizi [17] employed maximum margin cluster-ing (MMC) to generate split-based features of source viewand then transferred the split values to the target view.Zhang et al. [18] added a temporal regularization on thetraditional MMC. These works require feature-to-feature cor-respondence at the frame-level. Liu et al. [19] presented abipartite-graph-based method to bridge the domain shift acrossview-dependent vocabularies. Zheng et al. [14] exploitedthe video-to-video correspondence and proposed a dictionarylearning based method to jointly learn a set of view-speciﬁcdictionaries for speciﬁc views and a common dictionary sharedacross different views. Li and Zickler [7] proposed “virtualviews” that connect the source and target views by a virtualpath, which is associated with a linear transformation of theaction descriptors. Similarly, Zhang et al. [8] intended tobridge the source view and the target view by a continuousvirtual path keeping all the visual information. Wang et al. [20]proposed a statistical translation framework by estimating thevisual word transfer probabilities across views for cross-viewaction recognition. Kong et al. [15] addressed the cross-view action recognition problem by learning view-speciﬁcand view-shared features using a marginalized autoencoderbased deep models. Yan et al. [21] proposed a Multi-Task Information Bottleneck (MTIB) clustering method toexplore the shared information between multiple action clus-tering tasks to improve the performance of individual task.Ulhaq et al. [22] proposed an advanced space-time ﬁlteringframework for recognizing human actions despite large view-point variations. Rahmani et al. [23] proposed a 3D humanmodel based cross-view action recognition method that learnsa single model to transform any action from any viewpoints toits respective high level representation without requiring actionlabels or knowledge of the viewing angles.Different from above-mentioned cross-view action recogni-tion approaches [7], [8], [14], [15], [17]–[23], our proposedapproach exploits both the view-private and view-shared fea-tures to learn view-invariant representations hierarchically byincorporating shared feature learning, transferable dictionarylearning and distribution adaptation into a uniﬁed cross-view action recognition framework. To address the problemof performance degradation when view difference is large,we learn a set of projections that project the source andtarget views into respective subspaces to make the learnedrepresentation generalize well across views. In this way, ourapproach can learn robust and discriminative view-invariantrepresentations for cross-view action recognition even withlarge view difference or different datasets.

B. Transfer Learning on Heterogeneous Features

From the perspective of transfer learning, our workis related to the subspace based methods [24]–[27].Wang and Mahadevan [24] proposed a manifold alignmentbased method to learn a common feature subspace for allheterogeneous domains by preserving the topology of eachdomain, matching instances with the same labels and separat-ing instances with different labels. However, this method needsclass labels of both source and target domains and requires thatthe data should have a manifold structure, while our method is unsupervised and does not require the manifold assumptionof data. Long et al. [25] proposed a distribution adaptationmethod to ﬁnd a common subspace where the marginal andconditional distribution shifts between domains are reduced.Zhang et al. [26] relaxed the assumption that there exists auniﬁed projection to map source and target domains into a uni-ﬁed subspace, they learned two projections to map the sourceand target domains into their respective subspaces whereboth the geometrical and statistical distribution difference areminimized. Long et al. [27] introduced a unsupervised domainadaptation method that reduces the domain shift by jointlyﬁnding a common subspace and reweighting the instancesacross domains.Different from previous distribution adaptation meth-ods [24]–[27], we do not assume that there exists a uniﬁedprojection since this assumption is invalid when the distri-bution difference across views is large. Instead, we learn aset of projections that project the source and target viewsinto respective subspaces to obtain new representations ofrespective views, and concurrently encourage the subspacedivergence across views to be small.III. J

OINT S PARSE R EPRESENTATION AND D ISTRIBUTION A DAPTATION

The purpose of this work is to learn view-invariant represen-tations that allow us to train a classiﬁer on one (or multiple)view(s), and test on the other view.

A. Shared Features Learning

This subsection aims to obtain new informative feature rep-resentations of action videos for further transferable dictionarylearning and distribution adaptation.

1) Sample-Afﬁnity Matrix (SAM):

To ﬁx notation, we con-sider training videos of V views: { X v , y v } V v = . The datainstances of the v -th view X v consist of N action videos: X v = [ x v , · · · , x v N ] ∈ R d × N with corresponding labels y v = [ y v , · · · , y v N ] , where x v i ( i = , · · · , N ) denotes thefeature of the video i of the v -th view and d denotes thedimensionality of the video feature. We employ the Sample-Afﬁnity-Matrix (SAM) introduced in [15] to measure thesimilarity between pairs of video samples in multiple views.The SAM S ∈ R V N × V N is deﬁned as a block diagonal matrix:S = diag ( S , · · · , S N ), S i = ⎛⎜⎜⎜⎝ s i · · · s Vi s i · · · s Vi ... ... ... ... s V i s V i · · · ⎞⎟⎟⎟⎠ , where diag ( · ) creates a diagonal matrix, and s u v i = exp ( (cid:3) x v i − x ui (cid:3) / c ) parameterized by c calculates the distanceof the i -th video sample between two views. In this paper,we use c = i in S tells us how an action varies if view changes because itcharacterizes appearance variations in different views within IU et al. : HIERARCHICALLY LEARNED VIEW-INVARIANT REPRESENTATIONS FOR CROSS-VIEW ACTION RECOGNITION 2419 one class, which allows us to transfer information betweenviews and learn robust cross-view features. In addition, theoff-diagonal blocks in SAM S are set to zeros to limitinformation sharing between classes in the same view. As aresult, the features from different classes but in the same vieware encouraged to be distinct, which enables us to differen-tiate various action classes if they appear similarly in someviews.

2) Autoencoders:

Our shared features learning approachbuilds upon a popular deep learning approach Autoen-coder (AE) [29] and a AE based domain adaptation methodmarginalized stacked denoising Autoencoder [16] (mSDA).The objective of AE is to encourage similar or identical input-output pairs where the reconstruction loss is minimized. In thisway, the hidden unit is a good representation of the inputs asthe reconstruction process captures the intrinsic structure of theinput data. Different from the two-level encoding and decodingin AE, marginalized stacked denoising Autoencoder (mSDA)learns robust data representation using a single mapping W = arg min W (cid:8) Ni = (cid:3) x i − W ˜ x i (cid:3) by recovering original featuresfrom data that are artiﬁcially corrupted with noise, where ˜ x i is the corrupted version of x i obtained by randomly settingeach feature to 0 with a probability p , and N is the numberof training samples. mSDA performs m times over the trainingset with different corruptions each time, which essentially per-forms a dropout regularization on the mSDA [30]. By setting m → ∞ , mSDA effectively uses inﬁnitely many copiesof noisy data to compute the mapping matrix W that isrobust to noise. mSDA is stackable and can be computed inclosed-form.

3) Single-Layer Shared Features Learning:

Actually,an action from one view has some similar appearance infor-mation with that from other views. This motivates us toreconstruct an action data from one view (target view) usingthe action data from other view(s) (source view(s)). In thisway, shared information between views can be reﬁned andtransferred to the target view. Inspired by the mSDA, we incor-porate SAM S into the mSDA to balance information transferbetween views and learn discriminative shared features acrossmultiple views. We learn shared features using the followingobjective function that deﬁne the discrepancy between dataof the v -th target view and the data of all the V sourceviews: arg min W ψ = N (cid:9) i = V (cid:9) v = (cid:3) W ˜ x v i − (cid:9) u x ui s u v i (cid:3) = (cid:3) W ˜ X − X S (cid:3) F , (1)where s u v i is a weight measuring the contributions of the u -thview action in the reconstruction of the action sample x v i ofthe v -th view. W ∈ R d × d is the mapping matrix for thecorrupted input ˜ x v i of all the views. S ∈ R V N × V N is a sample-afﬁnity matrix encoding all the weights { s u v i } . Matrices X , ˜ X ∈ R d × V N denote the input training matrix and the corre-sponding corrupted version of X , respectively [16].The solution to optimization problem in Eq. (1) can beexpressed as the well-known closed-form solution for ordinary least squares [16], [31]: W = ( X S ˜ X T )( ˜ X ˜ X T ) − (2)It should be noted that X S ˜ X T and ˜ X ˜ X T are computed byrepeating the corruption m → ∞ times. By the weak law oflarge numbers [16], X S ˜ X T and ˜ X ˜ X T can be computed by theirexpectations E p ( X S ˜ X T ) and E p ( ˜ X ˜ X T ) with the corruptionprobability p , respectively.Although the mSDA can be designed to have deep archi-tecture by layer-wise stacking, we use only one layer in thispaper considering the extra training time using multiple layers.To obtain the shared features, a nonlinear squashing function σ ( · ) is applied on the output of one layer: H s = σ ( W X ) ,where X denotes the raw features of training data and H s denotes the shared features. Throughout this paper, we usetanh ( · ) as the squashing function. Besides the informationshared across views, private features that capture discrimina-tive information exclusively exists in each view should alsobe taken into consideration. Therefore, original features X is considered as private features H p and concatenated withthe obtained shared feature H s to form the new informativerepresentation H sp = [ H s , H p ] ∈ R d × N . B. Transferable Dictionary Learning

Although the obtained new representation H sp contains bothshared and private features, it cannot capture view-invariantinformation due to the variations in feature representations ofthe same action from different views. Therefore, we employtransferable dictionary learning method introduced in [14] tolearn sparse representation for each action video based on thenew representation H sp . Speciﬁcally, we learn a set of view-speciﬁc dictionaries where each dictionary corresponds to onecamera view. These dictionaries are learned simultaneouslyfrom the sets of corresponding videos taken at different viewswith the aim to encourage each video in the set to have thesame sparse representation. In this way, videos of the sameaction class from source and target views will tend to have thesame sparse codes when reconstructed from the correspondingview-speciﬁc dictionary.In this paper, we consider unsupervised transferable dictio-nary learning where labels of correspondence videos are notavailable. In addition, we require that the number of trainingactions videos in each view should be the same. Supposethere are p source views and one target view. To be noticed,the cross-view problem is a special case of multi-view problemwhen p =

1. Let D s , i ∈ R d × K and D t ∈ R d × K denotethe view-speciﬁc dictionary of the i -th source view and thetarget view, respectively. K is the number of dictionary atomsand each view-speciﬁc dictionary is of the same size. Y s , i ∈ R d × N and Y t ∈ R d × N denote the feature representation ofthe i -th source view and the target view, respectively. Thesparse representations X ∈ R K × N are obtained by solving thefollowing objective function:arg min { D s , i } pi = , D t , X p (cid:9) i = (cid:3) Y s , i − D s , i X (cid:3) + (cid:3) Y t − D t X (cid:3) s . t . ∀ i , (cid:3) x i (cid:3) ≤ γ (3)

420 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 8, AUGUST 2019

Since we have the same number of action videos in each view,Eq. (3) can be rewritten as:arg min D , X (cid:3) Y − D X (cid:3) s . t . ∀ i , (cid:3) x i (cid:3) ≤ γ (4)where Y = ⎡⎢⎢⎣ Y s , · · · Y s , p Y t ⎤⎥⎥⎦ , D = ⎡⎢⎢⎣ D s , · · · D s , p D t ⎤⎥⎥⎦ and (cid:3) x i (cid:3) ≤ γ isthe sparsity constraint. As for the optimization of the view-speciﬁc dictionaries D , they can be learned by the K-SVD [32]algorithm. After obtaining these dictionaries, the OMP [33]algorithm can be employed to compute the sparse featurerepresentations. Consequently, all videos in all views are pro-jected into a uniﬁed view-invariant sparse feature space. Thisprocedure enables the transfer of sparse feature representationsof videos in the source view(s) to the corresponding videos inthe target view. C. Distribution Adaptation

Although the obtained sparse representations of one actionin all views are the same, the distribution difference acrossviews still exists because a uniﬁed subspace where the sparserepresentations of one action across views are the same maynot exist when the view difference is large (e.g. the top viewand the side views). This will degrade the overall performanceof the cross-view action recognition algorithm. Thus, we relaxthis strong assumption that there exists a uniﬁed subspacewhere the feature representations of one action in all viewsshould be strictly equal. Instead, we learn a set of projectionsthat project different views into respective subspaces to obtainnew representations of respective views, and concurrentlyencourage the subspace divergence to be small.

1) Problem Deﬁnition:

Suppose there are p source viewsand one target view with a total of C classes. To ﬁx thedeﬁnitions of terminologies, the data from the i -th source viewdenoted as X s , i ∈ R K × N s , i are draw from marginal distribution P s , i ( X s , i ) and the target view data X t ∈ R K × N t are draw frommarginal distribution P t ( X t ) , where K is the dimension of thedata instance, N s , i and N t are the number of samples in the i -thsource view and the target view, respectively. In unsuperviseddistribution adaptation, there are sufﬁcient labeled source viewdata and unlabeled target domain data in the training stage.We assume that the features and label spaces between sourceand target views are the same. Due to the domain divergencebetween views, for any i ∈ { , · · · , p } , marginal distribution P s , i ( X s , i ) (cid:8)= P t ( X t ) and conditional distribution P s , i ( Y s , i | X s , i ) (cid:8)= P t ( Y t | X t ) , where Y s , i ∈ R × N s , i and Y t ∈ R × N t are the class labels of the i -th source view data and the targetview data, respectively. Different from previous distributionadaptation methods, we do not assume that there exists auniﬁed transformation T that P s , i ( T ( X s , i )) = P t ( T ( X t )) and P s , i ( Y s , i | T ( X s , i )) = P t ( Y t | T ( X t )) , since this assumptionbecomes invalid when the distribution shift across views islarge. Instead, we propose a novel distribution adaptationmethod that learns a set of projections that project the source and target views into respective subspaces to obtain new rep-resentations of respective views, and encourage the subspacedivergence to be small at the same time.

2) Formulation:

Our proposed distribution adaptationapproach is formulated by ﬁnding a set of projections ( F s , i forthe i -th source view and F t for the target view) to obtain newrepresentations of respective views, such that 1) the differencein marginal distribution and conditional distribution acrossviews is small, 2) the divergence between source and targetsubspaces is small, 3) the variance of target view domain ismaximized, 4) the discriminative information of source viewdomain is preserved.To reduce the difference between the marginal distributions P s , i ( X s , i ) and P t ( X t ) , we follow [25] and [34]–[36] andemploy the empirical Maximum Mean Discrepancy (MMD)to compute the distance between the sample means of thesource and target data in the k-dimensional embeddings,min { F s , i } pi = , F t p (cid:9) i = (cid:3) N s , i (cid:9) x k ∈ X s , i F T s , i x k − N t (cid:9) x j ∈ X t F T t x j (cid:3) F (5)In order to reduce difference between the conditional distri-butions P s , i ( Y s , i | X s , i ) and P t ( Y t | X t ) , sufﬁcient labeleddata in target view is need. However, there are no labeleddata in the target view in unsupervised scenario. To addressthese issues, Long et al. [25] utilized target view pseudo labelspredicted by source view classiﬁer to represent the conditionaldistribution in the target view domain. The pseudo labelsin target view domain are iteratively reﬁned to reduce thedifference in conditional distributions with the source viewdomains. We follow this idea to minimize the conditionaldistribution difference between domains,min { F s , i } pi = , F t p (cid:9) i = C (cid:9) c = (cid:3) N ( c ) s , i (cid:9) x k ∈ X ( c ) s , i F T s , i x k − N ( c ) t (cid:9) x j ∈ X ( c ) t F T t x j (cid:3) F (6)where X ( c ) s , i is the set of data instances from class c in the i -thsource view and N ( c ) s , i is the number of data instances in X ( c ) s , i .Correspondingly, X ( c ) t is the set of data instances from class c in the target view and N ( c ) t is the number of data instances in X ( c ) t . Since we have the same number of action videos in eachview, the marginal distribution difference minimization termEq. (5) and conditional distribution difference minimizationterm Eq. (6) can be combined to obtain the ﬁnal distributiondivergence minimization term,min F s , F T r ( (cid:16) F T s F T (cid:17) (cid:18) M s M st M ts M (cid:19) (cid:18) F s F (cid:19) ) (7)where the formulation of F s , F , M s , M st , M ts and M can befound in Appendix A.To reduce the divergence between source and target sub-spaces, we use the following term to encourage the sourceand target subspaces to be close,min { F s , i } pi = , F t p (cid:9) i = (cid:3) F s , i − F t (cid:3) F (8) IU et al. : HIERARCHICALLY LEARNED VIEW-INVARIANT REPRESENTATIONS FOR CROSS-VIEW ACTION RECOGNITION 2421 We rewrite Eq. (8) as follows,min F s , F (cid:3) F s − F (cid:3) F (9)where F s = ⎡⎣ F s , · · · F s , p ⎤⎦ and F = ⎡⎣ F t · · · F t ⎤⎦ is obtained by replicating F t p times.To maximize the variance of target view data and preserveits embedded data properties, we use the following term toachieve this purpose, max T r ( F T S F ) (10)where S = [ S t , · · · , S t ] is obtained by replicating S t p times, S t = X t H t X T t is essentially a covariance matrix, and H t = I t − N t t T t is the centering matrix while 1 t ∈ R N t × is thecolumn vector with all ones and I t ∈ R N t × N t is the identitymatrix.Since the label information in the source views is available,we can utilize this to preserve the discriminative informationin source views. Therefore, we use following terms to achievethis purpose, max { F s , i } pi = p (cid:9) i = T r ( F T s , i S b , i F s , i ) (11)min { F s , i } pi = p (cid:9) i = T r ( F T s , i S ω, i F s , i ) (12)where S b , i is the inter-class variance matrix of the data fromthe i -th source view domain and S ω, i is the intra-class variancematrix, which are deﬁned as follows, S b , i = C (cid:9) c = N ( c ) s , i ( m ( c ) s , i − ¯ m s , i )( m ( c ) s , i − ¯ m s , i ) T (13) S ω, i = C (cid:9) c = X ( c ) s , i H ( c ) s , i ( X ( c ) s , i ) T (14)where X ( c ) s , i ∈ R K × N ( c ) s , i is the set of data instance from class c in the i -th source view, m ( c ) s , i = N ( c ) s , i (cid:8) N ( c ) s , i k = x ( c ) k , ¯ m s , i = N s , i (cid:8) N s , i k = x k , H ( c ) s , i = I ( c ) s , i − N ( c ) s , i ( c ) s , i ( ( c ) s , i ) T is the centeringmatrix of data from class c , I ( c ) s , i ∈ R N ( c ) s , i × N ( c ) s , i is the identitymatrix. 1 ( c ) s , i ∈ R N ( c ) s , i × is the column vector with all ones, N ( c ) s , i is the number of data from class c in the i -th source view.Similarly, Eq. (11) and Eq. (12) can be rewritten as follows,max F s T r ( F T s S b F s ) (15)min F s T r ( F T s S ω F s ) (16)where F s = ⎡⎣ F s , · · · F s , p ⎤⎦ , S b = [ S b , , · · · , S b , p ] and S ω =[ S ω, , · · · , S ω, p ] . We formulate our distribution adaptation method by incor-porating the above ﬁve terms Eq. (7), (9), (10), (15) and (16)into a uniﬁed objective function as follows:max μ { T-Var } + β { S-Inter-Var }{ Dis-Dif } + λ { Sub-Div } + β { S-Intra-Var } + μ { F -C } where T-Var, S-Inter-Var, Dis-Dif, Sub-Div, S-Intra-Var and F -C terms denote the target view data variance, source viewinter-class variance, distribution difference, subspace diver-gence, source view intra-class variance and the scale constraintof F , respectively. And λ, μ, β are the parameters to balancethe importance of each terms. We follow [37] to impose aconstraint that T r ( F T F ) is small to control the scale of F .Speciﬁcally, we aim at ﬁnding a set of projections F s and F by solving the following optimization problem,max F s , F T r ( [ F T s F T ] (cid:18) β S b μ S (cid:19) (cid:18) F s F (cid:19) ) T r ( [ F T s F T ] (cid:18) M s + λ I + β S ω M st − λ IM ts − λ I M + (λ + μ) I (cid:19) (cid:18) F s F (cid:19) ) (17)where I is the identity matrix.Actually, minimizing the denominator of Eq. (17) encour-ages small difference of marginal and conditional distributions,small subspace divergence between the source and targetviews, and small intra-class variance of the source view. Maxi-mizing the numerator of Eq. (17) encourages the large varianceof the target view and large inter-class variance of sourceviews. In addition, we also iteratively update the pseudo labelsof target view domain data using the learned transformationsto improve the labelling quality until convergence.

3) Optimization:

To optimize Eq. (17), we rewrite [ F T s F T ] as W T . Then the objective function can be rewritten as follows:max W T r ( W T (cid:18) β S b μ S (cid:19) W ) T r ( W T (cid:18) M s + λ I + β S ω M st − λ IM ts − λ I M + (λ + μ) I (cid:19) W ) (18)Note that the objective function Eq. (18) is invariant torescaling of W . Therefore, the objective function Eq. (18) canbe rewritten as follows:max W T r ( W T (cid:18) β S b μ S (cid:19) W ) s . t . T r ( W T (cid:18) M s + λ I + β S ω M st − λ IM ts − λ I M + (λ + μ) I (cid:19) W ) = (cid:8) = diag (φ , · · · , φ k ) ∈ R k × k as the Lagrange multiplier andderive the Lagrange function for Eq. (19) as: L = T r ( W T (cid:18) β S b μ S (cid:19) W ) + T r (( W T (cid:18) M s + λ I + β S ω M st − λ IM ts − λ I M + (λ + μ) I (cid:19) W − I )(cid:8) (20)

422 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 8, AUGUST 2019

Setting ∂ L ∂ W =

0, we obtain the problem of generalizedeigendecomposition, (cid:18) β S b μ S (cid:19) W = (cid:18) M s + λ I + β S ω M st − λ IM ts − λ I M + (λ + μ) I (cid:19) W (cid:8) (21)where (cid:8) = diag (φ , · · · , φ k ) are the k smallest eigenvec-tors. Finally, ﬁnding the optimal transformation matrix W isreduced to solve Eq. (21) for the k smallest eigenvectors W =[ W , · · · , W k ] . Once the transformation matrix W is obtained,the corresponding subspace projections F s = ⎡⎣ F s , · · · F s , p ⎤⎦ and F t can be easily obtained.The distribution adaptation method can be extendedto nonlinear problems in a Reproducing Kernel HilbertSpace (RKHS) using kernel mapping ψ : x (cid:9)→ ψ( x ) ,or ψ( X ) = [ ψ( x ), · · · , ψ( x N ) ] , and kernel matrix K = ψ( X ) T ψ( X ) ∈ R N × N , where N is the number of all samplesin source and target views. We use the Representer theoremto kernelized the objective function Eq. (17) as follows:max F s , F T r ( [ F T s F T ] (cid:18) β S b μ S (cid:19) (cid:18) F s F (cid:19) ) T r ( [ F T s F T ] (cid:18) M s + λ K + β S ω M st − λ KM ts − λ K M + (λ + μ) K (cid:19) (cid:18) F s F (cid:19) ) (22)where K = ψ( X ) T ψ( X ) , X = [ X s , X t ] , all the X t arereplaced by ψ( X t ) and all the X s are replaced by ψ( X s ) in S b , S , S ω , M s , M , M st and M ts in the kernelized version.Once the kernelized objective function Eq. (22) is obtained,we can simply solve it in the same way as the originalobjective function to compute F s and F t . The Algorithm for-mat of the proposed JSRDA can be seen on the websitehttps://xdyangliu.github.io/JSRDA/ due to the limited space ofthe paper. IV. E XPERIMENTS

In this section, we evaluate our proposed approach onfour public multi-view action datasets: the IXMAS actiondataset [38], the Northwestern UCLA Multiview Action 3D(NUMA) dataset [39], the WVU action dataset [40] and theMuHAVi dataset [41].We consider both cross-view and multi-view action recogni-tion scenarios in this paper. The former one trains a classiﬁeron one view (source view) and test it on the other view (targetview), while the latter trains a classiﬁer on V − , , Fig. 3. Exemplar frames from the IXMAS dataset. Each row shows oneaction captured by ﬁve cameras. when constructing the codebook, only 200 local iDTs arerandomly selected from each video sequence according to thedefault setting in [28].For shared features learning, we ﬁx noise probability N p = . L = K = γ =

50. For distributionadaptation, we ﬁx λ = , μ = β and k are chosen differently forvarious datasets.For the action recognition task, it is hard to do eigendecom-position in the original data space since the dimensionality ofdata is high. Therefore, the experimental results are obtainedusing the RBF kernel in distribution adaptation, which isproved to be a good kernel for addressing the nonlinearproblem by previous works [25], [44]. For fair compari-son, we adopt the leave-one-action-class-out training strategyin [12] and [19]. At each time, only one action class is used fortesting in the target view. In order to evaluate the effectivenessof our proposed approach, all the videos in this action classare excluded from the feature learning procedure includingLLC encoding and the proposed JSRDA, while these actionvideos can be seen in training the classiﬁers. We report theclassiﬁcation accuracy by averaging all possible combinationsfor selecting testing action classes.On Intel (R) CoreTM i7 system with 32GB RAM and un-optimized Matlab code excluding the process of iDTs extrac-tion and LLC encoding, we get average run-time for trainingvideos as 1 .

53 seconds and testing videos as 0 .

17 secondsfor cross-view action recognition on the IXMAS dataset.For multi-view action recognition, we get average run-timefor training videos as 4 .

99 seconds and testing videos as0 .

99 seconds. Each video on the IXMAS dataset lasts for4 seconds in average. We have released some extra experimen-tal results and codes on the website https://xdyangliu.github.io/JSRDA/ due to the limited space of the paper.

A. IXMAS Dataset

The IXMAS dataset has 1 ,

650 action samples with 11action classes recorded by 4 side view cameras and 1 topview camera. These actions are check watch, cross arms, getup, kick, pick up, punch, scratch head, sit down, turn around,walk and wave. Figure 3 shows some example frames.

1) Cross-View Action Recognition:

In this experiment,we evaluate our proposed JSRDA approach for cross-view IU et al. : HIERARCHICALLY LEARNED VIEW-INVARIANT REPRESENTATIONS FOR CROSS-VIEW ACTION RECOGNITION 2423 TABLE IC

ROSS -V IEW A CTION R ECOGNITION R ESULTS OF V ARIOUS A PPROACHES U NDER

20 C

OMBINATIONS OF S OURCE (T RAINING ) AND T ARGET (T ESTING )V IEWS ON THE

IXMAS D

ATASET U NDER U NSUPERVISED M ODE . 0 TO ENOTE C AMERA TO C AMERA

ESPECTIVELY

TABLE IIM

ULTI -V IEW A CTION R ECOGNITION R ESULTS ON THE

IXMASD

ATASET . E

ACH C OLUMN C ORRESPONDS TO A T EST V IEW action recognition on the IXMAS dataset. We compare ourapproaches with [12], [14], [15], [19], [22], [23], [28], andreport recognition results in Table I. Our proposed approachachieves the best performance in 15 out of 20 combinationsand the overall performance is signiﬁcantly better than all thecomparison approaches especially when the view difference islarge (C4). In addition, our approach achieves nearly perfectperformance on the IXMAS dataset with eight 100% accu-racies, which veriﬁes that our proposed approach is robustto viewpoint variations and can achieve good performancein cross-view action recognition with learned view-invariantrepresentations.

2) Multi-View Action Recognition:

In this experiment,we evaluate the performance of our proposed JSRDA approachfor multi-view action recognition on the IXMAS dataset andmake comparisons with existing approaches [6], [12], [14],[15], [19], [22], [23], [28], [45], [46]. The importance ofshared feature learning, transferable dictionary learning anddistribution adaptation are also evaluated. The No-JSRDArepresents the results of 1-NN classiﬁer without using ourapproach, while the No-SFL, No-TDL and No-DA representthe results of JSRDA method without shared features learning,transferable dictionary learning and distribution adaptation,respectively.Table II shows that our proposed approach JSRDA achievescompetitive recognition performance compared with otherapproaches. Although Kong et al. [15] achieves nearly perfect performance, our approach achieves comparable performanceand achieves slightly better accuracies in C4 where C4 is thetop view camera and there exists larger domain divergencebetween the top view and other side views. In addition,the overall performance of our method (99 . et al. [15] (99 . et al. [14]achieves satisfactory performance in C0, C1, C2 and C3 butthe performance drops a lot when the target view is C4.On the contrary, our approach can still achieve satisfactoryperformance even when the target view is C4. These validatethat our learned action representations is view-invariant andgeneralizes well across views even when the view differenceis large.The No-JSRDA performs very poorly and the recognitionaccuracy for most tasks is less than 20% due to the existenceof domain divergence across views. Our proposed approachoutperforms No-SFL, which veriﬁes the effectiveness of theshare features. Without shared features, No-SFL only utilizethe private features which are not discriminative enough assome motion information only exist in one view and cannot beshared across views. There is a large margin between the accu-racies of Ours and that of the No-TDL, which demonstratesthat transferable dictionary learning can encourage the samplesfrom different views to have the same sparse representationand thus reduce the domain divergence effectively. In addition,the accuracy gap between Ours and No-DA suggests the ben-eﬁt of distribution adaptation for learning more robust view-invariant representations that generalize well across viewsespecially when the view difference is large. More importantly,the shared features learning (SFL), transferable dictionarylearning (TDL) and distribution adaptation (DA) are com-plementary and indeed encourage us to learn view-invariantfeatures hierarchically.

3) Parameter Analysis:

We analyse the sensitivity of para-meters N p , L , K , β , k and T in this experiment while ﬁxing λ = μ = γ =

50. We conduct experiments onthe multi-view action recognition task C0. The results ofother multi-view action recognition tasks or cross-view actionrecognition tasks are not given here as it shows similar resultsto the C0. N p is the corruption probability in shared features learningstage, we evaluate its performance given values N p ∈ {

0, 0.1,0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0 . } . Results in Figure 4(a)indicate that the performance increases if we increase the noiseprobability N p . When N p is lower than 0 .

3, the performance

424 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 8, AUGUST 2019

Fig. 4. Performance analysis of the JSRDA on IXMAS dataset with variousvalues of parameters N p , L , K , β , k and T . (a) Value of parameter N p .(b) Number of layers L. (c) Value of parameter K. (d) Value of parameter β .(e) Value of parameter k. (f) Number of iterations T. is poor. As the random corruption is essentially addingdropout regularization, too lower corruption probability cannotguarantee obtaining the informative corrupted data which isdifferent from raw data. When N p exceeds 0 .

3, the accuracyincreases a lot and reach its optimum value when N p = . N p , the performancebecomes decreasing. The underlying reason is that adding toomuch noise in raw data ( N p > .

6) reduces the amount ofshared information between views. Thus, the discriminativepower of shared features is decreased and lead to a relativelylower recognition accuracy. Considering these issues, we use N p = . L is the number of layers in mSDA, we evaluate its perfor-mance given values L ∈ { , , , } . From Figure 4(b), we cansee that the performance of L = L = L > L = K is the dictionary size in transferable dictionary learningstage, we ﬁx the sparsity factor γ =

50 and vary the dictionarysize from 300 to 1200 in multiples of 100 as in the range { , , , , , , , , , } due tothe fact that the dictionary size must be larger than the numberof samples in the IXMAS dataset of each view to guarantee thesparsity. From Figure 4 (c), we observe that the performanceincreases as the dictionary size increases and keep stable from K =

600 to K = K is too large ( K > K = β is the trade-off parameters of intra-class and inter-classvariance of source view. A large range of β ( β ∈ [ − , ] )are selected to evaluate its effect on the overall performance.If β is too small, the class information of source view is notconsidered. If β is too large, the classiﬁer may be overﬁtto the source view. As can be seen from Figure 4 (d),the performance is stable and good when β is neither too smallnor too big. To make a balance between the class informationand the overﬁt problem, we use β = − in this work. Fig. 5. Exemplar frames from the Northwestern UCLA dataset. Actionclasses: Sit down and Throw. k is the dimension of the learned view-invariant features,which is essentially the number of the chosen eigenvec-tors in eigendecomposition at distribution adaptation stage.We illustrate the relationship between various k and the overallaccuracy given values k ∈ { } . From Figure 4 (e), we can observethat the performance becomes stable when k is larger thana certain value (800). This is because much information offeature representation may lose in eigendecomposition processwhen k is too small. Thus, discriminability of the feature isnot enough and the overall performance is unsatisfactory. Butwhen k exceeds a certain value, information can be reservedwell and k has little impact on the overall performance exceptfor the computational complexity. Although we can choose k ∈ { , } to obtain view-invariant representations dueto their good performance, we use k = T . As can be seen fromFigure 4 (f), the accuracy converges to the optimum valueonly after 2 iterations and then keep stable. Therefore, we use T = B. Northwestern UCLA Multiview Action 3D Dataset

The NUMA dataset has 1 ,

509 action samples with 10 actionclasses captured by 3 Kinect cameras in 5 environments. Theseactions includes pick up with one hand, pick up with twohands, drop trash, walk around, sit down, stand up, donning,dofﬁng, throw and carry. Figure 5 shows exemplar frames offour action classes taken by three cameras.

1) Cross-View Action Recognition:

In this experiment, weevaluate our proposed JSRDA approach for cross-view actionrecognition on the NUMA dataset. Our method is comparedwith [24], [28], and [47]–[50]. Results in Table III showthat our proposed JSRDA approach achieves the best per-formance in all combinations and outperforms all the com-parison approaches by a signiﬁcantly large accuracy margin.In addition, our approach achieves nearly perfect performanceon the NUMA dataset with two 100% accuracies and fournearly perfect accuracies (99 . IU et al. : HIERARCHICALLY LEARNED VIEW-INVARIANT REPRESENTATIONS FOR CROSS-VIEW ACTION RECOGNITION 2425 TABLE IIIC

ROSS -V IEW A CTION R ECOGNITION R ESULTS OF V ARIOUS A PPROACHESON THE

NUMA D

ATASET U NDER U NSUPERVISED M ODE . E

ACH R OW C ORRESPONDS TO A T RAINING V IEW AND E ACH C OLUMN A T ESTING V IEW . 0 TO ENOTE V IEW TO V IEW

ESPECTIVELY

TABLE IVM

ULTI -V IEW A CTION R ECOGNITION R ESULTS ON THE

NUMA D

ATASET .E ACH C OLUMN C ORRESPONDS TO A T EST V IEW . T HE S YMBOL ‘N/A’ D

ENOTES T HAT THE R ESULT I S N OT R EPORTEDIN THE P UBLISHED P APER to cope with the data-distribution mismatch. The explanationfor the better performance of ours than [24] and [50] maybe the lack of strong manifold structure on these cross-view datasets. Approach [47] assumes that there exists acommon subspace where the modality gap between datasetscan be reduced effectively, but this assumption is invalid whenthe view difference is large. Such remarkable improvementsdemonstrate the beneﬁt of using both shared and privatefeatures for modeling cross-view feature representation, andlearning a set of projections that project the source andtarget domains into respective subspaces without the manifoldstructure assumption.

2) Multi-View Action Recognition:

In this experiment,we evaluate the performance of our proposed JSRDA formulti-view action recognition on the NUMA dataset. Ourapproach is compared with [15], [23], [28], [39], [47],[51], and [52]. As most of the published papers only report theaverage accuracies in the multi-view scenario, we only quotethe given accuracies in Table IV. The importance of sharedfeatures learning (SFL), transferable dictionary learning (TDL)and distribution adaptation (DA) are also evaluated.Table IV shows that our proposed JSRDA achieves nearlyperfect performance in the NUMA dataset with two 100%accuracies. Compared with other approaches, ours performssigniﬁcantly better with a large accuracy margin. Thesedemonstrate that our proposed JSRDA can learn robustand discriminative view-invariant representations for multi-view action recognition. Without the JSRDA, the No-JSRDAperforms very poorly due to the existence of domain

Fig. 6. Performance analysis of the JSRDA on NUMA dataset with variousvalues of parameters β , k and T . (a) Value of parameter β . (b) Value ofparameter k. (c) Number of iterations T. divergence across views. Thus, directly applying the classi-ﬁer trained on one view to another view will degrade theperformance. The No-SFL performs worse than the JSRDAdemonstrates the effectiveness of shared features which arecomplementary to the private features. A large accuracy mar-gin exists between ours and the No-TDL, which shows thattransferable dictionary learning is a key stage for reduce thedomain divergence across views by encouraging the samplesfrom different views to have the same sparse representations.In addition, the accuracy gap between Ours and No-DAsuggests the beneﬁt of distribution adaptation for addressingthe performance degradation problem caused by the largeview difference. More importantly, the shared features learn-ing (SFL), transferable dictionary learning (TDL) and distribu-tion adaptation (DA) are complementary and indeed encourageus to learn view-invariant representations hierarchically.

3) Parameter Analysis:

The sensitivity of parameters β , k and T are evaluated while ﬁxing other parameters λ = μ = γ = N p = . L = K = β is the trade-off parameters of intra-class and inter-classvariance of source view. A large range of β ( β ∈ [ − , ] )are selected to evaluate its effect on the overall performance.As can be seen from Figure 6 (a), our approach is insensitive tothe parameter β with small accuracy variation (0 . β is small. When β is too large, the classiﬁer may be overﬁt tothe source view and thus the performance degrades. To make abalance between the class information and the overﬁt problem,we use β = − in this work. k is the dimension of the learned view-invariant features.We illustrate the relationship between various k and theoverall accuracy given values k ∈ { } . From Figure 6 (b), we canobserve that we can choose k ∈ { , } to obtain view-invariant features due to their good performance. We make acompromise between the time efﬁciency and performance anduse k = T . As can be seen fromFigure 6 (c), the accuracy converges to the optimum valuewithin only 1 iteration and then keep stable. Therefore, we use T = C. WVU Action Dataset

The WVU dataset is collected from a network of8 embedded cameras with 12 action classes and each action

426 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 8, AUGUST 2019

TABLE VC

ROSS -V IEW A CTION R ECOGNITION R ESULTS OF V ARIOUS A PPROACHES ON THE

WVU D

ATASET U NDER U NSUPERVISED M ODE . E

ACH R OW C ORRESPONDS TO A T RAINING V IEW AND E ACH C OLUMN A T ESTING V IEW . C0 TO C7 D

ENOTE V IEW TO V IEW ESPECTIVELY . T HE T HREE A CCURACY N UMBERS IN THE B RACKET A RE THE A VERAGE R ECOGNITION A CCURACIES OF [14], [20],

AND O UR P ROPOSED A PPROACH R ESPECTIVELY

Fig. 7. Exemplar frames from the WVU dataset. Each row shows one actionviewed across eight camera views. has 65 video samples. There are a total of 6 ,

240 videosamples in this dataset, which is a relatively large multi-viewaction dataset compared with the IXMAS and the NUMAdatasets. These actions includes standing still, nodding head,clapping, waving one hand, waving two hands, punching,jogging, jumping jack, kicking, picking, throwing and bowling.Figure 7 shows exemplar frames of two action classes takenby eight cameras.

1) Cross-View Action Recognition:

In this experiment,we evaluate our proposed JSRDA approach for cross-viewaction recognition on the WVU dataset. Our method is com-pared with two approaches in [14] and [20]. Results in Table Vshow that our approach achieves similar performance com-pared with [14]. Some accuracies in [14] are better demon-strates the effectiveness of transferable dictionary learningproposed in [14], but when the distribution divergence acrossviews are large, transferable dictionary learning cannot addressthe cross-view problem well without considering distributionadaptation. For example, when the source view is C2, C5and C6, our proposed approach outperforms all the pairwiseviews by a large margin compared with the approach [14].In addition, we achieve the best average accuracies in 6 outof 8 tasks including C0, C1, C2, C3, C5 and C7. These verifythat our approach can effectively address the cross-view actionrecognition problem by learning view-invariant representationsusing the novel JSRDA framework.

2) Parameter Analysis:

We also evaluate the sensitivity ofour approach to parameters β , k and T while ﬁxing otherparameters λ = μ = γ = N p = . L = K = → C1. β is the trade-off parameters of intra-class and inter-classvariance of source view. A large range of β ( β ∈ [ − , ] )are selected to evaluate its effect on the overall performance.As can be seen from Figure 8 (a), our approach is insensitive Fig. 8. Performance analysis of the JSRDA on WVU dataset with variousvalues of parameters β , k and T . (a) Value of parameter β . (b) Value ofparameter k. (c) Number of iterations T. to the parameter β when it is small, and achieves the bestperformance when β =

1. Therefore, we use β = k is the dimension of the learned view-invariant features.We illustrate the relationship between various k and theoverall accuracy given values k ∈ { } . From Figure 8 (b), we canobserve that we can choose k ∈ { , } to obtain view-invariant features due to their good performance. We make acompromise between the time efﬁciency and performance anduse k = T . As can be seen fromFigure 8 (c), the accuracy converges to the optimum valuewithin 5 iterations. Therefore, we use T = D. MuHAVi Dataset

The MuHAVi dataset [41] contains 17 human action classes:WalkTurnBack, RunStop, Punch, Kick, ShotGunCollapse,PullHeavyObject, PickupThrowObject, WalkFall, LookInCar,CrawlOnKnees, WaveArms, DrawGrafﬁti, JumpOverFence,DrunkWalk, ClimbLadder, SmashObject and JumpOverGap.Each action video is performed by 7 actors and recorded using8 CCTV Schwan cameras located at 4 sides and 4 corners ofa rectangular platform. To reduce the computational burdenand have a fair comparison with other works, we follow[14] and [53] to choose the action videos captured by fourcameras (i.e. two side cameras and two corner cameras) in ourexperiments. Figure 9 shows exemplar frames of two actionclasses taken by four cameras.Table VI shows the recognition accuracies of our proposedJSRDA for cross-view action recognition. Although bothWSCDD [28] and un-RLTDL [14] are based on transferable IU et al. : HIERARCHICALLY LEARNED VIEW-INVARIANT REPRESENTATIONS FOR CROSS-VIEW ACTION RECOGNITION 2427 Fig. 9. Exemplar frames from the MuHAVi dataset. Action classes:DrawGrafﬁti and WaveArms. TABLE VIC

ROSS -V IEW A CTION R ECOGNITION R ESULTS OF V ARIOUS A PPROACHESON THE M U HAV I D ATASET U NDER U NSUPERVISED M ODE . E

ACH R OW C ORRESPONDS TO A T RAINING V IEW AND E ACH C OLUMNA T ESTING V IEW . T HE T HREE A CCURACY N UMBERS IN THE B RACKET A RE THE A VERAGE R ECOGNITION A CCURACIESOF

WSCDD [28], U N -RLTDL [14] AND O UR P ROPOSED A PPROACH R ESPECTIVELY

TABLE VIIM

ULTI -V IEW A CTION R ECOGNITION R ESULTS OF V ARIOUS A PPROACHESON THE M U HAV I D ATASET U NDER U NSUPERVISED M ODE .E ACH C OLUMN C ORRESPONDS TO A T EST V IEW dictionary learning, JSRDA achieves better performance thanWSCDD and un-RLTDL due to the beneﬁts of shared featuresand distribution adaptation. This demonstrates that JSRDA isrobust to viewpoint variations and can achieve satisfactoryperformance in cross-view action recognition.We also evaluate our approach for multi-view action recog-nition on the MuHAVi dataset and the results can be seenin Table VII. Although Zheng et al. [14] achieves goodperformance in C4 and C6, its performance degrades whenthe target view is C1 and C3 due to the existence of largeview difference. However, JSRDA can still achieve goodperformance even when the view difference is large and theoverall performance is better than other approaches [14], [28],[53]–[55]. This illustrates that JSRDA can learn robust anddiscriminative view-invariant representations for multi-viewaction recognition even with large view difference.To evaluate whether our method can generalize both theview and the action class, we use the leave-one-actor-outstrategy for training and testing which means that each timeone actor is excluded from both training and testing procedure.We report the classiﬁcation accuracy by averaging all possiblecombinations for excluding actors. We conduct experimentsfor multi-view action recognition task on the MuHAVi dataset.

TABLE VIIIP

ERFORMANCE C OMPARISON OF M ULTI -V IEW A CTION R ECOGNITION T ASK C4 ON THE

IXMAS D

ATASET FOR D IFFERENT C OMBINATIONSOF F EATURES AND C ODEBOOK S IZES . D D

ENOTESTHE C ODEBOOK S IZE

From Table VII, we can see that the performance of ourmethod using leave-one-actor-out strategy (JSRDA_actor) iscomparable to that of our method using leave-one-action-class-out strategy (JSRDA). This veriﬁes that our method cangeneralize both the view and the action class.

E. Impact of Features Extraction Parameters

To have a more detailed study on how well our methodbehaves for different choices of features, we use some possiblecombinations of features (trajectory shape, HOG, HOF, MBHxand MBHy) to form the iDTs to evaluate our method. To havea more detailed study on how well our method behavesfor different choices of extraction parameters, we conductexperiments for multi-view action recognition task C4 on theIXMAS dataset with different codebook sizes while keep-ing other parameters unchanged. The results can be seenin Table VIII. We can see that the average accuracy is the bestwhen we combine trajectory shape (Tra), HOG, HOF, MBHxand MBHy as the video feature. Although the accuracy ofD = = ONCLUSION

In this paper, we propose a novel view-invariant represen-tation learning approach for cross-view action recognition.Our approach incorporates shared features learning, transfer-able dictionary learning and distribution adaptation into auniﬁed framework and learns view-invariant representationshierarchically. A sample afﬁnity matrix is incorporated intothe marginalized stacked denoising Autoencoder (mSDA) tolearn shared features, and the shared features are combinedwith the private features to obtain new informative featurerepresentation. Then, a transferable dictionary pair is learnedsimultaneously from pairs of videos taken at different viewsto encourage each action video across views to have thesame sparse representation. To address the problem of largeview difference, a novel unsupervised distribution adaptationmethod is proposed to reduce the difference in both themarginal distribution and conditional distribution across viewsby learning a set of projections that project the source andtarget views data into respective low-dimensional subspaces.Finally, the view-invariant representations of action videosfrom different views are obtained in their respective subspaces.

428 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 8, AUGUST 2019

Extensive experiments on the IXMAS, NUMA, WVU andMuHAVi datasets show that our approach outperforms state-of-the-art approaches for both cross-view and multi-viewaction recognition. APPENDIX AT HE F ORMULATION OF E ACH T ERM IN (7) F s and F are deﬁned in Eq. (23), F s = ⎡⎣ F s , · · · F s , p ⎤⎦ , X s = [ X s , , · · · , X s , p ] F = ⎡⎣ F t · · · F t ⎤⎦ is obtained by replicating F t p times. X = [ X t , · · · , X t ] is obtained by replicating X t p times.(23) M s is deﬁned in Eq. (24), M s = X s ( L s + C (cid:9) c = L ( c ) s ) X T s , L s = [ L s , , · · · , L s , p ] , L s , i = N s , i s , i T s , i , s , i ∈ R N s , i × , L ( c ) s = [ L ( c ) s , , · · · , L ( c ) s , p ] , i ∈ { , · · · , p } ,( L ( c ) s , i ) m , n = ⎧⎪⎨⎪⎩ ( N ( c ) s , i ) , x m , x n ∈ X ( c ) s , i , otherwise (24) M st is deﬁned in Eq. (25), M st = X s ( L st + C (cid:9) c = L ( c ) st ) X T , L st = [ L st , , · · · , L st , p ] , L st , i = N s , i N t s , i T t , s , i ∈ R N s , i × , t ∈ R N t × L ( c ) st = [ L ( c ) st , , · · · , L ( c ) st , p ] , i ∈ { , · · · , p } ,( L ( c ) st , i ) m , n = ⎧⎪⎨⎪⎩ − N ( c ) s , i N ( c ) t , x m ∈ X ( c ) s , i , x n ∈ X ( c ) t , otherwise (25) M ts is deﬁned in Eq. (26), M ts = X ( L ts + C (cid:9) c = L ( c ) ts ) X T s , L ts = [ L ts , , · · · , L ts , p ] , L ts , i = N t N s , i t T s , i , t ∈ R N t × , s , i ∈ R N s , i × L ( c ) ts = [ L ( c ) ts , , · · · , L ( c ) ts , p ] , i ∈ { , · · · , p } ,( L ( c ) ts , i ) m , n = ⎧⎪⎨⎪⎩ − N ( c ) t N ( c ) s , i , x m ∈ X ( c ) t , x n ∈ X ( c ) s , i , otherwise (26) M is deﬁned in Eq. (27), M = X ( L + C (cid:9) c = L ( c ) ) X T , L = [ L t , · · · , L t ] is obtained by replicating L t p times. L t = N t t T t , t ∈ R N t × , L ( c ) = [ L ( c ) t , · · · , L ( c ) t ] is obtained by replicating L ( c ) t p times. ( L ( c ) t ) m , n = ⎧⎨⎩ ( N ( c ) t ) , x m , x n ∈ X ( c ) t , otherwise (27)where 1 s , i and 1 t are the column vectors with all ones.R EFERENCES[1] X. Wang, “Intelligent multi-camera video surveillance: A review,”

Pat-tern Recognit. Lett. , vol. 34, no. 1, pp. 3–19, 2013.[2] M. Gori, M. Lippi, M. Maggini, and S. Melacci, “Semantic video label-ing by developmental visual agents,”

Comput. Vis. Image Understand. ,vol. 146, pp. 9–26, May 2016.[3] W. Hu, D. Xie, Z. Fu, W. Zeng, and S. Maybank, “Semantic-basedsurveillance video retrieval,”

IEEE Trans. Image Process. , vol. 16, no. 4,pp. 1168–1181, Apr. 2007.[4] W. Zhang, M. L. Smith, L. N. Smith, and A. R. Farooq, “Gender andgaze gesture recognition for human-computer interaction,”

Comput. Vis.Image Understand. , vol. 149, pp. 32–50, Aug. 2016.[5] W. Chen et al. , “GameFlow: Narrative visualization of NBA basketballgames,”

IEEE Trans. Multimedia , vol. 18, no. 11, pp. 2247–2256,Nov. 2016.[6] I. N. Junejo, E. Dexter, I. Laptev, and P. Perez, “Cross-view actionrecognition from temporal self-similarities,” in

Proc. Eur. Conf. Comput.Vis. , 2008, pp. 293–306.[7] R. Li and T. E. Zickler, “Discriminative virtual views for cross-viewaction recognition,” in

Proc. IEEE Conf. Comput. Vis. Pattern Recognit. ,Jun. 2012, pp. 2855–2862.[8] Z. Zhang, C. Wang, B. Xiao, W. Zhou, S. Liu, and C. Shi, “Cross-viewaction recognition via a continuous virtual path,” in

Proc. IEEE Conf.Comput. Vis. Pattern Recognit. , Jun. 2013, pp. 2690–2697.[9] D. K. Vishwakarma and K. Singh, “Human activity recognition based onspatial distribution of gradients at sublevels of average energy silhouetteimages,”

IEEE Trans. Cogn. Develop. Syst. , vol. 9, no. 4, pp. 316–327,Dec. 2017.[10] Y. Liu, Z. Lu, J. Li, C. Yao, and Y. Deng, “Transferable feature repre-sentation for visible-to-infrared cross-dataset human action recognition,”

Complexity , vol. 2018, pp. 1–20, 2018.[11] Y. Liu, Z. Lu, J. Li, T. Yang, and C. Yao, “Global temporal representationbased CNNs for infrared action recognition,”

IEEE Signal Process. Lett. ,vol. 25, no. 6, pp. 848–852, Jun. 2018.[12] J. Zheng, Z. Jiang, P. J. Phillips, and R. Chellappa, “Cross-view actionrecognition via a transferable dictionary pair,” in

Proc. Brit. Mach. Vis.Conf. , 2012, pp. 1–11.[13] J. Zheng and Z. Jiang, “Learning view-invariant sparse representationsfor cross-view action recognition,” in

Proc. IEEE Int. Conf. Comput.Vis. , Dec. 2013, pp. 3176–3183.[14] J. Zheng, Z. Jiang, and R. Chellappa, “Cross-view action recognition viatransferable dictionary learning,”

IEEE Trans. Image Process. , vol. 25,no. 6, pp. 2542–2556, Jun. 2016.[15] Y. Kong, Z. Ding, J. Li, and Y. Fu, “Deeply learned view-invariantfeatures for cross-view action recognition,”

IEEE Trans. Image Process. ,vol. 26, no. 6, pp. 3028–3037, Jun. 2017.[16] M. Chen, Z. E. Xu, K. Q. Weinberger, and F. Sha, “Marginalizeddenoising autoencoders for domain adaptation,” in

Proc. Int. Conf. Mach.Learn. , 2012, pp. 1–8.[17] A. Farhadi and M. K. Tabrizi, “Learning to recognize activities fromthe wrong view point,” in

Proc. Eur. Conf. Comput. Vis. , 2008,pp. 154–166. IU et al. : HIERARCHICALLY LEARNED VIEW-INVARIANT REPRESENTATIONS FOR CROSS-VIEW ACTION RECOGNITION 2429 [18] Z. Zhang, C. Wang, B. Xiao, W. Zhou, and S. Liu, “Cross-view actionrecognition using contextual maximum margin clustering,” IEEE Trans.Circuits Syst. Video Technol. , vol. 24, no. 10, pp. 1663–1668, Oct. 2014.[19] J. Liu, M. Shah, B. Kuipers, and S. Savarese, “Cross-view actionrecognition via view knowledge transfer,” in

Proc. IEEE Conf. Comput.Vis. Pattern Recognit. , Jun. 2011, pp. 3209–3216.[20] J. Wang, H. Zheng, J. Gao, and J. Cen, “Cross-view action recognitionbased on a statistical translation framework,”

IEEE Trans. Circuits Syst.Video Technol. , vol. 26, no. 8, pp. 1461–1475, Aug. 2016.[21] X. Yan, S. Hu, and Y. Ye, “Multi-task clustering of human actionsby sharing information,” in

Proc. IEEE Conf. Comput. Vis. PatternRecognit. , Jul. 2017, pp. 4049–4057.[22] A. Ulhaq, X. S. Yin, J. He, and Y. Zhang, “On space-time ﬁlteringframework for matching human actions across different viewpoints,”

IEEE Trans. Image Process. , vol. 27, no. 3, pp. 1230–1242, Mar. 2018.[23] H. Rahmani, A. Mian, and M. Shah, “Learning a deep model for humanaction recognition from novel viewpoints,”

IEEE Trans. Pattern Anal.Mach. Intell. , vol. 40, no. 3, pp. 667–681, Mar. 2018.[24] C. Wang and S. Mahadevan, “Heterogeneous domain adaptation usingmanifold alignment,” in

Proc. Int. Joint Conf. Artif. Intell. , 2011,pp. 1541–1546.[25] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer featurelearning with joint distribution adaptation,” in

Proc. IEEE Int. Conf.Comput. Vis. , Dec. 2013, pp. 2200–2207.[26] J. Zhang, W. Li, and P. Ogunbona, “Joint geometrical and statisticalalignment for visual domain adaptation,” in

Proc. IEEE Conf. Comput.Vis. Pattern Recognit. , Jul. 2017, pp. 5150–5158.[27] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer jointmatching for unsupervised domain adaptation,” in

Proc. IEEE Conf.Comput. Vis. Pattern Recognit. , Jun. 2014, pp. 1410–1417.[28] F. Zhu and L. Shao, “Weakly-supervised cross-domain dictionary learn-ing for visual recognition,”

Int. J. Comput. Vis. , vol. 109, nos. 1–2,pp. 42–59, Aug. 2014.[29] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,”

Science , vol. 313, no. 5786, pp. 504–507,2006.[30] M. Chen, K. Q. Weinberger, F. Sha, and Y. Bengio, “Marginalizeddenoising auto-encoders for nonlinear representations,” in

Proc. Int.Conf. Mach. Learn. , 2014, pp. 1476–1484.[31] C. M. Bishop,

Pattern Recognition and Machine Learning , 5th ed.New York, NY, USA: Springer, 2007.[32] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm fordesigning overcomplete dictionaries for sparse representation,”

IEEETrans. Signal Process. , vol. 54, no. 11, pp. 4311–4322, Nov. 2006.[33] J. A. Tropp and A. C. Gilbert, “Signal recovery from random mea-surements via orthogonal matching pursuit,”

IEEE Trans. Inf. Theory ,vol. 53, no. 12, pp. 4655–4666, Dec. 2007.[34] A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola,“A kernel method for the two-sample-problem,” in

Proc. Adv. Neural Inf.Process. Syst. , 2007, pp. 513–520.[35] J. Li, T. Zhang, W. Luo, J. Yang, X.-T. Yuan, and J. Zhang, “Sparsenessanalysis in the pretraining of deep neural networks,”

IEEE Trans. NeuralNetw. Learn. Syst. , vol. 28, no. 6, pp. 1425–1438, Jun. 2017.[36] Q. Sun, R. Chattopadhyay, S. Panchanathan, and J. Ye, “A two-stageweighting framework for multi-source domain adaptation,” in

Proc. Adv.Neural Inf. Process. Syst. , 2011, pp. 505–513.[37] M. Ghifary, D. Balduzzi, W. B. Kleijn, and M. Zhang, “Scatter compo-nent analysis: A uniﬁed framework for domain adaptation and domaingeneralization,”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. 39, no. 7,pp. 1414–1430, Jul. 2017.[38] D. Weinland, R. Ronfard, and E. Boyer, “Free viewpoint action recog-nition using motion history volumes,”

Comput. Vis. Image Understand. ,vol. 104, nos. 2–3, pp. 249–257, 2006.[39] J. Wang, X. Nie, Y. Xia, Y. Wu, and S.-C. Zhu, “Cross-view actionmodeling, learning and recognition,” in

Proc. IEEE Conf. Comput. Vis.Pattern Recognit. , Jun. 2014, pp. 2649–2656.[40] S. Ramagiri, R. Kavi, and V. Kulathumani, “Real-time multi-view humanaction recognition using a wireless camera network,” in

Proc. 5thACM/IEEE Int. Conf. Distrib. Smart Camera , Aug. 2011, pp. 1–6.[41] S. Singh, S. A. Velastin, and H. Ragheb, “MuHAVI: A multicamerahuman action video dataset for the evaluation of action recognitionmethods,” in

Proc. IEEE Int. Conf. Adv. Video Signal Based Surveill. ,Aug. 2010, pp. 48–55.[42] H. Wang and C. Schmid, “Action recognition with improved trajecto-ries,” in

Proc. IEEE Int. Conf. Comput. Vis. , Dec. 2013, pp. 3551–3558. [43] J. Wang, J. Yang, K. Yu, F. Lv, T. S. Huang, and Y. Gong, “Locality-constrained linear coding for image classiﬁcation,” in

Proc. IEEE Conf.Comput. Vis. Pattern Recognit. , Jun. 2010, pp. 3360–3367.[44] D. Tuia and G. Camps-Valls, “Kernel manifold alignment for domainadaptation,”

PLoS ONE , vol. 11, p. e0148655, Feb. 2016.[45] Y. Yan, E. Ricci, S. Subramanian, G. Liu, and N. Sebe, “Multitask lineardiscriminant analysis for view invariant action recognition,”

IEEE Trans.Image Process. , vol. 23, no. 12, pp. 5599–5611, Dec. 2014.[46] D. Weinland, M. Özuysal, and P. Fua, “Making action recognition robustto occlusions and viewpoint changes,” in

Proc. Eur. Conf. Comput. Vis. ,2010, pp. 635–648.[47] W. Sui, X. Wu, Y. Feng, and Y. Jia, “Heterogeneous discriminantanalysis for cross-view action recognition,”

Neurocomputing , vol. 191,pp. 286–295, May 2016.[48] B. Kulis, K. Saenko, and T. Darrell, “What you saw is not what youget: Domain adaptation using asymmetric kernel transforms,” in

Proc.IEEE Conf. Comput. Vis. Pattern Recognit. , Jun. 2011, pp. 1785–1792.[49] J. Hoffman, E. Rodner, J. Donahue, K. Saenko, and T. Darrell. (2013).“Efﬁcient learning of domain-invariant image representations.” [Online].Available: https://arxiv.org/abs/1301.3224[50] W. Li, L. Duan, D. Xu, and I. W. Tsang, “Learning with augmentedfeatures for supervised and semi-supervised heterogeneous domainadaptation,”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. 36, no. 6,pp. 1134–1148, Jun. 2013.[51] H. Rahmani and A. S. Mian, “3D action recognition from novel view-points,” in

Proc. IEEE Conf. Comput. Vis. Pattern Recognit. , Jun. 2016,pp. 1506–1515.[52] M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visualization forview invariant human action recognition,”

Pattern Recognit. , vol. 68,pp. 346–362, Aug. 2017.[53] X. Wu and Y. Jia, “View-invariant action recognition using latentkernelized structural SVM,” in

Proc. Eur. Conf. Comput. Vis. , 2012,pp. 411–424.[54] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vec-tor machines,”

ACM Trans. Intell. Syst. Technol. , vol. 2, no. 3,pp. 27:1–27:27, 2011.[55] C.-N. J. Yu and T. Joachims, “Learning structural SVMs with latentvariables,” in

Proc. Int. Conf. Mach. Learn. , 2009, pp. 1169–1176.

Yang Liu received the B.S. degree from the Schoolof Information Engineering, Chang’an University,Xi’an, China, in 2014. He is currently pursuingthe Ph.D. degree with the School of Telecommu-nications Engineering, Xidian University, Xi’an. Hisresearch interests include cross-domain action recog-nition and transfer learning.

Zhaoyang Lu (SM’14) received the bachelor’s,master’s, and Ph.D. degrees in communication andinformation systems from Xidian University, Xi’an,China, in 1982, 1985, and 1990, respectively. He iscurrently a Full Professor with the School ofTelecommunications Engineering, Xidian Univer-sity. His current research interests include imagematching and recognition and video content analysisand understanding. He has authored or co-authoredover 100 papers. He holds over 10 patents in theﬁeld of pattern recognition and image processing.

430 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 29, NO. 8, AUGUST 2019

Jing Li (M’14) received the Ph.D. degree in controltheory and engineering from Northwestern Poly-technical University, Xi’an, China, in 2008. From2004 to 2005, she was a Visiting Scholar with theNational Laboratory of Pattern Recognition, Beijing,China. In 2008, she was a Research Assistant withthe Department of Computing, The Hong KongPolytechnic University. She was a Visiting Scholarwith the University of Delaware, USA, from 2013 to2014. She is currently an Associate Professor withthe School of Telecommunications Engineering,Xidian University, Xi’an, where she is also the Leader of the IntelligentSignal Processing and Pattern Recognition Laboratory. She has published over50 research papers in international journals and conference proceedings inthe areas of computer vision and pattern recognition. Her research interestsinclude image registration, matching and retrieval, and video content analysisand understanding.