[PDF] GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set

Abstract

Gait is a unique biometric feature that can be recognized at a distance; thus, it has broad applications in crime prevention, forensic identification, and social security. To portray a gait, existing gait recognition methods utilize either a gait template which makes it difficult to preserve temporal information, or a gait sequence that maintains unnecessary sequential constraints and thus loses the flexibility of gait recognition. In this paper, we present a novel perspective that utilizes gait as a deep set, which means that a set of gait frames are integrated by a global-local fused deep network inspired by the way our left- and right-hemisphere processes information to learn information that can be used in identification. Based on this deep set perspective, our method is immune to frame permutations, and can naturally integrate frames from different videos that have been acquired under different scenarios, such as diverse viewing angles, different clothes, or different item-carrying conditions. Experiments show that under normal walking conditions, our single-model method achieves an average rank-1 accuracy of 96.1% on the CASIA-B gait dataset and an accuracy of 87.9% on the OU-MVLP gait dataset. Under various complex scenarios, our model also exhibits a high level of robustness. It achieves accuracies of 90.8% and 70.3% on CASIA-B under bag-carrying and coat-wearing walking conditions respectively, significantly outperforming the best existing methods. Moreover, the proposed method maintains a satisfactory accuracy even when only small numbers of frames are available in the test samples; for example, it achieves 85.0% on CASIA-B even when using only 7 frames. The source code has been released at this https URL

Full PDF

IIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX 2020 1

GaitSet: Cross-view Gait Recognition throughUtilizing Gait as a Deep Set

Hanqing Chao, Kun Wang, Yiwei He, Junping Zhang,

Member, IEEE , Jianfeng Feng

Abstract —Gait is a unique biometric feature that can be recognized at a distance; thus, it has broad applications in crime prevention,forensic identiﬁcation, and social security. To portray a gait, existing gait recognition methods utilize either a gait template which makes itdifﬁcult to preserve temporal information, or a gait sequence that maintains unnecessary sequential constraints and thus loses theﬂexibility of gait recognition. In this paper, we present a novel perspective that utilizes gait as a deep set , which means that a set of gaitframes are integrated by a global-local fused deep network inspired by the way our left- and right-hemisphere processes information tolearn information that can be used in identiﬁcation. Based on this deep set perspective, our method is immune to frame permutations ,and can naturally integrate frames from different videos that have been acquired under different scenarios, such as diverse viewingangles, different clothes, or different item-carrying conditions. Experiments show that under normal walking conditions, our single-modelmethod achieves an average rank-1 accuracy of 96.1% on the CASIA-B gait dataset and an accuracy of 87.9% on the OU-MVLP gaitdataset. Under various complex scenarios, our model also exhibits a high level of robustness. It achieves accuracies of 90.8% and 70.3%on CASIA-B under bag-carrying and coat-wearing walking conditions respectively, signiﬁcantly outperforming the best existing methods.Moreover, the proposed method maintains a satisfactory accuracy even when only small numbers of frames are available in the test samples; for example, it achieves 85.0% on CASIA-B even when using only 7 frames. The source code has been released at https://github.com/AbnerHqC/GaitSet.

Index Terms —Gait Recognition, Biometric Authentication, GaitSet, Deep Learning (cid:70)

NTRODUCTION U NLIKE other biometric identiﬁcation sources such asa face, ﬁngerprint, or iris, gait is a unique biometricfeature that can be recognized from a distance without anyintrusive interactions with subjects. This characteristic givesgait recognition high potential for use in applications such ascrime prevention, forensic identiﬁcation, and social security.However, a person’s variational poses in walking, whichforms the basic information for gait recognition, is easilyaffected by exterior factors such as the subject’s walkingspeed, clothing, and item-carrying condition as well as thecamera’s viewpoint and frame rate. These factors makegait recognition very challenging, especially cross-view gaitrecognition, which seeks to identify gait that might becaptured from different angles. It thus is crucial to develop apractical gait recognition system.The existing works have tried to tackle the problem fromtwo aspects. They either regard gait as a single image or • This work was supported in part by Shanghai Municipal Science andTechnology Major Project (Grant No. 2018SHZDZX01) and ZJLab, theNational Key R & D Program of China (No. 2018YFB1305104), andNational Natural Science Foundation of China (Grant No. 61673118). • Manuscript received xxx xx, 2020; revised xxx xx, 2020. • Hanqing Chao, Kung Wang, Yiwei He, and Junping Zhang are with the

Shanghai Key Laboratory of Intelligent Information Processing, School of

Computer Science, Fudan University, Shanghai, 200438, P.R. China; Tel.:+86-21-55664503, Fax: +86-21-65654253, Email: { hqchao16, KunWang17,heyw15, jpzhang } @fudan.edu.cn • Jianfeng Feng is with the Institute of Science and Technology for Brain-inspired Intelligence, Fudan University, Email: [email protected] • Corresponding author: Junping Zhang

Fig. 1. From top-left to bottom-right are silhouettes of a completed periodof a subject in the

CASIA-B gait dataset. regard it as a video sequence. Methods in the ﬁrst categorycompress all gait silhouettes into one image, i.e. , a gaittemplate, for gait recognition [1], [2], [3], [4], [5], [6], [7].Although various existing gait templates [5], [6], [7] encodeinformation as abundantly as possible, the compression pro-cess omits signiﬁcant features such as temporal informationand ﬁne-grained spatial information. To address this issue,the methods in the second category extract features directlyfrom the original gait silhouette sequences [8], [9], [10]. Thesemethods preserve more temporal information but wouldsuffer a signiﬁcant degradation when an input containsdiscontinuous frames or has a frame rate different fromthe training dataset.To solve these problems, we present a novel perspectivethat regards gait as a set of gait silhouettes. Because gait isa periodic motion, it can be represented by a single period.Meanwhile, in a silhouette sequence containing one gaitperiod, it can be observed that the silhouette in each positionhas a unique pose, as shown in Fig. 1. Given anyone’s gaitsilhouettes, we can easily rearrange them into the correctorder solely by observing their appearance. This suggests thatthe order of the poses in a gait period is not a key information a r X i v : . [ c s . C V ] F e b EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX 2020 2 to differentiate one person from others, since the pattern ofthe order is universal. Based on such an assumption, we candirectly regard gait as a set of images and extract temporalinformation without ranking each frame like a video.From this perspective, we propose an end-to-end deeplearning model called Gaitset that extracts features froma gait frame set to identify gaits. Fig. 2 shows the overallscheme of Gaitset. The input to our model is a set of gaitsilhouettes. First, a CNN is used to extract frame-level fea-tures from each silhouette independently (local information).Second, an operation called Set Pooling is used to aggregateframe-level features into a single set-level feature (globalinformation). Because this operation is conducted usinghigh-level feature maps instead of the original silhouettes, itpreserves spatial and temporal information better than a gaittemplate; this aspect is experimentally validated in Sec. 4.5.The global-local fused deep network resembles the way ourbrain processes information [11]. Third, a structure calledHorizontal pyramid mapping (HPM) is applied to project theset-level feature into a more discriminative space to obtain aﬁnal deep set representation. The superiority of the proposedmethod can be summarized into the following three aspects: • Flexible:

Our model is pretty ﬂexible since it imposes noconstraints on the input except the size of the silhouette.This means that the input set can consist of any numberof nonconsecutive silhouettes ﬁlmed under differentviewpoints with different walking conditions. Relatedexperiments are presented in Sec. 4.7 • Fast:

Our model directly learns the deep set gait rep-resentation of gait instead of measuring the similaritybetween a pair of gait templates or sequences. Thus,the representation of each sample only needs to becomputed once and the recognition can be done by com-paring Euclidean distance between the representationsof different samples. • Effective:

Our model substantially improves the state-of-the-art performance on the CASIA-B [12] and theOU-MVLP [13] datasets, exhibiting strong robustnessto view and walking condition variations and highgeneralization ability to large datasets.Compared with our previous AAAI-19 conference paperon this topic, we have extended our work in four ways: 1) wesurveyed and compared more state-of-the-art gait recognitionalgorithms; 2) we conducted more comprehensive experi-ments to evaluate the performance of the proposed GaitSetmodel; 3) we achieved better performance by improving theloss function used in GaitSet; 4) a post feature dimensionreduction module were included to enhance the practicality.

ELATED W ORKS

In this section, we brieﬂy survey developments in gaitrecognition and set-based deep learning methods.

Gait recognition can be broadly categorized into template-based and sequence-based approaches. In the former cate-gory, the previous works have generally divided this pipelineinto two parts, i.e., template generation and matching. Thegoal of template generation is to compress gait information into a single image, e.g., a Gait Energy Image (GEI) [14] or aChrono-Gait Image (CGI) [15]. To generate a template, theseapproaches ﬁrst estimate the human silhouettes in each framethrough background removal. Then, they generate a gaittemplate by applying pixel level operators to the aligned sil-houettes [15]. In the template matching procedure, they ﬁrstextract the gait representation from a template image usingmachine learning approaches such as canonical correlationanalysis (CCA) [16], linear discriminant analysis (LDA) [1],[17] and deep learning [18]. Then, they measure the similaritybetween pairs of representations using Euclidean distance orother metric learning approaches [1], [3], [4], [7]. For example,the view transformation model (VTM) learns a projectionbetween different views [19]; [2] proposed view-invariantdiscriminative projection (ViDP) to project the templates intoa latent space to learn a view-invariant representation. Finally,they assign a label to the template based on the measureddistance using a classiﬁer, e.g., SVM or nearest neighborclassiﬁer [4], [6], [7], [18], [20], [21].In the second category, the video-based approachesdirectly take a sequence of silhouettes as an input. Forinstance, the 3D CNN-based approaches [7], [9] extracttemporal-spatial information using 3D convolutions; Liao[8] and An [c] utilize human skeletons to lean gait featureswhich is robust to the change of clothing; [23] fused sequen-tial frames by LSTM attention units; and [10] proposedthe spatial-temporal graph attention network (STGAN)to uncover the graph relationships between gait frames,followed by obtaining attention for gait video. In recent,a part-based model [a] was proposed to capture spatial-temporal feature of each part and achieved promising results.To prevent redundancy feature in part-based model, twostage training strategy was used in [b] to learn compactfeatures successfully.In the second category, the video-based approachesdirectly take a sequence of silhouettes as an input. Forinstance, the 3D CNN-based approaches [7], [9] extracttemporal-spatial information using 3D convolutions; Liaoet al. [8] and An et al. [22] utilized human skeletons tolearn gait features robust to the change of clothing; Zhanget al. [23] fused sequential frames by LSTM attention units;and Wu et al. [10] proposed the spatial-temporal graph atten-tion network (STGAN) to uncover the graph relationshipsbetween gait frames, followed by obtaining attention for gaitvideo. Recently, a part-based model [24] was proposed tocapture spatial-temporal features of each part and achievedpromising results. To prevent redundancy feature in part-based model, a two-stage training strategy was used in [25]to learn compact features effectively.

Most deep learning works have been focusing on regularinput representations such as video sequences or images.The initial goal for using unordered sets was to addresspoint cloud tasks in the computer vision domain [28] basedon PointNet. Using an unordered set, PointNet can avoidthe noise introduced by quantization and the extension ofdata, leading to a high prediction performance. Since then,set-based methods have been widely used in the point clouddomain [29], [30], [31] and in content recommendation [32]

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX 2020 3 S e p a r a t e C r o ss En t r o p y / T r i p l e t L o ss Testing

CNNCNN & PoolingCNN & Pooling CNNCNNCNNCNN & PoolingC3 x , C4 P x , x , CNN & Pooling + S P HPP

D=128 n = n = D=256SeparateFC

MGP D=256x31x2=15872

HPP

HPM S P C1 x , C2 P x , x , C5 x , C6 x , + CNN & PoolingCNN & Pooling CNN & Pooling S P Fig. 2. The framework of GaitSet [26] . ‘SP’ represents set pooling. Trapezoids represent convolution and pooling blocks and those in the same columnhave the same conﬁgurations, as shown by the rectangles with capital letters. Note that although the blocks in MGP have the same conﬁgurations asthose in the main pipeline, the parameters are shared only across blocks in the main pipeline – not with those in MGP. HPP represents horizontalpyramid pooling [27]. and image captioning [33] by integrating features into theform of a set. [34] further formalized the deep learning tasksdeﬁned for sets and characterized of the permutations usinginvariant functions. To the best of our knowledge, this topichas not been studied in depth in the gait recognition domainexcept in our previous AAAI-19 conference version of thispaper [26].

AIT S ET In this section, we introduce the details of our GaitSetmethod, which learns deep discriminative information from aset of gait silhouettes. To improve understanding, the overallpipeline is illustrated in Fig. 2.

The concept for regarding gait as a deep set will be for-mulated ﬁrst. Given a dataset of N people with identities y i , i ∈ , , ..., N , we assume that the gait silhouettes of a cer-tain person are subject to a distribution P i which is uniquelyrelated to that individual. Therefore, all silhouettes in oneor more sequences of a given person can be regarded as aset of n silhouettes X i = { x ji | j = 1 , , ..., n } , where x ji ∼ P i .Under this assumption, we tackle the gait recognition taskvia 3 steps, formulated as follows: f i = H ( G ( F ( x i ) , F ( x i ) , ..., F ( x ni ))) (1)where F is a convolutional network that seeks to extractframe-level features from each gait silhouette. The function G is a permutation invariant function used to map a setof frame-level features to a set-level feature [34] based onset pooling (SP), which will be introduced in Sec. 3.2. Thefunction H learns the deep set discriminative representationof P i from the set-level feature through a structure calledhorizontal pyramid mapping (HMP) which will be discussedin Sec. 3.3. The input X i is a tensor with four dimensions:set dimension, image channel dimension, image heightdimension, and image width dimension. The goal of Set Pooling (SP) is to condense a set of gaitinformation, formulated as z = G ( V ) , where z denotes theset-level feature and V = { v j | j = 1 , , ..., n } denotes theframe-level features, where v j means the j -th frame-levelfeature map and n denotes the number of gait frames in aset. Note that there are two constraints when performing anSP operation. First, if we expect to take a set as an input,the function should be a permutation invariant functionsatisfying: G ( { v j | j = 1 , , ..., n } ) = G ( { v π ( j ) | j = 1 , , ..., n } ) (2)where π is any permutation [34]. Second, the function G should be able to take a set with arbitrary cardinality becausethe number of a person’s gait silhouettes can be arbitrary in areal-world scenario. Next, we describe several instantiationsof G . The experiments will show that although differentinstantiations of SP do inﬂuence the performances, they donot produce signiﬁcant differences, and all SP instantiationsachieve better performances than do the GEI-based methodsby a large margin. Basic Statistical Functions

To meet the invariant constraintrequirement in Eq. 2, one rational strategy of SP is to usestatistical functions on the set dimension. To balance the rep-resentativeness and the computational cost, we consideredthree statistical functions: max( · ) , mean( · ) and median( · ) . Acomprehensive comparison will be given in Sec. 4. Joint Functions

Further, two feasible ways to join theaforementioned basic statistical functions are analyzed asfollows: G ( · ) = max( · ) + mean( · ) + median( · ) (3) G ( · ) = 1 1C(cat(max( · ) , mean( · ) , median( · ))) (4)where cat means concatenating on the channel dimension, C denotes × convolutional layer, and max , mean and median are applied to the set dimension. Eq. 4 is an extendedversion of Eq. 3, which allows the × convolutional layerto learn a proper weight for combining information extractedby different statistical functions. EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX 2020 4 J o i n t F un c ti on s B a s i c S t a ti s ti ca l F un c ti on s 𝑉 (𝑛, 𝑐, ℎ, 𝑤) 𝑚𝑒𝑑𝑖𝑎𝑛 𝑚𝑒𝑎𝑛𝑚𝑎𝑥 + 𝑐𝑎𝑡 1_1𝐶 a. Statistical Functions & Joint Functions b. Pixel-wise Attention

Repeat 𝑛 times → (𝑛, 𝑐, ℎ, 𝑤) 𝑉 𝑚𝑒𝑎𝑛 𝑚𝑎𝑥𝑚𝑒𝑑𝑖𝑎𝑛 𝑧 (1, 𝑐, ℎ, 𝑤) × 𝑚𝑎𝑥 + (𝑛, 𝑐, ℎ, 𝑤)(𝑛, 𝑐, ℎ, 𝑤)(𝑐, ℎ, 𝑤) 𝑐𝑎𝑡 c. Frame-wise Attention 𝑐 𝑤 𝑣 𝑣 𝑣 𝑛 GMP 𝑐 f c w it h s h a r e d p a r a m e t e r 𝑉 𝑎 ෤𝑎 S o f t m a x 𝑧 × ℎ Fig. 3. Seven different instantiations of Set Pooling (SP). C and cat represent the × convolutional layer and the concatenate operation,respectively. Here, n represents the number of feature maps in a set, and c , h and w denote the number of channels, the height and width of afeature map, respectively. a. Three basic statistical SP and two joint SP. b. Pixel-wise attention SP. c. Frame-wise attention SP.

Attention

Visual attention has been successfully appliedin many computer vision tasks [35], [36], [37], and we alsocapitalize on attention to implement SP. We included twoattention strategies in our work. The ﬁrst one is a pixel-wise attention . Speciﬁcally, we reﬁne the output of SP byutilizing the global information to learn an element-wiseattention map for each frame-level feature map, as shownin Fig. 3b. First, global information is ﬁrst collected by thestatistical functions. Then, it is input into a × convolutionallayer along with the original feature map to calculate anattention map for the reﬁnement. The ﬁnal set-level feature z is extracted by employing a MAX operation on the set ofthe reﬁned frame-level feature maps. We use the residualstructure to accelerate and stabilize the convergence. Anotheris a frame-wise attention , in which global max pooling isﬁrst applied on each v j to get a compressed frame-wisefeature. Then, based on the frame-wise feature, a fullyconnected layer is applied to calculate a frame-wise weight a j . Finally, z is calculated by (cid:80) nj =1 ˜ a j v j , where ˜ a j is thesoftmax-normalized frame-wise weight. Fig. 3c illustrates thearchitecture of the frame-wise attention. In the literature, splitting feature maps into strips is acommonly used tactic in person re-identiﬁcation tasks [27],[38]. For instance, [27] proposed horizontal pyramid pooling(HPP) through cropping and resizing the images into auniform size based on pedestrian size while varying thediscriminative parts from image to image. With 4 scales, HPPcan thus help the deep network gather both local and globalinformation by focusing on features with different sizes.Here, we improve HPP to adapt it to the gait recognitiontask; instead of applying a × convolutional layer after thepooling, we use independent fully connected layers (FC) foreach pooled feature to map it into the discriminative space,as shown in Fig. 4. We call this approach horizontal pyramidmapping (HPM).Concretely, HPM has S scales. On scale s ∈ , , ..., S ,the feature map extracted by SP is split into s − strips onthe height dimension, i.e. (cid:80) Ss =1 s − strips in total. Then,global pooling is applied to the 3-D strips to obtain 1-Dfeatures. For a strip z s,t where t ∈ , , ..., s − represents GAP+GMP 𝑤ℎ 𝑐 𝑑 𝑛 = ෍ 𝑠 = 𝑆 𝑠 − fc fc fc fc S,1~2

𝑆−1 𝑑𝑐𝑤ℎ 𝑧 𝑓′ 𝑓 𝑧 𝑧 Fig. 4. The structure of horizontal pyramid mapping [26]. the index of the strip in the scale, the global pooling isformulated as f (cid:48) s,t = maxpool ( z s,t ) + avgpool ( z s,t ) , where maxpool and avgpool denote global max pooling (GMP) andglobal average pooling (GAP) respectively. Note that thefunctions maxpool and avgpool are used at the same timebecause the combined results outperform applying eitheroperation alone. The ﬁnal step is to employ FCs to map thefeatures f (cid:48) into a deep discriminative space. Because thestrips at different scales depict features of different receptiveﬁelds, and different strips at each scale depict features ofdifferent spatial positions, using independent FCs is a naturalchoice, as shown in Fig. 4. Generally, different layers of a convolutional network havedifferent receptive ﬁelds. The deeper a layer is, the larger thereceptive ﬁeld will be. Thus, pixels in the feature maps of ashallow layer pay more attention to local and ﬁne-grainedinformation while those in deeper layers focus more onglobal and coarse-grained information. The set-level featuresextracted by applying SP to different layers are analogous. Asshown in the main pipeline of Fig. 2, only one SP is appliedto the last layer of the convolutional network. To collectdifferent-level set information, we propose a multilayerglobal pipeline (MGP), which has a similar structure to theconvolutional network in the main pipeline; however, we addthe set-level features extracted by different layers to MGP.The ﬁnal feature map generated by MGP is also mappedinto (cid:80) Ss =1 s − features by HPM. Note that 1) the HPM that EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX 2020 5 executes after MGP does not share parameters with the HPMthat executes after the main pipeline. 2) The main pipeline issimilar to that of human cognition, which focuses intuitivelyon a person’s proﬁle, whereas the MGH can preserve moredetails of a person’s walking movements.

In the ﬁeld of identiﬁcation [39], [40], [41], two loss functionsare widely used, i.e. , cross entropy loss and triplet loss [40]. Toobtain the best performance, we conducted comprehensiveexperiments on these two loss functions.

Cross-entropy loss is common in classiﬁcation tasks. Itmeasures the gap between a predictive distribution andthe corresponding true distribution. In the recognition task,the output classes reﬂect all labels (identities) in the trainingset. As mentioned above, the output of the network are × (cid:80) Ss =1 s − features with a dimension of d . Duringtraining, a cross entropy loss was calculated for each featureand then all losses were summed up as a total loss. Duringtesting, the feature before the softmax layer is used forrecognition. Triplet loss was initially proposed for face recognition [40]but has become a popular loss function for metric embeddinglearning and has achieved high performances on varioustasks [26], [40], [42], [43], [44]. It aims to pull semantically-similar points close to each other while pushing semantically-different points away from each other [42]. The speciﬁc ver-sion of triplet loss adopted in this paper is Batch All ( BA + )triplet loss [42].Speciﬁcally, we denote a sample triplet as r = ( α, β, γ ) where α represents the anchor, β is a sample with the samelabel as the anchor α , and γ denotes a sample with a labeldifferent from that of the anchor α . Then, the BA + tripletloss of this triplet is deﬁned as follows: L ( r ) = ReLU ( ξ + D α,β − D α,γ ) , (5)where ξ is the margin between intraclass distance D α,β andinterclass distance D α,γ . For a triplet r , each sample has × (cid:80) Ss =1 s − features. We calculated the triplet loss foreach corresponding feature triplet, i.e. , × (cid:80) Ss =1 s − tripletlosses were calculated. Using a combination of two loss functions

Our previouswork [26] only used Batch All ( BA + ) triplet loss andachieved state-of-the-art performance. In this study, toimprove the learning ability, we combined the cross entropyloss with the triplet loss. First, cross entropy loss was usedto train the network to converge. Then, a smaller learningrate with Batch All ( BA + ) triplet loss was used to let themodel ﬁnd a more discriminant metric space. Experimentsto compare these two losses are shown in Sec. 4.5.2. Training.

In the training phase, a batch with a size of p × k issampled from the training set, where p denotes the numberof persons and k is the number of training samples eachperson has in the batch. Note that although the experimentshows that our model performs well when its input is the setcomposed of silhouettes gathered from arbitrary sequences,during the training phase, a sample is composed only ofsilhouettes sampled in one sequence. Testing.

Given a query Q , the goal is to retrieve all the sets with the same identity in gallery set G . We denote a samplein G as G . First, Q is input into GaitSet model to generate × (cid:80) Ss =1 s − multiscale features, then all these featuresare concatenated into a ﬁnal representation F Q as shownin Fig. 2. The same process is applied to each G to obtain F G . Finally, F Q is compared with every F G to calculate therank- recognition accuracy, which means the percentage ofthe correct subjects ranked ﬁrst, based on nearest Euclideandistance. As introduced in Sec. 3.6, the identiﬁcation is achieved bycomparing F Q with every F G to ﬁnd the nearest neighbor.Let d f denotes the dimension of the ﬁnal representation F .The computational complexity of this identiﬁcation processis O ( d f | G | ) where | · | calculates the cardinality of a set.In practical applications, | G | could be extremely large. Asmall d f will be the key to keep the process efﬁcient.Thus, we proposed a post feature dimension reductionmodule which is a post trained linear projection to reducethe dimension of the output feature while maintaining acompetitive recognition accuracy. Speciﬁcally, based on atrained GaitSet model with ﬁxed parameters, we feed thelearned d f dimensional feature into a fully connected layerwith an output of dimension d (cid:48) f . The new d (cid:48) f dimensionalfeatures are used to calculate a triplet loss to train the fullyconnected layer. In Sec. 4.6 we show that compared withdirectly shrinking the output dimension of the HPM, our postfeature dimension reduction model can effectively reducethe feature dimension into a signiﬁcantly smaller size whilepreserving a high recognition performance. XPERIMENTS

In this section, we report the results of comprehensiveexperiments conducted to evaluate the performance of theproposed GaitSet. First, we compared GaitSet with otherstate-of-the-art methods on two public gait datasets: CASIA-B [12] and OU-MVLP [13]. Then, we conducted a set ofablation studies on CASIA-B. Third, we studied the effective-ness of feature dimension reduction. Finally, we analyzed thepracticality of GaitSet from three aspects: limited silhouettes,multiple views, and multiple walking conditions.

The

CASIA-B dataset [12] is a popular gait dataset thatcontains 124 subjects labeled from 001 to 124. Each subject has3 walking conditions, i.e. , normal (NM) (6 video sequencesper subject), walking with a bag (BG) (2 video sequencesper subject) and wearing a coat or jacket (CL) (2 videosequences per subject). Each sequence is simultaneouslyframed under 11 views ( ◦ , ◦ , ..., ◦ ). Thus, this datasetcontains × (6 + 2 + 2) ×

11 = 113 , videos in total.Since this dataset does not include ofﬁcial training andtest subset partitions, we conducted our experiments usingthree kinds of division popular in the current literature.Based on the sizes of the training sets, we name these threekinds of division small-sample training (ST), medium-sampletraining (MT) and large-sample training (LT). In ST, the EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX 2020 6 ﬁrst 24 subjects (001-024) were used for training and theremaining 100 subjects were used for testing with no overlap.In MT, the ﬁrst 62 subjects (001-062) were used for training,and the remaining 62 subjects were used for testing. In LT,the ﬁrst 74 subjects (001-074) were used for training andthe remaining 50 subjects were used for testing. For the testsets in all three settings, the ﬁrst 4 sequences of the NMcondition (i.e., NM

OU-MVLP dataset [13] is currently the largestpublic gait dataset. It contains 10,307 subjects with 14views ( ◦ , ◦ , ..., ◦ ; 180 ◦ , ◦ , ..., ◦ ) per subject and2 sequences ( In all the experiments, the input was a set of alignedsilhouettes of size × . The silhouettes were directlyprovided by the datasets and were aligned based on themethods described in [13]. We adopted the Adam opti-mizer [45] for training our GaitSet network. The code for allthe experiments was written in Python with Pytorch 0.4.0.The models were trained on a computer equipped with4 NVIDIA 1080TI GPUs. Unless otherwise stated, the setcardinality during the training phase was set to . Themargin ξ in the triplet loss function (Eq. 5) was set to . . The number of HPM scales S was set to . , Forthe CASIA-B dataset, we set the number of channels in C and C as 32, in C and C as 64, and in C and C as 128. Under these settings, the average computationalcomplexity of our model is 8.6GFLOPs. On the

OU-MVLP dataset, which contains 20 times more sequences than CASIA-B, we used convolutional layers with more channels, i.e., C C , C C , C C . Forconvenience, the details of batch size, learning rate, andtraining iterations on different experimental settings arelisted in Tab. 1. Furthermore, rank-1 accuracy is adoptedas a criterion in the subsequent evaluations. Among our compared methods, View-invariant Discrimi-native Projection (ViDP) [2] uses a unitary linear projec-tion to project the templates intoa latent space to learn aview-invariant represent. Correlated Motion Co-Clustering(CMCC) [46] ﬁrst uses motion co-clustering to partition themost related parts of gaits from different views into thesame group, and then applies canonical correlation analysis(CCA) on each group to maximize the correlation betweengait information across views. Wu et al. proposed severalCNN based models in [7]. CNN-LB feeds GEIs of two gaitsequences into a 3-layer CNN as two channels and judgeswhether the two GEIs belong to the same person; CNN-3Druns 3-layers 3D-CNN on 9 adjacent frames and averagespredictions of 16 9-frame samples to get the ﬁnal output;CNN-Ensemble aggregates outputs of 8 different networks and achieves the best performance in this work. Yu et al. [21]applied AutoEncoder (AE) to extract view-invariant features.He et al. [21] proposed a multi-task GAN (MGAN) to projectgait features from one angle to another angle for multi-viewgait recognition. Angle Center Loss (ACL) [23] which isrobust to different local parts and temporal window sizes isproposed to learn discriminative gait features. GEINet [18]classiﬁes GEIs of different persons with a 2-layer CNNfollowed by 2 fully connected layers. Takemura et al. [4]improved the structures proposed in [7] to leverage tripletloss.

Tab. 2 shows a comparison between the state-of-the-artmethods and the proposed GaitSet. Except for GaitSet, theother results were directly taken from their original papers.All the results were averaged on the 11 gallery views and theidentical views were excluded. For example, the accuracy ofprobe view ◦ was averaged on 10 gallery views, excludinggallery view ◦ . From Tab. 2, an interesting relationshipbetween views and accuracies can be observed. In addition to ◦ and ◦ , where low accuracies are expected, the accuracyfor ◦ is a local minimum value that is always worse thanthe accuracy for ◦ or ◦ . The possible reason is that gaitcontains feature information not only those parallel to thewalking direction, such as stride, which can be observedmost clearly at ◦ but also feature information vertical tothe walking direction, such as the left-right swinging motionsof the body or arms, which can be observed most clearly at ◦ or ◦ . Therefore, both parallel and vertical perspectiveslose some portion of the gait information while views suchas ◦ and ◦ achieve a better balance between these twoextremes. Small-Sample Training (ST)

Our method achieved a highperformance even with only 24 subjects in the training setand exceeded the best performance reported [7] by well over on the reported values. There are two main reasons forthese results. Because our model regards the input as aset, the number of samples (frames) available for trainingthe convolutional network in the main pipeline is dozensof times higher than the number of samples used to trainthe template- or video-based models. Taking a mini-batch asan example, our model input consists of ×

128 = 3 , silhouettes, while under the same batch size other template-based models only obtain templates. Because thesample sets used in the training phase are composed offrames selected randomly from the sequence in the trainingset, each of which can generate multiple different sample sets ; thus, any units related to set feature learning (such asMGP and HPM) can also be trained well.In addition, it is noteworthy that in ST, all the othercompared models were trained and tested only on the NMsubset, whereas our model was trained and tested on all theNM, BG, and CL subsets. If we instead focus our model ononly one subset, it performs even better, because then, thetraining and testing environments are the same and exhibitmore consistency.

1. Since Wu et al proposed more than one model [7], we cited themost competitive results under different experimental settings.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX 2020 7

TABLE 1Batch size ( B S ) , learning rate ( L R ) , and training iterations (Iter) on OU-MVLP and the three settings of CASIA-B (CASIA-ST, CASIA-MT, andCASIA-LT). When being trained with cross entropy loss, a mini-batch is randomly selected from the training set. When being trained with triplet loss, amini-batch is composed as described in Sec. 3.6. CASIA-ST CASIA-MT CASIA-LT OU-MVLPCE B S

128 128 128 512 L R < > = B S p = 8 , k = 16 p = 8 , k = 16 p = 8 , k = 16 p = 32 , k = 16 L R < > = B S p = 8 , k = 16 p = 8 , k = 16 p = 8 , k = 16 p = 32 , k = 16 L R < > = TABLE 2Averaged rank-1 accuracies on

CASIA-B under three different experimental settings, excluding identical-view cases.

Gallery NM − − − − − − − − − CMCC [46] 46.3 − − − − − − − − CNN-LB [7] 54.8 − − − − − − − − GaitSet(ours)

LT(74) NM

TABLE 3Averaged rank-1 accuracies on

OU-MVLP , excluding identical-viewcases. GEINet: [18]. 3in+2diff: [4] . Probe Gallery All 14 Views Gallery ◦ , ◦ , ◦ , ◦ GEINet Ours GEINet 3in+2diff Ours ◦ ◦ - - ◦ ◦ - - ◦ ◦ - - ◦ ◦ - - ◦ - - ◦ - - ◦ - - ◦ - - ◦ - - ◦ - - mean 35.8 - - Medium-Sample Training (MT) & Large-SampleTraining (LT)

Generally, the performance of deep learning models depends heavily on the scale of the trainingsets. Thus, we evaluated our GaitSet using two differentdivisions for the training set and test set, i.e., MT and LT,as recommended in the prior literature. Tab. 2 shows thatour model attains fairly good results on the NM subset,especially on LT, where the results of all views except ◦ are over . This result shows that the accuracy gainsobvious improvement when more training data is availablefor this subset.Our model achieves satisfactory performance on the BGsubset. On the CL dataset, the recognition performancesare somewhat less satisfactory, although our model stillexceeds the best performance reported so far [7] by over . This reduced performance may be explained by thefollowing three reasons: a coat can entirely change aperson’s appearance, e.g., a subject looks larger in a coatthan in a T-shirt. A coat can hide the motions of both limbsand body. In the training set, the ratio of the CL subset issubstantially lower than that of the NM subset, demandingstronger discriminative ability on the part of the model.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX 2020 8

TABLE 4Ablation experiments conducted on

CASIA-B using setting LT. The results are rank-1 accuracies averaged on all 11 views, excluding identical-viewcases. The numbers in brackets indicate the second highest results in each column. Here ‘att’ is the abbreviation of attention.

GEI Set Set Pooling MGP NM BG CLMax Mean Median Joint sum 3 Joint 1 1C 4 Pix-att Frame att √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ Tab. 3 compares GaitSet with the two other methods on theOU-MVLP dataset. As some of the previous works did notconduct experiments on all 14 views, we list our results basedon two types of gallery sets, i.e. all 14 views and 4 typicalviews ( ◦ , ◦ , ◦ , ◦ ). All the results are averaged on thegallery views and identical views are excluded. The resultsshow that our method generalizes well to a dataset with alarge scale and wide view variation. Moreover, because therepresentation for each sample only needs to be calculatedonce, our model can complete the test involving all 133,780sequences in only 14 minutes with 4 NVIDIA 1080TI GPUs.It is noteworthy that because some subjects missingseveral gait sequences have not been removed from theprobe set, the maximum rank-1 accuracy cannot reach .If we do not count the cases that have no correspondingsamples in the gallery, the average rank-1 accuracy of allprobe views will rise to . rather than . .Fig. 5 shows the relationship between training iterationsand test accuracy. We can see that after the cross entropy lossreaches its best performance, further tuning with triplet losscan still engender improvement. A c u rr ac y ( % ) Training Iters(100K) lr=1E-4, CE Losslr=1E-4, Triplet Tunelr=1E-5, Triplet Tune

Fig. 5. The accuracy change process on OU-MVLP.

In this section, we report ablation experiments and modelstudies on CASIA-B, to examine the effectiveness ofregarding gait as a set with set pooling, MGP, HPM, anddifferent training strategies with different loss combinations.

TABLE 5The impact of different HPM scales and HPM weight independenceexperiments conducted on

CASIA-B using the setting on LT. The resultsare rank-1 accuracies averaged on all 11 views, excluding identical-viewcases.

HPM scales HPM weights NM BG CLShared Independent1(no HPM) √ √ √ √ √ √ All the experiments were based on the settings under CASIA-B LT, as shown in Tab. 2.

Set VS. GEI

The ﬁrst two rows of Tab. 4 show the effec-tiveness of regarding gait as a set. With totally identicalnetworks, the result of using the set exceeds that of usingGEI by more than on the NM subset and more than on the CL subset. The only difference is that in the GEIexperiment, the gait silhouettes are averaged into a single GEIbefore being fed into the network. There might be two mainreasons for this improvement: our SP extracts the set-levelfeature from a high-level feature map where the temporalinformation is well preserved and the spatial information hasbeen sufﬁciently processed; and as mentioned in Sec. 4.4,regarding gait as a set enlarges the volume of training data. The impact of SP

In Tab. 4, the results from the second rowto the eighth row show the impact of different SP strategies.SP with pixel-wised attention achieves the highest accuracyon the NM and BG subsets and when max( · ) is used, itobtains the highest accuracy on the CL subsets. Consideringthe fact that SP with max( · ) also achieves the second bestperformance on the NM and BG subsets and has the mostconcise structure; thus, we choose it as the SP strategy in theﬁnal version of GaitSet. The impact of MGP

The second and the last rows of Tab. 4show that MGP improves all three test subsets. This resultis consistent with our experience mentioned in Sec. 3.4, i.e.,that set-level features extracted from different layers of themain pipeline contain different discriminative information.

The impact of HPM scales and HPM weight indepen-dence

As shown in Tab. 5, HPM obtains better performancewith more scales. Furthermore, the last two lines of Tab. 5

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX 2020 9 compare the impact of the weight independence of thefully connected layer in HPM. It can be seen that usingindependent weights increases the accuracy by more than on each subset. During the experiments, we also noticedthat introducing independent weights makes the networkconverge faster. Our previous AAAI-19 paper utilized triplet loss to achievegood performance. To further improve the gait recognitionaccuracy, we combined triplet loss and cross-entropy loss totrain the GaitSet model.

TABLE 6Different loss functions conducted on

CASIA-B using setting LT. Theresults are rank-1 accuracies averaged on all 11 views, excludingidentical-view cases.

Loss function BN Dropout NM BG CLCEloss √ √ √ √ √ √ √ √ √ √ √ √ Tab. 6 shows the results of the three training strategies,and the impact of batch normalization and dropout. Allthree training strategies exceed rank-1 accuracy on theNM subset when using batch normalization and dropout.However, only the pretraining model that combines the twolosses reaches the highest . rank-1 accuracy. The ﬁrstand third lines of Tab. 6 reveal that the dropout layers isessential for a robust training performance of cross-entropyloss in this case. We also see that batch normalizationimproves all the training strategies. As mentioned in Sec. 3.6, the testing feature after concate-nating all the HPM ouputs has × × , dimen-sions in a standard framework, which impairs the testingefﬁciency. Therefore, we conducted dimensional reductionusing two methods. One is to set the feature dimension to alower level by shrinking the output dimension of the HPMfully connected layers. The other method is to perform thetesting task after introducing a new fully connected layer,which achieves a large compression of the original , dimensions. HPM Output Dimensions.

The HPM output dimensionswere set to , , , , , and , . Using thesedifferent dimension, we studied the recognition accuracywith approximately , − , training iterations. AsFig. 6 shows, even when the output dimensions are as lowas , the performance still reaches on the NM subsetwith all three loss function strategies. However, there is stilla negative impact on the performance if the HPM outputdimensions are too low (down to ) or too high (up to ). The reasons for this performance degeneration arethat 1) the fully connected layers whose output dimensions are too high can easily be overﬁtting because they containtoo many parameters, and 2) an output dimension that is toosmall would signiﬁcantly constrain the fully connected layers’learning capacity. In particular, the model trained with CEloss is less robust on the CL subset with high dimensionalHMP outputs, while the model with the pretraining strategyhas a stable performance on the CL subset. By decreasingthe HMP output dimensions, we can compress the ﬁnalfeature dimension from , to one quarter of that. Whilethis compression is associated with a subtle performanceimpairment, the degeneration of recognition performance onthe BG and CL subsets cannot be ignored. Dimension reduction with the new FC

Undoubtedly, wecan directly reduce the ﬁnal feature dimension. After themodel has been well trained, a new fully connected layer isapplied to the , dimension feature to reduce it into alower dimension. This new layer is tuned for , itera-tions with triplet loss and a learning rate of e − . We investi-gate the output dimensions of , , , , , and , . TABLE 7The recognition accuracy after dimensional reduction with the new FC. feature dimension NM BG CL128 91.7 83.5 62.5256 94.1 87.3 66.6512 94.4 88.4 68.91024 95.0 89.3 69.12048 95.0 90.2 70.04096 94.9 90.2 70.3

As the Tab. 7 shows, the ﬁnal feature dimension canbe compressed to while maintaining the recognitionaccuracy at on the NM subset; this is only . of theoriginal , dimensions. Similar to changing the outputdimension of HPM, a too-small feature dimension leads to aperformance decrease. Although it runs counter to the ideaof an end-to-end design, introducing this postprocessingeffectively compresses the learned feature representation,making the method more practical for real applications. Because of the ﬂexibility of the gait set approach, GaitSetmay be useful in more complicated practical conditions. Inthis section, we investigate the practicality of GaitSet throughthree novel scenarios: How does GaitSet perform when ainput set contains only a few silhouettes? Can silhouetteswith different views enhance the identiﬁcation accuracy? Can the model effectively extract deep discriminativerepresentation from a set containing silhouettes shot underdifferent walking conditions? It is worth noting that we didnot retrain or tune our model in these experiments; the exactmodel used in the Sec. 4.4 with the LT setting is used here.Note that here the reported accuracies are averaged on 10times experiments with different random seeds.

Limited Silhouettes.

In real forensic identiﬁcation scenarios,cases occur in which no continuous sequence of a subject’sgait is available, only some ﬁtful and sporadic silhouettes. Wesimulate such a circumstance by randomly selecting a certainnumber of frames from sequences to compose each sample

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX 2020 10

32 64 128 256 512 1024 A cc u r ac y ( % ) Feature dimensions

Traning on CASIA-B NM subset

Triplet lossCross-Entropy lossTriplet tune

32 64 128 256 512 1024 A cc u r ac y ( % ) Feature dimensions

Traning on CASIA-B BG subset

Triplet lossCross-Entropy lossTriplet tune

32 64 128 256 512 1024 A cc u r ac y ( % ) Feature dimensions

Traning on CASIA-B CL subset

Triplet lossCross-Entropy lossTriplet tune

Fig. 6. Relationships between recognition accuracy and the HPM output dimension. From left to right are the individual results on the CASIA-B NM,BG, and CL subsets. The relationships vary with different training strategies, as shown by the different lines in each graph. R a nk - acc u r ac y ( % ) Number of selected Images

Fig. 7. Average rank-1 accuracies with constraints of silhouette volumeon the

CASIA-B dataset using the LT setting. The accuracy values areaveraged on all 11 views excluding identical-view cases, and the ﬁnalreported results are averaged across 10 experimental repetitions. in both the gallery and probe. Fig. 7 shows the relationshipbetween the number of silhouettes in each input set andthe rank-1 accuracy averaged on all 11 probe views. Ourmethod attains an accuracy using only 7 silhouettes.This result also indicates that our model makes full use ofthe temporal information contained in a gait set. It can alsobe observed that the accuracy rises monotonically as thenumber of silhouettes increases, and the accuracy is closeto the best performance when the samples contain more than25 silhouettes. This number is consistent with the number offrames that one gait period contains. Multiple Views.

Here we study a scenario where oneperson’s gait is collected from different views. We simulatethese scenarios by constructing each silhouette sampleselected from two sequences that have the same walkingcondition but different views. To alleviate the effects of thenumber of silhouettes, an experiment is performed underthe case where the maximum silhouette number is 10. Unlikethe previous contrast experiments of a single view where aninput set consists of 10 silhouettes from one sequence, moreconcretely, in this experiment, an input set is made up of5 silhouettes from each of two sequences in the two-viewexperiment. Note that in this experiment, only probe samplesare composed by the aforementioned method, whereas thesample in the gallery is composed of all silhouettes from onesequence.Because there are too many view pairs to display themall, we summarize the results by averaging the accuracies

TABLE 8Multiview experiments conducted on

CASIA-B using the LT setting.Cases with the probe contains the views in the gallery are excluded.

Viewdifference ◦ / ◦ ◦ / ◦ ◦ / ◦ ◦ / ◦ ◦ SingleviewAll silhouettes 98.9 99.6 97.6 96.2 99.3 96.110 silhouettes 94.8 97.2 95.9 91.7 97.25 89.5 of each possible view difference, as indicated in Tab. 8. Forexample, the result of a ◦ difference was averaged by theaccuracies of view pairs ( ◦ &90 ◦ , ◦ &108 ◦ , ..., ◦ &180 ◦ ).Furthermore, the 9 view differences were folded at ◦ andthose larger than ◦ were averaged with the correspondingview differences of less than ◦ . For example, the results ofthe ◦ view difference were averaged with those of the ◦ view difference.Our model effectively aggregates informationfrom different views and boosts the ﬁnal performance. Thisresult can be explained by the pattern between views andaccuracies discussed in Sec. 4.4. Including multiple views inthe input set allows the model to gather both parallel andvertical information, resulting in performance improvements. Multiple Walking Conditions.

In real life, it is highlylikely that gait sequences of the same person occur underdifferent walking conditions. We simulate such differentconditions by forming an input set using silhouettes fromtwo sequences with the same view but different walkingconditions and conduct experiments under the constraint ofdifferent numbers of silhouettes. Note that in this experiment,only the probe samples are composed by the method dis-cussed above. All the samples in the gallery are constructedusing all the silhouettes from one sequence. Moreover, theprobe-gallery division of this experiment is different. Foreach subject, the sequences NM

TABLE 9

Multiple walking condition experiments conducted on

CASIA-B using theLT setting. The results are rank-1 accuracies averaged on all 11 views,excluding identical-view cases. The numbers in brackets indicate theconstraints of the silhouette number in each input set.

NM(10) 91.1 NM(10)+BG(10) 93.0 NM(20) 94.2BG(10) 85.8 NM(10)+CL(10) 91.6 BG(20) 89.4CL(10) 85.1 BG(10)+CL(10) 89.7 CL(20) 88.9

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX 2020 11

From Tab. 9 we can see that 1) the accuracies are stillboosted as the number of silhouettes increases, and 2)when the number of the silhouettes is ﬁxed, the resultsreveal the relationships between different walking conditions.Containing large yet complementary noises and information,the combination of silhouettes from BG and CL helps themodel improve the accuracy. In contrast, silhouettes of NMcontain little noise. Consequently, substituting silhouettes ofthe other two conditions for some of them does not provideextra useful information but only introduces noise, leadingto a degraded performance.

ONCLUSION

In this paper, we presented a novel perspective that regardsgait as a deep set, called a GaitSet. The proposed GaitSetapproach extracts both spatial and temporal informationmore effectively and efﬁciently than do the existing methods,which regard gait as either a template or a sequence.Unlike other existing gait recognition approaches, the GaitSetapproach also provides an innovative way to aggregate valu-able spatiotemporal information from different sequences toenhance the accuracy of cross-view gait recognition. Experi-ments on two benchmark gait datasets indicate that GaitSetachieves the highest recognition accuracy compared withother state-of-the-art algorithms, and the results reveal thatGaitSet exhibits a wide range of ﬂexibility and robustnesswhen applied to various complex environments, showinga great potential for practical applications. In addition,since the set assumption could ﬁt various other biometricidentiﬁcation tasks including person re-identiﬁcation andvideo-based face recognition, the structure of GaitSet can beapplied to these tasks with few minor changes in the future. A CKNOWLEDGEMENT

The authors would like to thank the associate editor andanonymous reviewers for their valuable comments, whichgreatly improved the quality of this paper. R EFERENCES [1] Z. Liu and S. Sarkar, “Improved gait recognition by gait dynamics normalization,”

IEEE Transactions on Pattern Analysis and Machine

Intelligence , vol. 28, no. 6, pp. 863–876, 2006.[2] M. Hu, Y. Wang, Z. Zhang, J. J. Little, and D. Huang, “View-invariant discriminative projection for multi-view gait-basedhuman identiﬁcation,”

IEEE Transactions on Information Forensicsand Security , vol. 8, no. 12, pp. 2034–2045, 2013.[3] Y. Guan, C.-T. Li, and F. Roli, “On reducing the effect of covariatefactors in gait recognition: a classiﬁer ensemble method,”

IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 37,no. 7, pp. 1521–1528, 2014.[4] N. Takemura, Y. Makihara, D. Muramatsu, T. Echigo, and Y. Yagi,“On input/output architectures for convolutional neural network-based cross-view gait recognition,”

IEEE Transactions on Circuitsand Systems for Video Technology , vol. 28, no. 1, pp. 1–13, 2018.[5] X. Chen, J. Weng, W. Lu, and J. Xu, “Multi-gait recognition basedon attribute discovery,”

IEEE Transactions on Pattern Analysis andMachine Intelligence , vol. 40, no. 7, pp. 1697–1710, 2017.[6] Y. He, J. Zhang, H. Shan, and L. Wang, “Multi-task GANs for view-speciﬁc feature learning in gait recognition,”

IEEE Transactions onInformation Forensics and Security , vol. 14, no. 1, pp. 102–113, 2019.[7] Z. Wu, Y. Huang, L. Wang, X. Wang, and T. Tan, “A comprehen-sive study on cross-view gait based human identiﬁcation withdeep CNNs,”

IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 39, no. 2, pp. 209–226, 2017. [8] R. Liao, C. Cao, E. B. Garcia, S. Yu, and Y. Huang, “Pose-based temporal-spatial network (PTSN) for gait recognition withcarrying and clothing variations,” in

Chinese Conference on BiometricRecognition . Springer, 2017, pp. 474–483.[9] T. Wolf, M. Babaee, and G. Rigoll, “Multi-view gait recognitionusing 3D convolutional neural networks,” in

IEEE InternationalConference on Image Processing , 2016, pp. 4165–4169.[10] X. Wu, W. An, S. Yu, W. Guo, and E. B. Garc´ıa, “Spatial-temporalgraph attention network for video-based gait recognition,” in

AsianConference on Pattern Recognition . Springer, 2019, pp. 274–286.[11] C. Peyrin, M. Baciu, C. Segebarth, and C. Marendaz, “Cerebralregions and hemispheric specialization for processing spatialfrequencies during natural scene recognition. An event-relatedfMRI study,”

Neuroimage , vol. 23, no. 2, pp. 698–707, 2004.[12] S. Yu, D. Tan, and T. Tan, “A framework for evaluating the effect ofview angle, clothing and carrying condition on gait recognition,”in

International Conference on Pattern Recognition , vol. 4, 2006, pp.441–444.[13] N. Takemura, Y. Makihara, D. Muramatsu, T. Echigo, and Y. Yagi,“Multi-view large population gait dataset and its performanceevaluation for cross-view gait recognition,”

IPSJ Transactions onComputer Vision and Applications , vol. 10, no. 4, pp. 1–14, 2018.[14] J. Han and B. Bhanu, “Individual recognition using gait energyimage,”

IEEE Transactions on Pattern Analysis and Machine Intelli-gence , vol. 28, no. 2, pp. 316–322, 2006.[15] C. Wang, J. Zhang, L. Wang, J. Pu, and X. Yuan, “Human identiﬁca-tion using temporal information preserving gait template,”

IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 34,no. 11, pp. 2164–2176, 2012.[16] X. Xing, K. Wang, T. Yan, and Z. Lv, “Complete canonical correlationanalysis with application to multi-view gait recognition,”

PatternRecognition , vol. 50, pp. 107–117, 2016.[17] K. Bashir, T. Xiang, and S. Gong, “Gait recognition without subjectcooperation,”

Pattern Recognition Letters , vol. 31, no. 13, pp. 2052–2060, 2010.[18] K. Shiraga, Y. Makihara, D. Muramatsu, T. Echigo, and Y. Yagi,“GEINet: View-invariant gait recognition using a convolutionalneural network,” in

IAPR International Conference on Biometrics ,2016, pp. 1–8.[19] Y. Makihara, R. Sagawa, Y. Mukaigawa, T. Echigo, and Y. Yagi, “Gaitrecognition using a view transformation model in the frequencydomain,” in

European Conference on Computer Vision . Springer,2006, pp. 151–163.[20] S. Yu, H. Chen, E. B. G. Reyes, and N. Poh, “GaitGAN: Invariant gaitfeature extraction using generative adversarial networks,” in

IEEEConference on Computer Vision and Pattern Recognition Workshops ,2017, pp. 532–539.[21] S. Yu, H. Chen, Q. Wang, L. Shen, and Y. Huang, “Invariant featureextraction for gait recognition using only one uniform model,”

Neurocomputing , vol. 239, pp. 81–93, 2017.[22] W. An, S. Yu, Y. Makihara, X. Wu, C. Xu, Y. Yu, R. Liao, and Y. Yagi,“Performance evaluation of model-based gait on multi-view verylarge population database with pose sequences,”

IEEE Transactions on Biometrics, Behavior, and Identity Science , vol. 2, no. 4, pp. 421–430,

IEEE Transactions on ImageProcessing , vol. 29, pp. 1001–1015, 2019.[24] C. Fan, Y. Peng, C. Cao, X. Liu, S. Hou, J. Chi, Y. Huang, Q. Li, andZ. He, “Gaitpart: Temporal part-based model for gait recognition,”in

IEEE Conference on Computer Vision and Pattern Recognition , 2020,pp. 14 225–14 233.[25] S. Hou, C. Cao, X. Liu, and Y. Huang, “Gait lateral network:Learning discriminative and compact representations for gaitrecognition,” in

European Conference on Computer Vision , 2020, pp.382–398.[26] H. Chao, Y. He, J. Zhang, and J. Feng, “Gaitset: Regarding gait as aset for cross-view gait recognition,” in

AAAI Conference on ArtiﬁcialIntelligence , vol. 33, 2019, pp. 8126–8133.[27] Y. Fu, Y. Wei, Y. Zhou, H. Shi, G. Huang, X. Wang, Z. Yao,and T. Huang, “Horizontal pyramid matching for person re-identiﬁcation,” in

AAAI Conference on Artiﬁcial Intelligence , vol. 33,2019, pp. 8295–8302.[28] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “PointNet: Deeplearning on point sets for 3D classiﬁcation and segmentation,” in

IEEE Conference on Computer Vision and Pattern Recognition , 2017, pp.77–85.

EEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX 2020 12 [29] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M.Solomon, “Dynamic graph CNN for learning on point clouds,”

ACM Transactions On Graphics , vol. 38, no. 5, pp. 1–12, 2019.[30] Y. Zhou and O. Tuzel, “VoxelNet: End-to-end learning for pointcloud based 3D object detection,” in

IEEE Conference on ComputerVision and Pattern Recognition , 2018, pp. 4490–4499.[31] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deephierarchical feature learning on point sets in a metric space,” in

Advances in Neural Information Processing Systems , 2017, pp. 5099–5108.[32] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representationlearning on large graphs,” in

Advances in Neural InformationProcessing Systems , 2017, pp. 1024–1034.[33] J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei, “A hierarchicalapproach for generating descriptive image paragraphs,” in

IEEEConference on Computer Vision and Pattern Recognition , 2017, pp.3337–3345.[34] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhut-dinov, and A. J. Smola, “Deep sets,” in

Advances in Neural Informa-tion Processing Systems , 2017, pp. 3391–3401.[35] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neuralnetworks,” in

IEEE Conference on Computer Vision and PatternRecognition , 2018, pp. 7794–7803.[36] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov,R. Zemel, and Y. Bengio, “Show, attend and tell: Neural imagecaption generation with visual attention,” in

International Conferenceon Machine Learning , 2015, pp. 2048–2057.[37] W. Li, X. Zhu, and S. Gong, “Harmonious attention network forperson re-identiﬁcation,” in

IEEE Conference on Computer Vision andPattern Recognition , 2018, pp. 2285–2294.[38] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, “Learningdiscriminative features with multiple granularities for person re-identiﬁcation,” in

ACM International Conference on Multimedia , 2018,pp. 274–282.[39] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closingthe gap to human-level performance in face veriﬁcation,” in

IEEEConference on Computer Vision and Pattern Recognition , 2014, pp.1701–1708.[40] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A uniﬁedembedding for face recognition and clustering,” in

IEEE Conferenceon Computer Vision and Pattern Recognition , 2015, pp. 815–823.[41] L. Zheng, Y. Yang, and A. G. Hauptmann, “Person re-identiﬁcation:Past, present and future,” arXiv:1610.02984 , 2016.[42] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet lossfor person re-identiﬁcation,” arXiv:1703.07737 , 2017.[43] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deepspeaker recognition,” arXiv:1806.05622 , 2018.[44] X. Dong and J. Shen, “Triplet loss in Siamese network for objecttracking,” in

European Conference on Computer Vision , 2018, pp. 459–474.[45] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” in

International Conference on Learning Representation , 2015.[46] W. Kusakunniran, Q. Wu, J. Zhang, H. Li, and L. Wang, “Recog-nizing gaits across views through correlated motion co-clustering,”

IEEE Transactions on Image Processing , vol. 23, no. 2, pp. 696–709,

Hanqing Chao received the B.S. degree fromthe Nanjing University of Aeronautics and Astro-nautics, Nanjing, China, in 2016. He receivedthe Master’s degree in Computer Science atFudan University, Shanghai, China, in 2019. Hisresearch interests include machine learning, com-puter vision and gait recognition.

Kun Wang received the B.S. degree from theXi’an Jiaotong University, Xi’an, China, in 2017.He received the Master’s degree in ComputerScience at Fudan University, Shanghai, Chinain 2020. His research interests include machinelearning, computer vision and gait recognition.

Yiwei He received the B.S. degree from theNanjing Audit University, Nanjing, China, in 2015.He received the Master’s degree in ComputerScience at Fudan University, Shanghai, China,in 2018. His research interests include machinelearning, computer vision and gait recognition.

Junping Zhang (M’05) received the B.S. degreein automation from Xiangtan University, Xiangtan,China, in 1992. He received the M.S. degreein control theory and control engineering fromHunan University, Changsha, China, in 2000. Hereceived his Ph.D. degree in intelligent systemand pattern recognition from the Institute ofAutomation, Chinese Academy of Sciences, in2003. He is a professor at School of Com-puter Science, Fudan University since 2011.His research interests include machine learning,image processing, biometric authentication, and intelligent transportationsystems. He has widely published in highly ranked international journalssuch as IEEE TPAMI and IEEE TNNLS, and leading internationalconferences such as ICML and ECCV. He has been an Associate Editorof IEEE Intelligent Systems Magazine since 2009.

Jianfeng Feng received all his academic degrees from Peking University in mathematics,degrees from Peking University in mathematics,