Dynamic gesture retrieval: searching videos by human pose sequence
DDynamic gesture retrieval: searching videos by human posesequence
Cheng Zhang a a Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
Abstract
The number of static human poses is limited, it is hard to retrieve the exact videos usingone single pose as the clue. However, with a pose sequence or a dynamic gesture as thekeyword, retrieving specific videos becomes more feasible. We propose a novel methodfor querying videos containing a designated sequence of human poses, whereas previousworks only designate a single static pose. The proposed method takes continuous 3dhuman poses from keyword gesture video and video candidates, then converts each posein individual frames into bone direction descriptors, which describe the direction of eachnatural connection in articulated pose. A temporal pyramid sliding window is thenapplied to find matches between designated gesture and video candidates, which ensuresthat same gestures with different durations can be matched.
Keywords:
Dynamic Gesture Retrieval, Pose Sequence, Gesture Recognition, Gesture Search,Articulated Pose
1. Introduction
The proposed method focuses on matching video clips with human pose sequence.The human pose estimation methods are considerably advanced during last few yearsleveraging the development of deep learning, which makes estimating human poses froman unconstrained video possible. Based on these works, our method is constructed toaccept a sequence of poses as the keyword, and retrieves video clips with similar posesequences.Being able to search through human videos by pose sequence enables the possibilityof locating a specific dynamic gesture from large amount of videos without the needsof knowing the name of the gestures and adding proper descriptions to all candidatevideos. Comparing with existing methods e.g. retrieving by image [1] or retrieving bypose [2, 3, 4] , the proposed method takes temporally dynamic gesture as the keyword,and search through a video with arbitrary duration.To compare the similarity of two dynamic gestures, the distance between two posesequences (namely PS distance) must be measured. To achieve that, there are two mainchallenges: Firstly, the raw 3d pose features is not suitable for measuring the PS distance.This is because that pose features is formed by 3d coordinates representing the absolutehuman limb positions, which are sensitive to camera movements, but a desired posedescriptor should only reflects human gestures which is invariant to camera translations.
Preprint submitted to Elsevier June 16, 2020 a r X i v : . [ c s . C V ] J un igure 1: Gesture retrieval result for walking Secondly, the dynamic gestures are temporally scalable. That is, to make a gesture,different people tend to move with different speeds, so as to finish the same gesturewith variable durations. Retrieving process should focusing on the gesture but not thespeed, so the method has to be temporally flexible to match gestures made with differentspeeds. To address the first challenge, we propose a descriptor called bone directiondescriptor (BD descriptor) which describes the 3d direction of each natural connectionin 3d articulated pose model. To address the second problem, we apply a temporalpyramid sliding window on video candidates to match gestures with different durations.
2. Pose sequence distance
This section finds the numerical distance between two pose sequences, so as to measuretheir similarity. To this end, the BD descriptor is introduced to remove camera trans-lation information from poses. The 3d pose features are denoted as v p,t , p ∈ { , ..., P } , t ∈ { , ..., T } where P is the number of parts predicted by 3d pose estimation methodand T is the total number of interested frames. v p,t ∈ R where each v p,t representsan ( x, y, z ) location in 3d space. Although different pose estimation methods give dif-ferent number of parts T , in most methods, these parts can be connected following thenatural structure of human skeleton [5]. Use the set H to denote the set of number ofconnected pairs, where each pair of connection is formed by two pair numbers ( i, j ), thenthe connection set C t for frame t can be described by equation 1: C t = { v i,t v j,t | ( i, j ) ∈ H } (1)The BD descriptor is inspired by the spatial features mentioned in [6], This paperupgrades the angle spatial features into 3d space. Essentially, all the connections in C t are used as vectors (namely bone vectors) with the direction of top-down suppose theperson was standing on the ground with two arms naturally hanging on both sides. Thisvector set S is denoted as equation 2. s b,t = v j,t − v i,t , ( i, j ) ∈ HS t = { s b,t | b ∈ (1 , ..., B ) } (2)2n equation 2, B is the total number of connections, which is different than totalnumber of parts P because one part can have multiple connections. Then, the gravityunit vector g is introduced as an unit vector pointing at the direction of gravity. Forconventional 3d coordinate system where ( x, y, z ) values represent (width, height, depth), g has a value of (0 , , g as a reference, the angle between bone vector s andgravity vector g can be measured by degree and this value is not affected by the absoluteposition of a person. However, it is tricky to choose the range of degree. Intuitively,some parts like arms can do 360 degree rotation, so setting the range from 0 to 360 seemsfeasible. But this setting leads to the problem of having large value jumps between smallmovements around the boundary, therefore not suitable for measuring the similarity oftwo poses. Meanwhile, setting the range from 0 to 180 leads to another problem of twodifferent limb positions share a same feature value. To address this, the cos and sin values are used to describe each angle. The final BD descriptor d b,t for bone b and frame t is denoted by equation 3 to 5: d b,t = { φ sinb,t , φ cosb,t } (3) φ sinb,t = s b,t × g (cid:13)(cid:13) s b,t (cid:13)(cid:13) (cid:107) g (cid:107) (4) φ cosb,t = s b,t · g (cid:13)(cid:13) s b,t (cid:13)(cid:13) (cid:107) g (cid:107) (5)Equation 3 shows that a BD descriptor is a vector containing 2 values, they are thesin and cos angle values between gravity unit vector g and bone vector s . Equation 4and 5 show the details of sine and cosine value computations.The PS distance k is defined using d b,t as equation 6 where d and d represents BDdescriptors of the pose sequence 1 and 2 accordingly. k = 1 T B T (cid:88) t =1 B (cid:88) b =1 | d b,t − d b,t | (6)In equation 6, B denotes the total number of connections, T denotes the total numberof keyword frames. This equation computes the difference of each frames between twosequences.
3. Temporal pyramid sliding window
The PS distance is able to measure the similarity between two sequences of samelength. However, different people can make an identity dynamic gesture with differentspeeds, the duration of the same gesture in videos can be varied. It is desired for themethod to match two gestures with same limb movements and different durations T and T . Our proposed method provides temporal pyramid sliding window to addressthis problem. A normal sliding window slides through the entire candidate video with afixed window length, producing a sequence of PS distances. That is, to match the startpoint of a pose sequence with exact gesture and duration, the time point with smallestPS distance value is searched, as shown in equation 7:3
100 200 300 400 500 600frame0.20.30.40.5 P S d i s t a n c e Figure 2: Pose Sequence Distances for each sliding window location in Episode 2 T match = arg min t f slide ( d key , d cand ) (7)Where T match indicates the starting point of matched result. Sliding window f slide computes PS distance at each sliding position: f slide ( d key , d cand ) = { k ( d key , f subseq ( d cand , t, T key )) | t ∈ (1 , ..., T cand − T key ) } (8)In equation 8, d key denotes the set of BD descriptor used as searching keyword, d key = d b,t | b ∈ (1 , ..., B ) , t ∈ (1 , ..., T key ) where T key denotes the length of keyword posesequence. d cand denotes the set of BD descriptor used as candidates. Function f subseq cuts a sequence, which takes three arguments: the sequence, starting point and duration. k ( · ) computes the PS distance given two sequences of same length. The sliding window f slide produces one PS distance value at each sliding location, forming a collection of PSdistances at each time point. Finally, as described in 7, the time point with smallest PSdistance value is considered the best match. The argmin can be changed to a thresholdto match multiple results.As mentioned, the normal sliding window matches videos of exact gestures and dura-tions, where the time duration of keyword window and candidate window are same. Thetemporal pyramid sliding window is proposed to extend or reduce the temporal receptivefield of candidate window, and resamples the BD descriptor sequence to have the samelength with keyword BD descriptor sequence, so as to compute the PS distance. After theresampling, extended window plays the video clip faster with a larger temporal receptivefield, reduced window plays the video clip slower with a smaller temporal receptive field.Formally, given a set of pyramid parameters L , the frame rate r of resampled window isdenoted by equation 9, where r original denotes the frame rate when λ = 1. r λ = r original λ , λ ∈ L (9)4 igure 3: Retrieval result for waving
4. Experiments
We conducts experiments on UTKinect-Action dataset [7] to evaluate our method.This dataset provides videos of single person actions with human part positions computedfrom RGB images and depth maps. Different episodes in this dataset record differentperson. We use video clips from episode 1 as the searching keyword which contains agesture, and search for similar gestures from other episodes.Figure 1 uses 20 frames of walking gesture as the keyword and retrieves similar ges-tures from other episodes. In the figure, left images show the pose sequence used as thesearching keyword, right images show the retrieved best matches from another video.It can be seen that the desired gesture is correctly retrieved. Figure 2 shows the PSdistance at each sliding window location on episode 2, where smaller value means bettermatch.Figure 3 includes 30 frames of waving gesture as the keyword, the proposed methodcorrectly retrieved similar gesture.
5. Conclusion
We have developed a novel searching method that takes a dynamic gesture (a se-quence of human poses) as input, and matches videos containing similar gestures withvarious durations. As far as we know, previous methods only retrieve video frames bya single pose. The proposed method enables the possibilities of searching a video ofsports, behaviors, commanding gestures etc. without knowing it’s name by imitating therepresentative gesture as the searching keyword. The experiment result shows that theproposed method is capable of searching a dynamic gesture from a video by using posesequence as input.In future work, this method needs further improvements to be able to run in practicalsituations, e.g. dealing with partially occluded people, handling multiple people in sameframe which relates to pedestrian tracking, dealing with variable speeds for differentstages of same gesture etc.
References [1] M. Rodr´ıguez, G. Facciolo, R. G. v. Gioi, P. Muse, J. Delon, Robust estimation of local affinemaps and its applications to image matching, in: The IEEE Winter Conference on Applications ofComputer Vision, 2020, pp. 1342–1351.[2] V. Ferrari, M. Marin-Jimenez, A. Zisserman, Pose search: retrieving people using their pose, in:2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 1–8.
3] M. Eichner, M. Marin-Jimenez, A. Zisserman, V. Ferrari, 2d articulated human pose estimation andretrieval in (almost) unconstrained still images, International journal of computer vision 99 (2012)190–214.[4] R. Ren, J. Collomosse, Visual sentences for pose retrieval over low-resolution cross-media dancecollections, IEEE Transactions on Multimedia 14 (2012) 1652–1661.[5] S. Yan, Y. Xiong, D. Lin, Spatial temporal graph convolutional networks for skeleton-based actionrecognition, in: Thirty-second AAAI conference on artificial intelligence, 2018.[6] J. He, C. Zhang, X. He, R. Dong, Visual recognition of traffic police gestures with convolutionalpose machine and handcrafted features, Neurocomputing (2019).[7] L. Xia, C. Chen, J. Aggarwal, View invariant human action recognition using histograms of 3d joints,in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer SocietyConference on, IEEE, 2012, pp. 20–27.3] M. Eichner, M. Marin-Jimenez, A. Zisserman, V. Ferrari, 2d articulated human pose estimation andretrieval in (almost) unconstrained still images, International journal of computer vision 99 (2012)190–214.[4] R. Ren, J. Collomosse, Visual sentences for pose retrieval over low-resolution cross-media dancecollections, IEEE Transactions on Multimedia 14 (2012) 1652–1661.[5] S. Yan, Y. Xiong, D. Lin, Spatial temporal graph convolutional networks for skeleton-based actionrecognition, in: Thirty-second AAAI conference on artificial intelligence, 2018.[6] J. He, C. Zhang, X. He, R. Dong, Visual recognition of traffic police gestures with convolutionalpose machine and handcrafted features, Neurocomputing (2019).[7] L. Xia, C. Chen, J. Aggarwal, View invariant human action recognition using histograms of 3d joints,in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer SocietyConference on, IEEE, 2012, pp. 20–27.