A Global to Local Double Embedding Method for Multi-person Pose Estimation
AA Global to Local Double Embedding Methodfor Multi-person Pose Estimation
Yiming Xu , Jiaxin Li , Yiheng Peng , Yan Ding ∗ , and Hua-Liang Wei Yingcai Honors College, University of Electronic Science and Technology of China,Chengdu, China Key Laboratory of Dynamics and Control of Flight Vehicle, Ministry of Education,School of Aerospace Engineering, Beijing Institute of Technology, Beijing 100081,China School of Computer Science and Technology, Donghua University, Shanghai, China Department of Automatic Control and Systems Engineering, University ofSheffield, Sheffield S1 3JD, UK
Abstract.
Multi-person pose estimation is a fundamental and challeng-ing problem to many computer vision tasks. Most existing methods canbe broadly categorized into two classes: top-down and bottom-up meth-ods. Both of the two types of methods involve two stages, namely, persondetection and joints detection. Conventionally, the two stages are imple-mented separately without considering their interactions between them,and this may inevitably cause some issue intrinsically. In this paper, wepresent a novel method to simplify the pipeline by implementing per-son detection and joints detection simultaneously. We propose a DoubleEmbedding (DE) method to complete the multi-person pose estimationtask in a global-to-local way. DE consists of Global Embedding (GE)and Local Embedding (LE). GE encodes different person instances andprocesses information covering the whole image and LE encodes the lo-cal limbs information. GE functions for the person detection in top-downstrategy while LE connects the rest joints sequentially which functionsfor joint grouping and information processing in A bottom-up strategy.Based on LE, we design the Mutual Refine Machine (MRM) to reducethe prediction difficulty in complex scenarios. MRM can effectively re-alize the information communicating between keypoints and further im-prove the accuracy. We achieve the competitive results on benchmarksMSCOCO, MPII and CrowdPose, demonstrating the effectiveness andgeneralization ability of our method.
Human pose estimation aims to localize the human facial and body keypoints(e.g., nose, shoulder, knee, etc.) in the image. It is a fundamental techniquefor many computer vision tasks such as action recognition [1], human-computerinteraction [2,3], person Re-ID [4] and so on. *First two authors equally contributed to this work. Yan Ding is the correspondingauthor: [email protected] a r X i v : . [ c s . C V ] F e b Yiming Xu, Jiaxin Li et al.
Most of the existing methods can be broadly categorized into two classes: top-down methods [5,6,7,8,9,10,11,12,13] and bottom-up methods [14,15,16,17,18].As shown in Figure 1 (a), the top-down strategy first employs a human detectorto generate person bounding boxes, and then performs single person pose estima-tion on each individual person. On the contrary, the bottom-up strategy locatesall body joints in the image and then groups joints to corresponding persons.Top-down strategy is less efficient because the need to perform single personpose estimation on each detected instance sequentially. Also, the performanceof top-down strategy is highly dependent on the quality of person detections.Compared to top-down strategy, the complexity of bottom-up strategy is inde-pendent of the number of people in the image, which makes it more efficient.Though as a faster and more likely to be the real-time technique, the bottom-upmethods may suffer from solving an NP-hard graph partition problem [15,14] togroup joints to corresponding persons on densely connected graphs covering thewhole image.
Fig. 1.
Comparison between (a) existing two-stage strategy and (b) our Double Em-bedding method for multi-person pose estimation. The proposed DE model implementthe person detection and joints detection parallel, overcoming the intrinsic problemsof existing two-stage based top-down and bottom-up strategies.
We analyze and try to bypass the disadvantages of these two conventionalstrategies. The low efficiency of the top-down strategy comes from the indepen-dent single person pose estimation on each person bounding box. For bottom-upstrategy, treating all the detected unidentified joints equally causes the high dif- ouble Embedding 3 ficulty of joints grouping process. Both top-down and bottom-up strategy aretwo-stage structure with little interaction between the two stages. They bothsuffer from the separation of person instance detection and joints detection.To overcome this intrinsic limitation, we propose to implement person detec-tion and joints detection simultaneously, and realize information communicatingbetween the two procedures to better utilize the structure information of humanbody. The proposed approach is illustrated in Figure 1 (b). We observe that thetorso joints (shoulders and hips) are more stable than other limbs joints. Withmuch lower degree of freedom than limbs joints, torso joints can represent theidentity information to distinguish the different human instances well. We alsointroduce the center joint of the human body. Center joint is calculated by theaverage location of annotated joints. Together with the center joint point, torsojoints and center points compose the Root Joints Group (RJG). Based on this,we categorize the rest joints on limbs into Adjacency Joints Group (AJG). In thispaper, we propose the Double Embedding method to simplify the pipeline andimprove the joints localization accuracy. The framework of the DE approach isshown in Figure 2. Double Embedding consists of Global Embedding (GE) andLocal Embedding (LE). GE functions for the person instance separation processthrough encoding the clustering information of RJG. We follow the associativeembedding method [18] to allocate 1D tags to each pixel related to the jointsin root joints group. Joints belong to the same person have similar tags whilejoints belong to different instances have dissimilar tags.GE encodes global information around the whole image, while LE focuseson local information of each instance based on the global clues from GE. AGJis connected to identified RJG by corresponding displacement vector field en-coded in LE. Basically, we take the center joint as the reference point. However,the displacements from extremities joints (wrists and ankles) to center joint arelong-range displacements which are vulnerable to background noises. To opti-mize the long-range displacement prediction [19], we further divide AGJ intotwo hierarchies: the first level consists of elbows and knees, the second level con-sists of wrists and ankles. AJG is connected sequentially from the second levelto first level and finally to torso joints in RJG which are identified. Long-rangedisplacements are factorized into accumulative short-range displacements tar-geting on torso joints (hip joints and shoulder joints). Take the left ankle forexample, the displacement from ankle to center joint is long-range displacementwhich is difficult to predict. To tackle this problem with better utilizing the bodystructure information, we change the reference joint to left hip, and divide thedisplacement from left ankle to left hip into shorter displacements: the displace-ment from left ankle to left knee and the displacement from left knee joint toleft hip joint. Thus, AJG (limbs joints) are connected to RJG and identified insequence. As for facial joints (e.g., Eyes, nose, etc.), we localize them from pre-dicted heatmaps and connect them with the long-range displacements targetingon the center joint.In addition, we design Mutual Refine Machine (MRM) to further improvethe joints localization accuracy and reduce the prediction difficulty in complex
Yiming Xu, Jiaxin Li et al. scenarios such as pose deformation, cluttered background, occlusion, person over-lapping and scale variations. Based on hierarchical displacements and connec-tion information encoded in LE, MRP refines the poor-quality predicted jointsby high-quality predicted neighboring joints.We reduce the person detection task to identifying RJG with associative em-bedding. This is essential to implement the person detection and joints detectionat the same time. This is essential to implement the person detection and the fol-lowing joints detection and grouping at the same time. Avoiding the independentsingle person pose estimation on each detected person bounding boxes makes themethod more efficient. Compared to directly processing all the unidentified jointsaround the whole image, LE performs local inference with robust global affinitycues encoded in GE, reducing complexity for joints identifying. Different withthe independence of two stages in previous two-stage strategy, GE and LE worksmutually to complete the person detection and joints detection parallel.We implement DE with Convolutional Neural Networks (CNNs) based onthe state-of-the-art HigherHRNet[20] architecture. Experiments on benchmarksMSCOCO [21], MPII [22] and CrowdPose [23] demonstrate the effectiveness ofour method.The main contributions of the paper is summarized as follows: – We attempt to simplify the pipeline for multi-person pose estimation, solvingthe task in global-to-local way. – We propose the Double Embedding method to implement person detectionand joints detection parallel, overcoming the intrinsic disadvantages causedby two-stage structure. – Our model achieves competitive performance on multiple benchmarks.
Top-Down Multi-Person Estimation
Top-down methods [5,6,7,8,9,10,11,12,13]first employ object detection [24,25,26,27] to generate person instances withinperson bounding boxes, and then detect the keypoints of each person by singleperson pose estimation independently.Mask R-CNN [9] adopts a branch for keypoints detection based on FasterR-CNN [24]. G-RMI [8] directly divides top-down methods as two stages andemploys independent models for person detection and pose estimation. In [28],Gkioxari et al. adopted the Generalized Hough Transform framework for personinstance detection, and then classify joint candidates based on the poselets. In[29], Sun et al. proposed a part-based model to jointly detect person instancesand generate pose estimation. Recently, both person detection and single personpose estimation benefit a lot from the thriving of deep learning techniques. Iqbaland Gall [16] adopted Faster-RCNN [24] for person detection and convolutionalpose machine [30] for joints detection. In [31], Fang et al. used spatial trans-former network [32] and Hourglass network [13] for joints detection.Though these methods have achieved excellent performance, they suffer fromhigh time complexity due to sequential single person pose estimation on each ouble Embedding 5 person proposal. Differently, DE performs the person detection and joints detec-tion parallel, which simplifies the pipeline.
Bottom-Up Multi-Person Pose Estimation
Bottom-up methods [14,15,16,17,18]detect all the unidentified joints in an image and then group them to correspond-ing person instances.Openpose [33] proposes part affinity field to represent the limbs. The methodcalculates line integral through limbs and connects joints with the largest in-tegral. In [18], Newell et al. proposed associate embedding to assign each jointwith a 1D tag and then group joints which have the similar tags. PersonLab [19]groups joints by a 2D vector field in the whole image. In [34], Sven Kreiss et al.proposed Part Intensity Field (PIF) to localize body parts and Part AssociationField (PAF) to connect body parts to form full human poses.Nevertheless, the joints grouping cues of all these methods are covering thewhole image, which causes high complexity for joints grouping. Different withthe prior methods, global clues from GE reduce the search space for graph parti-tion problem, avoiding high complexity of joint partition in bottom-up strategy.
Fig. 2.
Overview of the Double Embedding (DE) model. For an image, we generatetwo kinds of feature maps for Global Embedding (GE) and Local Embedding (LE). GEand LE works parallel with information communicating to support each other. Basedon GE and LE, we design Mutual Refine Machine (MRM) to refine the low-qualitypredicted joints which further improves the accuracy. Yiming Xu, Jiaxin Li et al.
In this section, we present our proposed Double Embedding method. Figure 2illustrates the overall architecture of the proposed approach.
Joints Feature Map
For an image I , we generate two kinds of feature mapsfrom backbone network, one for Global Embedding (GE) and one for LocalEmbedding (LE). We use J R = { J R , J R , ..., J RU } to denote Root Joints Group(RJG), J Ri is the i-th kind of joints in RJG for all N persons in image I , and U isthe number of joint categories in RJG. Similarly, we use J A = { J A , J A , ..., J AV } to denote Adjacency Joints Group (AJG), J Ai is the i-th kind of joints in AJGfor all persons in image I , and V is the number of joint categories in RJG.For Global Embedding, let h Gk ∈ R W × H ( k = 1 , , ..., U ) denote the featuremap for the k-th kind of joint in Root Joints Group. For Local Embedding, h Lk ∈ R W × H ( k = 1 , , ..., V ) denotes the feature map for the k-th kind of jointin AJG. The form and generation method of h Gk and h Lk are the same. Tosimplify the description, we use h fk to represent both h Gk and h Lk .It was pointed out that Directly regressing the absolute joint coordinates inan image is difficult[35,14]. We therefore use heatmap, a confidence map modelsthe joints position as Gaussian peaks. For a position ( x, y ) in image I , h fk ( x, y )is calculated by: h fk ( x, y ) = (cid:40) exp ( − (cid:107) ( x,y ) − ( x ik ,y ik ) (cid:107) σ ) , ( x, y ) ∈ ℵ ik , otherwhise (1)In which σ is an empirical constant to control the variance of Gaussian distri-bution, set as 7 in our experiments. ( x ik , y ik ) denotes the i-th groundtruth jointposition in h fk . ℵ ik = ( x, y ) (cid:12)(cid:12) (cid:13)(cid:13) ( x, y ) − ( x ik , y ik ) (cid:13)(cid:13) ≤ τ is regressing area for eachjoint to truncate the Gaussian distribution. Thus, we generate two kinds of fea-ture map for GE and LE. Joints location of RJG and AJG are derived throughNMS process. Global Embedding
Global Embedding functions as a simpler person detec-tion process, reduces person detection problem to identifying the RJG. We useAssociate Embedding for this process. The identification information in h Gk isencoded in 1 D tag space T k for k-th joint in RJG. Based on the pixel loca-tion derived from peak detections in feature map h Gk , corresponding tags areretrieved at the same pixel location in T k . Joints belonging to one person havesimilar tags while tags of joints in different persons have obvious difference. Let p = { p , p , ..., p N } denote the N persons containing in image I . GE can be repre-sented as a function f GE : h Gk → T k . f GE learns to densely transform every pixelin h Gk to embedding space T k . We use loc ( p n ,J Rk ) ( n = 1 , , ..., N, k = 1 , , ..., U )to denote the ground truth pixel location of the k-th kind of joint in RJG of n-thperson.If U (cid:48) joints are labeled, the reference embedding for the n-th person is theaverage of retrieved tags of RJG in this person:¯ T ag n = 1 U (cid:48) (cid:88) k T k ( loc ( p n ,J R k ) ) (2) ouble Embedding 7 To pull the tags of joints within an individual together, pull-loss computes thesquared distance between the reference embedding and the predicted embeddingfor each joint. L pull = 1 N U (cid:48) (cid:88) n (cid:88) k ( ¯ T ag n − T k ( loc ( p n ,J Rk ) )) (3)To push the tags of joints in different persons, push-loss penalizes the refer-ence embeddings that are close to each other. As the distance between two tagsincreases, push-loss drops exponentially to zero resembling probability densityfunction of Gaussian distribution: L push = 1 N (cid:88) n (cid:88) n (cid:48) exp {− σ ( ¯ T ag n − ¯ T ag n (cid:48) ) } (4)The loss to train the model as f GE is the sum of L pull and L push : L G = 1 N U (cid:48) (cid:88) n (cid:88) k ( ¯ T ag n − T k ( loc ( p n ,J Rk ) )) + 1 N (cid:88) n (cid:88) n (cid:48) exp {− σ ( ¯ T ag n − ¯ T ag n (cid:48) ) } (5) Local Embedding
Local Embedding performs local inference and builds theconnection clues between AJG and identified RJG. Relative position informationof h Lk is encoded in displacement space D k for k-th joint in AJG. For each jointin AJG, we get its corresponding normed displacement to RJG from D k at thesame position in h Lk . We first build the basic displacement to connect to thecenter joint. Basic displacement for the k-th joint in AJG of the n-th person isrepresented as the 2D vector: Dis kn = ( x rn , y rn ) − ( x kn , y kn ) (6)In which, ( x rn , y rn ) is the location of the center joint in the n-th person. Be-sides, we design the hierarchical displacement to connect the limbs joints tocorresponding torso joints. Compared to basic displacements directly targetingon the center joint, hierarchical displacements are shorter ones which are morerobust and easier to predict. Normally, we use hierarchical displacements in in-ference process. But if some intermediate joints are absent, we directly use thelong-range prediction to complete the inference.The general displacement for joint A to joint B of n-th person is: Dis A Bn = ( x Bn , y Bn ) − ( x An , y An ) (7)In some cases, we may use the property Dis B An = − Dis A Bn to get reversedisplacements of paired joints.Local Embedding f LE : h Lk → D k maps each pixel in feature map h Lk tothe embedding space D k . For learning f LE , we build the target regression map T An for the displacement vector from joint A (the k-th kind of joint in AJG) tojoint B of the n-th person as follows: D Ak ( x, y ) = (cid:40) Dis ( x,y )2( x Bn ,y Bn ) n /Z, if( x, y ) ∈ ℵ Ak , otherwise (8) Yiming Xu, Jiaxin Li et al. where ℵ Ak = { ( x, y ) |(cid:107) ( x, y ) − ( x An , y An ) (cid:107) ≤ τ } . The displacements are createdin ℵ Ak which is the same as regression area in h Lk . Z = √ H + W is thenormalization factor, with H and W denoting the height and width of the image.The starting point A ( x An , y An ) is generated from peak detections in featuremap h Lk , we get its corresponding displacement to joint B in D k as Dis A Bn = D k (( x An , y An )). The ending joint is obtained: ( x Endn , y
Endn ) = ( x An , y An )+ Z · Dis A Bn .Compared with the peak detections in h LB (containing the same category jointsas joint B for all persons in the image, including ( x Bn , y Bn )), it will be confirmedthat joint B is the ending joint of joint A. Accordingly, joint A is connected tojoints B, meaning they share the same identification. In this way, joints in AJGare connected to RJG and identified. Mutual Refine Machine
We design the Mutual Refine Machine (MRM) toreduce the prediction difficulty in complex scenes. For a low-quality predictedjoint, it can be refined by the neighboring high-quality joints. Based on thedisplacements and connection information in LE, MRM realizes the informationcommunicating between paired joints. For n-th person, if prediction probabilityof i-th joint in k-th kind of joints confidence map h fk (( x in , y in )) is lower than itsneighboring paired joints { h fk (cid:48) (( x i (cid:48) n , y i (cid:48) n )) } (in which h fk (cid:48) (( x i (cid:48) n , y i (cid:48) n )) > . (cid:0) x in , y in (cid:1) refined = h fk (( x in , y in )) Q ∗ ( x in , y in )+ (cid:88) i (cid:48) h fk (cid:48) (( x i (cid:48) n , y i (cid:48) n )) Q ∗ (( x i (cid:48) n , y i (cid:48) n )+ Dis i (cid:48) in )(9) Q = h fk (( x i (cid:48) n , y i (cid:48) n )) + (cid:88) i (cid:48) h fk (cid:48) (( x i (cid:48) n , y i (cid:48) n )) (10) Training and inference
To train our model, we adopt L2 loss L H for jointconfidence regression, smooth L1 loss [36] L D for displacements regression and L G for GE. The total loss L for each image is the weighted sum of L H , L D and L G : L = U + V (cid:88) x =1 L H ( h fx , ˆ h fx ) + α V (cid:88) y =1 L D ( D y , ˆ D y ) + β U (cid:88) z =1 L G (11)where ˆ h fx and ˆ D y denote the predicted joints confidence map and displacementsregression map. α and β are constant weight factor to balance three kinds oflosses, both set as 0.01. The overall framework of DE is end-to-end trainable viagradient backpropagation.The overall architecture of DE is illustrated in Figure 2. For an image, DEgenerates two kinds of feature maps ˆ h Gk and ˆ h Lk . through performing NMS andon them, we get predicted joints location of RJG and AJG. GE gives identifi-cation tags for RJG and LE provides the connection relation to connect AJGto RGJ. To better present the collaborative work of GE and LE, we add theintermediate illustration. Connected pairs get identification information fromGE and joints in GE expand to all joints by connectivity in LE. Based on thedisplacements and connectivity in LE, MRM refines the low-quality predicted ouble Embedding 9 joints. The final result is generated through the combination of refined resultsfrom GE and LE. We evaluate the proposed Double Embedding model on three widelyused benchmarks for multi-person pose estimation: MSCOCO [21] dataset, MPII[22] dataset and CrowdPose [23] dataset.The MSCOCO [21] dataset contains over 200, 000 images and 250, 000 per-son instances labeled with 17 keypoints. COCO is divided into train/val/test-devsets with 57k, 5k and 20k images respectively. MPII [22] dataset contains 5,602images of multiple persons. Each person is annotated with 16 body joints. Im-ages are divided into 3,844 for training and 1,758 for testing. MPII also providesover 28,000 annotated single-person pose samples. The CrowdPose[23] datasetconsists of 20,000 images, containing about 80,000 person instances. The train-ing, validation and testing subset are split in proportional to 5:1:4. CrowdPosehas more crowded scenes than the COCO and MPII, and therefore is more chal-lenging for multi-person pose estimation.
Data augmentation
We follow the conventional data augmentation strate-gies in experiments. For MSCOCO and CrowdPose datasets, we augment train-ing samples with random rotation ([ − ◦ , ◦ ]), random scale [0 . , . ± ,
40] and random horizontally flip to crop input images to640x640 with padding. For MPII dataset, random scale is set as ([0.7, 1.3]) whileother augmentation parameters are set the same as MSCOCO and CrowdPosedatasets.
Evaluation metric
For COCO and CrowdPose datasets, the standard evalua-tion metric is based on Object Keypoint Similarity (OKS):
OKS = (cid:80) i exp ( − d i / s k i ) δ ( v i > (cid:80) i δ ( v i >
0) (12)where d i is the Euclidean distance between the predicted joints and groundtruth, v i is the visibility flag of the ground truth, s is the object scale, and k i isa per-keypoint constant that controls falloff. The standard average precision andrecall scores are shown as: AP (AP at OKS = 0.50), AP , AP (the mean ofAP scores at 10 positions, OKS = 0.50, 0.55, . . . , 0.90, 0.95; AP M for mediumobjects, AP L for large objects, and AR at OKS = 0.50, 0.55, . . . , 0.90, 0.955.For MPII dataset, the standard evaluation metric is PCKh (head-normalizedprobability of correct keypoint) score. A joint is correct if it falls within al pixelsof the groundtruth position, where α is a constant and l is the head size thatcorresponds to 60% of the diagonal length of the ground-truth head boundingbox. The [email protected] ( α = 0.5) score is reported. Implementation
For COCO dataset, we use standard validation set for abla-tion studies while use test-dev set to compare with other state-of-the-art meth-ods. or CrowdPose dataset, we use CrowdPose train and val set to train our model, and use test set for validation. For MPII dataset, following [37], werandomly select 350 groups of multi-person training samples as the validationdataset and use the remaining training samples and all single-person imagesas train dataset. We use the Adam optimizer [38]. For COCO and CrowdPosedatasets, the base learning rate is set to 1e-3, and dropped to 1e-4 and 1e-5 atthe 200th and 260th epochs respectively. We train the model for a total of 300epochs. For MPII dataset, we initialize learning rate by 1e-3. We train the modelfor 260 epochs and decrease learning rate by a factor of 2 at the 160th, 180th,210th, 240th epoch. Following [HigherHRNet-30], we adopt flip test for all theexperiments.
Table 1.
Comparison with state-of-the-arts on COCO2017 test-dev dataset. Top: w/omulti-scale test. Bottom: w/ multi-scale test.Method
AP AP AP AP M AP L w/o multi-scale testCMU-Pose[17] † † † † † indicates top-down methods In Table 1, we compare our proposedmodel with other state-of-the-arts methods on COCO2017 test-dev dataset. Wetest the run time of single-scale inference, the proposed method realizes thebalance on speed and accuracy. We achieve the competitive accuracy which out-performs most existing bottom-up methods. Compared to the typical top-down ouble Embedding 11 method CPN[12], we narrow the gap to top-down method in accuracy with lesscomplexity. This demonstrates the effectiveness of DE on multi-person pose es-timation.
Ablation analysis
We conduct ablation analysis on COCO2017[21] validationdataset without multi-scale test. We evaluate the impact of the introduced hier-archical short-range displacements that factorize the basic displacements. Also,the effect of MRM is studied. MRM is implemented based on the hierarchicaldisplacements, so MRM is non-existent without hierarchical displacements. Re-sults are shown in Table 2. which shows that with basic displacements only, DEachieves 69.3% mAP. By introducing hierarchical displacements, performanceimproves to 70.7% mAP with 1.3% mAP increasing. Based on the hierarchicaldisplacements, MRM further improve 0.9% mAP. The result shows the effective-ness of hierarchical displacements and MRM.
Table 2.
Ablation experiments on COCO validation dataset.Model Settings Pose EstimationBasic Dis. Hierar Dis. MRM
AP AP AP AP M AP L (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) In addition, we analyze the impact of the hyper-parameter τ which decidesthe regression area for joints confidence map and displacements in Section 3. Weobserve the performance of proposed model as τ varies from 1 to 20. As shownin Figure 3, the performance monotonically improves as τ increases from 1 to 7.When 7 < τ <
10, performance remains unchanged as τ increases. When τ > τ increases. This can be explained by the distribution ofpositive samples of the dataset. When τ increases in the range between 1 and 7,positive samples increase and larger effective area of joints is counted in jointsconfidence and displacements regression in training. When τ increases between7 and 10, effective information and background noise increases with equivalenteffect. When τ >
10, more background noise is countered as positive samples,regression area of joints overlaps with each other as τ keeps increasing. Smaller τ means less complexity, thus we set τ = 7 for balancing the accuracy andefficiency. Qualitative results
Qualitative results on COCO dataset are shown in the toprow of Figure 4. The proposed model performs well in challenging scenarios, e.g.,pose deformation (1st and 2nd examples), person overlapping and self-occlusion(3rd example), crowded scene (4nd example), and scale variation and small-scaleprediction (5st example). This presents the effectiveness of our method.
Fig. 3.
Studies on hyper-parameter τ , which decides the regression area for jointsconfidence map and displacements. Table 3.
Comparison with state-of-the-arts on the full testing set of MPII dataset.Method Head Sho Elb Wri Hip Knee Ank Total Time[s]lqbal and Gall[16] 58.4 53.9 44.5 35.0 42.2 36.7 31.1 43.1 10Insafutdinov et al.[15] 78.4 72.5 60.2 51.0 57.2 52.0 45.4 59.5 485Levinkov et al.[15] 89.8 85.2 71.8 59.6 71.1 63.0 53.5 70.6 -Insafutdinov et al.[40] 88.8 87.0 75.9 64.9 74.2 68.8 60.5 74.3 -Cao et al.[33] 91.2 87.6 77.7 66.8 75.4 68.9 61.7 75.6 0.6Fang et al.[11] 88.4 86.5 78.6 70.4 74.4 73.0 65.8 76.7 0.4Newell and Deng[13] 92.1 89.3 78.9 69.8 76.2 71.6 64.7 77.5 0.25Fieraru et al.[41] 91.8 89.5 80.4 69.6 77.3 71.7 65.5 78.0 -SPM[37] 89.7 87.4 80.4 72.4 76.7 74.9 68.3 78.5 0.058DoubleEmbedding (Ours) 91.9 89.7 81.6 74.9 79.8 75.8 71.5 80.7 0.21
Table 4.
Ablation experiments on MPII validation dataset.Model Settings Pose EstimationBasic Dis. Hierar Dis. MRM Head Sho Elb Wri Hip Knee Ank Total (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)
Table 3 shows a comparison of the proposed method with state-of-the-arts meth-ods on MPII dataset. The proposed model obtains 80.7% mAP achieving com-petitive result among other bottom-up methods. In addition, we conduct theablation study on MPII validation dataset to verify MRM and the hierarchi-cal displacements compared with the basic displacements. As shown in table 4,DE improves from 77.5% mAP to 78.8% mAP by introducing hierarchical dis-placements. Moreover, on wrists and ankles are significant from 68.9% to 71.3%mAP and 73.6% to 74.8% mAP, respectively. Indicating the effectiveness of hier-archical displacements to factorize the long-range displacements. MRM furtherimproves 1.1% mAP based on hierarchical displacements.Qualitative results on MPII are shown in the middle row of Figure 4, demon-strating the good performance and robustness of our model in complex scenessuch as person scale variations (1st example), large pose deformation (2nd and3rd examples) and small-scale prediction (3rd example).
Table 5.
Comparison with state-of-the-arts on CrowdPose test dataset.Method
AP AP AP AP M AP L AP H Openpose[17] - - - 62.7 48.7 32.3Mask-RCNN † [9] 57.2 83.5 60.3 69.4 57.9 45.8AlphaPose † [11] 61.0 81.3 66.0 71.2 61.4 51.1SPPE † [23] 66.0 84.2 71.5 75.5 66.3 57.4HigherHRNet[39] 67.6 87.4 72.6 75.8 68.1 58.9HRNet † [42] 71.7 89.8 76.9 79.6 72.7 61.5DoubleEmbedding (Ours) 68.8 89.7 73.4 76.1 69.5 60.3 † indicates top-down methods Table 5 shows experimental results on CrowdPose. The proposed modelachieves 68.8% AP which outperforms the existing bottom-up methods. but theperformance is still lower than the state-of-the-art top-down method, HRNetwhich has intrinsic advantage in accuracy due to its processing flow. However,it narrows the accuracy gap between other bottom-up methods and top-downmethods with less complexity. The performance on CrowdPose dataset indicatesthe robustness of our method in crowded scene.Qualitative results on CrowdPose dataset are shown in the bottom row ofFigure 4. The result verifies the effectiveness of our model in complex scenes, e.g.,ambiguity pose and small-scale prediction (1st example), self-occluded (2nd ex-ample), cluttered background (3rd example) and person overlapping and crowded scene (4th example).
Fig. 4.
Qualitative results on MSCOCO dataset (top), MPII dataset (middle) andCrowdPose dataset (bottom).
In this paper, we propose the Double Embedding (DE) method for multi-personpose estimation. Through Global Embedding (GE) and Local Embedding (LE),we achieve parallel implementation of person detection and joints detection,overcoming the intrinsic disadvantages of the conventional two-stage strategy onmulti-person pose estimation. GE reduces the person instance detection problemto identifying a group of joints and LE connects and identifies the rest jointshierarchically. Based on LE, we design Mutual Refine Machine (MRM) to furtherenhance the performance for dealing with complex scenarios. We implement DEbased on CNNs with end-to-end learning and inference. Experiments on threemain benchmarks demonstrate the effectiveness of our model. DE achieves thecompetitive results among existing bottom-up methods and narrows the gap tothe state-of-the-art top-down methods with less complexity.
References
1. Ch´eron, G., Laptev, I., Schmid, C.: P-cnn: Pose-based cnn features for actionrecognition. (2015)2. Li, Y., Zhou, S., Huang, X., Xu, L., Ma, Z., Fang, H., Wang, Y., Lu, C.:Transferable interactiveness prior for human-object interaction detection. CoRR abs/1811.08264 (2018)ouble Embedding 153. Fang, H.S., Cao, J., Tai, Y.W., Lu, C.: Pairwise body-part attention for recognizinghuman-object interactions. (2018)4. Qian, X., Fu, Y., Xiang, T., Wang, W., Qiu, J., Wu, Y., Jiang, Y.G., Xue, X.:Pose-normalized image generation for person re-identification. (2017)5. Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and track-ing. (2018)6. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learningfor human pose estimation. (2019)7. Wang, J., Sun, K., Cheng, T., Jiang, B., Xiao, B.: Deep high-resolution represen-tation learning for visual recognition. IEEE Transactions on Pattern Analysis andMachine Intelligence PP (2020) 1–18. Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C.,Murphy, K.P.: Towards accurate multi-person pose estimation in the wild. CoRR abs/1701.01779 (2017)9. He, K., Georgia, G., Piotr, D., Ross, G.: Mask r-cnn. IEEE Transactions on PatternAnalysis & Machine Intelligence (2018) 1–110. Huang, S., Gong, M., Tao, D.: A coarse-fine network for keypoint localization. In:2017 IEEE International Conference on Computer Vision (ICCV). (2017)11. Fang, H., Xie, S., Tai, Y., Lu, C.: Rmpe: Regional multi-person pose estimation.(2016)12. Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramidnetwork for multi-person pose estimation. (2017)13. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose esti-mation. (2016)14. Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.,Schiele, B.: Deepcut: Joint subset partition and labeling for multi person poseestimation. (2016)15. Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: DeeperCut:A deeper, stronger, and faster multi-person pose estimation model. (2016)16. Iqbal, U., Gall, J.: Multi-person pose estimation with local joint-to-person associ-ations. (2016)17. Osokin, D.: Real-time 2d multi-person pose estimation on cpu: Lightweight open-pose. arXiv preprint arXiv:1811.12004 (2018)18. Newell, A., Deng, J.: Associative embedding: End-to-end learning for joint detec-tion and grouping. CoRR abs/1611.05424 (2016)19. Papandreou, G., Zhu, T., Chen, L., Gidaris, S., Tompson, J., Murphy, K.: Person-lab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. CoRR abs/1803.08225 (2018)20. Cheng, B., Xiao, B., Wang, J., Shi, H., Zhang, L.: Higherhrnet: Scale-aware rep-resentation learning for bottom-up human pose estimation. In: 2020 IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR). (2020)21. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P.,Zitnick, C.L.: Microsoft coco: Common objects in context. (2014)22. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: Human pose estimation:New benchmark and state of the art analysis. In: Computer Vision and PatternRecognition (CVPR). (2014)23. Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C.: Crowdpose: Efficientcrowded scenes pose estimation and a new benchmark. (2018)24. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time objectdetection with region proposal networks. IEEE Transactions on Pattern Analysisand Machine Intelligence (2015)6 Yiming Xu, Jiaxin Li et al.25. Lin, T.Y., Doll´ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Featurepyramid networks for object detection. (2016)26. Cheng, B., Wei, Y., Shi, H., Feris, R.S., Xiong, J., Huang, T.S.: Revisiting RCNN:on awakening the classification power of faster RCNN. CoRR abs/1803.06799 (2018)27. Cheng, B., Wei, Y., Shi, H., Feris, R., Xiong, J., Huang, T.: Decoupled classificationrefinement: Hard false positive suppression for object detection. (2018)28. Gkioxari, G., Hariharan, B., Girshick, R., Malik, J.: Using k-poselets for detectingpeople and localizing their keypoints. In: 2014 IEEE Conference on ComputerVision and Pattern Recognition (CVPR). (2014)29. Min, S., Savarese, S.: Articulated part-based model for joint object detection andpose estimation. In: IEEE International Conference on Computer Vision, ICCV2011, Barcelona, Spain, November 6-13, 2011. (2011)30. Wei, S., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines.In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).(2016) 4724–473231. Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: Regional multi-person pose esti-mation. In: ICCV. (2017)32. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks.In: Advances in neural information processing systems. (2015) 2017–202533. Cao, Z., Simon, T., Wei, S., Sheikh, Y.: Realtime multi-person 2d pose estimationusing part affinity fields. CoRR abs/1611.08050 (2016)34. Kreiss, S., Bertoni, L., Alahi, A.: Pifpaf: Composite fields for human pose estima-tion. CoRR abs/1903.06593 (2019)35. Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation withiterative error feedback. CoRR abs/1507.06550 (2015)36. Girshick, R.: Fast r-cnn. Computer Science (2015)37. Nie, X., Zhang, J., Yan, S., Feng, J.: Single-stage multi-person pose machines(2019)38. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. ComputerScience (2014)39. Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: Higherhrnet:Scale-aware representation learning for bottom-up human pose estimation (2019)40. Insafutdinov, E., Andriluka, M., Pishchulin, L., Tang, S., Levinkov, E., Andres, B.,Schiele, B.: Articulated multi-person tracking in the wild. CoRR abs/1612.01465 (2016)41. Fieraru, M., Khoreva, A., Pishchulin, L., Schiele, B.: Learning to refine humanpose estimation. CoRR abs/1804.07909 (2018)42. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learningfor human pose estimation. CoRR abs/1902.09212abs/1902.09212