FastPose: Towards Real-time Pose Estimation and Tracking via Scale-normalized Multi-task Networks
Jiabin Zhang, Zheng Zhu, Wei Zou, Peng Li, Yanwei Li, Hu Su, Guan Huang
FFastPose: Towards Real-time Pose Estimation and Tracking viaScale-normalized Multi-task Networks
Jiabin Zhang ∗ Zheng Zhu ∗ Wei Zou Peng Li Yanwei Li Hu Su Guan Huang Institute of Automation, Chinese Academy of Sciences, Beijing, China Beijing University of Posts and Telecommunications, Beijing, China Horizon Robotics, Beijing, China { zhangjiabin2016,wei.zou,liyanwei2017, hu.su } @ia.ac.cn, [email protected],[email protected], [email protected] Abstract
Both accuracy and efficiency are significant for pose es-timation and tracking in videos. State-of-the-art perfor-mance is dominated by two-stages top-down methods. De-spite the leading results, these methods are impractical forreal-world applications due to their separated architecturesand complicated calculation. This paper addresses the taskof articulated multi-person pose estimation and tracking to-wards real-time speed. An end-to-end multi-task network(MTN) is designed to perform human detection, pose esti-mation, and person re-identification (Re-ID) tasks simulta-neously. To alleviate the performance bottleneck caused byscale variation problem, a paradigm which exploits scale-normalized image and feature pyramids (SIFP) is proposedto boost both performance and speed. Given the results ofMTN, we adopt an occlusion-aware Re-ID feature strategyin the pose tracking module, where pose information is uti-lized to infer the occlusion state to make better use of Re-IDfeature. In experiments, we demonstrate that the pose es-timation and tracking performance improves steadily uti-lizing SIFP through different backbones. Using ResNet-18 and ResNet-50 as backbones, the overall pose track-ing framework achieves competitive performance with 29.4FPS and 12.2 FPS, respectively. Additionally, occlusion-aware Re-ID feature decreases the identification switchesby 37% in the pose tracking process.
1. Introduction
Human pose estimation in images and articulated posetracking in videos are of importance for visual understand-ing tasks [63, 24]. Research community has witnessed asignificant advance from single person [4, 18, 47, 46, 48,36, 56] to multi-person pose estimation [38, 27, 8, 37, 12],from static images pose estimation [39, 24] to articulated ∗ The first two authors contributed equally to this work. A cc u r ac y ( M O T A ) < 0.2 FastPose-MobilenetFastPose-50FastPose-18FastPose-101JointFlow Detect-and-TrackFlowTrack-50-RFCNwith Flow PoseFlow
FlowTrack-50-FPN with Flow FlowTrack-50-RFCNwithout FlowFlowTrack-50-FPNwithout Flow 𝑟𝑒𝑎𝑙 − 𝑡𝑖𝑚𝑒10 A cc u r ac y ( m A P ) < 0.2 JointFlow
Detect-and-Track
FlowTrack-50-RFCNwith Flow PoseFlowFlowTrack-50-FPNwith Flow FlowTrack-50-RFCNwithout FlowFlowTrack-50-FPN without Flow
FastPose-MobilenetFastPose-50FastPose-18FastPose-101 𝑟𝑒𝑎𝑙 − 𝑡𝑖𝑚𝑒
Figure 1. Top: Inference speed and MOTA performance on Pose-Track [2] val set. Bottom: Inference speed and mAP perfor-mance on PoseTrack val set. Methods involved are PoseTrack[28], JointFlow [16], PoseFlow [53], Detect-and-Track [21], Flow-Track [50], and our FastPose framework with various backbones.A base 10 logarithmic scale is adopted for the x -axis. tracking in videos [28, 26, 16, 21, 53, 50, 59, 29, 65]. How-ever, there are still challenging pose estimation problemsin complex environments, such as occlusion, intense lightand rare poses [3, 34, 41]. Furthermore, articulated track-ing encounters new challenges in unconstrained videos suchas camera motion, blur and view variants [2].Previous pose estimation systems address single pre-located person, which exploit pictorial structures model[4,18] and following deep convolutional neural networks (DC-1 a r X i v : . [ c s . C V ] A ug Ns) approaches [47, 46, 48, 36, 56]. Motivated by prac-tical applications in video surveillance, human-computerinteraction and action recognition, researchers now focuson the multi-person pose estimation in unconstrained en-vironment. Multi-person pose estimation can be catego-rized into bottom-up [38, 27, 8] and top-down approaches[37, 12, 24, 50], where the latter becomes dominant partic-ipant in recent benchmarks [34, 3]. Top-down approachescan be divided into two-stages methods and unified frame-work. Two-stages methods [37, 12, 50] firstly detect andcrop persons from the image, then perform the single personpose estimation in the cropped person patches. Represen-tative work of unified framework method is Mask R-CNN[24], which extracts human bounding box and predicts key-points from the corresponding feature maps simultaneously.Generally, two-stages methods achieve the state-of-the-art results both on pose estimation and pose tracking tasks,beyond the performance of unified approach. We argue thatthe performance bottleneck of unified methods is caused byscale variation of the human. Specifically, the two-stagespose estimation methods are scale invariant. Based on thedetection result of the first stage, the second stage only fo-cuses on the task of keypoint detection on a fixed scale. De-spite the leading performance, these methods can’t performreal-time inference as their complex procedures, includinghuman detection, cropping and scaling images, and pose es-timation. In contrast, the unified frameworks can simplyobtain the final multi-person pose estimation result from theoriginal image in an end-to-end network. Unfortunately, theunified architecture destroys the scale invariance properties.Although many methods [33, 23] have been proposed to al-leviate scale variation problem in face detection or objectdetection area, there are few researches focusing on dealingwith the scale variation in unified multi-person pose esti-mation. Recent researches [44, 45] give an insight into thescale variation problem, but their inference speed suffersfrom multi-scale operation.Different from multiple object tracking that focuses oninstance identification assignment, pose tracking aims to ad-dress a more complex problem of articulated multi-personpose tracking in videos. Based on the bottom-up pose esti-mation methods, [28, 26] construct spatial-temporal graphsbetween detected joints and solve a matching or energy op-timization problem. However, high computing complexityof these methods makes it impractical for real-world ap-plications. Based on the top-down pose estimation meth-ods, [50] exploits flow-based pose similarity as metric andsolves the matching problem in a greedy fashion. [21]proposes a 3D extension of Mask R-CNN, which predictsthe location of person tubes and corresponding poses si-multaneously. In order to link these poses over time, theysolve a bipartite graph matching problem based on intersec-tion over union (IoU) metric. These simple tracking mod- ules may encounter failure in challenging scenarios suchas occlusions and crowds. Recent multiple object trackers[49, 5, 57, 64, 19, 31] prefer to use Re-ID features to main-tain more robust track in these situation. However, Re-IDfeature always becomes unreliable when the target is oc-cluded.Based on the above analyses, this paper develops Fast-Pose, a pose tracking framework which can perform poseestimation and tracking towards real-time speed. Specif-ically, we first build a multi-task network (MTN) whichjointly optimizes three tasks simultaneously, including hu-man detection, pose estimation, and person Re-ID. Thethree groups of outputs are utilized to perform pose track-ing. Then a scale-normalized paradigm is proposed to alle-viate the scale variation problem for the multi-task network.At last an occlusion-aware Re-ID strategy is designed forarticulated multi-person pose tracking in video. To makebetter use of Re-ID features, we utilize the pose informa-tion to infer the occlusion state.The main contributions of this paper can be described asfollows:(1) Taking the person Re-ID features into account, wedesign an end-to-end multi-task network which performshuman detection, pose estimation, and person Re-ID simul-taneously. The network’s outputs provide the necessary in-formations for the following pose tracking strategy.(2) We propose a paradigm named scale-normalized im-age and feature pyramid (SIFP) for alleviating scale varia-tion problem which is the performance bottleneck of unifiedtop-down pose estimation methods. Based on image pyra-mid, we ignore extremely small and large objects to makesizes of objects uniformly distributed in the exact range.Combining feature pyramid networks (FPN) with the scaledistribution can help the network to avoid multi-scale test-ing.(3) Utilizing the outputs of our multi-task network, anocclusion-aware strategy is exploited to perform articulatedmulti-person pose tracking in videos. Specifically, the poseinformation is utilized to infer occlusion state and achievethe occlusion-aware Re-ID strategy which dramatically re-duce the identification (ID) switches during tracking.(4) In the experiments, our FastPose-18 (takes ResNet-18 as backbone) achieves real-time inference speed at 29.4frames per image (FPS) while obtaining a mAP score of63.1 and a MOTA score of 56.8. It is faster than otherpose tracking approaches. Taking ResNet-50 as backbone,FastPose-50 achieves a fairly competitive performance atmAP of 69.7 and MOTA of 62.8 with a inference speed of12.2 FPS. More detailed relationship between accuracy andinference speed of FastPose and other approaches is illus-trated in Fig 1. Based on occlusion-aware Re-ID feature,our proposed tracking strategy achieves 37% ID switchesdecrease over the tracking strategy without Re-ID feature. . Related Works
Pose estimation has underwent a long way as a basic re-search topic of computer vision. In recent years, motivatedby practical applications, researchers switch focus from sin-gle person [4, 18, 47, 46, 48, 36, 56] to multi-person poseestimation. Different from single pre-located person, multi-person pose estimation can be categorized into bottom-up[38, 27, 8] and top-down approaches [37, 12, 24, 50]. CPN[12] is the leading method on COCO 2017 keypoint chal-lenge. It involves skip layer feature concatenation and anonline hard keypoint mining step. [50] adopts FPN-DCNas the human detector and adds a few deconvolutional lay-ers on single-person pose estimation network to improvethe performance. These top-down methods achieve multi-person pose estimation by the two-stages process, includ-ing obtaining person bounding boxes by a person detec-tor and predicting keypoint locations within these boxes.Besides, Mask R-CNN [24] builds an end-to-end frame-work and yields an impressive performance, but it is stillbehind these two-stages methods. We argue that the per-formance bottleneck of the unified approaches is caused byscale variation problem, which doesn’t exist in above two-stages framework.
Based on the multi-person pose estimation approachesdescribed above, it is natural to extend them to multi-person pose tracking in video. Hence, the works of posetracking can be also divided into bottom-up and top-downmethods. In [28, 26], authors firstly estimate human posewith a bottom-up method, and then transform the probleminto solving an energy minimizing function over a spatio-temporal graph constructed on the detected joints. [16] pro-poses a model to predict Temporal Flow Fields (TTF) toformulate a similarity measure of detected joints. Thesesimilarities are used as binary potentials in a bipartite graphoptimization problem in order to perform tracking of multi-ple poses. Based on the top-down pose estimation methods,[21] proposes an extended Mask R-CNN and solves the bi-partite graph matching problem based on IoU. [50] exploitsflow-based pose similarity as metric and solves the match-ing problem in a greedy fashion. Based on the obtainedpose of single person, [53] proposes to construct pose flowand perform pose flow non maximum suppression (NMS)to eliminate issues like ID switches.
Multi-task learning [9, 60, 20, 32] has been used success-fully in applications of natural language processing [13, 42],speech recognition [15], computer vision [22, 61, 52]. Es-pecially in many computer vision tasks, the effectiveness of multi-task learning has been proved. Fast R-CNN [22]and Faster R-CNN [40] jointly predict the class and the co-ordinates of objects in an image. Mask R-CNN [24] canefficiently detect objects in an image while simultaneouslygenerating a high-quality segmentation mask for each in-stance. Similar with these methods, our approach sharesthe backbone network among all tasks, while keeping sev-eral task-specific output layers. This form has several ad-vantages, for example, one end-to-end neural network savesmuch running time than several separated networks.
Large scale variation is one of major factors to influencethe performance of many computer vision tasks like facedetection, object detection and pose estimation. Many facedetection approaches [58, 30, 55] have been proposed tolearn representation that is invariant to scale. With the helpof image pyramid [1], some methods like DPM [18] be-come more scale-robust. To address the problem that largestrides of deep CNNs make small object detection very diffi-cult, object detector [10, 14] use dilated/atrous convolutionsto increase the resolution of the feature map. As the fea-ture maps of higher layers have more semantic informationbut lower resolution, meanwhile these of the lower layershave high resolution. SDP [54], SSH [35] and MS-CNN[7] make predictions of small objects on the lower layerand big objects on the higher layers respectively. Further-more, methods like FPN [33] and Mask-RCNN [24] pro-pose a pyramidal representation and fuse adjacent scale fea-ture maps to combine features which have semantic and de-tail informations. Besides, some methods, like SNIP [44]and SNIPER [45], propose advanced and efficient data ar-gumentation methods to illustrate the scale variation prob-lem. But they need multi-scale testing to achieve high per-formance, which harms the inference efficiency of network.
3. Our Approach
In this section, we discuss the proposed FastPose frame-work in details. The pipeline of the whole frameworkis illustrated in Fig. 2. Given an original image as in-put, the multi-task network (MTN) can predict the bound-ing boxes, keypoint coordinates and Re-ID features in thescene. A scale-normalized image pyramid and feature pyra-mid (SIFP) paradigm is exploited to alleviate the scalevariation problem of MTN. Following MTN, we proposean occlusion-aware pose tracking strategy for articulatedmulti-person pose tracking in video.
The MTN adopts the similar unified procedure as MaskR-CNN. We first use a deep convolutional neural network(CNN) to transform original image to feature maps. A fullyconvolutional network, called a Region Proposal Network PN Shared Bcakbone
Training Input
RoIAlignRoIAlign
RoIs classbox x8x4
PoseDetection
Re-ID
Image PyramidImage Pyramid
Inference Input
Occlusion-aware Pose Tracking Strategy
Re-ID Feature
Figure 2. The pipeline of the FastPose framework. In training process of multi-task network (MTN), a scale-normalized paradigm whichexploits both image pyramid and feature pyramid is utilized to improve the distribution of objects’ size. Utilizing the outputs of MTN, anocclusion-aware strategy performs pose tracking. (RPN), is built upon these feature maps to propose candi-date human bounding boxes. Based on the candidate boxesand their corresponding features extracted from the sharingfeature maps, Mask R-CNN has two branches, one branchperforms classification and bounding-box regression. An-other branch outputs a binary mask for each human pro-posal, which can easily be extended to perform human poseestimation. To perform the task of extracting 128-d Re-IDfeatures for each person in the image, we add a third branchthat outputs the classification result of person’s ID.
Network Architecture:
Similar with Mask R-CNN, ourproposed network can be instantiated with multiple archi-tectures: (i) the backbone network used for feature extrac-tion over an entire image, and (ii) the head networks forhuman detection (bounding-box classification and regres-sion), pose estimation and person Re-ID that are appliedseparately to each RoI.For the backbone network, deeper architecture gains theeffectiveness of extracted features, but brings longer train-ing and inference time. To provide a trade-off between ac-curacy and speed when MTN is adopted in practical appli-cations, we evaluate MobileNet-v2 [43] and ResNet [25]with FPN [33] of depth 18, 50 and 101 layers.For the pose estimation head network, Mask R-CNNadopts a straightforward structure, which limits the preci-sion of keypoints localization. MTN extends it to a moreefficient structure. In Mask R-CNN, × feature maps of512 channels are extracted by RoIAlign for each proposal.In MTN, we utilize a padding operation to maintaining theratio of the person in the × feature maps extractedby RoIAlign. After passing through a stack of × × .For the person Re-ID head network, a straightforwardstructure is adopted to classify each person’s ID. Utilizing RoIAlign operation, a small feature map ( e.g. × × )is extracted for each person proposal. Then it passesthrough a stack of × N -doutput, where N depends on the ID scale of training dataset.For each person, the training target is a one-hot N -d vector,and we minimize the cross-entropy loss over a N -way soft-max output. To reduce computation complexity and band-width consumption, we only take top-128 person proposalsinto training process. As this head network is based uponthe backbone and RPN, so it need the training data com-posed by images within multi-person and corresponding IDannotation, like some person search datasets [51, 62].The MTN can provide necessary informations to theocclusion-aware strategy introduced in Sec. 3.3 to performpose tracking. As described above, MTN performs human detection,pose estimation and person Re-ID simultaneously. Differ-ent with two-stages methods which perform these tasks byseparated networks, large scale variation across human in-stances is one of the main factors which influence the per-formance of our network, especially in pose estimation.Specifically, in the training and inference processes of two-stage methods, the scale of the input image for the single-person pose estimation network is fixed. However, MTN,a unified network, builds all the head networks upon theRoIs generated by RPN. This mechanism breaks scale in-variant of MTN. So inspired by [44], we develop a scale-ormalized paradigm exploiting both image pyramid andfeature pyramid (SIFP) to achieve enhanced scale invari-ance capability of MTN.In SIFP, we donate the scale s of each object as s = √ wh . Obviously, constraining s of all the training objectsto an intermediate scale range helps to reduce scale vari-ation. By using an image pyramid where the original im-age is resized with a set of scaling factors ( Ω = { ω i } ni =1 ),each object instance appears at several different scales andsome of those appearances fall in the desired scale range.However, with ω > large objects become larger and with ω < small objects become smaller, which increases scalevariation. Similar with [44], SIFP only uses objects whichfall in a certain scale range [ s l , s u ] as the training samplesat each pyramid level. Additionally, images at a high reso-lution pyramid level are cropped to the same size of originalimage, without ignoring any valid objects. Images at a lowresolution pyramid level are padded to the size of originalimage. In this way, all the object instances participate train-ing, which preserves no-scale diversity and reduces scalevariation in training the network. And fixed size at eachpyramid level helps to utilize computing resources better.If only using above extended image pyramid in trainingprocess, testing images also need to be resized to differentscales with Ω to keep consistent. Because single scale test-ing would cause a large domain-shift due to the scale differ-ence between training objects and testing objects. However,multi-scale testing would reduce the inference efficiency.To maxmize the inference speed without reducing perfor-mance, SIFP exploits FPN to tackle this dilemma. WithFPN, anchors are defined to have areas of S = { s ai } i =1 oncorresponding feature maps P = { P i } i =1 respectively, formore details, please refer to [33]. So the training objectsin [ s l , s u ] are automatically distributed to different featuremaps to assign labels to anchors. Each feature map onlyneeds to focus on objects in a smaller scale range. In infer-ence, test objects are distributed to different feature maps tobe predicted. Due to FPN, MTN enhances scale invariancecapability to alleviate the domain-shift brought by singlescale testing.In conclusion, SIFP is a modified version of SNIP. Com-bining with FPN helps SNIP to avoid slower inferencespeed brought by multi-scale testing. And our experimentsin Sec. 4.1 validate that even though testing on originalimage in order to meet towards real-time performance, ourparadigm is very effective. Based on the detection box, keypoints and Re-ID fea-ture provided by MTN, pose tracking is performed by anocclusion-aware strategy. Strategies like [21, 6] usuallyadopt IoU for linking tracks and ignore the appearance in-formation, which fails to achieve competitive tracking re- sult when the tracklets are occluded or in rapid movement.Recently, Re-ID feature is widely adopted in multi-objecttracking community as a stable appearance cue. However,Re-ID feature of the highly occluded target always containsinvalid information, and may cause drift in tracking proce-dure. Therefore, inference of occlusion state is significantwhen adopting the Re-ID feature in complex scenarios. Inthis work, the occlusion-aware Re-ID feature is utilized assimilarity metric to replace traditional IoU metric.
Human keypoints can be utilized to infer the occlusion stateby the number of keypoints ( N valid ) which are not oc-cluded. And N valid is computed as: N valid = (cid:88) N k i =1 { c i > γ valid } (1)where γ valid is the threshold for the confidence of keypoint k i to judge if k i is visible, equals 1 if the condition is trueotherwise 0.Re-ID feature is regarded valid when N valid is greaterthan the number threshold ( θ valid ), which means that mostof keypoints are visible and the target is not occluded, oth-erwise Re-ID feature is regarded invalid. Tracklet consists of historical matched detections. Appear-ance feature of tracklet f track should be maintained care-fully to make tracking procedure stable. In some scenarios,target may move fast so their scale and pose change rapidly.The appearance feature will be updated if the Re-ID featureof matched detection is valid. Given Re-ID feature f d of detection d and appearance fea-ture f track of tracklet track , we adopt a integrated simi-larity metric S containing position information and appear-ance information as: S = θ pos ∗ IoU + (1 − θ pos ) ∗ min ( dist ( f d , f track ) , σ max ) σ max (2)where θ pos controls the weight of IoU in S , and dist ( f d , f track ) means the Euclidean distance between fea-ture f d and feature f track . min ( dist ( f d ,f track ) ,σ max ) σ max is usedto normalize dist ( f d , f track ) where σ max is the upper limitof Euclidean distance. There are some differences in network structure in de-tails when we adopt various backbone networks for com-prehensive experiments. As described in Sec. 3.1, using aeeper backbone network (ResNet-50 or ResNet-101), thenumbers of convolutional layers in pose estimation headand Re-ID head are 8 and 4 respectively. When using asmaller backbone (ResNet-18 or MobileNet-v2), they arechanged to 4 and 2.
Training:
The MTN needs three types of annotations cor-responding to three head networks, including bounding boxannotation, keypoint annotation, and human ID annotation.So the training process of pose tracking task is conductedon five datasets. COCO [34] dataset provides boundingbox and keypoint annotation. MPII [3] and PoseTrack[2] datasets are utilized for training pose estimation task.Person search datasets including SSM [51] and PRW [62]datasets are for training person Re-ID task. Image-centrictraining is adopted, for each image, the loss of unrelatedtasks will not be propagated back.The [ s l , s u ] is set as [16, 560] when SIFP is imple-mented. Only the objects whose √ wh fall in [16, 560] canbe used to training in the image pyramid where the scalingfactors are 2.0, 1.5, 1.0 and 0.75. Inference:
At test time, for each frame images in videos,the proposal number provided by RPN is 1000 as in [24].Human detection branch runs on these proposals. Utilizingnon-maximum suppression, the highest scoring 100 detec-tion boxes are fed into pose estimation and person Re-IDbranches to obtain the heat maps of K keypoints and 128-dRe-ID feature for each human boxes. After the inference ofMTN, all the human boxes with their corresponding poseand Re-ID features are fed into our occlusion-aware track-ing framework for articulated multi-person pose tracking invideos. We adopt a pose tracking strategy similar to [21].In the pose tracking strategy, γ valid is set as 0.2, N valid isset as 10, and θ pos is set as 0.5.
4. Experiments
In this section, we perform thorough ablation experi-ments for both pose estimation and pose tracking tasks,and compare our FastPose framework with the state-of-the-art methods on PoseTrack [2] dataset. In all the exper-iments, pose tracking task is evaluated on PoseTrack [2]dataset. Pose estimation task is evaluated on 5k validationimages ( minival ) of COCO [34] dataset and PoseTrack[2] dataset.
Extensive of ablations are performed to analyze our ap-proach, including different backbone architectures, differ-ent pose tracking strategies, and SIFP paradigm.
Backbone Architecture and SIFP for Mask R-CNN:
Ta-ble 1(a) shows our scale-normalized paradigm SIFP usingin pose estimation with various backbones. AP kp is the main metric of pose estimation on COCO dataset. A deeperbackbone has better performance. AP kp increase is 7.4 fromResNet-18 to Resnet-50. From ResNet-50 to Resnet-101,we obtain a small 0.8 improvement while FLOPs increasesalmost 30%. So we adopt ResNet-50-FPN as the backbonefor ablation studies in Table 1(c)-(d). Utilizing SIFP, AP kp is improved from 55.6 to 57.9, from 57.7 to 60.1, from 65.1to 67.5, from 66.0 to 68.3 with the backbone of MobileNet-V2, ResNet-18, 50 and 101 respectively. One can find thatall the architectures improve the pose estimation perfor-mance by using SIFP. Backbone Architecture and SIFP for FastPose:
Asshown in Table 1(b), our proposed FastPose also showssteady improvement by using deeper backbone models.mAP and MOTA are two main metrics on PoseTrackdataset. Using MobileNet-v2 or ResNet18 as the backbone,FastPose can achieve real-time pose tracking. Note that ourinference speed doesn’t grow with number of detected peo-ple, making it much more scalable to various scenes. Al-though FastPose-MobileNet-v2 has lower metric (62.1 onmAP and 55.6 on MOTA) than FastPose-18 (63.1 and 56.8),its properties make it particularly suitable for mobile appli-cations. By using SIFP, mAP increases are 1.2, 1.1, 0.7and 0.8, MOTA increases are 3.5, 2.9, 2.7 and 2.6 on listedbackbones respectively. It proves SIFP can stably improvethe performance of pose estimation and tracking on Pose-Track dataset. Pose estimation performance of FastPose onCOCO dataset is reported in the supplementary material dueto the page limit.
Occlusion-aware Re-ID feature:
Table 1(c) shows the ef-fectiveness of Re-ID feature. Replacing IoU by Re-IDfeature can make ID switches reduce 41.6 (from 243.1to 201.5). Our proposed occlusion-aware strategy makea more remarkable improvement that reduces ID switchesfrom 243.1 to 153.9 (37%). Besides, we evaluate MTN onthe person Re-ID dataset SSM and the mAP on SSM test is 89.38, which suggests that it is feasible to extract Re-IDfeatures in MTN. Actually, this branch has a straightforwardstructure. More complex designs have the potential to im-prove performance but are not the focus of this work.
SIFP without/with FPN:
Table 1(d) illustrates the resultsof combining SIFP with FPN and utilizing SIFP withoutFPN. The first row is the baseline that adopting ResNet-50-FPN as backbone without using SIFP. At the second row,utilizing SIFP without FPN is actually using SNIP’s train-ing strategy. This method introduces improvements of 0.9AP kp for pose estimation, 0.4 mAP and 1.0 MOTA for posetracking. At the third row, SIFP can obtain the improvementof 2.4, 0.7 and 2.8, all beyond the second row. So, our SIFPexploits FPN with SNIP’s training strategy to improve poseestimation and tracking performance in single-scale testing. ackbone AP bb person AP kp AP kp AP kp AP kp M AP kp L Param(MB) FLOPs(GB) speed
MobileNet-V2-FPN 41.7 55.6 79.1 59.5 47.6 66.9 22.73 33.6 32.5+SIFP 43.9 57.9 81.1 62.1 51.4 67.8ResNet-18-FPN 43.1 57.7 80.4 62.0 49.1 70.1 32.38 63.4 32.7+SIFP 45.3 60.1 82.1 64.3 53.4 70.9ResNet-50-FPN 49.3 65.1 85.0 71.1 58.2 75.1 51.78 109.8 13.1+SIFP 52.9 67.5 85.8 73.6 62.4 75.8ResNet-101-FPN 50.8 66.0 85.6 72.0 59.5 75.2 67.48 147.8 9.1+SIFP 53.9 68.3 86.5 74.4 63.2 76.4(a)
Backbone Architecture and SIFP for Mask R-CNN : Pose estimation results of Mask R-CNN without/with SIFP ondifferent backbones. Among all the reported metrics, AP kp is the main metric of pose estimation on COCO dataset. We alsoreport the inference speed of Mask R-CNN with different backbones on COCO dataset. backbone mAP MOTA Param(MB) FLOPs(GB) speed MobileNet-V2-FPN 60.9 52.1 32.73 38.2 28.6+SIFP 62.1 55.6ResNet-18-FPN 62.0 53.9 42.38 68.0 29.4+SIFP 63.1 56.8ResNet-50-FPN 69.0 60.1 62.98 116.8 12.2+SIFP 69.7 62.8ResNet-101-FPN 69.5 60.6 78.68 154.8 8.7+SIFP 70.3 63.2(b)
Backbone Architecture and SIFP for FastPose : Pose tracking re-sults of FastPose without/with SIFP on different backbones. Among allthe reported metrics, mAP and MOTA are two main metrics on PoseTrackdataset. We also report the inference speed of FastPose with different back-bones on PoseTrack dataset. Strategy mAP MOTA FP FN
IDS
IoU-only
Re-ID features +0.3 - - -41.6 occlusion-aware +0.6 - - -89.2 (c)
Pose tracking strategy : Pose tracking results of three posetracking strategy. They all base on the MTN-ResNet-50. The firststrategy only utilizes IoU as similarity metric between persons. Thesecond strategy uses Re-ID feature. The third one is our proposedstrategy which uses occlusion-aware Re-ID feature. FP and FN arethe numbers of false positive and false negative detected person. Allthe rows have the same FP and FN because their input comes fromone MTN. IDS denotes ID switches.Mask R-CNN FastPoseMethod AP bb person AP kp AP kp AP kp AP kp M AP kp L mAP MOTAResNet-50-FPN 49.3 65.1 85.0 71.1 58.2 75.1 69.0 60.1ResNet-50 + SNIP [44] training 51.1 66.2 85.5 72.1 60.8 75.4 69.4 61.1ResNet-50-FPN + SIFP 52.9 67.5 85.8 73.6 62.4 75.8 69.7 62.8(d) SIFP without/with FPN :Results of utilizing SIFP with the backbone ResNet-50/ResNet-50-FPN. Baseline is ResNet-50-FPN with-out using SIFP. The second method is utilizing SNIP training strategy but testing on single scale, which means SIFP without FPN. Thethird one is the full SIFP. Metrics of pose estimation and pose tracking are reported simultaneously.Mask R-CNN FastPoseMethod AP bb person AP kp AP kp AP kp AP kp M AP kp L mAP MOTAResNet-50-FPN 49.3 65.1 85.0 71.1 58.2 75.1 69.0 60.1ResNet-50-FPN + MS training& testing 50.2 65.8 85.4 72.0 59.2 75.4 69.3 60.7ResNet-50-FPN + SIFP 52.9 67.5 85.8 73.6 62.4 75.8 69.7 62.8(e) SIFP v.s. MST : Results of comparison SIFP with multi-scale training/testing. Baseline is ResNet-50-FPN without using SIFP.Metrics of pose estimation and pose tracking are reported simultaneously.
Table 1.
Ablations . Pose estimation is achieved by Mask R-CNN, pose estimation and tacking in videos is achieved by MTN (FastPose).Mask R-CNN is trained on COCO train , tested on COCO minival . MTN is trained on COCO train , MPII train , PoseTrack train , SSM train and PRW train , tested on PoseTrack val . SIFP v.s. MST:
Multi-scale training and testing (MST) isanother way to tackle scale variation problem. In Table 1(e),we compare SIFP with MST. In multi-scale training pro-cess, the size of each training image is randomly scaled toone of 7 scales ((608, 1333), (640, 1333), (672, 1333), (704,1333), (736, 1333), (768, 1333), (800, 1333)). The multi-scale testing result is a combination of testing results in allthe same 7 scales. MST makes inference speed be 7 timesslower, which is the price for improved metrics. However, itis worth noting that SIFP has the same inference time withthe baseline, and increases AP bb person , AP kp , mAP and MOTAby 3.6, 2.4, 0.7 and 2.7 which are all higher than MST. we compare our FastPose framework with the state-of-the-art methods on PoseTrack Dataset [2], including Pose-Track [2], JointFlow [16], PoseFlow [53], Detect-and-Track[21] and FlowTrack [50].Table. 2 reports the results of pose estimation on Pose-Track dataset. Our FastPose-101 obtains mAP of 70.3 on val which outperforms the most methods, except Flow-Track. However, FlowTrack is a two-stages top-downmethod and has a significantly slower inference speed. Sim-ilarly using ResNet-101 as backbone, Detect-and-Track isalmost 10 points behind FastPose-101 on mAP while its in-ference time is a tenth of ours. Other methods, no matter ethod Type Detector test set mAP mAP mAP mAP mAP mAP mAP mAP speed Head Shoulder Elbow Wrist Hip Knee Ankle
Total FPS
JointFlow [16] Bottom-up - val - - - - - - 66.7 0.2PoseFlow [53] Top-down (2-stage) SSD-512 val val - - - - - - - 69.3 3.0FlowTrack-50 [50] Top-down (2-stage) FPN-DCN val val - - - - - - - 72.9 3.2FlowTrack-152 [50] Top-down (2-stage) FPN-DCN val val val val val test - - - - - - - 63.3 0.2PoseTrack [2] Bottom-up - test - - - - - - - 59.4 -PoseFlow [53] Top-down (2-stage) SSD-512 test test test test - - - - - - - 59.6 0.8Ours:FastPose-18 Top-down (end-end) - test test test
Table 2. Multi-person Pose Estimation Performance on PoseTrack dataset.
Method Type Detector test set
MOTA
MOTP Prec Rec speedTotal
Total Total Total
FPS
JointFlow [16] Bottom-up - val val val val val val val val val val test test test test test test test test test
Table 3. Multi-person Pose Tracking Performance on PoseTrack dataset. top-down or bottom-up approach, are all have lower mAPand slower speed than FastPose-50 or FastPose-101.Table. 3 reports the results of pose tracking on Pose-Track dataset. Our FastPose is also able to achieve compet-itive MOTA. On PoseTrack val , Only FlowTrack-152 withFlow has 65.4 MOTA higher than 63.2 of our FastPose-101.But its slower detector FPN-DCN and the optical flow es-timation take much inference time, which causes the speedof FlowTrack-152 is only 0.2 FPS. Although using Flowand adopting FPN-DCN as human detector, FlowTrack-50achieves MOTA of 62.9 which is still caught up by ourFastPose-50 with MOTA of 62.8. On PoseTrack test ,FastPose-50 and FastPose-101 achieve MOTA of 56.6 and57.4, which are close to the state-of-the-art performance.
Timing
The last column of Table. 2 and 3 shows the in-ference speed of compared methods. The speed is mea-sured when FastPose is implemented by MXNet [11] on In-tel Xeon E5-2620 @2.4GHz and NVIDIA TITAN X GPU. The inference time of FastPose comes from two aspects:MTN and tracking strategy. The inference speed of trackingstrategy is 66.7 FPS. The speed of MTN is mainly dependedon the two metrics of its architecture, including parameters(Param) and FLOPs calculated with setting the resolutionof testing image as 600 ×
5. Conclusion
In this paper, we present FastPose, a fast and unifiedpose estimation and tracking framework, which utilizes amulti-task network (MTN) to integrates three tasks together.n occlusion-aware strategy following MTN performs posetracking. Besides, a paradigm named Scale-normalized Im-age and Feature Pyramid (SIFP) is designed to deal withsevere scale variation widely existed in unified pose ap-proaches. In ablation studies, we prove the stable im-provements brought by MTN, SIFP and occlusion-awarestrategy. Moreover, with different configurations, FastPosecan achieve real-time inference or competitive performance,which is helpful to adopt pose tracking in actual scenarios.
References [1] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, andJ. M. Ogden. Pyramid methods in image processing.
RCAEngineer , 29(6):33–41, 1984. 3[2] M. Andriluka, U. Iqbal, A. Milan, E. Insafutdinov,L. Pishchulin, J. Gall, and B. Schiele. Posetrack: A bench-mark for human pose estimation and tracking. In
IEEE Con-ference on Computer Vision and Pattern Recognition , pages5167–5176, 2018. 1, 6, 7, 8[3] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2dhuman pose estimation: New benchmark and state of the artanalysis. In
IEEE Conference on Computer Vision and Pat-tern Recognition , pages 3686–3693, 2014. 1, 2, 6[4] M. Andriluka, S. Roth, and B. Schiele. Pictorial structuresrevisited: People detection and articulated pose estimation.In
IEEE Conference on Computer Vision and Pattern Recog-nition , pages 1014–1021, 2009. 1, 3[5] S.-H. Bae and K.-J. Yoon. Confidence-based data associa-tion and discriminative deep appearance learning for robustonline multi-object tracking.
IEEE Transactions on PatternAnalysis and Machine Intelligence , 40(3):595–610, 2018. 2[6] E. Bochinski, V. Eiselein, and T. Sikora. High-speedtracking-by-detection without using image information. In , pages 1–6,2017. 5[7] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unifiedmulti-scale deep convolutional neural network for fast ob-ject detection. In
European Conference on Computer Vision ,pages 354–370, 2016. 3[8] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In
IEEEConference on Computer Vision and Pattern Recognition ,2017. 1, 2, 3[9] R. Caruana. Multitask learning.
Machine learning ,28(1):41–75, 1997. 3[10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully con-nected crfs.
IEEE Transactions on Pattern Analysis and Ma-chine Intelligence , 40(4):834–848, 2018. 3[11] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and effi-cient machine learning library for heterogeneous distributedsystems. arXiv preprint arXiv:1512.01274 , 2015. 8 [12] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun. Cas-caded pyramid network for multi-person pose estimation. In
IEEE Conference on Computer Vision and Pattern Recogni-tion , 2018. 1, 2, 3[13] R. Collobert and J. Weston. A unified architecture for naturallanguage processing: Deep neural networks with multitasklearning. In
International Conference on Machine Learning ,pages 160–167, 2008. 3[14] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detectionvia region-based fully convolutional networks. In
Advancesin Neural Information Processing Systems , pages 379–387,2016. 3[15] L. Deng, G. Hinton, and B. Kingsbury. New types of deepneural network learning for speech recognition and relatedapplications: An overview. In
IEEE International Con-ference on Acoustics, Speech and Signal Processing , pages8599–8603, 2013. 3[16] A. Doering, U. Iqbal, and J. Gall. Joint flow: Temporal flowfields for multi person tracking. In
British Machine VisionConference , 2018. 1, 3, 7, 8[17] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu. Rmpe: Regionalmulti-person pose estimation. In
IEEE International Confer-ence on Computer Vision , pages 2334–2343, 2017. 8[18] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-manan. Object detection with discriminatively trained part-based models.
IEEE Transactions on Pattern Analysis andMachine Intelligence , 32(9):1627–1645, 2010. 1, 3[19] W. Feng, Z. Hu, W. Wu, J. Yan, and W. Ouyang. Multi-object tracking with multiple cues and switcher-aware clas-sification. arXiv preprint arXiv:1901.06129 , 2019. 2[20] J. Ghosn and Y. Bengio. Multi-task learning for stock selec-tion. In
Advances in Neural Information Processing Systems ,pages 946–952, 1997. 3[21] R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran.Detect-and-track: Efficient pose estimation in videos. In
IEEE Conference on Computer Vision and Pattern Recog-nition , pages 350–359, 2018. 1, 2, 3, 5, 6, 7, 8[22] R. Girshick. Fast R-CNN. In
IEEE International Conferenceon Computer Vision , pages 1440–1448, 2015. 3[23] Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li, and X. Hu. Scale-awareface detection. In
IEEE Conference on Computer Vision andPattern Recognition , pages 6186–6195, 2017. 2[24] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask R-CNN. In
IEEE International Conference on Computer Vi-sion , 2017. 1, 2, 3, 6[25] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In
IEEE Conference on ComputerVision and Pattern Recognition , pages 770–778, 2016. 4[26] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang,E. Levinkov, B. Andres, and B. Schiele. Arttrack: Articu-lated multi-person tracking in the wild. In
IEEE Conferenceon Computer Vision and Pattern Recognition , 2017. 1, 2, 3[27] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, andB. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In
European Conference onComputer Vision , pages 34–50, 2016. 1, 2, 328] U. Iqbal, A. Milan, and J. Gall. Posetrack: Joint multi-personpose estimation and tracking. In
IEEE Conference on Com-puter Vision and Pattern Recognition , 2017. 1, 2, 3[29] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High performancevisual tracking with siamese region proposal network. In
IEEE Conference on Computer Vision and Pattern Recogni-tion , pages 8971–8980, 2018. 1[30] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolu-tional neural network cascade for face detection. In
IEEEConference on Computer Vision and Pattern Recognition ,pages 5325–5334, 2015. 3[31] P. Li, J. Zhang, Z. Zhu, Y. Li, L. Jiang, and G. Huang. State-aware re-identification feature for multi-target multi-cameratracking. In
IEEE Conference on Computer Vision and Pat-tern Recognition Workshops , 2019. 2[32] Y. Li, X. Chen, Z. Zhu, L. Xie, G. Huang, D. Du, andX. Wang. Attention-guided unified network for panoptic seg-mentation. arXiv preprint arXiv:1812.03904 , 2018. 3[33] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, andS. Belongie. Feature pyramid networks for object detection.In
IEEE Conference on Computer Vision and Pattern Recog-nition , pages 2117–2125, 2017. 2, 3, 4, 5[34] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context. In
European Conference on Com-puter Vision , pages 740–755, 2014. 1, 2, 6[35] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis.SSH: Single stage headless face detector. In
IEEE Inter-national Conference on Computer Vision , pages 4875–4884,2017. 3[36] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-works for human pose estimation. In
European Conferenceon Computer Vision , pages 483–499, 2016. 1, 2, 3[37] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tomp-son, C. Bregler, and K. Murphy. Towards accurate multi-person pose estimation in the wild. In
IEEE Conference onComputer Vision and Pattern Recognition , 2017. 1, 2, 3[38] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. An-driluka, P. V. Gehler, and B. Schiele. Deepcut: Joint subsetpartition and labeling for multi person pose estimation. In
IEEE Conference on Computer Vision and Pattern Recogni-tion , pages 4929–4937, 2016. 1, 2, 3[39] L. Pishchulin, A. Jain, M. Andriluka, T. Thorm¨ahlen, andB. Schiele. Articulated people detection and pose estimation:Reshaping the future. In
IEEE Conference on Computer Vi-sion and Pattern Recognition , pages 3178–3185, 2012. 1[40] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-wards real-time object detection with region proposal net-works. In
Advances in Neural Information Processing Sys-tems , pages 91–99, 2015. 3[41] M. R. Ronchi and P. Perona. Benchmarking and error di-agnosis in multi-instance pose estimation. In
IEEE Inter-national Conference on Computer Vision , pages 369–378,2017. 1[42] S. Ruder. An overview of multi-task learning in deep neuralnetworks. arXiv preprint arXiv:1706.05098 , 2017. 3 [43] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.Chen. Mobilenetv2: Inverted residuals and linear bottle-necks. In
IEEE Conference on Computer Vision and PatternRecognition , pages 4510–4520, 2018. 4[44] B. Singh and L. S. Davis. An analysis of scale invariancein object detection snip. In
IEEE Conference on ComputerVision and Pattern Recognition , pages 3578–3587, 2018. 2,3, 4, 5, 7[45] B. Singh, M. Najibi, and L. S. Davis. Sniper: Efficient multi-scale training. In
Advances in Neural Information ProcessingSystems , pages 9333–9343, 2018. 2, 3[46] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint train-ing of a convolutional network and a graphical model forhuman pose estimation. In
Advances in Neural InformationProcessing Systems , pages 1799–1807, 2014. 1, 2, 3[47] A. Toshev and C. Szegedy. Deeppose: Human pose es-timation via deep neural networks. In
IEEE Conferenceon Computer Vision and Pattern Recognition , pages 1653–1660, 2014. 1, 2, 3[48] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Con-volutional pose machines. In
IEEE Conference on ComputerVision and Pattern Recognition , pages 4724–4732, 2016. 1,2, 3[49] N. Wojke, A. Bewley, and D. Paulus. Simple online andrealtime tracking with a deep association metric. In
IEEEInternational Conference on Image Processing , pages 3645–3649, 2017. 2[50] B. Xiao, H. Wu, and Y. Wei. Simple baselines for humanpose estimation and tracking. In
European Conference onComputer Vision , 2018. 1, 2, 3, 7, 8[51] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. End-to-end deep learning for person search. arXiv preprintarXiv:1604.01850 , 1(2), 2016. 4, 6[52] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. Joint detec-tion and identification feature learning for person search. In
IEEE Conference on Computer Vision and Pattern Recogni-tion , pages 3415–3424, 2017. 3[53] Y. Xiu, J. Li, H. Wang, Y. Fang, and C. Lu. Pose flow: Ef-ficient online pose tracking. In
British Machine Vision Con-ference , 2018. 1, 3, 7, 8[54] F. Yang, W. Choi, and Y. Lin. Exploit all the layers: Fastand accurate cnn object detector with scale dependent pool-ing and cascaded rejection classifiers. In
IEEE Conferenceon Computer Vision and Pattern Recognition , pages 2129–2137, 2016. 3[55] S. Yang, Y. Xiong, C. C. Loy, and X. Tang. Face detectionthrough scale-friendly deep convolutional networks. arXivpreprint arXiv:1706.02863 , 2017. 3[56] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang. Learningfeature pyramids for human pose estimation. In
IEEE Inter-national Conference on Computer Vision , 2017. 1, 2, 3[57] F. Yu, W. Li, Q. Li, Y. Liu, X. Shi, and J. Yan. Poi: Multipleobject tracking with high performance detection and appear-ance feature. In
European Conference on Computer Vision ,pages 36–42, 2016. 2[58] C. Zhang and Z. Zhang. A survey of recent advances in facedetection. 2010. 359] R. Zhang, Z. Zhu, P. Li, R. Wu, C. Guo, G. Huang, andH. Xia. Exploiting offset-guided network for pose estima-tion and tracking. arXiv preprint arXiv:1906.01344 , 2019.1[60] Y. Zhang and Q. Yang. A survey on multi-task learning. arXiv preprint arXiv:1707.08114 , 2017. 3[61] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmarkdetection by deep multi-task learning. In
European Confer-ence on Computer Vision , pages 94–108, 2014. 3[62] L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, andQ. Tian. Person re-identification in the wild. In
IEEE Con-ference on Computer Vision and Pattern Recognition , pages1367–1376, 2017. 4, 6[63] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.Learning deep features for scene recognition using placesdatabase. In
Advances in Neural Information Processing Sys-tems , pages 487–495, 2014. 1[64] J. Zhu, H. Yang, N. Liu, M. Kim, W. Zhang, and M.-H.Yang. Online multi-object tracking with dual matching atten-tion networks. In
European Conference on Computer Vision ,pages 366–382, 2018. 2[65] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu.Distractor-aware siamese networks for visual object track-ing. In