MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network
MMultiPoseNet: Fast Multi-Person PoseEstimation using Pose Residual Network
Muhammed Kocabas , M. Salih Karagoz , Emre Akbas Department of Computer Engineering, Middle East Technical University { muhammed.kocabas,e234299,eakbas } @metu.edu.tr Abstract.
In this paper, we present
MultiPoseNet , a novel bottom-upmulti-person pose estimation architecture that combines a multi-taskmodel with a novel assignment method. MultiPoseNet can jointly han-dle person detection, keypoint detection, person segmentation and poseestimation problems. The novel assignment method is implemented bythe
Pose Residual Network (PRN) which receives keypoint and persondetections, and produces accurate poses by assigning keypoints to personinstances. On the COCO keypoints dataset, our pose estimation methodoutperforms all previous bottom-up methods both in accuracy (+4-pointmAP over previous best result) and speed; it also performs on par withthe best top-down methods while being at least 4x faster. Our method isthe fastest real time system with ∼
23 frames/sec. Source code is availableat: https://github.com/mkocabas/pose-residual-network
Keywords:
Multi-Task Learning, Multi-Person Pose Estimation, Se-mantic Segmentation, MultiPoseNet, Pose Residual Network
This work is aimed at estimating the two-dimensional (2D) poses of multiplepeople in a given image. Any solution to this problem has to tackle a few sub-problems: detecting body joints (or keypoints , as they are called in the influ-ential COCO [1] dataset) such as wrists, ankles, etc., grouping these joints intoperson instances, or detecting people and assigning joints to person instances.Depending on which sub-problem is tackled first, there have been two majorapproaches in multi-person 2D estimation: bottom-up and top-down . Bottom-upmethods [2–8] first detect body joints without having any knowledge as to thenumber of people or their locations. Next, detected joints are grouped to form in-dividual poses for person instances. On the other hand, top-down methods [9–12]start by detecting people first and then for each person detection, a single-personpose estimation method (e.g. [13–16]) is executed. Single-person pose estimation,i.e. detecting body joints conditioned on the information that there is a singleperson in the given input (the top-down approach), is typically a more costly pro-cess than grouping the detected joints (the bottom-up approach). Consequently,the top-down methods tend to be slower than the bottom-up methods, since We use “body joint” and “keypoint” interchangeably throughout the paper. a r X i v : . [ c s . C V ] J u l they need to repeat the single-person pose estimation for each person detection;however, they usually yield better accuracy than bottom-up methods. Keypoint Subnet
FPNFPN C C C C P P P P P Person Detection Subnet
Backbone A n c h o rs cls reg Pose Residual Net
D features K K K K P o se R es i du a l N e t Fig. 1.
MultiPoseNet is a multi-task learning architecture capable of performing humankeypoint estimation, detection and semantic segmentation tasks altogether efficiently.
In this paper, we present a new bottom-up method for multi-person 2D poseestimation. Our method is based on a multi-task learning model which can jointlyhandle the person detection, keypoint detection, person segmentation and poseestimation problems. To emphasize its multi-person and multi-task aspects ofour model, we named it as “MultiPoseNet.” Our model (Fig. 1) consists ofa shared backbone for feature extraction, detection subnets for keypoint andperson detection/segmentation, and a final network which carries out the poseestimation, i.e. assigning detected keypoints to person instances.Our major contribution lies in the pose estimation step where the networkimplements a novel assignment method. This network receives keypoint andperson detections, and produces a pose for each detected person by assigningkeypoints to person boxes using a learned function. In order to put our contri-bution into context, here we briefly describe the relevant aspects of the state-of-the-art (SOTA) bottom-up methods [2, 8]. These methods attempt to groupdetected keypoints by exploiting lower order relations either between the groupand keypoints, or among the keypoints themselves. Specifically, Cao et al. [2]model pairwise relations (called part affinity fields) between two nearby jointsand the grouping is achieved by propagating these pairwise affinities. In theother SOTA method, Newell et al. [8] predict a real number called a tag per de-tected keypoint, in order to identify the group the detection belongs to. Hence,this model makes use of the unary relations between a certain keypoint and thegroup it belongs to. Our method generalizes these two approaches in the sensethat we achieve the grouping in a single shot by considering all joints togetherat the same time. We name this part of our model which achieves the groupingas the
Pose Residual Network (PRN) (Fig. 2). PRN takes a region-of-interest(RoI) pooled keypoint detections and then feeds them into a residual multilayerperceptron (MLP). PRN considers all joints simultaneously and learns configu-rations of joints. We illustrate this capability of PRN by plotting a sample setof learned configurations. (Fig. 2 right).Our experiments (on the COCO dataset, using no external data) show thatour method outperforms all previous bottom-up methods: we achieve a 4-pointmAP increase over the previous best result. Our method performs on par with + Fig. 2. Left:
Pose Residual Network (PRN). The PRN is able to disambiguate whichkeypoint should be assigned to the current person box.
Right:
Six sample poses ob-tained via clustering the structures learned by PRN. the best performing top-down methods while being an order of magnitude fasterthan them. To the best of our knowledge, there are only two top-down methodsthat we could not outperform. Given the fact that bottom-up methods havealways performed less accurately than the top-down methods, our results areremarkable.In terms of running time, our method appears to be the fastest of all multi-person 2D pose estimation methods. Depending on the number of people in theinput image, our method runs at between 27 frames/sec (FPS) (for one persondetection) and 15 FPS (for 20 person detections). For a typical COCO image,which contains ∼ ∼
23 FPS (Fig. 8).Our contributions in this work are four fold. (1) We propose the
Pose ResidualNetwork (PRN), a simple yet very effective method for the problem of assign-ing/grouping body joints. (2) We outperform all previous bottom-up methodsand achieve comparable performance with top-down methods. (3) Our methodworks faster than all previous methods, in real-time at ∼
23 frames/sec. (4) Ournetwork architecture is extendible; we show that using the same backbone, onecan solve other related problems, too, e.g. person segmentation.
Single person pose estimation is to predict individual body parts given a croppedperson image (or, equivalently, given its exact location and scale within an im-age). Early methods (prior to deep learning) used hand-crafted HOG features [17]to detect body parts and probabilistic graphical models to represent the posestructure (tree-based [18–21]; non-tree based [22, 23]).Deep neural networks based models [13, 14, 16, 19, 24–29] have quickly dom-inated the pose estimation problem after the initial work by Toshev et al. [24]who used the AlexNet architecture to directly regress spatial joint coordinates.Tompson et al. [25] learned pose structure by combining deep features alongwith graphical models. Carreira et al. [26] proposed the Iterative Error Feed-back method to train Convolutional Neural Networks (CNNs) where the inputis repeatedly fed to the network along with current predictions in order to re-fine the predictions. Wei et al. [13] were inspired by the pose machines [30] and used CNNs as feature extractors in pose machines.
Hourglass blocks , (HG) de-veloped by Newell et al. [14], are basically convolution-deconvolution structureswith residual connections. Newell et al. stacked HG blocks to obtain an iterativerefinement process and showed its effectiveness on single person pose estima-tion. Stacked Hourglass (SHG) based methods made a remarkable performanceincrease over previous results. Chu et al. [27] proposed adding visual attentionunits to focus on keypoint regions of interest. Pyramid residual modules by Yanget al. [19] improved the SHG architecture to handle scale variations. Lifshitz etal. [28] used a probabilistic keypoint voting scheme from image locations toobtain agreement maps for each body part. Belagiannis et al. [29] introduced asimple recurrent neural network based prediction refinement architecture. Huanget al. [16] developed a coarse-to-fine model with Inception-v2 [31] network as thebackbone. The authors calculated the loss in each level of the network to learncoarser to finer representations of parts.
Multi person pose estimation solutions branched out as bottom-up and top-down methods. Bottom-up approaches detect body joints and assignthem to people instances, therefore they are faster in test time and smaller in sizecompared to top-down approaches. However, they miss the opportunity to zoominto the details of each person instance. This creates an accuracy gap betweentop-down and bottom-up approaches.In an earlier work by Ladicky et al. [32], they proposed an algorithm tojointly predict human part segmentations and part locations using HOG-basedfeatures and probabilistic approach. Gkioxari et al. [33] proposed k-poselets tojointly detect people and keypoints.Most of the recent approaches use Convolutional Neural Networks (CNNs) todetect body parts and relationships between them in an end-to-end manner [2–4,8,18,34], then use assignment algorithms [2–4,34] to form individual skeletons.Pischulin et al. [3] used deep features for joint prediction of part locationsand relations between them, then performed correlation clustering. Even though[3] doesn’t use person detections, it is very slow due to proposed clusteringalgorithm and processing time is in the order of hours. In a following work byInsafutdinov et al. [4], they benefit from deeper ResNet architectures as partdetectors and improved the parsing efficiency of a previous approach with anincremental optimization strategy. Different from Pischulin and Insafutdinov,Iqbal et al. [35] proposed to solve the densely connected graphical model locally,thus improved time efficiency significantly.Cao et al. [2] built a model that contain two entangled CPM [13] branchesto predict keypoint heatmaps and pairwise relationships (part affinity fields)between them. Keypoints are grouped together with fast Hungarian bipartitematching algorithm according to conformity of part affinity fields between them.This model runs in realtime. Newell et al. [8] extended their SHG idea by out-putting associative vector embeddings which can be thought as tags representingeach keypoint’s group. They group keypoints with similar tags into individualpeople.
Top-down
Top-down methods first detect people (typically using a top per-forming, off-the-shelf object detector) and then run a single person pose esti-mation (SPPN) method per person to get the final pose predictions. Since aSPPN model is run for each person instance, top-down methods are extremelyslow, however, each pose estimator can focus on an instance and perform finelocalization. Papandreou et al. [10] used ResNet with dilated convolutions [36]which has been very successful in semantic segmentation [37] and computingkeypoint heatmap and offset outputs. In contrast to Gaussian heatmaps, theauthors estimated a disk-shaped keypoint masks and 2-D offset vector fieldsto accurately localize keypoints. Joint part segmentation and keypoint detec-tion given human detections approach were proposed by Xia et al. [38] Theauthors used separate PoseFCN and PartFCN to obtain both part masks andlocations and fused them with fully-connected CRFs. This provides more consis-tent predictions by eliminating irrelevant detections. Fang et al. [12] proposed touse spatial transformer networks to handle inaccurate bounding boxes and used stacked hourglass blocks [14]. He et al. [11] combined instance segmentation andkeypoint prediction in their
Mask-RCNN model. They append keypoint headson top of
RoI aligned feature maps to get a one-hot mask for each keypoint.Chen et al. [9] developed globalnet on top of
Feature Pyramid Networks [39] formultiscale inference and refined the predictions by using hyper-features [40].
The architecture of our proposel model, MultiPoseNet, can be found in Fig. 1.In the following, we describe each component in detail.
The backbone of MultiPoseNet serves as a feature extractor for keypoint andperson detection subnets. It is actually a ResNet [36] with two Feature Pyra-mid Networks (FPN) [39] (one for the keypoint subnet, the other for the persondetection subnet) connected to it, FPN creates pyramidal feature maps withtop-down connections from all levels of CNN’s feature hierarchy to make useof inherent multi-scale representations of a CNN feature extractor. By doingso, FPN compromises high resolution, weak representations with low resolution,strong representations. Powerful localization and classification properties of FPNproved to be very successful in detection, segmentation and keypoint tasks re-cently [9, 11, 39, 41]. In our model, we extracted features from the last residualblocks C , C , C , C with strides of (4,8,16,32) pixels and compute correspond-ing FPN features per subnet. Keypoint estimation subnet (Fig. 3) takes hierarchical CNN features (outputtedby the corresponding FPN) and outputs keypoint and segmentation heatmaps.Heatmaps represent keypoint locations as Gaussian peaks. Each heatmap layerbelongs to a specific keypoint class (nose, wrists, ankles etc.) and contains ar-bitrary number of peaks that pertain to person instances. Person segmentationmask at the last layer of heatmaps encodes the pixelwise spatial layout of peoplein the image. C C C C K K K K K K K K K : d = 256 d = 128 d = 512 prediction FPN
Loss
Fig. 3.
The architecture of the keypoint subnet. It takes hierarchical CNN features asinput and outputs keypoint and segmentation heatmaps.
A set of features specific to the keypoint detection task are computed simi-larly to [39] with top-down and lateral connections from the bottom-up pathway. K − K features have the same spatial size corresponding to C − C blocks butthe depth is reduced to 256. K features are identical to P features in the originalFPN paper, but we denote them with K to distinguish from person detectionsubnet layers. The depth of P features is downsized to 128 with 2 subsequent 3 × D , D , D , D layers. Since D features still have differ-ent strides, we upsampled D , D , D accordingly to match 4-pixel stride as D features and concatenated them into a single depth-512 feature map. Concate-nated features are smoothed by a 3 × K + 1) layers obtained via 1 × W which has W ( p ) = 0 in thearea of the persons without annotation. K is the number of human keypointsannotated in a dataset and +1 is person segmentation mask. In addition to theloss applied in the last layer, we append a loss at each level of K features tobenefit from intermediate supervision. Semantic person segmentation masks arepredicted in the same way with keypoints. Modern object detectors are classified as one-stage (SSD [42], YOLO [43], Reti-naNet [41]) or two-stage (Fast R-CNN [44], Faster R-CNN [45]) detectors. One-stage detectors enable faster inference but have lower accuracy in comparison totwo-stage detectors due to foreground-background class imbalance. The recentlyproposed RetinaNet [41] model improved one-stage detectors’ performance with focal loss which can handle the class imbalance problem during training. In or-der to design a faster and simpler person detection model which is compatiblewith FPN backbone, we have adopted RetinaNet. Same strategies to computeanchors, losses and pyramidal image features are followed. Classification andregression heads are modified to handle only person annotations.
Assigning keypoint detections to person instances (bounding boxes, in our case)is straightforward if there is only one person in the bounding box as in Fig. 4 a-b.However, it becomes non-trivial if there are overlapping people in a single boxas in Fig. 4 c-d. In the case of an overlap, a bounding box can contain multiplekeypoints not related to the person in question, so this creates ambiguity inconstructing final pose predictions. We solve these ambiguities by learning posestructures from data. (a) (b)(b) (c) (d)
Fig. 4.
Bounding box overlapping scenarios.
The input to PRN is prepared as follows. For each person box that the per-son detection subnet detected, the region from the keypoint detection subnet’soutput, corresponding to the box, is cropped and resized to a fixed size, whichensures that PRN can handle person detections of arbitrary sizes and shapes.Specifically, let X denote the input to the PRN, where X = { x , x , . . . , x k } inwhich x k ∈ R W × H , k is the number of different keypoint types. The final goalof PRN is to output Y where Y = { y , y , . . . , y k } , in which y k ∈ R W × H is ofthe same size as x k , containing the correct position for each keypoint indicatedby a peak in that keypoint’s channel. PRN models the mapping from X to Y as y k = φ k ( X ) + x k (1)where the functions φ ( · ) , . . . , φ K ( · ) apply a residual correction to the pose in X , hence the name pose residual network. We implement Eq. 1 using a residualmultilayer perceptron Fig. 2. Activation of the output layer uses softmax toobtain a proper probability distribution and binary cross-entropy loss is usedduring training.Before we came up with this residual model, we experimented with two naivebaselines and a non-residual model. In the first baseline method, which we call Max , for each keypoint channel k , we find the location with the highest valueand place a Gaussian in the corresponding location of the k th channel in Y . Inthe second baseline method, we compute Y as y k = x k ∗ P k (2)where P k is a prior map for the location of the k th joint, learned from ground-truth data and ∗ is element-wise multiplication. We named this method as UnaryConditional Relationship (UCR). Finally, in our non-residual model, we imple-mented y k = φ k ( X ) . (3)Performances of all these models can be found in Table 3.In the context of the models described above, both SOTA bottom up methodslearn lower order grouping models than the PRN. Cao et al. [2] model pairwisechannels in X while Newell et al. [8] model only unary channels in X . Hence,our model can be considered as a generalization of these lower order groupingmodels. We hypothesize that each node in PRN’s hidden layer encodes a certain bodyconfiguration. To show this, we visualized some of the representative outputs ofPRN in Fig. 2. These poses is obtained via reshaping PRN outputs and selectingthe maximum activated keypoints to form skeletons. All obtained configurationsare clustered using k -means with OKS (object keypoint similarity) [1] and clustermeans are visualized in Fig. 2. OKS (object keypoint similarity) is used as k-means distance metric to cluster the meaningful poses. Due to different convergence times and loss imbalance, we have trainedkeypoint and person detection tasks separately. To use the same backbone inboth task, we first trained the model with only keypoint subnet Fig. 3. There-after, we froze the backbone parameters and trained the person detection subnet.Since the two tasks are semantically similar, person detection results were notadversely affected by the frozen backbone.We have utilized Tensorflow [46] and Keras [47] deep learning library toimplement training and testing procedures. For person detection, we made useof open-source Keras RetinaNet [48] implementation.
Keypoint Estimation Subnet:
For keypoint training, we used 480x480 imagepatches, that are centered around the crowd or the main person in the scene.Random rotations between ±
40 degrees, random scaling between 0 . − . L loss, and we masked(ignored) people that are not annotated. We appended the segmentation masksto ground-truth as an extra layer and trained along with keypoint heatmaps.The cost function that we minimize is L kp = W · (cid:107) H t − H p (cid:107) , (4)where H t and H p are the ground-truth and predicted heatmaps respectively,and W is the mask used to ignore non-annotated person instances. Person Detection Subnet:
We followed a similar person detection training strat-egy as [41]. Images containing persons are used, they are resized such that shorteredge is 800 pixels. We froze backbone weights after keypoint training and notupdated during person detection training. We optimized subnet with Adam [50]starting from learning rate 1e-5 and is decreased by a factor of 0.1 in plateaux.We used Focal loss with ( γ = 2 , α = 0 .
25) and smooth L loss for classificationand bbox regression, respectively. We obtained final proposals using NMS witha threshold of 0.3. Pose Residual Network:
During training, we cropped input and output pairs andresized heatmaps according to bounding-box proposals. All crops are resized toa fixed size of 36 ×
56 (height/width = 1.56). We trained the PRN networkseparately and Adam optimizer [50] with a learning rate of 1e-4 is used duringtraining. Since the model is shallow, convergence takes 1.5 hours approximately.We trained the model with the person instances which has more than 2keypoints. We utilized a sort of curriculum learning [51] by sorting annotationsbased on number of keypoints and bounding box areas. In each epoch, modelis started to learn easy-to-predict instances, hard examples are given in laterstages.
Inference
The whole architecture (see in Fig. 1) behaves as a monolithic, end-to-end model during test time. First, an image ( W × H ×
3) is processed throughbackbone model to extract the features in multi-scales. Person and keypointdetection subnets compute outputs simultaneously out of extracted features.Keypoints are outputted as W × H × ( K +1) sized heatmaps. K is the number ofkeypoint channels, and +1 is for the segmentation channel. Person detections arein the form of N ×
5, where N is the number of people and 5 channel correspondsto 4 bounding box coordinates along with confidence scores. Keypoint heatmapsare cropped and resized to form RoIs according to person detections. OptimalRoI size is determined as 36 × × ( K + 1) in our experiments. PRN takeseach RoI as separate input, then outputs same size RoI with only one keypointselected in each layer of heatmap. All selected keypoints are grouped as a personinstance. We trained our keypoint and person detection models on COCO keypointsdataset [1] (without using any external/extra data) in our experiments. Weused COCO for evaluating the keypoint and person detection, however, we usedPASCAL VOC 2012 [52] for evaluating person segmentation due to the lackof semantic segmentation annotations in COCO. Backbone models (ResNet-50and ResNet-101) were pretrained on ImageNet and we finetuned with COCO-keypoints.COCO train2017 split contains 64K images including 260K person instanceswhich 150K of them have keypoint annotations. Keypoints of persons with smallarea are not annotated in COCO. We did ablation experiments on COCO val2017split which contains 2693 images with person instances. We made comparisonto previous methods on the test-dev2017 split which has 20K test images. Weevaluated test-dev2017 results on the online COCO evaluation server. We usethe official COCO evaluation metric average precision (AP) and average recall(AR). OKS and IoU based scores were used for keypoint and person detectiontasks, respectively.We performed person segmentation evaluation in PASCAL VOC 2012 testsplit with PASCAL IoU metric. PASCAL VOC 2012 person segmentation testsplit contains 1456 images. We obtained “Test results” using the online evalua-tion server. recall precision areaRng:[large], maxDets:[20] Oks 0.50: .873Oks 0.55: .863Oks 0.60: .863Oks 0.65: .861Oks 0.70: .850Oks 0.75: .837Oks 0.80: .802Oks 0.85: .752Oks 0.90: .639Oks 0.95: .393 recall precision areaRng:[medium], maxDets:[20] Oks 0.50: .790Oks 0.55: .790Oks 0.60: .779Oks 0.65: .768Oks 0.70: .747Oks 0.75: .715Oks 0.80: .671Oks 0.85: .593Oks 0.90: .472Oks 0.95: .196 recall precision areaRng:[all], maxDets:[20] Oks 0.50: .821Oks 0.55: .821Oks 0.60: .811Oks 0.65: .800Oks 0.70: .789Oks 0.75: .765Oks 0.80: .721Oks 0.85: .654Oks 0.90: .537Oks 0.95: .281 Fig. 5.
Precision-recall curves on COCO validation set across scales all, large and medium . Analysis tool is provided by [53]
We present the recall-precision curves of our method for different scales all,large, medium in Fig. 5. The overall AP results of our method along with top-performing bottom-up (BU) and top-down (TD) methods are given in Table 1.MultiPoseNet outperforms all bottom-up methods and most of the top-downmethods. We outperform the previously best bottom-up method [8] by a 4-pointincrease in mAP. In addition, the runtime speed (see the FPS column Table 1and Fig. 8) of our system is far better than previous methods with 23 FPS onaverage . This proves the effectiveness of PRN for assignment and our multitaskdetection approach while providing reasonable speed-accuracy tradeoff. To getthese results (Table 1) on test-dev, we have utilized test time augmentation andensembling (as also done in all previous studies). Multi scale and multi crop test-ing was performed during test time data augmentation. Two different backbonesand a single person pose refinement network similar to our keypoint detectionmodel was used for ensembling. Results from different models are gathered andredundant detections was removed via OKS based NMS [10].During ablation experiments we have inspected the effect of different back-bones, keypoint detection architectures, and PRN designs. In Table 2 and 3 youcan see the ablation analysis results on COCO validation set. Different Backbones
We used ResNet models [36] as shared backbone to ex-tract features. In Table 2, you can see the impact of deeper features and dilatedfeatures. R101 improves the result 1.6 mAP over R50. Dilated convolutions [37]which is very successful in dense detection tasks increases accuracy 2 mAP overR50 architecture. However, dilated convolutional filters add more computational We obtained the FPS results by averaging the inference time using images containing3 people (avg. number of person annotations per image in COCO dataset) on aGTX1080Ti GPU. Except for CFN and Mask RCNN, we obtained the FPS numbersby running the models ourselves under equal conditions. CFN’s code is not availableand Mask RCNN’s code was only made recently available and we did not have timeto test it ourselves. We got CFN’s and Mask RCNN’s FPS from their respectivepapers. COCO-only results of this entry was obtained from this talk on Joint Workshop ofthe COCO and Places Challenges at ICCV 2017.1
Table 1.
Results on COCO test-dev , excluding systems trained with external data.Top-down methods are shown separately to make a clear comparison between bottom-up methods.
FPS AP AP AP AP M AP L AR AR AR AR M AR L BU Ours 23
BU Newell et al. [8] 6 65.5 [10] - 66.9 86.4 73.6 64.0 72.0 71.6 89.2 77.6 66.1 79.1TD G-RMI-2016 [10] - 60.5 82.2 66.2 57.6 66.6 66.2 86.6 71.4 61.9 72.2 complexity, consequently hinder realtime performance. We showed that concate-nation of K features and intermediate supervision (see Section 3.2 for explana-tions) is crucial for good perfomance. The results demonstrated that performanceof our system can be further enhanced with stronger feature extractors like recentResNext [54] architectures. Table 2. Left:
Comparison of different keypoint models.
Right:
Performance of dif-ferent backbone architectures. (no concat: no concatenation, no int: no intermediatesupervision, dil: dilated, concat: concatenation)
Models AP AP AP AP M AP L R101 no int. no concat dil
Backbones AP AP AP AP M AP L R50 62.3 86.2 71.9 57.7 70.4R101 63.9 87.1 73.2 58.1 72.2R101 dil
Different Keypoint Architectures
Keypoint estimation requires dense pre-diction over spatial locations, so its performance is dependent on input andoutput resolution. In our experiments, we used 480 ×
480 images as inputs andoutputted 120 × × ( K + 1) heatmaps per input. K is equal to 17 for COCOdataset. The lower resolutions harmed the mAP results while higher resolutionsyielded longer training and inference complexity. We have listed the results ofdifferent keypoint models in Table 2.The intermediate loss which is appended to the outputs of K block’s enhancedthe precision significantly. Intermediate supervision acts as a refinement processamong the hierarchies of features. As previously shown in [2, 13, 14], it is anessential strategy in most of the dense detection tasks.We have applied a final loss to the concatenated D features which is down-sized from K features. This additional stage ensured us to combine multi-levelfeatures and compress them into a uniform space while extracting more semanticfeatures. This strategy brought 2 mAP gain in our experiments. Pose Residual Network Design
PRN is a simple yet effective assignmentstrategy, and is designed for faster inference while giving reasonable accuracy. To design an accurate model we have tried different configurations. Different PRNmodels and corresponding results can be seen in Table 3. These results indicatethe scores obtained from the assignment of ground truth person bounding boxesand keypoints. Table 3. Left:
Performance of different PRN models on COCO validation set.
N:nodes, D: dropout and R: residual connection.
Right:
Ablation experiments of PRNwith COCO validation data.
PRN Models AP AP AP AP M AP L PRN Ablations AP AP AP AP M AP L Both GT 89.4 97.1 91.2 87.9 91.8GT keypoints + Our bbox 75.3 82.1 78 70.1 84.5Our keypoints + GT bbox 65.1 89.2 76.2 60.3 74.7PRN 64.3 88.2 75 59.6 73.9UCR 49.7 59.5 52.4 44.1 51.6Max 45.3 55.1 48.8 40.6 46.9
We started with a primitive model which is a single hidden-layer MLP with50 nodes, and added more nodes, regularization and different connection typesto balance speed and accuracy. We found that 1024 nodes MLP, dropout with0.5 probability and residual connection between input and output boosts thePRN performance up to 89 . Table 4.
PRN assignment results with non-grouped keypoints obtained from twobottom-up methods.
Models AP AP AP AP M AP L Cao et al. [2] 58.4 81.5 62.6
Newell et al. [8] 56.9 80.8 61.3 49.9
PRN + [8]
In ablation analysis of PRN (see Table 3), we compared
Max , UCR and
PRN implementations (see Section 3.4 for descriptions) along with the performance ofPRN with ground truth detections. We found that , lower order grouping meth-ods could not handle overlapping detections, both of them performed poorly.As we hypothesized, PRN could overcome ambiguities by learning meaningfulpose structures (Fig. 2 (right)) and improved the results by ∼
20 mAP over naiveassignment techniques. We evaluated the impact of keypoint and person subnetsto the final results by alternating inputs of PRN with ground truth detections.With ground truth keypoints and our person detections, we got 75.3 mAP, itshows that there is a large room for improvement in the keypoint localizationpart. With our keypoints and ground truth person detections, we got 65.1 mAP.This can be interpreted as our person detection subnet is performing quite well.Both ground truth detections got 89.4 mAP, which is a good indicator of PRNperformance. In addition to these experiments, we tested PRN on the keypointsdetected by previous SOTA bottom-up models [2, 8]. Consequently, PRN per-formed better grouping (see Table 4) than their methods:
Part Affinity Fields [2] and Associative Embedding [8] by improving both detection results by ∼ We trained the person detection subnet only on COCO person instances byfreezing the backbone with keypoint detection parameters. The person categoryresults of our network with different backbones can be seen in Table 5. We com-pared our results with the original methods that we adopt in our architecture.Our model with both ResNet-50 and ResNet-101 backends outperformed theoriginal implementations. This is not a surprising result since our network isonly dealing with a single class whereas the original implementations handle 80object classes.
Table 5. Left:
Person detection results on COCO dataset.
Right:
Person segmentationresults on PASCAL VOC 2012 test split.
Person Detectors AP AP AP AP S AP M AP L Ours - R101
59 71.5
FPN [39] 47.5 78 50.7 28.6 55 67.4
Segmentation IoU
DeepLab v3 [55]
DeepLab v2 [37] 87.4SegNet [56] 74.9Ours 87.8
Person segmentation output is an additional layer appended to the keypointoutputs. We obtained the ground truth labels by combining person masks intosingle binary mask layer, and we jointly trained segmentation with keypointtask. Therefore, it adds a very small complexity to the model. Evaluation wasperformed on PASCAL VOC 2012 test set with PASCAL IoU metric. We ob-tained final segmentation results via multi-scale testing and thresholding. Wedid not apply any additional test-time augmentation or ensembling. Table 5shows the test results of our system in comparison with previous successful se-mantic segmentation algorithms. Our model outperformed most of the successfulbaseline models such as SegNet [56] and Deeplab-v2 [37], and got comparableperformance to the state-of-the-art Deeplab v3 [55] model. This demonstratesthe capacity of our model to handle different tasks altogether with competitiveperformance. Some qualitative segmentation results are given in Fig. 6.
Our system consists of a backbone, keypoint & person detection subnets, andthe pose residual network. The parameter sizes of each block is given in Fig. 7. Fig. 6.
Some qualitative results for COCO test-dev dataset.
Backbone
Keypoint Subnet
Person Subnet
PRN
42M - R101 3M
Fig. 7.
Number of parameters for eachblock of MultiPoseNet. R un t i m e ( m s ) OurOur BackboneTop-downCMU-pose
Fig. 8.
Runtime analysis of Multi-PoseNet with respect to number of peo-ple.
Most of the parameters are required to extract features in the backbone network,subnets and PRN are relatively lightweight networks. So most of the computa-tion time is spent on the feature extraction stage. By using a shallow featureextractor like ResNet-50, we can achieve realtime performance. To measure theperformance, we have built a model using ResNet-50 with 384 ×
576 sized inputswhich contain 1 to 20 people. We measured the time spent during the inferenceof 1000 images, and averaged the inference times to get a consistent result (seeFig. 8). Keypoint and person detections take 35 ms while PRN takes 2 ms perinstance. So, our model can perform between 27 (1 person) and 15 (20 persons)FPS depending on the number of people.
In this work, we introduced the Pose Residual Network that is able to accuratelyassign keypoints to person detections outputted by a multi task learning archi-tecture (MultiPoseNet). Our pose estimation method achieved state-of-the-artperformance among bottom-up methods and comparable results with top-downmethods. Our method has the fastest inference time compared to previous meth-ods. We showed the assignment performance of pose residual network ablationanalysis. We demonstrated the representational capacity of our multi-task learn-ing model by jointly producing keypoints, person bounding boxes and personsegmentation results. References
1. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar,P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: EuropeanConference on Computer Vision. (2014)2. Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime Multi-Person 2D Pose Esti-mation using Part Affinity Fields. In: IEEE Conference on Computer Vision andPattern Recognition. (2017)3. Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.,Schiele, B.: DeepCut: Joint Subset Partition and Labeling for Multi Person PoseEstimation. In: IEEE Conference on Computer Vision and Pattern Recognition.(2016)4. Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: Deepercut:A deeper, stronger, and faster multi-person pose estimation model. In: EuropeanConference on Computer Vision. (2016)5. Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional partheatmap regression. In: European Conference on Computer Vision. (2016)6. Iqbal, U., Gall, J.: Multi-person pose estimation with local joint-to-person associ-ations. In: European Conference on Computer Vision Workshops. (2016)7. Ning, G., Zhang, Z., He, Z.: Knowledge-Guided Deep Fractal Neural Networks forHuman Pose Estimation. In: IEEE Transactions on Multimedia. (2017)8. Newell, A., Huang, Z., Deng, J.: Associative Embedding: End-to-End Learning forJoint Detection and Grouping. In: Advances in Neural Information Processing.(2017)9. Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded PyramidNetwork for Multi-Person Pose Estimation. In: arXiv preprint arXiv:1711.07319.(2017)10. Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C.,Murphy, K.: Towards Accurate Multi-person Pose Estimation in the Wild. In:IEEE Conference on Computer Vision and Pattern Recognition. (2017)11. He, K., Gkioxari, G., Doll´ar, P., Girshick, R.: Mask R-CNN. In: InternationalConference on Computer Vision. (2017)12. Fang, H., Xie, S., Tai, Y., Lu, C.: RMPE: Regional Multi-Person Pose Estimation.In: International Conference on Computer Vision. (2017)13. Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional Pose Machines.In: IEEE Conference on Computer Vision and Pattern Recognition. (2016)14. Newell, A., Yang, K., Deng, J.: Stacked Hourglass Networks for Human PoseEstimation. In: European Conference on Computer Vision. (2016)15. Chou, C.J., Chien, J.T., Chen, H.T.: Self Adversarial Training for Human PoseEstimation. In: arXiv preprint arXiv:1707.02439. (2017)16. Huang, S., Gong, M., Tao, D.: A Coarse-Fine Network for Keypoint Localization.In: International Conference on Computer Vision. (2017)17. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In:IEEE Conference on Computer Vision and Pattern Recognition. (2005)18. Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Poselet conditioned pictorialstructures. In: IEEE Conference on Computer Vision and Pattern Recognition.(2013)19. Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: IEEE Transaction on Pattern Analysis and Machine Intelligence. (2013)20. Johnson, S., Everingham, M.: Clustered Pose and Nonlinear Appearance Modelsfor Human Pose Estimation. In: British Machine Vision Conference. (2010)621. Andriluka, M., Roth, S., Schiele, B.: Pictorial Structures Revisited: People Detec-tion and Articulated Pose Estimation. In: IEEE Conference on Computer Visionand Pattern Recognition. (2009)22. Dantone, M., Gall, J., Leistner, C., Van Gool, L.: Human Pose Estimation UsingBody Parts Dependent Joint Regressors. In: IEEE Conference on Computer Visionand Pattern Recognition. (2013)23. Gkioxari, G., Hariharan, B., Girshick, R., Malik, J.: Using k-poselets for detectingpeople and localizing their keypoints. In: IEEE Conference on Computer Visionand Pattern Recognition. (2014)24. Toshev, A., Szegedy, C.: DeepPose: Human Pose Estimation via Deep NeuralNetworks. In: IEEE Conference on Computer Vision and Pattern Recognition.(2014)25. Tompson, J., Jain, A., LeCun, Y., Bregler, C.: Joint Training of a ConvolutionalNetwork and a Graphical Model for Human Pose Estimation. In: Advances inNeural Information Processing. (2014)26. Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human Pose Estimation withIterative Error Feedback. In: IEEE Conference on Computer Vision and PatternRecognition. (2016)27. Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-ContextAttention for Human Pose Estimation. In: IEEE Conference on Computer Visionand Pattern Recognition. (2017)28. Lifshitz, I., Fetaya, E., Ullman, S.: Human Pose Estimation using Deep ConsensusVoting. In: European Conference on Computer Vision. (2016)29. Belagiannis, V., Zisserman, A.: Recurrent Human Pose Estimation. In: Interna-tional Conference on Automatic Face and Gesture Recognition. (2017)30. Ramakrishna, V., Munoz, D., Hebert, M., Bagnell, A.J., Sheikh, Y.: Pose machines:Articulated pose estimation via inference machines. In: European Conference onComputer Vision. (2014)31. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep-tion architecture for computer vision. In: IEEE Conference on Computer Visionand Pattern Recognition. (2016)32. Ladicky, L., Torr, P.H., Zisserman, A.: Human Pose Estimation Using a JointPixel-wise and Part-wise Formulation. In: IEEE Conference on Computer Visionand Pattern Recognition. (2013)33. Gkioxari, G., Arbelaez, P., Bourdev, L., Malik, J.: Articulated pose estimationusing discriminative armlet classifiers. In: IEEE Conference on Computer Visionand Pattern Recognition. (2013)34. Varadarajan, S., Datta, P., Tickoo, O.: A Greedy Part Assignment Algorithm forRealtime Multi-Person 2D Pose Estimation. In: arXiv preprint arXiv:1708.09182.(2017)35. Iqbal, U., Milan, A., Gall, J.: PoseTrack: Joint Multi-Person Pose Estimation andTracking. In: IEEE Conference on Computer Vision and Pattern Recognition.(2017)36. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition.In: IEEE Conference on Computer Vision and Pattern Recognition. (2016)37. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab:Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolu-tion, and Fully Connected CRFs. In: IEEE Transaction on Pattern Analysis andMachine Intelligence. (2017)738. Xia, F., Wang, P., Yuille, A., Angeles, L.: Joint Multi-Person Pose Estimation andSemantic Part Segmentation in a Single Image. In: IEEE Conference on ComputerVision and Pattern Recognition. (2017)39. Lin, T.Y., Doll´ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: FeaturePyramid Networks for Object Detection. In: IEEE Conference on Computer Visionand Pattern Recognition. (2017)40. Kong, T., Yao, A., Chen, Y., Sun, F.: Hypernet: Towards accurate region proposalgeneration and joint object detection. In: IEEE Conference on Computer Visionand Pattern Recognition. (2016)41. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ar, P.: Focal loss for dense objectdetection. In: International Conference on Computer Vision. (2017)42. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.:SSD: Single shot multibox detector. In: European Conference on Computer Vision.(2016)43. Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You Only Look Once:Unified, Real-Time Object Detection. In: IEEE Conference on Computer Visionand Pattern Recognition. (2016)44. Girshick, R.: Fast R-CNN. In: International Conference on Computer Vision.(2015)45. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time objectdetection with region proposal networks. In: Advances in Neural Information Pro-cessing. (2015)46. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A.,Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg,J., Man´e, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J.,Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V.,Vi´egas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng,X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015)Software available from tensorflow.org.47. Chollet, F., et al.: Keras. https://github.com/keras-team/keras (2015)48. Gaiser, H., de Vries, M., Williamson, A., Henon, Y., Morariu, M., Lacatusu, V.,Liscio, E., Fang, W., Clark, M., Sande, M.V., Kocabas, M.: fizyr/keras-retinanet0.2. https://github.com/fizyr/keras-retinanethttps://github.com/fizyr/keras-retinanet