AnimePose: Multi-person 3D pose estimation and animation
AANIMEPOSE: MULTI-PERSON 3D POSE ESTIMATION AND ANIMATION
Laxman Kumarapu † Prerana Mukherjee †† Dept. of Computer Science, Indian Institute of Information Technology, Sri City, India † { laxman.k17, prerana.m } @iiits.in ABSTRACT
3D animation of humans in action is quite challenging as itinvolves using a huge setup with several motion trackers allover the persons body to track the movements of every limb.This is time- consuming and may cause the person discom-fort in wearing exoskeleton body suits with motion sensors.In this work, we present a trivial yet effective solution togenerate 3D animation of multiple persons from a 2D videousing deep learning. Although significant improvement hasbeen achieved recently in 3D human pose estimation, mostof the prior works work well in case of single person poseestimation and multi-person pose estimation is still a chal-lenging problem. In this work, we firstly propose a super-vised multi-person 3D pose estimation and animation frame-work namely
AnimePose for a given input RGB video se-quence. The pipeline of the proposed system consists of vari-ous modules: i) Person detection and segmentation, ii) DepthMap estimation, iii) Lifting 2D to 3D information for per-son localization iv) Person trajectory prediction and humanpose tracking. Our proposed system produces comparable re-sults on previous state-of-the-art 3D multi-person pose esti-mation methods on publicly available datasets MuCo-3DHPand MuPoTS-3D datasets and it also outperforms previousstate-of-the-art human pose tracking methods by a significantmargin of 11.7% performance gain on MOTA score on Pose-track 2018 dataset.
Index Terms — Multi-person pose estimation, 3D Posetracking, Animation, Landmark keypoints, 3D IOU.
1. INTRODUCTION
The goal of 3D human pose estimation and animation is tolocalize semantic key points of a single or multiple humanbodies in 3D space with texture and surface renditions. Itis an essential technique for simulating the human behav-ior with the environment. 3D pose estimation and trackingis an essential step in action recognition tasks in video se-quences. Recently, many methods utilize deep convolutionalneural networks (CNNs) and have achieved noticeable per-formance improvement on large-scale publicly available 3Dpose estimation datasets such as Human 3.6M dataset[24]. The prior works in the context of human pose estima-tion can be categorized as i) 2D multi person pose estima-tion, ii) 3D single-person and iii) 3D multi-person pose esti-mation methods. There are two main approaches in the 2Dmulti-person pose estimation. The first one being top-downapproaches[9], which employ a human detector that estimatesthe bounding boxes of humans. Each detected human areais cropped and fed into the pose estimation network . Thesecond one include the bottom-up approaches, which local-ize all human body key points in an input image first, andthen groups them into each person using some clustering tech-niques [10]. Current 3D single person pose estimation meth-ods can be categorized into single- and two-stage approaches.The single-stage approach directly localizes the 3D body keypoints from the input image [12]. The two-stage methods uti-lize the high accuracy of 2D human pose estimation. Theyinitially localize body key points in a 2D space and lift themto a 3D space. The second approach has achieved better re-sults as the two-stage approach involves less complexity thanthe direct approach. Very few works address the problem ofmulti-person 3D pose estimation. This can be attributed tothe scarcity of large datasets on multi-person 3D human poseestimation. However, there are significant contributions like. Moon et al. [13] proposed a Camera Distance-aware top-down Approach for 3D Multi-person Pose Estimation froma Single RGB Image. Their proposed architectures consistsof DetectNet, RootNet, and PoseNet. The DetectNet detectsa human bounding box of each person in the input image.The RootNet takes the cropped human image from the De-tectNet and localizes the root of the human R = (xR, yR, ZR),in which xR and yR are pixel coordinates, and ZR is an ab-solute depth value. The same cropped human image is fed tothe PoseNet, which estimates the root-relative 3D pose.In this work, we propose 3D multi-person pose estima-tion and animation framework namely
AnimePose for videos.First, we estimate 2D key points of a person in an input frameand also the depth map of that frame and then perform 3Dpose estimation of a person by localizing the person in a 3Denvironment relative to the other persons with the help of theestimated depth map. Our approach not only works well for3D multi-person pose estimation but it also outperforms pre-vious human pose tracking methods as we have used humanlocalization in 3D space using additional depth-map informa- a r X i v : . [ c s . G R ] F e b ion and human trajectory estimation for tracking the humansin a video.In view of above discussions the key contributions in ourwork can be summarized follows: • To the best of author’s knowledge, this is the first workto utilize depth map for localizing the 3D poses ofmultiple persons in 3D environment in order to ensureaccurate relative 3D pose localization with respect toother objects in a 3D environment. • We have also utilized person trajectory predictionmechanism and introduced a novel metric namely 3DIOU (generated with the help of depth maps) that helpsus to overcome occlusion problem while predicting 3Dposes in a video sequence and helps in better humantracking. • We have conducted exhaustive experiments on current3D multi-person pose estimation methods and our pro-posed methodology outperforms the current state ofthe art techniques by getting 82.1 3DPCKrel score onMuPoTS-3D dataset. Our pose tracking method using3d human localization using depth map and person tra-jectory estimation outperforms the current state of theart by achieving total Multi-object tracking accuracy(MOTA) score of 60.1 % on pose track 2018 dataset.
2. PROPOSED METHODOLOGY
The goal of our proposed system is to predict 3D coordinatesof key joints of multiple persons in a given video sequence.To solve this problem, we utilize a holistic top down approachthat consists of i) Person Detection Network, Multi Person2D pose estimation and Depth Map estimation Network, ii)2D to 3D pose uplifting Network and Relative localization ofestimated 3D key joints with the help of Depth Map, and iii)Pose tracking module. Fig. 1 provides an overview of theproposed methodology.
We have utilized Hybrid task cascaded neural network withHR-Net(High Resolution Network) [14] as a backbone forperson detection as it has provided State-of-the-art resultsover several object detection benchmarks. Next, we have uti-lized [11] Alpha pose detection network which is a bottom-up multi-person pose estimation approach. Recently, deepCNNs have shown great progress in solving depth estima-tion from a single Monocular image. Recent research suchas Multi-Scale Local Planar Guidance [15] for MonocularDepth Estimation have shown promising results for tacklingthe depth estimation problem. In the above-mentioned workthey follow an encoder decoder scheme that reduces the fea-ture map resolution to H/8 then recovers back to the original resolution H. Here, Unlike the existing methods that recoversback to the original resolution using simple nearest neigh-bor up sampling layers and skip connections from encodingstages, the authors utilize novel local planar guidance (LPG)layers which guide features to the full resolution with thelocal planar assumption, and employ them together to get thefinal depth estimation. We have utilized this method for depthmap generation of the input frame. Fig. 2 shows the resultobtained by the depth map generation [15].
Significant progress has been achieved for 3D human poseestimation from monocular images due to the availability oflarge-scale dataset [1] and the powerful deep learning frame-works. Two-stage approaches first estimate 2D poses and thenlift 2D poses to 3D poses [1]. These approaches usually gen-eralize better on images in the wild, since the first stage canbenefit from the state-of-the-art 2D pose estimators, whichcan be trained on images in the wild. The second stage usuallyregresses the 3D locations from the 2D predictions. For ex-ample, Martinez et al. [5] proposed a simple fully connectedresidual networks to directly regress 3D coordinates from 2Dcoordinates. But the problem with this two stage approach isthat it works greatly for single person pose estimation as thatnetwork takes one pose at a time and convert it into 3D poses.In the process of doing this, we lose the relative position ofone person with respect to others and thus it fails in multi per-son 3D pose estimation. In order to cater solution to this, wepreserve the relative position by proposing the localization in3D environment utilizing depth information. First, we takethe pose and its corresponding 2D bounding box co-ordinatesthen we uplift the 2D poses to 3D poses individually. Usingthe depth map, bounding box co-ordinates and segmentationmask we localize it in a 3D environment as we can infer 3Dbounding box of a person. Due to this combination of depthmap and bounding boxes we are able to preserve the relativeposition between the poses. We then project it into the 3Denvironment using appropriate scaling and here the relativepositions are preserved.In order to get 3D bounding box, we take 2D boundingbox and a corresponding 2d segmentation mask of the per-son. We then look for the points Z min that is closer to thecamera and Z max that is far away from camera within thesegmentation mask regions of the person and if 2D bound-ing box co-ordinates are { ( X min , Y min ) ,( X max , Y max ) } then the corresponding 3D coordinates of the bounding boxare estimated as, { ( X min , Y min , Z min ), ( X max , Y max , Z min ),( X min , Y min , Z max ), ( X max , Y max , Z max ) } respectively. Fig.3 shows the 3D localization for multiple persons in a givenframe. ig. 1 . Overall pipeline of the proposed framework for 3D multi-person pose estimation and animation in a frame-wise manner.The animation is generated in Unity3D environment. Fig. 2 . Input image(left) and the corresponding image depthmap (right) .
In order to extend the approach to the entire video sequence,we need to keep track of individual poses. Recent worklike efficient multi-person 2D Pose Tracking with RecurrentSpatio-Temporal Affinity Fields [21] has achieved promisingresults on multi person pose tracking. But these methods arenot robust to occlusion and overlapping conditions. To cir-cumvent these problems, along with the procedure mentionedin [21], we introduce a novel solution that utilizes depth mapinformation and temporal person trajectory estimation.To track a pose, on the top of 2D pose prediction wepropose 3D IOU (Intersection Over Union) metric instead oftraditional 2D IOU which can be inferred with the help ofbounding box co-ordinates and depth map as shown in Fig.3. In order to overcome occlusion problem, if the entire poseof the person gets missing whereas it is present in previousframes, we perform temporal person trajectory estimation us-ing Scene-LSTM: A Model for Human Trajectory Prediction
Fig. 3 . Input frame (left) and corresponding 3D localizationof the persons done with the help of estimated depth map(right).[25] to fill the missing tracks in the occluded frames in avideo. The way we predict the missing poses is that we takea set of previous frames and plot the trajectory of each keypoints and estimate the position where the point will be inthe future frames and we utilize depth in addition to it so thatwe may overcome the overlapping issue of 2 poses in a 2Dframe. We have trained our Scene-LSTM on 50% of Pose-Track dataset. Given a 3D bounding boxes B in previousframe and its tracked 3D bounding box B in next frame, theformulation of 3D IOU ( IOU D ) is as follows, IOU D = V B ∩ V B V B ∪ V B (1)where V B and V B denote the volumes of the two 3D bound-ing boxes in two consecutive frames. able 1 . Sequence-wise 3DPCKrel [17] comparison with state-of-the-art methods on the MuPoTS-3D datasetMethods S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20 AvgRogez [16] 67.7 49.8 53.4 59.1 67.5 22.8 43.7 49.9 31.1 78.1 50.2 51.0 51.6 49.3 56.2 66.5 65.2 62.9 66.1 59.1 53.8Mehta [17] 81.0 60.9 64.9 63.0 69.1 30.3 65.0 59.6 64.1 83.9 68.0 68.6 62.3 59.2 70.1 80.0 79.6 67.3 66.6 67.2 66.03D MPPE [13] Ours . Joint-wise 3DPCKrel [17] comparison with state of-the-art methods on the MuPoTS-3D dataset.Methods Hd. Nck. Sho. Elb. Wri. Hip. Kn. Ank. AvgRogez [16] 49.4 67.4 57.1 51.4 41.3 84.6 56.3 36.3 53.8Mehta [17] 62.1 81.2 77.9 57.7 47.2 97.3 66.3 47.6 66.03D MPPE [13] 79.1 92.6
Ours 79.3 92.8 . Pose Track 2018 validation resultsMethod Wrist-AP Ankles-AP mAP MOTAPoseTrack [18] 54.3 49.2 59.4 48.4BUTD [19] 52.9 42.6 59.1 50.6PoseFlow [20] 59.0 57.9 63.0 51.0JointFlow [21] 53.1 50.4 63.3 53.1Temporal affinity fields[22] 65.0
Ours 65.2
These are the3D multi-person pose estimation datasets proposed by Mehta et al. [17]. The training set MuCo-3DHP is generated bycompositing the existing MPI-INF-3DHP 3D single-personpose estimation dataset [23]. The test set MuPoTS-3D datasetwas captured at outdoors and it includes 20 real-world sceneswith groundtruth 3D poses for up to three subjects. Thegroundtruth is obtained with a multi-view marker-less motioncapture system. For evaluation, a 3D percentage of correctkeypoints (3DPCKrel) and area under 3DPCK curve fromvarious thresholds (AUCrel) is used after root alignment withgroundtruth.
Pose Track 2018 dataset:
PoseTrack is a large-scalebenchmark for human pose estimation and articulated track-ing in video. They provide a publicly available training andvalidation set as well as an evaluation server for benchmark-ing on a held-out test set. For evaluation, they use MOTA(Multi object tracking Accuracy) on the poses in a video.
Table 1 and 2 gives the sequence-wise and joint-wise 3DPCK-rel score comparative analysis on MuPoTS-3D dataset re-spectively. Our proposed approach has produced comparableresults with State-of-the-Art techniques on MUPoTS-3Ddataset by achieving an average of 82.1 on 3DPCKrel evalua-tion criteria. The results on the Pose Track dataset is providedin Table 3. Our approach performs better than current State-of-the-Art pose tracking techniques and achieves a MOTA of60.1% on Pose Track 2018 Validation dataset.We have provided results of our 3D multi person pose es-timation and multi person animation on an input frame usingUnity3D in Fig. 4. We are able to effectively localize thepersons in 3D and render animation.
Fig. 4 . Multi-person person 2D RGB frame and its corre-sponding 3D animation.
4. CONCLUSION
We proposed a novel and holistic framework for 3D multi-person pose estimation and animation in a frame-wise man-ner in videos. We are effectively able to localize the 3D rel-ative poses in a 3D environment due to the use of depth in-formation. Further, we proposed a novel framework for posetracking in which along with recurrent spatio-temporal affin-ity fields we used depth information to calculate a novel met-ric 3D IOU instead of traditional 2D IOU and utilized multiperson trajectory estimation to continue tracking a person inoccluded frames. The proposed method produced compara-ble results to previous 3D multi-person pose estimation meth-ods without any ground truth information while other meth-ods utilize it during inference. . REFERENCES. REFERENCES