[PDF] Coherent Reconstruction of Multiple Humans from a Single Image

Abstract

In this work, we address the problem of multi-person 3D pose estimation from a single image. A typical regression approach in the top-down setting of this problem would first detect all humans and then reconstruct each one of them independently. However, this type of prediction suffers from incoherent results, e.g., interpenetration and inconsistent depth ordering between the people in the scene. Our goal is to train a single network that learns to avoid these problems and generate a coherent 3D reconstruction of all the humans in the scene. To this end, a key design choice is the incorporation of the SMPL parametric body model in our top-down framework, which enables the use of two novel losses. First, a distance field-based collision loss penalizes interpenetration among the reconstructed people. Second, a depth ordering-aware loss reasons about occlusions and promotes a depth ordering of people that leads to a rendering which is consistent with the annotated instance segmentation. This provides depth supervision signals to the network, even if the image has no explicit 3D annotations. The experiments show that our approach outperforms previous methods on standard 3D pose benchmarks, while our proposed losses enable more coherent reconstruction in natural images. The project website with videos, results, and code can be found at: this https URL

Full PDF

CCoherent Reconstruction of Multiple Humans from a Single Image

Wen Jiang * , Nikos Kolotouros * , Georgios Pavlakos , Xiaowei Zhou † , Kostas Daniilidis University of Pennsylvania Zhejiang University

Input image Baseline Ours

Figure 1:

Coherent reconstruction of pose and shape for multiple people . Typical top-down regression baselines (center) suffer frompredicting people in overlapping positions, or in inconsistent depth orderings. Our approach (right) is trained to respect all these constraintsand recover a coherent reconstruction of all the people in the scene in a feedforward manner.

Abstract

In this work, we address the problem of multi-person3D pose estimation from a single image. A typical re-gression approach in the top-down setting of this problemwould ﬁrst detect all humans and then reconstruct eachone of them independently. However, this type of predic-tion suffers from incoherent results, e.g., interpenetrationand inconsistent depth ordering between the people in thescene. Our goal is to train a single network that learns toavoid these problems and generate a coherent 3D recon-struction of all the humans in the scene. To this end, akey design choice is the incorporation of the SMPL para-metric body model in our top-down framework, which en-ables the use of two novel losses. First, a distance ﬁeld-based collision loss penalizes interpenetration among thereconstructed people. Second, a depth ordering-aware lossreasons about occlusions and promotes a depth orderingof people that leads to a rendering which is consistentwith the annotated instance segmentation. This providesdepth supervision signals to the network, even if the im-age has no explicit 3D annotations. The experiments showthat our approach outperforms previous methods on stan-dard 3D pose benchmarks, while our proposed losses en-able more coherent reconstruction in natural images. Theproject website with videos, results, and code can be foundat: https://jiangwenpl.github.io/multiperson

1. Introduction

Recent work has achieved tremendous progress on thefrontier of 3D human analysis tasks. Current approaches ∗ Equal contribution. † X. Zhou and W. Jiang are afﬁliated with the State Key Lab ofCAD&CG, Zhejiang University. Email: [email protected]. have established impressive performance for 3D keypointestimation [35, 57], 3D shape reconstruction [11, 62], full-body 3D pose and shape recovery [15, 26, 28, 43], or evengoing beyond that and estimating more detailed and expres-sive reconstructions [42, 63]. However, as we progress to-wards more holistic understanding of scenes and people in-teracting in them, a crucial step is the coherent 3D recon-struction of multiple people from single images.Regarding multi-person pose estimation, on one end ofthe spectrum, we have bottom-up approaches. The worksfollowing this paradigm, ﬁrst detect all body joints in thescene and then group them, i.e., assigning them to the ap-propriate person. However, it is not straightforward howbottom-up processing can be extended beyond joints (e.g.,use it for shape estimation, or mesh recovery). Differentfrom bottom-up, top-down approaches ﬁrst detect all peo-ple in the scene, and then estimate the pose for each oneof them. Although they take a hard decision early on (per-son detection), they typically rely on state-of-the-art meth-ods for person detection and pose estimation which allowsthem to achieve very compelling results, particularly in the2D pose case, e.g., [9, 56, 64]. However, when reasoningabout the pose of multiple people in 3D, the problems canbe more complicated than in 2D. For example, the recon-structed people can overlap each other in the 3D space, orbe estimated at depths that are inconsistent with the actualdepth ordering, as is demonstrated in Figure 1. This meansthat it is crucial to go beyond just predicting a reasonable3D pose for each person individually, and instead estimatea coherent reconstruction of all the people in the scene.This coherency of the holistic scene is the primary goalof this work. We adopt the typical top-down paradigm, andour aim is to train a deep network that learns to estimate acoherent reconstruction of all the people in the scene. Start-1 a r X i v : . [ c s . C V ] J un MPL R-CNN

Interpenetration lossInput image Depth ordering-aware loss − Reprojected scene Instance segmentation H o li s ti c R e nd e r i ng C o l l i s i o n A v o i d a n c e Detections

Figure 2:

Overview of the proposed approach.

We design an end-to-end framework for 3D pose and shape estimation of multiple peoplefrom a single image. An R-CNN-based architecture [19] detects all people in the image and estimates their SMPL parameters [34]. Duringtraining we incorporate constraints to promote a coherent reconstruction of all the people in the scene. First, we use an interpenetrationloss to avoid people overlapping each other. Second, we apply a depth ordering-aware loss by rendering the meshes of all the people to theimage and encouraging the rendered instance segmentation to match with the annotated instance masks. ing with a framework that follows the R-CNN pipeline [48],a key decision we make is to use of the SMPL parametricmodel [34] as our representation, and add a SMPL estima-tion branch to the R-CNN. The mesh representation pro-vided by SMPL allows us to reason about occlusions andinterpenetrations enabling the incorporation of two novellosses towards coherent 3D reconstruction. First, a com-mon problem of predictions from regression networks isthat the reconstructed people often overlap each other, sincethe feedforward nature does not allow for holistic feedbackon the potential intersections. To train a network that learnsto avoid this type of collisions, we introduce an interpen-etration loss that penalizes intersections among the recon-structed people. This term requires no annotations and re-lies on a simple property of natural scenes, i.e., that peo-ple cannot intersect each other. Besides collisions, anothersource of incoherency in the results is that the estimateddepths of the meshes are not respecting the actual depth or-dering of the humans in the scene. Equipped with a meshrepresentation, we render our holistic scene prediction onthe 2D image plane and penalize discrepancies of this ren-dering from the annotated instance segmentation. This lossenables reasoning about occlusion, encouraging the depthordering of the people in the scene to be consistent with theannotated instance masks. Our complete framework (Fig-ure 2) is evaluated on various benchmarks and outperformsprevious multi-person 3D pose and shape approaches, whilethe proposed losses improve coherency of the holistic resultboth qualitatively and quantitatively.To summarize, our main contributions are: • We present a complete framework for coherent regres-sion of 3D pose and shape for multiple people. • We train with an interpenetration loss to avoid regress- ing meshes that intersect each other. • We train with a depth ordering-aware loss to promotereconstructions that respect the depth ordering of thepeople in the scene. • We outperfrom previous approaches for multi-person3D pose and shape, while recovering signiﬁcantlymore coherent results.

2. Related work

In this Section we provide a short description of priorworks that are more relevant to ours.

Single-person 3D pose and shape : Many recent worksestimate 3D pose in the form of a skeleton, e.g., [35, 39, 44,47, 57, 59, 60, 67], or 3D shape in a non-parmetric way,e.g., [11, 53, 62]. However, here we focus on full-bodypose and shape reconstruction in the form of a mesh, typi-cally using a parametric model, like SMPL [34]. After theearly works on the problem [14, 52], the ﬁrst fully auto-matic approach, SMPLify, was proposed by Bogo et al . [4].SMPLify iteratively ﬁts SMPL on the 2D joints detectedby a 2D pose estimation network [46]. This optimizationapproach was later extended in multiple ways; Lassner etal . [31] use silhouettes for the ﬁtting, Varol et al . [62] usevoxel occupancy grids, while Pavlakos et al . [42] ﬁt a moreexpressive parametric model, SMPL-X.Despite the success of the aforementioned ﬁtting ap-proaches, recently we have observed an increased interestin approaches that regress the pose and shape parametersdirectly from images, using a deep network for this task.Many works focus on ﬁrst estimating some form of in-termediate representation before regressing SMPL param-eters. Pavlakos et al . [45] use keypoints and silhouettes,mran et al . [41] use semantic part segmentation, Tung etal . [61] append heatmaps for 2D joints to the RGB input,while Kolotouros et al . [29] regress the mesh vertices witha Graph CNN. Regressing SMPL parameters directly fromRGB input is more challenging, but it avoids any hand-designed bottleneck. Kanazawa et al . [26] use an adversar-ial prior to penalize improbable 3D shapes during training.Arnab et al . [3] use temporal context to improve the regres-sion network. G¨uler et al . [15] incorporate a test-time post-processing based on 2D/3D keypoints and DensePose [16].

Multi-person 3D pose : For the multi-person case, thetop-down paradigm is quite popular for 3D pose estima-tion, since it capitalizes on the success of the R-CNNworks [13, 48, 19]. The LCR-Net approaches [50, 51] ﬁrstdetect each person, then classify its pose in a pose clus-ter and ﬁnally regress an offset for each joint. Dabral etal . [10] ﬁrst estimate 2D joints inside the bounding box andthen regress 3D pose. Moon et al . [40] contribute a root net-work to give an estimate of the depth of the root joint. Zan-ﬁr et al . [65] rely on scene constraints to iteratively optimizethe 3D pose and shape of the people in the scene. Alterna-tively, there are also approaches that follow the bottom-upparadigm. Mehta et al . [38] propose a formulation based onocclusion-robust pose-maps, where Part Afﬁnity Fields [6]are used for the association problem. Follow-up work [37],improves, among others, the robustness of the system. Fi-nally, Zanﬁr et al . [66] solve a binary integer linear programto perform skeleton grouping.In the context of pose and shape estimation in particular,there is a limited number of works that estimate full-body3D pose and shape for multiple people in the scene. Zan-ﬁr et al . [65] optimize the 3D shape of all the people inthe image using multiple scene constraints. Our approachdraws inspiration from this work and shares the same goal,in the sense of recovering a coherent 3D reconstruction. Incontrast to them, instead of optimizing for this coherency attest-time, we train a feedforward regressor and use the sceneconstraints at training time to encourage it to produce coher-ent estimates at test-time. Using a feedforward network toestimate pose and shape for multiple people has been pro-posed by the work of Zanﬁr et al . [66]. However, in thatcase, 3D shape is regressed based on 3D joints, which arethe output of a bottom-up system. In contrast, our approachis top-down, and SMPL parameters are regressed directlyfrom pixels, instead of using an intermediate representation,like 3D joints. In fact, it is non-trivial to design a frameworkfor SMPL parameter regression in a bottom-up manner.

Coherency constraints : An important aspect of ourwork is the incorporation of loss terms that promote co-herent 3D reconstruction of the multiple humans. Re-garding our interpenetration loss, Bogo et al . [4] andPavlakos et al . [42] use a relevant objective to avoid self-interpenetrations of the human under consideration. In a more similar spirit to us, Zanﬁr et al . [65] use a volume oc-cupancy loss to avoid humans intersecting each other. Indifferent applications, Hasson et al . [18] penalize interpen-etrations between the object and the hand that interacts withit, while Hassan et al . [17] penalize interpenetrations be-tween humans and their environment. The majority of theabove works uses the interpenetration penalty to iterativelyreﬁne estimates at test-time. With the exception of [18], ourwork is the only one that uses an interpenetration term toguide the training of a feedforward regressor and promotecolliding-free reconstructions at test time.Regarding our depth ordering-aware loss, we follow theformulation of Chen et al . [8], which was also used in thecontext of 3D human pose by Pavlakos et al . [44]. In con-trast to them, we do not use explicit depth annotations, butinstead, we leverage the instance segmentation masks toreason about occlusion and thus, depth ordering. The workof Rhodin et al . [49] is also relevant, where inferring depthordering is used as an intermediate abstraction for scene de-composition from multiple views. Our work also aims toestimate a coherent depth ordering, but we do so from a sin-gle image with the guidance of instance segmentation, whilewe retain a more explicit human representation in terms ofmeshes. Finally, using instance segmentation via render andcompare has also been proposed by Kundu et al . [30]. How-ever, their multi-instance evaluation includes only rigid ob-jects, speciﬁcally cars, whereas we investigate the, signiﬁ-cantly more complex, non-rigid case.

3. Technical approach

In this Section, we describe the technical approach fol-lowed in this work. We start with providing some in-formation about the SMPL model (Subsection 3.1) andthe baseline regional architecture we use (Subsection 3.2).Then we describe in detail our proposed losses promotinginterpenetration-free reconstruction (Subsection 3.3) andconsistent depth ordering (Subsection 3.4). Finally, we pro-vide more implementation details (Subsection 3.5).

For the human body representation, we use the SMPLparametric model of the human body [34]. What makesSMPL very appropriate for our work, in comparison withother representations, is that it allows us to reason about oc-clusion and interpenetration enabling the use of the novellosses we incorporate in the training of our network. TheSMPL model deﬁnes a function M ( θ , β ) that takes as in-put the pose parameters θ , and the shape parameters β , andoutputs a mesh M ∈ R N v × , consisting of N v = 6890 vertices. The model also offers a convenient mapping frommesh vertices to k body joints J , through a linear regressor W , such that joints can be expressed as a linear combinationof mesh vertices, J = W M . igure 3: Illustration of interpenetration loss.

Left: Collisionbetween person i (red) and j (beige). Center: Distance ﬁeld φ i for person i , Right: Mesh M j of person j . The vertices of M j that collide with person i , i.e., located in non-zero areas of φ i andvisualized with soft red, are penalized by the interpenetration loss. In terms of the architecture for our approach, we followthe familiar R-CNN framework [48], and use a structurethat is most similar to the Mask R-CNN iteration [19]. Ournetwork consists of a backbone (here ResNet50 [20]), a Re-gion Proposal Network, as well as heads for detection andSMPL parameter regression (SMPL branch). Regarding theSMPL branch, its architecture is similar to the iterative re-gressor proposed by Kanazawa et al . [26], regressing poseand shape parameters, θ and β respectively, as well as cam-era parameters π = { s, t x , t y } . The camera parameters arepredicted per bounding box but we later update them basedon the position of the bounding box in the full image (detailsin the Sup.Mat.). Although there is no explicit feedbackamong bbox predictions, the receptive ﬁeld of each proposalincludes the majority of the scene. Since each bounding boxis aware of neighboring people and their poses, it can makean informed pose prediction that is consistent with them.For our baseline network, the various components aretrained jointly in an end-to-end manner. The detection taskis trained according to the training procedure of [19], whilefor the SMPL branch, the training details are similar to theones proposed by Kanazawa et al . [26]. In the rare casesthat 3D ground truth is available, we apply a loss, L D , onthe SMPL parameters and the 3D keypoints. In the mosttypical case that only 2D joints are available, we use a 2Dreprojection loss, L D , to minimize the distance betweenthe ground truth 2D keypoints and the projection of the 3Djoints, J , to the image. Additionally, we also use a discrimi-nator and apply an adversarial prior L adv on regressed poseand shape parameters, to encourage the output bodies to lieon the manifold of human bodies. Each of the above lossesis applied independently to each proposal, after assigningit to the corresponding ground truth bounding box. Moredetails about the above loss terms and the training of thebaseline model are included in the Sup.Mat. A critical barrier towards coherent reconstruction of mul-tiple people from a single image is that the regression net- work can often predict the people to be in overlapping lo-cations. To promote prediction of non-colliding people, weintroduce a loss that penalizes interpenetrations among thereconstructed people. Our formulation draws inspirationfrom [17]. An important difference is that instead of a staticscene and a single person, our scene includes multiple peo-ple and it is generated in a dynamic way during training.Let φ be a modiﬁed Signed Distance Field (SDF) for thescene that is deﬁned as follows: φ ( x, y, z ) = − min (SDF( x, y, z ) , , (1)According to the above deﬁnition, inside each human, φ takes positive values, proportional to the distance from thesurface, while it is simply outside of the human. Typically, φ is deﬁned on a voxel grid of dimensions N p × N p × N p .The na¨ıve generation of a single voxelized representationfor the whole scene is deﬁnitely possible. However, we of-ten require a very ﬁne voxel grid, which depending on theextend of the scene, might make processing intractable interms of memory and computation. One critical observa-tion here is that we can compute a separate φ i function foreach person in the scene, by calculating a tight box aroundthe person and voxelizing it. This allows us to ignore emptyscene space that is not covered by any person and we caninstead use a ﬁne spatial resolution to get a detailed vox-elization of the body. Using this formulation, the collisionpenalty of person j for colliding with person i is deﬁned as: P ij = (cid:88) v ∈ M j ˜ φ i ( v ) , (2)where ˜ φ i ( v ) samples the φ i value for each 3D vertex v ina differentiable way from the 3D grid using trilinear inter-polation (Figure 3). The φ i computation for person i isperformed by a custom GPU implementation. This com-putation does not have to be differentiable; φ i only deﬁnesa distance ﬁeld from which we sample values in a differ-entiable way. By deﬁnition, P ij is non-negative. It takesvalue if there is no collision between person i and j andincreases as the distance of the surface vertices for person j move farther from the surface of person i . In theory, P ij canbe used by itself as an optimization objective for interpen-etration avoidance. However, in practice, we observed thatit results in very large gradients for the person translation,leading to training instabilities when there are heavy colli-sions. Instead of the typical term, we use a robust version ofthis objective. More speciﬁcally, our ﬁnal interpenetrationloss for a scene with N people is deﬁned as follows: L P = N (cid:88) j =1 ρ  N (cid:88) i =1 ,i (cid:54) = j P ij  (3)where ρ is the Geman-McClure robust error function [12].To avoid penalizing intersections between boxes corre- igure 4: Illustration of depth ordering-aware loss.

For an RGB image (ﬁrst image), we consider the annotated instance segmentation(second image), and the instances based on the rendering of the estimated meshes on the image plane (third image). In case that there is adisagreement between the person index, e.g., for pixel p , where y ( p ) (cid:54) = ˆ y ( p ) , we penalize the corresponding depth estimates at this pixelwith an ordinal depth loss. The pixel depths D y ( p ) ( p ) and D ˆ y ( p ) ( p ) are estimated by rendering the depth map independently for eachperson mesh (fourth and ﬁfth image). This allows gradients to be backpropagated even to the non-visible vertices. sponding to the same person, we use only the most conﬁ-dence box proposal assigned to a ground truth box. Besides interpenetration, another common problem inmulti-person 3D reconstruction is that people are often es-timated in incorrect depth order. This problem is moreevident in cases where people overlap on the 2D imageplane. Although it is obvious to the human eye which per-son is closer (due to the occlusion), the network predictionscan still be incoherent. Fixing this depth ordering problemwould be easy if we had access to pixel-level depth annota-tions. However, this type of annotations is rarely available.Our key idea here is that we can leverage the instance seg-mentation annotations that are often available, e.g., in thelarge scale COCO dataset [32]. Rendering the meshes ofall the reconstructed people on the image plane can indicatethe person corresponding to each pixel and optimize basedon the agreement with the annotated instance annotation.Although this idea sounds straightforward, its realizationis more complicated. An obvious implementation would beto use a differentiable renderer, e.g., the Neural Mesh Ren-derer (NMR) [27], and penalize inconsistencies betweenthe actual instance segmentation and the one produced byrendering the meshes to the image. The practical problemwith [27] is that it backpropagates errors only to visiblemesh vertices; if there is a depth ordering error, it will notpromote the invisible vertices to move closer to the cam-era. In practice, we observed that this tends to move mostpeople farther away, collapsing our training. Liu et al . [33]attempt to address this problem, but we observed that theirsoftmax operation across the depths can result in vanishinggradients, while we also faced numerical instabilities.Instead of rendering only the semantic segmentation ofthe scene, we also render the depth image D i for each per-son independently using NMR [27]. Assuming the scenehas N people, we assign a unique index i ∈ { , , . . . , N } to each one of them. Let y ( p ) be the person index at pixellocation p in the ground truth segmentation, and ˆ y ( p ) be the predicted person index based on the rendering of the 3Dmeshes. We use to indicate background pixels. If for apixel p the two estimates indicate a person (no background)and disagree, i.e., y ( p ) (cid:54) = ˆ y ( p ) , then we apply a loss to thedepth values of both people for this pixel, y ( p ) and ˆ y ( p ) , topromote the correct depth ordering. The loss we apply is anordinal depth loss, similar in spirit to [8]. More speciﬁcally,the complete loss expression is: L D = (cid:88) p ∈S log (cid:0) (cid:0) D y ( p ) ( p ) − D ˆ y ( p ) ( p ) (cid:1)(cid:1) (4)where S = { p ∈ I : y ( p ) > , ˆ y ( p ) > , y ( p ) (cid:54) = ˆ y ( p ) } represents the set of pixels for image I where we have depthordering mistakes (Figure 4). The key detail here is thatthe loss is backpropagated to the mesh (and eventually themodel parameters) of both people, instead of backpropagat-ing gradients only to the visible person, as a conventionaldifferentiable renderer would do. This promotes a moresymmetric nature to the loss (and the updates), and even-tually makes this loss practical. Our implementation is done using PyTorch and the pub-licly available mmdetection library [7]. We resize all in-put images to 512x832, keeping the same aspect ratio asin the original COCO training. For the baseline model wetrain only with the losses speciﬁed in Subsection 3.2, whilefor our full model we include in our training the lossesproposed in Subsections 3.3 and 3.4. Our training uses 21080Ti GPUs and a batch size of 4 images per GPU.For the SDF computation, we reimplemented [54, 55] inCUDA. Voxelizing a single mesh in a × × voxelgrid requires about 45ms on an 1080Ti GPU. For efﬁciency,we perform 3D bounding box checks to detect overlapping3D bounding boxes, and voxelize only the relevant meshes.Additionally, we reimplemented parts of NMR [27] to makerendering large images more efﬁcient. This allowed us tohave more than an order of magnitude of speedup since theforward pass complexity dropped from O ( F wh ) to O ( F + h ) on average, where F is the number of faces and w and h the image width and height respectively.

4. Experiments

In this Section, we present the empirical evaluation ofour approach. First, we describe the datasets used for train-ing and evaluation (Subsection 4.1). Then, we focus onthe quantitative evaluation (Subsections 4.2 and 4.3), andﬁnally we present more qualitative results (Subsection 4.4).

Human3.6M [21]: It is an indoor dataset where a singleperson is visible in each frame. It provides 3D ground truthfor training and evaluation. We use Protocol 2 of [26],where Subjects S1,S5,S6,S7 and S8 are used for training,while Subjects S9 and S11 are used for evaluation.

MuPoTS-3D [38]: It is a multi-person dataset providing3D ground truth for all the people in the scene. We use thisdataset for evaluation using the same protocol as [38].

Panoptic [24]: It is a dataset with multiple people capturedin the Panoptic studio. We use this dataset for evaluation,following the protocol of [65].

MPI-INF-3DHP [36]: It is a single person dataset with 3Dpose ground truth. We use subjects S1 to S8 for training.

PoseTrack [1]: In-the-wild dataset with 2D pose annota-tions. Includes multiple frames for each sequence. We usethis dataset for training and evaluation.

LSP [22],

LSP Extended [23],

MPII [2]: In-the-wilddatasets with annotations for 2D joints. We use the train-ing sets of these datasets for training.

COCO [32]: In-the-wild dataset with 2D pose and instancesegmentation annotations. We use the 2D joints for train-ing as we do with the other in the-wild datasets, while theinstance segmentation masks are employed for the compu-tation of the depth ordering-aware loss.

For the comparison with the state-of-the-art, as a sanitycheck, we ﬁrst evaluate the performance of our approach ona typical single person baseline. Our goal is always multi-person 3D pose and shape, but we expect our approachto achieve competitive results, even in easier settings, i.e.,when only one person is in the image. More speciﬁcally,we evaluate the performance of our network on the popularHuman3.6M dataset [21]. The most relevant approach hereis HMR by Kanazawa et al . [26], since we share similararchitectural choices (iterative regressor, regression target),training practices (adversarial prior) and training data. Theresults are presented in Table 1. Our approach outperformsHMR, as well as the approach of Arnab et al . [3], that usesthe same network with HMR, but is trained with more data.Having established that our approach is competitive inthe single person setting, we continue the evaluation with Method HMR [26] Arnab et al . [3] OursReconst. Error 56.8 54.3

Table 1:

Results on Human3.6M . The numbers are mean 3D jointerrors in mm after Procrustes alignment (Protocol 2). The resultsof all approaches are obtained from the original papers.

Method Haggling Maﬁa Ultim. Pizza MeanZanﬁr et al . [65] 140.0 165.9 150.7 et al . [66] 141.4 152.3

Table 2:

Results on the Panoptic dataset . The numbers are meanper joint position errors after centering the root joint. The resultsof all approaches are obtained from the original papers. multi-person baselines. In this case, we consider ap-proaches that also estimate pose and shape for multiple peo-ple. The most relevant baselines are the works of Zan-ﬁr et al . [65, 66]. We compare with these approaches inthe Panoptic dataset [24, 25], using their evaluation proto-col (assuming no data from the Panoptic studio are used fortraining). The full results are reported in Table 2. Our ini-tial network (baseline), trained without our proposed losses,achieves performance comparable with the results reportedby the previous works of Zanﬁr et al . More importantlythough, adding the two proposed losses (full), improves per-formance across all subsequences and overall, while we alsooutperform the previous baselines. These results demon-strate both the strong performance of our approach in themulti-person setting, as well as the beneﬁt we get from thelosses we propose in this work.Another popular benchmark for multi-person 3D poseestimation is the MuPoTS-3D dataset [36]. Since nomulti-person 3D pose and shape approach reports resultson this benchmark, we implement two strong top-downbaselines, based on state-of-the-art approaches for single-person 3D pose and shape. Speciﬁcally, we select a regres-sion approach, HMR [26], and an optimization approach,SMPLify-X [42], and we apply them on detections providedby OpenPose [5] (as is suggested by their public reposito-ries), or by Mask-RCNN [19] (for the case of HMR). Thefull results are reported in Table 3. As we can see, our base-line model performs comparably to the other approaches,while our full model trained with the proposed losses im-proves signiﬁcantly over the baseline. Similarly with theprevious results, this experiment further justiﬁes the use ofour coherency losses. Besides this, we also demonstratethat na¨ıve baselines trained with a single person in mind aresuboptimal for the multi-person setting of 3D pose. Thisis different from the 2D case, where a single-person net-work can perform particularly well in multi-person top-ethod All MatchedOpenPose + SMPLify-X [42] 62.84 68.04OpenPose + HMR [26] 66.09 70.90Mask-RCNN + HMR [26] 65.57 68.57Ours (baseline) 66.95 68.96Ours (full)

Table 3:

Results on MuPoTS-3D . The numbers are 3DPCK. Wereport the overall accuracy (All), and the accuracy only for personannotations matched to a prediction (Matched).

Method MuPoTS-3D PoseTrackOur baseline 114 653Our baseline + L P

34 202

Table 4:

Ablative for interpenetration loss . The results indicatethe number of collisions on MuPoTS-3D and PoseTrack. down pipelines as well, e.g., [9, 56, 64]. For the 3D casethough, when multiple people are involved, making thenetwork aware of occlusions and interpenetrations duringtraining, can actually be beneﬁcial at test-time too.

For this work, our interest in multi-person 3D pose esti-mation extends beyond just estimating poses that are accu-rate under the typical 3D pose metrics. Our goal is also torecover a coherent reconstruction of the scene. This is im-portant, because in many cases we can improve the 3D posemetrics, e.g., get a better 3D pose for each detected person,but return incoherent results holistically. For example, thedepth ordering of the people might be incorrect, or the re-constructed meshes might be positioned such that they over-lap each other. To demonstrate how our proposed losses im-prove the network predictions under these coherency met-rics even if they are only applied during training, we per-form two ablative studies for more detailed evaluation.First, we expect our interpenetration loss to naturallyeliminate most of the overlapping people in our predictions.We evaluate this on MuPoTS-3D and PoseTrack, reportingthe number of collisions with and without the interpenetra-tion loss. The results are reported in Table 4. As we ex-pected, we observe signiﬁcant decrease in the number ofcollisions when we train the network with the L P loss.Moreover, our depth ordering-aware loss should improvethe translation estimates for the people in the scene. Sincefor monocular methods it is not meaningful to evaluate met-ric translation estimates, we propose to evaluate only thereturned depth ordering. More speciﬁcally, we consider allpairs of people in the scene, and we evaluate whether ourmethod predicted the ordinal depth relation for this pair cor-rectly. In the end, we report the percentage of correctly Method Moon et al . [40] Our baseline Our baseline + L D Accuracy 90.85% 92.17% 93.68%

Table 5:

Ablative for depth-ordering-aware loss.

Depth order-ing results on MuPoTS-3D. We consider all pairs of people in theimage, and we evaluate whether the approaches recovered the ordi-nal depth relation between the two people correctly. The numbersare percentages of correctly estimated ordinal depth relations. estimated ordinal relations in Table 5. As expected, thedepth ordering-aware loss improves upon our baseline. Inthe same Table, we also report the results of the approach ofMoon et al . [40] which is the state-of-the-art for 3D skeletonregression. Although [40] is skeleton-based and thus, notdirectly comparable to us, we want to highlight that even astate-of-the-art approach (under 3D pose metric evaluation)can still suffer from incoherency in the results. This pro-vides evidence that we often might overlook the coherencyof the holistic reconstruction, and we should also considerthis aspect when we evaluate the quality of our results.Finally, we underline that we do not apply these co-herency losses at test time. Instead, during training, ourlosses act as constraints to the reconstruction and ultimatelyprovide better supervision to the network, for images thatno explicit 3D annotations are available. The improved su-pervision leads to more coherent results at test time too . In this Subsection, we present more qualitative results ofour approach. In Figure 5 we compare our baseline withour full model trained with the proposed losses. As ex-pected, our full model generates more coherent reconstruc-tions, improving over the baseline as far as interpenetrationand depth ordering mistakes are concerned. Errors can hap-pen when there is signiﬁcant scale difference among thepeople and there is no overlap on the image plane (last rowof Figure 6). More results can be found in the Sup.Mat.

5. Summary

In this work, we present an end-to-end approach formulti-person 3D pose and shape estimation from a singleimage. Using the R-CNN framework, we design a top-down approach that regresses the SMPL model parametersfor each detected person in the image. Our main contribu-tion lies on assessing the problem from a more holistic viewand aiming on estimating a coherent reconstruction of thescene instead of focusing only on independent pose estima-tion for each person. To this end, we incorporate two novellosses in our framework that train the network such that a)it avoids generating overlapping humans and b) it is encour-aged to position the people in a consistent depth ordering.We evaluate our approach in various benchmarks, demon-strating very competitive performance in the traditional 3D nput image Baseline Ours

Figure 5:

Qualitative effect of proposed losses.

Results of our baseline model (center) and our full model trained with our proposed losses(right). As expected, we improve over our baseline in terms of coherency in the results (i.e., fewer interpenetrations, more consistent depthordering for the reconstructed meshes).Figure 6:

Qualitative evaluation.

We visualize the reconstructions of our approach from different viewpoints; front (green background),top (blue background) and side (red background). More qualitative results can be found in the Sup.Mat. pose metrics, while also performing signiﬁcantly better bothqualitatively and quantitatively in terms of coherency of thereconstructed scene. In future work, we aim to more explic-itly model interactions between people (besides the overlapavoidance), so that we can achieve a more accurate and de-tailed reconstruction of the scene at a ﬁner level as well. Ina similar vein, we can incorporate further information to-wards a holistic reconstruction of scenes. This can include constraints from the ground plane [65], background [17], orthe objects that humans interact with [18, 58].

Acknowledgements:

NK, GP and KD gratefully appreciate supportthrough the following grants: NSF-IIP-1439681 (I/UCRC), NSF-IIS-1703319, NSF MRI 1626008, ARL RCTA W911NF-10-2-0016, ONRN00014-17-1-2093, ARL DCIST CRA W911NF-17-2-0181, the DARPA-SRC C-BRIC, by Honda Research Institute and a Google Daydream Re-search Award. XZ and WJ would like to acknowledge support from NSFC(No. 61806176) and Fundamental Research Funds for the Central Univer-sities (2019QNA5022). eferences [1] Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov,Leonid Pishchulin, Anton Milan, Juergen Gall, and BerntSchiele. PoseTrack: A benchmark for human pose estima-tion and tracking. In

CVPR , 2018. 6[2] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, andBernt Schiele. 2D human pose estimation: New benchmarkand state of the art analysis. In

CVPR , 2014. 6[3] Anurag Arnab, Carl Doersch, and Andrew Zisserman. Ex-ploiting temporal context for 3D human pose estimation inthe wild. In

CVPR , 2019. 3, 6[4] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, PeterGehler, Javier Romero, and Michael J Black. Keep it SMPL:Automatic estimation of 3D human pose and shape from asingle image. In

ECCV , 2016. 2, 3[5] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, andYaser Sheikh. OpenPose: realtime multi-person 2D pose es-timation using part afﬁnity ﬁelds.

PAMI , 2019. 6[6] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Re-altime multi-person 2D pose estimation using part afﬁnityﬁelds. In

CVPR , 2017. 3[7] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, YuXiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu,Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tian-heng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu,Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang,Chen Change Loy, and Dahua Lin. MMDetection: Openmmlab detection toolbox and benchmark. arXiv preprintarXiv:1906.07155 , 2019. 5[8] Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single-image depth perception in the wild. In

NIPS , 2016. 3, 5[9] Yilun Chen, Zhicheng Wang, Yuxiang Peng, ZhiqiangZhang, Gang Yu, and Jian Sun. Cascaded pyramid networkfor multi-person pose estimation. In

CVPR , 2018. 1, 7[10] Rishabh Dabral, Nitesh B Gundavarapu, Rahul Mitra, Ab-hishek Sharma, Ganesh Ramakrishnan, and Arjun Jain.Multi-person 3D human pose estimation from monocular im-ages. In , 2019. 3[11] Valentin Gabeur, Jean-S´ebastien Franco, Xavier Martin,Cordelia Schmid, and Gregory Rogez. Moulding humans:Non-parametric 3D human shape estimation from single im-ages. In

ICCV , 2019. 1, 2[12] Stuart Geman and Donald E McClure. Statistical methodsfor tomographic image reconstruction.

Bulletin of the Inter-national Statistical Institute , 4:5–21, 1987. 4[13] Ross Girshick. Fast R-CNN. In

ICCV , 2015. 3[14] Peng Guan, Alexander Weiss, Alexandru O Balan, andMichael J Black. Estimating human shape and pose froma single image. In

ICCV , 2009. 2[15] Rıza Alp G¨uler and Iasonas Kokkinos. HoloPose: Holistic3D human reconstruction in-the-wild. In

CVPR , 2019. 1, 3[16] Rıza Alp G¨uler, Natalia Neverova, and Iasonas Kokkinos.DensePose: Dense human pose estimation in the wild. In

CVPR , 2018. 3[17] Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas,and Michael J Black. Resolving 3D human pose ambigui-ties with 3D scene constraints. In

ICCV , 2019. 3, 4, 8 [18] Yana Hasson, Gul Varol, Dimitrios Tzionas, Igor Kale-vatykh, Michael J Black, Ivan Laptev, and Cordelia Schmid.Learning joint reconstruction of hands and manipulated ob-jects. In

CVPR , 2019. 3, 8[19] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir-shick. Mask R-CNN. In

ICCV , 2017. 2, 3, 4, 6[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

CVPR ,2016. 4[21] Catalin Ionescu, Dragos Papava, Vlad Olaru, and CristianSminchisescu. Human3.6M: Large scale datasets and predic-tive methods for 3D human sensing in natural environments.

PAMI , 36(7):1325–1339, 2013. 6[22] Sam Johnson and Mark Everingham. Clustered pose andnonlinear appearance models for human pose estimation. In

BMVC , 2010. 6[23] Sam Johnson and Mark Everingham. Learning effective hu-man pose estimation from inaccurate annotation. In

CVPR ,2011. 6[24] Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe,Iain Matthews, Takeo Kanade, Shohei Nobuhara, and YaserSheikh. Panoptic studio: A massively multiview system forsocial motion capture. In

ICCV , 2015. 6[25] Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei Tan,Lin Gui, Sean Banerjee, Timothy Godisart, Bart Nabbe,Iain Matthews, Takeo Kanade, Shohei Nobuhara, and YaserSheikh. Panoptic studio: A massively multiview system forsocial interaction capture.

PAMI , 41(1):190–204, 2017. 6[26] Angjoo Kanazawa, Michael J Black, David W Jacobs, andJitendra Malik. End-to-end recovery of human shape andpose. In

CVPR , 2018. 1, 3, 4, 6, 7[27] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neu-ral 3D mesh renderer. In

CVPR , 2018. 5[28] Nikos Kolotouros, Georgios Pavlakos, Michael J Black, andKostas Daniilidis. Learning to reconstruct 3D human poseand shape via model-ﬁtting in the loop. In

ICCV , 2019. 1[29] Nikos Kolotouros, Georgios Pavlakos, and Kostas Dani-ilidis. Convolutional mesh regression for single-image hu-man shape reconstruction. In

CVPR , 2019. 3[30] Abhijit Kundu, Yin Li, and James M Rehg. 3D-RCNN:Instance-level 3D object reconstruction via render-and-compare. In

CVPR , 2018. 3[31] Christoph Lassner, Javier Romero, Martin Kiefel, FedericaBogo, Michael J Black, and Peter V Gehler. Unite the peo-ple: Closing the loop between 3D and 2D human representa-tions. In

CVPR , 2017. 2[32] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft COCO: Common objects in context. In

ECCV , 2014. 5, 6[33] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft ras-terizer: A differentiable renderer for image-based 3D reason-ing. In

ICCV , 2019. 5[34] Matthew Loper, Naureen Mahmood, Javier Romero, GerardPons-Moll, and Michael J Black. SMPL: A skinned multi-person linear model.

ACM transactions on graphics (TOG) ,34(6):248, 2015. 2, 335] Julieta Martinez, Rayat Hossain, Javier Romero, and James JLittle. A simple yet effective baseline for 3D human poseestimation. In

ICCV , 2017. 1, 2[36] Dushyant Mehta, Helge Rhodin, Dan Casas, PascalFua, Oleksandr Sotnychenko, Weipeng Xu, and ChristianTheobalt. Monocular 3D human pose estimation in the wildusing improved CNN supervision. In , 2017. 6[37] Dushyant Mehta, Oleksandr Sotnychenko, FranziskaMueller, Weipeng Xu, Mohamed Elgharib, Pascal Fua,Hans-Peter Seidel, Helge Rhodin, Gerard Pons-Moll, andChristian Theobalt. XNect: Real-time multi-person 3Dhuman pose estimation with a single RGB camera. arXivpreprint arXiv:1907.00837 , 2019. 3[38] Dushyant Mehta, Oleksandr Sotnychenko, FranziskaMueller, Weipeng Xu, Srinath Sridhar, Gerard Pons-Moll,and Christian Theobalt. Single-shot multi-person 3D poseestimation from monocular RGB. In , 2018. 3, 6[39] Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko,Helge Rhodin, Mohammad Shaﬁei, Hans-Peter Seidel,Weipeng Xu, Dan Casas, and Christian Theobalt. VNect:Real-time 3D human pose estimation with a single RGBcamera.

ACM Transactions on Graphics (TOG) , 36(4):44,2017. 2[40] Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee.Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In

ICCV ,2019. 3, 7[41] Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Pe-ter Gehler, and Bernt Schiele. Neural body ﬁtting: Unifyingdeep learning and model based human pose and shape esti-mation. In , 2018. 3[42] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani,Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, andMichael J Black. Expressive body capture: 3D hands, face,and body from a single image. In

CVPR , 2019. 1, 2, 3, 6, 7[43] Georgios Pavlakos, Nikos Kolotouros, and Kostas Daniilidis.Texturepose: Supervising human mesh estimation with tex-ture consistency. In

ICCV , 2019. 1[44] Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis.Ordinal depth supervision for 3D human pose estimation. In

CVPR , 2018. 2, 3[45] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and KostasDaniilidis. Learning to estimate 3D human pose and shapefrom a single color image. In

CVPR , 2018. 2[46] Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, BjoernAndres, Mykhaylo Andriluka, Peter V Gehler, and BerntSchiele. DeepCut: Joint subset partition and labeling formulti person pose estimation. In

CVPR , 2016. 2[47] Alin-Ionut Popa, Mihai Zanﬁr, and Cristian Sminchisescu.Deep multitask architecture for integrated 2D and 3D humansensing. In

CVPR , 2017. 2[48] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster R-CNN: Towards real-time object detection with re-gion proposal networks. In

NIPS , 2015. 2, 3, 4[49] Helge Rhodin, Victor Constantin, Isinsu Katircioglu, Math-ieu Salzmann, and Pascal Fua. Neural scene decompositionfor multi-person motion capture. In

CVPR , 2019. 3 [50] Gregory Rogez, Philippe Weinzaepfel, and Cordelia Schmid.LCR-Net: Localization-classiﬁcation-regression for humanpose. In

CVPR , 2017. 3[51] Gregory Rogez, Philippe Weinzaepfel, and Cordelia Schmid.LCR-Net++: Multi-person 2D and 3D pose detection in nat-ural images.

PAMI , 2019. 3[52] Leonid Sigal, Alexandru Balan, and Michael J Black. Com-bined discriminative and generative articulated pose andnon-rigid shape estimation. In

NIPS , 2008. 2[53] David Smith, Matthew Loper, Xiaochen Hu, ParisMavroidis, and Javier Romero. FACSIMILE: Fast and ac-curate scans from an image in less than a second. In

ICCV ,2019. 2[54] David Stutz.

Learning shape completion from boundingboxes with CAD shape priors . PhD thesis, Masters thesis,RWTH Aachen University, 2017. 5[55] David Stutz and Andreas Geiger. Learning 3D shape comple-tion from laser scan data with weak supervision. In

CVPR ,2018. 5[56] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deephigh-resolution representation learning for human pose esti-mation. In

CVPR , 2019. 1, 7[57] Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and YichenWei. Integral human pose regression. In

ECCV , 2018. 1, 2[58] Bugra Tekin, Federica Bogo, and Marc Pollefeys. H+O: Uni-ﬁed egocentric recognition of 3D hand-object poses and in-teractions. In

CVPR , 2019. 8[59] Bugra Tekin, Pablo M´arquez-Neila, Mathieu Salzmann, andPascal Fua. Learning to fuse 2D and 3D image cues formonocular body pose estimation. In

ICCV , 2017. 2[60] Denis Tome, Chris Russell, and Lourdes Agapito. Liftingfrom the deep: Convolutional 3D pose estimation from a sin-gle image. In

CVPR , 2017. 2[61] Hsiao-Yu Tung, Hsiao-Wei Tung, Ersin Yumer, and KaterinaFragkiadaki. Self-supervised learning of motion capture. In

NIPS , 2017. 3[62] Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, ErsinYumer, Ivan Laptev, and Cordelia Schmid. BodyNet: Volu-metric inference of 3D human body shapes. In

ECCV , 2018.1, 2[63] Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. Monoculartotal capture: Posing face, body, and hands in the wild. In

CVPR , 2019. 1[64] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselinesfor human pose estimation and tracking. In

ECCV , 2018. 1,7[65] Andrei Zanﬁr, Elisabeta Marinoiu, and Cristian Sminchis-escu. Monocular 3D pose and shape estimation of multi-ple people in natural scenes-the importance of multiple sceneconstraints. In

CVPR , 2018. 3, 6, 8[66] Andrei Zanﬁr, Elisabeta Marinoiu, Mihai Zanﬁr, Alin-IonutPopa, and Cristian Sminchisescu. Deep network for the in-tegrated 3D sensing of multiple people in natural images. In

NeurIPS , 2018. 3, 6[67] Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, andYichen Wei. Towards 3D human pose estimation in the wild:a weakly-supervised approach. In