[PDF] Holistic 3D Human and Scene Mesh Estimation from Single View Images

Abstract

The 3D world limits the human body pose and the human body pose conveys information about the surrounding objects. Indeed, from a single image of a person placed in an indoor scene, we as humans are adept at resolving ambiguities of the human pose and room layout through our knowledge of the physical laws and prior perception of the plausible object and human poses. However, few computer vision models fully leverage this fact. In this work, we propose an end-to-end trainable model that perceives the 3D scene from a single RGB image, estimates the camera pose and the room layout, and reconstructs both human body and object meshes. By imposing a set of comprehensive and sophisticated losses on all aspects of the estimations, we show that our model outperforms existing human body mesh methods and indoor scene reconstruction methods. To the best of our knowledge, this is the first model that outputs both object and human predictions at the mesh level, and performs joint optimization on the scene and human poses.

Full PDF

HHolistic 3D Human and Scene Mesh Estimation from Single View Images

Zhenzhen WengStanford University [email protected]

Serena YeungStanford University [email protected]

Abstract

The 3D world limits the human body pose and the hu-man body pose conveys information about the surroundingobjects. Indeed, from a single image of a person placed inan indoor scene, we as humans are adept at resolving am-biguities of the human pose and room layout through ourknowledge of the physical laws and prior perception of theplausible object and human poses. However, few computervision models fully leverage this fact. In this work, we pro-pose an end-to-end trainable model that perceives the 3Dscene from a single RGB image, estimates the camera poseand the room layout, and reconstructs both human bodyand object meshes. By imposing a set of comprehensiveand sophisticated losses on all aspects of the estimations,we show that our model outperforms existing human bodymesh methods and indoor scene reconstruction methods. Tothe best of our knowledge, this is the ﬁrst model that outputsboth object and human predictions at the mesh level, andperforms joint optimization on the scene and human poses.

1. Introduction

Holistic scene perception is key to our human ability toaccurately interpret and interact with the 3D world. The hu-man visual system naturally integrates context from actors,objects, and scene layout to infer realistic, robust estima-tions of the world. Suppose a human is partially includedin an image because they are positioned behind a desk. Wecan still effortlessly extract rich information from the staticscene to resolve ambiguities due to the occlusion. Like-wise, the appearance of humans also provides useful infor-mation about scenes, such as the ground plane and depth ofsurrounding objects. Humans and objects in scenes jointlymanifest spatial occupancies that constrain their relative po-sitions. For computer vision systems to achieve high accu-racy in recognizing and interpreting complex scenes, it istherefore important to develop approaches for holistic sceneperception and reasoning.In recent years, holistic scene understanding from sin-gle view images has gained increasing interest from com-

Figure 1. Given a single view RGB image of an indoor scene,our model is able to (i) predict all aspects of the scene (3D ob-ject bounding boxes, object and human meshes, 3D room layout,camera pose) and the human body mesh, and (ii) jointly optimizeover a comprehensive set of global consistency losses. The ﬁnalresult is more physically plausible and accurate. puter vision researchers. [33] [14] proposed methods forjoint reasoning over inanimate scenes, and recovered roomlayout and 3D object bounding boxes using consistencylosses such as a constraint for objects to be enclosed withinthe room bounding box. [4] additionally discouraged in-tersection between object bounding box estimations, andwas the ﬁrst model to bring 3D human pose estimation intothe holistic scene understanding problem. It incorporatedhuman-object interaction priors to reason about approxi-mate relations between humans and objects. However allof these works still operate at the relatively coarser level ofbounding boxes and joint key points, and are therefore lim-ited in their ability to use precise shapes, surfaces, and phys-ical occupancies to design holistic scene constraints and im-prove estimation accuracy.In this work we propose the ﬁrst single-view, holisticscene understanding method that jointly optimizes over allaspects of 3D human pose, objects, and room layout at themesh level, to produce state-of-the-art mesh estimations ofthe scene. Our approach builds on recent advances in meshprediction. [5] [7] [28] proposed methods for reconstruct-ing the individual object meshes with varying topological1 a r X i v : . [ c s . C V ] D ec tructures. [25] builds on [28] and proposed the ﬁrst holistic3D scene understanding method with mesh reconstructionat the instance level, however they did not consider humans.Recently, [11] introduced a method for 3D mesh-based hu-man pose estimation, that utilizes physical occupancy in-formation of the static scene to discourage body penetrationinto the scene. However, [11] requires the ground truth 3Dscans of the scene, and does not perform joint human andscene estimation.Given a single RGB image, our method simultaneouslyreconstructs the human body mesh and multiple aspects ofthe scene – 3D object meshes and bounding boxes, roomlayout, and camera pose – all in 3D (Figure 1). Our ap-proach outputs the SMPL-X (SMPL eXpressive) [29] hu-man mesh model, which fully parameterizes the 3D sur-face of the human body. It also leverages a variant of theTopology Modiﬁcation Network (TMN) [5], proposed in[25], as the base model for static object mesh and scenereconstruction. Importantly, we introduce a novel joint op-timization process that incorporates a comprehensive set ofphysical rules and priors including 2D/3D reprojection con-straints, object-object mesh constraints, object-human meshconstraints, and object/human - room layout constraints, toobtain robust, physically plausible predictions. We performexperimental evaluation on the PiGraphs [32] and PROX[11] datasets and demonstrate that our model outperformsstate-of-the-art methods on either 3D scene understandingor 3D human pose estimation.In summary, our contributions are the following:• We propose an end-to-end, holistic model for jointlyreconstructing 3D human body meshes and staticscene elements (3D object meshes and boundingboxes, room layout, and camera pose) from monocularRGB images. To the best of our knowledge, we are theﬁrst to jointly estimate this rich scene understanding atthe mesh level.• Our model does not require any ground truth anno-tations of the 3D scene or the human poses, and canbe directly used on any indoor dataset to produce highquality mesh reconstructions.• Through our novel joint optimization process that in-corporates a comprehensive set of physical rules andpriors, we show that our model outperforms prior state-of-the-art methods on either 3D scene understandingor 3D human pose estimation, on the PiGraphs andPROX Quantitative datasets.

2. Related Work

Single View 3D Human Pose Estimation.

Previous 3Dpose estimation methods from single view RGB images canbe divided into two types: (i) directly learning 3D humankeypoints from 2D image features [36], and (ii) 2D poseestimation with subsequent separate lifting of the 2D coor- dinates to 3D via deep neural networks [30] [23]. Althoughthese works have showed impressive results on in-the-wildimages with relatively clean backgrounds, estimating 3Dposes with cluttered background and partial occlusions isstill very challenging. Recent works in human body models[21] [29] and single view body mesh reconstruction meth-ods [2] [18] have pushed the richness of body details avail-able for reasoning, and provide opportunities for bringingnovel constraints to the training stage. Recently, [11] pro-posed the ﬁrst 3D human body mesh reconstruction methodthat takes the static scene into consideration; however theyrely on ground truth 3-D scene scans. Our work builds onthese directions and is the ﬁrst to leverage mesh representa-tions of both human and scene in performing holistic esti-mation of 3D human body and scene meshes jointly.

Holistic Scene Understanding.

The 3D holistic sceneunderstanding problem, in particular 3D scene reconstruc-tion from single view images, has received increasing at-tention over the past few years. While most of these workshave focused on coarser bounding boxes and keypoints asopposed to meshes, methods have differed in model outputsand constraint formulations [14][25][4]. Works such as [14]have focused on the static scene; [14] proposed an end-to-end model that learns the 3D room layout, camera pose and3D object bounding boxes. Drawing insight from the cam-era projection process and and physical commonsense, [14]encourages projected 3D bounding boxes to be close to their2D locations on the image plane, and forces object bound-ing boxes to be within the room layout bounding box.Some works have attempted to incorporate scene/objectinformation in human pose estimation [40] [11] and/or vice-versa [8]. [4] jointly tackles two tasks from a single-viewimage: (i) 3D estimations of object bounding boxes, camerapose, and room layout; and (ii) 3D human keypoints esti-mation. They used an energy-based inference optimizationprocess that reﬁnes direct 3D outputs by jointly reasoningacross aspects of the objects and human keypoints. How-ever, their constraint formulations based on 3D boundingboxes and human keypoints are still lacking in precision.Additionally, energy-based models have the disadvantageof an expensive inference step compared to feed-forwardmodels, and [4]’s MAP estimation method searches over adiscrete set of object locations which may give sub-optimalresults. In contrast, we impose precise physical constraintsat the mesh level in our joint optimization procedure anddirectly back-propagate the underlying neural networks.

Holistic Scene Mesh Reconstruction.

An emergingline of work attempts to reconstruct richer informationabout objects in scenes such as depth [34], voxel [20] [37],or mesh representations [7] [25]. Meshes contain muchricher 3D shape information about the objects, but are gen-erally harder to reconstruct due to the diverse topology ofthe shapes. Mesh-retrieval methods [16] [15] [17] retrieve2D models from a large 3D model repository, however thesize of these repositories remain a bottleneck. Object-wisemesh reconstruction methods [5] [39] [7] [28] take a differ-ent approach using end-to-end prediction and reﬁnement ofthe target mesh of individual objects. Recently, [25] incor-porated an object-wise mesh reconstruction module in theirholistic 3D understanding model for static scenes. How-ever, they did not take advantage of the rich informationabout object shapes that comes with the meshes, and theirreconstructed scene meshes are often physically implausi-ble. Although a recent 3D human mesh estimation method[11] takes advantage of precise object shapes in their con-straint formulation, they use ground truth 3D scene scans.In contrast, we estimate both humans and the static scene jointly from single view images.

3. Model

We introduce a two-stage approach for joint 3D humanand scene mesh estimation. In Stage I, we separately parseand reconstruct the human meshes and the 3D scene – 3Dobject bounding boxes and meshes, camera pose, and 3Droom layout – to obtain initial estimates. In this stage, holis-tic reasoning is limited to encouraging physical plausibil-ity within the human only and within the static scene only.Then in Stage II, we jointly minimize global consistencylosses across humans and the static scene together, whichextends the holistic reasoning to simultaneously improveperformance of all sub-tasks.An overview of our method is illustrated in Figure 2. InSection 3.1, we ﬁrst deﬁne our notation and representationof the 3D scene and our human body mesh model. In Sec-tion 3.2, we describe the model architectures we use for pro-ducing each part of the body and scene estimations. Basedon these, in Section 3.3, we present our novel joint opti-mization process that incorporates a comprehensive set ofphysical rules and priors – including 2D/3D reprojectionconstraints, object-object mesh constraints, object-humanmesh constraints, and object/human - room layout con-straints – to perform holistic estimation of both human andscene meshes.

3D Scene.

The input to our model is a 2D image I ∈ R (h , w) . We use a pre-trained Faster R-CNN [31] to ob-tain initial 2D bounding box estimates b ∈ R (4 , for eachobject in the scene. The 2D bounding box centers are rep-resented as c ∈ R . Our representation for the camera pose,room layout, and 3D object bounding boxes and meshes ina scene follows the notation used in [14][25]. The camerapose is a × rotation matrix deﬁned by the pitch and rollangles of the camera system relative to the world system. Inthe world system, an object bounding box is represented bya 3D bounding box X ∈ R (8 , , which can be determined from its 3D center C ∈ R , spatial size s ∈ R , and orien-tation angle θ ∈ [ − π, π ) . The cuboid room layout is alsorepresented by a 3D box X L ∈ R (8 , , and is parameterizedin the same manner as an object bounding box. The trian-gular mesh for object i in the image is represented by itsvertices and faces M i = ( V i , F i ) , where V i ∈ R ( N i , . N i is the number of vertices and F i deﬁnes the triangular facesof the mesh. M i is normalized to ﬁt in a unit cube, and thevertices of the mesh can be converted to the 3D camera co-ordinate system by translation and rotation as speciﬁed bythe 3D bounding box parameters. Human Body Model.

We represent the human body us-ing SMPL-X (SMPL eXpressive) [29], a generative modelthat captures how the human body shape varies across ahuman population, learned from a corpus of registered 3Dbody, face and hand scans of people of different sizes, gen-ders and nationalities in various poses. SMPL-X extendsthe SMPL model [21] with fully articulated hands and anexpressive face. It is essentially a differentiable function pa-rameterized by shape β b , pose θ b , facial expressions ψ andtranslation γ of the body. The output of SMPL-X is a 3Dtriangular mesh M b = ( V b , F b ) that contains N b = 10475 vertices V b ∈ R ( N b , and triangular faces F b . Body Model.

Since the SMPL-X [29] body model is afully differentiable function, we simply compute the bodyloss terms (Section 3.3) that are formulated in terms of thevertices and faces of the output human body mesh, andback-propogate the SMPL-X model to ﬁnd the optimal setof parameters such as the shape and pose of the humanbody. As in [29], the parameters of the SMPL-X modelare regularized with a set of body priors including a VAE-based body pose prior, and L priors on hand pose, facialpose, body shape and facial expressions, penalizing devia-tion from the neutral state. Scene Models.

We use three sub-modules to predict 3Dobject boxes, camera pose and 3D room layout, and 3Dobject meshes in the scene, respectively. Speciﬁcally, weadopt the Object Detection Network (ODN), Layout Es-timation Network (LEN), and Mesh Generation Network(MGN) from [25]. For 3D object box prediction, the ODNﬁrst takes 2D detections of a Faster R-CNN model trainedon LVIS [10], extracts appearance features in an object-wisefashion using ResNet-34 [12], and encodes the relative po-sition and size between 2D object boxes into geometry fea-tures using the method in [13]. For each target object, an“attention sum” is then computed using relational featuresto other objects [13]. Finally, each set of box parametersis regressed using a two-layer MLP. The LEN consists ofa ResNet-34 feature extractor and two separate brancheswith fully-connected layers, one for predicting the cam-era pose and the other for predicting the 3D room bound-3 igure 2. Overview of our model. Given a single RGB image, we ﬁrst use off-the-shelf 2D detectors to predict the 2D human keypointsand 2D bounding boxes of the objects in the scene. Then, the body mesh network reconstructs a SMPL-X body mesh model through thehuman keypoints re-projection loss and the human body prior losses. The Mesh Generation Network (MGN) reconstructs the object-wisemeshes. 3D Object Detection Network (ODN) predicts the 3D bounding boxes of the objects. Layout Estimation Network (LEN) predictsthe camera pose and the 3D room bounding box. In

Stage I , the individual modules are optimized with within-body and within-scenelosses. In

Stage II , the modules ﬁne-tune with the additional human-scene joint losses to achieve consistency and physical plausibilityacross all aspects of the output. ing box attributes. Finally, for 3D object mesh prediction,the MGN takes a 2D detection of an object as input anduses ResNet-18 to extract 2D appearance features. Then,the image features concatenated with the one-hot LVIS [10]object category encoding are fed into the decoder of At-lasNet [9], which performs mesh deformation from a tem-plate sphere mesh. An edge classiﬁer is trained to removeredundant edges from the deformed mesh and a boundaryreﬁnement module [28] is used to reﬁne the smoothness ofboundary edges and output the ﬁnal mesh. We pre-trainedon SUN RGB-D [35] to initialize the scene models. How-ever, no ground truth annotations are required when trainingour model on a new dataset.

We optimize a comprehensive set of losses based onphysically plausible constraints and priors, across twostages of training, to perform holistic estimation of 3D hu-man and scene meshes. These losses can be organized aswithin-body losses (Stage I), within-scene losses (Stage I),and global human-scene losses (Stage II).

Within-body losses

As part of Stage I of our approach,we ﬁrst utilize within-body constraints to generate an ini-tial human mesh estimation. Following [11] [2] [29], weformulate ﬁtting SMPL-X to monocular images as an opti- mization problem, and seek to minimize the loss function L body = E ( β, θ, ψ, γ )= E J + λ θ b E θ b + λ θ f E θ f + λ θ h E θ h + λ E E E + λ β E β + λ α E α + λ P self E P self (1)Here E J is the re-projection loss that we use to minimizethe weighted robust distance between 2D joints estimatedfrom the RGB image I and the 2D projection of the cor-responding 3D joints of SMPL-X. θ b , θ f , θ h are the posevectors for the body, face (neck, jaw) and the two hands re-spectively. The terms E θ f , E θ h , E E and E β are L priorsfor the hand pose, facial pose, facial expressions and bodyshape, penalizing deviations from the neutral state. E β isa VAE-based body pose prior called VPoser introduced in[29]. E α is a prior penalizing extreme bending only for el-bows and knees. The terms E J , E θ b , E θ h , E α , E β are asdescribed in [29]. E P self is a penetration penalty for self-penetrations (e.g. hand intersecting knee). The λ ’s are theweights for the terms.Our formulation is closest to that in [11], which performshuman mesh estimation and was built upon [29] with theaddition of scene contact ( E C ) and penetration ( E P ) termsby assuming access to ground truth scene scans. There areseveral differences between their full loss function and ourformulation in Eq. 1. First, we do not include any depthrelated terms, because we wish to perform estimation using4olely RGB images whereas [11] propose model variantsleveraging RGB-D information. Second, since we are per-forming joint estimation of the 3D scene from a monocularRGB image, we are not yet able to reason on scene contactor penetration after only human mesh estimation. So we in-clude only a body self-penetration term in Eq. 1, which iscomputed following the approach in [1] [29][38], and willconsider human-scene constraints instead during our globaloptimization stage. Within-scene losses

In Stage I of our approach, we alsoutilize within-scene constraints to generate an initial staticscene estimation. Speciﬁcally, we design two within-sceneconstraints, one for encouraging 2D/3D consistency of thepredicted object bounding boxes and the other one for pe-nalizing the collision between the object meshes.For the ﬁrst constraint, we utilize the fact that based onthe camera projection model, if we project predicted 3Dbounding boxes onto the 2D image plane, the projectedcorners should be close to the 2D bounding box corners.This constraint therefore optimizes both camera pose and3D bounding boxes. [25] imposes a similar loss where theypenalize the deviation of the 2D projections of predicted3D bounding box corners from ground truth 3D boundingbox corners for both object bounding boxes and the roombounding box. However, since our model does not rely onany ground truth annotations in our described optimizationprocess, we propose to use our detected 2D bounding boxesas a pseudo ground truth. We show the effectiveness of thisloss term in Section 4. The formal deﬁnition of this termcan be written as L J scene = 1 N N (cid:88) i =1 SmoothL ( f ( X i , s i , C i , θ i ) , b i ) (2)where s i , C i θ i are the size, centroid and orientation of theobject i . b i is the 2D bounding box estimate for object i , and f is a differentiable projection function that projects the cor-ners of a 3D bounding box to a 2D image plane. Like [25],we use a smooth L loss function comprised of a squaredterm if the absolute element-wise error falls below a thresh-old and an L term otherwise.Our second constraint is a new loss term that penalizethe collision between reconstructed object meshes. Previ-ous works have not explored this loss, because they eitherdid not have the object shape information necessary to cal-culate the precise collision [14] [4], or did not take advan-tage of the object shape information that comes with themeshes [25]. We notice that inter-object collision is com-mon in the output of these works. We detect collision usingthe signed distance ﬁeld (SDF) of each object. For eachobject mesh, we voxelize its 3D bounding box into a grid,where for each grid cell center, we calculate its signed dis-tance to the nearest point in the rest of the object meshes in the scene. A negative distance means that this cell centeris inside the nearest scene object and denotes penetration.We use a squared sum term of the signed distances of eachpenetrating grid cell. Formally, L P scene = n obj (cid:88) i =1 (cid:88) c j ∈ V i || d ( c j , M − i ) ( d ( c j , M − i ) < || (3)where c j is the center of the j th cell in the voxel grid V i forobject i . d ( c j , M − i ) is the signed distance between the cellcenter c i and the scene mesh composed of all object meshesexcept for object i . is an indicator function. Global human-scene losses

In Stage II of our approach,we jointly ﬁne-tune the human and scene estimation com-ponents by imposing additional human-scene losses acrossthe reconstructed human mesh and scene mesh. We con-sider four types of human-scene losses here.First, observing that indoor furniture are very likely to beon the ﬂoor, we penalize the absolute distance between theobject bounding boxes and the ground plane as estimated bythe Layout Estimation Network. In the camera coordinatesystem that we use, + y axis is perpendicular to the groundplane and pointing upward. Hence, we can write this termformally as L obj − groundjoint = n obj (cid:88) i =1 d ( y min ( X L ) , y min ( X i ))) (4)where y min ( X ) returns the minimum y coordinate values ofthe 3D bounding box X ∈ R (8 , .Second, like objects in the room, humans need a support-ing plane to counteract the gravity. Therefore, we penalizethe distance between the lowest point in the human bodymesh and the room ground plane. We denote this term as L body − groundjoint .Third, we include the contact term E C from [11],although[11] utlized ground truth scene scans. The intuitionis that when humans interact with the scene, they come incontact with it. Thus, [11] annotates a set of candidate con-tact vertices V C ⊂ V b across the whole body that come fre-quently in contact with the world, focusing on the actionsof sitting and touching with hands. Formally, L C joint = (cid:88) v C ∈ V C ρ C ( min v s ∈ V s || v C − v s || ) (5)where ρ C denotes a robust German-McClure error function[6] for down-weighting vertices in V C that are far from thenearest vertices in V s of the 3D scene mesh M s . Note thatsince we do not have access to (or reconstruct) a ﬂoor meshas in [11], we leave out [11]’s body-ﬂoor contact terms; in-stead, our loss term L body − groundjoint encourages contact be-tween the feet and the ﬂoor.5inally, we penalize any collisions between the bodymesh and object meshes in the scene. The formulation issimilar to Eq. 3. We call this term L P joint .To summarize,our model’s total loss is L total = L body + L scene + L joint (6)where L scene = λ L J scene + λ L P scene (7) L joint = λ L obj − ground joint + λ L body − groundjoint + λ L C joint + λ L P joint (8)In Stage I, only within-body ( L body ) and within-scene( L scene ) constraints are used. In Stage II, we add globalconsistency losses ( L joint ) across humans and the staticscene together, and continuously ﬁne-tune the modules tosimultaneously improve performance of all sub-tasks.

4. Experiments

In this section, we evaluate the performance of ourmethod. Since we are the ﬁrst to jointly predict and recon-struct both 3D human poses and objects at the mesh level,we compare our model with the state-of-the-art methods foreach task. Speciﬁcally, we compare with [11] on humanbody mesh prediction, [24][4] for 3D human keypoints es-timation, and [14][4] for 3D bounding box estimation.

Pigraphs [32].

PiGraphs contains

3D scene scans and video recordings of ﬁve human subjects with skeletaltracking provided by Kinect v2 devices. The dataset con-tains annotations for 3D human keypoints and 3D objectbounding boxes in the scenes. We will perform quantitativeevaluation on both of these prediction tasks. PROX Quantitative [11].

PROX Quantitative has static RGB-D frames and was captured using Vicon andMoSH markers. [11] placed everyday furniture and ob-jects into the scene to mimic a living room, and performed3D reconstruction of the scene. The ground truth humanbody mesh annotations were obtained by placing markerson the body and the ﬁngers, and then using MoSh++ [22] toconvert MoCap data into realistic 3D human meshes repre-sented by a rigged body model. To the best of our knowl-edge, this is the only available dataset that has both real fur-niture in a cuboid room as well as a human subject activelyinteracting with the scene, which makes it ideal for holis-tic scene understanding. Since PROX Quantitative does notprovide ground truth object-level meshes and therefore doesnot support scene estimation task, we will quantitativelyevaluate our model only on the human mesh estimation task.

PROX Qualitative.

PROX Qualitative [11] provides

K synchronized and spatially calibrated RGB-D record-ings of humans in indoor scenes. While it was released Object Detection Pose EstimationMethods 2D IoU (%) 3D IoU (%) Methods 2D (pix) 3D (m)[14] 68.6 21.4 [24] 63.9 0.732[4] 75.1 24.9 [4] 15.9 0.472Ours

Ours

Table 1.

Left : Quantitative results for 3D scene reconstruction onPigraphs. Higher IoU values indicate better performance.

Right :Quantitative results for human keypoints estimation on Pigraphs.For both 2D (pix) and 3D (m) metrics, lower values are better. together with PROX Quantitative, it does not have groundtruth human mesh annotations. We perform additional qual-itative evaluation on this dataset.

Given an RGB image of an indoor scene as the input tothe model, we ﬁrst use off-the-shelf 2D detectors to estimate2D object bounding boxes and 2D human keypoints. For 2Dobject detections, we use Faster R-CNN [31] trained on theLVIS [29] dataset; for 2D keypoint detections we use Open-Pose [3]. ODN, LEN, and MGN are pretrained on the SUNRGB-D dataset [35], following prior work for our task.In Stage I, we optimize the SMPL-X body model usingonly the within-body ( L body ) losses. We use L-BFGS opt-mizer [27] with learning rate e − . For the scene model, wefreeze the MGN and the feature extractors components ofODN and LEN, and use Adam [19] optimizer with learningrate e − to back-propagate the linear layers for predictingobject bounding box attributes (eg. centroid, orientation),camera pose and 3D room layout. For this part, only thewithin-scene ( L scene ) losses are used.In Stage II, we add the global consistency losses ( L joint ),and continue ﬁne-tuning of all modules. In this stage, weadditionally ﬁx the orientation of the 3D object and roombounding boxes and the camera pose. We train the linearlayers for predicting the centroid and the size of the objectand room boxes to further reﬁne the 3D location of the ob-jects and the ground plane of the scene. We use the sameoptimizers as Stage I but with reduced learning rates ( e − for L-BFGS and e − for Adam). The hyperparamters usedin Eq. 7 and Eq. 8 are λ = 1 , λ = 0 . , λ = 10 , λ = 20 , λ = 1 e , λ = 1 e .

3D Object and Human Pose Estimation.

To show theefﬁcacy of our method in holistic scene understanding, wequantitatively evaluate both 3D object detection and 3D hu-man pose estimation on PiGraphs [32]. No previous worksfor holistic scene understanding have attempted mesh levelreconstruction of the scene and human body; both [14] and[4] outputs 3D bounding boxes of objects, and [4] addition-ally outputs 3D human keypoints. Therefore, we evaluateon the same tasks as these baselines. We note that since ourapproach is fully based on physical constraints from exter-6ally available mesh models, we do not use any of the 3Dannotations in PiGraphs for training, as these baselines do.However, we are still able to outperform them (Table 1),showing the power of leveraging the rich shape informationavailable through meshes.Following [14], for object detection evaluation, we re-port mean 3D bounding box IoU, as well as 2D IoU be-tween the 2D projections of the 3D object bounding boxesand the ground-truth 2D boxes. For 3D human keypointsevaluation, we extract the body joints from the ﬁttedSMPL-X body model and only keep the ones used in [24][4], which is a subset of the SMPL-X body joints. As in[4], we compute the Euclidean distance between the esti-mated 3D joints and the 3D ground-truth, and average overall joints. For 2D evaluation, we project the estimated 3Dkeypoints back to the 2D image plane and compute pixeldistance to ground truth.The quantitative results for both tasks in Table 1 showthat our model outperforms both [14] and [4] on the 3Dobject detection task, and [24] [4] on the 3D pose estima-tion task, which illustrates the effectiveness of our method.The boost in 3D performance is signiﬁcant, because a largesource of error of the baseline models come from inaccu-rate depth estimation of the objects or the humans. Depthestimation from single view images is generally a difﬁcultproblem because 2D visual features are limited in suggest-ing the depth information. We show that the constraintsin our joint optimization are effective in helping to disam-biguate the depth information of the objects. The improve-ment on the object bounding box IoUs suggests that apply-ing ﬁne-grained constraints at the mesh level helps with re-ﬁning coarser details of the objects as well.

Human Mesh Estimation

We quantitatively evaluate ourhuman body mesh estimation results on PROX Quantitative[11] (Table 2). We follow the evaluation of [11], and re-port the mean per-joint error without/with procrustes align-ment (noted as “PJE” / “p.PJE”), and the mean vertex-to-vertex error (noted as “V2V” / “p.V2V”). Procrustes align-ment is a common trick to adjust the predicted 3D verticesfor errors in translation, rotation, and scaling. We includethe procustes aligned numbers for completion, but note thatsince our method optimizes all aspects of the human bodyincluding translation, rotation and scaling, V2V and PJEare more meaningful quantitative metrics in evaluating theoverall quality of the predicted 3D vertices of the mesh.We compare our body mesh reconstruction method with[11], the state-of-the-art human body mesh reconstructionmethod on PROX Quantitative. [11] shares the same bodyloss ( L body ) as us; however it imposes contact ( E C ) andcollision ( E P ) constraints between the human mesh and theground truth 3D scene scans. In our method, we consideran estimated scene mesh in formulating our losses instead.Therefore, in Table 2, we include quantitative performance with ground truth 3D scene scansV2V PJE p.V2V p.PJE[11] (including E C ) 208.03 208.57 72.76 60.95[11] (including E P ) 190.07 190.38 73.73 62.38Full [11] ( E C + E P ) 167.08 166.51 71.97 61.14 without ground truth 3D scene scans[11] (body terms only) 220.27 218.06 73.24 [11] + estimated scene 224.53 220.47 73.49 61.32[11] + w/in-scene losses 212.48 209.67 73.13 62.06Ours of [11]’s models using ground truth 3D scene scans for ref-erence, and additionally including the following three base-lines models for a fair comparison with our model:• [11] (body terms only): [11], without using sceneterms (since these utilize a ground truth scene).• [11] + estimated scene: [11] with their contact ( E C )and collision ( E P ) terms calculated using the 3D scenemesh predicted by [25] (our base scene model).• [11] + w/in-scene losses: [11] with their contact ( E C )and collision ( E P ) terms calculated using an opti-mized scene mesh (the base scene model [25], plus ourwithin-scene losses L scene ).Our model outperforms all three baselines that do not useground truth scene scans (bottom half of Table 2), and iscompetitive to [11]’s models using ground truth scene scans(top half). This shows the effectiveness of our scene meshestimation in reﬁning the human meshes, and that simplyadding estimated scenes to [11] is not sufﬁcient. The gapbetween [11] + w/in-scene losses and Ours highlights theutility of our joint optimization process. To analyze the contributions of different losses, we com-pare variants of our proposed full model. In Tables 3 and4, we compare quantitative results on the human body meshprediction and 3D object detection tasks as we take out eachone of the losses in Eqs. 7 and 8, except for the essen-tial body loss ( L body ) and box re-projection loss ( L J scene ).We observe that all of the losses are essential in improvingboth the scene estimation and body estimation tasks. Thejoint losses L object − groundjoint , L C joint , and L P joint play an essen-tial role in jointly improving the global consistency, whichboosts the performance of human body mesh reconstructiontask. In particular, L P joint seems to be the most important7 igure 3. Left half : Qualitative results on PROX Quantitative and Qualitative datasets. The left frames is from PROX Quantitative. Theright frame is from PROX Qualitative.

Right half : Qualitative results on Pigraphs dataset. From top to bottom are the RGB input, thedirect output from the scene and body mesh without any optimization, and the ﬁnal mesh with the joint optimization.

Metrics V2V PJE p. V2V p. PJEw/o L P scene L body − groundjoint L object − groundjoint L C joint L P joint Table 3. Ablation results for human mesh estimation on PROXQuantitative.

Tasks Object Detection Pose EstimationMetrics IoU D IoU D

2D (pix) 3D (m)w/o L P scene . . . . w/o L body − grndjoint . . . . w/o L obj . − grndjoint . . . . w/o L C joint . . . . w/o L P joint . . . . Full model . . . . Table 4. Ablation results on PiGraphs. For the IoU metrics, highervalues indicate better performance. For the pose estimation met-rics (2D (pix) and 3D (m)), lower values are better. term in reﬁning the body meshes. The L object − groundjoint and L body − groundjoint terms improves the ground plane estimation,which helps the 3D object detection task signiﬁcantly. Figure 3 shows qualitative results of our models on thePROX Quantitative and Qualitative, and PiGraphs datasets. We observe that the direct output of the scene model (pre-trained on SUN-RGBD) without our proposed holistic op-timization contains inaccurate object attributes. Our pro-posed joint optimization method improves the overall accu-racy of the predictions by constraining the orientations, po-sitions and the sizes of the objects to be realistic with respectto each other. Also, human pose estimation task helps theoptimization of the scene - the chair that the human sits ontend to have more accurate orientations than the other twochairs (column 2). Besides, the initially estimated groundplane could be very inaccurate (column 3), and our jointoptimization process helps adjust the ground plane and im-prove the location of all objects at the same time. Althoughnot obvious from the qualitative results in Figure 3, the es-timated scene mesh helps reﬁning the 3D locations of thehuman body mesh vertices through the joint losses, whichis supported by our quantitative results in Tables 2 and 3.

5. Conclusion

In this work, we focus on the challenging problem ofsingle view holistic reconstruction and joint optimizationof human pose together with static scene. We propose theﬁrst end-to-end model for reconstructing and jointly esti-mating both 3D human pose and 3D scene at the mesh level.Through a novel joint optimization process that incorpo-rates a comprehensive set of physical rules and priors, weshow that our model outperforms state-of-the-art methodson either 3D scene understanding or 3D human pose esti-mation, on the PiGraphs and PROX Quantitative datasets.8 eferences [1] Luca Ballan, Aparna Taneja, J¨urgen Gall, Luc Van Gool, andMarc Pollefeys. Motion capture of hands in action usingdiscriminative salient points. In

European Conference onComputer Vision , pages 640–653. Springer, 2012. 5[2] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, PeterGehler, Javier Romero, and Michael J Black. Keep it smpl:Automatic estimation of 3d human pose and shape from asingle image. In

European Conference on Computer Vision ,pages 561–578. Springer, 2016. 2, 4[3] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei,and Yaser Sheikh. Openpose: realtime multi-person 2dpose estimation using part afﬁnity ﬁelds. arXiv preprintarXiv:1812.08008 , 2018. 6, 11[4] Yixin Chen, Siyuan Huang, Tao Yuan, Siyuan Qi, YixinZhu, and Song-Chun Zhu. Holistic++ scene understanding:Single-view 3d holistic scene parsing and human pose esti-mation with human-object interaction and physical common-sense. In

Proceedings of the IEEE International Conferenceon Computer Vision , pages 8648–8657, 2019. 1, 2, 5, 6, 7[5] Lin Gao, Tong Wu, Yu-Jie Yuan, Ming-Xian Lin, Yu-KunLai, and Hao Zhang. Tm-net: Deep generative networks fortextured meshes. arXiv preprint arXiv:2010.06217 , 2020. 1,2, 3[6] Stuart Geman. Statistical methods for tomographic imagereconstruction.

Bull. Int. Stat. Inst , 4:5–21, 1987. 5[7] Georgia Gkioxari, Jitendra Malik, and Justin Johnson. Meshr-cnn. In

Proceedings of the IEEE International Conferenceon Computer Vision , pages 9785–9795, 2019. 1, 2, 3[8] Helmut Grabner, Juergen Gall, and Luc Van Gool. Whatmakes a chair a chair? In

CVPR 2011 , pages 1529–1536.IEEE, 2011. 2[9] Thibault Groueix, Matthew Fisher, Vladimir G Kim,Bryan C Russell, and Mathieu Aubry. A papier-mˆach´e ap-proach to learning 3d surface generation. In

Proceedings ofthe IEEE conference on computer vision and pattern recog-nition , pages 216–224, 2018. 4[10] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: Adataset for large vocabulary instance segmentation. In

Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 5356–5364, 2019. 3, 4[11] Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas,and Michael J Black. Resolving 3d human pose ambiguitieswith 3d scene constraints. In

Proceedings of the IEEE Inter-national Conference on Computer Vision , pages 2282–2292,2019. 2, 3, 4, 5, 6, 7, 11[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016. 3[13] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and YichenWei. Relation networks for object detection. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 3588–3597, 2018. 3[14] Siyuan Huang, Siyuan Qi, Yinxue Xiao, Yixin Zhu,Ying Nian Wu, and Song-Chun Zhu. Cooperative holistic scene understanding: Unifying 3d object, layout, and cam-era pose estimation. In

Advances in Neural Information Pro-cessing Systems , pages 207–218, 2018. 1, 2, 3, 5, 6, 7[15] Siyuan Huang, Siyuan Qi, Yixin Zhu, Yinxue Xiao, YuanluXu, and Song-Chun Zhu. Holistic 3d scene parsing and re-construction from a single rgb image. In

Proceedings of theEuropean Conference on Computer Vision (ECCV) , pages187–203, 2018. 2[16] Moos Hueting, Pradyumna Reddy, Vladimir Kim, ErsinYumer, Nathan Carr, and Niloy Mitra. Seethrough: ﬁnd-ing chairs in heavily occluded indoor scene images. arXivpreprint arXiv:1710.10473 , 2017. 2[17] Hamid Izadinia, Qi Shan, and Steven M Seitz. Im2cad. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 5134–5143, 2017. 2[18] Angjoo Kanazawa, Michael J Black, David W Jacobs, andJitendra Malik. End-to-end recovery of human shape andpose. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 7122–7131, 2018. 2[19] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 ,2014. 6, 11[20] Lin Li, Salman Khan, and Nick Barnes. Silhouette-assisted3d object instance reconstruction from a cluttered scene. In

Proceedings of the IEEE International Conference on Com-puter Vision Workshops , pages 0–0, 2019. 2[21] Matthew Loper, Naureen Mahmood, Javier Romero, GerardPons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model.

ACM transactions on graphics (TOG) ,34(6):1–16, 2015. 2, 3[22] Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Ger-ard Pons-Moll, and Michael J Black. Amass: Archive of mo-tion capture as surface shapes. In

Proceedings of the IEEEInternational Conference on Computer Vision , pages 5442–5451, 2019. 6[23] Julieta Martinez, Rayat Hossain, Javier Romero, and James JLittle. A simple yet effective baseline for 3d human pose es-timation. In

Proceedings of the IEEE International Confer-ence on Computer Vision , pages 2640–2649, 2017. 2[24] Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko,Helge Rhodin, Mohammad Shaﬁei, Hans-Peter Seidel,Weipeng Xu, Dan Casas, and Christian Theobalt. Vnect:Real-time 3d human pose estimation with a single rgb cam-era.

ACM Transactions on Graphics (TOG) , 36(4):1–14,2017. 6, 7[25] Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, JianChang, and Jian Jun Zhang. Total3dunderstanding: Joint lay-out, object pose and mesh reconstruction for indoor scenesfrom a single image. In

Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition , pages55–64, 2020. 2, 3, 5, 7[26] Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, JianChang, and Jian Jun Zhang. Total3dunderstanding: Joint lay-out, object pose and mesh reconstruction for indoor scenesfrom a single image. In

Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition , pages55–64, 2020. 11

27] Jorge Nocedal and Stephen J Wright. Nonlinear equations.

Numerical Optimization , pages 270–302, 2006. 6, 11[28] Junyi Pan, Xiaoguang Han, Weikai Chen, Jiapeng Tang, andKui Jia. Deep mesh reconstruction from single rgb imagesvia topology modiﬁcation networks. In

Proceedings of theIEEE International Conference on Computer Vision , pages9964–9973, 2019. 1, 2, 3, 4[29] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani,Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, andMichael J Black. Expressive body capture: 3d hands, face,and body from a single image. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 10975–10985, 2019. 2, 3, 4, 5, 6[30] Dario Pavllo, Christoph Feichtenhofer, David Grangier, andMichael Auli. 3d human pose estimation in video with tem-poral convolutions and semi-supervised training. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 7753–7762, 2019. 2[31] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In

Advances in neural information pro-cessing systems , pages 91–99, 2015. 3, 6[32] Manolis Savva, Angel X Chang, Pat Hanrahan, MatthewFisher, and Matthias Nießner. Pigraphs: learning interactionsnapshots from observations.

ACM Transactions on Graph-ics (TOG) , 35(4):1–12, 2016. 2, 6[33] Alexander G Schwing, Sanja Fidler, Marc Pollefeys, andRaquel Urtasun. Box in the box: Joint 3d layout and ob-ject reasoning from single images. In

Proceedings of theIEEE International Conference on Computer Vision , pages353–360, 2013. 1[34] Daeyun Shin, Zhile Ren, Erik B Sudderth, and Charless CFowlkes. 3d scene reconstruction with multi-layer depth andepipolar transformers. In

Proceedings of the IEEE Interna-tional Conference on Computer Vision , pages 2172–2182,2019. 2[35] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao.Sun rgb-d: A rgb-d scene understanding benchmark suite. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 567–576, 2015. 4, 6[36] Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and YichenWei. Integral human pose regression. In

Proceedings of theEuropean Conference on Computer Vision (ECCV) , pages529–545, 2018. 2[37] Shubham Tulsiani, Saurabh Gupta, David F Fouhey,Alexei A Efros, and Jitendra Malik. Factoring shape, pose,and layout from the 2d image of a 3d scene. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 302–310, 2018. 2[38] Dimitrios Tzionas, Luca Ballan, Abhilash Srikantha, PabloAponte, Marc Pollefeys, and Juergen Gall. Capturing handsin action using discriminative salient points and physicssimulation.

International Journal of Computer Vision ,118(2):172–193, 2016. 5[39] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, WeiLiu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d meshmodels from single rgb images. In

Proceedings of the Euro- pean Conference on Computer Vision (ECCV) , pages 52–67,2018. 3[40] Andrei Zanﬁr, Elisabeta Marinoiu, and Cristian Sminchis-escu. Monocular 3d pose and shape estimation of mul-tiple people in natural scenes-the importance of multiplescene constraints. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 2148–2157, 2018. 2 ppendix A. Additional Training Details

We direct the readers to [26] for camera/world systemsetting and details on the network architecture of ODN,LEN and MGN. Here we elaborate on the training details.

Stage I

In Stage I, we optimize the SMPL-X body modelusing only the within-body ( L body ) losses. We instantiate abody model for each human in the frame, and use L-BFGSoptmizer [27] with learning rate e − to learn the optimalbody parameters (e.g. body shape, pose, translation). First,the translation vector of the body model is optimized for iterations with only the human keypoints re-projection loss.This step is used to roughly position the body model in thecamera coordinate system. Then, all the within-body lossterms are considered and the entire body model is optimizedfor iterations. For the scene model, we freeze the MGNand the feature extractors components of ODN and LEN,and use Adam [19] optimizer with learning rate e − withweight decay e − to back-propagate the linear layers forpredicting object bounding box attributes (eg. centroid, ori-entation), camera pose and 3D room layout. For this part,only the within-scene ( L scene ) losses are used. For eachframe, the scene model is optimized for iterations. Stage II

In Stage II, we add the global consistency losses( L joint ), and continue ﬁne-tuning of all modules. In thisstage, we additionally ﬁx the orientation of the 3D objectand room bounding boxes and the camera pose. We trainthe linear layers for predicting the centroid and the size ofthe object and room boxes to further reﬁne the 3D locationof the objects and the ground plane of the scene. We use thesame optimizers as Stage I but with reduced learning rates( e − for L-BFGS and e − for Adam). The body modeland scene model are optimized alternately for iterations.The hyperparamters used are λ = 1 , λ = 0 . , λ = 10 , λ = 20 , λ = 1 e , λ = 1 e . B. Additional Qualitative Results

In Figure 4 we show qualitative examples in PROXQuantitative [11] where the scene estimation task signiﬁ-cantly helps the body estimation task. These are comple-mentary examples to those in the main paper, which showedthat the human body estimation task helps the scene estima-tion.From Figure 4 we can see that the initial body meshes areeither not physically plausible (column ), or are intersect-ing with the scene (column , ). Using the human-scenejoint optimization method proposed in our paper, the ﬁnalbody meshes are much more realistic. Note that since weare overlaying the meshes on the 2D images, we can stillsee the legs behind the furniture after the joint optimization. However, there is no mesh intersection in the 3D coordinatesystem.In Figure 5 we show similar results on PiGraphs andPROX Qualitative. We show that the ﬁnal body meshesin both examples improve through the human-scene opti-mization stage. In the PiGraphs example, the human bodyis lifted to reduce the intersection with the sofa. In thePROX Qualitative example, the right hand of the human isoccluded so the 2D keypoints predicted by OpenPose [3]do not include the keypoints on the right hand. As a result,the initial hand pose is far from the ground truth. However,through the human-scene optimization that encourages con-tact between the scene and the hands, the hand pose endedup closer to ground truth. C. Code

We will publicly release our code on Github.11

GB inputInitial body mesh (shown together with ﬁnal scene mesh in lower row)

Final body mesh (shown together with ﬁnal scene mesh in lower row)