[PDF] Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose

Abstract

Most of the recent deep learning-based 3D human pose and mesh estimation methods regress the pose and shape parameters of human mesh models, such as SMPL and MANO, from an input image. The first weakness of these methods is an appearance domain gap problem, due to different image appearance between train data from controlled environments, such as a laboratory, and test data from in-the-wild environments. The second weakness is that the estimation of the pose parameters is quite challenging owing to the representation issues of 3D rotations. To overcome the above weaknesses, we propose Pose2Mesh, a novel graph convolutional neural network (GraphCNN)-based system that estimates the 3D coordinates of human mesh vertices directly from the 2D human pose. The 2D human pose as input provides essential human body articulation information, while having a relatively homogeneous geometric property between the two domains. Also, the proposed system avoids the representation issues, while fully exploiting the mesh topology using a GraphCNN in a coarse-to-fine manner. We show that our Pose2Mesh outperforms the previous 3D human pose and mesh estimation methods on various benchmark datasets. For the codes, see this https URL.

Full PDF

PPose2Mesh: Graph Convolutional Networkfor 3D Human Pose and Mesh Recoveryfrom a 2D Human Pose

Hongsuk Choi ∗ , Gyeongsik Moon ∗ , and Kyoung Mu Lee ECE & ASRI, Seoul National University, Korea { redarknight,mks0601,kyoungmu } @snu.ac.kr Abstract.

Most of the recent deep learning-based 3D human pose andmesh estimation methods regress the pose and shape parameters of hu-man mesh models, such as SMPL and MANO, from an input image. Theﬁrst weakness of these methods is an appearance domain gap problem,due to diﬀerent image appearance between train data from controlledenvironments, such as a laboratory, and test data from in-the-wild en-vironments. The second weakness is that the estimation of the pose pa-rameters is quite challenging owing to the representation issues of 3Drotations. To overcome the above weaknesses, we propose Pose2Mesh,a novel graph convolutional neural network (GraphCNN)-based systemthat estimates the 3D coordinates of human mesh vertices directly fromthe

2D human pose . The 2D human pose as input provides essentialhuman body articulation information, while having a relatively homoge-neous geometric property between the two domains. Also, the proposedsystem avoids the representation issues, while fully exploiting the meshtopology using a GraphCNN in a coarse-to-ﬁne manner. We show thatour Pose2Mesh outperforms the previous 3D human pose and mesh esti-mation methods on various benchmark datasets. The codes are publiclyavailable .

3D human pose and mesh estimation aims to recover 3D human joint and meshvertex locations simultaneously. It is a challenging task due to the depth andscale ambiguity, and the complex human body and hand articulation. There havebeen diverse approaches to address this problem, and recently, deep learning-based methods have shown noticeable performance improvement.Most of the deep learning-based methods rely on human mesh models, suchas SMPL [36] and MANO [52]. They can be generally categorized into a model-based approach and a model-free approach. The model-based approach trainsa network to predict the model parameters and generates a human mesh bydecoding them [5–7, 26, 30, 32, 45, 46, 50]. On the contrary, the model-free ap-proach regresses the coordinates of a 3D human mesh directly [15, 31]. Both * equal contribution https://github.com/hongsukchoi/Pose2Mesh_RELEASE a r X i v : . [ c s . C V ] N ov H. Choi et al.

2D pose 3D pose coarse 3D mesh fine 3D meshintermediate 3D mesh

PoseNet MeshNet

Fig. 1: The overall pipeline of Pose2Mesh.approaches compute the 3D human pose by multiplying the output mesh witha joint regression matrix, which is deﬁned in the human mesh models [36, 52].Although the recent deep learning-based methods have shown signiﬁcant im-provement, they have two major drawbacks. First, when tested on in-the-wilddata, the methods inherently suﬀer from the appearance domain gap betweencontrolled and in-the-wild environment data. The data captured from the con-trolled environments [21, 25] is valuable train data in 3D human pose and esti-mation, because it has accurate 3D annotations. However, due to the signiﬁcantdiﬀerence of image appearance between the two domains, such as backgroundsand clothes, an image-based approach cannot fully beneﬁt from the data. Thesecond drawback is that the pose parameters of the human mesh models mightnot be an appropriate regression target, as addressed in Kolotouros et al. [31].The SMPL pose parameters, for example, represent 3D rotations in an axis-angle, which can suﬀer from the non-unique problem ( i.e. , periodicity). Whilemany works [26, 32, 45] tried to avoid the periodicity by using a rotation matrixas the prediction target, it still has a non-minimal representation issue.To resolve the above issues, we propose Pose2Mesh, a graph convolutionalsystem that recovers 3D human pose and mesh from the 2D human pose, ina model-free fashion. It has two advantages over existing methods. First, theproposed system beneﬁts from a relatively homogeneous geometric property ofthe input 2D poses from controlled and in-the-wild environments. They not onlyalleviates the appearance domain gap issue, but also provide essential geometricinformation on the human articulation. Also, the 2D poses can be estimatedaccurately from in-the-wild images, since many well-performing methods [11,43, 56, 64] are trained on large-scale in-the-wild 2D human pose datasets [2, 34].The second advantage is that Pose2Mesh avoids the representation issues of thepose parameters, while exploiting the human mesh topology ( i.e. , face and edgeinformation). It directly regresses the 3D coordinates of mesh vertices using agraph convolutional neural network (GraphCNN) with graphs constructed fromthe mesh topology.We designed Pose2Mesh in a cascaded architecture, which consists of PoseNetand MeshNet. PoseNet lifts the 2D human pose to the 3D human pose. MeshNettakes both 2D and 3D human poses to estimate the 3D human mesh in a coarse-to-ﬁne manner. During the forward propagation, the mesh features are initiallyprocessed in a coarse resolution and gradually upsampled to a ﬁne resolution.Figure 1 depicts the overall pipeline of the system. ose2Mesh 3

The experimental results show that the proposed Pose2Mesh outperforms theprevious state-of-the-art 3D human pose and mesh estimation methods [26,30,31]on various publicly available 3D human body and hand datasets [21, 38, 68].Particularly, our Pose2Mesh provides the state-of-the-art result on in-the-wilddataset [38], even when it is trained only on the controlled setting dataset [21].We summarize our contributions as follows. • We propose a novel system, Pose2Mesh, that recovers 3D human pose andmesh from the 2D human pose. The input 2D human pose lets Pose2Meshrobust to the appearance domain gap between controlled and in-the-wildenvironment data. • Our Pose2Mesh directly regresses 3D coordinates of a human mesh usingGraphCNN. It avoids representation issues of the model parameters andleverages the pre-deﬁned mesh topology. • We show that Pose2Mesh outperforms previous 3D human pose and meshestimation methods on various publicly available datasets.

3D human body pose estimation.

Current 3D human body pose estimationmethods can be categorized into two approaches according to the input type: animage-based approach and a 2D pose-based approach. The image-based approachtakes an RGB image as an input for 3D body pose estimation. Sun et al. [57]proposed to use compositional loss, which exploits the joint connection structure.Sun et al. [58] employed soft-argmax operation to regress the 3D coordinates ofbody joints in a diﬀerentiable way. Sharma et al. [54] incorporated a generativemodel and depth ordering of joints to predict the most reliable 3D pose thatcorresponds to the estimated 2D pose.The 2D pose-based approach lifts the 2D human pose to the 3D space. Mar-tinez et al. [39] introduced a simple network that consists of consecutive fully-connected layers, which lifts the 2D human pose to the 3D space. Zhao et al. [66]developed a semantic GraphCNN to use spatial relationships between joint coor-dinates. Our work follows the 2D pose-based approach, to make the Pose2Meshmore robust to the domain diﬀerence between the controlled environment of thetraining set and in-the-wild environment of the testing set.

3D human body and hand pose and mesh estimation.

A model-basedapproach trains a neural network to estimate the human mesh model parame-ters [36, 52]. It has been widely used for the 3D human mesh estimation, sinceit does not necessarily require 3D annotation for mesh supervision. Pavlakos etal. [50] proposed a system that could be only supervised by 2D joint coordinatesand silhouette. Omran et al. [45] trained a network with 2D joint coordinates,which takes human part segmentation as input. Kanazawa et al. [26] utilizedadversarial loss to regress plausible SMPL parameters. Baek et al. [5] traineda CNN to estimate parameters of the MANO model using neural renderer [28].Kolotouros et al. [30] introduced a self-improving system that consists of SMPLparameter regressor and iterative ﬁtting framework [6].

H. Choi et al.

Recently, the advance of ﬁtting frameworks [6, 48] has motivated a model-free approach, which estimates human mesh coordinates directly. It enabled re-searchers to obtain 3D mesh annotation, which is essential for the model-freemethods, from in-the-wild data. Kolotouros et al. [31] proposed a GraphCNN,which learns the deformation of the template body mesh to the target bodymesh. Ge et al. [15] adopted a GraphCNN to estimate vertices of hand mesh.Moon et al. [44] proposed a new heatmap representation, called lixel, to recover3D human meshes.Our Pose2Mesh diﬀers from the above methods, which are image-based, inthat it uses the 2D human pose as an input. The proposed system can beneﬁtfrom the data with 3D annotations, which are captured from controlled environ-ments [21, 25], without the appearance domain gap issue.

GraphCNN for mesh processing.

Recently, many methods consider a meshas a graph structure and process it using the GraphCNN, since it can fully exploitmesh topology compared with simple stacked fully-connected layers. Wang etal. [63] adopted a GraphCNN to learn a deformation from an initial ellipsoidmesh to the target object mesh in a coarse-to-ﬁne manner. Verma et al. [61]proposed a novel graph convolution operator and evaluated it on the shapecorrespondence problem. Ranjan et al. [51] also proposed a GraphCNN-basedVAE, which learns a latent space of the human face meshes in a hierarchicalmanner.

PoseNet estimates the root joint-relative 3D pose P ∈ R J × from the 2Dpose, where J denotes the number of human joints. We deﬁne the root joint ofthe human body and hand as pelvis and wrist, respectively. The estimated 2Dpose often contains errors [53], especially under severe occlusions or challengingposes. To make PoseNet robust to the errors, we synthesize 2D input poses byadding realistic errors on the ground truth 2D pose, following [10, 43], duringthe training stage. We represent the estimated 2D pose or the synthesized 2Dpose as P ∈ R J × . We apply standard normalization to P , following [10,62]. For this, we subtractthe mean from P and divide it by the standard deviation, which becomes¯ P . The mean and the standard deviation of P represent the 2D locationand scale of the subject, respectively. This normalization is necessary because P is independent of scale and location of the 2D input pose P . The architecture of the PoseNet is based on that of [10, 39]. The normalized2D input pose ¯ P is converted to a 4096-dimensional feature vector througha fully-connected layer. Then, it is fed to the two residual blocks [20]. Finally,the output feature vector of the residual blocks is converted to (3 J )-dimensionalvector, which represents P , by a full-connected layer. ose2Mesh 5 children vertices parent vertexparent vertexchildren vertices mesh coarsening Fig. 2: The coarsening process initially generates multiple coarse graphs from G M ,and adds fake nodes without edges to each graph, following [13]. The numbers ofvertices range from 96 to 12288 and from 68 to 1088, for body and hand meshes,respectively.

128 256 x

256 256 64 332 64 64 6144 6144 → x x → x

256 256 x

128 128 x

3D mesh ( ) graph convolutional layerFC layerupsampling layerreshaping layer input 2D pose ( ) 3D pose ( )concatenated pose( ) element-wise sum

Fig. 3: The network architecture of MeshNet.

We train the PoseNet by minimizing L P and groundtruth. The loss function L pose is deﬁned as follows: L pose = (cid:107) P − P ∗ (cid:107) , (1)where the asterisk indicates the groundtruth. MeshNet concatenates ¯ P and P into P ∈ R J × . Then, it estimates the rootjoint-relative 3D mesh M ∈ R V × from P , where V denotes the number of humanmesh vertices. To this end, MeshNet uses the spectral graph convolution [8, 55],which can be deﬁned as the multiplication of a signal x ∈ R N with a ﬁlter g θ = diag ( θ ) in Fourier domain as follows: g θ ∗ x = U g θ U T x, (2) H. Choi et al. where graph Fourier basis U is the matrix of the eigenvectors of the normalizedgraph Laplacian L [12], and U T x denotes the graph Fourier transform of x .Speciﬁcally, to reduce the computational complexity, we design MeshNet to bebased on Chebysev spectral graph convolution [13]. Graph construction.

We construct a graph of P , G P = ( V P , A P ), where V P = P = { p i } Ji =1 is a set of J human joints, and A P ∈ { , } J × J is an adjacencymatrix. A P deﬁnes the edge connections between the joints based on the humanskeleton and symmetrical relationships [9], where ( A P ) ij = 1 if joints i and j arethe same or connected, and ( A P ) ij = 0 otherwise. The normalized Laplaican iscomputed as L P = I J − D − / A P D − / , where I J is the identity matrix, and D P is the diagonal matrix which represents the degree of each joint in V P as( D P ) ij = (cid:80) j ( A P ) ij . The scaled Laplacian is computed as ˜ L P = 2 L P /λ max − I J . Spectral convolution on graph.

Then, MeshNet performs the spectral graphconvolution on G P , which is deﬁned as follows: F out = K − (cid:88) k =0 T k (cid:0) ˜ L P (cid:1) F in Θ k , (3)where F in ∈ R J × f in and F out ∈ R J × f out are the input and output feature mapsrespectively, T k (cid:0) x (cid:1) = 2 x T k − (cid:0) x (cid:1) − T k − (cid:0) x (cid:1) is the Chebysev polynomial [17]of order k , and Θ k ∈ R f in × f out is the k th Chebysev coeﬃcient matrix, whoseelements are the trainable parameters of the graph convolutional layer. f in and f out are the input and output feature dimensions respectively. The initial inputfeature map F in is P in practice, where f in = 5. This graph convolution is K -localized, which means at most K -hop neighbor nodes from each node areaﬀected [13,29], since it is a K -order polynomial in the Laplacian. Our MeshNetsets K = 3 for all graph convolutional layers following [15]. We gradually upsample G P to the graph of M , G M = ( V M , A M ), where V M = M = { m i } Vi =1 is a set of V human mesh vertices, and A M ∈ { , } V × V is an adjacencymatrix deﬁning edges of the human mesh. To this end, we apply the graphcoarsening [14] technique to G M , which creates various resolutions of graphs, {G c M = ( V c M , A c M ) } Cc =0 , where C denotes the number of coarsening steps, followingDeﬀerrard et al. [13]. Figure 2 shows the coarsening process and a balancedbinary tree structure of mesh graphs, where the i th vertex in G c +1M is a parentnode of the 2 i − i th vertices in G c M , and 2 |V c +1M | = |V c M | . i starts from 1.The ﬁnal output of MeshNet is V M , which is converted from V by a pre-deﬁnedindices mapping. During the forward propagation, MeshNet ﬁrst upsamples the G P to the coarsest mesh graph G C M by reshaping and a fully-connected layer.Then, it performs the spectral graph convolution on each resolution of meshgraphs as follows: F out = K − (cid:88) k =0 T k (cid:0) ˜ L c M (cid:1) F in Θ k , (4) ose2Mesh 7 where ˜ L c M denotes the scaled Laplacian of G c M , and the other notations are deﬁnedin the same manner as Equation 3. Following [15], MeshNet performs mesh up-sampling by copying features of each parent vertex in G c +1M to the correspondingchildren vertices in G c M . The upsampling process is deﬁned as follows: F c = ψ ( F Tc +1 ) T , (5)where F c ∈ R V c M × f c is the ﬁrst feature map of G c M , F c +1 ∈ R V c +1M × f c +1 is the lastfeature map of G c +1M , ψ : R f c +1 ×V c +1M → R f c +1 ×V c M denotes a nearest-neighborupsampling function, and f c and f c +1 are the feature dimensions of vertices in F c and F c +1 respectively. The nearest upsampling function copies the featureof the i th vertex in G c +1M to the 2 i − i th vertices in G c M . To facilitatethe learning process, we additionally incorporate a residual connection betweeneach resolution. Figure 3 shows the overall architecture of MeshNet. To train our MeshNet, we use four loss functions.

Vertex coordinate loss.

We minimize L M and groundtruth, which is deﬁned as follows: L vertex = (cid:107) M − M ∗ (cid:107) , (6)where the asterisk indicates the groundtruth. Joint coordinate loss.

We use a L M , to train our MeshNetto estimate mesh vertices aligned with joint locations. The 3D pose is calculatedas J M , where J ∈ R J × V is a joint regression matrix deﬁned in SMPL or MANOmodel. The loss function is deﬁned as follows: L joint = (cid:107)J M − P ∗ (cid:107) , (7)where the asterisk indicates the groundtruth. Surface normal loss.

We supervise normal vectors of an output mesh sur-face to be consistent with groundtruth. This consistency loss improves surfacesmoothness and local details [63]. Thus, we deﬁne the loss function L normal asfollows: L normal = (cid:88) f (cid:88) { i,j }⊂ f (cid:12)(cid:12)(cid:12)(cid:68) m i − m j (cid:107) m i − m j (cid:107) , n ∗ f (cid:69)(cid:12)(cid:12)(cid:12) , (8)where f and n ∗ f denote a triangle face in the human mesh and a groundtruthunit normal vector of f , respectively. m i and m j denote the i th and j th verticesin f . Surface edge loss.

We deﬁne edge length consistency loss between predictedand groundtruth edges, following [63]. The edge loss is eﬀective in recoveringsmoothness of hands, feet, and a mouth, which have dense vertices. The lossfunction L edge is deﬁned as follows: L edge = (cid:88) f (cid:88) { i,j }⊂ f |(cid:107) m i − m j (cid:107) − (cid:107) m ∗ i − m ∗ j (cid:107) | , (9) H. Choi et al. where f and the asterisk denote a triangle face in the human mesh and thegroundtruth, respectively. m i and m j denote i th and j th vertex in f .We deﬁne the total loss of our MeshNet, L mesh , as a weighted sum of all fourloss functions: L mesh = λ v L vertex + λ j L joint + λ n L normal + λ e L edge , (10)where λ v = 1 , λ j = 1 , λ n = 0 . , and λ e = 20. PyTorch [47] is used for implementation. We ﬁrst pre-train our PoseNet, and thentrain the whole network, Pose2Mesh, in an end-to-end manner. Empirically, ourtwo-step training strategy gives better performance than the one-step training.The weights are updated by the Rmsprop optimization [59] with a mini-batch sizeof 64. We pre-train PoseNet 60 epochs with a learning rate 10 − . The learningrate is reduced by a factor of 10 after the 30th epoch. After integrating the pre-trained PoseNet to Pose2Mesh, we train the whole network 15 epochs with alearning rate 10 − . The learning rate is reduced by a factor of 10 after the 12thepoch. In addition, we set λ e to 0 until 7 epoch on the second training stage,since it tends to cause local optima at the early training phase. We used fourNVIDIA RTX 2080 Ti GPUs for Pose2Mesh training, which took at least a halfday and at most two and a half days, depending on the training datasets. Ininference time, we use 2D pose outputs from Sun et al. [56] and Xiao et al. [64].They run at 5 fps and 67 fps respectively, and our Pose2Mesh runs at 37 fps.Thus, the proposed system can process from 4 fps to 22 fps in practice, whichshows the applicability to real-time applications. Human3.6M [21] is a large-scale indoor 3D body pose benchmark,which consists of 3.6M video frames. The groundtruth 3D poses are obtained us-ing a motion capture system, but there are no groundtruth 3D meshes. As a re-sult, for 3D mesh supervision, most of the previous 3D pose and mesh estimationworks [26, 30, 31] used pseudo-groundtruth obtained from Mosh [35]. However,because of the license issue, the pseudo-groundtruth from Mosh is not currentlypublicly accessible. Thus, we generate new pseudo-groundtruth 3D meshes byﬁtting SMPL parameters to the 3D groundtruth poses using SMPLify-X [48]. Forthe fair comparison, we trained and tested previous state-of-the-art methods onthe obtained groundtruth using their oﬃcially released code. Following [26, 49],all methods are trained on 5 subjects (S1, S5, S6, S7, S8) and tested on 2 subjects(S9, S11).We report our performance for the 3D pose using two evaluation metrics. Oneis mean per joint position error (MPJPE) [21], which measures the Euclideandistance in millimeters between the estimated and groundtruth joint coordinates, ose2Mesh 9

Table 1: The performance comparison between four combinations of regressiontarget and network design tested on Human3.6M. ‘no. param.’ denotes the num-ber of parameters of a network, which estimates SMPL parameters or vertexcoordinates from the output of PoseNet. target \ network FC GraphCNNMPJPE PA-MPJPE no. param. MPJPE PA-MPJPE no. param.SMPL param. 72.8 55.5 17.3M 79.1 59.1 13.5Mvertex coord. 119.6 95.1 37.5M Table 2: The performance comparisonon Human3.6M between two upsamplingschems. GPU mem. and fps denote therequired memory during training and fpsin inference time respectively. method GPU mem. fps MPJPEdirect 10G 24 65.3 coarse-to-ﬁne 6G 37 64.9

Table 3: The MPJPE comparisonbetween four architectures tested on3DPW. architecture MPJPE2D → mesh 101.12D → → mesh 103.2 → → mesh 100.5 after aligning the root joint. The other one is PA-MPJPE, which calculatesMPJPE after further alignment ( i.e. , Procrustes analysis (PA) [16]). J M isused for the estimated joint coordinates. We only evaluate 14 joints out of 17estimated joints following [26, 30, 31, 50]. J M , whose joint set followsthat of Human3.6M, are evaluated for MPJPE as above. MPVPE measures theEuclidean distance in millimeters between the estimated and groundtruth vertexcoordinates, after aligning the root joint. COCO.

COCO [34] is an in-the-wild dataset with various 2D annotations suchas detection and human joints. To exploit this dataset on 3D mesh learning,Kolotouros et al. [30] ﬁtted SMPL parameters to 2D joints using SMPLify [6].Following them, we use the processed data for training.

MuCo-3DHP.

MuCo-3DHP [41] is synthesized from the existing MPI-INF-3DHP 3D single-person pose estimation dataset [40]. It consists of 200K frames,and half of them have augmented backgrounds. For the background augmenta-tion, we use images of COCO that do not include humans to follow Moon etal. [42]. Following them, we use this dataset only for the training.

FreiHAND.

FreiHAND [68] is a large-scale 3D hand pose and mesh dataset.It consists of a total of 134K frames for training and testing. Following Zimmer-mann et al. [68], we report PA-MPVPE, F-scores, and additionally PA-MPJPEof Pose2Mesh. J M is evaluated for the joint errors. Table 4: The upper bounds of the two diﬀerent graph convolutional networksthat take a 2D pose and a 3D pose. Tested on Human3.6M. test input architecture MPJPE PA-MPJPE2D pose GT 2D → mesh 55.5 38.43D pose from [42] 3D → mesh 56.3 43.2

3D pose GT 3D → mesh 29.0 23.0 To analyze each component of the proposed system, we trained diﬀerent net-works on Human3.6M, and evaluated on Human3.6M and 3DPW. The test 2Dinput poses used in Human3.6M and 3DPW evaluation are outputs from Inte-gral Regression [58] and HRNet [56] respectively, using groundtruth boundingboxes.

Regression target and network design.

To demonstrate the eﬀectivenessof regressing the 3D mesh vertex coordinates using GraphCNN, we compareMPJPE and PA-MPJPE of four diﬀerent combinations of the regression targetand the network design in Table 1. First, vertex-GraphCNN , our Pose2Mesh,substantially improves the joint errors compared to vertex-FC , which regressesvertex coordinates with a network of fully-connected layers. This proves theimportance of exploiting the human mesh topology with GraphCNN, when es-timating the 3D vertex coordinates. Second, vertex-GraphCNN provides betterperformance than both networks estimating SMPL parameters, while maintain-ing the considerably smaller number of network parameters. Taken together,the eﬀectiveness of our mesh coordinate regression scheme using GraphCNN isclearly justiﬁed.In this comparison, the same PoseNet and cascaded architecture are employedfor all networks. On top of the PoseNet, vertex-FC and param-FC used a se-ries of fully-connected layers, whereas param-GraphCNN added fully-connectedlayers on top of Pose2Mesh. For the fair comparison, when training param-FC and param-GraphCNN , we also supervised the reconstructed mesh from the pre-dicted SMPL parameters with L vertex and L joint . The networks estimating SMPLparameters incorporated Zhou et al.’s method [67] for continuous rotations fol-lowing [30]. Coarse-to-ﬁne mesh upsampling.

We compare a coarse-to-ﬁne mesh up-sampling scheme and a direct mesh upsampling scheme. The direct upsamplingmethod performs graph convolution on the lowest resolution mesh until the mid-dle layer of MeshNet, and then directly upsamples it to the highest one ( e.g. ,96 to 12288 for the human body mesh). While it has the same number of graphconvolution layers and almost the same number of parameters, our coarse-to-ﬁne model consumes half as much GPU memory and runs 1.5 times faster thanthe direct upsampling method. It is because graph convolution on the highestresolution takes much more time and memory than graph convolution on lowerresolutions. In addition, the coarse-to-ﬁne upsampling method provides a slightly ose2Mesh 11

Table 5: The accuracy comparison between state-of-the-art methods andPose2Mesh on Human3.6M. The dataset names on top are training sets. method Human3.6M Human3.6M + COCOMPJPE PA-MPJPE MPJPE PA-MPJPEHMR [26] 184.7 88.4 153.2 85.5GraphCMR [31] 148.0 104.6 78.3 59.5SPIN [30] 85.6 55.6 72.9 51.9

Pose2Mesh(Ours) 64.9 48.0 67.9 49.9

Table 6: The accuracy comparison between state-of-the-art methods andPose2Mesh on 3DPW. The dataset names on top are training sets. method Human3.6M Human3.6M + COCOMPJPE PA-MPJPE MPVPE MPJPE PA-MPJPE MPVPEHMR [26] 377.3 165.7 481.0 300.4 137.2 406.8GraphCMR [31] 332.5 177.4 380.8 126.5 80.1 144.8SPIN [30] 313.8 156.0 344.3 113.1 71.7 122.8Pose2Mesh (Simple [64])

Pose2Mesh (HR [56]) lower joint error, as shown in Table 2. These results conﬁrm the eﬀectiveness ofour coarse-to-ﬁne upsampling strategy.

Cascaded architecture analysis.

We analyze the cascaded architecture ofPose2Mesh to demonstrate its validity in Table 3. To be speciﬁc, we construct(a) a GraphCNN that directly takes a 2D pose, (b) a cascaded network thatpredicts mesh coordinates from a 3D pose from pretrained PoseNet, and (c)our Pose2Mesh. All methods are both trained by synthesized 2D poses. First,(a) outperforms (b), which implies a 3D pose output from PoseNet may lackgeometry information in the 2D input pose. If we concatenate the 3D pose outputwith the 2D input pose as (c), it provides the lowest errors. This explains thatdepth information in 3D poses could positively aﬀect 3D mesh estimation.To further verify the superiority of the cascaded architecture, we explore theupper bounds of (a) and (d) a GraphCNN that takes a 3D pose in Table 13. Tothis end, we fed the groundtruth 2D pose and 3D pose to (a) and (d) as testinputs, respectively. Apparently, since the input 3D pose contains additionaldepth information, the upper bound of (d) is considerably higher than that of(a). We also fed state-of-the-art 3D pose outputs from [42] to (d), to validate thepractical potential for performance improvement. Surprisingly, the performanceis comparable to the upper bound of (a). Thus, our Pose2Mesh will substantiallyoutperform (a) a graph convolution network that directly takes a 2D pose, if wecan improve the performance of PoseNet.In summary, the above results prove the validity of our cascaded architectureof Pose2Mesh.

Table 7: The accuracy comparison between state-of-the-art methods andPose2Mesh on FreiHAND. method PA-MPVPE PA-MPJPE F@5 mm F@15 mmHasson et al. [18] 13.2 - 0.436 0.908Boukhayma et al. [7] 13.0 - 0.435 0.898FreiHAND [68] 10.7 - 0.529 0.935

Pose2Mesh (Ours) 7.6 7.4 0.683 0.973

We compare our Pose2Mesh with the previous state-of-the-art3D body pose and mesh estimation methods on Human3.6M in Table 5. First,when we train all methods only on Human3.6M, our Pose2Mesh signiﬁcantlyoutperforms other methods. However, when we train the methods addition-ally on COCO, the performance of the previous baselines increases, but thatof Pose2Mesh slightly decreases. The performance gain of other methods is awell-known phenomenon [58] among image-based methods, which tend to gen-eralize better when trained with diverse images from in-the-wild. Whereas, ourPose2Mesh does not beneﬁt from more images in the same manner, since it onlytakes the 2D pose. We analyze the reason for the performance drop is that thetest set and train set of Human3.6M have similar poses, which are from the sameaction categories. Thus, overﬁtting the network to the poses of Human3.6M canlead to better accuracy. Nevertheless, in both cases, our Pose2Mesh outperformsthe previous methods in both MPJPE and PA-MPJPE. The test 2D input posesfor Pose2Mesh are estimated by the method of Sun et al. [58] trained on MPIIdataset [3], using groundtruth bounding boxes.

We compare MPJPE, PA-MPJPE, and MPVPE of our Pose2Meshwith the previous state-of-the-art 3D body pose and mesh estimation works on3DPW, which is an in-the-wild dataset, in Table 6. First, when the image-basedmethods are trained only on Human3.6M, they give extremely high errors. Thisveriﬁes that the image-based methods suﬀer from the appearance domain gapbetween train and test data from controlled and in-the-wild environments re-spectively. In fact, since Human3.6M is an indoor dataset from the controlledenvironment, the image appearance from it are very diﬀerent from in-the-wildimage appearance. On the contrary, the 2D pose-based approach of Pose2Meshcan beneﬁt from accurate 3D annotations of the lab-recorded 3D datasets [21]without the appearance domain gap issue, utilizing the homogeneous geomet-ric property of 2D poses from diﬀerent domains. Indeed, Pose2Mesh gives farlower errors on in-the-wild images from 3DPW, even when it is only trainedon Human3.6M while other methods are additionally trained on COCO. Theexperimental results suggest that a 3D pose and mesh estimation approach maynot necessarily require 3D data captured from in-the-wild environments, which isextremely challenging to acquire, to give accurate predictions. The test 2D inputposes for Pose2Mesh are estimated by HRNet [56] and Simple [64] trained on ose2Mesh 13

Table 8: The accuracy comparison between state-of-the-art methods andPose2Mesh on Human3.6M and 3DPW. Diﬀerent train sets are used. method Human3.6M 3DPWMPJPE PA-MPJPE MPJPE PA-MPJPESMPLify [6] - 82.3 - -Lassner et al. [33] - 93.9 - -HMR [26] 88.0 56.8 - 81.3NBF [45] - 59.9 - -Pavlakos et al. [50] - 75.9 - -Kanazawa et al. [27] - 56.9 - 72.6GraphCMR [31] - 50.1 - 70.2Arnab et al. [4] 77.8 54.3 - 72.2SPIN [30] - - 59.2

Pose2Mesh (Ours) 64.9

COCO, using groundtruth bounding boxes. The average precision (AP) of [56]and [64] are 85.1 and 82.8 on 3DPW test set, 72.1 and 70.4 on COCO validationset, respectively.

FreiHAND.

We present the comparison between our Pose2Mesh and otherstate-of-the-art 3D hand pose and mesh estimation works in Table 7. The pro-posed system outperforms other methods in various metrics, including PA-MPVPEand F-scores. The test 2D input poses for Pose2Mesh are estimated by HR-Net [56] trained on FreiHAND [68], using bounding boxes from Mask R-CNN [19]with ResNet-50 backbone [20].

Comparison with diﬀerent train sets.

We report MPJPE and PA-MPJPEof Pose2Mesh trained on Human3.6M, COCO, and MuCo-3DHP, and othermethods trained on diﬀerent train sets in Table 8. The train sets include Hu-man3.6M, COCO, MPII [3] , LSP [23], LSP-Extended [24], UP [33], and MPI-INF-3DHP [40]. Each method is trained on a diﬀerent subset of them. In thetable, the errors of [26, 30, 31] decrease by a large margin compared to the er-rors in Table 5 and 6. Although it shows that the image-based methods can im-prove the generalizability with weak-supervision on in-the-wild 2D pose datasets,Pose2Mesh still provides the lowest errors in 3DPW, which is the in-the-wildbenchmark. This suggests that tackling the domain gap issue to fully beneﬁtfrom the 3D data of controlled environments is an important task to recover ac-curate 3D pose and mesh from in-the-wild images. We measured the PA-MPJPEof Pose2Mesh on Human3.6M by testing only on the frontal camera set, followingthe previous works [26, 30, 31]. In addition, we used 2D human poses estimatedfrom DarkPose [65] as input for 3DPW evaluation, which improved HRNet [56]recently.Figure 4 shows the qualitative results on COCO validation set and FreiHANDtest set. Our Pose2Mesh outputs visually decent human meshes without post-processing, such as model ﬁtting [31]. More qualitative results can be found inthe supplementary material.

Fig. 4: Qualitative results of the proposed Pose2Mesh. First to third rows:COCO, fourth row: FreiHAND.

Although the proposed system beneﬁts from the homogeneous geometric prop-erty of input 2D poses from diﬀerent domains, it could be challenging to recovervarious 3D shapes solely from the pose. While it may be true, we found thatthe 2D pose still carries necessary information to reason the corresponding 3Dshape to some degree. In the literature, SMPLify [6] has experimentally veriﬁedthat under the canonical body pose, utilizing 2D pose signiﬁcantly drops thebody shape ﬁtting error compared to using the mean body shape. We show thatPose2Mesh can recover various body shapes from the 2D pose in the supplemen-tary material.

We propose a novel and general system, Pose2Mesh, for 3D human mesh andpose estimation from a 2D human pose. The input 2D pose enables the systemto beneﬁt from the 3D data captured from the controlled settings without theappearance domain gap issue. The model-free approach using GraphCNN allowsit to fully exploit mesh topology, while avoiding the representation issues of the3D rotation parameters. We plan to enhance the shape recover capability ofPose2Mesh using denser keypoints or part segmentation, while maintaining theabove advantages.

Acknowledgements.

This work was supported by IITP grant funded by theMinistry of Science and ICT of Korea (No.2017-0-01780), and Hyundai MotorGroup through HMG-SNU AI Consortium fund (No. 5264-20190101). ose2Mesh 15

Supplementary Material of“Pose2Mesh: Graph Convolutional Network for3D Human Pose and Mesh Recoveryfrom a 2D Human Pose”

In this supplementary material, we present more experimental results thatcould not be included in the main manuscript due to the lack of space.

We trained and tested Pose2Mesh on SURREAL dataset [60], which have varioussamples in terms of the body shape, to verify the capability of shape recovery.As shown in Figure 5, Pose2Mesh can recover a 3D body shape correspondingto an input image, though not perfectly. The shape features of individuals, suchas the bone length ratio and fatness, are expressed in the outputs of Pose2Mesh.This implies that the information embedded in joint locations ( e.g. the distancebetween hip joints) carries a certain amount of shape cue. input image mean shape Pose2Mesh groundtruth input image mean shape Pose2Mesh groundtruthinput image mean shape Pose2Mesh groundtruth input image mean shape Pose2Mesh groundtruth

Fig. 5: The Pose2Mesh predictions compared with the groundtruth mesh, andthe mesh decoded from groundtruth pose parameters and the mean shape pa-rameters.

Here, we present more qualitative results on COCO [34] validation set and Frei-HAND [68] test set in Figure 6. The images at the fourth row show some of the failure cases. Although the people on the ﬁrst and second images appearto be overweight, the predicted meshes seem to be closer to the average shape.The right arm pose of the mesh in the third column is bent, though it appearsstraight.Fig. 6: Additional qualitative results on COCO and FreiHAND.

We present the qualitative comparison between our Pose2Mesh and GraphCMR [31]in Figure 7. We regard GraphCMR as a suitable comparison target, since it isalso the model-free method and regresses coordinates of human mesh deﬁnedby SMPL [36] using GraphCNN like ours. As the ﬁgure shows, our Pose2Meshprovides much more visually pleasant mesh results than GraphCMR. Based onthe loss function analysis in Section 7 and the visual results of GraphCMR, weconjecture that the surface losses such as the normal loss and the edge loss arethe reason for the diﬀerence. ose2Mesh 17 snapshot00.png

Pose2Mesh (Ours) GraphCMRinput image

Fig. 7: The mesh quality comparison between our Pose2Mesh andGraphCMR [31].

10 Details of PoseNet

Figure 8 shows the detailed network architecture of PoseNet. First, the normal-ized input 2D pose vector is converted to a 4096-dimensional feature vector bya fully-connected layer. Then, it is fed to the two residual blocks, where eachblock consists of a fully connected layer, 1D batch normalization, ReLU activa-tion, and the dropout. The dimension of the feature map in the residual blockis 4096, and the dropout probability is set to 0.5. Finally, the output from theresidual block is converted to (3 J )-dimensional vector, the 3D pose vector, by afully-connected layer. The 3D pose vector represents the root-relative 3D posecoordinates. We present MPJPE and PA-MPJPE of PoseNet on the benchmarks in Table 9.For the Human3.6M benchmark [21], 14 common joints out of 17 Human3.6M de-ﬁned joints are evaluated following [26,30,31,50]. For the 3DPW benchmark [38],COCO deﬁned 17 joints are evaluated and J M from the groundtruth SMPLmeshes are used as groundtruth. The 2D pose outputs from [58] and [56] aretaken as test inputs on Human3.6M and 3DPW respectively. For the FreiHANDbenchmark, only FreiHAND train set is used during training, and 21 MANO [52]hand joints are evaluated by the oﬃcial evaluation website. The 2D pose outputsfrom [56] are taken as test inputs. J .

3D pose ( ) graph convolutional layerlayer . input 2D pose ( ) FC layer batch normalization ReLU dropout element-wise sum

PoseNet

Fig. 8: The detailed network architecture of PoseNet.Table 9: The MPJPE and PA-MPJPE of PoseNet on each benchmark. train set Human3.6M Human3.6M + COCObenchmark MPJPE PA-MPJPE MPJPE PA-MPJPEHuman3.6M 65.1 48.4 66.7 48.93DPW 105.0 62.9 99.2 61.0benchmark PA-MPJPEFreiHAND 8.56

11 Pre-deﬁned joint sets and graph structures

We use diﬀerent pre-deﬁned joint sets and graph structures for Human3.6M,3DPW, SURREAL, and FreiHAND benchmarks, as shown in Figure 9. Tobe speciﬁc, we employ Human3.6M body joints, COCO body joints, SMPLbody joints, MANO hand joints for Human3.6M, 3DPW, SURREAL, FreiHANDbenchmarks, respectively, in both training and testing stages. For the COCOjoint set, we additionally deﬁne pelvis and neck joints that connect the upperbody and lower body. The pelvis and neck coordinates are calculated as themiddle point of right-left hips and right-left shoulders, respectively. headnoseneckrightshoulder right elbow right wristleft elbowleftwrist left shoulder torsopelvisleft hipleft kneeleft ankle right hipright kneeright ankle right earnoseneckrightshoulder right elbow right wristleft elbowleftwrist left shoulder right eyeleft earleft ear thumb 1thumb 2thumb 3thumb 4 index1index2index3index4 middle1middle2middle3middle4 ring1ring2ring3ring4pinky1 pinky2pinky3pinky4

MANO hand jointsCOCO body jointsHuman3.6M body joints( Human3.6M dataset ) ( 3DPW dataset ) ( FreiHAND dataset ) pelvisleft hipleft kneeleft ankle right hipright kneeright ankle headneckrightthorax right wristleftwristleft shoulder torso pelvisleft hipleft kneeleft ankle right hipright kneeright ankle

SMPL body joints( SURREAL dataset ) left toe right toespinechestleftthoraxlefthand rightshoulder right handleft elbow right elbow

Fig. 9: The joint sets and graph structures of each dataset that are used inPose2Mesh. ose2Mesh 19

12 Pseudo-groundtruth SMPL parameters ofHuman3.6M dataset

Mosh [35] method can compute SMPL parameters from the marker data in Hu-man3.6M dataset. Since Human3.6M dataset does not provide 3D mesh anno-tations, most of the previous 3D pose and mesh estimation papers [26, 30, 31, 50]used the SMPL parameters obtained by Mosh method as the groundtruth forthe supervision. However, due to the license issue, the SMPL parameters arenot currently available. Furthermore, the source code of Mosh is not publiclyreleased.For the 3D mesh supervision, we alternatively obtain groundtruth SMPL pa-rameters by applying SMPLify-X [48] on the groundtruth 3D joint coordinatesof Human3.6M dataset. Although the obtained SMPL parameters are not per-fectly aligned to the groundtruth 3D joint coordinates, we conﬁrmed that theerror of the SMPLify-X is much less than those of current state-of-the-art 3Dhuman pose estimation methods, as shown in Table 10. Thus, we believe usingSMPL parameters obtained by SMPLify-X as groundtruth is reasonable. For thefair comparison, all the previous works and our system are trained on our SMPLparameters from SMPLify-X.During the ﬁtting process of SMPLify-X, we adopted a neutral gender SMPLbody model. However, we empirically found that the ﬁtting process producesgender-speciﬁc body shapes, which correspond to each subject. As a result, sincemost of the subjects in the training set of Human3.6M dataset are female, ourPose2Mesh trained on Human3.6M dataset tends to produce female body shapemeshes. We tried to ﬁx the identity code of the SMPL body model obtainedfrom the T-pose; however, it produces higher errors. Thus, we did not ﬁx theidentity code for each subject.Table 10: The MPJPE comparison between SMPLify-X ﬁtting results and state-of-the-art 3D human pose estimation methods. “*” takes multi-view RGB imagesas inputs. methods MPJPEMoon et al. [42] 53.3Sun et al. [58] 49.6Iskakov et al. [22]* 20.8SMPLify-X from GT 3D pose

13 Synthetic data from AMASS

We leverage additional synthetic data from AMASS [37] to boost the perfor-mance of Pose2Mesh. AMASS is a new database that uniﬁes 15 diﬀerent opticalmarker-based mocap datasets within a common framework. It created SMPL parameters from mocap data by a method named Mosh++. We used CMUdataset [1] from the database in training.To be speciﬁc, we generated paired 2D pose-3D mesh data by projecting a 3Dpose obtained from a mesh to the image plane, using camera parameters fromHuman3.6M. As shown in Table 11, when AMASS is added, both the joint errorand surface error decrease. Exploiting AMASS data in this fashion is not possiblefor [26], [31], and [30], since they need pairs of image and 2D/3D annotations.Table 11: The MPJPE and MPVPE of our Pose2Mesh on 3DPW with accumu-lative training datasets. The 2D pose outputs from [56] are used for input toPose2Mesh. train sets MPJPE MPVPEHuman3.6M+COCO 91.4 109.3Human3.6M+COCO+AMASS

14 Synthesizing the input 2D poses in the training stage

As described in Section 4.1 of the main manuscript, we synthesize the input2D poses by adding randomly generated errors on the groundtruth 2D poses inthe training stage. For this, we generate errors following Chang et al. [10] andMoon et al. [43] for Human3.6M and COCO body joint sets, respectively. Onthe other hand, for FreiHAND benchmark, we used detection outputs from [56]on the training set as the input poses in the training stage, since there are noveriﬁed synthetic errors for the hand joints.

To demonstrate the validity of the synthesizing process, we compare MPJPE andPA-MPJPE of Pose2Mesh trained with the groundtruth 2D poses, and the syn-thesized input 2D poses in Table 12. For Human3.6M, only Human3.6M train setis used for the training, and for 3DPW benchmark, Human3.6M and COCO areused for the training. The test 2D input poses used in Human3.6M and 3DPWevaluation are outputs from Integral Regression [58] and HRNet [56] respectively,using groundtruth bounding boxes. Apparently, when our Pose2Mesh is trainedwith the synthesized input 2D poses, Pose2Mesh performs far better on bothbenchmarks. This proves that the synthesizing process makes Pose2Mesh morerobust to the errors in the input 2D poses and increases the estimation accuracy. ose2Mesh 21

Table 12: The MPJPE and PA-MPJPE comparison according to input type inthe training stage. input pose when training Human3.6M 3DPWMPJPE PA-MPJPE MPJPE PA-MPJPE2D pose GT 70.4 50.6 153.7 94.4

2D pose synthesized (Ours) 64.9 48.7 91.4 60.1

15 Train/test with groundtruth input poses

We present the upper bounds of Pose2Mesh, PoseNet, and MeshNet on Hu-man3.6M and 3DPW benchmarks by training and testing with groundtruthinput poses in Table 13. Pose2Mesh and PoseNet take the groundtruth 2D poseas an input, while MeshNet takes the groundtruth 3D pose as an input. As thetable shows, the upper bound of Pose2Mesh is similar to that of PoseNet, whichimplies that the 3D pose errors of Pose2Mesh follow those of PoseNet as ana-lyzed in Section 7.2 of the main manuscript. In addition, the upper bound ofMeshNet indicates that we can recover highly accurate 3D human meshes if wecan estimate nearly perfect 3D poses.The MPJPE and PA-MPJPE of Pose2Mesh and MeshNet are measured onthe 3D pose regressed from the mesh output, while the accuracy of PoseNetis measured on the lifted 3D pose. For the Human3.6M benchmark, only Hu-man3.6M train set is used to train the network. For the 3DPW benchmark,Human3.6M, COCO, AMASS train sets are used to train the network.Table 13: The upper bounds of Pose2Mesh, PoseNet, and MeshNet on Hu-man3.6m and 3DPW benchmarks. networks Human3.6M 3DPWMPJPE PA-MPJPE MPJPE PA-MPJPEPose2Mesh with 2D pose GT 51.1 35.3 65.1 34.6PoseNet with 2D pose GT 50.6 41.3 66.1 43.8MeshNet with 3D pose GT 13.9 9.9 10.8 8.1

16 Eﬀect of each loss function

We analyze the eﬀect of joint coordinate loss L joint , surface normal loss L normal ,and surface edge loss L edge on reconstructing a 3D human mesh in Table 14and Figure 10. Human3.6M dataset is used for the training and testing. As thetable shows, training without L joint has a relatively distinctive eﬀect on MPJPEand PA-MPJPE, while other settings show numerically negligible diﬀerences.On the other hand, as the ﬁgure shows, training without L normal or L edge clearly decreases the visual quality of the mesh output, while training without L joint has nearly no eﬀect on the visual quality of the meshes. To be speciﬁc, trainingwithout L normal impairs the overall smoothness of the mesh and local detailsof mouth, hands, and feet. Similarly, training without L edge ruins the detailsof body parts that have dense vertices, especially mouth, hands, and feet, bymaking serious artifacts caused by ﬂying vertices.Table 14: The MPJPE and PA MPJPE comparison between the networks trainedfrom various combinations of loss functions. settings MPJPE PA-MPJPE full supervision (Ours) L joint L normal without L edge input image groundtruth full supervision(Ours) without without without Fig. 10: Qualitative results for the ablation study on the eﬀectiveness of each lossfunction. ose2Mesh 23

References

1. The Carnegie Mellon University (CMU) Graphics Laboratory Motion CaptureDatabase. http://mocap.cs.cmu.edu/http://mocap.cs.cmu.edu/