[PDF] Reconstructing NBA Players

Abstract

Great progress has been made in 3D body pose and shape estimation from a single photo. Yet, state-of-the-art results still suffer from errors due to challenging body poses, modeling clothing, and self occlusions. The domain of basketball games is particularly challenging, as it exhibits all of these challenges. In this paper, we introduce a new approach for reconstruction of basketball players that outperforms the state-of-the-art. Key to our approach is a new method for creating poseable, skinned models of NBA players, and a large database of meshes (derived from the NBA2K19 video game), that we are releasing to the research community. Based on these models, we introduce a new method that takes as input a single photo of a clothed player in any basketball pose and outputs a high resolution mesh and 3D pose for that player. We demonstrate substantial improvement over state-of-the-art, single-image methods for body shape reconstruction.

Full PDF

RReconstructing NBA Players

Luyang Zhu, Konstantinos Rematas, Brian Curless,Steven M. Seitz, and Ira Kemelmacher-Shlizerman

University of Washington

Abstract.

Great progress has been made in 3D body pose and shapeestimation from a single photo. Yet, state-of-the-art results still suf-fer from errors due to challenging body poses, modeling clothing, andself occlusions. The domain of basketball games is particularly challeng-ing, as it exhibits all of these challenges. In this paper, we introduce anew approach for reconstruction of basketball players that outperformsthe state-of-the-art. Key to our approach is a new method for creatingposeable, skinned models of NBA players, and a large database of meshes(derived from the NBA2K19 video game) that we are releasing to theresearch community. Based on these models, we introduce a new methodthat takes as input a single photo of a clothed player in any basketballpose and outputs a high resolution mesh and 3D pose for that player. Wedemonstrate substantial improvement over state-of-the-art, single-imagemethods for body shape reconstruction. Code and dataset are availableat http://grail.cs.washington.edu/projects/nba_players/ . Keywords:

3D Human Reconstruction

Given regular, broadcast video of an NBA basketball game, we seek a complete3D reconstruction of the players, viewable from any camera viewpoint. Thisreconstruction problem is challenging for many reasons, including the need toinfer hidden and back-facing surfaces, and the complexity of basketball poses,e.g., reconstructing jumps, dunks, and dribbles.Human body modeling from images has advanced dramatically in recentyears, due in large part to availability of 3D human scan datasets, e.g., CAESAR[62]. Based on this data, researchers have developed powerful tools that enablerecreating realistic humans in a wide variety of poses and body shapes [47], andestimating 3D body shape from single images [64,70]. These models, however,are largely limited to the domains of the source data – people in underwear [62],or clothed models of people in static, staged poses [4]. Adapting this data to adomain such as basketball is extremely challenging, as we must not only matchthe physique of an NBA player, but also their unique basketball poses.Sports video games, on the other hand, have become extremely realistic, withrenderings that are increasingly diﬃcult to distinguish from reality. The playermodels in games like NBA2K [6] are meticulously crafted to capture each player’s a r X i v : . [ c s . C V ] J u l Zhu et al.

Fig. 1.

Single input photo (left), estimated 3D posed model that is viewed from anew camera position (middle), same model with video game texture for visualizationpurposes. The insets show the estimated shape from the input camera viewpoint. (Courtand basketball meshes are extracted from the video game)

Photo Credit: [5] physique and appearance (Fig. 3). Such models are ideally suited as a trainingset for 3D reconstruction and visualization of real basketball games.In this paper, we present a novel dataset and neural networks that reconstructhigh quality meshes of basketball players and retarget these meshes to ﬁt framesof real NBA games. Given an image of a player, we are able to reconstructthe action in 3D, and apply new camera eﬀects such as close-ups, replays, andbullet-time eﬀects (Fig. 1).Our new dataset is derived from the video game NBA2K (with approval fromthe creator, Visual Concepts), by playing the game for hours and interceptingrendering instructions to capture thousands of meshes in diverse poses. Eachmesh provides detailed shape and texture, down to the level of wrinkles in cloth-ing, and captures all sides of the player, not just those visible to the camera.Since the intercepted meshes are not rigged, we learn a mapping from pose pa-rameters to mesh geometry with a novel deep skinning approach. The result ofour skinning method is a detailed deep net basketball body model that can beretargeted to any desired player and basketball pose.We also introduce a system to ﬁt our retargetable player models to real NBAgame footage by solving for 3D player pose and camera parameters for eachframe. We demonstrate the eﬀectiveness of this approach on synthetic and realNBA input images, and compare with the state of the art in 3D pose and humanbody model ﬁtting. Our method outperforms the state-of-the-art methods whenreconstructing basketball poses and players even when these methods, to the ex-tent possible, are retrained on our new dataset. This paper focuses on basketballshape estimation, and leaves texture estimation as future work.Our biggest contributions are, ﬁrst, a deep skinning approach that produceshigh quality, pose-dependent models of NBA players. A key diﬀerentiator is thatwe leverage thousands of poses and capture detailed geometric variations as afunction of pose (e.g., folds in clothing), rather than a small number of poseswhich is the norm for datasets like CAESAR (1-3 poses/person) and modelingmethods like SMPL (trained on CAESAR and ∼

45 poses/person). While ourapproach is applicable to any source of registered 3D scan data, we apply it toreconstruct models of NBA players from NBA2K19 game play screen captures.As such, a second key contribution is pose-dependent models of diﬀerent bas- econstructing NBA Players 3 ketball players, and raw capture data for the research community. Finally, wepresent a system that ﬁts these player models to images, enabling 3D recon-structions from photos of NBA players in real games. Both our skinning andpose networks are evaluated quantitatively and qualitatively, and outperformthe current state of the art.One might ask, why spend so much eﬀort reconstructing mesh models thatalready exist (within the game)? NBA2K’s rigged models and in-house anima-tion tools are proprietary IP. By reconstructing a posable model from interceptedmeshes (eliminating requirement of proprietary animation and simulation tools),we can provide these best-in-the-world models of basketball players to researchersfor the ﬁrst time (with the company’s support). These models provide a numberof advantages beyond existing body models such as SMPL. In particular, theycapture not just static poses, but human body dynamics for running, walking,and many other challenging activities. Furthermore, the plentiful pose-dependentdata enables robust reconstruction even in the presence of heavy occlusions. Inaddition to producing the ﬁrst high quality reconstructions of basketball fromregular photos, our models can facilitate synthetic data collection for ML algo-rithms. Just as simulation provides a critical source of data for many ML tasksin robotics, self-driving cars, depth estimation, etc., our derived models can gen-erate much more simulated content under any desired conditions (we can renderany pose, viewpoint, combination of players, against any background, etc.)

Fig. 2.

Overview: Given a single basketball image (top left), we begin by detecting thetarget player using [16,65], and create a person-centered crop (bottom left). From thiscrop, our PoseNet predicts 2D pose, 3D pose, and jump information. The estimated 3Dpose and the cropped image are then passed to mesh generation networks to predictthe full, clothed 3D mesh of the target player. Finally, to globally position the playeron the 3D court (right), we estimate camera parameters by solving the PnP problemon known court lines and predict global player position by combining camera, 2D pose,and jump information. Blue boxes represent novel components of our method. Zhu et al.

Video Game Training Data.

Recent works [61,60,43,59] have shown that, forsome domains, data derived from video games can signiﬁcantly reduce manuallabor and labeling, since ground-truth labels can be extracted automaticallywhile playing the game. E.g., [15,59] collected depth maps of soccer players byplaying the FIFA soccer video game, showing generalization to images of realgames. Those works, however, focused on low level vision data, e.g., optical ﬂowand depth maps rather than full high quality meshes. In contrast, we collectdata that includes 3D triangle meshes, texture maps, and detailed 3D body pose,which requires more sophisticated modeling of human body pose and shape.

Sports 3D reconstruction.

Reconstructing 3D models of athletes playing var-ious sports from images has been explored in both academic research and indus-trial products. Most previous methods use multiple camera inputs rather than asingle view. Grau et al. [24,23] and Guillemaut et al. [28,27] used multiview stereomethods for free viewpoint navigation. Germann et al. [21] proposed an articu-lated billboard presentation for novel view interpolation. Intel demonstrated 360degree viewing experiences , with their True View [2] technology by installing38 synchronized 5k cameras around the venue and using this multi-view inputto build a volumetric reconstruction of each player. This paper aims to achievesimilar reconstruction quality but from a single image.Rematas et al. [59] reconstructed soccer games from monocular YouTubevideos. However, they predicted only depth maps, thus can not handle occludedbody parts and player visualization from all angles. Additionally, they estimatedplayers’ global position by assuming all players are standing on the ground, whichis not a suitable assumption for basketball, where players are often airborne. Thedetail of the depth maps is also low. We address all of these challenges by buildinga basketball speciﬁc player reconstruction algorithm that is trained on meshesand accounts for complex airborne basketball poses. Our result is a detailed meshof the player from a single view, but comparable to multi-view reconstructions.Our reconstructed mesh can be viewed from any camera position.

3D human pose estimation.

Large scale body pose estimation datasets[34,50,48] enabled great progress in 3D human pose estimation from single images[51,49,68,31,52]. We build on [51] but train on our new basketball pose data, usea more detailed skeleton (35 joints including ﬁngers and face keypoints), and anexplicit model of jumping and camera to predict global position. Accounting forjumping is an important step that allows our method outperform state of theart pose.

3D human body shape reconstruction.

Parametric human body models[10,47,57,63,37,54] are commonly ﬁt to images to derive a body skeleton, andprovide a framework to optimize for shape parameters [13,37,54,71,44,33,75].[70] further 2D warped the optimized parametric model to approximately ac-count for clothing and create a rigged animated mesh from a single photo. econstructing NBA Players 5 [38,56,39,42,55,29,76,41] trained a neural network to directly regress body shapeparameters from images. Most parametric model based methods reconstruct un-dressed humans, since clothing is not part of the parametric model.Clothing can be modeled to some extent by warping SMPL [47] models, e.g.,to silhouettes: Weng et al. [70] demonstrated 2D warping of depth and normalmaps from a single photo silhouette, and Alldeick et al. [8,7,9] addressed multi-image ﬁtting. Alternatively, given predeﬁned garment models [12] estimated aclothing mesh layer on top of SMPL.Non-parametric methods [69,53,64,58] proposed voxel [69] or implicit func-tion [64] representations to model clothed humans by training on representativesynthetic data. Xu et al. [73,74] and Habermann et al. [30] assumed a pre-captured multi-view model of the clothed human, retargeted based on new poses.We focus on single-view reconstruction of players in NBA basketball games,producing a complete 3D model of the player pose and shape, viewable from anycamera viewpoint. This reconstruction problem is challenging for many reasons,including the need to infer hidden and back-facing surfaces, and the complexityof basketball poses, e.g., reconstructing jumps, dunks, and dribbles. Unlike priormethods modeling undressed people in various poses or dressed people in afrontal pose, we focus on modeling clothed people under challenging basketballposes and provide a rigorous comparison with the state of the art. Fig. 3.

Our novel NBA2K dataset examples, extracted from the NBA2K19 video game.Our NBA2K dataset captures 27,144 basketball poses spanning 27 subjects, extractedfrom the NBA2K19 video game.

Imagine having thousands of 3D body scans of NBA players, in every conceiv-able pose during a basketball game. Suppose that these models were extremelydetailed and realistic, down to the level of wrinkles in clothing. Such a datasetwould be instrumental for sports reconstruction, visualization, and analysis. This

Zhu et al. section describes such a dataset, which we call

NBA2K , after the video gamefrom which these models derive. These models of course are not literally playerscans, but are produced by professional modelers for use in the NBA2K19 videogame, based on a variety of data including high resolution player photos, scannedmodels and mocap data of some players. While they do not exactly match eachplayer, they are among the most accurate 3D renditions in existence (Fig. 3).Our NBA2K dataset consists of body mesh and texture data for several NBAplayers, each in around 1000 widely varying poses. For each mesh (vertices, facesand texture) we also provide its 3D pose (35 keypoints including face and handﬁngers points) and the corresponding RGB image with its camera parameters.While we used meshes of 27 real famous players to create many of ﬁgures inthis paper, we do not have permission to release models of current NBA players.Instead, we additionally collected the same kind of data for 28 synthetic playersand retrained our pipeline on this data. The synthetic player’s have the samegeometric and visual quality as the NBA models and their data along withtrained models will be shared with the research community upon publication ofthis paper. Our released meshes, textures, and models will have the same qualityas what’s in the paper, and span a similar variety of player types, but not benamed individuals. Visual Concepts [6] has approved our collection and sharingof the data.The data was collected by playing the NBA2K19 game and interceptingcalls between the game engine and the graphics card using RenderDoc [3]. Theprogram captures all drawing events per frame, where we locate player renderingevents by analyzing the hashing code of both vertex and pixel shaders. Next,triangle meshes and textures are extracted by reverse-engineering the compiledcode of the vertex shader. The game engine renders players by body parts, sowe perform a nearest neighbor clustering to decide which body part belongs towhich player. Since the game engine optimizes the mesh for real-time rendering,the extracted meshes have diﬀerent mesh topologies, making them harder to usein a learning framework. We register the meshes by resampling vertices in texturespace based on a template mesh. After registration, the processed mesh has 6036vertices and 11576 faces with ﬁxed topology across poses and players (point-to-point correspondence), has multiple connected components (not a watertightmanifold), and comes with no skinning information. We also extract the rest-pose skeleton and per-bone transformation matrix, from which we can computeforward kinematics to get full 3D pose.

Figure 2 shows our full reconstruction system, starting from a single image ofa basketball game, and ending with output of a complete, high quality mesh ofthe target player with pose and shape matching the image. Next, we describethe individual steps to achieve the ﬁnal results. econstructing NBA Players 7

Since our input meshes are notrigged (no skeletal information or blending weights), we propose a neural networkcalled

PoseNet to estimate the 3D pose and other attributes of a player froma single image. This 3D pose information will be used later to facilitate shapereconstruction. PoseNet takes a single image as input and is trained to output2D body pose, 3D body pose, a binary jump classiﬁcation (is the person airborneor not), and the jump height (vertical height of the feet from ground). The twojump-related outputs are key for global position estimation and are our noveladdition to existing generic body pose estimation.From the input image, we ﬁrst extract ResNet [72] features (from layer 4)and supply them to four separate network branches. The output of the 2D posebranch is a set of 2D heatmaps (one for each 2D keypoint) indicating wherethe particular keypoint is located. The output of the 3D pose branch is a set of

XY Z location maps (one for each keypoint) [51]. The location map indicates thepossible 3D location for every pixel. The 2D and 3D pose branches use the samearchitecture as [72]. The jump branch estimates a class label, and the jump heightbranch regresses the height of the jump. Both networks use a fully connectedlayer followed by two linear residual blocks [49] to get the ﬁnal output.The PoseNet model is trained using the following loss: L pose = ω d L d + ω d L d + ω bl L bl + ω jht L jht + ω jcls L jcls (1)where L d = (cid:107) H − ˆ H (cid:107) is the loss between predicted ( H ) and ground truth ( ˆ H )heatmaps, L d = (cid:107) L − ˆ L (cid:107) is the loss between predicted ( L ) and ground truth( ˆ L ) 3D location maps, L bl = (cid:107) B − ˆ B (cid:107) is the loss between predicted ( B ) andground truth ( ˆ B ) bone lengths to penalize unnatural 3D poses (we pre-computedthe ground truth bone length over the training data), L jht = (cid:107) h − ˆ h (cid:107) is theloss between predicted ( h ) and ground truth (ˆ h ) jump height, and L jcls is thecross-entropy loss for the jump class. For all experiments, we set ω d = 10 , ω d =10 , ω bl = 0 . , ω jht = 0 .

4, and ω jcls = 0 . Global Position

To estimate the global position of the player we need the cam-era parameters of the input image. Since NBA courts have known dimensions,we generate a synthetic 3D ﬁeld and align it with the input frame. Similar to[59,17], we use a two-step approach. First, we provide four manual correspon-dences between the input image and the 3D basketball court to initialize thecamera parameters by solving PnP [45]. Then, we perform a line-based cameraoptimization similar to [59], where the projected lines from the synthetic 3Dcourt should match the lines on the image. Given the camera parameters, wecan estimate a player’s global position on (or above) the 3D court by the lowestkeypoint and the jump height. We cast a ray from the camera center throughthe image keypoint; the 3D location of that keypoint is where the ray-groundheight is equal to the estimated jump height.

Zhu et al.

DecoderRest pose template mesh Rest pose personalized meshInput ImageEncoder

IdentityNet 𝑍 !" Vertex Offset Posed meshInput 3D pose Output posed meshInput rest pose personalized mesh Mesh DecoderPoseEncoderMeshEncoder Fully Connected

Single body part SkinningNet 𝑍 𝑍 %&’( 𝑍 %)*’ 𝑍 "’*+ TestingTrainingMeshEncoder

Fig. 4.

Mesh generation contains two sub networks: IdentityNet and SkinningNet. Iden-tityNet deforms a rest pose template mesh (average rest pose over all players in thedatabase), into a rest pose personalized mesh given the image. SkinningNet takes therest pose personalized mesh and 3D pose as input and outputs the posed mesh. Thereis a separate SkinningNet per body part, here we illustrate the arms.

Reconstruction of a complete detailed 3D mesh (including deformation due topose, cloth, ﬁngers and face) from a single image is a key technical contribution ofour method. To achieve this we introduce two sub-networks (Fig. 4):

IdentityNet and

SkinningNet . IdentityNet takes as input an image of a player whose rest meshwe wish to infer, and outputs the person’s rest mesh by deforming a templatemesh. The template mesh is the average of all training meshes and is the samestarting point for any input. The main beneﬁt of this network is that it allowsus to estimate the body size and arm span of the player according to the inputimage. SkinningNet takes the rest pose personalized mesh and the 3D pose asinput, and outputs the posed mesh. To reduce the learning complexity, we pre-segment the mesh into six parts: head, arms, shirt, pants, legs and shoes. Wethen train a SkinningNet on each part separately. Finally, we combine the sixreconstructed parts into one, while removing interpenetration of garments withbody parts. Details are described below.

IdentityNet.

We propose a variant of 3D-CODED [25] to deform the templatemesh. We ﬁrst use ResNet [32] to extract features from input images. Then weconcatenate template mesh vertices with image features and send them into anAtlasNet decoder [26] to predict per vertex oﬀsets. Finally, we add this oﬀset tothe template mesh to get the predicted personalized mesh. We use the L1 lossbetween the prediction and ground truth to train IdentityNet.

SkinningNet.

We propose a TL-embedding network [22] to learn an embeddingspace with generative capability. Speciﬁcally, the 3D keypoints K pose ∈ R × are processed by the pose encoder to produce a latent code Z pose ∈ R . Therest pose personalized mesh vertices V rest ∈ R N × (where N is the number ofvertices in a mesh part) are processed by the mesh encoder to produce a latentcode Z rest ∈ R . Then Z pose and Z rest are concatenated and fed into a fully econstructing NBA Players 9 connected layer to get Z pred ∈ R . Similarly, the ground truth posed meshvertices V posed ∈ R N × are processed by another mesh encoder to produce alatent code Z gt ∈ R . Z gt is sent into the mesh decoder during training while Z pred is sent into the mesh decoder during testing.The Pose encoder is comprised of two linear residual blocks [49] followedby a fully connected layer. The mesh encoders and shared decoder are builtwith spiral convolutions [14]. See supplementary material for detailed networkarchitecture. SkinningNet is trained with the following loss: L skin = ω Z L Z + ω mesh L mesh (2)where L Z = (cid:107) Z pred − Z gt (cid:107) forces the space of Z pred and Z gt to be similar, and L mesh = (cid:107) V pred − V posed (cid:107) is the loss between decoded mesh vertices V pred andground truth vertices V posed . The weights of diﬀerent losses are set to ω Z =5 , ω mesh = 50. See supplementary for detailed training parameters. Combining body part meshes.

Direct concatenation of body parts resultsin interpenetration between the garment and the body. Thus, we ﬁrst detect allbody part vertices in collision with clothing as in [54], and then follow [66,67] todeform the mesh by moving collision vertices inside the garment while preservinglocal rigidity of the mesh. This detection-deformation process is repeated untilthere is no collision or the number of iterations is above a threshold (10 in ourexperiments). See supplementary material for details of the optimization.

HMR [38] CMR [42] SPIN [41] Ours(Reg+BL) Ours(Loc) Ours(Loc+BL)MPJPE 115.77 82.28 88.72 81.66 66.12

MPJPE-PA 78.17 61.22 59.85 63.70 52.73

The metric is mean per joint position error with (MPJPE-PA) and without (MPJPE)Procrustes alignment. Baseline methods are ﬁne-tuned on our NBA2K dataset.HMR [38] SPIN [41] SMPLify-X [54] PIFu [64] OursCD 22.411 14.793 47.720 23.136

EMD 0.137 0.125 0.187 0.207

We use Chamfer distance denoted by CD (scaled by 1000, lower is better),and Earth-mover distance denoted by EMD (lower is better) for comparison. Bothdistance metrics show that our method signiﬁcantly outperforms state of the art formesh estimation. All related works are retrained or ﬁne-tuned on our data, see text.0 Zhu et al.

Dataset Preparation.

We evaluate our method with respect to the state ofthe art on our NBA2K dataset. We collected 27,144 meshes spanning 27 sub-jects performing various basketball poses (about 1000 poses per player). PoseNettraining requires generalization on real images. Thus, we augment the data to265,765 training examples, 37,966 validation examples, and 66,442 testing ex-amples. Augmentation is done by rendering and blending meshes into variousrandom basketball courts. For IdentityNet and SkinningNet, we select 19,667examples from 20 subjects as training data and test on 7,477 examples from 7unseen players. To further evaluate generalization of our method, we also pro-vide qualitative results on real images. Note that textures are extracted fromthe game and not estimated by our algorithm.

We evaluate pose estimation by comparing to state of the art SMPL-basedmethods that released training code. Speciﬁcally we compare with HMR [38],CMR [42], and SPIN [41]. For fair comparison, we ﬁne-tuned their models with3D and 2D ground-truth NBA2K poses. Since NBA2K and SMPL meshes havediﬀerent topology we do not use mesh vertices and SMPL parameters as partof the supervision. Table 1 shows comparison results for 3D pose. The metric isdeﬁned as mean per joint position error (MPJPE) with and without procrustesalignment. The error is computed on 14 joints as deﬁned by the LSP dataset [36].Our method outperforms all other methods even when they are ﬁne-tuned onour NBA2K dataset (lower number is better).To further evaluate our design choices, we compare the location-map-basedrepresentation (used in our network) with direct regression of 3D joints, and alsoevaluate the eﬀect of bone length (BL) loss on pose prediction. A direct regressionbaseline is created by replacing our deconvolution network with fully connectedlayers [49]. The eﬀectiveness of BL loss is evaluated by running the networkwith and without it. As shown in Table 1, both location maps and BL loss canboost the performance. In supplementary material, we show our results on globalposition estimation. We can see that our method can accurately place players(both airborne and on ground) on the court due to accurate jump estimation.

Table 2 shows results of comparing our mesh recon-struction method to the state of the art on NBA2K data. We compare to bothundressed (HMR [38], SMPLify-X [54], SPIN [41]) and clothed (PIFu [64]) humanreconstruction methods. For fair comparison, we retrain PIFU on our NBA2Kmeshes. SPIN and HMR are based on the SMPL model where we do not havegroundtruth meshes, so we ﬁne-tuned with NBA2K 2D and 3D pose. SMPLify-X is an optimization method, so we directly apply it to our testing examples.The meshes generated by baseline methods and the NBA2K meshes do not have econstructing NBA Players 11

SMPLify-XInput SPIN Ours GT SMPLify-XSPIN Ours GT

Fig. 5. Comparison with SMPL-based methods.

Column 1 is input, columns 2-5are reconstructions in the image view, columns 6-9 are visualizations from a novel view.Note the signiﬁcant diﬀerence in body pose between ours and SMPL-based methods.

Input PIFu Ours GTPIFu+NBA PIFu Ours GTPIFu+NBA

Fig. 6. Comparison with PIFu[64].

Column 1 is input, columns 2-5 are reconstruc-tions in the image viewpoint, columns 6-9 are visualizations from a novel view. PIFusigniﬁcantly over-smooths shape details and produces lower quality reconstruction evenwhen trained on our dataset (PIFu+NBA).

Garment details Garment detailsPredictionInput PredictionInput

Fig. 7. Garment details at various poses.

For each input image, we show thepredicted shape, close-ups from two viewpoints.2 Zhu et al. one-to-one vertex correspondence, thus we use Chamfer (CD) and Earth-mover(EMD) as distance metrics. Prior to distance computations, all predictions arealigned to ground-truth using ICP. We can see that our method outperformsboth undressed and clothed human reconstruction methods even when they aretrained on our data.

Fig. 8. Results on real images.

For each example, column 1 is the input image, 2-3are reconstructions rendered in diﬀerent views. 4-5 are corresponding renderings usingtexture from the video game, just for visualization. Our technical method is focusedonly on shape recovery.

Photo Credit: [1]

Qualitative Results.

Fig. 5 qualitatively compares our results with the bestperforming SMPL-based methods SPIN [41] and SMPLify-X [54]. These twomethods do not reconstruct clothes, so we focus on the pose accuracy of the bodyshape. Our method generates more accurate body shape for basketball poses,especially for hands and ﬁngers. Fig. 6 qualitatively compares with PIFu [64],a state-of-the-art clothed human reconstruction method. Our method generatesdetailed geometry such as shirt wrinkles under diﬀerent poses while PIFu tendsto over-smooth faces, hands, and garments. Fig. 7 further visualizes garmentdetails in our reconstructions. Fig. 8 shows results of our method on real images,demonstrating robust generalization. Please also refer to the supplementary pdfand video for high quality reconstruction of real NBA players.

We follow the idea of SMPL [47] to traina skinning model from NBA2K registered mesh sequences. The trained bodymodel is called SMPL-NBA. Since we don’t have rest pose meshes for thousandsof diﬀerent subjects, we cannot learn a meaningful PCA shape basis as SMPLdid. Thus, we focus on the pose dependent part and ﬁt the SMPL-NBA modelto 2000 meshes of a single player. We use the same skeleton rig as SMPL to drivethe mesh. Since our mesh is comprised of multiple connected parts, we initializethe skinning weights using a voxel-based heat diﬀusion method [19]. The wholetraining process of SMPL-NBA is the same as the pose parameter training of econstructing NBA Players 13

SMPL-NBAInput Ours GT SMPL-NBA Ours GT

Fig. 9. Comparison with SMPL-NBA.

Column 1 is input, columns 2-4 are recon-structions in the image view, columns 5-7 are visualizations from a novel viewpoint.SMPL-NBA fails to model clothing and the ﬁtting process is unstable.

SMPL. We ﬁt the learned model to predicted 2D keypoints and 3D keypointsfrom PoseNet following SMPLify [13]. Fig. 9 compares SkinningNet with SMPL-NBA, showing that SMPL-NBA has severe artifacts for garment deformation– an inherent diﬃculty for traditional skinning methods. It also suﬀers fromtwisted joints which is a common problem when ﬁtting per bone transformationto 3D and 2D keypoints.

CMR [42] 3D-CODED [25] OursMPVPE 85.26 84.22

MPVPE-PA 64.32 63.13

The metric is mean per vertex position error in mm with (MPVPE-PA) and without(MPVPE) Procrustes alignment. All baseline methods are trained on the NBA2K data.

Comparison with Other Geometry Learning Methods.

Fig. 10 comparesSkinningNet with two state of the art mesh-based shape deformation networks:3D-CODED [25] and CMR [42]. The baseline methods are retrained on the samedata as SkinningNet for fair comparison. For 3D-CODED, we take 3D pose asinput instead of a point cloud to deform the template mesh. For CMR, we onlyuse their mesh regression network (no SMPL regression network) and replaceimages with 3D pose as input. Both methods use the same 3D pose encoder asSkinningNet. The input template mesh is set to the prediction of IdentityNet.Unlike baseline methods, SkinningNet does not suﬀer from substantial defor-mation errors when the target pose is far from the rest pose. Table 3 providesfurther quantitative results based on mean per vertex position error (MPVPE)with and without procrustes alignment.

Fig. 10. Comparison with 3D-CODED [25] and CMR [42].

Column 1 is input,columns 2-5 are reconstructions in the image view, columns 6-9 are zoomed-in versionof the red boxes. The baseline methods exhibit poor deformations for large deviationsfrom the rest pose.

We have presented a novel system for state-of-the-art, detailed 3D reconstructionof complete basketball player models from single photos. Our method includes3D pose estimation, jump estimation, an identity network to deform a templatemesh to the person in the photo (to estimate rest pose shape), and ﬁnally askinning network that retargets the shape from rest pose to the pose in the photo.We thoroughly evaluated our method compared to prior art; both quantitativeand qualitative results demonstrate substantial improvements over the state-of-the-art in pose and shape reconstruction from single images. For fairness, weretrained competing methods to the extent possible on our new data. Our data,models, and code will be released to the research community.

Limitations and future work

This paper focuses solely on high quality shapeestimation of basketball players, and does not estimate texture – a topic forfuture work. Additionally IdentityNet can not model hair and facial identitydue to lack of details in low resolution input images. Finally, the current systemoperates on single image input only; a future direction is to generalize to videowith temporal dynamics.

Acknowledgments

This work was supported by NSF/Intel Visual and Exper-imental Computing Award

References

1. Getty Images,

2. Intel True View, econstructing NBA Players 153. RenderDoc, https://renderdoc.org

4. RenderPeople, https://renderpeople.com

5. USA TODAY Network,

6. VISUAL CONCEPTS, https://vcentertainment.com

7. Alldieck, T., Magnor, M., Bhatnagar, B.L., Theobalt, C., Pons-Moll, G.: Learningto reconstruct people in clothing from a single RGB camera. In: IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) (2019)8. Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Video based re-construction of 3d people models. In: IEEE Conference on Computer Vision andPattern Recognition (CVPR) (2018)9. Alldieck, T., Pons-Moll, G., Theobalt, C., Magnor, M.: Tex2shape: Detailed fullhuman body geometry from a single image. In: IEEE International Conference onComputer Vision (ICCV). IEEE (2019)10. Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape:shape completion and animation of people. In: ACM transactions on graphics(TOG). vol. 24, pp. 408–416. ACM (2005)11. Anurag Ranjan, Timo Bolkart, S.S., Black, M.J.: Generating 3D faces usingconvolutional mesh autoencoders. In: European Conference on Computer Vision(ECCV). pp. 725–741. Springer International Publishing (2018), http://coma.is.tue.mpg.de/

12. Bhatnagar, B.L., Tiwari, G., Theobalt, C., Pons-Moll, G.: Multi-garment net:Learning to dress 3d people from images. In: IEEE International Conference onComputer Vision (ICCV). IEEE (oct 2019)13. Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep itSMPL: Automatic estimation of 3D human pose and shape from a single image.In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, SpringerInternational Publishing (Oct 2016)14. Bouritsas, G., Bokhnyak, S., Ploumpis, S., Bronstein, M., Zafeiriou, S.: Neural 3dmorphable models: Spiral convolutional networks for 3d shape representation learn-ing and generation. In: The IEEE International Conference on Computer Vision(ICCV) (2019)15. Calagari, K., Elgharib, M., Didyk, P., Kaspar, A., Matuisk, W., Hefeeda, M.:Gradient-based 2-d to 3-d conversion for soccer videos. In: ACM Multimedia. pp.605–619 (2015)16. Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtimemulti-person 2D pose estimation using Part Aﬃnity Fields. In: arXiv preprintarXiv:1812.08008 (2018)17. Carr, P., Sheikh, Y., Matthews, I.: Pointless calibration: Camera parameters fromgradient-based alignment to edge images. In: WACV (2012)18. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep networklearning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015)19. Dionne, O., de Lasa, M.: Geodesic voxel binding for production character meshes.In: Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Com-puter Animation. pp. 173–180. ACM (2013)20. Garland, M., Heckbert, P.S.: Surface simpliﬁcation using quadric error metrics. In:Proceedings of the 24th annual conference on Computer graphics and interactivetechniques. pp. 209–216. ACM Press/Addison-Wesley Publishing Co. (1997)21. Germann, M., Hornung, A., Keiser, R., Ziegler, R., W¨urmlin, S., Gross, M.: Articu-lated billboards for video-based rendering. In: Computer Graphics Forum. vol. 29,pp. 585–594. Wiley Online Library (2010)6 Zhu et al.22. Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable andgenerative vector representation for objects. In: European Conference on ComputerVision. pp. 484–499. Springer (2016)23. Grau, O., Hilton, A., Kilner, J., Miller, G., Sargeant, T., Starck, J.: A free-viewpoint video system for visualization of sport scenes. SMPTE motion imagingjournal (5-6), 213–219 (2007)24. Grau, O., Thomas, G.A., Hilton, A., Kilner, J., Starck, J.: A robust free-viewpointvideo system for sport scenes. In: 2007 3DTV conference. pp. 1–4. IEEE (2007)25. Groueix, T., Fisher, M., Kim, V.G., Russell, B.C., Aubry, M.: 3d-coded: 3d corre-spondences by deep deformation. In: Proceedings of the European Conference onComputer Vision (ECCV). pp. 230–246 (2018)26. Groueix, T., Fisher, M., Kim, V.G., Russell, B.C., Aubry, M.: A papier-mˆach´eapproach to learning 3d surface generation. In: Proceedings of the IEEE conferenceon computer vision and pattern recognition. pp. 216–224 (2018)27. Guillemaut, J.Y., Hilton, A.: Joint multi-layer segmentation and reconstruction forfree-viewpoint video applications. IJCV (2011)28. Guillemaut, J.Y., Kilner, J., Hilton, A.: Robust graph-cut scene segmentation andreconstruction for free-viewpoint video of complex dynamic scenes. In: ICCV (2009)29. Guler, R.A., Kokkinos, I.: Holopose: Holistic 3d human reconstruction in-the-wild.In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(June 2019)30. Habermann, M., Xu, W., , Zollhoefer, M., Pons-Moll, G., Theobalt, C.: Livecap:Real-time human performance capture from monocular video. ACM Transactionson Graphics, (Proc. SIGGRAPH) (2019)31. Habibie, I., Xu, W., Mehta, D., Pons-Moll, G., Theobalt, C.: In the wild humanpose estimation using explicit 2d features and intermediate 3d representations. In:IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June2019)32. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:Proceedings of the IEEE conference on computer vision and pattern recognition.pp. 770–778 (2016)33. Huang, Y., Bogo, F., Lassner, C., Kanazawa, A., Gehler, P.V., Romero, J., Akhter,I., Black, M.J.: Towards accurate marker-less human shape and pose estimationover time. In: 2017 International Conference on 3D Vision (3DV). pp. 421–430.IEEE (2017)34. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: Large scaledatasets and predictive methods for 3d human sensing in natural environments.IEEE transactions on pattern analysis and machine intelligence (7), 1325–1339(2013)35. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with condi-tional adversarial networks. In: Proceedings of the IEEE conference on computervision and pattern recognition. pp. 1125–1134 (2017)36. Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models forhuman pose estimation. In: Proceedings of the British Machine Vision Conference(2010), doi:10.5244/C.24.1237. Joo, H., Simon, T., Sheikh, Y.: Total capture: A 3d deformation model for trackingfaces, hands, and bodies. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 8320–8329 (2018)38. Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of humanshape and pose. In: Computer Vision and Pattern Regognition (CVPR) (2018)econstructing NBA Players 1739. Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3d human dynamicsfrom video. In: Computer Vision and Pattern Regognition (CVPR) (2019)40. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)41. Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct3d human pose and shape via model-ﬁtting in the loop. In: Proceedings of theIEEE International Conference on Computer Vision (2019)42. Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression forsingle-image human shape reconstruction. In: CVPR (2019)43. Kr¨ahenb¨uhl, P.: Free supervision from video games. In: CVPR (2018)44. Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite thepeople: Closing the loop between 3d and 2d human representations. In: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6050–6059 (2017)45. Lepetit, V., Moreno-Noguer, F., Fua, P.: Epnp: An accurate o (n) solution to thepnp problem. International journal of computer vision (2), 155 (2009)46. Liu, D.C., Nocedal, J.: On the limited memory bfgs method for large scale opti-mization. Mathematical programming (1-3), 503–528 (1989)47. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinnedmulti-person linear model. ACM transactions on graphics (TOG) (6), 248 (2015)48. von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Re-covering accurate 3d human pose in the wild using imus and a moving camera.In: Proceedings of the European Conference on Computer Vision (ECCV). pp.601–617 (2018)49. Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet eﬀective baseline for3d human pose estimation. In: Proceedings of the IEEE International Conferenceon Computer Vision. pp. 2640–2649 (2017)50. Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C.:Monocular 3d human pose estimation in the wild using improved cnn supervision.In: 2017 International Conference on 3D Vision (3DV). pp. 506–516. IEEE (2017)51. Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shaﬁei, M., Seidel, H.P., Xu,W., Casas, D., Theobalt, C.: Vnect: Real-time 3d human pose estimation with asingle rgb camera. ACM Transactions on Graphics (TOG) (4), 44 (2017)52. Moon, G., Chang, J., Lee, K.M.: Camera distance-aware top-down approach for 3dmulti-person pose estimation from a single rgb image. In: The IEEE Conferenceon International Conference on Computer Vision (ICCV) (2019)53. Natsume, R., Saito, S., Huang, Z., Chen, W., Ma, C., Li, H., Morishima, S.: Sic-lope: Silhouette-based clothed people. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pp. 4480–4490 (2019)54. Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas,D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a singleimage. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (2019)55. Pavlakos, G., Kolotouros, N., Daniilidis, K.: Texturepose: Supervising human meshestimation with texture consistency. In: ICCV (2019)56. Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3d humanpose and shape from a single color image. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 459–468 (2018)57. Pons-Moll, G., Romero, J., Mahmood, N., Black, M.J.: Dyna: A model of dynamichuman shape in motion. ACM Transactions on Graphics, (Proc. SIGGRAPH) (4), 120:1–120:14 (Aug 2015)8 Zhu et al.58. Pumarola, A., Sanchez, J., Choi, G., Sanfeliu, A., Moreno-Noguer, F.: 3DPeople:Modeling the Geometry of Dressed Humans. In: ICCV (2019)59. Rematas, K., Kemelmacher-Shlizerman, I., Curless, B., Seitz, S.: Soccer on yourtabletop. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. pp. 4738–4747 (2018)60. Richter, S.R., Hayder, Z., Koltun, V.: Playing for benchmarks. In: ICCV (2017)61. Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: Ground truthfrom computer games. In: ECCV (2016)62. Robinette, K.M., Blackwell, S., Daanen, H., Boehmer, M., Fleming, S.: CivilianAmerican and European Surface Anthropometry Resource (CAESAR), ﬁnal re-port. volume 1. summary. Tech. rep., SYTRONICS INC DAYTON OH (2002)63. Romero, J., Tzionas, D., Black, M.J.: Embodied hands: Modeling and capturinghands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPHAsia) (6) (Nov 2017)64. Saito, S., , Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu:Pixel-aligned implicit function for high-resolution clothed human digitization.arXiv preprint arXiv:1905.05172 (2019)65. Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in singleimages using multiview bootstrapping. In: CVPR (2017)66. Sorkine, O., Alexa, M.: As-rigid-as-possible surface modeling. In: Symposium onGeometry processing. vol. 4, pp. 109–116 (2007)67. Sorkine, O., Cohen-Or, D., Lipman, Y., Alexa, M., R¨ossl, C., Seidel, H.P.: Lapla-cian surface editing. In: Proceedings of the 2004 Eurographics/ACM SIGGRAPHsymposium on Geometry processing. pp. 175–184 (2004)68. Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression.In: Proceedings of the European Conference on Computer Vision (ECCV). pp.529–545 (2018)69. Varol, G., Ceylan, D., Russell, B., Yang, J., Yumer, E., Laptev, I., Schmid, C.:BodyNet: Volumetric inference of 3D human body shapes. In: ECCV (2018)70. Weng, C.Y., Curless, B., Kemelmacher-Shlizerman, I.: Photo wake-up: 3d char-acter animation from a single photo. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pp. 5908–5917 (2019)71. Xiang, D., Joo, H., Sheikh, Y.: Monocular total capture: Posing face, body, andhands in the wild. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (2019)72. Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking.In: European Conference on Computer Vision (ECCV) (2018)73. Xu, F., Liu, Y., Stoll, C., Tompkin, J., Bharaj, G., Dai, Q., Seidel, H.P., Kautz,J., Theobalt, C.: Video-based characters: Creating new human performancesfrom a multi-view video database. ACM Trans. Graph. (4), 32:1–32:10 (Jul2011). https://doi.org/10.1145/2010324.1964927, http://doi.acm.org/10.1145/2010324.1964927

74. Xu, W., Chatterjee, A., Zollh¨ofer, M., Rhodin, H., Mehta, D., Seidel, H.P.,Theobalt, C.: Monoperfcap: Human performance capture from monocular video.ACM Trans. Graph. (2018)75. Zanﬁr, A., Marinoiu, E., Sminchisescu, C.: Monocular 3d pose and shape estima-tion of multiple people in natural scenes-the importance of multiple scene con-straints. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. pp. 2148–2157 (2018)econstructing NBA Players 1976. Zhu, H., Zuo, X., Wang, S., Cao, X., Yang, R.: Detailed human shape estimationfrom a single image by hierarchical mesh deformation. In: The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) (June 2019)0 Zhu et al.

Reconstructing NBA PlayersSupplementary Material1 NB2K Dataset Capture

In this section we provide more details of how we select captures of the NBA2Kdataset.One way to decide which frames to capture is to let the game use its AIwhere two teams play against each other, however we found that the variety ofposes captured in this manner is rather limited. It captures mostly walking andrunning people, while we target more complex basketball moves. Instead, wehave people play the game and proactively capture frames where dunk, dribble,shooting, and other complex basketball moves occur.

In this section we provide more details for the PoseNet architecture and setup.The input is a single, person-centered image with dimensions 256 × ×

64 heatmaps, one for everykeypoint, indicating where a particular keypoint is located. Similarly, the 3Dpose branch outputs a set of 2D 64 ×

64 location maps [51], where each locationmap indicates the possible 3D location for every pixel. Each location map has3 channels that encode the

XY Z position of a keypoint with respect to pelvis.To generate the ground truth heatmaps, we ﬁrst transform the 2D pose from itsoriginal image resolution (256 × ×

64 resolution, and then generate a2D Gaussian map centered at each joint location. For ground truth XYZ locationmaps, we put the 3D joint location at the position where the heatmap has non-zero value. To obtain the ﬁnal output, we take the location of the maximumvalue in every keypoint heatmap to get the 2D pose at 64 ×

64 resolution anduse it to sample the 3D pose from the

XY Z location maps. After that, the 2dpose is transformed to original 256 ×

256 resolution. The ground truth jumpheight is directly extracted from the game, and the jump class is set to 1 if thejump height is greater than 0.1m. econstructing NBA Players 21

Input Prediction Groundtruth

Fig. 11. Court line generation on synthetic data.

For every example, from leftto right: input image, predicted court lines overlaid on the input image, ground truthcourt lines overlaid on the input image.

Input Prediction PredictionInput

Fig. 12. Court line generation on real data.

For every example, left is input image,right is predicted court lines overlaid on the input image.2 Zhu et al.

Fig. 13. Global position estimation. Please zoom in to see details.

From leftto right: input images, two views of the estimated location (middle and right). Notethe location of players with respect to court lines (marked with red boxes).

In this section we describe the process of placing a 3D player in its correspondingposition on (or above) the basketball court.Since a basketball court with players typically has more occlusions (andcurved lines) than a soccer ﬁeld, we found the traditional line detection methodused in [59] fails. To get robust line features, we train a pix2pix [35] networkto translate basketball images to court line masks. For the training data, weuse synthetic data from NBA2K, where the predeﬁned 3D court lines are pro-jected to image space using the extracted camera parameters. To demonstratethe robustness of our line feature extraction method, we provide the results onsynthetic data in Figure 11 and real data in Figure 12.After estimating the camera parameters, we place the player mesh in 3D byconsidering its 2D pose in the image and the jumping height (Sec 4.1): V c =  ( x p − p x ) z c f ( y p − p y ) z c f z c  (3) y w = R · ( V c − T ) (4)where R is the second column of the extrinsic rotation matrix; T is the extrinsictranslation; f is focal length; ( p x , p y ) is the principle point; V c is the cameracoordinates of the lowest joint (e.g. foot); y w is the world coordinate y -componentof the lowest joint, which equals the predicted jump height; ( x p , y p ) are the pixelcoordinates of the lowest joints. Substituting Eqn. 3 into Eqn. 4, we can solvefor z c (camera coordinate in z-component for lowest joints), from which we canfurther compute the global position of the player. In Figure 13, we show ourresults of global position estimation. We can see that our method can accurately econstructing NBA Players 23 place players (both airborne and on the ground) on the court due to accuratejump estimation. In this section we provide more details for the SkinningNet architecture.As we noted in the main paper, the pose encoder is comprised of linearresidual block [49] followed by a fully connected layer. The linear residual blockconsists of four FC-BatchNorm-ReLu-Dropout blocks with skip connection fromthe input to the output. For the mesh part, we denote Spiral Convolution [14]as SC, mesh downsampling and upsampling operator [11] as DS and US. Themesh encoder consists of four SC-ELU [18]-DS blocks, followed by a FC layer.The mesh decoder consists of a FC layer, four US-SC-ELU blocks, and a SClayer for ﬁnal processing. We follow COMA [11] to perform the mesh samplingoperation where vertices are removed by minimizing quadric errors [20] duringdown-sampling and added using barycentric interpolation during up-sampling.In table 4, we provide detailed settings for the mesh encoders and decoders ofdiﬀerent body parts.

Training details.

For training IdentityNet and SkinningNet, we use batch sizeof 16 for 200 epochs and optimize with the Adam solver [40] with weight decayset to 5 × − . Learning rate for IdentityNet is 0.0002 while learning rate forSkinningNet is 0.001 with a decay of 0.99 after every epoch. The weights ofdiﬀerent losses are set to ω Z = 5 , ω mesh = 50. head arm shoes shirt pant legNV 348 842 937 2098 1439 372DS Factor (2,2,1,1) (2,2,2,1) (2,2,2,1) (4,2,2,2) (2,2,2,2) (2,2,1,1)NZ 32 for all body partsFilter Size (16,32,64,64) for encoders, (64,32,16,16,3) for decodersDilation (2,2,1,1) for encoders, (1,1,2,2,2) for decodersStep Size (2,2,1,1) for encoders, (1,1,2,2,2) for decoders Table 4. Network architecture for mesh encoders and decoders of diﬀerentbody parts.

NV represents vertices numbers, DS factor represents downsamplingfactors. NZ represents the hidden size of latent vector. Filter Size represents the outputchannel of SC. Dilation represents dilation ratio for SC. Step size represents hops forSC.

In this section, we provide details of the interpenetration optimization.

As we noted in the main paper, we ﬁrst detect all the body part vertices incollision with clothing as in [54], and then follow [66,67] to deform the mesh bymoving collision vertices inside the garment while preserving local rigidity of themesh. This detection-deformation process is repeated until there is no collisionor the number of iterations is above a threshold (10 in our experiments). Beforeeach mesh deformation step, collision vertices are ﬁrst moved in the directionopposite their vertex normals by 10mm. Then we optimize the remaining vertexpositions of body parts by minimizing the following loss: L pen = ω data L data + ω lap L lap + ω el L el (5) L data = (cid:107) V − V ∗ (cid:107) forces optimized vertices V to stay close to the SkinningNetinferred vertices V ∗ = V ( Z pred ), L lap = (cid:107) ∆ V − ∆ V ∗ (cid:107) F is the Frobenius normof Laplacian diﬀerence between the optimized and inferred meshes, and L el = (cid:107) EE ∗ − (cid:107) encourages the optimized edge length E to be same as the inferred edgelength E ∗ . Each of these losses is taken as a sum over all vertices or edges. Weset ω data = 1 , ω lap = 0 . , ω el = 0 . Before After Before After

Fig. 14. Before and after interpenetration optimization.

Note the garment inthe red square. Ground truth textures are used to better visualize the intersection.

Fig. 15. Comparison with Tex2shape[9] . Note that tex2shape only predicts roughbody shape compared to our reconstructions. We follow their advice to select imageswhere person is large and fully visible.econstructing NBA Players 25

In this section, we provide additional qualitative comparisons that further demon-strate the eﬀectiveness of our system.Fig 15 shows qualitative comparison with tex2shape [9]. Note that tex2shapeis only trained with their A-pose data and directly tested on NBA images. Wecan see our method can generate better shirt wrinkles and body details underdiﬀerent poses.

SMPLify-XInput SPIN Ours SMPLify-XSPIN Ours

Fig. 16. Comparison with SMPL-based methods on real images.

Column 1 isinput, columns 2-4 are reconstructions in the image view, columns 5-7 are visualizationsfrom a novel viewpoint. Note the signiﬁcant diﬀerence in body pose between ours andSMPL-based methods; our results are qualitatively much more similar to what is seenin the input images. In addition, SMPL-based methods do not handle clothing.

In the main paper, we only provide qualitative comparisons for synthetic datawith state-of-the-art methods. In Figure 16, we compare our method against thebest-performing SMPL-based methods [54,41] on real images. In Figure 17, weadditionally compare with PIFu [64], the state-of-the-art method for clothedsubjects, on real images. Our system generates more stable poses and morerealistic, ﬁne details for real images.

Input PIFu OursPIFu+NBA PIFu OursPIFu+NBA

Fig. 17. Comparison with PIFu [64] on real images.

Column 1 is input (red boxshows the target player), columns 2-4 are reconstructions in the image view, columns5-7 are reconstructions in a novel view. PIFu fails to reconstruct high quality humanshapes from real images, even when the players are in nearly standing poses.econstructing NBA Players 27

Fig. 18. Qualitative Results on real images. Please zoom in to see details.

Forevery example, left is input (red box shows the target player), middle is reconstructionin the image view, right is reconstruction in a novel view. Our method generalizes wellon real images under a variety of poses.8 Zhu et al.

In Figure 18, we provide additional qualitative results of our method for realimages. Our method can reconstruct 3D shape of diﬀerent people under variousposes on real images.In Figure 19, we provide examples where our approach fails to reconstruct acorrect 3D shape from single view images.

Input Ours GT Ours GT Ours GT