A-NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering
AA-NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering
SHIH-YANG SU,
University of British Columbia
FRANK YU,
University of British Columbia
MICHAEL ZOLLHOEFER,
Facebook Reality Labs Research
HELGE RHODIN,
University of British Columbia (a) Our human representation in unseen poses and novel views Initial poseRefined poseGround truth(b) Pose refinement for real-image data
Fig. 1. Our A-NeRF test-time optimization for monocular 3D human pose estimation jointly learns a volumetric body model of the user that can be animatedand works with diverse body shapes (left), while also refining the initial 3D skeleton pose estimate from a single or, if available, multiple views without tediouscamera calibration (right). Underlying is a surface-free neural representation and skeleton-based embedding coupled with volumetric rendering. Faces inH3.6M images blurred for anonymity.
While deep learning has reshaped the classical motion capture pipeline, gen-erative, analysis-by-synthesis elements are still in use to recover fine detailsif a high-quality 3D model of the user is available. Unfortunately, obtainingsuch a model for every user a priori is challenging, time-consuming, andlimits the application scenarios. We propose a novel test-time optimizationapproach for monocular motion capture that learns a volumetric body modelof the user in a self-supervised manner. To this end, our approach combinesthe advantages of neural radiance fields with an articulated skeleton repre-sentation. Our proposed skeleton embedding serves as a common referencethat links constraints across time, thereby reducing the number of requiredcamera views from traditionally dozens of calibrated cameras, down to asingle uncalibrated one. As a starting point, we employ the output of anoff-the-shelf model that predicts the 3D skeleton pose. The volumetric bodyshape and appearance is then learned from scratch, while jointly refining theinitial pose estimate. Our approach is self-supervised and does not requireany additional ground truth labels for appearance, pose, or 3D shape. Wedemonstrate that our novel combination of a discriminative pose estimationtechnique with surface-free analysis-by-synthesis outperforms purely dis-criminative monocular pose estimation approaches and generalizes well tomultiple views.
Human motion capture is an important research problem in com-puter graphics and computer vision with many applications rangingfrom character animation for computer games or movies to motionanalysis for sports or medicine. Capturing the complex and highlyarticulated motion of human performances is an extremely challeng-ing research problem, especially if only a monocular camera setupis available. One reason for this is the high level of depth ambiguity,i.e., there are multiple possible 3D scenes that project to the same2D image.Modern motion capture techniques combine the advantages ofdiscriminative and generative techniques to obtain the highest qual-ity results. First, a neural network-based 3D human pose estimationapproach is employed to provide a coarse initial estimate of thehuman pose. These algorithms internalize a sophisticated pose priorand are thus able to counteract the depth ambiguity, but the obtainedresults might be noisy and do not perfectly re-project on top of theinput image. Afterwards, generative approaches based on either a a r X i v : . [ c s . C V ] F e b β’ Shih-Yang Su, Frank Yu, Michael Zollhoefer, and Helge Rhodin high-quality 3D scan of the person [Habermann et al. 2019] or aparametric human body model [Alldieck et al. 2019a] are employedto refine the pose estimate. In this step, cues in the input image canbe directly leveraged, e.g., additionally available silhouette masksor directly the color information. This allows the approach to betterfit the model to the image than the purely discriminative approach.Although such combined techniques achieve unprecedented accu-racy, their downside is that they require a shape body model or apersonalized 3D scan of the person to be available a priori. Paramet-ric 3D body models are not able to accurately model a particularuser, especially in terms of clothing, since they are not part of themodel and human appearances are diverse. A textured high-quality3D scan provides the most accurate additional constraints, but it ischallenging to acquire and limits the application scenarios.We propose a surface-free neural rendering method for estimatinga 3D body model and refining skeleton pose jointly, thereby alleviat-ing the constraints and cost of template models, while maintainingthe advantages of generative body models.Recently, volumetric neural rendering techniques have shownvery promising results for novel-view synthesis of static scenes.These surface-free approaches are able to represent arbitrary scenesby modeling the properties of every location in space based on itslocal radiance and opacity. Image formation is based on classical vol-umetric rendering techniques, such as ray marching and importancesampling for improved efficiency. One of the approaches that canhandle small dynamic real-world scenes that have been observed bya multi-camera capture setups is Neural Volumes (NVs) [Lombardiet al. 2019]. NVs parameterize the scene based on a dense voxelgrid of opacity and view-conditioned color that is regressed by anencoder-decoder network. NVs requires a dense multi-view camerasetup for training and can only model small scenes at high resolu-tion, due to the inherent cubic memory complexity of the underlyingdense grid. One other approach that has obtained a lot of attention,is Neural Radiance Fields (NeRF) [Mildenhall et al. 2020] due to itsstunning high quality results and the compactness of the learnedscene representation. NeRF parameterizes the scene compactly us-ing a Multi-layer Perceptron (MLP) as its scene representation thatmaps a 5D coordinate (position and direction) to the radiance andopacity at that position in space. However, NeRF only works forstatic scenes captured from dozens of calibrated camera positions. Itis unclear how to extend this approach to dynamic scenes capturedby a single uncalibrated camera, especially highly articulated humanperformances without a clear reference frame.The question we answer in this paper is: Can we combine theadvantages of volumetric rendering with analysis-by-synthesis (alsocalled render-and-compare) to further improve monocular humanpose estimation? We propose a novel test-time optimization ap-proach for 3D human pose estimation that jointly learns a volumet-ric body model of the user to improve pose estimation accuracy thatwe term Articulated NeRF (A-NeRF). By initializing the pose esti-mate with an existing CNN-based pose estimation method, A-NeRFis applicable to as few as a single static camera and, if available,can integrate multi-view information from additional uncalibratedviews to boost performance while keeping the capture setup simple.The core technical novelty lies in our skeleton-relative embed-dings. These encode positions and view-directions relative to the Training view Rotate 24 β¦ Translate
Fig. 2.
Robustness and invariance.
The original NeRF (top row) breakswhen training on a diverse set of poses and further degrades when the posesare rotated or shifted. With our skeleton-relative encoding (bottom row),the geometry for the subject is consistent under rotation and translation. bones of an articulated skeleton. Our relative pose embeddings havethe advantage of being invariant to global translation and rotationof the person, thus allowing the network to better combine bodyshape and appearance constraints across the entire captured se-quence (see Figure 2). In particular, a body part at frame π shouldlook similar to π β² if the pictured pose is similar, independent ofthe global position and rotation of the person in 3D space. Moreprecisely, the introduced hierarchical skeleton structure links infor-mation from different frames even if a pose is only locally similaron a subset of body parts. Note, this does not include illuminationchanges due to the global motion of the person, thus we explicitlyfactor out this variation into a set of jointly learned low-dimensionalappearance codes. Our strategy makes learning more feasible sinceit induces an inductive bias to the network, much like the translationinvariance of 2D or 3D convolutions. Learning a NeRF model as afunction of this embedding forces the underlying neural network tolearn a representation that is consistent with the underlying bodypose. The skeleton serves as a common reference frame linking allmonocular observations across time, thereby alleviating the needfor complicated multi-view capture setups.In summary, our core technical contributions are: β’ A novel volumetric analysis-by-synthesis approach for monoc-ular 3D human pose refinement that jointly learns a volumet-ric body model and pose refinement of an initial 3D skeletonpose estimate in a self-supervised manner. β’ Skeleton-relative embeddings that induce favorable inductivebiases to scene representation MLPs. β’ A multi-view extension that lets us integrate multiple uncali-brated camera views for an additional boost in performance.We demonstrate that the test-time pose refinement improveson state-of-the-art approaches for monocular skeleton-based 3Dpose reconstruction. Additional gains are possible by exploitinguncalibrated multi-view sequences. Moreover, we provide detailed -NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering β’ 3 ablation studies that reveal the importance of the proposed skeletonembedding and the robustness of its hyperparameters. The learnedbody model is more accurate when using the proposed refinementand can readily be used for animation, appearance and motiontransfer.
Our approach builds on prior work on motion capture and neuralscene representations, which we discuss here with a focus on 3Drepresentations for human modeling. For a detailed discussion ofmore general and image-based neural rendering we refer to [Tewariet al. 2020].
Human modeling from images is an important research directionwithin computer vision and computer graphics. In the following,we discuss the pros and cons of existing approaches with a focuson the employed body representation, such as 3D point clouds, 3Dskeletons, surface meshes, parametric shape models, voxel-basedrepresentations, and implicit surface models.
3D Joint Position Estimation.
For many computer vision applications,such as action recognition and performance analysis in sports, itsuffices to reconstruct motion as the 3D trajectories of the majorhuman joints. Lifting-based approaches first estimate the 2D poseand then use a separate neural network to lift predictions to 3Dbased on a regressed depth value [Martinez et al. 2017]. Amongthese, our pose-relative encoding takes inspiration from Moreno-Noguer [2017], who showed that encoding 2D and 3D pose as theover-complete space of distances between all joint positions yieldsfavorable invariances and relations. While lifting is light-weight andgeneralizes well, end-to-end reconstruction with CNNs still attainsthe highest accuracy [Li et al. 2020a; Xu et al. 2020]. However, purepositional information lacks information on the bone rotation andcan lead to anatomically impossible poses.
Skeleton Motion Capture.
Most computer graphics applications, suchas animation and motion retargeting, require an articulated skeletonrepresentation with fixed bone lengths that is parametrized by jointangles via forward kinematics. This representation can be obtainedin a post process via inverse kinematics on the 3D joint locationspredicted by the methods explained in the previous section, even inreal-time [Mehta et al. 2017] and for groups of persons [Mehta et al.2020]. However, strictly enforcing anatomical constraints comes atthe expense of slightly less accurate reconstructions that do not per-fectly reproject onto the input images. An alternative is to directlyregress the joint angles of the skeleton [Shi et al. 2020; Zhou et al.2016] and body proportions [Kocabas et al. 2020; Kolotouros et al.2019]. While more accurate, single-shot, discriminative predictionis still prone to misalignment when overlayed onto the input image.
Surface Performance Capture.
A common strategy for recoveringsurface detail is to roughly align a template mesh using one of theprevious approaches and to refine its silhouette to match an imagesegmentation [Habermann et al. 2019; Xu et al. 2018]. The interiorcan be refined with shape-from-shading [Wu et al. 2011] or pho-tometric terms [Robertini et al. 2016]. However, these approaches require multiple views [Orts-Escolano et al. 2016] a laser scan, orother actor calibration steps with scripted motion [Alldieck et al.2019a] and often depend on manual rigging [Xu et al. 2018] (associ-ation of vertices to skeleton bones via skinning weights). Our goalis to learn an actor model without a separate calibration step andfor unconstrained actor motion.
Fitting Parametric Body Models.
Parametric human body models[Balan et al. 2007; Choutas et al. 2020; Loper et al. 2015] enableda large number of computer graphics applications, such as movieediting and reshaping [Jain et al. 2010]. They are learned from largecollections of scans and fitted with a skeleton rig and constrain thespace of plausible human shapes and motions in a low-dimensionalspace. This enables real-time reconstructions from single images[Bogo et al. 2016; Guan et al. 2009], alleviates manual rigging, andenables test-time optimization [Dong et al. 2020; Guler and Kokkinos2019; Lassner et al. 2017] and weak-supervision when integrated indifferentiable form [Liu et al. 2019] into the neural training process[Kanazawa et al. 2018; Kolotouros et al. 2019; Omran et al. 2018;Pavlakos et al. 2018; Tung et al. 2017]. The result are virtual, oftenanimatable, characters for games or VR applications. Closest to ourapproach in this category is the model fitting and test-time refine-ment methods by [Zuffi et al. 2019] that textures and geometricallyrefines an untextured parametric quadruped model to zebra imagesand by [Xiang et al. 2019] that uses optical flow to refine human pose.Although similar in spirit, our surface-free neural body model andvolumetric rendering is fundamentally different to their texturedtriangle mesh in a differentiable rasterization pipeline.
Hybrid approaches.
Parametric models can be refined locally, e.g., intexture space predicting displacement maps [Alldieck et al. 2018a,2019b]. These neural models are differentiable, enabling test-time op-timization on silhouettes [Alldieck et al. 2019a, 2018b]. The learnedoffsets mitigate the constraint of classical parametric models thatlimit shapes by those in the training set. However, they requiregarment-specific handling of loose clothing such as skirts [Bhat-nagar et al. 2019]; things that can not be modeled as an offset topre-defined geometry.From the rendering standpoint, A-NeRF bears close similaritieswith the volumetric body models and renderers by [Rhodin et al.2016b, 2015], which refine human pose, shape, and appearance viadifferentiable ray-tracing in a multi-view camera setup. By definingthe volumetric density as a sum of Gaussians, real-time reconstruc-tion in egocentric perspective enabled immersive, real-time VRapplications [Rhodin et al. 2016a]. However, the reconstructionshave blurry appearance, two or more views are required, and anunderlying parametric model limits generalization to specific sub-jects. Huang et al. [2020] uses a similar differentiable rendering onspherical primitives attached to a skeleton, but without optimizingthe underlying pose. Our key differentiating factor is that our bodymodel is surface-free and independent of parametric body models,alleviating their limitations.
Implicit Body Models.
Implicit models describe the surface of a hu-man as the level-set of a function, by a sum of simple functions, suchas Gaussians [Stoll et al. 2011], or more recently by general purposeneural networks [He et al. 2020; Huang et al. 2020; Saito et al. 2019, β’ Shih-Yang Su, Frank Yu, Michael Zollhoefer, and Helge Rhodin
A large number of recent work combine classical computer graphicstechniques with a deep neural scene representations. We draw uponthese general purpose representations for our human model. Manyclassical computer graphics representations have been used as thebasis for neural rendering approaches, such as meshes [Lombardiet al. 2018; Thies et al. 2019], point clouds [Aliev et al. 2019; Meshryet al. 2019; Wiles et al. 2020], a set of spheres [Lassner and ZollhΓΆfer2020], and dense volumetric grids [Lombardi et al. 2019; Sitzmannet al. 2019a]. Recently, volumetric neural representations have beenwidely applied, due to their generality and the fact that they haveshown very promising results. These representations are basedeither on a dense volumetric grid or an MLP.
Dense Volumetric Grids.
Deep Voxels [Sitzmann et al. 2019a] em-ploys a coarse volumetric grid of learned features as the 3D scene rep-resentation. The features can be reprojected to novel views to condi-tion a U-Net based neural rendering network that regresses the finalcolor output. Deep Voxels is limited to novel view synthesis for staticscenes. Neural Volumes (NVs) [Lombardi et al. 2019] learn to pa-rameterize a dynamic object based on an encoder-decoder networkthat regresses a volumetric grid of opacity and view-conditionedcolor. Image formation is based on a differentiable raymarchingapproach that is inspired by additive alpha blending. NVs require adense multi-view camera setup for training. One limitation of allapproaches that are based on a dense voxel grid is that they canonly handle scenes with a small spatial extent due to their cubicmemory requirements.
MLP-based Volumetric Approaches.
Neural Radiance Fields (NeRF)parameterize a static scene compactly using an MLP-based scenerepresentation [Mildenhall et al. 2020]. One important component isthe positional encoding of the query point coordinates [Mildenhallet al. 2020; Tancik et al. 2020] that enables the approach to representhigh frequency detail. Sitzmann et al. [2019b] proposed Scene Repre-sentation Networks (SRNs) that assign a feature to every point in 3Dspace based on an MLP and also jointly learns a differentiable spheremarcher for image generation. Liu et al. [2020] proposed NeuralSparse Voxel Fields (NSVF) that combines an Octree accelerationstructure with a latent-conditioned MLP. The Octree enables the ap-proach to prune empty space and thus can speed up rendering of thescene representation. Currently, all these approaches are restrictedto static scenes.
Non-peer Reviewed Works.
There is a large number of recent exten-sions to MLP-based neural scene representations: These approachesfocus on general improvements [Tancik et al. 2020; Zhang et al.2020], in-the-wild data [Martin-Brualla et al. 2020], generalization aspects [Gao et al. 2020; Schwarz et al. 2020; Trevithick and Yang2020; Yu et al. 2020], and extending the approach to the dynamic set-ting [Du et al. 2020; Gafni et al. 2020; Li et al. 2020b; Park et al. 2020;Peng et al. 2020; Pumarola et al. 2020; Tretschk et al. 2020; Xian et al.2020]. These works are not yet peer-reviewed and only available astechnical reports via arXiv. Thus, they are not considered prior art.
We follow the classical analysis-by-synthesis approach of refiningan initial human pose estimate π π by rendering the current modelconfiguration and iteratively optimizing the pose to minimize thedifference between a real image I π and the rendering (also calledrender-and-compare). This optimization is done at test time over thevideo or collection of images, { I π } ππ = , that should be reconstructed.This is a practical setting since we do not require any 3D annotationor camera calibration. Figure 3 provides an overview of our method.Unique to our approach is that we learn the human body shapeand appearance during optimization without relying on restrictivetemplate scans or parametric shape models. We dub this model Artic-ulated Neural Radiance Field (A-NeRF). We explain in the followinghow the refinement is initialized with an existing discriminativemodel (Section 3.1) and how we realize the joint estimation of shape,appearance, and pose (Section 3.2) via a combination of classicalvolumetric rendering, neural scene representation (Section 3.3), anda novel embedding into the kinematic chain of an articulated skele-ton (Section 3.4). The training of the underlying neural network isexplained in Section 3.5 and further extensions in Section 3.6. Given a set of π images { I π } ππ = of a single human subject, we utilizeSPIN [Kolotouros et al. 2019], a the state-of-the-art 3D pose andshape estimation method. Importantly for us, it yields not onlyhuman joint positions, but also approximate camera parameters,joint angles, and thereby skeleton bone orientation, which we useto initialize the parameters π π of an articulated skeleton and cameraintrinsics K π for every input frame π . SPIN is a feed-forward neuralnetwork that has been trained on a combination of studio recordingsand in-the-wild images.Our skeleton representation follows that of SMPL [Loper et al.2015], but without the associated parametric surface model. Theskeleton structure is defined in a restpose of π =
24 3D joint lo-cations { a π } ππ = , with the root at . In the subsequent paragraphs,we use the subscript ( π, π ) to indicate that a variable is related tothe π -th joint of frame π . The skeleton pose that we deem to re-construct, π π = [ π π, , Β· Β· Β· , π π, ] , consists of the relative rotationof 24 joints, π π,π , in either axis-angle form ( π π,π β R ) or the re-cently proposed overparametrized representation of [Zhou et al.2019] ( π π,π β R ) and π π, β R the global root position. Every jointdetermines the relative orientation of a bone to its parents. We referto the coordinate system spanned by the joint rotation axis π π,π and vector between π and its parent as local bone coordinates. The3D world coordinates q β R of a point p π,π β R in the π -th localbone coordinates is then computed with forward kinematics using -NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering β’ 5 NeuralRadiance Field
RelativeTransformationPoseEstimator
Ray sampling
Feed forwardBackpropagation Camera raySample pointCoordinate systemInitialization only
RadianceIntegration
Fig. 3.
Overview.
Our human body model (bottom left) is learned using a photometric reconstruction loss L SV . First, the skeleton pose is initialized with anoff-the-shelf pose estimator (gray arrows). Second, this pose is refined via analysis-by-synthesis using volumetric rendering (the step after NeRF) of a neuralradiance field (green). Key is a skeleton-relative embedding that links the neural encoding with skeleton pose and enables their joint learning (blue). Faces inH3.6M images blurred for anonymity. homogeneous coordinates, (cid:20) q (cid:21) = π ( π π , π ) (cid:20) p (cid:21) , (1)with the 4 Γ π ( π π , π ) = (cid:214) π β π΄ ( π ) (cid:20) R ( π π,π ) a π (cid:21) , (2) = [ , , ] , π΄ ( π ) is the ordered set of the joint ancestors of π , and a π β R the joint location of π at the rest pose. Since SPIN predicts a π β R for every image, we average the rest pose over all π to obtainconsistent bone length estimates. The rotation matrix R ( π π,π ) isinferred from the axis angle representation via Rodrigues formulaor by [Zhou et al. 2019] in the 6D case. Our goal is to learn a volumetric body model parametrized by π and to optimize the agreement of its rendering π π ( π π ) on the setof images { I π } ππ = with respect to the underlying skeleton pose π π .Formally, we write the joint modeling and reconstruction objective L SV ( π, π ) = βοΈ π π ( π π ( π π ) , I π ) + ππ ( π π , Λ π π ) , (3)where π are the parameters of a neural network defining density andradiance of a volume, π π is a classical volumetric renderer that raycasts the neural volume, Λ π π is the initial pose estimate from SPIN, π is a hyperparameter that controls the regularization strength, and π is a distance function, such as the π distance.This objective is optimized using stochastic gradient descent. Inother words, we jointly train the neural network parameters π thatdefine how the body model is rendered given pose π π as well asrefine the pose π π such that the same model π can explain each image I π . The last term ππ ( π π , Λ π π ) controls the amount of changesto apply to the initial estimate Λ π π . How pose and model are linkedis explained in the following sections. Assumptions.
We assume that the images I π stem from one ormultiple videos of the same person captured with static cameras.In the most general case, the input is a monocular video but wealso provide an extension that imposes multi-view constraints ifmulti-view footage is available. We distinguish the single-view andmulti-view case with L SV and L MV , respectively. Notably, we donot require intrinsic or extrinsic camera calibrations in either ofthese settings. This is different from many structure-from-motionapproaches, which explicitly require camera motion with a strongtranslation component to calibrate; impractical for casual recordings.This uncalibrated setup drastically eases setup time compared toexisting solutions that operate in the same accuracy range.Most importantly, the neural body model π is learned from scratch,no parametric actor model nor separate model calibration step isrequired, which additionally eases its application. The steps requiredto attain this level of automation are explained in the following. Our rendering model follows that of NeRF [Mildenhall et al. 2020],with only minor additions to the importance sampling that improvethe effectiveness by exploiting the estimated initial skeletal pose π .NeRF stores the density π and radiance c of a 3D point x inview direction d using a 10-layer fully-connected neural network ( π, c ) = πΉ π ( x , d ) with 946500-dimensional parameters π . Similar toclassical frequency-space embeddings of volumetric data, this yields β’ Shih-Yang Su, Frank Yu, Michael Zollhoefer, and Helge Rhodin a compact parametric representation of the scene that can be ray-cast by sampling x along the view direction d . NeRF attained a break-through by using positional encoding, the projection of the query3D position onto a high number of spatial sine and cosine waves ofvarying wavelengths. Conditioning on this high-dimensional spaceencourages the learning of low and high frequency features.This neural density and radiance field is then formed into animage by ray casting. Each ray emitted from the camera is firstsampled at 64 points and πΉ π is evaluated at each. The samplestogether form a piece-wise constant probability density function(PDF) that is integrated using the Beer-Lambert law. A pixel atcoordinate ( π’, π£ ) is rendered as π π ( π’, π£ ) = βοΈ π π ( π₯ π )( β exp (β π ( π₯ π ) πΏ ( π₯ π ))) c ( π₯ π ) , with (4) π ( π₯ π ) = exp (β π β βοΈ π = π ( π₯ π ) πΏ ( π₯ π )) , (5)and π₯ π the sample points, πΏ ( π₯ π ) the distance between π₯ π and π₯ π + ,and π ( π₯ π ) the accumulated transmittance for the ray traveling fromthe near plane to π₯ π .At training time, the neural network parameters π are optimizedby minimizing the distance between the rendered pixel color π π ( π’, π£ ) and the true image color I ( π’, π£ ) at pixel coordinate ( π’, π£ ) . In essence,the image formation function, which is differentiable, takes therole of a fixed-pipeline neural network layer that connects to theunderlying NeRF network. It facilitates end-to-end training of thenetwork parameters π , which subsequently encode the radiance anddensity of the entire scene and can be rendered from novel viewsby changing the virtual camera (view direction d ).We follow the original work in using a coarse-to-fine trainingstrategy. A second neural network πΉ β² π is learned by sampling 16additional samples from the piecewise-constant PDF inferred fromthe 64 initial samples and repeating the above rendering with the64+16 locations. Since both neural networks have the same structureand function and are trained and evaluated in sequence, we makeno distinction in the following.However, the original NeRF formulation was designed to recon-struct a single static scene from a massive number of calibratedviews, which is not applicable to our setting of a moving personunless captured in a camera dome. Straight-forward extensions thatwe tried, such as conditioning πΉ π on the frame or initial pose es-timate π fail in that the neural network either memorizes all thepossible mappings of ( π, c ) β ( π, x , d ) from a small training set anddegrades into fog-like artifacts when the number of training posesincreases. It generalizes poorly not just to unseen human poses,but also to shifted or rotated training poses, as shown in Figure 2and Section 4.2. Motivated by the observation that the positional encoding in termsof frequency components in NeRF was a game changer, we deriveda novel embedding that is tailored for articulated human motion.The idea is to define the model density and radiance relative to therigid bones of the skeleton instead of explicitly conditioning on (a) Query Point +Joint Positions (b) Query Point +Joint Positions +Rel. Dir. + Rel. Ray (c) Rel. Dist. + π + Rel. Ray (d) Rel. Dist. +Rel. Dir. + Rel. Ray Fig. 4.
Encoding ablation study on a novel view. (a) Directly applyingpositional encoding on the concatenation of query point x as in NeRF isunsuitable for learning pose-dependent models, even when conditioningon the skeleton joint locations. Also when encoding only the directionalinformation (b) or distances and view-ray direction relatively (c) artifactsremain. (d) Our Pose-relative representation drastically improves the qualityof the articulated human representation. skeleton pose. Although not providing additional information to theneural radiance field, this creates an implicit bias that encouragesthe network to store only that information in relation to bone π thatis influenced by π . Embedding information of other joints would notbe consistent under relative pose changes; thereby, discouragingdependencies due to the increased complexity of relations.In the following, we introduce five skeleton-relative embeddingvariants that replace the positional encoding in Euclidean spaceexplained in the previous section. Each of them first maps the 3Dquery point x via the inverse of the bone-to-world transformation π ( π π , π ) explained in Section 3.1. Formally we write (cid:20) Λ x π,π (cid:21) = π β ( π π , π ) (cid:20) x (cid:21) . (6)An overview of all embeddings is given in Figure 4. Relative Positional (Rel. Pos.) encoding.
The query point x is mappedto all bones of the skeleton with Equation 6. Because we performpositional encoding afterwards as in the original NeRF, this blowsup the already massive positional encoding space by a factor ofthe number of joints (for us 24), drastically increasing the memoryfootprint.To this end, we propose the following alternatives. Two for en-coding the query position and two for the view direction. Relative Distance (Rel. Dist.) encoding.
Given π π , we calculate the π distance from the 3D query point x to joint π byΛ v π,π = (cid:13)(cid:13) Λ x π,π (cid:13)(cid:13) . (7)Recall, Λ x π,π is x in the local coordinate of joint π . Storing only thedistance reduces input parameters by a factor of three. Relative Direction (Rel. Dir) encoding.
Since the distance maps many3D points to the same scalar, we additionally obtain the directionvector to capture the orientation information of x byΛ r I π = Λ x π,π (cid:13)(cid:13) Λ x π,π (cid:13)(cid:13) . (8) -NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering β’ 7 We do not apply positional encoding on this direction encoding,which dramatically saves network capacity.
Relative Ray Direction (Rel. Ray.) encoding.
NeRF builds upon a ra-diance field that is a function of the position and view direction.To encode the view direction, we transform d to obtain the rel-ative view-ray direction Λ d by applying the rotational part of thebone-to-world transformation π ( π π , π ) ,Λ d = [ π β ( π π , π )] Γ d , (9)with [ π β ( π π , π )] Γ the top-left 3 Γ Relative Ray Angle (Ray. Angl.) encoding.
The other view-directionencoding we explore is the angle between the ray d and the vector u from query point to joint origin, d β² = arc cos ( d β’ u ) , (10)with β’ the dot product. As the distance encoding for position, thisencoding has the advantage of being one-dimensional.These relative embeddings have the advantage of being invariantto global shift and rotation of the person and preserve the piece-wiserigidity of articulated motion while still allowing for pose-dependentdeformation. The handling of view-dependent illumination effectsis explained in the extensions section (Section 3.6). With our relative skeleton embeddings introduced, we now turnto defining our Articulated NeRF Rendering model that is tailoredto modeling human shape and appearance and lends itself for poserefinement of the underlying skeleton.In a first step, we leave the NeRF model as is and only change theinput to the neural network that encodes the radiance field used forvolumetric rendering. We will expand on further extensions in thesubsequent section. Samples along each view ray are taken as before(see Section 3.3), however, now each query point is mapped to each ofthe π =
24 bone coordinate systems and projected to the previouslyintroduced encodings. This yields vectors Λ v = [ Λ v π, , Β· Β· Β· , Λ v π, ] ,Λ r = [ Λ r π, , Β· Β· Β· , Λ r π, ] and Λ d = [ Λ d π, , Β· Β· Β· , Λ d π, ] , that are fed into theneural radiance field function ( π, c ) = πΉ π ( Λ v , Λ r , Λ d ) . (11)Image formation (Eq. 5) takes in the output of this re-parametrizedNeRF but remains unchanged otherwise.The effect is that density and radiance values are now storedrelative to the skeleton parametrized by π . Changing the skeletonparameters changes the local bone coordinate systems and there-fore the density and radiance with themβlike a manually riggedcharacter would behave. This conditioning on pose allows us tooptimize L SV , our objective (Eq. 3), jointly with respect to pose π and neural network weights π that define the radiance field.It is a-priori unclear whether this joint optimization can succeed,in fact, our initial experiments on conditioning NeRF on pose ex-plicitly did not. The employed neural network is a general functionapproximator that has the degrees of freedom to compensate to givethe same output for two different configurations of π . Nevertheless,our experiments show that our complete method can jointly refinedpose from an off-the-shelf pose estimate. The learned model is not only consistent and visually accurate but better matches the groundtruth 3D pose.We believe that our skeleton-relative encodings make learningfeasible because they induce an implicit bias to the learning of theMLP, much like the invariance of convolutions, which helps CNNsto learn from images. Relation to existing pose embeddings.
An alternative would be todefine the NeRF field in a reference pose and to map query pointsto this template. While this works well for points on or close to asurface [Huang et al. 2020; Taylor et al. 2012] and small deformationfields [Park et al. 2020], it is ill-posed for articulated motion. Forinstance, when left and right hand are close a nearby query pointcan not be uniquely attributed to the left or right side.
We propose several extensions that make Articulated NeRF compu-tationally tractable and more accurate.
Sparse sampling and gradient accumulation.
At training time, weform a batch of rays by randomly sampling 2048 rays from allavailable images. Therefore, not every frame will be optimized inevery iteration. Moreover, every frame that is included may onlybe sampled with a few rays. As a result, optimizing the π at everyiteration is prone to noise in the stochastic gradient estimate. Tocounteract, we accumulate the gradient update Ξ π for 50 iterationsbefore updating. Without this, pose refinement diverged in ourexperiments. Background handling.
In addition, to make our network focus onlearning the subjectβs representation, we provide the renderer withthe background image π΅ obtained as the median pixel value overall images taken from the same camera, exploiting the constantbackground assumption. Let π ( π₯ π ) = π ( π₯ π )( β exp (β π ( π₯ π ) πΏ ( π₯ π ))) ,the render outputs at ( π’, π£ ) becomes π π ( π’, π£ ) = (cid:32) β βοΈ π π ( π₯ π ) (cid:33) π΅ ( π’, π£ ) + βοΈ π π ( π₯ π ) c ( π₯ π ) , (12)with the background filling that fraction of light that passes throughthe entire volume. Sampling optimization.
To further increase the sampling efficiency,we define a cylinder surrounding the initial skeleton pose estimateand sample points along the ray segment that lies inside. This issketched in Figure 3. Moreover, we use the background estimateto segment the person and only sample rays inside. Because thisestimate is uncertain, we dilate the initial segmentation by 3 pixels.
Positional Encoding with Cutoff.
To capture the high frequency de-tails from the training images { I π } ππ = , prior works incorporateperiodic activation functions [Sitzmann et al. 2020] in their neuralnetworks, or adding high frequency input components [Mildenhallet al. 2020; Tancik et al. 2020]. We follow the latter approach inour work. Specifically, we adopt a weighted version of positional β’ Shih-Yang Su, Frank Yu, Michael Zollhoefer, and Helge Rhodin Fig. 5.
Cutoff positional encoding relative to the right hand.
We visu-alize two slices of the world coordinates, and the respective encoded valuesfor the right hand joint in two different 3D poses. The encoded values arereduced to 0 for points that are too far away, so that they do not affect thefinal representation for the right hand. encoding [Mildenhall et al. 2020; Vaswani et al. 2017] πΎ ( π, π€ ) = π€ (cid:20) π, sin (cid:16) ππ (cid:17) , cos (cid:16) ππ (cid:17) , Β· Β· Β· , sin (cid:16) πΏ β ππ (cid:17) , cos (cid:16) πΏ β ππ (cid:17) (cid:21) , (13)where πΏ is the number of frequency components, and π€ is theweighting factor. We calculate π€ as a per-joint weighting factor thatdepends on the distance Λ v π,π π€ π,π = (cid:40) ., if Λ v π,π β€ π‘ exp (cid:16) β π (cid:0) Λ v π,π β π‘ (cid:1) (cid:17) , otherwise . (14)Here, π‘ denotes a cutoff threshold, and π is a hyperparameter con-trolling how fast the Gaussian weight should be reduced to 0 asΛ v π,π increases. We apply πΎ on the relative distance Λ v π,π and viewdirection Λ d π,π of all joints π such thatΛ v = (cid:2) πΎ (cid:0) Λ v π, , π€ π, (cid:1) , Β· Β· Β· , πΎ (cid:0) Λ v π, , π€ π, (cid:1)(cid:3) , (15)Λ d = (cid:104) πΎ (cid:16) Λ d π, , π€ π, (cid:17) , Β· Β· Β· , πΎ (cid:16) Λ d π, , π€ π, (cid:17)(cid:105) . (16)The intuition behind this weighted positional encoding is that ifthe joint π is far away from x (i.e., has large Λ v π,π ), its input encodingshould have less influence on the output. Our empirical results showthat the proposed encoding helps reduce noise in the background(Section 4.2). Appearance codes.
Illumination effects such as shadows and shadingdepend on the view direction in relation to the light position inworld coordinates. We encode the former relative to the skeleton,which however is invariant to global position. Following concurrentwork on handling illumination changes [Martin-Brualla et al. 2020],we add a 16-dimensional appearance code to the second last layerof the NeRF network πΉ π . It is individually stored and optimized forevery frame. Due to its position at the end of the network and itslow dimensionality, it helps learning these global effects while notdeteriorating the benefits of the relative encoding. Fig. 6.
Multi-view setup.
Our A-NeRF learning and test-time pose refine-ment method naturally generalizes to multiple views. For every frame, asingle view-consistent neural radiance field (center) as well as underlying3D skeleton pose is reconstructed. The camera pose is estimated and refinedfor every frame and view automatically. Faces in H3.6M images blurred foranonymity.
Multi-view constraints.
We can further incorporate a multi-viewconstraint to improve pose refinement when the motion is capturedfrom multiple cameras. Figure 6 shows the two camera case that ismost practical. For initializing π π , we average the individual jointrotation estimates from all π views π£ β [ , . . . ,π ] . Since rotationsare relative to the parent and root, this works without calibrating thecameras. Only the global position and orientation remains specificto view π£ . To this end, we extend our single-view notation withsubscripts, position π ( π£ ) π, and orientation π ( π£ ) π, are estimated relativeto camera π£ .For refinement, we extend the single view objective Eq. 3 to make π π view-consistent. With slight abuse of notation we write, L MV ( π, π ) = βοΈ π βοΈ π π ( π π ( π π , π ( π£ ) π, , π ( π£ ) π, , K ( π£ ) ) , I ( π£ ) π ) + ππ ( π π , Λ π π ) , (17)with π π shared across all views π£ , except for the global skeletonposition, π ( π£ ) π, , and global orientation, π ( π£ ) π, , which are estimatedindependently per camera view π£ since the relative camera positionand orientation is unknown in our setting. We train our A-NeRF for a total of500k iterations: During the first 100k, we jointly update A-NeRFweights π and refine pose estimates π using the π distance. Wethen stop optimizing π , and continue to train the A-NeRF modelfor improved visual fidelity additional 400k iterations using the π distance. We use Adam optimizer [Kingma and Ba 2015], and keepseparate step sizes for A-NeRF and pose refinement. For A-NeRF,we start with step size 0 . . π =
2, and the cutoff threshold π‘ = Γ Γ -NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering β’ 9 We perform a large number of experiments on real as well as syn-thetic data. The synthetic experiments allow us to vary independentfactors of variation and to have perfect ground truth. The evaluationon established benchmarks with real persons and images showsthe generality of the approach and quantifies the improvementsbrought about by our contributions compared to the state-of-the-artapproaches. Computer graphics applications of novel-view synthesisand retargeting are shown in the subsequent applications section.
Human 3.6 M Benchmark Dataset.
This dataset [Catalin Ionescu 2011;Ionescu et al. 2014] is the most widely used benchmark for singleand multi-view human motion capture. Human3.6M features 3.6million images captured from four cameras with varying position,five training subjects, two validation subjects, and accurate, marker-based 3D pose ground truth. We follow the widely used train/testsplit from [Nibali et al. 2019], subsampling the test videos at every64 th frame to reduce the dataset size from millions to thousandswithout compromising expressiveness. The test set features twosubjects (S9 and S11) doing 14 different actions with two iterationseach, their respective number of images to optimize is 5012 and3712. Metric.
We report the MPJPE metric, the Euclidean distance betweenpredicted and ground truth joint position averaged over all framesand joints of the test set. We utilize the PA-MPJPE variant thatperforms Procrustes alignment between prediction and ground truth.This alignment in scale and orientation is essential for comparingapproaches that do not assume knowledge of the ground truthcalibration and are, hence, ill-posed to the factors that the alignmentremoves.
Test-time Optimization.
As explained in section Section 3.4, pose isinitialized by applying SPIN on all test images individually. SPINwas trained on the Human3.6M training set among others. Thisinitial estimate is then jointly refined with the learning of the A-NeRF model, for 100k iterations on all test images using Equation 3,without assuming any additional knowledge of the camera or 3Dpose. In the machine learning literature, this is sometimes calledtransductive learning.
Comparison to Single-View Approaches.
Table 1 categorizes 3D poseestimation approaches in four categories and compares within each.The single-view approaches that predict 3D joint position but notskeleton pose play in an individual class because not enforcingskeleton constraints allows for poses that have a low MPJPE but areunrealistic. For instance, the center between two possible modeswith shrunk bone length, violating the constant bone-length assump-tion of those using skeletons. The most accurate 3D joint positionestimation methods attain a PA-MPJPE of 39 . . . .
6, thereby improving on the state-of-the-art methodsin skeleton pose prediction. Note that improvements in mm may
Table 1.
Quantitative evaluation on Human3.6M [Ionescu et al. 2014] .Single-view 3D joint position methods obtain the highest accuracies. Ourtest-time optimization yields skeleton pose, which is slightly harder due tothe kinematic constraints, and improves upon the SPIN and VIBE baselines.Additional gains are possible in the uncalibrated multi-view setup, withaccuracies approaching that of the calibrated related work using all fourcameras.
Method PA-MPJPE β Single view, 3D joint positions:
Pavlakos et al. [2017] 51.9XNect [Mehta et al. 2020] (before skeleton fit) 48.5Martinez et al. [2017] 47.7Nibali et al. [2019]
MotioNet [Shi et al. 2020] 54.6HoloPose [Guler and Kokkinos 2019] 46.5SPIN [Kolotouros et al. 2019]) 41.9VIBE [Kocabas et al. 2020]) 41.4Ours (single-view)
SPIN-multiview (2 views) 38.2Ours (2 views) 34.1SPIN-multiview (4 views) 34.0Ours (4 views)
Tome et al. [2018] (4 views) 44.6Iskakov et al. [2019] (4 views) appear marginal, but the dataset is nearly saturated with the mostrecent methods fighting for the last mm of improvement.
Comparison to Multi-view Approaches.
Our approach naturally ex-tends to multi-view inference, as introduced in Section 3.6. The lowerhalf of Table 1 shows that our multi-view refinement improves overall single view approaches and outperforms simple baselines suchas averaging pose estimates from the off-the-shelf SPIN pose es-timators across views (SPIN-multiview). These improvements areconsistent across two and four views, with four views being moreaccurate as expected.We can not compete with approaches that utilize exact knowl-edge of the camera location, orientation, and intrinsic calibration.These boil down to detecting 2D body parts in each view and trian-gulating the 3D pose. This strategy, however, does not translate toour more general case where cameras are not manually calibrated.The methods listed in this category therefore only serve as a lowerbound to the accuracy uncalibrated approaches may attain.
Synthetic Test Bench (SURREAL).
To generate synthetic data of ani-mated human performances, we employ the models of the SURREALdataset [Varol et al. 2017] animated with motions from the CMUGraphics Lab Motion Capture Database . We build a training set of10,800 frames by selecting 20 motion clips from the CMU MoCap http://mocap.cs.cmu.edu Table 2.
Monocular vs. multi-view reconstruction.
The PSNR and SSIMscores show that our model can learn as well from a single view as from mul-tiple ones. Adding more and diverse poses to the training is more importantthan additional views. β SSIM β Table 3.
Positional encoding trade-off.
Encoding world coordinates doesnot succeed on motions and has a large memory consumption. Encodingpositions relative to the skeleton works yet has also high dimensionality.Our full model that combines distance and direction performs well in bothaspects.
Position Rep. Direction Rep. PSNR β SSIM β
360 + 72 + 216
Rel. Dist. +Rel. Dir (our model w/o cutoff) Rel. Ray
360 + 72 + 648 dataset, 60 frames long each, and is captured by nine virtual cam-eras. We further create a evaluation set of 5 motion clips, includingextreme motions like a back-flip, cart-wheel, and capoeira, withnine virtual cameras that are at new positions and angles.
Metrics.
We measure image quality on a novel view of a pose un-seen during optimization by comparing the dataset image with therendering of A-NeRF, using the peak signal to noise ratio (PSNR)and structure similarity index (SSIM) [Wang et al. 2004]. The lattercomputes a contrast normalized similarity score between renderingand reference image.
Monocular vs. Multi-view quality.
Surprisingly, the visual fidelity ofour model trained on a single is nearly as good as trained on threeor more views. Table 2 lists the PSNR and SSIM for 1, 3, and 9 viewsand different number of poses for each. The difference betweenusing a single or multiple views is only 1% PSNR for 1200 poses.It is more important to have diverse poses rather than multiplesynchronized views, as evidenced by the comparison of using thesame number of images, where the monocular one scores higherby using only one camera but longer sequences. Multiple viewsare only beneficial for disambiguating depth ambiguities for 3Dreconstruction, as explained in the previous section, but these havelittle influence on the learned appearance. All following ablationstudies are performed with nine views.
Impact of Query Position Encoding.
The neural radiance field is po-sition and direction-dependent. We therefore analyze the effect ofencoding each quantity individually. Table 3 shows that straight-forwardly extending the original NeRF [Mildenhall et al. 2020] tolearning a neural radiance field in world coordinates and condition-ing on human pose (3D joint positions) yields low quality (PSNR <
Table 4.
Distance-based positional encoding is compact (360 dim) butinsufficient (lower PSNR and SSIM) to encode skeleton relative query lo-cations unless paired with direction information in our full model (72 dim,w/o positional encoding).
Position Rep. Direction Rep. PSNR β SSIM β
360 + 0 + 648
Rel. Dist. + Rotation angles ( π as input) Rel. Ray 19.25 0.8152 360 + 72 + 648Rel. Dist. + Rel. Dir. (our full model) Rel. Ray
360 + 72 + 648
Table 5.
Directional encoding impact.
The influence of the directionalencoding is small but noticeable. It works best to transfer the ray directionrelative to the bone coordinates.
Position Rep. Direction Rep. PSNR β SSIM β
360 + 72 + 27
Rel. Dist. + Rel. Dir. Ray Ang. 23.92 0.9318 360 + 72 + 216Rel. Dist. + Rel. Dir. Rel. Ray (our full model)
360 + 72 + 648
Table 6.
Cutoff influence.
Our cutoff limits the influence radius of eachbone, thereby increasing accuracy particularly in the foreground (PSNRdepicted in braces).
Cutoff Type Position Rep. Direction Rep. PSNR β SSIM β None Rel. Dist. + Rel. Dir. Rel. Ray 24.12 (19.50) 0.9228 (0.7803)Hard Rel. Dist. + Rel. Dir. Rel. Ray (19.58) (0.7822)Soft Rel. Dist. + Rel. Dir. Rel. Ray 24.18 ( π . Our fullmodel requires in addition the direction from query to joint in rela-tive bone coordinates (Rel. Dist + Rel. Dir. + Ray Dir.), which attainsa 20% higher PSNR while only marginally increasing the memoryfootprint as Rel. Dir. succeeds without positional encoding. Impact of View Direction Encoding.
The effect of the direction encod-ing on the radiance field has a smaller influence on the final resultsince it predominantly models low-frequency shading information.Table 5 reveals that positional encoding of the view-ray directionrelative to the bone coordinates works best. It exhibits less artifactscompared to only storing the ray angle. -NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering β’ 11
Fig. 7.
Bullet time effect using A-NeRF.
The camera can be freely rotated and the focal length and position can be changed at test time to synthesize novelviews of performances captured with A-NeRF. This example is captured from nine virtual cameras.
Reference view A-NeRF withrefinement Without poserefinement Reference view A-NeRF withrefinement Without poserefinement
Fig. 8.
Joint optimization and pose refinement corrects errors in the pose estimation and leads to more accurate A-NeRF models. Texture details improveand noise is drastically reduced as the model better aligns with the input image after pose refinement. Artifacts remain at the intersection of the human bodymodel with objects, such has the floor and a chair (bottom right) since disentanglement without prior shape models is difficult. All results are single-view.Faces in H3.6M images blurred for anonymity.
Impact of Cutoff.
Adding the distance-dependent cutoff furtherboosts the quality, primarily removing ghosting artifacts aroundthe person. Table 6 compares the different variants, soft cutoff(50cm as cutoff threshold) strikes the best compromise betweenforeground (reported in braces in Table 6) and background metrics.The foreground metrics are computed only over those pixels con-tained within the ground truth mask, which is known exactly onthe synthetic test sequences.
We further evaluate our approach in two application scenarios, inaddition to the previous quantitative evaluation sequences.
Novel-View Synthesis.
Even though our human body models arelearned from single camera views, we can synthesize the same posein a novel view by changing the skeleton position and orientationrelative to the camera. This can be used for bullet time effects inmovies, for stereoscopic rendering and other entertainment forms, and as a tool for visualizing motions in 3D for sports and medi-cal analysis by experts. Figure 7 demonstrates a bullet time effectand the teaser, Figure 1, shows two of our characters learned fromsynthetic data, the female learned from just a single view and themale from multi-view. Irrespective of body shape and number ofviews, both characters are crisp and nearly indistinguishable fromthe mesh models they are trained on. Reconstructing real images isconsiderably more difficult due to motion blur, varying illumination,and limited image resolution. Figure 8 shows our reconstructionfrom the Human 3.6M test set, rendered from a novel view. Whileartifacts remain due to 1) shadowing on the floor and objects (chair)that are present in some frames are partially modelled by A-NeRF,2) color ambiguities between legs and reddish background, and 3)the surface-free volumetric model not perfectly learning all depen-dencies. Still, the skeleton pose refinement significantly improvesresults and makes learning from real images possible.
SourceRetargeting ResultSourceRetargeting Result
Fig. 9.
Motion retargeting from H3.6M reconstructions to SURREAL characters , both reconstructed with A-NeRF. The underlying kinematic skeletoncan be posed and animated as a regular rig, here shown with a simple transfer of joint angles over time and rendered from a side view. H3.6M characters arelearned from single view. The male SURREAL characters are captured from 9 views and the female from a single one, yet both are of similar quality. Faces inH3.6M images blurred for anonymity.
Motion Retargeting.
A classic animation application is the retarget-ing of a source motion onto a target character. Since we learn abody model and pose simultaneously, A-NeRF can act as a motionsource and character source. To demonstrate this, we used contin-uous videos from the Human3.6M dataset, refined pose with theA-NeRF model optimized on the entire set, and retargeted themonto the SURREAL characters, see Figure 9. Conversely, Figure 10drives the Human3.6M characters with CMU skeleton motion. Thisworks as long as both characters have the same underlying skeletonby combining the fixed neural network parameters of the targetcharacter with the source skeleton pose.
The advantage of the neural model is that no explicit volumetricmodeling, surface scan, skinning function, and association of pointsto rigid bones is neccessary. These steps are learned end-to-end fromimage data and apply to diverse motions, shapes, and appearances.Still, the resulting neural body model can be posed by key-framingor motion retargeting. Our learned models are actor specific. It isan interesting future direction to learn a parametric A-NeRF bodyshape space on a diverse set of actors and apparels, similarly tohow surface-based models are currently trained on laser-scans. This -NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering β’ 13
SourceRetargeting ResultSourceRetargeting Result
Fig. 10.
Motion retargeting from CMU Mocap sequence to A-NeRF characters learned using the H3.6M dataset. The transfer works also from existingmotion capture data when the underlying skeletons are either matching or can be matched with existing animation pipelines. All depicted characters arelearned from single view video footage. would allow artists to edit shape and appearance in low-dimensionalspaces, as for a classical rig.Our required computation time is enormous and the biggest bot-tleneck in extending to long sequences and multiple actors. Some ofthe remaining artifacts can also be attributed to the low number ofray samples we use to make reconstruction and rendering tractable.We believe that the underlying ray-tracing model will benefit fromthe emerging ray-tracing hardware and can be further improved byoptimizations known in the rendering literature.In contrast to classical analysis-by-synthesis approaches that op-timize over a short video and exploit temporal constraints, we found training on a diverse set of poses sampled uniformly from the avail-able input videos the most effective use of the available resources.Although a single static camera suffices, A-NeRF requires to see theperson from all sides in varying poses, to learn pose dependenciesand independence. This is in contrast to other works that require aslittle motion as possible [Alldieck et al. 2018a]. Nevertheless, oncea body representation is learned, continuous motion can be recon-structed on consecutive video frames with the volumetric modelbeing fixed.
We proposed a fully automatic approach for estimating a volumet-ric actor model and jointly refining skeleton pose from monoc-ular or multi-view video. A-NeRF is the first approach to defineNeRF models for extreme and articulated motion and scores highon the Human 3.6 Million benchmark. Based on a new and compactskeleton-relative embedding, our approach reconstructs a personal-ized volumetric density field with texture detail and time-varyingposes of an actor. Importantly, it works from a single video and natu-rally extends to multi-view, but does not require camera calibrationin either of them. This is an important step towards making motioncapture more accurate and practical.
Shih-Yang Su, Frank Yu and Helge Rhodin were supported by Com-pute Canada and Advanced Research Computing at the Universityof British Columbia [Computing 2019].
REFERENCES
Kara-Ali Aliev, Artem Sevastopolsky, Maria Kolos, Dmitry Ulyanov, and Victor Lempit-sky. 2019. Neural point-based graphics. arXiv preprint arXiv:1906.08240 (2019).Thiemo Alldieck, Marcus Magnor, Bharat Lal Bhatnagar, Christian Theobalt, and GerardPons-Moll. 2019a. Learning to reconstruct people in clothing from a single RGBcamera. In
CVPR . 1175β1186.Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. 2018a. Detailed human avatars from monocular video. In . IEEE, 98β109.T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-moll. 2018b. Video BasedReconstruction of 3D People Models. In
Conference on Computer Vision and PatternRecognition .Thiemo Alldieck, Gerard Pons-Moll, Christian Theobalt, and Marcus Magnor. 2019b.Tex2Shape: Detailed Full Human Body Geometry From a Single Image. In
Proceedingsof the IEEE/CVF International Conference on Computer Vision (ICCV) .Alexandru O Balan, Leonid Sigal, Michael J Black, James E Davis, and Horst WHaussecker. 2007. Detailed human shape and pose from images. In
CVPR . 1β8.Bharat Lal Bhatnagar, Garvita Tiwari, Christian Theobalt, and Gerard Pons-Moll. 2019.Multi-Garment Net: Learning to Dress 3D People from Images. In
IEEE InternationalConference on Computer Vision (ICCV) . IEEE.Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, andMichael J Black. 2016. Keep it SMPL: Automatic estimation of 3D human pose andshape from a single image. In
ECCV . Springer, 561β578.Cristian Sminchisescu Catalin Ionescu, Fuxin Li. 2011. Latent Structured Models forHuman Pose Estimation. In
International Conference on Computer Vision .Vasileios Choutas, Georgios Pavlakos, Timo Bolkart, Dimitrios Tzionas, and Michael J.Black. 2020. Monocular Expressive Body Regression through Body-Driven Attention.In
ECCV . https://expose.is.tue.mpg.deUBC Advanced Research Computing. 2019. UBC ARC Sockeye. (2019). https://doi.org/doi:10.14288/SOCKEYE.Boyang Deng, JP Lewis, Timothy Jeruzalski, Gerard Pons-Moll, Geoffrey Hinton, Mo-hammad Norouzi, and Andrea Tagliasacchi. 2019. NASA: neural articulated shapeapproximation. arXiv preprint arXiv:1912.03207 (2019).Junting Dong, Qing Shuai, Yuanqing Zhang, Xian Liu, Xiaowei Zhou, and Hujun Bao.2020. Motion capture from internet videos. In
European Conference on ComputerVision . Springer, 210β227.Yilun Du, Yinan Zhang, Hong-Xing Yu, Joshua B Tenenbaum, and Jiajun Wu. 2020.Neural Radiance Flow for 4D View Synthesis and Video Processing. arXiv e-prints (2020), arXivβ2012.Guy Gafni, Justus Thies, Michael ZollhΓΆfer, and Matthias NieΓner. 2020. DynamicNeural Radiance Fields for Monocular 4D Facial Avatar Reconstruction. arXivpreprint arXiv:2012.03065 (2020).Chen Gao, Yichang Shih, Wei-Sheng Lai, Chia-Kai Liang, and Jia-Bin Huang. 2020.Portrait Neural Radiance Fields from a Single Image. arXiv preprint arXiv:2012.05903 (2020).Peng Guan, A. Weiss, A. O. BΓ£lan, and M. J. Black. 2009. Estimating human shape andpose from a single image. In
CVPR . 1381β1388. https://doi.org/10.1109/ICCV.2009.5459300Riza Alp Guler and Iasonas Kokkinos. 2019. Holopose: Holistic 3d human reconstructionin-the-wild. In
CVPR . 10884β10894. Marc Habermann, Weipeng Xu, , Michael Zollhoefer, Gerard Pons-Moll, and ChristianTheobalt. 2019. LiveCap: Real-time Human Performance Capture from MonocularVideo.
ACM Transactions on Graphics, (Proc. SIGGRAPH) (jul 2019).Tong He, John Collomosse, Hailin Jin, and Stefano Soatto. 2020. Geo-PIFu: Geometryand Pixel Aligned Implicit Functions for Single-view Human Reconstruction. arXivpreprint arXiv:2006.08072 (2020).Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, and Tony Tung. 2020. ARCH:Animatable Reconstruction of Clothed Humans. In . IEEE, 3090β3099. https://doi.org/10.1109/CVPR42600.2020.00316Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Hu-man3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing inNatural Environments.
IEEE Transactions on Pattern Analysis and Machine Intelli-gence
36, 7 (jul 2014), 1325β1339.Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. 2019. Learnabletriangulation of human pose. In
ICCV .Arjun Jain, Thorsten ThormΓ€hlen, Hans-Peter Seidel, and Christian Theobalt. 2010.MovieReshape: Tracking and Reshaping of Humans in Videos.
TOG
29, 6, Article148 (Dec. 2010), 10 pages. https://doi.org/10.1145/1882261.1866174Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. 2018. End-to-end Recovery of Human Shape and Pose. In
CVPR .Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization.In
ICLR .Muhammed Kocabas, Nikos Athanasiou, and Michael J Black. 2020. VIBE: Videoinference for human body pose and shape estimation. In
CVPR . 5253β5263.Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. 2019.Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In
ICCV . 2252β2261.Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J. Black, andPeter V. Gehler. 2017. Unite the People: Closing the Loop Between 3D and 2DHuman Representations. In
CVPR .Christoph Lassner and Michael ZollhΓΆfer. 2020. Pulsar: Efficient Sphere-based NeuralRendering. arXiv:2004.07484 [cs.GR]Shichao Li, Lei Ke, Kevin Pratama, Yu-Wing Tai, Chi-Keung Tang, and Kwang-TingCheng. 2020a. Cascaded Deep Monocular 3D Human Pose Estimation With Evo-lutionary Training Data. In
Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition (CVPR) .Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. 2020b. Neural SceneFlow Fields for Space-Time View Synthesis of Dynamic Scenes. arXiv preprintarXiv:2011.13084 (2020).Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. 2020.Neural sparse voxel fields.
NeurIPS
33 (2020).Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. 2019. Soft rasterizer: A differentiablerenderer for image-based 3d reasoning. In
Proceedings of the IEEE/CVF InternationalConference on Computer Vision . 7708β7717.Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep Appear-ance Models for Face Rendering.
ACM Trans. Graph.
37, 4, Article 68 (July 2018),13 pages. https://doi.org/10.1145/3197517.3201401Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann,and Yaser Sheikh. 2019. Neural Volumes: Learning Dynamic Renderable Volumesfrom Images.
ACM Trans. Graph.
38, 4, Article 65 (July 2019), 14 pages. https://doi.org/10.1145/3306346.3323020Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael JBlack. 2015. SMPL: A skinned multi-person linear model.
ACM TOG (Proc. SIG-GRAPH)
34, 6 (2015), 1β16.Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, AlexeyDosovitskiy, and Daniel Duckworth. 2020. Nerf in the wild: Neural radiance fieldsfor unconstrained photo collections. arXiv preprint arXiv:2008.02268 (2020).Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. 2017. A simple yeteffective baseline for 3d human pose estimation. In
ICCV .Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, MohamedElgharib, Pascal Fua, Hans-Peter Seidel, Helge Rhodin, Gerard Pons-Moll, andChristian Theobalt. 2020. XNect: Real-time Multi-Person 3D Motion Capturewith a Single RGB Camera.
ACM Transactions on Graphics
39, 4, 17. https://doi.org/10.1145/3386569.3392410Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, MohammadShafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. 2017.VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera. In
TOG ,Vol. 36. 14. https://doi.org/10.1145/3072959.3073596Moustafa Meshry, Dan B Goldman, Sameh Khamis, Hugues Hoppe, Rohit Pandey,Noah Snavely, and Ricardo Martin-Brualla. 2019. Neural rerendering in the wild.In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition .6878β6887.Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ra-mamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fieldsfor View Synthesis. In
ECCV . -NeRF: Surface-free Human 3D Pose Refinement via Neural Rendering β’ 15 Francesc Moreno-Noguer. 2017. 3D Human Pose Estimation from a Single Imagevia Distance Matrix Regression, In Proceedings of the Conference on ComputerVision and Pattern Recognition (CVPR).
Conference on Computer Vision and PatternRecognition (CVPR) .Aiden Nibali, Zhen He, Stuart Morgan, and Luke Prendergast. 2019. 3d human poseestimation with 2d marginal heatmaps. In
WACV .Mohamed Omran, Christop Lassner, Gerard Pons-Moll, Peter Gehler, and Bernt Schiele.2018. Neural Body Fitting: Unifying Deep Learning and Model Based Human Poseand Shape Estimation. In .Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, Adarsh Kowdle,Yury Degtyarev, David Kim, Philip L. Davidson, Sameh Khamis, Mingsong Dou,Vladimir Tankovich, Charles Loop, Qin Cai, Philip A. Chou, Sarah Mennicken,Julien Valentin, Vivek Pradeep, Shenlong Wang, Sing Bing Kang, Pushmeet Kohli,Yuliya Lutchyn, Cem Keskin, and Shahram Izadi. 2016. Holoportation: Virtual3D Teleportation in Real-Time. In
Proceedings of the 29th Annual Symposium onUser Interface Software and Technology (Tokyo, Japan) (UIST β16) . Association forComputing Machinery, New York, NY, USA, 741β754. https://doi.org/10.1145/2984511.2984517Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman,Steven M Seitz, and Ricardo-Martin Brualla. 2020. Deformable Neural RadianceFields. arXiv preprint arXiv:2011.12948 (2020).Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, and Kostas Daniilidis.2017. Coarse-to-fine volumetric prediction for single-image 3D human pose. In
CVPR .Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. 2018. Learningto Estimate 3D Human Pose and Shape from a Single Color Image. In
CVPR .Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao,and Xiaowei Zhou. 2020. Neural Body: Implicit Neural Representations with Struc-tured Latent Codes for Novel View Synthesis of Dynamic Humans. arXiv preprintarXiv:2012.15838 (2020).Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. 2020.D-NeRF: Neural Radiance Fields for Dynamic Scenes. arXiv preprint arXiv:2011.13961 (2020).Helge Rhodin, Christian Richardt, Dan Casas, Eldar Insafutdinov, Mohammad Shafiei,Hans-Peter Seidel, Bernt Schiele, and Christian Theobalt. 2016a. Egocap: egocentricmarker-less motion capture with two fisheye cameras.
ACM Transactions on Graphics(TOG)
35, 6 (2016), 1β11.Helge Rhodin, Nadia Robertini, Dan Casas, Christian Richardt, Hans-Peter Seidel, andChristian Theobalt. 2016b. General automatic human shape and motion captureusing volumetric contour cues. In
ECCV . 509β526.Helge Rhodin, Nadia Robertini, Christian Richardt, Hans-Peter Seidel, and ChristianTheobalt. 2015. A versatile scene model with differentiable visibility applied togenerative pose estimation. In
ICCV . 765β773.Nadia Robertini, Dan Casas, Helge Rhodin, Hans-Peter Seidel, and Christian Theobalt.2016. Model-based outdoor performance capture. In . IEEE, 166β175.Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa,and Hao Li. 2019. Pifu: Pixel-aligned implicit function for high-resolution clothedhuman digitization. In
CVPR . 2304β2314.Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul Joo. 2020. PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization.In
CVPR . 84β93.Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. 2020. Graf: Generativeradiance fields for 3d-aware image synthesis.
NeurIPS
33 (2020).Mingyi Shi, Kfir Aberman, Andreas Aristidou, Taku Komura, Dani Lischinski, DanielCohen-Or, and Baoquan Chen. 2020. MotioNet: 3D Human Motion Reconstructionfrom Monocular Video with Skeleton Consistency.
ACM Transactions on Graphics(TOG)
40, 1 (2020), 1β15.Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and GordonWetzstein. 2020. Implicit neural representations with periodic activation functions.
NeurIPS
33 (2020).V. Sitzmann, J. Thies, F. Heide, M. NieΓner, G. Wetzstein, and M. ZollhΓΆfer. 2019a.DeepVoxels: Learning Persistent 3D Feature Embeddings. In
Proceedings of ComputerVision and Pattern Recognition (CVPR 2019) .Vincent Sitzmann, Michael ZollhΓΆfer, and Gordon Wetzstein. 2019b. Scene represen-tation networks: Continuous 3d-structure-aware neural scene representations. In
Advances in Neural Information Processing Systems . 1121β1132.Carsten Stoll, Nils Hasler, Juergen Gall, Hans-Peter Seidel, and Christian Theobalt. 2011.Fast articulated motion tracking using a sums of Gaussians body model. In
ICCV .951β958.Matthew Tancik, Pratul P Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, NithinRaghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T Barron, and Ren Ng.2020. Fourier Features Let Networks Learn High Frequency Functions in LowDimensional Domains.
NeurIPS (2020).Jonathan Taylor, Jamie Shotton, Toby Sharp, and Andrew Fitzgibbon. 2012. The vitru-vian manifold: Inferring dense correspondences for one-shot human pose estimation. In
CVPR . IEEE, 103β110.A. Tewari, O. Fried, J. Thies, V. Sitzmann, S. Lombardi, K. Sunkavalli, R. Martin-Brualla,T. Simon, J. Saragih, M. NieΓner, R. Pandey, S. Fanello, G. Wetzstein, J.-Y. Zhu, C.Theobalt, M. Agrawala, E. Shechtman, D. B Goldman, and M. ZollhΓΆfer. 2020. Stateof the Art on Neural Rendering.
Computer Graphics Forum (EG STAR 2020) (2020).Justus Thies, Michael ZollhΓΆfer, and Matthias NieΓner. 2019. Deferred Neural Rendering:Image Synthesis using Neural Textures.
ACM Transactions on Graphics 2019 (TOG) (2019).Denis Tome, Matteo Toso, Lourdes Agapito, and Chris Russell. 2018. Rethinking posein 3d: Multi-stage refinement and recovery for markerless motion capture. In .IEEE.Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael ZollhΓΆfer, Christoph Lass-ner, and Christian Theobalt. 2020. Non-Rigid Neural Radiance Fields: Reconstructionand Novel View Synthesis of a Deforming Scene from Monocular Video. arXivpreprint arXiv:2012.12247 (2020).Alex Trevithick and Bo Yang. 2020. GRF: Learning a General Radiance Field for 3DScene Representation and Rendering. arXiv preprint arXiv:2010.04595 (2020).Hsiao-Yu Tung, Hsiao-Wei Tung, Ersin Yumer, and Katerina Fragkiadaki. 2017. Self-supervised Learning of Motion Capture. In
NeurIPS . 5242β5252.GΓΌl Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, IvanLaptev, and Cordelia Schmid. 2017. Learning from Synthetic Humans. In
CVPR .Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan NGomez, Εukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In
NeurIPS .Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image qualityassessment: from error visibility to structural similarity.
IEEE transactions on imageprocessing
13, 4 (2004), 600β612.Olivia Wiles, Georgia Gkioxari, Richard Szeliski, and Justin Johnson. 2020. Synsin:End-to-end view synthesis from a single image. In
Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition . 7467β7477.Chenglei Wu, Kiran Varanasi, Yebin Liu, Hans-Peter Seidel, and Christian Theobalt.2011. Shading-based dynamic shape refinement from multi-view video undergeneral illumination. In
IEEE International Conference on Computer Vision, ICCV 2011,Barcelona, Spain, November 6-13, 2011 , Dimitris N. Metaxas, Long Quan, AlbertoSanfeliu, and Luc Van Gool (Eds.). IEEE Computer Society, 1108β1115. https://doi.org/10.1109/ICCV.2011.6126358Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. 2020. Space-time NeuralIrradiance Fields for Free-Viewpoint Video. arXiv preprint arXiv:2011.12950 (2020).Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. 2019. Monocular total capture: Posingface, body, and hands in the wild. In
Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition . 10965β10974.Jingwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xiaokang Yang, and WenjunZhang. 2020. Deep Kinematics Analysis for Monocular 3D Human Pose Estimation.In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) .Weipeng Xu, Avishek Chatterjee, Michael ZollhΓΆfer, Helge Rhodin, Dushyant Mehta,Hans-Peter Seidel, and Christian Theobalt. 2018. Monoperfcap: Human performancecapture from monocular video.
TOG
37, 2 (2018), 27.Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. 2020. pixelNeRF: NeuralRadiance Fields from One or Few Images. arXiv preprint arXiv:2012.02190 (2020).Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. 2020. Nerf++: Analyzingand improving neural radiance fields. arXiv preprint arXiv:2010.07492 (2020).Xingyi Zhou, Xiao Sun, Wei Zhang, Shuang Liang, and Yichen Wei. 2016. Deep Kine-matic Pose Regression. In
ECCV Workshop .Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. 2019. On the continuityof rotation representations in neural networks. In
Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition . 5745β5753.Silvia Zuffi, Angjoo Kanazawa, Tanya Berger-Wolf, and Michael J. Black. 2019. Three-DSafari: Learning to Estimate Zebra Pose, Shape, and Texture from Images "In theWild". In