[PDF] PhysCap: Physically Plausible Monocular 3D Motion Capture in Real Time

Abstract

Marker-less 3D human motion capture from a single colour camera has seen significant progress. However, it is a very challenging and severely ill-posed problem. In consequence, even the most accurate state-of-the-art approaches have significant limitations. Purely kinematic formulations on the basis of individual joints or skeletons, and the frequent frame-wise reconstruction in state-of-the-art methods greatly limit 3D accuracy and temporal stability compared to multi-view or marker-based motion capture. Further, captured 3D poses are often physically incorrect and biomechanically implausible, or exhibit implausible environment interactions (floor penetration, foot skating, unnatural body leaning and strong shifting in depth), which is problematic for any use case in computer graphics. We, therefore, present PhysCap, the first algorithm for physically plausible, real-time and marker-less human 3D motion capture with a single colour camera at 25 fps. Our algorithm first captures 3D human poses purely kinematically. To this end, a CNN infers 2D and 3D joint positions, and subsequently, an inverse kinematics step finds space-time coherent joint angles and global 3D pose. Next, these kinematic reconstructions are used as constraints in a real-time physics-based pose optimiser that accounts for environment constraints (e.g., collision handling and floor placement), gravity, and biophysical plausibility of human postures. Our approach employs a combination of ground reaction force and residual force for plausible root control, and uses a trained neural network to detect foot contact events in images. Our method captures physically plausible and temporally stable global 3D human motion, without physically implausible postures, floor penetrations or foot skating, from video in real time and in general scenes. The video is available at this http URL

Full PDF

PPhysCap: Physically Plausible Monocular 3D Motion Capture in RealTime

SOSHI SHIMADA,

Max Planck Institute for Informatics, Saarland Informatics Campus

VLADISLAV GOLYANIK,

Max Planck Institute for Informatics, Saarland Informatics Campus

WEIPENG XU,

Facebook Reality Labs

CHRISTIAN THEOBALT,

Max Planck Institute for Informatics, Saarland Informatics Campus

Fig. 1.

PhysCap captures global 3D human motion in a physically plausible way from monocular videos in real time, automatically and without the use ofmarkers. (Left:) Video of a standing long jump (Peng et al. 2018) and our 3D reconstructions. Thanks to its formulation on the basis of physics-based dynamics,our algorithm recovers challenging 3D human motion observed in 2D while significantly mitigating artefacts such as foot sliding, foot-floor penetration,unnatural body leaning and jitter along the depth channel that troubled earlier monocular pose estimation methods. (Right:) Since the output of

PhysCap isenvironment-aware and the returned root position is global, it is directly suitable for virtual character animation, without any further post-processing. The 3Dcharacters are taken from (Adobe 2020). See our supplementary video for further results and visualisations.

PhysCap , the first algorithm for physically plau-sible, real-time and marker-less human 3D motion capture with a singlecolour camera at 25 fps. Our algorithm first captures 3D human posespurely kinematically. To this end, a CNN infers 2D and 3D joint positions,and subsequently, an inverse kinematics step finds space-time coherentjoint angles and global 3D pose. Next, these kinematic reconstructionsare used as constraints in a real-time physics-based pose optimiser thataccounts for environment constraints ( e.g., collision handling and floorplacement), gravity, and biophysical plausibility of human postures. Ourapproach employs a combination of ground reaction force and residual forcefor plausible root control, and uses a trained neural network to detect footcontact events in images. Our method captures physically plausible andtemporally stable global 3D human motion, without physically implausiblepostures, floor penetrations or foot skating, from video in real time and in general scenes.

PhysCap achieves state-of-the-art accuracy on estab-lished pose benchmarks, and we propose new metrics to demonstrate theimproved physical plausibility and temporal stability. The video is availableat http://gvv.mpi-inf.mpg.de/projects/PhysCapCCS Concepts: •

Computing methodologies → Computer graphics;Motion capture;

Additional Key Words and Phrases: Monocular Motion Capture, Physics-Based Constraints, Real Time, Human Body, Global 3D

3D human pose estimation from monocular RGB images is a veryactive area of research. Progress is fueled by many applicationswith an increasing need for reliable, real time and simple-to-usepose estimation. Here, applications in character animation, VR andAR, telepresence, or human-computer interaction, are only a fewexamples of high importance for graphics.Monocular and markerless 3D capture of the human skeleton isa highly challenging and severely underconstrained problem (Ko-valenko et al. 2019; Martinez et al. 2017; Mehta et al. 2017b; Pavlakoset al. 2018; Wandt and Rosenhahn 2019). Even the best state-of-the-art algorithms, therefore, exhibit notable limitations. Most methodscapture pose kinematically using individually predicted joints butdo not produce smooth joint angles of a coherent kinematic skeleton.

ACM Transactions on Graphics, Vol. 1, No. 1, Article 1. Publication date: November 2020. a r X i v : . [ c s . C V ] A ug any approaches perform per-frame pose estimates with notabletemporal jitter, and reconstructions are often in root-relative butnot global 3D space. Even if a global pose is predicted, depth predic-tion from the camera is often unstable. Also, interaction with theenvironment is usually entirely ignored, which leads to poses withsevere collision violations, e.g., floor penetration or the implausiblefoot sliding and incorrect foot placement. Established kinematic for-mulations also do not explicitly consider biomechanical plausibilityof reconstructed poses, yielding reconstructed poses with improperbalance, inaccurate body leaning, or temporal instability.We note that all these artefacts are particularly problematic in theaforementioned computer graphics applications, in which tempo-rally stable and visually plausible motion control of characters fromall virtual viewpoints, in global 3D, and with respect to the physi-cal environment, are critical. Further on, we note that establishedmetrics in widely-used 3D pose estimation benchmarks (Ionescuet al. 2013; Mehta et al. 2017a), such as mean per joint position error(MPJPE) or 3D percentage of correct keypoints (3D-PCK), which areoften even evaluated after a 3D rescaling or Procrustes alignment,do not adequately measure these artefacts. In fact, we show (seeSec. 4, and supplemental video) that even some top-performingmethods on these benchmarks produce results with substantial tem-poral noise and unstable depth prediction, with frequent violationof environment constraints, and with frequent disregard of physi-cal and anatomical pose plausibility. In consequence, there is stilla notable gap between monocular 3D pose human estimation ap-proaches and the gold standard accuracy and motion quality ofsuit-based or marker-based motion capture systems, which are un-fortunately expensive, complex to use and not suited for many ofthe aforementioned applications requiring in-the-wild capture.We, therefore, present PhysCap – a new approach for easy-to-use monocular global 3D human motion capture that significantlynarrows this gap and substantially reduces the aforementioned arte-facts, see Fig. 1 for an overview.

PhysCap is, to our knowledge, thefirst method that jointly possesses all the following properties: it isfully-automatic, markerless, works in general scenes, runs in realtime, captures a space-time coherent skeleton pose and global 3Dpose sequence of state-of-the-art temporal stability and smooth-ness. It exhibits state-of-the-art posture and position accuracy, andcaptures physically and anatomically plausible poses that correctlyadhere to physics and environment constraints. To this end, werethink and bring together in new way ideas from kinematics-basedmonocular pose estimation and physics-based human characteranimation.The first stage of our algorithm is similar to (Mehta et al. 2017b)and estimates 3D body poses in a purely kinematic, physics-agnosticway. A convolutional neural network (CNN) infers combined 2Dand 3D joint positions from an input video, which are then refined ina space-time inverse kinematics to yield the first estimate of skeletaljoint angles and global 3D poses. In the second stage , the foot contactand the motion states are predicted for every frame. Therefore, weemploy a new CNN that detects heel and forefoot placement onthe ground from estimated 2D keypoints in images, and classifiesthe observed poses into stationary or non-stationary. In the thirdstage , the final physically plausible 3D skeletal joint angle and pose sequence is computed in real time. This stage regularises human mo-tion with a torque-controlled physics-based character representedby a kinematic chain with a floating base. To this end, the optimalcontrol forces for each degree of freedom (DoF) of the kinematicchain are computed, such that the kinematic pose estimates from thefirst stage – in both 2D and 3D – are reproduced as closely as possi-ble. The optimisation ensures that physics constraints like gravity,collisions, foot placement, as well as physical pose plausibility ( e.g., balancing), are fulfilled. To summarise, our contributions in thisarticle are: • The first, to the best of our knowledge, marker-less monoc-ular 3D human motion capture approach on the basis ofan explicit physics-based dynamics model which runs inreal time and captures global, physically plausible skeletalmotion (Sec. 4). • A CNN to detect foot contact and motion states from images(Sec. 4.2). • A new pose optimisation framework with a human parametri-sed by a torque-controlled simulated character with a float-ing base and PD joint controllers; it reproduces kinemati-cally captured 2D/3D poses and simultaneously accountsfor physics constraints like ground reaction forces, footcontact states and collision response (Sec. 4.3). • Quantitative metrics to assess frame-to-frame jitter andfloor penetration in captured motions (Sec. 5.3.1). • Physically-justified results with significantly fewer arte-facts, such as frame-to-frame jitter, incorrect leaning, footsliding and floor penetration than related methods (con-firmed by a user study and metrics), as well as state-of-the-art 2D and 3D accuracy and temporal stability (Sec. 5).We demonstrate the benefits of our approach through experimen-tal evaluation on several datasets (including newly recorded videos)against multiple state-of-the-art methods for monocular 3D humanmotion capture and pose estimation.

Our method mainly relates to two different categories of approaches– (markerless) 3D human motion capture from colour imagery, andphysics-based character animation. In the following, we reviewrelated types of methods, focusing on the most closely related works.

Multi-View Methods for 3D Human Motion Capture from RGB.

Reconstructing humans from multi-view images is well studied.Multi-view motion capture methods track the articulated skeletalmotion, usually by fitting an articulated template to imagery (Boand Sminchisescu 2010; Brox et al. 2010; Elhayek et al. 2016, 2014;Gall et al. 2010; Stoll et al. 2011; Wang et al. 2018; Zhang et al. 2020).Other methods, sometimes termed performance capture methods,additionally capture the non-rigid surface deformation, e.g., of cloth-ing (Cagniart et al. 2010; Starck and Hilton 2007; Vlasic et al. 2009;Waschb¨usch et al. 2005). They usually fit some form of a templatemodel to multi-view imagery (Bradley et al. 2008; De Aguiar et al.2008; Martin-Brualla et al. 2018) that often also has an underlyingkinematic skeleton (Gall et al. 2009; Liu et al. 2011; Vlasic et al. 2008;Wu et al. 2012). Multi-view methods have demonstrated compelling esults and some enable free-viewpoint video. However, they re-quire expensive multi-camera setups and often controlled studioenvironments. Monocular 3D Human Motion Capture and Pose Estimation fromRGB.

Marker-less 3D human pose estimation (reconstruction of3D joint positions only) and motion capture (reconstruction ofglobal 3D body motion and joint angles of a coherent skeleton)from a single colour or greyscale image are highly ill-posed prob-lems. The state of the art on monocular 3D human pose estimationhas greatly progressed in recent years, mostly fueled by the powerof trained CNNs (Habibie et al. 2019; Mehta et al. 2017a). Somemethods estimate 3D pose by combining 2D keypoints predictionwith body depth regression (Dabral et al. 2018; Newell et al. 2016;Yang et al. 2018; Zhou et al. 2017) or with regression of 3D jointlocation probabilities (Mehta et al. 2017b; Pavlakos et al. 2017) in atrained CNN. Lifting methods predict joint depths from detected 2Dkeypoints (Chen and Ramanan 2017; Martinez et al. 2017; Pavlakoset al. 2018; Tom`e et al. 2017). Other CNNs regress 3D joint loca-tions directly (Mehta et al. 2017a; Rhodin et al. 2018; Tekin et al.2016). Another category of methods combines CNN-based keypointdetection with constraints from a parametric body model, e.g. , byusing reprojection losses during training (Bogo et al. 2016; Brau andJiang 2016; Habibie et al. 2019). Some works approach monocularmulti-person 3D pose estimation (Rogez et al. 2019) and motioncapture (Mehta et al. 2020), or estimate non-rigidly deforming hu-man surface geometry from monocular video on top of skeletalmotion (Habermann et al. 2020, 2019; Xu et al. 2020). In additionto greyscale images, (Xu et al. 2020) use an asynchronous eventstream from an event camera as input. Both these latter directionsare complementary but orthogonal to our work.The majority of methods in this domain estimates 3D pose as aroot-relative 3D position of the body joints (Kovalenko et al. 2019;Martinez et al. 2017; Moreno-Noguer 2017; Pavlakos et al. 2018;Wandt and Rosenhahn 2019). This is problematic for applicationsin graphics, as temporal jitter, varying bone lengths and the oftennot recovered global 3D pose make animating virtual charactershard. Other monocular methods are trained to estimate parametersor joint angles of a skeleton (Zhou et al. 2016) or parametric model(Kanazawa et al. 2018). (Mehta et al. 2020, 2017b) employ inversekinematics on top of CNN-based 2D/3D inference to obtain jointangles of a coherent skeleton in global 3D and in real-time.Results of all aforementioned methods frequently violate laws ofphysics, and exhibit foot-floor penetrations, foot sliding, and un-balanced or implausible poses floating in the air, as well as notablejitter. Some methods try to reduce jitter by exploiting temporalinformation (Kanazawa et al. 2019; Kocabas et al. 2020), e.g., by es-timating smooth multi-frame scene trajectories (Peng et al. 2018).(Zou et al. 2020) try to reduce foot sliding by ground contact con-straints. (Zanfir et al. 2018) jointly reason about ground planes andvolumetric occupancy for multi-person pose estimation. (Monsz-part et al. 2019) jointly infer coarse scene layout and human posefrom monocular interaction video, and (Hassan et al. 2019) use apre-scanned 3D model of scene geometry to constrain kinematicpose optimisation. To overcome the aforementioned limitations, no prior work formulates monocular motion capture on the basis of anexplicit physics-based dynamics model and in real-time, as we do.

Physics-Based Character Animation.

Character animation on thebasis of physics-based controllers has been investigated for manyyears (Barzel et al. 1996; Sharon and van de Panne 2005; Wroteket al. 2006), and remains an active area of research, (Andrews et al.2016; Bergamin et al. 2019; Levine and Popovi´c 2012; Zheng andYamane 2013). (Levine and Popovi´c 2012) employ a quasi-physicalsimulation that approximates a reference motion trajectory in real-time. They can follow non-physical reference motion by applyinga direct actuation at the root. By using proportional derivative(PD) controllers and computing optimal torques and contact forces,(Zheng and Yamane 2013) make a character follow a reference mo-tion captured while keeping balance. (Liu et al. 2010) proposed aprobabilistic algorithm for physics-based character animation. Dueto the stochastic property and inherent randomness, their resultsevince variations, but the method requires multiple minutes of run-time per sequence. Andrews et al. (2016) employ rigid dynamicsto drive a virtual character from a combination of marker-basedmotion capture and body-mounted sensors. This animation settingis related to motion transfer onto robots. (Nakaoka et al. 2007) trans-ferred human motion captured by a multi-camera marker-basedsystem onto a robot, with an emphasis on leg motion. (Zhang et al.2014) leverage depth cameras and wearable pressure sensors andapply physics-based motion optimisation. We take inspiration fromthese works for our setting, where we have to capture in a physi-cally correct way and in real time global 3D human motion fromimages, using intermediate pose reconstruction results that exhibitnotable artefacts and violations of physics laws.

PhysCap , therefore,combines an initial kinematics-based pose reconstruction with PDcontroller based physical pose optimisation.Several recent methods apply deep reinforcement learning to vir-tual character animation control (Bergamin et al. 2019; Lee et al. 2019;Peng et al. 2018). Peng et al. (2018) propose a reinforcement learningapproach for transferring dynamic human performances observedin monocular videos. They first estimate smooth motion trajectorieswith recent monocular human pose estimation techniques, and thentrain an imitating control policy for a virtual character. (Bergaminet al. 2019) train a controller for a virtual character from severalminutes of motion capture data which covers the expected varietyof motions and poses. Once trained, the virtual character can followdirectional commands of the user in real time, while being robust tocollisional obstacles. Other work (Lee et al. 2019) combines a mus-cle actuation model with deep reinforcement learning. (Jiang et al.2019) express an animation objective in muscle actuation space. Thework on learning animation controllers for specific motion classesis inspirational but different from real-time physics-based motioncapture of general motion.

Physically Plausible Monocular 3D Human Motion Capture.

Onlya few works on monocular 3D human motion capture using explicitphysics-based constraints exist (Li et al. 2019; Vondrak et al. 2012;Wei and Chai 2010; Zell et al. 2017). (Wei and Chai 2010) capture3D human poses from uncalibrated monocular video using physicsconstraints. Their approach requires manual user input for eachframe of a video. In contrast, our approach is automatic, runs in ig. 2. Our virtual character used in stage III. The forefoot and heel linksare involved in the mesh collision checks with the floor plane in the physicsengine (Coumans and Bai 2016). real time, and uses a different formulation for physics-based poseoptimisation geared to our setting. (Vondrak et al. 2012) capturebipedal controllers from a video. Their controllers are robust toperturbations and generalise well for a variety of motions. However,unlike our PhysCap , the generated motion often looks unnaturaland their method does not run in real time. (Zell et al. 2017) captureposes and internal body forces from images only for certain classesof motion ( e.g., lifting and walking) by using a data-driven approach,but not an explicit forward dynamics approach handling a widerange of motions, like ours.Our

PhysCap bears most similarities with the rigid body dynam-ics based monocular human pose estimation by Li et al. (2019).Li et al. estimate 3D poses, contact states and forces from inputvideos with physics-based constraints. However, their method andour approach are substantially different. While Li et al. focus onobject-person interactions, we target a variety of general motions,including complex acrobatic motions such as backflipping withoutobjects. Their method does not run in real time and requires manualannotations on images to train the contact state estimation networks.In contrast, we leverage the PD controller based inverse dynamicstracking, which results in physically plausible, smooth and naturalskeletal pose and root motion capture in real time. Moreover, ourcontact state estimation network relies on annotations generated ina semi-automatic way. This enables our architecture to be trainedon large datasets, which results in the improved generalisability.No previous method of the reviewed category “physically plausiblemonocular 3D human motion capture” combines the ability of ouralgorithm to capture global 3D human pose of similar quality andphysical plausibility in real time . The input to

PhysCap is a 2D image sequence I t , t ∈ { , . . . , T } ,where T is the total number of frames and t is the frame index.We assume a perspective camera model and calibrate the cameraand floor location before tracking starts. Our approach outputs aphysically plausible real-time 3D motion capture result q tphys ∈ R m (where m is the number of degrees of freedom) that adheres to theimage observation, as well as physics-based posture and environ-ment constraints. For our human model, m =

43. Joint angles are parametrised by Euler angles. The mass distribution of our char-acter is computed following (Liu et al. 2010). Our character modelhas a skeleton composed of 37 joints and links . A link defines thevolumetric extent of a body part via a collision proxy. The forefootand heel links, centred at the respective joints of our character (seeFig. 2), are used to detect foot-floor collisions during physics-basedpose optimisation.Throughout our algorithm, we represent the pose of our characterby a combined vector q ∈ R m (Featherstone 2014). The first threeentries of q contain the global 3D root position in Cartesian coor-dinates, the next three entries encode the orientation of the root,and the remaining entries are the joint angles. When solving forthe physics-based motion capture result, the motion of the physics-based character will be controlled by the vector of forces denotedby τ ∈ R m interacting with gravity, Coriolis and centripetal forces c ∈ R m . The root of our character is not fixed and can globally movein the environment, which is commonly called a floating-base sys-tem. Let the velocity and acceleration of q be (cid:219) q ∈ R m and (cid:220) q ∈ R m ,respectively. Using the finite-difference method, the relationshipbetween q , (cid:219) q , (cid:220) q can be written as (cid:219) q i + = (cid:219) q i + ϕ (cid:220) q i , q i + = q i + ϕ (cid:219) q i + , (1)where i represents the simulation step index and ϕ = .

01 is thesimulation step size.For the motion to be physically plausible, (cid:220) q and the vector offorces τ must satisfy the equation of motion (Featherstone 2014): M ( q )(cid:220) q − τ = J T G λ − c ( q , (cid:219) q ) , (2)where M ∈ R m × m is a joint space inertia matrix which is composedof the moment of inertia of the system. It is computed using theComposite Rigid Body algorithm (Featherstone 2014). J ∈ R N c × m is a contact Jacobi matrix which relates the external forces to jointcoordinates, with N c denoting the number of links where the contactforce is applied. G ∈ R N c × N c transforms contact forces λ ∈ R N c into the linear force and torque (Zheng and Yamane 2013).Usually, in a floating-base system, the first six entries of τ whichcorrespond to the root motion are set to 0 for a humanoid char-acter control. This reflects the fact that humans do not directlycontrol root translation and orientation by muscles acting on theroot, but indirectly by the other joints and muscles in the body. Inour case, however, the kinematic pose q tkin which our final physi-cally plausible result shall reproduce as much as possible (see Sec. 4),is estimated from a monocular image sequence (see stage I in Fig. 3),which contains physically implausible artefacts. Solving for jointtorque controls that blindly make the character follow, would makethe character quickly fall down. Hence, we keep the first six entriesof τ in our formulation and can thus directly control the root posi-tion and orientation with an additional external force. This enablesthe final character motion to keep up with the global root trajectoryestimated in the first stage of PhysCap , without falling down.

Our

PhysCap approach includes three stages, see Fig. 3 for anoverview. The first stage performs kinematic pose estimation . Thisencompasses 2D heatmap and 3D location map regression for each et PD Controller GRF Estimation3D Location Maps Input Image

Iterate Times2D Heatmaps Skeleton FittingCNN

Stage I: Kinematic Pose Estimation Stage II: Foot Contact and Motion State PredictionStage III: Physics-Based Global Pose OptimisationStage I: Kinematic Pose Estimation Stage III: Physics-Based Global Pose Optimisation Output i) iii) iv) Pose Tracking Optimisation v) Pose UpdateKinematic Pose Correction ii)

Fig. 3.

Overview of our pipeline.

In stage I, the 3D pose estimation network accepts RGB image I t as input and returns 2D joint keypoints K t along withthe global 3D pose q tkin , i.e., root translation, orientation and joint angles of a kinematic skeleton. In stage II, K t is fed to the contact and motion statedetection network. Stage II returns the contact states of heels and forefeet as well as a label b t that represents if the subject in I t is stationary or not. In stageIII, q tkin and b t are used to iteratively update the character pose respecting physics laws. After the n pose update iterations, we obtain the final 3D pose q tphys . Note that the orange arrows in stage III represent the steps that are repeated in the loop in every iteration. Kinematic pose correction is performedonly once at the beginning of stage III. body joint with a CNN, followed by a model-based space-time poseoptimisation step (Sec. 4.1). This stage returns 3D skeleton pose injoint angles q tkin ∈ R m along with the 2D joint keypoints K t ∈ R s × for every image; s denotes the number of 2D joint keypoints. Asexplained earlier, this initial kinematic reconstruction q tkin is proneto physically implausible effects such as foot-floor penetration, footskating, anatomically implausible body leaning and temporal jitter,especially notable along the depth dimension.The second stage performs foot contact and motion state detection, which uses 2D joint detections K t to classify the poses reconstructedso far into stationary and non-stationary – this is stored in onebinary flag. It also estimates binary foot-floor contact flags, i.e., for the toes and heels of both feet, resulting in four binary flags(Sec. 4.2). This stage outputs the combined state vector b t ∈ R .The third and final stage of PhysCap is the physically plausibleglobal 3D pose estimation (Sec. 4.3). It combines the estimates fromthe first two stages with physics-based constraints to yield a physi-cally plausible real-time 3D motion capture result that adheres tophysics-based posture and environment constraints q tphys ∈ R m .In the following, we describe each of the stages in detail. Our kinematic pose estimation stage follows the real-time VNectalgorithm (Mehta et al. 2017b), see Fig. 3, stage I. We first predictheatmaps of 2D joints and root-relative location maps of joint po-sitions in 3D with a specially tailored fully convolutional neuralnetwork using a ResNet (He et al. 2016) core. The ground truth jointlocations for training are taken from the MPII (Andriluka et al. 2014)and LSP (Johnson and Everingham 2011) datasets in the 2D case,and MPI-INF-3DHP (Mehta et al. 2017a) and Human3.6m (Ionescuet al. 2013) datasets in the 3D case.Next, the estimated 2D and 3D joint locations are temporallyfiltered and used as constraints in a kinematic skeleton fitting stepthat optimises the following energy function: E kin ( q tkin ) = E IK ( q tkin ) + E proj. ( q tkin ) + E smooth ( q tkin ) + E depth ( q tkin ) . (3)The energy function (3) contains four terms (see (Mehta et al. 2017b)), i.e., the 3D inverse kinematics term E IK , the projection term E proj. ,the temporal stability term E smooth and the depth uncertainty cor-rection term E depth . E IK is the data term which constrains the 3Dpose to be close to the 3D joint predictions from the CNN. E proj. enforces the pose q tkin to reproject it to the 2D keypoints (joints) eft (cid:4)(cid:4)footRight(cid:4)(cid:4)foot a)(cid:4)balanced(cid:4)posture b)(cid:4)unbalanced(cid:4)posture :(cid:4)Centre(cid:4)of(cid:4)Gravity(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(CoG):(cid:4)Base(cid:4)of(cid:4)Support:(cid:4)Projected(cid:4)CoG(cid:4)(cid:4)(cid:4)onto(cid:4)ground Left (cid:4)(cid:4)footRight(cid:4)(cid:4)foot

Fig. 4. (a) Balanced posture: the CoG of the body projects inside the baseof support. (b) Unbalanced posture: the CoG does not project inside thebase of support, which causes the human to start losing a balance. detected by the CNN. Note that this reprojection constraint, to-gether with calibrated camera and calibrated bone lengths, enablescomputation of the global 3D root (pelvis) position in the cameraspace. Temporal stability is further imposed by penalising the root’sacceleration and variations along the depth channel by E smooth and E depth , respectively. The energy (3) is optimised by non-linear leastsquares (Levenberg-Marquardt algorithm (Levenberg 1944; Mar-quardt 1963)), and the obtained vector of joint angles and the rootrotation and position q tkin of a skeleton with fixed bone lengthsare smoothed by an adaptive first-order low-pass filter (Casiez et al.2012). Skeleton bone lengths of a human can be computed, up toa global scale, from averaged 3D joint detections of a few initialframes. Knowing the metric height of the human determines thescale factor to compute metrically correct global 3D poses.The result of stage I is a temporally-consistent joint angle se-quence but, as noted earlier, captured poses can exhibit artefactsand contradict physical plausibility ( e.g., evince floor penetration,incorrect body leaning, temporal jitter, etc. ). The ground reaction force (GRF) – applied when the feet touchthe ground – enables humans to walk and control their posture.The interplay of internal body forces and the ground reaction forcecontrols human pose, which enables locomotion and body balancingby controlling the centre of gravity (CoG). To compute physicallyplausible poses accounting for the GRF in stage III, we thus needto know foot-floor contact states. Another important aspect of thephysical plausibility of biped poses, in general, is balance. Whena human is standing or in a stationary upright state, the CoG ofher body projects inside a base of support (BoS). The BoS is anarea on the ground bounded by the foot contact points, see Fig. 4for a visualisation. When the CoG projects outside the BoS in astationary pose, a human starts losing balance and will fall if nocorrecting motion or step is applied. Therefore, maintaining a staticpose with an extensive leaning, as often observed in the results ofmonocular pose estimation, is not physically plausible (Fig .4-(b)). (a) (b) (c)

Fig. 5. (a) An exemplary frame from the Human 3.6M dataset with theground truth reprojections of the 3D joint keypoints. The magnified viewin the red rectangle shows the reprojected keypoint that deviates from therotation centre (the middle of the knee). (b) Schematic visualisation of thereference motion correction. Readers are referred to Sec. 4.3.1 for its details.(c) Example of a visually unnatural standing (stationary) pose caused byphysically implausible knee bending.

The aforementioned CoG projection criterion can be used to correctimbalanced stationary poses (Coros et al. 2010; Faloutsos et al. 2001;Macchietto et al. 2009). To perform such correction in stage III, weneed to know if a pose is stationary or non-stationary (whether itis a part of a locomotion/walking phase).Stage II, therefore, estimates foot-floor contact states of the feetin each frame and determines whether the pose of the subject in I t is stationary or not. To predict both, i.e., foot contact and motionstates, we use a neural network whose architecture extends Zou etal. (2020) who only predict foot contacts. It is composed of temporalconvolutional layers with one fully connected layer at the end. Thenetwork takes as input all 2D keypoints K t from the last seventime steps (the temporal window size is set to seven), and returnsfor each image frame binary labels indicating whether the subjectis in the stationary or non-stationary pose, as well as the contactstate flags for the forefeet and heels of both feet encompassed in b t .The supervisory labels for training this network are automaticallycomputed on a subset of the 3D motion sequences of the Human3.6M(Ionescu et al. 2013) and DeepCap (Habermann et al. 2020) datasetsusing the following criteria: the forefoot and heel joint contact labelsare computed based on the assumption that a joint in contact is notsliding, i.e., the velocity is lower than 5 cm/sec. In addition, we usea height criterion, i.e., the forefoot/heel, when in contact with thefloor, has to be at a 3D height that is lower than a threshold h thres. . Todetermine this threshold for each sequence, we calculate the averageheel h heel avg and forefoot h f f oot avg heights for each subject using thefirst ten frames (when both feet touch the ground). Thresholds arethen computed as h heel thres. = h heel avg + h f f oot thres. = h f f oot avg + φ v , we classify the pose as stationary , and on-stationary otherwise. In total, around 600 k sets of contact andmotion state labels for the human images are generated. Stage III uses the results of stages I and II as inputs, i.e., q tkin and b t . It transforms the kinematic motion estimate into a physicallyplausible global 3D pose sequence that corresponds to the imagesand adheres to anatomy and environmental constraints imposedby the laws of physics. To this end, we represent the human asa torque-controlled simulated character with a floating base andPD joint controllers (A. Salem and Aly 2015). The core is to solvean energy-based optimisation problem to find the vector of forces τ and accelerations (cid:220) q of the character such that the equations ofmotion with constraints are fulfilled (Sec. 4.3.5). This optimisationis preceded by several preprocessing steps applied to each frame.First i), we correct q tkin if it is strongly implausible based onseveral easy-to-test criteria (Sec. 4.3.1). Second ii), we estimate thedesired acceleration (cid:220) q des ∈ R m necessary to reproduce q tkin basedon the PD control rule (Secs. 4.3.2). Third iii), in input frames inwhich a foot is in contact with the floor (Sec. 4.3.3), we estimate theground reaction force (GRF) λ (Sec. 4.3.4). Fourth iv), we solve theoptimisation problem (10) to estimate τ and accelerations (cid:220) q wherethe equation of motion with the estimated GRF λ and the contactconstraint to avoid foot-floor penetration (Sec. 4.3.5) are integratedas constraints. Note that the contact constraint is integrated onlywhen the foot is in contact with the floor. Otherwise, only theequation of motion without GRF is introduced as a constraint in(10). v) Lastly, the pose is updated using the finite-difference method(Eq. (1)) with the estimated acceleration (cid:220) q . The steps ii) - v) areiterated n = (cid:220) q , τ and λ simultaneously (Zheng and Yamane 2013).Our algorithm thus finds a plausible balance between pose accuracy,physical accuracy, the naturalness of captured motion and real-timeperformance. Due to the error accumulation in stage I( e.g., as a result of the deviation of 3D annotations from the jointrotation centres in the skeleton model, see Fig. 5-(a), as well as in-accuracies in the neural network predictions and skeleton fitting),the estimated 3D pose q tkin is often not physically plausible. There-fore, prior to torque-based optimisation, we pre-correct a pose q tkin from stage I if it is 1) stationary and 2) unbalanced, i.e., the CoGprojects outside the BoS. If both correction criteria are fulfilled, wecompute the angle θ t between the ground plane normal v n andthe vector v b that defines the direction of the spine relative to theroot in the local character’s coordinate system (see Fig. 5-(b) forthe schematic visualisation). We then correct the orientation of thevirtual character towards a posture, for which CoG projects inside BoS. Correcting θ t in one large step could lead to instabilities inphysics-based pose optimisation. Instead, we reduce θ t by a smallrotation of the virtual character around its horizontal axis ( i.e., theaxis passing through the transverse plane of a human body) startingwith the corrective angle ξ t = θ t for the first frame. Thereby, weaccumulate the degree of correction in ξ for the subsequent frames, i.e., ξ t + = ξ t + θ t . Note that θ t is decreasing for every frame andthe correction step is performed for all subsequent frames until 1)the pose becomes non-stationary or 2) CoG projects inside BoS .However, simply correcting the spine orientation by the skeletonrotation around the horizontal axis can lead to implausible standingposes, since the knees can still be unnaturally bent for the obtainedupright posture (see Fig. 5-(c) for an example). To account for that,we adjust the respective DoFs of the knees and hips such that therelative orientation between upper legs and spine, as well as upperand lower legs, are more straight. The hip and knee correctionstarts if both correction criteria are still fulfilled and θ t is already very small. Similarly to the θ correction, we introduce accumulatorvariables for every knee and every hip. The correction step forknees and hips is likewise performed until 1) the pose becomesnon-stationary or 2) CoG projects inside BoS . To control the physics-based virtual character such that it reproduces the kinematic esti-mate q tkin , we set the desired joint acceleration (cid:220) q des following thePD controller rule: (cid:220) q des = (cid:220) q tkin + k p ( q tkin − q ) + k d ((cid:219) q tkin − (cid:219) q ) . (4)The desired acceleration (cid:220) q des is later used in the GRF estimationstep (Sec. 4.3.4) and the final pose optimisation (Sec. 4.3.5). Con-trolling the character motion on the basis of a PD controller in thesystem enables the character to exert torques τ which reproduce thekinematic estimate q tkin while significantly mitigating undesiredeffects such as joint and base position jitter. To avoid foot-floor pene-tration in the final pose sequence and to mitigate contact positionsliding, we integrate hard constraints in the physics-based poseoptimisation to enforce zero velocity of forefoot and heel links inSec. 4.3.5. However, these constraints can lead to unnatural motionin rare cases when the state prediction network may fail to estimatethe correct foot contact states ( e.g., when the foot suddenly stops inthe air while walking). We thus update the contact state output ofthe state prediction network b t , j ∈{ ,..., } , to yield b (cid:48) t , j ∈{ ,..., } asfollows: b (cid:48) t , j ∈{ ,..., } =  , if ( b j = h j < ψ ) orthe j -th link collides with the floor plane , , otherwise. (5) This means we consider a forefoot or heel link to be in contact onlyif its height h j is less than a threshold ψ = .

1m above the calibratedground plane.In addition, we employ the

Pybullet (Coumans and Bai 2016)physics engine to detect foot-floor collision for the left and rightfoot links. Note that combining the mesh collision information either after the correction or already in q tkin provided by stage I ith the predictions from the state prediction network is necessarybecause 1) the foot may not touch the floor plane in the simulationwhen the subject’s foot is actually in contact with the floor due tothe inaccuracy of q tkin , and 2) the foot can penetrate into the meshfloor plane if the network misdetects the contact state when thereis actually a foot contact in I t . We first computethe GRF λ – when there is a contact between a foot and floor –which best explains the motion of the root as coming from stageI. However, the target trajectory from stage I can be physicallyimplausible, and we will thus eventually also require a residual forcedirectly applied on the root to explain the target trajectory; thisforce will be computed in the final optimisation. To compute theGRF, we solve the following minimisation problem:min λ (cid:107) M (cid:220) q des − J T G λ (cid:107) , s.t. λ ∈ F , (6)where (cid:107)·(cid:107) denotes (cid:96) -norm, and M ∈ R × m together with J T ∈R × N c are the first six rows of M and J T that correspond to theroot joint, respectively. Since we do not consider sliding contact,the contact force λ has to satisfy friction cone constraints. Thus, weformulate a linearised friction cone constraint F . That is, F j = (cid:110) λ j ∈ R | λ jn > , (cid:12)(cid:12)(cid:12) λ jt (cid:12)(cid:12)(cid:12) ≤ ¯ µλ jn , (cid:12)(cid:12)(cid:12) λ jb (cid:12)(cid:12)(cid:12) ≤ ¯ µλ jn (cid:111) , (7)where λ jn is a normal component, λ jt and λ jb are the tangentialcomponents of a contact force at the j -th contact position; µ is afriction coefficient which we set to 0 . µ = µ /√ λ is then integrated into the subsequent optimisation step(10) to estimate torques and accelerations of all joints in the body,including an additional residual direct root actuation componentthat is needed to explain the difference between the global 3D roottrajectory of the kinematic estimate and the final physically correctresult. The aim is to keep this direct root actuation as small aspossible, which is best achieved by a two-stage strategy that firstestimates the GRF separately. Moreover, we observed this two-stepoptimisation enables faster computation than estimating λ , (cid:220) q and τ all at once. It is hence more suitable for our approach which aimsat real-time operation. In this step, we solvean optimisation problem to estimate τ and (cid:220) q to track q tkin usingthe equation of motion (2) as a constraint. When contact is de-tected (Sec. 4.3.3), we integrate the estimated ground reaction force λ (Sec. 4.3.4) in the equation of motion. In addition, we introducecontact constraints to prevent foot-floor penetration and foot slidingwhen contacts are detected.Let (cid:219) r j be the velocity of the j -th contact link. Then, using therelationship between (cid:219) r j and (cid:219) q (Featherstone 2014), we can write: J j (cid:219) q = (cid:219) r j . (8)When the link is in contact with the floor, the velocity perpendicularto the floor has to be zero or positive to prevent penetration. Also,we allow the contact links to have a small tangential velocity σ Sequence ID Sequence Name Duration [sec]1 building 1 building 2 forest backyard balance beam 1 balance beam 2 Table 1. Names and duration of our six newly recorded outdoor sequencescaptured using SONY DSC-RX0 at fps. to prevent an immediate foot motion stop which creates visuallyunnatural motion. Our contact constraint inequalities read:0 ≤ (cid:219) r nj , | (cid:219) r tj | ≤ σ , and | (cid:219) r bj | ≤ σ , (9)where (cid:219) r nj is the normal component of (cid:219) r j , and (cid:219) r tj along with (cid:219) r bj arethe tangential elements of (cid:219) r j .Using the desired acceleration (cid:220) q des (Eq. (4)), the equation ofmotion (2), optimal GRF λ estimated in (6) and contact constraints(9), we formulate the optimisation problem for finding the physics-based motion capture result as:min (cid:220) q , τ (cid:107) (cid:220) q − (cid:220) q des (cid:107) + (cid:107) τ (cid:107) , s.t. M (cid:220) q − τ = J T G λ − c ( q , (cid:219) q ) , and0 ≤ (cid:219) r nj , | (cid:219) r tj | ≤ σ , | (cid:219) r bj | ≤ σ , ∀ j . (10)The first energy term forces the character to reproduce q tkin . Thesecond energy term is the regulariser that minimises τ to preventthe overshooting, thus modelling natural human-like motion.After solving (10), the character pose is updated by Eq. (1). Weiterate the steps ii) - v) (see stage III in Fig. 3) n = n -th output from v) as the final character pose q tphys .The final output of stage III is a sequence of joint angles and globalroot translations and rotations that explains the image observations,follows the purely kinematic reconstruction from stage I, yet isphysically and anatomically plausible and temporally stable. We first provide implementation details of

PhysCap (Sec. 5.1) andthen demonstrate its qualitative state-of-the-art results (Sec. 5.2).We next evaluate

PhysCap ’s performance quantitatively (Sec. 5.3)and conduct a user study to assess the visual physical plausibilityof the results (Sec. 5.4).We test

PhysCap on widely-used benchmarks (Habermann et al.2020; Ionescu et al. 2013; Mehta et al. 2017a) as well as on backflip and jump sequences provided by (Peng et al. 2018). We also collecta new dataset with various challenging motions. It features sixsequences in general scenes performed by two subjects recordedat 25 fps. For the recording, we used SONY DSC-RX0, see Table 1for more details on the sequences. the variety of motions per subject is high; there are only two subjects in the newdataset due to COVID-19 related recording restrictions rontal View Reference View2D Projection3D Poseb)a)2D Projection3D Pose Fig. 6. Two examples of reprojected 3D keypoints obtained by our ap-proach (light blue colour) and Vnect (Mehta et al. 2017b) (yellow colour)together with the corresponding 3D visualisations from different view angles.

PhysCap produces much more natural and physically plausible postureswhereas Vnect suffers from unnatural body leaning (see also the supple-mentary video).

Our method runs in real time (25 fps on average) on a PC witha Ryzen7 2700 8-Core Processor, 32 GB RAM and GeForce RTX2070 graphics card. In stage I, we proceed from a freely availabledemo version of VNect (Mehta et al. 2017b). Stages II and III areimplemented in python . In stage II, the network is implementedwith

PyTorch (Paszke et al. 2019). In stage III, we use the

Rigid BodyDynamics Library (Felis 2017) to compute dynamic quantities. Weemploy the

Pybullet (Coumans and Bai 2016) as a physics enginefor the character motion visualisation and collision detection. Inthis paper, we set the proportional gain value kp and derivativegain value kd for all joints to 300 and 20, respectively. For the rootangular acceleration, kp and kd are set to 340 and 30, respectively. kp and kd of the root linear acceleration are set to 1000 and 80,respectively. These settings are used in all experiments. Frontal View Reference ViewTime

Fig. 7. Reprojected 3D keypoints onto two different images with differentview angles for squatting. Frontal view images are used as inputs and imagesof the reference view are used only for quantitative evaluation. Our resultsare drawn in light blue, wheres the results by VNect (Mehta et al. 2017b) areprovided in yellow. Our reprojections are more feasible, which is especiallynoticeable in the reference view. See also our supplementary video.

The supplementary video and result figures in this paper, in partic-ular Figs. 1 and 11 show that

PhysCap captures global 3D humanposes in real time, even of fast and difficult motions, such as abackflip and a jump, which are of significantly improved qualitycompared to previous monocular methods. In particular, capturedmotions are much more temporally stable, and adhere to laws ofphysics with respect to the naturalness of body postures and ful-filment of environmental constraints, see Figs. 6–8 and 10 for theexamples of more natural 3D reconstructions. These properties areessential for many applications in graphics, in particular for stablereal-time character animation, which is feasible by directly applyingour method’s output (see Fig. 1 and the supplementary video).

In the following, we first describe our evaluation methodology inSec. 5.3.1. We evaluate

PhysCap and competing methods undera variety of criteria, i.e.,

3D joint position, reprojected 2D jointpositions, foot penetration into the floor plane and motion jitter.We compare our approach with current state-of-the-art monocularpose estimation methods, i.e.,

HMR (Kanazawa et al. 2018), HMMR(Kanazawa et al. 2019) and Vnect (Mehta et al. 2017b) (here we ig. 8. Several visualisations of the results by our approach and VNect(Mehta et al. 2017b). The first and second rows show our estimated 3Dposes after reprojection in the input image and its 3D view, respectively.Similarly, the third and fourth rows show the reprojected 3D pose and 3Dview for VNect. Note that our motion capture shows no foot penetrationinto the floor plane whereas such artefact is apparent in the VNect results. use the so-called demo version provided by the authors with furtherimproved accuracy over the original paper due to improved training).For the comparison, we use the benchmark dataset Human3.6M(Ionescu et al. 2013), the DeepCap dataset (Habermann et al. 2020)and MPI-INF-3DHP (Mehta et al. 2017a). From the Human3.6Mdataset, we use the subset of actions that does not have occludingobjects in the frame, i.e., directions, discussions, eating, greeting,posing, purchases, taking photos, waiting, walking, walking dog and walking together . From the DeepCap dataset, we use the subject 2for this comparison. The established evaluation method-ology in monocular 3D human pose estimation and capture consistsof testing a method on multiple sequences and reporting the accu-racy of 3D joint positions as well as the accuracy of the reprojectioninto the input views. The accuracy in 3D is evaluated by mean perjoint position error (MPJPE) in mm, percentage of correct keypoints (PCK) and the area under the receiver operating characteristic (ROC)curve abbreviated as AUC. The reprojection or mean pixel error e input2 D is obtained by projecting the estimated 3D joints onto the in-put images and taking the average per frame distance to the groundtruth 2D joint positions. We report e input2 D and its standard deviationdenoted by σ input2 D with the images of size 1024 × i.e.,reprojection error to unseen views e side2 D , motion jitter error e smooth and two floor penetration errors – Mean Penetration Error (MPE) and

Percentage of Non-Penetration (PNP) .When choosing a reference side view for e side2 D , we make surethat the viewing angle between the input and side views has to besufficiently large, i.e., more than ∼ π . Otherwise, if a side view isclose to the input view, such effects as unnatural leaning forward canstill remain undetected by e side2 D in some cases. After reprojection ofa 3D structure to an image plane of a side view, all further steps forcalculating e side2 D are similar to the steps for the standard reprojectionerror. We also report σ side2 D , i.e., the standard deviation of e side2 D .To quantitatively compare the motion jitter, we report the devi-ation of the temporal consistency from the ground truth 3D pose.Our smoothness error e smooth is computed as follows: Jit X = (cid:107) p s , tX − p s , t − X (cid:107) , Jit GT = (cid:107) p s , tGT − p s , t − GT (cid:107) , e smooth = Tm (cid:205) Tt = (cid:205) ms = | Jit GT − Jit X | , (11)where p s , t represents the 3D position of joint s in the time frame t . T and m denote the total numbers of frames in the video sequence andtarget 3D joints, respectively. The subscripts X and GT stand for thepredicted output and ground truth, respectively. A lower e smooth indicates lower motion jitter in the predicted motion sequence.MPE and PNP measure the degree of non-physical foot pene-tration into the ground. MPE is the mean distance between thefloor and 3D foot position, and it is computed only when the footis in contact with the floor. We use the ground truth foot contactlabels (Sec. 4.2) to judge the presence of the actual foot contacts.The complementary PNP metric shows the ratio of frames wherethe feet are not below the floor plane over the entire sequence. Table 2 summarises MPJPE,PCK and AUC for root-relative joint positions with (first row) andwithout (second row) Procrustes alignment before the error compu-tation for our and related methods. We also report the global rootposition accuracy in the third row. Since HMR and HMMR do notreturn global root positions as their outputs, we estimate the roottranslation in 3D by solving an optimisation with 2D projectionenergy term using the 2D and 3D keypoints obtained from thesealgorithms (similar to the solution in VNect). The 3D bone lengths eepCap Human 3.6M MPI-INF-3DHPMPJPE [mm] ↓ PCK[%] ↑ AUC[%] ↑ MPJPE [mm] ↓ PCK[%] ↑ AUC[%] ↑ MPJPE [mm] ↓ PCK[%] ↑ AUC[%] ↑ Procrustes ours 68.9

HMMR 75.5 93.8 53.1 55.0

HMR 113.4 75.1 39.0

Vnect 112.6 80.0 36.8 185.1 54.1 26.5 261.0 28.8 15.0HMR 251.4 19.5 8.4 204.2 45.8 22.1 505.0 28.6 13.5HMMR 213.0 27.7 11.3 231.1 41.6 19.4 926.2 28.0 14.5

Table 2. 3D error comparison on benchmark datasets with VNect (Mehta et al. 2017b), HMR (Kanazawa et al. 2018) and HMMR (Kanazawa et al. 2019). Wereport the MPJPE in mm, PCK at 150 mm and AUC. Higher AUC and PCK are better, and lower MPJPE is better. Note that the global root positions for HMRand HMMR were estimated by solving optimisation with a 2D projection loss using the 2D and 3D keypoints obtained from the methods. Our method is onpar with and often close to the best-performing approaches on all datasets. It consistently produces the best global root trajectory. As indicated in the text,these widely-used metrics in the pose estimation literature only paint an incomplete picture. For more details, please refer Sec. 5.3.

Front View Side View e input2 D [pixel] σ input2 D e side2 D [pixel] σ side2 D Ours 21.1 6.7

Vnect (Mehta et al. 2017b)

Table 3. 2D projection error of a frontal view (input) and side view (non-input) on DeepCap dataset (Habermann et al. 2020).

PhysCap performssimilarly to VNect on the frontal view, and significantly better on the sideview. For further details, see Sec. 5.3 and Fig. 7 of HMR and HMMR were rescaled so that they match the groundtruth bone lengths.In terms of MPJPE, PCK and AUC, our method does not outper-form the other approaches consistently but achieves an accuracythat is comparable and often close to the highest on Human3.6M,DeepCap and MPI-INF-3DHP. In the third row, we additionallyevaluate the global 3D base position accuracy, which is critical forcharacter animation from the captured data. Here,

PhysCap consis-tently outperforms the other methods on all the datasets.As noted earlier, the above metrics only paint an incompletepicture. Therefore, we also measure the 2D projection errors to theinput and side views on the DeepCap dataset, since this datasetincludes multiple synchronised views of dynamic scenes with awide baseline. Table 3 summarises the mean pixel errors e input2 D and e side2 D together with their standard deviations. In the frontal view, i.e., on e input2 D , VNect has higher accuracy than PhysCap . However, thiscomes at the prize of frequently violating physics constraints (floorpenetration) and producing unnaturally leaning and jittering 3Dposes (see also the supplemental video). In contrast, since

PhysCap explicitly models physical pose plausibility, it excels VNect in theside view, which reveals VNect’s implausibly leaning postures androot position instability in depth, also see Figs. 6 and 7.To assess motion smoothness, we report e smooth and its standarddeviation σ smooth in Table 4. Our approach outperforms Vnect andHMR by a big margin on both datasets. Our method is better thanHMMR on DeepCap dataset and marginally worse on Human3.6M. Ours Vnect HMR HMMRDeepCap e smooth σ smooth e smooth σ smooth Table 4. Comparison of temporal smoothness on the DeepCap (Haber-mann et al. 2020) and Human 3.6M datasets (Ionescu et al. 2013).

PhysCap significantly outperforms VNect and HMR, and fares comparably to HMMRin terms of this metric. For a detailed explanation, see Sec. 5.3.

MPE [mm] ↓ σ MPE ↓ PNP [%] ↑ Ours

Vnect (Mehta et al. 2017b) 39.3 37.5 45.6

Table 5. Comparison of Mean Penetration Error (MPE) and Percentageof Non-Penetration (PNP) on DeepCap dataset (Habermann et al. 2020).

PhysCap significantly outperforms VNect on this metric, measuring anessential aspect of physical motion correctness.

HMMR is one of the current state-of-the-art algorithms that has anexplicit temporal component in the architecture.Table 5 summarises the MPE and PNP for Vnect and

PhysCap onDeepCap dataset. Our method shows significantly better resultscompared to VNect, i.e., about a 30% lower MPE and a by 100%better result in PNP, see Fig. 8 for qualitative examples. Fig. 9 showsplots of contact forces as the functions of time calculated by ourapproach on the walking sequence from our newly recorded dataset(sequence 1). The estimated functions fall into a reasonable forcerange for walking motions (Shahabpoor and Pavic 2017).

The notion of physical plausibility can be understood and perceivedsubjectively from person to person. Therefore, in addition to thequantitative evaluation with existing and new metrics, we perform ig. 9. The estimated contact forces as the functions of time for the walkingsequence. We observe that the contact forces remain in a reasonable rangefor walking motions (Shahabpoor and Pavic 2017). an online user study which allows to subjectively assess and com-pare the perceived degree of different effects in the reconstructionsby a broad audience of people with different backgrounds in com-puter graphics and vision. In total, we prepared 34 questions withvideos, in which we always showed one or two reconstructions ata time (our result, a result by a competing method, or both at thesame time). In total, 27 respondents have participated.There were different types of questions. In 16 questions (categoryI), the respondents were asked to decide which 3D reconstructionout of two looks more physically plausible to them (the first, thesecond or undecided). In 12 questions (category II), the respondentswere asked to rate how natural the 3D reconstructed motions areor evaluate the degree of an indicated effect (foot sliding, bodyleaning, etc. ) on a predefined scale. In five questions (categoryIII), the respondents were also asked to decide which visualisationhas a more pronounced indicated artefact. For two questions outof five, 2D projections onto the input 2D image sequence wereshown, whereas the remaining questions in this category featured3D reconstructions. Finally (category IV), the participants wereencouraged to list which artefacts in the reconstructions seem to bemost apparent and most frequent.In category I, our reconstructions were preferred in 89 .

2% of thecases, whereas a competing method was preferred in 1 .

6% of thecases. Note that at the same time, the decision between the methodshas not been made in 8 .

9% cases. In category II, the respondentshave also found the results of our approach to be significantly morephysically plausible than the results of competing methods. Thelatter were also found to have consistently more jitter, foot slidingand unnatural body leaning. In category III, noteworthy is also thatthe participants have indicated a higher average perceived accuracyof our reprojections, i.e., .

7% voted that our results reprojectbetter, whereas the choice felt on the competing methods in 22 .

6% of the cases. Note that the smoothness and jitter in the results arealso reflected in the reprojections, and, thus, both influence hownatural the reprojected skeletons look like. At the same time, ahigh uncertainty of 44 .

2% indicates that the difference between thereprojections of

PhysCap and other methods is volatile. For the 3Dmotions in this category, 82 .

7% voted that our results show fewerindicated artefacts compared to other approaches, whereas 13 . .

7% of the cases. In category IV, 59% ofthe participants named jitter as the most frequent and apparentdisturbing effect of the competing methods, followed by unnaturalbody leaning (22%), foot-floor penetration (15%) and foot sliding(15%).The user study confirms a high level of physical plausibility andnaturalness of

PhysCap results. We see that also subjectively, a broadaudience coherently finds our results of high visual quality, and thegap to the competing methods is substantial. This strengthens ourbelief about the suitability of

PhysCap for computer graphics andprimarily virtual character animation in real time.

Our physics-based monocular 3D human motion capture algorithmsignificantly reduces the common artefacts of other monocular 3Dpose estimation methods such as motion jitter, penetration into thefloor, foot sliding and unnatural body leaning. The experiments haveshown that our state prediction network generalises well acrossscenes with different backgrounds (see Fig. 11). However, in the caseof foot occlusion, our state prediction network can sometimes mis-predict the foot contact states, resulting in the erroneous hard zerovelocity constraint for feet. Additionally, our approach requires thecalibrated floor plane to apply the foot contact constraint effectively;standard calibration techniques can be used for this.Swift motions can be challenging for stage I of our pipeline, whichcan cause inaccuracies in the estimates of the subsequent stages,as well as in the final estimate. In future, other monocular kine-matic pose estimators than (Mehta et al. 2017b) could be tested instage I, in case they are trained to handle occlusions and very fastmotions better. Moreover, note that – although we use a singleparameter set for

PhysCap in all our experiments (see Sec. 5) – userscan adjust the quality of the reconstructed motions by tuning thegain parameters of PD controller depending on the scenario. Byincreasing the derivative gain value, the reconstructed poses aresmoother, which, however, can cause motion delay compared tothe input video, especially when the observed motions are veryfast. By reducing the derivative gain value, our optimisation with avirtual character can track image sequence with less motion delay,at the cost of less temporally coherent motion. We demonstrate thistrade-off in the supplemental video.Further, while our method works in front of general backgrounds,we assume there is a ground plane in the scene, which is the case formost man-made environments, but not irregular outdoor terrains.Finally, our method currently only considers a subset of potentialbody-to-environment contacts in a physics-based way. As part offuture work, we will investigate explicit modelling of self-collisions,as well as hand-scene interactions or contacts of legs and body insitting and lying poses. nput Images Ours VnectSide View HMMRHMRTime Fig. 10. Several side (non-input) view visualisations of the results by our approach, Vnect (Mehta et al. 2017b), HMR (Kanazawa et al. 2018) and HMMR(Kanazawa et al. 2019) on DeepCap dataset. The green dashed lines indicate the expected root positions over time. It is apparent from the side view that our

PhysCap does not suffer from the unnatural body sliding along the depth direction, unlike other approaches. The global base positions for HMR and HMMRwere computed by us using the root-relative predictions of these techniques, see Sec. 5.3.2 for more details.

We have presented

PhysCap – the first physics-based approach for aglobal 3D human motion capture from a single RGB camera that runsin real time at 25 fps. Thanks to the pose optimisation frameworkusing PD joint control, the results of

PhysCap evince improvedphysical plausibility, temporal consistency and significantly fewerartefacts such as jitter, foot sliding, unnatural body leaning and foot-floor penetration, compared to other existing approaches (someof them include temporal constraints). We also introduced newerror metrics to evaluate these improved properties which are noteasily captured by metrics used in the established pose estimationbenchmarks. Moreover, our user study further confirmed theseimprovements. In future work, our algorithm can be extended forvarious contact positions (not only the feet).

REFERENCES

Farhan A. Salem and Ayman Aly. 2015. PD Controller Structures: Comparison andSelection for an Electromechanical System.

International Journal of IntelligentSystems and Applications (IJISA)

European Confer-ence on Visual Media Production (CVMP) .Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. HumanPose Estimation: New Benchmark and State of the Art Analysis. In

Computer Vision and Pattern Recognition (CVPR) .Ronen Barzel, John F. Hughes, and Daniel N. Wood. 1996. Plausible Motion Simulationfor Computer Graphics Animation. In

Proceedings of the Eurographics Workshop onComputer Animation and Simulation .Kevin Bergamin, Simon Clavet, Daniel Holden, and James Richard Forbes. 2019.DReCon: Data-Driven Responsive Control of Physics-Based Characters.

ACMTransactions On Graphics (TOG)

38, 6 (2019).Liefeng Bo and Cristian Sminchisescu. 2010. Twin Gaussian Processes for StructuredPrediction.

International Journal of Computer Vision (IJCV)

87 (2010), 28–52.Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, andMichael J. Black. 2016. Keep it SMPL: Automatic Estimation of 3D Human Pose andShape from a Single Image. In

European Conference on Computer Vision (ECCV) .Derek Bradley, Tiberiu Popa, Alla Sheffer, Wolfgang Heidrich, and Tamy Boubekeur.2008. Markerless garment capture.

ACM Transactions on Graphics (TOG)

27, 3 (2008),99.Ernesto Brau and Hao Jiang. 2016. 3D Human Pose Estimation via Deep Learning from2D Annotations. In

International Conference on 3D Vision (3DV) .Thomas Brox, Bodo Rosenhahn, Juergen Gall, and Daniel Cremers. 2010. Combinedregion- and motion-based 3D tracking of rigid and articulated objects.

Transactionson Pattern Analysis and Machine Intelligence (TPAMI)

32, 3 (2010), 402–415.Cedric Cagniart, Edmond Boyer, and Slobodan Ilic. 2010. Free-Form Mesh Tracking: aPatch-Based Approach. In

Computer Vision and Pattern Recognition (CVPR) .G´ery Casiez, Nicolas Roussel, and Daniel Vogel. 2012. 1 ff Filter: A Simple Speed-BasedLow-Pass Filter for Noisy Input in Interactive Systems. In

Proceedings of the SIGCHIConference on Human Factors in Computing Systems .Ching-Hang Chen and Deva Ramanan. 2017. 3D Human Pose Estimation = 2D PoseEstimation + Matching. In

Computer Vision and Pattern Recognition (CVPR) .Stelian Coros, Philippe Beaudoin, and Michiel van de Panne. 2010. Generalized BipedWalking Control.

ACM Transactions On Graphics (TOG)

29, 4 (2010).Erwin Coumans and Yunfei Bai. 2016. Pybullet, a python module for physics simulationfor games, robotics and machine learning.

GitHub repository (2016). ig. 11. Representative 2D reprojections and the corresponding 3D poses of our PhysCap approach. Note that, even with the challenging motions, our globalposes in 3D have high quality and 2D reprojections to the input images are accurate as well. See our supplementary video for more results on these sequences.The backflip video in the first row is taken from (Peng et al. 2018). Other sequences are from our own recordings. ishabh Dabral, Anurag Mundhada, Uday Kusupati, Safeer Afaque, Abhishek Sharma,and Arjun Jain. 2018. Learning 3D Human Pose from Structure and Motion. In European Conference on Computer Vision (ECCV) .Edilson De Aguiar, Carsten Stoll, Christian Theobalt, Naveed Ahmed, Hans-Peter Seidel,and Sebastian Thrun. 2008. Performance Capture from Sparse Multi-View Video.

ACM Transactions on Graphics (TOG)

27, 3 (2008).Ahmed Elhayek, Edilson de Aguiar, Arjun Jain, Jonathan Thompson, Leonid Pishchulin,Mykhaylo Andriluka, Christoph Bregler, Bernt Schiele, and Christian Theobalt. 2016.MARCOnIfi!?ConvNet-Based MARker-Less Motion Capture in Outdoor and IndoorScenes.

Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

39, 3(2016), 501–514.Ahmed Elhayek, Carsten Stoll, Kwang In Kim, and Christian Theobalt. 2014. Out-door Human Motion Capture by Simultaneous Optimization of Pose and CameraParameters.

Computer Graphics Forum (2014).Petros Faloutsos, Michiel van de Panne, and Demetri Terzopoulos. 2001. Compos-able Controllers for Physics-Based Character Animation. In

Annual Conference onComputer Graphics and Interactive Techniques . 251fi?!260.Roy Featherstone. 2014.

Rigid body dynamics algorithms .Martin L. Felis. 2017. RBDL: an Efficient Rigid-Body Dynamics Library using RecursiveAlgorithms.

Autonomous Robots

41, 2 (2017), 495–511.Juergen Gall, Bodo Rosenhahn, Thomas Brox, and Hans-Peter Seidel. 2010. Optimizationand Filtering for Human Motion Capture - a Multi-Layer Framework.

InternationalJournal of Computer Vision (IJCV)

87, 1 (2010), 75–92.Juergen Gall, Carsten Stoll, Edilson De Aguiar, Christian Theobalt, Bodo Rosenhahn,and Hans-Peter Seidel. 2009. Motion Capture Using Joint Skeleton Tracking andSurface Estimation. In

Computer Vision and Pattern Recognition (CVPR) .Marc Habermann, Weipeng Xu, Michael Zollhoefer, Gerard Pons-Moll, and ChristianTheobalt. 2020. DeepCap: Monocular Human Performance Capture Using WeakSupervision. In

Computer Vision and Pattern Recognition (CVPR) .Marc Habermann, Weipeng Xu, Michael Zollh¨ofer, Gerard Pons-Moll, and ChristianTheobalt. 2019. LiveCap: Real-Time Human Performance Capture From MonocularVideo.

ACM Transactions On Graphics (TOG)

38, 2 (2019), 14:1–14:17.Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Gerard Pons-Moll, and ChristianTheobalt. 2019. In the Wild Human Pose Estimation Using Explicit 2D Featuresand Intermediate 3D Representations. In

Computer Vision and Pattern Recognition(CVPR) .Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. 2019.Resolving 3D Human Pose Ambiguities with 3D Scene Constraints. In

InternationalConference on Computer Vision (ICCV) .Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learningfor Image Recognition. In

Computer Vision and Pattern Recognition (CVPR) .Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2013. Hu-man3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing inNatural Environments.

Transactions on Pattern Analysis and Machine Intelligence(TPAMI)

36, 7 (2013), 1325–1339.Yifeng Jiang, Tom Van Wouwe, Friedl De Groote, and C. Karen Liu. 2019. Synthe-sis of Biologically Realistic Human Motion Using Joint Torque Actuation.

ACMTransactions On Graphics (TOG)

38, 4 (2019).S. Johnson and M. Everingham. 2011. Learning Effective Human Pose Estimation fromInaccurate Annotation. In

Computer Vision and Pattern Recognition (CVPR) .Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. 2018. End-to-end Recovery of Human Shape and Pose. In

Computer Vision and Pattern Recognition(CVPR) .Angjoo Kanazawa, Jason Y. Zhang, Panna Felsen, and Jitendra Malik. 2019. Learning 3DHuman Dynamics from Video. In

Computer Vision and Pattern Recognition (CVPR) .Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. 2020. VIBE: VideoInference for Human Body Pose and Shape Estimation. In

Computer Vision andPattern Recognition (CVPR) .Onorina Kovalenko, Vladislav Golyanik, Jameel Malik, Ahmed Elhayek, and DidierStricker. 2019. Structure from Articulated Motion: Accurate and Stable Monocular3D Reconstruction without Training Data.

Sensors

19, 20 (2019).Seunghwan Lee, Moonseok Park, Kyoungmin Lee, and Jehee Lee. 2019. Scalable Muscle-Actuated Human Simulation and Control.

ACM Transactions On Graphics (TOG)

Quarterly Journal of Applied Mathmatics

II, 2 (1944), 164–168.Sergey Levine and Jovan Popovi´c. 2012. Physically Plausible Simulation for Charac-ter Animation. In

Proceedings of the ACM SIGGRAPH/Eurographics Symposium onComputer Animation .Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, and JosefSivic. 2019. Estimating 3D Motion and Forces of Person-Object Interactions fromMonocular Video. In

Computer Vision and Pattern Recognition (CVPR) .Libin Liu, KangKang Yin, Michiel van de Panne, Tianjia Shao, and Weiwei Xu. 2010.Sampling-Based Contact-Rich Motion Control.

ACM Transactions On Graphics (TOG)

29, 4 (2010), 128:1–128:10. Yebin Liu, Carsten Stoll, Juergen Gall, Hans-Peter Seidel, and Christian Theobalt. 2011.Markerless Motion Capture of Interacting Characters using Multi-View ImageSegmentation. In

Computer Vision and Pattern Recognition (CVPR) .Adriano Macchietto, Victor Zordan, and Christian R. Shelton. 2009. Momentum Controlfor Balance. In

ACM SIGGRAPH .Donald W. Marquardt. 1963. An Algorithm for Least-Squares Estimation of NonlinearParameters.

SIAM J. Appl. Math.

11, 2 (1963), 431–441.Ricardo Martin-Brualla, Rohit Pandey, Shuoran Yang, Pavel Pidlypenskyi, JonathanTaylor, Julien Valentin, Sameh Khamis, Philip Davidson, Anastasia Tkach, PeterLincoln, Adarsh Kowdle, Christoph Rhemann, Dan B Goldman, Cem Keskin, SteveSeitz, Shahram Izadi, and Sean Fanello. 2018. LookinGood: Enhancing PerformanceCapture with Real-Time Neural Re-Rendering.

ACM Transactions On Graphics (TOG)

37, 6 (2018).Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. 2017. A Simple YetEffective Baseline for 3D Human Pose Estimation. In

International Conference onComputer Vision (ICCV) .Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko,Weipeng Xu, and Christian Theobalt. 2017a. Monocular 3D Human Pose Esti-mation In The Wild Using Improved CNN Supervision. In

International Conferenceon 3D Vision (3DV) .Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Moham-mad Elgharib, Hans-Peter Seidel, Helge Rhodin, Gerard Pons-Moll, and ChristianTheobalt. 2020. XNect: Real-time Multi-Person 3D Motion Capture with a SingleRGB Camera.

ACM Transactions on Graphics (TOG)

39, 4.Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, MohammadShafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. 2017b.VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera.

ACMTransactions on Graphics

36, 4, 14.Aron Monszpart, Paul Guerrero, Duygu Ceylan, Ersin Yumer, and Niloy J. Mitra. 2019.IMapper: Interaction-Guided Scene Mapping from Monocular Videos.

ACM Trans-actions On Graphics (TOG)

38, 4 (2019).Francesc Moreno-Noguer. 2017. 3D Human Pose Estimation From a Single Image viaDistance Matrix Regression. In

Computer Vision and Pattern Recognition (CVPR) .Shin’ichiro Nakaoka, Atsushi Nakazawa, Fumio Kanehiro, Kenji Kaneko, MitsuharuMorisawa, Hirohisa Hirukawa, and Katsushi Ikeuchi. 2007. Learning from Observa-tion Paradigm: Leg Task Models for Enabling a Biped Humanoid Robot to ImitateHuman Dances.

The International Journal of Robotics Research

26, 8 (2007), 829–844.Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked Hourglass Networks forHuman Pose Estimation. In

European Conference on Computer Vision (ECCV) .Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019.Pytorch: An Imperative Style, High-Performance Deep Learning Library. In

Ad-vances in Neural Information Processing Systems (NeurIPS) . 8026–8037.Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. 2018. Ordinal Depth Super-vision for 3D Human Pose Estimation. In

Computer Vision and Pattern Recognition(CVPR) .Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, and Kostas Daniilidis.2017. Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose. In

Computer Vision and Pattern Recognition (CVPR) .Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, Pieter Abbeel, and Sergey Levine. 2018.SFV: Reinforcement Learning of Physical Skills from Videos.

ACM Transactions OnGraphics (TOG)

37, 6 (2018).Helge Rhodin, Mathieu Salzmann, and Pascal Fua. 2018. Unsupervised Geometry-AwareRepresentation Learning for 3D Human Pose Estimation. In

European Conference onComputer Vision (ECCV) .Gr´egory Rogez, Philippe Weinzaepfel, and Cordelia Schmid. 2019. LCR-Net++: Multi-Person 2D and 3D Pose Detection in Natural Images.

Transactions on Pattern Analysisand Machine Intelligence (TPAMI) (2019).Erfan Shahabpoor and Aleksandar Pavic. 2017. Measurement of Walking GroundReactions in Real-Life Environments: A Systematic Review of Techniques andTechnologies.

Sensors

17, 9 (2017), 2085.Dana Sharon and Michiel van de Panne. 2005. Synthesis of Controllers for StylizedPlanar Bipedal Walking. In

International Conference on Robotics and Animation(ICRA) .Jonathan Starck and Adrian Hilton. 2007. Surface capture for performance-basedanimation.

IEEE Computer Graphics and Applications (CGA)

27, 3 (2007), 21–31.Carsten Stoll, Nils Hasler, Juergen Gall, Hans-Peter Seidel, and Christian Theobalt.2011. Fast articulated motion tracking using a sums of Gaussians body model. In

International Conference on Computer Vision (ICCV) .Bugra Tekin, Isinsu Katircioglu, Mathieu Salzmann, Vincent Lepetit, and Pascal Fua.2016. Structured Prediction of 3D Human Pose with Deep Neural Networks. In

British Machine Vision Conference (BMVC) .Denis Tom`e, Chris Russell, and Lourdes Agapito. 2017. Lifting from the Deep: Convo-lutional 3D Pose Estimation from a Single Image. In

Computer Vision and PatternRecognition (CVPR) . aniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan Popovi´c. 2008. Articulated meshanimation from multi-view silhouettes. In ACM Transactions on Graphics (TOG) ,Vol. 27. 97.Daniel Vlasic, Pieter Peers, Ilya Baran, Paul Debevec, Jovan Popovi´c, SzymonRusinkiewicz, and Wojciech Matusik. 2009. Dynamic Shape Capture using Multi-View Photometric Stereo.

ACM Transactions on Graphics (TOG)

28, 5 (2009), 174.Marek Vondrak, Leonid Sigal, Jessica Hodgins, and Odest Jenkins. 2012. Video-based3D Motion Capture Through Biped Control.

ACM Transactions On Graphics (TOG)

31, 4 (2012), 1–12.Bastian Wandt and Bodo Rosenhahn. 2019. RepNet: Weakly Supervised Training ofan Adversarial Reprojection Network for 3D Human Pose Estimation. In

ComputerVision and Pattern Recognition (CVPR) .Yangang Wang, Yebin Liu, Xin Tong, Qionghai Dai, and Ping Tan. 2018. Robust Non-rigid Motion Tracking and Surface Reconstruction Using L0 Regularization.

IEEETransctions on Visualization and Computer Graphics (TVCG)

24, 5 (2018), 1770–1783.Michael Waschb¨usch, Stephan W¨urmlin, Daniel Cotting, Filip Sadlo, and Markus Gross.2005. Scalable 3D Video of Dynamic Scenes.

The Visual Computer

21, 8-10 (2005),629–638.Xiaolin Wei and Jinxiang Chai. 2010. Videomocap: Modeling Physically RealisticHuman Motion from Monocular Video Sequences. In

ACM Transactions on Graphics(TOG) , Vol. 29.Pawel Wrotek, Odest Chadwicke Jenkins, and Morgan McGuire. 2006. Dynamo: Dy-namic, Data-Driven Character Control with Adjustable Balance. In

ACM SandboxSymposium on Video Games 2006 .Chenglei Wu, Kiran Varanasi, and Christian Theobalt. 2012. Full Body PerformanceCapture under Uncontrolled and Varying Illumination: A Shading-Based Approach.In

European Conference on Computer Vision (ECCV) .Lan Xu, Weipeng Xu, Vladislav Golyanik, Marc Habermann, Lu Fang, and ChristianTheobalt. 2020. EventCap: Monocular 3D Capture of High-Speed Human Motionsusing an Event Camera. In

Computer Vision and Pattern Recognition (CVPR) .Wei Yang, Wanli Ouyang, Xiaolong Wang, Jimmy Ren, Hongsheng Li, and XiaogangWang. 2018. 3D Human Pose Estimation in the Wild by Adversarial Learning. In

Computer Vision and Pattern Recognition (CVPR) .Andrei Zanfir, Elisabeta Marinoiu, and Cristian Sminchisescu. 2018. Monocular 3DPose and Shape Estimation of Multiple People in Natural Scenes - The Importanceof Multiple Scene Constraints. In

Computer Vision and Pattern Recognition (CVPR) .Petrissa Zell, Bastian Wandt, and Bodo Rosenhahn. 2017. Joint 3D Human MotionCapture and Physical Analysis from Monocular Videos. In

Computer Vision andPattern Recognition (CVPR) Workshops .Peizhao Zhang, Kristin Siu, Jianjie Zhang, C Karen Liu, and Jinxiang Chai. 2014. Lever-aging Depth Cameras and Wearable Pressure Sensors for Full-Body Kinematics andDynamics Capture.

ACM Transactions on Graphics (TOG)

33, 6 (2014), 1–14.Yuxiang Zhang, Liang An, Tao Yu, xiu Li, Kun Li, and Yebin Liu. 2020. 4D AssociationGraph for Realtime Multi-Person Motion Capture Using Multiple Video Cameras.In

International Conference on Computer Vision (ICCV) .Yu Zheng and Katsu Yamane. 2013. Human Motion Tracking Control with Strict ContactForce Constraints for Floating-Base Humanoid Robots. In

International Conferenceon Humanoid Robots (Humanoids) .Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. 2017. To-wards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach. In

International Conference on Computer Vision (ICCV) .Xingyi Zhou, Xiao Sun, Wei Zhang, Shuang Liang, and Yichen Wei. 2016. DeepKinematic Pose Regression. In

European Conference on Computer Vision (ECCV) .Yuliang Zou, Jimei Yang, Duygu Ceylan, Jianming Zhang, Federico Perazzi, and Jia-Bin Huang. 2020. Reducing Footskate in Human Motion Reconstruction withGround Contact Constraints. In

Winter Conference on Applications of ComputerVision (WACV) ..