[PDF] Contact and Human Dynamics from Monocular Video

Abstract

Existing deep models predict 2D and 3D kinematic poses from video that are approximately accurate, but contain visible errors that violate physical constraints, such as feet penetrating the ground and bodies leaning at extreme angles. In this paper, we present a physics-based method for inferring 3D human motion from video sequences that takes initial 2D and 3D pose estimates as input. We first estimate ground contact timings with a novel prediction network which is trained without hand-labeled data. A physics-based trajectory optimization then solves for a physically-plausible motion, based on the inputs. We show this process produces motions that are significantly more realistic than those from purely kinematic methods, substantially improving quantitative measures of both kinematic and dynamic plausibility. We demonstrate our method on character animation and pose estimation tasks on dynamic motions of dancing and sports with complex contact patterns.

Full PDF

CContact and Human Dynamics fromMonocular Video

Davis Rempe , , Leonidas J. Guibas , Aaron Hertzmann , Bryan Russell ,Ruben Villegas , and Jimei Yang Stanford University Adobe Research geometry.stanford.edu/projects/human-dynamics-eccv-2020

Abstract.

Existing deep models predict 2D and 3D kinematic posesfrom video that are approximately accurate, but contain visible errorsthat violate physical constraints, such as feet penetrating the ground andbodies leaning at extreme angles. In this paper, we present a physics-based method for inferring 3D human motion from video sequences thattakes initial 2D and 3D pose estimates as input. We ﬁrst estimate groundcontact timings with a novel prediction network which is trained withouthand-labeled data. A physics-based trajectory optimization then solvesfor a physically-plausible motion, based on the inputs. We show this pro-cess produces motions that are signiﬁcantly more realistic than thosefrom purely kinematic methods, substantially improving quantitativemeasures of both kinematic and dynamic plausibility. We demonstrateour method on character animation and pose estimation tasks on dy-namic motions of dancing and sports with complex contact patterns.

Recent methods for human pose estimation from monocular video [1,20,34,47]estimate accurate overall body pose with small absolute diﬀerences from the trueposes in body-frame 3D coordinates. However, the recovered motions in world-frame are visually and physically implausible in many ways, including feet thatﬂoat slightly or penetrate the ground, implausible forward or backward bodylean, and motion errors like jittery, vibrating poses. These errors would preventmany subsequent uses of the motions. For example, inference of actions, inten-tions, and emotion often depends on subtleties of pose, contact and acceleration,as does computer animation; human perception is highly sensitive to physicalinaccuracies [16,38]. Adding more training data would not solve these problems,because existing methods do not account for physical plausibility.Physics-based trajectory optimization presents an appealing solution to theseissues, particularly for dynamic motions like walking or dancing. Physics imposesimportant constraints that are hard to express in pose space but easy in termsof dynamics. For example, feet in static contact do not move, the body movessmoothly overall relative to contacts, and joint torques are not large. However, a r X i v : . [ c s . C V ] J u l D. Rempe et al.

Fig. 1.

Our contact prediction and physics-based optimization corrects numerous phys-ically implausible artifacts common in 3D human motion estimations from, e.g., Monoc-ular Total Capture (MTC) [47] such as foot ﬂoating (top row), foot penetrations (mid-dle), and unnatural leaning (bottom). full-body dynamics is notoriously diﬃcult to optimize [40], in part because con-tact is discontinuous, and the number of possible contact events grows exponen-tially in time. As a result, combined optimization of contact and dynamics isenormously sensitive to local minima.This paper introduces a new strategy for extracting dynamically valid full-body motions from monocular video (Figure 1), combining learned pose estima-tion with physical reasoning through trajectory optimization. As input, we usethe results of kinematic pose estimation techniques [4,47], which produce accu-rate overall poses but inaccurate contacts and dynamics. Our method leveragesa reduced-dimensional body model with centroidal dynamics and contact con-straints [9,46] to produce a physically-valid motion that closely matches these in-puts. We ﬁrst infer foot contacts from 2D poses in the input video which are thenused in a physics-based trajectory optimization to estimate 6D center-of-massmotion, feet positions, and contact forces. We show that a contact predictionnetwork can be accurately trained on synthetic data. This allows us to separateinitial contact estimation from motion optimization, making the optimizationmore tractable. As a result, our method is able to handle highly dynamic mo-tions without sacriﬁcing physical accuracy. ontact and Human Dynamics from Monocular Video 3

We focus on single-person dynamic motions from dance, walking, and sports.Our approach substantially improves the realism of inferred motions over state-of-the-art methods, and estimates numerous physical properties that could beuseful for further inference of scene properties and action recognition. We pri-marily demonstrate our method on character animation by retargeting capturedmotion from video to a virtual character. We evaluate our approach using nu-merous kinematics and dynamics metrics designed to measure the physical plau-siblity of the estimated motion. The proposed method takes an important stepto incorporating physical constraints into human motion estimation from video,and shows the potential to reconstruct realistic, dynamic sequences.

We build on several threads of work in computer vision, animation, and robotics,each with a long history [11]. Recent vision results are detailed here.Recent progress in pose estimation can accurately detect 2D human key-points [4,14,31] and infer 3D pose [1,20,34] from a single image. Several recentmethods extract 3D human motions from monocular videos by exploring variousforms of temporal cues [21,30,48,47]. While these methods focus on explaininghuman motion in pixel space, they do not account for physical plausibility. Sev-eral recent works interpret interactions between people and their environmentin order to make inferences about each [7,13,49]; each of these works uses onlystatic kinematic constraints. Zou et al. [50] infer contact constraints to opti-mize 3D motion from video. We show how dynamics can improve inference ofhuman-scene interactions, leading to more physically plausible motion capture.Some works have proposed physics constraints to address the issues of kine-matic tracking. Brubaker et al. [3] propose a physics-based tracker based on areduced-dimensional walking model. Wei and Chai [45] track body motion fromvideo, assuming keyframe and contact constraints are provided. Similar to ourown work, Brubaker and Fleet [2] perform trajectory optimization for full-bodymotion. To jointly optimize contact and dynamics, they use a continuous ap-proximation to contact. However, soft contact models introduce new diﬃculties,including inaccurate transitions and sensitivity to stiﬀness parameters, while stillsuﬀering from local minima issues. Moreover, their reduced-dimensional modelincludes only center-of-mass positional motion, which does not handle rotationalmotion well. In contrast, we obtain accurate contact initialization in a prepro-cessing step to simplify optimization, and we model rotational inertia.Li et al. [27] estimate dynamic properties from videos. We share the sameoverall pipeline of estimating pose and contacts, followed by trajectory opti-mization. Whereas they focus on the dynamics of human-object interactions,we focus on videos where the human motion itself is much more dynamic, withcomplex variation in pose and foot contact; we do not consider human-objectinteraction. They use a simpler data term, and perform trajectory optimizationin full-body dynamics unlike our reduced representation. Their classiﬁer trainingrequires hand-labeled data, unlike our automatic dataset creation method.

D. Rempe et al.

Fig. 2.

Method overview. Given an input video, our method starts with initial estimatesfrom existing 2D and 3D pose methods [4,47]. The lower-body 2D joints are used toinfer foot contacts (orange box). Our optimization framework contains two parts (blueboxes). Inferred contacts and initial poses are used in a kinematic optimization thatreﬁnes the 3D full-body motion and ﬁts the ground. These are given to a reduced-dimensional physics-based trajectory optimization that applies dynamics.

Prior methods learn character animation controllers from video. Vondrak etal. [42] train a state-machine controller using image silhouette features. Peng etal. [36] train a controller to perform skills by following kinematically-estimatedposes from input video sequences. They demonstrate impressive results on a va-riety of skills. They do not attempt accurate reconstruction of motion or contact,nor do they evaluate for these tasks, rather they focus on control learning.Our optimization is related to physics-based methods in computer animation,e.g., [10,19,24,28,29,37,44]. Two unique features of our optimization are the use oflow-dimensional dynamics optimization that includes 6D center-of-mass motionand contact constraints, thereby capturing important rotational and footstepquantities without requiring full-body optimization, and the use of a classiﬁerto determine contacts before optimization.

This section describes our approach, which is summarized in Figure 2. The coreof our method is a physics-based trajectory optimization that enforces dynam-ics on the input motion (Section 3.1). Foot contact timings are estimated in apreprocess (Section 3.2), along with other inputs to the optimization (Section3.3). Similar to previous work [27,47], in order to recover full-body motion weassume there is no camera motion and that the full body is visible.

The core of our framework is an optimization which enforces dynamics on an ini-tial motion estimate given as input (see Section 3.3). The goal is to improve theplausibility of the motion by applying physical reasoning through the objective ontact and Human Dynamics from Monocular Video 5 and constraints. We aim to avoid common perceptual errors, e.g., jittery, unnat-ural motion with feet skating and ground penetration, by generating a smoothtrajectory with physically-valid momentum and static feet during contact.The optimization is performed on a reduced-dimensional body model thatcaptures overall motion, rotation, and contacts, but avoids the diﬃculty of op-timizing all joints. Modeling rotation is necessary for important eﬀects like armswing and counter-oscillations [15,24,29], and the reduced-dimensional centroidal dynamics model can produce plausible trajectories for humanoid robots [5,9,32].Our method is based on a recent robot motion planning algorithm from Winkleret al. [46] that leverages a simpliﬁed version of centroidal dynamics, which treatsthe robot as a rigid body with a ﬁxed mass and moment of inertia. Their methodﬁnds a feasible trajectory by optimizing the position and rotation of the center-of-mass (COM) along with feet positions, contact forces, and contact durationsas described in detail below. We modify this algorithm to suit our computer vi-sion task: we use a temporally varying inertia tensor which allows for changes inmass distribution (swinging arms) and enables estimating the dynamic motionsof interest, we add energy terms to match the input kinematic motion and footcontacts, and we add new kinematics constraints for our humanoid skeleton.

Inputs.

The method takes initial estimates of: COM position ¯ r ( t ) ∈ R andorientation ¯ θ ( t ) ∈ R trajectories, body-frame inertia tensor trajectory I b ( t ) ∈ R × , and trajectories of the foot joint positions ¯ p ( t ) ∈ R . There are fourfoot joints: left toe base, left heel, right toe base, and right heel, indexed as i ∈ { , , , } . These inputs are at discrete timesteps, but we write them here asfunctions for clarity. The 3D ground plane height h ﬂoor and upward normal isprovided. Additionally, for each foot joint at each time, a binary label is providedindicating whether the foot is in contact with the ground. These labels determineinitial estimates of contact durations for each foot joint ¯ T i, , ¯ T i, , . . . , ¯ T i,n i asdescribed below. The distance from toe to heel (cid:96) foot and maximum distancefrom toe to hip (cid:96) leg are also provided. All quantities are computed from videoinput as described in Sections 3.2 and 3.3, and are used to both initialize theoptimization variables and as targets in the objective function. Optimization Variables.

The optimization variables are the COM position andEuler angle orientation r ( t ) , θ ( t ) ∈ R , foot joint positions p i ( t ) ∈ R and con-tact forces f i ( t ) ∈ R . These variables are continuous functions of time, rep-resented by piece-wise cubic polynomials with continuity constraints. We alsooptimize contact timings. The contacts for each foot joint are independently pa-rameterized by a sequence of phases that alternate between contact and ﬂight.The optimizer cannot change the type of each phase (contact or ﬂight), but itcan modify their durations T i, , T i, , . . . , T i,n i ∈ R where n i is the number oftotal contact phases for the i th foot joint. Objective.

Our complete formulation is shown in Figure 3. E data and E dur seekto keep the motion and contacts as close as possible to the intial inputs, which D. Rempe et al. min (cid:88) Tt =0 (cid:16) E data ( t ) + E vel ( t ) + E acc ( t ) (cid:17) + E dur s.t. m ¨ r ( t ) = (cid:88) i =1 f i ( t ) + m g (dynamics) I w ( t ) ˙ ω ( t ) + ω ( t ) × I w ( t ) ω ( t ) = (cid:88) i =1 f i ( t ) × ( r ( t ) − p i ( t ))˙ r (0) = ˙ r (0) , ˙ r ( T ) = ˙ r ( T ) (velocity boundaries) || p ( t ) − p ( t ) || = || p ( t ) − p ( t ) || = (cid:96) foot (foot kinematics)for every foot joint i : || p i ( t ) − p hip ,i ( t ) || ≤ (cid:96) leg (leg kinematics) (cid:88) nij =1 T i,j = T (contact durations)for foot joint i in contact at time t :˙ p i ( t ) = 0 (no slip) p zi ( t ) = h ﬂoor ( p xyi ) (on ﬂoor)0 ≤ f i ( t ) T ˆ n ≤ f max (pushing/max force) | f i ( t ) T ˆ t , | < µ f i ( t ) T ˆ n (friction pyramid)for foot joint i in ﬂight at time t : p zi ( t ) ≥ h ﬂoor ( p xyi ) (above ﬂoor) f i ( t ) = 0 (no force in air) Fig. 3.

Physics-based trajectory optimization formulation. Please see text for details. are derived from video, at discrete steps over the entire duration T : E data ( t ) = w r || r ( t ) − r ( t ) || + w θ || θ ( t ) − θ ( t ) || + w p (cid:88) i =1 || p i ( t ) − p i ( t ) || (1) E dur = w d (cid:88) i =1 n i (cid:88) j =1 ( T i,j − ¯ T i,j ) (2)We weigh these terms with w d = 0 . w r = 0 . w θ = 1 . w p = 0 . E vel ( t ) = γ r || ˙ r ( t ) || + γ θ || ˙ θ ( t ) || + γ p (cid:88) i =1 || ˙ p i ( t ) || (3) E acc ( t ) = β r || ¨ r ( t ) || + β θ || ¨ θ ( t ) || + β p (cid:88) i =1 || ¨ p i ( t ) || (4)with γ r = γ θ = 10 − , γ p = 0 . β r = β θ = β p = 10 − . Constraints.

The ﬁrst set of constraints strictly enforce valid rigid body mechan-ics, including linear and angular momentum. This enforces important properties ontact and Human Dynamics from Monocular Video 7 of motion, for example, during ﬂight the COM must follow a parabolic arc ac-cording to Newton’s Second Law. During contact, the body motion accelerationis limited by the possible contact forces e.g., one cannot walk at a 45 ◦ lean.At each timestep, we use the world-frame inertia tensor I w ( t ) computed fromthe input I b ( t ) and the current orientation θ ( t ). This assumes that the ﬁnal out-put poses will not be dramatically diﬀerent from those of the input: a reasonableassumption since our optimization does not operate on upper-body joints andchanges in feet positioning are typically small (though perceptually important).We found that using a constant inertia tensor (as in Winkler et al. [46]) madeconvergence diﬃcult to achieve. The gravity vector is g = − . n , where ˆ n is theground normal. The angular velocity ω is a function of the rotations θ [46].The contact forces are constrained to ensure that they push away from theﬂoor but are not greater than f max = 1000 N in the normal direction. With 4 feetjoints, this allows 4000 N of normal contact force: about the magnitude that a 100kg (220 lb) person would produce for extremely dynamic dancing motion [23]. Weassume no feet slipping during contact, so forces must also remain in a frictionpyramid deﬁned by friction coeﬃcient µ = 0 . t , ˆ t .Lastly, forces should be zero at any foot joint not in contact.Foot contact is enforced through constraints. When a foot joint is in contact,it should be stationary (no-slip) and at ﬂoor height h ﬂoor . When not in contact,feet should always be on or above the ground. This avoids feet skating andpenetration with the ground.In order to make the optimized motion valid for a humanoid skeleton, thetoe and heel of each foot should maintain a constant distance of (cid:96) foot . Finally, nofoot joint should be farther from its corresponding hip than the length of the leg (cid:96) leg . The hip position p hip ,i ( t ) is computed from the COM orientation at thattime based on the hip oﬀset in the skeleton detailed in Section 3.3. Optimization Algorithm.

We optimize with IPOPT [43], a nonlinear interiorpoint optimizer, using analytical derivatives. We perform the optimization instages: we ﬁrst use ﬁxed contact phases and no dynamics constraints to ﬁt thepolynomial representation for COM and feet position variables as close as possi-ble to the input motion. Next, we add in dynamics constraints to ﬁnd a physicallyvalid motion, and ﬁnally we allow contact phase durations to be optimized tofurther reﬁne the motion if possible.Following the optimization, we compute a full-body motion from the physically-valid COM and foot joint positions using Inverse Kinematics (IK) on a desiredskeleton S tgt (see supplementary Appendix C). Before performing our physics-based optimization, we need to infer when thesubject’s feet are in contact with the ground, given an input video. These con-tacts are a target for the physics optimization objective and their accuracy iscrucial to its success. To do so, we train a network that, for each video frame,classiﬁes whether the toe and heel of each foot are in contact with the ground.

D. Rempe et al.

The main challenge is to construct a suitable dataset and feature represen-tation. There is currently no publicly-available dataset of videos with labeledfoot contacts and a wide variety of dynamic motions. Manually labeling a large,varied dataset would be diﬃcult and costly. Instead, we generate synthetic datausing motion capture (mocap) sequences. We automatically label contacts in themocap and then use 2D joint position features from OpenPose [4] as input toour model, rather than image features from the raw rendered video frames. Thisallows us to train on synthetic data but then apply the model to real inputs.

Dataset.

To construct our dataset, we obtained 65 mocap sequences for the13 most human-like characters from , ranging from dynamicdancing motions to idling. Our set contains a diverse range of mocap sequences,retargeted to a variety of animated characters. At each time of each motionsequence, four possible contacts are automatically labeled by a heuristic: a toeor heel joint is considered to be in contact when (i) it has moved less than 2cm from the previous time, and (ii) it is within 5 cm from the known groundplane. Although more sophisticated labeling [17,25] could be used, we found thisapproach suﬃciently accurate to learn a model for the videos we evaluated on.We render these motions (see Figure 5(c)) on their rigged characters withmotion blur, randomized camera viewpoint, lighting, and ﬂoor texture. For eachsequence, we render two views, resulting in over 100k frames of video with labeledcontacts and 2D and 3D poses. Finally, we run a 2D pose estimation algorithm,OpenPose [4], to obtain the 2D skeleton which our model uses as input.

Model and Training.

The classiﬁcation problem is to map from 2D pose in eachframe to the four contact labels for the feet joints. As we demonstrate in Section4.1, simple heuristics based on 2D velocity do not accurately label contacts dueto the ambiguities of 3D projection and noise.For a given time t , our labeling neural network takes as input the 2D posesover a temporal window of duration w centered on the target frame at t . The 2Djoint positions over the window are normalized to place the root position of thetarget frame at (0 , w = 9video frames and use the 13 lower-body joint positions as shown in Figure 4.Additionally, the OpenPose conﬁdence c for each joint position is included asinput. Hence, the input to the network is a vector of ( x, y, c ) values of dimension3 ∗ ∗ ontact and Human Dynamics from Monocular Video 9 Along with contact labels, our physics-based optimization requires as input aground plane and initial trajectories for the COM, feet, and inertia tensor. Inorder to obtain these, we compute an initial 3D full-body motion from video.Since this stage uses standard elements, e.g., [12], we summarize the algorithmhere, and provide full details in Appendix B.First, Monocular Total Capture [47] (MTC) is applied to the input video toobtain an initial noisy 3D pose estimate for each frame. Although MTC accountsfor motion through a texture-based reﬁnement step, the output still contains anumber of artifacts (Figure 1) that make it unsuitable for direct use in ourphysics optimization. Instead, we initialize a skeleton S src containing 28 bodyjoints from the MTC input poses, and then use a kinematic optimization to solvefor an optimal root translation and joint angles over time, along with parametersof the ground plane. The objective for this optimization contains terms to smooththe motion, ensure feet are stationary and on the ground when in contact, andto stay close to both the 2D OpenPose and 3D MTC pose inputs.We ﬁrst optimize so that the feet are stationary, but not at a consistentheight. Next, we use a robust regression to ﬁnd the ground plane which best ﬁtsthe foot joint contact positions. Finally, we continue the optimization to ensureall feet are on this ground plane when in contact.The full-body output motion of the kinematic optimization is used to extractinputs for the physics optimization. Using a predeﬁned body mass (73 kg forall experiments) and distribution [26], we compute the COM and inertia tensortrajectories. We use the orientation about the root joint as the COM orientation,and the feet joint positions are used directly. Here we present extensive qualitative and quantitative evaluations of our contactestimation and motion optimization.

We evaluate our learned contact estimation method and compare to baselineson the synthetic test set (78 videos) and 9 real videos with manually-labeledfoot contacts. The real videos contain dynamic dancing motions and include700 labeled frames in total. In Table 1, we report classiﬁcation accuracy for ourmethod and numerous baselines.We compare to using a velocity heuristic on foot joints, as described in Sec-tion 3.2, for both the 2D OpenPose and 3D MTC estimations. We also compareto using diﬀerent subsets of joint positions. Our MLP using all lower-body jointsis substantially more accurate on both synthetic and real videos than all base-lines. Using upper-body joints down to the knees yields surprisingly good results.In order to test the beneﬁt of contact estimation, we compared our full op-timization pipeline on the synthetic test set using network-predicted contacts

Table 1.

Classiﬁcation accuracy of estimating foot contacts from video. Left: compar-ison to various baselines, Right: ablations using subsets of joints as input features.

Baseline Synthetic Real MLP Synthetic RealMethod Accuracy Accuracy Input Joints Accuracy Accuracy

Random 0.507 0.480 Upper down to hips 0.919 0.692Always Contact 0.677 0.647 Upper down to knees 0.935 0.8652D Velocity 0.853 0.867 Lower up to ankles 0.933 0.9233D Velocity 0.818 0.875

Lower up to hips 0.941 0.935

Fig. 4.

Foot contact estimation on a video using our learned model compared to a 2Dvelocity heuristic. All visualized joints are used as input to the network which outputsfour contact labels (left toes, left heel, right toes, right heel). Red joints are labeled ascontacting. Key diﬀerences are shown with orange boxes. versus contacts predicted using a velocity heuristic on the 3D joints from MTCinput. Optimization using network-predicted contacts converged for 94.9% ofthe test set videos, compared to 69.2% for the velocity heuristic. This illustrateshow contact prediction is crucial to the success of motion optimization.Qualitative results of our contact estimation method are shown in Figure 4.Our method is compared to the 2D velocity baseline which has diﬃculty forplanted feet when detections are noisy, and often labels contacts for joints thatare stationary but oﬀ the ground (e.g. heels).

Our method provides key qualitative improvements over prior kinematic ap-proaches. We urge the reader to view the supplementary video in order tofully appreciate the generated motions. For qualitative evaluation, we demon-strate animation from video by retargeting captured motion to a computer-animated character. Given a target skeleton S tgt for a character, we insert anIK retargeting step following the kinematic optimization as shown in Figure 2(see Appendix D for details), allowing us to perform the usual physics-based ontact and Human Dynamics from Monocular Video 11 Fig. 5.

Qualitative results on synthetic and real data. a) results on a synthetic testvideo with a ground truth alternate view. Two nearby frames are shown for the inputvideo and the alternate view. We ﬁx penetration, ﬂoating and leaning prevalent in ourmethod’s input from MTC. b) dynamic exercise video (top) and the output full-bodymotion (middle) and optimized COM trajectory and contact forces (bottom). optimization on this new skeleton. We use the same IK procedure to compareto MTC results directly targeted to the character.Figure 1 shows that our proposed method ﬁxes artifacts such as foot ﬂoating(top row), foot penetrations (middle), and unnatural leaning (bottom). Figure5(a) shows frames comparing the MTC input to our ﬁnal result on a syntheticvideo for which we have a ground truth alternate view. For this example only,we use the true ground plane as input to our method for a fair comparison (seeSection 4.3). From the input view, our method ﬁxes feet ﬂoating and penetration.From the ﬁrst frame of the alternate view, we see that the MTC pose is in factextremely unstable, leaning backward while balancing on its heels; our methodhas placed the contacting feet in a stable position to support the pose, bettermatching the true motion.Figure 5(b) shows additional qualitative results on a real video. We faithfullyreconstruct dynamic motion with complex contact patterns in a physically accu-rate way. The bottom row shows the outputs of the physics-based optimizationstage of our method at multiple frames: the COM trajectory and contact forcesat the heel and toe of each foot.

Quantitative evaluation of high-quality motion estimation presents a signiﬁcantchallenge. Recent pose estimation work evaluates average positional errors ofjoints in the local body frame up to various global alignment methods [35]. How-ever, those pose errors can be misleading: a motion can be pose-wise close toground truth on average, but produce extremely implausible dynamics, includingvibrating positions and extreme body lean. These errors can be perceptually ob-jectionable when remapping the motion onto an animated character, and preventthe use of inferred dynamics for downstream vision tasks.

Fig. 6.

Contact forces from our physics-based optimization for a walking and danc-ing motion. The net contact forces around1000 N are 140% of the assumed bodyweight (73 kg), a reasonable estimate com-pared to prior force plate data [2].

Therefore, we propose to use aset of metrics inspired by the biome-chanics literature [2,15,19], namely, toevaluate plausibility of physical quan-tities based on known properties ofhuman motion.We use two baselines: MTC, whichis the state-of-the-art for pose esti-mation, and our kinematic-only ini-tialization (Section 3.3), which trans-forms the MTC input to align withthe estimated contacts from Section3.2. We run each method on the syn-thetic test set of 78 videos. For thesequantitative evaluations only, we use the ground truth ﬂoor plane as input toour method to ensure a fair comparison. Note that our method does not need the ground truth ﬂoor, but using it ensures a proper evaluation of our primarycontributions rather than that of the ﬂoor ﬁtting procedure, which is highly de-pendent on the quality of MTC input (see Appendix E for quantitative resultsusing the estimated ﬂoor).

Dynamics Metrics.

To evaluate dynamic plausibility, we estimate net groundreaction forces (GRF), deﬁned as f GRF ( t ) = (cid:80) i f i ( t ). For our full pipeline, weuse the physics-based optimized GRFs which we compare to implied forces fromthe kinematic-only initialization and MTC input. In order to infer the GRFsimplied by the kinematic optimization and MTC, we estimate the COM trajec-tory of the motion using the same mass and distribution as for our physics-basedoptimization (73 kg). We then approximate the acceleration at each time stepand solve for the implied GRFs for all time steps (both in contact and ﬂight).We assess plausibility using GRFs measured in force plate studies, e.g.,[2,15,39]. For walking, GRFs typically reach 80% of body weight; for a dancejump, GRFs can reach up to about 400% of body weight [23]. Since we do notknow body weights of our subjects, we use a conservative range of 50kg–80kgfor evaluation. Figure 6 shows the optimized GRFs produced by our method fora walking and swing dancing motion. The peak GRFs produced by our methodmatch the data: for the walking motion, 115–184% of body weight, and 127–204%for dancing. In contrast, the kinematic-only GRFs are 319–510% (walking) and765–1223% (dancing); these are implausibly high, a consequence of noisy andunrealistic joint accelerations.We also measure GRF plausibility across the whole test set (Table 2(left)).GRF values are measured as a percentage of the GRF exerted by an idle 73kg person. On average, our estimate is within 1% of the idle force, while thekinematic motion implies GRFs as if the person were 24.4% heavier. Similarly,the peak force of the kinematic motion is equivalent to the subject carryingan extra 830 kg of weight, compared to only 174 kg after physics optimization.The Max GRF for MTC is even less plausible, as the COM motion is jittery ontact and Human Dynamics from Monocular Video 13 Table 2.

Physical plausibility evaluation on synthetic test set.

Mean/Max GRF arecontact forces as a proportion of body weight; see text for discussion of plausiblevalues.

Ballistic GRF are unexplained forces during ﬂight; smaller values are better.Foot position metrics measure the percentage of frames containing typical foot contacterrors per joint; smaller values are better.

Dynamics (Contact forces) Kinematics (Foot positions)

Method Mean GRF Max GRF Ballistic GRF Floating Penetration Skate

MTC [47] 143.0% 9055.3% 115.6% 58.7% 21.1% 16.8%Kinematics (ours) 124.4% 1237.5% 255.2%

Physics (ours) before smoothing during kinematic and dynamics optimization.

Ballistic GRF measures the median GRF on the COM when no feet joints should be in contactaccording to ground truth labels. The GRF should be exactly 0%, meaning thereare no contact forces and only gravity acts on the COM; the kinematic methodobtains results of 255%, as if the subject were wearing a powerful jet pack.

Kinematics Metrics.

We consider three kinematic measures of plausibility (Ta-ble 2(right)). These metrics evaluate accuracy of foot contact measurements.Speciﬁcally, given ground truth labels of foot contact we compute instances offoot

Floating , Penetration , and

Skate for heel and toe joints.

Floating is thefraction of foot joints more than 3 cm oﬀ the ground when they should be incontact.

Penetration is the fraction penetrating the ground more than 3 cm atany time.

Skate is the fraction moving more than 2 cm when in contact.After our kinematics initialization, the scores on these metrics are best (loweris better for all metrics) and degrade slightly after adding physics. This is dueto the IK step which produces full-body motion following the physics-basedoptimization. Both the kinematic and physics optimization results substantiallyoutperform MTC, which is rarely at a consistent foot height.

Positional Metrics.

For completeness, we evaluate the 3D pose output of ourmethod on variations of standard positional metrics. Results are shown in Ta-ble 3. In addition to our synthetic test set, we evaluate on all walking sequencesfrom the training split of HumanEva-I [41] using the known ground plane asinput. We measure the mean global per-joint position error (mm) for ankle andtoe joints (

Feet in Table 3) and over all joints (

Body ). We also report the errorafter aligning the root joint of only the ﬁrst frame of each sequence to the groundtruth skeleton (

Body-Align 1 ), essentially removing any spurious constant oﬀsetfrom the predicted trajectory. Note that this diﬀers from the common practiceof aligning the roots at every frame, since this would negate the eﬀect of ourtrajectory optimization and thus does not provide an informative performancemeasure. The errors between all methods are comparable, showing at most adiﬀerence of 5 cm which is very small considering global joint position. Though

Table 3.

Pose evaluation on synthetic and HumanEva-I walking datasets. We measuremean global per-joint 3D position error (no alignment) for feet and full-body joints.For full-body joints, we also report errors after root alignment on only the ﬁrst frameof each sequence. We remain competitive while providing key physical improvements.

Synthetic Data HumanEva-I Walking

Method Feet Body Body-Align 1 Feet Body Body-Align 1

MTC [47] 581.095

Kinematics (ours) 573.097 562.356 281.044 the goal of our method is to improve physical plausibility, it does not negativelyaﬀect the pose on these standard measures.

Contributions.

The method described in this paper estimates physically-validmotions from initial kinematic pose estimates. As we show, this produces motionsthat are visually and physically much more plausible than the state-of-the-artmethods. We show results on retargeting to characters, but it could also be usedfor further vision tasks that would beneﬁt from dynamical properties of motion.Estimating accurate human motion entails numerous challenges, and we havefocused on one crucial sub-problem. There are several other important unknownsin this space, such as motion for partially-occluded individuals, and ground planeposition. Each of these problems and the limitations discussed below are anenormous challenge in their own right and are therefore reserved for future work.However, we believe that the ideas in this work could contribute to solving theseproblems and open multiple avenues for future exploration.

Limitations.

We make a number of assumptions to keep the problem manage-able, all of which can be relaxed in future work: we assume that feet are unoc-cluded, there is a single ground plane, the subject is not interacting with otherobjects, and we do not handle contact from other body parts like knees or hands.These assumptions are permissible for the character animation from video mocapapplication, but should be considered in a general motion estimation approach.Our optimization is expensive. For a 2 second (60 frame) video clip, the physicaloptimization usually takes from 30 minutes to 1 hour. This runtime is due pri-marily to the adapted implementation from prior work [46] being ill-suited forthe increased size and complexity of human motion optimization. We expect aspecialized solver and optimized implementation to speed up execution. ontact and Human Dynamics from Monocular Video 15

Acknowledgments.

This work was in part supported by NSF grant IIS-1763268, grants from the Samsung GRO program and the Stanford SAIL ToyotaResearch Center, and a gift from Adobe Corporation. We thank the followingYouTube channels for sharing their videos online: Dance FreaX, Dancercise Stu-dio, Fencers Edge, MihranTV, DANCE TUTORIALS, Deepak Tulsyan, GibsonMoraes, and pigmie.

References

1. Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep itsmpl: Automatic estimation of 3d human pose and shape from a single image. In:European Conference on Computer Vision (ECCV). pp. 561–578 (2016)2. Brubaker, M.A., Sigal, L., Fleet, D.J.: Estimating contact dynamics. In: The IEEEInternational Conference on Computer Vision (ICCV). pp. 2389–2396 (2009)3. Brubaker, M.A., Fleet, D.J., Hertzmann, A.: Physics-based person tracking usingthe anthropomorphic walker. International Journal of Computer Vision (1), 140–155 (2010)4. Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: Realtime multi-person 2d pose estimation using part aﬃnity ﬁelds. IEEE Transactions on PatternAnalysis and Machine Intelligence (2019)5. Carpentier, J., Mansard, N.: Multicontact locomotion of legged robots. IEEETransactions on Robotics (6), 1441–1460 (2018)6. Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. In: IEEEInternational Conference on Computer Vision (ICCV) (2019)7. Chen, Y., Huang, S., Yuan, T., Qi, S., Zhu, Y., Zhu, S.C.: Holistic++ scene un-derstanding: Single-view 3d holistic scene parsing and human pose estimation withhuman-object interaction and physical commonsense. In: The IEEE InternationalConference on Computer Vision (ICCV). pp. 8648–8657 (2019)8. Choi, K.J., Ko, H.S.: On-line motion retargetting. In: Paciﬁc Conference on Com-puter Graphics and Applications. pp. 32– (1999)9. Dai, H., Valenzuela, A., Tedrake, R.: Whole-body motion planning with centroidaldynamics and full kinematics. In: IEEE-RAS International Conference on Hu-manoid Robots. pp. 295–302 (2014)10. Fang, A.C., Pollard, N.S.: Eﬃcient synthesis of physically valid human motion.ACM Trans. Graph. (3), 417–426 (2003)11. Forsyth, D.A., Arikan, O., Ikemoto, L., O’Brien, J., Ramanan, D.: Computationalstudies of human motion: Part 1, tracking and motion synthesis. Foundations andTrends in Computer Graphics and Vision (23), 77–254 (2006)12. Gleicher, M.: Retargetting motion to new characters. In: SIGGRAPH. pp. 33–42(1998)13. Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3d human pose am-biguities with 3d scene constraints. In: The IEEE International Conference onComputer Vision (ICCV). pp. 2282–2292 (2019)14. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: The IEEE Interna-tional Conference on Computer Vision (ICCV). pp. 2961–2969 (2017)15. Herr, H., Popovic, M.: Angular momentum in human walking. Journal of Experi-mental Biology (4), 467–481 (2008)16. Hoyet, L., McDonnell, R., O’Sullivan, C.: Push it real: Perceiving causality invirtual interactions. ACM Trans. Graph. (4), 90:1–90:9 (2012)6 D. Rempe et al.17. Ikemoto, L., Arikan, O., Forsyth, D.: Knowing when to put your foot down. In:Symposium on Interactive 3D Graphics and Games (I3D). pp. 49–53 (2006)18. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scaledatasets and predictive methods for 3d human sensing in natural environments.IEEE Transactions on Pattern Analysis and Machine Intelligence (7), 1325–1339(2014)19. Jiang, Y., Van Wouwe, T., De Groote, F., Liu, C.K.: Synthesis of biologicallyrealistic human motion using joint torque actuation. ACM Trans. Graph. (4),72:1–72:12 (2019)20. Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of hu-man shape and pose. In: The IEEE Conference on Computer Vision and PatternRecognition (CVPR). pp. 7122–7131 (2018)21. Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3d human dynamicsfrom video. In: The IEEE Conference on Computer Vision and Pattern Recognition(CVPR). pp. 5614–5623 (2019)22. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Interna-tional Conference on Learning Representations (ICLR) (2015)23. Kulig, K., Fietzer, A.L., Jr., J.M.P.: Ground reaction forces and knee mechanicsin the weight acceptance phase of a dance leap take-oﬀ and landing. Journal ofSports Sciences (2), 125–131 (2011)24. de Lasa, M., Mordatch, I., Hertzmann, A.: Feature-based locomotion controllers.In: SIGGRAPH. pp. 131:1–131:10 (2010)25. Le Callennec, B., Boulic, R.: Robust kinematic constraint detection for motiondata. In: ACM SIGGRAPH/Eurographics Symposium on Computer Animation(SCA). pp. 281–290 (2006)26. de Leva, P.: Adjustments to zatsiorsky-seluyanov’s segment inertia parameters.Journal of Biomechanics (9), 1223 – 1230 (1996)27. Li, Z., Sedlar, J., Carpentier, J., Laptev, I., Mansard, N., Sivic, J.: Estimating3d motion and forces of person-object interactions from monocular video. In: TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8640–8649 (2019)28. Liu, C.K., Hertzmann, A., Popovi´c, Z.: Learning physics-based motion style withnonlinear inverse optimization. ACM Trans. Graph. (3), 1071–1081 (2005)29. Macchietto, A., Zordan, V., Shelton, C.R.: Momentum control for balance. In:SIGGRAPH. pp. 80:1–80:8 (2009)30. Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shaﬁei, M., Seidel, H.P., Xu,W., Casas, D., Theobalt, C.: Vnect: Real-time 3d human pose estimation with asingle rgb camera. ACM Trans. Graph. (4) (2017)31. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estima-tion. In: European Conference on Computer Vision (ECCV). pp. 483–499 (2016)32. Orin, D.E., Goswami, A., Lee, S.H.: Centroidal dynamics of a humanoid robot.Autonomous Robots (2-3), 161–176 (2013)33. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T.,Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z.,Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.:Pytorch: An imperative style, high-performance deep learning library. In: Advancesin Neural Information Processing Systems (NeurIPS). pp. 8026–8037 (2019)34. Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas,D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a singleimage. In: The IEEE Conference on Computer Vision and Pattern Recognition(CVPR). pp. 10975–10985 (2019)ontact and Human Dynamics from Monocular Video 1735. Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimationin video with temporal convolutions and semi-supervised training. In: The IEEEConference on Computer Vision and Pattern Recognition (CVPR). pp. 7753–7762(2019)36. Peng, X.B., Kanazawa, A., Malik, J., Abbeel, P., Levine, S.: Sfv: Reinforcementlearning of physical skills from videos. ACM Trans. Graph. (6), 178:1–178:14(2018)37. Popovi´c, Z., Witkin, A.: Physically based motion transformation. In: SIGGRAPH.pp. 11–20 (1999)38. Reitsma, P.S.A., Pollard, N.S.: Perceptual metrics for character animation: Sensi-tivity to errors in ballistic motion. ACM Trans. Graph. (3), 537–542 (2003)39. Robertson, D.G.E., Caldwell, G.E., Hamill, J., Kamen, G., Whittlesey, S.N.: Re-search Methods in Biomechanics. Human Kinetics (2004)40. Safonova, A., Hodgins, J.K., Pollard, N.S.: Synthesizing physically realistic humanmotion in low-dimensional, behavior-speciﬁc spaces. In: SIGGRAPH. pp. 514–521.ACM (2004)41. Sigal, L., Balan, A., Black, M.: Humaneva: Synchronized video and motion cap-ture dataset and baseline algorithm for evaluation of articulated human motion.International Journal of Computer Vision , 4–27 (2010)42. Vondrak, M., Sigal, L., Hodgins, J., Jenkins, O.: Video-based 3d motion capturethrough biped control. ACM Trans. Graph. (4), 27:1–27:12 (2012)43. W¨achter, A., Biegler, L.T.: On the implementation of an interior-point ﬁlter line-search algorithm for large-scale nonlinear programming. Mathematical Program-ming (1), 25–57 (2006)44. Wang, J.M., Hamner, S.R., Delp, S.L., Koltun, V.: Optimizing locomotion con-trollers using biologically-based actuators and objectives. ACM Trans. Graph. (4) (2012)45. Wei, X., Chai, J.: Videomocap: Modeling physically realistic human motion frommonocular video sequences. In: SIGGRAPH. pp. 42:1–42:10 (2010)46. Winkler, A.W., Bellicoso, D.C., Hutter, M., Buchli, J.: Gait and trajectory op-timization for legged systems through phase-based end-eﬀector parameterization.IEEE Robotics and Automation Letters (RA-L) , 1560–1567 (2018)47. Xiang, D., Joo, H., Sheikh, Y.: Monocular total capture: Posing face, body, andhands in the wild. In: The IEEE Conference on Computer Vision and PatternRecognition (CVPR). pp. 10965–10974 (2019)48. Xu, W., Chatterjee, A., Zollh¨ofer, M., Rhodin, H., Mehta, D., Seidel, H.P.,Theobalt, C.: Monoperfcap: Human performance capture from monocular video.ACM Trans. Graph. (2), 27:1–27:15 (2018)49. Zanﬁr, A., Marinoiu, E., Sminchisescu, C.: Monocular 3d pose and shape estimationof multiple people in natural scenes-the importance of multiple scene constraints.In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).pp. 2148–2157 (2018)50. Zou, Y., Yang, J., Ceylan, D., Zhang, J., Perazzi, F., Huang, J.B.: Reducing foot-skate in human motion reconstruction with ground contact constraints. In: TheIEEE Winter Conference on Applications of Computer Vision (WACV). pp. 459–468 (2020)8 D. Rempe et al. AppendicesA Contact Estimation Details

Here we detail the contact estimation model and the dataset from Section 3.2.

A.1 Synthetic Dataset

Our synthetic dataset was rendered using Blender and includes 13 charactersperforming 65 diﬀerent motion capture sequences retargeted to each charactertaken from . Each motion is recorded from 2 camera viewpointsresulting in 1690 videos and 101k frames of data. The motions include: samba,swing, and salsa dancing, boxing, football, and baseball actions, walking, andidle poses. Videos are rendered at 1280x720 with motion blur, and are 2 secondslong at 30 fps. Example frames from the dataset are shown in Figure A1. Inaddition to RGB frames, at each timestep the dataset includes 2D OpenPose [4]detections, the 3D pose in the form of the character’s skeleton (skeletons arediﬀerent for each character, and pose is provided in a .bvh motion capture ﬁle),foot contact labels for the heel and toe base of each foot as described in the mainpaper, and camera parameters.For each video, many parameters are randomized. The camera is placed at auniform random distance in [4 . , .

5] and Gaussian random height with µ = 0 . σ = 0 . . , . A.2 Model Details

We implement our contact estimation MLP (sizes 1024, 512, 128, 32, 20) inPyTorch [33]. All but the last layer are followed by batch normalization, andwe use a single dropout layer before the size-128 layer (dropout p = 0 . − . We apply an L2 weight decay with a weight of 10 − and useearly stopping based on the validation set loss. We scale all 2D joint inputs tobe largely between [ − , σ = 0 . Fig. A1.

Example RGB frames from our synthetic dataset for contact estimation learn-ing. The data also contains 2D OpenPose [4] detections, 3D pose in the form of thecharacter’s skeleton, and automatically labeled contacts for the toe base and heels. predictions. A joint is marked in contact at a frame if a majority of the votesfor that frame classify it as in contact.

B Kinematic Optimization Details

Fig. A2.

Skeleton. Bonelengths and pose are ini-tialized from MTC in-put for each motion se-quence before kinematicoptimization.

Here detail the kinematic optimization procedure usedto initialize our physics-based optimization. See Sec-tion 3.3 for an overview.

B.1 Inputs and Initialization

Our kinematic optimization takes in J body jointsover T timesteps that make up the motion. Specif-ically, for j ∈ J and t ∈ T we have the 2D posedetection from OpenPose x j,t ∈ R with conﬁdence σ j,t , contacts estimated in the previous stage of ourpipeline c j,t ∈ { , } where c j,t = 1 if joint j is incontact with the ground at time t , and ﬁnally the full-body 3D pose from MTC.We use the MTC input to initialize a custom skele-ton, called S src in the main paper, which contains J = 28 joints. In particular, we use the MTC COCOregressor to obtain 19 body joint positions (exclud-ing feet) over the sequence, and vertices on the Adammesh for the 6 foot joints as in the original MTC pa-per. We choose these 25 joints in order to use a re-projection error in our opti-mization objective, as described below. To better map to the character rigs used during animation retargeting, we additionally use 3 spine joint positions directlyfrom the MTC body model giving the ﬁnal total of 28 joints. Note, we do not useany hand or face information from MTC. Our skeleton has ﬁxed bone lengthswhich are determined based on these input 3D joint locations throughout themotion sequence: the median distance between two joints over the entire motionsequence deﬁnes the bone length. Our skeleton (before ﬁtting bone lengths tothe input data) is visualized in Figure A2.We normalize the input positions to get the root-relative positions of eachjoint q j,t ∈ R , j = 1 , . . . , J, t = 1 , . . . , T which we will target during opti-mization, and let the global translation be our initial root translation p root ,t . Allthese positions are preprocessed to remove obviously incorrect frames based onOpenPose detection conﬁdence: for the 25 joints with corresponding 2D Open-Pose detections (all non-spine joints in our skeleton), if the conﬁdence is below0.3, then the frame is replaced with a linear interpolation between the closestadjacent frames with suﬃcient conﬁdence.Because we optimize for the joint angles of our skeleton (see below), nextwe must ﬁnd initial joint angles to match the MTC joint position inputs. Weroughly initialize the joint angles of our skeleton by copying those from theMTC body model, and ﬁnally perform inverse kinematics (IK) targeting thepreprocessed joint positions which results in a reconstruction of the MTC inputon our skeleton. We use a Jacobian-based full body IK solver based on [8]. Thisis the skeleton which is optimized throughout our kinematic initialization. B.2 Optimization Variables

We optimize over global 3D root translation p root ,t ∈ R and skeleton jointEuler angles θ j,t ∈ R with j = 1 , . . . , J, t = 1 , . . . , T . We also ﬁnd ground planeparameters ˆ n , p ﬂoor ∈ R which are the normal vector and some point on theplane. As described below, we do not jointly optimize all of these at once; we doit in stages and ﬁt the ﬂoor separately. B.3 Problem Formulation

We seek to minimize the following objective function: α proj E proj + α vel E vel + α ang E ang + α acc E acc + α data E data + α cont E cont + α ﬂoor E ﬂoor where the α are constant weights. We use α proj = 0 . α vel = α ang = 0 . α acc = 0 . α data = 0 .

3, and α cont = α ﬂoor = 10. We now detail each of theseenergy terms.Suppose q j,t ∈ R , j = 1 , . . . , J, t = 1 , . . . , T are the current root-relativejoint position estimates during optimization which can be calculated using for-ward kinematics on S src with the current joint angles θ j,t . Then our energy termsare deﬁned as follows. ontact and Human Dynamics from Monocular Video 21 –

2D Re-projection Error : minimizes deviation of joints from correspondingOpenPose detections, weighted by detection conﬁdence E proj = T (cid:88) t =1 J (cid:88) j =1 σ j,t || Π ( q j,t + p root ,t ) − x j,t || (B1)where Π is the perspective projection parameterized by focal length f (as-sumed to be 2000) and [ c x , c y ]. – Velocity Smoothing : minimizes change in joint positions and angles over time E vel = T − (cid:88) t =1 J (cid:88) j =1 || q j,t +1 − q j,t || (B2) E ang = T − (cid:88) t =1 J (cid:88) j =1 || θ j,t +1 − θ j,t || . (B3) – Linear Acceleration Smoothing : minimizes change in joint linear velocity overtime E acc = T − (cid:88) t =1 J (cid:88) j =1 || ( q j,t +2 − q j,t +1 ) − ( q j,t +1 − q j,t ) || . (B4) –

3D Data Error : minimizes deviation from 3D MTC joint initialization E data = T (cid:88) t =1 J (cid:88) j =1 || q j,t − q j,t || . (B5) – Contact Velocity Error : encourages feet joints (toes and heels) to be station-ary when labeled as in contact E cont = T − (cid:88) t =1 (cid:88) j ∈ J F || c j,t (( q j,t +1 + p root ,t +1 ) − ( q j,t + p root ,t )) || . (B6)where J F is the set of foot joints. – Contact Position Error : encourages toe and heel joints to be on the groundplane when labeled as in contact E ﬂoor = T (cid:88) t =1 (cid:88) j ∈ J F || c j,t (ˆ n · ( q j,t + p root ,t − p ﬂoor )) || . (B7) B.4 Optimization Algorithm

We perform this optimization in three main stages. First, we enforce all objec-tives except the contact position error and solve only for skeleton root positionand joint angles (no ﬂoor parameters). Next, we use a robust Huber regression to ﬁnd the ﬂoor plane that best matches the foot joint contact positions and re-ject outliers, i.e., joints labeled as in contact when they are far from the ground.Outlier contacts are re-labeled as non-contacts for all subsequent processing. Fi-nally, we repeat the full-body optimization, now enabling the contact positionobjective to ensure feet are on the ground plane. We optimize using the TrustRegion Reﬂective algorithm with analytical derivatives.

B.5 Extracting Inputs for Physics-Based Optimization

From the full-body output of this kinematic optimization, we need to extractinputs for the physics-based optimization (Section 3.1). To get the COM targets r ( t ) ∈ R , we treat each body part as a point with a pre-deﬁned mass [26]. Thisalso allows the calculation of the body-frame inertia tensor at each time step I b ( t ) ∈ R × which is used to enforce dynamics constraints. Unless otherwisenoted, we assume a body mass of 73 kg for the character. We use the orientationabout the root joint as the COM orientation θ ( t ) ∈ R and the feet joint positions p ( t ) ∈ R are directly taken from the full-body motion. C Physics-Based Optimization Details

Here we detail the physics-based trajectory optimization from Section 3.1.

C.1 Polynomial Parameterization

COM position and orientation, foot positions during ﬂight, and contact forcesduring stance are parameterized by a sequence of cubic polynomials as donein Winkler et al. [46]. These polynomials use a Hermite parameterization: wedo not optimize over the polynomial coeﬃcients directly, rather the duration,starting and ending positions, and boundary velocities.The COM position and orientation use one polynomial every 0.1 s. Feetpositions and forces always use at least 6 polynomials per phase, which we foundnecessary to accurately produce extremely dynamic motions. We adaptively addpolynomials depending on the length of the phase. If the phase is longer than 2 s,additional polynomials are added commensurately. Foot positions during stanceare a single constant value and contact forces during ﬂight are constant 0 value.This ensures that the no slip and no force during ﬂight constraints are met.Please see Winkler et al. [46] for a more in-depth discussion of the polynomialparameterization along with the contact phase duration parameterization.

C.2 Constraint Parameters

Though the optimization variables are continuous polynomials, objective ener-gies and constraints are enforced at discrete intervals. Leg and foot kinematicconstraints are enforced at 0.08 s intervals, the above ﬂoor constraint at 0.1 sintervals, and dynamics constraints are enforced every 0.1 s. In practice, the ontact and Human Dynamics from Monocular Video 23 velocity boundary constraints try to match the mean initial velocity over theﬁrst(last) 5 frames to make it more robust to noisy motion.Objective terms, including smoothing, are enforced at every step for whichwe have input data. For example, the synthetic dataset at 30 fps will provide anobjective term at (1 /

30) s intervals.

C.3 Contact Timing Optimization

As explained in Section 3.1, our physics optimization is done in stages such thatcontact phase durations are not optimized until the very last stage. We foundthat allowing these durations to be optimized along with dynamics does notalways result in a better solution as it is an inherently harder and less stableoptimization. Therefore, in the presented results we use the better of the twosolutions: either the solution using ﬁxed input contact timings (from our neuralnetwork) or the solution after subsequently allowing phase durations to change,if the motion is improved.

C.4 Full-Body Output

Following the physics-based optimization, we must compute a full-body motionfrom the physically-valid COM and foot joint positions using IK. For the upperbody (including the root), we calculate the oﬀset of each joint from the COM inthe input motion to the physics optimization, and use this oﬀset added to thenew optimal COM as the joint targets during IK. This means the upper-bodymotion will be essentially identical to the result of the kinematic optimization(though the posture may improve due to the new COM position). For the lowerbody, we target the toe and heel joints directly to the physically optimized outputand let the remainder of the joints (i.e., ankles, knees, and hips) result from IK,which can be drastically diﬀerent from the input. We use the same IK algorithmas in Appendix B.1.

D Retargeting to a New Character

In many cases, we wish to retarget the estimated motion to a new animatedcharacter mesh. We do this in the main paper in Section 4.2 for qualitativeevaluation. One could apply physics-based motion retargeting methods to theoutput motion after an IK retargeting procedure, e.g., [37]. However, we avoidthis extra step by directly performing our physics-based optimization on thetarget character skeleton.Given a target skeleton S tgt , we insert an additional retargeting step follow-ing the kinematic optimization (see Figure 2). Namely, we uniformly scale S src to the approximate size of our target skeleton, and then perform an IK opti-mization based on a predeﬁned joint mapping to recover joint angles for S tgt .Then, the subsequent physics-based optimization and full-body upgrade are per-formed with this skeleton replacing S src . We use the same IK algorithm as inAppendix B.1. Table E1.

Precision, recall, and F1 Score (

Prec/Rec/F1 ) of estimating foot contactsfrom video. Left: comparison to various baselines, Right: ablations using subsets ofjoints as features. Supplements Table 1.

Baseline Synthetic Real MLP Synthetic RealMethod Prec / Rec / F1 Prec / Rec / F1 Input Joints Prec / Rec / F1 Prec / Rec / F1

Random 0.679 / 0.516 / 0.586 0.627 / 0.487 / 0.548 Upper to hips 0.940 / 0.941 / 0.940 0.728 / 0.837 / 0.779Always Contact 0.677 / 1.000 / 0.808 0.647 / 1.000 / 0.786 Upper to knees / 0.946 / 0.952 0.926 / 0.859 / 0.8922D Velocity 0.861 / 0.933 / 0.896 0.922 / 0.868 / 0.894 Lower to ankles 0.933 / 0.971 / 0.952 / 0.916 / 0.9393D Velocity 0.858 / 0.876 / 0.867 0.920 / 0.884 / 0.902 Lower to hips 0.941 / / / Note for qualitative comparison to MTC, we perform a very similar proce-dure: we ﬁrst ﬁt our skeleton to the raw MTC input, similar to Appendix B.1 butwithout the preprocessing, and then perform the IK retargeting as described inthis section. This provides a stronger baseline than a naive approach like directlycopying joint angles from MTC to S tgt . E Evaluation Details and Additional Results

Here we include additional results and details for evaluations.

E.1 Contact Estimation

Table E2.

Ablation study of in-put and output window sizes forlearned contact estimation. Classi-ﬁcation accuracy for many diﬀer-ent combinations are shown.

Input Prediction Synthetic Real

Window Window Accuracy Accuracy

Main results for contact estimation are pre-sented in Section 4.1. Table E1 supplementsthe main results with the precision, recall,and F1 score of each method. These give ad-ditional insight compared to accuracy sincedata labels are slightly imbalanced (more in-contact frames than no-contact).Table E2 shows an ablation study betweendiﬀerent input and output window size com-binations for our network.

Input Window isthe number of frames of 2D lower-body jointsgiven to the network, and

Prediction Window is the number of frames for which the network outputs foot contact classiﬁ-cations. We use an input window w = 9 and prediction window of 5 in ourexperiments since it achieves the best accuracy on real videos as shown in thetable. In general, there is not a clear trend in Prediction Window size, but asthe

Input Window size increases, so does accuracy on the real dataset.Figure E1 shows the accuracy of contact estimations over the entire predictionwindow of 5 frames on the synthetic test set. Though the target frame in thiscase is frame index 2, predictions on the oﬀ-target frames degrade only slightlyand are still very accurate since the input windows is 9 frames. This motivatesthe use of the majority voting scheme at inference time. ontact and Human Dynamics from Monocular Video 25

Fig. E1.

Contact estimation classiﬁcation accuracy for all frames in the 5-frame outputwindow on the synthetic test set (given 9 frames as input). The center frame index 2is the target frame, however oﬀ-target contact predictions are still accurate.

E.2 Qualitative Motion Evaluation

For extensive qualitative evaluation, see the supplementary video. For real data,we use videos from: publicly available datasets [6,36], YouTube videos that arelicensed under Creative Commons or with permission from the content creatorsto be used in this publication, and licensed stock footage.

E.3 Quantitative Motion Evaluation

Primary quantitative results for motion reconstruction are presented in Sec-tion 4.3. These quantitative evaluations make use of the ground truth ﬂoor planeas input. Note that our method does not need the ground truth ﬂoor: our ﬂoorﬁtting procedure works well as demonstrated in all qualitative results on liveaction monocular videos. However, we performed quantitative evaluations ondata that contains many cases of movement directly towards or away from thecamera: a challenging case for MTC, which results in noisy feet joints as inputto our method causing a poor ﬂoor ﬁt. This makes optimization diﬃcult andinterferes with evaluating our primary contributions.However, for completeness, here we include quantitative results using theﬂoor ﬁtting procedure (rather than taking the ground truth ﬂoor as input) onthe synthetic test set. Table E3 shows kinematic and dynamics evaluations usingthe ﬁtted ﬂoor while Table E4 shows the pose evaluation. Trends are similar tothose in Tables 2 and 3 using the ground truth ﬂoor.

E.4 Discussion on Global Pose Estimation

For quantitative evaluations in Section 4.3, we do not compare to other meth-ods in global human motion estimation but instead evaluate ablations of our ownmethod: the kinematics-only version and initialization from MTC.

Table E3.

Physical plausibility evaluation using the estimated ﬂoor on the synthetictest set. Supplements Table 2.

Dynamics (Contact forces) Kinematics (Foot positions)

Method Mean GRF Max GRF Ballistic GRF Floating Penetration Skate

MTC [47] 142.7% 9036.7% 120.8% 19.1% 10.0% 16.5%Kinematics (ours) 119.7% 1252.4% 103.6%

Physics (ours)

Table E4.

Pose evaluation on synthetictest set using the estimated ﬂoor . Supple-ments Table 3.

Method Feet Body Body-Align 1

MTC [47] 585.303

Kinematics (ours) 582.400 565.097 281.416Physics (ours)

The problem of predicting a temporally-consistent global motion (like MTCand this work does) is vastly under-explored so there are few comparableprior works. Many methods do tra-ditional local 3D pose estimation oreven predict the global root transla-tion from the camera, but these rarely result in a coherent global motion.For example, a recent work from Pavlakos et al. [34] called SMPLify-X esti-mates global camera extrinsics, local pose, and body shape, which gives globalmotion when applied to video. However, we found that MTC, which uses a tem-poral tracking procedure, gave better results which motivated its use in initial-izing our pipeline. Figure E2 shows a ﬁxed side view of results from SMPLify-Xand MTC on the same video clip. SMPLify-X is noisy and inconsistent especiallyin terms of global translation; MTC is much smoother and coherent.

E.5 Pose Estimation Evaluation Details

We quantitatively evaluate pose estimation in Section 4.3. We evaluate on oursynthetic test set and HumanEva-I [41] walking sequences. Like many pose esti-mation benchmarks (e.g. Human3.6M [18]), few motions in HumanEva are dy-namic with interesting foot contact patterns. Therefore, we evaluate on a subsetcontaining the walking sequences which meet this criteria.For the MTC baseline, we measure accuracy based directly on the regressedjoints given as input to our method. For our method, we use the estimatedjoints after the full physics-based motion pipeline on our custom skeleton thatis initially ﬁt from the MTC input as described in Appendix B.1.For the synthetic test set, we measure joint errors with respect to a subset ofthe known character rig that includes 16 joints: neck, shoulders, elbows, wrists,hips, knees, ankles, and toes (no spine joints). The “Feet” column of Table 3includes ankle and toe joints only.On the right side of Table 3, we evaluate methods on the walking sequencesfrom the training split of HumanEva-I [41] (which includes subjects 1, 2, and 3).Following prior work [35], we ﬁrst split the walking sequences into contiguouschunks by removing corrupted motion capture frames. We then further split thesechunks into sequences of roughly 120 frames (about 2 seconds) to use as input to ontact and Human Dynamics from Monocular Video 27

Fig. E2.

A ﬁxed side view is shown from SMPLify-X [34] and Monocular Total Cap-ture (MTC) [47]. SMPLify-X gives noisy and inconsistent global motion whereas thetracking reﬁnement of MTC gives smoother results.

Fig. E3.

Our method can be applied to multiple characters with varying body massesand mass distributions. From left to right the animated characters are Ybot (bodymass 73 kg), Ty (36.5 kg), and Skeleton Zombie (146 kg). our method. We extract the ground truth ﬂoor plane using the camera extrinsicsfrom the dataset and use this as input to our method. Joint errors are measuredwith respect to an adapted 15-joint skeleton [35] from the HumanEva groundtruth motion capture which includes: root, head, neck, shoulders, elbows, wrists,hips, knees, and ankles. The “Feet” column of Table 3 includes ankle joints only.