[PDF] Online Adaptation for Consistent Mesh Reconstruction in the Wild

Abstract

Full PDF

OOnline Adaptation for Consistent MeshReconstruction in the Wild

Xueting Li , Sifei Liu , Shalini De Mello , Kihwan Kim ,Xiaolong Wang , Ming-Hsuan Yang , Jan Kautz University of California, Merced, NVIDIA, University of California, San Diego

Abstract

This paper presents an algorithm to reconstruct temporally consistent 3D meshes ofdeformable object instances from videos in the wild. Without requiring annotationsof 3D mesh, 2D keypoints, or camera pose for each video frame, we pose video-based reconstruction as a self-supervised online adaptation problem applied to anyincoming test video. We ﬁrst learn a category-speciﬁc 3D reconstruction modelfrom a collection of single-view images of the same category that jointly predictsthe shape, texture, and camera pose of an image. Then, at inference time, we adaptthe model to a test video over time using self-supervised regularization terms thatexploit temporal consistency of an object instance to enforce that all reconstructedmeshes share a common texture map, a base shape, as well as parts. We demonstratethat our algorithm recovers temporally consistent and reliable 3D structures fromvideos of non-rigid objects including those of animals captured in the wild – anextremely challenging task rarely addressed before. Codes and other resources willbe maintained at https://sites.google.com/nvidia.com/vmr-2020 . When we humans try to understand the object shown in Fig. 1(a), we instantly recognize it as a “duck”.We also instantly perceive and imagine its shape in the 3D world, its viewpoint, and its appearancefrom other views. Furthermore, when we see it in a video, its 3D structure and deformation becomeeven more apparent to us. Our ability to perceive the 3D structure of objects contributes vitally to ourrich understanding of them.While 3D perception is easy for humans, 3D reconstruction of deformable objects remains a verychallenging problem in computer vision, especially for objects in the wild. For learning-basedalgorithms, the key bottleneck is the lack of supervision. It is extremely challenging to collect 3Dannotations such as 3D shape and camera pose [4, 18]. Consequently, existing research mostlyfocuses on limited domains (e.g., rigid objects [26], human bodies [15, 53] and faces [50]) forwhich 3D annotations can be captured in constrained environments. However, these approachesdo not generalize well to non-rigid objects captured in naturalistic environments (e.g., animals). Innon-rigid structure from motion methods [2, 30], the 3D structure can be partially recovered fromcorrespondences between multiple viewpoints, which are also hard to label. Due to constrainedenvironments and limited annotations, it is nearly impossible to generalize these approaches to the3D reconstruction of non-rigid objects (e.g., animals) from images and videos captured in the wild.Instead of relying on 3D supervision, weakly supervised or self-supervised approaches have beenproposed for 3D mesh reconstruction. They use annotated 2D object keypoints [14], category-leveltemplates [23, 22] or silhouettes [24]. However, to scale up learning with 2D annotations to hundredsof thousands of images is still non-trivial. This limits the generalization ability of current modelsto new domains. For example, a 3D reconstruction model trained on single-view images, e.g., [14],produces unstable and erratic predictions for video data. This is unsurprising, due to perturbations

Preprint. Under review. a r X i v : . [ c s . C V ] D ec ! !" !" !!! ! ! ! Figure 1:

By utilizing the consistency of texture, shape and object parts correspondences in videos (red box) asself-supervision signals in (a), we learn a model that reconstructs temporally consistent meshes of deformableobject instances in videos in (b). over time. However, the temporal signal in videos should provide us an advantage instead of adisadvantage, as recently shown on the task of optimizing a 3D rigid object mesh w.r.t. a particularvideo [56, 26]. The question is, can we also take advantage of the redundancy in temporal sequencesas a form of self-supervision in order to improve the reconstruction of dynamic non-rigid objects?In this work, we address this problem with two important innovations. First, we strike a balancebetween model generalization and specialization. That is, we train an image-based network on a setof images, while at test time we adapt it online to an input video of a particular instance. Test-timetraining [40] is non-trivial since no labels are provided for the video. The key is to introduce self-supervised objectives that can continuously improve the model. To do so, we exploit the UV texturespace, which provides a parameterization that is invariant to object deformation. We encourage thesampled texture, as well as a group of object parts, to be consistent among all the individual framesin the UV space, as shown in Fig. 1(a). Using this constraint of temporal consistency, the recoveredshape and camera pose are stabilized considerably and are adapted to the current video.One bottleneck of existing image-based 3D mesh reconstruction methods [14, 24] is that the predictedshapes are assumed to be symmetric. This assumption does not hold for most non-rigid animals, e.g.,birds tilting their heads, or walking horses, etc. Our second innovation is to remove this assumptionand to allow the reconstructed meshes to ﬁt more complex, non-rigid poses via an as-rigid-as-possible(ARAP) constraint. As another constraint that does not require any labels, we enforce ARAP duringtest-time training as well, to substantially improve shape prediction. We use two image-based 3Dreconstruction models for training (i) a weakly supervised one (i.e., with object silhouettes and 2Dkeypoints provided), and (ii) a self-supervised one where only object silhouettes are available. Theimage-based models are then adapted to in-the-wild bird and zebra videos collected from the internet.We show that for both models, our innovations lead to an effective and robust approach to deformable,dynamic 3D object reconstruction of non-rigid objects captured in the wild.

3D object reconstruction from images.

A triangular mesh has long been used for object recon-struction [14, 18, 27, 16, 47, 31, 48]. It is a memory-efﬁcient representation with vertices and faces,and is amenable to differentiable rendering techniques [18, 27]. The task of 3D reconstructionentails the simultaneous recovery of the 3D shape, texture, and camera pose of objects from 2Dimages. It is highly ill-posed due to the inherent ambiguity of correctly estimating both the shapeand camera pose together. A major trend of recent works is to gradually reduce supervision from 3Dvertices [4, 48, 47], shading [10], or multi-view images [52, 18, 49, 36, 26] and move towards weaklysupervised methods that instead use 2D semantic keypoints [14], or a category-level 3D template [23].This progress makes the reconstruction of objects, e.g., birds, captured in the wild possible. Morerecently, self-supervised methods [24, 50, 17] have been developed to further remove the need forannotations. Our method exploits different levels of supervision: weak supervision (i.e., using 2Dsemantic keypoints) and self-supervision to learn an image-based 3D reconstruction network from acollection of images of a category (Sec. 3.1). 2 on-rigid structure from motion (NR-SFM).

NR-SFM aims to recover the pose and 3D structureof a non-rigid object, or object deforming non-rigidly over time, solely from 2D landmarks without3D supervision [2]. It is a highly ill-posed problem and needs to be regularized by additional shapepriors [2, 57]. Recently, deep networks [21, 30] have been developed that serve as more powerfulpriors than the traditional approaches. However, obtaining reliable landmarks or correspondencesfor videos is still a bottleneck. Our method bears resemblances to deep NR-SFM [30], which jointlypredicts camera pose and shape deformation. Differently from them, we reconstruct dense meshesinstead of sparse keypoints, without requiring labeled correspondences from videos.

3D object reconstruction from videos.

Existing video-based object reconstruction methods mostlyfocus on speciﬁc domains, e.g., videos of faces [6, 41] or human bodies [42, 1, 5, 15, 53], wheredense labelling is possible [43]. To augment video labels, [15] formulates dynamic human meshreconstruction as an omni-supervision task, where a combination of labeled images and videoswith pseudo-ground truth are used for training. For human video-based 3D pose estimation, [33]introduces semi-supervised learning to leverage unlabeled videos with a self-supervised component.Dealing with speciﬁc application domains, all the aforementioned works rely on predeﬁned shapepriors, such as a parametric body model (e.g., SMPL [28]) or a morphable face model. While ourwork also exploits unlabeled videos, we do not assume any predeﬁned shape prior, which, practically,is hard to obtain for the majority of objects captured in the wild.

Optimization-based methods.

Optimization-based methods have also been extensively exploredfor scene or object reconstruction from videos. Several works [55, 46, 37, 34] ﬁrst obtain a single-view 3D reconstruction and then optimize the mesh and skeletal parameters. Another line of methodsis developed to optimize the weights of deep models instead, to render more robust results for avideo of a particular instance [42, 26, 29, 58]. Our method falls into this category. While [42]enforces consistency between observed 2D and a re-projection from 3D, [26, 29] take a further stepand encourage consistency between frames via a network that inherently encodes an entire videointo an invariant representation. In this work, instead of limiting to rigid objects as [26], or depthestimation as [29], we recover dynamic meshes from videos captured in the wild – a much morechallenging problem that is rarely explored.

Our goal is to recover coherent sequences of mesh shapes, texture maps and camera poses fromunlabeled videos, with a two-stage learning approach: (i) ﬁrst, we learn a 3D mesh reconstructionmodel on a collection of single-view images of a category, described in Sec. 3.1; (ii) at inferencetime, we adapt the model to ﬁt the sequence via temporal consistency constraints, as described inSec. 3.2. We focus on the weakly-supervised setting in Sec. 3.1 and 3.2, where both silhouettes andkeypoints are annotated in the image dataset. We then describe how to generalize the approach to aself-supervised setting, where only silhouettes are available in the image dataset in Sec. 3.3.

Notations.

We represent a textured mesh with | V | vertices ( V ∈ R | V |× ), | F | faces ( F ∈ R | F |× )and a UV texture image ( I uv ∈ R H uv × W uv × ) of height H uv and width W uv . Similarly to [14], weuse a weak perspective transformation to represent the camera pose θ ∈ R of an input image. Wedenote R ( · ) as a general projection, which can represent (i) a differentiable renderer [27, 18] torender a mesh to a 2D silhouette as R ( V, θ ) , or a textured mesh to an RGB image as R ( V, θ, I uv ) (we omit mesh faces F for conciseness); (ii) or a projection of a 3D point v to the image space as R ( v, θ ) . The Soft Rasterizer [27] is used as the differentiable renderer in this work. In the ﬁrst stage, we train a network with a collection of category-speciﬁc images that jointly estimatesthe shape, texture, and camera pose of an input image. Similarly to [14], we predict a texture ﬂow I ﬂow ∈ R H uv × W uv × that maps pixels from the input image to the UV space. A predeﬁned UVmapping function Φ [11, 14] is then used to map these pixels from the UV space to the mesh surface.With a differentiable renderer [27], we train the network with supervision from object silhouettes,texture, and the Laplacian objectives as in [14, 24]. More details of learning texture and camera posecan be found in [14]. An overview of our reconstruction framework is shown in Fig. 2(a).3 c) Random pair Shape base UV texture …… +∆ + =+∆ , = (d) Shape invariance (e) Texture invariance (f) Parts invarianceshared(b) Unlabeled video texture flow ! !" camera " %&'( + ∆ (a) Image reconstruction network Single-view reconstruction Online adaptation

Random parts

Figure 2:

Overview. We show the single-view image reconstruction network on the left and the test-timetraining procedure to adapt it to a video on the right. Bold red arrows indicate invariance constraints in Sec. 3.2.

Recovering asymmetric shapes.

We propose a novel shape reconstruction module as shown inFig. 2(a). The key idea is to remove the symmetry requirement of object shapes, which is employed bymany prior works [14, 24]. This is particularly important for recovering dynamic meshes in sequences,e.g., when a bird rotates its head as shown in Fig. 4, its mesh is no longer mirror-symmetric. Priorworks [14, 24] model object shape by predicting vertex offsets from a jointly learned 3D template.Simply removing the symmetry assumption for the predicted vertex offsets leads to excessive freedomin shape deformation, e.g., see Fig. 4(f). To resolve this, we learn a group of N b shape bases { V i } N b i =1 ,and replace the template by a weighted combination of them, denoted as the base shape V base .Compared to a single mesh template, the base shape V base is more powerful in capturing the object’sidentity and saves the model from predicting large motion deformation, e.g., of deforming a standingbird template to a ﬂying bird. The full shape reconstruction can be obtained by: V = V base + ∆ V, V base = N b (cid:88) i =1 β i V i , (1)where the ∆ V encodes the object’s asymmetric non-rigid motion and { β i } N b i =1 are learned coefﬁcients.The computation of our shape bases is inspired by parametric models [28, 59, 60], where the basisshapes are extracted from an existing mesh dataset [38] or toy scans [60]. However, we make ourmodel completely free of 3D supervision and obtain the bases by applying K-Means clustering to allmeshes reconstructed by CMR [14]. We use each cluster center as a basis shape in our model. + … &(() (a) 2D keypoints (b) Keypoint UV maps (c) Canonical keypoint UV map … &(()&(() (d) 3D keypointsProjectionKeypoint re-projection loss + + + … Figure 3:

3D canonical keypoints computation: (a) annotated2D keypoints and their location heatmaps; (b) keypoint heatmapsmapped to the UV space using learned texture ﬂows; (c) aggregatedcanonical keypoint heatmaps in the UV space; (d) canonical key-points on different instance mesh surface. Φ( · ) is the UV mappingfunction discussed in Sec. 3.1. Keypoint re-projection.

In theweakly-supervised setting, the 2Dkeypoints are provided that seman-tically associate different instances.When projected onto the meshsurface, the same semantic keypoint(e.g., the tail keypoint in the orangecircle in Fig. 3(a)) for differentobject instances should be matchedto the same face on the mesh (thetail keypoint in the orange circle inFig. 3(d)). To model the mappingbetween the 3D mesh surface andthe 2D keypoints, prior work [14]learns an afﬁnity matrix that describesthe probability of each 2D keypointmapping to each vertex on the mesh.The afﬁnity matrix is shared amongall instances and is independent ofindividual shape variations. However,this approach is sub-optimal because: (i) Mesh vertices are a subset of discrete points on a continuousmesh surface and so their weighted combination deﬁned by the afﬁnity matrix may not lie on it,leading to inaccurate mappings of 2D keypoints. (ii) The mapping from the image space to the meshsurface described by the afﬁnity matrix, in our case, however, is already modeled by the texture ﬂow.Hence, it is potentially redundant to learn both of them independently.4n this work, we re-utilize texture ﬂow to map 2D keypoints from each image to the mesh surface.We ﬁrst map each 2D keypoint to the UV space that is independent of shape deformation (Fig. 3(b)).Ideally, each semantic keypoint from different instances should map to the same point in the UVspace as discussed above. In practice, this does not hold due to inaccurate texture ﬂow prediction. Toaccurately map each keypoint to the UV space, we compute a canonical keypoint UV map as shownin Fig. 3(c) by: (i) mapping the keypoint heat map in Fig. 3(a) for each instance to the UV spacevia its predicted texture ﬂow, and (ii) aggregating these keypoint UV maps in Fig. 3(b) across allinstances to eliminate outliers caused by incorrect texture ﬂow prediction.We further utilize the pre-deﬁned UV mapping function Φ discussed above to map each semantickeypoint from the UV space to the mesh surface. Given the 3D correspondence (denoted as K i D )of each 2D semantic keypoint K i D , the keypoint re-projection loss enforces the projection of theformer to be consistent with the latter by: L kp = 1 N k N k (cid:88) i =1 (cid:13)(cid:13) R ( K i D , θ ) − K i D (cid:13)(cid:13) , (2)where N k is the number of keypoints. As-rigid-as-possible (ARAP) constraint.

Without any pose-related regularization, the predictedmotion deformation ∆ V often leads to erroneous random deformations and spikes as shown inFig. 4(f), which do not faithfully describe the motion of a non-rigid object. Therefore, we introducean as-rigid-as-possible (ARAP) constraint [39, 8] to encourage rigidity of local transformationsand the preservation of the local mesh structure. Instead of solving the optimization in [39, 8], wereformulate it as an objective that ensures that the predicted shape V is a locally rigid transformationfrom the predicted base shape V base by: L arap ( V base , V ) = | V | (cid:88) i =1 (cid:88) j ∈N ( i ) w ij (cid:13)(cid:13)(cid:13) ( V i − V j ) − R i ( V i base − V j base ) (cid:13)(cid:13)(cid:13) , (3)where N ( i ) represents the neighboring vertices of a vertex i , w ij and R i are the cotangent weightand the best approximating rotation matrix, respectively, as described in [39]. Applying the image-based model developed in Sec. 3.1 independently to each frame of an unseenvideo usually results in inconsistent mesh reconstruction (Fig. 5(a)), mainly due to the domaindifferences in video quality, lighting conditions, etc. In this section, we propose to perform onlineadaptation to ﬁt the model to individual test video, which contains a single object instance that movesover time. Inspired by the keypoint re-projection constraint described in Sec. 3.1, we resort to theUV space, where the (i) RGB texture, and (ii) object parts of an instance should be constant whenmapped from 2D via the predicted texture ﬂow, and invariant to shape deformation. By enforcing thepredicted values for (i) and (ii) to be consistent in the UV space across different frames, the adaptednetwork is regularized to generate coherent reconstructions over time. In the following, we describehow to exploit the aforementioned temporal invariances as self-supervisory signals to tune the model.

Part correspondence constraint.

We propose a part correspondence constraint that utilizes corre-sponding parts of each video frame to facilitate camera pose learning. The idea bears resemblance toNR-SFM methods [30, 21], but in contrast, we do not know the ground truth correspondence betweenframes. Instead, we resort to an unsupervised video correspondence (UVC) method [25]. The UVCmodel learns an afﬁnity matrix that captures pixel-level correspondences among video frames. It canbe used to propagate any annotation (e.g, segmentation labels, keypoints, part labels, etc.), from anannotated keyframe to the other unannotated frames. In this work, we generate part correspondencewithin a clip: we "paint" a group of random parts on the object, e.g., the vertical stripes in Fig. 2(f),on the ﬁrst frame and propagate them to the rest of the video using the UVC model.Given the propagated part correspondences in all the frames, we map them to the UV space via thetexture ﬂow, similar to our approach for the canonical keypoint map ( Sec. 3.1). We then average allpart UV maps to obtain a video-level part UV map (“UV parts” in Fig. 1(a)) for the object depictedin the video. We map the part UV map to each individually reconstructed mesh, and render it via thepredicted camera pose of each frame (see Fig. 1(a), bottom). Finally, we penalize the discrepancy5etween the parts being rendered back to the 2D space, and the propagated part maps, for each frame.As the propagated part maps are usually temporally smooth and continuous, this loss implicitlyregularizes the network to predict coherent camera pose and shape over time. In practice, instead ofminimizing the discrepancy between the rendered part map and the propagation part map of a frame,we found that it is more robust to penalize the geometric distance between the projections of verticesassigned to each part with 2D points sampled from the corresponding part as: L c = N f (cid:88) j =1 N p (cid:88) i =1 | V ji | Chamfer( R ( V ji , θ j ) , Y ji ) , (4)where N f is the number of frames in the video, N p = 6 is the number of parts and V ji are verticesassigned to part i . Here we utilize the Chamfer distance because the vertex projections R ( V ji , θ j ) do not strictly correspond one-to-one to the sampled 2D points Y ji . Texture invariance constraint.

Based on the observation that object texture mapped to the UVspace should be invariant to shape deformation and stay constant over time, we propose a textureinvariance constraint to encourage consistent texture reconstruction from all frames. However, naivelyaggregating the UV texture maps from all the frames via a scheme similar to the one described forkeypoints and parts, leads to a blurry video-level texture map. We instead enforce texture consistencybetween random pairs of frames, via a swap loss. Given two randomly sampled frames I i and I j , weswap their texture maps I i uv and I j uv , and combine them with the original mesh reconstructions V i and V j as: L t = dist( R ( V i , θ i , I j uv ) (cid:12) S i , I i (cid:12) S i ) + dist( R ( V j , θ j , I i uv ) (cid:12) S j , I j (cid:12) S j ) , (5)where S i and S j are the silhouettes of frame i and j , respectively and dist( · , · ) is the perceptualmetric used in [54, 14, 24]. Base shape invariance constraint.

As discussed in Sec. 3.1, our shape model is represented by abase shape V base and a deformation term ∆ V , in which the base shape V base intuitively correspondsto the “identity” of the instance, e.g., a duck, or a ﬂying bird, etc. During online adaptation, weenforce the network to predict consistent V base to preserve the identity, via a swapping loss function: L s = niou( R ( V j base + ∆ V i , θ i ) , S i ) + niou( R ( V i base + ∆ V j , θ j ) , S j ) , (6)where V i base and V j base are the base shapes for frame i and j ; ∆ V i and ∆ V j are the motion de-formations for frame i and j ; and niou( · , · ) denotes the negative intersection over union (IoU)objective [16, 24]. All other notations are deﬁned in Eq. 5. As-rigid-as-possible (ARAP) constraint.

Besides the consistency constraints, we keep the ARAPobjective, as discussed in Sec. 3.1, during online adaptation since it also does not require any form ofsupervision. We found that the ARAP constraint can obviously improve the qualitative results, asvisualized for the online adaptation procedure in Fig. 6.

Online adaption.

During inference, we ﬁne-tune the model on a particular video with the invarianceconstraints discussed above, along with a silhouette and a texture objective, a Laplacian term asin [14, 24], and the ARAP constraint discussed in Sec. 3.1. The foreground masks used for thesilhouette and texture objective are obtained by a segmentation model [3] trained with the groundtruth foreground masks available for the image collection. More details of the objectives used foronline adaptation can be found in the supplementary material.To obtain accurate part propagation of object parts by the UVC [25] model, we employ two strategies.Firstly, we ﬁne-tune all parameters in the reconstruction model on sliding windows instead of allvideo frames. Each sliding window includes N w = 50 consecutive frames and the sliding stride isset to N s = 10 . We tune the reconstruction model for N t = 40 iterations with frames in each slidingwindow. Secondly, instead of "painting" random parts onto the ﬁrst frame and propagating them tothe rest of the frames sequentially in a window, we "paint" random parts onto the middle frame (i.e.the N w th frame) in the window and propagate the parts backward to the ﬁrst frame as well as forwardto the last frame in the window. This strategy improves the propagation quality by decreasing thepropagation range to half of the window size. 6 a) Input (b) Base shape (c) Ours (d) View1 (e) View2 (g) CMR [13](f) No ARAP Figure 4:

Mesh reconstructions from single-view images. All meshes are visualized from the predicted camerapose except for (d) and (e), where the reconstructions in (c) are visualized from two extra views. Meshes in (f)are reconstructed by a model trained without the ARAP constraint.

Our model can also be easily generalized to a self-supervised setting in which keypoints are notprovided for the image datasets. In this setting, the template prior as well as camera poses in theCMR method [14] computed from the keypoints are no longer available. This self-supervised settingis trained differently from the weakly-supervised one in the following: (i) The ﬁrst stage still assumesshape symmetry to ensure stability when training without keypoints. (ii) It learns a single templatefrom scratch via the progressive training in [24]. (iii) We train this model without the keypointsre-projection and the ARAP constraints in Sec. 3.1. (iv) Without the shape bases, the base shapeinvariance constraint is thus removed in the online adaptation procedure. Other structure and trainingsettings in the self-supervised model are the same as in the weakly-supervised model discussed inSec. 3.1 and 3.2.

We conduct experiments on animals, i.e., birds and zebras. We evaluate our contributions in twoaspects: (i) the improvement of single-view mesh reconstruction, and (ii) the reconstruction of asequence of frames via online adaptation. Due to the lack of ground truth meshes for images andvideos captured in the wild, we evaluate the reconstruction results via mask and keypoint re-projectionaccuracy, e.g., we follow, and compare against [14] to evaluate the model trained on the image dataset.We also describe a new bird video dataset that we curate and evaluate the test-time tuned model on itin the following. We focus on evaluations on the bird category in the paper and leave evaluations onthe zebra category to the supplementary.

We ﬁrst train image reconstruction models, discussed in Sec. 3.1, for the CUB bird [45]and the synthetic zebra [59] datasets. For test-time adaptation on videos, we collect a new bird videodataset for quantitative evaluation. Speciﬁcally, we collect 19 slow-motion, high-resolution birdvideos from YoutTube, and 3 bird videos from the DAVIS dataset [19]. For each slow-motion videocollected from the Internet, we apply a segmentation model [3] trained on the CUB bird dataset [45]to obtain its foreground segmentation for online adaptation.

Evaluation metrics.

We evaluate the image-based model on the testing split of the CUB dataset.Note that for keypoint re-projection, instead of using the keypoint assignment matrix in [14], weapply the canonical keypoint UV map to obtain the 3D keypoints (Sec. 3.1). For the video dataset,we annotate frame-level object masks and keypoints via a semi-automatic procedure. We train asegmentation model and a keypoint detector [7] on the CUB dataset. Then, we manually adjust andﬁlter out inaccurate predictions to ensure the correctness of the ground-truth labels. To evaluate theaccuracy of mask re-projection, we compute the Jacaard index J (IoU) and contour-based accuracy F proposed in [35], between the rendered masks and the ground truth silhouettes of all annotatedframes. Evaluations on keypoint re-projection can be found in the supplementary documents.In addition, to further quantitatively evaluate shape reconstruction quality, we animate a synthetic 3Dbird model and create a video with 520 frames in various poses such as ﬂying, landing, walking etc.,as shown in Fig. 7. We then compare the predicted mesh with the ground truth mesh using Chamferdistance every 10 frames. 7 " Figure 5:

Mesh reconstruction from video frames. (a) Input video frames. (b) Reconstruction from the modeltrained only on single-view images. (c) Reconstruction from the model test-time trained on the video withoutthe invariance constraints in Sec. 3.2. (d) Reconstruction from the proposed video reconstruction model.

Network architecture.

For fair comparisons to the baseline method [14], we train our model usingthe same network as [14], i.e., ResNet18 [9] with batch normalization layers [13] as the encoder,we call this model “ACMR” in the following, which is short for “asymmetric CMR”. However,ACMR cannot be well adapted to test videos due to the batch normalization layers and the domaingap between images and videos (see Table 2(d)). Thus, we train a variant model where we useResNet50 [9] as our encoder and replace all the batch normalization layers in the network with groupnormalization layers [51]. We call this variant model “ACMR-vid”. All test-time training is carriedout on the ACMR-vid model unless otherwise speciﬁed.

Network training.

Taking the weakly-supervised setting as an example, to train the image recon-struction model, we ﬁrst warm up the model without the motion deformation branch, the keypointre-projection objective, or the ARAP constraint for epochs. This warm-up process effectivelyavoids the trivial solution where the model solely depends on the motion deformation branch for shapedeformation while ignoring the base shape branch. We then train the full image reconstruction net-work with all objectives for another epochs. Other training details, including the self-supervisedsetting, and the illustration of our network architecture can be found in the supplementary material.

In Fig. 4, we show visualizations of reconstructed bird meshesfrom single-view images. Thanks to the “motion deformation branch” discussed in Sec. 3.1, theproposed ACMR model is able to capture asymmetric motion of the bird such as head rotation(Fig. 4(c)), which cannot be modeled by the baseline method [14] (Fig. 4(g)).

WithoutARAPFull model Input frame Observed view Unobserved view Input frame Observed view Unobserved view

Figure 6:

Qualitative comparison of online adaptation with/without the ARAP constraint.

Online adaptation on a video.

We visualize the reconstructed meshes by our ACMR-vid modelfor video frames in Fig. 5. Without online adaptation, the ACMR-vid model independently appliedto each frame suffers from a domain gap and shows instability over time (Fig. 5(b)). With onlineadaption as discussed in Sec. 3.2, the ACMR-vid model reconstructs plausible meshes for each videoframe as shown in Fig. 5(c) and (d). Speciﬁcally, to demonstrate the effectiveness of the proposedinvariance constraints, we also show reconstructions of an ACMR-vid model trained without all theinvariance constraints in Fig. 5(c), which predicts less reliable shape, camera pose as well as texturecompared to our full ACMR-vid model. Finally, we visualize the effectiveness of ARAP for onlineadaptation in Fig. 6. Without this constraint, the reconstructed meshes are less plausible, especiallyfrom unobserved views. More video examples can be found in the supplementary.8 .3 Quantitative Results

Table 1:

Quantitative evaluation of mask IoU and keypoint re-projection ([email protected]) on the CUB dataset [45].(a) Metric (b) CMR [14] (c) ACMR (d) ACMR,no ∆ V (e) ACMR,no ARAP (f) ACMR-vidMask IoU ↑ [email protected] ↑ Evaluations on the image dataset.

As shown in Table 1(b) vs. (c), our ACMR model achieves com-parable mask IoU and higher keypoints re-projection accuracy compared to the baseline model [14]with the same network architecture. This conﬁrms the correctness of both the reconstructed meshesas well as that of the predicted camera poses. In addition, our ACMR-vid model achieves even betterperformance as shown in Table 1(f). We note that our full ACMR model does not quantitativelyoutperform the model trained without the ARAP constraint, because the motion deformation ∆ V freely over-ﬁts to the mask and keypoint supervision without any regularization. However, the modelwithout the ARAP constraint visually shows spikes and unnatural deformations as shown in Fig. 4(f)and in the supplementary.Table 2: Quantitative evaluation of mask re-projection accuracy on the bird video dataset. “(T)” indicates themodel is test-time trained on the given video., L c , L t , L s are deﬁned in Eq. 4, 5, 6 respectively.(a) Metric (b) CMR [14] (c) ACMR (d) ACMR (T) (e) ACMR-vid (T),no L c , L t , L s (f) ACMR-vid (T) J ( Mean ) ↑ F ( Mean ) ↑ Ref. parametric model any video label? controlled consistency[3] DAM GT cam Yes texture[4] kinematics GT cam Yes keypoints[5] SMPL large-scale videos (unlabeled) No shape[6] SMPL 3D GT Yes supervisedours none one video (unlabeled) No texture, shape, partsTable 2: Comparisons of settings with [3-4] in R4’s review. “cam” for camera, “GT” for ground truth. InputBaseFullTimeFigure 1: Reconstruction of animated synthetic 3D bird video.

Figure 7:

Reconstructions of an animated video clip.

Evaluations on the video dataset.

As shownin Table 2(b) and (c) vs. (e), by using the pro-posed online adaptation method discussed inSec. 3.2, the model tuned on videos achieveshigher J and F scores compared to the modeltrained only on images. This indicates that thetest-time trained model successfully adapts tounlabeled videos and can reconstruct meshesthat conform well to the frames. The perfor-mance of the model is further improved byadding the correspondence, texture, and shapeinvariance constraints discussed in Sec. 3.2 during online adaptation, as shown in Table 2(f). Evaluation on animated sequences.

We apply the proposed model to an animated video clip andcompare the predicted mesh with the ground truth mesh using Chamfer distance every 10 frames.We show the qualitative reconstructions in Fig. 7 and quantitative evaluation results in Table 3. Theproposed ACMR method outperforms the baseline CMR [14] model and is further improved via theproposed online adaptation strategy discussed in Sec. 3.2.

Metric CMR ACMR ACMR (T)Chamfer ↓ Table 3:

Evaluation on synthetic data.

Evaluations for self-supervised setting(Sec. 3.3).

After online adaptation, thismodel too, achieves both a higher J score (0.843 vs. F score (0.678 vs. We propose a method to reconstruct temporally consistent 3D meshes of deformable objects fromvideos captured in the wild. We learn a category-speciﬁc 3D mesh reconstruction model that jointlypredicts the shape, texture, and camera pose from single-view images, which is capable of capturingasymmetric non-rigid motion deformation of objects. We then adapt this model to any unlabeledvideo by exploiting self-supervised signals in videos, including those of shape, texture, and partconsistency. Experimental results demonstrate the superiority of the proposed method compared tostate-of-the-art works, both qualitatively and quantitatively.9 roader Impact

The developed method will make signiﬁcant contributions to both the 3D vision and endangeredspecies research. The method provides a way to study animals that can only be captured in the wildas 2D videos, e.g., endangered animal species of birds and zebras. The broader impact includesenhancing our understanding of such endangered animals simply from videos, as they can bereconstructed and viewed in 3D. The method can also be applied to tasks such as bird watching,motion analysis, shape analysis, to name a few. Furthermore, another important application is tosimplify an artists workﬂow, as an initial animated and textured 3D shape can be directly derivedfrom a video. 10 ppendix

In the Appendix, we provide additional details, discussions, and experiments to support the originalsubmission. In the following, we ﬁrst discuss our self-supervised setting in Sec. 6. We then describeevaluation of keypoint re-projection accuracy on videos in Sec. 7. Next, we show more ablationstudies in Sec. 8. More qualitative results on both bird and zebra image reconstructions are presentin Sec. 9. Details of the network design and implementation are discussed in Sec. 10 and Sec. 11,respectively. Finally, we describe failure cases and limitations in Sec. 12.

We train the self-supervised image reconstruction model with only silhouettes and single-view imagesin a category. To this end, we also learn a template from scratch as in the self-supervised 3D meshreconstruction approach [24]. Essentially, we do not apply the semantic parts from the SCOPSmethod [12], which means that no additional modules are required for training. After trainingthe image model, we adapt it to each unlabeled video using the method discussed in Sec.3.2 inthe submission. Since neither keypoints annotations, nor additional self-supervised blocks such asSCOPS are adopted, the results of this self-supervised single-view image reconstruction model donot outperform those of existing methods, i.e., [14] and [24], and of the proposed ACMR model (seeSec. 4.3 for the quantitative results). However, the test-time training improves the ﬁdelity and therobustness of the reconstruction results as shown in Fig. 8. The reconstructions are more plausibleafter online adaptation, especially from unobserved views.

Input frame Observed view Other viewsImage modelTest-time tuned modelImage modelTest-time tuned model

Figure 8:

Comparison of the self-supervised image model and the test-time tuned self-supervised model.

Figure 9:

Keypoints annotation using the Labelme [44]toolkit.

Video Keypoints Annotation.

We eval-uate the keypoint re-projection accuracy(see Sec 4.1 in the paper) on the 22 videoswe collected. To create the ground truthkeypoints, we follow the protocol of theCUB dataset [45] to annotate 15 semantickeypoints every ﬁve frames in each video,via the Labelme [44] toolkit (see Fig. 9 forthe annotation interface.) Visualizationsof the re-projected keypoints by differentmethods are visualized and compared inFig. 10. 11 a) CMR [13] (b) ACMR (c) ACMR-vid, no ! ! , ! " , ! (d) ACMR-vid (e) Ground Truth Figure 10:

Visualization of re-projected keypoints on videos. We use white circles to highlight the keypointsfor better visualization.

Table 4:

Keypoint re-projection evaluation on videos. “(T)” indicates the model is test-time trained on the givenvideo., L c , L t , L s are deﬁned in Eq.4, 5 and 6 of the main paper, respectively.(a) Metric (b) CMR [14] (c) ACMR (d) ACMR (T) (e) ACMR-vid (T),no L c , L t , L s (f) ACMR-vid (T) P CK @0 . ↑ Details of Keypoint Re-projection.

Since we do not have the keypoint assignment matrix proposedin [14], we employ the canonical keypoint UV map to obtain the 3D keypoints (Sec. 3.1 in the paper).The keypoint re-projection is done by (i) warping the canonical keypoint UV map to each individualpredicted mesh surface; (ii) projecting the canonical keypoint back to the 2D space via the predictedcamera pose; (iii) comparing against the ground truth keypoints in 2D. This evaluation implicitlyreveals the correctness of both the predicted shape and camera pose for the mesh reconstructionalgorithm, especially for objects that do not have 3D ground truth annotations.Compared to frame-wisely applying CMR [14] (Table 4 (b)) or ACMR (Table 4 (c)) discussed in Sec.3.1 in the main paper, the test-time tuned model achieves higher PCK score, as shown in Table 4 (f).It veriﬁes the effectiveness of the proposed test-time training procedure and the invariance constraints.Essentially, as we noted in Sec 4.1, although the original ACMR, i.e., using the ResNet-18 [9] asthe image encoder with batch normalization layers [13] in Table 4 (c), achieves relatively promisingresults, it is hard to adapt this model to new domains like low-quality videos (e.g., when switchingfrom the .eval() mode to the .train() mode in PyTorch [32]). The performance drops signiﬁcantlyafter test-time tuning as shown in Table 4 (d).

Quantitative comparison of the single basemodel with the proposed ACMR model.(a) Metric (b) Single base (c) ACMRMask IoU ↑ ↑ To demonstrate the superiority of using a set ofshape bases versus using a single template, we traina baseline model where we replace the shape com-bination branch with the template obtained by theCMR approach [14]. This setting is equivalent to ausing a single shape base (denoted as single base ).We show quantitative and qualitative comparisonswith the proposed ACMR model in Table 5 and Fig. 11, respectively. As shown in Table 5(b) vs. (c),the model trained with a single base template struggles to ﬁt the ﬁnal shape when the instance islargely different from the given template (also shown in Fig. 11). In contrast, the proposed ACMRmodel with 8 shape bases performs favorably against the single base model.

Input Single base ACMR Input Single base ACMR

Figure 11:

Qualitative comparison of the single base model with the proposed ACMR model with multipleshape bases. The single base model suffers when the instance is largely different from the template, e.g. ﬂyingbird or a duck.

Table 6:

Ablation study on the ARAP constraint in onlineadaptation.(a) Metric (b) without ARAP (c) ACMR-vid (T) J ( Mean ) ↑ F ( Mean ) ↑ ↑ To verify the effectiveness of using theARAP constraint in the online adaptationprocess, we test-time tune on the videoswithout this constraint. Although perform-ing online adaptation without the ARAPconstraint yields better quantitative evalu-ations as shown in Table 6, the reconstructedmeshes are not plausible from unobservedviews, as shown in Fig.6 in the submission.

To visually demonstrate the effectiveness of the test-time training procedure that stabilizes camerapose prediction, we visualize the differences in camera pose predictions between adjacent frames inFig. 13. Compared to the model that is only trained on images, the proposed method predicts morestable camera poses that change smoothly over time.

We visualize re-projected keypoints on test images in Fig. 14, where the corresponding quantitativeresults are presented in Sec. 4.3, Table 1 of the main paper. The proposed ACMR model is ableto predict more accurate keypoints compared to the CMR [14] method, especially when the birdperforms an asymmetric pose, e.g. ﬁrst row in Fig. 14.13 .3 Single-view Image Reconstructions + … ……… (a)(b)(c) (d)(e)(f) c o rr e s p o nd e n c e i n v a r i a n c e renderer renderer renderer … Figure 16:

Part correspondence constraint. (a) Input frameand part propagations. (b) Predicted texture ﬂows. (c) Part UVmap. (d) Aggregated video-level part UV map. (e) Base shapeand differentiable renderer. (f) Part rendering.

In Fig. 12, we show more visualizationsof reconstructions from single-view im-ages of the test dataset of CUB birds [45]as well as comparisons with the baselinemethod [14]. By removing the symmetricassumption, our model is able to recon-struct objects in the input images morefaithfully versus the baseline method [14](Fig. 12(g)).We also demonstrate the effectivenessof the ARAP constraint, as discussed inSec.3.1 in the paper. Without this con-straint, the reconstructed meshes containunnatural spikes, as shown in Fig. 12(f).Finally, we show reconstruction results ofzebra images of the test dataset [58] inFig. 15. Our ACMR model successfullycaptures motions such as head bending or walking for zebras.

10 Network Architecture

We visualize the eight shape bases obtained by applying KMeans clustering on all reconstructedmeshes by the CMR [14] method for birds in Fig. 17(a). We also show the bases obtained by applyingPCA to the bottleneck features of the image encoder. Note that the latter fails to discover rare shapemodes (e.g., duck and ﬂying bird) in the dataset as shown in Fig. 17(b). Thus we choose to useKMeans to obtain shape bases.

We illustrate the part correspondence constraint in details in Fig. 16. Given the propagated parts ineach frame in Fig. 16(a), we map them to the UV space with the predicted texture ﬂow in Fig. 16(b)and obtain part UV maps in Fig. 16(c). By aggregating these part UV maps, i.e., averaging, weminimize noise in each individual part UV map and obtain a video-level part UV map in Fig. 16(d).This video-level part UV map is shared by all frames in the video. Thus, for each frame, we wrap thevideo-level part UV map onto the base shape prediction and render it under the predicted camerapose as shown in Fig. 16(f). Finally, we encourage consistency between part renderings and partpropagations, as shown by the red arrow in Fig. 16. Through the differentiable renderer, the lossimplicitly improves the predicted camera pose.

In Fig. 18(a), we show details of our single-view reconstruction network. Given an input image, thenetwork jointly predicts texture, shape and camera pose. By utilizing a differentiable renderer [27],we are able to utilize 2D supervision, i.e. silhouettes and input images.

We show the proposed test-time tuning process in Fig. 18(b). Within each sliding window, weencourage the consistency of UV texture, UV parts as well as base shape of all frames.14

We summarize the objectives used in the single-view reconstruction model (Fig. 18(a)) discussed inSec.3.1 in the submission as follows: (i) foreground mask loss: a negative intersection over unionobjective between rendered and ground truth silhouettes [14, 24, 16]; (ii) foreground RGB texture loss: a perceptual metric [14, 24, 54] between rendered and input RGB images; (iii) mesh smoothness: aLaplacian constraint [14, 24] to encourage smooth mesh reconstruction; (iv) keypoint re-projectionloss: as discussed in Sec.3.1 in the paper; and (v) the ARAP constraint: described in Sec.3.1 in thepaper. The weight for each objective is set to 3.0, 3.0, 0.0008, 5.0 and 10.0.

We summarize the objectives used in the online adaptation process (Fig. 18(b)) in the following. Sinceit is feasible to predict a segmentation mask via a pretrained segmentation model, we make use of thepredicted foreground mask and compute the (i), (ii), and (iii) losses (mentioned above) similarly tothe image-based training. We also adopt the the ARAP constraint described in Sec.3.1 in the paper,and the three invariance constraints as discussed in Sec.3.2 for online adaptation. The weight for eachobjective is set to 0.1, 0.5, 0.0006, 2.0 and 2.0 (texture invariance), 1.0 (part correspondence), 1.0(base shape invariance).

We implement the proposed method in PyTorch [32] and use the Adam optimizer [20] with a learningrate of 0.0001 for both the image reconstruction model training and online adaptation. The weightof each objective for the image reconstruction model as well as the online adaptation process isdiscussed in Sec. 11.1 and Sec. 11.2, respectively.

We adopt a different scheme to train a single-view reconstruction model on zebras: (i) since naturalzebra images labeled with keypoints are not publicly available, we adopt a synthetic dataset [58], (ii)zebras have more complex shapes with large concavities. Therefore, it is not suitable to learn theshape by deforming from a sphere primitive. Instead, we utilize a readily available zebra mesh as atemplate and learn motion deformation on top of it. We ﬁrst train an image reconstruction modelusing the synthetic dataset provided by [58]. Similarly as [58], we utilize the silhouettes, keypoints,texture maps as well as partially available UV texture ﬂow as supervision. For shape reconstruction,instead of the utilizing the SMAL parametric model [60], we use the proposed shape module, i.e.combination of base shapes and motion deformation. Due to the limited motion of zebras, we onlyuse one base shape, which is a readily available zebra mesh with 3889 vertices and 7774 faces. Forcamera pose prediction, we use the perspective camera pose discussed in Sec.3 in the submission aswell as in [14]. Due to the limited capacity of a single UV texture map, we also model the texturemap by cutting the UV texture map into four pieces and stitch them together similarly as in [58]. Wenote that this “cutting and stitching” operation does not inﬂuence the mapping and aggregation of thepart UV maps discussed in Sec.3.2 in the submission.

12 Failure Cases

Our work is the ﬁrst to explore the challenging task of reconstructing 3D meshes of deformableobject instances from videos in the wild. Impressive as the performance is, this challenging task isfar from being fully solved. We discuss failure cases and limitations of the proposed method in thefollowing. To begin with, we focus on genus-0 objects such as birds and zebras in this work. Thusour model suffers when it is generalized to objects with large concave holes such as chairs, humansetc. Second, our work struggles to reconstruct meshes from videos with large motion and lightingchanges as well as occlusion, (see Fig. 19). This is mainly due to the failure in correctly propagatingparts by the self-supervised UVC model [25], which is out of scope of this work. We leave all thesefailure cases and limitations to future works. 15 eferences [1] A. Arnab, C. Doersch, and A. Zisserman. Exploiting temporal context for 3d human pose estimation in thewild. In

CVPR , June 2019. 3[2] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3d shape from image streams. In

CVPR ,2000. 1, 3[3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic imagesegmentation with deep convolutional nets, atrous convolution, and fully connected crfs.

TPAMI , 2017. 6,7[4] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A uniﬁed approach for single andmulti-view 3d object reconstruction. In

ECCV , 2016. 1, 2[5] C. Doersch and A. Zisserman. Sim2real transfer learning for 3d human pose estimation: motion to therescue. In

NeurIPS , 2019. 3[6] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou. Joint 3d face reconstruction and dense alignment withposition map regression network. In

ECCV , 2018. 3[7] P. Guo and R. Farrell. Aligned to the object, not to the image: A uniﬁed pose-aligned representation forﬁne-grained recognition. In

WACV , 2019. 7[8] M. Habermann, W. Xu, M. Zollhoefer, G. Pons-Moll, and C. Theobalt. Deepcap: Monocular humanperformance capture using weak supervision. In

CVPR , 2020. 5[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

CVPR , 2016. 8, 12[10] P. Henderson and V. Ferrari. Learning to generate and reconstruct 3d meshes with only 2d supervision. In

BMVC , 2018. 2[11] J. F. Hughes, A. Van Dam, J. D. Foley, M. McGuire, S. K. Feiner, and D. F. Sklar.

Computer graphics:principles and practice . Pearson Education, 2014. 3[12] W.-C. Hung, V. Jampani, S. Liu, P. Molchanov, M.-H. Yang, and J. Kautz. Scops: Self-supervised co-partsegmentation. In

CVPR , 2019. 11[13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internalcovariate shift. arXiv preprint arXiv:1502.03167 , 2015. 8, 12[14] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning category-speciﬁc mesh reconstruction fromimage collections. In

ECCV , 2018. 1, 2, 3, 4, 6, 7, 8, 9, 11, 12, 13, 14, 15[15] A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik. Learning 3d human dynamics from video. In

CVPR ,2019. 1, 3[16] H. Kato and T. Harada. Learning view priors for single-view 3d reconstruction. In

CVPR , 2019. 2, 6, 15[17] H. Kato and T. Harada. Self-supervised learning of 3d objects from natural images. arXiv preprintarXiv:1911.08850 , 2019. 2[18] H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer. In

CVPR , 2018. 1, 2, 3[19] A. Khoreva, A. Rohrbach, and B. Schiele. Video object segmentation with language referring expressions.In

ACCV , 2018. 7[20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,2014. 15[21] C. Kong and S. Lucey. Deep non-rigid structure from motion. In

ICCV , 2019. 3, 5[22] N. Kulkarni, A. Gupta, D. F. Fouhey, and S. Tulsiani. Articulation-aware canonical surface mapping. In

CVPR , 2020. 1[23] N. Kulkarni, A. Gupta, and S. Tulsiani. Canonical surface mapping via geometric cycle consistency. In

ICCV , 2019. 1, 2[24] X. Li, S. Liu, K. Kim, S. De Mello, V. Jampani, M.-H. Yang, and J. Kautz. Self-supervised single-view 3dreconstruction via semantic consistency. arXiv preprint arXiv:2003.06473 , 2020. 1, 2, 3, 4, 6, 7, 11, 15[25] X. Li, S. Liu, S. D. Mello, X. Wang, J. Kautz, and M.-H. Yang. Joint-task self-supervised learning fortemporal correspondence. In

NeurIPS , 2019. 5, 6, 15[26] C.-H. Lin, O. Wang, B. C. Russell, E. Shechtman, V. G. Kim, M. Fisher, and S. Lucey. Photometric meshoptimization for video-aligned 3d object reconstruction. In

CVPR , 2019. 1, 2, 3[27] S. Liu, T. Li, W. Chen, and H. Li. Soft rasterizer: A differentiable renderer for image-based 3d reasoning.In

ICCV , 2019. 2, 3, 14[28] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. SMPL: A skinned multi-person linearmodel.

ACM Trans. Graphics (Proc. SIGGRAPH Asia) , 2015. 3, 4[29] X. Luo, J. Huang, R. Szeliski, K. Matzen, and J. Kopf. Consistent video depth estimation.

ACMTransactions on Graphics (Proceedings of ACM SIGGRAPH) , 39(4), 2020. 3[30] D. Novotny, N. Ravi, B. Graham, N. Neverova, and A. Vedaldi. C3dpo: Canonical 3d pose networks fornon-rigid structure from motion. In

ICCV , 2019. 1, 3, 5[31] J. Pan, X. Han, W. Chen, J. Tang, and K. Jia. Deep mesh reconstruction from single rgb images viatopology modiﬁcation networks. In

ICCV , 2019. 2

32] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library.In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,

NeurIPS .2019. 12, 15[33] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli. 3d human pose estimation in video with temporalconvolutions and semi-supervised training. In

CVPR , 2019. 3[34] X. B. Peng, A. Kanazawa, J. Malik, P. Abbeel, and S. Levine. Sfv: Reinforcement learning of physicalskills from videos.

ACM Trans. Graph. , 37(6), Nov. 2018. 3[35] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davischallenge on video object segmentation. arXiv preprint arXiv:1704.00675 , 2017. 7[36] D. J. Rezende, S. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised learningof 3d structure from images. In

NeurIPS , 2016. 2[37] H. Rhodin, N. Robertini, D. Casas, C. Richardt, H.-P. Seidel, and C. Theobalt. General automatic humanshape and motion capture using volumetric contour cues. In

ECCV , 2016. 3[38] K. Robinette, S. Blackwell, H. Daanen, M. Boehmer, S. Fleming, T. Brill, D. Hoeferlin, and D. Burnsides.Civilian american and european surface anthropom- etry resource (caesar) ﬁnal report.

Tech. Rep. AFRL-HE-WP-TR-2002-0169, US Air Force Research Laboratory , 2002. 4[39] O. Sorkine and M. Alexa. As-rigid-as-possible surface modeling. In

Symposium on Geometry processing ,2007. 5[40] Y. Sun, X. Wang, Z. Liu, J. Miller, A. A. Efros, and M. Hardt. Test-time training for out-of-distributiongeneralization. arXiv preprint arXiv:1909.13231 , 2019. 2[41] L. Tran and X. Liu. On learning 3d face morphable model from in-the-wild images.

TPAMI , 2019. 3[42] H.-Y. Tung, H.-W. Tung, E. Yumer, and K. Fragkiadaki. Self-supervised learning of motion capture. In

NeurIPS , 2017. 3[43] T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons-Moll. Recovering accurate 3dhuman pose in the wild using imus and a moving camera. In

ECCV , 2018. 3[44] K. Wada. labelme: Image Polygonal Annotation with Python. https://github.com/wkentaro/labelme , 2016. 11[45] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset,2011. 7, 9, 11, 14, 18[46] B. Wandt, H. Ackermann, and B. Rosenhahn. 3d reconstruction of human motion from monocular imagesequences.

TPAMI , 2016. 3[47] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang. Pixel2mesh: Generating 3d mesh models fromsingle rgb images. In

ECCV , 2018. 2[48] C. Wen, Y. Zhang, Z. Li, and Y. Fu. Pixel2mesh++: Multi-view 3d mesh generation via deformation. In

ICCV , 2019. 2[49] O. Wiles and A. Zisserman. Silnet: Single-and multi-view reconstruction by learning from silhouettes. arXiv preprint arXiv:1711.07888 , 2017. 2[50] S. Wu, C. Rupprecht, and A. Vedaldi. Unsupervised learning of probably symmetric deformable 3d objectsfrom images in the wild. In

CVPR , 2020. 1, 2[51] Y. Wu and K. He. Group normalization. In

ECCV , 2018. 8[52] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning single-view 3dobject reconstruction without 3d supervision. In

NeurIPS , 2016. 2[53] J. Y. Zhang, P. Felsen, A. Kanazawa, and J. Malik. Predicting 3d human dynamics from video. In

ICCV ,2019. 1, 3[54] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deepfeatures as a perceptual metric. In

CVPR , 2018. 6, 15[55] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis. Sparseness meets deepness: 3d humanpose estimation from monocular video. In

CVPR , 2016. 3[56] R. Zhu, C. Wang, C.-H. Lin, Z. Wang, and S. Lucey. Object-centric photometric bundle adjustment withdeep shape prior. In

WACV , 2018. 2[57] Y. Zhu, D. Huang, F. De La Torre, and S. Lucey. Complex non-rigid motion 3d reconstruction by union ofsubspaces. In

CVPR , 2014. 3[58] S. Zufﬁ, A. Kanazawa, T. Berger-Wolf, and M. J. Black. Three-d safari: Learning to estimate zebra pose,shape, and texture from images in the wild. In

ICCV , 2019. 3, 14, 15[59] S. Zufﬁ, A. Kanazawa, T. Berger-Wolf, and M. J. Black. Three-d safari: Learning to estimate zebra pose,shape, and texture from images "in the wild". In

ICCV , 2019. 4, 7[60] S. Zufﬁ, A. Kanazawa, D. Jacobs, and M. J. Black. 3D menagerie: Modeling the 3D shape and pose ofanimals. In

CVPR , 2017. 4, 15 " Figure 12:

More qualitative reconstruction results on CUB birds [45].

Camera stability visualization. We visualize differences between adjacent camera pose predictions.The blue and red lines represent the model trained only with images and the test-time tuned model, respectively.

CMR [13] Ours GT CMR [13] Ours GT

Figure 14:

Visualization of re-projected keypoints of the single-view image reconstruction model. " Figure 15:

Visualization of reconstructed zebras. a) Applying KMeans to reconstructed meshes by CMR [13](b) Applying PCA to feature space of CMR [13] Figure 17:

Bases visualization. !" ! ($)(*%$& " !" %& !" !" ! !" ! ) $% % & ' !" ! " ! % ! ($% !" !!!! ! ! ! " Figure 18:

Frameworks. For the purposes of illustration only, we show the test-time tuning procedure in (b)with a sliding widow size of 3 and sliding stride of 2. In all our experiments, we use a sliding window size of50 and stride of 10. The gray dashed box shows the previous sliding window while the orange box shows thecurrent sliding window. t=10 t=20t=30 t=40t=50 t=60

Figure 19:

Failure cases. Our model fails when there is occlusion (e.g., t = 50 , t stands for frame number) orlarge lighting changes (e.g., t = 60 ).).