[PDF] Implicit Mesh Reconstruction from Unannotated Image Collections

Abstract

We present an approach to infer the 3D shape, texture, and camera pose for an object from a single RGB image, using only category-level image collections with foreground masks as supervision. We represent the shape as an image-conditioned implicit function that transforms the surface of a sphere to that of the predicted mesh, while additionally predicting the corresponding texture. To derive supervisory signal for learning, we enforce that: a) our predictions when rendered should explain the available image evidence, and b) the inferred 3D structure should be geometrically consistent with learned pixel to surface mappings. We empirically show that our approach improves over prior work that leverages similar supervision, and in fact performs competitively to methods that use stronger supervision. Finally, as our method enables learning with limited supervision, we qualitatively demonstrate its applicability over a set of about 30 object categories.

Full PDF

IImplicit Mesh Reconstructionfrom Unannotated Image Collections

Shubham Tulsiani Nilesh Kulkarni Abhinav Gupta , Facebook AI Research University of Michigan Carnegie Mellon University https://shubhtuls.github.io/imr/

Figure 1:

Given a single input image, we can infer the shape, texture and camera viewpoint for the underlyingobject. In rows 1 and 2, we show the input image, inferred 3D shape and texture from the predicted viewpoint,and three novel viewpoints. We can learn 3D inference using only in-the-wild image collections with approximateinstance segmentations, our approach can be easily applied across a diverse set of categories. Rows 3 and 4 showsample predictions across a broad set of categories, with the predicted 3D shape overlaid on the input image.Please see the project page for 360 degree visualizations.

Abstract

We present an approach to infer the 3D shape, texture, and camera pose for anobject from a single RGB image, using only category-level image collections withforeground masks as supervision. We represent the shape as an image-conditionedimplicit function that transforms the surface of a sphere to that of the predictedmesh, while additionally predicting the corresponding texture. To derive supervi-sory signal for learning, we enforce that: a) our predictions when rendered shouldexplain the available image evidence, and b) the inferred 3D structure should begeometrically consistent with learned pixel to surface mappings. We empiricallyshow that our approach improves over prior work that leverages similar supervi-sion, and in fact performs competitively to methods that use stronger supervision.Finally, as our method enables learning with limited supervision, we qualitativelydemonstrate its applicability over a set of about 30 object categories.

Preprint. Under review. a r X i v : . [ c s . C V ] J u l Introduction

Inferring 3D structure of diverse categories of objects from images in the wild (see Figure 1) hasbeen one of the long term goals in computer vision. Despite the decades of progress in computing,graphics and machine learning, we do not yet have systems that can infer the underlying 3D forobjects in natural images. This is in stark contrast to the progress witnessed in related problems suchas object recognition and detection where we have developed scalable methods that make accuratepredictions for hundreds of categories for images in the wild. Why is there this disconnect betweenprogress in 3D perception and 2D understanding, and what can we do to overcome it? We argue thatthe central bottleneck for 3D perception in the wild has been the strong reliance on 3D supervision, orthe low expressivity of previous models (e.g., ﬁxed 3D template). In this work, we aim to bypass thesebottlenecks and present an approach that can learn inference of a deformable 3D shape using only theform of data that 2D recognition systems leverage – in the wild category-level image collections with(approximate) instance segmentations.Given a single input image, our goal is to be able to infer the 3D shape, texture and camera posefor the underlying object. A scalable solution needs to have two ingredients: (a) 3D modelingwhich can handle instance variations and deformations of objects; (b) require minimal supervisionto allow scalability across diverse categories. There have been recent attempts on both these axes.For example, a recent approach [13] learns explicit 3D representations from image collections, andcan handle instance variations and pose variations. But this approach crucially relies on 2D keypointannotations to guide learning, thereby making it difﬁcult to scale beyond a handful of categories.On the other hand, Kulkarni et al. [17, 18] learn pixel to surface mappings that are consistent witha global (articulated) 3D template, and show that this can help learn accurate prediction. Thisapproach bypasses keypoint supervision, but cannot model any shape variations (fat vs thin bird) andadditionally requires 3D part supervision for modeling articulations.Our approach handles both, shape and pose variation by inferring implicit category-speciﬁc shaperepresentations. To bypass the need for direct supervision, we also infer the corresponding texture andenforce reprojection consistency between our predictions and the available images and segmentations.Drawing inspiration from the work by Kulkarni et al. [18], we use a geometric cycle-consistencyloss between global 3D and local pixel to surface predictions to derive additional learning signal fromunannotated image collections. Leveraging a single 3D template shape per category as initialization,we learn the category-level implicit shape space from image collections. We show sample shapepredictions obtained using our approach in Figure 1 and also visualize the 3D shape with predictedtexture from the predicted and novel camera viewpoints. We observe that, despite the lack of directsupervision, our method effectively captures the shape variation across instances e.g. head bendingdown, length of tail. As illustrated in Figure 1, our approach is applicable across a diverse set ofobject categories, and the reliance on only image collections with approximate segmentation masksallows learning in settings where previous approaches could not.

Learning 3D Shape from Annotated Image Collections.

The recent success of deep learning hasresulted in a number of learning based approaches for the task of single-view 3D inference. The initialapproaches [4, 7] showed impressive volumetric inference results using synthetic data as supervision,and these were then generalized to other representations such as point clouds [6], octrees [9, 29],or meshes [33]. However, these approaches relied on ground-truth 3D as supervision, and this isdifﬁcult to obtain at scale or for images in the wild. There have therefore been attempts to relax thesupervision required e.g. by instead using multi-view image collections [25, 28, 31, 32, 37]. Closerto our setup, several approaches [2, 3, 13, 14] have also addressed the task of learning 3D inferenceusing only single-view image collections, although relying on additional pose or keypoint annotations.While these approaches yield encouraging results, their reliance on annotated 2D keypoints limitstheir applicability for generic categories. Our work also similarly learns from image collections, butdoes so without using any supervision in the form of 2D keypoints or pose; and this allows us to learnfrom image collections of objects in the wild.

Learning 3D from Unannotated Image Collections.

Our motivation of learning 3D from unan-notated image collections is also shared by some recent works [17, 18, 21]. Nguyen-Phuoc et al. [21]use geometry-driven generative modeling to learn 3D structure, but their approach only infers avolumetric feature capable of view synthesis. Unlike our work, this does not output a tangible 3D2hape or texture for the underlying object. Closer to our approach, Kulkarni et al. [17, 18] predictexplicit 3D representations by proposing a cycle-consistency loss between inferred 3D and learnedpixelwise 2D to 3D mappings. However, their method only produces a limited 3D representationin the form of a rigid or articulated template, and cannot handle intra-instance shape variations e.g. fat vs thin bird. Our implicit shape representation allows capturing such variation in addition to theinstance texture, and we generalize their cycle consistency loss for our representation.

Implicit 3D Representations.

Several recent works [20, 23, 27, 36] using implicit functions haveshown impressive results on the tasks of 3D reconstruction. Unlike explicit representations (e.g.meshes, voxels, point clouds) these methods learn functions to parameterize a 3D volume or surface.Our representation is inspired by work from Groueix et al. [8] which learns a mapping conditioned onthe latent code from points on a 2D manifold to a 3D surface, and we further equip this with an implicittexture as in Oechsle et al. [22]. However, all these prior approaches require settings with ground-truth 3D and texture supervision available. In contrast, our work leverages these representations inan unsupervised setup with corresponding technical insights to make learning feasible e.g. use of a‘texture ﬂow’ and incorporation of category-speciﬁc shape consistency.

Given an image of an object, we aim to infer its 3D shape, texture, and camera pose. Moreover, wewant to learn this inference using only image collections with instance segmentations as supervisorysignal. Towards this, our approach leverages implicit shape and texture representations, and enforcesgeometric consistency with available image to bypass the need of direct supervision.Speciﬁcally, we leverage geometry-driven objectives to learn an encoder which predicts the desiredproperties from the input image f θ (cid:48) ( I ) ≡ ( π, z s , z t ) – here π denotes a weak perspective camera,and z s , z t correspond to latent variables that instantiate the underlying shape and texture. We ﬁrstdescribe in Section 3.1 the category-speciﬁc implicit shape representation pursued and then describethe proposed learning framework in Section 3.2. While we initially consider only shape inference,we show in Section 3.3 how our approach can also incorporate texture prediction. Figure 2: Mesh Parametrization.

An explicitrepresentation (left) of deformation of a sphereto a mesh is parametrized via the location of aﬁxed set of 3D vertices. In contrast, an implicitrepresentation (right) is parametrized via a functionthat maps any point on a sphere to a 3D coordinate.

Figure 3: Category-speciﬁc Implicit Mesh Representa-tion.

We represent the shapes for different instances in acategory via a latent-variable conditioned implicit function.The mesh for a given instance is obtained by adding aninstance-speciﬁc deformation to a shared category-levelmean shape, where both, the shared template and the de-formations, are represented as neural networks.

We model a predicted shape as a deformation of a unit sphere. One possible representation for sucha deformation is an explicit one: we can represent the unit sphere as a mesh with V vertices, andparametrize its deformation as a per-vertex translation δ ∈ R | V |× . While several prior works [13, 15]have successfully leveraged this explicit representation, it is computationally challenging to scale to aﬁner mesh and also lacks certain desirable inductive biases e.g. correlation of vertex locations, as theseare parametrized independently. To overcome these challenges, we instead implicitly parametrize theshape (see Figure 2). Denoting by S the surface of a unit sphere, we parametrize its deformation viaa function φ : S → R , such that for all u ∈ S , φ ( u ) is a point on the implied surface.3owever, a given φ as deﬁned above only represents a speciﬁc shape, whereas we are interestedin modeling the different possible shapes across instances of a category. We therefore model theimplicit shape function as additionally being dependent on a latent variable z s ∈ R d , such that φ ( · , z s ) : S → R can describe different shapes according to the variable z s . While the instance-speciﬁc latent variable allows representing the variation within the category, we factorize φ to alsobeneﬁt from the commonalities within it – we model the instance level shape as a combination of ainstance-agnostic mean shape and an instance dependent deformation (see Figure 3). φ ( u , z s ) = ¯ φ ( u ) + δ ( u , z s ); u ∈ S ; z s ∈ R d (1)Here ¯ φ ( · ) and δ ( · , · ) are both modeled as neural networks, and represent the implicit functions for acategory-level mean shape and instance level deformation respectively. To overcome ambiguities andpossible degenerate solutions when learning without supervision, we initialize ¯ φ ( · ) for each categoryto match a manually chosen template 3D shape (see appendix for details and visualizations). Enforcing Symmetry.

Almost all naturally occurring objects, as well as several man-made ones,exhibit reﬂectional symmetry. We incorporate this by constraining the learnt mean shape ¯ φ anddeformations δ ( · , z s ) to be symmetric along the X = 0 plane. For example, the deformation isconstrained to be symmetric as follows (where R ( · ) denotes a reﬂection function): δ ( u , z s ) ≡ ( δ (cid:48) ( u , z s ) + R ( δ (cid:48) ( R ( u ) , z s )) ) / (2) Encouraging Locally Rigid Transforms.

In our formulation, the instance-speciﬁc deformation δ ( · , z s ) is required to capture any change from the category-level mean shape. This change in 3Dstructure can stem from intrinsic variation e.g. a bird can be thin or fat, or can be caused by articulation e.g. the same horse would induce different deformations if its head is bent down or held upright. Asarticulations can be viewed as local rigid transformations of the underlying shape, we encourage thelearnt deformations to explain the variation using locally rigid transformations if possible. We notethat under any rigid transform, the distance between two corresponding points remains unchanged,and incorporate a regularization that penalizes the mean of this variation across local neighborhoods.Denoting by N ( u ) a local neighborhood of u , this objective is: L rigid = E u ∈ S E u (cid:48) ∈N ( u ) | (cid:107) φ ( u , z s ) − φ ( u (cid:48) , z s ) (cid:107) − (cid:107) ¯ φ ( u ) − ¯ φ ( u (cid:48) ) (cid:107) | (3) While we do not have direct supervision available for learning the shape and pose inference, wecan nevertheless derive supervisory signal by encouraging our 3D predictions be geometricallyconsistent with the available image evidence. Concretely, we enforce that the predicted 3D shape,when rendered according to the predicted camera, matches the foreground mask, while speciallyemphasizing boundary alignment. Following a surface mapping consistency formulation by Kulkarni et al. [18], we also implicitly encourage semantically similar regions across instances to be ‘explained’by consistent regions of the deformable shape space.

Mask and Boundary Reprojection Consistency.

Given the inferred (implicit) 3D shape φ θ ( · , z s ) ,we recover an explicit mesh M by sampling the implicit function at a ﬁxed resolution. We then use adifferentiable renderer [15, 24] to obtain the foreground mask for this predicted mesh camera π , anddeﬁne a loss against the ground-truth mask I f : L mask = (cid:107) I f − f render ( M, π ) (cid:107) .While this loss coarsely aligns the predictions to the image, it does not emphasize details such as longtails of birds, or animal legs. We therefore incorporate an objective L boundary as proposed by Kar etal. [14] – points should project within the foreground, and contour points should have some projected3D points nearby (see appendix for formulation). Regularization via Pixel to Surface Mappings.

In our formulation, each point on the predicted3D shape corresponds to a unique point u on the surface of a unit sphere. To ensure that theinferred 3D is consistent across different instances in a category, we would ideally like to enforce thatsemantically similar regions across images are ‘explained’ by similar regions of the unit sphere e.g. the u projecting on the horse head is similar across instances. However, we do not have access to anysemantic supervision to directly operationalize this insight, but we can do so indirectly.Following the work of Kulkarni et al. [18], we train a convolutional predictor to infer pixel to surfacemappings C ≡ g Θ ( I ) given an input image I . The inferred mapping C [ p ] ∈ S predicts for each4 erceptual SimilarityMask LossGCC Loss Figure 4: Overview of Training Procedure.

We train a network to predict the shape, texture, and camera posefor an object in an image, where the shape and texture are parametrized via latent-variable conditioned implicitfunctions. We learn this network using a combination of reprojection consistency losses between the inferred 3Dand the input image and a cycle consistency objective with learned pixel to 3D mappings (see text for details). pixel p a corresponding coordinate on the unit sphere, and thereby the 3D shape, and we encouragethis prediction to be cycle-consistent with the inferred 3D. Intuitively, the convolutional predictor,using local image appearance, predicts what 3D point each pixel corresponds to. Therefore enforcingconsistency between the inferred 3D and the predicted mapping implicitly encourages our inferred 3Dto be semantically consistent. Concretely, via the predicted mapping, a pixel p is mapped C [ p ] ∈ S ,and thus to the 3D point φ ( C [ p ] , z s ) on the implicit mesh, and we encourage its reprojection undercamera π to map back to the pixel. While [18] learned these mappings in context of a rigid or riggedtemplate, we can extend their loss formulation for our implicit mesh representation. L gcc = (cid:88) p (cid:107) π ( φ ( C [ p ] , z s )) − p (cid:107) (4) Leveraging Optional Keypoint Supervision.

While our goal is to learn 3D reconstruction withminimal supervision, our approach easily allows using semantic keypoint labels e.g. nose, tail etc. ifavailable. For each semantic keypoint k annotated in the 2D images, we ﬁrst manually annotate acorresponding 3D keypoint on the category-level template mesh. This allows us to associate a uniquespherical coordinate u k with each keypoint k for a category. Given an input training image withannotated 2D keypoint locations { x k } , we can penalize the reprojection error. L kp = (cid:88) k (cid:107) π ( φ ( u k , z s )) − x k (cid:107) (5)Note that we only leverage this supervision in certain ablations, and that all other results (includingvisualized predictions) are obtained without using this additional signal i.e. only mask supervision. Training Details.

Our reprojection and cycle consistency losses depend on both, the predictedcamera π and the inferred shape φ ( · , z s ) . Unfortunately, learning these together is susceptible tolocal minima where only a narrow range of poses are predicted. We follow suggestions from priorwork [11, 31] to overcome this and predict multiple diverse pose hypotheses and their likelihoods,and minimize the expected loss. Additionally, as the inferred poses in initial training iterations areoften inaccurate, the learned deformations are not meaningful. We therefore only allow training δ after certain epochs. We use a pretrained ResNet-18 [10] based encoder as f θ (cid:48) , and a simple 4 layerfeedforward MLP to instantiate the shape and texture implicit functions. Towards capturing the appearance of the depicted object, we predict the texture that gets overlaid onthe predicted (implicit) 3D shape, and learn this inference by enforcing consistency with the observedforeground pixels. As our 3D shape is parametrized via a deformation of a sphere, we simply need toassociate each point on the sphere with a corresponding texture to induce a textured 3D shape.While related approaches [13] create a ﬁxed resolution texture map, we instead propose to inferan implicit texture function τ ( · , z t ) s.t. τ ( u , z t ) yields the texture corresponding to u given theinstance-speciﬁc texture encoding z t . Instead of directly regressing to the color value, we follow Zhou et al. ’s [39] insight of copying the pixel value, and predict an implicit texture ﬂow: i.e. τ ( u , z t ) ∈ R indicates the coordinate of the pixel whose color should be copied (via bilinear sampling) to gettexture at u . As with our implicit shape representation, we also symmetrize the predicted texture ﬂowfunction to propagate textures on unseen regions.5 upv Method Bird Horse Cow SheepKP +Mask CMR [12] 47.3 – – –CSM [18] 45.8 42.1 28.5 31.5A-CSM [17] 51.0 44.6 29.2 39.0IMR (ours) Mask Dense-Equi [30] 33.5 23.3 20.9 19.6CSM [18] 36.4 31.2 26.3 24.7A-CSM [17] 42.6 32.9 26.3 28.6IMR (ours)

Keypoint Transfer Accuracy. Supv Method Bird Horse Cow SheepKP +Mask CMR [12] 80.0 – – –CSM [18] 68.5 46.4 52.6 47.9A-CSM [17] 72.4 57.3 56.8

IMR (ours)

Keypoint Reprojection Accuracy.

Table 1: Evaluation via Semantic Keypoints . We report comparisons to previous approaches for: a) transfer-ring semantic keypoints between images using the induced dense correspondence via our 3D predictions, and b)Predicting 2D keypoints via reprojection of 3D keypoints. Higher is better.

To derive learning signal for this prediction, we follow Kanazawa et al. [13] and differentiablyrender [19] the predicted 3D shape with the implied texture and penalize a perceptual loss [38] againstthe foreground pixels of the image. We additionally also incorporate a term encouraging the textureﬂow to sample from foreground pixels instead of background ones.

We present empirical and qualitative results across a diverse set of categories, and leverage severalexisting datasets to obtain the required training data. Across all these datasets, we only use the imagesand annotated (or automatically obtained) segmentation masks for training, but use the additionalavailable annotations e.g. keypoints or approximate 3D for evaluation. We download a representativetemplate model (used to initialize ¯ φ ) for all the categories from [1]. Rigid Objects (Aeroplane, Car).

We use images from the PASCAL3D+ [35] dataset, which combinesimages from PASCAL VOC [5] and Imagenet [26]. For the former set, a manually annotatedforeground mask is available, whereas an automatically obtained one is used for the Imagenet subset.We follow the splits used in prior work [32], but unlike them, do not use the keypoint annotationsfor training. As annotations for approximate 3D models are available on this dataset, it allows us toempirically measure the reconstruction accuracy.

Curated Animate Categories (Bird, Cow, Horse, Sheep).

We use the CUB-200-2011 [34] datasetfor obtaining bird images with segmentation masks. For the other categories, we use the splits from[18] which, similar to PASCAL3D+, combine images from VOC and Imagenet. Across all thesecategories, we have keypoint annotations available for images in the test set, and we use these forindirect evaluation of the quality and consistency of the inferred 3D.

Quadrupeds from Imagenet (Lion, Bear, Elephant, and 20 others).

We also apply our method oncategories from Imagenet using automatically obtained segmentations [16]. We use the images fromKulkarni et al. [17] who (noisily) ﬁlter out instances with signiﬁcant truncation and occlusion.

Recall that each point on our predicted 3D mesh corresponds to some u ∈ S . A rendering of thispredicted 3D according to the predicted camera therefore induces a per-pixel mapping to S (seeappendix for visualization). To indirectly evaluate the accuracy and consistency of our inferred3D, we can use this mapping to infer and evaluate dense 2D to 2D correspondence across images.Following prior work, we report a keypoint-transfer accuracy ‘PCK-T’ (see [17] for metric details)in Table 1 a). We observe that our approach signiﬁcantly improves over prior work that in bothsupervision settings – with and without keypoint annotations. Using the annotations of 3D keypointson the category-level template (see Section 3.2), we can predict their 2D locations in an image as π ( φ ( u k , z s )) . This allows us to compute a reprojection accuracy ‘PCK-R’ (percentage of keypointsreprojected within a certain threshold of ground-truth) for annotated semantic keypoints. We report6 igure 5: Sample Results. We show (from left to right) the input image, the inferred 3D shape and texturefrom the predicted viewpoint, and the textured 3D shape from a novel viewpoint. Please see the project page andappendix for additional visualizations of results across all classes.

Supv Method Aero CarKP +Mask CSDM [14] 0.40 0.60DRC [32] 0.42 0.67CMR [13] 0.46 0.64Mask IMR (ours) 0.44 0.66

Table 2: 3D Reconstruction Accu-racy . We report the mean intersectionover union (IoU) of our 3D predictionswith the available 3D annotation on PAS-CAL3D+ dataset. We observe competi-tive performance with prior approachesthat require stronger supervision.

In Section 4.2, we compared our approach to previous worksthat learn using similar supervision, but only infer a con-strained 3D representation. However, there have also beenprior approaches which, similar to ours, allow more expressive3D inference, but require additional keypoint or pose supervi-sion. We compare our learned 3D inference to these (relatively)strongly supervised methods using the PASCAL3D+ datasetwhich has approximate 3D ground-truth available in the formmanually selected templates.We report in Table 2 the mean intersection over union (IoU) ofthe predicted 3D shape with the available ground-truth on heldout test images. We compare our approach to a deformablemodel ﬁtting [14], a volumetric prediction [32], and an explicitmesh inference [13] approach – all of which rely on keypointor pose labels for learning. We observe that across both the examined categories, our approach, usingonly foreground mask as supervision, yields competitive (and sometimes better) performance.

To allow direct or indirect empirical evaluation, we so far only examined categories where annotatedimage collections are available. However, as we do not leverage these annotations for learning, we cango beyond this limited set of classes and apply our approach to generic categories from Imagenet. Inparticular, we consider several quadruped categories e.g. lion, bear, zebra etc. , and train our approachwith automatically obtained approximate instance segmentations using PointRend [16]. We visualizeour predictions in Figure 5 and also show additional random results in the supplementary. We seethat our predictions can capture variation e.g. thin or fat body, head bending down, sitting vs standing etc. , and that the inferred texture for visible and invisible regions is also meaningful.

Method PCK-P PCK-TOurs 54.1 41.3Ours - L gcc L boundary L rigid Table 3: Ablation of loss components.

We report mean keypoint reprojectionand transfer accuracy across categorieswhen removing certain terms from theobjectives. Higher is better.

We ablate various terms in our learning objective using thesemantic keypoint reprojection (PCK-P) and transfer (PCK-T)metrics. In particular, we examine whether incorporating con-sistency loss with pixelwise surface mappings is helpful, andif prioritizing locally rigid transforms and boundary alignmentis beneﬁcial. We report in Table 3 the mean accuracy for bothevaluations across the four animate categories evaluated (bird,horse, cow, and sheep). We observe that all the componentsimprove performance, though we note that some components e.g. encouraging rigidity or boundary alignment lead to moresigniﬁcant qualitative improvements.

We presented an approach for learning implicit shape and texture inference from unannotated imagecollections. Although this enabled 3D prediction for a diverse set of categories, several challenges stillremain towards being able to reconstruct thousands of object classes in generic images. As we modelshapes via deformation of a (learned) template, our model does not allow for large or topologicalshape changes that maybe common in artiﬁcial object categories ( e.g. chairs). Additionally, ourreprojection based objectives implicitly assume that the object is largely unoccluded, and it would beinteresting to generalize these objectives to allow partial visibility. Lastly, our results do not alwayscapture the ﬁne details or precise pose, and it may be desirable to leverage additional learning signalif available e.g. from videos. While there is clearly more progress needed to build systems that canreconstruct any object in any image, we believe our work represents an exciting step towards learningscalable and self-supervised 3D inference. 8 eferences [1] Free3d.com. . 6[2] Thomas J Cashman and Andrew W Fitzgibbon. What shape are dolphins? building 3d morphable modelsfrom 2d images.

TPAMI , 2013. 2[3] Wenzheng Chen, Huan Ling, Jun Gao, Edward Smith, Jaakko Lehtinen, Alec Jacobson, and Sanja Fidler.Learning to predict 3d objects with an interpolation-based differentiable renderer. In

NeurIPS , 2019. 2[4] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A uniﬁedapproach for single and multi-view 3d object reconstruction. In

European conference on computer vision ,pages 628–644. Springer, 2016. 2[5] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and AndrewZisserman. The pascal visual object classes challenge: A retrospective.

IJCV , 2015. 6[6] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstructionfrom a single image. In

CVPR , 2017. 2[7] Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Abhinav Gupta. Learning a predictable andgenerative vector representation for objects. In

ECCV , 2016. 2[8] Thibault Groueix, Matthew Fisher, Vladimir G Kim, Bryan C Russell, and Mathieu Aubry. A papier-mâchéapproach to learning 3d surface generation. In

CVPR , 2018. 3[9] Christian Häne, Shubham Tulsiani, and Jitendra Malik. Hierarchical surface prediction for 3d objectreconstruction. In . IEEE, 2017. 2[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In

CVPR , 2016. 5[11] Eldar Insafutdinov and Alexey Dosovitskiy. Unsupervised learning of shape and pose with differentiablepoint clouds. In

NeurIPS , 2018. 5[12] Angjoo Kanazawa, Shahar Kovalsky, Ronen Basri, and David Jacobs. Learning 3d deformation of animalsfrom 2d images. In

Eurographics , volume 35, pages 365–374. Wiley Online Library, 2016. 6[13] Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik. Learning category-speciﬁcmesh reconstruction from image collections. In

ECCV , 2018. 2, 3, 5, 6, 8[14] Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Category-speciﬁc object reconstruc-tion from a single image. In

CVPR , 2015. 2, 4, 8, 11[15] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. In

CVPR , 2018. 3, 4[16] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Girshick. Pointrend: Image segmentation asrendering. arXiv preprint arXiv:1912.08193 , 2019. 6, 8[17] Nilesh Kulkarni, Abhinav Gupta, David F Fouhey, and Shubham Tulsiani. Articulation-aware canonicalsurface mapping. In

CVPR , 2020. 2, 3, 6, 8[18] Nilesh Kulkarni, Abhinav Gupta, and Shubham Tulsiani. Canonical surface mapping via geometric cycleconsistency. In

ICCV , 2019. 2, 3, 4, 5, 6, 8[19] Shichen Liu, Tianye Li, Weikai Chen, and Hao Li. Soft rasterizer: A differentiable renderer for image-based3d reasoning. In

ICCV , 2019. 6[20] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancynetworks: Learning 3d reconstruction in function space. In

CVPR , 2019. 3[21] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, and Yong-Liang Yang. Hologan: Unsu-pervised learning of 3d representations from natural images. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 7588–7597, 2019. 2[22] Michael Oechsle, Lars Mescheder, Michael Niemeyer, Thilo Strauss, and Andreas Geiger. Texture ﬁelds:Learning texture representations in function space. In

ICCV , 2019. 3[23] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf:Learning continuous signed distance functions for shape representation. In

CVPR , 2019. 3[24] Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, andGeorgia Gkioxari. Pytorch3d. https://github.com/facebookresearch/pytorch3d , 2020. 4[25] Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and NicolasHeess. Unsupervised learning of 3d structure from images. In

NeurIPS , 2016. 2[26] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.

International journal of computer vision , 115(3):211–252, 2015. 6[27] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. Pifu:Pixel-aligned implicit function for high-resolution clothed human digitization. In

ICCV , 2019. 3[28] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Multi-view 3d models from single imageswith a convolutional network. In

ECCV , 2016. 2[29] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Efﬁcientconvolutional architectures for high-resolution 3d outputs. In

ICCV , 2017. 2[30] James Thewlis, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object frames by denseequivariant image labelling. In

NeurIPS , 2017. 6[31] Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik. Multi-view consistency as supervisory signal forlearning shape and pose prediction. In

CVPR , 2018. 2, 5[32] Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Jitendra Malik. Multi-view supervision forsingle-view reconstruction via differentiable ray consistency. In

CVPR , 2017. 2, 6, 8

33] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh:Generating 3d mesh models from single rgb images. In

ECCV , 2018. 2[34] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200.Technical Report CNS-TR-2010-001, California Institute of Technology, 2010. 6[35] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond pascal: A benchmark for 3d object detectionin the wild. In

WACV , 2014. 6[36] Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomir Mech, and Ulrich Neumann. Disn: Deep implicitsurface network for high-quality single-view 3d reconstruction. In

NeurIPS , 2019. 3[37] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets:Learning single-view 3d object reconstruction without 3d supervision. In

NeurIPS , 2016. 2[38] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonableeffectiveness of deep features as a perceptual metric. In

CVPR , 2018. 6[39] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. View synthesis byappearance ﬂow. In

ECCV , 2016. 5 ppendix Initializing Implicit Shape Space using a Template Mesh.

To initialize the ¯ φ ( · ) in our category-level implicit shape representation, we use a single template shape per category. We do so bytraining the network ¯ φ ( · ) to minimize a matching loss with the available template. Speciﬁcally,denoting by S the surface of a given template, we generate a target point set by randomly sampling N = 1000 points on the surface: P t = { p i ∼ S | i = 1 · · · N } . We also randomly sample pointson implicit shape corresponding to the current ¯ φ ( · ) via transforming random samples on the sphere P t = { ¯ φ ( u i ); u i ∼ S | i = 1 · · · N } . We then iteratively train ¯ φ ( · ) to minimize a hungarian matchingloss between the two point clouds (using different random samples every iteration). Figure 6:

We visualize the obtained initialization for ¯ φ for different categories. We use a single template shapeper class, and learn ¯ φ s.t. the shape represented via the deformation of a sphere matches the given template. We visualize the resulting meshes learned for some categories in Figure 6, and and the sphere and theresulting shapes to highlight correspondence. We observe that these capture the underlying 3D meshwell, but exhibit certain artifacts e.g. hind legs of cow (top right). While this initialization helps inlearning, we note that the learned shape space and inferred deformations allow us to model shapesthat signiﬁcantly vary from this initial template e.g. fat vs thin bird, articulated animals. Please seethe main text and additional visualizations below for examples.

Boundary Reprojection Consistency.

To encourage our inferred 3D shapes to match the fore-ground mask boundary, we adapt the objectives proposed by Kar et al. [14]. Speciﬁcally, weencourage that the projected 3D points should lie inside the object, and that each point on the 2Dboundary of the foreground mask should be close to some projected point(s).Let us denote by I fg the foreground mask image corresponding to RGB image I (from which wepredicted a shape z s and camera π ). Further, let B fg represent 2D points on the mask boundary,and D fg correspond to a distance ﬁeld induced by the mask. Additionally, let P = { ¯ φ ( u i ); u i ∼ S | i = 1 · · · N } represent a set of N randomly sampled points on the predicted 3D shape. Usingthese notations, our boundary reprojection objective can be formulated as: L boundary = E p ∈ P D fg ( π ( p )) + E b ∈B fg min p ∈ P (cid:107) π ( p ) − b (cid:107) Figure 7:

Our inferred 3D representationallow us to render a per-pixel spherical co-ordinate, which can then be used to transfersemantics across two different images.

Correspondence Transfer via 3D Inference.

As ourcategory-speciﬁc implicit shape representation deforms acommon sphere to yield the shape for any given instance,each mesh point is associated with a unique sphericalcoordinate. Given a predicted shape z s and pose π foran image I , we can therefore render a per-pixel sphericalcoordinate (akin to rendering a textured mesh) as shown inFigure 7. These per-pixel spherical mappings subsequentlyallow us to transfer semantics ( e.g. keypoints) from asource image to a target image. For example, given akeypoint annotation in a source image, we can use torendered spherical coordinate at that (or nearest) pixel asa query to ﬁnd the corresponding point in a target image.11 igure 8: Random Results. We show (from left to right) the input image, the inferred 3D shape from predictedview and a novel view for 6 randomly sampled images from the test set per category. igure 9: Random Results. We show (from left to right) the input image, the inferred 3D shape from predictedview and a novel view for 6 randomly sampled images from the test set per category. igure 10: Random Results. We show (from left to right) the input image, the inferred 3D shape from predictedview and a novel view for 6 randomly sampled images from the test set per category. igure 11: Random Results. We show (from left to right) the input image, the inferred 3D shape from predictedview and a novel view for 6 randomly sampled images from the test set per category. igure 12: Random Results. We show (from left to right) the input image, the inferred 3D shape from predictedview and a novel view for 6 randomly sampled images from the test set per category.images from the test set per category.