Learning 3D Shape Completion under Weak Supervision
LLearning 3D Shape Completion under Weak Supervision
David Stutz · Andreas GeigerAbstract
We address the problem of 3D shape com-pletion from sparse and noisy point clouds, a fundamen-tal problem in computer vision and robotics. Recent ap-proaches are either data-driven or learning-based: Data-driven approaches rely on a shape model whose param-eters are optimized to fit the observations; Learning-based approaches, in contrast, avoid the expensive op-timization step by learning to directly predict com-plete shapes from incomplete observations in a fully-supervised setting. However, full supervision is oftennot available in practice. In this work, we propose aweakly-supervised learning-based approach to 3D shapecompletion which neither requires slow optimization nordirect supervision. While we also learn a shape prior onsynthetic data, we amortize, i.e., learn , maximum like-lihood fitting using deep neural networks resulting inefficient shape completion without sacrificing accuracy.On synthetic benchmarks based on ShapeNet (Changet al, 2015) and ModelNet (Wu et al, 2015) as well ason real robotics data from KITTI (Geiger et al, 2012)and Kinect (Yang et al, 2018), we demonstrate thatthe proposed amortized maximum likelihood approachis able to compete with the fully supervised baseline ofDai et al (2017) and outperforms the data-driven ap-proach of Engelmann et al (2016), while requiring lesssupervision and being significantly faster.
David StutzMax Planck Institute for InformaticsCampus E1 4, 66123 Saarbr¨ucken, GermanyTel.: +49 681 9325 2021E-mail: [email protected] GeigerMax Planck Institute for Intelligent Systems and Universityof T¨ubingenMax-Planck-Ring 4, 72076 T¨ubingen, Germany (a) ShapeNet (Synthetic) (b) KITTI (Real)(c) ModelNet (Synthetic) (d) Kinect (Real)
Fig. 1:
3D Shape Completion.
Results for cars onShapeNet (Chang et al, 2015) and KITTI (Geiger et al,2012) and for chairs and tables on ModelNet (Wuet al, 2015) and Kinect (Yang et al, 2018). Learningshape completion on real-world data is challenging dueto sparse and noisy observations and missing groundtruth. Occupancy grids (bottom) or meshes from signeddistance functions (SDFs, top) at various resolutions inbeige and point cloud observations in red.
Keywords
3D shape completion ·
3D reconstruction · weakly-supervised learning · amortized inference · benchmark
3D shape perception is a long-standing and fundamen-tal problem both in human and computer vision (Pi-zlo, 2007, 2010; Furukawa and Hernandez, 2013) withmany applications to robotics. A large body of workfocuses on 3D reconstruction, e.g., reconstructing ob-jects or scenes from one or multiple views, which is aninherently ill-posed inverse problem where many con-figurations of shape, color, texture and lighting mayresult in the very same image. While the primary goal a r X i v : . [ c s . C V ] N ov David Stutz, Andreas Geiger (1) Shape Prior (Section 3.2)
Synthetic Reference ShapesShape y encoder z decoder Rec. Shape ˜ y Reconstruction Loss retain fixed decoder no correspondence needed (2) Shape Inference (Section 3.3)
Real Observations w/o TargetsObservation x newencoder z fixeddecoder Prop. Shape ˜ y Maximum Likelihood Loss
Fig. 2:
Amortized Maximum Likelihood (AML) for 3D Shape Completion on KITTI. (1) We train adenoising variational auto-encoder (DVAE) (Kingma and Welling, 2014; Im et al, 2017) as shape prior on ShapeNetusing occupancy grids and signed distance functions (SDFs) to represent shapes. (2) The fixed generative model,i.e., decoder, then allows to learn shape completion using an unsupervised maximum likelihood (ML) loss bytraining a new recognition model, i.e., encoder. The retained generative model constraints the space of possibleshapes while the ML loss aligns the predicted shape with the observations.of human vision is to understand how the human visualsystem accomplishes such tasks, research in computervision and robotics is focused on the task of devising3D reconstruction systems. Generally, work by Pizlo(2010) suggests that the constraints and priors used for3D perception are innate and not learned. Similarly, incomputer vision, cues and priors are commonly builtinto 3D reconstruction pipelines through explicit as-sumptions. Recently, however – leveraging the successof deep learning – researchers started to learn shapemodels from large collections of data, as for exampleShapeNet (Chang et al, 2015). Predominantly genera-tive models have been used to learn how to generate,manipulate and reason about 3D shapes (Girdhar et al,2016; Brock et al, 2016; Sharma et al, 2016; Wu et al,2016b, 2015).In this paper, we focus on the specific problem of in-ferring and completing 3D shapes based on sparse andnoisy 3D point observations as illustrated in Fig. 1. Thisproblem occurs when only a single view of an individ-ual object is provided or large parts of the object areoccluded as common in robotic applications. For exam-ple, autonomous vehicles are commonly equipped withLiDAR scanners providing a 360 degree point cloud ofthe surrounding environment in real-time. This pointcloud is inherently incomplete: back and bottom of ob-jects are typically occluded and – depending on mate-rial properties – the observations are sparse and noisy,see Fig. 1 (top-right) for an illustration. Similarly, in-door robots are generally equipped with low-cost, real-time RGB-D sensors providing noisy point clouds ofthe observed scene. In order to make informed decisions (e.g., for path planning and navigation), it is of utmostimportance to efficiently establish a representation ofthe environment which is as complete as possible.Existing approaches to 3D shape completion can becategorized into data-driven and learning-based meth-ods. The former usually rely on learned shape priorsand formulate shape completion as an optimization prob-lem over the corresponding (lower-dimensional) latentspace (Rock et al, 2015; Haene et al, 2014; Li et al,2015; Engelmann et al, 2016; Nan et al, 2012; Bao et al,2013; Dame et al, 2013; Nguyen et al, 2016). These ap-proaches have demonstrated good performance on realdata, e.g., on KITTI (Geiger et al, 2012), but are oftenslow in practice.Learning-based approaches, in contrast, assume afully supervised setting in order to directly learn shapecompletion on synthetic data (Riegler et al, 2017a; Smithand Meger, 2017; Dai et al, 2017; Sharma et al, 2016;Fan et al, 2017; Rezende et al, 2016; Yang et al, 2018;Wang et al, 2017; Varley et al, 2017; Han et al, 2017).They offer advantages in terms of efficiency as predic-tion can be performed in a single forward pass, however,require full supervision during training. Unfortunately,even multiple, aggregated observations (e.g., from mul-tiple views) will not be fully complete due to occlusion,sparse sampling of views and noise, see Fig. 14 (rightcolumn) for an example.In this paper, we propose an amortized maximumlikelihood approach for 3D shape completion (cf. Fig. 2)avoiding the slow optimization problem of data-drivenapproaches and the required supervision of learning-based approaches. Specifically, we first learn a shape earning 3D Shape Completion under Weak Supervision 3 prior on synthetic shapes using a (denoising) variationalauto-encoder (Im et al, 2017; Kingma and Welling, 2014).Subsequently, 3D shape completion can be formulatedas a maximum likelihood problem. However, insteadof maximizing the likelihood independently for distinctobservations, we follow the idea of amortized inference(Gershman and Goodman, 2014) and learn to predictthe maximum likelihood solutions directly. Towards thisgoal, we train a new encoder which embeds the ob-servations in the same latent space using an unsuper-vised maximum likelihood loss. This allows us to learn3D shape completion in challenging real-world situa-tions, e.g., on KITTI, and obtain sub-voxel accurateresults using signed distance functions at resolutionsup to 64 voxels. For experimental evaluation, we in-troduce two novel, synthetic shape completion bench-marks based on ShapeNet and ModelNet (Wu et al,2015). We compare our approach to the data-drivenapproach by Engelmann et al (2016), a baseline in-spired by Gupta et al (2015) and the fully-supervisedlearning-based approach by Dai et al (2017); we addi-tionally present experiments on real data from KITTIand Kinect (Yang et al, 2018). Experiments show thatour approach outperforms data-driven techniques andrivals learning-based techniques while significantly re-ducing inference time and using only a fraction of su-pervision.A preliminary version of this work has been pub-lished at CVPR’18 (Stutz and Geiger, 2018). However,we improved the proposed shape completion method,the constructed datasets and present more extensiveexperiments. In particular, we extended our weakly-supervised amortized maximum likelihood approach toenforce more variety and increase visual quality signif-icantly. On ShapeNet and ModelNet, we use volumet-ric fusion to obtain more detailed, watertight meshesand manually selected – per object-category – 220 high-quality models to synthesize challenging observations.We additionally increased the spatial resolution andconsider two additional baselines (Dai et al, 2017; Guptaet al, 2015). Our code and datasets will be made pub-licly available .The paper is structured as follows: We discuss re-lated work in Section 2. In Section 3 we introduce theweakly-supervised shape completion problem and de-scribe the proposed amortized maximum likelihood ap-proach. Subsequently, we introduce our synthetic shapecompletion benchmarks and discuss the data prepara-tion for KITTI and Kinect in Section 4.1. Next, we dis-cuss evaluation in Section 4.2, our training procedure inSection 4.3, and the evaluated baselines in Section 4.4. https://avg.is.tuebingen.mpg.de/research_projects/3d-shape-completion . Finally, we present experimental results in Section 4.5and conclude in Section 5.
3D Shape Completion:
Following Sung et al (2015),classical shape completion approaches can roughly becategorized into symmetry-based methods and data-driven methods. The former leverage observed symme-try to complete shapes; representative works include(Thrun and Wegbreit, 2005; Pauly et al, 2008; Zhenget al, 2010; Kroemer et al, 2012; Law and Aliaga, 2011).Data-driven approaches, in contrast, as pioneered byPauly et al (2005), pose shape completion as retrievaland alignment problem. While Pauly et al (2005) allowshape deformations, Gupta et al (2015), use the itera-tive closest point (ICP) algorithm (Besl and McKay,1992) for fitting rigid shapes. Subsequent work usu-ally avoids explicit shape retrieval by learning a latentspace of shapes (Rock et al, 2015; Haene et al, 2014;Li et al, 2015; Engelmann et al, 2016; Nan et al, 2012;Bao et al, 2013; Dame et al, 2013; Nguyen et al, 2016).Alignment is then formulated as optimization problemover the learned, low-dimensional latent space. For ex-ample, Bao et al (2013) parameterize the shape priorthrough anchor points with respect to a mean shape,while Engelmann et al (2016) and Dame et al (2013) di-rectly learn the latent space using principal componentanalysis and Gaussian process latent variable models(Prisacariu and Reid, 2011), respectively. In these cases,shapes are usually represented by signed distance func-tions (SDFs). Nguyen et al (2016) use 3DShapeNets(Wu et al, 2015), a deep belief network trained on oc-cupancy grids, as shape prior. In general, data-drivenapproaches are applicable to real data assuming knowl-edge about the object category. However, inference in-volves a possibly complex optimization problem, whichwe avoid by amortizing, i.e., learning , the inference pro-cedure. Additionally, we also consider multiple objectcategories.With the recent success of deep learning, severallearning-based approaches have been proposed (Firmanet al, 2016; Smith and Meger, 2017; Dai et al, 2017;Sharma et al, 2016; Rezende et al, 2016; Fan et al, 2017;Riegler et al, 2017a; Han et al, 2017; Yang et al, 2017,
David Stutz, Andreas Geiger both avoided bydirectly learning shape completion end-to-end, underfull supervision – usually on synthetic data from Shape-Net (Chang et al, 2015) or ModelNet (Wu et al, 2015).Riegler et al (2017a) additionally leverage octrees topredict higher-resolution shapes; most other approachesuse low resolution occupancy grids (e.g., 32 voxels). In-stead, Han et al (2017) use a patch-based approach toobtain high-resolution results. In practice, however, fullsupervision is often not available; thus, existing modelsare primarily evaluated on synthetic datasets. In orderto learn shape completion without full supervision, weutilize a learned shape prior to constrain the space ofpossible shapes. In addition, we use SDFs to obtain sub-voxel accuracy at higher resolutions (up to 48 × × voxels) without using patch-based refinement oroctrees. We also consider significantly sparser observa-tions. Single-View 3D Reconstruction:
Single-view 3Dreconstruction has received considerable attention overthe last years; we refer to (Oswald et al, 2013) for anoverview and focus on recent deep learning approaches,instead. Following Tulsiani et al (2018), these can becategorized by the level of supervision. For example,(Girdhar et al, 2016; Choy et al, 2016; Wu et al, 2016b;H¨ane et al, 2017) require full supervision, i.e., pairs ofimages and ground truth 3D shapes. These are gener-ally derived synthetically. More recent work (Yan et al,2016; Tulsiani et al, 2017, 2018; Kato et al, 2017; Linet al, 2017; Fan et al, 2017; Tatarchenko et al, 2017; Wuet al, 2016a), in contrast, self-supervise the problem byenforcing consistency across multiple input views. Tul-siani et al (2018), for example, use a differentiable rayconsistency loss; and in (Yan et al, 2016; Kato et al,2017; Lin et al, 2017), differentiable rendering allowsto define reconstruction losses on the images directly.While most of these approaches utilize occupancy grids,Fan et al (2017) and Lin et al (2017) predict pointclouds instead. Tatarchenko et al (2017) use octrees topredict higher-resolution shapes. Instead of employingmultiple views as weak supervision, however, we do notassume any additional views in our approach. Instead,knowledge about the object category is sufficient. Inthis context, concurrent work by Gwak et al (2017) ismore related to ours: a set of reference shapes implicitlydefines a prior of shapes which is enforced using an ad-versarial loss. In contrast, we use a denoising variationalauto-encoder (DVAE) (Kingma and Welling, 2014; Imet al, 2017) to explicitly learn a prior for 3D shapes. 2.2 Shape ModelsShape models and priors found application in a wide va-riety of different tasks. In 3D reconstruction, in general,shape priors are commonly used to resolve ambiguitiesor specularities (Dame et al, 2013; G¨uney and Geiger,2015; Kar et al, 2015). Furthermore, pose estimation(Sandhu et al, 2011, 2009; Prisacariu et al, 2013; Aubryet al, 2014), tracking (Ma and Sibley, 2014; Leotta andMundy, 2009), segmentation (Sandhu et al, 2011, 2009;Prisacariu et al, 2013), object detection (Zia et al, 2013,2014; Pepik et al, 2015; Song and Xiao, 2014; Zhenget al, 2015) or recognition (Lin et al, 2014) – to namejust a few – have been shown to benefit from shape mod-els. While most of these works use hand-crafted shapemodels, for example based on anchor points or partannotations (Zia et al, 2013, 2014; Pepik et al, 2015;Lin et al, 2014), recent work (Liu et al, 2017; Sharmaet al, 2016; Girdhar et al, 2016; Wu et al, 2016b, 2015;Smith and Meger, 2017; Nash and Williams, 2017; Liuet al, 2017) has shown that generative models such asVAEs (Kingma and Welling, 2014) or generative adver-sarial networks (GANs) (Goodfellow et al, 2014) allowto efficiently generate, manipulate and reason about 3Dshapes. We use these more expressive models to obtainhigh-quality shape priors for various object categories.2.3 Amortized InferenceTo the best of our knowledge, the notion of amortizedinference was introduced by Gershman and Goodman(2014) and picked up repeatedly in different contexts(Rezende and Mohamed, 2015; Wang et al, 2016; Ritchieet al, 2016). Generally, it describes the idea of learningto infer (or learning to sample). We refer to (Wanget al, 2016) for a broader discussion of related work.In our context, a VAE can be seen as specific exampleof learned variational inference (Kingma and Welling,2014; Rezende and Mohamed, 2015). Besides using aVAE as shape prior, we also amortize the maximumlikelihood problem corresponding to our 3D shape com-pletion task.
In the following, we introduce the mathematical for-mulation of the weakly-supervised 3D shape comple-tion problem. Subsequently, we briefly discuss denois-ing variational auto-encoders (DVAEs) (Kingma andWelling, 2014; Im et al, 2017) which we use to learna strong shape prior that embeds a set of reference earning 3D Shape Completion under Weak Supervision 5(a) Reference Shapes Y (b) Observation x n (c) Ground Truth y ∗ n Fig. 3:
Weakly-Supervised Shape Completion.
Given reference shapes Y and incomplete observations X , we want to learn a mapping x n (cid:55)→ ˜ y ( x n ) such that˜ y ( x n ) matches the unknown ground truth shape y ∗ n asclose as possible. The observations x n are split into freespace (i.e., x n,i = 0, right) and point observations (i.e., x n,i = 1, left). Shapes are shown in beige and observa-tions in red.shapes in a low-dimensional latent space. Then, we for-mally derive our proposed amortized maximum likeli-hood (AML) approach. Here, we use maximum likeli-hood to learn an embedding of the observations withinthe same latent space – thereby allowing to performshape completion. The overall approach is also illus-trated in Fig. 2.3.1 Problem FormulationIn a supervised setting, the task of 3D shape comple-tion can be described as follows: Given a set of incom-plete observations X = { x n } Nn =1 ⊆ R R and correspond-ing ground truth shapes Y ∗ = { y ∗ n } Nn =1 ⊆ R R , learna mapping x n (cid:55)→ y ∗ n that is able to generalize to pre-viously unseen observations and possibly across objectcategories. We assume R R to be a suitable representa-tion of observations and shapes; in practice, we resort tooccupancy grids and signed distance functions (SDFs)defined on regular grids, i.e., x n , y ∗ n ∈ R H × W × D (cid:39) R R .Specifically, occupancy grids indicate occupied space,i.e., voxel y ∗ n,i = 1 if and only if the voxel lies on or in-side the shape’s surface. To represent shapes with sub-voxel accuracy, SDFs hold the distance of each voxel’scenter to the surface; for voxels inside the shape’s sur-face, we use negative sign. Finally, for the (incomplete)observations, we write x n ∈ { , , ⊥} R to make miss-ing information explicit; in particular, x n,i = ⊥ corre-sponds to unobserved voxels, while x n,i = 1 and x n,i =0 correspond to occupied and unoccupied voxels, re-spectively. On real data, e.g., KITTI (Geiger et al, 2012), super-vised learning is often not possible as obtaining groundtruth annotations is labor intensive, cf. (Menze andGeiger, 2015; Xie et al, 2016). Therefore, we target aweakly-supervised variant of the problem instead: Givenobservations X and reference shapes Y = { y m } Mm =1 ⊆ R R both of the same, known object category, learn amapping x n (cid:55)→ ˜ y ( x n ) such that the predicted shape˜ y ( x n ) matches the unknown ground truth shape y ∗ n asclose as possible – or, in practice, the sparse observation x n while being plausible considering the set of referenceshapes, cf. Fig. 3. Here, supervision is provided in theform of the known object category. Alternatively, thereference shapes Y can also include multiple object cat-egories resulting in an even weaker notion of supervisionas the correspondence between observations and objectcategories is unknown. Except for the object categories,however, the set of reference shapes Y , and its size M ,is completely independent of the set of observations X ,and its size N , as also highlighted in Fig. 2. On realdata, e.g., KITTI, we additionally assume the objectlocations to be given in the form of 3D bounding boxesin order to extract the corresponding observations X .In practice, the reference shapes Y are derived from wa-tertight, triangular meshes, e.g., from ShapeNet (Changet al, 2015) or ModelNet (Wu et al, 2015).3.2 Shape PriorWe approach the weakly-supervised shape completionproblem by first learning a shape prior using a denois-ing variational auto-encoder (DVAE). Later, this priorconstrains shape inference (see Section 3.3) to predictreasonable shapes. In the following, we briefly discussthe standard variational auto-encoder (VAE), as intro-duced by Kingma and Welling (2014), as well as itsdenoising extension, as proposed by Im et al (2017). Variational Auto-Encoder (VAE):
We propose touse the provided reference shapes Y to learn a genera-tive model of possible 3D shapes over a low-dimensionallatent space Z = R Q , i.e., Q (cid:28) R . In the framework ofVAEs, the joint distribution p ( y, z ) of shapes y and la-tent codes z decomposes into p ( y | z ) p ( z ) with p ( z ) beinga unit Gaussian, i.e., N ( z ; 0 , I Q ) and I Q ∈ R R × R beingthe identity matrix. This decomposition allows to sam-ple z ∼ p ( z ) and y ∼ p ( y | z ) to generate random shapes.For training, however, we additionally need to approx-imate the posterior p ( z | y ). To this end, the so-calledrecognition model q ( z | y ) ≈ p ( z | y ) takes the form q ( z | y ) = N ( z ; µ ( y ) , diag( σ ( y ))) (1) David Stutz, Andreas Geiger where µ ( y ) , σ ( y ) ∈ R Q are predicted using the en-coder neural network. The generative model p ( y | z ) de-composes over voxels y i ; the corresponding probabilities p ( y i | z ) are represented using Bernoulli distributions foroccupancy grids or Gaussian distributions for SDFs: p ( y i | z ) = Ber( y i ; θ i ( z )) or p ( y i | z ) = N ( y i ; µ i ( z ) , σ ) . (2)In both cases, the parameters, i.e., θ i ( z ) or µ i ( z ), arepredicted using the decoder neural network. For SDFs,we explicitly set σ to be constant (see Section 4.3).Then, σ merely scales the corresponding loss, therebyimplicitly defining the importance of accurate SDFs rel-ative to occupancy grids as described below.In the framework of variational inference, the pa-rameters of the encoder and the decoder neural net-works are found by maximizing the likelihood p ( y ). Inpractice, the likelihood is usually intractable and the ev-idence lower bound is maximized instead, see (Kingmaand Welling, 2014; Blei et al, 2016). This results in thefollowing loss to be minimized: L VAE ( w ) = − E q ( z | y ) [ln p ( y | z )] + KL( q ( z | y ) | p ( z )) . (3)Here, w are the weights of the encoder and decoder hid-den in the recognition model q ( z | y ) and the generativemodel p ( y | z ), respectively. The Kullback-Leibler diver-gence KL can be computed analytically as described inthe appendix of (Kingma and Welling, 2014). The neg-ative log-likelihood − ln p ( y | z ) corresponds to a binarycross-entropy error for occupancy grids and a scaledsum-of-squared error for SDFs. The loss L VAE is min-imized using stochastic gradient descent (SGD) by ap-proximating the expectation using samples: − E q ( z | y ) [ln p ( y | z )] ≈ − L (cid:88) l =1 ln p ( y | z ( l ) ) (4)The required samples z ( l ) ∼ q ( z | y ) are computed usingthe so-called reparameterization trick, z ( l ) = µ ( y ) + (cid:15) ( l ) σ ( y ) with (cid:15) ( l ) ∼ N ( (cid:15) ; 0 , I Q ) , (5)in order to make L VAE , specifically the sampling pro-cess, differentiable. In practice, we found L = 1 sam-ples to be sufficient – which conforms with results byKingma and Welling (2014). At test time, the samplingprocess z ∼ q ( z | y ) is replaced by the predicted mean µ ( y ). Overall, the standard VAE allows us to embedthe reference shapes in a low-dimensional latent space.In practice, however, the learned prior might still in-clude unreasonable shapes. Denoising VAE (DVAE):
In order to avoid inappro-priate shapes to be included in our shape prior, we con-sider a denoising variant of the VAE allowing to obtain a tighter bound on the likelihood p ( y ). More specifi-cally, a corruption process y (cid:48) ∼ p ( y (cid:48) | y ) is consideredand the corresponding evidence lower bound results inthe following loss: L DVAE ( w ) = − E q ( z | y (cid:48) ) [ln p ( y | z )]+ KL( q ( z | y (cid:48) ) | p ( z )) . (6)Note that the reconstruction error − ln p ( y | z ) is stillcomputed with respect to the uncorrupted shape y while z , in contrast to Eq. (3), is sampled conditioned on thecorrupted shape y (cid:48) . In practice, the corruption process p ( y (cid:48) | y ) is modeled using Bernoulli noise for occupancygrids and Gaussian noise for SDFs. In experiments,we found DVAEs to learn more robust latent spaces– meaning the prior is less likely to contain unreason-able shapes. In the following, we always use DVAEs asshape priors.3.3 Shape InferenceAfter learning the shape prior, defining the joint distri-bution p ( y, z ) of shapes y and latent codes z as prod-uct of generative model p ( y | z ) and prior p ( z ), shapecompletion can be formulated as a maximum likelihood(ML) problem for p ( y, z ) over the lower-dimensional la-tent space Z = R Q . The corresponding negative log-likelihood − ln p ( y, z ) to be minimized can be writtenas L ML ( z ) = − (cid:88) x i (cid:54) = ⊥ ln p ( y i = x i | z ) − ln p ( z ) . (7)As the prior p ( z ) is Gaussian, the negative log-probability − ln p ( z ) is proportional to (cid:107) z (cid:107) and constrains the prob-lem to likely, i.e., reasonable, shapes with respect tothe shape prior. As before, the generative model p ( y | z )decomposes over voxels; here, we can only consider ac-tually observed voxels x i (cid:54) = ⊥ . We assume that thelearned shape prior can complete the remaining, unob-served voxels x i = ⊥ . Instead of solving Eq. (7) for eachobservation x ∈ X independently, however, we followthe idea of amortized inference (Gershman and Good-man, 2014) and train a new encoder z ( x ; w ) to learn ML. To this end, we keep the generative model p ( y | z )fixed and train only the weights w of the new encoder z ( x ; w ) using the ML objective as loss: L dAML ( w ) = − (cid:88) x i (cid:54) = ⊥ ln p ( y i = x i | z ( x ; w )) − λ ln p ( z ( x ; w )) . (8)Here, λ controls the importance of the shape prior. Theexact form of the probabilities p ( y i = x i | z ) depends on earning 3D Shape Completion under Weak Supervision 7 the used shape representation. For occupancy grids, thisterm results in a cross-entropy error as both the pre-dicted voxels y i and the observations x i are, for x i (cid:54) = ⊥ ,binary. For SDFs, however, the term is not well-definedas p ( y i | z ) is modeled with a continuous Gaussian dis-tribution, while the observations x i are binary. As solu-tion, we could compute (signed) distance values alongthe rays corresponding to observed points (e.g., follow-ing (Steinbrucker et al, 2013)) in order to obtain con-tinuous observations x i ∈ R for x i (cid:54) = ⊥ . However, asillustrated in Fig. 4, noisy observations cause the dis-tance values along the whole ray to be invalid. This canpartly be avoided when relying only on occupancy torepresent the observations; in this case, free space (cf.Fig. 3) observations are partly correct even though ob-served points may lie within the corresponding shapes.For making SDFs tractable (i.e., to predict sub-voxel accurate, visually smooth and appealing shapes,see Section 4.5) while using binary observations, we pro-pose to define p ( y i = x i | z ) through a simple trans-formation. In particular, as p ( y i | z ) is modeled usinga Gaussian distribution N ( y i ; µ i ( z ) , σ ) where µ i ( z ) ispredicted using the fixed decoder ( σ is constant), and x i is binary (for x i (cid:54) = ⊥ ), we introduce a mapping θ i ( µ i ( z )) transforming the predicted mean SDF valueto an occupancy probability θ i ( µ i ( z )): p ( y i = x i | z ) = Ber( y i = x i ; θ i ( µ i ( z ))) (9)As, by construction (see Section 3.1), occupied voxelshave negative sign or value zero in the SDF, we canderive the occupancy probability θ i ( µ i ( z )) as the prob-ability of a non-positive distance: θ i ( µ i ( z )) = N ( y i ≤ µ i ( z ) , σ ) (10)= 12 (cid:18) (cid:18) − µ i ( z ) σ √ π (cid:19)(cid:19) . (11)Here, erf is the error function which, in practice, can beapproximated following (Abramowitz, 1974). Eq. (11)is illustrated in Fig. 4 where the occupancy probability θ i ( µ i ( z )) is computed as the area under the Gaussianbell curve for y i ≤
0. This per-voxel transformationcan easily be implemented as non-linear layer and itsderivative wrt. µ i ( z ) is, by construction, a Gaussian.Note that the transformation is correct, not approxi-mate, based on our model assumptions and the defini-tions in Section 3.1. Overall, this transformation allowsus to easily minimize Eq. (8) for both occupancy gridsand SDFs using binary observations. The obtained en-coder embeds the observations in the latent shape spaceto perform shape completion. − . − . − . − . . . . surfaceray(a) ⊥ sdf4567 (b) ⊥ occ0000 (c) − − − . . . . µi ( z ) θi ( µi ( z )) yip ( yi ) p ( y i ≤ yi ) p ( yi ≤ Fig. 4:
Left: Problem with SDF Observations.
Il-lustration of a ray (red line) correctly hitting a sur-face (blue line) causing the (signed) distance valuesand occupancy values computed for voxels along theray to be correct (cf. (a)). A noisy ray, however, causesall voxels along the ray to be assigned incorrect dis-tance values (marked red ) wrt. to the true surface(blue line) because the ray ends far behind the actualsurface (cf. (b)). When using occupancy only, in con-trast, only the voxels behind the surface are assignedinvalid occupancy states (marked red ); the remainingvoxels are labeled correctly (marked green ; cf. (c)).
Right: Proposed Gaussian-to-Bernoulli Trans-formation.
For p ( y i ) := p ( y i | z ) = N ( y i ; µ i ( z ) , σ )(blue), we illustrate the transformation discussed inSection 3.3 allowing to use the binary observations x i (for x i (cid:54) = ⊥ ) to supervise the SDF predictions. This isachieved by transforming the predicted Gaussian distri-bution to a Bernoulli distribution with occupancy prob-ability θ i ( µ i ( z )) = p ( y i ≤
0) (blue area).3.4 Practical Considerations
Encouraging Variety:
So far, our AML formulationassumes a deterministic encoder z ( x, w ) which predicts,given the observation x , a single code z correspondingto a completed shape. A closer look at Eq. (8), however,reveals an unwanted problem: the data term scales withthe number of observations, i.e., |{ x i (cid:54) = ⊥}| , while theregularization term stays constant – with less obser-vations, the regularizer gains in importance leading tolimited variety in the predicted shapes because z ( x ; w )tends towards zero.In order to encourage variety, we draw inspirationfrom the VAE shape prior. Specifically, we use a prob-abilistic recognition model q ( z | x ) = N ( z ; µ ( x ) , diag( σ ( x ))) (12)(cf. see Eq. (1)) and replace the negative log-likelihood − ln p ( z ) with the corresponding Kullback-Leibler di-vergence KL( q ( z | x ) | p ( z )) with p ( z ) = N ( z ; 0 , I Q ). In-tuitively, this makes sure that the encoder’s predictions David Stutz, Andreas Geiger(a) Original (b) TSDF Fusion, 256 (c) Simplification, 5k Faces(d) Reconstruction, 24 × × (e) Observations (f) Voxelization, 24 × × Fig. 5:
ShapeNet and ModelNet Data Generation Pipeline.
On ShapeNet and ModelNet we illustrate: (a) samples from the original datasets; (b) fused watertight meshes from TSDF fusion at 256 voxels resolutionusing (Riegler et al, 2017a); (c) simplified meshes (5 k faces); (d) marching cubes (Lorensen and Cline, 1987)reconstructions from the SDFs computed from (c) (resolutions 24 × ×
24 and 32 voxels; note that steps (b)and (c) are necessary to derive exact SDFs); (e) observations obtained by projection into a single view; and (f) voxelized observations and shapes. Shapes (meshes and occupancy grids) in beige and observations in red.“cover” the prior distribution – thereby enforcing vari-ety. Mathematically, the resulting loss, i.e., L AML ( w ) = − E q ( z | x ) (cid:88) x i (cid:54) = ⊥ ln p ( y i = x i | z ) + λ KL( q ( z | x ) p ( z )) , (13)can be interpreted as the result of maximizing the evi-dence lower bound of a model with observation process p ( x | y ) (analogously to the corruption process p ( y (cid:48) | y )for DVAEs in (Im et al, 2017) and Section 3.2). The ex-pectation is approximated using samples (following thereparameterization trick in Eq. (5)) and, during test-ing, the sampling process z ∼ q ( z | x ) is replaced by themean prediction µ ( x ). In practice, we find that Eq. (13)improves visual quality of the completed shapes. Wecompare this AML model to its deterministic variantdAML in Section 4.5. Handling Noise:
Another problem of our AML for-mulation concerns noise. On KITTI, for example, spec-ular or transparent surfaces cause invalid observations– laser rays traversing through these surfaces cause ob-servations to lie within shapes or not get reflected. How-ever, our AML framework assumes deterministic, i.e.,trustworthy, observations – as can be seen in the recon-struction error in Eq. (13). Therefore, we introduce per-voxel weights κ i computed using the reference shapes Y = { y m } Mm =1 : κ i = 1 − (cid:32) M M (cid:88) m =1 y m,i (cid:33) ∈ [0 ,
1] (14)where y m,i = 1 if and only if the corresponding voxelis occupied. Applied to observations x i = 0, these aretrusted less if they are unlikely under the shape prior. Note that for point observations, i.e., x i = 1, this is notnecessary as we explicitly consider “filled” shapes (seeSection 4.1). This can also be interpreted as imposingan additional mean shape prior on the predicted shapeswith respect to the observed free space. In addition, weuse a corruption process p ( x (cid:48) | x ) consisting of Bernoulliand Gaussian noise during training (analogously to theDVAE shape prior). |{ x n,i (cid:54) = ⊥}| / HW D ,averaged over observations x n . ShapeNet:
We utilize the truncated SDF (TSDF) fu-sion approach of Riegler et al (2017a) to obtain water-tight versions of the provided car shapes allowing toreliably and efficiently compute occupancy grids andSDFs. Specifically, we use 100 depth maps of 640 × voxels. Detailed watertight meshes, withoutinner structures, can then be extracted using march-ing cubes (Lorensen and Cline, 1987) and simplified to5k faces using MeshLab’s quadratic simplification algo-rithm (Cignoni et al, 2008), see Fig. 5a to c. Finally, wemanually selected 220 shapes from this collection, re- earning 3D Shape Completion under Weak Supervision 9(a) KITTI, Point Clouds (b) Kinect, Occupancy Grids Fig. 6:
Extracted KITTI and Kinect Data.
ForKITTI, we show observed points in red and the accu-mulated, partial ground truth in green. Note that forthe first example ground truth is not available due tomissing past/future observations. For Kinect, we showobservations in red and ElasticFusion (Whelan et al,2015) ground truth in beige. Note that the objects arerotated and not aligned as in ModelNet (cf. Fig. 5).moving exotic cars, unwanted configurations, or shapeswith large holes (e.g., missing floors or open windows).The shapes are splitted into |Y| = 100 referenceshapes, |Y ∗ | = 100 shapes for training the inferencemodel, and 20 test shapes. We randomly perturb rota-tion and scaling to obtain 5 variants of each shape, vox-elize them using triangle-voxel intersections and subse-quently “fill” the obtained volumes using a connectedcomponents algorithm (Jones et al, 2001). For com-puting SDFs we use SDFGen . We use three differentresolutions: H × W × D = 24 × ×
24, 32 × ×
32 and48 × ×
48 voxels. Examples are shown in Fig. 5d to f.Finally, we use the OpenGL renderer of G¨uney andGeiger (2015) to obtain 10 depth maps per shape. Theincomplete observations X are obtained by re-projectingthem into 3D and marking voxels with at least one pointas occupied and voxels between occupied voxels and thecamera center as free space. We obtain more dense pointclouds at 48 ×
64 pixels resolution and sparser pointclouds using depth maps of 24 ×
32 pixels resolution. Forthe latter, more challenging case we also add exponen-tially distributed noise (with rate parameter 70) to thedepth values, or randomly (with probability 0 . SN-clean and
SN-noisy . The obtained observations are illustrated inFig. 5e.
KITTI:
We extract observations from KITTI’s Velo-dyne point clouds using the provided ground truth 3Dbounding boxes to avoid the inaccuracies of 3D objectdetectors (train/test split by Chen et al (2016)). As the3D bounding boxes in KITTI fit very tightly, we first https://github.com/christopherbatty/SDFGen . Synthetic Real
SN-clean/-noisy ModelNet KITTI KinectTraining/Test Sets < Low = 24 × × × × × × Low 7.66/3.86 9.71 6.79
Medium 6.1/ – Table 1:
Dataset Statistics.
We report the number of(rotated and scaled) meshes, used as reference shapes,and the resulting number of observations (i.e., views,10 per shape). We also report the average fraction ofobserved voxels, i.e., |{ x i (cid:54) = ⊥}| / HW D . For ModelNet, weexemplarily report statistics for chairs; and for Kinect,we report statistics for tables.padded them by factor 0 .
25 on all sides; afterwards, theobserved points are voxelized into voxel grids of size H × W × D = 24 × ×
24, 32 × ×
32 and 48 × × ModelNet:
We use ModelNet10, comprising 10 popu-lar object categories (bathtub, bed, chair, desk, dresser,monitor, night stand, table, toilet) and select, for eachcategory, the first 200 and 20 shapes from the providedtraining and test sets. Then, we follow the pipeline out-lined in Fig. 5, as on ShapeNet, using 10 random vari-ants per shape. Due to thin structures, however, SDFcomputation does not work well (especially for low reso-lution, e.g., 32 voxels). Therefore, we approximate theSDFs using a 3D distance transform on the occupancygrids. Our experiments are conducted at a resolutionof H × W × D = 32 , 48 and 64 voxels. Given the in-creased difficulty, we use a resolution of 64 , 96 and128 pixels for the observation generating depth maps.In our experiments, we consider bathtubs, chairs, desksand tables individually, as well as all 10 categories to-gether (resulting in 100k views overall). For Kinect, weadditionally used a dataset of rotated chairs and tablesaligned with Kinect’s ground plane. ShapeNet, KITTI × × ( ) ( ) × × ( ) ( ) × × ( ) ( ) × × ( ) ( ) × × ( ) ModelNet, Kinect ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) c o n v + p oo l c o n v c o n v + p oo l c o n v c o n v + p oo l c o n v c o n v + p oo l c o n v c o n v + p oo l r e s h a p e fcfc z fc r e s h a p e c o n v nnup + c o n v c o n v nnup + c o n v c o n v nnup + c o n v c o n v nnup + c o n v nnup + c o n v ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) × × ( )( ) × × ( )( ) × × ( )( ) × × ( )( ) × × ( ) Fig. 7:
Network Architectures.
We use different res-olutions for ShapeNet and KITTI as well as Model-Net and Kinect (bottom and top, respectively). In bothcases, architectures for higher resolutions employ oneadditional stage in the en- and decoder (in gray). Eachconvolutional layer is followed by ReLU activations andbatch normalization (Ioffe and Szegedy, 2015); the win-dow sizes for max pooling and nearest-neighbor upsam-pling can be derived from the context; the number ofchannels are given in parentheses.
Kinect:
Yang et al. provide Kinect scans of variouschairs and tables. They provide both single-view ob-servations as well as ground truth from ElasticFusion(Whelan et al, 2015) as occupancy grids. However, theground truth is not fully accurate, and only 40 viewsare provided per object category. Still, the objects havebeen segmented to remove clutter and are appropriatefor experiments in conjunction with ModelNet10. Un-fortunately, Yang et al. do not provide SDFs; again,we use 3D distance transforms as approximation. Ad-ditionally, the observations do not indicate free spaceand we were required to guess an appropriate groundplane. For our experiments, we use 30 views for trainingand 10 views for testing, see Fig. 6 for examples.4.2 EvaluationFor occupancy grids, we use Hamming distance (Ham)and intersection-over-union (IoU) between the (thresh-olded) predictions and the ground truth; note that lowerHam is better, while lower IoU is worse. For SDFs, weconsider a mesh-to-mesh distance on ShapeNet and amesh-to-point distance on KITTI. We follow (Jensenet al, 2014) and consider accuracy (Acc) and complete-ness (Comp). To measure Acc, we uniformly sampleroughly 10k points on the reconstructed mesh and av-erage their distance to the target mesh. Analogously,Comp is the distance from the target mesh (or theground truth points on KITTI) to the reconstructedmesh. Note that for both Acc and Comp, lower is bet-
GT,High GT DVAE,Low DVAE,High DVAE,Low DVAE,High (a) Reconstructions, Low and High Resolution (cf. Table 1)
Low Low Low High High High (b) Random Samples, Low and High Resolution (cf. Table 1)
Fig. 8:
DVAE Shape Prior.
Reconstructions and ran-dom samples on ShapeNet and ModelNet at multipleresolutions (cf. Table 1); false negative and false pos-itive voxels in green and red. Our DVAE shape priorprovides high-quality reconstructions and meaningfulrandom samples across resolutions.ter. On ShapeNet and ModelNet, we report both Accand Comp in voxels, i.e., in multiples of the voxel edgelength (i.e., in [vx], as we do not know the absolute scaleof the models); on KITTI, we report Comp in meters(i.e., in [m]).4.3 Architectures and TrainingAs depicted in Fig. 7, our network architectures arekept simple and shallow. Considering a resolution of24 × ×
24 voxels on ShapeNet and KITTI, the encodercomprises three stages, each consisting of two convolu-tional layers (followed by ReLU activations and batchnormalization (Ioffe and Szegedy, 2015)) and max pool-ing; the decoder mirrors the encoder, replacing maxpooling by nearest neighbor upsampling. We consis-tently use 3 convolutional kernels. We use a latentspace of size Q = 10 and predict occupancy using Sig-moid activations.We found that the shape representation has a sig-nificant impact on training. Specifically, learning bothoccupancy grids and SDFs works better compared totraining on SDFs only. Additionally, following prior artin single image depth prediction (Eigen and Fergus,2015; Eigen et al, 2014; Laina et al, 2016), we considerlog-transformed, truncated SDFs (logTSDFs) for train-ing: given a signed distance y i , we compute sign( y i ) log(1+min(5 , | y i | )) as the corresponding log-transformed, trun- earning 3D Shape Completion under Weak Supervision 11 d A M L A M L Fig. 9:
Comparison of AML and dAML.
Our de-terministic variant, dAML, suffers from inferior results.Predicted shapes in beige and observations in red atlow resolution (24 × ×
24 voxels).cated signed distance. TSDFs are commonly used in theliterature (Newcombe et al, 2011; Riegler et al, 2017a;Dai et al, 2017; Engelmann et al, 2016; Curless andLevoy, 1996) and the logarithmic transformation ad-ditionally increases the relative importance of valuesaround the surfaces (i.e., around the zero crossing).For training, we combine occupancy grids and logTS-DFs in separate feature channels and randomly trans-late both by up to 3 voxels per axis. Additionally, weuse Bernoulli noise (probability 0 .
1) and Gaussian noise(variance 0 . − which is decayed by 0 .
925 every 215 iterationsuntil a minimum of 10 − has been reached. In ad-dition, weight decay (10 − ) is applied. For shape in-ference, training takes 30 to 50 epochs, and an initiallearning rate of 10 − is decayed by 0 . σ = − σ may lead to difficulties during training, includ-ing divergence. On ShapeNet, ModelNet and Kinect,the weight λ of the Kullback-Leibler divergence KL (forboth DVAE and (d)AML) was empirically determinedto be λ = 2 , . , λ = 1 for all resolutions.In practice, λ controls the trade-off between diversity(low λ ) and quality (high λ ) of the completed shapes.In addition, we reduce the weight in free space areas toone fourth on SN-noisy and KITTI to balance betweenoccupied and free space. We implemented our networksin Torch (Collobert et al, 2011). (a) DVAE t-SNE (b)
DVAE Projection (c)
AML t-SNE (d)
AML Projection
Fig. 10:
Learned Latent Spaces.
In (a) and (b), weshow a t-SNE (van der Maaten and Hinton, 2008) visu-alization and a two-dimensional projection of the DVAElatent space on ModelNet10. The plots illustrate thatthe DVAE is able to separate the ten object categories.In (c) and (d), we show a t-SNE visualization anda projection of the latent space corresponding to ourlearned AML model on SN-clean. We randomly picked10 ground truth shapes, “x”, and the correspondingobservations (10 per shape), points (gray pixels indi-cate remaining shapes/observations). The plots illus-trate that AML is able to associate observations withthe corresponding ground truth shapes under weak su-pervision.4.4 Baselines
Data-Driven Approaches:
We consider the worksby Engelmann et al (2016) and Gupta et al (2015) asdata-driven baselines. Additionally, we consider regu-lar maximum likelihood (ML). Engelmann et al (2016)– referred to as Eng16 – use a principal componentanalysis shape prior trained on a manually selected setof car models . Shape completion is posed as optimiza-tion problem considering both shape and pose. The pre-trained shape prior provided by Engelmann et al. as-sumes a ground plane which is, according to KITTI’sLiDAR data, fixed at 1 m height. Thus, we don’t needto optimize pose on KITTI as we use the ground truthbounding boxes; on ShapeNet, in contrast, we need tooptimize both pose and shape to deal with the randomrotations in SN-clean and SN-noisy. https://github.com/VisualComputingInstitute/ShapePriors_GCPR16 Supervision Method SN-clean SN-noisy KITTIin % Ham ↓ IoU ↑ Acc [vx] ↓ Comp [vx] ↓ Ham ↓ IoU ↑ Acc [vx] ↓ Comp [vx] ↓ Comp [m] ↓ Low Resolution: 24 × ×
24 voxels; * independent of resolution(shape prior) DVAE 0.019 0.885 0.283 0.527 (same shape prior as on SN-clean)100 Dai et al (2017) (Dai17) < . (mesh only) 7.551 6.372 (too slow)*Engelmann et al (2016) (Eng16) (mesh only) 1.235 1.237 (mesh only) 1.974 1.312 0.13dAML (see AML)AML Low Resolution: 24 × ×
24 voxels; Multiple, k > k = 5 n/aSup, k = 5 0.022 0.866 0.336 0.566 0.024 0.86 0.331 0.573 <
16 AML, k = 2 0.032 0.794 0.489 0.695 0.034 0.79 0.52 0.725 n/a <
24 AML, k = 3 <
40 AML, k = 5 × ×
32 voxels(shape prior) DVAE 0.019 0.877 0.24 0.47 (same shape prior as on SN-clean)100 Dai et al (2017) (Dai17)
Sup 0.027 0.834 0.498 0.789 0.029 0.815 0.571 0.843 0.09 ≤ . High Resolution: 48 × ×
48 voxels(shape prior) DVAE 0.018 0.87 0.272 0.434 (same shape prior as on SN-clean)100 Dai17
Sup 0.023 0.843 0.677 1.032 < . Table 2:
Quantitative Results on ShapeNet and KITTI.
We consider Hamming distance (Ham) and inter-section over union (IoU) for occupancy grids as well as accuracy (Acc) and completeness (Comp) for meshes onSN-clean, SN-noisy and KITTI. For Ham, Acc and Comp, lower is better; for IoU, higher is better. The unit ofAcc and Comp is voxels (voxel length at 24 × ×
48 voxels) or meters. Note that the DVAE shape prior (in gray)is only reported as reference (i.e., bound on (d)AML). We indicate the level of supervision in percentage, relativeto the corresponding resolution (see Table 1) and mark the best results under full supervision in red and underweak supervision in green .Inspired by the work by Gupta et al (2015) we alsoconsider a shape retrieval and fitting baseline. Specif-ically, we perform iterative closest point (ICP) (Besland McKay, 1992) fitting on all training shapes andsubsequently select the best-fitting one. To this end, weuniformly sample 1Mio points on the training shapes,and perform point-to-point ICP for a maximum of 100iterations using (cid:2) R t (cid:3) = (cid:2) I (cid:3) as initialization. On thetraining set, we verified that this approach is alwaysable to retrieve the perfect shape.Finally, we consider a simple ML baseline iterativelyminimizing Eq. (7) using stochastic gradient descent(SGD). This baseline is similar to the work by Engel-mann et al., however, like ours it is bound to the voxelgrid. Per example, we allow a maximum of 5000 itera-tions, starting with latent code z = 0, learning rate 0 . . .
85 and 1 . − and 0 . Learning-Based Approaches:
Learning-based ap-proaches usually employ an encoder-decoder architec-ture to directly learn a mapping from observations x n . to ground truth shapes y ∗ n in a fully supervised setting(Wang et al, 2017; Varley et al, 2017; Yang et al, 2018,2017; Dai et al, 2017). While existing architectures dif-fer slightly, they usually rely on a U-net architecture(Ronneberger et al, 2015; Cicek et al, 2016). In thispaper, we use the approach of Dai et al (2017) – re-ferred to as Dai17 – as a representative baseline for thisclass of approaches. In addition, we consider a customlearning-based baseline which uses the architecture ofour DVAE shape prior, cf. Fig. 7. In contrast to (Daiet al, 2017), this baseline is also limited by the low-dimensional ( Q = 10) bottleneck as it does not useskip connections.4.5 Experimental EvaluationQuantitative results are summarized in Table 2 (Shape-Net and KITTI) and 3 (ModelNet). Qualitative results We use https://github.com/angeladai/cnncomplete .On ModelNet we added one convolutional stage in the en-and decoder for larger resolutions; on ShapeNet and KITTI,we needed to adapt the convolutional strides to fit the corre-sponding resolutions.earning 3D Shape Completion under Weak Supervision 13
Obs Dai17 Dai17 Eng16 ML ML AML AML GT GT (a) SN-clean (Top) and SN-noisy (Bottom), Low Resolution (24 × × Obs Dai17 Dai17 ICP ML ML AML AML GT GT (b) ModelNet Bathtubs, Chairs, Desks and Tables, Low Resolution (32 ) Fig. 11:
Qualitative Results on ShapeNet and ModelNet.
Results for AML, Dai17, Eng16, ICP and ML onSN-clean, SN-noisy and ModelNet’s bathtubs, chairs, desks and tables. AML outperforms data-driven approaches(ML, Eng16, ICP) and rivals Dai17 while requiring significantly less supervision. Occupancy grids and meshes inbeige, observations in red.for the shape prior are shown in Fig. 9 and 10; shapecompletion results are shown in Fig. 11 (ShapeNet andModelNet) and 14 (KITTI and Kinect).
Latent Space Dimensionality:
Regarding our DVAEshape prior, we found the dimensionality Q to be of cru-cial importance as it defines the trade-off between re-construction accuracy and random sample quality (i.e.,the quality of the generative model). A higher-dimension-al latent space usually results in higher-quality recon-structions but also imposes the difficulty of randomlygenerating meaningful shapes. Across all datasets, wefound Q = 10 to be suitable – which is significantlysmaller compared to related work: 35 in (Liu et al,2017), 6912 in (Sharma et al, 2016), 200 for (Wu et al, 2016b; Smith and Meger, 2017) or 64 in (Girdhar et al,2016). Still, we are able to obtain visually appealingresults. Finally, in Fig. 9 we show qualitative results,illustrating good reconstruction performance and rea-sonable random samples across resolutions.Fig. 10 shows a t-SNE (van der Maaten and Hinton,2008) visualization as well as a projection of the Q = 10dimensional latent space, color coding the 10 object cat-egories of ModelNet10. The DVAE clusters the objectcategories within the support region of the unit Gaus-sian. In the t-SNE visualization, we additionally seeambiguities arising in ModelNet10, e.g., night standsand dressers often look indistinguishable while moni-tors are very dissimilar to all other categories. Overall, Supervision Method bathtub chair desk table ModelNet10in % Ham ↓ IoU ↑ Ham ↓ IoU ↑ Acc [vx] ↓ Comp [vx] ↓ Ham ↓ IoU ↑ Ham ↓ IoU ↑ Ham ↓ IoU ↑ Low Resolution: 32 voxels; * independent of resolution(shape prior) DVAE 0.015 0.699 0.025 0.517 0.884 0.72 0.028 0.555 011 0.608 0.023 0.714100 Dai et al (2017) (Dai17) Sup 0.023 <
10 * Gupta et al (2015) (ICP) (mesh only) (mesh only) 1.483 0.89 (mesh only) (mesh only) (mesh only)ML 0.028
Medium Resolution: 48 voxels(shape prior) DVAE 0.014 0.671 0.021 0.491 0.748 0.697 0.025 0.525 0.01 0.548100 Dai et al (2017) (Dai17) < High Resolution: 64 voxels(shape prior) DVAE 0.014 0.644 0.02 0.474 0.702 0.705 0.024 0.506 0.009 0.548100 Dai et al (2017) (Dai17) < Table 3:
Quantitative Results on ModelNet.
Results for bathtubs, chairs, desks, tables and all ten categoriescombined (ModelNet10). As the ground truth SDFs are merely approximations (cf. Section 4.1), we concentrateon Hamming distance (Ham; lower is better) and intersection-over-union (IoU; higher is better). Only for chairs,we report accuracy Acc and completeness Comp in voxels (voxel length at 32 voxels). We also indicate the level ofsupervision (see Table 1). Again, we report the DVAE shape prior as reference and color the best weakly-supervisedapproach using green and the best fully-supervised approach in red .these findings support our decision to use a DVAE with Q = 10 as shape prior. Ablation Study:
In Table 2, we show quantitative re-sults of our model on SN-clean and SN-noisy. First, wereport the reconstruction quality of the DVAE shapeprior as reference. Then, we consider the DVAE shapeprior (Na¨ıve), and its mean prediction (Mean) as sim-ple baselines. The poor performance of both illustratesthe difficulty of the benchmark. For AML, we also con-sider its deterministic variant, dAML (see Section 3).Quantitatively, there is essentially no difference; how-ever, Fig. 9 demonstrates that AML is able to predictmore detailed shapes. We also found that using bothoccupancy and SDFs is necessary to obtain good per-formance – as is using both point observations and freespace.Considering Fig. 10, we additionally demonstratethat the embedding learned by AML, i.e., the embed-ding of incomplete observations within the latent shapespace, is able to associate observations with correspond-ing shapes even under weak supervision. In particular,we show a t-SNE visualization and a projection of thelatent space for AML trained on SN-clean. We color-code 10 randomly chosen ground truth shapes, resultingin 100 observations (10 views per shape). AML is usu-ally able to embed observations near the correspondingground truth shapes, without explicit supervision (e.g.,for violet, pink, blue or teal, the observations – points– are close to the corresponding ground truth shapes –“x”). Additionally, AML also matches the unit Gaus-sian prior distribution reasonably well.
Comparison to Baselines on Synthetic Data:
ForShapeNet, Table 2 demonstrates that AML outperformsdata-driven approaches such as Eng16, ICP and MLand is able to compete with fully-supervised approaches,Dai17 and Sup, while using only 8% or less supervi-sion. We also note that AML outperforms ML, illus-trating that amortized inference is beneficial. Further-more, Dai17 outperforms Sup, illustrating the advan-tage of propagating low-level information (through skipconnections) without bottleneck. Most importantly, theperformance gap between AML and Dai17 is rathersmall considering the difference in supervision (morethan 92%) and on SN-noisy, the drop in performancefor Dai17 and Sup is larger than for AML suggestingthat AML handles noise and sparsity more robustly.Fig. 11 shows that these conclusions also apply visuallywhere AML performs en par with Dai17.For ModelNet, in Table 3, we mostly focus on oc-cupancy grids (as the derived SDFs are approximate,cf. Section 4.1) and show that chairs, desks or tablesare more difficult. However, AML is still able to pre-dict high-quality shapes, outperforming data-driven ap-proaches. Additionally, in comparison to ShapeNet, thegap between AML and fully-supervised approaches (Dai17and Sup) is surprisingly small – not reflecting the dif-ference in supervision. This means that even under fullsupervision, these object categories are difficult to com-plete. In terms of accuracy (Acc) and completeness (Comp),e.g., for chairs, AML outperforms ICP and ML; Dai17and Sup, on the other hand, outperform AML. Still,considering Fig. 11, AML predicts visually appealingmeshes although the reference shape SDFs on Model- earning 3D Shape Completion under Weak Supervision 15 k = 3 AML GT k = 5 AML GT (a) SN-clean and -noisy, k Views, Low Resolution (24 × × Dai17 AML GT Dai17 AML GT (b) SN-clean and -noisy, Medium (32 × ×
32) and High(48 × ×
48) Resolution
Dai17 AML GT Dai17 AML GT (c) ModelNet desks and chairs, Medium (48 ) and High (64 )Resolution Fig. 12:
Multi-View and Higher-Resolution Re-sults on ShapeNet and ModelNet.
While AML isdesigned for especially sparse observations, it also per-forms well in a multi-view setting. Additionally, higherresolutions allow to predict more detailed shapes.Shapes, occupancy grids or meshes, in beige and ob-servations in red.Net are merely approximate. Qualitatively, AML alsooutperforms its data-driven rivals; only Dai17 predictsshapes slightly closer to the ground truth.
Multiple Views and Higher Resolutions:
In Ta-ble 2, we consider multiple, k ∈ { , , } , randomlyfused observations (from the 10 views per shape). Gen-erally, additional observations are beneficial (also cf.Fig. 12); however, fully-supervised approaches such asDai17 benefit more significantly than AML. Intuitively,especially on SN-noisy, k = 5 noisy observations seemto impose contradictory constraints that cannot be re-solved under weak supervision. We also show that higherresolution allows both AML and Dai17 to predict moredetailed shapes, see Fig. 12; for AML this is significantas, e.g., on SN-noisy, the level of supervision reducesto less than 1%. Also note that AML is able to han-dle the slightly asymmetric desks in Fig. 12 due to the Dai17 AML GT Dai17 AML GT
Fig. 13:
Category-Agnostic Results on Model-Net10.
AML is able to recover detailed shapes of thecorrect object category even without category supervi-sion (as provided to Dai17). Shapes (occupancy gridsand meshes) in beige and observations in red at lowresolution (32 voxels).strong shape prior which itself includes symmetric andless symmetric shapes. Multiple Object Categories:
We also investigatethe category-agnostic case, considering all ten Model-Net10 object categories; here, we train a single DVAEshape prior (as well as a single model for Dai17 andSup) across all ten object categories. As can be seenin Table 3, the gap between AML and fully-supervisedapproaches, Dai17 and Sup, further shrinks; even fully-supervised methods have difficulties distinguishing ob-ject categories based on sparse observations. Fig. 12shows that AML is able to not only predict reasonableshapes, but also identify the correct object category. Incontrast to Dai17, which predicts slightly more detailedshapes, this is significant as AML does not have accessto object category information during training.
Comparison on Real Data:
On KITTI, consideringFig. 14, we illustrate that AML consistently predicts de-tailed shapes regardless of the noise and sparsity in theinputs. Our qualitative results suggest that AML is ableto predict more detailed shapes compared to Dai17 andEng16; additionally, Eng16 is distracted by sparse andnoisy observations. Quantitatively, instead, Dai17 andSup outperform AML. However, this is mainly due totwo factors: first, the ground truth collected on KITTIdoes rarely cover the full car; and second, we put signifi-cant effort into faithfully modeling KITTI’s noise statis-tics in SN-noisy, allowing Dai17 and Sup to generalizevery well. The latter effort, especially, can be avoidedby using our weakly-supervised approach, AML.
Obs Dai17 Eng16 AML AML GT (a) KITTI, Medium Resolution (32 × × O b s A M L (b) Kinect, Low Resolution (32 ) Fig. 14:
Qualitative Results on KITTI andKinect.
On KITTI, AML visually outperforms bothDai17 and Eng16 while being faster and requiring lesssupervision. On Kinect, AML demonstrates that it isable to generalize from as few as 30 training samples.Predicted shapes (occupancy grids or meshes) in beigeand observations in red; additionally, partial groundtruth in green.On Kinect, also considering Fig. 14, only 30 obser-vations are available for training. It can be seen thatAML predicts reasonable shapes for tables. We find itinteresting that AML is able to generalize from only 30training examples. In this sense, AML functions simi-lar to ML, in that the objective is trained to overfit tofew samples. This, however, cannot work in all cases,as demonstrated by the chairs where AML tries to pre-dict a suitable chair, but does not fit the observationsas well. Another problem witnessed on Kinect, is thatthe shape prior training samples need to be aligned tothe observations (with respect to the viewing angles).For the chairs, we were not able to guess the viewingtrajectory correctly (cf. (Yang et al, 2018)).
Failure Cases:
AML and Dai17 often face similarproblems, as illustrated in Fig. 15, suggesting that theseproblems are inherent to the used shape representationsor the learning approach independent of the level ofsupervision. For example, both AML and Dai17 have
AML GT AML GT AML Dai17 Dai17 (a) Difficulties with Exotic Shapes and Fine Structures
Dai17 AML GT Dai17 AML GT (b) Difficulties with Multiple Object Categories
Fig. 15:
Failures Cases.
On the top, we show thatAML has difficulties with exotic shapes, not representedin the latent space; and both AML and Dai17 havedifficulties with fine details. The bottom row demon-strates that it is difficult to infer the correct object cat-egory from sparse observations, even under full supervi-sion as required by Dai17. Shapes (occupancy grids andmehses) in beige and observations in red from variousresolutions.problems with fine, thin structures that are hard toreconstruct properly at any resolution. Furthermore,identifying the correct object category on ModelNet10from sparse observations is difficult for both AML andSup. Finally, AML additionally has difficulties with ex-otic objects that are not well represented in the latentshape space as, e.g., designed chairs.
Runtime:
At low resolution, AML as well as the fully-supervised approaches Dai17 and Sup, are particularfast, requiring up to 2 ms on a NVIDIA TM GeForce R (cid:13) GTX TITAN using Torch (Collobert et al, 2011). Data-driven approaches (e.g., Eng16, ICP and ML), on theother hand, take considerably longer. Eng16, for in-stance requires 168 ms on average for completing theshape of a sparse LIDAR observation from KITTI us-ing an Intel R (cid:13) Xeon R (cid:13) E5-2690 @2.6Ghz and the multi-threaded Ceres solver (Agarwal et al, 2012). ICP andML take longest, requiring up to 38 s and 75 s (not tak-ing into account the point sampling process for theshapes), respectively. Except for Eng16 and ICP, allapproaches scale with the used resolution and the em-ployed architecture. In this paper, we presented a novel, weakly-supervisedlearning-based approach to 3D shape completion fromsparse and noisy point cloud observations. We used a(denoising) variational auto-encoder (Im et al, 2017;Kingma and Welling, 2014) to learn a latent space ofshapes for one or multiple object categories using syn- earning 3D Shape Completion under Weak Supervision 17 thetic data from ShapeNet (Chang et al, 2015) or Model-Net (Wu et al, 2015). Based on the learned generativemodel, i.e., decoder, we formulated 3D shape comple-tion as a maximum likelihood problem. In a secondstep, we then fixed the learned generative model andtrained a new recognition model, i.e. encoder, to amor-tize, i.e. learn , the maximum likelihood problem. Thus,our
Amortized Maximum Likelihood (AML) ap-proach to 3D shape completion can be trained in aweakly-supervised fashion. Compared to related data-driven approaches, e.g., (Rock et al, 2015; Haene et al,2014; Li et al, 2015; Engelmann et al, 2016, 2017; Nanet al, 2012; Bao et al, 2013; Dame et al, 2013; Nguyenet al, 2016), our approach offers fast inference at testtime; in contrast to other learning-based approaches,e.g., (Riegler et al, 2017a; Smith and Meger, 2017; Daiet al, 2017; Sharma et al, 2016; Fan et al, 2017; Rezendeet al, 2016; Yang et al, 2018; Wang et al, 2017; Varleyet al, 2017; Han et al, 2017), we do not require full su-pervision during training. Both characteristics renderour approach useful for robotic scenarios where full su-pervision is often not available such as in autonomousdriving, e.g., on KITTI (Geiger et al, 2012), or indoorrobotics, e.g., on Kinect (Yang et al, 2018).On two newly created synthetic shape completionbenchmarks, derived from ShapeNet’s cars and Model-Net10, as well as on real data from KITTI and, wedemonstrated that AML outperforms related data-drivenapproaches (Engelmann et al, 2016; Gupta et al, 2015)while being significantly faster. We further showed thatAML is able to compete with fully-supervised approaches(Dai et al, 2017), both quantitatively and qualitatively,while using only 3 −
10% supervision or less. In con-trast to (Rock et al, 2015; Haene et al, 2014; Li et al,2015; Engelmann et al, 2016, 2017; Nan et al, 2012;Bao et al, 2013; Dame et al, 2013), we additionallyshowed that AML is able to generalize across objectcategories without category supervision during train-ing. On Kinect, we also demonstrated that our AMLapproach is able to generalize from very few trainingexamples. In contrast to (Girdhar et al, 2016; Liu et al,2017; Sharma et al, 2016; Wu et al, 2015; Dai et al,2017; Firman et al, 2016; Han et al, 2017; Fan et al,2017), we considered resolutions up to 48 × ×
48 and64 voxels as well as significantly sparser observations.Overall, our experiments demonstrate two key advan-tages of the proposed approach: significantly reducedruntime and increased performance compared to data-driven approaches showing that amortizing inference ishighly effective.In future work, we would like to address severalaspects of our AML approach. First, the shape prioris essential for weakly-supervised shape completion, as also noted by Gwak et al (2017). However, training ex-pressive generative models in 3D is still difficult. Sec-ond, larger resolutions imply significantly longer train-ing times; alternative shape representations and datastructures such as point clouds (Qi et al, 2017a,b; Fanet al, 2017) or octrees (Riegler et al, 2017b,a; H¨ane et al,2017) might be beneficial. Finally, jointly tackling poseestimation and shape completion seems promising (En-gelmann et al, 2016). References
Abramowitz M (1974) Handbook of Mathematical Functions,With Formulas, Graphs, and Mathematical Tables. DoverPublicationsAgarwal S, Mierle K, Others (2012) Ceres solver. http://ceres-solver.org
Aubry M, Maturana D, Efros A, Russell B, Sivic J (2014)Seeing 3D chairs: exemplar part-based 2D-3D alignmentusing a large dataset of CAD models. In: Proc. IEEE Conf.on Computer Vision and Pattern Recognition (CVPR)Bao S, Chandraker M, Lin Y, Savarese S (2013) Dense objectreconstruction with semantic priors. In: Proc. IEEE Conf.on Computer Vision and Pattern Recognition (CVPR)Besl P, McKay H (1992) A method for registration of 3dshapes. IEEE Trans on Pattern Analysis and Machine In-telligence (PAMI) 14:239–256Blei DM, Kucukelbir A, McAuliffe JD (2016) Variational in-ference: A review for statisticians. arXivorg 1601.00670Brock A, Lim T, Ritchie JM, Weston N (2016) Generativeand discriminative voxel modeling with convolutional neu-ral networks. arXivorg 1608.04236Chang AX, Funkhouser TA, Guibas LJ, Hanrahan P, HuangQ, Li Z, Savarese S, Savva M, Song S, Su H, Xiao J, YiL, Yu F (2015) Shapenet: An information-rich 3d modelrepository. arXivorg 1512.03012Chen X, Kundu K, Zhu Y, Ma H, Fidler S, Urtasun R (2016)3d object proposals using stereo imagery for accurate ob-ject class detection. arXivorg 1608.07711Choy CB, Xu D, Gwak J, Chen K, Savarese S (2016) 3d-r2n2:A unified approach for single and multi-view 3d object re-construction. In: Proc. of the European Conf. on ComputerVision (ECCV)Cicek ¨O, Abdulkadir A, Lienkamp SS, Brox T, RonnebergerO (2016) 3d u-net: Learning dense volumetric segmentationfrom sparse annotation. arXivorg 1606.06650Cignoni P, Callieri M, Corsini M, Dellepiane M, GanovelliF, Ranzuglia G (2008) Meshlab: an open-source mesh pro-cessing toolCollobert R, Kavukcuoglu K, Farabet C (2011) Torch7: Amatlab-like environment for machine learning. In: Ad-vances in Neural Information Processing Systems (NIPS)WorkshopsCurless B, Levoy M (1996) A volumetric method for buildingcomplex models from range images. In: ACM Trans. onGraphics (SIGGRAPH)Dai A, Qi CR, Nießner M (2017) Shape completion using3d-encoder-predictor cnns and shape synthesis. In: Proc.IEEE Conf. on Computer Vision and Pattern Recognition(CVPR)Dame A, Prisacariu V, Ren C, Reid I (2013) Dense re-construction using 3D object shape priors. In: Proc.8 David Stutz, Andreas GeigerIEEE Conf. on Computer Vision and Pattern Recognition(CVPR)Eigen D, Fergus R (2015) Predicting depth, surface normalsand semantic labels with a common multi-scale convolu-tional architecture. In: Proc. of the IEEE InternationalConf. on Computer Vision (ICCV)Eigen D, Puhrsch C, Fergus R (2014) Depth map predic-tion from a single image using a multi-scale deep network.In: Advances in Neural Information Processing Systems(NIPS)Engelmann F, St¨uckler J, Leibe B (2016) Joint object poseestimation and shape reconstruction in urban street scenesusing 3D shape priors. In: Proc. of the German Conferenceon Pattern Recognition (GCPR)Engelmann F, St¨uckler J, Leibe B (2017) SAMP: shape andmotion priors for 4d vehicle reconstruction. In: Proc. ofthe IEEE Winter Conference on Applications of ComputerVision (WACV), pp 400–408Fan H, Su H, Guibas LJ (2017) A point set generation net-work for 3d object reconstruction from a single image. ProcIEEE Conf on Computer Vision and Pattern Recognition(CVPR)Firman M, Mac Aodha O, Julier S, Brostow GJ (2016) Struc-tured prediction of unobserved voxels from a single depthimage. In: Proc. IEEE Conf. on Computer Vision and Pat-tern Recognition (CVPR)Furukawa Y, Hernandez C (2013) Multi-view stereo: A tuto-rial. Foundations and Trends in Computer Graphics andVision 9(1-2):1–148Geiger A, Lenz P, Urtasun R (2012) Are we ready for au-tonomous driving? The KITTI vision benchmark suite. In:Proc. IEEE Conf. on Computer Vision and Pattern Recog-nition (CVPR)Gershman S, Goodman ND (2014) Amortized inference inprobabilistic reasoning. In: Proc. of the Annual Meeting ofthe Cognitive Science SocietyGirdhar R, Fouhey DF, Rodriguez M, Gupta A (2016) Learn-ing a predictable and generative vector representation forobjects. In: Proc. of the European Conf. on Computer Vi-sion (ECCV)Glorot X, Bengio Y (2010) Understanding the difficulty oftraining deep feedforward neural networks. In: Conferenceon Artificial Intelligence and Statistics (AISTATS)Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville AC, Bengio Y (2014) Gener-ative adversarial nets. In: Advances in Neural InformationProcessing Systems (NIPS)G¨uney F, Geiger A (2015) Displets: Resolving stereo ambi-guities using object knowledge. In: Proc. IEEE Conf. onComputer Vision and Pattern Recognition (CVPR)Gupta S, Arbel´aez PA, Girshick RB, Malik J (2015) Aligning3D models to RGB-D images of cluttered scenes. In: Proc.IEEE Conf. on Computer Vision and Pattern Recognition(CVPR)Gwak J, Choy CB, Garg A, Chandraker M, Savarese S (2017)Weakly supervised generative adversarial networks for 3dreconstruction. arXivorg 1705.10904Haene C, Savinov N, Pollefeys M (2014) Class specific3d object shape priors using surface normals. In: Proc.IEEE Conf. on Computer Vision and Pattern Recognition(CVPR)Han X, Li Z, Huang H, Kalogerakis E, Yu Y (2017) High-resolution shape completion using deep neural networks forglobal structure and local geometry inference. In: Proc. ofthe IEEE International Conf. on Computer Vision (ICCV),pp 85–93 H¨ane C, Tulsiani S, Malik J (2017) Hierarchical surface pre-diction for 3d object reconstruction. arXivorg 1704.00710Im DJ, Ahn S, Memisevic R, Bengio Y (2017) Denoising crite-rion for variational auto-encoding framework. In: Proc. ofthe Conf. on Artificial Intelligence (AAAI), pp 2059–2065Ioffe S, Szegedy C (2015) Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.In: Proc. of the International Conf. on Machine learning(ICML)Jensen RR, Dahl AL, Vogiatzis G, Tola E, Aanæs H (2014)Large scale multi-view stereopsis evaluation. In: Proc.IEEE Conf. on Computer Vision and Pattern Recognition(CVPR)Jones E, Oliphant T, Peterson P, et al (2001) SciPy: Opensource scientific tools for Python. URL