[PDF] 3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation

Abstract

We present 3DMV, a novel method for 3D semantic scene segmentation of RGB-D scans in indoor environments using a joint 3D-multi-view prediction network. In contrast to existing methods that either use geometry or RGB data as input for this task, we combine both data modalities in a joint, end-to-end network architecture. Rather than simply projecting color data into a volumetric grid and operating solely in 3D -- which would result in insufficient detail -- we first extract feature maps from associated RGB images. These features are then mapped into the volumetric feature grid of a 3D network using a differentiable backprojection layer. Since our target is 3D scanning scenarios with possibly many frames, we use a multi-view pooling approach in order to handle a varying number of RGB input views. This learned combination of RGB and geometric features with our joint 2D-3D architecture achieves significantly better results than existing baselines. For instance, our final result on the ScanNet 3D segmentation benchmark increases from 52.8\% to 75\% accuracy compared to existing volumetric architectures.

Full PDF

33DMV: Joint 3D-Multi-View Prediction for 3DSemantic Scene Segmentation

Angela Dai Matthias Nießner Stanford University Technical University of Munich

Fig. 1.

Abstract.

We present 3DMV, a novel method for 3D semantic scenesegmentation of RGB-D scans in indoor environments using a joint 3D-multi-view prediction network. In contrast to existing methods that ei-ther use geometry or RGB data as input for this task, we combine bothdata modalities in a joint, end-to-end network architecture. Rather thansimply projecting color data into a volumetric grid and operating solelyin 3D – which would result in insuﬃcient detail – we ﬁrst extract featuremaps from associated RGB images. These features are then mapped intothe volumetric feature grid of a 3D network using a diﬀerentiable back-projection layer. Since our target is 3D scanning scenarios with possiblymany frames, we use a multi-view pooling approach in order to handle avarying number of RGB input views. This learned combination of RGBand geometric features with our joint 2D-3D architecture achieves signiﬁ-cantly better results than existing baselines. For instance, our ﬁnal resulton the ScanNet 3D segmentation benchmark [1] increases from 52.8%to 75% accuracy compared to existing volumetric architectures.

Corresponding author: [email protected] a r X i v : . [ c s . C V ] M a r A. Dai and M. Nießner

Semantic scene segmentation is important for a large variety of applicationsas it enables understanding of visual data. In particular, deep learning-basedapproaches have led to remarkable results in this context, allowing prediction ofaccurate per-pixel labels in images [2,3]. Typically, these approaches operate ona single RGB image; however, one can easily formulate the analogous task in 3Don a per-voxel basis [1], which is a common scenario in the context of 3D scenereconstruction methods. In contrast to the 2D task, the third dimension oﬀers aunique opportunity as it not only predicts semantics, but also provides a spatialsemantic map of the scene content based on the underlying 3D representation.This is particularly relevant for robotics applications since a robot relies not onlyon information of what is in a scene but also needs to know where things are.In 3D, the representation of a scene is typically obtained from RGB-D surfacereconstruction methods [4,5,6,7] which often store scanned geometry in a 3Dvoxel grid where the surface is encoded by an implicit surface function such as asigned distance ﬁeld [8]. One approach towards analyzing these reconstructionsis to leverage a CNN with 3D convolutions, which has been used for shapeclassiﬁcation [9,10], and recently also for predicting dense semantic 3D voxelmaps [11,1,12]. In theory, one could simply add an additional color channel tothe voxel grid in order to incorporate RGB information; however, the limitedvoxel resolution prevents encoding feature-rich image data.In this work, we speciﬁcally address this problem of how to incorporate RGBinformation for the 3D semantic segmentation task, and leverage the combinedgeometric and RGB signal in a joint, end-to-end approach. To this end, wepropose a novel network architecture that takes as input the 3D scene represen-tation as well as the input of nearby views in order to predict a dense semanticlabel set on the voxel grid. Instead of mapping color data directly on the voxelgrid, the core idea is to ﬁrst extract 2D feature maps from 2D images using thefull-resolution RGB input. These features are then downsampled through con-volutions in the 2D domain, and the resulting 2D feature map is subsequentlybackprojected into 3D space. In 3D, we leverage a 3D convolutional networkarchitecture to learn from both the backprojected 2D features as well as 3D ge-ometric features. This way, we can join the beneﬁts of existing approaches andleverage all available information, signiﬁcantly improving on existing approaches.Our main contribution is the formulation of a joint, end-to-end convolutionalneural network which learns to infer 3D semantics from both 3D geometry and2D RGB input. In our evaluation, we provide a comprehensive analysis of thedesign choices of the joint 2D-3D architecture, and compare it with current stateof the art methods. In the end, our approach increases 3D segmentation accuracyfrom 52.8% to 75% compared to the best existing volumetric architecture.

Deep Learning in 3D.

An important avenue for 3D scene understanding has beenopened through recent advances in deep learning. Similar to the 2D domain,

DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation 3 convolutional neural networks (CNNs) can operate in volumetric domains usingan additional spatial dimension for the ﬁlter banks. 3D ShapeNets [13] was oneof the ﬁrst works in this context; they learn a 3D convolutional deep beliefnetwork from a shape database. Several works have followed, using 3D CNNsfor object classiﬁcation [14,10] or generative scene completion tasks [15,16,12].In order to address the memory and compute requirements, hierarchical 3DCNNs have been proposed to more eﬃciently represent and process 3D volumes[17,18,19,20,21,16]. The spatial extent of a 3D CNN can also be increased withdilated convolutions [22], which have been used to predict missing voxels andinfer semantic labels [11], or by using a fully-convolutional networks, in order todecouple the dimensions of training and test time [12]. Very recently, we haveseen also network architectures that operate on an (unstructured) point-basedrepresentation [23,24].

Multi-view Deep Networks.

An alternative way of learning a classiﬁer on 3Dinput is to render the geometry, run a 2D feature extractor, and combine theextracted features using max pooling. The multi-view CNN approach by Su etal. [25] was one of the ﬁrst to propose such an architecture for object classiﬁ-cation. However, since the output is a classiﬁcation score, this architecture doesnot spatially correlate the accumulated 2D features.Very recently, a multi-viewnetwork has been proposed for part-based mesh segmentation [26]. Here, 2Dconﬁdence maps of each part label are projected on top of ShapeNet [13] mod-els, where a mesh-based CRF accumulates inputs of multiple images to predictthe part labels on the mesh geometry. This approach handles only relativelysmall label sets (e.g., 2-6 part labels), and its input is 2D renderings of the 3Dmeshes; i.e., the multi-view input is meant as a replacement input for 3D geom-etry. Although these methods are not designed for 3D semantic segmentation,we consider them as the main inspiration for our multi-view component.Multi-view networks have also been proposed in the context of stereo recon-struction. For instance, Choi et al. [27] use an RNN to accumulate features fromdiﬀerent views and Tulsiani et al. [28] propose an unsupervised approach thattakes multi-view input to learn a latent 3D space for 3D reconstruction. Anotherwork in the context of stereo reconstruction was proposed by Kar et al. [29],which uses a sequence of 2D input views to reconstruct ShapeNet [13] models.An alternative way to combine several input views with 3D, is by projectingcolors directly into the voxels, maintaining one channel for each input view pervoxel [30]. However, due to memory requirements, this becomes impractical fora large number of input views.

3D Semantic Segmentation.

Semantic segmentation on 2D images is a popu-lar task and has been heavily explored using cutting-edge neural network ap-proaches [2,3]. The analog task can be formulated in 3D, where the goal is topredict semantic labels on a per-voxel level [31,32]. Although this is a relativelyrecent task, it is extremely relevant to a large range of applications, in particu-lar, robotics, where a spatial understanding of the inferred semantics is essential.For the 3D semantic segmentation task, several datasets and benchmarks have

A. Dai and M. Nießner recently been developed. The ScanNet [1] dataset introduced a 3D semantic seg-mentation task on approx. 1.5k RGB-D scans and reconstructions obtained witha Structure Sensor. It provides ground truth annotations for training, validation,and testing directly on the 3D reconstructions; it also includes approx. 2.5 mioRGB-D frames whose 2D annotations are derived using rendered 3D-to-2D pro-jections. Matterport3D [33] is another recent dataset of about 90 building-scalescenes in the same spirit as ScanNet; it includes fewer RGB-D frames (approx.194,400) but has more complete reconstructions.

The goal of our method is to predict a 3D semantic segmentation based on theinput of commodity RGB-D scans. More speciﬁcally, we want to infer semanticclass labels on per-voxel level of the grid of a 3D reconstruction. To this end, wepropose a joint 2D-3D neural network that leverages both RGB and geometricinformation obtained from a 3D scans. For the geometry, we consider a regularvolumetric grid whose voxels encode a ternary state (known-occupied, known-free, unknown). To perform semantic segmentation on full 3D scenes, our networkoperates on a per-chunk basis; i.e., it predicts columns of a scene in a sliding-window fashion through the xy -plane at test time. For a given xy -location in ascene, the network takes as input the volumetric grid of the surrounding area(chunks of 31 × ×

62 voxels). The network then extracts geometric featuresusing a series of 3D convolutions, and predicts per-voxel class labels for thecenter column at the current xy -location. In addition to the geometry, we selectnearby RGB views at the current xy -location that overlap with the associatedchunk. For all of these 2D views, we run the respective images through a 2Dneural network that extracts their corresponding features. Note that all of these2D networks have the same architecture and share the same weights.In order to combine the 2D and 3D features, we introduce a diﬀerentiablebackprojection layer that maps 2D features onto the 3D grid. These projectedfeatures are then merged with the 3D geometric information through a 3D con-volutional part of the network. In addition to the projection, we add a voxelpooling layer that enables handling a variable number of RGB views associ-ated with a 3D chunk; the pooling is performed on a per-voxel basis. In orderto run 3D semantic segmentation for entire scans, this network is run for each xy -location of a scene, taking as input the corresponding local chunks.In the following, we will ﬁrst introduce the details of our network architecture(see Sec. 4) and then show how we train and implement our method (see Sec. 5). Our network is composed of a 3D stream and several 2D streams that are com-bined in a joint 2D-3D network architecture. The 3D part takes as input avolumetric grid representing the geometry of a 3D scan, and the 2D streamstake as input the associated RGB images. To this end, we assume that the 3D

DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation 5

Fig. 2.

Network overview: our architecture is composed of a 2D and a 3D part. The2D side takes as input several aligned RGB images from which features are learnedwith a proxy loss. These are mapped to 3D space using a diﬀerentiable backprojectionlayer. Features from multiple views are max-pooled on a per-voxel basis and fed into astream of 3D convolutions. At the same time, we input the 3D geometry into another3D convolution stream. Then, both 3D streams are joined and the 3D per-voxel labelsare predicted. The whole network is trained in an end-to-end fashion. scan is composed of a sequence of RGB-D images obtained from a commodityRGB-D camera, such as a Kinect or a Structure Sensor; although note that ourmethod generalizes to other sensor types. We further assume that the RGB-Dimages are aligned with respect to their world coordinate system using an RGB-D reconstruction framework; in the case of ScanNet [1] scenes, the BundleFusion[7] method is used. Finally, the RGB-D images are fused together in a volumetricgrid, which is commonly done by using an implicit signed distance function [8].An overview of the network architecture is provided in Fig. 2.

Our 3D network part is composed of a series of 3D convolutions operating ona regular volumetric gird. The volumetric grid is a subvolume of the voxelized3D representation of the scene. Each subvolume is centered around a speciﬁc xy -location at a size of 31 × ×

62 voxels, with a voxel size of 4 . . × .

5m and 3m in height. Note that weuse a height of 3m in order to cover the height of most indoor environments, suchthat we only need to train the network to operate in varying xy -space. The 3Dnetwork takes these subvolumes as input, and predicts the semantic labels for thecenter columns of the respective subvolume at a resolution of 1 × ×

62 voxels;i.e., it simultaneously predicts labels for 62 voxels. For each voxel, we encode

A. Dai and M. Nießner the corresponding value of the scene reconstruction state: known-occupied (i.e.,on the surface), known-free space (i.e., based on empty space carving [8]), orunknown space (i.e., we have no knowledge about the voxel). We represent thisthrough a 2-channel volumetric grid, the ﬁrst a binary encoding of the occupancy,and the second a binary encoding of the known/unknown space. The 3D networkthen processes these subvolumes with a series of nine 3D convolutions whichexpand the feature dimension and reduce the spatial dimensions, along withdropout regularization during training, before a ﬁnal set of fully connected layerswhich predict the classiﬁcation scores for each voxel.In the following, we show how to incorporate learned 2D features from asso-ciated 2D RGB views.

The aim of the 2D part of the network is to extract features from each of theinput RGB images. To this end, we use a 2D network architecture based onENet [34] to learn those features. Note that although we can use a variable ofnumber of 2D input views, all 2D networks share the same weights as they arejointly trained. Our choice to use ENet is due to its simplicity as it is both fast torun and memory-eﬃcient to train. In particular, the low memory requirementsare critical since it allows us to jointly train our 2D-3D network in an end-to-end fashion with multiple input images per train sample. Although our aim is2D-3D end-to-end training, we additionally use a 2D proxy loss for each imagethat allows us to make the training more stable; i.e., each 2D stream is askedto predict meaningful semantic features for an RGB image segmentation task.Here, we use semantic labels of the 2D images as ground truth; in the case ofScanNet [1], these are derived from the original 3D annotations by renderingthe annotated 3D mesh from the camera points of the respective RGB imageposes. The ﬁnal goal of the 2D network is to obtain the features in the lastlayer before the proxy loss per-pixel classiﬁcation scores; these features maps arethen backprojected into 3D to join with the 3D network, using a diﬀerentiablebackprojection layer. In particular, from an input RGB image of size 328 × × )41 ×

32, which is then backprojectedinto the space of the corresponding 3D volume, obtaining a 3D representationof the feature map of size (128 × )31 × × In order to connect the learned 2D features from each of the input RGB viewswith the 3D network, we use a diﬀerentiable backprojection layer. Since weassume known 6-DoF pose alignments for the input RGB images with respectto each other and the 3D reconstruction, we can compute 2D-3D associationson-the-ﬂy. The layer is essentially a loop over every voxel in 3D subvolumewhere a given image is associated to. For every voxel, we compute the 3D-to-2Dprojection based on the corresponding camera pose, the camera intrinsics, andthe world-to-grid transformation matrix. We use the depth data from the RGB-D

DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation 7 images in order to prune projected voxels beyond a threshold of the voxel size of4 . n feat × w d × h d → n feat × w d × h d × d d For the backward pass, we use the inverse mapping of the forward pass, whichwe store in a temporary index map. We use 2D feature maps (feature dim. of128) of size (128 × )41 ×

31 and project them to a grid of size (128 × )31 × × The joint 2D-3D network combines 2D RGB features and 3D geometric fea-tures using the mapping from the backprojection layer. These two inputs areprocessed with a series of 3D convolutions, and then concatenated together; thejoined feature is then further processed with a set of 3D convolutions. We haveexperimented with several options as to where to join these two parts: at thebeginning (i.e., directly concatenated together without independent 3D process-ing), approximately 1/3 or 2/3 through the 3D network, and at the end (i.e.,directly before the classiﬁer). We use the variant that provided the best results,fusing the 2D and 3D features together at 2/3 of the architectures (i.e., afterthe 6th 3D convolution of 9); see Tab. 5 for the corresponding ablation study.Note that the entire network, as shown in Fig. 2, is trained in an end-to-endfashion, which is feasible since all components are diﬀerentiable. Tab. 1 showsan overview of the distribution of learnable parameters of our 3DMV model.

2d only 3d (2d ft only) 3d (3d geo only) 3d (fused 2d/3d)

Table 1.

Distribution of learnable parameters of our 3DMV model. Note that themajority of the network weights are part of the combined 3D stream just before theper-voxel predictions where we rely on strong feature maps; see top left of Fig. 2. A. Dai and M. Nießner

Our joint 2D-3D network operates on a per-chunk basis; i.e., it takes ﬁxed sub-volumes of a 3D scene as input (along with associated RGB views), and predictslabels for the voxels in the center column of the given chunk. In order to per-form a semantic segmentation of large 3D environments, we slide the subvolumethrough the 3D grid of the underlying reconstruction. Since the height of thesubvolume (3m) is suﬃcient for most indoor environments, we only need to slideover the xy -domain of the scene. Note, however, that for training, the train-ing samples do not need to be spatially connected, which allows us to train on arandom set of subvolumes. This de-coupling of training and test extents is partic-ularly important since it allows us to provide a good label and data distributionof training samples (e.g., chunks with suﬃcient coverage and variety). We train our joint 2D-3D network architecture in an end-to-end fashion. To thisend, we prepare correlated 3D and RGB input to the network for the trainingprocess. The 3D geometry is encoded in a ternary occupancy grid that encodesknown-occupied, known-free, and unknown states for each voxel. The ternaryinformation is split upon 2 channels, where the ﬁrst channel encodes occupancyand the second channel encodes the known vs. unknown state. To select trainsubvolumes from a 3D scene, we randomly sample subvolumes as potential train-ing samples. For each potential train sample, we check its label distribution anddiscard samples containing only structural elements (i.e., wall/ﬂoor) with 95%probability. In addition, all samples with empty center columns are discarded aswell as samples with less than 70% of the center column geometry annotated.For each subvolume, we then associate k nearby RGB images whose align-ment is known from the 6-DoF camera pose information. We select images greed-ily based on maximum coverage; i.e., we ﬁrst pick the image covering the mostvoxels in the subvolume, and subsequently take each next image which coversthe most number of voxels not covered by current set. We typically select 3-5images since additional gains in coverage become smaller with each added image.For each sampled subvolume, we augment it with 8 random rotations for a totalof 1 , ,

080 train samples. Since existing 3D datasets, such as ScanNet [1] orMatterport3D [33] contain unannotated regions in the ground truth (see Fig. 3,right), we mask out these regions in both our 3D loss and 2D proxy loss. Notethat this strategy still allows for making predictions for all voxels at test time.

We implement our approach in PyTorch. While 2D and 3D conv layers arealready provided by the PyTorch API, we implement a custom layer for thebackprojection layer. We implement this backprojection in python, as a custom

DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation 9

PyTorch layer, representing the projection as series of matrix multiplications inorder to exploit PyTorch parallelization, and run the backprojection on the GPUthrough the PyTorch API. For training, we have tried only training parts of thenetwork; however, we found that the end-to-end version that jointly optimizesboth 2D and 3D performed best. In the training processes, we use an SGDoptimizer with a learning rate of 0 .

001 and a momentum of 0 .

9; we set the batchsize to 8. Note that our training set is quite biased towards structural classes(e.g., wall, ﬂoor), even when discarding most structural-only samples, as theseelements are vastly dominant in indoor scenes. In order to account for this dataimbalance, we use the histogram of classes represented in the train set to weightthe loss during training. We train our network for 200 ,

000 iterations; for ournetwork trained on 3 views, this takes ≈

24 hours, and for 5 views, ≈

48 hours.

In this section, we provide an evaluation of our proposed method with a com-parison to existing approaches. We evaluate on the ScanNet dataset [1], whichcontains 1513 RGB-D scans composed of 2.5M RGB-D images. We use the pub-lic train/val/test split of 1045, 156, 312 scenes, respectively, and follow the 20-class semantic segmentation task deﬁned in the original ScanNet benchmark. Weevaluate our results with per-voxel class accuracies, following the evaluations ofprevious work [1,24,12]. Additionally, we visualize our results qualitatively andin comparison to previous work in Fig 3, with close-ups shown in Fig 4. Notethat we map the predictions from all methods back onto the mesh reconstructionfor ease of visualization.

Comparison to state of the art.

Our main results are shown in Tab. 2, wherewe compare to several state-of-the-art volumetric (ScanNet[1], ScanComplete[12])and point-based approaches (PointNet++[24]) on the ScanNet test set. Addi-tionally, we show an ablation study regarding our design choices in Tab. 3.The best variant of our 3DMV network achieves 75% average classiﬁcationaccuracy which is quite signiﬁcant considering the diﬃculty of the task and theperformance of existing approaches. That is, we improve 22.2% over existingvolumetric and 14.8% over the state-of-the-art PointNet++ architecture.

How much does RGB input help?

Tab. 3 includes a direct comparison be-tween our 3D network architecture when using RGB features against the exactsame 3D network without the RGB input. Performance improves from 54.4%to 70.1% with RGB input, even with just a single RGB view. In addition, wetried out the naive alternative of using per-voxel colors rather than a 2D fea-ture extractor. Here, we see only a marginal diﬀerence compared to the purelygeometric baseline (54.4% vs. 55.9%). We attribute this relatively small gain tothe limited grid resolution ( ≈ How much does geometric input help?

Another important question iswhether we actually need the 3D geometric input, or whether geometric infor-mation is a redundant subset of the RGB input; see Tab. 3. The ﬁrst experimentwe conduct in this context is simply a projection of the predicted 2D labels ontop of the geometry. If we only use the labels from a single RGB view, we obtain27% average accuracy (vs. 70.1% with 1 view + geometry); for 3 views, this labelbackprojection achieves 44.2% (vs. 73.0% with 3 views + geometry). Note thatthis is related to the limited coverage of the RGB backprojections (see Tab. 4).However, the interesting experiment now is what happens if we still runa series of 3D convolutions after the backprojection of the 2D labels. Again,we omit inputting the scene geometry, but we now learn how to combine andpropagate the backprojected features in the 3D grid; essentially, we ignore theﬁrst part of our 3D network; cf. Fig. 2. For 3 RGB views, this results in anaccuracy of 58.2%; this is higher than the 54.4% of geometry only; however,it is much lower than our ﬁnal 3-view result of 73.0% from the joint network.Overall, this shows that the combination of RGB and geometric informationaptly complements each other, and that the synergies allow for an improvementover the individual inputs by 14.8% and 18.6%, respectively (for 3 views). wall ﬂoor cab bed chair sofa table door wind bkshf pic cntr desk curt fridg show toil sink bath other avgScanNet [1] 70.1 90.3 49.8 62.4 69.3 75.7

Table 2.

Comparison of our ﬁnal trained model (5 views, end-to-end) against otherstate-of-the-art methods on the ScanNet dataset [1]. We can see that our approachmakes signiﬁcant improvements, 22.2% over existing volumetric and approx. 14.8%over state-of-the-art PointNet++ architectures.

How to feed 2D features into the 3D network?

An interesting question iswhere to join 2D and 3D features; i.e., at which layer of the 3D network do wefuse together the features originating from the RGB images with the featuresfrom the 3D geometry. On the one hand, one could argue that it makes moresense to feed the 2D part early into the 3D network in order to have morecapacity for learning the joint 2D-3D combination. On the other hand, it mightmake more sense to keep the two streams separate for as long as possible to ﬁrstextract strong independent features before combining them.To this end, we conduct an experiment with diﬀerent 2D-3D network com-binations (for simplicity, always using a single RGB view without end-to-endtraining); see Tab. 5. We tried four combinations, where we fused the 2D and3D features at the beginning, after the ﬁrst third of the network, after the sec-ond third, and at the very end into the 3D network. Interestingly, the resultsare relatively similar ranging from 67.6%, 65.4% to 69.1% and 67.5% suggestingthat the 3D network can adapt quite well to the 2D features. Across these exper-iments, the second third option turned out to be a few percentage points higherthan the alternatives; hence, we use that as a default in all other experiments.

DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation 11

How much do additional views help?

In Tab. 3, we also examine the eﬀectof each additional view on classiﬁcation performance. For geometry only, weobtain an average classiﬁcation accuracy of 54.4%; adding only a single view perchunk increases to 70.1% (+15.7%); for 3 views, it increases to 73.1% (+3.0%);for 5 views, it reaches 75.0% (+1.9%). Hence, for every additional view theincremental gains become smaller; this is somewhat expected as a large partof the beneﬁts are attributed to additional coverage of the 3D volume with 2Dfeatures. If we already use a substantial number of views, each additional addedfeature shares redundancy with previous views, as shown in Tab. 4.

Is end-to-end training of the joint 2D-3D network useful?

Here, weexamine the beneﬁts of training the 2D-3D network in an end-to-end fashion,rather than simply using a pre-trained 2D network. We conduct this experimentwith 1, 3, and 5 views. The end-to-end variant consistently outperforms the ﬁxedversion, improving the respective accuracies by 1.0%, 0.2%, and 0.5%. Althoughthe end-to-end variants are strictly better, the increments are smaller than weinitially hoped for. We also tried removing the 2D proxy loss that enforces good2D predictions, which led to a slightly lower performance. Overall, end-to-endtraining with a proxy loss always performed best and we use it as our default. wall ﬂoor cab bed chair sofa table door wind bkshf pic cntr desk curt fridg show toil sink bath other avg2d only (1 view) 37.1 39.1 26.7 33.1 22.7 38.8 17.5 38.7 13.5 32.6 14.9 7.8 19.1 34.4 33.2 13.3 32.7 29.2 36.3 20.4 27.12d only (3 views) 58.6 62.5 40.8 51.6 38.6 59.7 31.1 55.9 25.9 52.9 25.1 14.2 35.0 51.2 57.3 36.0 47.1 44.7 61.5 34.3 44.2Ours (no geo input) 76.2 92.9 59.3 65.6 80.6 73.9 63.3 75.1 22.6 80.2 13.3 31.8 43.4 56.5 53.4 43.2 82.1 55.0 80.8 9.3 58.2Ours (3d geo only) 60.4 95.0 54.4 69.5 79.5 70.6 71.3 65.9 20.7 71.4 4.2 20.0 38.5 15.2 59.9 57.3 78.7 48.8 87.0 20.6 54.4Ours (3d geo+voxel color) 58.8 94.7 55.5 64.3 72.1 80.1 65.5

Ours (5 view)

Table 3.

Ablation study for diﬀerent design choices of our approach on ScanNet [1].We ﬁrst test simple baselines where we backproject 2D labels from 1 and 3 views (rows1-2), then run set of 3D convs after the backprojections (row 3). We then test a 3D-geometry-only network (row 4). Augmenting the 3D-only version with per-voxel colorsshows only small gains (row 5). In rows 6-11, we test our joint 2D-3D architecture withvarying number of views, and the eﬀect of end-to-end training. Our 5-view, end-to-endvariant performs best.

Evaluation in 2D domains using NYUv2.

Although we are predicting 3Dper-voxel labels, we can also project the obtained voxel labels into the 2D im-ages. In Tab. 6, we show such an evaluation on the NYUv2 [35] dataset. Forthis task, we train our network on both ScanNet data as well as the NYUv2train annotations projected into 3D. Although this is not the actual task of ourmethod, it can be seen as an eﬃcient way to accumulate semantic informationfrom multiple RGB-D frames by using the 3D geometry as a proxy for the learn-ing framework. Overall, our joint 2D-3D architecture compares favorably againstthe respective baselines on this 13-class task.

Summary Evaluation.–

RGB and geometric features are orthogonal and help each other

Fig. 3.

Qualitative semantic segmentation results on the ScanNet [1] test set. Wecompare with the 3D-based approaches of ScanNet [1], ScanComplete [12], Point-Net++ [24]. Note that the ground truth scenes contain some unannotated regions,denoted in black. Our joint 3D-multi-view approach achieves more accurate semanticpredictions. 1 view 3 views 5 viewscoverage 40.3% 64.4% 72.3%

Table 4.

Amount of coverage from varying number of views over the annotated groundtruth voxels of the ScanNet [1] test scenes.DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation 13 wall ﬂoor cab bed chair sofa table door wind bkshf pic cntr desk curt fridg show toil sink bath other avgbegin 78.8 96.3 63.7 72.8 83.3 81.9 end

Table 5.

Evaluation of various network combinations for joining the 2D and 3D streamsin the 3D architecture (cf. Fig. 2, top). We use the single view variant with a ﬁxed 2Dnetwork here for simplicity. Interestingly, performance only changes slightly; however,the 2/3 version performed the best, which is our default for all other experiments.

Fig. 4.

Additional qualitative semantic segmentation results (close ups) on the Scan-Net [1] test set. Note the consistency of our predictions compared to the other baselines. – More views help, but increments get smaller with every view – End-to-end training is strictly better, but the improvement is not that big. – Variations of where to join the 2D and 3D features change performance tosome degree; 2/3 performed best in our tests. – Our results are signiﬁcantly better than the best volumetric or PointNetbaseline (+22.2% and +14.8%, respectively). bed books ceil. chair ﬂoor furn. obj. pic. sofa table tv wall wind. avg.SceneNet [36] 70.8 5.5 76.2 59.6 95.9 62.3 50.0 18.0 61.3 42.2 22.2 86.1 32.1 52.5Hermans et al. [37] 68.4 45.4 83.4 41.9 91.5 37.1 8.6 35.8 58.5 27.7 38.4 71.8 48.0 54.3SemanticFusion [38] (RGBD+CRF) 62.0

Table 6.

We can also evaluate our method on 2D semantic segmentation tasks byprojecting the predicted 3D labels into the respective RGB-D frames. Here, we show acomparison on dense pixel classiﬁcation accuracy on NYU2 [40]. Note that the reportedScanNet classiﬁcation is on the 11-class task.

Limitations.

While our joint 3D-multi-view approach achieves signiﬁcant per-formance gains over previous state of the art in 3D semantic segmentation, thereare still several important limitations. Our approach operates on dense volumet-ric grids, which become quickly impractical for high resolutions; e.g., RGB-Dscanning approaches typically produce reconstructions with sub-centimeter voxelresolution; sparse approaches, such as OctNet [17], might be a good remedy. Ad-ditionally, we currently predict only the voxels of each column of a scene jointly,while each column is predicted independently, which can give rise to some labelinconsistencies in the ﬁnal predictions since diﬀerent RGB views might be se-lected; note, however, that due to the convolutional nature of the 3D networks,the geometry remains spatially coherent.

We presented 3DMV, a joint 3D-multi-view approach built on the core idea ofcombining geometric and RGB features in a joint network architecture. We showthat our joint approach can achieve signiﬁcantly better accuracy for semantic 3Dscene segmentation. In a series of evaluations, we carefully examine our designchoices; for instance, we demonstrate that the 2D and 3D features complementeach other rather than being redundant; we also show that our method cansuccessfully take advantage of using several input views from an RGB-D sequenceto gain higher coverage, thus resulting in better performance. In the end, weare able to show results at more than

14% higher classiﬁcation accuracy than the best existing 3D segmentation approach. Overall, we believe that theseimprovements will open up new possibilities where not only the semantic content,but also the spatial 3D layout plays an important role.For the future, we still see many open questions in this area. First, the 3Dsemantic segmentation problem is far from solved, and semantic instance segmen-tation in 3D is still at its infancy. Second, there are many fundamental questionsabout the scene representation for realizing 3D convolutional neural networks,and how to handle mixed sparse-dense data representations. And third, we alsosee tremendous potential for combining multi-modal features for generative tasksin 3D reconstruction, such as scan completion and texturing.

DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation 15

Acknowledgments

This work was supported by a Google Research Grant, a Stanford GraduateFellowship, and a TUM-IAS Rudolf M¨oßbauer Fellowship.

References

1. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet:Richly-annotated 3d reconstructions of indoor scenes. In: Proc. Computer Visionand Pattern Recognition (CVPR), IEEE. (2017)2. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semanticsegmentation. In: Proceedings of the IEEE conference on computer vision andpattern recognition. (2015) 3431–34403. He, K., Gkioxari, G., Doll´ar, P., Girshick, R.: Mask r-cnn. In: Computer Vision(ICCV), 2017 IEEE International Conference on, IEEE (2017) 2980–29884. Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J.,Kohi, P., Shotton, J., Hodges, S., Fitzgibbon, A.: Kinectfusion: Real-time densesurface mapping and tracking. In: Mixed and augmented reality (ISMAR), 201110th IEEE international symposium on, IEEE (2011) 127–1365. Nießner, M., Zollh¨ofer, M., Izadi, S., Stamminger, M.: Real-time 3d reconstructionat scale using voxel hashing. ACM Transactions on Graphics (TOG) (2013)6. K¨ahler, O., Prisacariu, V.A., Ren, C.Y., Sun, X., Torr, P., Murray, D.: Veryhigh frame rate volumetric integration of depth images on mobile devices. IEEEtransactions on visualization and computer graphics (11) (2015) 1241–12507. Dai, A., Nießner, M., Zollh¨ofer, M., Izadi, S., Theobalt, C.: Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-ﬂy surface reintegration.ACM Transactions on Graphics (TOG) (3) (2017) 248. Curless, B., Levoy, M.: A volumetric method for building complex models fromrange images. In: Proceedings of the 23rd annual conference on Computer graphicsand interactive techniques, ACM (1996) 303–3129. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: Adeep representation for volumetric shapes. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. (2015) 1912–192010. Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.: Volumetric and multi-view cnns for object classiﬁcation on 3d data. In: Proc. Computer Vision andPattern Recognition (CVPR), IEEE. (2016)11. Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semanticscene completion from a single depth image. Proceedings of 30th IEEE Conferenceon Computer Vision and Pattern Recognition (2017)12. Dai, A., Ritchie, D., Bokeloh, M., Reed, S., Sturm, J., Nießner, M.: Scancom-plete: Large-scale scene completion and semantic segmentation for 3d scans. arXivpreprint arXiv:1712.10215 (2018)13. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z.,Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., Yu, F.: ShapeNet:An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012[cs.GR], Stanford University — Princeton University — Toyota Technological In-stitute at Chicago (2015)14. Maturana, D., Scherer, S.: Voxnet: A 3d convolutional neural network for real-timeobject recognition. In: Intelligent Robots and Systems (IROS), 2015 IEEE/RSJInternational Conference on, IEEE (2015) 922–9286 A. Dai and M. Nießner15. Dai, A., Qi, C.R., Nießner, M.: Shape completion using 3d-encoder-predictor cnnsand shape synthesis. In: Proc. Computer Vision and Pattern Recognition (CVPR),IEEE. (2017)16. Han, X., Li, Z., Huang, H., Kalogerakis, E., Yu, Y.: High Resolution Shape Com-pletion Using Deep Neural Networks for Global Structure and Local GeometryInference. In: IEEE International Conference on Computer Vision (ICCV). (2017)17. Riegler, G., Ulusoy, A.O., Geiger, A.: Octnet: Learning deep 3d representationsat high resolutions. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition. (2017)18. Wang, P.S., Liu, Y., Guo, Y.X., Sun, C.Y., Tong, X.: O-cnn: Octree-based con-volutional neural networks for 3d shape analysis. ACM Transactions on Graphics(TOG) (4) (2017) 7219. Riegler, G., Ulusoy, A.O., Bischof, H., Geiger, A.: Octnetfusion: Learning depthfusion from data. arXiv preprint arXiv:1704.01047 (2017)20. Tatarchenko, M., Dosovitskiy, A., Brox, T.: Octree generating networks: Eﬃ-cient convolutional architectures for high-resolution 3d outputs. arXiv preprintarXiv:1703.09438 (2017)21. H¨ane, C., Tulsiani, S., Malik, J.: Hierarchical surface prediction for 3d objectreconstruction. arXiv preprint arXiv:1704.00710 (2017)22. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXivpreprint arXiv:1511.07122 (2015)23. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3dclassiﬁcation and segmentation. Proc. Computer Vision and Pattern Recognition(CVPR), IEEE (2) (2017) 424. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical featurelearning on point sets in a metric space. In: Advances in Neural InformationProcessing Systems. (2017) 5105–511425. Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutionalneural networks for 3d shape recognition. In: Proceedings of the IEEE InternationalConference on Computer Vision. (2015) 945–95326. Kalogerakis, E., Averkiou, M., Maji, S., Chaudhuri, S.: 3d shape segmentationwith projective convolutional networks. Proc. CVPR, IEEE (2017)27. Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3d-r2n2: A uniﬁed approachfor single and multi-view 3d object reconstruction. In: European Conference onComputer Vision, Springer (2016) 628–64428. Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-view reconstruction via diﬀerentiable ray consistency. In: CVPR. Volume 1. (2017)329. Kar, A., H¨ane, C., Malik, J.: Learning a multi-view stereo machine. In: Advancesin Neural Information Processing Systems. (2017) 364–37530. Ji, M., Gall, J., Zheng, H., Liu, Y., Fang, L.: Surfacenet: An end-to-end 3d neuralnetwork for multiview stereopsis. arXiv preprint arXiv:1708.01749 (2017)31. Valentin, J., Vineet, V., Cheng, M.M., Kim, D., Shotton, J., Kohli, P., Nießner,M., Criminisi, A., Izadi, S., Torr, P.: Semanticpaint: Interactive 3d labeling andlearning at your ﬁngertips. ACM Transactions on Graphics (TOG) (5) (2015)15432. Vineet, V., Miksik, O., Lidegaard, M., Nießner, M., Golodetz, S., Prisacariu, V.A.,K¨ahler, O., Murray, D.W., Izadi, S., P´erez, P., et al.: Incremental dense seman-tic stereo fusion for large-scale semantic scene reconstruction. In: Robotics andAutomation (ICRA), 2015 IEEE International Conference on, IEEE (2015) 75–82DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation 1733. Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song,S., Zeng, A., Zhang, Y.: Matterport3D: Learning from RGB-D data in indoorenvironments. International Conference on 3D Vision (3DV) (2017)34. Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: Enet: A deep neural networkarchitecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147(2016)35. Silberman, N., Fergus, R.: Indoor scene segmentation using a structured lightsensor. In: Proceedings of the International Conference on Computer Vision -Workshop on 3D Representation and Recognition. (2011)36. Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., Cipolla, R.: Scenenet:Understanding real world indoor scenes with synthetic data. arXiv preprintarXiv:1511.07041 (2015)37. Hermans, A., Floros, G., Leibe, B.: Dense 3D semantic mapping of indoor scenesfrom RGB-D images. In: Robotics and Automation (ICRA), 2014 IEEE Interna-tional Conference on, IEEE (2014) 2631–263838. McCormac, J., Handa, A., Davison, A., Leutenegger, S.: Semanticfusion: Dense 3dsemantic mapping with convolutional neural networks. In: Robotics and Automa-tion (ICRA), 2017 IEEE International Conference on, IEEE (2017) 4628–463539. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels witha common multi-scale convolutional architecture. In: Proceedings of the IEEEInternational Conference on Computer Vision. (2015) 2650–265840. Nathan Silberman, Derek Hoiem, P.K., Fergus, R.: Indoor segmentation and sup-port inference from RGBD images. In: ECCV. (2012)8 A. Dai and M. Nießner In this appendix, we provide additional quantitative results for 3D seman-tic segmentation using our 3DMV method. In particular, we use the Matter-port3D [33] benchmark for this purpose; see Sec. A. Note that the results onthe ScanNet [1] and NYUv2 [40] datasets can be found in the main document.Furthermore, in Sec. B, we visualize additional qualitative results.

A Evaluation on Matterport3D [33]

Matterport3D provides 90 building-scale RGB-D reconstructions, densely anno-tated similar to the ScanNet dataset annotations. In Tab. 7, we compare againststate-of-the-art volumetric-based semantic 3D segmentation approaches (Scan-Net [1] and ScanComplete [12]) on the Matterport3D [33] dataset. Additionally,we evaluate the performance of our method over varying number of views in theablation study in Tab. 8. Note that our ﬁnal result improves over 10% on com-pared to the best existing volumetric-based 3D semantic segmentation method. wall ﬂoor cab bed chair sofa table door wind bkshf pic cntr desk curt ceil fridg show toil sink bath other avgScanNet [1]

Table 7.

Comparison of our ﬁnal trained model (5 views, end-to-end) against the state-of-the-art volumetric-based semantic 3D segmentation methods on the Matterport3Ddataset [33]. wall ﬂoor cab bed chair sofa table door wind bkshf pic cntr desk curt ceil fridg show toil sink bath other avg1-view 76.5

Table 8.

Evaluation of 1-, 3-, and 5-view (end-to-end) variants of our approach onthe Matterport3D [33] test set. Note that each additional view improves the resultsby several percentage points, which conﬁrms the ﬁndings of our ablation study on theScanNet [1] dataset shown in the main document.

B Additional Qualitative Results

In Fig. 5, we show additional qualtitative results using our 3DMV approach. Weadditionally show a qualitative comparison to the volumetric semantic segmenta-tion approaches of ScanNet [1] and ScanComplete [12] on the Matterport3D [33]dataset in Fig. 6.

DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation 19

Fig. 5.

Additional qualitative semantic 3D segmentation results on the ScanNet [1]test set. Note that black denotes regions that are unannotated or contain labels not inthe 20-label set.0 A. Dai and M. Nießner