Points2Vec: Unsupervised Object-level Feature Learning from Point Clouds
Joël Bachmann, Kenneth Blomqvist, Julian Förster, Roland Siegwart
PPoints2Vec: Unsupervised Object-level Feature Learning fromPoint Clouds
Jo¨el Bachmann ∗1, 2 , Kenneth Blomqvist , Julian F¨orster , and Roland Siegwart Mantis Technologies Autonomous Systems Lab, Swiss Federal Institute of Technology, Zurich, Switzerland
Abstract
Unsupervised representation learning techniques, suchas learning word embeddings, have had a significant im-pact on the field of natural language processing. Similarrepresentation learning techniques have not yet becomecommonplace in the context of 3D vision. This, despitethe fact that the physical 3D spaces have a similar se-mantic structure to bodies of text: words are surroundedby words that are semantically related, just like objectsare surrounded by other objects that are similar in con-cept and usage.In this work, we exploit this structure in learningsemantically meaningful low dimensional vector repre-sentations of objects. We learn these vector representa-tions by mining a dataset of scanned 3D spaces using anunsupervised algorithm. We represent objects as pointclouds, a flexible and general representation for 3D data,which we encode into a vector representation. We showthat using our method to include context increases theability of a clustering algorithm to distinguish differentsemantic classes from each other. Furthermore, we showthat our algorithm produces continuous and meaningfulobject embeddings through interpolation experiments.
Unsupervised learning has seen much success in naturallanguage processing [1, 2, 3]. Many current methodsrely on models that have been pretrained on standardtasks. Others use representations learned ahead of time,such as word vectors. Largely, this is driven by massiveamounts of text data that is available through the in-ternet. Similarly in computer vision, pretrained modelsare used to improve generalization and to learn from asmall amount of data. Methods have been developed tolearn representations from unlabeled data, to speed uplearning on downstream tasks [4]. This has been muchless explored in the context of geometric 3D data.Methods like word2vec [1] leverage the context inwhich a word appears to learn a semantic represen-tation for each word. Not only can the meaning of a ∗ Correspondence to [email protected]
Figure 1: An application of our work is finding seman-tically similar objects in a scene. Here, we give ouralgorithm an instance (red), and ask it to search for theinstances that are most similar to that instance (green).The algorithm correctly outputs all other chairs and thetable.word be altered by its context, but words appearing asneighbors are also correlated. Words from the samesemantic realm often appear close to each other in textor have similar neighbors. Just like words in bodiesof text, objects in physical spaces are defined by thecontext in which they appear and are used. A ham-mer will likely be found near a screwdriver. A chaircould be swapped with stool, without it appearing outof place. We build on this realization by designing anunsupervised approach to learning low dimensionalobject representations.Our goal is to find a mapping from 3D objects intoa low-dimensional vector representation that is repre-sentative of the semantic meaning of the object – anembedding function. We base our approach on the ideathat semantically relevant features of 3D objects areboth dependent on the context in which they appear,as well as the geometry of the object. Within the low-dimensional representation, a chair in a kitchen, forinstance, should be separable from an office chair, orconversely, pillows should occupy a nearby region inthe semantic space to the region occupied by beds.We choose to use point clouds as the representationfor our objects and 3D geometry. The reason is three-fold. Any 3D representation, such as a triangle mesh, a1 a r X i v : . [ c s . C V ] F e b AD model or a voxel grid, can be converted to a pointcloud by sampling points on the surface of the encodedgeometry. Point clouds can be obtained by increasinglycommon depth sensors, such as lidar scanners. Lastly,effective differentiable functions that operate on pointclouds, have been developed, PointNet [5] being onesuch example.In this paper we:• Propose an algorithm that embeds point cloudsinto a low dimensional vector space that takes thecontext in which an object appears into account.We do this by combining a PointNet encoder-decoder architecture with a contrastive loss func-tion.• Train our algorithm on the Replica dataset [6] andanalyze the obtained semantic embedding functionthrough a series of experiments.
Learning vector representations of 3D data has been andstill is an active topic of research in the computer visionand machine learning communities. This is driven bythe motivation to use representations in various down-stream tasks that require an abstract and generalizableunderstanding of the entity at hand. Examples for suchtasks are classification, scene completion or image in-painting.A common format for 3D data are voxel grids. Thisstructured representation format lends itself well for ap-plying convolutions. In [7], a model is trained to predictvoxels that were removed from the original grid. Aftertraining, the representations generated by the modelcan be used for shape recognition or interpolation be-tween instances. Another application for represenationslearned from voxel data [8] is the generation of 3D datafrom 2D images.Despite these successes, voxel grids have disadvan-tages compared to point clouds, such as the fixed scaleand the need to discretize sensor data before they can beapplied. As an alternative, methods directly operatingon point cloud data have been investigated, especiallyafter the introduction of PointNet [5]. Building on top ofPointNet, various works attempted to learn meaningfulrepresentations in an unsupervised fashion using auto-encoders [9], generative adversarial networks (GANs)[10], combinations of both [11, 12] or recurrent neu-ral networks (RNNs) [13]. While the representationsobtained with these methods were shown to greatlyimprove down-stream tasks such as reconstruction orclassification based on geometry, there is no methodthat leverages additional information for more semanti-cally meaningful representation. Autoencoders are of-ten used to extract meaningful features of point clouds.To the best of our knowledge, all research in the fieldanalyzes each point cloud separately and ignores con-text. +Margin Loss Reconstruction LossInput Data Anchor Point CloudClose Point CloudSimilar Point CloudNegative Point Cloud Anchor CodeClose CodeSimilar CodeNegative Code Decoder BackpropagationEncoder BackpropagationEncoders with shared weights Decoder
Figure 2: The architecture of our proposed Points2Vecalgorithm.[14] learns dense 3D descriptors for points in 3Dspaces by using a contrastive loss. Their goal is to learna descriptor which is invariant to lighting and viewpointto be used in finding correspondences between imagescaptured by a depth camera. [15] extends this approachto learn more object centric descriptors for use in roboticmanipulation.For this paper, we drew inspiration from a methodcalled word2vec [1] which is a technique from the fieldof natural language processing. The method is trainedto predict a word from its neighbors in a sentence, re-sulting in vectors that capture semantic informationabout the original words.With a similar motivation, in this paper we propose amethod to extract semantically meaningful embeddingsfrom 3D point cloud data by not only considering anobject’s geometric shape, but also its neighborhood ina scene.
In Points2Vec, we extend a PointNet autoencoder with acontext module that takes the surrounding objects intoaccount. We conjecture, that objects that are geomet-rically close and similar in 3D space, should occupy asimilar region in latent space. This is enforced by addinga contrastive loss in latent space. To some extent, thiscontext module is inspired by word2vec [1], in thatit leverages surrounding embeddings to improve thequality of the target embedding. Moreover, we assumethat adding a loss in latent space improves stability intraining. If the only loss function is applied after recon-struction, backpropagation is entirely dependent on thedecoder, which may add a significant source of error,particularly with unstructured data. For our algorithm,we assume point cloud instances to be segmented, butwe do not depend on class labels. Figure 2 shows anoverview of our proposed algorithm, which is furtherdescribed in the following chapters.2 .1 Autoencoder
To extract features from the point clouds, we make useof an autoencoder architecture. We use PointNet [5] asan encoder, which we briefly describe in the following(for the exact network specifications, see Table 5 in theAppendix): The N × point cloud (consisting of N points) gets transformed by means of matrix multiplica-tion with a learnable T-Net. Then, a multi-layer percep-tron (MLP) extracts features from each point separately,resulting in an N × matrix, where another transformmodule is applied. Further MLPs convert the matrix tothe shape N × . Here, we apply the max -operator,reducing the matrix to a vector of size . This vectoris then encoded to a lower embedding size using fullyconnected layers with ReLU activation functions. Toavoid overfitting, we employ dropout layers. The em-beddings are projected onto the unit hypersphere bynormalization. To reconstruct the point cloud from thecode, we use a series of fully connected layers with ReLUactivation functions. To account for the unstructurednature of point clouds, the loss function of generativedeep neural networks has to be symmetric. To this end,we use the chamfer distance function. We refer to the part of our algorithm that samples con-text instances and computes the margin loss of the em-beddings as context module.Our context module is inspired by the triplet marginloss of [16] and aims at inducing domain knowledgeinto the hyperspace by computing a contrastive loss offour different embeddings. For this, we make use of thesegmentation and run pointclouds through the encoderon a per-instance basis. We sample the following pointcloud instances:•
Anchor Point Cloud:
The anchor point cloud issampled randomly among all instances.•
Close Point Cloud:
We sample an instance situ-ated close to the anchor point cloud in euclideanspace. Specifically, we compute the distance tothe centroid for all instances and sample a pointcloud among the ten closest instances. Sampling isperformed by inverse transform sampling, usingthe distance to the anchor point cloud to inverselyweigh the sample probability. The motivation is, asdescribed above, that objects close to each other ineuclidean space are often from the same semanticrealm (e.g. kitchen utensils or tools in a work-shop). Therefore, we aim for the close point cloudembedding to be close to the anchor point cloudembedding.•
Similar Point Cloud:
Apart from the close in-stance, we sample an instance similar in euclideanspace. To get a similarity measure of point clouds, we use the chamfer distance. To account for dif-ferent rotations of instances, we pose the task offinding the chamfer distance as an optimizationproblem. Thus, the goal is to find the angle aroundthe z-axis, for which the chamfer distance is min-imal. As with the close point cloud, we sampleamong the ten most similar instances, using in-verse transform sampling to achieve a sample prob-ability inversely proportional to the distance score.Taking similar point clouds into consideration ismotivated by the observation that geometricallysimilar objects are often from the same semanticrealm (e.g. they can be used for the same tasks),even if they do not appear close to each other in theobserved scene (e.g. cups appearing in the kitchen,the dining room, and the office). Therefore, we aimfor the similar point cloud embedding to be closeto the anchor point cloud embedding.•
Negative Point Cloud:
The negative point cloudis sampled at random among all point clouds ex-cept the anchor, close and similar point cloud. Thisis to drive the anchor point cloud embedding awayfrom embeddings of unrelated objects. A randomsampling was chosen to reduce the chance of in-ducing a bias.The sampled point clouds are then run through ourencoder, where we apply the following margin loss tothe embeddings: L margin = max (cid:26) d close + d similar − d negative + α, (cid:27) (1) d close = || e anchor − e close || (2) d similar = || e anchor − e similar || (3) d negative = || e anchor − e negative || , (4)where α is the margin parameter, and e are theembedding vectors. The margin parameter is a tuneablehyperparameter that defines the margin for which aloss is generated. If the two positive embeddings areclose, and the negative embedding is far away, themargin function will not produce a loss. In essence, L margin achieves to pull close and similar point cloudembeddings closer to the anchor embedding and topush the negative point cloud embedding away from it,as shown in Figure 4. For the examples in Figure 3, thiswould mean that the embedding of the bed sheet andthe embedding of the second bed is pulled closer to theembedding of the first bed, whereas the embedding ofthe toilet is pushed further away from the embeddingof the bed.For backpropagation, the margin loss and the recon-struction loss are summed up to update the weightsof the encoder. The reconstructor is updated solely us-ing the reconstruction loss. At inference time, we only3 nchor Point Cloud (bed) Close Point Cloud (bedsheets) Similar Point Cloud (bed) Negative Point Cloud (toilet) Figure 3: Visualization of the training instance sampling algorithm in Points2Vec. Point clouds in question aremarked in black, the rest in grey. e close
Figure 4: Visualization of the margin loss in latent space.Applying our margin loss pulls e similar and e close closerto e anchor in latent space pushes e negative away from it.sample the anchor point cloud and run it through thedecoder. To test our method, we need a dataset of real worldscenes with object segmentation. We use the FacebookReplica dataset [6]. There would have been larger can-didate datasets, but most of them had drawbacks. In-teriorNet [17] would have required a dense mappingpreprocessing step. SceneNet [18] is randomly gen-erated and the positions of objects do not make anysemantic sense. ScanNet [19] and Matterport3D [20]would have been two other datasets we could have usedfor evaluation.The Facebook Replica dataset contains 18 photo-realistic indoor scene reconstructions. It includes se-mantic segmentation labels, but we only use this in-formation to evaluate our algorithm. Of the 18 scenes,we use one for validation and one for testing. The 3Dscenes are represented as polygon mesh files. We con-vert them to point clouds by randomly sampling pointson the surface of the mesh. The number of samples oneach quadrilateral is proportional to its area. For ourexperiments, we sampled 1000 points on each instance.To remove uninformative biases, we center all instancesin euclidean space.Several variations to the algorithm were tested, suchas a PointNet++ encoder [21] and both deeper and Figure 5: Principal component analysis of the latentspace of the test set. A subset of classes is highlightedto show the clustering.broader encoder and decoder architectures. Moreover,we tested the combination of our autoencoder with anRGB-D inpainting module. To stay concise, we onlyreport the results of our best-performing architecture,as described in Section 3 and in Table 5.In the following sections, we evaluate the results ofthe experiments conducted with the proposed algorithm.In essence, there are two outputs that we can evaluatequantitatively, namely the latent-space embeddings andthe reconstructions.
The latent space is not straightforward to evaluate asthere is no distinct metric that directly measures thesemantic value of the information in the embeddings.Figure 5 shows the latent space of the test set, reducedto two dimensions using principal component analysis.We can observe, that different classes occupy differentregions in embedding space. Moreover, classes that weperceive to be semantically similar, such as “indoor-plant” and “vase” are close in semantic space. Twoproperties that are implied by a rich hyperspace arealignment and uniformity.4RI-KMTrain TestPoints2Vec 0.2331 0.63PointNet Autoencoder 0.2246 0.5178Margin Only 0.2278 0.5491Table 1: Evaluation of the unsupervised k-Means clus-tering in latent space.
Alignment is achieved when similar features are as-signed to similar data inputs, resulting in embeddingsof the same classes being close in latent space. In our al-gorithm, this is enforced by both losses in the algorithm.With the margin loss, we pull geometrically similar in-stances closer together and the reconstruction loss im-plicitly forces geometrically similar instances to occupya similar region in the embedding space. To evaluatealignment, we run two tests over the embeddings:First, we run an unsupervised k-means clusteringalgorithm over the embeddings in hyperspace and eval-uate its performance. In an ideal embedding space, eachclass occupies a particular region in the hyperspacethat should be linearly separable from other regions. Toquantify the performance of the clustering, we makeuse of the ground truth labels and compute the adjustedrand index (ARI-KM). This measure takes the permu-tations of class IDs into account when performing un-supervised clustering. In essence, it can be viewed asa performance measure for a downstream classifica-tion task. The ARI-KM scores can be viewed in Table1. We see that our method achieves a higher ARI-KMthan a pure autoencoder, both on the test and trainset. Moreover, we run a training, where we skip thereconstruction and only backpropagate the margin loss.This results in an ARI-KM index that is lower than inthe method that we propose, but outperforms the pureautoencoder. Note that the ARI-KM on the test set ishigher than on the train set due to a different numberof instances per classes in the two sets.The second metric for alignment that we extract fromtwo classes, as reported in Table 3. For this, we take allpossible combinations of two instances from two classes,compute each cosine distance and average it over thecombinations. We perform these calculations both fortwo different classes, as well as for all combinations ofinstances of the same class. In a well-clustered latentspace, we expect the averaged cosine distance betweeninstances of the same class to be lower than between twodifferent classes. Not only can we see that intra-classcosine distances are lower than inter-class distances,but we also get a sense of the distances between twoinstances of different classes. For example, the averageddistance between all windows and all blinds is relativelysmall, indicating closeness in semantic space. This isconsistent with our semantic understanding of the twoobjects. Moreover, if we average all intra-class distances, Averaged cosine distancesChair Plate Blinds WindowChair 0.001 0.924 0.863 1.13Plate 0.924 0.186 1.255 1.289Blinds 0.863 1.255 0.113 0.137Window 1.13 1.289 0.137 0.035Table 2: Averaged cosine distances between all instancesof two classes on the test set. Intra-class distances arelower, indicating alignment in latent space.Figure 6: Heatmap of the hyperspace after principalcomponent analysis and normalization. Uniform distri-bution of feature vectors indicates that maximal infor-mation is preserved.we get a lower distance (0.262) than if we average allinter-class distances (0.997).
Second, we evaluate uniformity. Uniformity describesthe trait that features are evenly distributed in latentspace. Uniform distribution preserves maximal informa-tion in the latent space, as shown in [22]. Figure 6 showsa histogram of the embeddings. For this visualization,we reduced the dimensionality of the hyperspace from256 dimensions down to two using principal componentanalysis and normalized the embeddings. As is visible,the embeddings are evenly distributed along the circle.We conjecture that the areas of higher density are dueto class imbalance.Further, we test uniformity by interpolating and re-constructing embeddings. A uniform distribution inlatent space should result in a smooth transition of thereconstructed point clouds. Results of this are shown inFigure 7. Here, we interpolate between two instances ofdifferent classes, namely a bed and a table. We observea smooth transition between the two classes.
To give an example of a use-case in robotics, we give ouralgorithm the task of finding semantically similar ob-5igure 7: Interpolation between two different classes.ACDTrain TestPoints2Vec 0.0654 0.0946PointNet Autoencoder 0.067 0.103Table 3: Averaged chamfer distances (ACD) on trainingand test set for Points2Vec and PointNet autoencoder.Our method achieves a lower ACD than a PointNetautoencoder, indicating better reconstruction.jects in an unseen environment. In the example shownin Figure 1, we search for objects that are semanticallysimilar to a chair. With our method, the most seman-tically similar instances in the scene are all the otherchairs and the table. This showcases the principle ofour method. The chairs are perceived as semanticallysimilar, as they have a similar shape. The table, on theother hand, doesn’t have a similar shape but regularlyco-occurs along with tables in our training set. Themargin loss thus pulls its embedding closer to the em-bedding region of the chairs. When running the sameexperiment with a pure PointNet autoencoder, the tableis not among the most similar instances, even thoughwe would intuitively associate a table with chairs.
In this section, we evaluate the reconstructions of our al-gorithm. While achieving high-quality reconstructionsis not the actual goal of this method, it is still a viablequantitative measure for the semantic richness of thefeature vectors. Figure 8 shows combinations of groundtruth - reconstruction pairs from the unseen test set.To measure the quality of the reconstruction, we com-pute the chamfer distance between the ground truthand the reconstruction. On Table 3, we list the resultingaveraged chamfer distance (ACD) measures on the trainand test set. We compare our approach to the PointNetautoencoder in order to evaluate how adding our con-text module affects reconstruction. We see that both onthe training and on the test set, our method achieves alower ACD, indicating a better reconstruction. On thetest set, we observe an improvement of 8% compared tothe PointNet autoencoder. Figure 8: Ground truth (left) and reconstructions (right)of unseen instances
We presented an algorithm for unsupervised learningof object representations from point clouds. To the bestof our knowledge, this is the first algorithm that ap-proaches the task of extracting semantic feature vectorsof point clouds by not just looking at the isolated pointcloud, but by taking the context in which it appears intoaccount.We learned and evaluated our algorithm on theReplica dataset. For evaluation, we tested our methodon unseen data and evaluated the latent space for align-ment by running an unsupervised k-Means clusteringalgorithm over the embeddings and quantifying the per-formance using ground truth annotation. Moreover, wewere able to reconstruct realistic point cloud instances.For benchmarking, we run a PointNet autoencoder, asproposed in [11] on the same dataset. Our methodachieves a better score both in latent space clusteringand in reconstrucion than our benchmark, the regularPointNet autoencoder. These results encourage our hy-pothesis, that context can add semantic information inpoint cloud feature learning. Due to better clusteringin latent space, less data annotation is required in adownstream classification task.For now, our method runs on instance-segmentedpoint clouds. In a real-life application, it might there-fore be combined with a point cloud object segmen-tation algorithm, that segments object instances frompointclouds, such as [23] or [24].6ur algorithm is a first proposal of what an algorithmthat learns semantic representations from real-worldscenes can achieve. Future work might make use ofgenerative adversarial networks [10] to make better useof the data and better encode the geometry of the objects.A NeuralSampler [25] could be used to deal with thefact that not all objects are equally well representedwith the same number of points.We ran our algorithm on a small scale dataset consist-ing of 18 indoor spaces with limited range. Our workis inspired by methods in natural language processingwhich are learned using datasets with many orders ofmagnitude more data. 3D sensors are rapidly becomingmore common in human environments, as exemplifiedby the release of the iPhone 12 Pro with a Lidar sensor.Our promising early results are an encouraging signthat context provides valuable semantic information inunsupervised point cloud feature learning. We havelittle doubt that the quality and range of the learned rep-resentations would only improve when learning fromlarger, more varied, datasets. Imagine what we couldlearn, if our dataset contained 1 million scanned scenes.
Appendix
Additional implementation details
Learning Rate 0.0001Code Size 256Batch Size 20Number of epochs 4000Margin Loss Parameter 1Margin Loss to Reconstruction Loss Ratio 10Table 4: Hyperparameters used in the experiments.
References [1] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Effi-cient estimation of word representations in vectorspace,”
CoRR , vol. abs/1301.3781, 2013.[2] J. Pennington, R. Socher, and C. D. Manning,“Glove: Global vectors for word representation,”in
Proceedings of the 2014 conference on empiricalmethods in natural language processing (EMNLP) ,pp. 1532–1543, 2014.[3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,G. Sastry, A. Askell, et al. , “Language models arefew-shot learners,” arXiv preprint arXiv:2005.14165 ,2020.[4] L. Jing and Y. Tian, “Self-supervised visual featurelearning with deep neural networks: A survey,”
Layers
IEEE Transactions on Pattern Analysis and MachineIntelligence , 2020.[5] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Point-net: Deep learning on point sets for 3d classifica-tion and segmentation,”
CoRR , vol. abs/1612.00593,2016.[6] J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wij-mans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren,S. Verma, A. Clarkson, M. Yan, B. Budge, Y. Yan,X. Pan, J. Yon, Y. Zou, K. Leon, N. Carter, J. Briales,T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva,D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele,S. Lovegrove, and R. A. Newcombe, “The replicadataset: A digital replica of indoor spaces,”
CoRR ,vol. abs/1906.05797, 2019.[7] A. Sharma, O. Grau, and M. Fritz, “Vconv-dae:Deep volumetric shape learning without objectlabels,” in
Geometry Meets Deep Learning Work-shop at European Conference on Computer Vision(ECCV-W) , 2016.[8] R. Girdhar, D. F. Fouhey, M. Rodriguez, andA. Gupta, “Learning a predictable and generativevector representation for objects,” in
European Con-ference on Computer Vision , pp. 484–499, Springer,2016.[9] S. Chen, C. Duan, Y. Yang, D. Li, C. Feng, andD. Tian, “Deep unsupervised learning of 3d pointclouds via graph topology inference and filter-ing,”
IEEE Transactions on Image Processing , vol. 29,2019.710] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-gio, “Generative adversarial networks,” 2014.[11] P. Achlioptas, O. Diamanti, I. Mitliagkas, andL. J. Guibas, “Representation learning and ad-versarial generation of 3d point clouds,”
CoRR ,vol. abs/1707.02392, 2017.[12] M. Zamorski, M. Zieba, R. Nowak, W. Stokowiec,and T. Trzcinski, “Adversarial autoencodersfor generating 3d point clouds,”
CoRR ,vol. abs/1811.07605, 2018.[13] Z. Han, M. Shang, Y. Liu, and M. Zwicker, “Viewinter-prediction GAN: unsupervised representa-tion learning for 3d shapes by learning globalshape memories to support local view predictions,”
CoRR , vol. abs/1811.02744, 2018.[14] T. Schmidt, R. Newcombe, and D. Fox, “Self-supervised visual descriptor learning for densecorrespondence,”
IEEE Robotics and AutomationLetters , vol. 2, no. 2, pp. 420–427, 2016.[15] P. R. Florence, L. Manuelli, and R. Tedrake, “Denseobject nets: Learning dense visual object descrip-tors by and for robotic manipulation,” in
Confer-ence on Robot Learning , pp. 373–385, 2018.[16] F. Schroff, D. Kalenichenko, and J. Philbin,“Facenet: A unified embedding for face recognitionand clustering,”
CoRR , vol. abs/1503.03832, 2015.[17] W. Li, S. Saeedi, J. McCormac, R. Clark,D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang, andS. Leutenegger, “Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset,” in
British Machine Vision Conference (BMVC) , 2018.[18] J. McCormac, A. Handa, S. Leutenegger, andA. J.Davison, “Scenenet rgb-d: Can 5m syntheticimages beat generic imagenet pre-training on in-door segmentation?,” 2017.[19] A. Dai, A. X. Chang, M. Savva, M. Halber,T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,”in
Proc. Computer Vision and Pattern Recognition(CVPR), IEEE , 2017.[20] A. Chang, A. Dai, T. Funkhouser, M. Halber,M. Niessner, M. Savva, S. Song, A. Zeng, andY. Zhang, “Matterport3d: Learning from rgb-d datain indoor environments,”
International Conferenceon 3D Vision (3DV) , 2017.[21] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++:Deep hierarchical feature learning on point sets ina metric space,” in
Advances in neural informationprocessing systems , pp. 5099–5108, 2017. [22] T. Wang and P. Isola, “Understanding contrastiverepresentation learning through alignment anduniformity on the hypersphere,” 2020.[23] Z. Ding, X. Han, and M. Niethammer, “Votenet: adeep learning label fusion method for multi-atlassegmentation,” in
International Conference on Med-ical Image Computing and Computer-Assisted In-tervention , pp. 202–210, Springer, 2019.[24] T. Pham, T.-T. Do, N. S¨underhauf, and I. Reid,“SceneCut: Joint Geometric and Object Segmenta-tion for Indoor Scenes,” in ,2018.[25] E. Remelli, P. Baqu´e, and P. Fua, “Neuralsampler:Euclidean point cloud auto-encoder and sampler,”