Info3D: Representation Learning on 3D Objects using Mutual Information Maximization and Contrastive Learning
IInfo3D: Representation Learning on 3D Objectsusing Mutual Information Maximization andContrastive Learning
Aditya Sanghi
Autodesk AI Lab, Toronto, Canada [email protected]
Abstract.
A major endeavor of computer vision is to represent, un-derstand and extract structure from 3D data. Towards this goal, unsu-pervised learning is a powerful and necessary tool. Most current unsu-pervised methods for 3D shape analysis use datasets that are aligned,require objects to be reconstructed and suffer from deteriorated perfor-mance on downstream tasks. To solve these issues, we propose to extendthe InfoMax and contrastive learning principles on 3D shapes. We showthat we can maximize the mutual information between 3D objects andtheir “chunks” to improve the representations in aligned datasets. Fur-thermore, we can achieve rotation invariance in SO(3) group by maximiz-ing the mutual information between the 3D objects and their geometrictransformed versions. Finally, we conduct several experiments such asclustering, transfer learning, shape retrieval, and achieve state of art re-sults.
Keywords:
3D Shape Analysis, Unsupervised Learning, Rotation In-variance, InfoMax, Contrastive Learning
Recently, several unsupervised methods have managed to extract powerful fea-tures for 3D objects such as in [16], [49], [1], [30] and [26]. However, these methodsassume all 3D objects are aligned and have the same pose in the given category.In real world scenarios, this is not the case. For example, when a robot is identi-fying and picking up an object, the object is in an unknown pose. Even in onlinerepositories of 3D shapes, most of the data is randomly oriented as users createobjects in different poses. To use these methods effectively we require to alignall objects for a given category which is a very expensive and time consumingprocess.Furthermore, these unsupervised methods require reconstruction of 3D shapeswhich is not ideal for many reasons. Firstly, it is not always feasible to recon-struct the 3D representation of a shape. For example, due to the discrete natureof meshes, reconstructing their representation may not be attainable. Moreover,in cases where we need invariant representation, it’s hard to reconstruct back the a r X i v : . [ c s . C V ] A ug A. Sanghi
EncoderEncoder Latent SpaceEncoder EncoderEncoderEncoder
Fig. 1:
General idea of the method.
We try to bring the 3D shape and thedifferent view of the 3D shape closer in latent space while pushing away therepresentation of other objects in the dataset further away.shape from this invariant representation. For example, if you wanted to createrotation invariant embeddings, you need to lose pose information from the em-beddings of 3D objects. If you lose the pose information, it will not be possibleto reconstruct the shape back as you need to reconstruct it with that given pose.To overcome the challenges mentioned above we propose a decoder-free, un-supervised representation learning mechanism which is rotation insensitive. Themethod we introduce takes inspiration from the Contrastive Predictive Coding[29] and Deep InfoMax [43] approach. These methods usually require a different“view” of the object, which is used to maximize the mutual information withthe object. This other view can be different modalities, data augmentation ofthe object, local substructure of the object, etc.We consider two different views of a given object in this work. First, weconsider maximizing the mutual information between a local chunk of a 3Dobject and the whole 3D object. The intuition for using this view is that the3D shape is forced to learn about its local region as it has to distinguish itfrom other parts of different objects. This greatly enhances the representationlearnt in aligned objects. Second, we consider maximizing the mutual informationbetween 3D shape and a geometric transformed version of the 3D shape. Theadvantage of maximizing the mutual information in this scenario is that it cancreate global geometry invariant representations. This is very useful in the caseof achieving rotation insensitive representation in SO(3). Figure 1 illustrates therough intuition behind the method. Note, despite using objects from differentcategory in the figure, we push away every other shape in the dataset. Thismight even include object from the same category. This method can be thoughtas instance discrimination [48].The key contributions of our work are as follows: – We introduce a decoder-free representation learning method which can easilybe extended to any 3D descriptor without needing to construct complexdecoder architectures. nfo3D 3 – We show how local chunks of 3D objects can be used to get very effectiverepresentations for downstream tasks. – We demonstrate the effectiveness of the method on rotated inputs and showhow it is insensitive to such rotations in SO(3) group. – We conduct several experiments to show the efficacy of our method andachieve state of art results in transfer learning, clustering and semi-supervisedlearning.
Representation Learning on 3D objects.
Much progress has been madeto learn good representations of 3D objects in an unsupervised manner whichcan then be used in several downstream tasks. In point clouds, works such as[1], [49], [16], [18], [36] and [52] have been proposed. For voxels and implicitrepresentations, work such as [46], [26], [10], [27], [30] create powerful repre-sentation features which are used for several downstream tasks such as shapecompletion, shape generation and classification. Recently, there has been a lotof progress in auto-encoding meshes such as in [40], [11], [40]. One disadvan-tage of the above approaches are that they require you to reconstruct or gen-erate the 3D shapes. As stated earlier, it might be expensive or not possibleto reconstruct the shape. A recent approach [51] does not require to recon-struct the shape and instead does two stages of training. It first uses partsof a shape to learn features using contrastive learning and then uses pseudo-clusters to cluster all the data. Our method does not require two stages oftraining, and furthermore allows us to create rotation invariant embeddings.
Maximizing Mutual Information and Contrastive Learning.
A lot ofmethods have used mutual information to do unsupervised learning. Historically,works such as [24], [4], [6], [5] have explored the InfoMax principle and mutual in-formation maximization. More recently [29] proposed the Contrastive PredictingCoding framework which uses the embeddings to capture maximal informationabout future samples. The Deep InfoMax (DIM) [43] approach is similar but hasthe advantage of doing orderless autoregression. In concurrent to these works,works such as [48] and [53], have extended these ideas from the metric learn-ing point of view. These methods were extended in [3], [19], [42] by consideringmultiple views of the data and achieved state of the art representation learningon images. The InfoMax principle has been extended to graphs [43], [39] andfor state representation in reinforcement learning [2]. Our method is inspiredby these methods and we extend them to 3D representations. For the multi-ple views, we use geometric transformations and chunks of a 3D object. Finally,in concurrent to our work several news works such as [9], [28] have been proposed.
Rotation Invariance on 3D objects.
Traditionally, several methods have fo-cused on hand engineered features to get local rotation invariant descriptors,such as [34], [37], [22]. More recently, methods such as [13] and [14], first, get
A. Sanghi local invariant features by encoding local geometry into patches and then useautoencoder to reconstruct the local features. These approaches require creat-ing hand-crafted local features and need normal information. Methods such asMVCNN [38], Rot-SO-Net [23] and [35] explicitly force invariance by taking mul-tiple poses of the object and aggregating them over the poses. However, suchmethods only work on discrete rotations or can only reconstruct objects rotatedalong one axis. A lot of deep learning methods have also attempted to use equiv-ariance based architectures to achieve local and global rotation equivariance in2D and 3D data. Methods such as [12], [45], [15], [41] either use constrainedfilters that achieve rotation equivariance or use filter orbit which are themselvesequivariant. It is usually difficult to create such architectures for different 3D rep-resentations. Furthermore, to generate invariant representation to rotation fromequivariant representation usually requires a post-processing step. Our methoduses the InfoMax loss with rotation transformation to enforce rotation invari-ance. This method can be easily extended to voxels, implicit representation andmeshes without needing to create complex architectures. similarSecond PassFirst PassEncoder Encoder
MemoryBank dissimilar
Fig. 2:
Overview of the method.
A 3D object and different view of that objectis encoded using the same encoder. The features across these views are madesimilar while a memory bank is used for negative examples. We also store thefeatures obtained in the memory bank.Our goal is to extract good features from 3D shapes in an unsupervisedmanner. To achieve this, we maximize the mutual information on the featuresextracted from a 3D shape and a different view of the 3D shape. We consider twosuch views in this work for achieving different purposes. First, we consider usinga “chunk” of a 3D object to improve the representation of aligned shapes. Thegoal of using chunks of a 3D object is to incorporate some form of locality andstructure in input into the objective. We investigate different ways to extract a nfo3D 5 chunk of a 3D object and show how this can improve the representations learned.Next, we consider geometric transformation as the second view. Intuitively, thegoal is to make the shape and its geometric transformed version closer in thelatent space while distancing itself from other objects. We show how this cancreate transformation invariant embeddings to a large extent. We discuss thismethod in the context of point clouds but it should be straightforward to extendit to other 3D representations. The method in a pictorial form is shown in Figure2. Let x i and T ( x i ) represent a 3D shape and the object obtained after applyingthe transformation as mentioned above. We use an encoder, f ( . ), to transform x i and T ( x i ) into the latent representation z ai and z bi . Note that this encoderalso includes a final normalizing layer which makes the embeddings unit vectors.In the next sections, we detail the motivation and mechanism for using chunksof a 3D object and geometric transformations as second view. Axis choppedchunk Cosine distancechunk Euclidean distancechunk Chebyshev distancechunk
Fig. 3:
Types of chunks.
The first object represents a sample 3D object. Therest of the objects are chunks obtained from that object. Note, for the last threechunks we use the same random point. Despite using the same random point,very different chunks can be obtained from different distance metrics.As mentioned above, the chunks provide a way to define a locality structureto be incorporated in the objective. We do this by first defining a local subsetof the 3D object. For point clouds, if you randomly select a subset of points,you will just get a coarser representation of the 3D object. So, we investigatesome potential ways to define local sub-structures in point clouds. Next, we forcethe network to distinguish between its own local subset and other objects’ localsubsets using the InfoMax principle. This objective forces the network to learnabout its own local sub-structures and creates more informative embeddings.We define chunks by considering two mechanisms. In the first approach, werandomly select a point from the point cloud. Then, we use a distance measurein Euclidean space to select a subset of points. The distance measure we considerare Euclidean distance, cosine distance and Chebyshev distance. Once we selecta subset of points, we normalize them using a bounding sphere. In the secondapproach, we take a chunk of a 3D object based on chopping the object randomly
A. Sanghi along the cartesian axes. The chunk is again normalized using a bounding sphere.The different chunks obtained from a sample 3D object are shown in Figure 3.
In this work, we consider several different geometric transformed versions of a3D object. This can be thought of as a form of data augmentation. However, ourmethod differs from traditional use of data augmentation by explicitly influenc-ing the latent space to create better embeddings rather than implicitly hopingthat it would create meaningful representation. Furthermore, we are doing dataaugmentation in an unsupervised setting, so the increased cost of augmentationcan be shared across several tasks instead of just one task such as in a supervisedsetting.Though several other data augmentation methods can be used with the In-foMax principle, we consider geometric transformations for two major reasons.First, it is very trivial to apply an affine transformation to a 3D object. We sim-ply need to multiply a transformation matrix. Second, we can use the rotationaffine transformation to create embeddings which are less sensitive to alignment.In this paper, we only consider translation, rotation along z axis, rotationin SO(3) rotation group, uniform scale and non-uniform scale as our geometricaugmentations. We also combine different transformations together to createmore complex transformations. When we learn representations from unaligneddatasets, we always rotate the object before applying a transformation to ensureit is not sensitive to rotation. To estimate mutual information we use the InfoNCE [29] objective. Let us con-sider N samples from some unknown joint distribution p ( x, y ). For this objective,we need to construct positive samples and negative samples. The positive ex-amples, are sampled from the joint distribution of p ( x, y ) and negative samplesfrom the product of marginals p ( x ) p ( y ). The objective is to learn a critic func-tion, h ( . ), by increasing the probability of positive examples and decreasing theprobability for negative examples. The bound is given by I NCE = N (cid:88) i =1 log h ( x i , y i ) (cid:80) Nj =1 h ( x i , y j ) (1)In our case, the positive samples are constructed by using the shape, x i ,and a different view, T ( x i ), of the shape. We construct negative examples byuniformly sampling, k pairs, over the whole transformed version of the dataset.Note, this procedure can lead to objects from the same category being part ofnegative examples. We consider a batch size of N . The critic function is definedas exponential of bi-linear function of f ( x i ) and f ( T ( x i )). We parameterize thisfunction using W . Note that the critic can be defined on the global features from nfo3D 7 f ( . ) or the intermediate features of f ( . ). We can also modulate the distributionusing the parameter τ . The critic function is shown below h ( x i , T ( x i )) = exp( f ( x i ) W f ( T ( x i ) /τ ) (2)We now consider the objective where we maximize the mutual informationbetween the global representations of x and T ( x ). That is, we maximize themutual information between features from the latter layers of the encoder. Theloss is shown below L = N (cid:88) i =1 − log h ( x i , T ( x i )) h ( x i , T ( x i )) + (cid:80) kj =1 h ( x i , T ( x j )) (3)In theory, more negative examples, k , should lead to a tighter mutual infor-mation bound. One way of achieving this involves using large batch size, whichmight not be ideal. To avoid this, we take inspiration from [48], [42], [53] anduse a memory bank to store data from previous batches. This allows us to uselarge number of negative examples. Increasing the number of negative examplesleads to prohibitive cost in computing the softmax. Hence, we use the Noise-Contrastive estimation [17] to approximate the above loss as in [42]. The loss isas shown below L NCE = 1 N N (cid:88) i =1 − log [ h ( x i , T ( x i ))] − k (cid:88) j =1 log [1 − h ( x i , T ( x j )] (4) We conduct several experiments to test the efficacy of our method and the repre-sentations learned by the encoder on both aligned and unaligned shape datasets.We divided the experiment section into three parts. In the first section, we con-duct experiments on aligned datasets and show the effectiveness of our method.In the second section, we discuss representation learning on rotated 3D shapes.Finally, in the last section, we look at different hyperparameters and factorsaffecting our method.
Training Details.
For most of the experiments we use a batch size of 32, sample2048 points on the shapes and use the ADAM optimizer [20]. We use a learningrate of 0.0001. For ModelNet40, we run the experiment for 250 epochs in thecase of aligned datasets whereas we run it for 750 epochs for unaligned datasets.As ModelNet10 is a smaller dataset, we run 750 epochs for aligned dataset and1000 epochs for unaligned dataset. We set the number of negative examples to512 and use 0.07 as temperature parameter. For ShapeNet v1/v2 dataset, werun it for 200 epochs and use 2000 negative examples. For aligned datasets weuse the features from the 6th layer in our model whereas for unaligned datasets
A. Sanghi we use features from the 7th layer. Furthermore, for aligned datasets we use thecosine distance based chunks for clustering task and axis chopped chunks forall other experiments as the second view whereas for rotated datasets we userotation in any SO(3) rotation group plus translation as the second view. Thechoice of these parameters are further discussed in the ablation study section.We also do early stopping if the network has converged. More details about theencoder structures and training details are in the appendix section. Finally, forABC dataset [21] we use a batch size of 64 and use 1024 sample points on thesurface.
Baseline setup.
For the tasks of clustering, rotation invariance and shaperetrieval, we compare our method with three important works on representa-tion learning on point clouds: FoldingNet [49], Latent-GAN Autoencoder [1]and AtlasNet [16]. For all three baselines, we use similar training conditionsas mentioned above, except we train the three models on ShapeNet [8] for 750epochs and ModelNet40 [47] for 1500 epochs for unaligned dataset. For aligneddataset we train ModelNet40 for 500 epochs. We use the PointNet encoder asthe encoders for these baselines. Note, this is different from the respective paperimplementations. More details regarding the architecture used for the baselinescan be found in appendix.
Data Preparation.
All our experiments are conducted on ModelNet10 [47],ModelNet40 [47], ShapeNet v1/v2 [8] and ABC dataset [21]. In some of theabove datasets objects are aligned according to their categories, so to unalignthem we randomly generate a quanterion and rotate them in SO(3) space. Duringunsupervised training, for each epoch, we rotate the shape differently so that thenetwork can see different poses of the same shape. However, when we test thison the downstream tasks, we only rotate the dataset once. This is to ensurethat we only test the effectiveness of unsupervised learning part rather than theeffectiveness of the downstream part.
In this section, we demonstrate how our method performs when we do repre-sentation learning on aligned shapes. Here we compare with well establishedbaselines and show the advantages of our method. Moreover, we also show howthe embeddings obtained from our method are more clusterable then autoen-coder methods. Note for clustering experiment we use the cosine distance basedchunks whereas for all other experiments we use the axis chopped chunks as thesecond view.
Transfer Learning, semi-supervised learning and pre-training.
A wellestablished benchmark used for unsupervised learning is transfer learning. Wefollow the same procedure as [1] and [49]. We first use unsupervised learning nfo3D 9Unsup. Method Acc. Sup. Method Acc.3D-GAN [46] 83.3 PointNet [32] 89.2Latent-GAN [1] 85.7 DeepSets [50] 90.3ClusterNet [51] 86.8 PointNet++ [33] 90.7FoldingNet [49] 88.4 DGCNN [44] 93.5Multi-Task PC [18] 89.1 Relational PC [25]
Recon. PC(PointNet) [36] 87.3 PointNet (Pretrained)
Recon. PC(DGCNN) [36] 90.6 DGCNN (Pretrained)
Ours (PointNet)
Ours (DGCNN)
Table 1:
Results on Aligned Datasets . Left table represents the transferlearning results on ModelNet40 whereas the right table represents the supervisedlearning results.
Method 1 % 2 % 5 % 20 % 100 %FoldingNet [49] 56.15 67.05 75.97 84.06 88.413D Cap. Net [52] 59.24 67.67 76.49 84.48 88.91ours (best) ours (mean) 54.42 ± ± ± ± ± Table 2:
Semi-supervised results on ModelNet40. to train from the ShapeNet v1 dataset. We then train a Linear SVM usingthe training dataset of ModelNet40. In Table 1 we report the accuracy scoreon the test set of ModelNet40. Furthermore, we use the pre-trained weightsfrom training ShapeNet dataset and initialize the pointnet classifier with thoseweights. We compare the results with randomly initialized weights by reportingthe classification accuracy in Table 1. Finally, we test our method in limiteddata scenarios. We compare our method to [49] and [52] as mentioned in theappendix of [52]. It is not clear on how they select the subset of the data. Thisespecially matters when we take very limited data, as you can have as less as 0to 3 shapes per category. So we report both the best and mean accuracy over 10runs of choosing a random subset. The results are reported in Table 2.It can be seen from the Table 1 that we achieve state of the art results ontransfer learning benchmark. We beat current state of the art by 1% when weuse DGCNN encoder with our method. Furthermore, using a simple encoder likePointNet beats many previous unsupervised methods with complex architecturesand surprisingly beats the original PointNet supervised learning benchmark. Ini-tializing our model with pre-trained weights also helps in achieving high accuracyin very less epochs. We can achieve 91% accuracy within 3 epochs whereas ittakes about 32 epochs for random initialize weights. This is shown in more detailin test accuracy training curve in the appendix. Finally, we perform very well in limited data scenarios. We can achieve about 87% accuracy with 20% of labelleddata.
Clusterable representation.
A good unsupervised method would create aseperatable manifold associated with object classes [7]. A good way to test thisis to see how easily the data can be naturally clustered. We train the networkby first using our unsupervised method and then use K-means algorithm onembeddings obtained. We use the implementation present in sklearn [31]. Weset the number of clusters equal to 40 and use rest of the default parameters. Totest the associations of the embeddings with the object classes we use adjustedmutual information metric (AMI). The results are shown in the aligned columnof Table 3. We use the training set of ModelNet40 for the embeddings.
Method Aligned (AMI) Unaligned (AMI)Latent-GAN [1] 0.646 ± ± ± ± ± ± ± ± Table 3:
Clustering on Aligned and Unaligned Dataset.
As seen in Table 3, our method produces more clusterable embeddings thanautoencoder methods. This can be surprising as we are doing instance discrim-ination and trying to push away every other object in dataset. Our intuition isthat, as the neural network is compressing the 3D objects into a lower dimen-sion, the network has to arrange the embeddings strategically which we believeleads to semantic categories being closer in space compared to other objects fromdifferent categories.
In this part, the goal is to test how our method would compare to autoencoderbaselines on datasets which are randomly rotated in SO(3) space. As mentionedearlier, most data in real-world scenarios are unaligned. Note that we use rotationin SO(3) rotation group plus translation as the second view for this section.
Simple rotation invariance experiment.
We create a simple experiment totest the sensitivity of the embedding of the shape with respect to the pose ofthe shape. We conduct the experiment by randomly selecting 10 shapes fromShapeNet v1 dataset and then randomly rotating them 50 times in SO(3) spacegenerating 50 separate objects with different poses per shape. We then generateembeddings for all these objects and then apply clustering. We use t-SNE to give nfo3D 11 a visual representation of the clustering in R as shown in Figure 4. We comparewith the Latent-GAN [1] model.It can be seen from Figure 4 that our model manages to cluster objects andtheir different poses together. In contrast, Latent-GAN model fails to createmeaningful clusters. To quantify this we use the k-means algorithm on the em-beddings. In terms of AMI metric, our model achieves 1.0 whereas the baselineachieves 0.555 score. This implies that our method successfully learns to spaceobjects and their different poses in close proximity in latent space, leading toless sensitive embeddings for downstream task. (a) Baseline Autoencoder (b) Our Model Fig. 4: t-SNE visualization of rotation invariance check.
Figure illustratinghow 3D shapes and their random poses are clustered in our method but fail tocluster in the baseline method.
Clusterable representations.
We also test how well our method does onclustering of embeddings when the object is rotated in SO(3) space. The exper-imental setup is similar to above section and the results are shown in the lastcolumn of Table 3. It can be seen that our method significantly out performsthe autoencoder baselines. This illustrates that autoencoder baseline are verysensitive to rotation and incorporating some form of rotation invariance into theobjective can lead to significant improvement in the embeddings obtained. Wealso show how this can affect transfer learning results, which are present in theappendix.
Shape retrieval.
In many applications, retrieving an object similar to a queryobject is very useful irrespective of their poses. For such applications, we takeembedding of a query object from ShapeNet v1 dataset and ABC dataset, andretrieve the 5 nearest neighbours for a given shape by using the euclidean dis-tance. The results are shown in Figure 5. We again compare with Latent-GANbaseline model.It can be seen from the first row of Figure 5 that the objects retrieved bythe autoencoder baselines are affected by the pose of the object. That is theobject retrieved is similar in pose and also sometimes from a different category.Whereas our method manages to bring more semantically similar objects with
Data Type Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7Aligned 0.504 ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 4:
Layer wise embeddings clustering accuracy. different poses as shown in second row. The last two rows show sample objectsretrieved on ABC dataset. More examples are shown in the appendix. ours3D Latent-GANABCDatasetShapeNetDataset
Fig. 5:
Shape Retrieval.
First object in the row represents the query objectand next five object are the retrieved objects.
We detail the affect of using different architectures, transformations and chunkson the performance of our method. We run most of the experiments on Model-Net40 dataset and use the task of clustering. We run the clustering algorithm3 times because of the stochastic nature of k-means. We report the mean andstandard deviation on the set of experiments. All the results are measured inAMI.
Choice of layer
We see the effect of choosing different layers to obtain theembeddings on the task of clustering. The goal of this experiment is to empiri-cally test which layer contains the most information about the shape. The last nfo3D 13Data Transformation Aligned (AMI)Translate (T) 0.633 ± ± ± ± Chebyshev distance chunk 0.671 ± Table 5:
Effect of different types of augmentation on aligned data ofModelNet40. layer (7th) of the architecture only consist of a linear layer. We experiment onthe aligned as well the unaligned version of ModelNet40 dataset. The results areshown in Table 4.Based on the results of Table 4, there are two interesting observations. First,models trained on aligned datasets produce qualitative embeddings from Layer6 whereas models on unaligned datasets have informative embedding from layer7. Hence, we take those respective embeddings for our experiments in the abovesections. Secondly, in the case of models trained on unaligned datasets, Layer1-5 contain very less clusterable information indicating that using exact positioninformation in pointcloud might not be ideal.
Data Augmentation Unaligned (AMI)Rotate SO(3) + Translate 0.496 ± ± ± ± Rotate SO(3) + Random Scale + translate 0.485 ± Table 6:
Effect of different types of augmentation on unaligned data ofModelNet40.Different types of geometric transformation and chunk selection
Inthis section, we investigate effectiveness of using different ways of obtaining thechunk from a 3D object and geometric transformation of a 3D object for mu-tual information maximization. We do separate experiments for aligned and un-aligned dataset on ModelNet40 dataset. The transformations for aligned datasetare shown in Table 5 whereas for unaligned dataset it is shown in Table 6. Inthe case of uniform scaling, we scale the object uniformly across the three axisin the range of 0.5 to 1.5 units of the original object. For the translation dataaugmentation we randomly translation between − shown in Table 7. Based on the results from the mentioned tables, we chooseour transformations for a given task. Encoder (Data Augmentation) Transfer Learning (Acc. (%))PointNet (Translate) 87.8PointNet (Axis chopped chunk) 89.8DGCNN (Axis chopped chunk)
DGCNN (Euclidean distance based chunk) 90.9DGCNN (Cosine distance based chunk) 90.8DGCNN (Chebyshev distance based chunk) 91.3
Table 7:
Effect of different types of augmentation on transfer learningof ModelNet40.Effect of chunk size
The experiment illustrates the trade off between localand global information. If the chunk size is big more global information will beincorporate and the network might fail to capture locality. If the chunk size issmall we will capture finer details of the object but will affect the accuracy dueto the contrasting nature of the algorithm. The results are shown in Table 8.
Chunk size Aligned (AMI)128 0.653 ± ± ±
768 0.658 ± ± Table 8:
Effect of different chunk size.
In this paper, we investigated using different views of 3D objects to create effec-tive embeddings which generalizes well to different downstream tasks. We showedhow considering local substructure in the objective is very effective while con-sidering rotation as a different view can create rotation invariant embeddings.In terms of future work, we would like to explore how our method generalizes toother tasks such as segmentation and part detection. Secondly, we would like toinvestigate other views of 3D object such as surface normals. Finally, we wouldlike to extend this method to other 3D representations such as meshes and voxels. nfo3D 15
References
1. Achlioptas, P., Diamanti, O., Mitliagkas, I., Guibas, L.: Learning representationsand generative models for 3d point clouds. arXiv preprint arXiv:1707.02392 (2017)2. Anand, A., Racah, E., Ozair, S., Bengio, Y., Cˆot´e, M.A., Hjelm, R.D.: Unsupervisedstate representation learning in atari. arXiv preprint arXiv:1906.08226 (2019)3. Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximiz-ing mutual information across views. arXiv preprint arXiv:1906.00910 (2019)4. Becker, S.: An information-theoretic unsupervised learning algorithm for neuralnetworks. University of Toronto (1992)5. Becker, S.: Mutual information maximization: models of cortical self-organization.Network: Computation in neural systems (1), 7–31 (1996)6. Bell, A.J., Sejnowski, T.J.: An information-maximization approach to blind sepa-ration and blind deconvolution. Neural computation (6), 1129–1159 (1995)7. Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review andnew perspectives. IEEE transactions on pattern analysis and machine intelligence (8), 1798–1828 (2013)8. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z.,Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich3d model repository. arXiv preprint arXiv:1512.03012 (2015)9. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con-trastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020)10. Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 5939–5948 (2019)11. Cheng, S., Bronstein, M., Zhou, Y., Kotsia, I., Pantic, M., Zafeiriou, S.: Meshgan:Non-linear 3d morphable models of faces. arXiv preprint arXiv:1903.10384 (2019)12. Cohen, T., Welling, M.: Group equivariant convolutional networks. In: Interna-tional conference on machine learning. pp. 2990–2999 (2016)13. Deng, H., Birdal, T., Ilic, S.: Ppf-foldnet: Unsupervised learning of rotation invari-ant 3d local descriptors. In: Proceedings of the European Conference on ComputerVision (ECCV). pp. 602–618 (2018)14. Deng, H., Birdal, T., Ilic, S.: 3d local features for direct pairwise registration. arXivpreprint arXiv:1904.04281 (2019)15. Esteves, C., Xu, Y., Allen-Blanchette, C., Daniilidis, K.: Equivariant multi-viewnetworks. arXiv preprint arXiv:1904.00993 (2019)16. Groueix, T., Fisher, M., Kim, V.G., Russell, B.C., Aubry, M.: Atlasnet: Apapier-m \ ˆ ach \ ’e approach to learning 3d surface generation. arXiv preprintarXiv:1802.05384 (2018)17. Gutmann, M., Hyv¨arinen, A.: Noise-contrastive estimation: A new estimation prin-ciple for unnormalized statistical models. In: Proceedings of the Thirteenth Inter-national Conference on Artificial Intelligence and Statistics. pp. 297–304 (2010)18. Hassani, K., Haley, M.: Unsupervised multi-task feature learning on point clouds.In: Proceedings of the IEEE International Conference on Computer Vision. pp.8160–8171 (2019)19. H´enaff, O.J., Razavi, A., Doersch, C., Eslami, S., Oord, A.v.d.: Data-efficient imagerecognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272(2019)20. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)6 A. Sanghi21. Koch, S., Matveev, A., Jiang, Z., Williams, F., Artemov, A., Burnaev, E., Alexa,M., Zorin, D., Panozzo, D.: Abc: A big cad model dataset for geometric deeplearning. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. pp. 9601–9611 (2019)22. Lazebnik, S., Schmid, C., Ponce, J.: Semi-local affine parts for object recognition(2004)23. Li, J., Bi, Y., Lee, G.H.: Discrete rotation equivariance for point cloud recognition.arXiv preprint arXiv:1904.00319 (2019)24. Linsker, R.: An application of the principle of maximum information preservationto linear systems. In: Advances in neural information processing systems. pp. 186–194 (1989)25. Liu, Y., Fan, B., Xiang, S., Pan, C.: Relation-shape convolutional neural networkfor point cloud analysis. In: Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pp. 8895–8904 (2019)26. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancynetworks: Learning 3d struction in function space. In: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. pp. 4460–4470 (2019)27. Michalkiewicz, M., Pontes, J.K., Jack, D., Baktashmotlagh, M., Eriksson, A.: Deeplevel sets: Implicit surface representations for 3d shape inference. arXiv preprintarXiv:1901.06802 (2019)28. Misra, I., van der Maaten, L.: Self-supervised learning of pretext-invariant repre-sentations. arXiv preprint arXiv:1912.01991 (2019)29. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic-tive coding. arXiv preprint arXiv:1807.03748 (2018)30. Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: Learn-ing continuous signed distance functions for shape representation. arXiv preprintarXiv:1901.05103 (2019)31. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Ma-chine learning in python. Journal of machine learning research (Oct), 2825–2830(2011)32. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for3d classification and segmentation. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pp. 652–660 (2017)33. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learn-ing on point sets in a metric space. In: Advances in neural information processingsystems. pp. 5099–5108 (2017)34. Rublee, E., Rabaud, V., Konolige, K., Bradski, G.R.: Orb: An efficient alternativeto sift or surf. In: ICCV. vol. 11, p. 2. Citeseer (2011)35. Sanghi, A., Danielyan, A.: Towards 3d rotation invariant embeddings36. Sauder, J., Sievers, B.: Context prediction for unsupervised deep learning on pointclouds. arXiv preprint arXiv:1901.08396 (2019)37. Steder, B., Rusu, R.B., Konolige, K., Burgard, W.: Narf: 3d range image featuresfor object recognition. In: Workshop on Defining and Solving Realistic PerceptionProblems in Personal Robotics at the IEEE/RSJ Int. Conf. on Intelligent Robotsand Systems (IROS). vol. 44 (2010)38. Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutionalneural networks for 3d shape recognition. In: Proceedings of the IEEE internationalconference on computer vision. pp. 945–953 (2015)nfo3D 1739. Sun, F.Y., Hoffmann, J., Tang, J.: Infograph: Unsupervised and semi-supervisedgraph-level representation learning via mutual information maximization. arXivpreprint arXiv:1908.01000 (2019)40. Tan, Q., Gao, L., Lai, Y.K., Xia, S.: Variational autoencoders for deforming 3dmesh models. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition. pp. 5841–5850 (2018)41. Thomas, N., Smidt, T., Kearnes, S., Yang, L., Li, L., Kohlhoff, K., Riley, P.: Tensorfield networks: Rotation-and translation-equivariant neural networks for 3d pointclouds. arXiv preprint arXiv:1802.08219 (2018)42. Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv preprintarXiv:1906.05849 (2019)43. Veliˇckovi´c, P., Fedus, W., Hamilton, W.L., Li`o, P., Bengio, Y., Hjelm, R.D.: Deepgraph infomax. arXiv preprint arXiv:1809.10341 (2018)44. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamicgraph cnn for learning on point clouds. ACM Transactions on Graphics (TOG)38