[PDF] Cloud Transformers: A Universal Approach To Point Cloud Processing Tasks

Abstract

We present a new versatile building block for deep point cloud processing architectures that is equally suited for diverse tasks. This building block combines the ideas of spatial transformers and multi-view convolutional networks with the efficiency of standard convolutional layers in two and three-dimensional dense grids. The new block operates via multiple parallel heads, whereas each head differentiably rasterizes feature representations of individual points into a low-dimensional space, and then uses dense convolution to propagate information across points. The results of the processing of individual heads are then combined together resulting in the update of point features. Using the new block, we build architectures for both discriminative (point cloud segmentation, point cloud classification) and generative (point cloud inpainting and image-based point cloud reconstruction) tasks. The resulting architectures achieve state-of-the-art performance for these tasks, demonstrating the versatility and universality of the new block for point cloud processing.

Full PDF

aarXiv preprint C LOUD T RANSFORMERS

Kirill Mazur Victor Lempitsky

Samsung AI Center MoscowSkolkovo Institute of Science and Technology A BSTRACT

We present a new versatile building block for deep point cloud processing archi-tectures. This building block combines the ideas of self-attention layers from thetransformer architecture with the efﬁciency of standard convolutional layers in twoand three-dimensional dense grids. The new block operates via multiple parallelheads, whereas each head projects feature representations of individual points intoa low-dimensional space, treats the ﬁrst two or three dimensions as spatial coordi-nates and then uses dense convolution to propagate information across points. Theresults of the processing of individual heads are then combined together resultingin the update of point features. Using the new block, we build architectures forpoint cloud segmentation as well as for image-based point cloud reconstruction.We show that despite the dissimilarity between these tasks, the resulting archi-tectures achieve state-of-the-art performance for both of them demonstrating theversatility of the new block.

NTRODUCTION

Convolutional neural networks (ConvNets) LeCun et al. (1989) and Transformers Vaswani et al.(2017) have emerged as the most successful data processing architectures across a variety of data do-mains. ConvNets are naturally suited for high-dimensional data that are sampled on low-dimensionalgrids and have spatial, temporal or spatial-temporal nature (e.g. images or sound). At the same time,Transformers scale less well to high-dimensional data but excel at handling less structured data suchas phrases of a natural language.In this work, we focus on point clouds , which are data of relatively high dimensionality but lackingthe regularity of images. Despite the lack of the regularity and due to high dimensionality and spatialnature of point clouds, most state-of-the-art architectures for point cloud processing are derived fromConvNets. These ConvNet adaptations are based on direct rasterization of point clouds onto regulargrids followed by convolutional pipelines Su et al. (2015b); Graham et al. (2018a), as well as ongeneralizations of the convolutional operators to irregularly sampled data Mao et al. (2019b); Wanget al. (2018) or non-rectangular grids Klokov & Lempitsky (2017); Jampani et al. (2016).Here, we propose a new building block (a cloud transform block) for point cloud processing ar-chitectures that combines the ideas of ConvNets and Transformers (Figure 1). Similarly to the(self)-attention layers within transformers cloud transform blocks take unordered sets of vectors asan input, and process such input using multiple parallel heads . For an input set element, each headcomputes two- or three-dimensional key and a higher dimensional value , and then uses the computedkeys to rasterize the respective values onto a regular grid. A two- or three-dimensional convolutionis then used to propagate the information across elements. The results of parallel heads are thenprobed at key locations and are recombined together, producing an update to element features.We show that multiple cloud transform blocks can be stacked sequentially and trained end-to-end,as long as special care is taken when implementing forward and backward pass through the raster-ization operations. We then design cloud transformer architectures that concatenate multiple cloudtransform blocks together with task-speciﬁc 3D convolutional layers. Speciﬁcally, we design a cloudtransformer for semantic segmentation (which we evaluate on the S3DIS benchmark Armeni et al.(2016) and the ShapeNet-Part benchmark), and a cloud transformer for image-based geometric re-construction (which we evaluate on a recently introduced ShapeNet based benchmark Tatarchenko*et al. (2019)). In the evaluation, the designed cloud transformers achieve state-of-the-art accuracy1 a r X i v : . [ c s . C V ] J u l rXiv preprint 𝐕 BN, ReLU BN, ReLU 𝐯 𝐯 𝐯 𝐱 𝐍−𝟏 𝐱 𝐍−𝟐 𝐱 𝐍−𝟑 𝐱 𝐱 𝐱 𝐊𝐕 Linear

128 → 4 + 𝐊 Linear

128 → 4 +

෩𝐕 𝐊𝐕

Linear

128 → 4 +

128 → 4 + 𝐊𝐕 Linear

128 → 4 + Linear

128 → 4 + 𝐯 𝐯 𝐯 𝐲 𝐍−𝟏 𝐲 𝐍−𝟐 𝐲 𝐍−𝟑 𝐲 𝐲 𝐲 … ………… … ……… ……… …… 𝐗𝐘 Linear → Linear → Linear →

128 Linear4 → Linear →

128 Linear → Add elementwise3D Cloud Transform Layer2D Cloud Transform Layer

BN, ReLU BN, ReLU BN, ReLU BN, ReLU BN, ReLUBN BN BN BNBN BN 𝐕 𝐕 𝐊𝐕 𝐕 𝐊 ෩𝐕 𝐕 Figure 1: Our building block has several planar heads and several volumetric heads operating inparallel. Each head is a cloud transform, using a two-dimensional or a three-dimensional grid forrasterization, followed by convolutional operations, and de-rasterization (differentiable sampling).for semantic segmentation task and considerably outperform state-of-the-art for image-based recon-struction. We note that such versatility is rare among previously introduced point cloud processingarchitectures, which can handle either recognition tasks (such as semantic segmentation) or genera-tion tasks (such as image-based reconstruction) but usually not both.

ELATED WORK

Point cloud processing with deep architectures has grown to a large ﬁeld of study. The ideas imple-mented within cloud transform blocks are closely related to a large body of prior works. Below, wereview only the most related ones.A number of work use rasterizations of the point cloud over regular 3D grids Maturana & Scherer(2015); Graham et al. (2018b); Moon et al. (2018); Mao et al. (2019a), where each point is raster-ized at its original position within the point cloud. Multi-view ConvNets Su et al. (2015a) projectpoint clouds to multiple predeﬁned 2D views. The approaches that use splat convolutions on per-mutohedral grid convolutions Kiefel et al. (2015); Su et al. (2018) are perhaps most similar to ours(and have been an inspiration to us), as they also interleave rasterization ( splatting ), (permutohedral)convolution, and probing ( slicing ). In contrast to all above-mentioned works, which use initial posi-tions or data-independent projections of points for rasterizations, our architectures learn a variety ofdifferent and data-dependent projections (one projection per head in each block).In Wang et al. (2019b), a dynamic graph ConvNet (DGCNN) architecture based on graph convolu-tions is presented. The graph is computed from spatial positions of the points that are modiﬁed in adata-dependent way within the architecture. In their case, the loss can not be backpropagated throughthe graph node position estimation since the spatial graph construction is non-differentiable. In con-trast, our approach is based on regular grid convolutions and includes the backpropagation throughposition estimation (key computation). We also note that differentiable point cloud projection ontoa 2D grid (from 3D space) has been used in Insafutdinov & Dosovitskiy (2018) though in a differentway and for a different purpose than in our case.Our approach is also strongly related to the seminal work on spatial transformers

Jaderberg et al.(2015), which introduced blocks that warp signals on regular grids through data-dependent paramet-2rXiv preprintric warping and bilinear sampling. Our blocks also use bilinear sampling in the end of each headprocessing. Inspired by spatial transformers, Wang et al. (2019a) investigate how data-independentand data-dependent deformations of the original point clouds can be used to boost the performanceof several recognition architectures including DGCNN Simonovsky & Komodakis (2017), Splat-Net Su et al. (2018), and VoxelNet Zhou & Tuzel (2018). Similarly to Wang et al. (2019b) andunlike Jaderberg et al. (2015); Wang et al. (2019a) do not propagate the loss fully through deforma-tion computation (in the case of data-dependent deformations). Compared to Wang et al. (2019a),our architectures employ regular 2D and 3D convolutions, can handle both recognition and genera-tive tasks (the latter not considered in Wang et al. (2019a)), and are trained with gradient propagationthrough key position computation.Transformer’s quadratic complexity have been recently addressed by sparse transformers Child et al.(2019) that alleviate the quadratic complexity of the original transformers in the set size, by restrict-ing the interaction between elements in the set to predeﬁned sparse subsets. Our mechanism basedon rasterization and convolution can be seen as an alternative to sparse transformers that restrictsinteraction to elements that have been projected into adjacent grid cells.

ETHOD

We proceed by ﬁrst deﬁning the key operation in our point processing pipeline that we call cloudtransform . We then discuss how it can be embedded in a multi-head processing block. We ﬁnal-ize the section by discussing the architectures for point cloud segmentation and for image-basedgeometry reconstruction.3.1 C

LOUD T RANSFORM

The cloud transform takes as an input an unordered set (to which we further refer as point cloud) X = { x , . . . , x N | x i ∈ R f } , whose elements are vectors x i ∈ R f of a potentially high dimension f . The Cloud Transform T ( X ) maps such input into a new g -dimensional point cloud Y ∈ R N × f of the same size N . In other words, each point x i ∈ X gets transformed into a new point y i ∈ Y .When designing such transform, one might want every point y i ∈ Y to be dependent on the wholeset X . Even though this property holds for the self-attention operation in Vaswani et al. (2017), thisbecomes problematic in the case of large set size because of quadratic complexity.The cloud transform ﬁrst applies a learnable projection P (further called rasterization ), whichgenerates a two-dimensional feature map with c channels, i.e. P : X (cid:55)→ I ∈ R w × w × c . Or, ina volumetric setting, the cloud transform starts with a learnable projection P , which generates athree-dimensional volumetric feature map, i.e. P : X (cid:55)→ I ∈ R w × w × w × c . In both cases, w standsfor the spatial resolution of the grid, while c stands for the number of channels.Once an irregular point cloud X is projected onto a regular feature map, the cloud transform appliesa single convolution or a more complex combination of convolutional operations. We denote theresult of these convolutional layers as ˜ I ∈ R w × w × g ( ˜ I ∈ R w × w × w × g in the volumetric case). Notethat we expect ˜ I to be of the same spatial size as I . However, the channel dimension of ˜ I might bechanged from c to g .The last step in our Cloud Transform operation is de-rasterization (also called slicing ) ˜ P : ˜ I → Y from the processed feature map ˜ I into a new transformed point cloud Y ∈ R N × g . Note, that cloudtransform passes information from x i to x j as long as these two points have been projected tosufﬁciently close positions. Thus, the cloud transform can be seen as a variant of self-attention layerwith adaptive sparse attention mechanism. Below, for the sake of simplicity, we detail the stepsof the cloud transform for a two-dimensional feature map case. The volumetric case is completelyanalogous. Rasterization step.

To rasterize each point x i , we predict the value v i ∈ R c and the key k i ∈ [0 , . These two vectors stand for what to rasterize and where to rasterize respectively. Inpractice, this key-value prediction ξ : x i (cid:55)→ ( v i , k i ) can be implemented as a multi-layer perceptron(MLP) applied to the vector x i and producing a ( c +2) -dimensional vector. In our implementa-tion, we use a single afﬁne layer with the output dimension equal to six (i.e. c =4 ), followed by the3rXiv preprint … + 𝐰 𝐮𝐮𝐰𝐤 𝐍−𝟑 𝐤 𝐤 𝐤 𝐍−𝟑 𝐯 𝐤 𝐯 𝐤 𝐯 𝐊 CNN w - 1 h - 1 𝐯 𝐍−𝟏 𝐯 𝐍−𝟐 𝐯 𝐍−𝟑 𝐯 𝐯 𝐯 𝐕 𝐤 𝐍−𝟐 … 𝐯 𝐯 𝐯 ෤𝐯 𝐍−𝟐 ෤𝐯 ෤𝐯 ෤𝐯 𝐕 𝐤 𝐤 𝐤 𝐍−𝟏 𝐤 𝐍−𝟐 𝐤 𝐍−𝟏 𝐤 𝐍−𝟑 𝐤 𝐤 w - 1 h - 1 𝐤 𝐤 𝐍−𝟐 𝐤 𝐍−𝟏

𝐼 ∈ ℝ ℎ×𝑤×𝑐 ሚ𝐼 ∈ ℝ ℎ×𝑤×𝑔 ෤𝐯 𝐍−𝟑 ෤𝐯 𝐍−𝟐 ෤𝐯 𝐍−𝟏 𝐯 𝐯 𝐍−𝟐

Figure 2: The cloud transform consists of rasterization (left) and de-rasterization(right) steps, withthe convolutional part in between. It projects the high-dimensional point cloud onto low-dimensional(two-dimensional in this case) grid, applies convolutional processing, and lifts the result back to thehigh-dimensional space.normalization layer. Depending on the architecture, the normalization layer can be batch normal-ization Ioffe & Szegedy (2015), instance normalization Ulyanov et al. (2017) or adaptive instancenormalization Huang & Belongie (2017b). Finally, we apply clipping to the key values ensuring thatall key values are between zero and one.We then rasterize the value v i ∈ R c onto the grid I = R w × w × c using the predicted key k i as a posi-tion. Speciﬁcally, k i = ( k i , k i ) ∈ [0 , may be interpreted as a relative coordinate inside the spa-tial grid of I . Thus, the position deﬁned by k i falls into the enclosing integer cell ( h , w ) , ( h , w ) , ( h , w ) , ( h , w ) , where h = (cid:98) ( w − · k i (cid:99) , h = (cid:100) ( w − · k i (cid:101) , w = (cid:98) ( w − · k i (cid:99) , w = (cid:100) ( w − · k i (cid:101) .The value v i is then rasterized into four neighbouring feature map pixels I [ h , w ] , I [ h , w ] , I [ h , w ] , I [ h , w ] ∈ R c via bilinear assignment. In more detail, we compute bilinear weights b i = ( b i , b i , b i , b i ) of the key k i with respect to the cell it falls to: b i = (( w − · k i − h )(( w − · k i − w ) b i = − (( w − · k i − h )(( w − · k i − w ) b i = − (( w − · k i − h )(( w − · k i − w ) b i = (( w − · k i − h )(( w − · k i − w ) (1)The bilinear weights are then used to update the feature map I at corresponding locations: I [ h , w ] ← I [ h , w ] + b i v i I [ h , w ] ← I [ h , w ] + b i v i I [ h , w ] ← I [ h , w ] + b i v i I [ h , w ] ← I [ h , w ] + b i v i (2)The feature map I is initialized with zeros, and the rasterization is repeated for every x i ∈ X , i ∈ ..N accumulating rasterized results at respective cells of the feature map I . Convolution step.

As discussed above, after rasterization, we transform the feature map I into ˜ I with any convolutional architecture that preserves the spatial resolution. Unless noted otherwise, weuse a single convolutional layer that keeps the number of channels equal to four. De-rasterization step.

As the last step, we perform the de-rasterization transform ˜ P : ˜ I → Y produces the transformed feature cloud Y using standard bilinear grid sampling operation. Thus, thetransformed values ˜ I [ h , w ] , ˜ I [ h , w ] , ˜ I [ h , w ] , ˜ I [ h , w ] ∈ R g of the feature map are combinedwith bilinear weights b i = ( b i , b i , b i , b i ) into the transformed value vector ˜ v i . We apply thenormalization layer, and the ReLU nonlinearity to the result of de-rasterization step, and furthermap each value from c =4 dimensions back to g dimensions ( g =128 unless noted otherwise) usinga learnable afﬁne transform. 4rXiv preprint3.1.1 D ENSITY NORMALIZATION

Surprisingly to us, Cloud Transform performs marginally better with no density normalization com-pared with its normalized counterpart. We ﬁnd this particularly curious since this the converse holdsfor SPLATNet-like architectures.Given spatial positions P ∈ R n × , an auxiliary “density normalization“ term Conv g (Splat( P, is introduced. This term consists of identity features , splatted (rasterized) at the positions P , whichis Splat( P, . It designed to approximate the uneven point cloud density. Finally, these densitiesare blurred with a Gaussian ﬁlter Conv g . This step approximates a new density of the convolvedfeatures Conv (Splat(

P, F )) . The term is used prior to the slicing (de-rasterization) stage. Wepresent a direct comparison with and without such normalization in our ablation studies section.3.2 B ACKPROPAGATION THROUGH C LOUD T RANSFORM

We have found that learning architectures with multiple sequentially-stacked Cloud Transformblocks via back-propagation Rumelhart et al. (1986) is highly unstable, as the gradients tend ei-ther to explode or vanish. The issue of exploding and vanishing gradients in deep neural networkshas been thoroughly studied in Glorot & Bengio (2010); He et al. (2015). An ideal assumption ongradient variance during the back-propagation is to preserve its scale throughout the network.In our case, the instability can be tracked to the gradient of the bilinear weights b w.r.t. the key k atthe rasterization and de-rasterization steps. According to the chain rule, the gradients are multipliedby w during backpropagation through the keys. Problem discussion

Ultimately, the problem can be pinpointed to the fact that as the key k i movesfrom the top-left to the bottom-right of a certain grid cell thus traversing only /w -th of its variationrange, the assignment weight of v i to the bottom-right corner changes from to (i.e traversesthe full variation range). This means that gradients w.r.t. keys k i in our architecture will always beroughly w times stronger than w.r.t. values v i . Gradient balancing trick

Based on observation above, during back-propagation through keys, wesimply divide the partial derivatives w.r.t. both coordinates of k i by w , i.e. we apply: ∂ L ∂ k i ← w ∂ L ∂ k i . (3)We have found that this gradient balancing trick is sufﬁcient to enable the learning of deep archi-tectures containing multiple layers with cloud transforms.Let us justify this trick by an exact derivation. Lemma 1

Let k = ( k , k ) ∈ [0 , be a key, which is typically an output of the network’s keyprediction branch in our Cloud Transform block. Let b be a vector of bilinear weights of k insidethe enclosing cell, as in 3.1. The derivative of b with the respect to k is the following: ∂ b ∂ k = w ·  ( w · k − (cid:100) w · k (cid:100) ) ( w · k − (cid:100) w · k (cid:99) ) − ( w · k − (cid:98) w · k (cid:99) ) − ( w · k − (cid:100) w · k (cid:101) ) − ( w · k − (cid:100) w · k (cid:99) ) − ( w · k − (cid:98) w · k (cid:99) )( w · k − (cid:98) w · k (cid:99) ) ( w · k − (cid:98) w · k (cid:99) )  (4)The derivation is straightforward, given the formula equation 1 for the bilinear weights. From thelemma above we see ∂ b ∂ k = w · D , where each element of the matrix D is bounded by in absolutevalue. Thus, the back-propagation from bilinear weights b i to k i has a form: ∂Cost∂ k i = (cid:18) ∂ b i ∂ k i (cid:19) T · ∂Cost∂ b i = w · D T · ∂Cost∂ b i (5)Intuitively, the gradients are scaled up by w each layer during the fair back-propagation through thekeys. Therefore, given a network with d cloud transform layers, the gradient norm would “explode”5rXiv preprint 𝐕 L i n e a r → + L i n e a r → U-Net M u l t i - h e a d e d C l o ud T r a n s f o r m → … … M u l t i - h e a d e d C l o ud T r a n s f o r m → M u l t i - h e a d e d C l o ud T r a n s f o r m → M u l t i - h e a d e d C l o ud T r a n s f o r m → L i n e a r i n → L i n e a r → o u t L i n e a r → Figure 3: The architecture used for semantic segmentation is based on (1) ten standard multi-headedcloud transform blocks, followed by (2) a single-headed 3D cloud transform block with a deepVoxel-to-Voxel convolutional network, and (3) another sequence of ten standard multi-headed cloudtransform blocks.as w d . We have observed such explosions experimentally. The balancing trick discussed abovesuccessfully ﬁxes this problem.3.3 M ULTI - HEADED C LOUD T RANSFORM BLOCK

The rasterization and de-rasterization operation may lead to the information loss due to the lim-ited number of nodes in two dimensional and three dimensional lattices (we use w =64 for two-dimensional grids and w =32 for three-dimensional grids). We therefore build our architecturesfrom blocks that combine multiple cloud transforms operating in parallel. This is reminiscent ofboth the multiple self-attention head in the Transformer architecture Vaswani et al. (2017) and themulti-view convolutional networks Su et al. (2015a). Following Vaswani et al. (2017), we call eachof the parallel cloud transform modules a head and thus consider a multi-head architecture. Eachhead predicts keys and values independently, and may use its own spatial resolution w . In fact,two-dimensional and three-dimensional heads can operate in parallel.The results of the parallel heads for each point i are summed together, so that the resulting multi-headcloud transform (MHCT) block (Figure 1) still maps each input vector x i to a g -dimensional vector y i . We add another normalization layer and ReLU nonlinearity after the results of the heads aresummed, and complete the block with the residual skip connection from the start to the end.Heet al. (2015). We note that the multi-head cloud transform block also resembles the Inceptionblock Szegedy et al. (2015), which uses heterogeneous parallel convolutions, as well as the blocksof the ResNeXt networks Xie et al. (2016), which use grouped convolutions with small number ofchannels in each group.Unless noted otherwise, we use MHCT blocks with eight two-dimensional heads (with w = 64 ) andeight three-dimensional heads (with w = 32 ).3.4 C LOUD T RANSFORMERS

We now discuss the architectures that can be constructed from MHCT blocks.

Semantic segmentation.

The semantic segmentation cloud transformer (Figure 3) consists of aninitial one-layer perceptron, which is applied to each point independently and transforms its 3Dcoordinates and 3D color features to an f -dimensional vector ( f =128 ). Afterwards, we apply tenmulti-headed cloud transform layers with default setting.After the tenth MHCT layer, we insert a single cloud transform that enhances information propaga-tion between distant points. This cloud transform uses volumetric grid with resolution w =32 and thefeature dimensionality c =32 . The convolutional part is a Voxel-to-Voxel Moon et al. (2018)-alikewith downsampling and upsampling layers.We then append ten more standard MHCT blocks. And then conclude the architecture with a two-layer shared perceptron that maps the features of each point to the logits of segmentation classes.Following the U-Net Ronneberger et al. (2015) idea, we add skip connections from the initial ﬁve6rXiv preprint M u l t i - h e a d e d C l o ud T r a n s f o r m → … L i n e a r → M u l t i - h e a d e d C l o ud T r a n s f o r m → M u l t i - h e a d e d C l o ud T r a n s f o r m → L i n e a r → L i n e a r → U n i f o r m n o i s e n _ p o i n t s × O u t pu t c l o ud n _ p o i n t s × F C F C R e s N e t I npu t i m a g e × × F C AdaIn connections

Figure 4: The architecture used for image-based reconstruction on fourteen standard multi-headedcloud transform blocks, conditioned via adaptive instance normalizations on the output of the con-volutional encoder (green). The input to the ﬁrst MHCT block is sampled from a uniform 3Ddistribution.MHCT layers to the last ﬁve MHCT layers (the features passed through skip connections are mergedthrough summation).All normalization layers in the architectures are BatchNorm layers Ioffe & Szegedy (2015). Thearchitecture has . M parameters, of which . M are in the central block. The architecture is trainedwith cross-entropy loss.

Point cloud generation.

To create the architecture that generates point cloud, we stack 14 MHCTblocks sequentially, followed by a point-wise multi-layer perceptron that has two layers followed by tanh non-linearity that generates 3D points. The input point cloud is sampled from a uniform 3Ddistribution in the unit cube and then passed through point-wise linear layer, mapping each featureto f =128 dimensions.To solve the image-based geometry reconstruction task (recovering point clouds from images), weuse adaptive instance normalization (AdaIn) layers Huang & Belongie (2017a) in the MHCT blocks.We create image encoder with ResNet-50 architecture He et al. (2015) (pretrained on ILSVRC Rus-sakovsky et al. (2015)). The output of the encoder is a -dimensional vector, which is transformedinto AdaIn coefﬁcients via afﬁne layer (Figure 4). The architecture is trained with earth mover dis-tance (EMD) loss Liu et al. (2020). XPERIMENTS

Below, we report on the experiments with our architectures for the semantic segmentation and theimage-based reconstruction tasks. We note that these tasks were chosen as arguably the most popularrepresentatives of recognition and generation tasks respectively.4.1 E

XPERIMENTAL DETAILS .We have observed that our model quality is sensitive to the ﬁrst steps of optimizer, which is naturalgiven the fact that our heads learn data projections into low-dimensional spaces.We therefore used the RADAM optimizer Liu et al. (2019) with the learning rate 0.0001 for image-based reconstruction experiments and with the learning rate 0.001 for semantic segmentation ex-periments. We halve learning rate every k iterations for image-based reconstructions experi-ments and every k for semantic segmentation experiments. We observed similar performancewith warmed-up ADAM optimizer Kingma & Ba (2015) (in which the learning rate is linearly in-creased from to a target value typically within several thousands iterations), but in the end settledwith RADAM in order to avoid setting of extra hyper-parameters. Our models were trained on fourNVIDIA Tesla P40 GPU, with the batch size per GPU.7rXiv preprintInput image Ground truth Result RotatedFigure 5: Sample image-based reconstruction obtained by our cloud transformer architecture. Foreach input, we show the ground truth, the result of the reconstruction (from the same viewpoint)and another view of the reconstruction. Note the ability of the cloud transformer to recover unseenparts despite being trained to reconstruct in the viewer-based coordinate frame, where reasoningabout symmetries and object regularities is harder. Ground truth points are colored according totheir distances to the reconstruction and vice versa .4.2 S INGLE - VIEW O BJECT R ECONSTRUCTION

In our generation experiments we follow the recently introduced benchmark Tatarchenko* et al.(2019) on 3D object reconstruction. The benchmark is based on ShapeNet Chang et al. (2015)renderings. Unlike previously ShapeNet-based benchmarks for image-based reconstruction thatused canonical coordinate frames the new argues that the reconstruction should be evaluated inthe viewer-based coordinate frame , where the task is more challenging and more realistic. Thework Tatarchenko* et al. (2019) also provides evaluations of several recent methods on image-basedreconstruction, as well as the retrieval-based oracle. The dataset consists of ShapeNet Chang et al.(2015) models, where each model belongs to one of classes. Each object has been renderedwith ShapeNet-Viewer from ﬁve random view points. We employ the same train/val/test split asTatarchenko* et al. (2019).In the benchmark, objects were rendered to × pixel images, which we resize to × pixels and then fed to our model. Our model outputs points to represent a reconstructedobject in the viewer-aligned coordinate system. Since the protocol requires to predict exactly . points, we perform the reconstruction twice with different uniform noise and the same stylevector z extracted by the encoder. This results in . points total from which we randomly select . points. We note that the ability to sample arbitrarily-large number of points is an attractiveproperty of our architecture.The main evaluation metric proposed in Tatarchenko* et al. (2019) is the F -score computed at a volume distance threshold. The methods are compared with averaged per class F -score @ , andby the number of classes, in which a method has the highest mean F -score @ . Our quantitativeresults are summarized in Table 1. As can be seen, our method outperforms all methods evaluatedin Tatarchenko* et al. (2019) including the retrieval-based oracle very signiﬁcantly. We note that wehave not performed extensive architecture search or hyperparameter tuning for this application, andit is very likely that another architecture based on our new block can achieve much better result.8rXiv preprint Methods avg. cl. F -score @ Top-1 cat.AtlasNet .

252 2

Matryoshka .

217 3

OGN .

264 2

Retrieval .

236 0

Retrieval ( oracle ) .

290 7

Cloud Transformer Cubic (ours) .

367 47

Cloud Transformer Sphere (ours) . N \ ATable 1: F-score evaluation (@1%) of 3D shape reconstruction in the viewer-based coordinate frame.The cloud transformer outperforms other methods including the retrieval based oracle considerably.Only the cubic version is compared by the number of Top-1 categories for clarity, though the spher-ical version achieves similar performance.We additionally observed an improvement in performance with a different type of noise feed to ourmodel. Instead of a random point cloud sampled uniformly from a three-dimensional cube [0 , ,the points are uniformly sampled from a unit sphere S ⊂ R . We speculate this is because most ofthe ShapeNet objects is two-dimensional manifolds in contrast with a three-dimensional ﬁlled cube.In Figure 5, we provide several qualitative examples of input-output pairs, and note the ability ofour method to recover ﬁne details and to infer symmetries.4.3 I NDOOR S EMANTIC S EGMENTATION

The Stanford Indoor Dataset (S3DIS) Armeni et al. (2016) is a popular 3D point cloud segmenta-tion benchmark that consists of large 3D point cloud scenes captured at three different buildingsannotated with 13 semantic labels at the point level. The dataset comes with six splits.For the sake of fair comparison, we evaluate on S3DIS using a conventional protocol, established byQi et al. (2017a), which chunks rooms into m × m blocks. Each block consists of points andeach point is represented with its position xyz and its color, which results in a six-dimensional inputvector. Following many previous works, we evaluate on ‘Area 5’ split and train on the remainingﬁve splits, as Tchapmi et al. (2017b) advocate this fold as representative in measuring generalizationability due to Area being shot in a separate building.Since current state-of-the-art-method KPConv Thomas et al. (2019), uses a different protocol, wealso evaluate our model using their protocol. In this setting we also train a higher-capacity CloudTransformer with two subsequent Voxel-to-Voxel 3D heads in the middle (instead of the single headin the default architecture). In the KPConv protocol at each step an input point cloud is dynamicallysampled from sphere of 2m radius. During evaluation the same data strategy is applied together withvoting.In both protocols spatial coordinates xyz are augmented with random rotation, anisotropic scale,jitter and shifts. Whereas for color augmentation we use chromatic autocontrast, jitter and translation(following Choy et al. (2019)).The result of the comparison with state-of-the-art are given in Table 2. Our cloud transformersachieve state-of-the-art performance in both protocols.We also visualize the operation of our default model in Figure 6. Here, we show the keys of thepoints of a sample S3DIS chunk near the beginning of the architecture, in the middle Voxel-to-Voxellayer, and near the end of the architecture. Ground truth labels are used for color coding. It can beseen that the model uses rather diverse transforms within parallel heads in both multi-head blocks. Inthe penultimate layer, several heads collapse the cloud to a small area/volume. Such head thereforeperforms global information propagation and is similar to the PointNet-like block Qi et al. (2017a).Interestingly, in the middle layer, the point cloud is transformed with relatively little deformation ascompared to the input. This is not encoded into the architecture in any way and emerges naturally.9rXiv preprint Method mIoU ceil. ﬂoor wall beam col. wind. door chair table book. sofa board clut.Pointnet Qi et al. (2017a) . . . . . . . . . . . . . . SegCloud* Tchapmi et al. (2017a) . . . . . . . . . . . . . . Eff 3D Conv Zhang et al. (2018) . . . . . . . . . . . . . . TangentConv Tatarchenko et al. (2018) . . . . . . . . . . . . . . RNN Fusion Ye et al. (2018) . . . . . . . . . . . . . . SPGraph* Landrieu & Simonovsky (2018) . . . . . . . . . . . . . . ParamConv Wang et al. (2018) . . . . . . . . . . . . . . PointCNN Li et al. (2018b) . . . . . . . . . . . . . . CT (ours) std. prot. . . . . . . . . . . . . . . Minkowski32* Choy et al. (2019) . . . . . . . . . . . . . . KPConv* Thomas et al. (2019) . . . . . . . . . . . . . . CT (ours)* . . . . . . . . . . . . . . CT2 (ours)* . . . . . . . . . . . . . . Table 2: Semantic segmentation intersection-over-union scores on S3DIS

Area-5 split. The modelswithout ∗ use the standard protocol with chunking of the scene into blocks, while the models with ∗ employ different protocols. Our default cloud transformer model is denoted ‘CT’, while the modelwith extra capacity is denoted ‘CT2’. Cloud transformers match or outperform state-of-the-art inboth protocols (standard and KPConv’s). D h e a d s D h e a d s D h e a d s D h e a d s Third block

Penultimate block

Middle layer:

Final output ( predicted labels ):Input: Figure 6: Information ﬂow inside the segmentation architecture. Using the ground truth labels ascolor coding, we show the key locations inside the heads of the third block (top), the key locationsinside the heads of the penultimate block (bottom), the key locations in the middle of the architecture(middle). The input example with ground truth labels is shown on the left, while the predicted labelsare shown on the right (on top of the input point cloud). For 3D heads, we show projection of thekeys on the XY coordinate plane. See text for discussion.4.4 A

BLATION S TUDY

We also perform an ablation study to justify our architecture choices. We consider the followingablations: 10rXiv preprint • We add a density normalization term prior to the de-rasterzation step, following the dis-cussion in Section 3.1.1. • We replace all learnable keys with different non-learnable projections . More precisely, ineach head we apply a random afﬁne transformation to the input point cloud, followed bythe scaling with a logarithmically chosen scale. In case of the planar heads we also add aprojection to the plane z = 0 before the scaling. This makes the method more similar tothe SPLATNet architecture. It allows to verify our learnable key approach. • In the

Coarser feature maps experiment the spatial dimensions of feature maps are halved.Namely, volumetric heads of spatial sizes in MHCT are shrinked to and planar headsto . • We also train an architecture without planar heads in order to see if using only volumetricheads might be sufﬁcient. • Furthermore, we consider a high-capacity variant with no multihead processing, where weuse a single planar head with 32 channels (note that the capacity of each head is quadraticin the number of channels). • Finally, we consider an architecture with less layers , replacing ten blocks in the beginningand ten blocks in the end of the architecture with six and six respectively.As can be seen from Table 3, all ablated variants performed notably worse than the default architec-ture. These ablations also point out the possibility of improvement that can be easily achieved bye.g. increasing the number of MHCT blocks (at the cost of running on more parallel GPUs).Methods mIOUDensity norm. 55.1Non-learnable proj. 55.9Without planar heads 57.3Coarser feature maps 58.5No multihead 55.4Less layers (6) 57.4Cloud Transformer (full) 61.0Table 3: Ablation study on Area 5 S3DIS conventional protocol. See text for discussion.4.4.1 S

HAPE N ET P ART PART SEGMENTATION

We follow the protocol of SPH3D-GCN Lei et al. (2019) and train a separate network for each class.Our model architecture is the same as for the semantic segmentation, except for adding dropoutregularization with rate . to the two last linear transformations in the network. Such regularizationis needed due to a tiny train set size for some of the classes (e.g. rocket). The results are presented inTable 4. While in this benchmark our method does not exceed state-of-the-art, it performs essentiallyon par with it. instance class Air- Bag Cap Car Chair Ear- Guitar Knife Lamp Laptop Motor- Mug Pistol Rocket Skate- TablemIoU mIoU plane phone bike board Su et al. (2015a) 84.6 82.0 81.9 83.9 88.6 79.5 90.1 73.5 91.3 84.7 84.5 96.3 69.7 95.0 81.7 59.2 70.4 81.3KCNet Shen et al. (2017) 84.7 82.2 82.8 81.5 86.4 77.6 90.3 76.8 91.0 87.2 84.5 95.5 69.2 94.4 81.6 60.1 75.2 81.3SO-Net Li et al. (2018a) 84.9 81.0 82.8 77.8 88.0 77.3 90.6 73.5 90.7 83.9 82.8 94.8 69.1 94.2 80.9 53.1 72.9 83.0PointNet++ Qi et al. (2017b) 85.1 81.9 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6SpiderCNN Xu et al. (2018) 85.3 81.7 83.5 81.0 87.2 77.5 90.7 76.8 91.1 87.3 83.3 95.8 70.2 93.5 82.7 59.7 75.8 82.8SFCNN Rao et al. (2019) 85.4 82.7 83.0 83.4 87.0 80.2 90.1 75.9 91.1 86.2 84.2 96.7 69.5 94.8 82.5 59.9 75.1 82.9PointCNN Li et al. (2018b) 86.1 84.6 84.1 86.5 86.0 80.8 90.6 79.7 92.3 88.4 85.3 96.1 77.2 95.3 84.2 Ψ -CNN Lei et al. (2019) 86.8 83.4 84.2 82.1 83.8 80.5 91.0 78.3 91.6 86.7 84.7 95.6 74.8 94.5 83.4 61.3 75.9 SPH3D-GCN Lei et al. (2019)

CT(ours)

Table 4: Part segmentation results on the ShapeNet Parts benchmark. Our results are very similar tothe state-of-the-art. 11rXiv preprintFigure 7: Results of image-based reconstruction. Our network performs the reconstruction by sam-pling points from the unit cube and then “folding” this set into the answer. Here, we show theresults, while coloring each point according to its initial position within the sphere (red, green, andblue values are used to color-code each coordinate).

ONCLUSION

We have presented a new block for neural architectures that process point clouds (and more generallyvectorial sets). The new block has been encouraged by the success of Transformer architecture andits self-attention blocks, and harnesses the efﬁciency of 2D and 3D grid convolutions in modernparallel processors, GPUs in particular.Based on the new block, we have presented architectures for point cloud semantic segmentationand single-image based geometry reconstruction that achieve state-of-the-art results. Additionally,we evaluated our model on a part segmentation dataset with a quality compatible with state-of-the-art. Between these three applications, it is perhaps the ability of our building block to performwell for generative tasks that is more interesting, and in the future we would like to investigate theperformance of our approach for generation tasks. Our reconstruction architecture can be retargetedfor other generation tasks easily by changing the encoder part. R EFERENCES

Iro Armeni, Ozan Sener, Amir R. Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and SilvioSavarese. 3d semantic parsing of large-scale indoor spaces. In

Proc. CVPR , 2016.Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li,Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu.ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012[cs.GR], Stanford University — Princeton University — Toyota Technological Institute atChicago, 2015.Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparsetransformers.

CoRR , abs/1904.10509, 2019.Christopher Bongsoo Choy, JunYoung Gwak, and Silvio Savarese. 4d spatio-temporal convnets:Minkowski convolutional neural networks.

Proc. CVPR , pp. 3070–3079, 2019.12rXiv preprintXavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neuralnetworks. In

AISTATS , 2010.Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 3d semantic segmentation withsubmanifold sparse convolutional networks. In

Proc. CVPR , 2018a.Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 3d semantic segmentation withsubmanifold sparse convolutional networks.

Proc. CVPR , pp. 9224–9232, 2018b.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassinghuman-level performance on imagenet classiﬁcation.

Proc. ICCV , pp. 1026–1034, 2015.Xun Huang and Serge J. Belongie. Arbitrary style transfer in real-time with adaptive instance nor-malization.

Proc. ICCV , pp. 1510–1519, 2017a.Xun Huang and Serge J. Belongie. Arbitrary style transfer in real-time with adaptive instance nor-malization.

Proc. ICCV , pp. 1510–1519, 2017b.Eldar Insafutdinov and Alexey Dosovitskiy. Unsupervised learning of shape and pose with differ-entiable point clouds. In

Proc. NeurIPS , pp. 2802–2812. 2018.Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training byreducing internal covariate shift. In

Proc. ICML , volume 37, pp. 448–456, 2015.Max Jaderberg, Karen Simonyan, Andrew Zisserman, and koray kavukcuoglu. Spatial transformernetworks. In

Proc. NIPS , pp. 2017–2025. 2015.Varun Jampani, Martin Kiefel, and Peter V. Gehler. Learning sparse high dimensional ﬁlters: Imageﬁltering, dense crfs and bilateral neural networks. In

Proc. CVPR , June 2016.Martin Kiefel, Varun Jampani, and Peter V. Gehler. Permutohedral lattice cnns. In

ICLR WorkshopTrack , May 2015. URL http://arxiv.org/abs/1412.6618 .Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In

InternationalConference on Learning Representations (ICLR) , 2015.Roman Klokov and Victor S. Lempitsky. Escape from cells: Deep kd-networks for the recognitionof 3d point cloud models.

Proc. ICCV , pp. 863–872, 2017.Lo¨ıc Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with super-point graphs.

Proc. CVPR , pp. 4558–4567, 2018.Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel.Backpropagation applied to handwritten zip code recognition.

Neural Comput. , 1(4):541–551,December 1989. ISSN 0899-7667.Huan Lei, Naveed Akhtar, and Ajmal Mian. Octree guided cnn with spherical kernels for 3d pointclouds.

IEEE Conference on Computer Vision and Pattern Recognition , 2019.Jiaxin Li, Ben M. Chen, and Gim Hee Lee. So-net: Self-organizing network for point cloud analysis. , pp. 9397–9406, 2018a.Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convo-lution on x-transformed points. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.),

Proc. NeurIPS , pp. 820–830. 2018b.Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and JiaweiHan. On the variance of the adaptive learning rate and beyond.

ArXiv , abs/1908.03265, 2019.Minghua Liu, Lu Sheng, Shilin Yang, Jing Shao, and Shi-Min Hu. Morphing and sampling networkfor dense point cloud completion. In

AAAI , 2020.Jiageng Mao, Xiaogang Wang, and Hongsheng Li. Interpolated convolutional networks for 3d pointcloud understanding. In

Proc. ICCV , October 2019a.13rXiv preprintJiageng Mao, Xiaogang Wang, and Hongsheng Li. Interpolated convolutional networks for 3d pointcloud understanding.

Proc. ICCV , pp. 1578–1587, 2019b.Daniel Maturana and Sebastian A. Scherer. Voxnet: A 3d convolutional neural network for real-timeobject recognition. In

Proc. IROS , pp. 922–928. IEEE, 2015.Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. V2v-posenet: Voxel-to-voxel predictionnetwork for accurate 3d hand and human pose estimation from a single depth map.

Proc. CVPR ,pp. 5079–5088, 2018.Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning onpoint sets for 3d classiﬁcation and segmentation.

Proc. CVPR , pp. 77–85, 2017a.Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Pointnet++: Deep hierarchicalfeature learning on point sets in a metric space. In

NIPS , 2017b.Yongming Rao, Jiwen Lu, and Jie Zhou. Spherical fractal convolutional neural networks for pointcloud recognition. , pp. 452–460, 2019.O. Ronneberger, P.Fischer, and T. Brox. U-net: Convolutional networks for biomedical image seg-mentation. In

Proc. MICCAI , volume 9351 of

LNCS , pp. 234–241, 2015.David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations byback-propagating errors.

Nature , 323:533–536, 1986.Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.ImageNet Large Scale Visual Recognition Challenge.

International Journal of Computer Vision(IJCV) , 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.Yiru Shen, Chen Feng, Yaoqing Yang, and Dong Tian. Mining point cloud local structures by kernelcorrelation and graph pooling. , pp. 4548–4557, 2017.Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned ﬁlters in convolutional neu-ral networks on graphs.

Proc. CVPR , pp. 29–38, 2017.Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik G. Learned-Miller. Multi-view convo-lutional neural networks for 3d shape recognition.

Proc. ICCV , pp. 945–953, 2015a.Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik G. Learned-Miller. Multi-view convo-lutional neural networks for 3d shape recognition. In

Proc. ICCV , 2015b.Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, Evangelos Kalogerakis, Ming-Hsuan Yang,and Jan Kautz. Splatnet: Sparse lattice networks for point cloud processing.

Proc. CVPR , pp.2530–2539, 2018.Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Du-mitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.

Proc. CVPR , pp. 1–9, 2015.Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and Qian-Yi Zhou. Tangent convolutions fordense prediction in 3d. In

Proc. CVPR , pp. 3887–3896, 2018.Maxim Tatarchenko*, Stephan R. Richter*, Ren´e Ranftl, Zhuwen Li, Vladlen Koltun, and ThomasBrox. What do single-view 3d reconstruction networks learn? 2019.Lyne Tchapmi, Christopher Choy, Iro Armeni, JunYoung Gwak, and Silvio Savarese. Segcloud:Semantic segmentation of 3d point clouds. In

International Conference on 3D Vision (3DV) , pp.537–547. IEEE, 2017a.Lyne P. Tchapmi, Christopher Bongsoo Choy, Iro Armeni, JunYoung Gwak, and Silvio Savarese.Segcloud: Semantic segmentation of 3d point clouds.

International Conference on 3D Vision(3DV) , pp. 537–547, 2017b. 14rXiv preprintHugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, Franc¸ois Goulette,and Leonidas J. Guibas. Kpconv: Flexible and deformable convolution for point clouds.

Proc.ICCV , pp. 6410–6419, 2019.Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Improved texture networks: Maximizingquality and diversity in feed-forward stylization and texture synthesis.

Proc. CVPR , pp. 4105–4113, 2017.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In

Proc. NIPS , 2017.Jiayun Wang, Rudrasis Chakraborty, and Stella X. Yu. Spatial transformer for 3d points.

ArXiv ,abs/1906.10887, 2019a.S. Wang, S. Suo, W. Ma, A. Pokrovsky, and R. Urtasun. Deep parametric continuous convolutionalneural networks. In

Proc. CVPR , pp. 2589–2597, 2018.Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M.Solomon. Dynamic graph cnn for learning on point clouds.

ACM Transactions on Graphics(TOG) , 2019b.Saining Xie, Ross B. Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residualtransformations for deep neural networks.

Proc. CVPR , pp. 5987–5995, 2016.Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. Spidercnn: Deep learning on pointsets with parameterized convolutional ﬁlters. In

ECCV , 2018.Xiaoqing Ye, Jiamao Li, Hexiao Huang, Liang Du, and Xiaolin Zhang. 3d recurrent neural networkswith context fusion for point cloud semantic segmentation. In

Proc. ECCV , pp. 415–430. Springer,2018.Li Yi, Hao Su, Xingwen Guo, and Leonidas J. Guibas. Syncspeccnn: Synchronized spectral cnnfor 3d shape segmentation. , pp. 6584–6592, 2016.Chris Zhang, Wenjie Luo, and Raquel Urtasun. Efﬁcient convolutions for real-time semantic seg-mentation of 3d point clouds. In

International Conference on 3D Vision (3DV) , pp. 399–408.IEEE, 2018.Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection.

Proc. CVPR , pp. 4490–4499, 2018.

A A

PPENDIX

AtlasNet OGN Matryoshka Retrieval Oracle NN CT(ours) airplane 0.39 0.26 0.33 0.37 0.45 . ashcan 0.18 0.23 0.26 0.21 0.24 . bag 0.16 0.14 0.18 0.13 0.15 . basket 0.19 0.16 0.21 0.15 0.15 . bathtub 0.25 0.13 0.26 0.22 0.26 . bed 0.19 0.12 0.18 0.15 0.17 . bench 0.34 0.09 0.32 0.3 0.34 . birdhouse 0.17 0.13 0.18 0.15 0.15 . bookshelf 0.24 0.18 0.25 0.2 0.2 . bottle 0.34 0.54 0.45 0.46 0.55 . bowl 0.22 0.18 0.24 0.2 0.25 . bus 0.35 0.38 0.41 0.36 0.44 . cabinet 0.25 0.29 0.33 0.23 0.27 . camera 0.13 0.08 0.12 0.11 0.12 . can 0.23 . . . cellular 0.34 0.45 0.47 0.41 0.5 . chair 0.25 0.15 0.27 0.2 0.23 . clock 0.24 0.21 0.25 0.22 0.27 . dishwasher 0.2 0.29 . . display 0.22 0.15 0.23 0.19 0.24 . earphone 0.14 0.07 0.11 0.11 0.13 . faucet 0.19 0.06 0.13 0.14 0.2 . ﬁle 0.22 0.33 0.36 0.24 0.25 . guitar 0.45 0.35 0.36 0.41 0.58 . helmet 0.1 0.06 0.09 0.08 . . keyboard 0.36 0.25 0.37 0.35 . . lamp 0.26 0.13 0.2 0.21 0.27 . laptop 0.29 0.21 0.33 0.26 0.33 . loudspeaker 0.2 0.26 0.27 0.19 0.23 . mailbox 0.21 0.2 0.23 0.2 0.19 . microphone 0.23 0.22 0.19 0.18 0.21 . microwave 0.23 0.36 0.35 0.22 0.25 . motorcycle 0.27 0.12 0.22 0.24 0.28 . mug 0.13 0.11 0.15 0.11 0.17 . piano 0.17 0.11 0.16 0.14 0.17 . pillow 0.19 0.14 0.17 0.18 0.3 . pistol 0.29 0.22 0.23 0.25 0.3 . pot 0.19 0.15 0.19 0.14 0.16 . printer 0.13 0.11 0.13 0.11 0.14 . remote 0.3 0.33 0.31 0.31 0.37 . riﬂe 0.43 0.28 0.3 0.36 .

48 0 . rocket . . sofa 0.24 0.23 0.27 0.21 0.27 . stove 0.2 0.19 0.24 0.18 0.19 . table 0.31 0.24 0.34 0.26 0.34 . telephone 0.33 0.42 . . train 0.34 0.29 0.3 0.32 .

38 0 . vessel 0.28 0.19 0.22 0.23 0.29 . washer 0.2 .

31 0 .0.21 0.25 0.23