[PDF] Point Attention Network for Semantic Segmentation of 3D Point Clouds

Abstract

Convolutional Neural Networks (CNNs) have performed extremely well on data represented by regularly arranged grids such as images. However, directly leveraging the classic convolution kernels or parameter sharing mechanisms on sparse 3D point clouds is inefficient due to their irregular and unordered nature. We propose a point attention network that learns rich local shape features and their contextual correlations for 3D point cloud semantic segmentation. Since the geometric distribution of the neighboring points is invariant to the point ordering, we propose a Local Attention-Edge Convolution (LAE Conv) to construct a local graph based on the neighborhood points searched in multi-directions. We assign attention coefficients to each edge and then aggregate the point features as a weighted sum of its neighbors. The learned LAE-Conv layer features are then given to a point-wise spatial attention module to generate an interdependency matrix of all points regardless of their distances, which captures long-range spatial contextual features contributing to more precise semantic information. The proposed point attention network consists of an encoder and decoder which, together with the LAE-Conv layers and the point-wise spatial attention modules, make it an end-to-end trainable network for predicting dense labels for 3D point cloud segmentation. Experiments on challenging benchmarks of 3D point clouds show that our algorithm can perform at par or better than the existing state of the art methods.

Full PDF

PPoint Attention Network for Semantic Segmentation of3D Point Clouds

Mingtao Feng a , Liang Zhang b , Xuefei Lin c , Syed Zulqarnain Gilani d and AjmalMian d a Hunan University, Changsha, China; b Xidian University, Xi’an, China; c Hunan AgriculturalUniversity, Changsha, China; d The University of Western Australia, Perth, Australia

Abstract

Convolutional Neural Networks (CNNs) have performed extremely well on datarepresented by regularly arranged grids such as images. However, directly lever-aging the classic convolution kernels or parameter sharing mechanisms on sparse3D point clouds is inefﬁcient due to their irregular and unordered nature. Wepropose a point attention network that learns rich local shape features and theircontextual correlations for 3D point cloud semantic segmentation. Since the ge-ometric distribution of the neighboring points is invariant to the point ordering,we propose a Local Attention-Edge Convolution (LAE-Conv) to construct a lo-cal graph based on the neighborhood points searched in multi-directions. Weassign attention coefﬁcients to each edge and then aggregate the point features asa weighted sum of its neighbors. The learned LAE-Conv layer features are thengiven to a point-wise spatial attention module to generate an interdependency ma-trix of all points regardless of their distances, which captures long-range spatialcontextual features contributing to more precise semantic information. The pro-posed point attention network consists of an encoder and decoder which, togetherwith the LAE-Conv layers and the point-wise spatial attention modules, make itan end-to-end trainable network for predicting dense labels for 3D point cloudsegmentation. Experiments on challenging benchmarks of 3D point clouds showthat our algorithm can perform at par or better than the existing state of the artmethods.

Keywords:

Semantic segmentation, 3D point cloud, point attention network, deep learning

Preprint submitted to a Journal September 30, 2019 a r X i v : . [ c s . C V ] S e p . Introduction With the widespread availability of 3D scanning devices and depth sensors [1],3D geometric data is being increasingly used in many different application do-mains such as robotics, autonomous driving, 3D scene understanding, city plan-ning, infrastructure maintenance etc [2, 3, 4, 5]. Several representations of 3Dshape have been investigated, such as depth maps, voxels, multi-views, meshesand point clouds [6]. However, point cloud is arguably the simplest format for 3Ddata representation and has hence attracted increasing research interest. Similarto the pixels in a 2D image, points in the three-dimensional coordinate system arebasic building units of point clouds, which naturally encode the geometric featuresand their spatial distributions of a real 3D scene.The extraction of meaningful information from 3D point clouds requires se-mantic segmentation. Point cloud semantic segmentation has been a challengingand active research topic for the last few years. Unlike pixels of 2D images whichhave a rectangular grid-like structure with no missing bits, 3D point clouds aresparse, irregular, unordered and with missing regions due to the limited range ofscanners and occlusions. While deep learning has been very successful in seman-tic segmentation of 2D images, its use for 3D point clouds has not been fully ex-ploited yet. Qi et al. [7] ﬁrst proposed PointNet that learns point features directlyfrom unordered point sets. In PointNet, all 3D points are independently passedthrough a set of multi-layer perceptions (MLP) and then aggregated to a globalfeature using max-pooling. Recent research directions focus on extending the ba-sic idea of PointNet to incorporate local geometric features for abstracting morediscriminative high level features [8, 9, 10]. Among these methods, Pointnet++[8] exploited neighborhood points within a ball query radius, where each localpoint is processed separately by a PointNet-based hierarchical network. However,the relationships between local points are neglected. Recently dynamic graphCNN [9] was proposed which considers neighborhood points as a local graph anduses a ﬁlter generating network to assign edge labels. Since the edge-conditionednetwork does not consider the order of local points, it does not have transforma-tion invariance. Similar to dynamic graph CNN [9], dynamic edge conditionedﬁlters [10] were introduced as an edge function to encode local information bycombining the relative coordinates (raw features) between the center point andits K -nearest neighbors (KNN). Although dynamic edge conditioned ﬁlters [10]attempt to use a function designed to handle local points, it does not fully exploitthe geometrical correlations of the local neighborhood points.To address the above short comings, we propose a local attention-edge convo-2ution (LEA-Conv) layer that extends the ideas of [8, 9] and [10]. The LAE-Convlayer constructs a local graph based on the neighborhood points searched alongmultiple directions. Unlike KNN and ball query methods, we propose a multi-directional search strategy that ﬁnds all neighborhood points from 16 directionsspread systematically within a ball query making the local geometric shape moregeneralizable across space. After the search operation, LAE-Conv layer assignsattention coefﬁcients to each edge and then aggregates the central point features asa weighted sum of its neighbors. Aggregating features from a group of points withtheir contribution coefﬁcients, rather than a single max-pooling operation, betterexploits the correlations between points to get accurate and robust local geometricdetails. Moreover, LAE-Conv layer is invariant to the ordering of points and canimplicitly infer how the points contribute to the overall 3D shape.Equipped with the LAE-Conv layer, we are able to design hierarchical deeplearning architectures on point clouds for semantic segmentation. Since eachLAE-Conv layer has a limited local receptive ﬁeld, each unit of the output fea-tures (at the initial layers) exploits correlations within its local scale only. How-ever, later LAE-Conv layers have progressively larger receptive ﬁelds enablingthe network to learn hierarchical features. While existing networks [8, 11, 9] cap-ture multi-scale shapes for high-level point feature learning, they do not leveragethe long-range contextual relationship among points belonging to the same cat-egories, which is important for semantic segmentation. Superpoint graphs [12]employed a recurrent neural network to exploit long-range dependencies based onan unsupervised geometric partitioning. However, that method relies heavily onthe partitioning results. To address the above problems, in this paper, we proposea point-wise spatial attention module, which captures long-range contextual in-formation in the spatial dimension. Features obtained from LAE-Conv layer arefed into the point-wise spatial attention module to generate a global dependencymatrix which models the correlations between any two points of the feature maps.Through multiplying the dependency matrix with original features, the differencesbetween point features of the same category are reduced. Hence, any two pointswith similar features can contribute mutual improvement regardless of their spatialdistance.Using the proposed LAE-Conv layer and point-wise spatial attention model asthe main building blocks, we design a U-shape network to predict the dense labelsfor semantic segmentation of 3D point clouds. The unorganized 3D points (rawdata) are input directly to our point attention network comprising an encoder and adecoder. This is different from other approaches [8, 11, 9] since our method stacksthe point-wise attention module after the LAE-Conv layer at different stages of the3etwork enabling it to learn more accurate local geometric features and long rangerelationships.To summarize, our contributions include: (1) A novel local attention-edgeconvolution (LAE-Conv) layer to encode point features using a weighted sum ofits neighborhood points with edge attention coefﬁcients. The proposed multi-directional search strategy makes the local geometric shape more generalizableacross space. (2) A novel point-wise spatial attention module that learns thelong-range contextual information and signiﬁcantly improves the segmentationresults by boosting the representation power of local features obtained from theLAE-Conv layers. (3) Extending the U-shaped network to incorporate the pro-posed LAE-Conv layer and point-wise spatial attention module. Experimentalresults show that our method obtains on pair or better performance than exist-ing state-of-the-art methods quantitatively and qualitatively on challenging bench-mark datasets. Finally, we show that our proposed point attention block can gen-eralize to other networks and improve their performance.

2. Related Work

A number of deep learning architectures have been recently proposed to learndirectly from 3D point cloud data or its derived representations for applicationssuch as semantic segmentation, object part segmentation and object categoriza-tion. We provide a brief survey of these methods and divide them into three cate-gories based on the underlying data representations they use.

This category includes methods that transform the irregular 3D point clouddata to a canonical form so that traditional convolutions can be applied [13, 14].Volumetric representations [15, 6, 16, 17, 18] are the most common canonicalform used by these methods due to their simplicity. However, voxel represen-tations have cubic complexity leading to dramatic increase in the memory con-sumption and computing resources required to process even medium size pointclouds. To alleviate this problem, Octree-Net [19, 20] and Kd-Net [21] have beenproposed which skip representation and computations at empty spaces to savememory and processing resources respectively [22]. Moreover, sparse convolu-tional operations, where the activations are kept sparse in the convolution layers[23, 24], have been introduced to process spatially-sparse 3D point clouds. Nev-ertheless, the kernels are still dense and inefﬁcient in their implementation. Multi-view convolutional neural networks and their variants [25, 26, 27, 28] have also4een proposed. These methods render the 3D shape from multiple pre-deﬁnedviews, which are then processed by conventional image-based convolution net-works. The main drawback of the multi-view frameworks is that the 3D geometricinformation is not always fully retained in the 2D projections.The sparse lattice networks proposed by Hang et al. [29] project the input 3Dpoints onto a high dimensional lattice, perform standard spatial convolution on itand then ﬁlter the features back to the input points. Matan et al. [30] extended thefunction over point cloud to a volumetric function, where volumetric convolutionis applied and then a restriction operator is used to do the inverse action. Qianguiet al. [31] used a slice pooling layer to project unordered point clouds into anordered format, making it feasible to apply traditional deep learning algorithms.Fully convolutional networks [32] have been proposed that sample the input pointcloud uniformly and use PointNet as a low-level feature learner, followed by 3Dconvolutions to learn features at multiple scales. Finally, tangent convolutions [33]have also been proposed that operate directly on surface geometry in the tangentspace. Although the above methods have used deep learning techniques to realizethe 3D data analysis tasks, they have not used the 3D point clouds directly. Webelieve that learning directly from raw 3D point cloud data can achieve higheraccuracy and efﬁciency as learning from raw data is the major strength of deeplearning.

Graph convolutional methods combine the power of convolution operationwith graph representations of irregular data. Graph convolutional networks havebeen designed to perform convolutions either in the spectral or spatial domain.More recently, Joan et al. [34] proposed a generalization of convolution for graphvia the Laplacian operator. In that method, the spectral network can learn convo-lutional layers with a number of parameters for low dimensional graphs. Wang etal. [35] proposed a local spectral graph convolution to construct local graph froma point’s neighborhood and aggregate information from nodes using their spec-tral coordinates. The PointNet++ architecture is then applied along with the localspectral graph convolution layers and graph pooling layers. The regularized graphconvolution network proposed by Gusi et al. [36] treats point cloud as a graph anddeﬁnes convolution operation over it. Moreover, a graph smoothness prior is usedin the loss function to regularize the learning process. Graph Laplacian basedmethods have a number of drawbacks including the computational complexity ofLaplacian eigen-decomposition, the large number of parameters to express theconvolutional ﬁlters, and the lack of spatial localization. Different from these5ethods, Martin et al. [10] proposed a convolution-like operation on graph sig-nals in the spatial domain and used an asymmetric edge function to describe therelationships between local points. However, the edge labels are dynamically gen-erated and hence, the irregular distribution of local points is not taken into account.This method was improved by Wang et al. [9] through max pooling operation onlocal features. However, max pooling operation is still unable to fully utilize thecorrelations of local points. Our proposed method exploits local feature learningusing a completely different approach. We propose a local attention-edge convo-lution layer that learns local relationships between points.

Many researchers have proposed deep learning architectures that learn directlyfrom point clouds. One of the earliest methods in this category is the PointNet [7]that operates on point clouds using multi-layer perception (MLP). PointNet is ro-bust to the global transformation of 3D shape because the spatial transformer net-work [37] is used to learn the 3D alignment. The main limitation of PointNet isthat it only relies on the max-pooling layer to learn global features. Since PointNetdoes not consider local relationships, Qi et al. [8] introduced an improved networknamed PointNet++, which exploits local geometric features in point sets and ag-gregates them for hierarchical inference. However, PointNet++ still treats pointswithin local regions individually and does not consider relationships between theneighborhood points.Later, Francis et al. [38] designed a multi-scale architecture to enlarge the re-ceptive ﬁeld over the 3D scene by incorporating larger-scale spatial grid blocksinto PointNet. Loic et al. [12] used an unsupervised method to cluster input pointsinto superpoint graphs, then fed the graphs to PointNet-based gated recurrent unit.Li et al. [11] proposed X-Conv layer instead of MLP to permute unordered localpoints into a latent potentially canonical order. A similar approach was proposedin [39], where kernel correlation was introduced to incorporate local informationextracted from point cloud by PointNet. Wang et al. [40] introduced a similar-ity group proposal network for point cloud instance segmentation, which use asimilarity matrix to produce a grouping proposal based features extracted fromPointNet. Different from these PointNet-based frameworks, Hua et al. [41] pre-sented a point-wise convolution operator that can be applied to each point of thepoint set. Recently, Zhao et al. [42] proposed PointWeb for point cloud process-ing, which connects all points densely in a local neighborhood for better encodinglocal geometric features. Wu et al. [43] introduced PointConv, a nonlinear func-tion kernel for point cloud, which is used to learn the translation-invariant and6 lgorithm 1

LAE-Conv Operation

Input:

Input local points h , central point p i ; Number of selected points m in eachbin; Output:

Filtered central point p i LAE ; Search K neighbor points p j of p i in the point cloud h ; p j − p i : move points p j to local coordinate system of p i ; W ( p j − p i ) : transform the input points into higher-level features ; α ij : compute normalized attention edge coefﬁcients with softmax; p (cid:48) i : use graph attention aggregator to obtain updated feature at p i ; MLP ( p (cid:48) i ) : feature transformation operation; return p i LAE ;permutation-invariant features in 3D space. Wang et al. [44] designed a graphattention kernel to adapt to the local geometric, which is useful for ﬁne-grainedsegmentation.A common limitation of all the aforementioned methods is that they are unableto simultaneously exploit ﬁne local details and long-range contextual information.We ﬁll this gap and propose a network that learns local geometrical features usingtheir edge attention coefﬁcients and allows deep learning architectures to exploitﬁne details as well as interactions over longer distances.

3. Proposed Approach

We ﬁrst give details of the LAE-Conv layer that captures accurate local geo-metric details. Next, we explain the point wise spatial attention module that ag-gregates the long-range contextual information based on the output of LAE-Convlayers. Finally, we present a general framework of our network.

The Local Attention-Edge Convolution (LAE-Conv) layer forms the basiccomponent of our point attention network architecture for 3D point cloud semanticsegmentation. Inspired by DGCNN [9], ECC [10], GATs [45] and Non-local net-work [46], we construct a multi-directional neighborhood graph and apply graphattention mechanism to compute local edge features. Similar to traditional con-volution in images, LAE-Conv explores local regions to leverage correlations be-tween unordered points and exploits the local geometric structure of the points.We summarize the LAE-Conv operator in Algorithm 1.7 igure 1: (a) An illustration of our multi-directional search method. The ball space around thecenter point within the search radius is divided into 16 uniform directions. The azimuth θ , radius r and number m of selected points in one cube are hyperparameters. (b) When m = 1 , 16neighborhood points are considered along different directions. If all neighbors are projected ontothe xy coordinate plane, we can see that there are two points in each of the eight directions. Thethickness of the line connecting the center point to the neighbors represents different contributingvalues. In image convolution operation, the local region of a pixel can be representedin a grid-like structure given a convolution kernel size. However, the neighbor-hood of a center point (in a point cloud) is deﬁned by metric distance in a 3Dcoordinate system where neighboring points are irregularly distributed. To ro-bustly leverage local point correlations, we endeavour to explicitly capture ge-ometric information in different orientations. Given an unordered point cloud P = { p , p , . . . , p N } with p i ∈ R C , where N is the number of points, and C is the feature dimension at each point. When each point is represented by its 3Dcoordinates p i = ( x i , y i , z i ) , then C = 3 . We denote a central point in P as p i ,and its K neighbors in P as p j , j ∈ N ( i ) . As shown in Figure 1(a), the spacearound the reference point p i within a radius of r is split into 16 bins, where eachbin indicates a direction. Each bin has an azimuth angle θ = ∠ . Within thespatial range represented by each bin, we select m nearest points of p i from allthe points that fall in that bin and use their features to represent the bin, i.e. when8 = 1 , , ... , K = 16 , , ... . Since some points far away from p i are notvery useful to represent p i , we set the radius r empirically as a hyper-parameteraccording to each layer. In case there are insufﬁcient points inside a bin, point p i is repeated. This is similar to self convolution.Two common ways for range query are K-nearest neighbor (KNN) search andball query. KNN returns a ﬁxed number of K neighboring points while ball queryreturns all points that are within a radius. The local shape will not be well repre-sented if all selected points, using either of the methods, are from a small regionor one direction. Different from KNN and ball query, our search method guar-antees that neighborhood points are from different directions to ensure sufﬁcientexpressive power of encoding the local geometric information. We compare theeffectiveness of our search method over ball query and KNN in the experimentssection. For a set of local points h = { p i , p j , p j , · · · p j K } , h ∈ R C , where p i is thecentral point and others are its K neighbors, we consider a graph G = ( V, E ) ,where V is a ﬁnite set of points with | V | = K + 1 and E ⊆ V × V is a setof directed edges { ( p i , p j ) , ( p i , p j ) · · · ( p i , p j K ) } . We deﬁne the attention edgecoefﬁcients as e ij , which represent the importance of neighbors p j to the centralpoint p i , computed by an attention mechanism a . e ij = a ( W h i , W h j ) (1)Where W ∈ R C × C (cid:48) is a learnable weight matrix that transforms the input pointset to higher-level features, h i and h j represent the central point and its neighborsrespectively and the mechanism a is a single layer MLP, parametrized by a weightvector (cid:126) a ∈ R C (cid:48) . To make the edge coefﬁcients easily comparable across differentpoints, we use the softmax function to normalize them across all neighbors of thereference point p i : α = softmax ( e ij ) = exp ( e ij ) (cid:80) j ∈N ( i ) exp ( e ij ) . (2)The ﬁnal edge coefﬁcients computed by the attention mechanism may then beexpressed as: α ij = exp ( a ( W ( p j − p i ))) (cid:80) j ∈N ( i ) exp ( a ( W ( p j − p i ))) , (3)9here the neighbor points of the central point are transformed to local coordinatesystems by ( p i − p j ) and then the local coordinates of each point are lifted tohigher-order features by W .Once obtained, the normalized edge coefﬁcients α ij are used to assign at-tributes to each edge. Our approach computes the ﬁltered feature at point p i as aweighted sum of points in its neighborhood. The proposed commutative aggre-gation method not only solves the problem of undeﬁned point ordering, but alsosmoothes out the structural information. The local graph attention aggregator isdeﬁned as p (cid:48) i = (cid:88) j ∈N pi α ij W p j , (4)where p (cid:48) i is the updated features of central point p i . Now we have an aggregated representation for the central point p i . It is naturalto add a feature transformation function f to incorporate additional non-linearityand increase the learning capacity of the model. The transformation can be re-alized by MLP with a non-linear activation function. The output of the transfor-mation function is p i LAE : 1 × C (cid:48) . The proposed LAE-Conv layer is described inAlgorithm 1. The output point cloud P LAE : N × C (cid:48) of the LAE-Conv layer have rich rep-resentation power for local geometric features. However, since each LAE-Convlayer have a local receptive ﬁeld, individual units of the ﬁltered features are unableto exploit contextual information outside of their local regions. In P LAE , featurescorresponding to the points with the same label are signiﬁcantly different whenthe points are far apart. These differences affect the point wise segmentation ac-curacy of the scene as a whole. To address this issue, we focus on the globalspatial relationships to boost the representation power of the LAE-Conv layer. Wedesign a point-wise spatial attention module that captures the global dependenciesby building associations among features within the point set. We demonstrate thatby stacking these blocks after LAE-Conv layers, we can construct local-globalarchitectures that adaptively encode long-range contextual information, thus im-proving the semantic segmentation accuracy of 3D point clouds that cover largeareas. Next, we introduce a process to adaptively aggregate point-wise spatialcontexts. 10 igure 2: The proposed point-wise spatial attention module. The feature maps are representedby the shape of their tensors, e.g., N × F where N denotes the number of points and F denotesthe feature dimension. For simplicity, we set the batch size to 1. ⊗ denotes matrix multiplication,and ⊕ denotes point-wise sum. The green, yellow and blue boxes denote MLP layers. Long-rangecorrelations are learned once the input features pass through this module. Inspired by the position attention operation [47], we deﬁne a point-wise spatialattention module for 3D point clouds. As illustrated in Figure 2, two MLP layersare used to transform the local feature P LAE into two new representations A and B respectively, where A, B ∈ R F . We compute relationships between differentpoints based on the transpose of A and B . Unlike [47], we calculate the spatialcorrelations of all points directly from the transpose of A and B without reshapingthe matrices, hence, maintaining the original space distribution. Softmax is thenapplied to normalize relationship map to get the point-wise spatial attention map S with size N × N : s ij = sof tmax (cid:32) exp ( A i · B j ) (cid:80) Ni =1 exp ( A i · B j ) (cid:33) , (5)where i and j denote the point positions in A and B respectively, s ij is the i th point’s impact on the j th point, and · denotes matrix multiplication. We show thattwo points have a strong correlation when their features have similar semanticinformation.At the same time, the local feature P LAE is transformed to a new feature D ∈ R F by an MLP layer. This is followed by a matrix multiplication between S and D . Finally, the output is multiplied by a scale parameter α and element-wise summation is performed with the features P LAE to obtain the ﬁnal output P final ∈ R N × C (cid:48)(cid:48) as follows: P final = S · D + P LAE , (6)11 igure 3: Illustration of the proposed point attention network for point cloud segmentation. Theencoder and decoder parts are based on the LAE-Conv layer and point-wise spatial attention mod-ule. B , N i and Ci denote the batch size, point number and point feature dimension respectively.The downsampling and upsampling processes are followed by [8]. The encoder and decoder partsare linked by three skip connections. where · denotes matrix multiplication. Here, the resulting feature P final contains along-range contextual information and selectively aggregates contexts accordingto the point-wise spatial attention map S . This module improves the feature repre-sentation power and is more accurate for 3D point cloud semantic segmentation. For dense point label prediction, the output resolution is high. Moreover, thereare multiple objects with different scales in one scene. Selecting the most repre-sentative scale for each kind of object is important for semantic segmentation.Following the hierarchical structure of PointNet++ [8], our network consists ofencoder and decoder parts. As shown in Figure 3, our point attention networkcomprises the LAE-Conv layers and point-wise spatial attention modules. At theencoder part, the input point set is processed by three LAE-Conv layers, whichtransform it into fewer representation points but with richer features. The inputpoint cloud is represented by its 3D coordinates and sometimes with the RGBcolor values as well. The point-wise spatial attention modules are stacked afterthe third and fourth LAE-Conv layers to aggregate long range point-wise contex-tual information from output of the previous LAE-Conv layer. The long-rangecontextual features along with the local features from LAE-Conv layers togetherachieve robust and accurate 3D point cloud semantic segmentation.At the decoder part, three skip connections are used to combine features fromthe encoders. The point-wise spatial attention module is also inserted after theﬁfth LAE-Conv layer at the decoder part. In our hierarchical architecture, weuse three steps of down-sampling operations and tree steps of up-sampling opera-12ions which are followed by set abstraction and feature propagation modules as inPointNet++ [8]. Finally, all the features in the last decoder layer go through fullyconnected layer and convert to class probabilities.

4. Comparison with Existing Methods

Our point attention network is a more generalized form of the classic approachPointNet++ [8]. We explain how PointNet++ is a special case of our network.PointNet++ is an extension of [7] with considers local point structure. Given areference point p i , ball query search K local points with data size N l × K × C l ,PointNet processes the local region points individually and then max pools them toget the most representative point feature as the output N l × C (cid:48) l of the local region.Different from PointNet++, the LAE-Conv layer constructs the local graph forthe K neighbors and central point p i . We compute attention edge coefﬁcients e ij = e i , e i , · · · e iK to indicate different contributions of each neighbor to thecentral point. When e ij = { e i = 1 | e i , e i , · · · e iK = 0 } , p features are selectedto represent the local region. We can observe that the basic convolution layer ofPointNet++ is an instance of our LAE-Conv layer.DGCNN [9] uses KNN to establish local point shape and proposes an aggrega-tion operation max ( M LP ( p i , p j − p i )) . In that operation, the neighbor points aremoved to the local coordinate system ﬁrst and then stacked with the central point.All the neighbors have equal contribution to the central point, which is equivalentto our operator when all edge coefﬁcients are equal to . Since DGCNN is basedon PointNet, the receptive ﬁeld remains constant ( K ) at different layers, whichis a disadvantage when encoding point clouds with different spatial distributiondensities.Similar to PointNet++, PointCNN [11] follows the encoder-decoder architec-ture and learns a X transformation to lift the input irregular points into an un-known canonical format, then applying a typical convolution on the transformedpoint cloud. In PointCNN, the dilated convolution process from image convolu-tion networks is employed to expand the local receptive ﬁeld of different layers.The local receptive ﬁeld changes the number of neighborhood points K by adjust-ing the dilation ratio. Different from the grid structure of local pixels, points aredisordered in a three-dimensional coordinate system and the density distributionis not uniform. Although KNN searches for neighborhood points which is con-trolled by the dilation ratio proportionally, the global geometric features learnedby the change of receptive ﬁeld is limited. To address this issue, our point atten-tion network inserts a point-wise attention module in the high level feature layer.13 crucial difference between these two operations is that the latter assumes a longrange dependency, which reduces the gap between features corresponding to thepoints with the same label encoding more accurate global information. The moresimilar are the feature representations of the two points, the greater is the correla-tion between them.

5. Experiments and Discussion

We evaluate the performance of the proposed network on the ShapeNet [48]3D part segmentation dataset and the two largest point cloud segmentation bench-marks, ScanNet [49] and Stanford Large-Scale 3D Indoor Spaces (S3DIS) [50].While ShapeNet is synthetic data, ScanNet and S3DIS are real point clouds ob-tained with a scanner. We perform ablation studies of different design choicesand network variations as well as compare the performance of our network withexisting state of the art.

ScanNet [49] contains 1513 scans annotated with semantic voxel labels from21 categories (bed, refrigerator, ﬂoor, table etc. plus other furniture). ScanNetis divided into 1201 training and 312 test samples. Similar to [8], we split theScanNet training scenes into 2m by 2m by 3m blocks, with 0.5m padding in eachdirection ( x , y , z ) and sample 8192 points randomly from each block on the ﬂy. Topredict semantic label of every point of the test scene, we similarly split it intosimilar cubes using a sliding window strategy along the xy plane with differentstride sizes. If the same point gets different predictions in the overlap regions, wechoose the one with highest conﬁdence.Although ScanNet also contains RGB values for each point, we only usethe xyz coordinates as point features for a fair comparison with other methods.Hence, the input data size for the network is × . As shown in Figure 3, weuse downsampling and upsampling operations from PointNet++ [8] for both theencoder and decoder parts. The output point numbers and feature dimensions ofdifferent LAE-Conv layers are ( N = 8192 , C = 64) , ( N = 2048 , C = 128) , ( N = 512 , C = 256) , ( N = 128 , C = 512) , ( N = 512 , C = 256) , ( N =2048 , C = 256) and ( N = 8192 , C = 128) respectively. The fully connectedlayer with size ( N fc = 8192 , C fc = 21) converts the ﬁnal features into class prob-abilities. We set ( m = 1 , K = 16) for the neighborhood search. For the threepoint-wise attention block, the output point numbers and feature dimensions are ( N p = 512 , C p = 256) , ( N p = 128 , C p = 512) and ( N p = 512 , C p = 256) able 1: 3D point cloud semantic segmentation results on ScanNet scenes. The metrics are meanper-class Intersection over Union (mIoU,%) and per voxel overall accuracy (OA, %). Method mean IoU Overall Accuracy (OA)PointNet [7] - 73.9PointNet++ [8] - 84.5RSNet [31] 39.35 -TCDP [33] 40.9 80.9FCPN [32] - 82.63DRCNN [51] 76.5PointCNN [11] - 85.1Ours

Table 2: Model size and inference time comparison, where ”M” means million and ”s” denotessecond. We use the model ﬁle (.cptk) size obtained by the training using tensorﬂow to representthe complexity of different methods. The entire scenes was tested 5 times and the average timewas recorded.

Methods Size (M) Time (s)PointNet [7] 321.9 2.16PointNet++(msg) [8] 177.3 3.8DGCNN [9] 180.1 3.94SpiderCNN(3-layers) [14] 349.5 4.3Ours 183 3.97respectively. The initial learning rate is 0.001, batch size is 22 and the momentumis 0.9. We set the decay rate of 0.7 and stop training after 1000 epochs.Table 1 shows quantitative comparison of our proposed point attention net-work with PointNet++ [8], PointCNN [11] on the ScanNet dataset. This com-parison is done using two metrics, namely the mean per-class IoU (mIoU, % )and per voxel overall accuracy (OA, % ). For a fair comparison, Table 1 showsresults of baseline methods reported in the original papers since the trained mod-els are not available for testing. Compared to the baseline methods, our networkachieves the highest accuracy on both metrics. Table 2 reports the model sizeand average inference time of a few representative methods [7], [8], [9], [14],where the released source codes are easy to use. Experiments are conducted bya single NVIDIA GTX TitanX GPU with tensorﬂow and an Intel [email protected] igure 4: Qualitative comparison on three scenes from ScanNet. (a) Input point cloud with only xyz coordinate features. Semantic segmentation by (b) PointNet++ [8], (c) PointCNN [11] and (d)our method. (e) Ground truth Semantic labels. Colors denote different categories. These scenescontain only 13 categories out of 21. The areas marked by the boxes are some examples whereour method performed signiﬁcantly better than others. As mentioned in Sec 3.1, there are three options (KNN, ball query and ourmulti-direction searching method) for searching the neighbors of the central point.We use ScanNet as a test benchmark to compare these options. We also set dif-ferent point numbers at each cube for our proposed search method. In Table 3,we can see that our method is more efﬁcient for selecting local point shapes.When ( m = 2 , K = 32) and ( m = 3 , K = 48) , the segmentation accuracy isgreatly reduced. This is because the parameters of LAE-Conv layer will increaseas the number of neighbors increase. Too many neighbors bring information re-dundancy, which reduces the efﬁciency and accuracy of the LAE-Conv layer. To take full advantage of the point-wise spatial attention block, we show thesegmentation results with more attention blocks in the network architecture. Weadd 7 attention blocks (after LAE-Conv layer − ), 5 attention blocks (afterLAE-Conv layer − ) and 3 attention blocks (after LAE-Conv layer − ). Asshown in the ﬁrst part of Table 4, more point-wise spatial attention blocks do not17 able 3: Ablation analysis on ScanNet with different search methods and numbers. All resultsare based on our LAE-Conv layer while other settings are kept constant. Neighborhood Search Method Overall Accuracy (OA %)KNN (K=16) 85.0Ball query (K=16) 85.3Proposed Multi-direction (m=1,K=16)

Proposed Multi-direction (m=2,K=32) 85.9Proposed Multi-direction (m=3,K=48) 84.4lead to an improvement in performance. One explanation is that more attentionblocks massively increase the number of parameters and the network can not ﬁnda local optimal solution within the speciﬁed training steps on ScanNet. The sec-ond part of Table 4 compares same number of attention blocks added to differentstages of network. The attention block is added to the right, after the LAE-Convlayer (2,4,6) and (1,4,7) respectively. We can see that the results deteriorate whenthe attention blocks are added to layers with lower feature dimensions. A pos-sible explanation is that the point features do not contain enough representativesemantic information when their dimensions are low, the features of the pointswith the same labels are signiﬁcantly different, and the number of parameters ofattention block will also increase from (2,4,6) to (1,4,7). Under this condition, theeffectiveness of attention block is limited. Finally, we choose to add three atten-tion blocks to the right after LAE-Conv layers (3,4,5) in Figure 3. We also testedadding three attention blocks to vanilla PointNet++ (without MSG and DP [8])at the corresponding stages as in our network. As shown in the third part of Ta-ble 4, the performance of baseline network (vanilla PointNet++ [8]) is improvedby . . This shows that our proposed point attention block is generic and is ableto improve the performance of any network architecture. The Stanford Large-Scale 3D Indoor Spaces (S3DIS) dataset [50] contains3D scans obtained with the Matterport scanners in 6 areas from three differentbuildings, divided into 271 individual rooms. Each point in the scene is annotatedwith one label from categories (ceiling, wall, beam, chair, column etc. andclutter), and is represented by its 3D coordinates, RGB features and normalizedlocation. The S3DIS is a highly unbalanced dataset [50], ﬂoor, wall, chair andother common furniture items being the dominant classes in the dataset whilebookcase, window and beam etc. being the rare classes. To prepare the training18 able 4: Ablation analysis on ScanNet comparing 3, 5 and 7 point-wise spatial attention modulesadded to our network and comparing the results when 3 point-wise spatial attention modules areadded to different stages of our network. We also test adding three attention blocks to standardPointNet++[8]. Block Position Overall Accuracy (OA %)LAE-Conv layer (1-7) 84.9LAE-Conv layer (2,3,4,5,6) 85.7LAE-Conv layer (3,4,5)

LAE-Conv layer (2,4,6) 86.0LAE-Conv layer (1,4,7) 85.5PointNet++[8] (vanilla) baseline 83.3PointNet++[8] (vanilla, 3-5) 84.7data, rooms in S3DIS are split into blocks of m × m , with . m padding on eachdirection ( x , y ). We randomly sample 4096 points from each block during trainingwhile all points are used at test time. Similar to PointNet [7], we follow the same6-fold cross validation strategy across 6 areas. To obtain the overall segmentationaccuracy, we evaluate 6 models on their corresponding test areas and report theaverage results.For comparison, we use xyz coordinates and RGB information as the pointfeatures. Therefore, the input data size to the network is × . As shown inFigure 3, we use downsampling and upsampling operations from PointNet++ [8]for both encoder and decoder parts. The output point numbers and feature dimen-sions of different LAE-Conv layers are ( N = 4096 , C = 64) , ( N = 1024 , C =128) , ( N = 512 , C = 256) , ( N = 128 , C = 512) , ( N = 512 , C = 256) , ( N = 1024 , C = 256) and ( N = 4096 , C = 128) respectively. The fullyconnected layer with size ( N fc = 4096 , C fc = 13) converts the ﬁnal features intoprobability of each class. We set ( m = 1 , K = 16) during the neighbors searchprocess. For the three point-wise attention modules, the output point numbers andfeature dimensions are ( N p = 512 , C p = 256) , ( N p = 128 , C p = 512) and ( N p = 512 , C p = 256) respectively. We set the initial learning rate to 0.001,batch size to 32 and momentum to . . We set the decay rate to 0.7 and stop thetraining process after 1000 epochs.Table 5 summerizes the quantitative results where our proposed method out-performs the baseline methods PointNet [7], SPGraph [12], RSNet [31], 3DR-CNN [51] and PointCNN [11]. It is worth noting that our method achieves higheraccuracy for some rare class objects, such as beam, column, window, board and19 igure 5: Qualitative comparison on three scenes of S3DIS. (a) Input point cloud with xyz coor-dinates and RGB features. Semantic segmentation results by (b) PointNet [7], (c) PointCNN [11]and (d) our method. (e) Ground truth Semantic labels. Different colors denote different categories.These scenes contain 13 categories. Boxes highlight some examples where our method performsbetter than others. able 5: Quantitative comparison using overall accuracy (OA,%) and mean IoU (mIoU,%) onS3DIS. mIoU OA celling ﬂoor wall beam column window door chair table bookcase sofa board clutterPointNet [7] 47.6 78.5 88.0 88.7 69.3 42.4 23.1 47.5 51.6 54.1 42.0 9.6 38.2 29.4 35.2SPGraph [12] 62.1 85.5 89.9 95.1 76.4 62.8 47.1 55.3 clutter because our method is able to capture more global information of pointsthat are far apart.In Figure 5, we compare our method with PointNet [7] and PointCNN [11]qualitatively. It is not surprising that chairs are correctly segmented more oftenby the three baseline methods because their shapes are more consistent, they aresmall and not easily confused with other objects. We can see that objects such asthe whiteboard hung on the wall, column and window embedded in the wall, clut-ter next to the table and irregular bookcases are quite difﬁcult to segment. We alsouse boxes to mark some examples where our method outperformed the baselinemethods. In the ﬁrst scene, our method obtains more regular segmentation of thepainting on the wall than PointNet [7] and PointCNN [11]. Our ﬁnal segmentationresult preserves the full shapes of column and bookcase next to wall while othermethods mistake them for wall or clutter. We also obtain a smoother predictionfor chair in the front row than the other methods. In the second scene, the boardon the wall is more accurately estimated by our method compared to PointNetand PointCNN. Our method makes fewer mistakes in predicting the bookshelves,beam and clutter which are up and below the table compared to other approaches.Notably, our method can predict the clutter on the left wall, even though it isnot marked by ground truth. In the third scene, our method also outputs fewerincorrect predictions for bookcase, table, chair and beam compared to other ap-proaches. We also extend our network architecture to perform part segmentation on theShapeNet dataset [52], which consists of shape models from 16 object cat-egories. Each object in ShapeNet is annotated with to parts. We follow thesettings from [48] to divide the ShapeNet dataset for training, validation and test-ing. During training, we randomly sample 2048 points from each 3D shape whileall points from each 3D shape are used during the test stage.21 able 6: Quantitative comparison on ShapeNet part dataset [52]. The values show part-averagedIoU (pIoU % ), mean per-category pIoU (mpIoU % ) and per-category IoU ( % ) scores. pIoU mpIoU airplane bag cap car chair earphone guitar knife lamp laptop motor mug pistol rocket skateboard tableshapes 2690 76 55 898 3758 69 787 392 1547 451 202 184 283 66 152 5271PointNet [7] 83.7 80.4 83.4 78.7 82.5 74.9 89.6 73.0 91.5 85.9 80.8 95.3 65.2 93.0 81.2 57.9 72.8 80.6PointNet++ [8] 85.1 81.9 82.4 79.0 87.7 77.3 90.8 71.8 91.0 85.9 83.7 95.3 71.6 94.1 81.3 58.7 76.4 82.6DGCNN [9] 85.1 82.3 84.2 83.7 84.4 77.1 90.9 78.5 91.5 87.3 82.9 96.0 67.0 93.3 82.6 59.7 75.5 82.0RSNet [31] 84.9 81.4 82.7 86.4 84.1 78.2 90.4 69.3 91.4 87.0 83.5 95.4 66.0 92.6 81.8 56.1 75.8 82.2SGPN [40] 85.8 82.8 80.4 78.6 78.8 71.5 88.6 78.0 90.9 83.0 78.8 95.8 77.8 93.8 ASCNet [53] 84.6 81.78 83.8 80.8 83.5 79.3 90.5 69.8 91.7 86.5 82.9 96.0 69.2 93.8 82.5 62.9 74.4 80.8PCNNet [30] 85.1 81.8 82.4 80.1 85.5 79.5 90.8 73.2 91.3 86.0 85.0 95.7 73.2 94.8 83.3 51.0 75.0 81.8PGrid [13]

For a fair comparison, we only use the xyz coordinates as the point features.The size of input data for the network is × . The network architecture isillustrated in Figure 3, we adjust the network parameters to suit ShapeNet. Theoutput point numbers and feature dimensions of different LAE-Conv layers are ( N = 2048 , C = 64) , ( N = 1024 , C = 128) , ( N = 256 , C = 256) , ( N = 128 , C = 512) , ( N = 256 , C = 256) , ( N = 1024 , C = 256) and ( N = 2048 , C = 128) respectively. A fully connected layer with size ( N fc = 128 , C fc = 16) is used at the end to convert the point features into partpredictions. We set ( m = 1 , K = 16) for the neighborhood search. For the threepoint-wise attention block, the output point numbers and feature dimensions are ( N p = 256 , C p = 256) , ( N p = 128 , C p = 512) and ( N p = 256 , C p = 256) respectively. We set the initial learning rate to 0.003, batch size to 16, momentumto 0.9, decay rate to 0.7 and stop the training after 500 epochs.We use the same evaluation metric (mean IoU) on points as PointNet [7] tocompare our method with others methods [7, 8, 9, 31, 40, 53, 30, 13, 29, 39, 14,54, 23, 11]. We report the part-averaged IoU (pIoU % ), mean per-category pIoU(mpIoU % ) and per-category IoU ( % ) scores in Table 6. Our method achieves onpar performance with most methods in the metrics pIoU and mpIoU. In individualcategories, we rank the best in ear phone, lamp, motor and rocket. As we can see,our method performs better when there are fewer data points as in the case of earphone, motor and rocket. 22 . Conclusion We proposed a point attention network for 3D point cloud semantic segmen-tation. Our network adaptively integrates local point features and long-rangecontextual information. We introduced a novel local attention-edge convolution(LAE-Conv) layer which exploits attention mechanism on a local graph con-structed by the central point and its neighborhood to capture accurate and ro-bust geometric details. To reﬁne the output local features of LAE-Conv layer, weproposed a point-wise spatial attention module and showed that this module cangeneralize to other networks to improve their accuracy. Finally, we adapted the U-shaped network to combine the LAE-Conv layer and point-wise spatial attentionmodules. Experiments on challenging benchmark datasets show that our methodquantitatively and qualitatively obtains on pair or better performance than existingstate-of-the-art in 3D point cloud semantic segmentation.

Acknowledgment

This work was supported in part by National Natural Science Foundation ofChina under Grant 61573134, Grant 61973106 and in part by the Australian Re-search Council (ARC) grant DP190102443. Thank Yifeng Zhang and TingtingYang from Hunan University for helping with baseline experiments setup.