[PDF] (AF)2-S3Net: Attentive Feature Fusion with Adaptive Feature Selection for Sparse Semantic Segmentation Network

Abstract

Autonomous robotic systems and self driving cars rely on accurate perception of their surroundings as the safety of the passengers and pedestrians is the top priority. Semantic segmentation is one the essential components of environmental perception that provides semantic information of the scene. Recently, several methods have been introduced for 3D LiDAR semantic segmentation. While, they can lead to improved performance, they are either afflicted by high computational complexity, therefore are inefficient, or lack fine details of smaller instances. To alleviate this problem, we propose AF2-S3Net, an end-to-end encoder-decoder CNN network for 3D LiDAR semantic segmentation. We present a novel multi-branch attentive feature fusion module in the encoder and a unique adaptive feature selection module with feature map re-weighting in the decoder. Our AF2-S3Net fuses the voxel based learning and point-based learning into a single framework to effectively process the large 3D scene. Our experimental results show that the proposed method outperforms the state-of-the-art approaches on the large-scale SemanticKITTI benchmark, ranking 1st on the competitive public leaderboard competition upon publication.

Full PDF

(( AF ) -S3Net: Attentive Feature Fusion with Adaptive Feature Selection forSparse Semantic Segmentation Network Ran Cheng , Ryan Razani , Ehsan Taghavi , Enxu Li , and Bingbing Liu Noah Ark’s Lab, Huawei, Markham, ON, Canada

Abstract

Autonomous robotic systems and self driving cars relyon accurate perception of their surroundings as the safetyof the passengers and pedestrians is the top priority. Se-mantic segmentation is one of the essential components ofroad scene perception that provides semantic informationof the surrounding environment. Recently, several meth-ods have been introduced for D LiDAR semantic segmen-tation. While they can lead to improved performance, theyare either afﬂicted by high computational complexity, there-fore are inefﬁcient, or they lack ﬁne details of smaller in-stances. To alleviate these problems, we propose ( AF ) -S3Net, an end-to-end encoder-decoder CNN network for DLiDAR semantic segmentation. We present a novel multi-branch attentive feature fusion module in the encoder anda unique adaptive feature selection module with featuremap re-weighting in the decoder. Our ( AF ) -S3Net fusesthe voxel-based learning and point-based learning meth-ods into a uniﬁed framework to effectively process the large3D scene. Our experimental results show that the proposedmethod outperforms the state-of-the-art approaches on thelarge-scale SemanticKITTI benchmark, ranking st on thecompetitive public leaderboard competition upon publica-tion.

1. Introduction

Understanding of the surrounding environment has beenone of the most fundamental tasks in autonomous roboticsystems. With the challenges introduced with recent tech-nologies such as self-driving cars, a detailed and accurateunderstanding of the road scene has become a main partof any outdoor autonomous robotic system in the past fewyears. To achieve an acceptable level of road scene under-standing, many frameworks beneﬁt from image semanticsegmentation, where a speciﬁc class is predicted for everypixel in the input image, giving a clear perspective of thescene.

SalsaNextOurs MinkNet42Ground Truth

Figure 1: Comparison of our proposed method with Sal-saNext [9] and MinkNet42 [8] on SemanticKITTI bench-mark [3].Although image semantic segmentation is an importantstep in realizing self driving cars, the limitations of a visionsensor such as inability to record data in poor lighting con-ditions, variable sensor sensitivity, lack of depth informa-tion and limited ﬁeld-of-view (FOV) makes it difﬁcult forvision sensors to be the sole primary source for scene un-derstanding and semantic segmentation. In contrast, LightDetection and Ranging (LiDAR) sensors can record accu-rate depth information regardless of the lighting conditionswith high density and frame rate, making it a reliable sourceof information for critical tasks such as self driving.LiDAR sensor generates point cloud by scanning theenvironment and calculating time-of-ﬂight for the emittedlaser beams. In doing so, LiDARs can collect valuable in-formation, such as range (e.g., in Cartesian coordinates) andintensity (a measure of reﬂection from the surface of the ob-jects). Recent advancement in LiDAR technology makes itpossible to generate high quality, low noise and dense scans1 a r X i v : . [ c s . C V ] F e b rom desired environments, making the task of scene un-derstanding a possibility using LiDARs. Although rich ininformation, LiDAR data often comes in an unstructuredformat and partially sparse at far ranges. These characteris-tics make the task of scene understating challenging usingLiDAR as primary sensor. Nevertheless, research in sceneunderstanding and in speciﬁc, semantic segmentation usingLiDARs, has seen an increase in the past few years with theavailability of datasets such as semanticKITTI [3].The unstructured nature and partial sparsity of LiDARdata brings challenges to semantic segmentation. However,a great effort has been put by researchers to address theseobstacles and many successful methods have been proposedin the literature (see Section 2). From real-time methodswhich use projection techniques to beneﬁt from the avail-able D computer vision techniques, to fully D approacheswhich target higher accuracy, there exist a range of methodsto build on. To better process LiDAR point cloud in D andto overcome limitations such as non-uniform point densi-ties and loss of granular information in voxelization step,we propose ( AF ) -S3Net , which is built upon MinkowskiEngine [8] to suit varying levels of sparsity in LiDAR pointclouds, achieving state-of-the-art accuracy in semantic seg-mentation methods on SemanticKITTI [3]. Fig. 1 demon-strates qualitative results of our approach compared to Sal-saNext [9] and MinkNet42 [8]. We summarize our contri-butions as,• An end-to-end encoder-decoder D sparse CNN thatachieves state-of-the-art accuracy in semanticKITTIbenchmark [3];• A multi-branch attentive feature fusion module in theencoder to learn both global contexts and local details;• An adaptive feature selection module with feature mapre-weighting in the decoder to actively emphasize thecontextual information from feature fusion module toimprove the generalizability;• A comprehensive analysis on semantic segmenta-tion and classiﬁcation performance of our modelas opposed to existing methods on three bench-marks, semanticKITTI [3], nuScenes-lidarseg [5], andModelNet [33] through ablation studies, qualitativeand quantitative results.

2. Related Work

SqueezeSeg [31] is one of the ﬁrst works on LiDARsemantic segmentation using range-image, where LiDARpoint cloud projected on a D plane using spherical trans-formation. SqueezeSeg [31] network is based on anencoder-decoder using Fully Connected Neural Network (FCNN) and a Conditional Random Fields (CRF) as a Re-current Neural Network (RNN) layer. In order to reducenumber of the parameters in the network, SqueezeSeg in-corporates “ﬁreModules” from [14]. In a subsequent work,SqueezeSegV2 [32] introduced Context Aggregation Mod-ule (CAM), a reﬁned loss function and batch normalizationto further improve the model. SqueezeSegV3 [34] standson the shoulder of [31, 14], adopting a Spatially-AdaptiveConvolution (SAC) to use different ﬁlters in different loca-tions in relation to the input image. Inspired by YOLOv3[25], RangeNet++ [21] uses a DarkNet backbone to processa range-image. In addition to a novel CNN, RangeNet++[21] proposes an efﬁcient way of predicting labels for thefull point cloud using a fast implementation of K-nearestneighbour (KNN).Beneﬁting from a new D projection, PolarNet [37]takes on a different approach using a polar Birds-Eye-View(BEV) instead of the standard D grid-based BEV projec-tions. Moreover, PolarNet encapsulates the informationregarding each polar gird using PointNet, rather than us-ing hand crafted features, resulting in a data-driven featureextraction, a nearest-neighbor-free method and a balancedgrid distribution. Finally, in a more successful attempt, Sal-saNext [9], makes a series of improvements to the backboneintroduced in SalsaNet [1] such as, a new global contextualblock, an improved encoder-decoder and Lov´asz-Softmaxloss [4] to achieve state-of-the-art results in D LiDAR se-mantic segmentation using range-image input.

The category of large scale D perception methodskicked off by early works such as [6, 20, 23, 29, 39]in which a voxel representation was adopted to capital-ize vanilla D convolutions. In attempt to process un-structured point cloud directly, PointNet [22] proposed aMulti-Layer Perception (MLP) to extract features from in-put points without any voxelization. PointNet++ [24] whichis an extension to the nominal work Pointnet [22], intro-duced sampling at different scales to extract relevant fea-tures, both local and global. Although effective for smallerpoint clouds, Methods rely on Pointnet [22] and its varia-tions are slow in processing large-scale data.Down-sampling is at the core of the method proposedin RandLA-Net [13]. As down-sampling removes featuresrandomly, a local feature aggregation module is also intro-duced to progressively increase the receptive ﬁeld for each3D point. The two techniques used jointly to achieve bothefﬁciency and accuracy in large-scale point cloud seman-tic segmentation. In a different approach, Cylinder D [38]uses cylindrical grids to partition the raw point cloud. Toextract features, authors in [38] introduced two new CNNblocks. An asymmetric residual block to ensure features re-lated to cuboid objects are being preserved and Dimension-ecomposition based Context Modeling in which multiplelow-rank contexts are merged to model a high-ranked tensorsuitable for D point cloud data.Authors in KPConv [28] introduced a new point convo-lution without any intermediate steps taken in processingpoint clouds. In essence, KPConv is a convolution operationwhich takes points in the neighborhood as input and pro-cesses them with spatially located weights. Furthermore, adeformable version of this convolution operator was also in-troduced that learns local shifts to make them adapt to pointcloud geometry. Finally, MinkowskiNet [8] introduces anovel D sparse convolution for spatio-temporal D pointcloud data along with an open-source library to supportauto-differentiation for sparse tensors. Overall, where weconsider the accuracy and efﬁciency, voxel-based methodssuch as MinkowskiNet [8] stands above others, achievingstate-of-the-art results within all sub-categories of D se-mantic segmentation.

Hybrid methods, where a mixture of voxel-based,projection-based and/or point-wise operations are used toprocess the point cloud, has been less investigated in thepast, but with availability of more memory efﬁcient designs,are becoming more successful in producing competitive re-sults. For example, FusionNet [35] uses a voxel-based MLP,called voxel-based mini-PointNet which directly aggregatesfeatures from all the points in the neighborhood voxels tothe target voxel. This allows FusionNet [35] to searchneighborhoods with low complexity, processing large scalepoint cloud with acceptable performance. In another ap-proach, D-MiniNet [2] proposes a learning-based projec-tion module to extract local and global information from the D data and then feeds it to a D FCNN in order to gener-ate semantic segmentation predictions. In a slightly differ-ent approach, MVLidarNet [7] beneﬁts form range-imageLiDAR semantic segmentation to reﬁne object instances inbird’s-eye-view perspective, showcasing the applicability ofLiDAR semantic segmentation in real-world applications.Finally, SPVNAS [27] builds upon the Minkowski En-gine [8] and designs a hybrid approach of using D sparseconvolution and point-wise operations to achieve state-of-the-art results in LiDAR semantic segmentation. To do this,authors in SPVNAS [27] use a neural architecture search(NAS) [18] to efﬁciently design a NN, based on their novelSparse Point-Voxel Convolution (SPVConv) operation.

3. Proposed Approach

The sparsity of outdoor-scene point clouds makes it difﬁ-cult to extract spatial information compared to indoor-scenepoint clouds with ﬁxed number of points or based on thedense image-based dataset. Therefore, it is difﬁcult to lever-age the indoor-scene or image-based segmentation methods to achieve good performance on a large-scale driving scenecovering more than m with non-uniform point densi-ties. Majority of the LiDAR segmentation methods attemptto either transform 3D LiDAR point cloud into D imageusing spherical projection (i.e., perspective, bird-eye-view)or directly process the raw point clouds. The former ap-proach abandons valuable D geometric structures and suf-fers from information loss due to projection process. Thelatter approach requires heavy computations and not feasi-ble to be deployed in constrained systems with limited re-sources. Recently, sparse D convolution became populardue to its success on outdoor LiDAR semantic segmentationtask. However, out of a few methods proposed in [8, 27], noadvanced feature extractors were proposed to enhance theresults similar to computer vision and D convolutions.To overcome this, we propose (AF) -S3Net for Li-DAR semantic segmentation in which a baseline model ofMinkNet42 [8] is transformed into an end-to-end encoder-decoder with attention blocks and achieves stat-of-the-artresults. In this Section we ﬁrst present the proposed net-work architecture along with its novel components, namelyAF2M and AFSM. Then, the network optimization is intro-duced followed by the training details. Lets consider a semantic segmentation task in which aLiDAR point cloud frame is given with a set of unorderedpoints ( P, L ) = ( { p i , l i } ) with p i ∈ R d in and i = 1 , ..., N ,where N denotes the number of points in an input pointcloud scan. Each point p i contains d in input features, i.e.,Cartesian coordinates ( x, y, z ) , intensity of returning laserbeam ( i ) , colors ( R, G, B ) , etc. Here, l i ∈ R represents theground truth labels corresponding to each point p i . How-ever, in object classiﬁcation task, a single class label L isassigned to an individual scene containing P points.Our goal is to learn a function F cls ( ., Φ) parameterizedby Φ that assigns a single class label L for all the points inthe point cloud or in other words, F seg ( ., Φ) , that assigns aper point label ˆ c i to each point p i . To this end, we propose ( AF ) -S3Net to minimize the difference between the pre-dicted label(s), ˆ L and ˆ c i , and the ground truth class label(s), L and l i , for the tasks of classiﬁcation and segmentation,respectively. The block diagram of the proposed method, ( AF ) -S3Net, is illustrated in Fig. 2. ( AF ) -S3Net consists ofa residual network based backbone and two novel mod-ules, namely Attentive Feature Fusion module (AF2M) andAdaptive Feature Selection Module (AFSM). The modeltakes in a 3D LiDAR point cloud and transforms it intosparse tensors containing coordinates and features corre-sponding to each point. Then, the input sparse tensor is onv (4x4x4)[64, 32] Conv (3x3x3)[4, 64]

Conv (5x5x5)[64, 32]

Conv (8x8x8)[32, 32][ 𝑁𝑁 , 𝑑𝑑 𝑖𝑖𝑖𝑖 ] [N , , , � Conv (2x2x2)[32, 32] 𝑓𝑓 𝑒𝑒𝑖𝑖𝑒𝑒 𝑐𝑐𝑎𝑎𝑎𝑎𝑓𝑓 𝑒𝑒𝑖𝑖𝑒𝑒 ( ∗ , ∗ , ∗ ) = 𝑠𝑠𝑜𝑜𝑓𝑓𝑎𝑎𝑜𝑜𝑎𝑎𝑜𝑜 𝑑𝑑𝑖𝑖𝑜𝑜 =1 = , , Broadcast ××× 𝑜𝑜 𝑜𝑜 𝑜𝑜 αβγ 𝑜𝑜 𝑜𝑜 𝑜𝑜 × 𝑐𝑐𝑎𝑎𝑎𝑎 : Concatenation : Multiplication : Sum � Conv (4x4x4)[4, 64]

Conv (12x12x12)[4, 32]

Conv (2x2x2)[32, 32]

Conv (2x2x2)[32, 32] 𝑓𝑓 𝑑𝑑𝑒𝑒𝑒𝑒 �� 𝑜𝑜′ 𝑜𝑜′ 𝑜𝑜′ [3 , × θ � (1- θ ) TrConv (3x3x3)[64, 96]

Conv (3x3x3)[32, 20]

Conv (3x3x3)[128, 256]

TrConv (3x3x3)[256, 128]

Conv (3x3x3)[32, 64]

Conv (3x3x3)[64, 128]

Adaptive Feature Selection

TrConv (3x3x3)[128, 64]

TrConv (3x3x3)[192, 32][ 𝑁𝑁 , 𝑑𝑑 𝑖𝑖𝑖𝑖 ] [ 𝑁𝑁 , 𝑑𝑑 𝑜𝑜𝑜𝑜𝑜𝑜 ] : Concatenation Attentive Feature Fusion

Input 3D Point Cloud Output 3D Point Cloud3D Coordinates Point Features

Attentive Feature Fusion Adaptive Feature Selection : Conv (2x2x2)[32, 1] : Conv (2x2x2)[3, 32] 𝑐𝑐𝑎𝑎𝑎𝑎 : ConcatenationPoint label

Sparse LinearExcitationSparse Global Pooling Squeeze [N ,96 ][N ,32 ][N ,32 ][N ,32 ] c c F ℎ ℎ ℎ ℎπ � 𝑐𝑐𝑎𝑎𝑎𝑎𝑓𝑓 𝑑𝑑𝑒𝑒𝑒𝑒 ( ∗ , ∗ , ∗ ) = 𝑜𝑜′ , 𝑜𝑜′ , 𝑜𝑜′ 𝑟𝑟𝑒𝑒𝑒𝑒𝑒𝑒 π𝑆𝑆 J J 𝑜𝑜 𝑜𝑜 𝑜𝑜 Figure 2: Overview of (AF) -S3Net. The top left block is Attentive Feature Fusion Module (AF2M) that aggregates Lo-cal and global context using a weighted combination of mutually exclusive learnable masks, α , β , and γ . The top rightblock illustrates how Adaptive Feature Selection Module (AFSM) uses shared parameters to learn inter relationship betweenchannels across multi-scale feature maps from AF2M. (best viewed on display)processed by ( AF ) -S3Net which is built upon D sparseconvolution operations which suits sparse point clouds andeffectively predicts a class label for each point given a Li-DAR scan.A sparse tensor can be expressed as P s = [ C, F ] , where C ∈ R N × M represents the input coordinate matrix with M coordinates and F ∈ R N × K denotes its correspond-ing feature matrix with K feature dimensions. In thiswork, we consider D coordinates of points ( x, y, z ) , as oursparse tensor coordinate C , and per point normal features ( n x , n y , n z ) along with intensity of returning laser beam ( i ) as our sparse tensor feature F . Exploiting normal featureshelps the model to learn additional directional information,hence, the model performance can be improved by differen-tiating the ﬁne details of the objects. The detailed descrip-tion of the network architecture is provided below. Attentive Feature Fusion (AF2M) : To better extract theglobal contexts, AF2M embodies a hybrid approach, cover-ing small, medium and large kernel sizes, which focuseson point-based, medium-scale voxel-based and large-scalevoxel-based features, respectively. The block diagram ofAF2M is depicted in Fig. 2 (top-left). Principally, the pro-posed AF2M fuses the features ¯ x = [ x , x , x ] at the cor-responding branches using g ( · ) which is deﬁned as, g ( x , x , x ) (cid:44) αx + βx + γx + ∆ (1)where α , β and γ are the corresponding coefﬁcients thatscale the feature columns for each point in the sparse tensor,and are processed by function f enc ( · ) as shown in Fig. 2.Moreover, the attention residuals, ∆ , is introduced to stabi- lize the attention layers h i ( · ) , ∀ i ∈ { , , } , by adding theresidual damping factor. This damping factor is the outputof residual convolution layer π . Further, function π can beformulated as π (cid:44) sigmoid ( bn ( conv ( f enc ( · )))) (2)Finally, the output of AF2M is generated by F ( g ( · )) , where F is used to align the sparse tensor scale space with thenext convolution block. As illustrated in Fig. 2 (top-left),for each h i , ∀ i ∈ { , , } , the corresponding gradient ofweight w h i can be computed as: w h i = w h i − ∂J∂g ∂g∂f enc ∂f enc ∂h i − ∂J∂g ∂g∂π ∂π∂h i (3)where J is the output of F . Considering g ( · ) is a linearfunction of concatenated features ¯ x and ∆ , we can rewriteEq. 3 as follows: w h i = w h i − ∂J∂g ¯ x ∂f enc ∂h i − ∂J∂g ∂π∂h i (4)where ∂f enc ∂h i = S j ( δ ij − S j ) is the Jacobian of softmaxfunction S (¯ x ) : R N → R N and maps i th input featurecolumn to j th output feature column, and δ is Kroneckerdelta function where δ i = j = 1 and δ i (cid:54) = j = 0 . As shown inEq. 5, S j ( δ ij − S j ) =  − S S (1 − S ) . . . ... . . . S N (1 − S ) . . . − S N  (5)hen the softmax output S is close to , the term ∂f enc ∂h i approaches to zero which prompts no gradient, and when S is close to , the gradient is close to identity matrix. As aresult, when S → , all values in α or β or γ get very highconﬁdence and the update of w h i becomes: w h i = w h i − ∂J∂g ∂π∂h i + ∂J∂g ¯ xI (6)and in the case of S → , the update gradient will onlydepends on π . Fig. 3 further illustrates the capability of theproposed AF2M and highlights the effect of each branchvisually. 𝑦𝑦 𝑥𝑥𝑧𝑧 Figure X: Illustration of Attentive Feature Fusion and spatial geometry of point cloud. The three labels $\alpha x_{1}$, $\beta x_{2}$, and $\gamma x_{3}$ represent the branches in AF2M encoder block. The first branch, $\alpha x_{1}$, learns to capture and emphasize the fine details for smaller instances such as person, pole and traffic-sign across the driving scenes with varying point densities. The shallower branches, $\beta x_{2}$ and $\gamma x_{3}$, learn different attention-maps that focus on global contexts embodied in larger instances such as vegetation, sidewalk and road surface α 𝑥𝑥 β 𝑥𝑥 γ 𝑥𝑥 Attentive Feature Fusion

Side walkPerson RoadTraffic-sign PoleVegetation TrunkUnlabeled F Conv ( ) ∆ Attention Residual

DenseSparse F [ , ] α 𝑥𝑥 + β 𝑥𝑥 + γ 𝑥𝑥 + ∆ 𝐶𝐶 [ 𝑁𝑁 × ] 𝐹𝐹 [ 𝑁𝑁 × ] + ∆ 𝑙𝑙 [ 𝑁𝑁 × ] voting = JJ Figure 3: Illustration of Attentive Feature Fusion and spa-tial geometry of point cloud. The three labels αx , βx , and γx represent the branches in AF2M encoder block. Theﬁrst branch, αx , learns to capture and emphasize the ﬁnedetails for smaller instances such as person, pole and trafﬁc-sign across the driving scenes with varying point densi-ties. The shallower branches, βx and γx , learn differentattention-maps that focus on global contexts embodied inlarger instances such as vegetation, sidewalk and road sur-face. (best viewed on display) Adaptive Feature Selection module (AFSM) : Theblock diagram of AFSM is shown in Fig. 2 (top-right).In AFSM decoder block, the feature maps from multiplebranches in AF2M, x , x , and x , are further processedby residual convolution units. The resulted output, x (cid:48) , x (cid:48) ,and x (cid:48) , are concatenated, shown as f dec , and are passedinto a shared squeeze re-weighting network [12] in whichdifferent feature maps are voted. This module acts like anadaptive dropout that intentionally ﬁlters out several featuremaps that are not contributing to the ﬁnal results. Insteadof directly passing through the weighted feature maps asoutput, we employed a damping factor θ = 0 . , to regu-larize the weighting effect. It is worth noting that the skipconnection connecting the attentive feature fusion module branches to the last decoder block, ensures that the errorgradient propagates back to the encoder branches for betterlearning stability. We leveraged a linear combination of geo-awareanisotrophic [17], Exponential-log loss [30] and Lov´aszloss [4] to optimize our network. In particular, geo-awareanisotrophic loss is beneﬁcial to recover the ﬁne details ina LiDAR scene. Moreover, Exponential-log loss [30] lossis used to further improve the segmentation performance byfocusing on both small and large structures given a highlyunbalanced dataset.The geo-aware anisotrophic loss can be computed by, L geo ( y, ˆ y ) = − N (cid:88) i,j,k C (cid:88) c =1 M LGA ψ y ijk,c log ˆ y ijk,c (7)where y and ˆ y are the ground truth label and predicted la-bel. Parameter N is the local tensor neighborhood andin our experiment, we empirically set it as (a 10 voxelssize cube). Parameter C is the semantic classes, M LGA = (cid:80) Ψ ψ =1 ( c p ⊕ c q ψ ) , deﬁned in [17]. We normalized local ge-ometric anisotropy within the sliding window Ψ of the cur-rent voxel cell p and its neighbor voxel grid q ψ ∈ Ψ .Therefore, the total loss used to train the proposed net-work is given by, L tot ( y, ˆ y ) = w L exp ( y, ˆ y ) + w L geo ( y, ˆ y ) + w L lov ( y, ˆ y ) (8)where w , w , and w denote the weights of Exponential-log loss [30], geo-aware anisotrophic, and Lov´asz loss, re-spectively. They are set as 1, 1.5 and 1.5 in our experiments.

4. Experimental results

We base our experimental results on three dif-ferent dataset, namely, SemanticKITTI, nuScenes andModelNet to show the applicability of the proposedmethods in different scenes and domains. As for the Se-manticKITTI and ModelNet , ( AF ) -S3Net is comparedto the previous state-of-the-art, but due to a recently an-nounce challenge for nuScenes-lidarseg dataset [5], we pro-vide our own evaluation results against the baseline model.To evaluate the performance of the proposed methodand compare with others, we leverage mean Intersectionover Union ( mIoU ) as our evaluation metric. mIoU is the most popular metric for evaluating semantic pointcloud segmentation and can be formalized as mIoU = n (cid:80) nc =1 T P c T P c + F P c + F N c , where T P c is the number of truepositive points for class c , F P c is the number of false posi-tives, and F N c is the number of false negatives. otorcycle SalsaNext Ground TruthOursMinkNet42 motorcycle motorcycle N u S ce n e s SalsaNext Ground Truth

Ours

MinkNet42 S e m a n ti c K I TT I car car car Figure 4: Compared to SalsaNext and MinkNet42, our method has a lower error (shown in red) recognizing region surfaceand smaller objects on nuScenes validation set, thanks to the proposed attention modules.As for the training parameters, we trained our modelwith SGD optimizer with momentum of . and learningrate of . , weight decay of . for epochs. Theexperiments are conducted using Nvidia V

GPUs.

In this Section, we provide quantitative evaluation of ( AF ) -S3Net on two outdoor large-scale public dataset:SemanticKITTI [3] and nuScenes-lidarseg dataset [5] forsemantic segmentation task and on ModelNet [33] forclassiﬁcation task. SemanticKITTI dataset: we conduct our experimentson SemanticKITTI [3] dataset, the largest dataset for au-tonomous vehicle LiDAR segmentation. This dataset isbased on the KITTI dataset introduced in [11], containing total frames which captured in sequences. We listour experiments with all other published works in Table 1.As shown in Table 1, our method achieves state-of-the-artperformance in SemanticKITTI test set in terms of meanIoU. With our proposed method, ( AF ) -S3Net, we see a . improvement from the second best method [27] and . improvement from baseline model (MinkNet42 [8]).Our method dominates greatly in classifying small objectssuch as bicycle, person and motorcycle, making it a reli-able solution to understating complex scenes. It is worthnoting that ( AF ) -S3Net only uses the voxelized data asinput, whereas the competing methods like SPVNAS [27]use both voxelized data and point-wise features. Nuscenes dataset: to prove the generalizability of ourproposed method, we trained our network with nuScenes-lidarseg dataset [5], one of the recently available large-scaledatasets that provides point level labels of LiDAR pointclouds. It consists of driving scenes from various lo-cations in Boston and Singapore, providing a rich set of la- beled data to advance self driving car technology. Amongthese scenes, of them is reserved for training andvalidation, and the remaining scenes for testing. Thelabels are, to some extent, similar to the semanticKITTIdataset [3], making it a new challenge to propose methodsthat can handle both datasets well, given the different sen-sor setups and environment they record the dataset. In Ta-ble 2, we compared our proposed method with MinkNet42[8] baseline and the projection based method SalsaNext [9].Results in Table 2 shows that our proposed method can han-dle the small objects in nuScenes dataset and indicates alarge margin improvement from the competing methods.Considering the large difference between the two publicdatasets, we can prove that our work can generalize well.

ModelNet : to expand and evaluate the capabilities of theproposed method in different applications, ModelNet , a3D object classiﬁcation dataset [33] is adopted for evalu-ation. ModelNet contains , meshed CAD modelsfrom different object categories. From all the samples, , models are used for training and , models fortesting. To evaluate our method against existing stat-of-the-art, we compare ( AF ) -S3Net with techniques in which asingle input (e.g., single view, sampled point cloud, voxel)has been used to train and evaluate the models. To make ( AF ) -S3Net compatible for the task of classiﬁcation, thedecoder part of the network is removed and the output of theencoder is directly reshaped to the number of the classes inModelNet dataset. Moreover, the model is trained onlyusing cross-entropy loss. Table 3 presents the overall classi-ﬁcation accuracy results for our proposed method and pre-vious state-of-the-art. With the introduction of AF2M inour network, we achieved similar performance to the point-based methods which leverage ﬁne-grain local features. ethod M ea n I o U C a r B i c y c l e M o t o r c y c l e T r u c k O t h e r- v e h i c l e P e r s on B i c y c li s t M o t o r c y c li s t R o a d P a r k i ng S i d e w a l k O t h e r- g r ound B u il d i ng F e n ce V e g e t a ti on T r unk T e rr a i n P o l e T r a f ﬁ c - s i gn S-BKI [10] . . . . . . . . . . . . . . . . . . . . RangeNet++ [21] . . . . . . . . . . . . . . . . . . . . LatticeNet [26] . . . . . . . . . . . . . . . . . . . . RandLA-Net [13] . . . . . . . . . . . . . . . . . . . . PolarNet [37] . . . . . . . . . . . . . . . . . . . . MinkNet42 [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SqueezeSegV3 [34] . . . . . . . . . . . . . . . . . . . . Kpconv [28] . . . . . . . . . . . . . . . . . . . . SalsaNext [9] . . . . . . . . . . . . . . . . . . . . FusionNet [35] . . . . . . . . . . . . . . . . . . . KPRNet [15] . . . . . . . . . . . . . . . . SPVNAS [27] . . . . . . . . . . . . . . AF ) -S3Net [Ours] . . . . . . . . . . . . Table 1: Segmentation IoU (%) results on the SemanticKITTI [3] test dataset.

Method F W m I o U M e a n I o U B a rr i e r B i c y c l e B u s C a r C on s t r u c ti on v e h i c l e M o t o r c y c l e P e d e s t r i a n T r a f ﬁ cc on e T r a il e r T r u c k D r i v ea b l e s u rf ace O t h e r ﬂ a t g r ound S i d e w a l k T e rr a i n M a n m a d e V e g e t a ti on SalsaNext [9] . . . . . . . . . . . . . . . MinkNet42 [8] . . . . . . . . . . . . . . AF ) -S3Net [Ours] . . . . . . . Table 2: Segmentation IoU (%) results on the nuScenes-lidarseg [5] validation dataset. Frequency-Weighted IoU denotes thateach IoU is weighted by the point-level frequency of its class.

PredictionAttention Map

Figure 5: Reference image (top), Prediction (bottom-right),attention map (bottom-left) on SemanticKITTI test set.Color codes are: road | side-walk | parking | car | bicyclist | pole | vegetation | terrain | trunk | building | other-structure | other-object. In this section, we visualize the attention maps in AF2Mby projecting the scaled feature maps back to original pointcloud. Moreover, to better present the the improvementsthat has been made against the baseline model MinkNet [8] and SalsaNext [9], we provide the error maps whichhighlights the superior performance of our method.As shown in Fig. 5, our method is capable of capturing ﬁne details in a scene. To demonstrate this, we train ( AF ) -S3Net on SemanticKITTI as explained above and visualizea test frame. In Fig. 5 we highlight the points with top 5%feature norm from each scaled feature maps of αx , βx and γx with cyan, orange and green colors, respectively.It can be observed that our model learns to put its attentionon small instances (i.e., person, pole, bicycle, etc.) as wellas larger instances (i.e., car, region boundaries, etc.). Fig. 4shows some qualitative results on SemanticKITTI (top) andnuScenes (bottom) benchmark. It can be observed that theproposed method surpasses the baseline (MinkNet [8])and range-based SalsaNext [9] by a large margin, whichfailed to capture ﬁne details such as cars and vegetation. To show the effectiveness of the proposed attentionmechanisms, namely, AF2M and AFSM introduced in Sec-tion 3, along with other design choices such as loss func-tions, this section is dedicated to a thorough ablation studystarting from our baseline model introduced in [8]. Thebaseline is MinkNet which is a semantic segmentationresidual NN model for D sparse data. To start off with awell trained baseline, we use Exponential Logarithmic Loss[30] to train the model which results in . mIoU accu-racy for the validation set on semanticKITTI. ethod Input Main operator Overall Accuracy (%)Vox-Net [20] voxels 3D Operation 83.00Mink-ResNet50 [8] voxels Sparse 3D Operation 85.30Pointnet [22] point cloud Point-wise MLP 89.20Pointnet++ [24] point cloud Local feature 90.70DGCNN[36] (1 vote) point cloud Local feature 91.84GGM-Net [16] point cloud Local feature 92.60RS-CNN [19] point cloud Local feature Ours (AF2M) voxels Sparse 3D Operation 93.16

Table 3: Classiﬁcation accuracy results on ModelNet dataset [33], for input size × .Next, we add our proposed AF2M to the baseline modelto help the model extract richer features from the raw data.This addition of AF2M improves the mIoU to . , anincrease of . . In our second study and to show the effec-tiveness of the AFSM only, we ﬁrst reduce the AF2M blockto only output { x , x , x } (see Fig. 2 for reference), andthen add the AFSM to the model. Adding AFSM showsan increase of . in mIoU from the baseline. In thelast step of improving the NN model, we combine AF2Mand AFSM together as shown in Fig. 2, which result in mIoU of . and an increase of . from the base-line model.Finally, in our last two experiments, we study the effectof our loss function by adding Lov´asz loss and the combi-nation of Lov´asz and geo-aware anisotrophic loss, resultingin mIoU of . and . , respectively. The ablationstudies presented, shows a series of adequate steps in thedesign of ( AF ) -S3Net, proving the steps taken in the de-sign of the proposed model are effective and can be usedseparately in other NN models to improve the accuracy. Architecture A F M A F S M L ov ´ a s z L ov ´ a s z + G e o mIoU Baseline 59.8Proposed (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Table 4: Ablation study of the proposed method vs baselineevaluated on SemanticKITTI [3] validation dataset (seq 08).

In this section, we investigate how segmentation is af-fected by distance of the points to the ego-vehicle. In orderto show the improvements, we follow our ablation study andcompare ( AF ) -S3Net and the baseline (MinkNet42) onthe SemanticKITTI validation set (seq ). Fig. 6 illustratesthe mIoU of ( AF ) -S3Net as opposed to the baseline and SalsaNext w.r.t. the distance to the ego-vehicle’s Li-DAR sensors. The results of all the methods get worse byincreasing the distance due to the fact that point clouds gen-erated by LiDAR are relatively sparse, especially at largedistances. However, the proposed method can produce bet-ter results at all distances, making it an effective method tobe deployed on autonomous systems. It is worth noting that,while the baseline methods attempt to alleviate the sparsityproblem of point clouds by using sparse convolutions in aresidual style network, it lacks the necessary encapsulationof features proposed in Section 3 to robustly predict the se-mantics. 0-10 10-20 20-30 30-40 40-50 50-60 60+ Range (m) m I o U ( % ) ( AF ) -S3Net MinkNet42 SalsaNext Figure 6: mIoU vs Distance for ( AF ) -S3Net vs. baseline.

5. conclusion

In this paper, we presented an end-to-end CNN modelto address the problem of semantic segmentation andclassiﬁcation of D LiDAR point cloud. We proposed ( AF ) -S3Net, a D sparse convolution based network withtwo novel attention blocks called Attentive Feature Fu-sion Module (AF2M) and Adaptive Feature Selection Mod-ule (AFSM), to effectively learn local and global contextsand emphasize the ﬁne detailed information in a givenLiDAR point cloud. Extensive experiments on severalbenchmarks, SemanticKITTI, nuScenes-lidarseg, and Mod-elNet40 demonstrated the ability to capture the local detailsand the state-of-the-art performance of our proposed model.Future work will include the extension of our method toend-to-end D instance segmentation and object detectionon large-scale LiDAR point cloud. eferences [1] Eren Erdal Aksoy, Saimir Baci, and Selcuk Cavdar. Salsanet:Fast road and vehicle segmentation in lidar point clouds forautonomous driving. In

IEEE Intelligent Vehicles Symposium(IV2020) , 2020. 2[2] I˜nigo Alonso, Luis Riazuelo, Luis Montesano, and Ana CMurillo. 3d-mininet: Learning a 2d representation frompoint clouds for fast and efﬁcient 3d lidar semantic segmen-tation. In

IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS) . IEEE, 2020. 3, 7[3] Jens Behley, Martin Garbade, Andres Milioto, Jan Quen-zel, Sven Behnke, Cyrill Stachniss, and J¨urgen Gall. Se-mantickitti: A dataset for semantic scene understanding oflidar sequences. In , pages 9296–9306. IEEE,2019. 1, 2, 6, 7, 8[4] Maxim Berman, Amal Rannen Triki, and Matthew BBlaschko. The lov´asz-softmax loss: a tractable surrogatefor the optimization of the intersection-over-union measurein neural networks. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 4413–4421, 2018. 2, 5[5] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora,Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi-ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi-modal dataset for autonomous driving. In

Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , pages 11621–11631, 2020. 2, 5, 6, 7[6] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, PatHanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Mano-lis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi,and Fisher Yu. ShapeNet: An Information-Rich 3D ModelRepository. Technical Report arXiv:1512.03012 [cs.GR],Stanford University — Princeton University — Toyota Tech-nological Institute at Chicago, 2015. 2[7] Ke Chen, Ryan Oldja, Nikolai Smolyanskiy, Stan Birch-ﬁeld, Alexander Popov, David Wehr, Ibrahim Eden, andJoachim Pehserl. Mvlidarnet: Real-time multi-class sceneunderstanding for autonomous driving using multiple views.In , pages 2288–2294, 2020. 3[8] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4dspatio-temporal convnets: Minkowski convolutional neuralnetworks. In

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 3075–3084,2019. 1, 2, 3, 6, 7, 8[9] Tiago Cortinhal, George Tzelepis, and Eren Erdal Aksoy.Salsanext: Fast semantic segmentation of lidar point cloudsfor autonomous driving. In , pages 655–661, 2020. 1, 2, 6, 7[10] Lu Gan, Ray Zhang, Jessy W Grizzle, Ryan M Eustice, andMaani Ghaffari. Bayesian spatial kernel smoothing for scal-able dense semantic mapping.

IEEE Robotics and Automa-tion Letters , 5(2):790–797, 2020. 7[11] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for Au-tonomous Driving? The KITTI Vision Benchmark Suite. In

Proc. of the IEEE Conf. on Computer Vision and PatternRecognition (CVPR) , pages 3354–3361, 2012. 6[12] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-works. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 7132–7141, 2018. 5[13] Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, YulanGuo, Zhihua Wang, Niki Trigoni, and Andrew Markham.Randla-net: Efﬁcient semantic segmentation of large-scalepoint clouds. In

Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , pages 11108–11117, 2020. 2, 7[14] Forrest N Iandola, Song Han, Matthew W Moskewicz,Khalid Ashraf, William J Dally, and Kurt Keutzer.Squeezenet: Alexnet-level accuracy with 50x fewer pa-rameters and < arXiv preprintarXiv:1602.07360 , 2016. 2[15] Deyvid Kochanov, Fatemeh Karimi Nejadasl, and OlafBooij. Kprnet: Improving projection-based lidar semanticsegmentation. ECCV Workshop , 2020. 7[16] Dilong Li, Xin Shen, Yongtao Yu, Haiyan Guan, HanyunWang, and Deren Li. Ggm-net: Graph geometric momentsconvolution neural network for point cloud shape classiﬁca-tion.

IEEE Access , 8:124989–124998, 2020. 8[17] Jie Li, Yu Liu, Xia Yuan, Chunxia Zhao, Roland Sieg-wart, Ian Reid, and Cesar Cadena. Depth based semanticscene completion with position importance aware loss.

IEEERobotics and Automation Letters , 5(1):219–226, 2019. 5[18] Chenxi Liu, Barret Zoph, Maxim Neumann, JonathonShlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, JonathanHuang, and Kevin Murphy. Progressive neural architecturesearch. In

Proceedings of the European Conference on Com-puter Vision (ECCV) , pages 19–34, 2018. 3[19] Yongcheng Liu, Bin Fan, Shiming Xiang, and ChunhongPan. Relation-shape convolutional neural network for pointcloud analysis. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 8895–8904, 2019. 8[20] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d con-volutional neural network for real-time object recognition.In , pages 922–928. IEEE, 2015. 2,8[21] Andres Milioto, Ignacio Vizzo, Jens Behley, and CyrillStachniss. Rangenet++: Fast and accurate lidar semanticsegmentation. In

Proc. of the IEEE/RSJ Intl. Conf. on In-telligent Robots and Systems (IROS) , 2019. 2, 7[22] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.Pointnet: Deep learning on point sets for 3d classiﬁcationand segmentation. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages 652–660,2017. 2, 8[23] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai,Mengyuan Yan, and Leonidas J Guibas. Volumetric andmulti-view cnns for object classiﬁcation on 3d data. In

Pro-ceedings of the IEEE conference on computer vision and pat-tern recognition , pages 5648–5656, 2016. 2[24] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas JGuibas. Pointnet++: Deep hierarchical feature learning onoint sets in a metric space. In

Advances in neural informa-tion processing systems , pages 5099–5108, 2017. 2, 8[25] Joseph Redmon and Ali Farhadi. Yolov3: An incrementalimprovement. arXiv preprint arXiv:1804.02767 , 2018. 2[26] Radu Alexandru Rosu, Peer Sch¨utt, Jan Quenzel, and SvenBehnke. Latticenet: Fast point cloud segmentation using per-mutohedral lattices.

Robotics: Science and Systems (RSS) ,2020. 7[27] Haotian* Tang, Zhijian* Liu, Shengyu Zhao, Yujun Lin, JiLin, Hanrui Wang, and Song Han. Searching efﬁcient 3d ar-chitectures with sparse point-voxel convolution. In

EuropeanConference on Computer Vision , 2020. 3, 6, 7[28] Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud,Beatriz Marcotegui, Franc¸ois Goulette, and Leonidas JGuibas. Kpconv: Flexible and deformable convolution forpoint clouds. In

Proceedings of the IEEE International Con-ference on Computer Vision , pages 6411–6420, 2019. 3, 7[29] Zongji Wang and Feng Lu. Voxsegnet: Volumetric cnns forsemantic part segmentation of 3d shapes.

IEEE transactionson visualization and computer graphics , 2019. 2[30] Ken CL Wong, Mehdi Moradi, Hui Tang, and TanveerSyeda-Mahmood. 3d segmentation with exponential log-arithmic loss for highly unbalanced object sizes. In

In-ternational Conference on Medical Image Computing andComputer-Assisted Intervention , pages 612–619. Springer,2018. 5, 7[31] Bichen Wu, Alvin Wan, Xiangyu Yue, and Kurt Keutzer.Squeezeseg: Convolutional neural nets with recurrent crf forreal-time road-object segmentation from 3d lidar point cloud.In , pages 1887–1893. IEEE, 2018. 2[32] Bichen Wu, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, andKurt Keutzer. Squeezesegv2: Improved model structure andunsupervised domain adaptation for road-object segmenta-tion from a lidar point cloud. In , pages 4376–4382.IEEE, 2019. 2[33] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin-guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3dshapenets: A deep representation for volumetric shapes. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pages 1912–1920, 2015. 2, 6, 8[34] Chenfeng Xu, Bichen Wu, Zining Wang, Wei Zhan, PeterVajda, Kurt Keutzer, and Masayoshi Tomizuka. Squeeze-segv3: Spatially-adaptive convolution for efﬁcient point-cloud segmentation, 2020. 2, 7[35] Feihu Zhang, Jin Fang, Benjamin Wah, and Philip Torr. Deepfusionnet for point cloud semantic segmentation. In

Pro-ceedings of the European Conference on Computer Vision(ECCV) , 2020. 3, 7[36] Muhan Zhang, Zhicheng Cui, Marion Neumann, and YixinChen. An end-to-end deep learning architecture for graphclassiﬁcation. In

Proceedings of AAAI Conference on Artiﬁ-cial Inteligence , 2018. 8[37] Yang Zhang, Zixiang Zhou, Philip David, Xiangyu Yue, Ze-rong Xi, Boqing Gong, and Hassan Foroosh. Polarnet: Animproved grid representation for online lidar point clouds se- mantic segmentation. In

Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition , pages9601–9610, 2020. 2, 7[38] Hui Zhou, Xinge Zhu, Xiao Song, Yuexin Ma, Zhe Wang,Hongsheng Li, and Dahua Lin. Cylinder3d: An effective3d framework for driving-scene lidar semantic segmentation. arXiv preprint arXiv:2008.01550 , 2020. 2[39] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learningfor point cloud based 3d object detection. In