[PDF] Self-Prediction for Joint Instance and Semantic Segmentation of Point Clouds

Abstract

We develop a novel learning scheme named Self-Prediction for 3D instance and semantic segmentation of point clouds. Distinct from most existing methods that focus on designing convolutional operators, our method designs a new learning scheme to enhance point relation exploring for better segmentation. More specifically, we divide a point cloud sample into two subsets and construct a complete graph based on their representations. Then we use label propagation algorithm to predict labels of one subset when given labels of the other subset. By training with this Self-Prediction task, the backbone network is constrained to fully explore relational context/geometric/shape information and learn more discriminative features for segmentation. Moreover, a general associated framework equipped with our Self-Prediction scheme is designed for enhancing instance and semantic segmentation simultaneously, where instance and semantic representations are combined to perform Self-Prediction. Through this way, instance and semantic segmentation are collaborated and mutually reinforced. Significant performance improvements on instance and semantic segmentation compared with baseline are achieved on S3DIS and ShapeNet. Our method achieves state-of-the-art instance segmentation results on S3DIS and comparable semantic segmentation results compared with state-of-the-arts on S3DIS and ShapeNet when we only take PointNet++ as the backbone network.

Full PDF

SSelf-Prediction for Joint Instance and SemanticSegmentation of Point Clouds

Jinxian Liu , (cid:63) , Minghui Yu , (cid:63) , Bingbing Ni , † , and Ye Chen , Shanghai Jiao Tong University { liujinxian,1475265722,nibingbing,chenye123 } @sjtu.edu.cn Huawei Hisilicon { liujinxian1,yuminghui2,nibingbing,chenye17 } @hisilicon.com Abstract.

We develop a novel learning scheme named Self-Predictionfor 3D instance and semantic segmentation of point clouds. Distinct frommost existing methods that focus on designing convolutional operators,our method designs a new learning scheme to enhance point relationexploring for better segmentation. More speciﬁcally, we divide a pointcloud sample into two subsets and construct a complete graph basedon their representations. Then we use label propagation algorithm topredict labels of one subset when given labels of the other subset. Bytraining with this Self-Prediction task, the backbone network is con-strained to fully explore relational context/geometric/shape informationand learn more discriminative features for segmentation. Moreover, ageneral associated framework equipped with our Self-Prediction schemeis designed for enhancing instance and semantic segmentation simulta-neously, where instance and semantic representations are combined toperform Self-Prediction. Through this way, instance and semantic seg-mentation are collaborated and mutually reinforced. Signiﬁcant perfor-mance improvements on instance and semantic segmentation comparedwith baseline are achieved on S3DIS and ShapeNet. Our method achievesstate-of-the-art instance segmentation results on S3DIS and comparablesemantic segmentation results compared with state-of-the-arts on S3DISand ShapeNet when we only take PointNet++ as the backbone network.

Keywords:

Self-Prediction, Instance Segmentation, Semantic Segmen-tation, Point Cloud, State-of-the-art, S3DIS, ShapeNet

With the growing popularity of low-cost 3D sensors, e.g., LiDAR and RGB-D cameras, 3D scene understanding is tremendous demand recently due to itsgreat application values in autonomous driving, robotics, augmented reality, etc.3D data provides rich information about the environment, however, it is hardfor traditional convolutional neural networks (CNNs) to process this irregular (cid:63)

Equal contribution. This work is done during their internships at Huawei Hisilicon. † Corresponding author: Bingbing Ni. a r X i v : . [ c s . C V ] J u l Liu et al. data. Fortunately, many ingenious works [20,21,33,24,36,23,5,10,14,8,42,13,15]are proposed to directly process point cloud, which is the simplest 3D dataformat. This motivates us to work with 3D point clouds.The key to better understanding a 3D scene is to learn more discriminativepoint representations. To this end, many works [14,33,27,37,8] elaborately designvarious point convolution operators to capture semantic or geometric relationamong points. DGCNN [33] proposes to construct a KNN graph and an operatornamed EgdeConv to process this graph, where semantic relation among pointsis explicitly modeled. RelationShape [14] attempts to model geometric pointrelation in local areas, hence local shape information is captured. Other methodsalso share similar design philosophy. Although good segmentation performanceis achieved by explicitly modeling points relation, lack of constraint/guidanceon relation exploring limits the network from reaching its full potential. Hence aconstraint is urgently needed to enforce/guide/encourage this relation exploringand helps the network learn more representative features.3D Instance and semantic segmentation are two of the most important tasksin 3D scene understanding. Many works [25,3,1,9,40,41,38] tackle these two tasksseparately. And some works [17,19] address these two tasks in a serial fashion,where instance segmentation is usually formulated as a post-processing taskof semantic segmentation. However, this formulation often gets a sub-optimalsolution since the performance of instance segmentation highly depends on theperformance of semantic segmentation. Actually, these two tasks could be associ-ated and cooperate with each other as proved in ASIS [32] and JSIS3D [18]. Theypropose to couple these two tasks in a parallel fashion. ASIS makes instance seg-mentation beneﬁt from semantic segmentation through learning semantic-awareinstance embeddings. Semantic features of the points belonging to the same in-stance are fused to make more accurate semantic predictions. However, extraparameters and computation burden are introduced during inference. JSIS3Dcombines these two tasks in a simple way. They formulate it as a simple multi-task problem and just train the two tasks simultaneously. A multi-value condi-tional random ﬁelds model is proposed to jointly optimize class labels and objectinstances. However, it is a time consuming post-processing scheme and cannot beoptimized end-to-end. Moreover, performance improvements achieved by ASISand JSIS3D are both limited.To address these two issues, we propose a novel learning scheme named Self-Prediction to constrain the network to fully capture point relation and a uniﬁedframework that equipped with this scheme to associate instance and semanticsegmentation. The framework of our method is shown in Figure 1, which containsa backbone network and three heads named instance-head, semantic-head andSelf-Prediction head respectively. The instance-head learns instance embeddingsfor instance clustering and the semantic-head outputs semantic embeddings forsemantic prediction. In Self-Prediction head, the instance and semantic embed-dings for each point are combined. We then concatenate semantic and instancelabels to form a multi-label for every point. After that, we divide the pointcloud into two groups with one group’s labels being discarded. Given the com-

P for Joint Instance and Semantic Segmentation 3 bined embeddings of the whole point cloud and labels of one group, we constructa complete graph and then predict semantic and instance labels simultaneouslyfor the other group using label propagation algorithm. It should be noted thatbidirectional propagation among the two groups are performed. Through thisprocedure of multi-label Self-Prediction, the instance and semantic embeddingsare associatively enhanced. The process of Self-Prediction incorporates embed-ding similarity of points, which enforces the network to explore eﬀective relationamong points and learn more discriminative representations. The three heads arejointly optimized at training time. During inference, our Self-Prediction head isdiscarded, and no computation burden and network parameters are introduced.Our framework is demonstrated to be general and eﬀective on diﬀerent backbonenetworks such as PointNet, PointNet++, etc. Signiﬁcant performance improve-ments over baseline are achieved on both instance and semantic segmentation.By only taking PointNet++ as the backbone, our method achieves state-of-the-art instance segmentation results and comparable semantic segmentation resultscompared with state-of-the-art networks.

Instance Segmentation in 3D Point Clouds.

A pioneer work for instancesegmentation in 3D point clouds can be found in [31], which uses similarity ma-trix to yield proposals followed by conﬁdence map for pruning proposals andutilizes semantic map for assigning labels. ASIS [32] proposes to associate in-stance segmentation and semantic segmentation to achieve semantic awarenessfor instance segmentation. JSIS3D [18] introduces a multi-value CRF model tojointly optimize class labels and object instances. However, their performanceis quite limited. Encouraged by the success of RPN and RoI, GSPN [41] gener-ates proposals by reconstructing shapes and proposes Region-based PointNet toget ﬁnal segmentation results. 3D-SIS [4] is also a proposal-based method. How-ever, proposal-based methods are usually two-stages and need pruning proposals.3D-BoNet [38] directly predicts point-level masks for instances within detectedobject boundaries. It is single-stage, anchor free and computationally eﬃcient.However, there is a limitation on adaptation to diﬀerent types of input pointclouds. In this work, we propose a uniﬁed framework equipped with an eﬃcientlearning scheme to simultaneously improve instance and semantic segmentationsigniﬁcantly.

Semantic Segmentation in 3D Point Clouds.

PointNet [20] is the ﬁrstto directly consume raw point clouds which processes each point identicallyand independently and then aggregates them through global max pooling. Itwell respects order invariances of input data and achieves strong performance.Pointnet++ [21] applies PointNet in a recursive way to learn local features withincreasing contextual scales thus it achieves both robustness and detailed fea-tures. Attention [36,39,29,43] has also been paied to aggregate local featureseﬀectively. RSNet [6] proposes a lightweight local dependency module to eﬃ-ciently model local structures in point clouds, which is composed by slice pooling

Liu et al. B ac kbon e MLP (128, classes) Semantic Classifier ( ! " )Discriminative Loss ( ! %&" ) M L P ( , ) Class

Construct Graph( ' ()%&* , + ()%&* ) Self-Prediction PredictedResults + %&"∗ + " I n s t a n ce P r e d i c t o r S e m a n ti c P r e d i c t o r Instance-HeadSemantic-Head

Self-Prediction HeadPoint Cloud

Fig. 1.

The overall framework of our method. The input point cloud goes through abackbone network to extract instance and semantic features for instance and seman-tic segmentation respectively. These two features are then combined to construct acomplete graph to perform bidirectional Self-Prediction in Self-Prediction head. layers, RNN layers and slice unpooling layers. SPLATNET [24] utilizes sparse bi-lateral convolutional layers to maintain eﬃciency and ﬂexibility. PointCNN [10]explores X -transformation to promote both weighting input point features andpermutation of points into a latent and potentially canonical order. Graph con-volutions [23,9,27] are also proposed for improving semantic segmentation task.SPG [9] adapts graph convolutional network on compact but rich representa-tions of contextual relationship between object parts. SEGCloud [25] combinesadvantages of neural network and conditional random ﬁeld to get coarse to ﬁnesemantics on points. DGCNN [33] tries to capture local geometric structures by anew operation named EdgeConv, which generates edge features describing rela-tion between a point and its neighbors. [7] shares the same idea of edge features,which constructs an edge branch to hierarchically integrate point features andedge features. Diﬀerent from these methods, PointConv [34] proposes a densityre-weighted convolution which can closely approximate 3D continuous convolu-tion on 3D point set. KPConv [26] uses kernel points located in Euclidean Spaceto deﬁne the area where each kernel weight is applied, which well models localgeometry. Our method can take most of these models as backbone network andachieve better segmentation performance. Label Propagation Alogrithms.

Label propagation is derived from un-supervised learning. [35] is an earlier attempt to address this issue where thelabeled data act as sources that push out labels through unlabeled data andthen developed by [45], which introduces consistency assumption to guide la-bel propagation. It is mentioned in [28] that since scale parameter σ will aﬀectthe performance signiﬁcantly, to address this issue, LNP uses overlapped linearneighborhood patches to approximate the whole graph. How to automaticallylearn optimal σ is worthwhile exploring. [45] proposes to learn parameter σ byminimum spanning tree heuristic and entropy minimization. Label propagation P for Joint Instance and Semantic Segmentation 5 algorithms are designed to enhance models with unlabeled samples. Motivatedby our intention of enhancing point relation exploring, we design a new learn-ing scheme to restrict the results predicted by label propagation algorithm beidentical with their ground truth.

We design a novel learning scheme named Self-Prediction to strengthen ourbackbone networks and learn more discriminative representations for better seg-mentation. Among a point cloud sample, this proposed scheme encourages thenetwork to capture more eﬀective relation between points by predicting the labelof a part of points when given labels of rest points and all points embeddings.Equipped with Self-Prediction, a uniﬁed framework is proposed to combine in-stance and semantic segmentation and conduct these two tasks in a mutuallyreinforced fashion. The overall framework of our method is shown in Figure 1.In this section, we ﬁrst introduce our proposed general Self-Prediction scheme.Then we present how to use this scheme to conduct instance and semantic seg-mentation jointly, and describe the overall framework. Finally, we summarizethe global optimization objectives of our method.

Self-Prediction is an auxiliary task paralleled with instance and segmentationtasks, and is designed to enforce backbone networks to learn more strong anddiscriminative representations. To get better segmentation performance, manyexisting works [33,43,27,26] elaborately design convolution operators to capturerelation, geometric and shape information contained in point clouds. Our com-mon goal is to learn more discriminative representations. However, we take anew perspective. We think that if the learned representations can be utilized topredict instance/semantic labels of a part of a point cloud when given labels ofrest points in a point cloud, it can be considered to have fully exploited the rela-tion information and be representative enough. Hence we naturally formulate aSelf-Prediction task, i.e., equally divide a point cloud into two groups, and thenperform bidirectional prediction between the two groups given their representa-tions. By constraining the network to perform well on Self-Prediction task, weget more strong features and perform better on speciﬁc tasks, i.e., instance andsemantic segmentation.Given a point cloud example that contains N points X = { x , x , ..., x N } ,each point x i ∈ R h can be represented by coordinates, color, normal, etc. h isthe dimension of features of input point. For each point x i , its class label isrepresented by a one-hot vector. We formulate a label matrix Y ∈ Y , whereeach row of matrix Y denotes the one-hot label of point x i and Y denotes theset of N × C matrix ( C is the number of classes) with non-negative elements.We equally divide a point cloud into two groups, i.e., X S = { x , x , ..., x M } with its label matrix Y M and X U = { x M + , x M + , ..., x N } with its label ma- Liu et al. trix Y M +1: N . We use label propagation algorithm to perform bidirectional Self-Prediction between point subsets X S and X U , i.e., propagating labels from X S to X U and from X U to X S inversely. Firstly, we construct a complete graph W ∈ R N × N , each element of which is deﬁned by Gaussian similarity function: W ij = exp ( − d ( ϕ ( x i ) , ϕ ( x j ))2 σ ) . (1) ϕ is the backbone network and ϕ ( x i ) denotes extracted features of point x i . d is Euclidean distance measure function and σ is the length scale parameter usedto adjust the strength neighbors. We set σ to 1 in all our experiments. Then wenormalize the constructed graph by computing Laplacian matrix: L = D − / WD − / , (2)where D is a diagonal matrix with D ii to be the sum of the i -th row of W ,i.e., D ii = (cid:80) Nj =1 W ij . To predict labels of X U when given labels of X S andlabels of X S when given labels of X U respectively, we have to prepare two initiallabel matrices S and U by padding Y M and Y M +1: N with zero vectorscorrespondingly. Speciﬁcally, S and U are represented by: S = [ Y T , ..., Y TM , T , ..., T ] T , U = [ T , ..., T , Y TM +1 , ..., Y TN ] T , (3)where Y i denotes the i -th row of label matrix Y . The Self-Prediction procedureis conducted by label propagation algorithm, the iterative version of which is asfollows: S ( t + ) = α LS ( t ) + (1 − α ) S , U ( t + ) = α LU ( t ) + (1 − α ) U , (4)where α is a parameter used to control the propagation proportion, i.e., howmuch the initial label matrix has eﬀect on propagated results. Following thecommon setting [12], we set α to 0 .

99 in all our experiments. S ( t ) ∈ Y and U ( t ) ∈ Y are the t -th iteration results. We will get the ﬁnal results S ∗ and U ∗ by iterating Equation 4 until convergence. In practice, we directly use theclosed form of the above iteration version that proposed in [45] to get propa-gated/predicted results. We present the closed form expression as follows: S ∗ = ( I − α L ) − S , U ∗ = ( I − α L ) − U , (5)where I ∈ R N × N is the identity matrix. It should be noted that S ∗ M +1: N and U ∗ M are valid propagated results. We can predict label of x i by arg max U ∗ i when 1 < i ≤ M and arg max S ∗ i when M < i ≤ N . We formulate the ﬁnalself-predicted results Y ∗ ∈ Y as: Y ∗ = [ U ∗ T M , S ∗ TM +1: N ] T . (6)Finally, we use ground truth label matrix Y as supervised signal to train thisSelf-Prediction task. P for Joint Instance and Semantic Segmentation 7

As shown in Figure 1, our proposed framework contains one backbone networkand three heads. The backbone network can be almost all existing point cloudlearning architectures. We take PointNet, PointNet++ as examples in our work.Based on the backbone network, three heads are utilized to perform instancesegmentation, semantic segmentation and Self-Prediction task respectively.Taken a point cloud X as input, the backbone network output a featurematrix F ∈ R N × H , where H denotes dimension of output features. Instance-head takes F as input and transform it into point-wise instance embeddings F ins ∈ R N × H ins , where H ins is dimension of instance embeddings and set to 32in all our experiments. We adopt the same discriminative loss function as [32]and [18] to supervise instance segmentation. If a point cloud example contains K instances and the k -th ( k ∈ , , ...K ) instance contains N k points, we denote e j ∈ R H ins as the instance embedding of the j -th point and µ k ∈ R H ins as themean embedding of the k -th instance. Hence the instance loss is written as : L var = 1 K K (cid:88) k =1 N k N k (cid:88) j =1 (cid:2) (cid:107) µ k − e j (cid:107) − δ v (cid:3) , (7) L dist = 1 K ( K − K (cid:88) k =1 K (cid:88) m =1 ,m (cid:54) = k [2 δ d − (cid:107) µ k − µ m (cid:107) ] , (8) L reg = 1 K K (cid:88) k =1 (cid:107) µ k (cid:107) , (9) L ins = L var + L dist + 0 . · L reg (10)where [ x ] + = max (0 , x ), δ v and δ d are margins for L var and L dist respectively.Instance labels are obtained by conducting mean-shift clustering [2] on instanceembeddings during inference.The semantic-head takes feature matrix F as input and learns a semanticembedding matrix F sem ∈ R N × H sem to further perform point-wise classiﬁca-tion that supervised by cross-entropy loss. H sem is dimension of point semanticembedding and set to 128 in all our experiments.In Self-Prediction head, we combine instance and semantic embeddings andjointly self-predict instance and semantic labels. Speciﬁcally, we concatenate F ins and F sem along the axis of features and transform it into a joint embeddingmatrix F joint ∈ R H joint , where H joint is dimension of joint embeddings and setto 160 in all our experiments. For each point in X , we transform its semanticand instance label into one-hot form respectively. Instance label of each pointdenotes which instance it belongs to. This instance label is semantic-agnostic,i.e., we cannot infer the semantic label of a point from its instance label. Weassume that a dataset contains C sem semantic classes and the input point cloudsample X contains C ins instances. Then we denote the semantic label matrix andinstance label matrix as Y sem ∈ Y sem and Y ins ∈ Y ins respectively, where Y sem Liu et al. is the set of N × C sem matrix with non-negative elements and Y ins is the set of N × C ins matrix with non-negative elements. Given the two label matrices, weformulate a multi-label matrix Y joint ∈ Y joint by concatenating semantic labeland instance label of each point, where Y joint is the set of N × ( C sem + C ins )matrix with non-negative elements. In other words, one can infer which semanticclass and instance each point belongs to from the Y joint . We ﬁnally carry outSelf-Prediction described in Section 3.1 based on the joint feature matrix F joint and multi-label matrix Y joint . We slice the self-predicted results Y ∗ joint ∈ Y joint into semantic results Y ∗ sem ∈ Y sem and instance results Y ∗ ins ∈ Y ins , which arethen supervised by semantic ground truth Y sem and instance ground truth Y ins respectively. It should be noted that our Self-Prediction is conducted among onepoint cloud sample every time, hence it does not matter that the meaning ofinstance label varies from sample to sample.Instance-head, semantic-head and Self-Prediction head are jointly optimized.Instance-head and semantic-head are aimed to get segmentation results. Ourproposed Self-Prediction head incorporates similarity relation among points andenforces the backbone to learn more discriminative representations. These threeheads cooperate with each other and get better segmentation performance. Wewant to emphasize that our Self-Prediction head is discarded and only instance-head and semantic-head are used during inference, hence no extra computationalburden and space usage are introduced. We train the instance-head with the instance loss L ins that formulated in Equa-tion 10. The semantic-head is trained by classical cross-entropy loss and super-vised by semantic label Y sem , which is written as: L sem = − N N (cid:88) i =1 [ Y sem ] i log p i , (11)where p i denotes output probability distribution computed by softmax function.Given the jointly self-predicted results Y ∗ ins and Y ∗ sem , we train our Self-Prediction head also by cross-entropy loss, which is formulated as: L sp = − N N (cid:88) i =1 ([ Y ins ] i ∗ log q i + [ Y sem ] i ∗ log r i ) , (12)where q i and r i are output probability distribution (computed by softmax) ofthe i -th row of Y ∗ ins and Y ∗ sem respectively. The output probability distributionis also computed by softmax function.The three head are jointly optimized and the overall optimization objectiveis a weighted sum of above three losses: L = L ins + L sem + β L sp , (13)where β is used to balance contributions of the three above terms such that theycontribute equally to the overall loss. β is set to 0 . P for Joint Instance and Semantic Segmentation 9

Stanford 3D Indoor Semantics Dataset (S3DIS) is a large scale realscene segmentation benchmark and contains 6 areas with a total of 272 rooms.Each 3D RGB point is annotated with an instance label and a semantic labelfrom 13 categories. Each room is typically parsed to about 10-80 object instances.ShapeNet part dataset contains 16681 samples from 16 categories. There aretotally 50 parts, and each category contains 2-6 parts. The instance annotationsare got from [31], which is used as ground truth instance label.

Evaluation Metrics

On S3DIS dataset, following the common evaluation set-tings, we validate our method in a 6-fold cross validation fashion over the 6 areas,i.e., 5 areas are used for training and the left 1 area for validation each time.Moreover, test results on Area 5 are reported individually due to no overlaps be-tween Area 5 and left areas, which is a better way to show generalization abilityof methods. For evaluation of semantic segmentation, we use mean IoU (mIoU)across all the categories, class-wise mean of accuracy (mAcc) and point-wise over-all accuracy (oAcc) as metrics. We take the same evaluation metric as [32] forinstance segmentation. Apart from common used metric mean precision (mPrec)and mean recall (mRec) with IoU threshold 0.5, coverage and weighted coverage(Cov, WCov) [11,22,46] are taken. Cov is the average instance-wise IoU betweenprediction and ground truth. WCov means Cov that is weighted by the size ofthe ground truth instances. On ShapeNet, part-averaged IoU (pIoU) and meanper-class pIoU (mpIoU) are taken as evaluation metrics for semantic segmen-tation. Following [31,32], we only provide qualitative results of part instancesegmentation on ShapeNet.

Implementation Details

For experiments on S3DIS, we follow the same set-ting as PointNet [20], where each room is split into blocks of area 1 m × m . Each3D point is represented by a 9-dim vector, (XYZ, RGB and normalized locationsas to the room). We sample 4096 points for each block during training and allpoints are used for testing. We have mentioned above that we construct a graphand then divide the point cloud into two groups to perform Self-Prediction inSelf-Prediction head. In practice, we partition the point cloud into more thantwo groups for acceleration. Speciﬁcally, we divide every block equally into 8groups according to their instance labels, i.e., guarantee points of each instanceare averagely distributed in every group. As a result, points of each semantics arealso averagely distributed in every group. And then 4 pairs are randomly pairedto conduct Self-Prediction. We train all models on S3DIS for 100 epochs withSGD optimizer and batch size 8. The base learning rate is set to 0.01 and dividedby 2 every 20 epochs. For instance head, we set δ v to 0 . δ d to 1 . β for L sp is setto 0 .

8. BlockMerging algorithm [32,18] is used to merge instances from diﬀerentblocks during inference, and bandwidth is set to 0 . train all models for 200 epochs with Adam optimizer and batch size 16. The baselearning rate is set to 0.001 and divided by 2 every 20 epochs. Other settings arethe same as experiments conducted on S3DIS. We report instance and semantic segmentation results in Table 1 and Table 2respectively, where results of Area 5 and 6-fold cross validation are all shown.Baseline results in tables denote that we train our backbone network with onlyinstance-head and semantic-head. All baseline results for PointNet and Point-Net++ in the table are got from vanilla results of [32], which are almost thesame as ours. In all tables, InsSem-SP denotes complete version of our method,i.e., performing instance and semantic Self-Prediction jointly. To prove eﬀec-tiveness of our proposed Self-Prediction scheme and our associated frameworkmore clearly, we report the results of Ins-SP and Sem-SP in Table 1 and Ta-ble 2 respectively. Ins-SP means that we only perform instance Self-Predictionby taking F ins and Y ins as input. Sem-SP means that we only perform semanticSelf-Prediction by taking F sem and Y sem as input. Backbone Method mPrec mRec mCov mWCovRseults on Area 5PN Baseline [32] 42.3 34.9 38.0 40.6ASIS [32] 44.5 37.4 40.4 43.3Ours (Ins-SP) 48.2 39.9 44.7 47.6Ours (InsSem-SP)

PN++ Baseline [32] 53.4 40.6 42.6 45.7ASIS [32] 55.3 42.4 44.6 47.8Ours (Ins-SP) 58.9 46.3 52.8 54.9Ours (InsSem-SP)

Results 6-fold CVPN Baseline [32] 50.6 39.2 43.0 46.3ASIS [32] 53.2 40.7 44.7 48.2Ours (Ins-SP) 55.1 44.3 48.9 50.1Ours (InsSem-SP)

PN++ Baseline [32] 62.7 45.8 49.6 53.4ASIS [32] 63.6 47.5 51.2 55.1Ours (Ins-SP) 65.9 53.2 58.0 60.7Ours (InsSem-SP)

Instance segmentation results on S3DIS dataset.

From Table 1 and Table 2, we can observe that our method improves thebaseline based on all three backbone networks on both instance and semanticsegmentation tasks signiﬁcantly. For example, our method improve baseline by8.3 mPrec, 8.7 mRec, 11.2 mCov, 11.2 mWCov in instance segmentation and7.7 mIoU, 9.5 mAcc, 3.9 oAcc in semantic segmentation on Area 5 when weuse PointNet as backbone. Eﬀectiveness of proposed Self-Prediction scheme is

P for Joint Instance and Semantic Segmentation 11 fully proved by comparing the results of Ins-SP with baseline in Table 1 andthe results of Sem-SP with baseline in Table 2. Moreover, performance is furtherimproved when we conduct instance and semantic Self-Prediction jointly. InFigure 2, we show some visualization results of baseline and Ours (InsSem-SP) based on PointNet++. We observe that our method achieves obvious moreaccurate predictions and performs better instance/semantic class boundaries.

Backbone Method mIoU mAcc oAcc mIoU mAcc oAccArea 5 6-fold CVPN Baseline [32] 44.7 52.9 83.7 49.5 60.7 80.4ASIS [32] 46.4 55.7 84.5 51.1 62.3 81.7Ours (Sem-SP) 48.0 58.6 85.5 52.3 64.5 83.0Ours (InsSem-SP)

PN++ Baseline [32] 50.8 58.3 86.7 58.2 69.0 85.9ASIS [32] 53.4 60.9 86.9 59.3 70.1 86.2Ours (Sem-SP) 55.9 63.6 87.3 61.1 72.2 87.3Ours (InsSem-SP)

Semantic segmentation results on S3DIS dataset.

Based on the baseline, ASIS associates instance and semantic segmentation,and designs a module to make these two tasks cooperate with each other. Obviousimprovements are achieved by ASIS compared with baseline, while our methodperformes signiﬁcantly better. Another advantage of our method is that ourproposed Self-Prediction head is formulated as a loss function and will be takenoﬀ during inference, hence no extra computation burden and space usage areintroduced compared with baseline.

Baseline Ours Ground Truth (a) Instance segmentation

Baseline Ours Ground Truth (b) Semantic segmentation

Fig. 2.

Visualization results of instance and semantic segmentation. Our method ob-viously performs better than baseline. Best viewed in color.

Compare with state-of-the-arts

We also compare our method with otherstate-of-the-art methods. Instance segmentation results are shown in Table 3, from which we see that our method achieves state-of-the-art performance. Tothe best of our knowledge, 3D-BoNet [38] is the best published method for in-stance segmentation in 3D point cloud. Obviously better performance comparedwith 3D-BoNet is achieved by our method, especially for mean recall. JSNet [44]achieves excellent performance by designing a feature fusion module based onPointConv [34]. Compared with JSNet, our method (PN++) performs better es-pecially for mCov and mWCov. For semantic segmentation, results are shown inTable 4. Our method achieves comparable results compared with state-of-the-artmethods when we only use PointNet++ as backbone. Even better performanceon Area 5 is achieved compared with PointCNN, which is an excellent pointcloud learning architecture. Moreover, our method is general and can use themost advanced architectures as the backbone to achieve superior performance.

Method mPrec mRec mCov mWCovResults on Area 5SGPN (PN) [31] 36.0 28.7 32.7 35.53D-BoNet [38] 57.5 40.2 - -Ours (PN++)

Results on 6-fold CVSGPN (PN) [31] 38.2 31.2 37.9 40.8PartNet [16] 56.4 43.4 - -3D-BoNet [38] 65.6 47.6 - -JSNet [44] 66.9 53.9 54.1 58.0Ours (PN++)

Instance segmentation results of state-of-the-art methods on S3DIS dataset.Method mIoU mAcc oAcc mIoU mAcc oAccArea 5 6-fold CVRSNet [6] - - - 56.5 66.5 -JSNet [44] 54.5 61.4 87.7 61.7 71.7

SPGraph [9] 58.0 66.5 86.5 62.1 73.0 85.5PointCNN [10] 57.3 63.9 85.9 65.4 75.6 88.1PCCN [30] 58.3 - - - -PointWeb [43] 60.3 66.6 87.0 - 87.8 - - -Ours (PN++) 58.8 65.9

Table 4.

Semantic segmentation results of state-of-the-art methods on S3DIS dataset.

We provide qualitative results of part instance segmentation in Figure 3 fol-lowing [31] and [32]. As shown in Figure 3, our method successfully segmentsinstances of the same part, such as diﬀerent legs of the chair. Semantic segmen-tation results are shown in Table 5, from which we observe that our method

P for Joint Instance and Semantic Segmentation 13 achieves signiﬁcant improvements over baselines. And more improvements com-pared with ASIS are achieved by our methods. In addition to PointNet andPointNet++, we add a stronger network DGCNN [33] in this dataset as ourbackbone. Obvious performance improvement over baseline also can be observedbased on this backbone. (a) Ins (b) Ins GT (c) Sem (d) Sem GT

Fig. 3.

Visualization results of our method on ShapeNet. (a) Instance segmentationresults. (b) Instance segmentation ground truth. (c) Semantic segmentation results. (d)Semantic segmentation ground truth.Method pIoU mpIoUPointNet (

RePr ) 83.3 79.7PointNet++ (

RePr ) 84.5 80.5DGCNN (

RePr ) 85.2 82.3ASIS (PN) 84.0 -ASIS (PN++) 85.0 -Ours (InsSem-SP, PN) 84.5 81.5Ours (InsSem-SP, PN++) 85.8 82.6Ours (InsSem-SP, DGCNN)

Semantic segmentation results on ShapeNet dataset.

RePr denotes our re-produced results. All models in the table are trained without normal information.

In this section, we analyze some important components and hyper-parametersof our methods. All experiments in this section are performed on S3DIS Area 5using PointNet as backbone.

Component Analyses . As shown in Section 4.2, the eﬀectiveness of our pro-posed Self-Prediction scheme and joint learning framework has been proved. Wefurther discuss how much our method beneﬁts from bidirectional Self-predictionand class-averaged group dividing way. To this end, two corresponding experi-ments are conducted: 1) we only perform unidirectional Self-Prediction, and thedirection is randomly selected among the two directions, 2) we randomly dividepoint cloud into groups rather than dividing according to instance labels in Self-Prediction head. Experimental results are reported in Table 6, where mPrec, mRec for instance segmentation and mIoU, mAcc for semantic segmentation areshown. We can draw a conclusion that bidirectional Self-Prediction bring vis-ible improvements compared with unidirectional Self-Prediction and randomlygrouping will slightly degrade the performance.

Method mPrec mRec mIoU mAccUnidirectional 49.9 40.7 51.0 60.8Randomly Dividing 50.5 42.1 51.6 61.3Ours (InsSem-SP)

Component analyses. Results on S3DIS Area 5 are shown.

Parameter Analyses . Three important parameters introduced by our methodare analyzed in this section. The ﬁrst is β used to balance the weight of L sp . Theanalysis results are shown in Figure 4(a), from which we can see that our methodis not sensitive to this parameter and works very well in a wide range (0.4-1.4).The second parameter is the number of divided groups G to make a trade oﬀbetween performance and training speed. We show the results in Figure 4(b),from which we see that the performance is relatively stable and not sensitive to G in a reasonable range. The last parameter is α used to control propagationportion in the process of label propagation. Although we follow the commonsetting [12] ( α = 0 .

99) in all our experiments, we still conduct experiments toanalyze the sensitivity to this parameter of our method. As shown in Figure 4(c),our method outperforms baseline in a large range, i.e., α > . (a) Analysis of β (b) Analysis of G (c) Analysis of α Fig. 4.

Results of parameter analyses. mPrec for instance segmentation and mIoU forsemantic segmentation are shown in ﬁgure. Dotted lines represent results of baseline.

In this paper, we present a novel learning scheme named Self-Prediction to en-force relation exploring, and a joint framework for associating instance andsemantic segmentation of point clouds. Extensive experiments prove that ourmethod can be combined with popular networks signiﬁcantly improve their per-formance. By only taking PointNet++ as the backbone, our method achievesstate-of-the-art or comparable results. Moreover, our method is a general learn-ing framework and easy to apply to most existing learning networks.

P for Joint Instance and Semantic Segmentation 15

References

1. Choy, C., Gwak, J., Savarese, S.: 4d spatio-temporal convnets: Minkowski convo-lutional neural networks. In: CVPR (2019)2. Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature spaceanalysis. IEEE Trans. Pattern Anal. Mach. Intell. (2002)3. Graham, B., Engelcke, M., van der Maaten, L.: 3d semantic segmentation withsubmanifold sparse convolutional networks. In: CVPR (2018)4. Hou, J., Dai, A., Nießner, M.: 3d-sis: 3d semantic instance segmentation of RGB-Dscans. In: CVPR (2019)5. Hua, B., Tran, M., Yeung, S.: Pointwise convolutional neural networks. In: CVPR(2018)6. Huang, Q., Wang, W., Neumann, U.: Recurrent slice networks for 3d segmentationof point clouds. In: CVPR (2018)7. Jiang, L., Zhao, H., Liu, S., Shen, X., Fu, C.W., Jia, J.: Hierarchical point-edgeinteraction network for point cloud semantic segmentation. In: ICCV (2019)8. Lan, S., Yu, R., Yu, G., Davis, L.S.: Modeling local geometric structure of 3d pointclouds using geo-cnn. In: CVPR (2019)9. Landrieu, L., Simonovsky, M.: Large-scale point cloud semantic segmentation withsuperpoint graphs. In: CVPR (2018)10. Li, Y., Bu, R., Sun, M., Wu, W., Di, X., Chen, B.: Pointcnn: Convolution onx-transformed points. In: NeurIPS (2018)11. Liu, S., Jia, J., Fidler, S., Urtasun, R.: SGN: sequential grouping networks forinstance segmentation. In: ICCV (2017)12. Liu, Y., Lee, J., Park, M., Kim, S., Yang, E., Hwang, S.J., Yang, Y.: Learningto propagate labels: Transductive propagation network for few-shot learning. In:ICLR (2019)13. Liu, Y., Fan, B., Meng, G., Lu, J., Xiang, S., Pan, C.: Densepoint: Learning denselycontextual representation for eﬃcient point cloud processing. In: ICCV (2019)14. Liu, Y., Fan, B., Xiang, S., Pan, C.: Relation-shape convolutional neural networkfor point cloud analysis. In: CVPR (2019)15. Mao, J., Wang, X., Li, H.: Interpolated convolutional networks for 3d point cloudunderstanding. In: ICCV (2019)16. Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: Part-net: A large-scale benchmark for ﬁne-grained and hierarchical part-level 3d objectunderstanding. In: CVPR (2019)17. Pham, Q., Hua, B., Nguyen, D.T., Yeung, S.: Real-time progressive 3d semanticsegmentation for indoor scenes. In: WACV (2019)18. Pham, Q., Nguyen, D.T., Hua, B., Roig, G., Yeung, S.: JSIS3D: joint semantic-instance segmentation of 3d point clouds with multi-task pointwise networks andmulti-value conditional random ﬁelds. In: CVPR (2019)19. Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3d objectdetection from RGB-D data. In: CVPR (2018)20. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for3d classiﬁcation and segmentation. In: CVPR (2017)21. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learn-ing on point sets in a metric space. In: NIPS (2017)22. Ren, M., Zemel, R.S.: End-to-end instance segmentation with recurrent attention.In: CVPR (2017)6 Liu et al.23. Shen, Y., Feng, C., Yang, Y., Tian, D.: Mining point cloud local structures bykernel correlation and graph pooling. In: CVPR (2018)24. Su, H., Jampani, V., Sun, D., Maji, S., Kalogerakis, E., Yang, M., Kautz, J.:Splatnet: Sparse lattice networks for point cloud processing. In: CVPR (2018)25. Tchapmi, L.P., Choy, C.B., Armeni, I., Gwak, J., Savarese, S.: Segcloud: Semanticsegmentation of 3d point clouds. In: 3DV (2017)26. Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas, L.J.:Kpconv: Flexible and deformable convolution for point clouds. In: ICCV (2019)27. Wang, C., Samari, B., Siddiqi, K.: Local spectral graph convolution for point setfeature learning. In: ECCV (2018)28. Wang, F., Zhang, C.: Label propagation through linear neighborhoods. IEEETransactions on Knowledge and Data Engineering (2007)29. Wang, L., Huang, Y., Hou, Y., Zhang, S., Shan, J.: Graph attention convolutionfor point cloud semantic segmentation. In: CVPR (2019)30. Wang, S., Suo, S., Ma, W., Pokrovsky, A., Urtasun, R.: Deep parametric continuousconvolutional neural networks. In: CVPR (2018)31. Wang, W., Yu, R., Huang, Q., Neumann, U.: SGPN: similarity group proposalnetwork for 3d point cloud instance segmentation. In: CVPR (2018)32. Wang, X., Liu, S., Shen, X., Shen, C., Jia, J.: Associatively segmenting instancesand semantics in point clouds. In: CVPR (2019)33. Wang, Y., Sun, Y., Liu, Z., Sarma, S., Bronstein, M., Solomon, J.: Dynamic graphcnn for learning on point clouds. ACM Transactions on Graphics38