Rethinking Task and Metrics of Instance Segmentation on 3D Point Clouds
RRethinking Task and Metrics of Instance Segmentation on 3D Point Clouds
Kosuke Arase , Yusuke Mukuta , , Tatsuya Harada , The University of Tokyo RIKEN AIP { arase,mukuta,harada } @mi.t.u-tokyo.ac.jp Abstract
Instance segmentation on 3D point clouds is one of themost extensively researched areas toward the realization ofautonomous cars and robots. Certain existing studies havesplit input point clouds into small regions such as × ;one reason for this is that models in the studies cannot con-sume a large number of points because of the large spacecomplexity. However, because such small regions occasion-ally include a very small number of instances belonging tothe same class, an evaluation using existing metrics such asmAP is largely affected by the category recognition perfor-mance. To address these problems, we propose a new methodwith space complexity O ( N p ) such that large regions can beconsumed, as well as novel metrics for tasks that are inde-pendent of the categories or size of the inputs. Our methodlearns a mapping from input point clouds to an embeddingspace, where the embeddings form clusters for each instanceand distinguish instances using these clusters during test-ing. Our method achieves state-of-the-art performance usingboth existing and the proposed metrics. Moreover, we showthat our new metric can evaluate the performance of a taskwithout being affected by any other condition.
1. Introduction
3D environment recognition has been extensively re-searched toward the realization of autonomous cars androbots. In particular, instance segmentation, the task ofnot only labeling each point but also distinguishing eachinstance belonging to the same class, is one of the key tasksto such realization. Instance segmentation is challenging be-cause the number of instances is not fixed, and thus, methodsfor categorical classification cannot be directly applied. Al-though there are several typical 3D data representations suchas voxels, meshes, and point clouds, in this study, we focuson point clouds, which can be obtained directly from depthsensors such as Light Detection and Ranging (LiDAR).The instance segmentation model learns the mappingfrom each input point to the semantics of the correspondingpoint. When evaluating the instance segmentation model, the pairing of a prediction and a ground truth is consideredtrue positive when the intersection over union (IoU) betweenthem is higher than the threshold. In many cases, semanticsegmentation can be solved simultaneously, and thus, the im-portant issue is distinguishing objects in the same category.There have been many studies on instance segmentation,where the input point clouds have been split into small re-gions such as square [21, 22]; however, conducting eval-uations on such small regions is somewhat complicated.One solution is first merging small regions into one entirescene prediction and then evaluating the entire scene [22].However, the final result is largely affected by the mergingalgorithm, and it is difficult to evaluate the pure instancesegmentation performance.Another way is evaluating the instance segmentation insmall regions [21]; however, this is not desirable owing tothe following reason. As shown in Figure 1, small regionsoften contain only one instance for a certain category, and insuch cases, the resulting semantic segmentation is sufficientfor instance segmentation because it is not necessary to dis-tinguish objects belonging to the same class. When the inputregions are too small and there is only one object in eachregion, it is unnecessary to distinguish the object, and thus,instance segmentation does not have to be conducted. Con-versely, when there are many objects belonging to the sameclass, it is necessary to consume larger regions in order toevaluate an instance segmentation. Consuming large regionsis also challenging because it is necessary to consume a largenumber of points to avoid a sparse input, which decreases theperformance of certain models including PointNet [15, 16].Handling dense point clouds is also helpful in the applicationof instance segmentation. However, as an example, the Simi-larity Group Proposal Network (SGPN) [22] calculates thesimilarities for each pair of points and its space complexityis O ( N p ) for N p number of points, which makes it difficultto consume large point clouds. Thus, a memory efficientmethod is required.Moreover, there are certain problems in existing metrics.The method in [22] was evaluated using the mean averageprecision (mAP), which has a characteristic in that the effectof false positives with low confidence scores is small. Al-1 a r X i v : . [ c s . L G ] S e p a) × (b) × (c) × Figure 1. Objects within grids for various grid sizes (S3DIS [1], Area 6, Office 29) though this property is appropriate for tasks such as objectdetection where multiple candidates with overlaps are al-lowed, or retrieval where the rank of the output is important,this property is not suitable for instance segmentation. Be-cause outputs are objects without an overlap for each point,we need to equally evaluate the outputs of each point regard-less of the confidence score. In addition, when evaluatinginstance segmentation, we focus on whether two objects areproperly distinguished, and whether one object is incorrectlysplit. However, these failures cannot be distinguished from amisclassification when performing evaluations using exist-ing metrics, and a misclassification is often the main factorof decreasing mAP.In this study, we first experimentally show our claim thatevaluating instance segmentation in small regions with ex-isting metrics is inappropriate and reveal the problem usingmetrics that has not been investigated in previous studies.Then, we propose a novel instance segmentation methodwith small space complexity that enables the consumption oflarge regions.Our loss function learns a one-to-one mappingfrom an input feature space to an embedding space, whereembeddings from the same instance form a cluster, and wecan distinguish instances by clustering at the test time. Be-cause our method does not have to handle point pairs, thespace complexity is O ( N ) and is scalable to the number ofpoints. We show that the proposed memory efficient methodoutperforms other state-of-the-art methods.In addition, we demonstrate that consuming small regionsand evaluating them by using existing metrics is not appro-priate. This fact has been overlooked by previous research,so we propose a novel metric that can evaluate it correctly forthe first time. Our metric is based on inclusion, which is therelationship of one set being a subset of another. Using theproposed metric, we can evaluate the pure performance re-gardless of the size of the regions, categories, or confidencescores. We can also analyze the types of errors quantitatively.We conducted extensive experiments to reveal the effectof the size of the regions and the density of the points onthe instance segmentation performance and showed that con-suming a large number of points increases the performance for large regions.The key contributions of this study are as follows: • We propose a new loss function that learns to pushembeddings for each instance to be clustered and isscalable to the number of points; we also experimentallydemonstrated that the proposed method outperformsexisting methods. • We reveal the problems associated with existing metrics,including the fact that they are affected by the size ofinputs or categories, which have been overlooked inprevious research. • We propose a novel metric that is not affected by thesefactors and can evaluate instance segmentation perfor-mance correctly.
2. Related Work
The effective handling of point clouds is challenging be-cause they are unordered, non-uniformly distributed data.Methods to extract features from point clouds can beroughly classified into two approaches, namely, describinglocal features [18, 19] and describing relationships amongmultiple points [2, 6].PointNet [15], which addresses the problem of unordereddata by using symmetric functions, and PointNet++ [16],which stacks PointNets and is able to handle local features,have made recent breakthroughs in deep learning on pointclouds. We use PointNet and PointNet++ as feature extrac-tors in this study.
Segmentation is a task of labeling each minimum elementin the data such as a pixel or point. In particular, labeling thecategory of each element and distinguishing objects belong-ing to the same category are called instance segmentationagainst semantic segmentation.
D Images
Many studies on instance segmentation onimages have been recently published [4, 5, 8, 12–14], andNovotny et al . [14] classified instance segmentation into twoapproaches, propose & verify (P&V) and instance coloring (IC). P&V is an approach that first proposes candidates ofobjects based on their objectness and then verifies whetheran object is a candidate. This is currently a popular approachin the field of object detection [17] and instance segmenta-tion [8] on images. Although P&V approaches have achievedsignificant success in image segmentation, they have weak-nesses in that object candidates are approximations of theobject shapes, and a second-stage to refine the candidates isnecessary for segmentation, such as Mask R-CNN [8], andthus, the network architecture tends to be complex.Approaches labeling an object identifier directly to eachpixel are called IC, and some studies have been conductedin this area for image segmentation [5, 12, 14]. Brabandere et al . [5] proposed a discriminative loss function that learnsa mapping to an embedding space where the embeddingsform clusters for each object. The loss function is simple andefficient but has some shortcomings, as described in Section2.3.We choose an IC-based approach because the architecturetends to be simpler, and it is thus expected to be computa-tionally efficient.
3D Point Clouds
SGPN [22] and deep functional dictio-naries (DFD) [21] have tackled instance segmentation on3D point clouds. SGPN first predicts similarities for everypair of points that describes whether two points belong tothe same object and then merges points to instance proposalsby considering a pair of points with a similarity higher thana certain threshold as being contained in the same object.Although it is a pioneering work of instance segmentation onpoints clouds, the space complexity of the similarity matrixis proportional to the square of the number of points andcannot handle too many points. We discuss this problemin Section 3.2. Thus, input scenes are split into squareregions, and the results are then aggregated for each regionusing a heuristic algorithm. However, the final performancedepends on the merging algorithm, as described in Section1, and applying the method for every small region is compu-tationally inefficient.Recently, Sung et al . [21] proposed a general methodcalled DFD that produces a dictionary of the probe func-tions. The authors proposed a general framework that learnsa mapping from the shape to the dictionary. Each atom ofthe dictionary can be associated with semantics, instances,or something else based on the task and constraint. A per-formance comparable to that of state-of-the-art techniqueswas achieved on S3DIS, but the authors evaluated its per-formance for each small region. Thus, this evaluation hascertain problems, as discussed in Section 1.
Our method performs instance segmentation by first learn-ing the feature embeddings for each point such that the di-ameter of the embedding cluster corresponding to the sameobject is small compared to the distance among clustersfrom different objects; then, clustering is conducted in theembedding space. Such a feature learning method that trainsthe embedding to minimize the distance between embed-dings with the same semantics while maximizing the dis-tance between embeddings with different semantics is widelyused in category classification [3, 23] and similarity learn-ing [11, 20]. This concept has been used for recent instancesegmentation studies on images such as those on discrimi-native loss [5], [12]. Inspired by this, we propose a novelinstance segmentation method that overcomes the discrimi-native loss problem.Discriminative loss L consists of L var , which makes thedistance between points and centroids of the correspondingcluster smaller than δ v ; L dist , which makes the distancebetween cluster centroids larger than δ d ; and a regularizer L reg , which prevents the feature norms from diverging. Here, L is written as follows: L var = 1 C C (cid:88) c =1 N c N c (cid:88) i =1 [ (cid:107) µ c − x i (cid:107) − δ v ] (1) L dist = 1 C ( C − (cid:88) c A (cid:54) = c B [2 δ d − (cid:107) µ c A − µ c B (cid:107) ] (2) L reg = 1 C C (cid:88) c =1 (cid:107) µ c (cid:107) (3) L = L sem + αL var + βL dist + γL reg , (4)where C denotes the number of clusters, and µ c and N c arethe centroid and number of points of cluster c , respectively, x i is the embedding, L sem is the softmax cross entropy lossof the category classification, (cid:107) · (cid:107) is the Euclidean normin the feature space, and [ x ] + = max(0 , x ) . When weconduct instance segmentation, we apply clustering on thelearned embedding space. When we set δ d ≥ δ v and thelearned embedding space satisfies L var = L dist = 0 , wecan guarantee that all points whose distances from a pointare smaller than δ v belong to the same object.However, there are some drawbacks in this original for-mulation of discriminative loss. First, it is difficult to selectthe hyperparameter β, γ that balances the weights of L reg and L dist . The optimization is hyperparameter-sensitive be-cause L reg attempts to reduce the distances between points(i.e., make them closer), whereas L dist attempts to increasethe distance between points (i.e., make them more distant).Empirically, it turns out that γ should be about 100-timessmaller than α, β , and seeking such balance is an cumber-some task. Moreover, when we concatenate the learned r L dist × θϕ L var L var × Figure 2. Overview of the proposed feature learning method. Eachpoint ( • ) represents one feature embedding, and the points with thesame color belong to the same object. The crosses ( × ) are clustercentroids. As the training progresses, points with the same colorsmove to a nearby spot and clusters move away from each other. feature to other features such as the raw coordinates of apoint, we need to arrange the scale of the features such thatboth features are effective for clustering. However, it is dif-ficult to arrange the scale because the norms of the featureare different among feature spaces. In contrast, when wenormalize each feature after we learn the feature space, wecannot distinguish between points with the same unit vectorand a different norm.In the following sections, we propose a novel embeddingmethod that solves these problems.
3. Method
In this section, we describe the proposed feature learningmethod. As described in a previous section, Equation (2)in L attempts to increase the distances between differentclusters while minimizing the norms of the feature usingEquation (3). Thus, L is sensitive to the hyperparameters β and γ that balance these conflicting losses. Moreover, it isdifficult to combine a learned feature with other features forclustering.In this study, we overcome these difficulties by restrict-ing the features to a unit hypersphere and the learning ofthe feature space based on a cosine similarity instead of theEuclidean loss. We present an overview of our method inFigure 2, where each point ( • ) represents one feature embed-ding, points with the same color belong to the same object,and the cross ( × ) indicates the cluster centroid. Moreover, θ and φ satisfy δ v = cos( θ ) and δ d < cos( φ ) , respectively.Using the cosine similarity between two embeddings x i , x j , which is calculated as s ( x i , x j ) = x Ti x j (cid:107) x i (cid:107)(cid:107) x j (cid:107) , the proposed loss function is written as follows: L var = 1 C C (cid:88) c =1 N c N c (cid:88) i =1 [ δ v − s ( µ c , x i )] + (5) L dist = 1 C ( C − (cid:88) c A (cid:54) = c B [ s ( µ c A , µ c B ) − δ d ] + (6) L = L sem + αL var + βL dist , (7)where δ d and δ v satisfy δ d (cid:28) δ v ≈ so that s ( µ c , x i ) becomes larger than s ( µ c A , µ c B ) . In addition, we use theabsolute error of the [ · ] + terms instead of the squared erroradopted in [5] because the norm of the [ · ] + terms is smallerthan 1 and the squared errors become considerably smallerwhen these terms are near zero.When the angle between an embedding and its clustercentroid is larger than θ , L var attempts to reduce the distancebetween the embedding and the centroid. In addition, L dist attempts to increase distance between cluster centroids whenthe cosine similarity is larger than δ d . In an image recog-nition study ( [12]), the feature was also learned using thecosine similarity using a unit hypersphere. However, in thatstudy, similarities between all pairs of points were calculated,whereas our method only considers the similarities betweenpoints and the corresponding cluster centroids. Thus, ourmethod is considerably more computationally effective.Compared to [5], the advantages of our method are asfollows: • We do not need to consider the scale of the featurespace and thus, we can omit L reg and do not need toconsider the balance between β and γ . • Because the embeddings are guaranteed to have a unitnorm, it is easy to combine the learned embeddings toother features.We learn the mapping from the feature space to the em-bedding space by adding one fully connected layer.
In Section 2.2, we discussed the fact that one of the prob-lems of the existing IC-based instance segmentation methodSGPN [22] is that it requires a large space complexity. Be-cause our method and SGPN require only a few extra layersin the feature extractor, and thus, the number of iterationsfor training is nearly the same, we focus on analyzing thedetailed computation complexity of the loss functions ofSGPN and our method. In the following section, we denotethe batch size as B , the number of points as N p , the numberof points in a cluster c as N c , and the dimensions of the inputfeature space and embedding space as d f and d e , respec-tively. We also write the input and embedded features of the i -th point as f i ∈ R d f and h i ∈ R d e , respectively.n SGPN, the similarity S ij between the i -th and j -thpoints is calculated as S ij = (cid:107) f i − f j (cid:107) = (cid:113) (cid:107) f i (cid:107) − f i f j + (cid:107) f j (cid:107) . (8)Because the method calculates S ij for all pairs of points, thespace complexity of the similarity matrix is O ( BN p d f ) . Asfor the time complexity, because we need to evaluate (cid:107) f i (cid:107) for each i and f i f j for each pair ( i, j ) to calculate S ij , thetime complexity is O ( BN p d f ) .In contrast, the proposed loss function obtains d e -dimensional embedded features and calculates the cosinesimilarity between each point and its cluster centroid, andbetween each pair of cluster centroids. Therefore, the spacecomplexity for the embeddings of each point is O ( BN p d e ) ,and the computation complexity is O ( B ( N p + C ) d e ) ; how-ever, this order is equivalent to O ( BN p d e ) because C (cid:28) N p in most cases. Both complexities are linear in N p . Becausewe use N p = 2 n ( n = 12 , , , d e = 2 in the experi-ment, the proposed method can calculate the loss functionwith a smaller space/time complexity than SGPN. We describe our feature learning method in Section 3.1.In this section, we explain the clustering method applied tothe learned feature space to conduct instance segmentation.The requirements for the clustering method are as follows: • The number of clusters is variable. • The clustering result is robust to outliers. • The clustering does not fail even when the number ofpoints in each cluster has a large variety.In this study, we adopt the density-based spatial clusteringof applications with noise (DBSCAN) [7], which satisfiesthese requirements. DBSCAN is a density-based clusteringmethod that first calculates the densities of points basedon the number of neighboring points and then constructsclusters by considering a continuous region with a density ofabove a certain threshold as a single cluster. The number ofclusters of the output of DBSCAN can vary, and DBSCANis robust to outliers because it accepts the noise points thatdo not belong to any clusters.We apply this clustering to the embeddings, which arepredicted as the same category. We concatenate the learnedembeddings using the normalized coordinates of the pointas the input for the clustering method. In addition, someclusters consist of a very small number of points. Becausesuch clusters are false positive in most cases, we handle suchclusters as points in that they do not belong to any cluster inthe evaluation.
GroundTruth Prediction GroundTruthPredictionGroundTruth Prediction GroundTruth PredictionTrue Positive (TP) Partial Detection (PD)False Positive (FP)False Merging (FM)
Figure 3. Error patterns for instance segmentation
4. Evaluation Metrics
As described in Section 1, existing metrics of instancesegmentation are affected by the misclassification, the confi-dence of prediction, and the size of the regions. Therefore,we propose a novel evaluation metric that focuses on thedistinct ability of the objects regardless of the confidence orsemantics, and which can be used for any sized input region.When we neglect semantic errors, we can observe fourpatterns for each prediction output: • There is a corresponding ground truth (GT) for theprediction output (true positive (TP)). • The prediction output covers some part of a GT (partialdetection (PD)). • The prediction output contains more than one GT (falsemerging (FM)). • There is no corresponding GT (false positive (FP)).Figure 3 shows a diagram of these four error patterns.Note that one prediction output can fulfill more than oneof the patterns even though each point corresponds to exactlyone GT and one prediction. For example, one predictionoutput, 90% of which is contained in a GT, can cover othersmall GTs with the remaining 10%. In particular, PD andFM are characteristics of instance segmentation.To formulate these patterns, we define ”intersection overa set (IoS)”, which describes the part of an object A that iscontained in an object B as follows: IoS ( A, B ) = N ( A ∩ B ) N ( A ) (9)where N ( X ) denotes the number of points in X , and object A is considered to be contained in object B when IoS ( A, B ) exceeds a certain threshold t . Note that IoS() is an asym-metric function and A cannot be contained in more than oneobject when we set t > . .Using this IoS, we can establish the proposed metricsas follows. We first calculate a map from the GTs to theprediction outputs ( gt pred ) that describes which predictionutputs are contained in each GT, and conversely calculatea map from the prediction outputs to the GTs ( pred gt )that describes which GTs are contained in each prediction.Note that gt pred and pred gt are not exclusive. Then, welabel each prediction for at least one of the patterns. As for gt pred , for each GT g , we can obtain a list of predictionoutputs corresponding to g ( g p ). If a prediction p is onthe list that also contains g itself, p is considered as TP;otherwise, p is considered as PD because p is a subset of g .In contrast, for pred gt , with each prediction output p ,we can obtain a list of GTs corresponding to p ( p g ). If a GT g on the list does not contain any data and p is not labeledas PD in the last process, p is considered as FP. Otherwise,if g also contains p itself, it must be labeled as TP in thelast process owing to its symmetry. Here, g , which doesnot contain p , is considered as FM because g is a subset of g in this case. We define the ratio of TP to the number ofpredictions as precision and the ratio of TP to the number ofGTs as recall ; we define the F-score as their harmonic mean.We also evaluate the error patterns based on the ratio of PD,FM, and FP to the number of predictions. Because we ignorethe semantic segmentation in the calculation, one predictionoutput can be TP even if its predicted semantics are incorrect.This proposed metric does not depend on the semantics,confidence, or size of the input regions. Therefore, we canevaluate the pure performance of the instance segmentation.The procedure explained above can be written as Algorithm1. Algorithm 1
Criteria for instance segmentation procedure AggregateResults ( GT , prediction P ) arr gt pred [len( G )][] (cid:46) map from GT to preds arr pred gt [len( P )][] (cid:46) map from pred to GTs for each g in G do for each p in P do if s ( g ∩ p ) /s ( g ) > t then (cid:46) g is included in p pred gt .append( g ) if s ( g ∩ p ) /s ( p ) > t then (cid:46) p is included in g gt pred .append( p ) Summarize( gt pred , pred gt ) procedure Summarize ( gt pred , pred gt ) results [len( P )][] (cid:46)
2D array to store predictions for each g p in gt pred do for each p in g p do if g in pred gt [ p ] then results [ p ] .append(”TP”) (cid:46) true positive else results [ p ] .append(”PD”) (cid:46) partial detection for each p g in pred gt do if len( p g ) == 0 and results [ p ] == [] then results [ p ] .append(”FP”) (cid:46) false positive for each g in p g do if p not in gt pred [ g ] then results [ p ] .append(”FM”) (cid:46) false merging
5. Experiments
In this section, we conduct experiments to compare ourmethod with existing methods in order to demonstrate itseffectiveness and to show that existing evaluation metricsof instance segmentation in small regions are inappropri-ate. We then clearly distinguish errors of misclassificationand splitting instances by using our proposed evaluationmetric, which cannot be achieved using existing evaluationmetrics such as mAP. Moreover, we evaluate the relation-ships between the size of the split regions and the instancesegmentation performance to validate our assumption thatevaluating instance segmentation methods using existingevaluation metrics for small regions is inappropriate.
We use the Stanford large-scale 3D Indoor Spaces Dataset(S3DIS) [1]. S3DIS consists of 270 indoor scenes scannedfrom six areas and 13 objects. We use 203 scenes for trainingand the remaining 67 scenes for evaluation.Although PointNet++ [16] and SGPN [22] have been usedto evaluate methods by splitting the input scene horizontallyinto small regions, such as square regions, we conductedadditional experiments using larger regions as input. Thisis because one of our aims is to construct a method that canbe applied to wide regions with a greater number of points.During each training iteration, we randomly sample subre-gions with a fixed size from each scene, and then randomlysample a fixed number of points from the sampled subregionas the input. In the following experiments, the region size is square, and the number of points is 4,096 unless other-wise noted. Each point has a nine-dimensional normalizedfeature consisting of RGB values, relative coordinates in thesubregion, and absolute coordinates in the room. For dataaugmentation, we apply random noise to some of the inputfeatures.The number of objects in the dataset differs significantlyamong categories. For example, the number of objects ofthe category with the largest number of objects is 55 timesas large as the number for the category with the smallestnumber of objects. To eliminate the effect of this imbalance,we weight the miscategorization cross-entropy loss as theweight corresponding to the category with a small numberof points weighted as a large value.Although we can apply our embedding learning methodto any feature extractors, we use PointNet (PN) [15] andPointNet++ (PN++) [16] as feature extractors for our ex-periments. We use 131-dimensional features consisting of128-dimensional features extracted using the feature extrac-tor and three-dimensional RGB features as the input forfeature embedding. The 131-dimensional features are thenpassed through a fully connected layer, which produces 32-dimensional embeddings.We use the Adam [10] optimizer with an initial learningate of 0.001 and a batch size of 32. We train our networkfor 6,000 steps, and the learning rate is divided by 10 at the4,500th step. We set the hyperparameters for our method as δ v = 0 . , δ d = 0 . , α = β = 0 . . In this section, we evaluate the proposed embedding learn-ing method using existing instance segmentation metrics. Wecompare our method with SGPN [22] and DFD [21], whichhave been found to exhibit the highest accuracy for this task.The scores for these two methods are reported in [21].Following [21], we chose the proposal recall [9] as theevaluation metric and used PointNet as a feature extractorfor a fair comparison. The proposal recall is calculated asfollows: first, for each GT object, we select the predictedobject with the highest intersection over union (IoU) regard-less of the category of the object and consider the output as atrue positive when the IoU is higher than a certain threshold(we chose a value of 0.5). The ratio of the number of truepositives is then calculated with respect to the number ofGTs. Because the number of objects for each category isunbalanced, we evaluated both the mean of the proposalrecall of 12 categories, except the ’clutter’ class (mean), andthe overall proposal recall regardless of the categories (total).Note that the overall proposal recall (total) can be high evenif the model overfits some of the categories with many in-stances and ignores the categories with fewer instances, andthus, it may not be reliable. However, DFD, which does notuse category information for training, cannot solve the imbal-ance problem between categories, and thus, it was necessaryto add the total proposal recall.Moreover, to validate our argument that instance segmen-tation on small regions is a substantially semantic segmenta-tion because there is often only one instance in the region,we also evaluated the result obtained using semantic segmen-tation model (SemSeg), which never splits objects belongingto the same category. We also report the score obtainedwhen using PointNet++ (PN++) instead of PN as the featureextractor; however, we do not compare PN++ with PN as itwould not make for a fair comparison.Table 1 shows a comparison of our methods with exist-ing methods as well as the obtained semantic segmentationresults. We can see that our method with PointNet (PN)outperforms existing methods in terms of the mean proposalrecall by a large margin, and the use of PointNet++ leadsto a considerably better score. As described earlier, DFDachieves a high total score; however, its mean score is low,which means that the model ignores categories with fewerinstances. In addition, for some categories such as a ceil-ing, floor, and beam, a mere semantic segmentation resultachieves a very high score because there is essentially onlyone instance of such categories. This result supports ourargument that semantic segmentation results affect the in-
Threshold P r e c i s i o n Threshold R e c a ll Threshold F S c o r e Threshold P a r t i a l D e t e c t i o n Threshold F a l s e M e r g i n g Ours DFD SemSeg
Threshold F a l s e P o s i t i v e Figure 4. Comparison of our method and [21] using the proposedevaluation metrics stance segmentation performance, and an evaluation in smallregions using existing metrics is inappropriate.
We then evaluated our method with DFD, which outper-forms SGPN, using the proposed evaluation metrics. Somepredicted objects consist of a very small number of points.Because such predicted objects are often false positives, weset a threshold and use the predicted objects with a number ofpoints larger than the threshold as the targets for evaluation.There is a trade-off between precision and recall , which weintroduced in Section 4. As the threshold decreases, preci-sion decreases while recall increases. We fix t = 0 . forthe IoS defined by Equation 9 and search for the thresholdthat can obtain the highest F-score. As a result, we use athreshold of 150 for the DFD and 35 for our method. TheDFD shows a larger threshold, which implies that it outputsnoisy small predicted objects that are false positives.Furthermore, as discussed in Section 5.2, the output ofthe semantic segmentation model (SemSeg), which neversplits objects belonging to the same category, achieves highscores using the existing metrics for instance segmentation.This occurs in some categories in which multiple objectsseldom exist in a single subregion. We evaluated the resultsof semantic segmentation using our evaluation metrics todemonstrate whether this problem was solved.We plot the results when varying t of IoS from 0.5 to0.95 in Figure 4. Note that precision, recall, and f1-scoreevaluate performance, whereas partial detection, false merg-ing, and false positive represent types of mistakes. We cansee that, although the semantic segmentation model shows ahigh score for the existing metrics, specifically for recall , thefalse merging score is quite high. This is because semanticsegmentation outputs one prediction per category at most;this is why the partial detection of the semantic segmentationmodel is low. To the best of our knowledge, this fact is re-vealed by the proposed metrics for the first time. We can alsoobserve fine patterns and the property of instance segmenta- able 1. Comparison with existing methods ( [21, 22]) by proposal recall [%]method ceil- floor wall beam col- win- door table chair sofa book- board mean totaling umn dow caseSGPN [22] 67.0 71.4 66.8 54.5 45.4 51.2 69.9 63.1 SemSeg 95.8 95.2 61.7
Length of Side P r o p o s a l R e c a ll Length of Side F S c o r e Ours (1,024)Ours (4,096)Ours (8,192)Ours (16,384)DFD (4,096)
Figure 5. Effect of region size and number of points tion errors when eliminating semantic errors, which cannotbe obtained using existing evaluation metrics. For example,for DFD, most errors arise from partial detection whereasfalse merging is the dominant cause in our method. Suchinformation is useful not only for analyzing and improvingthe model but also for applying an ensemble of models whenconsidering the characteristics of each model.
In this section, we analyze the effects of the input regionsize and the number of points.We varied the number of pointsfrom 4,096 to 16,384 and the size of the regions from square from square. Note that the instance segmentationresults can be also affected by the density of the points. Thesettings with 1,024 points and square, 4,096 points and square, and 16,384 points with square have the samedensity.Figure 5 shows the proposal recall and the F-score valuesfor each setting. Even if the density is the same, the scoreof the instance segmentation decreases with the size of theregion. As discussed in Section 1, when the input region issmall, we do not need to distinguish the objects because thenumber of different objects of the same category is small.Therefore, the task becomes difficult as the input regionincreases and the apparent score decreases. In particular, theF1 score of DFD is significantly decreased compared withthe proposal recall; one reason for this is that the proposalrecall does not penalize false positives and cannot reveal theweakness of the DFD model that the instance segmentationresults are quite noisy. Moreover, the figure shows that thedensity of points does not considerably affect the instancesegmentation performance.Although we can apply instance segmentation on en- tire scenes by first applying instance segmentation on eachsubregion and then integrating the results through a post-processing technique, this approach has a significantly highcomputational complexity because we need to repeat in-stance segmentation on each subregion, making it unsuitablefor practical use. Moreover, choosing an appropriate subre-gion size is difficult, and an integration procedure can addnoise to the final result. Therefore, it is desirable to use aslarge a region as possible for the input. However, this figureimplies that the instance segmentation task becomes signif-icantly difficult when the input size is large, this difficultyhas not been adequately investigated in existing works. Han-dling large regions, such as an entire scene, is a challengingtask; however, it is of great importance in the application ofinstance segmentation.
6. Conclusion
We proposed a new method for instance segmentation on3D point clouds. Our memory efficient loss function learnsmapping to the embedding space, where the embeddingsform clusters for each object. We experimentally showedthat our method outperforms existing methods. Our methodcan handle a large number of points and performs well evenwhen consuming large regions. Moreover, we claimed andexperimentally demonstrated that existing metrics are notsuitable for evaluating instance segmentation because theyare considerably affected by the input size of the misclassi-fication. We proposed novel metrics that are unaffected bysuch external conditions and can aid in evaluating instancesegmentation performances correctly. Using the proposedmetrics, we not only evaluated the instance segmentationtask without being affected by external conditions but alsoanalyzed the types of errors in an instance segmentation taskfor each method.
7. Acknowledgement
This work was partially supported by JST CREST GrantNumber JPMJCR1403, and partially supported by JSPSKAKENHI Grant Number JP19H01115. eferences [1] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fis-cher, and S. Savarese. 3d semantic parsing of large-scaleindoor spaces. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 1534–1543,2016. 2, 6[2] C. Choi, Y. Taguchi, O. Tuzel, M.-Y. Liu, and S. Ramalingam.Voting-based pose estimation for robotic assembly using a 3dsensor. In , pages 1724–1731. IEEE, 2012. 2[3] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similaritymetric discriminatively, with application to face verification.In null , pages 539–546. IEEE, 2005. 3[4] J. Dai, K. He, and J. Sun. Instance-aware semantic segmenta-tion via multi-task network cascades. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 3150–3158, 2016. 3[5] B. De Brabandere, D. Neven, and L. Van Gool. Semanticinstance segmentation for autonomous driving. In
Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition Workshops , pages 7–9, 2017. 3, 4[6] B. Drost, M. Ulrich, N. Navab, and S. Ilic. Model globally,match locally: Efficient and robust 3d object recognition.In
Computer Vision and Pattern Recognition (CVPR), 2010IEEE Conference on , pages 998–1005. Ieee, 2010. 2[7] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm for discovering clusters in large spatialdatabases with noise. In
Kdd , volume 96, pages 226–231,1996. 5[8] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask r-cnn. In
Proceedings of the IEEE international conference oncomputer vision , pages 2961–2969, 2017. 3[9] J. Hosang, R. Benenson, P. Doll´ar, and B. Schiele. Whatmakes for effective detection proposals?
IEEE transactionson pattern analysis and machine intelligence , 38(4):814–830,2016. 7[10] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. In
International Conference on Learning Rep-resentations (ICLR) , 2015. 6[11] M. Koestinger, M. Hirzer, P. Wohlhart, P. M. Roth, andH. Bischof. Large scale metric learning from equivalenceconstraints. In , pages 2288–2295. IEEE, 2012. 3[12] S. Kong and C. C. Fowlkes. Recurrent pixel embedding forinstance grouping. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 9018–9028,2018. 3, 4[13] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutionalinstance-aware semantic segmentation. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 2359–2367, 2017. 3[14] D. Novotny, S. Albanie, D. Larlus, and A. Vedaldi. Semi-convolutional operators for instance segmentation. In
Pro-ceedings of the European Conference on Computer Vision(ECCV) , pages 86–102, 2018. 3[15] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deeplearning on point sets for 3d classification and segmentation. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 652–660, 2017. 1, 2, 6[16] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deephierarchical feature learning on point sets in a metric space.In
Advances in Neural Information Processing Systems , pages5099–5108, 2017. 1, 2, 6[17] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. In
Advances in neural information processing systems , pages91–99, 2015. 3[18] R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature his-tograms (fpfh) for 3d registration. In , pages 3212–3217.IEEE, 2009. 2[19] R. B. Rusu, N. Blodow, Z. C. Marton, and M. Beetz. Aligningpoint cloud views using persistent feature histograms. In
Intelligent Robots and Systems, 2008. IROS 2008. IEEE/RSJInternational Conference on , pages 3384–3391. IEEE, 2008.2[20] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: Aunified embedding for face recognition and clustering. In
Proceedings of the IEEE conference on computer vision andpattern recognition , pages 815–823, 2015. 3[21] M. Sung, H. Su, R. Yu, and L. J. Guibas. Deep functionaldictionaries: Learning consistent semantic structures on 3dmodels from functions. In
Advances in Neural InformationProcessing Systems , pages 483–493, 2018. 1, 3, 7, 8[22] W. Wang, R. Yu, Q. Huang, and U. Neumann. Sgpn: Sim-ilarity group proposal network for 3d point cloud instancesegmentation. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 2569–2578,2018. 1, 3, 4, 6, 7, 8[23] K. Q. Weinberger and L. K. Saul. Distance metric learningfor large margin nearest neighbor classification.