[PDF] 4D Panoptic LiDAR Segmentation

Abstract

Temporal semantic scene understanding is critical for self-driving cars or robots operating in dynamic environments. In this paper, we propose 4D panoptic LiDAR segmentation to assign a semantic class and a temporally-consistent instance ID to a sequence of 3D points. To this end, we present an approach and a point-centric evaluation metric. Our approach determines a semantic class for every point while modeling object instances as probability distributions in the 4D spatio-temporal domain. We process multiple point clouds in parallel and resolve point-to-instance associations, effectively alleviating the need for explicit temporal data association. Inspired by recent advances in benchmarking of multi-object tracking, we propose to adopt a new evaluation metric that separates the semantic and point-to-instance association aspects of the task. With this work, we aim at paving the road for future developments of temporal LiDAR panoptic perception.

Full PDF

44D Panoptic LiDAR Segmentation

Mehmet Ayg¨un * Aljo˘sa O˘sep * Mark Weber Maxim Maximov Cyrill Stachniss Jens Behley Laura Leal-Taix´e Technical University of Munich, Germany University of Bonn, Germany { mehmet.ayguen, aljosa.osep, leal.taixe, mark-cs.weber, maxim.maximov } @tum.de { firstname.lastname } @igg.uni-bonn.de instances semanticssemanticssemantic segmentation panoptic segmentation multi-object tracking 4D panoptic segmentation Figure 1: Types of LiDAR-based scene understanding. Semantic and panoptic segmentation assign semantic classes anddetermine instances in 3D space. Multi-object tracking encompasses 3D object detection in space, followed by associationover time. 4D panoptic LiDAR segmentation jointly tackles semantic and instance segmentation in 3D space over time.

Abstract

Temporal semantic scene understanding is critical forself-driving cars or robots operating in dynamic environ-ments. In this paper, we propose 4D panoptic LiDAR seg-mentation to assign a semantic class and a temporally-consistent instance ID to a sequence of 3D points. To thisend, we present an approach and a point-centric evalua-tion metric. Our approach determines a semantic classfor every point while modeling object instances as proba-bility distributions in the 4D spatio-temporal domain. Weprocess multiple point clouds in parallel and resolve point-to-instance associations, effectively alleviating the need forexplicit temporal data association. Inspired by recent ad-vances in benchmarking of multi-object tracking, we pro-pose to adopt a new evaluation metric that separates thesemantic and point-to-instance association aspects of thetask. With this work, we aim at paving the road for futuredevelopments of temporal LiDAR panoptic perception.

1. Introduction

Spatio-temporal interpretation of raw sensory data is im-portant for autonomous vehicles to understand how to inter-act with the environment and perceive how trajectories ofmoving agents evolve in 3D space and time. * Authors contributed equally.

In the past, different aspects of dynamic scene under-standing such as semantic segmentation [22, 18, 49, 73, 86,70], object detection [23, 63, 41, 66, 65, 67], instance seg-mentation [28], and multi-object tracking [44, 8, 54, 10,77, 62, 59] have been tackled independently. The devel-opments in these ﬁelds were largely fueled by the rapidprogress in deep learning-based image [38] and point-setrepresentation learning [60, 61, 73], together with contribu-tions of large-scale datasets, benchmarks, and uniﬁed eval-uation metrics [45, 22, 25, 19, 17, 75, 26, 5, 18, 13, 69].In the pursuit of image-based holistic scene understanding,recent community efforts have been moving towards con-vergence of tasks, such as multi-object tracking (MOT) andsegmentation [75, 84], and semantic and instance segmen-tation, i.e ., panoptic segmentation [36]. Recently, panop-tic segmentation was extended to the video domain [35].Here, the dataset, task formalization, and evaluation met-rics focused on interpreting short and sparsely labeled videosnippets in 3D (2D image+time) in an ofﬂine setting. Au-tonomous vehicles, however, need to continuously interpretsensory data and localize objects in a 4D continuum.Tackling sequence-level LiDAR panoptic segmentationis a challenging problem, since state-of-the-art meth-ods [73] usually need to downsample even single-scan pointclouds to satisfy the memory constraints. Therefore, thecommon approach in (3D) multi-object tracking is detect-1 a r X i v : . [ c s . C V ] F e b ng objects in individual scans, followed by temporal as-sociation [24, 77, 78], often guided by a hand-crafted mo-tion model. In this paper, we take a substantially differentapproach, inspired by the uniﬁed space-time treatment phi-losophy. We form overlapping 4D volumes of scans (seeFig. 1) and, in parallel, assign to 4D points a semantic in-terpretation while grouping object instances jointly in 4Dspace-time.Importantly, these 4D volumes can be processed in a sin-gle network pass, and the temporal association is resolvedimplicitly via clustering. This way, we retain inference efﬁ-ciency while resolving long-term association between over-lapping volumes based on the point overlap, alleviating theneed for explicit data association.For the evaluation, we introduce a point-centric higher-order tracking metric, inspired by recent metrics for multi-object tracking [46] and concurrent work on video panop-tic segmentation [76] which differ from the available met-rics [36, 9] that overemphasize the recognition part of thetasks. Our metric consist of two intuitive terms, one mea-suring the semantic aspect and second the spatio-temporalassociation of the task. Together with the recently proposedSemanticKITTI [5, 6] dataset, this gives us a test bed to an-alyze our method and compare it with existing LiDAR se-mantic/instance segmentation [41, 73, 77, 48] approaches,adapted to the sequence-level domain.In summary, our contributions are: (i) we propose a uni-ﬁed space-time perspective to the task of 4D LiDAR panop-tic segmentation, and pose detection/segmentation/trackingjointly as point clustering which can effectively leverage thesequential nature of the data and process several LiDARscans while maintaining memory efﬁciency; (ii) we adopt apoint-centric evaluation protocol that fairly weights seman-tic and association aspects of this task and summarizes theﬁnal performance with a single number; (iii) we establish atest bed for this task, which we use to thoroughly analyzeour model’s performance and the existing LiDAR panopticsegmentation methods used in conjunction with a tracking-by-detection mechanism. Our code, experimental data andbenchmark are publicly available.

2. Related Work

Our work is related to tasks covering different aspectsof scene perception, such as semantic segmentation, objectdetection/segmentation, and tracking. In the following, wereview related methods and tasks.

Datasets and Metrics.

The growing interest in au-tonomous vehicles has sparked interest in scene perceptionusing LiDAR sensors. Here the progress has been fueled byadvances in deep learning on point sets [60, 61, 31, 37, 39, https://github.com/mehmetaygun/4d-pls http://bit.ly/4d-panoptic-benchmark

80, 81, 71, 73, 48] and datasets with standardized bench-marks for 3D semantic/instance segmentation [25, 5] and3D object detection and multi-object tracking [13, 69]. Thisconﬁrms the importance of advancing both spatial and tem-poral aspects of mobile robot perception. Our proposed taskformulation and evaluation metric is the ﬁrst that uniﬁesboth aspects to the best of our knowledge.Recent community efforts in the ﬁeld of image-basedperception have been moving towards the convergence ofdifferent tasks. For instance, Kirillov et al . [36] proposedto unify semantic and instance segmentation, which theytermed panoptic segmentation, together with an evaluationmetric, the panoptic quality (PQ). Others proposed to tacklemulti-object tracking and instance segmentation (MOTS) invideos jointly [75, 84]. Moreover, [35] recently extendedpanoptic segmentation to videos – however, the dataset andthe evaluation metrics focus on interpreting short and sparsevideo snippets ofﬂine. This is reﬂected in the evaluationmetric, which is essentially PQ evaluated based on the 3DIoU [84] and averaged over temporal windows of varyingsizes to compensate that the difﬁculty of the task dependson the sequence length. This setting is not suitable for au-tonomous vehicles that need to interpret raw sensor datacontinuously. Hurtado et al . [32] proposes to combine ideasfrom MOTA [9] and PQ [36] by adding a penalty related toID switches to the PQ. Nonetheless, both PQ and MOTAwere criticized [58, 46], and the proposed evaluation inher-its all of their well-known issues.In this paper, we propose a different approach and bringideas recently introduced in the context of benchmarkingvision-based multi-object tracking [46] to the domain of se-quential LiDAR semantic and instance segmentation. To-gether with the metric, we also propose an approach thatoperates directly on spatio-temporal point clouds providingobject instances in space and time.

Point Cloud Segmentation.

Semantic segmentation orpoint-wise classiﬁcation of point clouds is a well-knownresearch topic [2]. Traditionally, it was solved using fea-ture extractors in combination with traditional classiﬁers [1]and conditional random ﬁelds to enforce label consistencyof neighboring points [74, 51, 82]. Availability of large-scale datasets, such as S3DIS [3], Semantic3D [27], and re-cently SemanticKITTI [5], made it possible to also investi-gate end-to-end pipelines [40, 49, 73, 30, 86, 64, 61, 60, 48].Similar to recent trends in RGB-D [34, 20] and LiDARsegmentation [79], our method performs bottom-up pointgrouping in a data-driven fashion. However, different fromthe aforementioned, we perform grouping in 3D space andtime. We use the backbone by [73] that applies deformablepoint convolutions directly on the point clouds. In our case,this empirically performed better compared to backbones,speciﬁcally designed for point sequences [15, 64].2 ulti-Object Tracking and Segmentation.

The ma-jority of vision-based MOT methods follow tracking-by-detection [53]. Here the idea is to ﬁrst run a pre-trainedobject detector independently in each video frame and thenassociate detections across time. In the past, there was astrong focus on developing robust and, preferably, globallyoptimal methods for data association [85, 43, 57, 11, 12].Recent data-driven trends mainly focus on learning to asso-ciate detections [42, 68] or to regress targets [8], often incombination with end-to-end learning [16, 83, 10].In the realm of robot vision, it is critical to localize objecttrajectories in 3D space and time. Early methods localizedmonocular detections in 3D e.g ., using stereo [44, 54, 21],or performed tracking in a category-agnostic manner byﬁrst performing bottom-up segmentation based on spatialproximity followed by point-segment association [72, 29].Recently, LiDAR-based MOT has become a popular task,thanks to the emergence of reliable 3D object detectors [66,41] and LiDAR-centric datasets [13, 69]. Weng et al . [77]demonstrated that even simple methods based on linear as-signment and constant-velocity motion models can performsurprisingly well when object detections are localized reli-ably in 3D space. Our method departs from 3D object de-tection in the spatial domain, followed by the detection as-sociation in the temporal domain. Instead, we follow recentadvances in image [52, 14] and video instance segmenta-tion [4]. We localize possible object instance centers withina 4D volume and associate points to estimated centers in abottom-up manner, while a semantic branch assigns seman-tic classes to points.

3. Method

In this paper, we propose a method and a metric for 4DPanoptic LiDAR Segmentation task that tackles LiDAR se-mantic segmentation and instance segmentation jointly inthe spatial and temporal domain. Given a sequence of Li-DAR scans, the goal of this task is to predict for each 3Dpoint (i) a semantic label for both stuff and thing classes,and (ii) a unique, identity-preserving object instance ID thatshould persist over the whole sequence.

In this work, we take a different path compared tothe tracking-by-detection paradigm to video-instance andvideo-panoptic segmentation [75, 84, 35, 33]. We pose 4Dpanoptic segmentation as two joint processes. The ﬁrst oneis responsible for point grouping in the 4D continuum usingclustering, while the second assigns a semantic interpreta-tion to each point.We provide an overview of our method in Fig. 2. In anutshell, we ﬁrst form 4D point clouds from several con-secutive LiDAR scans. In parallel, within a single networkpass, we localize the most likely object centers (inspired by

Semantic predictions O ε ΣS

4D Point Cloud Encoder-Decoder Network t-1t-2t- 𝜏 ... Point Sampling 4D Semantic + Instance Predictions

Figure 2: Visualization of our method. We sample pointsfrom past scans to form a 4D point cloud. Our encoder-decoder network estimates point objectness map ( O ), pointvariance map ( Σ ), and point embeddings ( ε ). We usethese maps to assign points to their respective instances viadensity-based clustering in a 4D continuum. We obtain se-mantic interpretation from the semantic decoder ( S ).point-based tracking methods by [87, 14]) in the sequence(objectness map O ), assign semantic classes to points (se-mantic map S ), and compute per-point embeddings (em-bedding map ε ) and variances (variance map Σ ).The clustering can be performed efﬁciently by evaluat-ing the probability of each 4D point belonging to a certain“seed“ point, which is similarly performed in the context ofimages and video segmentation [52, 4]. Finally, to associate4D sub-volumes, we examine point intersections betweenoverlapping point volumes.

4D Volume Formation.

During inference and training,we form overlapping 4D point cloud volumes in an onlinesetting. In particular, for scan t and temporal window size τ , we align together point clouds within temporal window { max(0 , t − τ ) , ..., t } using ego-motion estimates providedby a SLAM approach [7]. Our experiments in Sec. 4.1 re-veal that processing multiple point clouds signiﬁcantly im-proves spatial and temporal point association performance.However, due to the linear growth in memory requirements,stacking point clouds along the temporal dimension is pro-hibitively expensive. To overcome this issue, we build onthe intuition that thing classes are most critical for a stabletemporal association, since these classes correspond to po-tentially moving objects. As we operate in an online setting,where the past scans have already been processed, we cansample points that belong to thing classes or lie near to anobject centers from earlier scans. Density-based Clustering.

We model object instances viaGaussian probability distributions. Given an estimate ofthe object center, i.e ., clustering “seed“ point, we can as-sign points to their respective instance by evaluating eachpoint under the Gaussian pdf based on the point’s embed-ding vectors. The estimated centers do not need to corre-spond to exact object centers but are merely used to initiatethe clustering. Thus, our approach is in practice fairly ro-bust to occlusions and cross-time view changes. We notethat the Gaussian assumption is only valid for shorter tem-poral windows. In particular, given a point p i representingthe instance center and its embedding vector e i , and a querypoint p j with its embedding vector e j , we can evaluate the3robability of point p j belonging to its center “seed“ point p i as: ˆ p ij = 1(2 π ) D | Σ i | exp (cid:18) −

12 ( e i − e j ) (cid:62) Σ − i ( e i − e j ) (cid:19) (1)where Σ i is a diagonal matrix constructed using varianceprediction σ i of point p i . We concatenate coordinate val-ues ( x, y, z, t ) with the learned point embedding vectors tocombine spatial and temporal coordinates with learned em-beddings. We account for these additional dimensions dur-ing the training of the variance map. Network and Training.

To perform such clustering, weneed to identify most likely instance centers, i.e., “seed“points, in a 4D point cloud. We also need variance pre-dictions for each point to evaluate probability scores duringclustering, and a posterior over all semantic classes.We estimate all these quantities using an encoder-decoder architecture that operates directly on the 4D pointcloud P ∈ R N × . The encoder network is based on theKP-Conv [73] backbone that uses deformable point convo-lutions. The decoder predicts point-wise feature embed-dings ε ∈ R N × D using consecutive point convolutions.On top of the encoder, we add an object centerness decoder ∈ R N × , point variance decoder ∈ R N × D , and semanticdecoder ∈ R N × C . We train our network in an end-to-endmanner and in online fashion.To train the semantic decoder, we use cross-entropy clas-siﬁcation loss L class . As the semantic classes are highly im-balanced, we sample points to ensure that the probability ofsampling a point from a certain class is roughly uniform.To learn the point centerness and point variance, we usethree different losses. First, we impose the mean squarederror (MSE) loss to train the object centerness decoder. Dueto the sparsity of the LiDAR signal, there will generally beno points near the actual object centers, unlike the imageand video domain [52, 4]. Therefore, instead of predictingper-point centerness, we predict the proximity of the pointto its instance center. We compute for each point p i thepoint objectness o i as Euclidean distance between the pointand its instance center, i.e ., mean point of all instance point,normalized to the range of [0 , . This objectness o i is thencompared to the regressed objectness score ˆ o i : L obj = N (cid:88) i =1 ( ˆ o i − o i ) , ˆ o i , o i ∈ [0 , (2)Since we want the embeddings of instances to form clustersin the spatio-temporal domain, we introduce our instanceloss. Given a 4D point cloud of N points and K instances,it is deﬁned as: L ins = K (cid:88) j =1 N (cid:88) i =1 ( ˆ p ij − p ij ) , p ij = (cid:40) , if p i ∈ I j , otherwise (3) where ˆ p ij is evaluated under the Gaussian pdf (Eq. 1)with points embedding e i as well as instance embeddingand variance e j and σ j . In addition, we employ vari-ance smoothness loss L var , similar to [4, 52] for train-ing the variance decoder. In summary, we use four dif-ferent losses to train our network in end-to-end manner: L = L class + L obj + L ins + L var . Inference.

We resolve point-to-instance associations intwo stages, ﬁrst within a processed 4D volume, and thenacross volumes. First, based on the point cloud centernessmap, we select the point p i , which has the highest object-ness score. Then, we evaluate probabilities p ij for all candi-date points and assign them to the cluster in case p ij > . .The assigned points are then removed from the candidatepool. We repeat these steps until the next highest objectnessscore is below a certain threshold. To transfer identitiesacross processed 4D volumes, we perform cross-volumeassociation greedily based on the overlap score, taking allscans into account. When the overlap is below a threshold,we assign a new id. The central question when proposing a novel task andbenchmark is how to evaluate and compare different meth-ods. Preferably, we would like to summarize performancewith a single number to rank the methods while retainingthe capability of looking at different aspects of the task.

To motivate our approach to evaluation, we ﬁrst brieﬂy dis-cuss established metrics for image-based panoptic segmen-tation (PQ [36]) and multi-object tracking and segmentation(MOTSA/MOTSP [9, 75]). Then, we discuss two recentlyproposed extensions of PQ to the temporal domain and ar-gue why we do not promote their adaptation for the task of4D LiDAR panoptic segmentation.

Segment-centric Evaluation.

PQ and MOTSA/MOTSPare instance-centric evaluation metrics. Both ﬁrst determinea unique matching between sets of ground-truth objects andmodel predictions for each frame individually to determinetrue positives (TPs), false positives (FPs), and false nega-tives (FNs). Both metrics provide measures for the segmen-tation and recognition aspects of the task. The segmentationquality (SQ) term of PQ and MOTSP integrates IoUs overthe set TPs and normalizes it by the size of the TP set. Therecognition quality (RQ) term of PQ is expressed as the F score. Similarly, MOTSA combines detection errors (FNsand FPs) with ID switch (IDSW) penalty in a single term.IDSW occurs when a track is lost, and the tracker assigns anew identity to a tracked object. This is the only term thattakes the temporal aspect of the task into account.4 criticism of PQ is that it over-emphasizes the impor-tance of very small segments and stuff classes can be difﬁ-cult to match [58]. MOTSA overemphasizes the detectioncompared to the association aspect and it is nonintuitive,since the score can be negative and is unbounded, as can beseen in Sec. 4. Furthermore, the inﬂuence of the ID switcheson the ﬁnal score depends on the frame rate, and MOTSAdoes not reward trackers that recover from incorrect associ-ations. Importantly, both metrics are sensitive to the selec-tion of the matching threshold. Thus, instances that slightlymiss this threshold will cause both a FN and a FP. This is notthe case for pixel or point centric metrics used for evaluat-ing semantic segmentation. The standard mean IoU (mIoU)metric [22] computes sets of TPs, FPs and FNs pixel (orpoint) basis, effectively bypassing the segment matching. PQ Extensions.

Recent work [35] proposes video panopticquality, a variant of PQ for the sequential domain. Differentfrom PQ, gt-to-prediction mapping is established based onthe sequential IoU matching criterion, proposed in the con-text of video instance segmentation [84]. As objects are notpresent throughout the clip and the difﬁculty of the task crit-ically depends on the length of the temporal window, the ﬁ-nal metric is averaged over varying temporal window sizes.This is suitable for the setting deﬁned in [35], where the taskis to evaluate short, sparsely labeled video snippets. How-ever, this approach does not scale to real-world sequences ofarbitrary length. Another extension to PQ, panoptic track-ing quality (PTQ) [33] combines MOTA and PQ by addingan ID penalty to the PQ measure. This approach inheritsissues from both PQ and MOTSA metrics.

In the following, we assume a sequence of 3D point cloudsof length l , sampled at discrete time-steps: Ω = { ( p , n ) ∈ R × N | n < l } . We deﬁne the ground-truth assignmentfunction as gt ( p , n ) → ( c, id ) and a prediction function as pr ( p , n ) → ( c, id ) , that map each 4D tuple, consisting of apoint p and a timestamp n , to a certain class c and identity id . In the following, we devise an evaluation metric that,for each pair ( p , n ) , evaluates whether (i) it was assignedto a correct class, and (ii) for the thing classes, whetherit was assigned to the correct object instance. Inspired bythe recently introduced Higher Order Tracking Accuracy(HOTA) [46], proposed in the context of MOT, and con-current work on video panoptic segmentation proposing theSegmentation and Tracking Quality (STQ) [76], our LSTQ (LiDAR Segmentation and Tracking Quality) consists oftwo terms, the classiﬁcation score S cls and the associationscore S assoc .We adopt a fundamentally different evaluation philoso-phy compared to other metrics [46, 36, 35, 32]. In particu-lar, we drop the concept of the frame-level “detection“ and do not match segments between ground-truth and predic-tion. Our association score measures point-to-instance as-sociation quality in a uniﬁed way – in space and time atpoint level, which is more natural for segmentation tasks. Classiﬁcation Score.

For the classiﬁcation score, we ﬁrstdeﬁne instance-agnostic ground-truth and predictions sets: gt agn ( c ) = { ( p , n ) | gt ( p , n ) = ( c, ∗ ) } , pr agn ( c ) = { ( p , n ) | pr ( p , n ) = ( c, ∗ ) } , representing the ground truth and predicted points that be-long to class c irrespective of their assigned ids. Then, theTP, FP, FN sets are computed as in standard semantic seg-mentation evaluation with respect to gt class c and predictedclass c (cid:48) : T P c = | pr agn ( c ) ∩ gt agn ( c ) | ,F P c = | pr agn ( c ) − pr agn ( c ) ∩ gt agn ( c ) | ,F N c = | gt agn ( c ) − pr agn ( c ) ∩ gt agn ( c ) | . The classiﬁcation score then simply boils down tointersection-over-union (IoU) over these sets, which is thestandard approach for evaluating semantic segmentation(however, this is different from segment-centric PQ, wherepoints contribute to the

T P c term only if the segment thatthey belong is matched). We follow the standard procedureand average over the classes: S cls = |C| C (cid:88) c = [ | TP c || TP c | + | FN c | + | FP c | ] = C (cid:88) c = IoU ( c ) . Association Score.

To evaluate the association score,we introduce the following class-agnostic predictions andground-truth for the thing classes: gt id ( id ) = { ( p , n ) | gt ( p , n ) = ( c, id ) , c ∈ things } , pr id ( id ) = { ( p , n ) | pr ( p , n ) = ( c, id ) , c ∈ things } . We deﬁne the true positive association (TPA) set between aground-truth object t with identity id and prediction s , thatwas assigned identity id (cid:48) . This gives us a set of points withmutually consistent identities id and id (cid:48) , over the whole 4Dvolume: T P A ( id, id (cid:48) ) = | pr id ( id (cid:48) ) ∩ gt id ( id ) | . (4)Analogously, we deﬁne the set of false positive associa-tions:FPA ( id, id (cid:48) ) = | pr id ( id (cid:48) ) − pr id ( id (cid:48) ) ∩ gt id ( id ) | . (5)Intuitively, this set contains predicted point assignmentswith identity id (cid:48) , that were assigned a different ground-truth5emantic Segmentation Instance Segmentation - s ca n p r e d i c ti on4 - s ca n p r e d i c ti on timeFigure 3: The performance of our model with 2 and 4 scan versions, MOTSA = 1 . / . , S assoc = 0 . / . , S cls =0 . / . . While both models track the instance correctly, due to slight difference in semantic segmentation predictions,MOTSA scores differ drastically.identity ( (cid:54) = id ), or were not assigned to a valid object in-stance. Finally, the set of false negative assignments:FNA ( id, id (cid:48) ) = | gt id ( id ) − pr id ( id (cid:48) ) ∩ gt id ( id ) | (6)contains ground-truth points with identity id that were as-signed an identity, different to id (cid:48) , or were missed. We notethat the concept of TPA, FPA and FNA was ﬁrst introducedin the context of MOT evaluation for measuring the qualityof temporal detection association. Therefore, to establishthese sets, a bijective mapping between gt and pred needsto be established (as in the case of [9]). However, in LSTQ ,these sets are established with respect to each 4D point,treating association in space and time in a uniﬁed manner.Once we have quantiﬁed these sets, we can evaluate howwell a predicted segment s agrees with ground-truth seg-ment t . Because a ground truth segment t may be explainedby multiple different predictions, we sum contributions ofall pairs with non-zero overlap: S assoc = 1 |T | (cid:88) t ∈T | gt id ( t ) | (cid:88) s ∈ Ss ∩ t (cid:54) =0 | T P A ( s, t ) | IoU ( s, t ) , (7)where the IoU term is evaluated using the

T P A, F N A and

F P A sets (Eq. 4, 6, 5). In practice, we do not need to per-form any point segment association, and even a predictionwith a single common point will contribute to this term. Wenormalize these contributions by the tube volume, and weweigh each contribution by the volume of the TPA set. Thisweighting term ensures that instances with larger tempo-ral spans have a higher contribution to the ﬁnal score. Fi-nally, our metric is computed as a geometric mean of thetwo terms:

LSTQ = √ S cls × S assoc . The advantage overthe arithmetic mean is that the ﬁnal score will become zeroif any of the two terms approach zero. This reﬂects our in-tuition that failing at either of two aspects of the task shouldyield a very low ﬁnal score. LSTQ tolerates different semantic predictions within aspatio-temporal segment by design (the

IoU term in Eq. 7 isevaluated in a class-agnostic manner). Following STQ [76],

Strategy

LSTQ S assoc S cls IoU St IoU Th Mem.Base 1 51.92 45.16 59.69 64.60 60.40 1xThing-prop. 2 59.20 58.71

Table 1: Ablation study on point sampling strategies forbuilding 4D point cloud volumes with respect to differenttemporal window sizes.we decouple semantic and association errors , otherwise, e.g ., a truck mistaken for a bus would be harshly penal-ized by the association term, even though it was tracked cor-rectly. This behavior that dis-entangles association and clas-siﬁcation errors is different from MOTSA/PTQ/VPQ wheresemantics and temporal association are entangled.

4. Experimental Evaluation

In this section, we ﬁrst evaluate different strategies forforming 4D point cloud volumes, assess the impact of pro-cessing multiple scans on the ﬁnal performance, and discussseveral possibilities for designing the embedding functionused for point grouping. We compare our method to severalbaselines for single-scan LiDAR panoptic segmentation [6]and 4D panoptic LiDAR segmentation by extending exist-ing methods to the sequential domain.We use the SemanticKITTI [5] LiDAR dataset to con-duct our experiments. It contains sequences from KITTIodometry dataset [25] and provides point-wise semantic andtemporally consistent instance annotations [6]. We use thetraining/validation/test split by SemanticKITTI [5, 6]. We perform all ablation studies on the validation splitand interpret results through the lens of

LSTQ metric(Sec. 3.2.2).6 oint Propagation.

As discussed in Sec. 3, we cannotsimply stack point clouds along the temporal dimension dueto the memory constraints. We build on the intuition thatwe can subsample a set of points from the past scans thatare most beneﬁcial for the end-task performance. As weare operating in an online setting, and the past scans havealready been processed, we can leverage predictions fromthe past scans.In this experiment, we discuss different temporal pointsampling strategies for varying temporal window sizes of τ = 2 , , . In the thing-propagation strategy, we exclu-sively sample points which are assigned only to a thingclass, as they represent only a small number of all point. Inthe importance sampling strategy, we sample 10% of pointswith a probability proportional to the objectness. This way,we focus on points likely to represent thing classes, whilestill allowing to propagate points belonging to stuff classes,which can aid the semantic segmentation of the task. Simi-larly, the temporal decay sampling uses the objectness scoreas a deciding factor, but we decay the number of sampledpoints based on the proximity to the current scan. Finally,the strided sampling samples points with a stride of alongthe temporal dimension.As can be seen in Tab. 1, the importance sampling strat-egy yields higher performance compared to sampling only thing classes, at a slightly increased memory cost. As ex-pected, this approach improves association quality and aidssemantics as it also propagates points representing stuff classes.Interestingly, even a temporal window of size dras-tically improves the performance compared to a singlescan baseline, at negligible memory consumption ( . × ).We observe the largest gains when the scans are tempo-rally close: our 4-scan multi-scan baseline improves per-formance . → . in terms of LSTQ . The asso-ciation term beneﬁts more from processing multiple scanscompared to the segmentation term. This conﬁrms that ourmodel learns to exploits temporal cues very well. Whiletemporal decaying does aid semantic or temporal aspect,introducing a temporal stride of yields the highest per-formance gains for semantic point classiﬁcation. However,denser sampling in the temporal domain beneﬁts associa-tion, which is why in the following experiments, we focuson the importance sampling strategy with τ = 4 . In Tab. 7(supplementary), we highlight the performance of this ap-proach for temporal window size τ = 1 , , , , , . As canbe seen, the association accuracy is increasing up to τ = 4 and then saturates, while classiﬁcation accuracy saturates a τ = 2 ; however, it only decreases marginally. Embedding Design.

In this experiment, we study differ-ent design decisions to formulate the point embeddings forclustering and show our ﬁndings in Tab. 2. We investigatethe base performance of using only 3D spatial ( xyz ) and 4D Mixing

LSTQ S assoc S cls IoU St IoU Th xyz xyzt xyz xyzt xyz xyzt xyz xyzt Table 2: Embedding design ablation.spatio-temporal point coordinates ( xyzt ) , and using onlylearned embeddings (Emb.). Next, we investigate the per-formance of the coordinate mixing formulation that com-bines learned embeddings with 3D spatial and 4D spatio-temporal coordinates. As can be seen, the variant in whichwe combine both yields the best results, not only in terms of S assoc , but also S cls . This shows that a well-designed em-bedding branch has a positive effect on learning the back-bone features. Note that for the baseline that uses onlyspatio-temporal coordinates, we still use our fully trainednetwork. Single-scan Prediction.

First, we evaluate our method us-ing standard single-scan LiDAR panoptic segmentation [5,6] to demonstrate the effectiveness of our network solely inthe spatial domain. We use points from single scans duringtraining and testing. We follow the standard evaluation pro-tocol and compare to published and peer-reviewed methods.As can be seen from Tab. 5, our method achieves state-of-the-art results on all metrics for semantic and panop-tic segmentation [36, 5, 6]. The ﬁrst two entries use twodifferent networks for object detection and semantic seg-mentation, followed by fusion of the results. We use asingle network to obtain both semantic and instance seg-mentation of the point cloud in a single network pass. Wenote that the recently proposed Panoptic RangeNet [48] andRangeNet++ [49] combined with PointPillars [41] detec-tor operate on the range image and not the point cloud,and therefore, use a different backbone. However, KPConvcombined with PointPillars uses the same backbone as ourmethod.

4D Panoptic Segmentation.

For evaluation in the multi-scan setting for the 4D panoptic segmentation task, weextend all single-scan methods reported in Tab. 5, ex-cept for the Panoptic RangeNet [48] as we were not ableto obtain point cloud predictions from the authors. Weadapt them to the sequential domain using two strategies.AB3DMOT [77] uses a constant-velocity motion model toobtain track predictions associated with object detectionbased on a 3D bounding box overlap. The second strat-7 ethod

LSTQ S assoc S cls IoU St IoU Th sPTQ PTQ sMOTSA MOTSARangeNet++[49] + PP + MOT 43.76 36.28 52.78 60.49 42.17 34.58 33.83 -7.88 -4.57KPConv [73] + PP + MOT 46.27 37.58 56.97 64.21 54.13 39.13 38.11 -6.16 -2.41RangeNet++[49] + PP + SFP 43.38 35.66 52.78 60.49 42.17 35.83 35.46 -3.13 -0.01KPConv [73] + PP + SFP 45.95 37.07 56.97 64.21 54.13 41.44 41.05 2.83 6.1MOPT [32] 24.80 11.73 52.41 62.37 45.27 41.82 42.39 12.88 17.07Our (single scan) + MOT 51.92 45.16 59.69 64.60 60.40 48.36 47.84 6.65 12.69Our (single scan) + SFP 45.45 34.61 59.69 64.60 60.40 48.24 47.72 3.01 7.93Ours (2 scans) 59.86 58.79 Ours (4 scans)

Table 3: 4D Panoptic (validation set). MOT – tracking-by-detection method by [77]; SFP – tracking-by-detection via sceneﬂow based segment propagation [50]; PP – PointPillars [41] detector.

Category assoc S cls Motorcycle 255 0.01 2 209 151 46 31 0.58 0.82 0.11 0.56 0.884 231 747 24 9 0.24 0.91 -2.06 0.81 0.74Other-vehicle 2138 0.06 2 778 362 1360 162 0.68 0.36 0.12 0.17 0.564 1022 1131 1116 99 0.47 0.48 -0.10 0.38 0.55

Table 4: Per-class evaluation on SemanticKITTI validation set (2 and 4 scan versions).egy, Scene Flow Propagation (SFP) is inspired by standardbaselines that perform optical ﬂow warping of points, fol-lowed by mask-IoU based association. This approach iscommonly used in the domain of vision-based video ob-ject segmentation [47], video instance segmentation [84],and multi-object tracking and segmentation [55, 56, 75]. In-stead of optical ﬂow, we use state-of-the-art LiDAR sceneﬂow by [50]. We outline our results, obtained on the testset in Tab. 6. As can be seen, the baseline that uses KP-Conv [73] to obtain per-pixel classiﬁcation, PointPillars(PP) detector [41] and a network for point cloud propaga-tion (SFP [50]) performs slightly better in terms of associ-ation accuracy, compared to the standard 3D multi-objecttracking baseline. Our method that uniﬁes all three aspectsin a single network outperforms all tracking-by-detectionbaselines by a large margin, including our single-scan base-line, even though we are using a single network. This con-ﬁrms the importance of tackling all three aspects of thesetasks in a uniﬁed manner. An important contribution ofour paper is the ﬁnding that even when processing smalleroverlapping sub-sequences with our network (and resolvingintra-window associations with a simple overlap-based ap-proach), we perform signiﬁcantly better compared to single-scan baselines that use more elaborate association tech-niques ( e.g ., Kalman ﬁlter), as can be conﬁrmed in Tab. 6.

Metric Insights.

In this section, we analyze the per-formance on the validation split (Tab. 3) through the lensof several evaluation metrics and analyze per-class perfor-mance (Tab. 4). Our method outperforms all baselines withrespect to all metrics. However, while our 4-scan variant

Method PQ

P Q † SQ RQ mIoURangeNet++ [49] + PointPillars [41] 37.1 45.9 75.9 47.0 52.4KPConv [73] + PointPillars [41] 44.5 52.5 80.0 54.4 58.8Panoptic RangeNet [48] 38.0 47.0 76.5 48.2 50.9Our method (single scan)

Table 5: Single scan Panoptic Segmentation (test set).

Method

LSTQ S assoc S cls IoU St IoU Th RangeNet++[49] + PP + MOT 35.52 24.06 52.43 64.52 35.82KPConv [73] + PP + MOT 38.01 25.86 55.86 66.90 47.66RangeNet++[49] + PP + SFP 34.91 23.25 52.43 64.52 35.82KPConv [73] + PP + SFP 38.53 26.58 55.86 66.90 47.66Our (single scan) + MOT 40.18 28.07 57.51 66.95 51.50Our (single scan) + SFP 43.88 33.48 57.51 66.95 51.50Ours (multi scan) 56.89 56.36 57.43 66.86 51.64

Table 6: 4D Panoptic (test set). MOT – tracking-by-detection by [77]; SFP – tracking-by-detection via sceneﬂow based propagation [50]; PP – PointPillars [41].performs better than the 2-scan variant in terms of

LSTQ ,we observe a signiﬁcant drop in the MOTSA score. Ouranalysis shows that this is due to the fact that we obtainnegative MOTSA scores on some classes due to a decreasein precision while having fewer ID switches (see Tab. 4, andTab. 8). We visualize such case in Fig. 3. As can be seen,the difference is due to the semantic interpretation of thepoints and not due to the segmentation and tracking qual-ity at the instance level. This conﬁrms the nonintuitive be-havior of MOTSA, while our metric provides insights onboth semantic interpretation and instance segmentation andtracking. For a more detailed discussion we refer to the sup-plementary (Sec. B.2).8 . Conclusion

In this paper, we extended LiDAR panoptic segmenta-tion to the temporal domain resulting in the 4D PanopticSegmentation task. We presented an evaluation metric suit-able for analyzing this task’s performance and proposed anew model. Importantly, we have shown that a single modeltackling semantic point classiﬁcation and point-to-instanceassociation jointly in space and time substantially outper-forms methods that independently tackle these aspects. Wehope that our uniﬁed view and model, accompanied by apublic benchmark, will pave the road to future develop-ments in this ﬁeld.

Acknowledgements.

This project was funded by the Hum-boldt Foundation through the Sofja Kovalevskaja Awardand the EU Horizon 2020 research and innovation pro-gramme under grant agreement No. 101017008 (Harmony).We thank Ismail Elezi and the whole DVL group for helpfuldiscussions.

References [1] Anuraag Agrawal, Atsushi Nakazawa, and Haruo Takemura.MMM-classiﬁcation of 3D Range Data. In

ICRA , 2009. 2[2] Dragomir Anguelov, Ben Taskar, Vassil Chatalbashev,Daphne Koller, Dinkar Gupta, Geremy Heitz, and AndrewNg. Discriminative Learning of Markov Random Fields forSegmentation of 3D Scan Data. In

CVPR , pages 169–176,2005. 2[3] Iro Armeni, Ozan Sener, Amir R. Zamir, Helen Jiang, Ioan-nis Brilakis, Martin Fischer, and Silvio Savarese. 3D Seman-tic Parsing of Large-Scale Indoor Spaces. In

CVPR , 2016. 2[4] Ali Athar, Sabarinath Mahadevan, Aljosa Osep, Laura Leal-Taix´e, and Bastian Leibe. Stem-seg: Spatio-temporal em-beddings for instance segmentation in videos. In

ECCV ,2020. 3, 4[5] Jens Behley, Martin Garbade, Andres Milioto, Jan Quen-zel, Sven Behnke, Cyrill Stachniss, and Juergen Gall. Se-manticKITTI: A Dataset for Semantic Scene Understandingof LiDAR Sequences. In

ICCV , 2019. 1, 2, 6, 7[6] Jens Behley, Andres Milioto, and Cyrill Stachniss. A Bench-mark for LiDAR-based Panoptic Segmentation based onKITTI. arXiv preprint arXiv:2003.02371 , 2020. 2, 6, 7[7] Jens Behley and Cyrill Stachniss. Efﬁcient Surfel-BasedSLAM using 3D Laser Range Data in Urban Environments.In

RSS , 2018. 3[8] Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taix´e.Tracking without bells and whistles. In

ICCV , 2019. 1, 3[9] Keni Bernardin and Rainer Stiefelhagen. Evaluating multipleobject tracking performance: The clear mot metrics.

JIVP ,2008:1:1–1:10, 2008. 2, 4, 6[10] Guillem Braso and Laura Leal-Taixe. Learning a neuralsolver for multiple object tracking. In

CVPR , June 2020.1, 3[11] William Brendel, Mohamed R. Amer, and Sinisa Todorovic.Multi object tracking as maximum weight independent set.

CVPR , 2011. 3 [12] Asad A. Butt and Robert T. Collins. Multi-target tracking bylagrangian relaxation to min-cost network ﬂow. In

CVPR ,June 2013. 3[13] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora,Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi-ancarlo Baldan, and Oscar Beijbom. nuScenes: A multi-modal dataset for autonomous driving. In

CVPR , 2020. 1, 2,3[14] Bowen Cheng, Maxwell D Collins, Yukun Zhu, Ting Liu,Thomas S Huang, Hartwig Adam, and Liang-Chieh Chen.Panoptic-deeplab: A simple, strong, and fast baseline forbottom-up panoptic segmentation. In

CVPR , 2020. 3[15] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4Dspatio-temporal convnets: Minkowski convolutional neuralnetworks. In

CVPR , 2019. 2[16] Peng Chu and Haibin Ling. FAMNet: Joint learning of fea-ture, afﬁnity and multi-dimensional assignment for onlinemultiple object tracking. In

ICCV , 2019. 3[17] Marius Cordts, Mohamed Omran, Sebastian Ramos, TimoRehfeld, Markus Enzweiler, Rodrigo Benenson, UweFranke, Stefan Roth, and Bernt Schiele. The cityscapesdataset for semantic urban scene understanding. In

CVPR ,2016. 1[18] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal-ber, Thomas Funkhouser, and Matthias Nießner. Scannet:Richly-annotated 3d reconstructions of indoor scenes. In

CVPR , 2017. 1[19] Patrick Dendorfer, Aljoˇsa Oˇsep, Anton Milan, KonradSchindler, Daniel Cremers, Ian Reid, and Stefan Roth LauraLeal-Taix´e. Motchallenge: A benchmark for single-cameramultiple target tracking.

IJCV , 2020. 1[20] Francis Engelmann, Martin Bokeloh, Alireza Fathi, BastianLeibe, and Matthias Nießner. 3D-MPA: Multi Proposal Ag-gregation for 3D Semantic Instance Segmentation. In

CVPR ,2020. 2[21] Andreas Ess, Bastian Leibe, Konrad Schindler, and LucVan Gool. Robust multiperson tracking from a mobile plat-form.

PAMI , 31(10):1831–1846, 2009. 3[22] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, andA. Zisserman. The pascal visual object classes (VOC) chal-lenge.

IJCV , 88(2):303–338, 2010. 1, 5[23] P. Felzenszwalb, D. Mcallester, and D. Ramanan. A dis-criminatively trained, multiscale, deformable part model. In

CVPR , 2008. 1[24] Davi Frossard and Raquel Urtasun. End-to-end learning ofmulti-sensor 3d tracking by detection.

ICRA , 2018. 2[25] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are weready for autonomous driving? the KITTI vision benchmarksuite. In

CVPR , 2012. 1, 2, 6[26] Agrim Gupta, Piotr Dollar, and Ross Girshick. LVIS: Adataset for large vocabulary instance segmentation. In

CVPR ,2019. 1[27] Timo Hackel, Nikolay Savinov, Lubor Ladicky, Jan D.Wegner, Konrad Schindler, and Marc Pollefeys. SEMAN-TIC3D.NET: A new large-scale point cloud classiﬁcationbenchmark. In

ISPRS Annals of the Photogrammetry, Re-mote Sensing and Spatial Information Sciences , 2017. 2

28] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask R-CNN. In

ICCV , 2017. 1[29] David Held, Jesse Levinson, Sebastian Thrun, and SilvioSavarese. Combining 3d shape, color, and motion for robustanytime tracking. In

RSS , 2014. 3[30] Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, YulanGuo, Zhihua Wang, Niki Trigoni, and Andrew Markham.Randla-net: Efﬁcient semantic segmentation of large-scalepoint clouds. In

CVPR , 2020. 2[31] Binh-Son Hua, Minh-Khoi Tran, and Sai-Kit Yeung. Point-wise convolutional neural networks. In

CVPR , 2018. 2[32] Juana Valeria Hurtado, Rohit Mohan, and Abhinav Val-ada. Mopt: Multi-object panoptic tracking. arXiv preprintarXiv:2004.08189 , 2020. 2, 5, 8[33] Juana Valeria Hurtado, Rohit Mohan, and Abhinav Val-ada. Mopt: Multi-object panoptic tracking. arXiv preprintarXiv:2004.08189 , 2020. 3, 5[34] Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point groupingfor 3d instance segmentation. In

CVPR , 2020. 2[35] Dahun Kim, Sanghyun Woo, Joon-Young Lee, and In SoKweon. Video panoptic segmentation. In

CVPR , 2020. 1, 2,3, 5[36] Alexander Kirillov, Kaiming He, Ross Girshick, CarstenRother, and Piotr Doll´ar. Panoptic segmentation. In

CVPR ,2019. 1, 2, 4, 5, 7[37] Artem Komarichev, Zichun Zhong, and Jing Hua. A-cnn:Annularly convolutional neural networks on point clouds. In

CVPR , 2019. 2[38] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Ima-genet classiﬁcation with deep convolutional neural networks.In

NIPS , 2012. 1[39] Shiyi Lan, Ruichi Yu, Gang Yu, and Larry S Davis. Modelinglocal geometric structure of 3d point clouds using geo-cnn.In

CVPR , 2019. 2[40] Loic Landrieu and Martin Simonovsky. Large-scale pointcloud semantic segmentation with superpoint graphs. In

CVPR , 2018. 2[41] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou,Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encodersfor object detection from point clouds. In

CVPR , 2019. 1, 2,3, 7, 8[42] Laura Leal-Taix´e, Cristian Canton-Ferrer, and KonradSchindler. Learning by tracking: Siamese cnn for robust tar-get association.

CVPR Workshops , 2016. 3[43] Laura Leal-Taix´e, Gerard Pons-Moll, and Bodo Rosenhahn.Everybody needs somebody: Modeling social and groupingbehavior on a linear programming multiple people tracker.In

ICCV Workshops , 2011. 3[44] Bastian Leibe, Konrad Schindler, Nico Cornelis, andLuc Van Gool. Coupled object detection and tracking fromstatic cameras and moving vehicles.

PAMI , 30(10):1683–1698, 2008. 1, 3[45] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. LawrenceZitnick. Microsoft COCO: Common objects in context. In

ECCV , 2014. 1 [46] Jonathon Luiten, Aljoˇsa Oˇsep, Patrick Dendorfer, PhilipTorr, Andreas Geiger, Laura Leal-Taix´e, and Bastian Leibe.Hota: A higher order metric for evaluating multi-objecttracking.

IJCV , 2020. 2, 5[47] Jonathon Luiten, Paul Voigtlaender, and Bastian Leibe. Pre-mvos: Proposal-generation, reﬁnement and merging forvideo object segmentation. In

Asian Conference on Com-puter Vision , 2018. 8[48] Andres Milioto, Jens Behley, Chris McCool, and CyrillStachniss. Lidar panoptic segmentation for autonomous driv-ing. In

IROS , 2020. 2, 7, 8[49] Andres Milioto, Ignacio Vizzo, Jens Behley, and CyrillStachniss. RangeNet++: Fast and Accurate LiDAR SemanticSegmentation. In

IROS , 2019. 1, 2, 7, 8[50] Himangi Mittal, Brian Okorn, and David Held. Just go withthe ﬂow: Self-supervised scene ﬂow estimation. In

CVPR ,2020. 8[51] Daniel Munoz, J. Andrew Bagnell, Nicolas Vandapel, andMartial Hebert. Contextual Classiﬁcation with FunctionalMax-Margin Markov Networks. In

CVPR , 2009. 2[52] Davy Neven, Bert De Brabandere, Marc Proesmans, andLuc Van Gool. Instance segmentation by jointly optimiz-ing spatial embeddings and clustering bandwidth. In

CVPR ,2019. 3, 4[53] Kenji Okuma, Ali Taleghani, Nando De Freitas, James J Lit-tle, and David G Lowe. A boosted particle ﬁlter: Multitargetdetection and tracking. In

ECCV , 2004. 3[54] Aljoˇsa Oˇsep, Wolfgang Mehner, Markus Mathias, and Bas-tian Leibe. Combined image- and world-space tracking intrafﬁc scenes. In

ICRA , 2017. 1, 3[55] Aljoˇsa Oˇsep, Wolfgang Mehner, Paul Voigtlaender, and Bas-tian Leibe. Track, then decide: Category-agnostic vision-based multi-object tracking.

ICRA , 2018. 8[56] Aljoˇsa Oˇsep, Paul Voigtlaender, Mark Weber, JonathonLuiten, and Bastian Leibe. 4d generic video object propos-als.

ICRA , 2020. 8[57] Hamed Pirsiavash, Deva Ramanan, and Charles C.Fowlkes.Globally-optimal greedy algorithms for tracking a variablenumber of objects. In

CVPR , 2011. 3[58] Lorenzo Porzi, Samuel Rota Bulo, Aleksander Colovic, andPeter Kontschieder. Seamless scene segmentation. In

CVPR ,2019. 2, 5[59] Johannes P¨oschmann, Tim Pfeifer, and Peter Protzel. Factorgraph based 3d multi-object tracking in point clouds. arXivpreprint arXiv:2008.05309 , 2020. 1[60] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.Pointnet: Deep learning on point sets for 3d classiﬁcationand segmentation. In

CVPR , 2017. 1, 2[61] Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Point-net++: Deep hierarchical feature learning on point sets in ametric space. In

NIPS , 2017. 1, 2[62] Haozhe Qi, Chen Feng, Zhiguo Cao, Feng Zhao, and YangXiao. P2b: Point-to-box network for 3d object tracking inpoint clouds. In

CVPR , 2020. 1[63] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster R-CNN: Towards real-time object detection with re-gion proposal networks. In

NIPS , 2015. 1

64] Hanyu Shi, Guosheng Lin, Hao Wang, Tzu-Yi Hung, andZhenhua Wang. Spsequencenet: Semantic segmentation net-work on 4d point clouds. In

CVPR , 2020. 2[65] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, JianpingShi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In

CVPR , 2020. 1[66] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. PointR-CNN: 3D Object Proposal Generation and Detection FromPoint Cloud. In

CVPR , 2019. 1, 3[67] Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang,and Hongsheng Li. From Points to Parts: 3D Object Detec-tion from Point Cloud with Part-aware and Part-aggregationNetwork.

PAMI , 2020. 1[68] Jeany Son, Mooyeol Baek, Minsu Cho, and Bohyung Han.Multi-object tracking with quadruplet convolutional neuralnetworks. In

CVPR , 2017. 3[69] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, AurelienChouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,Yuning Chai, Benjamin Caine, et al. Scalability in perceptionfor autonomous driving: Waymo open dataset. In

CVPR ,2020. 1, 2, 3[70] Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin,Hanrui Wang, and Song Han. Searching Efﬁcient 3D Ar-chitectures with Sparse Point-Voxel Convolution. In

ECCV ,2020. 1[71] Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and Qian-Yi Zhou. Tangent convolutions for dense prediction in 3d. In

CVPR , 2018. 2[72] Alex Teichman and Sebastian Thrun. Tracking-based semi-supervised learning.

IJRR , 31(7):804–818, 2012. 3[73] Hugues Thomas, Charles R. Qi, Jean-Emmanuel Deschaud,Beatriz Marcotegui, Franc¸ois Goulette, and Leonidas J.Guibas. Kpconv: Flexible and deformable convolution forpoint clouds. In

ICCV , 2019. 1, 2, 4, 8[74] Rudolph Triebel, Krisitian Kersting, and Wolfram Bur-gard. Robust 3D Scan Point Classiﬁcation using AssociativeMarkov Networks. In

ICRA , pages 2603–2608, 2006. 2[75] Paul Voigtlaender, Michael Krause, Aljosa Osep, JonathonLuiten, B.B.G Sekar, Andreas Geiger, and Bastian Leibe.MOTS: Multi-object tracking and segmentation. In

CVPR ,2019. 1, 2, 3, 4, 8[76] Mark Weber, Jun Xie, Maxwell Collins, Yukun Zhu,Paul Voigtlaender, Hartwig Adam, Bradley Green, AndreasGeiger, Bastian Leibe, Daniel Cremers, Aljoˇsa Oˇsep, LauraLeal-Taix´e, and Liang-Chieh Chen. STEP: Segmenting andTracking Every Pixel. arXiv preprint arXiv:2102.11859 ,2021. 2, 5, 6[77] Xinshuo Weng, Jianren Wang, David Held, and Kris Kitani.3D Multi-Object Tracking: A Baseline and New EvaluationMetrics.

IROS , 2020. 1, 2, 3, 7, 8[78] Xinshuo Weng, Yongxin Wang, Yunze Man, and Kris Kitani.Gnn3dmot: Graph neural network for 3d multi-object track-ing with multi-feature learning. 2020. 2[79] Kelvin Wong, Shenlong Wang, Mengye Ren, Ming Liang,and Raquel Urtasun. Identifying unknown instances for au-tonomous driving. In

CoRL , 2020. 2 [80] Bichen Wu, Alvin Wan, Xiangyu Yue, and Kurt Keutzer.Squeezeseg: Convolutional neural nets with recurrent crf forreal-time road-object segmentation from 3d lidar point cloud.In

ICRA , 2018. 2[81] Bichen Wu, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, andKurt Keutzer. Squeezesegv2: Improved model structure andunsupervised domain adaptation for road-object segmenta-tion from a lidar point cloud. In

ICRA , 2019. 2[82] Xuehan Xiong, Daniel Munoz, J. Andrew Bagnell, and Mar-tial Hebert. 3-D Scene Analysis via Sequenced Predictionsover Points and Regions. In

ICRA , pages 2609–2616, 2011.2[83] Yihong Xu, Aljoˇsa Oˇsep, Yutong Ban, Radu Horaud, LauraLeal-Taix´e, and Xavier Alameda-Pineda. How to train yourdeep multi-object tracker. In

CVPR , 2020. 3[84] Linjie Yang, Yuchen Fan, and Ning Xu. Video instance seg-mentation. In

ICCV , 2019. 1, 2, 3, 5, 8[85] Li Zhang, Li Yuan, and Ramakant Nevatia. Global data as-sociation for multi-object tracking using network ﬂows. In

CVPR , 2008. 3[86] Yang Zhang, Zixiang Zhou, Philip David, Xiangyu Yue, Ze-rong Xi, Boqing Gong, and Hassan Foroosh. Polarnet: Animproved grid representation for online lidar point clouds se-mantic segmentation. In

CVPR , 2020. 1, 2[87] Xingyi Zhou, Vladlen Koltun, and Philipp Kr¨ahenb¨uhl.Tracking objects as points. In

ECCV , 2020. 3 upplementary Material A. Implementation Details

In this section, we (i) provide details about the four dif-ferent point propagation strategies we experimented withfor forming a 4D point clouds and (ii) we detail the pointoverlap based association procedure we use to link 4D ob-ject instances across overlapping point clouds.

A.1. 4D Point Cloud Formation

Our method works on directly 4D volumes which con-structed using consecutive lidar scans. However, due tomemory constraints stacking all points is not feasible. Toreduce memory usage, when we process the scan f i togetherwith previous scans f i − τ ,..., f i − , we take all of the pointsfrom f i and sub-sample points from other scans. Moreover,since we already processed previous scans f i − τ ,..., f i − be-fore, we know the semantic class and objectness scores ofall points at time step f for that scans. We use three dif-ferent strategy to sub-sample point from previous scans byleveraging these information. Thing Propagation:

In this strategy, we only samplepoints from previous scans if the points are assigned to athing class. If the total number of points are exceeded thegpu memory limit, we randomly sub-sample again.

Importance Sampling:

We select 10% of points from aprevious scans using the objectness score predicted by thenetwork in the previous time steps. Thus, points with higherobjectness scores have a higher chance to be used in theclustering process in the following scans.

Temporal Decay:

In this strategy, we use importance sam-pling using objectness scores again. However, instead ofsampling 10% of points from each past scan, we select thepercentage of points based on temporal proximity of scans.Given a temporal window size of τ , we select the numberof points n i as: n i = e i (cid:80) τ − n =1 e i , i = 1 , , ...τ − , (8)where n τ − is the closest scan to the current scan. In thisstrategy more points would be sampled from scans whichare temporally close. Temporal Stride:

We used importance sampling in thisstrategy, but instead of using points from previous scans i =1 , , ...τ − we used every second scan i = 1 , , ..τ − . For the points from the remaining scans, we assignedpredictions by looking at the closest points, which had classand instance prediction. LSTQ S assoc S cls IoU St IoU Th .74 Table 7: Panoptic Tracking on SemanticKITTI valid. set.

A.2. Clustering

Our method can cluster points with different semanticsand does not provide a single class label for a speciﬁc in-stance. This can be adapted depending on the requirementsof the downstream application ( e.g ., via majority vote).Moreover, if the number of points that assigned to a speciﬁccluster is lower than a threshold, we eliminate that instancefrom the ﬁnal prediction.

A.3. Tracking

As discussed in the main paper (Section 3), we processmultiple scans together in an overlapping fashion. For awindow size of τ , at time t , we process scans f ti − τ , ..., f ti together by overlapping them in a 4D point cloud. f ti repre-sent the scan i which processed at time step t .To associate instances at time t and t + 1 , we look atinstance intersections in scans which are common in bothtime steps. For instance, with temporal window size of two,we would process scans f and f , next we would process f and f together. To transfer ids from the previous time tothe current scan ( f ), we would look the instance intersec-tions in scans which processed on both time step ( f and f ). Since the instance ids are same for the scans whichprocessed together ( f and f ), the association would beﬁnished between overlapping 4D volumes.For the intersection, we consider all common scans.When there is a conﬂict (i.e, one instance has overlap withtwo instance in the next step), we pick the instance pairwhich have higher intersection-over-union. If any of theintersections do not surpass IoU of . , we create a new IDfor the instance. B. Additional Results

B.1. Ablation on the Temporal Window Size

In Tab. 7, we highlight the performance of our methodfor temporal window size τ = 1 , , , , , . As can beseen, the association accuracy is increasing up to τ = 4 and then saturates, while classiﬁcation accuracy saturates a τ = 2 ; however, it only decreases marginally.12 a) S assoc = 0 . , S cls = 0 . , MOT SA = 0 . , P T Q = 0 . (b) S assoc = 0 . , S cls = 0 . , MOT SA = − . , P T Q = 0 . (c) S assoc = 0 . , S cls = 0 . , MOT SA = 0 . , P T Q = 0 . timeFigure 4: Comparison of evaluation metrics for some failure cases. Respective instances which we calculate the metrics aredepicted with bounding boxes. In (a) ID recovery is punished by MOTSA and PTQ. In (b) two instances predicted as singleinstance and in (c) ID switch happened and in the second scan the instance is not segmented correctly. B.2. Per-class Evaluation

In this section, we analyze the performance on the val-idation split (Tab. 3) through the lens of several evalua-tion metrics and analyze per-class performance in Tab. 8(this table extends Tab. 4 from the main paper). Whileour 4-scan variant performs better than the 2-scan variantin terms of