[PDF] PanopticFusion: Online Volumetric Semantic Mapping at the Level of Stuff and Things

Abstract

We propose PanopticFusion, a novel online volumetric semantic mapping system at the level of stuff and things. In contrast to previous semantic mapping systems, PanopticFusion is able to densely predict class labels of a background region (stuff) and individually segment arbitrary foreground objects (things). In addition, our system has the capability to reconstruct a large-scale scene and extract a labeled mesh thanks to its use of a spatially hashed volumetric map representation. Our system first predicts pixel-wise panoptic labels (class labels for stuff regions and instance IDs for thing regions) for incoming RGB frames by fusing 2D semantic and instance segmentation outputs. The predicted panoptic labels are integrated into the volumetric map together with depth measurements while keeping the consistency of the instance IDs, which could vary frame to frame, by referring to the 3D map at that moment. In addition, we construct a fully connected conditional random field (CRF) model with respect to panoptic labels for map regularization. For online CRF inference, we propose a novel unary potential approximation and a map division strategy. We evaluated the performance of our system on the ScanNet (v2) dataset. PanopticFusion outperformed or compared with state-of-the-art offline 3D DNN methods in both semantic and instance segmentation benchmarks. Also, we demonstrate a promising augmented reality application using a 3D panoptic map generated by the proposed system.

Full PDF

PPanopticFusion: Online Volumetric Semantic Mappingat the Level of Stuff and Things

Gaku Narita, Takashi Seno, Tomoya Ishikawa, Yohsuke Kaji Abstract — We propose

PanopticFusion , a novel online vol-umetric semantic mapping system at the level of stuff and things . In contrast to previous semantic mapping systems,PanopticFusion is able to densely predict class labels of abackground region ( stuff ) and individually segment arbitraryforeground objects ( things ). In addition, our system has the ca-pability to reconstruct a large-scale scene and extract a labeledmesh thanks to its use of a spatially hashed volumetric maprepresentation. Our system ﬁrst predicts pixel-wise panopticlabels (class labels for stuff regions and instance IDs for thing regions) for incoming RGB frames by fusing 2D semantic andinstance segmentation outputs. The predicted panoptic labelsare integrated into the volumetric map together with depthmeasurements while keeping the consistency of the instanceIDs, which could vary frame to frame, by referring to the 3Dmap at that moment. In addition, we construct a fully connectedconditional random ﬁeld (CRF) model with respect to panopticlabels for map regularization. For online CRF inference, wepropose a novel unary potential approximation and a mapdivision strategy.We evaluated the performance of our system on the ScanNet(v2) dataset. PanopticFusion outperformed or compared withstate-of-the-art ofﬂine 3D DNN methods in both semantic andinstance segmentation benchmarks. Also, we demonstrate apromising augmented reality application using a 3D panopticmap generated by the proposed system.

I. INTRODUCTIONGeometric and semantic scene understanding in 3D en-vironments has an important role in autonomous roboticsand context-aware augmented reality (AR) applications. Ge-ometric scene understanding such as visual simultaneouslocalization and mapping (SLAM) and 3D reconstructionhas been widely discussed since the early days of boththe robotics and computer vision communities. In recentyears, semantic mapping, which not only reconstructs the3D structure of a scene but also recognizes what exists inthe environment, has attracted much attention because of thegreat progress of deep neural networks.Semantic mapping systems could take a variety of ap-proaches in terms of geometry and semantics. When wethink about robotic and AR applications that deeply interactwith the real world, what kind of properties are required forthe ideal semantic mapping system? In terms of geometry,it needs to be able to reconstruct a large-scale scene, notsparsely but densely. Additionally, the 3D reconstructiondesirably needs to be represented as a volumetric map, notjust point clouds or surfels, because it is difﬁcult to directlyutilize point clouds and surfels for robot–object collision The authors are with R&D Center, Sony Corporation. { gaku.narita, takashi.seno, tomoya.ishikawa,yohsuke.kaji } @sony.com Fig. 1. PanopticFusion realizes an online volumetric semantic mappingat the level of stuff and things . The system performs large-scale 3Dreconstruction, as well as dense semantic labeling on stuff regions andsegmentation of individual things in an online manner, as shown in thetop ﬁgure. It is also able to restore the class labels of things and yield acolored mesh, as shown in the bottom ﬁgures. The results obtained withscene0645 01 of ScanNet v2 are shown. detection or robot navigation. In terms of semantics, whichwe mainly focus on in this paper, we believe that it isimportant for the mapping system to have a holistic scene un-derstanding capability, that is to say, dense semantic labelingas well as individual object discrimination. This is becausedensely labeled semantics is a crucial cue for intelligentrobot navigation, and also, discriminating individual objectsis essential for robot–object interaction.Turning our eyes to the ﬁeld of 2D image recognition, animage understanding task called panoptic segmentation hasbeen proposed recently [11]. In the panoptic segmentationtask, semantic classes are deﬁned as a set of stuff classes(amorphous regions, such as ﬂoors, walls, the sky and roads)and thing classes (countable objects, such as chairs, tables,people and vehicles) and one needs to predict class labels on stuff regions and both class labels and instance IDs on thing regions, where the predictions should be performed for eachpixel. Extending this point of view to 3D mapping, in thispaper we propose the

PanopticFusion system. To the best ofour knowledge, it is the ﬁrst semantic mapping system thatrealizes scene understanding at the level of stuff and things .Our system incrementally performs large-scale 3D surfacereconstruction online, as well as dense class label predictionon the background region and segmentation and recognitionof individual foreground objects, as shown in Fig. 1.Our approach ﬁrst passes the incoming RGB frame to 2D a r X i v : . [ c s . C V ] S e p emantic and instance segmentation networks and obtainsa panoptic label image in which class labels are assignedto stuff pixels and instance IDs to thing pixels. The pre-dicted panoptic labels and depth measurements are integratedinto the volumetric map. Before integration, we keep theconsistency of instance IDs, which possibly change fromframe to frame, by referring to the volumetric map at thatmoment. In addition, we regularize the map using a fullyconnected CRF model with respect to panoptic labels. ForCRF inference, we propose a unary potential approximationusing limited information stored in the map. We also presenta map division strategy that achieves a signiﬁcant reductionin computational time without a drop in accuracy.We evaluated the performance of our system on theScanNet v2 dataset [4], a richly annotated large-scale datasetfor indoor scene understanding. The results revealed thatPanopticFusion is superior or comparable to the state-of-the-art ofﬂine 3D DNN methods in the both 3D semantic andinstance segmentation tasks. Note that our system is not lim-ited to indoor scenes. Finally, we demonstrated a promisingAR application using the 3D panoptic map generated by oursystem.The main contributions of this paper are the following: • The ﬁrst reported semantic mapping system that realizesscene understanding at the level of stuff and things . • Large-scale 3D reconstruction and labeled mesh extrac-tion thanks to the use of a spatially hashed volumetricmap representation. • Map regularization using a fully connected CRF with anovel unary potential approximation and map divisionstrategy. • Superior or comparable results in both 3D semanticand instance segmentation tasks, in comparison with thestate-of-the-art ofﬂine 3D DNN methods.II. RELATED WORKPreviously proposed representative semantic mapping sys-tems related to our PanopticFusion system are shown in TableI. These systems can be divided into two categories from theperspective of semantics: the dense labeling approach and theobject-oriented approach.The dense labeling approach builds a single 3D mapand assigns a class label or a probability distribution ofclass labels to each surfel or voxel to realize a dense 3Dsemantic segmentation. Hermans et al. [8] utilize randomdecision forests for 2D semantic segmentation and trans-fer the inferred probability distributions to point cloudswith a Bayesian update scheme. Extending the approachof Hermans et al. [8], SemanticFusion [17] improves therecognition performance by using CNNs for 2D semanticsegmentation and makes use of ElasticFusion [31] for aSLAM system to generate a globally consistent map. Xi-ang et al. [32] presented KinectFusion[19]-based volumetricmapping with novel data associated RNNs for improving thesegmentation accuracy. While these methods realize densescene understanding, they suffer from the drawback that theyare not able to distinguish individual objects in the scene.

TABLE I : Semantic mapping systems related to PanopticFusion.

Method Speed Geometry Semantics O n li n e T S D F V o l u m e S u rf e l s L a r g e - s ca l e M od e l -fr ee D e n s e L a b e li ng O b j ec t - l e v e l M a p R e g . SLAM++ [25] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

SemanticFusion [17] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

DA-RNN [32] (cid:88) (cid:88) (cid:88) (cid:88)

MaskFusion [24] (cid:88) (cid:88) (cid:88) (cid:88)

Fusion++ [16] (cid:88) (cid:88) (cid:88) (cid:88)

PanopticFusion (Ours) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

Methods adopted in the early days of the object-orientedapproach leverage 3D model databases. SLAM++ [25] per-forms point pair feature-based object detection and feedsthe detected objects into a pose graph. Tateno et al. [29]proposed a 3D object detection and pose estimation systemthat combines unsupervised geometric segmentation andglobal 3D descriptor matching. These methods, however,require the shapes of objects in the scene to be exactly thesame as the 3D models in the database. Recently, severalstudies on the object-oriented approach using a CNN-based2D object detector have been reported. S¨underhauf et al. [27] and Nakajima et al. [18] combine a 2D object detectorand unsupervised geometric segmentation in order to detectobjects in point clouds or a surfel map. MaskFusion [24],Fusion++ [16] and MID-Fusion [33] introduced an object-oriented map representation that individually builds 3D mapsfor each object based on 2D object detection. The object-oriented map representation enables tracking of individualobjects [24], [33] and an object-level pose graph optimization[16]. However, the quantitative recognition performance ofthese methods is not clear because they mainly evaluatethe camera trajectory accuracy. Furthermore, they focus onforeground objects, resulting in a lack of semantics and/orgeometry of background regions.In contrast to these related studies, PanopticFusion realizesholistic scene reconstruction and dense semantic labelingwith the ability to discriminate individual objects. Our systembuilds a single volumetric map, similar to dense labelingapproaches, yet each voxel stores neither class labels norclass probability distributions but DNN-predicted panopticlabels in order to seamlessly manage both stuff and things semantics. The class labels of foreground objects can berestored by a probability integration process. In addition, our3D reconstruction leverages the truncated signed distanceﬁeld (TSDF) volumetric map with the voxel hashing datastructure [20], which allows us to reconstruct a large-scalescene as well as extract labeled meshes by using marchingcubes [15], in contrast to the 3D maps of previous methods,which are based on point clouds [8], [27], surfels [17], [18],[24] and a ﬁxed-sized voxel grid [32], [16]. It should benoted that, with 3D DNN methods that directly apply deepnetworks to 3D data such as point clouds or voxel grids,high recognition performance has been reported [22], [5],[35], [9]. Nevertheless, with those methods, it is basicallynecessary to reconstruct the whole scene in advance, requir-ing ofﬂine processing, which could limit their application ig. 2. System overview of PanopticFusion. to robotics and AR. On the contrary, PanopticFusion is anonline and incremental framework.III. METHODFig. 2 shows the system overview of PanopticFusion. Oursystem ﬁrst feeds an incoming RGB frame into 2D semanticand instance segmentation networks and obtains pixel-wisepanoptic labels by fusing the two outputs (Section III-C).The panoptic labels are carefully tracked by referring tothe volumetric map at that moment (Section III-D) and areintegrated into the map with depth measurements (SectionIII-E). Probability distributions of class labels for foregroundobjects are also incrementally integrated (Section III-F). Inaddition, online map regularization with a fully-connectedCRF model is performed for a further improvement of therecognition accuracy. Note that camera poses with respect tothe volumetric map are given by an external vSLAM, andlabeled meshes are extracted by using marching cubes [15].

A. Notations

We denote all class labels by L , and they are dividedinto stuff labels L St and thing labels L Th : such that L = L St ∪ L Th and L St ∩ L Th = ∅ . A set of instance IDs fordiscriminating individual things is denoted by Z . Here wedeﬁne a set of panoptic labels L Pa = L St ∪ Z ∪ l unk in orderto seamlessly manage stuff and things level semantics in the3D map. l unk denotes the unknown label. B. Volumetric Map

We use the TSDF-based volumetric map representationwith a voxel hashing approach [20], which manages spatiallyhashed small regular voxel grids called voxel blocks. Thisapproach is memory efﬁcient compared with a single voxelgrid approach like the original KinectFusion [19] and enablesus to reconstruct large-scale scenes. Our implementationis based on voxblox [21], which is a CPU-based TSDFmapping system, but we extend it to integrate the semantics.Our volumetric map stores the truncated signed distance D t ( v ) ∈ R , the RGB color C t ( v ) ∈ R and the associatedweight W D t ( v ) ∈ R ≥ at each voxel location v ∈ R , aswith [19]. Our system additionally stores the panoptic label L Pa t ( v ) ∈ L Pa and its weight W L t ( v ) ∈ R ≥ . Here t denotesthe time index. C. 2D Panoptic Label Prediction

For the incoming RGB frame, we predict pixel-wisepanoptic labels by fusing both 2D semantic and instancesegmentation outputs. We utilize the state-of-the-art CNNarchitectures of PSPNet [36] and Mask R-CNN [7] for 2D semantic and instance segmentation, respectively. PSPNetinfers pixel-wise class labels L t ( u ) ∈ L , where u ∈ R de-notes the image coordinates. Mask R-CNN outputs instanceIDs for each pixel Z t ( u ) ∈ Z ∪ l unk , where the regionswithout any foreground objects are ﬁlled with l unk . Theforeground object probability p t ( z, O ) and conditional prob-ability distribution of thing labels p t ( z, l Th |O ) with respectto instance z are utilized in the probability integration stepdescribed in Section III-F. We obtain pixel-wise panopticlabels L Pa t ( u ) from L t ( u ) and Z t ( u ) preceding the instanceIDs: L Pa t ( u ) =  Z t ( u ) Z t ( u ) (cid:54) = l unk L t ( u ) Z t ( u ) = l unk ∧ L t ( u ) ∈ L St l unk otherwise . (1) D. Panoptic Label Tracking

Direct integration of raw panoptic labels L Pa t ( u ) into thevolumetric map induces label inconsistency because MaskR-CNN does not necessarily output a consistent instance IDfor the same object through multiple frames. To avoid thisproblem, we need to estimate consistency-resolved panopticlabels ˆ L Pa t ( u ) before the integration. The simplest way isto track the foreground objects in the 2D image sequenceusing a visual object tracker. This approach unfortunately isnot able to re-identify an object in the case of a loopy cameratrajectory. Therefore, we take a map reference approachsimilar to [24], [16].We ﬁrst prepare the reference panoptic labels ˜ L Pa t − ( u ) byaccessing the map. Here, T t denotes the live camera pose, K the camera intrinsic parameters, and D t ( u ) the live depthmap: ˜ L Pa t − ( u ) = L Pa t − ( T t K − D t ( u )[ u , T ) . (2)To track labels, we compute the intersection over union (IoU) U (˜ z, z ) of instance ID z of raw panoptic labels L Pa t ( u ) andinstance ID ˜ z of reference panoptic labels ˜ L Pa t − ( u ) : U (˜ z, z ) = IoU (cid:0) { u | ˜ L Pa t − ( u ) = ˜ z } , { u | L Pa t ( u ) = z } (cid:1) (3)Here, IoU is deﬁned as IoU(

A, B ) = | A ∩ B | / | A ∪ B | .When the maximum value of IoU exceeds a threshold θ U , ˜ z giving the maximum value is associated with z . Otherwisea new instance ID is assigned to z : ˆ z = (cid:40) arg max ˜ z U (˜ z, z ) max ˜ z U (˜ z, z ) > θ U z new otherwise . (4)The association is processed in descending order in the maskarea |{ u | L Pa t ( u ) = z }| . Once a reference instance ID ˜ z isssociated with z , that instance ID is not associated with anyother z . The utilization of IoU instead of an overlap ratio,as used in [24], [16], and the exclusive label association isfor avoiding under-segmentation of foreground objects in themap. From the associated instance IDs and raw stuff labels,we obtain the consistency-resolved panoptic labels ˆ L Pa t ( u ) as follows, which are used in the integration step: ˆ L Pa t ( u ) =  L Pa t ( u ) L Pa t ( u ) ∈ L St ˆ z L Pa t ( u ) ∈ Z l unk otherwise . (5) E. Volumetric Integration

For integration, we take the raycasting approach, as with[21]. For each pixel u , we cast a ray from the sensor origin s to the back-projected 3D point p u = T t K − D t ( u )[ u , T and update the voxels along the ray within a truncated dis-tance. Regarding TSDF values, we update them by weightedaveraging, similar to [19]: D t ( v ) = W D t − ( v ) D t − ( v ) + w t ( v , p u ) d t ( v , p u , s ) W D t − ( v ) + w t ( v , p u ) , (6) W D t ( v ) = W D t − ( v ) + w t ( v , p u ) . (7)Here, d t denotes the distance between the voxel and thesurface boundary, and w t a quadric weight [21] that takesthe reliability of depth measurements into account. Similarupdating is applied to the voxel color C t ( v ) .In contrast to TSDF and colors of continuous values,weighted averaging cannot be applied to panoptic labelsof discrete values. The most reliable and simplest way tomanage panoptic labels is to record all integrated labels. This,unfortunately, results in a signiﬁcant increase in memoryusage and frequent memory allocation. Instead we storea single label at each voxel and update its weight bythe increment/decrement strategy. If the pixel-wise panopticlabel ˆ L Pa t ( u ) estimated in the previous section is the sameas the current voxel panoptic label L Pa t − ( v ) , we incrementthe weight W L t ( v ) with the quadric weight: L Pa t ( v ) = L Pa t − ( v ) , W L t ( v ) = W L t − ( v ) + w t ( v , p u ) . (8)In contrast, if those panoptic labels do not coincide, wedecrement the weight: L Pa t ( v ) = L Pa t − ( v ) , W L t ( v ) = W L t − ( v ) − w t ( v , p u ) . (9)Note that in the case where w t > W L t − , that is, when theweight considerably falls, we replace the voxel label withthe newly estimated label: L Pa t ( v ) = ˆ L Pa t ( u ) , W L t ( v ) = w t ( v , p u ) − W L t − ( v ) . (10) F. Thing Label Probability Integration

The thing label predicted by Mask R-CNN is frequentlyuncertain even while the segmentation mask is accurate,especially in the case where a small part of the objectis visible. Hence we probabilistically integrate thing labelsinstead of assigning a single label to each foreground object: p ··· t ( z, l Th ) = (cid:80) t p t ( z, O ) p t ( z, l Th |O ) (cid:80) t p t ( z, O ) . (11) Weighting the probability distributions with the detectionconﬁdence p t ( z, O ) allows the ﬁnal distribution to prefer-entially reﬂect reliable detections. G. Online Map Regularization

While the integration scheme described above yields areliable 3D panoptic map, it is possible to further improvethe recognition accuracy by using a map regularization witha fully connected CRF model. A fully connected CRFwith Gaussian edge potentials has been widely used in 2Dimage segmentation since an efﬁcient inference method wasproposed [12]. Recently, several studies that apply it to a3D map, such as surfels or occupancy grids, have beenreported [8], [17], [34]. In those approaches, CRF modelsare constructed with respect to class labels whose numberis ﬁxed, whereas we consider the CRF with respect topanoptic labels whose number depends on the scene andis theoretically not limited. Here we are faced with twoproblems: how to properly compute unary potentials forpanoptic labels, and how to infer a CRF whose number oflabels is potentially large within a practical time.

1) Problem Setting:

We construct a fully connected graphwhose nodes are individual voxels. We assign a label variable x v ∈ L Pa to each node and infer the optimal labels x = { x v } that minimize the Gibbs energy E by the mean-ﬁeldapproximation and a message passing scheme: E ( x ) = (cid:88) v ψ u ( x v ) + (cid:88) v

2) Unary Potential Approximation:

Previous approaches[8], [17], [34] assigned a probability distribution to eachsurfel or voxel, which can be used directly to compute unarypotentials; in contrast, from the viewpoint of memory efﬁ-ciency, we store only a single label in each voxel. Therefore,we approximate the unary potentials using only a singlelabel, and weights stored in a voxel, based on a certainassumption described as follows.ere let us focus on the integration scheme of panopticlabels shown in Eq. (8)-(10). We denote the set of timeswhen the predicted panoptic label is the same as, and notthe same as, the current voxel label by T + = { τ | ˆ L Pa τ ( u ) = L Pa t ( v ) } and T − = { τ | ˆ L Pa τ ( u ) (cid:54) = L Pa t ( v ) } , respectively.If L Pa τ ( v ) = L Pa t ( v ) for all τ = 1 , · · · , t − , that is to say,the voxel label has not changed, Eq. (17) holds strictly. If p ( x v = L Pa t ( v )) > . and the number of integrations issufﬁciently large, Eq. (17) holds asymptotically: (cid:88) t ∈T + w t ( v , p u ) − (cid:88) t ∈T − w t ( v , p u ) (cid:39) W L t ( v ) . (17)In addition, from the TSDF update scheme in Eq. (7) wehave, (cid:88) t ∈T + w t ( v , p u ) + (cid:88) t ∈T − w t ( v , p u ) = W D t ( v ) . (18)Consequently, the probability that the current panoptic labelin the voxel is actually correct can be calculated as, p ( x v = L Pa t ( v )) = (cid:80) t ∈T + w t ( v , p u ) (cid:80) t ∈T + w t ( v , p u ) + (cid:80) t ∈T − w t ( v , p u ) (cid:39) (cid:18) W L t ( v ) W D t ( v ) (cid:19) . (19)It is unfortunately not possible to calculate the exact proba-bility that the voxel takes a label other than the current labelbecause the map does not record all the information aboutpreviously integrated labels. Therefore, we approximate theprobability as follows, where M denotes the number ofpanoptic labels in the map: p ( x v ) = 1 M − (cid:0) − p ( x v = L Pa t ( v )) (cid:1) ( x v (cid:54) = L Pa t ( v )) . (20)Finally, we obtain the unary potential from Eq. (13), (19)and (20). In spite of the approximated approach, it realizesquantitative and qualitative improvements in recognition ac-curacy, as shown in Section IV-C.

3) Map Division for Online Inference:

The computa-tional complexity of the inference algorithm proposed byKr¨ahenb¨uhl et al. [12] is O ( N M ) , where N and M arethe numbers of voxels and panoptic labels, respectively. Inour problem setting, however, M is theoretically limitlessand could in practice be large, e.g. several hundreds, whichwould make online inference impracticable. To solve thisproblem, we present a map division strategy. When we dividethe volumetric map into S spatially contiguous submaps, thenumber of panoptic labels in each submap can be expectedto be O ( M/S ) . Hence, the total computational complexitycould be reduced to S ×O ( N/S × M/S ) = O ( N M ) /S . Themap is divided by the block-wise region growing approachbased on the predeﬁned maximum number of voxel blocks.The division process has little effect on computational time.IV. EVALUATION A. Experimental Setup

For evaluating the performance of our system, we usedthe ScanNet v2 dataset [4], a large-scale dataset for indoor scene understanding. It provides RGB-D images captured byhand-held consumer-grade depth sensors, camera trajectories,reconstructed 3D models, and 2D/3D semantic annotations.In the following experiments, we used RGB-D images ofsize 640 ×

480 pixels and the provided camera trajectories forfair comparison. The dataset was composed of 1201 trainingscenes and 312 open test scenes. In addition, 100 hidden test scenes without publicly available semantic annotationsare provided for the ScanNet Benchmark Challenge [2]. Forquantitative evaluations, 20 class annotations are generallyused. In this paper, we deﬁne the wall and ﬂoor as the stuff class L St and the other 18 classes, such as chairs and sofas,as the thing class L Th . Note that our system is not limitedto indoor scenes, and the numbers of stuff and thing classescan be arbitrarily deﬁned.We employed ResNet-50 for the backbone of PSPNet. Thenetwork was initialized with the ADE20K [37] pre-trainedweights, and was then ﬁne-tuned using a SGD optimizerfor 30 epochs with a learning rate of 0.01 and a batchsize of 2. We leveraged ResNet-101-FPN for the Mask R-CNN’s backbone. After initialization with MS COCO [13]pre-trained weights, the network was ﬁne-tuned by 4-stepalternating learning [23] using an ADAM optimizer for 25epochs with a learning rate of 0.001 and a batch size of 1 .We used the following parameters for the integration pro-cess: voxel size of 0.024 m, a truncation distance of 4 × × ×

16 voxels per voxel block, IoU threshold θ U =0 . . In the map regularization, w (1) = 10 , w (2) = 15 , θ α = 0 .

05 m , θ β = 20 were used with 5 iterations of CRFinference. The following experiments were performed on acomputer equipped with an Intel Core i7-7800X CPU at 3.50GHz and two NVIDIA GeForce GTX 1080Ti GPUs. B. Quantitative and Qualitative Results

Fig. 3 shows examples of 3D panoptic maps generated byour system. Unfortunately, there are no semantic mappingsystems or 3D DNNs that can recognize a 3D scene atthe level of stuff and things . Therefore, we evaluated theperformance on two sub-tasks, 3D semantic segmentationand instance segmentation, for a quantitative comparison. Inthis evaluation, we used the hidden test set of ScanNet v2.We show the results in Tables II and III. In the tables, thestate-of-the-art methods that apply 3D DNNs to points orvolumetric grids are listed. Note that the methods of [10],[5], [9] leverage RGB images with associated camera posesas well. Our system that uses only 2D-based recognitionmodules surprisingly achieves comparable or superior perfor-mance compared with those methods, thanks to the carefulintegration of multi-view predictions. In terms of the class-wise accuracy, the results revealed that our system has advan-tages especially in the case of small objects such as sinks andpictures, and objects that are confusing to recognize only bytheir geometry, such as beds, bookshelves, and curtains. InTable II, several semantic segmentation methods outperform We used a publicly available implementation of [1] and [3] for PSPNetand Mask R-CNN, respectively.ig. 3. Qualitative results obtained with PanopticFusion system. From left to right, typical scenes in ScanNet v2 of scene0608 00, scene0643 00 andscene0488 01 are displayed. Note that ground truth and our results leverage different reconstruction algorithms, and the colors of things in our results arenot necessarily the same as the ground truth.Fig. 4. Results of the map regularization with the map division strategy.The relationship between the maximum number of voxel blocks and (a)recognition accuracy and (b) computational time. Note the computationaltime is shown in a logarithmic scale. our system because of their large receptive ﬁelds in 3Dspace. However, these methods basically need to reconstructthe entire scene in advance, assuming ofﬂine process, whileour system is an online and incremental framework. How toapply 3D DNNs to partial observations and how to integratethem into an online mapping system are left for future work.Additionally, we evaluated 3D panoptic quality on theopen test set of ScanNet v2, although there are no quan-titatively comparable methods. We employed the evaluationcriteria originally proposed in [11]. Note that the quality wasevaluated with respect to each vertex instead of each pixel,and, as with the ScanNet 3D semantic instance benchmark,we ignored the predicted things with less than 100 vertices.We show the panoptic quality (PQ) as well as the segmenta-tion quality (SQ) and recognition quality (RQ) in Table IV.We hope these results will invigorate research in this ﬁeld.

C. Evaluation of Map Regularization

In this section, we evaluate the map regularization pro-posed in Section III-G. First, we evaluated the effects of themap division on the recognition accuracy and computationaltime. We used the open test set for the recognition accuracyand typical scenes in ScanNet v2 for the computational time.

Fig. 5. Qualitative results of map regularization. The noisy predictionswithin red circles are appropriately regularized, taking a spatial context intoaccount.

The result is shown in Fig. 4. Note that, in this experiment,we applied regularization to the pre-generated map as a postprocess to evaluate solely the effects of CRF.As can been seen, the recognition performance was im-proved by the map regularization with the proposed unarypotential approximation regardless of whether or not mapdivision was used. The results also show that the mapdivision strategy drastically reduced the computational timewithout a decrease in recognition performance, comparedwith the case of building a CRF model for a whole map.Based on the above results, our online system employedmap regularization with the map division strategy. We chosea maximum number of voxel blocks of 25 because of the bet-ter recognition accuracy and acceptable computational time.Table IV shows the difference in recognition performancedue to whether or not map regularization was used in onlineprocessing. This result shows that the map regularizationimproved the recognition performance even when the systemran online. Note that the scores of almost all the classeswere boosted by the proposed regularization. See Fig. 5 forqualitative effects of the map regularization.

D. Run-time Analysis

Table V shows computational times for each componentof our system, which are measured on scene0645 01, a

ABLE II :

3D semantic segmentation results on ScanNet (v2) 3D semantic label benchmark (hidden test set) [2]. This table shows IoU (%). Note thatthe bold and underlined numbers denote ﬁrst and second ranks, respectively. avg. wall ﬂoor cab bed chair sofa tabl door wind bkshf pic cntr desk curt fridg showr toil sink bath ofurnScanNet [4] 30.6 43.7 78.6 31.1 36.6 52.4 34.8 30.0 18.9 18.2 50.1 10.2 21.1 34.2 0.2 24.5 15.2 46.0 31.8 20.3 14.5PointNet++ [22] 33.9 52.3 67.7 25.6 47.8 36.0 34.6 23.2 26.1 25.2 45.8 11.7 25.0 27.8 24.7 21.2 14.5 54.8 36.4 58.4 18.3SPLATNet [26] 39.3 69.9 92.7 31.1 51.1 65.6 51.0 38.3 19.7 26.7 60.6 0.0 24.5 32.8 40.5 0.1 24.9 59.3 27.1 47.2 22.7Tangent Conv. [28] 43.8 63.3 91.8 36.9 64.6 64.5 56.2 42.7 27.9 35.2 47.4 14.7 35.3 28.2 25.8 28.3 29.4 61.9 48.7 43.7 29.83DMV [5] 48.4 60.2 79.6 42.4 53.8 60.6 50.7 41.3 37.8 53.9 64.3 21.4 31.0 43.3 57.4 53.7 20.8 69.3 47.2 48.4 30.1TextureNet [10] 56.6 68.0 93.5 49.4 66.4 71.9 63.6 46.4 39.6 56.8 67.1 22.5 44.5 41.1 67.8 41.2 53.5 79.4 56.5

TABLE III :

3D instance segmentation results on ScanNet (v2) 3D semantic instance benchmark (hidden test set) [2]. This table shows AP . , averageprecision with IoU threshold of 0.5. Note that the bold and underlined numbers denote ﬁrst and second ranks, respectively. avg. cab bed chair sofa tabl door wind bkshf pic cntr desk curt fridg showr toil sink bath ofurnSGPN [30] 14.3 6.5 39.0 27.5 35.1 16.8 8.7 13.8 16.9 1.4 2.9 0.0 6.9 2.7 0.0 43.8 11.2 20.8 4.3GSPN [35] 30.6 34.8 40.5 58.9 39.6 27.5 28.3 24.5 31.1 2.8 PanopticFusion (Ours) 47.8

TABLE IV :

3D panoptic segmentation results on ScanNet (v2) open test set. method metric all things stuff wall ﬂoor cab bed chair sofa tabl door wind bkshf pic cntr desk curt fridg showr toil sink bath ofurnPanopticFusionw/o CRF PQ 29.7 26.7 56.7 37.5 76.0 18.6 29.1 37.8 38.2 29.5 13.8 14.1 13.0 26.5 8.3 14.9 11.6 38.0 28.8 72.4 33.3 28.0 24.3SQ 71.2 71.4 69.5 62.3 76.7 69.4 68.5 69.3 72.3 70.1 74.6 69.9 70.7 72.9 65.0 60.6 70.5 75.3 75.8 79.2 71.9 74.0 75.3RQ 41.1 36.8 79.6 60.2 99.0 26.8 42.5 54.6 52.8 42.1 18.5 20.1 18.4 36.3 12.8 24.6 16.4 50.4 37.9 91.3 46.4 37.8 32.2PanopticFusionwith CRF PQ 33.5 30.8 58.4 40.4 76.4 23.8 35.8 46.7 42.1 34.8 18.0 19.3 16.4 26.4 10.4 16.1 16.6 39.5 36.3 76.1 36.7 31.0 27.7SQ 73.0 73.3 70.7 64.0 77.4 71.1 70.1 74.3 74.6 74.3 76.0 72.5 73.9 71.2 65.1 61.7 72.3 77.7 79.5 81.4 72.7 75.3 75.8RQ 45.3 41.3 80.9 63.1 98.7 33.5 51.1 62.8 56.3 46.9 23.6 26.7 22.2 37.1 16.0 26.0 23.0 50.8 45.7 93.5 50.5 41.2 36.5

TABLE V : Run-time analysis.

Frequency Component time

Every Mask R-CNN frames PSPNet 80 msMask R-CNN 235 msPanoptic label fusion 2 msReference panoptic label gen. 19 msPanoptic label tracking 9 msVolumetric integration 139 msProbability integration ∼ msEvery 10 sec. Map regularization 4.5 sEvery 1 sec. Mesh extraction 14 ms Throughput typical large-scale scene in ScanNet v2 (shown in Fig. 1).PSPNet and Mask R-CNN each run on GPUs, and the othercomponents are processed on a CPU. All components arebasically processed in parallel. The throughput of our systemis around 4.3 Hz, which is determined by Mask R-CNN,the bottleneck process of our system. Although our currentimplementation is not highly optimized, our system is able torun at a rate allowing interaction. Note that the computationaltime except for the map regularization does not depend on thescale of scenes nor the number of things because we utilizethe raycasting approach for the integrations. The processingtime of the map regularization increases to about 10 secondsat the end of the sequence, but it could be reduced byprocessing only the voxel blocks near the camera frustum.V. APPLICATIONSIn this section, we demonstrate a promising AR appli-cation utilizing a 3D panoptic map generated online bythe proposed system. A 3D panoptic map reconstructed as3D meshes allows us to realize the following visualizationsaccording to the context of the scene: • Path planning on stuff regions such as ﬂoors and walls. • Interaction with individual objects, or the thing regions. • Interaction appropriate for the semantics of each region. • Natural occlusion and collision visualization.

Fig. 6. An example of AR application using a 3D panoptic map generatedby PanopticFusion system.

We show an example of an AR application utilizing theabove visualizations in Fig. 6. Humanoids and insect-typerobots are able to locomote on the ﬂoor and wall meshes,respectively, according to the automatic path planning. Ad-ditionally, the semantics of each object realizes context-aware interactions such that humanoids sit and lie on chairsand sofas, respectively, and CG objects appear on tables.Moreover, we can naturally visualize the occlusion effects,which are important for AR, because the 3D meshes ofthe scene are extracted. Note that, taking advantage ofthe accurately recognized 3D panoptic map, we can easilyestimate the poses of seats of chairs and sofas, and toppanels of tables by using simple normal- and curvature-basedsegmentation and plane detection.We believe that our system is useful not only for ARscenarios but also for autonomous robots that explore scenesand manipulate objects.VI. CONCLUSIONSIn this paper, we have introduced a novel online volumetricsemantic mapping system at the level of stuff and things .It performs dense semantic labeling while discriminatingindividual objects, as well as large-scale 3D reconstructionand labeled mesh extraction thanks to the use of a spatiallyhashed volumetric map representation. This was realizedy pixel-wise panoptic label prediction and its volumet-ric integration with careful label tracking. In addition, weconstructed a fully connected CRF model with respect topanoptic labels and inferred it online with a novel unarypotential approximation and a map division strategy, whichfurther improved the recognition performance. The exper-imental results showed that our system outperformed orcompared well with state-of-the-art ofﬂine 3D DNN methodsin terms of both 3D semantic and instance segmentation. Infuture work, we plan to extend our system to ensure globalconsistency against long-term pose drift, to perform high-throughput mapping by network reduction, and to supportdynamic environments.We believe that the stuff and things -level semantic map-ping will open the way to new applications of intelligentautonomous robotics and context-aware augmented realitythat deeply interact with the real world.R

EFERENCES[1] Pspnet-keras-tensorﬂow. https://github.com/Vladkryvoruchko/PSPNet-Keras-tensorflow .[2] Scannet benchmark challenge. http://kaldir.vc.in.tum.de/scannet_benchmark/ , accessed 2019-02-27.[3] Waleed Abdulla. Mask r-cnn for object detection and instancesegmentation on keras and tensorﬂow. https://github.com/matterport/Mask_RCNN , 2017.[4] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, ThomasFunkhouser, and Matthias Nießner. Scannet: Richly-annotated 3dreconstructions of indoor scenes. In

IEEE Conf. on Computer Visionand Pattern Recognition (CVPR) , 2017.[5] Angela Dai and Matthias Nießner. 3dmv: Joint 3d-multi-viewprediction for 3d semantic scene segmentation. arXiv preprintarXiv:1803.10409 , 2018.[6] Benjamin Graham, Martin Engelcke, and Laurens van der Maaten.3d semantic segmentation with submanifold sparse convolutionalnetworks. In

IEEE Conf. on Computer Vision and Pattern Recognition(CVPR) , 2018.[7] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Girshick. Maskr-cnn. In

IEEE Int. Conf. on Computer Vision (ICCV) , 2017.[8] Alexander Hermans, Georgios Floros, and Bastian Leibe. Dense 3dsemantic mapping of indoor scenes from rgb-d images. In

IEEE Int.Conf. on Robotics and Automation (ICRA) , 2014.[9] Ji Hou, Angela Dai, and Matthias Nießner. 3d-sis: 3d semantic instancesegmentation of rgb-d scans. arXiv preprint arXiv:1812.07003 , 2018.[10] Jingwei Huang, Haotian Zhang, Li Yi, Thomas Funkhouser, MatthiasNießner, and Leonidas Guibas. Texturenet: Consistent localparametrizations for learning from high-resolution signals on meshes. arXiv preprint arXiv:1812.00020 , 2018.[11] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, andPiotr Doll´ar. Panoptic segmentation. arXiv preprint arXiv:1801.00868 ,2018.[12] Philipp Kr¨ahenb¨uhl and Vladlen Koltun. Efﬁcient inference in fullyconnected crfs with gaussian edge potentials. In

Advances in NeuralInformation Processing Systems , 2011.[13] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per-ona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoftcoco: Common objects in context. In

European Conf. on ComputerVision (ECCV) , 2014.[14] Chen Liu and Yasutaka Furukawa. Masc: Multi-scale afﬁnity withsparse convolution for 3d instance segmentation. arXiv preprintarXiv:1902.04478 , 2019.[15] William E Lorensen and Harvey E Cline. Marching cubes: A highresolution 3d surface construction algorithm. In

ACM siggraphcomputer graphics , volume 21, pages 163–169. ACM, 1987.[16] John McCormac, Ronald Clark, Michael Bloesch, Andrew Davison,and Stefan Leutenegger. Fusion++: Volumetric object-level slam. In

IEEE Int. Conf. on 3D Vision (3DV) , 2018. [17] John McCormac, Ankur Handa, Andrew Davison, and StefanLeutenegger. Semanticfusion: Dense 3d semantic mapping withconvolutional neural networks. In

IEEE Int. Conf. on Robotics andAutomation (ICRA) , 2017.[18] Yoshikatsu Nakajima and Hideo Saito. Efﬁcient object-orientedsemantic mapping with object detector.

IEEE Access , 7:3206–3213,2019.[19] Richard A Newcombe, Shahram Izadi, Otmar Hilliges, DavidMolyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, JamieShotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In

IEEE Int. Symposium onMixed and Augmented Reality (ISMAR) , 2011.[20] Matthias Nießner, Michael Zollh¨ofer, Shahram Izadi, and Marc Stam-minger. Real-time 3d reconstruction at scale using voxel hashing.

ACM Transactions on Graphics (ToG) , 32(6):169, 2013.[21] Helen Oleynikova, Zachary Taylor, Marius Fehr, Roland Siegwart, andJuan Nieto. Voxblox: Incremental 3d euclidean signed distance ﬁeldsfor on-board mav planning. In

IEEE/RSJ Int. Conf. on IntelligentRobots and Systems (IROS) , 2017.[22] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas.Pointnet++: Deep hierarchical feature learning on point sets in a metricspace. In

Advances in Neural Information Processing Systems , 2017.[23] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn:Towards real-time object detection with region proposal networks. In

Advances in Neural Information Processing Systems , 2015.[24] M. Runz, M. Bufﬁer, and L. Agapito. Maskfusion: Real-time recogni-tion, tracking and reconstruction of multiple moving objects. In

IEEEInt. Symposium on Mixed and Augmented Reality (ISMAR) , 2018.[25] Renato F Salas-Moreno, Richard A Newcombe, Hauke Strasdat,Paul HJ Kelly, and Andrew J Davison. Slam++: Simultaneouslocalisation and mapping at the level of objects. In

IEEE Conf. onComputer Vision and Pattern Recognition (CVPR) , 2013.[26] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji, EvangelosKalogerakis, Ming-Hsuan Yang, and Jan Kautz. Splatnet: Sparse latticenetworks for point cloud processing. In

IEEE Conf. on ComputerVision and Pattern Recognition (CVPR) , 2018.[27] Niko S¨underhauf, Trung T Pham, Yasir Latif, Michael Milford, andIan Reid. Meaningful maps with object-oriented semantic mapping. In

IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS) . IEEE,2017.[28] Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and Qian-Yi Zhou.Tangent convolutions for dense prediction in 3d. In

IEEE Conf. onComputer Vision and Pattern Recognition (CVPR) , 2018.[29] Keisuke Tateno, Federico Tombari, and Nassir Navab. When 2.5 d isnot enough: Simultaneous reconstruction, segmentation and recogni-tion on dense slam. In

IEEE Int. Conf. on Robotics and Automation(ICRA) , 2016.[30] Weiyue Wang, Ronald Yu, Qiangui Huang, and Ulrich Neumann.Sgpn: Similarity group proposal network for 3d point cloud instancesegmentation. In

IEEE Conf. on Computer Vision and PatternRecognition (CVPR) , 2018.[31] Thomas Whelan, Stefan Leutenegger, Renato F. Salas-Moreno, BenGlocker, and Andrew J. Davison. Elasticfusion: Dense slam withouta pose graph. In

Robotics: Science and Systems (RSS) , 2015.[32] Yu Xiang and Dieter Fox. Da-rnn: Semantic mapping with dataassociated recurrent neural networks. In

Robotics: Science and Systems(RSS) , 2017.[33] Binbin Xu, Wenbin Li, Dimos Tzoumanikas, Michael Bloesch,Andrew Davison, and Stefan Leutenegger. Mid-fusion: Octree-based object-level multi-instance dynamic slam. arXiv preprintarXiv:1812.07976 , 2018.[34] S. Yang, Y. Huang, and S. Scherer. Semantic 3d occupancy mappingthrough efﬁcient high order crfs. In

IEEE/RSJ Int. Conf. on IntelligentRobots and Systems (IROS) , 2017.[35] Li Yi, Wang Zhao, He Wang, Minhyuk Sung, and Leonidas Guibas.Gspn: Generative shape proposal network for 3d instance segmentationin point cloud. arXiv preprint arXiv:1812.03320 , 2018.[36] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, andJiaya Jia. Pyramid scene parsing network. In

IEEE Conf. on ComputerVision and Pattern Recognition (CVPR) , 2017.[37] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, AdelaBarriuso, and Antonio Torralba. Semantic understanding of scenesthrough the ade20k dataset.