Fusion-Aware Point Convolution for Online Semantic 3D Scene Segmentation
FFusion-Aware Point Convolution for Online Semantic 3D Scene Segmentation
Jiazhao Zhang , ∗ Chenyang Zhu , * Lintao Zheng Kai Xu , † National University of Defense Technology SpeedBot Robotics Ltd.
Abstract
Online semantic 3D segmentation in company with real-time RGB-D reconstruction poses special challenges suchas how to perform 3D convolution directly over the progres-sively fused 3D geometric data, and how to smartly fuse in-formation from frame to frame. We propose a novel fusion-aware 3D point convolution which operates directly on thegeometric surface being reconstructed and exploits effec-tively the inter-frame correlation for high quality 3D fea-ture learning. This is enabled by a dedicated dynamic datastructure which organizes the online acquired point cloudwith global-local trees . Globally, we compile the online re-constructed 3D points into an incrementally growing coor-dinate interval tree, enabling fast point insertion and neigh-borhood query. Locally, we maintain the neighborhood in-formation for each point using an octree whose constructionbenefits from the fast query of the global tree. Both levelsof trees update dynamically and help the 3D convolution ef-fectively exploits the temporal coherence for effective infor-mation fusion across RGB-D frames. Through evaluationon public benchmark datasets, we show that our methodachieves the state-of-the-art accuracy of semantic segmen-tation with online RGB-D fusion in FPS.
1. Introduction
Semantic segmentation of 3D scenes is an fundamental taskin 3D vision. The recent state-of-the-art methods mostlyapply deep learning on either 3D geometric data solely [25]or the fusion of 2D and 3D data [20]. These approaches,however, are usually offline, working with an already re-constructed 3D scene geometry [5, 14]. Online scene un-derstanding associated with real-time RGB-D reconstruc-tion [13, 22], on the other hand, is deemed to be moreappealing due to the potential applications in robot andAR. Technically, online analysis can also fully exploit thespatial-temporal information during RGB-D fusion.For the task of semantic scene segmentation in companywith RGB-D fusion, deep-learning-based approaches com-monly adopt the frame feature fusion paradigm . Such meth- * Joint first authors † Corresponding author: [email protected]
Frame 180 Frame 350Frame 260
Figure 1: We present fusion-aware 3D point convolutionwhich operates directly over the progressively acquired andonline reconstructed scene surface. We show the point-wiselabeling is being gradually improved (the chairs are recog-nized) as more and more frames (first row) are fused in.ods first perform 2D convolution in the individual RGB-Dframes and then fuse the extracted 2D features across con-secutive frames. Previous works conduct such feature fu-sion through either max-pooling operation [14] or Bayesianprobability updating [20]. We advocate the adoption of di-rect convolution over 3D surfaces for frame feature fusion.3D convolution on surfaces learns features of the intrinsicstructure of the geometric surfaces [2] that cannot be well-captured by view-based convolution and fusion. During on-line RGB-D fusion, however, the scene geometry changesprogressively with the incremental scanning and reconstruc-tion. It is difficult to perform 3D convolution directly overthe time-varying geometry. Besides, to attain a powerful3D feature learning, special designs are needed to exploitthe temporal correlation between adjacent frames.In this work, we argue that a fast and powerful 3D convo-lution for online segmentation necessitates an efficient andversatile in-memory organization of dynamic 3D geomet-ric data. To this end, we propose a tree-based global-localdynamic data structure to enable efficient data maintenanceand 3D convolution of time-varying geometry. Globally, weorganize the online fused 3D points with an incrementallygrowing coordinate interval tree, which enables fast point1 a r X i v : . [ c s . G R ] M a r nsertion and neighborhood query. Locally, we maintainthe neighborhood information for each point using an oc-tree whose dynamic update benefits from the fast query ofthe global tree. The local octrees facilitate efficient point-wise 3D convolution directly over the reconstructed geo-metric surfaces. Both levels of trees update dynamicallyalong with the online reconstruction.The dynamic maintenance of the two-level trees supports3D point convolution with feature fusion across RGB-Dframes, leading to so-called fusion-aware point convolution . First , point correspondence between consecutive framescan be easily retrieved from the global tree, so that boththe 2D and 3D features of a point can be efficiently ag-gregated from frame to frame when the point is observedby multiple frames.
Second , with the help of per-point oc-trees, we realize adaptive convolution kernels at each pointthrough weighting its neighboring points based on approxi-mate geodesic distance. This allows a progressive improve-ment of labeling accuracy across frames.Through extensive evaluation on the public benchmarkdatasets, we demonstrate that our method performs online3D scene semantic segmentation at interactive frame-rate( FPS and even higher for key-frame-based processing)while achieving high accuracy outperforming the state-of-the-art offline methods. In particular, the accuracy achievesthe top-ranking in the ScanNet benchmark, outperformingmany existing approaches including both online and offlineones. Our main contributions include:• A tree-based global-local dynamic data structure en-abling efficient and powerful 3D convolution on time-varying 3D geometry.• A fusion-aware point convolution which exploits inter-frame correlation for quality 3D feature learning.• An interactive system implementing our online seg-mentation with real-time RGB-D reconstruction.
2. Related work
3D scene segmentation.
Scene segmentation is a long-standing problem in computer vision. Here, we only reviewoffline approaches handling 3D geometric data obtained ei-ther by RGB-D fusion or LiDAR acquisition. For fusion-based 3D reconstruction, many works [16,18] show that the2D labeling of the RGB-D frames can be incorporated intothe volumetric or surfel map, resulting in stable 3D label-ing. The labeling can be further improved with MRF orCRF inference over the 3D map. These works enjoy theadvances in image-based CNN for 2D segmentation. Tak-ing the advantage of direct 3D geometric feature learning,3D deep learning approaches become increasingly popular,where an efficient 3D convolution operation is the key. Inthese methods, 3D labeling is attained with CNNs operating directly on point clouds [25] or their voxelization [7]. Sev-eral other approaches conduct object detection over 3D re-construction and then predict a segmentation mask for eachdetection, leading to instance segmentation [14, 35, 40].
Online scene segmentation.
Apart from the majority ofoffline batch methods, online and incremental mapping andlabeling starts to gain renewed interest lately due to the bigsuccess of multi-view deep learning [24, 32]. Since theearly attempts on 3D semantic mapping from RGB-D se-quences [10, 28, 30], a notable recent work of such kindis SemanticFusion [20]. It performs CNN-based 2D label-ing for individual RGB-D frames and then probabilisticallyfuses the 2D predictions into a semantic 3D map. Instead offusing prediction results, some methods [5,14] adopt featuremap fusion based on max-pooling operation which is moredeep learning friendly. In our method, we advocate the useof 3D convolution to aggregate 2D features where a majorchallenge is how to handle time-varying 3D geometric data.The DA-RNN method [38] aggregates frame features usingrecurrent neural networks with dynamically routed connec-tions. It smartly utilizes the data association from SLAMto connect recurrent units on the fly. Our method, on theother hand, pursues effective direct 3D convolution throughexploiting the data association between frames.
Point cloud convolution.
Since 3D convolution is natu-rally performed on 3D Euclidean grids, early practice optsto first converts 3D point clouds to 2D images [32] or 3Dvolumes [19] and then perform Euclidean convolution. Forthe task of semantic segmentation, most approaches chooseto extract features in 2D and then perform segmentationin 3D based on the 2D features [18]. Volumetric convo-lution is limited by resolution due to computational cost,which can be relieved with efficient data structure [15, 27].These acceleration, however, cannot handle dynamicallychanging point clouds like ours. Atzmon et al. [1] proposea unique volume-based point convolution which consistsof two operators, extension and restriction, mapping pointcloud functions to volumetric ones and vise-versa. Pointcloud convolution is defined by the extension and restric-tion of volumetric convolution against the point cloud.Since the pioneering work of PointNet [25], there have beenmany works focusing on direct convolution on 3D pointclouds. Existing works aim either to improve the neighbor-hood structure [7, 17, 26, 31, 33] or to enhance the convolu-tional filters [11,29,36,37,39]. These methods are designedto process over fixed neighborhood on static point clouds.We design a new point convolution for time-varying geo-metric data with dedicated designs targeting both aspects.First, we maintain and update a surface-aware neighbor-hood structure based on a tree-based dynamic data structure.Second, we learn adaptive convolutional filters via exploit-ing the temporal coherence between consecutive frames. cene scanning and online RGB-D reconstruction RGB-Dsequence … … … Global-local tree construction
Global treeLocal trees … …
Feature fusion within frameFeature fusion within frameFeature fusion across adjacent frames
Fusion-aware point convolution … Online semantic segmentation (a) (b) (c) (d)
Figure 2: An overview of our pipeline. The input to our method is an online acquired RGB-D sequence being reconstructed inreal-time (a). Based on the online reconstruction, we construct global-local trees to maintain a global spatial organization ofthe reconstructed point cloud as well as per-point local neighborhood (b). The dynamic data structure supports fusion-awarepoint convolution encompassing intra-frame and inter-frame feature fusion (c). Finally, the point-wise features are used forpoint label prediction, leading to a semantic segmentation (d).
3. Method
Overview.
Figure 2 provides an overview of the proposedonline 3D scene segmentation method. The input to ourmethod is an online acquired RGB-D sequence being re-constructed with real-time depth-fusion [6]. Let us de-note the RGB-D sequence by f k = { ( c km , p km ) } Mm =0 , k =0 , , . . . , K , where c km and p km store the RGB-D informa-tion of pixel m of frame k and the coordinates of its cor-responding 3D point, respectively. Given a reconstructionrepresented by a point set P , we construct a global tree T G maintaining the spatial organization of all points, as wellas a per-point local trees {T L ( p ) } p ∈P storing the 1-ringneighborhood for each point (Section 3.1). The dynamicdata structure supports fusion-aware point convolution en-compassing intra-frame and inter-frame feature fusion (Sec-tion 3.2). The point-wise features are used for point la-bel prediction, resulting in a semantic segmentation (Sec-tion 3.3). To support both intra-frame and inter-frame feature learn-ing with point-based convolution, we require a data struc-ture to organize the dynamically reconstructed, unstructuredpoint cloud. There are several considerations in designingsuch a dynamic data structure.
Firstly , to facilitate point-based convolution, we need to construct the local neighbor-hood of any given point.
Second , the data structure shouldsupport fast update of the local neighborhoods under time-varying geometry.
Thirdly , to realize 2D-to-3D and frame-to-frame feature fusion, the data structure should allow usto find correspondence between image pixels and recon-structed points. This way, pixels across different frames canbe matched through the shared corresponding 3D point.To meet those requirements, we design a two-level tree-based data structure. Globally, we construct a coordinate-based tree organization of points which supports fast neigh- borhood query for any given point. This allows us tofind pixel-to-point correspondence and point-based neigh-borhood efficiently. Based on the global tree, we build foreach point an octree from which multi-scale local neighbor-hood can be found quickly for point-based convolution.
Global coordinate interval tree.
We maintain three co-ordinate interval trees T x G , T y G and T z G , one for each di-mension. Without loss of generality, we take T x G for ex-ample to describe the tree construction. Each node n i ∈ T x G records a set of point P xi ⊂ P in which each point has its x-coordinate lie in the interval [ x min ( n i ) , x max ( n i )] . x min ( n i ) and x max ( n i ) are the minimum and maximum threshold fornode n i . We stipulate the adjacent nodes in a coordinate in-terval tree complies with the following interval constraints: x max ( n l ) < x min ( n p ) , x max ( n p ) < x min ( n r ) , with n l and n r being the left and right child of node n p . Theentire 3D scene is then split into slices along x-dimension.The coordinate interval tree is constructed dynamically asmore 3D points are reconstructed and inserted. The pointinsertion of coordinate interval tree is conducted as follows.Given a 3D point p = ( x p , y p , z p ) , we first find a node n i ∈ T x G satisfying x p ∈ [ x min ( n i ) , x max ( n i )] , through atop-down traverse of the tree. If such a node exists, p isadded to the corresponding point set P xi of the node. Oth-erwise, we create a new leaf node whose point set is ini-tialized as { p } and coordinate interval as [ x p − h, x p + h ] .Here, h is the half size of coordinate intervals. This newnode is then attached to the node whose interval is closestto the new node’s interval. The detailed explanation of treeconstruction with balance maintenance can be found in thesupplemental material.After constructing the coordinate interval trees for all threedimensions, we can achieve efficient point correspondencesearch and neighborhood retrieval for any given query 3Dpoint q = ( x q , y q , z q ) . Through traversing the three trees,we obtain three nodes n i ∈ T x G , n j ∈ T y G , n k ∈ T z G satis- lobal coordinate interval tree 𝑛 𝑖𝑥 𝑛 𝑖𝑦 𝑛 𝑘𝑦 𝑛 𝑗𝑥 𝑛 𝑘 𝑥 𝑛 𝑗𝑦 𝑃 𝑘𝑥 ∩ 𝑃 𝑗𝑦 𝑃 𝑖𝑥 ∩ 𝑃 𝑘𝑦 Local per-point octrees (illustrated with quadtrees) and neighborhood searchTop view
Figure 3: Illustration of global-local trees. The global coordinate interval trees are shown for x- and y-dimension only. Withthese trees, we can find a local neighborhood for any given point as well as the correspondence between two pixels fromdifferent frames. The per-point octrees (illustrated with 2D quadtrees) can be used to find multi-ring neighborhoods.fying x p ∈ [ x min ( n i ) , x max ( n i )] , y p ∈ [ y min ( n j ) , y max ( n j )] and z p ∈ [ z min ( n k ) , z max ( n k )] , respectively. The neighbor-ing points of point q is simply the intersection of the threecorresponding point sets: N ( q ) = P xi ∩P yj ∩P zk . Point cor-respondence can also be found efficiently within the neigh-borhood point set by using a distance threshold. And theadjacent intervals ∪N ( q ) around N ( q ) can be retrieved ina similar fashion. Local per-point octrees.
Although the global coordinateinterval tree can be used to find a local neighborhood forany given point, the neighborhood is an merely unstructuredpoint set. To conduct point convolution, a distance metricbetween points is required to apply convolutional operationswith distance-based kernels [17, 26]. To this end, we needto sort the set of neighboring points into a structured orga-nization based on surface-aware metric. This is achievedby maintaining per-point octrees so that the surface-awareneighborhood in arbitrary scale can be found efficiently.Given a point p ∈ P , we first retrieve its local neighborhood N ( p ) using the coordinate interval trees. We then dividethe extended point set ∪N ( p ) and its according to the eightquadrants of the Cartesian coordinate system originated at p . Within each quadrant, we add the point that is the closestto p as the child of the corresponding direction, if the short-est distance is smaller than a threshold d T . If some quad-rant does not contain a point, however, the correspondingchild node is left empty. The detailed description of per-point octree construction can be found in the supplementalmaterial. After this process, we compute for each point a1-ring neighborhood organized in a direction-aware . Basedon the direction-aware octrees, one can easily expend the1-ring neighborhood of a point into multiple rings throughchaining octree-based neighbor searches; see Figure 3.The neighborhood search enabled by the per-point octreeshas two important characteristics. First , for each point, itseight neighbor points (child nodes) are scattered in the eightquadrants of its local Cartesian frame. When finding n-ring neighbors based on the octree-based point connections, the consecutive searches can roughly follow the eight direc-tions. Consequently, the octrees direct the search to findevenly distributed neighbors in all directions, which can beexpensive to realize with naive region growing.
Second ,since the octrees maintain fine-scale local neighborhood,the search path along the octree-based connections approxi-mately follows the 3D surface. This results in surface-awaren-ring neighborhoods. Both the two characteristics bene-fit substantially point convolution in learning improved 3Dfeatures over those working with Euclidean-distance-basedneighborhoods as demonstrated in Section 4.
We propose a convolution operation which extends a recentwork PointConv [37] with intra-frame and inter-frame fea-ture fusion named fusion-aware point convolution. Point-Conv introduces a novel convolution operation over pointcloud: PC p ( W, F ) = (cid:88) ∆ p ∈ Ω W (∆ p ) F ( p + ∆ p ) , where F ( p + ∆ p ) is the feature of a point in the local region Ω centered at p and W is weight function.RGB-D frame sequence is a mixed data of rich 2D and 3Dinformation with time stamp. However, PointConv only uti-lize limited 3D information and the rest does not make con-tribution to the segmentation task. To improve this method,there are three primary questions that we seek to answerwithin the section. First, PointConv is mainly about 3D, buthow to fuse 2D information properly with 3D? Second, canwe construct better local area Ω which ensures the neigh-borhood would be more relevant? Thirdly, how to utilizethe inter-frame information given by the sequence?
3D feature fusion with online RGB-D reconstruction should better utilize the temporal correla-tion between adjacent frames, which goes beyond simplisticprojection-based 2D-3D correlation.he feature encoding at a 3D point should consider all thematched pixels in different frames if it is observed frommultiple views. Pixel correspondence between consecutiveframes can be easily retrieved based on T x G , T y G and T z G .This way, each 3D point p in the scene would have a set ofcorresponding 2D pixels I ( p ) = { c k | k ∈ n } . We can ex-tract feature for each pixel c k intra-frame via 2D convolu-tion on image. We adopt a pre-trained FuseNet [9] withoutmulti-scale layers as our 2D feature encoder (Section 3.3): F D ( c k ) = FuseNet ( f k , c k ) , where FuseNet ( f k , c k ) is the 2D feature given by FuseNetfor the pixel c k in frame f k . Therefore, each 3D point p inthe scene has a set of corresponding 2D features and max-pooling is adopted to fuse them into one feature: F ( p ) = maxpooling { F D ( c k ) | c k ∈ I ( p ) } Octree-induced surface-aware 3D convolution.
Mostdeep convolutional neural networks for 3D point cloudsgather neighborhood information on the basis of Euclideandistance. Apparently, geodesic distance can better capturethe underlying geometry and topology of 3D surfaces. Wepropose octree-based neighborhood to take advantage of thegeodesic approximation offered by our octree structure.In particular, the local region Ω for each point p is givenby its octree T L ( p ) . This approach could ensure the neigh-borhood would only enlarge along the object surface whichis surface-aware, but not skip some gaps to reach shortestdistance on Euclidean metric. A visual example is shown inFigure8, the neighborhood searched by our approach wouldbe more semantic related with the central point. This char-acteristic would benefit the following segmentation task.The construction of n-ring local region Ω n ( p ) follows: Ω n ( p ) = {T L ( p ) | p ∈ O (Ω n − ( p )) } , Ω ( p ) = Ω( p ) . We name the n-ring local region found G n ( p v ) as octreebased neighborhood, and we can adopt it to improve ourfusion-aware point convolution. The convolution formula-tion FPC p ( W, F ) follows as below:FPC p ( W, F ) = (cid:88) ∆ p ∈ Ω n ( p ) W (∆ p ) F ( p + ∆ p ) . Frame-to-frame feature fusion.
Beside 2D-3D featurefusion, inter-frame information about segmentation uncer-tainty can benefit the segmentation task as well. A recentwork about active scene segmentation [41] demonstrate thatthe segmentation entropy or uncertainty is crucial for onlineprocessing. We introduce a frame-to-frame feature fusion which utilize the segmentation results given by previousframes to improve the performance for following frames.For each 3D point p , our method would update its segmen-tation result if it is observed by a new frame f i . Althoughwe do not know the result is correct or not in the test, pre-dicted segmentation uncertainty U ( p, i ) given by the con-voluted feature FPC ip ( W, F ) at frame i can be easilyretrieved by the network. Our basic idea here is that if p haslow segmentation uncertainty in frame f i , the current formof feature fusion should be useful in the future prediction.In practice, we record every uncertainty U ( p, i ) when pro-cessing the frame sequence. Further, our method conductsmaxpooling operation on FPC current p ( W, F ) which wasupdated in the current frame with feature FPC ip ( W, F ) .Thus, we rewrite the feature fusion as follows: F fused ( p ) = maxpool { FPC current p ( W, F ) , arg min FPC ip U ( p, i )( W, F ) } This way, the intra- and inter-frame information of p arefused into F fused ( p ) , maximally utilizing the informationof the input RGB-D frames and significantly improving theperformance of semantic segmentation. As shown in Figure 4, the the backbone of our proposedonline segmentation network is the global-local tree struc-ture we introduced in Section 3.1. The global coordinatesinterval tree helps mapping corresponding 2D features for2D-3D feature fusion and the local octrees help searchingsurface-aware neighborhood for 3D fusion-aware convolu-tion.
Network architecture.
The backbone of our 2D feature en-coder is FuseNet [9]. Note, however, we discard the multi-scale layers in our implementation since the re-samplingoperation is too time-consuming. Please refer to the sup-plemental material for details. This modification enablesour method to achieve a close-to-interactive performance( FPS) with high accuracy.According to Section 3.2, the 2D features FuseNet ( f i , c k ) in different frames corresponding to the same 3D point p are fused as F ( p ) . Fused features are adopted in ourproposed fusion-aware convolution, and the convoluted fea-ture FPC ip ( W, F ) for point p in frame i could be calcu-lated based on its n-ring neighbors Ω n ( p ) along geometrysurface. The convoluted feature is sent to a simple fullyconnected network F C which consists of 3 mlp layers toget the final feature which length is 128. It then be fur-ther fused with the selected feature with highest segmen-tation confidence in previous frames through max-pooling.
D featureextraction2D featureextraction Point-based convolutionPoint-based convolution FC layersFC layers Max poolingMax pooling Point-wise label predictionPoint-wise label predictionGlobal coordinate interval tree Local per-point octrees
3D points with 2D features Neighborhood with 3D features … … … … Feature update
Figure 4: Network architecture. We show the pipeline with two consecutive frames. The global and local trees are dynamicdata structure which evolves through time. The network output point-wise labels along with confidence. The labelingprediction is used to update features from the previous frame to be fused into the next frame.Finally, we use it to predict the semantic label for p witha one layer classifier. Note that, our network updates arg min FPC ip ( W,F ) U ( p, i ) simultaneously to ensure thefeature with the lowest segmentation uncertainty would beadopted in the future fusion-aware processing. Training details.
The batch size in training is 64. For eachbatch, we randomly select 8 different scenes which eachcontributes a sequence of 8 frames. The first frame of eachsampled sequences are the first 8 data in each batch and the n -th frames are the (8 ∗ ( n − -th to (8 ∗ n − -th data ineach batch. We back-propagate the training gradients onceevery 8 forward passes of frames, and update the networkweights after the forward pass of the whole batch. The train-ing of our network on ScanNet [4] takes about ~ hourson a single Titan Xp GPU. For more details please refer tothe supplemental material.
4. Results and evaluations
We first introduce our benchmark dataset and how we setupour experiments. Comparisons with some state-of-the-artalternatives are presented on both online and offline seman-tic segmentation tasks. We then conduct extensive evalua-tion for each components of our method. We also demon-strate the advantage of surface-aware characteristic of ourmethod through some experiments.
Dataset.
We evaluate our method on two datasets: Scan-Net [4] and SceneNN [12]. ScanNet contains 1513 scannedscene sequences, out of which we use 1200 sequences fortraining and the rest 312 for testing. SceneNN contains 50high-quality scanned scene sequences with semantic label.However, this dataset is not specifically organized for onlinesegmentation task. Some of the scanned scene sequencesdo not have camera pose information. Color image anddepth map are not well aligned in some of the sequencesas well. After some filtering work, we select 15 clean se- Table 1: Accuracy comparison between our method and twostate-of-the-art online scene segmentation methods.
Dataset SemanticFusion [20] ProgressiveFusion [23] OursScanNet 0.518 0.566
SceneNN 0.628 0.666 quences from SceneNN with proper scanned informationfor our evaluation.
Experiment configuration.
To evaluate the performanceof our method, we adopt accuracy and IOU as two indica-tors in our experiments. Since different online segmentationmethods may adopt different 3D reconstruction approaches,it is really difficult to measure these two indicators in dif-ferent 3D point clouds. In our experiment, we project thesemantic labels of 3D points into their corresponding 2Dframes and measure the accuracy and IOU in 2D.
Comparison with other online methods.
Our method iscompared to two state-of-the-art online segmentation meth-ods for indoor scenes: SemanticFusion [20] and Progres-siveFusion [23]. The comparison is conducted on Sce-neNN and ScanNet respectively. The mean accuracy ofthree methods are shown in Table 4. From the results wecan clearly see that our method gets the highest segmenta-tion accuracy on both datasets. Note that we only train ourmethod on ScanNet dataset and do not fine-tune it on Sce-neNN, and our method still outperform other methods onboth datasets. This result demonstrate the generality of ourmethod and it can be easily adopted on different dataset.
Comparison with offline methods.
To further demon-strate the superiority of our segmentation method, we setupa comparison with three state-of-the-art offline segmen-tation methods (SparseConvNet [8], PointConv [37] andMinkowskiNet [3]) for indoor scenes as well. Note that,there is a challenge in online segmentation methods whencompare to offline alternatives. Partial scene would be moreable 2: IOU comparison between our method and state-of-the-art offline scene segmentation methods. Our method has thehighest mean IOU, outperforming the state-of-the-art methods for nine semantic categories. model mean wall floor cabinet bed chair sofa table door window bookshelf picture counter desk curtain fridge bathshade toilet sink bathtub othersSparseConvNet 0.685 0.828
PointConv 0.580 0.741 0.948 0.474 0.672 0.813 0.633 0.651 0.346 0.446 0.713 0.067 0.568 0.525 0.551 0.370 0.520 0.840 0.590 0.750 0.387Ours
Figure 5: Visual comparison between our method and MinkowskiNet [3]. Our method works better than MinkowskiNetespecially on those small and incomplete objects. Live demo is provided in the accompanying video.difficult to be segmented than the whole scene. We showsome visual comparison on partial scenes in Figure 5. Ourmethod can achieve much better results on these challeng-ing cases which is very crucial for online tasks. We alsopresent a segmentation comparison on the complete scenein Table 2. Our method achieves a comparable performancewith the offline alternatives.
Feature fusion study.
We investigate the effect of somecrucial designs on segmentation performance. We turn offthe 2D-3D feature fusion and frame-to-frame feature fu-sion in succession to assess how these two componentswould benefit our method. The results are shown in Ta-ble 3. Without 2D-3D feature fusion, our method cannotimprove performance anymore with multi-view informationin the sequence. The absence of frame-to-frame feature fu-sion makes our method lose the ability of learning from his-tory in the sequence. We observe significant drop on perfor-mance if we turn off these designs, which prove the impor-tance of these two designs. Similar results are also demon-strated in the right plot of Figure 6.
Temporal information study.
The segmentation label of a3D point would be updated if new information is fused inthe sequence. To further investigate how our fusion-awarepoint convolution benefit the segmentation performance, weplot the accuracy of point labeling with increasing featurefusion times in Figure 6. In the plots, “vanilla” refers toour basic model with global-local tree and improved pointconvolution. We observe a clear increase of segmentationaccuracy with increasing scans. Such effect is more ob- Table 3: Ablation study on the two feature fusion operationsof our method. The results justify each design. × × . (cid:88) × . × (cid:88) . (cid:88) (cid:88) . Figure 6: Segmentation accuracy improves over time. Theleft plot shows per-category accuracy. Objects with com-plex structures, such as tables and chairs, benefit more fromour feature fusion. The right is mean accuracy. Segmenta-tion accuracy improves significantly with increasing scans.servable on objects with complex structures, such as tablesand chairs. Note that the performance of the vanilla modelalso improves with increasing scans. This is because bothtree construction and point convolution benefit from a morecomplete point cloud.To explore why point with more fusion times would havebetter segmentation performance, we track feature changesof some points in Figure 7. We plot the feature embed-ding of the points marked in yellow box, and we found the rame 180 Frame 520Frame 360
Figure 7: The evolution of point-wise feature embeddingwith more frames are acquired and fused. We show the t-SNE plots of the feature embedding of the same group ofpoints with semantic labels indicated by color.boundary between points with different semantic labels aregetting clear with during the scanning. Our fusion-awarefeatures are more separable in the embedding space whichmake the segmentation network easier to train.
Surface-aware convolution context.
Another importantcharacteristic of our method is the neighborhood used inpoint convolution is surface-aware. This is enabled by ourlocal octree structure. To verify this claim, we select twopoints on the point cloud and estimate their distance in Fig-ure 8. We observe that tracing along the local octrees leadsto a path whose length is closer to the ground-truth geodesicdistance than the Euclidean distance. The visual examplealso demonstrates that the tree-induced path traces along thesurface, making the point neighborhood surface-aware. Fig-ure 8 also visualizes the kernel weight distribution of twopoints on the point cloud, showing that neighboring pointswhich are geodesically closer have higher weights.This characteristic makes our method more geometry-aware. To verify the geometric information would benefitthe online segmentation, we design a comparison betweenconvolutions based on Euclidean distance and our local-octree-induced distance in Figure 9. The results show thatour surface-aware convolution demonstrates a better perfor-mance all the way along the increase of convolution area(size of receptive field).
5. Conclusion
For the task of semantic segmentation of a 3D scene beingreconstructed with RGB-D fusion, we have presented a tree-based dynamic data structure to organize online fused 3Dpoint clouds. It supports 3D point convolution over time-varying geometry, fusing information between 2D and 3Dand from frame to frame. Our method achieves online seg-mentation at close-to-interactive frame-rate while reaching
Length of search path0.898 Euclidean distance0.730 Geodesic distance0.925 Fusion-aware kernelweight distribution(a) (b) (c) (d)
Figure 8: (a-c): Tracing path between two selected pointsalong the local octrees. We compare the path length to theEuclidean distance and ground-truth geodesic distance. Wefind that the path traced along the local octrees is a good ap-proximate of the geodesic distance. (d): The kernel weightdistribution of two points (green dot) are visualized (red ishigh and blue is low).
Neighborhood size A cc u r a c y Octree-induced surface-aware distanceEuclidean distance
Figure 9: Segmentation accuracy of different neighborhoodsearching approaches. Our local-octree-induced neighbor-hood always leads to better segmentation accuracy than Eu-clidean based. We also find that increasing the neighbor-hood size more than a certain value would not improve theperformance significantly.state-of-the-art accuracy. Our current method has a few lim-itations on which we plan to investigate in future works.
First , our current system does not support data streaming,thus confining the per-frame point cloud density we couldhandle due to memory limit.
Second , our method still re-lies on accurate camera poses for high-quality point fusionand convolution. Although it works well with the popularRGB-D fusion methods (such as BundleFusion [6]), it isinteresting to investigate how to integrate online semanticsegmentation with camera tracking, accomplishing seman-tic SLAM with a state-of-the-art accuracy for both. Further,we would like to test our method on outdoor scenes [34].
Acknowledgement
We thank the anonymous reviewers for the valuable sugges-tions. We are also grateful to Quang-Hieu Pham for the helpon comparison with ProgressiveFusion. This work was sup-ported in part by NSFC (61572507, 61532003, 61622212),NUDT Research Grants(No.ZK19-30) and Natural ScienceFoundation of Hunan Province for Distinguished YoungScientists (2017JJ1002). eferences [1] Matan Atzmon, Haggai Maron, and Yaron Lipman. Pointconvolutional neural networks by extension operators. arXivpreprint arXiv:1803.10091 , 2018. 2[2] Michael M Bronstein, Joan Bruna, Yann LeCun, ArthurSzlam, and Pierre Vandergheynst. Geometric deep learning:going beyond euclidean data.
IEEE Signal Processing Mag-azine , 34(4):18–42, 2017. 1[3] Christopher Choy, JunYoung Gwak, and Silvio Savarese. 4dspatio-temporal convnets: Minkowski convolutional neuralnetworks. arXiv preprint arXiv:1904.08755 , 2019. 6, 7[4] Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal-ber, Thomas Funkhouser, and Matthias Nießner. Scannet:Richly-annotated 3d reconstructions of indoor scenes. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 5828–5839, 2017. 6[5] Angela Dai and Matthias Nießner. 3dmv: Joint 3d-multi-view prediction for 3d semantic scene segmentation. In
Pro-ceedings of the European Conference on Computer Vision(ECCV) , pages 452–468, 2018. 1, 2[6] Angela Dai, Matthias Nießner, Michael Zollh¨ofer, ShahramIzadi, and Christian Theobalt. Bundlefusion: Real-timeglobally consistent 3d reconstruction using on-the-fly sur-face reintegration.
ACM Transactions on Graphics (TOG) ,36(3):24, 2017. 3, 8, 11[7] Benjamin Graham, Martin Engelcke, and Laurens van derMaaten. 3d semantic segmentation with submanifold sparseconvolutional networks. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages9224–9232, 2018. 2[8] Benjamin Graham, Martin Engelcke, and Laurens van derMaaten. 3d semantic segmentation with submanifold sparseconvolutional networks.
CVPR , 2018. 6[9] Caner Hazirbas, Lingni Ma, Csaba Domokos, and DanielCremers. Fusenet: Incorporating depth into semantic seg-mentation via fusion-based cnn architecture. In
Asian Con-ference on Computer Vision (ACCV) , 2016. 5, 11[10] Alexander Hermans, Georgios Floros, and Bastian Leibe.Dense 3d semantic mapping of indoor scenes from rgb-d im-ages. In , pages 2631–2638. IEEE, 2014. 2[11] Pedro Hermosilla, Tobias Ritschel, Pere-Pau V´azquez, `AlvarVinacua, and Timo Ropinski. Monte carlo convolution forlearning on non-uniformly sampled point clouds.
ACMTrans. on Graph. (SIGGRAPH Asia) , page 235, 2018. 2[12] Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen,Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. SceneNN:A Scene Meshes Dataset with aNNotations. In
InternationalConference on 3D Vision (3DV) , 2016. 6[13] Shahram Izadi, David Kim, Otmar Hilliges, DavidMolyneaux, Richard Newcombe, Pushmeet Kohli, JamieShotton, Steve Hodges, Dustin Freeman, Andrew Davison,and Andrew Fitzgibbon. KinectFusion: Real-time 3D recon-struction and interaction using a moving depth camera. In
UIST , pages 559–568, 2011. 1 [14] Hou Ji, Angela Dai, and Matthias Nießner. 3d-sis: 3d seman-tic instance segmentation of rgb-d scans. In
Proc. ComputerVision and Pattern Recognition (CVPR), IEEE , 2019. 1, 2[15] Roman Klokov and Victor Lempitsky. Escape from cells:Deep kd-networks for the recognition of 3d point cloud mod-els. In
Proceedings of the IEEE International Conference onComputer Vision , pages 863–872, 2017. 2[16] Kevin Lai, Liefeng Bo, and Dieter Fox. Unsupervised fea-ture learning for 3d scene labeling. In ,pages 3050–3057. IEEE, 2014. 2[17] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di,and Baoquan Chen. Pointcnn: Convolution on x-transformedpoints. In
Advances in Neural Information Processing Sys-tems , pages 820–830, 2018. 2, 4[18] Lingni Ma, J¨org St¨uckler, Christian Kerl, and Daniel Cre-mers. Multi-view deep learning for consistent semantic map-ping with rgb-d cameras. In , pages598–605. IEEE, 2017. 2[19] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d con-volutional neural network for real-time object recognition.In , pages 922–928. IEEE, 2015. 2[20] John McCormac, Ankur Handa, Andrew Davison, and Ste-fan Leutenegger. Semanticfusion: Dense 3d semantic map-ping with convolutional neural networks. In ,pages 4628–4635. IEEE, 2017. 1, 2, 6[21] Gaku Narita, Takashi Seno, Tomoya Ishikawa, and YohsukeKaji. Panopticfusion: Online volumetric semantic map-ping at the level of stuff and things. arXiv preprintarXiv:1903.01177 , 2019. 11[22] M. Nießner, M. Zollh¨ofer, S. Izadi, and M. Stamminger.Real-time 3D reconstruction at scale using voxel hashing.
ACM Trans. on Graph. (SIGGRAPH Asia) , 32(6):169, 2013.1[23] Quang-Hieu Pham, Binh-Son Hua, Thanh Nguyen, and Sai-Kit Yeung. Real-time progressive 3d semantic segmentationfor indoor scenes. In , pages 1089–1098.IEEE, 2019. 6[24] Charles Qi, Hao Su, Matthias Niessner, Angela Dai,Mengyuan Yan, and Leonidas Guibas. Volumetric and multi-view cnns for object classification on 3d data. In
Proc.CVPR , 2016. 2[25] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.Pointnet: Deep learning on point sets for 3d classificationand segmentation. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 652–660,2017. 1, 2[26] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas JGuibas. Pointnet++: Deep hierarchical feature learning onpoint sets in a metric space. In
Advances in neural informa-tion processing systems , pages 5099–5108, 2017. 2, 4[27] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger.Octnet: Learning deep 3d representations at high resolutions.n
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 3577–3586, 2017. 2[28] Renato F. Salas-Moreno, Richard A. Newcombe, HaukeStrasdat, Paul H. J. Kelly, and Andrew J. Davison. SLAM++:Simultaneous localisation and mapping at the level of ob-jects. In
Proc. CVPR , pages 1352–1359, 2012. 2[29] Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned filters in convolutional neural networks ongraphs. In
Proceedings of the IEEE conference on computervision and pattern recognition , pages 3693–3702, 2017. 2[30] J¨org St¨uckler and Sven Behnke. Multi-resolution surfel mapsfor efficient dense 3d modeling and tracking.
Journal of Vi-sual Communication and Image Representation , 25(1):137–147, 2014. 2[31] Hang Su, Varun Jampani, Deqing Sun, Subhransu Maji,Evangelos Kalogerakis, Ming-Hsuan Yang, and Jan Kautz.Splatnet: Sparse lattice networks for point cloud processing.In
Proc. CVPR , pages 2530–2539, 2018. 2[32] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and ErikLearned-Miller. Multi-view convolutional neural networksfor 3D shape recognition. In
Proc. ICCV , pages 945–953,2015. 2[33] Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and Qian-Yi Zhou. Tangent convolutions for dense prediction in 3d. In
Proc. CVPR , pages 3887–3896, 2018. 2[34] The SYNTHIA dataset. https://synthia-dataset.net/ . 8[35] Weiyue Wang, Ronald Yu, Qiangui Huang, and Ulrich Neu-mann. Sgpn: Similarity group proposal network for 3d pointcloud instance segmentation. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 2569–2578, 2018. 2[36] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma,Michael M Bronstein, and Justin M Solomon. Dynamicgraph cnn for learning on point clouds. arXiv preprintarXiv:1801.07829 , 2018. 2[37] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: Deepconvolutional networks on 3d point clouds. In
Proc. CVPR ,pages 9621–9630, 2019. 2, 4, 6[38] Yu Xiang and Dieter Fox. Da-rnn: Semantic mapping withdata associated recurrent neural networks. arXiv preprintarXiv:1703.03098 , 2017. 2[39] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao.Spidercnn: Deep learning on point sets with parameterizedconvolutional filters. In
Proc. ECCV , pages 87–102, 2018. 2[40] Li Yi, Wang Zhao, He Wang, Minhyuk Sung, and Leonidas JGuibas. Gspn: Generative shape proposal network for 3dinstance segmentation in point cloud. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 3947–3956, 2019. 2[41] Lintao Zheng, Chenyang Zhu, Jiazhao Zhang, Hang Zhao,Hui Huang, Matthias Niessner, and Kai Xu. Active sceneunderstanding via online semantic reconstruction. arXivpreprint arXiv:1906.07409 , 2019. 5 . Supplementary Material Introduction
This supplemental material contains four parts:• Section B reports the construction details of global co-ordinate interval tree and local octrees.• Section C reports the module details of our fullpipeline network, which include parameter selection,network layers and loss functions.• Section D shows the comparison of the online methods[21].• Section E shows more progressive results of online se-mantic segmentation on ScanNet [6] dataset.
B. Algorithm Details
Algorithm 1 demonstrates the details about how we main-tain nodes in a global coordinate interval tree when a point p is detected in the coming new frame.Algorithm 2 describes the construction of octree T L ( p ) fora given 3D point p based on global coordinate interval trees T x G , T y G and T z G . Algorithm 1:
Adding p into global coordinate inter-val tree T x G Input :
Existed nodes n i ∈ T x G , new 3D position p = ( x p , y p , z p ) and threshold d . Output:
Updated T x G and the a node n contains p . // find node n satisfy x min ( n ) < x p < x max ( n ) n ← TraverseTree( T x G , x p ) ; if n satisfy x min ( n ) < x p < x max ( n ) exists then n ← AddIntoNode( n, x p ) ; return T x G , n // create a node n contains p // can pre-create neighbor nodes to reduce creation costs n ← CreateNewNode( x p ) ; interval ← GetInterval( x p , d ) ; x min ( n ) ← interval min ; x max ( n ) ← interval max ; // Find a nearest node in R x for n dist, node ← NearestNode( n, T x G ) ; SetAsChild( node, n ) ; // Rebalance with red-black tree return n, T x G ; C. Network Architecture
Figure 10 presents the detailed configuration of our wholenetwork. There are two important difference between ourmethod and other semantic segmentation network. First,since the predicted result in previous frames would beadopted in the future segmentation prediction, the on-line construction of our global-local tree would be time-consuming if no special design in training. We find that
Algorithm 2:
Construct octree T L ( p ) for p ∈ P Input :
3D point p , its neighborhood N ( p ) and threshold h = 0 . . Output:
Octree T L ( p ) . // Initialize T L ( p ) with 8 nodes N m ( p ) at each direction −→ m Initialize ( T L ( p ) ); foreach N −→ m ( p ) in T L ( p ) do NearestDist ( N −→ m ( p )) = inf // Extend the neighborhood for octree search ∪N ( p ) ← AdjIntervals ( N ( p ) ); foreach p i ∈ ∪N ( p ) do // Early finish if Dist ( p, p i ) < h then return T L ( p i ) ; foreach N −→ m ( p ) in T L ( p ) do if SameDirection ( −→ m, −→ p i p ) then if Dist ( p, p i ) < NearestDist ( N −→ m ( p )) then NearestDist ( N −→ m ( p )) = Dist ( p, p i ) ; N −→ m ( p ) = p i ; return T L ( p ) ; some parameter tuning would help us improve the effi-ciency. In training, we only take 4096 points for global-local tree construction in each frame and we find this con-figuration can take care both performance and efficiency.Secondly, our network need to take continuous frames as in-put since the frame-to-frame information is required duringthe training. However, adjacent frames may only have lim-ited pixels with large differences. If we update the networkweights after each frame forwarding in a continuous se-quence, the training performance would be pretty bad sincegradient generated in the loss backward may be too smallto be stable for weight update. Therefore, we only updatethe network weights after 8 continuous loss backwards, andwe find there is significant performance improvement whencomparing to the naive weight update approach. D. Website Benchmark Result
We test our result on the benchmark of the ScanNet Website.Because the reconstruction point cloud is different from theofficial point cloud, we have to map the labels to the nearestpoints. On Table 4, there is a decrease in the result due tothe mismatch, but also outperform the state-of-the-art on-line method which proves the effectiveness of our methods.
E. More Online Segmentation Results
Figure 11 shows the comparsion of full pipeline 2D model[9] with our methods. Figure 12 to 20 presents some vi-sual results of our online semantic segmentation method onScanNet Dataset. Note that, test data shown here selectedfrom the validation and test set in ScanNet Dataset. For livedemo, please refer to the attached video in the supplementalmaterial. olor image3×240×320Depth image240×320 feature map64×120×160CBRCBR
FuseNet-Encoder-Decoder feature map add ...... feature mapfeature mapCBRCBR add ...
CBR feature map512×7×10512×7×10downsample 128×60×80 interval-octree treeNode:{coordinate:(x,y,z)feature{ feature2d:(128×1)uncertainty_feature:(128×1) } matrix multiplication downsample downsampledownsample ... ... MLP 128 ×N num_class×NMLP uncertainty _feature:(128×N)
MaxPooling
Figure 10: Network architecture.Table 4: IOU comparison between our method and state-of-the-art online scene segmentation methods. Our method outper-forms the state-of-the-art methods in 19 semantic categories(except curtain). model mean wall floor cabinet bed chair sofa table door window bookshelf picture counter desk curtain fridge curtain toilet sink bathtub othersPanopticFusion 0.529 0.491 0.688 0.604 0.386 0.632 0.225 0.705 0.434 0.293 0.815 0.348 0.241 0.499 0.669 0.507 0.649 0.442 0.796 0.602 0.561Ours
Color Image Single Frame
Ours
Figure 11: The first row shows the complete result of our method. The last three rows (in red, blue, gree boxes) give the inputcolor image, semantic label of single frame and the projection result of our method. can SequenceOnline Semantic SegmentationScan SequenceOnline Semantic SegmentationScan SequenceOnline Semantic Segmentation
Figure 12: Visual results of our online semantic segmentation method. Our method can work properly even the input scan isincomplete. can SequenceOnline Semantic SegmentationScan SequenceOnline Semantic SegmentationScan SequenceOnline Semantic Segmentation
Figure 13: Visual results of our online semantic segmentation method. Our method can work properly even the input scan isincomplete. can SequenceOnline Semantic SegmentationScan SequenceOnline Semantic SegmentationScan SequenceOnline Semantic Segmentation
Figure 14: Visual results of our online semantic segmentation method. Our method can work properly even the input scan isincomplete. can SequenceOnline Semantic SegmentationScan SequenceOnline Semantic SegmentationScan SequenceOnline Semantic Segmentation
Figure 15: Visual results of our online semantic segmentation method. Our method can work properly even the input scan isincomplete. can SequenceOnline Semantic SegmentationScan SequenceOnline Semantic SegmentationScan SequenceOnline Semantic Segmentation
Figure 16: Visual results of our online semantic segmentation method. Our method can work properly even the input scan isincomplete. can SequenceOnline Semantic SegmentationScan SequenceOnline Semantic SegmentationScan SequenceOnline Semantic Segmentation
Figure 17: Visual results of our online semantic segmentation method. Our method can work properly even the input scan isincomplete.
1) (2) (3) (4)(5) (6) (7)Scan SequenceOnline Semantic SegmentationScan SequenceOnline Semantic Segmentation(1) (2) (3) (4)(5) (6) (7)
Figure 18: Visual results of our online semantic segmentation method. Our method can work properly even the input scan isincomplete. can SequenceOnline Semantic SegmentationScan SequenceOnline Semantic Segmentation(1) (2) (3) (4)(5) (6) (7)(1) (2) (3)(4) (5)
Figure 19: Visual results of our online semantic segmentation method. Our method can work properly even the input scan isincomplete. can SequenceOnline Semantic SegmentationScan SequenceOnline Semantic Segmentation(1) (2) (3) (4)(5) (6) (7)(1) (2) (3) (4)(5) (6) (7)can SequenceOnline Semantic SegmentationScan SequenceOnline Semantic Segmentation(1) (2) (3) (4)(5) (6) (7)(1) (2) (3) (4)(5) (6) (7)