[PDF] Active Scene Understanding via Online Semantic Reconstruction

Abstract

We propose a novel approach to robot-operated active understanding of unknown indoor scenes, based on online RGBD reconstruction with semantic segmentation. In our method, the exploratory robot scanning is both driven by and targeting at the recognition and segmentation of semantic objects from the scene. Our algorithm is built on top of the volumetric depth fusion framework (e.g., KinectFusion) and performs real-time voxel-based semantic labeling over the online reconstructed volume. The robot is guided by an online estimated discrete viewing score field (VSF) parameterized over the 3D space of 2D location and azimuth rotation. VSF stores for each grid the score of the corresponding view, which measures how much it reduces the uncertainty (entropy) of both geometric reconstruction and semantic labeling. Based on VSF, we select the next best views (NBV) as the target for each time step. We then jointly optimize the traverse path and camera trajectory between two adjacent NBVs, through maximizing the integral viewing score (information gain) along path and trajectory. Through extensive evaluation, we show that our method achieves efficient and accurate online scene parsing during exploratory scanning.

Full PDF

VVolume xx ( ), Number z, pp. 1–11

Active Scene Understanding via Online Semantic Reconstruction

Lintao Zheng , Chenyang Zhu , Jiazhao Zhang , Hang Zhao , Hui Huang , Matthias Niessner and Kai Xu National University of Defense Technology, China Shenzhen University, China Technical University of Munich, Germany

Figure 1: We introduce a method for robot-operated active semantic understanding of unknown indoor scenes, based on online RGBDreconstruction with semantic segmentation. The method performs online volumetric RGBD reconstruction, on which real-time voxel-basedsemantic labeling is conducted. The robot is guided by the requirement of fast online segmentation with minimal scanning effort. The imageto the left shows the robot paths computed by our method and the corresponding scene parsing results (correspondence indicated by color)are shown to the right.

Abstract

We propose a novel approach to robot-operated active understanding of unknown indoor scenes, based on online RGBD re-construction with semantic segmentation. In our method, the exploratory robot scanning is both driven by and targeting at therecognition and segmentation of semantic objects from the scene. Our algorithm is built on top of the volumetric depth fusionframework (e.g., KinectFusion) and performs real-time voxel-based semantic labeling over the online reconstructed volume.The robot is guided by an online estimated discrete viewing score ﬁeld (VSF) parameterized over the 3D space of 2D locationand azimuth rotation. VSF stores for each grid the score of the corresponding view, which measures how much it reduces the un-certainty (entropy) of both geometric reconstruction and semantic labeling. Based on VSF, we select the next best views (NBV)as the target for each time step. We then jointly optimize the traverse path and camera trajectory between two adjacent NBVs,through maximizing the integral viewing score (information gain) along path and trajectory. Through extensive evaluation, weshow that our method achieves efﬁcient and accurate online scene parsing during exploratory scanning.

CCS Concepts • Computing methodologies → Shape analysis;

1. Introduction

With the proliferation of commodity RGBD sensors and the boost-ing of 3D deep learning techniques, 3D scene understanding basedon RGBD data has been emerging as a core problem of 3D visionand gained much attention from both graphics and vision commu-nity lately [SLX15, GAGM15, NKP19]. The majority of existingworks pursues ofﬂine, passive analysis, in which scene understand-ing, encompassing object detection and/or segmentation, is con-ducted over already acquired RGBD sequences or their 3D recon-struction. In such approach, data acquisition is analysis-agnostic.Therefore, the ofﬂine analysis often suffers from incomplete and uninformative data acquisition which greatly limits the perfor-mance of scene understanding.Online scene understanding is a different paradigm in which acqui-sition and analysis are intertwined [XHS ∗

15, LXS ∗

18, YLL ∗ submitted to COMPUTER GRAPHICS Forum (6/2019). a r X i v : . [ c s . G R ] J un L. Zheng, C. Zhu, J. Zhang, H, Zhao, H. Huang, M. Niessner & K. Xu / Active Scene Understanding via Online Semantic Reconstruction ing, with a minimum traversing and scanning effort. Therefore,scanning is both driven by and targeting at understanding.Online scene understanding can be performed either directlyover online acquired RGBD sequence or based on online RGBDreconstruction. Most recent works usually adopt the formerdue to the deep-learning-friendly representation of RGBD im-ages [GAGM15]. However, 3D object segmentation should bestbe performed over the 3D reconstruction of scene geometrywhich facilitates 3D spatial and structural reasoning [ZXTZ14,XHS ∗ ∗

11, IKH ∗

11, NZIS13,WLSM ∗

15, DNZ ∗ ∗ • A new approach to active scene understanding based on onlinesemantic reconstruction. • An efﬁcient semantic segmentation network with incrementalvolumetric feature aggregation. • A method for estimating the next best view based on the uncer-tainty in scene reconstruction and understanding. • A method for joint optimization of robot path and camera trajec-tory in three-dimensional view space.

2. Related WorksScene understanding.

Scene understanding has been a long-standing problem in both vision and graphics. The two main prob-lems of scene understanding are scene classiﬁcation and seman-tic parsing (object detection and/or segmentation). With the de- velopment of commodity depth sensors, the input of interest hasbeen shifting from 2D RGB images [LSFFX10], 3D CAD mod-els [FSH11, XMZ ∗

14] or 3D point clouds [NXS12], to RGBDimages [GAM13, SX16] and/or their 3D reconstruction [KKS13,HDN18, MHDL17]. To take the advantage of deep learning, muchattention has been paid on designing suited representation and ef-ﬁcient neural networks for the task of RGBD-based understand-ing [SHB ∗

12, CDF ∗

17, SYZ ∗

17, QLJ ∗ Online RGB-D reconstruction.

With the introduction of com-modity depth cameras, we have seen signiﬁcant advances in on-line RGB-D reconstruction. KinectFusion [NDI ∗

11, IKH ∗

11] wasone of the ﬁrst to realize a real-time volumetric fusion frameworkof [CL96]. In order to handle larger environments, spatial hierar-chies [CBI13], and hashing schemes [NZIS13,KPR ∗

15] have beenproposed. At scale, these methods also required robust, global poseoptimizations which are common in ofﬂine approaches [CZK15];however, fast GPU optimization techniques [DNZ ∗

17] or online re-localization methods [WLSM ∗

15] allow for real-time global posealignment. Our work builds upon this line of research to achieveactive RGBD-based scene understanding.

Active object recognition.

Autonomous object detection and/orrecognition is one of the most important ability of domestic robots.A common solution to active object recognition is to actively re-solve ambiguities of a certain viewpoint in recognizing an object. Incases where the target object is known , Browatzki et al. [BTM ∗ ∗ ∗ target objects are unknown , de-tection and recognition need to be solved simultaneously. Ye etal. [YLL ∗

18] propose navigation policy learning guided by activeobject detection and recognition. The work in [LXS ∗

18] is the mostsimilar in spirit to ours. They develop a data-driven solution toautonomous object detection and recognition with one navigationpass in an indoor room. The problem is formulated as an onlinescene segmentation with database 3D models serving as templates.Our work frames the problem as online volumetric reconstructionand deep-learning-based voxel labeling.

Active scene segmentation.

Semantic segmentation of an indoorscene is critical to accurate robot-environment interaction. How-ever, many existing approaches do not involve an online active viewselection. Mishra et al. [MAF09] propose ﬁxation-based activescene segmentation in which the agent segments only one imageregion at a time, speciﬁcally the one containing the ﬁxation pointby an active observer. Similar method is also studied in [BK10]which integrates different cues in a temporal framework for im- submitted to COMPUTER GRAPHICS

Forum (6/2019). . Zheng, C. Zhu, J. Zhang, H, Zhao, H. Huang, M. Niessner & K. Xu / Active Scene Understanding via Online Semantic Reconstruction (a) (b) (c) (d) (e) (f) Figure 2: An overview of our method. Given the current reconstruction and understanding in (a), the robot performs online progressivereconstruction and entropy map computation/updating (b). Based on that, the view scoring ﬁeld (VSF) is generated (c). Based on VSF, itperforms ﬁeld-guided optimization of robot path and camera trajectory (d). (e) shows the online reconstruction with semantic segmentationand (f) visualizes the updated entropy map for the next iteration.proving object hypotheses over time. Xu et al. [XHS ∗

15] present anautoscanning system for indoor scene reconstruction with object-level segmentation. They adopt a proactive approach where objectsare detected and segmented with the help of physical interaction(poking). In our system, scene segmentation is achieved by activelyselecting the best view points and traverse paths that maximally de-termine the volumetric labeling.

3. Method3.1. Problem Statement and OverviewProblem statement.

Given an indoor scene whose map is un-known, the objective of our system is to drive a ground robotmounted with an RGBD camera to explore and actively parse thescene into semantic objects. It is impossible to plan the completescan path in advance since the map of the target scene is unavail-able at the beginning, which makes it a chicken-and-egg problem.We therefore have to solve for scene understanding and path plan-ning simultaneously. Existing approaches to active scene scanningusually takes a “scan and plan” paradigm, which only takes geo-metric but not semantic information into consideration when plan-ning the robot scanning. In this work, we frame the problem fromonline reconstruction with semantic segmentation and propose anovel “scan, understand, and plan” solution.

Method overview.

For the purpose of online scene understanding,we introduce a semantic segmentation network based on online vol-umetric reconstruction, inspired by [HDN18]. The basic idea of ournetwork is to ﬁrst extract multi-view 2D features and then performfeature aggregation based on 3D convolution over the online re-constructed TSDF volume. Different from the ofﬂine scene under-standing in [HDN18], the input for semantic labeling is dynamicdue to the progressive scanning and online reconstruction. There-fore, the feature aggregation must follow the online reconstructedTSDF volume. Futhermore, to avoid redundant computation, ournetwork bypasses the known and unchanged voxels in the TSDFvolume during feature aggregation, thus signiﬁcantly improving theonline efﬁciency.To guide the robot in achieving an fast online semantic recon-struction with minimal scanning effort, we adopt an information-theoretic approach to Next-Best-View (NBV) prediction throughminimizing uncertainty (entropy) of semantic reconstruction. Theentropy measures the uncertainty of both geometric reconstructionand semantic segmentation. In particular, we present a ﬁeld-guided optimization of robot path and camera trajectory to maximize theinformation gain in traversing and scanning between every two ad-jacent NBVs.An overview of the process is given in Algorithm 1. Any scanningmove of the robot would collect some semantic information S of theunknown scene. A entropy-based (section 3.3) view scoring ﬁeld F is generated based on the online reconstructed TSDF with seman-tic labels D ((section 3.2)). To maximize the scanning efﬁciency,the Next-Best-View (NBV) should enable the robot to reduce theoverall entropy as much as possible in the next move. Based on theonline updated entropy map and occupancy grid T , we compute aview scoring ﬁeld F , based on which and robot path and cameraorientation can be optimized jointly (section 3.4). The above pro-cess repeats until the terminate condition is met. We measure the quality of a scan view by how much the uncertaintyof scene understanding would be reduced through this move. In our

Algorithm 1:

Robot scanning guided by online reconstructionand semantic segmentation .

Input :

Initial TSDF D , occupancy grid T with few random scansand robot location v r Output:

Semantic label S and optimized scanning path { P i , C i (cid:107) i = ... k } Initialize S ← f rec ( D ) ; Initialize entropy map H from T and S ; Initialize view scoring ﬁeld F ← H , T ; repeat // Path planning and camera rotation optimization based on F Find NBV v i ← argmax v F i ; Find the optimal robot and camera path P i , C i from v r to v i ; // Update scene mapping based on given path Scan along P i , C i and update semantic map S ; Update S i ← f rec ( D i ) ; Update H i from T i and S i ; Update F i from H i , T i , H i − , T i − ; // Record current path planning { P i , C i (cid:107) i = ... k } ← P i , C i ; until Terminate condition is met ; return S , { P i , C i (cid:107) i = ... k } ; submitted to COMPUTER GRAPHICS Forum (6/2019).

L. Zheng, C. Zhu, J. Zhang, H, Zhao, H. Huang, M. Niessner & K. Xu / Active Scene Understanding via Online Semantic Reconstruction work, the uncertainty of scene understanding is measured from twoaspects, i.e., geometry reconstruction and semantic segmentation.

RGBD-based reconstruction with volumetric representation.

Given a sequence of RGBD images, we adopt the volumetric rep-resentation (TSDF) for depth fusion [CL96]. The construction ofTSDF D is incremental. The occupancy uncertainty of each voxel v is reduced when more images are fused into D . Usually, the oc-cupancy of v can be modeled based on a 1D half normal distribu-tion: t ( v ) = − | X | , X ∼ N ( , σ ( v )) . The variance σ ( v ) providesa measure of reconstruction uncertainty. More speciﬁcally, the vari-ance σ ( v ) is deﬁned based on how many images provide positivesupport for the occupancy of v [HWB ∗ v when shot adepth image and vice versa. To make it simple, every positive sup-port i would provide σ occ ( v , i ) = .

85 and every negative support i would provide penalty σ f ree ( v , i ) = − . σ ( v ) = ∑ i σ occ ( v , i ) + ∑ i σ f ree ( v , i ) (1) Semantic reconstruction network.

To incrementally gain seman-tic information during scanning, we propose a network to predict a3D semantic segmentation based on the TSDF D . More speciﬁcally,we want to infer the semantic labeling over the TSDF on a per-voxel basis. The backbone of our network is similar to [HDN18].We ﬁrst brieﬂy review the network architecture and then discussour improvement over it.The network is composed of two main components including objectdetection and per-voxel labeling prediction. Each of these compo-nent has its own feature extraction module. Each module is com-posed of a 2D and 3D feature extraction layers. The extracted 2Dand 3D features are aggregated by a series of 3D convolutional lay-ers over the TSDF volume. The object detection component com-prises a 3D region proposal network(3D-RPN) to predict boundingbox locations, and a 3D-region of interest (3D-RoI) pooling layerfollowed for classiﬁcation. The per-voxel mask prediction networktakes geometry as well as the predicted bounding box location asinput. The cropped feature channels are used to create a mask pre-diction for per-voxel semantic labeling as well as the conﬁdencescore.However, this network is designed for ofﬂine scene understand-ing where the reconstruction is already given. In our problem set-ting, the online reconstruction is executed online, with smooth andprogressive RGBD acquisition. This means that there is immenseoverlap ( > Incremental 3D feature aggregation.

Most ofﬂine scene under-standing methods do not consider how to process dynamic inputs,we present an incremental semantic segmentation network specif-ically designed for online understanding. The key insight of ourapproach is that 3D convolution should be performed only on the newly observed voxels and reuse the previous result for overlappingareas as much as possible. Figure 4 gives an illustration of this.More speciﬁcally, we maintain a global data structure to recordthe TSDF and 3D features information for all the observed vox-els. When the network get a new local input, the ﬁrst step is ﬁndingthe overlapped areas between the input and the global record. Andour proposed network would skip the 3D convolution and reuse thestored information directly for this overlapped areas which wouldsave a lot of computational time.Moreover, our network would also reuse the results of the 3D-RPN.All the box proposals in the overlapped areas would not be pro-cessed again for a new local input. By removing these redundantproposals, our incremental network would improve the efﬁciencyone step further. In our experiment, the incremental process wouldmake our network be 23.6% faster if the input has 50% overlappedarea and 41.1% faster if the input has 75% overlapped area. Moredetails about out network can be found in Figure 3.

We adopt Shannon entropy to measure the information gain ofrobot scanning. In particular, we estimate the average new infor-mation the robot can collect under a speciﬁc pose. In other words,we want to measure how much uncertainty would be reduced by apotential scanning view.The entropy map H is deﬁned on each voxel in the 3D scene. Differ-ent from previous method like [BWCE16], we do not only countthe geometry occupancy possibility of each voxel v but also thepredict semantic label as new information. The general deﬁnitionof entropy in our problem is H ( v ) = − ∑ p ( v ) log p ( v ) and we canmeasure the gained information I as I ( v ) = H ( v ) − H ( v | v new ) . Thekey point to evaluate the quality of new information through I ishow to deﬁne probability p in H for geometry and semantic in-formation respectively. Then we can sum these two item up in aweighted fashion to get ﬁnal formulation of the gained informa-tion. α and β are constants to weight the geometry term and thesemantic term. I ( v ) = α I semantic ( v ) + β I geometry ( v ) (2) Geometry reconstruction entropy.

As we discuss in Section 3.2,the uncertainty of voxel v in geometry reconstruction can be de-ﬁned as Equation (1). However, the output range of this uncertaintyformulation is [ , ∞ ] which can not be adopted as the probabilityfunction p in a entropy formulation directly. We simply map thisuncertainty function to [ , ] as below and use it in our geometryreconstruction entropy. I geometry ( v ) = H g ( v ) − H g ( v | v new ) (3) H g ( v ) = − p g ( v ) log p g ( v ) − ( − p g ( v )) log ( − p g ( v )) , (4) p g ( v ) = e σ ( v ) + e σ ( v ) (5) Semantic segmentation entropy.

To measure the uncertainty forsemantic segmentation, we should take both the predicted semanticlabel and corresponded conﬁdence score for a voxel v into consid-eration. If the predicted semantic label for v in current scan move submitted to COMPUTER GRAPHICS Forum (6/2019). . Zheng, C. Zhu, J. Zhang, H, Zhao, H. Huang, M. Niessner & K. Xu / Active Scene Understanding via Online Semantic Reconstruction Class LabelBox Regression Per-Voxel Segmentation

Local InputGlobal Input CheckOverlap Feature overlap

Feature new updateGlobal Input Global Feature MapNew Global Input New Global Feature MapUpdate Global Feature Map 3D ROI3D RPNCov Layers Cov LayersLocal Input Filtered Input FilteringBoxesLocal Overlap Input ConcateLocal Feature Map

Figure 3: The architecture of our online semantic segmentation network. Note that the key difference between our network and 3D-SIS[HDN18] is that our feature aggregation is incremental. The input of our network contains two components, which are local and global(redbox). We save massive processing time on redundant operations of overlapped areas. More speciﬁcally, the voxels which have been observedin previous steps would not go through the 3D convolution layers and 3D-ROI (blue box) again. Our network would re-unitize the storedinformation directly.

3D CNNMask predictionRecover overlap

Figure 4: An illustration of incremental semantic segmentation.

Low uncertaintyHigh uncertainty Initial scans Middle result Final result

Figure 5: The evolution of scanning entropy over increasing scans.keeps the same as the previous predicted result and the conﬁdencescore becomes higher, then the uncertainty for semantic segmenta-tion for v is reduced. And there is another case that we have gainedmore information for v which the conﬁdence score is higher eventhe predicted semantic labels are different. Therefore, we have a thefollowing formulation for semantic segmentation entropy, where p s Low uncertaintyHigh uncertainty

Geometry item Semantic item Combination entropy map

Figure 6: Visualization of scanning entropy encompassing both ge-ometric and semantic uncertainty.represents score of semantic prediction given by our semantic re-construction network for a speciﬁc label s . I semantic ( v ) =  ∑ s p s ( v | v new ) log p s ( v | v new ) − ∑ s p s ( v ) log p s ( v ) , if S v = S v new − ∑ s p s ( v | v new ) log p s ( v | v new ) , if S v (cid:54) = S v new and p v < p v new , otherwise (6)where v new denotes the new observe for voxel v and S v is semanticlabel of v .To make our idea about this combined entropy more clear, wepresent a visual example about our entropy in Figure 6. Note thatthe higher entropy value means lower conﬁdence since we needmore information to be more conﬁdent. It is clear that in Figure 6,if we consider only the geometry term, the robot has no idea aboutwhat to do next since the uncertainty for geometry reconstructionare similar everywhere in this case. But the semantic term give avery good guidance about which area should be focused in the nextmove since there is a valid semantic object(sofa) in the right view.To show that our combined entropy is positively correlated to thequality of semantic labeling, we provide a side-by-side visual com-parison in Figure 5. In this way, the objective of our NBV prediction submitted to COMPUTER GRAPHICS Forum (6/2019).

L. Zheng, C. Zhu, J. Zhang, H, Zhao, H. Huang, M. Niessner & K. Xu / Active Scene Understanding via Online Semantic Reconstruction (a) (b) (c) (d) 𝑟 Figure 7: (a): Illustration of 3D parametric space of location andorientation. (b,c): Visualization of view scoring ﬁeld and the opti-mal path found by A ∗ algorithm between two camera poses in theﬁeld. (d): The computed robot trajectory based on ﬁeld-guided op-timization.is clear now. The NBV should be the view that all the voxels in ithave the highest uncertainty. Baesd on Equation (2), we have a for-mulation for NBV prediction as following where Ω ( v , n ) representsall the voxels in current camera view n : NBV = arg max k ∑ k ∈ Ω ( v , n ) α I semantic ( v ) + β I geometry ( v ) (7) Our scan planning is composed of two compo-nents, NBV prediction and path planing with camera optimization.These two components are implemented upon a 3D ﬁeld whichrecords the entropy information which described in section 3.3.Please note that this ﬁeld is incrementally constructed with thescanning process.However, the gained information ∑ v ∈ Ω ( v ) I ( v ) is not the only factorshould be considered in our ﬁeld construction. The following fac-tors are also important components to formulate our view scoringﬁeld: • Safety: View point must be in free space and keep a safedistance away from obstacles; • Visibility: Views should orient toward objects or frontiers tomaximize information gain; • Movement cost: Robot traverse path should be as short aspossible.In this case, we ﬁnd Equation (7) is not sufﬁcient to ﬁnd the mostappropriate NBV for our system. Besides the gained information I ,we also introduce the occupancy grid T to measure the value in ourview scoring ﬁeld which would be helpful to measure the abovethree factors.To ensure robot safety, we get obstacle information from 2D pro-jection of T , and only sample views which can keep a safe distance0 . m from obstacles.The next factor we consider is visibility to frontier. Frontier isthe boundary between (known) empty regions and unknown ones,which is a well known driving factor for robot exploration inrobotics. We measure the visibility to frontier by counting the fron-tier voxels visible in the current view frustum. Speciﬁcally, it isestimated based on T : V ( v , r ) = ∑ k ∈ Ω ( v , r ) T ( k ) , T ( v ) = (cid:40) , if voxel v is frontier0 , otherwise (8)where r is the given view from voxel v and Ω ( v , r ) means all thevoxels in this view frustum. And we need to make sure the plannedpath should not be too long. However, we can not get the exact pathlength before the ﬁnal path planning.Here we use an approximate distance estimation formulation tomake this movement constraint. L ( v ) = e − dist2 ( v , vrobot ) σ , where v robot is the current robot location. After formulating all these factors, wewill discuss the details about how to assemble them to get 3D viewscoring ﬁeld. For each ground grid voxel v of the given scene, wesample some different views. And the safety, visibility and move-ment factor are calculated for each view r i , i ∈ { ... k } of every v .And we will have the ﬁnal view score formulation for each gridvoxel v with different camera view r i , and we have a 3D visualiza-tion of this ﬁeld in Figure 7(a): F ( v , r k ) = α V ( v , r i ) + ∑ v ∈ Ω ( v , r i ) I ( k ) L ( v ) (9) Optimization formulation.

We will update this view scoring ﬁled F after each scan move, and the NBV can simply computed bythis simple optimization NBV = arg max v , r F ( v , r ) . However, themain challenge in this part lies in how to compute a collision-freepath from the current robot position v robot to the NBV , so that thepath maximizes the information gain of semantic reconstructionand minimizes the traverse distance.To guarantee robot safety and scanning efﬁciency, the view scoringﬁled F plays a signiﬁcant role in path planning algorithm. Formally,we deﬁne C ( P ) as the total cost of the optimal path P : C ( P ) = inf π ∈ Π ( v robot , v i ) (cid:90) η − F ( v , r ) d π ( v , r ) (10)s.t. V rotation > δπδ c where Π ( v r , v i ) is the set of all possible paths from location v r to v i , V rotation is maximum rotation speed of the robot camera and η isa big constant, which we set to 500 in our experiment. which helpsto adopt F in our costmap. However, even we consider the safetyfactor when designing the view scoring ﬁeld F , it is still can notguarantee that the found path through Equation (10) is collision-free. we introduce a 2D Obstacle costmap to enhance F to solvethis problem.The 2D obstacle map f o is obtained from projection of 3d occu-pancy grid map T . We generate the obstacle cost map by usingtwo-dimensional Gaussian distribution. f o ( v ) = e − min vk ∈{ v (cid:107) T ( v )= } dist2 ( v , vk ) σ (11) submitted to COMPUTER GRAPHICS Forum (6/2019). . Zheng, C. Zhu, J. Zhang, H, Zhao, H. Huang, M. Niessner & K. Xu / Active Scene Understanding via Online Semantic Reconstruction We integrate the 2D obstacle map f o into the view scoring ﬁeld F and change the optimization formulation from Equation (10) to thefollowing: C ( P ) = inf π ∈ Π ( v robot , v i ) (cid:90) η − F ( v , r ) d π ( v , r ) + (cid:90) η f o ( v , r ) dv (12)s.t. V rotation > δπδ c Scan planning by optimization.

To solve the path and camera op-timization deﬁned in Equation (12) , we adopt A ∗ algorithm to ﬁndthe optimal solution in discrete level. Figure 7 illustrates how weget the camera view path and the robot path from the optimal pathgiven by the 3D costmap. We project the optimal path which givenby the A ∗ algorithm to θ axis to get the camera rotation sequenceand the projected path on xy plane is the optimal 2D robot path.This “scan, analyze and plan” process is repeated until the termi-nate condition is met, leading to a progressive understanding by therobot. In our experiment, the robot will stop the exploration if theoverall entropy ∑ v I ( v ) is reduced below a certain threshold, whichmeans there is no signiﬁcant uncertainty for our scene understand-ing.

4. Results and Evaluation

There are three primary questions that we seek to answer with ourexperiments and evaluations. • How does our approach compare to previous work in termsof distance traveled, time cost, and semantic quality? • How much effect does semantic entropy item have on theresults? • How well does ﬁeld guided path planning improve the scan-ning efﬁciency?

The simulation is conducted by using theGazebo simulator [KH04]. We adopt a differential drive groundrobot equipped with a virtual RGB-D camera simulating the Kinectv1 sensor. We assume the sensor has a depth range of [ . , . ] mwith 0 .

03m noise. The camera is mounted on top of the robot andhas one DoF of azimuth rotation. To make the simulation more real-istic, the ground robot will obtain a noisy pose estimation from thesimulator. The simulation runs on a computer with Intel I7-5930KCPU (3.5GHZ *12), 32GB RAM, and an NVIDIA GeForce GTX1080 Graphics card.

Dataset.

Our benchmark dataset is built upon the virtual scenedataset SUNCG. SUNCG contains 40K human-modeled 3D in-door scenes with visually realistic geometry and texture. It encom-passes indoor rooms ranging from single-room studios to multi-ﬂoor houses. We select 180 scenes which are suitable for navigationand exploration task. These scenes have averagely 4 . . Different interiors including ofﬁces,bedrooms, sitting rooms, kitchens, etc. are involved in our datasetto guarantee the test variety. The dataset also provides ground truthobject segmentation and labeling for the scenes. Parameters and details.

The 3D occupancy grid T is constructedwith a resolution of 0 . . . ◦ per second, respectively. Thecoefﬁcient ratio α is set to 1 . β is set to 0 . In this section, we conduct a series of experiments and compar-isons which focus on evaluating scanning efﬁciency and semanticmapping quality of our method. Since it is impossible to get theinput scene fully labeled in voxel-wise, we evaluate the scanningefﬁciency by measuring the time for our system to achieve a givenmass of correctly labeled voxels. To evaluate semantic quality, wemeasure the accuracy of ﬁnal scene segmentation.

Comparison with alternative NBV methods.

Our method iscompared to several state-of-the-art NBV techniques: Bayesianoptimization-based exploration method (BO) [BWCE16],information-theoretic planning approach (IG) [CKP ∗ ∗ Scanning efﬁciency.

We compare the scanning time and travel dis-tance from the four kinds of approaches, while scanning the scenesvirtually. The initial positions and orientations of the robot in allthese methods are the same. The comparison about scanning timeand traveled distance over correctly labeled voxel number is plottedin Figure 8. We observe that the scanning cost time and traveleddistance are increasing as the scene semantic mapping gets morecomplete (more and more occupancy voxels get labeled). But theproposed approach always gets the least time and shortest distance.Figure 8: Comparing scanning efﬁciency betwwen our method(red), NBO (green), BO (magenta) and IG (blue), over differentscenes. It is measued by traveled distance and time over numbersof correctly labeled voxels.

Semantic segmentation performance.

To evaluate the quality ofsemantic segmentation, we measure the segmentation accuracy andidentiﬁed objects number (exclude wall, ceiling and ﬂoor) respec-tively . The segmentation accuracy and identiﬁed objects numberover traveled distance is plotted in Figure 9. The number of cor-rectly labeled voxels are increasing while the robot explores morearea, and the accuracy increasing as well. From the results we canclearly see that our method gets the highest semantic accuracy andmaximum number of identiﬁed objects almost all the way.To demonstrate the superiority of our algorithm one step further,in Figure 12, we show some visual results of the ﬁnal semantic submitted to COMPUTER GRAPHICS

Forum (6/2019).

L. Zheng, C. Zhu, J. Zhang, H, Zhao, H. Huang, M. Niessner & K. Xu / Active Scene Understanding via Online Semantic Reconstruction

Figure 9: Comparing segmentation accuracy (left) and objectrecognition (right) against NBO, BO, and IG. Note that the num-bers of total semantic voxels and objects can be obtained fromgroudTrue.Figure 10: Effect of various entropy items on semantic segmenta-tion performance (left) and exploration efﬁciency (right). The com-bined entropy leads to faster scene exploration and more completesegmentation results.segmentation. The results show that our scanning strategy leads tomore complete and better results. For more visual results, pleaserefer to the supplemental material.Figure 11: Effect of ﬁeld-guided path planning on scanning efﬁ-ciency. The proposed algorithm (red) is compared against classicalDijstra (blue) method. Using ﬁeld-guided path planning, the robottravels less distance and time.

Ablation study on semantic entropy.

Occupancy entropy tends toguide robot to explore more unknown space, while semantic en-tropy is more likely to guide robot to exploit scanned region. Inthis experiment, we investigate the effect of the semantic entropyitem on the semantic segmentation efﬁciency and quality. Figure 10shows the numbers of correctly labeled voxels and all observedvoxels over robot traveled distance, with only occupancy entropy,with only semantic entropy and with combined entropy.As shown in the plot, when the observed region is relatively fewin an early stage, the beneﬁt of semantic entropy is signiﬁcant, dueto the better exploitation of partial scanned scene. When the robot Table 1: Comparison between our ﬁeld-guided method and Dijstkramethod. Our ﬁeld-guided method is much better on efﬁciency thanDijstkra method whether on time cost or explored distance.

Scene Area (m ) Field-guided method Dijstkra methodTime (s) Distance(m) Time (s) Distance (m)1 134.2

708 92.9 travels more distance, the occupancy entropy starts to take effect,which leads to faster discovery of unknown space and more voxelsare observed. Since the semantic entropy has no ability to guaran-tee discovering more region, the robot sometimes get stucked inscanned region with only semantic entropy. With occupancy en-tropy only, the robot is faster to ﬁnd new regions so that the to-tal scanned voxels are always the highest. However, when the un-known space is few at later state, the robot has no idea where toﬁnd better observation, which leads to poor performance of seman-tic segmentation. The combined entropy gets the best performancein ﬁnal scanning results, which works well on both exploration andexploitation jobs. In Figure 14, we compare semantic segmentationquality of these three different entropy. The visual results verify theabove analysis.

Effect of viewing score ﬁeld.

In order to verify the efﬁciency of ourviewing score ﬁeld-guided path planning approach. We conducteda number of experiments in four synthetic scenes to compare ourmethod with classical path planning algorithm of Dijkstra. Table 1reports total scanning time, traveling distance for our ﬁeld-guidedand Dijkstra path planning, on these scenes. The termination con-ditions are set the same for both algorithms. From the comparison,ﬁeld-guided panning can better save much scanning effort. To bet-ter demonstrate the superiority of our ﬁeld-guieded approach, inFigure 11, we also plot the cost time and traveled distance over thenumber of correctly labeled voxels. It can be clearly seen that theﬁeld-guided path planning leads to faster scanning time and lesstraveled distance all the way.In addition to the above results, we show more visual results ofour active scene understanding in Figure 13. In these examples, itis clear that our scene understanding is guided by collecting moresemantic information. Our method would try to drive the robot dis-cover the most semantic objects in a local area before it enters a newarea which would maximize the scene understanding efﬁciency.

5. Conclusions

We have presented a method for active scene understanding basedon online RGBD reconstruction with volumetric segmentation. Ourmethod leverages the online reconstructed TSDF volume and learnsa deep neural network for voxel-based semantic labeling. It attainsthe following key features.

First , the online scene segmentation isconducted over the online reconstruction, thus beneﬁting from the3D spatial reasoning.

Second , the robot scanning is guided by theinformation gain of both geometric reconstruction and semantic un- submitted to COMPUTER GRAPHICS

Forum (6/2019). . Zheng, C. Zhu, J. Zhang, H, Zhao, H. Huang, M. Niessner & K. Xu / Active Scene Understanding via Online Semantic Reconstruction Comparison to alternative next best view method

GT Ours NBO BO IG

Figure 12: Qualitative comparison of indoor scene semantic segmentation on SUNCG dataset. Note that different colors represent differentsemantic labels.derstanding.

Third , the online estimated viewing score ﬁeld (VSF)facilitates the joint optimization of moving path and camera orien-tation.There are a few promising venues for future research.

First , ourNBV prediction is based on the VSF estimated online. A more fa-vorable approach would be training a network to achieve an end-to-end NBV estimation. The difﬁculty lies in how to consider the un-certainty in both reconstruction and segmentation within one neuralnetwork.

Second , we would like to explore the use of the proposedframework on a real robot.

Third , our VSF-based path/trajectoryoptimization can be extended to support more ﬂexible scanning set-ting, for example, a robot holding a depth camera in its arm, sim-ilar to [XZY ∗ Last , another interesting future direction wouldbe extending our framework to achieve multi-robot collaborativescene understanding.

References [BK10] B

JÖRKMAN

M., K

RAGIC

D.: Active 3d scene segmentation anddetection of unknown objects. In (2010), IEEE, pp. 3114–3120. 2[BTM ∗

12] B

ROWATZKI

B., T

IKHANOFF

V., M

ETTA

G., B

ÜLTHOFF

H. H., W

ALLRAVEN

C.: Active object recognition on a humanoid robot.In (2012), IEEE, pp. 2021–2028. 2[BWCE16] B AI S., W

ANG

J., C

HEN

F., E

NGLOT

B.: Information-theoretic exploration with bayesian optimization. In (2016),IEEE, pp. 1816–1822. 4, 7[CBI13] C

HEN

J., B

AUTEMBACH

D., I

ZADI

S.: Scalable real-time vol-umetric surface reconstruction.

ACM Transactions on Graphics (TOG)32 , 4 (2013), 113. 2[CDF ∗

17] C

HANG

A., D AI A., F

UNKHOUSER

T., H

ALBER

M., N

IESS - NER

M., S

AVVA

M., S

ONG

S., Z

ENG

A., Z

HANG

Y.: Matterport3d: submitted to COMPUTER GRAPHICS

Forum (6/2019). L. Zheng, C. Zhu, J. Zhang, H, Zhao, H. Huang, M. Niessner & K. Xu / Active Scene Understanding via Online Semantic Reconstruction

Figure 13: Visualization of our active scene understanding process for three different scenes.

Effect of semantic entropy item.

Global quality Local quality

GT Combined Only Semantic Only OccupancyCombined Only Semantic Only OccupancyGT

Figure 14: Qualitative comparison of semantic segmentation results with different entropy items (Left: global quality; Right: local quality).

Learning from rgb-d data in indoor environments.

International Confer-ence on 3D Vision (3DV) (2017). 2[CKP ∗

15] C

HARROW

B., K

AHN

G., P

ATIL

S., L IU S., G

OLDBERG

K.,A

BBEEL

P., M

ICHAEL

N., K

UMAR

V.: Information-theoretic planningwith trajectory optimization for dense 3D mapping. In

Proceedings ofRobotics: Science and Systems (2015). 7[CL96] C

URLESS

B., L

EVOY

M.: A volumetric method for buildingcomplex models from range images. In

Proc. of SIGGRAPH (1996),pp. 303–312. 2, 4[CZK15] C

HOI

S., Z

HOU

Q.-Y., K

OLTUN

V.: Robust reconstruction ofindoor scenes. In

Proc. CVPR (2015), pp. 5556–5565. 2[DNZ ∗

17] D AI A., N

IESSNER

M., Z

OLLHÖFER

M., I

ZADI

S.,T

HEOBALT

C.: Bundlefusion: Real-time globally consistent 3d recon-struction using on-the-ﬂy surface reintegration.

ACM Transactions onGraphics (TOG) 36 , 3 (2017), 24. 2[FSH11] F

ISHER

M., S

AVVA

M., H

ANRAHAN

P.: Characterizing struc- tural relationships in scenes using graph kernels.

ACM Trans. on Graph.(SIGGRAPH) (2011). 2[GAGM15] G

UPTA

S., A

RBELÁEZ

P., G

IRSHICK

R., M

ALIK

J.: Indoorscene understanding with rgb-d images: Bottom-up segmentation, objectdetection and semantic segmentation.

International Journal of ComputerVision 112 , 2 (2015), 133–149. 1, 2[GAM13] G

UPTA

S., A

RBELAEZ

P., M

ALIK

J.: Perceptual organizationand recognition of indoor scenes from RGB-D images. In

Proc. CVPR (2013), pp. 564–571. 2[HDN18] H OU J., D AI A., N

IESSNER

M.: 3d-sis: 3d semantic instancesegmentation of rgb-d scans. arXiv preprint arXiv:1812.07003 (2018).2, 3, 4, 5[HWB ∗

13] H

ORNUNG

A., W

URM

K. M., B

ENNEWITZ

M., S

TACHNISS

C., B

URGARD

W.: OctoMap: An efﬁcient probabilistic 3D mappingframework based on octrees.

Autonomous Robots 34 , 3 (2013), 189–206.4 submitted to COMPUTER GRAPHICS

Forum (6/2019). . Zheng, C. Zhu, J. Zhang, H, Zhao, H. Huang, M. Niessner & K. Xu / Active Scene Understanding via Online Semantic Reconstruction ∗

11] I

ZADI

S., K IM D., H

ILLIGES

O., M

OLYNEAUX

D., N EW - COMBE

R., K

OHLI

P., S

HOTTON

J., H

ODGES

S., F

REEMAN

D., D

AVI - SON

A., F

ITZGIBBON

A.: KinectFusion: Real-time 3D reconstructionand interaction using a moving depth camera. In

UIST (2011), pp. 559–568. 2[KH04] K

OENIG

N., H

OWARD

A.: Design and use paradigms forgazebo, an open-source multi-robot simulator. In

International Confer-ence on Intelligent Robots and Systems (2004), pp. 2149–2154 vol.3. 7[KKS13] K IM B.- S ., K OHLI

P., S

AVARESE

S.: 3d scene understandingby voxel-crf. In

Proceedings of the IEEE International Conference onComputer Vision (2013), pp. 1425–1432. 2[KPR ∗

15] K

AHLER

O., P

RISACARIU

V. A., R EN C. Y., S UN X., T

ORR

P. H. S., M

URRAY

D. W.: Very high frame rate volumetric integration ofdepth images on mobile device.

IEEE Trans. Vis. & Computer Graphics(ISMAR) 22 , 11 (2015). 2[LSFFX10] L I L.-J., S U H., F EI -F EI L., X

ING

E. P.: Object bank: Ahigh-level image representation for scene classiﬁcation & semantic fea-ture sparsiﬁcation. In

Advances in neural information processing systems (2010), pp. 1378–1386. 2[LXS ∗

18] L IU L., X IA X., S UN H., S

HEN

Q., X U J., C

HEN

B., H

UANG

H., X U K.: Object-aware guidance for autonomous scene reconstruc-tion.

ACM Trans. on Graph. (SIGGRAPH) 37 , 4 (2018). 1, 2, 7[MAF09] M

ISHRA

A., A

LOIMONOS

Y., F

ERMULLER

C.: Active seg-mentation for robotics. In (2009), IEEE, pp. 3133–3139. 2[MHDL17] M C C ORMAC

J., H

ANDA

A., D

AVISON

A., L

EUTENEGGER

S.: Semanticfusion: Dense 3d semantic mapping with convolutional neu-ral networks. In (2017), IEEE, pp. 4628–4635. 2[NDI ∗

11] N

EWCOMBE

R. A., D

AVISON

A. J., I

ZADI

S., K

OHLI

P.,H

ILLIGES

O., S

HOTTON

J., M

OLYNEAUX

D., H

ODGES

S., K IM D.,F

ITZGIBBON

A.: KinectFusion: Real-time dense surface mapping andtracking. In

Proc. IEEE Int. Symp. on Mixed and Augmented Reality (2011), pp. 127–136. 2[NKP19] N

ASEER

M., K

HAN

S., P

ORIKLI

F.: Indoor scene understand-ing in 2.5/3d for autonomous agents: A survey.

IEEE Access 7 (2019),1859–1887. 1[NXS12] N AN L., X IE K., S

HARF

A.: A search-classify approach forcluttered indoor scene understanding.

ACM Trans. on Graph. (SIG-GRAPH Asia) 31 , 6 (2012), 137:1–137:10. 2[NZIS13] N

IESSNER

M., Z

OLLHÖFER

M., I

ZADI

S., S

TAMMINGER

M.: Real-time 3D reconstruction at scale using voxel hashing.

ACMTrans. on Graph. (SIGGRAPH Asia) 32 , 6 (2013), 169. 2[PBSS16] P

OTTHAST

C., B

REITENMOSER

A., S HA F., S

UKHATME

G. S.: Active multi-view object recognition: A unifying view on onlinefeature selection and view planning.

Robotics and Autonomous Systems84 (2016), 31–47. 2[QLJ ∗

17] Q I X., L

IAO

R., J IA J., F

IDLER

S., U

RTASUN

R.: 3d graphneural networks for rgbd semantic segmentation. In

Proceedings of theIEEE International Conference on Computer Vision (2017), pp. 5199–5208. 2[SHB ∗

12] S

OCHER

R., H

UVAL

B., B

ATH

B., M

ANNING

C. D., N G A. Y.: Convolutional-recursive deep learning for 3d object classiﬁcation.In

Advances in neural information processing systems (2012), pp. 656–664. 2[SLX15] S

ONG

S., L

ICHTENBERG

S. P., X

IAO

J.: Sun rgb-d: A rgb-d scene understanding benchmark suite. In

Proceedings of the IEEEconference on computer vision and pattern recognition (2015), pp. 567–576. 1[SX16] S

ONG

S., X

IAO

J.: Deep sliding shapes for amodal 3d objectdetection in rgb-d images. In

Proc. CVPR (2016). 2[SYZ ∗

17] S

ONG

S., Y U F., Z

ENG

A., C

HANG

A. X., S

AVVA

M.,F

UNKHOUSER

T.: Semantic scene completion from a single depth im-age. 2 [SZX15] S

ONG

S., Z

HANG

L., X

IAO

J.: Robot in a room: Towardperfect object recognition in closed environments. arXiv preprintarXiv:1507.02703 (2015). 2[WLSM ∗

15] W

HELAN

T., L

EUTENEGGER

S., S

ALAS -M ORENO

R. F.,G

LOCKER

B., D

AVISON

A. J.: Elasticfusion: Dense slam without a posegraph. In

Proc. Robotics: Science and Systems (2015). 2[WSK ∗

15] W U Z., S

ONG

S., K

HOSLA

A., Y U F., Z

HANG

L., T

ANG

X.,X

IAO

J.: 3D ShapeNets: A deep representation for volumetric shapes.In

Proc. CVPR (2015), pp. 1912–1920. 2[XHS ∗

15] X U K., H

UANG

H., S HI Y., L I H., L

ONG

P., C

AICHEN

J.,S UN W., C

HEN

B.: Autoscanning for coupled scene reconstruction andproactive object analysis.

ACM Trans. on Graph. 34 , 6 (2015), 177. 1,2, 3[XMZ ∗

14] X U K., M A R., Z

HANG

H., Z HU C., S

HAMIR

A., C

OHEN -O R D., H

UANG

H.: Organizing heterogeneous scene collection throughcontextual focal points.

ACM Trans. on Graph. (SIGGRAPH) 33 , 4(2014), 35:1–35:12. 2[XSZ ∗

16] X U K., S HI Y., Z

HENG

L., Z

HANG

J., L IU M., H

UANG

H.,S U H., C

OHEN -O R D., C

HEN

B.: 3D attention-driven depth acquisitionfor object identiﬁcation.

ACM Trans. on Graph. 35 , 6 (2016), 238. 2[XZY ∗

17] X U K., Z

HENG

L., Y AN Z., Y AN G., Z

HANG

E., N

IESSNER

M., D

EUSSEN

O., C

OHEN -O R D., H

UANG

H.: Autonomous recon-struction of unknown indoor scenes guided by time-varying tensor ﬁelds.

ACM Transactions on Graphics 2017 (TOG) (2017). 9[YLL ∗

18] Y E X., L IN Z., L I H., Z

HENG

S., Y

ANG

Y.: Active objectperceiver: Recognition-guided policy learning for object searching onmobile robots. In (2018), IEEE, pp. 6857–6863. 1, 2[ZXTZ14] Z

HANG

Y., X U W., T

ONG

Y., Z

HOU

K.: Online struc-ture analysis for real-time indoor scene reconstruction.

ACM Trans. onGraph. 34 , 5 (2014), 159. 2 submitted to COMPUTER GRAPHICS