Deep Reinforcement Learning of Volume-guided Progressive View Inpainting for 3D Point Scene Completion from a Single Depth Image
Xiaoguang Han, Zhaoxuan Zhang, Dong Du, Mingdai Yang, Jingming Yu, Pan Pan, Xin Yang, Ligang Liu, Zixiang Xiong, Shuguang Cui
DDeep Reinforcement Learning of Volume-guided Progressive View Inpaintingfor 3D Point Scene Completion from a Single Depth Image , Xiaoguang Han, , Zhaoxuan Zhang, , Dong Du, , Mingdai Yang, Jingming Yu, Pan Pan Xin Yang, Ligang Liu, Zixiang Xiong, , Shuguang Cui The Chinese University of Hong Kong(Shenzhen), Dalian University of Technology Shenzhen Research Institute of Big Data, University of Science and Technology of China Alibaba Group, Texas A&M University
Abstract
We present a deep reinforcement learning method of pro-gressive view inpainting for 3D point scene completion un-der volume guidance, achieving high-quality scene recon-struction from only a single depth image with severe oc-clusion. Our approach is end-to-end, consisting of threemodules: 3D scene volume reconstruction, 2D depth mapinpainting, and multi-view selection for completion. Givena single depth image, our method first goes through the 3Dvolume branch to obtain a volumetric scene reconstructionas a guide to the next view inpainting step, which attemptsto make up the missing information; the third step involvesprojecting the volume under the same view of the input, con-catenating them to complete the current view depth, and in-tegrating all depth into the point cloud. Since the occludedareas are unavailable, we resort to a deep Q-Network toglance around and pick the next best view for large holecompletion progressively until a scene is adequately recon-structed while guaranteeing validity. All steps are learnedjointly to achieve robust and consistent results. We performqualitative and quantitative evaluations with extensive ex-periments on the SUNCG data, obtaining better results thanthe state of the art.
1. Introduction
Recovering missing information in occluded regions of a3D scene from a single depth image is a very active researcharea of late [36, 54, 12, 23, 9, 46]. This is due to its impor-tance in robotics and vision tasks such as indoor navigation,surveillance, and augmented reality. Although this problemis mild in human vision system, it becomes severe in ma-chine vision because of the sheer imbalance between inputand output information. One class of popular approaches[32, 2, 13, 11] to this problem is based on classify-and- (a) depth (b) visible surface(c) output: two views
Figure 1. Surface-based scene Completion. (a) A single-viewdepth map as input; (b) Visible surface from the depth map, whichis represented as the point cloud. In our paper, the color of depthand point cloud is for visualization only; (c) Our scene comple-tion results: directly recovering the missing points of the occludedregions. Here we choose two views for a better display. search: pixels of the depth map are classified into severalsemantic object regions, which are mapped to most simi-lar 3D ones in a prepared dataset to construct a fully 3Dscene. Owing to the limited capacity of the database, re-sults from classify-and-search are often far away from theground truth. By transforming the depth map into an in-complete point cloud, Song et al. [36] recently presentedthe first end-to-end deep network to map it to a fully vox-elized scene, while simultaneously outputting the class la-bels each voxel belongs to. The availability of volumetricrepresentations makes it possible to leverage 3D convolu-tional neural networks (3DCNN) to effectively capture the1 a r X i v : . [ c s . C V ] M a r lobal contextual information, however, starting with an in-complete point cloud results in loss of input information andconsequently low-resolution outputs. Several recent works[23, 12, 9, 46] attempt to compensate the lost informationby extracting features from the 2D input domain in paralleland feeding them to the 3DCNN stream. To our best knowl-edge, no work has been done on addressing the second issueof improving output quality.Taking an incomplete depth map as input, in this workwe advocate the approach of straightforwardly reconstruct-ing 3D points to fill missing region and achieve high-resolution completion (Figure 1). To this end, we propose tocarry out completion on multi-view depth maps in an iter-ative fashion until all holes are filled, with each iterationfocusing on one viewpoint. At each iteration/viewpoint,we render a depth image relative to the current view andfill the produced holes using 2D inpainting. The recov-ered pixels are re-projected to 3D points and used for thenext iteration. Our approach has two issues: First, differ-ent choices of sequences of viewpoints strongly affect thequality of final results because given a partial point cloud,different visible contexts captured from myriad perspectivespresent various levels of difficulties in the completion task,producing diverse prediction accuracies; moreover, select-ing a larger number of views for the sake of easier inpaint-ing to fill smaller holes in each iteration will lead to erroraccumulation in the end. Thus we need a policy to deter-mine the next best view as well as the appropriate num-ber of selected viewpoints. Second, although existing deeplearning based approaches [28, 16, 20] show excellent per-formance for image completion, directly applying them todepth maps across different viewpoints usually yields in-accurate and inconsistent reconstructions. The reason isbecause of lack of global context understanding. To ad-dress the first issue, we employ a reinforcement learningoptimization strategy for view path planning. In particular,the current state is defined as the updated point cloud afterthe previous iteration and the action space is spanned by aset of pre-sampled viewpoints chosen to maximize 3D con-tent recovery. The policy that maps the current state to thenext action is approximated by a multi-view convolutionalneural network (MVCNN) [38] for classification. The sec-ond issue is handled by a volume-guided view completiondeepnet. It combines one 2D inpainting network [20] andanother 3D completion network [36] to form a joint learn-ing machine. In it low-resolution volumetric results of the3D net are projected and concatenated to inputs of the 2Dnet, lending better global context information to depth mapinpainting. At the same time, losses from the 2D net areback-propagated to the 3D stream to benefit its optimiza-tion and further help improve the quality of 2D outputs. Asdemonstrated in our experimental results, the proposed jointlearning machine significantly outperforms existing meth- ods quantitatively and qualitatively.In summary, our contributions are • The first surface-based algorithm for 3D scene comple-tion from a single depth image by directly generatingthe missing points. • A novel deep reinforcement learning strategy for de-termining the optimal sequence of viewpoints for pro-gressive scene completion. • A volume-guided view inpainting network that notonly produces high-resolution outputs but also makesfull use of the global context.
2. Related Works
Many prior works are related to scene completion. Theliterature review is conducted in the following aspects.
Geometry Completion
Geometry completion has a longhistory in 3D processing, known for cleaning up broken sin-gle objects or incomplete scenes. Small holes can be filledby primitives fitting[31, 19], smoothness minimization[37,56, 17], or structures analysis[25, 35, 39]. These methodshowever seriously depend on prior knowledge. Template orpart based approaches can successfully recover the underly-ing structures of a partial input by retrieving the most simi-lar shape from a database, matching with the input, deform-ing disparate parts and assembling them[34, 18, 30, 39].However, these methods require manually segmented data,and tend to fail when the input does not match well with thetemplate due to the limited capacity of the database. Re-cently, deep learning based methods have gained much at-tentions for shape completion[30, 42, 33, 45, 5, 14], whilescene completion from sparse observed views remains chal-lenging due to large-scale data loss in occluded regions.Song et al.[36] first propose an end-to-end network basedon 3DCNNs, named SSCNet, which takes a single depthimage as input and simultaneously outputs occupancy andsemantic labels for all voxels in the camera view frustum.ScanComplete[6] extends it to handle larger scenes withvarying spatial extent. Wang et al.[46] combine it with anadversarial mechanism to make the results more plausible.Zhang et al.[54] apply a dense CRF model followed withSSCNet to further increase the accuracy. In order to exploitthe information of input images, Garbade et al.[9] adopt atwo stream neural network, leveraging both depth informa-tion and semantic context features extracted from the RGBimages. Guo et al.[12] present a view-volume CNN whichextracts detailed geometric features from the 2D depth im-age and projects them into a 3D volume to assist completedscene inference. However, all these works based on the vol-umetric representation result in low-resolution outputs. Inthis paper, we directly predict point cloud to achieve high-resolution completion by conducting inpainting on multi-view depth images.
QNSSCNet 2DCNN
View ViewView
Depth Voxel
Point Cloud
View
Output
Figure 2. The pipeline of our method. Given a single depth image D , we convert it to a point cloud P , here shown in two different views.DQN is used to seek the next-best-view, under which the point cloud is projected to a new depth image D , causing holes. In parallel, the P is also completed in volumetric space by SSCNet, resulting in V . Under the view of D , V is projected and guide the inpainting of D with a 2DCNN network. Repeating this process several times, we can achieve the final high-quality scene completion. Depth Inpainting
Similar to geometry completion, re-searchers have employed various priors or optimized mod-els to complete a depth image[15, 21, 27, 41, 3, 22, 50, 53].The patch-based image synthesis idea is also applied[7, 10].Recently, significant progresses have been achieved in im-age inpainting field with deep convolutional networks andgenerative adversarial networks (GANs) for regular or free-form holes[16, 20, 52]. Zhang et al.[55] imitate them witha deep end-to-end model for depth inpainting. Comparedwith inpainting task on colorful images, recovering missinginformation from a single depth map is more challengingdue to the absence of strong context features in depth maps.To address it, an additional 3D global context is provided inour paper, guiding the inpainting on diverse views to reachmore accurate and consistent output.
View Path Planing
Projecting a scene or an object tothe image plane will severely cause information loss be-cause of self-occusions. A straightforward solution is uti-lizing dense views for making up[38, 29, 40], yet it willlead to heavy computation cost. Choy et al.[4] propose a3D recurrent neural networks to integrate information frommulti-views which decreases the number of views to five orless. Even so, how many views are sufficient for comple-tion and which views are better to provide the most infor-mative features, are still open questions. Optimal view pathplanning, as the problem to predict next best view from cur-rent state, has been studied in recent years. It plays criticalroles for scene reconstruction as well as environment navi-gation in autonomous robotics system[24, 1, 57, 48]. Mostrecently, this problem is also explored in the area of object-level shape reconstruction[51]. A learning framework is de-signed in [49], by exploiting the spatial and temporal struc-ture of the sequential observations, to predict a view se-quence for groundtruth fitting. Our work explores the ap-proaches of view path planning for scene completion. Wepropose to train a Deep Q-Network (DQN)[26] to choosethe best view sequence in a reinforcement learning frame-work.
3. Algorithm
Overview
Taking a depth image D as input, we first convert it toa point cloud P , which suffers from severe data loss. Ourgoal is to generate 3D points to complete P . The mainthrust of our proposed algorithm is to represent the incom-plete point cloud as multi-view depth maps and perform2D inpainting tasks on them. To take full advantage of thecontext information, we execute these inpainting operationsview by view in an accumulative way, with inferred pointsfor the current viewpoint kept and used to help inpaintingof the next viewpoint. Assume D is rendered from P un-der viewpoint v , we start our completion procedure with anew view v and render P under v to obtain a new depthmap D , which potentially has many holes. We fill theseholes in D with 2D inpainting, turning D to ˆ D . The in-ferred depth pixels in ˆ D are then converted to 3D pointsand aggregated with P to output a denser point cloud P .This procedure is repeated for a sequence of new viewpoints v , v , ..., v n , yielding point clouds P , P , ..., P n , with P n being our final output. Figure 2 depicts the overall pipelineof our proposed algorithm. Since P n depends on the viewpath v , v , ..., v n , we describe in section 3.2 a deep rein-forcement learning framework to seek the best view path.Before that, we introduce our solution to another criticalproblem of 2D inpainting, i.e., transforming D i to ˆ D i , insection 3.1 first. Deep Convolutional Neural Network (CNN) has beenwidely utilized to effectively extract context features for im-age inpainting tasks, achieving excellent performance. Al-though it can be directly applied to each viewpoint indepen-dently, this simplistic approach will lead to inconsistenciesacross views because of lack of global context understand-ings. We propose a volume-guided view inpainting frame-work by first conducting completion in the voxel space, con-erting P ’s volumetric occupancy grid V to its completedversion V c . Denote the projected depth map from V c to theview v i as D ci . Our inpainting of the i th view takes both D i and D ci as input and outputs ˆ D i . As shown in Figure 2, thisis implemented using a three-module neural network archi-tecture consisting of a volume completion network, a depthinpainting network, and a differentiate projection layer con-necting them. The details of each module and our trainingstrategy are described below. Volume Completion
We employ SSCNet proposed in [36]to map V to V c for volume completion. SSCNet predictsnot only volumetric occupancy but also the semantic labelsfor each voxel. Such a multi-task learning scheme helps usbetter capture object-aware context features and contributesto higher accuracy. The readers are referred to [36] fordetails on how to set up this network architecture. We trainthe network as a voxel-wise binary classification task andtake the output 3D probability map as V c . The resolutionof input is × × , and the output is × × . Depth Inpainting
In our work, the depth map is renderedas a × grayscale image. Among various existingapproaches, the method of [20] is chosen to handle ourcase with holes of irregular shapes. Specifically, D i and D ci are first concatenated to form a map with channels.The resulting map is then fed into a U-Net structure imple-mented with a masked and re-normalized convolution oper-ation (also called partial convolution), followed by an auto-matic mask-updating step. The output is also in × .We refer the readers to [20] for details of the architecturesettings and the design of loss functions. Projection Layer
As validated in our experiments de-scribed in 4.2, the projection of V c greatly benefits inpaint-ing of 2D depth maps. We further exploit the benefit of 2Dinpainting to volume completion by propagating the 2D lossback to optimize the parameters of 3D CNNs. Doing so re-quires a differentiable projection layer, which was recentlyproposed in [43]. Thus, we connect V c and D ci using thislayer. For the sake of notational convenience, we use V torepresent V c and D to represent D ci . Specifically, for eachpixel x in D , we launch a ray that starts from the viewpoint v i , passes through x , and intersects a sequence of voxels in V , noted as l , l , ..., l N x . We denote the value of the k th voxel in V as V k , which represents the probability of thisvoxel being empty. Then, we define the depth value of thispixel x as D ( x ) = N x (cid:88) k =1 P xk d k (1)where d k is the distance from the viewpoint to voxel l k and P xk the probability of the ray corresponding to x first meets the l k voxel P xk = (1 − V k ) k − (cid:89) j =1 V j , k = 1 , , ..., N x (2)The derivative of D ( x ) with respect to V k can be calculatedas ∂D ( x ) ∂V k = N x (cid:88) i = k ( d i +1 − d i ) (cid:89) ≤ t ≤ i,t (cid:54) = k V t . (3)This guarantees back propagation of the projection layer. Inorder to speed up implementation, the processing of all raysare implemented in parallel via GPUs. Joint Training
Because our network consists of three sub-networks, we divide the entire training process into threestages to guarantee convergence: 1) The 3D convolutionnetwork is trained independently for scene completion; 2)With fixed parameters of the 3D convolution network, wetrain the 2D convolution network for depth image inpaintngunder the guidance of 3D models; 3) We train the entire net-work jointly and fine tune it with all the parameters freed in2D and 3D convolution networks.The training data are generated based on the SUNCGsynthetic scene dataset provided in [36]. We first create N depth images by rendering randomly selected scenes un-der randomly picked camera viewpoints. Each depth image D is then converted to a point cloud P . Assuming D isthe projection of P under the viewpoint v , we project P to m depth maps from m randomly sampled views near v to avoid causing large holes and to ensure that sufficientcontextual information is available in the learning process.Each training sample consists of a point cloud and one of itscorresponding depth. Given an incomplete point cloud P that is convertedfrom D with respect to view v , we describe in thissubsection how to obtain the optimal next view sequence v , v , ..., v n . The problem is defined as a Markov decisionprocess (MDP) consisting of state, action, reward, and anagent which takes actions during the process. The agentinputs the current state, outputs the corresponding optimalaction, and receives the most reward from the environment.We train our agent using DQN [26], an algorithm of deep re-inforcement learning. The definition of the proposed MDPand the training procedure are given below. State
We define the state as the updated point cloud at eachiteration, with the initial state being P . As the iterationcontinues, the state for performing completion on the i th view is P i − , which is accumulated from all previous itera-tion updates. Action Space
The action at the i th iteration is to deter-mine the next best view v i . To ease the training process NN ViewPoolingCNNCNNCNN
Point Cloud Qvalue512 256 120 20Depth
Figure 3. The architecture of our DQN. For a point cloud state,MVCNN is used to predict the best view for the next inpainting. and support the use of DQN, we evenly sample a set ofscene-centric camera views to form a discrete action space.Specifically, we first place P in its bounding sphere andkeep it upright. Then, two circle paths are created for boththe equatorial and 45-degree latitude line. In our experi-ments, camera views are uniformly selected on thesetwo paths, per circle. All views are facing to the cen-ter of the bounding sphere. We fixed these views for alltraining samples. The set of 20 views is denoted as C = { c , c , ..., c } . Reward
An reward function is commonly unitized to eval-uate the result for an action executed by the agent. In ourwork, at the i th iteration, the input is an incomplete depthmap D i rendered from P i − under view v i chosen in the ac-tion space C . The result of the agent action is an inpainteddepth image ˆ D i . Hence the accuracy of this inpainting op-eration can be used as the primary rewarding strategy. It canbe measured by the mean error of the pixels inside the holesbetween ˆ D i and its ground truth D gti . All the ground truthdepth maps are pre-rendered from SUNCG dataset. Thuswe define the award function as R acci = − | Ω | L ( ˆ D i , D gti ) , (4)where L denotes the L loss, Ω the set of pixels inside theholes, and | Ω | the number of pixels inside Ω .If we only use the above reward function R acci , the agenttends to change the viewpoint slightly in each action cycle,since doing this results in small holes. However, this in-curs higher computational cost while accumulating errors.We thus introduce a new reward term to encourage infer-ring more missing points at each step. This is implementedby measuring the percentage of filled original holes. Todo so, we need to calculate the area of missing regions inan incomplete point cloud P , which is not trivial in a 3Dspace. Therefore, we project P under all camera views tothe action space C and count the number of pixels inside thegenerated holes in each rendered image. The sum of thesenumbers is denoted as Area h ( P ) for measuring the area. Input & GT
DepIn w/oVG
DepIn w/oPBP
Ours
Figure 4. Comparisons on variants of depth inpainting network.Given incompleted depth images, we show results of our proposedmethod w/o volume-guidance, w/o projection back-propagationand also ours, compared with the groundtruth. Both the inpaintedmap and its error map are shown.
We thus define the new reward term as R holei = Area h ( P i − ) − Area h ( P i ) Area h ( P ) − (5)to avoid the agent from choosing the same action as in pre-vious steps. We further define a termination criterion to stopview path search by Area h ( P i ) /Area h ( P ) < , whichmeans that all missing points of P have been nearly recov-ered. We set the reward for terminal to zero.Therefore, our final reward function is R totali = wR acci + (1 − w ) R holei , (6)where w is a fractional weight that balances the two rewardterms. DQN Training
Our DQN is built upon MVCNN[38]. Ittakes mutil-view depth maps projected from P i − as inputsand outputs the Q-value of different actions. The wholenetwork is trained to approximate the action-value function Q ( P i − , v i ) , which is the expected reward that the agentreceives when taking action v i at state P i − .To ensure stability of the learning process, we introducea target network separated from the architecture of [26],whose loss function for training DQN is Loss ( θ ) = E [( r + γ max v i +1 Q ( P i , v i +1 ; θ (cid:48) ) − Q ( P i − , v i ; θ )) ] . (7)here r is the reward, γ a discount factor, and θ (cid:48) theparameters of the target network. For effective learn-ing, we create an experience replay buffer to reduce thecorrelation between data. The buffer stores the tuples ( P i − , v i , r, P i ) proceeded with the episode. We also em-ploy the technique of [44] to remove upward bias caused by max v i +1 Q ( P i , v i +1 ; θ (cid:48) ) and change the loss function to L our = E [( r + γQ ( P i , arg max v i +1 Q ( P i , v i +1 ; θ ); θ (cid:48) ) − Q ( P i − , v i ; θ )) ] . (8)Combining with the dueling DQN structure [47], our net-work structure is shown in Figure 3. At state P i − , werender at all viewpoints c , c , ..., c in the action space C in × resolution and get the corresponding multi-view depth maps D i , D i , ..., D i . These depth maps arethen sent to the same CN N as inputs. After a view poolinglayer and a fully-connected layer, we obtain a 512-D vector,which is split evenly into two parts to learn the advantagefunction A ( v, P ) and the state value function V ( P ) [47].Finally, after combining the results of the two functions, wehave our final result, which is a 20-D Q-values based on theaction space C . We use an (cid:15) -greedy policy to choose ac-tion v i for state P i − , i.e., a random action with probability − (cid:15) or an action that maximizes the Q-values with proba-bility (cid:15) . In the end, we reach the decision on depth map D i for inpainting.The training data are also generated from SUNCG. Weuse the same N depth images as in section 3.1. We alsochoose the action space C to generate new data. The groundtruth depth maps, which are used in the reward calculation,are generated in the same viewpoint from the action space C .
4. Experimental Results
Dataset
The dataset we used to train our 2DCNN andDQN is generated from SUNCG [36]. Specifically, for2DCNN, we set N = 3 , and m = 10 and get , depth maps. We further remove the maps whose cameraviews are occluded by doors or walls. Then, , of themare took for testing and the rest is used for training. ForDQN, we set N = 2 , with for the training episodeand for the testing. Implementation Details
Our network architecture isimplemented in PyTorch. The provided pre-trained modelof SSCNet [36] is used to initialize parameters of our3DCNN part. It takes 30 hours to train inpainting net-work on our training dataset and 20 hours to fine-tune thewhole network after the addition of projection layer. Dur-ing DQN training process, we first use episodes to fillexperience replay buffer. In each episode, the DQN choosesthe action randomly in each iteration step, and store the tu-ple ( P i − , v i , r, P i ) in the buffer. After those episodes be- ing pre-trained, the network begins to learn by randomlysampled batches in buffers for each step during differentepisodes. The buffer can store , tuples and the batchsize is set to 16. The weight w for reward calculation isset as . and the discount factor γ is set to . , while (cid:15) decreases from . to . over , steps and then befixed to . . Training DQN takes 3 days and running ourcomplete algorithm once takes about s which adopts fiveview points on average. In this part, we evaluate our proposed methods againstSSCNet [36], which is one of the most popular approachesin this area. Based on SSCNet, there although exists manyincremental works such as [46] and [12], they all producevolumetric outputs in the same resolution as SSCNet. Re-garding neither the code nor the pre-trained model of thesemethods is public, we propose to compare our result withthe corresponding 3D groundtruth volume, whose outputaccuracy can be treated as the upper bound of all exist-ing volume-based scene completion methods. We denotethis method as volume-gt. For evaluation, we first renderthe volume obtained from SSCNet and volume-gt to severaldepth maps under the same viewpoints as our method. Wethen convert these depth maps to point cloud.
Quantitative Comparisons
The Chamfer Distance(CD) [8] is used as one of our metrics for evaluate the ac-curacy of our generated point set P , compared with thegoundtruth point cloud P GT . Similar to [8], we also useanother completeness metric to evaluate how complete ofthe generated result. We define it as: C r ( P, P GT ) = |{ d ( x, P ) < r | x ∈ P GT }||{ y | y ∈ P GT }| (9)where d ( x, P ) denotes the distance from a point x to apoint set P , |·| denotes the number of the elements in theset, and r means the distance threshold. In our exper-iments, we report the completeness w.r.t five different r ( . , . , . , . , . are used). The results are re-ported in Tab 1. As seen, our approach significantly out-performs all the others. This also validates that the using ofvolumetric representation greatly reduces the quality of theoutputs. Qualitative Comparisons
The visual comparisons ofthese methods are shown in Figure 5. It can be seen that,the generated point cloud from SSCNet is of no surface de-tails. Although our method shows more errors than volume-gt in some local regions, it overall produces more accurateresults. This can be validated in Tab 1. In addition, by con-ducting completion in multiple views, our approach also re-covers more missing points, showing better completenessas validated in Tab 1. nput & GT
SSCNet Volume-GT Ours
Figure 5. Comparisons against the state-of-the-arts. Given different inputs and the referenced groundtruth, we show the completion resultsof three methods, with the corresponding point cloud error maps below, and zoom-in areas beside.Table 1. Quantitative Comparisons against existing methods. The CD metric and the completeness metric (w.r.t different thresholds) areused.
SSCNet V olume − GT ScanComplete V olume − GT U U DQN w/o − hole OursCD C r =0 . (%) C r =0 . (%) C r =0 . (%) C r =0 . (%) C r =0 . (%) To ensure the effectiveness of several key components ofour system, we do some control experiments by removingeach component.
On Depth Inpainting
Firstly, to evaluate the efficacyof the volume guidance, we propose two variants of ourmethod: 1) we train a 2D inpainting network directly with-out projecting volume as guidance, which is denoted as
DepIn w/oV G ; 2) we train the volume guided 2D inpaint-ing network without projection back-propagation, which isdenoted as
DepIn w/oP BP . We use the metrics of L , P SN R and
SSIM for the comparisons. The quantitativeresults are reported in Tab 2 and the visual comparisons are shown in Figure 4. All of them show the superiority of ourdesign.
Table 2. Quantitative ablation studies on inpainting network.
DepIn w/oV G
DepIn w/oPBP
OursL P SNR
SSIM
On View Path Planning
Without using DQN for pathplanning, there exists a straightforward way to do comple-tion: we can uniformly sample a fixed number of viewsfrom C and directly perform depth implanting on them.In this uniform manner, two methods with two differentnumbers of views (5 and 10 are selected) are evaluated. nput & GT U U OursDQN w/o-hole
Figure 6. Comparisons on the variants of view path planning. Given different inputs and the referenced groundtruth, we show the comple-tion results of four different approaches, with the corresponding point cloud error maps below.
We denote them as U and U . The results of CD and C r ( P, P GT ) using these two methods and ours are reportedin Tab 1. As seen, increasing the uniform sampled viewscauses accuracy reducing. This might be because of the in-creased accumulated errors. Using DQN greatly improvesthe accuracy, which validates the importance of a betterview path. And all of them give rise to similar complete-ness. In addition, we also train a new DQN with only the re-ward R acci , denoted as DQN w/o − hole , which chooses sevenview points on average since it tends to pick views withsmall holes for higher R acci . The results in Tab 1 verifythe efficiency of the reward R holei . Visual comparison re- sults on some sampled scenes are shown in Figure 6, whereour proposed model results in much better appearances thanothers.
5. Conclusion
In this paper, we propose the first surface-based approachfor 3D scene completion from a single depth image. Themissing 3D points are inferred by conducting completionon multi-view depth maps. To guarantee a more accurateand consistent output, a volume-guided view inpianting net-work is proposed. In addition, a deep reinforcement learn-ing framework is devised to seek the optimal view patho contribute the best result in accuracy. The experimentsdemonstrate that our model is the best choice and signif-icantly outperforms existing methods. There are two re-search directions worth further exploration in the future: 1)how to make use of the texture information from the inputRGBD images to achieve more accurate depth inpainting;2) how to do texture completion together with the depth in-painting, to output a complete textured 3D scene.
Acknowledgements
We thank the anonymous reviewers for the insight-ful and constructive comments.This work was funded inpart by The Pearl River Talent Recruitment Program In-novative and Entrepreneurial Teams in 2017 under grantNo. 2017ZT07X152, Shenzhen Fundamental ResearchFund under grants No. KQTD2015033114415450 andNo. ZDSYS201707251409055, and by the National Nat-ural Science Foundation of China under Grant 91748104,Grant 61632006, Grant 61751203.
References [1] Paul S Blaer and Peter K Allen. Data acquisition and viewplanning for 3-d modeling tasks. In
Intelligent Robots andSystems, 2007. IROS 2007. IEEE/RSJ International Confer-ence on , pages 417–422. IEEE, 2007. 3[2] Kang Chen, Yu-Kun Lai, Yu-Xin Wu, Ralph Martin, and Shi-Min Hu. Automatic semantic modeling of indoor scenesfrom low-quality rgb-d data using contextual information.
ACM Trans. Graph. , 33(6):208:1–208:12, Nov. 2014. 1[3] Weihai Chen, Haosong Yue, Jianhua Wang, and XingmingWu. An improved edge detection algorithm for depth map in-painting.
Optics and Lasers in Engineering , 55:69–77, 2014.2[4] Christopher B Choy, Danfei Xu, JunYoung Gwak, KevinChen, and Silvio Savarese. 3d-r2n2: A unified approach forsingle and multi-view 3d object reconstruction. In
Europeanconference on computer vision , pages 628–644. Springer,2016. 3[5] Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner.Shape completion using 3d-encoder-predictor cnns andshape synthesis. In
Proc. IEEE Conf. on Computer Visionand Pattern Recognition (CVPR) , volume 3, 2017. 2[6] Angela Dai, Daniel Ritchie, Martin Bokeloh, Scott Reed, Jr-gen Sturm, and Matthias Niener. Scancomplete: Large-scalescene completion and semantic segmentation for 3d scans. computer vision and pattern recognition , 2018. 2[7] David Doria and Richard J Radke. Filling large holes in li-dar data by inpainting depth gradients. In
Computer Visionand Pattern Recognition Workshops (CVPRW), 2012 IEEEComputer Society Conference on , pages 65–72. IEEE, 2012.3[8] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point setgeneration network for 3d object reconstruction from a singleimage. In
CVPR , volume 2, page 6, 2017. 6 [9] Martin Garbade, Johann Sawatzky, Alexander Richard, andJuergen Gall. Two stream 3d semantic scene completion. arXiv preprint arXiv:1804.03550 , 2018. 1, 2[10] Josselin Gautier, Olivier Le Meur, and Christine Guillemot.Depth-based image completion for view synthesis. In , pages 1–4. IEEE,2011. 3[11] Ruiqi Guo, Chuhang Zou, and Derek Hoiem. Predict-ing complete 3d models of indoor scenes. arXiv preprintarXiv:1504.02437 , 2015. 1[12] Yu-Xiao Guo and Xin Tong. View-volume network for se-mantic scene completion from a single depth image. In
IJ-CAI 2018: 27th International Joint Conference on ArtificialIntelligence , pages 726–732, 2018. 1, 2, 6[13] Saurabh Gupta, Pablo Andr´es Arbel´aez, Ross B. Girshick,and Jitendra Malik. Aligning 3d models to RGB-D imagesof cluttered scenes. In
CVPR , pages 4731–4740. IEEE Com-puter Society, 2015. 1[14] Xiaoguang Han, Zhen Li, Haibin Huang, EvangelosKalogerakis, and Yizhou Yu. High-resolution shape com-pletion using deep neural networks for global structure andlocal geometry inference. In
Proceedings of IEEE Interna-tional Conference on Computer Vision (ICCV) , 2017. 2[15] Daniel Herrera, Juho Kannala, Janne Heikkil¨a, et al. Depthmap inpainting under a second-order smoothness prior. In
Scandinavian Conference on Image Analysis , pages 555–566. Springer, 2013. 2[16] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa.Globally and locally consistent image completion.
ACMTransactions on Graphics (TOG) , 36(4):107, 2017. 2, 3[17] Michael Kazhdan and Hugues Hoppe. Screened poisson sur-face reconstruction.
ACM Transactions on Graphics (ToG) ,32(3):29, 2013. 2[18] Vladimir G Kim, Wilmot Li, Niloy J Mitra, SiddharthaChaudhuri, Stephen DiVerdi, and Thomas Funkhouser.Learning part-based templates from large collections of 3dshapes.
ACM Transactions on Graphics (TOG) , 32(4):70,2013. 2[19] Yangyan Li, Xiaokun Wu, Yiorgos Chrysathou, AndreiSharf, Daniel Cohen-Or, and Niloy J Mitra. Globfit: Con-sistently fitting primitives by discovering global relations. In
ACM Transactions on Graphics (TOG) , volume 30, page 52.ACM, 2011. 2[20] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang,Andrew Tao, and Bryan Catanzaro. Image inpainting forirregular holes using partial convolutions. arXiv preprintarXiv:1804.07723 , 2018. 2, 3, 4[21] Junyi Liu, Xiaojin Gong, and Jilin Liu. Guided inpaintingand filtering for kinect depth maps. In
Pattern Recognition(ICPR), 2012 21st International Conference on , pages 2055–2058. IEEE, 2012. 2[22] Miaomiao Liu, Xuming He, and Mathieu Salzmann. Build-ing scene models by completing and hallucinating depth andsemantics. In
European Conference on Computer Vision ,pages 258–274. Springer, 2016. 2[23] Shice Liu, Yu Hu, Yiming Zeng, Qiankun Tang, Beibei Jin,Yinhe Han, and Xiaowei Li. See and think: Disentanglingemantic scene completion. In
NIPS 2018: The 32nd An-nual Conference on Neural Information Processing Systems ,2018. 1, 2[24] Kok-Lim Low and Anselmo Lastra. An adaptive hierarchi-cal next-best-view algorithm for 3d reconstruction of indoorscenes. In
Proceedings of 14th Pacific Conference on Com-puter Graphics and Applications (Pacific Graphics 2006) ,pages 1–8, 2006. 3[25] Niloy J Mitra, Leonidas J Guibas, and Mark Pauly. Partialand approximate symmetry detection for 3d geometry.
ACMTransactions on Graphics (TOG) , 25(3):560–568, 2006. 2[26] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, An-drei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves,Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski,et al. Human-level control through deep reinforcement learn-ing.
Nature , 518(7540):529, 2015. 3, 4, 5[27] Suryanarayana M Muddala, Marten Sjostrom, and RogerOlsson. Depth-based inpainting for disocclusion filling. In , pages 1–4.IEEE, 2014. 2[28] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, TrevorDarrell, and Alexei A Efros. Context encoders: Featurelearning by inpainting. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages2536–2544, 2016. 2[29] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai,Mengyuan Yan, and Leonidas J Guibas. Volumetric andmulti-view cnns for object classification on 3d data. In
Pro-ceedings of the IEEE conference on computer vision and pat-tern recognition , pages 5648–5656, 2016. 3[30] Jason Rock, Tanmay Gupta, Justin Thorsen, JunYoungGwak, Daeyun Shin, and Derek Hoiem. Completing 3dobject shape from one depth image. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 2484–2493, 2015. 2[31] Ruwen Schnabel, Patrick Degener, and Reinhard Klein.Completion and reconstruction with primitive shapes. In
Computer Graphics Forum , volume 28, pages 503–512. Wi-ley Online Library, 2009. 2[32] Tianjia Shao, Weiwei Xu, Kun Zhou, Jingdong Wang, Dong-ping Li, and Baining Guo. An interactive approach to seman-tic modeling of indoor scenes with an rgbd camera.
ACMTrans. Graph. , 31(6):136:1–136:11, Nov. 2012. 1[33] Abhishek Sharma, Oliver Grau, and Mario Fritz. Vconv-dae:Deep volumetric shape learning without object labels. In
European Conference on Computer Vision , pages 236–250.Springer, 2016. 2[34] Chao-Hui Shen, Hongbo Fu, Kang Chen, and Shi-Min Hu.Structure recovery by part assembly.
ACM Transactions onGraphics (TOG) , 31(6):180, 2012. 2[35] Ivan Sipiran, Robert Gregor, and Tobias Schreck. Approxi-mate symmetry detection in partial 3d meshes. In
ComputerGraphics Forum , volume 33, pages 131–140. Wiley OnlineLibrary, 2014. 2[36] Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang,Manolis Savva, and Thomas A. Funkhouser. Semantic scene completion from a single depth image. In ,pages 190–198, 2017. 1, 2, 4, 6[37] Olga Sorkine and Daniel Cohen-Or. Least-squares meshes.In
Shape Modeling Applications, 2004. Proceedings , pages191–199. IEEE, 2004. 2[38] Hang Su, Subhransu Maji, Evangelos Kalogerakis, andErik G. Learned-Miller. Multi-view convolutional neuralnetworks for 3d shape recognition. In
Proc. ICCV , 2015.2, 3, 5[39] Minhyuk Sung, Vladimir G Kim, Roland Angst, andLeonidas Guibas. Data-driven structural priors for shapecompletion.
ACM Transactions on Graphics (TOG) ,34(6):175, 2015. 2[40] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox.Multi-view 3d models from single images with a convolu-tional network. In
European Conference on Computer Vi-sion , pages 322–337. Springer, 2016. 3[41] Ali K Thabet, Jean Lahoud, Daniel Asmar, and BernardGhanem. 3d aware correction and completion of depth mapsin piecewise planar scenes. In
Asian Conference on Com-puter Vision , pages 226–241. Springer, 2014. 2[42] Duc Thanh Nguyen, Binh-Son Hua, Khoi Tran, Quang-HieuPham, and Sai-Kit Yeung. A field model for repairing 3dshapes. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 5676–5684, 2016. 2[43] Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, andJitendra Malik. Multi-view supervision for single-viewreconstruction via differentiable ray consistency.
CoRR ,abs/1704.06254, 2017. 4[44] Hado Van Hasselt, Arthur Guez, and David Silver. Deep re-inforcement learning with double q-learning. In
AAAI , vol-ume 2, page 5. Phoenix, AZ, 2016. 5[45] Jacob Varley, Chad DeChant, Adam Richardson, Joaqu´ınRuales, and Peter Allen. Shape completion enabled roboticgrasping. In
Intelligent Robots and Systems (IROS), 2017IEEE/RSJ International Conference on , pages 2442–2447.IEEE, 2017. 2[46] Yida Wang, David Joseph Tan, Nassir Navab, and FedericoTombari. Adversarial semantic scene completion from a sin-gle depth image. In , pages 426–434. IEEE, 2018. 1, 2, 6[47] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt,Marc Lanctot, and Nando De Freitas. Dueling network ar-chitectures for deep reinforcement learning. arXiv preprintarXiv:1511.06581 , 2015. 6[48] Shihao Wu, Wei Sun, Pinxin Long, Hui Huang, DanielCohen-Or, Minglun Gong, Oliver Deussen, and BaoquanChen. Quality-driven poisson-guided autoscanning.
ACMTransactions on Graphics , 33(6), 2014. 3[49] Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, HuiHuang, Hao Su, Daniel Cohen-Or, and Baoquan Chen. 3dattention-driven depth acquisition for object identification.
ACM Transactions on Graphics (TOG) , 35(6):238, 2016. 3[50] Hongyang Xue, Shengming Zhang, and Deng Cai. Depth im-age inpainting: Improving low rank matrix completion withlow gradient regularization.
IEEE Transactions on ImageProcessing , 26(9):4311–4320, 2017. 251] Xin Yang, Yuanbo Wang, Yaru Wang, Baocai Yin, QiangZhang, Xiaopeng Wei, and Hongbo Fu. Active object re-construction using a guided view planner. arXiv preprintarXiv:1805.03081 , 2018. 3[52] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, andThomas S Huang. Free-form image inpainting with gatedconvolution. arXiv preprint arXiv:1806.03589 , 2018. 3[53] Hai-Tao Zhang, Jun Yu, and Zeng-Fu Wang. Probability con-tour guided depth map inpainting and superresolution usingnon-local total generalized variation.
Multimedia Tools andApplications , 77(7):9003–9020, 2018. 2[54] Liang Zhang, Le Wang, Xiangdong Zhang, Peiyi Shen, Mo-hammed Bennamoun, Guangming Zhu, Syed Afaq Ali Shah,and Juan Song. Semantic scene completion with dense crffrom a single depth image.
Neurocomputing , 318:182–195,2018. 1, 2[55] Yinda Zhang and Thomas Funkhouser. Deep depth comple-tion of a single rgb-d image. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 175–185, 2018. 3[56] Wei Zhao, Shuming Gao, and Hongwei Lin. A robust hole-filling algorithm for triangular mesh.
The Visual Computer ,23(12):987–997, 2007. 2[57] Qian-Yi Zhou and Vladlen Koltun. Dense scene reconstruc-tion with points of interest.
ACM Transactions on Graphics(ToG) , 32(4):112, 2013. 3 upplemental Material
In this supplemental material, more comparison resultsare shown: Fig 7 shows the results of our method and othermethods testing on NYU dataset. Fig 8 shows comparisonsof different methods selecting different view paths. Fig 9shows more results, where our method is compared withvoxel-based algorithms and other methods appearing in ourpaper. Fig 10 shows comparisons on all variants of our in-painting network. a) NYU Depth Image (b) Visibal Surface (c) SSCNet (d) Ours
Figure 7. NYU data(a) testing results: SSCNet(c) and ours(d).Figure 8. Comparisons of different methods choosing different view paths. Given the same input and the referenced groundtruth, we showthe completion results after processing the first viewpoint and after the second viewpoint, and the final results where the whole view pathshave been completed. The corresponding point cloud error maps are shown.igure 9. Comparisons of our method against the others. Given different inputs and the referenced groundtruth, we show the completionresults of the six methods with the corresponding point cloud error maps shown below.igure 10. Comparisons on variants of depth inpainting network in eight groups. Given incompleted depth images, we show results ofour proposed method with and without . ) volume-guidance and . ))