[PDF] Boundary-induced and scene-aggregated network for monocular depth prediction

Abstract

Monocular depth prediction is an important task in scene understanding. It aims to predict the dense depth of a single RGB image. With the development of deep learning, the performance of this task has made great improvements. However, two issues remain unresolved: (1) The deep feature encodes the wrong farthest region in a scene, which leads to a distorted 3D structure of the predicted depth; (2) The low-level features are insufficient utilized, which makes it even harder to estimate the depth near the edge with sudden depth change. To tackle these two issues, we propose the Boundary-induced and Scene-aggregated network (BS-Net). In this network, the Depth Correlation Encoder (DCE) is first designed to obtain the contextual correlations between the regions in an image, and perceive the farthest region by considering the correlations. Meanwhile, the Bottom-Up Boundary Fusion (BUBF) module is designed to extract accurate boundary that indicates depth change. Finally, the Stripe Refinement module (SRM) is designed to refine the dense depth induced by the boundary cue, which improves the boundary accuracy of the predicted depth. Several experimental results on the NYUD v2 dataset and \xff{the iBims-1 dataset} illustrate the state-of-the-art performance of the proposed approach. And the SUN-RGBD dataset is employed to evaluate the generalization of our method. Code is available at this https URL

Full PDF

BBoundary-induced and Scene-aggregated Networkfor Monocular Depth Prediction

Feng Xue a,1 , Junfeng Cao a,1 , Yu Zhou b, ∗ , Fei Sheng a , Yankai Wang a , Anlong Ming a a Beijing University of Posts and Telecommunications, Beijing, China b Huazhong University of Science and Technology, Wuhan, China

Abstract

Monocular depth prediction is an important task in scene understanding. It aims topredict the dense depth of a single RGB image. With the development of deep learn-ing, the performance of this task has made great improvements. However, two is-sues remain unresolved: (1) The deep feature encodes the wrong farthest region in ascene, which leads to a distorted 3D structure of the predicted depth; (2) The low-level features are insufﬁcient utilized, which makes it even harder to estimate thedepth near the edge with sudden depth change. To tackle these two issues, we pro-pose the Boundary-induced and Scene-aggregated network (BS-Net). In this network,the Depth Correlation Encoder (DCE) is ﬁrst designed to obtain the contextual corre-lations between the regions in an image, and perceive the farthest region by consid-ering the correlations. Meanwhile, the Bottom-Up Boundary Fusion (BUBF) mod-ule is designed to extract accurate boundary that indicates depth change. Finally,the Stripe Reﬁnement module (SRM) is designed to reﬁne the dense depth inducedby the boundary cue, which improves the boundary accuracy of the predicted depth.Several experimental results on the NYUD v2 dataset and the iBims-1 dataset illus-trate the state-of-the-art performance of the proposed approach. And the SUN-RGBDdataset is employed to evaluate the generalization of our method. Code is available athttps://github.com/XuefengBUPT/BS-Net.

Keywords: monocular depth prediction, boundary-induced, depth correlation ∗ Corresponding author

Email address: [email protected] (Yu Zhou) Equal contribution.

Preprint submitted to Pattern Recognition March 1, 2021 a r X i v : . [ c s . C V ] F e b round Truth Lee et al. Fu et al. Hu et al. Proposed MethodRegion with largest depth FartherCloser

Region revealing shape detailFarthest region in 3D space D e p t h m a p D v i s u a li z a t i on Figure 1: Visualization of the predicted depth. The white boxes indicate the farthest region in each depthmap. The black boxes indicate the shape detail of the bookcase.

1. Introduction

Dense and accurate depth is widely used in many computer vision applications,such as autonomous driving [1, 2] and robotics [3, 4, 5, 6]. Generally, acquiring densedepth depends on speciﬁc sensors, i.e., stereo cameras [7, 8] and time-of-ﬂight cameras.To lower requirements of the sensor, predicting dense depth from a single image attractsthe attention of many researchers.In recent years, many deep learning based methods are proposed to tackle this taskand achieve great performance gain. Some of them [9, 10, 11, 12, 13] are based onthe encoder-decoder structure and utilize speciﬁc modules or extra labels to improvethe ability of their networks for scene understanding. Although these methods achievegood performance in reducing pixel-wise depth error, the ability to recover 3D spaceis still limited by the unstable 3D scene structure, which is expressed as mistakenlyestimating the farthest region of the scene. As shown in Fig.1, previous methods mis-takenly regard the bookcase or the wall as the global farthest region, leading to the2cene distortion. The intrinsic reason is that the encoder lacks consideration of thecorrelation between distant points when modeling the scene with high semantics, thuseach pixel in the high-level feature only represents the depth inside a region of the inputimage. It results in that when the decoder is utilized to recover the pixel-wise depth,the farthest region probably be wrongly estimated. Besides, the boundary of objects isan important cue in 3D space, which indicates the sudden change of depth. To obtain adepth map with sharp boundary, several more recent methods [11, 14, 10, 15] focus onimproving the accuracy of the boundary in the predicted depth. But due to the insufﬁ-cient use of low-level features, the boundaries are still hard to be recovered accurately.The black polygon in Fig.1 depicts the detail of the bookcase. Observably, previousmethods fail to recover the accurate shapes of the small objects in the bookcase.In this paper, the Boundary-induced and Scene-aggregated network (BS-Net) isproposed. Firstly, to perceive the farthest region, we design the Depth Correlation En-coder (DCE). On the one hand, dilation convolutions are utilized to extract the corre-lation between long-distance pixels that are independent of each other, based on whichthe relative depth between different independent pixels is built. On the other hand, thePyramid Scene Encoder (PSE) extracts the dominant features in multi-scale regionsand fuses them to one, which obtains the correlation between different regions. Sec-ondly, to effectively recover the boundary, we design the Bottom-Up Boundary Fusion(BUBF) module. Starting from stage 2 of the encoder, the module gradually fuses thefeature of each two adjacent levels, passing the location information to the high-levelfeature. Thus, it eliminates the ineffective edges (indicating small depth change) in thelow-level cues by the guidance of high-level cues, and eventually extracts the boundaryedges (indicating sudden depth change). To effectively exploit the boundary, the StripeReﬁnement Module (SRM) is designed to replace the conventional regression module.The convincing experiments on the NYUD v2 dataset [16] demonstrate the effective-ness of our method, and the additional results on the SUN-RGBD dataset [17] provethe generalization of our method.The main contributions of our method can be summarized as follows:• To perceive the farthest region in a depth map, the well-designed Depth Corre-3ation Encoder (DCE) extracts the correlations between long-distance pixels andthe correlations between multi-scale regions.• To effectively extract and utilize the boundary, the Bottom-Up Boundary Fusion(BUBF) module is designed to gradually fuse features of adjacent levels. Mean-while, the Stripe Reﬁnement Module (SRM) is designed to reﬁne depth aroundthe boundary.• Numerous experiments on the NYUD v2 dataset [16] prove the effectiveness ofour method. The proposed method achieves state-of-the-art performance. Andthe experiment on the SUN-RGBD dataset [17] proves the generalization of ourmethod.

2. Related Work

Handcraft-based depth prediction methods utilize the handcrafted features to esti-mate the depth from a single image. Saxena et al. [18] design a multi-scale MarkovRandom Field to predict the depth and extend it with multiple cues combination [19].Liu et al. [20] deﬁne the task as a discrete-continuous optimization via ConditionalRandom Field. However, handcrafted features limit the expression of depth and areunable to predict accurate depth.

Deep learning based depth prediction methods signiﬁcantly improve the perfor-mance of depth prediction. Initially, the encoder-decoder structure is widely utilizedand the deep learning based depth prediction is divided into two parts [21, 22]: theglobal coarse depth prediction and the local shape detail reﬁnement. The former ob-tains a low-resolution depth map as an intermediate product, while the latter obtains aﬁnal depth map with original-resolution. To obtain smooth and accurate depth, Hu etal. [11] utilize the multi-level features and multi-task loss. With a similar goal, Chen etal. [10] gradually reﬁne the ﬁnal depth. Hao et al. [23] reserve the high resolution offeature maps by utilizing continuous dilated convolutions. Zhang et al. [24] retain theobject shape by hierarchically fusing the side-output with the decoder. Dhamo et al. et al. [26] introduce the spacing-increasing discretizationto regress the ordinal. Similarly, Lee et al. [13] reﬁne the ﬁnal map by combiningthe predicted relative depth of various scales. Besides, multi-task learning is widelyconcerned. Several methods [27, 22, 28, 29, 30] jointly utilize the labels of depth,semantic segmentation, and surface normal in predicting monocular depth. However,existing encoder-decoder based methods always focus on how to obtain detailed shapesof objects and ignore the false prediction of the farthest region, which leads to the dis-tortion of the 3D structure. Our method focuses on addressing this problem.

The datasets for depth prediction are divided into two categories according to thescene, namely indoor datasets and outdoor datasets. In the datasets of indoor scenes[16, 17, 31], only the NYUD v2 dataset [16] provides dense and accurate depth andaligns with the corresponding RGB images. Thus, it is utilized to evaluate the per-formance of monocular depth prediction methods. Other datasets provide raw densedepth and RGB pairs captured by Kinect. Thus, they are utilized to evaluate the gener-alization. Since our method uses gradients to express the boundary in the depth map,we utilize the NYUD v2 dataset to verify the performance of our method, and otherdatasets to verify the generalization of our method. For the datasets of outdoor scenes[19, 32, 33], KITTI [32] and CityScapes [33] provide the RGB images captured by thecamera and the corresponding sparse depth captured by the 3D Lidar. But since thesparse depth cannot be used to extract pixel-wise gradients, these two datasets are notemployed to verify our method. Mono3D [19] provides dense but inaccurate depth.Thus, it is not employed either.

Contextual Extractor is signiﬁcant for scene understanding tasks, e.g., monoculardepth prediction, semantic segmentation, and object detection. Chen et al. [34] employthe Atrous Spatial Pyramid Pooling (ASPP) to encode the image context at multi-scale.Fu et al. [26] propose the Scene Understanding Modular to achieve a comprehensiveunderstanding of the input image. Liu et al. [35] extract the global context information5y global pooling and combine it with a simple Fully Convolutional Network structure.Different from other methods, to predict the correct farthest region in the depth map,our method extracts two types of contextual features.

Multi-level Feature Fusion exploits the features from different levels of the encoder,And it is widely utilized in many computer vision tasks [36, 37, 11, 38, 23, 39, 40]. Hu et al. [11] concatenate all side-outputs of the encoder to fuse multiple level features.Xu et al. [36] learn the relationship between features of different levels and extend itwith an attention model for multi-level feature fusion [37]. Hao et al. [23] graduallyfuse the features of different levels. Different from other methods, our motivation isto reduce the gap of multi-level features in the fusion process. Thus, starting from thelowest level, our method gradually fuses features of each two adjacent levels.

Boundary Cue reveals the boundaries of objects. Thus, it has a high response at thedepth discontinuity between foreground and background and affects the accuracy ofdepth prediction around the object boundary. Different from other work [41, 42, 38,43, 44] which utilize the boundary as the label, we lack the label of boundary. Thus,this paper applies the gradient of depth to guide the generation of boundary cue.

3. Method

In this section, we introduce the Boundary-induced and Scene-aggregated Network(BS-Net). This network consists of an encoder, a decoder, and three modules. Asshown in Fig.2, the network ﬁrstly employs the ResNet [45] as the encoder. To obtain alarger receptive ﬁeld, the down-sampling operation of stage 4 and 5 are replaced by thedilated convolution [46]. Then, to perceive the farthest region, the Depth CorrelationEncoder (DCE) is designed to aggregate the correlations between long-distance pixelsand multi-scale regions from Res 5 (see Sec.3.1). Subsequently, a decoder is designedto predict the depth with the learned feature. It consists of four consecutive steps. Theﬁrst two steps utilize the Large-Kernel Reﬁnement Blocks (l-RB) to compress channelsand keep the resolution. The last two steps utilize the combined operation of upsam-pling and l-RB, which is equal to the UpProjection in [47], to expand the resolutionto of the original size. Meanwhile, taking the side-output of Res 2/3/4/5 as input,6 Concatenate C Bottom-UpBoundary Fusion ： Feature maps ： Dilated convolution Up Bilinear up-sampling ： Input × × × × Encoder × × × × × × Output

StripeRefinement × × × × C Up + l-RBUp + l-RBl-RB × × × × l-RB × × Depth CorrelationEncoder

Decoder × × × Large-kernel refinementblock (l-RB)

Figure 2: Our network structure, which consists of an encoder-decoder network, a Depth Correlation En-coder (DCE), a Bottom-Up Boundary Fusion (BUBF) module and a Stripe Reﬁnement Module (SRM). the Bottom-Up Boundary Fusion (BUBF) module is introduced to gradually extractboundaries indicating sudden changes in depth (see Sec.3.2). Finally, the features fromthe decoder and BUBF are concatenated. A Stripe Reﬁnement Module (SRM) is de-signed to reﬁne the high-resolution depth by the guidance of the learned boundary (seeSec.3.3).

To perceive the farthest region, the Depth Correlation Encoder (DCE) is proposedto capture the correlations between long-distance pixels and multi-scale regions. Giventhe features of Res 5 as input, the module captures the correlations by eight parallelbranches and perceives the farthest region by considering these correlations, as depictedin Fig. 3. Speciﬁcally, the ﬁrst three branches utilize three parallel dilated convolutionsrespectively. They all have the same kernel size of × , but different dilated rates,i.e., 6, 12, 18. For the contextual feature, namely Res 5 , each pixel encodes the depthin a region of the input image. Thus, these dilated convolutions encode the correlationbetween two distant regions of the input image. The largest-rate kernel has a receptiveﬁeld of the entire image, while the smallest one only covers about of the entire7 utput C × C × Res 5

Pyramid Scene Encoder

Image Ground truth Baseline Ours

Adap Avg Poolto 1 × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ×

512 29 × ×

512 29 × × Figure 3: Illustration of our proposed Depth Correlation Encoder (DCE). The black arrow indicates that thefeature maps have the same channel number. image. The multi-scale correlations between pixels are extracted sufﬁciently. Aftereach dilated convolution, a × convolution is employed to integrate the informationbetween channels and remove the grid artifacts caused by the dilated convolution. Asshown in Fig.3, the farthest region perceived by dilated convolutions is different fromRes 5 and is more consistent with the true depth. The fourth branch is two consecutive × convolutions, learning local cross-channel interaction.The last four branches construct the Pyramid Scene Encoder (PSE) to encode thecorrelations between multi-scale regions and locate the farthest regions by consideringthe correlations. In detail, by utilizing the adaptive average pooling operation, thesefour branches respectively downsample the Res 5 to four sizes ( n × n × | n ∈{ , , , } ). Let H res and W res denote the height and width of Res 5, s H and s W , namely the pooling strides in the vertical and horizontal directions, are equal to (cid:98) H res n (cid:99) and (cid:98) W res n (cid:99) , meanwhile the kernel sizes of pooling in the two directions areequal to H res − ( n − × s H/W and W res − ( n − × s H/W , respectively. Forclarity, the four paths corresponding to n = 1 / / / are denoted as 1-st, 2-nd, 3-8d, and 4-th paths respectively. Since each pixel of the pooled feature is the averageof the Res 5 feature inside a region of size H res n ∗ W res n , abnormally large valuesare smoothed by other surrounding pixels. Then, for features of all paths, a × convolution is used to combine the feature of each channel and reduce the channelnumber to . Subsequently, the features of all paths are upsampled to a ﬁxed size × × by UpProjection [47] and concatenated. The features of different scaleregions encode the dominant features in their regions because of the average poolingoperation. Finally, a × convolution is employed to learn the correlations betweenthe multi-scale regions and further reduce the channel number to . By consideringthe depth information of different scale regions and their correlations, PSE extractsthe relative depth change between regions. As the visualizations of the last four pathsshown in Fig.3, the farthest region is in the upper left bin, and the feature in the 4-thpath provides location information of the farthest region.Eventually, all the ﬁve branches of DCE are concatenated, a ﬁnal × convolutionis employed to fuse the contextual correlations of depth, and the channel is reducedto . The effectiveness of DCE is shown in Fig.3. The response of the lower halfof Res 5 is low, which is consistent with the ground truth. However, the left of Res 5has high response, which leads to the wrong prediction of the farthest region (see thedark red left side of baseline in Fig.3) and is completely different from the true depth.By encoding the correlation between the long-distance points and the feature insideregions, it is observed that the farthest region is consistent with the true depth. To accurately recover the depth around the boundary, the Bottom-Up BoundaryFusion (BUBF) module is designed to extract the boundary (indicating sudden changesin depth) and remove the non-boundary pixel (indicating smooth changes in depth). Asshown in Fig.4, this module is based on the complementarity of low-level and high-level features: Low-level features are rich in location information of edge but lacksemantic information, while high-level features are rich in semantic information. Theboundary of objects is encoded but inaccurately located. Since the sudden change indepth is generally not inside the object area, but near the boundary of objects, the9 C C × C on v Up + l-RB

Res 2

Up + l-RBUp + l-RBUp + l-RB

Res 3Res 4Res 5 R R eL U Element-wise sum 𝑌 ! 𝑌 " 𝑌 𝑌 $ RRRR R R R

Refinement block (RB) R RGB ImagePredictionGround Truth × C on v ba t c hno r m R eL U × C on v × × × × × × × × × ×

64 C=64 C=64 C=64

Figure 4: Illustration of the proposed BUBF module and the mean of channel maps ( Y , Y , Y , Y ) in fourlevels. The black arrow indicates that the feature maps have the same channel number. proposed module delivers the low-level location information to the higher level layerby layer, fusing the feature of two adjacent levels to obtain the accurate boundary.Speciﬁcally, the whole module is composed of four steps. In the ﬁrst step, all side-outputs, i.e., Res 2,3,4,5, are reﬁned by a well-designed Reﬁnement Block (RB). Inmore detail, the RB ﬁrst uses a × convolution to reduce the channel number to64, which integrates the information across all channels. Then, the following residualblock reﬁnes the feature. In the second step, to align the features of multi-levels, thereﬁned features in all levels are upsampled to the original resolution of the input imageand further improved by the l-RB. This operation is the same as the UpProjection in[47]. In the third step, beginning from the lowest level (corresponding to Res 2), thecue of each level is concatenated with that of the adjacent higher level for latter fusion.In the fourth step, the cues of two adjacent levels are fused by utilizing another RB.And in the same way, the BUBF module iteratively combines the reﬁned cues of twoadjacent levels, until the highest level (corresponding to Res 5) is reached.Since the ﬁnal loss function computes the gradient of depth (see Sec 3.4), whichindicates the depth boundary, the boundaries indicating depth changes are encoded bythe extracted features. Therefore, the high-level features (such as Res 5) encode theboundary with depth changes, but lack pixel-level location information. In contrast,the low-level features (such as Res 2) contain sufﬁcient location information, but fail10 × C on v × C on v × C on v Stripe convolutions Residual prediction × C on v B a t c h N o r m R eL U × C on v B a t c h N o r m R eL U × C on v B a t c h N o r m R eL U C=128C=128 C=128 C=128 C=128 C=1

Figure 5: The structure of our Stripe Reﬁnement Module. to reveal the depth changes. The BUBF passes the accurate location information to thehigh-level features from stage 2 to stage 5, making the depth boundary to be locatedmore accurately. As shown in Fig.4, compared to the initial feature at the lowest level(namely Y in Fig.4), the feature combined by adjacent levels suppresses pixels withsmooth depth change, as shown in Y , Y , Y in Fig.4. The function R () denotes theRB, the function up () denotes the upsampling, and the function lR () denotes the l-RB. X i is the side-output of Res i . Y i is the fused feature map of each level. The ﬁrst fusedfeature map Y is stated as Y = up ( R ( X )) and other fused feature Y i is formulatedas: Y i = R (cid:16) lR (cid:0) up ( R ( X i )) (cid:1) C Y i − (cid:17) , i ∈ { , , } (1)where C denotes the concatenation operation. Y is the output of the whole module. The previous method [11] exploits three × convolutions to predict the ﬁnal depthmap from the decoder. However, the small-kernel convolution causes two problemsbecause of its limited receptive ﬁeld.• It only aggregates the local feature at each pixel, making the local confusion indepth prediction inevitable.• It fails to take full advantage of the boundary and the global contextual features.To address these issues, a Stripe Reﬁnement Module (SRM) is proposed to re-ﬁne the depth across the boundary by utilizing the stripe convolution. Speciﬁcally,11fter the channel-wise concatenation of the decoder module and the BUBF, the ob-tained features are taken as the input of the regression module which consists of twosteps, as illustrated in Fig.5. In the ﬁrst step, two stripe convolutions (with kernel sizes × and × ) are utilized to aggregate the pixels nearby the boundary, in a widerange of both vertical and horizontal directions. Since the global contexts along or-thogonal directions make signiﬁcant contributions to indicating the relative depth, thedepth change between the object and its background is better recognized. Secondly, a × convolution is employed to fuse the features extracted by two stripe convolutions.Thirdly, three convolutions with kernel size × are employed to reﬁne the ﬁnal depthmap. Especially, to predict the depth map more accurately, the fused feature beforeconvolutions are delivered to the last convolution by a skip connection. Differences with whole strip masking (WSM) [48] and vertical pooling [49]:

Intuitively, the structure of SRM looks like the whole strip masking [48] and verticalpooling [49]. However, SRM is substantially different to them. Firstly, we use a × convolution and a × convolution, instead of × W convolution and H × convolution, or H × pooling operation, where W and H denote the width andheight of the input feature. Secondly, the element-wise sum is employed to fuse thefeatures obtained by two strip convolutions. Thirdly, the strip convolution is employedto reﬁne the depth near the boundary acquired by BUBF to improve the depth, insteadof exploiting the trend of the scene [48], or vertically aggregate image feature [49]. To train the network, a ground truth depth map in the training data is denoted as G and its corresponding prediction is denoted as P . Each pixel in the ground truth depthmap is denoted as g i ∈ G , and p i ∈ P for the prediction. Following [11], the lossfunction of our network is composed of three terms, i.e., pixel-wise depth difference l depth , gradient difference l grad , and surface normal difference l normal . Assumingthat ∇ x () and ∇ y () denote the spatial gradient of a pixel in x and y directions, thesurface normal of a ground truth depth map and its predicted depth map are denotedas n p i = [ −∇ x ( p i ) , −∇ y ( p i ) , and n g i = [ −∇ x ( g i ) , −∇ y ( g i ) , . The three lossfunctions can be formulated as follows: 12 l depth = N p (cid:80) N p i =1 ln( (cid:107) p i − g i (cid:107) + α ) • l gradient = N p (cid:80) N p i =1 ln (cid:16) ∇ x ( (cid:107) p i − g i (cid:107) + α ) + ∇ y ( (cid:107) p i − g i (cid:107) + α ) (cid:17) • l normal = N p (cid:80) N p i =1 (cid:16) − (cid:104) n g i , n p i (cid:105) (cid:113) (cid:104) n g i , n g i (cid:105) (cid:113) (cid:104) n p i , n p i (cid:105) (cid:17) where (cid:104) ., . (cid:105) denotes the inner production, N p denotes the pixel number in an image, α is a hyper parameter, which is set to . . For the overall loss of the BS-Net, the weightsof these three loss functions are equal: l overall = l depth + l normal + l grad (2)where l overall is the overall loss function. Since the pixels around the boundary havelarge gradients of depth, the gradient difference l grad naturally guides the BUBF mod-ule to learn the boundary.

4. Experiment

In this section, we evaluate the proposed method on three challenging datasets.Firstly, the NYUD v2 dataset [16] is employed to evaluate the performance of depthand edge. Secondly, the iBims-1 dataset [50] is employed to evaluate the quality ofdepth boundaries and other relevant metrics. Thirdly, the SUN-RGBD dataset [17] isemployed to evaluate the generalization of our method.

Our proposed network is implemented using the NVIDIA 1080Ti GPUs and Py-Torch framework. The ResNet-50 is adopted as the backbone network and initializedby the pre-trained model on ILSVRC [51]. The classiﬁcation layers for output are re-moved. To preserve the size of feature maps, dilated convolutions are employed in thelast two stages of the backbone. Other parameters in the decoder, DCE, BUBF, andSRM are randomly initialized. We train our model for 20 epochs and set batch sizeto 8. The Adam optimizer is adopted with parameters ( β , β

2) = (0 . , . . Theweight decay is − . The initial learning rate is set to 0.0001 and reduced by 10% forevery 5 epochs. 13ollowing the previous work [10, 11], we consider 654 RGB-D pairs for testingand 50k pairs for training. The data augmentation is performed on the training imagesin the same way as [11]. To train the model, all images and labels are downsampled to × pixels from the original size ( × ) using bilinear interpolation, and thencropped to × pixels from the central part. To align with the network output, thecropped labels are downsampled to × pixels. Furthermore, the output of thenetwork is upsampled to × pixels in the testing process to evaluate the model.Note that, we do not clip the depth maps of predicted depth to a ﬁxed range. Three kinds of metrics are employed to thoroughly evaluate the proposed method.

Depth accuracy:

Let N denotes the total number of pixels in the test set, d i de-notes pixel i on the predicted depth, and d ∗ i denotes pixel i on the true depth. Fol-lowing [26, 11, 52], four metrics indicating pixel-wise accuracy are employed: (1)Mean absolute relative error (REL): N (cid:80) i | ¯ d i − d ∗ i | d ∗ i . (2) Mean log 10 error (log10): N (cid:80) i (cid:107) log ¯ d i − log d ∗ i (cid:107) . (3) Root mean squared error (RMS): (cid:113) N (cid:80) i ( ¯ d i − d ∗ i ) .(4) Accuracy under threshold t d : max ( d ∗ i ¯ d i , ¯ d i d ∗ i ) = δ < t d ( t d ∈ [1 . , . , . ]) . Boundary accuracy in predicted depth:

Following [11], we measure the boundaryaccuracy in predicted depth. The Sobel operator is used to recover boundary fromthe predicted P and the true depth maps G , obtaining P sobel and G sobel The pixelslarger than threshold t e ( t e ∈ { . , . , } ) is boundary, tp is the number of correctboundary pixels. f p and f n correspond to the false positive and false negative. Theprecision P = tptp + fp , recall R = tptp + fn and F1 score × P × RP + R are used to evaluate theperformance. Depth boundary error and other relevant metrics:

Several novel metrics are in-troduced in [50], which are used to evaluate our algorithm more comprehensively. Inmore detail, the depth boundary errors (DBEs) measures the completeness and accu-racy of the boundary in the predicted dense depth map. In addition, we employ twoother depth error metrics in [50]: planarity error (PE) to measure the depth accuracy in3D space, directed depth errors (DDEs) to measure the proportions of too far and tooclose predicted depth pixels. 14 ormalized distance error of farthest region:

To evaluate the farthest region in thepredicted depth, another metric is introduced. The maps P and G are partitioned into m × m rectangular regions with the same size. And the mean depth inside each region iscalculated. In map P , the region with largest mean depth is located as P max = ( u , v ) ,and G max = ( u , v ) corresponds to that of the map G , where u i , v i ∈ N + and < u i , v i < m . The normalized distance error of the farthest region is stated as thedistance between P max and G max : E = 1 N test (cid:88) N test n =1 m √ (cid:107) P nmax − G nmax (cid:107) (3)where N test is the number of test images, n indicates the n -th depth map, and m √ isused to normalize the error distance. The NYUD v2 dataset [16] includes 120K pairs of RGB and depth maps capturedby a Microsoft Kinect, and is split into a training set (249 scenes) and a test set (215scenes). The images in this dataset have a resolution of × . The labeled datacontains 1449 RGB-Depth pairs which are split into two parts: 795 pairs for trainingand 654 pairs for testing. We ﬁrstly conduct analysis and comparison with severalstate-of-the-art methods on this dataset quantitatively and qualitatively. Secondly, weconduct an in-depth analysis of our method by designing sufﬁcient ablation studies,evaluating all metrics of evaluation for depth prediction and scene edges on depth maps. Table 1 illustrates the comparisons between our proposed method and the recentapproaches with ResNet-50 backbone. The ﬁrst three metrics (i.e., accuracy underthreshold t d ( t d ∈ [1 . , . , . ] )) measure the pixel-wise accuracy of the pre-dicted depth map. Note that, SARPN [10] employ the ResNet-50 as the backbone, As SARPN employs the SENet as the backbone, in order to conduct the comparison objectively, wemodify its backbone to ResNet-50, which is the same as our method. Thus, the quantitative values ofSARPN is not exactly equal to the original paper. ethod backbone higher is better ( δ< ) lower is better .

25 1 . . REL RMS log10Saxena et al. [19] - 0.447 0.745 0.897 0.349 1.214 -Karsch et al. [53] - - - - 0.35 1.20 0.131Liu et al. [20] - - - - 0.335 1.06 0.127Xu et al. [36] ResNet-50 0.811 0.954 0.987 0.121 0.586 0.052Eigen et al. [21] - 0.611 0.887 0.971 0.215 0.907 -Eigen et al. [22] VGG 0.769 0.950 0.988 0.158 0.641 -Dharmasiri et al. [54] VGG 0.776 0.953 0.989 0.156 0.624 -Laina et al. [47] ResNet-50 0.811 0.953 0.988 0.127 0.573 0.055Lee et al. [55] ResNet-152 0.815 0.963 0.991 0.139 0.572 -Hu et al. [11] ResNet-50 0.843 0.968 0.991 0.126 0.555 0.054Xu et al. [56] ResNet-50 0.817 0.954 0.987 0.120 0.582 0.055Fu et al. [26] ResNet-101 0.828 0.965 0.992 0.115 0.509 0.051Lee et al. [13] DenseNet-162 0.837 0.971 0.994 - 0.538 -Chen et al. * [10] ResNet-50 0.853 0.959 0.991 0.121 0.545 0.052Ramamonjisoa et al. [15] ResNet-50

Yin et al. [57] ResNeXt-101 0.875 0.976 0.994 et al. [58] SENet-154 0.870 0.974 0.993 0.115 0.528 0.049

Ours

ResNet-50 0.846 0.969 0.992 0.123 0.550 0.053Table 1: Depth accuracy and error of different methods on the NYUD v2 dataset. The bold type indicates thebest performance. The red number indicates the best performance with the same backbone of ResNet-50. *Using a ResNet-50 backbone, which is different from the original paper. which is different from the ofﬁcial implementation. In addition, for other methods pro-viding the source code, the quantitative values are obtained by running their sourcecode. Since our platform may be slightly different from their original papers, there aresmall differences in performance probably, however, the differences are very small. Asa result, our method can achieve the second best performance. Another method [15]employs additional occlusion boundary label, and this is the reason why their methodachieves a higher accuracy than ours. In addition, our method achieves a better accu-racy than other methods [21, 22, 54, 47, 11, 56, 55], because the proposed networkcombines outputs of the PSE and MSL with various kernel sizes and dilated rates,which effectively reserves the contextual information with global depth layout. Thus,16 hres Method Backbone Precision Recall F1-score > et al. [21] - 0.346 0.322 0.323Eigen et al. [22] VGG 0.544 0.481 0.500Dharmasiri et al. [54] VGG 0.577 Laina et al. [47] ResNet-50 0.489 0.435 0.454Xu et al. [36] ResNet-50 0.516 0.400 0.436Fu et al. [26] ResNet-101 0.320 0.583 0.402Hu et al. [11] ResNet-50 0.635 0.480 0.540Lee et al. [13] DenseNet-162 0.475 0.354 0.390Chen et al. * [10] ResNet-50 et al. [15] ResNet-50 0.416 0.353 0.374Yin et al. [57] ResNeXt-101 0.523 0.459 0.480Ours ResNet-50 0.644 0.483 0.546 > et al. [21] - 0.443 0.278 0.327Eigen et al. [22] VGG 0.587 0.456 0.501Dharmasiri et al. [54] VGG 0.531 et al. [47] ResNet-50 0.536 0.422 0.463Xu et al. [36] ResNet-50 0.600 0.366 0.439Fu et al. [26] ResNet-101 0.316 0.473 0.412Hu et al. [11] ResNet-50 0.664 0.476 0.547Lee et al. [13] DenseNet-162 0.648 0.331 0.424Chen et al. * [10] ResNet-50 Ramamonjisoa et al. [15] ResNet-50 0.598 0.338 0.419Yin et al. [57] ResNeXt-101 0.605 0.457 0.510Ours ResNet-50 0.665 0.492 0.558 > et al. [21] - 0.730 0.347 0.456Eigen et al. [22] VGG 0.733 0.488 0.574Dharmasiri et al. [54] VGG 0.617 0.489 0.533Laina et al. [47] ResNet-50 0.670 0.479 0.548Xu et al. [36] ResNet-50 0.794 0.407 0.525Fu et al. [26] ResNet-101 0.483 0.512 0.485Hu et al. [11] ResNet-50 0.755 0.514 0.604Lee et al. [13] DenseNet-162 et al. * [10] ResNet-50 0.763 0.526 Ramamonjisoa et al. [15] ResNet-50 0.797 0.404 0.524Yin et al. [57] ResNeXt-101 0.740 0.502 0.589Ours ResNet-50 0.750 ethod and Variants Backbone Error of the farthest region in predicted depth m = 6 m = 12 m = 24 Eigen et al. [21] - 0.1692 0.1827 0.1871Fergus et al. [22] VGG 0.1181 0.1354 0.1460Dharmasiri et al. [54] VGG 0.1573 0.1725 0.1796Laina et al. [47] ResNet-50 0.1162 0.1400 0.1573Fu et al. [26] ResNet-101 0.1090 0.1254 0.1281Hu et al. [11] ResNet-50 0.1123 0.1308 0.1428Lee et al. [13] DenseNet-162 0.1214 0.1408 0.1496Chen et al. * [10] ResNet-50 0.1029 0.1225 0.1350Ramamonjisoa et al. [15] ResNet-50 0.1162 0.1352 0.1475Yin et al. [57] ResNeXt-101

Baseline ResNet-50 0.1113 0.1338 0.1427Baseline + DCE ResNet-50 0.1122 0.1314 0.1402Baseline + BUBF + SRM ResNet-50 0.1085 0.1279 0.1373Ours ResNet-50 0.1061 0.1263 0.1349Table 3: Farthest region accuracy under partition ratios of different methods on NYUD v2 dataset. The boldtype indicates the best performance. The red number indicates the best performance with the same backboneof ResNet-50. * Using a ResNet-50 backbone, which is different from the original paper. even is competitive with some methods with a heavier backbone.

Figure.6 shows the qualitative results on the NYUD v2 dataset. The ﬁrst row to thetwelfth row shows the original RGB images, ground truth, the depth maps predicted by[22, 47, 26, 15, 13, 10, 57, 11], our baseline, and the proposed method, respectively.The depth maps are visualized by different colors corresponding to different depthvalues, i.e., dark blue corresponds to the minimum depth and dark red corresponds tothe maximum depth. The white rectangular boxes mark the farthest region in the depthmaps.In detail, the ﬁrst column shows an ofﬁce desk and a chair with a clear depth. Dueto numerous objects stacking on the desk, it is hard to ﬁnely recover the depth changesbetween these objects. Obviously, other methods predict the wrong farthest regionand blurred depth changes, e.g., the monitor on the desk. In contrast, our method19 cene with similar depth Scene with mirror Scene facing the wallStructuredScene R G B G r ound t r u t h E i gene t a l . La i nae t a l . F ue t a l . Leee t a l . H ue t a l . B a s e li ne O u r s C hene t a l . * R a m aon ji s oae t a l . Y i ne t a l . Figure 6: Example depth maps predicted by several methods and our proposed method. st row: inputimages; nd row: ground truth; rd − th rows: predicted depth maps of Eigen et at. [22], Liana et al. [47], Fu et al. [26], Ramamonjisoa et al. [15], Lee et al. [13], Chen et al. * [10], Yin et al. [57], Hu et al. [11], our baseline and the proposed method. The white boxes mark the predicted farthest region ( m = 12 ).* Using a ResNet-50 backbone, which is different from the original paper. ar higher is better ( δ< ) lower is better .

25 1 . . RMS REL log10Baseline 0.840 0.966 0.991 0.557 0.128 0.055+DCE

Table 4: The prediction result on NYUD v2 dataset. The bold type indicates the best performance.

RGB image Ground Truth BaselinePredicted DepthEdge Baseline+DCEEdge Predicted Depth Baseline+DCE+BUBF+SRMEdge Predicted Depth

Figure 7: Example predicted depth and visualized boundary of several variants. Following the boundarymetric, the sobel operation is used to recover boundary from depth map, and pixels larger than 1 are regardedas boundary pixels. Green indicates true positive pixels, red indicates false positive pixels, and blue indicatesfalse negative pixels. fully predicts the real depth of each object without any blur and correctly recovers thefarthest region, which reveals the effectiveness of boundary cues and extracting globalinformation. The scenes in the second to fourth columns show a similar depth overa large area, e.g., the wall in these scenes. Due to the lack of global context, othermethods fail to correctly ﬁnd the farthest region in these scenes and thus suffer fromthe distortion of the 3D scene structure. But our method works admirably in predictingdepth in these scenes. Intuitively, comparing to others, our method avoids the gridsartiﬁcial on the wall and completely recovers the shape detail of objects, especially thewooden chair in the fourth column. The reason is that the boundary cue better locatesthe depth change than other edge cues and the global context is necessary. In the ﬁfthand sixth columns, due to the mirror in the scene, it is hard to predict the depth changebetween areas inside and outside the mirror. Most methods fail to predict the depth21 hres Method P R F1 > > > Table 5: Accuracy of predicted boundary pixels in depth maps under different thresholds. The bold typeindicates the best performance. level between the two areas, causing the distorted 3D scene structure. By utilizing theboundary cue, our method fully predicts the depth level of the scene. In the last twocolumns, since the depths of the wall facing the camera are extremely similar, it isvery difﬁcult to accurately predict its depth. It can be seen that all most other methodsfail to predict the farthest region. By utilizing the global context of the scene and theboundary cue, our method correctly predicts the depth.

To clarify the contribution of each proposed module, the baseline is deﬁned as thecombination of our encoder and decoder, and several ablation studies are conducted toverify the improvement.

Role of the global context and boundary cue:

Two issues are assumed in thispaper, i.e., (1) global context effectively improves the prediction of depth layout. (2)boundary cue boosts the prediction of the depth gradient around the object boundary.To verify them, three variants are constructed for evaluation, as shown in Table 4 andTable 5. Intuitively, compared to the baseline, since DCE extracts the global contex-tual information to predict the farthest region of the scene, the variant Baseline+DCEachieves better performance in all metrics of evaluation. By combining the baseline,the BUBF, and the SRM, the variant Baseline+BUBF+SRM locates the edge with depth22hange more precisely and ﬁts the depth near the contours more precisely. Note that,in Table 5, DCE seems to play a more signiﬁcant role than BUBF + SRM in the depthboundary recovery when the threshold is larger than 1. Since the boundary is not a lo-cal visual cue, the network needs sufﬁcient contextual cues to suppress boundary thatare visually signiﬁcant but have no depth changes. It makes the baseline + DCE toeliminate many false-positive boundary pixels, even outperform the variant baseline +BUBF + SRM in depth boundary prediction. Furthermore, the whole network with allmodules improves the accuracy of pixel-wise depth while ensuring the sharp detail ofobjects. The intrinsic reason is that the DCE and BUBF play important and differentroles in depth prediction. Speciﬁcally, on the one hand, the DCE aggregates the globalinformation with a larger perceptive ﬁeld, boosting the accuracy in global depth lay-out prediction. On the other hand, the combination of BUBF and SRM extracts theboundary cue and then reﬁnes the depth change around the boundary.Fig.7 illustrates the visualized results of several variants. Compared to the base-line, the other two variants generate fewer false-positive boundary pixels (marked inred), such as the surface of the desk in the ﬁrst row, the small objects on the desk ofthe second row, and the chairs boundary in the third row. Among them, the variantsBaseline+DCE+BUBF+SRM performs the best. The reason is that, BUBF and SRMimprove the accuracy of the overall boundary. In addition, the two variants both achievea lower distance error of the farthest region than the baseline.

Contribution of the Pyramid Scene Encoder:

The PSE applies average pool-ing with various kernel sizes, then integrates scene features from different sub-regions.The global contextual information in sub-regions helps to estimate the depth layouteffectively. Table 6 indicates various experimental results with the different settings ofpooling kernel size in the PSE. We ﬁrst remove all pooling layers, and then graduallyadd them back. Intuitively, these experiments prove that the setting, in which the sizesof feature maps after pooling are 1,2,3 and 6 respectively, achieves the best perfor-mance. It illustrates that different from the multi-scale dilated convolutions, the PSEaggregates the global information which is signiﬁcant for predicting the global depthlayout.Fig. 8 illustrates the visualization results of PSE variants. It can be seen that the23 ooling i × i higher is better ( δ< ) lower is better .

25 1 . . RMS REL log10None 0.844 0.967 0.991 0.553 0.125 0.053 i ∈{ } i ∈{ } i ∈{ } i ∈{ } Table 6: The experimental result of pyramid scene encoder with difﬁerent pooling rate. The bold typeindicates the best performance.

Ground Truth 𝑖 ∈ {1}

RGB image None 𝑖 ∈ {1,2} 𝑖 ∈ {1,2,3} 𝑖 ∈ {1,2,3,6}

Figure 8: Example predicted depth of several variants with different pooling rates. The white boxes indicatethe farthest regions. predicted depth is getting closer to the real depth as more pooling layers are added.In the ﬁrst row, as more pooling layers are used, the network gradually predicts thefarthest region in the scene. In the second row, when more pooling layers are used, thefarthest region in the mirror is located more accurately. In the third row, it is difﬁcult todetermine which area on either side of the pillar is further. As more pooling layers areused, the farthest region is accurately predicted. In the fourth row, the hollow on thewall can easily be regarded as coplanar with the wall. When the × pooling layer isemployed, the farthest region in this hollow is accurately predicted. Learning method for boundary cue:

There are previous methods that apply the24 hres Method Precision Recall F1-score > > > Table 7: Accuracy of predicted boundary pixels in depth maps under different thresholds. The bold typeindicates the best performance.

Ours with MFFPredicted DepthEdge Ours with BUBFPredicted DepthEdgeGround TruthRGB Image

Figure 9: Example predicted depth and visualized boundary of several variants. Following the boundarymetric, the sobel operation is used to recover boundary from depth map, and pixels larger than 1 are regardedas boundary pixels. Green indicates true positive pixels, red indicates false positive pixels, and blue indicatesfalse negative pixels. multi-level information from encoder to improve the edge accuracy, such as multi-scalefeature fusion (MFF) [10, 11]. Different from these methods, the proposed BUBFﬁnely learns the boundary cue from the scene, i.e., the location with a sudden changein depth. To evaluate the effectiveness of learning boundary, the MFF is employedto compare with the proposed BUBF. As shown in Table 7, our module BUBF betterlocates the object contour and ﬁne-tunes the accurate depth near the contour. The25 ethod backbone higher is better ( δ< ) lower is better .

25 1 . . REL RMS log10Eigen et al. [21] - 0.36 0.65 0.84 0.32 1.55 0.17Eigen et al. [22] AlexNet 0.40 0.73 0.88 0.30 1.38 0.15Eigen et al. [22] VGG 0.47 0.78 0.93 0.25 1.26 0.13Dharmasiri et al. [54] VGG 0.22 0.55 0.78 0.35 1.61 0.19Laina et al. [47] ResNet-50 0.50 0.78 0.91 0.26 1.20 0.13Hu et al. [11] ResNet-50 0.52 0.85 et al. [26] ResNet-101 0.55 0.81 0.92 0.24 1.13 0.12Lee et al. [13] DenseNet-162 0.53 0.83 0.95 0.23 1.09 0.11Chen et al. * [10] ResNet-50 0.51 0.84 0.94 0.22 1.14 0.11Liu et al. [59] - 0.48 0.78 0.91 0.30 1.26 0.13Li et al. [60] VGG 0.58 0.85 0.94 0.22 1.09 0.11Liu et al. [61] - 0.41 0.70 0.86 0.29 1.45 0.17Ramamonjisoa et al. [15] ResNet-50 0.59 0.84 0.94 0.26 1.07 0.11Yin et al. [57] ResNeXt-101 0.54 0.84 0.93 0.24 1.06 0.11Swami et al. [58] SENet-154

ResNet-50 0.51 0.82 0.93 0.24 1.19 0.12Table 8: Conventional depth error and accuracy on iBims-1 dataset. The bold type indicates the best perfor-mance. The red number indicates the best performance with the same backbone of ResNet-50. * Using aResNet-50 backbone, which is different from the original paper. reason is that the MFF has a problem of introducing lots of useless noisy cues in depthprediction, which is addressed by our BUBF.Furthermore, Fig. 9 visualizes the results of our model with MFF and BUBF. Inthe ﬁrst row, there are a large number of boundary pixels of the window and the plant.Our method not only obtains fewer false-positive boundary pixels (marked in red), butalso obtains more true-positive boundary pixels (marked in green). In the second row,there is a large carpet with various appearances. Due to directly introducing low-levelcues, MFF leads to a large number of false-positive boundary pixels (marked in red),while the proposed BUBF alleviates this issue better. In the third row, MFF locates theboundary of the desk inaccurately, but BUBF locates these boundary pixels better.26 ethod backbone PE (cm / ◦ ) ↓ DBE (px) ↓ DDE (%) (cid:15) plan (cid:15) orie (cid:15) acc (cid:15) comp (cid:15) ↑ (cid:15) + ↓ (cid:15) − ↓ Eigen et al. [21] - 7.70 24.91 9.97 9.99 70.37 27.42 2.22Eigen et al. [22] AlexNet 7.52 21.50 4.66 8.68 77.48 18.93 3.59Eigen et al. [22] VGG 5.97 17.65 4.05 8.01 79.88 18.72 1.41Dharmasiri et al. [54] VGG 6.97 28.56 5.07 7.83 70.10 29.46 0.43Laina et al. [47] ResNet-50 6.46 19.13 6.19 9.17 81.02 17.01 1.97Hu et al. [11] ResNet-50 3.88 28.06 2.36 5.40 82.20 16.10 1.69Fu et al. [26] ResNet-101 10.50 23.83 4.07 - 82.78 - -Lee et al. [13] DenseNet-162 et al. * [10] ResNet-50 3.45 43.44 2.98 et al. [59] - 8.45 28.69 2.42 7.11 79.70 14.16 6.14Li et al. [60] VGG 7.82 22.20 3.90 8.17 83.71 13.20 3.09Liu et al. [61] - 7.26 17.24 4.84 8.86 71.24 28.36

Ramamonjisoa et al. [15] ResNet-50 9.95 25.67 3.52 7.61 84.03 et al. [57] ResNeXt-101 5.73 16.91 3.65 7.16 82.72 13.91 3.36Swami et al. [58] SENet-154 6.67 - - - Ours

ResNet-50 3.98 28.75 2.25 5.18 80.54 17.64 1.80Table 9: Planarity error, depth boundary errors, and directed depth error on iBims-1 dataset. The bold typeindicates the best performance. The red number indicates the best performance with the same backbone ofResNet-50. * Using a ResNet-50 backbone, which is different from the original paper. iBims-1 is a new RGB-D dataset, which contains pairs of a high-quality depth mapand a high-resolution image. These pairs are acquired by a digital single-lens reﬂex(DSLR) camera and a high-precision laser scanner. Thus, compared with NYUD v2dataset, iBims-1 achieves a very low noise level, sharp depth transitions, no occlusions,and high depth ranges. This dataset contains 100 pairs for evaluation, and is also usefulto evaluate the generalization of our method. Note that, since the iBims-1 dataset lackstraining set, all models used for testing on the iBims-1 dataset are trained on the NYUDv2 dataset. 27 ethod and Variants Backbone Error of the farthest region in predicted depth m = 6 m = 12 m = 24 Dharmasiri et al. [54] VGG 0.1927 0.2011 0.2128Ramamonjisoa et al. [15] ResNet-50 0.2020 0.2217 0.2400Lee et al. [13] DenseNet-162 0.1819 0.1961 0.2152Laina et al. [47] ResNet-50 et al. [11] ResNet-50 0.1804 0.2075 0.2147Chen et al. * [10] ResNet-50 0.1693 0.1895 0.2021Yin et al. [57] ResNeXt-101 0.2022 0.2086 0.2343Ours ResNet-50 0.1724

Table 10: Distance error of the farthest region under different partition rate on iBims-1 datasets. The boldtype indicates the best performance. The red number indicates the best performance with the same backboneof ResNet-50. * Using a ResNet-50 backbone, which is different from the original paper.

Table 8 shows the performance of the commonly used error metrics on the iBims-1dataset. When δ is small than . or . , although our method does not focus onthe overall depth accuracy and error, it is still comparable with the previous methods.Table 9 illustrates the new metrics of different methods on the iBims-1 dataset. It canbe seen that our method achieves a low 3D planarity error. The reason is that ourmethod mainly focuses on the accurate estimation of the farthest region to improvethe depth. Noteworthy, our method outperforms most methods on the depth boundaryerrors, even outperforms the methods trained with extra boundary label [15] and somemethods with a heavier backbone, which proves the effectiveness of our method onimproving the boundary prediction. Table 10 illustrates the farthest region distanceerror of several methods on the iBims-1 dataset. The method [57], which performs wellon NYUD v2 dataset, however have a high error on the iBims-1 dataset. Our methodnot only performs well on NYUD v2 dataset, but also outperforms other methods onthe iBims-1 dataset. Our method achieves the lowest error when m = 12 and m = 24 ,indicating that our method can accurately predict the farthest region. Moreover, it alsoproves the generalization of our method. 28 u r s La i nae t a l . R G B G r ound t r u t h Y i ne t a l . H ue t a l . C hene t a l . * D ha r m a s i r i e t a l . Leee t a l . R a m a m on ji s oae t a l . Figure 10: The visualization results on iBims-1 dataset. st row: input images; nd row: ground truth; rd − th rows: predicted depth maps of Dharmasiri et at. [54], Liana et al. [47], Lee et al. [13], Hu et al. [11], Chen et al. * [10], Ramamonjisoa et al. [15], Yin et al. [57], the proposed method. The whiterectangular boxes mark the farthest region in the corresponding depth maps. * Using a ResNet-50 backbone,which is different from the original paper. .4.2. Qualitative Evaluation Fig. 10 illustrates the visualized results of different methods on the iBims-1 dataset.The ﬁrst column shows a standard indoor scene. Both our method and [10] can cor-rectly predict the farthest region. The method [57] using the plane normal vector failsto correctly estimate the structure of the scene. The second column shows a kitchenscene. Most methods are hard to recover the farthest region of the scene (i.e., the win-dow) and the shape details of the objects in the scene (the boundary of the window) atthe same time, but our method performs very well in both aspects. The third columnshows an ofﬁce scene with an irregular tent placed inside. Intuitively, the proposedmethod predicts the edge of the tent more accurately. At the same time, the farthestpoint is more accurate. These results prove the effectiveness of the proposed BUBFand DCE. The fourth scene is mainly occupied by a large wall. Our method avoidswrongly predicting the door on the wall as the farthest region. At the same time, ourmethod recovers richer shape details of small objects on the cabinet. The ﬁfth sceneshows a lot of sundries. Most methods fail to fully recover the shape details of theobject. Both our method and [10] recover the details of small objects and the farthestarea. The last column shows a bathroom. Although most methods give a more accuratefarthest area, the edges of small objects generated by our method are sharper. All theresults above demonstrate that our method can achieve high accuracy on farthest regionprediction and depth boundary recovery.

Besides the NYUD v2 dataset and the iBims-1 dataset, another indoor dataset isemployed to evaluate the generalization of our method. Fig.11 demonstrates the vi-sualization results on SUN-RGBD dataset [17], and the predicted depth is generatedby our network (trained on the NYUD v2 only). Although the data distribution of theSUN-RGBD is totally different from that of NYUD v2, our method achieves plausibleresults. In detail, regions that are relatively farther and closer are correctly predicted,which means our method correctly predicts the structure in each scene. Furthermore,the predicted depth is reasonable where the depth sensor cannot capture the accuratevalue. 30

GB imageGround truthPredicted depth

Figure 11: More visualization results on SUN-RGBD dataset.

In this paper, the proposed method is to perceive the farthest region by capturing thecontextual correlations and recover the depth boundary by fusing adjacent levels of thefeature. Thus, our approach handles the scenarios with rich visual cues well. In con-trast, the iBims-1 dataset contains several images that occupy a single large plane, lack-ing the visual cue to assess the farthest area and depth. Consequently, in this case, ourmethod suffers from greater performance degradation than others, limiting our methodin the commonly used error metrics on the iBims-1 dataset. This is the main limitationof our method. Fig. 12 shows four examples with a large plane occupying the wholeimage. It can be seen that our method obtains a larger error than other methods in thesecases. In the future, we are going to consider avoiding the incorrect depth change ofthe large plane by enhancing the feature embedding of the whole image. Besides, wealso consider handling these cases by introducing additional semantic information intoour network to constrain the depth change further.31 ue t a l . C hene t a l . * O u r s G r ound T r u t h R G B i m age s mse=0.72mse=0.47mse=0.43 mse=0.44mse=0.38mse=0.35 mse=0.3mse=0.18mse=0.37 mse=0.1mse=0.12mse=0.09 Figure 12: Several failure cases. Mse denotes the mean squared error between the predicted depth and thecorresponding ground truth. * Using a ResNet-50 backbone, which is different from the original paper.

5. Conclusion

In this paper, we propose a novel Boundary-induced and Scene-aggregated network(BS-Net), which considers the important roles of the farthest region and boundary cuesin depth prediction. To perceive the farthest region, the DCE is introduced, which cap-tures the correlations between multi-scale regions. To extract the important edge cue,namely the boundary cue, the BUBF is proposed to gradually locate sudden changes indepth without any other labels. Besides, the SRM is proposed to fully fuse the boundarycue and global depth layout. Numerous experiments on the NYUD v2 dataset indicatethat our approach achieves state-of-the-art performance.

6. Acknowledgment

This work is supported by the National Natural Science Foundation of China No.61703049. This work is also supported by the BUPT Excellent Ph.D. Students Foun-32ation, No. CX2020114, the Natural Science Foundation of Hubei Province of Chinaunder Grant 2019CFA022.