[PDF] UniFuse: Unidirectional Fusion for 360^{\circ} Panorama Depth Estimation

Abstract

Learning depth from spherical panoramas is becoming a popular research topic because a panorama has a full field-of-view of the environment and provides a relatively complete description of a scene. However, applying well-studied CNNs for perspective images to the standard representation of spherical panoramas, i.e., the equirectangular projection, is suboptimal, as it becomes distorted towards the poles. Another representation is the cubemap projection, which is distortion-free but discontinued on edges and limited in the field-of-view. This paper introduces a new framework to fuse features from the two projections, unidirectionally feeding the cubemap features to the equirectangular features only at the decoding stage. Unlike the recent bidirectional fusion approach operating at both the encoding and decoding stages, our fusion scheme is much more efficient. Besides, we also designed a more effective fusion module for our fusion scheme. Experiments verify the effectiveness of our proposed fusion strategy and module, and our model achieves state-of-the-art performance on four popular datasets. Additional experiments show that our model also has the advantages of model complexity and generalization capability.The code is available at this https URL

Full PDF

UUniFuse: Unidirectional Fusion for ◦ Panorama Depth Estimation

Hualie Jiang , Zhe Sheng , Siyu Zhu , Zilong Dong and Rui Huang Abstract — Learning depth from spherical panoramas is be-coming a popular research topic because a panorama has afull ﬁeld-of-view of the environment and provides a relativelycomplete description of a scene. However, applying well-studiedCNNs for perspective images to the standard representationof spherical panoramas, i.e. , the equirectangular projection, issuboptimal, as it becomes distorted towards the poles. Anotherrepresentation is the cubemap projection, which is distortion-free but discontinued on edges and limited in the ﬁeld-of-view. This paper introduces a new framework to fuse featuresfrom the two projections, unidirectionally feeding the cubemapfeatures to the equirectangular features only at the decodingstage. Unlike the recent bidirectional fusion approach operatingat both the encoding and decoding stages, our fusion schemeis much more efﬁcient. Besides, we also designed a moreeffective fusion module for our fusion scheme. Experimentsverify the effectiveness of our proposed fusion strategy andmodule, and our model achieves state-of-the-art performanceon four popular datasets. Additional experiments show thatour model also has the advantages of model complexity andgeneralization capability.

I. INTRODUCTIONDepth estimation is a fundamental step in 3D reconstruc-tion, having many applications, such as robot navigationand virtual/augmented reality. A spherical (or ◦ , om-nidirectional) panoramic image has a full ﬁeld-of-view ofthe environment, thus has the potential to produce a moreaccurate, complete, and scale-consistent reconstruction ofscenes. This paper presents our work on better predictingdepth from a single spherical panoramic image.The ◦ panorama is usually represented as the equirect-angular projection (ERP) or cubemap projection (CMP) [1].Both of them are different from the perspective image andhave their respective advantages and disadvantages. EPRprovides a complete view of the scene but contains distortionthat becomes severer towards the poles. In contrast, CMP isdistortion-free but discontinued on face sides and limited inthe ﬁeld-of-view. Applying deep CNNs to panoramic imagesfor accurate depth estimation is thus challenging.Recently, BiFuse [2] combines the two above projectionsfor depth estimation, which builds bidirectional fusion ofthe two branches at both the encoding and decoding stagesand ﬁnally uses a reﬁnement network to fuse the estimateddepth maps from both branches. To alleviate the discontinuityof CMP, BiFuse also adopts the spherical padding amongcube faces. However, with too many modules added, BiFusebecomes over-complicated, as discussed in detail in Sec. IV-B.3. We argue that feeding the ERP features to the CMP Shenzhen Institute of Artiﬁcial Intelligence and Robotics for Society,The Chinese University of Hong Kong, Shenzhen. Alibaba Cloud A.I. Lab. This work was mainly done when Hualie Jianginterned at Alibaba Cloud A.I. Lab. branch is unnecessary, as the ultimate goal is to output anequirectangular depth map. Optimizing the cube map depthmay cause the training to lose focus on the equirectangulardepth. Furthermore, performing the fusion at the encodingstage may disturb the learning of the encoder, as it is usuallyinitialized with ImageNet [3] pretrained parameters.To address the above limitations, we propose a newfusion framework, which unidirectionally feeds the featuresextracted from CMP to the ERP branch only at the decodingstage to better support the ERP prediction, as shown in Fig. 1.The fusion scheme uses the simple U-Net [4] and performsthe fusion at the skip connections so that the fusion hasminimum coupling with the backbones. Besides, we designa fusion module for our fusion framework, aiming at usingCubemap to Enhance the Equirectangular features (noted asCEE). We ﬁrst adopt a residual modulation of the cubemapfeatures to mitigate its discontinuity. Because the concate-nation of modulated cubemap features and equirectangularfeatures doubles the feature map channels, to better modelthe channel-wise importance, we introduce the Squeeze-and-Excitation (SE) [5] block to the CEE module. Our CEEmodule works better for our fusion scheme than the simpleconcatenation or the Bi-Projection [2] module does.Our contributions are summarized as follows: (1) we pro-pose a new fusion framework of equirectangular and cube-map features for single spherical panorama depth estimation,(2) we design a better fusion module for our unidirectionalfusion framework than existing modules, and (3) we performexperiments to show our approach’s effectiveness, and theﬁnal model establishes the state-of-the-art performance andhas advantages on the complexity and generalization ability.II. RELATED WORKMake3D [6] is a seminal work on a single perspectiveimage depth estimation, which uses the traditional graphicalmodel. With the development of deep learning, convolutionalneural networks were applied to this task [7], [8], [9],[10], [11], [12], [13]. This task is either treated as a denseregression problem [7], [8], [9], [10] or a classiﬁcationproblem [11], [12], [13] by discretizing the depth. The exper-iments are usually performed on datasets with ground truthdepth obtained with physical sensors. To avoid the directusage of ground truth depth, some work tried to utilize otherdata source for training, e.g. , stereo images [14], [15], [16],monocular videos [17], [18], [16], [19], where the trainingobjective is to minimize the between-view reconstructionerror. However, the performance of these unsupervised ap-proaches is inferior to the supervised ones. a r X i v : . [ c s . C V ] F e b useFuseFuseFuse Equirectangular Projection

Cubemap Projection Depth to Point

Fig. 1:

Our Proposed Unidirectional Fusion Framework.

The spherical panorama has a full ﬁeld-of-view of a scene,which can extract more accurate and scale-consistent depththan the perspective image. Zioulis et al. [20] ﬁrst performeddepth estimation on panoramas and proposed to replacethe ﬁrst two layers of the network with a set of rectangleﬁlters [21] to handle distortion. They constructed the 3D60dataset rendered from several datasets. But this dataset isrelatively easy due to a problem of rendering, as pointedout in Sec. IV-B.1. Later, they constructed both vertical andhorizontal stereo panoramas to perform unsupervised ◦ depth learning [22]. Similarly, Wang et al. [23] composed apurely virtual panorama dataset PonoSUNCG with panoramavideo frames to perform unsupervised depth learning likeSfMLearner [17]. Pano Popups [24] jointly learns depth withsurface normals and boundaries to improve depth estimation.More recently, ODE-CNN [25] reduces the ◦ depthestimation problem as an extension problem from the frontface depth. Both their experiments are still performed onvirtual datasets only.However, virtual datasets tend to be too easy, and themodels trained on them are probably hard to transfer to realapplications. Tateno et al. [26] ﬁrst experimented on the realdataset Stanford2D3D [27]. They proposed to train on thecommon sub perspective views and then transfer to the fullpanorama images by applying a distortion-aware deformationon the convolutional kernels of the trained model. Thoughsuch a method has the potential to utilize more availableRGBD datasets to learn a panorama depth estimation model,it fails to take advantage of the large receptive ﬁeld-of-viewof the panorama. More recent work [28], [2] tends to builda complex model on this task. Jin et al. [28] leveraged thelayout elements as both the prior and regularizer for depthestimation, resulting in a model with three encoders andseven decoders. In comparison, our UniFuse contains onlytwo encoders and one decoder. Wang et al. [2] ﬁrst proposedto utilize both EPR and CMP for ◦ depth estimation.Their model BiFuse is composed of two networks, i.e. , theequirectangular and cubemap branches, between which arebidirectional fusion modules. There is also a reﬁnementnetwork to reﬁne the predicted depth maps of the twobranches. However, its complex structures may hinder the concentration on the learning of equirectangular features,which is critical to the ﬁnal equirectangular depth map. Incontrast, both our unidirectional fusion framework and theCEE fusion module are designed to use the cubemap toenhance the equirectangular feature learning.There are some existing elaborate convolutions for han-dling distortion of EPR and special padding techniquesfor discontinuity of both EPR and CMP. The convolutionsinclude the Spherical Convolution (SC) [21] which is a setof rectangle ﬁlters and is used at the front layers of thenetwork, and the Distortion-aware Convolution (DaC) [26],[29], [30] which samples the features for convolution froma regular grid on the tangent plane instead of EPR. Thepadding methods include the Circular Padding (CirP) [31]for EPR and Cube Padding (CuP) [32] and Spherical Padding(SP) [2] for CMP. Readers could refer to their original papersfor more technical details. We do not adopt these specialconvolutions and paddings in our models. Experiments inSec. IV-B.3 show that using these methods in UniFuse doesnot improve the performance but usually adds the complexity.III. METHODOLOGY A. Preliminaries

In this section, we introduce the two common projectionsfor the spherical image, i.e., the equirectangular projectionand cubemap projection, and their mutual conversion.

Equirectangular Projection is the representation of aspherical surface by uniformly sampling it in longitudinaland latitudinal angles. The sampling grid is rectangular, inwhich the width is twice of the height. Suppose that thelongitude and latitude are φ and θ respectively, and we have ( φ, θ ) ∈ [0 , π ] × [0 , π ] . The angular position ( φ, θ ) canbe converted into the coordinate P s = ( p xs , p ys , p zs ) in thestandard spherical surface with radius r by, p xs = r sin( φ ) cos( θ ) ,p ys = r sin( θ ) ,p zs = r cos( φ ) cos( θ ) . (1) Cubemap Projection is the projection of a sphericalsurface to the faces of its inscribed cube. The facesare speciﬁc perspective images, whose size is r × r and at1x1conv 𝐹 𝑒𝑞𝑢𝑖 𝐹 𝑐𝑢𝑏𝑒 C2E 𝐹 𝑐2𝑒 𝐹 𝑐𝑎𝑡 𝐹 𝑓𝑢𝑠𝑒𝑑 (a) Concatenation cat 𝐹 𝑒𝑞𝑢𝑖 𝐹 𝑐𝑢𝑏𝑒 C2E 𝐹 𝑐2𝑒 (b) Bi-Projection cat 𝐹 𝑒𝑞𝑢𝑖 𝐹 𝑐𝑢𝑏𝑒 C2E 𝐹 𝑐2𝑒 (c) CEE 𝐹 𝑒𝑞𝑢𝑖′ 𝐹 𝑐2𝑒′ 𝐹 𝑐𝑎𝑡 Sigmoid 𝐹 𝑓𝑢𝑠𝑒𝑑 𝐹 𝑐𝑎𝑡 convBN 𝐹 𝑟𝑒𝑠′ cat 𝐹 𝑐𝑎𝑡′ conv 𝐹 𝑓𝑢𝑠𝑒𝑑 SE 𝑀𝑎𝑠𝑘

Global pooling 1 × 1 × 2𝐶

FC 1 × 1 × 2𝐶/𝑟

ReLUFC × 1 × 2𝐶/𝑟 × 1 × 2𝐶 × 1 × 2𝐶 Scale

𝐻 × 𝑊 × 2𝐶

Sigmoid -op w/o para. -op w/ fixed para. -op w/ learnable para. -op w/ bias-activation func. -element-wise prod. -element-wise sum

ReLUReLU ReLU ReLU3x3conv BN 𝐹 𝑟𝑒𝑠 𝐻 × 𝑊 × 2𝐶 𝐹 𝑐2𝑒′ Fig. 2:

The Fusion Modules. focal length r/ . The faces can be denoted as f i , i ∈ B, D, F, L, R, U , corresponding to the looking directions, − z (back), − y (down), z (front), x (left), − x (right) and y (up).The front face has the identical coordinate system withthe spherical surface, while others have either ◦ or ◦ rotations around one axis. Let us denote the rotation matrixfrom the system of the spherical surface to one of the i- th face as R f i . Then we can project the pixel P c = ( p xc , p yc , p zc ) in f i by, P s = s · R f i P c , (2)where, p xc , p yc ∈ [0 , r ] , p zc = r/ , and the factor s = r/ | p c | . C2E is the reprojection of the contents (raw RGB valuesor features) in cubemaps to the equirectangular grid.

C2E isusually performed as an inverse wrapping, where we have tocompute the corresponding point with the angular position ( φ, θ ) . Speciﬁcally, we ﬁrst use Equ. (1) to reproject theangular position to the spherical surface. Then we determinethe projected face by ﬁnding the minimum angular distancebetween it and the looking directions of cubemaps. Finally,we compute the corresponding position in the cube face byusing the inverse process of Equ. (2). B. The Unidirectional Fusion Network

Our unidirectional fusion network of ERP and CMP isillustrated in Fig 1. The reason to perform fusion in aunidirectional manner is that the ultimate goal of ◦ depth estimation is to produce an equirectangular depth map,and to feed distortion-free cubemap features to the full-view equirectangular features as a supporting component isa natural choice. We do not perform fusion in a reversedirection, as CMP is limited in the ﬁeld-of-view and thespherical padding [2] for the discontinuity of CMP is time-consuming. Additionally, the decoder for the CMP branchto predict cube depth maps increases the complexity, andoptimizing the cube depth maps would distract the learningof equirectangular depth. Therefore, we do not adopt adecoder for the cubemaps, and the network contains only twoencoders and one decoder. To avoid disturbing the learning of the backbone, we choose to perform fusion only at thedecoding stage, and it is better to fuse the well-encodedfeatures. To this end, we adopt a U-Net [4] as a baselinenetwork and perform fusion within skip-connections. C. The Fusion Modules

In this section, we introduce our proposed fusion modulefor the UniFuse framework, as well as two baseline fusionmethods, concatenation, and the Bi-Projection [2], as illus-trated in Fig. 2. The dimensions of common features of these3 modules are, • F equi /F (cid:48) equi , F c e /F (cid:48) c e , F res /F (cid:48) res , F fused : H × W × C • F cube : × H/ × H/ × C • F cat / F (cid:48) cat : H × W × C where H , W and C are height, width, and channel ofthe equirectangular features, and cubemap features have thesame channels, but the face size is just H/ . Concatenation module ﬁrst casts the CMP features toEPR by

C2E , then concatenates them with F equi and ﬁnallyuses a × conv. module to reduce the channel from C to C . The number of parameters is C . Bi-Projection [2] aims at generating a masked featuremap from one branch and add it to another one. In (b) ofFig. 2, we omit the

E2C path, as it is not necessary in ourunidirectional fusion framework. The

M ask is H × W × .To generate the mask, the Bi-Projection ﬁrst uses two × conv. modules to encode F equi and F c e as F (cid:48) equi and F (cid:48) c e ,then reduce the concatenated feature map’s channel to , andﬁnally apply a Sigmoid function to scale the M ask between and . Such masked modiﬁcation in Bi-Projection may beuseful in gradually improving the feature learning in twobranches in BiFuse, but seems not effective enough for ourunidirectional fusion framework at the decoding stage. Thenumber of parameters is C + 4 C + 1 . CEE is a more elaborate concatenation that better facili-tates the fusion process. It aims at using the distortion-freecubmap to enhance the equirectangular features. Becausethe cubemap features are probably inconsistent in cubemap 𝑐2𝑒 𝐹 𝑐2𝑒 ′ Fig. 3:

The Visualization of F c e and F (cid:48) c e .boundaries, we ﬁrst generate a residual feature map F res tobe added to F c e to reduce such an effect. To generate F res ,we design a residual block inspired by ResNet [33] to theconcatenation of F equi and F c e . The residual block containsa × conv. module to squeeze the channels and a × conv.module to generate the residual feature map. Fig. 3 showsthe feature map of F c e and F (cid:48) c e at the / resolution stage.Cracks appear between cube faces in F c e but they disappearin F (cid:48) c e , which indicates that the residual modulation hasﬁlled them. An intuitive explanation is that the continuous F equi helps the residual block localize inconsistent bound-aries of F c e and the supervision from continuous groundtruth helps learn to generate values for ﬁlling boundaries.The remained part is similar to Concatenation . As we havedual channels of features from both branches, before the ﬁnal × conv. module, we add a Squeeze-and-Excitation (SE) [5]block, which can adaptively recalibrates channel-wise featureresponses, and thus masking the fusion better. We set r inthe SE block as 16, so the total number of parameters is . C + 4 C . Therefore the number of parameters in ourCEE is considerably smaller (75%) than that of the Bi-Projection. IV. EXPERIMENTS A. Experimental Settings1) Datasets:

Our experiments are conducted on fourdatasets, Matterport3D [34], Stanford2D3D [27], 3D60 [20],and PanoSUNCG [23]. Matterport3D and Stanford2D3D arereal-world datasets collected by Matterport’s Pro 3D Camera.While Matterport3D provides the raw depth, Stanford2D3Dconstructs the depth maps from reconstructed 3D models.Thus, the bottom and top depth is missed in Matterport3D,and some depth in Stanford2D3D is inaccurate, as shownin Fig. 4. 3D60 is a ◦ depth dataset provided byOmnidepth [20], and it is rendered from 3D models oftwo realistic datasets, Matterport3D and Stanford2D3D, andtwo synthetic datasets, SceneNet [35] and SunCG [36]. Incontrast, PanoSUNCG is a purely virtual dataset renderedfrom SunCG [36]. The statistics of the datasets are listedin Tab. I, and real datasets are smaller than virtual ones.

2) Implementation Details:

We implement the proposedapproach using Pytorch [37]. The ResNet18 [33] pretrainedon ImageNet [3] is used as backbone for most experiments,except for some using other backbones in Sec. IV-B.3. Weuse Adam [38] with default parameters as the optimizer anda constant learning rate of . . Besides the common dataaugmentation techniques, random color adjustment, and left-right-ﬂipping, we also use the random yaw rotation, as the Dataset Matterport3D Stanford2D3D 3D60 PanoSUNCG

TABLE I:

The Statistics of the Datasets.

ERP property is invariant under such transformation. We usethe popular BerHu loss [9] as the regression objective intraining. During training, we randomly select 40 and 800samples from the training set of Stanford2D3D and 3D60for validation, and we use the last ﬁve scenes (1091 samples)from 80 training scenes of PanoSUNCG for validation. Wetrain the real datasets for 100 epochs, and the virtual datasetsfor 30 epochs, as the virtual datasets are quite large, whileBiFuse [2] trains all datasets for 100 epochs. FollowingBiFuse, we set the input size for real and virtual datasetsto × and × . We train most models on anNVIDIA 2080Ti GPU, the batch size of virtual datasets is 8,but the batch size of real datasets is just 6 due to the limitedGPU memory. For some models in Sec. IV-B.3, we have touse two GPUs, each with a batch size of 3. These modelsinclude UniFuse with CuP [32], SP [2] and CirP [31], andthe equirectangular baseline with DaC [29].

3) Evaluation Metrics:

We use some standard metrics forevaluation, including four error metrics, mean absolute error(MAE), absolute relative error (Abs Rel), root mean squareerror (RMSE) and the root mean square error in log space(RMSE log ), and three accuracy metrics, i.e. , the percentagesof pixels where the ratio ( δ ) between the estimated depth andground truth depth is smaller than . , . , and . .Note that, while most papers on depth estimation use log e in RMSE log , the latest BiFuse adopts log . As BiFuse isthe state-of-the-art method with which we mainly compareour UniFuse model, we also adopt log . B. Experimental Results1) Performance Comparison:

The quantitative compari-son among the start-of-the-art methods of spherical depthestimation, our equirectangular baseline, and UniFuse modelon the four datasets are shown in Tab. II. We directly takethe results from related papers for comparison. Our UniFusemodel has established new state-of-the-art performance onall of the four datasets, especially on the biggest realisticdataset, Matterport3D, by a signiﬁcant margin. To be speciﬁc,UniFuse outperforms BiFuse [2] by reducing the error AbsRel from 0.2048 to 0.1063 and improving accuracy metricof δ < . by . . In terms of fusion effectiveness, ourUniFuse framework reduces the error metrics by over in average from our equirectangular baseline, while BiFuseonly reduces by about from its equirectangular baseline.Although UniFuse’s improvement on the loosest accuracymetric of δ < . is slightly smaller than BiFuse’simprovement, UniFuse performs much better than BiFuse onenhancing the other two tighter accuracy metrics, especiallyover times on the accuracy of δ < . . ataset Method Error metric ↓ Accuracy metric ↑ MAE Abs Rel RMSE RMSE log δ < . δ < . δ < . Matterport3D BiFuse [2] –Equi. 0.3701 0.2074 0.6536 0.1176 83.02 92.45 95.77BiFuse [2] –Fusion 0.3470 0.2048 0.6259 0.1134 84.52 93.19 96.32BiFuse [2] –Improve -6.24% -1.25% -4.24% -3.57% +1.50 +0.74 +0.55

Ours –Equi. 0.3267 0.1304 0.5460 0.0817 83.70 94.84 97.81Ours –Fusion

Ours –Improve -13.87% -18.48% -9.51% -14.20% +5.27 +1.39 +0.50Stanford2D3D Jin et al. [28] - 0.1180 0.4210 - 85.10 +1.22 +0.60

Ours –Equi. 0.2696 0.1417 0.4224 0.0871 82.96 95.59 98.35Ours –Fusion

Ours –Improve -22.77% -21.38% -12.62% -17.22% +4.15 +1.05 +0.473D60 OmniDepth [20] - 0.0702 0.2911 0.1017† 95.74 99.33 99.79Cheng et al. [25] - 0.0467

BiFuse [2] –Equi. 0.1172 0.0606 0.2667 0.0437 96.67 99.20 99.66BiFuse [2] –Fusion 0.1143 0.0615 0.2440 0.0428 96.99 99.27 99.69BiFuse [2] –Improve -2.47% +1.49% -8.51% -2.06% +0.32 +0.07 +0.03

Ours –Equi. 0.1099 0.0517 0.2134 0.0342 97.64 99.59 99.86Ours –Fusion -9.37% -9.86% -8.16% +0.71 +0.06 +0.01PanoSUNCG BiFuse [2] –Equi. 0.0836 0.0687 0.2902 0.0496 95.29 97.87 98.86BiFuse [2] –Fusion 0.0789 0.0592 -13.83% -10.54% -10.69% +0.61 +0.51 +0.21

Ours –Equi. 0.0839 0.0531 0.2982 0.0444 96.09 98.25 99.00Ours –Fusion

Ours –Improve -8.82% -8.66% -6.04% -6.31% +0.46 +0.21 +0.12

TABLE II:

Quantitative Comparison on Four Datasets. †The RMSE log e of our UniFuse on 3D60 is 0.0725.For the Stanford2D3D dataset, UniFuse outperforms Bi-Fuse and another recent method by Jin et al. [28]. UniFuseperforms the best on most metrics except being sightlyinferior to the model by Jin et al. [28] on δ < . .Note that, Jin et al. only experimented on a small portionof the Stanford2D3D dataset that satisﬁes the Manhattanstructure (404 samples of the training set and 113 of thetest set), as they proposed joint learning of ◦ depthand Manhattan layout. In contrast, UniFuse is not limitedto such a speciﬁc structure, and if the model by Jin etal. were evaluated on the entire test set of Stanford2D3D,its performance might degrade largely. In terms of fusioneffectiveness, both UniFuse and BiFuse [2] reduce the errorson Stanford2D3D more largely than Matterport3D, perhapsbecause the former is much less complex. Similarly, UniFuseoutperforms BiFuse to a bigger extent on the other four errormetrics. For the accuracy metrics, our UniFuse still has abetter improvement on the tightest one than BiFuse.For 3D60, our UniFuse still has a much better improve-ment on MAE, Abs Rel, RMSElog, and δ < . and slightlyless improvement on other three metrics than BiFuse [2].Overall, the performance on 3D60 is much higher than thetwo realistic datasets, and thus the effectiveness of the fusionis inferior. Our UniFuse model signiﬁcantly outperformsBiFuse and performs approximately to the model by Cheng etal. [25]. However, their model takes the depth map ofthe front face and extends it to the entire ◦ space.This requires an extra depth camera and careful calibration between the depth camera and panorama camera. The virtualPanoSUNCG is also very easy, and our ﬁnal model stilloutperforms BiFuse at most metrics expect the RMSE.We also provide qualitative results of our equirectangularbaseline and UniFuse model in Fig. 4. Two examples fromthe test set of each dataset are shown here and the dark regionof the ground truth depth maps indicates unavailable depth.It can be observed that the UniFuse model produces accuratedepth maps with fewer artifacts than the equirectangularbaseline, which further veriﬁes the effectiveness of ourproposed approach. The two examples of the 3D60 datasetare rendered from the 3D models of Matterport3D, and itappears that the farther it gets, the darker the scene is.But in the realistic dataset, it is not the case. We believethat the problematic rendering makes 3D60 easy, as thenetwork probably uses the brightness as a cue for predictingdepth. However, sometimes such a cue may cause problems.For instance, the regions within a blue rectangle on thetwo examples contain dark areas, and our equirectangularbaseline tends to predict the regions farther.

2) Ablation Study:

We compare the effectiveness of Ima-geNet [3] pretraining (pt) and the three different fusion mod-ules in our proposed fusion framework on the Matterport3Ddataset in Tab. III. The pretraining is useful for both thebaseline and UniFuse. For the baseline, disabling pretrainingmakes the Abs Rel error increase by . and the δ < . accuracy drop over . Disabling pretraining in bothequirectangular and cubemap branches of UniFuse results in a tt e r po r t D M a tt e r po r t D S t a n f o r d2 D D S t a n f o r d2 D D D D P a no S UN C G P a no S UN C G Input Ground Truth Our Equi. Our UniFuse

Fig. 4:

Qualitative Comparison between Our Equirectangular Baseline and UniFuse Model.

Best viewed in color.

Method MAE ↓ Abs Rel ↓ RMSE ↓ δ < . ↑ Equi. w/o pt 0.3548 0.1413 0.5946 81.53Equi. 0.3267 0.1304 0.5460 83.70Concat. 0.3162 0.1237 0.5340 85.00Bi-Proj. 0.3096 0.1188 0.5283 85.94CEE w/o SE 0.3046 0.1161 0.5217 86.53UniFuse w/o pt 0.3164 0.1195 0.5440 85.53UniFuse 0.2814 0.1063 0.4941 88.97

TABLE III:

Ablation Study. a bigger performance degradation. Therefore, pretraining isalso useful for ERP. There is an explanation for this effectin [28], which adopts a pretrained U-Net for ERP. The reasonis that the high-level parameters from perspective images can be more easily ﬁne-tuned to the equirectangular ones.The simple concatenation of equirectangular features andcubemap features produces a good performance gain uponthe equirectangular baseline. Speciﬁcally, the Abs Rel erroris reduced by . , and the δ < . accuracy increases . . This indicates that our unidirectional fusion strategyis effective, although simple. Our fusion strategy is alsocompatible with the Bi-Projection module in BiFuse [2].Bi-Projection roughly doubles the performance gain of thenaive concatenation. By looking backwards to Tab. II, itis easily to ﬁnd that the performance gain of our simpleunidirectional fusion scheme with Bi-Projection signiﬁcantlysurpasses that of the complex bidirectional scheme of BiFuse.This further veriﬁed that our simpliﬁed fusion scheme is ethod ↓ δ < . ↑ BiFuse’s Euqi. [2] (R50) 63.56M 2247M 113ms 0.2075 83.02BiFuse [2] (R50) 253.1M 4003M 1125ms 0.2048 84.52Our Euqi. (R18) 14.33M 907M 12.3ms 0.1304 83.70Our Euqi. (R50) 32.52M 1039M 27.2ms 0.1207 85.31UniFuse (R18) 30.26M 1221M 24.1ms 0.1063 88.97Our Euqi. (MV2) 4.00M 791M 13.9ms 0.1274 84.27UniFuse (MV2) 7.35M 889M 28.9ms 0.1116 87.56UniFuse w/ CuP [32] (R18) 30.26M 1271M 365ms 0.1081 88.44UniFuse w/ SP [2] (R18) 30.26M 1271M 574ms 0.1089 88.36Our Euqi. w/ CirP [31] (R18) 14.33M 999M 25.6ms 0.1229 85.39UniFuse w/ CirP [31] (R18) 30.26M 1275M 40.8ms 0.1060 88.92Our Euqi. w/ DaC [29] (R18) 14.33M 1455M 116ms 0.1194 86.00Our Euqi. w/ SC [21] (R18) 14.34M 897M 12.6ms 0.1300 83.98UniFuse w/ SC [21] (R18) 30.27M 1239M 24.5ms 0.1134 87.45

TABLE IV:

Model Complexity Comparison. even more effective. Furthermore, our proposed CEE fusionmodule produces a large performance gap over Bi-Projection.To be speciﬁc, our CEE module has an Abs Rel . less than Bi-Projection, and the accuracy δ < . is . higher. When turning off the SE block of our CEE module,the performance is still markedly better than Bi-Projection.Thus, our proposed CEE module is much more effectivein fusing the equirectangular features and cubemap featuresthan Bi-Projection under our unidirectional fusion scheme.

3) Complexity Analysis:

We compare the complexityamong the models of this work and the models in BiFuse [2]as well as some of their performance on Matterport3D tounderstand how the performance is boosted by adding com-plexity. We also examine the efﬁciency and effectiveness ofsome existing methods handling discontinuity and distortionfor panoramas in our models, which are listed at the end ofSec II. The complexity metrics include the number of neuralmodel parameters, the GPU memory, and computation timewhen the model infers an image with a size of × .The experiment is performed on an NVIDIA Titan Xp GPU,and the computation time is averaged over 1000 images. Theresults are listed in Tab IV. R50 and R18 indicates that thebackbone are ResNet-50 and ResNet-18 [33], respectively,while MV2 represents MobileNetV2 [39].We obtained the complexity of BiFuse from its open infer-ence code. The models of BiFuse are much more complicatedthan ours. Its baseline is much more complex than ourbaseline. It is an FCRN [9] with the R50 backbone whoseﬁrst layer is replaced with SC [21]. Ours is just a simplerU-Net using a pretrained R18 backbone. However, our base-line still performs slightly better. BiFuse’s fusion schemesigniﬁcantly increases the complexity. It almost quadruplesthe number of parameters. One fold of parameter comesfrom the cubemap branch. The Bi-Projection modules onboth encoding and decoding stages and the ﬁnal reﬁnementmodule to fuse the depth maps from both the equirectangularand cubemap branches increase the neural parameters further.BiFuse also increases the inference time to about times.We believe that only part of it comes from the Bi-Projectionmodules and the ﬁnal reﬁnement module. Most of it probablycan be attributed to SP [2] on the cubemap branch, which will be explained later when using SP and CuP [32] in UniFuse.In contrast, our UniFuse on ResNet-18 only doubles thecomplexity of parameters and time upon our baseline, butthe performance boosts, and it is still real-time ( > f ps ).To prove that the performance of our UniFuse is not simplyobtained by adding complexity, we also experiment with ourbaseline with ResNet-50, whose complexity is similar to ourUniFuse on ResNet-18 but the performance is far behind.To explore the possibility to apply our models on mobiledevices, we also experiment on the MobileNetV2 [39]. Theresults show that our UniFuse can still boost performance,even to a smaller extent than using ResNet-18. The inferencetime is still real-time, and the GPU memory and parametersdecrease a great deal, which indicates our models havepotential to be applied to mobile robots or AR/VR devices.The remaining experiments are about using specialpadding and convolution methods in our models, and theresults indicate it is unnecessary to adopt them in UniFuse.We ﬁnd that both SP and CuP are implemented in BiFuse’scode, so we introduce them in UniFuse. The results show theinference time increases over times for CuP and over times for SP. They have similar implementations. For each of cube faces, it has sides on the feature map to be sampledfrom adjacent cube faces, so there are loops for eachpadding. The padding should be performed on convolutionswith kernel size bigger than . Therefore, the inference timegreatly increases. SP uses interpolation in each loop, so itstime increases further. However, they do not improve theperformance of UniFuse. Although BiFuse’s paper has anablation study where they enhance the predictions of thecubemap branch, we hypothesize that they are less usefulfor fusion, especially for UniFuse, which uses CMP asa supplement. A similar observation can be seen in theexperiments of CirP [31]. CirP almost does not improve theperformance of UniFuse but is effective for our baseline.There is a small technical difference for DaC amongdifferent studies [26], [29], [30]. We implement the ver-sion by Coors et al. [29] in our equirectangular baseline.DaC improves the performance of our baseline but is farinferior to UniFuse. However, as interpolation has to beperformed densely in convolution, the resulted space and ataset Method Abs Rel ↓ RMSE ↓ δ < . ↑ train onMatterport3D BiFuse [2] 0.1014 0.4070 90.48Our UniFuse 0.0348 0.1863 98.47transfer toStanford2D3D BiFuse [2] 0.1195 0.4339 86.16Our UniFuse 0.0944 0.3800 91.31 TABLE V:

Generalization Comparison. time complexity is much higher, which is in accordancewith the experiments of CFL [30]. We replace the ﬁrst layerof our models with BiFuse’s implementation of SC [21].The resulted complexity is almost the same, since only theﬁrst layer is changed. SC slightly improves the performanceof our baseline but reduces the performance of UniFuse.Therefore, SC is not effective for our UniFuse.

4) Generalization Capability:

We examine the general-ization capability between BiFuse [2] and our UniFuse inTab V. We can perform such examination as BiFuse providesa pretrained model on the entire

Matterport3D dataset. Wealso trained our UniFuse on the whole Matterport3D dataset.Next, we evaluate both the BiFuse and our UniFuse on thetest set of Matterport3D to see the effectiveness of ﬁtting,as the test set has also been trained on. Finally, we evaluateboth models on the test set of Stanford2D3D. As the sensingdepth of Matterport3D does not cover the top and bottom ofthe panorama, to make a fair comparison, we do not countthe topmost and lowest 68 pixels following BiFuse’s codewhen evaluating on Stanford2D3D. These two datasets havea certain domain gap, as Matterport2D3D is about varioushousehold scenes, while Stanford2D3D is about the ofﬁcescenes of a university. Thus, this transfer can examine thegeneralization capability of different models.From Tab V, BiFuse’s Abs Rel is 3 times of our UniFuse’sand UniFuse has . higher on the accuracy δ < . than BiFuse. This indicates UniFuse ﬁts Matterport3D muchbetter than BiFuse. But the better ﬁtting is not overﬁtting,as UniFuse also transfers to Stanford2D3D much better. Thevisualization results on Fig 5 also verify this fact, and oneextra merit of UniFuse is that even there is no ground truthdepth on the top and bottom, it can produce plausible depth.V. CONCLUSIONSIn this paper, we have shown our UniFuse model forsingle spherical panorama depth estimation. Our UniFusemodel is a simple yet effective framework that utilizesboth equirectangular projection and cubemap projection for ◦ depth estimation. We have also designed the newCEE fusion module for our framework to enhance theequirectangular features better. Experiments have veriﬁedthat both our framework and module are effective. The ﬁnalUniFuse model makes signiﬁcant progress over the state-of-the-art methods on four popular ◦ panoramic datasets,especially on the biggest realistic dataset, Matterport3D.Furthermore, we have shown that our model has much lowermodel complexity and higher generalization capability thanprevious works, indicating the potential to apply it in real-world applications. We are exploring the possibility to applyour model to practical ﬁelds, such as mobile robots. R EFERENCES[1] R. Skupin, Y. Sanchez, Y.-K. Wang, M. M. Hannuksela, J. Boyce,and M. Wien, “Standardization status of 360 degree video coding anddelivery,” in . IEEE, 2017, pp. 1–4.[2] F.-E. Wang, Y.-H. Yeh, M. Sun, W.-C. Chiu, and Y.-H. Tsai, “Bifuse:Monocular 360 depth estimation via bi-projection fusion,” in

Proceed-ings of the IEEE/CVF Conference on Computer Vision and PatternRecognition , 2020, pp. 462–471.[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,“Imagenet: A large-scale hierarchical image database,” in . Ieee, 2009.[4] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,” in

International Confer-ence on Medical image computing and computer-assisted intervention .Springer, 2015, pp. 234–241.[5] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in

Proceedings of the IEEE conference on computer vision and patternrecognition , 2018, pp. 7132–7141.[6] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scenestructure from a single still image,”

IEEE transactions on patternanalysis and machine intelligence , vol. 31, no. 5, pp. 824–840, 2009.[7] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from asingle image using a multi-scale deep network,” in

Advances in neuralinformation processing systems , 2014, pp. 2366–2374.[8] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from singlemonocular images using deep convolutional neural ﬁelds,”

IEEEtransactions on pattern analysis and machine intelligence , vol. 38,no. 10, pp. 2024–2039, 2015.[9] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab,“Deeper depth prediction with fully convolutional residual networks,”in

3D Vision (3DV), 2016 Fourth International Conference on . IEEE,2016, pp. 239–248.[10] H. Jiang and R. Huang, “High quality monocular depth estimation viaa multi-scale network and a detail-preserving objective,” in . IEEE, 2019.[11] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deepordinal regression network for monocular depth estimation,” in

IEEEConference on Computer Vision and Pattern Recognition , 2018.[12] B. Li, Y. Dai, and M. He, “Monocular depth estimation with hierarchi-cal fusion of dilated cnns and soft-weighted-sum inference,”

PatternRecognition , vol. 83, pp. 328–339, 2018.[13] H. Jiang and R. Huang, “Hierarchical binary classiﬁcation for monoc-ular depth estimation,” in . IEEE, 2019, pp. 1975–1980.[14] R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised cnn forsingle view depth estimation: Geometry to the rescue,” in

EuropeanConference on Computer Vision . Springer, 2016, pp. 740–756.[15] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monoc-ular depth estimation with left-right consistency,” in

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition ,2017, pp. 270–279.[16] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digginginto self-supervised monocular depth estimation,” in

Proceedings ofthe IEEE International Conference on Computer Vision , 2019.[17] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervisedlearning of depth and ego-motion from video,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2017.[18] G. Wang, H. Wang, Y. Liu, and W. Chen, “Unsupervised learning ofmonocular depth and ego-motion using multiple masks,” in

Interna-tional Conference on Robotics and Automation , 2019.[19] H. Jiang, L. Ding, Z. Sun, and R. Huang, “Dipe: Deeper intophotometric errors for unsupervised learning of depth and ego-motionfrom monocular videos,” in

In IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS) , 2020.[20] N. Zioulis, A. Karakottas, D. Zarpalas, and P. Daras, “Omnidepth:Dense depth estimation for indoors spherical panoramas,” in

Proceed-ings of the European Conference on Computer Vision (ECCV) , 2018.[21] Y.-C. Su and K. Grauman, “Learning spherical convolution for fastfeatures from 360 imagery,” in

Advances in Neural InformationProcessing Systems , 2017, pp. 529–539.[22] N. Zioulis, A. Karakottas, D. Zarpalas, F. Alvarez, and P. Daras,“Spherical view synthesis for self-supervised ◦ depth estimation,”in . IEEE, 2019. a tt e r po r t D M a tt e r po r t D M a tt e r po r t D S t a n f o r d2 D D S t a n f o r d2 D D S t a n f o r d2 D D Input Ground Truth BiFuse [2] Our UniFuse

Fig. 5:

Qualitative Comparison between BiFuse and Our UniFuse Model.

Best viewed in color. [23] F.-E. Wang, H.-N. Hu, H.-T. Cheng, J.-T. Lin, S.-T. Yang, M.-L. Shih,H.-K. Chu, and M. Sun, “Self-supervised learning of depth and cameramotion from ◦ videos,” in Asian Conference on Computer Vision .Springer, 2018, pp. 53–68.[24] M. Eder, P. Moulon, and L. Guan, “Pano popups: Indoor 3d reconstruc-tion with a plane-aware network,” in . IEEE, 2019, pp. 76–84.[25] X. Cheng, P. Wang, Y. Zhou, C. Guan, and R. Yang, “Ode-cnn: Om-nidirectional depth extension networks,” in , 2020, pp. 589–595.[26] K. Tateno, N. Navab, and F. Tombari, “Distortion-aware convolutionalﬁlters for dense prediction in panoramic images,” in

Proceedings ofthe European Conference on Computer Vision (ECCV) , 2018.[27] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese, “Joint 2d-3d-semantic data for indoor scene understanding,” arXiv preprintarXiv:1702.01105 , 2017.[28] L. Jin, Y. Xu, J. Zheng, J. Zhang, R. Tang, S. Xu, J. Yu, and S. Gao,“Geometric structure based and regularized depth estimation from360 indoor imagery,” in

Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , 2020, pp. 889–898.[29] B. Coors, A. P. Condurache, and A. Geiger, “Spherenet: Learningspherical representations for detection and classiﬁcation in omnidi-rectional images,” in

Proceedings of the European Conference onComputer Vision (ECCV) , 2018, pp. 518–533.[30] C. Fernandez-Labrador, J. M. Facil, A. Perez-Yus, C. Demonceaux,J. Civera, and J. J. Guerrero, “Corners for layout: End-to-end layoutrecovery from 360 images,”

IEEE Robotics and Automation Letters ,vol. 5, no. 2, pp. 1255–1262, 2020.[31] T.-H. Wang, H.-J. Huang, J.-T. Lin, C.-W. Hu, K.-H. Zeng, andM. Sun, “Omnidirectional cnn for visual place recognition and nav-igation,” in . IEEE, 2018, pp. 2341–2348.[32] H.-T. Cheng, C.-H. Chao, J.-D. Dong, H.-K. Wen, T.-L. Liu, andM. Sun, “Cube padding for weakly-supervised saliency prediction in360 videos,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2018, pp. 1420–1429.[33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computervision and pattern recognition , 2016, pp. 770–778.[34] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva,S. Song, A. Zeng, and Y. Zhang, “Matterport3D: Learning from RGB-D data in indoor environments,”

International Conference on 3DVision (3DV) , 2017.[35] A. Handa, V. P˘atr˘aucean, S. Stent, and R. Cipolla, “Scenenet: Anannotated model generator for indoor scene understanding,” in

IEEEInternational Conference on Robotics and Automation , 2016.[36] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser,“Semantic scene completion from a single depth image,” in

Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2017, pp. 1746–1754.[37] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito,Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differen-tiation in pytorch,”

In NeurIPS-W , 2017.[38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980 , 2014.[39] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo-bilenetv2: Inverted residuals and linear bottlenecks,” in