[PDF] SeasonDepth: Cross-Season Monocular Depth Prediction Dataset and Benchmark under Multiple Environments

Abstract

Changing environments poses a great challenge on the outdoor visual perception and scene understanding for robust long-term autonomous driving and mobile robots, where depth-auxiliary geometric information plays an essential role to the robustness under challenging scenes. Although monocular depth prediction has been well studied recently, there are few work focusing on the depth prediction across multiple environmental conditions, e.g. changing illumination and seasons, owing to the lack of such a real-world dataset and benchmark. In this work, a new cross-season monocular depth prediction dataset SeasonDepth (available on this https URL) is derived from CMU Visual Localization dataset through structure from motion. To benchmark the depth estimation performance under different environments, we investigate representative and recent state-of-the-art open-source supervised, self-supervised and domain adaptation depth prediction methods from KITTI benchmark using several newly-formulated metrics. Through extensive experimental evaluation on the proposed dataset without fine-tuning, the influence of multiple environments on performance and robustness is analyzed both qualitatively and quantitatively, showing that the long-term monocular depth prediction is far from solved. We further give promising solutions especially with stereo geometry and multi-task sequential self-supervised training to enhance the robustness to changing environments.

Full PDF

SSeasonDepth: Cross-Season Monocular Depth Prediction Dataset andBenchmark under Multiple Environments

Hanjiang Hu, Baoquan Yang, Weiang Shi, Zhijian Qiao, and Hesheng Wang ∗ Abstract — Monocular depth prediction has been well studiedrecently, while there are few works focused on the depth pre-diction across multiple environments, e.g. changing illuminationand seasons, owing to the lack of such real-world dataset andbenchmark. In this work, we derive a new cross-season scalelessmonocular depth prediction dataset

SeasonDepth from CMUVisual Localization dataset through structure from motion.And then we formulate several metrics to benchmark theperformance under different environments using recent state-of-the-art open-source depth prediction pretrained models from

KITTI benchmark. Through extensive zero-shot experimentalevaluation on the proposed dataset, we show that the long-termmonocular depth prediction is far from solved and providepromising solutions in the future work, including geometric-based or scale-invariant training. Moreover, multi-environmentsynthetic dataset and cross-dataset validataion are beneﬁcial tothe robustness to real-world environmental variance.

I. I

NTRODUCTION

Outdoor monocular depth prediction plays an essential rolein the perception of autonomous driving and mobile robotics.In recent years, monocular depth prediction has made signif-icant progress due to the boost of deep convolutional neuralnetworks [1]–[4]. However, since the outdoor environmentalcondition is changing due to different seasons, weather andday time [5], the pixel appearance is drastically affected,which poses a huge challenge for the robust visual perceptionand localization.For the consideration of the cross-dataset performance [6],especially under adverse environment for depth prediction[7], it is of great necessity to study how environmentalchanges inﬂuence the monocular depth prediction. Besides,cross-season real-world datasets and benchmark would alsopromote the related tasks [8] , like long-term visual local-ization [9] and place recognition [10].Since the groundtruth of the outdoor depth map is usuallysparse from the calibrated LiDAR device, and it is expensiveand effort-costly to obtain high-quality dense depth map,we derive a scaleless dense depth prediction dataset

Season-Depth with multi-environment traverses based on the urbanpart of CMU Localization dataset [11]. Different from dataset

This work was supported in part by the Natural Science Foundation ofChina under Grant U1613218, 61722309 and U1913204, in part by BeijingAdvanced Innovation Center for Intelligent Robots and Systems under Grant2019IRS01. Corresponding Author: Hesheng Wang.The authors are with Department of Automation, Insititue of MedicalRobotics, Key Laboratory of System Control and Information Processing ofMinistry of Education, Key Laboratory of Marine Intelligent Equipment andSystem of Ministry of Education, Shanghai Jiao Tong University, Shanghai200240, China. H. Wang is also with Beijing Advanced Innovation Centerfor Intelligent Robots and Systems, Beijing Institute of Technology, China Available on https://github.com/SeasonDepth/SeasonDepth. (a) (b) (c)Fig. 1. The ﬁrst row shows the image samples in the proposed dataset

SeasonDepth and the second row gives the ground truths. Column (a) isunder

Cloudy + Foliage condition, (b) is under

Overcast + Mixed Foliage condition and (c) is under

Low Sun + Mixed Foliage condition. with dense depth map from stereo camera, we adopt structurefrom motion (SfM) and multi-view stereo (MVS) pipelinewith clustering-based RANSAC to ﬁlter the dynamic objectson the road, followed by the reﬁnement in HSV color spacefor the areas with extreme illumination change.To benchmark the performance on the dataset, severalpractical metrics are proposed for the zero-shot evaluationof the state-of-the-art open-source methods with pretrainedmodels from

KITTI benchmark [12], [13] on

SeasonDepth .The baselines we choose include supervised [14]–[16], semi-supervised [17] and self-supervised [18]–[21] algorithmstrained on the real-world dataset and domain adaptation algo-rithms [22], [23] trained on the virtual synthetic dataset withchanging environments [24]. From the evaluation results, norobust algorithm could solve the problem of depth predictionunder changing environments. And promising avenues ingeometric constraint, normal-based and scale-invariant train-ing are found. Furthermore, it is shown that cross-datasetvalidation and model training on multi-environment virtualdataset are also helpful to the robustness to the real-worldchanging environments. In summary, our contributions in thiswork lie in: • A new monocular depth prediction dataset

SeasonDepth with drastic environmental changes is built and availablethrough MVS pipeline with modiﬁed RANSAC forreﬁnement. • Several metrics are proposed to measure the depth mapwith relative depth value under multiple environments,evaluating the effectiveness and robustness for eachalgorithm. • Recent state-of-the-art open-sourced depth predictionalgorithms are evaluated on the

SeasonDepth with pre-trained model on

KITTI benchmark, showing what kindof methods are robust to changing environments. a r X i v : . [ c s . C V ] N ov he rest of the paper is structured as follows. Section IIanalyzes the related work in monocular depth predictionand related datasets. SectionIII presents the proposed dataset.Section III introduces the metrics and benchmark setup. Theexperimental evaluation is shown in Section IV. Finally, inSection VI we draw the conclusions.II. R ELATED W ORK

A. Depth Prediction Datasets

Depth prediction plays an important role in the applicationof perception and localization of mobile robotics and othercomputer vision applications. The dataset is essential to thedevelopment of depth prediction task. The acquirement ofthe depth map ground truth is challenging for the dataset,and it would be obtained through other calibrated devices orcomplicated calculating algorithms. For the indoor scenarios,the ground truth of depth map is quite easy to get directlythrough the calibrated RGBD camera like [25]–[27],webstereo photos like [16], [28], [29] or sparse depth map fromlaser scanner like [30].While the ground truth of outdoor depth map is morecomplex to obtain, one common method is to project theLiDAR point cloud data onto the image plane [12], [21],[30], which is quite accurate but too sparse and thus needsinaccurate interpolation for dense depth map. And the stereocamera includes two aligned cameras which generate thedisparity map through stereo matching and depth calculatingalgorithms, like

CityScapes dataset [31]. But the calculationhas theoretical drawbacks of the limited depth scope whichis determined by the baseline of the stereo camera. And thestereo matching is sensitive to the illumination variance ofimage pairs as well. Another way to get the depth mapis through image geometric registration [15], [32], whichhelps to recover the 3D dense model by feature matchingfrom monocular sequence. Although this method is time-consuming, it generates pretty accurate dense depth mapswith relative scale, which is more general for depth predic-tion under different scenarios.

Megadepth [15] is the mostsimilar one to our proposed dataset, where structure frommotion method is implemented to reconstruct the 3D modeland depth map. However, the images are from the Internet sothey are not multi-traverse with different environments withlots of dynamic objects.Changing environments casts great challenge on the out-door visual perception and localization of mobile robots,which gives signiﬁcant necessity to build a new dataset withmultiple environments. Though [31] includes some environ-mental changes, there are still no real-world dataset withmulti-environment traverses, which is really important to theevaluation of the generalization ability across environmentalchanges and could be found in some virtual synthetic datasets[24], [33]. The synthetic depth maps are of high quality whilethe RGB images are obviously different from real-worldimages, and therefore domain adaptation is indispensablewhen training on it. The details of comparison of datasetsare shown in Section III-C.

B. Outdoor Monocular Depth Prediction Algorithms

Since the depth map cannot be directly obtained throughmonocular camera in the outdoor scenarios, monocular depthprediction aims to predict the depth map given single RGBimage. Early studies including MRF and other graph models[30], [34], [35] largely depend on man-made descriptors so asto constrain the performance of depth prediction. Afterwards,enormous studies on convolutional neural networks [1]–[3],[36] have shown promising results for monocular depthestimation. Among these, [2] combines neural networkswith continuous CRF pixel by pixel, which is differentfrom the regression task for depth prediction. Recent work[37] analyzes the mechanism of inferring depth maps frommonocular images through CNNs. Supervised methods [15],[16] and semi-supervised methods [17] for monocular depthprediction have been well studied through the supervisionof depth map ground truth. Recently, normal-based methods[14], [38], [39] have been proposed to regress the normaland depth map using the geometric constraints.However, since the groundtruth of outdoor depth map iseffort-costly and time-consuming to obtain, some unsuper-vised learning methods [18], [40]–[44] for depth estimationhave appeared to solve this problem using stereo geometryinformation as a secondary supervisory signal. Furthermore,in [43], depth prediction has even been coped with as imagereconstruction in an unsupervised manner. Additionally, left-right consistency [20], [40] and ego-motion pose constraint[44] have also been exploited in self-supervised [19], [21],[41], [45] way to estimate depth map.Also, to avoid using the expensive real-world depth mapground truth, other algorithms are trained on the multi-environment synthetic virtual datasets [24] to leverage thehigh-quality depth map. Such methods [22], [23], [46], [47]confront with the domain adaptation from synthetic domainto real-world domain through training the model on thevirtual datasets.III. S

EASON D EPTH D ATASET

To study how depth prediction performance are inﬂuencedby environmental factors, a new dataset

SeasonDepth isderived from CMU Visual Localization dataset [11] throughSfM method with clustering-based RANSAC and HSV ﬁl-tering reﬁnement. The original CMU Visual Localizationdataset [11] covers over 12 months in Pittsburgh, USA,including 12 different environmental conditions. It was col-lected from two cameras on the left and right of the vehiclealong 8.5 kilometers. And this dataset is also used for long-term visual localization [5] by calculating the 6-DoF camerapose of images under different environments. We adoptUrban area according to the categorization in [5] with 7slices about 21652 images in total for the city-view scenariosin mobile robotics and autonomous driving. The differentconditions consist of multiple weather and season conditionswith the combination of sunny, cloudy, overcast, low sun andsnow together with foliage, mixed foliage and no foliage.Some examples of the dataset are shown in Figure 2. a) (b) (c) (d) (e)Fig. 2. Column (a) (b) (c) (d) (e) present the examples and depth map groundtruths of

SeasonDepth dataset, which are under

Sunny + No Foliage , Overcast + Foliage , Low Sun + Mixed Foliage , Cloudy + Mixed Foliage and

Cloudy + Foliage condition respectively.

A. Image Reconstruction with Clustering-based RANSAC

We reconstruct the dense model for each traversal underevery environmental condition through multi-view stereo(MVS) pipeline [48]. Speciﬁcally, COLMAP [48], [49] isused to obtain the depth maps through photometric andgeometric consistency from sequential images. To makefull use of the image sequences, we adjust the sequentialmatching overlap to be 15 instead of the whole sequence,improving the local optimization with less noise.Since there are many dynamic objects along the roadfor every image sequence, we devise a modiﬁed RANSACalgorithm along with MVS pipeline based on COLMAP.During each iteration of RANSAC algorithm in triangulation,ﬁrst all the depth of the pixels are reconstructed evenwith lots of noise due to dynamic objects, and then K-means clustering is implemented based on the 3D distance.After roughly clustering, the inliers would be found and theoutliers would be calculated based on the maximum relativematching distance of 0.55. The whole cluster with the inlierratio of less than 50% will be labeled as outliers as well.After optimizing the camera pose and depth value of inliersthrough bundle adjustment, the iteration continues until thetotal inlier ratio surpasses the minimum inlier ratio of 0.65and the outliers will be set to be empty value ﬁnally. Theexample of ﬁltering dynamic object could be found in Figure3-(a).Also, the MVS algorithm generates the ground truth mapswith some error points which are greatly out of range ortoo close, like the cloud in the sky or noisy points on thevery near road. Consequently, the points which exceed thenormal range of the depth map are rejected. Pixels between5 % and 92% of the largest original depth are kept, whichmaintains the valid pixels of the original depth map as muchas possible and rejects most noise pixels.

B. Depth Map Reﬁnement

Although reﬁnement based on semantic segmentation iscommonly used in previous work [15] after SfM, it is reallychallenging to generate the semantic segmentation map forthe images under drastically changing environments for thedataset. Instead, we adopt simple but effective noise ﬁlteringmethod for reﬁnement. After the depth maps are generated (a) (b)Fig. 3. The ﬁltered areas are shown in dashed red blocks while thereconstructed areas are shown in dashed green blocks. Column (a) showsthe effect of RANSAC to remove the dynamic object while keeping thestatic objects. Column (b) shows the effective reﬁnement in the HSV colorspace for special noisy areas. from COLMAP, we have investigated the noise distributionwith respect to the pixel value in the color space, like bluepixels always appear in the sky which tends to be noise inmost cases. Instead of ﬁltering the noise pixels in the RGBcolor space, we use the HSV color space to remove the noisyarea. For most clear or cloudy sky, the Value tends to be highabout 100 and the Hue is usually blue or white. However,for those area in the shadow of low sun, the Saturation andValue are extremely low to be around 10 so that the depthmap pixels are too hard to be correctly reconstructed, whichneed to be ﬁltered. The sample of ﬁltering in HSV spaceis shown in Figure 3-(b). Since the pixels of the near roadrarely overlap from the multiple views for local optimization,noise of these areas often occurs and is far beyond the rangeof its neighboring pixels. These noise pixels of depth mapcould be ﬁltered through the speciﬁc position on the image.

C. Comparison with Other Datasets

The current datasets are introduced in II-A. The com-parison between

SeasonDepth and several popular datasetsis shown in Table I.

SeasonDepth contains comprehensiveand multiple outdoor real-world environmental conditions for

ABLE IC

OMPARISON BETWEEN

SeasonDepth

AND O THER D ATASETS

Name Scene Real orVirtual DepthValue Sparseor Dense MultipleTraverses DifferentEnvironments DynamicObjects

NYUV2 [25] Indoor Real Absolute Dense × × (cid:88)

DIML [26] Indoor Real Absolute Dense × × × iBims-1 [27] Indoor Real Absolute Dense × × ×

Make3D [30] Outdoor & Indoor Real Absolute Sparse × × ×

ReDWeb [29] Outdoor & Indoor Real Relative Dense × × (cid:88)

WSVD [28] Outdoor & Indoor Real Relative Dense × × (cid:88)

HR-WSI [16] Outdoor & Indoor Real Absolute Dense × × (cid:88)

KITTI [12] Outdoor Real Absolute Sparse × × (cid:88)

CityScapes [31] Outdoor Real Absolute Dense × (cid:88) (cid:88) DIW [32] Outdoor Real Relative Sparse × × (cid:88)

MegaDepth [15] Outdoor Real Relative Dense × × (cid:88)

DDAD [21] Outdoor Real Absolute Dense × × (cid:88)

V-KITTI [24] Outdoor Virtual Absolute Dense (cid:88) (cid:88) (cid:88)

SYNTHIA [33] Outdoor Virtual Absolute Dense × × ×

SeasonDepth Outdoor Real Relative Dense (cid:88) (cid:88) × (a) (b) (c) (d)Fig. 4. The example of depth adjustment and normalization of evaluationresult. Column (a) shows the RGB image and depth map groundtruth.Column (b) (c) (d) are results of PackNet [21],

VNL [14] and

GASDA [23].The depth maps on the ﬁrst row are the original depth output while thoseon the second row are the adjusted depth map results through Equation (1). every scene sequence as the synthetic virtual dataset

V-KITTI does.Though

CityScapes includes different environments, itlacks the traverses under different conditions so is not ableto evaluate the performance across changing environments.Similar to outdoor

DIW and

MegaDepth dataset, the depthmap of ours is scaleless and it contains the relative depthvalue and geometry of the images, where the values need tobe normalized for evaluation as the following section shows.The depth map ground truths are dense compared to LiDAR-based sparse depth maps. Also, dynamic objects are removedto be empty through RANSAC but static vehicles are kept,which ensures the diversity of objects for evaluation.IV. B

ENCHMARK S ETUP

In this section, several metrics are proposed for

Season-Depth to benchmark state-of-the-art open-source algorithmswith pretrained models on

KITTI benchmark [13].

A. Evaluation Metrics

The challenge for the evaluation lies in two points, andone is to cope with the scaleless and partially-valid densedepth map ground truth. While the other is to ﬁnd appropriatemetrics to fully measure the depth prediction performanceacross different environments.In the

SeasonDepth dataset, the scale of the ground truthdepth map ground truth generated from SfM method isrelative. In order to adapt to other algorithms which predictdepth maps with absolute scale, different from the way [44] does, the predicted depth maps from all of these depth esti-mation methods are normalized to meet the scaleless groundtruth based on both the mean value and variance with theassumption of Gaussian distribution, aligning the distributionof predicted map to groundtruth. First, for each predictedand groundtruth map, the valid pixels D i,jvalid predicted of thepredicted depth map D valid predicted are determined by thecorrespondence with the non-empty or valid pixels D i,jvalid GT of the depth map ground truth. And then the valid averageand variance value of both D valid GT and D valid predicted arecalculated as Avg GT , Avg pred and

V ar GT , V ar pred . Then weuse this formula to normalize the depth map D adj to havethe same distribution with D valid GT , D adj = ( D pred − Avg pred ) × (cid:115) V ar GT V ar pred + Avg GT (1)After this operation, we could reduce the difference of scalesfor the current methods across datasets, which makes thiszero-shot evaluation dataset reliable and usable in all modelsno matter what datasets they were trained on, improving therobustness and accuracy for the evaluation for generalizationability. Denote the adjusted valid predicted depth map D adj in Equation (1) as D P in the following formulation. Theresults of adjusted depth prediction are shown in Figure 4.The most commonly used evaluation metrics for monoc-ular depth prediction are AbsRel , SqRel , RMSE , RMSElog , SILog , δ < . , δ < . and δ < . [13]. However, forthe evaluation under changing environments, we choose themost distinguishable metrics under multiple environments, AbsRel and δ < . ( a ), to benchmark the performance ofbaselines. For environment k , AbsRel k and a k are calculatedin Equation (2) and (3). AbsRel k = 1 n n (cid:88) i,j (cid:12)(cid:12) D P ki,j − D GT ki,j (cid:12)(cid:12)(cid:14) D GT ki,j (2) a k = 1 n n (cid:88) i,j ( max { D P ki,j D GT ki,j , D

GT ki,j D P ki,j } < . (3)For the evaluation under different environments, six sec-ondary metrics are proposed based on these original metricsn the Equation (4)-(9). AbsRel avg = 1 m (cid:88) k AbsRel k (4) AbsRel var = 1 m (cid:88) k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) AbsRel k − m (cid:88) k AbsRel k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (5) AbsRel dev = ( max { AbsRel k } − min { AbsRel k } ) m (cid:80) k AbsRel k (6) a avg = 1 m (cid:88) k a k (7) a var = 1 m (cid:88) k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) a k − m (cid:88) k a k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (8) a dev = ( max { − a k } − min { − a k } ) m (cid:80) k (1 − a k ) (9)where avg term (4), (7) and var term (5), (8) come from average value and variance value in statistics, indicating theaverage performance and the ﬂuctuation around the meanvalue across multiple environments respectively. Consideringthe practical case of depth prediction, it should be morerigorous to prevent the ﬂuctuation of better results thanthat of worse results under changing conditions. We thenpropose the relative deviation term (6), (9) to calculate therelative difference between maximum and minimum of allthe environments. So we formulate relative deviation termsfor AbsRel and − a , which are the less the better and morestrict than the variance terms (5), (8). B. Evaluated Algorithms

As introduced in II-B, the best open-source depth pre-diction methods on

KITTI benchmark [13] we have evalu-ated on the

SeasonDepth dataset include supervised, semi-supervised, unsupervised and self-supervised methods. Forthe supervised methods, we choose

VNL [14],

MegaDepth [15] and

SGRL [16].

VNL proposes the virtual normal whichutilizes a stable geometric constraint for long-range relationsin a global view and helps depth prediction.

MegaDepth introduces an end-to-end hourglass network for depth pre-diction with several scale-invariant and ordinal depth lossusing semantic and geometric information and validation onmultiple datasets.

SGRL uses a novel sampling strategy forbetter structure and important regions from sparse depth mapthrough Structure-Guided Ranking Loss and conducts cross-dataset validation. Also, semi-supervised method semiDepth [17] is also benchmarked, which explicitly exploits left-rightconsistency as a loss function along with supervised learning.We choose adareg [18], monoResMatch [19],

Monodepth2 [20] and

PackNet [21] as unsupervised or self-supervisedmethods for depth prediction. adareg introduces a novelbilateral consistency constraint on the left-right disparity andadaptive regularization technique. monoResMatch leveragestraditional stereo algorithms to ﬁnd proxy labels to predictdepth maps in a self-supervised way.

Monodepth2 proposesa simple model using geometric information for robustness under occlusions and camera motion cases.

PackNet usesa self-supervised method combining geometry with novelsymmetrical packing and unpacking blocks using 3D con-volutions for depth prediction.While for models trained on the

V-KITTI dataset withmultiple environments, we use the two recent competitivealgorithms

T2Net [22] and

GASDA [23] as baselines.

T2Net is fully supervised both on

KITTI and virtual

KITTI datasetand it enables the training of both synthetic-to-real translationand depth prediction simultaneously. While

GASDA is unsu-pervised for real-world images by incorporating geometry-aware loss through wrapping of stereo images together withimage translation from synthetic to real-world domain. Theevaluation results are shown in section V.V. Z

ERO - SHOT E XPERIMENTAL E VALUATION

In this section, supervised and semi-supervised meth-ods

VNL [14],

MegaDepth [15]

SGRL [16] and semiDepth [17], unsupervised and self-supervised methods adareg [18], monoResMatch [19],

Monodepth2 [20] and

PackNet [21]and methods trained on multi-environment virtual datasets

T2Net [22] and

GASDA [23] are evaluated with the metricsintroduced in Section IV-A on the test set of

SeasonDepth with 4 slices about 18367 images, directly using state-of-the-art well-pretrained models from

KITTI benchmark.Spciﬁcally, for adareg , GASDA , MegaDepth and

SGRL ,we use the single released model for evaluation. Besides,we use mono+stereo,640x192 for

Monodepth2 , KITTI pre-trained model for monoResMatch , ResNet18, self-supervised,192x640 with ImageNet for

PackNet , Resnet50, Eigen ﬁned-tuned, CityScape for semiDepth , weakly-supervised, outdoor for T2Net and

KITTI pretrained model for

VNL . A. Evaluation Comparison from Overall Metrics

From the benchmark of current open-source best depthprediction baselines in Table II, it could be seen that allthe methods perform worse on the challenging

SeasonDepth dataset than

KITTI dataset.

VNL [14], semiDepth [17] and

Monodepth2 [20] present relatively impressive results onboth

KITTI and

SeasonDepth dataset. While

MegaDepth [15]performs better than other baselines on

SeasonDepth com-pared to

KITTI , showing the signiﬁcance of scale-invariantand ordinal depth loss for training on scaleless dataset andvalidation on multiple datasets. Additionally,

PackNet [21]and monoResMatch [19] show limited generalization abilitycompared with other methods on

SeasonDepth although theyperform relatively well on

KITTI .For

AbsRel of SeasonDepth Average and

Variance ,geometric-constraint-free methods

SGRL [16] and

T2Net [22]perform relatively worse than other methods, from which itcould be inferred that geometry information is essential tothe robustness of monocular depth prediction under changingenvironments. Focused on

SeasonDepth Average , SGRL [16]and

PackNet [21] perform better on a compared to AbsRel ,which gives the hint that their bad performance on

AbsRel might be owing to the extreme but limited outliers of depthmap across multiple environments.

ABLE IIE

VALUATION RESULTS ON

KITTI

AND

SeasonDepth

DATASET

Method

KITTI SeasonDepth

Average

SeasonDepth

Variance

SeasonDepth

Deviation

AbsRel lowerbetter a higherbetter AbsRel lowerbetter a higherbetter AbsRel (10 − ) lower better a (10 − ) lower better AbsRel lowerbetter a lowerbetter VNL [14] 0.0720 0.938 0.297 0.543 0.114 0.187 0.476 0.397MegaDepth [15] 0.220 0.632 0.505 0.426 0.0883 0.0338 0.197 0.128SGRL [16] 0.179 0.746 1.15 0.417 0.577 0.0305 0.229 0.0974semiDepth [17] 0.0784 0.923 0.406 0.465 0.119 0.104 0.276 0.190adareg [18] 0.126 0.840 0.510 0.416 0.116 0.0579 0.224 0.132Monodepth2 [20] 0.106 0.876 0.403 0.451 0.0568 0.0695 0.225 0.170monoResMatch [19] 0.096 0.890 0.483 0.391 0.306 0.103 0.440 0.181PackNet [21] 0.116 0.865 0.718 0.429 0.173 0.0612 0.196 0.146T2Net [22] 0.168 0.769 0.807 0.400 0.405 0.0747 0.256 0.148GASDA [23] 0.143 0.836 0.434 0.409 0.111 0.0654 0.305 0.184(a) (b)Fig. 5. Diagram (a) (b) are the detailed evaluation results for

AbsRel and a on SeasonDepth dataset under 12 different environments with dates. Theabbreviations of X-tick labels are S for Sunny , C for Cloudy , O for Overcast , LS for Low Sun , Sn for Snow , F for Foliage , NF for No Foliage , and MF for Mixed Foliage . Coming to the

Variance and

Deviation , thanks to theeffective cross-dataset validation

MegaDepth [15] gives rel-atively stable performance on both

AbsRel and a acrossmultiple environments and SGRL [16] shows impressiveperformance on

Variance and

Deviation of a metric thoughbad performance on AbsRel Average and

Variance . On thecontrary, although

VNL [14] and semiDepth [17] performthe best on

KITTI and

SeasonDepth Average , they clearlysuffer from large ﬂuctuation under changing environmentsfrom

Variance and

Deviation , especially on a metric. B. Performance under Different Environmental Conditions

The evaluated performance under different environmentsis illustrated in Figure 5 using lines in different colorsand markers. From Figure 5-(a), it could be found thatalthough different methods perform differently on

AbsRel ,the inﬂuence of some environments is similar for all themethods. Most methods perform well under

Sunny+Foliageon Sept. th , which is close to the environmental conditionof KITTI dataset. However,

Cloudy+Foliage on Oct. st , Low Sun+Mixed Foliage on Nov. rd and Low Sun+NoFoliage+Snow on Dec. st pose great challenge to depthprediction for most algorithms.While from Figure 5-(b), the trend of a from environ-mental impact is less consistent among all the algorithmsbut mainly similar to that on AbsRel , especially for goodperformance for

Sunny+Foliage on Sept. th and bad per- formance for Low Sun+No Foliage+Snow on Dec. st .For example, except MegaDepth [15],

PackNet [21] and

SGRL [16],

Low Sun+No Foliage+Snow on Dec. st largelyaffects the performance of depth prediction, indicating thereis still a long way to go for the robust long-term depth pre-diction. However, multi-environment virtual synthetic datasetwith domain adaptation is beneﬁcial to the robustness tochanging environments from the performance of T2Net [22]and

GASDA [23] though the only bad results on the snowyweather which is not included in the synthetic

V-KITTI dataset. VI. C

ONCLUSION

In this paper, a new dataset

SeasonDepth and evaluationmetrics are developed for the monocular depth predictionunder different environments. Supervised, semi-supervisedand self-supervised best open-source depth prediction algo-rithms from

KITTI benchmark are evaluated. Through theexperimental evaluation, we ﬁnd that there is still a longway to go to achieve robust long-term monocular depthprediction. Several promising aspects for the future worklie in geometry information involved in the model trainingand scale-invariant loss to improve the generalization ability.Besides, multi-dataset validation and domain adaptation withtraining on virtual dataset with images under multiple envi-ronmental conditions are beneﬁcial to the robustness underchanging real-world environments.

EFERENCES[1] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from asingle image using a multi-scale deep network,” in

Advances in neuralinformation processing systems , 2014, pp. 2366–2374.[2] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from singlemonocular images using deep convolutional neural ﬁelds,”

IEEEtransactions on pattern analysis and machine intelligence , vol. 38,no. 10, pp. 2024–2039, 2015.[3] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab,“Deeper depth prediction with fully convolutional residual networks,”in . IEEE,2016, pp. 239–248.[4] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe, “Multi-scalecontinuous crfs as sequential deep networks for monocular depthestimation,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2017, pp. 5354–5362.[5] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Sten-borg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, et al. , “Bench-marking 6dof outdoor visual localization in changing conditions,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2018, pp. 8601–8610.[6] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun,“Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,”

IEEE Transactions on Pattern Analysisand Machine Intelligence , 2020.[7] J. Spencer, R. Bowden, and S. Hadﬁeld, “Defeat-net: General monoc-ular depth via simultaneous unsupervised representation learning,” in

Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR) , June 2020.[8] M. Larsson, E. Stenborg, L. Hammarstrand, M. Pollefeys, T. Sat-tler, and F. Kahl, “A cross-season correspondence dataset for robustsemantic segmentation,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2019, pp. 9532–9542.[9] M. Larsson, E. Stenborg, C. Toft, L. Hammarstrand, T. Sattler, andF. Kahl, “Fine-grained segmentation networks: Self-supervised seg-mentation for improved long-term visual localization,” in

Proceedingsof the IEEE International Conference on Computer Vision , 2019, pp.31–41.[10] T. Jenicek and O. Chum, “No fear of the dark: Image retrievalunder varying illumination conditions,” in

Proceedings of the IEEEInternational Conference on Computer Vision , 2019, pp. 9696–9704.[11] H. Badino, D. Huber, and T. Kanade, “The CMU Visual LocalizationData Set,” http://3dvis.ri.cmu.edu/data-sets/localization, 2011.[12] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomousdriving? the kitti vision benchmark suite,” in . IEEE, 2012, pp. 3354–3361.[13] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, andA. Geiger, “Sparsity invariant cnns,” in

International Conference on3D Vision (3DV) , 2017.[14] W. Yin, Y. Liu, C. Shen, and Y. Yan, “Enforcing geometric constraintsof virtual normal for depth prediction,” in

Proceedings of the IEEEInternational Conference on Computer Vision , 2019, pp. 5684–5693.[15] Z. Li and N. Snavely, “Megadepth: Learning single-view depth pre-diction from internet photos,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2018, pp. 2041–2050.[16] K. Xian, J. Zhang, O. Wang, L. Mai, Z. Lin, and Z. Cao, “Structure-guided ranking loss for single image depth prediction,” in

Proceedingsof the IEEE/CVF Conference on Computer Vision and Pattern Recog-nition , 2020, pp. 611–620.[17] A. J. Amiri, S. Y. Loo, and H. Zhang, “Semi-supervised monoc-ular depth estimation with left-right consistency using deep neuralnetwork,” in . IEEE, 2019, pp. 602–607.[18] A. Wong and S. Soatto, “Bilateral cyclic constraint and adaptiveregularization for unsupervised monocular depth prediction,” in

Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2019, pp. 5644–5653.[19] F. Tosi, F. Aleotti, M. Poggi, and S. Mattoccia, “Learning monoculardepth estimation infusing traditional stereo knowledge,” in

Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recog-nition , 2019, pp. 9799–9809.[20] C. Godard, O. M. Aodha, M. Firman, and G. J. Brostow, “Digginginto self-supervised monocular depth estimation,” in

Proceedings of the IEEE International Conference on Computer Vision , 2019, pp.3828–3838.[21] V. Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3dpacking for self-supervised monocular depth estimation,” in

Proceed-ings of the IEEE/CVF Conference on Computer Vision and PatternRecognition , 2020, pp. 2485–2494.[22] C. Zheng, T.-J. Cham, and J. Cai, “T2net: Synthetic-to-realistic trans-lation for solving single-image depth estimation tasks,” in

Proceedingsof the European Conference on Computer Vision (ECCV) , 2018, pp.767–783.[23] S. Zhao, H. Fu, M. Gong, and D. Tao, “Geometry-aware symmetricdomain adaptation for monocular depth estimation,” in

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2019, pp. 9788–9798.[24] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds asproxy for multi-object tracking analysis,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2016, pp.4340–4349.[25] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmenta-tion and support inference from rgbd images,” in

European conferenceon computer vision . Springer, 2012, pp. 746–760.[26] Y. Kim, H. Jung, D. Min, and K. Sohn, “Deep monocular depthestimation via integration of global and local predictions,”

IEEEtransactions on Image Processing , vol. 27, no. 8, pp. 4131–4144, 2018.[27] T. Koch, L. Liebel, F. Fraundorfer, and M. K¨orner, “Evaluationof cnn-based single-image depth estimation methods,” in

EuropeanConference on Computer Vision Workshop (ECCV-WS) , L. Leal-Taix´eand S. Roth, Eds. Springer International Publishing, 2018, pp. 331–348.[28] C. Wang, S. Lucey, F. Perazzi, and O. Wang, “Web stereo videosupervision for depth prediction from dynamic scenes,” in . IEEE, 2019, pp. 348–357.[29] K. Xian, C. Shen, Z. Cao, H. Lu, Y. Xiao, R. Li, and Z. Luo, “Monoc-ular relative depth perception with web stereo data supervision,” in

The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , June 2018.[30] A. Saxena, M. Sun, and A. Y. Ng, “Make3d: Learning 3d scenestructure from a single still image,”

IEEE transactions on patternanalysis and machine intelligence , vol. 31, no. 5, pp. 824–840, 2008.[31] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes datasetfor semantic urban scene understanding,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2016, pp.3213–3223.[32] W. Chen, Z. Fu, D. Yang, and J. Deng, “Single-image depth perceptionin the wild,” in

Advances in neural information processing systems ,2016, pp. 730–738.[33] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “Thesynthia dataset: A large collection of synthetic images for semanticsegmentation of urban scenes,” in

Proceedings of the IEEE conferenceon computer vision and pattern recognition , 2016, pp. 3234–3243.[34] A. Saxena, S. H. Chung, and A. Y. Ng, “Learning depth from singlemonocular images,” in

Advances in neural information processingsystems , 2006, pp. 1161–1168.[35] B. Liu, S. Gould, and D. Koller, “Single image depth estimationfrom predicted semantic labels,” in . IEEE,2010, pp. 1253–1260.[36] X. Wang, D. Fouhey, and A. Gupta, “Designing deep networks forsurface normal estimation,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2015, pp. 539–547.[37] J. Hu, Y. Zhang, and T. Okatani, “Visualization of convolutional neuralnetworks for monocular depth estimation,” in

Proceedings of the IEEEInternational Conference on Computer Vision , 2019, pp. 3869–3878.[38] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deepordinal regression network for monocular depth estimation,” in

Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2018, pp. 2002–2011.[39] U. Kusupati, S. Cheng, R. Chen, and H. Su, “Normal assisted stereodepth estimation,” in

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition (CVPR) , June 2020.[40] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monoc-ular depth estimation with left-right consistency,” in

Proceedings ofhe IEEE Conference on Computer Vision and Pattern Recognition ,2017, pp. 270–279.[41] J. Watson, M. Firman, G. J. Brostow, and D. Turmukhambetov,“Self-supervised monocular depth hints,” in

Proceedings of the IEEEInternational Conference on Computer Vision , 2019, pp. 2162–2171.[42] H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, andI. Reid, “Unsupervised learning of monocular depth estimation andvisual odometry with deep feature reconstruction,” in

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition ,2018, pp. 340–349.[43] R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised cnn forsingle view depth estimation: Geometry to the rescue,” in

EuropeanConference on Computer Vision . Springer, 2016, pp. 740–756.[44] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervisedlearning of depth and ego-motion from video,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2017,pp. 1851–1858.[45] A. Johnston and G. Carneiro, “Self-supervised monocular traineddepth estimation using self-attention and discrete disparity volume,”in

Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR) , June 2020.[46] Y. Chen, W. Li, X. Chen, and L. V. Gool, “Learning semanticsegmentation from synthetic data: A geometrically guided input-outputadaptation approach,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2019, pp. 1841–1850.[47] B. Bozorgtabar, M. S. Rad, D. Mahapatra, and J.-P. Thiran, “Syndemo:Synergistic deep feature alignment for joint learning of depth andego-motion,” in

Proceedings of the IEEE International Conferenceon Computer Vision , 2019, pp. 4210–4219.[48] J. L. Sch¨onberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, “Pixel-wise view selection for unstructured multi-view stereo,” in

EuropeanConference on Computer Vision . Springer, 2016, pp. 501–518.[49] J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,”in