[PDF] Multi-Modal Depth Estimation Using Convolutional Neural Networks

Abstract

This paper addresses the problem of dense depth predictions from sparse distance sensor data and a single camera image on challenging weather conditions. This work explores the significance of different sensor modalities such as camera, Radar, and Lidar for estimating depth by applying Deep Learning approaches. Although Lidar has higher depth-sensing abilities than Radar and has been integrated with camera images in lots of previous works, depth estimation using CNN's on the fusion of robust Radar distance data and camera images has not been explored much. In this work, a deep regression network is proposed utilizing a transfer learning approach consisting of an encoder where a high performing pre-trained model has been used to initialize it for extracting dense features and a decoder for upsampling and predicting desired depth. The results are demonstrated on Nuscenes, KITTI, and a Synthetic dataset which was created using the CARLA simulator. Also, top-view zoom-camera images captured from the crane on a construction site are evaluated to estimate the distance of the crane boom carrying heavy loads from the ground to show the usability in safety-critical applications.

Full PDF

MMulti-Modal Depth Estimation Using Convolutional Neural Networks

Sadique Adnan Siddiqui, Axel Vierling and Karsten BernsRobotics Research Lab, Dep. of Computer Science, TU Kaiserslautern, Kaiserslautern, Germany

Abstract — This paper addresses the problem of dense depthpredictions from sparse distance sensor data and a singlecamera image on challenging weather conditions. This workexplores the signiﬁcance of different sensor modalities such ascamera, Radar, and Lidar for estimating depth by applyingDeep Learning approaches. Although Lidar has higher depth-sensing abilities than Radar and has been integrated withcamera images in lots of previous works, depth estimation usingCNN’s on the fusion of robust Radar distance data and cameraimages has not been explored much. In this work, a deepregression network is proposed utilizing a transfer learningapproach consisting of an encoder where a high performing pre-trained model has been used to initialize it for extracting densefeatures and a decoder for upsampling and predicting desireddepth. The results are demonstrated on Nuscenes, KITTI,and a Synthetic dataset which was created using the CARLAsimulator. Also, top-view zoom-camera images captured fromthe crane on a construction site are evaluated to estimate thedistance of the crane boom carrying heavy loads from theground to show the usability in safety-critical applications.

I. INTRODUCTIONIn recent years, there has been a meteoric rise in au-tonomous driving and automation research paired with anundeniable success of machine learning approaches on com-puter vision tasks. Although advanced driving assistancesystems could signiﬁcantly reduce the accidents caused byhuman-controlled driving, there is a lot of room for improve-ment concerning human’s safety. Fully autonomous vehiclesbeing driven in every part of the world is a distant dream,achieving a standard level of autonomy with the help ofsustainable assistance systems that could help in mitigatingthe chances of crashes due to human error is the pressingpriority. Sense of depth is crucial for understanding the worldaround us. Humans can easily assess the distance betweenthe objects through their binocular vision. Intelligent systemsdo not have insights that a human has, they need to assignspeciﬁc features for every object visible on the scene sothat they could recognize them and infer their semantics andgeometric structure. For computer vision in the autonomousdriving ﬁeld, the sensors continuously capture the informa-tion either in the form of images or point clouds to build arepresentation that could be understandable to the vehicle.Depth information i.e distance information of the objectsfrom the vehicle captured from depth sensors is vital fordynamic driving activities such as avoiding collisions fromthe obstacles. Several sensors are used to estimate robustdepth such as stereo cameras, depth cameras, Lidars but theyhave their limitations. Lidars are usually expensive and resultin sparse depth point clouds, whereas stereo cameras requirecareful calibration of both the cameras. Stereo cameras often

Fig. 1: Prediction from deep regression network trained by fusingcamera image on different weather conditions (such as fog asdepicted in the ﬁgure) and sparse depth input of Radar. result in mediocre performance on low light, object occludedand less textured regions. Motion sensors such as the Kinectare sensitive to sunlight and are not suitable for outdoorapplications. Due to these limitations depth estimation usingDeep Learning approaches gained much attention in recentyears and their fusion with a robust depth sensor such as aRadar seems promising. The signiﬁcant contribution of thiswork is a Convolutional Neural Network based regressionnetwork that takes fused input of a sparse point clouddata from Radar and camera image and it predicts a full-resolution depth image for images recorded on challengingweather conditions such as day, night, cloudy, fog, andrain. The positive aspect of using Radars is its robustnessto adverse weather conditions. A Radar sensor is cheap,lightweight, small in size, and is capable of measuring amore considerable distance than Lidars. RGB images canprovide information about the appearance and texture of thescene, whereas depth sensors provide information about theshape of the object, and it is invariant to color alteration andillumination [1].Furthermore, it is demonstrated that the presented workcan be extended to predict depth on top view zoomed imagesthat can be utilized for a better understanding of surroundingswhen viewed from cranes. Here stereo zoom camera isvery challenging to use as it requires careful calibration ofcameras depending on the zoom, and Lidar, as well as normalstereo cameras, can not be used due to the distance. However,safety is of utmost concern, due to the sheer size of a craneand weight of the hoisted objects. a r X i v : . [ c s . C V ] D ec ig. 2: Encoder-Decoder architecture with RGB and sparse data fusion as input. For networks using only 3 channel input sparse databranch should be removed. II. RELATED WORKDepth estimation and image reconstruction from a singleRGB image is an ”ill-posed problem” as many scenes in3D planes represent the same 2D plane because of scaleambiguity. There have been some signiﬁcant works of depthestimation using stereo and depth cameras but they havetheir limitations. In recent times, monocular depth estimationalgorithms using Deep Learning has gained much recognitionas they do not rely on hand-crafted features and theseare mostly extracted using a Convolutional Neural Networkresulting in superior results.

Monocular Depth Estimation - One of the early workwas proposed by Saxena et al. [2] involving the use ofprobabilistic models such as Markov Random Field (MRF)or Conditional Random Field (CRF) on handcraft featuresextracted from a single camera image. Eigen et al.[3]proposed the two-stack convolutional neural network anddirectly regressed the pixels for depth estimation tasks. Lainaet al. [4] developed a deep residual network using ResNetas an encoder and a custom decoder. They proposed an”efﬁcient scheme for upconvolutions using skip connectionsto create up-projection blocks for the effective upsamplingof feature maps”. Fu et al. [5] modiﬁed the depth learningproblem from the regression problem to the ordinal quantizedregression problem. Supervised methods for monocular depthestimation require the creation of groundtruth for trainingwhich is an expensive and time-consuming task. To over-come this problem, many unsupervised and self-supervisedlearning approaches for depth estimation have been proposedwhich do not require ground truth depth at training time [6],[7]. But it requires a stereo image pair for training, whichis computationally expensive, and using images of the samemodality are prone to errors in noisy environments.

Multi-Modal Depth Estimation - Some deep learning mod-els, when trained with data from different modality performexceptionally well for tasks such as semantic segmentationand depth estimation utilizing vital information from boththe modalities. For depth estimation tasks, sparse depth inputfrom Lidar’s or Radar’s could be propitious when fused with noisy RGB images. Ma et al. [8] proposed a supervisedlearning approach to train an encoder-decoder model byfusing sparse data with RGB sampled randomly from theground truth depth image. Following their previous work,they trained a network [9] with a semi-dense groundtruthdepth map with a fusion of sparse Lidar point cloud and RGBimage as inputs. In comparison with the previous works,the proposed method predicts a full-resolution depth image,explores the importance of sparse measurements of Radarwhen fused with camera images that could supplement RGBimages in areas with low visibility by providing distancedata. Thus, helping in achieving low-level autonomy in thevehicles. III. METHODOLOGYIn this section, the proposed architecture will be described.Convolutional Neural Networks are used for dense depthprediction task. Different data augmentation techniques andloss functions used for efﬁciently training the network is alsodiscussed.

A. Architecture

In computer vision, dense problems that require per pixelpredictions such as depth estimation and semantic segmenta-tion are usually resolved by using encoder-decoder architec-ture. Encoder maps raw inputs into feature representationand downsamples the spatial resolution of the inputs. Incontrast, the decoder takes those feature representations asinputs, then processes it and gives output that is the closestmatch of the original input. The working of the decoder isopposite to that of encoders. Decoders consist of one or morelayers that increase the spatial resolution of the downsampledfeature representations from the encoder. The ﬁnal structureis based on the work done by Alhashim et al. [10] as itutilizes the transfer learning approach and achieved state-of-the-art accuracy in RGB-based depth prediction for theNYU dataset. The network is modiﬁed so that it could ﬁt thepresented problem with fused input of different modalities.The encoder used in this work is DenseNet-169 pre-trainedon Imagenet [11], although experiments with ResNet-50ere also conducted but the errors were comparatively highin comparison to the model trained on DenseNet. As the pre-trained networks are restricted to 3 channel inputs, a 1x1 con-volution layer has been used to truncate four-channel inputto three-channel before passing the input to the network. A1x1 ﬁlter only has a single parameter for each channel of theinput, so it is widely considered for dimensionality reduction.Separable convolutions were attempted instead of standardconvolution layer but it resulted in the substandard perfor-mance of the network. The resultant feature map extractedfrom the encoder is passed through series of upsamplingblocks, comprising of a bilinear upsampling layer, followedby two convolution layers of 3x3 ﬁlters with output channelsset to half the channels of the input and the result fromthe upsampling layer is concatenated with the output of theencoder having same spatial dimension similar to [12]. A3x3 convolution ﬁlter is favored as it can look few pixels ata time extracting the vast amount of local features that arehighly useful in later layers. Fig. 2 gives an overview of thearchitecture.

B. Data Augmentation

Data Augmentation is a strategy of increasing the amountand induce diversity in the datasets for training Deep Learn-ing models. For this work, augmentation is done in an onlinemanner with transformations, such as random horizontalﬂipping, random contrast, and brightness in the range of [0.9,1.1] and [0.75, 1.25] respectively, are used with ﬁfty percentchance.

C. Loss Functions

Neural Networks are trained by minimizing the differ-ence between the groundtruth and the predictions fromthe network. A standard loss function for regression tasksare L and L loss. Different variants of loss functionssuch as scale-invariant loss [13], inverse Huber loss [14],the combination of smoothness and reconstruction loss [6]can be found in depth estimation literature. For our work,we have used loss functions similar to [10] that considerpenalizing the distortions at the boundaries of the objects onthe image while focusing on minimizing the error betweenthe groundtruth and predictions generated from the network.For training our network we used a composition of threedifferent loss functions. If d ∗ is the predicted depth and d isthe groundtruth depth, then the loss function is deﬁned as L total ( d , d ∗ ) = L ssim ( d , d ∗ ) + L edge ( d , d ∗ ) + L pixel ( d , d ∗ )) (1)The ﬁrst loss term is Structural Similarity Loss ( L ssim ) . L ssim has been used commonly for image reconstructiontasks, which is based on the Structural Similarity (SSIM)index used for measuring the similarity between two images.The SSIM index can be viewed as a quality measure forcomparison between the two images provided [15]. Thesecond loss term ( L edge ) is deﬁned over the image gradient ofthe predictions and the depth groundtruth. It is basically thedirectional change in the intensity of the image [16]. Oftenthere is a discontinuity at the boundary of the object, so it is necessary to penalize the errors around the edges more. Forthe third loss term ( L pixel ) , three different losses L , L , andBerhu loss [14] has been examined. The difference betweenthe L and L loss functions is that L penalizes larger errors,but is more tolerant of small errors, whereas L loss does notproduce a high error for large predictions error. L loss tendsto make the reconstructed image less blurry as compared to L loss.The Berhu loss or reverse Huber loss is a combinationof L and L loss as described in the equation 2. It actsas a L loss when the difference between groundtruth andpredictions falls below the threshold, otherwise acts as L when the error exceeds that threshold. B ( x ) = (cid:26) | x | | x | ≤ c x + c c | x | > c (2)The network is trained using reciprocal depth to com-pensate the problem of large errors resulting from biggergroundtruth depth values [10] . The ﬁnal depth value isobtained by taking the inverse of the obtained prediction.IV. E XPERIMENTS

A. Datasets

Nuscenes is a publicly available large scale dataset for au-tonomous driving by Aptiv Autonomous Mobility [17]. Thedataset is collected from Boston and Singapore, two cities areknown for their dense trafﬁc situations and highly arduousdriving conditions. Nuscenes is the only real recorded datasetavailable publicly that provides Radar data. Ground-truthdepth maps have been generated by Lidar reprojection andmissing values of depth are ﬁlled using the colorizationmethod [18].

KITTI is one of the benchmark autonomous drivingdatasets available for depth completion and estimation, 3Dobject detection, and scene ﬂow estimation tasks [19]. Oneof the signiﬁcant advantages of using the Kitti dataset fordepth estimation is the availability of semi-dense depth mapgroundtruth, which is obtained by registering 11 consecutiveLidar point cloud using Iterative Closest Point Algorithm.Although this work focuses on incorporating Radar data withRGB images for depth estimation, the impact of the Lidarcloud point in enhancing the accuracy of depth estimationusing the same network when fused with RGB images isalso examined.

Synthetic Dataset has been created through the Carla sim-ulator with nearly perfect groundtruth depths. Carla is anopen-source emulator built on top of the Unreal Engine 4(UE4) gaming engine for autonomous driving research [20].With the recent release of CARLA in December 2019, ithas added a low ﬁdelity Radar sensor to complete its sensorsuites. The data is recorded on multiple scenes with differentﬁeld of view, thus providing a versatile range of imagesrecorded on adverse weather conditions, its correspondingdepth map, and Radar distance data. Each Radar pointcloud is projected into the image and the depth informationfrom Radar measurements is described as an image-likeepresentation. An unseen test set is created for evaluation,comprising of 600 images recorded on scenes with differentweather conditions such as day, night, fog, and rain.

Top-View Dataset is generated with a gaming engine,Unreal Engine 4 [21]. The images are captured from thecameras located at the boom of the crane on a constructionsite. The role of the crane is to lower the boom, pick up theload, and move the load to the desired position. The cameratakes a snapshot of the load and the ground while the boomis moving [22].

Name Type SensingModalities GT Size

Nuscenes Real Camera, Lidar,Radar Created usingLidar frames 3400framesSynthetic Simulation Camera, Radar Created usingCarla 8KframesKITTI Real Camera, Lidar Created usingLidar frames 9KframesTop-View Simulation Camera Created fromUE4 1200frames

TABLE I: Overview of the datasets used in this work

B. Implementation Details

The network is implemented using TensorFlow 2.0 [23].and trained on the above-mentioned datasets on a singlemachine with 12GB NVIDIA TITAN X GPU and 32GBinternal memory, to have a fair comparison of all methods.The weights for the encoder is initialized with pre-trainedweights from the model trained on the ImageNet dataset[11]. In this work, the batch size of 2 is used throughoutall the experiments. The networks are trained for 20 epochsfor all the datasets except Top-View dataset which is trainedfor 30 epochs using Adam Optimizer [24] with exponentialdecay rate for the ﬁrst moment and the second momentestimates β = 0.9, β = 0.999. A constant learning rateof 0.0001 was experimented for training the network butthe best performing results were achieved by reducing thelearning rate to 20 percent after every 7 epochs. So, for allthe networks used in this work, the learning rate of 0.0001which is reduced to 20 percent of its previous value at theeighth epoch is used for training. C. Evaluation Metrics

To evaluate the quantitative performance of the networks,standard metrics for depth estimation are used as mentionedin various depth estimation literature [3]. Each of the usedmetrics is explained below where d i is the original depth ofthe pixel in depth image d and d ∗ i is the pixel of the predicteddepth d ∗ and N is the total number of pixels for each depthimage. • Absolute Relative Distance (ARD): N ∑ N = | d i − d ∗ i | d ∗ i • Squared Relative Difference (SRD): N ∑ N = | d i − d ∗ i | d ∗ i • Root Mean Square Error (RMSE): (cid:113) N ∑ N = (cid:12)(cid:12) d i − d ∗ i (cid:12)(cid:12) • Threshold Accuracy: % of d i s.t.max (cid:16) d i d ∗ i , d ∗ i d i (cid:17) < thr , where thr ∈ (cid:8) . , . , . (cid:9) V. RESULTSThis section consists of empirical evaluations for all thementioned datasets and discussion about the results of all theexperiments performed using the described model. The ex-periments are supported by their qualitative and quantitativeresults.

A. Architecture Evaluation a). Evaluation on the Top-View dataset

In this section, the quantitative evaluations on the Top-View dataset when trained with different loss functions arediscussed. The network is trained with only RGB modality.From the results, it can be inferred that the model trainedwith Berhu loss slightly outperformed the models trainedwith other loss functions in terms of RMSE, ARD, and SRDas listed in II . Predictions are depicted in Fig. 3

Modality Loss RMSE ARD SRD δ δ δ RGB L RGB L TABLE II: Quantitative performance of the network trained on Top-View dataset with different loss functions. Note that, L1, L2 andBerhu loss corresponds to the combination of that loss with SSIMand image gradient. For every metric except threshold accuracies,lower values indicate better performance.Fig. 3: Predictions on Top-View dataset. From top to bottom: (Input)- RGB images recorded from the camera located at the boomof the crane. (Output) - RGB based predictions. (Groundtruth) -Near perfect depth map generated from Unreal Engine. Note that,colormap is used for visualization of the depth map and it does notencode original depth information of the object from the camera. b). Evaluation on the Synthetic dataset

In this section, quantitative evaluations on the Syntheticdataset when trained with different loss functions are dis-cussed. The fusion modality trained with L loss is the bestperforming model in terms of ARD and threshold accuraciesas compared to the model trained with other loss functionsas mentioned in the Table III. The modalities trained withBerhu loss also performs at par with modality trained on L loss, the reason may be that the Berhu loss takes advantage of L loss for very small errors and use L otherwise. For DeepLearning models to be utilized on real-life applications, itis important to have a smaller inference time. The inferenceime for each frame on the model trained on RGB modality is0.10 seconds and for the fusion modality, it is 0.11 seconds.Qualitative results are depicted on Fig. 4 Modality Loss RMSE ARD SRD δ δ δ RGB L L RGB-Add L L L L TABLE III: Quantitative performance of the network trained onSynthetic dataset with different loss functions. RGB modality refersto the model trained only on camera images, RGB-Fusion modalityrefers to channel-wise concatenation of RGB and Radar data,whereas RGB-Add modality refers to the element-wise addition ofRGB and Radar data. For every metric except threshold accuracies,lower values indicate better performance. Note that, L , L andBerhu loss corresponds to the combination of that loss with SSIMand image gradient.Fig. 4: Predictions on Synthetic dataset on scenes with differentweather conditions. From top to bottom: (Input) - consisting ofnight, foggy and cloudy scenes. (Output(RGB)) - RGB basedpredictions. (Output(Fusion)) - Predictions on the fusion of RGBand sparse depth input. (Output(Add) - Predictions on element-wiseaddition of RGB and sparse depth input). (Groundtruth) - Nearperfect depth map generated from Carla. Note that, colormap isused for visualization of the depth map and it does not encodeoriginal depth information of the object from the camera. c). Evaluation on the KITTI dataset In this section, the quantitative evaluations on the KITTIdataset when trained on different modalities are presentedand compared to the results with other benchmarked net-works. The proposed network is evaluated on 652 imagesas speciﬁed by [3]. During evaluation the predictions areupscaled from 192x624 to the original input resolution of384x1248. The groundtruth used for evaluation is acquiredby ﬁlling the missing depth values as mentioned in [10]. The network is expected to perform much better when trainedwith the fusion of Lidar and RGB than the network trainedon only RGB images because the input modality roughly is asubset of the groundtruth. The quantitative results are listedin IV.

Modality Method RMSE ARD SRD δ δ δ RGB+sD Liao [25] 4.50 0.113 - 0.874 0.96 0.984RGB+sD Ma [8] 3.378 - 0.935 0.976

RGB+sD Ours

TABLE IV: Comparison with state-of-the-art on the KITTI datasetwith input consisting of RGB image and sparse depth. Note that,the evaluation is performed on 652 images as mentioned in [3]. Thegroundtruth used for evaluation is acquired by ﬁlling the missingdepth values as mentioned in [10]. Ma et al. [8] has used a randomsubset of 3200 images for the ﬁnal evaluation and the groundtruthis acquired by projecting Velodyne LiDAR measurements onto theRGB images. For every metric except threshold accuracies, lowervalues indicate better performance. d). Evaluation on the Nuscenes dataset

In this section, the quantitative evaluations on the Nuscenesdataset when trained with different loss functions are pre-sented. As groundtruth for Nuscenes is generated using Lidarpoint clouds and RGB images, the modality of RGB ismuch more important as compared to Radar point clouds.In comparison to all the modalities, the network trained withonly RGB modality performed better. The metric results forRadar fusion and addition modality are slightly substandardas compared to a network trained only on RGB but onthe visualization of predictions, the results are much morecomprehensible as compared to predictions from a networktrained only on RGB images and can be seen in Fig. 5

Modality Loss RMSE ARD SRD δ δ δ RGB L RGB-Fusion L L L L L TABLE V: Quantitative performance of the network trained on theNuscenes dataset with different loss functions. For every metric ex-cept threshold accuracies, lower values indicate better performance.

VI. CONCLUSIONSA convolutional neural network based depth predictionmethod for predicting dense depth images is proposed by fus-ing RGB images with sparse depth input of Radar recordedon various weather conditions such as day, night, fog, cloudy,and rain. It was evaluated on the Nuscenes dataset, Syntheticdataset which is generated using CARLA and Top-Viewdataset comprising of images of construction sites generatedfrom Unreal Engine. The experimental results reveal thatby fusion of sparse input with RGB slightly enhanced thepredictions of depth maps on the Synthetic dataset. Further-more, we demonstrated that this model performs signiﬁcantly ig. 5: Predictions on the Nuscenes dataset. From top to bottom:(Input) - RGB images. (Output(RGB)) - RGB based predictions.(Output(Fusion)) - Predictions on fusion of RGB and sparse depthinput. (Output(Add) - Predictions on element-wise addition ofRGB and sparse depth input). Note that, colormap is used forvisualization of the depth map and it does not encode original depthinformation of the object from the camera. better for predicting depth on the Top-View zoomed imagescaptured from cranes, where stereo cameras or Lidars arenot feasible to use. We believe that one avenue of futureworks is to examine the usage of neural architecture searchas an encoder for dense depth prediction tasks. For the real-time operation of fusion methods, it is very essential tohave a short latency for depth map predictions on embeddeddevices. Model compression by pruning methods could beexplored to reduce the parameters of the network by maskingthe irrelevant features learned by the network. Moreover,self-supervised learning strategies on multi-modal datasetscould be used as the creation of a groundtruth depth map isan expensive and tedious task.R

EFERENCES[1] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, and W. Bur-gard, “Multimodal deep learning for robust rgb-d object recognition,”2015.[2] A. Saxena, S. Chung, and A. Ng, “Learning depth from singlemonocular images,” in

Advances in Neural Information ProcessingSystems , Y. Weiss, B. Sch¨olkopf, and J. Platt, Eds., vol. 18. MIT Press,2006, pp. 1161–1168. [Online]. Available: https://proceedings.neurips.cc/paper/2005/ﬁle/17d8da815fa21c57af9829fb0a869602-Paper.pdf[3] D. Eigen and R. Fergus, “Predicting depth, surface normalsand semantic labels with a common multi-scale convolutionalarchitecture,”

CoRR , vol. abs/1411.4734, 2014. [Online]. Available:http://arxiv.org/abs/1411.4734[4] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab,“Deeper depth prediction with fully convolutional residual networks,”

CoRR , vol. abs/1606.00373, 2016. [Online]. Available: http://arxiv.org/abs/1606.00373[5] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deepordinal regression network for monocular depth estimation,”

CoRR ,vol. abs/1806.02446, 2018. [Online]. Available: http://arxiv.org/abs/1806.02446[6] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monoc-ular depth estimation with left-right consistency,” in

CVPR , 2017. [7] R. Garg, V. K. B. G, and I. D. Reid, “Unsupervised CNN forsingle view depth estimation: Geometry to the rescue,”

CoRR , vol.abs/1603.04992, 2016. [Online]. Available: http://arxiv.org/abs/1603.04992[8] F. Ma and S. Karaman, “Sparse-to-dense: Depth prediction from sparsedepth samples and a single image,” in

ICRA , 2018.[9] F. Ma, G. V. Cavalheiro, and S. Karaman, “Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocularcamera,” arXiv preprint arXiv:1807.00275 , 2018.[10] I. Alhashim and P. Wonka, “High quality monocular depth estimationvia transfer learning,” arXiv e-prints , vol. abs/1812.11941, 2018.[Online]. Available: https://arxiv.org/abs/1812.11941[11] J. Deng, R. Socher, L. Fei-Fei, W. Dong, K. Li, and L.-J. Li, “Imagenet: A large-scale hierarchical image database,”in , vol. 00, 06 2009, pp. 248–255. [Online].Available: https://ieeexplore.ieee.org/abstract/document/5206848/[12] O. Ronneberger, P.Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,” in

Medical ImageComputing and Computer-Assisted Intervention (MICCAI) , ser.LNCS, vol. 9351. Springer, 2015, pp. 234–241, (available onarXiv:1505.04597 [cs.CV]). [Online]. Available: http://lmb.informatik.uni-freiburg.de/Publications/2015/RFB15a[13] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction froma single image using a multi-scale deep network,”

CoRR , vol.abs/1406.2283, 2014. [Online]. Available: http://arxiv.org/abs/1406.2283[14] L. Zwald and S. Lambert-Lacroix, “The berhu penalty and the groupedeffect,” 2012.[15] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Imagequality assessment: From error visibility to structural similarity,”

Trans. Img. Proc. ∼ djacobs/CMSC426/ImageGradients.pdf[17] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Kr-ishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodaldataset for autonomous driving,” arXiv preprint arXiv:1903.11027 ,2019.[18] A. Levin, D. Lischinski, and Y. Weiss, “Colorization usingoptimization.” ACM Trans. Graph. , vol. 23, no. 3, pp. 689–694, 2004. [Online]. Available: http://dblp.uni-trier.de/db/journals/tog/tog23.html

Conference on ComputerVision and Pattern Recognition (CVPR) , 2012.[20] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun,“CARLA: An open urban driving simulator,” in

Proceedings of the1st Annual Conference on Robot Learning

Proceedings ofthe 35th International Symposium on Automation and Robotics inConstruction (ISARC) , J. Teizer, Ed. Taipei, Taiwan: InternationalAssociation for Automation and Robotics in Construction (IAARC),July 2018, pp. 882–888.[23] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J.Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. J´ozefowicz,L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore,D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan,F. B. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,Y. Yu, and X. Zheng, “Tensorﬂow: Large-scale machine learningon heterogeneous distributed systems,”

CoRR , vol. abs/1603.04467,2016. [Online]. Available: http://arxiv.org/abs/1603.04467[24] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,”