[PDF] Noise-Aware Unsupervised Deep Lidar-Stereo Fusion

Abstract

In this paper, we present LidarStereoNet, the first unsupervised Lidar-stereo fusion network, which can be trained in an end-to-end manner without the need of ground truth depth maps. By introducing a novel "Feedback Loop'' to connect the network input with output, LidarStereoNet could tackle both noisy Lidar points and misalignment between sensors that have been ignored in existing Lidar-stereo fusion studies. Besides, we propose to incorporate a piecewise planar model into network learning to further constrain depths to conform to the underlying 3D geometry. Extensive quantitative and qualitative evaluations on both real and synthetic datasets demonstrate the superiority of our method, which outperforms state-of-the-art stereo matching, depth completion and Lidar-Stereo fusion approaches significantly.

Full PDF

NNoise-Aware Unsupervised Deep Lidar-Stereo Fusion ∗ Xuelian Cheng , , ∗ Yiran Zhong , , , Yuchao Dai , Pan Ji , Hongdong Li , Northwestern Polytechnical University Australian National University NEC Laboratories America, ACRV, Data61 CSIRO

Abstract

In this paper, we present LidarStereoNet, the ﬁrst unsu-pervised Lidar-stereo fusion network, which can be trainedin an end-to-end manner without the need of ground truthdepth maps. By introducing a novel “Feedback Loop”to connect the network input with output, LidarStereoNetcould tackle both noisy Lidar points and misalignment be-tween sensors that have been ignored in existing Lidar-stereo fusion studies. Besides, we propose to incorporatea piecewise planar model into network learning to furtherconstrain depths to conform to the underlying 3D geome-try. Extensive quantitative and qualitative evaluations onboth real and synthetic datasets demonstrate the superior-ity of our method, which outperforms state-of-the-art stereomatching, depth completion and Lidar-Stereo fusion ap-proaches signiﬁcantly.

1. Introduction

Accurately perceiving surrounding 3D information frompassive and active sensors is crucial for numerous applica-tions such as localization and mapping [15], autonomousdriving [18], obstacle detection and avoidance [25], and3D reconstruction [10, 33]. However, each kind of sen-sors alone suffers from its inherent drawbacks. Stereo cam-eras are well-known for suffering from computational com-plexities and their incompetence in dealing with texture-less/repetitive areas and occlusion regions [28], while Lidarsensors often provide accurate but relatively sparse depthmeasurements [5].Therefore, it is highly desired to fuse measurements fromLidar and stereo cameras to achieve high-precision depthperception by exploiting their complementary properties.However, it is a non-trivial task as accurate Stereo-Lidar fu-sion requires a proper registration between Lidar and stereoimages and noise-free Lidar points. Existing methods arenot satisfactory due to the following drawbacks: • Existing deep neural network based Lidar-Stereo fu- ∗ These authors contributed equally in this work.

Input LidarGT S2D [19]Ours SINet [30]Figure 1.

Results on KITTI 2015.

We highlight the displacementerror of Lidar points with bounding boxes. Lidar points are dilatedfor better visualization and we overlay our disparity maps to thecolour images for illustration. Note the Lidar points for the fore-ground car and utility pole have been aligned to the background.Our method successfully recovers accurate disparities on tiny andmoving objects while the other methods are misled by drifted andnoisy Lidar points. sion studies [2, 20, 26] strongly depend on the avail-ability of large-scale ground truth depth maps, and thustheir performance is fundamentally limited by theirgeneralization ability to real-world applications. • Due to rolling-shutter effects of Lidar and other cali-bration imperfections, a direct registration will intro-duce signiﬁcant alignment errors between Lidar andstereo depth. Furthermore, existing methods tend toassume the Lidar measurements are noise-free [19, a r X i v : . [ c s . C V ] A p r

2. Related Work

Stereo Matching

Deep convolutional neural networks(CNNs) based stereo matching methods have recentlyachieved great success. Existing supervised deep methodseither formulate the task as depth regression [21] or multi-label class classiﬁcations [31]. Recently, unsupervised deepstereo matching methods have also been introduced to relieffrom a large amount of labeled training data. Godard et al .[11] proposed to exploit the photometric consistency lossbetween left images and the warped version of right images,thus forming an unsupervised stereo matching framework.Zhong et al . [36] presented a stereo matching network forestimating depths from continuous video input. Very re-cently, Zhang et al . [34] extended the self-supervised stereonetwork [35] from passive stereo cameras to active stereoscenarios. Even though stereo matching has been greatly advanced, it still suffers from challenging scenarios such astexture-less and low-lighting conditions.

Depth Completion/Interpolation

Lidar scanners canprovide accurate but sparse and incomplete 3D measure-ments. Therefore, there is a highly desired requirement inincreasing the density of Lidar scans, which is crucial forapplications such as self-driving cars. Uhrig et al . [30] pro-posed a masked sparse convolution layer to handle sparseand irregular Lidar inputs. Chodosh et al . [4] utilizedcompressed sensing to approach the sparsity problem forscene depth completion. With the guidance of correspond-ing color images, Ma et al . [19] extended the up-projectionblocks proposed by [17] as decoding layers to achieve fulldepth reconstruction. Jaritz et al . [14] handled sparse inputsof various densities without any additional mask input.

Lidar-Stereo Fusion

Existing studies mainly focus onfusing stereo and time-of-ﬂight (ToF) cameras for indoorscenarios [7, 23, 24, 12, 8], while Lidar-Stereo fusion foroutdoor scenes has been seldom approached in the litera-ture. Badino et al . [2] used Lidar measurements to reducethe searching space for stereo matching and provided pre-deﬁned paths for dynamic programming. Later on, Mad-dern et al . [20] proposed a probabilistic model to fuse Lidarand disparities by combining prior from each sensor. How-ever, their performance degrades signiﬁcantly when the Li-dar information is missing. To tackle this issue, instead ofusing a manually selected probabilistic model, Park et al .[26] utilized CNNs to learn such a model, which takes twodisparities as input: one from the interpolated Lidar andthe other from semi-global matching [13]. Compared withthose supervised approaches, our unsupervised method canbe end-to-end trained using stereo pairs and sparse Lidarpoints without using external stereo matching algorithms.

3. Lidar-Stereo Fusion

In this section, we formulate Lidar-Stereo fusion as anunsupervised learning problem and present our main ideasin dealing with the inherent challenges encountered by ex-isting methods, i.e ., noise in Lidar measurements.

Lidar-Stereo fusion aims at recovering a dense and accu-rate depth/disparity map from sparse Lidar measurements S ∈ R n × and a pair of stereo images I l , I r . We assumethe Lidar and stereo camera have been calibrated with ex-trinsic matrix T and the stereo camera itself is calibratedwith intrinsic matrices K l , K r and projection matrices P l , P r . We can then project sparse Lidar points S onto the im-age plane of I l by d sl = P l T S . Since disparity is used in thestereo matching task, we convert the projected depth d sl todisparity D sl using D = Bf /d , where B is the baseline be-tween the stereo camera pair and f is the focal length. The igure 2. KITTI VO Lidar points and our cleaned Lidar points.

Erroneous Lidar points on transparent/relective areas and tiny ob-ject surface have been successfully removed. same process is applied to the right image as well. Mathe-matically this problem can be deﬁned as: ( (cid:98) D l , (cid:98) D r ) = F ( I l , I r , D sl , D sr ; Θ) , (1)where F is the learned Lidar-Stereo fusion model (a deepnetwork in our paper) parameterized by Θ , (cid:98) D l , (cid:98) D r are thefusion outputs deﬁned on the left and right coordinates.Under our problem setting, we do not make any as-sumption on the Lidar points’ conﬁguration ( e.g ., the num-ber or the pattern) or error distribution of Lidar points andstereo disparities. Removing all these restrictions makes ourmethod more generic and wider applicability. In Lidar-Stereo fusion, existing methods usually assumethe Lidar points are noise free and the alignment betweenLidar and stereo images is perfect. We argue that evenfor dedicated systems such as the KITTI dataset, the Li-dar points are never perfect and the alignment cannot beconsistently accurate, c.f . Fig. 1. The errors in Lidar scan-ning are inevitable for two reasons: ( ) Even for well cali-brated Lidar-stereo systems, e.g. , KITTI, and after eliminat-ing the rolling shutter effect in Lidar scans by compensatingthe ego-motion, Lidar errors still persist even for stationaryscenes, as shown in Fig. 2. According to the readme ﬁlein the KITTI VO dataset, the rolling shutter effect has al-ready been removed in Lidar scans. However, we still ﬁndLidar errors on transparent (white box) and reﬂective (redbox) surfaces. Also, due to the displacement between Li-dar and cameras, the Lidar can see through tiny objects asshown in the yellow box. ( ) It is hard to perform motion-compensation on dynamic scenes, thus the rolling shuttereffect will persist for moving objects.It is possible to eliminate these errors by manually insert-ing 3D models and other post-processing steps [22]. How-ever, lots of human efforts will be involved. Our methodcan automatically deal with these Lidar errors without theneed of human power. Hence the problem we are tack- Figure 3.

The feedback loop.

For each iteration, the input stereopair ﬁrst computes initial disparities to ﬁlter errors in sparse Lidarpoints. At this stage, no backprob is taken place. So we call itthe Verify phase. Then in the Update phase, the Core Architecturetakes stereo pairs and cleaned sparse Lidar points as inputs to gen-erate the ﬁnal disparities. The parameters of the Core Architecturewill be updated through backprob this time. ling (“Noise-Aware Lidar-Stereo Fusion”) is not a simple“system-level” problem that can be solved through “engi-neering registration”.It is known that the ability of deep CNNs to overﬁt ormemorize the corrupted labels can lead to poor generaliza-tion performance. Therefore, we aim to deal with the noisein Lidar measurements properly to train deep CNNs.Robust functions such the (cid:96) norm, Huber function or thetruncated (cid:96) norm are natural choices in dealing with noisymeasurements. However, these functions will not eliminatethe effects caused by noises but only suppress them. Fur-ther, these errors also exist in the input. Automatically cor-rect/ignore these erroneous points creates an extra difﬁcultyfor the network. To this end, we introduce a feedback loop in our network to allow the input also to depend on the out-put of the network. In this way, the input Lidar points canbe cleaned before being fed into the network. The Feedback Loop

We propose a novel framework toprogressively detect and remove erroneous Lidar points dur-ing the training process and generate a highly accurateLidar-Stereo fusion. Fig. 3 illustrates an unfolded struc-ture of our network design, namely the feedback loop. Itconsists of two phases: “Verify” phase and “Update” phase.Each phase shares the same network structure of Core Ar-chitecture, and the details will be illustrated in Section 4.1.In the Verify phase, the network takes stereo imagepairs ( I l , I r ) and noisy Lidar disparities ( D sl , D sr ) as in-put, and generates two disparity maps ( D vl , D vr ). No back-propagation takes place in this phase. We then compare( D vl , D vr ) and ( D sl , D sr ) and retain the sparse Lidar points( D scl , D scr ) that are consistent in both stereo matching andLidar measurements. In the Update phase, the networktakes both stereo pairs ( I l , I r ) and cleaned sparse Lidarpoints ( D scl , D scr ) as the inputs to recover dense disparitymaps ( D fl , D fr ). All loss functions are evaluated on the ﬁ-nal disparity outputs ( D fl , D fr ) only. Once the network is igure 4. Core Architecture of our LidarStereoNet.

It consists of a feature extraction and fusion block, a stack-hourglass type featurematching block and a disparity computing layer. Given a stereo pair I l , I r and corresponding projected Lidar points D sl , D sr , the featureextraction block produces feature maps separately for images and Lidar points. The feature maps are then concatenated to form ﬁnal inputfeatures which are aggregated to form a feature volume. The feature matching block learns the cost of feature-volume. Then we use thedisparity computing layer to obtain disparity estimation. Details of the feature extraction and fusion block is illustrated on the right. trained, we empirically ﬁnd that there is no performancedrop if we directly feed the Core Architecture with noisy Li-dar points. Therefore, we remove the feedback loop moduleand only use the Core Architecture in testing.Our feedback loop detects erroneous Lidar points bymeasuring the consistency between Lidar points and stereomatching. Lidar and stereo matching are active and pas-sive depth acquisition techniques. Hence, it is less likelythat they would make the same errors. It may also ﬁlter outsome correct Lidar points at the ﬁrst place but we have im-age warping loss and other regularization losses to keep thenetwork training on the right track.

4. Our Network Design

In this section, we present our “LidarStereoNet” forLidar-Stereo fusion, which can be learned in an unsuper-vised end-to-end manner. To remove the need of large-scale training data with ground truth, we propose to exploitthe photometric consistency between stereo images, andthe depth consistency between stereo cameras and Lidar.This novel network design enables the following beneﬁts: ) A wide generalization ability of our framework in vari-ous real-world scenarios; ) Our network design allows thesparsity of input Lidar points to be varied, and can even han-dle the extreme case when the Lidar sensor is completelyunavailable. Furthermore, to alleviate noisy Lidar measure-ments and the misalignment between Lidar and stereo cam-eras, we incorporate the “Feedback Loop” into the networkdesign to connect the output with the input, which enablesthe Lidar points to be cleaned before fed into the network. We illustrate the detailed structure of the Core architec-ture of our LidarStereoNet in Fig. 4. LidarStereoNet con-sists of the following blocks: ) Feature extraction and fu-sion; ) Feature matching and ) Disparity computing. The general information ﬂow of our network is similar to [35]but has some crucial modiﬁcations in the feature extrac-tion and fusion block. In view of different characteristicsbetween dense colour images and sparse disparity maps,we leverage different convolution layers to extract featuresfrom each of them. For colour images, we use the samefeature extraction block from [3] while for sparse Lidar in-puts, the sparse invariant convolution layer [30] is used. Theﬁnal feature maps are produced by concatenating stereo im-age features and Lidar features. Feature maps from left andright branches are concatenated to form a 4D feature vol-ume with a maximum disparity range of 192. Then fea-ture matching is processed through an hourglass structure of3D convolutions to compute matching cost at each disparitylevel. Similar to [16], we use the soft-argmin operation toproduce a 2D disparity map from the cost volume. Dealing with dense and sparse inputs

To extract fea-tures from sparse Lidar points, Uhrig et al . [30] proposeda sparsity invariant CNN after observing the failure of con-ventional convolutions. However, Ma et al . [19] and Jaritz et al . [14] argued that using a standard CNN with spe-cial training strategies can achieve better performance andalso handle varying input densities. We compared bothapproaches and realized that standard CNNs can handlesparse inputs and even get better performance but they re-quest much deeper network (ResNet38 encoded VS 5 Con-volutional layers) with 500 times more trainable parameters( . K VS . K ). Using such a “deep” network asa feature extractor will make our network not feasible forend-to-end training and hard to converge.In our network, we choose sparsity invariant convolu-tional layers [30] to assemble our Lidar feature extrac-tor which can handle varying Lidar points distribution el-egantly. It consists of 5 sparse convolutional layers with astride of 1. Each convolution has an output channel of 16and is followed by a ReLU activation function. We attached plain convolution with a stride of 4 to generate the ﬁnal16 channels Lidar features in order to make sure the Lidarfeatures compatible with the image features. Our loss function consists of two data terms and two reg-ularization terms. For data terms, we directly choose the im-age warping error L w as a dense supervision for every pixeland discrepancy on ﬁltered sparse Lidar points L l . For reg-ularization terms, we use colour weighted smoothness term L p and our novel slanted plane ﬁtting loss L p . Our overallloss function is a weighted sum of the above loss terms: L = L l + µ L w + µ L s + µ L p , (2)we empirically set µ = 1 , µ = 0 . , µ = 0 . . We assume photometric consistency between stereo pairssuch that corresponding points between each pair shouldhave similar appearance. However, in some cases, this as-sumption does not hold. Hence, we also compare the dif-ference between small patches’ Census transform as it isrobust for photometric changes. Our image warping loss isdeﬁned as follow: L w = L i + λ L c + λ L g , (3)where L i stands for photometric loss, L c represents Cen-sus loss and L g is the image gradient loss. We set λ =0 . , λ = 1 to balance different terms.The photometric loss is deﬁned as the difference betweenthe observed left (right) image and the warped left (right)image, where we have weighted each term with the ob-served pixels to account for the occlusion: L i = (cid:34)(cid:88) i,j ϕ (cid:16) ˆ I ( i, j ) − I ( i, j ) (cid:17) · O ( i, j ) (cid:35) / (cid:88) i,j O ( i, j ) , (4) where ϕ ( s ) = √ s + 0 . and the occlusion mask O iscomputed through left-right consistency check.To further improve the robustness in evaluating the im-age warping error, we used the Census transformation tomeasure the difference: L c = (cid:34)(cid:88) i,j ϕ (cid:16) ˆ C ( i, j ) − C ( i, j ) (cid:17) · O ( i, j ) (cid:35) / (cid:88) i,j O ( i, j ) . (5) Lastly, we have also used the difference between imagegradients as an error metric: L g = (cid:34)(cid:88) i,j ϕ (cid:16) ∇ ˆ I ( i, j ) − ∇ I ( i, j ) (cid:17) · O ( i, j ) (cid:35) / (cid:88) i,j O ( i, j ) . (6) The cleaned sparse Lidar points after our feedback veriﬁca-tion can also be used as a sparse supervision for generatingdisparities. We leverage the truncated (cid:96) function to handlenoises and errors in these sparse Lidar measurements, L l = || M ( (cid:98) D − D sc ) || τ , (7)where M is the mask computed in the Verify phase. Thetruncated (cid:96) fuction is deﬁned as: || · || τ = (cid:26) . x , | x | < (cid:15) . (cid:15) , otherwise . (8) The smoothness term in the loss function is deﬁned as: L s = (cid:88) (cid:16) e − α |∇ I | |∇ d | + e − α | ∇ I | (cid:12)(cid:12) ∇ d (cid:12)(cid:12)(cid:17) /N, (9)where α = 0 . and α = 0 . . Note that previous studies[11, 35] often neglect the weights α , α , which actuallyplay a crucial role in colour weighted smoothness term. We also introduce a slanted plane model into deep learn-ing frameworks to enforce structural constraint. This modelhas been commonly used in conventional Conditional Ran-dom Field (CRF) based stereo matching/optical ﬂow algo-rithms. It assumes that all pixels within a superpixel lieon a 3D plane. By leveraging this piecewise plane ﬁttingloss, we could enforce strong regularization on 3D struc-ture. Although our slanted plane model is deﬁned on dispar-ity space, it has been proved that a plane in disparity space isstill a plane in 3D space [29]. Mathematically, the disparity d p of each pixel p is parameterized by a local plane, d p = a p u + b p v + c p , (10)where ( u, v ) is the image coordinate, the triplet ( a p , b p , c p ) denotes the parameters of a local disparity plane.Deﬁne P as the matrix representation of pixel’s homo-geneous coordinates within a SLIC superpixel [1] with adimension of N × where N is number of pixels within asegment, and denote a as the planar parameters. Given thecurrent disparity predictions (cid:98) d , we can estimate the planeparameter in closed-form via a ∗ = ( P T P ) − P T (cid:98) d . Withthe estimated plane parameter, the ﬁtted planar disparities (cid:101) d ∈ R N can be computed as (cid:101) d = P a ∗ = P ( P T P ) − P T (cid:98) d .Our plane ﬁtting loss then can be deﬁned as L p = (cid:107) (cid:98) d − (cid:101) d (cid:107) = (cid:107) [ I − P ( P T P ) − P T ] (cid:98) d (cid:107) . (11) able 1. Quantitative results on the selected KITTI 141 subset.

We compare our LidarStereoNet with various state-of-the-art Lidar-Stereo fusion methods, where our proposed method outperforms all the competing methods with a wide margin.

Methods Input Supervised Abs Rel > px > px > px δ < . DensityInput Lidar Lidar - - 0.0572 0.0457 0.0375 - 7.27%S2D [19] Lidar Yes 0.0665 0.0849 0.0659 0.0430 0.9626 100.00%SINet [30] Lidar Yes 0.0659 0.0908 0.0660 0.0456 0.9576 100.00%Probabilistic fusion [20] Stereo + Lidar No - - 0.0591 - - 99.6%CNN Fusion [26] Stereo + Lidar Yes - - 0.0484 - - 99.8%Our method Stereo No

Figure 5.

Qualitative results of the methods from Tab. 1.

Our method is trained on KITTI VO dataset and tested on the selected unseenKITTI 141 subset without any ﬁnetuning.

5. Experiments

We implemented our LidarStereoNet in Pytorch. All in-put images were randomly cropped to × duringtraining phases while we used their original size in infer-ence. The typical processing time of our net was about 0.5fps on Titan XP. We used the Adam optimizer with a con-stant learning rate of 0.001 and a batch size of 1. We per-formed a series of experiments to evaluate our LidarStere-oNet on both real-world and synthetic datasets. In additionto analyzing the accuracy of depth prediction in compari-son to previous work, we also conducted a series of ablationstudies on different sensor fusing architectures and investi-gate how each component of the proposed losses contributesto the performance. The KITTI dataset [9] is created to set a benchmark forautonomous driving visual systems. It captures depth infor-mation from a Velodyne HDL-64E Lidar and correspondingstereo images from a moving platform. They use a highlyaccurate inertial measurement unit to accumulate 20 framesof raw Lidar depth data in a reference frame and serves asground truth for the stereo matching benchmark. In KITTI2015 [22], they also take moving objects into consideration.The dynamic objects are ﬁrst removed and then re-insertedby ﬁtting CAD models to the point clouds, resulting in aclean and dense ground truth for depth evaluation.

Dataset Preparation

After these processes, the raw Li-dar points and the ground truth differ signiﬁcantly in termsof outliers and density as shown in Fig. 1. In raw data, dueto the large displacement between the Lidar and the stereocameras [29], boundaries of objects may not perfectly alignwhen projecting Lidar points onto image planes. Also, sinceLidar system scans depth in a line by line order, it will cre-ate a rolling shutter effect on the reference image, especiallyfor a moving platform. Instead of heuristically removingmeasurements, our method is able to omit these outliers au-tomatically which is evidently shown in Fig. 1 and Fig. 2.We used the KITTI VO dataset [9] as our training set.We sorted all 22 KITTI VO sequences and found 7 framesfrom sequence 17 and 20 having corresponding frames inthe KITTI 2015 training set. Therefore we excluded thesetwo sequences and used the remaining 20 stereo sequencesas our training dataset. Our training dataset contains 42104images with a typical image resolution of × . Toobtain sparse disparities inputs, we projected raw Lidarpoints onto left and right images using provided extrinsicand intrinsic parameters and converted the raw Lidar depthsto disparities. Maddern et al . [20] also traced 141 framesfrom KITTI raw dataset that have corresponding frames inthe KITTI 2015 dataset and reported their results on thissubset. For consistency, we used the same subset to eval-uate our performance and utilize the 6 frames from KITTIVO dataset as our validation set (we excluded 1 frame thatoverlaps the KITTI 141 subset from our validation). igure 6. Test results of our network on the selected KITTI 141subset with varying levels of input Lidar points sparsity.

Leftcolumn: lower is better; right column: higher is better.

Comparisons with State-of-the-Art

We compared ourresults with depth completion methods and Lidar-stereo fu-sion methods using depth metrics from [6] and bad pixelratio disparity error from KITTI [22]. We also provide acomparison of our method and stereo matching methods inthe supplemental material.For depth completion, we compared with S2D [19] andSINet [30]. In our implementation of S2D and SINet,we trained them on KITTI depth completion dataset [30].From 151 training sequences, we excluded 28 sequencesthat overlaps with KITTI 141 dataset and used the remain-ing 123 sequences to train these networks from scratch in asupervised manner. As a reference, we computed the errorrate of the input Lidar. It is worth noting that our methodincreases the disparity density from less than 7.3% to 100%while reducing the error rate by a half.We also compared our method with two existing Lidar-Stereo fusion methods: Probabilistic fusion [20] and CNNfusion [26] and outperforms them with a large margin.Quantitative comparison between our method and the com-peting state-of-the-art methods is reported in Tab. 1. We canclearly see that our self-supervised LidarStereoNet achievesthe best performance throughout all the metrics evaluated.Note that, our method even outperforms recent supervisedCNN based fusion method [26] with a large margin. Morequalitative evaluations of our method in challenging scenesare provided in Fig. 11. These results demonstrate the supe-riority of our method that can effectively leverage the com-plementary information between Lidar and stereo images.

On Input Sparsity

Thanks to the nature of deep networkand sparsity invariant convolution, our LidarStereoNet canhandle Lidar input of varying density, ranging from no Lidarinput to 64 lines input. To see this trend, we downsampledthe vertical and horizontal resolution of the Lidar points. Asshown in Fig. 6, our method performs equally well whenusing 8 or more lines of Lidar points. Note that even whenthere are no Lidar points as input (in this case, the problembecomes a pure stereo matching problem), our method still

Figure 7.

Gradually cleaned input Lidar points.

From top tobottom, left column: left image, cleaned Lidar points at the nd epoch, cleaned Lidar points at the th epoch; Right column: rawLidar points, error points ﬁnd at the nd epoch, error points ﬁnd atthe nd epoch. Note that the error measurements on the right carhave been gradually removed. outperforms SOTA stereo matching methods. Table 2.

Ablation study on the feedback loop

Type 1 and Type 2show the performance only use the Core Architecture without andwith removing error Lidar points from the input, while Full modelmeans our proposed feedback loop.

Methods Abs Rel > px > px > pxType 1 0.0539 0.0411 0.0310 0.0229Type 2 0.0468 0.0401 0.0302 0.0226Full model In this section, we perform ablation studies to evaluatethe importance of our feedback loop and proposed losses.Notably, all ablation studies on losses and fusion strategiesare evaluated on Core Architecture only in order to reducethe randomness introduced by our feedback loop module.

Importance of the feedback loop

We evaluate the impor-tance of the feedback loop in two aspects. One is to removethe error points from the back-end, i.e . the loss computationpart. The other is to remove them from the input. In ourbaseline model (

Type 1 ), we use raw Lidar as our input andcompute the Lidar loss on them. For

Type 2 model, we alsouse the raw Lidar as input but compute the Lidar loss onlyon cleaned Lidar points. Our full model uses clean Lidarpoints in both parts. As shown in Tab. 2, removing errors inthe back-end can improve the performance by . . How-ever, using cleaned Lidar points as input can boost the per-formance in . in > px metric, which demonstratesthe importance of our feedback loop module. Comparing different loss functions

Tab. 3 shows theperformance gain with different losses. As we can see,when only using Lidar points as supervision, its perfor-mance is affected by the outliers in Lidar measurements.dding a warping loss can reduce the error rate from . to . . Adding our proposed plane ﬁtting loss can fur-ther reduce the metric from . to . . In the supple-mentary material, we further compare our soft slanted planemodel and a hard plane ﬁtting model. The soft one achievesbetter performance. Table 3.

Evaluation of different loss functions. L w , L s , L l and L p represent warping loss, smoothness loss, Lidar loss and planeﬁtting loss separately. Loss Abs Rel > px > px > px L l L w + L s L w + L s + L l L w + L s + L l + L p Considering theproblem of utilizing sparse depth information, one no-fusion approach will be directly using Lidar measurementsfor supervisions. As shown in Tab. 4, its performance isaffected by the misaligned Lidar points and it has a rela-tively high error rate of . . The second method is toleverage the depth as a fourth channel additionally to theRGB images. We term it an early fusion strategy. As shownin Tab. 4, it has the worst performance among the baselines.This may be due to the incompatible characteristics betweenRGB images and depth maps thus the network is unable tohandle well within a common convolution layer. Our latefusion strategy achieves the best performance among them. To illustrate that our method can generalize to otherdatasets, we compare our method to several methods on theSynthia dataset [27]. Synthia contains 5 sequences underdifferent scenarios. And for each scenario, they capture im-ages under different lighting and weather conditions such asSpring, Winter, Soft-rain, Fog and Night. We show quanti-tative results of experiments in Tab. 5 and qualitative resultsare provided in the supplementary material.For sparse disparity inputs, we randomly selected 10%of full image resolution. As discussed before, projected Li-dar points have misalignment with stereo images in KITTIdataset. To simulate the similar interference, we add variousdensity levels of Gaussian noise to sparse disparity maps.As shown in Fig. 8, our proposed LidarStereoNet adapts

Table 4.

Comparison of different fusion strategies.

Methods Abs Rel > px > px > pxNo Fusion 0.0555 0.0733 0.0471 0.0296Early fusion 0.0644 0.0667 0.0526 0.0398Our method Figure 8.

Ablation study on noise resistance on Synthia dataset.

Our method has a consistent performance while the others have anotable performance drop. well to the noisy input disparity maps, while S2D [19] failsto recover disparity information.

Table 5.

Quantitative results on the Synthia dataset.

Methods Abs Rel > px > px > pxSPS-ST [32] 0.0475 0.0980 0.0879 0.0713S2D [19] 0.0864 0.5287 0.4414 0.270SINet [30] Our method 0.0334

6. Conclusion

In this paper, we have proposed an unsupervised end-to-end learning based Lidar-Stereo fusion network “Li-darStereoNet” for accurate 3D perception in real world sce-narios. To effectively handle noisy Lidar points and mis-alignment between sensors, we presented a novel “Feed-back Loop” to sort out clean measurements by comparingoutput stereo disparities and input Lidar points. We havealso introduced a piecewise slanted plane ﬁtting loss to en-force strong 3D structural regularization on generated dis-parity maps. Our LidarStereoNet does not need groundtruth disparity maps for training and has good generaliza-tion capabilities. Extensive experiments demonstrate the su-periority of our approach, which outperforms state-of-the-art stereo matching and depth completion methods with alarge margin. Our approach can reliably work even whenLidar points are completely missing. In the future, we planto extend our method to other depth perception and sensorfusion scenarios.

Acknowledgement

Y. Dai ([email protected]) is thecorresponding author. This research was supported in partby Australia Centre for Robotic Vision, Data61 CSIRO,the Natural Science Foundation of China grants (61871325,61420106007) the Australian Research Council (ARC)grants (LE190100080, CE140100016, DP190102261,DE140100180). The authors are grateful to the GPUs do-nated by NVIDIA. eferences [1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, AurelienLucchi, Pascal Fua, and Sabine Susstrunk. Slic superpix-els compared to state-of-the-art superpixel methods.

IEEETrans. Pattern Anal. Mach. Intell. , 34(11):2274–2282, Nov.2012. 5[2] H. Badino, D. Huber, and T. Kanade. Integrating lidar intostereo for fast and improved disparity computation. In

Inter-national Conference on 3D Imaging, Modeling, Processing,Visualization and Transmission , pages 405–412, May 2011.1, 2[3] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereomatching network. In

Proc. IEEE Conf. Comp. Vis. Patt.Recogn. , pages 5410–5418, 2018. 4, 11, 13[4] Nathaniel Chodosh, Chaoyang Wang, and Simon Lucey.Deep convolutional compressed sensing for lidar depth com-pletion. arXiv preprint arXiv:1803.08949 , 2018. 2[5] Jeffrey S Deems, Thomas H Painter, and David C Finnegan.Lidar measurement of snow depth: a review.

Journal ofGlaciology , 59(215):467–479, 2013. 1[6] David Eigen, Christian Puhrsch, and Rob Fergus. Depth mapprediction from a single image using a multi-scale deep net-work. In

Proc. Adv. Neural Inf. Process. Syst. , NIPS’14,pages 2366–2374, Cambridge, MA, USA, 2014. MIT Press.7[7] Samir El-Omari and Osama Moselhi. Integrating 3d laserscanning and photogrammetry for progress measurement ofconstruction work.

Automation in Construction , 18(1):1 – 9,2008. 2[8] V. Gandhi, J. ech, and R. Horaud. High-resolution depthmaps based on tof-stereo fusion. In

IEEE International Con-ference on Robotics and Automation , pages 4742–4749, May2012. 2[9] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are weready for autonomous driving? the kitti vision benchmarksuite. In

Proc. IEEE Conf. Comp. Vis. Patt. Recogn. , 2012. 6[10] Andreas Geiger, Julius Ziegler, and Christoph Stiller. Stere-oscan: Dense 3d reconstruction in real-time. In

IEEE Intel-ligent Vehicles Symposium , pages 963–968. Ieee, 2011. 1[11] Cl´ement Godard, Oisin Mac Aodha, and Gabriel J Bros-tow. Unsupervised monocular depth estimation with left-right consistency. In

Proc. IEEE Conf. Comp. Vis. Patt.Recogn. , volume 2, page 7, 2017. 2, 5[12] Alastair Harrison and Paul Newman. Image and sparse laserfusion for dense scene reconstruction. In Andrew Howard,Karl Iagnemma, and Alonzo Kelly, editors,

Field and ServiceRobotics , pages 219–228, Berlin, Heidelberg, 2010. SpringerBerlin Heidelberg. 2[13] Heiko Hirschmuller. Stereo processing by semiglobal match-ing and mutual information.

IEEE Trans. Pattern Anal.Mach. Intell. , 30(2):328–341, Feb. 2008. 2[14] Maximilian Jaritz, Raoul De Charette, Emilie Wirbel, XavierPerrotton, and Fawzi Nashashibi. Sparse and dense data withcnns: Depth completion and semantic segmentation. In

In-ternational Conference on 3D Vision , 2018. 2, 4[15] Jonathan Kelly and Gaurav S Sukhatme. Visual-inertialsensor fusion: Localization, mapping and sensor-to-sensor self-calibration.

International Journal of Robotics Research ,30(1):56–79, 2011. 1[16] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, PeterHenry, Ryan Kennedy, Abraham Bachrach, and Adam Bry.End-to-end learning of geometry and context for deep stereoregression. In

Proc. IEEE Int. Conf. Comp. Vis. , Oct 2017. 4[17] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Fed-erico Tombari, and Nassir Navab. Deeper depth predictionwith fully convolutional residual networks. In

3D Vision(3DV), 2016 Fourth International Conference on , pages 239–248. IEEE, 2016. 2[18] Jesse Levinson, Jake Askeland, Jan Becker, Jennifer Dolson,David Held, Soeren Kammel, J Zico Kolter, Dirk Langer,Oliver Pink, Vaughan Pratt, et al. Towards fully autonomousdriving: Systems and algorithms. In

IEEE Intelligent Vehi-cles Symposium (IV) , pages 163–168. IEEE, 2011. 1[19] Fangchang Ma and Sertac Karaman. Sparse-to-dense: Depthprediction from sparse depth samples and a single image. In

IEEE International Conference on Robotics and Automation ,pages 1–8. IEEE, 2018. 1, 2, 4, 6, 7, 8, 12[20] W. Maddern and P. Newman. Real-time probabilistic fusionof sparse 3d lidar and dense stereo. In

IEEE/RSJ Interna-tional Conference on Intelligent Robots and Systems (IROS) ,pages 2181–2188, Oct 2016. 1, 2, 6, 7[21] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer,Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. Alarge dataset to train convolutional networks for disparity,optical ﬂow, and scene ﬂow estimation. In

Proc. IEEE Conf.Comp. Vis. Patt. Recogn. , pages 4040–4048, 2016. 2, 13[22] Moritz Menze, Christian Heipke, and Andreas Geiger. Joint3d estimation of vehicles and scene ﬂow. In

ISPRS Workshopon Image Sequence Analysis (ISA) , 2015. 3, 6, 7[23] Peyman Moghadam, Wijerupage Sardha Wijesoma, andDong Jun Feng. Improving path planning and mapping basedon stereo vision and lidar. In

International Conference onControl, Automation, Robotics and Vision , pages 384–389.IEEE, 2008. 2[24] Kevin Nickels, Andres Castano, and Christopher M. Cianci.Fusion of lidar and stereo range for mobile robots.

Interna-tional Conference on Advanced Robotics (ICAR) , 1:65–70,2003. 2[25] Florin Oniga and Sergiu Nedevschi. Processing dense stereodata using elevation maps: Road surface, trafﬁc isle, and ob-stacle detection.

IEEE Transactions on Vehicular Technol-ogy , 59(3):1172–1182, 2010. 1[26] Kihong Park, Seungryong Kim, and Kwanghoon Sohn.High-precision depth estimation with the 3d lidar and stereofusion. In

IEEE International Conference on Robotics andAutomation (ICRA) , pages 2156–2163. IEEE, 2018. 1, 2, 6,7[27] German Ros, Laura Sellart, Joanna Materzynska, DavidVazquez, and Antonio Lopez. The SYNTHIA Dataset: Alarge collection of synthetic images for semantic segmenta-tion of urban scenes. In

Proc. IEEE Conf. Comp. Vis. Patt.Recogn. , 2016. 8[28] Daniel Scharstein and Richard Szeliski. A taxonomy andevaluation of dense two-frame stereo correspondence algo-rithms.

Int. J. Comp. Vis. , 47(1-3):7–42, 2002. 129] Nick Schneider, Lukas Schneider, Peter Pinggera, UweFranke, Marc Pollefeys, and Christoph Stiller. Semanticallyguided depth upsampling. In

German Conference on PatternRecognition , pages 37–48. Springer, 2016. 5, 6[30] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke,Thomas Brox, and Andreas Geiger. Sparsity invariant cnns.In

International Conference on 3D Vision , 2017. 1, 2, 4, 6,7, 8, 12[31] Jure ˇZbontar and Yann LeCun. Stereo matching by traininga convolutional neural network to compare image patches.

J.Mach. Learn. Res. , 17(1):2287–2318, Jan. 2016. 2, 13[32] Koichiro Yamaguchi, David McAllester, and Raquel Urta-sun. Efﬁcient joint segmentation, occlusion labeling, stereoand ﬂow estimation. In

Proc. Eur. Conf. Comp. Vis. , pages756–771. Springer, 2014. 8, 12, 13[33] Chi Zhang, Zhiwei Li, Yanhua Cheng, Rui Cai, HongyangChao, and Yong Rui. Meshstereo: A global stereo modelwith mesh alignment regularization for view interpolation. In

Proc. IEEE Int. Conf. Comp. Vis. , pages 2057–2065, 2015. 1[34] Yinda Zhang, Sameh Khamis, Christoph Rhemann, JulienValentin, Adarsh Kowdle, Vladimir Tankovich, MichaelSchoenberg, Shahram Izadi, Thomas Funkhouser, and SeanFanello. Activestereonet: End-to-end self-supervised learn-ing for active stereo systems. In

Proc. Eur. Conf. Comp. Vis. ,September 2018. 2[35] Yiran Zhong, Yuchao Dai, and Hongdong Li. Self-supervised learning for stereo matching with self-improvingability. arXiv preprint arXiv:1709.00930 , 2017. 2, 4, 5, 13[36] Yiran Zhong, Hongdong Li, and Yuchao Dai. Open-worldstereo video matching with deep rnn. In

Proc. Eur. Conf.Comp. Vis. , September 2018. 2 oise-Aware Unsupervised Deep Lidar-Stereo Fusion–Supplementary Material –

Abstract

In this supplementary material, we provide our detailed network structure, qualitative comparison of hard and soft slantedplane constraint, qualitative and quantitative comparison to stereo matching algorithms and qualitative results on the Synthiadataset.

1. Detailed Network Structure

The core architecture of our LidarStereoNet contains three blocks: 1) Feature extraction and fusion; 2) Feature matching,and 3) Disparity computing. We provide the detailed structure of the feature extraction and fusion block in Table 6. Thefeature matching block and disparity computing block share the same structures with PSMnet [3].

Table 6. Feature extraction and fusion block architecture, where k , s , chns represent the kernel size, stride and the number of the input andthe output channels. We use “+” to represent feature concatenation. Lidar feature extractionlayer k , s chns input conv s1 11 ×

11, 1 1/16 disparityconv s2 7 ×

7, 2 16/16 conv s1conv s3 5 ×

5, 1 16/16 conv s2conv s4 3 ×

3, 2 16/16 conv s3conv s5 3 ×

3, 1 16/16 conv s4conv mask 1 ×

1, 1 17/16 conv s5+mask

Stereo feature extractionlayer k , s chns input conv0 1 3 ×

3, 2 3/32 imageconv0 2 3 ×

3, 1 32/32 conv0 1conv0 3 3 ×

3, 1 32/32 conv0 2conv1 n (cid:20) ×

3, 13 ×

3, 1 (cid:21) × (cid:20) ×

3, 23 ×

3, 1 (cid:21) (cid:20) (cid:21) conv1 3conv2 n (cid:20) ×

3, 13 ×

3, 1 (cid:21) ×

15 64/64 conv2 1conv3 1 (cid:20) ×

3, 13 ×

3, 1 (cid:21) (cid:20) (cid:21) conv2 16conv3 n (cid:20) ×

3, 13 ×

3, 1 (cid:21) × (cid:20) ×

3, 13 ×

3, 1 (cid:21) × ×

64, 64 128/32 conv4 3branch2 32 ×

32, 32 128/32 conv4 3branch3 64 ×

16, 16 128/32 conv4 3branch4 8 ×

8, 8 128/32 conv4 3lastconv (cid:20) ×

3, 11 ×

1, 1 (cid:21) (cid:20) (cid:21) conv2 16+conv4 3+branch1+branch2+branch3+branch4

Feature fusion lastconv + conv mask . Hard versus Soft Plane Fitting

There are two kinds of plane ﬁtting constraints. Conventional CRF based methods use one slanted plane model to describeall disparities in one segment, i.e ., disparities insides one segment exactly obeys one slanted plane model. We term it as“Hard” plane ﬁtting constraint. Our method, on the other hand, only applies this term as part of the whole optimizationtarget. In other words, we only require the recovered disparities to ﬁt a plane in a segment if possible but it can still bebalanced by other loss terms.Fig. 9 illustrates the difference between our soft constraint and the CRF-style hard constraint in a recovered disparity map.As can be seen in Fig. 9, strictly applying the slanted plane model in recovered disparity map decreases its performancefrom . to . and it is very sensitive to segments as well. By switching segments from Stereo SLIC to SLIC, itsperformance further decreases from . to . .SLIC segments Hard constraint resultStereo SLIC segments Hard constraint resultGround truth disparity Soft constraint result Figure 9.

Comparison of soft and hard constraints on slanted plane model with different superpixel segmentation methods.

Notethat our recovered disparity map has more aligned boundaries with the color image.

Sparse disparity SPS-ST [32] S2D [19] SINet [30] Ours

Figure 10.

Qualitative results on the Synthia dataset.

The ﬁrst raw is the colorized disparity results, and the second row is the corre-sponding error maps. . Comparisons with STOA stereo matching methods

For the sake of completeness, we provide qualitative and quantitative comparisons with state-of-the-art stereo matchingmethods. We choose SPS-ST [32], MC-CNN[31], PSMnet [3] and SsSMnet[35] for reference. Note that the SPS-ST methodis a traditional (non-deep) method, and its meta-parameters were tuned on KITTI dataset. For deep MC-CNN we used amodel which was ﬁrstly trained on Middlebury dataset and for PSMnet we used the model that was trained on SceneFlow[21] dataset and the model (“-ft”) that we ﬁne-tuned on KITTI VO dataset. We also compared our method with state-of-the-artself-supervised stereo matching network SsSMnet [35].

Table 7.

Quantitative comparison on the selected KITTI 141 subset.

We compare our LidarStereoNet with various state-of-the-artstereo matching methods, where our proposed method outperforms all the competing methods with a wide margin.

Methods Input Supervised Abs Rel > px > px > px δ < . DensityMC-CNN [31] Stereo Yes 0.0798 0.1070 0.0809 0.0555 0.9472 100.00%PSMnet [3] Stereo Yes 0.0807 0.2480 0.1460 0.0639 0.9399 100.00%PSMnet-ft [3] Stereo Yes 0.0609 0.0635 0.0410 0.0277 0.9689 100.00%SPS-ST [32] Stereo No 0.0633 0.0702 0.0413 0.0265 0.9660 100.00%SsSMnet [35] Stereo No 0.0619 0.0743 0.0498 0.0334 0.9633 100.00%Our method Stereo No

Figure 11.

Qualitative results of the methods from Table 7.

Our method is trained on KITTI VO dataset and tested on the selected unseenKITTI 141 subset without any ﬁnetuning.

4. Qualitative results on Synthia dataset4. Qualitative results on Synthia dataset