Reducing Drift in Visual Odometry by Inferring Sun Direction Using a Bayesian Convolutional Neural Network
RReducing Drift in Visual Odometry by Inferring Sun DirectionUsing a Bayesian Convolutional Neural Network ‡ Valentin Peretroukhin † , Lee Clement † , and Jonathan Kelly Abstract — We present a method to incorporate global ori-entation information from the sun into a visual odometrypipeline using only the existing image stream, where thesun is typically not visible. We leverage recent advances inBayesian Convolutional Neural Networks to train and imple-ment a sun detection model that infers a three-dimensionalsun direction vector from a single RGB image. Crucially, ourmethod also computes a principled uncertainty associated witheach prediction, using a Monte Carlo dropout scheme. Weincorporate this uncertainty into a sliding window stereo visualodometry pipeline where accurate uncertainty estimates arecritical for optimal data fusion. Our Bayesian sun detectionmodel achieves a median error of approximately 12 degreeson the KITTI odometry benchmark training set, and yieldsimprovements of up to 42% in translational ARMSE and 32%in rotational ARMSE compared to standard VO. An opensource implementation of our Bayesian CNN sun estimator(Sun-BCNN) using Caffe is available at https://github.com/utiasSTARS/sun-bcnn-vo . I. I
NTRODUCTION
Egomotion estimation is a fundamental building block ofmobile autonomy. Although there exist an array of possi-ble algorithm-sensor combinations that can estimate motionin unknown environments (e.g., LIDAR-based point-cloudmatching [1] and visual-inertial navigation [2]), egomotionestimation remains a dead-reckoning technique that accumu-lates unbounded estimation error over time in the absence ofglobal information such as GPS or a known map.In this work, we focus on one technique to infer globalorientation information without a known map: computingthe direction of the sun. By leveraging recent advancesin Bayesian Convolutional Neural Networks (BCNNs), wedemonstrate how we can train a deep model to compute adirection vector from a single RGB image using only 20,000training images. Furthermore, we show that our network canproduce a principled covariance estimate that can readily beused in an egomotion estimation pipeline. We demonstrateone such use by incorporating sun direction estimates into astereo visual odometry (VO) pipeline and report significanterror reductions of up to 42% in translational average rootmean squared error (ARMSE) and 32% in rotational ARMSEcompared to plain VO on the KITTI odometry benchmarktraining set [3]. † Valentin Peretroukhin and Lee Clement contributed equally to thiswork and jointly assert first authorship. All authors are with the Space& Terrestrial Autonomous Robotic Systems (STARS) laboratory at theUniversity of Toronto Institute for Aerospace Studies (UTIAS), Canada { lee.clement, v.peretroukhin } @mail.utoronto.ca,[email protected] . ‡ This version of the paper corrects minor errors in Section V-B.3 and thecaption of Figure 5 of the published version.
Bayesian GoogLeNetSun Direction N ( µ , ⌃ ) Stereo Images Feature Tracks
Optimization ˆ T k, VO Pipeline
SE(3)
PoseLayer Activations
Sun-
BCNN
Fig. 1: Sun-BCNN (Sun Bayesian Convolutional NeuralNetwork) incorporated into a visual odometry (VO) pipeline.A Bayesian CNN infers sun direction estimates as a meanand covariance, which are then incorporated into a slidingwindow bundle adjuster to produce a final trajectory estimate.Our main contributions are as follows:1) We apply a Bayesian CNN to the problem of sun direc-tion estimation, incorporating the resulting covarianceestimates into a visual odometry pipeline;2) We show that a Bayesian CNN with dropout layersafter each convolutional and fully-connected layer canachieve state-of-the-art accuracy at test time;3) We learn a 3D unit-length sun direction vector, appro-priate for full 6-DOF pose estimation;4) We present experimental results on 21.6 km of urbandriving data from the KITTI odometry benchmarktraining set [3]; and5) We release our Bayesian CNN sun estimator (Sun-BCNN) as open-source code.II. R
ELATED W ORK
Visual odometry (VO), a technique to estimate the egomo-tion of a moving platform equipped with one or more cam-eras, has a rich history of research including a notable im-plementation onboard the Mars Exploration Rovers (MERs)[4]. Modern approaches to VO can achieve estimation errorsbelow 1% of total distance traveled [3]. To achieve suchaccurate and robust estimates, modern techniques use carefulvisual feature pruning [5], adaptive robust methods [6], [7],or operate directly on pixel intensities [8].Independent of the estimator, VO exhibits super-linearerror growth [9], and is particularly sensitive to errors inorientation [9], [5]. One way to reduce orientation error isto incorporate observations of a landmark whose position or a r X i v : . [ c s . R O ] J u l ell Lit RegionsSky Shadows Fig. 2: Three conv1 layer activation maps superimposed on two images from the KITTI [3] odometry benchmark and for three selected filters. Each filter picks out salient parts of the image that aid in sun direction inference.direction in the navigation frame is known a priori . The sunis an example of such a known directional landmark. Accord-ingly, sun sensors have been used to improve the accuracy ofVO in planetary analogue environments [10], [11], and werealso incorporated into the MERs [12], [13]. More recently,software-based alternatives have been developed that canestimate the direction of the sun from a single image, makingsun-aided navigation possible without additional sensors ora specially-oriented camera [14]. Some of these methodshave been based on hand-crafted illumination cues [15], [14],while others have attempted to learn such cues from datausing deep Convolutional Neural Networks (CNNs) [16].CNNs have been applied to a wide range of classification,segmentation, and learning tasks in computer vision [17].Recent work has shown that CNNs can learn orientationinformation directly from images by modifying the lossfunctions of existing discrete classification-based CNN ar-chitectures into continuous regression losses [16], [18], [19].Despite their success in improving prediction accuracy, mostexisting CNN-based models do not report principled uncer-tainty estimates, which are important in the context of datafusion. To address this, Gal and Ghahramani [20] showedthat it is possible to achieve principled covariance outputswith only minor modifications to existing CNN architectures.An early application of this uncertainty quantification waspresented by Kendall et al. [19] who used it to improve theirprior work on camera pose regression.Our method is similar in spirit to the work of Ma etal. [16] who built a CNN-based sun sensor as part of arelocalization pipeline. We also extend the work of Clementet al. [14] who demonstrated that virtual sun sensors canimprove VO accuracy. Our model makes three importantimprovements: 1) in addition to a point estimate of the sundirection, we output a principled covariance estimate that isincorporated into our estimator; 2) we produce a full 3Dsun direction estimate with azimuth and zenith angles thatis better suited to 6-DOF estimation problems (as opposedto only the azimuth angle and 3-DOF estimator in [16]);and 3) we incorporate the sun direction covariance into aVO estimator that accounts for growth in pose uncertaintyover time (unlike [14]). Furthermore, our Bayesian CNNincludes a dropout layer after every convolutional and fullyconnected layer (as outlined in [20] but not done in [19]),which produces more principled covariance outputs. III. I NDIRECT S UN D ETECTION USING A B AYESIAN C ONVOLUTIONAL N EURAL N ETWORK
We use a Convolutional Neural Network (CNN) to inferthe direction of the sun. We motivate the choice of a deepmodel through the empirical findings of Clement et al. [14]and Ma et al. [16], who demonstrated that a CNN-based sundetector can substantially outperform hand-crafted modelssuch as that of Lalonde et al. [15].We choose a deep neural network structure based onGoogLeNet [21] due to its use in past work that adapted it fororientation regression [18]. Unlike Ma et al. [16], we chooseto transfer weights trained on the MIT Places dataset [22]rather than ImageNet [23]. We believe the MIT Places datasetis a more appropriate starting point for localization tasks thanImageNet since it includes outdoor scenes and is concernedwith classifying physical locations rather than objects.
A. Cost Function
We train the network by minimizing the cosine distancebetween the (unit-norm) target sun direction vector s k andthe predicted (unit-norm) sun direction vector ˆ s k , where k indexes the images in the training set: L (ˆ s k ) = 1 − (ˆ s k · s k ) , (1)Note that in our implementation, we do not formulate thecosine distance loss explicitly, but instead minimize half thesquare of the Euclidian distance between s k and ˆ s k . Sinceboth vectors have unit length, this is equivalent to minimizingEquation (1): (cid:107) ˆ s k − s k (cid:107) = 12 (cid:16) (cid:107) ˆ s k (cid:107) + (cid:107) s k (cid:107) − s k · s k ) (cid:17) = 1 − (ˆ s k · s k )= L (ˆ s k ) . B. Uncertainty Estimation
To output principled covariances for sun direction esti-mates, we adopt Bayesian Convolution Neural Networks(BCNNs) [20], [24], [25]. BCNNs rely on a connectionbetween stochastic regularization (e.g. dropout, a widelyadopted technique in deep learning) and approximate varia-tional inference of a Bayesian Neural Network. We outlinethe technique here briefly, and refer the reader to [24] formore details.he method begins with a prior on the weights in a deepneural network, p ( w ) , and attempts to compute a posteriordistribution p ( w | X , S ) given training inputs X and targets S = { s k } . This posterior can be used to compute a predictivedistribution for test samples but is generally intractable. Toovercome this, the BCNN approach notes that CNN trainingwith stochastic regularization can be viewed as variationalinference if we define a variational distribution q ( w ) as: q ( w i ) = M i diag (cid:110) { b ij } K i j =1 (cid:111) , (2) b ij ∈ Bernoulli ( p i ) . (3)Here, i indexes a particular layer in the neural network with K i weights, M i are the weights to be optimized, b ij areBernoulli distributed binary variables, and p i is the dropoutprobability for weights in layer i .With this variational distribution q ( w ) , training a CNNwith dropout results in the same w as minimizing theKullback-Leibler (KL) divergence between the variationaldistribution and the true posterior: KL ( p ( w | X , S ) || q ( w )) . Attest time, the first two moments of the predictive distributionare approximated using Monte Carlo integration over theweights w : E (ˆ s ∗ k ) = ¯ˆ s ∗ k ≈ N N (cid:88) n =1 ˆ s ∗ k ( x ∗ , w n ) , (4) E (ˆ s ∗ k ˆ s ∗ Tk ) ≈ τ − + 1 N N (cid:88) n =1 ˆ s ∗ k ( x ∗ , w n )ˆ s ∗ k ( x ∗ , w n ) T − ¯ˆ s ∗ k ¯ˆ s ∗ Tk , (5)where is the identity matrix, and w n is a sample from q ( w ) (obtained by sampling the network with dropout). Themodel precision, τ , is computed as τ = pl M λ , (6)where p is the dropout probability, l is the characteristiclength scale, M is the number of samples in the trainingdata, and λ is the weight decay.Following Gal and Ghahramani [24], we build our BCNNby adding dropout layers after every convolutional and fullyconnected layer in the network. We then retain these layers attest time to sample the network stochastically (following thetechnique of Monte Carlo Dropout), and obtain the relevantstatistical quantities using Equations (4) and (5).IV. S LIDING W INDOW S TEREO V ISUAL O DOMETRY
We adopt a sliding window sparse stereo VO techniquethat has been used in a number of successful mobile roboticsapplications [26], [27], [28], [29]. Our task is to estimatea window of SE (3) poses { T k , , . . . , T k , } expressedin a base coordinate frame F−→ , given a prior estimate ofthe transformation T k , . We accomplish this by trackingkeypoints across pairs of stereo images and computing aninitial guess for each pose in the window using frame-to-frame point cloud alignment, which we then refine by solvinga local bundle adjustment problem over the window. In our experiments we choose a window size of two, which providesgood VO accuracy at low computational cost. As discussedin Section IV-C, we select the initial pose T , to be the firstGPS ground truth pose such that F−→ is a local East-North-Up (ENU) coordinate system with its origin at the first GPSposition. A. Observation Model
We assume that our stereo images have been de-warpedand rectified in a pre-processing step, and model the stereocamera as a pair of perfect pinhole cameras with focallengths f u , f v and principal points ( c u , c v ) , separated bya fixed and known baseline b . If we take p j to be thehomogeneous 3D coordinates of keypoint j , expressed inour chosen base frame F−→ , we can transform the keypointinto the camera frame at pose k to obtain p jk = T k, p j = (cid:104) p jk,x p jk,y p jk,z (cid:105) T . Our observation model g ( · ) canthen be formulated as y k,j = g (cid:16) p jk (cid:17) = uvd = f u p jk,x /p jk,z + c u f v p jk,y /p jk,z + c v f u b/p jk,z , (7)where ( u, v ) are the keypoint coordinates in the left imageand d is the disparity in pixels. B. Sliding Window Bundle Adjustment
We use the open-source libviso2 package [28] todetect and track keypoints between stereo image pairs. Basedon these keypoint tracks, a three-point Random SampleConsensus (RANSAC) algorithm generates an initial guessof the inter-frame motion and rejects outlier keypoint tracksby thresholding their reprojection error. We compound thesepose-to-pose transformation estimates through our chosenwindow and refine them using a local bundle adjustment,which we solve using the nonlinear least-squares solverCeres [30]. The objective function to be minimized can bewritten as J = J reprojection + J prior , (8)where J reprojection = k (cid:88) k = k J (cid:88) j =1 e T y k,j R − y k,j e y k,j (9)and J prior = e T ˇ T k , R − T k , e ˇ T k , . (10)The quantity e y k,j = ˆ y k,j − y k,j represents the reprojec-tion error of keypoint j for camera pose k , with R y k,j beingthe covariance of these errors. The predicted measurementsare given by ˆ y k,j = g (cid:16) ˆ T k, ˆ p j (cid:17) , where ˆ T k, and ˆ p j are theestimated poses and keypoint positions in base frame F−→ .The cost term J prior imposes a normally distributed prior ˇ T k , on the first pose in the current window, based onthe estimate of this pose in the previous window. The errorin the current estimate ˆ T k , of this pose compared to theprior can be computed using the SE (3) matrix logarithms e ˇ T k , = log (cid:16) ˇ T − k , ˆ T k , (cid:17) . The × matrix R ˇ T k , is the covariance associated with ˇ T k , in its local tangentspace, and is obtained as part of the previous window’s bun-dle adjustment solution. This prior term allows consecutivewindows of pose estimates to be combined in a principledway that appropriately propagates global pose uncertaintyfrom window to window, which is essential in the context ofoptimal data fusion. C. Sun-based Orientation Correction
In order to combat drift in the VO estimate produced byaccumulated orientation error, we adopt the technique ofLambert et al. [11] to incorporate absolute orientation in-formation from the sun directly into the estimation problem.We assume the initial camera pose and its timestamp areavailable from GPS and use them to determine the globaldirection of the sun s , expressed as a 3D unit vector, basedon a solar ephemeris model that computes the sun directionfor a given date, time, and location on Earth. We define theworld frame F−→ to be a local ENU coordinate system withthe initial GPS position as its origin. At each timestep weupdate s by querying the ephemeris model using the currenttimestamp and the initial camera pose, allowing us to accountfor the apparent motion of the sun over long trajectories. Notethat here we are using the notation s k to represent the sunvector predicted by our sun sensing apparatus (denoted ˆ s k in Section III), not the ground truth training vector.By transforming the global sun direction into each cameraframe F−→ k in the window, we obtain expected sun directions ˆ s k = ˆ T k, s , where ˆ T k, is the current estimate of camerapose k in the base frame. We compare the expected sundirection ˆ s k to the estimated sun direction s k to introduce anadditional error term into the bundle adjustment cost function(cf. Equation (8)): J = J reprojection + J prior + J sun , (11)where J sun = k (cid:88) k = k e T s k R − s k e s k , (12)and J reprojection and J prior are defined in Equations (9)and (10), respectively. This additional cost term constrainsthe orientation of the camera, which helps limit drift in theVO result due to orientation error [11].Since s k is constrained to be unit length, there are onlytwo underlying degrees of freedom. We therefore define f ( · ) to be a function that transforms a 3D unit vector in cameraframe F−→ k to a zenith-azimuth parameterization: (cid:20) θφ (cid:21) = f ( s k ) = (cid:20) acos ( − s k,y ) atan2 ( s k,x , s k,z ) (cid:21) (13)where s k = (cid:2) s k,x s k,y s k,z (cid:3) T . We can then define theterm e s k = f ( s k ) − f (ˆ s k ) to be the error in the predictedsun direction, expressed in azimuth-zenith coordinates, and R s k to be the covariance of these errors. While R s k wouldgenerally be treated as an empirically determined static covariance, in our approach we use the per-observationcovariance computed using Equation (5), which allows us toweight each observation individually according to a measureof its intrinsic quality. In practice, we also attempt to mitigatethe effect of outlier sun predictions by applying a robustHuber loss to the sun measurements in our optimizer.V. E XPERIMENTS
To train and test Sun-BCNN we used the KITTI odometrybenchmark training sequences [3]. Because we rely on thefirst pose reported by the GPS/INS system, we used theraw (rectified and synchronized) sequences correspondingto each odometry sequence. However, the raw sequence corresponding to odometry se-quence was not available on the KITTI website at thetime of writing, so we omit sequence from our analysis.In this section, the test datasets simply correspond to eachodometry sequence, while the corresponding training datasetsconsist of the union of the remaining nine sequences. A. Training Sun-BCNN
We implemented our network in Caffe [31] (for thenormalization layers, we used the
L2Norm layer from theCaffe-SL fork ) and trained the network using stochasticgradient descent, performing 30,000 iterations with a batchsize of 64. This results in approximately 1000 epochs oftraining on an average of roughly 20,000 images. We set alldropout probabilities to 0.5.
1) Data Preparation & Transfer Learning:
We resizedthe KITTI images from their original, rectified size of [1242 × pixels to [224 × pixels to achieve theimage size expected by GoogleLeNet. We experimented withpreserving the aspect ratio of the original image (paddingzeros to the top and bottom of the resized image), but foundthat preserving the vertical resolution (as in [16]) resultedin better test-time accuracy. We performed no additionalcropping or rotating of the images.
2) Model Precision:
We found an empirically optimalmodel precision τ (see Equation (6)) by optimizing theAverage Normalized Estimation Error Squared (ANEES) ontest error. In principle, this hyperparameter should be tunedusing a validation set, but we omitted this step to keepour training procedure close to that of Ma et al [16]. Wenote that the BCNN uncertainty estimates are affected bytwo significant factors: 1) variational inference is known tounderestimate predictive variance [25]; and 2) we assume theobservation noise is homoscedastic. As noted by Gal [25], theBCNN can be made heteroscedastic by learning the modelprecision during training, but this extension is outside thescope of this work. B. Testing Sun-BCNN
Once trained, we analyzed the accuracy and consistencyof Sun-BCNN mean ( s k ) and covariance ( R s k ) estimates. https://github.com/wanji/caffe-sl ABLE I: Test Errors for Sun-BCNN on KITTI odometry sequences with estimates computed at every image.
Zenith Error [deg] Azimuth Error [deg] Vector Angle Error [deg]Sequence
Mean Median Stdev Mean Median Stdev Mean Median Stdev
ANEES -2.59 -1.37 5.15 -0.33 0.81 25.61 13.56 10.31 13.14 1.00 -12.53 -8.31 10.33 8.95 8.83 33.67 22.16 17.85 15.00 1.38 -6.13 -4.26 7.38 -1.03 0.74 37.61 19.69 14.32 18.25 1.40 -2.42 -2.11 1.64 -3.89 -2.18 9.14 5.33 3.29 6.44 0.30 -4.31 -2.51 6.18 -0.74 -3.80 29.81 15.66 11.33 14.80 1.05 -2.48 -2.52 2.27 -12.22 -17.86 25.78 19.78 17.72 11.35 1.93 -0.69 -0.16 3.26 1.25 5.98 20.27 12.44 10.05 9.97 0.97 -4.46 -1.61 8.14 3.66 -0.14 41.73 19.90 13.30 19.59 1.04 -1.35 -0.75 5.60 4.78 2.36 23.84 13.09 9.48 12.66 0.73 We compute Average Normalized Estimation Error Squared (ANEES) values with all sun directions that fall below a cosinedistance threshold of . and set τ − = 0 . .
00 01 02 04 05 06 07 08 09 10
Sequence -40-2002040
Zen. err. [deg]
00 01 02 04 05 06 07 08 09 10
Sequence -40-2002040
Az. err. [deg]
00 01 02 04 05 06 07 08 09 10
Sequence
Vec. angle err [deg]
Fig. 3: Box-and-whiskers plot of final test errors on all ten KITTI odometry sequences (c.f. Table I).
1) Computing s k : We evaluated Equation (4) (setting N = 25 ) and then renormalized the resulting mean vector topreserve unit length.
2) Computing R s k : To obtain the required covariance onazimuth and zenith angles (recall that the BCNN outputsunit-length direction vectors), we sampled the vector outputs,converted them to azimuth and zenith angles using Equa-tion (13), and then applied Equation (5). It is also possibleto retain samples in unit vector form, apply Equation (5),and then propagate this covariance through a linearizedEquation (13). In this paper we used the former approach,leaving a comparison of these two uncertainty propagationschemes to future work.
3) Results:
Table I summarizes the test errors numerically,while Figures 3 and 4 plot the error distributions for azimuth,zenith, and angular distance for all ten KITTI odometrysequences. Table I also lists the Average Normalized Es-timation Error Squared (ANEES) values for each sequence.Figure 5 shows three characteristic plots of the azimuth andzenith predictions over time. Sun-BCNN achieved medianvector angle errors of less than 15 degrees on every sequenceexcept and , which were particularly difficult inplaces due to challenging lighting conditions. As illustratedin Figure 2, Sun-BCNN often relies on strong shadows toestimate the sun direction. C. Visual Odometry with Simulated Sun Sensing
In order to gauge the effectiveness of incorporating suninformation in each sequence, and to determine the impact ofmeasurement error, we constructed several sets of simulated -40 -20 0 20 4000.20.40.60.81 C D F Zen. err. [deg] -100 0 10000.20.40.60.81
Az. err. [deg]
Vec. angle err. [deg] -100 0 100 µ = 0.68 σ = 32.23 µ = -4.01 σ = 7.06 P D F µ = 16.66 Fig. 4: Distributions of azimuth error, zenith error, andangular distance for Sun-BCNN compared to ground truthover each test sequence.
Top row : Cumulative distributionsof errors for each test sequence individually.
Bottom row:
Histograms and Gaussian fits of aggregated errors.sun measurements by computing ground truth sun vectorsand artificially corrupting them with varying levels of zero-mean Gaussian noise. We obtained these ground truth sunvectors by transforming the ephemeris vector into eachcamera frame using ground truth vehicle poses. We selectedour noise levels such that the mean angular error of eachsimulated dataset was approximately 0, 10, 20, and 30degrees, and denote each such dataset as “GT-Sun-0”, “GT- [ d e g ] Dataset: 04
Az. BCNN Az. CNN Zen. BCNN Az. GT Zen. GT -1000100 [ d e g ] Dataset: 06
Frame [% of total] -1000100 [ d e g ] Dataset: 10
Fig. 5: Azimuth (Sun-BCNN and Sun-CNN [16]) and zenith(Sun-BCNN only) predictions over time for KITTI testsequences , and . Sun-CNN is trained and tested onevery tenth image, whereas Sun-BCNN is trained and testedon all frames (in our VO experiments, we use the Sun-BCNNpredictions of every tenth image to make a fair comparison).Sun-10”, “GT-Sun-20”, and “GT-Sun-30”, respectively.Figures 6a to 6c show the results we obtained using sim-ulated sun measurements on the 2.2 km odometry sequence , in which the basic VO suffers from substantial orien-tation drift. Incorporating absolute orientation informationfrom the simulated sun sensor allows the VO to correctthese errors, but the magnitude of the correction decreasesas sensor noise increases. As shown in Table II, whichsummarizes our VO results for all ten sequences, this istypical of sequences where orientation drift is the dominantsource of error.While the VO solutions for sequences such as do notimprove in terms of translational ARMSE, Table II showsthat rotational ARMSE nevertheless improves on all tensequences when low-noise simulated sun measurements areincluded. This implies that the estimation errors of the basicVO solutions for certain sequences are dominated by non-rotational effects, and that the apparent benefit of the Lalondemethod on translational ARMSE in sequence is likelycoincidental. We speculate that incorporating a motion priorin our VO pipeline may mitigate these additional translationalerrors, and leave such an investigation to future work. D. Visual Odometry with Vision-based Sun Sensing
Figures 6d to 6f show the results we obtained for sequence using the Sun-CNN of Ma et al. [16], which estimatesonly the azimuth angle of the sun, our Bayesian Sun-BCNNwhich provides full 3D estimates of the sun direction aswell as a measure of the uncertainty associated with eachestimate, and the method of Lalonde et al. in its original [15]and VO-informed [14] forms, which provide 3D estimates of the sun direction without reasoning about uncertainty. Aselection of results using simulated sun measurements arealso displayed for reference. All four sun detection methodssucceed in reducing the growth of total estimation error onthis sequence, with Sun-BCNN reducing both translationaland rotational error growth significantly more than the otherthree methods. Both Sun-CNN and Sun-BCNN outperformthe two Lalonde variants, consistent with the results of Maet al. [16] and Clement et al. [14].Table II shows results for all ten sequences using eachmethod. With few exceptions, the VO results using Sun-BCNN achieve improvements in rotational and translationalARMSE comparable to those achieved using the simulatedsun measurements with between 10 and 30 degrees averageerror. As previously noted, sequences such as do notbenefit significantly from sun sensing since rotational driftis not the dominant source of estimation error in thesecases. Nevertheless, these results indicate that CNN-basedsun sensing is a valuable tool for improving localizationaccuracy in VO – an improvement that comes without theneed for additional sensors or a specially oriented camera.VI. C ONCLUSION & F
UTURE W ORK
In this work, we have presented Sun-BCNN, a BayesianCNN applied to the problem of sun direction estimation froma single RGB image in which the sun may not be visible.By leveraging the principled uncertainty estimates of theBCNN, we incorporated the sun direction estimates into astereo visual odometry pipeline and demonstrated significantreductions in error growth over 21.6 km of urban drivingdata from the KITTI odometry benchmark. By using a fullcomplement of dropout layers, we were able to train thenetwork using a relatively small training set while achievinga median test error rate of approximately 12 degrees. Westress that although we integrated Sun-BCNN into a visualodometry pipeline in this work, it can just as readily be usedto inject global orientation information into any egomotionestimator.Possible avenues for future work include investigating theeffect of cloud cover on sun direction estimates, an analysisof the effect of hyperparameters such as length scale andweight decay on the final model, and the use of multiplecameras with non-overlapping fields of view to compute andcombine sun direction estimates from multiple perspectives.R
EFERENCES[1] J. Zhang and S. Singh, “Visual-lidar odometry and mapping: low-drift,robust, and fast,” in
Proc. IEEE Int. Conf. Robot. Automat. (ICRA) ,May 2015, pp. 2174–2181.[2] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale,“Keyframe-based visual–inertial odometry using nonlinear optimiza-tion,”
Int. J. Robot. Res. , vol. 34, no. 3, pp. 314–334, 1 Mar. 2015.[3] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robot.:The KITTI dataset,”
Int. J. Robot. Res. , vol. 32, no. 11, pp. 1231–1237,1 Sep. 2013.[4] D. Scaramuzza and F. Fraundorfer, “Visual odometry [tutorial],”
IEEERobot. Automat. Mag. , vol. 18, no. 4, pp. 80–92, Dec. 2011.[5] I. Cviˇsi´c and I. Petrovi´c, “Stereo odometry based on careful featureselection and tracking,” in
Proc. European Conf. Mobile Robots(ECMR) , 2015, pp. 1–6.
ABLE II: Comparison of translational and rotational average root mean squared error (ARMSE) on KITTI odometrysequences with and without sun direction estimates at every tenth image. The best result (excluding simulated sun sensing)is highlighted in bold.
Sequence
00 01
02 04 05 06 07 08 09 10
Length [km]
Trans. ARMSE [m]
Without Sun 4.33 198.52 28.59 2.48 9.90 3.35 4.55 28.05 10.44 5.54GT-Sun-0 5.40 114.69 23.83 2.23 4.84 3.50 1.58 31.55 8.21 3.67GT-Sun-10 4.85 123.84 25.34 2.45 5.84 2.80 2.94 28.47 8.65 4.81GT-Sun-20 4.78 136.60 22.33 2.46 8.16 3.03 3.90 27.54 8.68 5.45GT-Sun-30 4.83 157.14 27.30 2.48 8.93 3.44 4.62 26.73 10.10 5.28Lalonde
Lalonde-VO 4.87 199.03 29.41 2.48 9.74
Trans. ARMSE (EN-plane) [m]
Without Sun 4.53 230.73 30.66 1.81 11.50 3.68 5.44 32.37 11.65 5.95GT-Sun-0 3.41 136.76 24.12 1.46 3.67 3.96 1.80 21.51 7.77 3.71GT-Sun-10 5.05 149.36 24.79 1.79 6.29 2.73 3.51 22.41 8.90 5.09GT-Sun-20 5.14 164.37 22.04 1.80 9.01 3.13 4.66 27.58 8.86 5.81GT-Sun-30 5.12 188.61 22.65 1.83 10.31 3.83 5.50 27.65 11.16 5.58Lalonde
Lalonde-VO 5.38 231.33 33.68 1.82 11.13
Rot. ARMSE ( × − ) [axis-angle] Without Sun 23.88 185.30 63.18 12.97 70.18 23.24 49.96 63.13 26.77 21.54GT-Sun-0 11.20 38.82 53.48 11.75 29.38 17.66 20.37 56.39 17.00 12.60GT-Sun-10 17.05 64.51 58.78 12.86 41.47 18.90 34.05 54.89 19.71 14.26GT-Sun-20 18.84 94.65 58.03 12.91 55.39 19.67 43.34 58.82 20.99 25.87GT-Sun-30 23.40 121.21 57.79 13.01 62.73 23.96 49.92 56.74 25.63 20.15Lalonde
Lalonde-VO 27.91 185.52 69.52 12.98 68.09 Because we rely on the timestamps and first pose reported by the GPS/INS system, we use the raw (rectified and synchronized)sequences corresponding to each odometry sequence. However, the raw sequence corresponding toodometry sequence was not available on the KITTI website at the time of writing, so we omit sequence from our analysis. Sequence consists largely of self-similar, corridor-like highway driving which causes difficulties when detecting and matchingfeatures using libviso2 . The base VO result is of low quality, although we note that including global orientation from thesun nevertheless improves the VO result.[6] P. F. Alcantarilla and O. J. Woodford, “Noise models in feature-basedstereo visual odometry,” 1 Jul. 2016.[7] V. Peretroukhin, W. Vega-Brown, N. Roy, and J. Kelly, “PROBE-GK:Predictive robust estimation using generalized kernels,” in Proc. IEEEInt. Conf. Robot. Automat. (ICRA) , May 2016, pp. 817–824.[8] J. Engel, J. Stuckler, and D. Cremers, “Large-scale direct SLAM withstereo cameras,” in
Proc. IEEE/RSJ Int. Conf. Intelligent Robots andSyst. (IROS) , 2015, pp. 1935–1942.[9] C. F. Olson, L. H. Matthies, M. Schoppers, and M. W. Maimone,“Rover navigation using stereo ego-motion,”
Robot. Auton. Syst. ,vol. 43, no. 4, pp. 215–229, 30 Jun. 2003.[10] P. Furgale, J. Enright, and T. Barfoot, “Sun sensor navigation forplanetary rovers: Theory and field testing,”
IEEE Trans. Aerosp.Electron. Syst. , vol. 47, no. 3, pp. 1631–1647, Jul. 2011.[11] A. Lambert, P. Furgale, T. D. Barfoot, and J. Enright, “Field testingof visual odometry aided by a sun sensor and inclinometer,”
J. FieldRobot. , vol. 29, no. 3, pp. 426–444, 1 May 2012.[12] M. Maimone, Y. Cheng, and L. Matthies, “Two years of visualodometry on the mars exploration rovers,”
J. Field Robot. , vol. 24,no. 3, pp. 169–186, 1 Mar. 2007.[13] A. R. Eisenman, C. C. Liebe, and R. Perez, “Sun sensing on the mars exploration rovers,” in
Aerosp. Conf. Proc. , vol. 5. IEEE, 2002, pp.5–2249–5–2262 vol.5.[14] L. Clement, V. Peretroukhin, and J. Kelly, “Improving the accuracy ofstereo visual odometry using visual illumination estimation,” in
Proc.Int. Symp. Experimental Robot. (ISER) , Oct. 2016.[15] J.-F. Lalonde, A. A. Efros, and S. G. Narasimhan, “Estimating thenatural illumination conditions from a single outdoor image,”
Int. J.Comput. Vis. , vol. 98, no. 2, pp. 123–145, 27 Oct. 2011.[16] W.-C. Ma, S. Wang, M. A. Brubaker, S. Fidler, and R. Urtasun, “Findyour way by observing the sun and other semantic cues,” in
Proc.IEEE Int. Conf. Robot. Automat. (ICRA) , 23 Jun. 2017.[17] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
Nature , vol.521, no. 7553, pp. 436–444, 28 May 2015.[18] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutionalnetwork for real-time 6-dof camera relocalization,” in
Proc. IEEE Int.Conf. Comput. Vision (ICCV) , Dec. 2015.[19] A. Kendall and R. Cipolla, “Modelling uncertainty in deep learningfor camera relocalization,” in
Proc. IEEE Int. Conf. Robot. Automat.(ICRA) , May 2016, pp. 4762–4769.[20] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation:Representing model uncertainty in deep learning,” in
Proc. Int. Conf.Mach. Learning (ICML) , 2016, pp. 1050–1059.
300 -200 -100 0 100 200
Easting [m] -1000100200300 N o r t h i n g [ m ] Ground TruthWithout SunGT-Sun-0GT-Sun-10GT-Sun-20GT-Sun-30 (a)
Simulated sun measurements : VO trajec-tories (EN plane).
500 1000 1500 2000 2500
Pose C u m u l a t i v e R M S E [ m ] × Without SunGT-Sun-0GT-Sun-10GT-Sun-20GT-Sun-30 (b)
Simulated sun measurements : CRMSEof VO translation estimates (EN plane).
500 1000 1500 2000 2500
Pose C u m u l a t i v e R M S E [ a - a ] Without SunGT-Sun-0GT-Sun-10GT-Sun-20GT-Sun-30 (c)
Simulated sun measurements : CRMSE ofVO rotation estimates. -300 -200 -100 0 100 200
Easting [m] -1000100200300 N o r t h i n g [ m ] Ground TruthWithout SunLalondeLalonde-VOSun-CNNSun-BCNN (d)
Estimated sun measurements : VO trajec-tories (EN plane).
500 1000 1500 2000 2500
Pose C u m u l a t i v e R M S E [ m ] × Without SunLalondeLalonde-VOSun-CNNSun-BCNNGT-Sun-0GT-Sun-20 (e)
Estimated sun measurements : CRMSEof VO translation estimates (EN plane).
500 1000 1500 2000 2500
Pose C u m u l a t i v e R M S E [ a - a ] Without SunLalondeLalonde-VOSun-CNNSun-BCNNGT-Sun-0GT-Sun-20 (f)
Estimated sun measurements : CRMSE ofVO rotation estimates.
Fig. 6: VO results for sequence : top down trajectory plots in the Easting-Northing (EN) plane and Cumulative RootMean Squared Error (CRMSE) plots for translational and rotational error. Top row : Results using a selection of simulatedsun measurements of varying accuracy (c.f. Section V-C).
Bottom row : Results using different sun estimation techniques,with selected simulated measurements added for reference. The sun direction estimates provided by Sun-BCNN significantlyimprove the VO solution, while the Lalonde [15], Lalonde-VO [14], and Sun-CNN [16] methods provide modest reductionsin estimation error. In both simulated and estimated measurements, sun directions are computed at every tenth pose. [21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper withconvolutions,” in
Proc. IEEE Conf. Comput. Vision and PatternRecognition, (CVPR) , Jun. 2015, pp. 1–9.[22] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learningdeep features for scene recognition using places database,” in
Advancesin Neural Inform. Process. Syst. (NIPS) , 2014, pp. 487–495.[23] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,“Imagenet: A large-scale hierarchical image database,” in
Proc. IEEEConf. Comput. Vision and Pattern Recognition, (CVPR) , 2009, pp.248–255.[24] Y. Gal and Z. Ghahramani, “Bayesian convolutional neural networkswith Bernoulli approximate variational inference,” in
Proc. Int. Conf.Learning Representations (ICLR), Workshop Track , 2016.[25] Y. Gal, “Uncertainty in deep learning,” Ph.D. dissertation, Universityof Cambridge, 2016.[26] Y. Cheng, M. W. Maimone, and L. Matthies, “Visual odometry on themars exploration rovers - a tool to ensure accurate driving and science imaging,”
IEEE Robot. Automat. Mag. , vol. 13, no. 2, pp. 54–62, Jun.2006.[27] P. Furgale and T. D. Barfoot, “Visual teach and repeat for long-rangerover autonomy,”
J. Field Robot. , vol. 27, no. 5, pp. 534–560, 1 Sep.2010.[28] A. Geiger, J. Ziegler, and C. Stiller, “StereoScan: Dense 3D reconstruc-tion in real-time,” in
Proc. Intelligent Vehicles Symp. (IV) . IEEE, Jun.2011, pp. 963–968.[29] J. Kelly, S. Saripalli, and G. S. Sukhatme, “Combined visual andinertial navigation for an unmanned aerial vehicle,” in
Proc. Fieldand Service Robot. (FSR) , 2008, pp. 255–264.[30] S. Agarwal, K. Mierle et al. , “Ceres solver.” [Online]. Available:http://ceres-solver.org[31] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” in