[PDF] Unsupervised Learning of Lidar Features for Use in a Probabilistic Trajectory Estimator

Abstract

We present unsupervised parameter learning in a Gaussian variational inference setting that combines classic trajectory estimation for mobile robots with deep learning for rich sensor data, all under a single learning objective. The framework is an extension of an existing system identification method that optimizes for the observed data likelihood, which we improve with modern advances in batch trajectory estimation and deep learning. Though the framework is general to any form of parameter learning and sensor modality, we demonstrate application to feature and uncertainty learning with a deep network for 3D lidar odometry. Our framework learns from only the on-board lidar data, and does not require any form of groundtruth supervision. We demonstrate that our lidar odometry performs better than existing methods that learn the full estimator with a deep network, and comparable to state-of-the-art ICP-based methods on the KITTI odometry dataset. We additionally show results on lidar data from the Oxford RobotCar dataset.

Full PDF

IIEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2021 1

Unsupervised Learning of Lidar Features for Use ina Probabilistic Trajectory Estimator

David J. Yoon , Haowei Zhang , Mona Gridseth , Hugues Thomas , Timothy D. Barfoot Abstract —We present unsupervised parameter learning ina Gaussian variational inference setting that combines classictrajectory estimation for mobile robots with deep learning forrich sensor data, all under a single learning objective. Theframework is an extension of an existing system identiﬁcationmethod that optimizes for the observed data likelihood, whichwe improve with modern advances in batch trajectory estimationand deep learning. Though the framework is general to anyform of parameter learning and sensor modality, we demonstrateapplication to feature and uncertainty learning with a deepnetwork for 3D lidar odometry. Our framework learns fromonly the on-board lidar data, and does not require any formof groundtruth supervision. We demonstrate that our lidarodometry performs better than existing methods that learn thefull estimator with a deep network, and comparable to state-of-the-art ICP-based methods on the KITTI odometry dataset. Weadditionally show results on lidar data from the Oxford RobotCardataset.

Index Terms —Localization, Deep Learning Methods, FieldRobots

I. INTRODUCTION P ROBABILISTIC state estimation is a mature componentof autonomous navigation. Large estimation problemswith rich data (e.g., lidar, camera) can be solved efﬁcientlyby exploiting the sparse structure inherent to the probabilisticformulation. However, the implementation of estimators canvary depending on the platform, sensor, and application en-vironment. Rather than requiring expert engineers to adaptexisting estimators for new deployments, our vision is todevelop a learning framework that can learn model parametersspeciﬁc to the deployment, solely from the sensor data.Recently, Barfoot et al. [1] presented Exactly Sparse Gaus-sian Variational Inference (ESGVI), a nonlinear batch stateestimation framework that starts from a variational objective,and provides a family of scalable estimators by exploiting thefactorization of the joint likelihood between the observed mea-surements (data) and state. Wong et al. [2] demonstrated theextension of ESGVI to parameter learning, and successfullylearned robot noise models that are robust to noisy measure-

Manuscript received: October 14, 2020; Revised January 7, 2021; AcceptedJanuary 28, 2021.This paper was recommended for publication by Editor Sven Behnke uponevaluation of the Associate Editor and Reviewers’ comments. This work wassupported by Applanix Corporation and the Natural Sciences and EngineeringResearch Council of Canada (NSERC). All authors are with the University of Toronto Institute forAerospace Studies (UTIAS), Canada. { david.yoon, gareth.zhang,mona.gridseth, hugues.thomas } @robotics.utias.utoronto.ca,[email protected] Digital Object Identiﬁer (DOI): see top of this page.

NetworkFeatures + UncertaintyPointclouds UnsupervisedBackprop.Trajectory Estimator

Fig. 1: An example factor graph diagram of a lidar odometry problem. Weoptimize the trajectory over a sliding window of w time frames (e.g., w = 3 above), where x k is our state at time t k . The ﬁrst frame is locked (grey)and is the reference frame, we do not optimize it. Each frame receives apointcloud from the lidar sensor. A deep network takes each pointcloud asinput and outputs features with uncertainty (stars) that can be associated toother frames and composed into measurement factors, φ m (circles). Motionprior factors, φ p (squares), are applied between every frame. We do not requiresupervision, and only learn from the on-board lidar data. ments and outliers, solely from the observed measurements(i.e., without groundtruth).In this paper, we demonstrate parameter learning for adeep neural network with ESGVI for the ﬁrst time, resultingin a novel hybrid of deep learning and probabilistic stateestimation. We apply our method to feature and uncertaintylearning for lidar odometry with a network architecture builtusing KPConv [3], a pointcloud convolution operator.We do not train with groundtruth supervision; all trainablenetwork parameters are learned solely from the on-board lidardata. Experimental results on Velodyne HDL-64 lidar dataof the KITTI odometry dataset [4] show that our methodperforms better than those that learn the estimator with a deepnetwork, and comparable to the current state of the art, i.e.,Iterative Closest Point (ICP)-based methods. We additionallyshow results on Velodyne HDL-32 data from the OxfordRobotCar dataset [5], [6], where we demonstrate the simplicityin retraining the network for different deployment regions.We review related work in Section II, and provide anoverview of ESGVI parameter learning in Section III. SectionIV is the methodology of our lidar odometry, and experimentalresults are presented in Section V. Finally, we discuss conclud-ing remarks and future work in Section VI.II. R ELATED W ORK

Parameter learning with ESGVI originates from a linearsystem identiﬁcation method, where Ghahramani and Hinton[7] optimize the likelihood of the observed measurements(data) by introducing a latent trajectory (state) and applyingExpectation-Maximization (EM). In the E-step, model parame-ters are held ﬁxed and the trajectory is optimized with Kalmansmoothing. In the M-step, the trajectory is held ﬁxed and a r X i v : . [ c s . R O ] F e b IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2021 the model parameters are optimized. Critically, this methodis able to learn entire linear models from just the observeddata, with no prior knowledge. Ghahramani and Roweis [8]extend this concept to simple nonlinear models approximatedwith Gaussian Radial Basis Functions (RBF).Barfoot et al. [1] recently presented a new nonlinear estima-tion framework that exploits the sparsity in not just smoothingproblems (i.e., linear chain), but any factorization of the jointlikelihood between data and state (i.e., cycles in the factorgraph). Wong et al. [2] apply ESGVI estimation to parameterlearning with EM, thus advancing the EM learning frameworkto modern advances in nonlinear batch estimation. They wereable to learn robot noise models robust to outliers, includingrobustness to false loop closures for pose-graph optimization.Our work further extends this framework to deep learning inorder to directly handle rich sensor data such as lidar.A similar concept based on optimizing the data likelihoodby introducing a latent state is applied in the VariationalAutoencoder (VAE) [9] framework. With VAEs, inferenceof the latent state is approximated as a deep network (i.e.,data input maps to state output), so the bound on the datalikelihood can be optimized without EM. This approximationis restrictive for time-series application, such as trajectoryestimation. EM parameter learning with ESGVI gives us all thebeneﬁts of classic probabilistic estimation, such as informationpropagation over the entire latent trajectory, and the ﬂexibilityof multiple sensors as additional factors.Alternatively, Bloesch et al. [10] use a VAE in probabilistictrajectory estimation without directly inferencing the state witha deep network. They train a VAE to learn an efﬁcient (lower-dimensional) latent space for geometry, and optimize over thisdomain jointly with pose variables at test time for monocularvision estimation problems. Czarnowski et al. [11] extend thiswork by using the same depth representation in a full denseSimultaneous Localization and Mapping (SLAM) system. Un-like our approach, their network is trained independently fromthe trajectory estimator, and a training dataset with groundtruthdepth images is required.Evidently, deep learning with core components of proba-bilistic estimation is an increasingly popular avenue of re-search. Tang and Tan [12] maintain the differentiability of theLevenberg-Marquardt optimizer by iterating for a ﬁxed numberof steps and proposing a network to predict the dampingfactor. Similarly, Stumberg et al. [13] backpropagate throughthe Gauss-Newton update step from a random initial condition.Jatavallabhula et al. [14] go even further by proposing differ-entiable alternatives to all modules in full SLAM systems ascomputational graphs. In contrast, our approach does not relyon making the estimator differentiable and so facilitates usingany probabilistic estimation method.Most similar to our method is the work of DeTone et al.[15], where they alternate between training a deep networkfrontend that outputs visual features from images, and a bundleadjustment backend that optimizes the feature observationsas landmarks. The optimized landmarks become the trainingsignal for learning the frontend network. Our approach isderived from a probabilistic objective, and as consequence, ourlearning objective is different and accounts for uncertainty in the posterior estimates.While our learning framework is sensor agnostic, our choiceof application to demonstrate our work is lidar odometry. Dueto the vast amount of literature related to lidar estimation, werestrict our review to the most relevant to our work in theinterest of space.The current state of the art for non-learned lidar estimationare those based on ICP. Zhang and Singh [17] present LidarOdometry and Mapping (LOAM), which has been the topcontender for lidar-only odometry in the KITTI odometrybenchmark [4] since its inception. Behley and Stachniss [18]present Surfel-based Mapping (SuMa), which is notably themethod used as the trajectory groundtruth in SemanticKITTI[19], the KITTI odometry sequences with semantic labels.At the opposite end, we have fully learned lidar odometrymethods that infer relative poses with a deep network. Liet al. [20] present LO-Net, a network that takes two point-clouds as input and outputs the relative pose. Their methoddemonstrates competent odometry performance comparable toICP-based ones, but requires training with supervision fromgroundtruth trajectory. Cho et al. [21] present DeepLO, anetwork that similarly outputs a relative pose change fromtwo input pointclouds and is trained unsupervised. However,their unsupervised approach comes at a cost, as the odometryperformance they present falls short compared to LO-Net andexisting ICP-based methods.The learned estimator in LO-Net is impressive, but Li etal. [20] demonstrate better odometry in the same publicationwith ICP enhanced with point measurement masks that weretrained alongside LO-Net. In similar fashion, Chen et al. [22]improve on SuMa (SuMa++) by incorporating a pretrainedsemantic classiﬁcation network. These outcomes suggest thata hybrid of non-learned estimators with learned componentscan be beneﬁcial. Our work is motivated by the idea that in thespectrum between non-learned and fully learned estimators,there is an optimal balance that can beneﬁt from the advan-tages of both extremes. The learning framework and odometrysolution we present is our attempt at meeting this balance.Compared to fully learning the estimator, such as inDeepLO [21] and LO-Net [20], our odometry solutionachieves better performance while being trained unsupervised.Compared to SuMa++ [22] and LO-Net with ICP, our ap-proach learns more and does not rely on nearest-neighboursfor data association. Having more learnable components hasthe advantage of automation, i.e., being able to tune modelsfrom data for different deployments, rather than requiring anexpert engineer. Another advantage of not relying on ICP isthat our method inherently has a good estimate of trajectoryuncertainty. Uncertainty estimation for ICP is on its own achallenging research problem [23], [24].III. E

XACTLY S PARSE G AUSSIAN V ARIATIONAL I NFERENCE

In this section, we summarize parameter learning in theESGVI framework as presented by Barfoot et al. [1]. We begin

OON et al. : UNSUPERVISED LEARNING OF LIDAR FEATURES FOR USE IN A PROBABILISTIC TRAJECTORY ESTIMATOR 3

Input

64 128 256 512

ResNet Bottleneck KPConv x

2, Strided KPConvResNet Bottleneck KPConv x

2 Pointcloud with Intensity ChannelNearest Upsample,ResNet Bottleneck KPConv1 x InverseCovarianceOutputDetectorScoreOutput

DescriptorOutput

ConcatenationSkip LinkSkip Link,Nearest Upsample

Fig. 2: Our network architecture is based on the work of Barnes and Posner [16], which we adapt for pointclouds using the KPConv [3] pointcloud convolutionoperator. Input to the network is a lidar pointcloud with an intensity channel. Descriptor vectors for each point are composed from the output of the ﬁrst fourencoder layers. The 6 channel output of the top decoder are composed into inverse measurement covariances (see (7) in body text). The single channel outputof the remaining decoder are detector scores used to compute keypoints (see (8) in body text). Refer to the KPConv [3] publication for implementation detailsof the various operations. with the maximum-likelihood problem for the given data, z ,which is expressed as θ (cid:63) = arg max θ p ( z | θ ) , (1)where θ represents the parameters of our system that we wishto learn (e.g., parameters of a neural network).We deﬁne the loss to minimize as the negative log-likelihoodof the data, L = − ln p ( z | θ ) , and introduce the latenttrajectory, x . Applying the usual EM decomposition, L = (cid:90) ∞ − ∞ q ( x ) ln (cid:18) p ( x | z , θ ) q ( x ) (cid:19) d x (cid:124) (cid:123)(cid:122) (cid:125) ≤ − (cid:90) ∞ − ∞ q ( x ) ln (cid:18) p ( x , z | θ ) q ( x ) (cid:19) d x (cid:124) (cid:123)(cid:122) (cid:125) upper bound , (2)where we deﬁne our approximate posterior trajectory as amultivariate Gaussian distribution, q ( x ) = N ( µ , Σ ) .While we cannot optimize the ﬁrst term as it requirescomputing the true posterior, p ( x | z , θ ) , we can optimize thesecond (upper bound) term, the so-called (negative) EvidenceLower Bound (ELBO).Using the expression for the entropy, − (cid:82) q ( x ) ln q ( x ) d x ,of a Gaussian and dropping constants, the upper bound termcan be written as, V ( q | θ ) = E q [ φ ( x , z | θ )] + 12 ln (cid:0) | Σ − | (cid:1) , (3)where we deﬁne φ ( x , z | θ ) = − ln p ( x , z | θ ) , E [ · ] is theexpectation operator, and | · | is the matrix determinant. Asdeﬁned by Barfoot et al. [1], (3) is the ESGVI loss functional.We now apply EM and proceed iteratively in two steps: theE-step and the M-step. In the E-step, we hold the parameters, θ , ﬁxed and optimize the posterior estimate q ( x ) . In the M-step, we hold the posterior estimate, q ( x ) , ﬁxed, and optimizefor the parameters, θ . This iterative algorithm gradually opti-mizes the data likelihood, L , in (2).Barfoot et al. [1] explain in detail how the E-step can beevaluated efﬁciently by taking advantage of the factorization of φ ( x , z | θ ) , the joint likelihood of the state and data. When the We work with the negative log-likelihood, therefore we are technicallyapplying Expectation Minimization. However, the acronym stays the same. expectation over the posterior, q ( x ) , is approximated at onlythe mean of the Gaussian, the E-step is the familiar MaximumA Posteriori (MAP) state estimator [1].IV. U NSUPERVISED D EEP L EARNING FOR L IDAR O DOMETRY

A. Problem Deﬁnition

We deﬁne our state at time t k as x k = { T k, , (cid:36) k } , wherethe pose T k, ∈ SE (3) is a transformation matrix betweenframes at t k and t , and (cid:36) k ∈ R is the body-centric velocity.We assume we receive a new pointcloud frame from the lidarsensor at each new time t k .Our odometry implementation is an optimization over awindow of w lidar frames, t τ , . . . , t τ + w − . The ﬁrst pose ofthe window at t τ , T τ, , is locked (not optimized) and treatedas the reference frame for keypoint matching. The factorizationof our joint likelihood of the state and data is φ ( x , z | θ ) = τ + w − (cid:88) k = τ +1 (cid:32) φ p ( x k − , x k ) + L k (cid:88) (cid:96) =1 φ m ( z (cid:96)k | x τ , x k , θ ) (cid:33) , (4)where z (cid:96)k is the (cid:96) th keypoint measurement in lidar frame k ,which has a total of L k keypoints. Figure 1 shows an examplefactor graph illustration.Referring to Figure 1, the square factors, φ p , are motionprior factors. We apply a white-noise-on-acceleration prior aspresented by Anderson and Barfoot [25], which is deﬁned bythe following kinematic equations: ˙ T ( t ) = (cid:36) ( t ) ∧ T ( t ) , ˙ (cid:36) = w ( t ) , w ( t ) ∼ GP ( , Q c δ ( t − t (cid:48) )) , (5)where w ( t ) ∈ R is a zero-mean, white-noise Gaussianprocess, and the operator, ∧ , transforms an element of R into a member of Lie algebra, se (3) [26]. In the interest ofspace, see Wong et al. [2] for further details on this factor. IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2021

Fig. 3: Network outputs coloured in the following order: blue (low value), cyan, green, yellow, and red (high value). (Left) Detector score visualization,highlighting structure such as wall corners and vertical posts. The nearby ground is favoured over vehicles, possibly due to their dynamic nature. (Right)Visualization of sphericity (see Section IV-D) computed with the learned measurement covariance. Planar surfaces have low values, the expected result. where we use the log-likelihood of a Gaussian as the factor,and W � k is the inverse covariance matrix corresponding tomeasurement z � k . The keypoint, z � k , its inverse covariancematrix, W � k , and the measurement model, g ( · ) , are quantitiesthat depend on the network parameters, θ . These quantitieswill be further explained in Section IV-B.We apply EM (see Section III) to jointly optimize for theposterior, q ( x ) , and the network parameters, θ , under thesingle objective in (3). We emphasize that we do not use anyform of groundtruth, such as pose estimates from a globalpositioning system, for training θ . We learn only from theon-board lidar data.However, ESGVI parameter learning is a general frame-work based on factor graph optimization and can accommo-date additional data sources beyond those presented in thiswork. For example, supervision from a global positioningsystem could be applied as unary factors for the poses ofthe posterior. Alternatively, a weaker form of supervision cancome from applying a known measurement factor for inertialmeasurement unit (IMU) data, which could be used at bothtrain and test time for a lidar-IMU odometry solution. B. Network

We adapt the network architecture of Barnes and Posner[16] for pointclouds. They present a U-Net [27] style convo-lutional encoder-multi-decoder network architecture that out-puts keypoints and descriptors from radar data projected intoa 2-dimensional (2D) bird’s-eye view image, and thus theconvolutional kernels have a spatial extent in 2D Euclideanspace. We achieve an equivalent effect for 3D pointcloudswith KPConv [3], a pointcloud convolution method that useskernel points arranged in a sphere of ﬁ xed radius . Figure 2shows the network architecture, where in place of pixels of animage with feature channels, we have points of a pointcloudwith feature channels.The input to the network is a lidar frame pointcloud witha channel for intensity data. Each network layer j , includingthe input layer , uniformly subsamples the pointcloud intoa voxel grid of dimension dl j . Successive layers in theencoder increase the grid dimension by a factor of 2, i.e., dl j +1 = 2 dl j , and therefore the convolutions are appliedat different scales in Euclidean space. The opposite is truefor the decoder layers in order to have the input and outputdimensions be equal. We set dl to be . m.Each layer of the encoder consists of two KPConv vari-ations of bottleneck ResNet blocks [28], followed by a Thomas et al. [3] also present a deformable kernel implementation, butwe do not apply it in our work. strided variation for spatial dimension reduction. Each layerof the decoder consists of a nearest upsample operation,for spatial dimension enlargement, and a single KPConvbottleneck ResNet block. These convolution blocks apply theleaky ReLU nonlinearity and batch normalization. We directreaders to the KPConv publication [3] for detailed de ﬁ nitionsof the various block operations. As in U-Net [27], we useskip connections between encoder and decoder layers.The network outputs descriptor vectors, inverse measure-ment covariance matrices, and detector scores for each inputpoint. These outputs are used to compute the keypoints, z � k ,their corresponding inverse covariance matrices, W � k , and theoutput of the measurement model, g ( · ) (see Section IV-A).The descriptor vectors are computed for each point inthe input layer and are composed of the output featurechannels of all but the last encoder layer. The channeloutput dimension of the ﬁ rst layer is 64, and doubles foreach subsequent layer. The output channels of the layers areconcatenated with nearest upsampling to create descriptorsof length 960, which are normalized into unit vectors.The inverse measurement covariances are derived from theoutput of one of the two decoders (top decoder of Figure 2).Applying a × linear convolution to the last decoder layergives a ﬁ nal output with 6 channels. We compose the outputvalues into inverse covariance matrices, W ∈ R × , with theapproach of Liu et al. [29], which uses the following LDUdecomposition for symmetric, positive de ﬁ nite matrices: W =  � � �   exp d d

00 0 exp d   � � �  T , (7)where [ � , � , � , d , d , d ] is the 6D output for each point.The detector scores are the output of the remaining de-coder. After applying a × linear convolution to the lastlayer, the ﬁ nal output has 1 channel, i.e. a scalar detectorscore for each point. The pointcloud is then partitioned intovoxels of grid size dg (we set dg to be . m) for the purposeof computing one keypoint per voxel. For each voxel, weapply a softmax function over the detector scores, resulting inweights we use to compute the keypoint’s coordinates alongwith its descriptor and inverse covariance. For example, the � th keypoint coordinate in frame k is z � k = M � i =1 exp s i � Mj =1 exp s j p i , (8)where s , . . . , s M are the detector scores of voxel � , and p , . . . , p M ∈ R are the corresponding point coordinates. Fig. 3: Network outputs coloured in the following order: blue (low value), cyan, green, yellow, and red (high value). (Left) Detector score visualization,highlighting structure such as wall corners and vertical posts. The nearby ground is favoured over vehicles, possibly due to their dynamic nature. (Right)Visualization of sphericity (see Section IV-D) computed with the learned measurement covariance. Planar surfaces have low values, the expected result.

The circle factors in Figure 1 , φ m , are the measurementfactors deﬁned by the lidar keypoint measurements: φ m ( z (cid:96)k | x τ , x k , θ ) = 12 (cid:0) z (cid:96)k − g ( x τ , x k ) (cid:1) T × W (cid:96)k (cid:0) z (cid:96)k − g ( x τ , x k ) (cid:1) − ln (cid:12)(cid:12) W (cid:96)k (cid:12)(cid:12) , (6)where we use the log-likelihood of a Gaussian as the factor,and W (cid:96)k is the inverse covariance matrix corresponding tomeasurement z (cid:96)k . The keypoint, z (cid:96)k , its inverse covariancematrix, W (cid:96)k , and the measurement model, g ( · ) , are quantitiesthat depend on the network parameters, θ . These quantitieswill be further explained in Section IV-B.We apply EM (see Section III) to jointly optimize for theposterior, q ( x ) , and the network parameters, θ , under thesingle objective in (3). We emphasize that we do not use anyform of groundtruth, such as pose estimates from a globalpositioning system, for training θ . We learn only from theon-board lidar data.However, ESGVI parameter learning is a general frameworkbased on factor graph optimization and can accommodateadditional data sources beyond those presented in this work.For example, supervision from a global positioning systemcould be applied as unary factors for the poses of the posterior.Alternatively, a weaker form of supervision can come from ap-plying a known measurement factor for inertial measurementunit (IMU) data, which could be used at both train and testtime for a lidar-IMU odometry solution. B. Network

We adapt the network architecture of Barnes and Pos-ner [16] for pointclouds. They present a U-Net [27] styleconvolutional encoder-multi-decoder network architecture thatoutputs keypoints and descriptors from radar data projectedinto a 2-dimensional (2D) bird’s-eye view image, and thus theconvolutional kernels have a spatial extent in 2D Euclideanspace. We achieve an equivalent effect for 3D pointclouds withKPConv [3], a pointcloud convolution method that uses kernelpoints arranged in a sphere of ﬁxed radius . Figure 2 showsthe network architecture, where in place of pixels of an imagewith feature channels, we have points of a pointcloud withfeature channels.The input to the network is a lidar frame pointcloud witha channel for intensity data. Each network layer j , includingthe input layer , uniformly subsamples the pointcloud into a This is a general illustration with measurement factors between frames.For implementation, we associate each frame only to the reference (see (6)). Thomas et al. [3] also present a deformable kernel implementation, butwe do not apply it in our work. voxel grid of dimension dl j . Successive layers in the encoderincrease the grid dimension by a factor of 2, i.e., dl j +1 = 2 dl j ,and therefore the convolutions are applied at different scalesin Euclidean space. The opposite is true for the decoder layersin order to have the input and output dimensions be equal. Weset dl to be . m.Each layer of the encoder consists of two KPConv variationsof bottleneck ResNet blocks [28], followed by a strided varia-tion for spatial dimension reduction. Each layer of the decoderconsists of a nearest upsample operation, for spatial dimensionenlargement, and a single KPConv bottleneck ResNet block.These convolution blocks apply the leaky ReLU nonlinearityand batch normalization. We direct readers to the KPConvpublication [3] for detailed deﬁnitions of the various blockoperations. As in U-Net [27], we use skip connections betweenencoder and decoder layers.The network outputs descriptor vectors, inverse measure-ment covariance matrices, and detector scores for each inputpoint. These outputs are used to compute the keypoints, z (cid:96)k ,their corresponding inverse covariance matrices, W (cid:96)k , and theoutput of the measurement model, g ( · ) (see Section IV-A).The descriptor vectors are computed for each point in theinput layer and are composed of the output feature channels ofall but the last encoder layer. The channel output dimension ofthe ﬁrst layer is 64, and doubles for each subsequent layer. Theoutput channels of the layers are concatenated with nearestupsampling to create descriptors of length 960, which arenormalized into unit vectors.The inverse measurement covariances are derived from theoutput of one of the two decoders (top decoder of Figure 2).Applying a × linear convolution to the last decoder layergives a ﬁnal output with 6 channels. We compose the outputvalues into inverse covariance matrices, W ∈ R × , with theapproach of Liu et al. [29], which uses the following LDUdecomposition for symmetric, positive deﬁnite matrices: W =  (cid:96) (cid:96) (cid:96)   exp d d

00 0 exp d   (cid:96) (cid:96) (cid:96)  T , (7)where [ (cid:96) , (cid:96) , (cid:96) , d , d , d ] is the 6D output for each point.The detector scores are the output of the remaining decoder.After applying a × linear convolution to the last layer,the ﬁnal output has 1 channel, i.e. a scalar detector scorefor each point. The pointcloud is then partitioned into voxelsof grid size dg (we set dg to be . m) for the purposeof computing one keypoint per voxel. For each voxel, weapply a softmax function over the detector scores, resulting inweights we use to compute the keypoint’s coordinates along OON et al. : UNSUPERVISED LEARNING OF LIDAR FEATURES FOR USE IN A PROBABILISTIC TRAJECTORY ESTIMATOR 5 with its descriptor and inverse covariance. For example, the (cid:96) th keypoint coordinate in frame k is z (cid:96)k = M (cid:88) i =1 exp s i (cid:80) Mj =1 exp s j p i , (8)where s , . . . , s M are the detector scores of voxel (cid:96) , and p , . . . , p M ∈ R are the corresponding point coordinates.A similar computation is done to get the descriptor vector, d (cid:96)k , and inverse covariance, W (cid:96)k , for each keypoint. For theinverse covariance, we apply the weighted summation over the6D vector, and compose it into the × matrix afterward.We use the keypoint descriptor for data association, whichwill be matched to a point in the reference pointcloud attime t τ . Differentiability is maintained by approximating allmatches with a softmax [16], [30]. We compute the dot productbetween each keypoint descriptor and all descriptors of thereference pointcloud: c (cid:96) T k = d (cid:96) Tk (cid:2) d τ . . . d Nτ (cid:3) , (9)where d (cid:96)k is the descriptor vector of keypoint z (cid:96)k , and d τ , . . . , d Nτ are the descriptor vectors of the N points in thereference frame. We apply a softmax function on c (cid:96)k , andcompute a weighted summation. The reference point matchfor keypoint z (cid:96)k is therefore r (cid:96) k τ = N (cid:88) i =1 exp c (cid:96)k,i (cid:80) Nj =1 exp c (cid:96)k,j p iτ , (10)where c (cid:96)k, , . . . , c (cid:96)k,N are the scalar elements of c (cid:96)k , and p τ , . . . , p Nτ are the reference point coordinates.We can now fully deﬁne the measurement factor in (6) withoutputs of the network: φ m ( z (cid:96)k | x τ , x k , θ ) = 12 (cid:18) z (cid:96)k − DT k, T ,τ (cid:20) r (cid:96) k τ (cid:21)(cid:19) T × W (cid:96)k (cid:18) z (cid:96)k − DT k, T ,τ (cid:20) r (cid:96) k τ (cid:21)(cid:19) − ln (cid:12)(cid:12) W (cid:96)k (cid:12)(cid:12) , (11)where D is a × constant projection matrix that removesthe homogeneous element.Figure 3 shows visualizations of the learned detector scoresand covariances. The detector favours points on, and in thevicinity of, structure such as wall corners and vertical posts.Interestingly, the nearby ground is favoured over vehicles,possibly due to their dynamic nature. We visualize sphericity[31] (see Section IV-D) to demonstrate the covariance. Insteadof manually choosing the error metric (e.g., point-to-plane),the network adapts to low-level geometry. C. Training and Inference

The general idea of training and inference in the ESGVI pa-rameter learning framework using EM is presented in SectionIII. In the E-step, we hold all network parameters, θ , ﬁxed andoptimize for the posterior, q ( x ) . In the M-step, we hold theposterior, q ( x ) , ﬁxed and optimize for the network parameters, θ . Critically, the M-step does not have to be computed tocompletion (i.e., convergence), before alternating to the E-step,to satisfy the iterative update scheme of the data likelihood. When the M-step is not computed to completion, the algorithmis referred to as Generalized EM (GEM).We adapt GEM to seamlessly ﬁt into conventional networktraining (i.e., stochastic gradient optimization) by includingthe E-step in the forward propagation routine. A window ofsequential lidar frames is treated as a mini-batch of data, andforward propagation involves the following steps:1) Evaluate the lidar features and other associated outputsof each lidar frame (see Section IV-B).2) Construct the motion prior, φ p , and measurement, φ m ,factors (see Section IV-A and IV-B).3) The E-step: Inference for the current best posteriorestimate q ( x ) of the mini-batch (window) of frames.Wong et al. [2] demonstrate learning Q c of the motion prior(see Section IV-A) in the M-step, but we do not apply it inour work and manually set suitable values for urban driving.The E-step is simply a factor graph optimization problem,and can be solved efﬁciently. We apply MAP estimation byoptimizing the loss functional (3) with the Gauss-Newtonalgorithm, which involves taking two approximations: • Approximate the Hessian with ﬁrst-order derivatives. • Approximate the expectation in (3) at only the mean ofthe posterior q ( x ) .The ﬁrst approximation is commonly made in practical appli-cations, the alternative being Newton’s method which requiressecond-order derivatives. In the ESGVI framework, the secondapproximation is reasonable under the condition that theposterior is concentrated [1], which we ﬁnd to be the case withlidar data (i.e., rich data with accurate geometry). In futurework we may revisit these approximations and return to thefull ESGVI optimizer for the E-step.We compute backpropagation for the network parameters, θ , on the loss functional (3), where only the measurementfactors, φ m , are affected since the motion prior factors areconstant with respect to θ . We use the spherical-cubaturerule [32] to compute sigmapoints for the posterior, q ( x ) ,in order to approximate the expectation in (3). We do notneed to compute sigmapoints over the entire posterior, whichcan be expensive, but just the marginals for each factor [1].Training continues until the loss functional (3) converges.Once converged, inference can be computed on new sequencesof lidar frames for odometry (i.e., the E-step). D. Outlier Rejection

Outlier rejection is an important component to improverobustness for any estimation algorithm, and is traditionallyhandled with M-estimation [1] in factor graph optimization.We apply M-estimation in the E-step by applying the Geman-McClure cost function on the measurement factors, φ m (seeSection IV-A), when optimizing with Gauss-Newton. M-estimation is applied at both train and test time.While the robust cost function is sufﬁcient for the E-step, we cannot apply it to the measurement factor whenbackpropagating to learn the inverse measurement covariances, W , in the M-step. Instead, we apply a hard threshold on the IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2021 squared Mahalanobis term in the measurement factor with thecurrent best posterior estimate, (cid:0) z (cid:96)k − g ( x τ , x k ) (cid:1) T W (cid:96)k (cid:0) z (cid:96)k − g ( x τ , x k ) (cid:1) > α, (12)and do not backpropagate any factor terms that exceed thethreshold, α . This threshold, which we set to , is only appliedduring training at the backpropagation step.Our keypoint detector, adapted from Barnes and Posner[16], determines the best keypoint in each voxel partition ofthe pointcloud (see Section IV-B). This is suboptimal for ourproblem formulation, as it results in keypoints in uninterestingareas (e.g., the ground plane). We can compensate at testtime by judging the quality of each keypoint, z (cid:96)k , with thelearned inverse measurement covariance, W (cid:96)k . Computing thesphericity metric [31] using the eigenvalues of the measure-ment covariance, λ /λ , where λ ≥ λ ≥ λ ∈ R are theeigenvalues of the covariance W (cid:96)k − , is a potential way tojudge the quality of each keypoint.However, we found the computation of the eigenvalues tobe too inefﬁcient in practice. Alternatively, we apply a metricthat achieves a similar effect using the diagonal elements of W (cid:96)k (see (7)). We deﬁne this metric with a threshold as exp d min / exp d max = exp ( d min − d max ) < β, (13)where d min and d max are the smallest and largest of thediagonal elements, respectively. We do not use keypoints lessthan the threshold, β . We found through experimentation thatthis metric works well on planar surfaces that are axis-alignedto the sensor frame. This threshold, which we set to . ,is only applied in the E-step, i.e., we still backpropagatekeypoints less than the threshold for covariance learning.V. E XPERIMENTAL R ESULTS

A. Experiment Setup

We evaluate lidar odometry on two datasets, each with adifferent lidar sensor. The KITTI odometry benchmark [4] has22 sequences of Velodyne HDL-64 data collected at Hz.The ﬁrst 11 (00-10) are provided as the training set, and theremaining 11 (11-21) are provided without groundtruth andact as the benchmark. Following existing work [20], [21], wesplit the ﬁrst 11 sequences into training and testing sequencesand evaluate against the provided groundtruth.Velodyne HDL-32 sensors were introduced to the OxfordRobotCar dataset [5] in the radar dataset extension [6]. Two Hz HDL-32 sensors (left and right) were placed on the roofof the data collection vehicle. We opted for the simpler setup ofonly evaluating odometry using one (left) of the two sensors.The dataset contains sequences, each km in length, and shorter sequences. All sequences were collected from a similardriving route over a week, and thus there is little variation forlidar data. We evaluate odometry on of the km sequences,where of the are used for training.KITTI preprocessed the lidar data to account for motion dis-tortion. While the Oxford dataset does not motion-compensatethe data, we chose to not account for this effect as it is not Eigenvalues of W (cid:96)k − are the reciprocals of the eigenvalues of W (cid:96)k . TABLE I: A comparison of our odometry method to those that fully learnthe estimator with a deep network. DeepLO [21] and our method aretrained unsupervised, while LO-Net [20] is trained with supervision fromthe groundtruth trajectory. Our method and LO-Net trained on sequences 00-06, while DeepLO trained on sequences 00-08. Using the KITTI odometrybenchmark metric [4], the average translation ( % ) and orientation ( ◦ / m) errors over lengths of m to m are presented. The average oversequences 00-08 are presented, as DeepLO does not present them individually.The best results are in bold.Seq. Ours DeepLO [21] LO-Net [20](Unsupervised) (Unsupervised) (Supervised)00-08 / / / / the focus of this work, and the faster spin-rate alleviates thisproblem to some degree. We demonstrated motion compensa-tion in past work with the same estimator [33], [34], and notethat it is applicable to this work as well. TABLE II: A comparison of our method to the current state of the art forlidar odometry methods. Using the KITTI odometry benchmark metric [4],the average translation ( % ) and orientation ( ◦ / m) errors over lengths of m to m are presented. The best results are in bold.Seq. Ours LO-Net+Mapping (ICP) [20] LOAM [17](Unsupervised) (Supervised) (Non-Learned)00 † /0.42 /-01 † / † /-03 † /0.59 0.86/-04 † /0.54 0.71/-05 † /-06 † / ∗ / ∗ / ∗ /0.38 /-10 ∗ /-Avg. 0.89/ /0.43 0.84/- † : Sequences that our method and LO-Net train on. ∗ : Sequences that are not used for training. We follow the KITTI odometry evaluation metric for alldatasets, which averages the relative position and orientationerrors over trajectory segments of m to m. Weimplemented the network using a KPConv implementation inPyTorch . Network parameters were trained with the Adamoptimizer [35], and always trained from random initialization.Pointclouds were augmented during training with randomrotations in the z -axis for more variation. The estimator(Gauss-Newton) was implemented using STEAM , a C++optimization library. Loop closures were not implemented.Our current implementation is not real-time for a HDL-64,taking on average ms for a window of 4 frames. KPConvis a bottleneck, taking ms for pre-processing and msfor forward propagation for each frame . Updates are in theworks to improve runtime. Data association for each frame https://github.com/HuguesTHOMAS/KPConv-PyTorch https://github.com/utiasASRL/steam On an Nvidia Tesla V100 GPU and 2.2 GHz Intel Xeon CPU. Only computed for the latest frame since previous ones are saved.

OON et al. : UNSUPERVISED LEARNING OF LIDAR FEATURES FOR USE IN A PROBABILISTIC TRAJECTORY ESTIMATOR 7 takes ms ( × for window of 4), while Gauss-Newton takes ms. Gauss-Newton is the only C++ implementation andruns on the CPU. The rest is overhead. B. Odometry results

We train and evaluate odometry with a window of 4 framesfor KITTI. The relative pose between the latest two frames ofthe window are taken as the odometry output, reﬂecting onlineoperation (i.e., does not use data from future frames). Wecompare to the current state-of-the-art methods that fully learnthe estimator: LO-Net [20] and DeepLO [21]. Since DeepLOonly presents the average of sequences 00-08, Table I presentsthe results in the same way. DeepLO train on sequences00-08, while we follow LO-Net and train on sequences 00-06. DeepLO does not perform as well as LO-Net, but hasthe advantage that it is unsupervised. Our method maintainsthe advantage of being unsupervised, and achieves betterperformance than both methods. frames of the window are taken as the odometry output,re ﬂ ecting online operation (i.e., does not use data from futureframes). We compare to the current state-of-the-art methodsthat fully learn the estimator: LO-Net [20] and DeepLO [21].Since DeepLO only presents the average of sequences 00-08, Table I presents the results in the same way. DeepLOtrain on sequences 00-08, while we follow LO-Net and trainon sequences 00-06. DeepLO does not perform as well asLO-Net, but has the advantage that it is unsupervised. Ourmethod maintains the advantage of being unsupervised, andachieves better performance than both methods. -3-3 time [sec] Fig. 4: Odometry error of KITTI sequence with σ variance envelopes. Inthe interest of space, only two dimensions, ρ and ψ , are shown (See (14)).Our estimator is in general consistent, but at times slightly overcon ﬁ dent. The uncertainty output of our estimator is in Figure 4,which shows the relative pose error of sequence with σ variance envelopes. Errors are computed as ξ k,k − = [ ρ ρ ρ ψ ψ ψ ] T = ln � T k,k − T gt − k,k − � ∨ , (14)where T k,k − is the relative pose estimate between frames t k and t k − , T gt k,k − is the groundtruth, ln( · ) is the inverseexponential map, and ∨ is the inverse of the ∧ operator [26].Table II compares our odometry to the current state ofthe art for lidar odometry, which are ICP-based methods.LOAM [17] is currently leading the KITTI benchmarkleaderboard, and LO-Net+Mapping is the ICP solution pre-sented by Li et al. [20] that applies point masks trained withLO-Net and manually computed surface normals. Overall,we demonstrate that our method is comparable to the currentstate of the art. Compared to LO-Net+Mapping, our methodis learned unsupervised and does not rely on ICP for dataassociation. Compared to LOAM, our method can easily betuned for different platforms and lidar sensors by learningfrom just the on-board lidar data.Our submission to the online benchmark, with the samenetwork and parameters, achieved . average translationand . ◦ / m average orientation error. In comparison,LOAM currently has . translation and . ◦ / morientation ( . translation in original publication [17]).Our results are more comparable to SuMa [18] ( . and . ◦ / m) and SuMa++ [22] ( . and . ◦ / m),which are both well-regarded ICP-based methods. DeepLOand LO-Net currently do not have submissions. Consideringwe do not apply loop closure, which the benchmark permits,we believe our method achieved reasonable performance. The orientation results for LOAM are not provided in their publication. Our understanding is that the masks are trained without supervisionfrom mask targets, but with supervision from groundtruth trajectory [20]. -200 0 200 400 600 800 1000 1200 [m] -600-400-2000200400 [ m ] GroundtruthRandom ParametersTrained on OxfordTrained on KITTI

We demonstrate automated tuning by training and testingon the Oxford dataset, and comparing it to performance ofthe network trained on KITTI (i.e., the network applied inTables I and II). We optimize over a window of 7 and use thelidar data at Hz (i.e., skip every other frame), settings withwhich the KITTI trained network performed best. The samesettings were applied when training on the Oxford dataset.The results in Table III show a clear improvement whenparameters are trained with data related to the deployment.Figure 5 shows a qualitative plot of the odometry paths forsequence 2019-01-15-14-24-38, where we additionally showthe performance with an untrained network.

TABLE III: Odometry results for our method on Velodyne HDL-32 data ofthe Oxford RobotCar dataset [5], [6]. Using the KITTI odometry benchmarkmetric [4], the average translation ( % ) and orientation ( ◦ / m) errors overlengths of m to m are presented. The best results are in bold.Seq. Trained on Oxford Trained on KITTI2019-01-10-11-46-21 † / † / ∗ / ∗ / ∗ / ∗ / / † : Sequences that we train on (applicable only to ’Trained on Oxford’). ∗ : Sequences that are not used for training. C. Ablation Study

Table IV shows the results of an ablation study, where weremove various components of our method. In addition to theKITTI benchmark metrics for translation and orientation, wecompute an average Mahalanobis distance metric [24], � K � k =1 ξ Tk,k − Q − k,k − ξ k,k − dim( ξ k,k − ) K � / , (15)where ξ k,k − is the error as de ﬁ ned in (14) and Q k,k − isthe corresponding covariance. A value close to is ideal. Fig. 4: Odometry error of KITTI sequence with σ variance envelopes. Inthe interest of space, only two dimensions, ρ and ψ , are shown (See (14)).Our estimator is in general consistent, but at times slightly overconﬁdent. The uncertainty output of our estimator is in Figure 4, whichshows the relative pose error of sequence with σ varianceenvelopes. Errors are computed as ξ k,k − = [ ρ ρ ρ ψ ψ ψ ] T = ln (cid:16) T k,k − T gt − k,k − (cid:17) ∨ , (14)where T k,k − is the relative pose estimate between frames t k and t k − , T gt k,k − is the groundtruth, ln( · ) is the inverseexponential map, and ∨ is the inverse of the ∧ operator [26].Table II compares our odometry to the current state of theart for lidar odometry, which are ICP-based methods. LOAM [17] is currently leading the KITTI benchmark leaderboard,and LO-Net+Mapping is the ICP solution presented by Li etal. [20] that applies point masks trained with LO-Net andmanually computed surface normals. Overall, we demonstratethat our method is comparable to the current state of theart. Compared to LO-Net+Mapping, our method is learnedunsupervised and does not rely on ICP for data association.Compared to LOAM, our method can easily be tuned fordifferent platforms and lidar sensors by learning from just theon-board lidar data.Our submission to the online benchmark, with the samenetwork and parameters, achieved . average translationand . ◦ / m average orientation error. In comparison,LOAM currently has . translation and . ◦ / m The orientation results for LOAM are not provided in their publication. Our understanding is that the masks are trained without supervision frommask targets, but with supervision from groundtruth trajectory [20]. -200 0 200 400 600 800 1000 1200 [m] -600-400-2000200400 [ m ] GroundtruthRandom ParametersTrained on OxfordTrained on KITTI

Fig. 5: Odometry paths for sequence 2019-01-15-14-24-38 of the OxfordRobotCar dataset [5], [6] with a Velodyne HDL-32. We compare against theperformance when using a network trained on a different dataset, KITTI [4],which is a Velodyne HDL-64. Odometry fails prematurely due to numericalinstability when the network is not trained (red). orientation ( . translation in original publication [17]).Our results are more comparable to SuMa [18] ( . and . ◦ / m) and SuMa++ [22] ( . and . ◦ / m),which are both well-regarded ICP-based methods. DeepLOand LO-Net currently do not have submissions. Consideringwe do not apply loop closure, which the benchmark permits,we believe our method achieved reasonable performance.We demonstrate automated tuning by training and testingon the Oxford dataset, and comparing it to performance of thenetwork trained on KITTI (i.e., the network applied in TablesI and II). We optimize over a window of 7 and use the lidardata at Hz (i.e., skip every other frame), settings with whichthe KITTI trained network performed best. The same settingswere applied when training on the Oxford dataset. The resultsin Table III show a clear improvement when parameters aretrained with data related to the deployment. Figure 5 showsa qualitative plot of the odometry paths for sequence 2019-01-15-14-24-38, where we additionally show the performancewith an untrained network.

Table IV shows the results of an ablation study, where weremove various components of our method. In addition to the

IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2021

KITTI benchmark metrics for translation and orientation, wecompute an average Mahalanobis distance metric [24], (cid:32) K (cid:88) k =1 ξ Tk,k − Q − k,k − ξ k,k − dim( ξ k,k − ) K (cid:33) / , (15)where ξ k,k − is the error as deﬁned in (14) and Q k,k − is thecorresponding covariance. A value close to is ideal.‘No Sampling’ refers to evaluating the expectation in theloss functional (3) at only the mean of the posterior in theM-step (see Section IV-C). We see that approximating theexpectation with sigmapoints is insigniﬁcant for this problem,which is consistent with the approximation made in the E-step. ‘No β ’ and ‘No α ’ refers to not applying the β and α thresholds in Section IV-D, which clearly perform worse fortranslation and orientation. The exception is the Mahalanobismetric for ‘No α ’, which performs the most consistently andmore conservatively than the rest. Backpropagating outliersmeans the learned uncertainties must account for them, thus itmakes sense for the estimator to become more conservative. TABLE IV: An ablation study over components of our method on the KITTIodometry dataset. Using the KITTI odometry benchmark metric [4], theaverage translation ( % ) and orientation ( ◦ / m) errors over lengths of m to m are presented. We additionally compute the average squaredMahalanobis distance (third metric in each column) of the relative poseestimates (see (15)), which ideally is for a consistent estimator. Seq. Full method No Sampling No β No α

07 0.49/0.33/1.22 / /1.23 0.61/0.38/1.67 1.22/0.96/

08 1.01/0.36/2.66 / /2.71 1.17/0.45/6.98 2.23/0.84/ / /1.31 0.98/0.36/1.34 1.36/0.56/1.98 2.44/0.92/ / /1.54 1.56/0.58/1.55 2.13/0.84/2.09 2.13/1.58/ Avg. / /1.68 0.99/0.39/1.71 1.32/0.56/3.18 2.00/1.07/ VI. C

ONCLUSION AND F UTURE W ORK

In this paper, we presented the ﬁrst application of deepnetwork parameter learning for ESGVI. We showed that ourparameter learning framework can learn the parameters of adeep network without groundtruth supervision. Our applicationto lidar odometry resulted in performance comparable tocurrent state-of-the-art ICP-based methods, while being simpleto train for new deployments with different lidars.For future work, we plan on extending parameter learningfor ESGVI to other rich sensors (i.e., camera, radar). We areinterested in estimation problems beyond odometry, and willfocus on localization and mapping. Unfortunately, the currentodometry implementation is incapable of running in real-time. The current computational bottleneck is the pointcloudconvolution operator, KPConv [3]. A real-time implementationof KPConv is in the works such that we believe real-timeperformance is easily achievable.R

EFERENCES[1] T. D. Barfoot, J. R. Forbes, and D. J. Yoon, “Exactly sparse gaussianvariational inference with application to derivative-free batch nonlinearstate estimation,”

IJRR , 2020.[2] J. N. Wong, D. J. Yoon, A. P. Schoellig, and T. D. Barfoot, “Varia-tional inference with parameter learning applied to vehicle trajectoryestimation,”

RAL , 2020.[3] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, andL. J. Guibas, “KPConv: Flexible and deformable convolution for pointclouds,”

Int. Conf. on Comp. Vision , 2019. [4] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomousdriving? The KITTI vision benchmark suite,” in

CVPR , 2012.[5] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 Year, 1000km:The Oxford RobotCar Dataset,”

IJRR , 2017.[6] D. Barnes, M. Gadd, P. Murcutt, P. Newman, and I. Posner, “The oxfordradar robotcar dataset: A radar extension to the oxford robotcar dataset,”in

ICRA , 2020.[7] Z. Ghahramani and G. E. Hinton, “Parameter estimation for lineardynamical systems,” U. of Toronto, Tech. Rep. CRG-TR-96-2, 1996.[8] Z. Ghahramani and S. T. Roweis, “Learning nonlinear dynamicalsystems using an EM algorithm,” in

Advances in neural informationprocessing systems , 1999.[9] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in

Int. Conf. on Learning Representations , 2013.[10] M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and A. J. Davison,“CodeSLAM-learning a compact, optimisable representation for densevisual SLAM,” in

CVPR , 2018.[11] J. Czarnowski, T. Laidlow, R. Clark, and A. J. Davison, “Deepfactors:Real-time probabilistic dense monocular SLAM,”

RAL , 2020.[12] C. Tang and P. Tan, “Ba-net: Dense bundle adjustment network,” in

Int.Conf. on Learning Representations , 2018.[13] L. von Stumberg, P. Wenzel, Q. Khan, and D. Cremers, “Gn-net: Thegauss-newton loss for multi-weather relocalization,”

RAL , 2020.[14] K. M. Jatavallabhula, G. Iyer, and L. Paull, “gradSLAM: Dense slammeets automatic differentiation,” in

ICRA , 2020.[15] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Self-improving visualodometry,” arXiv preprint arXiv:1812.03245 , 2018.[16] D. Barnes and I. Posner, “Under the radar: Learning to predict robustkeypoints for odometry estimation and metric localisation in radar,” in

ICRA , 2020.[17] J. Zhang and S. Singh, “Low-drift and real-time lidar odometry andmapping,”

Autonomous Robots , 2017.[18] J. Behley and C. Stachniss, “Efﬁcient surfel-based slam using 3d laserrange data in urban environments,” in

RSS , 2018.[19] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss,and J. Gall, “SemanticKITTI: A dataset for semantic scene understand-ing of lidar sequences,” in

Int. Conf. on Comp. Vision , 2019.[20] Q. Li, S. Chen, C. Wang, X. Li, C. Wen, M. Cheng, and J. Li, “LO-Net:Deep real-time lidar odometry,” in

CVPR , 2019.[21] Y. Cho, G. Kim, and A. Kim, “Unsupervised geometry-aware deep lidarodometry,” in

ICRA , 2020.[22] X. Chen, A. Milioto, E. Palazzolo, P. Gigu`ere, J. Behley, and C. Stach-niss, “SuMa++: Efﬁcient lidar-based semantic slam,” in

IROS , 2019.[23] D. Landry, F. Pomerleau, and P. Giguere, “CELLO-3D: Estimating thecovariance of ICP in the real world,” in

ICRA , 2019.[24] M. Brossard, S. Bonnabel, and A. Barrau, “A new approach to 3D ICPcovariance estimation,”

RAL , 2020.[25] S. Anderson and T. D. Barfoot, “Full STEAM ahead: Exactly sparseGaussian process regression for batch continuous-time trajectory esti-mation on SE(3),” in

IROS , 2015.[26] T. D. Barfoot,

State Estimation for Robotics . Cambridge UniversityPress, 2017.[27] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-works for biomedical image segmentation,” in

Int. Conf. on MedicalImage Computing and Comp.-Assisted Intervention , 2015.[28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

CVPR , 2016, pp. 770–778.[29] K. Liu, K. Ok, W. Vega-Brown, and N. Roy, “Deep inference for covari-ance estimation: Learning Gaussian noise models for state estimation,”in

ICRA , 2018.[30] Y. Wang and J. M. Solomon, “Deep closest point: Learning representa-tions for point cloud registration,” in

Int. Conf. on Comp. Vision , 2019,pp. 3523–3532.[31] H. Thomas, F. Goulette, J.-E. Deschaud, B. Marcotegui, and Y. LeGall,“Semantic classiﬁcation of 3d point clouds with multiscale sphericalneighborhoods,” in

Int. Conf. on 3D Vision , 2018.[32] S. S¨arkk¨a,

Bayesian ﬁltering and smoothing . Cambridge UniversityPress, 2013.[33] T. Y. Tang, D. J. Yoon, F. Pomerleau, and T. D. Barfoot, “Learning abias correction for lidar-only motion estimation,” in

Conf. on Comp. andRobot Vision , 2018.[34] T. Y. Tang, D. J. Yoon, and T. D. Barfoot, “A WNOA motion prior forcontinuous-time trajectory estimation on SE(3),”

RAL , 2019.[35] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in