Fusion of neural networks, for LIDAR-based evidential road mapping
Edouard Capellier, Franck Davoine, Veronique Cherfaoui, You Li
FFusion of neural networks, for LIDAR-based evidentialroad mapping ´Edouard Capellier ∗ HeuDiaSycUniversit´e de technologie de Compi`egneCNRS, Heudiasyc, Alliance Sorbonne Universit´eCS 60 319, 60 203 Compiegne Cedex, France [email protected]
Franck Davoine
HeuDiaSycUniversit´e de technologie de Compi`egneCNRS, Heudiasyc, Alliance Sorbonne Universit´eCS 60 319, 60 203 Compiegne Cedex, France [email protected]
V´eronique Cherfaoui
HeuDiaSycUniversit´e de technologie de Compi`egneCNRS, Heudiasyc, Alliance Sorbonne Universit´eCS 60 319, 60 203 Compiegne Cedex, France [email protected]
You Li
Renault S.A.SResearch department (DEA-IR)1 av. du Golf78288 Guyancourt, France [email protected]
Abstract
LIDAR sensors are usually used to provide autonomous vehicles with 3D representations of theirenvironment. In ideal conditions, geometrical models could detect the road in LIDAR scans, at thecost of a manual tuning of numerical constraints, and a lack of flexibility. We instead propose anevidential pipeline, to accumulate road detection results obtained from neural networks. First, weintroduce RoadSeg, a new convolutional architecture that is optimized for road detection in LIDARscans. RoadSeg is used to classify individual LIDAR points as either belonging to the road, ornot. Yet, such point-level classification results need to be converted into a dense representation,that can be used by an autonomous vehicle. We thus secondly present an evidential road mappingalgorithm, that fuses consecutive road detection results. We benefitted from a reinterpretation oflogistic classifiers, which can be seen as generating a collection of simple evidential mass functions.An evidential grid map that depicts the road can then be obtained, by projecting the classificationresults from RoadSeg into grid cells, and by handling moving objects via conflict analysis. Thesystem was trained and evaluated on real-life data. A python implementation maintains a 10 Hzframerate. Since road labels were needed for training, a soft labelling procedure, relying lane-levelHD maps, was used to generate coarse training and validation sets. An additional test set wasmanually labelled for evaluation purposes. So as to reach satisfactory results, the system fuses roaddetection results obtained from three variants of RoadSeg, processing different LIDAR features.
Grid mapping algorithms are traditionally de-ployed, within robotic systems, to infer thetraversability of discretized areas of the environ-ment. In particular, evidential grid mapping al-gorithms are commonly used to fuse sensor inputsover time [1, 2, 3, 4]. The evidential frameworkbetter represents uncertainties, and the fact of notknowing, than regular Bayesian frameworks [5]. In- ∗ Edouard CAPELLIER is also a research engineer incomputer vision at Renault S.A.S, France deed, the evidential framework allows for an ex-plicit classification of unobserved, or ambiguous,areas as being in an
Unknown state.Specifically, LIDAR-only evidential grid map-ping systems tend to rely on a first ground de-tection step, strong geometrical assumptions, andmanually tuned thresholds, to generate evidentialmass functions at the cell level [6, 7, 8]. However,such systems might prove to have a limited applica-bility, for autonomous vehicles that are intended toevolve in urban or peri-urban areas. The first rea- a r X i v : . [ c s . C V ] F e b on is that autonomous ground vehicles can only,typically, drive on roads. Yet, the road does notnecessarily cover the whole ground, especially in ur-ban areas. Moreover, the need for manual finetun-ing, and the use of geometrical assumptions, leadto grid mapping systems that lack of flexibility, andmight fail when used in uncontrolled environments.For instance, the use of ray tracing, to densify ev-idential grid maps as in [2], supposes the absenceof negative obstacles in unobserved areas, whichcannot be guaranteed in practice when only usingLIDAR sensors [9]. We consider that a representa-tion of the drivable area should only be generatedby fusing actual observations, without any upsam-pling nor downsampling. Similarly, a grid mappingsystem relying on the flat-ground assumption, andaltitude thresholds to detect obstacles [6], mightlead to false detections when encountering slopesor speed bumps.To tackle those limitations, we propose to clas-sify individual LIDAR points as either belongingto the road , or to an obstacle . The use of machine-learning, for classification, would have the benefitof removing the need for manually tuned geometri-cal assumptions and thresholds, during the infer-ence. The use of dense LIDAR scans, for roaddetection, has been a common practice in aerialdata analysis for several years [10], and several ap-proaches relying on machine learning techniqueshave recently been proposed to address automo-tive use cases [11, 12]. However, those systemsnever generate a point-level classification, and al-ways output an upsampled road area. Indeed, thoseapproaches are designed to process data from theKITTI road dataset [13], and are evaluated withregards to a dense ground-truth in the image do-main. As such, the fact that LIDAR scans aresparse is not considered when evaluating those sys-tems. Recent machine learning architectures, suchas SqueezeSeg [14] and PointNet [15], have beenproposed to process raw point-clouds. In partic-ular, SqueezeSeg relies on a very efficient convo-lutional architecture that processes LIDAR scans,which are organized as dense range images by sim-ply using a spherical projection. This represen-tation actually corresponds to the raw outputsof most state-of-the-art LIDAR sensors, that onlymeasure ranges at fixed angular positions. We pro-pose to adapt SqueezeSegV2 [16], and make it ableto classify each LIDAR point as road or obstacle .Those classification results an then be accumulatedinto a 2D road map, thanks to an evidential gridmapping algorithm. Figure 1 depicts the whole sys- tem.To generate evidential mass functions from theclassification results, we propose to reinterpret theoutput of neural networks as a collection of sim-ple evidential mass functions, as proposed in [17].The system was trained on a coarse training set,which was generated by automatically labelling LI-DAR scans from re-existing lane-level HD Maps,thanks to a simple localization error model. An ad-ditional test set was also manually labelled, in orderto reliably verify the classification performances.The best classification performances are achievedby fusing three distinct neural networks, that pro-cess different sets of LIDAR features: the Cartesiancoordinates, the spherical coordinates, and the re-turned intensity and elevation angles. The classi-fication results are then accumulated on the XY-plane, in order to generate an evidential grid thatdepicts the road surface. Moving objects are de-tected by analyzing of the evidential conflict dur-ing the fusion, and a grid-level clustering pipeline.Those objects are then excluded from the final roadsurface estimation.Our main contributions are: • RoadSeg, a refined version of Squeeze-SegV2 for road detection in LIDAR scans,and optimized for the generation of eviden-tial mass functions • An automatic labelling procedure of theroad in LIDAR scans, from HD Maps • A road mapping and object detection al-gorithm relying on road detections results,obtained from a fusion of independentRoadSeg networksThe remainder of the paper is organized as fol-lows. Section 2 reviews some related works on roaddetection in LIDAR scans, machine learning archi-tectures for LIDAR scan processing, and eviden-tial grid mapping. Section 3 describes how eviden-tial mass functions can be generated from a neu-ral network. Section 4 describes RoadSeg, our re-fined version of SqueezeSegV2 for road detection,and optimized for the generation of evidential massfunctions. Section 5 presents our training, valida-tion and test sets, alongside our Map-based au-tomatic labelling pipeline. In section 6, we eval-uate the performances of RoadSeg, and a fusionigure 1: Overview of the proposed systemof RoadSeg networks that process independent in-puts, both on the KITTI dataset, and our manuallylabelled test. Section 7 finally presents a road map-ping and moving object detection algorithm, thatrelies on evidential mass functions generated fromseveral RoadSeg networks. A python implementa-tion of this algorithm processes LIDAR scans at a10Hz rate.
Evidential occupancy grids associate evidentialmass functions to individual grid cells. Each massvalue represents evidence towards the fact that thecorresponding cell is
Free , Occupied , or
Unknown ,and their sum is equal to 1. The first evidentialoccupancy grid mapping system that relied on LI-DAR sensors was proposed in [18]. Evidential massfunctions were constructed at the cell level fromseveral stationary-beam LIDAR sensors, via raytracing and ad-hoc false alarm and missed detec-tion rates. The use of the evidential framework wasshown to better represent uncertainty and the factof not knowing, when compared to a traditionalBayesian fusion scheme. The Caroline vehicle al-ready used evidential occupancy grid maps duringthe 2007 DARPA Urban Challenge [2], but the rea-soning was done globally from both LIDAR scansand point-clouds generated by stereo-vision, with-out considering the specificity of LIDAR sensors.It was proposed, in [19], to extend the work in[18] to four-layer LIDAR scanners. Based on [20],a discount factor was applied to an evidential po-lar grid, before fusing it with new sensor observa- tions. Evidential mass functions were again gener-ated from ray tracing, and ad-hoc false alarm andmissed detection rates. This model was general-ized, in [6], to a more complex 360 ◦ Velodyne HDL-64 LIDAR scanner. A first ground-detection step,based on a simple thresholding, is used to classifyLIDAR points as either belonging to an obstacle orto the drivable area. Then, evidential mass func-tions were created at the cell level, from the classi-fication results and from, again, ad-hoc false-alarmand missed-detection rates.More recent works aim at tackling some intrinsiclimitations of LIDAR-based evidential occupancygrid mapping. It was proposed, in [4], to couplean evidential model with particle-filtering, in or-der to estimate the speed of each grid cell, and de-tect moving objects from a four-layer LIDAR scan.This comes at the cost of various hyper-parametersto manually tune, and a computational complexity,as virtually any occupied cell could be associatedwith a set of particles. Another recent work aimsat predicting dense evidential occupancy grid mapsfrom individual LIDAR scans [8]. Consecutive LI-DAR scans are registered, and dense grids are ob-tained from the resulting pointcloud thanks to aground removal step, and manually defined false-alarm and missed detection rates. A convolutionalneural network is then trained to recreate the denseevidential grids from a feature grid generated fromonly one of the original scans.All those approaches rely on several strong as-sumptions, like the absence of negative obstaclesthat justifies the use of ray tracing, or the fact thatthe ground is flat. As such, they might lack flexibil-ity, when used in uncontrolled environments. Wethus consider that in complex and uncontrolled en-vironments, evidential LIDAR grids should only beuilt from raw observations. The use of ad-hoc pa-rameters should also be limited, so as to create gen-eral models. Moreover, autonomous urban vehiclesare expected to drive on roads. Therefore, cur-rent evidential occupancy grid mapping algorithmscannot be used alone, in autonomous driving sce-narios, as unoccupied grid cells are not necessarilydrivable. An explicit road detection step is thusneeded, when generating evidential grid maps forautonomous driving.
Road detection from LIDAR scans can be ad-dressed by focusing on the detection of road bor-ders. One of the first road detection approachesthat only relies on LIDAR data was proposed in thecontext of the DARPA Urban Challenge 2007 [21].Elevation information was obtained from a single-layer LIDAR mounted on a prototype vehicle, andoriented towards the ground. The road was thenconsidered as the lowest smooth surface within theLIDAR scan, and a comparison with detected roadedges was used to confirm the results. A simi-lar sensor setup was used in [22], to detect break-points among LIDAR points, so that smooth roadsegments and obstacles could be jointly detected.More recently, [23] coupled a ground detection anda probabilistic fusion framework, to accumulateand vectorize road edge candidates. In [24], roadmarkings are detected among LIDAR points, fromthe reflectance information, and the geometry ofthe lane in inferred from the detected markings.Simply detecting road borders from LIDAR scans isnevertheless limited, as the provided information isextremely coarse by nature. Those approaches alsoheavily rely on assumptions about the sensor set-up and geometry of the scene, such as the absenceof negative obstacles, the flatness of the ground,and the presence of road markings. In complex ur-ban scenarios, those assumptions are not necessar-ily verified, which validates the need for algorithmsthat explicitely detect the road, instead of simpleroad borders.State-of-the-art road detection algorithms usu-ally rely on the fusion of LIDAR scans and RGB im-ages [25, 26]. Yet, the release of the KITTI datasetallowed for the emergence of LIDAR-only road de-tection algorithms. Most of those approaches actu-ally readapt image processing techniques, to detecta road area instead of road edges. In [27], LI-DAR points are projected and upsampled into a 2D image plane. The road area is then detectedin this plane, via a histogram similarity measure.Another proposition was to project LIDAR pointsinto a 2D sparse feature grid corresponding to abird’s eye view, and to train a convolutional neuralnetwork to predict a dense road region from thisrepresentation [11]. Another proposal was to traina neural network on range images, obtained fromthe spherical projection of LIDAR scans, and to fita polygon on the predicted road points to obtain adense prediction [12]. Yet, some LIDAR points arelost on the projection process, as each pixel of therange image could represent several LIDAR points.Although those approaches are currently the bestperforming LIDAR-only road-detection approacheson the KITTI dataset, they share the same limi-tation: they aim at predicting a dense road areaalthough LIDAR scans are sparse, since they areevaluated with regards to dense, image-level roadlabels. All those approaches then predict the pres-ence of road on locations where no actual LIDARmeasurements are actually available, which is anincorrect behavior for a LIDAR-only road detec-tion algorithm. Indeed, gaps or small obstaclescould be present but unobserved, because of thesparsity of LIDAR scans. Moreover, due to thelimitations of the KITTI dataset, which only haveroad labels for the front camera view, those sys-tems are not designed to detect the road at 360 ◦ ,and might require to be significantly modified todetect the road on full LIDAR scans, without anyguarantee of success. However, machine learningapproaches [11, 12] have the advantage of beingable to detect roads, after their training, with-out any explicit geometrical model, nor manuallytuned thresholds. A machine-learning algorithmthat could process raw LIDAR scans, and classifyeach individual LIDAR point as either belongingto the road or to an obstacle, would then be valu-able when generating evidential grid maps in thecontext of autonomous driving. The Machine Learning algorithms that can pro-cess raw LIDAR scans, and output per-pointresults, can be split into two main categories.The first category of approaches consider LIDARscans as unorganized point-clouds, and usually relyon PointNet-like architectures [15]. The seconccategory of approaches are strongly inspired bySqueezeSeg [14], and consider that LIDAR scansre organized point-clouds, which can be processedas range images obtained by spherical projections,similarly to what is done in [12].PointNet applies a common multi-layer percep-tron to the features of each point of a point-cloud.A global max-pooling operation is then used toextract a cloud-level feature. Such an architec-ture was mathematically demonstrated to be ableto approximate any set function. Further layerscan then be trained to perform object classifica-tion or semantic segmentation. However, PointNetexpects normalized, and relatively constrained in-puts, which involves several pre-processing steps.Several PointNet-like architectures, optimized forthe semantic segmentation of LIDAR scans, areproposed in [28]. They require an initial pre-processing step, as they process independent sub-sets of the overall scan, and none of them reachfully satisfactory results. PointNet was also testedfor object detection in LIDAR scans. Voxelnet [29]applies PointNet network on individual voxel par-titions of a LIDAR scan. It then uses addi-tional convolutions, and a region proposal net-work, to perform bounding-box regression. Point-Net can also be used to perform bounding-box re-gression on LIDAR frustrums, corresponding to apre-computed 2D bounding-box obtained from anRGB image [30]. Given the limitations observedin [28] for semantic segmentation, and the need forpre-processing, we chose to rely on another type ofapproach to perform road detection. Indeed, as theresults are intended to be fused in evidential gridmaps in real-time, a solution that could directlyprocess complete LIDAR scans would be more rel-evant.SqueezeSeg [14] was introduced as a refined ver-sion of SqueezeNet [31], a highly efficient convo-lutional neural network reaching AlexNet-level ac-curacy, with a limited number of parameters anda low memory footprint. SqueezeNet heavily onFire layers, that require less computations and usefewer parameters than traditional convolution lay-ers. SqueezeSeg adapts SqueezeNet for semanticsegmentation of LIDAR scans, by processing or-ganized LIDAR scans. Spherical projection canindeed be used to generate dense range imagesfrom LIDAR scans. SqueezeSeg was initially de-signed to perform semantic segmentation. The la-bels were obtained from the KITTI dataset, andground-truth bounding-boxes: the LIDAR pointsthat were falling into those boxes were classified according the class of the related object. Again, asonly the front camera view is labelled in KITTI,the labels do not cover complete LIDAR scans.A conditional random field (CRF), reinterpretedas a recurrent neural network, was also trainedalongside SqueezeSeg, to further improve the seg-mentation results. Recently, SqueezeSegV2 in-troduced a context aggregation module, to copewith the fact that LIDAR scans usually includemissing points, due to sensor noise. The inputrange image used in SqueezeSegV2 also includesan additional channel that indicates whether avalid sensor reading was available, at each angu-lar position. Finally, SqueezeSegV2 extensivelyuses batch-normalization, contrary to SqueezeSeg.Both SqueezeSeg and SqueezeSegV2 are highly ef-ficient networks, as their original implementationcould process up to 100 frames per second. Morerecently, RangeNet++ [32] reused a similar ap-proach, to perform semantic segmentation on theSemanticKitti dataset [33]. As the SemanticKittidataset is a finally labelled dataset, with numerouslabelled classes, RangeNet++ has more parame-ters than SqueezeSegV2. It is also significantlyslower, so that its original implementation does notmatch the 10Hz original frame rate of the VelodyneHDL64, which was used to collect the raw data ofthe SemanticKitti dataset.As SqueezeSegV2 is highly efficient, and pro-cesses full LIDAR scans that are easy to organize,refining it for road detection seems to be a naturaloption. Such a network, thanks to its fast inefer-ence time, can also be coupled with an evidentialroad mapping algorithm, to convert the segmenta-tion results into a representation that can be di-rectly used by an autonomous vehicle. A straight-forward way to do it would be to create a datasetwith evidential labels, train the network on them,and fuse the detection results over time. However,such labels are hard to obtain, due to the presenceof an unknown class. We instead propose to extendthe model described in [17], which offers a way togenerate evidential mass functions from pre-trainedbinary generalized logistic regression (GLR) classi-fiers.
Generation of evidentialmass functions from abinary GLR classifier
Let Ω = { R, ¬ R } be a binary frame of discern-ment, with R corresponding to the fact that agiven LIDAR point belongs to the road, and ¬ R to the fact that it does not. Following the eviden-tial framework, the reasoning in done on the powerset 2 Ω = {∅ , R, ¬ R, Ω } . Ω indicates ignorance onthe class of the point. ∅ indicates that the pointdoes not fit the frame of discernment, which is notpossible in our case, since each LIDAR point eitherbelongs, or does not belong, to the road.Let the binary classification problem with X =( x , ..., x d ), a d-dimensional input vector represent-ing a LIDAR point, and Y ∈ Ω a class variable. Let p R ( x ) be the probability that Y = R given that X = x . Then 1 − p R ( x ) is the corresponding prob-ability that Y = ¬ R . Let w be the output of abinary logistic regression classifier, trained to pre-dict the probability that a LIDAR point belongs tothe road. Then, p R ( x ) is such that: p R ( x ) = S ( d (cid:88) j =1 β j φ j ( x ) + β ) (1)where S is the sigmoid function, and the β param-eters usually learnt alongside those of the φ i map-pings. Equation 1 exactly corresponds to the be-havior of a deep neural network trained as a binaryGLR classifier, with x being its input. Let ⊕ be Dempster’s rule of combination, thatcan be used to fuse two independent eviden-tial mass functions m , m that follow the sameframe of discernment. The resulting mass function m , = m ⊕ m is such that: m , ( ∅ ) = 0 (2a) K = (cid:88) B ∩ C = ∅ m ( B ) m ( C ) (2b) ∀ A ∈ Ω \ {∅} : m , ( A ) = (cid:80) D ∩ E = A (cid:54) = ∅ m ( D ) m ( E )1 − K (2c)A simple evidential mass function m on Ω is suchthat ∃ θ ⊂ Ω , m ( θ ) = s, m (Ω) = 1 − s . Let w = − ln (1 − s ) be the weight of evidence asso-ciated to m ; m can then be represented as { θ i } w .The sigmoid function is strictly increasing. Then,in Equation 1, the larger the value generated bythe classifier is, the larger p R ( x ) is and the smaller1 − p R ( x ) is. Equation 1 can be rewritten as follows: p R ( x ) = S ( d (cid:88) j =1 ( β j φ j ( x ) + α j )) = S ( d (cid:88) j =1 w j ) (3)with d (cid:88) j =1 α j = β (4)Each w j can then be seen as piece of evidencetowards R or ¬ R , depending on its sign. Let us as-sume that the w j values are weights of evidenceof simple mass functions, denoted by m j . Let w + j = max (0 , w j ) be the positive part of w j , andlet w − j = max (0 , − w j ) be its negative part. What-ever the sign of w j , the corresponding evidentialmass function m j can be written as: m j = { R } w + j ⊕ {¬ R } w − j (5)Under the assumption that all the m j mass func-tions are independent, the Dempster-Shafer opera-tor can be used to fuse them together.The resultingmass function obtained from the output of the bi-nary logistic regression classifier, noted m LR is asollows: m LR ( { R } ) = [1 − exp( − w + )] (exp( − w − ))1 − K (6a) m LR ( {¬ R } ) = [1 − exp( − w − )] (exp( − w + ))1 − K (6b) m LR (Ω) = exp( − (cid:80) dj =1 | w j | )1 − K (6c)with (6d) K = (cid:2) − exp( − w + ) (cid:3) (cid:2) − exp ( − w − ) (cid:3) (6e)By applying the plausibility transformation de-scribed in [34] to the evidential mass function inEquation 6, the expression of p R ( x ) can be recon-structed. Indeed: p R ( x ) = m LR ( { R } ) + m LR (Ω) m LR ( { R } ) + m LR ( {¬ R } ) + 2 m LR (Ω)(7)This means that any binary GLR classifier can beseen as a fusion of simple mass functions, that canbe derived from the parameters of the final linearlayer of the classifier. The previous assumptions arethus justified. However, the α j values introducedin Equation 4 have to be estimated. Let α = ( α , ...α d ). A cautious approach wouldbe to select them so as to maximize the mass val-ues on Ω. A solution to the resulting minimizationproblem over the training set was proposed in [17].Alongside this approach, we propose to explore twoother possibilities. The original approach proposed in [17] was toselect the α vector that maximizes the sum of the m LR (Ω) mass values over the training set, so as toget the most uncertain evidential mass functions.This leads to the following minimization problem: minf ( α ) = n (cid:88) i =1 d (cid:88) j =1 ( β j φ j ( x i ) + α j ) (8) with { ( x i , y i ) } ni =1 being the training dataset. Let φ k be the mean of φ k ( x i ) on the training set. Theoptimal value for α j is then: α j = β d + 1 d d (cid:88) q =1 β q φ q − β j φ j (9)This solution thus relies on a post-processing step,and is dependent on the parameters that were ini-tially learnt from the network. Typically, if ei-ther d , the β j values or the φ j ( x i ) values in Equa-tion 3 are very large, the resulting optimal α j valuesmight remain negligible with regards to the corre-sponding w j values. Given that binary GLR clas-sifiers are usually trained from binary labels, andthus to saturate the Sigmoid function by predict-ing probabilities close to 0 or 1, this case is likelyto happen, especially for deep GLR classifiers forwhich the d value can be extremely high. There is actually no practical reason to optimizethe α vector over the training set, appart from thelack of data. Indeed, the predicted probabilities arenot dependent on the α vector, since the sum of itselements is always equal to β . Moreover, the mini-mization problem in Equation 8 does not depend onthe original labels. As we intend to use this modelfor road detection in LIDAR scans, abundant andvarious unlabelled data can be acquired easily bysimply recording LIDAR scans. Then, the mini-mization problem in Equation 8 can be optimizedon these abundant LIDAR scans. It was observed in [17] that if the φ k ( x i ) featuresin Equation 8 were centered around their mean,the minimization problem might be partially solvedby applying L2-regularization during the training.Indeed, Equation 8 then becomes: minf ( α ) = d (cid:88) j =1 β j ( n (cid:88) i =1 φ j ( x i ) ) + nd β (10)However, nothing would then prevent the parame-ters of the φ k functions to overcompensate for theL2-regularization during the training, especially ifnon-linearities are used, and if the φ k mappings areodelled by a very deep network. This means thatthe total sum might actually not be minimized.Moreover, centering the φ k ( x i ) is not a straight-forward operation. If the classifier only used linearoperations, this could be easily done by centeringthe training set. However, performant classifiersare usually non-linear. The centering thus has tobe done as an internal operation of the network.We previously observed that using Instance-Normalization [35] in the final layer of the classifierwould be an easy way to solve those problems [36].Let υ ( x ) = ( υ ( x ) , ..., υ d ( x )) be the mapping mod-elled by all the consecutive layers of the classifierbut the last one ; let υ j be the mean value of the υ j function on the training set, and σ ( υ j ) its cor-responding variance. Then, if it is assumed thatthe final layer of the network relies on an Instance-Normalization and a feature-wise sum, Equation 8becomes: minf ( α ) = n (cid:88) i =1 d (cid:88) j =1 (( β j υ j ( x c ) − υ j (cid:112) σ ( υ j ) + (cid:15) ) + d (cid:88) j =1 α j ) (11)The α vector would actually correspond to the bi-ases of the Instance-Normalization layer. After de-velopment, if (cid:15) is assumed to be neglectible, thefollowing expression is obtained: minf ( α ) = n d (cid:88) j =1 β j + n d (cid:88) j =1 α j (12)Again, this expression would be minimized duringthe training by simply applying L2-regularization.Instance-Normalization is thus an interesting wayof centering the penultimate features of the net-work, since it inhibates the influence of the inter-mediate layers of the network.It should be noted that the use of Instance-Normalization is not incompatible with a post-processing step to further optimize the α vectorover either the training set, or abundant unlabelleddata. The best strategy will thus have to be definedwith regards to the road detection and grid map-ping tasks. To properly detect the road in LIDAR scans, wepropose an architecture that is heavily inspired bySqueezeseg [14] and SqueezeSegV2 [16], as thosenetwork are particularly efficient to process LIDARscans. However, we introduce several refinementsto the original architectures and training schemes,so as to better fit the task of road detection, andallow for evidential fusion of segmented scans. Wename the resulting architecture ”RoadSeg”.
Figure 2: Conversion of a LIDAR scan into a multi-channel range image by spherical projection. Thechannels in the lower part correspond to the re-turned intensity, the Cartesian coordinates, and themeasured range.Similarly to SqueezeSeg, RoadSeg processes LI-DAR scans that are converted into dense range im-ages. Such a representation corresponds to rawsensor measurements, as LIDAR scanners actuallyreturn distances at fixed angular positions. The di-mensions of those range images then depend on thespecifications of the LIDAR sensor. In this work,RoadSeg is assumed to process scans obtained fromVelodyne VLP32C sensors, operating at 10Hz, andscanning at 360 ◦ . These sensors rely on 32 stackedlasers, each scanning at a different zenith angle,with a horizontal angular resolution of 0.2 ◦ whenoperating at 10Hz. As such, only 32*360/0.2 angu-lar positions can be observed by the sensor. EachIDAR scan obtained from a VLP32C can then berepresented by a 32*1800*C grid, with C being thenumber of features available for each point. Thisgrid can be considered as an image with an heightof 32 pixels, a width of 1800 pixels, and C chan-nels. Let (x,y,z) be the Cartesian coordinates of aLIDAR point. Let α , β be the indexes of the pixelthat correspond to this point, and RingId the indexof the laser that measured the point. A RingId of0 (respectively 31) indicates that the point is mea-sured by the topmost (respectively lowest) laser.Then α and β are such that: α = arcsin ( z (cid:112) x + y + z ) ∗ β = RingId
The channels of the pixel at the ( α , β ) positioncan then be filled with the features of the corre-sponding point. Figure 2 presents the resultingranges images obtained after spherical projectionof LIDAR points. In practice, the VLP32C scan-ner returns ethernet packets containing measuredranges that are already ordered by laser and az-imuth position. The range images can then be ob-tained directly from the sensor, without having topre-process a cartesian LIDAR scan. This projec-tion was only used to manually label data, in prac-tice. But when run on the vehicle, RoadSeg pro-cesses range images that are directly obtained fromthe ethernet packets.In total, we chose to represent a LIDAR point byits Cartesian coordinates, its spherical coordinates,and its returned intensity. Additionally, we alsouse a validity channel, as done for SqueezeSegV2.A point is valid, and the corresponding validity isequal to 1, if the range that was measured at itsangular position is strictly positive. Otherwise, thepoint is not valid, which can correspond to misseddetections, or lost ethernet LIDAR packets. Figure 3 depicts the RoadSeg architecture, whichis very close to the original SqueezeSeg architec-ture, as it also relies on the use of Fire andFire-deconvolutional layers, and skip connections.RoadSeg also only downsamples horizontally, asSqueezeSeg and SqueezeSegV2. RoadSeg uses onedownsampling step less than SqueezeSegV2. In-deed, the first convolutional layer of RoadSeg has a stride of 1 in both direction, while SqueezeSeghas a horizontal stride of 2. The kernel size is how-ever still equal to 3. This is justified by the factthat, in the case of road detection, downsamplingmight make road hard to distinguish at long-range.SqueezeSeg benefitted from this downsampling be-cause it was originally designed for semantic seg-mentation of road users, which are often relativelyclose to the sensor. RoadSeg then also has onedeconvolutional layer less than SqueezeSeg. Addi-tionnally, as done in SqueezeSegV2, RoadSeg usesBatch Normalization after each convolution, bothinside and outside the Fire layers.An initial Batch Normalization is also appliedon the input of the network. This mechanism wasproposed to replace the manual normalization step,that is used in both SqueezeSeg and SqueezeSegV2.As the normalization parameters initially used inSqueezeSeg and SqueezeSegV2 are computed fromthe train set, manual normalization assumes thatthe sensor setup is the same in the train, validationand test sets. As our labelled LIDAR scans werenot acquired by the same vehicles, this assumptiondoes not hold anymore. Applying Batch Normal-ization to the inputs of RoadSeg is thus a way totrain the network on inputs that are not exactlynormalized. A T-Net, inspired by PointNet [15],can also be used to realign the LIDAR scans, bypredicting a 3 × × a) General road classification architecture (b) T-Net of RoadSeg Figure 3: RoadSeg, a refined SqueezeSeg-like architecture for road detection and evidential fusionOutput channelsLayer SqueezeSegV1/V2 RoadSegFire2 128 96Fire3 128 128Fire4 256 192Fire5 256 256Fire6 384 256Fire7 384 256Fire8 384 256Fire9 512 256Table 1: Comparison of the fire layers in RoadSegand SqueezeSegfrom a regular convolutional layer, or an Instance-Normalization layer. Secondly, removing this layerreduces the inference time of RoadSeg. Finally, ex-periments showed than the CRF layer was degrad-ing the performances of SqueezeSegV2, for the roaddetection task, and was thus assumed not to besuitable for RoadSeg. LIDAR scans were collectedin several urban and peri-urban areas in France,to do those experiments. Coarse training and val-idation sets were automatically labelled from HDMaps. Moreover, a test set was manually labelled,to obtain reliable performance indicators.
Reliable semantic segmentation labels are usu-ally generated manually, by expert annotators.This however proves to be expensive and time con-suming. Automatic labelling can instead be usedto generate labelled LIDAR scans, on which a roaddetection algorithm could be trained. Ideally, anerror model should be associated to an automaticlabelling procedure, as the resulting data is likelyto include errors. We thus chose to softly label theroad in LIDAR scans from pre-existing lane-levelmaps, and according to a localization error model.
The automatic soft-labelling procedure used inthe context of this work is presented in Figure 4. Itwas assumed that the LIDAR scans can be acquiredin areas where reliable georeferenced maps, withcorrect positions and road dimensions, are avail-able. Moreover, the LIDAR scans were supposedto be acquired from a moving data acquisition plat-form, which includes a LIDAR scanner that canperceive the ground and obstacles, and a GNSS lo-calization system which returns an estimation of itspose in the map frame, and a corresponding covari-ance matrix. Those sensors were considered to berigidly attached to the acquisition platform, andcalibrated together. The coordinates in the mapframe were expressed in terms of northing and east-ing offsets with regards to an origin. This originwas close to the data acquisition platform. Underthe classical assumption that the localization errorigure 4: Automatic soft-labelling of LIDAR scans from road mapsfollows a zero-mean Gaussian model [37], an un-certainty ellipse could then be obtained, from thecovariance matrix associated with the current pose.A probability of belonging to a mapped road couldthen be estimated, for each LIDAR point.Let X li ( x li , y li , z li ) represent the a LIDAR point,and X mi ( x mi , y mi , z mi ) the corresponding point afterprojection in the map frame. Let R i and T i be therotation matrix, and translation vector, to projet X li into the map frame. We were using a scan-ning LIDAR with a parameterized sweeping speedthat we note S , and assumed that this rotationspeed remained constant over time. The vehiclemotion thus had to be accounted for, before pro-jection into the map frame. We neglected the rota-tion speed of the vehicle during the scan. We alsoassumed that the longitudinal speed of the vehiclewas constant during the scan, and that the sensorwas positionned so that its X-axis was parallel withthe vehicle’s forward direction. Let ∆ t i be the es-timated time offset between the beginning of thescan, and the moment when X li was acquired. Aspinning LIDAR records points at pre-defined az-imuth positions. Let θ i be the azimuth position, indegrees, at which X li was acquired. Then, ∆ t i canbe estimated by: ∆ t i ≈ S ∗ θ i
360 (13)We also had access to the closest instantaneousmeasure of the vehicle’s speed at the end of the scan, via its CAN network. We note this speed v s .The projection of X li into the map frame is thendone as follows: X mi = x mi y mi z mi = (cid:20) R T (cid:21) x li + ∆ t i ∗ v s y li z li (14)Let P R ( X li ), the labelled probability that X li be-longed to the road. First, a ground detection stepwas used to segregate obvious obstacles from pointsthat, potentially, belonged to the road. It was as-sumed that the algorithm does not falsely classifyground points as obstacles. If X li did not belongto the detected ground, then P R ( X li ) = 0, since itwas considered to belong to an obstacle. Otherwise, X li was projected into the map frame. Given thatthe projection into the map frame relied on a rigidtransformation, the localization uncertainty of theresulting X mi was equal to the uncertainty of thepose measured by the GNSS localization system.The closer from a road edge X mi was, the lowerthe probability that it belonged to the road is. Let d i be the minimum distance between X mi and amapped road edge. The covariance matrix given bythe GNSS localization system could, ideally, havebeen used to estimate the standard-deviation of theGaussian distribution that models the error on d i ,by having considered the heading of the road in themap frame [38]. However, the value of this head-ng can be ambiguous in curbs or roundabouts. Toaccount for those use cases and facilitate the com-putations, a bounding of this standard-deviation,noted σ b , was thus estimated instead. Let σ n and σ e be, respectively, the standard-deviation in thenorthing and easting directions; then σ b was esti-mated as follows: σ b = M ax ( σ n , σ e ) + γ (15)The γ term is a hyperparameter used to accountfor the uncertainty in the measures of the LIDARsensor, and possible errors in motion compensation.It also covers inevitable errors in the extrinsic cal-ibration parameters, that we used to register theLIDAR and the localization. Indeed, such param-eters are only accurate up to a certain point [39].Then, if X li belonged to the ground, P R ( X li ) wasestimated as follows: If X mi falls into a mappedroad: P R ( X li ) = (cid:90) d i −∞ σ b √ π exp − ( xσb ) dx (16)Otherwise: P R ( X li ) = 1 − (cid:90) d i −∞ σ b √ π exp − ( xσb ) dx (17)It has to be noted that, even though the motion wascompensated in the forward direction via Equa-tion 14, this probability is associated to the non-compensated LIDAR point. The reason for thatwas that, even though our compensation procedureempirically seemed sufficient to build our trainingset, it was far from being very accurate, as manyparameters were not considered (among others: thevariations in the longitudinal speed of the vehicleover the scanning process, the variations in the ro-tation speed of the LIDAR over the scanning pro-cess, the rotation speed of the vehicle, the noises inspeed measurements, the differences in frameratesamong the odometry and LIDAR sensors). In ac-tual autonomous driving situations, those parame-ters would have to be accounted for, probably viaapproaches inspired by Kalman filters, which relyon manualy tuned hyper-parameters. So as notto have a road detection system trained on datawith a specific compensation process, which wouldpotentially have to be tuned manually and couldthus have to be modified for every experiment, wedecicided to build our training set from uncom-pensated LIDAR scans, and to only compensateroughly via Equation 14 when labelling automat-ically from the map. This way, the training set, and the road detection systems trained on it, areindependant from any compensation method. Thisallows for the use of any compensation method af-ter road detection, without having to potentiallygenerate training scans for every possible compen-sation method. Figure 5 presents two use cases andthe resulting softly labelled LIDAR scans. Appropriate data was needed to apply this auto-matic labelling procedure. Accurate maps and lo-calization were needed. Open source maps, suchas OpenStreetMap, were thus not considered, asthe road dimensions, and especially their width,are rarely available and accurate. The NuScenesdataset [40] includes LIDAR scans and an accu-rate map ; however, the localization of the vehicle,though extremely accurate, is not provided withuncertainty indicators, which prevented us from us-ing this dataset. Moreover, we observed that theVelodyne HDL32E LIDAR sensors, that were usedto collect the nuScenes dataset, seemed improperfor road detection. Indeed, this sensor has a verti-cal resolution of 1 .
33 deg. As such, most of the roadpoints in nuScenes were actually acquired very closeto the data collection platform, as presented in Fig-ure 6, limiting the use of this dataset for long-rangeroad detection. Data was thus collected via a LI-DAR scanner that has a better vertical resolution,in several areas where Renault S.A.S has access tolane-level HD-map. The framework of those mapsis described in [41].
ZoeCab platforms were used to collect data. Zoe-Cabs are robotized electric vehicles, based on Re-nault Zoes, that are augmented with perceptionand localization sensors, and intended to be de-ployed as autonomous shuttles in urban and peri-urban areas. Each vehicle embedded a VLP32CVelodyne LIDAR running at 10 Hz, and a Trim-ble BX940 inertial positioning system, which wascoupled with an RTK Satinfo, so that centimeter-accurate localization could be achieved when RTKcorrections were available. The PPS signal pro-vided by the GPS receiver was used to synchro-nize the computers and sensors together. Figure 7presents one of the ZoeCab systems that were usedfor this work. a) Use case 1: two opposite lanes (b) Use case 2: Roundabout(c) Use case 1: resulting labelled LIDAR can (d) Use case 2: Automatically labelled LIDAR scan
Figure 5: Use cases and examples of automatically labelled LIDAR scans. The data acquisition platform isdepicted in white, and the map in yellow. Green points are pre-classified as ground points ; the purpler apoint is, the higher its probability of belonging to the road is; the redder a point is, the lower its probabilityof belonging to the road is.
In order to get diverse training samples, LIDARand localization data was acquired in four differ-ent cities and locations in France, with two dif-ferent ZoeCab systems: Compi`egne, Rambouillet,Renault Techocentre and the Paris-Saclay campus.One of the vehicles was used to collect the data inCompi`egne and Rambouillet, and the other one wasused in Renault Technocentre and the Paris-Saclaycampus.All those locations are urban or peri-urban areas.The sensor setup was similar, but not identical, be- tween those vehicles, with a a VLP32C LIDAR andGNSS antenna on top of the vehicle and the Trim-ble localization system in the trunk. LIDAR scanswere only automatically labelled every ten meters,to limit the redundancy in the training set. As anaccurate localization was needed to label the LI-DAR point-clouds from the map, the LIDAR scanswere only labelled when the variance in the eastingand northing direction, associated with the pose es-timated by the localization system, were lower than0.5m. The γ parameter in Equation 15 was empiri-cally set to 10 cm. The initial ground detection wasrealized thanks to a reimplementation of the algo-rithm described in [42]. The vehicles were drivenigure 6: Example of labelled scan from thenuScenes dataset. The horizontal resolution of thescan is significant, and most of the points belong-ing to the road are, actually, very close to the dataacquisition platform.at typical speed of 30 km/h, in order to be ableto be able to rely on our compensation procedure.Figure 8 presents the areas in which data was col-lected. The total number of collected and labelledsamples for each location is reported in Table 2. Though a training and validation dataset wereobtained, they would not enable road detection al-gorithms to be properly evaluated, as the obtainedlabels are not binary, and may still include errors.An additional test set, that would only be usedto evaluate the performances of the road detectionsystem, was thus also created. As no open-sourcedataset uses a VLP32C LIDAR scanner, this testdataset was specifically labelled by hand for ourwork. The platform used to collect the data wasthe ZoeCab system that was used to collect the Re-nault Technocentre and Paris-Saclay datasets, andwas driven in Guyancourt, France. Yet, the LIDARsensor that was used to collect the test data wasdifferent from the one that was, previously, usedto create the automatically labelled dataset. In-deed, we observed that some differences with re-gards to the returned intensity exist among differ-ent VLP32C LIDARs, even if the sensors are prop-erly calibrated. Additionnally, the sensor was putfive centimeters lower than the position that wasused, previously, to collect the training and valida- tion sets. This test dataset was pre-labelled thanksto the same automatic labelling procedure that wasused for the training set, and manually refined af-terwards. We chose the Guyancourt area for ourtest set because it contains road setups that arevery different from our training scans: most of theroads in Guyancourt are two-lane one-way roads,while the training scans were mostly acquired onsingle-lane two-way roads. Moreover, this area nu-merous roundabouts with very varrying sizes. Inthe Guyancourt area, the vehicle was driven at upto 70km/h, following local driving rules, as the mo-tion compensation for projection into the map didnot need to be as accurate as for the training set.Indeed the automatic labels were refined. In total,the resulting test dataset includes 347 scans. Fig-ure 9 presents the locations where the test datasetwas collected.A specificity of the Guyancourt area is that thereexist reserved lanes for buses that are physicallyseparated from the rest of the road. Those lanesmight have unique and particular setups. Figure 10presents for instance a bus lane that goes through aroundabout, while the other vehicles have to drivein the circle. In Figure 10b, it can be observed thatthe bus lane was separated from the other lanes be-fore the roundabout. In Figure 10c, it can be seenthat the part of the bus lane that goes through theroundabout does not have the same texture as theother parts of the road. As those bus lanes arevery particular, they were labelled as belonging toa
Do not care class. We considered that classifyingthose bus lanes as roads was not relevant, but notan error per se, as they are still technically roads.They were not considered in the evaluations doneon the Test set. Our training, validation and testsets are made publicly available , alongside meta-data that include GPS locations and CAN mea-surements from the data acquisition platforms. The performances of the RoadSeg are mainly eval-uated on our manually labelled test set. TheKITTI dataset is also used to give some prelim-inary, though limited, results, as it is one of themost commonly used dataset for road detection. https://datasets.hds.utc.fr – under the ”Automatic andmanual LIDAR road labels” project a) A ZoeCab platform (b) Velodyne VLP32C and GNSS antenna Figure 7: Data collection platform and sensor setupTechnocentre Rambouillet Compi`egne Paris-Saclay Campus TotalNumber of samples 647 337 160 356 1500Table 2: Number of automatically labelled LIDAR scans
Contrary to other approaches that are evaluatedon the KITTI road dataset, we evaluate our sys-tem at the scan level, and do not aim at predict-ing the presence of road in areas that are not ob-served by the LIDAR sensor. No publication to theKITTI road benchmark was then possible. The la-bels of the KITTI road dataset are only given in animage plane corresponding to a camera whose in-trinsic calibration parameters are available, along-side extrinsic calibration parameters with regardsto a Velodyne HDL64 LIDAR, which was synchro-nized with the camera. The image road labels canthen be projected into the LIDAR scans, to createground-truth for road detection in LIDAR scans.Each point is represented by its Cartesian coordi-nates, its reflectance, the range measured by theLIDAR sensor, and its validity. The scans are thenrepresented as 6 × ×
512 images.The TNet of RoadSeg only predicts transfor-mations for the Cartesian coordinates, and leavesthe ranges and intensities unchanged. The LIDARscans can only be labelled partially, because thecamera view only covers a section of the corre-sponding LIDAR scan. Moreover, some LIDARpoints can easily be mislabelled. Indeed, the LI- DAR labels in the KITTI road dataset are not accu-rate, especially around road objects, and synchro-nization and calibration errors between the sensorsexist in the KITTI dataset. Figure 11 displays anexample of labels obtained from the KITTI dataset.Short range points are unlabelled, most of 360 ◦ scanis excluded from the labelling procedures, and theimage labels around the cyclist and the parked ve-hicles are noticeably coarse. Public labels are onlyavailable for the 289 scans that are present in train-ing set of the KITTI road dataset.Given the small number of labelled scans avail-able in the KITTI dataset, a 5-fold cross-validationprocedure was used to estimate the performances ofRoadSeg. RoadSeg is compared with both Squeeze-SegV2, and SqueezeSegV2 without its CRF layer.Each model was trained for 200 epochs on 4 of thefolds. The training procedure was the same for thethree models, and followed the one originally usedby SqueezeSegV2, with a batch size of 16 for allthe models. The training was repeated until eachmodel was trained on every possible combinationof folds. The 5 folds were the same for each model.After the 200 epochs, the selected parameters foreach model correspond to those that maximize theF1-score over the 4 training folds. Only the pointsthat are both valid and labelled are considered inthe loss function and by those metrics. We consider a) Technocentre (b) Rambouillet(c) Compi`egne (d) Paris-Saclay campus Figure 8: Automatically labelled dataset, in four locations. Each point indicates a position where a LIDARscan was automatically labelled, and the white arrows are oriented towards the north direction. The bluepoints correspond to the training set, the orange points correspond to the validation set.that a point, for which a probability of belongingto the road is strictly higher that 0.5, is predictedas being a road point. Finally, average F1, Re-call, Precision and IoU scores were computed fromthe results of each model on the corresponding testfolds, and reported in Table 3. We also report max-imum and minimum scores over the 5 test folds,alongside the average execution time of each modelon an NVidia TitanX GPU.The amount of data that is available on the KITTIdataset is probably inadequate for proper training,and thus limits the the relevance of those evalu-ations. The limitations of the projection proce-dure also prevent us from considering that thoseresults are completely reliable. However, they tendto show that RoadSeg vastly outperforms Squeeze-SegV2 for road detection. This however comes withan increased mean inference time. Yet, RoadSegstill processes LIDAR scans at a high rate, with amean inference time of 30.8 ms. We also observethat the CRF layer of SqueezeSegV2 seems not to improve the performances in road detection. Thismay mean that using a CRF alongside Squeeze-SegV2 may require other hyperparameters, or ker-nels, for road detection. This could also mean thatthe CRF layer struggles with large objects, such asthe road, as it aims at locally smoothing the seg-mentation results. More reliable results are givenin the next section, as the models are evaluated onour manually labelled dataset, after having beentrained on the automatically labelled dataset thatwas generated from our HD Maps.
To give more significant results, we report met-rics on our manually labelled test set. All the ap-proaches were only trained thanks to the 1500 auto-igure 9: Generation of the test dataset in Guyancourt. Each green point indicates a location where amanually labelled LIDAR scan was recorded.
Precision Recall F1-score IoU Inference timeModel Avg. Max. Min. Avg. Max. Min. Avg. Max. Min. Avg. Max. Min. MeanSqueezeSegV2 without CRF 0.9267 0.9573 0.8882 0.9108 0.9675 0.8345 0.9177 0.9426 0.8792 0.8487 0.8914 0.7844
SqueezeSegV2 with CRF 0.9277 0.9424
Table 3: Comparison of the fire layers in RoadSeg and SqueezeSegmatically labelled LIDAR scans that were recordedin Compi`egne, Renault Technocentre, Rambouilletand the Paris-Saclay Campus, and thus have neverbeen trained on the area where the manually la-belled data was recorded. The training/validationsplit is exactly the one presented in Figure 8, andall the approaches were trained with the same pro-cedure, which is close to the one used for the KITTIdataset, except that batch size was reduced to 10for all the approaches. This was needed because,while the scans from the KITTI dataset only coverthe front view, our training set is composed of360 ◦ scans. This also justified to modify the behav-ior regarding the padding of the feature maps pro-cessed by the networks. Indeed, the left and rightsides of the inputs actually correspond to neighbor-ing areas. The left (respectively right) padding isthen obtained by mirroring the right (respectivelyleft) side. Tests on several variants are presented.First of all, SqueezeSegV2, SqueezeSegV2 with-out CRF, RoadSeg, and RoadSeg without TNethave been trained on the whole automatically la-belled dataset, and each point was represented bythe eight available features (Cartesian coordinates,spherical coordinates, intensity and validity). Wealso trained four additional and lighter networks. • RoadSeg-Intensity: the points are onlyrepresented by their intensity, elevation an-gle, and validity. • RoadSeg-Spherical: the points are onlyrepresented by their spherical corrdinates,and their validity • RoadSeg-Cartesian: the points are onlyrepresented by their Cartesian corrdinates,and their validity • RoadSeg-Cartesian without TNet: likeRoadSeg-Cartesian, but without the TNetRoadSeg-Intensity and RoadSeg-Spherical do notinclude TNets. Indeed, TNets normally predictan affine transformation matrix for Cartesian co-ordinates. Yet, in the case of RoadSeg-Intensityand RoadSeg-Spherical, only spherical features areused. A spherical transformation could be pre-dicted by a modified TNet. Yet, since the elevationand azimuth angles are unique for each position inthe range image, such a transformation would beequivalent to an image deformation, which couldcomplexify the further convolutions used in Road-Seg. RoadSeg-Intensity processes the elevation an-gle, alongside the intensity, to cope with the indi-vidual behavior of each LIDAR laser. Indeed, theintensity might be inconsistent among each LIDARlaser. Each model was trained ten times on thetraining set, to account for the possible variabilityin the results due to the random initialization ofthe network. At each training session, the param-eters that were kept were those that maximize the a) Exemple of situation in the test set, witha reserved bus lane that goes through a round-about (b) Camera view 1 ; the orange mask represents the reserved thebus lane(c) Camera view 2 ; the orange mask represents the reserved thebus lane(d) Corresponding manually labelled LIDAR scan. Purple points are labelled as road , red points as obstacle , andgreen points as do not care , since they belong to a reserved bus lane.
Figure 10: Example of reserved bus lane in Guyancourt that go through a roundaboutF1-score on the validation set. We again report,in Table 4, average F1, Recall, Precision and IoUscores reached by all the models on the test set. Wealso report maximum and minimum scores along-side the average execution time of each model on anNVidia TitanX GPU. As the test set is manuallylabelled, and considered as reliable, it is a properway to evaluate what was learnt by the networks from the automatically labelled dataset.Interestingly, SqueezeSegV2 without CRF seems tohave the best precision on the test set. Yet, thisis compensated by its poor performances in termsof recall. We can assume that this is mainly dueto the additional subsampling that it uses with re-gards to RoadSeg: SqueezeSegV2 cannot properlyprocess points at long-range, and considers a sig- recision Recall F1-score IoU Inference timeModel Avg. Max. Min. Avg. Max. Min. Avg. Max. Min. Avg. Max. Min. MeanSqueezeSegV2 without CRF msRoadSeg-Spherical 0.8565 0.8968 0.8253 0.8776 0.9045 0.8414 0.8665 0.8830 0.8497 0.7645 0.7409
Table 4: Comparison variants of RoadSeg and SqueezeSegV2 (a) Labels from the KITTI road dataset in the image plane(b) Labelled LIDAR scan by projection of the image labels.Points that do not intersect the field-of-view of the cameraare not displayed. Green points are labelled as road , redpoints as not road , and grey points are unlabelled . Figure 11: Generation of LIDAR labels from theKITTI road dataset.nificant amount of remote road points as obsta-cles. We again observe that the use of the CRF de-grades the performances for road detection. Road-Seg, when used without the TNet and trained onthe full set of features, outperforms SqueezeSegV2,while being slightly faster. The use of the TNetdoes not seem relevant when training on the full setof features, as it doubles the inference time whileslightly degrading the performances, except for theprecision scores. Another interesting result is thatRoadSeg-Spherical and both version of RoadSeg-Cartesian outperform the networks that are trainedon the full set of features. The best F1-scoreand IoU are even reached by a specific instance of RoadSeg-Cartesian that relies on a TNet. However,RoadSeg-Cartesian without TNet is faster, and eas-ier to train since it has best F1-score and IoU inaverage. This network thus seems to be the besttrade-off for road-detection. However, the perfor-mances reached so far are still relatively low. Thiscan be explained by the automatic labelling pro-cedure that we used to generate the train set, asit is very sensitive to unmapped roads, localiza-tion errors and improper obstacle filtering. Theseresults also highlight the difficulties for RoadSegand SqueezeSeg to process numerous features fora given point, as the best performing approachesonly process a limited number of features. Fusingthe results from several networks, that process dif-ferent sets of features, however lead to significantimprovements.
RoadSeg has been designed to allow for thegeneration of evidential mass functions from itsoutputs. A straighforward way to fuse severalRoadSeg-like networks could then be to rely onan evidential fusion. We thus propose to use themodel described in Equation 6 to generate eviden-tial mass functions for each LIDAR point, from aset of neural networks. The resulting mass func-tions can then be fused, for each point, thanks toDempster’s rule of combination that is describedin Equation 2. Figure 12 displays exemples of ev-idential mass functions that can be obtained fromall the variants of RoadSeg that are expected tobe independent. Those mass functions are directlyobtained after the training, from the biases of the fi-nal Instance-Normalization layer, and without anyoptimization of the α vector.The plausibility transformation in Equation 7 canthen be used to compute the fused probabilitythat each points belongs to the road, and the re-sults can then be evaluated again on our test set.Given to the properties of plausibility transforma-tion [34], fusing the evidential mass functions ob- a) m ( { R } ), RoadSeg-Intensity (b) m ( {¬ R } ), RoadSeg-Intensity (c) m ( { Ω } ), RoadSeg-Intensity(d) m ( { R } ), RoadSeg-Spherical (e) m ( {¬ R } ), RoadSeg-Cartesian (f) m ( { Ω } ), RoadSeg-Spherical(g) m ( { R } ), RoadSeg-Cartesian withTNet (h) m ( {¬ R } ), RoadSeg-Cartesianwith TNet (i) m ( { Ω } ), RoadSeg-Cartesian withTNet(j) m ( { R } ), RoadSeg-Cartesianwithout TNet (k) m ( { Ω } ), RoadSeg-Cartesianwithout TNet (l) m ( { Ω } ), RoadSeg-Cartesianwithout TNet Figure 12: Example of evidential mass values, obtained from the weights optimized after the training, fromall the variants of RoadSeg that can be fused together. Better seen in color, and by zooming in theelectronic version of the article.ained from the different networks via Dempster’srule, and then applying plausibility transformationon the resulting mass function, is equivalent to aBayesian independent opinion poll fusion amongthe probabilities obtained from each network. Thismeans that a classification which purely relies ona Bayesian interpretation of the probabilities, ob-tained from the networks, only depends on the sumof the values of the α vector in Equation 3, which issupposed to be equal for all the possible α vectors.As we have chosen to consider that LIDAR pointsare classified as belonging to the road when theprobabilities that are predicted by the network arehigher than 0.5, we are in this case. We thus reportin Table 5 recomputed F1-score, Precision, Recalland IoU obtained after the fusion of different setsof networks. To prevent data incest, only RoadSeg-Intensity, RoadSeg-Spherical, RoadSeg-Cartesianand RoadSeg-Cartesian without TNet are consid-ered. We also do not fuse RoadSeg-Cartesian andRoadSeg-Cartesian without TNet together, againto prevent data incest. Since the networks are thennot trained jointly, a set of parameters has to bechosen for each one, among the ten that are avail-able. For a fair comparison, the parameters thatwere selected for each network correspond to thebest ones in terms of F1-score on the validationset. Indeed, they are not necessarily the best per-forming ones on the test set. We thus only reportone Precision, Recall, F1-score and IoU for eachcombination.Fusion leads to significantly better Precisionand F1-scores and IoU scores, with Recall scoresthat in par with the best ones that are obtainedfrom single networks. Fusion then prevents false-positives from happening, without significantly in-creasing the false-negative rate. Especially, the fu-sion of RoadSeg-Intensity, RoadSeg-Spherical andRoadSeg-Cartesian leads to the best results. Theobtained road detection is satisfactory, as the F1-score is equal to 0.9, and the IoU is higher than 0.8.This is significantly better than the performancesof each individual network, which seems to confirmthat, originally, RoadSeg does not have enough pa-rameters to properly detect the road from all theavailable features. This result moreover is particu-larly satisfactory because the training and valida-tion labels were obtained automatically.The use of a TNet improves the results, but thatcomes at the cost of an increased inference time,as observed in Table 4. We however consider that the use of the TNet is relevant, when using eviden-tial fusion. Indeed, the risk of data incest amongRoadSeg-Spherical and RoadSeg-Cartesian exists,because the Cartesian coordinates can be obtainedfrom the Spherical ones, and vice-versa. As Demp-ster’s rule of combination is designed to fuse in-dependent mass functions, this risk of data incestshould better be limitted when operating in real-life conditions. We however point out that the factthat, when being trained, all the networks were ini-tialized randomly, and trained on random batches.Although these elements seem to be enough to en-sure a certain level of independance [43], we pre-ferred to enforce further this independance, by fus-ing networks with different inputs. Moreover, asthe TNet transforms the input to the fire modulesby an affine transformation, this dependence be-tween RoadSeg-Cartesian and RoadSeg-Sphericalis less likely to happen. To support this claim,we observe that just fusing RoadSeg-Spherical andRoadSeg-Cartesian leads to the second best re-sults in terms of F1-score and IoU, while the fu-sion of RoadSeg-Spherical and RoadSeg-Cartesianwithout any TNet is significantly less performant.We also show, in Figure 13, a LIDAR scan be-longing to the test set, and the pointcloud ob-tained after transformation by the TNet used inthe evidential fusion. The transformation predictedby the TNet is normally applied to a point-cloudthat was normalized by batch-normalization, butfor the sake of clarity, we apply it on the orig-inal point-cloud. Indeed, the normalized point-cloud is, by definition, more compact, and harderto visualize. We can see that, in this case, theTNet mainly applies rotations around the x andz axes. The global alignment of the scan is thusmodified. The initial Batch-Normalization layer ofRoadSeg-Spherical could also be seen as a globalaffine transformation applied to the scan. Yet, it isextremely unlikely that both a RoadSeg-Cartesianusing a TNet and a RoadSeg-Spherical actually ap-ply the same affine transformation to their input.The TNet can then be considered as a way to en-force independence among RoadSeg-Cartesian andRoadSeg-Spherical. As a reminder, RoadSeg networks are trained asSqueezeSegV2 was originally on the KITTI objectdataset, and thus use a weight decay of 0.0001, ntensity Spherical Cartesian Cartesian (w/o TNet) Precision Recall F1-score IoU0.8986 0.8588 0.8782 0.78290.8932
Table 5: Evaluation of the results for several fusion schemes among RoadSeg networks (a) Original test scan (b) Test scan transformed by the TNet
Figure 13: Effect of the TNet on a LIDAR scan. Purple points were manually labelled as road points, redones as not belonging to the road. The two pictures correspond to the same field of view and scale.which was also needed by the results displayed inEquation 12. We propose to compare the differ-ent evidential mass functions that can be obtainedfrom the fusion of RoadSeg-Cartesian, RoadSeg-Spherical and RoadSeg-Intensity, as this is our bestperforming model, so as to verify their behaviordepending on how the α vector is obtained. Espe-cially, we focus on the points of the test set thatare misclassified by this fusion of networks. A de-sirable behavior would be to have uncertain massvalues generated for those falsely classified points.We chose to rely on the decomposable entropy forthe evidential theory, defined in [44], to quantifythe uncertainty level of the mass functions obtainedfrom the fused RoadSeg networks. This entropy issimilar to Shannon’s entropy, especially in the sensethat an uncertain evidential mass function will leadto a high entropy. In our case, this entropy measure H on a mass function m is computed as: H ( m ) = − ( m ( { R } ) + m ( { R, ¬ R } )) log ( m ( { R } )+ m ( { R, ¬ R } )) − ( m ( {¬ R } ) + m ( { R, ¬ R } )) log ( m ( {¬ R } )+ m ( { R, ¬ R } ))+ m ( { R, ¬ R } ) log ( m ( { R, ¬ R } )) (18)We report in Table 6 the mean entropy on themisclassified points, from evidential mass functionsobtained by solving the minimization problem de-scribed in Equation 8 on several sets. We comparethose values with their equivalent obtained with-out post-processing of the weights. The consid-ered sets for the post processing were the trainingset, the training and validation sets, a collectionof 2221 unlabelled random LIDAR scans that wereacquired at the same locations as the the trainingand validation sets, and a collection of 695 unla-belled scans acquired in Guyancourt alongside the et(s) used for post-processingNone train validation additional train additional test Mean Entropy on misclassified points0.35370.36630.36640.36760.36680.36690.36640.3668 Table 6: Comparison of the mean evidential entropies that were generated, for misclassified points, byRoadSeg-Intensity, RoadSeg-Spherical and RoadSeg-Cartesiantest set. The latest sets correspond to a tenth ofthe whole squences that were recorded to make thetraining, validation and test set. To ensure varietyamong the scans, the difference between the times-tamps of those unlabelled scans is at least of onesecond. Those sets are denoted respectively as ad-ditional train and additional test .The lowest mean entropy on the misclassifiedpoints corresponds to the vanilla results, whenno post-processing is used on the weights of thenetworks. This is thus the most over-confidentcase. However, all the values of mean entropy areextremely close among each other. Interestinglythe maximum entropy is achieved when only addi-tional train is used for post-processing. The useof data similar to the test set does not seem par-ticularily useful and, counterintuitively, the use ofbigger sets did not necessarily lead to more cau-tious evidential mass functions. As a conclusion,evidential mass functions can be generated only us-ing the training and validation sets, and potentiallyadditonal similar and unlabelled data. The use ofInstance Normalization and weight decay is con-firmed as a way to obtain near-optimal evidentialmass functions during the training, as the differ-ences in terms of entropy are barely visible. It couldeven be considered that no post-processing shouldbe used, since the dataset on which it should becomputed is not clear, but this should be confirmedby further experimental and theoretical analyses.
We now present some segmentation results fromthe test set, that were generated from the fu-sion of RoadSeg-Cartesian, RoadSeg-Spherical andRoadSeg-Intensity.
We first show the scan for which the fused net-works reach the best results, in terms of per-scanF1-score. On this situation, the fused networksachieved an F1-score of 0.9569, and an IoU of0.9173. As displayed in Figure 14, this is a verysimple situation with a straight road, and no othervehicles, although a small part of the scan is miss-ing. The fact that the system can handle such sim-ple situations is reassuring. We also point out thefact that the reserved bus lane, labelled in green,was fully classified as road. Following the policy wepreviously exposed, those points corresponding tothe bus lane were not considered in the evaluation.
We show, in Figure 15, the scan for which the fusednetworks achieve the worst result in terms of F1-score and IoU. The F1-score for this scan is equalto 0.8971, and the IoU is equal to 0.8133. The er-rors are mainly localized on the left side, probablybecause the central median partially occluded thisarea. The results on the ego-lane are still satisfac-tory.
We present a result at a crossroad. Most of theactual road surface is classified as road. However,remote, narrow roads on top-right and bottom-leftare undetected. This is because those roads arehard to distinguish when projected into the rangeimages processed by the RoadSeg networks, dueto their narrowness. The network achieves an F1-score of 0.9271 and an IoU of 0.8641 on this scan. a) Labels (b) Predicted road probabilities
Figure 14: Scan For which the fused networks achieve the best F1-score. On the left side, purple pointswere manually labelled as road; red ones as obstacles; green ones as do not care, as they correspond to areserved bus lane. On the right, the purpler a point is, the higher the probability of being a road point ishigh, according to the fused networks. (a) Labels (b) Predicted road probabilities
Figure 15: Scan for which the fused networks achieve the worst F1-score.
To counterbalance the results on the previous usecase, we present results obtained at a junction.Again most of the road surface is properly detected.However, the networks consider that the entranceto the road on the right is narrower than what itactually is. This is because of our label generationprocedure, which relied on a map. Indeed, many ofthe junctions in those maps were mapped similarly,which under-estimated entrance width that wereconsidered to be equal to the roads’ length. The fused networks still achieve an F1-score of 0.9323and IoU of 0.8732 on this scan.
We conclude with a roundabout. The results arevery satisfying, as most of the actual road surfaceis properly detected. The remote vehicle is alsoproperly considered as an obstacle. The centralmedian in front of the vehicle if however partiallyconsidered as belonging to the road. The fusednetworks achieve an F1-score of 0.9010 and an IoUof 0.8199 on this scan. a) Labels (b) Predicted road probabilities
Figure 16: Example at a crossroad (a) Labels (b) Predicted road probabilities
Figure 17: Exemple at a junction (a) Labels (b) Predicted road probabilities
Figure 18: Example at a roundabouthe fusion of RoadSeg-Cartesian, RoadSeg-Spherical and RoadSeg-Intensity leads to verypromising results, especially in straight roads androundabouts. Junctions and crossroads can alsobe processed with a certain efficiency, although thenetwork is limited by what was present in the train-ing set. Indeed, very few crossroads were automat-ically labelled, and the approximative representa-tion of the junctions in the maps that we used influ-ence the final results of the network. Hopefully, thefact that these results have been achieved with arelatively small training set, that was automaticallylabelled, indicates that there is probably room forimprovement. Additional and finer training datawould certainly correct the behavior of the fusednetworks.
From the satisfactory results obtained by fusingseveral RoadSeg networks, and evidential massfunctions that can be generated from their outputs,an usable representation of the road for a naviga-tion algorithm can be created, by fusing consecu-tive road detection results. We chose to rely onan evidential grid mapping framework. A naiveapproach would be to extend the evidential gridmapping algorithm depicted in [6], by replacingthe original geometrical model by the evidentialmass functions generated from RoadSeg. Such anapproach however cannot handle moving objectsproperly, as it only relies on the fusion of consecu-tive observations. Inspired by the work in [19], inwhich it is observed that the conflict induced by thefusion of evidential grids can correspond to mov-ing objects, we propose an evidential road map-ping algorithm, to generate both a grid depictingthe actual road surface, and a list of moving roadobstacles. We consider that the area below a mov-ing road obstacle should be considered as road, andthat the road grid should only depict the reality ofthe road, independently of the presence of obsta-cles. Figure 19 depicts the whole algorithm, andFigure 20 presents a possible output of the system.In the next sections, we present its different stepsin details.
As an evidential fusion of RoadSeg-Intensity,RoadSeg-Spherical and RoadSeg-Cartesian leads tothe best classification performances, the algorithmprocesses the segmentation results obtained afterevidential fusion of the three networks. The gener-ation of evidential mass functions can be performedeither from the original weights, that are directlyobtained after the training of the networks, or frompost-processed weights. We arbitrarily chose tomake the road grid correspond to the xy-plane, inthe reference coordinate system used by the LI-DAR. This plane is split into equally sized gridcells, which cover a pre-defined area around thesensor. The state of each cell of index i can be rep-resented by three evidential mass values m i ( { R } )(road), m i ( {¬ R } ) (not road) and m i ( { R, ¬ R } ) (un-known). Similarly to what is done at the LIDARpoint level, those evidential mass values respec-tively quantify the evidence towards the fact thatthe i th cell belongs to the road, does not belong tothe road, or is in an unknown state. A straight-forward way to compute m i ( { R } ) , m i ( {¬ R } ) and m i ( { R, ¬ R } ) is to project, into the xy-plane, allthe LIDAR points, and the evidential mass valuesthat are obtained after the fusion of the Road-Seg networks. As previously stated, the Road-Seg networks process scans in which the motionof the vehicle was not compensated. As such, thismotion has to be compensated, in the segmentedpoint-cloud that we obtain from the RoadSeg net-works, before projecting the points into the gridmap. We thus reuse the same approach and as-sumptions as for the soft-labelling procedure, whichwe described in Section 5.1, except that we use anidentity Rotation/Translation matrix, because thegrid map and the LIDAR sensor share the samecoordinate system. Then, m i ( { R } ) , m i ( {¬ R } ) and m i ( { R, ¬ R } ) can be obtained by fusing the massvalues of the points projected into the grid-cell i ,thanks to Dempster’s rule of combination. To re-duce the computational complexity of this projec-tion and fusion step, each grid cell is processed inparallel. For the sake of clarity, we drop the cell-index i . The number of points projected into eachgrid-cell is unknown, and varries over time and foreach cell of the grid. To solve this issue, we relyon the rewriting of the Dempster-Shafer operatorin terms of commonality functions [45].igure 19: General evidential road mapping and road object detection algorithmFor Ω = { R, ¬ R } our binary frame of discern-ment, a commonality value Q ( A ) can be computedfrom a mass function m for each element A ∈ Ω ,as follows: Q ( A ) = (cid:88) B ⊇ A m ( B ) (19)The evidential mass function m can be recoveredfrom the commonality values, as follows: m ( A ) = (cid:88) B ⊇ A ( − | B |−| A | Q ( B ) (20)Commonality functions can be used to fuse n ev-idential mass functions into a fused mass function m res as follows:1. Compute Q ,..., Q n from the n mass func-tions, using Equation 192. For each A ∈ Ω , Q res ( A ) =exp( (cid:80) nj =1 ln ( Q j ( A )))3. Compute m ∗ res from Q res , the unnormal-ized version of m res using Equation 204. Normalize m res as follows: ∀ A ∈ Ω \ {∅} , m res ( A ) = Km ∗ res ( A ) witk K = 1 / (1 − m ( ∅ )) ; m ( ∅ ) = 0igure 20: Results obtained from the road mapping and object detection pipeline. The LIDAR scan classifiedby the three RoadSeg network is visible. Below, a greyscale RoadGrid represents the belief for m ( { R } ) ineach cell. The clustered objects (mainly vehicles on the road) are colored according to their cluster id.This procedure is equivalent to consecutively ap-plying the Dempster’s rule of combination amongthe n evidential mass functions. However, this for-mulation enables the projection and fusion opera-tions to be reinterpreted as an operation on a 2Dhistogram.The log-commonalities associated to each pointcan trivially be computed in parallel, after the fu-sion of the results generated by the three Road-Seg networks. If n corresponds to the numberof points that are projected into a grid cell, andwhose evidential mass functions have to be fused, (cid:80) nj =1 ln ( Q j ( A )) can be computed by histogram-ming the x and y coordinates of each point, and byweighting the samples by the corresponding log-commonalities. The evidential mass values asso-ciated to each cell can then be recovered for eachcell. We call the resulting evidential grid, whichwas only generated from a single scan, a ScanGrid . In order to generate a dense representation of theroad, ScanGrids have to be fused over time. Leta
RoadGrid be an evidential grid that has beenobtained by accumulating several previous Scan-Grids. A RoadGrid is supposed to only repre-sent the road surface, without considering objectsthat might stand on the road. The latest Scan-Grid is noted ScanGrid( t i ), and the latest Road-Grid available is noted RoadGrid( t i − ). Let m t i be the evidential mass function that correspondto a given cell of ScanGrid( t i ), and m t i − the evi-dential mass function of the cell of RoadGrid( t i − )that is at the same position. A naive way to ac- cumulate ScanGrids would be to use, again, theDempster-Shafer operator to fuse all the m t i s withthe corresponding m t i − . However, this could leadto a catastrophic accumulation of objects over time,that would affect the estimated road surface, with-out corresponding to actual objects. This case isdepicted in the Figure 21. The current LIDARscan is depicted in green and red. Green pointsare classified as road points, and red points as ob-stacles (not road). Under the scan, an evidentialgrid corresponding to the simple accumulation ofScanGrids is depicted. White cells are classified asroad cells ( m ( { R } ) > . m ( {¬ R } ) > . m ( { R, ¬ R } ) > . obs = { O, ¬ O } and Ω displaced = { D, ¬ D } .The first one models the presence of road obsta-cles that do not belong to the road surface, asthe O proposition. The second one models thefact that a previously present road obstacles is nolonger present, as the D proposiion. This case typ-ically corresponds to a vehicle that was static onthe road, while the RoadGrid was being generated,and started moving.Ω obs can be used to detect which cells ofthe ScanGrid( t i ) should not be fused withRoadGrid( t i − ). The evidential mass functions inthis frame of discernment can be computed, foreach cell, from the conflict between the m t i − andigure 21: Exemple of RoadGrid obtained by accumulating ScanGrids without considering the objects thatstand on the road. White cells have an m ( { R } ) value higher than 0.5, black cells have an m ( {¬ R } ) valuehigher than 0.5, grey cells have an m ( { R, ¬ R } ) value higher than 0.5. m t i mass functions. Indeed, a high value for both m t i ( {¬ R } ) and m t i − ( { R } ) can indicate that amoving road obstacle is currently observed in thecorresponding ScanGrid( t i ) cell, and thus shouldnot be fused with RoadGrid( t i − ). However, itcould also indicate that the neural network hastrouble detecting a given road, meaning that thecorresponding cells should instead be fused. Let m obs be the evidential mass function associated toa cell of ScanGrid( t i ) under the Ω obs frame of dis-cernment. We propose to compute m obs as follows: m obs ( { O } ) = α ( Z ) m t i − ( { R } ) m t i ( {¬ R } ) (21a) m obs ( {¬ O } ) = 0 (21b) m obs ( { O, ¬ O } ) = 1 − m obs ( { O } ) (21c) This formulation supposes that only m t i − ( { R } )and m t i ( {¬ R } ) can indicate the presence of a roadobstacle. The α function computes a discountingfactor, which depends on the mean z coordinates ofthe points that have been projected, while creatingthe ScanGrid( t i ), into the considered grid cell. Thismean elevation is noted Z . As the LIDAR used bythe ZoeCab systems is put on the roof of the ve-hicles, Z is typically negative when only groundpoints have been projected into a grid cell. Wedefine α as follows: α ( z ) = min ( exp ( ν ( z + ξ )) ,
1) (22)This function only generates discounting factors inthe ]0,1] range. The ξ parameter indicates the ab-solute value of the height from which the conflictdoes not have to be discounted. The ν factor mon- itors the growth of α ( z ).Similarly, we can define m displaced as the ev-idential mass function associated to a cell ofScanGrid( t i ) under the Ω displaced frame of discern-ment. We propose to compute m displaced as follows: m displaced ( { D } ) = (1 − α ( Z )) m t i ( { R } ) m t i − ( {¬ R } )(23a) m displaced ( {¬ D } ) = 0 (23b) m displaced ( { D, ¬ D } ) = 1 − m displaced ( { D } ) (23c) From m displaced , grids that should not be consid-ered anymore as occupied in RoadGrid( t i − ) caneasily be detected. The grids in RoadGrid( t i − )for which m displaced ( D ) is higher than 0.5 are reini-tialized to a fully unknown state: m t i − ( { R } ) = 0, m t i − ( {¬ R } ) = 0, m t i − ( { R, ¬ R } ) = 1. Similarly, the m obs mass can be used to de-tect grid cells that belong, in a ScanGrid, toa road obstacle, and should not be fused withRoadGrid( t i − ). A binary ObsMap( t i ) map is cre-ated from ScanGrid( t i ) and the m obs values. Thisbinary map represents, for each cell, the presenceof a road obstacle. A binary cell is set to 1 if thecorresponding cell in ScanGrid( t i ) has an m obs ( O )value higher than 0.5. Otherwise, it is set to 0. a) m ( { R } ) in RoadGrid( t i − ) (b) m ( {¬ R } ) in ScanGrid( t i )(c) Maximum filtered ObsGrid( t i ) (d) ClusterMap( t i ) Figure 22: Grids used for clustering and road object detection. The evidential mass values were generatedfrom the weights obtained after the training, without any post-processing.ObsMap( t i ) can be used to generate a list of de-tected road obstacles. First of all, a 5 × t i ), to inflate the de-tected objects. This pessimistic behavior is justi-fied by the need of taking into account the fact thatthe LIDAR points at the edges of those objects,when having been projected into the grid cells,might have been projected into cells where roadpoints were also present. The α function mightthen be affected, an return an under-confident dis-counting factor. This maximum filtering is alsoused to connect grid cells that belong to the samephysical obstacle, which might not be the case be-cause of the sparsity of the LIDAR scans. Finally,ObsGrid( t i ) is converted into a grid of cluster ids,noted ClusterMap( t i ), by connected component la-belling, with an 8-connectivity. In each cell ofClusterMap( t i ) is indicated the id of the cluster towhich the cell belongs, or 0 if the cell does not cor-respond to a clustered object. This ClusterMap( t i )can be seen as a list of localized road obstacles. Af-terwards, the cells of ScanGrid( t i ) for which a clus-ter id has been returned are also reinitialized to afully unknown state: m t i ( { R } ) = 0, m t i ( {¬ R } ) = 0, m t i ( { R, ¬ R } ) = 1. Each grid used in this stepis presented in Figure 22. As potential road objects and displaced ob-jects have been removed, ScanGrid( t i ) andRoadGrid( t i − ) can trivially be fused, by simplyusing the Dempster-Shafer operator on m t i and m t i − for each grid cell. The resulting RoadGrid( t i )is then available for a navigation system, and canbe fused with new incoming LIDAR scans. How-ever, when a new ScanGrid will be generated, thedisplacement of the vehicle over time will have tobe considered, before fusing it with a RoadGrid.An odometry model is thus needed to reproject theRoadGrid. A CAN odometry model can be usedto track the movement of the acquisition platform,and reproject the RoadGrid when a new ScanGridwill be available. Cells of the RoadGrid that are notprojected into the area covered by the new Scan-Grid are dropped. New cells that cover previouslyunobserved areas are initialized to a fully unknowntate, with a mas value of 1 for { R, ¬ R } . The algorithm was implemented as a PythonROS node. The inference and evidential fusionof the neural networks is done via the PyTorchframework, and the operations on the grid are per-formed thanks to the Numpy, OpenCV and Scipylibraries. The TitanX GPU that was used for thetraining in reused for the inference of the neuralnetworks, but all the grid operations are done onan Intel i7-6700K octacore CPU. An 80 m × m grid is computed around the vehicle, with a cellsize of 0 . m . Only points that have a Z coordinatein the [-2.5,0] range are considered. The odometryis evaluated from an Extended Kalman filter rely-ing on a classical Constant Turn Rate and Velocity(CTRV) model. The CAN network provides thesystem with speed an heading direction measure-ments at 10 Hz, and a yaw rate measurement at100Hz. The CTRV model normally also fuses posi-tion measurements obtained from a GNSS sensor,but we chose to rely on a pure CAN odometry, soas to be agnostic to the localization system that isin use. The ν and ξ values in the α function areempirically set to 4 and 1,5. Similarly to what wasdone for the training, validation and test sets, theLIDAR scans are obtained from a VLP32C run-ning at 10Hz. We report in Figure 24 the temporalbehavior of the algorithm over a 12-minute record-ing session in Guyancourt. The measured runtimescover the unpacking of the LIDAR scans, the in-ference of the neural networks, their fusion, and allthe steps of the grid-level mapping and detectionalgorithm. Our current implementation managesto match, on this recording session, the publicationrate of the LIDAR, as the run time is always below100ms. Yet, the processing time is sometimes veryclose to 100ms, due to significant jitter. We thuscannot guarantee that the LIDAR scans will alwaysbe processed at 10Hz with the current implementa-tion. However, the fact that most of the current im-plementation relies on standard functions, withoutextensive use of the GPU, indicates that the per-formances will be improved by using a dedicated,pure GPU implementation of the functions used inthis road mapping and object detection algorithm.We report an additional example of the outputsthat are available from the algorithm, in Figure 23.This example highlights one limitation with usingconflict analysis for object detection: false posi- tives tend to happen at road edges. This can beexplained by the fact that road edges are ambigu-ous by nature, especially because the system wastrained on coarse labels. Errors while estimatingthe odometry from CAN readings, due to sensornoise, can also lead to false positives. A video de-picting the whole sequence is available online. By using the HD Maps from which the trainingand validation sets were labelled, a ground truthto evaluate the quality of the road grids can be ob-tained. An empty grid around the vehicle, whichfollows the same dimensions as the grids obtainedfrom our algorithm (80 m × m , with a cell sizeof 0 . m ) is first created. The center of each cell isthen projected into the corresponding map of theenvironment. Then, if the center of a cell is pro-jected into a mapped road, the cell is consideredas fully belonging to the road ; if not, it is sup-posed not to belong to the road. We assume aperfect localization system ; otherwise, the result-ing grids would be ambiguous, and could not beconsidered as a ground truth anymore. Figure 26shows an example of such a ground-truth grid. Thegrid mapping algorithm can then be evaluated overa driving sequence, provided that the localizationremains accurate enough, and that all the roads areproperly and unambiguously mapped. The Guyan-court area, where the test set was recorded, wasthus not suitable, because of the reserved bus lanesthat are present over the area. We chose to eval-uate the road mapping algorithm over a drivingsequence recorded in the Rambouillet area. In or-der to make this evaluation fair, the recording wasdone on roads that are perfectly mapped, and werenot part of the training dataset. We selected aperi-urban section where the roads were borderedby fields and small buildings, so as ensure an ac-curate GNSS positioning, and reliable RTK correc-tions, over the whole sequence. The driving area isdepicted in Figure 25. The LIDAR grids are gen-erated from the same LIDAR and vehicle than thetest set. The vehicle was driving on open roads,and traffic was thus present during the recording.To evaluate the quality of our evidential roadgrids, we propose to rely on three metrics, thatare traditionnally used to compare occupancy grids https://datasets.hds.utc.fr/tmp/automatic-and-manual-lidar-road-labels/road-mapping-demo/ a) m ( { R } ) for each point (b) m ( {¬ R } ) for each point (c) m ( { R, ¬ R } ) for each point(d) m ( { R } ) in RoadGrid( t i ) (e) m ( {¬ R } ) in RoadGrid( t i ) (f) m ( { R, ¬ R } ) in RoadGrid( t i )(g) ClusterMap( t i ) and (d) Figure 23: Outputs from the road mapping and object detection algorithm. The evidential mass values weregenerated from the weights obtained after the training, without any post-processing.igure 24: Runtime of the road mapping and object detection algorithm relying on the fused RoadSegnetworksFigure 25: Area of evaluation of the road mapping algorithm. The red line indicates the path of the vehicleover the driving sequence used for evaluation. The arrow indicates the north direction.with simulated ground truth, as in [46]: the crosscorrelation coefficient [47], the Map-Score [48], andthe Overall Error [49]. At each update of the roadgrid, and at each new LIDAR scan, those met-rics can be computed with regards to the groundtruth grid obtained from the HD map. Those threemetrics are complementary, since the cross correla-tion coefficient globally compares the statistics ofthe ground truth and estimated grids, the OverallError estimates the error on the evidential massvalues at the cell level, and the Map-Score com-pares how well probabilities estimated at the celllevel match the ground truth. We however adaptthose metrics to our system, as they were origi-nally used to assess the performances of fully au-tonomous robotics systems, that relied on pathplanning and exploration algorithms. That is thereason why we only compute the Map-Score, the Overall Error, and the cross correlation coefficientfrom grid cells in which at least one LIDAR pointhas been projected. The Map-Score and OverallError are also normalized with regards to the num-ber of grid cells that were used to compute them.Let i be the frame index, N c the number of cells inthe ground truth and estimated road grids, 1 R ( j )a binary indicator which indicates that the j th gridcell of the ground truth grid belongs to the road,1 L ( j ) a binary indicator which indicates that atleast one LIDAR point was projected into the j th cell of our road grid, m j ( R ) the estimated massvalue on R evaluated on the j th cell from our algo-rithm, and P l P m j ( R ) the probability that the j th grid cell belongs to the road, computed from thecorresponding evidential mass values and the plau-sibility transformation. We compute a Map-Scoreigure 26: Generation of ground-truth grid fromHD Maps. The yellow lines represent the mapskeleton ; black cells are considered as belongingto the road, and grey ones as not belonging to theroadas follows: MapScore i = N c (cid:88) j =1 L ( j ) ∗ [1 + log ( P l P m j ( R ) ∗ R ( j ))]1 L ( j )+ N c (cid:88) j =1 L ( j ) ∗ [(1 − P l P m j ( R )) ∗ (1 − R ( j ))]1 L ( j ) (24) We compute an Overall Error on m j ( R ) as follows: Overall Error i = (cid:80) N c j =1 L ( j ) ∗ | m j ( R ) − R ( j ) | (cid:80) N c j =1 L ( j ) (25)Finally, the cross-correlation coefficient is esti-mated as follows: Cross Correlation i = P l P m j ( R ) ∗ R ( j ) ∗ P l P m j ( R ) ∗ R ( j ) σ ( P l P m j ( R )) ∗ σ (1 R ( j )) (26) where the mean values and standard deviations areonly computed from cells in which at least one LI-DAR point has been projected.Those metrics do not directly indicate whether aroad grid can be used by a robotic system. Theyonly indicate how well estimated grids match with ground truth grids. However, they can be used tocompare several grid mapping algorithms, as ap-proaches that match ground truth grids have highcross correlation coefficients and Map Scores, andlow Overall Error rates. We thus propose to com-pare the evidential road grids obtained from Road-Seg networks using weights that were not post-processed, and grids obtained from RoadSeg net-works using post-processed weights. We especiallycompare the results without post-processing of theweights with the results from weights that werepost-processed on the training set, as originallyproposed in [17]. Doing so, we evaluate the in-terest of this post-processing, with regards to theuse of RoadSeg weights directly obtained after thetraining.Figure 27 depicts the evolution of these met-rics over the drving sequence used for evaluation.For the three metrics, both curbs obtained fromthe post-processed and original weights follow thesame evolution. None of the two approaches con-sistently outperform the other over the whole se-quence. The local minima of the curbs depictingthe performances of the weights that were not post-processed, in terms of Map-Score and Cross corre-lation, are significantly lower than their counter-part in the curbs depicting the performances of thepost-processed weights. The local maxima how-ever are not necessarily lower for the post-processedweights, which indicates that the post-processedweights have led to more cautious results in in-stances where the road grids were wrong. Forthe Overall Error rates, the local maxima reachedby the original weights are, often, significantlylarger than their counterparts corresponding to thepost-processed weights. The fact that the post-processed weights are more cautious, especiallywhere classification errors have happened, is thusclearly visible. However, the original weights stillperform relatively well, when compared to theirpost-processed counterparts. Indeed, in 37.5% ofthe frames, the original weights lead to road mapsthat have strictly higher Map-Scores than the post-processed weights. In 40.7% of the frames, theoriginal weights lead to lower Overall Error ratesthan the post-processed weights. Finally, in 53.3%of the frames, the grids obtained from the originalweights even have strictly higher cross correlationcoefficients than the grids obtained from the post-processed weights.igure 27: Map-Score, Overall Error and Cross correlation of our road grids, over a driving session.he fact that the post-processed weights seem tolead to more cautious grids when analyzed at thecell-level, via Map-Scores and Overall Error rates,while the impact on a global cross correlation coef-ficient is not as clearly visible, indicates that over-all, grids obtained from the post-processed weightsand the original weights are very similar. The dif-ferences seem to mainly come from individual cellsthat might happen to be misclassified, and moreuncertain when using post-processed weights, with-out significantly impacting the overall grid. This iscompatible with our previous observations on mis-classified points, for which the entropy was indeedhigher when using post-processed weights, but stillclose to its counterpart calculated from the originalweights. We presented a system that relies on the eviden-tial fusion of three neural networks to detect theroad in LIDAR scans. From a training set that wasautomatically labelled thanks to an HD Map, weachieve performant results on a manually labelledtest set. The evidential framework thus seems tobe a performant way to fuse neural networks, pro-vided that their respective inputs are independent.We also presented an algorithm that uses this roaddetection system to map the road surface over time,and cluster road objects. A simple CPU/GPU im-plementation of this algorithm is able to processLIDAR scans at approximately 10 Hz, which fitsthe usual publication rate of state-of-the-art LI-DAR sensors. Additional training data is likely tolead to even better results, which is easy to ob-tain from our automatic label generation proce-dure, provided that accurate maps are available.A refinement of the training procedure, to copewith the label noise in the automatic labels, is alsoa possible research direction. Moreover, a moreaccurate compensation mechanism, to handle themovement of the vehicle during the scanning pro-cess, would be beneficial. However, such a mech-anism would have to be able to cope with sensornoise and timing errors, and to produce uncertaintyestimates in the individual coordinates of each LI-DAR, point while constructing a coherent scan dur-ing the sweeping process. This uncertainty infor-mation could even be directly used in a grid map-ping algorithm, to further improve the representa-tion of the environment. To the best of our knowl-edge, such an approach has not been proposed yet.Another interesting research direction, to improve our road mapping algorithm, would be to force thenetworks to generate even more cautious evidentialmass values for misclassified points, as the currentpost-processing procedure proposed in [17] does notlead to road grids that are significantly better thangrids obtained from the directly obtained after thetraining. Multi-class classification would also bevaluable, especially for road object detection, butthis would come at the cost of, either, semanticallyenhanced maps, or intensive manual labellisation.This would however be very useful to detect moreobjects, as conflicts analysis only allows us to detectobjects on the road, but not on the sidewalks for in-stance. Huge LIDAR datasets for semantic segmen-tation are emerging [33], but the burden of manuallabellisation is still a reality. Furthermore, LIDARscans produced from different sensors can be verydifferent, in terms of granularity and resolution.This makes the use of external data way more com-plex than for image segmentation. Finally, even ifwe manage to detect the road from a neural net-work, without relying on a explicit model, we stillrely on hyper-parameters for conflict analysis andobject detection. We thus produce false positivesat the road edges. A neural network, trained todetect objects, could potentially replace this con-flict analysis step. Nevertheless, a mechanism toensure the absence of false negatives would thenhave to be implemented. We instead believe thatknowledge-based approaches could properly handlefalse positives. For instance, it has been proposedto ensure the consistence of the perception informa-tion via rule-based systems, so as to build a coher-ent World Model from pre-existing knowledge [50].Adequate rules could potentially be used to copewith the false positives, depending on their dimen-sions and location for instance. For example, flatobjects that are localized at road borders could beeasily considered as false positives.
Acknowledgments
This work is supported by a CIFRE fellow-ship from Renault S.A.S, and realized within theSIVALab joint laboratory between Renault S.A.S,and Heudiasyc (UMR 7253 UTC/CNRS). We alsobenefited from fundings from the Equipex ROBO-TEX (ANR-10- EQPX-44-01). We thank NicolasCaddart who, in the context of an internship at Re-nault S.A.S, to complete his engineering cursus atthe ENSTA school of engineering, participated inthe labelling of the test dataset, and in the softwaredevelopments. eferences [1] S. Lacroix, A. Mallet, D. Bonnafous,G. Bauzil, S. Fleury, M. Herrb, and R. Chatila,“Autonomous rover navigation on unknownterrains: Functions and integration,”
TheInternational Journal of Robotics Research ,vol. 21, no. 10-11, pp. 917–942, 2002.[2] F. W. Rauskolb, K. Berger, C. Lipski, M. Mag-nor, K. Cornelsen, J. Effertz, T. Form,F. Graefe, S. Ohl, W. Schumacher, et al. ,“Caroline: An autonomously driving vehi-cle for urban environments,”
Journal of FieldRobotics , vol. 25, no. 9, pp. 674–724, 2008.[3] G. Tanzmeister and D. Wollherr, “Eviden-tial grid-based tracking and mapping,”
IEEETransactions on Intelligent TransportationSystems , vol. 18, pp. 1454–1467, June 2017.[4] D. Nuss, S. Reuter, M. Thom, T. Yuan,G. Krehl, M. Maile, A. Gern, and K. Di-etmayer, “A random finite set approach fordynamic occupancy grid maps with real-timeapplication,”
The International Journal ofRobotics Research , vol. 37, no. 8, pp. 841–866,2018.[5] J. Mullane, M. D. Adams, and W. S. Wije-soma, “Evidential versus bayesian estimationfor radar map building,” in , pp. 1–8, IEEE, 2006.[6] C. Yu, V. Cherfaoui, and P. Bonnifait, “An ev-idential sensor model for velodyne scan grids,”in ,pp. 583–588, IEEE, 2014.[7] E. Capellier, F. Davoine, V. Fr´emont,J. Iba˜nez-Guzman, and Y. Li, “Evidential gridmapping, from asynchronous LIDAR scansand RGB images, for autonomous driving,”in , pp. 2595–2602, IEEE, 2018.[8] S. Wirges, C. Stiller, and F. Hartenbach, “Ev-idential occupancy grid map augmentation us-ing deep learning,” in
IEEE Intelligent Ve-hicles Symposium (IV) , pp. 668–673, IEEE,2018. [9] E. Shang, X. An, T. Wu, T. Hu, Q. Yuan, andH. He, “Lidar based negative obstacle detec-tion for field autonomous land vehicles,”
Jour-nal of Field Robotics , vol. 33, no. 5, pp. 591–617, 2016.[10] S. Clode, F. Rottensteiner, and P. J. Koot-sookos, “Improving city model determinationby using road detection from lidar data,” IS-PRS, 2005.[11] L. Caltagirone, S. Scheidegger, L. Svensson,and M. Wahde, “Fast lidar-based road de-tection using fully convolutional neural net-works,” in , pp. 1019–1024, IEEE, 2017.[12] Y. Lyu, L. Bai, and X. Huang, “Chipnet:Real-time lidar processing for drivable regionsegmentation on an FPGA,”
IEEE Transac-tions on Circuits and Systems I: Regular Pa-pers , 2018.[13] J. Fritsch, T. Kuehnl, and A. Geiger, “A newperformance measure and evaluation bench-mark for road detection algorithms,” in
Inter-national Conference on Intelligent Transporta-tion Systems (ITSC) , 2013.[14] B. Wu, A. Wan, X. Yue, and K. Keutzer,“SqueezeSeg: Convolutional neural nets withrecurrent crf for real-time road-object segmen-tation from 3D lidar point cloud,” in
IEEEInternational Conference on Robotics and Au-tomation (ICRA) , pp. 1887–1893, IEEE, 2018.[15] C. R. Qi, H. Su, K. Mo, and L. J. Guibas,“PointNet: Deep learning on Point Sets for3D classification and segmentation,” in
Pro-ceedings of the Computer Vision and PatternRecognition conference (CVPR), IEEE , vol. 1,p. 4, 2017.[16] B. Wu, X. Zhou, S. Zhao, X. Yue, andK. Keutzer, “SqueezeSegV2: Improved modelstructure and unsupervised domain adapta-tion for road-object segmentation from a li-dar point cloud,” in ,pp. 4376–4382, IEEE, 2019.[17] T. Denoeux, “Logistic regression, neuralnetworks and dempster–shafer theory: Anew perspective,”
Knowledge-Based Systems ,vol. 176, pp. 54–67, 2019.18] T. Yang and V. Aitken, “Evidential map-ping for mobile robots with range sensors,” in , vol. 3,pp. 2180–2185, May 2005.[19] J. Moras, V. Cherfaoui, and P. Bonnifait,“Moving objects detection by conflict analy-sis in evidential grids,” in , pp. 1122–1127,IEEE, 2011.[20] P. Smets, “Data fusion in the transferable be-lief model,” in
Proceedings of the Third Inter-national Conference on Information Fusion ,vol. 1, pp. PS21–PS33 vol.1, July 2000.[21] W. Zhang, “Lidar-based road and road-edgedetection,” in , pp. 845–848, 2010.[22] J. Han, D. Kim, M. Lee, and M. Sunwoo, “En-hanced road boundary and obstacle detectionusing a downward-looking lidar sensor,”
IEEETransactions on Vehicular Technology , vol. 61,no. 3, pp. 971–985, 2012.[23] X. He, J. Zhao, L. Sun, Y. Huang, X. Zhang,J. Li, and C. Ye, “Automatic vector-basedroad structure mapping using multi-beam li-dar,” in ,pp. 417–422, IEEE, 2018.[24] J. Jung and S.-H. Bae, “Real-time road lanedetection in urban areas using LIDAR data,”
Electronics , vol. 7, no. 11, p. 276, 2018.[25] L. Xiao, R. Wang, B. Dai, Y. Fang, D. Liu,and T. Wu, “Hybrid conditional random fieldbased camera-lidar fusion for road detection,”
Information Sciences , vol. 432, pp. 543–558,2018.[26] Z. Chen, J. Zhang, and D. Tao, “Progres-sive lidar adaptation for road detection,”
IEEE/CAA Journal of Automatica Sinica ,vol. 6, no. 3, pp. 693–702, 2019.[27] R. Fernandes, C. Premebida, P. Peixoto,D. Wolf, and U. Nunes, “Road detection usinghigh resolution lidar,” in ,pp. 1–6, IEEE, 2014.[28] F. Engelmann, T. Kontogianni, A. Hermans,and B. Leibe, “Exploring spatial context for3D semantic segmentation of point clouds,” in
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition ,pp. 716–724, 2017.[29] Y. Zhou and O. Tuzel, “Voxelnet: End-to-endlearning for point cloud based 3d object detec-tion,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition ,pp. 4490–4499, 2018.[30] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J.Guibas, “Frustum PointNets for 3D object de-tection from rgb-d data,” in
Proceedings ofthe IEEE Conference on Computer Vision andPattern Recognition , pp. 918–927, 2018.[31] F. N. Iandola, S. Han, M. W. Moskewicz,K. Ashraf, W. J. Dally, and K. Keutzer,“Squeezenet: Alexnet-level accuracy with 50xfewer parameters and < arXiv preprint arXiv:1602.07360 , 2016.[32] A. Milioto, I. Vizzo, J. Behley, and C. Stach-niss, “Rangenet++: Fast and accurate li-dar semantic segmentation,” in Proc. of theIEEE/RSJ Intl. Conf. on Intelligent Robotsand Systems (IROS) , 2019.[33] J. Behley, M. Garbade, A. Milioto, J. Quen-zel, S. Behnke, C. Stachniss, and J. Gall, “Se-mantickitti: A dataset for semantic scene un-derstanding of lidar sequences,” in
Proc. of theIEEE/CVF International Conf. on ComputerVision (ICCV) , 2019.[34] B. R. Cobb and P. P. Shenoy, “On the plau-sibility transformation method for translatingbelief function models to probability models,”
International Journal of Approximate Reason-ing , vol. 41, no. 3, pp. 314–330, 2006.[35] D. Ulyanov, A. Vedaldi, and V. Lempit-sky, “Instance normalization: The missing in-gredient for fast stylization,” arXiv preprintarXiv:1607.08022 , 2016.[36] E. Capellier, F. Davoine, V. Cherfaoui, andY. Li, “Evidential deep learning for arbitrarylidar object classification in the context ofautonomous driving,” in , pp. 1304–1311,IEEE, 2019.[37] K. A. B. Ahmad, M. Sahmoudi, andC. Macabiau, “Characterization of GNSS re-ceiver position errors for user integrity moni-toring in urban environments,” in
ENC-GNSS2014, European Navigation Conference , 2014.38] Z. Tao and P. Bonnifait, “Road invariant ex-tended kalman filter for an enhanced esti-mation of GPS errors using lane markings,”in ,pp. 3119–3124, IEEE, 2015.[39] T. A. D’Adamo, T. G. Phillips, and P. R.McAree, “Registration of three-dimensionalscanning lidar sensors: An evaluation ofmodel-based and model-free methods,”
Jour-nal of Field Robotics , vol. 35, no. 7, pp. 1182–1200, 2018.[40] H. Caesar, V. Bankiti, A. H. Lang, S. Vora,V. E. Liong, Q. Xu, A. Krishnan, Y. Pan,G. Baldan, and O. Beijbom, “Nuscenes: Amultimodal dataset for autonomous driving,” arXiv preprint arXiv:1903.11027 , 2019.[41] F. Li, P. Bonnifait, J. Ibanez-Guzman, andC. Zinoune, “Lane-level map-matching withintegrity on high-definition maps,” in ,pp. 1176–1181, IEEE, 2017.[42] P. Chu, S. Cho, S. Sim, K. Kwak, and K. Cho,“A fast ground segmentation method for 3Dpoint cloud,”
Journal of information process-ing systems , vol. 13, no. 3, pp. 491–499, 2017.[43] B. Lakshminarayanan, A. Pritzel, andC. Blundell, “Simple and scalable predictiveuncertainty estimation using deep ensembles,”in
Advances in neural information processingsystems , pp. 6402–6413, 2017.[44] R. Jirouˇsek and P. P. Shenoy, “A de-composable entropy of belief functions inthe dempster-shafer theory,” in
InternationalConference on Belief Functions , pp. 146–154,Springer, 2018.[45] G. Shafer,
A mathematical theory of evidence ,vol. 42. Princeton university press, 1976.[46] B. Balaguer, S. Balakirsky, S. Carpin, andA. Visser, “Evaluating maps produced by ur-ban search and rescue robots: lessons learnedfrom robocup,”
Autonomous Robots , vol. 27,no. 4, p. 449, 2009.[47] S. O’Sullivan, “An empirical evaluation of mapbuilding methodologies in mobile robotics us-ing the feature prediction sonar noise filter andmetric grid map benchmarking suite,” Mas-ter’s thesis, University of Limerick, Ireland,2003. [48] M. C. Martin and H. P. Moravec, “Robotevidence grids.,” tech. rep., Carnegie-MellonUniv Pittsburgh PA Robotics Inst, 1996.[49] J. Carlson, R. R. Murphy, S. Christopher, andJ. Casper, “Conflict metric as a measure ofsensing quality,” in
Proceedings of the 2005IEEE International Conference on Roboticsand Automation , pp. 2032–2039, IEEE, 2005.[50] J. S. Albus, “4D/RCS: a reference model ar-chitecture for intelligent unmanned ground ve-hicles,” in