[PDF] LookUP: Vision-Only Real-Time Precise Underground Localisation for Autonomous Mining Vehicles

Abstract

A key capability for autonomous underground mining vehicles is real-time accurate localisation. While significant progress has been made, currently deployed systems have several limitations ranging from dependence on costly additional infrastructure to failure of both visual and range sensor-based techniques in highly aliased or visually challenging environments. In our previous work, we presented a lightweight coarse vision-based localisation system that could map and then localise to within a few metres in an underground mining environment. However, this level of precision is insufficient for providing a cheaper, more reliable vision-based automation alternative to current range sensor-based systems. Here we present a new precision localisation system dubbed "LookUP", which learns a neural-network-based pixel sampling strategy for estimating homographies based on ceiling-facing cameras without requiring any manual labelling. This new system runs in real time on limited computation resource and is demonstrated on two different underground mine sites, achieving real time performance at ~5 frames per second and a much improved average localisation error of ~1.2 metre.

Full PDF

LLookUP: Vision-Only Real-Time Precise Underground Localisation forAutonomous Mining Vehicles

Fan Zeng , Adam Jacobson , David Smith , Nigel Boswell , Thierry Peynot , Michael Milford ©This paper is a preprint (IEEE accepted status). IEEE copyright notice. ©2019 IEEE. Personal use of this material is permitted. Permission from IEEEmust be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Abstract — A key capability for autonomous undergroundmining vehicles is real-time accurate localisation. While sig-niﬁcant progress has been made, currently deployed systemshave several limitations ranging from dependence on costlyadditional infrastructure to failure of both visual and range-sensor-based techniques in highly aliased or visually challengingenvironments. In our previous work, we presented a lightweightcoarse vision-based localisation system that could map andthen localise to within a few metres in an underground miningenvironment. However, this level of precision is insufﬁcient forproviding a cheaper, more reliable vision-based automationalternative to current range sensor-based systems. Here wepresent a new precision localisation system dubbed “LookUP”,which learns a neural-network-based pixel sampling strategyfor estimating homographies based on ceiling-facing cameraswithout requiring any manual labelling. This new system runsin real time on limited computation resource and is demon-strated on two different underground mine sites, achievingreal time performance at ∼ ∼ I. INTRODUCTIONReal-time high-accuracy localisation for autonomous ve-hicles in underground mine sites is challenging due to alack of GPS, severe lighting changes, dust and environ-ment ambiguity. As the mining industry seeks to becomemore efﬁcient, companies are looking for more economicaltechnology that will enable less lucrative secondary miningresources to be feasibly mined. One consequence of this fornavigating autonomous mine vehicles is that infrastructure-based techniques are less feasible, while range-based sensorsare often expensive and have been reported to struggle ingeometrically aliased environments such as long uniformtunnels. Low-cost vision-based localisation technologies areamong the most promising alternatives for overcoming theselimitations.Among vision-based localisation methods, the state-of-the-art general-purpose SLAM (Simultaneous Localisationand Mapping) algorithm ORB-SLAM [1] has been shownto perform unsatisfactorily in underground mine site envi-ronments [2]. Our previous work [3], [2] on coarse locali-sation based on whole-image matching has a demonstratedlocalisation accuracy out-performing a state-of-the-art deeplearning approach [4] with a mean localisation error of afew metres [2]. Because it localises to the nearest node in

This research was supported by an Advance Queensland InnovationPartnerships grant from the Queensland Government, Mining3, Caterpillarand the Queensland University of Technology (QUT). MM also receivedsupport from an ARC Future Fellowship FT140101229. FZ, AJ, TP and MM are with QUT [email protected] DS and NB are with Caterpillar, Inc. (a) (b) (c)(d) (e) (f) (g)

Fig. 1: (a) The proposed system consists of a coarse lo-calisation stage using a forward-facing camera (pink arrow)and a reﬁnement stage, LookUP, using an upward-facing one(orange arrow). (b)(c) Examples of “optical ﬂow” between aquery image (top) and a nearby reference image (bottom)when (b) a regular grid is used, and (c) an FCN (FullyConvolutional Network) is used with LookUP. Note thesigniﬁcant reduction in number of sample points from (b) to(c). Inlier optical ﬂow vectors are coloured in green. (d)(f)Sample point quality heat maps generated by the FCN inLookUP for the original images in (e)(g), respectively.the database, its accuracy is limited to the resolution ofthe node separation in the map. The goal of the researchpresented in this paper is to build on the previously presentedcoarse localisation system to enable a higher degree ofprecision with the eventual aim of enabling reliable, vision-only autonomous control of underground mining vehicles.Developing localisation for underground autonomous ve-hicles presents some challenges and opportunities. Sensingand hardware capabilities are limited as sensors must betoughened, severely restricting the use of recent hardwareand limiting the deployment of computationally intensivealgorithms including full size deep learning architectures.There are also limitations in the practical amount of trainingdata that can be obtained from a site. Naive deploymentof full 6DOF SLAM systems (e.g. [5]) is not necessaryas there are a range of constraints that can be applied: thepitch and roll variations of the vehicle (therefore the camera)relative to the tunnel can be assumed to be limited, as isthe variation in the height of the ceiling. Even allowingfor the occasional three-dimensional structures such as windpipes, the ceiling of mine tunnels is mostly planar. Thisoffers an opportunity to signiﬁcantly reduce computation, a r X i v : . [ c s . R O ] J un ince theoretically as few as four point-correspondences arerequired for planar homography estimation. Furthermore, theceiling-facing camera is less affected by dust and lightingfrom other vehicles.The paper makes the following contributions: • A new vision-only localisation system, designed forunderground mine environments, which takes coarselocalisation results and reﬁnes them through rapid quasi-planar surface homography estimation. • An efﬁcient neural-network-based sample point selectorthat generates quality heat maps of candidate pointsfor effective pixel-correspondence calculations, and anassociated off-line training process that does not requiremanual dataset labelling. • Demonstration of new levels of vision-only localisationaccuracy in two new challenging underground mine sitedatasets.The paper proceeds as follows. Section II reviews previouswork on robust localisation algorithms and various saliencygeneration methods used as preprocessing ﬁlters for imagematchers. Section III provides a detailed description of theproposed localisation system. Section IV describes exper-imental settings including the datasets and our method tobuild the evaluation benchmark, with the results presentedin Section V followed by the conclusion in Section VI.II. L

ITERATURE R EVIEW

A. LIDAR-based Localisation Methods

Laser scanners (LIDARs) can provide metric positionestimations when there are ample features across the scannedangle span, but laser-scanner-based localisation systems [6],[7], [8], [9] can easily get lost in long tunnels, which areubiquitous in underground mines, as the scanned point cloudsappear confusingly similar along the tunnel. This problemis uncommon in environments such as typical rooms andwarehouses because the shape of enclosing walls providessalient variations across the scanned angle span. In a longtunnel, a LIDAR essentially becomes one dimensional - itonly knows its distance to the walls but has no idea abouthow far it has travelled along them. Moreover, in the areas ofthe mine where there are more features, such as draw points,there could be objects like metal meshes that could confuselocalisation methods based on 2D laser scanners, becausethe returns from the mesh may also form occupied spacethat could be misinterpreted as a wall to align scans with.Therefore, due to the current limitations of LIDAR-basedmethods, we choose to exploit vision-based place recognitionmethods to do localisation, for enhanced global robustness.

B. Vision-Based Methods

Traditional feature-based place recognition algorithmssuch as FAB-MAP [10], [11] work poorly in mine-tunnelenvironments due to severe visual aliasing. SeqSLAM [12]and many other SLAM frameworks [13], [14], [15] are lesssensitive but require external sources like GPS or wheelodometry to provide metric information. As demonstrated (a) (b) (c)

Fig. 2: (a) Displacement (optical ﬂow) between matchedimages - shown in (b)(c) - could be large. Blue optical ﬂowvectors were rejected by the Pixel Correspondence Matcher,the rest were RANSAC ﬁltered, with inliers and outliersshown in green and gray, respectively.by our previous results, the coarse localisation unit - Semi-Supervised SLAM - is able to produce better maps thanORB-SLAM [1] with 2.5 times smaller localisation error.Nevertheless, a higher localisation accuracy is desirable tobetter assist the automated control of vehicle pose duringvarious activities such as digging, dumping and driving.Given the range of uncertainty of the coarse localisationresults and the sparse density of reference images sampledacross the mine, the translation between reference and queryimages can be quite signiﬁcant comparing to the capturedrange, even when a wide Field Of View (FOV) camerais used, because the walls and ceiling of the tunnel areusually a short distance away from the camera. As a re-sult, the matched point pair, if it exists, can be a largedistance apart (Fig. 2), under limited frame-rate constraints.Although we still refer to this translation vector as “opticalﬂow”, traditional optical ﬂow algorithms [16], [17] typicallyassume small displacement [18] and are not suitable forour application. I2-S2 [19] has been proposed to extracthomographies between query and candidate reference imagesfor pixels at predeﬁned image locations. Different saliencygenerators [20], [3] have been proposed for sample point orpatch ﬁltering, however, they are based on pre-determinedmetrics of pixel intensities and do not adapt automatically toa different context.

C. Deep Convolutional Networks

Deep convolutional networks [21] have been proven to besuccessful in place recognition [4], [22], image classiﬁcationand semantic segmentation [23], [24]. However, there is nodirect metric information output from these methods. Deeplearning based methods [25], [26], [27] have also been usedto analyse large optical ﬂows, among which FCN-basedpixel labelling [28], [29] is suitable for our application ofsample point selection, and an FCN similar to [28] is usedin this paper. In the next section, our precise localisation unit“LookUP” will be described.III. A

PPROACH

The more precise localisation unit takes in a query image,a coarse localisation result, and has access to a databaseof images with known camera poses. This database can becollected with a single camera or an array of cameras duringthe surveying process accompanying the construction of amine. The associated poses can be obtained via surveyingtools and recorded alongside the image frames. Based onhe coarse localisation result, relevant reference images in thedatabase are cross-examined with the query image. Since theceiling of mine tunnels provide quasi-planar surfaces to allowhomography calculation based on only a handful of points,the cameras used in the precise localisation unit looks uptowards the ceiling; in addition, the pose estimation requiresa “look up” in the database to ﬁnd the reference camera pose,hence the name “LookUP”.Fig. 3: Schematic diagram of the underground localisationsystem showing the query images (black), the coarse local-isation unit (pink) and the precise localisation unit LookUP(orange).

A. Pixel Correspondence Matcher

The Pixel Correspondence Matcher (Fig. 3) is used to ﬁndthe most-likely corresponding pixel in a query image for aselected pixel in the reference image. It takes an l patch -sizedreference patch centred at the selected reference pixel, andgenerates a search neighbourhood in the query image, whichis an L SR × L SR sized square centred at the same pixelcoordinates as the selected reference pixel. It then comparesthe reference patch to a set of candidate patches centredat every pixel in search neighbourhood. The best matchcandidate pixel in terms of Sum of Absolute Difference(SAD) score is reported. The process is visualised as colour-coded “optical ﬂows” in Fig. 4. (a) (b) Fig. 4: Two examples of “optical ﬂow” between a queryimage (top row) and a set of reference images (bottom row),featuring “optical ﬂow” outliers caused by (a) 3D objects onthe ceiling, and (b) uneven rock surfaces.The processing time for each pair of reference and queryimages is proportional to the number of sampled pixels forwhich the pixel correspondence is to be found. Under thesmall pitch and roll assumption, most likely the inlier vectorsfrom the output of a RANSAC [30] ﬁlter are similar indirection and magnitude (Fig. 1b). If we can identify such inliers in advance and only ﬁnd pixel correspondence forthem, the computation time could be reduced. As for theoutliers, they are excluded from the homography calculationanyway, it would be better if they were not sampled in theﬁrst place. In this paper we used an off-the-shelf neuralnetwork (VGG16-based FCN) to produce sampling pointqualities.

B. Sample Point Selector

As can be seen from Fig. 4, the outlier sample pointsthat produce inconsistent optical ﬂow vectors are most likelyon 3D objects (Fig. 4a) or uneven rock surfaces (Fig. 4b)on the ceiling. However, rather than deﬁning rigid rules forclassiﬁcation, such as “avoid long wires, pipes and stronglights”, more general and adaptive qualiﬁcation criteria aredesirable. This is because in some situations, certain objectsmay provide high-quality sample points for template match-ing, but they may not work well in other cases - the semanticsof the features affect their quality as a sample point, involvingcontextual information many pixels away from them. SupportVector Machines [31] are usually effective binary classiﬁersbut they are limited to local information around the samplepoint. A neural network architecture that incorporates moreholistic information is preferred.Although feature-based methods are susceptible to visualaliasing in our underground localisation application, theymay work well for sample point quality generation becausevisual aliasing is not a problem for this task. We implementedan FCN similar to the one described in [29]. The query imageis fed into the convolutional layers of a VGG16 [32] networkpre-trained with the ImageNet dataset [33], the output ofwhich goes into a × convolution and three up-samplinglayers, with skip connections to layer3 and layer4 of theoriginal VGG16. The output of the network is a heat map ofsample point quality (Figs. 1d and 1f), according to whichthe sample points are selected (Fig. 3). The training dataset ofthe FCN is generated by applying the Pixel CorrespondenceMatcher in LookUP to a training image dataset, processingpoints densely sampled on a regular grid, as shown inFigs. 1b and 4. RANSAC is used to classify the sampledpoints into inliers (coloured coded green) and outliers (colourcoded gray), according to the optical ﬂow vector obtainedon that sample point. The FCN is then trained using thislabelled data. The loss function is deﬁned as proportionalto the total number of misclassiﬁed sample points for thetraining images. The output of the sample point selector is aheat map of quality for all candidate pixels. After the FCNis trained, all the reference images are processed with itand corresponding sample quality heat maps are generatedalongside the reference image database. The training andclassiﬁcation processes are completed off-line, therefore theyneither take up on-line run time nor require a GPU inthe localisation system. At run time, it is up to the pixelcorrespondence matcher to decide how this heat map shouldbe used. . Homography Estimator The set of “optical ﬂow” vectors calculated by the pixelcorrespondence matcher from all selected sample points areused to compute a × homography matrix that relates thepose of the query and reference images. Although multiplesolutions exist for the homography matrix, it is not hardto identify the one that makes physical sense by choosingthe solution that gives the smaller pitch and roll. Before thehomography is found, there is an optional RANSAC ﬁlteringif the number of sampled points is greater than 10. D. Determination of Scaling Constant

The above homography estimation process can be donewith multiple reference images (each column in Fig. 4(a)(b)).If there are more than one reference images for which goodmatches are found, it is possible to estimate the constant thatconverts the distance from pixel to metric space, using theassumption that the scaling constant should be similar forboth homography relations. If only one reference image isused for faster processing, it is also possible to use a pre-determined constant for this conversion under the assumptionof small variations in the ceiling height.

E. Integration with the Coarse Localisation Unit

Currently, the interface between LookUP and Semi-Supervised SLAM is simply the time stamp of the databaseimage that is considered a match. LookUP will fetch thereference images from the ceiling-facing camera that weretaken most closely in time to the matched database imagefrom the forward facing camera. A reﬁned location is esti-mated by LookUP using this reference image and the systemthen decides whether this reﬁnement should be applied. Twoﬁlters are applied. The reﬁnement is deemed not reliable if1) the percentage of inliers after the RANSAC ﬁltering islower than a threshold N th or 2) if the x or y translationfrom the reference pose extracted from the homography islarger than a threshold d th . These could happen if the coarselocalisation result is incorrect, or the relative displacementis larger than the search range. The system will simply fallback to the coarse localisation result when LookUP is notconﬁdent. Apart from the above interface, the coarse andﬁne localisation units are highly independent and can beoptimised separately. Next we describe the experiments wehave done to evaluate the performance of LookUP system.IV. E XPERIMENTS

In order to evaluate the precise localisation system, coarselocalisation needs to be performed ﬁrst. Based on a map inwhich the reference poses are deﬁned, the coarse localisationsystem, Semi-Supervised SLAM [2], takes in images withknown locations and constructs an internal database accord-ing to their associated locations, grouping images taken at ad-jacent places to the same node and saves them in a database.When sequences of query images arrive, it compares queryimages to reference images in the database, and generatesa confusion matrix corresponding to the sequence of query images. Using the confusion matrix, LookUP was then runto output metric location results in the map.To evaluate the reﬁnement achieved by LookUP, the lo-calisation results corresponding to the confusion matrix werealso generated by disabling LookUP and directly outputtingthe reference poses corresponding to the time stamp ofthe matched reference image. The frames for which Semi-Supervised SLAM generated localisation errors that weregreater than 10 metres, for which a reﬁnement is hardly pos-sible, were excluded from the evaluation. Next we describethe real-world datasets collected to do such evaluation andhow the maps, reference poses and benchmark localisationresults were obtained.

A. Datasets

In order to evaluate the localisation accuracy, a differentlocalisation system that can generate benchmark localisationresults that are at least locally accurate must be applicableto the datasets. If the datasets contain many draw points andjunctions but few long stretches of tunnels, algorithms basedon laser scan matching can be used for benchmarking. Basedon such criteria, the following datasets were collected.1) Mine A dataset: This dataset includes nine traversesof a heavy vehicle in two connected tunnels of anunderground mine (Fig. 5a). Four of the traversesare used to build the map and the reference imagedatabase, the other ﬁve are used as localisation query.This is the same dataset used in [2].2) Mine B dataset: The majority of the optical ﬂowsbetween images in the Mine A dataset are alongthe travelling direction of the vehicle. On the otherhand, LookUP does not constrain the optical ﬂowsearch along one direction. To study the generality ofLookUP, a second dataset was collected in a differentmine, featuring four traverses of a light vehicle in amine tunnel (Fig. 5b). Traverse Middle(M): the lightvehicle was driven along the centre of the tunnel.Traverse Left(L) and Right(R): the light vehicle wasdriven close to the left and right wall, respectively.Traverse Zigzag(Z): the light vehicle was driven de-liberately in a zigzag motion. Traverse M was used tobuild the SLAM map; Traverses L, M and R were usedto build the reference image database; Traverse Z wasused as the localisation query. In this way, the queryimages in this dataset can have optical ﬂows in variousdirections w.r.t the references.Altogether the two datasets contain 276,063 dataframes over 5,117 seconds of ∼

50 kilometre traverses(average vehicle speed ∼

35 km/h). These datasets areparticular challenging due to the afﬂuence of heavilyaliased patterns on multiple scales.

B. Map Building and Reference Poses

1) Mine A: The coarse localisation results were directlytaken from [2]. However, the metric locations from [2]were based on an external Radio Telemetry Systemhat is not accurate enough for evaluating the pre-cise localisation system. A more precise occupancygrid map was required to generate the reference andbenchmark poses. The attempt to build such a mapusing Hector-Mapping [34] was unsuccessful since thisdataset contains a few sections of long tunnels andmetal meshes. Therefore, a different approach was usedto build the map. First, four separate maps, one foreach reference traverse were built using Cartographer[35], then the four maps were manually aligned to forma large map, shown as the black occupancy grid inFig. 5a. The manual assembly was necessary becausethe four traverses used for map building were notcollected continuously in time and space. The referenceposes were then obtained by running AMCL [36] onthe stitched map subscribing to the “ROS tf frames”[37] published by Cartographer.2) Mine B: The map of mine tunnel was successfully builtusing Hector-Mapping, shown as the blue occupancygrid in Fig. 5b. The camera poses of the referenceimages were obtained by running AMCL on the mapsubscribing to the “ROS tf frames” published byHector-Mapping. Unlike Mine A dataset, no coarselocalisation results were available, so AMCL wasused on the same map used to generate the locationsassociated with the images used in Semi-SupervisedSLAM.It should be clariﬁed that it is not necessary to obtainreference poses in this way. We obtained the referenceposes using the laser scan data with occupancy gridmap simply because this dataset was collected after theconstruction of the mine and we did not have surveyingcapabilities. (a) SLAM Map of Mine A, built by Cartographer [35].(b) SLAM Map of Mine B. Blue: Hector-Mapping [34]; Black:Cartographer [35].

Fig. 5: Maps built by the SLAM algorithms in [34], [35].Note our system does not depend on these algorithms.

C. Localisation benchmark

To calculate the localisation errors for evaluating differentsystem settings, AMCL was run on the query traverses toproduce the benchmark poses. During the AMCL runs, theposes of the vehicle and the laser scan results are visualisedtogether with the maps (in Fig. 5). Except for the beginningof each traverse, when AMCL is “initializing likelihood ﬁeldmodel with probabilities”, and a few times in the tunnelsections in the maps (Fig. 5) where there are no draw pointsor junctions, the laser scans align with the map pretty well.Since the reference poses are built with the same maps in thesame way, although the maps and the AMCL poses may notbe globally accurate, the AMCL poses can be reasoned aslocally reliable enough to be used for the local reﬁnementspresented in this paper, which are essentially relative posetransformations indifferent to absolute global coordinates.Additionally, the global accuracy of the whole system iscross-veriﬁed with an independent algorithm on the Mine Bdataset. The state-of-the-art SLAM algorithm - Cartographer[35], not used for Mine B, was chosen to build a second setof map (black occupancy grid in Fig. 5b). The two SLAMalgorithms work under different principles: AMCL usesparticle ﬁlters and Cartographer uses iterative optimisationsof pose graph. Proper loop closure was achieved by bothalgorithms, which is non-trivial for such datasets. As shownin Fig. 5b, the difference between the two maps is within5 metres, indicating the accuracy of the AMCL poses in amore global sense.

D. Comparison of FCN with Regular Grid

The FCN was implemented with Tensorﬂow [38] inPython. It was trained with Stochastic Gradient Descent(SGD) with batch size of 8 (the maximum that can ﬁt intoan NVIDIA GeForce GTX 1080 GPU) and drop out rateof 50%. Adam Optimiser and Softmax activation were usedto generate the sample quality heat map. LookUP iterativelyselects the best sample point (the one with highest heat mapvalue), and apply a ﬁxed reduction ratio ρ to its l n -sizedneighbourhood in the heat map. It continues to pick the nextbest sample point until the required sample point number isreached. The FCN-based sample point selector was evaluatedon the Mine A dataset in comparison with a regular gridsampling method. The regular grid contains 24 sample points(at the cost of more computation), whereas only the top 12from the FCN-based sample point classiﬁer were processed.All other parameters were kept the same. Selected framesof query images from query traverse 0 were used to trainthe FCN. After that the FCN generated sample point qualityfor all reference images in the database, which does notinclude any image the FCN was trained on. The FCN forMine B dataset was trained on sub-sampled query frames,and classiﬁed sample points for the reference images. E. Parameters

The parameters in Table I were used to obtain the resultsin the next section.ABLE I: PARAMETER LIST

Parameter Value Unit Description L SR

40 pixels Search range, Mine A L SR

70 pixels Search range, Mine B l patch

40 pixels Patch size, Mine A l patch

60 pixels Patch size, Mine B ρ l n

10 pixels Neighbourhood size, Mine A l n

20 pixels Neighbourhood size, Mine B N th

60% Min. inlier percentage d th V. R

ESULTS

A. Evaluation of the FCN

The performance of the FCN in generating high-qualitysample points was evaluated on test sets of images differentfrom the training sets. The classiﬁcation accuracy of the bestsample point selected by the FCN was compared with thatof a random point generator (representing the percentage ofgood sampling points in the ground truth). The percentageof correct classiﬁcations for test sets of Mine A dataset was ∼ ∼

62% from a random sampler; for MineB dataset it was ∼

41% compared to ∼ B. Localisation Results of LookUP

As shown in Fig. 6a, LookUP can successfully extractoptical ﬂow in various directions and its ability to reﬁne thecoarse localisation results is not limited to the travel direction(Figs. 6b-6d). (a)(b) (c) (d)

Fig. 6: (a) Optical ﬂow between the query image (top row)and various reference images (bottom row) for the frame in(d). (b-d) Localisation results of three sample frames fromMine B dataset, showing reﬁnements in different directions.

C. Effectiveness of Sample Point Classiﬁer

The mean localisation errors obtained for each traversewith: a) Semi-Supervised SLAM without reﬁnement, b)LookUP with FCN and c) LookUP with regular grid sam-ple point selector, are shown in Fig. 7. The localisationreﬁnements computed by LookUP with regular grid leadsto consistent but small error reductions, while LookUP with FCN sample point selector consistently leads to signiﬁcanterror reduction (as much as ∼

27% for traverse 3). This isbecause the indiscriminately sampled points on a regular gridresulted in false positive matches and therefore inaccurateoptical ﬂows for the Pixel Correspondence Matcher. Notethat the mean errors reported for the coarse localisationmethod in Fig. 7 are signiﬁcantly lower than the ∼ D. Computation Time

To study computation time performance of the LookUPunit, the coarse localisation result was obtained ﬁrst byrunning Semi-Supervised SLAM unit with all the queryimages and the confusion matrix was saved before the timerwas started. The following processes are all included inthe computation time: for each query image, The LookUPreads the pre-computed confusion matrix, searches for thebest-match coarse reference image from the forward-facingcamera for that query, and “looks up” the correspondingceiling images with the closest time stamp. The homographyresult is then calculated and saved as a ﬁle. Subsequentﬁltering, analyses and plotting are not timed. On an Intel i7-7700K 4.20GHz CPU, LookUP with FCN took 15 minutesto generate all results for Traverse 0 of Mine A, an averaged ∼ ∼

22 fps.VI. C

ONCLUSION

In this paper, we designed and characterised a reﬁnementunit “LookUP” to our localisation system for vehicles inunderground mine tunnel environments. It works by ﬁndinghomographies based on matched pixels between query andreference images of the mine ceiling. The accuracy ofookUP is enhanced by generating pixel correspondencesonly on high-quality sample points proposed by an FCN.Selectively processing high-quality sample points also sig-niﬁcantly increased the frame rate to ∼ EFERENCES[1] J. M. M. Mur-Artal Ra´ul, Montiel and J. D. Tard´os, “ { ORB-SLAM } :a Versatile and Accurate Monocular { SLAM } System,”

IEEE Trans.Robot. , vol. 31, no. 5, pp. 1147–1163, 2015.[2] A. Jacobson, F. Zeng, D. Smith, N. Boswell, T. Peynot, and M. Mil-ford, “Semi-supervised slam: Leveraging low-cost sensors on under-ground autonomous vehicles for position tracking,” in

IEEE Int. Conf.Intell. Robot. Syst. (IROS) , 2018.[3] F. Zeng, A. Jacobson, D. Smith, N. Boswell, T. Peynot, and M. Mil-ford, “Enhancing Underground Visual Place Recognition with Shan-non Entropy Saliency,” in

Australas. Conf. Robot. Autom. (ACRA) ,Sydney, Australia, 2017.[4] N. S¨underhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford,“On the performance of convnet features for place recognition,” in

Intell. Robot. Syst. (IROS) , 2015.[5] P. Newman, G. Sibley, M. Smith, M. Cummins, A. Harrison, C. Mei,I. Posner, R. Shade, D. Schroeter, L. Murphy, W. Churchill, D. Cole,and I. Reid, “Navigating, Recognizing and Describing Urban SpacesWith Vision and Lasers,”

Int. J. Rob. Res. , 2009.[6] D. M. Cole and P. M. Newman, “Using laser range data for 3D SLAMin outdoor environments,” in

IEEE Int. Conf. Robot. Autom. (ICRA) ,2006.[7] M. Magnusson, H. Andreasson, A. N¨uchter, and A. J. Lilienthal,“Appearance-Based place recognition from 3D laser data using thenormal distributions transform,” in

IEEE Int. Conf. Robot. Autom.(ICRA) , 2009.[8] C. Sprunk, G. D. Tipaldi, A. Cherubini, and W. Burgard, “Lidar-basedteach-and-repeat of mobile robot trajectories,” in

IEEE Int. Conf. Intell.Robot. Syst. (IROS) , 2013.[9] M. Bosse and J. Roberts, “Histogram Matching and Global Initializa-tion for Laser-only SLAM in Large Unstructured Environments,” in

IEEE Int. Conf. Robot. Autom. (ICRA) , 2007.[10] M. Cummins and P. Newman, “FAB-MAP: Probabilistic localizationand mapping in the space of appearance,”

Int. J. Rob. Res. , vol. 27,no. 6, pp. 647–665, 2008.[11] ——, “Highly scalable appearance-only SLAM - FAB-MAP 2.0,” in

Robot. Sci. Syst. , Seattle, United States, 2009.[12] M. J. Milford and G. F. Wyeth, “SeqSLAM: Visual route-basednavigation for sunny summer days and stormy winter nights,” in

IEEEInt. Conf. Robot. Autom. (ICRA) , 2012.[13] M. J. Milford, G. Wyeth, and D. Prasser, “RatSLAM: A HippocampalModel for Simultaneous Localization and Mapping,” in

IEEE Int.Conf. Robot. Autom. (ICRA) , 2004.[14] W. Maddern, M. Milford, and G. Wyeth, “CAT-SLAM: probabilisticlocalisation and mapping using a continuous appearance-based trajec-tory,”

Int. J. Rob. Res. , vol. 31, no. 4, pp. 429–451, 2012.[15] A. J. Glover, W. P. Maddern, M. J. Milford, and G. F. Wyeth, “FAB-MAP + RatSLAM: Appearance-based SLAM for Multiple Times ofDay,” in

IEEE Int. Conf. Robot. Autom. (ICRA) , Anchorage, UnitedStates, 2010.[16] B. D. Lucas, T. Kanade, and Others, “An iterative image registrationtechnique with an application to stereo vision,” , 1981.[17] J. Shi and C. Tomasi, “Good features to track,” Cornell University,Tech. Rep., 1993.[18] P. Fua and V. Lepetit, “Monocular model-based 3d tracking of rigidobjects,”

Comput. Graph. Vis , vol. 1, no. 1, pp. 1–89, 2005.[19] F. Zeng, A. Jacobson, D. Smith, N. Boswell, T. Peynot, and M. Mil-ford, “I2-S2: Intra-Image-SeqSLAM for more accurate vision-basedlocalisation in underground mines,” in

Australas. Conf. Robot. Autom.(ACRA) , Canterbury, New Zealand, 2018.[20] M. Milford, E. Vig, W. Scheirer, and D. Cox, “Vision-based simulta-neous localization and mapping in changing outdoor environments,”

J. F. Robot. , vol. 31, no. 5, pp. 814–836, 2014. [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Adv. Neural Inf. Process.Syst. , 2012, pp. 1097–1105.[22] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic,“NetVLAD: CNN architecture for weakly supervised place recogni-tion,” in

IEEE Conf. Comput. Vis. Pattern Recognit. , 2016.[23] R. Girshick, “Fast r-cnn,” in

IEEE Int. Conf. Comput. Vis. , 2015.[24] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in

Adv. NeuralInf. Process. Syst. , 2015.[25] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov,P. van der Smagt, D. Cremers, and T. Brox, “Flownet: Learning opticalﬂow with convolutional networks,” in

IEEE Int. Conf. Comput. Vis. ,2015.[26] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “DeepFlow:Large displacement optical ﬂow with deep matching,” in

IEEE Int.Conf. Comput. Vis. , 2013.[27] J. Wulff, L. Sevilla-Lara, and M. J. Black, “Optical Flow in MostlyRigid Scenes,” in

IEEE Conf. Comput. Vis. Pattern Recognit. , 2017.[28] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in

IEEE Conf. Comput. Vis. PatternRecognit. , 2015.[29] H. Noh, S. Hong, and B. Han, “Learning deconvolution network forsemantic segmentation,” in

IEEE Int. Conf. Comput. Vis. , 2015.[30] M. A. Fischler and R. C. Bolles, “Random sample consensus: aparadigm for model ﬁtting with applications to image analysis andautomated cartography,”

Commun. ACM , vol. 24, no. 6, pp. 381–395,1981.[31] J. A. K. Suykens and J. Vandewalle, “Least squares support vectormachine classiﬁers,”

Neural Process. Lett. , vol. 9, no. 3, pp. 293–300,1999.[32] K. Simonyan and A. Zisserman, “Very deep convolutional networksfor large-scale image recognition,” arXiv preprint arXiv:1409.1556 ,2014.[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,“Imagenet: A large-scale hierarchical image database,” in

IEEE Conf.Comput. Vis. Pattern Recognit. , 2009.[34] S. Kohlbrecher, J. Meyer, O. von Stryk, and U. Klingauf, “A Flexibleand Scalable SLAM System with Full 3D Motion Estimation,” in

IEEEInt. Symp. Safety, Secur. Rescue Robot. , 2011.[35] W. Hess, D. Kohler, H. Rapp, and D. Andor, “Real-Time Loop Closurein 2D LIDAR SLAM,” in , 2016,pp. 1271–1278.[36] F. Dellaert, D. Fox, W. Burgard, and S. Thrun, “Monte carlo local-ization for mobile robots,” in

IEEE Int. Conf. Robot. Autom. (ICRA) ,1999.[37] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs,R. Wheeler, and A. Y. Ng, “ROS: an open-source Robot OperatingSystem,” in

ICRA Work. open source Softw. , vol. 3, no. 3.2. Kobe,Japan, 2009, p. 5.[38] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard, et al. , “Tensorﬂow: A systemfor large-scale machine learning,” in { USENIX } Symposium onOperating Systems Design and Implementation ( { OSDI }16)