[PDF] Graph-based non-linear least squares optimization for visual place recognition in changing environments

Abstract

Visual place recognition is an important subproblem of mobile robot localization. Since it is a special case of image retrieval, the basic source of information is the pairwise similarity of image descriptors. However, the embedding of the image retrieval problem in this robotic task provides additional structure that can be exploited, e.g. spatio-temporal consistency. Several algorithms exist to exploit this structure, e.g., sequence processing approaches or descriptor standardization approaches for changing environments. In this paper, we propose a graph-based framework to systematically exploit different types of additional structure and information. The graphical model is used to formulate a non-linear least squares problem that can be optimized with standard tools. Beyond sequences and standardization, we propose the usage of intra-set similarities within the database and/or the query image set as additional source of information. If available, our approach also allows to seamlessly integrate additional knowledge about poses of database images. We evaluate the system on a variety of standard place recognition datasets and demonstrate performance improvements for a large number of different configurations including different sources of information, different types of constraints, and online or offline place recognition setups.

Full PDF

TTo appear in IEEE Robotics and Automation Letters (RA-L), 2021. ACCEPTED VERSION ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creatingnew collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Graph-based non-linear least squares optimizationfor visual place recognition in changing environments

Stefan Schubert, Peer Neubert and Peter Protzel Abstract — Visual place recognition is an important sub-problem of mobile robot localization. Since it is a specialcase of image retrieval, the basic source of information isthe pairwise similarity of image descriptors. However, theembedding of the image retrieval problem in this robotic taskprovides additional structure that can be exploited, e.g. spatio-temporal consistency. Several algorithms exist to exploit thisstructure, e.g., sequence processing approaches or descriptorstandardization approaches for changing environments. In thispaper, we propose a graph-based framework to systematicallyexploit different types of additional structure and information.The graphical model is used to formulate a non-linear leastsquares problem that can be optimized with standard tools.Beyond sequences and standardization, we propose the usageof intra-set similarities within the database and/or the queryimage set as additional source of information. If available,our approach also allows to seamlessly integrate additionalknowledge about poses of database images. We evaluate thesystem on a variety of standard place recognition datasets anddemonstrate performance improvements for a large number ofdifferent conﬁgurations including different sources of informa-tion, different types of constraints, and online or ofﬂine placerecognition setups.

I. INTRODUCTIONVisual place recognition in changing environments is theproblem of ﬁnding matchings between two sets of observa-tions, database and query, despite severe appearance changes.It is required for loop closure detection in SLAM (Simultane-ous Localization and Mapping) and for candidate selection inimage-based 6D localization systems. The common pipelinefor visual place recognition in changing environments isshown in Fig. 1 (top): Given the two sets of images, database(the “map”) and query (later or current run), a descriptoris computed for each image. Subsequently, each descriptorof the database is compared pairwise to each descriptor ofthe query to get a similarity matrix ˆ S DB × Q that can ﬁnallybe used to determine potential place matches. As illustratedin Fig. 1 (bottom), there are various additional approachesto improve the place recognition pipeline either by prepro-cessing the descriptors, e.g., with feature standardization, orby postprocessing the similarity matrix, e.g., with sequencesearch in the similarity matrix. In addition to these pre- andpostprocessing steps, additional information can be exploitedlike database image poses or the so far rarely used intra-database and intra-query similarities. However, there are onlyfew methods that exploit such additional knowledge in orderto further improve the place recognition performance. All authors are with Faculty of Electrical Engineering and AutomationTechnology, Chemnitz University of Technology, 09126 Chemnitz, Germany { firstname.lastname } @etit.tu-chemnitz.de QueryQDatabaseDB DescriptorsDescriptors

Pairwise DescriptorSimilarity Matching Decisions

Standardization SequenceProcessingGraph-basedProcessing

Additional knowledge(intra-set similarities,database poses, sequences)

Basic PipelineExtensionsBased on StructuralInformationof the Task

Fig. 1. The basic place recognition pipeline (above the horizontaldashed line) can be extended with additional information (below this line).Established approaches are standardization of descriptors and sequenceprocessing. We propose a graph-based framework to integrate varioussources of additional information in a common optimization formulation.An example for so far rarely used information are intra-set similarities.

In this paper, we address the problem of systematicallyexploiting additional information by proposing a versatile,expandable framework for different kinds of prior knowledgefrom or about database and query. Speciﬁcally, we • propose a versatile, expandable graph-based frameworkthat formulates place recognition as a sparse non-linearleast squares optimization problem in a factor graph(Sec. III) • discuss different sources of information in place recog-nition problems and propose a loop-rule and anexclusion-rule to exploit inherent structural propertiesof the place recognition problem (Sec. III-C) • demonstrate how this framework can be used to in-tegrate the different sources of information, e.g., weprovide implementations of the loop- and exclusion-rules in terms of cost functions for factors. These eitherexploit intra-set descriptor similarities within databaseand/or query, or, if available, additional knowledgeabout database image poses (Sec. III-C) • demonstrate the versatility of the framework by propos-ing an n-ary factor that mimics the sequence processingapproach of SeqSLAM [1] (Sec. III-D) • present the optimization using standard non-linear leastsquares optimization tools (Sec. IV) • provide comprehensive experimental evaluation on avariety of datasets, conﬁgurations and existing methodsfrom the literature (Sec. V)The paper closes with a discussion of the current imple-mentation and potential extensions in Sec. VI.II. RELATED WORKVisual place recognition in changing environments is asubject of active research. An overview of existing methods a r X i v : . [ c s . C V ] D ec s given in [2]. In the present paper, the basic source ofinformation to match places are image descriptors. Theauthors of [3] demonstrated that intermediate CNN-layeroutputs like the conv3 -layer from AlexNet [4] trained forimage classiﬁcation can serve as holistic image descriptorsto match places across condition changes between databaseand query. Moreover, there are CNNs especially trained forplace recognition that return either holistic image descriptorslike NetVLAD [5] or local features like DELF [6]. Theperformance of holistic descriptors can be further improvedby additional pre- and postprocessing steps. To improvethe performance of holistic descriptors, feature standardiza-tion and unsupervised learning techniques like PCA- andclustering-based methods can be used [7]. In [8] it is shownhow a neurologically-inspired neural network can be usedto combine several descriptors along a sequence to a newdescriptor for each place. [9] extended this approach toencode additional odometry information in the new descrip-tors. Given the ﬁnal pairwise similarities between descriptorsfrom database and query, sequence-based methods [1][10] forpostprocessing can be used to improve the place recognitionperformance further.In this paper, we propose a graph-based approach tooptimize the descriptor similarities by incorporating priorknowledge. In [10] Hansen and Browning used a hiddenMarkov model (HMM) to formulate a graph-based sequencesearch method in the similarity matrix. Naseer et al. [11]used a ﬂow network with edges deﬁned for sequence searchto improve place recognition results. In [12] a graph is usedto prevent place matches between adjacent places and todistribute high matching scores to neighboring places. Incontrast to these approaches, our approach exploits not onlysequence information but also integrates other additionalknowledge, e.g., about intra-set similarities in the databaseset or the query set. The literature provides several ap-proaches for localization where the database is known inadvance. For example, given the images from the database(or another representative training set), FabMap [13] learnsstatistics about feature occurrences in order to determine themost descriptive features. McManus et al. [14] train ofﬂinecondition-invariant broad-region detectors from beforehandcollected images with a variety of appearances at particularlocations. In [15] intra-database descriptor comparisons areused to reduce the number of required image comparisonsduring the query run. Vysotska and Stachniss [16] use binaryintra-database place matches to perform jumps within thesimilarity matrix during sequence search. In contrast, inour presented approach we use potentially continuous intra-database similarities to optimize the place recognition resultinstead of just accelerating it. Moreover, we do not only usethis information to ﬁnd loops but in addition to potentiallyinhibit wrong loop closures.Graphical models, and in particular factor-graphs, are awell established tool in mobile robotics [17], e.g., in theform of robust pose graph SLAM [18]. Similar to thehere proposed approach, pose graph SLAM represents eachplace as a node and the edges (factors) model constraints between these places. However, a signiﬁcant difference isthat pose graph SLAM deals with spatial information, i.e., theplaces are represented by pose coordinates in the world andthe factors are spatial transformations between these poses(e.g., odometry or detected loop closures). In contrast, ourpresented approach is intended to be used before SLAM toestablish loop closures. In particular, we do not optimizemetric errors between spatial constraints but errors in themutual consistency of descriptor similarities.III. ARCHITECTURE OF THE GRAPHICAL MODELA graphical model serves as a structured representationof prior knowledge in terms of dependencies, rules, andavailable information. Given the variable nodes in the graphwith their initial values together with dependencies betweennodes based on additional knowledge, optimization algo-rithms can be used to modify the variables in order to resolveinconsistencies between nodes. We are going to exploit thismechanism and present a graph-based framework for visualplace recognition that optimizes the initial similarity values ˆ S DB × Q from pairwise image descriptor comparisons.The graph-based framework is expressed as factor graph.Factor graphs are graphical representations of least squaresproblems – for a detailed introduction to factor graphs pleaserefer to [17]. The graph’s basic architecture consists of nodeswith unary factors that penalize deviations from the initialsimilarity values. Depending on additional knowledge, wecan add different factors in the graph to introduce connec-tions (i.e. dependencies) between nodes. Two architectureswith nodes, unary factors, and additional factors are shownin Fig. 3. Each factor deﬁnes a quadratic cost functionbased on the variables it connects. The resulting combinedoptimization problem is deﬁned as a weighted sum of theindividual cost functions. The optimization is subject ofSec. IV. Here, we ﬁrst deﬁne the basic architecture of thegraph with corresponding nodes and unary factors. Next, wepropose two ways to exploit prior knowledge with binary andn-ary factors. We will structure the explanation of each factorby the prior knowledge that can be exploited, a corresponding rule that expresses the resulting dependency between nodes,a proposed related cost function f ( · ) that punishes a violationof this rule, and the factor’s usage in the graph. Except forthe unary factor, each factor is optional and depends on theavailable knowledge. Accordingly, the proposed frameworkcan be extended in future work with additional factors. A. Graph nodes

The graph-based framework is designed to optimize theinitial pairwise descriptor similaritiy matrix ˆ S DB × Q ∈ R M × N . M is the number of database images and N thenumber of query images. Accordingly, we deﬁne M × N nodes where each node s ij ∈ S DB × Q is a variable versionof its initial value ˆ s ij ∈ ˆ S DB × Q with s ij , ˆ s ij ∈ [0 , . B. Unary factor1) Prior Knowledge:

Descriptor similarities ˆ s ij are theprimary source of information for place recognition. Beyondnitializing the variables to these similarities, we have toprevent too large deviations of s ij from ˆ s ij , in particularto prevent trivial solutions during optimization.

2) Rule ”prior“: s ij ≈ ˆ s ij (1)

3) Cost function ”prior“: f prior ( · ) = ( s ij − ˆ s ij ) (2)

4) Usage:

Each node s ij is connected with a single unaryfactor to its initial value ˆ s ij as shown in Fig. 3. Thus, M × N unary factors are used in a graph. C. Binary factors for the exploitation of intra-databaseor intra-query similarities from poses or descriptor-comparisons1) Prior Knowledge – intra-database similarities fromposes:

In some applications, the database is created witha more advanced sensor setup than the query. Due to themissing position information for the query images, directposition-based matching cannot be conducted. Nevertheless,available poses for database images can be used to createbinary intra-database similarities ˆ s DBij ∈ ˆ S DB that encodewhether two database images i and j show the same place( ˆ s DBij := 1 ) or different places ( ˆ s DBij := 0 ).

2) Prior Knowledge – intra-set similarities from imagedescriptors:

Even if position data is not available, we canacquire similar information directly from image comparisonswithin the database and also within the query to get intra-database similarities ˆ s DBij ∈ ˆ S DB and intra-query similarities ˆ s Qij ∈ ˆ S Q with ˆ s DBij , ˆ s Qij ∈ [0 , . These intra-set similar-ities are an inherent and almost always available sourceof additional information, which has not been used oftenyet. The computation of intra-set similarities can usuallybe done more reliably than the computation of inter-setsimilarities, because the condition within database or queryis potentially more constant than between both. The intra-set image comparisons could be done with methods likeFabMap [13] that are more suited for place recognition underconstant condition. Fig. 2 shows how the intra-set similarities ˆ S DB , ˆ S Q are related to the inter-set similarities ˆ S DB × Q andhow they can reveal loops and zero-velocity stages.We deﬁne separate binary factors for both intra-set sim-ilarities, either from poses or from image descriptors. Eachbinary factor is a combination of two complementing rulesand cost functions f Q loop ( · ) + f Q exclusion ( · ) (Eq. (5), (9)) and f DB loop ( · ) + f DB exclusion ( · ) (Eq. (6), (10)), respectively.

3) Rule ”loop“: If ˆ s Qjl is high (indicated by ” ˆ s Qjl ↑ “), the j -th and l -th query image likely show the same place. If the i -th database image is compared to these two query images,the following ternary relation can be derived: s ij ≈ s il , iff ˆ s Qjl ↑ (3)The rule expresses that ” s ij should be equal to s il iff thequery images j , l show the same place“, because if bothquery images j , l show the same place, database image i DB QDB Q intra-databasesimilarity Ŝ DB intra-querysimilarity Ŝ Q inter-setsimilarity S DBxQ v vehicle =0in DBv vehicle =0in Qv vehicle =0in DB/QNo overlapwith Q Fig. 2. The information of the similarity matrix between database andquery (bottom right) can be extended with similarity information of images within the query set (top right) and/or within the database set (bottom left). can either be equal to both or to none of both. Accordingly,this rule exploits loops within the intra-query similarities.This rule is inherent to the place recognition problem andvalid as well for intra-database similarities: s ij ≈ s kj , iff ˆ s DBik ↑ (4)

4) Cost function ”loop“:

For equation (3) and (4), costfunctions similar to (2) can be formulated: f Q loop ( · ) = ˆ s Qjl · ( s ij − s il ) (5) f DB loop ( · ) = ˆ s DBik · ( s ij − s kj ) (6)The quadratic error term is multiplied by the intra-set sim-ilarity ˆ s Qjl or ˆ s DBik to apply weighting in case of non-binaryintra-set similarities from image descriptors or to turn it onand off for binary intra-set similarities.

5) Rule ”exclusion“: If ˆ s Qjl is low (indicated by ” ˆ s Qjl ↓ “),the j -th and l -th query image probably show different places.If the i -th database image is compared to these two queryimages, the following ternary relation can be derived: ¬ ( s ij ↑ ∧ s il ↑ ) , iff ˆ s Qjl ↓ (7)This rule expresses that ”not both similarity measurements s ij AND s il can be high iff the query images j , l show dif-ferent places“, i.e., the rule excludes one or both similarities s ij , s il from being high; otherwise a single database image i would show two different places concurrently. Again, thisrule is inherent to the place recognition problem and is sup-posed to add valuable information that can be exploited. Asbefore, the same rule is valid for intra-database similarities: ¬ ( s ij ↑ ∧ s kj ↑ ) , iff ˆ s DBik ↓ (8)

6) Cost function ”exclusion“:

It is less natural how toexpress the rule ”exclusion“ in a cost function. This is partof the discussion in Sec. VI. In this work, we deﬁne the s s s s s s s s f prior (s ij )Factorsunary: f Qloop (s ij ,s kl ) + f Qexclusion (s ij ,s kl )f DBloop (s ij ,s kl ) + f DBexclusion (s ij ,s kl )binary: Q DB s s s s s s s s Q DB s n-nary:f seq (s ij ,s i-L/2,j-L/2 ,...,s i+L/2,j+L/2 ) Fig. 3. Illustration of the graph structure. (left)

Unary and binary factors. (right)

Unary and n-nary factors, which connect structured local blocks. following cost functions for equation (7) and (8): f Q exclusion ( · ) = (1 − ˆ s Qjl ) · ( s ij · s il ) (9) f DB exclusion ( · ) = (1 − ˆ s DBik ) · ( s ij · s kj ) (10)The quadratic error term is multiplied by the negated intra-set similarity (1 − ˆ s Qjl ) or (1 − ˆ s DBik ) to weight it in case ofnon-binary intra-set similarities from image descriptors or toturn it on and off for binary intra-set correspondences.

7) Usage:

Fig. 3 (left) shows a graphical model withnodes s ij , unary factors f prior ( · ) , and the proposed binaryfactors. To apply the binary factors for existing intra-databasesimilarities ˆ S DB , each node s ij has to be connected toevery node s kj for all k = 1 , . . . , M with k (cid:54) = i withineach column in S DB × Q ; i.e., (cid:0) M (cid:1) factors per column. Forexisting intra-query similarities ˆ S Q , each node s ij has tobe connected to every node s il for all l = 1 , . . . , N with l (cid:54) = j within each row in S DB × Q ; i.e., (cid:0) N (cid:1) factors per row.The potentially high number of connections is part of thediscussion in Sec. VI. D. N-ary factors for the exploitation of sequences1) Prior Knowledge:

If both database and query arerecorded as spatio-temporal sequence, sequences also appearin the inter-set similarities S DB × Q (see Fig. 2). In this case,SeqSLAM [1] showed the beneﬁt from a simple combi-nation of similarities of neighboring images. Originally, itwas implemented as postprocessing of the similarity matrix S DB × Q . We can integrate such sequential information withinthe graph in a similar fashion using n-ary factors.

2) Rule ”sequence“: s ij ≈ L (cid:88) ∀{ k,l }∈ Seq ( i,j ; L,v p ) s kl (11)Seq ( i, j ; L, v p ) is a function that returns similarities that arepart of a sequence along a line segment with slope v p andsequence length L within S DB × Q around s ij . For a fullexplanation how SeqSLAM works, please refer to [1].

3) Cost function ”sequence“: f seq ( · ) = ( s ij − L max v p ∈ V ( (cid:88) { k,l }∈ Seq ( { i,j } ; L,v p ) s kl )) (12)with V being the set of allowed velocities within S DB × Q .

4) Usage:

Fig. 3 (right) shows a graphical model withnodes s ij , unary factors f prior ( · ) , and the proposed n-aryfactors for sequences. As for the unary factors in Sec. III-B,each node in the graph is equipped with one n-ary factor.Each n-ary factor connects its node to all nodes that arepart of any sequence with length L and slope v p ∈ V .Accordingly, M × N n-ary factors are introduced into a graphif sequences are exploited.IV. OPTIMIZATION OF THE GRAPHICAL MODELThe objective of the graph-based framework is the for-mulation of dependencies from prior knowledge for a sub-sequent optimization of the similarities s ij to incorporatethe prior knowledge into the ﬁnal similarity values, and topotentially resolve contradictory dependencies.We use factor graphs as graphical representation of leastsquares problems and deﬁned every cost function for eachtype of factor as (weighted) quadratic error function. Ac-cordingly, the global error E is deﬁned as a sum over the(weighted) cost functions of all factors in the graph. A. Weighting of factor’s costs for global error computationin the graph

As usually done in error computation for a graph, costsfrom different factor-types have to be balanced by weightingthem separately. Therefore, we normalize the cost of eachfactor by the number of factors per factor-type, and introduceuser-speciﬁed weights w for all factors except for the basicunary factor (which gets weight 1).For the unary factor of our basic architecture with costfunction f prior ( · ) , we deﬁne the partial global error E prior with E prior = 1 M N M (cid:88) i =1 N (cid:88) j =1 f prior ( s ij ) (13)In case of dependencies in the graph that are introduced byavailable intra-database or intra-query similarities, we deﬁnea partial global error E DBloop,exclusion or E Qloop,exclusion forthe weighted summation over all binary factors (Sec. III-C): E DB loop,exclusion = 1 N (cid:0) M (cid:1) N (cid:88) j =1 M − (cid:88) i =1 M (cid:88) k = i +1 w DB loop · f DB loop ( s ij , s kj )+ w DB exclusion · f DB exclusion ( s ij , s kj ) (14) E Q loop,exclusion = 1 M (cid:0) N (cid:1) M (cid:88) i =1 N − (cid:88) j =1 N (cid:88) l = j +1 w Q loop · f Q loop ( s ij , s il )+ w Q exclusion · f Q exclusion ( s ij , s il ) (15)For the n-ary factors (Sec. III-D) in case of sequenceexploitation, the partial global error E seq is deﬁned with E seq = 1 M N M (cid:88) i =1 N (cid:88) j =1 w seq · f seq ( s ij , Seq ( { i, j } ; L, v ∗ ij ) (16)ith v ∗ ij being the optimal velocity (slope) with the highestmean of connected similarities around s ij .Finally, summation over all partial global errors E i thatoccur in the graph yields the global error E : E = (cid:88) ∀ i E i (17) B. Implementation of the optimization

Error E is a sum solely over quadratic cost functions.Thus, many tools for non-linear least squares (NLSQ) op-timization can be used (e.g., scipy’s least squares -functionin Python). For an easier formulation of the optimizationproblem, factor graph tools like g2o [19] can be used.However, one should be aware that depending on the in-troduced factors, the optimization problem can get quitehuge. Thus, efﬁcient implementations should be preferredthat perform a quick and memory efﬁcient optimization.Sec. V-F reports our achieved runtimes that are presumablysufﬁcient for many applications. Sec. VI provides somemore discussion on alternative optimization methods andapproximation techniques for runtime improvements.V. EXPERIMENTAL RESULTSWe present experiments that investigate the performancegains achieved by the graph-based framework with threedifferent extensions that exploit 1) intra-database similari-ties; 2) intra-database and intra-query similarities; 3) intra-database similarities, intra-query similarities and sequences.In order to evaluate the potential beneﬁt beyond availablepre- and postprocessing methods, we repeat these experi-ments with the descriptor standardization approach from [7]for preprocessing and SeqSLAM for postprocessing. In aﬁnal experiment, we compare our graph-based method withsequence-based approaches from the literature. A. Experimental Setup1) Image descriptor:

NetVLAD [5] is used as CNN-image descriptor in all experiments. We use the author’simplementation trained on the Pitts30k dataset with VGG-16 and whitening.

2) Metric:

The performance is measured with averageprecision which is the area under the precision-recall curve.

3) Datasets:

All experiments are based on the ﬁve differ-ent datasets Nordland [20], StLucia (Various Times of theDay) [21], CMU (Visual Localization) [22], GardensPointWalking [23] and Oxford RobotCar [24] with different char-acteristics regarding the environment, appearance changes,in-sequence loops, stops, or viewpoint changes. We usethe datasets as described in our previous work [8]. Imagesfrom StLucia, CMU and Oxford were sampled with oneframe per second, which preserves varying camera speeds,stops and loops in the datasets, and leads to translation andorientation changes during revisits. GardensPoint containslateral viewpoint changes.For the binary intra-database similarities ˆ S DBpose from poses,we use the GPS from the datasets or a main diagonalfor

GardensPoint Walking and

Nordland . For the querysequence, we assume that no GPS is available.

4) Implementation:

Graph creation and optimizationwere implemented in Python with scipy’s least squares -optimization function; the

Trust Region Reﬂective algorithmwas used for minimization. Due to the huge number offactors within a graph in case of the usage of intra-databaseand intra-query similarities, we divided S DB × Q into equallysized patches with height and width ≤ , and optimizedeach patch separately. No information is shared betweenpatches, and the n-ary factors are truncated at borders. Afull optimization on S DB × Q without patches was performedif only intra-database similarities ˆ S DB were used. The vari-ables in the optimization are initialized with ˆ S DB × Q fromthe pairwise descriptor comparisons.

5) Parameters:

In all experiments, we used a ﬁxed pa-rameter set that was determined from a toy-example and asmall real-world dataset. We used w DB loop = 4 , w DB exclusion =40 for intra-set similarities from GPS, w DB loop = w Q loop =1 , w DB exclusion = w Q exclusion = 20 for intra-set similarities fromdescriptors, and w seq = 10 , L = 11 for sequences. B. Contributions of information sources and rules

In Sec. III we identiﬁed four rules: ”prior“ , ”loop“ , ”ex-clusion“ and ”sequence“ . In the following, we are going toevaluate the inﬂuence of the rules when they are successivelyadded and exploited in the graph. Note that rule ”prior“ alone would merely return the initial values ˆ S DB × Q . Input tothe graph are the pairwise descriptor similarities from the rawNetVLAD descriptors that serve as baseline as well (termed”pairwise“). All results are summarized in Table I.

1) Exploitation of intra-database similarities from posesor descriptors (rules ”loop“ and ”exclusion“):

Intra-database similarities ˆ S DB can be used in most place recog-nition setups as they can be acquired either from the pairwiseimage comparisons within the database or from poses, e.g.,from GPS or SLAM.Table I shows the results (indicated by ˆ S DBpose and ˆ S DBdesc )when intra-database similarities are used either from posesor descriptor comparisons. In most cases, the pairwise per-formance is signiﬁcantly improved and never gets worse.If intra-database similarities from poses (here: GPS) areused, the performance gain is slightly better, presumablybecause place matchings and distinctions from poses arebinary and less error prone. However, even when the intra-database similarities from descriptor comparisons are used,the performance can be improved signiﬁcantly.The results support that most existing place recognitionpipelines could be improved with the proposed graph-basedframework together with intra-database similarities. More-over, since it is not necessary to know the query sequencein advance, the proposed graph-based framework can beemployed in an online fashion.

2) Exploitation of intra-database and intra-query simi-larities from poses or descriptors (rules ”loop“ and ”ex-clusion“):

In addition to the previous experiment, supple-mentary intra-query similarities could be used to modeldependencies within the graph not only between databaseimages but also between query images. We used intra-query

ABLE IA

VERAGE PRECISION FOR DIFFERENT CONFIGURATIONS OF THE GRAPH - BASED FRAMEWORK . C

OLORED ARROWS INDICATE LARGE ( ≥ BETTER / WORSE ) OR MEDIUM ( ≥ DEVIATION COMPARED TO THE CONFIGURATION THAT EXPLOITS LESS PRIOR KNOWLEDGE ( I . E ., ” PAIRWISE “ VS ˆ S DB ; ˆ S DB VS ˆ S DB + ˆ S Q ; ˆ S DB + ˆ S Q VS ˆ S DB + ˆ S Q +S EQ ). B OLD TEXT INDICATES THE BEST PERFORMANCE PER ROW AND PER INTRA - DATABASE SOURCE . ˆ S DB from poses ˆ S DB from descriptorsonline ofﬂine / delayed online ofﬂine / delayed Dataset Name Database Query pairwise ˆ S DBpose ˆ S DBpose + ˆ S Qdesc ˆ S DBpose + ˆ S Qdesc + Seq ˆ S DBdesc ˆ S DBdesc + ˆ S Qdesc ˆ S DBdesc + ˆ S Qdesc + SeqNordland Nordland ↑ → ↑ ↑ → ↑ Nordland ↑ ↑ ↑ ↑ ↑ ↑ Nordland ↑ ↑ ↑ ↑ ↑ ↑ Nordland ↑ (cid:37) ↑ ↑ → ↑ Nordland ↑ → ↑ ↑ (cid:37) ↑ Nordland ↑ → ↑ ↑ → ↑ StLucia StLucia ↑ → ↑ ↑ (cid:37) ↑ StLucia ↑ → ↑ (cid:37) (cid:37) ↑ StLucia ↑ → ↑ ↑ (cid:37) ↑ StLucia ↑ → ↑ ↑ (cid:37) ↑ StLucia ↑ → ↑ ↑ (cid:37) ↑ CMU CMU (cid:37) → (cid:37) → → (cid:37) CMU → → (cid:37) → → ↑ CMU (cid:37) → (cid:37) → → ↑ CMU ↑ (cid:37) ↑ (cid:37) → ↑ GardensPoint GP → → → → → → Walking GP ↑ (cid:37) ↑ ↑ (cid:37) ↑ GP ↑ (cid:37) ↑ ↑ ↑ ↑ Oxford Oxford (cid:37) → → → → → Oxford → → → → → → Oxford ↑ → → (cid:37) → → Oxford (cid:37) → → → (cid:37) → similarities solely from descriptor comparisons since we donot assume global pose information during the query run;otherwise, place matchings could be received directly frompose-comparisons.Table I shows the results (indicated by ˆ S DBpose + ˆ S Qdesc and ˆ S DBdesc + ˆ S Qdesc ). For intra-database similarities from poses,the performance is improved only for a few sequence-combinations compared to the performance with intra-database but without intra-query similarities. When usingintra-query similarities in addition to intra-database similari-ties from descriptors, the performance could be improved atleast by additional for of all datasets.The results indicate that additional data from intra-querysimilarities within the proposed graph-based framework canbe used to improve the place recognition performance fur-ther. It is important to note that the system again neverperforms worse in comparison to a graph with intra-databasebut without intra-query similarity exploitation. The resultis interesting, since intra-query similarities from descriptorscould always be collected for a subsequent postprocessing.

3) Additional exploitation of sequences within the graph(rules ”loop“, ”exclusion“ and ”sequence“):

In this exper-iment, we used sequence information within the graph inaddition to the intra-database and intra-query similarities. Ta-ble I shows the results (indicated by ˆ S DBpose + ˆ S Qdesc + Seq and ˆ S DBdesc + ˆ S Qdesc + Seq). With sequence information the graphcan again signiﬁcantly improve the place recognition per-formance compared to the previous experiments withoutthis additional assumption no matter if the intra-databasesimilarities come from poses or descriptor comparisons.Moreover, the full setup of the graph with all proposedfactors clearly outperforms the baseline.

C. Combination with preprocessed descriptors

The place recognition performance can be improved if thedescriptors are preprocessed [7]. To investigate the inﬂuenceof preprocessing on all conﬁgurations of the graph from Sec. V-B, we repeated all previous experiments with stan-dardized image descriptors [7]. Results are shown in Table II(left). Since the preprocessing requires complete knowledgeof the query descriptors, we do not provide results for theonline conﬁguration of the graph. The full conﬁguration ofthe graph with intra-set similarities and sequences can againshow signiﬁcant performance improvements for the majorityof the sequence-combinations. The results demonstrate thatthe graph-based framework beneﬁts from better performingdescriptors.

D. Combination with sequence-based postprocessing

In the next experiment, a modiﬁed version of SeqS-LAM [1] without local contrast normalization and the singlematching constraint is used; otherwise SeqSLAM would failon datasets with in-sequence loops. It postprocesses the inter-set similarities S DB × Q either from the pairwise descriptorcomparison (baseline) or from all conﬁgurations of the graphfrom Sec. V-B.Results are shown in Table II (right); note that descriptorcomparisons from the raw NetVLAD descriptors without sequence-based postprocessing are used as inputs to thegraphs. The baseline performance after postprocessing com-pared to the baseline performance without postprocessing(Table I) is already comparatively high. Accordingly, it ishard to achieve high performance improvements. Nonethe-less, the graph-based framework could improve the results inalmost of the cases often by more than – however,intra-database similarities from both poses and descriptorsseem to be sufﬁcient, since more information could not beused for further performance improvements on most datasets. E. Combination with pre- and postprocessing

We also conducted experiments with preprocessing fromSec. V-C and postprocessing from Sec. V-D. The base-line performance got already almost perfect for most ofthe datasets, so that performances could only be improvedslightly by less than with the graph-based approach.Again, the performance was never worse than the baseline. ABLE IIE

VALUATION OF THE AVERAGE PRECISION OF THE GRAPH - BASED FRAMEWORK WITH PRE - OR POSTPROCESSING ˆ S DB from poses ˆ S DB from descriptors ˆ S DB from poses ˆ S DB from descriptorsofﬂine / delayed online ofﬂine / delayed online ofﬂine / delayed Dataset pairwise ˆ S DBpose +ˆ S Qdesc ˆ S DBpose +ˆ S Qdesc + Seq ˆ S DBdesc +ˆ S Qdesc ˆ S DBdesc +ˆ S Qdesc + Seq pairwise ˆ S DBpose ˆ S DBpose +ˆ S Qdesc ˆ S DBpose +ˆ S Qdesc + Seq ˆ S DBdesc ˆ S DBdesc +ˆ S Qdesc ˆ S DBdesc +ˆ S Qdesc + Seq with preprocessing of descriptors (Sec. V-C) with postprocessing of similarities (Sec. V-D)Nordland → ↑ → ↑ → → → → → → Nordland (cid:37) ↑ → ↑ ↑ → ↓ ↑ → ↑ Nordland → ↑ → ↑ (cid:37) (cid:37) ↓ (cid:37) (cid:37) → Nordland → ↑ → ↑ ↑ → (cid:38) ↑ → → Nordland → ↑ → ↑ → → → → → → Nordland → ↑ → ↑ → → → → → → StLucia → ↑ → ↑ (cid:37) → → → → → StLucia → ↑ → ↑ → → → → → → StLucia → ↑ → ↑ → → → → → → StLucia → ↑ → ↑ (cid:37) → → → → → StLucia → ↑ → ↑ ↑ → (cid:38) ↑ → ↑ CMU (cid:37) (cid:37) → (cid:37) → → → → → → CMU (cid:37) (cid:37) → (cid:37) → → → → → → CMU → (cid:37) → ↑ → → → → → → CMU ↑ ↑ → ↑ ↑ → → (cid:37) → → GP → → → → → → → → → → GP → ↑ → ↑ ↑ → → (cid:37) → ↑ GP → ↑ → ↑ ↑ → → (cid:37) (cid:37) ↑ Oxford → → → → → → → → → → Oxford → → → → → → → → → → Oxford (cid:37) → → → (cid:37) → → (cid:37) → → Oxford → → → → → → → → → → F. Comparison with state-of-the-art sequence-based methods

To compare performance and runtime of our method withapproaches from the literature, we conduced additional ex-periments with the sequence-based methods SeqSLAM [1],MCN [8], VPR [16] and ABLE [25]. The sequence lengthwas L = 11 , if required. The experiments were performedtwice without and with feature standardization [7] for de-scriptor preprocessing.Table III shows the achieved performances. Our methodclearly outperformed the compared approaches and achievedthe best performance on most datasets. Only on NordlandSeqSLAM and ABLE achieved better performance, sincethey beneﬁt from constant camera speed in database andquery. With feature standardization, ABLE could addition-ally achieve best performance on GardensPoint, and MCNperformed best on three Oxford datasets.Runtimes were measured on an Intel i7-7700K CPUwith 64GB RAM. The maximum runtimes per query forall methods are shown in Table III (bottom). Our methodrequired approx. 5.7sec per query on Oxford S DB × Q . This makes it relatively inde-pendent of the actually chosen image descriptor, which mayrequire only a slight parameter adjustment. This propertyeven allows an optimization of place descriptor similaritiesfrom different sensor modalities like LiDAR.In this paper, we deﬁned several factor-types to expressprior knowledge (“rules”) about place recognition problems.Each factor implements a cost function for a rule. Often,there are alternative formulations of a cost function for a rule.For example, the cost function in case of no loop in query( ˆ s Qij ↓ ; Eq. (9)) or database ( ˆ s DBij ↓ ; Eq. (10)) is deﬁned asmultiplication ( s · s ) . Especially for this rule, alternativeformulations may apply like min( s , s ) or max( s + s − , ; both are piecewise linear which could be beneﬁcial.Factor graphs are often (but not exclusively) used in com-bination with probabilities. Presumably, a more probabilisticview on the proposed graph-based framework could provideadditional insights. For instance, the chosen cost functions(2), (6) and (9) with structure ( s − s ) can be consideredas the negative log-likelihood of a single Gaussian: ( s − s ) ⇔ − ln( e − ( s − s ) ) (18)Moreover, a piecewise linear formulation of the discussedfactor above could help to formulate the proposed graph-based framework in a more probabilistic way; for instance acost function min( s , s ) corresponds to the negative log-likelihood of a maximum of two Gaussians: min( s , s ) ⇔ − ln(max( e − ( s − , e − ( s − ) (19)These cost functions, however, solely work on the descrip-tor similarities. In the related work (Sec. II), we alreadymentioned the important difference to pose graph SLAM,which solely works on spatial poses (and not their similari-ties). An interesting question for future work is whether theproposed approach can be extended to also directly workon descriptors instead of their similarities (i.e., the variableswould be descriptor vectors, not their scalar similarities).This signiﬁcantly increases the complexity of the optimiza-tion problem, but could allow the simultaneous optimization ABLE IIIA

VERAGE PRECISION AND MAXIMUM RUNTIME PER QUERY OF OUR METHOD COMPARED TO SEQUENCE - BASED APPROACHES FROM THE LITERATURE .C OLORED ARROWS INDICATE DEVIATION FROM ” PAIRWISE “. B

OLD TEXT INDICATES THE BEST PERFORMANCE PER ROW AND PER PREPROCESSING . Dataset pair-wise ˆ S DBdesc + ˆ S Qdesc + Seq (ours) SeqSLAM [1]

MCN [8]

VPR [16]

ABLE [25] pair-wise ˆ S DBdesc + ˆ S Qdesc + Seq (ours) SeqSLAM [1]

MCN [8]

VPR [16]

ABLE [25] (online) w/o preprocessing (ofﬂine) descriptor preprocessing with feature standardization [7]

Nordland ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↓ ↑ Nordland ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ Nordland ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ (cid:38) ↑ Nordland ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ (cid:38) ↑ Nordland ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↓ ↑ Nordland ↑ ↑ ↓ ↑ ↑ ↑ ↑ (cid:37) ↓ ↑ StLucia ↑ ↓ ↑ ↑ → ↑ ↓ → ↓ ↓ StLucia ↑ ↓ ↑ → → ↑ ↓ → ↓ ↓ StLucia ↑ ↓ ↑ (cid:38) (cid:38) ↑ ↓ → ↓ ↓ StLucia ↑ ↓ ↑ ↑ ↑ ↑ ↓ → ↓ → StLucia ↑ ↓ ↑ ↑ ↑ ↑ ↓ ↑ ↓ → CMU ↑ ↓ ↑ ↓ → ↑ ↓ ↑ ↓ → CMU ↑ ↓ → ↓ (cid:38) ↑ ↓ → ↓ (cid:38) CMU ↑ ↓ (cid:37) ↓ (cid:38) ↑ ↓ → ↓ ↓ CMU ↑ ↓ ↑ ↓ ↓ ↑ ↓ ↑ ↓ ↓ GP → ↓ → ↓ → → ↓ → ↓ → GP ↑ ↓ ↑ (cid:38) ↑ ↑ ↓ → → ↑ GP ↑ ↓ (cid:37) ↓ ↑ ↑ ↓ (cid:37) ↑ ↑ Oxford → ↓ → ↓ ↓ → ↓ (cid:37) ↓ ↓ Oxford → ↓ → ↓ ↓ → ↓ → ↓ ↓ Oxford ↑ ↓ ↑ ↓ ↓ (cid:37) ↓ (cid:37) ↓ ↓ Oxford ↑ ↓ → ↓ → → ↓ → ↓ → max. runtime per - 5.6sec 24msec 280msec 3.7msec 63 µ sec - 5.7sec 23msec 441msec 3.6msec 68 µ secquery (Oxford of spatial poses and descriptors for a potentially tightlycoupled loop closure detection and SLAM.Even without such an extension, as indicated in Sec. IIIand Sec. IV, the number of factors or connections betweennodes can get quite huge, and grows cubically if intra-setsimilarities are used. In our experiments, we addressed theproblem by dividing S DB × Q into patches if both intra-setsimilarities were used (Sec. V-A.4). Presumably, using moreefﬁcient implementations, e.g. C++ implementations in Ceres[26], can improve memory consumption and computationalefﬁciency. Another promising direction are approximationtechniques like a systematic removal of low-relevance con-nections in the graph. Finally, different optimization tech-niques could be used: In earlier work on graph optimization,minimization techniques based on hill-climbing algorithmslike ICM (iterated conditional modes) were used [27, p.599];these may allow a different and more compact representationand optimization of the graph.R EFERENCES[1] M. Milford and G. F. Wyeth, “Seqslam: Visual route-based navigationfor sunny summer days and stormy winter nights.” in

Proc. of Int.Conf. on Robotics and Automation , 2012.[2] S. Lowry, N. Sunderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke,and M. J. Milford, “Visual place recognition: A survey,”

Trans. Rob. ,vol. 32, no. 1, 2016.[3] N. S¨underhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford,“On the performance of convnet features for place recognition,”

Proc.of Int. Conf. on Intelligent Robots and Systems (IROS) , 2015.[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in NeuralInformation Processing Systems , 2012.[5] R. Arandjelovi´c, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad:Cnn architecture for weakly supervised place recognition,”

Trans. onPattern Analysis and Machine Intelligence , vol. 40, no. 6, 2018.[6] H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale imageretrieval with attentive deep local features,” in

Int. Conf. on ComputerVision (ICCV) , 2017.[7] S. Schubert, P. Neubert, and P. Protzel, “Unsupervised learningmethods for visual place recognition in discretely and continuouslychanging environments,” in

Int. Conf. on Rob. & Autom. (ICRA) , 2020.[8] P. Neubert, S. Schubert, and P. Protzel, “A neurologically inspiredsequence processing model for mobile robot place recognition,”

IEEERobotics and Automation Letters , vol. 4, no. 4, 2019. [9] S. Schubert, P. Neubert, and P. Protzel, “Towards combining a neo-cortex model with entorhinal grid cells for mobile robot localization,”in

European Conf. on Mobile Robots , 2019.[10] P. Hansen and B. Browning, “Visual place recognition using HMMsequence matching,” in

Int. Conf. on Intel. Robots and Systems , 2014.[11] T. Naseer, L. Spinello, W. Burgard, and C. Stachniss, “Robust visualrobot localization across seasons using network ﬂows,” in

Proc. ofAAAI Conf. on Artiﬁcial Intelligence , 2014.[12] X. Zhang, L. Wang, Y. Zhao, and Y. Su, “Graph-based place recogni-tion in image sequences with cnn features,”

Journal of Intelligent &Robotic Systems , vol. 95, no. 2, pp. 389–403, 2019.[13] M. Cummins and P. Newman, “FAB-MAP: Probabilistic localizationand mapping in the space of appearance,”

The Int. J. of RoboticsResearch , vol. 27, no. 6, 2008.[14] C. McManus, B. Upcroft, and P. Newmann, “Scene signatures: Lo-calised and point-less features for localisation,” in

Proc. of Robotics:Science and Systems , 2014.[15] P. Neubert, S. Schubert, and P. Protzel, “Exploiting intra databasesimilarities for selection of place recognition candidates in changingenvironments,”

Computer Vision and Pattern Recognition Workshopon Visual Place Recognition in Changing Environments , 2015.[16] O. Vysotska and C. Stachniss, “Relocalization under substantial ap-pearance changes using hashing,”

Int. Conf. on Intelligent Robots andSystems Workshop PPNIV’17 , 2017.[17] F. Dellaert and M. Kaess,

Factor Graphs for Robot Perception , 2017.[18] N. S¨underhauf and P. Protzel, “Switchable constraints for robust posegraph slam,” in

Int. Conf. on Intelligent Robots and Systems , 2012.[19] R. K¨ummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,“G2o: A general framework for graph optimization,” in

Int. Conf. onRobotics and Automation , 2011.[20] N. S¨underhauf, P. Neubert, and P. Protzel, “Are we there yet? Chal-lenging SeqSLAM on a 3000 km journey across all four seasons,”

Int.Conf. on Rob. & Autom. Workshop on Long-Term Autonomy , 2013.[21] A. Glover, W. Maddern, M. Milford, and G. Wyeth, “FAB-MAP +RatSLAM: Appearance-based SLAM for Multiple Times of Day,” in

Proc. of Int. Conf. on Robotics and Automation , 2010.[22] H. Badino, D. Huber, and T. Kanade, “Visual topometric localization,”in

Proc. of Intelligent Vehicles Symp. , 2011.[23] A. Glover, “Day and night with lateral pose change datasets,”https://wiki.qut.edu.au/display/raq/Day+and+Night+with+Lateral+Pose+Change+Datasets, 2014.[24] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 Year, 1000km:The Oxford RobotCar Dataset,”

The Int. J. of Robotics Research ,vol. 36, no. 1, pp. 3–15, 2017.[25] R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, and E. Romera, “Towardslife-long visual localization using an efﬁcient matching of binarysequences from images,” in

Int. Conf. on Rob. & Autom. , 2015.[26] S. Agarwal and Others, “Ceres solver,” http://ceres-solver.org.[27] D. Koller and N. Friedman,