OpenMPR: Recognize Places Using Multimodal Data for People with Visual Impairments
OOpenMPR: Recognize Places Using MultimodalData for People with Visual Impairments
Ruiqi Cheng & Kaiwei Wang & Jian Bai
State Key Laboratory of Modern Optical Instrumentation, Zhejiang University,Hangzhou, ChinaE-mail: [email protected]
Zhijie Xu
School of Computing and Engineering, University of Huddersfield, Queensgate,Huddersfield, UKNovember 2018
Abstract.
Place recognition plays a crucial role in navigational assistance,and is also a challenging issue of assistive technology. The place recognitionis prone to erroneous localization owing to various changes between databaseand query images. Aiming at the wearable assistive device for visually impairedpeople, we propose an open-sourced place recognition algorithm OpenMPR,which utilizes the multimodal data to address the challenging issues of placerecognition. Compared with conventional place recognition, the proposedOpenMPR not only leverages multiple effective descriptors, but also assignsdifferent weights to those descriptors in image matching. Incorporating GNSSdata into the algorithm, the cone-based sequence searching is used for robust placerecognition. The experiments illustrate that the proposed algorithm managesto solve the place recognition issue in the real-world scenarios and surpass thestate-of-the-art algorithms in terms of assistive navigation performance. Onthe real-world testing dataset, the online OpenMPR achieves 88.7% precisionat 100% recall without illumination changes, and achieves 57.8% precisionat 99.3% recall with illumination changes. The OpenMPR is available athttps://github.com/chengricky/OpenMultiPR.
Keywords : Visual Localization, Computer Vision, Navigational Assistance, AssistiveTechnology a r X i v : . [ c s . C V ] S e p penMPR
1. Introduction
Vision provides people with the majority of environ-mental information. Up to 253 million people in theworld are with visual impairments [1], and they en-counter various difficulties in their daily life. The visu-ally impaired people have limited capability to acquirespatial knowledge [2], hence visual place recognition isdesired by the visually impaired people, especially inthe complex and unfamiliar outdoor environments.Among the decades, GNSS (global navigationsatellite system) has become a prevailing approachto positioning in many applications, such as vehiclenavigation, engineering measurement and etc. In orderto promote the positioning performance, a number ofGNSS processing methods [3, 4, 5, 6, 7] were proposedby the research community to reduce the localizationerror up to even several millimeters. However, on thelow-cost portable devices, the performance of GNSSlocalization is usually insufficient for the localizationdemands of the visually impaired people. Comparedwith that, optical images containing extra positioningcues could be exploited to achieve precise localization.Leveraging images to localize is known as placerecognition , which is to select the corresponding imageof a given query image from database images.The challenging issues of place recognition liein applying the place recognition algorithm to real-world scenarios, where the visual appearance of queryand database images suffers from variations, such asilluminance changes and viewpoint changes [8]. Withthe proliferation of computer vision, the challengingplace recognition task has attracted many researchersto make contributions in this area. Apart fromthe appearance changes between database and queryimages, the navigational assistance for people withvisual impairments brings in more challenges for thetask of place recognition. In the research areaof intelligent vehicles, the stationary car-mountedcameras capture the images with high resolutionand large field of view, and the accuracy of aroundseveral tens of meters is sufficient for car localization.However, the images captured by the wearable devicesusually feature low quality, such as the severemotion blur and the continuously changing viewpoint.Moreover, assistive navigation requires more accuratelocalization, especially at some key positions like streetcorners, gates and bus stations.In our previous work [9], multimodal images andGNSS data were used to achieve key position predic-tion, which aimed to localize the visually impaired per-son at the positions of interest. Besides, we also imple-mented Visual Localizer [8], which utilized CNN (con-volutional neural network) descriptor and data asso-ciation graph to achieve place recognition for visuallyimpaired people. Aiming at the scenarios of assistive
Multiple Descriptor Extraction … …
Place Recognition
ResultsDatabase Sequence Q u er y S e qu e n ce Database Images with GNSS Data Query Images with GNSS Data
Figure 1.
The schematic diagram of OpenMPR, an open-sourcemultimodal place recognition algorithm proposed in this paper. technology, we propose a real-time place recognitionalgorithm OpenMPR (open-source multimodal placerecognition), which extends our preceding research. Inthis paper, the multiple descriptors of multimodal dataand parameter tuning schemes are incorporated to ro-bustify the performance of place recognition in realworld. Compared with existing algorithms, OpenMPRruns in an online fashion that only the “past” query im-ages are utilized for place recognition, hence it couldbe used on wearable assistive devices in real time.The place recognition procedures of OpenMPRare shown in Figure 1. Multiple descriptors areextracted from multimodal data in both database andquery sequences, and the multiple distance matricesare subsequently calculated. Subsequently, the scorematrix is synthesized by the distance matrices ofdifferent modal data. Finally, the place recognitionresults are selected from the candidates with highmatching scores. The contributions of this paper aresummarized as follows: • To cope with the appearance changes in placerecognition, multimodal data, including theimages of different modalities and GNSS data, areleveraged for place recognition tasks. • In order to exploit the latent “place fingerprint”embedded in those data, training-free multipleimage descriptors are utilized. The weightsof those descriptors are tuned to improve theperformance of place recognition. • Aiming at tackling the localization issues of peoplewith impaired vision, we propose an online placerecognition algorithm OpenMPR that surpassesthe state of the art and a place recognition dataset penMPR
2. State of the Art
Place recognition is a prevalent research topic amongthe communities of computer vision and robotics.According to the types of map abstraction, visuallocalization falls into metric place recognition andtopological place recognition [10]. Metric placerecognition returns localization results with metricinformation. It includes various SLAM (simultaneouslocalization and mapping) systems (e.g. ORB-SLAM2 [11]) and deep pose prediction networks(e.g. PoseNet [12]). Although SLAM systems buildthe three-dimensional metric maps which could bereused to estimate precise camera poses, they arenot suitable for visual localization in changing andlarge-scale outdoor environments. The deep networksthough feature superior robustness against appearancechanges, they need to be trained exclusively foreach region to predict camera poses in that specificregion. For building metric maps, video streams arerequired as input data to ensure enough scene overlapbetween successive frames, which is not necessarilyavailable to the wearable assistive devices with limitedcomputational resources. Therefore, metric placerecognition is not the optimal choice for assistivetechnology. Avoiding to build metric maps, topologicalplace recognition generates localization results withoutmetric information. Topological place recognitionis suitable for assistive navigation, considering itdoes not require high-performance hardware and idealenvironments.The community of autonomous vehicles hasdeveloped numbers of algorithms to pursue betterperformance on topological place recognition. Usingbag-of-word method, OpenFAB-MAP [13] is one ofearliest open-source packages to achieve appearance-based place recognition. Different kinds of datawere leveraged in the existing place recognitionalgorithms. GNSS priors were exploited in thecomputationally expensive matching process based onminimum network flow model [14]. Sequence-basedLDB (local difference binary) features derived fromintensity, gradient and disparity images were utilized todepict images and achieved life-long visual localizationin OpenABLE [15]. However, the multimodal LDB descriptors were simply concatenated into single imagefeature, thus the weights of different modalities inplace recognition were not considered. Multipledescriptors were leveraged to achieve sequence-basedimage matching [16], but only color images were usedas visual knowledge. Taking advantage of sequencesearch and match selection, Open-SeqSLAM2.0 [17]designed configurable parameters to explore theoptimal performance of place recognition underchanging conditions.The appearance variations impede the perfor-mance of visual place recognition, and many re-searchers are dedicated to mitigating the impact of ap-pearance variations towards place recognition by dif-ferent methods [8, 9, 12, 18]. The illumination changeis one of vital appearance variations, and quite a fewplace recognition algorithms [19, 20] addressed the is-sue. Illumination invariant transformation was pro-posed to improve visual localization performance dur-ing daylight hours [19]. Change removal based on un-supervised learning was utilized to achieve robust placerecognition under day-to-night circumstances [20]. De-spite the fact that inspiring progress has been obtainedby those work, there are challenging issues to be ad-dressed on place recognition for assistive navigation,which has not aroused the sufficient attention of theresearch community.To evaluate the performance of place recognition,substantial datasets were proposed by the researchcommunity, and some typical datasets feature differ-ent appearance variations between query and databaseimages. Those datasets involve the cross-season Nord-land dataset [21] as well as Gardens Point Walkingdataset [22] with viewpoint and illuminance variations.Bonn dataset [23] and Freiburg dataset [14] both fea-ture multiple variations, including season, illuminanceand viewpoint changes. Most of the datasets are de-signed for place recognition on autonomous vehicles,the images captured by car-mounted cameras are dif-ferent with those captured by wearable devices. Be-sides, the ground truths of those datasets are labeledwith GNSS data, hence the localization resolution isnot sufficient for assistive technology. To the best ofour knowledge, the dataset with multimodal images forassistive technology has not been released.
3. OpenMPR
Different from the existing place recognition ap-proaches, OpenMPR leverages multimodal data to ad-dress the issues of place recognition. Apart from vanillacolor images, other visual modalities (i.e. depth im-ages and infrared images), as well as GNSS data, arealso considered in the system. Multiple descriptors areutilized to exploit the latent information embedded in penMPR
The multimodal images involved in OpenMPR arecolor images, depth images and near-infrared images.The vanilla color image is an indispensable modalityin place recognition task, in that it conveys the bothholistic scenes and local textures with chromatic visualcues. Compared with color images, infrared imagesoccupy a longer-wavelength band in spectrum, thusnaturally carry different scene information. Depthimages contain the three-dimensional shapes, whichreduce the odds of mismatching between query anddatabase images. In order to describe the scenescomprehensively, not only are the multimodal imagescaptured to enrich the input information but alsoboth holistic and local image descriptors are utilizedto extract the key visual cues embedded in images.As shown in Figure 2, four training-free and pre-trained descriptors are chosen to depict scenes, whichavoids the training procedures toward the regions tobe deployed so as to be applied to assistive navigation.The descriptor vector extracted by descriptor f from modality m is defined as d f,m in this paper. Thedescriptor f could be one of descriptors in the set F = { f | GIST, LDB, BoW, CN N } , (1)and the modality m could be one of modalities in theset M = { m | color, depth, inf rared } . (2)The concrete extraction configurations of thosedescriptors have been illustrated in our previouswork [9, 8]. Herein, we summarize descriptorextraction as follows. Based on the local feature ORB(oriented FAST and rBRIEF) [24], BoW (bag of words)characterizes the image details by the occurrence ofeach visual word clustered by local features. BoWis widely applied to object and scene categorization,due to its simplicity, computational efficiency andinvariance to affine transformation [25]. In this paper,the key points [see Figure 2 (c1)] are detected byoriented FAST (features from accelerated segmenttest) and are described by rBRIEF (rotated binaryrobust independent elementary features). The ORBdescriptors of all key points are merged together andcompose the concatenated descriptors [see Figure 2 (c2)]. Subsequently, the BoW descriptor [see Figure 2(c3)] is generated using the extracted ORB descriptorsand the pre-trained vocabularies [26]. In view that theoff-the-shelf vocabularies were trained on photometricimages, BoW desceiptors are extracted from color andinfrared modalities.
The holistic imagedescriptors, i.e. GIST [27, 28], LDB [29] and CNNdescriptors, emphasize whole visual features ratherthan local details, hence are used to alleviate theimpact of appearance changes for image matching.As shown in Figure 2 (a), LDB descriptor isextracted as a global descriptor after the preprocessingof illumination invariance transformation. It isworthwhile to note that bit selection [29] is notexecuted in this paper, in that the compression ofthe global descriptor hinders the performance of imagedescription. LDB descriptors are extracted from all ofthe modalities separately.
Also as a holistic image descriptor,GIST represents the scene in a very low dimensions.The global descriptor GIST is extracted from the pre-processed image, which involves image normalization[see Figure 2 (b1)], Gabor filtering [see Figure 2 (b2)]and response averaging [see Figure 2 (b3)]. GIST de-scriptors are extracted from all of the modalities sepa-rately.
Different from the hand-crafted descrip-tors above, the descriptors selected from CNN are alsoused to enhance the description ability of the sys-tem. As presented in Figure 2 (d), the CNN descrip-tor is generated from the intermediate layers of thepre-trained GoogLeNet fed by the preprocessed colorimage. The compressed concatenation of two layers inception a/ × inception a/ × reduce inGoogLeNet pre-trained on Places365 dataset [30] isused for image descriptor. Constrained by the struc-ture of GoogLeNet, the CNN descrtiptor is only ex-tracted from color images. The extracted multiple descriptors { d f,m | f ∈ F, m ∈ M } are leveraged to measure the similarity betweenimages, thus to characterize the correspondence ofquery images and database images. In this paper,sequential images, rather than single images, areutilized during image matching. Assuming that thequery sequence has the size of n and the databasesequence has the size of l , then the distance matrixfeatures the size of n × l . Herein, we define D f,m as the distance matrix of descriptor f extracted from penMPR Color ImageDepth Image Infrared Image
ID Value …… (1) V o ca bu l a r y ( C od e B ook ) (3) (2) (1)(2) (3) (b) GIST (a) LDB (c) Bag of Words G oog L e N e t (1) (2) (3) (d) CNN Figure 2.
The multiple descriptors extracted from the multimodal images. modality m . The element D f,mi,j of the matrix D f,m isattained by measuring the descriptor distance betweenthe i -th query image and the j -th database image. Thedistance measurement varies from different descriptors.For binary descriptors (LDB), Hamming distance ismeasured as the distance of images, while the distancesof GIST and CNN descriptors are measured withEuclidean distance.Despite with the insufficient positioning accuracy,GNSS data consisting the coordinates of longitude andlatitude provide with a priori knowledge for visualplace recognition. With the GNSS priors, those query-database pairs that leave a large spatial distancebetween each other need not be matched, so as toimprove the computational efficiency and to reduce thepossibility of image mismatching. The metric distancebetween the i -th query image and the j -th databaseimage is specified as G i,j , hence the final distancematrix containing GNSS data E f,m is obtained by E f,mi,j = (cid:26) D f,mi,j G i,j ≤ g,Inf. G i,j > g. , (3)where g is the threshold of possible matching pairs.The smaller the threshold g , the smaller the searching range of image matching. Considering the observationerror of the GNSS module used in this paper, thethreshold g should not be too small, the correctmatching results would be ruled out otherwise. In thispaper, g is set to 15 meters. Having obtained a distance matrix E f,m , we execute anonline cone-based searching upon every query-databasepair, which achieves sequential image matching andgets a matching score for each pair.As shown in Figure 3, the horizontal axis denotesthe database sequence, and the vertical axis denotesthe query sequence. Within the distance matrix, eachquery-database pair ( i, j ) is associated with only onecone region which is limited by sequential length n q ,maximal velocity v max and minimal velocity v min . Theonline cone-based searching algorithm proposed in thispaper is different from the offline one in [31]. Theoffline searching algorithm makes use of the “future”query images, thus place recognition cannot run in realtime. penMPR Database Sequence Q u e r y S e qu e n ce n q (i, j) v min v max Database Sequence Q u e r y S e qu e n ce n q (i, j) v min v max (a) (b) Figure 3.
The online cone-based searching schematics.
Within the region, the number of best-matchingpairs (represented by blue squares in Figure 3) iscounted firstly. The best-matching pair is defined asthe minimum value of a certain row in the distancematrix. In other words, a query descriptor and thedatabase descriptor featuring minimum distance withthat query descriptor compose a best-matching pair.Herein, the number of best-matching pairs in a coneregion is defined as n match , and the score s i,j of thequery-database pair ( i, j ) is defined as s i,j = n match n q . (4)Naturally, all of the matching scores s i,j forminto a score matrix S f,m . Multiple descriptorsextracted from different modalities carry diverse visualinformation, so assigning the same weight to differentdescriptors during image matching does not necessarilypromote the matching robustness. Therefore, thecoefficients of score matrix synthesis { λ f,m } need to beadjusted for the better accuracy of place recognition.The score matrices derived from different descriptorsof different modalities are synthesized to a single scorematrix S , which is presented as S i,j = (cid:80) f ∈ F,m ∈ M λ f,m × S f,mi,j (cid:80) f ∈ F,m ∈ M λ f,m . (5)The genetic algorithm [32] is used to determinethe values of { λ f,m } , which is described later inSection 4. With matching score matrix, each queryimage corresponds to the best database image with thehighest score. In order to get the final place recognitionresults, the matching score of the best query-databasepair is evaluated to rule out the mismatching pairs.In this paper, we use score thresholding [17] to removelow-confidence matching results whose score lower thanthreshold t . OpenMPR
Images with GNSS Data
Multimodal
Camera Portable
Computer
Figure 4.
The assistive device Intoer is used to capturemultimodal data.
The proposed OpenMPR algorithm is implementedin C++, considering the portability and effectiveness.The open-source code of OpenMPR is availableonline [33]. The dependencies include
OpenCV4.0 [34], as well as
DBoW3 [26] for BoW extraction,
LibGIST [35] for GIST extraction, libLDB [29] forLDB extraction, and
OpenGA [32] for parametertuning.The settings of OpenMPR could be easily switchedby configuration file
Config.yaml . There are two modesin OpenMPR, which is testing mode and tuning mode.In testing mode, the place recognition is executed byusing the default or customized parameters. In tuningmode, the optimal parameters are searched to achievethe best performance. The other configurable partsof OpenMPR involve the resolution of input images,whether to use GNSS data, and whether to use certainimage modality or image descriptor.
4. Experiments
In this section, the real-world place recognition datasetcollected by the assistive device is illustrated firstly.In order to achieve the optimal performance of placerecognition, the experiments on parameter tuningwere carried out and the tuning results are analyzedthoroughly. Finally, the state-of-the-art performanceof OpenMPR is validated through comparative study.In view that OpenMPR is prone to be implantedinto assistive devices, the experiments were carried outon the assistive device Intoer [36], which is shown inFigure 4. The assistive device Intoer is utilized not onlyto capture multimodal images and GNSS data but alsoto run the OpenMPR algorithm. penMPR Table 1.
The specifications of multimodal images captured byZR300.
Color Depth InfraredResolution 320 ×
240 320 ×
240 320 × In view that the place recognition dataset withmultimodal data has not been released, we collected areal-world dataset, available at [33], within the campusof Yuquan, Zhejiang University.One frame of data consists of a color image,a depth image, an infrared image, and a GNSScoordinate. The multimodal images were collectedusing Intel RealSense ZR300 Camera [37] embeddedin Intoer, which is an infrared assisted stereo visioncamera. In terms of the effective range and densityof depth images, Realsense ZR300 represents themoderate level among commercial RGB-D cameras.Thereby, the dataset proposed in this paper is adequateto evaluate the performance of OpenMPR. Otherimaging specifications are illustrated in Table 1. TheGNSS data were collected with the customized GNSSreceiver embedded in Intorer.Up to 1,671 frames of data are involved in thedataset, where the four subsets were collected on threeroutes as shown in Table 2 and Figure 5. AlthoughTrain-1 and Train-2 cover the same route, they werecollected in the opposite traversing direction. It isworthwhile to note that no route overlap exists amongthe training subsets and the testing subsets. Moreover,images of the four subsets are not selected artificially.In the experiments, Train-1 and Train-2 are utilizedto tune the parameters, and Test-3 and Test-4 areused to validate the performance of multimodal placerecognition.Each subset is composed of one query sequenceand one database sequence. The collected multimodalimages feature apparent viewpoint changes betweenquery and database sequence, since the camera isembedded in the wearable device and the query anddatabase were not captured on the completely identicalroute. Apart from that, all of the images alsopresent dynamic object changes between query anddatabase. For example, the person passing by in frontof the camera appears in the query sequence, butdoes not appear in the database sequence. Moreover,the illumination changes exist in Train-2 and Test-4. In those subsets, database and query images werecaptured in the afternoon and at dusk separately.All of those real-world changes form into substantialchallenges for place recognition. In brief, the datasetwas collected in real-world scenarios, and is suitable for
Figure 5.
The three traversed paths in multimodal datasets ofplace recognition. Route A: from the teaching building to thegate (orange). Route B: from the teaching building to the library(green). Route C: from the library to the teaching building(blue).
Table 2.
The characteristics of multimodal place recognitiondataset. v = viewpoint changes, o = dynamic objects, and i =illumination changes.
Subset Route
In order to optimize the performance of placerecognition, a series of parameters are tuned on thetraining datasets that are separated from the testingdatasets. The parameters include the length ( n q ) andvelocity limits ( v max and v min ) of cone searching, thecoefficients of score matrix synthesis ( λ f,m ), and thethreshold t of score thresholding.If a query image matches with a database image,the result is defined as a positive result. If no databaseimage is matched, the result is defined as a negativeresult. Considering the query and database imagesare sequential in the dataset, the place recognitionresult of a query image could be represented as thesequential index of the best-matching database image.If the index difference between the place recognitionresult and the ground truth is less than or equal tothe tolerance (set to 5 in this paper), the result isdefined as a TP (true positive) result. Otherwise,the positive result is defined as a FP (false positive)result. Moreover, if the result should match witha database image but it does not match with anydatabase image, the result is defined as a FN (false penMPR P recision = T PT P + F P (6)
Recall = T PT P + F N (7)In this section, the objective of parameter tuning isto choose the parameters that achieve the greatest F score. F = 2 × P recision × RecallP recision + Recall . (8)The following section outlines the procedures andresults of configurable parameter tuning.
Thecoefficient λ f,m denotes the importance of specificdescriptor d f,m during place recognition. In order totune the coefficients efficiently, we leverage the geneticalgorithm implemented by [32] to seek the optimalcombination of coefficients. The length of the coneregion ( n q ) is set to 1, in order that the sequentialsearching does not affect coefficient optimization.The genetic algorithm is an analogue of naturalselection, which optimizes the parameters (genes) bythe bio-inspired operators such as mutation, crossoverand selection. The coefficient array with the size of 9(see Table 3) is defined as the genes, and the fitness ofgenes is evaluated with the F score at that coefficientcombination. The principles and implementationdetails of the genetic algorithm could be found in [32].Empirically, the maximum number of generations (80in this paper) is set as the stopping criterion, whichis sufficient for the genetic algorithm to generate astable iterative result. The genetic algorithm runsfor multiple times (15 in this paper) with randomlyinitialized genes to avoid the local optima. The meancoefficients of the multiple results obtained by thegenetic algorithm are set as final parameter searchingresults. The mean coefficients of the two datasetsare chosen as the optimal parameters, as presented inTable 3.As demonstrated in Table 2, Train-1 suffers fromviewpoint changes and dynamic objects, meanwhileTrain-2 suffers from more illumination changes thanTrain-1. On Train-2, the CNN descriptor presentsthe highest weights compared with other descriptors,which illustrates that the descriptors derived from theGoogLeNet pre-trained on Places365 yield superior de-scription performance even under the severe changes.On dataset 1, the GIST descriptors show better de-scription performance compared with other descrip-tors, which indicates that GIST descriptor is suitable for depicting the images without large illuminationchanges. Besides, LDB presents the suboptimal per-formance of place recognition in the complicated envi-ronments. The dataset used in this paper features se-vere viewpoint changes and dynamic objects, hence theholistic descriptors are important to grasp the globalinformation.Compared with the holistic descriptors, theperformance of BoW descriptors on the two datasetsreveals that BoW is advantageous and stable for placerecognition in various environments. More conclusionscan be drawn when inspecting the results on color andinfrared modality carefully. The BoW descriptor onthe infrared modality features a higher weight thanthat on color modality. The reasonable explanationis that the local ORB features are susceptible to imagedetails with motion blur, which is prone to occur oncolor images captured with the rolling shutter. Onthe contrary, the infrared modality features betterperformance on imaging stability thanks to the globalshutter. Havingchosen the optimal coefficients (shown in Table 3), thetuning procedures of other parameters are executed.The parameters of cone-based searching algorithminclude the length of sequence ( n q ) and velocity limits( v max and v min ). They represent the quantity ofinformation used in cone searching. In parametersweeping, the maximal velocity ( v max ) is set as thereciprocal of the minimal velocity ( v min ), so there areonly two parameters to be tuned. The minimal velocity( v min ) is varied from 0.1 to 0.75, and the length ( n q )is varied from 3 to the 79.Different velocity limits of cone-based searchingare utilized to test the performance of place recog-nition. The Figure 6 demonstrates that v min ≥ . v max ≤ .
5) features good performance, and thatthe larger velocity range results in suboptimal per-formance. The large searching range introduces morebest-matching pairs, meanwhile introduces more po-tential inaccurate results. The velocity limits shouldbe moderate to tolerant the real-world scenarios, suchas the inconsistency of the carrier’s walking speed whenrecording query and database sequences. Thereby, weset v min = 0 .
4, and v max = 2 . n q ). Whether n q is too large or too small, theperformance is limited. For the sake of computationalefficiency, we set the optimal parameter n q as 10. The thresholdscore thresholding t affects the precision and recallof place recognition. The score threshold t is used penMPR Table 3.
The searching results of coefficients λ f,m using the genetic algorithm. (c=color, d=depth, i=infrared.) Dataset f = BoW f = GIST f = LDB λ
CNN,c F λ f,c λ f,i λ f,c λ f,d λ f,i λ f,c λ f,d λ f,i Train-1 1.359 1.666 2.269 1.102 0.986 0.617 0.469 1.042 0.491 0.73Train-2 1.132 1.705 0.889 1.081 0.989 0.436 0.778 0.638 2.353 0.63Optimal 1.245 1.685 1.579 1.091 0.987 0.526 0.623 0.840 1.422 - F s c o r e v min Train-1Train-2 F s c o r e n q Train-1Train-2
Figure 6.
The parameter sweeping results of v min . F s c o r e v min Train-1Train-2 F s c o r e n q Train-1Train-2
Figure 7.
The parameter sweeping results of n q . to eliminate bad matching results and improve theperformance of place recognition. As shown inFigure 8, the precision-recall curve under differentthresholds t is plotted. As the threshold is low,the matching results with low confidence influencethe precision of place recognition. On the contrary,the high threshold results in low recall rate. Theoptimal value of threshold t is set to 0.16, where therecall has not descended substantially and the precisionmaintains at a high level. In order to validate the parameter tuning results andthe systematic performance of OpenMPR, the testingsets whose routes are different from those of trainingsets are utilized to evaluate the proposed algorithm.As demonstrated in Table 2, viewpoint changes anddynamic objects exist in both subsets, meanwhileillumination changes are introduced in Test-4. P r ec i s i on Recall Train-1Train-2
Figure 8.
The precision-recall curve as sweeping parameter t . To validatethe effectiveness of optimized coefficients { λ f,m } , theplace recognition performance on different coefficientconfigurations is compared. In addition to optimizedcoefficients, the other configuration involves(1) λ = 1: let all of the coefficients be 1,which means that all descriptors feature the sameimportance.(2) w/o certain descriptor : let the coefficientsof the corresponding descriptor be 0, which meansabandoning that descriptor in the system.The mean localization error of different coefficientconfigurations is shown in Figure 9. Herein, thelocalization error refers to the index difference betweenthe OpenMPR result and the ground truth. Themean localization error is obtained by averaging thelocalization error of all query images in the testingset. It is concluded that the configuration of optimizedcoefficients shows the balanced performance on bothtesting sets, though it is not the best configuration onthe single dataset.On both testing sets, the configuration w/o BoW features the worst performance, which illustrates thatBoW descriptor is essential for place recognition. OnTest-3, the configuration w/o CNN yields the optimalperformance, and the configuration w/o GIST showsthe suboptimal performance. Those phenomena areconsistent with the analysis in Section 4.2.1 thatthe GIST descriptor, instead of the CNN descriptor,plays the vital role in place recognition if there is penMPR Optima λ=1 w/o CNN w/o GIST w/o LDB w/o BoWTest-3
Test-4
Test-3 Test-4
Optima λ=1 w/o CNNw/o GIST w/o LDB w/o BoW
Figure 9.
The mean localization error under differentconfigurations of parameters.
Table 4.
The place recognition results on the testing datasets.
Algorithm Subset Precision Recall ErrorOpenMPR Test-3 88.7% 100.0%
Test-4 57.8% 99.3%
OpenSeq- Test-3 26.6% 34.0% 7.80SLAM2.0 Test-4 25.7% 82.0% 47.91Visual Test-3 48.5% 100% 19.14Localizer Test-4 58.4% 100% 9.99no illumination changes. In contrast, on the testingset with illumination changes, the GIST descriptoris no longer eligible for good performance, the CNNdescriptor and the other descriptors are indispensable.
With theoptimal parameters determined in the preceding sec-tions, the place recognition results of OpenMPR on thetwo testing sets are compared with the state-of-the-artplace recognition algorithms. OpenSeqSLAM2.0 [17]and Visual Localizer [8] are chosen as the baselines ofOpenMPR. Though OpenSeqSLAM2.0 was designedfor visual place recognition on autonomous vehicles, itprovides with important inspirations in terms of se-quence searching and matching selection techniques.In the experiments, the OpenSeqSLAM2.0 parametersrelated to sequence searching and matching selectionwere set to those optimal values presented above. As apreliminary work, Visual Localizer proposed a placerecognition solution for the mobility of visually im-paired people using pretrained CNN descriptors andglobal optimization.As shown in Table 4, the three performanceindicators (precision, recall and mean localizationerror) are leveraged to evaluate the results ofOpenMPR on the two testing sets. In terms of meanlocalization error, the proposed OpenMPR is superiorto the two state-of-the-art algorithms. Accordingto the statistics of OpenMPR, the place recognitionon Test-3 is more precise than that on Test-4, inthat fewer appearance changes are involved in Test- 3. Fortunately, with the help of the multipledescriptors extracted from multimodal images, themean localization error of OpenMPR on Test-4 isacceptable, which slightly exceeds the tolerance of 5.GNSS priors play an important role in ruling out thedefinite negatives during image matching.Compared with OpenMPR, OpenSeqSLAM2.0yields inferior localization performance both on Test-3 and Test-4, in view of the low recall and largelocalization error. Apparently, OpenSeqSLAM2.0 thatmeasures the similarity of images via sum of absolutedifferences of normalized images does not makeplace recognition robust against various appearancechanges. The comparison between OpenMPR andOpenSeqSLAM justifies that the proposed imagedescriptors robustify place recognition under practicalconditions. For Visual Localizer, the performanceof Test-4 with more appearance changes surpassesthat of Test-3, which resembles the phenomenon thatCNN descriptor features a higher weight on Train-2 than on Train-1. It is evident that the proposedCNN descriptor (the compressed concatenation of inception a/ × inception a/ × reduce )is capable of extracting effective semantic “placefingerprint” between images with large appearancechanges. However, without the aid of other descriptorsand multimodal images, the performance of the CNNdescriptor on Test-3 is limited, which further confirmsthat the necessity of multiple descriptors proposed inthis paper.As shown in Figure 10, the place recognitionresult of OpenMPR is visualized as a visualizationmatrix with the size of n × l , where n is the numberof query images and l is the number of databaseimages. The element of the matrix denotes thequery-database pair. In the matrix, green and redpoints denote ground truths (with the tolerance of5) and localization results respectively. From thediagrams, it is concluded that the place recognitionresults basically conform to the corresponding groundtruths, despite the serious viewpoint variation, motionblur and dynamic objects (e.g. pedestrians). Evenon Test-4 with obvious illumination changes, most ofthe mismatching images are not far from the toleranceof place recognition. In Figure 10, some successfulmatching results are presented, which indicates thatOpenMPR still recognizes places under the conditionsof various appearance changes.Real-time performance is crucial for assistive navi-gation. OpenSeqSLAM2.0 uses both the “past” imagesand the “future” images during cone-based searching,hence it cannot be used in real time. Unfortunately,network flow-based global optimization scheme em-bedded in Visual Localizer features inferior computa-tional efficiency. The single-frame computation speed penMPR (a) (b) Database Sequence Q u e r y S e qu e n ce M a t c h e d Q u e r y M a t c h e d Q u e r y M a t c h e d Q u e r y Database Sequence Q u e r y S e qu e n ce Figure 10.
The place recognition results and some localization instances on (a) Test-3 and (b) Test-4 dataset. In the left diagram,the horizontal axis denotes the database sequence, and the vertical axis denotes the query sequence. is analyzed on the Inoter with Intel Atom x5-Z8500and a desktop with Inter Core i5-6500 to evaluate thereal-time performance of OpenMPR, as shown in Ta-ble 5. The real-time requirement is basically satisfiedby OpenMPR according to the results on the Intoer.With the update of Intoer hardware, the real-time per-formance would be further improved in view of thespeed test on the desktop. After inspecting the runningtime of descriptor extraction, it is found that time con-sumed during extracting GIST descriptors from multi-modal images accounts for the major proportion (morethan 80% of descriptor extraction). In the future, ap-plying GIST library with superior computational effi-ciency to OpenMPR leads to better real-time perfor-mance of the system.
5. Conclusion
Different with the majority of place recognition work,this paper focus on the traveling demands of visuallyimpaired people, and propose an open-source softwareOpenMPR, which leverages multi-modal data foronline place recognition task.In the area of assistive technology, the wearablecamera tends to capture images with motion blur and
Table 5.
The real-time performance of OpenMPR on differentplatforms.
Platform Descriptor Matching OverallExtractionIntel Atom 2,056 ms 98 ms 2,154 msx5-Z8500Intel Core 362 ms 25 ms 387 msi5-6500low resolution. Due to the limited computationalresource, discrete images (one image per second in thispaper), instead of video streams, are captured andprocessed on the portable devices. Apart from that,the query and database sequences features variousappearance changes, including viewpoint changes,illumination changes and dynamic objects. Inthose real-world scenarios, the proposed OpenMPRutilizes configured multiple descriptors extracted frommultimodal data and online sequence-based searchingto obtain good place recognition performance. Itachieves 88.7% precision at 100% recall withoutillumination changes, and achieves 57.8% precision at99.3% recall with illumination changes. penMPR
6. Acknowledgments
This work was supported by the State Key Laboratoryof Modern Optical Instrumentation.
7. References [1] R. R. A. Bourne, S. R. Flaxman, T. Braithwaite, M. V.Cicinelli, A. Das, J. B. Jonas, J. Keeffe, J. H. Kempen,J. Leasher, H. Limburg, K. Naidoo, K. Pesudovs,S. Resnikoff, A. Silvester, G. A. Stevens, N. Tahhan,T. Y. Wong, H. R. Taylor, R. Bourne, P. Ackland,A. Arditi, Y. Barkana, B. Bozkurt, T. BRAITHWAITE,A. Bron, D. Budenz, F. Cai, R. Casson, U. Chakravarthy,J. Choi, M. V. Cicinelli, N. Congdon, R. Dana,R. Dandona, L. Dandona, A. Das, I. Dekaris, M. D.Monte, J. Deva, L. Dreer, L. Ellwein, M. Frazier,K. Frick, D. Friedman, J. Furtado, H. Gao, G. Gazzard,R. George, S. Gichuhi, V. Gonzalez, B. Hammond,M. E. Hartnett, M. He, J. Hejtmancik, F. Hirai,J. Huang, A. Ingram, J. Javitt, J. Jonas, C. Joslin,J. Keeffe, J. Kempen, M. Khairallah, R. Khanna, J. Kim,G. Lambrou, V. C. Lansingh, P. Lanzetta, J. Leasher,J. Lim, H. LIMBURG, K. Mansouri, A. Mathew,A. Morse, B. Munoz, D. Musch, K. Naidoo, V. Nangia,M. PALAIOU, M. B. Parodi, F. Y. Pena, K. Pesudovs,T. Peto, H. Quigley, M. Raju, P. Ramulu, S. Resnikoff,A. Robin, L. Rossetti, J. Saaddine, M. SANDAR,J. Serle, T. Shen, R. Shetty, P. Sieving, J. C. Silva,A. Silvester, R. S. Sitorus, D. Stambolian, G. Stevens,H. Taylor, J. Tejedor, J. Tielsch, M. Tsilimbaris, J. vanMeurs, R. Varma, G. Virgili, J. Volmink, Y. X. Wang, N.-L. Wang, S. West, P. Wiedemann, T. Wong, R. Wormald,and Y. Zheng, “Magnitude, temporal trends, andprojections of the global prevalence of blindness anddistance and near vision impairment: a systematicreview and meta-analysis,”
The Lancet Global Health ,vol. 5, no. 9, pp. e888 – e897, 2017.[2] V. R. Schinazi, T. Thrash, and D.-R. Chebat, “Spatialnavigation by congenitally blind individuals,”
WileyInterdisciplinary Reviews: Cognitive Science , vol. 7,no. 1, pp. 37–58, 2016.[3] J. Paziewski, R. Sieradzki, and R. Baryla, “Multi-GNSS high-rate RTK, PPP and novel direct phaseobservation processing method: application to precisedynamic displacement detection,”
Measurement Scienceand Technology , vol. 29, no. 3, p. 035002, feb 2018.[4] R. Odolinski and P. J. G. Teunissen, “Low-cost, 4-system,precise GNSS positioning: a GPS, galileo, BDS andQZSS ionosphere-weighted RTK analysis,”
MeasurementScience and Technology , vol. 28, no. 12, p. 125801, nov2017.[5] R. Odolinski, P. J. G. Teunissen, and D. Odijk, “CombinedGPS + BDS for short to long baseline RTK positioning,”
Measurement Science and Technology , vol. 26, no. 4, p.045801, feb 2015.[6] F. Guo and X. Zhang, “Adaptive robust kalman filteringfor precise point positioning,”
Measurement Science andTechnology , vol. 25, no. 10, p. 105011, sep 2014.[7] E. Realini and M. Reguzzoni, “goGPS: open source softwarefor enhancing the accuracy of low-cost receivers by single-frequency relative kinematic positioning,”
Measurement Science and Technology , vol. 24, no. 11, p. 115010, oct2013.[8] S. Lin, R. Cheng, K. Wang, and K. Yang, “Visual Localizer:Outdoor localization based on ConvNet descriptor andglobal optimization for visually impaired pedestrians,”
Sensors , vol. 18, no. 8, 2018.[9] R. Cheng, K. Wang, L. Lin, and K. Yang, “Visuallocalization of key positions for visually impairedpeople,” in , Aug 2018, pp. 2893–2898.[10] S. Lowry, N. Snderhauf, P. Newman, J. J. Leonard, D. Cox,P. Corke, and M. J. Milford, “Visual place recognition: Asurvey,”
IEEE Transactions on Robotics , vol. 32, no. 1,pp. 1–19, Feb 2016.[11] R. Mur-Artal and J. D. Tardos, “ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras,”
IEEE Transactions on Robotics , vol. 33,no. 5, pp. 1255–1262, oct 2017.[12] A. Kendall, M. Grimes, and R. Cipolla, “PoseNet:A convolutional network for real-time 6-dof camerarelocalization,” in , Dec 2015, pp. 2938–2946.[13] A. Glover, W. Maddern, M. Warren, S. Reid, M. Milford,and G. Wyeth, “OpenFABMAP: An open sourcetoolbox for appearance-based loop closure detection,” in , May 2012, pp. 4730–4735.[14] O. Vysotska, T. Naseer, L. Spinello, W. Burgard, andC. Stachniss, “Efficient and effective matching ofimage sequences under substantial appearance changesexploiting GPS priors,” in , May2015, pp. 2774–2779.[15] R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, andE. Romera, “OpenABLE: An open-source toolbox forapplication in life-long visual localization of autonomousvehicles,” in , Nov2016, pp. 965–970.[16] F. Han, H. Wang, G. Huang, and H. Zhang, “Sequence-based sparse optimization methods for long-term loopclosure detection in visual slam,”
Autonomous Robots ,vol. 42, no. 7, pp. 1323–1335, Oct 2018.[17] B. Talbot, S. Garg, and M. Milford, “OpenSeqSLAM2.0:An open source toolbox for visual place recognition underchanging conditions,” in ,Oct 2018, pp. 7758–7765.[18] R. Arandjelovi, P. Gronat, A. Torii, T. Pajdla, andJ. Sivic, “NetVLAD: CNN architecture for weaklysupervised place recognition,”
IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 40,no. 6, pp. 1437–1451, June 2018.[19] W. Maddern, A. Stewart, C. McManus, B. Upcroft,W. Churchill, and P. Newman, “Illumination invariantimaging: Applications in robust vision-based localisa-tion, mapping and classification for autonomous vehi-cles,” in
Proceedings of the Visual Place Recognition inChanging Environments Workshop, IEEE InternationalConference on Robotics and Automation (ICRA), HongKong, China , vol. 2, 2014, p. 3.[20] S. Lowry and M. J. Milford, “Supervised and unsupervisedlinear learning techniques for visual place recognitionin changing environments,”
IEEE Transactions onRobotics , vol. 32, no. 3, pp. 600–613, June 2016.[21] N. S¨underhauf, P. Neubert, and P. Protzel, “Are we thereyet? challenging seqslam on a 3000 km journey acrossall four seasons,” in
IEEE International Conference onRobotics and Automation (ICRA) 2013 , ser. Proc. ofWorkshop on Long-Term Autonomy, IEEE International penMPR Conference on Robotics and Automation (ICRA),Karlsruhe, Germany, 2013.[22] N. Snderhauf, S. Shirazi, F. Dayoub, B. Upcroft, andM. Milford, “On the performance of ConvNet featuresfor place recognition,” in ,Sept 2015, pp. 4297–4304.[23] O. Vysotska and C. Stachniss, “Relocalization undersubstantial appearance changes using hashing,” in
Proc.Int. Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst.Workshop Planning, Perception Navig. Intell. Veh. ,2017.[24] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski,“ORB: An efficient alternative to sift or surf,” in , Nov2011, pp. 2564–2571.[25] D. Galvez-Lpez and J. D. Tardos, “Bags of binary wordsfor fast place recognition in image sequences,”
IEEETransactions on Robotics , vol. 28, no. 5, pp. 1188–1197,Oct 2012.[26] R. Muoz-Salinas, “DBoW3,” 2017. [Online]. Available:https://github.com/rmsalinas/DBow3[27] A. Oliva and A. Torralba, “Modeling the shape of thescene: A holistic representation of the spatial envelope,”
International Journal of Computer Vision , vol. 42, no. 3,pp. 145–175, May 2001.[28] Torralba, Murphy, Freeman, and Rubin, “Context-basedvision system for place and object recognition,” in
Proceedings Ninth IEEE International Conference onComputer Vision , Oct 2003, pp. 273–280 vol.1.[29] X. Yang and K. T. Cheng, “Local difference binary for ul-trafast and distinctive feature description,”
IEEE Trans-actions on Pattern Analysis and Machine Intelligence ,vol. 36, no. 1, pp. 188–194, Jan 2014.[30] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, andA. Torralba, “Places: A 10 million image databasefor scene recognition,”
IEEE Transactions on PatternAnalysis and Machine Intelligence , 2017.[31] M. Milford, J. Firn, J. Beattie, A. Jacobson, E. Pepperell,E. Mason, M. Kimlin, and M. Dunbabin, “Automatedsensory data alignment for environmental and epidermalchange monitoring,” in
Australasian Conference onRobotics and Automation 2014 . The University ofMelbourne, Victoria, Australia: Australian Robotic andAutomation Association, December 2014, pp. 1–10.[32] A. Mohammadi, H. Asadi, S. Mohamed, K. Nelson, andS. Nahavandi, “OpenGA, a C++ genetic algorithmlibrary,” in2017 IEEE Conference on ComputerVision and Pattern Recognition Workshops (CVPRW)