[PDF] InLoc: Indoor Visual Localization with Dense Matching and View Synthesis

Abstract

We seek to predict the 6 degree-of-freedom (6DoF) pose of a query photograph with respect to a large indoor 3D map. The contributions of this work are three-fold. First, we develop a new large-scale visual localization method targeted for indoor environments. The method proceeds along three steps: (i) efficient retrieval of candidate poses that ensures scalability to large-scale environments, (ii) pose estimation using dense matching rather than local features to deal with textureless indoor scenes, and (iii) pose verification by virtual view synthesis to cope with significant changes in viewpoint, scene layout, and occluders. Second, we collect a new dataset with reference 6DoF poses for large-scale indoor localization. Query photographs are captured by mobile phones at a different time than the reference 3D map, thus presenting a realistic indoor localization scenario. Third, we demonstrate that our method significantly outperforms current state-of-the-art indoor localization approaches on this new challenging data.

Full PDF

IInLoc: Indoor Visual Localization with Dense Matching and View Synthesis

Hajime Taira Masatoshi Okutomi Torsten Sattler Mircea Cimpoi Marc Pollefeys , Josef Sivic , Tomas Pajdla Akihiko Torii Tokyo Institute of Technology Department of Computer Science, ETH Z¨urich CIIRC, CTU in Prague ∗ Microsoft, Redmond Inria † Abstract

We seek to predict the 6 degree-of-freedom (6DoF) pose ofa query photograph with respect to a large indoor 3D map.The contributions of this work are three-fold. First, wedevelop a new large-scale visual localization method tar-geted for indoor environments. The method proceeds alongthree steps: (i) efﬁcient retrieval of candidate poses that en-sures scalability to large-scale environments, (ii) pose es-timation using dense matching rather than local featuresto deal with textureless indoor scenes, and (iii) pose ver-iﬁcation by virtual view synthesis to cope with signiﬁcantchanges in viewpoint, scene layout, and occluders. Sec-ond, we collect a new dataset with reference 6DoF posesfor large-scale indoor localization. Query photographs arecaptured by mobile phones at a different time than the refer-ence 3D map, thus presenting a realistic indoor localizationscenario. Third, we demonstrate that our method signiﬁ-cantly outperforms current state-of-the-art indoor localiza-tion approaches on this new challenging data.

1. Introduction

Autonomous navigation inside buildings is a key ability ofrobotic intelligent systems [24, 39]. Successful navigationrequires both to localize a robot and to determine a path toits goal. One approach to solving the localization problemis to build a 3D map of the building and then use a camera to estimate the current position and orientation of the robot(Figure 1). Imagine also the beneﬁt of an intelligent indoornavigation system that helps you ﬁnd your way, for exam- ∗ CIIRC - Czech Institute of Informatics, Robotics, and Cybernetics,Czech Technical University in Prague. † WILLOW project, Departement d’Informatique de l’ ´Ecole NormaleSup´erieure, ENS/INRIA/CNRS UMR 8548, PSL Research University. While RGBD sensors could also be used indoors, they are often tooenergy-consuming for mobile scenarios or have only a short-range to scanclose-by objects (faces). Thus, purely RGB-based localization approachesare also relevant in indoor scenes. Obviously, indoor scenes are GPS-denied environments.

Database of RGBD imagesQuery imageLarge-scale indoor 3D maps ? Figure 1.

Large-scale indoor visual localization.

Given adatabase of geometrically-registered RGBD images, we predictthe 6DoF camera pose of a query RGB image by retrieving can-didate images, estimating candicate camera poses, and selectingthe best matching camera pose. To address inherent difﬁculties inindoor visual localization, we introduce the “InLoc” approach thatperforms a sequence of progressively stricter veriﬁcation steps. ple, at Chicago airport, Tokyo Metropolitan station or theCVPR conference center. Besides intelligent systems, the visual localization problem is also highly relevant for anytype of Mixed Reality application, including AugmentedReality [16, 44, 72].Due to the availability of datasets, e.g ., obtained fromFlickr [38] or captured from autonomous vehicles [19, 43],large-scale localization in urban environments has been anactive ﬁeld of research [6, 9, 14, 15, 19, 20, 27, 29, 34, 38,44, 53–57, 65–67, 75, 79, 80]. In contrast, indoor localiza-tion [11, 12, 39, 58, 59, 64, 69, 74] has received less atten- a r X i v : . [ c s . C V ] A p r ion in the last years. At the same time, indoor localiza-tion is, in many ways, a harder problem than urban local-ization: 1) Due to the short distance to the scene geometry,even small changes in viewpoint lead to large changes inimage appearance. For the same reason, ocluders such ashumans or chairs often have a stronger impact compared tourban scenes. Thus, indoor localization approaches have tohandle signiﬁcantly larger changes in appearance betweena query and reference images. 2) Large parts of indoorscenes are textureless and textured areas are typically rathersmall. As a result, feature matches are often clustered insmall regions of the images, resulting in unstable pose es-timates [29]. 3) To make matters worse, buildings are of-ten highly symmetric with many repetitive elements, bothon large (similar corridors, rooms, etc .) and small (similarchairs, tables, doors etc .) scale. While structural ambigui-ties also cause problems in urban environments, they oftenonly occur in larger scenes [9, 54, 67]. 4) The appearanceof indoor scenes changes considerably over the course ofa day due to the complex illumination conditions (indirectlight through windows and active illumination from lamps).5) Indoor scenes are often highly dynamic over time as fur-niture and personal effects are moved through the environ-ment. In contrast, the overall appearance of building fa-cades does not change too much over time.This paper addresses these difﬁculties inherent to indoorvisual localization by proposing a new localization method.Our approach starts with an image retrieval step, using acompact image representation [6] that scales to large scenes.Given a shortlist of potentially relevant database images, weapply two progressively more discriminative geometric ver-iﬁcation steps: (i) We use dense matching of CNN descrip-tors that capture spatial conﬁgurations of higher-level struc-tures (rather than individual local features) to obtain the cor-respondences required for camera pose estimation. (ii) Wethen apply a novel pose veriﬁcation step based on virtualview synthesis that can accurately verify whether the queryimage depicts the same place by dense pixel-level matching,again not relying on sparse local features.Historically, the datasets used to evaluate indoor vi-sual localization were restricted to small, often room-scale,scenes. Driven by the interest in semantic scene under-standing [10, 23, 78] and enabled by scalable reconstructiontechniques [28, 47, 48], large-scale indoor datasets cover-ing multiple rooms or even whole buildings are becomingavailable [10, 17, 23, 64, 74, 76–78]. However, most of thesedatasets focus on reconstruction [76,77] and semantic sceneunderstanding [10, 17, 23, 78] and are not suitable for local-ization. To address this issue, we create a new dataset forindoor localization that, in contrast to other existing indoorlocalization datasets [10, 26, 64], has two important proper-ties. First, the dataset is large-scale, capturing two univer-sity buildings. Second, the query images are acquired using a smartphone at a time months apart from the date of captureof the reference 3D model. As a result, the query imagesand the reference 3D model often contain large changes inscene appearance due to the different layout of furniture,occluders (people), and illumination, representing a realis-tic and challenging indoor localization scenario. Contributions.

Our contributions are three-fold. First, wedevelop a novel visual localization approach suitable forlarge-scale indoor environments. The key novelty of ourapproach lies in carefully introducing dense feature extrac-tion and matching in a sequence of progressively stricterveriﬁcation steps. To the best of our knowledge, the presentwork is the ﬁrst to clearly demonstrate the beneﬁt of densedata association for indoor localization. Second, we createa new dataset suitably designed for large-scale indoor local-ization that contains large variation in appearance betweenqueries and the 3D database due to large viewpoint changes,moving furniture, occluders or changing illumination. Thequery images are taken at a different time from the refer-ence database, using a handheld device, and at different mo-ments of the day, to capture enough variability, bridging thegap to realistic usage scenarios. The code and data are pub-licly available on the project page [1]. Third, the proposedmethod shows a solid improvement over existing state-of-the-art results, showing an absolute improvement of 17–20% in the percent of correctly localized queries within a0.25 – 0.5 m error, which is of high importance for indoorlocalization.

2. Related work

We next review previous work on visual localization.

Image retrieval based localization.

Visual localization inlarge-scale urban environments is often approached as animage retrieval problem. The location of a given queryimage is predicted by transferring the geotag of the mostsimilar image retrieved from a geotagged database [6, 9,18, 35, 54, 66, 67]. This approach scales to entire citiesthanks to compact image descriptors and efﬁcient index-ing techniques [7, 8, 22, 31, 33, 49, 63, 70] and can be fur-ther improved by spatial re-ranking [51], informative fea-ture selection [21, 22] or feature weighting [27, 32, 54, 67].Most of the above methods are based on image representa-tions using sparsely sampled local invariant features. Whilethese representations have been very successful, outdoorimage-based localization has recently also been approachedusing densely sampled local descriptors [66] or (denselyextracted) descriptors based on convolutional neural net-works [6, 35, 40, 75]. However, the main shortcoming of allthe above methods is that they output only an approximatelocation of the query, not an exact 6DoF pose.

Visual localization using 3D maps.

Another approach isto directly obtain 6DoF camera pose with respect to a pre-uilt 3D map. The map is usually composed of a 3D pointcloud constructed via Structure-from-Motion (SfM) [2]where each 3D point is associated with one or more localfeature descriptors. The query pose is then obtained by fea-ture matching and solving a Perspective-n-Point problem(PnP) [14, 15, 20, 29, 34, 38, 53, 55]. Alternatively, pose esti-mation can be formulated as a learning problem, where thegoal is to train a regressor from the input RGB(D) space tocamera pose parameters [11, 34, 59, 73]. While promising,scaling these methods to large-scale datasets is still an openchallenge.

Indoor 3D maps.

Indoor scene datasets [50, 52, 62, 68]have been introduced for tasks such scene recognition, clas-siﬁcation, and object retrieval. With the increased avail-ability of laser range scanners and time-of-ﬂight (ToF) sen-sors, several datasets include depth data besides RGB im-ages [5, 10, 23, 26, 36, 60, 78] and some of these datasetsalso provide reference camera poses registered into the 3Dpoint cloud [10, 26, 78], though their focus is not on local-ization. Datasets focused speciﬁcally on indoor localiza-tion [59, 64, 69] have so far captured fairly small spacessuch as a single room (or a single ﬂoor at largest) andhave been constructed from densely-captured sequences ofRGBD images. More recent datasets [17, 76] provide largerscale (multi-ﬂoor) indoor 3D maps containing RGBD im-ages registered to a global ﬂoor map. However, they aredesigned for object retrieval, 3D reconstruction, or train-ing deep-learning architectures. Most importantly, they donot contain query images taken from viewpoints far fromdatabase images, which are necessary for evaluating visuallocalization.To address the shortcomings of the above datasets forlarge-scale indoor visual localization, we introduce a newdataset that includes query images captured at a differenttime from the database, taken from a wide range of view-points, with a considerably larger 3D database distributedacross multiple ﬂoors of multiple buildings. Furthermore,our dataset contains various difﬁcult situations for visuallocalization, e.g ., textureless and highly symmetric ofﬁcescenes, repetitive tiles, and repetitive objects that confusethe existing visual localization methods designed for out-door scenes. The newly collected dataset is described next.

3. The InLoc dataset for visual localization

Our dataset is composed of a database of RGBD images ge-ometrically registered to the ﬂoor maps augmented with aseparate set of RGB query images taken by hand-held de-vices to make it suitable for the task of indoor localization(Figure 2). The provided query images are annotated withmanually veriﬁed ground-truth 6DoF camera poses (refer-ence poses) in the global coordinate system of the 3D map.

Database.

The base indoor RGBD dataset [76] consists of

Number Image size [pixel] FoV [degree]Query 356 4,032 × × Table 1. Statistics of the

InLoc dataset .Figure 2.

Example images from InLoc dataset . (Top) Databaseimages. (Bottom) Query images. The selected images showthe challenges encountered in indoor environments: even smallchanges in viewpoint lead to large differences in appearance; largetextureless surfaces ( e.g . walls); self-repetitive structures ( e.g . cor-ridors); signiﬁcant variation throughout the day due to differentillumination sources ( e.g ., active vs. indirect illumination).

277 RGBD panoramic images obtained from scanning twobuildings at the Washington University in St. Louis with aFaro 3D scanner. Each RGBD panorama has about 40M3D points in color. The base images are divided into ﬁvescenes: DUC1, DUC2, CSE3, CSE4, and CSE5, represent-ing ﬁve ﬂoors of the mentioned buildings, and are geomet-rically registered to a known ﬂoor plan [76]. The scenes arescanned sparsely on purpose, to cover a larger area with asmall number of scans to reduce the required manual work,as well as due to the long operating times of the high-endscanner used. The area per scan varies between 23.5 and185.8 m . This inherently leads to critical view changesbetween query and database images when compared withother existing datasets [64, 69, 74] .For creating an image database suitable for indoor vi-sual localization evaluation, a set of perspective images isgenerated by following the best practices from outdoor vi-sual localization [19, 66, 79]. We obtain 36 perspectiveRGBD images from each panorama by extracting stan-dard perspective views ( ◦ FoV) with a sampling strideof ◦ in yaw and ± ◦ in pitch directions, resulting in10K perspective images in total (Table 1). Our databasecontains signiﬁcant challenges, such as repetitive patterns(stairs, pillars), frequently appearing building structures(doors, windows), furniture changing position, people mov-ing across the scene, and textureless and highly symmetricareas (walls, ﬂoors, corridors, classrooms, open spaces). Query images.

We captured 356 photos using a smart-phone camera (iPhone 7), distributed only across two ﬂoors,DUC1 and DUC2. The other three ﬂoors in the databaseare not represented in the query images, and play the role For example, in the database of [64], the scans are distributed on onesingle ﬂoor, and the area per each database image is less than 45 m . igure 3. Examples of veriﬁed query poses . We evaluated thequality of the reference camera poses both visually and quantita-tively, as described in section 3. Red dots are the database 3Dpoints projected onto a query image using its estimated pose. of confusers at search time, contributing to the building-scale localization scenario. Note that these query photosare taken at different times of the day, to capture the varietyof occluders and layouts ( e.g ., people, furniture) as well asillumination changes.

Reference pose generation.

For all query photos, we esti-mate 6DoF reference camera poses w.r.t. the 3D map. Eachquery camera reference pose is computed as follows: (i) Selection of the visually most similar database images .For each query, we manually select one panorama locationwhich is visually most similar to the query image using theperspective images generated from the panorama. (ii) Automatic matching of query images to selecteddatabase images . We match the query and perspective im-ages by using afﬁne covariant features [45] and nearest-neighbor search followed by Lowe’s ratio test [42]. (iii) Computing the query camera pose and visually veri-fying the reprojection . All the panoramas (and perspec-tive images) are already registered to the ﬂoor plan andhave pixel-wise depth information. Therefore, we computequery pose via P3P-RANSAC [25], followed by bundle ad-justment [3], using correspondences between query imagepoints and scene 3D points obtained by feature matching.We evaluate the obtained poses visually by inspecting thereprojection of edges detected in the corresponding RGBpanorama into the query image (see examples in ﬁgure 3). (iv) Manual matching of difﬁcult queries to selecteddatabase images . Pose estimation from automatic matchesoften gives inaccurate poses for difﬁcult queries which are, e.g ., far from any database image. Hence, for queries withsigniﬁcant misalignment in reprojected edges, we manuallyannotate 5 to 20 correspondences between image pixels and3D points and apply step (iii) on the manual matches. (v) Quantitative and visual inspection . For all estimatedposes, we measure the median reprojection error, computedas the distance of the reprojected 3D database point to thenearest edge pixel detected in the query image, after remov-ing correspondences with gross errors (with distance over20 pixels) due to, e.g ., occlusions. For query images thathave under 5 pixels median reprojection error, we manually inspect the reprojected edges in the query image and ﬁnallyaccept

329 reference poses out of the 356 query images.

4. Indoor visual localization with dense match-ing and view synthesis

We propose a new method for large-scale indoor visual lo-calization. We address the three main challenges of indoorenvironments: (1) Lack of sparse local features . Indoor environmentsare full of large textureless areas, e.g ., walls, ceilings, ﬂoorsand windows, where sparse feature extraction methods de-tect very few features. To overcome this problem, we use multi-scale dense CNN features for both image descriptionand feature matching. Our features are generic enough to bepre-trained beforehand on (outdoor) scenes, avoiding costlyre-training, e.g ., as in [11, 34, 73], of the localization ma-chine for each particular environment. (2) Large image changes . Indoor environments are clut-tered with movable objects, e.g ., furniture and people, and3D structures, e.g ., pillars add concave bays, causing se-vere occlusions when viewed from a close distance. Themost similar images obtained by retrieval may therefore bevisually very different from a query image. To overcomethis problem, we rely on dense feature matches to collect asmuch positive evidence as possible . We employ image de-scriptors extracted from a convolutional neural network thatcan match higher-level structures of the scene rather thanrelying on matching individual local features. In detail, ourpose estimation step performs coarse-to-ﬁne dense featurematching, followed by geometric veriﬁcation and estima-tion of the camera pose using P3P-RANSAC. (3) Self-similarity . Indoor environments are often veryself-similar, e.g ., due to many symmetric and repetitive el-ements on a large and small scale (corridors, rooms, tiles,windows, chairs, doors, etc .). Existing matching strate-gies count the positive evidence, i.e ., how much of the im-age (or how many inliers) have been matched, to decidewhether two images match. This is, however, problematicas large textureless areas can be matched well, hence pro-viding strong (incorrect) positive evidence. To overcomethis problem, we propose to count also the negative evi-dence , i.e ., what portion of the image does not match, todecide whether two views are taken from the same location.To achieve this, we perform explicit pose estimate veriﬁ-cation based on view synthesis . In detail, we compare thequery image with a virtual view of the 3D model renderedfrom the estimated camera pose of the query. This novelapproach takes advantage of the high quality of the RGBDimage database and incorporates both the positive and nega-tive evidence by counting matching and non-matching pix-els across the entire query image. As shown by our exper-iments, this approach is orthogonal to the choice of localescriptors. The proposed veriﬁcation by view synthesis isconsistently showing a signiﬁcant improvement regardlessof the choice of features used for estimating the pose.The pipeline of InLoc has the following three steps.Given a query image, (1) we obtain a set of candidate im-ages by ﬁnding the N best matching images from the ref-erence image database registered to the map. (2) For theseN retrieved candidate images, we compute the query posesusing the associated 3D information that is stored togetherwith the database images. (3) Finally, we re-rank the com-puted camera poses based on veriﬁcation by view synthesis.The three steps are detailed next. As demonstrated by existing work [6, 35, 66], aggregatingfeature descriptors computed densely on a regular grid mit-igates issues such as a lack of repeatability of local featuresdetected on textureless scenes, large-illumination changes,and a lack of discriminability of image description, domi-nated by features from repetitive structures (burstiness). Asalready mentioned in section 1, these problems are also oc-curring in large-scale indoor localization, which motivatesour choice of using an image descriptor based on dense fea-ture aggregation. Both query and database images are de-scribed by NetVLAD [6] (but other variants could also beused), normalized L2 distances of the descriptors are com-puted, and the poses of the N best matching images fromthe database are chosen as candidate poses. In section 5, wecompare our approach with the state-of-the-art image de-scriptors based on local feature detection and show beneﬁtsof our approach for indoor localization.

A severe problem in indoor localization is that standard ge-ometric veriﬁcation based on local feature detection [51,54]does not work on textureless or self-repetitive scenes, suchas corridors, where robots (and also humans) often getlost. Motivated by the improvements in candidate poseretrieval with dense feature aggregation (Section 4.1), weuse features densely extracted on a regular grid for verify-ing and re-ranking the candidate images by feature match-ing and pose estimation. A possible approach would be tomatch DenseSIFT [41] followed by RANSAC-based ver-iﬁcation. Instead of tailoring DenseSIFT description pa-rameters (patch sizes, strides, scales) to match across im-ages with signiﬁcant viewpoint changes, we use an imagerepresentation extracted by a convolutional neural network(VGG-16 [61]) as a set of multi-scale features extracted ona regular grid that describes more higher-level informationwith a larger receptive ﬁeld (patch size).We ﬁrst ﬁnd geometrically consistent sets of correspon-dences using the coarser conv5 layer containing high-levelinformation. Then we reﬁne the correspondence by search- ing for additional matches on the conv3 layer. Examples inﬁgure 4 demonstrate that our dense CNN matching (4th col-umn) obtains better matches in indoor environments whencompared to matching standard local features (3rd column),even for less-textured areas. Notice that dense-feature ex-traction and description requires no additional computationat query time as the intermediate convolutional layers are al-ready computed when extracting the NetVLAD descriptorsas described in section 4.1. As will also be demonstrated insection 5, memory requirements and computational speedof feature matching can be addressed by binarizing the con-volutional features without loss in matching performance.As perspective images in our database have depth values,and hence associated 3D points, the query camera pose canbe estimated by ﬁnding pixel-to-pixel correspondences be-tween the query and the matching database image followedby P3P-RANSAC [25].

We propose here to collect both positive and negative ev-idence to determine what is and is not matched . Thisis achieved by harnessing the power of the high-qualityRGBD image database that provides a dense and accurate3D structure of the indoor environment. This structure isused to render a virtual view that shows how the scenewould look like from the estimated query pose. The ren-dered image enables us to count, in a pixel-wise manner,both positive and negative evidence by counting which re-gions are and are not consistent between the query imageand the underlying 3D structure. To gain invariance toillumination changes and small misalignments, we evalu-ate image similarity by comparing local patch descriptors(DenseRootSIFT [7, 41]) at corresponding pixel locations.The ﬁnal similarity is computed as the median of descriptordistances across the entire image while ignoring areas withmissing 3D structure.

5. Experiments

We ﬁrst describe the experimental setup for evaluatingvisual localization performance using our dataset (Sec-tion 5.1). The proposed method, termed “InLoc”, is com-pared with state-of-the-art methods (Section 5.2) and weshow the beneﬁts of each component in detail (Section 5.3).

In the candidate pose retrieval step, we retrieve 100 can-didate database images using NetVLAD. We use the im-plementation provided by the authors and the pre-trainedPitts30K [6] VGG-16 [61] model to generate , -dimensional NetVLAD descriptor vectors. The impact of negative evidence in feature aggregation is demon-strated in [30]..39, 152.74 ◦ ◦ ◦ ◦ ◦ ◦ Disloc [9] NetVLAD [6] NetVLAD [6]+ NetVLAD [6]+ NetVLAD [6]+

InLoc:

NetVLAD [6]+SparsePE

DensePE DensePE DensePE + DensePV

Figure 4.

Qualitative comparison of different localization methods (columns).

From top to bottom: query image, the best matchingdatabase image, synthesized view at the estimated pose (without inter/extra-polation), error map between the query image and the syn-thesized view, localization error (meters, degrees). Green dots are the inlier matches obtained by P3P-LO-RANSAC. Methods using theproposed dense pose estimation (DensePE) and dense pose veriﬁcation (DensePV) are shown in bold. The query images in the 2nd, 4th and6th column are well localized within 1.0 meters and 5.0 degrees whereas localization results in the 1st, 3rd and 5th column are incorrect.

In the second pose estimation step, we obtain tentativecorrespondences by matching densely extracted convolu-tional features in a coarse-to-ﬁne manner: we ﬁrst ﬁnd mu-tually nearest matches among the conv5 features and thenﬁnd matches in the ﬁner conv3 features restricted by thecoarse conv5 correspondences. The tentative matches aregeometrically veriﬁed by estimating up to two homogra-phies using RANSAC [25]. We re-rank the 100 candidatesusing the number of RANSAC inliers and keep the top-10database images. For each of the 10 images, the 6DoF querypose is computed by P3P-LO-RANSAC [37] (referred to as

DensePE ), assuming a known focal length, e.g ., from EXIFdata, using the inlier matches and depth ( i.e . the 3D struc-ture) associated to each database image.In the ﬁnal pose veriﬁcation step, we generate synthe-sized views by rendering colored 3D points while takingcare of self-occlusions. For computing the scores thatmeasure the similarities of the query image and the im-age rendered from the estimated pose, we use the Dens-eSIFT extractor and its RootSIFT descriptor [7, 41] fromVLFeat [71] . Finally, we localize the query image by the When computing the descriptors, the blank pixels induced by missing3D points are ﬁlled by linear inter(/extra)-polation using the values of non-blank pixels on the boundary. best pose among its top-10 candidates.

Evaluation metrics.

We evaluate the localization accuracyas the consistency of the estimated poses with our refer-ence poses. We measure positional and angular differencesin meters and degrees between the estimated poses and themanually veriﬁed reference poses.

Direct 2D-3D matching [53, 55].

We ﬁrst compare witha variation of a state-of-the-art 3D structure-based imagelocalization approach [53]. We compute afﬁne covariantRootSIFT features for all the database images and associatethem with 3D coordinates via the known scene geometry.Features extracted from a query image are then matched tothe database 3D descriptors [46]. We select at most ﬁvedatabase images receiving the largest numbers of matchesand use all these matches together for pose estimation. Sim-ilar to [53], we did not apply Lowe’s ratio test [42] as it low-ered the performance. The 6DoF query pose is ﬁnally com-puted by P3P-LO-RANSAC [37]. As shown in table 2, In- Due to the sparse sampling of viewpoints in our indoor dataset, wecannot establish feature tracks between database images. This prevents usfrom applying algorithms relying on co-visibility [20, 38, 53, 55, 80].irect2D-3D Disloc [9] NetVLAD InLoc[53] +SparsePE +SparsePE (Ours)0.25m 11.9 20.1 21.3

Table 2.

Comparison with the state-of-the-art localizationmethods on the InLoc dataset.

We show the rate (%) of cor-rectly localized queries within a given distance (m) threshold andwithin a ◦ angular error threshold. Loc outperforms direct 2D-3D matching by a large margin( . at the localization accuracy of 0.5m). We believethat this is because our large-scale indoor dataset involvesmany distractors and large viewpoint changes that present amajor challenge for 3D structure-based methods. Disloc [9] + sparse pose estimation (SparsePE) [51].

Wenext compare with the state-of-the-art image retrieval-basedlocalization method. Disloc represents images using bag-of-visual-words with Hamming-Embedding [31] while alsotaking local descriptor space density into account. We usea publicly available implementation [54] of Disloc with a200K vocabulary trained on afﬁne covariant features [45],described by RootSIFT [7], extracted from the databaseimages of our indoor dataset. The top-100 candidate im-ages shortlisted by Disloc are re-ranked by spatial veri-ﬁcation [51] using (sparse) afﬁne covariant features [45].The ratio test [42] was not applied here as it was remov-ing too many features that need to be retained in the in-door scenario. Using the inliers, the 6DoF query pose iscomputed with P3P-LO-RANSAC [37]. To make a faircomparison, we use exactly the same features and P3P-LO-RANSAC for pose estimation as the direct 2D-3D match-ing method described above. As shown in table 2, Dis-loc [9]+SparsePE [51] results in a . performance gaincompared to Direct 2D-3D matching [55]. This can be at-tributed to the image retrieval step that discounts burst ofrepetitive features. However, the results are still signiﬁ-cantly worse compared to our InLoc approach. NetVLAD [6] + sparse pose estimation (SparsePE) [51].

We also evaluate a variation of the above image retrieval-based localization method. Here the candidate shortlist isobtained by NetVLAD [6], which is then re-ranked usingSparsePE [51], followed by pose estimation using P3P-LO-RANSAC [37]. This is a strong baseline buildingon the state-of-the-art place recognition results obtainedby [6]. Interestingly, as shown in table 2, there is no sig-niﬁcant difference between NetVLAD+SparsePE and Dis-Loc+SparsePE, which is in line with results reported inoutdoor settings [57]. Yet, NetVLAD outperforms Dis-Loc ( . at the localization accuracy of 0.5m) before re-ranking via SparsePE ( c.f . ﬁgure 5) in this indoor setting(see also ﬁgure 4). Overall, both methods, even though theyrepresent the state-of-the-art in outdoor localization, stillperform signiﬁcantly worse than our proposed approach based on dense feature matching and view synthesis. Next, we demonstrate the beneﬁts of the individual compo-nents of our approach.

Beneﬁts of pose estimation using dense matching.

Us-ing the NetVLAD retrieval as the base retrieval method(Figure 5 (a)), our pose estimation with dense match-ing (NetVLAD [6]+

DensePE (blue line)) constantly im-proves the localization rate by about when com-pared to the state-of-the-art sparse local feature matching(NetVLAD [6]+SparsePE (green line)). This result sup-ports our conclusion that dense feature matching and ver-iﬁcation is superior to sparse feature matching for oftenweakly textured indoor scenes. This effect is also clearlydemonstrated in qualitative results in ﬁgure 4 (cf. columns3 and 4).

Beneﬁts of pose veriﬁcation with view synthesis.

We ap-ply our pose veriﬁcation step (

DensePV ) to the top–10 poseestimates obtained by different spatial re-reranking meth-ods. Results are shown in ﬁgure 5 and demonstrate sig-niﬁcant and consistent improvements obtained by our poseveriﬁcation approach (compare “- • -” to “—” in ﬁgure 5).Improvements are most pronounced for the position accu-racy within 1.5 meters (13% or more). Binarized representation.

A binary representation (in-stead of ﬂoats) of features in the intermediate CNN layerssigniﬁcantly reduces memory requirements. We use fea-ture binarization that follows the standard Hamming em-bedding approach [31] but without dimensionality reduc-tion. Matching is then performed by computing Hammingdistances. This simple binarization scheme results in a neg-ligible performance loss (less than 1% at 0.5 meters) com-pared to the original descriptors, which is in line with resultsreported for object recognition [4]. At the same time, bina-rization reduces the memory requirements by a factor of 32,compressing 428GB of original descriptors to just 13.4GB.

Comparison with learning based localization methods.

We have attempted a comparison with DSAC [11], which isa state-of-the-art pose estimator for indoor scenes. Despiteour best efforts, training DSAC on our indoor dataset failedto converge. We believe this is because the RGBD scansin our database are sparsely distributed [76] and each scanhas only a small overlap with neighboring scans. Trainingon such a dataset is challenging for methods designed fordensely captured RGBD sequences [26]. We believe thiswould also be the case for PoseNet [34], another method forCNN-based pose regression. We do provide the comparisonwith DSAC and PoseNet on much smaller datasets next. irect2D-3DDisLoc [9]DisLoc [9] + DensePV

DisLoc [9]DisLoc [9] + DensePV + SparsePE+ SparsePENetVLAD [6]NetVLAD [6] + SparsePE+ SparsePENetVLAD [6]NetVLAD [6]NetVLAD [6] + DensePE + DensePV*InLoc* (Ours) + DensePV

Distance threshold [meters] C o rr ec tl y l o ca li ze d qu e r i e s [ % ] Distance threshold [meters] C o rr ec tl y l o ca li ze d qu e r i e s [ % ] (a) NetVLAD baselines (b) Other baselines Figure 5.

Impact of different components.

The graphs show impact of dense matching (

DensePE ) and dense pose veriﬁcation (

DensePV )on pose estimation quality for (a) the pose candidates retrieved by NetVLAD and (b) state-of-the-art baselines. Plots show the fraction ofcorrectly localized queries (y-axis) within a certain distance (x-axis) whose rotation error is at most ◦ . Disloc [9] NetVLAD [6] NetVLAD [6] InLoc+SparsePE +SparsePE +DensePE (Ours)90 bldgs. 0.42, 4.58 ◦ ◦ ◦ ◦ Table 3.

Comparison on Matterport3D [17].

Numbers show themedian positional (m) and angular (degrees) errors.

PoseNet ActiveSearch DSAC NetVLAD [6] NetVLAD [6]Scene [34] [55] [11,13] +SparsePE [51] +

DensePE

Chess 13, 4.48 ◦

4, 1.96 ◦ , 1.2 ◦

4, 1.83 3, ◦ Fire 27, 11.3 ◦ , 1.53 ◦

4, 1.5 ◦

4, 1.55 , ◦ Heads 17, 13.0 ◦ , 1.45 ◦

3, 2.7 ◦ , 1.65 , ◦ Ofﬁce 19, 5.55 ◦

9, 3.61 ◦

4, 1.6 ◦

5, 1.49 , ◦ Pumpkin 26, 4.75 ◦

8, 3.10 ◦ , 2.0 ◦

7, 1.87 , ◦ Red kit. 23, 5.35 ◦

7, 3.37 ◦

5, 2.0 ◦

5, 1.61 , ◦ Stairs 35, 12.4 ◦ , ◦ ◦

12, 3.41 9, 2.47 ◦ Table 4.

Evaluation on the 7 Scenes dataset [26, 59] . Numbersshow the median positional (cm) and angular errors (degrees).

We also evaluate InLoc on two existing indoor datasets [17,59] to conﬁrm the relevance of our results. The Matter-port3D [17] dataset consists of RGBD scans of 90 build-ings. Each RGBD scan contains 18 images that capture thescene around the scan position with known camera poses.We created a test set by randomly choosing 10% of the scanpositions and selected their horizontal views. This resultedin 58,074 database images and a query set of 6,726 images.Results are shown in table 3. Our approach (InLoc) out-performs the baselines, which is in line with results on theInLoc dataset. We also tested PoseNet [34] and DSAC [11]on a single (the largest) building. The test set is created inthe same manner as above and contains 1,884 database im-ages and 210 query images. Even in this much easier case,DSAC fails to converge. PoseNet produces large localiza-tion errors (24.8 meters and 80.0 degrees) in comparisonwith InLoc (0.26 meters and 2.78 degrees).We also report results on the 7 Scenes dataset [26, 59]which is, while relatively small, a standard benchmark for indoor localization. The 7 Scenes dataset [59] consists ofgeometrically-registered video frames representing sevenscenes, together with associated depth images and cam-era poses. Table 4 shows localization results for our ap-proach (NetVLAD+

DensePE ) compared with state-of-the-art methods [11, 34, 55]. Note that our approach performscomparably to these methods on this relatively small anddensely captured data, while it does not need any scene spe-ciﬁc training (which is needed by [11, 34]).

6. Conclusion

We have presented InLoc – a new approach for large-scaleindoor visual localization that estimates the 6DoF camerapose of a query image with respect to a large indoor 3Dmap. To overcome the difﬁculties of indoor camera poseestimation, we have developed new pose estimation andveriﬁcation methods that use dense feature extraction andmatching in a sequence of progressively stricter veriﬁcationsteps. The localization performance is evaluated on a newlarge indoor dataset with realistic and challenging queryimages captured by mobile phones. Our results demon-strate signiﬁcant improvements compared to state-of-the-art localization methods. To encourage further progress onhigh-accuracy large-scale indoor localization, we make ourdataset publicly available [1].

Acknowledgements.

This work was partially supported byJSPS KAKENHI Grant Numbers 15H05313, 17H00744,17J05908, EU-H2020 project LADIO No. 731970, ERCgrant LEAP No. 336845, CIFAR Learning in Ma-chines & Brains program and the European RegionalDevelopment Fund under the project IMPACT (reg.no. CZ . . . / . / . /

15 003 / ). The authorswould like to express the deepest appreciation to YasutakaFurukawa for his arrangement to capture query photographsat Washington University in St. Louis. ppendix This appendix ﬁrst provides additional examples of queryimages and their reference poses in our

InLoc dataset (sec-tion A). We also present additional qualitative results, il-lustrating situations in which the proposed

InLoc methodsucceeds while the investigated baseline methods fail (sec-tion B).

A. Additional examples of query images andreference poses in the InLoc dataset

Figure A shows the 3D maps (grey dots), the 329 refer-ence poses of the query images (blue dots), and the 129database scan positions (red circles) in our InLoc dataset.The query images are distributed across two ﬂoors (DUC1and DUC2) that cover an area of ≈ f t (9,290 m )each [76], and are taken from signiﬁcantly distant positionsfrom database scans.Figure B illustrates the veriﬁcation process for the refer-ence poses. We show example query images on the 1st and3rd row. The edges extracted on the best matching databaseimage were reprojected on the query image (2nd and 4throw) to verify the quality of the reference poses. The manu-ally and visually veriﬁed reference poses, in total 329, haveat most 5 pixels median re-projection error, out of which,101 reference poses have median re-projection error below1 pixel. B. Qualitative results

In what follows, we will consider the query image correctlylocalized, if the error for the estimated pose is within 1 me-ter and 5 ◦ with respect to the reference pose.We ﬁrst consider situations in which InLoc success-fully localizes the query images, while the state-of-the-art NetVLAD+SparsePE fails. Figure C shows qualitativeexamples of the results obtained by NetVLAD+SparsePE(a,c,e) versus our InLoc (b,d,f). As shown in (a) and (c),sparse features are often detected on highly repetitive struc-tures e.g ., fonts (text), textured surfaces (fabric pattern onthe sofa). As shown in (a) for the baseline, matching fea-tures found on such objects can result in matches withunrelated parts of the scene, leading to incorrect camerapose estimates. The fact that sparse features are predomi-nantly found in few textured regions leads to problems inthe largely untextured indoor scenes. This is shown in (e),where matches are found only in a small part of the queryimage, which leads to an unstable conﬁguration for camerapose estimation. This, in turn, leads to more stable pose es-timates in (b), (d), and (f). Our pose veriﬁcation, DensePV (section 4.3), allows us to identify incorrect poses, result-ing from features found on repetitive structures, since mostparts of the image rendered from a false pose are not con-sistent with the query image. Thus, InLoc is better suited to handle highly repetitive indoor scenes with rich featurecorrespondences.The next set of qualitative results demonstrates the ben-eﬁts of dense pose veriﬁcation. For this, ﬁgure D com-pares results obtained by InLoc (b,d,f) with results obtainedby baseline NetVLAD+

DensePE (a,c,e). In this case,the baseline NetVLAD+

DensePE uses our dense matching(

DensePE ) but selects the best pose based only on the num-ber of inlier matches and not using our pose veriﬁcation byvirtual view synthesis (

DensePV ). For scenes dominated bysymmetries and repetitive structures (a,c) or largely texture-less regions (a, e), there can be a large amount of geomet-rically consistent matches even for unrelated database im-ages. This still holds true even if matches are obtainedby dense features and geometrically veriﬁed. Our densepose veriﬁcation strategy using synthesized images (b,d,f)effectively provides “negative” evidence in such situations.The error maps (bottom row) clearly show that it detects(in)consistent areas between the query and its synthesizedimage.

Limitations.

Our pose veriﬁcation (section 4.3) evaluatesthe estimated camera pose by dense pixel-level matchingbetween the query image and the synthesized view. Thisveriﬁcation is robust up to a certain level of scene changes, e.g ., illumination changes and some amount of misalign-ment, but cannot deal with extreme changes in the scenesuch as very large occlusions or when the view is dominatedby moving objects.Figure E shows typical failure cases of InLoc, due to ourpose veriﬁcation not being able to identify the correct posein highly dynamic scenes. In both cases, the query imagescapture many moving objects, e.g ., people (a) or chairs (b),and highly dynamic scenes, e.g ., opened/closed shutters (a)or pictures on the wall/removed (b). These moving objectscover a large part of the image.Those are the remaining open-issues that can be po-tentially addressed by adopting further semantic informa-tion [5, 35].

References [1] Project webpage. .[2] S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless,S. M. Seitz, and R. Szeliski. Building rome in a day.

Comm.ACM , 54(10):105–112, 2011.[3] S. Agarwal, K. Mierle, and Others. Ceres solver. http://ceres-solver.org .[4] P. Agrawal, R. B. Girshick, and J. Malik. Analyzing the per-formance of multilayer neural networks for object recogni-tion. In

Proc. ECCV , 2014.[5] A. Anand, H. S. Koppula, T. Joachims, and A. Saxena.Contextually guided semantic labeling and search for three-dimensional point clouds.

Intl. J. of Robotics Research ,32(1):19–34, 2013. (a) DUC1 (ﬁrst ﬂoor) (b) DUC2 (second ﬂoor) Figure A.

Query reference positions in the InLoc dataset.

The 329 reference poses of query images (blue dots) are plotted on the 3Dmaps (grey dots) that are generated by panoramic 3D scans at 277 distinct positions (red circles).[6] R. Arandjelovi´c, P. Gronat, A. Torii, T. Pajdla, and J. Sivic.NetVLAD: CNN architecture for weakly supervised placerecognition. In

Proc. CVPR , 2016.[7] R. Arandjelovi´c and A. Zisserman. Three things everyoneshould know to improve object retrieval. In

Proc. CVPR ,2012.[8] R. Arandjelovic and A. Zisserman. All about vlad. In

Proc.CVPR , 2013.[9] R. Arandjelovi´c and A. Zisserman. Dislocation: Scalabledescriptor distinctiveness for location recognition. In

Proc.ACCV , 2014.[10] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis,M. Fischer, and S. Savarese. 3D Semantic Parsing of Large-Scale Indoor Spaces. In

Proc. CVPR , 2016. [11] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel,S. Gumhold, and C. Rother. DSAC - DifferentiableRANSAC for Camera Localization. In

Proc. CVPR , 2017.[12] E. Brachmann, F. Michel, A. Krull, M. Y. Yang, S. Gumhold,and C. Rother. Uncertainty-Driven 6D Pose Estimation ofObjects and Scenes from a Single RGB Image. In

Proc.CVPR , 2016.[13] E. Brachmann and C. Rother. Learning less is more-6d cam-era localization via 3d surface regression. In

Proc. CVPR ,2018.[14] F. Camposeco, T. Sattler, A. Cohen, A. Geiger, and M. Polle-feys. Toroidal constraints for two-point localization underhigh outlier ratios. In

Proc. CVPR , 2017.[15] S. Cao and N. Snavely. Minimal scene descriptions fromstructure from motion models. In

Proc. CVPR , 2014.igure B.

Examples of query images and veriﬁed reference poses.

Each of the two groups show query images on top, followed by thesame image, with database edges projected onto the queries.[16] R. O. Castle, G. Klein, and D. W. Murray. Video-rate local-ization in multiple maps for wearable augmented reality. In

ISWC , 2008.[17] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner,M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3D:Learning from RGB-D Data in Indoor Environments. In

Proc. 3DV , 2017.[18] D. Chen, S. Tsai, V. Chandrasekhar, G. Takacs, H. Chen,R. Vedantham, R. Grzeszczuk, and B. Girod. Residual en-hanced visual vectors for on-device image matching. In

Proc. ASILOMAR , 2011.[19] D. M. Chen, G. Baatz, K. K¨oser, S. S. Tsai, R. Vedantham,T. Pylv¨an¨ainen, K. Roimela, X. Chen, J. Bach, M. Pollefeys,et al. City-scale landmark identiﬁcation on mobile devices.In

Proc. CVPR , 2011.[20] S. Choudhary and P. Narayanan. Visibility probability struc-ture from sfm datasets and applications. In

Proc. ECCV ,2012.[21] O. Chum, A. Mikulik, M. Perdoch, and J. Matas. Total recallii: Query expansion revisited. In

Proc. CVPR , 2011.[22] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman.Total recall: Automatic query expansion with a generative feature model for object retrieval. In

Proc. ICCV , 2007.[23] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser,and M. Nießner. ScanNet: Richly-annotated 3D Reconstruc-tions of Indoor Scenes. In

Proc. CVPR , 2017.[24] A. Debski, W. Grajewski, W. Zaborowski, and W. Turek.Open-source localization device for indoor mobile robots.

Procedia Computer Science , 76:139–146, 2015.[25] M. A. Fischler and R. C. Bolles. Random sample consensus:a paradigm for model ﬁtting with applications to image anal-ysis and automated cartography.

Comm. ACM , 24(6):381–395, 1981.[26] B. Glocker, S. Izadi, J. Shotton, and A. Criminisi. Real-timeRGB-D camera relocalization. In

Proc. ISMAR , 2013.[27] P. Gronat, G. Obozinski, J. Sivic, and T. Pajdla. Learning andcalibrating per-location classiﬁers for visual place recogni-tion. In

Proc. CVPR , 2013.[28] M. Halber and T. Funkhouser. Fine-To-Coarse Global Reg-istration of RGB-D Scans. In

Proc. CVPR , 2017.[29] A. Irschara, C. Zach, J.-M. Frahm, and H. Bischof. Fromstructure-from-motion point clouds to fast location recogni-tion. In

Proc. CVPR , 2009. ◦ ◦ ◦ ◦ ◦ ◦ NetVLAD [6]+

InLoc:

NetVLAD [6]+ NetVLAD [6]+

InLoc:

NetVLAD [6]+ NetVLAD [6]+

InLoc:

NetVLAD [6]+SparsePE

DensePE + DensePV

SparsePE

DensePE + DensePV

SparsePE

DensePE + DensePV (a) (b) (c) (d) (e) (f)

Figure C.

Qualitative comparison between InLoc and NetVLAD+SparsePE.

In these examples, InLoc successfully localizes a queryimage within 1 meter distance error and 5 ◦ angular error with respect to the reference pose whereas the state-of-the-art NetVLAD+SparsePEfails. From top to bottom: query image, the best matching database image, synthesized view rendered from the estimated pose, error mapbetween the query image and the synthesized view, localization error (meters, degrees). Warm colors correspond to large errors. Greendots are the inlier matches obtained by P3P-LO-RANSAC.[30] H. J´egou and O. Chum. Negative evidences and co-occurrences in image retrieval: the beneﬁt of PCA andwhitening. In Proc. ECCV , 2012.[31] H. Jegou, M. Douze, and C. Schmid. Hamming embeddingand weak geometric consistency for large scale image search.In

Proc. ECCV , 2008.[32] H. J´egou, M. Douze, and C. Schmid. On the burstiness ofvisual elements. In

Proc. CVPR , 2009.[33] H. Jegou, F. Perronnin, M. Douze, J. S´anchez, P. Perez, andC. Schmid. Aggregating local image descriptors into com-pact codes.

IEEE PAMI , 34(9):1704–1716, 2012.[34] A. Kendall and R. Cipolla. Geometric loss functions for cam-era pose regression with deep learning. In

Proc. CVPR , 2017.[35] H. J. Kim, E. Dunn, and J.-M. Frahm. Learned contex-tual feature reweighting for image geo-localization. In

Proc.CVPR , 2017.[36] K. Lai, L. Bo, and D. Fox. Unsupervised feature learningfor 3D scene labeling. In

Proc. Intl. Conf. on Robotics andAutomation , 2014.[37] K. Lebeda, J. Matas, and O. Chum. Fixing the locally opti-mized ransac–full experimental evaluation. In

Proc. BMVC. ,2012.[38] Y. Li, N. Snavely, D. P. Huttenlocher, and P. Fua. Worldwidepose estimation using 3d point clouds. In

Proc. ECCV , 2012. [39] H. Lim, S. N. Sinha, M. F. Cohen, and M. Uyttendaele. Real-time image-based 6-dof localization in large-scale environ-ments. In

Proc. CVPR , 2012.[40] T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear cnn mod-els for ﬁne-grained visual recognition. In

Proc. ICCV , 2015.[41] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman.Sift ﬂow: Dense correspondence across different scenes. In

Proc. ECCV , 2008.[42] D. Lowe. Distinctive image features from scale-invariantkeypoints.

IJCV , 60(2):91–110, 2004.[43] W. Maddern, G. Pascoe, C. Linegar, and P. Newman. 1 Year,1000km: The Oxford RobotCar Dataset.

IJRR , 36(1):3–15,2017.[44] S. Middelberg, T. Sattler, O. Untzelmann, and L. Kobbelt.Scalable 6-dof localization on mobile devices. In

Proc.ECCV , 2014.[45] K. Mikolajczyk and C. Schmid. Scale & afﬁne invariant in-terest point detectors.

IJCV , 60(1):63–86, 2004.[46] M. Muja and D. G. Lowe. Fast approximate nearest neigh-bors with automatic algorithmic conﬁguration. In

Proc. VIS-APP , 2009.[47] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux,D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, andA. Fitzgibbon. Kinectfusion: Real-time dense surface map-ping and tracking. In

Proc. ISMAR , 2011. .74, 14.56 ◦ ◦ ◦ ◦ ◦ ◦ NetVLAD [6]+

InLoc:

NetVLAD [6]+ NetVLAD [6]+

InLoc:

NetVLAD [6]+ NetVLAD [6]+

InLoc:

NetVLAD [6]+

DensePE DensePE + DensePV DensePE DensePE + DensePV DensePE DensePE + DensePV (a) (b) (c) (d) (e) (f)

Figure D.

Qualitative comparison between InLoc and NetVLAD+DensePE.

InLoc successfully localizes a query image within 1 meterdistance error and 5 ◦ angular error with respect to the reference pose whereas NetVLAD+DensePE (no pose veriﬁcation via synthesis),fails. From top to bottom: query image, the best matching database image, synthesized view rendered from the estimated pose, error mapbetween the query image and the synthesized view, localization error (meters, degrees). Warm colors correspond to large errors. Greendots are the inlier matches obtained by P3P-LO-RANSAC.[48] M. Nießner, M. Zollh¨ofer, S. Izadi, and M. Stamminger.Real-time 3D Reconstruction at Scale using Voxel Hashing. ACM TOG , 32(6):169, 2013.[49] D. Nister and H. Stewenius. Scalable recognition with a vo-cabulary tree. In

Proc. CVPR , 2006.[50] M. Pandey and S. Lazebnik. Scene recognition and weaklysupervised object localization with deformable part-basedmodels. In

Proc. ICCV , 2011.[51] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisser-man. Object retrieval with large vocabularies and fast spatialmatching. In

Proc. CVPR , 2007.[52] A. Quattoni and A. Torralba. Recognizing indoor scenes. In

Proc. CVPR , 2009.[53] T. Sattler, M. Havlena, F. Radenovic, K. Schindler, andM. Pollefeys. Hyperpoints and ﬁne vocabularies for large-scale location recognition. In

Proc. ICCV , 2015.[54] T. Sattler, M. Havlena, K. Schindler, and M. Pollefeys.Large-scale location recognition and the geometric bursti-ness problem. In

Proc. CVPR , 2016.[55] T. Sattler, B. Leibe, and L. Kobbelt. Efﬁcient & effective pri-oritized matching for large-scale image-based localization.

IEEE PAMI , 39(9):1744–1756, 2017.[56] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand,E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, F. Kahl, and T. Pajdla. Benchmarking 6DOF outdoor visuallocalization in changing conditions. In

Proc. CVPR , 2018.[57] T. Sattler, A. Torii, J. Sivic, M. Pollefeys, H. Taira, M. Oku-tomi, and T. Pajdla. Are large-scale 3D models really neces-sary for accurate visual localization? In

Proc. CVPR , 2017.[58] T. Schmidt, R. Newcombe, and D. Fox. Self-Supervised Vi-sual Descriptor Learning for Dense Correspondence.

RAL ,2(2):420–427, 2017.[59] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, andA. Fitzgibbon. Scene coordinate regression forests for cam-era relocalization in RGB-D images. In

Proc. CVPR , 2013.[60] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoorsegmentation and support inference from RGBD images. In

Proc. ECCV , 2012.[61] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. In

Proc. ICLR ,2015.[62] S. Singh, A. Gupta, and A. A. Efros. Unsupervised discoveryof mid-level discriminative patches. In

Proc. ECCV , 2012.[63] J. Sivic and A. Zisserman. Video google: A text retrievalapproach to object matching in videos. In

Proc. ICCV , 2003.[64] X. Sun, Y. Xie, P. Luo, and L. Wang. A Dataset for Bench-marking Image-based Localization. In

Proc. CVPR , 2017. a) (b)

Figure E.

Failure cases.

Our InLoc approach fails to localizethese examples due to many moving objects, e.g . people (a) orchairs (b), and highly dynamic scenes, e.g . opened/closed shutters(a) or pictures on the wall/removed (b). From top to bottom: queryimage and the reference database image.[65] L. Sv¨arm, O. Enqvist, F. Kahl, and M. Oskarsson. City-Scale Localization for Cameras with Known Vertical Direc-tion.

IEEE PAMI , 39(7):1455–1461, 2017.[66] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pa-jdla. 24/7 place recognition by view synthesis. In

Proc.CVPR , 2015.[67] A. Torii, J. Sivic, T. Pajdla, and M. Okutomi. Visual placerecognition with repetitive structures. In

Proc. CVPR , 2013.[68] A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin.Context-based vision systems for place and object recogni-tion. In

Proc. ICCV , 2003.[69] J. Valentin, A. Dai, M. Nießner, P. Kohli, P. Torr, S. Izadi,and C. Keskin. Learning to Navigate the Energy Landscape.In

Proc. 3DV , 2016.[70] J. C. Van Gemert, C. J. Veenman, A. W. Smeulders, andJ.-M. Geusebroek. Visual word ambiguity.

IEEE PAMI ,32(7):1271–1283, 2010.[71] A. Vedaldi and B. Fulkerson. VLFeat - an open and portablelibrary of computer vision algorithms. In

Proc. ACMM ,2010.[72] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, andD. Schmalstieg. Real-time detection and tracking for aug-mented reality on mobile phones.

Visualization and Com-puter Graphics , 16(3):355–368, 2010.[73] F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler, S. Hilsen-beck, and D. Cremers. Image-Based Localization UsingLSTMs for Structured Feature Correlation. In

Proc. ICCV ,2017.[74] S. Wang, S. Fidler, and R. Urtasun. Lost shopping! monocu-lar localization in large indoor spaces. In

Proc. ICCV , 2015. [75] T. Weyand, I. Kostrikov, and J. Philbin. Planet-photo geolo-cation with convolutional neural networks. In

Proc. ECCV ,2016.[76] E. Wijmans and Y. Furukawa. Exploiting 2D ﬂoorplan forbuilding-scale panorama RGBD alignment. In

Proc. CVPR ,2017.[77] J. Xiao and Y. Furukawa. Reconstructing the worlds muse-ums.

IJCV , 110(3):243–258, 2014.[78] J. Xiao, A. Owens, and A. Torralba. SUN3D: A databaseof big spaces reconstructed using sfm and object labels. In

Proc. ICCV , 2013.[79] A. R. Zamir and M. Shah. Accurate image localization basedon google maps street view. In

Proc. ECCV , 2010.[80] B. Zeisl, T. Sattler, and M. Pollefeys. Camera Pose Votingfor Large-Scale Image-Based Localization. In