Beyond Cartesian Representations for Local Descriptors
Patrick Ebel, Anastasiia Mishchuk, Kwang Moo Yi, Pascal Fua, Eduard Trulls
BBeyond Cartesian Representations for Local Descriptors
Patrick Ebel , Anastasiia Mishchuk , Kwang Moo Yi , Pascal Fua , Eduard Trulls Computer Vision Lab, ´Ecole Polytechnique F´ed´erale de Lausanne Visual Computing Group, University of Victoria Google Switzerland { firstname.lastname } @epfl.ch , [email protected] , [email protected] Abstract
The dominant approach for learning local patch descrip-tors relies on small image regions whose scale must be prop-erly estimated a priori by a keypoint detector. In otherwords, if two patches are not in correspondence, their de-scriptors will not match. A strategy often used to alleviatethis problem is to “pool” the pixel-wise features over log-polar regions, rather than regularly spaced ones.By contrast, we propose to extract the “support re-gion” directly with a log-polar sampling scheme. We showthat this provides us with a better representation by si-multaneously oversampling the immediate neighbourhoodof the point and undersampling regions far away fromit. We demonstrate that this representation is particularlyamenable to learning descriptors with deep networks. Ourmodels can match descriptors across a much wider rangeof scales than was possible before, and also leverage muchlarger support regions without suffering from occlusions.We report state-of-the-art results on three different datasets.
1. Introduction
Keypoint matching has played a pivotal role in computervision for well over a decade. This is clearly demonstratedby the fact that SIFT [23] remains the most cited paper incomputer vision history. While many areas of computer vi-sion are currently dominated by dense deep networks, thatis, methods that take entire images as input, some problemsremain best approached using sparse features. For exam-ple, despite recent attempts at tackling 6DOF pose estima-tion using dense networks, the top-performing models forwide-baseline stereo and large-scale Structure-from-Motion(SfM) still rely on sparse features [49, 51, 33].As a result, the quest for ever-improving local feature de-scriptors goes on [23, 5, 46, 42, 39, 12, 50, 38, 41, 28, 45,
This research was partially funded by Google’s Visual PositioningSystem, the Swiss National Science Foundation, the Natural Sciences andEngineering Research Council of Canada, and by Compute Canada.
19, 25, 15, 24, 10, 31]. These methods all seek to achieveinvariance to small changes in location, orientation, scale,perspective, and illumination, along with imaging artefactsand partial occlusions. Most descriptors, however, whetherlearned or hand-crafted, operate on SIFT-like keypoints andthus rely on simple heuristics to estimate the scale. If thescales for two keypoints do not correspond, neither will thesupport regions used to extract their descriptors, which iswidely accepted as an unrecoverable situation. This is dam-aging because scale detection is often unreliable.In this paper we demonstrate that this does not need tobe the case. To this end, we go beyond the current paradigmfor local descriptors, which we call the cartesian approach.This paradigm confines local descriptors to small, regularlysampled regions and relies on accurate scale estimates. Bycontrast, we posit that extracting the support region with alog-polar sampling scheme allows us to generate a betterlocal representation by oversampling the immediate neigh-borhood of the point. We show that this approach is con-ducive to learning scale-invariant descriptors with off-the-shelf deep networks, enabling us to match keypoints acrossmismatched scales; see Fig. 4. Furthermore, we demon-strate that this representation is far less sensitive to occlu-sions or background motion than its cartesian counterpart,which allows us to exploit much larger image regions thanwas possible before to further boost performance.Note that while log-polar representations have been usedextensively by local features, this has typically involved log-polar aggregation of local statistics that are still computedon the cartesian image grid. By contrast, we propose to warp the patch using a log-polar sampling scheme and learnan optimal descriptor on this data. Fig. 1 illustrates the dif-ference between these two approaches.In short, we propose a new approach to represent localpatches and show how to leverage it to achieve scale invari-ance. In the remainder of the paper, we first briefly reviewhow scale has been handled in the vast body of literaturepertaining to matching descriptors, whether learned or de-signed. We then describe our method and show that it out-performs the state of the art on several challenging datasets.1 a r X i v : . [ c s . C V ] A ug . Related works In this section we first review techniques representativeof the many that have been proposed to achieve scale invari-ance for local feature matching, with and without explicitscale detection. Next, we discuss approaches to learningmodels for patch descriptors. Finally, we study the use oflog-polar representations in local features. For a thorough,up-to-date survey on local features please refer to [8].
Scale Invariance via Scale Detection.
The vast majorityof work in the literature assumes that scale estimation ishandled by the keypoint detector and that keypoints can beput in correspondence only if their scales match. This in-cludes classical hand-crafted pipelines such as SIFT [23] orSURF [5]. Image measurements are then aggregated over acorrespondingly-sized support region to extract the descrip-tor. As a result, errors in this a priori scale estimation can-not be recovered from, and the affected keypoints are sim-ply written off as potential correspondences.
Two-stage pipelines.
Special strategies can be used forrigid matching under large zoom. Zhou et al . [52] proposea two-stage approach to first coarsely register the imagein scale-space and then narrow down the search scope tomatches of commensurate scale. Shan et al . [36] assumethat dense SfM models are available, along with an approx-imate pose, and synthesize ground views from aerial view-points using the 3D model, for aerial-to-ground matching.Both methods rely on SIFT features and would directly ben-efit from improved, scale-invariant descriptors such as ours.
Scale Invariance without Scale Detection.
A simpleway to achieve scale invariance is to concatenate multi-scaledescriptors and find the best match among them. This wasdone in [47] to improve robustness against scale changeswith ORB features [32]. Scale-Less SIFT (SLS) [14] goesbeyond that and exploits the observation that SIFT de-scriptors do not change drastically over close, contiguousscales, which suggests that they are embedded in a low-dimensional space. This observation can be used to find arepresentation more compact than their concatenation. Theresulting feature vectors are still high-dimensional (8k) butcan be reduced by PCA to a 512-dimensional vector. How-ever, this requires a singular value decomposition for eachkeypoint to find its subspace, which is very costly.The Scale and rotation-Invariant Descriptor (SID) [20]samples axis-aligned derivatives over a log-polar grid, alongwith incremental smoothing over image regions furtheraway from the keypoint. Thus, scale changes and rota-tions result in translations on the measurement matrix. Us-ing the Fourier Transform Modulus of this signal, whichis translation-invariant, makes the descriptors scale- androtation-invariant. However, SID requires fine samplingover large support regions, which fails in real-world scenar- ios with viewpoint changes and occlusions. Seg-SID [43]addresses this shortcoming by leveraging segmentation cuesto suppress image measurements from image regions not as-sociated to the keypoint, but this requires image-level seg-mentation and is failure-prone. SID also suffers from highdimensionality ( ∼ Learned Descriptors.
Early works applied PCA toSIFT [18], learned comparison metrics [40], or learned de-scriptors with convex optimization [39]. Current researchon patch descriptors is dominated by convolutional neu-ral networks. MatchNet [12] and DeepCompare [50] traindescriptor extraction and distance metric networks using aSiamese architecture. DeepDesc [38] uses hard positive andnegative mining to learn discriminative features. A triplet-based loss is introduced in [4]. L2-Net [41] improves theloss function by enforcing similarity in the intermediate fea-ture maps and penalizing highly correlated descriptor bins.HardNet [28] extends the formulation of [38] to mine overall the samples in the batch. In [15], mining heuristics arereplaced by a differentiable approximation of the averageprecision metric that is then used for optimization. Spectralpooling is introduced in [45] to deal with geometric trans-formations. An alternative to siamese- and triplet-based lossfunctions is proposed in [19] to address their shortcomings.GeoDesc [25] uses geometry constraints for optimization.ContextDesc [24] incorporates global context, and geomet-ric context from the keypoint distribution.All of the deep methods, except [25, 24], are trained onthe same dataset [7], which consists of patches pre-extractedon keypoints using Difference of Gaussians (DoG) [23] ormulti-scale Harris corners [13]. Only keypoints that survivea 3D reconstruction by Structure from Motion (SfM) areconsidered, and similarly to the traditional approach, thelearned models are simply expected to fail if the detectorfails first. To the best of our knowledge there is no learning-based method that explicitly addresses scale invariance.Another line of works comprises those that use deeparchitectures to learn keypoints and descriptors jointly.LIFT [48] is trained on patches extracted around SIFT key-points with corresponding scales, similarly to the previousmethods. LF-Net [29] learns to detect the scale with self-supervision, but in practice seems to perform best over avery narrow set of scales. SuperPoint [9] learns scale invari-ance at the descriptor level, which works for visual odom-etry but breaks in more generalized problems. D2-Net [10]focuses on difficult imaging conditions and relies on a sin-gle network for detection and description. R2D2 [31] ap- x2x4x (a) Cartesian (b) Log-Polar (c) Log-Polar (d) Log-PolarPooling (SIFT) Pooling Sampling Patch
Figure 1:
Pooling vs Sampling. (a,b) The red patterns depict theregions that most descriptors use to pool features computed on acartesian pixel-grid. The size of the pattern depends on the lo-cal scale, and we show three versions. Under large scale changes,many regions of the cartesian and log-polar grids, such as the oneshighlighted by the yellow dots, can no longer be put in correspon-dence. (c) By contrast, we first resample the patch according to thepatterns shown in blue (32 × plies L2-Net convolutionally while penalizing repeatablebut non-discriminative patches. Leveraging Polar representations.
Polar and log-polarrepresentations have been extensively used in computer vi-sion to aggregate local information, because they are robustto small changes in scale and rotation. Traditional hand-crafted patch descriptors typically consist of two stages:feature extraction and feature pooling . First, image mea-surements such as gradients are computed for every pixel.Then, they are aggregated over small regions around thepoint given its location, orientation, and scale. SIFT, forinstance, aggregates features (histograms of gradient orien-tations) over 4 × pooling , that is, the pixel-wise featuresare always computed in cartesian space, and it is only theiraggregation that takes place in log-polar space. As shownin Fig. 1, this is drastically different to our approach, whichconsists in warping the raw pixel data and use that represen-tation to learn scale-invariant models.
3. Method
First, we describe our sampling scheme in Section 3.1and, then, our network architecture and training strategy inSection 3.2. For the purposes of this section, we assume thatthe training data consists of pairs of keypoints across twoimages that are in correspondence in terms of location andorientation, but not necessarily scale. The actual procedureused to generate the training data is described in Section 4.1.
As in most papers about learning descriptors [12, 50, 38,4, 41, 28], we use SIFT keypoints [23]. Given an image I of size H × W , a keypoint p i on I is fully described by itscenter coordinates ( x i , y i ) , its scale σ i ∈ R + , and its ori-entation θ i ∈ [0 , π ) . We use a Polar Transformer Network(PTN) [11] to extract a L × L patch around keypoint p i . Tothis end, we rely on the following coordinate transform: x si = x i + e log ( r i ) x ti /W cos ( ϕ i ) , (1) y si = y i + e log ( r i ) x ti /W sin ( ϕ i ) . Variables ( x si , y si ) denote source coordinates and ( x ti , y ti ) target coordinates, after the transform. The coordinate ori-gin is centered on ( x i , y i ) , the angle is ϕ i = θ i +2 πy ti H , andthe radius r i is given by λ σ i , where λ is a factor that con-verts the SIFT scale to image pixels . Finally, we constructthe warped patches by looking up the intensity values in im-age I at coordinates ( x ti , y ti ) with bilinear interpolation, asdone in [11]. This process is illustrated in Fig. 1.We denote patches extracted in this way as LogPol .For comparison purposes, we also consider the standardcartesian approach, using Spatial Transformer Networks(STN) [17] on a regularly spaced sampling grid, defined by x ti = x i + x si cos ( θ i ) σ i /W − y si sin ( θ i ) σ i /H , (2) y ti = y i + x si sin ( θ i ) σ i /W + y si sin ( θ i ) σ i /H . Given the convention followed by OpenCV, λ = 12 denotes the scalemultiplier of SIFT. One can extract larger image regions setting λ > . a) (b) (c) (d)Figure 2: Cartesian vs Log-Polar. (a,c) Two images taken from different viewpoints with four pairs of corresponding keypoints, denotedby their color. (b,d) Patches around these keypoints extracted with their estimated scale and orientation, with λ =16, similarly color-coded.On each column, we show cartesian patches on the left and log-polar patches on the right. While cartesian patches can look very different,log-polar ones remain similar. This is particularly visible for the red keypoint, whose scale estimates are very different in the two images. We denote these patches as
Cart . Note that STN and PTNwere designed to facilitate whole-image classification by al-lowing deep networks to manipulate data spatially, thus re-moving the burden of learning spatial invariance from theclassifier. This is not applicable here: we only use theirrespective samplers, which allow us to efficiently samplethe images with in-line data augmentation at a negligiblecomputational cost by applying small perturbations whenextracting the patches.The following properties of log-polar patches distinguishthem from cartesian ones: • Rotations in cartesian space correspond to shifts on thepolar axis in log-polar space (rotation equivariance). • Peripheral regions are undersampled, which meansthat paired patches look similar to the eye even underdrastic scale changes (scale equivariance).This phenomena is illustrated in Fig. 2. Note how the log-polar representation facilitates visual matching even whenscales are mismatched. Our approach is predicated on lever-aging this information effectively with the deep networksand training framework introduced in the next section.
To extract patch descriptors, we use a HardNet [28] ar-chitecture. As shown in Fig. 3, our network has seven con-volutional layers and takes as input grayscale patches ofsize 32 ×
32. Input patches are pre-processed with InstanceNormalization [44]. Feature maps are zero-padded aftereach convolutional layer, and we use strided convolutionsinstead of pooling layers. Each convolution is followed bya ReLU and Batch Normalization, but the last convolutionlayer omits the ReLU. We apply dropout regularization witha rate of 0.1 after the last ReLU. The final convolutionallayer is followed by Batch Normalization and l normaliza-tion. The output of the network is a descriptor of unit lengthand size 128. We found this to be a good compromise be-tween descriptor size and performance. The standard way to train such networks is in a siameseconfiguration, with two copies of the network, sharingweights. Among the many loss formulations that have havebeen proposed [38, 4, 19, 15], we use the triplet loss of [4],as in [28]. To build the required triplets, we consider a col-lection of patch pairs { P ak , P bk } which contain two differentviews of a 3D point, where k = 1 . . . K , with K denotingthe batch size. We systematically check that the 3D pointsin a given batch are unique, so that P ai and P bj only corre-spond if i = j . We denote their respective descriptors as { f ak , f bk } . We then mine negative samples with the ‘hardest-in-batch’ procedure of [28]. Specifically, we build a pair-wise distance matrix D i,j = d ( f ai , f bj ) , i, j ∈ [1 , K ] , where d ( f ai , f bj ) is the Euclidean distance between descriptors f ai and f bj if i (cid:54) = j , and an arbitrarily large value otherwise. Wedenote the hardest negative sample for P ak , i.e ., the one withthe smallest distance, as P bk min , and the hardest negativesample for P bk as P ak min . We consider both P ak and P bk aspossible anchors, for all k . Denoting a triplet with anchor( A ), positive ( + ) and negative ( − ) patches as ( A, + , − ) ,we form triplet k taking the hardest negative example, i.e . { P ak , P bk , P bk min } if d ( P ak , P bk min ) < d ( P bk , P ak min ) and { P bk , P ak , P ak min } otherwise. We then take the loss to be L ( f A , f + , f − ) = K (cid:88) k =1 max (cid:0) , | f Ak − f + k | − | f Ak − f − k | (cid:1) . We set the batch size K to 1000. For optimization we useStochastic Gradient Descent (SGD) with a learning rate of10, momentum of 0.9, weight decay − , and decay thelearning rate linearly to zero within a set number of trainingepochs [28]. Sampling the patches in-line allows us to applydata augmentation at training time, jittering the orientationof each anchor keypoint by ∆ θ ∼ N (0 , degrees. Ourimplementation uses Pytorch as a back-end. Code, modelsand training data are all available. https://github.com/cvlab-epfl/log-polar-descriptors pad 1 3x3 conv stride 2, pad 1 3x3 conv pad 1 3x3 conv stride 2, pad 1 3x3 conv pad 1BN + ReLU + Dropout (0.1) 8x8 convBN +
L2Norm3x3 conv pad 1
64 64
BN +
ReLUBN +
ReLUBN +
ReLUBN +
ReLUBN +
ReLU 1 x 12832x32x1
PTN Sampler ( x i , y i , i , ✓ i )
Network architecture.
We extract patches size 32 ×
32 in-line with a sampler (pictured: PTN) on the desired keypoints. Thisenables data augmentation techniques. The patches are then given to a network which produces descriptors size 128.
4. Experiments
In Section 4.1, we introduce the dataset we built to trainscale-invariant descriptors, because there is currently noneavailable for this purpose. We then compare ourselves tothe state of the art on it. In Sections 4.2, 4.3 and 4.4, webenchmark our models on three publicly available datasets:HPatches [3], AMOS patches [30], and the CVPR’19 PhotoTourism image matching challenge [1]. As baselines, weconsider: SIFT [23], TFeat [4], L2-Net [41], HardNet [28],and GeoDesc [25]. For our own method we consider de-scriptors learned with either cartesian or log-polar patches.
Nearly all learned descriptors rely on the dataset of [7]for training, which provides patches extracted over differ-ent viewpoints for three different scenes. Correspondenceswere established from SfM reconstructions and SIFT. Theyare thus biased towards keypoints that can be matched withSIFT, i.e ., commensurate in terms of scale. In order to learnscale-invariant descriptors under real-world conditions, werequire patches extracted at non-corresponding scales, forwhich we need the original images, which are not providedby [7]. Other datasets, such as [27] or [3], provide imagesalong with homographies for correspondence, but focus onaffine transformations and are much too small to train deepnetworks effectively. Therefore we collected a new datasetfor training purposes. In the remainder of this section, wedetail how we created it and then report our results on it.
We applied COLMAP [34], a state-of-the-art SfM frame-work, over large collections of photo-tourism images origi-nally collected by [16]. These images show drastic changesin terms of viewpoint, illumination, and other imaging prop-erties, which is crucial to learn invariance [48]. In additionto sparse reconstructions, COLMAP provides dense corre-spondences in the form of depth maps. We used them togenerate training data by randomly selecting a pair of im-ages I i and I j , extracting SIFT keypoints for both, and us-ing the depth maps to build ground truth correspondences.To do this we projected each keypoint from one image to theother using the estimated poses and depth maps. We took acorrespondence ( m, n ) to be valid if the projection of key-point m in image i falls within 1.5 pixels of keypoint n in We use OpenCV for SIFT, and public implementations for the rest. image j . We performed a bijective check to ensure one-to-one correspondences. We applied this projection in a cycle,from i to j and back to i , to ensure that the depth estimatesare consistent across both views, and discarded the putativecorrespondence otherwise. Points which fall in occludedareas were likewise discarded. Note that we only checkfor corresponding locations , but not scales: in this mannerwe are collecting SIFT keypoints with non-matching scaleswhose distribution comes from real-world data.We also require the orientations to be compatible acrossviews. To guarantee this we use the ground truth cameraposes to compute the difference between orientation esti-mates and filter out keypoint matches outside 25 o , as in [7].Finally, we suppress pairs of keypoints closer than 7 pixelsto each other, to exclude patches with large overlaps, whichwould be particularly problematic for cartesian patches.We can similarly use the ground truth to warp thescale across images, which we do in order to estimate thefrequency of inaccurate scale estimates. Given a corre-spondence ( m, n ) comprised of two keypoints with scales ( s mi , s nj ) , we warp the scale from image i to image j toobtain ˆ s mi , and compute the scale difference ratio as r = max (ˆ s mi ,s nj ) min (ˆ s mi ,s nj ) , so that r ≥ , with 1 encoding perfect scalecorrespondence. We histogram this ratio and use it to eval-uate each method under scale changes, as depicted in Fig. 4.We select 11 sequences for training, and 9 for testing.Please refer to the supplementary material for details. Wesplit the training sequences into training and validation setsin a per-image basis, with a 70:30 ratio. Images are down-sampled to a maximum height or width of 1024 pixels,which is the resolution that we extract keypoints at, andmirror-padded to 1500 × λ . We generate up to 1000 correspondences for each imagepair, and extract the patches from the images on the fly.Training requires negative samples, that is, points not incorrespondence. Finding negatives is easy when a SfM re-construction is available, as done in [7], ensuring that key-points are stable across all views. This not feasible in ourcase. Instead, we generate training samples from a singleimage pair at a time. Specifically, we take one image pairfrom each of the 11 sequences and use it to fill roughly equence SIFT TFeat L2-Net Geodesc HardNet Ours ( λ = 12 ) Ours ( λ = 96 )Cart LogPol LogPol‘british museum’ 5.91 3.53 3.52 4.30 3.21 2.17 2.18 ‘florence cathedral side’ 4.36 1.30 0.51 2.13 0.40 0.23 0.23 ‘lincoln memorial statue’ 2.89 4.32 2.28 2.61 1.65 1.30 1.31 ‘milan cathedral’ 7.08 1.98 1.48 1.86 0.35 0.19 0.12 ‘mount rushmore’ 18.71 11.94 2.52 2.27 0.43 0.42 0.32 ‘reichstag’ 2.22 0.44 0.30 0.42 0.21 0.19 0.19 ‘sagrada familia’ 9.01 2.41 0.85 1.08 0.27 0.21 0.19 ‘st pauls cathedral’ 8.64 2.01 1.48 2.45 0.68 0.42 0.46 ‘united states capitol’ 8.67 3.90 2.64 5.43 1.60 1.33 0.98 Average 7.50 3.54 1.73 2.51 0.98 0.72 0.67
Table 1:
FPR95 on our new dataset.
We benchmark our models against the baselines with patches extracted at the SIFT scale, λ = 12 .We also show that log-polar models are able to leverage much larger support regions, using λ = 96 . By contrast, with cartesian patchesperformance degrades as we increase the support region, as we demonstrate in the ablation study of Table 2. / th of each training batch. We can then perform nega-tive mining over the entire batch, as outlined in Section 3.2. In this section we evaluate performance in terms of patchmatching over the test sequences. We extract descriptorsfor SIFT keypoints with corresponding locations, but us-ing their original scales, which are not always in correspon-dence. We train our networks with cartesian and log-polarpatches, keeping all other settings identical. We use thestandard metric in patch matching benchmarks, FPR95, i.e .,the False Positive Rate at 95% True Positive recall. For thebaselines, we extract patches at the SIFT scale, i.e ., λ = 12 .We also consider λ > for log-polar patches. We reportthe results in Table 1 and discuss them below. Comparison to the state of the art.
Our models trainedwith log-polar patches deliver the best performance oneach sequence, followed by our models trained on carte-sian patches, and then HardNet. Remarkably, we achieveour best results with λ = 96 , which corresponds to patchesmuch larger than those best-suited for traditional descrip-tors, extracted with λ = 12 , a fact that we will examinemore closely below. Note the small gap between HardNetand Ours-Cartesian, which is due to the innate differencesbetween datasets and training the latter with mismatchedscales. The other baselines perform significantly worse. Performance under large scale mismatches.
In Fig. 4we break down the results of Table 1 in terms of orientationand scale mismatches. Note how models trained on log-polar representations can tolerate a wide range of scale mis-matches. Our results show a negligible drop in performanceunder scale changes up to 2-3x, and remain useful even at 3-4x. All baselines degrade significantly under scale changesof 2x and become essentially useless beyond that. Notethat this invariance is made possible by leveraging log-polar representations and cannot be achieved by simply exposingthe models to cartesian patches exhibiting scale changes, asevidenced by the performance of Ours-Cartesian shown inFig. 4-(c). Finally, remember that this data has been col-lected from real-world settings with unreliable scale detec-tion. In other words, our models allow us to retrieve morecorrespondences without changing the detector.
Increasing the size of the support region.
As shownin Fig. 2, patches extracted with log-polar sampling areremarkably similar across different scales, because scalechanges correspond to shifts in the horizontal dimension.This representation is not only easier to interpret visually,but also easier to learn invariant models with. Moreover,oversampling the immediate neighbourhood of the pointallows us to leverage larger support regions, because theeffect of occlusions and background motion in log-polarpatches is smaller than in their cartesian counterparts. Wedemonstrate this by training models for different values of λ , and report the results in Table 2. Our models are ableto exploit support regions much larger than cartesian-basedapproaches. We see performance flatten out at λ = 96 , andobserve boundary issues beyond that point, so we use thisvalue for all experiments in the paper. Note how the ra-dius of the circle determining the support region is than the optimal value for cartesian patches, and itsarea
64 times larger . Note that we use an identical architec-ture, which can only leverage this information effectivelythanks to the log-polar representation.
Next, we evaluate our performance in terms of patch re-trieval. For every image pair in the test sequence, we ex-tract SIFT keypoints on each image and establish groundtruth correspondences using the procedure outlined in Sec-tion 4.1. Matches with a difference of up to degrees in cale change Scale change O r i e n t a ti on c h a ng e [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [0.0,3.57][3.57,7.14][7.14,10.7][10.7,14.3][14.3,17.9][17.9,21.4][21.4,25.0] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] (a) SIFT (b) L2-Net O r i e n t a ti on c h a ng e [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [0.0,3.57][3.57,7.14][7.14,10.7][10.7,14.3][14.3,17.9][17.9,21.4][21.4,25.0] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] (c) Ours-Cart ( λ = 12 ) (d) HardNet O r i e n t a ti on c h a ng e [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [0.0,3.57][3.57,7.14][7.14,10.7][10.7,14.3][14.3,17.9][17.9,21.4][21.4,25.0] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] [ . , . ] (e) Ours-LogPol ( λ = 12 ) (f) Ours-LogPol ( λ = 96 ) Figure 4:
FPR95 vs Scale and orientation changes.
We breakdown the results of Table 1, histogramming them by the error in thekeypoint detection stage. Orientation misdetections increase top tobottom, up to 25 o . Scale misdetections increase left to right, up to4x. (a,b,d) All baselines degrade quickly under scale changes. (c)Training deep networks with cartesian patches with scale changesis not sufficient. (e,f) By contrast, our log-polar representationenables them to learn scale invariance. Note that some bins aresparsely populated, which explains sudden discontinuities. orientation are considered positives. Typically, a large per-centage of the image pixels are occluded, so that it is notpossible to generate a large number of matches. Instead, forevery pair of images, we extract up to N m = 500 matchesand then generate N d = 3000 distractors, defined as key-points further than 3 pixels away from a keypoint. The taskis thus to find the needle in the haystack, where every key-point has one positive match and N m + N d − negatives.We compute the distance between descriptors, extract therank of each match, and accumulate it over all keypointsand images pairs. The results are summarized in Fig. 5.Our models with log-polar patches obtain the best results, λ
12 16 32 64 96 128Ours, Cart
Table 2:
FPR95 vs λ . We evaluate models trained with differently-sized support regions. Performance increases with λ for log-polarpatches, but quickly degrades for cartesian ones. C D F CDF of matching ranks
SIFT =12TFeat =12L2Net =12HardNet =12Ours-Cart λ = 12 Ours-LogPol λ = 96 Figure 5:
Patch retrieval on the new dataset.
We plot the cumu-lative distribution function of the rank in a patch retrieval scenariowith a large number of distractors. Our models outperform all thebaselines. Log-polar models ( pink ) are significantly better thancartesian ones ( purple ) and baselines based on cartesian patches,such as HardNet ( red ). retrieving the correct match about 97% of the time for ourbest model, for λ = 96 . They are followed by our modelswith cartesian patches, and HardNet. Notice that contraryto the previous experiment, we evaluate on a realistic patchretrieval scenario with a large number of distractors, whichindicates that our performance holds even when samplingkeypoints densely, and that it does so regardless of λ . The HPatches dataset [3] contains 116 sequences with 6images each, with either viewpoint or illumination changes.As in [7], HPatches provides pre-extracted patches sampledat corresponding scales, which are not useful for our pur-poses. However, it also provides the original images andground truth homographies. We thus define the followingprotocol. We use SIFT to find keypoints and determine cor-respondences among them using the ground truth homogra-phies. We consider sequences with viewpoint and illumina-tion changes separately. This provides us with 20733 corre-spondences for the illumination split and 22079 correspon-dences for the viewpoint split. For every match, we com-pute the distance between a pair of corresponding descrip-tors and also to all the negatives in the dataset, and evaluate ethod Viewpoint split Illumination splitSIFT, λ = 12 λ = 12 λ = 12 λ = 12 λ = 16 λ = 32 λ = 64 λ = 96 λ = 12 λ = 16 λ = 32 λ = 64 λ = 96 Table 3:
Results on HPatches.
Rank-1 performance on the view-point and illumination splits of the HPatches dataset. Our log-polar sampling approach performs better on average than all thebaselines, and performance increases with λ , until it saturates. our models in terms of the rank-1 metric, i.e ., the percent-age of samples for which we can retrieve the correct matchwith rank 1. We show the results in Table 3. As expected,our log-polar models outperform most of the baselines, andperform better as λ increases. For this experiment we usethe models trained on our dataset, without fine-tuning. We also consider AMOS patches [30], a dataset releasedrecently featuring pairs of images captured by webcams andcarefully curated in order to provide correspondences. Weevaluate our method on the training split, which consists of27 sequences, each with 50 images, and which also provideskeypoints with scales and orientations for every image. Weuse unique matching keypoint pairs across all images, ob-taining a split of 13268 unique keypoint pairs. We use thesame metric as for HPatches, and summarize the results inTable 4. As before, we do not re-train the models in anyway. Again, our models outperform the state of the art andour results improve with the size of the support region, un-like for methods based on cartesian patches.
Patch matching performance does not always translateto upstream applications, as evidenced by [48, 35]. We thusalso evaluate our method on the public Phototourism Im-age Matching challenge [1]. This benchmark features twotracks: stereo and multi-view matching, and evaluates localfeatures in terms of the quality of the reconstructed poses .Features are submitted to the organizers, who compute theresults. We provide them in Table 5, including comparablebaselines (up to 8k features per image, matched by brute-
Method Rank-1 λ Table 4:
Results on AMOS patches.
Rank-1 performance on theAMOS patches dataset. We noticed that for this dataset, extractingdescriptors with smaller patches produces better results for mostbaselines, so we also consider λ = 6 . Our models trained onlog-polar patches outperform the state of the art, and performanceincreases with λ . Type Method Stereo task Multi-view taskmAP o Rank † mAP o Rank † DoG SIFT (IJCV’04) 0.0277 9 0.4146 8TFeat (BMVC’16) 0.0357 8 0.4643 7L2-Net (CVPR’17) 0.0400 6 0.5087 5HardNet (NIPS’17) 0.0425 4 λ = 16 λ = 32 λ = 64 λ = 96 Table 5:
PhotoTourism challenge.
Mean average precision inpose estimation with an error threshold of 15 o . Top method( † among comparable submissions) in red , runner-up in green . Werank 2nd on both tracks, and 1st on average. force nearest-neighbour) extracted from the public leader-boards. Our method ranks second on both tracks, and first interms of average rank. Note that our observations from Sec-tion 4.1.2 carry over – models trained on log-polar patchesimprove with patch size, and outperform cartesian models.
5. Conclusions and Future Work
We have introduced a novel approach to learn local de-scriptors that goes beyond the current paradigm, which re-lies on image measurements sampled in cartesian space. Weshow that we can learn richer and more scale-invariant rep-resentations by coupling log-polar sampling with state-of-the-art deep networks. This allows us to match local de-scriptors across a wider range of scales, virtually for free.Our approach could be used to learn invariance to arbi-trary scale changes. This can be, however, counterproduc-tive when used alongside SIFT, as the majority of its de-tections are accurate enough. Instead, we intend to bypassscale detection and learn end-to-end pipelines as in [48, 29].
References [1] Phototourism Challenge, CVPR 2019 Image Matchingorkshop. https://image-matching-workshop.github.io . Accessed August 1, 2019. 5, 8[2] Alexandre Alahi, Raphael Ortiz, and Pierre Vandergheynst.FREAK: Fast Retina Keypoint. In
CVPR , 2012. 3[3] Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys-tian Mikolajczyk. Hpatches: A Benchmark and Evaluationof Handcrafted and Learned Local Descriptors. In
CVPR ,2017. 5, 7[4] Vassileios Balntas, Edgar Riba, Daniel Ponsa, and Krys-tian Mikolajczyk. Learning Local Feature Descriptors withTriplets and Shallow Convolutional Neural Networks. In
BMVC , 2016. 2, 3, 4, 5[5] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF:Speeded Up Robust Features. In
ECCV , 2006. 1, 2[6] Serge Belongie, Jitendra Malik, and Jan Puzicha. ShapeMatching and Object Recognition Using Shape Contexts.
PAMI , 24(24):509–522, April 2002. 3[7] Matthew Brown, Gang Hua, and Simon Winder. Discrimi-native Learning of Local Image Descriptors.
PAMI , 2011. 2,5, 7[8] Gabriela Csurka and Martin Humenberger. From hand-crafted to deep local invariant features. In arXiv preprintarXiv:1807.10254 , 2018. 2[9] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi-novich. Superpoint: Self-Supervised Interest Point Detec-tion and Description.
CVPR Workshop on Deep Learningfor Visual SLAM , 2018. 2[10] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Polle-feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-Net:A Trainable CNN for Joint Detection and Description of Lo-cal Features. In
CVPR , 2019. 1, 2[11] Carlos Esteves, Christine Allen-Blanchette, Xiaowei Zhou,and Kostas Daniilidis. Polar Transformer Networks. In
ICLR , 2018. 3[12] Xufeng Han, Thomas Leung, Yangqing Jia, Rahul Suk-thankar, and Alexander C. Berg. MatchNet: Unifying Fea-ture and Metric Learning for Patch-Based Matching. In
CVPR , 2015. 1, 2, 3[13] Christopher G. Harris and Mike .J. Stephens. A CombinedCorner and Edge Detector. In
Fourth Alvey Vision Confer-ence , 1988. 2[14] Tal Hassner, Viki Mayzels, and Lihi Zelnik-Manor. OnSIFTs and Their Scales. In
CVPR , 2012. 2[15] Kun He, Yan Lu, and Stan Sclaroff. Local descriptors opti-mized for average precision. In
CVPR , 2018. 1, 2, 4[16] Jared Heinly, Johannes L. Schoenberger, Enrique Dunn, andJan-Michael Frahm. Reconstructing the World in Six Days.In
CVPR , 2015. 5[17] Max Jaderberg, Karen Simonyan, Andrew Zisserman, andKoray Kavukcuoglu. Spatial Transformer Networks. In
NIPS , pages 2017–2025, 2015. 3[18] Yan Ke and Rahul Sukthankar. PCA-SIFT: A More Distinc-tive Representation for Local Image Descriptors. In
CVPR ,pages 111–119, 2000. 2[19] Michel Keller, Zetao Chen, Fabiola Maffra, Patrik Schmuck,and Margarita Chli. Learning Deep Descriptors with Scale-Aware Triplet Networks. In
CVPR , 2018. 1, 2, 4 [20] Iasonas Kokkinos, Michael Bronstein, and Alan Yuille.Dense Scale Invariant Descriptors for Images and Surfaces.Technical report, INRIA, 2012. 2[21] Stefan Leutenegger, Margarita Chli, and Roland Siegwart.BRISK: Binary Robust Invariant Scalable Keypoints. In
ICCV , 2011. 3[22] Ce Liu, Jenny Yuen, and Antonio Torralba. SIFT Flow:Dense Correspondence Across Scenes and Its Applications.In
ECCV , 2008. 2[23] David Lowe. Distinctive Image Features from Scale-Invariant Keypoints.
IJCV , 20(2):91–110, Nov 2004. 1, 2, 3,5[24] Zixin Luo, Tianwei Shen, Lei Zhou, Jiahui Zhang, Yao Yao,Shiwei Li, Tian Fang, and Long Quan. ContextDesc: LocalDescriptor Augmentation with Cross-Modality Context. In
CVPR , 2019. 1, 2[25] Zixin Luo, Tianwei Shen, Lei Zhou, Siyu Zhu, Runze Zhang,Yao Yao, Tian Fang, and Long Quan. Geodesc: LearningLocal Descriptors by Integrating Geometry Constraints. In
ECCV , 2018. 1, 2, 5[26] Krystian Mikolajczyk and Cordelia Schmid. A PerformanceEvaluation of Local Descriptors.
PAMI , 27(10):1615–1630,2004. 3[27] Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid,Andrew Zisserman, Jiri Matas, Frederik Schaffalitzky, TimorKadir, and Luc Van Gool. A Comparison of Affine RegionDetectors.
IJCV , 65(1/2):43–72, 2005. 5[28] Anastasiia Mishchuk, Dmytro Mishkin, Filip Radenovic,and Jiri Matas. Working Hard to Know Your Neighbor’sMargins: Local Descriptor Learning Loss. In
NIPS , 2017.1, 2, 3, 4, 5[29] Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi.Lf-Net: Learning Local Features from Images. In
NIPS ,2018. 2, 8[30] Milan Pultar, Dmytro Mishkin, and Jiri Matas. Leveragingoutdoor webcams for local descriptor learning. In
ComputerVision Winter Workshop , 2019. 5, 8[31] Jerome Revaud, Philippe Weinzaepfel, C´esar De Souza, NoePion, Gabriela Csurka, Yohann Cabon, and Martin Humen-berger. R2D2: Repeatable and Reliable Detector and De-scriptor. In arXiv Preprint , 2019. 1, 2[32] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and GaryBradski. ORB: An Efficient Alternative to SIFT or SURF.In
ICCV , 2011. 2[33] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii,Lars Hammarstrand, Erik Stenborg, Daniel Safari, MasatoshiOkutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, andTomas Pajdla. Benchmarking 6DOF Outdoor Visual Local-ization in Changing Conditions. In
CVPR , 2018. 1[34] Johannes L. Sch¨onberger and Jan-Michael Frahm. Structure-From-Motion Revisited. In
CVPR , 2016. 5[35] Johannes L. Sch¨onberger, Hans Hardmeier, Torsten Sat-tler, and Marc Pollefeys. Comparative Evaluation of Hand-Crafted and Learned Local Features. In
CVPR , 2017. 8[36] Qi Shan, Changchang Wu, Brian Curless, Yasutaka Fu-rukawa, Carlos Hernandez, and Steven M. Seitz. AccurateGeo-registration by Ground-to-Aerial Image Matching. In , 2014. 237] Eli Shechtman and Michal Irani. Matching Local Self-Similarities Across Images and Videos.
CVPR , 2007. 3[38] Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, IasonasKokkinos, Pascal Fua, and Francesc Moreno-noguer. Dis-criminative Learning of Deep Convolutional Feature PointDescriptors. In
ICCV , 2015. 1, 2, 3, 4[39] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman.Learning Local Feature Descriptors Using Convex Optimi-sation.
PAMI , 2014. 1, 2[40] Christoph Strecha, Alex Bronstein, Michael Bronstein, andPascal Fua. LDAHash: Improved Matching with SmallerDescriptors.
PAMI , 34(1), January 2012. 2[41] Yurun Tian, Bin Fan, and Fuchao Wu. L2-Net: Deep Learn-ing of Discriminative Patch Descriptor in Euclidean Space.In
CVPR , 2017. 1, 2, 3, 5[42] Engin Tola, Vincent Lepetit, and Pascal Fua. A Fast LocalDescriptor for Dense Matching. In
CVPR , 2008. 1, 3[43] Eduard Trulls, Iasonas Kokkinos, Alberto Sanfeliu, andFrancesc Moreno-Noguer. Dense Segmentation-Aware De-scriptors.
Dense Image Correspondences for Computer Vi-sion , 2015. 2[44] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Im-proved Texture Networks: Maximizing Quality and Diver-sity in Feed-Forward Stylization and Texture Synthesis. In
CVPR , 2017. 4[45] Xing Wei, Yue Zhang, Yihong Gong, and Nanning Zheng.Kernelized Subspace Pooling for Deep Local Descriptors. In
CVPR , 2018. 1, 2[46] Simon Winder and Matthew Brown. Learning Local ImageDescriptors. In
CVPR , June 2007. 1, 3[47] Alessio Xompero, Oswald Lanz, and Andrea Cavallaro.MORB: A Multi-Scale Binary Descriptor. In
ICIP , 2018.2[48] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and PascalFua. LIFT: Learned Invariant Feature Transform. In
ECCV ,2016. 2, 5, 8[49] Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit,Mathieu Salzmann, and Pascal Fua. Learning to Find GoodCorrespondences. In
CVPR , 2018. 1[50] Sergey Zagoruyko and Nikos Komodakis. Learning to Com-pare Image Patches via Convolutional Neural Networks. In
CVPR , 2015. 1, 2, 3[51] Dang Zheng, Kwang Moo Yi, Yinlin Hu, Fei Wang, Pas-cal Fua, and Mathieu Salzmann. Eigendecomposition-FreeTraining of Deep Networks with Zero Eigenvalue-BasedLosses. In
ECCV , 2018. 1[52] Lei Zhou, Siyu Zhu, Tianwei Shen, Jinglu Wang, Tian Fang,and Long Quan. Progressive Large Scale-Invariant ImageMatching In Scale Space. In
ICCV , 2017. 2 . Beyond Cartesian Representations for Local Descriptors: Supplementary Material
In order to train scale-invariant models with real data relevant to wide-baseline stereo, it was necessary to collect trainingdata. For this we rely on public collections of photo-tourism images in the Yahoo Flickr Creative Commons 100M (YFCC)dataset. We use COLMAP, a Structure from Motion (SfM) framework, to obtain 3D reconstructions. COLMAP provides uswith sparse point clouds and depth maps for each image. We clean up the depth maps following the procedure outlined in thepaper and use them, along with the ground truth camera poses, to project keypoints between corresponding images.We sample pairs of images with a visibility check in order to guarantee that a minimum number of keypoints can beextracted and matched across both views. Specifically, we retrieve the SfM keypoints in common over both views, extracttheir bounding box, and reject the image pair if it is smaller than a given threshold (we use 0.5) for either image.We use 11 sequences for training and validation and 9 for testing. We list their details in Table 6, and give some examplesin Fig. 6. This data will be made publicly available along with code and pre-trained models.Sequence name Num. imagesbrandenburg gate 1363buckingham palace 1676colosseum exterior 2063grand place brussels 1083notre dame front facade 3765palace of westminster 983pantheon exterior 1401sacre coeur 1179st peters square 2504taj mahal 1312temple nara japan 904Total 18233 Sequence name Num. imagesbritish museum 660florence cathedral side 108lincoln memorial statue 850milan cathedral 124mount rushmore 138reichstag 75sagrada familia 401st pauls cathedral 615united states capitol 258Total 4107
Table 6:
Dataset details.
Left: training sequences. Right: Test sequences.Figure 6: