SEKD: Self-Evolving Keypoint Detection and Description
SSEKD: Self-Evolving Keypoint Detection and Description
Yafei Song ∗ , Ling Cai , Jia Li , Yonghong Tian ∗ and Mingyang Li A.I. Labs, Alibaba Group School of Electronics Engineering and Computer Science, Peking University School of Computer Science and Engineering, Beihang University
Abstract
Researchers have attempted utilizing deep neural net-work (DNN) to learn novel local features from imagesinspired by its recent successes on a variety of visiontasks. However, existing DNN-based algorithms have notachieved such remarkable progress that could be partly at-tributed to insufficient utilization of the interactive charac-ters between local feature detector and descriptor. To al-leviate these difficulties, we emphasize two desired prop-erties, i.e., repeatability and reliability, to simultaneouslysummarize the inherent and interactive characters of localfeature detector and descriptor. Guided by these properties,a self-supervised framework, namely self-evolving keypointdetection and description (SEKD), is proposed to learn anadvanced local feature model from unlabeled natural im-ages. Additionally, to have performance guarantees, noveltraining strategies have also been dedicatedly designed tominimize the gap between the learned feature and its prop-erties. We benchmark the proposed method on homog-raphy estimation, relative pose estimation, and structure-from-motion tasks. Extensive experimental results demon-strate that the proposed method outperforms popular hand-crafted and DNN-based methods by remarkable margins.Ablation studies also verify the effectiveness of each criti-cal training strategy. We will release our code along withthe trained model publicly.
1. Introduction
Local feature, peculiarly referring to the local point fea-ture in this paper, is extensively employed in a large numberof computer vision applications, such as image stitching [5],content-based image retrieval [11], image-based localiza-tion [16, 32], structure-from-motion (SfM) [1], and simul-taneous localization and mapping (SLAM) [38]. In theseapplications, the quality of the local feature module signif- ∗ Corresponding authors: Yafei Song and Yonghong Tian. E-mail: { [email protected], [email protected] } (a) 1st image of the scene. (b) 2nd image of the scene.(c) Detection result of 1st image. (d) Detection result of 2nd image.(e) Description result of 1st image. (f) Description result of 2nd image.Property 1.1detector repeatabilityProperty 1.2descriptor repeatability Property 2.2descriptor reliabilityProperty 2.1detector reliability Figure 1. Desired properties of local features. Detector repeatabil-ity (1.1): a visible scene point should be detected on all images.Descriptor repeatability (1.2): the descriptor of the same point isinvariant over different images. Detector reliability (2.1): givendescriptor, detected keypoints could be distinguished by their de-scriptors. Descriptor reliability (2.2): given detector, descriptorscan distinguish detected keypoints. icantly influences the overall system performance and thusmust be in-depth studied and optimized.In general, a standard local feature algorithm can be di-vided into two modules, i.e. , keypoint detection and de-scription. For each keypoint, its inner-image location isdetermined via the detection module, while its descriptoris calculated by summarizing the local context informationvia the description module. Early works on local featureprimarily originated from hand-crafted methodologies, andthe representative methods include SIFT [19], SURF [4],KAZE [2], AKAZE [25], BRISK [15], ORB [29], and soon. Although hand-crafted features have been widely usedin various computer vision tasks, their nature of rule-basedalgorithm design prevents the feasibility of further perfor-1 a r X i v : . [ c s . C V ] J un ance enhancement along with the increasing model repre-sentation ability.Inspired by the great successes of DNN on a variety ofcomputer vision tasks [12, 27, 6], researchers have been ac-tively working on designing and learning advanced localfeature models. Since local feature consists of both detec-tion and description, each module can be individually re-placed and improved by DNN-based methods [13, 31]. Al-ternatively, both modules also can be jointly designed usingone DNN model. That can be done either by sequentiallyconnected neural networks for firstly calculating keypointlocations and subsequently computing descriptors [37, 24]or by a single network with a shared backbone and two sep-arate branches for regressing detectors and descriptors re-spectively [23, 7, 9, 28].However, unlike on most tasks, existing DNN-based lo-cal features have not achieved such great progress comparedwith hand-crafted methods, that indicates it is very chal-lenging to exploit DNN on local feature learning. As onelocal feature algorithm consists of two modules, we partlyattribute this difficulty to the insufficient utilization of theirinherent and interactive properties. To alleviate this prob-lem, we analyze the desired properties of local features, in-cluding its detector, descriptor, and their mutual relations.As demonstrated in Fig. 1, the properties can be summa-rized into two sets, i.e. , ‘repeatability’ and ‘reliability’, andexplained as: Property 1
Repeatability property of local feature.
Property 1.1
Detector repeatability: If a scene point isdetected as a keypoint in one image, it is should be detectedin all images where it is visible.
Property 1.2
Descriptor repeatability: The descriptor of ascene point should be invariant across all images.
Property 2
Reliability property of local feature.
Property 2.1
Detector reliability: Given a descriptormethod, the detector should localize the points which couldbe reliably distinguished by their descriptors.
Property 2.2
Descriptor reliability: Given a detectormethod, the descriptor could reliably distinguish the de-tected keypoints.
The repeatability is an inherent property of the detec-tor and descriptor, respectively. And the reliability is theinteractive property between them. We also note that simi-lar analyses and properties also have been adopted to guidethe algorithm design in previous works [7, 9, 28]. How-ever, instead of optimizing the detector and descriptor at thesame time, we propose to optimize each module in turn.When optimizing the detector or descriptor, both its inher-ent repeatability property and interactive reliability propertyare exploited to design the training strategies. Specifically,we figure out keypoints with reliable descriptors from allpoints. These keypoints are taken as ground-truth to op-timize the detector, that is guided by the detector reliabil- ity property. The optimized detector is then taken to detectkeypoints from images. The descriptor is then optimizedto reliably distinguish the detected keypoints, that is guidedby the descriptor reliability property. This process is iter-ated until the learned model is convergent. Moreover, sev-eral strategies are also adopted to ensure the repeatabilityproperty and the convergence of the whole process. Thistraining process is self-evolving as it needs no additionalsupervised signals. Extensive experiments have been con-ducted to compare our model with state-of-the-art methodsvia performing homography estimation, relative pose esti-mation, structure-from-motion tasks on public datasets, theresults verify the effectiveness of our algorithm.Our main contributions can be concluded as follows:1. We propose a self-evolving framework guided by theproperties of local features, by that an advanced modelcan be trained effectively using unannotated images.2. Training strategies are elaborately designed and de-ployed to ensure the computed local feature modelaligned with the desired properties.3. Extensive experiments verify the effectiveness of ourframework and training strategies by outperformingstate-of-the-art methods.
2. Related Work
In this section, we briefly review well-known local fea-tures, that could be categorized into four main groups:hand-crafted methods and three sets of DNN-based ap-proaches.
Hand-crafted methods . Early works on local featuresprimarily rely on hand-crafted rules. One of the most well-known local feature algorithms is SIFT [19], that builds de-tector by the difference of Gaussian operators and calcu-lates descriptor via computing orientation histograms. Af-ter SIFT, plenty of algorithms have been proposed for eitherapproximating the image processing operators to gain com-putational efficiency or seeking for performance gain by re-designing detector or descriptor. The representative meth-ods include SURF [4], KAZE [2], AKAZE [25], BRISK[15], and ORB [29]. To date, despite the nature of rule-based design, hand-crafted features still can achieve leadingperformance in specific applications [14].
DNN-based two-stage methods . Hand-crafted localfeature algorithms typically first detect keypoints in imagesand subsequently calculate descriptors around each key-point by cropping and summarizing the local context in-formation. This procedure can also be used in designingDNN-based methods by using sequentially connected neu-ral networks [37, 24]. Each network contains its trainingstrategy, optimizing for the detector or descriptor, respec-tively. We name this kind of method as two-stage methods,that can utilize previous expert knowledge in this area. Themajor disadvantage of two-stage based design is its ineffi-2iency in computational costs since sequentially connectednetworks cannot share a large number of computations andparameters or enable fully parallel computing.
DNN-based one-stage methods . To improve the ef-ficiency of DNN-based local features, researchers haveproposed the one-stage paradigm, that typically connectsa backbone network with two lightweight head branches[23, 7, 9, 28]. Since the backbone network shares mostcomputations for both the detector and descriptor calcula-tion, this type of algorithms could achieve significantly lessruntime. For the two lightweight branches, they can be ei-ther designed using small neural networks [23, 7, 28] or byhand-crafted methods [9]. In terms of training strategies, allthese methods require annotated information for conductingsupervised learning. [23] adopted a landmark image datasetwith image-level annotations. [9, 28] obtained ground-truthcorrespondences between images via SfM reconstruction.And [7] relied on synthetic images with generated ‘corner’-style keypoints.
DNN-based individual detector/descriptor methods .There are also a number of methods that only focus onDNN-based detector or descriptor, e.g. , [36, 30, 8, 13] pro-posed DNN-based keypoints detectors, and [31, 22, 34,21, 20, 33] worked on descriptor computation. However,we usually employ one local feature algorithm as a wholesince either detector or descriptor would influence the per-formance of each other. Those methods can be consideredas pluggable modules and used in a two-stage algorithm. Inthis paper, we focus on developing an advanced DNN-basedone-stage model.
3. Formulation and Network Architecture
To describe our method better, we first introduce basicdenotations along with the network architecture, while theself-evolving framework and training strategies are elabo-rated in the next section. As shown in Fig. 2, our net-work consists of a shared backbone N b and two lightweighthead branches, i.e. , a detector branch N det and a descrip-tor branch N des . The backbone N b consists of 1 convolu-tional layer and 9 ResNet-v2 blocks [10], that extracts fea-ture maps F ∈ R C × H × W from the input image I ∈ R H × W .In the above notations, H , W are the height and width of theinput image I respectively, and C is the channels of the ex-tracted feature maps. The hidden feature maps at initial and k scale are denoted as F and k F respectively. The de-tector branch N det consists of 2 deconvolutional layers and1 softmax layer that predicts the keypoint probability map P ∈ R × H × W from the feature maps F . Moreover, thisbranch also consists of two shortcut links from low-levelfeatures to enhance its localization ability. The descriptorbranch N des consists of 1 ResNet-v2 block and 1 bi-linearup-sampling layer that extracts a descriptor F ( h,w ) of di- (b) Backbone. (c) Detector branch.(a) Input image. (d) Descriptor branch. Figure 2. Overview of our network, that consists of a heavy sharedbackbone and two lightweight head branches for detection and de-scription respectively. mension C for each pixel ( h, w ) , where F ∈ R C × H × W and F ( h,w ) ∈ R C . Benefiting from this network structure, ourdetector and descriptor can share most parameters and com-putations.
4. Self-Evolving Framework
To train the network constructed in Sec. 3, two types ofsupervisory signals should be pre-provided. The first is thelocation of each keypoint, and the second is the keypointscorrespondence between different images. With the desiredproperties of local features in mind, we propose to figureout the points with reliable descriptors as keypoints. Andpairs of images, along with their correspondences, can beobtained via affine transformation. Then, the network canbe trained only using unlabeled images. However, as thetraining data have no additional annotation information, wemust carefully design the training strategies to ensure theperformance.The overview of our framework is shown in Fig. 3, thatmainly consists of four steps: (a) compute keypoints prob-ability map P using the current detector and subsequentlyfilter the keypoints via non-maximum suppression (NMS)algorithm; (b) update the descriptor branch using the de-tected keypoints via heightening their descriptors’ repeata-bility and reliability properties; (c) compute keypoints byfiguring out points with reliable (both repeatable and dis-tinct) descriptors; (d) update detector using the newly com-puted keypoints following detector repeatability and relia-bility properties. In what follows, we present each step indetail. For an input image I , the backbone network N b extractsfeature maps F , F , F via F , F , F = N b ( I ) . (1)The feature maps are subsequently used by the detectorbranch N det to estimate the keypoints probability map P atio&NMS (a) Detect keypoints via detector. (c) Compute keypoints via descriptor.(d) Update detector using computed keypoints. (b) Update descriptor on detected keypoints.(a ) Random selectkeypoints at iteration 0. (d1) Input image. (d2) Reliability metric &computed keypoints. (c2) Distinctness metric.(c1) Repeatability metric.(a2) Detected keypoints.(a1) Detection probability. (b2) Affined image.(b1) Image. IterativelyEvolving
Figure 3. Overview of our self-evolving framework, that consists of four main steps: (a) detect keypoints using the current detector, (b)update the descriptor with the detected keypoints, (c) compute keypoints with reliable (both repeatable and distinct, the reliability metricis the ratio between the distinctiveness metric and the repeatability metric) descriptors, and (d) refine the detector using newly computedkeypoints. as P = N det (cid:16) F , F , F (cid:17) . (2)Strong response in each pixel in probability map P indi-cates a potential keypoint, which is further filtered by non-maximum suppression (NMS). We set the suppression ra-dius as 4 pixel in all experiments and set the maximum num-ber of keypoints as , during the training process.However, the above process is not designed to ensure ro-bust detection of the same keypoints under varying condi-tions. In other words, the detection process is not optimizedto satisfy the detector repeatability property 1.1 and mightlead to sub-optimal results. To this end, we adopt a ded-icated data augmentation strategy, namely affine adaption[7]. Specifically, we first apply random affine transforma-tion and color jitter on each input image, and calculate thekeypoint probability map. This process is repeated severaltimes, and an average detection result P = AVG ( P , P , . . . , m P ) (3)is computed as the final output, where P corresponds to theinitial image and the others correspond to the transformedcounterparts. Representative examples of the detection pro-cess are also demonstrated in Fig. 4. Note that, the affineadaption is only applied during training.As the detector has not been optimized well at iteration0, another problem is how to detect keypoints at start. Asshown in Fig. 3 (a ), we just randomly select keypoints foreach input image. Even so, we show in experiments thatthe proposed self-evolving framework can converge quicklywithin just a few iterations. Keypoint descriptor is typically a 2D vector associatedwith each keypoint, for both re-identifying the same key-points and distinguishing different keypoints across images.Those descriptor properties are summarized by repeatabil- (d) Average result .(a) Input image & result . (b) Affined image & result . (c) Affined image & result . …… Figure 4. Representative examples of keypoint detection process.Our detector operates on both the input image as well as its affinetransformed counterparts and calculates the average detection re-sults as the final output. ity property 1.2 and reliability property 2.2 in Sec. 1, thatare used as guidelines in our descriptor training process.To show the details, we note that for each image I thekeypoint detection process described in Sec. 4.1 providesa set of keypoints Q = { Q i | Q i = (cid:104) h i , w i (cid:105) ) . The trainingprocess starts by applying random affine transformation andcolor jitter H on both I and Q , leading to ˆ I = H ( I ) , (4)and ˆ Q = (cid:8) ˆ Q i | ˆ Q i = H ( h i , w i ) (cid:9) . (5)By denoting (cid:104)· , ·(cid:105) a pair of keypoints, (cid:10) Q i , ˆ Q i (cid:11) representsa pair of ‘ground-truth’ matched keypoints. Accordingto the descriptor repeatability property 1.2, their descrip-tors F Q i , F ˆ Q i should be close to each other. On the otherhand, according to the descriptor reliability property 2.2, F Q i should be distinct from others except for its matchedkeypoint F ˆ Q i . The representative example of matched anddistinct cases are shown in Fig. 3(b) by green and red linesrespectively. Inspired by HardNet [22], we use triplet loss4long with hard example mining strategy to train the de-scriptor. Specifically, the loss function is defined as L des = 1 n (cid:88) i max (cid:16) , D i,i − min (cid:16) D i, ˜ i , D ˜ i,i (cid:17) + m (cid:17) , (6)where n is the number of keypoints, m = 0 . denotes themargin parameter, || · || represents the L distance, and D i,i = (cid:107)F Q i − ˆ F ˆ Q i (cid:107) , (7) D i, ˜ i = min j (cid:54) = i (cid:107)F Q i − ˆ F ˆ Q j (cid:107) , (8) D ˜ i,i = min j (cid:54) = i (cid:107)F Q j − ˆ F ˆ Q i (cid:107) . (9)The triplet loss function (6) enables the descriptor with boththe repeatability property (by (7)) as well as the reliabilityproperty (by (8) and (9)).In addition, as our network shares a common backboneto simultaneously perform keypoint detection and descrip-tion, the detector branch should also be considered whentraining the descriptor. To this end, we add a regularizationloss term L (cid:48) det = 12 (cid:16) MSE ( P , P (cid:48) ) + MSE (cid:16) ˆ P , ˆ P (cid:48) (cid:17)(cid:17) (10)to maintain the detection results unchanged, where P isgiven by (2) and P (cid:48) = N (cid:48) det ( N (cid:48) b ( I )) , ˆ P = N det ( N b (ˆ I )) , ˆ P (cid:48) = N (cid:48) det ( N (cid:48) b (ˆ I )) , (11) N (cid:48) det , N (cid:48) b and N det , N b are the networks before and afterthis descriptor training step. The final loss to update thedescriptor is L = L des + α L (cid:48) det , (12)where α is the parameter to balance these two losses and isset to be empirically. The next step of our self-evolving framework is to com-pute keypoints from the descriptor maps, that remains achallenging problem in the research community. In ourwork, we propose to calculate keypoints via evaluating therepeatability property 2.1 and reliability property 2.2 oftheir corresponding descriptors. Furthermore, as reliabil-ity property somehow contains repeatability property, thesetwo properties can be summarized as reliability propertyand divided into two aspect, namely repeatability and dis-tinctness. Specifically, given the outputs of the descriptorbranch F P and ˆ F ˆ P from the original image I and its affinetransformed counterpart ˆ I , the descriptor repeatability canbe evaluated at each point as: D i,i = (cid:107)F P i − ˆ F ˆ P i (cid:107) . (13) (a) Input image. (d) Reliability metric .(b) Repeatability metric . (c) Distinctness metric . Figure 5. Representative maps of repeatability metric D i,i , distinct-ness metric D i, ˜ i , and reliability metric R i . We point out that the lower D i,i is, the more repeatable thedescriptor is. In addition, the distinctness of a descriptorcan be evaluated as D i, ˜ i = min j (cid:54) = i (cid:107)F P i − ˆ F ˆ P j (cid:107) . (14)Similarly, the higher D i, ˜ i is, the more distinct the descrip-tor is. As a reliable descriptor should be both repeatableand distinct, we combine the repeatability and distinctnessmetric into a single metric following the ratio term R i = D i, ˜ i D i,i . (15)Representative examples of computed maps D i,i , D i, ˜ i , and R i are shown in Fig. 5. Someone may find that this ratioterm (15) is the same as the ratio in the ratio-test algorithm[19], that is a well-known method to find keypoints corre-spondence. This means that the points with higher ratioscould be reliably distinguished by subsequently keypointscorrespondence finding algorithms. These points, withoutdoubt, should be detected by the detector as much as pos-sible. Therefore, strongly responsive elements on the ratiomap R are figured out as keypoints via applying NMS al-gorithm.Moreover, to ensure high-quality performance, threestrategies are applied in the keypoint computing process.Firstly, we note that the ratio map R does not cover allpoints in image I , since some elements do not have cor-respondences in the affine transformed image ˆ I . Also, tocompute keypoints using a single ratio map R is not pre-ferred in terms of robustness. To this end, we adopt adata augmentation strategy similar to the affine adaption de-scribed in Sec. 4.1. Specifically, we randomly warp the in-put image via affine transformation, calculate the ratio map,and repeat the same process multiple times to generate anaverage ratio map R = AVG ( R , R , . . . , m R ) , (16)where i R is corresponding to the i th result. An examplecase of computing the average ratio map is given by Fig. 6.Secondly, it is important to point out that it is an ex-tremely heavy task to compute D i, ˜ i . To reduce the com-5 d) Average result .(a) Input image. (b) Affined image& result . (c) Affined image & result . … Figure 6. Representative examples of average reliability map R . putations, we modify D i, ˜ i as D i, ˜ i = min j (cid:54) = i, ˆ P j ∈ Ω ( ˆ P i ) (cid:107)F P i − ˆ F ˆ P j (cid:107) , (17)where Ω (cid:0) ˆ P i (cid:1) contains the local neighbors of point ˆ P i .Thirdly, the feature maps F usually are too coarse forkeypoints computing as the descriptor branch consists of abi-linear up-sampling layer. To this end, we actually use thefeature maps F and F to compute a coarse scale and afine-scale ratio map respectively and fuse them to obtain thefinal result. After the keypoints have been computed via their de-scriptor reliability, they can be taken as ground-truth to trainthe detector following the detector reliability property 2.1.We formulate the keypoints detection task as a per-pixelclassification task to determine whether the point at eachpixel is a keypoint or not. Since the keypoints are verysparse among all the points, we adopt focal loss [17] as L det = FL ( P , Y ) , (18)where Y is the computed keypoints.Besides detector reliability property 2.1, the detector alsoshould be with repeatability property 1.1. To this end, wefurther adopt affine transformation on the input image andobtain its affined image ˆ I and detection output ˆ P . The de-tector also should rightly detect the keypoints in image ˆ I ,then the detection loss (18) is modified as L det = 12 (cid:16) FL ( P , Y ) + FL (cid:16) ˆ P , ˆ Y (cid:17)(cid:17) , (19)where ˆ Y = H ( Y ) . To further enhance the repeatabilityproperty 1.1, we minimize the difference between detectionprobabilities of corresponding keypoints via the loss L rep = 12 (cid:88) i (cid:16) KLD (cid:16) P Q i (cid:107) ˆ P ˆ Q i (cid:17) + KLD (cid:16) ˆ P ˆ Q i (cid:107)P Q i (cid:17)(cid:17) , (20)where KLD () is the KullbackLeibler divergence function.To maintain the description results unchanged, we also add a regularization term L (cid:48) des = 12 (cid:16) MSE ( F , F (cid:48) ) + MSE (cid:16) ˆ F , ˆ F (cid:48) (cid:17)(cid:17) , (21)where F (cid:48) , ˆ F (cid:48) are obtained by the initial network before thisdetector training step. The final loss to update the detectorcan be defined as L = L det + β L rep + λ L (cid:48) des , (22)where β = 1 , λ = 10 − empirically in our experiments.
5. Experiments and Comparisons
In this section, we first present the details during trainingour local feature model, and then compare it with 11 popu-lar methods on homograph estimation, relative pose estima-tion(stereo), structure-from-motion tasks. At last, we alsoconduct an ablation experiment to exploit the effectivenessof key training strategies.
Our local feature model is trained on Microsoft COCOvalidation dataset [18], that consists of , realistic im-ages. We repeated the self-evolving iteration times to pre-vent under-fitting or over-fitting. In each iteration, we trainthe detector and descriptor epochs in turn and set theinitial learning rate as . . The learning rate will be mul-tiplied by . after the average loss remains un-declining epochs. The whole training process will take 45 hours ona GPU server with two NVIDIA-Tesla-P100 GPUs. To testthe inference speed, we deploy our model on a desktop ma-chine with one NVIDIA-GTX-1080Ti GPU to process 10Kimages with a resolution × . Our model can process301 images per second averagely. We implemented our al-gorithm based on the PyTorch framework [26].For affine adaption, we uniformly sample the in-planerotation, shear, translation, and scale parameters from [ − ◦ , +40 ◦ ] , [ − ◦ , +40 ◦ ] , [ − . , +0 . , [0 . , . ,respectively. For color jitter, we also uniformly sample thebrightness, contrast, saturation, and hue parameters from [0 . , . , [0 . , . , [0 . , . , [ − . , . , respectively.For comparison methods, we select 6 hand-crafted meth-ods, i.e. , ORB [29], AKAZE [25], BRISK [15], SURF [4],KAZE [2], and SIFT [19], that are implemented directlyusing OpenCV. We also select 5 recently proposed DNN-based methods, i.e. , D2-Net [9], DELF [23], LF-Net [24],SuperPoint [7], and R2D2 [28]. We implement these meth-ods using the codes and models released by the authors. Allof these methods can perform keypoints detection and de-scription. The individual detector or descriptor algorithmsare not included in the comparison methods since their com-binations are various and it is difficult to conduct a fair com-parison with methods mentioned above.6efore comparing the performance, we first review thetraining data (less constraints is better), model size (smalleris better), and dimension of descriptor (lower is better) ofeach DNN-based method in Tab. 1. On all of these aspects,our method is superior or comparable with other methods. Table 1. The training data (less constraints is better), model size(smaller is better), and dimension of descriptor (lower is better) ofeach DNN-based method. On all of these aspects, our method issuperior or comparable with other methods.Method Training Data Model(MB) Dim. Desc.D2-Net [9] SfM data 30.5 512 floatDELF [23] landmarks data 36.4 1024 floatLF-Net [24] SfM data 31.7 256 floatSuperPoint [7] rendered&web imgs 5.2 256 floatR2D2 [28] web imgs, SfM data 2.0 128 floatSEKD (ours) web imgs 2.7 128 float
Following many previous works, e.g. , [19, 7], we alsoevaluate and compare our method with previous methodsvia performing the homography estimation task. For bench-mark dataset, HPatches [3] is adopted as it is the most pop-ular and largest dataset on this task. It includes 117 se-quences of images, where each sequence consists of onereference image and five target images. The homographybetween the reference image and each target image has beencarefully calibrated. There are 57 sequences of images onlychanging in illumination, and 59 sequences of images onlychanging in viewpoint. We follow most experimental setupsand use the homograpy accuracy metric used in [7].To estimate the homography, we use our model and 11comparison methods to extract the top-500 most confiden-tial keypoints from each input image. The correspondencesof keypoints are constructed via nearest matching by de-scriptors. A cross-check step is further applied to eliminateunstable matches. Then the homography is estimated usingthe RANSAC algorithm with default parameters via directlycalling the findHomography () function in OpenCV.As shown in Fig. 7, we plot the homography accuracycurve of each method along with different reprojection er-ror thresholds from 1 through 10. The average homographyaccuracy (Avg.HA@1:10) is also calculated and presentedin Tab. 2. The results of Illumination subset and Viewpointsubset are also presented respectively. The results show thatour SEKD model achieves the best overall performance. Onthe Illumination subset, DELF [23] achieves the best result.However, its performance on Viewpoint subset is the worstdue to its poor keypoints localization ability. On the View-point subset, our SEKD model outperforms all comparisonmethods. The HPathes dataset is a planar dataset and the relationbetween a pair of images is affine transformation. However,images from unconstrained real environment usually are notsatisfy with this constraint. To this end, we resort to the Im-age Matching Challenge (IMC) dataset [35], that consistsof images from 26 scenes and each image is annotated withground-truth 6-DoF pose. For each scene, IMC collectedadequate images to reconstruct the scene and estimate thepose of each image using SfM algorithm. The estimatedposes are taken as pseudo ground-truth. Then only a subsetof images are selected for evaluation via performing rela-tive pose estimation and struture-from-motion tasks. Viaadjusting the error thresholds from 1 to 10 degrees, IMCcalculates mean Average Accuracy (mAA) as the metric tocompare each method. Please see the website [35] for moredetails about this dataset.We adopt the validation set since both the images andground-truth have been released at the moment. It consistsof three scenes, i.e. , sacre coeur, st peters square, and re-ichstag. We extract up to 2K keypoints from each imageusing each comparison method. Then the keypoints cor-respondences between each pair of images are constructedvia the same matching algorithm, which is the ratio-test inour experiment for float descriptors and nearest-matchingfor binary descriptors. The mAA metrics are then figuredout via evaluating the relative pose estimation and structure-from-motion results. For fair comparison, besides keypointsextraction, all other processes are implemented using thebenchmark code released by IMC [35] with the same ex-perimental setups and parameters.As demonstrated in Tab. 2, our SEKD achieves the bestoverall performance on the IMC dataset and outperformsthe second place method, i.e. , SuperPoint [7], with a largemargin of 0.035. Specifically, on relative pose estimationtask, our method outperforms the second place with a largemargin of 0.049. On structure-from-motion task, Super-Point [7] slightly outperforms our method with 0.006, how-ever, it achieves unsatisfactory result on relative pose esti-mation task, that is 0.076 lower than our method. This ex-periment indicates that, though our SEKD model is trainedonly using web images with synthetic affine transforma-tions, it has fairly good generalization ability on 3D datasetsand problems.
To exploit the effectiveness of each key training strat-egy in our framework, we further conduct an ablation ex-periment on homography estimation task with HPatchesdataset. As shown in Tab. 3, we replace the descriptor re-peatability (13) and the descriptor distinctness (14) with theconstant value , respectively, then the Avg.HA@1:10 de-creases dramatically, that verifies the rationality of our al-7 H o m og r a phy acc u r ac y Overall
ORBAKAZEBRISKSURFKAZESIFTD2-NetDELFLF-NetSuperPointR2D2SEKD
Threshold [px]Illumination
Viewpoint
Figure 7. The homography accuracy curves of our SEKD model and 11 comparison methods along with different reprojection errorthresholds from 1 through 10 on HPatches overall data, Illumination subset, and Viewpoint subset, respectively.Table 2. The average homography accuracy (Avg.HA) of our SEKD model and 11 comparison methods on HPatches dataset. And themean average accuracy (mAA) of relative pose estimation (stereo) and structure-from-motion (SfM) on IWC dataset.Method Avg.HA@1:10 on HPatches mAA on IMCMean ILL. VIEW. Mean Stereo SfMORB [29] 48.96% 60.28% 38.03% 0.064 0.032 0.097AKAZE [25] 59.22% 70.63% 48.20% 0.190 0.079 0.302BRISK [15] 61.15% 71.08% 51.55% 0.111 0.040 0.183SURF [4] 66.77% 78.94% 55.01% 0.238 0.149 0.328KAZE [2] 68.10% 81.82% 54.84% 0.270 0.169 0.371SIFT [19] 74.13% 84.28% 64.33% 0.342 0.258 0.427D2-Net [9] 30.96% 47.12% 15.35% 0.025 0.025 0.025DELF [23] R2D2 [28] gorithm. We also delete the detector repeatability loss (20)and affine adaption (3)&(16), respectively, the performancealso decreases, that verifies that these two strategies can im-prove the stability of our framework along with the trainedmodel. On IMC dataset, we reduce the dimension of DELF descriptor from1024 to 512 using PCA as the benchmark code refuses to take longer de-scriptors as input. R2D2 adopts image pyramid as input for better performance. For afair comparison, we only compare the results taking the initial image as
6. Discussion and Conclusion
In this paper, we analyze the inherent and interactiveproperties of local feature detector and descriptor. Guidedby the properties, a self-evolving framework is elaboratelydesigned to update the detector and descriptor iteratively us-ing unlabeled images. Extensive experiments verify the ef-fectiveness of our method both on planar and 3D datasets,though our model is trained only using planar data. More-over, as our framework can work well only using unlabeleddata, theoretically, besides natural images, it also can beadopted to discover novel local features from other typesof data, e.g. , medical images, infrared images, and remotesensing images. We leave these as our future work. input. Actually, with image pyramid as input, the mean results of R2D2and our method should be updated to 72.81%, 0.442, and, 79.74%, 0.496on HPatches, IMC respectively. However, this has no influence on theconclusions. eferences [1] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si-mon, Brian Curless, Steven M. Seitz, and Richard Szeliski.Building Rome in a Day. Commun. ACM , 54(10):105–112,Oct. 2011. 1[2] Pablo Fern´andez Alcantarilla, Adrien Bartoli, and Andrew J.Davison. KAZE Features. In
ECCV , 2012. 1, 2, 6, 8[3] Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys-tian Mikolajczyk. HPatches: A Benchmark and Evaluationof Handcrafted and Learned Local Descriptors. In
CVPR ,2017. 7[4] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF:Speeded Up Robust Features. In
ECCV , 2006. 1, 2, 6, 8[5] Matthew Brown and David G. Lowe. Automatic PanoramicImage Stitching using Invariant Features.
International Jour-nal of Computer Vision , 74(1):59–73, Aug. 2007. 1[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L Yuille. Deeplab: Semantic imagesegmentation with deep convolutional nets, atrous convolu-tion, and fully connected crfs.
IEEE Transactions on PatternAnalysis and Machine Intelligence , 40(4):834–848, 2017. 2[7] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi-novich. SuperPoint: Self-Supervised Interest Point Detec-tion and Description. In
CVPR Workshops , 2018. 2, 3, 4, 6,7, 8[8] Paolo Di Febbo, Carlo Dal Mutto, Kinh Tieu, and StefanoMattoccia. KCNN: Extremely-Efficient Hardware KeypointDetection With a Compact Convolutional Neural Network.In
CVPR Workshops , 2018. 3[9] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Polle-feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-Net:A Trainable CNN for Joint Description and Detection of Lo-cal Features. In
CVPR , 2019. 2, 3, 6, 7, 8[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Identity Mappings in Deep Residual Networks. In
ECCV ,2016. 3[11] Josef Sivic and Andrew Zisserman. Video Google: A TextRetrieval Approach to Object Matching in Videos. In
ICCV ,2003. 1[12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In
NeurIPS , pages 1097–1105, 2012. 2[13] Axel Barroso Laguna, Edgar Riba, Daniel Ponsa, and Krys-tian Mikolajczyk. Key.Net: Keypoint Detection by Hand-crafted and Learned CNN Filters. In
ICCV , 2019. 2, 3[14] C. Leng, H. Zhang, B. Li, G. Cai, Z. Pei, and L. He. LocalFeature Descriptor for Image Matching: A Survey.
IEEEAccess , pages 6424–6434, 2019. 2[15] Stefan Leutenegger, Margarita Chli, and Roland Siegwart.BRISK: Binary Robust invariant scalable keypoints. In
ICCV , 2011. 1, 2, 6, 8[16] Yunpeng Li, Noah Snavely, Dan Huttenlocher, and PascalFua. Worldwide Pose Estimation Using 3D Point Clouds. In
ECCV , 2012. 1[17] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, andPiotr Doll´ar. Focal Loss for Dense Object Detection. In
ICCV , 2017. 6 [18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft COCO: Common Objects in Context. In
ECCV , 2014. 6[19] David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints.
International Journal of Computer Vi-sion , 60(2):91–110, 2004. 1, 2, 5, 6, 7, 8[20] Zixin Luo, Tianwei Shen, Lei Zhou, Jiahui Zhang, Yao Yao,Shiwei Li, Tian Fang, and Long Quan. Contextdesc: Localdescriptor augmentation with cross-modality context.
CVPR ,2019. 3[21] Zixin Luo, Tianwei Shen, Lei Zhou, Siyu Zhu, Runze Zhang,Yao Yao, Tian Fang, and Long Quan. GeoDesc: Learn-ing local descriptors by integrating geometry constraints. In
ECCV , 2018. 3[22] Anastasya Mishchuk, Dmytro Mishkin, Filip Radenovic, andJiri Matas. Working Hard to Know Your Neighbor’s Mar-gins: Local Descriptor Learning Loss. In
NeurIPS , 2017. 3,4[23] Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand,and Bohyung Han. Large-Scale Image Retrieval With Atten-tive Deep Local Features. In
ICCV , 2017. 2, 3, 6, 7, 8[24] Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi.LF-Net: Learning Local Features from Images. In
NeurIPS .2018. 2, 6, 7, 8[25] Adrien Bartoli Pablo Alcantarilla, Jesus Nuevo. Fast Ex-plicit Diffusion for Accelerated Features in Nonlinear ScaleSpaces. In
BMVC , 2013. 1, 2, 6, 8[26] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-ban Desmaison, Luca Antiga, and Adam Lerer. Automaticdifferentiation in PyTorch. In
NeurIPS - Workshop , 2017. 6[27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In
NeurIPS , pages 91–99, 2015. 2[28] Jerome Revaud, Philippe Weinzaepfel, C´esar De Souza, NoePion, Gabriela Csurka, Yohann Cabon, and Martin Humen-berger. R2D2: Repeatable and Reliable Detector and De-scriptor. In
NeurIPS , 2019. 2, 3, 6, 7, 8[29] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary R.Bradski. ORB: An Efficient Alternative to SIFT or SURF.In
ICCV , 2011. 1, 2, 6, 8[30] Nikolay Savinov, Akihito Seki, Lubor Ladicky, Torsten Sat-tler, and Marc Pollefeys. Quad-Networks: UnsupervisedLearning to Rank for Interest Point Detection. In
CVPR ,2017. 3[31] Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, IasonasKokkinos, Pascal Fua, and Francesc Moreno-Noguer. Dis-criminative Learning of Deep Convolutional Feature PointDescriptors. In
ICCV , 2015. 2, 3[32] Yafei Song, Xiaowu Chen, Xiaogang Wang, Yu Zhang, andJia Li. 6-DOF Image Localization From Massive Geo-Tagged Reference Images.
IEEE Transactions on Multime-dia , 18(8):1542–1554, 2016. 1[33] Yafei Song, Di Zhu, Jia Li, Yonghong Tian, and MingyangLi. Learning Local Feature Descriptor with Motion Attributefor Vision-based Localization. In
IROS , 2019. 3
34] Yurun Tian, Bin Fan, and Fuchao Wu. L2-Net: Deep Learn-ing of Discriminative Patch Descriptor in Euclidean Space.In
CVPR , 2017. 3[35] Eduard Trulls, Yuhe Jin, Kwang Yi, Dmytro Mishkin,Jiri Matas, Anastasiia Mishchuk, and Pascal Fua. Im-age matching challenge 2020. https://vision.uvic.ca/image-matching-challenge/ . 7[36] Yannick Verdie, Kwang Yi, Pascal Fua, and Vincent Lepetit.TILDE: A Temporally Invariant Learned DEtector. In
CVPR ,2015. 3[37] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and PascalFua. LIFT: Learned Invariant Feature Transform. In
ECCV ,2016. 2[38] Mingming Zhang, Xingxing Zuo, Yiming Chen, andMingyang Li. Localization for Ground Robots: On Man-ifold Representation, Integration, Re-Parameterization, andOptimization. In
IROS , 2019. 1, 2019. 1