[PDF] SEKD: Self-Evolving Keypoint Detection and Description

Abstract

Researchers have attempted utilizing deep neural network (DNN) to learn novel local features from images inspired by its recent successes on a variety of vision tasks. However, existing DNN-based algorithms have not achieved such remarkable progress that could be partly attributed to insufficient utilization of the interactive characters between local feature detector and descriptor. To alleviate these difficulties, we emphasize two desired properties, i.e., repeatability and reliability, to simultaneously summarize the inherent and interactive characters of local feature detector and descriptor. Guided by these properties, a self-supervised framework, namely self-evolving keypoint detection and description (SEKD), is proposed to learn an advanced local feature model from unlabeled natural images. Additionally, to have performance guarantees, novel training strategies have also been dedicatedly designed to minimize the gap between the learned feature and its properties. We benchmark the proposed method on homography estimation, relative pose estimation, and structure-from-motion tasks. Extensive experimental results demonstrate that the proposed method outperforms popular hand-crafted and DNN-based methods by remarkable margins. Ablation studies also verify the effectiveness of each critical training strategy. We will release our code along with the trained model publicly.

Full PDF

SSEKD: Self-Evolving Keypoint Detection and Description

Yafei Song ∗ , Ling Cai , Jia Li , Yonghong Tian ∗ and Mingyang Li A.I. Labs, Alibaba Group School of Electronics Engineering and Computer Science, Peking University School of Computer Science and Engineering, Beihang University

Abstract

Researchers have attempted utilizing deep neural net-work (DNN) to learn novel local features from imagesinspired by its recent successes on a variety of visiontasks. However, existing DNN-based algorithms have notachieved such remarkable progress that could be partly at-tributed to insufﬁcient utilization of the interactive charac-ters between local feature detector and descriptor. To al-leviate these difﬁculties, we emphasize two desired prop-erties, i.e., repeatability and reliability, to simultaneouslysummarize the inherent and interactive characters of localfeature detector and descriptor. Guided by these properties,a self-supervised framework, namely self-evolving keypointdetection and description (SEKD), is proposed to learn anadvanced local feature model from unlabeled natural im-ages. Additionally, to have performance guarantees, noveltraining strategies have also been dedicatedly designed tominimize the gap between the learned feature and its prop-erties. We benchmark the proposed method on homog-raphy estimation, relative pose estimation, and structure-from-motion tasks. Extensive experimental results demon-strate that the proposed method outperforms popular hand-crafted and DNN-based methods by remarkable margins.Ablation studies also verify the effectiveness of each criti-cal training strategy. We will release our code along withthe trained model publicly.

1. Introduction

Local feature, peculiarly referring to the local point fea-ture in this paper, is extensively employed in a large numberof computer vision applications, such as image stitching [5],content-based image retrieval [11], image-based localiza-tion [16, 32], structure-from-motion (SfM) [1], and simul-taneous localization and mapping (SLAM) [38]. In theseapplications, the quality of the local feature module signif- ∗ Corresponding authors: Yafei Song and Yonghong Tian. E-mail: { [email protected], [email protected] } (a) 1st image of the scene. (b) 2nd image of the scene.(c) Detection result of 1st image. (d) Detection result of 2nd image.(e) Description result of 1st image. (f) Description result of 2nd image.Property 1.1detector repeatabilityProperty 1.2descriptor repeatability Property 2.2descriptor reliabilityProperty 2.1detector reliability Figure 1. Desired properties of local features. Detector repeatabil-ity (1.1): a visible scene point should be detected on all images.Descriptor repeatability (1.2): the descriptor of the same point isinvariant over different images. Detector reliability (2.1): givendescriptor, detected keypoints could be distinguished by their de-scriptors. Descriptor reliability (2.2): given detector, descriptorscan distinguish detected keypoints. icantly inﬂuences the overall system performance and thusmust be in-depth studied and optimized.In general, a standard local feature algorithm can be di-vided into two modules, i.e. , keypoint detection and de-scription. For each keypoint, its inner-image location isdetermined via the detection module, while its descriptoris calculated by summarizing the local context informationvia the description module. Early works on local featureprimarily originated from hand-crafted methodologies, andthe representative methods include SIFT [19], SURF [4],KAZE [2], AKAZE [25], BRISK [15], ORB [29], and soon. Although hand-crafted features have been widely usedin various computer vision tasks, their nature of rule-basedalgorithm design prevents the feasibility of further perfor-1 a r X i v : . [ c s . C V ] J un ance enhancement along with the increasing model repre-sentation ability.Inspired by the great successes of DNN on a variety ofcomputer vision tasks [12, 27, 6], researchers have been ac-tively working on designing and learning advanced localfeature models. Since local feature consists of both detec-tion and description, each module can be individually re-placed and improved by DNN-based methods [13, 31]. Al-ternatively, both modules also can be jointly designed usingone DNN model. That can be done either by sequentiallyconnected neural networks for ﬁrstly calculating keypointlocations and subsequently computing descriptors [37, 24]or by a single network with a shared backbone and two sep-arate branches for regressing detectors and descriptors re-spectively [23, 7, 9, 28].However, unlike on most tasks, existing DNN-based lo-cal features have not achieved such great progress comparedwith hand-crafted methods, that indicates it is very chal-lenging to exploit DNN on local feature learning. As onelocal feature algorithm consists of two modules, we partlyattribute this difﬁculty to the insufﬁcient utilization of theirinherent and interactive properties. To alleviate this prob-lem, we analyze the desired properties of local features, in-cluding its detector, descriptor, and their mutual relations.As demonstrated in Fig. 1, the properties can be summa-rized into two sets, i.e. , ‘repeatability’ and ‘reliability’, andexplained as: Property 1

Repeatability property of local feature.

Property 1.1

Detector repeatability: If a scene point isdetected as a keypoint in one image, it is should be detectedin all images where it is visible.

Property 1.2

Descriptor repeatability: The descriptor of ascene point should be invariant across all images.

Property 2

Reliability property of local feature.

Property 2.1

Detector reliability: Given a descriptormethod, the detector should localize the points which couldbe reliably distinguished by their descriptors.

Property 2.2

Descriptor reliability: Given a detectormethod, the descriptor could reliably distinguish the de-tected keypoints.

The repeatability is an inherent property of the detec-tor and descriptor, respectively. And the reliability is theinteractive property between them. We also note that simi-lar analyses and properties also have been adopted to guidethe algorithm design in previous works [7, 9, 28]. How-ever, instead of optimizing the detector and descriptor at thesame time, we propose to optimize each module in turn.When optimizing the detector or descriptor, both its inher-ent repeatability property and interactive reliability propertyare exploited to design the training strategies. Speciﬁcally,we ﬁgure out keypoints with reliable descriptors from allpoints. These keypoints are taken as ground-truth to op-timize the detector, that is guided by the detector reliabil- ity property. The optimized detector is then taken to detectkeypoints from images. The descriptor is then optimizedto reliably distinguish the detected keypoints, that is guidedby the descriptor reliability property. This process is iter-ated until the learned model is convergent. Moreover, sev-eral strategies are also adopted to ensure the repeatabilityproperty and the convergence of the whole process. Thistraining process is self-evolving as it needs no additionalsupervised signals. Extensive experiments have been con-ducted to compare our model with state-of-the-art methodsvia performing homography estimation, relative pose esti-mation, structure-from-motion tasks on public datasets, theresults verify the effectiveness of our algorithm.Our main contributions can be concluded as follows:1. We propose a self-evolving framework guided by theproperties of local features, by that an advanced modelcan be trained effectively using unannotated images.2. Training strategies are elaborately designed and de-ployed to ensure the computed local feature modelaligned with the desired properties.3. Extensive experiments verify the effectiveness of ourframework and training strategies by outperformingstate-of-the-art methods.

2. Related Work

In this section, we brieﬂy review well-known local fea-tures, that could be categorized into four main groups:hand-crafted methods and three sets of DNN-based ap-proaches.

Hand-crafted methods . Early works on local featuresprimarily rely on hand-crafted rules. One of the most well-known local feature algorithms is SIFT [19], that builds de-tector by the difference of Gaussian operators and calcu-lates descriptor via computing orientation histograms. Af-ter SIFT, plenty of algorithms have been proposed for eitherapproximating the image processing operators to gain com-putational efﬁciency or seeking for performance gain by re-designing detector or descriptor. The representative meth-ods include SURF [4], KAZE [2], AKAZE [25], BRISK[15], and ORB [29]. To date, despite the nature of rule-based design, hand-crafted features still can achieve leadingperformance in speciﬁc applications [14].

DNN-based two-stage methods . Hand-crafted localfeature algorithms typically ﬁrst detect keypoints in imagesand subsequently calculate descriptors around each key-point by cropping and summarizing the local context in-formation. This procedure can also be used in designingDNN-based methods by using sequentially connected neu-ral networks [37, 24]. Each network contains its trainingstrategy, optimizing for the detector or descriptor, respec-tively. We name this kind of method as two-stage methods,that can utilize previous expert knowledge in this area. Themajor disadvantage of two-stage based design is its inefﬁ-2iency in computational costs since sequentially connectednetworks cannot share a large number of computations andparameters or enable fully parallel computing.

DNN-based one-stage methods . To improve the ef-ﬁciency of DNN-based local features, researchers haveproposed the one-stage paradigm, that typically connectsa backbone network with two lightweight head branches[23, 7, 9, 28]. Since the backbone network shares mostcomputations for both the detector and descriptor calcula-tion, this type of algorithms could achieve signiﬁcantly lessruntime. For the two lightweight branches, they can be ei-ther designed using small neural networks [23, 7, 28] or byhand-crafted methods [9]. In terms of training strategies, allthese methods require annotated information for conductingsupervised learning. [23] adopted a landmark image datasetwith image-level annotations. [9, 28] obtained ground-truthcorrespondences between images via SfM reconstruction.And [7] relied on synthetic images with generated ‘corner’-style keypoints.

DNN-based individual detector/descriptor methods .There are also a number of methods that only focus onDNN-based detector or descriptor, e.g. , [36, 30, 8, 13] pro-posed DNN-based keypoints detectors, and [31, 22, 34,21, 20, 33] worked on descriptor computation. However,we usually employ one local feature algorithm as a wholesince either detector or descriptor would inﬂuence the per-formance of each other. Those methods can be consideredas pluggable modules and used in a two-stage algorithm. Inthis paper, we focus on developing an advanced DNN-basedone-stage model.

3. Formulation and Network Architecture

To describe our method better, we ﬁrst introduce basicdenotations along with the network architecture, while theself-evolving framework and training strategies are elabo-rated in the next section. As shown in Fig. 2, our net-work consists of a shared backbone N b and two lightweighthead branches, i.e. , a detector branch N det and a descrip-tor branch N des . The backbone N b consists of 1 convolu-tional layer and 9 ResNet-v2 blocks [10], that extracts fea-ture maps F ∈ R C × H × W from the input image I ∈ R H × W .In the above notations, H , W are the height and width of theinput image I respectively, and C is the channels of the ex-tracted feature maps. The hidden feature maps at initial and k scale are denoted as F and k F respectively. The de-tector branch N det consists of 2 deconvolutional layers and1 softmax layer that predicts the keypoint probability map P ∈ R × H × W from the feature maps F . Moreover, thisbranch also consists of two shortcut links from low-levelfeatures to enhance its localization ability. The descriptorbranch N des consists of 1 ResNet-v2 block and 1 bi-linearup-sampling layer that extracts a descriptor F ( h,w ) of di- (b) Backbone. (c) Detector branch.(a) Input image. (d) Descriptor branch. Figure 2. Overview of our network, that consists of a heavy sharedbackbone and two lightweight head branches for detection and de-scription respectively. mension C for each pixel ( h, w ) , where F ∈ R C × H × W and F ( h,w ) ∈ R C . Beneﬁting from this network structure, ourdetector and descriptor can share most parameters and com-putations.

4. Self-Evolving Framework

To train the network constructed in Sec. 3, two types ofsupervisory signals should be pre-provided. The ﬁrst is thelocation of each keypoint, and the second is the keypointscorrespondence between different images. With the desiredproperties of local features in mind, we propose to ﬁgureout the points with reliable descriptors as keypoints. Andpairs of images, along with their correspondences, can beobtained via afﬁne transformation. Then, the network canbe trained only using unlabeled images. However, as thetraining data have no additional annotation information, wemust carefully design the training strategies to ensure theperformance.The overview of our framework is shown in Fig. 3, thatmainly consists of four steps: (a) compute keypoints prob-ability map P using the current detector and subsequentlyﬁlter the keypoints via non-maximum suppression (NMS)algorithm; (b) update the descriptor branch using the de-tected keypoints via heightening their descriptors’ repeata-bility and reliability properties; (c) compute keypoints byﬁguring out points with reliable (both repeatable and dis-tinct) descriptors; (d) update detector using the newly com-puted keypoints following detector repeatability and relia-bility properties. In what follows, we present each step indetail. For an input image I , the backbone network N b extractsfeature maps F , F , F via F , F , F = N b ( I ) . (1)The feature maps are subsequently used by the detectorbranch N det to estimate the keypoints probability map P atio&NMS (a) Detect keypoints via detector. (c) Compute keypoints via descriptor.(d) Update detector using computed keypoints. (b) Update descriptor on detected keypoints.(a ) Random selectkeypoints at iteration 0. (d1) Input image. (d2) Reliability metric &computed keypoints. (c2) Distinctness metric.(c1) Repeatability metric.(a2) Detected keypoints.(a1) Detection probability. (b2) Affined image.(b1) Image. IterativelyEvolving

Figure 3. Overview of our self-evolving framework, that consists of four main steps: (a) detect keypoints using the current detector, (b)update the descriptor with the detected keypoints, (c) compute keypoints with reliable (both repeatable and distinct, the reliability metricis the ratio between the distinctiveness metric and the repeatability metric) descriptors, and (d) reﬁne the detector using newly computedkeypoints. as P = N det (cid:16) F , F , F (cid:17) . (2)Strong response in each pixel in probability map P indi-cates a potential keypoint, which is further ﬁltered by non-maximum suppression (NMS). We set the suppression ra-dius as 4 pixel in all experiments and set the maximum num-ber of keypoints as , during the training process.However, the above process is not designed to ensure ro-bust detection of the same keypoints under varying condi-tions. In other words, the detection process is not optimizedto satisfy the detector repeatability property 1.1 and mightlead to sub-optimal results. To this end, we adopt a ded-icated data augmentation strategy, namely afﬁne adaption[7]. Speciﬁcally, we ﬁrst apply random afﬁne transforma-tion and color jitter on each input image, and calculate thekeypoint probability map. This process is repeated severaltimes, and an average detection result P = AVG ( P , P , . . . , m P ) (3)is computed as the ﬁnal output, where P corresponds to theinitial image and the others correspond to the transformedcounterparts. Representative examples of the detection pro-cess are also demonstrated in Fig. 4. Note that, the afﬁneadaption is only applied during training.As the detector has not been optimized well at iteration0, another problem is how to detect keypoints at start. Asshown in Fig. 3 (a ), we just randomly select keypoints foreach input image. Even so, we show in experiments thatthe proposed self-evolving framework can converge quicklywithin just a few iterations. Keypoint descriptor is typically a 2D vector associatedwith each keypoint, for both re-identifying the same key-points and distinguishing different keypoints across images.Those descriptor properties are summarized by repeatabil- (d) Average result .(a) Input image & result . (b) Affined image & result . (c) Affined image & result . …… Figure 4. Representative examples of keypoint detection process.Our detector operates on both the input image as well as its afﬁnetransformed counterparts and calculates the average detection re-sults as the ﬁnal output. ity property 1.2 and reliability property 2.2 in Sec. 1, thatare used as guidelines in our descriptor training process.To show the details, we note that for each image I thekeypoint detection process described in Sec. 4.1 providesa set of keypoints Q = { Q i | Q i = (cid:104) h i , w i (cid:105) ) . The trainingprocess starts by applying random afﬁne transformation andcolor jitter H on both I and Q , leading to ˆ I = H ( I ) , (4)and ˆ Q = (cid:8) ˆ Q i | ˆ Q i = H ( h i , w i ) (cid:9) . (5)By denoting (cid:104)· , ·(cid:105) a pair of keypoints, (cid:10) Q i , ˆ Q i (cid:11) representsa pair of ‘ground-truth’ matched keypoints. Accordingto the descriptor repeatability property 1.2, their descrip-tors F Q i , F ˆ Q i should be close to each other. On the otherhand, according to the descriptor reliability property 2.2, F Q i should be distinct from others except for its matchedkeypoint F ˆ Q i . The representative example of matched anddistinct cases are shown in Fig. 3(b) by green and red linesrespectively. Inspired by HardNet [22], we use triplet loss4long with hard example mining strategy to train the de-scriptor. Speciﬁcally, the loss function is deﬁned as L des = 1 n (cid:88) i max (cid:16) , D i,i − min (cid:16) D i, ˜ i , D ˜ i,i (cid:17) + m (cid:17) , (6)where n is the number of keypoints, m = 0 . denotes themargin parameter, || · || represents the L distance, and D i,i = (cid:107)F Q i − ˆ F ˆ Q i (cid:107) , (7) D i, ˜ i = min j (cid:54) = i (cid:107)F Q i − ˆ F ˆ Q j (cid:107) , (8) D ˜ i,i = min j (cid:54) = i (cid:107)F Q j − ˆ F ˆ Q i (cid:107) . (9)The triplet loss function (6) enables the descriptor with boththe repeatability property (by (7)) as well as the reliabilityproperty (by (8) and (9)).In addition, as our network shares a common backboneto simultaneously perform keypoint detection and descrip-tion, the detector branch should also be considered whentraining the descriptor. To this end, we add a regularizationloss term L (cid:48) det = 12 (cid:16) MSE ( P , P (cid:48) ) + MSE (cid:16) ˆ P , ˆ P (cid:48) (cid:17)(cid:17) (10)to maintain the detection results unchanged, where P isgiven by (2) and P (cid:48) = N (cid:48) det ( N (cid:48) b ( I )) , ˆ P = N det ( N b (ˆ I )) , ˆ P (cid:48) = N (cid:48) det ( N (cid:48) b (ˆ I )) , (11) N (cid:48) det , N (cid:48) b and N det , N b are the networks before and afterthis descriptor training step. The ﬁnal loss to update thedescriptor is L = L des + α L (cid:48) det , (12)where α is the parameter to balance these two losses and isset to be empirically. The next step of our self-evolving framework is to com-pute keypoints from the descriptor maps, that remains achallenging problem in the research community. In ourwork, we propose to calculate keypoints via evaluating therepeatability property 2.1 and reliability property 2.2 oftheir corresponding descriptors. Furthermore, as reliabil-ity property somehow contains repeatability property, thesetwo properties can be summarized as reliability propertyand divided into two aspect, namely repeatability and dis-tinctness. Speciﬁcally, given the outputs of the descriptorbranch F P and ˆ F ˆ P from the original image I and its afﬁnetransformed counterpart ˆ I , the descriptor repeatability canbe evaluated at each point as: D i,i = (cid:107)F P i − ˆ F ˆ P i (cid:107) . (13) (a) Input image. (d) Reliability metric .(b) Repeatability metric . (c) Distinctness metric . Figure 5. Representative maps of repeatability metric D i,i , distinct-ness metric D i, ˜ i , and reliability metric R i . We point out that the lower D i,i is, the more repeatable thedescriptor is. In addition, the distinctness of a descriptorcan be evaluated as D i, ˜ i = min j (cid:54) = i (cid:107)F P i − ˆ F ˆ P j (cid:107) . (14)Similarly, the higher D i, ˜ i is, the more distinct the descrip-tor is. As a reliable descriptor should be both repeatableand distinct, we combine the repeatability and distinctnessmetric into a single metric following the ratio term R i = D i, ˜ i D i,i . (15)Representative examples of computed maps D i,i , D i, ˜ i , and R i are shown in Fig. 5. Someone may ﬁnd that this ratioterm (15) is the same as the ratio in the ratio-test algorithm[19], that is a well-known method to ﬁnd keypoints corre-spondence. This means that the points with higher ratioscould be reliably distinguished by subsequently keypointscorrespondence ﬁnding algorithms. These points, withoutdoubt, should be detected by the detector as much as pos-sible. Therefore, strongly responsive elements on the ratiomap R are ﬁgured out as keypoints via applying NMS al-gorithm.Moreover, to ensure high-quality performance, threestrategies are applied in the keypoint computing process.Firstly, we note that the ratio map R does not cover allpoints in image I , since some elements do not have cor-respondences in the afﬁne transformed image ˆ I . Also, tocompute keypoints using a single ratio map R is not pre-ferred in terms of robustness. To this end, we adopt adata augmentation strategy similar to the afﬁne adaption de-scribed in Sec. 4.1. Speciﬁcally, we randomly warp the in-put image via afﬁne transformation, calculate the ratio map,and repeat the same process multiple times to generate anaverage ratio map R = AVG ( R , R , . . . , m R ) , (16)where i R is corresponding to the i th result. An examplecase of computing the average ratio map is given by Fig. 6.Secondly, it is important to point out that it is an ex-tremely heavy task to compute D i, ˜ i . To reduce the com-5 d) Average result .(a) Input image. (b) Affined image& result . (c) Affined image & result . … Figure 6. Representative examples of average reliability map R . putations, we modify D i, ˜ i as D i, ˜ i = min j (cid:54) = i, ˆ P j ∈ Ω ( ˆ P i ) (cid:107)F P i − ˆ F ˆ P j (cid:107) , (17)where Ω (cid:0) ˆ P i (cid:1) contains the local neighbors of point ˆ P i .Thirdly, the feature maps F usually are too coarse forkeypoints computing as the descriptor branch consists of abi-linear up-sampling layer. To this end, we actually use thefeature maps F and F to compute a coarse scale and aﬁne-scale ratio map respectively and fuse them to obtain theﬁnal result. After the keypoints have been computed via their de-scriptor reliability, they can be taken as ground-truth to trainthe detector following the detector reliability property 2.1.We formulate the keypoints detection task as a per-pixelclassiﬁcation task to determine whether the point at eachpixel is a keypoint or not. Since the keypoints are verysparse among all the points, we adopt focal loss [17] as L det = FL ( P , Y ) , (18)where Y is the computed keypoints.Besides detector reliability property 2.1, the detector alsoshould be with repeatability property 1.1. To this end, wefurther adopt afﬁne transformation on the input image andobtain its afﬁned image ˆ I and detection output ˆ P . The de-tector also should rightly detect the keypoints in image ˆ I ,then the detection loss (18) is modiﬁed as L det = 12 (cid:16) FL ( P , Y ) + FL (cid:16) ˆ P , ˆ Y (cid:17)(cid:17) , (19)where ˆ Y = H ( Y ) . To further enhance the repeatabilityproperty 1.1, we minimize the difference between detectionprobabilities of corresponding keypoints via the loss L rep = 12 (cid:88) i (cid:16) KLD (cid:16) P Q i (cid:107) ˆ P ˆ Q i (cid:17) + KLD (cid:16) ˆ P ˆ Q i (cid:107)P Q i (cid:17)(cid:17) , (20)where KLD () is the KullbackLeibler divergence function.To maintain the description results unchanged, we also add a regularization term L (cid:48) des = 12 (cid:16) MSE ( F , F (cid:48) ) + MSE (cid:16) ˆ F , ˆ F (cid:48) (cid:17)(cid:17) , (21)where F (cid:48) , ˆ F (cid:48) are obtained by the initial network before thisdetector training step. The ﬁnal loss to update the detectorcan be deﬁned as L = L det + β L rep + λ L (cid:48) des , (22)where β = 1 , λ = 10 − empirically in our experiments.

5. Experiments and Comparisons

In this section, we ﬁrst present the details during trainingour local feature model, and then compare it with 11 popu-lar methods on homograph estimation, relative pose estima-tion(stereo), structure-from-motion tasks. At last, we alsoconduct an ablation experiment to exploit the effectivenessof key training strategies.

Our local feature model is trained on Microsoft COCOvalidation dataset [18], that consists of , realistic im-ages. We repeated the self-evolving iteration times to pre-vent under-ﬁtting or over-ﬁtting. In each iteration, we trainthe detector and descriptor epochs in turn and set theinitial learning rate as . . The learning rate will be mul-tiplied by . after the average loss remains un-declining epochs. The whole training process will take 45 hours ona GPU server with two NVIDIA-Tesla-P100 GPUs. To testthe inference speed, we deploy our model on a desktop ma-chine with one NVIDIA-GTX-1080Ti GPU to process 10Kimages with a resolution × . Our model can process301 images per second averagely. We implemented our al-gorithm based on the PyTorch framework [26].For afﬁne adaption, we uniformly sample the in-planerotation, shear, translation, and scale parameters from [ − ◦ , +40 ◦ ] , [ − ◦ , +40 ◦ ] , [ − . , +0 . , [0 . , . ,respectively. For color jitter, we also uniformly sample thebrightness, contrast, saturation, and hue parameters from [0 . , . , [0 . , . , [0 . , . , [ − . , . , respectively.For comparison methods, we select 6 hand-crafted meth-ods, i.e. , ORB [29], AKAZE [25], BRISK [15], SURF [4],KAZE [2], and SIFT [19], that are implemented directlyusing OpenCV. We also select 5 recently proposed DNN-based methods, i.e. , D2-Net [9], DELF [23], LF-Net [24],SuperPoint [7], and R2D2 [28]. We implement these meth-ods using the codes and models released by the authors. Allof these methods can perform keypoints detection and de-scription. The individual detector or descriptor algorithmsare not included in the comparison methods since their com-binations are various and it is difﬁcult to conduct a fair com-parison with methods mentioned above.6efore comparing the performance, we ﬁrst review thetraining data (less constraints is better), model size (smalleris better), and dimension of descriptor (lower is better) ofeach DNN-based method in Tab. 1. On all of these aspects,our method is superior or comparable with other methods. Table 1. The training data (less constraints is better), model size(smaller is better), and dimension of descriptor (lower is better) ofeach DNN-based method. On all of these aspects, our method issuperior or comparable with other methods.Method Training Data Model(MB) Dim. Desc.D2-Net [9] SfM data 30.5 512 ﬂoatDELF [23] landmarks data 36.4 1024 ﬂoatLF-Net [24] SfM data 31.7 256 ﬂoatSuperPoint [7] rendered&web imgs 5.2 256 ﬂoatR2D2 [28] web imgs, SfM data 2.0 128 ﬂoatSEKD (ours) web imgs 2.7 128 ﬂoat

Following many previous works, e.g. , [19, 7], we alsoevaluate and compare our method with previous methodsvia performing the homography estimation task. For bench-mark dataset, HPatches [3] is adopted as it is the most pop-ular and largest dataset on this task. It includes 117 se-quences of images, where each sequence consists of onereference image and ﬁve target images. The homographybetween the reference image and each target image has beencarefully calibrated. There are 57 sequences of images onlychanging in illumination, and 59 sequences of images onlychanging in viewpoint. We follow most experimental setupsand use the homograpy accuracy metric used in [7].To estimate the homography, we use our model and 11comparison methods to extract the top-500 most conﬁden-tial keypoints from each input image. The correspondencesof keypoints are constructed via nearest matching by de-scriptors. A cross-check step is further applied to eliminateunstable matches. Then the homography is estimated usingthe RANSAC algorithm with default parameters via directlycalling the findHomography () function in OpenCV.As shown in Fig. 7, we plot the homography accuracycurve of each method along with different reprojection er-ror thresholds from 1 through 10. The average homographyaccuracy (Avg.HA@1:10) is also calculated and presentedin Tab. 2. The results of Illumination subset and Viewpointsubset are also presented respectively. The results show thatour SEKD model achieves the best overall performance. Onthe Illumination subset, DELF [23] achieves the best result.However, its performance on Viewpoint subset is the worstdue to its poor keypoints localization ability. On the View-point subset, our SEKD model outperforms all comparisonmethods. The HPathes dataset is a planar dataset and the relationbetween a pair of images is afﬁne transformation. However,images from unconstrained real environment usually are notsatisfy with this constraint. To this end, we resort to the Im-age Matching Challenge (IMC) dataset [35], that consistsof images from 26 scenes and each image is annotated withground-truth 6-DoF pose. For each scene, IMC collectedadequate images to reconstruct the scene and estimate thepose of each image using SfM algorithm. The estimatedposes are taken as pseudo ground-truth. Then only a subsetof images are selected for evaluation via performing rela-tive pose estimation and struture-from-motion tasks. Viaadjusting the error thresholds from 1 to 10 degrees, IMCcalculates mean Average Accuracy (mAA) as the metric tocompare each method. Please see the website [35] for moredetails about this dataset.We adopt the validation set since both the images andground-truth have been released at the moment. It consistsof three scenes, i.e. , sacre coeur, st peters square, and re-ichstag. We extract up to 2K keypoints from each imageusing each comparison method. Then the keypoints cor-respondences between each pair of images are constructedvia the same matching algorithm, which is the ratio-test inour experiment for ﬂoat descriptors and nearest-matchingfor binary descriptors. The mAA metrics are then ﬁguredout via evaluating the relative pose estimation and structure-from-motion results. For fair comparison, besides keypointsextraction, all other processes are implemented using thebenchmark code released by IMC [35] with the same ex-perimental setups and parameters.As demonstrated in Tab. 2, our SEKD achieves the bestoverall performance on the IMC dataset and outperformsthe second place method, i.e. , SuperPoint [7], with a largemargin of 0.035. Speciﬁcally, on relative pose estimationtask, our method outperforms the second place with a largemargin of 0.049. On structure-from-motion task, Super-Point [7] slightly outperforms our method with 0.006, how-ever, it achieves unsatisfactory result on relative pose esti-mation task, that is 0.076 lower than our method. This ex-periment indicates that, though our SEKD model is trainedonly using web images with synthetic afﬁne transforma-tions, it has fairly good generalization ability on 3D datasetsand problems.

To exploit the effectiveness of each key training strat-egy in our framework, we further conduct an ablation ex-periment on homography estimation task with HPatchesdataset. As shown in Tab. 3, we replace the descriptor re-peatability (13) and the descriptor distinctness (14) with theconstant value , respectively, then the Avg.HA@1:10 de-creases dramatically, that veriﬁes the rationality of our al-7 H o m og r a phy acc u r ac y Overall

ORBAKAZEBRISKSURFKAZESIFTD2-NetDELFLF-NetSuperPointR2D2SEKD

Threshold [px]Illumination

Viewpoint

Figure 7. The homography accuracy curves of our SEKD model and 11 comparison methods along with different reprojection errorthresholds from 1 through 10 on HPatches overall data, Illumination subset, and Viewpoint subset, respectively.Table 2. The average homography accuracy (Avg.HA) of our SEKD model and 11 comparison methods on HPatches dataset. And themean average accuracy (mAA) of relative pose estimation (stereo) and structure-from-motion (SfM) on IWC dataset.Method Avg.HA@1:10 on HPatches mAA on IMCMean ILL. VIEW. Mean Stereo SfMORB [29] 48.96% 60.28% 38.03% 0.064 0.032 0.097AKAZE [25] 59.22% 70.63% 48.20% 0.190 0.079 0.302BRISK [15] 61.15% 71.08% 51.55% 0.111 0.040 0.183SURF [4] 66.77% 78.94% 55.01% 0.238 0.149 0.328KAZE [2] 68.10% 81.82% 54.84% 0.270 0.169 0.371SIFT [19] 74.13% 84.28% 64.33% 0.342 0.258 0.427D2-Net [9] 30.96% 47.12% 15.35% 0.025 0.025 0.025DELF [23] R2D2 [28] gorithm. We also delete the detector repeatability loss (20)and afﬁne adaption (3)&(16), respectively, the performancealso decreases, that veriﬁes that these two strategies can im-prove the stability of our framework along with the trainedmodel. On IMC dataset, we reduce the dimension of DELF descriptor from1024 to 512 using PCA as the benchmark code refuses to take longer de-scriptors as input. R2D2 adopts image pyramid as input for better performance. For afair comparison, we only compare the results taking the initial image as

6. Discussion and Conclusion

In this paper, we analyze the inherent and interactiveproperties of local feature detector and descriptor. Guidedby the properties, a self-evolving framework is elaboratelydesigned to update the detector and descriptor iteratively us-ing unlabeled images. Extensive experiments verify the ef-fectiveness of our method both on planar and 3D datasets,though our model is trained only using planar data. More-over, as our framework can work well only using unlabeleddata, theoretically, besides natural images, it also can beadopted to discover novel local features from other typesof data, e.g. , medical images, infrared images, and remotesensing images. We leave these as our future work. input. Actually, with image pyramid as input, the mean results of R2D2and our method should be updated to 72.81%, 0.442, and, 79.74%, 0.496on HPatches, IMC respectively. However, this has no inﬂuence on theconclusions. eferences [1] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si-mon, Brian Curless, Steven M. Seitz, and Richard Szeliski.Building Rome in a Day. Commun. ACM , 54(10):105–112,Oct. 2011. 1[2] Pablo Fern´andez Alcantarilla, Adrien Bartoli, and Andrew J.Davison. KAZE Features. In

ECCV , 2012. 1, 2, 6, 8[3] Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys-tian Mikolajczyk. HPatches: A Benchmark and Evaluationof Handcrafted and Learned Local Descriptors. In

CVPR ,2017. 7[4] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF:Speeded Up Robust Features. In

ECCV , 2006. 1, 2, 6, 8[5] Matthew Brown and David G. Lowe. Automatic PanoramicImage Stitching using Invariant Features.

International Jour-nal of Computer Vision , 74(1):59–73, Aug. 2007. 1[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L Yuille. Deeplab: Semantic imagesegmentation with deep convolutional nets, atrous convolu-tion, and fully connected crfs.

IEEE Transactions on PatternAnalysis and Machine Intelligence , 40(4):834–848, 2017. 2[7] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi-novich. SuperPoint: Self-Supervised Interest Point Detec-tion and Description. In

CVPR Workshops , 2018. 2, 3, 4, 6,7, 8[8] Paolo Di Febbo, Carlo Dal Mutto, Kinh Tieu, and StefanoMattoccia. KCNN: Extremely-Efﬁcient Hardware KeypointDetection With a Compact Convolutional Neural Network.In

CVPR Workshops , 2018. 3[9] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Polle-feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-Net:A Trainable CNN for Joint Description and Detection of Lo-cal Features. In

CVPR , 2019. 2, 3, 6, 7, 8[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Identity Mappings in Deep Residual Networks. In

ECCV ,2016. 3[11] Josef Sivic and Andrew Zisserman. Video Google: A TextRetrieval Approach to Object Matching in Videos. In

ICCV ,2003. 1[12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classiﬁcation with deep convolutional neural net-works. In

NeurIPS , pages 1097–1105, 2012. 2[13] Axel Barroso Laguna, Edgar Riba, Daniel Ponsa, and Krys-tian Mikolajczyk. Key.Net: Keypoint Detection by Hand-crafted and Learned CNN Filters. In

ICCV , 2019. 2, 3[14] C. Leng, H. Zhang, B. Li, G. Cai, Z. Pei, and L. He. LocalFeature Descriptor for Image Matching: A Survey.

IEEEAccess , pages 6424–6434, 2019. 2[15] Stefan Leutenegger, Margarita Chli, and Roland Siegwart.BRISK: Binary Robust invariant scalable keypoints. In

ICCV , 2011. 1, 2, 6, 8[16] Yunpeng Li, Noah Snavely, Dan Huttenlocher, and PascalFua. Worldwide Pose Estimation Using 3D Point Clouds. In

ECCV , 2012. 1[17] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, andPiotr Doll´ar. Focal Loss for Dense Object Detection. In

ICCV , 2017. 6 [18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft COCO: Common Objects in Context. In

ECCV , 2014. 6[19] David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints.

International Journal of Computer Vi-sion , 60(2):91–110, 2004. 1, 2, 5, 6, 7, 8[20] Zixin Luo, Tianwei Shen, Lei Zhou, Jiahui Zhang, Yao Yao,Shiwei Li, Tian Fang, and Long Quan. Contextdesc: Localdescriptor augmentation with cross-modality context.

CVPR ,2019. 3[21] Zixin Luo, Tianwei Shen, Lei Zhou, Siyu Zhu, Runze Zhang,Yao Yao, Tian Fang, and Long Quan. GeoDesc: Learn-ing local descriptors by integrating geometry constraints. In

ECCV , 2018. 3[22] Anastasya Mishchuk, Dmytro Mishkin, Filip Radenovic, andJiri Matas. Working Hard to Know Your Neighbor’s Mar-gins: Local Descriptor Learning Loss. In

NeurIPS , 2017. 3,4[23] Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand,and Bohyung Han. Large-Scale Image Retrieval With Atten-tive Deep Local Features. In

ICCV , 2017. 2, 3, 6, 7, 8[24] Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi.LF-Net: Learning Local Features from Images. In

NeurIPS .2018. 2, 6, 7, 8[25] Adrien Bartoli Pablo Alcantarilla, Jesus Nuevo. Fast Ex-plicit Diffusion for Accelerated Features in Nonlinear ScaleSpaces. In

BMVC , 2013. 1, 2, 6, 8[26] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-ban Desmaison, Luca Antiga, and Adam Lerer. Automaticdifferentiation in PyTorch. In

NeurIPS - Workshop , 2017. 6[27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In

NeurIPS , pages 91–99, 2015. 2[28] Jerome Revaud, Philippe Weinzaepfel, C´esar De Souza, NoePion, Gabriela Csurka, Yohann Cabon, and Martin Humen-berger. R2D2: Repeatable and Reliable Detector and De-scriptor. In

NeurIPS , 2019. 2, 3, 6, 7, 8[29] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary R.Bradski. ORB: An Efﬁcient Alternative to SIFT or SURF.In

ICCV , 2011. 1, 2, 6, 8[30] Nikolay Savinov, Akihito Seki, Lubor Ladicky, Torsten Sat-tler, and Marc Pollefeys. Quad-Networks: UnsupervisedLearning to Rank for Interest Point Detection. In

CVPR ,2017. 3[31] Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, IasonasKokkinos, Pascal Fua, and Francesc Moreno-Noguer. Dis-criminative Learning of Deep Convolutional Feature PointDescriptors. In

ICCV , 2015. 2, 3[32] Yafei Song, Xiaowu Chen, Xiaogang Wang, Yu Zhang, andJia Li. 6-DOF Image Localization From Massive Geo-Tagged Reference Images.

IEEE Transactions on Multime-dia , 18(8):1542–1554, 2016. 1[33] Yafei Song, Di Zhu, Jia Li, Yonghong Tian, and MingyangLi. Learning Local Feature Descriptor with Motion Attributefor Vision-based Localization. In

IROS , 2019. 3

34] Yurun Tian, Bin Fan, and Fuchao Wu. L2-Net: Deep Learn-ing of Discriminative Patch Descriptor in Euclidean Space.In

CVPR , 2017. 3[35] Eduard Trulls, Yuhe Jin, Kwang Yi, Dmytro Mishkin,Jiri Matas, Anastasiia Mishchuk, and Pascal Fua. Im-age matching challenge 2020. https://vision.uvic.ca/image-matching-challenge/ . 7[36] Yannick Verdie, Kwang Yi, Pascal Fua, and Vincent Lepetit.TILDE: A Temporally Invariant Learned DEtector. In

CVPR ,2015. 3[37] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and PascalFua. LIFT: Learned Invariant Feature Transform. In

ECCV ,2016. 2[38] Mingming Zhang, Xingxing Zuo, Yiming Chen, andMingyang Li. Localization for Ground Robots: On Man-ifold Representation, Integration, Re-Parameterization, andOptimization. In

IROS , 2019. 1, 2019. 1