[PDF] Progressive Correspondence Pruning by Consensus Learning

Abstract

Correspondence selection aims to correctly select the consistent matches (inliers) from an initial set of putative correspondences. The selection is challenging since putative matches are typically extremely unbalanced, largely dominated by outliers, and the random distribution of such outliers further complicates the learning process for learning-based methods. To address this issue, we propose to progressively prune the correspondences via a local-to-global consensus learning procedure. We introduce a ``pruning'' block that lets us identify reliable candidates among the initial matches according to consensus scores estimated using local-to-global dynamic graphs. We then achieve progressive pruning by stacking multiple pruning blocks sequentially. Our method outperforms state-of-the-arts on robust line fitting, camera pose estimation and retrieval-based image localization benchmarks by significant margins and shows promising generalization ability to different datasets and detector/descriptor combinations.

Full PDF

CConsensus-Guided Correspondence Denoising

Chen Zhao † *1 , Yixiao Ge †3 , Jiaqi Yang , Feng Zhu ‡2 , Rui Zhao , and Hongsheng Li ´Ecole Polytechnique F´ed´erale de Lausanne (EPFL) SenseTime Research CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong School of Computer Science, Northwestern Polytechnical University

Abstract

Correspondence selection between two groups of featurepoints aims to correctly recognize the consistent matches(inliers) from the initial noisy matches. The selection isgenerally challenging since the initial matches are gener-ally extremely unbalanced, where outliers can easily dom-inate. Moreover, random distributions of outliers lead tothe limited robustness of previous works when applied todifferent scenarios. To address this issue, we propose todenoise correspondences with a local-to-global consensuslearning framework to robustly identify correspondence. Anovel “pruning” block is introduced to distill reliable can-didates from initial matches according to their consensusscores estimated by dynamic graphs from local to globalregions. The proposed correspondence denoising is pro-gressively achieved by stacking multiple pruning blocks se-quentially. Our method outperforms state-of-the-arts on ro-bust line ﬁtting, wide-baseline image matching and imagelocalization benchmarks by noticeable margins and showspromising generalization capability on different distribu-tions of initial matches.

1. Introduction

Accurate pixel-wise correspondences act as a premiseto tackle many important tasks in computer vision androbotics, e.g ., Structure from Motion (SfM) [31], Simulta-neous Location and Mapping (SLAM) [24], image stitch-ing [5], visual localization [25], virtual reality [33], etc . Butunfortunately, initial feature correspondences establishedby off-the-shelf detector-descriptors [20, 24, 22, 8] arefar from satisfactory due to the possible challenging cross- * The work was done when the author was an intern at SenseTime Re-search † First two authors contributed equally ‡ Corresponding author

Hard to recognize inliers

Easy to recognize inliers

Local Graphs Global Graph C o n s e n s u s Figure 1. Correspondence denoising via local-to-global consen-sus learning. Given initial correspondences (bottom-left image)with dominant outliers, correctly recognizing inliers remains chal-lenging. The correspondences are gradually tailored into reliablecandidates (bottom-right image) based on their consensus scoresestimated by local-to-global graphs, encouraging accurate inlierselection and robust model estimation. image variations, e.g ., rotations, translations, scale changes,viewpoint changes, illumination changes, etc . Correspon-dence selection [40, 14] is therefore needed to select cor-rect matches (inliers) and reject false matches (outliers),where deep learning-based methods [23, 39, 38, 32] havebeen proven effective.Correspondence selection is generally cast as a per-match classiﬁcation task in existing learning-based meth-ods [23, 39, 38, 32], where Multi-Layer Perceptrons(MLPs) are adopted to classify putative matches into thesubset of inliers or outliers. The optimization of such a bi-nary classiﬁcation problem is non-trivial, since the initialmatches are likely to be extremely unbalanced with a ratioof around for outliers [39]. Moreover, directly iden-1 a r X i v : . [ c s . C V ] J a n a s e C a s e PointCN Our CLNet

Figure 2. Robust line ﬁtting under different distributions of outliers with PointCN [23] and our method, where inliers are identical in twocases while outliers are randomly sampled. PointCN fails in ﬁnding the correct solution in the second case, showing the inferior robustnessunder arbitrary outlier distributions. By contrast, our method gradually tailors the correspondences into reliable candidates for furtherline ﬁtting, mitigating the effects of randomly sampled noises to large extent. The introduced denoising allows our method to be bettergeneralized to different scenarios. tifying inliers by two-class prediction is prone to sufferingfrom randomly distributed outliers in real-world scenarios.We illustrate the above limitations in Fig. 2 via a toy lineﬁtting task. It requires the model to ﬁt a line from givendata points corrupted by randomly sampled outliers. Notethat we adopt identical inliers and different outliers in twocases. We observe that the baseline method PointCN [23] isnot robust and fails in the second case.Given the observed limitations of existing learning-basedmethods, we argue that properly denoising correspondencesis crucial for robust model estimation, i.e . gradually tai-lor the noisy initial matches into reliable candidates (seeFig. 1). Such an intuition is inspired by the classic algorithmRANSAC [9], whose core idea is to sample the most reli-able subset with sufﬁcient inliers iteratively. The introducedcorrespondence denoising could largely mitigate the effectsof unbalanced initial matches and stably improves the ro-bustness of model estimation, since inliers are expected toaccount for a larger proportion in the selected subset thanthe one in raw matches. Apparently, a distinctive prun-ing is a cornerstone of the correspondence denoising. Asone cannot classify given an isolated correspondence with-out context information, we propose to operate the pruningby a local-to-global consensus learning framework, which explicitly captures local and global context for correspon-dences.Speciﬁcally, we dynamically construct local graphs forcorrespondences where the nodes (neighbors) and edges aredetermined from feature embeddings. An annular convolu-tional layer is introduced to estimate the consensus scoresof aggregated features that represent the consistency in lo-cal graphs. The local graphs are further extended to aglobal graph guided by the local consensus scores, wherethe global consensus is described by a spectral graph convo-lutional layer [16]. The local and global consensus learninglayers together form a novel “pruning” block, which pre-serves potential inliers with higher consensus scores while ﬁlters out noisy ones with lower scores. Our proposed cor-respondence denoising is progressively achieved by build-ing up multiple pruning blocks sequentially. Such an archi-tecture design encourages the reﬁnement of local and globalconsensus learning at multiple scales. Note that the contextinformation is implicitly modeled in previous works [23, 32]by feature normalization, while we explicitly exploit thecontext with local-to-global graphs. The superiority of ourmethod is empirically shown in Sec. 4. Our contributionsare summarized as three-fold.• We for the ﬁrst time propose to gradually prune cor-respondences for better inlier recognition and modelestimation, which alleviates the effects of unbalancedinitial matches and random outlier distributions.• A local-to-global consensus learning framework is in-troduced to distill the putative correspondences into re-liable candidates, where both local and global consen-suses are properly estimated by establishing dynamicgraphs on-the-ﬂy * .• We demonstrate the effectiveness of our method on thetasks of robust line ﬁtting, wide-baseline image match-ing and image localization, where our method consid-erably outperforms state-of-the-art approaches.

2. Related Works

Generation-veriﬁcation framework.

The generation-veriﬁcation framework has been widely used for ro-bust model estimation, which iteratively generates hy-potheses and veriﬁes the hypothesis conﬁdence, such asRANSAC [9], LO-RANSAC [7], PROSAC [6], USAC [27],NG-RANSAC [4], etc . Speciﬁcally, RANSAC [9] ran-domly samples a minimal subset of data points to esti-mate a parametric model, and then veriﬁes its conﬁdence * The code will be released.

2y evaluating the consistency between data points and thegenerated parametric model. NG-RANSAC [4] proposesa two-stage approach which improves the sampling strat-egy of RANSAC by a pre-trained deep neural network.RANSAC and its variations appear as powerful solutionswhen proper inliers exist in initial matches, where a reliableparametric model estimated from a noise-free sampling canbe expected after iterations. But unfortunately, such kindof pipeline is vulnerable to enormous outliers due to the in-evitable noise in sampled subsets.

Per-match classiﬁcation.

Inspired by the tremendous suc-cess of deep learning [12, 28, 17], it is desirable to performcorrespondence selection via deep networks. However, itis non-trivial to employ 2D convolutions to consume corre-spondences due to the irregular and unordered characteris-tics. As a pioneer in learning-based methods, PointCN [23]treats the correspondence selection as a per-match classiﬁ-cation, using MLPs to predict the label (inlier or outlier) foreach correspondence. Following PointCN, the per-matchclassiﬁcation becomes a mainstream. NM-Net [39] expectsto extract reliable local information for correspondences viaa compatibility-speciﬁc mining, which relies on the knownafﬁne attributes. OANet [38] presents a combination of dif-ferentiable pooling and unpooling to describe local contextand generate the full size prediction, respectively. An atten-tive context normalization is proposed in [32], which im-plicitly represents global context by the weighted mean andvariance. Although existing methods have shown satisfac-tory performance, they still suffer from the issue of dom-inant outliers in the putative correspondences. To addressthis issue, we suggest denoising correspondences into reli-able candidates for more robust model estimation. Corre-spondence denoising is found effective to mitigate the ef-fects of noisy outliers to large extent (Please see Sec. 4).

Consensus in correspondences.

Correct matches havebeen proven consistent in epipolar geometry or under thehomography constraint [11], while mismatches are incon-sistent because of the random distribution. The consensus incorrespondences has been studied by hand-crafted methodsin the literature. GTM [1] performs game-theoretic match-ing based on a payoff function which utilizes afﬁne informa-tion around keypoints to measure the consistency betweena pair of correspondences. LPM [21] assumes that localgeometry around correct matches does not change freely.The geometry variation is represented by the consensus of k -nearest neighbors on keypoint coordinates, which is notdiscriminative enough as evaluated in [40]. Inspired bythe hand-crafted efforts, we suggest learning consensus vialocal-to-global graphs.

3. Method

To tackle the challenge of noisy initial correspondences,we present a local-to-global C onsensus L earning frame- (cid:25) (cid:91) (cid:81) (cid:106) (cid:81) (cid:60) (cid:89) (cid:3) (cid:13) (cid:93) (cid:103)(cid:103) (cid:73) (cid:104) (cid:100) (cid:93) (cid:91)(cid:71) (cid:73) (cid:91) (cid:69) (cid:73) (cid:104) (cid:43)(cid:103)(cid:107)(cid:91)(cid:81)(cid:91)(cid:79)(cid:3)(cid:12)(cid:89)(cid:93)(cid:69)(cid:88)(cid:3) (cid:190)(cid:192) (cid:43)(cid:103)(cid:107)(cid:91)(cid:81)(cid:91)(cid:79)(cid:3)(cid:12)(cid:89)(cid:93)(cid:69)(cid:88)(cid:3) (cid:190)(cid:31) (cid:15)(cid:73)(cid:91)(cid:93)(cid:81)(cid:104)(cid:73)(cid:71)(cid:3)(cid:13)(cid:60)(cid:91)(cid:71)(cid:81)(cid:71)(cid:60)(cid:106)(cid:73)(cid:104) (cid:33)(cid:93)(cid:71)(cid:73)(cid:89)(cid:17)(cid:104)(cid:106)(cid:81)(cid:90)(cid:60)(cid:106)(cid:81)(cid:93)(cid:91) (cid:22)(cid:107)(cid:89)(cid:89)(cid:159)(cid:104)(cid:81)(cid:118)(cid:73)(cid:54)(cid:73)(cid:103)(cid:81)(cid:78)(cid:81)(cid:69)(cid:60)(cid:106)(cid:81)(cid:93)(cid:91) (cid:43) (cid:103) (cid:73) (cid:71) (cid:81) (cid:69) (cid:106) (cid:73) (cid:71) (cid:3) (cid:25) (cid:91) (cid:89)(cid:81) (cid:73) (cid:103) (cid:104) (cid:155) (cid:93) (cid:107) (cid:106) (cid:89)(cid:81) (cid:73) (cid:103) (cid:104) (cid:32)(cid:93)(cid:69)(cid:60)(cid:89)(cid:3)(cid:13)(cid:93)(cid:91)(cid:104)(cid:73)(cid:91)(cid:104)(cid:107)(cid:104)(cid:32)(cid:73)(cid:60)(cid:103)(cid:91)(cid:81)(cid:91)(cid:79) (cid:23)(cid:89)(cid:93)(cid:68)(cid:60)(cid:89)(cid:3)(cid:13)(cid:93)(cid:91)(cid:104)(cid:73)(cid:91)(cid:104)(cid:107)(cid:104)(cid:32)(cid:73)(cid:60)(cid:103)(cid:91)(cid:81)(cid:91)(cid:79) (cid:13)(cid:93)(cid:91)(cid:104)(cid:73)(cid:91)(cid:104)(cid:107)(cid:104)(cid:159)(cid:79)(cid:107)(cid:81)(cid:71)(cid:73)(cid:71)(cid:3)(cid:43)(cid:103)(cid:107)(cid:91)(cid:81)(cid:91)(cid:79) (cid:144)(cid:144)(cid:144) Figure 3. Overall framework. N represents the number of matchesand denotes 4D locations of matched keypoints. Raw dataare gradually tailored into ˆ N candidates with K pruning blocksguided by local-to-global consensus learning. A parametric modelis estimated from ˆ N denoised candidates and is further employedfor the full-size veriﬁcation, yielding N × inlier/outlier predic-tions for initial correspondences. work ( CLNet ) (see Fig. 3), which consists of sequential“pruning” blocks. The key innovation of our frameworklies in progressively tailoring the putative correspondencesinto more reliable candidates by exploiting their consen-sus scores, which are learned from dynamic local-to-globalgraphs. A parametric model is estimated from the most con-ﬁdent inliers in the denoised subset and further employed todetermine inliers and outliers out of full-size putative cor-respondences. In order to encourage a continuous latentspace for more accurate consensus estimation, we introduceto use adaptive temperatures in the inlier/outlier classiﬁca-tion losses as the training objective.

Given an image pair ( I , I (cid:48) ) , putative correspondences C can be established via nearest neighbor matching be-tween extracted keypoints according to the correspondingdescriptors, denoted as C = [ c , · · · , c N ] ∈ R N × . c i =[ x i , y i , x (cid:48) i , y (cid:48) i ] indicates a correspondence between a key-point ( x i , y i ) in the image I and another keypoint ( x (cid:48) i , y (cid:48) i ) in the paired image I (cid:48) . There is no restriction on the off-the-shelf detectors or descriptors, either handcrafted meth-ods [20, 29] or learned ones [22, 8] are compatible. As theputative correspondences C are prone to containing enor-mous mismatches, the task of correspondence selection isintroduced to select the correct matches (inliers) C p and re-ject the noisy ones (outliers) C n .Existing learning-based methods [23, 32] generally castthe correspondence selection task as an inlier/outlier classi-ﬁcation problem by adopting permutation-invariant neuralnetworks to predict inlier weights w = tanh( ReLU ( o )) ∈ [0 , for all putative correspondences C , where o is theoutput of networks. It is worth noting that only C serves asinput for the networks. The correspondence c i ∈ C will becategorized into outliers C n if its predicted weight w i = 0 .The predicted weights w are not only utilized to recognizeinliers but also employed as an auxiliary input for modelestimation, e.g. yielding the essential matrix ˆ E for camerapose estimation [11].3 .2. Local-to-Global Consensus Learning Predicting accurate weights w is at the core of learning-based correspondence selection methods. However, it isnon-trivial due to the dominant outliers in putative corre-spondences. Speciﬁcally, the initial matches are prone tobeing extremely unbalanced with over 90% outliers [39] foreach image pair. The outliers are randomly distributed inreal-world scenarios which might not be “seen” from thetraining samples. We argue that existing methods [23] heav-ily suffer from this issue since training a network to performfull-size per-match classiﬁcation straightforwardly is sub-optimal. Correspondence denoising.

By contrast, we introduce togradually tailor the putative correspondences C into re-liable candidates ˆ C by removing most outliers, mitigat-ing the effects of randomly distributed outliers to largeextent. By predicting inlier weights ˆ w for the denoisedsubset ˆ C , we can expect more conﬁdent inliers ˆ C p andregress a more accurate parametric model (essential ma-trix ˆ E as an example) from ˆ C p . In turn, the computedmodel can be adopted to perform a full-size veriﬁcation onthe entire set C for speciﬁc applications, e.g. image lo-calization. Our solution, therefore, becomes a “generation-veriﬁcation” pipeline, which is inspired by classic and ef-fective algorithm RANSAC [9], such that ( ˆ w , ˆ C ) = f φ ( C ) , ˆ E = g ( ˆ w , ˆ C ) , w = h ( ˆ E , C ) , (1)where f φ is a deep neural network with learnable parame-ters φ that performs correspondence denoising and weightpredicting simultaneously. g ( · , · ) denotes parametric modelestimation (generation) and the optional h ( · , · ) performsfull-size predictions (veriﬁcation).As inliers in C p are expected to be consistent in bothlocal and global contexts, we propose to estimate local-to-global consensus scores for correspondences. Correspon-dences with higher scores are preserved and the remainingis immediately removed as outliers. Note that the consen-sus scores also serve as inlier weights in our work, so weuse a uniﬁed notation w for these two concepts. The localand global consensus layers are used sequentially, forminga novel pruning module, dubbed “pruning” block in our pa-per. As illustrated in Fig. 3, multiple “pruning” blocks areestablished for progressive denoising, i.e. C → · · · → ˆ C .The detailed network architecture of each “pruning” blockcan be found in Fig. 4. Without loss of generality, we as-sume that the input for a “pruning” block is C ∈ R N × andthe output is ˆ C ∈ R ˆ N × in the remainder of this section,where ˆ N < N . Local consensus.

We propose to leverage local contextfor each anchor correspondence c i , building upon a graph with its k -nearest neighbors, denoted as G li = ( V li , E li ) , ≤ i ≤ N, (2)where V li = { c i , · · · , c ki } is the set of k -nearest neighborsfor c i , and E li = { e li , · · · , e lik } indicates edges describingthe afﬁnities between c i and its neighbors in V li .Speciﬁcally, given an input correspondence c i ∈ C ,we encode it into a feature representation z i by a series ofResNet blocks [12]. The k -nearest neighbors for each c i aredetermined by ranking the Euclidean distances between z i and other features { z j | ≤ j ≤ N, j (cid:54) = i } . Following [36],we utilize the concatenated anchor feature and residual fea-ture as the edge, denoted as e lij = [ z i , z i − z ji ] , ≤ j ≤ k (3)where z ji is the feature representation of c i ’s j -th neighborand e lij is the edge linking c i and c ji in the local graph.Given the established local graphs, our goal is to de-scribe the local consensus represented by a consensus scorefor each anchor correspondence. Intuitively, we can splitsuch a process into two steps: 1) aggregating the feature z i → ˜ z i by passing messages along graph edges E li , and2) predicting a consensus score for ˜ z i via MLPs. A na¨ıveway for feature aggregation is using convolutions followedby pooling layers [36] (will be discussed in Table 4). How-ever, this operation may lose the structural information inthe graphs, i.e. the k -nearest neighbors in V li are actuallysorted by their afﬁnities and should be treated differently.To make the most of the graph knowledge, we propose anovel annular convolutional layer (shown in Fig. 5), gradu-ally aggregating features in groups. In detail, we assign thenodes in V li into k/p annuli, where p denotes the number ofnodes in each annulus and k is expected to be divisible by p .We aggregate the features in each annulus via a convolutionkernel as out = W l E li ( s ) + b l , ≤ s ≤ kp , (4)where E li ( s ) is a subset of E li , W l and b l are learned weightsand bias for the convolution. k/p aggregated features arefurther integrated by another annular convolution (see thesecond one in Fig. 5), where the parameters W l and b l donot share among different annular convolutional layers.Subsequently, several ResNet blocks are adopted for fea-ture embedding, yielding ˜ z i . We further encode the aggre-gated feature ˜ z i into a local consensus score w li which re-ﬂects the consistency of c i in the local receptive ﬁeld. Inother words, w li roughly measures the inlier weights of c i when only considering the local context. Global consensus.

As aforementioned, the local consen-sus scores { w l , · · · , w lN } evaluate the correspondence con-sistency in the fractional context, which only consider a lim-ited k -nearest region. To this end, we introduce to estimate4 (cid:73)(cid:104)(cid:34)(cid:73)(cid:106)(cid:12)(cid:89)(cid:93)(cid:69)(cid:88)(cid:3)(cid:114)(cid:3)(cid:197) (cid:15)(cid:115)(cid:91)(cid:60)(cid:90)(cid:81)(cid:69)(cid:23)(cid:103)(cid:93)(cid:107)(cid:100)(cid:81)(cid:91)(cid:79) (cid:4)(cid:91)(cid:91)(cid:107)(cid:89)(cid:60)(cid:103)(cid:13)(cid:93)(cid:91)(cid:112) (cid:46)(cid:73)(cid:104)(cid:34)(cid:73)(cid:106)(cid:12)(cid:89)(cid:93)(cid:69)(cid:88)(cid:3)(cid:114)(cid:3)(cid:197) (cid:33)(cid:32)(cid:43)(cid:114)(cid:3)(cid:194) (cid:46)(cid:73)(cid:104)(cid:34)(cid:73)(cid:106)(cid:12)(cid:89)(cid:93)(cid:69)(cid:88)(cid:3)(cid:114)(cid:3)(cid:194) (cid:33)(cid:32)(cid:43)(cid:114)(cid:3)(cid:194) (cid:23)(cid:13)(cid:34) (cid:32)(cid:93)(cid:69)(cid:60)(cid:89)(cid:3)(cid:13)(cid:93)(cid:91)(cid:104)(cid:73)(cid:91)(cid:104)(cid:107)(cid:104)(cid:3)(cid:32)(cid:73)(cid:60)(cid:103)(cid:91)(cid:81)(cid:91)(cid:79) (cid:23)(cid:89)(cid:93)(cid:68)(cid:60)(cid:89)(cid:3)(cid:13)(cid:93)(cid:91)(cid:104)(cid:73)(cid:91)(cid:104)(cid:107)(cid:104)(cid:3)(cid:32)(cid:73)(cid:60)(cid:103)(cid:91)(cid:81)(cid:91)(cid:79) (cid:4)(cid:71) (cid:87) (cid:60)(cid:69)(cid:73)(cid:91)(cid:69)(cid:115) (cid:3) (cid:33) (cid:60)(cid:106)(cid:103) (cid:81) (cid:114) (cid:32)(cid:93)(cid:69)(cid:60) (cid:89) (cid:13)(cid:93)(cid:91)(cid:104)(cid:73)(cid:91)(cid:104)(cid:107)(cid:104) (cid:23) (cid:89) (cid:93)(cid:68)(cid:60) (cid:89) (cid:13)(cid:93)(cid:91)(cid:104)(cid:73)(cid:91)(cid:104)(cid:107)(cid:104) Figure 4. Detailed architecture of the proposed pruning block, which consists of local-to-global consensus learning layers. Each ResNetblock [12] contains two MLPs followed by Context Normalization [23], Batch Normalization [13], and ReLU. Note that Attentive ContextNormalization [32] is not used in our method, because it requires additional supervisions. According to the estimated global consensusscores, the reliable candidates of input correspondences can be selected. A nnu l a r C o n v o l u t i o n A nnu l a r C o n v o l u t i o n (𝑝 × 𝑘𝑝) × 128 𝑘𝑝 × 128 1 × 128 𝑝 Figure 5. Illustration of the proposed annular convolution. Thenodes (colored dots) in a local graph are assigned into annuli basedon afﬁnities to the anchor. The features in each annulus are aggre-gated by a convolution kernel. full-size correlations via connecting local graphs to a globalone. We denote the global graph as G g = ( V g , E g ) , (5)where the nodes V g are represented by local aggregated fea-tures { ˜ z , · · · , ˜ z N } . The edges E g that indicate the compat-ibility of pairs of correspondences, are estimated based onlocal consensus scores w l . Speciﬁcally, an edge e gij ∈ E g is computed by e gij = w li · w lj , ≤ i, j ≤ N. (6)Using all the entries in E g , the adjacency matrix A ∈ R N × N ( A ij = e gij ) is formed which explicitly describesglobal context. Subsequently, the graph Laplacian for theGCN layer [16] is approximated as L = (cid:101) D − (cid:101) A (cid:101) D − , (7)where (cid:101) A = A + I N for numerical stability, and (cid:101) D ∈ R N × N is the diagonal degree matrix of (cid:101) A . The global embeddingis then acquired byout = L · [ ˜ z , · · · , ˜ z N ] · W g , (8)where { ˜ z } are aggregated features yield by the local con-sensus learning and W g is a learnable matrix. L modulates { ˜ z } into the spectral domain, considering the isolated lo-cal embeddings in a joint manner. The feature ﬁlters in the spectral domain enable the propagated features to be capa-ble of reﬂecting the consensus from the global graph Lapla-cian. Similar to the local consensus learning, global con-sensus scores w g are estimated by encoding the aggregatedfeatures via a ResNet block followed by MLPs. Consensus-guided pruning.

Since the global consensusscores jointly consider both global and local context, weprune the putative correspondences based on their globalconsensus scores w g . Speciﬁcally, elements in C are sortedin a descending order by w g . The top- ˆ N correspondencesare preserved and the remaining ones are removed as noisyoutliers. Besides, inspired by [38], we employ an iterativedesign, which takes local and global consensus scores asadditional input to the next pruning block. A ResNet blockand MLPs are used after the last pruning block to predictinlier weights ˆ w of the denoised candidates for the weightedmodel estimation. Training objectives.

Learning-based correspondence se-lection methods [23, 32] generally require an inlier/outlierclassiﬁcation loss and an regression loss for model esti-mation as the training objective. For camera pose estima-tion, the ground-truth labels vs . of C are assigned by epipo-lar distances with an ad-hoc threshold d thr on widely-usedbenchmarks [38, 14], which is empirically set as e - .Although training with a conventional binary cross-entropy loss has achieved satisfactory performance, we ar-gue that inevitable label ambiguity exists, especially for thecorrespondences whose epipolar distances are around d thr .Intuitively, the conﬁdence of c i should be negatively corre-lated to the corresponding epipolar distance d i , i.e. d i → for an inlier. To this end, we introduce an adaptive temper-ature for putative inliers ( d i < d thr ), which is computed bya Gaussian kernel τ i = exp( − (cid:107) d i − d thr (cid:107) α · d thr ) , (9)where α is the kernel width. For outliers c i with d i > = d thr ,5e set τ i = 1 . Note that the error of label assignment can-not be eliminated due to the inherent ambiguity of epipolarconstraint [11]. The overall training objective is denoted as L = L cls + λ L reg ( ˆ E , E ) , (10)where L reg represents a regression loss [38] on estimatedparametric model ˆ E , and λ is a weighting factor. The binaryclassiﬁcation loss L cls with our proposed adaptive tempera-ture is formulated as L cls = K (cid:88) j =1 (cid:16) (cid:96) bce ( H ( o lj ) , y j ) + (cid:96) bce ( H ( o gj ) , y j ) (cid:17) + (cid:96) bce ( H (ˆ o j ) , ˆ y j ) , (11)where o lj , o gj are the outputs of local and global consen-sus learning layers in j -th pruning block, respectively, ˆ o j is the output of the last MLP in CLNet (recall that w =tanh( ReLU ( o )) ); H ( o ) = σ ( τ · o ) ( σ is the sigmoid acti-vation); y j , ˆ y j denote the set of binary ground-truth labels; (cid:96) bce indicates a binary cross-entropy loss; K is the numberof pruning blocks. As a result, an inlier c i with a smaller d i would be more conﬁdent to enforce larger regularization onthe model optimization via a smaller temperature.

4. Experiments

Experiments are conducted on four datasets, coveringthe tasks of robust line ﬁtting (Sec. 4.2), wide-baseline im-age matching (Sec. 4.3), and image localization (Sec. 4.4).Comprehensive analyses in Sec. 4.5 demonstrate the effec-tiveness of each component in our method.

Two pruning blocks are adopted sequentially, tailoringthe putative N correspondences into N/ candidates, i.e. pruning by half for each block. We set k = 9 and k =6 as the number of nearest neighbors in two blocks whenestablishing local graphs, respectively. We use p = 3 inEq. (4), α = 1 in Eq. (9), and λ = 0 . in Eq. (10). ADAMoptimizer is employed with a batch size of 32 and constantlearning rate of − for training. Let us assume a 2D line as ax + by + c = 0 withrandomly sampled parameters ( a, b, c ) from [0 , . Inliersare generated by randomly sampling x ∈ [ − , and thenestimating y by the equation. Outliers are taken into ac-count via randomly locating ( x, y ) in [ − , . points ( N = 1000) are selected in total to ﬁt each line by deter-mining inliers from noisy data and approximating ( a, b, c ) by Least Squares [18]. We use , and , samplesfor training and testing, respectively.

50 60 70 80 90

Outlier Ratio (%)

Figure 6. Line ﬁtting performance. The methods are tested on ﬁvedatasets where the outlier ratio varies from to . The eval-uation metric is L distance between the predicted line parametersand the ground-truth ones. The results of line ﬁtting are illustrated in Fig. 6.PointCN [23], OANet [38], and PointACN [32] are re-trained on the synthetic data, using ofﬁcial codes releasedby the authors. We add different levels of perturbations byvarying the outlier ratio from to . The L dis-tance between ground-truth ( a, b, c ) and predicted ones isreported. Compared with competitors, our method general-izes well in all ﬁve levels and achieves signiﬁcantly superiorresults in the hardest case, i.e. outliers. We conduct experiments of wide-baseline image match-ing for camera pose estimation. The experiments areperformed on the outdoor YFCC100M [34] and indoorSUN3D [37] datasets, following the settings in [38]. Assuggested in [23, 38, 32], a weighted 8-point algorithm isused in our model estimation to compute the essential ma-trix ( ˆ E ) with denoised candidates ˆ C , which is pivotal torecover camera poses. Keypoint coordinates in C are nor-malized using camera intrinsics [23]. Epipolar distances of C are estimated under the constraint of ˆ E and then com-pared with a threshold ( d thr = 1 e - by default) to perform afull-size veriﬁcation. The mean average precision (mAP) ofrecovered poses under different error thresholds is reported.We also evaluate the inlier/outlier classiﬁcation results byprecision, recall, and F-measure [39].Table 1 lists the quantitative results on YFCC100M andSUN3D. For hand-crafted methods, i.e. RANSAC [9] andMAGSAC [3], we clean the initial correspondences by ratiotest [20] with a threshold of . , because the results with-out ratio test are signiﬁcantly worse than other competi-tors. For learned methods, the initial correspondences areimmediately consumed. The solution that uses RANSACas a post-processing of learned approaches is also consid-ered. Note that SuperGlue [30] is not compared since ittargets at another task, i.e. predicting high-quality initial matches rather than correspondence selection. As reported,CLNet delivers the best mAP5 and F-measure on both twodatasets. It achieves . mAP5 and . F-measureon YFCC100M, which outperforms other approaches bynoticeable margins, i.e. over . The ground-truth la-6

FCC100M [34] (outdoor) (%)mAP5 mAP10 mAP15 mAP20 Precision Recall F-measureRANSAC [9] 30.25/- 39.58/- 46.02/- 50.64/- 74.92 51.98 60.35MAGSAC [3] 32.80/- 41.61/- 47.70/- 52.20/- 73.00 14.42 23.56PointCN [23] 49.75/29.80 59.66/42.65 65.94/51.33 70.07/57.74 55.80 84.32 64.51NM-Net [39] 51.90/32.93 61.75/46.85 67.99/55.98 72.27/62.51 55.30 85.80 64.71OANet [38] 51.98/35.30 61.76/48.49 67.98/57.38 72.13/63.43 55.35 83.37 64.79OANet++ [38] 52.48/40.93 62.74/54.56 68.87/62.70 72.82/67.93 53.80 87.57 63.99PointACN [32] 52.56/42.80 62.05/56.76 68.60/64.99 72.70/70.42 53.85

SUN3D [37] (indoor) (%)mAP5 mAP10 mAP15 mAP20 Precision Recall F-measureRANSAC [9] 9.97/- 16.24/- 21.70/- 26.21/- 59.81 29.98 38.00MAGSAC [3] 13.15/- 20.35/- 26.02/- 30.60/- 56.89 0.08 13.17PointCN [23] 15.55/9.82 24.51/18.44 31.33/25.96 36.72/32.09 47.28 82.85 56.48NM-Net [39] 16.86/14.13 25.55/24.01 32.56/31.85 38.09/38.19 46.68 83.98 56.34OANet [38] 17.41/15.56 27.22/26.19 34.33/34.37 40.00/40.76 47.21 84.04 56.73OANet++ [38] 17.29/16.85 26.69/27.20 33.80/35.30 39.36/41.65 47.12 84.21 56.81PointACN [32] 17.09/17.44 26.67/28.23 34.01/ / /36.29 /42.52 Table 1. Performance on YFCC100M [34] and SUN3D [37], in terms of mAP on recovered poses, precision, recall and F-measure metricson inlier recognition. As for the mAP, -/- represents results with / without RANSAC post-processing. Ratio test is employed as a pre-processing technique for both RANSAC and MAGSAC.

SUN3D [37] (%) YFCC100M [34] (%)SIFT [20] ORB [29] DoG-HardNet [22]PointCN [23] 1.39 9.65 29.45NM-Net [39] 0.94 10.45 29.65OANet++ [38] 2.91 13.00 41.88PointACN [38] 1.83 12.38 41.70Our CLNet

Table 2. Evaluation on the generalization capability. All the mod-els are trained on YFCC100M [34] with SIFT [20], and are directlytested on SUN3D [37] with SIFT and YFCC100M with ORB [29]and DoG-HardNet [22]. mAP5 (%) is reported. bel assignments of SUN3D dataset are considerably am-biguous due to the repeatable and low texture [23], makingRANSAC an important role for data post-processing [14].Our CLNet-RANSAC combination outperforms all state-of-the-arts and CLNet alone achieves competitive perfor-mance on SUN3D. Note that previous learning-based meth-ods, e.g.

PointACN as well as OANet++, cannot be wellcompatible with RANSAC, showing inferior ﬁnal perfor-mance, which would limit their applications in real-worldscenarios.We take the generalization capacity into account by test-ing learned methods on SUN3D with SIFT and YFCC100Mwith ORB [29] and DoG-HardNet [22]. We choose ORBand DoG-HardNet to generate initial matches with two dif-ferent data distributions, because the former one has beenwidely used in SLAM [24] and the latter one is the most ro-bust detector-descriptor combination evaluated in [14]. Thetested models are trained on YFCC100M with SIFT. As listed in Table 2, CLNet achieves the best performance un-der all settings, which shows the ﬂexibility of our methodon different datasets and detector-descriptor combinations.

We consider an application of image-based localiza-tion, which aims at localizing a query image by retriev-ing nearby reference images with geographical tags [10].Existing image-based methods [2, 10] take raw images asinput and learn discriminative feature descriptors for re-trieval. We suggest utilizing correspondence-based meth-ods as a post-processing technique of image-based ones toachieve a coarse-to-ﬁne localization, where three steps areadopted. 1) Image-based methods [2, 10] are used to searchfor top- k ( k is empirically set as 100) images for each query.We did not use all reference images for re-ranking due tothe intractable time consuming. 2) Feature matching tech-niques [20, 9, 23] are then carried out on each pair of the re-trieved reference image and the query image, achieving pu-tative correspondences. 3) The top- k images are re-rankedby the reﬁned similarity. Speciﬁcally, the similarity be-tween each query image and gallery image is reﬁned by S img + S inl , where S img represents the original similarity es-timated by image-based methods, and S inl is the normalizedinlier number within [0 , .In the experiments, we use state-of-the-art SFRS [10]as the image-based method and employ correspondence-based RANSAC [9], PointCN [23] and our CLNet as thepost-processing technique. Note that PointCN and CLNet7 ethod Tokyo 24/7 [35] (%)R@1 R@5 R@10NetVLAD [2] 73.3 82.9 86.0CRN [15] 75.2 83.8 87.3SARE [19] 79.7 86.7 90.5SFRS [10] 85.4 91.1 93.3RANSAC [9]-SFRS 88.6 (+3.2) 93.0 (+1.9) 93.7 (+0.4)PointCN [23]-SFRS 89.5 (+4.1) 93.3 (+2.2) 94.3 (+1.0)Our CLNet-SFRS Table 3. Evaluation on the image localization dataset Tokyo24/7 [35], in terms of Recall@1/5/10.

100 50 25 10 5

Sampling Rate (%) m AP ( % ) (a) Local Graph -100102030405060 I n li e r R a t i o ( % ) Negative anchor

Posi tive anchor (b)

Figure 7. (a) shows mAP5 of poses estimated from candidatessampled by CLNet and PointCN [23] under different samplingratios; (b) illustrates inlier ratios of nodes in local graphs an-chored by inliers (positive anchor) and outliers (negative anchor)in CLNet.

Initial 1st 2nd 3rd 4th

Pruning I n li e r R a t i o ( % ) m AP ( % ) Inlier ratiomAP5

Figure 8. The inlier ratio of candidates (blue line) and mAP5 ofposes (orange line) estimated by CLNet with different pruning it-erations. are pretrained on YFCC100M [34]. As shown in Table. 3,CLNet further improves SFRS with noticeable . gainsin terms of Recall@1. Since image-based methods focuson global descriptors, while correspondence-based meth-ods pay more attention on local patterns, these two kindsof methods can be well compatible with each other. Thesuperiority of our method can also be observed by compar-ing “CLNet-SFRS” with “RANSAC-SFRS” and “PointCN-SFRS”, where our CLNet achieves more advanced perfor-mance than both RANSAC and PointCN. Consensus-guided pruning.

To distill candidates frominitial data, a vanilla solution is to sample matches accord-ing to the weights predicted by off-the-shelf methods, e.g.

PointCN [23]. One may consider if consensus learning is anover-kill for pruning. To address this issue, we use PointCNto replace consensus learning for pruning as a fair com-parison. Speciﬁcally, PointCN pretrained on YFCC100Mis iteratively performed twice in the inference phase. Ini-

Annular Conv. Conv. & Max-poolingmAP5 (%)

Table 4. Comparison between the proposed annular convolutionand convolution-pooling strategy. “Conv. & Max-pooling” ex-tracts and aggregates local features with convolutions and max-pooling layers, respectively. mAP5 on YFCC100M [34] is listed.

Local Conv. Global Conv. Adaptive temp. mAP5 (%) (cid:55) (cid:55) (cid:55) (cid:51) (cid:55) (cid:55) (cid:51) (cid:51) (cid:55) (cid:51) (cid:51) (cid:51)

Table 5. Ablation studies of CLNet on individual components,where the models are trained and tested on YFCC100M [35]. tial matches are tailored into a speciﬁc number of candi-dates in the ﬁrst iteration, and the pruned correspondencesare consumed to estimate essential matrices in the seconditeration. Fig. 7(a) shows the results with different sam-pling rates on YFCC100M. We observe that the mAP5 ofPointCN declines as the sampling rate gets smaller. It sug-gests that it is non-trivial to carry out an effective pruningfor camera pose estimation. By contrast, Our method sig-niﬁcantly outperforms PointCN-guided pruning, achievingthe optimal mAP5 with a sampling rate of . The ef-fectiveness of our consensus learning is further explained inFig. 7(b), which illustrates the inlier ratios of nodes in localgraphs. The grouped neighbors are signiﬁcantly more con-sistent with higher inlier ratios when graphs are anchoredby inliers. The results demonstrate that our method is wellcapable of enlarging the consensus of inliers, while restrain-ing the consensus of outliers.Moreover, since our pruning block is feasible to be iter-atively performed, we analyze the effect of iteration num-bers towards candidate consistency and pose estimation ac-curacy. As shown in Fig. 8, an increase of inlier ratios isobserved after pruning, i.e. from 10% to 80%, which indi-cates the candidates are much more consistent than initialdata. The mAP drops after the second iteration, becausethe number of remained matches is too small to carry out arobust model estimation.

Component analysis.

We perform ablation studies toshed more light on the rationality of each component inCLNet. Speciﬁcally, Table 4 compares the proposed annu-lar convolution and convolution-pooling strategy which hasbeen widely employed in [26, 36]. The annular convolutionachieves . improvements of mAP5 on YFCC100M,demonstrating its effectiveness. Furthermore, we evaluatethe necessity of each component by different combinations.As reported in Table 5, the optimal performance cannot beachieved by removing any one of them. The local-to-globalconsensus learning leads to a . improvement of mAP5in total (the third line v.s. the ﬁrst line), and the performance8s further boosted by applying the adaptive temperatures.

5. Conclusion

Given the observed effects of dominant noisy outliers incorrespondence selection tasks, we for the ﬁrst time proposeto gradually tailor the putative correspondences into reliablecandidates with a local-to-global consensus learning frame-work. Both local and global graphs are established on-the-ﬂy to explicitly describe the consensus in context, leadingto better pruning. Our proposed correspondence denoisinglargely alleviates the randomly distributed outliers in vari-ous scenarios, showing signiﬁcant improvements on multi-ple benchmarks.

References [1] Andrea Albarelli, Emanuele Rodol`a, and Andrea Torsello.Imposing semi-local geometric constraints for accurate cor-respondences selection in structure from motion: A game-theoretic perspective.

Int. J. Comput. Vis. , 97(1):36–53,2012.[2] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pa-jdla, and Josef Sivic. Netvlad: Cnn architecture for weaklysupervised place recognition. In

IEEE Conf. Comput. Vis.Pattern Recog. , pages 5297–5307, 2016.[3] Daniel Barath, Jiri Matas, and Jana Noskova. Magsac:marginalizing sample consensus. In

IEEE Conf. Comput. Vis.Pattern Recog. , pages 10197–10205, 2019.[4] Eric Brachmann and Carsten Rother. Neural-guided ransac:Learning where to sample model hypotheses. In

Int. Conf.Comput. Vis. , pages 4322–4331, 2019.[5] Matthew Brown and David G Lowe. Automatic panoramicimage stitching using invariant features.

Int. J. Comput. Vis. ,74(1):59–73, 2007.[6] Ondrej Chum and Jiri Matas. Matching with prosac-progressive sample consensus. In

IEEE Conf. Comput. Vis.Pattern Recog. , volume 1, pages 220–226. IEEE, 2005.[7] Ondˇrej Chum, Jiˇr´ı Matas, and Josef Kittler. Locally op-timized ransac. In

Joint Pattern Recognition Symposium ,pages 236–243. Springer, 2003.[8] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi-novich. Superpoint: Self-supervised interest point detectionand description. In

IEEE Conf. Comput. Vis. Pattern Recog.Worksh. , pages 224–236, 2018.[9] Martin A Fischler and Robert C Bolles. Random sampleconsensus: a paradigm for model ﬁtting with applications toimage analysis and automated cartography.

Communicationsof the ACM , 24(6):381–395, 1981.[10] Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, and Hong-sheng Li. Self-supervising ﬁne-grained region similaritiesfor large-scale image localization. In

Eur. Conf. Comput.Vis. , 2020.[11] Richard Hartley and Andrew Zisserman.

Multiple view ge-ometry in computer vision . Cambridge university press,2003. [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

IEEE Conf.Comput. Vis. Pattern Recog. , pages 770–778, 2016.[13] Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internal co-variate shift. arXiv preprint arXiv:1502.03167 , 2015.[14] Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas,Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Imagematching across wide baselines: From paper to practice. arXiv preprint arXiv:2003.01587 , 2020.[15] Hyo Jin Kim, Enrique Dunn, and Jan-Michael Frahm.Learned contextual feature reweighting for image geo-localization. In

IEEE Conf. Comput. Vis. Pattern Recog. ,pages 3251–3260. IEEE, 2017.[16] Thomas N Kipf and Max Welling. Semi-supervised classi-ﬁcation with graph convolutional networks. arXiv preprintarXiv:1609.02907 , 2016.[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classiﬁcation with deep convolutional neural net-works.

Communications of the ACM , 60(6):84–90, 2017.[18] Charles L Lawson and Richard J Hanson.

Solving leastsquares problems . SIAM, 1995.[19] Liu Liu, Hongdong Li, and Yuchao Dai. Stochasticattraction-repulsion embedding for large scale image local-ization. In

IEEE Conf. Comput. Vis. Pattern Recog. , pages2570–2579, 2019.[20] David G Lowe. Distinctive image features from scale-invariant keypoints.

Int. J. Comput. Vis. , 60(2):91–110, 2004.[21] Jiayi Ma, Ji Zhao, Junjun Jiang, Huabing Zhou, and XiaojieGuo. Locality preserving matching.

Int. J. Comput. Vis. ,127(5):512–531, 2019.[22] Anastasiia Mishchuk, Dmytro Mishkin, Filip Radenovic,and Jiri Matas. Working hard to know your neighbor’s mar-gins: Local descriptor learning loss. In

Adv. Neural Inform.Process. Syst. , pages 4826–4837, 2017.[23] Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lep-etit, Mathieu Salzmann, and Pascal Fua. Learning to ﬁndgood correspondences. In

IEEE Conf. Comput. Vis. PatternRecog. , pages 2666–2674, 2018.[24] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan DTardos. Orb-slam: a versatile and accurate monocular slamsystem.

IEEE Transactions on Robotics , 31(5):1147–1163,2015.[25] James Philbin, Michael Isard, Josef Sivic, and Andrew Zis-serman. Descriptor learning for efﬁcient retrieval. In

Eur.Conf. Comput. Vis. , pages 677–691. Springer, 2010.[26] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas JGuibas. Pointnet++: Deep hierarchical feature learning onpoint sets in a metric space. In

Adv. Neural Inform. Process.Syst. , pages 5099–5108, 2017.[27] Rahul Raguram, Ondrej Chum, Marc Pollefeys, Jiri Matas,and Jan-Michael Frahm. Usac: a universal framework forrandom sample consensus.

IEEE Trans. Pattern Anal. Mach.Intell. , 35(8):2022–2038, 2012.[28] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In

Adv. Neural Inform. Process. Syst. ,pages 91–99, 2015.

29] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and GaryBradski. Orb: An efﬁcient alternative to sift or surf. In

Int.Conf. Comput. Vis. , pages 2564–2571. Ieee, 2011.[30] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz,and Andrew Rabinovich. Superglue: Learning featurematching with graph neural networks. In

IEEE Conf. Com-put. Vis. Pattern Recog. , pages 4938–4947, 2020.[31] Noah Snavely, Steven M Seitz, and Richard Szeliski. Model-ing the world from internet photo collections.

Int. J. Comput.Vis. , 80(2):189–210, 2008.[32] Weiwei Sun, Wei Jiang, Eduard Trulls, Andrea Tagliasacchi,and Kwang Moo Yi. Acne: Attentive context normalizationfor robust permutation-equivariant learning. In

IEEE Conf.Comput. Vis. Pattern Recog. , pages 11286–11295, 2020.[33] Richard Szeliski. Image mosaicing for tele-reality applica-tions. In

Proceedings of 1994 IEEE Workshop on Applica-tions of Computer Vision , pages 44–53. IEEE, 1994.[34] Bart Thomee, David A Shamma, Gerald Friedland, Ben-jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, andLi-Jia Li. Yfcc100m: The new data in multimedia research.

Communications of the ACM , 59(2):64–73, 2016.[35] Akihiko Torii, Relja Arandjelovic, Josef Sivic, MasatoshiOkutomi, and Tomas Pajdla. 24/7 place recognition by viewsynthesis. In

IEEE Conf. Comput. Vis. Pattern Recog. , pages1808–1817, 2015.[36] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma,Michael M Bronstein, and Justin M Solomon. Dynamicgraph cnn for learning on point clouds.

ACM Trans. Graph. ,38(5):1–12, 2019.[37] Jianxiong Xiao, Andrew Owens, and Antonio Torralba.Sun3d: A database of big spaces reconstructed using sfm andobject labels. In

IEEE Conf. Comput. Vis. Pattern Recog. ,pages 1625–1632, 2013.[38] Jiahui Zhang, Dawei Sun, Zixin Luo, Anbang Yao, LeiZhou, Tianwei Shen, Yurong Chen, Long Quan, and HongenLiao. Learning two-view correspondences and geometry us-ing order-aware network. In

Int. Conf. Comput. Vis. , pages5845–5854, 2019.[39] Chen Zhao, Zhiguo Cao, Chi Li, Xin Li, and Jiaqi Yang.Nm-net: Mining reliable neighbors for robust feature cor-respondences. In

IEEE Conf. Comput. Vis. Pattern Recog. ,pages 215–224, 2019.[40] Chen Zhao, Zhiguo Cao, Jiaqi Yang, Ke Xian, and XinLi. Image feature correspondence selection: A comparativestudy and a new contribution.

IEEE Trans. Image Process. ,29:3506–3519, 2020. . Appendix A.1. Comparison with NGRANSAC [4]

Outdoor (%)mAP5 mAP10 mAP20NGRANSAC [4] 49 54 59Our CLNet

54 60 66

Indoor (%)mAP5 mAP10 mAP20NGRANSAC [4]

20 29Our CLNet

14 21 31

Table 6. Comparison with NGRANSAC on wide-baseline imagematching. mAPs (%) of recovered camera poses are reported.

NGRANSAC [4] aims at correspondence selection, buton a different benchmark. So we do not include their resultsin Table 1 of the main paper, instead, we fairly comparewith them by training our CLNet on the same benchmarkas that used in NGRANSAC. Speciﬁcally, one sequencefrom YFCC100M [34] and one sequence from SUN3D [37]are selected for jointly training. Five sequences fromYFCC100M are employed for testing the outdoor scenar-ios, and sixteen sequences from SUN3D are used for test-ing the indoor scenarios. We adopt all the same experimen-tal settings as [4]. NGRANSAC uses PointCN [23] to guidemodel hypothesis search, which utilizes the strength of bothPointCN and RANSAC. For a fair comparison, we also em-ploy RANSAC [9] as a post-processing of our CLNet. Theresults are shown in Table 6. Compared with NGRANSAC,our CLNet achieves signiﬁcantly superior results on out-door sequences, i.e. at least 5% improvement in terms ofmAPs. Better performance on indoor sequences are alsoobserved.

A.2. Analysis on k for Local Graphs (6, 6) (9, 6) (9, 9) (12, 9) m AP ( % ) Figure 9. The effect of different numbers ( k ) of neighbors in lo-cal graphs. ( · , · ) denote the k values in local graphs of the ﬁrstand the second pruning blocks, respectively. Results withoutRANSAC [9] on YFCC100M [34] are reported in terms of mAP5. We adopt local consensus learning in the proposed prun-ing blocks, where local graphs are established by k -nearestneighbor search. Speciﬁcally, two pruning blocks are uti-lized sequentially in our framework. We select k = 9 in theﬁrst block and k = 6 in the second block. Note we suggest using a smaller k in the second pruning block, since N/ correspondences are consumed in this block.To verify the effectiveness of such a design, we conductexperiments by using different combinations of values of k . As illustrated in Fig. 9, our CLNet achieves the optimalperformance with a combination of (9 , . The results withother combinations still outperforms state-of-the-art meth-ods [32, 38] (Please refer to Table 1 of the main paper forresults of competitors). A.3. Visualizations

Visual results on YFCC100M and SUN3D are shown inthis section. We illustrate the selected correspondences bystate-of-the-art methods [23, 32] and our CLNet in Fig. 10and Fig. 11. We use red and green lines to distinguish thefalse positives (outliers) and true positives (inliers) identi-ﬁed by the ground-truth (“GT”). Note that the ground-truthlabels are generated by the epipolar constraint [11], whichare inevitably noisy, i.e. outliers might be falsely classiﬁedinto inliers. For our CLNet, we show the matches selectedfrom the denoised candidates, since the parametric modelis estimated from the selected results. The results withRANSAC as a post-processing are also illustrated. Com-pared with the competitors, the preserved matches of ourCLNet are much more consistent with fewer outliers onboth YFCC100M and SUN3D datasets.In order to discuss the limitations of our method, weshow some failure cases of our CLNet. As illustrated inFig. 12, outliers dominate the denoised candidates. One canobserve that ground-truth labels are remarkably unreliablein these cases, where the keypoints of many inliers are actu-ally located in visually inconsistent regions. The unreliableground-truth labels may mislead the consensus learning inour method. We attempt to mitigate such an effect by intro-ducing an adaptive temperature in the binary cross-entropyloss (see Eq. (9)&(11) of the main paper), which has beenproven effective in ablation studies (Table 5 of the main pa-per). We expect to further improve the robustness of ourmethod towards noisy ground-truth labels in future works.11 ointCN PointACN CLNet GT PointCN-RANSAC PointACN-RANSAC CLNet-RANSAC GT-RANSAC