[PDF] Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

Abstract

Conducting text retrieval in a dense learned representation space has many intriguing advantages over sparse retrieval. Yet the effectiveness of dense retrieval (DR) often requires combination with sparse retrieval. In this paper, we identify that the main bottleneck is in the training mechanisms, where the negative instances used in training are not representative of the irrelevant documents in testing. This paper presents Approximate nearest neighbor Negative Contrastive Estimation (ANCE), a training mechanism that constructs negatives from an Approximate Nearest Neighbor (ANN) index of the corpus, which is parallelly updated with the learning process to select more realistic negative training instances. This fundamentally resolves the discrepancy between the data distribution used in the training and testing of DR. In our experiments, ANCE boosts the BERT-Siamese DR model to outperform all competitive dense and sparse retrieval baselines. It nearly matches the accuracy of sparse-retrieval-and-BERT-reranking using dot-product in the ANCE-learned representation space and provides almost 100x speed-up.

Full PDF

AApproximate Nearest Neighbor Negative ContrastiveLearning for Dense Text Retrieval

Lee Xiong ∗ , Chenyan Xiong ∗ , Ye Li, Kwok-Fung Tang, Jialin Liu,Paul Bennett, Junaid Ahmed, Arnold Overwijk Microsoft Corporation. lexion, chenyan.xiong, yeli1, kwokfung.tang, jialliu,paul.n.bennett, jahmed, [email protected]

Abstract

Conducting text retrieval in a dense learned representation space has many intrigu-ing advantages over sparse retrieval. Yet the effectiveness of dense retrieval (DR)often requires combination with sparse retrieval. In this paper, we identify thatthe main bottleneck is in the training mechanisms, where the negative instancesused in training are not representative of the irrelevant documents in testing. Thispaper presents Approximate nearest neighbor Negative Contrastive Estimation(ANCE), a training mechanism that constructs negatives from an ApproximateNearest Neighbor (ANN) index of the corpus, which is parallelly updated with thelearning process to select more realistic negative training instances. This funda-mentally resolves the discrepancy between the data distribution used in the trainingand testing of DR. In our experiments, ANCE boosts the BERT-Siamese DRmodel to outperform all competitive dense and sparse retrieval baselines. It nearlymatches the accuracy of sparse-retrieval-and-BERT-reranking using dot-product inthe ANCE-learned representation space and provides almost 100x speed-up.

Many language systems rely on text retrieval as their ﬁrst step to ﬁnd relevant information. Forexample, search ranking [1], open domain question answering [2], and fact veriﬁcation [3, 4] all ﬁrstretrieve relevant documents as the input to their later stage reranking, machine reading, and reasoningmodels. All these later-stage models enjoy the advancements of deep learning techniques [5, 6], while,in contrast, the ﬁrst stage retrieval still mainly relies on matching discrete bag-of-words [1, 2, 3, 7].Due to intrinsic challenges such as vocabulary mismatch [8], sparse retrieval inevitably introducesnoisy information and often becomes the bottleneck of many systems [3, 9].Dense Retrieval (DR) using learned distributed representations is a promising direction to overcomethis sparse retrieval bottleneck [9, 10, 11, 12, 13, 14, 15]: The representation space is fully learnableand can leverage the strength of pretraining, while the retrieval operation is sufﬁciently efﬁcientthanks to the recent progress in Approximate Nearest Neighbor (ANN) search [16]. With theseintriguing properties, one would expect dense retrieval to revolutionize the ﬁrst stage retrieval, asdeep learning has done in almost all language tasks. However, this is not yet the case: Recent studiesfound dense retrieval often underperforms BM25, especially on documents [9, 10]. The effectivenessof DR is more observed when combined with sparse retrieval, instead of replacing it [12, 13].In this paper, we identify that the underwhelming performance of dense retrieval resides in its learningmechanisms, as there exists a severe mismatch between the negatives used to train DR representationsand those seen in testing. An example t-SNE [17] representation used in DR is shown in Fig. 1. ∗ Lee and Chenyan contributed equally.Preprint. Under review. a r X i v : . [ c s . I R ] J u l ueryRelevantDR NegBM25 NegRand Neg Figure 1: Representations of query,relevant documents, actual dense re-trieval negatives (DR Neg), and thenegatives used in different training.As expected, the negatives dense retrieval models need tohandle in testing (DR Neg) are quite close to the relevantdocuments. However, the negatives used to train DR models,sampled from sparse retrieval (BM25 Neg) or randomlyfrom the corpus (Rand Neg), are rather separated from therelevant or the negative documents in testing. Training withthose negatives may never guide the model to learn a properrepresentation space that separates relevant documents fromthe actual negatives in dense retrieval.We fundamentally eliminate this discrepancy by developingApproximate nearest neighbor Negative Contrastive Esti-mation (ANCE), which constructs more realistic trainingnegatives for dense retrieval exactly as how DR is per-formed. During training, we maintain an ANN index ofdocument encodings, from the same representation modelbeing optimized for DR, which we parallelly update andasynchronously refresh as the learning goes on. The top dense-retrieved documents from the ANNindex are used as negatives for each training query; they are retrieved by the same function, in thesame representation space, and thus belong to the same distribution with the irrelevant documents todiscriminate during testing.In TREC Deep Learning Track’s text retrieval benchmarks [18], ANCE signiﬁcantly boosts theaccuracy of dense retrieval. With ANCE training, BERT-Siamese, the DR architecture used in multipleparallel research [9, 12, 13], signiﬁcantly outperforms all sparse retrieval baselines. Impressively,simple dot product in the ANCE-learned representation is nearly as effective as the sparse retrievaland BERT reranking cascade pipeline while being 100 times more efﬁcient.Our analyses further conﬁrm that the negatives from sparse retrieval or other sampling methods differdrastically from the actual negatives in DR, and that ANCE fundamentally resolves this mismatch.We also show the inﬂuence of the asynchronous ANN refreshing on learning convergence anddemonstrate that the efﬁciency bottleneck is in the encoding update, not in the ANN part duringANCE training. These qualiﬁcations demonstrate the advantages, perhaps also the necessity, of ourasynchronous ANCE learning in dense retrieval. In this section, we discuss the background of sparse, cascade information retrieval, and dense retrieval.

Sparse Retrieval and Cascade IR:

Given a query q and a corpus C , the text retrieval task is to ﬁnda set of documents D = { d , ..., d i , ..., d n } in C and rank them based on relevance to the query.Because the corpus C is often at the scale of millions or billions, efﬁcient retrieval often requirescascade pipelines. These systems ﬁrst use an efﬁcient sparse retrieval to zoom in to a small set ofcandidate documents and then feed them to one or several more sophisticated reranking steps [8].The sparse retrieval (e.g. BM25) usually performs an exact match between query and document inthe bag-of-word space using frequency-based statistics. The reranking step often applies BERT ontop of the sparse-retrieved documents, i.e. by concatenating them with the query and feeding into aﬁne-tuned BERT reranker [1, 19].The quality of the ﬁrst stage retrieval deﬁnes the upper bound of many language systems: if a relevantdocument is not retrieved, for example, because of no overlap between query and document’s bag-of-words, then its information is never available to later-stage models. Addressing this vocabularymismatch is a core research topic in IR [8, 20, 21, 22]. Dense Retrieval aims to fundamentally redesign the ﬁrst stage text retrieval with representationlearning. Instead of retroﬁtting to sparse retrieval, recent approaches in dense retrieval ﬁrst learn adistributed representation space of the query and documents, in which the relevance function f ( q, d ) can be a simple similarity calculation [9, 10, 11, 12, 13, 15, 23]. Code, trained models, and pre-computed embeddings are available at (https://github.com/microsoft/ANCE). f ( q, d ) = BERT-Siamese ( q, d ) (1) = Encoder ( q ) · Encoder ( d ); (2)Encoder ( · ) = LayerNorm ( Linear ( BERT ( · ))) . (3)The encoder uses a layer normalized projection on the last layer’s “[CLS]”, and its weights canbe shared between q and d [9]. The similarity metric in BERT-Siamese is often as simple as dotproduct or cosine similarity. The dense retrieval is then performed using efﬁcient ANN search withthe learned encoder: DR ( q, · ) = ANN f ( q,d ) ( q, · ) . (4)We use DR ( q, · ) to refer to the documents retrieved by dense retrieval for query q , which comes fromthe ANN index with the learned model ANN f ( q,d ) .This leads to several intriguing properties of dense retrieval:1. Learnability:

Compared to bag-of-words, the representation in dense retrieval is fullylearned following the advancement of representation learning.2.

Efﬁciency:

Compared to the costly reranking in cascade pipelines, in dense retrieval, thedocument representation can be pre-computed ofﬂine. Moreover, only the query needs to beencoded online and retrieval from the ANN index has many efﬁcient solutions [16].

Representation Learning for Dense Retrieval:

The effectiveness of DR depends on learning arepresentation space that aligns a query with its relevant documents d + , and separates it fromirrelevant ones d − . This is often done using the following learning objective: l ( q, d + , D − ) = − log exp( f ( q, d + ))exp( f ( q, d + )) + (cid:80) d − ∈ D − exp( f ( q, d − )) , (5)where we used the negative log likelihood (NLL) loss [9] on positive and negative documents for eachquery. Other similar loss functions are also explored [24]. The positive documents ( d + ) are fromthose labeled relevant ( D + ) for the query. The construction of negative documents ( D − ), however, isnot as straightforward. For reranking models, their negatives in both training and inference are theirrelevant ones in their candidate set, for example, top documents retrieved by BM25: D − BM25 = BM25 ( q, · ) \ D + . (6)However, in dense retrieval, the optimal training negatives are different from those in reranking. Toaddress this concern, several recent work enrich the BM25 negatives with random sampling from thecorpus: D − hybrid = D − BM25 ∪ D − rand , (7)where D − rand is sampled from the entire corpus [9] or in batch [15]. Intuitively, the strong negatives close to the relevant documents in an effective dense retrievalrepresentation space should be different from those from sparse retrieval, as the goal of DR is to ﬁnddocuments beyond those retrieved by sparse retrieval. Random sampling from a large corpus is alsounlikely to hit those strong negatives as most documents are not relevant to the query.In this section, we present how to principally align the negatives used in DR representation learningand in inference. We ﬁrst describe a conceptually simple approach, A pproximate nearest neighbor N egative C ontrastive E stimation (ANCE), which constructs a query and relevant document pair withnegatives retrieved from the ANN index – the same as how the learned representations are used inDR inference. Then we discuss the challenge in updating negative representations in the ANN indexduring training and how we address it using asynchronous learning.3 𝑑 ! 𝐷 ! !" " TrainerInferencer q 𝑑 ! 𝐷 ! !"$ " Checkpoint k-1 … Checkpoint k q 𝑑 ! 𝐷 ! !"$ " q 𝑑 ! 𝐷 ! ! " … Checkpoint k+1 q 𝑑 ! 𝐷 ! !" " … Inferencing

Index & Search

Training PositivesANCE Negatives

Index & Search

Figure 2: ANCE Asynchronous Training. The Trainer learns the representation using negativesfrom the ANN index, while the Inferencer uses a recent checkpoint to update the representation ofdocuments in the corpus and once ﬁnished, refreshes the ANN index with most up-to-date encodings.

ANCE:

We use the standard dense retrieval model and loss functions described in last section: f ( q, d ) = BERT-Siamese ( q, d ) , Same as Eq.1; (8) l ( q, d + , D − ) = NLL ( q, d + , D − ) , Same as Eq.5. (9)The only difference is the negatives used in training: D − = D − ANCE = ANN f ( q,d ) \ D + , (10)which are the top documents retrieved from the ANN index using the learned representation model f () , exactly the same as the inference from the learned DR model. This eliminates the gap betweenthe learning and the application of the representation space. Asynchronous Training:

Since the training is almost always stochastic, the encoder in f is updatedin each training batch. To update the representations used to construct ANCE negatives ( D − ANCE ), thefollowing two steps are needed:1.

Inference : refresh the representations of all documents in the corpus with the new encoder,2.

Index : rebuild the ANN index using updated representations.Although rebuilding the ANN index is efﬁciently implemented in recent libraries [16],

Inference iscostly as it re-encodes the entire corpus. Doing so after every training batch is unrealistic in stochasticsettings where the corpus is at a much bigger scale than the training batch size.To overcome this, we propose Asynchronous ANCE training which refreshes the ANN index usedto construct D − ANCE only after each checkpoint k which include m training batches (i.e., D − f k ). Asillustrated in Fig. 2, besides the Trainer job, we also maintain a parallel Inferencer job, which1. takes the latest checkpoint of the representation model, e.g., f k at the (m · k)-th training step,2. parallelly inferences the encoding of the entire corpus using f k , while the Trainer keepsoptimizing with D − f k − from index ANN f k − at the last checkpoint;3. reconstructs the ANN index (ANN f k ) once the parallel inference ﬁnishes, and connects itwith the Trainer to provide more up-to-date D − f k .In this parallel process, the ANCE negatives ( D − ANCE ) are asynchronously updated to “catch up” withthe stochastic training as soon as the Inferencer refreshes the ANN index. The asynchronous lapbetween the training and the negative construction depends on the allocation of computing resourcesbetween the Trainer and the Inferencer: one can choose to refresh the ANN index after every back-propagation m = 1 , to get synchronous ANCE negatives, or never refresh the ANN index m = ∞ tosave compute, or somewhere in-between. In experiments, we analyze this efﬁciency-effectivenesstrade-off and its inﬂuences on training stability and retrieval accuracy.4 Experimental Methodologies

This section describes our experimental setups. More details can be found in Appendix A.1 and A.2.

Benchmarks:

Our experiments are mainly conducted on the TREC 2019 Deep Learning (DL)Track benchmark [18]. It includes the most recent, realistic, and standard large scale text retrievaldatasets. The training and dev sets are passage relevance labels for one million Bing queries fromMSMARCO [25]. The testing sets are labeled by NIST accessors on the top 10 ranked results frompast Track participants [18]. Our experiments follow the ofﬁcial settings of TREC DL Track and useboth the passage and the document task. We mainly evaluate dense retrieval in the retrieval settingbut also show the results of DR models as rerankers on the top 100 candidates from BM25. TRECDL ofﬁcial metrics include NDCG@10 on test and MRR@10 on MARCO Passage Dev. MARCODocument Dev is noisy and the recall on the DL Track testing is less meaningful due to low labelcoverage on DR results (more in Appendix A.1 and A.2).We also evaluate ANCE on the OpenQA benchmark used in a parallel work (DPR) [15]. It includesﬁve OpenQA tasks, including Natural Questions (NQ) [26], TriviaQA [27], WebQuestions (WQ) [28],CuratedTREC [29], and SQuAD [5]. At the time of our experiment, only the pre-processed NQ andTriviaQA data are released . Our experiments use the two released tasks and inherit their retrieverevaluation. The evaluation uses the Coverage@20/100 which is whether the Top-20/100 retrievedpassages include the answer [15]. Sparse Retrieval Baselines:

By keeping the settings consistent with TREC DL Track, our methodsare directly comparable with all the TREC participating runs. We list the results of several runs thatare most representative in this paper. The detailed descriptions of these runs and many other systems’results can be found in Appendix A.1 and the Track overview paper [18].

Dense Retrieval Baselines:

As there are no open-source dense retrieval baselines in our documentretrieval tasks, we implement all DR baselines and try our best to tune their hyperparameters.All DR baselines use the same BERT-Siamese (base) model as used in various parallel research [9,12, 13, 15]. The DR baselines only vary in their mechanisms to construct the negative instances:random samples from the entire corpus or in batch (Rand Neg), random samples from BM25 top 100(BM25 Neg) [12], Noise Contrastive Estimation, which is the highest scored negatives in batch (NCENeg) [30], and the 1:1 combination of BM25 and Random negatives (BM25 + Rand Neg) [9, 15].Participants in TREC DL found the passage training labels cleaner than the post-constructed documentlabels and lead to better results on the document task [31]. Recent DR research also ﬁnds it helpstraining convergence to include BM25 Negatives to provide stronger contrast for the representationlearning [9, 15]. In all our experiments on TREC DL, we include the “BM25 Warm Up” setting(BM25 → ∗ ), in which the representation model is ﬁrst trained using MARCO ofﬁcial passagetraining triples from BM25 Negatives.

Our Methods and Implementation Details:

ANCE uses the same BERT-Siamese model and onlydiffers with DR baselines in the training mechanism. To ﬁt long documents in BERT-Siamese, weuse the two settings from Dai et al. [32], FirstP where only the ﬁrst 512 tokens of the document areused, and MaxP, where the document is split to 512-token passages (maximum 4) and scores on thesepassages are max-pooled. The max-pooling operation is natively supported by ANN [9] with anoverhead of four times more vectors in the index.Our ANN search uses the Faiss IndexFlatIP Index [16]. We implemented the parallel trainingand ANCE index refreshing upon Faiss and plan to include it in our code release. To reduce thecomputing cost required to navigate from randomly initialized representations, we ﬁrst warm up allBERT-Siamese using the standard RoBERTa (base) and then continue ANCE training on TREC DLusing BM25 → ∗ . On OpenQA, we start the ANCE from the released DPR checkpoints [15].Our main ANCE setting uses 1:1 training:index refreshing GPU allocation, 1:1 positive-negative withthe negative documents sampled from ANN top 200, index refreshing at every 10k training batches,batch size 8, and gradient accumulation step 2 on 4 GPUs. We measured ANCE efﬁciency in Table 3using a single 32GB V100 GPU, on an Azure VM containing Intel(R) Xeon(R) Platinum 8168 CPUand 650GB of RAM memory. More details of our implementation can be found in Appendix A.1 andour upcoming code release. https://github.com/facebookresearch/DPR MARCO Dev TREC DL Passage TREC DL DocumentPassage Retrieval NDCG@10 NDCG@10MRR@10 Recall@1k Rerank Retrieval Rerank RetrievalSparse & Cascade IR

BM25 0.240 0.814 – 0.506 – 0.519Best DeepCT [22] 0.243 n.a. – n.a. – 0.554Best TREC Trad Retrieval 0.240 n.a. – 0.554 – 0.549Best TREC Trad LeToR – – 0.556 – 0.561 –BERT Reranker [1] – – – 0.646 –

Dense Retrieval

Rand Neg 0.261 0.949 0.605 0.552 0.615 0.543NCE Neg [30] 0.256 0.943 0.602 0.539 0.618 0.542BM25 Neg [12] 0.299 0.928 0.664 0.591 0.626 0.529BM25 + Rand Neg [15, 9] 0.311 0.952 0.653 0.600 0.629 0.557BM25 → Rand 0.280 0.948 0.609 0.576 0.637 0.566BM25 → NCE Neg 0.279 0.942 0.608 0.571 0.638 0.564BM25 → BM25 + Rand 0.306 0.939 0.648 0.591 0.626 0.540ANCE (FirstP)

Table 2: Retrieval results (Answer Coverage at Top-20/100) on OpenQA benchmarks collected inDPR [15]. Models are trained using the training split from each Single Task or from Multiple Tasks.

Single Task Training Multi Task TrainingNatural Questions TriviaQA Natural Questions TriviaQARetriever

Top-20 Top-100 Top-20 Top-100 Top-20 Top-100 Top-20 Top-100BM25 59.1 73.7 66.9 76.7 – – – –DPR 78.4 85.4 79.4 85.0 79.4 86.0 78.8 84.7BM25 + DPR 76.6 83.8 79.8 84.5 78.0 83.9 79.9 84.4ANCE

This section ﬁrst presents the evaluations on ANCE effectiveness and efﬁciency. Then we study theinﬂuences of the asynchronous learning. More evaluations can be found in the Appendix.

The results in TREC Deep Learning Track benchmarks are presented in Table 1. ANCE empowereddense retrieval to signiﬁcantly outperform all sparse retrieval baselines in all evaluation metrics.Without using any sparse bag-of-words in retrieval, ANCE leads to 20%+ relative NDCG gains overBM25 and signiﬁcantly outperforms DeepCT, which uses BERT to optimize sparse retrieval [33].Among the learning mechanisms used in DR, the contemporary method that uses the combination ofBM25 + Random Negatives [9, 12, 13, 15] outperforms sparse retrieval in passage retrieval. However,the same as observed in various parallel research [9, 13], their trained DR models are no better thantuned traditional retrieval (Best TREC Trad Retrieval) on long documents, where the term frequencysignals are more robust. ANCE is the only one that elevates the same BERT-Siamese architectureto robustly exceed the sparse methods in document retrieval. It also convincingly surpasses theconcurrent DR models in passage retrieval on OpenQA benchmarks as shown in Table 2.When reranking documents, ANCE-learned BERT-Siamese outperforms the interaction-based BERTReranker (0.671 NDCG versus 0.646). This overthrows a previously-held belief that it is necessary tocapture the interactions between the discrete query and document terms [34, 35].

With ANCE, it isnow feasible to learn a representation space that captures the ﬁnesse in search relevance.

Solely usingthe ﬁrst-stage retrieval, ANCE nearly matches the accuracy of the cascade retrieval-and-rerankingpipeline (BERT Reranker) – with effective representation learning, dot product is all you need.6able 3: Efﬁciency of Ofﬂine indexing and train-ing operations and Online (query time) operations.Online time is on per query and 100 documents.

Operation Ofﬂine OnlineSparse & Cascade IR

BM25 Index Build –BM25 Retrieval – BERT Rerank – 1.15sCascade Total (BM25 + BERT) –

Per Document Encoding 4.5ms –Query Encoding – 2.6msANN Retrieval (batched q) – 9msDense Rretrieval Total –

Encoding of the Training Corpus –ANN Index Build 10s –ANCE Neg Construction Per batch 72ms –Back Propagation Per Batch 19ms –

Training Steps to Convergence (k) O v e r l a p w i t h D R N e g ANCEBM25BM25 + Rand

Figure 3: Overlap of negatives used in trainingand faced in testing. Y-axis is the overlap ofnegatives used in different training mechanismsversus those in testing with their own denseretrieval. L o ss NDCG@10Loss (a) 10k Batch; 4:4; 1e-5 (b) 20k Batch; 8:4; 1e-6 (c) 5k Batch; 4:8; 1e-6 N D C G @ (d) 10k Batch; 4:4; 5e-6 Figure 4: Training loss and testing NDCG of ANCE (FirstP) on documents, with different ANNindex refreshing (e.g., per 10k Batch), Trainer:Inferencer GPU allocation, and learning rate (e.g.,1e-5). X-axes is the training steps in thousands.

Table 3 measures the efﬁciency of sparse retrieval and ANN dense retrieval. The latter use the ANN(FirstP) on TREC DL Track document retrieval. These numbers may vary in different environments.Impressively, ANCE DR with standard batching only takes 11.6 ms per query, a . This is a natural advantage of dense retrieval:Only the Query Encoding and ANN Retrieval need to be performed online. Encoding one shortquery is efﬁcient, while ANN Retrieval enjoys the advantages of fast approximate search [16]. Thedocument encoding can be done ofﬂine (e.g., at the crawling or indexing phrase) and is only 4.5msper document. This leads to a remarkable return of investment (ROI) on computing resource andengineering: The 1.42s throughput of BERT Rerank is prohibitive in many production systems andmakes distillation or complicated caching necessary, while ANCE is just a dot product.The quantiﬁcation of the ANCE training time reveals the main efﬁciency bottleneck is the encodingof the training corpus, which is to refresh the encoding of the entire corpus with the newly updatedrepresentation model. In general, it is not feasible to refresh the representation of the entire corpusto select perfectly up-to-date negatives after each training batch, because the corpus is orders ofmagnitude larger than one training batch, and a forward pass in the neural network is only linearlymore efﬁcient than backward. We address this efﬁciency bottleneck using asynchronous Trainer andInferencer updates. The next experiment studies the inﬂuence of it.

In this experiment, we ﬁrst demonstrate the main advantage of ANCE in providing realistic trainingnegatives. Then we study the inﬂuence of delayed updates in the asynchronous learning.7ig. 3 shows the overlap of negatives used in training versus those seen in ﬁnal testing. We measurethe overlap through the learning process using the same set of sampled dev queries. The same as inFigure 1, which illustrates the ANCE learned representations on query “what is the most popular foodin Switzerland”, there is very low overlap (<20%) between the BM25 negatives or Random negativeswith the negatives from their corresponding trained DR models. The discrepancy between the trainingand testing candidate distributions risks optimizing DR models to undesired local minimums.ANCE eliminates this discrepancy. The non-perfect overlap at the beginning is merely becausethe representation is still being learned. The retrieval of training negatives and testing documentsare equivalent, subject to a small delay from the async Inferencer. By simply aligning the trainingdistribution with testing, ANCE unleashes the power of representation learning in dense retrieval.Fig. 4 illustrates the behavior of asynchronous learning with different conﬁgurations. A large learningrate or a low refreshing rate (Figure 4(a) and 4(b)) leads to ﬂuctuations as the async gap of the ANNindex may drive the representation learning to undesired local optima. Refreshing as often as every5k Batches yields a smooth convergence (Figure 4(c)), but requires twice as many GPU allocations tothe Inferencer. We found a 1:1 allocation of Trainer and Inference GPUs, at an appropriate learningrate, leads to an asynchronous learning process adequate to train effective representations for denseretrieval.More ablation studies, retrieval results, and case studies are included in the Appendix.

In neural information retrieval, neural ranking models are categorized into representation-based andinteraction-based, depending on whether they represent query and document separately, or model theinteractions between discrete term pairs [36]. BERT Reranker is interaction-based as the self-attentionis applied on all term pairs, while BERT-Siamese is representation-based. Previous research foundinteraction-based models more effective as they capture the relevance match between all query-document terms [32, 34, 35, 36]. However, the effectiveness of interaction-based models is onlyavailable at the reranking stage, as the model needs to go through each query and candidate documentpair [23]. Their efﬁciency also becomes a concern when pretrained models are used [37, 38].Recently, researchers revisited the representation-based model with BERT for dense retrieval. Pro-gresses include the BERT dual-encoder latent retrieval model [10] and customized pretraining [11],etc. Promising effectiveness has been achieved on OpenQA passage retrieval tasks, where passagesare shorter and questions are cleaner [15, 23]. On documents, the effectiveness of dense retrieval wasmore underwhelming and it was more considered as an add-on to sparse retrieval [9, 12, 13].To construct stronger training negatives is a rapidly growing topic in representation learning. Espe-cially in contrastive learning for visual representations [39], remarkable progresses have been madein the past year, for example, SimCLR [24], MoCo [40], and MoCo V2 [41]. These methods are alsorooted in Noise Constructive Estimation [30, 42], but their technical choices are different from ANCEas visual representation learning does not have a natural query and sparse retrieval to start with.Technical-wise, maintain a parallelly updated ANN index during learning is also used in REALM, buttheir usage is to retrieve background information in language model pretraining [43]. Our open-sourcesolution can also be used by the community to conduct REALM style pretraining.

ANCE fundamentally eliminates the discrepancy between the representation learning of texts andtheir usages in dense retrieval. Our ANCE trained dense retrieval model, the vanilla BERT-Siamese,convincingly outperforms all dense retrieval and sparse retrieval baselines in our large scale documentretrieval and passage retrieval experiments. It nearly matches the ranking accuracy of the state-of-the-art cascade sparse retrieval and BERT reranking pipeline. More importantly, all these advantagesare achieved with a standard transformer encoder at a 1% online inference latency, using a simpledot-product in the ANCE-learned representation space.8 roader Impact

For the past decades, in academic community we have been joking that every year we made 10%progress upon BM25, but it had always been 10% upon the same BM25; the techniques developedrequire more and more IR domain knowledge that might be unfamiliar to researchers in other relatedﬁelds. For example, in OpenQA, document retrieval was often done with vanilla BM25 instead of thewell-tuned BM25F, query expansion, or SDM. In industry, many places build their search solutionsupon open source solutions, such as Lucene and ElasticSearch, where BM25, a technique inventedin the 1970s and 1980s, was incorporated as late as 2015 [44]; the required expertise, complexinfrastructure, and computing resource make many missing out the beneﬁts of Neu-IR.With their effectiveness, efﬁciency, and simplicity, ANCE and dense retrieval have the potential toredeﬁne the next stage of information systems and provide broader impacts in many fronts.

Empower User with Better Information Access:

The effectiveness of DR is particularly prominentfor exploratory or knowledge acquisition information needs. Formulating good queries that haveterm overlap with the target documents often requires certain domain knowledge, which is a barrierfor users trying to learn new information. A medical expert trying to learn how to build a smallsearch functionality on her patient’s medical records may not be aware of the terminology “BM25”and “Dense Retrieval”. By matching user’s information need and the target information in a learnedrepresentation space, ANCE has the potential to overcome this language barrier and empower usersto achieve more in their daily interactions with search engines.

Reduce Computing Cost and Energy Consumption in Neural Search Stack:

The nature of denseretrieval makes it straightforward to conduct most of the costly operations ofﬂine and reuse the pre-computed document vectors. This leads to 100x better efﬁciency and will signiﬁcantly reduce thehardware cost and energy consumption needed when serving deep pretrained models online. Weconsider this a solid step towards carbon neutrality in the search stack.

Democratize the Beneﬁt of Neural Techniques:

Building, maintaining, and serving a cascade IRpipeline with the advanced pretrained models is daunting and may not lead to good ROI for manycompanies not in the web search business. In comparison, the simple dot product operation in amostly pre-computed representation space is much more accessible. Faiss and many other librariesprovide easy-to-access solution of efﬁcient ANN retrieval; our (to be) released pretrained encodersand ANCE open-source solution will ﬁll in the effectiveness part. Together we will democratize therecent revolutions in neural information retrieval to a much broader audience and end-users.

References [1] Rodrigo Nogueira and Kyunghyun Cho. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085 ,2019.[2] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer open-oomainquestions. In

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics ,pages 1870–1879, 2017.[3] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, andChristopher D. Manning. HotpotQA: a dataset for diverse, explainable multi-hop question answering.In

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages2369–2380, 2018.[4] James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, and Arpit Mittal. The factextraction and veriﬁcation (FEVER) shared task. In

Proceedings of the 1st Workshop on Fact Extractionand VERiﬁcation (FEVER) , pages 1–9, 2018.[5] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions formachine comprehension of text. In

Proceedings of the 2016 Conference on Empirical Methods in NaturalLanguage Processing , pages 2383–2392, 2016.[6] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue:a multi-task benchmark and analysis platform for natural language understanding. arXiv preprintarXiv:1804.07461 , 2018.[7] Chen Zhao, Chenyan Xiong, Corby Rosset, Xia Song, Paul Bennett, and Saurabh Tiwary. Transformer-xh:multi-evidence reasoning with extra hop attention. In

International Conference on Learning Representa-tions , 2020.

8] W Bruce Croft, Donald Metzler, and Trevor Strohman.

Search engines: information retrieval in practice ,volume 520. Addison-Wesley Reading, 2010.[9] Yi Luan, Jacob Eisenstein, Kristina Toutanove, and Michael Collins. Sparse, dense, and attentionalrepresentations for text retrieval. arXiv preprint arXiv:2005.00181 , 2020.[10] Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly supervised open domainquestion answering. In

Proceedings of the 57th Annual Meeting of the Association for ComputationalLinguistics , page 6086–6096, 2019.[11] Wei-Cheng Chang, Felix X Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. Pre-training tasks forembedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932 , 2020.[12] Luyu Gao, Zhuyun Dai, Zhen Fan, and Jamie Callan. Complementing lexical retrieval with semanticresidual embedding. arXiv preprint arXiv:2004.13969 , 2020.[13] Ji Ma, Ivan Korotkov, Yinfei Yang, Keith Hall, and Ryan McDonald. Zero-shot neural retrieval viadomain-targeted synthetic query generation. arXiv preprint arXiv:2004.14503 , 2020.[14] Bhuwan Dhingra, Manzil Zaheer, Vidhisha Balachandran, Graham Neubig, Ruslan Salakhutdinov,and William W Cohen. Differentiable reasoning over a virtual knowledge base. arXiv preprintarXiv:2002.10640 , 2020.[15] Vladimir Karpukhin, Barlas O˘guz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 , 2020.[16] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus.

IEEE Transac-tions on Big Data , 2019.[17] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.

Journal of Machine LearningResearch , 9(Nov):2579–2605, 2008.[18] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. Overview of thetrec 2019 deep learning track. In

Text REtrieval Conference (TREC) . TREC, 2020.[19] Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. Multi-stage document ranking with bert. arXiv preprint arXiv:1910.14424 , 2019.[20] Victor Lavrenko and W Bruce Croft. Relevance-based language models. In

Association for ComputingMachinery (ACM) Special Interest Group on Information Retrieval (SIGIR) Forum , volume 51, pages260–267. ACM New York, NY, USA, 2017.[21] Chenyan Xiong and Jamie Callan. Query expansion with freebase. In

Proceedings of the 2015 InternationalConference on the Theory of Information Retrieval , pages 111–120, 2015.[22] Zhuyun Dai and Jamie Callan. Context-aware sentence/passage term importance estimation for ﬁrst stageretrieval. arXiv preprint arXiv:1910.10687 , 2019.[23] Amin Ahmad, Noah Constant, Yinfei Yang, and Daniel Cer. Reqa: an evaluation for end-to-end answerretrieval models. In

Proceedings of the 2nd Workshop on Machine Reading for Question Answering , pages137–146, 2019.[24] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework forcontrastive learning of visual representations. arXiv preprint arXiv:2002.05709 , 2020.[25] Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder,Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine readingcomprehension dataset. arXiv preprint arXiv:1611.09268 , 2016.[26] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redﬁeld, Michael Collins, Ankur Parikh, Chris Alberti,Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark forquestion answering research.

Transactions of the Association for Computational Linguistics , 7:453–466,2019.[27] Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: a large scale distantlysupervised challenge dataset for reading comprehension. In

Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics , pages 1601–1611, 2017.

28] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase fromquestion-answer pairs. In

Proceedings of the 2013 Conference on Empirical Methods in Natural LanguageProcessing , pages 1533–1544, 2013.[29] Petr Baudiš and Jan Šediv`y. Modeling of the question answering task in the yodaqa system. In

InternationalConference of the Cross-Language Evaluation Forum for European Languages , pages 222–228. Springer,2015.[30] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: a new estimation principle forunnormalized statistical models. In

Proceedings of the 13th International Conference on ArtiﬁcialIntelligence and Statistics , pages 297–304, 2010.[31] Ming Yan, Chenliang Li, Chen Wu, Bin Bi, Wei Wang, Jiangnan Xia, and Luo Si. Idst at trec 2019 deeplearning track: Deep cascade ranking with generation-based document expansion and pre-trained languagemodeling. In

Text REtrieval Conference . TREC, 2019.[32] Zhuyun Dai and Jamie Callan. Deeper text understanding for ir with contextual neural language modeling.In

Proceedings of the 42nd International ACM SIGIR Conference on Research and Development inInformation Retrieval , pages 985–988, 2019.[33] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: attentive language models beyond a fixed-length context. In

Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics , pages 2978–2988, 2019.[34] Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural ad-hocranking with kernel pooling. In

Proceedings of the 40th International ACM SIGIR Conference on Researchand Development in Information Retrieval , pages 55–64, 2017.[35] Yifan Qiao, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. Understanding the behaviors of bert inranking. arXiv preprint arXiv:1904.07531 , 2019.[36] Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. A deep relevance matching model for ad-hocretrieval. In

Proceedings of the 25th ACM International on Conference on Information and KnowledgeManagement , pages 55–64, 2016.[37] Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. Poly-encoders: architecturesand pre-training strategies for fast and accurate multi-sentence scoring. In

International Conference onLearning Representations , 2020.[38] Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and OphirFrieder. Efﬁcient document re-ranking for transformers by precomputing term representations. arXivpreprint arXiv:2004.14255 , 2020.[39] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictivecoding. arXiv preprint arXiv:1807.03748 , 2018.[40] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervisedvisual representation learning. arXiv preprint arXiv:1911.05722 , 2019.[41] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastivelearning. arXiv preprint arXiv:2003.04297 , 2020.[42] Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings efﬁciently with noise-contrastiveestimation. In

Advances in Neural Information Processing Systems , pages 2265–2273, 2013.[43] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: retrieval-augmentedlanguage model pre-training. arXiv preprint arXiv:2002.08909 , 2020.[44] BM25 the next generation of Lucene relevance. https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/ , October 2015.[45] Ellen M Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness.

Information Processing & Management , 36(5):697–716, 2000.[46] Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. Document expansion by query prediction. arXiv preprint arXiv:1904.08375 , 2019.[47] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 , 2019. Appendix

A.1 More Implementation Details

More Details on TREC Deep Learning Benchmarks:

There are two tasks in the Track: document retrievaland passage retrieval. The training and development sets are from MS MARCO, which includes passage levelrelevance labels for one million Bing queries [25]. The document corpus was post-constructed by back-ﬁllingthe body texts of the passage’s URLs and their labels were inherited from its passages [18].There is a two-year gap between the construction of the passage training data and the back-ﬁlling of their fulldocument content. Some original documents were no longer available. There is also a decent amount of contentchanges in those documents during the two-year gap, and many no longer contain the passages. This back-ﬁllingperhaps is the reason why many Track participants found the passage training data is more effective than theinherited document labels. Note that the TREC testing labels are not inﬂuenced as the annotators were providedthe same document contents when judging.All the TREC DL runs are trained using these training data. Their inference results on the testing queries ofthe document and the passage retrieval tasks were evaluated by NIST assessors in the standard TREC-stylepooling technique [45]. The pooling depth is set to 10, that is, the top 10 ranked results from all participatedruns are evaluated, and these evaluated labels are released as the ofﬁcial TREC DL benchmarks for passage anddocument retrieval tasks.

More Details on Baselines:

The most representative sparse retrieval baselines in TREC DL include the standardBM25 (“bm25base” or “bm25base_p”), Best TREC Sparse Retrieval (“bm25tuned_rm3” or “bm25tuned_prf_p”)with tuned query expansion [20], and Best DeepCT (“dct_tp_bm25e2”, doc only), which uses BERT to estimatethe term importance for BM25 [22]. These three runs represent the standard sparse retrieval, best classical sparseretrieval, and the recent progress of using BERT to improve sparse retrieval.We also include two cascade retrieval-and-reranking systems: Best TREC LeToR (“srchvrs_run1” or“srchvrs_ps_run3”), which is the best feature-based learning to rank in the Track, and BERT Reranker(“bm25exp_marcomb” or “p_exp_rm3_bert”), which is the best run using standard BERT on top of query/docexpansion, from the groups with multiple top MARCO runs [1, 46].

BERT-Siamese Conﬁgurations:

We follow the network conﬁgurations in Luan et al. [9] in all Dense Retrievalmethods, which we found provides the most stable results. More speciﬁcally, we initialize the BERT-Siamesemodel with RoBERTa base [47] and add a × projection layer on top of the last layer’s “[CLS]” token,followed by a layer norm. Training Details:

The training often takes about 1-2 hours per ANCE epoch, which is whenever new ANCEnegative is ready, it immediately replaces existing negatives in training, without waiting. It converges in about 10epochs, similar to other DR baselines. The optimization uses LAMB optimizer, learning rate 5e-6 for documentand 1e-6 for passage retrieval, and linear warm-up and decay after 5000 steps. More detailed hyperparametersettings can be found in our code release.

A.2 Converge of TREC 2019 DL Track Labels on Dense Retrieval Results

As a nature of TREC-style pooling evaluation, only those ranked in the top 10 by the 2019 TREC participatingsystems were labeled. As a result, documents not in the pool and thus not labeled are all considered irrelevant,even though there may be relevant ones among them. When reusing TREC style relevance labels, it is veryimportant to keep track of the “hole rate” on the evaluated systems, i.e., the fraction of the top K ranked resultswithout TREC labels (not in the pool). A larger hole rate shows that the evaluated methods are very differentfrom those systems that participated in the Track and contributed to the pool, thus the evaluation results are notperfect. Note that the hole rate does not necessarily reﬂect the accuracy of the system, only the difference of it.In TREC 2019 Deep Learning Track, all the participating systems are based on sparse retrieval. Dense retrievalmethods often differ considerably from sparse retrievals and in general will retrieve many new documents. Thisis conﬁrmed in Table 4. All DR methods have very low overlap with the ofﬁcial BM25 in their top 100 retrieveddocuments. At most, only 25% of documents retrieved by DR are also retrieved by BM25. This makes the holerate quite high and the recall metric not very informative. It also suggests that DR methods might beneﬁt morein this year’s TREC 2020 Deep Learning Track if participants are contributing DR based systems.The MS MARCO ranking labels were not constructed based on pooling the sparse retrieval results but werefrom Bing [25], which include many signals beyond term overlap. This makes the recall metric in MS MARCOmore robust as it reﬂects how a single model can recover a complex online system.

TREC DL Passage TREC DL DocumentMethod Recall@1K Hole@10 Overlap w. BM25 Recall@100 Hole@10 Overlap w. BM25

BM25 0.685 5.9% 100% 0.387 0.2% 100%BM25 Neg 0.569 25.8% 11.9% 0.217 28.1% 17.9%BM25 + Rand Neg 0.662 20.2% 16.4% 0.240 21.4% 21.0%ANCE (FirstP) 0.661 14.8% 17.4% 0.266 13.3% 24.4%ANCE (MaxP) - - - 0.286 11.9% 24.9%

Table 5: Results of several different hyperparameter conﬁgurations. “Top K Neg” lists the top k ANNretrieved candidates from which we sampled the ANCE negatives from.

Hyperparameter MARCO Dev Passage TREC DL DocumentLearning rate Top K Neg Refresh (step) Retrieval MRR@10 Retrieval NDCG@10Passage ANCE –1e-6 500 10k 0.31 –2e-6 200 10k 0.29 –2e-7 500 20k 0.303 –2e-7 1000 20k 0.302 –

Document ANCE

A.3 Hyperparameter Studies

We show the results of some hyperparameter conﬁgurations in Table 5. The cost of training with BERT makes itdifﬁcult to conduct a more detailed hyperparameter exploration. Often a failed conﬁguration leads to divergencein training loss. We barely explore other conﬁgurations due to the time-consuming nature of working withpretrained language models. Our DR model architecture is kept consistent with recent parallel work and thelearning conﬁgurations in Table 5 are about all the explorations we did. Most of the hyperparameter choices aredecided solely using the training loss curve and otherwise by the loss in the MARCO Dev set. We found thetraining loss, validation NDCG, and testing performance align well in our (limited) hyperparameter explorations.

A.4 Case Studies

In this section, we show Win/Loss case studies between ANCE and BM25. Among the 43 TREC 2019 DL Trackevaluation queries in the document task, ANCE outperforms BM25 on 29 queries, loses on 13 queries, andties on the rest 1 query. The winning examples are shown in Table 6 and the losing ones are in Table 7. Theircorresponding ANCE-learned (FirstP) representations are illustrated by t-SNE in Fig. 5 and Fig. 6.In general, we found ANCE better captures the semantics in the documents and their relevance to the query. Thewinning cases show the intrinsic limitations of sparse retrieval. For example, BM25 exact matches the “mostpopular food” in the query “what is the most popular food in Switzerland” but using the document is aboutMexico. The term “Switzerland” only appears in the related question section of the web page.The losing cases in Table 7 are also quite interesting. Many times we found that it is not that DR fails completelyand retrieves documents not related to the query’s information needs at all, which was a big concern when westarted research in DR. The errors ANCE made include retrieving documents that are related just not exactlyrelevant to the query, for example, “yoga pose” for “bow in yoga”. In other cases, ANCE retrieved wrongdocuments due to the lack of the domain knowledge: the pretrained language model may not know “activemargin” is a geographical terminology, not a ﬁnancial one (which we did not know ourselves and took some timeto ﬁgure out when conducting this case study). There are also some cases where the dense retrieved documentsdo make sense but were labeled irrelevant due to noise in the labels.The t-SNE plots in Fig. 5 and Fig. 6 also show many interesting patterns of the learned representation space. TheANCE winning cases often correspond to clear separations of different document groups, while the losing casesare those the representation space is more mixed, or there is too few relevant documents which may cause thevariances in model performances. There are also many different patterns in the ANCE-learned representationspace, which we found quite interesting. We include the t-SNE plots for all 43 TREC DL Track queries in ouropen-source repository (attached in the supplementary material). More future analyses of the learned patterns inthe representation space may help provide more insights into dense retrieval.

ANCE BM25Query: qid (104861): Cost of interior concrete ﬂooringTitle: Concrete network: Concrete Floor Cost Pinterest: Types of FlooringDocNo: D293855 D2692315Snippet: For a concrete ﬂoor with a basic ﬁnish,you can expect to pay $2 to $12 persquare foot. . . Know About Hardwood Flooring AndIts Types White Oak Floors Oak Floor-ing Laminate Flooring In Bathroom . . .Ranking Position: 1 1TREC Label: 3 (Very Relevant) 0 (Irrelevant)NDCG@10: 0.86 0.15

Query: qid (833860): What is the most popular food in SwitzerlandTitle: Wikipedia: Swiss cuisine Answers.com: Most popular traditionalfood dishes of MexicoDocNo: D1927155 D3192888Snippet: Swiss cuisine bears witness to many re-gional inﬂuences, . . . Switzerland washistorically a country of farmers, so tra-ditional Swiss dishes tend not to be. . . One of the most popular traditional Mex-ican deserts is a spongy cake . . . (inthe related questions section) What isthe most popular food dish in Switzer-land?. . .Ranking Position: 1 1TREC Label: 3 (Very Relevant) 0 (Irrelevant)NDCG@10: 0.90 0.14

Query: qid (1106007): Deﬁne visceralTitle: Vocabulary.com: Visceral Quizlet.com: A&P EX3 autonomic 9-10DocNo: D542828 D830758Snippet: When something’s visceral, you feel itin your guts. A visceral feeling is in-tuitive — there might not be a rationalexplanation, but you feel that you knowwhat’s best. . . Acetylcholine A neurotransmitter liber-ated by many peripheral nervous systemneurons and some central nervous sys-tem neurons. . .Ranking Position: 1 1TREC Label: 3 (Very Relevant) 0 (Irrelevant)NDCG@10: 0.80 0.14(a) 104861: interior ﬂooring cost. (b) 833860: popular Swiss food

QueryRelevantANCE NegBM25 NegRand Neg (c) 1106007: deﬁne visceral

Figure 5: t-SNE Plots for Winning Cases in Table 6.14able 7: Queries in the TREC 2019 DL Track Document Ranking Tasks where ANCE performsworse than BM25. Snippets are manually extracted. The documents in the ﬁrst position where BM25wins are shown. The NDCG@10 of ANCE and BM25 in the corresponding query is listed. Typos inthe query are from the real web search queries in TREC.

ANCE BM25Query: qid (182539): Example of monotonic functionTitle: Wikipedia: Monotonic function Explain Extended: Things SQL needs:sargability of monotonic functionsDocNo: D510209 D175960Snippet: In mathematics, a monotonic function(or monotone function) is a function be-tween ordered sets that preserves or re-verses the given order... For example, ify=g(x) is strictly monotonic on the range[a,b] . . . I’m going to write a series of articlesabout the things SQL needs to workfaster and more efﬁcienly. . .Ranking Position: 1 1TREC Label: 0 (Irrelevant) 2 (Relevant)NDCG@10: 0.25 0.61

Query: qid (1117099): What is a active marginTitle: Wikipedia: Margin (ﬁnance) Yahoo Answer: What is the differencebetween passive and active continentalmarginsDocNo: D166625 D2907204Snippet: In ﬁnance, margin is collateral that theholder of a ﬁnancial instrument . . . An active continental margin is found onthe leading edge of the continent where. . .Ranking Position: 2 2TREC Label: 0 (Irrelevant) 3 (Very Relevant)NDCG@10: 0.44 0.74

Query: qid (1132213): How long to hold bow in yogaTitle: Yahoo Answer: How long should youhold a yoga pose for yogaoutlet.com: How to do bow pose inyogaDocNo: D3043610 D3378723Snippet: so i’ve been doing yoga for a few weeksnow and already notice that my ﬂexi-ablity has increased drastically. . . . Thatdepends on the posture itself . . . Bow Pose is an intermediate yoga back-bend that deeply opens the chest and thefront of the body. . . Hold for up to 30seconds . . .Ranking Position: 3 3TREC Label: 0 (Irrelevant) 3 (Very Relevant)NDCG@10: 0.66 0.74(a) 182539: monotonic function (b) 1117099: active margin

QueryRelevantANCE NegBM25 NegRand Neg (c) 1132213: yoga bow(c) 1132213: yoga bow