[PDF] Exploring Classic and Neural Lexical Translation Models for Information Retrieval: Interpretability, Effectiveness, and Efficiency Benefits

Abstract

We study the utility of the lexical translation model (IBM Model 1) for English text retrieval, in particular, its neural variants that are trained end-to-end. We use the neural Model1 as an aggregator layer applied to context-free or contextualized query/document embeddings. This new approach to design a neural ranking system has benefits for effectiveness, efficiency, and interpretability. Specifically, we show that adding an interpretable neural Model 1 layer on top of BERT-based contextualized embeddings (1) does not decrease accuracy and/or efficiency; and (2) may overcome the limitation on the maximum sequence length of existing BERT models. The context-free neural Model 1 is less effective than a BERT-based ranking model, but it can run efficiently on a CPU (without expensive index-time precomputation or query-time operations on large tensors). Using Model 1 we produced best neural and non-neural runs on the MS MARCO document ranking leaderboard in late 2020.

Full PDF

aa r X i v : . [ c s . C L ] F e b Exploring Classic and Neural Lexical TranslationModels for Information Retrieval:Interpretability, Eﬀectiveness, and EﬃciencyBeneﬁts

Leonid Boytsov , Zico Kolter Bosch Center for Artiﬁcial Intelligence [email protected] Carnegie Mellon University [email protected]

Abstract.

We study the utility of the lexical translation model (IBMModel 1) for English text retrieval, in particular, its neural variants thatare trained end-to-end. We use the neural Model1 as an aggregator layerapplied to context-free or contextualized query/document embeddings.This new approach to design a neural ranking system has beneﬁts foreﬀectiveness, eﬃciency, and interpretability. Speciﬁcally, we show thatadding an interpretable neural Model 1 layer on top of BERT-based con-textualized embeddings (1) does not decrease accuracy and/or eﬃciency;and (2) may overcome the limitation on the maximum sequence length ofexisting BERT models. The context-free neural Model 1 is less eﬀectivethan a BERT-based ranking model, but it can run eﬃciently on a CPU(without expensive index-time precomputation or query-time operationson large tensors). Using Model 1 we produced best neural and non-neuralruns on the MS MARCO document ranking leaderboard in late 2020.

A typical text retrieval system relies on simple term-matching techniques togenerate an initial list of candidates, which can be further re-ranked using alearned model [10,13]. Thus, retrieval performance is adversely aﬀected by amismatch between query and document terms, which is known as a vocabularygap problem [18,74]. Two decades ago Berger and Laﬀerty [4] proposed to reducethe vocabulary gap and, thus, to improve retrieval eﬀectiveness with a help ofa lexical translation model called IBM Model 1 (henceforth, simply Model 1).Model 1 has strong performance when applied to ﬁnding answers in Englishquestion-answer (QA) archives using questions as queries [35,57,65,71] as well asto cross-lingual retrieval [73,38]. Yet, little is known about its eﬀectiveness on realistic monolingual

English queries, partly, because training Model 1 requireslarge query sets, which previously were not publicly available.

Research Question 1.

In the past, Model 1 was trained on question-document pairs of similar lengths which simpliﬁes the task of ﬁnding useful1 associations between query terms and terms in relevant documents. It is notclear if Model 1 can be successfully trained if queries are substantially, e.g., twoorders of magnitude, shorter than corresponding relevant documents.

Research Question 2.

Furthermore, Model 1 was trained in a translation task using an expectation-maximization (EM) algorithm [16,9] that producesa sparse matrix of conditional translation probabilities, i.e., a non-parametric model. Can we do better by parameterizing conditional translation probabilitieswith a neural network and learning the model end-to-end in a ranking —ratherthan a translation—task?To answer these research questions we experiment with lexical translationmodels on two recent MS MARCO collections, which have hundreds of thousandsof real user queries [49,12]. Speciﬁcally, we consider a novel class of rankingmodels where an interpretable neural Model 1 layer aggregates an output of atoken-embedding neural network. The resulting composite network (includingtoken embeddings) is learned end-to-end using a ranking objective. We considertwo scenarios: context-independent token embeddings [11,22] and contextualizedtoken embeddings generated by BERT [17]. Note that our approach is generic and can be applied to other embedding networks as well.The neural Model 1 layer produces all pairwise similarities T ( q | d ) for all queryand documents BERT word pieces, which are combined via a straightforwardproduct-of-sum formula without any learned weights: P ( Q | D ) = Y q ∈ Q X d ∈ D T ( q | d ) P ( d | D ) , (1)where P ( d | D ) is a maximum-likelihood estimate of the occurrence of d in D .Indeed, a query-document score is a product of scores for individual query wordpieces, which makes it easy to pinpoint word pieces with largest contributions.Likewise, for every query word piece we can easily identify document word pieceswith highest contributions to its score. This makes our model more interpretable compared to prior work.Our contributions can be summarized as follows:1. Adding an interpretable neural Model 1 layer on top of BERT entails virtu-ally no loss in accuracy and eﬃciency compared to the vanilla BERT ranker,which is not readily interpretable.2. In fact, for long documents the BERT-based Model 1 may outperform base-line models applied to truncated documents, thus, overcoming the limita-tion on the maximum sequence length of existing pretrained Transformer[67] models. However, evidence was somewhat inconclusive and we found itwas also not conclusive for previously proposed CEDR [44] models that tooincorporate an aggregator layer (though a non -interpretable one);3. A fusion of the non-parametric Model 1 with BM25 scores can outperformthe baseline models, though the gain is modest ( ≈ ≈ times faster than a BERT-based ranker on a GPU. We can, thus, improve the ﬁrstretrieval stage without expensive index-time precomputation approaches. Translation Models for Text Retrieval.

This line of work begins with an inﬂu-ential paper by Berger and Laﬀerty [4] who ﬁrst applied Model 1 to text re-trieval [4]. It was later proved to be useful for ﬁnding answers in monolingual

QA archives [35,57,65,71] as well as for cross-lingual document retrieval [73,38].Model 1 is a non-parametric and lexical translation model that learns context-independent translation probabilities of lexemes (or tokens) from a set of paireddocuments called a parallel corpus or bitext . The learning method is a variant ofthe expectation-maximization (EM) algorithm [16,9].A generic approach to improve performance of non-parametric statisticallearning models consists in parameterizing respective probabilities using neuralnetworks. An early successful implementation of this idea in language processingwere the hybrid HMM-DNN/RNN systems for speech recognition [5,26]. Moreconcretely, our proposal to use the neural Model 1 as a last network layer wasinspired by the LSTM-CRF [32] and CEDR [44] architectures.There is prior history of applying the neural Model 1 to retrieval, however,without training the model on a ranking task. Zuccon et al. [75] computed trans-lation probabilities using the cosine similarity between word embeddings (nor-malized over the sum of similarities for top- k closest words). They achievedmodest 3-7% gains on four small-scale TREC collections. Ganguly et al. [19]used a nearly identical approach (on similar TREC collections) and reportedslightly better (6-12%) gains. Neither Zuccon et al. [75] nor Ganguly et al. [19]attempted to learn translation probabilities from a large set of real user queries.Zbib et al. [73] employed a context- dependent lexical neural translation modelfor cross-lingual retrieval. They ﬁrst learn context-dependent translation prob-abilities from a bilingual parallel corpus in a lexical translation task. Given adocument, highest translation probabilities together with respective tokens areprecomputed in advance and stored in the index. Zbib et al. [73] trained theirmodel on aligned sentences of similar lengths. In the case of monolingual re-trieval , however, we do not have such ﬁne-grained training data as queries arepaired only with much longer relevant documents. To our knowledge, there is no reliable way to obtain sentence-level relevance labels from this data.

Neural Ranking models have been a popular topic in recent years [24], butthe success of early approaches—which predate BERT—was controversial [40].This changed with adoption of large pretrained models [55], especially after theintroduction of the Transformer models [17] and release of BERT [17]. Nogueiraand Cho were ﬁrst to apply BERT to ranking of text documents [50]. In theTREC 2019 deep learning track [12] as well as on the MS MARCO leaderboard[1], BERT-based models outperformed all other approaches by a large margin.The Transformer model [67] uses an attention mechanism [3] where each se-quence position can attend to all the positions in the previous layer. Because self-attention complexity is quadratic with respect to a sequence length, Trans-former models (BERT including) support only limited-length inputs. A numberof proposals—see Tay et al. [66] for a survey—aim to mitigate this constraint,which is complementary to our work.To process longer documents with existing pretrained models, one has to splitdocuments into several chunks, process each chunk separately, and aggregate re-sults, e.g., by computing a maximum or a weighted prediction score [72,15].Such models cannot be trained end-to-end on full documents. Furthermore, atraining procedure has to assume that each chunk in a relevant document isrelevant as well, which is not quite accurate. To improve upon simple aggrega-tion approaches, MacAvaney et al. [44] combined output of several documentchunks using three simpler models: KNRM [70], PACRR [33], and DRMM [23].A more recent PARADE architectures use even simpler aggregation approaches[39]. However, none of the mentioned aggregator models is interpretable and wepropose to replace them with our neural Model 1 layer.

Interpretability and Explainability of statistical models has become a busyarea of research. However, a vast majority of approaches rely on training a sep-arate explanation model or exploiting saliency/attention maps [41,59]. This isproblematic, because explanations provided by extraneous models cannot be ver-iﬁed and, thus, trusted [59]. Moreover, saliency/attention maps reveal which dataparts are being processed by a model, but not how the model processes them[62,34,59]. Instead of producing unreliable post hoc explanations, Rudin [59] ad-vocates for networks whose computation is transparent by design . If full trans-parency is not feasible, there is still a beneﬁt of last-layer interpretability.In text retrieval we know only two implementations of this idea. Hofst¨atteret al. [29] use a kernel-based formula by Xiong et al. [70] to compute soft-matchcounts over contextualized embeddings. Because each pair of query-documenttokens produces several soft-match values corresponding to diﬀerent thresholds,it is problematic to aggregate these values in an explainable way. Though thisapproach does oﬀer insights into model decisions, the aggregation formula is arelatively complicated two-layer neural network with a non-linear (logarithm)activation function after the ﬁrst layer [29]. ColBERT in the re-ranking modecan be seen as an interpretable interaction layer, however, unlike the neuralModel 1 its use entails a 3% degradation in accuracy [37].

Eﬃciency.

It is possible to speed-up ranking by deferring some computa-tion to index time. They can be divided into two groups. First, it is possible toprecompute separate query and document representations, which can be quicklycombined at query-time in a non-linear fashion [37,20]. This method entails littleto no performance degradation. Second, one can generate (or enhance) indepen-dent query and document representations to compare them via the inner-productcomputation. Representations—either dense or sparse—were shown to improvethe ﬁrst-stage retrieval albeit at the cost of expensive indexing processing andsome loss in eﬀectiveness. In particular, Khattab et al [36] show that dense rep-resentations are inferior to the vanilla BERT ranker [52] in a QA task.

In the case of sparse representations, one can rely on Transformer [67] modelsto generate importance weights for document or query terms [14], augment doc-uments with most likely query terms [51,52], or use a combination of these meth-ods [43]. Due to sparsity of data generated by term expansion and re-weightingmodels, it can be stored in a traditional inverted ﬁle to improve performance ofthe ﬁrst retrieval stage. However, these models are less eﬀective than the vanillaBERT ranker [52] and they require costly index-time processing.

Token Embeddings and Transformers.

We assume that an input text is split intosmall chunks of texts called tokens . A token can be a complete English word, aword piece, or a lexeme (a lemma). The length of a document d —denoted as | d | —is measured in the number of tokens. Because neural networks cannot operatedirectly on text, a sequence of tokens t t . . . t n is ﬁrst converted to a sequencesof d -dimensional embedding vectors w w . . . w n by an embedding network. Ini-tially, embedding networks were context independent, i.e., each token was alwaysmapped to the same vector [22,11,46]. Peters et al. [55] demonstrated superior-ity of contextualized , i.e., context-dependent, embeddings produced a multi-layerbi-directional LSTM [61,27,21] pretrained on a large corpus in a self-supervised manner. These were later outstripped by large pretrained Transformers [17,56].In our work we use two types of embeddings: vanilla context-free embeddings(see [22] for an excellent introduction) and BERT-based contextualized embed-dings [17]. Due to space constraints, we do not discuss BERT architecture indetail (see [60,17] instead). It is crucial, however, to know the following: – Contextualized token embeddings are vectors of the last-layer hidden state; – BERT operates on word pieces [69] rather than complete words; – The vocabulary has close to 30K tokens and includes two special tokens: [CLS] (an aggregator) and [SEP] (a separator); – [CLS] is always prepended to every token sequence and its embedding isused as a sequence representation for classiﬁcation and ranking tasks.The “vanilla” BERT ranker uses a single fully-connected layer as a predic-tion head, which converts the [CLS] vector into a scalar. It makes a predictionbased on the following sequence of tokens: [CLS] q [SEP] d [SEP] , where q isa query and d = t t . . . t n is a document. Long documents and queries needto be truncated so that the overall number of tokens does not exceed 512. Toovercome this limitation, MacAvaney et al. [44] proposed an approach that: – splits longer documents d into m chunks: d = d d . . . d m ; – generates m token sequences [CLS] q [SEP] d i [SEP] ; – processes each sequence with BERT to generate contextualized embeddingsfor regular tokens as well as for [CLS] . The outcome of this procedure is m [CLS] -vectors cls i and n contextualizedvectors w w . . . w n : one for each document token t i . MacAvaney et al. [44] ex-plore several approaches to combine these contextualized vectors. First, they ex-tend the vanilla BERT ranker by making prediction on the average [CLS] token: m P mi =1 cls i . Second, they use contextualized embeddings as a direct replace-ment of context-free embeddings in the following neural architectures: KNRM[70], PACRR [33], and DRMM [23]. Third, they introduced a CEDR architecturewhere the [CLS] embedding is additionally incorporated into KNRM, PACCR,and DRMM in a model-speciﬁc way, which further boosts performance. Non-parametric Model 1.

Let P ( D | Q ) denote a probability that a document D is relevant to the query Q . Using the Bayes rule, P ( D | Q ) is convenient to re-write as P ( D | Q ) ∝ P ( Q | D ) P ( D ). Assuming a uniform prior for the documentoccurrence probability p ( D ), one concludes that the relevance probability is pro-portional to P ( Q | D ). Berger and Laﬀerty proposed to estimate this probabilitywith a term-independent and context-free model known as Model 1 [4].Let T ( q | d ) be a probability that a query token q is a translation of a documenttoken d and P ( d | D ) is a probability that a token d is “generated” by a document D . Then, a probability that query Q is a translation of document D can becomputed as a product of individual query term likelihoods as follows: P ( Q | D ) = Q q ∈ Q P ( q | D ) P ( q | D ) = P d ∈ D T ( q | d ) P ( d | D ) (2)The summation in Eq. 3 is over unique document tokens. The in-documentterm probability P ( d | D ) is a maximum-likelihood estimate. Making the non-parametric Model 1 eﬀective requires quite a few tricks. First, P ( q | D )—a like-lihood of a query term q —is linearly combined with the collection probability P ( q | C ) using a parameter λ [71,65]. P ( q | D ) = (1 − λ ) " X d ∈ D T ( q | d ) P ( d | D ) + λP ( q | C ) . (3)We take several additional measures to improve Model 1 eﬀectiveness: – We propose to create a parallel corpus by splitting documents and passagesinto small contiguous chunks whose length is comparable to query lengths; – T ( q | d ) are learned from a symmetrized corpus as proposed by Jeon et al. [35]; – We discard all translation probabilities T ( q | d ) below an empirically foundthreshold of about 10 − and keep at most 10 most frequent tokens; – We make self-translation probabilities T ( t | t ) to be equal to an empiricallyfound positive value and rescale T ( t ′ | t ) so that P t ′ T ( t ′ | t ) = 1 as in [35,65]; Our Neural Model 1.

Let us rewrite Eq. 2 so that the inner summation iscarried out over all document tokens rather than over the set of unique ones. P ( q | C ) is a maximum-likelihood estimate. For an out-of-vocabulary term q , P ( q | C )is set to a small number (e.g., 10 − ). This is particularly relevant for contextualized embeddings where embeddings ofidentical tokens are not guaranteed to be the same (and typically they are not): P ( Q | D ) = Y q ∈ Q | D | X i =1 T ( q | d i ) | D | . (4)We further propose to compute T ( q | d ) in Eq. 4 by a simple and eﬃcient neuralnetwork. Networks “consumes” context-free or contextualized embeddings of to-kens q and d and produces a value in the range [0 , T ( t | t ) = p self and multiply all other probabilities by 1 − p self . However, it was notpractical to scale conditional probabilities to ensure that ∀ t P t T ( t | t ) = 1.Thus, T ( t | t ) is a similarity function, but not a true probability distribution.Note that—unlike CEDR [43]—we do not use the embedding of the [CLS] token.We explored several approaches to neural parametrization of T ( t | t ). Letembed q ( t ) and embed d ( t ) denote embeddings of query and document tokens,respectively. One of the simplest approaches is to learn separate embeddingnetworks for queries and documents and use the scaled cosine similarity: T ( t | t ) = 0 . { cos(embed q ( t ) , embed d ( t )) + 1 } . However, this neural network is not suﬃciently expressive and the resultingcontext-free Model 1 is inferior to the non-parametric Model 1 learned via EM.We then found that a key performance ingredient was a concatenation of embed-dings with their Hadamard product, which we think helps the following layersdiscover better interaction features. We pass this combination through one ormore fully-connected linear layer with RELUs [25] followed by a sigmoid: T ( q | d ) = σ ( F (relu( F (relu( F ([ x q , x d , x q ◦ x d ])))))) x q = P q (tanh(layer-norm(embed q ( q )))) x d = P d (tanh(layer-norm(embed d ( d )))) , where P q , P d , and F i are fully-connected linear layers; [ x, y ] is vector concatena-tion; layer-norm is layer normalization [2]; x ◦ y is the Hadamard product. Neural Model 1 Sparsiﬁcation/Export to Non-Parametric Format.

We canprecompute T ( t | t ) for all pairs of vocabulary tokens, discard small values (be-low a threshold), and store the result as a sparse matrix. This format permitsan extremely eﬃcient execution on CPU (see results in § Data sets.

We experiment with MS MARCO collections, which include data forpassage and document retrieval tasks [49,12]. Each MS MARCO collection hasa large number of real user queries (see Table 1). To our knowledge, there are no other collections comparable to MS MARCO in this respect. The large setof queries is sampled from the log ﬁle of the search engine Bing. In that, dataset creators ensured that all queries can be answered using a short text snippet.These queries are only sparsely judged (about one relevant passage per query).Sparse judgments are binary: Relevant documents have grade one and all otherdocuments have grade zero. documents passages

Table 1.

MS MARCO data set details

In addition to large query setswith sparse judgments, we use twoevaluation sets from TREC 2019/2020deep learning tracks [12]. These querysets are quite small, but they havebeen thoroughly judged by NIST as-sessors separately for a document anda passage retrieval task. TREC NISTjudgements range from zero (not-relevant) to three (perfectly relevant).We randomly split publicly avail-able training and validation sets intothe following subsets: a small training set to train a linear fusion model( train/fusion ), a large set to train neural models and non-parametric Model 1( train/modeling ), a development set ( development ), and a test set (

MS MARCOtest ) containing at most 3K queries. Detailed data set statistics is summarized inTable 1. Note that the training subsets were obtained from the original trainingset, whereas the new development and test sets were obtained from the originaldevelopment set. The leaderboard validation set is not publicly available.We processed collections using Spacy 2.2.3 [30] to extract tokens (text words)and lemmas (lexemes) from text. The frequently occurring words and lemmaswere ﬁltered out using Indri’s list of stopwords [64], which was expanded toinclude a few contractions such as “n’t” and “’ll”. Lemmas were indexed us-ing Lucene 7.6. We also generated sub-word tokens, namely BERT word pieces[69,17], using a HuggingFace Transformers library (version 0.6.2) [68]. We did not apply the stopword list to BERT word pieces.

Basic Setup.

We experimented on a Linux server equipped with a six-core (12threads) i7-6800K 3.4 Ghz CPU, 125 GB of memory, and four GeForce GTX 1080TI GPUs. We used the text retrieval framework

FlexNeuART [8], which is imple-mented in Java. It employs Lucene 7.6 with a BM25 scorer [58] to generate aninitial list of candidates, which can be further re-ranked using either traditionalor neural re-rankers. The traditional re-rankers, including the non-parametricModel 1, are implemented in Java as well. They run in a multi-threaded mode (12 threads) and fully utilize the CPU. The neural rankers are implemented usingPyTorch 1.4 [54] and Apache Thrift. A neural ranker operates as a standalone single-threaded server. Our software is available online [8]. https://thrift.apache.org/ https://github.com/oaqa/FlexNeuART Ranking speed is measured as the overall CPU/GPU throughput —ratherthan latency—per one thousand of documents/passages. Ranking accuracy ismeasured using the standard utility trec eval provided by TREC organizers. .Statistical signiﬁcance is computed using a two-sided t-test with threshold 0.05.All ranking models are applied to the candidate list generated by a tunedBM25 scorer [58]. BERT-based models re-rank 100 entries with highest BM25scores: using a larger pool of candidates hurts both eﬃciency and accuracy. Allother models, including the neural context-free Model 1 re-rank 1000 entries:Further increasing the number of candidates does not improve accuracy. Training Models.

Neural models are trained using a pairwise margin loss. Training pairs are obtained by combining known relevant documents with 20negative examples selected from a set of top-500 candidates returned by Lucene.In each epoch, we randomly sample one positive and one negative example perquery. BERT-based models ﬁrst undergo a target-corpus pretraining [31] usinga masked language modeling and next-sentence prediction objective [17]. Then,we train them for one epoch in a ranking task. We use batch size 16 simulatedvia gradient accumulation. Context-free Model 1 is trained from scratch for 32epochs using batch size 32. The non-parametric Model 1 is trained for ﬁve epochswith MGIZA [53]. Further increasing the number of epochs does not substan-tially improve results. MGIZA computes probabilities of spurious insertions (i.e.,a translation from an empty word), but we discard them as in prior work [65].We use a small weight decay (10 − ) and a warm-up schedule where thelearning rate grows linearly from zero for 10-20% of the steps until it reachesthe base learning rate [48,63]. The optimizer is AdamW [42]. For BERT-basedmodels we use diﬀerent base rates for the fully-connected prediction head (2 · − ) and for the main Transformer layers (2 · − ). For the context-free Model 1the base rate is 3 · − , which is decayed by 0.9 after each epoch. The learningrate is the same for all parameters.The trained neural Model 1 is “exported” to a non-parametric format byprecomputing all pairwise translation probabilities and discarding probabilitiessmaller than 10 − . This sparsiﬁcation/export procedure takes three minutes andthe exported model is executed using the same Java code as the non-parametricModel 1. Each neural model and the sparsiﬁed Model 1 is trained and evaluatedfor ﬁve seeds. To this end, we compute the value for each query and seed andaverage query-speciﬁc values (over ﬁve seeds). All hyper-parameters are tunedon a development set.Because context-free Model 1 rankers are not strong on their own, we evaluatethem in a fusion mode. First, Model 1 is trained on train/modeling . Then welinearly combine a model score with the BM25 score [58]. Optimal weights arecomputed on a train/fusion subset using the coordinate ascent algorithm [45]from RankLib. To improve eﬀectiveness of this linear fusion, we use Model 1 https://github.com/usnistgov/trec_eval We use the loss reduction type sum . https://github.com/moses-smt/mgiza/ https://sourceforge.net/p/lemur/wiki/RankLib/ MSMARCOtest TREC2019 TREC2020 rank.speed MSMARCOtest TREC2019 TREC2020 rank.speed

MRR NDCG@10 per 1K MRR NDCG@10 per 1K baselines

BM25 (lemm) 0.270 0.544 0.524 0.8 ms ms BM25 (lemm)+BM25 (word) 0.274 0.544 0.523 2.5 ms ms BM25 (lemm)+BM25 (bwps) 0.283 0.528 0.537 2.2 ms ms BERT-vanilla (short) 0.387 0.655 0.623 39 sec sec

BERT-vanilla (full) 0.376 sec

BERT-CEDR-KRNM 0.387 0.665 0.649 ⋆ sec ⋆ ms BERT-CEDR-DRMM 0.377 ⋆ sec sec BERT-CEDR-PACRR ⋆ sec sec our methods BM25 (lemm)+Model1 (word) 0.283 ⋆ ms ⋆ ⋆ ms BM25 (lemm)+Model1 (bwps) 0.284 0.557 0.525 33 ms ms BM25 (lemm)+NN-Model1-exp 0.307 ⋆ ms ⋆ ⋆ ⋆ ms BM25 (lemm)+NN-Model1 0.311 ⋆ sec ⋆ ⋆ ⋆ sec BERT-Model1 (short) 0.384 0.657 0.631 36 sec sec

BERT-Model1 (full) 0.391 ⋆ sec Table 2.

Evaluation results: bwps denotes BERT word pieces, lemm denotes text lem-mas, and word denotes original words.

NN-Model1 and

NN-Model1-exp are the context-free neural Model 1 models: They use only bwps . NN-Model1 runs on GPU whereas

NN-Model1-exp runs on CPU. Ranking speed is throughput and not latency! Statis-tical signiﬁcance is denoted by ⋆ and . Hypotheses are explained in the main text. log -scores normalized by the number of query words. In turn, BM25 scores arenormalized by the sum of query-term IDF values (see [58] for the description ofBM25 and IDF). As one of the baselines, we use a fusion of BM25 scores fordiﬀerent tokenization approaches (basically a multi-ﬁeld BM25). Fusion weightsare obtained via RankLib on train/fusion . Model Overview.

We compare several models (see Table 2). First, we use BM25scores [58] computed for the lemmatized text, henceforth,

BM25 (lemm) . Second,we evaluate several variants of the context-free Model 1. The non-parametricModel 1 was trained for both original words and BERT word pieces: Respectivemodels are denoted as

Model1 (word) and

Model1 (bwps) . The neural context-free Model 1—denoted as

NN-Model1 —was used only with BERT word pieces. This model was sparsiﬁed and exported to a non-parametric format (see § NN-Model1-exp . Note thatcontext-free Model 1 rankers are not strong on their own, thus, we evaluatethem in a fusion mode by combining their scores with

BM25 (lemm) .Crucially, all context-free models incorporate exact term-matching signal viaeither the self-translation probability or via explicit smoothing with a word col-lection probability (see Eq. 3). Thus, these models should be compared not onlywith BM25, but also with the fusion model incorporating BM25 scores for orig-inal words or BERT word pieces. We denote these baselines as

BM25 (lemm)+BM25 (word) and

BM25 (lemm)+ BM25 (bwps) , respectively.As we describe in §

3, our contextualized Model 1 applies the neural Model 1layer to the contextualized embeddings produced by BERT. We denote thismodel as

BERT-Model1 . Due to the limitation of existing pretrained Transformermodels, long documents need to be split into chunks each of which is pro-cessed, i.e., contextualized, separately. This is done in

BERT-Model1 (full) , BERT-vanilla (full), and

BERT-CEDR [44] models. These models operate on(mostly) complete documents: For eﬃciency reasons we nevertheless use only theﬁrst 1431 tokens (three BERT chunks). Another approach is to make predictionson much shorter (one BERT chunk) fragments [15]. This is done in

BERT-Model1(short) and

BERT-vanilla (short) . In the passage retrieval task, all passagesare short and no truncation or chunking is needed. Note that we use a base ,i.e., a 12-layer Transformer [67] model, since it is more practical then a 24-layerBERT-large and performs at par with BERT-large on MS MARCO data [29].We tested several hypotheses using a two-sided t-test: – BM25 (lemm)+ Model1 (word) is the same as

BM25 (lemm)+ BM25 (word) ; – BM25 (lemm)+ Model1 (bwps) is the same as

BM25 (lemm)+ BM25 (bwps) ; – BERT-Model1 (full) is the same as

BERT-vanilla (short) ; – For each

BERT-CEDR model, we test if it is the same as

BERT-vanilla (short) ; – BERT-vanilla (full) is the same as

BERT-vanilla (short) ; – BERT-Model1 (full) is the same as

BERT-Model1 (short) ;The main purpose of these tests is to assess if special aggregation layers (includ-ing the neural Model 1) can be more accurate compared to models that run ontruncated documents. In Table 2 statistical signiﬁcance is indicated by a specialsymbol: the last two hypotheses use ; all other hypotheses use ⋆ . Discussion of Results.

The results are summarized in Table 2. First notethat there is less consistency in results on

TREC 2019/2020 sets compared to

MSMARCO test sets. In that, some statistically signiﬁcant diﬀerences (on

MS MARCOtest ) “disappear” on

TREC 2019/2020 . TREC 2019/2020 query sets are quitesmall and it is more likely (compared to

MS MARCO test ) to obtain spuriousresults. Furthermore, the fusion model

BM25 (lemm)+ Model1 (bwps) is eitherworse than the baseline model

BM25 (lemm)+ BM25 (bwps) or the diﬀerence isnot signiﬁcant.

BM25 (lemm)+ Model1 (word) is mostly better than the respec-tive baseline, but the gain is quite small. In contrast, the fusion of the neuralModel 1 with BM25 scores for BERT word pieces is more accurate on all thequery sets. On the

MS MARCO test sets it is 15-17% better than

BM25 (lemm) . These diﬀerences are signiﬁcant on both

MS MARCO test sets as well as on

TREC2019/2020 tests sets for the passage retrieval task. Sparsiﬁcation of the neuralModel 1 leads only to a small (0.6-1.3%) loss in accuracy. In that, the sparsiﬁedmodel—executed on a CPU—is more than 10 times faster than BERT-basedrankers, which run on a GPU. It is 5 × × faster in the case of passage re-trieval. In contrast, on a GPU, the fastest neural model KNRM is only 500 timesfaster than vanilla BERT [28] (also for passage retrieval). For large candidatesets computation of Model 1 scores can be further sped up ( § BM25 (lemm)+NN-Model1-exp can be useful at the candidate generation stage.We also compared BERT-based neural Model 1 with BERT-CEDR andBERT-vanilla models on the

MS MARCO test set for the document retrieval task.By comparing

BERT-vanilla (short) , BERT-Model1 (short) , and

BERT-Model1(full) we can see that the neural Model 1 layer entails virtually no eﬃciencyor accuracy loss. In fact,

BERT-Model1 (full) is 1.8% and 1% better than

BERT-Model1 (short) and

BERT-vanilla (short) , respectively. Yet, only theformer diﬀerence is statistically signiﬁcant.Furthermore, the same holds for

BERT-CEDR-PACRR , which was shown to out-perform

BERT-vanilla by MacAvaney et al. [44]. In our experiments it is 1% bet-ter than

BERT-vanilla (short) , but the diﬀerence is neither substantial nor sta-tistical signiﬁcant. This does not invalidate results of MacAvaney et al. [44]: Theycompared

BERT-CEDR-PACRR only with

BERT-vanilla (full) , which makes pre-dictions on the averaged [CLS] embeddings. However, in our experiments, thismodel is noticeably worse (by 4.2%) than

BERT-vanilla (short) and the diﬀer-ence is statistically signiﬁcant. We think that obtaining more conclusive evidenceabout the eﬀectiveness of aggregation layers requires a diﬀerent data set whererelevance is harder to predict from a truncated document.

Leaderboard Submissions.

We combined

BERT-Model1 with the strong ﬁrst-stage pipeline, which uses Lucene to index documents expanded with doc2query[51,52] and re-ranks them using a mix of traditional and

NN-Model1-exp scores(our exported neural Model 1). This ﬁrst-stage pipeline is about as eﬀectiveas the Conformer-Kernel model [47]. The combination model achieved the topplace on a well-known leaderboard in November and December 2020. Further-more, using the non-parametric Model 1, we produced the best traditional runin December 2020, which outperformed several neural baselines [7].

We study a neural Model 1 combined with a context-free or contextualized em-bedding network and show that such a combination has beneﬁts to eﬃciency,eﬀectiveness, and interpretability. To our knowledge, the context-free neuralModel 1 is the only neural model that can be sparsiﬁed to run eﬃciently ona CPU (up to 5 × × faster than BERT on a GPU) without expensive index-time precomputation or query-time operations on large tensors. We hope thateﬀectiveness of this approach can be further improved, e.g., by designing a betterparametrization of conditional translation probabilities. References

1. MS MARCO leaderboard., https://microsoft.github.io/msmarco/

2. Ba, L.J., Kiros, J.R., Hinton, G.E.: Layer normalization. CoRR abs/1607.06450 (2016)3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learningto align and translate. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Confer-ence on Learning Representations, ICLR 2015 (2015)4. Berger, A., Laﬀerty, J.: Information retrieval as statistical translation. In: Pro-ceedings of the 22nd annual international ACM SIGIR conference on Researchand development in information retrieval. pp. 222–229 (1999)5. Bourlard, H., Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition: AHybrid Approach, vol. 247. Springer Science & Business Media (1994)6. Boytsov, L.: Eﬃcient and Accurate Non-Metric k-NN Search with Applications toText Matching. Ph.D. thesis, Carnegie Mellon University (2018)7. Boytsov, L.: Traditional IR rivals neural models on the MS MARCO documentranking leaderboard (2020)8. Boytsov, L., Nyberg, E.: Flexible retrieval with NMSLIB and FlexNeuART. In:Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS). pp.32–43 (2020)9. Brown, P.F., Pietra, S.D., Pietra, V.J.D., Mercer, R.L.: The mathematics of statis-tical machine translation: Parameter estimation. Computational Linguistics (2),263–311 (1993)10. B¨uttcher, S., Clarke, C.L., Cormack, G.V.: Information retrieval: Implementingand evaluating search engines. MIT Press (2016)11. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.:Natural language processing (almost) from scratch. J. Mach. Learn. Res. , 2493–2537 (2011)12. Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of theTREC 2019 deep learning track. CoRR abs/2003.07820 (2020)13. Croft, W.B., Metzler, D., Strohman, T.: Search engines: Information retrieval inpractice, vol. 520. Addison-Wesley Reading (2010)14. Dai, Z., Callan, J.: Context-aware sentence/passage term importance estimationfor ﬁrst stage retrieval. CoRR abs/1910.10687 (2019)15. Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural lan-guage modeling. In: SIGIR. pp. 985–988. ACM (2019)16. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incompletedata via the EM algorithm. Journal of the Royal Statistical Society: Series B(Methodological) (1), 1–22 (1977)17. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidi-rectional transformers for language understanding pp. 4171–4186 (2019)18. Furnas, G.W., Landauer, T.K., Gomez, L.M., Dumais, S.T.: The vocabulary prob-lem in human-system communication. Commun. ACM (11), 964–971 (1987)19. Ganguly, D., Roy, D., Mitra, M., Jones, G.J.: Word embedding based generalizedlanguage model for information retrieval. In: Proceedings of the 38th internationalACM SIGIR conference on research and development in information retrieval. pp.795–798 (2015)20. Gao, L., Dai, Z., Callan, J.: EARL: speedup transformer-based rankers with pre-computed representation. CoRR abs/2004.13313 (2020)421. Gers, F.A., Schmidhuber, J., Cummins, F.A.: Learning to forget: Continual pre-diction with LSTM. Neural Comput. (10), 2451–2471 (2000)22. Goldberg, Y.: A primer on neural network models for natural language processing.Journal of Artiﬁcial Intelligence Research , 345–420 (2016)23. Guo, J., Fan, Y., Ai, Q., Croft, W.B.: A deep relevance matching model for ad-hocretrieval. In: CIKM. pp. 55–64. ACM (2016)24. Guo, J., Fan, Y., Pang, L., Yang, L., Ai, Q., Zamani, H., Wu, C., Croft, W.B.,Cheng, X.: A deep look into neural ranking models for information retrieval. In-formation Processing & Management p. 102067 (2019)25. Hahnloser, R.H.R.: On the piecewise analysis of networks of linear threshold neu-rons. Neural Networks (4), 691–697 (1998)26. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.r., Jaitly, N., Senior, A.,Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acousticmodeling in speech recognition: The shared views of four research groups. IEEESignal processing magazine (6), 82–97 (2012)27. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. (8),1735–1780 (1997)28. Hofst¨atter, S., Hanbury, A.: Let’s measure run time! extending the IR replica-bility infrastructure to include performance aspects. In: OSIRRC@SIGIR. CEURWorkshop Proceedings, vol. 2409, pp. 12–16. CEUR-WS.org (2019)29. Hofst¨atter, S., Zlabinger, M., Hanbury, A.: Interpretable & time-budget-constrained contextualization for re-ranking. In: ECAI. Frontiers in Artiﬁcial In-telligence and Applications, vol. 325, pp. 513–520. IOS Press (2020)30. Honnibal, M., Montani, I.: spacy 2: Natural language understanding with bloomembeddings, convolutional neural networks and incremental parsing. To appear(2017)31. Howard, J., Ruder, S.: Universal language model ﬁne-tuning for text classiﬁcation.In: ACL (1). pp. 328–339. Association for Computational Linguistics (2018)32. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging.CoRR abs/1508.01991 (2015)33. Hui, K., Yates, A., Berberich, K., de Melo, G.: Co-pacrr: A context-aware neuralIR model for ad-hoc retrieval. In: WSDM. pp. 279–287. ACM (2018)34. Jain, S., Wallace, B.C.: Attention is not explanation. In: NAACL-HLT (1). pp.3543–3556. Association for Computational Linguistics (2019)35. Jeon, J., Croft, W.B., Lee, J.H.: Finding similar questions in large question andanswer archives. In: CIKM. pp. 84–90. ACM (2005)36. Khattab, O., Potts, C., Zaharia, M.: Relevance-guided supervision for OpenQAwith ColBERT. CoRR abs/2007.00814 (2020)37. Khattab, O., Zaharia, M.: ColBERT: Eﬃcient and eﬀective passage search viacontextualized late interaction over BERT. In: SIGIR. pp. 39–48. ACM (2020)38. Lavrenko, V., Choquette, M., Croft, W.B.: Cross-lingual relevance models. In:Proceedings of the 25th annual international ACM SIGIR conference on Researchand development in information retrieval. pp. 175–182 (2002)39. Li, C., Yates, A., MacAvaney, S., He, B., Sun, Y.: PARADE: passage representationaggregation for document reranking. CoRR abs/2008.09093 (2020)40. Lin, J.: The neural hype and comparisons against weak baselines. In: ACM SIGIRForum. vol. 52, pp. 40–51. ACM New York, NY, USA (2019)41. Lipton, Z.C.: The mythos of model interpretability. Commun. ACM (10), 36–43(2018)42. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprintarXiv:1711.05101 (2017)543. MacAvaney, S., Nardini, F.M., Perego, R., Tonellotto, N., Goharian, N., Frieder,O.: Expansion via prediction of importance with contextualization. In: Proceedingsof the 43rd International ACM SIGIR conference on research and development inInformation Retrieval. pp. 1573–1576. ACM (2020)44. MacAvaney, S., Yates, A., Cohan, A., Goharian, N.: CEDR: contextualized em-beddings for document ranking. In: SIGIR. pp. 1101–1104. ACM (2019)45. Metzler, D., Croft, W.B.: Linear feature-based models for information retrieval.Inf. Retr. (3), 257–274 (2007). https://doi.org/10.1007/s10791-006-9019-z46. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed represen-tations of words and phrases and their compositionality. In: NIPS. pp. 3111–3119(2013)47. Mitra, B., Hofst¨atter, S., Zamani, H., Craswell, N.: Conformer-kernel with queryterm independence for document retrieval. CoRR abs/2007.10434 (2020)48. Mosbach, M., Andriushchenko, M., Klakow, D.: On the stability of ﬁne-tuning BERT: misconceptions, explanations, and strong baselines. CoRR abs/2006.04884 (2020)49. Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., Deng,L.: MS MARCO: A human generated MAchine Reading COmprehension dataset(November 2016)50. Nogueira, R., Cho, K.: Passage re-ranking with BERT. CoRR abs/1901.04085 (2019)51. Nogueira, R., Lin, J.: From doc2query to docTTTTTquery. MS MARCO passageretrieval task publication (2019)52. Nogueira, R., Yang, W., Lin, J., Cho, K.: Document expansion by query prediction.CoRR abs/1904.08375 (2019)53. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment mod-els. Computational Linguistics (1), 19–51 (2003)54. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processingsystems. pp. 8026–8037 (2019)55. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettle-moyer, L.: Deep contextualized word representations. In: Proceedings of NAACL-HLT. pp. 2227–2237 (2018)56. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language un-derstanding with unsupervised learning. Technical report, OpenAI (2018)57. Riezler, S., Vasserman, A., Tsochantaridis, I., Mittal, V.O., Liu, Y.: Statistical ma-chine translation for query expansion in answer retrieval. In: ACL 2007, Proceed-ings of the 45th Annual Meeting of the Association for Computational Linguistics(2007)58. Robertson, S.: Understanding inverse document frequency: on theoretical argu-ments for IDF. Journal of Documentation (5), 503–520 (2004)59. Rudin, C.: Stop explaining black box machine learning models for high stakesdecisions and use interpretable models instead. Nature Machine Intelligence (5),206–215 (2019)60. Rush, A.M.: The annotated transformer. In: Proceedings of workshop for NLPopen source software (NLP-OSS). pp. 52–60 (2018)61. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans.Signal Process. (11), 2673–2681 (1997)62. Serrano, S., Smith, N.A.: Is attention interpretable? In: ACL (1). pp. 2931–2951.Association for Computational Linguistics (2019)663. Smith, L.N.: Cyclical learning rates for training neural networks. In: WACV. pp.464–472. IEEE Computer Society (2017)64. Strohman, T., Metzler, D., Turtle, H., Croft, W.B.: Indri:A language-model based search engine for complex queries. http://ciir.cs.umass.edu/pubfiles/ir-407.pdf [Last Checked Apr 2017](2005)65. Surdeanu, M., Ciaramita, M., Zaragoza, H.: Learning to rank answers to non-factoid questions from web collections. Computational Linguistics (2), 351–383(2011)66. Tay, Y., Dehghani, M., Bahri, D., Metzler, D.: Eﬃcient transformers: A survey.CoRR abs/2009.06732 (2020)67. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: NIPS. pp. 5998–6008 (2017)68. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P.,Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C.,Jernite, Y., Plu, J., Xu, C., Scao, T.L., Gugger, S., Drame, M., Lhoest, Q., Rush,A.M.: Huggingface’s transformers: State-of-the-art natural language processing.ArXiv abs/1910.03771 (2019)69. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun,M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X.,Kaiser, L., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G.,Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Cor-rado, G., Hughes, M., Dean, J.: Google’s neural machine translation system: Bridg-ing the gap between human and machine translation. CoRR abs/1609.08144abs/1609.08144