[PDF] Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

Abstract

Pyserini is an easy-to-use Python toolkit that supports replicable IR research by providing effective first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections. We aim to support, out of the box, the entire research lifecycle of efforts aimed at improving ranking with modern neural approaches. In particular, Pyserini supports sparse retrieval (e.g., BM25 scoring using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well as hybrid retrieval that integrates both approaches. This paper provides an overview of toolkit features and presents empirical results that illustrate its effectiveness on two popular ranking tasks. We also describe how our group has built a culture of replicability through shared norms and tools that enable rigorous automated testing.

Full PDF

aa r X i v : . [ c s . I R ] F e b Pyserini: An Easy-to-Use Python Toolkit to Support ReplicableIR Research with Sparse and Dense Representations

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin,Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira

David R. Cheriton School of Computer ScienceUniversity of Waterloo

ABSTRACT

Pyserini is an easy-to-use Python toolkit that supports replicableIR research by providing eﬀective ﬁrst-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a stan-dard Python package and comes with queries, relevance judgments,pre-built indexes, and evaluation scripts for many commonly usedIR test collections. We aim to support, out of the box, the entire re-search lifecycle of eﬀorts aimed at improving ranking with modernneural approaches. In particular, Pyserini supports sparse retrieval(e.g., BM25 scoring using bag-of-words representations), dense re-trieval (e.g., nearest-neighbor search on transformer-encoded rep-resentations), as well as hybrid retrieval that integrates both ap-proaches. This paper provides an overview of toolkit features andpresents empirical results that illustrate its eﬀectiveness on twopopular ranking tasks. We also describe how our group has built aculture of replicability through shared norms and tools that enablerigorous automated testing.

The advent of pretrained transformers has led to many exciting re-cent developments in information retrieval [15]. In our view, thetwo most important research directions are transformer-based re-ranking models and learned dense representations for ranking. De-spite many exciting opportunities and rapid research progress, theneed for easy-to-use, replicable baselines has remained a constant.In particular, the importance of stable ﬁrst-stage retrieval within amulti-stage ranking architecture has become even more important,as it provides the foundation for increasingly-complex modern ap-proaches that leverage hybrid techniques.We present Pyserini, our Python IR toolkit designed to serve thisrole: it aims to provide a solid foundation to help researchers pur-sue work on modern neural approaches to information retrieval.The toolkit is speciﬁcally designed to support the complete “re-search lifecycle” of systems-oriented inquiries aimed at buildingbetter ranking models, where “better” can mean more eﬀective,more eﬃcient, or some tradeoﬀ thereof. This typically involvesworking with one or more standard test collections to design rank-ing models as part of an end-to-end architecture, iteratively im-proving components and evaluating the impact of those changes.In this context, our toolkit provides the following key features: • Pyserini is completely self-contained as a Python package, avail-able via pip install . The package comes with queries, col-lections, and qrels for standard IR test collections, as well aspre-built indexes and evaluation scripts. In short, batteries areincluded. Pyserini supports, out of the box, the entire researchlifecycle of eﬀorts aimed at improving ranking models. • Pyserini can be used as a standalone module to generate batchretrieval runs or be integrated as a library into an applicationdesigned to support interactive retrieval. • Pyserini supports sparse retrieval (e.g., BM25 scoring using bag-of-words representations), dense retrieval (e.g., nearest-neighborsearch on transformer-encoded representations), as well hybridretrieval that integrates both approaches. • Pyserini provides access to data structures and system internalsto support advanced users. This includes access to postings, doc-ument vectors, and raw term statistics that allow our toolkit tosupport use cases that we had not anticipated.Pyserini began as the Python interface to Anserini [27, 28], whichour group has been developing for several years, with its roots ina community-wide replicability exercise dating back to 2015 [14].Anserini builds on the open-source Lucene search library and wasmotivated by the desire to better align academic research with thepractice of building real-world search applications; see, for exam-ple, Grand et al. [9]. More recently, we recognized that Anserini’sreliance on the Java Virtual Machine (due to Lucene), greatly lim-ited its reach [2, 3], as Python has emerged as the language ofchoice for both data scientists and researchers. This is particularlythe case for work on deep learning today, since the major toolk-its (PyTorch [22] and Tensorﬂow [1]) have both adopted Pythonas their front-end language. Thus, Pyserini aims to be a “feature-complete” Python interface to Anserini.Sparse retrieval support in Pyserini comes entirely from Lucene(via Anserini). To support dense and hybrid retrieval, Pyserini inte-grates Facebook’s FAISS library for eﬃcient similarity search overdense vectors [11], which in turns integrates the HNSW library [17]to support low-latency querying. Thus, Pyserini provides a super-set of features in Anserini; dense and hybrid retrieval is entirelymissing from the latter.This paper is organized in the following manner: After a pre-amble on our design philosophy, we begin with a tour of Pyserini,highlighting its main features and providing the reader with a senseof how it might be used in a number of common scenarios. This isfollowed by a presentation of empirical results illustrating the useof Pyserini to provide ﬁrst-stage retrieval in two popular rankingtasks today. Before concluding with future plans, we discuss howour group has internalized replicability as a shared norm throughsocial processes supported by technical infrastructure.

The design of Pyserini emphasizes ease of use and replicability.Larry Wall, the creator of the Perl programming language, once re-marked that “easy things should be easy, and hard things should beossible.” While aspects of the lifecycle for systems-oriented IR re-search are not diﬃcult per se, there are many details that need to bemanaged: downloading the right version of a corpus, building in-dexes with the appropriate settings (tokenization, stopwords, etc.),downloading queries and relevance judgments (deciding betweenavailable “variants”), manipulating runs into the correct output for-mat for the evaluation script, selecting the right metrics to obtainmeaningful results, etc. The list goes on. These myriad details oftentrip up new researchers who are just learning systems-oriented IRevaluation methodology (motivating work such as Akkalyoncu Yil-maz et al. [2]), and occasionally subtle issues confuse experiencedresearchers as well. The explicit goal of Pyserini is to make these“easy things” easy, supporting common tasks and reducing the pos-sibility of confusion as much as possible.At the other end of the spectrum, “hard things should be pos-sible”. In our context, this means that Pyserini provides access todata structures and system internals to support researchers whomay use our toolkit in ways we had not anticipated. For sparse re-trieval, the Lucene search library that underlies Anserini providesinterfaces to control various aspects of indexing and retrieval, andPyserini exposes a subset of features that we anticipate will beuseful for IR researchers. These include, for example, traversingpostings lists to access raw term statistics, manipulating documentvectors to reconstruct term weights, and ﬁne-grained control overdocument processing (tokenization, stemming, stopword removal,etc.). Pyserini aims to suﬃciently expose Lucene internals to make“hard things” possible.Finally, the most common use case of Pyserini as ﬁrst-stage re-trieval in a multi-stage ranking architecture means that replicabil-ity is of utmost concern, since it is literally the foundation thatcomplex reranking pipelines are built on. In our view, replicabil-ity can be divided into technical and social aspects: an exampleof the former is an internal end-to-end regression framework thatautomatically validates experimental results. The latter includes acommitment to “eat our own dog food” and the adoption of sharednorms. We defer more detailed discussions of replicability to Sec-tion 5.

Pyserini is packaged as a Python module available on the PythonPackage Index. Thus, the toolkit can be installed via pip , as follows: $ pip install pyserini==0.11.0.0

In this paper, we are explicitly using v0.11.0.0. The code for thetoolkit itself is available on GitHub at pyserini.io ; for users whomay be interested in contributing to Pyserini, we recommend a“development” installation, i.e., cloning the source repository it-self. However, for researchers interested only in using

Pyserini, themodule installed via pip suﬃces.In this section, we will mostly use the MS MARCO passageranking dataset [5] as our running example. The dataset has manyfeatures that make it ideal for highlighting various aspects of ourtoolkit: the corpus, queries, and relevance judgments are all freely As a concrete example, TREC-COVID has (at least) 12 diﬀerent sets of qrels. All ofthem are useful for answering diﬀerent research questions. Which one do you use? 1 from pyserini.search import

SimpleSearcher searcher = SimpleSearcher.from_prebuilt_index('msmarco-passage') hits = searcher.search('what␣is␣a␣lobster␣roll?', 10) for i in range (0, 10): print (f'{i+1:2}␣{hits[i].docid:7}␣{hits[i].score:.5f}') Figure 1: Simple example of interactive sparse retrieval (i.e.,bag-of-word BM25 ranking). downloadable; the corpus is manageable in size and thus experi-ments require only modest compute resources (and time); the taskis popular and thus well-studied by many researchers.

In Figure 1, we begin with a simple example of using Pyserinito perform bag-of-words ranking with BM25 (the default rankingmodel) on the MS MARCO passage corpus (comprising 8.8M pas-sages). To establish a parallel with “dense retrieval” techniques us-ing learned transformer-based representations (see below), we re-fer to this as “sparse retrieval”, although this is not common par-lance in the IR community at present.The

SimpleSearcher class provides a single point of entry forsparse retrieval functionality. In (L3), we initialize the searcher witha pre-built index. For many commonly used collections where thereare no data distribution restrictions, we have built indexes that canbe directly downloaded from our project servers. For researcherswho simply want an “out-of-the-box” keyword retrieval baseline,this provides a simple starting point. Speciﬁcally, the researcherdoes not need to download the collection and build the index fromscratch. In this case, the complete index, which includes a copy ofall the texts, is a modest 2.6GB.Using an instance of

SimpleSearcher , we issue a query to re-trieve the top 10 hits (L4), the results of which are stored in the ar-ray hits . Naturally, there are methods to control ranking behavior,such as setting BM25 parameters and enabling the use of pseudo-relevance feedback, but for space considerations these options arenot shown here. In (L6–7), we iterate through the results and printout rank, docid, and score. If desired, the actual text can be fetchedfrom the index (e.g., to feed a downstream reranker).Figure 2 shows an example of interactive retrieval using denselearned representations. Here, we are using TCT-ColBERT [16], amodel our group has constructed from ColBERT [13] using knowl-edge distillation. As with sparse retrieval, we provide pre-built in-dexes that can be directly downloaded from our project servers.The

SimpleDenseSearcher class serves as the entry point to near-est-neighbor search functionality that provides top 𝑘 retrieval ondense vectors. Here, we are taking advantage of HNSW [17], whichhas been integrated into FAISS [11] to enable low latency interac-tive querying (L6).The ﬁnal component needed for dense retrieval is a query en-coder that converts user queries into the same representationalspace as the documents. We initialize the query encoder in (L4),which is passed into the method that constructs the searcher. Theencoder itself is a lightweight wrapper around the Transformerslibrary by Huggingface [25]. Retrieval is performed in the samemanner (L9), and we can manipulate the returned hits array in a from pyserini.dsearch import SimpleDenseSearcher, \ TCTColBERTQueryEncoder encoder = TCTColBERTQueryEncoder('castorini/tct_colbert-msmarco') searcher = SimpleDenseSearcher.from_prebuilt_index( 'msmarco-passage-tct_colbert-hnsw', encoder ) hits = searcher.search('what␣is␣a␣lobster␣roll') Figure 2: Simple example of interactive dense retrieval (i.e.,approximate nearest-neighbor search on dense learned rep-resentations). from pyserini.search import SimpleSearcher from pyserini.dsearch import SimpleDenseSearcher, \ TCTColBERTQueryEncoder from pyserini.hsearch import HybridSearcher ssearcher = SimpleSearcher.from_prebuilt_index('msmarco-passage') encoder = TCTColBERTQueryEncoder('castorini/tct_colbert-msmarco') dsearcher = SimpleDenseSearcher.from_prebuilt_index( 'msmarco-passage-tct_colbert-hnsw', encoder ) hsearcher = HybridSearcher(dsearcher, ssearcher) hits = hsearcher.search('what␣is␣a␣lobster␣roll', 10) Figure 3: Simple example of interactive search with hybridsparse–dense retrieval. manner similar to sparse retrieval (Figure 1). At present, we sup-port the TCT-ColBERT model [16] as well as DPR [12]. Note thatour goal here is to provide retrieval capabilities based on existingmodels; quite explicitly, representational learning lies outside thescope of our toolkit (see additional discussion in Section 6).Of course, the next step is to combine sparse and dense retrieval,which is shown in Figure 3. Our

HybridSearcher takes as its con-structor the sparse retriever and the dense retriever and performsweighted interpolation on the individual results to arrive at a ﬁnalranking. This is a standard approach and Pyserini adopts the spe-ciﬁc implementation in TCT-ColBERT [16], but similar techniquesare used elsewhere as well [12].

Beyond the corpus, topics (queries) and relevance judgments (qrels)form indispensable components of IR test collections to supportsystems-oriented research aimed at producing better ranking mod-els. Many topics and relevance judgments are freely available fordownload, but at disparate locations (in various formats)—and of-ten it may not be obvious to a newcomer where to obtain theseresources and which exact ﬁles to use.Pyserini tackles this challenge by packaging together these eval-uation resources and providing a uniﬁed interface for accessingthem. Figure 4 shows an example of loading topics via get_topics (L3) and loading qrels via get_qrels (L4) for the standard 6980-query subset of the development set of the MS MARCO passageranking test collection. We have taken care to name the text de-scriptors consistently, so the associations between topics and rele-vance judgments are unambiguous. from pyserini.search import get_topics, get_qrels topics = get_topics('msmarco-passage-dev-subset') qrels = get_qrels('msmarco-passage-dev-subset') sum ([ len (topics[t]['title'].split()) for t in topics])/ len (topics) sum ([ len (qrels[t]) for t in topics])/ len (topics) Figure 4: Simple example of working with queries and qrelsfrom the MS MARCO passage ranking test collection.

Using Pyserini’s provided functions, the topics and qrels areloaded into simple Python data structures and thus easy to manip-ulate. A standard TREC topic has diﬀerent ﬁelds (e.g., title, descrip-tion, narrative), which we model as a Python dictionary. Similarly,qrels are nested dictionaries: query ids mapping to a dictionary ofdocids to (possibly graded) relevance judgments. Our choice to usePython data structures means that they can be manipulated usingstandard constructs such as list comprehensions. For example, wecan straightforwardly compute the avg. length of queries (L7) andthe avg. number of relevance judgments per query (L10).

Putting everything discussed above together, it is easy in Pyserinito perform an end-to-end batch retrieval run with queries froma standard test collection. For example, the following commandgenerates a run on the development queries of the MS MARCOpassage ranking task (with BM25): $ python -m pyserini.search --topics msmarco-passage-dev-subset \--index msmarco-passage --output run.msmarco-passage.txt \--bm25 --msmarco

The option --msmarco speciﬁes the MS MARCO output format; analternative is the TREC format. We can evaluate the eﬀectivenessof the run with another simple command: $ python -m pyserini.eval.msmarco_passage_eval \msmarco-passage-dev-subset run.msmarco-passage.txt

Pyserini includes a copy of the oﬃcial evaluation script and pro-vides a lightweight convenience wrapper around it. The toolkitmanages qrels internally, so the user simply needs to provide thename of the test collection, without having to worry about down-loading, storing, and specifying external ﬁles. Otherwise, the usageof the evaluation module is exactly the same as the oﬃcial evalu-ation script; in fact, Pyserini simply dispatches to the underlyingscript after it translates the qrels mapping internally.The above result corresponds to an Anserini baseline on theMS MARCO passage leaderboard. This is worth emphasizing andnicely illustrates our goal of making Pyserini easy to use: with onesimple command, it is possible to replicate a run that serves asa common baseline on a popular leaderboard, providing a spring-board to experimenting with diﬀerent ranking models in a multi-stage architecture. Similar commands provide replication for batchretrieval with dense representations as well as hybrid retrieval. .4 Working with Custom Collections

Beyond existing corpora and test collections, a common use casefor Pyserini is users who wish to search their own collections. Forbag-of-words sparse retrieval, we have built in Anserini (writtenin Java) custom parsers and ingestion pipelines for common doc-ument formats used in IR research, for example, the TREC SGMLformat used in many newswire collections and the WARC formatfor web collections. However, exposing the right interfaces andhooks to support custom implementations in Python is awkward.Instead, we have implemented support for a generic and ﬂexibleJSON-formatted collection in Anserini (written in Java), and Py-serini’s indexer directly accesses the underlying capabilities in An-serini. Thus, searching custom collections in Pyserini necessitatesﬁrst writing a simple script to reformat existing documents intoour JSON speciﬁcation, and then invoking the indexer. For denseretrieval, support for custom collections is less mature at present,but we provide utility scripts that take an encoder model to con-vert documents into dense representations, and then build indexesthat support querying.The design of Pyserini makes it easy to use as a standalone mod-ule or to integrate as a library in another application. In the ﬁrstuse case, a researcher can replicate a baseline (ﬁrst-stage retrieval)run with a simple invocation, take the output run ﬁle (which is justplain text) to serve as input for downstream reranking, or as partof ensembles [6, 8]. As an alternative, Pyserini can be used as a li-brary that is tightly integrated into another package; see additionaldiscussions in Section 6.

Beyond simplifying the research lifecycle of working with stan-dard IR test collections, Pyserini provides access to system inter-nals to support use cases that we might not have anticipated. Anumber of these features for sparse retrieval are illustrated in Fig-ure 5 and available via the

IndexReader object, which can be ini-tialized with pre-built indexes in the same way as the searcherclasses. In (L7–9), we illustrate how to iterate over all terms in a corpus(i.e., its dictionary) and access each term’s document frequency andcollection frequency. Here, we use standard Python tools to selectand print out the ﬁrst 10 terms alphabetically. In the next example,(L12–14), we show how to “analyze” a word (what Lucene callstokenization, stemming, etc.). For example, the analyzed form of“atomic” is “atom”. Since terms in the dictionary (and documentvectors, see below) are stored in analyzed form, these methods arenecessary to access system internals. Another way to access col-lection statistics is shown in (L17–18) by direct lookup.Pyserini also provides raw access to index structures, both theinverted index as well as the forward index (i.e., to fetch docu-ment vectors). In (L21–23), we show an example of looking up aterm’s postings list and traversing its postings, printing out termfrequency and term position occurrences. Access to the forwardindex is shown in (L26–27) based on a docid: In the ﬁrst case, Py-serini returns a dictionary mapping from terms in the document to For these examples, we use the Robust04 index becauseaccess to many of the featuresrequires positional indexes and storing document vectors. Due to size considerations,this information is not included in the pre-built MS MARCO indexes. 1 from pyserini.index import

IndexReader reader = IndexReader.from_prebuilt_index('robust04') import itertools for term in itertools.islice(reader.terms(), 10): print (f'{term.term}␣(df={term.df},␣cf={term.cf})') term = 'atomic' analyzed = reader.analyze(term) print (f'The␣analyzed␣form␣of␣"{term}"␣is␣"{analyzed[0]}"') df, cf = reader.get_term_counts(term) print (f'term␣"{term}":␣df={df},␣cf={cf}') postings_list = reader.get_postings_list(term) for p in postings_list: print (f'docid={p.docid},␣tf={p.tf},␣pos={p.positions}') tf = reader.get_document_vector('LA071090-0047') tp = reader.get_term_positions('LA071090-0047') df = { term: (reader.get_term_counts(term, analyzer=None))[0] for term in tf.keys() } bm25_vector = { term: reader.compute_bm25_term_weight('LA071090-0047', term, analyzer=None) for term in tf.keys() } Figure 5: Examples of using Pyserini to access system inter-nals such as term statistics and postings lists. their term frequencies. In the second case, Pyserini returns a dictio-nary mapping from terms to their term positions in the document.From these methods, we can, for example, look up document fre-quencies for all terms in a document using a list comprehension inPython (L28–31). This might be further manipulated to compute tf–idf scores. Finally, the toolkit provides a convenience method forcomputing BM25 term weights, using which we can reconstructthe BM25-weighted document vector (L32–37).At present, access to system internals focuses on manipulatingsparse representations. Dense retrieval capabilities in Pyserini areless mature. It is not entirely clear what advanced features wouldbe desired by researchers, but we anticipate adding support as theneeds and use cases become more clear.

Having provided a “tour” of Pyserini and some of the toolkit’s fea-tures, in this section we present experimental results to quantify itseﬀectiveness for ﬁrst-stage retrieval. Currently, Pyserini providessupport for approximately 30 test collections; here, we focus ontwo popular leaderboards.Pyserini provides baselines for two MS MARCO datasets [5]: thepassage ranking task (Table 1) and the document ranking task (Ta-ble 2). In both cases, we report the oﬃcial metric (MRR@10 for

S MARCO Passage

Development Test

Method

MRR@10 R@1k MRR@10Pyserini: sparse(1a) Original text 0.184 0.853 0.186BM25, default ( 𝑘 = . ,𝑏 = . 𝑘 = . ,𝑏 = . 𝑘 = . ,𝑏 = . 𝑘 = . ,𝑏 = . dot [10] 0.323 0.957 -Pyserini: multi-stage pipelines(4d) monoBERT [20] 0.372 - 0.365(4e) Expando-Mono-DuoT5 [23] 0.420 - 0.408 Table 1: Results on the MS MARCO passage ranking task. passage, MRR@100 for document). For the development set, we ad-ditionally report recall at rank 1000, which is useful in establishingan upper bound on reranking eﬀectiveness. Note that evaluationresults on the test sets are only available via submissions to theleaderboard, and therefore we do not have access to recall ﬁgures.Furthermore, since the organizers discourage submissions that are“too similar” (e.g., minor diﬀerences in parameter settings) and ac-tively limit the number of submissions to the leaderboard, we fol-low their guidance and hence do not have test results for all of ourexperimental conditions.For the passage ranking task, Pyserini supports sparse retrieval,dense retrieval, as well as hybrid dense–sparse retrieval; all resultsin rows (1) through (3) are replicable with our toolkit. Row (1a) re-ports the eﬀectiveness of sparse bag-of-words ranking using BM25with default parameter settings on the original text; row (1b) showsresults after tuning the parameters on a subset of the dev queriesvia simple grid search to maximize recall at rank 1000. Parametertuning makes a small diﬀerence in this case. Pyserini also providesdocument expansion baselines using our doc2query method [21];the latest model uses T5 [24] as described in Nogueira and Lin [19].Bag-of-words BM25 ranking over the corpus with document ex-pansion is shown in rows (1c) and (1d) for default and tuned param-eters. We see that doc2query yields a large jump in eﬀectiveness,while still using bag-of-words retrieval, since neural inference isapplied to generate expansions prior to the indexing phase. Withdoc2query, parameter tuning also makes a diﬀerence.For dense retrieval, results using TCT-ColBERT [16] are shownin rows (2) using diﬀerent indexes. Row (2a) refers to brute-forcescans over the document vectors in FAISS [11], which provides ex-act nearest-neighbor search. Row (2b) refers to approximate nearest-neighbor search using HNSW [17]; the latter yields a small loss

MS MARCO Document

Development Test

Method

MRR@100 R@1k MRR@100Pyserini: sparse(1a) Original text (doc) 0.230 0.886 0.201BM25, default ( 𝑘 = . ,𝑏 = . 𝑘 = . ,𝑏 = . 𝑘 = . ,𝑏 = . 𝑘 = . ,𝑏 = . 𝑘 = . ,𝑏 = . 𝑘 = . ,𝑏 = . Table 2: Results on the MARCO document ranking task. in eﬀectiveness, but enables interactive querying. We see that re-trieval using dense learned representations is much more eﬀectivethan retrieval using sparse bag-of-words representations, even tak-ing into account document expansion techniques.Results of hybrid techniques that combine sparse and dense re-trieval using weighted interpolation are shown next in Table 1.Row (3a) shows the results of combining TCT-ColBERT with BM25bag-of-words search over the original texts, while row (3b) showsresults that combine document expansion using doc2query withthe T5 model. In both cases we used a brute-force approach. Re-sults show that combining sparse and dense signals is more eﬀec-tive than either alone, and that the hybrid technique continues tobeneﬁt from document expansion.To put these results in context, rows (4) provide a few additionalpoints of comparison. Row (4a) shows the BM25 baseline providedby the MS MARCO leaderboard organizers, which appears to beless eﬀective than Pyserini’s implementation. Rows (4b) and (4c)refer to two alternative dense-retrieval techniques; these resultsshow that our TCT-ColBERT model performs on par with com-peting models. Finally, rows (4d) and (4e) show results from twoof our own reranking pipelines built on Pyserini as ﬁrst-stage re-trieval: monoBERT, a standard BERT-based reranker [20], and our“Expando-Mono-Duo” design pattern with T5 [23]. These illustratehow Pyserini can serve as the foundation for further explorationsin neural ranking techniques.Results on the MS MARCO document ranking task are shown inTable 2. For this task, there are two common conﬁgurations, whatwe call “per-document” vs. “per-passage” indexing. In the former,each document in the corpus is indexed as a separate document; inthe latter, each document is ﬁrst segmented into multiple passages,nd each passage is indexed as a separate “document”. Typically,for the “per-passage” index, a document ranking is constructed bysimply taking the maximum of per-passage scores; the motivationfor this design is to reduce the amount of text that computationallyexpensive downstream rerankers need to process. Rows (1a)–(1d)show the per-document and per-passage approaches on the origi-nal texts, using default parameters and after tuning for recall@100using grid search. With default parameters, there appears to be alarge eﬀectiveness gap between the per-document and per-passageapproaches, but with properly tuned parameters, (1b) vs. (1d), wesee that they achieve comparable eﬀectiveness. As with passageretrieval, we can include document expansion with either the per-document or per-passage approaches (the diﬀerence is whether weappend the expansions to each document or each passage); theseresults are shown in (1e) and (1f). Similarly, the diﬀerences in ef-fectiveness between the two approaches are quite small.Dense retrieval using TCT-ColBERT is shown in row (2); this isa new experimental condition that was not reported in Lin et al.[16]. Here, we are simply using the encoder that has been trainedon the MS MARCO passage data in a zero-shot manner. Since theseencoders were not designed to process long segments of text, onlythe per-passage condition makes sense here. In row (3a), we com-bine row (2) with the per-passage sparse retrieval results on theoriginal text, and in row (3b), with the per-passage sparse retrievalresults using document expansion. Overall, the ﬁndings are con-sistent with the passage ranking task: Dense retrieval is more ef-fective than sparse retrieval (although the improvements for docu-ment ranking are smaller, most likely due to zero-shot application).Dense and sparse signals are complementary, shown by the eﬀec-tiveness of the dense–sparse hybrid, which further beneﬁts fromdocument expansion (although the gains from expansion appearto be smaller).Similar to the passage ranking task, Table 2 provides a few pointsof comparison. Row (4a) shows the eﬀectiveness of the BM25 base-line provided by the leaderboard organizers; once again, we seethat Pyserini’s results are better. Row (4b) shows ACNE results [26],which are more eﬀective than TCT-ColBERT, although the com-parison isn’t quite fair since our models were not trained on MSMARCO document data. Finally, Row (4c) shows the results of ap-plying our “Expando-Mono-Duo” design pattern with T5 [23] in azero-shot manner.In summary, Pyserini “covers all the bases” in terms of provid-ing ﬁrst-stage retrieval for modern research on neural ranking ap-proaches: sparse retrieval, dense retrieval, as well as hybrid tech-niques combining both approaches. Experimental results on twopopular leaderboards show that our toolkit provides a good start-ing point for further research.

As replicability is a major consideration in the design and imple-mentation of Pyserini, it is worthwhile to spend some time dis-cussing practices that support this goal. At a high-level, we can di-vide replicability into technical and social aspects. Of the two, webelieve the latter are more important, because any technical tool tosupport replicability will either be ignored or circumvented unlessthere is a shared commitment to the goal and established social practices to promote it. Replicability is often in tension with otherimportant desiderata, such as the ability to rapidly iterate, and thuswe are constantly struggling to achieve the right balance.Perhaps the most important principle that our group has inter-nalized is “to eat our own dog food”, which refers to the colloqui-alism of using one’s own “product”. Our group uses Pyserini asthe foundation for our own research on transformer-based rerank-ing models, dense learned representations for reranking, and be-yond (see more details in Section 6). Thus, replicability comes atleast partially from our self interest—to ensure that group mem-bers can repeat their own experiments and replicate each other’sresults. If we can accomplish replicability internally, then externalresearchers should be able to replicate our results if we ensure thatthere is nothing peculiar about our computing environment.Our shared commitment to replicability is operationalized intosocial processes and is supported by technical infrastructure. Tostart, Pyserini as well as the underlying Anserini toolkit adoptstandard best practices in open-source software development. Ourcode base is available on GitHub, issues are used to describe pro-posed feature enhancements and bugs, and code changes are me-diated via pull requests that are code reviewed by members of ourgroup.Over the years, our group has worked hard to internalize theculture of writing replication guides for new capabilities, typicallypaired with our publications; these are all publicly available andstored alongside our code. These guides include, at a minimum,the sequence of command-line invocations that are necessary toreplicate a particular set of experimental results, with accompa-nying descriptions in prose. In theory, copying and pasting com-mands from the guide into a shell should succeed in replication. Inpractice, we regularly “try out” each other’s replication guides touncover what didn’t work and to oﬀer improvements to the docu-mentation. Many of these guides are associated with a “replicationlog” at the bottom of the guide, which contains a record of indi-viduals who have successfully replicated the results, and the com-mit id of the code version they used. With these replication logs,if some functionality breaks, it becomes much easier to debug, byrewinding the code commits back to the previous point where itlast “worked”.How do we motivate individuals to write these guides and repli-cate each other’s results? We have two primary tools: appealingto reciprocity and providing learning experiences for new groupmembers. For new students who wish to become involved in ourresearch, conducting replications is an easy way to learn our codebase, and hence provides a strong motivation. In particular, replica-tions are particularly fruitful exercises for undergraduates as theirﬁrst step in learning about research. For students who eventuallycontribute to Pyserini, appeals to reciprocity are eﬀective: they arethe beneﬁciaries of previous group members who “paved the way”and thus it behooves them to write good documentation to sup-port future students. Once established, such a culture becomes aself-reinforcing virtuous cycle.Building on these social processes, replicability in Anserini isfurther supported by an end-to-end regression framework, that,for each test collection, runs through the following steps: buildsthe index from scratch (i.e., the raw corpus), performs multiple re-trieval runs (using diﬀerent ranking models), evaluates the outpute.g., with trec_eval ), and veriﬁes eﬀectiveness ﬁgures against ex-pected results. Furthermore, the regression framework automati-cally generates documentation pages from templates, populatingresults on each successful execution. All of this happens automat-ically without requiring any human intervention. There are cur-rently around 30 such tests, which take approximately two daysto run end to end. The largest of these tests, which occupies mostof the time, builds a 12 TB index on all 733 million pages of theClueWeb12 collection. Although it is not practical to run these re-gression tests for each code change, we do try to run them as oftenas possible, resources permitting. This has the eﬀect of catchingnew commits that break existing regressions early so they are eas-ier to debug. We keep a change log that tracks divergences fromexpected results (e.g., after a bug ﬁx) or when new regressions areadded.On top of the regression framework in Anserini, further end-to-end regression tests in Pyserini compare its output against Anse-rini’s output to verify that the Python interface does not introduceany bugs. These regression tests, for example, test diﬀerent param-eter settings from the command line, ensure that single-threadedand multi-threaded execution yield identical results, that pre-builtindexes can be successfully downloaded, etc.Written guides and automated regression testing lie along a spec-trum of replication rigor. We currently do not have clear-cut crite-ria as to what features become “enshrined” in automated regres-sions. However, as features become more critical and foundationalin Pyserini or Anserini, we become more motivated to includethem in our automated testing framework.In summary, replicability has become ingrained as a shared normin our group, operationalized in social processes and facilitatedby technical infrastructure. This has allowed us to balance the de-mands of replicability with the ability to iterate at a rapid pace.

Anserini has been in development for several years and our grouphas been working on Pyserini since late 2019. The most recent ma-jor feature added to Pyserini (in 2021) has been dense retrieval ca-pabilities alongside bag-of-words sparse retrieval, and their inte-gration in hybrid sparse–dense techniques.Despite much activity and continued additions to our toolkit,the broad contours of what Pyserini “aims to be” are fairly well de-ﬁned. We plan to stay consistent to our goal of providing replica-ble and easy-to-use techniques that support innovations in neuralranking methods. Because it is not possible for any single piece ofsoftware to do everything, an important part of maintaining focuson our goals is to be clear about what Pyserini isn’t going to do .While we are planning to add support for more dense retrievaltechniques based on learned representations, quite explicitly the training of these models is outside the scope of Pyserini. At a high-level, the ﬁnal “product” of any dense retrieval technique com-prises an encoder for queries and an encoder for documents (andin some cases, these are the same). The process of training theseencoders can be quite complex, involving, for example, knowledgedistillation [10, 16] and complex sampling techniques [26]. This isan area of active exploration and it would be premature to try tobuild a general-purpose toolkit for learning such representations. For dense retrieval techniques, Pyserini assumes that query/doc-ument encoders have already been learned: in modern approachesbased on pretrained transformers, Huggingface’s Transformers li-brary has become the de facto standard for working with such mod-els, and our toolkit provides tight integration. From this startingpoint, Pyserini provides utilities for building indexes that supportnearest-neighbor search on these dense representations. However,it is unlikely that Pyserini will, even in the future, become involvedin the training of dense retrieval models.Another conscious decision we have made in the design of Py-serini is to not prescribe an architecture for multi-stage rankingand to not include neural reranking models in the toolkit. Ourprimary goal is to provide replicable ﬁrst-stage retrieval, and wedid not want to express an opinion on how multi-stage rankingshould be organized. Instead, our group is working on a separatetoolkit, called PyGaggle, that provides implementations for muchof our work on multi-stage ranking, including our “mono” and“duo” designs [23] as well as ranking with sequence-to-sequencemodels [18]. PyGaggle is designed speciﬁcally to work with Py-serini, but the latter was meant to be used independently, and weexplicitly did not wish to “hard code” our own research agenda.This separation has made it easier for other neural IR toolkits tobuild on Pyserini, for example, the Caprelous toolkit [29, 30].On top of PyGaggle, we have been working on faceted search in-terfaces to provide a complete end-to-end search application: thiswas initially demonstrated in our Covidex [31] search engine forCOVID-19 scientiﬁc articles. We have since generalized the appli-cation into Cydex, which provides infrastructure for searching thescientiﬁc literature, demonstrated in diﬀerent domains [7].Our ultimate goal is to provide reusable libraries for craftingend-to-end information access applications, and we have organizedthe abstractions in a manner that allows users to pick and choosewhat they wish to adopt and build on: Pyserini to provide ﬁrst-stage retrieval and basic support, PyGaggle to provide neural re-ranking models, and Cydex to provide a faceted search interface.

Our group’s eﬀorts to promote and support replicable IR researchdates back to 2015 [4, 14], and the landscape has changed quite a bitsince then. Today, there is much more awareness of the issues sur-rounding replicability; norms such as the sharing of source codehave become more entrenched than before, and we have accessto better tools now (e.g., Docker, package mangers, etc.) than wedid before. At the same time, however, today’s software ecosystemhas become more complex; ranking models have become more so-phisticated and modern multi-stage ranking architectures involvemore complex components than before. In this changing environ-ment, the need for stable foundations on which to build remains.With Pyserini, it has been and will remain our goal to provide easy-to-use tools in support of replicable IR research.

ACKNOWLEDGEMENTS

This research was supported in part by the Canada First ResearchExcellence Fund, the Natural Sciences and Engineering ResearchCouncil (NSERC) of Canada, and the Waterloo–Huawei Joint In-novation Laboratory.

EFERENCES [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, JeﬀreyDean, Matthieu Devin, Sanjay Ghemawat, Geoﬀrey Irving, Michael Isard, Man-junath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray,Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, YuanYu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machinelearning. In

Proceedings of the 12th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI ’16) . 265–283.[2] Zeynep Akkalyoncu Yilmaz, Charles L. A. Clarke, and Jimmy Lin. 2020. A Light-weight Environment for Learning Experimental IR Research Practices. In

Pro-ceedings of the 43rd Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR 2020) . 2113–2116.[3] Zeynep Akkalyoncu Yilmaz, Shengjin Wang, Wei Yang, Haotian Zhang, andJimmy Lin. 2019. Applying BERT to Document Retrieval with Birch. In

Pro-ceedings of the 2019 Conference on Empirical Methods in Natural Language Pro-cessing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP): System Demonstrations . Hong Kong, China, 19–24.[4] Jaime Arguello, Matt Crane, Fernando Diaz, Jimmy Lin, and Andrew Trotman.2015. Report on the SIGIR 2015 Workshop on Reproducibility, Inexplicability,and Generalizability of Results (RIGOR).

SIGIR Forum

49, 2 (2015), 107–116.[5] Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, XiaodongLiu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, MirRosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018.MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268v3 (2018).[6] Michael Bendersky, Honglei Zhuang, Ji Ma, Shuguang Han, Keith Hall, and RyanMcDonald. 2020. RRF102: Meeting the TREC-COVID Challenge with a 100+Runs Ensemble. arXiv:2010.00200 (2020).[7] Shane Ding, Edwin Zhang, and Jimmy Lin. 2020. Cydex: Neural Search Infras-tructure for the Scholarly Literature. In

Proceedings of the First Workshop onScholarly Document Processing . 168–173.[8] Andre Esteva, Anuprit Kale, Romain Paulus, Kazuma Hashimoto, Wenpeng Yin,Dragomir Radev, and Richard Socher. 2020. CO-Search: COVID-19 InformationRetrieval with Semantic Search, Question Answering, and Abstractive Summa-rization. arXiv:2006.09595 (2020).[9] Adrien Grand, Robert Muir, Jim Ferenczi, and Jimmy Lin. 2020. From MaxScoreto Block-Max WAND: The Story of How Lucene Signiﬁcantly Improved QueryEvaluation Performance. In

Proceedings of the 42nd European Conference on In-formation Retrieval, Part II (ECIR 2020) . 20–27.[10] Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, andAllan Hanbury. 2021. Improving Eﬃcient Neural Ranking Models with Cross-Architecture Knowledge Distillation. arXiv:2010.02666 (2021).[11] Jeﬀ Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similaritysearch with GPUs. arXiv:1702.08734 (2017).[12] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, SergeyEdunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In

Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) . 6769–6781.[13] Omar Khattab and Matei Zaharia. 2020. ColBERT: Eﬃcient and Eﬀective PassageSearch via Contextualized Late Interaction over BERT. In

Proceedings of the 43rdInternational ACM SIGIR Conference on Research and Development in InformationRetrieval (SIGIR 2020) . 39–48.[14] Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya,John Foley, Grant Ingersoll, Craig Macdonald, and Sebastiano Vigna. 2016. To-ward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In

Proceedings of the 38th European Conference on Information Retrieval (ECIR 2016) .Padua, Italy, 408–420. [15] Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2020. Pretrained Transformersfor Text Ranking: BERT and Beyond. arXiv:2010.06467 (2020).[16] Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2020. Distilling DenseRepresentations for Ranking using Tightly-Coupled Teachers. arXiv:2010.11386 (2020).[17] Yu A. Malkov and D. A. Yashunin. 2020. Eﬃcient and Robust Approximate Near-est Neighbor Search Using Hierarchical Navigable Small World Graphs.

IEEETransactions on Pattern Analysis and Machine Intelligence

42, 4 (2020), 824–836.[18] Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Docu-ment Ranking with a Pretrained Sequence-to-Sequence Model. In

Findings of theAssociation for Computational Linguistics: EMNLP 2020 . 708–718.[19] Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTTquery.[20] Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-StageDocument Ranking with BERT. arXiv:1910.14424 (2019).[21] Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. DocumentExpansion by Query Prediction. arXiv:1904.08375 (2019).[22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre-gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga,Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison,Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, andSoumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance DeepLearning Library. In

Advances in Neural Information Processing Systems . 8024–8035.[23] Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. 2021. The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-SequenceModels. arXiv:2101.05667 (2021).[24] Colin Raﬀel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Lim-its of Transfer Learning with a Uniﬁed Text-to-Text Transformer.

Journal ofMachine Learning Research

21, 140 (2020), 1–67.[25] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement De-langue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz,Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, JulienPlu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, QuentinLhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Lan-guage Processing. In

Proceedings of the 2020 Conference on Empirical Methods inNatural Language Processing: System Demonstrations . 38–45.[26] Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Ju-naid Ahmed, and Arnold Overwijk. 2020. Approximate Nearest Neighbor Neg-ative Contrastive Learning for Dense Text Retrieval. arXiv:2007.00808 (2020).[27] Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the Use ofLucene for Information Retrieval Research. In

Proceedings of the 40th Annual In-ternational ACM SIGIR Conference on Research and Development in InformationRetrieval (SIGIR 2017) . Tokyo, Japan, 1253–1256.[28] Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini: Reproducible RankingBaselines Using Lucene.

Journal of Data and Information Quality

10, 4 (2018),Article 16.[29] Andrew Yates, Siddhant Arora, Xinyu Zhang, Wei Yang, Kevin Martin Jose, andJimmy Lin. 2020. Capreolus: A Toolkit for End-to-End Neural Ad Hoc Retrieval.In

Proceedings of the 13th ACM International Conference on Web Search and DataMining (WSDM 2020) . Houston, Texas, 861–864.[30] Andrew Yates, Kevin Martin Jose, Xinyu Zhang, and Jimmy Lin. 2020. FlexibleIR Pipelines with Capreolus. In

Proceedings of the 29th International Conferenceon Information and Knowledge Management (CIKM 2020) . 3181–3188.[31] Edwin Zhang, Nikhil Gupta, Raphael Tang, Xiao Han, Ronak Pradeep, KuangLu, Yue Zhang, Rodrigo Nogueira, Kyunghyun Cho, Hui Fang, and Jimmy Lin.2020. Covidex: Neural Ranking Models and Keyword Search Infrastructure forthe COVID-19 Open Research Dataset. In