[PDF] A White Box Analysis of ColBERT

Abstract

Transformer-based models are nowadays state-of-the-art in ad-hoc Information Retrieval, but their behavior is far from being understood. Recent work has claimed that BERT does not satisfy the classical IR axioms. However, we propose to dissect the matching process of ColBERT, through the analysis of term importance and exact/soft matching patterns. Even if the traditional axioms are not formally verified, our analysis reveals that ColBERT: (i) is able to capture a notion of term importance; (ii) relies on exact matches for important terms.

Full PDF

AA White Box Analysis of ColBERT

Thibault Formal , , Benjamin Piwowarski , and St´ephane Clinchant Sorbonne Universit´e, LIP6, F-75005 Paris, France [email protected] Naver Labs Europe, Meylan, France [email protected]

Abstract.

Transformer-based models are nowadays state-of-the-art inadhoc Information Retrieval, but their behavior is far from being under-stood. Recent work has claimed that BERT does not satisfy the classicalIR axioms. However, we propose to dissect the matching process of Col-BERT, through the analysis of term importance and exact/soft match-ing patterns. Even if the traditional axioms are not formally veriﬁed, ouranalysis reveals that ColBERT (i) is able to capture a notion of termimportance; (ii) relies on exact matches for important terms.

Keywords:

Information Retrieval · Term Matching · Transformer · BERT

Over the last two years, Natural Language Processing has been shaken by the re-lease of large pre-trained language models based on self-attention, like BERT [4].Ranking models based on BERT are currently state-of-the-art for adhoc IR,and rank ﬁrst on leaderboards of the MSMARCO passage and document (re-)ranking tasks by a large margin [10], as well as on more standard IR datasetssuch as Robust04 [3,9,11]. They have excelled where previous neural models hadbeen struggling so far [15]. It is thus interesting to understand better what ishappening inside those models during ranking, and what phenomena are cap-tured. Some works have been conducted in this direction [2, 12], but focused onwhether IR axioms are respected – or not – by neural and transformer-basedmodels. In [2], BERT has been shown to not fully respect axioms that haveproved to be important for standard IR models, such as the axiom stating thatwords occurring in more documents are less important (IDF eﬀect). Instead ofinvestigating whether these models behave like standard ones, in this paper, wemake a step towards understanding how they manage to improve over traditionalmodels through their speciﬁc matching process.There exists a wide variety of BERT-based ranking models, as summarized inthe recent overview [8]. Canonical BERT models are diﬃcult to analyse becausethey require a thorough analysis of attention mechanisms, which is a complextask [1]. We rather choose to focus on contextual interaction models [6, 7, 9], https://microsoft.github.io/msmarco/ a r X i v : . [ c s . I R ] D ec T. Formal, B. Piwowarski and S. Clinchant where query and document are encoded independently – contrary to the usualcase [10]. Among such models, ColBERT [7] exhibits the best trade-oﬀ betweeneﬀectiveness and eﬃciency, with performance on par with standard BERT, sug-gesting that the power of these models comes from learning rich contextualrepresentations, rather than modeling complex matching patterns. Moreover,the structure of ColBERT (sum over query terms of some similarity scores) issimilar to standard IR models like BM25, and makes the analysis easier, as thecontribution for each term is explicit.In this paper, we hence focus on ColBERT, and look at two research ques-tions. In Section 3, we investigate the link between term importance as computedby standard IR models, and the one computed by (Col)BERT. In Section 4, welook at how (Col)BERT is dealing with exact and soft matches as this is knownto be critical for IR systems.

Dataset

For our analysis, we use the passage retrieval tasks from TREC-DL 2019and 2020 [14] (400 queries in total). We consider a re-ranking setting, where fora given query q , the model needs to re-rank a set of documents S q selected bya ﬁrst stage ranker. Following the MSMARCO setting, we consider candidatesfrom BM25, and |S q | ≤ how it attributes scores to each query token, for documents in S q . ColBERT

We now introduce the variant of ColBERT [7] we used to simplify theanalysis – we checked each time that the drop in performance was minor. In par-ticular, we did not include query/document speciﬁc tokens ( [Q] and [D] ), sincethese tokens could bias the representation of query/document terms. Second,while query augmentation has been shown to be beneﬁcial in [5, 7], we omit thiscomponent to avoid analysis of the induced implicit query expansion mechanism.We however keep the compression layer, that projects token representations fromthe BERT representation space ( d = 768) to the ColBERT representation space( d = 128). By ﬁne-tuning our model in a similar fashion to [7], we obtain aMRR@10 of 0 .

343 on MSMARCO dev set (versus 0 . E q = ( E q i ) i for the query q (after WordPiece tokenization) and E d = ( E d j ) j for the document d , is given by the following relevance score: s ( q, d ) = (cid:88) i ∈ q max j ∈ d cos( E q i , E d j ) = (cid:88) i ∈ q max j ∈ d C ij = (cid:88) i ∈ q C (cid:63)id (1)In the following, we say that a query token i matches the document token j ∗ if C ij ∗ = C (cid:63)id . We denote this token j ∗ by d (cid:63)i . White Box Analysis of ColBERT 3

Our ﬁrst research question focuses on comparing the term importance of stan-dard IR models (e.g. BM25) with the term importance as determined by Col-BERT. With respect to the former, given that documents are (small) passages,term frequency is close to 1 for most terms. Moreover, passage length does notvary much, and is caped at 512 tokens. Hence, we can reasonably assume thata term BM25 score roughly corresponds to its IDF – this might not be true forterms with very low IDF values, but it is a good enough approximation for otherterms.For ColBERT, it is diﬃcult to measure the importance of a term because itdepends on both document and query contexts. We hence resort to an indirectmean, by measuring the correlation between the original ColBERT ranking andthe ranking obtained when the corresponding word is masked, i.e. when weremove from the sum in Equation (1) all the contributions of subwords thatcompose the word. Finally, to compare rankings, we use AP-correlation τ AP [16],which is akin to Kendall rank correlation, but gives more importance to the topof the ranking. Values close to 1 indicate a strong correlation, meaning that thetwo rankings are similar, implying a low contribution of the term in the rankingprocess. Note that such measure of importance is query dependent: when theterm appears in several queries, we consider the average as a ﬁnal measure ofimportance. [0:2] ]2:4] ]4:6] ]6:8] ]8:10] ]10:12] ]12:[ IDF bins . . . . . . τ A P t e r m i m po r t an c e pre-trained onlyﬁne-tuned Fig. 1.

ColBERT term importance (as computed using τ AP ) with respect to IDF (stan-dard term importance). using the Python implementation provided by [13]. T. Formal, B. Piwowarski and S. Clinchant In Figure 1, we show how IDF and τ AP are connected. There is a linearnegative correlation between both metrics (Pearson correlation coeﬃcient r = − . > so important; (ii) as most of the documents contain the term, the eﬀect on τ AP might not be high; (iii) another query term (with no semantics) is bearing thesame semantics as the target one.The ﬁrst hypothesis is probably true since ColBERT improves over BM25.As for the second one, this is a more general observation regarding the re-rankingsetting, where IR axioms might not fully apply. Finally, to investigate the hy-pothesis (iii), we looked, for each query token, at the frequency of exact matching(i.e. the max similarity is obtained with the same token in a document) and atthe frequency with which it matches in documents other terms of the query . Weobserved that stopwords ( the , of , etc.) did indeed match terms in the documentsthat were other query terms. For instance, in the query (and associated τ AP )“ the (0.94) symptoms (0.87) of (0.93) shingles (0.88) ”, the word “of” actuallymostly matches with “shingles” in documents from S q . After having looked at term importance, we now turn our attention into theissue of exact matches, i.e. how exact string matching is processed by ColBERT.Because it has been trained to re-order a standard term-based IR model, it isinteresting to check whether it might be less sensitive to such signals.To look into this, we need to deﬁne a measure indicating when ColBERTasserts whether a term should be an exact match or not (i.e., soft match). Todo so, we compute, for each query term i , the diﬀerence between the averageColBERT scores when i matches the same term within a document (i.e, when d (cid:63)i → t ) or not (i.e., when d (cid:63)i (cid:54)→ t ). We then average at the query level, to obtainone measure per term (for terms appearing in several queries). This measure isformally deﬁned as: ∆ ES ( t ) = mean i,q/i → t (cid:18) mean d ∈S q /d (cid:63)i → t { C (cid:63)id } − mean d ∈S q /d (cid:63)i (cid:54)→ t { C (cid:63)id } (cid:19) (2)where j → t means that the j th token corresponds to token t .For a word w composed of several WordPiece components t , . . . , t n , we use (cid:80) t ∈ w ∆ ES ( t ), which corresponds to the way ColBERT works (summing oversubwords). Then, for all query words w , we plot ∆ ES ( w ) with respect to IDF ( w )(Figure 2). We can observe that there is a moderate positive correlation betweenterms focusing more on exact matching by ColBERT –larger ∆ ES – and IDF White Box Analysis of ColBERT 5 [0:2] ]2:4] ]4:6] ]6:8] ]8:10] ]10:12] ]12:[

IDF bins − ∆ E S pre-trained onlyﬁne-tuned Fig. 2. ∆ ES with respect to IDF: we observe a moderate correlation (0.667) between ∆ ES and IDF, showing that the less frequent a term is, the more it is likely to bematched exactly. ( r = 0 . ∆ ES ) “ causes (0.35) of (0.11) left (0.64) ventricular (1.14) hypertro-phy (1.62) ”, we can see that the model relies a lot on exact match for the lasttwo terms.To explain this behavior, our hypothesis is that exact matches correspondto contextual embeddings that do not vary much, while terms that carry less”information” are more heavily inﬂuenced by their context (they act as some sortof reservoirs to encode concepts of the sequence), and thus their embeddings varya lot. To check this hypothesis, we conducted a spectral analysis of contextualterm embeddings. More speciﬁcally, we use an SVD decomposition of the matrixcomposed of all the contextual representations for a given term t , on the testdocuments, and look at the relative magnitude of the singular values λ ≥ ... ≥ λ d where d is the dimension of the embedding space. If the magnitude of λ ismuch larger than the others, it means that all the contextual representationspoint to the same direction in the embedding space. In Figure 3, we report theratio of the ﬁrst eigenvalue λ with respect to (cid:80) k λ k for terms that appear inthe test queries. This ﬁgure conﬁrms the above hypothesis, as the ratio increaseswith the subword IDF (correlation r = 0 . T. Formal, B. Piwowarski and S. Clinchant [0:2] ]2:4] ]4:6] ]6:8] ]8:10] ]10:[ subword IDF bins . . . . . . . λ P k λ k pre trained onlyﬁne tuned Fig. 3.

Ratio of the ﬁrst eigenvalue to the sum of the eigenvalues with respect to IDF(subword level). The less frequent the term is, the higher the ratio is, showing that allcontextualized embedding for a rare term are concentrated in the same direction . promotes exact matches. By looking at the distribution of singular values (notshown here), we can conﬁrm this trend. In particular, words with a low IDF tendto point each time in a diﬀerent direction, showing that what they capture ismore about their context. For instance, in the query “ when did family feud comeout ? ” (a TV show), the term “come”, for all the documents in S q , matches 97%of the time to document terms that are not in the query, but are synonyms (ina broad sense) e.g. { july, happen, item, landing, released, name, en, going, it,rodgers } . While the axiomatic approach is appropriate to analyze traditional IR models,its application to BERT-based models remains limited and somehow inadequate.To the best of our knowledge, our study is one of the ﬁrst to shed light on somematching behaviors of BERT, through the analysis of a simpler counterpart,ColBERT. We showed that (i) even if the IDF eﬀect from the axiomatic theoryis not enforced, (Col)BERT does have a notion of term importance; (ii) exactmatching remains an important component of the model, and is ampliﬁed afterﬁne-tuning on relevance; (iii) our analysis gave some hints on the properties offrequent words which tend to capture the contexts in which they appear.

White Box Analysis of ColBERT 7

Although this work is a ﬁrst step towards understanding matching propertiesof BERT in IR, we believe there is much more to uncover by either analyzing awider range of models, or by extending our analysis of ColBERT to ﬁrst stageranking, where retrieval axioms might be more critical.

References

1. Brunner, G., Liu, Y., Pascual, D., Richter, O., Ciaramita, M., Wattenhofer, R.:On identiﬁability in transformers (02 2020)2. Cˆamara, A., Hauﬀ, C.: Diagnosing BERT with retrieval heuristics. In: Jose,J.M., Yilmaz, E., Magalh˜aes, J., Castells, P., Ferro, N., Silva, M.J., Mar-tins, F. (eds.) Advances in Information Retrieval - 42nd European Confer-ence on IR Research, ECIR 2020, Lisbon, Portugal, April 14-17, 2020, Pro-ceedings, Part I. Lecture Notes in Computer Science, vol. 12035, pp. 605–618. Springer (2020). https://doi.org/10.1007/978-3-030-45439-5 40, https://doi.org/10.1007/978-3-030-45439-5_40

3. Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural lan-guage modeling. CoRR abs/1905.09217 (2019), http://arxiv.org/abs/1905.09217

4. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec-tional transformers for language understanding. CoRR abs/1810.04805 (2018), http://arxiv.org/abs/1810.04805

5. Hofst¨atter, S., Althammer, S., Schr¨oder, M., Sertkan, M., Hanbury, A.: Improv-ing eﬃcient neural ranking models with cross-architecture knowledge distillation(2020)6. Hofst¨atter, S., Zlabinger, M., Hanbury, A.: Interpretable & time-budget-constrained contextualization for re-ranking (2020)7. Khattab, O., Zaharia, M.: Colbert: Eﬃcient and eﬀective passage search viacontextualized late interaction over bert. In: Proceedings of the 43rd Interna-tional ACM SIGIR Conference on Research and Development in Information Re-trieval. p. 39–48. SIGIR ’20, Association for Computing Machinery, New York,NY, USA (2020). https://doi.org/10.1145/3397271.3401075, https://doi.org/10.1145/3397271.3401075

8. Lin, J., Nogueira, R., Yates, A.: Pretrained Transformers for Text Ranking:BERT and Beyond. arXiv:2010.06467 [cs] (Oct 2020), http://arxiv.org/abs/2010.06467 , zSCC: NoCitationData[s0] arXiv: 2010.064679. MacAvaney, S., Yates, A., Cohan, A., Goharian, N.: Cedr: Contextualized embed-dings for document ranking. In: SIGIR (2019)10. Nogueira, R., Cho, K.: Passage re-ranking with bert (2019)11. Nogueira, R., Jiang, Z., Lin, J.: Document ranking with a pretrained sequence-to-sequence model (2020)12. Rennings, D., Moraes, F., Hauﬀ, C.: An Axiomatic Approach to DiagnosingNeural IR Models. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauﬀ, C.,Hiemstra, D. (eds.) Advances in Information Retrieval. pp. 489–503. LectureNotes in Computer Science, Springer International Publishing, Cham (2019).https://doi.org/10/ggcmnb, zSCC: NoCitationData[s0]13. Urbano, J., Marrero, M.: The treatment of ties in ap correlation. In: Proceed-ings of the ACM SIGIR International Conference on Theory of Information Re-trieval. p. 321–324. ICTIR ’17, Association for Computing Machinery, New York, T. Formal, B. Piwowarski and S. ClinchantNY, USA (2017). https://doi.org/10.1145/3121050.3121106, https://doi.org/10.1145/3121050.3121106

14. Voorhees, E.M., Ellis, A. (eds.): Proceedings of the Twenty-Eighth Text REtrievalConference, TREC 2019, Gaithersburg, Maryland, USA, November 13-15, 2019,NIST Special Publication, vol. 1250. National Institute of Standards and Technol-ogy (NIST) (2019), https://trec.nist.gov/pubs/trec28/trec2019.html

15. Yang, W., Lu, K., Yang, P., Lin, J.: Critically examining the “neu-ral hype”. Proceedings of the 42nd International ACM SIGIR Confer-ence on Research and Development in Information Retrieval (Jul 2019).https://doi.org/10.1145/3331184.3331340, http://dx.doi.org/10.1145/3331184.3331340

16. Yilmaz, E., Aslam, J.A., Robertson, S.: A new rank correlation coeﬃcientfor information retrieval. In: Proceedings of the 31st Annual InternationalACM SIGIR Conference on Research and Development in Information Re-trieval. p. 587–594. SIGIR ’08, Association for Computing Machinery, New York,NY, USA (2008). https://doi.org/10.1145/1390334.1390435,16. Yilmaz, E., Aslam, J.A., Robertson, S.: A new rank correlation coeﬃcientfor information retrieval. In: Proceedings of the 31st Annual InternationalACM SIGIR Conference on Research and Development in Information Re-trieval. p. 587–594. SIGIR ’08, Association for Computing Machinery, New York,NY, USA (2008). https://doi.org/10.1145/1390334.1390435,