A White Box Analysis of ColBERT
AA White Box Analysis of ColBERT
Thibault Formal , , Benjamin Piwowarski , and St´ephane Clinchant Sorbonne Universit´e, LIP6, F-75005 Paris, France [email protected] Naver Labs Europe, Meylan, France [email protected]
Abstract.
Transformer-based models are nowadays state-of-the-art inadhoc Information Retrieval, but their behavior is far from being under-stood. Recent work has claimed that BERT does not satisfy the classicalIR axioms. However, we propose to dissect the matching process of Col-BERT, through the analysis of term importance and exact/soft match-ing patterns. Even if the traditional axioms are not formally verified, ouranalysis reveals that ColBERT (i) is able to capture a notion of termimportance; (ii) relies on exact matches for important terms.
Keywords:
Information Retrieval · Term Matching · Transformer · BERT
Over the last two years, Natural Language Processing has been shaken by the re-lease of large pre-trained language models based on self-attention, like BERT [4].Ranking models based on BERT are currently state-of-the-art for adhoc IR,and rank first on leaderboards of the MSMARCO passage and document (re-)ranking tasks by a large margin [10], as well as on more standard IR datasetssuch as Robust04 [3,9,11]. They have excelled where previous neural models hadbeen struggling so far [15]. It is thus interesting to understand better what ishappening inside those models during ranking, and what phenomena are cap-tured. Some works have been conducted in this direction [2, 12], but focused onwhether IR axioms are respected – or not – by neural and transformer-basedmodels. In [2], BERT has been shown to not fully respect axioms that haveproved to be important for standard IR models, such as the axiom stating thatwords occurring in more documents are less important (IDF effect). Instead ofinvestigating whether these models behave like standard ones, in this paper, wemake a step towards understanding how they manage to improve over traditionalmodels through their specific matching process.There exists a wide variety of BERT-based ranking models, as summarized inthe recent overview [8]. Canonical BERT models are difficult to analyse becausethey require a thorough analysis of attention mechanisms, which is a complextask [1]. We rather choose to focus on contextual interaction models [6, 7, 9], https://microsoft.github.io/msmarco/ a r X i v : . [ c s . I R ] D ec T. Formal, B. Piwowarski and S. Clinchant where query and document are encoded independently – contrary to the usualcase [10]. Among such models, ColBERT [7] exhibits the best trade-off betweeneffectiveness and efficiency, with performance on par with standard BERT, sug-gesting that the power of these models comes from learning rich contextualrepresentations, rather than modeling complex matching patterns. Moreover,the structure of ColBERT (sum over query terms of some similarity scores) issimilar to standard IR models like BM25, and makes the analysis easier, as thecontribution for each term is explicit.In this paper, we hence focus on ColBERT, and look at two research ques-tions. In Section 3, we investigate the link between term importance as computedby standard IR models, and the one computed by (Col)BERT. In Section 4, welook at how (Col)BERT is dealing with exact and soft matches as this is knownto be critical for IR systems.
Dataset
For our analysis, we use the passage retrieval tasks from TREC-DL 2019and 2020 [14] (400 queries in total). We consider a re-ranking setting, where fora given query q , the model needs to re-rank a set of documents S q selected bya first stage ranker. Following the MSMARCO setting, we consider candidatesfrom BM25, and |S q | ≤ how it attributes scores to each query token, for documents in S q . ColBERT
We now introduce the variant of ColBERT [7] we used to simplify theanalysis – we checked each time that the drop in performance was minor. In par-ticular, we did not include query/document specific tokens ( [Q] and [D] ), sincethese tokens could bias the representation of query/document terms. Second,while query augmentation has been shown to be beneficial in [5, 7], we omit thiscomponent to avoid analysis of the induced implicit query expansion mechanism.We however keep the compression layer, that projects token representations fromthe BERT representation space ( d = 768) to the ColBERT representation space( d = 128). By fine-tuning our model in a similar fashion to [7], we obtain aMRR@10 of 0 .
343 on MSMARCO dev set (versus 0 . E q = ( E q i ) i for the query q (after WordPiece tokenization) and E d = ( E d j ) j for the document d , is given by the following relevance score: s ( q, d ) = (cid:88) i ∈ q max j ∈ d cos( E q i , E d j ) = (cid:88) i ∈ q max j ∈ d C ij = (cid:88) i ∈ q C (cid:63)id (1)In the following, we say that a query token i matches the document token j ∗ if C ij ∗ = C (cid:63)id . We denote this token j ∗ by d (cid:63)i . White Box Analysis of ColBERT 3
Our first research question focuses on comparing the term importance of stan-dard IR models (e.g. BM25) with the term importance as determined by Col-BERT. With respect to the former, given that documents are (small) passages,term frequency is close to 1 for most terms. Moreover, passage length does notvary much, and is caped at 512 tokens. Hence, we can reasonably assume thata term BM25 score roughly corresponds to its IDF – this might not be true forterms with very low IDF values, but it is a good enough approximation for otherterms.For ColBERT, it is difficult to measure the importance of a term because itdepends on both document and query contexts. We hence resort to an indirectmean, by measuring the correlation between the original ColBERT ranking andthe ranking obtained when the corresponding word is masked, i.e. when weremove from the sum in Equation (1) all the contributions of subwords thatcompose the word. Finally, to compare rankings, we use AP-correlation τ AP [16],which is akin to Kendall rank correlation, but gives more importance to the topof the ranking. Values close to 1 indicate a strong correlation, meaning that thetwo rankings are similar, implying a low contribution of the term in the rankingprocess. Note that such measure of importance is query dependent: when theterm appears in several queries, we consider the average as a final measure ofimportance. [0:2] ]2:4] ]4:6] ]6:8] ]8:10] ]10:12] ]12:[ IDF bins . . . . . . τ A P t e r m i m po r t an c e pre-trained onlyfine-tuned Fig. 1.
ColBERT term importance (as computed using τ AP ) with respect to IDF (stan-dard term importance). using the Python implementation provided by [13]. T. Formal, B. Piwowarski and S. Clinchant In Figure 1, we show how IDF and τ AP are connected. There is a linearnegative correlation between both metrics (Pearson correlation coefficient r = − . > so important; (ii) as most of the documents contain the term, the effect on τ AP might not be high; (iii) another query term (with no semantics) is bearing thesame semantics as the target one.The first hypothesis is probably true since ColBERT improves over BM25.As for the second one, this is a more general observation regarding the re-rankingsetting, where IR axioms might not fully apply. Finally, to investigate the hy-pothesis (iii), we looked, for each query token, at the frequency of exact matching(i.e. the max similarity is obtained with the same token in a document) and atthe frequency with which it matches in documents other terms of the query . Weobserved that stopwords ( the , of , etc.) did indeed match terms in the documentsthat were other query terms. For instance, in the query (and associated τ AP )“ the (0.94) symptoms (0.87) of (0.93) shingles (0.88) ”, the word “of” actuallymostly matches with “shingles” in documents from S q . After having looked at term importance, we now turn our attention into theissue of exact matches, i.e. how exact string matching is processed by ColBERT.Because it has been trained to re-order a standard term-based IR model, it isinteresting to check whether it might be less sensitive to such signals.To look into this, we need to define a measure indicating when ColBERTasserts whether a term should be an exact match or not (i.e., soft match). Todo so, we compute, for each query term i , the difference between the averageColBERT scores when i matches the same term within a document (i.e, when d (cid:63)i → t ) or not (i.e., when d (cid:63)i (cid:54)→ t ). We then average at the query level, to obtainone measure per term (for terms appearing in several queries). This measure isformally defined as: ∆ ES ( t ) = mean i,q/i → t (cid:18) mean d ∈S q /d (cid:63)i → t { C (cid:63)id } − mean d ∈S q /d (cid:63)i (cid:54)→ t { C (cid:63)id } (cid:19) (2)where j → t means that the j th token corresponds to token t .For a word w composed of several WordPiece components t , . . . , t n , we use (cid:80) t ∈ w ∆ ES ( t ), which corresponds to the way ColBERT works (summing oversubwords). Then, for all query words w , we plot ∆ ES ( w ) with respect to IDF ( w )(Figure 2). We can observe that there is a moderate positive correlation betweenterms focusing more on exact matching by ColBERT –larger ∆ ES – and IDF White Box Analysis of ColBERT 5 [0:2] ]2:4] ]4:6] ]6:8] ]8:10] ]10:12] ]12:[
IDF bins − ∆ E S pre-trained onlyfine-tuned Fig. 2. ∆ ES with respect to IDF: we observe a moderate correlation (0.667) between ∆ ES and IDF, showing that the less frequent a term is, the more it is likely to bematched exactly. ( r = 0 . ∆ ES ) “ causes (0.35) of (0.11) left (0.64) ventricular (1.14) hypertro-phy (1.62) ”, we can see that the model relies a lot on exact match for the lasttwo terms.To explain this behavior, our hypothesis is that exact matches correspondto contextual embeddings that do not vary much, while terms that carry less”information” are more heavily influenced by their context (they act as some sortof reservoirs to encode concepts of the sequence), and thus their embeddings varya lot. To check this hypothesis, we conducted a spectral analysis of contextualterm embeddings. More specifically, we use an SVD decomposition of the matrixcomposed of all the contextual representations for a given term t , on the testdocuments, and look at the relative magnitude of the singular values λ ≥ ... ≥ λ d where d is the dimension of the embedding space. If the magnitude of λ ismuch larger than the others, it means that all the contextual representationspoint to the same direction in the embedding space. In Figure 3, we report theratio of the first eigenvalue λ with respect to (cid:80) k λ k for terms that appear inthe test queries. This figure confirms the above hypothesis, as the ratio increaseswith the subword IDF (correlation r = 0 . T. Formal, B. Piwowarski and S. Clinchant [0:2] ]2:4] ]4:6] ]6:8] ]8:10] ]10:[ subword IDF bins . . . . . . . λ P k λ k pre trained onlyfine tuned Fig. 3.
Ratio of the first eigenvalue to the sum of the eigenvalues with respect to IDF(subword level). The less frequent the term is, the higher the ratio is, showing that allcontextualized embedding for a rare term are concentrated in the same direction . promotes exact matches. By looking at the distribution of singular values (notshown here), we can confirm this trend. In particular, words with a low IDF tendto point each time in a different direction, showing that what they capture ismore about their context. For instance, in the query “ when did family feud comeout ? ” (a TV show), the term “come”, for all the documents in S q , matches 97%of the time to document terms that are not in the query, but are synonyms (ina broad sense) e.g. { july, happen, item, landing, released, name, en, going, it,rodgers } . While the axiomatic approach is appropriate to analyze traditional IR models,its application to BERT-based models remains limited and somehow inadequate.To the best of our knowledge, our study is one of the first to shed light on somematching behaviors of BERT, through the analysis of a simpler counterpart,ColBERT. We showed that (i) even if the IDF effect from the axiomatic theoryis not enforced, (Col)BERT does have a notion of term importance; (ii) exactmatching remains an important component of the model, and is amplified afterfine-tuning on relevance; (iii) our analysis gave some hints on the properties offrequent words which tend to capture the contexts in which they appear.
White Box Analysis of ColBERT 7
Although this work is a first step towards understanding matching propertiesof BERT in IR, we believe there is much more to uncover by either analyzing awider range of models, or by extending our analysis of ColBERT to first stageranking, where retrieval axioms might be more critical.
References
1. Brunner, G., Liu, Y., Pascual, D., Richter, O., Ciaramita, M., Wattenhofer, R.:On identifiability in transformers (02 2020)2. Cˆamara, A., Hauff, C.: Diagnosing BERT with retrieval heuristics. In: Jose,J.M., Yilmaz, E., Magalh˜aes, J., Castells, P., Ferro, N., Silva, M.J., Mar-tins, F. (eds.) Advances in Information Retrieval - 42nd European Confer-ence on IR Research, ECIR 2020, Lisbon, Portugal, April 14-17, 2020, Pro-ceedings, Part I. Lecture Notes in Computer Science, vol. 12035, pp. 605–618. Springer (2020). https://doi.org/10.1007/978-3-030-45439-5 40, https://doi.org/10.1007/978-3-030-45439-5_40
3. Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural lan-guage modeling. CoRR abs/1905.09217 (2019), http://arxiv.org/abs/1905.09217
4. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec-tional transformers for language understanding. CoRR abs/1810.04805 (2018), http://arxiv.org/abs/1810.04805
5. Hofst¨atter, S., Althammer, S., Schr¨oder, M., Sertkan, M., Hanbury, A.: Improv-ing efficient neural ranking models with cross-architecture knowledge distillation(2020)6. Hofst¨atter, S., Zlabinger, M., Hanbury, A.: Interpretable & time-budget-constrained contextualization for re-ranking (2020)7. Khattab, O., Zaharia, M.: Colbert: Efficient and effective passage search viacontextualized late interaction over bert. In: Proceedings of the 43rd Interna-tional ACM SIGIR Conference on Research and Development in Information Re-trieval. p. 39–48. SIGIR ’20, Association for Computing Machinery, New York,NY, USA (2020). https://doi.org/10.1145/3397271.3401075, https://doi.org/10.1145/3397271.3401075
8. Lin, J., Nogueira, R., Yates, A.: Pretrained Transformers for Text Ranking:BERT and Beyond. arXiv:2010.06467 [cs] (Oct 2020), http://arxiv.org/abs/2010.06467 , zSCC: NoCitationData[s0] arXiv: 2010.064679. MacAvaney, S., Yates, A., Cohan, A., Goharian, N.: Cedr: Contextualized embed-dings for document ranking. In: SIGIR (2019)10. Nogueira, R., Cho, K.: Passage re-ranking with bert (2019)11. Nogueira, R., Jiang, Z., Lin, J.: Document ranking with a pretrained sequence-to-sequence model (2020)12. Rennings, D., Moraes, F., Hauff, C.: An Axiomatic Approach to DiagnosingNeural IR Models. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C.,Hiemstra, D. (eds.) Advances in Information Retrieval. pp. 489–503. LectureNotes in Computer Science, Springer International Publishing, Cham (2019).https://doi.org/10/ggcmnb, zSCC: NoCitationData[s0]13. Urbano, J., Marrero, M.: The treatment of ties in ap correlation. In: Proceed-ings of the ACM SIGIR International Conference on Theory of Information Re-trieval. p. 321–324. ICTIR ’17, Association for Computing Machinery, New York, T. Formal, B. Piwowarski and S. ClinchantNY, USA (2017). https://doi.org/10.1145/3121050.3121106, https://doi.org/10.1145/3121050.3121106
14. Voorhees, E.M., Ellis, A. (eds.): Proceedings of the Twenty-Eighth Text REtrievalConference, TREC 2019, Gaithersburg, Maryland, USA, November 13-15, 2019,NIST Special Publication, vol. 1250. National Institute of Standards and Technol-ogy (NIST) (2019), https://trec.nist.gov/pubs/trec28/trec2019.html
15. Yang, W., Lu, K., Yang, P., Lin, J.: Critically examining the “neu-ral hype”. Proceedings of the 42nd International ACM SIGIR Confer-ence on Research and Development in Information Retrieval (Jul 2019).https://doi.org/10.1145/3331184.3331340, http://dx.doi.org/10.1145/3331184.3331340
16. Yilmaz, E., Aslam, J.A., Robertson, S.: A new rank correlation coefficientfor information retrieval. In: Proceedings of the 31st Annual InternationalACM SIGIR Conference on Research and Development in Information Re-trieval. p. 587–594. SIGIR ’08, Association for Computing Machinery, New York,NY, USA (2008). https://doi.org/10.1145/1390334.1390435,16. Yilmaz, E., Aslam, J.A., Robertson, S.: A new rank correlation coefficientfor information retrieval. In: Proceedings of the 31st Annual InternationalACM SIGIR Conference on Research and Development in Information Re-trieval. p. 587–594. SIGIR ’08, Association for Computing Machinery, New York,NY, USA (2008). https://doi.org/10.1145/1390334.1390435,