[PDF] Adaptive Distributional Extensions to DFR Ranking

Abstract

Divergence From Randomness (DFR) ranking models assume that informative terms are distributed in a corpus differently than non-informative terms. Different statistical models (e.g. Poisson, geometric) are used to model the distribution of non-informative terms, producing different DFR models. An informative term is then detected by measuring the divergence of its distribution from the distribution of non-informative terms. However, there is little empirical evidence that the distributions of non-informative terms used in DFR actually fit current datasets. Practically this risks providing a poor separation between informative and non-informative terms, thus compromising the discriminative power of the ranking model. We present a novel extension to DFR, which first detects the best-fitting distribution of non-informative terms in a collection, and then adapts the ranking computation to this best-fitting distribution. We call this model Adaptive Distributional Ranking (ADR) because it adapts the ranking to the statistics of the specific dataset being processed each time. Experiments on TREC data show ADR to outperform DFR models (and their extensions) and be comparable in performance to a query likelihood language model (LM).

Full PDF

AAdaptive Distributional Extensions to DFR Ranking

Casper Petersen

University of Copenhagen [email protected] Jakob Grue Simonsen

University of Copenhagen [email protected] Järvelin

University of Tampere

[email protected].ﬁ Christina Lioma

University of Copenhagen [email protected]

ABSTRACT

Divergence From Randomness (DFR) ranking models as-sume that informative terms are distributed in a corpusdiﬀerently than non-informative terms. Diﬀerent statisticalmodels (e.g. Poisson, geometric) are used to model the dis-tribution of non-informative terms, producing diﬀerent DFRmodels. An informative term is then detected by measur-ing the divergence of its distribution from the distributionof non-informative terms. However, there is little empiri-cal evidence that the distributions of non-informative termsused in DFR actually ﬁt current datasets. Practically thisrisks providing a poor separation between informative andnon-informative terms, thus compromising the discrimina-tive power of the ranking model. We present a novel ex-tension to DFR, which ﬁrst detects the best-ﬁtting distri-bution of non-informative terms in a collection, and thenadapts the ranking computation to this best-ﬁtting distri-bution. We call this model Adaptive Distributional Rank-ing (ADR) because it adapts the ranking to the statistics ofthe speciﬁc dataset being processed each time. Experimentson TREC data show ADR to outperform DFR models (andtheir extensions) and be comparable in performance to aquery likelihood language model (LM).

CCS Concepts • Information systems → Retrieval models and rank-ing;

1. INTRODUCTION

Early work on automatic indexing [3, 8, 9] observed that informative words, e.g. those belonging to a technical vo-cabulary, are distributed in a document collection diﬀerentlythan non-informative words, e.g. those usually treated asstopwords. Speciﬁcally, the diﬀerence in the distributionsof these two diﬀerent types of words was that informativewords were observed to appear more densely in few so-called elite documents. Based on this, non-informative words weremodelled by a Poisson distribution, and it was hypothesised

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor proﬁt or commercial advantage and that copies bear this notice and the full cita-tion on the ﬁrst page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior speciﬁc permissionand/or a fee. Request permissions from [email protected].

CIKM’16 , October 24-28, 2016, Indianapolis, IN, USA c (cid:13) http://dx.doi.org/10.1145/2983323.2983895 that an informative word can be detected by measuring theextent to which its distribution deviates from a Poisson dis-tribution, or, in other words, by testing the hypothesis thatthe word distribution on the whole document collection doesnot ﬁt the Poisson model [2]. This so-called 2-Poisson modelhas led to well-known ranking models such as BM25 [18] orDivergence from Randomness (DFR) [2, 14].For DFR ranking models in particular (presented in Sec-tion 2), the idea that non-informative terms tend to follow aspeciﬁc distribution is central. Diﬀerent distributions (e.g.the Poisson or geometric distribution) can be used to ob-tain diﬀerent DFR ranking models. The goal is to choosea distribution that provides a good ﬁt to the empirical dis-tribution of non-informative terms in C . However, there islittle empirical evidence that the distributions used in DFRactually ﬁt current datasets. Practically this risks providinga poor separation between informative and non-informativeterms, thus compromising the discriminative power of theranking model.Motivated by this, we present a novel extension to DFR,which ﬁrst detects the best-ﬁtting distribution (among acandidate set of distributions) of non-informative terms ina collection, and then adapts the ranking computation tothis “best-ﬁtting” distribution. We call this model AdaptiveDistributional Ranking (ADR, presented in Section 3) be-cause it adapts the ranking to the statistics of the speciﬁcdataset being processed each time. With our approach, onlyone ranking model is obtained per collection: the one thatprovides the best ﬁt to the distribution of non-informativeterms in that collection. Evaluation on TREC datasets (Sec-tion 4) shows that our ADR models outperform the originalDFR models, as well as more recent DFR extensions andperform on a par with a query likelihood LM [17].

2. DIVERGENCE FROM RANDOMNESS

Given a query q and a document d , a DFR ranking modelestimates the relevance R ( q, d ) of d to q as [2]: R ( q, d ) = (cid:80) t ∈ q f t,q · ( − log P ) · (1 − P ) (1)where t is a term, f t,q is the frequency of t in q , − log P isthe information content of t in d , and (1 − P ) is the risk ofaccepting t as a descriptor of d ’s topic. We explain each ofthese two components next.The information content P measures the divergence of f t,d from f t,C , where a higher divergence means more infor-mation is carried by t in d . The assumption is that termsthat bring little information are distributed over all docu-ments in a certain way across the entire corpus; diﬀerent a r X i v : . [ c s . I R ] S e p istributions (called “models of randomness” in DFR) giverise to diﬀerent DFR ranking models. For example, using aPoisson distribution, P is: P ( t, λ, d ) = e − λ λ ˆ f t,d / ˆ f t,d ! (2)where λ = f t,C / | C | is the parameter of the Poisson distri-bution, and ˆ f t,d = f t,d · log (1+ c · avg l / | d | ) is a logarithmicterm frequency normalisation where c is a free parameter,avg l is the average document length in the collection, and | d | is the length of d (the number of terms). Eqn. 2 returnsthe probability of seeing ˆ f t,d occurrences of t in d and ef-fectively tests whether t ’s distribution on C ﬁts the Poissondistribution. If the probability of obtaining ˆ f t,d occurrencesof t is low, then t carries a high amount of information [2].In addition to the Poisson distribution used in this example,DFR models can be instantiated also with the geometric,tf-idf (I n ), tf-itf (I F ) and tf-expected-idf (I en ) [2].The second component of Eqn. 1, the information gain, P , is the conditional probability of encountering f t,d + 1occurrences of t in d . If P is high, the risk (1 − P ) as-sociated with accepting t as a descriptor of d is low. Such“after-eﬀect” sampling was taken [2] to be normalised eitherwith Laplace: P = f t,d / ( f t,d + 1) or Bernoulli normalisa-tion: P = ( f t,C +1) / ( n t · f t,d +1) where n t is the number ofdocuments where t occurs.

3. ADAPTIVE DISTRIBUTIONAL RANKING

Our ADR models follow the basic DFR rationale pre-sented above, but adapt the computation of P in Eqn. 1to the best-ﬁtting distribution of non-informative term col-lection frequencies for each collection. The ADR algorithmis shown below. Algorithm

ADR ( C )Input: Collection C ; Y = { parameterised statistical models } for the set T of non-informative terms in C for each model M ∈ Y

3. ﬁnd M ’s best parameters for ﬁtting T

4. ﬁnd ˆ M that ﬁts T best5. replace P in Eqn. 1 by ˆ M return an ADR model derived for C Given some collection C and a set Y of candidate param-eterised statistical models, we determine for each candidatemodel the optimal parameter values (if any) that make it ﬁthow non-informative terms are distributed in C (step 3 inthe algorithm), select the best-ﬁtting distribution (step 4),and plug it in the place of P in Eqn. 1 (step 5). The outputis an ADR model adapted speciﬁcally to C (i.e. a collection-speciﬁc ranking model). In this work, we use as candidateparameterised statistical distributions all the discrete mod-els in [16], namely the geometric, negative Binomial, Pois-son, power law and Yule–Simon. However, any other set ofcandidate distributions can be used in step 2. We explainsteps 3 – 5 next. A parameterised statistical distribution M = { g ( T | θ ) : θ ∈ Θ } where θ ∈ Θ is a real or integer-valued vector, is a familyof probability density or mass functions g ( T | θ ). Parame-ter estimation is then the problem of ﬁnding, among all theprobability density/mass functions of M , the ˆ θ that mostlikely generated T . We estimate ˆ θ with maximum likeli-hood estimation as follows: We seek the density/mass func- tion that makes the observed term frequencies “most likely”[13] using a likelihood function, L ( θ | T ), that speciﬁes thelikelihood of θ given T . ˆ θ is then obtained by maximis-ing the average likelihood ˆ θ = arg max θ ∈ Θ L ( θ | T ) / | T | where L ( θ | T ) = (cid:80) | T | i =1 log g (cid:0) f it,C | θ (cid:1) , and f it,C is the collection fre-quency of the i th term t ∈ T . In step 4 of the ADR algorithm we select, from the set ofcandidate statistical distributions, the one that best quanti-ﬁes the distribution of term frequencies for non-informativeterms in C . We do this using Vuong’s Closeness Test [20] andAkaike’s Information Criterion [1], both of which are amongthe least controversial and widely used statistical tests formodel selection [5]. Both tests are based on the Kullback–Leibler divergence and will favour the same model in thelimit of large sample sizes, i.e. the model that minimises theinformation loss w.r.t. the unknown but “true” model. Let ˆ M be the best-ﬁtting distribution to the collectionterm frequencies of the non-informative terms. Then, in step5, we replace the distribution P in Eqn. 1 by ˆ M , yielding: R ( q, d ) = (cid:80) t ∈ q f t,q · ( − log ˆ M ) · (1 − P ) (3)ˆ M captures the same assumption as all DFR models: thatnon-informative terms are distributed in a certain way. How-ever, whereas in DFR this assumption is not tested, ADRempirically validates the choice of ˆ M per collection.

4. EVALUATION

We next evaluate the retrieval performance of our ADRmodels. As ADR can produce diﬀerent ranking models fordiﬀerent datasets, we split this section into two parts. Sec-tion 4.1 presents our datasets and the ADR models instan-tiated for these datasets. Section 4.2 compares the retrievalperformance of these ADR models against relevant baselines.

We use TREC disks 4 and 5 (TREC-d45) with queries301-450 and 601-700 and ClueWeb cat. B. (CWEB09) withqueries 1 - 200, minus query 20 for which no documentsare judged relevant as datasets. All datasets are indexedwithout stop word removal and without stemming usingIndri 5.10. For our ADR models, we must identify non-informative terms in each collection and the best ﬁtting dis-tribution of f t,C of the non-informative terms in the collec-tion. We detect non-informative terms using SVM classiﬁ-cation, as follows. Given an initial list of 40 informative andnon-informative terms (manually compiled by three humanswith complete agreement, see Table 1), we compute the termweight of these terms (using IDF [11], x I [3], residual-IDF(RIDF) [4], gain [15] and z -measure [8] - see [12] for anoverview of term weighting), and use these term weights asfeatures to train a binary SVM classiﬁer using 10-fold cross-validation on all feature combinations. We use an SVMapproach instead of any of the above term-weighting ap-proaches, as the latter ones require (manual) setting somead hoc threshold above/below which terms are consideredinformative/non-informative. In contrast, the best classiﬁercorrectly classiﬁes ∝

86% of the terms using RIDF and gainterm weights as features. We use this classiﬁer to classifyach term in TREC-d45 and CWEB09 as informative ornon-informative.

Informative terms Non-informative termshypothermia intercepted welcome awesomecongolese furloughs beginner jollyoutpaced randomization spade outanthropocentric existentialist delete tempiridescence canvass quit feelsarchdiocesan colonisation least lossnonconformist airbrushed silly clearoverclocking leviathans cent oﬀnominalization inﬂammation jar testtranslucent handmade chat forkshortwave monasticism yards movecrystallography aperitif fast pairexpressionist pathologize view stopcephalopod abolishment fold colourpaparazzi bookcase roll sitpresided hydraulic follow backbeneﬁciary vested pen helloconvection ﬂoods day staircustodian chivalry ﬂash bestpopulist constrained money considerable

Table 1: Informative and non-informative terms.Next, using steps 3 – 4 of the ADR algorithm we ﬁnd thatthe best-ﬁtting statistical model to the distribution of non-informative term frequencies in both datasets is the discreteYule–Simon (YS) with parameter p = 1 .

804 (TREC-d45) and p = 1 .

627 (CWEB09).The YS distribution is deﬁned for x ∈ Z + , x ≥ x | p +1) = ( p + 1) · Γ( x ) · Γ ( p + 1)Γ( x + p + 1) (4)where p > x with ˆ f t,d (see Section 2), we obtain:P YS ( ˆ f t,d | p +1) = ( p + 1) · Γ( ˆ f t,d ) · Γ( p + 1)Γ( ˆ f t,d + p + 1) (5)which, for a term t , returns the probability of having ˆ f t,d occurrences of t . Identically to the original Poisson model[2], Eqn. 5 is theoretically dubious as it is a discrete distri-bution whose input is a real number. Consequently, we useLanczo’s method to approximate the Γ function identical to[2]. Plugging Eqn. 5 into Eqn. 1 gives: R ( q, d ) = (cid:80) t ∈ q f t,q · ( − log P YS ) · (1 − P ) (6)We refer to Eqn. 6 as our YS ADR model, or YS for short. -6 -4 -2 LoglogisticPoissonPower lawYule-Simon (a)

TREC Disks 4 & 5 -8 -6 -4 -2 LoglogisticPoissonPower lawYule-Simon (b)

CWEB09

Figure 1: Collection term frequencies for non-informativeterms (grey) in CWEB09, and TREC-d45. Superimposedon each ﬁgure are MLE ﬁtted log-logistic, Poisson, powerlaw and Yule–Simon distributions as references.

Query Nr -0.4-0.200.20.40.6

NDCG

Figure 2: Per query diﬀerence in nDCG between YSL2 andLMDir for all queries. Horizontal line indicates no diﬀerence.Points above zero favour YSL2 over LMDir.

Baselines and Tuning.

We compare our ADR rankingmodel (YS) to a Poisson (P), tf-idf (I n ), log-logistic (LL),smoothed power law (SPL) and query-likelihood unigramLM with Dirichlet smoothing (LMDir). P and I n are originalDFR models [2]; LL [6] and SPL [7] are information mod-els that extend the original DFR models. All DFR/ADRmodels are postﬁxed with L2 meaning that Laplace and log-arithmic term normalisation (see Section 2) are used. Fol-lowing previous work [2, 7], the parameter values of the dis-tributions used in DFR/ADR (e.g. Poisoon or Yule–Simon)are induced from the collection statistics in two ways (weexperiment with both): T tc = f t,C / | C | , or T dc = n t / | C | .E.g., PL2- T tc is a Poisson model using Laplace and log-arithmic term normalisation, where Poisson’s λ = T tc = f t,C / | C | (see Eqn. 2), and YSL2- T dc is a Yule–Simon ADRmodel using Laplace and logarithmic term normalisation,where Yule–Simon’s p = T dc = n t / | C | (see Eqn. 4). We re-trieve the top-1000 documents according to each rankingmodel using title-only queries, and measure performance us-ing P@10, Bpref, ERR@20, nDCG@10 and nDCG. All mod-els are tuned using 3-fold cross-validation and we report theaverage over all three test folds. We vary c (the free parame-ter in the logarithmic term normalisation - see Section 2) inthe range { . , , , , , } [7], and µ of LMDir in the range { , , , , , , , , , } . Findings.

The results are shown in Table 2. For CWEB09,our YSL2 models obtain the best performance at all times.For TREC-d45, our YSL2 models obtain the best perfor-mance on early precision retrieval (P@10, ERR@20 and nDCG@10),while the SPLL2 models perform best on nDCG and Bpref(with YSL2 following closely). As Fig. 1 shows, the powerlaw (used in SPLL2) and Yule–Simon (used in YSL2) havevery similar ﬁts, which explains their similar scores in Ta-ble 2. In Fig. 1 we also see that the log-logistic, power lawand Yule–Simon all approximate the head of the distribution(collection term frequencies up to ≈ >

5. RELATED WORK

Our work can be seen as a reﬁnement of DFR ranking

REC disks 4& 5 ClueWeb09 cat. B.Model nDCG P@10 Bpref ERR@20 nDCG@10 nDCG P@10 Bpref ERR@20 nDCG@10LMDir .4643 .3845 .2239 .1043 .3968 .2973 .2586 .2209 .0973 .1769PL2- T tc [2] .2524 ∗ .1273 ∗ .1009 ∗ .0359 ∗ .1332 ∗ .1448 ∗ .0712 ∗ .1258 ∗ .0211 ∗ .0472 ∗ PL2- T dc [2] .2487 ∗ .1217 ∗ .0960 ∗ .0347 ∗ .1273 ∗ .1444 ∗ .0709 ∗ .1252 ∗ .0314 ∗ .0471 ∗ I n L2- T tc [2] .2917 ∗ .1627 ∗ .1114 ∗ .0478 ∗ .1742 ∗ .1596 ∗ .0782 ∗ .1405 ∗ .0352 ∗ .0511 ∗ I n L2- T dc [2] .2818 ∗ .1626 ∗ .1088 ∗ .0481 ∗ .1745 ∗ .1596 ∗ .0783 ∗ .1407 ∗ .0352 ∗ .0512 ∗ LLL2- T tc [6] .4812 .4049 .2341 .1072 .4142 .3184 .2542 .2349 .0926 .1706LLL2- T dc [6] .4810 .3982 .2329 .1069 .4097 .3180 .2542 .2349 .0928 .1707SPLL2- T tc [7] .4863 .4144 .2375 .1103 .4276 .3207 .2529 .2357 .0945 .1720SPLL2- T dc [7] .4876 .4176 .2387 .1107 .4299 .3224 .2586 .2370 .0958 .1752YSL2- T tc (ADR) .4644 .3982 .2280 .1048 .4069 .3197 .2601 .2359 .0951 .1752YSL2- T dc (ADR) .4860 .4182 .2381 .1113 .4312 .3240 .2666 .2376 .0985 .1810 Table 2: Retrieval performance. Grey denotes larger than the LMDir baseline. Bold marks the best results. ∗ marksstatistically signiﬁcant diﬀerence from the LMDir baseline using a t -test at the .

05% level.models. Several other approaches attempt such reﬁnements.For example, Clinchant and Gaussier [6] formulated heuris-tic retrieval constraints to assess the validity of DFR rank-ing models, and found that several DFR ranking models donot conform to these heuristics. On this basis, a simpliﬁedDFR model based on a log-logistic (LL) statistical distribu-tion was proposed, which adheres to all heuristic retrievalconstraints. Experimental evaluation showed that the LLmodel improved MAP on several collections. In a follow-upstudy, Clinchant and Gaussier [7] introduced informationmodels which are simpliﬁed DFR models that do not relyon so-called after-eﬀect normalisation. They instantiatedthe LL and a “smoothed power law” (SPL) ranking model,and showed that both models tend to outperform a LM,BM25, but not two original DFR models on smaller TRECand CLEF collections. Closest to ours is the work by Hui etal. [10] who developed DFR extensions where the distribu-tions of f t,C are modelled using multiple statistical distribu-tions. However, (i) no justiﬁcation or empirical validationof the choice of statistical distributions used was given, and(ii) the distribution of all terms, rather than only the non-informative terms, was used. Finally, Hui et al’s modelsalso removed e.g. collection-wide term frequencies from theranking component, hence eﬀectively removing the notion ofdivergence from the DFR framework.

6. CONCLUSION

Divergence From Randomness (DFR) ranking models as-sume that informative terms in a collection are distributeddiﬀerently than non-informative terms. Diﬀerent DFR mod-els are produced depending on the statistical model (e.g.Poisson, geometric) used to quantify the distribution of non-informative terms, where an informative term is detected bymeasuring the divergence of its distribution from the dis-tribution of non-informative terms. However, there is lit-tle evidence that the distributions used in DFR actually ﬁtthe the distribution of non-informative terms. To addressthis, we presented a novel DFR extension called AdaptiveDistributional Ranking (ADR), which adapts the rankingcomputation to the statistics of the speciﬁc collection be-ing processed each time. Our ADR models ﬁrst determinethe best-ﬁtting distribution of non-informative terms, andthen integrate this distribution into ranking. Experimentson TREC data showed that our ADR models outperformedDFR models (and their extensions), and achieved perfor-mance comparable to a query likelihood LM. In future workwe plan to experiment with additional collections. We will also study automatically deriving ad hoc distributions suitedfor the collection data in ADR instead of selecting among alist of standard distributions.

7. ACKNOWLEDGMENTS

Work partially funded by C. Lioma’s FREJA research ex-cellence fellowship (grant no. 790095).

8. REFERENCES [1] H. Akaike. A new look at the statistical modelidentiﬁcation.

IEEE TAC , 19(6):716–723, 1974.[2] G. Amati and C. J. Rijsbergen. Probabilistic modelsof information retrieval based on measuring thedivergence from randomness.

ACM TOIS ,20(4):357–389, 2002.[3] A. Bookstein and D. R. Swanson. Probabilistic modelsfor automatic indexing.

JASIS , 25(5):312–316, 1974.[4] K. Church and W. Gale. Inverse document frequency(idf): A measure of deviations from Poisson. In

NLP-VLC , pages 283–295. Springer, 1999.[5] K. A. Clarke. Nonparametric model discrimination ininternational relations.

JCR , 47(1):72–93, 2003.[6] S. Clinchant and ´E. Gaussier. Bridging languagemodeling and divergence from randomness models: Alog-logistic model for IR. In

ICTIR , pages 54–65, 2009.[7] S. Clinchant and E. Gaussier. Information-basedmodels for Ad Hoc IR. In

SIGIR , pages 234–241.ACM, 2010.[8] S. P. Harter. A probabilistic approach to automatickeyword indexing. part I. on the distribution ofspecialty words in a technical literature.

JASIS ,26(4):197–206, 1975.[9] S. P. Harter. A probabilistic approach to automatickeyword indexing. part II. an algorithm forprobabilistic indexing.

JASIS , 26(5):280–289, 1975.[10] K. Hui, B. He, T. Luo, and B. Wang. Relevanceweighting using within-document term statistics. In

CIKM , pages 99–104. ACM, 2011.[11] K. S. Jones. A statistical interpretation of termspeciﬁcity and its application in retrieval. JD ,28(1):11–21, 1972.[12] C. Lioma and R. Blanco. Part of speech based termweighting for information retrieval. In M. Boughanem,C. Berrut, J. Mothe, and C. Soul´e-Dupuy, editors, CIR , volume 5478 of

LNCS , pages 412–423.Springer, 2009.[13] I. J. Myung. Tutorial on maximum likelihoodestimation.

JMP , 47(1):90–100, 2003.[14] I. Ounis, C. Lioma, C. Macdonald, and V. Plachouras.Research directions in terrier: a search engine foradvanced retrieval on the web.

Novatica ,VIII(1):49–56, 2007.[15] K. Papineni. Why inverse document frequency? In

ACL , pages 1–8. ACL, 2001.[16] C. Petersen, J. G. Simonsen, and C. Lioma. Power lawdistributions in information retrieval.

ACM TOIS ,21(4):1–36, 2015.[17] J. M. Ponte and W. B. Croft. A language modelingapproach to information retrieval. In

SIGIR , pages275–281. ACM, 1998.[18] S. E. Robertson and S. Walker. Some simple eﬀectiveapproximations to the 2-Poisson model forprobabilistic weighted retrieval. In

SIGIR , pages232–241. Springer, 1994.[19] H. A. Simon. On a class of skew distributionfunctions.

Biometrika , pages 425–440, 1955.[20] Q. H. Vuong. Likelihood ratio tests for model selection& non-nested hypotheses.