Adaptive Distributional Extensions to DFR Ranking
Casper Petersen, Jakob Grue Simonsen, Kalervo Jarvelin, Christina Lioma
AAdaptive Distributional Extensions to DFR Ranking
Casper Petersen
University of Copenhagen [email protected] Jakob Grue Simonsen
University of Copenhagen [email protected] Järvelin
University of Tampere
[email protected].fi Christina Lioma
University of Copenhagen [email protected]
ABSTRACT
Divergence From Randomness (DFR) ranking models as-sume that informative terms are distributed in a corpusdifferently than non-informative terms. Different statisticalmodels (e.g. Poisson, geometric) are used to model the dis-tribution of non-informative terms, producing different DFRmodels. An informative term is then detected by measur-ing the divergence of its distribution from the distributionof non-informative terms. However, there is little empiri-cal evidence that the distributions of non-informative termsused in DFR actually fit current datasets. Practically thisrisks providing a poor separation between informative andnon-informative terms, thus compromising the discrimina-tive power of the ranking model. We present a novel ex-tension to DFR, which first detects the best-fitting distri-bution of non-informative terms in a collection, and thenadapts the ranking computation to this best-fitting distri-bution. We call this model Adaptive Distributional Rank-ing (ADR) because it adapts the ranking to the statistics ofthe specific dataset being processed each time. Experimentson TREC data show ADR to outperform DFR models (andtheir extensions) and be comparable in performance to aquery likelihood language model (LM).
CCS Concepts • Information systems → Retrieval models and rank-ing;
1. INTRODUCTION
Early work on automatic indexing [3, 8, 9] observed that informative words, e.g. those belonging to a technical vo-cabulary, are distributed in a document collection differentlythan non-informative words, e.g. those usually treated asstopwords. Specifically, the difference in the distributionsof these two different types of words was that informativewords were observed to appear more densely in few so-called elite documents. Based on this, non-informative words weremodelled by a Poisson distribution, and it was hypothesised
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].
CIKM’16 , October 24-28, 2016, Indianapolis, IN, USA c (cid:13) http://dx.doi.org/10.1145/2983323.2983895 that an informative word can be detected by measuring theextent to which its distribution deviates from a Poisson dis-tribution, or, in other words, by testing the hypothesis thatthe word distribution on the whole document collection doesnot fit the Poisson model [2]. This so-called 2-Poisson modelhas led to well-known ranking models such as BM25 [18] orDivergence from Randomness (DFR) [2, 14].For DFR ranking models in particular (presented in Sec-tion 2), the idea that non-informative terms tend to follow aspecific distribution is central. Different distributions (e.g.the Poisson or geometric distribution) can be used to ob-tain different DFR ranking models. The goal is to choosea distribution that provides a good fit to the empirical dis-tribution of non-informative terms in C . However, there islittle empirical evidence that the distributions used in DFRactually fit current datasets. Practically this risks providinga poor separation between informative and non-informativeterms, thus compromising the discriminative power of theranking model.Motivated by this, we present a novel extension to DFR,which first detects the best-fitting distribution (among acandidate set of distributions) of non-informative terms ina collection, and then adapts the ranking computation tothis “best-fitting” distribution. We call this model AdaptiveDistributional Ranking (ADR, presented in Section 3) be-cause it adapts the ranking to the statistics of the specificdataset being processed each time. With our approach, onlyone ranking model is obtained per collection: the one thatprovides the best fit to the distribution of non-informativeterms in that collection. Evaluation on TREC datasets (Sec-tion 4) shows that our ADR models outperform the originalDFR models, as well as more recent DFR extensions andperform on a par with a query likelihood LM [17].
2. DIVERGENCE FROM RANDOMNESS
Given a query q and a document d , a DFR ranking modelestimates the relevance R ( q, d ) of d to q as [2]: R ( q, d ) = (cid:80) t ∈ q f t,q · ( − log P ) · (1 − P ) (1)where t is a term, f t,q is the frequency of t in q , − log P isthe information content of t in d , and (1 − P ) is the risk ofaccepting t as a descriptor of d ’s topic. We explain each ofthese two components next.The information content P measures the divergence of f t,d from f t,C , where a higher divergence means more infor-mation is carried by t in d . The assumption is that termsthat bring little information are distributed over all docu-ments in a certain way across the entire corpus; different a r X i v : . [ c s . I R ] S e p istributions (called “models of randomness” in DFR) giverise to different DFR ranking models. For example, using aPoisson distribution, P is: P ( t, λ, d ) = e − λ λ ˆ f t,d / ˆ f t,d ! (2)where λ = f t,C / | C | is the parameter of the Poisson distri-bution, and ˆ f t,d = f t,d · log (1+ c · avg l / | d | ) is a logarithmicterm frequency normalisation where c is a free parameter,avg l is the average document length in the collection, and | d | is the length of d (the number of terms). Eqn. 2 returnsthe probability of seeing ˆ f t,d occurrences of t in d and ef-fectively tests whether t ’s distribution on C fits the Poissondistribution. If the probability of obtaining ˆ f t,d occurrencesof t is low, then t carries a high amount of information [2].In addition to the Poisson distribution used in this example,DFR models can be instantiated also with the geometric,tf-idf (I n ), tf-itf (I F ) and tf-expected-idf (I en ) [2].The second component of Eqn. 1, the information gain, P , is the conditional probability of encountering f t,d + 1occurrences of t in d . If P is high, the risk (1 − P ) as-sociated with accepting t as a descriptor of d is low. Such“after-effect” sampling was taken [2] to be normalised eitherwith Laplace: P = f t,d / ( f t,d + 1) or Bernoulli normalisa-tion: P = ( f t,C +1) / ( n t · f t,d +1) where n t is the number ofdocuments where t occurs.
3. ADAPTIVE DISTRIBUTIONAL RANKING
Our ADR models follow the basic DFR rationale pre-sented above, but adapt the computation of P in Eqn. 1to the best-fitting distribution of non-informative term col-lection frequencies for each collection. The ADR algorithmis shown below. Algorithm
ADR ( C )Input: Collection C ; Y = { parameterised statistical models } for the set T of non-informative terms in C for each model M ∈ Y
3. find M ’s best parameters for fitting T
4. find ˆ M that fits T best5. replace P in Eqn. 1 by ˆ M return an ADR model derived for C Given some collection C and a set Y of candidate param-eterised statistical models, we determine for each candidatemodel the optimal parameter values (if any) that make it fithow non-informative terms are distributed in C (step 3 inthe algorithm), select the best-fitting distribution (step 4),and plug it in the place of P in Eqn. 1 (step 5). The outputis an ADR model adapted specifically to C (i.e. a collection-specific ranking model). In this work, we use as candidateparameterised statistical distributions all the discrete mod-els in [16], namely the geometric, negative Binomial, Pois-son, power law and Yule–Simon. However, any other set ofcandidate distributions can be used in step 2. We explainsteps 3 – 5 next. A parameterised statistical distribution M = { g ( T | θ ) : θ ∈ Θ } where θ ∈ Θ is a real or integer-valued vector, is a familyof probability density or mass functions g ( T | θ ). Parame-ter estimation is then the problem of finding, among all theprobability density/mass functions of M , the ˆ θ that mostlikely generated T . We estimate ˆ θ with maximum likeli-hood estimation as follows: We seek the density/mass func- tion that makes the observed term frequencies “most likely”[13] using a likelihood function, L ( θ | T ), that specifies thelikelihood of θ given T . ˆ θ is then obtained by maximis-ing the average likelihood ˆ θ = arg max θ ∈ Θ L ( θ | T ) / | T | where L ( θ | T ) = (cid:80) | T | i =1 log g (cid:0) f it,C | θ (cid:1) , and f it,C is the collection fre-quency of the i th term t ∈ T . In step 4 of the ADR algorithm we select, from the set ofcandidate statistical distributions, the one that best quanti-fies the distribution of term frequencies for non-informativeterms in C . We do this using Vuong’s Closeness Test [20] andAkaike’s Information Criterion [1], both of which are amongthe least controversial and widely used statistical tests formodel selection [5]. Both tests are based on the Kullback–Leibler divergence and will favour the same model in thelimit of large sample sizes, i.e. the model that minimises theinformation loss w.r.t. the unknown but “true” model. Let ˆ M be the best-fitting distribution to the collectionterm frequencies of the non-informative terms. Then, in step5, we replace the distribution P in Eqn. 1 by ˆ M , yielding: R ( q, d ) = (cid:80) t ∈ q f t,q · ( − log ˆ M ) · (1 − P ) (3)ˆ M captures the same assumption as all DFR models: thatnon-informative terms are distributed in a certain way. How-ever, whereas in DFR this assumption is not tested, ADRempirically validates the choice of ˆ M per collection.
4. EVALUATION
We next evaluate the retrieval performance of our ADRmodels. As ADR can produce different ranking models fordifferent datasets, we split this section into two parts. Sec-tion 4.1 presents our datasets and the ADR models instan-tiated for these datasets. Section 4.2 compares the retrievalperformance of these ADR models against relevant baselines.
We use TREC disks 4 and 5 (TREC-d45) with queries301-450 and 601-700 and ClueWeb cat. B. (CWEB09) withqueries 1 - 200, minus query 20 for which no documentsare judged relevant as datasets. All datasets are indexedwithout stop word removal and without stemming usingIndri 5.10. For our ADR models, we must identify non-informative terms in each collection and the best fitting dis-tribution of f t,C of the non-informative terms in the collec-tion. We detect non-informative terms using SVM classifi-cation, as follows. Given an initial list of 40 informative andnon-informative terms (manually compiled by three humanswith complete agreement, see Table 1), we compute the termweight of these terms (using IDF [11], x I [3], residual-IDF(RIDF) [4], gain [15] and z -measure [8] - see [12] for anoverview of term weighting), and use these term weights asfeatures to train a binary SVM classifier using 10-fold cross-validation on all feature combinations. We use an SVMapproach instead of any of the above term-weighting ap-proaches, as the latter ones require (manual) setting somead hoc threshold above/below which terms are consideredinformative/non-informative. In contrast, the best classifiercorrectly classifies ∝
86% of the terms using RIDF and gainterm weights as features. We use this classifier to classifyach term in TREC-d45 and CWEB09 as informative ornon-informative.
Informative terms Non-informative termshypothermia intercepted welcome awesomecongolese furloughs beginner jollyoutpaced randomization spade outanthropocentric existentialist delete tempiridescence canvass quit feelsarchdiocesan colonisation least lossnonconformist airbrushed silly clearoverclocking leviathans cent offnominalization inflammation jar testtranslucent handmade chat forkshortwave monasticism yards movecrystallography aperitif fast pairexpressionist pathologize view stopcephalopod abolishment fold colourpaparazzi bookcase roll sitpresided hydraulic follow backbeneficiary vested pen helloconvection floods day staircustodian chivalry flash bestpopulist constrained money considerable
Table 1: Informative and non-informative terms.Next, using steps 3 – 4 of the ADR algorithm we find thatthe best-fitting statistical model to the distribution of non-informative term frequencies in both datasets is the discreteYule–Simon (YS) with parameter p = 1 .
804 (TREC-d45) and p = 1 .
627 (CWEB09).The YS distribution is defined for x ∈ Z + , x ≥ x | p +1) = ( p + 1) · Γ( x ) · Γ ( p + 1)Γ( x + p + 1) (4)where p > x with ˆ f t,d (see Section 2), we obtain:P YS ( ˆ f t,d | p +1) = ( p + 1) · Γ( ˆ f t,d ) · Γ( p + 1)Γ( ˆ f t,d + p + 1) (5)which, for a term t , returns the probability of having ˆ f t,d occurrences of t . Identically to the original Poisson model[2], Eqn. 5 is theoretically dubious as it is a discrete distri-bution whose input is a real number. Consequently, we useLanczo’s method to approximate the Γ function identical to[2]. Plugging Eqn. 5 into Eqn. 1 gives: R ( q, d ) = (cid:80) t ∈ q f t,q · ( − log P YS ) · (1 − P ) (6)We refer to Eqn. 6 as our YS ADR model, or YS for short. -6 -4 -2 LoglogisticPoissonPower lawYule-Simon (a)
TREC Disks 4 & 5 -8 -6 -4 -2 LoglogisticPoissonPower lawYule-Simon (b)
CWEB09
Figure 1: Collection term frequencies for non-informativeterms (grey) in CWEB09, and TREC-d45. Superimposedon each figure are MLE fitted log-logistic, Poisson, powerlaw and Yule–Simon distributions as references.
Query Nr -0.4-0.200.20.40.6
NDCG
Figure 2: Per query difference in nDCG between YSL2 andLMDir for all queries. Horizontal line indicates no difference.Points above zero favour YSL2 over LMDir.
Baselines and Tuning.
We compare our ADR rankingmodel (YS) to a Poisson (P), tf-idf (I n ), log-logistic (LL),smoothed power law (SPL) and query-likelihood unigramLM with Dirichlet smoothing (LMDir). P and I n are originalDFR models [2]; LL [6] and SPL [7] are information mod-els that extend the original DFR models. All DFR/ADRmodels are postfixed with L2 meaning that Laplace and log-arithmic term normalisation (see Section 2) are used. Fol-lowing previous work [2, 7], the parameter values of the dis-tributions used in DFR/ADR (e.g. Poisoon or Yule–Simon)are induced from the collection statistics in two ways (weexperiment with both): T tc = f t,C / | C | , or T dc = n t / | C | .E.g., PL2- T tc is a Poisson model using Laplace and log-arithmic term normalisation, where Poisson’s λ = T tc = f t,C / | C | (see Eqn. 2), and YSL2- T dc is a Yule–Simon ADRmodel using Laplace and logarithmic term normalisation,where Yule–Simon’s p = T dc = n t / | C | (see Eqn. 4). We re-trieve the top-1000 documents according to each rankingmodel using title-only queries, and measure performance us-ing P@10, Bpref, ERR@20, nDCG@10 and nDCG. All mod-els are tuned using 3-fold cross-validation and we report theaverage over all three test folds. We vary c (the free parame-ter in the logarithmic term normalisation - see Section 2) inthe range { . , , , , , } [7], and µ of LMDir in the range { , , , , , , , , , } . Findings.
The results are shown in Table 2. For CWEB09,our YSL2 models obtain the best performance at all times.For TREC-d45, our YSL2 models obtain the best perfor-mance on early precision retrieval (P@10, ERR@20 and nDCG@10),while the SPLL2 models perform best on nDCG and Bpref(with YSL2 following closely). As Fig. 1 shows, the powerlaw (used in SPLL2) and Yule–Simon (used in YSL2) havevery similar fits, which explains their similar scores in Ta-ble 2. In Fig. 1 we also see that the log-logistic, power lawand Yule–Simon all approximate the head of the distribution(collection term frequencies up to ≈ >
5. RELATED WORK
Our work can be seen as a refinement of DFR ranking
REC disks 4& 5 ClueWeb09 cat. B.Model nDCG P@10 Bpref ERR@20 nDCG@10 nDCG P@10 Bpref ERR@20 nDCG@10LMDir .4643 .3845 .2239 .1043 .3968 .2973 .2586 .2209 .0973 .1769PL2- T tc [2] .2524 ∗ .1273 ∗ .1009 ∗ .0359 ∗ .1332 ∗ .1448 ∗ .0712 ∗ .1258 ∗ .0211 ∗ .0472 ∗ PL2- T dc [2] .2487 ∗ .1217 ∗ .0960 ∗ .0347 ∗ .1273 ∗ .1444 ∗ .0709 ∗ .1252 ∗ .0314 ∗ .0471 ∗ I n L2- T tc [2] .2917 ∗ .1627 ∗ .1114 ∗ .0478 ∗ .1742 ∗ .1596 ∗ .0782 ∗ .1405 ∗ .0352 ∗ .0511 ∗ I n L2- T dc [2] .2818 ∗ .1626 ∗ .1088 ∗ .0481 ∗ .1745 ∗ .1596 ∗ .0783 ∗ .1407 ∗ .0352 ∗ .0512 ∗ LLL2- T tc [6] .4812 .4049 .2341 .1072 .4142 .3184 .2542 .2349 .0926 .1706LLL2- T dc [6] .4810 .3982 .2329 .1069 .4097 .3180 .2542 .2349 .0928 .1707SPLL2- T tc [7] .4863 .4144 .2375 .1103 .4276 .3207 .2529 .2357 .0945 .1720SPLL2- T dc [7] .4876 .4176 .2387 .1107 .4299 .3224 .2586 .2370 .0958 .1752YSL2- T tc (ADR) .4644 .3982 .2280 .1048 .4069 .3197 .2601 .2359 .0951 .1752YSL2- T dc (ADR) .4860 .4182 .2381 .1113 .4312 .3240 .2666 .2376 .0985 .1810 Table 2: Retrieval performance. Grey denotes larger than the LMDir baseline. Bold marks the best results. ∗ marksstatistically significant difference from the LMDir baseline using a t -test at the .
05% level.models. Several other approaches attempt such refinements.For example, Clinchant and Gaussier [6] formulated heuris-tic retrieval constraints to assess the validity of DFR rank-ing models, and found that several DFR ranking models donot conform to these heuristics. On this basis, a simplifiedDFR model based on a log-logistic (LL) statistical distribu-tion was proposed, which adheres to all heuristic retrievalconstraints. Experimental evaluation showed that the LLmodel improved MAP on several collections. In a follow-upstudy, Clinchant and Gaussier [7] introduced informationmodels which are simplified DFR models that do not relyon so-called after-effect normalisation. They instantiatedthe LL and a “smoothed power law” (SPL) ranking model,and showed that both models tend to outperform a LM,BM25, but not two original DFR models on smaller TRECand CLEF collections. Closest to ours is the work by Hui etal. [10] who developed DFR extensions where the distribu-tions of f t,C are modelled using multiple statistical distribu-tions. However, (i) no justification or empirical validationof the choice of statistical distributions used was given, and(ii) the distribution of all terms, rather than only the non-informative terms, was used. Finally, Hui et al’s modelsalso removed e.g. collection-wide term frequencies from theranking component, hence effectively removing the notion ofdivergence from the DFR framework.
6. CONCLUSION
Divergence From Randomness (DFR) ranking models as-sume that informative terms in a collection are distributeddifferently than non-informative terms. Different DFR mod-els are produced depending on the statistical model (e.g.Poisson, geometric) used to quantify the distribution of non-informative terms, where an informative term is detected bymeasuring the divergence of its distribution from the dis-tribution of non-informative terms. However, there is lit-tle evidence that the distributions used in DFR actually fitthe the distribution of non-informative terms. To addressthis, we presented a novel DFR extension called AdaptiveDistributional Ranking (ADR), which adapts the rankingcomputation to the statistics of the specific collection be-ing processed each time. Our ADR models first determinethe best-fitting distribution of non-informative terms, andthen integrate this distribution into ranking. Experimentson TREC data showed that our ADR models outperformedDFR models (and their extensions), and achieved perfor-mance comparable to a query likelihood LM. In future workwe plan to experiment with additional collections. We will also study automatically deriving ad hoc distributions suitedfor the collection data in ADR instead of selecting among alist of standard distributions.
7. ACKNOWLEDGMENTS
Work partially funded by C. Lioma’s FREJA research ex-cellence fellowship (grant no. 790095).
8. REFERENCES [1] H. Akaike. A new look at the statistical modelidentification.
IEEE TAC , 19(6):716–723, 1974.[2] G. Amati and C. J. Rijsbergen. Probabilistic modelsof information retrieval based on measuring thedivergence from randomness.
ACM TOIS ,20(4):357–389, 2002.[3] A. Bookstein and D. R. Swanson. Probabilistic modelsfor automatic indexing.
JASIS , 25(5):312–316, 1974.[4] K. Church and W. Gale. Inverse document frequency(idf): A measure of deviations from Poisson. In
NLP-VLC , pages 283–295. Springer, 1999.[5] K. A. Clarke. Nonparametric model discrimination ininternational relations.
JCR , 47(1):72–93, 2003.[6] S. Clinchant and ´E. Gaussier. Bridging languagemodeling and divergence from randomness models: Alog-logistic model for IR. In
ICTIR , pages 54–65, 2009.[7] S. Clinchant and E. Gaussier. Information-basedmodels for Ad Hoc IR. In
SIGIR , pages 234–241.ACM, 2010.[8] S. P. Harter. A probabilistic approach to automatickeyword indexing. part I. on the distribution ofspecialty words in a technical literature.
JASIS ,26(4):197–206, 1975.[9] S. P. Harter. A probabilistic approach to automatickeyword indexing. part II. an algorithm forprobabilistic indexing.
JASIS , 26(5):280–289, 1975.[10] K. Hui, B. He, T. Luo, and B. Wang. Relevanceweighting using within-document term statistics. In
CIKM , pages 99–104. ACM, 2011.[11] K. S. Jones. A statistical interpretation of termspecificity and its application in retrieval. JD ,28(1):11–21, 1972.[12] C. Lioma and R. Blanco. Part of speech based termweighting for information retrieval. In M. Boughanem,C. Berrut, J. Mothe, and C. Soul´e-Dupuy, editors, CIR , volume 5478 of
LNCS , pages 412–423.Springer, 2009.[13] I. J. Myung. Tutorial on maximum likelihoodestimation.
JMP , 47(1):90–100, 2003.[14] I. Ounis, C. Lioma, C. Macdonald, and V. Plachouras.Research directions in terrier: a search engine foradvanced retrieval on the web.
Novatica ,VIII(1):49–56, 2007.[15] K. Papineni. Why inverse document frequency? In
ACL , pages 1–8. ACL, 2001.[16] C. Petersen, J. G. Simonsen, and C. Lioma. Power lawdistributions in information retrieval.
ACM TOIS ,21(4):1–36, 2015.[17] J. M. Ponte and W. B. Croft. A language modelingapproach to information retrieval. In
SIGIR , pages275–281. ACM, 1998.[18] S. E. Robertson and S. Walker. Some simple effectiveapproximations to the 2-Poisson model forprobabilistic weighted retrieval. In
SIGIR , pages232–241. Springer, 1994.[19] H. A. Simon. On a class of skew distributionfunctions.
Biometrika , pages 425–440, 1955.[20] Q. H. Vuong. Likelihood ratio tests for model selection& non-nested hypotheses.