[PDF] Evaluation Measures for Relevance and Credibility in Ranked Lists

Abstract

Recent discussions on alternative facts, fake news, and post truth politics have motivated research on creating technologies that allow people not only to access information, but also to assess the credibility of the information presented to them by information retrieval systems. Whereas technology is in place for filtering information according to relevance and/or credibility, no single measure currently exists for evaluating the accuracy or precision (and more generally effectiveness) of both the relevance and the credibility of retrieved results. One obvious way of doing so is to measure relevance and credibility effectiveness separately, and then consolidate the two measures into one. There at least two problems with such an approach: (I) it is not certain that the same criteria are applied to the evaluation of both relevance and credibility (and applying different criteria introduces bias to the evaluation); (II) many more and richer measures exist for assessing relevance effectiveness than for assessing credibility effectiveness (hence risking further bias). Motivated by the above, we present two novel types of evaluation measures that are designed to measure the effectiveness of both relevance and credibility in ranked lists of retrieval results. Experimental evaluation on a small human-annotated dataset (that we make freely available to the research community) shows that our measures are expressive and intuitive in their interpretation.

Full PDF

aa r X i v : . [ c s . I R ] A ug Evaluation Measures for Relevance and Credibility in RankedLists

Christina Lioma, Jakob Grue Simonsen

Department of Computer ScienceUniversity of CopenhagenCopenhagen, Denmark{c.lioma,simonsen}@di.ku.dk

Birger Larsen

Department of CommunicationUniversity of Aalborg in CopenhagenCopenhagen, [email protected]

ABSTRACT

Recent discussions on alternative facts , fake news , and post truth politics have motivated research on creating technologies that al-low people not only to access information, but also to assess thecredibility of the information presented to them by informationretrieval systems. Whereas technology is in place for ﬁltering in-formation according to relevance and/or credibility [15], no sin-gle measure currently exists for evaluating the accuracy or preci-sion (and more generally eﬀectiveness ) of both the relevance and the credibility of retrieved results. One obvious way of doing so isto measure relevance and credibility eﬀectiveness separately, andthen consolidate the two measures into one. There at least twoproblems with such an approach: (I) it is not certain that the samecriteria are applied to the evaluation of both relevance and credi-bility (and applying diﬀerent criteria introduces bias to the eval-uation); (II) many more and richer measures exist for assessingrelevance eﬀectiveness than for assessing credibility eﬀectiveness(hence risking further bias).Motivated by the above, we present two novel types of evalu-ation measures that are designed to measure the eﬀectiveness ofboth relevance and credibility in ranked lists of retrieval results.Experimental evaluation on a small human-annotated dataset (thatwe make freely available to the research community) shows thatour measures are expressive and intuitive in their interpretation. KEYWORDS relevance; credibility; evaluation measures

ACM Reference format:

Christina Lioma, Jakob Grue Simonsen and Birger Larsen. 2017. EvaluationMeasures for Relevance and Credibility in Ranked Lists. In

Proceedings ofICTIR ’17, Amsterdam, Netherlands, October 1–4, 2017,

Recent discussions on alternative facts , fake news , and post truth politics have motivated research on creating technologies that al-low people, not only to access information, but also to assess thecredibility of the information presented to them [8, 14]. In the Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor proﬁt or commercial advantage and that copies bear this notice and the full citationon the ﬁrst page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

ICTIR ’17, Amsterdam, Netherlands © 2017 Copyright held by the owner/author(s). ISBN 978-1-4503-4490-6/17/10...$15.00DOI: http://dx.doi.org/10.1145/3121050.3121072 broader area of information retrieval (IR), various methods for ap-proximating [10, 23] or visualising [11, 17, 18, 21] information cred-ibility have been presented, both stand-alone and in relation to rel-evance [15]. Collectively, these approaches can be seen as stepsin the direction of building IR systems that retrieve informationthat is both relevant and credible. Given such a list of IR results,which are ranked decreasingly by both relevance and credibility,the question arises: how can we evaluate the quality of this rankedlist?One could measure retrieval eﬀectiveness ﬁrst, using any suit-able existing relevance measure, such as NDCG or AP, and thenmeasure separately credibility accuracy similarly, e.g. using the F-1 or the G-measure. This approach would output scores by twoseparate metrics , which would need to somehow be consolidatedor considered together when optimising system performance. Insuch a case, and depending on the choice of relevance and credibil-ity measures, it would not be always certain that the same criteriaare applied to the evaluation of both relevance and credibility. Forinstance, whereas the state of the art metrics in relevance evalu-ation treat relevance as graded and consider it in relation to therank position of the retrieved documents (we discuss these in Sec-tion 2), no metrics exist that consider graded credibility accuracy inrelation to rank position. Hence, using two separate metrics for rel-evance and credibility may, in practice, bias the overall evaluationprocess in favour of relevance, for which more thorough evalua-tion metrics exist.To provide a more principled approach that obviates this bias,we present two new types of evaluation measures that are designedto measure the eﬀectiveness of both relevance and credibility inranked lists of retrieval results simultaneously and without bias infavour of either relevance or credibility. Our measures take as in-put a ranked list of documents, and assume that assessments (ortheir approximations) exist both for the relevance and for the cred-ibility of each document. Given this information, our Type I mea-sures deﬁne diﬀerent ways of measuring the eﬀectiveness of bothrelevance and credibility based on diﬀerences in the rank position of the retrieved documents with respect to their ideal rank posi-tion (when ranked only by relevance or credibility). Unlike TypeI, our Type II measures operate directly on document scores of rel-evance and credibility, instead of rank positions. We evaluate our In this paper, we use metric and measure interchangeably, as is common in the IRcommunity, even though the terms are not synonymous. Strictly speaking, measure should be used for more concrete or objective attributes, and metric should be used formore abstract, higher-level, or somewhat subjective attributes [2]. When discussingeﬀectiveness, which is generally hard to deﬁne objectively, but for which we havesome consistent feel, Black et al. argue that the term metric should be used [2]. easures both axiomatically (in terms of their properties) and em-pirically on a small human-annotated dataset that we build specif-ically for the purposes of this work. We ﬁnd that our measures areexpressive and intuitive in their interpretation.

The aim of evaluation is to measure how well some method achievesits intended purpose. This allows to discover weaknesses in thegiven method, potentially leading to the development of improvedapproaches and generally more informed deployment decisions.For this reason, evaluation has been a strong driving force in IR,where, for instance, the literature of IR evaluation measures is richand voluminous, spanning several decades. Generally speaking, rel-evance metrics for IR can be split into three high-level categories:(i) earlier metrics, assuming binary relevance assessments;(ii) later metrics, considering graded relevance assessments,and(iii) more recent metrics, approximating relevance assessmentsfrom user clicks.We overview some among the main developments in each ofthese categories next.

Binary relevance metrics are numerous and widely used. Examplesinclude:

Precision @ k (P@ k ): the proportion of retrieved documentsthat are relevant, up to and including position k in theranking; Average Precision (AP): the average of (un-interpolated) pre-cision values (proportion of retrieved documents that arerelevant) at all ranks where relevant documents are found;

Binary Preference (bPref): this is identical to AP exceptthat bPref ignores non-assessed documents (whereas APtreats non-assessed documents as non-relevant). Becauseof this, bPref does not violate the completeness assumption ,according to which “all relevant documents within a testcollection have been identiﬁed and are present in the col-lection”) [5];

Mean Reciprocal Rank (MRR): the reciprocal of the posi-tion in the ranking of the ﬁrst relevant document only;

Recall: the proportion of relevant documents that are retrieved;

F-score: the equally weighted harmonic mean of precisionand recall.

There exist noticeably fewer graded relevance metrics than binaryones. The two main graded relevance metrics are NDCG and ERR:

Normalised Discounted Cumulative Gain (NDCG): thecumulative gain a user obtains by examining the retrievalresult up to a rank position, where the relevance scores ofthe retrieved documents are: • accumulated over all the rank positions that are con-sidered, • discounted in order to devaluate late-retrieved docu-ments, and • normalised in relation to the maximum score that thismetric can possibly yield on an ideal reranking of thesame documents.Two useful properties of NDCG are that it rewards retrieveddocuments according to both (i) their degree (or grade)of relevance, and (ii) their rank position. Put simply, thismeans that the more relevant a document is and the closerto the top it is ranked, the higher the NDCG score will be[12]. Expected Reciprocal Rank (ERR):

ERR operates on thesame high-level idea as NDCG but diﬀers from it in that itpenalises documents that are shown below very relevantdocuments. That is, whereas NDCG makes the indepen-dence assumption that “a document in a given position hasalways the same gain and discount independently of thedocuments shown above it”, ERR does not make this as-sumption, and, instead, considers (implicitly) the immedi-ate context of each document in the ranking. In addition,instead of the discounting of NDCG, ERR approximatesthe expected reciprocal length of time that a user will taketo ﬁnd a relevant document. Thus, ERR can be seen as anextension of (the binary) MRR for graded relevance assess-ments [6].

The most recent type of evaluation measures are designed to op-erate, not on traditionally-constructed relevance assessments (de-ﬁned by human assessors), but on approximations of relevance as-sessments from user clicks (actual or simulated). Most of these met-rics have underlying user models, which capture how users inter-act with retrieval results. In this case, the quality of the evaluationmeasure is a direct function of the quality of its underlying usermodel [24].The main advances in this area include the following:

Expected Browsing Utility (EBU): an evaluation measurewhose underlying user click model has been tuned by ob-servations over many thousands of real search sessions[24];

Converting click models to evaluation measures: a gen-eral method for converting any click model into an evalu-ation metric [7]; and

Online evaluation: various diﬀerent algorithms for inter-leaving [19] or multileaving [3, 4, 20] multiple initial rankedlists into a single combined ranking, and by approximatingclicks (through user click models) on the resulting com-bined ranking, assigning credit (hence evaluating) the meth-ods that produced each initial ranked list [9].In addition to the above three types of IR evaluation measures,there also exists further literature on IR measures that consider ad-ditional dimensions on top of relevance, such as query diﬃcultyfor instance [16]. To the best of our knowledge, none of these mea-sures consider credibility. The closest to a credibility measure wecould ﬁnd is the work by Balakrishnan et al. [1] on source selectionfor deep web databases: their method considers the agreement be-tween diﬀerent sources in answering a query as an indication ofhe credibility of the sources. An adjusted version of this agree-ment is modeled as a graph with vertices representing sources.Given such a graph, the credibility (or quality) of each source iscalculated as the stationary visit probability of a random walk onthis graph.The evaluation measures we present in Sections 4 - 5 are theonly ones, to our knowledge, that are designed to operate both onrelevance and credibility. Beyond these two particular dimensions,reasoning more generally about diﬀerent dimensions of eﬀective-ness, the F-score, and its predecessor, van Rijsbergen’s E-score [22],are early examples of a single evaluation measure combining twodiﬀerent aspects, namely precision and recall. We return to thisdiscussion in Section 5, where we present a variant of the F-scorefor aggregating relevance and credibility.

Given a ranked list of documents, the aim is to produce a measurethat reﬂects how eﬀective this ranking is with respect to both therelevance of these documents to some query and also the credibil-ity of these documents (irrespective of a query).There are at least two basic ways to produce such a metric:Either(I) gauge the diﬀerence in rank position(s) between an inputranking and “ideal” relevance and credibility rankings,or (II) employ relevance and credibility scores to gauge how wellthe input ranking reﬂects high versus low scores.Note that while (II) is reminiscent of existing measures for rele-vance ranking, the fact that two distinct kinds of scores (relevanceand credibility) – perhaps having diﬀerent ranges and behaviour –must be combined may lead to further complications.Accordingly, in the remainder of the paper, we call measures

Type I if they are based primarily on diﬀerences in rank position,and

Type II if they are based primarily on relevance and credibilityscores.Regardless of whether it is Type I or Type II, we reason thatany measure must be easily interpretable. Hence, its scores shouldbe normalised between 0 and 1, where low scores should indicatepoor rankings, and high scores should indicate good rankings. Theextreme points (0 and 1) of the scale should preferably be attainableby particularly bad or particularly good rankings; as a minimum,if the ranking can be measured against an “ideal” ranking (as in,e.g. NDCG), the value 1 should be attainable by the ideal ranking.In addition to the above, there also exist desiderata for eval-uation measures that are more debatable (e.g., how the measureshould act in case of identical ranking scores for distinct docu-ments). Below, we list what we believe to be the most pertinentdesiderata. The list encompasses desiderata tailored to evaluatemeasures that gauge ranking based on either rank position or on(relevance or credibility) scores . For the desiderata pertaining torank position, we need the following ancillary deﬁnition:Let D i be a document at rank i . We then deﬁne an error as anyinstance where • either (a) the relevance of a document at rank i is greaterthan the relevance of a document at rank i − • or (b) the credibility of a document at rank i is greater thanthe credibility of a document at rank i − D1 - D8 henceforth): D1 Larger errors should be penalised more than smaller er-rors; D2 Errors high in the ranking should be penalised more thanerrors low in ranking; D3 Let δ r be the diﬀerence in relevance score between D i and D i − when D i − is more relevant than D i . Similarly,let δ c be the diﬀerence in credibility score between D i and D i − when D i − is more credible than D i . Then, larger δ r and δ c values should imply larger error; D4 Ceteris paribus , a credibility error on documents of highrelevance should be penalised more than a credibility erroron documents of low relevance; D5 The metric should be well-deﬁned even if all documentshave identical ranking/credibility scores; D6 Scaling the document scores used to produce the rankingby some constant should not aﬀect the metric; D7 If all documents have the same relevance score, the met-ric should function as a credibility metric; and vice versa; D8 We should be able to adjust (by some easily interpretableparameter) how much we wish to penalise low credibilitywith respect to low relevance, if at all.Next, we present two types of evaluation measures of relevanceand credibility that satisfy (wholly or partially) the above desider-ata: Type I measures (Section 4) operate solely on the rank positions of documents; Type II measures (Section 5) operate solely on doc-ument scores . Given a ranking of documents that we want to evaluate (let uscall this input ranking ), we reason in terms of two additional idealrankings : one by relevance only, and one by credibility only (thetwo ideal rankings are entirely independent of each other). So, foreach document, we have:(1) its rank position in the input ranking;(2) its rank position in the ideal relevance ranking; and(3) its rank position in the ideal credibility ranking.The basic idea is then to take each adjacent pair of documents inthe input ranking, check for errors in the input ranking comparedto the ideal relevance and separately the ideal credibility ranking,and aggregate those errors. We explain next how we do this.Let D i be the rank position of document D in the input rank-ing. We then denote by R rD i the rank position of D i in the idealrelevance ranking, and by R cD i the rank position of D i in the idealcredibility ranking. Note that subscript i refers to the rank positionof D in the input ranking at all times . That is, R rD i should be readas: the position in the ideal relevance ranking of the document thatis at position i in the input ranking; similarly for R cD i .et the monus operator ·− be deﬁned on non-negative real num-bers by: a ·− b = (cid:26) a ≤ ba − b if a > b (1)That is, a ·− b is simply subtraction as long as a > b and otherwisejust returns 0. Then, using the monus operator and the notationintroduced above, we deﬁne a “relevance error” ( ϵ r ) and a “credi-bility error” ( ϵ c ) as: ϵ r = R rD i ·− R rD i + (2) ϵ c = R cD i ·− R cD i + (3)In the above, i and i + i ) in the inputranking is ranked after the other document in the ideal relevanceranking. Otherwise, the error is zero. Similarly for the “credibilityerror”.For example, if three documents A , B and C are ranked as C , A , B in the input ranking (i.e., D = C , D = A , D = B ), but ranked as R r = [ A , B , C ] in an ideal relevance ranking, there are two rele-vance errors, namely(i) R rD ·− R rD = R rC ·− R rA = ·− =

2, and(ii) R rD ·− R rD = R rA ·− R rB = ·− = Let n be the total number of documents in the ranked list. We deﬁnethe Local Rank Error (LRE) evaluation measure as LRE = n = LRE = n − Õ i = ( + i ) (cid:0)(cid:0) µ + ϵ r (cid:1) (cid:0) ν + ϵ c (cid:1) − µν (cid:1) (4)where ϵ r , ϵ c are the relevance error and credibility error deﬁnedin Equations 2 – 3, and µ , ν are non-negative real numbers (with µ + ν >

0) controlling how much we wish to penalise low relevancewith respect to low credibility. For instance, a high ν weighs cred-ibility more, whereas a high µ weighs relevance more. The reasonfor the term − µν inside the summation at the end is to ensure thatthe value of the LRE measure is zero if no error occurs.Because Equation 4 is large for bad rankings and small for good rankings, we invert and normalise it (Normalised LRE or NLRE) asfollows: N LRE = − LREC

LRE (5)where C LRE is the normalisation constant, deﬁned as: C LRE = ⌊ n − ⌋ Õ j = ( n − j − ) + ( µ + ν )( n − j − ) + log ( + j ) (6)ensuring that LRE / C LRE ≤

1. Note the “ﬂoor” function of the an-gular brackets above Í in Equation 6, which rounds the contentsof the brackets down to the next (lowest) integer.The somewhat involved deﬁnition of C LRE is due to the factthat we wish the maximal possible error attainable (i.e., rankings that produce the largest possibly credibility and relevance errors)to correspond to a value of 1 for

LRE / C LRE . Observe that

N LRE is1 if no errors of any kind occur (because, in that case, LRE is 0).Our NLRE measure satisﬁes the desiderata presented in Section3 as follows: • D1 holds if we interpret error size as the size of the rankdiﬀerences; • D2 holds due to the discount factor of 1 / log ( + i ) ; • D3 is satisﬁed in the sense that larger diﬀerences in credi-bility or relevance ranks mean larger error; • D4 : The credibility error is scaled by the relevance error,if there is any (i.e., they are multiplied). If there is no rel-evance error, the credibility error is still strictly greaterthan zero; • D5 : The measure is well-deﬁned in all cases; • D6 : No scores occur explicitly, only rankings, so scalingmakes no diﬀerence; • D7 is satisﬁed because if all documents have equal rele-vance, the relevance error will be zero. The resulting scorewill measure only credibility error. And vice versa; • D8 is satisﬁed through µ and ν .We call NLRE a local measure because it is aﬀected by diﬀer-ences in credibility and relevance between documents at each rankposition in the input ranking. We present next a global evaluationmetric that does not take such “local” eﬀects at each rank into ac-count (i.e., any diﬀerences in credibility and relevance betweendocuments at rank i in the input ranking do not aﬀect the globalmetric; only the total diﬀerence of credibility and relevance of theentire input ranking aﬀects the global metric). We deﬁne the Global Rank Error (GRE) evaluation measure as

GRE = n =

1, and otherwise:

GRE = + µ n − Õ i = ( + i ) ϵ r ! + ν n − Õ i = ( + i ) ϵ c ! − The notation is the same as for LRE. Similarly to LRE, we invertand normalise GRE, to produce its normalised version (NGRE) asfollows:

NGRE = − GREC

GRE (8)where C GRE is the normalisation constant, deﬁned as: C GRE = µν © « ⌊ n − ⌋ Õ j = n − j − + log ( + j ) ª®®¬ + ( µ + ν ) ⌊ n − ⌋ Õ j = n − j − + log ( + j ) (9) C GRE is chosen to ensure that

GRE / C GRE ≤ GRE / C GRE = Í s in Equation 9 also use the ﬂoor func-tion, exactly like in Equation 6.As with NLRE, NGRE is 1 if no errors of any kind occur. In spiteof the diﬀerences in computation, NGRE satisﬁes all eight desider-ata for the same reasons given for NLRE.he main intuitive diﬀerence between NLRE and NGRE is thatin NGRE the credibility errors and relevance errors are cumulatedseparately, and then multiplied at the end. Thus, there is no imme-diate connection between credibility and relevance errors at thesame rank ( locally ), hence we say that the metric is global .The advantage of such a global versus local measure is that, inthe global case, it is more straightforward to perform mathemati-cal manipulations to achieve, e.g., normalisation, and easier to in-tuitively grasp what the measure means. The disadvantage is thatlocal information is lost, and this may, in theory, lead to poorlyperforming measures. As the notion of “error” deﬁned earlier is in-herently a local phenomenon, the desiderata concerning errors areharder to satisfy formally for global measures. The two evaluation measures presented above (NLRE and NGRE)operate on the rank positions of documents. We now present threeevaluation measures that operate, not on the rank positions of doc-uments, but directly on document scores.

Given a ranking of documents that we wish to evaluate, let Z r ( i ) denote the relevance score with respect to some query of the docu-ment ranked at position i , and let Z c ( i ) denote the credibility scoreof the document ranked at position i . Then, we deﬁne the WeightedCumulative Score (WCS) measure as: W CS = n Õ i = ( + i ) ( λZ r ( i ) + ( − λ ) Z c ( i )) (10)where n is the total number of documents in the ranking list, and λ is a real number in [ , ] controlling the impact of relevance versuscredibility in the computation. We normalise WCS by dividing it bythe value obtained by an “ideal” ranking maximizing the value ofWCS (this is inspired by the normalisation of the NDCG evaluationmeasure [12]): NW CS = W CSIW CS (11)where IWCS is the ideal

WCS, i.e. the maximum WCS that can beobtained on an ideal ranking of the same documents.NWCS uses a simple weighted combination of relevance or cred-ibility scores in the same manner as the metric

NGRE , but is appli-cable directly to relevance or credibility scores (instead of rankingpositions).Our NWCS measure satisﬁes the following of the desiderata pre-sented in Section 3: • D1 is satisﬁed as both Z r and Z c occur linearly in WCS; • D2 is satisﬁed due to the logarithmic discounting for in-creasing rank positions; • D3 is satisﬁed by design as both Z r and Z c occur directlyin the formula for WCS; • D5 is satisﬁed as the measure is well-deﬁned in all cases; • D6 is satisﬁed due to normalization; • D7 is satisﬁed because the contribution of the credibilityscores (if all are equal) is just a constant in each term (andvice versa if relevance scores are all equal); • D8 is satisﬁed due to the presence of λ .Of all desiderata, only D4 is not satisﬁed: there is no scaling ofcredibility errors based on relevance. Despite this, the advantageof NWCS is that it is interpretable in much the same way as NDCG.The main idea of the next two measures is that any two sepa-rate measures of either relevance or credibility, but not both, canbe combined into a single aggregating measure of relevance andcredibility. We next present two such aggregating measures. We deﬁne the convex aggregating measure (CAM) of relevance andcredibility as:

CAM = λM r + ( − λ ) M c (12)where M r and M c denote respectively any valid relevance andcredibility evaluation measure, and λ is a real number in [ , ] con-trolling the impact of the individual relevance or credibility mea-sure in the overall computation. CAM is normalized if both M r and M c are normalised.Our CAM measure satisﬁes the following desiderata: • D1 is satisﬁed for the same reasons as NWCS; • D2 is not satisﬁed in general; • D3 is satisﬁed for the same reasons as NWCS; • D4 is not satisﬁed in general; • D5 is satisﬁed for the same reasons as NWCS; • D6 is not satisﬁed in general; it is satisﬁed if both M r and M c are scale-free; • D7 is satisﬁed because the contribution of the credibilityscores (if all are equal) is just a constant in each term (andvice versa if relevance scores are all equal); • D8 is satisﬁed for the same reasons as NWCS.With respect to D2 , D4 , and D6 not being satisﬁed in general:The tradeoﬀ in this case is that as CAM is just a convex combina-tion of existing measures, the scores are readily interpretable byanyone able to interpret M r and M c scores. F -score forcredibility and ranking”) We deﬁne the weighted harmonic mean aggregating measure (WHAM)as zero if either M r or M c is zero, and otherwise: W HAM = λ M r + ( − λ ) M c (13)where the notation is the same as for CAM in Equation 12 above.WHAM is the weighted harmonic mean of M r and M c . Observethat if λ = .

5, WHAM is simply the F-1 scores of M r and M c . Notethat WHAM is normalized if both M r and M c are normalised.Similar deﬁnitions of metrics can be made that use other aver-ages. For example, one can use the weighted arithmetic and geo-metric means instead of the harmonic mean.Our WHAM measure satisﬁes the following desiderata: D1 is satisﬁed for the same reasons as CAM; • D2 is not satisﬁed in general; • D3 is satisﬁed for the same reasons as CAM; • D4 is not satisﬁed in general; • D5 is satisﬁed for the same reasons as CAM; • D6 is not satisﬁed in general; it is satisﬁed if both M r and M c are scale-free; • D7 is satisﬁed for the same reasons as CAM; • D8 is satisﬁed for the same reasons as CAM.The primary advantage of CAM and WHAM is that their deﬁni-tions appeal to simple concepts already known to larger audiences(convex combinations and averages), and hence the measures aresimple to state and interpret. The consequent disadvantage is thatthis simplicity comes at the cost of not satisfying all desiderata.We next present an empirical evaluation of all our measures. There are two main approaches for evaluating evaluation measures:

Axiomatic

Deﬁne some general fundamental properties thata measure should adhere to, and then reﬂect on how manyof these properties are satisﬁed by a new measure, and towhat extent.

Empirical

Present a drawback of existing standard and ac-cepted measures, and illustrate how a new measure ad-dresses this. Ideally, the new measure should generallycorrelate well with the existing measures, except for theproblematic cases, where it should perform better [13].We have already conducted the axiomatic evaluation of our mea-sures, having presented 8 fundamental properties they should ad-here to (Desiderata in Section 3), and having subsequently discussedeach of our measures in relation to these fundamental propertiesin Sections 4 - 5. We now present the empirical evaluation. We ﬁrstpresent our in-house dataset and experimental setup, and then ourﬁndings.

The goal is to determine how good our measures are at evaluat-ing both relevance and credibility in ranked lists. We do this bycomparing the scores of our measures to the scores of well-knownrelevance and separately credibility measures. This comparison isdone on a small dataset that we create for the purposes of thiswork as follows . We formulated 10 queries that we thought werelikely to fetch results of various levels of credibility if submitted toa web search engine. These queries are shown in Table 1. We thenrecruited 10 assessors (1 MSc student, 5 PhD students, 3 postdocs,and 1 assistant professor, all within Computer Science, but noneworking on this project; 1 female, 9 males). Assessors were askedto submit each query to Google, and to assign separately a score ofrelevance and a score of credibility to each of the top 5 results. As-sessors were instructed to use the same graded scale of relevanceand credibility shown in the ﬁrst column of Table 2.Assessors were asked to use their own understanding of rele-vance and credibility, and not to let relevance aﬀect their assess-ment of credibility, or vice versa (relevance and credibility were Our dataset is freely available here: https://github.com/diku-irlab/A66

Table 1: The 10 queries used in our experiments.

Query no. Query1 Smoking not bad for health2 Princess Diana alive3 Trump scientologist4 UFO sightings5 Loch Ness monster sightings6 Vaccines bad for children7 Time travel proof8 Brexit illuminati9 Climate change not dangerous10 Digital tv surveillance

Table 2: Conversion of graded assessments to binary. Thesame conversion is applied to both relevance and credibilityassessments.

Graded Binary1 (not at all) 0 (not at all)2 (marginally) 0 (not at all)3 (medium) 1 (completely)4 (completely) 1 (completely)to be treated as unrelated aspects). Assessors were instructed that,if they did not understand a query, or if they were unsure aboutthe credibility of a result, they should open a separate browser andtry to gather more information on the topic. Assessors received anominal reward for their eﬀort.Even though assessors used the same queries, the top 5 resultsretrieved from Google per query were not always identical. Con-sequently, we compute our measures separately on each assessedranking, and we report the arithmetic average. For NLRE and NGRE,we set µ = ν = .

5, meaning that relevance and credibility areweighted equally. Similarly, for NWCS, CAM, and WHAM, we set λ = . both relevance and credibility exist, we com-pare the score of our measures on the above dataset to the scoresof: • NDCG (for graded relevance), AP (for binary relevance); • F-1, G-measure (for binary credibility).F-1 was introduced in Section 2 for relevance. We use it here toassess credibility, by deﬁning its constituent precision and recallin terms of true/false positives/negatives (as is standard in classiﬁ-cation evaluation). The G-measure is the geometric mean of preci-sion and recall, which are deﬁned as for F-1.To render our graded assessments binary (for AP, F-1, G-measure),we use the conversion shown in Table 2.

Table 3 displays the scores of all evaluation measures on our dataset.We see that relevance-only measures (NDCG, AP) give overall higherscores than credibility-only measures (F-1, G). It is not surprisingto see such high NDCG and AP scores, considering that we assess able 3: Our evaluation measures compared to NDCG, AP, F-1 and G. For NDCG we see our graded assessments. For therest, we convert our graded assessments to binary as follows:1 or 2 = not relevant/credible; 3 or 4 = relevant or credible.All measures are computed on the top 5 results returned foreach query shown in Table 1. We report the average acrossall assessors. RELEVANCE

NDCG 0.9329AP 0.7842

CREDIBILITY

F-1 0.4786G 0.5475

RELEVANCE and CREDIBILITY

NLRE 0.8262NGRE 0.6919NWCS 0.9413CAM

N DCG , F − N DCG , G AP , F − AP , G N DCG , F − N DCG , G AP , F − AP , G ρ = 0.79 forNDCG and F-1, up to ρ = 0.97 for NDCG and NLRE).Table 4 shows examples of high divergence between the rele-vance and credibility of the retrieved documents, for three of ourmeasures (the scores of our remaining metrics can be easily de-duced from the respective relevance-only and credibility-only scores,as our omitted measures – CAM and WHAM – aggregate the ex-isting relevance-only and credibility-only metrics shown in Table4). Note that, whereas we found several examples of max relevanceand min credibility in our data, there were (understandably) signif-icantly fewer examples of max credibility and min relevance (thisdistribution is reﬂected in Table 4). We see that NWCS gives higherscores for queries 2 and 4-10 than NLRE and NGRE. For the ﬁrstﬁve examples (of max relevance and min credibility), this is likelybecause NWCS does not satisfy D4 , namely that credibility errorsshould be penalised more on high relevance versus low relevancedocuments. We also see that NGRE gives consistently lower scoresthan NLRE and NWCS. This is due to its global aspect discussed earlier: NGRE accumulates credibility and relevance errors sepa-rately and then multiplies them at the end, meaning that local er-rors in each rank do not impact as much the ﬁnal score (unlikeNLRE and NWCS, which are both local in that sense, the ﬁrst us-ing document ranks, the second using document scores). The credibility of search results is important in many retrievaltasks, and should be, we reason, integrated into IR evaluation mea-sures that are, as of now, targetting mostly relevance. We have pre-sented several measures and types of measures that can be usedto gauge the eﬀectiveness of a ranking, taking into account bothcredibility and relevance. The measures are both axiomatically andempirically sound, the latter illustrated in a small user study.There are at least two natural extensions of our approach: First,the combination of rankings based on diﬀerent criteria goes be-yond the combination of relevance and credibility, and several suchcombinations are used in practice based on diﬀerent criteria (e.g.,combinations of relevance and upvotes on social media sites); webelieve that much of our work can be encompassed in more generalapproaches, suitably axiomatised, that do not necessarily have tosatisfy the same desiderata as those of this paper (e.g., do not haveto scale credibility error by relevance errors as in our D4 ). Second,while we have chosen to devise measures that are both theoreti-cally principled and conceptually simple using simple criteria (sat-isfaction of desiderata, local versus global, amenable to principledinterpretation), there are many more measures that can be deﬁnedwithin the same limits. For example, our Type II measures are pri-marily built on simple combinations of scores or pre-existing mea-sures that can easily be understood by the community, but at theprice that some desiderata are hard or impossible to satisfy; how-ever, there is no theoretical reason why one could not create TypeII measures that incorporate some of the ideas from Type I metrics.We intend to investigate these two extensions in the future, andinvite the community to do so as well.Lastly, while the notion of credibility, in particular in news me-dia, is subject to intense public discussion, very few empirical stud-ies exist that contain user preferences, credibility rankings, or in-formation needs related to credibility. The small study includedin this paper, while informative, is a very small step in this direc-tion. We believe that future substantial discussion of practically rel-evant research involving credibility in information retrieval wouldgreatly beneﬁt from having access to larger-scale empirical userstudies. REFERENCES [1] Raju Balakrishnan and Subbarao Kambhampati. 2011. SourceRank: relevanceand trust assessment for deep web sources based on inter-source agreement. In

Proceedings of the 20th International Conference on World Wide Web, WWW 2011,Hyderabad, India, March 28 - April 1, 2011 , Sadagopan Srinivasan, Krithi Ramam-ritham, Arun Kumar, M. P. Ravindra, Elisa Bertino, and Ravi Kumar (Eds.). ACM,227–236.

DOI: https://doi.org/10.1145/1963405.1963440[2] Paul E. Black, Karen A. Scarfone, and Murugiah P. Souppaya (Eds.). 2008.

CyberSecurity Metrics and Measures . Wiley Handbook of Science and Technology forHomeland Security.[3] Brian Brost, Ingemar J. Cox, Yevgeny Seldin, and Christina Lioma. 2016. AnImproved Multileaving Algorithm for Online Ranker Evaluation. In

Proceedingsof the 39th International ACM SIGIR conference on Research and Development inInformation Retrieval, SIGIR 2016, Pisa, Italy, July 17-21, 2016 , Raﬀaele Perego, able 4: Examples of max/min relevance and credibility, from our experiments. Only one out of the 5 retrieved documents isshown per query. The urls of the retrieved results are reduced to their most content-bearing parts, for brevity.

EXAMPLES OF HIGH RELEVANCE AND LOW CREDIBILITYQuery Result (rank) Relevance Credibility NDCG AP F-1 G NLRE NGRE NWCS2 . . . princess-diana-found-alive (3) 4 1 .883 .679 .333 .387 .819 .585 .9503 tonyortega.org . . . scientology . . . where-does-trump-stand (1) 4 1 .938 1.00 .571 .631 .949 .797 .9134 (1) 4 1 1.00 1.00 .333 .431 .808 .262 .9416 articles.mercola.com . . . vaccines-adverse-reaction (4) 4 1 .938 .950 .571 .500 .872 .534 .9278 . . . brexit-what-is-the-globalist-game (1) 4 1 .884 .679 .000 .000 .889 .666 .98510 educate-yourself.org . . .

HDtvcovertsurveillanceagenda (3) 4 1 .979 1.00 .000 .000 .926 .885 .997EXAMPLES OF HIGH CREDIBILITY AND LOW RELEVANCE10 cctvcamerapros.com/Connect-CCTV-Camera-to-TV-s (2) 1 4 .780 .533 .571 .715 .863 .710 .93110 ieeexplore.ieee.org/document/891879 (5) 1 4 .780 .533 .571 .715 .899 .605 .874

Fabrizio Sebastiani, Javed A. Aslam, Ian Ruthven, and Justin Zobel (Eds.). ACM,745–748.

DOI: https://doi.org/10.1145/2911451.2914706[4] Brian Brost, Yevgeny Seldin, Ingemar J. Cox, and Christina Lioma. 2016. Multi-Dueling Bandits and Their Application to Online Ranker Evaluation. In

Proceed-ings of the 25th ACM International on Conference on Information and KnowledgeManagement, CIKM 2016, Indianapolis, IN, USA, October 24-28, 2016 , SnehasisMukhopadhyay, ChengXiang Zhai, Elisa Bertino, Fabio Crestani, Javed Mostafa,Jie Tang, Luo Si, Xiaofang Zhou, Yi Chang, Yunyao Li, and Parikshit Sondhi(Eds.). ACM, 2161–2166.[5] Chris Buckley and Ellen M. Voorhees. 2004. Retrieval evaluation withincomplete information. In

SIGIR 2004: Proceedings of the 27th AnnualInternational ACM SIGIR Conference on Research and Development inInformation Retrieval, Sheﬃeld, UK, July 25-29, 2004 , Mark Sanderson,Kalervo Järvelin, James Allan, and Peter Bruza (Eds.). ACM, 25–32.

DOI: https://doi.org/10.1145/1008992.1009000[6] Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Ex-pected reciprocal rank for graded relevance. In

Proceedings of the 18th ACMConference on Information and Knowledge Management, CIKM 2009, HongKong, China, November 2-6, 2009 , David Wai-Lok Cheung, Il-Yeol Song, Wes-ley W. Chu, Xiaohua Hu, and Jimmy J. Lin (Eds.). ACM, 621–630.

DOI: https://doi.org/10.1145/1645953.1646033[7] Aleksandr Chuklin, Pavel Serdyukov, and Maarten de Rijke. 2013. Click model-based information retrieval metrics. In

The 36th International ACM SIGIR con-ference on research and development in Information Retrieval, SIGIR ’13, Dublin,Ireland - July 28 - August 01, 2013 , Gareth J. F. Jones, Paraic Sheridan, Di-ane Kelly, Maarten de Rijke, and Tetsuya Sakai (Eds.). ACM, 493–502.

DOI: https://doi.org/10.1145/2484028.2484071[8] Rob Ennals, Dan Byler, John Mark Agosta, and Barbara Rosario. 2010. What isdisputed on the web?. In

Proceedings of the 4th ACM Workshop on InformationCredibility on the Web, WICOW 2010, Raleigh, North Carolina, USA, April 27, 2010 ,Katsumi Tanaka, Xiaofang Zhou, Min Zhang, and Adam Jatowt (Eds.). ACM, 67–74.

DOI: https://doi.org/10.1145/1772938.1772952[9] Katja Hofmann, Lihong Li, and Filip Radlinski. 2016. Online Evaluation for In-formation Retrieval.

Foundations and Trends in Information Retrieval

10, 1 (2016),1–117.[10] Christopher Horn, Alisa Zhila, Alexander F. Gelbukh, Roman Kern, and Elisa-beth Lex. 2013. Using Factual Density to Measure Informativeness of Web Doc-uments. In

Proceedings of the 19th Nordic Conference of Computational Linguis-tics, NODALIDA 2013, May 22-24, 2013, Oslo University, Norway (Linköping Elec-tronic Conference Proceedings) , Wendy E. Mackay, Stephen A. Brewster, and Susanne Bødker(Eds.). ACM, 1887–1892.[12] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evalu-ation of IR techniques.

ACM Trans. Inf. Syst.

20, 4 (2002), 422–446.

DOI: https://doi.org/10.1145/582415.582418[13] Ravi Kumar and Sergei Vassilvitskii. 2010. Generalized distances between rank-ings. In

Proceedings of the 19th International Conference on World Wide Web,WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010 , Michael Rappa,Paul Jones, Juliana Freire, and Soumen Chakrabarti (Eds.). ACM, 571–580.

DOI: https://doi.org/10.1145/1772690.1772749[14] Elisabeth Lex, Inayat Khan, Horst Bischof, and Michael Granitzer. 2014.Assessing the Quality of Web Content.

CoRR abs/1406.3188 (2014).http://arxiv.org/abs/1406.3188[15] Christina Lioma, Birger Larsen, Wei Lu, and Yong Huang. 2016. A study offactuality, objectivity and relevance: three desiderata in large-scale informationretrieval?. In

Proceedings of the 3rd IEEE/ACM International Conference on BigData Computing, Applications and Technologies, BDCAT 2016, Shanghai, China,December 6-9, 2016 , Ashiq Anjum and Xinghui Zhao (Eds.). ACM, 107–117.

DOI: https://doi.org/10.1145/3006299.3006315[16] Stefano Mizzaro. 2008. The Good, the Bad, the Diﬃcult, and the Easy:Something Wrong with Information Retrieval Evaluation?. In

Advancesin Information Retrieval , 30th European Conference on IR Research, ECIR2008, Glasgow, UK, March 30-April 3, 2008. Proceedings (Lecture Notes inComputer Science) , Craig Macdonald, Iadh Ounis, Vassilis Plachouras, IanRuthven, and Ryen W. White (Eds.), Vol. 4956. Springer, 642–646.

DOI: https://doi.org/10.1007/978-3-540-78646-7_71[17] Meredith Ringel Morris, Scott Counts, Asta Roseway, Aaron Hoﬀ, and Ju-lia Schwarz. 2012. Tweeting is believing?: understanding microblog cred-ibility perceptions. In

CSCW ’12 Computer Supported Cooperative Work,Seattle, WA, USA, February 11-15, 2012 , Steven E. Poltrock, Carla Simone,Jonathan Grudin, Gloria Mark, and John Riedl (Eds.). ACM, 441–450.

DOI: https://doi.org/10.1145/2145204.2145274[18] Souneil Park, Seungwoo Kang, Sangyoung Chung, and Junehwa Song. 2009.NewsCube: delivering multiple aspects of news to mitigate media bias. In

Pro-ceedings of the 27th International Conference on Human Factors in ComputingSystems, CHI 2009, Boston, MA, USA, April 4-9, 2009 , Dan R. Olsen Jr., Richard B.Arthur, Ken Hinckley, Meredith Ringel Morris, Scott E. Hudson, and Saul Green-berg (Eds.). ACM, 443–452.[19] Anne Schuth, Katja Hofmann, and Filip Radlinski. 2015. Predicting Search Sat-isfaction Metrics with Interleaved Comparisons. In

Proceedings of the 38th In-ternational ACM SIGIR Conference on Research and Development in InformationRetrieval, Santiago, Chile, August 9-13, 2015 , Ricardo A. Baeza-Yates, Mounia Lal-mas, Alistair Moﬀat, and Berthier A. Ribeiro-Neto (Eds.). ACM, 463–472.[20] Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. 2016.Multileave Gradient Descent for Fast Online Learning to Rank. In

Proceedingsof the Ninth ACM International Conference on Web Search and Data Mining, SanFrancisco, CA, USA, February 22-25, 2016 . 457–466.[21] Julia Schwarz and Meredith Ringel Morris. 2011. Augmenting web pages andsearch results to support credibility assessment. In

Proceedings of the Interna-tional Conference on Human Factors in Computing Systems, CHI 2011, Vancou-ver, BC, Canada, May 7-12, 2011 , Desney S. Tan, Saleema Amershi, Bo Begole,Wendy A. Kellogg, and Manas Tungare (Eds.). ACM, 1245–1254.[22] C. J. Keith van Rijsbergen. 1974. Foundation of evaluation.

Journal of Documen-tation

30, 4 (1974), 365–373.[23] Janyce Wiebe and Ellen Riloﬀ. 2011. Finding Mutual Beneﬁt between Subjectiv-ity Analysis and Information Extraction.

IEEE Trans. Aﬀective Computing

2, 4(2011), 175–191.

DOI: https://doi.org/10.1109/T-AFFC.2011.19[24] Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. 2010.Expected browsing utility for web search evaluation. In

Proceedings of the19th ACM Conference on Information and Knowledge Management, CIKM 2010,Toronto, Ontario, Canada, October 26-30, 2010 , Jimmy Huang, Nick Koudas,Gareth J. F. Jones, Xindong Wu, Kevyn Collins-Thompson, and Aijun An (Eds.).ACM, 1561–1564.