[PDF] A Cardinal Comparison of Experts

Abstract

In various situations, decision makers face experts that may provide conflicting advice. This advice may be in the form of probabilistic forecasts over critical future events. We consider a setting where the two forecasters provide their advice repeatedly and ask whether the decision maker can learn to compare and rank the two forecasters based on past performance. We take an axiomatic approach and propose three natural axioms that a comparison test should comply with. We propose a test that complies with our axioms. Perhaps, not surprisingly, this test is closely related to the likelihood ratio of the two forecasts over the realized sequence of events. More surprisingly, this test is essentially unique. Furthermore, using results on the rate of convergence of supermartingales, we show that whenever the two experts\textquoteright{} advice are sufficiently distinct, the proposed test will detect the informed expert in any desired degree of precision in some fixed finite time.

Full PDF

aa r X i v : . [ ec on . T H ] F e b A Cardinal Comparison of Experts

Itay Kavaler, Rann Smorodinsky

Davidson Faculty of Industrial Engineering and Management, Technion, Haifa 3200003, [email protected] (IK); [email protected] (RS)

In various situations, decision makers face experts that may provide conﬂicting advice. This advice may be inthe form of probabilistic forecasts over critical future events. We consider a setting where the two forecastersprovide their advice repeatedly and ask whether the decision maker can learn to compare and rank the twoforecasters based on past performance. We take an axiomatic approach and propose three natural axiomsthat a comparison test should comply with. We propose a test that complies with our axioms. Perhaps,not surprisingly, this test is closely related to the likelihood ratio of the two forecasts over the realizedsequence of events. More surprisingly, this test is essentially unique. Furthermore, using results on the rateof convergence of supermartingales, we show that whenever the two experts’ advice are suﬃciently distinct,the proposed test will detect the informed expert in any desired degree of precision in some ﬁxed ﬁnite time.

Key words : forecasting; probability; testing.

MSC2000 subject classiﬁcation : C11; C70; C73; D83

1. Introduction

Consider an individual who repeatedly consults two weather forecasting websites. It is reasonableto ask what should the individual do when the two forecasts repeatedly contradict. In what waycan the individual rank the two? Should the individual trust one site and (eventually) ignore theother?The weather example above serves as a metaphor for a plethora of settings where a decision makerfaces conﬂicting expert advice. Take for example an elected oﬃcial who must rely on professionalinput from civil servants, a patient who receives prognosis from various doctors or, more abstractly,a learning algorithm mechanism that uses input from various sources.In this paper, we set the stage for deﬁning the notion of a cardinal comparison test . The settingwe have in mind is a sequential one. At each stage t two forecasters provide a probability over somefuture event (e.g., the occurrence of rain) and then the event is either realized or its complementis. Before the next day’s forecasts the test must rank the two forecasters. We calibrate these ranksso they add up to one. One way to think of the rank is a recommendation for a coin ﬂip to decidewhich of the two experts’ advice should be taken.We pursue a test that complies with the following set of properties which we consider natural: Anonymity - A test is anonymous if it does not depend on the identity of the experts but onlyon their forecasts.

Error-free - A test is error-free if from their perspective, each of the experts cannot entertainthe thought that the other expert will be overwhelmingly preferred (i.e., he assigns relatively lowerprobability). Another way to think about a notion of an error-free test is to assume that one ofthe experts has the correct model. In such a case, the test will probably not point at the secondexpert as the superior one.

Reasonable - Let us consider an event, A , that has positive probability according to the ﬁrstexpert but relatively lower probability according to the second. Conditional on the occurrence ofevent A , a reasonable test must assign positive probability to the ﬁrst expert being better informedthan the second. avaler and Smorodinsky: A Cardinal Comparison of Experts One thing to emphasize about the cardinal comparison test we pursue and the related propertiesis that they are not designed to evaluate whether either of the two forecasters is correct in someobjective sense. They are only designed to compare the two. To make this point, assume thatNature follows a fair coin for deciding on rain and one forecaster insists on forecasting rain withprobability 60% while the other insists on 10%. While both are wrong, a cardinal comparison testshould somehow gravitate towards the former one as being better.There is a large body of literature on expert testing that studies the question of whether aself-proclaimed expert is a true expert or a charlatan (see Section 1.2 for more details) and manyof the results point to the diﬃculty or impossibility of designing such tests that are immune tostrategic forecasters.A comparison test may often be a more natural question than the one on whether the forecasteris correct. Indeed, when a decision maker must act, then she must choose which of the experts tofollow. In the case of a single expert, the dismissal of that expert leaves the decision maker workingwith her own unsubstantiated beliefs, which may lead to an even worse outcome. In case a decisionmaker faces two forecasters with conﬂicting input, she may choose to somehow aggregate the twoinstead of dismissing one or the other. We discuss this alternative line of research in Section 1.2.

Given an ordered pair of forecasters, f and g , at any ﬁnite time t , we consider the correspondinglikelihood ratio of the actual outcome and calibrate it so that it and its inverse add up to one. Wecall this the ﬁnite derivative test at time t . We prove that this test is anonymous, error-free andreasonable. Furthermore, modulo an equivalence relation, it is unique. In fact, for any test thatdiﬀers from the aforementioned construction and which is anonymous and reasonable, there existtwo forecasters which render the test not error-free.Moreover, our constructed test perfectly identiﬁes the correct forecaster whenever the two mea-sures induced by the forecasters are mutually singular with respect to each other. Requiring thetest to identify the correct expert when the measures are not mutually singular is shown to beimpossible.A test could potentially take a long while until it converges to a verdict on the better expert. Weshow that the proposed comparison test converges fast and uniformly. In fact, when disregardingthe stages at which the two experts provide similar forecasts, then with high probability the correctverdict will emerge in ﬁnite time that is independent of the underlying probabilities.One can ask whether ideal tests can exist, that is, tests that always rank the correct forecasterhigher regardless of what forecasting strategies other experts might submit. Unfortunately, thisturns out to be impossible, as we discuss in Appendix A. Since an ideal test does not exist, itis natural to explore the ideality of a test over a limited class of data-generating processes. Weprovide a full characterization for the existence of ideal tests over sets by showing that an idealtest with respect to a set A exists if and only if, A is pairwise mutually singular. Single expert testing. A substantial part of the literature on expert testing focuses on the singleexpert setting. This literature dates back to the seminal paper of Dawid [5], who proposes thecalibration test as a scheme to evaluate the validity of weather forecasters. Dawid asserts that atest must not fail a true expert. Foster and Vohra [9] show how a charlatan, who has no knowledgeof the weather, can produce forecasts which are always calibrated. The basic ingredient that allowsthe charlatan to fool the test is the use of random forecasts. Lehrer [12] and Sandroni, Smorodinsky,and Vohra [18] extend this observation to a broader class of calibration-like tests. Finally, Sandroni[17] shows that there exists no error-free test that is immune to such random charlatans (see alsoextensions of Sandroni’s result in Shmaya [19] and Olszewski and Sandroni [15]). avaler and Smorodinsky:

A Cardinal Comparison of Experts To circumvent the negative results, various authors suggest limiting the set of models for whichthe test must be error-free (e.g., Al-Najjar, Sandroni, Smorodinsky, and Weinstein [1] and Pomatto[16]), or limiting the computational power associated with the charlatan (e.g., Fortnow and Vohra[8]) or replacing measure theoretic implausibility with topological implausibility by resorting tothe notion of category one sets (e.g., Dekel and Feinberg [6]).Multiple expert testing. Comparing performance of two (or more) experts gained very littleattention in the literature. Apart from our previous work, Kavaler and Smorodinsky [11], we areonly familiar with Al-Najjar and Weinstein [2]. That paper proposes a test based on the likelihoodratio for comparing two experts. They show that if one expert knows the true process whereasthe other is uninformed, then one of the following must occur: either, the test correctly identiﬁesthe informed expert, or the forecasts made by the uninformed expert are close to those made bythe informed one. It turns out that the test they propose is anonymous and reasonable but is noterror-free (please refer to Section 5 for the formal deﬁnition).Another approach was suggested by Feinberg and Stewart [7], who study an inﬁnite horizonmodel of testing multiple experts, using a cross-calibration test. In their test, N experts are testedsimultaneously; each expert is tested according to a calibration restricted to dates where not onlydoes the expert have a ﬁxed forecast but the other experts also have a ﬁxed forecast, possiblywith diﬀerent values. That is to say, where the calibration test checks the empirical frequencyof observed outcomes conditional on each forecast, the cross-calibration test checks the empiricalfrequency of observed outcomes conditional on each proﬁle of forecasts (please refer to AppendixC for the formal deﬁnition).They showed that if an expert predicts according to the data-generating process, the expert isguaranteed to pass the cross-calibration test with probability 1, no matter what strategies the otherexperts use. In addition, they prove that in the presence of an informed expert, the subset of data-generating processes under which an ignorant expert (a charlatan) will pass the cross-calibrationtest with positive probability, is topologically “small”.In a previous paper, Kavaler and Smorodinsky [11], we construct a comparison test over theinﬁnite horizon. In that paper, the test outputs one verdict at the end of all times which is in oneof three forms---it points to either one of the forecasters as advantageous or it is indecisive. Themain result in that paper was the identiﬁcation of an essentially unique inﬁnite-horizon, ordinaltest that adheres with some natural properties. The properties studied in the current paper (as wellas the associated terminology) are inspired by the ones studied in Kavaler and Smorodinsky [11].The test we identify is based on the likelihood ratio. Interestingly, the tests identiﬁed in Al-Najjarand Weinstein [2] and that identiﬁed by Pomatto [16] for testable paradigms are also based on thelikelihood ratio.An alternative approach to that of comparing and ranking experts is that of aggregating forecastsby a non-Bayesian aggregator. For aggregation schemes that do well in a single stage setting, seeArieli et al. [3], as well as Levy and Razin [13], and Levy and Razin [14]; for schemes that work wellin a repeated setting and produce small regret, see the rich literature in machine learning surveyedin Cesa-Bianchi and Lugosi [4].

2. Model

At the beginning of each period t = 1 , , . . . an outcome, ω t , drawn randomly by Nature from theset Ω = { , } , is realized. A realization is an inﬁnite sequence of outcomes, ω := { ω , ω , . . . } ∈ Ω ∞ .We denote by ω t := { ω , ω , . . . , ω t } to be the preﬁx of length t of ω (sometimes referred to asthe partial history of outcomes up to period t ) and use the convention that ω := ∅ . At the riskof abusing notation, we will also use ω t to denote the cylinder set { ˆ ω ∈ Ω ∞ : ˆ ω t = ω t } . In otherwords, ω t will also denote the set of realizations which share a common preﬁx of length t . For any t we denote by G t the σ -algebra on Ω ∞ generated by the cylinder sets ω t and let G ∞ := σ ( ∞ S t =0 G t ) avaler and Smorodinsky: A Cardinal Comparison of Experts denote the smallest σ -algebra which consists of all cylinders (also known as the Borel σ -algebra).Let ∆(Ω ∞ ) be the set of all probability measures deﬁned over the measurable space (Ω ∞ , G ∞ ).Before ω t is realized, two self-proclaimed experts (sometimes referred to as forecasters) simulta-neously announce their forecast in the form of a probability distribution over Ω. Let (Ω × ∆(Ω) × ∆(Ω)) t be the set of all sequences composed of realizations and pairs of forecasts made up to time t and let S t ≥ (Ω × ∆(Ω) × ∆(Ω)) t be the set of all such inﬁnite sequences.A (pure) forecasting strategy f is a function that maps ﬁnite histories to a probability distributionover Ω. Formally, f : S t ≥ (Ω × ∆(Ω) × ∆(Ω)) t −→ ∆(Ω) . Note that each forecast provided by oneexpert may depend, inter alia, on those provided by the other expert in previous stages. Let F denote the set of all forecasting strategies.A probability measure P ∈ ∆(Ω ∞ ) naturally induces a (set of) corresponding forecasting strategy,denoted f P , that satisﬁes for any ω ∈ Ω ∞ and any t such that P ( ω t ) > f P ( ω t , · , · )( ω t +1 ) = P ( ω t +1 | ω t ) . Thus, the forecasting strategy f P derives its forecasts from the original measure, P , via Bayes rule.Note that this does not restrict the forecast of f P over cylinders, ω t , for which P ( ω t ) = 0. In the other direction, a realization ω, and an ordered pair of forecasting strategies, ~f := ( f, g ) , induce a unique play path, ( ω, ~f ) ∈ ( Ω × ∆(Ω) × ∆(Ω)) ∞ , where the corresponding t - history isdenoted by ( ω, ~f ) t ∈ (Ω × ∆(Ω) × ∆(Ω)) t started at the Null history, ( ω, ~f ) := ∅ , which in turninduce a pair of probability measures, denoted for simplicity by ( f, g ), over Ω ∞ , as follows: f ( ω t ) = t Y n =1 f (( ω, ~f ) n − )[ ω n ] , g ( ω t ) = t Y n =1 g (( ω, ~f ) n − )[ ω n ] . By Kolomogorov’s extension theorem, the above is suﬃcient in order to derive the whole measure.Observe that a pair of forecasting strategies induces a pair of probability measures, whereas eachsingle forecasting strategy does not induce a single measure due to the dependency between thetwo forecasters.

At each stage t a third party (the ‘tester’) who observes the forecasts and outcomes comparesthe performance of both forecasters and decides who she thinks is better. Formally, Definition 1. A cardinal comparison test is a sequence T := ( T t ) t> , where T t : (Ω × ∆(Ω) × ∆(Ω)) ∞ −→ [0 ,

1] is G t − measurable for all t > t and any realization ω and any sequence of forecasts ~f , the tester,conditional on a t - history, announces her level of conﬁdence that the ﬁrst forecaster (the one using f ) is better than the second one (we will interchangeably refer to this as his propensity that f issuperior to g). Note that announcing 0 . T t ( ω, ~f ) = 1(respectively, 0) the tester is conﬁdent that f outperforms g (respectively, g outperforms f ). Definition 2. T is called anonymous if for all ω ∈ Ω ∞ , t > f, g ∈ F,T t ( ω, f, g ) = 1 − T t ( ω, g, f ) . In other words, the test’s propensity at each period should not depend on the expert’s identity. Notethat whenever f = g an anonymous test T must output a propensity of 0 . ω ∈ Ω ∞ , t > T, an ordered pair of forecasting strategies ~f = ( f, g ) , and a realization ω , wedenote by T ( ω, ~f ) = limT t ( ω, ~f ) whenever the limit exists. For ǫ ∈ (0 , L ~fT,ǫ := { ω : T ( ω, ~f ) > avaler and Smorodinsky: A Cardinal Comparison of Experts ǫ } be the set of realizations for which the limit of T exists and from some time on assigns apropensity larger than ǫ to f (similarly we denote R ~fT,ǫ := { ω : T ( ω, ~f ) < ǫ } ) . Notice that thefollowing is a straightforward observation derived from Deﬁnition 2. If T is an anonymous test,then ω ∈ R ( f,g ) T,ǫ if and only if ω ∈ L ( g,f ) T, − ǫ ; we use the last for some of our proofs.When ω is in L ~fT,ǫ and ǫ > .

5, the test eventually assigns a higher propensity to f than to g . Onthe other hand, for ǫ < .

5, the test assigns a higher propensity to g whenever ω is in R ~fT,ǫ . Thus,we will typically focus on the sets L ~fT,ǫ with ǫ > . R ~fT,ǫ for ǫ < . In this section, we introduce a set of axioms we deem desirable for a cardinal comparison test.Our ﬁrst property asserts that any set that is contained in R ~fT,ǫ must not be assigned a highprobability according to f in comparison with the probability assigned by g . In particular, theratio of these probabilities must be bounded by ǫ − ǫ . Definition 3. T is error-free if for all ~f := ( f, g ) ∈ F × F, for all ǫ ∈ (0 , ) and for all measurableset A f ( A ∩ R ~fT,ǫ ) ≤ ( ǫ − ǫ ) g ( A ∩ R ~fT,ǫ ) (1)(Similarly, g ( A ∩ L ~fT,ǫ ) ≤ ( − ǫǫ ) f ( A ∩ L ~fT,ǫ ) for ǫ ∈ ( , ǫ approaches 0, the set R ~fT,ǫ captures the paths where g is clearly deemedbetter than f and so the property of error-freeness implies that although g may assign a subsetof R ~fT,ǫ a positive probability, it must be the case that f assigns it near-zero probability. On theother hand, whenever ǫ approaches 0 .

5, the corresponding ratio approaches 1 and so error-freenessrequires that f assigns that event a probability no greater than g .In particular, each forecaster must believe that a test cannot point out the other forecaster ascorrect. From his perspective, he is either preferred or the test is indecisive.Consider a set of realizations assigned positive probability by one forecaster whereas his colleagueassigns it a relatively lower probability. We shall call a test ‘reasonable’ if the former forecasterassigns a positive probability to the event that the test will eventually provide a high propensityto her. Formally: Definition 4. T is reasonable if for all ~f ∈ F × F, for all ǫ ∈ (0 , ) and for all measurable set A, g ( A ) > and f ( A ) < ( ǫ − ǫ ) g ( A ) = ⇒ g ( A ∩ R ~fT,ǫ ) > . (2)(Similarly, f ( A ) > and g ( A ) < ( − ǫǫ ) f ( A ) = ⇒ f ( A ∩ L ~fT,ǫ ) > ǫ ∈ ( , Remark 1.

One could propose to replace error-freeness with a stronger and more appealingproperty in which a test points out the better informed expert with probability one. Informally, wewould like to consider tests that have the following property f ( T ( ω, ~f ) = 1) = 1 whenever f = g .However, there could be pairs of forecasters that are not equal but induce the same probabilitydistribution. In appendix A, we formalize this and refer to tests that satisfy this stronger require-ment as an ideal . We, furthermore show, as the name suggests, that such tests essentially do notexist. avaler and Smorodinsky: A Cardinal Comparison of Experts

3. An error-free and reasonable test

We now turn to propose an anonymous cardinal comparison test that is error-free and reasonable.For any pair of forecasters, ~f := ( f, g ) ∈ F × F, ω ∈ Ω ∞ , t ≥ , the ﬁnite derivative test , D , is deﬁnedas follows: D t +1 ( ω, ~f ) = ( f ( ω t ) f ( ω t )+ g ( ω t ) , , g ( ω t ) > or f ( ω t ) > other. It should be noted that the ratio between D t +1 ( ω, ~f ), the rank associated with the forecast f and1 − D t +1 ( ω, ~f ), the rank associated with the forecast g, equals the likelihood ratio between the twoforecasters. Clearly, D is anonymous. We turn to show that it is reasonable and error-free. Beforedoing so, some preliminaries are required. Lemma 1.

Let ~f := ( f, g ). Then the limit of D t ( · , ~f ) exists and is ﬁnite f − a.s. Proof.

For ω ∈ Ω ∞ where f ( ω t ) > t as D tf g ( ω ) = t Y n =1 g (( ω, ~f ) n − )[ ω n ] f (( ω, ~f ) n − )[ ω n ] , and observe that D t +1 ( ω, ~f ) = D tf g ( ω ) . Applying Lemma 2 from Kavaler and Smorodinsky [11],we know that the limit of D tf g, denoted D f g, exists and is ﬁnite f − a.s . It readily follows that D ( ω, ~f ) := D f g ( ω ) := lim D t ( ω, ~f ) exists and is ﬁnite f − a.s. (cid:3) Now that we have established the existence and the ﬁniteness of the test D , let us prove that itcomplies with the two central properties for cardinal comparison tests: Proposition 1. D is error-free . Proof.

Let ~f := ( f, g ) ∈ F × F, ǫ ∈ (0 , ) , and a measurable set A. From Lemma 1, the limit of D t ( · , ~f ) exists and is ﬁnite f − a.s. Hence, A ∩ R ~f D ,ǫ = { ω ∈ A : 11 + D f g ( ω ) ∈ R ~f D ,ǫ } ⊂ { ω : D f g ( ω ) ≥ − ǫǫ } . Thus, applying Kavaler and Smorodinsky [11], Lemma 2, part b, we obtain f ( A ∩ R ~f D ,ǫ ) ≤ ( ǫ − ǫ ) g ( A ∩ R ~f D ,ǫ ) . Similarly, by applying Kavaler and Smorodinsky [11], Lemma 2, part a, we show that g ( A ∩ L ~f D , − ǫ ) ≤ ( ǫ − ǫ ) f ( A ∩ L ~f D , − ǫ ) , and D is error-free. (cid:3) Proposition 2. D is reasonable . Proof.

Let ~f ∈ F × F, ǫ ∈ (0 , ) , and a measurable set A, and suppose (w.l.o.g.) that g ( A ) > and f ( A ) < ( ǫ − ǫ ) g ( A ) . (3)Denote A := ( A ∩ L ~f D ,ǫ ) ∪ ( A ∩ { ω : D ( ω, ~f ) = ǫ } ) , A := ( A ∩ R ~f D ,ǫ ) and observe that A = A ∪ A . Assume by contradiction that f ( A ) = 0 and notice that by the construction, A ⊂ { ω : lim D t ( ω, ~f ) = 11 + D f g ( ω ) ≥ ǫ } . Thus, applying Kavaler and Smorodinsky [11], Lemma 2, part a, together with f ( A ) = f ( A ), weobtain that f ( A ) ≥ ( ǫ − ǫ ) g ( A ) which contradicts (3) and hence f ( A ) > . By similar considerationwe show that f ( A ∩ L ~f D ,ǫ ) > f ( A ) > and g ( A ) < ( − ǫǫ ) f ( A ) for ǫ ∈ ( , D is reasonable. (cid:3) avaler and Smorodinsky: A Cardinal Comparison of Experts Propositions 1 and 2 jointly prove our ﬁrst main theorem:

Theorem 1. D is an anonymous, reasonable and error-free test.We now turn to show that the ﬁnite derivative test is essentially the unique anonymous cardinalcomparison test that is reasonable and error-free.

4. Uniqueness

Although there may be other error-free and reasonable cardinal comparison tests, they are essen-tially equivalent to the ﬁnite derivative test. To motivate this idea, consider the following example.

Example 1.

Consider the realization ˜ ω := (1 , , , , , ) , and two forecasters ˜ f and ˜ g , both usinga coin to make predictions. ˜ f uses a fair coin whereas ˜ g uses a biased coin with probability one forthe outcome to be 1. Let −→ h t be the history of length t induced by (˜ ω, ˜ f , ˜ g ) and let ←− h t be the oneinduced by (˜ ω, ˜ g, ˜ f ). Let c > T t ( ω, ~f ) =  D t ( ω, ~f ) , c · D tf g ( ω ) , − c · D tg f ( ω ) , other ( ω, ~f ) t = −→ h t ( ω, ~f ) t = ←− h t . Hence, the propensities of T diﬀer from those provided by D only along the play paths −→ h , ←− h , inwhich case the limit of T converges slower to 1 , , respectively, than D . Claim 1. T is an anonymous error-free and a reasonable test.Proof. Let ~f := ( f, g ) ∈ F × F, ǫ ∈ (0 , ) and a measurable set A . Recall that ~f and ˜ ω induce aunique play path, (˜ ω, ~f ). Thus, if ˜ ω / ∈ A ∩ R ~fT,ǫ or ˜ ω ∈ A ∩ R ~fT,ǫ and (˜ ω, ~f ) = ←− h , then the constructionyields A ∩ R ~fT,ǫ = A ∩ R ~f D ,ǫ . In addition, note that (˜ ω, ~f ) = −→ h implies that T t ( ω, ~f ) −→

1, in whichcase ˜ ω / ∈ R ~fT,ǫ . If, on the other hand, ˜ ω ∈ A ∩ R ~fT,ǫ and (˜ ω, ~f ) = ←− h then T t ( ω, ~f ) −→ cD tg f (˜ ω ) −→

0. In which case f (˜ ω ) = ˜ f (˜ ω ) = 0. Since by Propositions 1 D is error-free the following is obtain f ( A ∩ R ~fT,ǫ ) = f ( A \ ˜ ω ∩ R ~fT,ǫ ) = f ( A \ ˜ ω ∩ R ~f D ,ǫ ) ≤ ( ǫ − ǫ ) g ( A \ ˜ ω ∩ R ~f D ,ǫ ) ≤ ( ǫ − ǫ ) g ( A ∩ R ~fT,ǫ ) . The case for which ǫ ∈ ( ,

1) is analogous and hence omitted. We therefore conclude that T iserror-free.To see why T is reasonable assume that g ( A ) > f ( A ) ≤ ( ǫ − ǫ ) g ( A ). Similar considerationshows that either ˜ ω ∈ A ∩ R ~fT,ǫ and (˜ ω, ~f ) = ←− h , in which case g ( A ∩ R ~fT,ǫ ) = ˜ g (˜ ω ) = 1, or g ( A ∩ R ~fT,ǫ ) = g ( A ∩ R ~f D ,ǫ ) > D is a reasonabletest. The proof for ǫ ∈ ( ,

1) is analogous. Finally, by construction, the anonymity of D implies theanonymity of T . (cid:3) To capture the concept of equivalence, we introduce the following equivalence relation over tests;

Definition 5.

Let ~f := ( f, g ) ∈ F × F. We say that T ∼ ~f ˆ T if f ( { ω : T ( ω, ~f ) = ˆ T ( ω, ~f ) } ) = g ( { ω : T ( ω, ~f ) = ˆ T ( ω, ~f ) } ) = 0 . We say that T ∼ ˆ T if and only if T ∼ ~f ˆ T for all ~f . That is, two tests are equivalent if and only if, given an ordered pair of forecasting strategies, thereis zero probability according to each forecaster that the tests will converge to diﬀerent propensities.

Claim 2.

The relation ∼ is an equivalence relation on ⊤ := { T : T − cardinal comparison test } . avaler and Smorodinsky: A Cardinal Comparison of Experts The proof of Claim 2 is relegated to Appendix B. The next theorem asserts that, up to anequivalence class representative, there exists a unique anonymous reasonable and error-free cardinalcomparison test. That is, any anonymous test T ≁ T D which is reasonable, admits an error. To thisend, we will show that any T ≁ T D can be associated with a pair of forecasting strategies for whichthe error-free condition fails. More importantly, the power of the theorem stems from the premisethat T admits an error at any pair ~f whenever T ≁ ~f D . Before proceeding, we make the observation that Deﬁnition 5 can be stated equivalently by thenext lemma which is invoked in our adjacent uniqueness theorem proof.

Lemma 2.

Let ~f := ( f, g ) ∈ F × F. Then T ∼ ~f ˆ T if and only if for all ǫ ∈ (0 , ∩ Q f (( L ~fT,ǫ ∩ R ~f ˆ T ,ǫ ) ∪ ( L ~f ˆ T,ǫ ∩ R ~fT,ǫ )) = g (( L ~fT,ǫ ∩ R ~f ˆ T ,ǫ ) ∪ ( L ~f ˆ T ,ǫ ∩ R ~fT,ǫ )) = 0 . The proof of Lemma 2 is supplemented to Appendix B.

Theorem 2.

Let T be an anonymous and reasonable cardinal comparison test. If T ≁ D then T is not error-free. Proof.

Assume by contradiction that T is error-free. Let ~f := ( f, g ) be such that T ≁ ~f D ; thenfrom Lemma 2 there exits ǫ ∈ (0 ,

1) such that (w.l.o.g. for f ) f (( L ~f D ,ǫ ∩ R ~fT,ǫ ) ∪ ( L ~fT,ǫ ∩ R ~f D ,ǫ )) > . We shall consider the following cases which result in a contradiction.Case 1: f ( L ~f D ,ǫ ∩ R ~fT,ǫ ) > . Assume that ǫ ∈ ( ,

1) and observe that since L ~f D ,ǫ = ∞ S n ∈ N : n> ⌈ − ǫ ⌉ L ~f D ,ǫ + n and f is monotone with respect to inclusion, there exist ǫ < ǫ such that f ( ˆ A := L ~f D ,ǫ ∩ R ~fT,ǫ ) > . By Proposition 1, D is error-free where ˆ A ⊂ L ~f D ,ǫ ; hence g ( ˆ A ) ≤ ( 1 − ǫ ǫ ) f ( ˆ A ) < ( 1 − ǫǫ ) f ( ˆ A ) . In addition, by the assumption T is reasonable hence f ( ˆ A ∩ L ~fT,ǫ ) > , which yields a contradiction since R ~fT,ǫ , L ~fT,ǫ are disjoint sets.For ǫ ∈ (0 , ) note that R ~fT,ǫ = ∞ S n ∈ N : n> ⌈ ǫ ⌉ R ~fT,ǫ − n , and hence there exists ǫ < ǫ such that f ( A := { L ~f D ,ǫ ∩ R ~fT,ǫ } ) >

0. By the assumption T is an error-free test, hence f ( ˆ A ) ≤ ( ǫ − ǫ ) g ( ˆ A ) < ( ǫ − ǫ ) g ( ˆ A ) . In addition, by Proposition 1, D is reasonable hence g ( ˆ A ∩ R ~f D ,ǫ ) > , which yields a contradiction since R ~f D ,ǫ , L ~f D ,ǫ are disjoint sets.Case 2: f ( ˆ A := L ~fT,ǫ ∩ R ~f D ,ǫ ) > . Assume (w.l.o.g) that ǫ ∈ ( , . By the assumption T is anerror-free test where, by Proposition 1, D is reasonable, therefore, the contradiction g ( ˆ A ∩ L ~fT,ǫ ) > , follows analogously from Case 1 and hence omitted. (cid:3) avaler and Smorodinsky: A Cardinal Comparison of Experts

5. Independence of axioms

The notions of error-freeness and reasonableness which were introduced in Subsection 2.2 are notrelated; obviously, as the sets: { T = } , R ~fT,ǫ , L ~fT, − ǫ are disjoint, inequality (1) is satisﬁed triviallyand hence the constant fair test, T t ( ω, ~f ) ≡ / , is error-free and is not reasonable. Using the resultof Theorem 1, the next example illustrates that reasonableness does not imply error-freeness. Example 2.

Let −→ h , ←− h be play paths composed of the realization ˜ ω := (1 , , , , , ) , and pairs offorecasts along ˜ ω which, from day two onward, are shown to have similar forecasts according toan iid distribution with parameter 1 , where on day one, one forecast assigns 1 to the outcome 1whereas the other assigns half. Let −→ h t , ←− h t be the corresponding uniquely induced t - history andconsider the following test: T t ( ω, ~f ) =  D t ( ω, ~f ) , , , other ( ω, ~f ) t = −→ h t ( ω, ~f ) t = ←− h t . Claim 3. T is anonymous and reasonable but is not error-free.Proof. Since D is anonymous and T t ( ω, f, g ) = 1 − T t ( ω, g, f ) whenever ( ω, ~f ) t equals −→ h t or ←− h t , itfollows that T is anonymous. Further, let ~ ˜ f = ( ˜ f , ˜ g ) be such that (˜ ω, ~ ˜ f ) = −→ h . Since { ˜ ω } = R ~ ˜ fT,ǫ forall ǫ ∈ (0 , ) one has ˜ f ( { ˜ ω } ∩ R ~ ˜ fT, ) = 1 >

12 ˜ g ( { ˜ ω } ∩ R ~ ˜ fT, )and hence T is not error-free. To verify that T is a reasonable test let ~f := ( f, g ), a measurableset A , and ǫ ∈ (0 , ). If (˜ ω, ~f ) = −→ h and (˜ ω, ~f ) = ←− h , then by the construction T t ( · , ~f ) ≡ D t ( · , ~f )and since by Proposition 2 D is a reasonable test, condition (2) is satisﬁed. Now, observe that if(˜ ω, ~f ) = −→ h and ˜ ω ∈ A then f ( A ) = 1 > ǫ − ǫ g ( A ) which rules out the left hand-side of condition (2).If, on the other hand, (˜ ω, ~f ) = −→ h and ˜ ω / ∈ A , then g ( A ) > ω := (0 , , , , , ) ∈ A where g ( A \ ˆ ω ) = 0, in which case f ( A ) = 0 and T t (ˆ ω, ~f ) ≡ D t (ˆ ω, ~f ) = 0. Hence, since by Proposition 2 D isreasonable we have g (ˆ ω ∩ R ~fT,ǫ ) = g ( A ∩ R ~fT,ǫ ) = g ( A ∩ R ~f D ,ǫ ) >

0. For the remaining case, note thatif (˜ ω, ~f ) = ←− h then either ˜ ω ∈ A , in which case f ( A ) ≥ > ǫ − ǫ g ( A ), or ˜ ω / ∈ A yielding that g ( A ) = 0.Since the case for which ǫ ∈ ( ,

1) is proven analogously, the result follows. (cid:3)

Al-Najjar and Weinstein [2] introduce an alternative cardinal comparison test: L t ( ω, f, g ) =  , . , , g ( ω t ) f ( ω t ) > other g ( ω t ) f ( ω t ) < . Note that this test diﬀers from D whenever the likelihood ratio is high but ﬁnite. In our case, thetest does not prefer any expert but provides a relative ranking, whereas the likelihood ratio test, L , does. Claim 4. L is anonymous and reasonable and is not error-free.Proof. Let ǫ ∈ ( , g be a forecasting strategy which deterministically predicts ˜ ω , and let f be such that it predicts (1 − ǫ ) at day one and meets g from day two onward regardless of anypast history. Note that whenever f is assumed to be the true measure, then L t (˜ ω, f, g ) = − ǫ > t > g is determinstically ranked by 1 along (˜ ω, ~f ) yielding ˜ ω ∈ L −→ fL,ǫ . A simplecalculation shows that g (˜ ω ∩ L −→ fL,ǫ ) > − ǫǫ f (˜ ω ∩ L −→ fL,ǫ ) avaler and Smorodinsky: A Cardinal Comparison of Experts as g (˜ ω ) = 1. Since ǫ is taken arbitrarily, the following important conclusion can be drawn: not onlyis L not error-free but it admits an arbitrarily large error. The fact that L is reasonable followsdirectly from Proposition 2. (cid:3)

6. Decisiveness in ﬁnite time

In this section we provide a natural suﬃcient condition for which a tester achieves a higher levelof conﬁdence in favor of the informed forecaster with any desired degree of precision in some ﬁxedﬁnite time. To this end, we show the existence of a uniform bound on the rate at which a cardinalcomparison test converges. Consider expert f ′ s point of view. Not only should he maintain that,whenever expert g ’s forecasts are diﬀerent from his, then he should eventually be ranked higherthan him, but if expert g ’s forecasts are relatively far, then this should essentially happen uniformlyfast. Indeed, as we show in this section, this holds for our ﬁnite derivative test. This observationtightly builds on a theory of active supermartingales due to Fudenberg and Levine [10].To determine whether a test is ‘almost’ certain about a forecaster requires the two forecastersto provide signiﬁcantly diﬀerent forecasts as captured by the following deﬁnition: Definition 6.

A pair of forecasting strategies ~f := ( f, g ) is ǫ − close along ω at period t > , if | f (( ω, ~f ) t − )[ ω t ] − g (( ω, ~f ) t − )[ ω t ] | < ǫ The next theorem asserts that, given an arbitrarily small ǫ > , there exists a ﬁnite uniformbound, K, which is independent of any pair of forecasting strategies, such that if the forecastsof the uninformed expert are suﬃciently diﬀerent from those of the informed one in more than K periods, then the ﬁnite derivative test, D , will eventually settle on the informed expert with ahigh level of conﬁdence. In the latter scenario, it furthermore surprisingly asserts that, given anysuﬃciently large time n , D n ranks the informed expert higher than (1 − ǫ ) and up to ǫ - amount ofaccuracy as it would have ranked had it continued to rank the expert following his test to inﬁnity. Theorem 3.

For all 0 < ǫ < K = K ( ǫ ) such that for all ~f := ( f, g ) , and for all n >

0, there is a set of which the probability according to f is at least (1 − ǫ ) such that for any ω in that set:1. Either ~f is ǫ − close along ω in all but K periods in { ...n } or2. ω ∈ L ~f D , − ǫ . Furthermore, |D t ( ω, ~f ) − D n ( ω, ~f ) | < ǫ for all t ≥ n .In words, with high probability, given any suﬃciently large n and any suﬃciently small ǫ , theonly reason that the tester is not ‘almost’ settled on the correct forecaster at time n is becausethe uninformed expert made excellent predictions along the play path. Moreover, Theorem 3 isuniversal in the following manner: The bound on the number of periods in which the two experts’forecasts must be diﬀerent, K , for the ﬁnite derivative test to rank the informed one higher, dependson the required level of accuracy, but is independent of any pair of forecasting strategies, f or g .The proof of Theorem 3 is relegated to Appendix B. Nevertheless let us brieﬂy provide sometechnical intuition. At the heart of the proof of Theorem 3 lies a theorem due to regarding the rateof decrease of active supermartingales. Consider an abstract setting with a probability measure P in ∆(Ω ∞ ) and a ﬁltration {G t } ∞ t =1 . Definition 7.

A ( G t ) - adapted, real-valued process ˜ D := { ˜ D t } ∞ t =0 is called a supermartingale under P if1. E | ˜ D t | < ∞ for all t > E [ ˜ D t |G s ] ≤ ˜ D t for all s ≤ t , P − a.s. Intuitively, a supermartingale is a process that decreases on average. The proof of Theorem 3implies that the ﬁnite derivative test is associated with a supermartingale property with respectto the natural ﬁltration which is deﬁned in Section 2. Let us further consider the following classof supermartingales called active supermartingales. This notion was ﬁrst introduced in Fudenbergand Levine [10] who studied reputations in inﬁnitely repeated games: avaler and Smorodinsky:

A Cardinal Comparison of Experts Definition 8.

A non-negative supermartingale ˜ D is active with activity ψ ∈ (0 ,

1) under P if P ( { ω : | ˜ D t ( ω )˜ D t − ( ω ) − | > ψ }| ˜ ω k − ) > ψ for P - almost all histories ˜ ω t − such that ˜ D t − (˜ ω ) > ψ if the probability of a jump of size ψ at time t exceeds ψ for almost all histories. Note that ˜ D being a supermartingale, is weakly decreasing inexpectations. Showing that it is active implies that ˜ D t substantially goes up or down relative to˜ D t − with probability bounded away from zero in each period. Fudenberg and Levine [10], TheoremA.1, showed the following remarkable result Theorem 4 (Fudenberg and Levine [10]) . For every ǫ > , ψ ∈ (0 , , and 0 < D − < K < ∞ such that P ( { ω : sup t>K ˜ D t ( ω ) ≤ D − } ) ≥ − ǫ for every active supermartingale { ˜ D t } with ˜ D ≡ ψ. Theorem 4 asserts that if ˜ D is an active supermartingale with activity ψ , then there is a ﬁxedtime K by which, with high probability, ˜ D t drops below D − and remains below D − for all futureperiods. It should be noted that the power of the theorem stems from the fact that the bound, K ,depends solely on the parameters ǫ > , ψ and D − , and is otherwise independent of the underlyingstochastic process P .We exploit the active supermartingale property in a diﬀerent way. In the context of cardinalcomparison testing, we consider two strategies, one for each expert, which are updated using Bayesrule. Given suﬃciently small ǫ > , our comparative test ranks an expert depending on whetherthe posterior odds ratio is above or below ǫ . The active supermartingale result implies that there isa uniform bound (independent of neither the length of the game nor the true distribution) on thenumber of periods where the uninformed expert can be substantially wrong, without being detected,such that if this bound is exceeded, the probability that the tester ranks high the uninformedexpert is small.

7. Concluding remarks

The paper proposes a normative approach to the challenge of comparing between two forecasterswho repeatedly provide probabilistic forecasts. The paper postulates three basic norms: anonymity,error-freeness and reasonableness and provides a cardinal comparison test, the ﬁnite derivative test,that complies with them. It also shows that this test is essentially unique. Finally, it shows thatthe test converges fast and hence is meaningful in ﬁnite time. In the future we hope to extend ourresults to settings with more than two forecasters and study alternative sets of norms.

The approach taken in this paper can be considered as a contribution to the hypothesis testingliterature in statistics where a forecaster is associated with a hypothesis. In this context we proposea hypothesis test that complies with a set of fundamental properties which we refer to as axioms.In contrast, a central thrust for the hypothesis testing literature (for two hypotheses) is the pairof notions of signiﬁcance level and power of a test. In that literature one hypothesis is consideredas the null hypothesis while the other serves as an alternative. A test is designed to either rejectthe null hypothesis, in which case it accepts the alternative, or fail to reject it (a binary outcome).The signiﬁcance level of a test is the probability of rejecting the null hypothesis whenever it is avaler and Smorodinsky:

A Cardinal Comparison of Experts correct (type-1 error) while the power of the test is the probability of rejecting the null hypothesisassuming the alternative one is correct (the complement of a type-2 error).In contrast with the aforementioned binary outcome that is prevalent in the hypothesis testingliterature we allow, in addition, for an inconclusive outcome. Recall the celebrated Neyman-Pearsonlemma which characterizes a test with the maximal power subject to an upper bound on thesigniﬁcance level. The possibility of an inconclusive (ranking) outcome, in our framework, allowsus to design a test where both type-1 and type-2 errors have relatively low probability. Interestingly, the test proposed in the Neyman-Pearson lemma, similar to ours, also hinges onthe likelihood ratio. In our approach we, a priori, treat both hypotheses symmetrically. In thestatistics literature, however, this is not the case and the null hypothesis is, in some sense, thestatus quo hypothesis. This asymmetry is manifested, for example, in the Neyman-Pearson lemma.Note that in order to design a test that complies with a given signiﬁcance level and a given powerone must know the full speciﬁcation of the two hypotheses. This is in contrast with our test whichis universal, in the sense that it does not rely on the speciﬁcations of the two forecasts. Finally,let us comment that whereas hypothesis testing is primarily discussed in the context of a ﬁnitesample, typically from some iid distribution, our framework allows for sequences of forecasts thatare dependent on past outcomes as well as past forecasts of the other expert.

Acknowledgments

Smorodinsky gratefully acknowledges the United States-Israel Binational Science Foundationand the National Science Foundation (grant 2016734), the German-Israel Foundation (grant I-1419-118.4/2017), the Ministry of Science and Technology (grant 19400214), the Technion VPRgrants, and the Bernard M. Gordon Center for Systems Engineering at the Technion.

Appendix A: On ideal tests

Recall that an error-free test eliminates the necessity of pointing out the less informed expert.A stronger and more appealing property is to point out the better informed expert, in which casethe tester eventually settles on one forecaster as being better than the other. We consider teststhat exhibit such a property as ideal. Formally,

Definition 9. T is decisive on f at ( ω, ~f ) (respectively, g) if T t ( ω, ~f ) −→ − T t ( ω, ~f )) −→ T, ~f , we denote by A ~fT,f := { ω : T is decisive on f at ( ω, ~f ) } , to be the measurable set of realizations (in L ~fT,N ) for which T is decisive on f at ( ω, ~f ) . Definition 10.

A test T is ideal with respect to A ⊆ F if for all ~f := ( f , g = f ) ∈ A × Af ( A ~fT,f ) = g ( A ~fT,g ) = 1 . It is called ideal if it is ideal with respect to F. In other words, whenever the left expert knows the actual data-generating process and the rightexpert does not, an ideal test will surely identify the informed expert.Trivially, any ideal test with respect to a subset of forecasts A is also error-free with respect tothe same set. The following is a straightforward corollary of Theorem 1. Corollary 1.

There exists no ideal test with respect to a set of forecasts A whenever itcontains two forecasts which induce measures, one of which is absolutely continuous with respectto the other.This immediately entails: avaler and Smorodinsky: A Cardinal Comparison of Experts Corollary 2.

There exists no ideal test.However, whenever A contains no such pair of forecasts, then an ideal test does exist. To provethis we must ﬁrst accurately deﬁne the notion of mutually singular forecasts. Definition 11.

Two forecasting strategies, ~f = ( f, g = f ) ∈ F, are said to be mutually singularwith respect to each other, if there exist two disjoint sets C ~ff , C ~fg ⊂ (Ω × ∆(Ω) × ∆(Ω)) ∞ such that f ( { ω : ( ω, ~f ) ∈ C ~ff } ) = g ( { ω : ( ω, ~f ) ∈ C ~fg } ) = 1 . A set A ⊆ F is pairwise mutually singular if for all ~f = ( f, g ) = f ) ∈ A , f, g are mutually singularwith respect to each other.The next lemma asserts that a reasonable test is able to perfectly distinguish between far measureswhich are induced from forecasting strategies which are mutually singular with respect to eachother. Lemma 3.

Let f, g = f ∈ F which are mutually singular with respect to each other. If T isreasonable then f ( A ~fT,f ) = g ( A ~fT,g ) = 1 . The proof of Lemma 3 is relegated to Appendix B. It should be noted that Lemma 3 holds evenfor T which is not error-free.The next theorem provides a necessary and suﬃcient condition for the existence of an ideal testover sets. Theorem 5.

There exists an anonymous ideal test with respect to A if and only if A is pairwisemutually singular. Proof. ⇐ = Directly follows from Lemma 3 and Proposition 2.= ⇒ Let T be an ideal anonymous test with respect to a set A . Let ~f := ( f, g = f ) ∈ A × A anddenote C Tf := { ( ω, ~f ) : ω ∈ A ~fT,f } , C Tg := { ( ω, ~f ) : ω ∈ A ~fT,g } . Since A ~fT,f , A ~fT,g are disjoint, it follows that C Tf , C Tg are disjoint where T ideal yields f ( { ω : ( ω, ~f ) ∈ C Tf } ) = f ( A ~fT,f ) = g ( A ~fT,g ) = g ( { ω : ( ω, ~f ) ∈ C ~fg } ) = 1 . (cid:3) We conclude the paper with an example of an ideal test over a domain of mutually singularforecasts:

Example 3.

Let A iid × A iid := { ~f := ( f, g ) : there exist a f , a g ∈ [0 ,

1] s.t f ( ω t )[1] ≡ a f , g ( ω t )[1] ≡ a g for all ω ∈ Ω ∞ } and for ω ∈ Ω ∞ denote the average realization by a ω := lim t →∞  t P n =1 { ω n =1 } t  avaler and Smorodinsky: A Cardinal Comparison of Experts (whenever the limit exists). Let ~f ∈ A iid × A iid such that a f = a g and observe that for any ωa ω = a f ⇐⇒ lim t →∞ D tf g ( ω ) = 0 ⇐⇒ lim t →∞ D t ( ω, ~f )) = 1 . Since the induced measures f, g are iid with diﬀerent parameters, a mere application of the law oflarge numbers yields f ( A ~f D ,f ) = 1 and g ( A ~f D ,f ) = 0 , showing that D is ideal with respect to A iid . Appendix B: Missing proofs

Proof of Lemma 2 . Observe that for all ω ∈ { T = ˆ T } there exists ǫ ∈ (0 , ∩ Q such that eitherˆ T ( ω, ~f ) < ǫ < T ( ω, ~f ) or T ( ω, ~f ) < ǫ < ˆ T ( ω, ~f ) . We thus have { ˆ T < T } = [ ǫ ∈ (0 , ∩ Q { ˆ T < ǫ < T } = [ ǫ ∈ (0 , ∩ Q ( L ~fT,ǫ ∩ R ~f ˆ T ,ǫ ) (B.1)as well as { ˆ T > T } = [ ǫ ∈ (0 , ∩ Q { T < ǫ < ˆ T } = [ ǫ ∈ (0 , ∩ Q ( L ~f ˆ T ,ǫ ∩ R ~fT,ǫ ) . (B.2) ⇐ =Assume by contradiction that T ≁ ~f ˆ T and observe that since { T = ˆ T } = { ˆ T < T } ∪ { ˆ T > T } it follows that (w.l.o.g. for f ) either f ( { ˆ T < T } ) > f ( { ˆ T > T } ). If f ( { ˆ T < T } ) > ǫ ′ ∈ (0 , ∩ Q such that f ( L ~fT,ǫ ′ ∩ R ~f ˆ T ,ǫ ′ ) > f ( { ˆ T < T } ) > ⇒ Assume by contradiction that (w.l.o.g. for f ) f (( L ~fT,ǫ ′ ∩ R ~f ˆ T ,ǫ ′ ) ∪ ( L ~f ˆ T,ǫ ′ ∩ R ~fT,ǫ ′ )) > ǫ ′ ∈ (0 , ∩ Q . Therefore, either f ( S ǫ ∈ (0 , ∩ Q ( L ~fT,ǫ ∩ R ~f ˆ T ,ǫ )) > f ( S ǫ ∈ (0 , ∩ Q ( L ~f ˆ T,ǫ ∩ R ~fT,ǫ )) > . In whichcase, using (B.1) and (B.2), we conclude f ( { ˆ T < T } ∪ { ˆ T > T } ) > ∼ ~f ˆ T . (cid:3)

Proof of Claim 2 . Let

T, T , T ∈ ⊤ , ~f ∈ F × F .Reﬂexivity: Applying Lemma 2 it is readily seen that f ( L ~fT,ǫ ∩ R ~fT,ǫ ) = g ( L ~fT,ǫ ∩ R ~fT,ǫ ) = 0 as L ~fT,ǫ , R ~fT,ǫ are disjoint sets for all ǫ ∈ (0 , T ∼ ~f T. Anonymity: From Lemma 2 we obtain that for all ǫ ∈ (0 , ∩ Q ,T ∼ ~f T ⇐⇒ f (( L ~fT ,ǫ ∩ R ~fT ,ǫ ) ∪ ( L ~fT ,ǫ ∩ R ~fT ,ǫ )) = g (( L ~fT ,ǫ ∩ R ~fT ,ǫ ) ∪ ( L ~fT ,ǫ ∩ R ~fT ,ǫ ))= f (( L ~fT ,ǫ ∩ R ~fT ,ǫ ) ∪ ( L ~fT ,ǫ ∩ R ~fT ,ǫ )) = g (( L ~fT ,ǫ ∩ R ~fT ,ǫ ) ∪ ( L ~fT ,ǫ ∩ R ~fT ,ǫ )) ⇐⇒ T ∼ ~f T . Transitivity: Suppose by contradiction that T ∼ ~f T, and T ∼ ~f T where T ≁ ~f T . Then fromLemma 2 (wl.o.g. for f ) we are provided with ¯ ǫ ∈ (0 ,

1) such that f (( L ~fT , ¯ ǫ ∩ R ~fT , ¯ ǫ ) ∪ ( L ~fT , ¯ ǫ ∩ R ~fT , ¯ ǫ )) > , where for all ǫ ∈ (0 , , T ∼ ~f T = ⇒ f (( L ~fT ,ǫ ∩ R ~fT,ǫ ) ∪ ( L ~fT,ǫ ∩ R ~fT ,ǫ )) = 0 ,T ∼ ~f T = ⇒ f (( L ~fT,ǫ ∩ R ~fT ,ǫ ) ∪ ( L ~fT ,ǫ ∩ R ~fT,ǫ )) = 0 . (B.3) avaler and Smorodinsky: A Cardinal Comparison of Experts Case 1: f ( L ~fT , ¯ ǫ ∩ R ~fT , ¯ ǫ ) > . Note that, f ( L ~fT , ¯ ǫ ∩ R ~fT , ¯ ǫ ) == f (( L ~fT , ¯ ǫ ∩ R ~fT , ¯ ǫ ∩ L ~fT, ¯ ǫ ) ∪ ( L ~fT , ¯ ǫ ∩ R ~fT , ¯ ǫ ∩ ( L ~fT, ¯ ǫ ) c ))= f ( L ~fT , ¯ ǫ ∩ R ~fT , ¯ ǫ ∩ L ~fT, ¯ ǫ ) + f ( L ~fT , ¯ ǫ ∩ R ~fT , ¯ ǫ ∩ R ~fT, ¯ ǫ ) + f ( L ~fT , ¯ ǫ ∩ R ~fT , ¯ ǫ ∩ { T = ¯ ǫ } ) . Thus, if f ( L ~fT , ¯ ǫ ∩ R ~fT , ¯ ǫ ∩ L ~fT, ¯ ǫ ) > , then f ( R ~fT , ¯ ǫ ∩ L ~fT, ¯ ǫ ) > , which contradicts the second conditionof (B.3); otherwise if f ( L ~fT , ¯ ǫ ∩ R ~fT , ¯ ǫ ∩ R ~fT, ¯ ǫ ) >

0, then f ( L ~fT , ¯ ǫ ∩ R ~fT, ¯ ǫ ) > , which contradicts the ﬁrstcondition of (B.3). Otherwise, f ( L ~fT , ¯ ǫ ∩ R ~fT , ¯ ǫ ∩ { T = ¯ ǫ } ) > f ( L ~fT , ¯ ǫ ∩ { T = ¯ ǫ } ) > . In addition, since { L ~fT , ¯ ǫ + n } n> ⌈ − ¯ ǫ ⌉ is increasing to L ~fT , ¯ ǫ and R ~fT, ¯ ǫ ⊂ { R ~fT, ¯ ǫ + n } for all n it followsthat there exists a suﬃciently large n ′ and ǫ ′ = ¯ ǫ + n ′ such that f ( L ~fT ,ǫ ′ ∩ R ~fT,ǫ ′ ) > , which againcontradicts the ﬁrst condition of (B.3).Case 2: f ( L ~fT , ¯ ǫ ∩ R ~fT , ¯ ǫ ) > . The contradiction follows analogously from Case 1 and henceomitted. (cid:3)

Claim 5. If T is reasonable then for all ~f and ǫ ∈ ( , and for all measurable set Af ( A ∩ R ~fT,ǫ ) > ⇒ g ( A ∩ R ~fT,ǫ ) > . (similarly for g where ǫ ∈ (0 , ) ).Proof. Let ~f and ǫ ∈ ( ,

1) and a measurable set A , and (w.l.o.g.) assume by contradiction that f ( A ∩ R ~fT,ǫ ) > ⇒ g ( A ∩ R ~fT,ǫ ) = 0 . Since 0 = g ( A ∩ R ~fT,ǫ ) < ( − ǫǫ ) f ( A ∩ R ~fT,ǫ ) and T is reasonable (2) yields that f ( A ∩ R ~fT,ǫ ∩ L ~fT,ǫ ) > , which contradicts the fact as R ~fT,ǫ , L ~fT,ǫ are disjoint sets. (cid:3) Proof of Lemma 3 . W.l.o.g. let A be such that: f ( A ) = 1 , g ( A ) = 0 , and let ǫ ∈ ( , f ( A ∩ R ~fT,ǫ ) > , T is reasonable, therefore applying Claim (5) with the set A yields g ( A ∩ R ~fT,ǫ ) > g ( A ) = 0 . Hence, f ( A ∩ R ~fT,ǫ ) = 0 . On the other hand, sincethe left-hand side of condition (2) is satisﬁed trivially for A, we are provided with f ( A ∩ L ~fT,ǫ ) ≥ f ( A ∩ L ~fT,ǫ ) > . As a result,1 = f ( A ) = f ( A ∩ R ~fT,ǫ ) + f ( A ∩ L ~fT,ǫ ) + f ( A ∩ { T = ǫ } )and therefore f ( A ∩ L ~fT,ǫ . ) = 1 . Similarly, assuming that f ( B ) = 0 , g ( B ) = 1 we obtain that g ( R ~fT,ǫ . ) = 1 for all ǫ ∈ (0 , ) . Since A ~fT,f ⊂ L ~fT,ǫ for all ǫ ∈ ( ,

1) and L ~fT,ǫ is decreasing as ǫ → f ( A ~fT,f ) = f ( T ǫ L ~fT,ǫ ) = 1 and the result follows. (cid:3) B.1. Decisiveness in ﬁnite time

The proof of Theorem 3 is generalized to the case where the number of elements, | Ω | , is arbitraryand it is relied on achieving a uniform bound on the up-crossing probability of any non-negativesupermartingale which admits suﬃciently (ﬁnite) many ﬂuctuations. avaler and Smorodinsky: A Cardinal Comparison of Experts Proof of Theorem 3.

Let ǫ ∈ (0 , K = K ( ǫ ) such that on the set of histories ω t of f − probability − (1 − ǫ ), and for all n > , onlytwo scenarios are possible; if there exists a subsequence of times ( t i ) K +1 i =1 ⊂ { , ..., n } , and thereexists a subsequence of corresponding outcomes ( ̟ t i ) K +1 i =1 ⊂ Ω K +1 such that | f (( ω, ~f ) t i − )[ ̟ t i ] − g (( ω, ~f ) t i − )[ ̟ t i ] | ≥ ǫ for all 1 ≤ i ≤ K + 1, then, the limit of D is strictly greater than (1 − ǫ ), andmore importantly, the value of D at time n , D n , is ǫ − close for all ranks from time n onward. Inall other scenarios, | f (( ω, ~f ) t − )[ ̟ ] − g (( ω, ~f ) t − )[ ̟ ] | < ǫ for all ̟ ∈ Ω in all but K periods t in { , ..., n } . Construction of the faster process

As in Fudenberg and Levine [10], deﬁne an increasingsequence of stopping times { τ k } ∞ k =0 relative to { D tf g } and ǫ inductively as follows. First set τ = 0and if τ k − ( ω ) = ∞ set τ k ( ω ) = ∞ . If τ k − ( ω ) < ∞ set τ k ( ω ) to be the smallest integer t > τ k − ( ω )such that either f ( ω t − ) > and f ( { ¯ ω ∈ Ω ∞ : | D tf g (¯ ω ) D t − f g (¯ ω )) -1—¿ ǫ | Ω | }| ω t − } ) > ǫ | Ω | (B.1.1)or D tf g ( ω ) D τ k − f g ( ω ) − ≥ ǫ | Ω | . (B.1.2)If there is no such t, set τ k ( ω ) = ∞ . Now deﬁne the process { ˜ D k } ∞ k =0 by ˜ D k = D τ k f g if τ k < ∞ and ˜ D k = 0 if τ k = ∞ . Now, From Fudenberg and Levine [10], Lemma 4.1, ( D tf g ( ω ) := g ( ω t ) f ( ω t ) ) t> is a supermartingale; hence from a standard result, the process { ˜ D k } ∞ k =0 is a supermartingale.Furthermore, by Fudenberg and Levine [10], Lemma 4.3, { ˜ D k } ∞ k =0 is an active supermartingalewith activity ǫ | Ω | . Applying Theorem 4 with ǫ, | Ω | , acitivity = ǫ | Ω | , and ˜ D ≡ , there exists an integer K = K ( ǫ ) > { ˜ D k } with activity ǫ | Ω | , one has f ( sup k>K ˜ D k < ǫ ) > − ǫ. (B.1.3)In addition, by Fudenberg and Levine [10], Lemma 4.2, if | f (( ω, ~f ) t )[ ̟ ] − g (( ω, ~f ) t )[ ̟ ] | > ǫ, forsome ̟ ∈ Ω then condition (B.1.1) holds. Consequently, the process { ˜ D k } ∞ k =0 takes into accountall observations where | f (( ω, ~f ) t )[ ̟ ] − g (( ω ~f ) t )[ ̟ ] | > ǫ for some ̟ ∈ Ω and omits only observa-tions where | f (( ω, ~f ) t )[ ̟ ] − g (( ω, ~f ) t )[ ̟ ] | ≤ ǫ for all ̟ ∈ Ω (although, by condition (B.1.2), notnecessarily all of them).As a result, under the assumption that expert f is truthful (meaning, the realizations are gen-erated via f ), there exists a constant K = K ( ǫ ), which does not depend on the true process f orthe forecasting strategy g, so that on the set of histories, ω t , of probability (1 − ǫ ) under f , in allbut K periods either | f (( ω, ~f ) t )[ ̟ ] − g (( ω, ~f ) t )[ ̟ ] | ≤ ǫ for all ̟ ∈ Ω or D tf g ( ω ) < ǫ. Now assume that there exist K + 1 periods ( t i ) K +1 i =1 ⊂ { , ..., n } and ( ̟ t i ) K +1 i =1 ⊂ Ω K +1 such that | f (( ω, ~f ) t i − )[ ̟ t i ] − g (( ω, ~f ) t i − )[ ̟ t i ] | ≥ ǫ for all 1 ≤ i ≤ K + 1 with f ( ω t i − ) > n > K + 1 . Then equation (B.1.3) ensures us that with f − probability − (1 − ǫ )˜ D K +1 = D τ K +1 f g < ǫ (B.1.4)where by condition (B.1.2) for any t ≥ n ≥ τ K +1 we obtain that either ˜ D t drops below ǫ or D tf g ( ω ) < D τ K +1 f g ( ω )(1 + ǫ < ǫ (1 + ǫ ) (B.1.5)and hence it cannot exceed ǫ (1 + ǫ ). avaler and Smorodinsky: A Cardinal Comparison of Experts We conclude that there exists a constant K , which does not depend on the forecasting strategies f, g , such that for any suﬃciently large n > K , with f - probability - (1 − ǫ ); if there exist K + 1periods in which f and g are slightly diﬀerent above ǫ along a play path then the likelihood ratioat any point t after n never exceeds ǫ (1 + ǫ ) . Now from (B.1.4) and (B.1.5) we conclude that with f - probability - (1 − ǫ ), either | f (( ω, ~f ) t − )[ ̟ ] − g (( ω, ~f ) t − )[ ̟ ] | < ǫ for all ̟ ∈ Ω in all but K periods t in { , ..., n } , or1 − ǫ (1 + ǫ )1 + ǫ (1 + ǫ ) = 11 + ǫ (1 + ǫ ) <

11 + D tf g ( ω ) = D t ( ω, ~f ) ≤ t ≥ n, and as a result, the liminf of D is always greater than 1 − ǫ (1+ ǫ )1+ ǫ (1+ ǫ ) . Consequently, forall t ≥ n inequality (B.1.6) yields1 − ǫ (1 + ǫ )1 + ǫ (1 + ǫ ) < |D n ( ω, ~f ) − D t ( ω, ~f ) | ≤ , and hence |D n ( ω, ~f ) − D t ( ω, ~f ) | < ǫ (1+ ǫ )1+ ǫ (1+ ǫ ) , which, together with the ﬁrst scenario, holds with f -probability - (1 − ǫ ). Since ǫ (1+ ǫ )1+ ǫ (1+ ǫ ) < ǫ , the result follows. (cid:3) Appendix C: The cross-calibration test

We now restate the cross-calibration test as suggested by Feinberg and Stewart [7]. Fix a posi-tive integer

N > ,

1] into N equal closed subintervals I , ..., I N , so that I j = [ j − N , jN ] , ≤ j ≤ N . All results in their paper hold when [0 ,

1] is replaced with the set of distri-butions over any ﬁnite set Ω and the intervals I j are replaced with a cover of the set of distributionsby suﬃciently small closed convex subsets. At the beginning of each period t = 1 , ... , all forecast-ers (or experts) i ∈ { , .., M − } simultaneously announce predictions I it ∈ { I , ..., I N } , which areinterpreted as probabilities with which the realization 1 will occur in that period. We assume thatforecasters observe both the realized outcome and the predictions of the other forecasters at theend of each period.The cross-calibration test is deﬁned over outcomes ( ω t , I t , ..., I M − t ) ∞ t =1 , which specify, for eachperiod t , the realization ω t ∈ Ω, together with the prediction intervals announced by each of the M forecasters. Given any such outcome and any M - tuple l = ( I l , ..., I l M − ) ∈ { I , ..., I N } M , let ζ lt = 1 I it = I li , ∀ i =0 ,...,M − , and ν lt = t X n =1 ζ ln , which represents the number of times that the forecast proﬁle l is chosen up to time n . For ν lt > f ln of realizations conditional on this forecast proﬁle is given by f lt = 1 ν lt t X n =1 ζ ln ω n . Forecaster i passes the cross-calibration test at the outcome ( ω t , I t , ..., I M − t ) ∞ t =1 if limsup t →∞ | f lt − l i − N | ≤ N for every l satisfying lim t →∞ ν lt = ∞ .In the case of a single forecaster, the cross-calibration test reduces to the classic calibrationtest, which checks the frequency of realizations conditional on each forecast that is made inﬁnitelyoften. With multiple forecasters, the cross-calibration test checks the empirical frequencies of therealization conditional on each proﬁle of forecasts that occurs inﬁnitely often. Note that if an expertis cross-calibrated, he will also be calibrated. avaler and Smorodinsky: A Cardinal Comparison of Experts Endnotes For expository reasons, we restrict attention to a binary set Ω = { , } . The results extend to any ﬁnite set. An expert who uses f P to derive the correct predictions is referred to as informed, whereas an expert who concoctspredictions strategically to pass the test without any knowledge on P is referred to as uninformed. It should be emphasized that the results in this paper hold even for the general case for which deﬁnition 1 isextended such that a tester may condition his one step ahead decisions on his own past decisions. Formally, whenever T has the form T t : (Ω × ∆(Ω) × ∆(Ω) × { f, g } ) ∞ −→ [0 , . Notice that D is unaﬀected by the so-called “counterfactual” predictions. These predictions are referred to eventswhich may not occur. On the contrary, the outcome of D depends only on predictions which were made along therealized play path. If f (( ω, ~f ) n − )[ ω n ] = 0 for some n , we set D tf g ( ω ) = ∞ for all t ≥ n . In fact we show a stronger result: since f is monotone and R ~f D ,ǫ = S ¯ ǫ ∈ Q ∩ (0 ,ǫ ] R ~f D , ¯ ǫ it follows that, conditional on f ( A ) > ǫ < ǫ such that g ( A ∩ R ~f D , ¯ ǫ ) > A diﬀerent existing test, which was introduced in Feinberg and Stewart [7], is the cross-calibration test which isdiscussed in the introduction. However, it turns out that this test does not naturally induce a cardinal comparison test;rather than ranking the experts, this test outputs a binary verdict (pass/fail) for each of the two experts separatelyand hence may rule out anonymity. Moreover, it can be shown that any cardinal comparison test which naturallyranks an expert according to its empirical frequency would fail to be reasonable if an expert, who had calibrated onlyalong one proﬁle and had failed along all others, is tested against an informed expert . Note that we abuse the statistical terminology. In statistics the notion of rejection is always used in the contextof the null hypothesis. In our model, we assume symmetry between the alternatives and so we discuss rejection alsoin the context of the alternative hypothesis. As a consequence, an error of type-1 is deﬁned as the probability ofaccepting the alternative hypothesis whenever the null hypothesis is correct, and symmetrically, an error of type-2 isthe probability of accepting the null hypothesis whenever the alternative one is correct. The test proposed in the Neyman-Pearson lemma rejects the null hypothesis whenever the likelihood ratio fallsbelow some positive threshold. Recall that ~f induces a unique play path ( ω, ~f ) . References [1] Al-Najjar N, Sandroni A, Smorodinsky R, Weinstein J (2010) Testing theories with learnable andpredictive representations.

Journal of Economic Theory

Econometrica

Proceedings of the NationalAcademy of Sciences

Prediction, Learning, and Games (New York, NY, USA: CambridgeUniversity Press), ISBN 0521841089.[5] Dawid P (1982) The well-calibrated bayesian.

Journal of the American Statistical Association

Review of Economic Studies

Econometrica

Biometrika

Reviewof Economic Studies

Games and Economic Behavior

Econometrica

Unpublished results .[14] Levy G, Razin R (2018) An explanation-based approach to combining forecasts.

Unpublished results .[15] Olszewski W, Sandroni A (2008) Manipulability of future-independent tests.

Econometrica

Caltech. Unpublished results . avaler and Smorodinsky: A Cardinal Comparison of Experts

International Journal of GameTheory

Mathematics ofOperations Research