From Proper Scoring Rules to Max-Min Optimal Forecast Aggregation
FFrom Proper Scoring Rules to Max-Min OptimalForecast Aggregation
Eric Neyman, Tim RoughgardenFebruary 16, 2021
Abstract
This paper forges a strong connection between two seemingly unrelated forecastingproblems: incentive-compatible forecast elicitation and forecast aggregation. Properscoring rules are the well-known solution to the former problem. To each such rule s we associate a corresponding method of aggregation, mapping expert forecasts andexpert weights to a “consensus forecast,” which we call quasi-arithmetic (QA) pooling with respect to s . We justify this correspondence in several ways: • QA pooling with respect to the two most well-studied scoring rules (quadraticand logarithmic) corresponds to the two most well-studied forecast aggregationmethods (linear and logarithmic). • Given a scoring rule s used for payment, a forecaster agent who sub-contractsseveral experts, paying them in proportion to their weights, is best off aggregatingthe experts’ reports using QA pooling with respect to s , meaning this strategymaximizes its worst-case profit (over the possible outcomes). • The score of an aggregator who uses QA pooling is concave in the experts’ weights.As a consequence, online gradient descent can be used to learn appropriate expertweights from repeated experiments with low regret. • The class of all QA pooling methods is characterized by a natural set of axioms(generalizing classical work by Kolmogorov on quasi-arithmetic means). a r X i v : . [ c s . G T ] F e b . Introduction and motivation You are a meteorologist tasked with advising the governor of Florida on hurricane prepa-rations. A hurricane is threatening to make landfall in Miami, and the governor needs todecide whether to order a mass evacuation. The governor asks you what the likelihood is ofa direct hit, so you decide to consult several weather models at your disposal. These modelsall give you different answers: 10%, 25%, 70%. You trust the models equally, but your jobis to come up with one number for the governor — your best guess, all things considered.What is the most sensible way for you to aggregate these numbers?This is one of many applications of probabilistic opinion pooling . The problem of prob-abilistic opinion pooling (or forecast aggregation ) asks: how should you aggregate severalprobabilities, or probability distributions, into one? This question is relevant in nearly ev-ery domain involving probabilities or risks: meteorology, national security, climate science,epidemiology, and economic policy, to name a few.The setting that interests us is as follows: there are m experts, who report probabilitydistributions p , . . . , p m over n possible outcomes (we call these reports , or forecasts ). Addi-tionally, each expert i has a non-negative weight w i (with weights adding to 1); this weightrepresents the expert’s quality, i.e. how much the aggregator trusts the expert. A poolingmethod takes these distributions and weights as input and outputs a single distribution p .(Where do these weights come from? How can one learn weights for experts? More on thislater.) Linear pooling is arguably the simplest of all reasonable pooling methods: a weightedarithmetic mean of the probability distributions. p = m (cid:88) i =1 w i p i Logarithmic pooling (sometimes called log-linear or geometric pooling) consists of takinga weighted geometric mean of the probabilities and scaling appropriately. p ( j ) = c m (cid:89) i =1 ( p i ( j )) w i . Here, p ( j ) denotes the probability of the j -th outcome and c is a normalizing constant tomake the probabilities add to 1. Logarithmic pooling can be interpreted as averaging theexperts’ Bayesian evidence (see Appendix B).The linear and logarithmic pooling methods are by far the two most studied ones, see e.g.[GZ86], [PR00], [KR08]. This is because they are simple and follow certain natural rules,which we briefly discuss in Section 2. Furthermore, they are each optimal according to somenatural optimality metrics, see e.g. [Abb09]. 1 .2. Proper scoring rules A seemingly unrelated topic within probabilistic forecasting is the truthful elicitation offorecasts: how can a principal structure a contract so as to elicit an expert’s probabilitydistribution in a way that incenvitizes truthful reporting? This is usually done using a proper scoring rule .A scoring rule is a function s that takes as input (1) a probability distribution over n outcomes and (2) a particular outcome, and assigns a score , or reward. The interpretationis that if the expert reports a distribution p and event j comes to pass, then the expertreceives reward s ( p ; j ) from the principal. A scoring rule is called proper if the expert’sexpected score is strictly maximized by reporting their probability distribution truthfully.That is, s is proper if n (cid:88) j =1 p ( j ) s ( p ; j ) ≥ n (cid:88) j =1 p ( j ) s ( x ; j )for all x , with equality only for x = p . It is worth noting that properness is preserved underpositive affine transformations. That is, if s is proper, then s (cid:48) ( p ; j ) := as ( p ; j ) + b is properif a > Quadratic scoring rule
One example of a proper scoring rule is
Brier’s quadratic scoringrule , introduced in [Bri50]. It is given by s quad ( p ; j ) := 2 p ( j ) − n (cid:88) k =1 p ( k ) . The quadratic scoring rule can be interpreted as penalizing the expert by an amount equalto the squared distance from their report p to the “true answer” δ j (i.e. the vector with a 1in the j -th position and zeros elsewhere). Logarithmic scoring rule
Another example of a proper scoring rule is the logarithmicscoring rule , introduced in [Goo52]. It is given by s log ( p ; j ) := ln p ( j ) . The logarithmic rule is the only proper scoring rule for which an expert’s score only dependson the probability assigned to the eventual outcome and not other outcomes [SAM66]. Thequadratic and logarithmic scoring rules are by far the most studied and most frequently usedones in practice.
Choice of scoring rule as a value judgment
There are infinitely many proper scoringrules. How might a principal go about deciding which one to use? To gain some intuition,we will take a closer look at the quadratic and logarithmic scoring rules in the case of n = 2outcomes. In Figure 1, for both of these scoring rules, we show the difference between the We specify the domain of s more precisely in Section 3. Figure 1: Difference between expert’s reward if an outcome happens and if it does not happen, as a functionof the expert’s report, for the quadratic and logarithmic scoring rules. For example, if the expert reports a70% probability of an outcome, then under the quadratic rule they receive a score of 2 · . − . − . = 0 . · . − . − . = 0 .
02 if it does not: a difference of 0 .
8. If rewarded withthe logarithmic rule, this difference would be 0 . This difference scales linearly with the expert’s report for the quadratic rule. Meanwhile,for the logarithmic rule, the difference changes more slowly than for the quadratic rule forprobabilities in the middle, but much more quickly at the extremes. Informally speaking,this means that the logarithmic rule indicates a preference (of the elicitor) for high precisionclose to 0 and 1, while the quadratic rule indicates a more even preference for precisionacross [0 , .
01 and 0 .
001 are quite different; one who usesthe quadratic rule indicates that these probabilities are very similar.On its surface, the elicitation of forecasts has seemingly little to do with their aggregation .However, given that the choice of scoring rule implies a subjective judgment about howdifferent probabilities compare to one another, it makes sense to apply this judgment tothe aggregation of forecasts as well. For example, if the logarithmic scoring rule accuratelyreflects a principal’s preferences, how does that value judgment inform how that principalshould aggregate multiple forecasts? This brings us to the main focus of our paper: namely,we prove a novel correspondence between proper scoring rules and opinion pooling methods.
Before introducing the aforementioned correspondence, we need to introduce the
Savagerepresentation of a proper scoring rule. We scale down the logarithmic rule by a factor of 2 ln 2 to make the two rules comparable. The factor2 ln 2 was chosen to make the range of values taken on by the Savage representations of the two scoring rulesthe same (see Section 1.3). avage representation A proper scoring rule has a unique representation in terms of itsexpected reward function G , i.e. the expected score on an expert who believes (and reports)a distribution p : G ( p ) := E j ← p [ s ( p ; j )] = n (cid:88) j =1 p ( j ) s ( p ; j ) . This representation of s , introduced in [Sav71], is known as the Savage representation , thoughwe will usually refer to it as the expected reward function . Given that s is proper, G isstrictly convex; and conversely, given a strictly convex function G , one can re-derive s withthe formula s ( p ; j ) = G ( p ) + (cid:104) g ( p ) , δ j − p (cid:105) , (1)where g is the gradient of G [GR07]. Pictorially, draw the tangent plane to G at p ; thenthe expert’s score if outcome j is realized is the height of the plane at δ j .The Savage representation of the quadratic scoring rule is G quad ( p ) = (cid:80) nj =1 p ( j ) . TheSavage representation of the logarithmic scoring rule is G log ( p ) = (cid:80) nj =1 p ( j ) ln p ( j ).The function g , which will be central to our paper, describes the difference in the expert’sscore depending on which outcome happens. More precisely, the vector ( s ( p ; j ) , . . . , s ( p ; j n ))is exactly the vector g ( p ), except possibly for a uniform translation in all coordinates. Forexample, s ( p ; j ) − s ( p ; j ) = g ( p ) − g ( p ); this is precisely the quantity plotted in Figure 1for the quadratic and logarithmic scoring rules. This observation about the function g motivates the connection that we will establish between proper scoring rules and opinionpooling methods. Quasi-arithmetic opinion pooling
We can now define our correspondence betweenproper scoring rules and opinion pooling methods. Given a proper scoring rule s used forelicitation, and given m probability distributions p , . . . , p m and expert weights w , . . . , w m ,the aggregate distribution p ∗ that we suggest is the one satisfying g ( p ∗ ) = m (cid:88) i =1 w i g ( p i ) . (2)(It is natural to ask whether this p ∗ exists and whether it is unique. We will discuss thisshortly.)We refer to this pooling method as quasi-arithmetic pooling with respect to g (or thescoring rule s ), or QA pooling for short. To get a sense of QA pooling, let us determinewhat this method looks like for the quadratic and logarithmic scoring rules. Or a subgradient, if G is not differentiable. This term comes from the notion of quasi-arithmetic means: given a continuous, strictly increas-ing function f and values x , . . . , x m , the quasi-arithmetic mean with respect to f of these values is f − (1 /m (cid:80) i f ( x i )). A pooling with respect to the quadratic scoring rule
We have g quad ( x ) = (2 x , . . . , x n ),so we are looking for the p ∗ such that(2 p ∗ (1) , . . . , p ∗ ( n )) = m (cid:88) i =1 w i (2 p i (1) , . . . , p i ( n )) . This is p ∗ = (cid:80) mi =1 w i p i . Therefore, QA pooling for the quadratic scoring rule is preciselylinear pooling . QA pooling with respect to the logarithmic scoring rule
We have g log ( x ) = (ln x +1 , . . . , ln x n + 1), so we are looking for the p ∗ such that(ln p ∗ (1) + 1 , . . . , ln p ∗ ( n ) + 1) = m (cid:88) i =1 w i (ln p i (1) + 1 , . . . , ln p i ( n ) + 1) . By exponentiating the components on both sides, we find that p ∗ ( j ) = c (cid:81) ni =1 ( p i ( j )) w i forall j , for some proportionality constant c . This is precisely the definition of the logarithmicpooling method. (The constant c comes from the fact that values of g ( · ) should be interpretedmodulo translation by the all-ones vector; see Remark 3.9.)The fact that this pooling scheme maps the two most well-studied scoring rules to the twomost well-studied opinion pooling methods has not been noted previously, to our knowledge.This correspondence suggests that — beyond just our earlier informal justification — QApooling with respect to a given scoring rule may be a fundamental concept. The rest of thispaper argues that this is indeed the case. (Section 4) Max-min optimality Suppose that a principal asks you to issue a forecastand will pay you according to s . You are not knowledgeable on the subject but know someexperts whom you trust on the matter (perhaps to varying degrees). You sub-contractthe experts, promising to pay each expert i according to w i · s . By using QA poolingaccording to s on the experts’ forecasts, you guarantee yourself a profit; in fact, this strategymaximizes your worst-case profit, and is the unique such report. Furthermore, this profit isthe same for all outcomes. This fact can be interpreted to mean that you have, in a sense,pooled the forecasts “correctly”: you do not care which outcome will come to pass, whichmeans that you have correctly factored the expert opinions into your forecast. We give anadditional interpretation of this optimality notion as maximizing an aggregator’s guaranteedimprovement over choosing an expert at random.In Section 4.2, we give an additional interpretation of QA pooling as an optimal methodrelative to a proper scoring rule: namely, the QA pool of expert forecasts is the forecastwith respect to which the experts would be the least wrong (as measured via the weightedaverage of Bregman divergences associated with G ).5 Section 5) Learning expert weights
Opinion pooling entails assigning weights to ex-perts. Where do these weights come from? How might one learn them from experience?Suppose we have a fixed proper scoring rule s , and further consider fixing the reports ofthe m experts as well as the eventual outcome. One can ask: what does the score of theaggregate distribution (per QA pooling with respect to s ) look like as a function of w , thevector of expert weights? We prove that this function is concave. This is useful because itallows for online convex optimization over expert weights. Theorem (informal) . Let s be a bounded proper scoring rule. For time steps t = 1 . . . T , m experts report forecasts to an aggregator, who combines them into a forecast p t using QApooling with respect to s and suffers a loss of − s ( p t ; j t ) , where j t is the outcome at timestep t . If the aggregator updates the experts’ weights using online gradient descent, then theaggregator’s regret compared to the best weights in hindsight is O ( √ T ) . The aforementioned concavity property is a nontrivial fact that demonstrates an advan-tage of QA pooling over e.g. linear and logarithmic pooling: these pooling methods satisfythe concavity property for some proper scoring rules s but not others. (Section 6) Natural axiomatization for QA pooling methods [Kol30] and [Nag30]independently came up with a simple axiomatization of quasi-arithmetic means. We showhow to change these axioms to allow for weighted means; the resulting axiomatization isa natural characterization of all quasi-arithmetic pooling methods in the case of n = 2outcomes. Furthermore, although quasi-arithmetic means are typically defined for scalar-valued functions, we demonstrate that these axioms can be extended to describe quasi-arithmetic means with respect to vector-valued functions, as is necessary for our purposesif n >
2. This extension is nontrivial but natural, and to our knowledge has not previouslybeen described.
The reverse direction: pooling to scoring
Although we have mostly discussed thecorrespondence between proper scoring rules and pooling methods in the context of “given ascoring rule, what is the most natural pooling method,” the correspondence holds in reverse.That is, if a principal has a pooling method in mind, they can choose the scoring rule withwhich to reward the aggregator according to this correspondence. In Section 7 we give aninterpretation of this reverse connection in the context of collusion between experts.
When is QA pooling well defined?
Is there always a p ∗ satisfying Equation 2, and is itguaranteed to be unique? In Section 3 we show that the answer to the uniqueness questionis yes, and that in the n = 2 outcomes case, the answer to the existence question is also yes.For larger values of n , such a p ∗ may not exist. In particular, QA pooling with respect to s is well defined if and only if the range of g ( · ) is a convex set. For reasons that we willdiscuss, we call this the convex exposure property . In Appendix F, we will discuss when thisproperty holds. In particular, we will show that the it holds for the quadratic, logarithmic,and spherical scoring rules, as well as other natural classes of proper scoring rules. For which QA pooling is well defined (we discuss this below).
2. Related work
Opinion pooling [CW07] categorize mathematical approaches to opinion pooling as either
Bayesian or axiomatic . A Bayesian approach to this problem is one that entails Bayesupdating on each expert’s opinion. While quite natural, Bayesian opinion pooling is difficultto apply and, in full generality, computationally intractable. This is because the Bayesupdates must fully account for interdependencies between expert opinions.By contrast, axiomatic approaches do not make assumptions about the structure of infor-mation underlying the experts’ opinions; instead, they aim to come up with pooling methodsthat satisfy certain axioms or desirable properties. Such axioms include unanimity preserva-tion, eventwise independence, and external Bayesianality; see e.g. [DL14] for statements ofthese axioms. For n > §
4] defines a notion of pooling analogous to our Defini-tion 4.5, though in a different context. The main focus of their line of work is on connectingopinion pooling to Bregman divergence; our approach connects opinion pooling to properscoring rules, and a connection to Bregman divergence falls naturally out of this pursuit.
Scoring rules
The literature on scoring rules is quite large; we recommend [GR07] for athorough but technical overview, or [Car16] for a less technical overview that focuses more onapplications (while still introducing the basic theory). Seminal work on the theory behindscoring rules includes Brier’s paper introducing the quadratic rule [Bri50], Good’s paperintroducing the logarithmic rule [Goo52], and Savage’s work on the general theory of proper7coring rules [Sav71]. Additionally, see [DM14] for an overview of various families of properscoring rules.
Aggregation via prediction markets
One common way to aggregate probabilistic fore-casts is through prediction markets, some of which are based on scoring rules. [Han03]introduced market scoring rules (MSRs), in which experts are sequentially presented withan opportunity to update an aggregate forecast and are rewarded (or penalized) by theamount that their update changed the aggregate prediction’s eventual score. [CP07] intro-duced cost-function markets , in which a market maker sells n types of shares — one for eachoutcome — where the price of a share depends on the number of shares sold thus far ac-cording to some cost function. They established a connection between cost-function marketsand MSRs, where a market with a given cost function will behave the same way as a certainMSR (under certain conditions later formalized in [ACV13]). Subsequent work explored thisarea further, tying cost-function market making to online learning of probability distribu-tions [CV10] [ACV13]. This work differs from ours in that the goal of their online learningproblem is to learn a probability distribution over outcomes , whereas our goal in Section 5 isto learn expert weights .While MSRs and cost function markets have superficial similarities to our work, they havequite different goals and properties. For both MSRs and cost function markets, incentivesare set up so that an expert brings the market into alignment with their own opinion, ratherthan an aggregate. Thus, in the well-studied setting of experts whose beliefs do not dependon other experts’ actions, the final state of such a market reflects only the beliefs of the mostrecent trader, rather than an aggregate of the experts’ beliefs. Arbitrage from collusion
Part of our work can be viewed as a generalization of previouswork by Chun and Shachter done in a different context: namely, preventing colluding expertsfrom exploiting arbitrage opportunities [CS11]. The authors show that for the case of n = 2outcomes, if experts are rewarded with the same scoring rule s , preventing this is impossible:the experts can successfully collude by all reporting what we are calling the QA pool of theirreports with respect to s . Our Theorem 4.1 recovers this result as a special case. See[Che+14] for related work in the context of wagering mechanisms and [Fre+20] for follow-upwork on preventing arbitrage from colluding experts. Prediction with expert advice
In Section 5 we discuss learning expert weights online.The online learning literature is vast, but our approach fits into the framework of predictionwith expert advice . In this setting, at each time step each expert submits a report (in ourcontext a probability distribution). The agent then submits a report based on the experts’submissions, and suffers a loss depending on this report and the eventual outcome. See[CBL06] for a detailed account of this setting; the authors prove a variety of no-regret boundsin this setting, ranging (depending form the setting) from O ( √ T ) to O (1). Our setting isan ambitious one: while typically one desires low regret compared to the best expert inhindsight, we desire low regret compared to the best mixture of experts in hindsight. In[CBL06, § (log T ) regret in comparison with the best linear pool of experts in hindsight. This settingis different from ours in two important ways: first, the losses that we consider are not ingeneral exp-concave (e.g. the quadratic loss); and second, the authors consider linear poolingfor any loss, whereas we consider QA pooling with respect to the loss function. Quasi-arithmetic means
Our notion of quasi-arithmetic pooling is an adaptation (andextension to higher dimensions) of the existing notion of quasi-arithmetic means. These wereoriginally defined and axiomatized independently in [Kol30] and [Nag30]. Acz´el generalizedthis work to include weighted quasi-arithmetic means [Acz48], though these means haveweights baked in rather than taking them as inputs, which is different from our setting. See[Gra+11, §
3. Preliminaries
Throughout this paper, we will let m be the number of experts and use the index i to referto any particular expert. We will let n be the number of outcomes and use the index j torefer to any particular outcome.Let ∆ n be the standard simplex in R n , i.e. the one with vertices δ , . . . , δ n . (Here, δ j denotes the vector with a 1 in the j -th coordinate and zeros elsewhere.) Note that ∆ n is an( n − n -outcome forecast domain to beany convex ( n − subset of ∆ n . We will use D to denote an arbitrary forecastdomain.Although our results will apply to any forecast domain, the two forecast domains thatwe expect to be by far the most useful are ∆ n and ∆ n minus its boundary: the former forbounded scoring rules (such as the quadratic rule) and the latter for unbounded ones (suchas the logarithmic rule). Given an n -outcome forecast domain D , a proper scoring rule on D is a function s : D × [ n ] → R such that for all p ∈ D , we have E j ← p [ s ( p ; j )] ≥ E j ← p [ s ( x ; j )]for all x ∈ D , with equality only when x = p . (Here, j ← p means that j is drawn randomlyfrom the probability distribution p .) Formally, we require that the forecast domain contain a subset that is homeomorphic to R n − . Some authors refer to such scoring rules as strictly proper while others assume that propriety entailsstrictness; we choose the latter convection. Also, while many sources define the range of s to include ±∞ so as to e.g. make the logarithmic scoring rule well defined on the boundary of ∆ n , we do not do this. Thisis an application-specific choice: we will be interested in pooling forecasts, and pooling e.g. (1 ,
0) and (0 , s , we define its expected reward function (or Savage repre-sentation ) G : D → R by G ( p ) := E j ← p [ s ( p ; j )] = n (cid:88) j =1 p ( j ) s ( p ; j ) . We will henceforth assume that s is continuous; to our knowledge, this is the case for allfrequently-used proper scoring rules. This is equivalent to assuming that G is differentiable,or that G is continuously differentiable (see Proposition C.1 for a proof of this equivalence). Proposition 3.1 ([GR07]) . Given a proper scoring rule s , its expected reward function G isstrictly convex, and s can be re-derived from G via the formula s ( p ; j ) = G ( p ) + (cid:104) g ( p ) , δ j − p (cid:105) , (3) where g = ∇ G (i.e. the gradient of G ). Conversely, given a differentiable, strictly convexfunction G , the function s defined by Equation 3 is a proper scoring rule. The important intuition to keep in mind for Equation 3 is that the score of an expertwho reports p is determined by drawing the tangent plane to G at p ; the value of this planeat δ j , where j is the outcome that happens, is the expert’s score.We refer to g as the exposure function of s . We borrow this term from finance, where exposure refers to how much an agent stands to gain or lose from various possible outcomes— informally speaking, how much the agent cares about which outcome will happen. If weview G ( p ) − (cid:104) g ( p ) , p (cid:105) as the agent’s “baseline profit,” then the j -th component of g ( p ) isthe amount that the agent stands to gain (or lose) on top of the baseline profit if outcome j happens.We give a geometric intuition for Proposition 3.1 to help explain why the properness of s corresponds to the convexity of G ; this intuition will be helpful for understanding Bregmandivergence (below) and the proofs in Section 4.Consider Figure 2, which depicts some G in the n = 2 outcome case, with the x -axiscorresponding to the probability of Outcome 1 (see Remark 3.5 for formal details). Supposethat the expert believes that the probability Outcome 1 is 0 .
7. If the expert reports p = 0 . y -values of the rightmostand leftmost points on the red line, respectively. Thus, in expectation, the expert’s rewardis the y -value of the red point. If instead the expert lies and reports p = 0 . y -values of the rightmost and leftmost points on the blue line represent the expert’s rewardsin the cases of Outcome 1 and Outcome 2, respectively. In this case, since Outcome 1 isstill 70% likely, the expert’s expected reward is the y -value of the blue point. Because G isstrictly convex , the blue point is strictly below the red point; that is, the expert is strictlybetter off reporting p = 0 .
7. This argument holds in full generality: for any strictly convexfunction G in any number of dimensions. This choice is partially for ease of exposition, though e.g. our axiomatization in Section 6 depends onthis. The statement in [GR07] is slightly more complicated because they consider scoring rules with ±∞ intheir range, as well as discontinuous scoring rules. igure 2: A convex function G , with tangent lines drawn at x = 0 . x = 0 .
4. If an expert believes thatthe probability of an event is 0 .
7, their expected score if they report p = 0 . y -value of the red point;if they instead report p = 0 .
4, their expected score is the y -value of the blue point. Because G is convex, thered point is guaranteed to be above the blue point, so the expert is incentivized to be truthful. We will find
Bregman divergence to be a useful concept for some of our proofs.
Definition 3.2 (Bregman divergence) . Given a differentiable, strictly convex function G : D → R with gradient g , where D is a convex subset of R n , and given p , q ∈ D , the Bregmandivergence between p and q with respect to G is D G ( p (cid:107) q ) := G ( p ) − G ( q ) − (cid:104) g ( q ) , p − q (cid:105) . (Note that Bregman divergence is not symmetric.) A geometric interpretation of D G ( p (cid:107) q ) is: if you draw the tangent plane to G at q , how far below G ( p ) the value of that planewill be at p . For example, the distance between the red and blue points in Figure 2 is theBregman divergence between the expert’s belief (0 . , .
3) and their report (0 . , . Remark 3.3. If G is the expected reward function of a proper scoring rule, then D G ( p (cid:107) q )is the expected reward lost by reporting q when your belief is p . Put otherwise, D G ( p (cid:107) q )measures the “wrongness” of the report q relative to a correct answer of p . Proposition 3.4 (Well-known facts about Bregman divergence) . • D G ( p (cid:107) q ) ≥ , with equality only when p = q . • For any q , D G ( x (cid:107) q ) is a strictly convex function of x . Finally, we make a note about interpreting the n = 2 outcome case in one dimension. Remark 3.5.
Because ∆ n is ( n − n = 2 outcomesin one dimension. All probabilities in D are of the form ( p, − p ); we map D to [0 ,
1] (or asubset) via the first coordinate. Thus, we let G ( p ) := G ( p, − p ). We let g ( p ) := G (cid:48) ( p ) = (cid:104) g ( p, − p ) , (1 , − (cid:105) . The tangent line to G at p , e.g. as in Figure 2, will intersect the line x = 1 at s ( p ; 1) (i.e. the score if Outcome 1 happens) and intersect the line x = 0 at s ( p ; 2)(i.e. the score if Outcome 2 happens). This formulation will be helpful when discussing thetwo-outcome case, e.g. in Section 6. 11 .2. Probabilistic opinion pooling We now introduce the central concept of this paper: quasi-arithmetic pooling. Let s be aproper scoring rule over a forecast domain D , and let G and g be as previously defined.We denote the quasi-arithmetic (QA) pooling operator with respect to g with ⊕ g . Thisoperator takes as input (probability, weight) pairs with weights non-negative and adding to1, and outputs a probability (all probabilities are in D ). In particular, ⊕ g is defined by m (cid:77) g i =1 ( p i , w i ) := p ∗ , where g ( p ∗ ) = m (cid:88) i =1 w i g ( p i ) . (4)Is p ∗ in Equation 4 well defined? That is, does it exist, and if so, is it unique? It isindeed the case that p ∗ , if defined, is unique. This is because g cannot take on the samevalue at two different points, as it is the gradient of a convex function. As for existence: p ∗ is guaranteed to exist if and only if s satisfies the following property. Definition 3.6 (convex exposure) . A proper scoring rule s has convex exposure if the rangeof its exposure function g is a convex set. Proposition 3.7. (a) Given a proper scoring rule s , p ∗ as defined above is guaranteed to exist for any p , . . . , p m if and only if s has convex exposure.(b) In the case of n = 2 , every (continuous) proper scoring rule has convex exposure (so p ∗ exists).Proof. (a) is clear: the right-hand side of Equation 4 is an arbitrary convex combinationof values in the range of g . As for (b), since D is connected and g is continuous (seeProposition C.1), the range of g is connected. In the n = 2 outcome case, the range of g lieson the line { ( x , x ) : x + x = 0 } , and a connected subset of a line is convex.We now formally state the definition of quasi-arithmetic pooling: Definition 3.8 (quasi-arithmetic pooling) . Let s be a proper scoring rule with convex ex-posure on a forecast domain D . Given forecasts p , . . . , p m ∈ D with non-negative weights w , . . . , w m adding to , the quasi-arithmetic (QA) pool of these forecasts with respect to s (or with respect to its exposure function g ), denoted by m (cid:77) g i =1 ( p i , w i ) , is the unique p ∗ ∈ D such that g ( p ∗ ) = m (cid:88) i =1 w i g ( p i ) . If the forecasts and weights are clear from context, we may simply write p ∗ to refer to theirquasi-arithmetic pool; or, if only the forecasts are clear, we may write p ∗ w , where w is thevector of weights.
12n Section 4.2 we will discuss a natural generalization of QA pooling to proper scoringrules that do not have convex exposure. In Appendix F, we will explore which commonlyused proper scoring rules the convex exposure property holds for. (The upshot is: most ofthem.)We conclude with a technical note about the range of g and the correct interpretation ofEquation 4. Remark 3.9.
Because the domain of G is a subset of ∆ n (and thus lies in a plane that isorthogonal to the all-ones vector n ), its gradient function g only takes on values orthogonalto n . When we treat G as a function of n variables rather than n − G outside of the plane containing ∆ n , g might gain a component parallel to n . However, the correct way to think of the codomainof g is either as { x : (cid:80) i x i = 0 } , or else as R n modulo translation by n , which we willdenote by R n /T ( n ). Consequently, the correct way to think of the equality in Equation 4is as an equality in R n /T ( n ), rather than an equality in R n ; this came up in Section 1.3when we discussed the fact that QA pooling for the logarithmic scoring rule is logarithmicopinion pooling.
4. Optimality results
Our goal is to give a formal justification for quasi-arithmetic pooling. This section gives twosuch justifications: one in terms of max-min optimality and one (as a corollary) in terms ofminimizing the weighted average of the Bregman divergences between the pooled forecastand the experts’ forecasts. We will give additional justifications in later sections.
Theorem 4.1.
Let s be a proper scoring rule with convex exposure on a forecast domain D .Fix any forecasts p , . . . , p m ∈ D with non-negative weights w , . . . , w m adding to . Define u ( p ; j ) := s ( p ; j ) − m (cid:88) i =1 w i s ( p i ; j ) . Then the quantity min j u ( p ; j ) is uniquely maximized by setting p to p ∗ := m (cid:77) g i =1 ( p i , w i ) .Furthermore, u ( p ∗ ; j ) is the same for all j . This quantity is non-negative, and is positiveunless all reports p i with positive weights are equal. One interpretation for this theorem statement is as follows. Consider an agent who istasked with submitting a forecast, and who will be paid according to s . The agent decides Formally, consider the change of coordinates given by z j = x n − x j for j ≤ n − z n = (cid:80) j x j , sothat the domain of G lies in the plane z n = 1. Then for j ≤ n − ∂G∂z j at a given point in the domain of G does not change if 1 is substituted for z n ; only ∂G∂z n changes (to zero). Equivalently in terms of our originalcoordinates, the change that g undergoes when we consider G to be a function only defined on D instead of R n is precisely a projection of g onto H n (0).
13o sub-contract m experts to get their opinions, paying expert i the amount w i s ( p i ; j ) if theexpert reports p i and outcome j happens. (Perhaps experts whom the agent trusts morehave higher w i ’s.) Finally, the agent reports some (any) forecast p . Then u ( p ; j ) is preciselythe agent’s profit (utility).The quantity min j u ( p ; j ) is the agent’s minimum possible profit over all outcomes. It isnatural to ask which report p maximizes this quantity. Theorem 4.1 states that this maxi-mum is achieved by the QA pool of the experts’ forecasts with respect to s , and that this isthe unique maximizer.A possible geometric intuition to keep in mind for the proof (below): for each expert i ,draw the plane tangent to G at p i . For any j , the value of this plane at δ j is s ( p i ; j ). Nowtake the weighted average of all m planes; this is a new plane whose intersection with any δ j is the total reward received by the experts if j happens. Since G is convex, this plane liesbelow G . To figure out which point maximizes the agent’s guaranteed profit, push the planeupward until it hits G . It will hit G at p ∗ and the agent’s profit will be the vertical distancethat the plane was pushed. Proof of Theorem 4.1.
We first show that u ( p ∗ ; j ) is the same for all j . We have u ( p ∗ ; j ) = s ( p ∗ ; j ) − (cid:88) i w i s ( p i ; j )= G ( p ∗ ) + (cid:104) g ( p ∗ ) , δ j − p ∗ (cid:105) − (cid:88) i w i ( G ( p i ) + (cid:104) g ( p i ) , δ j − p i (cid:105) )= G ( p ∗ ) − (cid:88) i w i G ( p i ) + (cid:104) (cid:88) i w i g ( p i ) , δ j − p ∗ (cid:105) − (cid:88) i w i (cid:104) g ( p i ) , δ j − p i (cid:105) = G ( p ∗ ) − (cid:88) i w i G ( p i ) + (cid:88) i w i (cid:104) g ( p i ) , p i − p ∗ (cid:105) , which indeed does not depend on j . We can in fact rewrite this expression as a weightedsum of Bregman divergences: u ( p ; j ) = (cid:88) i w i D G ( p ∗ (cid:107) p i ) . It follows that u ( p ; j ) is non-negative (see Proposition 3.4), and positive except when all p i ’swith positive weights are equal.Finally we show that p = p ∗ maximizes min j u ( p ; j ). Suppose that for some report q we have that min j u ( q ; j ) ≥ min j u ( p ∗ ; j ). Then u ( q ; j ) ≥ u ( p ∗ ; j ) for every j , since u ( p ∗ ; j ) is the same for every j . But this means that s ( q ; j ) ≥ s ( p ∗ ; j ) for every j , since the (cid:80) i w i s ( p i ; j ) term in the definition of u ( p ; j ) does not depend on p . But then q = p ∗ , since s is proper. Remark 4.2.
We can reformulate Theorem 4.1 as follows: suppose that an agent has accessto forecasts p , . . . , p m and needs to issue a forecast, for which the agent will be rewardedusing a proper scoring rule s with convex exposure. The agent can improve upon selectingan expert at random according to weights w , . . . , w m , no matter the outcome j , by reporting p ∗ . This improvement is the same no matter the outcome, and is a strict improvement unlessall forecasts with positive weights are the same.14 .2. QA pooling as a Bregman divergence minimizing method The quantity (cid:80) i w i D G ( p (cid:107) p i ) that came up in the proof of Theorem 4.1 is a natural quantityto consider, as it is a measure of how far away overall p is from the expert reports p i . Infact, it is a natural quantity to minimize if we care about aggregation. This brings us to oursecond formal justification of QA pooling. Proposition 4.3.
Given a proper scoring rule s with convex exposure and reports p , . . . , p m with weights w , . . . , w m , the quantity d ( x ) := (cid:88) i w i D G ( x (cid:107) p i ) is uniquely minimized at x = p ∗ . This fact makes sense in light of the geometric intuition we described for Theorem 4.1.In these terms, Preposition 4.3 states that p ∗ is the point at which G is closest (in verticaldistance) to the average of the experts’ planes. Formally: Proof.
Since Bregman divergence is strictly convex in its first argument (see Proposition 3.4), d ( x ) is strictly convex. This means that if there is a point p ∈ D where ∇ d ( p ) = , then p is the unique minimizer of d . Now, we have ∇ d ( x ) = (cid:88) i w i ∇ x D G ( x (cid:107) p i ) = (cid:88) i w i ( g ( x ) − g ( p i )) = g ( x ) − (cid:88) i w i g ( p i ) . But in fact, g ( p ∗ ) = (cid:80) i w i g ( p i ) by definition, so ∇ d ( p ∗ ) = . This completes the proof.Proposition 4.3 generalizes [Abb09, Proposition 4], which showed that logarithmic pool-ing minimizes the weighted average of the KL divergences between the pooled forecast andthe expert forecasts. KL divergence is Bregman divergence relative to negative entropy,which is precisely the function G for the logarithmic scoring rule.In light of the fact that D G ( p (cid:107) q ) is the expected reward lost by an expert who, believing p , reports q (see Remark 3.3), Proposition 4.3 gives another natural interpretation of QApooling. Remark 4.4.
Consider a proper scoring rule s with convex exposure and forecasts p , . . . , p m with weights w , . . . , w m . The QA pool of these forecasts is the forecast p ∗ that, if it is thecorrect answer (i.e. if the outcome is drawn according to p ∗ ), would minimize the expectedloss of a randomly chosen (according to w ) expert relative to reporting p ∗ .In this sense, QA pooling reflects a compromise between experts: it is the probabilitythat, if it were correct, would make the experts’ forecast least wrong overall. This motivates ageneralization of QA pooling to non-convex exposure scoring rules that uses the formulationof p ∗ in terms of minimizing Bregman divergence.15 efinition 4.5 (generalized quasi-arithmetic pooling) . Let s be a proper scoring rule ona closed forecast domain D . Given forecasts p , . . . , p m ∈ D with non-negative weights w , . . . , w m adding to , the generalized quasi-arithmetic pool of these forecasts with respectto s (or g ) is the unique p ∗ ∈ D minimizing d ( x ) := (cid:88) i w i D G ( x (cid:107) p i ) . To check that Definition 4.5 is well defined, we need to check that p ∗ exists and is unique.Existence follows from the fact that D is (by assumption) closed and therefore compact, andthat a continuous function on a compact domain achieves its minimum. Uniqueness followsfrom the fact that a strictly convex function on a convex set has at most one minimum. (Aspart of proving Proposition 4.3, we showed that d is strictly convex.) Remark 4.6.
Since d is convex, minimizing it is a matter of convex optimization. Inparticular, if G is bounded, given oracle access to g , the ellipsoid method can be used toefficiently find the generalized QA pool of a list of forecasts. Remark 4.7.
It is natural to ask which pooling method minimizes the Bregman divergencegoing the other way, i.e. (cid:80) i w i D G ( p i (cid:107) x ). The answer is linear pooling [Ban+05, Proposi-tion 1]. This makes sense, because this minimization question asks: if an expert is selectedat random to be “correct” according to the weights, what is the overall probability of anyevent j ? The answer is achieved by linear pooling.
5. Convex losses and learning expert weights
Thus far when discussing QA pooling, we have regarded expert weights as given. Where dothese weights come from? As we will show in this section, these weights can be learned fromexperience. The key observation is the following theorem, which states that an agent’s scoreis a concave function of the weights it uses for the experts.
Theorem 5.1.
Let s be a proper scoring rule with convex exposure on a forecast domain D , and fix any p , . . . , p m ∈ D . Given a weight vector w = ( w , . . . , w m ) ∈ ∆ m , define the weight-score of w for an outcome j asWS j ( w ) := s m (cid:77) g i =1 ( p i , w i ); j . Then for every j ∈ [ n ] , WS j ( w ) a concave function of w . We defer the proof to Appendix D, but the basic idea is that for any two weight vectors v and w and any c ∈ [0 , j ( c v + (1 − c ) w ) − c WS j ( v ) − (1 − c )WS j ( w )can be expressed as a sum of Bregman divergences, and is therefore non-negative. Note that this is a weaker condition than s being bounded; e.g. G log is the negative entropy function,which is bounded. emark 5.2. Theorem 5.1 can be stated in more generality: s need not have convex ex-posure; it suffices to have that for the particular p , . . . , p m , the QA pool of these forecastsexists for every weight vector. Remark 5.3.
Theorem 5.1 would not hold if the pooling operator in the definition of WSwere replaced by linear pooling or by logarithmic pooling. This is an advantage of QApooling over using the linear or logarithmic method irrespective of the scoring rule.Beyond Theorem 5.1’s instrumental use for no-regret online learning of expert weights(Theorem 5.5 below), the result is interesting in its own right. For example, the following fact— loosely speaking, that QA pooling cannot benefit from weight randomization — followsas a corollary. (Recall the definition of p ∗ w from Definition 3.8.) Corollary 5.4.
Consider a randomized algorithm A with the following specifications: • Input: a proper scoring rule s with convex exposure, expert forecasts p , . . . , p m . • Output: a weight vector w ∈ ∆ m .Consider any input s, p , . . . , p m , and let ˆ w = E A [ w ] . Then for every j , we have s ( p ∗ ˆ w ; j ) ≥ E A [ s ( p ∗ w ; j )] , where p ∗ x denotes the QA pool of p , . . . , p m with weight vector x . We now state the no-regret result that we have alluded to. The algorithm referenced inthe statement (which is in Appendix D.2) is an application of the standard online gradientdescent algorithm (see e.g. [Haz19, Theorem 3.1]) to our particular setting.
Theorem 5.5.
Let s be a bounded proper scoring rule with convex exposure over a forecastdomain D . For time steps t = 1 . . . T , an agent chooses a weight vector w t ∈ ∆ m . The agentthen receives a score of s m (cid:77) g i =1 ( p ti , w ti ); j t , where p t , . . . , p tm ∈ D and j t ∈ [ n ] are chosen adversarially. By choosing w t according toAlgorithm D.3 (online gradient descent on the experts’ weights), the agent achieves O ( √ T ) regret in comparison with the best weight vector in hindsight. In particular, if M is an upperbound on (cid:107) g (cid:107) , then for every w ∗ ∈ ∆ m we have T (cid:88) t =1 s m (cid:77) g i =1 ( p ti , w ∗ i ); j t − s m (cid:77) g i =1 ( p ti , w ti ); j t ≤ √ mM √ T . For a counterexample to logarithmic pooling, consider n = 2, let s be the quadratic scoring rule, p = (0 . , . p = (0 . , . j = 1, v = (1 , w = (0 , c = . For a counterexample to linearpooling, consider n = 2, let s be given by G ( p , p ) = (cid:112) p + p (this is known as the spherical scoring rule), p = (0 , p = (0 . , . j = 1, v = (1 , w = (0 , c = . M exists because s is bounded by assumption, and so g is alsobounded (this follows from Equation 3).We also note that this result is quite strong in that it does not merely achieve low regretcompared to the best expert , but in fact compared to the best possible weighted pool of ex-perts in hindsight. This is a substantial distinction, as it is possible for a mixture of expertsto substantially outperform any individual expert.We defer the proof to Appendix D.2. The proof amounts to applying the standard boundsfor online gradient descent, though with an extra step: we use the bound M on (cid:107) g (cid:107) to boundthe gradient of the loss as a function of expert weights.
6. Axiomatization of QA pooling
In this section, we aim to show that the class of all quasi-arithmetic pooling operators is anatural one, by showing that these operators are precisely those which satisfy a natural setof axioms.[Kol30] and [Nag30] independently considered the class of quasi-arithmetic means. Givenan interval I ⊆ R and a continuous, injective function f : I → R , the quasi-arithmetic meanwith respect to f , or f -mean , is the function M f that takes as input x , . . . , x m ∈ I (for any m ≥
1) and outputs M f ( x , . . . , x m ) := f − (cid:18) f ( x ) + · · · + f ( x m ) m (cid:19) . For example, the arithmetic mean corresponds to f ( x ) = x ; the quadratic to f ( x ) = x ; thegeometric to f ( x ) = log x ; and the harmonic to f ( x ) = − x .Kolmogorov proved that the class of quasi-arithmetic means is precisely the class offunctions M : (cid:83) ∞ m =1 I m → I satisfying the following natural properties: (1) M ( x , . . . , x m ) is continuous and strictly increasing in each variable.(2) M is symmetric in its arguments.(3) M ( x, x, . . . , x ) = x .(4) M ( x , . . . , x k , x k +1 , . . . , x m ) = M ( y, . . . , y, x k +1 , . . . , x m ), where y := M ( x , . . . , x k ) ap-pears k times on the right-hand side. Informally, a subset of arguments to the meanfunction can be replaced with their mean.The four properties listed above can be viewed as an axiomatization of quasi-arithmeticmeans.Our notion of quasi-arithmetic pooling is exactly that of a quasi-arithmetic mean, exceptthat it is more general in two ways. First, it allows for weights to accompany the arguments Nagumo also provided a characterization, though with slightly different properties.
18o the mean. Second, we are considering quasi-arithmetic means with respect to vector-valued functions g . In the n = 2 outcome case, g can be considered a scalar-valued functionsince it is defined on a one-dimensional space (see Remark 3.5 for details); but in general wecannot treat g as scalar-valued. Our goal is to extend the above axiomatization of quasi-arithmetic means in these twoways: first (below) to include weights as arguments, and second (in Appendix E) to general n (while still allowing arbitrary weights). Generalizing to include weights as arguments
The objects that we will be studyingin this section are ones of the form ( p, w ), where w ≥ p ∈ D . In this subsection, D is atwo-outcome forecast domain, which we will think of as a sub-interval of [0 ,
1] identified bythe probability of the first outcome (see Remark 3.5). We will fix the set D for the remainderof the subsection. Our results generalize to any interval of R (as in Kolmogorov’s work), butwe focus on forecast domains since that is our application. Definition 6.1. A weighted forecast is an element of D × R > : a probability and a positiveweight. Given a weighted forecast Π = ( p, w ) we define pr (Π) := p and wt (Π) := w . We will thinking of the output of pooling operators as weighted forecasts. This is asimple extension of our earlier definition of quasi-arithmetic pooling (Definition 3.8), whichonly output a probability.
Definition 6.2 (Quasi-arithmetic pooling with arbitrary weights ( n = 2)) . Given a contin-uous, strictly increasing function g : D → R , and weighted forecasts Π = ( p , w ) , . . . , Π m =( p m , w m ) , define the quasi-arithmetic pool of Π , . . . , Π m with respect to g as m (cid:77) gi =1 ( p i , w i ) := (cid:32) g − (cid:18) (cid:80) i w i g ( p i ) (cid:80) i w i (cid:19) , (cid:88) i w i (cid:33) . Remark 6.3.
In Definition 3.8, g was the derivative of a differentiable, strictly convexfunction. Here, g is a continuous, strictly increasing function. These are the same condition(see Proposition C.1).In the case that (cid:80) i w i = 1, Definition 6.2 reduces to Definition 3.8. In general, bylinearly scaling the weights in Definition 6.2 to add to 1, we recover quasi-arithmetic poolingas previously defined.The proof of the following proposition is straightforward. Proposition 6.4.
Given two continuous, strictly increasing functions g and g , ⊕ g and ⊕ g are the same if and only if g = ag + b for some a > and b ∈ R . We now define properties (i.e. axioms) of a pooling operator ⊕ , such that these propertiesare satisfied if and only if ⊕ is ⊕ g for some g . Our axiomatization will look somewhat differentfrom Kolmogorov’s, in part because we choose to define ⊕ as a binary operator that (if itsatisfies the associativity axiom) extends to the m -ary case. This is a simpler domain andwill simplify notation. Another difference is that in the n = 2 case, we are restricting D ⊆ [0 , efinition 6.5 (Axioms for pooling operators ( n = 2)) . For a pooling operator ⊕ on D (i.e.a binary operator on weighted forecasts), we define the following axioms.1. Weight additivity : wt (Π ⊕ Π ) = wt (Π ) + wt (Π ) for every Π , Π .2. Commutativity : Π ⊕ Π = Π ⊕ Π for every Π , Π .3. Associativity : Π ⊕ (Π ⊕ Π ) = (Π ⊕ Π ) ⊕ Π for every Π , Π , Π .4. Continuity : For every p , p , the quantity pr (( p , w ) ⊕ ( p , w )) is a continuousfunction of ( w , w ) on R ≥ \ { (0 , } .5. Idempotence : For every Π , Π , if pr (Π ) = pr (Π ) then pr (Π ⊕ Π ) = pr (Π ) .6. Monotonicity : Let w > and let p > p ∈ D . Then for x ∈ (0 , w ) , the quantitypr (( p , x ) ⊕ ( p , w − x )) is a strictly increasing function of x . The motivation for the weight additivity axiom is that the weight of a weighted forecastcan be thought of as the amount of evidence for its prediction. When pooling weightedforecasts, the weight of an individual forecast can be thought of as the strength of its votein the aggregate.The monotonicity axiom essentially states that if one pools two forecasts with differentprobabilities and a fixed total weight, then the larger the share of the weight belonging tothe larger of the two probabilities, the larger the aggregate probability.We now state this section’s main result: these axioms describe the class of QA poolingoperators.
Theorem 6.6.
A pooling operator is a QA pooling operator (as in Definition 6.2) withrespect to some g if and only if it satisfies the axioms in Definition 6.5. We defer the proof of Theorem 6.6 to Appendix E, though we briefly summarize it here.We first show that the axioms in Definition 6.5 hold for any QA pooling operator. Weightadditivity, commutativity, and idempotence are trivial; associativity is a matter of simplealgebra. Continuity and monotonicity both follow from the fact that g is continuous andstrictly increasing, as is g − .Showing that any pooling operator ⊕ satisfying our axioms is a QA pooling operatorinvolves constructing a g such that ⊕ = ⊕ g . This is simplest if D is a closed interval. Forexample, if D = [0 , g as follows: g (0) = 0, g (1) = 1, and for 0 < p < g ( p ) = w where (1 , w ) ⊕ (0 , − w ) = ( p,
1) (such a w exists by the continuity axiom andis unique by the monotonicity axiom). The remainder (showing that ⊕ pools weightedforecasts in the same way as ⊕ g as in Definition 6.2) follows by a sequence of applicationsof the axioms. We allow one weight to be 0 by defining ( p, w ) ⊕ ( q,
0) = ( q, ⊕ ( p, w ) = ( p, w ). As we mentioned, for an associative pooling operator ⊕ , Π ⊕ Π · · · ⊕ Π m is a well-specified quantity,even without indicating parenthesization. This lets us use the notation (cid:76) mi =1 Π i . This is why the statementof Theorem 6.6 makes sense despite pooling operators not being m -ary by default. D is not closed is a little trickier. The idea there is to define g on twopoints on the interior of D and build g by essentially applying the same construction asdescribed in the previous paragraph, but with “negative weights.” We describe the detailsof this argument in the proof of the analogous result for general n (i.e. Theorem E.9). Generalizing to higher dimensions
In Appendix E, we discuss extending our axiom-atization to arbitrary values of n in a way that, again, describes the class of QA poolingoperators. An important challenge is extending the monotonicity axiom: what is an ap-propriate generalization of an increasing function in higher dimensions? We show that thenotion we need is cyclical monotonicity , which we define and discuss. We then present ouraxiomatization (Definition E.8) and prove that the axioms represent precisely all QA poolingoperators (Theorem E.9). On a high level, the proof is not dissimilar to that of Theorem 6.6,though the details are fairly different and more technical.In conclusion, in Definition 6.5 we made a list of natural properties that a pooling operatormay satisfy. Theorem 6.6 shows that the pooling operators satisfying these properties areexactly the QA pooling operators. In Appendix E, we generalize this theorem to higherdimensions, thus fully axiomatizing QA pooling. This result gives us an additional importantreason to believe that QA pooling with respect to a proper scoring rule is a fundamentalnotion.
7. Conclusions and future work
We conclude with one observation and a number of suggestions for future work.We have established and motivated a connection between proper scoring rules and opin-ion pooling methods. This connection is, in fact, a bijective correspondence between properscoring rules with convex exposure (up to positive affine transformation) and pooling meth-ods satisfying our axioms. In most of our discussion, the focus has been on one side of thiscorrespondence: given a proper scoring rule s (with convex exposure), we have shown thatQA pooling with respect to s satisfies many desirable properties. It is natural to considerthe reverse direction as well: if a principal wishes to pool some experts’ forecasts using aparticular pooling method ⊕ , does it make sense for the principal to use the correspondingscoring rule for elicitation?We argue that choosing such a scoring rule makes sense in a context where experts maycollude. If experts are rewarded with a proper scoring rule s with convex exposure, they canguarantee themselves a larger total reward by all reporting the QA pool with respect to s oftheir true beliefs; in fact, this is the collusion strategy that maximizes their minimum (over j )possible surplus over truthful reporting. This is essentially a restatement of Theorem 4.1,and was observed for the n = 2 case in [CS11]. This means that it makes sense for theprincipal to choose the scoring rule that will incentivize the experts to collude in the manner This is a stronger statement because it asserts that this is the best of all collusion strategies, not justthose in which all experts give the same report. This fact follows from the same techniques that we used toprove Theorem 4.1. ⊕ . As such, the principal should choose thescoring rule corresponding to ⊕ .Of course, the principal may wish to reward the experts using a contract function forwhich collusion never generates a surplus. The question of whether such a contract functionexists is explored (but not resolved) in [Fre+20]. Given the close ties between this line ofwork and ours, it will be natural to explore whether the tools that we develop shed light onthis question.Moving on to further intriguing research directions, it is natural to expand our notionof forecast aggregation. All pooling methods that we considered satisfy what we calledidempotence: pooling p and p gives back p . This is a natural assumption, but may beundesirable in some Bayesian setups. In a situation where two experts use different evidenceto arrive at the same small probability of an outcome, it may be sensible for the aggregateprobability to be even smaller. It would be interesting to explore whether our results andtechniques are applicable to notions of pooling that do not satisfy idempotence, or to other(perhaps Bayesian) settings.Finally, we present some concrete future directions for potential work: • As we discussed in Section 2, there is a fair amount of work on aggregating forecastswith prediction markets — often ones that are based on proper scoring rules. Is therea natural trading-based interpretation of QA pooling? • Definition 4.5 gave a natural generalization of QA pooling to proper scoring rules thatdo not have convex exposure. There is another potentially natural generalization,which is to define the QA pool as the forecast p ∗ maximizing min j u ( p ∗ ; j ) (as definedin Theorem 4.1). Is this generalization equivalent to Definition 4.5? If not, how doesit behave? • Although our proof of Theorem 5.1 (that the QA pool of forecasts is concave in theexperts’ weights) relies on the convex exposure property, we have not ruled out thepossibility that the result holds even without this assumption (with QA pooling definedas in Definition 4.5). Is this the case? Even if not, are no-regret algorithms for learningweights still possible? • Our no-regret algorithm for learning weights relies on s being bounded, because thisallows us to place a concrete upper bound on (cid:107)∇ L t ( · ) (cid:107) . Intuitively it seems unlikelythat no-regret compared to the best weight vector in hindsight can be achieved if wecannot place a bound on the loss function; is this so? Are there natural restrictions tothe model — i.e. ways to make it less than fully adversarial — under which a no-regretalgorithm would be possible? • We have presented a list of axioms characterizing QA pooling operators. Is there analternative axiomatization that uses equally natural but fewer axioms?22 eferences [Abb09] Ali E. Abbas. “A Kullback-Leibler View of Linear and Log-Linear Pools”. In:
Decision Analysis doi : . url : https://ideas.repec.org/a/inm/ordeca/v6y2009i1p25-37.html .[ABS18] Itai Arieli, Yakov Babichenko, and Rann Smorodinsky. “Robust forecast ag-gregation”. In: Proceedings of the National Academy of Sciences issn : 0027-8424. doi : . url : .[ACR12] D Allard, A Comunian, and Philippe Renard. “Probability aggregation methodsin geoscience”. In: Mathematical Geosciences
ACM Trans. Economics and Comput. doi : .[Acz48] J. Acz´el. “On mean values”. In: Bull. Amer. Math. Soc. url : https://projecteuclid.org:443/euclid.bams/1183511892 .[Ada14] M. Adamˇc´ık. “Collective reasoning under uncertainty and inconsistency”. PhDthesis. University of Manchester, 2014.[AK14] Aaron Archer and Robert Kleinberg. “Truthful germs are contagious: A local-to-global characterization of truthfulness”. In: Games and Economic Behavior url : https://EconPapers.repec.org/RePEc:eee:gamebe:v:86:y:2014:i:c:p:340-366 .[Ash+10] Itai Ashlagi, Mark Braverman, Avinatan Hassidim, and Dov Monderer. “Mono-tonicity and Implementability”. In: Econometrica doi : https://doi.org/10.3982/ECTA8882 . url : https://onlinelibrary.wiley.com/doi/abs/10.3982/ECTA8882 .[AW80] J. Acz´el and C. Wagner. “A Characterization of Weighted Arithmetic Means”.In: SIAM Journal on Algebraic Discrete Methods doi : .[Ban+05] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh.“Clustering with Bregman Divergences”. In: J. Mach. Learn. Res. url : http://jmlr.org/papers/v6/banerjee05b.html .[BB20] Shalev Ben-David and Eric Blais. “A New Minimax Theorem for RandomizedAlgorithms”. In: CoRR abs/2002.10802 (2020). arXiv: . url : https://arxiv.org/abs/2002.10802 .[BC11] Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and MonotoneOperator Theory in Hilbert Spaces . 1st. Springer Publishing Company, Incorpo-rated, 2011. isbn : 1441994661.[Bri50] G. W. Brier. “Verification of forecasts expressed in terms of probability”. In:
Monthly Weather Review
78 (1950), pp. 1–3.23Car16] Arthur Carvalho. “An Overview of Applications of Proper Scoring Rules”. In:
Decision Analysis
13 (Nov. 2016). doi : .[CBL06] Nicol`o Cesa-Bianchi and G´abor Lugosi. Prediction, Learning, and Games . Jan.2006. isbn : 978-0-521-84108-5. doi : .[Che+14] Yiling Chen, Nikhil R. Devanur, David M. Pennock, and Jennifer WortmanVaughan. “Removing arbitrage from wagering mechanisms”. In: ACM Confer-ence on Economics and Computation, EC ’14, Stanford , CA, USA, June 8-12,2014 . Ed. by Moshe Babaioff, Vincent Conitzer, and David A. Easley. ACM,2014, pp. 377–394. doi : .[CL13] Arthur Carvalho and Kate Larson. “A Consensual Linear Opinion Pool”. In: IJCAI 2013, Proceedings of the 23rd International Joint Conference on Arti-ficial Intelligence, Beijing, China, August 3-9, 2013 . Ed. by Francesca Rossi.IJCAI/AAAI, 2013, pp. 2518–2524. url : .[CP07] Yiling Chen and David M. Pennock. “A Utility Framework for Bounded-LossMarket Makers”. In: UAI 2007, Proceedings of the Twenty-Third Conference onUncertainty in Artificial Intelligence, Vancouver, BC, Canada, July 19-22, 2007 .Ed. by Ronald Parr and Linda C. van der Gaag. AUAI Press, 2007, pp. 49–56. url : https://dl.acm.org/doi/abs/10.5555/3020488.3020495 .[CS11] SangIn Chun and Ross D. Shachter. “Strictly Proper Mechanisms with Cooper-ating Players”. In: UAI 2011, Proceedings of the Twenty-Seventh Conference onUncertainty in Artificial Intelligence, Barcelona, Spain, July 14-17, 2011 . Ed. byF´abio Gagliardi Cozman and Avi Pfeffer. AUAI Press, 2011, pp. 125–134. url : https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1\&smnu=2\&article\_id=2168\&proceeding\_id=27 .[CV10] Yiling Chen and Jennifer Wortman Vaughan. “A new understanding of predic-tion markets via no-regret learning”. In: Proceedings 11th ACM Conference onElectronic Commerce (EC-2010), Cambridge, Massachusetts, USA, June 7-11,2010 . Ed. by David C. Parkes, Chrysanthos Dellarocas, and Moshe Tennenholtz.ACM, 2010, pp. 189–198. doi : .[CW07] R.T. Clemen and R.L. Winkler. “Aggregating probability distributions”. In: Ad-vances in Decision Analysis: From Foundations to Applications (Jan. 2007),pp. 154–176. doi : .[Daw+95] A. Dawid, M. DeGroot, J. Mortera, R. Cooke, S. French, C. Genest, M. Schervish,D. Lindley, K. McConway, and R. Winkler. “Coherent combination of experts’opinions”. In: Test
Probabilistic Opinion Pooling . Oct. 2014. url : http://philsci-archive.pitt.edu/11349/ .[DM14] Alexander Dawid and Monica Musio. “Theory and Applications of Proper Scor-ing Rules”. In: METRON
72 (Jan. 2014). doi : .24FCK15] Rafael M. Frongillo, Yiling Chen, and Ian A. Kash. “Elicitation for Aggregation”.In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence,January 25-30, 2015, Austin, Texas, USA . Ed. by Blai Bonet and Sven Koenig.AAAI Press, 2015, pp. 900–906. url : .[FES20] Christian Feldbacher-Escamilla and Gerhard Schurz. “Optimal probability ag-gregation based on generalized brier scoring”. In: Annals of Mathematics andArtificial Intelligence
88 (July 2020). doi : .[FK14] Rafael Frongillo and Ian Kash. “General Truthfulness Characterizations via Con-vex Analysis”. In: Web and Internet Economics . Ed. by Tie-Yan Liu, Qi Qi, andYinyu Ye. Cham: Springer International Publishing, 2014, pp. 354–370. isbn :978-3-319-13129-0.[Fre+20] Rupert Freeman, David M. Pennock, Dominik Peters, and Bo Waggoner. “Pre-venting Arbitrage from Collusion When Eliciting Probabilities”. In:
The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020,The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence,EAAI 2020, New York, NY, USA, February 7-12, 2020 . AAAI Press, 2020,pp. 1958–1965. url : https : / / aaai . org / ojs / index . php / AAAI / article /view/5566 .[Gen84] Christian Genest. “A Characterization Theorem for Externally Bayesian Groups”.In: Ann. Statist. doi : .[Goo52] I. J. Good. “Rational Decisions”. In: Journal of the Royal Statistical Society.Series B (Methodological) issn : 00359246. url : .[GR07] Tilmann Gneiting and Adrian E Raftery. “Strictly proper scoring rules, pre-diction, and estimation”. In: Journal of the American Statistical Association
Information Sciences issn :0020-0255. doi : https://doi.org/10.1016/j.ins.2010.08.043 . url : .[GZ86] Christian Genest and James V. Zidek. “Combining Probability Distributions:A Critique and an Annotated Bibliography”. In: Statistical Science issn : 08834237. url : .[Han03] Robin Hanson. “Combinatorial Information Market Design”. In: InformationSystems Frontiers doi : . url : https : / / ideas . repec . org / a / spr / infosf / v5y2003i1d10 . 1023 _a1022058209073.html .[Haz19] Elad Hazan. “Introduction to Online Convex Optimization”. In: CoRR abs/1909.05207(2019). arXiv: . url : http://arxiv.org/abs/1909.05207 .25Kol30] A.N. Kolmogorov. Sur la notion de la moyenne . G. Bardi, tip. della R. Accad.dei Lincei, 1930. url : https://books.google.com/books?id=iUqLnQEACAAJ .[KR08] Christian Kascha and Francesco Ravazzolo. Combining inflation density fore-casts . Working Paper 2008/22. Norges Bank, Dec. 2008. url : https://ideas.repec.org/p/bno/worpap/2008_22.html .[LS07] Ron Lavi and Chaitanya Swamy. “Truthful Mechanism Design for Multi-DimensionalScheduling via Cycle Monotonicity”. In: Proceedings of the 8th ACM Conferenceon Electronic Commerce . EC ’07. San Diego, California, USA: Association forComputing Machinery, 2007, 252–261. isbn : 9781595936530. doi :
10 . 1145 /1250910.1250947 .[Nag30] Mitio Nagumo. “ ¨Uber eine Klasse der Mittelwerte”. In:
Japanese journal ofmathematics :transactions and abstracts doi :
10 . 4099 /jjm1924.7.0_71 .[Pet19] Richard Pettigrew. “Aggregating incoherent agents who disagree”. In:
Synthese
196 (July 2019). doi : .[PR00] David Poole and Adrian E. Raftery. “Inference for Deterministic SimulationModels: The Bayesian Melding Approach”. In: Journal of the American Sta-tistical Association doi :
10 . 1080 / 01621459 .2000.10474324 . url : .[Roc70a] R. T. Rockafellar. “On the maximal monotonicity of subdifferential mappings.”In: Pacific J. Math. url : https://projecteuclid.org:443/euclid.pjm/1102977253 .[Roc70b] R. Tyrrell Rockafellar. Convex Analysis . Princeton University Press, 1970. isbn :9780691015866. url : .[SAM66] Emir Shuford, Arthur Albert, and H. Edward Massengill. “Admissible probabil-ity measurement procedures”. In: Psychometrika url : https://EconPapers.repec.org/RePEc:spr:psycho:v:31:y:1966:i:2:p:125-145 .[Sat+14] Ville Satop¨a¨a, Jonathan Baron, Dean Foster, Barbara Mellers, Philip Tetlock,and Lyle Ungar. “Combining multiple probability predictions using a simplelogit model”. In: International Journal of Forecasting
30 (Apr. 2014), 344–356. doi : .[Sav71] Leonard J. Savage. “Elicitation of Personal Probabilities and Expectations”. In: Journal of the American Statistical Association issn :01621459. url : .[SY05] Michael Saks and Lan Yu. “Weak monotonicity suffices for truthfulness on con-vex domains”. In: Jan. 2005, pp. 286–293. doi : .[Tsa88] Constantino Tsallis. “Possible generalization of Boltzmann-Gibbs statistics”.In: Journal of Statistical Physics
52 (July 1988), pp. 479–487. doi : . 26Voh07] Rakesh V. Vohra. Paths, Cycles and Mechanism Design . 2007.[Wik18] Wikipedia contributors.
Mahler’s inequality — Wikipedia, The Free Encyclope-dia . [Online; accessed 07-February-2021]. 2018. url : https://en.wikipedia.org/wiki/Mahler\%27s\_inequality . A. Outline of appendices • In Appendix B we give an interpretation of logarithmic opinion pooling as averagingexperts’ Bayesian evidence, as we mentioned in Section 1. • In Appendix C we prove that s is continuous if and only if G is differentiable if andonly if g is continuous, as we stated in Section 3. • In Appendix D we give the proof of Theorem 5.1 (that a QA aggregator’s score is con-cave in the experts’ weights), state the no regret algorithm referenced in Theorem 5.5,and then prove that the algorithm indeed has low regret. • In Appendix E we give the proof of Theorem 6.6, which states that the axioms inDefinition 6.5 capture the class of QA pooling methods in the case of n = 2 outcomes.We then present and prove an analogous axiomatization in full generality (i.e. forarbitrary values of n ). • In Appendix F we discuss which well-known proper scoring rules satisfy the convexexposure property.
B. Details omitted from Section 1
Logarithmic pooling as averaging Bayesian evidence
We discuss for simplicity thebinary outcome case, though this discussion holds in general. Suppose that an expert assignsa probability to an event X occurring by updating on some prior (50%, say, though this doesnot matter). Suppose that the expert receives evidence E . Bayesian updating works as such:Pr [ X | E ]Pr [ ¬ X | E ] = Pr [ X ]Pr [ ¬ X ] · Pr [ E | X ]Pr [ E | ¬ X ] . That is, the expert multiplies their odds of X (i.e. the probability of X divided by theprobability of ¬ X ) by the relative likelihood that E would be the case conditioned on X versus ¬ X . Equivalently, we can take the log of both sides; this tells us that the posteriorlog odds of X is equal to the prior log odds plus log Pr[ E | X ]Pr[ E |¬ X ] . For every piece of evidence E k that the expert receives, they make this update (assuming that the E k ’s are mutuallyindependent conditioned on X ).This means that for any E k , we can view the quantity log Pr[ E k | X ]Pr[ E k |¬ X ] as the strength ofevidence that E k gives in favor of X . We call this the “Bayesian evidence” in favor of X given by E k . The total Bayesian evidence that the expert has in favor of X (i.e. the sum ofthese values over all E k ) is the expert’s log odds of X (i.e. log Pr[ X ]Pr[ ¬ X ] ).27ogarithmic pooling takes the average of experts’ log odds of X . As we have shown,if the experts are Bayesian, this amounts to taking the average of all experts’ amounts ofBayesian evidence in favor of X . C. Details omitted from Section 3
Proposition C.1.
Given a proper scoring rule s , the following are equivalent:(1) s is continuous(2) G is differentiable(3) G is continuously differentiableProof. Any differentiable convex function is continuously differentiable [Roc70b, Theorem25.5] so (2) implies (3).If G is continuously differentiable, then both G and g are continuous, so s is continuousby Equation 3. Thus, (3) implies (1).Finally, assume that s is continuous. Since G ( p ) = (cid:80) j p ( j ) s ( p ; j ), it follows that G iscontinuous. It follows by Equation 3 (with g taken to be a subgradient of G , as in [GR07])that g is continuous. A convex function with a continuous subgradient is differentiable[BC11, Proposition 17.41]. This proves that (1) implies (2). D. Details omitted from Section 5
D.1. Proof of Theorem 5.1
Theorem 5.1.
Let s be a proper scoring rule with convex exposure on a forecast domain D , and fix any p , . . . , p m ∈ D . Given a weight vector w = ( w , . . . , w m ) ∈ ∆ m , define the weight-score of w for an outcome j asWS j ( w ) := s m (cid:77) g i =1 ( p i , w i ); j . Then for every j ∈ [ n ] , WS j ( w ) a concave function of w .Proof of Theorem 5.1. Let v and w be two weight vectors. We wish to show that for any c ∈ [0 , j ( c v + (1 − c ) w ) − c WS j ( v ) − (1 − c )WS j ( w ) ≥ . Recall the notation p ∗ w from Definition 3.8. Note that g ( p ∗ c v +(1 − c ) w ) = m (cid:88) i =1 ( cv i + (1 − c ) w i ) g ( p i ) = c g ( p ∗ v ) + (1 − c ) g ( p ∗ w ) . (5)28e have WS j ( c v + (1 − c ) w ) − c WS j ( v ) − (1 − c )WS j ( w )= s ( p ∗ c v +(1 − c ) w ; j ) − cs ( p ∗ v ; j ) − (1 − c ) s ( p ∗ w ; j )= G ( p ∗ c v +(1 − c ) w ) + (cid:104) g ( p ∗ c v +(1 − c ) w ) , δ j − p ∗ c v +(1 − c ) w (cid:105)− c ( G ( p ∗ v ) + (cid:104) g ( p ∗ v ) , δ j − p ∗ v (cid:105) ) − (1 − c )( G ( p ∗ w ) + (cid:104) g ( p ∗ w ) , δ j − p ∗ w (cid:105) )= G ( p ∗ c v +(1 − c ) w ) − (cid:104) g ( p ∗ c v +(1 − c ) w ) , p ∗ c v +(1 − c ) w (cid:105)− c ( G ( p ∗ v ) − (cid:104) g ( p ∗ v ) , p ∗ v (cid:105) ) − (1 − c )( G ( p ∗ w ) − (cid:104) g ( p ∗ w ) , p ∗ w (cid:105) )Step 1 follows from the definition of WS. Step 2 follows from Equation 3. Step 3 followsfrom Equation 5, and specifically that the inner product of each side with δ j is the same (sothe δ j terms cancel out, leaving a quantity that does not depend on j ). Continuing wherewe left off: WS j ( c v + (1 − c ) w ) − c WS j ( v ) − (1 − c )WS j ( w )= G ( p ∗ c v +(1 − c ) w ) − c (cid:104) g ( p ∗ v ) , p ∗ c v +(1 − c ) w (cid:105) − (1 − c ) (cid:104) g ( p ∗ w ) , p ∗ c v +(1 − c ) w (cid:105)− c ( G ( p ∗ v ) − (cid:104) g ( p ∗ v ) , p ∗ v (cid:105) ) − (1 − c )( G ( p ∗ w ) − (cid:104) g ( p ∗ w ) , p ∗ w (cid:105) )= c ( G ( p ∗ c v +(1 − c ) w ) − G ( p ∗ v ) − (cid:104) g ( p ∗ v ) , p ∗ c v +(1 − c ) w − p ∗ v (cid:105) )+ (1 − c )( G ( p ∗ c v +(1 − c ) w ) − G ( p ∗ w ) − (cid:104) g ( p ∗ w ) , p ∗ c v +(1 − c ) w − p ∗ w (cid:105) )= cD G ( p ∗ c v +(1 − c ) w (cid:107) p ∗ v ) + (1 − c ) D G ( p ∗ c v +(1 − c ) w (cid:107) p ∗ w ) ≥ . Step 4 again follows from Equation 5. Step 5 is a rearrangement of terms. Finally, step6 follows from the definition of Bregman divergence, and step 7 follows from the fact thatBregman divergence is always non-negative. This completes the proof.
D.2. Gradient descent algorithm for Theorem 5.5
We follow the online gradient descent algorithm as presented in [Haz19, § Algorithm D.1 ([Haz19], Algorithm 6) . Input: convex set K , convex function f t on K , T , x ∈ K , step sizes { η t } . for t = 1 to T do Play x t and observe cost f t ( x t )Update and project: y t +1 = x t − η t ∇ f t ( x t ) x t +1 = Π K ( y t +1 ) end for Algorithm D.1 achieves the following regret guarantee.29 heorem D.2 ([Haz19], Theorem 3.1) . Online gradient descent with step sizes { η t = DG √ t , t ∈ [ T ] guarantees the following for all T ≥ :regret T = T (cid:88) t =1 f t ( x t ) − min x ∗ ∈K T (cid:88) t =1 f t ( x ∗ ) ≤ GD √ T .
Here, D is an upper bound on the diameter of K and G is an upper bound on the Lipschitzconstant of any f t , i.e. | f t ( x ) − f t ( y ) | ≤ G (cid:107) x − y (cid:107) for any t and any x , y ∈ K . In our setting, K = ∆ m and f t is our loss function (i.e. negative the score) at time step t ,as a function of w ; we will denote this function by L t . That is, L t ( w ) = − WS j t ( w ), whereWS is as in Theorem 5.1, relative to forecasts p t , . . . , p tm .Our adaptation of Algorithm D.1 is as follows. Algorithm D.3 (Online gradient descent algorithm for Theorem 5.5) . We proceed as fol-lows: • For t ≥
1, define η t := M √ mt . • Start with an arbitrary guess w ∈ ∆ m . • At each time step t from 1 to T : – Play w t and observe loss L t ( w t ). – Let ˜ w t +1 = w t − η t ∇ L t ( w t ). If ˜ w t +1 ∈ ∆ m , let w t +1 = ˜ w t +1 . Otherwise, let w t +1 be the orthogonal projection of ˜ w t +1 onto ∆ m .We now prove that Algorithm D.3 satisfies the details of Theorem 5.5. Theorem 5.5.
Let s be a bounded proper scoring rule with convex exposure over a forecastdomain D . For time steps t = 1 . . . T , an agent chooses a weight vector w t ∈ ∆ m . The agentthen receives a score of s m (cid:77) g i =1 ( p ti , w ti ); j t , where p t , . . . , p tm ∈ D and j t ∈ [ n ] are chosen adversarially. By choosing w t according toAlgorithm D.3 (online gradient descent on the experts’ weights), the agent achieves O ( √ T ) regret in comparison with the best weight vector in hindsight. In particular, if M is an upperbound on (cid:107) g (cid:107) , then for every w ∗ ∈ ∆ m we have T (cid:88) t =1 s m (cid:77) g i =1 ( p ti , w ∗ i ); j t − s m (cid:77) g i =1 ( p ti , w ti ); j t ≤ √ mM √ T . roof. In our setting, D = √
2. We claim that B ≤ √ mM , where B is an upper bound on (cid:107)∇ L t ( w ) (cid:107) . This would make our choice of η t match that of Algorithm D.1 and guarantee aregret of at most √ B √ T ≤ √ mM √ T . The remainder of the proof is demonstrating thatindeed B ≤ √ mM .Let L be an arbitrary loss function, i.e. L ( w ) = − WS j ( w ) for some j, p , . . . , p m . Let p ∗ ( w ) = m (cid:77) g i =1 ( p i , w i ). We claim that ∇ L ( w ) = g ( p )... g ( p m ) ( p ∗ ( w ) − δ j ) , (6)where this m -dimensional vector should be interpreted modulo translation by m (see Re-mark 3.9). To see this, observe that ∇ L ( w ) = −∇ W S j ( w ) = −∇ w s ( p ∗ ( w ); j ) = −∇ w ( G ( p ∗ ( w )) + (cid:104) g ( p ∗ ( w )) , δ j − p ∗ ( w ) (cid:105) ) , where ∇ w denotes the gradient with respect to change in the weight vector w (as opposedto change in the probability vector). Now, by the chain rule for gradients, we have ∇ w G ( p ∗ ( w )) = ( J p ∗ ( w )) (cid:62) g ( p ∗ ( w )) , where J p ∗ denotes the Jacobian matrix of the function p ∗ ( w ). Also, we have g ( p ∗ ( w )) = m (cid:88) i =1 w i g ( p i ) , so (again by the chain rule) we have ∇ w ( (cid:104) g ( p ∗ ( w )) , δ j − p ∗ ( w ) (cid:105) ) = g ( p )... g ( p m ) ( δ j − p ∗ ( w )) − ( J p ∗ ( w )) (cid:62) g ( p ∗ ( w )) . This gives us Equation 6.Now, for any i , we have |(cid:104) g ( p i ) , p ∗ ( w ) − δ j (cid:105)| ≤ (cid:107) g ( p i ) (cid:107) (cid:107) p ∗ ( w ) − δ j (cid:107) ≤ √ M. Therefore, (cid:107)∇ L ( w ) (cid:107) ≤ (cid:113) m · ( √ M ) = √ mM, completing the proof. 31 . Details omitted from Section 6 E.1. Proof of Theorem 6.6
Theorem 6.6.
A pooling operator is a QA pooling operator (as in Definition 6.2) withrespect to some g if and only if it satisfies the axioms in Definition 6.5. Proof.
For this proof, we will use ⊕ (without a g subscript) to denote an arbitrary poolingoperator that satisfies the axioms in Definition 6.5. We begin by noting a few importantfacts about weighted forecasts and pooling operators. First, we find it natural to define anotion of multiplying a weighted forecast pair by a positive constant. Definition E.1.
Given a weighted forecast
Π = ( p, w ) and c > , define c Π := ( p, cw ) . Note that m Π = (cid:76) mi =1 Π for any positive integer m , by idempotence; this definition is anatural extension to all c >
0. We note the following (quite obvious) fact.
Proposition E.2.
For every weighted forecast Π and c , c > , we have c ( c Π) = ( c c )Π . A natural property that is not listed in Definition 6.5 is scale invariance , i.e. thatpr(( p , w ) ⊕ ( p , w )) = pr(( p , cw ) ⊕ ( p , cw )) for any positive c ; or, equivalently, that c (Π ⊕ Π ) = c Π ⊕ c Π . This in fact follows from the listed axioms. Proposition E.3 (Distributive property/scale invariance) . For every Π , Π and any oper-ator ⊕ satisfying the axioms in Definition 6.5, we have c (Π ⊕ Π ) = c Π ⊕ c Π .Proof. First suppose c is an integer. Then c Π ⊕ c Π = c (cid:77) i =1 Π ⊕ c (cid:77) i =1 Π = c (cid:77) i =1 (Π ⊕ Π ) = c (Π ⊕ Π ) . Here, the first and last steps follow by weight additivity and idempotence. Now supposethat c = k(cid:96) is a rational number. Let Π (cid:48) = (cid:96) Π and Π (cid:48) = (cid:96) Π . We have k(cid:96) (Π ⊕ Π ) = k(cid:96) ( (cid:96) Π (cid:48) ⊕ (cid:96) Π (cid:48) ) = k(cid:96) · (cid:96) (Π (cid:48) ⊕ Π (cid:48) ) = k (Π (cid:48) ⊕ Π (cid:48) ) = k Π (cid:48) ⊕ k Π (cid:48) = k(cid:96) Π ⊕ k(cid:96) Π . Here, the second and second-to-last steps follow from the fact that the distributive propertyholds for integers.Finally, make use of the continuity axiom to extend our proof to all positive real numbers c . In particular, it suffices to show that pr(Π ⊕ Π ) = pr( c Π ⊕ c Π ). Let p be the formerquantity; note that pr( r Π ⊕ r Π ) = p for positive rational numbers r . Since the rationalsare dense among the reals, it follows that for every (cid:15) >
0, we have | pr( c Π ⊕ c Π ) − p | ≤ (cid:15) .Therefore, pr( c Π ⊕ c Π ) = p . This completes the proof. As we mentioned, for an associative pooling operator ⊕ , Π ⊕ Π · · · ⊕ Π m is a well-specified quantity,even without indicating parenthesization. This lets us use the notation (cid:76) mi =1 Π i . This is why the statementof Theorem 6.6 makes sense despite pooling operators not being m -ary by default. ⊕ g satisfies the axioms in Definition 6.5. Weight additivity, commutativity, and idempotenceare trivial. Associativity is also clear: given Π = ( p , w ) and likewise Π , Π , we have g (pr((Π ⊕ g Π ) ⊕ g Π )) = ( w + w ) g (pr(Π ⊕ g Π )) + w g ( p )( w + w ) + w = ( w + w ) w g ( p )+ w ( p ) w + w + w g ( p ) w + w + w = w g ( p ) + w g ( p ) + w g ( p ) w + w + w and likewise for g (pr(Π ⊕ g (Π ⊕ g Π ))), so pr((Π ⊕ g Π ) ⊕ g Π ) = pr(Π ⊕ g (Π ⊕ g Π ))(since g is strictly increasing and therefore injective). The fact that the weights are also thesame is trivial. Continuity follows from the fact thatpr(Π ⊕ g Π ) = g − (cid:18) w g ( p ) + w g ( p ) w + w (cid:19) is continuous in ( w , w ) (when w , w are not both zero). Here we are using the fact that g is strictly increasing, which means that g − is continuous.Finally, regarding the monotonicity axiom, for any fixed w and p > p (as in the axiomstatement), we have g (pr(( p , x ) ⊕ g ( p , w − x ))) = xg ( p ) + ( w − x ) g ( p ) x + w − x = xg ( p ) + ( w − x ) g ( p ) w . Since p > p , we have g ( p ) > g ( p ), so the right-hand side strictly increases with x . Since g − is also strictly increasing, it follows that pr(( p , x ) ⊕ g ( p , w − x )) strictly increases with x .The converse — that every pooling operator satisfying the axioms in Definition 6.5 is ⊕ g for some g — works by constructing g by fixing it at two points and constructing g at allother points. Right now we show how to do this when the forecast domain is [0 ,
1] (thoughthe technique works for any closed D ); see the proof of Theorem E.9 for the argument in fullgenerality.Let ⊕ be a pooling operator that satisfies our axioms. Define g as follows: let g (0) = 0and g (1) = 1. For 0 < p <
1, define g ( p ) = w where (1 , w ) ⊕ (0 , − w ) = ( p, w existsby continuity and the intermediate value theorem; it is unique by the “strictly” increasingstipulation of monotonicity.) Note that g is continuous and increasing by monotonicity. We wish to show that for any Π = ( p , w ) and Π = ( p , w ), we have that Π ⊕ Π =Π ⊕ g Π . Clearly the weight of both sides is w + w , so we wish to show that the probabilitieson each side are the same. We have pr(Π ⊕ Π ) = pr( w ( p , ⊕ w ( p , As a matter of fact, g is strictly increasing because it is impossible for g ( p ) to equal g ( p ) for p (cid:54) = p ,as that would mean that (1 , g ( p )) ⊕ (0 , − g ( p )) = ( p ,
1) = ( p , , w ) ⊕ (0 , − w ) is continuous in w by the continuity axiom. In a sense, the continuity of g corresponds to the strictness of increase in the monotonicity axiom and the strictness ofincrease of g corresponds to the continuity axiom. Steps 3 and 7 uses the distributive property (Proposition E.3).
33 pr( w ((1 , g ( p )) ⊕ (0 , − g ( p ))) ⊕ w ((1 , g ( p )) ⊕ (0 , − g ( p ))))= pr( w (1 , g ( p )) ⊕ w (0 , − g ( p )) ⊕ w (1 , g ( p )) ⊕ w (0 , − g ( p )))= pr((1 , w g ( p )) ⊕ (0 , w (1 − g ( p ))) ⊕ (1 , w g ( p )) ⊕ (0 , w (1 − g ( p ))))= pr((1 , w g ( p ) + w g ( p )) ⊕ (0 , w (1 − g ( p )) + w (1 − g ( p ))))= pr (cid:18) w + w ((1 , w g ( p ) + w g ( p )) ⊕ (0 , w (1 − g ( p )) + w (1 − g ( p )))) (cid:19) = pr (cid:18)(cid:18) , w g ( p ) + w g ( p ) w + w (cid:19) ⊕ (cid:18) , w (1 − g ( p )) + w (1 − g ( p )) w + w (cid:19)(cid:19) , which by definition of g is equal to the probability p such that g ( p ) = g ( p ) w + g ( p ) w w + w . Thatis, pr(Π ⊕ Π ) = pr(Π ⊕ g Π ).Showing that ⊕ and ⊕ g are equivalent for more than two arguments is now trivial: m (cid:77) g i =1 Π i = Π ⊕ g Π ⊕ g Π · · · ⊕ g Π m = Π ⊕ Π ⊕ g Π · · · ⊕ g Π m = · · · = m (cid:77) i =1 Π i . (Here we are implicitly using the fact that ⊕ g is associative, as we proved earlier.) Thiscompletes the proof. E.2. Generalization of our axiomatization to higher dimensions
Just as we fixed a two-outcome forecast domain D in Section 6, we now fix an n -outcomeforecast domain D for any n ≥
2. Our definition of weighted forecasts remains the same(except that now pr(Π) is a vector). Our definition of quasi-arithmetic pooling, however,needs to change to make g vector-valued. This raises the question: what is the analogueof “increasing” for vector-valued functions? It turns out that the relevant notion for us is cyclical monotonicity , introduced by Rockafellar [Roc70a] (see also [Roc70b, § H n ( c ) := { x ∈ R n : (cid:80) i x i = c } . Recall from Remark 3.9 that therange of the gradient of a function defined on D is a subset of H n (0). Definition E.4 (Quasi-arithmetic pooling with arbitrary weights) . Given a continuous,strictly cyclically monotone vector-valued function g : D → H n (0) whose range is a convexset, and weighted forecasts Π = ( p , w ) , . . . , Π m = ( p m , w m ) , define the quasi-arithmeticpool of Π , . . . , Π m with respect to g as m (cid:77) g i =1 ( p i , w i ) := (cid:32) g − (cid:18) (cid:80) i w i g ( p i ) (cid:80) i w i (cid:19) , (cid:88) i w i (cid:33) . Definition E.5 (Cyclical monotonicity) . A function g : U ⊆ R n → R n is cyclically mono-tone if for every list of points x , x , . . . , x k − , x k = x ∈ U , we have k (cid:88) i =1 (cid:104) g ( x i ) , x i − x i − (cid:105) ≥ . e also say that g is strictly cyclically monotone if the inequality is strict except when x = · · · = x k − . To gain an intuition for this notion, consider the case of k = 2; then this condition saysthat (cid:104) g ( x ) − g ( x ) , x − x (cid:105) ≥
0. In other words, the change in g from x to x is in thesame general direction as the direction from x to x . This property is called (or weak ) monotonicity .Cyclical monotonicity is a stronger notion, which may be familiar to the reader for itsapplications in mechanism design and revealed preference theory, see e.g. [LS07], [Ash+10],[FK14], [Voh07, § g ) is finite [SY05]. However,cyclical monotonicity is substantially stronger than two-cycle monotonicity when the rangeof g is infinitely large, as in our setting. In fact, the difference between these two conditions isthat a two-cycle monotone function is cyclically monotone if and only if it is also vortex-free [AK14, Theorem 3.9]. Vortex-freeness means that the path integral of g along any trianglevanishes. See [AK14] for a deteailed comparison of these two notions.The immediately relevant fact for us is that cyclically monotone functions are gradientsof convex functions (and vice versa). Speaking more precisely: Theorem E.6.
A vector-valued function g is continuous and strictly cyclically monotone ifand only if it is the gradient of a differentiable, strictly convex function G .Proof of Theorem E.6. Per a theorem of Rockafellar ([Roc70a], see also Theorem 24.8 in[Roc70b]), a function g is cyclically monotone if and only if it is a subgradient of a convexfunction G . The proof of this fact shows just as easily that a function is strictly cyclicallymonotone if and only if it is a subgradient of a strictly convex function.Consider a differentiable, strictly convex function G . Its gradient is continuous (seeProposition C.1). Conversely, consider a continuous, strictly cyclically monotone vector-valued function g . As we just discussed, it is a subgradient of some strictly convex function G . A convex function with a continuous subgradient is differentiable [BC11, Proposition17.41].This means that the conditions on g in Definition E.4 are precisely those necessary tolet g be any function that it could be in our original definition of quasi-arithmetic pooling(Definition 3.8). Our new definition is thus equivalent to the old one (after normalizingweights to add to 1).We now discuss our axioms for pooling operators that will again capture the class ofQA pooling operators. We will keep the weight additivity, commutativity, associativity, andidempotence verbatim from our discussion of the n = 2 case. We will slightly strengthen thecontinuity argument (see below).We will also add a new axiom, subtraction , which states that if Π ⊕ Π = Π ⊕ Π thenΠ = Π . Subtraction in the n = 2 case follows from monotonicity; in this case, however, wethe subtraction axiom will help us state the monotonicity axiom. In particular, it allows us35o make the following definition, which essentially extends the notion of pooling to allow fornegative weights. Definition E.7.
Let ⊕ be a pooling operator satisfying weight additivity, commutativity,associativity, and subtraction. Fix p , . . . , p k ∈ D . Define a function p : ∆ k → D (with p , . . . , p k serving as implicit arguments) defined by p ( w , . . . , w k ) = pr (cid:32) k (cid:77) i =1 ( p i , w i ) (cid:33) . We extend the definition of p to a partial function on H k (1) , as follows: given input ( w , . . . , w k ) , let S ⊆ [ k ] be the set of indices i such that w i < and T ⊆ [ k ] be the setof indices i such that w i > . We define p ( w , . . . , w k ) to be the q ∈ D such that ( q , ⊕ (cid:32)(cid:77) i ∈ S ( p i , − w i ) (cid:33) = (cid:77) i ∈ T ( p i , w i ) . Note that q is not guaranteed to exist, which is why we call p a partial function. However,if q exists then it is unique, by the subtraction axiom. We can now state the full axiomatization, including the monotonicity axiom.
Definition E.8 (Axioms for pooling operators) . For a pooling operator ⊕ on D , we definethe following axioms.1. Weight additivity : wt (Π ⊕ Π ) = wt (Π ) + wt (Π ) for every Π , Π .2. Commutativity : Π ⊕ Π = Π ⊕ Π for every Π , Π .3. Associativity : Π ⊕ (Π ⊕ Π ) = (Π ⊕ Π ) ⊕ Π for every Π , Π , Π .4. Continuity : For every positive integer k and p , . . . , p k , the quantity pr (cid:32) k (cid:77) i =1 ( p i , w i ) (cid:33) is a continuous function of ( w , . . . , w k ) on R k ≥ \ { } .5. Idempotence : For every Π and Π , if pr (Π ) = pr (Π ) then pr (Π ⊕ Π ) = pr (Π ) .6. Subtraction : If Π ⊕ Π = Π ⊕ Π then Π = Π .7. Monotonicity : There exist vectors p , . . . , p n ∈ D such that p (as in Definition E.7)is a strictly cyclically monotone function from its domain to R n . The continuity axiom is only well-defined conditioned on ⊕ being associative, which is fine for ourpurposes. We allow a proper subset of weights to be zero by defining the aggregate to ignore forecasts withweight zero. n “anchor points” in D suchthat the function p from weight vectors to D that pools the anchor points with the weightsgiven as input obeys a notion of monotonicity (namely cyclical monotonicity). Informally,this means that the vector of weights that one would need to give to the anchor points inorder to arrive at a forecast p “correlates” with the forecast p itself.We now state the main theorem of our axiomatization. Theorem E.9.
A pooling operator is a QA pooling operator (as in Definition E.4) withrespect to some g if and only if it satisfies the axioms in Definition E.8.Proof. We begin by noting the following fact, which follows from results in [Roc70b, § Proposition E.10.
A strictly cyclically monotone function g : D → R n is injective, and itsinverse g − is strictly cyclically monotone and continuous. We provide a partial proof below; it relies on the following observation.
Remark E.11.
We can instead write the condition as k (cid:88) i =1 ( g ( x i ) − g ( x i − )) · x i ≥ . This is equivalent to the condition in Definition E.5, because it is the same statement (withrearranged terms) when the x i ’s are listed in reverse order. Proof.
First, suppose that g ( x ) = g ( y ). Then g ( x )( x − y ) + g ( y )( y − x ) = 0 . Since g is strictly cyclically monotone, this implies that x = y . (Note that we only usetwo-cycle monotonicity.)We now show that g − is strictly cyclically monotone. That is, we wish to show that k (cid:88) i =1 x i · ( g − ( x i ) − g − ( x i − )) > x , . . . , x k = x that are not all the same. (See Remark E.11.) By the cyclicalmonotonicity of g , we have that k (cid:88) i =1 g ( p i ) · ( p i − p i − ) > Why can’t we apply this result again to g − to conclude that g is continuous, even though we did notassume it to be? The reason is that the proof of continuity relies on the convexity of D ; if g is discontinuousthen the domain of g − may not be convex (or even connected), so we cannot apply the result to g − . g : if x i (cid:54) = x j then p i (cid:54) = p j ). Thismeans that k (cid:88) i =1 x i · ( g − ( x i ) − g − ( x i − )) > , as desired. As for continuity, we defer to [Roc70b, Theorem 26.5].Back to the proof of Theorem E.9, we first prove that any such ⊕ g satisfies the stated ax-ioms. Weight additivity, commutativity, associativity, and idempotence are clear. Continuityfollows from the formulapr(( p , w ) ⊕ g ( p , w )) = g − (cid:18) w g ( p ) + w g ( p ) w + w (cid:19) , noting that g − is continuous by Proposition E.10. Likewise, subtraction follows from thefact that g is injective (by Proposition E.10), as is g − (likewise). Monotonicity remains.The range of g contains an open subset of H n (0), so in particular it contains the verticesof some translated and dilated copy of the standard simplex. That is, there are n points x , . . . , x n in the range of g for which there is a positive scalar a and vector b such that aδ i + b = x i for every i . (Here δ i is the i -th standard basis vector in R n .) We will let p i bethe pre-image of x i under g , so that g ( p i ) = aδ i + b .Observe that for any w in the domain of p , we have g ( p ( w )) = n (cid:88) i =1 w i g ( p i ) = n (cid:88) i =1 w i ( aδ i + b ) = a w + b , so p ( w ) = g − ( a w + b ) . We have that g − is strictly cyclically monotone (by Proposition E.10), and it is easy toverify that for any strictly cyclically monotone function f and any a > b , f ( a x + b )is a strictly cyclically monotone function of x . Therefore, p ( w ) = g − ( a w + b ) is strictlycyclically monotone, as desired.Now we prove the converse. Assume that we have a pooling operator ⊕ satisfying theaxioms in Definition E.8. We wish to show that ⊕ is ⊕ g for some g : D → H n (0).For the remainder of this proof, let p , . . . , p n be vectors certifying the monotonicity of ⊕ , and let p ( · ) be as in Definition E.7.For any q ∈ D , let g ( q ) := w − n n , where w ∈ H n (1) is such that p ( w ) = q and n is the all-ones vector. This raises the question of well-definedness: does this w necessarilyexist, and if so, is it unique? The following claim shows that this is indeed the case. Claim E.12.
The function p , from the subset of H n (1) where it is defined to D , is bijective. This follows from the invariance of domain theorem, which states that the image of an open subset of amanifold under an injective continuous map is open. roof. The fact that p is injective follows from the fact that it is strictly cyclically monotone(see Proposition E.10). We now show that p is surjective.Let q ∈ D . Define the function ˜ p : ∆ n +1 → D by˜ p ( w , . . . , w n +1 ) := pr (cid:32)(cid:32) n (cid:77) i =1 ( p i , w i ) (cid:33) ⊕ ( q , w n +1 ) (cid:33) . Since ˜ p is a continuous map from ∆ n +1 (an n -dimensional manifold) to D (an ( n − p is not injective. So in particular, let w (cid:54) = w ∈ ∆ n +1 be suchthat ˜p ( w ) = ˜p ( w ). That is, we have (cid:32) n (cid:77) i =1 ( p i , w ,i ) (cid:33) ⊕ ( q , w ,n +1 ) = (cid:32) n (cid:77) i =1 ( p i , w ,i ) (cid:33) ⊕ ( q , w ,n +1 ) . (7)Observe that w ,n +1 (cid:54) = w ,n +2 ; for otherwise it would follows from the subtraction axiomthat two different combinations of the p i ’s would give the same probability, contradictingthe fact that p is injective. Without loss of generality, assume that w ,n +1 > w ,n +1 . We canrearrange the terms in Equation 7 to look as follows.( q , w ,n +1 − w ,n +1 ) ⊕ (cid:32)(cid:77) i ∈ S ( p i , v i ) (cid:33) = (cid:77) i ∈ T ⊆ [ n ] \ S ( p i , v i )for some positive v , . . . , v n . By the distributive property, we may multiply all weights by w ,n +1 − w ,n +1 . The result will be an equation as in Definition E.7, certifying that q is in therange of the function p , as desired.We return to our main proof, now that we have shown that our function g ( q ) := w − n ,where w ∈ H n (1) is such that p ( w ) = q , is well-defined. In fact, we can simply write g ( q ) = p − ( q ) − n . (The vector n is fairly arbitrary; it only serves the purpose of forcingthe range of g to lie in H n (0) instead of H n (1).)We first show that the defining equation of ⊕ g holds — that is, that if ( q , v ) ⊕ ( q , v ) =( q , v + v ) (with v , v ≥
0, not both zero), then g ( q ) = v g ( q ) + v g ( q ) v + v . Let w , w ∈ H n (1) be such that q = p ( w ) and q = p ( w ). It is intuitive that q = p (cid:16) v w + v w v + v (cid:17) , but we show this formally. Claim E.13.
Given q , q ∈ D with q = p ( w ) , q = p ( w ) , and ≤ α ≤ , we have p ( α w + (1 − α ) w ) = ( q , α ) ⊕ ( q , − α ) . By the continuity axiom; here we use the more generalized form we stated earlier. This follows e.g. from the Borsuk-Ulam theorem. roof. Note that ( q , ⊕ (cid:77) i : w ,i < ( p i , w ,i ) = (cid:77) i : w ,i > ( p i , w ,i )( q , ⊕ (cid:77) i : w ,i < ( p i , w ,i ) = (cid:77) i : w ,i > ( p i , w ,i ) . Applying the distributive property to the two above equations with constants α and 1 − α ,respectively, and adding them, we get that( q , α ) ⊕ ( q , − α ) ⊕ (cid:77) i : w ,i < ( p i , αw ,i ) ⊕ (cid:77) i : w ,i < ( p i , (1 − α ) w ,i ) = (cid:77) i : w ,i > ( p i , αw ,i ) ⊕ (cid:77) i : w ,i > ( p i , (1 − α ) w ,i ) . We have that ( q , α ) ⊕ ( q , − α ) = ( q , q = p (cid:16) v w + v w v + v (cid:17) .Applying Claim E.13 with α = v v + v , we find that g ( q ) = v w + v w v + v − n = v (cid:0) g ( q ) + n (cid:1) + v (cid:0) g ( q ) + n (cid:1) v + v − n = v g ( q ) + v g ( q ) v + v , as desired.It remains to show that g is continuous, strictly cyclically monotone, and has convexrange. By the monotonicity axiom, p is strictly cyclically monotone. It follows by Propo-sition E.10 that its inverse its continuous and strictly cyclically monotone. Therefore, g iscontinuous and cyclically monotone (as it is simply a translation of p − ( q ) by n ).Finally, to show that g has convex range, we wish to show that p − has convex range;or, in other words, that the domain on which p is defined is convex. And indeed, thisfollows straightforwardly from Claim E.13. Let w , w be in the domain of p , with p ( w ) = q , p ( w ) = q . Then for any 0 ≤ α ≤
1, we have that p ( α w + (1 − α ) w ) = ( q , α ) ⊕ ( q , − α ) , so in particular α w + (1 − α ) w is in the domain of p . This concludes the proof. F. The convex exposure property
Several of our results have been contingent on the convex exposure property. We showed inProposition 3.7 that this property always holds in the case of n = 2 outcomes (assuming,40s we have been, that s is continuous). In this appendix, we take this discussion further byconsidering when the convex exposure property holds in higher dimensions. As we shall see,the property holds for nearly all of the most commonly used scoring rules.(A note on notation: in this section we use p j instead of p ( j ) to refer to the j -th coordi-nate of a probability distribution p .)Our first result says, roughly speaking, that scoring rules that — like the logarithmicscoring rule — “go off to infinity” have convex exposure. Proposition F.1.
Let s be a proper scoring rule on a forecast domain D that is an opensubset of ∆ n , such that for any point x on the boundary of D , and for any sequence x , x , . . . converging to x , lim k →∞ (cid:107) g ( x k ) (cid:107) = ∞ . Then s has convex exposure. This is a statement of convex analysis — namely that if (cid:107) g (cid:107) approaches ∞ on theboundary of a convex set, then the range of g is convex (assuming g is the gradient ofa differentiable convex function). We refer the reader to [Roc70b, Theorem 26.5] for theproof. In non-pathological cases, the basic intuition is that every v ∈ { x : (cid:80) i x i = 0 } is thegradient of G at some point. In these cases, ∇ G ( x ) = v where x minimizes G ( v ) − v · x ; thelim k →∞ (cid:107) g ( x k ) (cid:107) = ∞ condition means that this minimum does not occur on the boundaryof D . Corollary F.2.
Let int (∆ n ) := { p ∈ ∆ n : p i > ∀ i } . The following scoring rules haveconvex exposure over forecast domain D = int (∆ n ) . • The logarithmic scoring rule. • The scoring rule given by G ( p ) = − (cid:80) j p γj for γ ∈ (0 , and by G ( p ) = (cid:80) j p γj for γ < . • The scoring rule given by G ( p ) = − (cid:80) j ln p j , which can be thought of as the limit ofthe G in the previous bullet point as γ → . • The scoring rule hs given by G hs ( p ) = − (cid:81) j p /nj . The hs scoring rule The last of these scoring rules is a generalization of the scoring rule hs ( q ) = 1 − (cid:113) − qq used in [BB20] as part of proving their minimax theorem for randomizedalgorithms. The authors used this scoring rule as a key ingredient in their minimax the-orem for randomized algorithms. The key property of the scoring rule was a result aboutits amplification [BB20, Lemma 3.10]. The authors define a forecasting algorithm to be ageneralization of a randomized algorithm that outputs an estimated probability that an out-put should be accepted. Then, roughly speaking, the authors show that given a forecastingalgorithm R , it is possible to create a forecasting algorithm R (cid:48) that has a much larger ex-pected score from the scoring rule hs by combining running R a small number of times and This is a natural way to think of this scoring rule because ∇ G ( p ) = − ( p − , . . . , p − n ). Here we are using the shorthand notation for the n = 2 outcome case discussed in Remark 3.5. hs deserves more attention.Since additive and multiplicative constants are irrelevant, we may treat hs ( q ) = − (cid:113) − qq .Observe that (in the case of two outcomes), the expected score G hs on a report of q is G hs ( q ) = q (cid:18) − (cid:114) − qq (cid:19) + (1 − q ) (cid:18) − (cid:114) − qq (cid:19) = − (cid:112) q (1 − q ) . That is, G hs is precisely negative the geometric mean of q and 1 − q . This motivates us togeneralize hs to a setting with n outcomes by setting G hs ( p ) := − n (cid:89) i =1 p /ni . It should not be obvious that this function is convex, but it turns out to be; this is theprecise statement of an inequality known as Mahler’s inequality [Wik18].Next we note that the quadratic scoring rule has convex exposure, since its exposurefunction g ( p ) = 2 p (modulo n as discussed in Remark 3.9) maps any convex set to aconvex set. Proposition F.3.
The quadratic scoring rule has convex exposure (for any convex D ).Spherical scoring rules — the third most studied proper scoring rules, after the quadraticand logarithmic rules — also have convex exposure. Definition F.4 (Spherical scoring rules) . [GR07, Example 2] For any α > , define the spherical scoring rule with parameter α to be the scoring rule given by G sph ,α ( p ) := (cid:32) n (cid:88) i =1 p αi (cid:33) /α . If the “spherical scoring rule” is referenced with no parameter α given, α is presumed toequal . Proposition F.5.
For any α > , the spherical scoring rule with parameter α (over D = ∆ n )has convex exposure.Proof. Fix α >
1. We will write G in place of G sph ,α . We have g ( p ) = (cid:32) n (cid:88) j =1 p αj (cid:33) (1 /α ) − ( p α − , . . . , p α − n ) . (8)Now, define the n -dimensional unit β -sphere , i.e. { x : (cid:80) j x βj = 1 } , and define the n -dimensional unit β -ball correspondingly (i.e. with ≤ in place of =). The range of g is As discussed in Remark 3.9, the range of g should be thought of as modulo T ( n ). However, we find itconvenient for this proof to think of it as lying in R n and project later. n -dimensional unit αα − -sphere with all non-negative coordinates.Indeed, on the one hand, for any p we have (cid:88) j g j ( p ) α/ ( α − = (cid:32)(cid:88) j p αj (cid:33) − · (cid:88) j p αj = 1(where g j ( p ) denotes the j -th coordinate of g ( p ) as in Equation 8). On the other hand,given a point x on the unit αα − -sphere with all non-negative coordinates, p = (cid:32)(cid:88) j x / ( α − j (cid:33) − (cid:16) x / ( α − , . . . , x / ( α − n (cid:17) lies in ∆ n and satisfies g ( p ) = x .The crucial point for us is that for β >
1, the unit β -ball is convex. This means that forany such β , the convex combination of any number of points on the unit β -sphere will lie inthe unit β -ball. Since αα − > α >
1, we have that for arbitrary p , q ∈ ∆ n and w ∈ [0 , w g ( p ) + (1 − w ) g ( q ) lies in the unit β -ball — in fact, in the part with all non-negativecoordinates. Now, consider casting a ray from this convex combination point in the positive n direction. All points on this ray are equivalent to this point modulo T ( n ), and this raywill intersect the unit β -sphere at some point x with all non-negative coordinates. The point p ∈ ∆ n with g ( p ) = x satisfies g ( p ) = g ( p ) + (1 − w ) g ( q ) . This completes the proof.
Remark F.6.
The above proof gives a geometric interpretation of the QA pooling withrespect to the spherical scoring rule, particularly for α = 2. In the α = 2 case, poolingamounts to taking the following steps:(1) Scale each forecast so it lies on the unit sphere.(2) Take the weighted average of the resulting points in R n .(3) Shift the resulting point in the positive n direction to the unique point in that directionthat lies on the unit sphere.(4) Scale this point so that its coordinates add to 1.Finally we consider the parametrized family known as Tsallis scoring rules , defined in[Tsa88].
Definition F.7 (Tsallis scoring rules) . For γ > , the Tsallis scoring rule with parameter γ is the rule given by G Tsa ,γ ( p ) = m (cid:88) j =1 p γj . γ = 2 above yields the quadratic scoring rule. Note also that we have alreadyaddressed the scoring rule given by G ( p ) = ± (cid:80) j p γj for γ ≤ γ = 0 ,
1, which aredegenerate), with the sign chosen to make G convex: these scoring rules have convex exposureby Proposition F.1. The following proposition completes our analysis for this natural classof scoring rules. Proposition F.8.
For γ ≤ , the Tsallis scoring rule with parameter γ (over D = ∆ n ) hasconvex exposure. For γ > , this is not the case if n > .Proof of Proposition F.8. Fix γ >
1. We will write G in place of G Tsa ,γ . Up to a multiplica-tive factor of γ that we are free to ignore, we have g ( p ) = ( p γ − , . . . , p γ − n ) . Let p , q ∈ ∆ n and w ∈ [0 , x ∈ ∆ n such that g ( x ) = w g ( p ) + (1 − w ) g ( q ), i.e. wp γ − j + (1 − w ) q γ − j + c = x γ − j , for all j ∈ [ n ], for some c . Since (cid:80) j x j = 1, this c must satisfy (cid:88) j ( wp γ − j + (1 − w ) q γ − j + c ) / ( γ − = 1 . (9)Let h ( x ) := (cid:80) j ( wp γ − j + (1 − w ) q γ − j + x ) / ( γ − . Note that h is increasing in x .First consider the case that γ ≤
2. By concavity, we have that wp γ − j + (1 − w ) q γ − j ≤ ( wp j + (1 − w ) q j ) γ − . This means that h (0) = (cid:88) j ( wp γ − j + (1 − w ) q γ − j ) / ( γ − ≤ (cid:88) j ( wp j + (1 − w ) q j ) = 1 . On the other hand, lim x →∞ h ( x ) = ∞ . Since h is continuous, there must be some x ∈ [0 , ∞ )such that h ( x ) = 1; call this value c . Then let x j = ( wp γ − j + (1 − w ) q γ − j + c ) / ( γ − . Then every x j is nonnegative and (cid:80) j x j = 1, so we have succeeded.Now consider the case that γ >
2, and consider as a counterexample p = (1 , , . . . , q = (0 , , , . . . , w = . To satisfy Equation 9, we are looking for c such that h ( c ) = 2 (cid:18)
12 + c (cid:19) / ( γ − + ( n − c / ( γ − = 1 . Note that h (0) = 2 · − / ( γ − = 2 ( γ − / ( γ − >
1, so c < h is increasing). But in thatcase x γ − j < j ≥
3, a contradiction (assuming n > ∇ G Tsa ,γ ( p ) = ( p γ − , . . . , p γ − n ) (up to a constant factor), QA pool-ing with respect to the Tsallis scoring rule can be thought of as an appropriately scaledcoordinate-wise ( γ − γ = 2 it is the coordinate-wise arithmetic average.44or γ = 3 it is the coordinate-wise root mean square, but with the average of the squaresscaled by an appropriate additive constant so that, upon taking the square roots, the prob-abilities add to 1. (However, as the Tsallis score with parameter 3 does not have convexexposure, this is not always well-defined.)In Corollary F.2 we mentioned that the scoring rule given by G ( p ) = − (cid:80) j ln p j can bethought of as an extension to γ = 0 of (what we are now calling) the Tsallis score, becausethe derivative of ln x is x − . QA pooling with respect to this scoring rule is, correspondingly,the − harmonic pooling , see e.g. [Daw+95, § γ = 1, in that the second derivative of x ln x is x −1