[PDF] From Proper Scoring Rules to Max-Min Optimal Forecast Aggregation

Abstract

Full PDF

FFrom Proper Scoring Rules to Max-Min OptimalForecast Aggregation

Eric Neyman, Tim RoughgardenFebruary 16, 2021

Abstract

This paper forges a strong connection between two seemingly unrelated forecastingproblems: incentive-compatible forecast elicitation and forecast aggregation. Properscoring rules are the well-known solution to the former problem. To each such rule s we associate a corresponding method of aggregation, mapping expert forecasts andexpert weights to a “consensus forecast,” which we call quasi-arithmetic (QA) pooling with respect to s . We justify this correspondence in several ways: • QA pooling with respect to the two most well-studied scoring rules (quadraticand logarithmic) corresponds to the two most well-studied forecast aggregationmethods (linear and logarithmic). • Given a scoring rule s used for payment, a forecaster agent who sub-contractsseveral experts, paying them in proportion to their weights, is best oﬀ aggregatingthe experts’ reports using QA pooling with respect to s , meaning this strategymaximizes its worst-case proﬁt (over the possible outcomes). • The score of an aggregator who uses QA pooling is concave in the experts’ weights.As a consequence, online gradient descent can be used to learn appropriate expertweights from repeated experiments with low regret. • The class of all QA pooling methods is characterized by a natural set of axioms(generalizing classical work by Kolmogorov on quasi-arithmetic means). a r X i v : . [ c s . G T ] F e b . Introduction and motivation You are a meteorologist tasked with advising the governor of Florida on hurricane prepa-rations. A hurricane is threatening to make landfall in Miami, and the governor needs todecide whether to order a mass evacuation. The governor asks you what the likelihood is ofa direct hit, so you decide to consult several weather models at your disposal. These modelsall give you diﬀerent answers: 10%, 25%, 70%. You trust the models equally, but your jobis to come up with one number for the governor — your best guess, all things considered.What is the most sensible way for you to aggregate these numbers?This is one of many applications of probabilistic opinion pooling . The problem of prob-abilistic opinion pooling (or forecast aggregation ) asks: how should you aggregate severalprobabilities, or probability distributions, into one? This question is relevant in nearly ev-ery domain involving probabilities or risks: meteorology, national security, climate science,epidemiology, and economic policy, to name a few.The setting that interests us is as follows: there are m experts, who report probabilitydistributions p , . . . , p m over n possible outcomes (we call these reports , or forecasts ). Addi-tionally, each expert i has a non-negative weight w i (with weights adding to 1); this weightrepresents the expert’s quality, i.e. how much the aggregator trusts the expert. A poolingmethod takes these distributions and weights as input and outputs a single distribution p .(Where do these weights come from? How can one learn weights for experts? More on thislater.) Linear pooling is arguably the simplest of all reasonable pooling methods: a weightedarithmetic mean of the probability distributions. p = m (cid:88) i =1 w i p i Logarithmic pooling (sometimes called log-linear or geometric pooling) consists of takinga weighted geometric mean of the probabilities and scaling appropriately. p ( j ) = c m (cid:89) i =1 ( p i ( j )) w i . Here, p ( j ) denotes the probability of the j -th outcome and c is a normalizing constant tomake the probabilities add to 1. Logarithmic pooling can be interpreted as averaging theexperts’ Bayesian evidence (see Appendix B).The linear and logarithmic pooling methods are by far the two most studied ones, see e.g.[GZ86], [PR00], [KR08]. This is because they are simple and follow certain natural rules,which we brieﬂy discuss in Section 2. Furthermore, they are each optimal according to somenatural optimality metrics, see e.g. [Abb09]. 1 .2. Proper scoring rules A seemingly unrelated topic within probabilistic forecasting is the truthful elicitation offorecasts: how can a principal structure a contract so as to elicit an expert’s probabilitydistribution in a way that incenvitizes truthful reporting? This is usually done using a proper scoring rule .A scoring rule is a function s that takes as input (1) a probability distribution over n outcomes and (2) a particular outcome, and assigns a score , or reward. The interpretationis that if the expert reports a distribution p and event j comes to pass, then the expertreceives reward s ( p ; j ) from the principal. A scoring rule is called proper if the expert’sexpected score is strictly maximized by reporting their probability distribution truthfully.That is, s is proper if n (cid:88) j =1 p ( j ) s ( p ; j ) ≥ n (cid:88) j =1 p ( j ) s ( x ; j )for all x , with equality only for x = p . It is worth noting that properness is preserved underpositive aﬃne transformations. That is, if s is proper, then s (cid:48) ( p ; j ) := as ( p ; j ) + b is properif a > Quadratic scoring rule

One example of a proper scoring rule is

Brier’s quadratic scoringrule , introduced in [Bri50]. It is given by s quad ( p ; j ) := 2 p ( j ) − n (cid:88) k =1 p ( k ) . The quadratic scoring rule can be interpreted as penalizing the expert by an amount equalto the squared distance from their report p to the “true answer” δ j (i.e. the vector with a 1in the j -th position and zeros elsewhere). Logarithmic scoring rule

Another example of a proper scoring rule is the logarithmicscoring rule , introduced in [Goo52]. It is given by s log ( p ; j ) := ln p ( j ) . The logarithmic rule is the only proper scoring rule for which an expert’s score only dependson the probability assigned to the eventual outcome and not other outcomes [SAM66]. Thequadratic and logarithmic scoring rules are by far the most studied and most frequently usedones in practice.

Choice of scoring rule as a value judgment

There are inﬁnitely many proper scoringrules. How might a principal go about deciding which one to use? To gain some intuition,we will take a closer look at the quadratic and logarithmic scoring rules in the case of n = 2outcomes. In Figure 1, for both of these scoring rules, we show the diﬀerence between the We specify the domain of s more precisely in Section 3. Figure 1: Diﬀerence between expert’s reward if an outcome happens and if it does not happen, as a functionof the expert’s report, for the quadratic and logarithmic scoring rules. For example, if the expert reports a70% probability of an outcome, then under the quadratic rule they receive a score of 2 · . − . − . = 0 . · . − . − . = 0 .

02 if it does not: a diﬀerence of 0 .

8. If rewarded withthe logarithmic rule, this diﬀerence would be 0 . This diﬀerence scales linearly with the expert’s report for the quadratic rule. Meanwhile,for the logarithmic rule, the diﬀerence changes more slowly than for the quadratic rule forprobabilities in the middle, but much more quickly at the extremes. Informally speaking,this means that the logarithmic rule indicates a preference (of the elicitor) for high precisionclose to 0 and 1, while the quadratic rule indicates a more even preference for precisionacross [0 , .

01 and 0 .

001 are quite diﬀerent; one who usesthe quadratic rule indicates that these probabilities are very similar.On its surface, the elicitation of forecasts has seemingly little to do with their aggregation .However, given that the choice of scoring rule implies a subjective judgment about howdiﬀerent probabilities compare to one another, it makes sense to apply this judgment tothe aggregation of forecasts as well. For example, if the logarithmic scoring rule accuratelyreﬂects a principal’s preferences, how does that value judgment inform how that principalshould aggregate multiple forecasts? This brings us to the main focus of our paper: namely,we prove a novel correspondence between proper scoring rules and opinion pooling methods.

Before introducing the aforementioned correspondence, we need to introduce the

Savagerepresentation of a proper scoring rule. We scale down the logarithmic rule by a factor of 2 ln 2 to make the two rules comparable. The factor2 ln 2 was chosen to make the range of values taken on by the Savage representations of the two scoring rulesthe same (see Section 1.3). avage representation A proper scoring rule has a unique representation in terms of itsexpected reward function G , i.e. the expected score on an expert who believes (and reports)a distribution p : G ( p ) := E j ← p [ s ( p ; j )] = n (cid:88) j =1 p ( j ) s ( p ; j ) . This representation of s , introduced in [Sav71], is known as the Savage representation , thoughwe will usually refer to it as the expected reward function . Given that s is proper, G isstrictly convex; and conversely, given a strictly convex function G , one can re-derive s withthe formula s ( p ; j ) = G ( p ) + (cid:104) g ( p ) , δ j − p (cid:105) , (1)where g is the gradient of G [GR07]. Pictorially, draw the tangent plane to G at p ; thenthe expert’s score if outcome j is realized is the height of the plane at δ j .The Savage representation of the quadratic scoring rule is G quad ( p ) = (cid:80) nj =1 p ( j ) . TheSavage representation of the logarithmic scoring rule is G log ( p ) = (cid:80) nj =1 p ( j ) ln p ( j ).The function g , which will be central to our paper, describes the diﬀerence in the expert’sscore depending on which outcome happens. More precisely, the vector ( s ( p ; j ) , . . . , s ( p ; j n ))is exactly the vector g ( p ), except possibly for a uniform translation in all coordinates. Forexample, s ( p ; j ) − s ( p ; j ) = g ( p ) − g ( p ); this is precisely the quantity plotted in Figure 1for the quadratic and logarithmic scoring rules. This observation about the function g motivates the connection that we will establish between proper scoring rules and opinionpooling methods. Quasi-arithmetic opinion pooling

We can now deﬁne our correspondence betweenproper scoring rules and opinion pooling methods. Given a proper scoring rule s used forelicitation, and given m probability distributions p , . . . , p m and expert weights w , . . . , w m ,the aggregate distribution p ∗ that we suggest is the one satisfying g ( p ∗ ) = m (cid:88) i =1 w i g ( p i ) . (2)(It is natural to ask whether this p ∗ exists and whether it is unique. We will discuss thisshortly.)We refer to this pooling method as quasi-arithmetic pooling with respect to g (or thescoring rule s ), or QA pooling for short. To get a sense of QA pooling, let us determinewhat this method looks like for the quadratic and logarithmic scoring rules. Or a subgradient, if G is not diﬀerentiable. This term comes from the notion of quasi-arithmetic means: given a continuous, strictly increas-ing function f and values x , . . . , x m , the quasi-arithmetic mean with respect to f of these values is f − (1 /m (cid:80) i f ( x i )). A pooling with respect to the quadratic scoring rule

We have g quad ( x ) = (2 x , . . . , x n ),so we are looking for the p ∗ such that(2 p ∗ (1) , . . . , p ∗ ( n )) = m (cid:88) i =1 w i (2 p i (1) , . . . , p i ( n )) . This is p ∗ = (cid:80) mi =1 w i p i . Therefore, QA pooling for the quadratic scoring rule is preciselylinear pooling . QA pooling with respect to the logarithmic scoring rule

We have g log ( x ) = (ln x +1 , . . . , ln x n + 1), so we are looking for the p ∗ such that(ln p ∗ (1) + 1 , . . . , ln p ∗ ( n ) + 1) = m (cid:88) i =1 w i (ln p i (1) + 1 , . . . , ln p i ( n ) + 1) . By exponentiating the components on both sides, we ﬁnd that p ∗ ( j ) = c (cid:81) ni =1 ( p i ( j )) w i forall j , for some proportionality constant c . This is precisely the deﬁnition of the logarithmicpooling method. (The constant c comes from the fact that values of g ( · ) should be interpretedmodulo translation by the all-ones vector; see Remark 3.9.)The fact that this pooling scheme maps the two most well-studied scoring rules to the twomost well-studied opinion pooling methods has not been noted previously, to our knowledge.This correspondence suggests that — beyond just our earlier informal justiﬁcation — QApooling with respect to a given scoring rule may be a fundamental concept. The rest of thispaper argues that this is indeed the case. (Section 4) Max-min optimality Suppose that a principal asks you to issue a forecastand will pay you according to s . You are not knowledgeable on the subject but know someexperts whom you trust on the matter (perhaps to varying degrees). You sub-contractthe experts, promising to pay each expert i according to w i · s . By using QA poolingaccording to s on the experts’ forecasts, you guarantee yourself a proﬁt; in fact, this strategymaximizes your worst-case proﬁt, and is the unique such report. Furthermore, this proﬁt isthe same for all outcomes. This fact can be interpreted to mean that you have, in a sense,pooled the forecasts “correctly”: you do not care which outcome will come to pass, whichmeans that you have correctly factored the expert opinions into your forecast. We give anadditional interpretation of this optimality notion as maximizing an aggregator’s guaranteedimprovement over choosing an expert at random.In Section 4.2, we give an additional interpretation of QA pooling as an optimal methodrelative to a proper scoring rule: namely, the QA pool of expert forecasts is the forecastwith respect to which the experts would be the least wrong (as measured via the weightedaverage of Bregman divergences associated with G ).5 Section 5) Learning expert weights

Opinion pooling entails assigning weights to ex-perts. Where do these weights come from? How might one learn them from experience?Suppose we have a ﬁxed proper scoring rule s , and further consider ﬁxing the reports ofthe m experts as well as the eventual outcome. One can ask: what does the score of theaggregate distribution (per QA pooling with respect to s ) look like as a function of w , thevector of expert weights? We prove that this function is concave. This is useful because itallows for online convex optimization over expert weights. Theorem (informal) . Let s be a bounded proper scoring rule. For time steps t = 1 . . . T , m experts report forecasts to an aggregator, who combines them into a forecast p t using QApooling with respect to s and suﬀers a loss of − s ( p t ; j t ) , where j t is the outcome at timestep t . If the aggregator updates the experts’ weights using online gradient descent, then theaggregator’s regret compared to the best weights in hindsight is O ( √ T ) . The aforementioned concavity property is a nontrivial fact that demonstrates an advan-tage of QA pooling over e.g. linear and logarithmic pooling: these pooling methods satisfythe concavity property for some proper scoring rules s but not others. (Section 6) Natural axiomatization for QA pooling methods [Kol30] and [Nag30]independently came up with a simple axiomatization of quasi-arithmetic means. We showhow to change these axioms to allow for weighted means; the resulting axiomatization isa natural characterization of all quasi-arithmetic pooling methods in the case of n = 2outcomes. Furthermore, although quasi-arithmetic means are typically deﬁned for scalar-valued functions, we demonstrate that these axioms can be extended to describe quasi-arithmetic means with respect to vector-valued functions, as is necessary for our purposesif n >

2. This extension is nontrivial but natural, and to our knowledge has not previouslybeen described.

The reverse direction: pooling to scoring

Although we have mostly discussed thecorrespondence between proper scoring rules and pooling methods in the context of “given ascoring rule, what is the most natural pooling method,” the correspondence holds in reverse.That is, if a principal has a pooling method in mind, they can choose the scoring rule withwhich to reward the aggregator according to this correspondence. In Section 7 we give aninterpretation of this reverse connection in the context of collusion between experts.

When is QA pooling well deﬁned?

Is there always a p ∗ satisfying Equation 2, and is itguaranteed to be unique? In Section 3 we show that the answer to the uniqueness questionis yes, and that in the n = 2 outcomes case, the answer to the existence question is also yes.For larger values of n , such a p ∗ may not exist. In particular, QA pooling with respect to s is well deﬁned if and only if the range of g ( · ) is a convex set. For reasons that we willdiscuss, we call this the convex exposure property . In Appendix F, we will discuss when thisproperty holds. In particular, we will show that the it holds for the quadratic, logarithmic,and spherical scoring rules, as well as other natural classes of proper scoring rules. For which QA pooling is well deﬁned (we discuss this below).

2. Related work

Opinion pooling [CW07] categorize mathematical approaches to opinion pooling as either

Bayesian or axiomatic . A Bayesian approach to this problem is one that entails Bayesupdating on each expert’s opinion. While quite natural, Bayesian opinion pooling is diﬃcultto apply and, in full generality, computationally intractable. This is because the Bayesupdates must fully account for interdependencies between expert opinions.By contrast, axiomatic approaches do not make assumptions about the structure of infor-mation underlying the experts’ opinions; instead, they aim to come up with pooling methodsthat satisfy certain axioms or desirable properties. Such axioms include unanimity preserva-tion, eventwise independence, and external Bayesianality; see e.g. [DL14] for statements ofthese axioms. For n > §

4] deﬁnes a notion of pooling analogous to our Deﬁni-tion 4.5, though in a diﬀerent context. The main focus of their line of work is on connectingopinion pooling to Bregman divergence; our approach connects opinion pooling to properscoring rules, and a connection to Bregman divergence falls naturally out of this pursuit.

Scoring rules

The literature on scoring rules is quite large; we recommend [GR07] for athorough but technical overview, or [Car16] for a less technical overview that focuses more onapplications (while still introducing the basic theory). Seminal work on the theory behindscoring rules includes Brier’s paper introducing the quadratic rule [Bri50], Good’s paperintroducing the logarithmic rule [Goo52], and Savage’s work on the general theory of proper7coring rules [Sav71]. Additionally, see [DM14] for an overview of various families of properscoring rules.

Aggregation via prediction markets

One common way to aggregate probabilistic fore-casts is through prediction markets, some of which are based on scoring rules. [Han03]introduced market scoring rules (MSRs), in which experts are sequentially presented withan opportunity to update an aggregate forecast and are rewarded (or penalized) by theamount that their update changed the aggregate prediction’s eventual score. [CP07] intro-duced cost-function markets , in which a market maker sells n types of shares — one for eachoutcome — where the price of a share depends on the number of shares sold thus far ac-cording to some cost function. They established a connection between cost-function marketsand MSRs, where a market with a given cost function will behave the same way as a certainMSR (under certain conditions later formalized in [ACV13]). Subsequent work explored thisarea further, tying cost-function market making to online learning of probability distribu-tions [CV10] [ACV13]. This work diﬀers from ours in that the goal of their online learningproblem is to learn a probability distribution over outcomes , whereas our goal in Section 5 isto learn expert weights .While MSRs and cost function markets have superﬁcial similarities to our work, they havequite diﬀerent goals and properties. For both MSRs and cost function markets, incentivesare set up so that an expert brings the market into alignment with their own opinion, ratherthan an aggregate. Thus, in the well-studied setting of experts whose beliefs do not dependon other experts’ actions, the ﬁnal state of such a market reﬂects only the beliefs of the mostrecent trader, rather than an aggregate of the experts’ beliefs. Arbitrage from collusion

Part of our work can be viewed as a generalization of previouswork by Chun and Shachter done in a diﬀerent context: namely, preventing colluding expertsfrom exploiting arbitrage opportunities [CS11]. The authors show that for the case of n = 2outcomes, if experts are rewarded with the same scoring rule s , preventing this is impossible:the experts can successfully collude by all reporting what we are calling the QA pool of theirreports with respect to s . Our Theorem 4.1 recovers this result as a special case. See[Che+14] for related work in the context of wagering mechanisms and [Fre+20] for follow-upwork on preventing arbitrage from colluding experts. Prediction with expert advice

In Section 5 we discuss learning expert weights online.The online learning literature is vast, but our approach ﬁts into the framework of predictionwith expert advice . In this setting, at each time step each expert submits a report (in ourcontext a probability distribution). The agent then submits a report based on the experts’submissions, and suﬀers a loss depending on this report and the eventual outcome. See[CBL06] for a detailed account of this setting; the authors prove a variety of no-regret boundsin this setting, ranging (depending form the setting) from O ( √ T ) to O (1). Our setting isan ambitious one: while typically one desires low regret compared to the best expert inhindsight, we desire low regret compared to the best mixture of experts in hindsight. In[CBL06, § (log T ) regret in comparison with the best linear pool of experts in hindsight. This settingis diﬀerent from ours in two important ways: ﬁrst, the losses that we consider are not ingeneral exp-concave (e.g. the quadratic loss); and second, the authors consider linear poolingfor any loss, whereas we consider QA pooling with respect to the loss function. Quasi-arithmetic means

Our notion of quasi-arithmetic pooling is an adaptation (andextension to higher dimensions) of the existing notion of quasi-arithmetic means. These wereoriginally deﬁned and axiomatized independently in [Kol30] and [Nag30]. Acz´el generalizedthis work to include weighted quasi-arithmetic means [Acz48], though these means haveweights baked in rather than taking them as inputs, which is diﬀerent from our setting. See[Gra+11, §

3. Preliminaries

Throughout this paper, we will let m be the number of experts and use the index i to referto any particular expert. We will let n be the number of outcomes and use the index j torefer to any particular outcome.Let ∆ n be the standard simplex in R n , i.e. the one with vertices δ , . . . , δ n . (Here, δ j denotes the vector with a 1 in the j -th coordinate and zeros elsewhere.) Note that ∆ n is an( n − n -outcome forecast domain to beany convex ( n − subset of ∆ n . We will use D to denote an arbitrary forecastdomain.Although our results will apply to any forecast domain, the two forecast domains thatwe expect to be by far the most useful are ∆ n and ∆ n minus its boundary: the former forbounded scoring rules (such as the quadratic rule) and the latter for unbounded ones (suchas the logarithmic rule). Given an n -outcome forecast domain D , a proper scoring rule on D is a function s : D × [ n ] → R such that for all p ∈ D , we have E j ← p [ s ( p ; j )] ≥ E j ← p [ s ( x ; j )]for all x ∈ D , with equality only when x = p . (Here, j ← p means that j is drawn randomlyfrom the probability distribution p .) Formally, we require that the forecast domain contain a subset that is homeomorphic to R n − . Some authors refer to such scoring rules as strictly proper while others assume that propriety entailsstrictness; we choose the latter convection. Also, while many sources deﬁne the range of s to include ±∞ so as to e.g. make the logarithmic scoring rule well deﬁned on the boundary of ∆ n , we do not do this. Thisis an application-speciﬁc choice: we will be interested in pooling forecasts, and pooling e.g. (1 ,

0) and (0 , s , we deﬁne its expected reward function (or Savage repre-sentation ) G : D → R by G ( p ) := E j ← p [ s ( p ; j )] = n (cid:88) j =1 p ( j ) s ( p ; j ) . We will henceforth assume that s is continuous; to our knowledge, this is the case for allfrequently-used proper scoring rules. This is equivalent to assuming that G is diﬀerentiable,or that G is continuously diﬀerentiable (see Proposition C.1 for a proof of this equivalence). Proposition 3.1 ([GR07]) . Given a proper scoring rule s , its expected reward function G isstrictly convex, and s can be re-derived from G via the formula s ( p ; j ) = G ( p ) + (cid:104) g ( p ) , δ j − p (cid:105) , (3) where g = ∇ G (i.e. the gradient of G ). Conversely, given a diﬀerentiable, strictly convexfunction G , the function s deﬁned by Equation 3 is a proper scoring rule. The important intuition to keep in mind for Equation 3 is that the score of an expertwho reports p is determined by drawing the tangent plane to G at p ; the value of this planeat δ j , where j is the outcome that happens, is the expert’s score.We refer to g as the exposure function of s . We borrow this term from ﬁnance, where exposure refers to how much an agent stands to gain or lose from various possible outcomes— informally speaking, how much the agent cares about which outcome will happen. If weview G ( p ) − (cid:104) g ( p ) , p (cid:105) as the agent’s “baseline proﬁt,” then the j -th component of g ( p ) isthe amount that the agent stands to gain (or lose) on top of the baseline proﬁt if outcome j happens.We give a geometric intuition for Proposition 3.1 to help explain why the properness of s corresponds to the convexity of G ; this intuition will be helpful for understanding Bregmandivergence (below) and the proofs in Section 4.Consider Figure 2, which depicts some G in the n = 2 outcome case, with the x -axiscorresponding to the probability of Outcome 1 (see Remark 3.5 for formal details). Supposethat the expert believes that the probability Outcome 1 is 0 .

7. If the expert reports p = 0 . y -values of the rightmostand leftmost points on the red line, respectively. Thus, in expectation, the expert’s rewardis the y -value of the red point. If instead the expert lies and reports p = 0 . y -values of the rightmost and leftmost points on the blue line represent the expert’s rewardsin the cases of Outcome 1 and Outcome 2, respectively. In this case, since Outcome 1 isstill 70% likely, the expert’s expected reward is the y -value of the blue point. Because G isstrictly convex , the blue point is strictly below the red point; that is, the expert is strictlybetter oﬀ reporting p = 0 .

7. This argument holds in full generality: for any strictly convexfunction G in any number of dimensions. This choice is partially for ease of exposition, though e.g. our axiomatization in Section 6 depends onthis. The statement in [GR07] is slightly more complicated because they consider scoring rules with ±∞ intheir range, as well as discontinuous scoring rules. igure 2: A convex function G , with tangent lines drawn at x = 0 . x = 0 .

4. If an expert believes thatthe probability of an event is 0 .

7, their expected score if they report p = 0 . y -value of the red point;if they instead report p = 0 .

4, their expected score is the y -value of the blue point. Because G is convex, thered point is guaranteed to be above the blue point, so the expert is incentivized to be truthful. We will ﬁnd

Bregman divergence to be a useful concept for some of our proofs.

Deﬁnition 3.2 (Bregman divergence) . Given a diﬀerentiable, strictly convex function G : D → R with gradient g , where D is a convex subset of R n , and given p , q ∈ D , the Bregmandivergence between p and q with respect to G is D G ( p (cid:107) q ) := G ( p ) − G ( q ) − (cid:104) g ( q ) , p − q (cid:105) . (Note that Bregman divergence is not symmetric.) A geometric interpretation of D G ( p (cid:107) q ) is: if you draw the tangent plane to G at q , how far below G ( p ) the value of that planewill be at p . For example, the distance between the red and blue points in Figure 2 is theBregman divergence between the expert’s belief (0 . , .

3) and their report (0 . , . Remark 3.3. If G is the expected reward function of a proper scoring rule, then D G ( p (cid:107) q )is the expected reward lost by reporting q when your belief is p . Put otherwise, D G ( p (cid:107) q )measures the “wrongness” of the report q relative to a correct answer of p . Proposition 3.4 (Well-known facts about Bregman divergence) . • D G ( p (cid:107) q ) ≥ , with equality only when p = q . • For any q , D G ( x (cid:107) q ) is a strictly convex function of x . Finally, we make a note about interpreting the n = 2 outcome case in one dimension. Remark 3.5.

Because ∆ n is ( n − n = 2 outcomesin one dimension. All probabilities in D are of the form ( p, − p ); we map D to [0 ,

1] (or asubset) via the ﬁrst coordinate. Thus, we let G ( p ) := G ( p, − p ). We let g ( p ) := G (cid:48) ( p ) = (cid:104) g ( p, − p ) , (1 , − (cid:105) . The tangent line to G at p , e.g. as in Figure 2, will intersect the line x = 1 at s ( p ; 1) (i.e. the score if Outcome 1 happens) and intersect the line x = 0 at s ( p ; 2)(i.e. the score if Outcome 2 happens). This formulation will be helpful when discussing thetwo-outcome case, e.g. in Section 6. 11 .2. Probabilistic opinion pooling We now introduce the central concept of this paper: quasi-arithmetic pooling. Let s be aproper scoring rule over a forecast domain D , and let G and g be as previously deﬁned.We denote the quasi-arithmetic (QA) pooling operator with respect to g with ⊕ g . Thisoperator takes as input (probability, weight) pairs with weights non-negative and adding to1, and outputs a probability (all probabilities are in D ). In particular, ⊕ g is deﬁned by m (cid:77) g i =1 ( p i , w i ) := p ∗ , where g ( p ∗ ) = m (cid:88) i =1 w i g ( p i ) . (4)Is p ∗ in Equation 4 well deﬁned? That is, does it exist, and if so, is it unique? It isindeed the case that p ∗ , if deﬁned, is unique. This is because g cannot take on the samevalue at two diﬀerent points, as it is the gradient of a convex function. As for existence: p ∗ is guaranteed to exist if and only if s satisﬁes the following property. Deﬁnition 3.6 (convex exposure) . A proper scoring rule s has convex exposure if the rangeof its exposure function g is a convex set. Proposition 3.7. (a) Given a proper scoring rule s , p ∗ as deﬁned above is guaranteed to exist for any p , . . . , p m if and only if s has convex exposure.(b) In the case of n = 2 , every (continuous) proper scoring rule has convex exposure (so p ∗ exists).Proof. (a) is clear: the right-hand side of Equation 4 is an arbitrary convex combinationof values in the range of g . As for (b), since D is connected and g is continuous (seeProposition C.1), the range of g is connected. In the n = 2 outcome case, the range of g lieson the line { ( x , x ) : x + x = 0 } , and a connected subset of a line is convex.We now formally state the deﬁnition of quasi-arithmetic pooling: Deﬁnition 3.8 (quasi-arithmetic pooling) . Let s be a proper scoring rule with convex ex-posure on a forecast domain D . Given forecasts p , . . . , p m ∈ D with non-negative weights w , . . . , w m adding to , the quasi-arithmetic (QA) pool of these forecasts with respect to s (or with respect to its exposure function g ), denoted by m (cid:77) g i =1 ( p i , w i ) , is the unique p ∗ ∈ D such that g ( p ∗ ) = m (cid:88) i =1 w i g ( p i ) . If the forecasts and weights are clear from context, we may simply write p ∗ to refer to theirquasi-arithmetic pool; or, if only the forecasts are clear, we may write p ∗ w , where w is thevector of weights.

12n Section 4.2 we will discuss a natural generalization of QA pooling to proper scoringrules that do not have convex exposure. In Appendix F, we will explore which commonlyused proper scoring rules the convex exposure property holds for. (The upshot is: most ofthem.)We conclude with a technical note about the range of g and the correct interpretation ofEquation 4. Remark 3.9.

Because the domain of G is a subset of ∆ n (and thus lies in a plane that isorthogonal to the all-ones vector n ), its gradient function g only takes on values orthogonalto n . When we treat G as a function of n variables rather than n − G outside of the plane containing ∆ n , g might gain a component parallel to n . However, the correct way to think of the codomainof g is either as { x : (cid:80) i x i = 0 } , or else as R n modulo translation by n , which we willdenote by R n /T ( n ). Consequently, the correct way to think of the equality in Equation 4is as an equality in R n /T ( n ), rather than an equality in R n ; this came up in Section 1.3when we discussed the fact that QA pooling for the logarithmic scoring rule is logarithmicopinion pooling.

4. Optimality results

Our goal is to give a formal justiﬁcation for quasi-arithmetic pooling. This section gives twosuch justiﬁcations: one in terms of max-min optimality and one (as a corollary) in terms ofminimizing the weighted average of the Bregman divergences between the pooled forecastand the experts’ forecasts. We will give additional justiﬁcations in later sections.

Theorem 4.1.

Let s be a proper scoring rule with convex exposure on a forecast domain D .Fix any forecasts p , . . . , p m ∈ D with non-negative weights w , . . . , w m adding to . Deﬁne u ( p ; j ) := s ( p ; j ) − m (cid:88) i =1 w i s ( p i ; j ) . Then the quantity min j u ( p ; j ) is uniquely maximized by setting p to p ∗ := m (cid:77) g i =1 ( p i , w i ) .Furthermore, u ( p ∗ ; j ) is the same for all j . This quantity is non-negative, and is positiveunless all reports p i with positive weights are equal. One interpretation for this theorem statement is as follows. Consider an agent who istasked with submitting a forecast, and who will be paid according to s . The agent decides Formally, consider the change of coordinates given by z j = x n − x j for j ≤ n − z n = (cid:80) j x j , sothat the domain of G lies in the plane z n = 1. Then for j ≤ n − ∂G∂z j at a given point in the domain of G does not change if 1 is substituted for z n ; only ∂G∂z n changes (to zero). Equivalently in terms of our originalcoordinates, the change that g undergoes when we consider G to be a function only deﬁned on D instead of R n is precisely a projection of g onto H n (0).

13o sub-contract m experts to get their opinions, paying expert i the amount w i s ( p i ; j ) if theexpert reports p i and outcome j happens. (Perhaps experts whom the agent trusts morehave higher w i ’s.) Finally, the agent reports some (any) forecast p . Then u ( p ; j ) is preciselythe agent’s proﬁt (utility).The quantity min j u ( p ; j ) is the agent’s minimum possible proﬁt over all outcomes. It isnatural to ask which report p maximizes this quantity. Theorem 4.1 states that this maxi-mum is achieved by the QA pool of the experts’ forecasts with respect to s , and that this isthe unique maximizer.A possible geometric intuition to keep in mind for the proof (below): for each expert i ,draw the plane tangent to G at p i . For any j , the value of this plane at δ j is s ( p i ; j ). Nowtake the weighted average of all m planes; this is a new plane whose intersection with any δ j is the total reward received by the experts if j happens. Since G is convex, this plane liesbelow G . To ﬁgure out which point maximizes the agent’s guaranteed proﬁt, push the planeupward until it hits G . It will hit G at p ∗ and the agent’s proﬁt will be the vertical distancethat the plane was pushed. Proof of Theorem 4.1.

We ﬁrst show that u ( p ∗ ; j ) is the same for all j . We have u ( p ∗ ; j ) = s ( p ∗ ; j ) − (cid:88) i w i s ( p i ; j )= G ( p ∗ ) + (cid:104) g ( p ∗ ) , δ j − p ∗ (cid:105) − (cid:88) i w i ( G ( p i ) + (cid:104) g ( p i ) , δ j − p i (cid:105) )= G ( p ∗ ) − (cid:88) i w i G ( p i ) + (cid:104) (cid:88) i w i g ( p i ) , δ j − p ∗ (cid:105) − (cid:88) i w i (cid:104) g ( p i ) , δ j − p i (cid:105) = G ( p ∗ ) − (cid:88) i w i G ( p i ) + (cid:88) i w i (cid:104) g ( p i ) , p i − p ∗ (cid:105) , which indeed does not depend on j . We can in fact rewrite this expression as a weightedsum of Bregman divergences: u ( p ; j ) = (cid:88) i w i D G ( p ∗ (cid:107) p i ) . It follows that u ( p ; j ) is non-negative (see Proposition 3.4), and positive except when all p i ’swith positive weights are equal.Finally we show that p = p ∗ maximizes min j u ( p ; j ). Suppose that for some report q we have that min j u ( q ; j ) ≥ min j u ( p ∗ ; j ). Then u ( q ; j ) ≥ u ( p ∗ ; j ) for every j , since u ( p ∗ ; j ) is the same for every j . But this means that s ( q ; j ) ≥ s ( p ∗ ; j ) for every j , since the (cid:80) i w i s ( p i ; j ) term in the deﬁnition of u ( p ; j ) does not depend on p . But then q = p ∗ , since s is proper. Remark 4.2.

We can reformulate Theorem 4.1 as follows: suppose that an agent has accessto forecasts p , . . . , p m and needs to issue a forecast, for which the agent will be rewardedusing a proper scoring rule s with convex exposure. The agent can improve upon selectingan expert at random according to weights w , . . . , w m , no matter the outcome j , by reporting p ∗ . This improvement is the same no matter the outcome, and is a strict improvement unlessall forecasts with positive weights are the same.14 .2. QA pooling as a Bregman divergence minimizing method The quantity (cid:80) i w i D G ( p (cid:107) p i ) that came up in the proof of Theorem 4.1 is a natural quantityto consider, as it is a measure of how far away overall p is from the expert reports p i . Infact, it is a natural quantity to minimize if we care about aggregation. This brings us to oursecond formal justiﬁcation of QA pooling. Proposition 4.3.

Given a proper scoring rule s with convex exposure and reports p , . . . , p m with weights w , . . . , w m , the quantity d ( x ) := (cid:88) i w i D G ( x (cid:107) p i ) is uniquely minimized at x = p ∗ . This fact makes sense in light of the geometric intuition we described for Theorem 4.1.In these terms, Preposition 4.3 states that p ∗ is the point at which G is closest (in verticaldistance) to the average of the experts’ planes. Formally: Proof.

Since Bregman divergence is strictly convex in its ﬁrst argument (see Proposition 3.4), d ( x ) is strictly convex. This means that if there is a point p ∈ D where ∇ d ( p ) = , then p is the unique minimizer of d . Now, we have ∇ d ( x ) = (cid:88) i w i ∇ x D G ( x (cid:107) p i ) = (cid:88) i w i ( g ( x ) − g ( p i )) = g ( x ) − (cid:88) i w i g ( p i ) . But in fact, g ( p ∗ ) = (cid:80) i w i g ( p i ) by deﬁnition, so ∇ d ( p ∗ ) = . This completes the proof.Proposition 4.3 generalizes [Abb09, Proposition 4], which showed that logarithmic pool-ing minimizes the weighted average of the KL divergences between the pooled forecast andthe expert forecasts. KL divergence is Bregman divergence relative to negative entropy,which is precisely the function G for the logarithmic scoring rule.In light of the fact that D G ( p (cid:107) q ) is the expected reward lost by an expert who, believing p , reports q (see Remark 3.3), Proposition 4.3 gives another natural interpretation of QApooling. Remark 4.4.

Consider a proper scoring rule s with convex exposure and forecasts p , . . . , p m with weights w , . . . , w m . The QA pool of these forecasts is the forecast p ∗ that, if it is thecorrect answer (i.e. if the outcome is drawn according to p ∗ ), would minimize the expectedloss of a randomly chosen (according to w ) expert relative to reporting p ∗ .In this sense, QA pooling reﬂects a compromise between experts: it is the probabilitythat, if it were correct, would make the experts’ forecast least wrong overall. This motivates ageneralization of QA pooling to non-convex exposure scoring rules that uses the formulationof p ∗ in terms of minimizing Bregman divergence.15 eﬁnition 4.5 (generalized quasi-arithmetic pooling) . Let s be a proper scoring rule ona closed forecast domain D . Given forecasts p , . . . , p m ∈ D with non-negative weights w , . . . , w m adding to , the generalized quasi-arithmetic pool of these forecasts with respectto s (or g ) is the unique p ∗ ∈ D minimizing d ( x ) := (cid:88) i w i D G ( x (cid:107) p i ) . To check that Deﬁnition 4.5 is well deﬁned, we need to check that p ∗ exists and is unique.Existence follows from the fact that D is (by assumption) closed and therefore compact, andthat a continuous function on a compact domain achieves its minimum. Uniqueness followsfrom the fact that a strictly convex function on a convex set has at most one minimum. (Aspart of proving Proposition 4.3, we showed that d is strictly convex.) Remark 4.6.

Since d is convex, minimizing it is a matter of convex optimization. Inparticular, if G is bounded, given oracle access to g , the ellipsoid method can be used toeﬃciently ﬁnd the generalized QA pool of a list of forecasts. Remark 4.7.

It is natural to ask which pooling method minimizes the Bregman divergencegoing the other way, i.e. (cid:80) i w i D G ( p i (cid:107) x ). The answer is linear pooling [Ban+05, Proposi-tion 1]. This makes sense, because this minimization question asks: if an expert is selectedat random to be “correct” according to the weights, what is the overall probability of anyevent j ? The answer is achieved by linear pooling.

5. Convex losses and learning expert weights

Thus far when discussing QA pooling, we have regarded expert weights as given. Where dothese weights come from? As we will show in this section, these weights can be learned fromexperience. The key observation is the following theorem, which states that an agent’s scoreis a concave function of the weights it uses for the experts.

Theorem 5.1.

Theorem 5.1 would not hold if the pooling operator in the deﬁnition of WSwere replaced by linear pooling or by logarithmic pooling. This is an advantage of QApooling over using the linear or logarithmic method irrespective of the scoring rule.Beyond Theorem 5.1’s instrumental use for no-regret online learning of expert weights(Theorem 5.5 below), the result is interesting in its own right. For example, the following fact— loosely speaking, that QA pooling cannot beneﬁt from weight randomization — followsas a corollary. (Recall the deﬁnition of p ∗ w from Deﬁnition 3.8.) Corollary 5.4.

Consider a randomized algorithm A with the following speciﬁcations: • Input: a proper scoring rule s with convex exposure, expert forecasts p , . . . , p m . • Output: a weight vector w ∈ ∆ m .Consider any input s, p , . . . , p m , and let ˆ w = E A [ w ] . Then for every j , we have s ( p ∗ ˆ w ; j ) ≥ E A [ s ( p ∗ w ; j )] , where p ∗ x denotes the QA pool of p , . . . , p m with weight vector x . We now state the no-regret result that we have alluded to. The algorithm referenced inthe statement (which is in Appendix D.2) is an application of the standard online gradientdescent algorithm (see e.g. [Haz19, Theorem 3.1]) to our particular setting.

Theorem 5.5.

Let s be a bounded proper scoring rule with convex exposure over a forecastdomain D . For time steps t = 1 . . . T , an agent chooses a weight vector w t ∈ ∆ m . The agentthen receives a score of s  m (cid:77) g i =1 ( p ti , w ti ); j t  , where p t , . . . , p tm ∈ D and j t ∈ [ n ] are chosen adversarially. By choosing w t according toAlgorithm D.3 (online gradient descent on the experts’ weights), the agent achieves O ( √ T ) regret in comparison with the best weight vector in hindsight. In particular, if M is an upperbound on (cid:107) g (cid:107) , then for every w ∗ ∈ ∆ m we have T (cid:88) t =1 s  m (cid:77) g i =1 ( p ti , w ∗ i ); j t  − s  m (cid:77) g i =1 ( p ti , w ti ); j t  ≤ √ mM √ T . For a counterexample to logarithmic pooling, consider n = 2, let s be the quadratic scoring rule, p = (0 . , . p = (0 . , . j = 1, v = (1 , w = (0 , c = . For a counterexample to linearpooling, consider n = 2, let s be given by G ( p , p ) = (cid:112) p + p (this is known as the spherical scoring rule), p = (0 , p = (0 . , . j = 1, v = (1 , w = (0 , c = . M exists because s is bounded by assumption, and so g is alsobounded (this follows from Equation 3).We also note that this result is quite strong in that it does not merely achieve low regretcompared to the best expert , but in fact compared to the best possible weighted pool of ex-perts in hindsight. This is a substantial distinction, as it is possible for a mixture of expertsto substantially outperform any individual expert.We defer the proof to Appendix D.2. The proof amounts to applying the standard boundsfor online gradient descent, though with an extra step: we use the bound M on (cid:107) g (cid:107) to boundthe gradient of the loss as a function of expert weights.

6. Axiomatization of QA pooling

In this section, we aim to show that the class of all quasi-arithmetic pooling operators is anatural one, by showing that these operators are precisely those which satisfy a natural setof axioms.[Kol30] and [Nag30] independently considered the class of quasi-arithmetic means. Givenan interval I ⊆ R and a continuous, injective function f : I → R , the quasi-arithmetic meanwith respect to f , or f -mean , is the function M f that takes as input x , . . . , x m ∈ I (for any m ≥

1) and outputs M f ( x , . . . , x m ) := f − (cid:18) f ( x ) + · · · + f ( x m ) m (cid:19) . For example, the arithmetic mean corresponds to f ( x ) = x ; the quadratic to f ( x ) = x ; thegeometric to f ( x ) = log x ; and the harmonic to f ( x ) = − x .Kolmogorov proved that the class of quasi-arithmetic means is precisely the class offunctions M : (cid:83) ∞ m =1 I m → I satisfying the following natural properties: (1) M ( x , . . . , x m ) is continuous and strictly increasing in each variable.(2) M is symmetric in its arguments.(3) M ( x, x, . . . , x ) = x .(4) M ( x , . . . , x k , x k +1 , . . . , x m ) = M ( y, . . . , y, x k +1 , . . . , x m ), where y := M ( x , . . . , x k ) ap-pears k times on the right-hand side. Informally, a subset of arguments to the meanfunction can be replaced with their mean.The four properties listed above can be viewed as an axiomatization of quasi-arithmeticmeans.Our notion of quasi-arithmetic pooling is exactly that of a quasi-arithmetic mean, exceptthat it is more general in two ways. First, it allows for weights to accompany the arguments Nagumo also provided a characterization, though with slightly diﬀerent properties.

18o the mean. Second, we are considering quasi-arithmetic means with respect to vector-valued functions g . In the n = 2 outcome case, g can be considered a scalar-valued functionsince it is deﬁned on a one-dimensional space (see Remark 3.5 for details); but in general wecannot treat g as scalar-valued. Our goal is to extend the above axiomatization of quasi-arithmetic means in these twoways: ﬁrst (below) to include weights as arguments, and second (in Appendix E) to general n (while still allowing arbitrary weights). Generalizing to include weights as arguments

The objects that we will be studyingin this section are ones of the form ( p, w ), where w ≥ p ∈ D . In this subsection, D is atwo-outcome forecast domain, which we will think of as a sub-interval of [0 ,

1] identiﬁed bythe probability of the ﬁrst outcome (see Remark 3.5). We will ﬁx the set D for the remainderof the subsection. Our results generalize to any interval of R (as in Kolmogorov’s work), butwe focus on forecast domains since that is our application. Deﬁnition 6.1. A weighted forecast is an element of D × R > : a probability and a positiveweight. Given a weighted forecast Π = ( p, w ) we deﬁne pr (Π) := p and wt (Π) := w . We will thinking of the output of pooling operators as weighted forecasts. This is asimple extension of our earlier deﬁnition of quasi-arithmetic pooling (Deﬁnition 3.8), whichonly output a probability.

Deﬁnition 6.2 (Quasi-arithmetic pooling with arbitrary weights ( n = 2)) . Given a contin-uous, strictly increasing function g : D → R , and weighted forecasts Π = ( p , w ) , . . . , Π m =( p m , w m ) , deﬁne the quasi-arithmetic pool of Π , . . . , Π m with respect to g as m (cid:77) gi =1 ( p i , w i ) := (cid:32) g − (cid:18) (cid:80) i w i g ( p i ) (cid:80) i w i (cid:19) , (cid:88) i w i (cid:33) . Remark 6.3.

In Deﬁnition 3.8, g was the derivative of a diﬀerentiable, strictly convexfunction. Here, g is a continuous, strictly increasing function. These are the same condition(see Proposition C.1).In the case that (cid:80) i w i = 1, Deﬁnition 6.2 reduces to Deﬁnition 3.8. In general, bylinearly scaling the weights in Deﬁnition 6.2 to add to 1, we recover quasi-arithmetic poolingas previously deﬁned.The proof of the following proposition is straightforward. Proposition 6.4.

Given two continuous, strictly increasing functions g and g , ⊕ g and ⊕ g are the same if and only if g = ag + b for some a > and b ∈ R . We now deﬁne properties (i.e. axioms) of a pooling operator ⊕ , such that these propertiesare satisﬁed if and only if ⊕ is ⊕ g for some g . Our axiomatization will look somewhat diﬀerentfrom Kolmogorov’s, in part because we choose to deﬁne ⊕ as a binary operator that (if itsatisﬁes the associativity axiom) extends to the m -ary case. This is a simpler domain andwill simplify notation. Another diﬀerence is that in the n = 2 case, we are restricting D ⊆ [0 , eﬁnition 6.5 (Axioms for pooling operators ( n = 2)) . For a pooling operator ⊕ on D (i.e.a binary operator on weighted forecasts), we deﬁne the following axioms.1. Weight additivity : wt (Π ⊕ Π ) = wt (Π ) + wt (Π ) for every Π , Π .2. Commutativity : Π ⊕ Π = Π ⊕ Π for every Π , Π .3. Associativity : Π ⊕ (Π ⊕ Π ) = (Π ⊕ Π ) ⊕ Π for every Π , Π , Π .4. Continuity : For every p , p , the quantity pr (( p , w ) ⊕ ( p , w )) is a continuousfunction of ( w , w ) on R ≥ \ { (0 , } .5. Idempotence : For every Π , Π , if pr (Π ) = pr (Π ) then pr (Π ⊕ Π ) = pr (Π ) .6. Monotonicity : Let w > and let p > p ∈ D . Then for x ∈ (0 , w ) , the quantitypr (( p , x ) ⊕ ( p , w − x )) is a strictly increasing function of x . The motivation for the weight additivity axiom is that the weight of a weighted forecastcan be thought of as the amount of evidence for its prediction. When pooling weightedforecasts, the weight of an individual forecast can be thought of as the strength of its votein the aggregate.The monotonicity axiom essentially states that if one pools two forecasts with diﬀerentprobabilities and a ﬁxed total weight, then the larger the share of the weight belonging tothe larger of the two probabilities, the larger the aggregate probability.We now state this section’s main result: these axioms describe the class of QA poolingoperators.

Theorem 6.6.

A pooling operator is a QA pooling operator (as in Deﬁnition 6.2) withrespect to some g if and only if it satisﬁes the axioms in Deﬁnition 6.5. We defer the proof of Theorem 6.6 to Appendix E, though we brieﬂy summarize it here.We ﬁrst show that the axioms in Deﬁnition 6.5 hold for any QA pooling operator. Weightadditivity, commutativity, and idempotence are trivial; associativity is a matter of simplealgebra. Continuity and monotonicity both follow from the fact that g is continuous andstrictly increasing, as is g − .Showing that any pooling operator ⊕ satisfying our axioms is a QA pooling operatorinvolves constructing a g such that ⊕ = ⊕ g . This is simplest if D is a closed interval. Forexample, if D = [0 , g as follows: g (0) = 0, g (1) = 1, and for 0 < p < g ( p ) = w where (1 , w ) ⊕ (0 , − w ) = ( p,

1) (such a w exists by the continuity axiom andis unique by the monotonicity axiom). The remainder (showing that ⊕ pools weightedforecasts in the same way as ⊕ g as in Deﬁnition 6.2) follows by a sequence of applicationsof the axioms. We allow one weight to be 0 by deﬁning ( p, w ) ⊕ ( q,

0) = ( q, ⊕ ( p, w ) = ( p, w ). As we mentioned, for an associative pooling operator ⊕ , Π ⊕ Π · · · ⊕ Π m is a well-speciﬁed quantity,even without indicating parenthesization. This lets us use the notation (cid:76) mi =1 Π i . This is why the statementof Theorem 6.6 makes sense despite pooling operators not being m -ary by default. D is not closed is a little trickier. The idea there is to deﬁne g on twopoints on the interior of D and build g by essentially applying the same construction asdescribed in the previous paragraph, but with “negative weights.” We describe the detailsof this argument in the proof of the analogous result for general n (i.e. Theorem E.9). Generalizing to higher dimensions

In Appendix E, we discuss extending our axiom-atization to arbitrary values of n in a way that, again, describes the class of QA poolingoperators. An important challenge is extending the monotonicity axiom: what is an ap-propriate generalization of an increasing function in higher dimensions? We show that thenotion we need is cyclical monotonicity , which we deﬁne and discuss. We then present ouraxiomatization (Deﬁnition E.8) and prove that the axioms represent precisely all QA poolingoperators (Theorem E.9). On a high level, the proof is not dissimilar to that of Theorem 6.6,though the details are fairly diﬀerent and more technical.In conclusion, in Deﬁnition 6.5 we made a list of natural properties that a pooling operatormay satisfy. Theorem 6.6 shows that the pooling operators satisfying these properties areexactly the QA pooling operators. In Appendix E, we generalize this theorem to higherdimensions, thus fully axiomatizing QA pooling. This result gives us an additional importantreason to believe that QA pooling with respect to a proper scoring rule is a fundamentalnotion.

7. Conclusions and future work

We conclude with one observation and a number of suggestions for future work.We have established and motivated a connection between proper scoring rules and opin-ion pooling methods. This connection is, in fact, a bijective correspondence between properscoring rules with convex exposure (up to positive aﬃne transformation) and pooling meth-ods satisfying our axioms. In most of our discussion, the focus has been on one side of thiscorrespondence: given a proper scoring rule s (with convex exposure), we have shown thatQA pooling with respect to s satisﬁes many desirable properties. It is natural to considerthe reverse direction as well: if a principal wishes to pool some experts’ forecasts using aparticular pooling method ⊕ , does it make sense for the principal to use the correspondingscoring rule for elicitation?We argue that choosing such a scoring rule makes sense in a context where experts maycollude. If experts are rewarded with a proper scoring rule s with convex exposure, they canguarantee themselves a larger total reward by all reporting the QA pool with respect to s oftheir true beliefs; in fact, this is the collusion strategy that maximizes their minimum (over j )possible surplus over truthful reporting. This is essentially a restatement of Theorem 4.1,and was observed for the n = 2 case in [CS11]. This means that it makes sense for theprincipal to choose the scoring rule that will incentivize the experts to collude in the manner This is a stronger statement because it asserts that this is the best of all collusion strategies, not justthose in which all experts give the same report. This fact follows from the same techniques that we used toprove Theorem 4.1. ⊕ . As such, the principal should choose thescoring rule corresponding to ⊕ .Of course, the principal may wish to reward the experts using a contract function forwhich collusion never generates a surplus. The question of whether such a contract functionexists is explored (but not resolved) in [Fre+20]. Given the close ties between this line ofwork and ours, it will be natural to explore whether the tools that we develop shed light onthis question.Moving on to further intriguing research directions, it is natural to expand our notionof forecast aggregation. All pooling methods that we considered satisfy what we calledidempotence: pooling p and p gives back p . This is a natural assumption, but may beundesirable in some Bayesian setups. In a situation where two experts use diﬀerent evidenceto arrive at the same small probability of an outcome, it may be sensible for the aggregateprobability to be even smaller. It would be interesting to explore whether our results andtechniques are applicable to notions of pooling that do not satisfy idempotence, or to other(perhaps Bayesian) settings.Finally, we present some concrete future directions for potential work: • As we discussed in Section 2, there is a fair amount of work on aggregating forecastswith prediction markets — often ones that are based on proper scoring rules. Is therea natural trading-based interpretation of QA pooling? • Deﬁnition 4.5 gave a natural generalization of QA pooling to proper scoring rules thatdo not have convex exposure. There is another potentially natural generalization,which is to deﬁne the QA pool as the forecast p ∗ maximizing min j u ( p ∗ ; j ) (as deﬁnedin Theorem 4.1). Is this generalization equivalent to Deﬁnition 4.5? If not, how doesit behave? • Although our proof of Theorem 5.1 (that the QA pool of forecasts is concave in theexperts’ weights) relies on the convex exposure property, we have not ruled out thepossibility that the result holds even without this assumption (with QA pooling deﬁnedas in Deﬁnition 4.5). Is this the case? Even if not, are no-regret algorithms for learningweights still possible? • Our no-regret algorithm for learning weights relies on s being bounded, because thisallows us to place a concrete upper bound on (cid:107)∇ L t ( · ) (cid:107) . Intuitively it seems unlikelythat no-regret compared to the best weight vector in hindsight can be achieved if wecannot place a bound on the loss function; is this so? Are there natural restrictions tothe model — i.e. ways to make it less than fully adversarial — under which a no-regretalgorithm would be possible? • We have presented a list of axioms characterizing QA pooling operators. Is there analternative axiomatization that uses equally natural but fewer axioms?22 eferences [Abb09] Ali E. Abbas. “A Kullback-Leibler View of Linear and Log-Linear Pools”. In:

Decision Analysis doi : . url : https://ideas.repec.org/a/inm/ordeca/v6y2009i1p25-37.html .[ABS18] Itai Arieli, Yakov Babichenko, and Rann Smorodinsky. “Robust forecast ag-gregation”. In: Proceedings of the National Academy of Sciences issn : 0027-8424. doi : . url : .[ACR12] D Allard, A Comunian, and Philippe Renard. “Probability aggregation methodsin geoscience”. In: Mathematical Geosciences

ACM Trans. Economics and Comput. doi : .[Acz48] J. Acz´el. “On mean values”. In: Bull. Amer. Math. Soc. url : https://projecteuclid.org:443/euclid.bams/1183511892 .[Ada14] M. Adamˇc´ık. “Collective reasoning under uncertainty and inconsistency”. PhDthesis. University of Manchester, 2014.[AK14] Aaron Archer and Robert Kleinberg. “Truthful germs are contagious: A local-to-global characterization of truthfulness”. In: Games and Economic Behavior url : https://EconPapers.repec.org/RePEc:eee:gamebe:v:86:y:2014:i:c:p:340-366 .[Ash+10] Itai Ashlagi, Mark Braverman, Avinatan Hassidim, and Dov Monderer. “Mono-tonicity and Implementability”. In: Econometrica doi : https://doi.org/10.3982/ECTA8882 . url : https://onlinelibrary.wiley.com/doi/abs/10.3982/ECTA8882 .[AW80] J. Acz´el and C. Wagner. “A Characterization of Weighted Arithmetic Means”.In: SIAM Journal on Algebraic Discrete Methods doi : .[Ban+05] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh.“Clustering with Bregman Divergences”. In: J. Mach. Learn. Res. url : http://jmlr.org/papers/v6/banerjee05b.html .[BB20] Shalev Ben-David and Eric Blais. “A New Minimax Theorem for RandomizedAlgorithms”. In: CoRR abs/2002.10802 (2020). arXiv: . url : https://arxiv.org/abs/2002.10802 .[BC11] Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and MonotoneOperator Theory in Hilbert Spaces . 1st. Springer Publishing Company, Incorpo-rated, 2011. isbn : 1441994661.[Bri50] G. W. Brier. “Veriﬁcation of forecasts expressed in terms of probability”. In:

Monthly Weather Review

78 (1950), pp. 1–3.23Car16] Arthur Carvalho. “An Overview of Applications of Proper Scoring Rules”. In:

Decision Analysis

13 (Nov. 2016). doi : .[CBL06] Nicol`o Cesa-Bianchi and G´abor Lugosi. Prediction, Learning, and Games . Jan.2006. isbn : 978-0-521-84108-5. doi : .[Che+14] Yiling Chen, Nikhil R. Devanur, David M. Pennock, and Jennifer WortmanVaughan. “Removing arbitrage from wagering mechanisms”. In: ACM Confer-ence on Economics and Computation, EC ’14, Stanford , CA, USA, June 8-12,2014 . Ed. by Moshe Babaioﬀ, Vincent Conitzer, and David A. Easley. ACM,2014, pp. 377–394. doi : .[CL13] Arthur Carvalho and Kate Larson. “A Consensual Linear Opinion Pool”. In: IJCAI 2013, Proceedings of the 23rd International Joint Conference on Arti-ﬁcial Intelligence, Beijing, China, August 3-9, 2013 . Ed. by Francesca Rossi.IJCAI/AAAI, 2013, pp. 2518–2524. url : .[CP07] Yiling Chen and David M. Pennock. “A Utility Framework for Bounded-LossMarket Makers”. In: UAI 2007, Proceedings of the Twenty-Third Conference onUncertainty in Artiﬁcial Intelligence, Vancouver, BC, Canada, July 19-22, 2007 .Ed. by Ronald Parr and Linda C. van der Gaag. AUAI Press, 2007, pp. 49–56. url : https://dl.acm.org/doi/abs/10.5555/3020488.3020495 .[CS11] SangIn Chun and Ross D. Shachter. “Strictly Proper Mechanisms with Cooper-ating Players”. In: UAI 2011, Proceedings of the Twenty-Seventh Conference onUncertainty in Artiﬁcial Intelligence, Barcelona, Spain, July 14-17, 2011 . Ed. byF´abio Gagliardi Cozman and Avi Pfeﬀer. AUAI Press, 2011, pp. 125–134. url : https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1\&smnu=2\&article\_id=2168\&proceeding\_id=27 .[CV10] Yiling Chen and Jennifer Wortman Vaughan. “A new understanding of predic-tion markets via no-regret learning”. In: Proceedings 11th ACM Conference onElectronic Commerce (EC-2010), Cambridge, Massachusetts, USA, June 7-11,2010 . Ed. by David C. Parkes, Chrysanthos Dellarocas, and Moshe Tennenholtz.ACM, 2010, pp. 189–198. doi : .[CW07] R.T. Clemen and R.L. Winkler. “Aggregating probability distributions”. In: Ad-vances in Decision Analysis: From Foundations to Applications (Jan. 2007),pp. 154–176. doi : .[Daw+95] A. Dawid, M. DeGroot, J. Mortera, R. Cooke, S. French, C. Genest, M. Schervish,D. Lindley, K. McConway, and R. Winkler. “Coherent combination of experts’opinions”. In: Test

Probabilistic Opinion Pooling . Oct. 2014. url : http://philsci-archive.pitt.edu/11349/ .[DM14] Alexander Dawid and Monica Musio. “Theory and Applications of Proper Scor-ing Rules”. In: METRON

72 (Jan. 2014). doi : .24FCK15] Rafael M. Frongillo, Yiling Chen, and Ian A. Kash. “Elicitation for Aggregation”.In: Proceedings of the Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence,January 25-30, 2015, Austin, Texas, USA . Ed. by Blai Bonet and Sven Koenig.AAAI Press, 2015, pp. 900–906. url : .[FES20] Christian Feldbacher-Escamilla and Gerhard Schurz. “Optimal probability ag-gregation based on generalized brier scoring”. In: Annals of Mathematics andArtiﬁcial Intelligence

88 (July 2020). doi : .[FK14] Rafael Frongillo and Ian Kash. “General Truthfulness Characterizations via Con-vex Analysis”. In: Web and Internet Economics . Ed. by Tie-Yan Liu, Qi Qi, andYinyu Ye. Cham: Springer International Publishing, 2014, pp. 354–370. isbn :978-3-319-13129-0.[Fre+20] Rupert Freeman, David M. Pennock, Dominik Peters, and Bo Waggoner. “Pre-venting Arbitrage from Collusion When Eliciting Probabilities”. In:

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artiﬁcial Intelligence Conference, IAAI 2020,The Tenth AAAI Symposium on Educational Advances in Artiﬁcial Intelligence,EAAI 2020, New York, NY, USA, February 7-12, 2020 . AAAI Press, 2020,pp. 1958–1965. url : https : / / aaai . org / ojs / index . php / AAAI / article /view/5566 .[Gen84] Christian Genest. “A Characterization Theorem for Externally Bayesian Groups”.In: Ann. Statist. doi : .[Goo52] I. J. Good. “Rational Decisions”. In: Journal of the Royal Statistical Society.Series B (Methodological) issn : 00359246. url : .[GR07] Tilmann Gneiting and Adrian E Raftery. “Strictly proper scoring rules, pre-diction, and estimation”. In: Journal of the American Statistical Association

Information Sciences issn :0020-0255. doi : https://doi.org/10.1016/j.ins.2010.08.043 . url : .[GZ86] Christian Genest and James V. Zidek. “Combining Probability Distributions:A Critique and an Annotated Bibliography”. In: Statistical Science issn : 08834237. url : .[Han03] Robin Hanson. “Combinatorial Information Market Design”. In: InformationSystems Frontiers doi : . url : https : / / ideas . repec . org / a / spr / infosf / v5y2003i1d10 . 1023 _a1022058209073.html .[Haz19] Elad Hazan. “Introduction to Online Convex Optimization”. In: CoRR abs/1909.05207(2019). arXiv: . url : http://arxiv.org/abs/1909.05207 .25Kol30] A.N. Kolmogorov. Sur la notion de la moyenne . G. Bardi, tip. della R. Accad.dei Lincei, 1930. url : https://books.google.com/books?id=iUqLnQEACAAJ .[KR08] Christian Kascha and Francesco Ravazzolo. Combining inﬂation density fore-casts . Working Paper 2008/22. Norges Bank, Dec. 2008. url : https://ideas.repec.org/p/bno/worpap/2008_22.html .[LS07] Ron Lavi and Chaitanya Swamy. “Truthful Mechanism Design for Multi-DimensionalScheduling via Cycle Monotonicity”. In: Proceedings of the 8th ACM Conferenceon Electronic Commerce . EC ’07. San Diego, California, USA: Association forComputing Machinery, 2007, 252–261. isbn : 9781595936530. doi :

10 . 1145 /1250910.1250947 .[Nag30] Mitio Nagumo. “ ¨Uber eine Klasse der Mittelwerte”. In:

Japanese journal ofmathematics :transactions and abstracts doi :

10 . 4099 /jjm1924.7.0_71 .[Pet19] Richard Pettigrew. “Aggregating incoherent agents who disagree”. In:

Synthese

196 (July 2019). doi : .[PR00] David Poole and Adrian E. Raftery. “Inference for Deterministic SimulationModels: The Bayesian Melding Approach”. In: Journal of the American Sta-tistical Association doi :

10 . 1080 / 01621459 .2000.10474324 . url : .[Roc70a] R. T. Rockafellar. “On the maximal monotonicity of subdiﬀerential mappings.”In: Paciﬁc J. Math. url : https://projecteuclid.org:443/euclid.pjm/1102977253 .[Roc70b] R. Tyrrell Rockafellar. Convex Analysis . Princeton University Press, 1970. isbn :9780691015866. url : .[SAM66] Emir Shuford, Arthur Albert, and H. Edward Massengill. “Admissible probabil-ity measurement procedures”. In: Psychometrika url : https://EconPapers.repec.org/RePEc:spr:psycho:v:31:y:1966:i:2:p:125-145 .[Sat+14] Ville Satop¨a¨a, Jonathan Baron, Dean Foster, Barbara Mellers, Philip Tetlock,and Lyle Ungar. “Combining multiple probability predictions using a simplelogit model”. In: International Journal of Forecasting

30 (Apr. 2014), 344–356. doi : .[Sav71] Leonard J. Savage. “Elicitation of Personal Probabilities and Expectations”. In: Journal of the American Statistical Association issn :01621459. url : .[SY05] Michael Saks and Lan Yu. “Weak monotonicity suﬃces for truthfulness on con-vex domains”. In: Jan. 2005, pp. 286–293. doi : .[Tsa88] Constantino Tsallis. “Possible generalization of Boltzmann-Gibbs statistics”.In: Journal of Statistical Physics

52 (July 1988), pp. 479–487. doi : . 26Voh07] Rakesh V. Vohra. Paths, Cycles and Mechanism Design . 2007.[Wik18] Wikipedia contributors.

Mahler’s inequality — Wikipedia, The Free Encyclope-dia . [Online; accessed 07-February-2021]. 2018. url : https://en.wikipedia.org/wiki/Mahler\%27s\_inequality . A. Outline of appendices • In Appendix B we give an interpretation of logarithmic opinion pooling as averagingexperts’ Bayesian evidence, as we mentioned in Section 1. • In Appendix C we prove that s is continuous if and only if G is diﬀerentiable if andonly if g is continuous, as we stated in Section 3. • In Appendix D we give the proof of Theorem 5.1 (that a QA aggregator’s score is con-cave in the experts’ weights), state the no regret algorithm referenced in Theorem 5.5,and then prove that the algorithm indeed has low regret. • In Appendix E we give the proof of Theorem 6.6, which states that the axioms inDeﬁnition 6.5 capture the class of QA pooling methods in the case of n = 2 outcomes.We then present and prove an analogous axiomatization in full generality (i.e. forarbitrary values of n ). • In Appendix F we discuss which well-known proper scoring rules satisfy the convexexposure property.

B. Details omitted from Section 1

Logarithmic pooling as averaging Bayesian evidence

We discuss for simplicity thebinary outcome case, though this discussion holds in general. Suppose that an expert assignsa probability to an event X occurring by updating on some prior (50%, say, though this doesnot matter). Suppose that the expert receives evidence E . Bayesian updating works as such:Pr [ X | E ]Pr [ ¬ X | E ] = Pr [ X ]Pr [ ¬ X ] · Pr [ E | X ]Pr [ E | ¬ X ] . That is, the expert multiplies their odds of X (i.e. the probability of X divided by theprobability of ¬ X ) by the relative likelihood that E would be the case conditioned on X versus ¬ X . Equivalently, we can take the log of both sides; this tells us that the posteriorlog odds of X is equal to the prior log odds plus log Pr[ E | X ]Pr[ E |¬ X ] . For every piece of evidence E k that the expert receives, they make this update (assuming that the E k ’s are mutuallyindependent conditioned on X ).This means that for any E k , we can view the quantity log Pr[ E k | X ]Pr[ E k |¬ X ] as the strength ofevidence that E k gives in favor of X . We call this the “Bayesian evidence” in favor of X given by E k . The total Bayesian evidence that the expert has in favor of X (i.e. the sum ofthese values over all E k ) is the expert’s log odds of X (i.e. log Pr[ X ]Pr[ ¬ X ] ).27ogarithmic pooling takes the average of experts’ log odds of X . As we have shown,if the experts are Bayesian, this amounts to taking the average of all experts’ amounts ofBayesian evidence in favor of X . C. Details omitted from Section 3

Proposition C.1.

Given a proper scoring rule s , the following are equivalent:(1) s is continuous(2) G is diﬀerentiable(3) G is continuously diﬀerentiableProof. Any diﬀerentiable convex function is continuously diﬀerentiable [Roc70b, Theorem25.5] so (2) implies (3).If G is continuously diﬀerentiable, then both G and g are continuous, so s is continuousby Equation 3. Thus, (3) implies (1).Finally, assume that s is continuous. Since G ( p ) = (cid:80) j p ( j ) s ( p ; j ), it follows that G iscontinuous. It follows by Equation 3 (with g taken to be a subgradient of G , as in [GR07])that g is continuous. A convex function with a continuous subgradient is diﬀerentiable[BC11, Proposition 17.41]. This proves that (1) implies (2). D. Details omitted from Section 5

D.1. Proof of Theorem 5.1

Theorem 5.1.

Let s be a proper scoring rule with convex exposure on a forecast domain D , and ﬁx any p , . . . , p m ∈ D . Given a weight vector w = ( w , . . . , w m ) ∈ ∆ m , deﬁne the weight-score of w for an outcome j asWS j ( w ) := s  m (cid:77) g i =1 ( p i , w i ); j  . Then for every j ∈ [ n ] , WS j ( w ) a concave function of w .Proof of Theorem 5.1. Let v and w be two weight vectors. We wish to show that for any c ∈ [0 , j ( c v + (1 − c ) w ) − c WS j ( v ) − (1 − c )WS j ( w ) ≥ . Recall the notation p ∗ w from Deﬁnition 3.8. Note that g ( p ∗ c v +(1 − c ) w ) = m (cid:88) i =1 ( cv i + (1 − c ) w i ) g ( p i ) = c g ( p ∗ v ) + (1 − c ) g ( p ∗ w ) . (5)28e have WS j ( c v + (1 − c ) w ) − c WS j ( v ) − (1 − c )WS j ( w )= s ( p ∗ c v +(1 − c ) w ; j ) − cs ( p ∗ v ; j ) − (1 − c ) s ( p ∗ w ; j )= G ( p ∗ c v +(1 − c ) w ) + (cid:104) g ( p ∗ c v +(1 − c ) w ) , δ j − p ∗ c v +(1 − c ) w (cid:105)− c ( G ( p ∗ v ) + (cid:104) g ( p ∗ v ) , δ j − p ∗ v (cid:105) ) − (1 − c )( G ( p ∗ w ) + (cid:104) g ( p ∗ w ) , δ j − p ∗ w (cid:105) )= G ( p ∗ c v +(1 − c ) w ) − (cid:104) g ( p ∗ c v +(1 − c ) w ) , p ∗ c v +(1 − c ) w (cid:105)− c ( G ( p ∗ v ) − (cid:104) g ( p ∗ v ) , p ∗ v (cid:105) ) − (1 − c )( G ( p ∗ w ) − (cid:104) g ( p ∗ w ) , p ∗ w (cid:105) )Step 1 follows from the deﬁnition of WS. Step 2 follows from Equation 3. Step 3 followsfrom Equation 5, and speciﬁcally that the inner product of each side with δ j is the same (sothe δ j terms cancel out, leaving a quantity that does not depend on j ). Continuing wherewe left oﬀ: WS j ( c v + (1 − c ) w ) − c WS j ( v ) − (1 − c )WS j ( w )= G ( p ∗ c v +(1 − c ) w ) − c (cid:104) g ( p ∗ v ) , p ∗ c v +(1 − c ) w (cid:105) − (1 − c ) (cid:104) g ( p ∗ w ) , p ∗ c v +(1 − c ) w (cid:105)− c ( G ( p ∗ v ) − (cid:104) g ( p ∗ v ) , p ∗ v (cid:105) ) − (1 − c )( G ( p ∗ w ) − (cid:104) g ( p ∗ w ) , p ∗ w (cid:105) )= c ( G ( p ∗ c v +(1 − c ) w ) − G ( p ∗ v ) − (cid:104) g ( p ∗ v ) , p ∗ c v +(1 − c ) w − p ∗ v (cid:105) )+ (1 − c )( G ( p ∗ c v +(1 − c ) w ) − G ( p ∗ w ) − (cid:104) g ( p ∗ w ) , p ∗ c v +(1 − c ) w − p ∗ w (cid:105) )= cD G ( p ∗ c v +(1 − c ) w (cid:107) p ∗ v ) + (1 − c ) D G ( p ∗ c v +(1 − c ) w (cid:107) p ∗ w ) ≥ . Step 4 again follows from Equation 5. Step 5 is a rearrangement of terms. Finally, step6 follows from the deﬁnition of Bregman divergence, and step 7 follows from the fact thatBregman divergence is always non-negative. This completes the proof.

D.2. Gradient descent algorithm for Theorem 5.5

We follow the online gradient descent algorithm as presented in [Haz19, § Algorithm D.1 ([Haz19], Algorithm 6) . Input: convex set K , convex function f t on K , T , x ∈ K , step sizes { η t } . for t = 1 to T do Play x t and observe cost f t ( x t )Update and project: y t +1 = x t − η t ∇ f t ( x t ) x t +1 = Π K ( y t +1 ) end for Algorithm D.1 achieves the following regret guarantee.29 heorem D.2 ([Haz19], Theorem 3.1) . Online gradient descent with step sizes { η t = DG √ t , t ∈ [ T ] guarantees the following for all T ≥ :regret T = T (cid:88) t =1 f t ( x t ) − min x ∗ ∈K T (cid:88) t =1 f t ( x ∗ ) ≤ GD √ T .

Here, D is an upper bound on the diameter of K and G is an upper bound on the Lipschitzconstant of any f t , i.e. | f t ( x ) − f t ( y ) | ≤ G (cid:107) x − y (cid:107) for any t and any x , y ∈ K . In our setting, K = ∆ m and f t is our loss function (i.e. negative the score) at time step t ,as a function of w ; we will denote this function by L t . That is, L t ( w ) = − WS j t ( w ), whereWS is as in Theorem 5.1, relative to forecasts p t , . . . , p tm .Our adaptation of Algorithm D.1 is as follows. Algorithm D.3 (Online gradient descent algorithm for Theorem 5.5) . We proceed as fol-lows: • For t ≥

1, deﬁne η t := M √ mt . • Start with an arbitrary guess w ∈ ∆ m . • At each time step t from 1 to T : – Play w t and observe loss L t ( w t ). – Let ˜ w t +1 = w t − η t ∇ L t ( w t ). If ˜ w t +1 ∈ ∆ m , let w t +1 = ˜ w t +1 . Otherwise, let w t +1 be the orthogonal projection of ˜ w t +1 onto ∆ m .We now prove that Algorithm D.3 satisﬁes the details of Theorem 5.5. Theorem 5.5.

2. We claim that B ≤ √ mM , where B is an upper bound on (cid:107)∇ L t ( w ) (cid:107) . This would make our choice of η t match that of Algorithm D.1 and guarantee aregret of at most √ B √ T ≤ √ mM √ T . The remainder of the proof is demonstrating thatindeed B ≤ √ mM .Let L be an arbitrary loss function, i.e. L ( w ) = − WS j ( w ) for some j, p , . . . , p m . Let p ∗ ( w ) = m (cid:77) g i =1 ( p i , w i ). We claim that ∇ L ( w ) =  g ( p )... g ( p m )  ( p ∗ ( w ) − δ j ) , (6)where this m -dimensional vector should be interpreted modulo translation by m (see Re-mark 3.9). To see this, observe that ∇ L ( w ) = −∇ W S j ( w ) = −∇ w s ( p ∗ ( w ); j ) = −∇ w ( G ( p ∗ ( w )) + (cid:104) g ( p ∗ ( w )) , δ j − p ∗ ( w ) (cid:105) ) , where ∇ w denotes the gradient with respect to change in the weight vector w (as opposedto change in the probability vector). Now, by the chain rule for gradients, we have ∇ w G ( p ∗ ( w )) = ( J p ∗ ( w )) (cid:62) g ( p ∗ ( w )) , where J p ∗ denotes the Jacobian matrix of the function p ∗ ( w ). Also, we have g ( p ∗ ( w )) = m (cid:88) i =1 w i g ( p i ) , so (again by the chain rule) we have ∇ w ( (cid:104) g ( p ∗ ( w )) , δ j − p ∗ ( w ) (cid:105) ) =  g ( p )... g ( p m )  ( δ j − p ∗ ( w )) − ( J p ∗ ( w )) (cid:62) g ( p ∗ ( w )) . This gives us Equation 6.Now, for any i , we have |(cid:104) g ( p i ) , p ∗ ( w ) − δ j (cid:105)| ≤ (cid:107) g ( p i ) (cid:107) (cid:107) p ∗ ( w ) − δ j (cid:107) ≤ √ M. Therefore, (cid:107)∇ L ( w ) (cid:107) ≤ (cid:113) m · ( √ M ) = √ mM, completing the proof. 31 . Details omitted from Section 6 E.1. Proof of Theorem 6.6

Theorem 6.6.

A pooling operator is a QA pooling operator (as in Deﬁnition 6.2) withrespect to some g if and only if it satisﬁes the axioms in Deﬁnition 6.5. Proof.

For this proof, we will use ⊕ (without a g subscript) to denote an arbitrary poolingoperator that satisﬁes the axioms in Deﬁnition 6.5. We begin by noting a few importantfacts about weighted forecasts and pooling operators. First, we ﬁnd it natural to deﬁne anotion of multiplying a weighted forecast pair by a positive constant. Deﬁnition E.1.

Given a weighted forecast

Π = ( p, w ) and c > , deﬁne c Π := ( p, cw ) . Note that m Π = (cid:76) mi =1 Π for any positive integer m , by idempotence; this deﬁnition is anatural extension to all c >

0. We note the following (quite obvious) fact.

Proposition E.2.

For every weighted forecast Π and c , c > , we have c ( c Π) = ( c c )Π . A natural property that is not listed in Deﬁnition 6.5 is scale invariance , i.e. thatpr(( p , w ) ⊕ ( p , w )) = pr(( p , cw ) ⊕ ( p , cw )) for any positive c ; or, equivalently, that c (Π ⊕ Π ) = c Π ⊕ c Π . This in fact follows from the listed axioms. Proposition E.3 (Distributive property/scale invariance) . For every Π , Π and any oper-ator ⊕ satisfying the axioms in Deﬁnition 6.5, we have c (Π ⊕ Π ) = c Π ⊕ c Π .Proof. First suppose c is an integer. Then c Π ⊕ c Π = c (cid:77) i =1 Π ⊕ c (cid:77) i =1 Π = c (cid:77) i =1 (Π ⊕ Π ) = c (Π ⊕ Π ) . Here, the ﬁrst and last steps follow by weight additivity and idempotence. Now supposethat c = k(cid:96) is a rational number. Let Π (cid:48) = (cid:96) Π and Π (cid:48) = (cid:96) Π . We have k(cid:96) (Π ⊕ Π ) = k(cid:96) ( (cid:96) Π (cid:48) ⊕ (cid:96) Π (cid:48) ) = k(cid:96) · (cid:96) (Π (cid:48) ⊕ Π (cid:48) ) = k (Π (cid:48) ⊕ Π (cid:48) ) = k Π (cid:48) ⊕ k Π (cid:48) = k(cid:96) Π ⊕ k(cid:96) Π . Here, the second and second-to-last steps follow from the fact that the distributive propertyholds for integers.Finally, make use of the continuity axiom to extend our proof to all positive real numbers c . In particular, it suﬃces to show that pr(Π ⊕ Π ) = pr( c Π ⊕ c Π ). Let p be the formerquantity; note that pr( r Π ⊕ r Π ) = p for positive rational numbers r . Since the rationalsare dense among the reals, it follows that for every (cid:15) >

0, we have | pr( c Π ⊕ c Π ) − p | ≤ (cid:15) .Therefore, pr( c Π ⊕ c Π ) = p . This completes the proof. As we mentioned, for an associative pooling operator ⊕ , Π ⊕ Π · · · ⊕ Π m is a well-speciﬁed quantity,even without indicating parenthesization. This lets us use the notation (cid:76) mi =1 Π i . This is why the statementof Theorem 6.6 makes sense despite pooling operators not being m -ary by default. ⊕ g satisﬁes the axioms in Deﬁnition 6.5. Weight additivity, commutativity, and idempotenceare trivial. Associativity is also clear: given Π = ( p , w ) and likewise Π , Π , we have g (pr((Π ⊕ g Π ) ⊕ g Π )) = ( w + w ) g (pr(Π ⊕ g Π )) + w g ( p )( w + w ) + w = ( w + w ) w g ( p )+ w ( p ) w + w + w g ( p ) w + w + w = w g ( p ) + w g ( p ) + w g ( p ) w + w + w and likewise for g (pr(Π ⊕ g (Π ⊕ g Π ))), so pr((Π ⊕ g Π ) ⊕ g Π ) = pr(Π ⊕ g (Π ⊕ g Π ))(since g is strictly increasing and therefore injective). The fact that the weights are also thesame is trivial. Continuity follows from the fact thatpr(Π ⊕ g Π ) = g − (cid:18) w g ( p ) + w g ( p ) w + w (cid:19) is continuous in ( w , w ) (when w , w are not both zero). Here we are using the fact that g is strictly increasing, which means that g − is continuous.Finally, regarding the monotonicity axiom, for any ﬁxed w and p > p (as in the axiomstatement), we have g (pr(( p , x ) ⊕ g ( p , w − x ))) = xg ( p ) + ( w − x ) g ( p ) x + w − x = xg ( p ) + ( w − x ) g ( p ) w . Since p > p , we have g ( p ) > g ( p ), so the right-hand side strictly increases with x . Since g − is also strictly increasing, it follows that pr(( p , x ) ⊕ g ( p , w − x )) strictly increases with x .The converse — that every pooling operator satisfying the axioms in Deﬁnition 6.5 is ⊕ g for some g — works by constructing g by ﬁxing it at two points and constructing g at allother points. Right now we show how to do this when the forecast domain is [0 ,

1] (thoughthe technique works for any closed D ); see the proof of Theorem E.9 for the argument in fullgenerality.Let ⊕ be a pooling operator that satisﬁes our axioms. Deﬁne g as follows: let g (0) = 0and g (1) = 1. For 0 < p <

1, deﬁne g ( p ) = w where (1 , w ) ⊕ (0 , − w ) = ( p, w existsby continuity and the intermediate value theorem; it is unique by the “strictly” increasingstipulation of monotonicity.) Note that g is continuous and increasing by monotonicity. We wish to show that for any Π = ( p , w ) and Π = ( p , w ), we have that Π ⊕ Π =Π ⊕ g Π . Clearly the weight of both sides is w + w , so we wish to show that the probabilitieson each side are the same. We have pr(Π ⊕ Π ) = pr( w ( p , ⊕ w ( p , As a matter of fact, g is strictly increasing because it is impossible for g ( p ) to equal g ( p ) for p (cid:54) = p ,as that would mean that (1 , g ( p )) ⊕ (0 , − g ( p )) = ( p ,

1) = ( p , , w ) ⊕ (0 , − w ) is continuous in w by the continuity axiom. In a sense, the continuity of g corresponds to the strictness of increase in the monotonicity axiom and the strictness ofincrease of g corresponds to the continuity axiom. Steps 3 and 7 uses the distributive property (Proposition E.3).

33 pr( w ((1 , g ( p )) ⊕ (0 , − g ( p ))) ⊕ w ((1 , g ( p )) ⊕ (0 , − g ( p ))))= pr( w (1 , g ( p )) ⊕ w (0 , − g ( p )) ⊕ w (1 , g ( p )) ⊕ w (0 , − g ( p )))= pr((1 , w g ( p )) ⊕ (0 , w (1 − g ( p ))) ⊕ (1 , w g ( p )) ⊕ (0 , w (1 − g ( p ))))= pr((1 , w g ( p ) + w g ( p )) ⊕ (0 , w (1 − g ( p )) + w (1 − g ( p ))))= pr (cid:18) w + w ((1 , w g ( p ) + w g ( p )) ⊕ (0 , w (1 − g ( p )) + w (1 − g ( p )))) (cid:19) = pr (cid:18)(cid:18) , w g ( p ) + w g ( p ) w + w (cid:19) ⊕ (cid:18) , w (1 − g ( p )) + w (1 − g ( p )) w + w (cid:19)(cid:19) , which by deﬁnition of g is equal to the probability p such that g ( p ) = g ( p ) w + g ( p ) w w + w . Thatis, pr(Π ⊕ Π ) = pr(Π ⊕ g Π ).Showing that ⊕ and ⊕ g are equivalent for more than two arguments is now trivial: m (cid:77) g i =1 Π i = Π ⊕ g Π ⊕ g Π · · · ⊕ g Π m = Π ⊕ Π ⊕ g Π · · · ⊕ g Π m = · · · = m (cid:77) i =1 Π i . (Here we are implicitly using the fact that ⊕ g is associative, as we proved earlier.) Thiscompletes the proof. E.2. Generalization of our axiomatization to higher dimensions

Just as we ﬁxed a two-outcome forecast domain D in Section 6, we now ﬁx an n -outcomeforecast domain D for any n ≥

2. Our deﬁnition of weighted forecasts remains the same(except that now pr(Π) is a vector). Our deﬁnition of quasi-arithmetic pooling, however,needs to change to make g vector-valued. This raises the question: what is the analogueof “increasing” for vector-valued functions? It turns out that the relevant notion for us is cyclical monotonicity , introduced by Rockafellar [Roc70a] (see also [Roc70b, § H n ( c ) := { x ∈ R n : (cid:80) i x i = c } . Recall from Remark 3.9 that therange of the gradient of a function deﬁned on D is a subset of H n (0). Deﬁnition E.4 (Quasi-arithmetic pooling with arbitrary weights) . Given a continuous,strictly cyclically monotone vector-valued function g : D → H n (0) whose range is a convexset, and weighted forecasts Π = ( p , w ) , . . . , Π m = ( p m , w m ) , deﬁne the quasi-arithmeticpool of Π , . . . , Π m with respect to g as m (cid:77) g i =1 ( p i , w i ) := (cid:32) g − (cid:18) (cid:80) i w i g ( p i ) (cid:80) i w i (cid:19) , (cid:88) i w i (cid:33) . Deﬁnition E.5 (Cyclical monotonicity) . A function g : U ⊆ R n → R n is cyclically mono-tone if for every list of points x , x , . . . , x k − , x k = x ∈ U , we have k (cid:88) i =1 (cid:104) g ( x i ) , x i − x i − (cid:105) ≥ . e also say that g is strictly cyclically monotone if the inequality is strict except when x = · · · = x k − . To gain an intuition for this notion, consider the case of k = 2; then this condition saysthat (cid:104) g ( x ) − g ( x ) , x − x (cid:105) ≥

0. In other words, the change in g from x to x is in thesame general direction as the direction from x to x . This property is called (or weak ) monotonicity .Cyclical monotonicity is a stronger notion, which may be familiar to the reader for itsapplications in mechanism design and revealed preference theory, see e.g. [LS07], [Ash+10],[FK14], [Voh07, § g ) is ﬁnite [SY05]. However,cyclical monotonicity is substantially stronger than two-cycle monotonicity when the rangeof g is inﬁnitely large, as in our setting. In fact, the diﬀerence between these two conditions isthat a two-cycle monotone function is cyclically monotone if and only if it is also vortex-free [AK14, Theorem 3.9]. Vortex-freeness means that the path integral of g along any trianglevanishes. See [AK14] for a deteailed comparison of these two notions.The immediately relevant fact for us is that cyclically monotone functions are gradientsof convex functions (and vice versa). Speaking more precisely: Theorem E.6.

A vector-valued function g is continuous and strictly cyclically monotone ifand only if it is the gradient of a diﬀerentiable, strictly convex function G .Proof of Theorem E.6. Per a theorem of Rockafellar ([Roc70a], see also Theorem 24.8 in[Roc70b]), a function g is cyclically monotone if and only if it is a subgradient of a convexfunction G . The proof of this fact shows just as easily that a function is strictly cyclicallymonotone if and only if it is a subgradient of a strictly convex function.Consider a diﬀerentiable, strictly convex function G . Its gradient is continuous (seeProposition C.1). Conversely, consider a continuous, strictly cyclically monotone vector-valued function g . As we just discussed, it is a subgradient of some strictly convex function G . A convex function with a continuous subgradient is diﬀerentiable [BC11, Proposition17.41].This means that the conditions on g in Deﬁnition E.4 are precisely those necessary tolet g be any function that it could be in our original deﬁnition of quasi-arithmetic pooling(Deﬁnition 3.8). Our new deﬁnition is thus equivalent to the old one (after normalizingweights to add to 1).We now discuss our axioms for pooling operators that will again capture the class ofQA pooling operators. We will keep the weight additivity, commutativity, associativity, andidempotence verbatim from our discussion of the n = 2 case. We will slightly strengthen thecontinuity argument (see below).We will also add a new axiom, subtraction , which states that if Π ⊕ Π = Π ⊕ Π thenΠ = Π . Subtraction in the n = 2 case follows from monotonicity; in this case, however, wethe subtraction axiom will help us state the monotonicity axiom. In particular, it allows us35o make the following deﬁnition, which essentially extends the notion of pooling to allow fornegative weights. Deﬁnition E.7.

Let ⊕ be a pooling operator satisfying weight additivity, commutativity,associativity, and subtraction. Fix p , . . . , p k ∈ D . Deﬁne a function p : ∆ k → D (with p , . . . , p k serving as implicit arguments) deﬁned by p ( w , . . . , w k ) = pr (cid:32) k (cid:77) i =1 ( p i , w i ) (cid:33) . We extend the deﬁnition of p to a partial function on H k (1) , as follows: given input ( w , . . . , w k ) , let S ⊆ [ k ] be the set of indices i such that w i < and T ⊆ [ k ] be the setof indices i such that w i > . We deﬁne p ( w , . . . , w k ) to be the q ∈ D such that ( q , ⊕ (cid:32)(cid:77) i ∈ S ( p i , − w i ) (cid:33) = (cid:77) i ∈ T ( p i , w i ) . Note that q is not guaranteed to exist, which is why we call p a partial function. However,if q exists then it is unique, by the subtraction axiom. We can now state the full axiomatization, including the monotonicity axiom.

Deﬁnition E.8 (Axioms for pooling operators) . For a pooling operator ⊕ on D , we deﬁnethe following axioms.1. Weight additivity : wt (Π ⊕ Π ) = wt (Π ) + wt (Π ) for every Π , Π .2. Commutativity : Π ⊕ Π = Π ⊕ Π for every Π , Π .3. Associativity : Π ⊕ (Π ⊕ Π ) = (Π ⊕ Π ) ⊕ Π for every Π , Π , Π .4. Continuity : For every positive integer k and p , . . . , p k , the quantity pr (cid:32) k (cid:77) i =1 ( p i , w i ) (cid:33) is a continuous function of ( w , . . . , w k ) on R k ≥ \ { } .5. Idempotence : For every Π and Π , if pr (Π ) = pr (Π ) then pr (Π ⊕ Π ) = pr (Π ) .6. Subtraction : If Π ⊕ Π = Π ⊕ Π then Π = Π .7. Monotonicity : There exist vectors p , . . . , p n ∈ D such that p (as in Deﬁnition E.7)is a strictly cyclically monotone function from its domain to R n . The continuity axiom is only well-deﬁned conditioned on ⊕ being associative, which is ﬁne for ourpurposes. We allow a proper subset of weights to be zero by deﬁning the aggregate to ignore forecasts withweight zero. n “anchor points” in D suchthat the function p from weight vectors to D that pools the anchor points with the weightsgiven as input obeys a notion of monotonicity (namely cyclical monotonicity). Informally,this means that the vector of weights that one would need to give to the anchor points inorder to arrive at a forecast p “correlates” with the forecast p itself.We now state the main theorem of our axiomatization. Theorem E.9.

A pooling operator is a QA pooling operator (as in Deﬁnition E.4) withrespect to some g if and only if it satisﬁes the axioms in Deﬁnition E.8.Proof. We begin by noting the following fact, which follows from results in [Roc70b, § Proposition E.10.

A strictly cyclically monotone function g : D → R n is injective, and itsinverse g − is strictly cyclically monotone and continuous. We provide a partial proof below; it relies on the following observation.

Remark E.11.

We can instead write the condition as k (cid:88) i =1 ( g ( x i ) − g ( x i − )) · x i ≥ . This is equivalent to the condition in Deﬁnition E.5, because it is the same statement (withrearranged terms) when the x i ’s are listed in reverse order. Proof.

First, suppose that g ( x ) = g ( y ). Then g ( x )( x − y ) + g ( y )( y − x ) = 0 . Since g is strictly cyclically monotone, this implies that x = y . (Note that we only usetwo-cycle monotonicity.)We now show that g − is strictly cyclically monotone. That is, we wish to show that k (cid:88) i =1 x i · ( g − ( x i ) − g − ( x i − )) > x , . . . , x k = x that are not all the same. (See Remark E.11.) By the cyclicalmonotonicity of g , we have that k (cid:88) i =1 g ( p i ) · ( p i − p i − ) > Why can’t we apply this result again to g − to conclude that g is continuous, even though we did notassume it to be? The reason is that the proof of continuity relies on the convexity of D ; if g is discontinuousthen the domain of g − may not be convex (or even connected), so we cannot apply the result to g − . g : if x i (cid:54) = x j then p i (cid:54) = p j ). Thismeans that k (cid:88) i =1 x i · ( g − ( x i ) − g − ( x i − )) > , as desired. As for continuity, we defer to [Roc70b, Theorem 26.5].Back to the proof of Theorem E.9, we ﬁrst prove that any such ⊕ g satisﬁes the stated ax-ioms. Weight additivity, commutativity, associativity, and idempotence are clear. Continuityfollows from the formulapr(( p , w ) ⊕ g ( p , w )) = g − (cid:18) w g ( p ) + w g ( p ) w + w (cid:19) , noting that g − is continuous by Proposition E.10. Likewise, subtraction follows from thefact that g is injective (by Proposition E.10), as is g − (likewise). Monotonicity remains.The range of g contains an open subset of H n (0), so in particular it contains the verticesof some translated and dilated copy of the standard simplex. That is, there are n points x , . . . , x n in the range of g for which there is a positive scalar a and vector b such that aδ i + b = x i for every i . (Here δ i is the i -th standard basis vector in R n .) We will let p i bethe pre-image of x i under g , so that g ( p i ) = aδ i + b .Observe that for any w in the domain of p , we have g ( p ( w )) = n (cid:88) i =1 w i g ( p i ) = n (cid:88) i =1 w i ( aδ i + b ) = a w + b , so p ( w ) = g − ( a w + b ) . We have that g − is strictly cyclically monotone (by Proposition E.10), and it is easy toverify that for any strictly cyclically monotone function f and any a > b , f ( a x + b )is a strictly cyclically monotone function of x . Therefore, p ( w ) = g − ( a w + b ) is strictlycyclically monotone, as desired.Now we prove the converse. Assume that we have a pooling operator ⊕ satisfying theaxioms in Deﬁnition E.8. We wish to show that ⊕ is ⊕ g for some g : D → H n (0).For the remainder of this proof, let p , . . . , p n be vectors certifying the monotonicity of ⊕ , and let p ( · ) be as in Deﬁnition E.7.For any q ∈ D , let g ( q ) := w − n n , where w ∈ H n (1) is such that p ( w ) = q and n is the all-ones vector. This raises the question of well-deﬁnedness: does this w necessarilyexist, and if so, is it unique? The following claim shows that this is indeed the case. Claim E.12.

The function p , from the subset of H n (1) where it is deﬁned to D , is bijective. This follows from the invariance of domain theorem, which states that the image of an open subset of amanifold under an injective continuous map is open. roof. The fact that p is injective follows from the fact that it is strictly cyclically monotone(see Proposition E.10). We now show that p is surjective.Let q ∈ D . Deﬁne the function ˜ p : ∆ n +1 → D by˜ p ( w , . . . , w n +1 ) := pr (cid:32)(cid:32) n (cid:77) i =1 ( p i , w i ) (cid:33) ⊕ ( q , w n +1 ) (cid:33) . Since ˜ p is a continuous map from ∆ n +1 (an n -dimensional manifold) to D (an ( n − p is not injective. So in particular, let w (cid:54) = w ∈ ∆ n +1 be suchthat ˜p ( w ) = ˜p ( w ). That is, we have (cid:32) n (cid:77) i =1 ( p i , w ,i ) (cid:33) ⊕ ( q , w ,n +1 ) = (cid:32) n (cid:77) i =1 ( p i , w ,i ) (cid:33) ⊕ ( q , w ,n +1 ) . (7)Observe that w ,n +1 (cid:54) = w ,n +2 ; for otherwise it would follows from the subtraction axiomthat two diﬀerent combinations of the p i ’s would give the same probability, contradictingthe fact that p is injective. Without loss of generality, assume that w ,n +1 > w ,n +1 . We canrearrange the terms in Equation 7 to look as follows.( q , w ,n +1 − w ,n +1 ) ⊕ (cid:32)(cid:77) i ∈ S ( p i , v i ) (cid:33) = (cid:77) i ∈ T ⊆ [ n ] \ S ( p i , v i )for some positive v , . . . , v n . By the distributive property, we may multiply all weights by w ,n +1 − w ,n +1 . The result will be an equation as in Deﬁnition E.7, certifying that q is in therange of the function p , as desired.We return to our main proof, now that we have shown that our function g ( q ) := w − n ,where w ∈ H n (1) is such that p ( w ) = q , is well-deﬁned. In fact, we can simply write g ( q ) = p − ( q ) − n . (The vector n is fairly arbitrary; it only serves the purpose of forcingthe range of g to lie in H n (0) instead of H n (1).)We ﬁrst show that the deﬁning equation of ⊕ g holds — that is, that if ( q , v ) ⊕ ( q , v ) =( q , v + v ) (with v , v ≥

0, not both zero), then g ( q ) = v g ( q ) + v g ( q ) v + v . Let w , w ∈ H n (1) be such that q = p ( w ) and q = p ( w ). It is intuitive that q = p (cid:16) v w + v w v + v (cid:17) , but we show this formally. Claim E.13.

Given q , q ∈ D with q = p ( w ) , q = p ( w ) , and ≤ α ≤ , we have p ( α w + (1 − α ) w ) = ( q , α ) ⊕ ( q , − α ) . By the continuity axiom; here we use the more generalized form we stated earlier. This follows e.g. from the Borsuk-Ulam theorem. roof. Note that ( q , ⊕  (cid:77) i : w ,i < ( p i , w ,i )  = (cid:77) i : w ,i > ( p i , w ,i )( q , ⊕  (cid:77) i : w ,i < ( p i , w ,i )  = (cid:77) i : w ,i > ( p i , w ,i ) . Applying the distributive property to the two above equations with constants α and 1 − α ,respectively, and adding them, we get that( q , α ) ⊕ ( q , − α ) ⊕  (cid:77) i : w ,i < ( p i , αw ,i )  ⊕  (cid:77) i : w ,i < ( p i , (1 − α ) w ,i )  =  (cid:77) i : w ,i > ( p i , αw ,i )  ⊕  (cid:77) i : w ,i > ( p i , (1 − α ) w ,i )  . We have that ( q , α ) ⊕ ( q , − α ) = ( q , q = p (cid:16) v w + v w v + v (cid:17) .Applying Claim E.13 with α = v v + v , we ﬁnd that g ( q ) = v w + v w v + v − n = v (cid:0) g ( q ) + n (cid:1) + v (cid:0) g ( q ) + n (cid:1) v + v − n = v g ( q ) + v g ( q ) v + v , as desired.It remains to show that g is continuous, strictly cyclically monotone, and has convexrange. By the monotonicity axiom, p is strictly cyclically monotone. It follows by Propo-sition E.10 that its inverse its continuous and strictly cyclically monotone. Therefore, g iscontinuous and cyclically monotone (as it is simply a translation of p − ( q ) by n ).Finally, to show that g has convex range, we wish to show that p − has convex range;or, in other words, that the domain on which p is deﬁned is convex. And indeed, thisfollows straightforwardly from Claim E.13. Let w , w be in the domain of p , with p ( w ) = q , p ( w ) = q . Then for any 0 ≤ α ≤

1, we have that p ( α w + (1 − α ) w ) = ( q , α ) ⊕ ( q , − α ) , so in particular α w + (1 − α ) w is in the domain of p . This concludes the proof. F. The convex exposure property

Several of our results have been contingent on the convex exposure property. We showed inProposition 3.7 that this property always holds in the case of n = 2 outcomes (assuming,40s we have been, that s is continuous). In this appendix, we take this discussion further byconsidering when the convex exposure property holds in higher dimensions. As we shall see,the property holds for nearly all of the most commonly used scoring rules.(A note on notation: in this section we use p j instead of p ( j ) to refer to the j -th coordi-nate of a probability distribution p .)Our ﬁrst result says, roughly speaking, that scoring rules that — like the logarithmicscoring rule — “go oﬀ to inﬁnity” have convex exposure. Proposition F.1.

Let s be a proper scoring rule on a forecast domain D that is an opensubset of ∆ n , such that for any point x on the boundary of D , and for any sequence x , x , . . . converging to x , lim k →∞ (cid:107) g ( x k ) (cid:107) = ∞ . Then s has convex exposure. This is a statement of convex analysis — namely that if (cid:107) g (cid:107) approaches ∞ on theboundary of a convex set, then the range of g is convex (assuming g is the gradient ofa diﬀerentiable convex function). We refer the reader to [Roc70b, Theorem 26.5] for theproof. In non-pathological cases, the basic intuition is that every v ∈ { x : (cid:80) i x i = 0 } is thegradient of G at some point. In these cases, ∇ G ( x ) = v where x minimizes G ( v ) − v · x ; thelim k →∞ (cid:107) g ( x k ) (cid:107) = ∞ condition means that this minimum does not occur on the boundaryof D . Corollary F.2.

Let int (∆ n ) := { p ∈ ∆ n : p i > ∀ i } . The following scoring rules haveconvex exposure over forecast domain D = int (∆ n ) . • The logarithmic scoring rule. • The scoring rule given by G ( p ) = − (cid:80) j p γj for γ ∈ (0 , and by G ( p ) = (cid:80) j p γj for γ < . • The scoring rule given by G ( p ) = − (cid:80) j ln p j , which can be thought of as the limit ofthe G in the previous bullet point as γ → . • The scoring rule hs given by G hs ( p ) = − (cid:81) j p /nj . The hs scoring rule The last of these scoring rules is a generalization of the scoring rule hs ( q ) = 1 − (cid:113) − qq used in [BB20] as part of proving their minimax theorem for randomizedalgorithms. The authors used this scoring rule as a key ingredient in their minimax the-orem for randomized algorithms. The key property of the scoring rule was a result aboutits ampliﬁcation [BB20, Lemma 3.10]. The authors deﬁne a forecasting algorithm to be ageneralization of a randomized algorithm that outputs an estimated probability that an out-put should be accepted. Then, roughly speaking, the authors show that given a forecastingalgorithm R , it is possible to create a forecasting algorithm R (cid:48) that has a much larger ex-pected score from the scoring rule hs by combining running R a small number of times and This is a natural way to think of this scoring rule because ∇ G ( p ) = − ( p − , . . . , p − n ). Here we are using the shorthand notation for the n = 2 outcome case discussed in Remark 3.5. hs deserves more attention.Since additive and multiplicative constants are irrelevant, we may treat hs ( q ) = − (cid:113) − qq .Observe that (in the case of two outcomes), the expected score G hs on a report of q is G hs ( q ) = q (cid:18) − (cid:114) − qq (cid:19) + (1 − q ) (cid:18) − (cid:114) − qq (cid:19) = − (cid:112) q (1 − q ) . That is, G hs is precisely negative the geometric mean of q and 1 − q . This motivates us togeneralize hs to a setting with n outcomes by setting G hs ( p ) := − n (cid:89) i =1 p /ni . It should not be obvious that this function is convex, but it turns out to be; this is theprecise statement of an inequality known as Mahler’s inequality [Wik18].Next we note that the quadratic scoring rule has convex exposure, since its exposurefunction g ( p ) = 2 p (modulo n as discussed in Remark 3.9) maps any convex set to aconvex set. Proposition F.3.

The quadratic scoring rule has convex exposure (for any convex D ).Spherical scoring rules — the third most studied proper scoring rules, after the quadraticand logarithmic rules — also have convex exposure. Deﬁnition F.4 (Spherical scoring rules) . [GR07, Example 2] For any α > , deﬁne the spherical scoring rule with parameter α to be the scoring rule given by G sph ,α ( p ) := (cid:32) n (cid:88) i =1 p αi (cid:33) /α . If the “spherical scoring rule” is referenced with no parameter α given, α is presumed toequal . Proposition F.5.

For any α > , the spherical scoring rule with parameter α (over D = ∆ n )has convex exposure.Proof. Fix α >

1. We will write G in place of G sph ,α . We have g ( p ) = (cid:32) n (cid:88) j =1 p αj (cid:33) (1 /α ) − ( p α − , . . . , p α − n ) . (8)Now, deﬁne the n -dimensional unit β -sphere , i.e. { x : (cid:80) j x βj = 1 } , and deﬁne the n -dimensional unit β -ball correspondingly (i.e. with ≤ in place of =). The range of g is As discussed in Remark 3.9, the range of g should be thought of as modulo T ( n ). However, we ﬁnd itconvenient for this proof to think of it as lying in R n and project later. n -dimensional unit αα − -sphere with all non-negative coordinates.Indeed, on the one hand, for any p we have (cid:88) j g j ( p ) α/ ( α − = (cid:32)(cid:88) j p αj (cid:33) − · (cid:88) j p αj = 1(where g j ( p ) denotes the j -th coordinate of g ( p ) as in Equation 8). On the other hand,given a point x on the unit αα − -sphere with all non-negative coordinates, p = (cid:32)(cid:88) j x / ( α − j (cid:33) − (cid:16) x / ( α − , . . . , x / ( α − n (cid:17) lies in ∆ n and satisﬁes g ( p ) = x .The crucial point for us is that for β >

1, the unit β -ball is convex. This means that forany such β , the convex combination of any number of points on the unit β -sphere will lie inthe unit β -ball. Since αα − > α >

1, we have that for arbitrary p , q ∈ ∆ n and w ∈ [0 , w g ( p ) + (1 − w ) g ( q ) lies in the unit β -ball — in fact, in the part with all non-negativecoordinates. Now, consider casting a ray from this convex combination point in the positive n direction. All points on this ray are equivalent to this point modulo T ( n ), and this raywill intersect the unit β -sphere at some point x with all non-negative coordinates. The point p ∈ ∆ n with g ( p ) = x satisﬁes g ( p ) = g ( p ) + (1 − w ) g ( q ) . This completes the proof.

Remark F.6.

The above proof gives a geometric interpretation of the QA pooling withrespect to the spherical scoring rule, particularly for α = 2. In the α = 2 case, poolingamounts to taking the following steps:(1) Scale each forecast so it lies on the unit sphere.(2) Take the weighted average of the resulting points in R n .(3) Shift the resulting point in the positive n direction to the unique point in that directionthat lies on the unit sphere.(4) Scale this point so that its coordinates add to 1.Finally we consider the parametrized family known as Tsallis scoring rules , deﬁned in[Tsa88].

Deﬁnition F.7 (Tsallis scoring rules) . For γ > , the Tsallis scoring rule with parameter γ is the rule given by G Tsa ,γ ( p ) = m (cid:88) j =1 p γj . γ = 2 above yields the quadratic scoring rule. Note also that we have alreadyaddressed the scoring rule given by G ( p ) = ± (cid:80) j p γj for γ ≤ γ = 0 ,

1, which aredegenerate), with the sign chosen to make G convex: these scoring rules have convex exposureby Proposition F.1. The following proposition completes our analysis for this natural classof scoring rules. Proposition F.8.

For γ ≤ , the Tsallis scoring rule with parameter γ (over D = ∆ n ) hasconvex exposure. For γ > , this is not the case if n > .Proof of Proposition F.8. Fix γ >

1. We will write G in place of G Tsa ,γ . Up to a multiplica-tive factor of γ that we are free to ignore, we have g ( p ) = ( p γ − , . . . , p γ − n ) . Let p , q ∈ ∆ n and w ∈ [0 , x ∈ ∆ n such that g ( x ) = w g ( p ) + (1 − w ) g ( q ), i.e. wp γ − j + (1 − w ) q γ − j + c = x γ − j , for all j ∈ [ n ], for some c . Since (cid:80) j x j = 1, this c must satisfy (cid:88) j ( wp γ − j + (1 − w ) q γ − j + c ) / ( γ − = 1 . (9)Let h ( x ) := (cid:80) j ( wp γ − j + (1 − w ) q γ − j + x ) / ( γ − . Note that h is increasing in x .First consider the case that γ ≤

2. By concavity, we have that wp γ − j + (1 − w ) q γ − j ≤ ( wp j + (1 − w ) q j ) γ − . This means that h (0) = (cid:88) j ( wp γ − j + (1 − w ) q γ − j ) / ( γ − ≤ (cid:88) j ( wp j + (1 − w ) q j ) = 1 . On the other hand, lim x →∞ h ( x ) = ∞ . Since h is continuous, there must be some x ∈ [0 , ∞ )such that h ( x ) = 1; call this value c . Then let x j = ( wp γ − j + (1 − w ) q γ − j + c ) / ( γ − . Then every x j is nonnegative and (cid:80) j x j = 1, so we have succeeded.Now consider the case that γ >

2, and consider as a counterexample p = (1 , , . . . , q = (0 , , , . . . , w = . To satisfy Equation 9, we are looking for c such that h ( c ) = 2 (cid:18)

12 + c (cid:19) / ( γ − + ( n − c / ( γ − = 1 . Note that h (0) = 2 · − / ( γ − = 2 ( γ − / ( γ − >

1, so c < h is increasing). But in thatcase x γ − j < j ≥

3, a contradiction (assuming n > ∇ G Tsa ,γ ( p ) = ( p γ − , . . . , p γ − n ) (up to a constant factor), QA pool-ing with respect to the Tsallis scoring rule can be thought of as an appropriately scaledcoordinate-wise ( γ − γ = 2 it is the coordinate-wise arithmetic average.44or γ = 3 it is the coordinate-wise root mean square, but with the average of the squaresscaled by an appropriate additive constant so that, upon taking the square roots, the prob-abilities add to 1. (However, as the Tsallis score with parameter 3 does not have convexexposure, this is not always well-deﬁned.)In Corollary F.2 we mentioned that the scoring rule given by G ( p ) = − (cid:80) j ln p j can bethought of as an extension to γ = 0 of (what we are now calling) the Tsallis score, becausethe derivative of ln x is x − . QA pooling with respect to this scoring rule is, correspondingly,the − harmonic pooling , see e.g. [Daw+95, § γ = 1, in that the second derivative of x ln x is x −1