[PDF] Bayesian Ensembles of Binary-Event Forecasts: When Is It Appropriate to Extremize or Anti-Extremize?

Abstract

Many organizations face critical decisions that rely on forecasts of binary events. In these situations, organizations often gather forecasts from multiple experts or models and average those forecasts to produce a single aggregate forecast. Because the average forecast is known to be underconfident, methods have been proposed that create an aggregate forecast more extreme than the average forecast. But is it always appropriate to extremize the average forecast? And if not, when is it appropriate to anti-extremize (i.e., to make the aggregate forecast less extreme)? To answer these questions, we introduce a class of optimal aggregators. These aggregators are Bayesian ensembles because they follow from a Bayesian model of the underlying information experts have. Each ensemble is a generalized additive model of experts' probabilities that first transforms the experts' probabilities into their corresponding information states, then linearly combines these information states, and finally transforms the combined information states back into the probability space. Analytically, we find that these optimal aggregators do not always extremize the average forecast, and when they do, they can run counter to existing methods. On two publicly available datasets, we demonstrate that these new ensembles are easily fit to real forecast data and are more accurate than existing methods.

Full PDF

aa r X i v : . [ s t a t . M E ] A p r Bayesian Ensembles of Binary-Event Forecasts: WhenIs It Appropriate to Extremize or Anti-Extremize?

Kenneth C. Lichtendahl Jr.

Darden School of Business, University of Virginia, Charlottesville, VA 22903, [email protected]

Yael Grushka-Cockayne

Darden School of Business, University of Virginia, Charlottesville, VA 22903, [email protected]

Victor Richmond R. Jose

McDonough School of Business, Georgetown University, Washington, DC 20057, [email protected]

Robert L. Winkler

The Fuqua School of Business, Duke University, Durham, NC 27708, [email protected]

Many organizations face critical decisions that rely on forecasts of binary events. In these situations, organi-zations often gather forecasts from multiple experts or models and average those forecasts to produce a singleaggregate forecast. Because the average forecast is known to be underconﬁdent, methods have been proposedthat create an aggregate forecast more extreme than the average forecast. But is it always appropriate toextremize the average forecast? And if not, when is it appropriate to anti-extremize (i.e., to make the aggre-gate forecast less extreme)? To answer these questions, we introduce a class of optimal aggregators. Theseaggregators are Bayesian ensembles because they follow from a Bayesian model of the underlying informationexperts have. Each ensemble is a generalized additive model of experts’ probabilities that ﬁrst transforms theexperts’ probabilities into their corresponding information states, then linearly combines these informationstates, and ﬁnally transforms the combined information states back into the probability space. Analytically,we ﬁnd that these optimal aggregators do not always extremize the average forecast, and when they do, theycan run counter to existing methods. On two publicly available datasets, we demonstrate that these newensembles are easily ﬁt to real forecast data and are more accurate than existing methods.

Key words : Forecast aggregation; linear opinion pool; generalized additive model; generalized linear model;stacking.

Date : May 1, 2019

1. Introduction

Many organizations face forecasting challenges that involve binary events. These forecasts are crit-ical to decisions such as the approval of credit (probability of default), the target of a marketingcampaign (probability of a click), the recommendation of a drug (probability of having a disease),and the choice of a national security response (probability of a geopolitical event occurring). Oftencrowds of experts or models issue probabilities for such events. To aggregate the individual prob-abilities, we oﬀer a new method based on Bayesian principles. This method generalizes several Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts proposed aggregators in the literature, exhibits some structural advantages, and may be moreaccurate in practice.Since its introduction by Stone (1961), many researchers have found the linear opinion pool tobe an attractive way to aggregate forecasts (DeGroot 1974, McConway 1981, Genest and Zidek1986, DeGroot and Mortera 1991). The most popular way to aggregate expert forecasts is to takethe average of the experts’ forecasts, which is a linear opinion pool with equal weights (Clemen andWinkler 1986, Larrick and Soll 2006). In aggregating binary-events forecasts, Winkler and Poses(1993, p. 1533) state, “Simple averages of forecasts seem to work as well as or better than fanciercombining methods.”Nonetheless, Hora (2004) and Ranjan and Gneiting (2010) show that the linear opinion pool isunderconﬁdent. In the context of binary events, the linear opinion pool is, on average, not extremeenough. To address this problem, Ranjan and Gneiting (2010) propose a method that extremizesthe linear opinion pool by pushing it closer to its nearer extreme. Many others have employedschemes to extremize the average forecast (Karmarkar 1978, Erev et al. 1994, Ariely et al. 2000,Shlomi and Wallsten 2010, Turner et al. 2014, Mellers et al. 2014, Baron et al. 2014, Satop¨a¨aet al. 2014). Extremizing methods, such as the logit aggregator, are now used as benchmarks inpractice (IARPA Geopolitical Forecasting Challenge 2018). Baron et al. (2014, p. 134) note that“If every forecaster said 0.6, and they were using diﬀerent information, then someone who knewall of this would have a right to much higher conﬁdence.” This updating process is consistent withBayesian principles under certain assumptions and is the main justiﬁcation for extremizing theaverage forecast.These existing methods, however, are heuristics that do not follow from Bayesian principles.Consequently, they risk being sub-optimal. The idea that a decision maker can use Bayesian rea-soning to aggregate experts’ forecasts goes back at least to Winkler (1968) and Morris (1974). “Tothe expert, the probability assessment is a representation of his state of information; to the decisionmaker, the probability assessment is information.” (Morris 1974, p. 1241) Many other researchershave proposed models along these lines (Lindley et al. 1979, French 1980, Clemen 1987). Dawid etal. (1995) make an important contribution to the literature on aggregating binary-event forecasts.They introduce the ﬁrst set of aggregation methods based on Bayesian principles and a fundamen-tal condition regarding calibration. The condition says that if the decision maker hears from onlyone calibrated expert, he adopts that expert’s forecast has his own. Recently, Satop¨a¨a et al. (2016),building on the work of Dawid et al. (1995) and generalizing an example in Ranjan and Gneiting ichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts (2010), introduce a probit aggregator that is consistent with Bayesian principles. We call such anaggregator a Bayesian ensemble in the spirit of Chipman et al. (2007).In this paper, we signiﬁcantly enlarge the class of Bayesian ensembles. This larger class hasthree main structural properties, the combination of which oﬀers advantages over previous meth-ods. First, our Bayesian ensembles incorporate the prior-predictive probability into the aggregateforecast. One can think of the prior-predictive probability as the base rate known to the expertsand the decision maker alike. For example, when forecasting rain in Phoenix, Arizona, all partiesinvolved in the aggregation may agree on the base rate at which rain occurs daily, say 10%. Second,our Bayesian ensembles are generalized linear models of experts’ transformed probabilities wherethe decision maker can choose a link function beyond those that accompany the logit and probitmodels. With a customized link function, a generalized linear model can often ﬁt data better.Third, because our Bayesian ensembles are generalized linear models of experts’ transformed prob-abilities, they can be easily and quickly estimated on large datasets using statistical computingsoftware.With this enlarged class of Bayesian ensembles, we provide new insights into why, when, andhow to extremize the average forecast. In this class, it often makes sense to extremize the averageforecast, but sometimes it does not. In some cases, it can make sense to anti-extremize, or makethe aggregate forecast less extreme than the average forecast. Three diﬀerent types of informationplay a crucial role in our understanding of extremizing/anti-extremizing: (i) the prior informationknown to the decision maker and experts, (ii) each expert’s private information, and (iii) the sharedinformation known to all the experts but not the decision maker. When the experts rely on privateinformation only, the decision maker naturally wants to form an aggregate forecast that extremizesthe average forecast because each individual forecast contains some weight on the prior information.For example, with two experts, the average forecast double counts the prior information and ispulled too far in the direction of the prior-predictive probability.When the experts rely on the same amount of information but some of that information isshared information, the decision maker does not have as much new information in the reportsto justify the same move away from the prior-predictive probability. Hence, shared informationtends to reduce the degree of extremizing. In fact, the presence of shared information can cause aBayesian ensemble to attenuate the average forecast. In practice, shared information is common,since real-world experts often use similar models or have similar training (Kim et al. 2001, Chenet al. 2004, Marinovic et al. 2012). If models’ forecasts are being combined, the models often pickup on similar features from the training set. Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts

In the next two sections, we introduce two types of Bayesian ensembles. The ﬁrst type, developedin Section 2, involves the use of conjugate pairs of distributions. The conjugate pairs help buildintuition for how a Bayesian ensemble emerges from the information experts have. The conjugate-pair Bayesian ensembles also help motivate the assumptions behind our second type of Bayesianensemble, called the generalized probit ensemble. In Section 3, we provide the form of the general-ized probit ensemble and describe how it can be easily ﬁt to data using a generalized linear modelwith a custom link function. In Section 4, we present two pilot studies where we ﬁt the generalizedprobit ensemble to large sets of real forecast data. In the ﬁrst study the goal is to predict loandefaults in Fannie Mae’s single-family loan performance data. In the second study, the task is topredict defective used cars in Kaggle’s

Don’t Get Kicked! competition. The forecasts our ensem-ble and benchmark methods aggregate come from several leading statistical and machine learningalgorithms. The results of our pilot studies suggest that the generalized probit ensemble may bemore accurate than several leading aggregation methods in other applications.

2. Conjugate-Pair Bayesian Ensembles

In this section, we introduce our ﬁrst type of Bayesian ensemble. Before introducing this ensemble,we provide a formal deﬁnition of extremizing—an important deﬁnition that will be used throughoutthe paper.

Our deﬁnition of extremizing is inspired by the deﬁnitions of sharpness in Winkler and Jose (2008)and Ranjan and Gneiting (2010). Sharper, or more extreme forecasts, are those farther away fromtheir marginal event frequencies. Winkler and Jose (2008) state, “If climatology c is used as thebaseline probability for probability of precipitation forecasts, sharpness should be viewed in termsof shifts from c toward zero or one instead of shifts from 0.5 toward zero or one.” In forecastingrain in Phoenix, Arizona where the historical daily frequency of rain is about 10%, a forecast of40% would naturally be considered more extreme than a forecast of 30%.For the following deﬁnition, we assume the average forecast is not equal to either the prior-predictive probability (i.e., the forecast based on the prior information only) or the aggregateforecast. Definition 1 (Extremizing).

The aggregate forecast extremizes the average forecast if it isfarther away from the prior-predictive probability in the same direction as the average forecast.Otherwise, the aggregate forecast anti-extremizes the average forecast. ichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts This deﬁnition diﬀers from other deﬁnitions in the literature. For example, Baron et al. 2014,Satop¨a¨a et al. (2014), and Satop¨a¨a et al. (2016) say the aggregate forecast extremizes the averageforecast if it is farther away from one-half in the same direction as the average forecast. Under thisdeﬁnition, a forecast of 40% for rain in Phoenix would be considered less extreme than a forecastof 30%.

Below we introduce an information structure that describes the information the decision makerand experts have in order to make their forecasts. In this setting, each of k ≥ y given her private sample information x i and the sharedsample information x s . The event y equals 1 if the event occurs, or it equals 0 if the event does notoccur. We denote expert i ’s forecast P ( y = 1 | x i , x s ) by p i . After hearing from his k ≥ p , if it is the posterior-predictive probability P ( y = 1 | p , . . . , p k ) derived from the joint distribution of ( p , . . . , p k , y ) usingBayes’ Theorem. This aggregate forecast is optimal because using any other forecast would yielda worse score when evaluated by a proper scoring rule (Gneiting and Raftery 2007). Later in thissection we compare several Bayesian ensembles to the average forecast, ¯ p = (1 /k ) P ki =1 p i , to seewhen these ensembles extremize (or anti-extremize) the average forecast.Suppose there is an exchangeable sequence of data points:( x , . . . , x n | {z } Expert 1’s private sampleof size n , x n +1 , . . . , x n + n | {z } Expert 2’s private sampleof size n , . . . , x N k − +1 , . . . , x N k | {z } Expert k’s private sampleof size n k , x N k +1 , . . . , x N k + n s | {z } Shared sampleof size n s , x N k + n s +1 | {z } Related to the binaryevent y ) . (1)Expert i sees the private sample x i = ( x N i − +1 , . . . , x N i ) of size n i for i = 1 , . . . , k , where N i = P il =1 n l . The shared information—known to all the experts, but not the decision maker—is thesample x s = ( x N k +1 , . . . , x N k + n s ) of size n s .The ﬁnal data point x N k + n s +1 , abbreviated as x , is related to the binary event y , which is whatthe decision maker ultimately cares about. If x is in the event occurrence set A , then y equals 1;otherwise, y equals 0. For example, if x is a Bernoulli random variable, the set A might simply be { } . Alternatively, if x is a normal random variable, the set A might be the interval (0 , ∞ ).Data points in the sequence are independent and identically distributed according to a like-lihood from a regular, one-parameter exponential family with probability mass or density func-tion f ( x j | θ ) = a ( x j ) b ( θ ) exp( c ( θ ) h ( x j )). The parameter θ is distributed according to a conju-gate prior f ( θ ) = [ K ( τ , τ )] − [ b ( θ )] τ exp( c ( θ ) τ ), where τ and τ are the prior’s hyperparameters Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts and K ( τ , τ ) is its normalizing constant. The joint distribution of ( θ, x , . . . , x k , x s , x ) and theevent occurrence set A are common knowledge among the decision maker and the experts. Theprior/likelihood pair of distributions that describes this joint distribution is called a conjugate pair(Raiﬀa and Schlaifer 1961, Bernardo and Smith 2000). Hence we call the Bayesian ensemble thata conjugate pair generates a conjugate-pair Bayesian ensemble.Based on these assumptions, the following function generates a set of predictive distributions—one for the prior-predictive, one for each expert’s posterior-predictive, and one for the decisionmaker’s posterior-predictive. We call this function the predictive generating function: F n ( t ) = Z x ∈ A a ( x ) K ( τ + n + 1 , t + h ( x )) K ( τ + n, t ) dx, (2)where n is the relevant sample size and t is the relevant suﬃcient statistic (Bernardo and Smith2000). Note that when the random variable x is discrete, the integral in (2), and other integralslike it throughout the paper, naturally become sums. With n = 0 and t = τ , F ( τ ) is every-one’s prior-predictive probability P ( y = 1), denoted by p . With n = n i + n s and t = t i + t s where t i = P N i j = N i − +1 h ( x j ) and t s = P N k + n s j = N k +1 h ( x j ), F n i + n s ( τ + t i + t s ) is expert i ’s posterior-predictiveprobability P ( y = 1 | x i , x s ).With n = N k + n s and t = P ki =1 t i + t s , F N k + n s ( τ + P ki =1 t i + t s ) is the decision maker’s posterior-predictive probability P ( y = 1 | x , . . . , x k , x s ), as if he had access to all the experts’ private andshared information. In reality, the decision maker only hears p i from expert i , so the best he cando is infer ( t , . . . , t k , t s ), the suﬃcient statistics for ( x , . . . , x k , x s ), from ( t + t s , . . . , t k + t s ). Hecan learn each t i + t s from p i if F n is invertible. If it is invertible, then t + t s = F − n i + n s ( p i ) − τ . In the case of private information only ( n s = 0), the conjugate-pair Bayesian ensemble is a gener-alized additive model in the experts’ probabilities and a generalized linear model in the experts’suﬃcient statistics. For this result, we need the following two deﬁnitions. A generalized additivemodel links the conditional expectation of a quantity of interest y to an additive function of somecovariates ( q , . . . , q k ): E [ y | q , . . . , q k ] = g − ( g + g ( q ) + · · · + g k ( q k )) where g is the link function, g is a constant, and each g i for i = 1 , . . . , k is a smooth function (Hastie and Tibshirani 1986). Ageneralized linear model is a general additive model where each g i is a linear function (Nelder andWedderburn 1972). Proofs of this and other results appear in the Appendix. Proposition 1 (Private Information Only) . Assume only private information is availableto the experts ( n i > for i = 1 , . . . , k and n s = 0 ) and the predictive generating function F n ( t ) in (2) ichtendahl, Grushka-Cockayne, Jose, and Winkler: Bayesian Ensembles of Binary-Event Forecasts is strictly monotonic in t . Then the conjugate-pair Bayesian ensemble of the experts’ probabilitiesis a generalized additive model: ˆ p = P ( y = 1 | p , . . . , p k ) = F N k (cid:18) − ( k − F − ( p ) + k X i =1 F − n i ( p i ) (cid:19) , (3) where p = F ( τ ) , p i = F n i ( τ + t i ) , and t i = P N i j = N i − +1 h ( x j ) . Also, the Bayesian ensemble of theexperts’ suﬃcient statistics is a generalized linear model: P ( y = 1 | t , . . . , t k ) = F N k (cid:18) τ + k X i =1 t i (cid:19) . (4)This result provides a large class of Bayesian ensembles. The class is as large as the class ofregular, one-parameter exponential families. Any ensemble in this class is a generalized additivemodel of experts’ probabilities that ﬁrst transforms the experts’ probabilities into their correspond-ing information states, then linearly combines these information states, and ﬁnally transforms thecombined information states back into the probability space.Below we provide four examples of conjugate-pair Bayesian ensembles. Our ﬁrst example appearsin Dawid et al. (1995), and our third example is a variant of the models studied in Ranjan andGneiting (2010) and Satop¨a¨a et al. (2016). The other two examples are new. For details on thederivation of each example’s ensemble, e.g., each example’s τ , h ( x j ), and F n ( t ), see the Appendix. Example 1 (Beta/Bernoulli Pair).

Let x j given θ be drawn from a Bernoulli distributionwith probability θ . The conjugate prior for this likelihood is the beta distribution with shapeparameters α and β . Suppose A = { } corresponds to the event that a future borrower defaults ona loan. With the event occurrence set A = { } , this conjugate pair leads to the Bayesian ensembleˆ p = (cid:18) − k X i =1 w n i (cid:19) p + k X i =1 w n i p i , where w n i = ( α + β + n i ) / ( α + β + N k ). (cid:3) Example 2 (Gamma/Poisson Pair).

Let x j given θ be drawn from a Poisson distributionwith rate θ . The conjugate prior for this likelihood is the gamma distribution with shape α and rate β , denoted by θ ∼ Ga ( α, β ). Suppose A = { } corresponds to the event that a piece of equipment,with exponentially distributed interarrival times of breakdowns, does not break down in the nextyear. With A = { } , this conjugate pair leads to the Bayesian ensembleˆ p = exp (cid:18) − ( k − v N k v log( p ) + k X i =1 v N k v n i log( p i ) (cid:19) , where v n = log(( β + n ) / ( β + n + 1)). (cid:3) Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts

Example 3 (Normal/Normal Pair).

Let x j given θ be drawn from a normal distributionwith mean θ and variance σ . The conjugate prior for this likelihood is another normal distributionwith mean θ and variance σ . Suppose A = (0 , ∞ ) corresponds to the event that a new productmakes a proﬁt in its ﬁrst year. With A = (0 , ∞ ), this conjugate pair leads to the Bayesian ensembleˆ p = Φ (cid:18) − ( k − r v v N k Φ − ( p ) + k X i =1 r v n i v N k Φ − ( p i ) (cid:19) , where Φ is the cumulative distribution function (cdf) of the standard normal distribution and v n = ( σ /σ + n )( σ /σ + n + 1) σ . We call this ensemble a probit ensemble because the inverselink function in this generalized linear model is the standard normal cdf. (cid:3) Example 4 (Generalized-Gamma/Gumbel Pair).

Let x j given θ be drawn from a Gumbeldistribution with location θ and scale σ . The conjugate prior for this likelihood is the reﬂection ofthe generalized gamma distribution in Ahuja and Nash (1967, Equation 2.7): exp( θ/σ ) ∼ Ga ( α, β ).Suppose A = ( −∞ ,

0) corresponds to the event that a hedge-fund manager’s best investment makesa loss in some year. With A = ( −∞ , p = (cid:18) − ( k − p v / (1 − p v ) + P ki =1 p v ni i / (1 − p v ni i )1 − ( k − p v / (1 − p v ) + P ki =1 p v ni i / (1 − p v ni i ) (cid:19) /v Nk , where v n = 1 / ( α + n ). (cid:3) The ensembles in Examples 1 and 3 are depicted in Figure 1a and 1c, respectively. In both ﬁgures,we hold p ﬁxed and vary p . For Figure 1a, we assume α = β = 1, two experts each privately see twodata points, and expert 1 reports p = 3 /

4. For Figure 1c, we assume σ = σ = 1, θ = − .

25, twoexperts each privately see two data points, and expert 1 reports p = F n ( τ ) = F ( − . ≈ . n i = n ), the following strict inequality holds for thebeta/Bernoulli ensemble in Example 1:ˆ p = (1 − kw n ) p + k X i =1 w n p i = (1 − kw n ) p + kw n ¯ p < (1 − kw n ) ¯ p + kw n ¯ p = ¯ p, (5)for ¯ p < p . Similarly, for ¯ p > p , the inequality in (5) is reversed. Thus, with private informationonly, the beta/Bernoulli ensemble always extremizes the average forecast. The probit ensemble,however, sometimes does not. Its region of anti-exremizing (in gray) is sizeable.The intuition for why a Bayesian ensemble tends to extremize the average forecast is that theexperts’ forecasts each have some weight on the prior information, but when their forecasts arecombined, the aggregate forecast only needs that weight once. This intuition explains the coeﬃcient − ( k −

1) in front of the term F − ( p ) in (3). In other words, extremizing tends to be the result ichtendahl, Grushka-Cockayne, Jose, and Winkler: Bayesian Ensembles of Binary-Event Forecasts . . . . . . p p^pp (a) Beta/Bernoulli (Private Only). . . . . . . p p^pp (b) Beta/Bernoulli (Private and Shared). . . . . . . p p^pp (c) Normal/Normal (Private Only). . . . . . . p p^pp (d) Normal/Normal (Private and Shared). Figure 1 Conjugate-Pair Bayesian Ensembles from Examples 1 and 5 (Beta/Bernoulli) and Examples 3 and 6(Normal/Normal). of removing the redundant weight on the prior information. Because the experts’ probabilities areﬁrst non-linearly transformed inside the probit ensemble, it does not always extremize the averageforecast. In the next subsection, we compare these two ensembles to their shared informationcounterparts to see what eﬀect this information has on extremizing/anti-extremizing.

The following result generalizes Proposition 1 to include the presence of shared information. Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts

Proposition 2 (Private and Shared Information) . Assume private and shared informa-tion are available to the experts ( n i > for i = 1 , . . . , k and n s > ) and the predictive generatingfunction F n ( t ) in (2) is strictly monotonic in t . Then the conjugate-pair Bayesian ensemble of theexperts’ probabilities is given by ˆ p = Z F N k + n s (cid:18) − ( k − F − ( p ) + k X i =1 F − n i + n s ( p i ) − ( k − t s (cid:19) f ( t s | p , . . . , p k ) dt s , (6) where p = F ( τ ) , p i = F n i + n s ( τ + t i + t s ) , and f ( t s | p , . . . , p k ) is the probability mass or densityfunction of the shared sample’s suﬃcient statistic t s conditional on the experts’ reported probabili-ties. In the presence of shared information, the decision maker can, at most, deduce each expert’ssuﬃcient statistic t + t s from p : t + t s = F − n i + n s ( p i ) − τ . So, in the integral (or sum) in (6), f ( t s | p , . . . , p k ) is equal to f ( t s | t + t s , . . . , t k + t s ), which is not always tractable. The example belowillustrates a situation where this conditional distribution is quite simple to evaluate. Example 5 (Beta/Bernoulli Pair with Private and Shared Information).

Assume α = β = 1, x and x are private information seen by experts 1 and 2, respectively, x is sharedinformation, and x is related to the binary event y . The binary event y (default or not) is1 if x = 1 and is 0 otherwise. The prior-predictive probability p = 1 /

2. Suppose expert 1reports p = 3 /

4. Given this report, the decision maker can deduce that x + x = 2 (or x = 1and x = 1): he knows expert 1 saw two defaults in her private and shared information. Basedon expert 1’s report, the decision maker also knows that x + x can only be either 1 or 2,leading to either p = 1 / p = 1 /

2. In either case, f ( t s | p , . . . , p k ) is a point mass because P ( x = 1 | x + x = 2 , x + x = 1) = P ( x = 1 | x + x = 2 , x + x = 2) = 1. Consequently, in thisexample, the beta/Bernoulli Bayesian ensemble with private and shared information becomesˆ p = P ( y = 1 | p = 3 / , p ) = 4 p + 15 . (cid:3) (7)In Figure 1, we compare Example 5’s ensemble with private and shared information to Exam-ple 1’s ensemble with private information only. In both examples, the experts see two data pointsthe decision maker does not see. Turning one piece of private information into shared informa-tion, as we move from Figure 1a to Figure 1b, we see a reduction in the degree of extremizing. InFigure 1b, we also see a point of anti-extremizing (in gray) when p = 0 . f ( t s | p , . . . , p k ) is not always the point mass we see in Example 5.If p were 1/2 there, then the decision maker could be uncertain about the event x = 1 and the ichtendahl, Grushka-Cockayne, Jose, and Winkler: Bayesian Ensembles of Binary-Event Forecasts Bayesian ensemble would be a mixture. In this case, the sum in (6) would contain two terms whenexpert 2 reports 1 /

2, rather than the single term in (7). In general, this integral (or sum) is diﬃcultto evaluate. For the normal-normal pair though, the Bayesian ensemble is not a mixture and hasa tractable form.

Proposition 3 (Private and Shared Information: Normal/Normal Pair) . Assumeprivate and shared information are available to the experts ( n i > for i = 1 , . . . , k and n s > )and the conditions of Example 3’s normal/normal conjugate pair hold. Then the normal/normalconjugate-pair leads to the Bayesian ensemble of the experts’ probabilities: ˆ p = Φ (cid:18) − ( k − √ v Φ − ( p ) + P ki =1 √ v n i + n s Φ − ( p i ) − ( k − E [ t s | p , . . . , p k ] p ( k − Var [ t s | p , . . . , p k ] + v N k + n s (cid:19) , (8) where v n = ( σ /σ + n )( σ /σ + n + 1) σ . The conditional mean E [ t s | p , . . . , p k ] is linear in (Φ − ( p ) , . . . , Φ − ( p k )) , and the conditional variance Var [ t s | p , . . . , p k ] is constant in ( p , . . . , p k ) .(The conditional moments are given in the proof of this result.) Similar to the probit ensemble with private information only, the ensemble in the result above isa linear combination of (Φ − ( p ) , Φ − ( p ) , . . . , Φ − ( p k )) inside the standard normal cdf. Therefore,we call the ensemble in (8) a probit ensemble as well. The two probit ensembles diﬀer only in theweights in their respective linear combinations. These weights determine the degree of extremizing,as the next example illustrates. Example 6 (Normal/Normal Pair with Private and Shared Information).

Assume θ = − .

25 and σ = σ = 1, x and x are private information seen by experts 1 and 2, respectively, x is shared information, and x is related to the binary event y . The binary event y is 1 if x > p = F n + n s ( τ ) = F ( − . ≈ .

36. Then we havethe Bayesian ensemble depicted in Figure 1d for all possible reports from the second expert.Compare this ensemble with private and shared information to Example 3’s ensemble with privateinformation only (Figure 1c). The presence of shared information in Figure 1d again reduces thedegree of extremizing we see in Figure 1c. The region of anti-extremizing enlarges as one piece ofprivate information becomes shared information. (cid:3)

For other settings of the conjugate-pair ensembles above, we get similar results. The ensemblestend to extremize the average forecast, although not always, even when there is no shared infor-mation. As more of the information is shared, the degree of extremizing reduces and the region ofanti-extremizing enlarges. Anti-extremizing emerges (or increases) because of the reduction in theoverall sample size the decision maker uses to aggregate the experts’ forecast, as the experts share Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts more information and see less information privately. The smaller sample size means the decisionmaker is less conﬁdent in his aggregate forecast.

The conjugate-pair ensembles above, because they are optimal aggregators under some reasonableassumptions, suggest an important structural property we would like to see in any aggregationmethod. An aggregation method should incorporate the prior-predictive probability p . The existingmethods in Ranjan and Gneiting (2010), Turner et al. (2014), Baron et al. (2014), Satop¨a¨a et al.(2014), and Satop¨a¨a et al. (2016), however, assume p is equal to one-half or do not incorporate p . In many applications though, such as in the two empirical studies of Section 4, p (or the baserate of the y ’s in a training set) will be far from one-half.Ranjan and Gneiting (2010), Turner et al. (2014), and Baron et al. (2014) propose transforma-tions of the linear opinion pool p LOP = P ki =1 w i p i where w i > i = 1 , . . . , k , and P ki =1 w i = 1.Ranjan and Gneiting (2010) study the beta-transformed linear opinion pool (BLOP): p BLOP = F Be ( α,β ) ( p LOP ) where F Be ( α,β ) is the cdf of the beta distribution. Turner et al. (2014) and Baron etal. (2014) use the Karmarkar equation p a LOP / ( p a LOP + (1 − p LOP ) a ) to extremize the linear opinionpool. We call this aggregator the Karmarkar-transformed linear opinion pool (KLOP), denotedby p KLOP . Both methods involve s-shaped transformations of the linear opinion pool. These twomethods, ﬁt to the same data, will be almost identical. For example, it is diﬃcult to distinguish theKarmarkar-transformed linear opinion pool with a = 2 . α = β = 5; see Figure 2a where p LOP = ¯ p and p = 1 / p logit = F Lo (0 , ( P ki =1 ( a/k ) F − Lo (0 , ( p i )) where F Lo ( l,ψ ) ( z ) = 1 / (1 + e − ( z − l ) /ψ ) is the cdf of the logistic distribution with location l and scale ψ and F − Lo ( l,ψ ) ( p ) = l + ψ log( p/ (1 − p )) is the inverse of its cdf. Ranjan and Gneiting (2010) and Satop¨a¨a et al. (2016) studyversions of the probit ensemble in Example 3 with p = 1 /

2. When p = 1 /

2, the probit ensemblereduces to a generalized additive model with no intercept. These two ensembles are depicted in Fig-ure 2b. The probit ensemble ˆ p in Figure 2b has the same settings as the one in Figure 1c (except for θ = 0). This probit ensemble and the logit aggregator with a = 1 .

25 are virtually indistinguishable.Interestingly, both the logit aggregator and the probit ensemble in Figure 2b involve inverses-shaped functions. In the next section, we will see that negative correlation between experts’information states results in an s-shaped probit ensemble. The experts’ information states accordingto the probit ensemble of Proposition 3 are always positively correlated (see v i,j in the Proof of ichtendahl, Grushka-Cockayne, Jose, and Winkler: Bayesian Ensembles of Binary-Event Forecasts . . . . . . p p BLOP p KLOP p (a) KLOP and BLOP. . . . . . . p p logit p^ p (b) Probit Ensemble and Logit Aggregator. Figure 2 Ensembles With No Informative Prior-Predictive Probability Incorporated.

Proposition 3). Because negative correlation among experts is rare in practice, this result calls intoquestion the use of s-shaped ensembles.

3. Generalized Probit Ensemble

Below we introduce the generalized probit ensemble. For this ensemble, we ﬁrst assume the experts’information states (e.g., suﬃcient statistics) are jointly normally distributed, as in the probitensemble of Example 3, but with a general correlation structure. This general correlation structurecan capture more complicated patterns of overlapping information sources, without having towork through detailed combinatorics (Clemen 1987). Second, we assume a generalized linear modelof these information states. This second assumption is informed by the Bayesian ensemble inProposition 1 and, in particular, by its generalized linear model form in (4).At the outset, the decision maker places beliefs directly on the experts’ information states—states that may result from experts seeing some observations in common. He assumes the experts’information states x = ( x , . . . , x k ) ′ are jointly normally distributed with mean µ and covariancematrix Σ , denoted x ∼ N ( µ , Σ ). The covariance matrix has elements σ ij , and the correlationbetween x i and x j is deﬁned as ρ ij = σ ij / ( √ σ ii √ σ jj ).The decision maker also assumes the conditional distribution of y given the information statesis given by the generalized linear model of the information states P ( y = 1 | x , . . . , x k ) = F z ( α + α x + · · · + α k x k ) , (9) Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts where the inverse link function F z is the cdf of a continuous random variable z with the entirereal line as its support. In addition, he assumes the inverse link function F z is the standard cdffrom a location-scale family, i.e., the random variable z = l + ψz has the cdf F z (( z − l ) /ψ ) where l is the location parameter and ψ is the scale parameter. Finally, he assumes any expert i reportsthe probability p i = P ( y = 1 | x i ).In our generalized linear model, we do not restrict ourselves here to the standard normal linkfunction in a generalized linear model of experts’ information states. We propose a member of anylocation-scale family for the inverse link function and later focus on the exponential-power distri-bution. The exponential-power inverse link function has the ﬂexibility to be either the standardnormal near one extreme or the uniform on the other extreme. Therefore, the generalized probitensemble can take the form of either an inverse s-shape or a linear shape, as needed. Proposition 4 (Generalized Probit Ensemble) . Given the assumptions in this section, theoptimal way to aggregate experts’ forecasts is with the generalized probit ensemble ˆ p = P ( y = 1 | p , . . . , p k ) = F z (cid:18) β F − z + √ v x ( p ) + k X i =1 β i F − z + √ v i x ( p i ) (cid:19) , (10) where p = F z + √ v x ( m ) , p i = F z + √ v i x ( m + β − i α i ( x i − µ i )) , m = α + P ki =1 α i µ i , β = 1 − k X i =1 β i , and β i = α i √ σ ii P kj =1 α j √ σ jj ρ ij for i = 1 , . . . , k. Also, the cdf F z + √ v i x is the cdf of the sum of the independent random variables z and √ v i x where z is the standard random variable from a location-scale family, x is a standard normalrandom variable, v = k X i =1 k X j =1 α i α j σ ij , and v i = X j = i X j ′ = i α j α j ′ σ jj ′ − (cid:0) P j = i α j σ ij (cid:1) σ ii for i = 1 , . . . , k. Immediately we see that the generalized probit ensemble in (10) is a generalized additive modelof the experts’ probabilities. The main beneﬁt of this ensemble is that we do not need to work witha conjugate pair’s predictive distribution, which can be diﬃcult to do in all but few cases.Another immediate consequence of the result is that each expert is calibrated. An expert iscalibrated if P ( y = 1 | p i ) = p i (French 1986, Murphy and Winkler 1987). As was the case in theprevious section, this calibration means that the decision maker, if he heard from only one expert,would use the same probability the expert reported. Note that if the expert is not calibrated, then ichtendahl, Grushka-Cockayne, Jose, and Winkler: Bayesian Ensembles of Binary-Event Forecasts the decision maker could re-calibrate the expert’s reported probability using the re-calibrationfunction R i ( p i ) so that P ( y = 1 | x i ) = R ( p i ). All results related to the generalized probit ensemblehold with R ( p i ) in place of p i , if an expert is not calibrated.To interpret the coeﬃcients in the ensemble, it is helpful to consider the case of exchangeableexperts. If the experts are exchangeable, then the weights are given by β = 1 − k ( k − ρ + 1 < β i = 1( k − ρ + 1 > i = 1 , . . . , k, Note that for the exchangeable distribution of information states to be a proper distribution, wemust have ρ > − / ( k − Σ is positive deﬁnite. Below we propose a family of inverse link functions to use when ﬁtting the generalized probitensemble to data. The family is based on the exponential-power distribution, also known as ageneralized error or generalized Gaussian distribution (Subbotin 1923, Box and Tiao 1973, Mineoand Ruggieri 2005, Zhang et al. 2012). This family contains the normal and Laplace distributionsas special cases. Before we give the family’s general form, we look at the example of the normaldistribution.

Example 7 (Normal Distribution).

Let the inverse link function F z in Proposition 4be standard normal cdf. In this case, F z + √ v i x ( u ) = Φ( u/ √ v i ) so that F − z + √ v i x ( p i ) = √ v i Φ − ( p i ). The probit ensemble, according to the assumptions in this section, is given byˆ p = Φ (cid:18) β √ v Φ − ( p ) + k X i =1 β i √ v i Φ − ( p i ) (cid:19) . (cid:3) To estimate this probit ensemble’s coeﬃcients β ′ i = β i √ v i from data, one can ﬁt a generalizedlinear model of y on (Φ − ( p ) , . . . , Φ − ( p k )) using a standard normal inverse link function. Example 8 (Exponential-Power Distribution).

Let the inverse link function be the cdf F z with density f z ( z ) = 12 η /η Γ(1 + 1 /η ) ψ exp (cid:18) − η (cid:12)(cid:12)(cid:12)(cid:12) z − lψ (cid:12)(cid:12)(cid:12)(cid:12) η (cid:19) . This is the density of an exponential-power distribution. We denote this distribution by z ∼ EP ( l, ψ, η ) where l is the mean, η /η ψ Γ(3 /η ) / Γ(1 /η ) is the variance, and η > (cid:3) Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts

For the ensemble based on the exponential-power distribution, the inside function is F z + √ v i x ,which is not tractable (Soury and Alouini 2015). Nonetheless, we can approximate this insidefunction closely by matching the ﬁrst two moments. We choose v ′ i in p v ′ i z EP (0 , ,η ) so thatthe variance of this random variable equals the variance of z EP (0 , ,η ) + √ v i x . Their means areboth zero by construction. Consequently, the approximate generalized probit ensemble with anexponential-power inverse link function, denoted by ˜ p , is given by˜ p = F EP (0 , ,η ) (cid:18) β p v ′ F − EP (0 , ,η ) ( p ) + k X i =1 β i p v ′ i F − EP (0 , ,η ) ( p i ) (cid:19) . (11)This ensemble is useful in applications because one can estimate the coeﬃcients β ′ i = β i p v ′ i byﬁtting a generalized linear model of y on ( F − EP (0 , ,η ) ( p ) , . . . , F − EP (0 , ,η ) ( p k )) using the exponential-power inverse link function F EP (0 , ,η ) . For η = 2, the exponential-power distribution becomes thenormal distribution, and the resulting ensemble is a probit ensemble. For η = 1, the exponential-power distribution becomes the Laplace distribution. For η strictly less (greater) than 2, the dis-tribution has fat (thin) tails. As η → ∞ , the distribution of z EP ( l,ψ,η ) goes to a uniform distributionon ( l − ψ, l + ψ ) and the resulting ensemble approaches a linear ensemble, like the one in thebeta/Bernoulli ensemble of Example 1.In Figure 3, we show several approximate generalized probit ensembles of two exchangeableexperts’ forecasts. We plot these ensembles as a function of p with p = 1 / α = 1, σ = 1 /

20, and ρ = 3 /

4. The weights on these positively correlated experts are β = β = 0 .

57. Figure 3a depictsthree approximate generalized probit ensembles: the ensemble with η = 1, the probit ensemble( η = 2), and the ensemble with η = 4. As the power parameter increases, the ensemble becomesmore linear. Also, we see in Figure 3b that the ensemble of positively correlated experts is inverses-shaped, while the ensemble of negatively correlated ( ρ = − /

2) experts is s-shaped.

4. Empirical Studies

In this section, we present two empirical studies where we ﬁt the generalized probitensemble and compare its out-of-sample forecasting performance to several leading aggre-gation methods. The challenge in the ﬁrst study is to predict defaults on loans acquiredby Fannie Mae in 2007, just before the Great Recession. These data are available at https://loanperformancedata.fanniemae.com/lppub/index.html . In the Fannie Mae data,there are 1,056,724 records on acquired loans and 20 independent variables, such as the borrower’scredit score, the home’s loan-to-value, and borrower’s debt-to-income ratio. This year of acquisi-tions had the highest rate of defaults at 8.5% in the period 2000-2015. The second study’s challenge ichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts . . . . . . p h = h = h = p~ pp (a) More Linear As η Increases. . . . . . . p h = h = h = p~pp (b) S-shaped with ρ < Figure 3 Approximate Generalized Probit Ensembles As the Power Parameter ( η ) and the Correlation Coeﬃcient( ρ ) Change. is to predict bad used-car buys by Carvana (a large used-car retailer) during the period 2009-2010.These data are available at . In the Carvanadata, there are 72,983 records on used-car purchases and 34 independent variables, such as thevehicle’s age, the vehicle’s odometer, the vehicle’s model, the buyer’s id number, and the auction’slocation. This prediction challenge was part of a Kaggle competition called Don’t Get Kicked! . Thebase rate of defective cars in the training set from Kaggle is 12.3%.

In the machine learning literature, forecast aggregation is known as stacking (Wolpert 1992,Breiman 1996, Smyth and Wolpert 1999, Dˇzeroski and ˇZenko 2004). The idea is that the predictionsfrom several base models become features in a second-stage stacker model. Breiman (1996) andSmyth and Wolpert (1999), for example, both consider stacker models that are linear opinion poolsof base models’ probabilities. They choose optimal weights in a linear opinion pool that maximizethe likelihood on a training set.For each study, we trained three base models using the covariates available in the datasets.Each base model is a leading statistical or machine learning algorithm or part of a competition-winning ensemble. The ﬁrst model is a regularized logistic regression (RLR), the lasso proposedby Tibshirani (1996). The second model is the random forest (RF) introduced by Breiman (2001). Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts

The third model is the extreme gradient boosted trees model called xgboost (XGB) (Chen andGuestrin 2016). This model is an extension of the gradient boosted trees model introduced byFriedman (2001). We could add more more base models to our ensemble, but our goal here is todemonstrate the plausibility of our approach. The xgboost model, by itself, represents a diﬃcultbenchmark to beat. It was a part of 17 winning solutions published on Kaggle in 2015, and it wasused by every team in the top 10 at KDD Cup 2015 (Chen and Guestrin 2016).To ensure that an ensemble was trained on out-of-sample probabilities from the base models, weemployed a two-step process for building a stacker model. First, we randomly split the data into 10equal folds and used these folds for both steps. In Step 1, the base models were ﬁt to the ﬁrst ninefolds of data using the available covariates (e.g., credit score and loan-to-value) and the outcomesof the binary event of interest (e.g., loan defaults). Then the base models were used to predict thebinary event of interest in the tenth fold. Next the base models were trained on Folds 1-8 and 10and used to predict the binary events of interest in the Fold 9. This process continued until eachfold was held out once and out-of-sample predictions were made for it. In Step 2, an ensemble, orstacker model, was trained on the out-of-sample predictions made by the base models in Step 1.For example, a stacker model was trained on Folds 1-9 and tested (or evaluated) on Fold 10. Thistraining and testing of a stacker model was done 10 times, with each fold serving as the hold-outsample once. For more details on how these models were trained and tested, see the supplementalmaterials. All data and code are available from the authors.

To evaluate out-of-sample forecasts, we use three diﬀerent scoring rules: (i) the log score ( LS ), (ii)the asymmetric log score ( ALS ), and (iii) the area under the curve (

AUC ). The ﬁrst scoring ruleis negatively oriented (lower score is better), and the second and third scoring rules are positivelyoriented (higher score is better). The log score of a probability forecast p for a binary event y is givenby LS ( p, y ) = − ( y log( p ) + (1 − y ) log(1 − p )) for 0 < p <

1. This score is consistent with maximumlikelihood estimation; the model (or ensemble) with the lowest average log score maximizes thelog-likelihood. The asymmetric log score of a probability forecast p for a binary event y is given by ALS ( y, p ) = ( LS ( c, y ) − LS ( p, y )) / LS ( c, I p>c ) where I p>c equals 1 if p > c and equals 0 otherwise(Winkler 1994). This score is “adjusted for the diﬃculty of the forecast task . . . with the value of c ∈ (0 ,

1) adapted to reﬂect a baseline probability” (Gneiting and Raftery 2007, p. 365). In theresults we report below, c is taken to be the base rate of occurrence of y (denoted ¯ y ) in the trainingset. The area under the curve is a popular score in the machine-learning community (Hand andTill 2001). ichtendahl, Grushka-Cockayne, Jose, and Winkler: Bayesian Ensembles of Binary-Event Forecasts Table 1 reports the average scores of out-of-sample predictions from the three base models ( p RLR , p RF , and p XGB ), four existing aggregation models (the linear opinion pool with equal weights ¯ p ,the linear opinion pool with optimal weights p OLOP , the beta-transformation of the linear opinionpool with optimal weights p BLOP , the logit aggregator p logit ), and the best approximate generalizedprobit ensemble with an exponential-power inverse link function, ˜ p in (11). The best ˜ p has thepower parameter η ∗ that minimizes the average log loss out-of-sample. The estimates of η ∗ are 40and 9 in the two studies, respectively. In both studies, the best approximate generalized probitensemble outperforms all other models on average over the 10 cross-validation folds.Fannie Mae CarvanaModel LS ALS AUC LS ALS AUC p RLR p RF p XGB p p OLOP p BLOP p logit p Table 1 Average Scores of Out-of-Sample Predictions in the Two Studies.

After performing 10-fold cross validation for each study, we estimate the coeﬃcients in thegeneralized linear model for the best approximate generalized probit ensemble on the entire dataset.Table 2 lists these estimates. Not surprisingly, the ensemble puts the most weight on the best basemodel: xgboost. We also report in Table 2 the base rate at which the binary events occur in eachdataset. In addition, we provide the percentage of times in out-of-sample forecasting that the bestapproximate generalized probit ensemble extremizes the average forecast.Figure 4 depicts the best approximate generalized probit ensemble as a function of the best basemodel’s forecasts, with the other two base models’ forecasts set to twice the base rate in theirrespective dataset. For these settings, the ensembles lie somewhere in between all weight on thebest base model (xgboost) and equal weights on the base models. In each plot, we also highlightthe anti-extremizing region in gray. In addition, because the estimated power parameters are highin these studies, we can see that the ensembles are nearly linear over much of the domain of p XGB . Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts

Fannie Mae CarvanaConstant 0.0496 0.0653Coeﬃcient β ′ RLR β ′ RF β ′ XGB η ∗

40 9˜ p extremizes ¯ p y Table 2 Final Estimation of Best Approximate Generalized Probit Ensembles. . . . . . . p XGB p XGB pp~p (a) Study 1: Fannie Mae . . . . . . p XGB p~pp

XGB p (b) Study 2: Carvana Figure 4 Best Approximate Generalized Probit Ensemble as a Function of the Best Base Model’s Forecast.

5. Summary and Conclusions

In this paper, we introduce a large class of Bayesian ensembles. Because the ensembles in this classare based on Bayesian reasoning, they can help us understand why, when, and how much extrem-izing is appropriate. It is appropriate to extremize the average forecast in order to remove theredundant reports of the prior-predictive probability in the experts’ forecasts. When some informa-tion is shared by the experts, however, the decision maker may sometimes want to anti-extremize.Due to the shared information, the decision maker does not have as much total information uponwhich to base his aggregate forecast, and he may in turn issue a less conﬁdent aggregate forecast.This theoretical result matches the empirical ﬁndings of Tetlock and Gardner (2015). They also ichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts ﬁnd that when teams are good at sharing information, like teams of superforecasters, extremizingdoes not help them much.The two types of Bayesian ensembles we introduce here—conjugate-pair Bayesian ensembles andthe generalized probit ensemble—take the form of a generalized additive model. In the ﬁrst type,the link function is determined by a regular, one-parameter exponential family and its conjugateprior. This type is easy to interpret because of the natural sampling process that underlies theensemble. In the second type, the inverse link function is any distribution from a location-scalefamily. This type is more ﬂexible; and therefore more suitable for applications. In practice, weﬁnd that the exponential-power distribution is a good choice for the inverse link function in thegeneralized probit ensemble. Within the family of generalized probit ensembles are the probitensemble near one extreme and a linear ensemble at the other extreme, depending on where thepower parameter is set. Using a generalized linear modeling framework, this power parameter canbe tuned to ﬁt real world training data. Importantly, our ensembles include a constant term insidetheir generalized additive model forms. This constant term incorporates in the aggregate forecastthe prior-predictive probability, e.g., the base rate of occurrences of the binary event in a trainingset.In our two empirical studies, the generalized probit ensemble performed well in making out-of-sample predictions. It outperformed three base models and four leading aggregation methods. Thebase models in these studies were leading statistical and machine learning algorithms: the lasso,random forest, and xgboost. The latter two models are ensembles themselves and can be diﬃcultto beat in practice, especially xgboost. In outperforming xgboost, it is important to stress thatthe Bayesian ensembles here are not replacements for an ensemble like xgboost. The Bayesianensemble we ﬁt used xgboost as one of its inputs. More importantly, the empirical results presentedhere demonstrate the plausibility of our generalized probit ensemble, not its superiority. We seeseveral possible avenues for future work, including applications of the generalized probit ensembleto (possibly poorly calibrated) human judgments. Appendix

In this appendix, we provide proofs of Propositions 1-4 and derivations of Examples 1-4. We alsostate a lemma and its proof. This lemma is useful in the proof of Proposition 4.

Proof of Proposition 1.

By Bernardo and Smith (2000, Prop. 5.5), expert i ’s forecast of x given x i is given by f ( x | x i ) = a ( x ) K ( τ + n i + 1 , τ + t i + h ( x )) K ( τ + n i , τ + t i ) , Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts where t i = P N i j = N i − +1 h ( x j ) is the suﬃcient statistic for x i . Expert i ’s forecast of ( y | x i ) is given by P ( y = 1 | x i ) = R x ∈ A f ( x | x i ) dx . Hence, p i = P ( y = 1 | x i ) = F n i ( τ + t i ). The decision maker, if he haddirect access to the experts’ samples, would form the following forecast of ( y | x , . . . , x k ): P ( y = 1 | x , . . . , x k ) = Z x ∈ A a ( x ) K ( τ + P ki =1 n i + 1 , τ + P ki =1 t i + h ( x )) K ( τ + P ki =1 n i , τ + P ki =1 t i ) dx = F N k (cid:18) τ + k X i =1 t i (cid:19) . (12)The decision maker knows all the sample sizes because of the common knowledge assumption. Theprior-predictive probability is given by p = P ( y = 1) = Z x ∈ A a ( x ) K ( τ + 1 , τ + h ( x )) K ( τ , τ ) dx = F ( τ ) . By the assumption that F n ( t ) is strictly monotonic in t , we can invert F n ( t ). By substitutionaccording to τ = F − ( p ) and t i = F − n i ( p i ) − τ into (12), we have the result. Derivation of Example 1’s Ensemble.

The Bernoulli distribution is a regular, one-parameter exponential family with pmf f ( x j | θ ) = θ x j (1 − θ ) − x j = (1 − θ ) exp (cid:18) log (cid:18) θ − θ (cid:19) x j (cid:19) for x j ∈ { , } , where a ( x j ) = 1, b ( θ ) = 1 − θ , c ( θ ) = log( θ − θ ), and h ( x j ) = x j . Its conjugate prior has the pdf f ( θ ) = Γ( α + β )Γ( α )Γ( β ) θ α − (1 − θ ) β − = 1 K ( τ , τ ) (1 − θ ) τ exp (cid:18) log (cid:18) θ − θ (cid:19) τ (cid:19) for θ ∈ (0 , , where τ = α + β − τ = α −

1, and K ( τ , τ ) = Γ( τ + 1)Γ( τ − τ + 1) / Γ( τ + 2). With A = { } , F n ( t ) = Z x ∈ A a ( x ) K ( τ + n + 1 , t + h ( x )) K ( τ + n, t ) dx = Γ( t + 1 + 1)Γ( τ + n + 1 − ( t + 1) + 1)Γ( τ + n + 1 + 2) Γ( τ + n + 2)Γ( t + 1)Γ( τ + n − t + 1) = t + 1 α + β + n , and F − n i ( p i ) = ( α + β + n i ) p i − Derivation of Example 2’s Ensemble.

The Poisson distribution is a regular, one-parameter exponential family with pmf f ( x j | θ ) = θ x j exp( − θ ) x j ! = 1 x j ! exp( − θ ) exp(log( θ ) x j ) for x j ∈ { , , , . . . } , where a ( x j ) = 1 /x j !, b ( θ ) = exp( − θ ), c ( θ ) = log( θ ), and h ( x j ) = x j . Its conjugate prior has the pdf f ( θ ) = β α Γ( α ) θ α − exp( − βθ ) = 1 K ( τ , τ ) θ τ exp( − τ θ ) for θ ∈ (0 , ∞ ) , ichtendahl, Grushka-Cockayne, Jose, and Winkler: Bayesian Ensembles of Binary-Event Forecasts where τ = β , τ = α −

1, and K ( τ , τ ) = Γ( τ + 1) /τ τ +10 . With A = { } , F n ( t ) = Z x ∈ A a ( x ) K ( τ + n + 1 , t + h ( x )) K ( τ + n, t ) dx = Γ( t + 1)( τ + n + 1) t +1 ( τ + n ) t +1 Γ( t + 1) = exp( v n ( t + 1)) , where v n = log(( β + n ) / ( β + n + 1)), and F − n i ( p i ) = log( p i ) /v n i − Derivation of Example 3’s Ensemble.

This example’s normal distribution is a regular, one-parameter exponential family with pdf f ( x j | θ ) = 1 √ πσ exp (cid:18) − σ ( x j − θ ) (cid:19) = 1 √ πσ exp (cid:18) − x j σ (cid:19) exp (cid:18) − θ σ (cid:19) exp (cid:18) θσ x j (cid:19) for x j ∈ ( −∞ , ∞ ) , where a ( x ) = (1 / √ πσ ) exp( − x j / (2 σ )), b ( θ ) = exp (cid:0) − θ σ (cid:1) , c ( θ ) = θ , and h ( x j ) = x j . Its conjugateprior has the pdf f ( θ ) = 1 p πσ exp (cid:18) − σ ( θ − θ ) (cid:19) = 1 p πσ exp (cid:18) − θ σ (cid:19)(cid:18) exp (cid:18) − θ σ (cid:19)(cid:19) τ exp (cid:18) θσ τ (cid:19) for θ ∈ ( −∞ , ∞ ) , where τ = σ /σ , τ = σ θ /σ , and K ( τ , τ ) = p (2 πσ ) /τ exp( τ / (2 σ τ )). With A = (0 , ∞ ), F n ( t ) = Z x ∈ A a ( x ) K ( τ + n + 1 , t + h ( x )) K ( τ + n, t ) dx = Z ∞ q π τ + n +1 τ + n σ exp (cid:18) − τ + n σ ( τ + n + 1) (cid:18) x − tτ + n (cid:19) (cid:19) dx = 1 − Φ (cid:18) − tτ + n q τ + n +1 τ + n σ (cid:19) = Φ (cid:18) tτ + n q τ + n +1 τ + n σ (cid:19) = Φ (cid:18) t √ v n (cid:19) , where v n = ( σ /σ + n )( σ /σ + n + 1) σ , and F − n i ( p i ) = √ v n i Φ − ( p i ). The second integrand in(13) is the posterior-predictive distribution for the normal/normal model with known precision inBernardo and Smith (2000, p. 439). For example, with t = τ + t i and n = n i , this integrand isexpert i ’s posterior-predictive distribution of x . Derivation of Example 4’s Ensemble.

This example’s Gumbel distribution is a regular, one-parameter exponential family with pdf f ( x j | θ ) = 1 σ exp (cid:18) − x j − θσ (cid:19) exp (cid:18) − exp (cid:18) − x j − θσ (cid:19)(cid:19) = 1 σ exp (cid:18) − x j σ (cid:19) exp (cid:18) θσ (cid:19) exp (cid:18) − exp (cid:18) − x j − θσ (cid:19)(cid:19) for x j ∈ ( −∞ , ∞ ) , Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts where a ( x j ) = exp( − x j /σ ) /σ , b ( θ ) = exp( θ/σ ), c ( θ ) = − exp( θ/σ ), and h ( x j ) = exp( − x j /σ ). Itsconjugate prior has the pdf f ( θ ) = β α σ Γ( α ) (cid:18) exp (cid:18) θσ (cid:19)(cid:19) α exp (cid:18) − β exp (cid:18) θσ (cid:19)(cid:19) for θ ∈ ( −∞ , ∞ ) , where τ = α , τ = β , and K ( τ , τ ) = Γ( τ ) τ − τ σ . With A = ( −∞ , F n ( t ) = Z x ∈ A a ( x ) K ( τ + n + 1 , t + h ( x )) K ( τ + n, t ) dx = Z −∞ σ exp (cid:18) − xσ (cid:19) Γ( α + n + 1) (cid:0) t + exp (cid:0) − xσ (cid:1)(cid:1) − ( α + n +1) Γ( α + n ) t − ( α + n ) dx = 1 σ t α + n Z −∞ ( α + n ) exp (cid:0) − xσ (cid:1)(cid:0) t + exp (cid:0) − xσ (cid:1)(cid:1) α + n +1 dx = t α + n (cid:20) (cid:0) t + exp (cid:0) − xσ (cid:1)(cid:1) α + n (cid:21) −∞ = (cid:18) t t (cid:19) α + n , and F − n i ( p i ) = p v ni i / (1 − p v ni i ), where v n = 1 / ( α + n ). Proof of Proposition 2.

By Bernardo and Smith (2000, Prop. 5.5), expert i ’s forecast of x given the private sample x i andthe shared sample x s is given by f ( x | x i , x s ) = a ( x ) K ( τ + n i + n s + 1 , τ + t i + t s + h ( x )) K ( τ + n i + n s , τ + t i + t s ) , where t i and t s are the suﬃcient statistics for the samples x i and x s , respectively. Expert i ’sforecast of ( y | x i , x s ) is given by P ( y = 1 | x i , x s ) = R x ∈ A f ( x | x i , x s ) dx . Hence, p i = P ( y = 1 | x i , x s ) = F n i + n s ( τ + t i + t s ). The decision maker, if he had direct access to the experts’ private samples andthe shared sample, would form the following forecast of ( y | x , . . . , x k , x s ): P ( y = 1 | x , . . . , x k , x s ) = Z x ∈ A a ( x ) K ( τ + P ki =1 n i + n s + 1 , τ + P ki =1 t i + t s + h ( x )) K ( τ + P ki =1 n i + n s , τ + P ki =1 t i + t s ) dx = F N k + n s (cid:18) τ + k X i =1 t i + t s (cid:19) . (13)The decision maker knows all the sample sizes because of the common knowledge assumption. Fromthe proof of Proposition 1, the prior-predictive probability is given by p = P ( y = 1) = F ( τ ). By ichtendahl, Grushka-Cockayne, Jose, and Winkler: Bayesian Ensembles of Binary-Event Forecasts the assumption that F n ( t ) is strictly monotonic in t , we can invert F n ( t ). By substitution accordingto τ = F − ( p ) and t i = F − n i + n s ( p i ) − τ − t s into (13), we have P ( y = 1 | t , . . . , t k , t s ) = F N k + n s (cid:18) τ + k X i =1 t i + t s (cid:19) = F N k + n s (cid:18) F − ( p ) + k X i =1 ( F − n i + n s ( p i ) − F − ( p ) − t s ) + t s (cid:19) = F N k + n s (cid:18) − ( k − F − ( p ) + t s ) + k X i =1 F − n i + n s ( p i ) (cid:19) = P ( y = 1 | p , . . . , p k , t s ) . Consequently, because P ( y = 1 | p , . . . , p k ) = E [ E ( y | p , . . . , p k , t s )], we have the result. Proof of Proposition 3

The ﬁrst step in the proof is to derive f ( t s | p , . . . , p k ), which is equivalent to f ( t s | t + t s , . . . , t k + t s )because t i + t s = √ v n i + n s Φ − ( p i ) − τ is a known function of p i . The fact that ( t s , t + t s , . . . , t k + t s ) is jointly normally distributed follows from two facts: (a) ( x , . . . , x N k + n s ) is jointly normallydistributed according to Bernardo and Smith’s (2000, Prop. 5.5(ii)) and (b) ( t s , t + t s , . . . , t k + t s )is a linear combination of ( x , . . . , x N k + n s ). The requisite means are m s = E [ t s ] = E (cid:20) E (cid:20) N k + n s X j = N k +1 x j (cid:12)(cid:12)(cid:12)(cid:12) θ (cid:21)(cid:21) = E [ n s θ ] = n s θ m i = E [ t i + t s ] = E (cid:20) E (cid:20) N i X j = N i − +1 x j + N k + n s X j = N k +1 x j (cid:12)(cid:12)(cid:12)(cid:12) θ (cid:21)(cid:21) = E [( n i + n s ) θ ] = ( n i + n s ) θ . The variance of t i + t s is given by v i,i = Var [ t i + t s ] = E [ Var [ t i + t s | θ ]] + Var [ E [ t i + t s | θ ]]= ( n i + n s ) σ + Var [( n i + n s ) θ ] = ( n i + n s ) σ + ( n i + n s ) σ . Similarly, the variance of t s is given by v s,s = Var [ t s ] = n s σ + n s σ . The covariance between t i + t s and t j + t s (for i = j ) is given by v i,j = Cov [ t i + t s , t j + t s ] = E [ Cov [ t i + t s , t j + t s | θ ]] + Cov [ E [ t i + t s | θ ] , E [ t j + t s | θ ]]= E [ Cov [ t i , t j | θ ] + Cov [ t i , t s | θ ] + Cov [ t s , t j | θ ] + Cov [ t s , t s | θ ]]+ Cov [ E [ t i | θ ] + E [ t s | θ ] , E [ t j | θ ] + E [ t s | θ ]]= n s σ + Cov [ n i θ, n j θ ] + Cov [ n i θ, n s θ ] + Cov [ n s θ, n j θ ] + Cov [ n s θ, n s θ ]= n s σ + ( n i n j + n i n s + n s n j + n s ) σ , Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts which follows by the law of total covariance and the fact that t i and t j are conditionally independentgiven θ . Similarly, the covariance between t s and t j + t s is given by v s,j = Cov [ t s , t j + t s ] = n s σ +( n s n j + n s ) σ .Thus, ( t s , t + t s , . . . , t k + t s ) ′ ∼ N ( m , V ) (West and Harrison 1997, p. 637). The mean vector is m = ( m , m ) ′ where m = m s and m = ( m , . . . , m k ) ′ . The covariance matrix is V = (cid:18) V V V V (cid:19) , where V = v s,s , V = ( v s, , . . . , v s,k ), and V has elements v i,j for i, j ∈ { , . . . , k } .According to West and Harrison (1997, p. 637), ( t s | t + t s , . . . , t k + t s ) is normally distributedwith mean and variance: E [ t s | t + t s , . . . , t k + t s ] = m + V V − (( t + t s , . . . , t k + t s ) ′ − m ) Var [ t s | t + t s , . . . , t k + t s ] = V − V V − V . Hence, ( t s | p , . . . , p k ) is normally distributed with mean and variance E [ t s | p , . . . , p k ] = m + V V − (cid:0) ( √ v n + n s Φ − ( p ) − τ , . . . , √ v n k + n s Φ − ( p k ) − τ ) ′ − m (cid:1) Var [ t s | p , . . . , p k ] = V − V V − V . Note that τ = σ θ /σ and p = F ( τ ) = Φ( τ / √ v ), which implies that τ = √ v Φ − ( p ) and θ = ( σ /σ ) √ v Φ − ( p ).The second step in the proof is to apply Proposition 2. Let L = − ( k − √ v Φ − ( p ) + P ki =1 √ v n i + n s Φ − ( p i ). According to Proposition 2 and the derivation of Example 3 where F n ( t ) =Φ( t/ √ v n ), we have P ( y = 1 | p , . . . , p k ) = Z ∞−∞ P ( y = 1 | p , . . . , p k , t s ) f ( t s | p , . . . , p k ) dt s = Z ∞−∞ Φ (cid:18) L − ( k − t s √ v N k + n s (cid:19) f ( t s | p , . . . , p k ) dt s = Z ∞−∞ (cid:18) − Φ (cid:18) t s − L/ ( k − √ v N k + n s / ( k − (cid:19)(cid:19) f ( t s | p , . . . , p k ) dt s = 1 − Z ∞−∞ (cid:18) Z t s −∞ k − √ v N k + n s φ (cid:18) z − L/ ( k − √ v N k + n s / ( k − (cid:19) dz (cid:19) f ( t s | p , . . . , p k ) dt s = 1 − Z Z z ≤ t s k − √ v N k + n s φ (cid:18) z − L/ ( k − √ v N k + n s / ( k − (cid:19) f ( t s | p , . . . , p k ) dzdt s = 1 − P ( z ≤ t s | p , . . . , p k ) = P ( t s ≤ z | p , . . . , p k ) = P ( t s − z ≤ | p , . . . , p k ) , where ( z | p , . . . , p k ) ∼ N ( L/ ( k − , v N k + n s / ( k − ) and z and t s are independent given ( p , . . . , p k ).Consequently, t s − z ∼ N ( E [ t s | p , . . . , p k ] − L/ ( k − , Var [ t s | p , . . . , p k ] + v N k + n s / ( k − ), whichgives us the result. ichtendahl, Grushka-Cockayne, Jose, and Winkler: Bayesian Ensembles of Binary-Event Forecasts Lemma 1 and its Proof.

The result below provides key properties of the linear combination of information states. We usethese properties in the proof of Proposition 4 below.

Lemma 1 (Linear Combination of Information States) . Assume x ∼ N ( µ , Σ ) . Then thefollowing statements hold.(i) The distribution of α + P ki =1 α i x i is normal with mean m = α + P ki =1 α i µ i and variance v = α Σ α ′ = P ki =1 P kj =1 α i α j σ ij .(ii) The joint distribution of α + P j = i α j x j and α i x i is normal with mean a + A i µ and variance A i Σ A ′ i , where a = (cid:18) α (cid:19) and A i = (cid:18) α · · · α i − α i − · · · α k · · · α i · · · (cid:19) . (iii) The conditional distribution of α + P j = i α j x j given α i x i is normal with mean m i = θ − i + b − i,i b − i,i ( α i x i − θ i ) and variance v i = b − i, − i − b − i,i b − i,i b i, − i , where (cid:18) θ − i θ i (cid:19) = a + A i µ , B i = (cid:18) b − i, − i b − i,i b i, − i b i,i (cid:19) = A i Σ A ′ i ,b − i, − i = X j = i X j ′ = i α j α j ′ σ jj ′ , b − i,i = X j = i α j α i σ ij , and b i,i = α i σ ii . A second expression for the variance v i is v i = P j = i P j ′ = i α j α j ′ σ jj ′ − σ − ii (cid:0) P j = i α j σ ij (cid:1) .Proof. Statements (i) and (ii) follow directly from results in West and Harrison (1997, p. 637) onlinear transformations of jointly distributed normal random variables. Statement (iii) follows fromStatement (ii) and West and Harrison’s (1997, p. 637) result on conditional distributions of jointlydistributed normal random variables. The variance of α + P ki =1 α i x i follows from the formula forthe covariance of two linear transformations: v = Cov (cid:20) α + k X i =1 α i x i , α + k X i =1 α i x i (cid:21) = k X i =1 k X j =1 α i α j Cov[ x i , x j ] = k X i =1 k X j =1 α i α j σ ij . The elements of B i follow from the same formula: b − i, − i = Cov (cid:20) α + X j = i α j x j , α + X j = i α j x j (cid:21) = X j = i X j ′ = i α j α j ′ Cov[ x j , x j ′ ] = X j = i X j ′ = i α j α j ′ σ jj ′ ,b − i,i = Cov (cid:20) α + X j = i α j x j , α i x i (cid:21) = X j = i α j α i Cov[ x j , x i ] = X j = i α j α i σ ij , Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts and b i,i = Cov[ α i x i , α i x i ] = α i σ ii , so that v i = b − i, − i − b − i,i b − i,i b i, − i = b − i, − i − b − i,i b i,i = X j = i X j ′ = i α j α j ′ σ jj ′ − (cid:0) P j = i α j α i σ ij (cid:1) α i σ ii = X j = i X j ′ = i α j α j ′ σ jj ′ − (cid:0) P j = i α j σ ij (cid:1) σ ii . Proof of Proposition 4.

The conditional probability of ( y | α i x i ) is given by P ( y = 1 | α i x i ) = Z ∞−∞ f ( x | α i x i ) F z ( x + α i x i ) dx, where x = α + P j = i α j x j , f ( x | α i x i ) is the normal density according to the joint normality assump-tion and Lemma 1(iii), and P ( y = 1 | x ′ ) = F z ( x + α i x i ) comes from the generalized linear modelassumption. We can rewrite this expression as P ( y = 1 | α i x i ) = Z ∞−∞ √ v i φ (cid:18) x − m i √ v i (cid:19)(cid:18) Z x −∞ f z ( z + α i x i ) dz (cid:19) dx = Z Z z ≤ x √ v i φ (cid:18) x − m i √ v i (cid:19) f z ( z + α i x i ) dzdx = P ( z ≤ x ) = P ( z − x ≤ , (14)where on the right-hand side of the third equality, z and x ∼ N ( m i , v i ) become random variables.By the assumption about the link function, the random variable z = ψz + l i is from a location-scalefamily with the standard cdf F z , location parameter l i = − α i x i , and scale parameter ψ = 1. Sincethe last integrand involves a product of these random variable’s densities, z and x are independent,as are z and − x ∼ N ( − m i , v i ). The random variable − x , being normally distributed, is also from alocation-scale family where − x = √ v i x − m i and x ∼ N (0 , P ( z − x ≤

0) = P ( z + l i + √ v i x − m i ≤

0) = P ( z + √ v i x ≤ m i − l i ) . Let F z + √ v i x ( u ) = P ( z + √ v i x ≤ u ), the cdf of the sum of the independent random variables z and √ v i x . Because m i − l i = θ − i − b − i,i b − i,i θ i + ( b − i,i b − i,i + 1) α i x i , we have P ( y = 1 | α i x i ) = F z + √ v i x ( θ − i − b − i,i b − i,i θ i + ( b − i,i b − i,i + 1) α i x i ) , which is strictly monotonic in x i .By Bayes’ Theorem and a change of variable, we have that P ( y = 1 | q ) = P ( y = 1 | x i ) where q = α i x i : P ( y = 1 | q ) = P ( y = 1) f q | y =1 ( q | y = 1) f q ( q ) = P ( y = 1) f q | y =1 ( α i x i | y = 1) α i f q ( α i x i ) α i = P ( y = 1) f x i | y =1 ( x i | y = 1) f x i ( x i ) = P ( y = 1 | x i ) . ichtendahl, Grushka-Cockayne, Jose, and Winkler: Bayesian Ensembles of Binary-Event Forecasts Then by the assumption that p i = P ( y = 1 | x i ), we have that p i = P ( y = 1 | α i x i ). Consequently, p i = F z + √ v i x ( θ − i − b − i,i b − i,i θ i + ( b − i,i b − i,i + 1) α i x i ) ⇐⇒ α i x i = F − z + √ v i x ( p i ) + b − i,i b − i,i θ i − θ − i ( b − i,i b − i,i + 1) . Substitution of this last expression into the assumed generalized linear model gives us α + k X i =1 α i x i = α + k X i =1 b − i,i b − i,i θ i − θ − i ( b − i,i b − i,i + 1) + k X i =1 β i F − z + √ v i x ( p i ) , where β i = 1 b − i,i b − i,i + 1 = b i,i b − i,i + b i,i = α i σ ii P kj =1 α j α i σ ij = α i √ σ ii P kj =1 α j √ σ jj ρ ij . Because θ i = α i µ i and θ − i = α + P j = i α j µ j , which implies that θ i + θ − i = m from Lemma 1, wehave that α + k X i =1 b − i,i b − i,i θ i − θ − i ( b − i,i b − i,i + 1) = α + k X i =1 (1 − β i ) θ i − k X i =1 β i θ − i = α + k X i =1 θ i − k X i =1 β i ( θ i + θ − i )= α + k X i =1 α i µ i − k X i =1 β i m = (cid:18) − k X i =1 β i (cid:19) m . From the expressions above, we can rewrite expert i ’s reported probability as follows: p i = F z + √ v i x ( θ − i − b − i,i b − i,i θ i + ( b − i,i b − i,i + 1) α i x i )= F z + √ v i x ( θ i + θ − i − ( b − i,i b − i,i + 1) θ i + ( b − i,i b − i,i + 1) α i x i )= F z + √ v i x ( θ i + θ − i + ( b − i,i b − i,i + 1) α i ( x i − µ i ))= F z + √ v i x (cid:18) α + k X i =1 α i µ i + β − i α i ( x i − µ i ) (cid:19) . Finally, P ( y = 1) = R ∞−∞ f ( w ) F z ( w ) dw where w = α + αx and f ( w ) is the normal den-sity according to the joint normality assumption, the generalized linear model assumption, andLemma 1(i). This expression can be simpliﬁed: P ( y = 1) = Z ∞−∞ √ v φ (cid:18) w − m √ v (cid:19)(cid:18) Z w −∞ f z ( z ) dz (cid:19) dw = Z Z z ≤ w √ v φ (cid:18) w − m √ v (cid:19) f z ( z ) dzdw = P ( z ≤ w ) = P ( z − w ≤ , where on the right-hand side of the third equality, z and w become independent random variables.Because − w = √ v x − m and x ∼ N (0 , P ( z − w ≤

0) = P ( z + √ v x ≤ m ). Let F z + √ v x ( u ) = P ( z + √ v x ≤ u ). Then m = F − z + √ v x ( p ) and β = 1 − P ki =1 β i . Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts

References

Ahuja, J C, S W Nash. 1967. The generalized Gompertz-Verhulst family of distributions.

Sankhy¯a: TheIndian Journal of Statistics, Series A (2) 141–156.Ariely, D, W T Au, R H Bender, D V Budescu, C B Dietz, H Gu, T S Wallsten, G Zauberman. 2000. Theeﬀects of averaging subjective probability estimates between and within judges. Journal of ExperimentalPsychology Applied (2) 130–146.Baron, J, B A Mellers, P E Tetlock, E Stone, L H Ungar. 2014. Two reasons to make aggregated probabilityforecasts more extreme. Decision Analysis (2) 133–145.Bernardo, J M, A F M Smith. 2000. Bayesian Theory . John Wiley & Sons, Ltd, Chichester, England.Box, G E P, G C Tiao. 1973.

Bayesian Inference in Statistical Inference . Adison-Wesley, Reading, MA.Breiman, L. 1996. Bagging predictors.

Machine Learning (2) 123–140.Breiman, L. 2001. Random forests. Machine Learning (1) 5–32.Chen, K, L R Fine, B A Huberman. 2004. Eliminating public knowledge biases in information-aggregationmechanisms. Management Science Proceedings of the 22nd ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining . ACM, 785–794.Chipman, Hugh A., Edward I. George, Robert E. Mcculloch. 2007. Bayesian ensemble learning. P BSch¨olkopf, J C Platt, T Hoﬀman, eds.,

Advances in Neural Information Processing Systems 19 . MITPress, Cambridge, MA, 265–272.Clemen, R T. 1987. Combining overlapping information.

Management Science (3) 373–380.Clemen, R T, R L Winkler. 1986. Combining economic forecasts. Journal of Business & Economic Statistics (1) 39–46.Dawid, A P, M H DeGroot, J Mortera. 1995. Coherent combination of experts’ opinions. Test (2) 263–313.DeGroot, M H. 1974. Reaching a consensus. Journal of the American Statistical Association (345) 118–121.DeGroot, M H, J Mortera. 1991. Optimal linear opinion pools. Management Science (5) 546–558.Dˇzeroski, S, B ˇZenko. 2004. Is combining classiﬁers with stacking better than selecting the best one? MachineLearning (3) 255–273.Erev, I, T S Wallsten, D V Budescu. 1994. Simultaneous over-and underconﬁdence: The role of error injudgment processes. Psychological Review (3) 519–527.French, S. 1980. Updating of belief in the light of someone else’s opinion.

Journal of the Royal StatisticalSociety, Series A (1) 43–48.French, S. 1986. Calibration and the expert problem.

Management Science (3) 315–321. ichtendahl, Grushka-Cockayne, Jose, and Winkler: Bayesian Ensembles of Binary-Event Forecasts

Annals of Statistics (5) 1189–1232.Genest, C, J V Zidek. 1986. Combining probability distributions: A critique and an annotated bibliography. Statistical Science (1) 114–135.Gneiting, T, A E Raftery. 2007. Strictly proper scoring rules, prediction, and estimation. Journal of theAmerican Statistical Association (477) 359–378.Hand, D J, R J Till. 2001. A simple generalisation of the area under the ROC curve for multiple classclassiﬁcation problems.

Machine Learning (2) 171–186.Hastie, T, R Tibshirani. 1986. Generalized additive models. Statistical Science (3) 297–318.Hora, S C. 2004. Probability judgments for continuous quantities: Linear combinations and calibration. Management Science Organizational Behavior and Human Performance (1) 61–72.Kim, O, S C Lim, K W Shaw. 2001. The ineﬃciency of the mean analyst forecast as summary of forecast ofearnings. Journal of Accounting Research Management Science (1) 111–127.Lindley, D V, A Tversky, R V Brown. 1979. On the reconciliation of probability assessments. Journal of theRoyal Statistical Society, Series A (2) 146–180.Marinovic, I, M Ottaviani, P N Sorensen. 2013. Forecasters’ objectives and strategies. G Elliott, A Timmer-mann, eds.,

Handbook of Economic Forecasting , vol. 2. Elsevier, Amsterdam, Netherlands, 690–720.McConway, K J. 1981. Marginalization and linear opinion pools.

Journal of the American Statistical Asso-ciation (374) 410–414.Mellers, B, L Ungar, J Baron, J Ramos, B Gurcay, K Fincher, S E Scott, D Moore, P Atanasov, S A Swift,T Murray, E Stone, P E Tetlock. 2014. Psychological strategies for winning a geopolitical forecastingtournament. Psychological Science (5) 1106–1115.Mineo, A M, M Ruggieri. 2005. A software tool for the exponential power distribution: The normalp package. Journal of Statistical Software (4) 1–24.Morris, P A. 1974. Decision analysis expert use. Management Science (9) 1233–1241.Murphy, A H, R L Winkler. 1987. A general framework for forecast veriﬁcation. Monthly Weather Review (7) 1330–1338.Nelder, J A, R W M Wedderburn. 1972. Generalized linear models.

Journal of the Royal Statistical Society,Series A (3) 370–385.2

Lichtendahl, Grushka-Cockayne, Jose, and Winkler:

Bayesian Ensembles of Binary-Event Forecasts

Raiﬀa, H, R Schlaifer. 1961.

Applied Statistical Decision Theory . Harvard Business School, Boston, MA.Ranjan, R, T Gneiting. 2010. Combining probability forecasts.

Journal of the Royal Statistical Society:Series B (Statistical Methodology) (1) 71–91.Satop¨a¨a, V A, J Baron, D P Foster, B A Mellers, P E Tetlock, L H Ungar. 2014. Combining multipleprobability predictions using a simple logit model. International Journal of Forecasting Journal of the American Statistical Association (516) 1623–1633.Shlomi, Y, T S Wallsten. 2010. Subjective recalibration of advisors’ probability estimates.

PsychonomicBulletin & Review (4) 492–498.Smyth, P, D Wolpert. 1999. Linearly combining density estimators via stacking. Machine Learning (1-2)59–83.Soury, H, M S Alouini. 2015. New results on the sum of two generalized Gaussian random variables. . IEEE, 1017–1021.Stone, M. 1961. The opinion pool. The Annals of Mathematical Statistics (4) 1339–1342.Subbotin, M T. 1923. On the law of frequency of error. Matematicheskii Sbornik (2) 296–301.Tetlock, P, D Gardner. 2016. Superforecasting: The Art and Science of Prediction . Broadway Books, NewYork, NY.Tibshirani, R. 1996. Regression shrinkage and selection via the lasso.

Journal of the Royal Statistical Society,Series B (1) 267–288.Turner, B M, M Steyvers, E C Merkle, D V Budescu, T S Wallsten. 2014. Forecast aggregation via recali-bration. Machine Learning (3) 261–289.West, M, J Harrison. 1997. Bayesian Forecasting and Dynamic Models, 2nd ed. . Springer, New York, NY.Winkler, R L. 1968. The consensus of subjective probability distributions.

Management Science (2)B61–B75.Winkler, R L. 1994. Evaluating probabilities: Asymmetric scoring rules. Management Science (11) 1395–1405.Winkler, R L, V R R Jose. 2008. Comments on: Assessing probabilistic forecasts of multivariate quantities,with an application to ensemble predictions of surface winds. Test Management Science (12) 1526–1543.Wolpert, D H. 1992. Stacked generalization. Neural Networks (2) 241–259.Zhang, Z, S Wang, D Liu, M I Jordan. 2012. EP-GIG priors and applications in Bayesian sparse learning. Journal of Machine Learning Research13