[PDF] Probability aggregation in time-series: Dynamic hierarchical modeling of sparse expert beliefs

Abstract

Most subjective probability aggregation procedures use a single probability judgment from each expert, even though it is common for experts studying real problems to update their probability estimates over time. This paper advances into unexplored areas of probability aggregation by considering a dynamic context in which experts can update their beliefs at random intervals. The updates occur very infrequently, resulting in a sparse data set that cannot be modeled by standard time-series procedures. In response to the lack of appropriate methodology, this paper presents a hierarchical model that takes into account the expert's level of self-reported expertise and produces aggregate probabilities that are sharp and well calibrated both in- and out-of-sample. The model is demonstrated on a real-world data set that includes over 2300 experts making multiple probability forecasts over two years on different subsets of 166 international political events.

Full PDF

aa r X i v : . [ s t a t . A P ] A ug The Annals of Applied Statistics (cid:13)

Institute of Mathematical Statistics, 2014

PROBABILITY AGGREGATION IN TIME-SERIES: DYNAMICHIERARCHICAL MODELING OF SPARSE EXPERT BELIEFS By Ville A. Satop¨a¨a, Shane T. Jensen, Barbara A. Mellers,Philip E. Tetlock and Lyle H. Ungar

University of Pennsylvania

Most subjective probability aggregation procedures use a singleprobability judgment from each expert, even though it is common forexperts studying real problems to update their probability estimatesover time. This paper advances into unexplored areas of probabilityaggregation by considering a dynamic context in which experts canupdate their beliefs at random intervals. The updates occur very in-frequently, resulting in a sparse data set that cannot be modeled bystandard time-series procedures. In response to the lack of appropri-ate methodology, this paper presents a hierarchical model that takesinto account the expert’s level of self-reported expertise and producesaggregate probabilities that are sharp and well calibrated both in-and out-of-sample. The model is demonstrated on a real-world dataset that includes over 2300 experts making multiple probability fore-casts over two years on diﬀerent subsets of 166 international politicalevents.

1. Introduction.

Experts’ probability assessments are often evaluated on calibration , which measures how closely the frequency of event occurrenceagrees with the assigned probabilities. For instance, consider all events thatan expert believes to occur with a 60% probability. If the expert is wellcalibrated, 60% of these events will actually end up occurring. Even thoughseveral experiments have shown that experts are often poorly calibrated[see, e.g., Cooke (1991), Shlyakhter et al. (1994)], these are noteworthy ex-ceptions. In particular, Wright et al. (1994) argue that higher self-reportedexpertise can be associated with better calibration.

Received September 2013; revised March 2014. Supported by a research contract to the University of Pennsylvania and the Universityof California from the Intelligence Advanced Research Projects Activity (IARPA) via theDepartment of Interior National Business Center contract number D11PC20061.

Key words and phrases.

Probability aggregation, dynamic linear model, hierarchicalmodeling, expert forecast, subjective probability, bias estimation, calibration, time series.

This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in

The Annals of Applied Statistics ,2014, Vol. 8, No. 2, 1256–1280. This reprint diﬀers from the original in paginationand typographic detail. 1

V. A. SATOP ¨A ¨A ET AL.

Calibration by itself, however, is not suﬃcient for useful probability es-timation. Consider a relatively stationary process, such as rain on diﬀerentdays in a given geographic region, where the observed frequency of occur-rence in the last 10 years is 45%. In this setting an expert could always as-sign a constant probability of 0.45 and be well-calibrated. This assessment,however, can be made without any subject-matter expertise. For this rea-son the long-term frequency is often considered the baseline probability—anaive assessment that provides the decision-maker very little extra informa-tion. Experts should make probability assessments that are as far from thebaseline as possible. The extent to which their probabilities diﬀer from thebaseline is measured by sharpness [Gneiting et al. (2008), Winkler and Jose(2008)]. If the experts are both sharp and well calibrated, they can forecastthe behavior of the process with high certainty and accuracy. Therefore, use-ful probability estimation should maximize sharpness subject to calibration[see, e.g., Raftery et al. (2005), Murphy and Winkler (1987)].There is strong empirical evidence that bringing together the strengthsof diﬀerent experts by combining their probability forecasts into a sin-gle consensus, known as the crowd belief , improves predictive performance.Prompted by the many applications of probability forecasts, including medi-cal diagnosis [Wilson et al. (1998), Pepe (2003)], political and socio-economicforesight [Tetlock (2005)], and meteorology [Sanders (1963), Vislocky andFritsch (1995), Baars and Mass (2005)], researchers have proposed manyapproaches to combining probability forecasts [see, e.g., Ranjan and Gneit-ing (2010), Satop¨a¨a et al. (2014a), Batchelder, Strashny and Romney (2010)for some recent studies, and Genest and Zidek (1986), Wallsten, Budescuand Erev (1997), Clemen and Winkler (2007), Primo et al. (2009) for a com-prehensive overview]. The general focus, however, has been on developingone-time aggregation procedures that consult the experts’ advice only oncebefore the event resolves.Consequently, many areas of probability aggregation still remain ratherunexplored. For instance, consider investors aiming to assess whether a stockindex will ﬁnish trading above a threshold on a given date. To maximize theiroverall predictive accuracy, they may consult a group of experts repeatedlyover a period of time and adjust their estimate of the aggregate probabilityaccordingly. Given that the experts are allowed to update their probabilityassessments, the aggregation should be performed by taking into accountthe temporal correlation in their advice.This paper adds another layer of complexity by assuming a heterogeneousset of experts, most of whom only make one or two probability assessmentsover the hundred or so days before the event resolves. This means thatthe decision-maker faces a diﬀerent group of experts every day, with only afew experts returning later on for a second round of advice. The problemat hand is therefore strikingly diﬀerent from many time-series estimation

ROBABILITY AGGREGATION IN TIME-SERIES problems, where one has an observation at every time point—or almostevery time point. As a result, standard time-series procedures like ARIMA[see, e.g., Mills (1991)] are not directly applicable. This paper introduces atime-series model that incorporates self-reported expertise and captures asharp and well-calibrated estimate of the crowd belief. The model is highlyinterpretable and can be used for the following: • analyzing under and overconﬁdence in diﬀerent groups of experts, • obtaining accurate probability forecasts, and • gaining question-speciﬁc quantities with easy interpretations, such as ex-pert disagreement and problem diﬃculty.This paper begins by describing our geopolitical database. It then in-troduces a dynamic hierarchical model for capturing the crowd belief. Themodel is estimated in a two-step procedure: ﬁrst, a sampling step producesconstrained parameter estimates via Gibbs sampling [see, e.g., Geman andGeman (1984)]; second, a calibration step transforms these estimates totheir unconstrained equivalents via a one-dimensional optimization proce-dure. The model introduction is followed by the ﬁrst evaluation section thatuses synthetic data to study how accurately the two-step procedure can es-timate the crowd belief. The second evaluation section applies the model toour real-world geopolitical forecasting database. The paper concludes witha discussion of future research directions and model limitations.

2. Geopolitical forecasting data.

Forecasters were recruited from profes-sional societies, research centers, alumni associations, science bloggers andword of mouth ( n = 2365). Requirements included at least a Bachelor’s de-gree and completion of psychological and political tests that took roughlytwo hours. These measures assessed cognitive styles, cognitive abilities, per-sonality traits, political attitudes and real-world knowledge. The expertswere asked to give probability forecasts (to the second decimal point) andto self-assess their level of expertise (on a 1-to-5 scale with 1 = Not At AllExpert and 5 = Extremely Expert) on a number of 166 geopolitical binaryevents taking place between September 29, 2011 and May 8, 2013. Each ques-tion was active for a period during which the participating experts could up-date their forecasts as frequently as they liked without penalty. The expertsknew that their probability estimates would be assessed for accuracy usingBrier scores. This incentivized them to report their true beliefs instead ofattempting to game the system [Winkler and Murphy (1968)]. In additionto receiving $150 for meeting minimum participation requirements that did The Brier score is the squared distance between the probability forecast and theevent indicator that equals 1.0 or 0.0 depending on whether the event happened or not,respectively. See Brier (1950) for the original introduction.

V. A. SATOP ¨A ¨A ET AL.

Table 1

Five-number summaries of our real-world data

Statistic Min. Q Median Mean Q Max. . . . .

20 418 . . . . . . . . . . . . not depend on prediction accuracy, the experts received status rewards fortheir performance via leader-boards displaying Brier scores for the top 20experts. Given that a typical expert participated only in a small subset ofthe 166 questions, the experts are considered indistinguishable conditionalon the level of self-reported expertise.The average number of forecasts made by a single expert in one day wasaround 0.017, and the average group-level response rate was around 13.5forecasts per day. Given that the group of experts is large and diverse, theresulting data set is very sparse. Tables 1 and 2 provide relevant summarystatistics on the data. Notice that the distribution of the self-reported exper-tise is skewed to the right and that some questions remained active longerthan others. For more details on the data set and its collection see Ungaret al. (2012).To illustrate the data with some concrete examples, Figure 1(a) and 1(b)show scatterplots of the probability forecasts given for (a) Will the expansionof the European bailout fund be ratiﬁed by all 17 Eurozone nations before 1November 2011? and (b)

Will the Nikkei 225 index ﬁnish trading at or above9500 on 30 September 2011?

The points have been shaded according to thelevel of self-reported expertise and jittered slightly to make overlaps visible.The solid line gives the posterior mean of the calibrated crowd belief as es-timated by our model. The surrounding dashed lines connect the point-wise95% posterior intervals. Given that the European bailout fund was ratiﬁedbefore November 1, 2011 and that the Nikkei 225 index ﬁnished trading ataround 8700 on September 30, 2011, the general trend of the probability fore-casts tends to converge toward the correct answers. The individual experts,however, sometimes disagree strongly, with the disagreement persisting evennear the closing dates of the questions.

Table 2

Frequencies of the self-reported expertise (1 = Not At All Expert and 5 = ExtremelyExpert) levels across all the 166 questions in our real-world data

Expertise level 1 2 3 4 5Frequency (%) 25 . . . . . Fig. 1.

Scatterplots of the probability forecasts given for two questions in our data set.The solid line gives the posterior mean of the calibrated crowd belief as estimated by ourmodel. The surrounding dashed lines connect the point-wise 95% posterior intervals.

3. Model.

Let p i,t,k ∈ (0 ,

1) be the probability forecast given by the i thexpert at time t for the k th question, where i = 1 , . . . , I k , t = 1 , . . . , T k , and k = 1 , . . . , K . Denote the logit probabilities with Y i,t,k = logit( p i,t,k ) = log (cid:18) p i,t,k − p i,t,k (cid:19) ∈ R and collect the logit probabilities for question k at time t into a vector Y t,k = [ Y ,t,k Y ,t,k · · · Y I k ,t,k ] T . Partition the experts into J groups based onsome individual feature, such as self-reported expertise, with each groupsharing a common multiplicative bias term b j ∈ R for j = 1 , . . . , J . Collectthese bias terms into a bias vector b = [ b b · · · b J ] T . Let M k be a I k × J matrix denoting the group memberships of the experts in question k ; that is,if the i th expert participating in the k th question belongs to the j th group,then the i th row of M k is the j th standard basis vector e j . The bias vector b is assumed to be identical across all K questions. Under this notation, themodel for the k th question can be expressed as Y t,k = M k b X t,k + v t,k , (3.1) V. A. SATOP ¨A ¨A ET AL. X t,k = γ k X t − ,k + w t,k , (3.2) X ,k ∼ N ( µ , σ ) , where (3.1) denotes the observed process, (3.2) shows the hidden process thatis driven by the constant γ k ∈ R , and ( µ , σ ) ∈ ( R , R + ) are hyperparametersﬁxed a priori to 0 and 1, respectively. The error terms follow: v t,k | σ k i . i . d . ∼ N I k ( , σ k I I k ) ,w t,k | τ k i . i . d . ∼ N (0 , τ k ) . Therefore, the parameters of the model are b , σ k , γ k and τ k for k = 1 , . . . , K .Their prior distributions are chosen to be noninformative, p ( b , σ k | X k ) ∝ σ k and p ( γ k , τ k | X k ) ∝ τ k .The hidden state X t,k represents the aggregate logit probability for the k th event given all the information available up to and including time t . Tomake this more speciﬁc, let Z k ∈ { , } indicate whether the event associ-ated with the k th question happened ( Z k = 1) or did not happen ( Z k = 0). If {F t,k } T k t =1 is a ﬁltration representing the information available up to and in-cluding a given time point, then according to our model E [ Z k |F t,k ] = P ( Z k =1 |F t,k ) = logit − ( X t,k ). Ideally this probability maximizes sharpness subjectto calibration [for technical deﬁnitions of calibration and sharpness see Ran-jan and Gneiting (2010), Gneiting and Ranjan (2013)]. Even though a singleexpert is unlikely to have access to all the available information, a large anddiverse group of experts may share a considerable portion of the availableinformation. The collective wisdom of the group therefore provides an at-tractive proxy for F t,k .Given that the experts may believe in false information, hide their truebeliefs or be biased for many other reasons, their probability assessmentsshould be aggregated via a model that can detect potential bias, separatesignal from noise and use the collective opinion to estimate X t,k . In ourmodel the experts are assumed to be, on average, a multiplicative constant b away from X t,k . Therefore, an individual element of b can be interpretedas a group-speciﬁc systematic bias that labels the group either as overcon-ﬁdent [ b j ∈ (1 , ∞ )] or as underconﬁdent [ b j ∈ (0 , X t,k is considered random noise . This noise is measured in terms of σ k and canbe assumed to be caused by momentary over-optimism (or pessimism), falsebeliefs or other misconceptions.The random ﬂuctuations in the hidden process are measured by τ k andare assumed to represent changes or shocks to the underlying circumstancesthat ultimately decide the outcome of the event. The systematic component γ k allows the model to incorporate a constant signal stream that drifts the ROBABILITY AGGREGATION IN TIME-SERIES hidden process. If the uncertainty in the question diminishes [ γ k ∈ (1 , ∞ )],the hidden process drifts to positive or negative inﬁnity. Alternatively, thehidden process can drift to zero, in which case any available information doesnot improve predictive accuracy [ γ k ∈ (0 , γ k ∈ (1 , ∞ ) for all k = 1 , . . . , K .As for any future time T ∗ ≥ t , X T ∗ ,k = γ T ∗ − tk X t + T ∗ X i = t +1 γ T ∗ − ik w i ∼ N γ T ∗ − tk X t,k , τ k T ∗ X i = t +1 γ T ∗ − ik ! , the model can be used for time-forward prediction as well. The predictionfor the aggregate logit probability at time T ∗ is given by an estimate of γ T ∗ − t X t,k . Naturally the uncertainty in this prediction grows in T . To makesuch time-forward predictions, it is necessary to assume that the past pop-ulation of experts is representative of the future population. This is a rea-sonable assumption because even though the future population may consistof entirely diﬀerent individuals, on average the population is likely to lookvery similar to the past population. In practice, however, social scientistsare generally more interested in an estimate of the current probability thanthe probability under unknown conditions in the future. For this reason, ouranalysis focuses on probability aggregation only up to the current time t .For the sake of model identiﬁability, it is suﬃcient to share only one ofthe elements of b among the K questions. In this paper, however, all theelements of b are assumed to be identical across the questions because someof the questions in our real-world data set involve very few experts with thehighest level of self-reported expertise. The model can be extended rathereasily to estimate bias at a more general level. For instance, by assuming ahierarchical structure b ik ∼ N ( b j ( i,k ) , σ j ( i,k ) ), where j ( i, k ) denotes the self-reported expertise of the i th expert in question k , the bias can be estimatedat an individual-level. These estimates can then be compared across ques-tions. Individual-level analysis was not performed in our analysis for tworeasons. First, most experts gave only a single prediction per problem, whichmakes accurate bias estimation at the individual-level very diﬃcult. Second,it is unclear how the individually estimated bias terms can be validated.If the future event can take upon M > X t,k is extended to a vector of size M − M th one, is chosen as the base case to ensure that theprobabilities will sum to one at any given time point. Each of the remaining V. A. SATOP ¨A ¨A ET AL. M − M −

4. Model estimation.

This section introduces a two-step procedure, called

Sample-And-Calibrate (SAC), that captures a well-calibrated estimate of thehidden process without sacriﬁcing the interpretability of our model.4.1.

Sampling step.

Given that ( a b , X t,k /a, a τ k ) = ( b , X t,k , τ k ) for any a > Y t,k , the model as described by (3.1)and (3.2) is not identiﬁable. A well-known solution is to choose one of theelements of b , say, b , as the reference point and ﬁx b = 1. In Section 5 weprovide a guideline for choosing the reference point. Denote the constrainedversion of the model by Y t,k = M k b (1) X t,k (1) + v t,k ,X t,k (1) = γ k (1) X t − ,k (1) + w t,k , v t,k | σ k (1) i . i . d . ∼ N I k ( , σ k (1) I I k ) ,w t,k | τ k (1) i . i . d . ∼ N (0 , τ k (1)) , where the trailing input notation, (a), signiﬁes the value under the con-straint b = a . Given that this version is identiﬁable, estimates of the modelparameters can be obtained. Denote the estimates by placing a hat on theparameter symbol. For instance, ˆ b (1) and ˆ X t,k (1) represent the estimates of b (1) and X t,k (1), respectively.These estimates are obtained by ﬁrst computing a posterior sample viaGibbs sampling and then taking the average of the posterior sample. Theﬁrst step of our Gibbs sampler is to sample the hidden states via the Forward-Filtering-Backward-Sampling (FFBS) algorithm. FFBS ﬁrst predicts thehidden states using a Kalman ﬁlter and then performs a backward sam-pling procedure that treats these predicted states as additional observations[see, e.g., Carter and Kohn (1994), Migon et al. (2005) for details on FFBS].Given that the Kalman ﬁlter can handle varying numbers or even no fore-casts at diﬀerent time points, it plays a very crucial role in our probabilityaggregation under sparse data.Our implementation of the sampling step is written in C++ and runsquite quickly. To obtain 1000 posterior samples for 50 questions each with100 time points and 50 experts takes about 215 seconds on a 1.7 GHz Intel

ROBABILITY AGGREGATION IN TIME-SERIES Core i5 computer. See the supplemental article for the technical details ofthe sampling steps [Satop¨a¨a et al. (2014b)] and, for example, Gelman et al.(2003) for a discussion on the general principles of Gibbs sampling.4.2.

Calibration step.

Given that the model parameters can be estimatedby ﬁxing b to any constant, the next step is to search for the constant thatgives an optimally sharp and calibrated estimate of the hidden process. Thissection introduces an eﬃcient procedure that ﬁnds the optimal constantwithout requiring any additional runs of the sampling step. First, assumethat parameter estimates ˆ b (1) and ˆ X t,k (1) have already been obtained viathe sampling step described in Section 4.1. Given that for any β ∈ R / { } , Y t,k = M k b (1) X t,k (1) + v t,k = M k ( b (1) β )( X t,k (1) /β ) + v t,k = M k b ( β ) X t,k ( β ) + v t,k , we have that b ( β ) = b (1) β and X t,k ( β ) = X t,k (1) /β . Recall that the hiddenprocess X t,k is assumed to be sharp and well calibrated. Therefore, b canbe estimated with the value of β that simultaneously maximizes the sharp-ness and calibration of ˆ X t,k (1) /β . A natural criterion for this maximizationis given by the class of proper scoring rules that combine sharpness andcalibration [Gneiting et al. (2008), Buja, Stuetzle and Shen (2005)]. Due tothe possibility of complete separation in any one question [see, e.g., Gelmanet al. (2008)], the maximization must be performed over multiple questions.Therefore, ˆ β = arg max β ∈ R / { } K X k =1 T k X t =1 S ( Z k , ˆ X k,t (1) /β ) , (4.1)where Z k ∈ { , } is the event indicator for question k . The function S is astrictly proper scoring rule such as the negative Brier score [Brier (1950)] S BRI ( Z, X ) = − ( Z − logit − ( X )) or the logarithmic score [Good (1952)] S LOG ( Z, X ) = Z log(logit − ( X )) + (1 − Z ) log(1 − logit − ( X )) . The estimates of the unconstrained model parameters are then given byˆ X t,k = ˆ X k,t (1) / ˆ β, ˆ b = ˆ b (1) ˆ β, ˆ τ k = ˆ τ k (1) / ˆ β , ˆ σ k = ˆ σ k (1) , ˆ γ k = ˆ γ k (1) . Notice that estimates of σ k and γ k are not aﬀected by the constraint. V. A. SATOP ¨A ¨A ET AL.

5. Synthetic data results.

This section uses synthetic data to evaluatehow accurately the SAC-procedure captures the hidden states and bias vec-tor. The hidden process is generated from standard Brownian motion. Morespeciﬁcally, if Z t,k denotes the value of a path at time t , then Z k = ( Z T k ,k > ,X t,k = logit (cid:20) Φ (cid:18) Z t,k √ T k − t (cid:19)(cid:21) gives a sequence of T k calibrated logit probabilities for the event Z k = 1.A hidden process is generated for K questions with a time horizon of T k =101. The questions involve 50 experts allocated evenly among ﬁve expertisegroups. Each expert gives one probability forecast per day with the exceptionof time t = 101 when the event resolves. The forecasts are generated byapplying bias and noise to the hidden process as described by (3.1). Oursimulation study considers a three-dimensional grid of parameter values: σ ∈ { / , , / , , / } ,β ∈ { / , / , , / , / } ,K ∈ { , , , , } , where β varies the bias vector by b = [1 / , / , , / , / T β . Forty syn-thetic data sets are generated for each combination of σ , β and K values.The SAC-procedure runs for 200 iterations of which the ﬁrst 100 are usedfor burn-in.SAC under the Brier (SAC BRI ) and logarithm score (SAC

LOG ) are com-pared with the

Exponentially Weighted Moving Average (EWMA). EWMA,which serves as a baseline, can be understood by ﬁrst denoting the (expertise-weighted) average forecast at time t for the k th question with¯ p t,k = J X j =1 ω j (cid:18) | E j | X i ∈ E j p i,t,k (cid:19) , (5.1)where E j refers to an index set of all experts in the j th expertise group and ω j denotes the weight associated with the j th expertise group. The EWMAforecasts for the k th problem are then constructed recursively fromˆ p t,k ( α ) = (cid:26) ¯ p ,k , for t = 1, α ¯ p t,k + (1 − α )ˆ p t − ,k ( α ) , for t > α and ω are learned from the training set by( ˆ α, ˆ ω ) = arg min α,ω j ∈ [0 , K X k =1 T k X t =1 ( Z k − ˆ p t,k ( α, ω )) s.t. J X j =1 ω j = 1 . ROBABILITY AGGREGATION IN TIME-SERIES Table 3

Summary measures of the estimation accuracy undersynthetic data. As EWMA does not produce an estimate of thebias vector, its accuracy on the bias term cannot be reported

Model Quadratic loss Absolute loss

Hidden processSAC

BRI

LOG

BRI

LOG If p t,k = logit − ( X t,k ) and ˆ p t,k is the corresponding probability estimatedby the model, the model’s accuracy to estimate the hidden process is mea-sured with the quadratic loss, ( p t,k − ˆ p t,k ) , and the absolute loss, | p t,k − ˆ p t,k | .Table 3 reports these losses averaged over all conditions, simulations andtime points. The three competing methods, SAC BRI , SAC

LOG and EWMA,estimate the hidden process with great accuracy. Based on other perfor-mance measures that are not shown for the sake of brevity, all three methodssuﬀer from an increasing level of noise in the expert logit probabilities butcan make eﬃcient use of extra data.Some interesting diﬀerences emerge from Figure 2 which shows the marginaleﬀect of β on the average quadratic loss. As can be expected, EWMA per- Fig. 2.

The marginal eﬀect of β on the average quadratic loss. V. A. SATOP ¨A ¨A ET AL. forms well when the experts are, on average, close to unbiased. Interest-ingly, SAC estimates the hidden process more accurately when the expertsare overconﬁdent (large β ) compared to underconﬁdent (small β ). To un-derstand this result, assume that the experts in the third group are highlyunderconﬁdent. Their logit probabilities are then expected to be closer tozero than the corresponding hidden states. After adding white noise to theseexpected logit probabilities, they are likely to cross to the other side of zero.If the sampling step ﬁxes b = 1, as it does in our case, the third group istreated as unbiased and some of the constrained estimates of the hiddenstates are likely to be on the other side of zero as well. Unfortunately, thisdiscrepancy cannot be corrected by the calibration step that is restrictedto shifting the constrained estimates either closer or further away from zerobut not across it. To maximize the likelihood of having all the constrainedestimates on the right side of zero and hence avoiding the discrepancy, thereference point in the sampling step should be chosen with care. A helpfulguideline is to ﬁx the element of b that is a priori believed to be the largest.The accuracy of the estimated bias vector is measured with the quadraticloss, ( b j − ˆ b j ) , and the absolute loss, | b j − ˆ b j | . Table 3 reports these lossesaveraged over all conditions, simulations and elements of the bias vector.Unfortunately, EWMA does not produce an estimate of the bias vector.Therefore, it cannot be used as a baseline for the estimation accuracy inthis case. Given that the losses for SAC BRI and SAC

LOG are quite small,they estimate the bias vector accurately.

6. Geopolitical data results.

This section presents results for the real-world data described in Section 2. The goal is to provide application speciﬁcinsight by discussing the speciﬁc research objectives itemized in Section 1.First, however, we discuss two practical matters that must be taken intoaccount when aggregating real-world probability forecasts.6.1.

Incoherent and imbalanced data.

The ﬁrst matter regards humanexperts making probability forecasts of 0.0 or 1.0 even if they are not com-pletely sure of the outcome of the event. For instance, all 166 questions inour data set contain both a zero and a one. Transforming such forecasts intothe logit space yields inﬁnities that can cause problems in model estimation.To avoid this, Ariely et al. (2000) suggest changing p = 0 .

00 and 1 .

00 to p = 0 .

02 and 0 .

98, respectively. This is similar to winsorising that sets theextreme probabilities to a speciﬁed percentile of the data [see, e.g., Hastingset al. (1947) for more details on winsorising]. Allard, Comunian and Re-nard (2012), on the other hand, consider only probabilities that fall withina constrained interval, say, [0 . , . p = 0 .

00 and 1 .

00 to p = 0 . .

99, respectively. Our results remain insensitive to the exact choice of

ROBABILITY AGGREGATION IN TIME-SERIES censoring as long as this is done in a reasonable manner to keep the extremeprobabilities from becoming highly inﬂuential in the logit space.The second matter is related to the distribution of the class labels in thedata. If the set of occurrences is much larger than the set of nonoccurrences(or vice versa), the data set is called imbalanced . On such data the modelingprocedure can end up over-focusing on the larger class and, as a result,give very accurate forecast performance over the larger class at the costof performing poorly over the smaller class [see, e.g., Chen (2008), Wallaceand Dahabreh (2012)]. Fortunately, it is often possible to use a well-balancedversion of the data. The ﬁrst step is to ﬁnd a partition S and S of thequestion indices { , , . . . , K } such that the equality P k ∈ S T k = P k ∈ S T k is as closely approximated as possible. This is equivalent to an NP-hardproblem known in computer science as the Partition Problem : determinewhether a given set of positive integers can be partitioned into two sets suchthat the sums of the two sets are equal to each other [see, e.g., Karmarkar andKarp (1982), Hayes (2002)]. A simple solution is to use a greedy algorithmthat iterates through the values of T k in descending order, assigning each T k to the subset that currently has the smaller sum [see, e.g., Kellerer, Pferschyand Pisinger (2004), Gent and Walsh (1996) for more details on the PartitionProblem ]. After ﬁnding a well-balanced partition, the next step is to assignthe class labels such that the labels for the questions in S x are equal to x for x = 0 or 1. Recall from Section 4.2 that Z k represents the event indicatorfor the k th question. To deﬁne a balanced set of indicators ˜ Z k for all k ∈ S x ,let ˜ Z k = x, ˜ p i,t,k = (cid:26) − p i,t,k , if Z k = 1 − x , p i,t,k , if Z k = x ,where i = 1 , . . . , I k , and t = 1 , . . . , T k . The resulting set { ( ˜ Z k , { ˜ p i,t,k | i = 1 , . . . , I k , t = 1 , . . . , T k } ) } Kk =1 is a balanced version of the data. This procedure was used to balance ourreal-world data set both in terms of events and time points. The ﬁnal outputsplits the events exactly in half ( | S | = | S | = 83) such that the number oftime points in the ﬁrst and second halves are 8737 and 8738, respectively.6.2. Out-of-sample aggregation.

The goal of this section is to evaluatethe accuracy of the aggregate probabilities made by SAC and several otherprocedures. The models are allowed to utilize a training set before makingaggregations on an independent testing set. To clarify some of the upcom-ing notation, let S train and S test be index sets that partition the data intotraining and testing sets of sizes | S train | = N train and | S test | = 166 − N train ,respectively. This means that the k th question is in the training set if and V. A. SATOP ¨A ¨A ET AL. only if k ∈ S train . Before introducing the competing models, note that allchoices of thinning and burn-in made in this section are conservative andhave been made based on pilot runs of the models. This was done to ensurea posterior sample that has low autocorrelation and arises from a convergedchain. The competing models are as follows:1. Simple Dynamic Linear Model (SDLM). This is equivalent to the dynamicmodel from Section 3 but with b = and β = 1. Thus, Y t,k = X t,k + v t,k ,X t,k = γ k X t − ,k + w t,k , where X t,k is the aggregate logit probability. Given that this model doesnot share any parameters across questions, estimates of the hidden pro-cess can be obtained directly for the questions in the testing set withoutﬁtting the model ﬁrst on the training set. The Gibbs sampler is run for500 iterations of which the ﬁrst 200 are used for burn-in. The remaining300 iterations are thinned by discarding every other observation, leavinga ﬁnal posterior sample of 150 observations. The average of this samplegives the ﬁnal estimates.2. The Sample-And-Calibrate procedure both under the Brier (SAC

BRI ) andthe Logarithmic score (SAC LOG ). The model is ﬁrst ﬁt on the trainingset by running the sampling step for 3000 iterations of which the ﬁrst500 iterations are used for burn-in. The remaining 2500 observations arethinned by keeping every ﬁfth observation. The calibration step is per-formed for the ﬁnal 500 observations. The out-of-sample aggregation isdone by running the sampling step for 500 iterations with each consec-utive iteration reading in and conditioning on the next value of β and b found during the training period. The ﬁrst 200 iterations are used forburn-in. The remaining 300 iterations are thinned by discarding everyother observation, leaving a ﬁnal posterior sample of 150 observations.The average of this sample gives the ﬁnal estimates.3. A fully Bayesian version of

SAC

LOG (BSAC

LOG ). Denote the calibratedlogit probabilities and event indicators across all K questions with X (1)and Z , respectively. The posterior distribution of β conditional on X (1)is given by p ( β | X (1) , Z ) ∝ p ( Z | β, X (1)) p ( β | X (1)). The likelihood is p ( Z | β, X (1))(6.1) ∝ K Y k =1 T k Y t =1 logit − ( X t,k (1) /β ) Z k (1 − logit − ( X t,k (1) /β )) − Z k . As in Gelman et al. (2003), the prior for β is chosen to be locally uniform, p (1 /β ) ∝

1. Given that this model estimates X t,k (1) and β simultaneously,it is a little more ﬂexible than SAC. Posterior estimates of β can be sam- ROBABILITY AGGREGATION IN TIME-SERIES pled from (6.1) using generic sampling algorithms such as the Metropolisalgorithm [Metropolis et al. (1953)] or slice sampling [Neal (2003)]. Giventhat the sampling procedure conditions on the event indicators, the fullconditional distribution of the hidden states is not in a standard form.Therefore, the Metropolis algorithm is also used for sampling the hiddenstates. Estimation is made with the same choices of thinning and burn-inas described under Sample-And-Calibrate .4. Due to the lack of previous literature on dynamic aggregation of expertprobability forecasts, the main competitors are exponentially weightedversions of procedures that have been proposed for static probabilityaggregation:(a)

Exponentially Weighted Moving Average (EWMA) as described inSection 5.(b)

Exponentially Weighted Moving Logit Aggregator (EWMLA). Thisis a moving version of the aggregator ˆ p G ( b ) that was introducedin Satop¨a¨a et al. (2014a). The EWMLA aggregate probabilities arefound recursively fromˆ p t,k ( α, b ) = (cid:26) G ,k ( b ) , for t = 1, αG t,k ( b ) + (1 − α )ˆ p t − ,k ( α, b ) , for t > b ∈ R J collects the bias terms of the expertisegroups, and G t,k ( ν ) = N t,k Y i =1 (cid:18) p i,t,k − p i,t,k (cid:19) b j ( i,k ) /N t,k !(cid:30) N t,k Y i =1 (cid:18) p i,t,k − p i,t,k (cid:19) b j ( i,k ) /N t,k ! . The parameters α and b are learned from the training set by( ˆ α, ˆ b ) = arg min b ∈ R ,α ∈ [0 , X k ∈ S train T k X t =1 ( Z k − ˆ p t,k ( α, b )) . (c) Exponentially Weighted Moving Beta-transformed Aggregator(EWMBA) . The static version of the Beta-transformed aggregatorwas introduced in Ranjan and Gneiting (2010). A dynamic versioncan be obtained by replacing G t,k ( ν ) in the EWMLA description with H ν,τ (¯ p t,k ), where H ν,τ is the cumulative distribution function of theBeta distribution and ¯ p t,k is given by (5.1). The parameters α, ν, τ and ω are learned from the training set by( ˆ α, ˆ ν, ˆ τ , ˆ ω ) = arg min ν,τ> α,ω j ∈ [0 , X k ∈ S train T k X t =1 ( Z k − ˆ p t,k ( α, ν, τ, ω )) (6.2) s.t. J X j =1 ω j = 1 . V. A. SATOP ¨A ¨A ET AL.

Table 4

Brier scores based on 10-fold cross-validation. Scores by Day weighs a question by thenumber of days the question remained open. Scores by Problem gives each question anequal weight regardless of how long the question remained open. The bolded valuesindicate the best scores in each column. The values in the parenthesis represent standarderrors in the scores

Model All Short Medium Long

Scores by daySDLM 0.100 (0.156) 0.066 (0.116) 0.098 (0.154) 0.102 (0.157)BSAC

LOG (0.147) 0.100 (0.215) 0.098 (0.215)SAC

BRI

LOG (0.191) 0.056 (0.134) (0.189) (0.193)EWMBA 0.104 (0.204) 0.057 (0.120) 0.113 (0.205) 0.105 (0.206)EWMLA 0.102 (0.199) 0.061 (0.130) 0.111 (0.214) 0.103 (0.200)EWMA 0.111 (0.146) 0.080 (0.101) 0.116 (0.152) 0.112 (0.146)Scores by problemSDLM 0.089 (0.116) 0.064 (0.085) 0.106 (0.141) 0.092 (0.117)BSAC

LOG (0.103) 0.110 (0.198) 0.085 (0.162)SAC

BRI

LOG (0.142) 0.055 (0.096) (0.174) (0.144)EWMBA 0.091 (0.157) 0.057 (0.095) 0.121 (0.187) 0.093 (0.164)EWMLA 0.090 (0.159) 0.064 (0.109) 0.120 (0.200) 0.090 (0.159)EWMA 0.102 (0.108) 0.080 (0.075) 0.123 (0.130) 0.103 (0.110)

The competing models are evaluated via a 10-fold cross-validation thatﬁrst partitions the 166 questions into 10 sets such that each set has approx-imately the same number of questions (16 or 17 questions in our case) andthe same number of time points (between 1760 and 1764 time points in ourcase). The evaluation then iterates 10 times, each time using one of the 10sets as the testing set and the remaining 9 sets as the training set. Therefore,each question is used nine times for training and exactly once for testing.The testing proceeds sequentially one testing question at a time as follows:First, for a question with a time horizon of T k , give an aggregate probabilityat time t = 2 based on the ﬁrst two days. Compute the Brier score for thisprobability. Next give an aggregate probability at time t = 3 based on theﬁrst three days and compute the Brier score for this probability. Repeat thisprocess for all of the T k − T k − Scores by Day , weighs each question by the number of days thequestion remained open. This is performed by computing the average of the A 5-fold cross-validation was also performed. The results were, however, very similarto the 10-fold cross-validation and hence not presented in the paper.ROBABILITY AGGREGATION IN TIME-SERIES Scores by Problem , gives eachquestion an equal weight regardless of how long the question remained open.This is done by ﬁrst averaging the scores within a question and then aver-aging the average scores across all the questions. Both scores can be furtherbroken down into subcategories by considering the length of the questions.The ﬁnal three columns of Table 4 divide the questions into

Short questions(30 days or fewer),

Medium questions (between 31 and 59 days) and

Long

Problems (60 days or more). The number of questions in these subcategorieswere 36, 32 and 98, respectively. The bolded scores indicate the best scorein each column. The values in the parenthesis quantify the variability in thescores: Under

Scores by Day the values give the standard errors of all thescores. Under

Scores by Problem , on the other hand, the values representthe standard errors of the average scores of the diﬀerent questions.As can be seen in Table 4, SAC

LOG achieves the lowest score across allcolumns except

Short where it is outperformed by BSAC

LOG . It turns outthat BSAC

LOG is overconﬁdent (see Section 6.3). This means that BSAC

LOG underestimates the uncertainty in the events and outputs aggregate proba-bilities that are typically too near 0.0 or 1.0. This results into highly variableperformance. The short questions generally involved very little uncertainty.On such easy questions, overconﬁdence can pay oﬀ frequently enough tocompensate for a few large losses arising from the overconﬁdent and drasti-cally incorrect forecasts.SDLM, on the other hand, lacks sharpness and is highly underconﬁdent(see Section 6.3). This behavior is expected, as the experts are underconﬁ-dent at the group level (see Section 6.4) and SDLM does not use the train-ing set to explicitly calibrate its aggregate probabilities. Instead, it merelysmooths the forecasts given by the experts. The resulting aggregate prob-abilities are therefore necessarily conservative, resulting into high averagescores with low variability.Similar behavior is exhibited by EWMA that performs the worst of allthe competing models. The other two exponentially weighted aggregators,EWMLA and EWMBA, make eﬃcient use of the training set and presentmoderate forecasting performance in most columns of Table 4. Neither ap-proach, however, appears to dominate the other. The high variability andaverage of their performance scores indicate that their performance suﬀersfrom overconﬁdence.6.3.

In- and out-of-sample sharpness and calibration.

A calibration plotis a simple tool for visually assessing the sharpness and calibration of amodel. The idea is to plot the aggregate probabilities against the observedempirical frequencies. Therefore, any deviation from the diagonal line sug-gests poor calibration. A model is considered underconﬁdent (or overconﬁ-dent) if the points follow an S-shaped (or S-shaped) trend. To assess sharp-ness of the model, it is common practice to place a histogram of the given V. A. SATOP ¨A ¨A ET AL.

Fig. 3.

The top and bottom rows show in- and out-of-sample calibration and sharpness,respectively. forecasts in the corner of the plot. Given that the data were balanced, anydeviation from the the baseline probability of 0.5 suggests improved sharp-ness.The top and bottom rows of Figure 3 present calibration plots for SDLM,SAC

LOG , SAC

BRI and BSAC

LOG under in- and out-of-sample probabilityaggregation, respectively. Each setting is of interest in its own right: Goodin-sample calibration is crucial for model interpretability. In particular, if theestimated crowd belief is well calibrated, then the elements of the bias vector b can be used to study the amount of under or overconﬁdence in the diﬀerentexpertise groups. Good out-of-sample calibration and sharpness, on the otherhand, are necessary properties in decision making. To guide our assessment,the dashed bands around the diagonal connect the point-wise, Bonferroni-corrected [Bonferroni (1936)] 95% lower and upper critical values underthe null hypothesis of calibration. These have been computed by runningthe bootstrap technique described in Br¨ocker and Smith (2007) for 10,000iterations. The in-sample predictions were obtained by running the modelsfor 10,200 iterations, leading to a ﬁnal posterior sample of 1000 observationsafter thinning and using the ﬁrst 200 iterations for burn-in. The out-of-sample predictions were given by the 10-fold cross-validation discussed inSection 6.2. ROBABILITY AGGREGATION IN TIME-SERIES Fig. 4.

Posterior distributions of b j for j = 1 , . . . , . Overall, SAC is sharp and well calibrated both in- and out-of-sample withonly a few points barely falling outside the point-wise critical values. Giventhat the calibration does not change drastically from the top to the bottomrow, SAC can be considered robust against overﬁtting. This, however, is notthe case with BSAC

LOG that is well calibrated in-sample but presents over-conﬁdence out-of-sample. Figure 3(a) and (e) serve as baselines by showingthe calibration plots for SDLM. Given that this model does not perform anyexplicit calibration, it is not surprising to see most points outside the criticalvalues. The pattern in the deviations suggests strong underconﬁdence. Fur-thermore, the inset histogram reveals drastic lack of sharpness. Therefore,SAC can be viewed as a well-performing compromise between SDLM andBSAC

LOG that avoids overconﬁdence without being too conservative.6.4.

Group-level expertise bias.

This section explores the bias among theﬁve expertise groups in our data set. Figure 4 compares the posterior dis-tributions of the individual elements of b with side-by-side boxplots. Giventhat the distributions fall completely below the no-bias reference line at 1.0,all the expertise groups are deemed underconﬁdent. Even though the exactlevel of underconﬁdence is aﬀected slightly by the extent to which the ex-treme probabilities are censored (see Section 6.1), the qualitative results inthis section remain insensitive to diﬀerent levels of censoring.Figure 4 shows that underconﬁdence decreases as expertise increases. Theposterior probability that the most expert group is the least underconﬁdent V. A. SATOP ¨A ¨A ET AL. is approximately equal to 1 .

0, and the posterior probability of a strictlydecreasing level of underconﬁdence is approximately 0.87. The latter prob-ability is driven down by the inseparability of the two groups with the low-est levels of self-reported expertise. This inseparability suggests that theexperts are poor at assessing how little they know about a topic that isstrange to them. If these groups are combined into a single group, the pos-terior probability of a strictly decreasing level of underconﬁdence is approx-imately 1.0.The decreasing trend in underconﬁdence can be viewed as a process ofBayesian updating. A completely ignorant expert aiming to minimize a rea-sonable loss function, such as the Brier score, has no reason to give anythingbut 0.5 as his probability forecast. However, as soon as the expert gains someknowledge about the event, he produces an updated forecast that is a com-promise between his initial forecast and the new information acquired. Theupdated forecast is therefore conservative and too close to 0.5 as long as theexpert remains only partially informed about the event. If most experts fallsomewhere on this spectrum between ignorance and full information, theiraverage forecast tends to fall strictly between 0.5 and the most informedprobability forecast [see Baron et al. (2014) for more details]. Given thatexpertise is to a large extent determined by subject matter knowledge, thelevel of underconﬁdence can be expected to decrease as a function of thegroup’s level of self-reported expertise.Finding underconﬁdence in all the groups may seem like a surprising re-sult given that many previous studies have shown that experts are oftenoverconﬁdent [see, e.g., Lichtenstein, Fischhoﬀ and Phillips (1977), Morgan(1992), Bier (2004) for a summary of numerous calibration studies]. It is,however, worth emphasizing three points: First, our result is a statementabout groups of experts and hence does not invalidate the possibility ofthe individual experts being overconﬁdent. To make conclusions at the in-dividual level based on the group level bias terms would be considered an ecological inference fallacy [see, e.g., Lubinski and Humphreys (1996)]. Sec-ond, the experts involved in our data set are overall very well calibrated[Mellers et al. (2014)]. A group of well-calibrated experts, however, can pro-duce an aggregate forecast that is underconﬁdent. In fact, if the aggregate islinear, the group is necessarily underconﬁdent [see Theorem 1 of Ranjan andGneiting (2010)]. Third, according to Erev, Wallsten and Budescu (1994),the level of conﬁdence depends on the way the data were analyzed. Theyexplain that experts’ probability forecasts suggest underconﬁdence when theforecasts are averaged or presented as a function of independently deﬁnedobjective probabilities, that is, the probabilities given by logit − ( X t,k ) in ourcase. This is similar to our context and opposite to many empirical studieson conﬁdence calibration. ROBABILITY AGGREGATION IN TIME-SERIES Question diﬃculty and other measures.

One advantage of our modelarises from its ability to produce estimates of interpretable question-speciﬁcparameters γ k , σ k and τ k . These quantities can be combined in many in-teresting ways to answer questions about diﬀerent groups of experts or thequestions themselves. For instance, being able to assess the diﬃculty of aquestion could lead to more principled ways of aggregating performancemeasures across questions or to novel insight on the kinds of questions thatare found diﬃcult by experts [see, e.g., a discussion on the Hard-Easy Ef-fect in Wilson (1994)]. To illustrate, recall that higher values of σ k suggestgreater disagreement among the participating experts. Given that expertsare more likely to disagree over a diﬃcult question than an easy one, it isreasonable to assume that σ k has a positive relationship with question dif-ﬁculty. An alternative measure is given by τ k that quantiﬁes the volatilityof the underlying circumstances that ultimately decide the outcome of theevent. Therefore, a high value of τ k can cause the outcome of the event toappear unstable and diﬃcult to predict.As a ﬁnal illustration of our model, we return to the two example questionsintroduced in Figure 1. Given that ˆ σ k = 2 .

43 and ˆ σ k = 1 .

77 for the questionsdepicted in Figure 1(a) and 1(b), respectively, the ﬁrst question provokesmore disagreement among the experts than the second one. Intuitively thismakes sense because the target event in Figure 1(a) is determined by severalconditions that may change radically from one day to the next while thetarget event in Figure 1(b) is determined by a relatively steady stock marketindex. Therefore, it is not surprising to ﬁnd that in Figure 1(a) ˆ τ k = 0 . τ k = 0 .

039 in Figure 1(b). We may conclude thatthe ﬁrst question is inherently more diﬃcult than the second one.

7. Discussion.

This paper began by introducing a rather unorthodoxbut nonetheless realistic time-series setting where probability forecasts aremade very infrequently by a heterogeneous group of experts. The resultingdata is too sparse to be modeled well with standard time-series methods.In response to this lack of appropriate modeling procedures, we proposean interpretable time-series model that incorporates self-reported expertiseto capture a sharp and well-calibrated estimate of the crowd belief. Thisprocedure extends the forecasting literature into an under-explored area ofprobability aggregation.Our model preserves parsimony while addressing the main challenges inmodeling sparse probability forecasting data. Therefore, it can be viewedas a basis for many future extensions. To give some ideas, recall that mostof the model parameters were assumed constant over time. It is intuitivelyreasonable, however, that these parameters behave diﬀerently during diﬀer-ent time intervals of the question. For instance, the level of disagreement(represented by σ k in our model) among the experts can be expected to V. A. SATOP ¨A ¨A ET AL. decrease toward the ﬁnal time point when the question resolves. This hy-pothesis could be explored by letting σ t,k evolve dynamically as a functionof the previous term σ t − ,k and random noise.This paper modeled the bias separately within each expertise group. Thisis by no means restricted to the study of bias or its relation to self-reportedexpertise. Diﬀerent parameter dependencies could be constructed based onmany other expert characteristics, such as gender, education or specialty,to produce a range of novel insights on the forecasting behavior of experts.It would also be useful to know how expert characteristics interact withquestion types, such as economic, domestic or international. The resultswould be of interest to the decision-maker who could use the information asa basis for hiring only a high-performing subset of the available experts.Other future directions could remove some of the obvious limitations ofour model. For instance, recall that the random components are assumedto follow a normal distribution. This is a strong assumption that may notalways be justiﬁed. Logit probabilities, however, have been modeled with thenormal distribution before [see, e.g., Erev, Wallsten and Budescu (1994)].Furthermore, the normal distribution is a rather standard assumption inpsychological models [see, e.g., signal-detection theory in Tanner, Wilsonand Swets (1954)].A second limitation resides in the assumption that both the observed andhidden processes are expected to grow linearly. This assumption could berelaxed, for instance, by adding higher order terms to the model. A morecomplex model, however, is likely to sacriﬁce interpretability. Given that ourmodel can detect very intricate patterns in the crowd belief (see Figure 1),compromising interpretability for the sake of facilitating nonlinear growthis hardly necessary.A third limitation appears in an online setting where new forecasts arereceived at a fast rate. Given that our model is ﬁt in a retrospective fash-ion, it is necessary to reﬁt the model every time a new forecast becomesavailable. Therefore, our model can be applied only to oﬄine aggregationand online problems that tolerate some delay. A more scalable and eﬃcientalternative would be to develop an aggregator that operates recursively onstreams of forecasts. Such a ﬁltering perspective would oﬀer an aggregatorthat estimates the current crowd belief accurately without having to reﬁt theentire model each time a new forecast arrives. Unfortunately, this typicallyimplies being less accurate in estimating the model parameters such as thebias term. However, as estimation of the model parameters was addressedin this paper, designing a ﬁlter for probability forecasts seems like the nextnatural development in time-series probability aggregation. Acknowledgments.

The U.S. Government is authorized to reproduce anddistribute reprints for Government purposes notwithstanding any copyright

ROBABILITY AGGREGATION IN TIME-SERIES annotation thereon. Disclaimer: The views and conclusions expressed hereinare those of the authors and should not be interpreted as necessarily repre-senting the oﬃcial policies or endorsements, either expressed or implied, ofIARPA, DoI/NBC or the U.S. Government.We deeply appreciate the project management skills and work of TerryMurray and David Wayrynen, which went far beyond the call of duty onthis project. SUPPLEMENTARY MATERIAL Sampling step (DOI: 10.1214/14-AOAS739SUPP; .pdf). This supplemen-tary material provides a technical description of the sampling step of theSAC-algorithm. REFERENCES

Allard, D. , Comunian, A. and

Renard, P. (2012). Probability aggregation methodsin geoscience.

Math. Geosci. Ariely, D. , Au, W. T. , Bender, R. H. , Budescu, D. V. , Dietz, C. B. , Gu, H. , Wallsten, T. S. and

Zauberman, G. (2000). The eﬀects of averaging subjectiveprobability estimates between and within judges.

Journal of Experimental Psychology:Applied Baars, J. A. and

Mass, C. F. (2005). Performance of national weather service forecastscompared to operational, consensus, and weighted model output statistics.

Weatherand Forecasting Baron, J. , Mellers, B. A. , Tetlock, P. E. , Stone, E. and

Ungar, L. H. (2014).Two reasons to make aggregated probability forecasts more extreme.

Decis. Anal. .DOI:10.1287/deca.2014.0293. Batchelder, W. H. , Strashny, A. and

Romney, A. K. (2010). Cultural consensustheory: Aggregating continuous responses in a ﬁnite interval. In

Advances in SocialComputing ( S.-K. Chaim , J. J. Salerno and

P. L. Mabry , eds.) 98–107. Springer,Berlin.

Bier, V. (2004). Implications of the research on expert overconﬁdence and dependence.

Reliability Engineering & System Safety Bonferroni, C. E. (1936). Teoria Statistica Delle Classi e Calcolo Delle Probabilit´a.

Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze Brier, G. W. (1950). Veriﬁcation of forecasts expressed in terms of probability.

MonthlyWeather Review Br¨ocker, J. and

Smith, L. A. (2007). Increasing the reliability of reliability diagrams.

Weather and Forecasting Buja, A. , Stuetzle, W. and

Shen, Y. (2005). Loss functions for binary class proba-bility estimation and classiﬁcation: Structure and applications. Statistics Department,Univ. Pennsylvania, Philadelphia, PA. Available at http://stat.wharton.upenn.edu/~buja/PAPERS/paper-proper-scoring.pdf . Carter, C. K. and

Kohn, R. (1994). On Gibbs sampling for state space models.

Biometrika Chen, Y. (2008). Learning classiﬁers from imbalanced, only positive and unlabeled datasets. Project Report for UC San Diego Data Mining Contest. Dept. Computer Science, V. A. SATOP ¨A ¨A ET AL.Iowa State Univ., Ames, IA. Available at . Clemen, R. T. and

Winkler, R. L. (2007). Aggregating probability distributions.In

Advances in Decision Analysis: From Foundations to Applications ( W. Edwards , R. F. Miles and

D. von Winterfeldt , eds.) 154–176. Cambridge Univ. Press, Cam-bridge.

Cooke, R. M. (1991).

Experts in Uncertainty: Opinion and Subjective Probability inScience . Clarendon Press, New York. MR1136548

Erev, I. , Wallsten, T. S. and

Budescu, D. V. (1994). Simultaneous over- and under-conﬁdence: The role of error in judgment processes.

Psychological Review Gelman, A. , Carlin, J. B. , Stern, H. S. and

Rubin, D. B. (2003).

Bayesian dataanalysis . CRC press, Boca Raton.

Gelman, A. , Jakulin, A. , Pittau, M. G. and

Su, Y.-S. (2008). A weakly informativedefault prior distribution for logistic and other regression models.

Ann. Appl. Stat. Geman, S. and

Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and theBayesian restoration of images.

Institute of Electrical and Electronics Engineer (IEEE)Transactions on Pattern Analysis and Machine Intelligence Genest, C. and

Zidek, J. V. (1986). Combining probability distributions: A critique andan annotated bibliography.

Statist. Sci. Gent, I. P. and

Walsh, T. (1996). Phase transitions and annealed theories: Numberpartitioning as a case study. In

Proceedings of European Conference on Artiﬁcial Intel-ligence (ECAI 1996)

Gneiting, T. and

Ranjan, R. (2013). Combining predictive distributions.

Electron. J.Stat. Gneiting, T. , Stanberry, L. I. , Grimit, E. P. , Held, L. and

Johnson, N. A. (2008).Rejoinder on: Assessing probabilistic forecasts of multivariate quantities, with an ap-plication to ensemble predictions of surface winds [MR2434318].

TEST Good, I. J. (1952). Rational decisions.

J. R. Stat. Soc. Ser. B Stat. Methodol. Hastings, C. Jr. , Mosteller, F. , Tukey, J. W. and

Winsor, C. P. (1947). Low mo-ments for small samples: A comparative study of order statistics.

Ann. Math. Statistics Hayes, B. (2002). The easiest hard problem.

American Scientist Karmarkar, N. and

Karp, R. M. (1982). The diﬀerencing method of set partition-ing. Technical Report UCB/CSD 82/113, Computer Science Division, Univ. California,Berkeley, CA.

Kellerer, H. , Pferschy, U. and

Pisinger, D. (2004).

Knapsack Problems . Springer,Dordrecht. MR2161720

Lichtenstein, S. , Fischhoff, B. and

Phillips, L. D. (1977). Calibration of Prob-abilities: The State of the Art. In

Decision Making and Change in Human Aﬀairs ( H. Jungermann and

G. De Zeeuw , eds.) 275–324. Springer, Berlin.

Lubinski, D. and

Humphreys, L. G. (1996). Seeing the forest from the trees: Whenpredicting the behavior or status of groups, correlate means.

Psychology, Public Policy,and Law Mellers, B. , Ungar, L. , Baron, J. , Ramos, J. , Gurcay, B. , Fincher, K. , Scott, S. E. , Moore, D. , Atanasov, P. and

Swift, S. A. et al. (2014). Psy-chological strategies for winning a geopolitical forecasting tournament.

PsychologicalScience . DOI:10.1177/0956797614524255.ROBABILITY AGGREGATION IN TIME-SERIES Metropolis, N. , Rosenbluth, A. W. , Rosenbluth, M. N. , Teller, A. H. and

Teller, E. (1953). Equation of state calculations by fast computing machines.

TheJournal of Chemical Physics Migon, H. S. , Gamerman, D. , Lopes, H. F. and

Ferreira, M. A. R. (2005). Dynamicmodels. In

Bayesian Thinking: Modeling and Computation . Handbook of Statist. Mills, T. C. (1991).

Time series techniques for economists . Cambridge Univ. Press,Cambridge.

Morgan, M. G. (1992).

Uncertainty: A Guide to Dealing with Uncertainty in QuantitativeRisk and Policy Analysis . Cambridge Univ. Press, Cambridge.

Murphy, A. H. and

Winkler, R. L. (1987). A general framework for forecast veriﬁcation.

Monthly Weather Review

Neal, R. M. (2003). Slice sampling.

Ann. Statist. Pepe, M. S. (2003).

The Statistical Evaluation of Medical Tests for Classiﬁcation and Pre-diction . Oxford Statistical Science Series . Oxford Univ. Press, Oxford. MR2260483 Primo, C. , Ferro, C. A. , Jolliffe, I. T. and

Stephenson, D. B. (2009). Calibrationof probabilistic forecasts of binary events.

Monthly Weather Review

Raftery, A. E. , Gneiting, T. , Balabdaoui, F. and

Polakowski, M. (2005). UsingBayesian model averaging to calibrate forecast ensembles.

Monthly Weather Review

Ranjan, R. and

Gneiting, T. (2010). Combining probability forecasts.

J. R. Stat. Soc.Ser. B Stat. Methodol. Sanders, F. (1963). On subjective probability forecasting.

Journal of Applied Meteorology Satop¨a¨a, V. A. , Baron, J. , Foster, D. P. , Mellers, B. A. , Tetlock, P. E. and

Ungar, L. H. (2014a). Combining multiple probability predictions using a simple logitmodel.

International Journal of Forecasting Satop¨a¨a, V. A. , Jensen, S. T. , Mellers, B. A. , Tetlock, P. E. and

Ungar, L. H. (2014b). Supplement to “Probability aggregation in time-series: Dynamic hierarchicalmodeling of sparse expert beliefs.” DOI:10.1214/14-AOAS739SUPP.

Shlyakhter, A. I. , Kammen, D. M. , Broido, C. L. and

Wilson, R. (1994). Quantifyingthe credibility of energy projections from trends in past data: The US energy sector.

Energy Policy Tanner, J. , Wilson, P. and

Swets, J. A. (1954). A decision-making theory of visualdetection.

Psychological Review Tetlock, P. E. (2005).

Expert Political Judgment: How Good Is It? How Can We Know?

Princeton Univ. Press, Princeton, NJ.

Ungar, L. , Mellers, B. , Satop¨a¨a, V. , Tetlock, P. and

Baron, J. (2012). The goodjudgment project: A large scale test of diﬀerent methods of combining expert pre-dictions. In

The Association for the Advancement of Artiﬁcial Intelligence 2012 FallSymposium Series , Univ. Pennsylvania, Philadelphia, PA.

Vislocky, R. L. and

Fritsch, J. M. (1995). Improved model output statistics forecaststhrough model consensus.

Bulletin of the American Meteorological Society Wallace, B. C. and

Dahabreh, I. J. (2012). Class probability estimates are unreliablefor imbalanced data (and how to ﬁx them). In

Institute of Electrical and ElectronicsEngineers (IEEE) 12th International Conference on Data Mining (International Con-ference on Data Mining)

Wallsten, T. S. , Budescu, D. V. and

Erev, I. (1997). Evaluating and combiningsubjective probability estimates.

Journal of Behavioral Decision Making V. A. SATOP ¨A ¨A ET AL.

Wilson, A. G. (1994). Cognitive factors aﬀecting subjective probability assessment. Dis-cussion Paper 94-02, Institute of Statistics and Decision Sciences, Duke Univ., ChapelHill, NC.

Wilson, P. W. , D’Agostino, R. B. , Levy, D. , Belanger, A. M. , Silbershatz, H. and

Kannel, W. B. (1998). Prediction of coronary heart disease using risk factorcategories.

Circulation Winkler, R. L. and

Jose, V. R. R. (2008). Comments on: Assessing probabilistic fore-casts of multivariate quantities, with an application to ensemble predictions of surfacewinds [MR2434318].

TEST Winkler, R. L. and

Murphy, A. H. (1968). “Good” probability assessors.

Journal ofApplied Meteorology Wright, G. , Rowe, G. , Bolger, F. and

Gammack, J. (1994). Coherence, calibration,and expertise in judgmental probability forecasting.

Organizational Behavior and Hu-man Decision Processes V. A. Satop¨a¨aS. T. JensenDepartment of StatisticsThe Wharton SchoolUniversity of PennsylvaniaPhiladelphia, Pennsylvania 19104-6340USAE-mail: [email protected]@wharton.upenn.edu

B. A. MellersP. E. TetlockDepartment of PsychologyUniversity of PennsylvaniaPhiladelphia, Pennsylvania 19104-6340USAE-mail: [email protected]@wharton.upenn.edu