# Models, Markets, and the Forecasting of Elections

Rajiv Sethi, Julie Seager, Emily Cai, Daniel M. Benjamin, Fred Morstatter

MModels, Markets, and the Forecasting of Elections * Rajiv Sethi † Julie Seager ‡ Emily Cai § Daniel M. Benjamin ¶ Fred Morstatter || February 12, 2021

Abstract

We examine probabilistic forecasts for battleground states in the 2020 US presidentialelection, using daily data from two sources over seven months: a model published by

The Economist , and prices from the

PredictIt exchange. We ﬁnd systematic differencesin accuracy over time, with markets performing better several months before the elec-tion, and the model performing better as the election approached. A simple averageof the two forecasts performs better than either one of them overall, even thoughno average can outperform both component forecasts for any given state-date pair.This effect arises because the model and the market make different kinds of errors indifferent states: the model was conﬁdently wrong in some cases, while the marketwas excessively uncertain in others. We conclude that there is value in using hybridforecasting methods, and propose a market design that incorporates model forecastsvia a trading bot to generate synthetic predictions. We also propose and conduct aproﬁtability test that can be used as a novel criterion for the evaluation of forecastingperformance. * We thank G. Elliott Morris for providing us with daily forecast data from the

Economist model, ParkerHowell at

PredictIt for access to daily price data, and Pavel Atanasov, David Budescu, Jason Pipkin, andDavid Rothschild for comments on an earlier version. Rajiv Sethi thanks the Radcliffe Institute for Ad-vanced Study at Harvard University for fellowship support. † Department of Economics, Barnard College, Columbia University, and the Santa Fe Institute ‡ Department of Economics, Barnard College, Columbia University § Fu Foundation School of Engineering and Applied Science, Columbia University ¶ Information Sciences Institute, University of Southern California || Information Sciences Institute, University of Southern California a r X i v : . [ ec on . GN ] F e b Introduction

The forecasting of elections is of broad and signiﬁcant interest. Predictions inﬂuencecontributions to campaigns, voter turnout, candidate strategies, investment decisions byﬁrms, and a broad range of other activities. But accurate prediction is notoriously chal-lenging, and each approach to forecasting has signiﬁcant limitations.In this paper we compare two very different approaches to the forecasting of elec-tions, demonstrate the value of integrating them systematically, and propose a methodfor doing so. One is a model-based approach that uses a clearly identiﬁed set of inputsincluding polls and fundamentals to generate a probability distribution over outcomes(Heidemanns et al., 2020). Daily state-level forecasts based on such a model were pub-lished by

The Economist for several months leading up to the 2020 presidential election inthe United States. Over the same period, a market-based approach to prediction was im-plemented via decentralized trading on a peer-to-peer exchange,

PredictIt , which operateslegally in the United States under a no action letter from the Commodity Futures TradingCommission. Daily closing prices on this exchange, suitably adjusted, can be interpretedas probabilistic forecasts.Models and markets both respond to emerging information, but do so in very differ-ent ways. Most models are backward-looking by construction; they are calibrated andback-tested based on earlier election cycles, and built using a set of variables that havebeen selected prior to the prediction exercise. New information is absorbed only when itstarts to affect the input variables, for instance by making its presence felt in polling dataor economic indicators. By contrast, markets are fundamentally forward-looking, andcan rapidly incorporate information from novel and essentially arbitrary sources, as longas any trader considers this to be relevant. Information that affects the prices at whichtraders are willing to buy and sell contracts changes the market equilibrium, and hencethe implied probabilistic forecasts derived from prices.Each approach has strengths and weaknesses. New information that experienced ob-servers immediately know to be relevant—such as debate performances, military strikes,or sharp changes in unemployment claims—can be reﬂected in market prices almostinstantaneously, while it leads to much more sluggish adjustments in model forecasts.However, markets can be prone to herding, cognitive biases, excess volatility, and evenactive manipulation. Models are generally not vulnerable to such social and psychologi-cal factors, but to avoid overﬁtting, they need to identify a relatively small set of variables2n advance (possibly extracted from a much larger set of potentially relevant variables),and this can lead to the omission of information that turns out to be highly relevant inparticular instances. Market forecasts face no such constraint, and traders are free to usemodels in forming their assessments as they see ﬁt.We show that data from the

Economist model and the

PredictIt market for the sevenmonths leading up to the 2020 election exhibits the following patterns. The market per-forms signiﬁcantly better than the model during the early part of this period, based onaverage Brier scores as a measure of accuracy, especially during the month of April. Thetwo mechanisms exhibit comparable performance from May to August, after which themodel starts to predict better than the market. Over the entire period the average dif-ference in performance is negligible. However, the two approaches generate very differ-ent forecasts for individual states, and make errors of different kinds. For example, themodel was conﬁdent but wrong in Florida and North Carolina, while the market was ex-cessively uncertain in Minnesota and New Hampshire. This raises the possibility that asimple average of the two predictions could outperform both component forecasts in theaggregate, by avoiding the most egregious errors. We verify that this is indeed the case.Even though no average can ever outperform both components for any given state-datepair, the simple average does so in the aggregate, across all states and periods.A simple average of model and market forecasts is a very crude approach to gen-erating a hybrid prediction. The fact that it nevertheless exhibits superior performanceto either component suggests that more sophisticated synthetic approaches that combinemodels and markets could be very promising. Such hybrid forecasting can be approachedfrom two different directions. Models could directly incorporate prediction market pricesas just another input variable, in combination with polls and fundamentals. Markets canbe extended to allow for automated traders endowed with budgets and risk preferencesthat act as if they believe the predictions of the model. Following the latter approach, weshow below how a hybrid prediction market can be designed and explored experimen-tally.We also conduct a performance test, by examining the proﬁtability of a bot that heldand updated beliefs based on the model forecasts throughout the period. We show thata bot of this kind would have made very different portfolio choices in different states,building a large long position in the Democratic candidate in some states, while switch-ing frequently between long and short positions and trading much more cautiously inothers. It would have made signiﬁcant proﬁts in some states and lost large sums in oth-3rs, but would have made money on the whole, for a double digit return. As a checkfor robustness, we consider how performance would have been affected had one or moreclosely decided states had a different outcome. This proﬁtability test can be used as anovel and dynamic criterion for a comparative evaluation of model performance moregenerally.

Forecasting elections based on fundamental factors such as economic conditions, presi-dential approval, and incumbency status has a long history (Fair, 1978; Lewis-Beck andRice, 1984; Abramowitz, 1988; Campbell and Wink, 1990; Norpoth, 1996; Lewis-Beck andTien, 1996). In 2020 such models faced the difﬁculty of an unprecedented collapse in eco-nomic activity in response to a global pandemic, and were forced to make a number of adhoc adjustments (Abramowitz, 2020; J´er ˆome et al., 2020; DeSart, 2020; Enns and Lagodny,2020; Lewis-Beck and Tien, 2020; Murr and Lewis-Beck, 2020; Norpoth, 2020). While theyvaried widely in their predictions for the election, the average across a range of modelswas reasonably accurate (Dassonneville and Tien, 2020).Traditional models such as these typically generate forecasts with low frequency, of-ten just once in an election cycle. More recently, with advances in computational capacity,data processing, and machine learning, models have been developed that rapidly incor-porate information from trial heat polls and other sources to generate forecasts at dailyfrequencies. In the 2020 cycle there were two that received signiﬁcant media attention,published online at

FiveThirtyEight and

The Economist respectively; see Silver (2020) andHeidemanns et al. (2020) for methodologies. We focus on the latter because of its greaterclarity and transparency, though our analysis could easily be replicated for any modelthat generates forecasts for the same set of events.Prediction markets have been a ﬁxture in the forecasting ecosystem for decades, dat-ing back to the launch of the pioneering

Iowa Electronic Markets in 1988 (Forsythe et al.,1992). This market, like others that followed in its wake, is a peer-to-peer exchange fortrading contracts with state-contingent payoffs. Buyers pay a contract price and receivea ﬁxed payment if the referenced event occurs by an expiration date, or else they get nopayout. The contract price is commonly interpreted as the probability of event occur-rence. The forecasting accuracy of such markets has been shown to be competitive withthose of opinion polls and structural models (Leigh and Wolfers, 2006; Berg et al., 2008a,b;4othschild, 2009).A prediction market is a type of crowdsourced forecasting pool that leverages a mar-ket structure, similar to a stock market, to efﬁciently combine judgments, provide timelyresponses, and incentivize truthful reporting of beliefs (Wolfers and Zitzewitz, 2004).Forecasting crowds are not constrained by a ﬁxed methodology or a preselected set ofvariables, and if suitably diverse, can include a range of ideas and approaches. Crowd-sourcing forecasts has been demonstrated to be more accurate than experts’ predictions(Satop¨a¨a et al., 2014), which can be attributed to pooling and balancing diverse opin-ions (Budescu and Yu, 2007). A market can predict well even if individual participantspredict poorly, through endogenous changes in the distribution of portfolios over time(Kets et al., 2014).The value of combining or aggregating forecasts derived from different sources hasbeen discussed by many authors; see Clemen (1989) for an early review and Rothschild(2015) and Graefe et al. (2015) for more recent discussion. We propose a novel method fordoing so, through the actions of a trading bot that internalizes the model and is endowedwith beliefs, a budget, and risk preferences. In some respects the activities of this traderresemble those of a traditional specialist or market maker, as modeled by Glosten andMilgrom (1985) and Kyle (1985), as well as algorithmic traders of more recent vintage, asin Budish et al. (2015) and Baron et al. (2019). But there are important differences, as wediscuss in some detail below.

Heidemanns et al. (2020) constructed a dynamic Bayesian forecasting model that inte-grates information from two sources—trial heat polls at the state and national level, andfundamental factors such as economic conditions, presidential approval, and the advan-tage of incumbency—while making allowances for differential non-response by partyidentiﬁcation and the degree of political polarization. The model builds on Linzer (2013),and uses a variant of the Abramowitz (2008) fundamentals model as the basis for a priorbelief. State-level forecasts based on this model were updated and published daily, andwe use data for the period April 1 to November 2, 2020, stopping the day before theelection.For the same forecasting window, we obtained daily closing price data for state mar-5ets from the

PredictIt exchange. These prices are for contracts that reference events, andeach contract pays a dollar if the event occurs and nothing otherwise. For example, amarket based on the event ”Which party will win Wisconsin in the 2020 presidential elec-tion?” contained separate contracts for the “Democratic” and “Republican” nominees. Atclose of trading on November 2, the eve of the election, the prices of these contracts were0.70 and 0.33 respectively. These cannot immediately be interpreted as probabilities sincethey sum to something other one. To obtain a probabilistic market forecast, we simplyscale the prices uniformly by dividing by their sum. In this instance the market forecastfor the likelihood of a Democratic victory is taken to be about 0.68.Traders on

PredictIt were restricted to a maximum position size of $850 per contract.The effects on predictive accuracy of such restrictions is open to debate, but the lim-its do preclude the possibility that a single large trader could dominate or manipulatethe market and distort prices, as was done on the

Intrade exchange during the 2012 cy-cle (Rothschild and Sethi, 2016). Furthermore, despite the restrictions on any individualtrader’s position sizes, total trading volume on this exchange was substantial. Over theseven-month period under consideration, for markets referencing outcomes in thirteenbattleground states, more than 58 million contracts were traded.Our focus is on 13 swing states that were all considered competitive to some degreeat various points in the election cycle: Arizona, Florida, Georgia, Iowa, Michigan, Min-nesota, Nevada, New Hampshire, North Carolina, Ohio, Pennsylvania, Texas, and Wis-consin. We exclude states that were considered non-competitive both because there islittle disagreement between the model and the market, and also because much of the dis-agreement that exists arises because the incentive structure created by markets distortsprices as they approach extreme levels. Figure 1 shows the probability of a Democratic victory in eleven of the thirteen statesbased on the model and the market. As can be seen, both the levels and trends divergequite substantially. The model began to place extremely high likelihood (close to 100 This is a consequence of the fee structure on

Predictit , which took ten percent of all trader proﬁts. In theabsence of fees, a trader could sell both contracts and effectively purchase $1.03 for $1.00, making the pricesunsustainable. The

Iowa Electronics Market does not charge fees and the prices of complementary contractson that exchange preclude such arbitrage opportunities. Speciﬁcally, as the price of a contract approaches 1, it becomes extremely costly to bet on the eventrelative to the potential reward, while it becomes extremely cheap to bet against it. This creates a potentialimbalance between demand and supply that keeps prices away from extremes. Two of the states—New Hampshire and Minnesota—have been dropped from the ﬁgure for visualclarity; these exhibit roughly similar patterns to the state of Nevada.

Figure 1:

Probabilities of a Democratic Victory in Selected States, based on Model and Market

As Figure 1 reveals, the model predictions span a larger subset of the probability spacethan the market predictions do, especially over the latter part of the period. This can beseen more clearly clearly in Figure 2, which shows the frequencies with which speciﬁcprobabilities appear in the data, aggregating across all thirteen states. The range of themarket data is considerably more compressed.Given this, one might expect that at any given probability level, more states wouldappear at least once over the period of observation. However, this is not the case. Atmost seven states appear at any given prediction point in the market, and this happensjust once at probability point 0.63. The model has more than 25 prediction points at whichseven or more states appear, and the peak is 10 states at probability point 0.54. That is,not only does the model span a larger range of probabilities, it also has greater movementacross this range for individual states. 7 igure 2:

Frequency Distributions for Model and Market Predictions

A standard tool for assessing the accuracy of probabilistic forecasts is the Brier score(Brier, 1950). In the case of binary outcomes this is just the mean squared error. Let-ting p it denote the probability assigned in period t to a Democratic victory in state i , thescore is given by s it = ( p it − r i ) ,where r i = i is resolved in favor of the Democratic nominee and r i = n =

13 events (corresponding to outcomes in the battleground states), foreach of which there are 216 forecasts on consecutive days. Aggregating across all eventswe obtain a time series of the average Brier score: s t = n n ∑ i = ( p it − r i ) .8his can be computed for models and markets separately, resulting in the plots in Figure3. As we have already seen, the market forecasts were much less variable over time,both for individual states and in the aggregate. Looking at relative performance, we seethat markets were signiﬁcantly more accurate on average during the early part of theforecasting window. This was followed by a period of roughly comparable performance,with the model pulling ahead as the election approached. Figure 3:

Mean Brier Scores for the Model and Market over Time

Averaging across time as well as states, we can obtain a scalar measure of overallperformance for each method as follows: s = nT T ∑ t = n ∑ i = ( p it − r i ) ,where T =

216 is the number of periods in the forecasting window. On this measure weﬁnd virtually identical average forecasting performance across methods: s model = s market = Figure 4:

Calibration Curves for the Model and Market

One factor that may have distorted prices as the election approached was a staggeringincrease in volume in the wake of a sustained inﬂow of funds. This can be seen in Figure 5,which shows the total number of contracts traded daily in the thirteen battleground statesover the seven month period, using a log scale. Daily volume by the end of the period wasa thousand times as great as at the start. In fact, volume continued to remain extremely10igh even after the election was called on November 7, driven in part by traders who wereconvinced that the results would somehow be overturned. On November 8, for instance,there were more than 2.7 million contracts traded for six key contested states (Arizona,Georgia, Michigan, Nevada, Pennsylvania, and Wisconsin). All of these had been calledfor Biden but the implied market forecast for the Democrat ranged from 0.86 in Arizonato 0.93 in Nevada. This suggests that even leading up to the election, a group of tradersconvinced that a Trump victory was inevitable and willing to bet large sums on it werehaving signiﬁcant price effects (Strauss, 2021).

Figure 5:

Trading Volume for Major Party Contracts in Thirteen Battleground States

Even setting aside this consideration, it is important to note that our comparison be-tween model and market performance ought not to be seen as deﬁnitive. Although we ex-amine more than 2,800 predictions for each mechanism over the seven month period, theyreference just thirteen events, and the scores are accordingly very sensitive to changes inoutcome. For example, the model assigned a 97 percent likelihood of a Biden victoryin Wisconsin on the eve of the election, and this state was decided by less than 21,000votes, about 0.6 percent. Had there been a different outcome in Wisconsin (or indeed in11ichigan, Pennsylvania, or Nevada, all of which were close, and all of which the modelpredicted Biden would win with probability 93 percent or higher) the model’s measuredperformance would have been much worse. The value of examining relative performancelies in identifying systematic variation across time, the nature of errors, and the value ofhybridization. We consider this next.

As a ﬁrst step to demonstrating the value of combining model and market forecasts,we compute for each state-date pair the simple average estimate for the probability ofa Democratic victory. Brier scores for this synthetic forecast are shown in Figure 6, alongwith the separate model and market scores shown earlier in Figure 3.By construction, for any given state-date pair, the score for the synthetic forecast mustlie between model and market scores; it cannot be lower than both. But this is no longertrue when we average across states at any given date. In fact, as Figure 6 clearly shows,there are many dates on which the synthetic forecast performs better than both modeland market. Of the 216 days in the series, there are 87 on which the synthetic forecastbeats both model and market, including each of the 26 days leading up to the election.Furthermore, across the entire time period and all states, the hybrid forecast received aBrier score of s hybrid = nd forecasts, which are shown in Figure 7 for each ofthe states. For any given state the hybrid forecast Brier score necessarily lies between thetwo other scores. But averaging across states for this date, we get scores of 0.1414 for themarket, 0.1339 for the model, and 0.1228 for the hybrid forecast. The market fared poorlyin states such as New Hampshire and Minnesota, for which it did not make conﬁdentforecasts even though they ended up not being especially close. The model fared poorlyin Florida and North Carolina, where it predicted Democratic wins with probabilities 0.7812nd 0.67 respectively, neither of which materialized. Figure 6:

Mean Brier Scores for the Model, Market, and a Simple Average over Time

To summarize, the model fared poorly in states that it conﬁdently expected to go toDemocratic nominee, and which he ended up losing in the end. The market fared poorlyin a number of states where it correctly predicted that the Democrat would win, but withmuch lower conﬁdence than the model. The hybrid forecast could not beat both modeland market for any given state, but was able to beat both when averaging across states,by avoiding the most egregious errors.The simple unweighted average, taken uniformly across time, is the crudest possibleway of generating a hybrid forecast. One alternative is to use prediction market prices asdirect inputs in the model, along with information from trial-heat polls and fundamentals.A different approach, which we describe next, consists of constructing markets in whicha model is represented by a virtual trader. See, however, Dawes (1979) on the ”robust beauty” of linear models and how even unsophisticatedmodels succeed at rule-based prediction and minimizing error. igure 7: Brier Scores for Market, Model, and Hybrid Forecasts on November 2

Algorithmic trading is a common feature of modern ﬁnancial markets, where low latencycan be highly rewarding (Budish et al., 2015; Baron et al., 2019). In this sense marketsfor stocks and derivatives are already hybrid markets. But algorithms in these settingsoperate as part of an organic trading ecosystem, competing with each other and moretraditional strategies implemented by fund managers and retail investors. In order toconstruct a hybrid prediction market that can be tuned and tested experimentally, wepropose a somewhat different approach.Given a model that generates forecasts updated with relatively high frequency, onecan insert into the market a trading bot that acts as if it believes the model forecast. Inorder to do so, the bot has to be endowed with a budget and preferences that exhibitsome degree of risk aversion. These parameters can be tuned in experimental settingsto examine their effects on forecasting accuracy. The bot posts orders to buy and sellsecurities based on its beliefs (derived from the model), its preferences, and its existing Algorithmic trading was also extremely common on the

Intrade prediction market (Rothschild andSethi, 2016), even though it is disallowed on

PredictIt . m jurisdictions, where each jurisdiction has n candidates on the ballot. Let S denote an n × m matrix that represents an outcome realization, with typical element s ij ∈{

0, 1 } . Columns s j of S correspond to jurisdictions, and s ij = i is the (unique) winner in jurisdiction j . Let Ω denote the set of all possible electoraloutcomes, and let p : Ω → [

0, 1 ] denote the probability distribution over these outcomesgenerated by the model.Now suppose that for each jurisdiction there exists a prediction market listing n con-tracts, one for each candidate. In practice, there will be distinct prices for the sale andpurchase of a contract, the best bid and best ask prices. We return to this issue in Section 8below, but assume for the moment that there is no such difference, and that each contractis associated with a unique price at which it can be bought or sold. Let Q denote a set ofprices in these markets, with typical element q ij ∈ [

0, 1 ] . Here q ij is the price at which onecan purchase or sell a contract that pays a dollar if candidate i wins in jurisdiction j , andpays nothing otherwise. Columns q j of Q contain the prices for each of the i contracts injurisdiction j .Next consider a virtual trader or bot that represents the model in the market, in thesense that it inherits and updates beliefs over the set of outcomes based on the output ofthe model. At any given point in time, this trader in this market will be holding a portfolio ( y , Z ) , where y is cash and Z is a n × m matrix of contract holdings. Element z ij ∈ R isthe number of contracts held by the trader that reference candidate i in jurisdiction j . It ispossible for z ij to be negative, which corresponds to a short position: if candidate i wins injurisdiction j then a trader with z ij < i in jurisdiction j lowers cash by q ij and raises z ij by one unit, while selling such a contract raises cash by q ij and lowers z ij by one unit. Let z j denote column j of Z , the contract holdings in the market corresponding to jurisdiction There is no loss of generality in assuming that all jurisdictions have the same number of candidates,since any with fewer than n candidates can be assigned notional candidates who win with zero probability. For battleground state markets on

PredictIt the bid-ask spread was typically just a penny, which largenumbers of contracts available for purchase and sale at most times. . If the outcome is s ∈ Ω when all contracts are resolved, the terminal cash value orwealth resulting from portfolio ( y , Z ) is w = y + ∑ j ∈ M s (cid:48) j z j ,where M = {

1, ..., m } is the set of jurisdictions. A risk-neutral trader would choose aportfolio that maximizes the expected value E ( w ) of terminal wealth, given her beliefs p and the market prices Q , subject to solvency constraints that are discussed further below.A risk-averse trader, instead, will maximize expected utility, given by E ( u ) = ∑ S ∈ Ω p ( S ) u (cid:32) y + ∑ j ∈ M s (cid:48) j z j (cid:33) ,where u : R + → R is strictly increasing and concave.Given a starting portfolio ( y , Z ) , beliefs p , preferences u , and prices Q , let X denotean n × m matrix of trades, where x ij ∈ R is the (possibly negative) number of contractspurchased that reference candidate i in jurisdiction j . Column j of X , which registerstrades involving jurisdiction j , is denoted x j . The set of trades will be chosen to maximize E ( U ) = ∑ S ∈ Ω p ( S ) u (cid:32) ∑ j ∈ M (cid:16) s (cid:48) j ( z j + x j ) + y − q (cid:48) j x j (cid:17)(cid:33) . (4)If x ij > z ij rises by this amount, and y falls by the cost ofthe contracts. If x ij < z ij falls by this amount, and y rises by thecost of the contracts.A trading bot programmed to execute transactions in accordance with the maximiza-tion of (4) will trade whenever there is a change in model output p or in market prices Q .Any such trades must be consistent with solvency even in the worst case scenario. Thatis, the trading bot chooses X to maximize (4) subject to the constraint:min S ∈ Ω ∑ j ∈ M (cid:16) s (cid:48) j ( z j + x j ) + y − q (cid:48) j x j (cid:17) ≥ u , facing actual prices on PredictIt , would havemade money. Most prediction markets, including PredictIt, operate under a 100 percent margin requirement to ensurecontract fulﬁllment with certainty, so that there is no counter-party risk or risk borne by the exchange itself. A Proﬁtability Test

The analysis in the previous section can be put to use to develop an alternative measureof model performance, both relative to the market and to other models. Speciﬁcally, onecan ask the hypothetical question of whether a bot endowed with model beliefs wouldhave been proﬁtable over the course of the forecasting period, when matched againstthe actual market participants. And one can use the level of proﬁts or losses to comparedifferent models, and obtain a performance criterion in which early and late forecasts aretreated differently in a systematic and dynamic manner. Proﬁtability of a model is alsoa necessary condition for survival, without which the model could not have a signiﬁcantand persistent effect on market prices. Only a proﬁtable model can yield hybrid forecaststhat differ meaningfully from market forecasts.In order to implement this idea, we need to specify preferences for the bot. A usefulclass of functions for which the degree of risk aversion can be tuned for experimentalpurposes is that exhibiting constant relative risk aversion (CRRA): u ( w ) = − ρ w − ρ , if ρ ≥ ρ (cid:54) = ( w ) , if ρ = ρ is the degree of relative risk-aversion. We report results below for the case of logutility.One piece of information that is unavailable to us is the complete daily probabilitydistribution p ( S ) over the set of possible outcomes, where | S | = = y = $1000and a contract position z =

0. Negative values of z correspond to bets against the Demo-cratic nominee. The evolution of z over time for Wisconsin is shown in Figure 8. Initiallythe market considered a Biden victory to be more likely than the model did, so the botentered a short position (negative z ) in this contract. This started to change towards theend of April, and the bot started to cover this short position by buying contracts, eventu-ally ending up with a long position (positive z ) that grew unevenly but substantially overtime until the election. 17 igure 8: Evolution of Contract Holdings for Wisconsin

There are several ﬂuctuations and interruptions in this growth, reﬂecting changes inbeliefs and prices that warranted a less aggressive posture. But by the eve of the electionthe bot held less than $90 in cash, along with almost 1,500 contracts. This portfolio had amarket value of almost 1,100 on November 2. That is, the bot could have sold all contractsat the then-prevailing market price and ended up with a ten percent return without takingany risk into election day, insulated completely from the eventual outcome. These proﬁtswere generated by buying and holding contracts that had subsequently risen in price, orby buying and selling along the way. But the bot did not liquidate its position, and oncethe contracts were resolved, ended up with a cash position of $1,574 for a 57% return.Had Biden lost Wisconsin this would have resulted in a substantial loss, exceeding 90%.Repeating this analysis for the other twelve states, we get the paths shown in Figure9. In all cases the bot began by betting against a Democratic victory but quickly changedcourse in many states. Several states follow the same pattern as Wisconsin, in that alarge positive position was built up over time. These states resulted in signiﬁcant prof-its (Michigan, Minnesota, New Hampshire, Pennsylvania) or signiﬁcant losses (Florida).There are also some states in which trading was much more cautious, because model18orecasts and market prices were not far apart. And in many states contract holdingsswitched often from short to long and back again.

Figure 9:

Bot Contract Holdings for 12 battleground States.

Additional details for all states, including ﬁnal portfolios, terminal payoffs, proﬁts,and rates of return, may be seen in Table 1. Only in Texas did the bot hold a short positionin a Democratic victory on the even of the election; elsewhere the model favored theDemocrat to a greater degree than the market. The market value of all portfolios on theeve of the election was just slightly greater than the starting value of $1,300. That is, if thebot had liquidated all positions immediately prior to resolution it would have basicallybroken even. But the eventual payoff was much larger, with a proﬁt exceeding $2,000and a 16% return, as proﬁts in some states outweighed losses in others. In general the botmade money where the Democrat won and lost where the Republican prevailed, with thesingle exception being Texas. In Texas the cash position exceeded the value of the portfolio on election day, because the contractposition was negative; had Biden won Texas the bot would have had to pay out about $37 and would haveended up with a loss in this state. tate Cash Contracts Value Payoff Proﬁt ReturnArizona $566.45 748.55 $944.35 $1,314.99 $314.99 31%Florida $339.96 1351.73 $904.28 $339.96 –$660.04 –66%Georgia $743.54 735.23 $1,034.72 $1,478.76 $478.76 48%Iowa $855.20 722.99 $1,039.50 $855.20 –$144.80 –14%Michigan $68.29 1383.63 $1,004.27 $1,451.92 $451.92 45%Minnesota $54.98 1305.55 $1,021.09 $1,360.53 $360.53 36%Nevada $190.84 1094.15 $976.93 $1,284.99 $284.99 28%New Hampshire $72.74 1274.46 $984.85 $1,347.20 $347.20 35%North Carolina $607.15 880.03 $1,004.03 $607.15 –$392.85 –39%Ohio $904.82 669.22 $1,096.97 $904.82 –$95.18 –10%Pennsylvania $138.91 1359.79 $921.43 $1,498.70 $498.70 50%Texas $1,027.13 –36.95 $1,016.47 $1,027.13 $27.13 3%Wisconsin $89.92 1484.09 $1,088.83 $1,574.01 $574.01 57%Total $13,037.72 $15,045.36 $2,045.36 16% Table 1:

Terminal portfolios, payoffs, and proﬁts in battleground states.

The payoffs and proﬁts shown in Table 1 are based on the actual election outcomes.As a check on robustness, one might ask how the model would have performed if one ormore close states had been decided differently. There were three states that were espe-cially close, decided by less than one percentage point: Georgia (0.24 percent), Arizona(0.31 percent) and Wisconsin (0.62 percent). All three were won by the Democrat. Table 2shows the payoffs and proﬁts that would have accrued to the bot had one or more of thesebeen won by the Republican instead. In each case proﬁts would have been lower, since thebot was betting on a Democratic victory in all three states. If any one of these states hadbeen decided differently, proﬁts would still have been positive. However, if the Democrathad lost Wisconsin along with one or both of Georgia and Arizona, it would have madea loss overall, largely because the model’s very high conﬁdence that the Democrat wouldwin Wisconsin led the bot to accumulate a substantial number of contracts.The hypothetical proﬁts in Table 2 provide a measure of robustness by identifying thenumber of close states (based on some speciﬁed threshold) that would have to ﬂip in orderfor the model to incur a loss. In this case the critical number or ”robustness diameter” istwo. In comparing the performance of two or more models against the market, one canconsider both proﬁtability (which depends in part on pure chance) and the robustness20iameter in this way. Flipped State(s) Payoff Proﬁt RateGeorgia $14,310.14 $1,310.14 10.08%Arizona $14,296.82 $1,296.82 9.98%Wisconsin $13,561.27 $561.27 4.32%Georgia, Arizona $13,561.60 $561.60 4.32%Georgia, Wisconsin $12,826.05 -$173.95 –1.34%Arizona, Wisconsin $12,812.73 -$187.27 –1.44%Georgia, Arizona, Wisconsin $12,077.51 -$922.49 –7.10%

Table 2:

Hypothetical payoffs and proﬁts if the closest states had been decided differently.

The proﬁtability test explored here may be improved by using the complete model-derived probability distribution over outcomes rather than just the marginal distributionsfor each state, and allowing cash to ﬂow freely across markets. But even in its currentform, it provides a new and valuable evaluation criterion that can be used to comparemodels with each other, and to assess the value of hybrid forecasting.

To this point we have not distinguished between bid and ask prices, which will typicallydiffer and cause the purchase price to be slightly higher that the sale price. We have alsoassumed that any demands or supplies can be met at prevailing market prices. But in theabsence of any prices at which trading can occur, the bot representing the model can postprices at which it would be willing to trade, thus providing liquidity to the market. This isimportant for the development of experimental markets which are not already populatedwith a large number of active traders.As an example consider a model that forecasts a single event with n ≥ i thenthat contract pays out while the others expire worthless. Suppose that a model generatesa probability distribution p ∗ = ( p ∗ , ..., p ∗ n ) over the possible outcomes. At any point intime, the trading bot will have a cash position y ≥ z = ( z , ..., z n ) , Note, however, that the risk preferences assigned to the bot will affect the aggressiveness of trading,position sizes, proﬁtability, and the robustness diameter. z i can be positive or negative (representing a long or short position). Thebot is endowed with a utility function u ( w ) where w is the cash holding after resolution.This is uncertain at the time of trading, and the bot places orders in order to maximizeexpected utility as before.Now consider any particular bin or outcome interval i . We need to ﬁnd the highestprice at which the bot will be willing to buy shares, and the lowest prices at which it willsell. We also need to compute the quantities posted at those prices. Suppose that at anyhypothetical price p i in market i the bot places an order to buy x i shares in this market.Then, given the beliefs p ∗ , its expected utility is E ( u ) = p ∗ i u ( z i + x i + y − p i x ) + ∑ j (cid:54) = i p ∗ j u ( z j + y − p i x i ) . (5)Maximizing this yields a demand x ( p i ) for any market i and price p i . Let p bi and p si denotethe highest price at which x ( p i ) > x ( p i ) < x i ( p bi ) shares at price p bi orders to sell x i ( p si ) shares at price p si in each market i . Note that all buy and sell prices are functions of ( y , z ) and will changeeach time the bot is able to buy or sell any contracts.The bot is effectively acting as a market marker, as in Glosten and Milgrom (1985)and Kyle (1985), but instead of making inferences from the trading activity of counterpar-ties, it derives and updates its beliefs based on the model. It acts as an informed traderrather than an uninformed market maker, and is willing to take and hold large directionalpositions.To illustrate, consider forecasting problem with three options or bins, and a modelthat assigns probability p ∗ = ( ) . The problem could involve discrete outcomesthat can take one of three values, or continuous outcomes that can fall into three speciﬁedranges. The model forecast is represented in the market by a bot having initial cash en-dowment y = $1000 and preferences given by u ( w ) = log ( w ) . The bot initially holds nocontracts, so z = (

0, 0, 0 ) .To begin with, the bot places six orders, one to buy and one to sell in each of the threebins, resulting in the following order book:This is what human traders see when they view the market. For each bin the pricesposted by the bot straddle the beliefs generated by the model, though this can changeas cash and asset positions are altered through the process of trading, as we shall seemomentarily. The bid-ask spreads are tight, and there is plenty of liquidity available for22in Bid Price Bid Quantity Ask Price Ask Quantity1 0.29 48 0.31 462 0.49 40 0.51 403 0.19 64 0.21 60humans who wish to trade.Now suppose that the offer to buy 48 shares at 0.29 is met (a human trader choosesto bet against the event corresponding to Bin 1, paying 0.71 per contract for 48 contracts,each of which will pay out a dollar if this event does not occur). Then the asset positionof the bot goes to z = (

48, 0, 0 ) and its cash position goes to 1000 − ( ) = all markets, not just in the market in which the transaction occurred,resulting in the following order book:Bin Bid Price Bid Quantity Ask Price Ask Quantity1 0.28 51 0.30 472 0.50 28 0.51 113 0.20 17 0.21 42Note that the bot is now willing to sell shares in the ﬁrst bin at price 0.30, even thoughthis exactly matches its belief. This is because it can secure a reduction in the risk of itsportfolio, without lowering expected value. In fact, if it continues to accumulate sharesin the ﬁrst bin, it will offer to sell at prices below 0.30, lowering expected return but alsolowering risk.A naive observer of this market will see a bot acting as a market maker, placing orderson both sides of the market, and canceling and replacing these whenever any order ismet. But the strategy implemented by the bot is not that of a traditional market maker;it is maximizing expected utility subject to model-based beliefs about the various out-comes. In fact, when the model is updated the bot’s imputed beliefs p ∗ change, and itwill respond by canceling and replacing all orders, possibly resulting in the immediateexecution of trades against humans. And human traders can trade against each other, orpost limit orders that compete with those posted by the bot. The hybrid prediction market described here is accordingly quite different from one in which an auto-

Iowa Electronic Markets , there will besubstantial liquidity available to human traders at all times. No matter what the modelforecast or trading history, at least one side of the order book will always be populated.As a result the spreads between bid and ask prices will be narrower, providing greaterincentives for market participation.

The analysis in this paper ought not to be seen as a meaningful horse race between mod-els and markets, since all the forecasts reference just thirteen closely connected and cor-related events, and a different outcome in a couple of states could have resulted in a verydifferent evaluation of performance. Instead, our primary contribution should be seen asthe development of a particular approach to hybrid forecasting that can be implemented,calibrated and tested empirically. It is important to be attentive to concerns about exter-nal validity, however, since real-world markets have recently been prone to high volatil-ity arising from the spread of misinformation or coordinated attempts to inﬂate prices(Strauss, 2021; Phillips and Lorenz, 2021).Models extrapolate from available data in a manner that is potentially objective, trans-parent, and consistent. They are designed to produce good average calibration across theprobability space. Markets aggregate real-time human judgement and provide agility andversatility. They balance a pool of knowledge, and are able to extrapolate from sparse datafurther from resolution. We have shown that even a crude simple average of these meth-ods can outperform each one of them when examining data from the 2020 US presidential mated market maker is a counterparty to all trades, and sets prices based on a potential function (Hanson,2003; Chen and Pennock, 2007; Sethi and Vaughan, 2016).