[PDF] Measuring the Completeness of Theories

Abstract

We use machine learning to provide a tractable measure of the amount of predictable variation in the data that a theory captures, which we call its "completeness." We apply this measure to three problems: assigning certain equivalents to lotteries, initial play in games, and human generation of random sequences. We discover considerable variation in the completeness of existing models, which sheds light on whether to focus on developing better models with the same features or instead to look for new features that will improve predictions. We also illustrate how and why completeness varies with the experiments considered, which highlights the role played in choosing which experiments to run.

Full PDF

MMeasuring the Completeness of Theories ∗ Drew Fudenberg † Jon Kleinberg ‡ Annie Liang § Sendhil Mullainathan ¶ October 17, 2019

Abstract

We use machine learning to provide a tractable measure of the amountof predictable variation in the data that a theory captures, which we call its“completeness.” We apply this measure to three problems: assigning certainequivalents to lotteries, initial play in games, and human generation of randomsequences. We discover considerable variation in the completeness of existingmodels, which sheds light on whether to focus on developing better models withthe same features or instead to look for new features that will improve predic-tions. We also illustrate how and why completeness varies with the experimentsconsidered, which highlights the role played in choosing which experiments torun.

Suppose we have a theory of the labor market that says that a person’s wagesdepend on their knowledge and capabilities. We can test this theory by looking atwhether more education indeed predicts higher wages in labor data. If it does, thiswould provide evidence in support of the theory, but it would not tell us whether analternative theory might be even more predictive. The question of whether there are ∗ This is an updated version of “The Theory is Predictive, but is it Complete?” We thank AlbertoAbadie, Amy Finkelstein, and Johan Ugander for helpful comments. We are also grateful to AdrianBruhin, Helga Fehr-Duda, Thomas Epper, Kevin Leyton-Brown, and James Wright for sharing datawith us. † Department of Economics, MIT ‡ Department of Computer Science, Cornell University § Department of Economics, University of Pennsylvania ¶ Department of Economics, University of Chicago a r X i v : . [ ec on . T H ] O c t ore predictive theories, and if so how much more predictive they might be, raisesthe issue of completeness : How close is the performance of a given theory to thebest performance that is achievable in the domain? In other words, how much of thepredictable variation in the data is captured by the theory?We cannot gauge completeness of a theory solely through the level of its predictiveaccuracy because there is intrinsic noise in the outcome. For example, an accuracyof 55% is strikingly successful for predicting a (discretized) stock movement based onpast returns, but extremely weak for predicting the (discretized) movement of Earthbased on the masses and positions of the sun and all of the other planets. These twoproblems diﬀer in the best achievable prediction performance they permit, and so thesame quantitative level of predictive accuracy must be interpreted diﬀerently in thetwo domains.One way to view the contrast between these two problem domains is as follows. Ineach case, an instance i of the prediction problem consists of a vector x i of measuredfeatures or covariates, and a hidden outcome y i that must be predicted. In thecase of astronomical bodies, we believe that the measured features are suﬃcient tomake highly accurate predictions over short time scales. In the case of stock prices,the measured features—past prices and returns—are only a small fraction of theinformation that we believe may be relevant to future prices. Thus, the variationin stock movements conditioned on the features we know is large, while planetarymotions are well-predicted by known features.The point then is that prediction error represents a composite of two things: ﬁrst,the opportunity for a more predictive model; and second, intrinsic noise in the problemdue to the limitations of the feature set. If we want to understand how much roomthere is for improving the predictive performance of existing theories within a givendomain—holding constant the set of features that we know how to measure—we needa way to separate these two eﬀects.We propose that a good way to distinguish between these sources of predictionerror is to compare the performance of the existing models with the best achievablelevel of prediction for our feature set, as computed by a Table Lookup algorithm.This algorithm ﬁnds the best prediction for each feature vector, assuming that thedistribution of training instances approximates the actual relationship between theobservable features and the outcome. With an ideal (i.e. inﬁnite) data set, Table2ookup minimizes the expected out-of-sample error, but Table Lookup can be quiteimperfect when data is sparse. Appendix A provides remarks that justify the use ofthe performance of Table Lookup as an approximation to the best achievable accuracyin our applications.We illustrate the usefulness of Table Lookup by applying it to three diﬀerentproblem domains: the evaluation of risk, initial play in games, and human perceptionof randomness. These are all are important topics in economics, with a long line ofestablished models. We use our benchmark to evaluate the completeness of leadingmodels from each domain. Interestingly, we ﬁnd that the best model we use for theperception of randomness is only 24% complete, while Cumulative Prospect Theoryis 95% complete despite having a mean-squared prediction error of 67.38. This, andthe subsequent observations we make in Sections 2.1-2.3, are informative about theproblem domains and the status of their associated models.Our main contribution, however, is methodological: since most economic behav-iors cannot be perfectly predicted given the available features, the predictive accura-cies of our models are diﬃcult to interpret on their own. Understanding not just howwell existing models predict, but also how complete they are, is important for guidingthe development of the theory. In the three applications in this paper, we show howthese benchmarks can be constructed, and that they reveal non-obvious insights intothe performance of our existing models. We also illustrate how and why our ourcompleteness measure varies with the experiments considered, for example with thechoice of lotteries used to evaluate risk preferences. This dependence highlights thekey role played in choosing which experiments to run.The Table Lookup benchmark is applicable to domains beyond the three thatwe describe here, but not in all of them, and we discuss various limitations to itsapplicability and interpretation throughout the paper. First, as we explain in Section1.3, Table Lookup approximates the best achievable predictive accuracy only whenthe data set is large relative to the number of unique feature vectors. This requireseither that features are discrete-valued (as in our application in Section 2.3), or thatthe available data involves observations from a ﬁnite number of unique instances (as inour two applications in Sections 2.1 and 2.2). Although Table Lookup is not feasiblefor all problems, the range of applications in the paper suggest that Table Lookup ismore eﬀective than one might initially suspect.3econd, our completeness measure depends on a speciﬁed set of features, and isevaluated on a given data set. If we change either the underlying feature set or thedata, we would expect the measurement of completeness to also change, as we discussin Section 3.2. The dependence of completeness on what data set we use is importantto keep in mind. Moreover, as we show in Sections 2.2, the way the completeness ofa model varies across data sets can shed light on the domains in which the modelperforms well or performs poorly.

In a prediction problem, there is an outcome Y whose realization is of interest, and features X , . . . , X N that are statistically related to the outcome. The goal is topredict the outcome given the observed features. Some examples include predictingan individual’s future wage based on childhood covariates (city of birth, family income,quality of education, etc.), or predicting a criminal defendant’s ﬂight risk based ontheir past record and properties of the crime (Kleinberg et al., 2017). We focus hereon three prediction problems that emerge from experimental economics: Example . Can we predict the valuations that people assign tovarious money lotteries?

Example . Can we predict how people will play the ﬁrsttime they encounter a given simultaneous-move game?

Example . Given a target random pro-cess—for example, a Bernoulli random sequence—can we predict the errors that ahuman makes while mimicking this process?Formally, suppose that the observable features belong to some space X = X ×· · · × X N and the outcome belongs to Y . A map f : X → Y from features to outcomesis a (point) prediction rule . Many economic models can be described as a parametricfamily of prediction rules F = ( f θ ) θ ∈ Θ . For example, if our model class F imposes a Note that a prediction of a probability distribution over Y can be cast as the prediction of apoint in the space Y (cid:48) = ∆( Y ) of distributions on Y . θ would deﬁne a vector of weights applied to each of the features. In the applicationwe study in Section 2.1, the expected utility class F describes a family of utilityfunctions u ( z ) = z θ over dollar amounts, and the parameter θ reﬂects the degree ofrisk aversion. We suppose that our prediction problem comes with a a loss function , (cid:96) : Y × Y → R ,where (cid:96) ( y (cid:48) , y ) is the error assigned to prediction of y (cid:48) when the realized outcomeis y . The commonly used loss functions mean-squared error and classiﬁcation loss correspond to (cid:96) ( y (cid:48) , y ) = ( y (cid:48) − y ) and (cid:96) ( y (cid:48) , y ) = ( y (cid:48) (cid:54) = y ) respectively. Deﬁnition.

The expected error (or risk ) of prediction rule f on a new observation ( x, y ) generated according to the joint distribution P is E P ( f ) = E P [ (cid:96) ( f ( x ) , y )] . The prediction rule in the class F that minimizes the expected prediction error isthe one associated with the parameter value θ ∗ P = arg min θ ∈ Θ E P ( f θ ) . The expected error of this “best” rule in F is E P ( f θ ∗ P ).In Section 1.3.1, we discuss how to estimate E P ( f θ ∗ P ) on ﬁnite data; here we discusshow to interpret it. To understand a model’s error, it is helpful to distinguish betweentwo very diﬀerent error sources.First, if the the conditional distribution Y | X is not degenerate, then even theideal prediction rule f ∗ P ( x ) = arg min y (cid:48) ∈Y E P [ (cid:96) ( y (cid:48) , y ) | x ]does not predict perfectly. Diﬀerent loss functions are typically used when predicting distributions, see e.g. Gneiting andRaftery (2007). eﬁnition. The irreducible error in the prediction problem is the expected error E P ( f ∗ P ) = E P [ (cid:96) ( f ∗ P ( x ) , y )] (1) of the ideal rule on a new test observation. The irreducible error is an upper bound on how well we can predict Y using thefeatures X .A diﬀerent source of prediction error is the speciﬁcation of which prediction rules f : X → Y are in the class F . Typically the best possible model will not be anelement of F —that is, most sets of models are at least slightly misspeciﬁed. If F leaves out an important regularity in the data, there may be exist models outside of F that give much better predictions on this domain. These two sources of prediction error have very diﬀerent implications for how toimprove prediction in the domain. If the achieved performance of the model is sub-stantially lower than the best feasible performance, then it may be possible to achievelarge improvements without seeking additional inputs, for example by identifying newregularities in behavior. On the other hand, if the achieved prediction error is closeto the best achievable level of prediction for our feature set, then only marginal gainsare feasible from identiﬁcation of new structure. This encourages consideration ofprediction rules f : X (cid:48) → Y based on some larger feature space X (cid:48) .We propose the ratio of reduction in prediction error achieved by the model,compared to the achievable reduction, as a measure of how close the model comes tothe best achievable performance. We call this ratio the the model’s completeness . Tooperationalize this measure, let f naive : X → Y be a naive rule suited to the predictionproblem; this rule—such as “predict uniformly at random”—is meant to represent alower bound on how bad predictions can be.

Deﬁnition.

The completeness for the model class F is E P ( f naive ) − E P ( f θ ∗ P ) E P ( f naive ) − E P ( f ∗ P ) . (2) On the other hand, expanding the model class risks overﬁtting, so more parsimonious modelclasses can lead to more accurate predictions when data is scarce (Hastie et al., 2009). As we discussin Sections 1.3.2 and 1.4, all of the data sets we consider here are large relative to the number offeatures. P .We expect the conditional distribution P ( y | x ) to be a ﬁxed distribution describingthe true dependence of the outcome on the features, but the marginal distributionover the feature space X is frequently a choice variable of the analyst—e.g. whichlotteries or games to run in an experiment. As we show in Section 2.2, when wechange this marginal distribution, we obtain diﬀerent measures of completeness forthe same model. Ideally, we would like the chosen distribution over features to be theone that is most economically relevant, but in practice we may not know what thatis. Neither the true joint distribution P over features and outcomes nor the derivedquantities E P ( f naive ), E P ( f θ ∗ P ), and E P ( f ∗ P ) are directly observable, but they can beestimated from data. We describe below an approach (tenfold cross-validation) that isstandard for estimating expected prediction errors, and describe an algorithm— TableLookup —for approximating the ideal prediction rule f ∗ P . To evaluate the predictive accuracy of a model class F on a ﬁnite data set, we ﬁrstchoose between prediction rules in F based on how well they predict a sample oftraining observations. Then we evaluate the trained rule on a new set of test obser-vations.Formally, for any integer n let Z n = ( X × Y ) n be the space corresponding to n observations of ( x, y ), and suppose the analyst has access to a data set Z ∈ Z n . Usingthe procedure of K -fold cross-validation, this data is randomly split into K equally-sized disjoint subsets Z , . . . , Z K . In each iteration 1 ≤ i ≤ K of the procedure,the subset Z i is identiﬁed as the test data and the remaining subsets are used as training data . The i -th parameter estimate is the one that minimizes average losswhen predicting the i -th training set: θ ∗ i = arg min θ ∈ Θ n (cid:88) ( x,y ) ∈∪ j (cid:54) = i Z j (cid:96) ( f θ ( x ) , y ) . f naive .) The out-of-sample error of the estimated f θ ∗ i on the test set Z i isCVErr i = 1 m (cid:88) ( x,y ) ∈ Z i (cid:96) ( f θ ∗ i ( x ) , y ) . (3)If the data in Z are drawn i.i.d. from P , the average out-of-sample errorCVErr(Z) = 1 K K (cid:88) i =1 CVErr i (4)is a consistent estimator for the expected error E P ( f θ ∗ P ). The display in (4) is knownas the K -fold cross-validated prediction error . In the main text, we will more simplyrefer to it as the prediction error of the model class F , understanding that it is aﬁnite-data estimate.Below we write CV naive ( Z ) for the cross-validated prediction error for the naiverule f naive and CV F ( Z ) for the cross-validated prediction error for the model class F .These are respectively our estimates for E P ( f naive ) and E ( f θ ∗ P ). To estimate the expected error of the ideal rule E P ( f ∗ P ), we apply a Table Lookupalgorithm to each iteration i of cross-validation: Formally, let f T Li = arg min f ∈X Y n (cid:88) ( x,y ) ∈∪ j (cid:54) = i Z j (cid:96) ( f θ ( x ) , y )be the function that minimizes prediction error on the training data, where we searchacross the complete (unrestricted) class of mappings from X to Y . Then deﬁne thecross-validated Table Lookup error as in (3) and (4). This measure, which we willdenote CV T L , is a consistent estimator for the irreducible error E P ( f ∗ P ). How goodof an approximation it is depends on a comparison between the size of the data n and the “eﬀective” size of the feature set X , by which we mean the number of uniquefeature vectors x that appear in the data. Table Lookup predicts well when we have a large number of observations for each unique featurevector x ∈ X . This requires either that the feature space X is ﬁnite (as in our application in Section CV T L is to look at the standard error of thecross-validated prediction errors , which is (cid:114) K Var(CVErr , . . . , CVErr K ) . We report these standard errors for each of our applications and model classes. Itturns out that for each of the applications we look at, and we suspect for other datasets as well, the Table Lookup standard errors are relatively small. (See AppendixA.1 for more detail.) As another test, we compare the performance of Table Lookupwith a diﬀerent machine learning algorithm that is better suited to smaller data sets(bagged decision trees), and ﬁnd that Table Lookup’s performance is comparable butbetter for all of our applications (see Appendix A.2). These analyses suggest thatthe Table Lookup performance is indeed a reasonable approximation for the bestachievable performance in each of our applications.In place of the ideal completeness measure described in (2), we compute the fol-lowing ratio from our data: CV naive − CV F CV naive − CV T L . This is the ratio of reduction in cross-validated prediction error achieved by the model(relative to the naive baseline) compared to the reduction achieved by Table Lookup(again relative to the naive baseline).

Irreducible error is an old concept in statistics and machine learning, and a largeamount of work has focused on further decomposing this error into bias (reﬂectingerror due to the speciﬁcation of the model class) and variance (reﬂecting sensitivityof the estimated rule to the randomness in the training data). Depending on thequantity of data available to the analyst, it may be preferable to trade oﬀ bias for X = { , } ), or that the data-generating measure P has ﬁnite support over X (as in ourtwo applications in Sections 2.1 and 2.2). In some settings with a continuum of possible featuresthere may be very few observations for a given feature vector. In these cases, we cannot directly useTable Lookup to approximate the ideal performance, and should instead use approaches that makeassumptions on how outcomes are related at “nearby” features, e.g. kernel regression. This paper abstracts from these concerns, as well as therelated concern of overﬁtting. We work exclusively with data sets where the quantityof data is large enough that the most predictive model is approximately the mostcomplex one, i.e. Table Lookup (see Appendix A).A related literature compares the performance of speciﬁc machine learning algo-rithms to that of existing economic models. These algorithms are themselves poten-tially incomplete relative to the best achievable level, and thus provide a lower bound for the best achievable level, where the degree to which they are incomplete is a pri-ori unknown. The closest of these papers to our work is Peysakhovich and Naecker(2017), which studies choices under uncertainty and under ambiguity, and constructsa benchmark based on regularized regression algorithms. Erev et al. (2007) deﬁne a a model’s equivalent number of observations as thenumber n of prior observations such that the mean of a data set of n random ob-servations has the same prediction error as the model. We expect that models withlarger numbers of equivalent observations will be more complete by our metric.Finally, an alternative measure of a model’s performance is the proportion of thevariance in the outcome that it explains, that is the model’s R . This measure is notwell suited to the question of the model’s completeness, because the best achievable R cannot be directly inferred from the R of any existing model. Background and Data.

An important question in economics is how individualsevaluate risk. In addition to the Expected Utility models (von Neumann and Mor- For example, given small quantities of data, we may prefer to work with models that have fewerfree parameters, leading to higher bias but potentially substantially lower variance. In addition, Ori Plonsky (2017), Noti et al. (2016), and Plonsky et al. (2019) develops algorithmicmodels for predicting choice, Camerer et al. (2018) uses machine learning to predict disagreementsin bargaining, and Aaron Bodoh-Creed and Hickman (2019) uses random forests to predict pricingvariation. The improvements achieved by the algorithms are sometimes modest, perhaps due tointrinsic noise, as Bourgin et al. (2019) point out. We show how this noise can be quantiﬁed. We could, however, develop a notion of completeness based on comparing the achieved R withthe best achievable R , analogous to what we do here. certainty equivalents for lotteries—i.e.the lowest certain payment that the individual would prefer over the lottery. We con-sider a data set from Bruhin et al. (2010), which includes 8906 certainty equivalentselicited from 179 subjects, all of whom were students at the University of Zurich orthe Swiss Federal Institute of Technology Zurich. Subjects reported certainty equiv-alents for the same 50 two-outcome lotteries, half over positive outcomes (e.g. gains)and half over negative outcomes (e.g. losses). Prediction Task and Models.

In this data set, the outcomes are the reportedcertainty equivalents for a given lottery, and the features are the lottery’s two possiblemonetary prizes z and z , and the probability p of the ﬁrst prize. A prediction ruleis any function that maps the tuple ( z , z , p ) into a prediction for the certaintyequivalent, i.e. a function f : R × R × [0 , → R .We evaluate two prediction rules that are based on established models from theliterature. Our Expected Utility (EU) rule sets the agent’s utility function to be u ( z ) = z α , where α is a free parameter that we train. The predicted certaintyequivalent is pz α + (1 − p ) z α .Second, our Cumulative Prospect Theory (CPT) rule predicts w ( p ) v ( z ) + (1 − w ( p )) v ( z ) for each lottery, where w is a probability weighting function and v is avalue function. We follow the literature (see e.g. Bruhin et al. (2010)) in assumingthe functional forms: v ( z ) = (cid:40) z α if z > − ( − z β ) if z ≤ w ( p ) = δp γ δp γ + (1 − p ) γ . (5)This model has four free parameters α, β, δ, γ ∈ R + .Finally, as a naive benchmark, we predict the expected value of the lottery, whichis p z + (1 − p ) z . This naive benchmark is arguably less naive than the naive benchmarks we use for the other erformance Metric. For a given test set of n observations { ( z i , z i , p i ; y i ) } ni =1 —where( z i , z i , p i ) is the lottery shown in observation i , and y i is the reported certainty equiv-alent—we evaluate the prediction error of prediction rule f using1 n n (cid:88) i =1 (cid:0) f ( z i , z i , p i ) − y i ) (cid:1) . This loss function, mean-squared error , penalizes quadratic distance from the pre-dicted and actual response, and is minimized when f ( z , z , p ) is the mean responsefor lottery ( z , z , p ).To conduct out-of-sample tests of the models described above, we follow the stan-dard approach of tenfold cross-validation described in Section 1.3.1, estimating thefree parameters of the model on training data and evaluating how well the estimatedmodel predicts choices in a test set. Results.

The following table reveals that both models are predictive, improvingupon the Expected Value benchmark: ErrorNaive Benchmark 103.81(4.00)Expected Utility 99.67(4.50)CPT 67.38(4.49)Table 1: Both models are predictive.The improvement of CPT over the naive benchmark is larger than that of ExpectedUtility, but the CPT performance is substantially worse than perfect prediction. It is prediction problems. Replacing our naive benchmark with, for example, an unconditional mean,would result in even higher completeness for CPT than we already ﬁnd in Table 2. The parameter estimate for EU is α = 0 .

98, and the parameter estimates for CPT are α =1 . , β = 0 . , δ = 0 . , and γ = 0 . z , z , p ) input cannot possiblypredict every reported certainty equivalent.But another source for prediction error is the functional form assumptions thatwe made in (5). Could a diﬀerent (potentially more complex) speciﬁcation for thevalue function or probability weighting function lead to large gains in prediction?Moreover, might there be other features of risk evaluation, yet unmodelled, whichlead to even larger improvements in prediction?To separate these sources of error, we need to understand how the CPT perfor-mance compares to the best achievable performance for this data. For this evaluation,we construct an ideal benchmark using a Table Lookup procedure. The lookup table’srows correspond to the 50 unique lotteries in our data, and the predicted certaintyequivalent for each lottery is the mean response for that lottery in the training data.Given suﬃciently many reports for each lottery, the lookup table prediction approx-imates the actual mean responses in the test data, and its error approximates thebest possible error that is achievable by any prediction rule that takes ( z , z , p ) asits input. We report this benchmark below in Table 2:Error CompletenessNaive Benchmark 103.81 0%(4.00)Expected Utility 99.67 11%(4.50)CPT 67.38 95%(4.49)Table Lookup 65.58 100%(3.00)Table 2: CPT is nearly complete for prediction of our data.The Table Lookup benchmark shows that no prediction rule based on ( z , z , p )can improve more than slightly over CPT on this data, because CPT obtains 95%13f the feasible improvement in prediction. This tells us that to make substantiallybetter predictions, we would need to expand the set of variables on which the modeldepends. For example, as we discuss in Section 11, we could group subjects usingauxiliary data such as their evaluations of other lotteries or response times, and makeseparate predictions for each group.We note that our completeness measure does not imply that in general

CPT isa nearly-complete model for predicting certainty equivalents, since the completenessmeasure we obtain is determined from a speciﬁc data set, and thus its generalizabilitydepends on the extent to which that data is representative. Indeed, the data fromBruhin et al. (2010) has certain special features; for example, all lotteries in the dataare over two possible outcomes. It would be an interesting exercise to evaluate thecompleteness of CPT using observations on lotteries with more complex supports.

Background and Data.

In many game theory experiments, equilibrium analysishas been shown to be a poor predictor of the choices that people make when theyencounter a new game. This has led to models of initial play that depart from equi-librium theory, for example the level- k models of Stahl and Wilson (1994) and Nagel(1995), the Poisson Cognitive Hierarchy model (Camerer et al., 2004), and the relatedmodels surveyed in Crawford et al. (2013). These models represent improvements overthe equilibrium predictions, but we do not know how substantial these improvementsare. Are there important regularities in play that have not yet been modeled?To study this question, we use a data set from Fudenberg and Liang (2018) con-sisting of 23,137 total observations of initial play from 486 3 × , From this data it is hard to know whether the high completeness of CPT (in the speciﬁedfunctional form) comes from its good match to actual behavior or because it is ﬂexible enough tomimic Table Lookup on many data sets. We leave exploration of this question to future work. This data is an aggregate of three data sets: the ﬁrst is a meta data set of play in 86 games,collected from six experimental game theory papers by Kevin Leyton-Brown and James Wright, seeWright and Leyton-Brown (2014); the second is a data set of play in 200 games with randomlygenerated payoﬀs, which were gathered on MTurk for Fudenberg and Liang (2018); the ﬁnal is adata set of play in 200 games that were “algorithmically designed” for a certain model (level 1) toperform poorly, again from Fudenberg and Liang (2018). There was no learning in these experiments—subjects were randomly matched to opponents,

14s in the previous section, we pool observations across all of the subjects and games.

Prediction Task, Performance Metric, and Models.

In the prediction prob-lem we consider here, the outcome is the action that is chosen by the row player ina given instance of play, and the features are the 18 entries of the payoﬀ matrix. Aprediction rule is thus any map f : R → { a , a , a } from 3 × f and test set of observations { ( g i , a i } ni =1 —where g i is thepayoﬀ matrix in observation i , and a i is the observed row player action—we evaluateerror using the misclassiﬁcation rate n n (cid:88) i =1 ( f ( g i ) (cid:54) = a i )) . This is the fraction of observations where the predicted action was not the observedaction.As a naive baseline, we consider guessing uniformly at random for all games, whichyields an expected misclassiﬁcation rate of 2 /

3. Additionally, we consider a predictionrule based on the

Poisson Cognitive Hierarchy Model (PCHM), which supposes thatthere is a distribution over players of diﬀering levels of sophistication: The level-0 player is maximally unsophisticated and randomizes uniformly over his availableactions, while the level-1 player best responds to level-0 play (Stahl and Wilson, 1994,1995; Nagel, 1995). Camerer et al. (2004) deﬁnes the play of level- k players, k ≥ p k ( h ) = π τ ( h ) (cid:80) k − l =0 π τ ( l ) ∀ h ∈ N

15s in Section 2.1, we estimate the free parameter τ on training data, and evaluatethe out-of-sample prediction of the estimated model. All reported prediction errorsare tenfold cross-validated. Results.

Because we use the classiﬁcation loss as the loss function, the best attain-able classiﬁcation error will diﬀer across games: In games where all subjects choosethe same action, the perfect 0-error prediction is feasible, but when play is closeto uniform over the actions, it will be hard to improve over random guessing. Thismeans that the same level of predictive accuracy should potentially be evaluated quitediﬀerently, depending on what kinds of games are being predicted.We illustrate this by comparing predictions for two subsets of our data:

Data SetA consists of the 16,660 observations of play from the 359 games with no strictlydominated actions. Data Set B consists of the 7,860 observations of play from the161 games in which the action proﬁle with the highest sum of player payoﬀs is outsideof the support of level- k actions, and moreover the diﬀerence in the payoﬀ sums islarge (at least 20% of the largest row player payoﬀ in the game.) For example, thefollowing game is included in Data Set B: a a a a ,

40 10 ,

20 70 , a ,

10 80 ,

80 0 , a ,

70 100 , , a is level 1, since it yields the highest expected payoﬀ againstuniform play, and action a is level 2, since it is a best response against play of a .Because ( a , a ) is a pure-strategy Nash equilibrium, action a is then level- k for all k ≥

2. The highest possible player sum achieved by playing either a or a is 120(from action proﬁle ( a , a )), but the action proﬁle ( a , a ) yields a higher payoﬀ sumof 160. The diﬀerence, 40, is 1 / τ generate the samepredicted modal action, and so have the same cross-validated prediction error. For all Speciﬁcally, we consider games where no pure action is strictly dominated by another pureaction. Here we use the classic deﬁnition from Stahl and Wilson (1995) and Nagel (1995), where eachlevel- k action is the best response to the level-( k −

1) action.

16f the games in our data, this mode is simply the level-1 action. But as Table 3 shows,PCHM improves upon the naive benchmark by a larger amount for prediction of playin Data Set B, compared to Data Set A. Using perfect prediction as the benchmark,this would imply that PCHM is a more complete model of play for games in DataSet B. Data Set A Data Set BNaive Benchmark 0.66 0.66PCHM 0.49 0.44(0.004) (0.009)Table 3: PCHM improves upon the naive baseline by a larger amount for predictionof play in Data Set B.But the amount of irreducible error in the two data sets may be quite diﬀerent,leading to diﬀerent predictive limits. Thus we need to understand how the predictionerrors compare to the best achievable error for the two data sets. We can againgain insight into this by building a lookup table. The rows of the table are thediﬀerent games, and the associated predictions are the modal actions (observed forthose games) in the training data. Given suﬃciently many observations, the modalaction in the training data will also be the action most likely to be played in the testdata, thus minimizing classiﬁcation error.Below we report the Table Lookup performance and completeness measures rela-tive to this performance. Instead of our task of predicting each action, Fudenberg and Liang (2018) studies the task ofpredicting the modal action in each game; the ideal prediction for that task always has no errorat all. Correspondingly for that prediction task, Fudenberg and Liang (2018) also used a diﬀerentcross-validation procedure: Instead of dividing the data into folds at random as described above, itsplit the set of games so that the games in the training set were not used for testing.This alternativeis relevant for the study of how well we can extrapolate from one game to another, which is not thequestion of interest here. / Background and Data.

Extensive experimental and empirical evidence suggeststhat humans misperceive randomness, expecting for example that sequences of coinﬂips “self-correct” (too many Heads in a row must be followed by a Tails) and arebalanced (the proportion of Heads and Tails are approximately the same) (Bar-Hilleland Wagenaar, 1991; Tversky and Kahneman, 1971). These misperceptions are sig-niﬁcant not only for their basic psychological interest, but also for the ways in whichmisperception of randomness manifests itself in a variety of contexts: for example,investors’ judgment of sequences of (random) stock returns (Barberis et al., 1998),professional decision-makers’ reluctance to choose the same (correct) option multipletimes in succession (Chen et al., 2016), and people’s execution of a mixed strategy ina game (Batzilis et al., 2016).A common experimental framework in this area is to ask human participants togenerate ﬁxed-length strings of k (pseudo-)random coin ﬂips, for some small valueof k (e.g. k = 8), and then to compare the produced distribution over length- k strings to the output of a Bernoulli process that generates realizations from { H, T } independently and uniformly at random (Rapaport and Budescu, 1997; Nickerson andButler, 2009). Following in this tradition, we use the platform Mechanical Turk tocollect a large dataset of human-generated strings designed to simulate the output ofa Bernoulli(0.5) process , in which each symbol in the string is generated from { H, T } independently and uniformly at random. To incentive eﬀort, we told subjects thatpayment would be approved only if their (set of) strings could not be identiﬁed as19uman-generated with high conﬁdence. , Following removal of subjects who wereclearly not attempting to mimic a random process, our ﬁnal data set consisted of21,975 strings generated by 167 subjects. Prediction Task, Performance Metric, and Models.

We consider the problemof predicting the probability that the eighth entry in a string is H given its ﬁrst sevenelements. Thus the outcome here is a number in [0 , { H, T } —and the feature space is { H, T } (note that as in the previous exampleswe ﬁt a representative-agent model and do not treat the identity of the subject asfeature).Given a test dataset { ( s i , . . . , s i ) } ni =1 of n binary strings of length-8, we evaluatethe error of the prediction rule f using mean-squared error1 n n (cid:88) i =1 (cid:0) s i − f ( s i , . . . , s i ) (cid:1) where f ( s i , . . . , s i ) is the predicted probability that the eighth ﬂip is ‘ H ’ given theobserved initial seven ﬂips s i , . . . , s i , and s i is the actual eighth ﬂip. Note that thenaive baseline of unconditionally guessing 0.5 guarantees a mean-squared prediction In one experiment, 537 subjects each whom produced 50 binary strings of length eight. Ina second experiment, an additional 101 subjects were asked to each generate 25 binary strings oflength eight. Subjects were informed: “To encourage eﬀort in this task, we have developed an algorithm(based on previous Mechanical Turkers) that detects human-generated coin ﬂips from computer-generated coin ﬂips. You are approved for payment only if our computer is not able to identify yourﬂips as human-generated with high conﬁdence.” Our initial data set consists of 29,375 binary strings. We chose to remove all subjects whorepeated any string in more than ﬁve rounds. This cutoﬀ was selected by looking at how often eachsubject generated any given string, and ﬁnding the average “highest frequency” across subjects. Thisturned out to be 10% of the strings, or ﬁve strings. Thus, our selection criteria removes all subjectswhose highest frequency was above average. This selection eliminated 167 subjects and 7,400 strings,yielding a ﬁnal dataset with 471 subjects and 21,975 strings. We check that our main results are nottoo sensitive to this selection criteria by considering two alternative choices in Appendix C.2—ﬁrst,keeping only the initial 25 strings generated by all subjects, and then, removing the subjects whosestrings are “most diﬀerent” from a Bernoulli process under a χ -test. We ﬁnd very similar resultsunder these alternative criteria. Alternatively we could have deﬁned the outcome to be an individual realization of H or T , so thatprediction rules are maps f : { H, T } → { H, T } , and then evaluated error using the misclassiﬁcation We expect that the presence of behavioral errors in the generationprocess will make it possible to improve upon the naive baseline, but do not know how much it is possible to improve upon 0.25.In this task, the natural naive baseline is the rule that unconditionally guesses thatthe probability the ﬁnal ﬂip is ‘ H ’ is 0.5. We compare this to prediction rules basedon Rabin (2002) and Rabin and Vayanos (2010), both of which predict generationof negatively autocorrelated sequences. Our prediction rule based on Rabin (2002)supposes that subjects generate sequences by drawing sequentially without replace-ment from an urn containing 0 . N ‘1’ balls and 0 . N ‘0’ balls. The urn is “refreshed”(meaning the composition is returned to its original) every period with independentprobability p . This model has two free parameters: N ∈ Z + and p ∈ [0 , s ∼ Bernoulli(0.5) while each subsequent ﬂip s k is distributed s k ∼ Ber (cid:32) . − α k − (cid:88) t =0 δ t (2 · s k − t − − (cid:33) , where the parameter δ ∈ R + reﬂects the (decaying) inﬂuence of past ﬂips, and theparameter α ∈ R + measures the strength of negative autocorrelation. Results.

Table 6 shows that both prediction rules improve upon the naive baseline.The need for a benchmark for achievable prediction is starkest in this application,however, as the best improvement is only 0.0008, while the gap between the best rate (i.e. the fraction of instances where the predicted outcome was not the realized outcome). Wedo not take a stand on which method is better, but note that the completeness measure can dependon which one is used. In Appendix C.1 we show that the completeness measures are very similarusing this alternative formulation. Due to the convexity of the loss function, it is possible to do worse than the naive baseline, forexample by predicting 1 unconditionally. Although both of these frameworks are models of mistaken inference from data, as opposed tohuman attempts to generate random sequences, they are easily adapted to our setting, as the papersexplained. We make a small modiﬁcation on the Rabin and Vayanos (2010) model, allowing α, δ ∈ R + instead of α, δ ∈ [0 , (0.0007) Rabin and Vayanos (2010) 0.2492 (0.0007)

Table 6: Both models improve upon naive guessing, but the absolute improvement issmall.For this problem, the lookup table’s rows correspond to the 2 unique initial seven-ﬂip sequences, and we associate each such string to the empirical frequency with whichthat string is followed by ‘ H ’ in the training data. Given a suﬃciently large trainingset, we can approximate the true continuation frequency for each initial sequence,and hence approximate the best achievable error. We note here that although thereare 2 unique initial sequences, with approximately 21,000 strings in our data set, wehave (on average) 164 observations per initial sequence.22rror CompletenessNaive Benchmark 0.25 0Rabin (2002) 0.2494 10% (0.0007) Rabin & Vayanos (2010) 0.2492 14% (0.0007)

Table Lookup 0.2441 100% (0.0006)

Table 7: The Table Lookup benchmark permits a more accurate representation ofthe completeness of these models.We ﬁnd that Table Lookup achieves a prediction error of 0.2439, so that naivelycomparing achieved prediction error against perfect prediction (which would suggesta completeness measure of at most 0.4%) grossly misrepresents the performance ofthe models. Relative to the Table Lookup benchmark, the existing models produceup to 14% of the achievable improvement in prediction error. This suggests thatalthough negative autocorrelation is indeed present in the human-generated strings,and explains a sizable part of the deviation from a Bernoulli(0.5) process, there isadditional structure that could yet be exploited for prediction.

So far we’ve focused on evaluating representative agent models that implement asingle prediction across all subjects. When we evaluate models that include subjectheterogeneity, the question of what is the best achievable level of accuracy is stillrelevant, and the suitable analogue of Table Lookup—with subject type added as anadditional feature—can again help us to determine this. The exact implementation ofTable Lookup will depend on how the groups are determined. As a simple illustration,we return to our ﬁrst domain—evaluation of risk—and demonstrate how to construct23 predictive bound for certain models with subject heterogeneity.The models that we consider extend the Expected Utility and Cumulative ProspectTheory models introduced in Section 2.1 by allowing for three groups of subjects. Totest the models, we randomly select 71 (out of 171) subjects to be test subjects, and45 (out of 50) lotteries to be test lotteries. All other data—the 100 training sub-ject’s choices in all lotteries, as well as the test subject’s choices in the 5 traininglotteries—are used for training the models.In more detail, we ﬁrst use the training subjects’ responses in the training lotteriesto develop a clustering algorithm for separating subjects into three groups. Thisalgorithm can be used to assign a group number to any new subject based on theirchoices in the ﬁve training lotteries. Second, we use each group’s training subjects’responses in the test lotteries to estimate free model parameters—that is, the singlefree parameter of the EU model, and the four free parameters for CPT. This yieldsthree versions of EU and CPT, one per group.Out of sample, we ﬁrst use the clustering algorithm to assign groups to the testsubjects, and then use the associated models to predict their certainty equivalents inthe test lotteries. We measure accuracy using mean-squared error, as in Section 2.1,and we again report the Expected Value prediction as a naive baseline.Prediction ErrorNaive Benchmark 91.13(10.44)Expected Utility 86.68(10.69)CPT 57.14(7.17)Table 8: Prediction Errors Achieved by Models with Subject HeterogeneityWhat we ﬁnd from Table 8 is very similar to what we observed in Section 2.1:Both models improve upon the naive baseline, but we do not know how complete We use a simple algorithm, k -means, which minimizes the Euclidean distance between the vectorsof reported certainty equivalents for subjects within the same group. In addition to evaluating the predictive limits of a feature set and the completenessof existing models, Table Lookup can be used to compare the predictive power ofdiﬀerent feature sets. We illustrate this potential comparison by revisiting our prob-lem from Section 2.3—predicting human generation of randomness—and evaluatingthe predictive value of certain features. To do this, we consider “compressed” Table25ookup algorithms built on diﬀerent properties of the string, where strings of thesame type are bucketed into the same row, and focus on the the predictive valueof two properties: number of Heads, and ﬂips 4-8. Our compressed Table Lookupbased on the number of Heads partitions the set of length-7 strings depending onthe total number ‘ H ’ ﬂips in the string, and learns a prediction for each partitionelement; similarly, our compressed Table Lookup based on ﬂips 4-7 partitions the setof strings depending only on outcomes including and after ﬂip 4. Just as our originalTable Lookup algorithm returned an approximation of the highest level of predictiveaccuracy using the full structure of initial ﬂip data, these compressed Table Lookupalgorithms approximate the highest level of predictive accuracy that is achievableusing a particular kind of structure in the strings.Table 9: Comparison of the value of various feature sets.

Error CompletenessNaive Benchmark 0.25 0%Flips 4-7 0.2478 36% (0.0010)

Number of Heads 0.2464 59% (0.0009)

Full Table Lookup 0.2441 100% (0.0006)

We ﬁnd that these simple features achieve large fractions of the achievable im-provement over the naive rule of always predicting that the probability of H is 1 / When evaluating the predictive performance of a theory, it is important to know notjust whether the theory is predictive, but also how complete its predictive performanceis. We propose the use of Table Lookup as a way to measure the best achievable predictive performance for a given problem, and the completeness of a model as ameasure of how close it comes to this bound. We demonstrate three domains in whichcompleteness can help us to evaluate the performance of existing models.The present paper has focused on the criterion of predictiveness. When we takeother criteria into account, such as the interpretability or generality of the model,then we may prefer models that are not 100% complete by the measure proposedhere—for example, we may prefer to sacriﬁce some predictive power in return forhigher explainability, as in Fudenberg and Liang (2018).Finally, we note that all the tests mentioned so far involve training and testingmodels on data drawn from the same domain. A question for future work wouldbe how to compare the transferability of models across domains. Indeed, we mayexpect that economic models that are outperformed by machine learning modelsin a given domain have higher transfer performance outside of the domain. In thissense, within-domain completeness may provide an incomplete measure of the “overallcompleteness” of the model, and we leave development of such notions to future work. Note that the value of individual features will in general depend on what other features areavailable. eferences Aaron Bodoh-Creed, J. B. and B. Hickman (2019): “Using Machine Learningto Explain Price Dispersion,” .

Bar-Hillel, M. and W. Wagenaar (1991): “The Perception of Randomness,”

Advances in Applied Mathematics . Barberis, N., A. Shleifer, and R. Vishny (1998): “A Model of Investor Senti-ment,”

Journal of Financial Economics . Batzilis, D., S. Jaffe, S. Levitt, J. A. List, and J. Picel (2016): “How Face-book Can Deepen our Understanding of Behavior in Strategic Settings: Evidencefrom a Million Rock-Paper-Scissors Games,” Working Paper.

Bourgin, D. D., J. C. Peterson, D. Reichman, T. L. Griffiths, andS. J. Russell (2019): “Cognitive Model Priors for Predicting Human Decisions,”

CoRR , abs/1905.09397.

Bruhin, A., H. Fehr-Duda, and T. Epper (2010): “Risk and Rationality: Un-covering Heterogeneity in Probability Distortion,”

Econometrica . Camerer, C. F., T.-H. Ho, and J.-K. Chong (2004): “A cognitive hierarchymodel of games,”

The Quarterly Journal of Economics , 119, 861–898.

Camerer, C. F., G. Nave, and A. Smith (2018): “Dynamic unstructured bar-gaining with private information: theory, experiment, and outcome prediction viamachine learning,”

Management Science . Chen, D., K. Shue, and T. Moskowitz (2016): “Decision-Making under theGambler’s Fallacy: Evidence from Asylum Judges, Loan Oﬃcers, and BaseballUmpires,”

Quarterly Journal of Economics . Crawford, V. P., M. A. Costa-Gomes, and N. Iriberri (2013): “Structuralmodels of nonequilibrium strategic thinking: Theory, evidence, and applications,”

Journal of Economic Literature , 51, 5–62.

Domingos, P. (2000): “A Uniﬁed Bias-Variance Decomposition and its Applica-tions,”

Proc. 17th International Conf. on Machine Learning . Erev, I., A. E. Roth, R. L. Slonim, and G. Barron (2007): “Learning andequilibrium as useful approximations: Accuracy of prediction on randomly selectedconstant sum games,”

Economic Theory , 33, 29–51.

Fudenberg, D. and A. Liang (2018): “Predicting and Understanding Initial28lay,” Working Paper.

Gneiting, T. and A. E. Raftery (2007): “Strictly Proper Scoring Rules, Pre-diction, and Estimation,”

Journal of the American Statistical Association . Hastie, T., R. Tibshirani, and J. Friedman (2009):

The Elements of StatisticalLearning , Springer.

Kleinberg, J., H. Lakkaraju, J. Leskovec, J. Ludwig, and S. Mul-lainathan (2017): “Human Decisions and Machine Predictions,”

The QuarterlyJournal of Economics . Nagel, R. (1995): “Unraveling in Guessing Games: An Experimental Study,”

Amer-ican Economic Review , 85, 1313–1326.

Nickerson, R. and S. Butler (2009): “On Producing Random Sequences,”

American Journal of Psychology . Noti, G., E. Levi, Y. Kolumbus, and A. Daniely (2016): “Behavior-BasedMachine-Learning: A Hybrid Approach for Predicting Human Decision Making,”

CoRR , abs/1611.10228.

Ori Plonsky, Ido Erev, T. H. M. T. (2017): “Psychological forest: Predictinghuman behavior,”

AAAI Conference on Artiﬁcial Intelligence . Peysakhovich, A. and J. Naecker (2017): “Using Methods from Machine Learn-ing to Evaluate Behavioral Models of Choice Under Risk and Ambiguity,”

Journalof Economic Behavior and Organization . Plonsky, O., R. Apel, E. Ert, M. Tennenholtz, D. Bourgin, J. C. Peter-son, D. Reichman, T. L. Griffiths, S. J. Russell, E. C. Carter, J. F.Cavanagh, and I. Erev (2019): “Predicting human decisions with behavioraltheories and machine learning,”

CoRR , abs/1904.06866.

Rabin, M. (2000): “Risk Aversion and Expected-utility Theory: A Calibration The-orem,”

Econometrica , 68, 1281–1292.——— (2002): “Inference by Believers in the Law of Small Numbers,”

The QuarterlyJournal of Economics . Rabin, M. and D. Vayanos (2010): “The Gambler’s and Hot-Hand Fallacies:Theory and Applications,”

Review of Economic Studies . Rapaport, A. and D. Budescu (1997): “Randomization in Individual ChoiceBehavior,”

Psychological Review . Samuelson, P. (1952): “Probability, Utility, and the Independence Axiom,”

Econo- etrica . Savage, L. (1954):

The Foundations of Statistics , J. Wiley.

Stahl, D. O. and P. W. Wilson (1994): “Experimental evidence on players’models of other players,”

Journal of Economic Behavior and Organization , 25,309–327.——— (1995): “On players’ models of other players: Theory and experimental evi-dence,”

Games and Economic Behavior , 10, 218–254.

Tversky, A. and D. Kahneman (1971): “The Belief in the Law of Small Num-bers,”

Psychological Bulletin .——— (1992): “Advances in Prospect Theory: Cumulative Representation of Uncer-tainty,”

Journal of Risk and Uncertainty , 5, 297–323. von Neumann, J. and O. Morgenstern (1944):

Theory of Games and EconomicBehavior , Princeton University Press.

Wright, J. R. and K. Leyton-Brown (2014): “Level-0 meta-models for pre-dicting human behavior in games,”

Proceedings of the ﬁfteenth ACM conference onEconomics and computation , 857–874. 30 ppendix

A Is Table Lookup the most predictive algorithmfor our data?

In the main text, we use the performance of Table Lookup as an approximation ofthe best possible accuracy. Below we investigate whether the data sets we study arelarge enough for this to be a good approximation.We ﬁrst review some results from the machine learning and statistics literatures,which explain how the cross-validated standard errors that we report in the maintext can be used as a measure for how well the Table Lookup error approximates theirreducible error (Section A.1).In Section A.2, we compare Table Lookup’s performance with that of baggeddecision trees, an algorithm that scales better to smaller quantities of data. We ﬁndthat in each of our prediction problems, the two prediction errors are similar, andTable Lookup weakly outperforms bagged decision trees. Finally, in Section A.3, westudy the sensitivity of the Table Lookup performance to the quantity of data. Thepredictive accuracies achieved using our full data sets are very close to those achievedusing, for example, just 70% of the data. This again suggests that only minimalimprovements in predictive accuracy are feasible from further increases in data size.

A.1 Cross-Validated Standard Error

Suppose that the loss function is mean-squared error: L ( y (cid:48) , y ) = ( y (cid:48) − y ) . (Similararguments apply for the misclassiﬁcation rate; see e.g. Domingos (2000).) Let f ∗ P ( x ) = E P [ y | x ]be the ideal prediction rule discussed in Section 1.2, which assigns to each x itsexpected outcome y under distribution P . Write f T L [ Z ] for the random Table Lookupprediction rule that has been estimated from a set Z of n i.i.d. training observations.The expected mean-squared error of f T L on a new observation ( x, y ) ∼ P can be31ecomposed as follows (Hastie et al., 2009): E [( f T L [ Z ]( x ) − y ) ] = E [( f ∗ ( x ) − y ) ] (cid:124) (cid:123)(cid:122) (cid:125) irreducible noise + ( E [ f T L [ Z ]( x )] − f ∗ ( x )) (cid:124) (cid:123)(cid:122) (cid:125) bias + E [( f T L [ Z ]( x ) − E [ f T L [ Z ]( x )]) ] (cid:124) (cid:123)(cid:122) (cid:125) sampling error where the expectation is both over the realization of the training data Z used to trainTable Lookup, and also over the realization of the test observation ( x, y ).The ﬁrst component is the irreducible noise introduced in (1). The second compo-nent, bias , is the mean-squared diﬀerence between the expected Table Lookup predic-tion and the prediction of the ideal prediction rule f ∗ . The ﬁnal component, samplingerror or variance , is the variance of the Table Lookup prediction (reﬂecting the sen-sitivity of the algorithm to the training data).Since Table Lookup is an unbiased estimator, the second component is zero. Thus,irreducible noise is the diﬀerence between the expected Table Lookup error and thesampling error of the Table Lookup predictor. As described in Section 1.3, we followstandard procedures of using the cross-validated prediction error to estimate the ex-pected Table Lookup error, and using the variance of the cross-validated predictionerrors to estimate the sampling error (Hastie et al., 2009). That is, E [( f T L [ Z ]( x ) − E [ f T L [ Z ]( x )]) ] ≈ K Var( { CV , . . . , CV K } )where CV i is the prediction error for the i -th iteration of cross-validation. The right-hand side of the display is the square of the cross-validated standard errors reportedin the main text; thus, we have from Tables 2, 4, and 7:Table Lookup Error Sampling ErrorRisk Preferences 65.58 9Predicting Initial Play, Data Set A 0.41 < < < .2 Comparison with Scalable Machine Learning Algorithms Another way to evaluate whether our Table Lookup algorithm approximates the bestpossible prediction accuracy is to compare it with the performance of other machinelearning algorithms. Below we compare its performance with bagged decision trees (also known as bootstrap-aggregated decision trees). This algorithm creates severalbootstrapped data sets from the training data by sampling with replacement, andthen trains a decision tree on each bootstrapped training set. Decision trees arenonlinear prediction models that recursively partition the feature space and learn a(best) constant prediction for each partition element. The prediction of the baggeddecision tree algorithm is an aggregation of the predictions of individual decisiontrees. When the loss function is mean-squared error, the decision tree ensemblepredicts the average of the predictions of the individual trees. When the loss functionis misclassiﬁcation rate, the decision tree ensemble predicts based on a majority voteacross the ensemble of trees.Table 10 shows that for each prediction problem, the error of the bagged decisiontree algorithm is comparable to and slightly worse than that of the Table Lookupalgorithm. These results again suggest that the Table Lookup error is a reasonableapproximation for the best achievable error.Risk Games A Games B SequencesBagged Decision Trees 65.65 0.45 0.36 0.2442(0.10) (0.004) (0.005) (0.0005)Table Lookup 65.58 0.41 0.34 0.2441(3.00) (0.005) (0.006) (0.0006)Table 10: Table Lookup outperforms Bagged Decision Trees in each of our predictionproblems.

A.3 Performance of Table Lookup on Smaller Samples

Finally, we report the Table Lookup cross-validated performance on random samplesof x % of our data, where x ∈ { , , . . . , } . For each x , we repeat the procedure33000 times, and report the average performance across iterations. We ﬁnd that theTable Lookup performance ﬂattens out for larger values of x , suggesting that thequantity of data we have is indeed large enough that further increases in the data sizewill not substantially improve predictive performance. x % Risk Games A Games B Sequences10% 69.47 0.4191 0.3473 0.2592(11.13) (0.012) (0.018) (0.0034)20% 67.13 0.4183 0.3476 0.2504(7.95) (0.0018) (0.024) (0.0018)30% 66.28 0.4178 0.3472 0.2479(6.51) (0.0022) (0.0029) (0.0014)40% 66.25 0.4169 0.3470 0.2464(5.65) (0.0024) (0.0032) (0.0011)50% 65.68 0.4157 0.3459 0.2458(4.59) (0.0025) (0.0036) (0.0010)60% 65.68 0.4141 0.3449 0.2452(4.24) (0.0027) (0.0040) (0.0008)70% 65.68 0.4131 0.3435 0.2448(3.95) (0.0031) (0.0045) (0.0007)80% 65.68 0.4119 0.3427 0.2445(3.95) (0.0034) (0.0046) (0.0007)90% 65.66 0.4109 0.3416 0.2443(3.71) (0.0034) (0.0047) (0.0007)100% 65.58 0.4100 0.3404 0.2441(3.00) (0.0036) (0.0051) (0.0006)Table 11: Performance of Table Lookup using x % of the data, averaged over 100iterations for each x Experimental Instructions for Section 2.3

Subjects on Mechanical Turk were presented with the following introduction screen:35

Supplementary Material to Section 2.3

C.1 Robustness

Here we check how our results in Section 2.3 change when the outcome space and errorfunction are changed so that prediction functions are maps f : { H, T } → { H, T } and the error for predicting the test data set { ( s i , . . . , s i } ni =1 is deﬁned to be1 n n (cid:88) i =1 ( s i (cid:54) = f ( s i , . . . , s i )) , i.e. the misclassiﬁcation rate. We use as a naive benchmark the prediction rule thatguesses H and T uniformly at random; this is guaranteed an expected misclassiﬁcationrate of 0.50.For this problem, the Table Lookup benchmark learns the modal continuation foreach sequence in { , } . We ﬁnd that the completeness of Rabin (2002) and Rabin(2000) relative to the Table Lookup benchmark are respectively 19% and 9%.Error CompletenessNaive Benchmark 0.50 0Rabin (2002) 0.45 19% (0.003) Rabin & Vayanos (2010) 0.475 9% (0.01)

Table Lookup 0.23 1 (0.002)

C.2 Diﬀerent Cuts of the Data

Initial strings only.

We repeat the analysis in Section 2.3 using data from all sub-jects, but only their ﬁrst 25 strings. This selection accounts for potential fatigue ingeneration of the ﬁnal strings, and leaves a total of 638 subjects and 15,950 strings.Prediction results for our main exercise are shown below using this alternative selec-tion. 36rror CompletenessNaive Benchmark 0.25 0Rabin & Vayanos (2010) 0.2491 5%(0.0008)Table Lookup 0.2326 100%(0.0030)

Removing the least random subjects.

For each subject, we conduct a Chi-squared test for the null hypothesis that their strings were generated under a Bernoulliprocess. We order subjects by p -values and remove the 100 subjects with the lowest pp