How Flexible is that Functional Form? Quantifying the Restrictiveness of Theories
HHow Flexible is that Functional Form?Quantifying the Restrictiveness of Theories ∗ Drew Fudenberg † Wayne Gao ‡ Annie Liang § July 21, 2020
Abstract
We propose a new way to quantify the restrictiveness of an economic model,based on how well the model fits simulated, hypothetical data sets. The datasets are drawn at random from a distribution that satisfies some application-dependent content restrictions (such as that people prefer more money to less).Models that can fit almost all hypothetical data well are not restrictive. To il-lustrate our approach, we evaluate the restrictiveness of two widely-used behav-ioral models, Cumulative Prospect Theory and the Poisson Cognitive HierarchyModel, and explain how restrictiveness reveals new insights about them. ∗ We thank Nikhil Agarwal, Victor Aguiar, Abhijit Banerjee, Tilman B¨orgers, Vince Crawford,Glenn Ellison, Shaowei Ke, Rosa Matzkin, John Quah, Charles Sprenger, and Emanuel Vespa forhelpful comments, and NSF grant SES 1643517 for financial support. † Department of Economics, MIT ‡ Department of Economics, U. Pennsylvania § Department of Economics, U. Pennsylvania a r X i v : . [ ec on . T H ] J u l Introduction
If a parametric model fits the available data well, is it because the model capturesstructure that is specific to the observed data, or because the model is so flexible thatit would fit almost all conceivable data?This paper provides a quantitative measure of model restrictiveness that can dis-tinguish between the two explanations above and is easy to compute across a varietyof applications. We test the restrictiveness of a model by simulating hypothetical datasets and seeing how well the model can fit this data. A restrictive model performspoorly on most of the hypothetical data, while an unrestrictive model approximatesalmost all conceivable data.What the analyst views as conceivable reflects their ex-ante knowledge or intu-ition. For example, the analyst might think that everyone prefers more money toless, or that players are less likely to choose strictly dominated actions. To measurerestrictiveness, we propose that the analyst first stipulates some basic application-dependent restrictions on the data, and then generates random data sets that obeythese properties. Our measure of restrictiveness, based on the model’s performanceon this hypothetical data, tells us how much the model restricts behaviors beyondthese background restrictions.We complement the evaluation of restrictiveness, which is based solely on hypo-thetical data, with an evaluation of the model’s performance on actual data, usingthe measure of completeness proposed in Fudenberg et al. (2019). If a model is veryunrestrictive, then its completeness on the real data does not directly speak to its rel-evance. In contrast, a model that is simultaneously restrictive and complete encodesimportant structure.Our restrictiveness measure can be computed from data without the guidance ofanalytical results regarding the model’s implications or empirical content, so it can beused in settings where there are no analytic results that describe the model’s implica-tions. We provide estimators for restrictiveness and completeness, and characterize There are representation theorems for many non-parametric theories of individual choice, andsome analytic results for the sets of equilibria in games, but we are unaware of representationtheorems for the functional forms that are commonly used in applied work. × logit level-1 , whichmodels the distribution of play as a logistic best reply to the uniform distribution,and logit PCHM , which allows for logistic best replies in the PCHM (Wright andLeyton-Brown, 2014). We find that logit level-1 not only fits the actual data betterthan the PCHM, but is also more restrictive. Moreover, logit level-1 performs almostas well as the more complex logit PCHM on the actual data, and is substantiallymore restrictive.Our measure of restrictiveness provides a new perspective on the problem of howrichly to parameterize a model. Minimizing cross-validated prediction error can help,as overparameterized models can overfit to training data and perform poorly on testdata. But cross-validation—like other techniques for guarding against overfitting—tends to favor increasingly flexible models given increasingly large data sets. In con-trast, our approach supposes an intrinsic preference for more parsimonious models.As we show, models with a small number of parameters, such as the four-parameterspecification of CPT that we examine, can allow for a large range of behaviors; modelswith the same number of parameters (PCHM versus logit level-1) can differ substan-tially in their restrictiveness. Understanding the range of behaviors permitted bythese models explains how much a model’s success on real data is due to flexibilityand how much is due to specifically tracking regularities present in the data. Koopmans and Reiersol (1950) defined a model to be observationally restrictive ifthe distributions of observables it allows are a proper subset of the distributions thatwould otherwise be possible. Their definition is with respect to an ambient familyof outcome distributions; when this ambient family consists of every distribution, anon-restrictive theory cannot be refuted from data. Selten (1991) subsequently proposed measuring the restrictiveness of a model bythe fraction of possible data sets that it can exactly explain. To compute this mea- As Koopmans and Reiersol (1950) points out, a special case of an observationally restrictivespecification is an overidentifying restriction. See e.g. Sargan (1958), Hausman (1978), Hansen(1982), and Chen and Santos (2018) for econometric tests of overidentification. Our paper is also related to the vast literature in statistics and econometrics onmodel selection, which dates back to Cox (1961, 1962). Unlike classic measures, in-cluding AIC and BIC, restrictiveness is not based on observed data, and it is notdesigned to guard against overfitting. Instead, it proposes a practical procedure forevaluating the restrictiveness of a parametric modeling class within a class of permissi-ble models. Similarly, although VC dimension—which provides another measure forthe “span” of a model—is related to our restrictiveness measure at a high level, it isgenerally nontrivial to determine the VC dimension of any given model. In contrast,our metric is (by design) easy to compute. Finally, to derive standard errors for ourestimator for completeness, the paper utilizes a recent development in the statisticsliterature (Austern and Zhou, 2020) on the asymptotic theory of the cross-validationrisk estimator.
Let X be an observable (random) feature vector taking values in a finite set X , and Y be an observable random outcome variable taking values in a finite-dimensional set Y . We use P ∗ to denote the joint distribution of ( X, Y ), P ∗ X to denote the marginaldistribution of X and P ∗ Y | X to denote the conditional distribution of Y given X .We assume that the marginal P ∗ X is known to the analyst, while the conditional For example, the Harless and Camerer (1994) exercise would be much harder on larger menus ofbinary lotteries, on 3-outcome lotteries, or if subjects had been asked to report real-valued certaintyequivalents. This paper has a different goal than the extensive econometric literature that studies how the“restrictiveness” of an econometric model may affect the identification of parameters and the effi-ciency of estimators. The VC dimension is known for very few economic models. A recent exception is the work ofBasu and Echenique (2020) for various models of decision-making under uncertainty. The analyst wants to learn a function of the conditional distribution, s ( P ∗ Y | X = x ) ∈ S , where S is finite-dimensional. We call any function f : X → S a predictive map-ping , or simply mapping , and denote the true mapping f ∗ by f ∗ ( x ) := s ( P ∗ Y | X = x ).The set of all possible mappings is denoted by F .We focus on two leading cases of this problem whose structure makes our methodseasier to explain; Section 7 explains how to extend our approach to more generalproblems. Prediction of a Conditional Expectation.
When the statistic of interest is E P ∗ [ Y | X ], the analyst’s objective is to learn the average outcome for each real-ization of X . To evaluate the error of predicting f ( x ) when the realized outcome is y , we use squared loss l ( f, ( x, y )) := ( y − f ( x )) . The expected error of a mapping f is then e P ∗ ( f ) = E P ∗ (cid:2) ( Y − f ( X )) (cid:3) , which is minimized by the true mapping f ∗ ( x ) = E P ∗ [ Y | X = x ]. We show in Appendix A that the difference between theerror e P ∗ ( f ) of an arbitrary mapping f ∈ F and the best possible error e P ∗ ( f ∗ ) is d MSE ( f, f ∗ ) := e P ∗ ( f ) − e P ∗ ( f ∗ ) := E P ∗ X (cid:2) ( f ∗ ( X ) − f ( X )) (cid:3) , (1)i.e. the expected mean-squared difference between the predicted outcomes.Our first application, predicting the average reported certainty equivalent forbinary lotteries, is an example of this case. Each lottery is described as a tuple x = ( z, z, p ), and the feature space X consists of the 50 tuples associated with lotter-ies in a data set from Bruhin et al. (2010). The outcome space of certainty equivalentsis Y = R , and we seek to predict the population average of certainty equivalents foreach lottery x ∈ X . A predictive mapping for this problem specifies an averagecertainty equivalent for each of the 50 binary lotteries. Prediction of a Conditional Distribution.
Here the statistic of interest is P ∗ Y | X ,so the analyst’s objective is to learn the conditional distribution itself. To evaluate For example, in a decision theory experiment the experimenter knows the distribution overmenus that the subjects will face. f ( x ) when the realized outcome is y , we usethe negative (conditional) log-likelihood l ( f, ( x, y )) := − log f ( y | x ). The expectederror of mapping f is e P ∗ ( f ) = E P ∗ [ − log f ( Y | X )] , which is minimized by the trueconditional distribution f ∗ ( x ) = P ∗ Y | X ( x ). As we show in Appendix A, the differencebetween the error of an arbitrary mapping f ∈ F , e P ∗ ( f ) and the best possible error, e P ∗ ( f ∗ ), is d KL ( f, f ∗ ) := e P ∗ ( f ) − e P ∗ ( f ∗ ) := E P ∗ X (cid:34)(cid:88) y f ( y | x ) [log f ( y | x ) − log f ∗ ( y | x )] (cid:35) , (2)i.e. the expected Kullback-Liebler divergence between f and the true distribution.Our second application, predicting initial play in in matrix games, is an example ofthis case. Here the feature space X consists of the 466 unique 3 × R . The outcome spaceis Y = { a , a , a } (the set of row player actions) and the analyst seeks to predictthe conditional distribution over Y for each game, interpreted as choices made by apopulation of subjects for the same game. Thus, S = ∆( Y ), the set of all distributionsover row player actions. A predictive mapping is any function f : X → S taking the466 games into predicted distributions of play. Our goal is to evaluate the restrictiveness of parametric models F Θ = { f θ } θ ∈ Θ ⊆ F ,where the permitted mappings f θ are indexed by a finite dimensional parameter θ and Θ is a compact set. If the model F Θ contains a mapping that can approximatethe predictions of the true mapping f ∗ , then inf f ∈F Θ e P ∗ ( f ) also approximates thetrue mapping’s error, e P ∗ ( f ). Given enough data, such a model will predict aboutas well as possible, but a good fit to the data could be because the model includesthe “right” regularities, or because it is simply flexible enough to accommodate anypattern of behavior (i.e. F Θ includes most mappings).Our strategy to determine the restrictiveness of a model is to generate randommappings f from a primitive distribution µ . In our applications below, we choose µ
7o be uniform over a set F M ⊆ F of “permissible mappings,” which encodes priorknowledge or intuition about the setting. For example, when predicting certaintyequivalents for lotteries, we may assume that people prefer more money to less.We treat both F M and µ as primitives. In a sense, their role is analogous tothe choice of what alternatives to consider when computing the power of a statisticaltest. In both cases, the right choice is guided by intuition and prior knowledge, andnot derived from formal considerations. For this reason, it can be instructive tocompute restrictiveness with respect to different choices of µ —including those thathave support on different permissible sets F M —as we do in Appendix B.2.We then evaluate how well the generated mappings can be approximated usingthe model F Θ . When predicting conditional expectations, we define d : F × F → R to extend d MSE (as given in (1)) to d MSE ( f, f (cid:48) ) := E P ∗ X (cid:104) ( f (cid:48) ( X ) − f ( X )) (cid:105) . When predicting a conditional distribution, we define d to extend d KL (as given in(2)) to d KL ( f, f (cid:48) ) : E P ∗ X (cid:34)(cid:88) y f ( y | x ) (cid:104) log f ( y | x ) − log f (cid:48) ( y | x ) (cid:105)(cid:35) . Since our subsequent statements hold for both of these functions, we simply usethe notation d , understanding that it means d MSE for predicting the conditionalexpectation, and d KL for predicting the conditional distribution.The model’s approximation error to a generated mapping f is then d ( F Θ , f ) :=inf θ ∈ Θ d ( f θ , f ) . We normalize this raw error relative to a benchmark naive mapping f naive ∈ F Θ chosen to suit the problem. We interpret the naive mapping as a lowerbound that any sensible model should outperform. For example, in our applicationto predicting initial play in games, we define the naive mapping to predict a uniformdistribution of play in every game. Normalizing relative to a naive benchmark re- Thus the choice of the distribution of simulated data is related to the choice of what alternativesto consider when evaluating the power of a test. We note that in many settings where a “correct”distribution does not exist, uniform distributions are used as a default. For example, in computa-tional complexity, the average-case time complexity of an algorithm measures the amount of timeused by the algorithm, averaged over all possible inputs (Goldreich and Vadhan, 2007).
Definition . The f -discrepancy of model F Θ is δ f := d ( F Θ , f ) d ( f naive , f ) . Since f naive is assumed to be an element of F Θ , the f -discrepancy of F Θ is boundedabove by 1, and since d is nonnegative, the f -discrepancy is also bounded below byzero. Thus, the f -discrepancy in any problem must fall between 0 and 1. Largevalues of δ f imply that the model does not approximate f much better than thenaive mapping does. Since the naive mapping itself has no free parameters andtherefore does not have the flexibility accommodate most mappings, concentrationof the distribution of δ f around large values implies that the model rules out manykinds of regularities.The restrictiveness of model F Θ is its average f -discrepancy. Definition . The restrictiveness of model F Θ is r := E µ [ δ f ] . If F Θ = F M (so that the model is completely unrestrictive), then r = 0 for everychoice of µ with support on F M . While restrictive models are desirable, a restrictive model is not particularly usefulif it fails to predict real data. We would like models to embody regularities that arepresent in actual behavior, and rule out conceivable regularities that are not. Wethus evaluate models from the dual perspectives of how restrictive they are and howwell they predict actual data. The latter can be measured using the f ∗ -discrepancyof the model, where f ∗ is the true mapping. This measure is tightly linked to thenotion of completeness introduced in Fudenberg et al. (2019). Definition . The completeness of model F Θ is κ ∗ := e P ∗ ( f naive ) − e P ∗ ( F Θ ) e P ∗ ( f naive ) − e P ∗ ( f ∗ ) , e P ∗ ( F Θ ) := inf θ ∈ Θ e P ∗ ( f θ ).Completeness is the complement of the f ∗ -discrepancy, since κ ∗ = 1 − e P ∗ ( F Θ ) − e P ∗ ( f ∗ ) e P ∗ ( f naive ) − e P ∗ ( f ∗ ) = 1 − d ( F Θ , f ∗ ) d ( f naive , f ∗ ) = 1 − δ f ∗ . (3)A model’s completeness can be interpreted as the ratio of the reduction in errorachieved by the model (relative to the naive baseline), compared to the largest achiev-able reduction. By construction, the measure κ ∗ is scale-free and lies within the unitinterval. A large κ ∗ suggests that the model is able to approximate the real data well:at the extremes, a model with κ ∗ = 1 matches the true mapping f ∗ exactly, while amodel with κ ∗ = 0 is no better at matching f ∗ than the naive model. We will reportboth restrictiveness r and completeness κ ∗ for each of the models that we consider. An alternative “area” measure.
Selten’s area measure of model flexibility is a := µ ( F Θ ), where µ is the Lebesgue measure, i.e. the fraction of possible mappingsthat are exactly consistent with the model. Our measure of restrictiveness differsboth by normalization with respect to the performance of a naive model, and bymeasuring how well the model F Θ approximates a randomly drawn mapping f in F M , which allows us to quantify the degree of error. A model that does not includemost mappings from F M would be considered highly restrictive under the Seltenmeasure, but would have low restrictiveness by our measure if it approximated mostmappings very well. Role of the normalization.
We define restrictiveness to be the average valueof d ( F Θ , f ) /d ( f naive , f ), rather than its un-normalized counterpart d ( F Θ , f ). Nor-malizing relative to a naive mapping has several advantages compared to the unit-dependent raw error d ( F Θ , f ): If we were to scale up the payoffs in the binary lotter-ies in our first application, then d ( F Θ , f ) would mechanically scale up as well, eventhough the flexibility of the model has not changed, which makes it hard to say whatconstitutes a “large” value of d ( F Θ , f ). Normalizing relative to the naive error returns10 unitless quantity that is easier to interpret, and can more easily be compared acrossproblems that use different error metrics. Sensitivity to µ . We might prefer that the restrictiveness measure does not respondtoo sensitively to small changes in µ . We demonstrate now that it does not. For anytwo measures µ, µ (cid:48) ∈ ∆( F ), E µ [ δ f ] − E µ (cid:48) [ δ f ] ≤ (cid:90) δ f · | dµ − dµ (cid:48) | ≤ · d T V ( µ, µ (cid:48) ) , (4)where d T V is the total variation distance. Thus for any two measures that are close intotal variation distance, the corresponding restrictiveness measures must also be close.We complement this theoretical bound with a numerical sensitivity check in Section5.2, where we evaluate restrictiveness with respect to beta distributions that are closeto our specification that µ is uniform. The resulting variation in restrictiveness isquite small. Combining κ ∗ and r . Ideal models have high κ ∗ , so they approximate the real datawell, but also high restrictiveness r , so they rule out regularities that could have beenpresent but are not. These two criteria generate a partial order on models; there aremany ways to complete it. One possibility is to use a lexicographic ordering, wheremodels are ordered first by κ ∗ and then by r . Another is to impose a functionalform that combines κ ∗ and restrictiveness r , such as r − (1 − κ ∗ ) = E µ [ δ f ] − δ f ∗ . Yet another possibility is to use the probability that the model fits the actual databetter than it fits a randomly generated data set, namely the quantile of δ f ∗ under thedistribution of f -discrepancies. In the present paper, we report κ ∗ and r separately,and leave it to the analyst’s discretion whether or how to combine these two metrics. Point-Identified and Set-Identified Models.
Note that f -discrepancy, restric-tiveness, and completeness are well-defined regardless of whether the parametricmodel F Θ is point-identified or set-identified. This is because the definitions of Selten (1991) provided an axiomatic characterization of the similar aggregator m = r − a , where r is the pass rate of the model on the actual data and a is the area measure we discussed above. ( F Θ , f ), restrictiveness, and e P ∗ ( F Θ ) do not rely on the uniqueness of the mini-mizers. In other words, we evaluate the parametric model F Θ with d and e P ∗ , so ourmeasures do not differentiate point-identified models from set-identified models thatyield the same d ( F Θ , f ) and e P ∗ ( F Θ ). We now discuss how to implement our approach in practice. r We provide an algorithm for computing r : Sample M times from the distribution µ on F M , and for each sampled f m ∈ F M , compute δ m := d ( F Θ ,f m ) d ( f naive ,f m ) . The samplemean δ M := M (cid:80) Mm =1 δ m is an estimator for restrictiveness. In principle, the numberof simulations we run, M , can be taken as large as we want, so δ can be madearbitrarily close to r by the Law of Large Numbers. Moreover, the approximationerror under a given finite M can be quantified using standard statistical inferencemethods. We focus on the case where the distribution of δ m is nondegenerate. Assumption 1.
The distribution of δ m is non-degenerate. Assumption 1 is a very mild condition that can be easily verified, as it is sufficientfor any two δ m and δ (cid:48) m to be distinct.The sample variance is ˆ σ δ := 1 M M (cid:88) m =1 (cid:0) δ m − δ M (cid:1) , (5)and the standard Central Limit Theorem gives the following result. Proposition 1.
Under Assumption 1, √ M (cid:0) δ M − r (cid:1) ˆ σ δ d −→ N (0 , . he (1 − α ) -th confidence interval for r is given by (cid:20) δ M − q − α/ · √ M ˆ σ δ , δ M − q α/ · √ M ˆ σ δ (cid:21) , where ˆ σ δ is given in (5). One-sided hypothesis tests on r , e.g. for the null that r = 0 so the model iscompletely unrestrictive, can also be carried out in standard ways. We again notethat the confidence intervals here simply measure the approximation error of r basedon a finite number of simulations and do not reflect randomness in experimental data. κ ∗ In this section, we show how to estimate completeness, κ ∗ .Suppose that the analyst has access to a finite sample of data { Z i := ( X i , Y i ) } Ni =1 drawn from the unknown true distribution P ∗ . To estimate completeness, we use K -fold cross-validation to estimate the out-of-sample prediction error of the model. (Inour applications, we take the standard choice of K = 10.) Specifically, we randomlydivide Z N into K (approximately) equal-sized groups. To simplify notation, assumethat J N = NK is an integer. Let k ( i ) denote the group number of observation Z i , andfor each group k = 1 , ..., K , defineˆ f − k := arg min f ∈F N − J N (cid:88) k ( i ) (cid:54) = k l ( f, Z i )to be the mapping from (cid:101) F that minimizes error for prediction of observations outsideof group k . This estimated mapping is used for prediction of the k -th test set, andˆ e k := 1 J N (cid:88) k ( i )= k l (cid:16) ˆ f − k , Z i (cid:17) is its out-of-sample error on the k -th test set. Then, CV ( F ) := 1 K K (cid:88) k =1 ˆ e k
13s the average test error across the K folds. This is an estimator for the unobservableexpected error of the best mapping from class F .Setting (cid:101) F to be F Θ , F , or F naive = { f naive } , we can compute CV ( F Θ ), CV ( F )and CV ( F naive ) from the data, leading to the following estimator for κ ∗ :ˆ κ ∗ = CV ( F Θ ) − CV ( F ) CV ( F naive ) − CV ( F ) . It is crucial that the denominator in ˆ κ ∗ does not vanish asymptotically, so we imposethe following assumption: Assumption 2 (Naive Rule is Imperfect) . e ( f naive ) − e ( f ∗ ) > . This assumption is quite weak, as it simply says that the naive mapping performsstrictly worse in expectation than the best mapping. Under additional technicalconditions, we show, by applying and adapting Proposition 5 in Austern and Zhou(2020), that ˆ δ ∗ is asymptotically normal. See Appendix C for details.To obtain the standard error, we use a variance estimator adapted from Proposi-tion 1 in Austern and Zhou (2020). Specifically, for the k -th test set, let f ˆ θ − k and ˆ f − k be the estimated mappings from models F Θ and F , respectively. The difference intheir test errors on observation Z i is ∆( Z i ) = l ( f ˆ θ − k , Z i ) − l ( ˆ f − k , Z i ), and the averagedifference across all observations in test fold k is∆ k = 1 J N (cid:88) k ( i )= k ∆( Z i ) . The sample variance of the difference in test errors isˆ σ ,k = 1 J N − (cid:88) k ( i )= k (cid:0) ∆( Z i ) − ∆ k (cid:1) . Based on this, we define the following variance estimator for ˆ κ ∗ :ˆ σ κ ∗ := K (cid:80) Kk =1 ˆ σ ,k [ CV ( f naive ) − CV ( F )] . (6)We establish the asymptotic distribution of our proposed estimators via the following14heorem. Theorem 1.
Under Assumption 2 and some regularity conditions, √ N (ˆ κ ∗ − κ ∗ )ˆ σ ˆ κ ∗ d −→ N (0 , . Consequently, the (1 − α ) two-sided confidence interval for κ ∗ is given by (cid:20) ˆ κ ∗ − q − α/ · √ N ˆ σ ˆ κ ∗ , ˆ κ ∗ − q α/ · √ N ˆ σ ˆ κ ∗ (cid:21) , where ˆ σ ˆ κ ∗ is given in (6) . Our first application is to predicting certainty equivalents for a set of 25 binarylotteries from Bruhin et al. (2010). Each lottery is described as a tuple x = ( z, z, p ),where z > z ≥ p is the probability of the largerprize z . (See Appendix B.1 for our analysis of the Bruhin et al. (2010) lotteries inthe loss domain, which is qualitatively very similar.) The feature space X consists ofthe 25 tuples associated with lotteries in the Bruhin et al. (2010) data. The outcomespace is Y = R . Each observation in the data is a pair consisting of a lottery anda reported certainty equivalent by a given subject. Note that the variation in Y forfixed X reflects the fact that different subjects report different certainty equivalentsfor the same lottery. (In Appendix B.5, we discuss how to extend our approach toallow for subject-level heterogeneity.We seek to predict the average of the certainty equivalents (over subjects) reportedfor each lottery. A mapping for this problem is any function f : X → R from the25 lotteries to predicted average certainty equivalents. We define d ( f, f (cid:48) ) to be theexpected mean-squared distance between the two mappings’ predictions, as in (1). See Appendix C for details of these assumptions.
CumulativeProspect Theory indexed by θ = ( α, γ, η ) ∈ R + × R + × R + , which predicts f θ ( z, z, p ) = w ( p ) v ( z ) + (1 − w ( p )) v ( z )where v ( z ) = z α (7)is a value function for money, and w ( p ) = ηp γ ηp γ + (1 − p ) γ (8)is a probability weighting function. We specify F Θ as the set of all such functions f θ , and refer to this model simply as CPT. As a naive benchmark, we consider thefunction f naive that maps each lottery into its expected value, corresponding to α = γ = η = 1 in CPT.CPT is 95% complete for predicting this data, so the model achieves almost allof the possible improvement in prediction accuracy over the naive baseline. (Equiv-alently, the estimated f ∗ -discrepancy of this model is 0.05.) One explanation is thatCPT is a very good model of risk preferences; another possibility is that the modelis flexible enough to mimic most functions from binary lotteries to certainty equiva-lents. These explanations have very different implications for how to interpret CPT’sempirical success. To distinguish between these explanations, we now compute CPT’s restrictiveness.Our primitive distribution µ is a uniform distribution over the set of all mappings This parametric form for w ( p ) was first suggested by Goldstein and Einhorn (1987) and Lat-timore et al. (1992). Following Bruhin et al. (2010) and much of the literature, we will estimateseparate values of these parameters for losses (see Appendix B.1, so in a sense the “overall CPTmodel” has 6 parameters. CV ( F naive ) − CV ( F Θ ) CV ( F naive ) − CV ( F ) = . − . . − . = 0 .
95. A similar result was reported in Fudenberg et al.(2019) for the pooled sample of gain-domain and loss-domain lotteries. This finding is consistent Peysakhovich and Naecker (2017)’s result that CPT approximates thepredictive performance of lasso regression trained on a high-dimensional set of features. z ≤ f ( z, z, p ) ≤ z
2. if z ≥ z (cid:48) , z ≥ z (cid:48) , and p ≥ p (cid:48) then f ( z, z, p ) ≥ f ( z (cid:48) , z (cid:48) , p (cid:48) )3. if z ≥ z , p ≥ p (cid:48) , then f ( z, z, p ) ≥ f ( z, z, p (cid:48) )Constraint (1) requires that the certainty equivalent is within the range of the pos-sible payoffs, while constraints (2) and (3) require f to respect first-order stochasticdominance. Note that in the Bruhin et al. (2010) lottery data, there are many pairsof lotteries that can be compared via (2) and (3), so these conditions are not vacuous.Below we plot the distribution of f -discrepancies for 100 random mappings f . N u m b e r o f m a pp i n g s f f Figure 1: Distribution of f -discrepancies for 100 randomly generated mappings f The restrictiveness of the model (i.e. the average f -discrepancy) is 0 .
29, so onaverage CPT’s approximation error is less than a third of the error of the naive(expected-value) mapping. Thus CPT is quite flexible, as it rules out very few regu-larities that are not already restricted by first-order stochastic dominance.CPT’s restrictiveness suggests an explanation of its completeness that is interme-diate to the two explanations proposed above: CPT is quite flexible, as its averagecompleteness is 71% on the hypothetical data, but it is even more complete on thereal data (95%). Taking both measures into account via a composite such as the dif-ference ( r − δ f ∗ ), CPT’s high completeness on real data somewhat compensates for its This uniform distribution is well-defined since F M is a bounded subset of R . µ , which wechose to be uniform. The uniform distribution is the same as beta(1 , a, b ) distributions,with parameters ( a, b ) sampled from a uniform distribution over [0 . , . × [0 . , . a, b ) pair, we generate certainty equivalents from a beta( a, b ) distributionover the prize range, again keeping only those functions f that satisfy FOSD. Over100 such distributions beta( a, b ), the average restrictiveness is 0.30, with a min valueof 0.17 and a max value of 0.41. Thus our finding that CPT is quite flexible is robustto these perturbations in µ . Next, in Appendix B.2, we compute the restrictiveness of the model with respectto a different background constraint, dropping the FOSD restrictions in (2) and (3)while keeping the range restriction in (1). We would expect the restrictiveness ofCPT to increase in this case, since (for all parameter values) CPT obeys first-orderstochastic dominance. We find however that the restrictiveness of CPT relative tothis larger permissible set, 0.35, is only slightly higher than the restrictiveness of 0.29that we find for the main specification of F M . This reinforces our finding that CPTis not very restrictive.Our analysis so far leaves open the possibility that the flexibility of the 4-parameterCPT model is specific to the domain of binary lotteries. In Appendix B.4, we evaluatethe restrictiveness of CPT on a set of 3-outcome lotteries from Bernheim and Sprenger(2020). We find that CPT is indeed more restrictive on this domain, but still quiteflexible: Its restrictiveness on these lotteries is 0 .
50. In particular, CPT is much lessrestrictive than the models of initial play that we study in Section 6. The variation in restrictiveness is bounded by the total variation distance between the primitivechoices of µ (see (4)), but it can be difficult to compute the total variation distance between complexchoices of µ . Normalization plays an important role here: CPT’s errors are substantially higher when we dropFOSD, but so are the errors of the naive benchmark (Expected Value). CPT’s relative performancecompared to the naive benchmark is comparable, whether or not we impose FOSD. .3 Comparing Models One way to evaluate the value of additional parameters is to compare the increasein completeness that they permit, relative to the decrease in restrictiveness. As anillustration, we compare the three-parameter specification of CPT with more restric-tive special cases that have been studied in the literature: η = 1, as in Tverskyand Kahneman (1992), α = 1, which corresponds to a risk-neutral CPT agent whoseutility function over money is u ( z ) = z but exhibits nonlinear probability weighting,and η = γ = 1, which corresponds to an Expected Utility decision-maker whose util-ity function is as given in (7). We refer to these models respectively as CPT( α, γ ),CPT( γ, η ), and CPT( α ), where models are associated with their free parameters. Werefer to the original three-parameter specification of CPT as CPT( α, η, γ ). The dis-tributions of f -discrepancies under these more restrictive models are shown in Figure2 below. N u m b e r o f m a pp i n g s f f f N u m b e r o f m a pp i n g s f (a) Original (b) CPT-( ↵, γ ) N u m b e r o f m a pp i n g s f (c) CPT-( ⌘, γ ) f (d) CPT-( ↵ ) N u m b e r o f m a pp i n g s f f Figure 2: Comparison of distributions of f -discrepancies.Less general specifications are always at least weakly more restrictive, but the19estrictiveness of a model must be considered jointly with its ability to explain theactual data. Table B.1 reports restrictiveness and completeness for all four specifica-tions of CPT, and Figure 5.3 plots these measures.Free Parameters Completeness κ ∗ N Restrictiveness r Mα, γ, η α, γ γ, η α N is the number of observations in the data used to estimate κ ∗ . M is thenumber of generated mappings from F M for computation of r . R e s t r i c t i v e n e ss Completeness
CPT-( ↵, ⌘, γ ) CPT-( ↵, γ ) CPT-( ↵ ) CPT-( ⌘, γ ) Figure 3: Comparison of modelsWe find that CPT( η, γ ), which uses only the nonlinear probability weighting pa-rameters η and γ , achieves a higher completeness than CPT( α, γ ), and does so despitebeing more restrictive. This suggests to us that it is a better model of risk prefer-ences. Adding the risk-aversion parameter α to the nonlinear probability weightingparameters η and γ leads to only a slight improvement in completeness ( κ ∗ increases20rom 0.91 to 0.95), but results in a substantial drop in restrictiveness ( r falls from0.49 to 0.29). This suggests that the probability weighting parameters η and γ aremore useful than the utility curvature parameter α . (These qualitative comparisonsalso hold when we consider lotteries on the loss domain, see Appendix B.1.) Ourfinding is consistent with previous studies which find that probability distortions playan important role in explaining field data (Snowberg and Wolfers, 2010; Barseghyanet al., 2013), and adds a perspective on how much flexibility these parameters intro-duce. The model CPT-( α ) is less complete, but more restrictive than CPT-( η, γ ), sothese two models cannot be directly ranked. Our second application is to predicting the distribution of initial play in games. Herethe feature space X consists of the 466 unique 3 × each described as a vector in R . The outcome space is Y = { a , a , a } (the set of row player actions) and the analyst seeks to predict theconditional distribution over Y for each game, interpreted as choices made by a pop-ulation of subjects for the same game. Thus, S = ∆( Y ), the set of all distributionsover row player actions. A mapping for this problem is any function f : X → S taking the 466 games into predicted distributions of play. We define d ( f, f (cid:48) ) to bethe expected Kullback-Liebler divergence between the predicted distributions under f and f (cid:48) , as in (2).We define the naive mapping to predict the uniform distribution for every game: f naive ( x ) = (1 / , / , /
3) for every x . Additionally, we consider three economicmodels for this prediction task. The Poisson Cognitive Hierarchy Model (PCHM) ofCamerer et al. (2004) supposes that there is a distribution over players of differinglevels of sophistication: The level-0 player randomizes uniformly over his available This data includes a meta data-set of experimental data aggregated in Wright and Leyton-Brown(2014) from six experimental game theory papers, in addition to Mechanical Turk data from newexperiments in Fudenberg and Liang (2019). level-1 player best responds to level-0 play (Stahl and Wilson, 1994, 1995;Nagel, 1995); and for k ≥
2, level- k players best respond to a perceived distribution p k ( h, τ ) = π τ ( h ) (cid:80) k − l =0 π τ ( l ) ∀ h ∈ N 0. For each action a i , the predicted frequency with which a i is playedis exp ( λ · u ( a i )) (cid:80) i =1 exp ( λ · u ( a i )) . This model nests prediction of uniform play (our naive rule) as λ = 0, and predicts adegenerate distribution on the level-1 action when λ is sufficiently large.Finally, we consider a model that we call logit PCHM (see e.g. Wright and Leyton-Brown (2014)), which replaces the assumption of exact maximization in the PCHMwith a logit best response. This model has two free parameters: λ, τ ∈ R + . Thelevel-0 player chooses g = (1 / , / , / k ≥ v k ( a i ) = k − (cid:88) h =0 p k ( h, τ ) (cid:32) (cid:88) j =1 g h ( j ) u ( a i , a j ) (cid:33) to be the expected payoff of action a i against a player whose type is distributedaccording to p k ( · , τ ), where p k ( h, τ ) is as defined in (9), and define g k ( a i ) = exp( λ · v k ( a i )) (cid:80) j =1 exp( λ · v k ( a j ))to be the distribution of level- k play. We aggregate across levels using a Poissondistribution with rate parameter τ .The models PCHM, logit level-1, and logit PCHM turn out to be 43.6%, 72.7%,and 72.9% complete on the actual data. (Equivalently, their f ∗ -discrepancies are0.564, 0.273, and 0.271.) Thus, as observed in a related study by Wright and Leyton-22rown (2014), logit PCHM provides much better predictions of the distribution ofplay than the baseline PCHM does. Perhaps surprisingly, we find that almost allof this improvement is obtained by simply adding the logit parameter to the level-1model, i.e. the further improvement from allowing for multiple levels of sophisticationis negligible.The strong performance of logit level-1 for predicting initial play is consistent withthe earlier result of Fudenberg and Liang (2019) that the level-1 model provides agood prediction of the modal action. It is harder to predict the full distribution ofplay, so it is not obvious from the previous result that level-1 play with a logit noiseparameter would perform so well for prediction of the distribution of play. The strongperformance of level-1 for predicting modal play, combined with our new observationthat logit level-1 does a good job predicting the distribution of play, suggests thatinitial play in many games is rather unstrategic. We turn now to evaluating the restrictiveness for these models. Compared to thecase of preferences over binary lotteries, economic theory provides very little in theway of a priori restrictions on initial play. We thus define the permissible set F M to include all mappings satisfying the following very weak conditions:1. If an action is strictly dominated, then the frequency with which it is chosendoes not exceed 1/3. 2. If an action is strictly dominant, then the frequency with which it is chosen isat least 1 / In Fudenberg and Liang (2019), we found that modal play in some sorts of games is betterdescribed by equilibrium notions than level-1. Since such regularities cannot be accommodatedby the logit level-1 model, these may explain the gap between the completeness of logit level-1and full completeness. Costa-Gomes et al. (2001) find a sizable fraction of level-2 players in theirexperimental data, which may further help to explain this gap. Classic game theory alone would suggest that dominant strategies have probability 1 and dom-inated strategies have probability 0, but this is inconsistent with our data (and most experimentaldata of play in games). In the actual data, the median strictly dominated action receives a frequency of 0.03 and themax frequency is 0.35. In the actual data, the median strictly dominant action receives a frequency of 0.86 and the min α ), and logit PCHM, we generate 100 mappings f from a uniform distribution µ over the set of permissible mappings F M , and eval-uate the f -discrepancies with respect to these mappings. The distributions of f -discrepancies are shown in the figure below. N u m b e r o f m a pp i n g s f f f f (a) Logit Level-1 (b) PCHM (c) Logit PCHM Figure 4: Distribution of f -discrepancies for the three models.We find that logit level-1’s restrictiveness is 0 . . f -discrepancy is always at least 0.72. Equivalently, the completenessesof these models across the simulated mappings is bounded above by 0.28. Since thecompleteness of these models on the actual data ranged from 0.436 to 0.729, each ofthese models is a much better predictor of the real data than of our hypothetical datasets.Simply comparing the completeness of the PCHM, 0.436, against the completenessof CPT, 0.95, suggests that the PCHM is a “worse” model of initial play than CPTis of certainty equivalents for lotteries. The contrast in their restrictivenesses (0.915vs. 0.27) tells us that while PCHM does not capture all of the observed behaviors, itmore successfully rules out behaviors that we do not observe. These two perspectivesare depicted in Figure 5, where δ f ∗ is smaller for CPT than for PCHM, implyingthat CPT better fits real data, but the distribution of f -discrepancies computed fromsimulated data is concentrated at substantially larger values for PCHM, so it is a frequency is 0.69. The set F M can be embedded in [0 , × , and so the uniform distribution over F M is well-defined. f ⇤ (a) CPT f ⇤ (b) PCHM N u m b e r o f m a pp i n g s f f N u m b e r o f m a pp i n g s f f Figure 5: Comparison of distribution of f -discrepancies, and δ f ∗ , across the twoapplications.Table 6.2 summarizes completeness and restrictiveness measures for all three models.Completeness κ ∗ N Restrictiveness r M PCHM 0.436 21,393 0.915 100(0.017) (0.003)logit level-1 0.727 21,393 0.930 100(0.015) (0.005)logit PCHM 0.729 21,393 0.822 100(0.014) (0.003)Table 2: N is the number of observations in the data used to estimate κ ∗ . M is thenumber of generated mappings from F M for computation of r .From Table 6.2 we see that logit level-1 is more complete and also more restrictivethan the PCHM. Logit level-1 is also substantially more restrictive than logit PCHM,at the cost of only a slight and statistically insignificant decrease in completeness.These observations suggest that logit level-1 may be a preferable model to the PCHMand logit PCHM for predicting initial play. The figure naturally suggests composite measures such as r − δ f ∗ (the difference between theaverage f -discrepancy computed on hypothetical data and the f ∗ -discrepancy computed on realdata), or the fraction of sampled f for which δ f < δ f ∗ (as proposed in Section 3.4). By either ofthese composite measures, PCHM is the “better” model, but we don’t know what the right compositemeasure is. We suspect that PCHM and logit PCHM would outperform logit level-1 for predicting theactions of subjects who played these games several times and learned from feedback. Note howeverthat the restrictiveness of the models would not change. Application to General Prediction Problems In the two leading cases we have analyzed in the main text (Section 3.1), the function d is derived from a primitive loss function l . We call the general property that permitsthis decomposability . Definition . Consider an arbitrary loss function l : F × X × Y and define e P ∗ ( f ) = E P ∗ X [ l ( f, ( X, Y ))] to be the expected loss of mapping f . For anydistribution P , let f P = min f ∈F e P ( f ) denote the error-minimizing mapping underthat distribution. Say that the problem is decomposable if there exists a function d : F × F → R such that d ( f, f P ) = e P ( f ) − e P ( f P ) (10)for every distribution P (with fixed marginal distribution P ∗ X ). That is, d ( f, f P ) isthe difference between the error of mapping f and the error of the best mapping f P .In general, prediction problems need not be decomposable. For example, supposethe objective is to predict the conditional median, and the loss function is l ( f, ( x, y )) = | y − f ( x ) | instead of squared loss. The expected error is then e P ∗ ( f ) = E P ∗ X | Y − f ( X ) | ,and the error-minimizing function f ∗ takes each x into the median value of Y at x .We might want to use d ( f, f (cid:48) ) = E P ∗ X ( | f ( X ) − f (cid:48) ( X ) | ) (11)as a measure of how different the predictions are under f and f ’, but this function doesnot satisfy (10). For the absolute value loss function, there is in fact no function d : F × F → R that satisfies (10), because the difference in errors cannot be determinedfrom f and f ∗ alone, but depends on further properties of the conditional distribution P ∗ . (See Appendix D.2 for more details.)When the problem is decomposable, as in the cases analyzed in the main text,then our approach is applicable without change by setting d to be the function sat-isfying (10). If the problem is not decomposable, we take d as a primitive, ratherthan deriving it from the loss function l . The key concepts of f -discrepancies and26estrictiveness are defined as they are in the main text, using this primitive d . Whatwe lose is the equivalence between the 1 − δ f ∗ and completeness κ ∗ , as described in(3). One can report restrictiveness r (based on the primitive d ) and completeness κ ∗ (based on the primitive l ), understanding that there is no inherent relationshipbetween these concepts. Larger values of r and κ ∗ can still be interpreted as morerestrictive and more complete models. A second alternative is to report 1 − δ f ∗ insteadof completeness. Since δ f ∗ is derived from d , this second approach does not requirespecification of a loss function at all. A new estimation procedure for δ f ∗ is needed,however, as our approach in Section 4.2 makes use of the relationship δ f ∗ = 1 − κ ∗ .We provide an alternative estimator for δ f ∗ in Appendix D.1 for this purpose. When a theory fits the data well, it matters whether this is because the theorycaptures important regularities in the data, or whether the theory is so flexible thatit can explain any behavior at all. We provide a practical, algorithmic approach forevaluating the restrictiveness of a theory, and demonstrate that it reveals new insightsinto models from two economic domains. The method is easily applied to models fromdifferent domains. We conclude with a few final comments. Why prefer restrictive theories? Completely unrestrictive theories, such as thetheory of utility maximization with unrestricted dependence of preferences on themenu, can explain any data and so are vacuous. A theory is falsifiable if there is atleast one potential observation that it couldn’t explain. We can view restrictivenessas a quantitative extension of the idea of falsifiability. Just as we prefer falsifiabletheories to vacuous one, we prefer theories that are more restrictive, though this is notquite the same as “more falsifiable,” as it replaces the binary evaluation of whetheror not a data set refutes the theory with a quantitative evaluation of how theory For example, to measure the restrictiveness of rational aggregate demand, one could generaterandom demand functions on a finite collection of budget sets, and compute the “distance” betweenthese functions and one that satisfies GARP. (We thank Tilman B¨orgers for this suggestion.) Comparing the predictions of two models. A common practice for distinguish-ing the empirical content of two models is to find instances where the models makedifferent predictions. We do not compare models here, although our approach canbe extended to compare the predictions of two models on the hypothetical data sets.Specifically, instead of evaluating the discrepancy between the estimated model andthe best mapping, one could evaluate the discrepancy between the estimated modelsfrom two parametric families. The average discrepancy in this case would then rep-resent an average disagreement between the two models on hypothetical data. Weleave development of such concepts to future work.28 eferences Austern, M. and W. Zhou (2020): “Asymptotics of Cross-Validation,” arXivpreprint arXiv:2001.11111 . Barseghyan, L., F. Molinari, T. O’Donoghue, and J. C. Teitelbaum (2013): “The Nature of Risk Preferences: Evidence from Insurance Choices,” Amer-ican Economic Review , 103, 2499–2529. Basu, P. and F. Echenique (2020): “On the falsifiability and learnability ofdecision theories,” Theoretical Economics , forthcoming. Beatty, T. and I. Crawford (2011): “How Demanding Is the Revealed Prefer-ence Approach to Demand?” American Economic Review , 101, 2782–95. Bernheim, D. and C. Sprenger (2020): “Direct Tests of Cumulative ProspectTheory,” Working Paper. Bronars, S. (1987): “The Power of Nonparametric Tests of Preference Maximiza-tion,” Econometrica , 55, 693–698. Bruhin, A., H. Fehr-Duda, and T. Epper (2010): “Risk and Rationality: Un-covering Heterogeneity in Probability Distortion,” Econometrica , 78, 1375–1412. Camerer, C. F., T.-H. Ho, and J.-K. Chong (2004): “A cognitive hierarchymodel of games,” The Quarterly Journal of Economics , 119, 861–898. Chen, X. and A. Santos (2018): “Overidentification in regular models,” Econo-metrica , 86, 1771–1817. Choi, S., R. Fisman, D. Gale, and S. Kariv (2007): “Consistency and Hetero-geneity of Individual Behavior under Uncertainty,” American Economic Review ,97, 1–15. Costa-Gomes, M., V. P. Crawford, and B. Broseta (2001): “Cognition andbehavior in normal-form games: An experimental study,” Econometrica , 69, 1193–1235. Cox, D. R. (1961): “Tests of separate families of hypotheses,” in Proceedings ofthe fourth Berkeley symposium on mathematical statistics and probability , vol. 1,105–123.——— (1962): “Further results on tests of separate families of hypotheses,” Journalof the Royal Statistical Society: Series B (Methodological) , 24, 406–424. Fudenberg, D., J. Kleinberg, A. Liang, and S. Mullainathan (2019):“Measuring the Completeness of Theories,” Working Paper. Fudenberg, D. and A. Liang (2019): “Predicting and Understanding InitialPlay,” American Economic Review , 109, 4112–4141. Goldreich, O. and S. Vadhan (2007): “Special Issue On Worst-case VersusAverage-case Complexity,” . Goldstein, W. M. and H. J. Einhorn (1987): “Expression theory and the pref-erence reversal phenomena,” Psychological review , 94, 236–254. Hansen, L. P. (1982): “Large sample properties of generalized method of momentsestimators,” Econometrica , 50, 1029–1054. Harless, D. and C. Camerer (1994): “The Predictive Utility of Generalized29xpected Utility Theories,” Econometrica , 62, 1251–1289. Hausman, J. A. (1978): “Specification tests in econometrics,” Econometrica , 46,1251–1271. Hey, J. D. (1998): “An application of Selten’s measure of predictive success,” Math-ematical Social Sciences , 35, 1–15. Koopmans, T. and O. Reiersol (1950): “The Identification of Structural Char-acteristics,” The Annals of Mathematical Statistics , 21, 165–181. Lattimore, P. K., J. R. Baker, and A. D. Witte (1992): “The influenceof probability on risky choice: A parametric examination,” Journal of EconomicBehavior & Organization , 17, 315–436. Nagel, R. (1995): “Unraveling in Guessing Games: An Experimental Study,” Amer-ican Economic Review , 85, 1313–1326. Peysakhovich, A. and J. Naecker (2017): “Using methods from machine learn-ing to evaluate behavioral models of choice under risk and ambiguity,” Journal ofEconomic Behavior and Organization , 133, 373–384. Polisson, M., J. K.-H. Quah, and L. Renou (2020): “Revealed Preferencesover Risk and Uncertainty,” American Economic Review , 110, 1782–1820. Quiggin, J. (1982): “A Theory of Anticipated Utility,” Journal of Economic Be-havior and Organization , 3, 323–343. Sargan, J. D. (1958): “The estimation of economic relationships using instrumentalvariables,” Econometrica , 26, 393–415. Selten, R. (1991): “Properties for a Measure of Predictive Success,” MathematicalSocial Sciences , 21, 153–167. Snowberg, E. and J. Wolfers (2010): “Explaining the Favorite-Long Shot Bias:Is It Risk-Love or Misperceptions?” Journal of Political Economy , 118, 723–746. Stahl, D. O. and P. W. Wilson (1994): “Experimental evidence on players’models of other players,” Journal of Economic Behavior and Organization , 25,309–327.——— (1995): “On players’ models of other players: Theory and experimental evi-dence,” Games and Economic Behavior , 10, 218–254. Tversky, A. and D. Kahneman (1992): “Advances in Prospect Theory: Cumula-tive Representation of Uncertainty,” Journal of Risk and Uncertainty , 5, 297–323. Varian, H. (1982): “The Nonparametric Approach to Demand Analysis,” Econo-metrica , 50, 945–973. Wright, J. R. and K. Leyton-Brown (2014): “Level-0 meta-models for pre-dicting human behavior in games,” Proceedings of the fifteenth ACM conference onEconomics and computation , 857–874. Yaari, M. (1987): “The Dual Theory of Choice under Risk,” Econometrica , 55,95–115. 30 Supplementary Material to Section 3.1 We now demonstrate the relationships in (1) and (2). Mean-Squared Error. Suppose S = Y = R and the loss function is l ( f, ( x, y )) =( y − f ( x )) . The following decomposition is standard: e P ∗ ( f ) := E P ∗ (cid:2) ( Y − f ( X )) (cid:3) = E P ∗ (cid:2) ( Y − f ∗ ( X )) (cid:3) + E P ∗ (cid:2) ( f ( X ) − f ∗ ( X )) (cid:3) = e P ∗ ( f ∗ ) + d ( f, f ∗ ) Negative Log-Likelihood. Suppose S = ∆( Y ) where Y is a finite set, and the lossfunction is l ( f, ( x, y )) = − log f ( y | x ) for any mapping f : X → S . Then, d ( f, f ∗ ) = (cid:88) x ∈X f ∗ ( x ) (cid:88) y ∈Y f ∗ ( y | x ) log (cid:18) f ∗ ( y | x ) f ( y | x ) (cid:19) = E P ∗ [log f ∗ ( y | x )] − E P ∗ [log f ( y | x )] = − e P ∗ ( f ∗ ) + e P ∗ ( f ) . So e P ∗ ( f ) = e P ∗ ( f ∗ ) + d ( f, f ∗ ) as desired. B Supplementary Material for Application 1 B.1 Loss Domain Results Below we repeat the analysis of Section 5 for the 25 binary lotteries over the lossdomain from Bruhin et al. (2010). Again each lottery is denoted ( z, z, p ) where p is the probability of the first prize. The prizes satisfy 0 ≥ z ≥ z . We evaluate a3-parameter version of CPT indexed to θ = ( β, γ, η ) ∈ R + × R + × R + , where f θ ( z, z, p ) = (1 − w (1 − p )) · v ( z ) + w (1 − p ) · v ( z )with v ( z ) = − (( − z ) β ) and w ( p ) = ( ηp γ ) / ( ηp γ + (1 − p ) γ ).We report below the equivalent of Table B.1 for this domain. The qualitativefindings are very similar to what we found in the main text. In particular, CPT’srestrictiveness is 0.35 (compare to our previous estimate of 0.29), so CPT is fairlyunrestrictive on this set of loteries as well. Additionally, we again find that CPT-( η, γ ) is simultaneously more complete and more restrictive than CPT-( β, γ ), and31hat augmenting CPT-( η, γ ) with the utility curvature parameter β only marginallyimproves completeness while substantially decreasing restrictiveness.Free Parameters Completeness κ ∗ N Restrictiveness r Mβ, γ, η β, γ γ, η β N isthe number of observations in the data used to estimate κ ∗ . M is the number ofgenerated mappings from F M for computation of r . B.2 Different Specification for the Permissible Set Consider the alternative permissible set of mappings consisting of all functions f : X → R satisfying f ( z, z, p ) ∈ [ z, z ]. We sample 100 times from a uniform distributionover this set and report the distribution of f -discrepancies in the figure below: N u m b e r o f m a pp i n g s f f The average discrepancy, 0.35, tells us that the model is more restrictive on thisexpanded domain of mappings, but not substantially so. Even though the errors are substantially higher than when we require the permissible mappingsto respect FOSD, the estimated restrictiveness is almost the same because the naive error alsoincreases. Specifically, the mean naive error is 343.32 (compared to 178.73 under the original F M ),while the mean CPT error is 110.73 (compared to 58.21 under the original F M ). .3 Parameter Estimates We report below the estimated parameters for each of the models that we consider.In the first column, we report the estimated parameters on the actual data. In thesecond, we report the average parameter estimates for across our generated mappings.Free Parameters Real Data Generated Mappings α, η, γ (1.02,0.6,0.5) (1.08,0.71,0.41) α, γ (0.98,1.01,0.50) (1.03,0.33) η, γ (0.70,0.50) (1.08,0.29) α τ = 0 . τ = 0 . λ = 0 . λ = 0 . τ, λ ) = (1 . , . 11) ( τ, λ ) = (1 . , . B.4 Three-Outcome Lotteries We use a set of 18 three-outcome lotteries from Bernheim and Sprenger (2020) (listedbelow) and evaluate the restrictiveness of Cumulative Prospect Theory for predictingcertainty equivalents for these lotteries. 33 z z p p p 34 24 18 0.1 0.3 0.634 24 18 0.4 0.3 0.334 24 18 0.6 0.3 0.132 24 18 0.1 0.3 0.632 24 18 0.4 0.3 0.332 24 18 0.6 0.3 0.130 24 18 0.1 0.3 0.630 24 18 0.4 0.3 0.330 24 18 0.6 0.3 0.124 23 18 0.3 0.1 0.624 23 18 0.3 0.4 0.324 23 18 0.3 0.6 0.124 21 18 0.3 0.1 0.624 21 18 0.3 0.4 0.324 21 18 0.3 0.6 0.124 19 18 0.3 0.1 0.624 19 18 0.3 0.4 0.324 19 18 0.3 0.6 0.1The prizes satisfy z > z > z ≥ 0. On the domain of three-outcome lotteries, CPTpredicts v ( z ) + w ( p + p )( v ( z ) − v ( z )) + w ( p )( v ( z ) − v ( z ))for each lottery ( z , z , z ; p , p , p ) (Tversky and Kahneman, 1992). We use thefunctional forms for v and w given in the main text.A predictive mapping f is a map from these 18 lotteries into average certaintyequivalents. The set of permissible mappings F M is again defined to satisfy: (1)each certainty equivalent has to be in the range of the lottery outcomes, and (2) ifa lottery first-order stochastically dominates another, then its certainty equivalentmust be higher. We generate 100 random mappings from a uniform distribution overmappings satisfying these properties.Below, we compare the distribution of f -discrepancies from Figure 6 with thedistribution of f -discrepancies that we find for these three-outcome lotteries.34igure 6: Left: Binary lotteries; Right: Three-outcome lotteriesThe restrictiveness of CPT on this set of three-outcome lotteries is 0.496, with astandard error of 0.018. Thus CPT is about 1.5 times as restrictive as a model ofcertainty equivalents for three-outcome lotteries than as a model of certainty equiv-alents for binary lotteries. Besides imposing FOSD, CPT imposes rank dependenceon the domain of three-outcome lotteries. (This restriction does not apply for binarylotteries.) This may explain part of the increase in restrictiveness. Even this higherrestrictiveness, however, is substantially less than what we find for models of initialplay. B.5 Heterogeneous Risk Preferences Our analysis in the main text considered representative agent models. In some cases,the analyst may have auxiliary data on the subjects that can be used to improvepredictions. We show now how completeness and restrictiveness can be evaluated inthis case.Specifically, we return to our first application and group subjects into three clustersidentified by Bruhin et al. (2010). We fit CPT for each cluster, allowing parametervalues to vary across groups. Table B.5 reports completeness measures cluster bycluster.The performance of the naive expected value rule, the best achievable performance,and the performance of CPT, all vary substantially across clusters. For example,the behavior of subjects in cluster 1 is roughly consistent with expected value (theerror of the naive rule is 39.90), while the behavior of subjects in cluster 2 departssubstantially from this benchmark (the error of the naive rule is 99.94). The best35luster 1 Cluster 2 Cluster 3Naive 39.90 150.10 99.94(4.98) (7.24) (7.97)CPT 30.74 43.87 69.62(7.25) ( 4.72) (8.50)Best Achievable Error 29.59 36.30 67.05(7.36) (3.34) (8.02)Completeness 0.98 0.88 0.92(0.02) (0.03) (0.03) N 674 1144 2641Table 4: Completeness measures for each of three subject clusters.achievable prediction for these groups of subjects is also very different (ranging from29.59 to 67.05), as is the completeness of CPT (ranging from 30.74 to 69.62).The average completeness, weighted by proportion of observations in each cluster,is 0.91, which is very close to what we found for the representative agent model. Thismay seem surprising at first, since allowing for parameters to vary across subjectsimproves the accuracy of predictions. But the best mapping from the extended featurespace X (cid:48) = X × { , , } to Y is more predictive than the best mapping consideredpreviously. Thus what we find is that the completeness of CPT with three clusters, relative to the best three-cluster mapping , is comparable to the completeness of therepresentative-agent version of CPT, relative to the best representative-agent mapping .Similarly, when measuring restrictiveness, we extend the set of permissible map-pings to the domain X (cid:48) . Each generated pattern of behavior is thus a triple ( f , f , f )of mappings from the original F M . We ask how well these tuples can be approximatedusing mappings ( g , g , g ) from CPT. It is straightforward to see that the restrictive-ness of the three-cluster CPT is identical to the restrictiveness of the representative-agent model. Note that this is true for any number of exogenously specified clusters. Estimation of Completeness κ ∗ C.1 Preliminary Definitions We now introduce some definitions and notation that will be useful in the derivationof the asymptotic distribution of the CV-based completeness estimator. C.1.1 Finite-Sample Out-of-Sample Error Let Z N := ( Z i ) Ni =1 be a random sample of observations in a given data set, andlet Z N +1 ∼ P ∗ denote a random variable with the same distribution P ∗ that isindependent of Z N . For a given data set Z N and a given model F , we define theconditional out-of-sample error (given data set Z N ) as e F ( Z N ) := E (cid:104) l (cid:16) ˆ f Z N , Z N +1 (cid:17)(cid:12)(cid:12)(cid:12) Z N (cid:105) , where ˆ f Z N ∈ F is an estimator, or an algorithm, that selects a mapping ˆ f Z N within themodel F based on data Z N . We also define the out-of-sample error, with expectationtaken over different possible data sets Z N , as e F ,N := E [ e F ( Z N )] . From the definition of the K-fold cross-validation estimator, it can be easilyshown that E [ CV ( F )] = e F , K − K N . As a result, the asymptotic distribution of CV ( F ) − e F , K − K N has been studied in the statistics and machine learning litera-ture. Our analysis below will be based on the results in Austern and Zhou (2020) onthe asymptotic distribution of CV ( F ) − e F , K − K N . C.1.2 Joint Parametrization of F Θ and F M Recall that the model F Θ is parametrized by θ ∈ Θ, and f θ denotes a generic functionin F Θ . Motivated by the applications in this paper, we assume that F M can besmoothly parameterized by a finite-dimensional parameter β ∈ B M ⊆ R d M and usethe notation f [ β ] ∈ F M to denote a generic function in F M . Since by assumption f ∗ ∈ F M , we can define a parameter β ∗ to represent it, i.e. f [ β ∗ ] = f ∗ .37or arbitrary parameters θ and β , write l Θ ( θ, Z i ) := l ( f θ , Z i ) , l B ( β, Z i ) := l (cid:0) f [ β ] , Z i (cid:1) . We define the estimation mappings in F Θ and F M byˆ θ ( Z N ) := arg min θ ∈ Θ N (cid:88) l Θ ( θ, Z i ) , ˆ β ( Z N ) := arg min β ∈B M N (cid:88) l B ( β, Z i ) . Let α := (cid:0) θ (cid:48) , β (cid:48) (cid:1) (cid:48) denote the concatenation of the parameters θ ∈ F Θ and β ∈ B M , α ∗ := (cid:0) θ ∗ (cid:48) , β ∗ (cid:48) (cid:1) (cid:48) to be the parameters associated with the best mappings in F Θ and F M , and also defineˆ α ( Z N ) := (cid:16) ˆ θ (cid:48) ( Z N ) , ˆ β (cid:48) ( Z N ) (cid:17) (cid:48) = arg min θ ∈ Θ ,β ∈B M N N (cid:88) i =1 [ l Θ ( θ, Z i ) + l B ( β, Z i )] , to be an estimator for α ∗ . Finally, define∆ l ( θ, β ; Z i ) := l ( f θ , Z i ) − l (cid:0) f [ β ] , Z i (cid:1) = l Θ ( θ, Z i ) − l B ( β, Z i ) . C.2 Assumptions and Lemmas Based on Austern and Zhou(2020) Assumption 3 (Conditions for Asymptotics of CV Estimator) . l Θ ( θ, z ) and l B ( β, z ) are twice differentiable and strictly convex in θ and β .2. E [sup θ ∈ Θ l ( θ, Z i )] < ∞ and E (cid:2) sup β ∈B l B ( β, Z i ) (cid:3) < ∞ .3. There exist open neighborhoods O θ ∗ and O β ∗ of θ ∗ and β ∗ in Θ and B such that(a) E (cid:2) sup θ ∈O θ ∗ (cid:107)∇ θ l Θ ( θ, Z i ) (cid:107) (cid:3) < ∞ , E (cid:104) sup β ∈O β ∗ (cid:107)∇ β l B ( β, Z i ) (cid:107) (cid:105) < ∞ . (b) E (cid:104) sup θ ∈O θ ∗ (cid:107)∇ θ l Θ ( θ, Z i ) (cid:107) (cid:105) < ∞ , E (cid:104) sup β ∈O β ∗ (cid:107)∇ β l B ( β, Z i ) (cid:107) (cid:105) < ∞ . (c) there exists c > such that λ min ( ∇ θ l Θ ( θ, Z i )) ≥ c , λ min (cid:0) ∇ β l B ( β, Z i ) (cid:1) ≥ c a.s. uniformly on O θ ∗ and O β ∗ . emma C.1 (Application of Proposition 5 of Austern and Zhou, 2020) . Under As-sumption 3: √ N (cid:104) CV ( F Θ ) − CV ( F M ) − (cid:16) e F Θ , K − K N − e F M , K − K N (cid:17)(cid:105) d −→ N (0 , Var (∆ l ( f θ ∗ , f ∗ ; Z i ))) . Proof. Proposition 5 of Austern and Zhou (2020) establishes the asymptotic normal-ity of cross-validation risk estimator and its asymptotic variance under parametricsettings where the loss function used for training is the same as the loss function usedfor evaluation. Applying Proposition 5 of Austern and Zhou (2020) under Assumption3 to θ, β and α = ( θ, β ), we obtain: √ N (cid:16) CV ( F Θ ) − e F Θ , K − K N (cid:17) d −→ N (0 , Var ( l ( f θ ∗ , Z i ))) , √ N (cid:16) CV ( F M ) − e F M , K − K N (cid:17) d −→ N (0 , Var ( l ( f ∗ , Z i ))) , √ N (cid:16) CV ( F Θ ) + CV ( F M ) − e F Θ , K − K N − e F M , K − K N (cid:17) d −→ N (0 , Var ( l ( f θ ∗ , Z i ) + l ( f ∗ , Z i ))) . Using the equality Var ( X + Y )+Var ( X − Y ) = 2Var ( X )+2Var ( Y ), we then deducethat √ N (cid:104) CV ( F Θ ) − CV ( F M ) − (cid:16) e F Θ , K − K N − e F M , K − K N (cid:17)(cid:105) d −→ N (0 , Var (∆ l ( f θ ∗ , f ∗ ; Z i ))) . Lemma C.2 (Application of Proposition 1 of Austern and Zhou, 2020) . Under As-sumption 3, ˆ σ p −→ Var (∆ l ( f θ ∗ , f ∗ ; Z i )) . Proof. Applying Proposition 1 of Austern and Zhou (2020) under Assumption 3 to θ, β and α = ( θ, β ):ˆ σ CV ( F Θ ) := 1 K K (cid:88) k =1 J N − (cid:88) k ( i )= k l ( f ˆ θ − k , Z i ) − J N (cid:88) k ( j )= k l ( f ˆ θ − k , Z j ) p −→ Var ( l ( f θ ∗ , Z i )) . and ˆ σ CV ( F M ) := 1 K K (cid:88) k =1 J N − (cid:88) k ( i )= k l (cid:16) f [ ˆ β − k ] , Z i (cid:17) − J N (cid:88) k ( j )= k l (cid:16) f [ ˆ β − k ] , Z j (cid:17) p −→ Var ( l ( f ∗ , Z i )) . σ CV ( F Θ )+ CV ( F M ) := 1 K K (cid:88) k =1 J N − · (cid:88) k ( i )= k l ( f ˆ θ − k , Z i ) + l (cid:16) f [ ˆ β − k ] , Z i (cid:17) − J N (cid:88) k ( j )= k (cid:104) l (cid:16) f [ ˆ β − k ] , Z j (cid:17) + l ( f ˆ θ − k , Z i ) (cid:105) p −→ Var ( l ( f θ ∗ , Z i ) + l ( f ∗ , Z i )) , Hence: ˆ σ = 2ˆ σ CV ( F Θ ) + 2ˆ σ CV ( F M ) − ˆ σ CV ( F Θ )+ CV ( F M ) p −→ l ( f θ ∗ , Z i )) + 2Var ( l ( f ∗ , Z i )) − l ( f θ ∗ ,Z i ) + l ( f ∗ , Z i ))= Var (∆ l ( f θ ∗ , f ∗ ; Z i )) C.3 Proof of Asymptotic Normality of ˆ κ ∗ Lemma C.1 characterizes the limit distribution of √ N (cid:104) CV ( F Θ ) − CV ( F M ) − (cid:16) e F Θ , K − K N − e F M , K − K N (cid:17)(cid:105) which we now show is also the limit distribution of √ N [ CV ( F Θ ) − CV ( F M ) − ( e F Θ − e F M )] . To see this, notice that e Θ , K − K N − e F Θ = E (cid:104) l Θ (cid:16) ˆ θ − k ( i ) , Z i (cid:17) − l Θ ( θ ∗ , Z i ) (cid:105) = E (cid:20) ∇ l Θ ( θ ∗ , Z i ) · (cid:16) ˆ θ − k ( i ) − θ ∗ (cid:17) + (cid:16) ˆ θ − k ( i ) − θ ∗ (cid:17) (cid:48) ∇ l Θ (cid:16) ˜ θ, Z i (cid:17) · (cid:16) ˆ θ − k ( i ) − θ ∗ (cid:17)(cid:21) = 0 + E (cid:20)(cid:16) ˆ θ − k ( i ) − θ ∗ (cid:17) (cid:48) ∇ l Θ (cid:16) ˜ θ, Z i (cid:17) · (cid:16) ˆ θ − k ( i ) − θ ∗ (cid:17)(cid:21) = 1 N − J N E (cid:20)(cid:112) N − J N (cid:16) ˆ θ − k ( i ) − θ ∗ (cid:17) (cid:48) ∇ l Θ (cid:16) ˜ θ, Z i (cid:17) · (cid:112) N − J N (cid:16) ˆ θ − k ( i ) − θ ∗ (cid:17)(cid:21) = c N − J N + o (cid:18) N − J N (cid:19) = c KK − · N + o (cid:18) N (cid:19) J N = N/K , and hence √ N (cid:16) e Θ , K − K N − e Θ (cid:17) = o p (1) . Similarly, √ N (cid:16) e F M , K − K N − e F M (cid:17) = o p (1).Hence: √ N [ CV ( F Θ ) − CV ( F M ) − ( e F Θ − e F M )] d −→ N (0 , Var (∆ l ( f θ ∗ , f ∗ ; Z i ))) . Then, by Lemma C.2, Assumption 2 and the continuous mapping theorem, we have √ N (ˆ κ ∗ − κ ∗ )ˆ σ ˆ κ ∗ d −→ N (0 , . D Supplementary Material to Section 7 D.1 Alternative Estimator of f ∗ -Discrepancy We now discuss an alternative estimator for f ∗ -discrepancy δ f ∗ = d ( f θ ∗ , f ∗ ) d ( f naive , f ∗ )when the decomposability condition (10) does not hold.We again work with the parameterization of F M via β ∈ B . Suppose that we haveaccess to an estimator ˆ β of β ∗ that is consistent and asymptotically normal: √ N (cid:16) ˆ β − β ∗ (cid:17) d −→ N ( , Σ) . Given that θ ∗ = arg min θ ∈ Θ d (cid:0) f θ , f [ β ∗ ] (cid:1) , we can construct an estimator of θ ∗ asˆ θ := ˆ θ (cid:16) ˆ β (cid:17) := arg min θ ∈ Θ d (cid:16) f θ , f [ ˆ β ] (cid:17) , with which we can obtain the following estimator of δ f ∗ ˆ δ f ∗ := d (cid:16) f ˆ θ ( ˆ β ) , f [ ˆ β ] (cid:17) d (cid:16) f naive , f [ ˆ β ] (cid:17) = min θ ∈ Θ d (cid:16) f θ , f [ ˆ β ] (cid:17) d (cid:16) f naive , f [ ˆ β ] (cid:17) . We impose the following joint assumption on the dissimilarity function d and the41arameterization of F Θ and F M . Assumption 4. Define d ( θ, β ) := d (cid:0) f θ , f [ β ] (cid:1) . Suppose that:(a) d is joint differentiable with respect to ( θ, β ) in a neighborhood of ( θ ∗ , β ∗ ) .(b) ψ ∗ := ∇ β d (cid:16) ˆ θ ( β ) , β (cid:17)(cid:12)(cid:12)(cid:12) β = β ∗ (cid:54) = . The requirements in Assumption 4 are very weak. Part (a) is a standard differ-entiability condition, which should be satisfied in most applications. For (b), noticethat by the Envelope Theorem, ψ ∗ := ∇ β d (cid:16) ˆ θ ( β ∗ ) , β ∗ (cid:17) = ∂∂β d ( θ ∗ , β ∗ )Hence, ψ ∗ (cid:54) = essentially requires that the dissimilarity d ( f θ ∗ , f ) between f θ ∗ and f as f varies locally in a neighborhood of f ∗ .By the Delta Method, √ N (cid:18) min θ ∈ Θ d (cid:16) f θ , f [ ˆ β ] (cid:17) − min θ ∈ Θ d (cid:0) f θ , f [ β ∗ ] (cid:1)(cid:19) = √ N (cid:16) d (cid:16) ˆ θ (cid:16) ˆ β (cid:17) , ˆ β (cid:17) − d (cid:16) ˆ θ ( β ∗ ) , β ∗ (cid:17)(cid:17) d −→ N (cid:16) , ψ ∗ (cid:48) Σ ψ ∗ (cid:17) . implying that √ N (cid:16) ˆ δ f ∗ − δ f ∗ (cid:17) d −→ N (cid:18) , ψ ∗ (cid:48) Σ ψ ∗ d ( f naive , f ∗ ) (cid:19) . The standard error can be estimated via bootstrapping. D.2 Example: Lack of Decomposability Consider a setting where X is degenerate, i.e., X is a singleton, so that the jointdistribution P is completely characterized by the distribution of Y . Furthermore, let Y := [0 , f ∗ := med ( Y ) ∈ S := Y = [0 , f : X → S is just a numberin [0 , l ( f, y ) := | y − f | , and theerror function is mean absolute deviation e P ∗ ( f ) := E P ∗ [ | Y − f | ] , the true median f ∗ minimizes the error, i.e. f ∗ ∈ arg min f ∈ [0 , e P ∗ ( f ) . However, it is not true that | f − f ∗ | = e P ∗ ( f ) − e P ∗ ( f ∗ ) for any f ∈ [0 , Y ∼ U [0 , P ∗ . Then f ∗ = 0 . e P ∗ ( f ∗ ) = 0 . 25. However, for f = 0 . 4, we have e P ∗ ( f ) = 0 . . but | f − f ∗ | = 0 . (cid:54) = 0 . 01 = e P ∗ ( f ) − e P ∗ ( f ∗ ) . Moreover, there is no function d : [0 , → [0 , 1] such that decomposability (10)holds, which would require that d ( f, f P ) = e P ( f ) − e P ( f P ) for any distribution P of Y supported on [0 , Y ∼ U [0 , 1] under P , we have e P ( f ) − e P ( f P ) = ( f − . = ( f − f P ) , ∀ f ∈ [0 , . However, supposing that, under P , the probability density function of Y is given by2 y for y ∈ [0 , f P = √ / e P ( f P ) = (2 − √ / e P ( f ) − e P ( f P ) = 13 (cid:16) f − f + √ (cid:17) (cid:54) = ( f − f P ) ..