MMicroeconometrics with Partial Identification
Francesca MolinariCornell UniversityDepartment of Economics [email protected] ∗ March 12, 2020
Abstract
This chapter reviews the microeconometrics literature on partial identification , focus-ing on the developments of the last thirty years. The topics presented illustrate that theavailable data combined with credible maintained assumptions may yield much informa-tion about a parameter of interest, even if they do not reveal it exactly. Special attentionis devoted to discussing the challenges associated with, and some of the solutions putforward to, (1) obtain a tractable characterization of the values for the parameters ofinterest which are observationally equivalent, given the available data and maintainedassumptions; (2) estimate this set of values; (3) conduct test of hypotheses and makeconfidence statements. The chapter reviews advances in partial identification analysisboth as applied to learning (functionals of) probability distributions that are well-definedin the absence of models, as well as to learning parameters that are well-defined only inthe context of particular models. A simple organizing principle is highlighted: the sourceof the identification problem can often be traced to a collection of random variables thatare consistent with the available data and maintained assumptions. This collection maybe part of the observed data or be a model implication. In either case, it can be formal-ized as a random set . Random set theory is then used as a mathematical framework tounify a number of special results and produce a general methodology to carry out partialidentification analysis. ∗ This manuscript was prepared for the Handbook of Econometrics, Volume 7A c (cid:13)
North Holland, 2019. Ithank Don Andrews, Isaiah Andrews, Levon Barseghyan, Federico Bugni, Ivan Canay, Joachim Freyberger,Hiroaki Kaido, Toru Kitagawa, Chuck Manski, Rosa Matzkin, Ilya Molchanov, ´Aureo de Paula, Jack Porter,Seth Richards-Shubik, Adam Rosen, Shuyang Sheng, J¨org Stoye, Elie Tamer, Matthew Thirkettle, and par-ticipants to the 2017 Handbook of Econometrics Conference, for helpful comments, and the National ScienceFoundation for financial support through grants SES-1824375 and SES-1824448. I am grateful to Louis Liuand Yibo Sun for research assistance supported by the Robert S. Hatfield Fund for Economic Education atCornell University. Part of this research was carried out during my sabbatical leave at the Department ofEconomics at Duke University, whose hospitality I gratefully acknowledge. a r X i v : . [ ec on . E M ] A p r ontents H P [ θ ] vs. Coverage of θ . . . . . . . . . . . . . . . . . . . 971.3.2 Pointwise vs. Uniform Coverage . . . . . . . . . . . . . . . . . . . . . 994.3.3 Coverage of the Vector θ vs. Coverage of a Component of θ . . . . . . 1014.3.4 A Brief Note on Bayesian Methods . . . . . . . . . . . . . . . . . . . . 103 Introduction
Knowing the population distribution that data are drawn from, what can one learn abouta parameter of interest? It has long been understood that assumptions about the datagenerating process (DGP) play a crucial role in answering this identification question atthe core of all empirical research. Inevitably, assumptions brought to bear enjoy a varyingdegree of credibility. Some are rooted in economic theory (e.g., optimizing behavior) orin information available to the researcher on the DGP (e.g., randomization mechanisms).These assumptions can be argued to be highly credible. Others are driven by concerns fortractability and the desire to answer the identification question with a certain level of precision(e.g., functional form and distributional assumptions). These are arguably less credible.Early on, Koopmans and Reiersol (1950) highlighted the importance of imposing re-strictions based on prior knowledge of the phenomenon under analysis and some criteria ofsimplicity, but not for the purpose of identifiability of a parameter that the researcher hap-pens to be interested in, stating (p. 169): “One might regard problems of identifiability as anecessary part of the specification problem. We would consider such a classification accept-able, provided the temptation to specify models in such a way as to produce identifiabilityof relevant characteristics is resisted.”Much work, spanning multiple fields, has been devoted to putting forward strategiesto carry out empirical research while relaxing distributional, functional form, or behavioralassumptions. One example, embodied in the research program on semiparameteric and non-parametric methods, is to characterize sufficient sets of assumptions, that exclude manysuspect ones –sometimes as many as possible– to guarantee that point identification of spe-cific economically interesting parameters attains. This literature is reviewed in, e.g., Matzkin(2007, 2013), and is not discussed here.Another example, embodied in the research program on Bayesian model uncertainty, isto specify multiple models (i.e., multiple sets of assumptions), put a prior on the parame-ters of each model and on each model, embed the various separate models within one largehierarchical mixture model, and obtain model posterior probabilities which can be used fora variety of inferences and decisions. This literature is reviewed in, e.g., Wasserman (2000)and Clyde and George (2004), and is not discussed here.The approach considered here fixes a set of assumptions and a parameter of interest apriori, in the spirit of Koopmans and Reiersol (1950), and asks what can be learned aboutthat parameter given the available data, recognizing that even partial information can beilluminating for empirical research, while enjoying wider credibility thanks to the weakerassumptions imposed. The bounding methods at the core of this approach appeared in theliterature nearly a century ago. Arguably, the first exemplar that leverages economic rea-3oning is given by the work of Marschak and Andrews (1944). They provided bounds onCobb-Douglas production functions in models of supply and demand, building on optimiza-tion principles and restrictions from microeconomic theory. Leamer (1981) revisited theiranalysis to obtain bounds on the elasticities of demand and supply in a linear simultaneousequations system with uncorrelated errors. The first exemplars that do not rely on specificeconomic models appear in Gini (1921), Frisch (1934), and Reiersol (1941), who bounded thecoefficient of a simple linear regression in the presence of measurement error. These resultswere extended to the general linear regression model with errors in all variables by Klepperand Leamer (1984) and Leamer (1987).This chapter surveys some of the methods proposed over the last thirty years in the mi-croeconometrics literature to further this approach. These methods belong to the systematicprogram on partial identification analysis started with Manski (1989, 1990, 1995, 2003, 2007a,2013b) and developed by several authors since the early 1990s. Within this program, the fo-cus shifts from points to sets: the researcher aims to learn what is the set of values for theparameters of interest that can generate the same distribution of observables as the one in thedata, for some DGP consistent with the maintained assumptions. In other words, the focus ison the set of observationally equivalent values, which henceforth I refer to as the parameters’ sharp identification region . In the partial identification paradigm, empirical analysis beginswith characterizing this set using the data alone. This is a nonparametric approach thatdispenses with all assumptions, except basic restrictions on the sampling process such thatthe distribution of the observable variables can be learned as data accumulate. In subsequentsteps, one incorporates additional assumptions into the analysis, reporting how each assump-tion (or set of assumptions) affects what one can learn about the parameters of interest, i.e.,how it modifies and possibly shrinks the sharp identification region. Point identification mayresult from the process of increasingly strengthening the maintained assumptions, but it isnot the goal in itself. Rather, the objective is to make transparent the relative role played bythe data and the assumptions in shaping the inference that one draws.There are several strands of independent, but thematically related literatures that arenot discussed in this chapter. As a consequence, many relevant contributions are left out ofthe presentation and the references. One example is the literature in finance. Hansen andJagannathan (1991) developed nonparametric bounds for the admissible set for means andstandard deviations of intertemporal marginal rates of substitution (IMRS) of consumers.The bounds were developed exploiting the condition, satisfied in many finance models, thatthe equilibrium price of any traded security equals the expectation (conditioned on currentinformation) of the product’s future payoff and the IMRS of any consumer. Luttmer (1996) Hansen and Jagannathan (1991) deduce a duality relation with the mean variance theory of Markowitz(1952) and Fama (1996), but the relation does not apply to the sharp bounds they derive. In the ArbitragePricing Model (Ross, 1976), bounds on extensions of existing pricing functions, consistent with the absence ofarbitrage opportunities, were considered by Harrison and Kreps (1979) and Kreps (1981).
To carry out econometric analysis with partial identification, one needs: (1) computationallyfeasible characterizations of the parameters’ sharp identification region; (2) methods to es-timate this region; and (3) methods to test hypotheses and construct confidence sets. Thegoal of this chapter is to provide insights into the challenges posed by each of these desider-ata, and into some of their solutions. In order to discuss the partial identification literaturein microeconometrics with some level of detail while keeping this chapter to a manageablelength, I focus on a selection of papers and not on a complete survey of the literature. As aconsequence, many relevant contributions are left out of the presentation and the references.I also do not discuss the important but separate topic of statistical decisions in the presenceof partial identification, for which I refer to the textbook treatments in Manski (2005, 2007a)and to the review by Hirano and Porter (2019, Chapter XXX in this Volume).The presumption in identification analysis that the distribution from which the data aredrawn is known allows one to keep separate the identification question from the distinctquestion of statistical inference from a finite sample. I use the same separation in thischapter. I assume solid knowledge of the topics covered in first year Economics PhD coursesin econometrics and microeconomic theory.I begin in Section 2 with the analysis of what can be learned about features of prob-ability distributions that are well defined in the absence of an economic model, such asmoments, quantiles, cumulative distribution functions, etc., when one faces measurementproblems. Specifically, I focus on cases where the data is incomplete , either due to selec-tively observed data or to interval measurements. I refer to Manski (1995, 2003, 2007a) fortextbook treatments of many other cases. I lay out formally the maintained assumptions forseveral examples, and then discuss in detail what is the source of the identification problem.I conclude with providing tractable characterizations of what can be learned about the pa-rameters of interest, with formal proofs. I show that even in simple problems, great care maybe needed to obtain the sharp identification region. It is often easier to characterize an outerregion , i.e., a collection of values for the parameter of interest that contains the sharp onebut may contain also additional values. Outer regions are useful because of their simplicityand because in certain applications they may suffice to answer questions of great interest,e.g., whether a policy intervention has a nonnegative effect. However, compared to the sharpidentification region they may afford the researcher less useful predictions, and a lower abilityto test for misspecification, because they do not harness all the information in the observed6ata and maintained assumptions.In Section 3 I use the same approach to study what can be learned about features ofparameters of structural econometric models when the model is incomplete (Tamer, 2003;Haile and Tamer, 2003; Ciliberto and Tamer, 2009). Specifically, I discuss single agent dis-crete choice models under a variety of challenging situations (interval measured as well asendogenous explanatory variables; unobserved as well as counterfactual choice sets); finitediscrete games with multiple equilibria; auction models under weak assumptions on biddingbehavior; and network formation models. Again I formally derive sharp identification regionsfor several examples.I conclude each of these sections with a brief discussion of further theoretical advancesand empirical applications that is meant to give a sense of the breadth of the approach, butnot to be exhaustive. I refer to the recent survey by Ho and Rosen (2017) for a thoroughdiscussion of empirical applications of partial identification methods.In Section 4 I discuss finite sample inference. I limit myself to highlighting the challengesthat one faces for consistent estimation when the identified object is a set, and several coveragenotions and requirements that have been proposed over the last 20 years. I refer to the recentsurvey by Canay and Shaikh (2017) for a thorough discussion of methods to tests hypothesesand build confidence sets in moment inequality models.In Section 5 I discuss the distinction between refutable and non-refutable assumptions, andhow model misspecification may be detectable in the presence of the former, even within thepartial identification paradigm. I then highlight certain challenges that model misspecificationpresents for the interpretation of sharp identification (as well as outer) regions, and for theconstruction of confidence sets.In Section 6 I highlight that while most of the sharp identification regions characterized inSection 2 can be easily computed, many of the ones in Section 3 are more challenging. This isbecause the latter are obtained as level sets of criterion functions in moderately dimensionalspaces, and tracing out these level sets or their boundaries is a non-trivial computationalproblem. In Section 7 I conclude providing some considerations on what I view as openquestions for future research.I refer to Tamer (2010) for an earlier review of this literature, and to Lewbel (2018) for acareful presentation of the many notions of identification that are used across the econometricsliterature, including an important historical account of how these notions developed over time.
Throughout Sections 2 and 3, a simple organizing principle for much of partial identificationanalysis emerges. The cause of the identification problems discussed can be traced back toa collection of random variables that are consistent with the available data and maintainedassumptions. For the problems studied in Section 2, this set is often a simple function of the7bserved variables. The incompleteness of the data stems from the fact that instead of observ-ing the singleton variables of interest, one observes set-valued variables to which these belong,but one has no information on their exact value within the sets. For the problems studiedin Section 3, the collection of random variables consistent with the maintained assumptionscomprises what the model predicts for the endogenous variable(s). The incompleteness of themodel stems from the fact that instead of making a singleton prediction for the variable(s)of interest, the model makes multiple predictions but does not specify how one is chosen.The central role of set-valued objects, both stochastic and nonstochastic, in partial iden-tification renders random set theory a natural toolkit to aid the analysis. This theoryoriginates in the seminal contributions of Choquet (1953/54), Aumann (1965), and Debreu(1967), with the first self contained treatment of the theory given by Matheron (1975). I re-fer to Molchanov (2017) for a textbook presentation, and to Molchanov and Molinari (2014,2018) for a treatment focusing on its applications in econometrics.Beresteanu and Molinari (2008) introduce the use of random set theory in econometrics tocarry out identification analysis and statistical inference with incomplete data. Beresteanu,Molchanov, and Molinari (2011, 2012) propose it to characterize sharp identification regionsboth with incomplete data and with incomplete models. Galichon and Henry (2011) proposethe use of optimal transportation methods that in some applications deliver the same char-acterizations as the random set methods. I do not discuss optimal transportation methodsin this chapter, but refer to Galichon (2016) for a thorough treatment.Over the last ten years, random set methods have been used to unify a number of spe-cific results in partial identification, and to produce a general methodology for identificationanalysis that dispenses completely with case-by-case distinctions. In particular, as I showthroughout the chapter, the methods allow for simple and tractable characterizations of sharpidentification regions. The collection of these results establishes that indeed this is a usefultool to carry out econometrics with partial identification, as exemplified by its prominentrole both in this chapter and in Chapter XXX in this Volume by Chesher and Rosen (2019),which focuses on general classes of instrumental variable models. The random sets approachcomplements the more traditional one, based on mathematical tools for (single valued) ran-dom vectors, that proved extremely productive since the beginning of the research programin partial identification.This chapter shows that to fruitfully apply random set theory for identification and in-ference, the econometrician needs to carry out three fundamental steps. First, she needs todefine the random closed set that is relevant for the problem under consideration using allinformation given by the available data and maintained assumptions. This is a delicate task,but one that is typically carried out in identification analysis regardless of whether random The first idea of a general random set in the form of a region that depends on chance appears in Kol-mogorov (1950), originally published in 1933. For another early example where confidence regions are explicitlydescribed as random sets, see Haavelmo (1944, p. 67). The role of random sets in this chapter is different. able 1.1: Notation Used (Ω , F , P ) Nonatomic probability space R d , (cid:107) · (cid:107) Euclidean space equipped with the Euclidean norm F , G , K Collection of closed, open, and compact subsets of R d (respectively) S d − = { x ∈ R d : (cid:107) x (cid:107) = 1 } Unit sphere in R d B d = { x ∈ R d : (cid:107) x (cid:107) ≤ } Unit ball in R d conv( A ) , cl( A ) , | B | Convex hull and closure of a set A ⊂ R d (respectively), and cardinality of a finite set B ⊂ R d x , y , z , . . . Random vectors x, y, z, . . .
Realizations of random vectors or deterministic vectors X , Y , Z , . . . Random sets
X, Y, Z, . . .
Realizations of random sets or deterministic sets (cid:15), ε, ν, ζ
Unobserved random variables (heterogeneity)Θ , θ, ϑ
Parameter space, data generating value for the parameter vector, and a generic element of Θ R Joint distribution of all variables (observable and unobservable) P Joint distribution of the observable variables Q Joint distribution whose features one wants to learn M A joint distribution of observed variables implied by the model q τ ( α ) Quantile function at level α ∈ (0 ,
1) for a random variable distributed τ ∈ { R , P , Q } E τ Expectation operator associated with distribution τ ∈ { R , P , Q } T X ( K ) = P { X ∩ K (cid:54) = ∅} , K ∈ K Capacity functional of random set X C X ( F ) = P { X ⊂ F } , F ∈ F Containment functional of random set X p → , a.s. → , ⇒ Convergence in probability, convergence almost surely, and weak convergence (respectively) x d = y x and y have the same distribution x ⊥⊥ y Statistical independence between random variables x and y x (cid:62) y Inner product between vectors x and y , x, y ∈ R d U , u Family of utility functions and one of its elements q P Criterion function that aggregates violations of the population moment inequalities q n Criterion function that aggregates violations of the sample moment inequalities H P [ · ] Sharp identification region of the functional in square brackets (a function of P ) O P [ · ] An outer region of the functional in square brackets (a function of P ) set theory is applied. Indeed, throughout the chapter I highlight how relevant random closedsets were characterized in partial identification analysis since the early 1990s, albeit the con-nection to the theory of random sets was not made. As a second step, the econometricianneeds to determine how the observable random variables relate to the random closed set. Of-ten, one of two cases occurs: either the observable variables determine a random set to whichthe unobservable variable of interest belongs with probability one, as in incomplete data sce-narios; or the (expectation of the) (un)observable variable belongs to (the expectation of) arandom set determined by the model, as in incomplete model scenarios. Finally, the econo-metrician needs to determine which tool from random set theory should be utilized. To date,new applications of random set theory to econometrics have fruitfully exploited (Aumann)expectations and their support functions, (Choquet) capacity functionals, and laws of largenumbers and central limit theorems for random sets. Appendix A reports basic definitionsfrom random set theory of these concepts, as well as some useful theorems. The chapterexplains in detail through applications to important identification problems how these stepscan be carried out. 9 .4 Notation This chapter employs consistent notation that is summarized in Table 1.1. Some importantconventions are as follows: y denotes outcome variables, ( x , w ) denote explanatory variables,and z denotes instrumental variables (i.e., variables that satisfy some form of independencewith the outcome or with the unobservable variables, possibly conditional on x , w ).I denote by P the joint distribution of all observable variables. Identification analysis iscarried out using the information contained in this distribution, and finite sample inferenceis carried out under the presumption that one draws a random sample of size n from P . Idenote by Q the joint distribution whose features the researcher wants to learn. If Q wereidentified given the observed data (e.g., if it were a marginal of P ), point identification of theparameter or functional of interest would attain. I denote by R the joint distribution of allvariables, observable and unobservable ones; both P and Q can be obtained from it. In thecontext of structural models, I denote by M a distribution for the observable variables that isconsistent with the model. I note that model incompleteness typically implies that M is notunique. I let H P [ · ] denote the sharp identification region of the functional in square brackets,and O P [ · ] an outer region. In both cases, the regions are indexed by P , because they dependon the distribution of the observed data. The literature reviewed in this chapter starts with the analysis of what can be learned aboutfunctionals of probability distributions that are well-defined in the absence of a model. Theapproach is nonparametric, and it is typically constructive , in the sense that it leads to“plug-in” formulae for the bounds on the functionals of interest.
As in Manski (1989), suppose that a researcher is interested in learning the probability that anindividual who is homeless at a given date has a home six months later. Here the populationof interest is the people who are homeless at the initial date, and the outcome of interest y isan indicator of whether the individual has a home six months later (so that y = 1) or remainshomeless (so that y = 0). A random sample of homeless individuals is interviewed at theinitial date, so that individual background attributes x are observed, but six months lateronly a subset of the individuals originally sampled can be located. In other words, attritionfrom the sample creates a selection problem whereby y is observed only for a subset of thepopulation. Let d be an indicator of whether the individual can be located (hence d = 1)or not (hence d = 0). The question is what can the researcher learn about E Q ( y | x = x ),with Q the distribution of ( y , x )? Manski (1989) showed that E Q ( y | x = x ) is not pointidentified in the absence of additional assumptions, but informative nonparametric bounds10n this quantity can be obtained. In this section I review his approach, and discuss severalimportant extensions of his original idea.Throughout the chapter, I formally state the structure of the problem under study asan “Identification Problem”, and then provide a solution, either in the form of a sharpidentification region, or of an outer region. To set the stage, and at the cost of some repetition,I do the same here, slightly generalizing the question stated in the previous paragraph. Identification Problem : Let y ∈ Y ⊂ R and x ∈ X ⊂ R d be, respectively, an outcome variable and a vector ofcovariates with support Y and X respectively, with Y a compact set. Let d ∈ { , } . Supposethat the researcher observes a random sample of realizations of ( x , d ) and, in addition,observes the realization of y when d = 1. Hence, the observed data is ( yd , d , x ) ∼ P . Let g : Y (cid:55)→ R be a measurable function that attains its lower and upper bounds g = min y ∈Y g ( y )and g = max y ∈Y g ( y ), and assume that −∞ < g < g < ∞ . Let y j ∈ Y be such that g ( y j ) = g j , j = 0 , In the absence of additional information, what can the researcher learnabout E Q ( g ( y ) | x = x ), with Q the distribution of ( y , x )? (cid:52) Manski’s analysis of this problem begins with a simple application of the law of totalprobability, that yields Q ( y | x = x ) = P ( y | x = x, d = 1) P ( d = 1 | x = x ) + R ( y | x = x, d = 0) P ( d = 0 | x = x ) . (2.1)Equation (2.1) lends a simple but powerful anatomy of the selection problem. While P ( y | x = x, d = 1) and P ( d | x = x ) can be learned from the observable distribution P ( yd , d , x ), underthe maintained assumptions the sampling process reveals nothing about R ( y | x = x, d = 0).Hence, Q ( y | x = x ) is not point identified.If one were to assume exogenous selection (or data missing at random conditional on x ), i.e., R ( y | x , d = 0) = P ( y | x , d = 1), point identification would obtain. However, thatassumption is non-refutable and it is well known that it may fail in applications. Let T denote the space of all probability measures with support in Y . The unknown functionalvector is { τ ( x ) , υ ( x ) } ≡ { Q ( y | x = x ) , R ( y | x = x, d = 0) } . What the researcher can learn, inthe absence of additional restrictions on R ( y | x = x, d = 0), is the region of observationallyequivalent distributions for y | x = x , and the associated set of expectations taken with respectto these distributions. Theorem SIR- : Under the The bounds g , g and the values y , y at which they are attained may differ for different functions g ( · ). Section 5 discusses the consequences of model misspecification (with respect to refutable assumptions). ssumptions in Identification Problem 2.1, H P [ E Q ( g ( y ) | x = x )] = (cid:104) E P ( g ( y ) | x = x, d = 1) P ( d = 1 | x = x ) + g P ( d = 0 | x = x ) , E P ( g ( y ) | x = x, d = 1) P ( d = 1 | x = x ) + g P ( d = 0 | x = x ) (cid:105) (2.2) is the sharp identification region for E Q ( g ( y ) | x = x ) .Proof. Due to the discussion following equation (2.1), the collection of observationally equiv-alent distribution functions for y | x = x is H P [ Q ( y | x = x )] = (cid:110) τ ( x ) ∈ T : τ ( x ) = P ( y | x = x, d = 1) P ( d = 1 | x = x )+ υ ( x ) P ( d = 0 | x = x ) , for some υ ( x ) ∈ T (cid:111) . (2.3)Next, observe that the lower bound in equation (2.2) is achieved by integrating g ( y ) againstthe distribution τ ( x ) that results when υ ( x ) places probability one on y . The upper boundis achieved by integrating g ( y ) against the distribution τ ( x ) that results when υ ( x ) placesprobability one on y . Both are contained in the set H P [ Q ( y | x = x )] in equation (2.3).These are the worst case bounds , so called because assumptions free and therefore repre-senting the widest possible range of values for the parameter of interest that are consistentwith the observed data. A simple “plug-in” estimator for H P [ E Q ( g ( y ) | x = x )] replaces allunknown quantities in (2.2) with consistent estimators, obtained, e.g., by kernel or sieveregression. I return to consistent estimation of partially identified parameters in Section 4.Here I emphasize that identification problems are fundamentally distinct from finite sampleinference problems. The latter are typically reduced as sample size increase (because, e.g.,the variance of the estimator becomes smaller). The former do not improve, unless a differ-ent and better type of data is collected, e.g. with a smaller prevalence of missing data (seeDominitz and Manski, 2017, for a discussion).Manski (2003, Section 1.3) shows that the proof of Theorem SIR-2.1 can be extended toobtain the smallest and largest points in the sharp identification region of any parameter thatrespects stochastic dominance. This is especially useful to bound the quantiles of y | x = x .For any given α ∈ (0 , q g ( y ) P ( α, , x ) ≡ { min t : P ( g ( y ) ≤ t | d = 1 , x = x ) ≥ α } . Then Recall that a probability distribution F ∈ T stochastically dominates F (cid:48) ∈ T if F ( −∞ , t ] ≤ F (cid:48) ( −∞ , t ] forall t ∈ R . A real-valued functional d : T → R respects stochastic dominance if d ( F ) ≥ d ( F (cid:48) ) whenever F stochastically dominates F (cid:48) . α -quantile of g ( y ) | x = x are, respectively, r ( α, x ) ≡ q g ( y ) P (cid:16)(cid:104) − (1 − α ) P ( d =1 | x = x ) (cid:105) , , x (cid:17) if P ( d = 1 | x = x ) > − α,g otherwise; s ( α, x ) ≡ q g ( y ) P (cid:16)(cid:104) α P ( d =1 | x = x ) (cid:105) , , x (cid:17) if P ( d = 1 | x = x ) ≥ α,g otherwise . The lower bound on E Q ( g ( y ) | x = x ) is informative only if g > −∞ , and the upper boundis informative only if g < ∞ . By comparison, for any value of α , r ( α, x ) and s ( α, x ) aregenerically informative if, respectively, P ( d = 1 | x = x ) > − α and P ( d = 1 | x = x ) ≥ α ,regardless of the range of g .Stoye (2010) further extends partial identification analysis to the study of spread pa-rameters in the presence of missing data (as well as interval data, data combinations, andother applications). These parameters include ones that respect second order stochastic dom-inance, such as the variance, the Gini coefficient, and other inequality measures, as well asother measures of dispersion which do not respect second order stochastic dominance, suchas interquartile range and ratio. Stoye shows that the sharp identification region for theseparameters can be obtained by fixing the mean or quantile of the variable of interest at aspecific value within its sharp identification region, and deriving a distribution consistentwith this value which is “compressed” with respect to the ones which bound the cumulativedistribution function (CDF) of the variable of interest, and one which is “dispersed” withrespect to them. Heuristically, the compressed distribution minimizes spread, while the dis-persed one maximizes it (the sense in which this optimization occurs is formally defined inthe paper). The intuition for this is that a compressed CDF is first below and then aboveany non-compressed one; a dispersed CDF is first above and then below any non-dispersedone. Second-stage optimization over the possible values of the mean or the quantile deliversunconstrained bounds. The main results of the paper are sharp identification regions forthe expectation and variance, for the median and interquartile ratio, and for many othercombinations of parameters.
Key Insight : Identification Problem 2.1 ismathematically simple, but it puts forward a new approach to empirical research. The tradi-tional approach aims at finding a sufficient (possibly minimal) set of assumptions guaranteeingpoint identification of parameters, viewing identification as an “all or nothing” notion, whereeither the functional of interest can be learned exactly or nothing of value can be learned.The partial identification approach pioneered by Manski (1989) points out that much can be Earlier related work includes, e.g., Gastwirth (1972) and Cowell (1991), who obtain worst case bounds onthe sample Gini coefficient under the assumption that one knows the income bracket but not the exact incomeof every household. earned from combination of data and assumptions that restrict the functionals of interest toa set of observationally equivalent values, even if this set is not a singleton. Along the way,Manski (1989) points out that in Identification Problem 2.1 the observed outcome is the sin-gleton y when d = 1 , and the set Y when d = 0 . This is a random closed set, see DefinitionA.1. I return to this connection in Section 2.3. Despite how transparent the framework in Identification Problem 2.1 is, important sub-tleties arise even in this seemingly simple context. For a given t ∈ R , consider the function g ( y ) = ( y ≤ t ), with ( A ) the indicator function taking the value one if the logical conditionin parentheses holds and zero otherwise. Then equation (2.2) yields pointwise-sharp boundson the CDF of y at any fixed t ∈ R : H P [ Q ( y ≤ t | x = x )] = [ P ( y ≤ t | x = x, d = 1) P ( d = 1 | x = x ) , P ( y ≤ t | x = x, d = 1) P ( d = 1 | x = x ) + P ( d = 0 | x = x )] . (2.4)Yet, the collection of CDFs that belong to the band defined by (2.4) is not the sharp identi-fication region for the CDF of y | x = x . Rather, it constitutes an outer region , as originallypointed out by Manski (1994, p. 149 and note 2). Theorem OR- : Let C denote the collection of cumulative distribution functions on Y . Then, under the assump-tions in Identification Problem 2.1, O P [ F ( y | x = x )] = { F ∈ C : P ( y ≤ t | x = x, d = 1) P ( d = 1 | x = x ) ≤ F ( t | x ) ≤ P ( y ≤ t | x = x, d = 1) P ( d = 1 | x = x ) + P ( d = 0 | x = x ) ∀ t ∈ R } (2.5) is an outer region for the CDF of y | x = x .Proof. Any admissible CDF for y | x = x belongs to the family of functions in equation (2.5).However, the bound in equation (2.5) does not impose the restriction that for any t ≤ t , Q ( t ≤ y ≤ t | x = x ) ≥ P ( t ≤ y ≤ t | x = x, d = 1) P ( d = 1 | x = x ) . (2.6)This restriction is implied by the maintained assumptions, but is not necessarily satisfied byall CDFs in O P [ F ( y | x = x )], as illustrated in the following simple example.14 DF t P ( y ≤ t | d = 1) P ( d = 1) + P ( d = 0) P ( y ≤ t | d = 1) P ( d = 1) F ( t )11 / / / Figure 2.1: The tube defined by inequalities (2.4) in the set-up of Example 2.1, and the CDF in (2.7).
Example 2.1.
Omit x for simplicity, let P ( d = 1) = , and let P ( y ≤ t | d = 1) t < t if 0 ≤ t < t ≥ F ( t ) = t < t if 0 ≤ t < t + if 1 ≤ t < t if 2 ≤ t < t ≥
3. (2.7)For each t ∈ R , F ( t ) lies in the tube defined by equation (2.4). However, it cannot be theCDF of y , because F (2) − F (1) = < P (1 ≤ y ≤ | d = 1) P ( d = 1), directly contradictingequation (2.6). (cid:52) How can one characterize the sharp identification region for the CDF of y | x = x un-der the assumptions in Identification Problem 2.1? In general, there is not a single answerto this question: different methodologies can be used. Here I use results in Manski (2003,Corollary 1.3.1) and Molchanov and Molinari (2018, Theorem 2.25), which yield an alter-native characterization of H P [ Q ( y | x = x )] that translates directly into a characterization of H P [ F ( y | x = x )]. Theorem SIR- : Given τ ∈ T , let τ K ( x ) denote the probability that distribution τ assigns to set K conditional Whereas Manski (1994) is very clear that the collection of CDFs in (2.4) is an outer region for the CDFof y | x = x , and Manski (2003) provides the sharp characterization in (2.8), Manski (2007a, p. 39) does notstate all the requirements that characterize H P [ F ( y | x = x )]. n x = x , with τ y ( x ) ≡ τ { y } ( x ) . Under the assumptions in Identification Problem 2.1, H P [ Q ( y | x = x )] = (cid:110) τ ( x ) ∈ T : τ K ( x ) ≥ P ( y ∈ K | x = x, d = 1) P ( d = 1 | x = x ) , ∀ K ⊂ Y (cid:111) , (2.8) where K is measurable. If Y is countable, H P [ Q ( y | x = x )] = (cid:110) τ ( x ) ∈ T : τ y ( x ) ≥ P ( y = y | x = x, d = 1) P ( d = 1 | x = x ) , ∀ y ∈ Y (cid:111) . (2.9) If Y is a bounded interval, H P [ Q ( y | x = x )] = (cid:110) τ ( x ) ∈ T : τ [ t ,t ] ( x ) ≥ P ( t ≤ y ≤ t | x = x, d = 1) P ( d = 1 | x = x ) , ∀ t ≤ t , t , t ∈ Y (cid:111) . (2.10) Proof.
The characterization in (2.8) follows from equation (2.3), observing that if τ ( x ) ∈H P [ Q ( y | x = x )] as defined in equation (2.3), then there exists a distribution υ ( x ) ∈ T such that τ ( x ) = P ( y | x = x, d = 1) P ( d = 1 | x = x ) + υ ( x ) P ( d = 0 | x = x ). Hence, byconstruction τ K ( x ) ≥ P ( y ∈ K | x = x, d = 1) P ( d = 1 | x = x ), ∀ K ⊂ Y . Conversely,if one has τ K ( x ) ≥ P ( y ∈ K | x = x, d = 1) P ( d = 1 | x = x ), ∀ K ⊂ Y , one can define υ ( x ) = τ ( x ) − P ( y | x = x, d =1) P ( d =1 | x = x ) P ( d =0 | x = x ) . The resulting υ ( x ) is a probability measure, and hence τ ( x ) ∈ H P [ Q ( y | x = x )] as defined in equation (2.3). When Y is countable, if τ y ( x ) ≥ P ( y = y | x = x, d = 1) P ( d = 1 | x = x ) it follows that for any K ⊂ Y , τ K ( x ) = (cid:88) y ∈ K τ y ( x ) ≥ (cid:88) y ∈ K P ( y = y | x = x, d = 1) P ( d = 1 | x = x )= P ( y ∈ K | x = x, d = 1) P ( d = 1 | x = x ) . The result in equation (2.10) is proven in Molchanov and Molinari (2018, Theorem 2.25)using elements of random set theory, to which I return in Section 2.3. Using elements ofrandom set theory it is also possible to show that the characterization in (2.8) requires onlyto check the inequalities for K the compact subsets of Y .This section provides sharp identification regions and outer regions for a variety of func-tionals of interest. The computational complexity of these characterizations varies widely.Sharp bounds on parameters that respect stochastic dominance only require computing theparameters with respect to two probability distributions. An outer region on the CDF can beobtained by evaluating all tail probabilities of a certain distribution. A sharp identificationregion on the CDF requires evaluating the probability that a certain distribution assigns toall intervals. I return to computational challenges in partial identification in Section 6.16 .2 Treatment Effects with and without Instrumental Variables The discussion of partial identification of probability distributions of selectively observed datanaturally leads to the question of its implications for program evaluation. The literature onprogram evaluation is vast. The purpose of this section is exclusively to show how the ideaspresented in Section 2.1 can be applied to learn features of treatment effects of interest, whenno assumptions are imposed on treatment selection and outcomes. I also provide examples ofassumptions that can be used to tighten the bounds. To keep this chapter to a manageablelength, I discuss only partial identification of the average response to a treatment and of theaverage treatment effect (ATE). There are many different parameters that received muchinterest in the literature. Examples include the local average treatment effect of Imbens andAngrist (1994) and the marginal treatment effect of Heckman and Vytlacil (1999, 2001, 2005).For thorough discussions of the literature on program evaluation, I refer to the textbooktreatments in Manski (1995, 2003, 2007a) and Imbens and Rubin (2015), to the Handbookchapters by Heckman and Vytlacil (2007a,b) and Abbring and Heckman (2007), and to thereview articles by Imbens and Wooldridge (2009) and Mogstad and Torgovitsky (2018).Using standard notation (e.g., Neyman, 1923), let y : T (cid:55)→ Y be an individual-specificresponse function, with T = { , , . . . , T } a finite set of mutually exclusive and exhaustivetreatments, and let s denote the individual’s received treatment (taking its realizations in T ). The researcher observes data ( y , s , x ) ∼ P , with y ≡ y ( s ) the outcome correspondingto the received treatment s , and x a vector of covariates. The outcome y ( t ) for s (cid:54) = t is counterfactual, and hence can be conceptualized as missing. Therefore, we are in theframework of Identification Problem 2.1 and all the results from Section 2.1 apply in thiscontext too, subject to adjustments in notation. For example, using Theorem SIR-2.1, H P [ E Q ( y ( t ) | x = x )] = (cid:104) E P ( y | x = x, s = t ) P ( s = t | x = x ) + y P ( s (cid:54) = t | x = x ) , E P ( y | x = x, s = t ) P ( s = t | x = x ) + y P ( s (cid:54) = t | x = x ) (cid:105) , (2.11)where y ≡ inf y ∈Y y , y ≡ sup y ∈Y y . If y < ∞ and/or y < ∞ , these worst case bounds areinformative. When both are infinite, the data is uninformative in the absence of additionalrestrictions. Here the treatment response is a function only of the (scalar) treatment received by the given individual,an assumption known as stable unit treatment value assumption (Rubin, 1978). Beresteanu, Molchanov, and Molinari (2012) and Molchanov and Molinari (2018, Section 2.5) provide acharacterization of the sharp identification region for the joint distribution of [ y ( t ) , t ∈ T ].
17f the researcher is interested in an Average Treatment Effect (ATE), e.g. E Q ( y ( t ) | x = x ) − E Q ( y ( t ) | x = x ) = E P ( y | x = x, s = t ) P ( s = t | x = x ) + E Q ( y ( t ) | x = x, s (cid:54) = t ) P ( s (cid:54) = t | x = x ) − E P ( y | x = x, s = t ) P ( s = t | x = x ) − E Q ( y ( t ) | x = x, s (cid:54) = t ) P ( s (cid:54) = t | x = x ) , with t , t ∈ T , sharp worst case bounds on this quantity can be obtained as follows. First,observe that the empirical evidence reveals E P ( y | x = x, s = t j ) and P ( s | x = x ), but isuninformative about E Q ( y ( t j ) | x = x, s (cid:54) = t j ), j = 0 ,
1. Each of the latter quantities (theexpectations of y ( t ) and y ( t ) conditional on different realizations of s and x = x ) can takeany value in [ y , y ]. Hence, the sharp lower bound on the ATE is obtained by subtractingthe upper bound on E Q ( y ( t ) | x = x ) from the lower bound on E Q ( y ( t ) | x = x ). The sharpupper bound on the ATE is obtained by subtracting the lower bound on E Q ( y ( t ) | x = x )from the upper bound on E Q ( y ( t ) | x = x ). The resulting bounds have width equal to( y − y )[2 − P ( s = t | x = x ) − P ( s = t | x = x )] ∈ [( y − y ) , y − y )], and hence areinformative only if both y > −∞ and y < ∞ . As the largest logically possible value forthe ATE (in the absence of information from data) cannot be larger than ( y − y ), and thesmallest cannot be smaller than − ( y − y ), the sharp bounds on the ATE always cover zero. Key Insight : How should one think about the finding on the size of the worst casebounds on the ATE? On the one hand, if both y < ∞ and y < ∞ the bounds are informative,because they are a strict subset of the ATE’s possible realizations. On the other hand, theyreveal that the data alone are silent on the sign of the ATE. This means that assumptionsplay a crucial role in delivering stronger conclusions about this policy relevant parameter.The partial identification approach to empirical research recommends that as assumptions areadded to the analysis, one systematically reports how each contributes to shrinking the bounds,making transparent their role in shaping inference. What assumptions may researchers bring to bear to learn more about treatment effects ofinterest? The literature has provided a wide array of well motivated and useful restrictions.Here I consider two examples. The first one entails shape restrictions on the treatment re-sponse function, leaving selection unrestricted. Manski (1997b) obtains bounds on treatmenteffects under the assumption that the response functions are monotone, semi-monotone, orconcave-monotone. These restrictions are motivated by economic theory, where it is com-monly presumed, e.g., that demand functions are downward sloping and supply functions areupward sloping. Let the set T be ordered in terms of degree of intensity. Then Manski’s monotone treatment response assumption requires that t ≥ t ⇒ Q ( y ( t ) ≥ y ( t )) = 1 ∀ t , t ∈ T . y ( t ): y ( t ) ∈ ( −∞ , y ] ∩ Y if t < s , { y } if t = s , [ y , ∞ ) ∩ Y if t > s . (2.12)Hence, the sharp bounds on E Q ( y ( t ) | x = x ) are (Manski, 1997b, Proposition M1) H P [ E Q ( y ( t ) | x = x )] = (cid:104) E P ( y | x = x, s ≤ t ) P ( s ≤ t | x = x ) + y P ( s > t | x = x ) , E P ( y | x = x, s ≥ t ) P ( s ≥ t | x = x ) + y P ( s < t | x = x ) (cid:105) . (2.13)This finding highlights some important facts. Under the monotone treatment response as-sumption, the bounds on E Q ( y ( t ) | x = x ) are obtained using information from all ( y , s ) pairs(given x = x ), while the bounds in (2.11) only use the information provided by ( y , s ) pairsfor which s = t (given x = x ). As a consequence, the bounds in (2.13) are informative evenif P ( s = t | x = x ) = 0, whereas the worst case bounds are not.Concerning the ATE with t > t , under monotone treatment response its lower boundis zero, and its upper bound is obtained by subtracting the lower bound on E Q ( y ( t ) | x = x )from the upper bound on E Q ( y ( t ) | x = x ), where both bounds are obtained as in (2.13)(Manski, 1997b, Proposition M2).The second example of assumptions used to tighten worst case bounds is that of exclusionrestrictions , as in, e.g., Manski (1990). Suppose the researcher observes a random variable z , taking its realizations in Z , such that E Q ( y ( t ) | z , x ) = E Q ( y ( t ) | x ) ∀ t ∈ T , x -a.s. . (2.14)This assumption is treatment-specific, and requires that the treatment response to t is meanindependent with z . It is easy to show that under the assumption in (2.14), the bounds on E Q ( y ( t ) | x = x ) become H P [ E Q ( y ( t ) | x = x )] = (cid:104) ess sup z E P ( y | x = x, s = t, z ) P ( s = t | x = x, z )+ y P ( s (cid:54) = t | x = x, z ) , ess inf z E P ( y | x = x, s = t, z ) P ( s = t | x = x, z ) + y P ( s (cid:54) = t | x = x, z ) (cid:105) . (2.15)These are called intersection bounds because they are obtained as follows. Given x and z , oneuses (2.11) to obtain sharp bounds on E Q ( y ( t ) | z = z, x = x ). Due to the mean independence Stronger exclusion restrictions include statistical independence of the response function at each t with z : Q ( y ( t ) | z , x ) = Q ( y ( t ) | x ) ∀ t ∈ T , x -a.s.; and statistical independence of the entire response function with z : Q ([ y ( t ) , t ∈ T ] | z , x ) = Q ([ y ( t ) , t ∈ T ] | x ) , x -a.s. Examples of partial identification analysis under theseconditions can be found in Balke and Pearl (1997), Manski (2003), Kitagawa (2009), Beresteanu, Molchanov,and Molinari (2012), Machado, Shaikh, and Vytlacil (2018), and many others. E Q ( y ( t ) | x = x ) must belong to each of these bounds z -a.s., hence totheir intersection. The expression in (2.15) follows. If the instrument affects the probabilityof being selected into treatment, or the average outcome for the subpopulation receivingtreatment t , the bounds on E Q ( y ( t ) | x = x ) shrink. If the bounds are empty, the meanindependence assumption can be refuted (see Section 5 for a discussion of misspecification inpartial identification). Manski and Pepper (2000, 2009) generalize the notion of instrumentalvariable to monotone instrumental variable, and show how these can be used to obtaintighter bounds on treatment effect parameters. They also show how shape restrictionsand exclusion restrictions can jointly further tighten the bounds. Manski (2013a) generalizesthese findings to the case where treatment response may have social interactions – that is,each individual’s outcome depends on the treatment received by all other individuals.
Identification Problem 2.1, as well as the treatment evaluation problem in Section 2.2, is aninstance of the more general question of what can be learned about (functionals of) probabilitydistributions of interest, in the presence of interval valued outcome and/or covariate data.Such data have become commonplace in Economics. For example, since the early 1990s theHealth and Retirement Study collects income data from survey respondents in the form ofbrackets, with degenerate (singleton) intervals for individuals who opt to fully reveal theirincome (see, e.g., Juster and Suzman, 1995). Due to concerns for privacy, public use taxdata are recorded as the number of tax payers which belong to each of a finite number ofcells (see, e.g., Picketty, 2005). The Occupational Employment Statistics (OES) program atthe Bureau of Labor Statistics (Bureau of Labor Statistics, 2018) collects wage data fromemployers as intervals, and uses these data to construct estimates for wage and salary workersin more than 800 detailed occupations. Manski and Molinari (2010) and Giustinelli, Manski,and Molinari (2019b) document the extensive prevalence of rounding in survey responses toprobabilistic expectation questions, and propose to use a person’s response pattern acrossdifferent questions to infer his rounding practice, the result being interpretation of reportednumerical values as interval data. Other instances abound. Here I focus first on the case ofinterval outcome data.
Identification Problem : Assume that in addition tobeing compact, either Y is countable or Y = [ y , y ], with y = min y ∈Y y and y = max y ∈Y y .Let ( y L , y U , x ) ∼ P be observable random variables and y be an unobservable random variablewhose distribution (or features thereof) is of interest, with y L , y U , y ∈ Y . Suppose that( y L , y U , y ) are such that R ( y L ≤ y ≤ y U ) = 1. In the absence of additional information, See Chesher and Rosen (2019, Chapter XXX in this Volume) for further discussion. In Identification Problem 2.1 the observable variables are ( yd , d , x ), and ( y L , y U ) are determined asfollows: y L = yd + y (1 − d ), y U = yd + y (1 − d ). For the analysis in Section 2.2, the data is ( y , s , x ) and Q ( y | x = x ), the conditional distribution of y given x = x ? (cid:52) It is immediate to obtain the sharp identification region H P [ E Q ( y | x = x )] = [ E P ( y L | x = x ) , E P ( y U | x = x )] . As in the previous section, it is also easy to obtain sharp bounds on parameters that respectstochastic dominance, and pointwise-sharp bounds on the CDF of y at any fixed t ∈ R : P ( y U ≤ t | x = x ) ≤ Q ( y ≤ t | x = x ) ≤ P ( y L ≤ t | x = x ) . (2.16)In this case too, however, as in Theorem OR-2.1, the tube of CDFs satisfying equation (2.16)for all t ∈ R is an outer region for the CDF of y | x = x , rather than its sharp identificationregion. Indeed, also in this context it is easy to construct examples similar to Example 2.1.How can one characterize the sharp identification region for the probability distributionof y | x when one observes ( y L , y U , x ) and assumes R ( y L ≤ y ≤ y U ) = 1? Again, there is not asingle answer to this question. Depending on the specific problem at hand, e.g., the specificsof the interval data and whether y is assumed discrete or continuous, different methods canbe applied. I use random set theory to provide a characterization of H P [ Q ( y | x = x )]. Let Y ≡ [ y L , y U ] ∩ Y . Then Y is a random closed set according to Definition A.1. The requirement R ( y L ≤ y ≤ y U ) = 1 can be equivalently expressed as y ∈ Y almost surely. (2.17)Equation (2.17), together with knowledge of P , exhausts all the information in the data andmaintained assumptions. In order to harness such information to characterize the set ofobservationally equivalent probability distributions for y , one can leverage a result due toArtstein (1983) (and Norberg, 1992), reported in Theorem A.1 in Appendix A, which allowsone to translate (2.17) into a collection of conditional moment inequalities. Specifically, let T denote the space of all probability measures with support in Y . Theorem SIR- : Given τ ∈ T , let τ K ( x ) denote the probability that distribution τ assigns to set K conditional on x = x . Under the assumptions in Identification Problem 2.2, the sharp identification region y L = y ( s = t ) + y ( s (cid:54) = t ), y U = y ( s = t ) + y ( s (cid:54) = t ). Hence, P ( y L ≤ y ≤ y U ) = 1 by construction. For a proof of this statement, see Molchanov and Molinari (2018, Example 1.11). or Q ( y | x = x ) is H P [ Q ( y | x = x )] = (cid:110) τ ( x ) ∈ T : τ K ( x ) ≥ P ( Y ⊂ K | x = x ) , ∀ K ⊂ Y , K compact (cid:111) (2.18) When Y = [ y , y ] , equation (2.18) becomes H P [ Q ( y | x = x )] = (cid:110) τ ( x ) ∈ T : τ [ t ,t ] ( x ) ≥ P ( y L ≥ t , y U ≤ t | x = x ) , ∀ t ≤ t , t , t ∈ Y (cid:111) . (2.19) Proof.
Theorem A.1 yields (2.18). If Y = [ y , y ], Molchanov and Molinari (2018, Theorem2.25) show that it suffices to verify the inequalities in (2.19) for sets K that are intervals.Compare equation (2.18) with equation (2.8). Under the set-up of Identification Problem2.1, when d = 1 we have Y = { y } and when d = 0 we have Y = Y . Hence, for any K (cid:40) Y , P ( Y ⊂ K | x = x ) = P ( y ∈ K | x = x, d = 1) P ( d = 1). It follows that the characterizationsin (2.18) and (2.8) are equivalent. If Y is countable, it is easy to show that (2.18) simplifiesto (2.8) (see, e.g., Beresteanu, Molchanov, and Molinari, 2012, Proposition 2.2). Key Insight : The mathematicalframework for the analysis of random closed sets embodied in random set theory is naturallysuited to conduct identification analysis and statistical inference in partially identified models.This is because, as argued by Beresteanu and Molinari (2008) and Beresteanu, Molchanov,and Molinari (2011, 2012), lack of point identification can often be traced back to a collectionof random variables that are consistent with the available data and maintained assumptions.In turn, this collection of random variables is equal to the family of selections of a properlyspecified random closed set, so that random set theory applies. The interval data case is asimple example that illustrates this point. More examples are given throughout this chapter.As mentioned in the Introduction, the exercise of defining the random closed set that is rel-evant for the problem under consideration is routinely carried out in partial identificationanalysis, even when random set theory is not applied. For example, in the case of treat-ment effect analysis with monotone response function, Manski (1997b) derived the set in theright-hand-side of (2.12) , which satisfies Definition (A.1) . An attractive feature of the characterization in (2.18) is that it holds regardless of thespecific assumptions on y L , y U , and Y . Later sections in this chapter illustrate how TheoremA.1 delivers the sharp identification region in other more complex instances of partial identi-fication of probability distributions, as well as in structural models. In Chapter XXX in thisVolume, Chesher and Rosen (2019) apply Theorem A.1 to obtain sharp identification regionsfor functionals of interest in the important class of generalized instrumental variable models .To avoid repetitions, I do not systematically discuss that class of models in this chapter. For K = Y , both (2.18) and (2.8) hold trivially. Q ( y | x = x ) in the presence of intervaloutcome data, an alternative approach (e.g. Tamer, 2010; Ponomareva and Tamer, 2011)looks at all (random) mixtures of y L , y U . The approach is based on a random variable u (a selection mechanism that picks an element of Y ) with values in [0 , y L , y U is left completely unspecified. Using this random variable, one defines y u = uy L + (1 − u ) y U . (2.20)The sharp identification region in Theorem SIR-2.3 can be characterized as the collectionof conditional distributions of all possible random variables y u as defined in (2.20), given x = x . This is because each y u is a (stochastic) convex combination of y L , y U , hence eachof these random variables satisfies R ( y L ≤ y u ≤ y U ) = 1. While such characterization issharp, it can be of difficult implementation in practice, because it requires working with allpossible random variables y u built using all possible random variables u with support in[0 , u , and obtain directly a characterizationof the sharp identification region for Q ( y | x = x ) based on conditional moment inequalities. Horowitz and Manski (1998, 2000) study nonparametric conditional prediction problemswith missing outcome and/or missing covariate data. Their analysis shows that this problemis considerably more pernicious than the case where only outcome data are missing. Forthe case of interval covariate data, Manski and Tamer (2002) provide a set of sufficientconditions under which simple and elegant sharp bounds on functionals of Q ( y | x ) can beobtained, even in this substantially harder identification problem. Their assumptions arelisted in Identification Problem 2.3, and their result (with proof) in Theorem SIR-2.4. Identification Problem : Let ( y , x L , x U ) ∼ P be observ-able random variables in R × R × R and x ∈ R be an unobservable random variable. Supposethat R , the joint distribution of ( y , x , x L , x U ), is such that: (I) R ( x L ≤ x ≤ x U ) = 1; (M) E Q ( y | x = x ) is weakly increasing in x ; and (MI) E R ( y | x , x L , x U ) = E Q ( y | x ). In the ab-sence of additional information, what can the researcher learn about E Q ( y | x = x ) for given x ∈ X ? (cid:52) Compared to the earlier discussion for the interval outcome case, here there are twoadditional assumptions. The monotonicity condition (M) is a simple shape restrictions, whichhowever requires some prior knowledge about the joint distribution of ( y , x ). The meanindependence restriction (MI) requires that if x were observed, knowledge of ( x L , x U ) wouldnot affect the conditional expectation of y | x . The assumption is not innocuous, as pointed It can be shown that the collection of random variables y u equals the collection of measurable selections of the random closed set Y ≡ [ y L , y U ] (see Definition A.3); see Beresteanu, Molchanov, and Molinari (2011,Lemma 2.1). Theorem A.1 provides a characterization of the distribution of any y u that satisfies y u ∈ Y a.s.,based on a dominance condition that relates the distribution of y u to the distribution of the random set Y .Such dominance condition is given by the inequalities in (2.18). Theorem SIR- : Under the assumptions of Identification Problem 2.3, the sharp identification region for E Q ( y | x = x ) for given x ∈ X is H P [ E Q ( y | x = x )] = (cid:20) sup x U ≤ x E P ( y | x L , x U ) , inf x L ≥ x E P ( y | x L , x U ) (cid:21) . (2.21) Proof.
The law of iterated expectations and the independence assumption yield E P ( y | x L , x U ) = (cid:82) E Q ( y | x ) d R ( x | x L , x U ). For all x ≤ ¯ x , the monotonicity assumption and the fact that x ∈ [ x L , x U ]-a.s. yield E Q ( y | x = x ) ≤ (cid:82) E Q ( y | x ) d R ( x | x L = x, x U = ¯ x ) ≤ E Q ( y | x = ¯ x ). Puttingthis together with the previous result, E Q ( y | x = x ) ≤ E P ( y | x L = x, x U = ¯ x ) ≤ E Q ( y | x = ¯ x ).Then (using again the monotonicity assumption) for any x ≥ ¯ x , E P ( y | x L = x, x U = ¯ x ) ≤ E Q ( y | x = x ) so that the lower bound holds. The bound is weakly increasing as a function of x , so that the monotonicity assumption on E Q ( y | x = x ) holds and the bound is sharp. Theargument for the upper bound can be concluded similarly.Learning about functionals of Q ( y | x = x ) naturally implies learning about predictors of y | x = x . For example, H P [ E Q ( y | x = x )] yields the collection of values for the best predictorunder square loss; H P [ M Q ( y | x = x )], with M Q the median with respect to distribution Q ,yields the collection of values for the best predictor under absolute loss. And so on. A relatedbut distinct problem is that of parametric conditional prediction. Often researchers specifynot only a loss function for the prediction problem, but also a parametric family of predictorfunctions, and wish to learn the member of this family that minimizes expected loss. Toavoid confusion, let me clarify that here I am not referring to a parametric assumption on thebest predictor, e.g., that E Q ( y | x ) is a linear function of x . I return to such assumptions atthe end of this section. For now, in the example of linearity and square loss, I am referring tobest linear prediction, i.e., best linear approximation to E Q ( y | x ). Manski (2003, pp. 56-58)discusses what can be learned about the best linear predictor of y conditional on x , whenonly interval data on ( y , x ) is available.I treat first the case of interval outcome and perfectly observed covariates. Identification Problem : Maintain the same assumptions as in Identification Problem 2.2. Let ( y L , y U , x ) ∼ P beobservable random variables and y be an unobservable random variable, with R ( y L ≤ y ≤ y U ) = 1. In the absence of additional information, what can the researcher learn about thebest linear predictor of y given x = x ? (cid:52) For the case of missing covariate data, which is a special case of interval covariate data similarly toarguments in footnote 12, Aucejo, Bugni, and Hotz (2017) show that the MI restriction implies the assumptionthat data is missing at random. x is a scalar, and let θ = [ θ θ ] (cid:62) ∈ Θ ⊂ R denote theparameter vector of the best linear predictor of y | x . Assume that V ar ( x ) >
0. Combiningthe definition of best linear predictor with a characterization of the sharp identification regionfor the joint distribution of ( y , x ), we have that H P [ θ ] = (cid:26) ϑ = arg min (cid:90) ( y − θ − θ x ) dη, η ∈ H P [ Q ( y , x )] (cid:27) , (2.22)where, using an argument similar to the one in Theorem SIR-2.3, H P [ Q ( y , x )] = (cid:110) η : η ([ t ,t ] , ( −∞ ,s ]) ≥ P ( y L ≥ t , y U ≤ t , x ≤ s ) ∀ t ≤ t , t , t ∈ R , ∀ s ∈ R (cid:111) . (2.23)Beresteanu and Molinari (2008, Proposition 4.1) show that (2.22) can be re-written in anintuitive way that generalizes the well-known formula for the best linear predictor that ariseswhen y is perfectly observed. Define the random segment G and the matrix Σ P as G = (cid:40)(cid:32) yyx (cid:33) : y ∈ Sel( Y ) (cid:41) ⊂ R , and Σ P = E P (cid:32) xx x (cid:33) , (2.24)where Sel( Y ) is the set of all measurable selections from Y , see Definition A.3. Then, Theorem SIR- : Under the as-sumptions of Identification Problem 2.4, the sharp identification region for the parameters ofthe best linear predictor of y | x is H P [ θ ] = Σ − P E P G , (2.25) with E P G the Aumann (or selection) expectation of G as in Definition A.4.Proof. By Theorem A.1, ( ˜ y , ˜ x ) ∈ ( Y × x ) (up to an ordered coupling as discussed in AppendixA), if and only if the distribution of ( ˜ y , ˜ x ) belongs to H P [ Q ( y , x )]. The result follows.In either representation (2.22) or (2.25), H P [ θ ] is the collection of best linear predictorsfor each selection of Y . Why should one bother with the representation in (2.25)? Thereason is that H P [ θ ] is a convex set, as it can be evinced from representation (2.25): G hasalmost surely convex realizations that are segments and the Aumann expectation of a convexset is convex. Hence, it can be equivalently represented through its support function h H P [ θ ] , Under our assumption that Y is a bounded interval, all the selections of Y are integrable. Beresteanu andMolinari (2008) consider the more general case where Y is not required to be bounded. In R in our example, in R d if x is a d − h H P [ θ ] ( u ) = E P [( y L ( f ( x , u ) <
0) + y U ( f ( x , u ) ≥ f ( x , u )] , u ∈ S , (2.26)where f ( x , u ) ≡ [1 x ]Σ − P u . The characterization in (2.26) results from Theorem A.2,which yields h H P [ θ ] ( u ) = h Σ − P E P G ( u ) = E P h Σ − P G ( u ), and the fact that E P h Σ − P G ( u ) equalsthe expression in (2.26). As I discuss in Section 4 below, because the support function fullycharacterizes the boundary of H P [ θ ], (2.26) allows for a simple sample analog estimator, andfor inference procedures with desirable properties. It also immediately yields sharp boundson linear combinations of θ by judicious choice of u . Stoye (2007) and Magnac and Maurin(2008) provide the same characterization as in (2.26) using, respectively, direct optimizationand the Frisch-Waugh-Lovell theorem.A natural generalization of Identification Problem 2.4 allows for both outcome and co-variate data to be interval valued.
Identification Problem : Maintain the same assumptions as in Identification Problem 2.4, but with x ∈ X ⊂ R unobservable. Let the researcher observe ( y L , y U , x L , x U ) such that R ( y L ≤ y ≤ y U , x L ≤ x ≤ x U ) = 1. Let X ≡ [ x L , x U ] and let X be bounded. In the absence ofadditional information, what can the researcher learn about the best linear predictor of y given x = x ? (cid:52) Abstractly, H P [ θ ] is as given in (2.22), with H P [ Q ( y , x )] = { η : η K ≥ P (( Y × X ) ⊂ K ) ∀ compact K ⊂ Y × X } replacing (2.23) by an application of Theorem A.1. While this characterization is sharp, it iscumbersome to apply in practice, see Horowitz, Manski, Ponomareva, and Stoye (2003).On the other hand, when both y and x are perfectly observed, the best linear predictoris simply equal to the parameter vector that yields a mean zero prediction error that isuncorrelated with x . How can this basic observation help in the case of interval data? Theidea is that one can use the same insight applied to the set-valued data, and obtain H P [ θ ] asthe collection of θ ’s for which there exists a selection ( ˜ y , ˜ x ) ∈ Sel( Y × X ), and associatedprediction error ε θ = ˜ y − θ − θ ˜ x , satisfying E P ε θ = 0 and E P ( ε θ ˜ x ) = 0 (as shown byBeresteanu, Molchanov, and Molinari, 2011). To obtain the formal result, define the θ - See Beresteanu and Molinari (2008, p. 808) and Bontemps, Magnac, and Maurin (2012, p. 1136). For example, in the case that x is a scalar, sharp bounds on θ can be obtained by choosing u = [0 1] (cid:62) and u = [0 − (cid:62) , which yield θ ∈ [ θ L , θ U ] with θ L = min y ∈ [ y L , y U ] Cov ( x , y ) V ar ( x ) = E P [( x − E P x )( y L ( x > E P x )+ y U ( x ≤ E x ))] E P x − ( E P x ) and θ U = max y ∈ [ y L , y U ] Cov ( x , y ) V ar ( x ) = E P [( x − E P x )( y L ( x < E P x )+ y U ( x ≥ E x ))] E P x − ( E P x ) . Here for simplicity I suppose that both x L and x U have bounded support. Beresteanu, Molchanov, andMolinari (2011) do not make this simplifying assumption. E θ = (cid:40)(cid:32) ˜ y − θ − θ ˜ x ( ˜ y − θ − θ ˜ x ) ˜ x (cid:33) : ( ˜ y , ˜ x ) ∈ Sel( Y × X ) (cid:41) . Theorem SIR- : Under the assumptions of Identification Problem 2.5, the sharp identification region for theparameters of the best linear predictor of y | x is H P [ θ ] = { θ ∈ Θ : ∈ E P E θ } = (cid:26) θ ∈ Θ : min u ∈ B d E P h E θ ( u ) = 0 (cid:27) , (2.27) where h E θ ( u ) = max y ∈ Y ,x ∈ X [ u ( y − θ − θ x ) + u ( yx − θ x − θ x )] is the support functionof the set E θ in direction u ∈ S d − , see Definition A.5.Proof. By Theorem A.1, ( ˜ y , ˜ x ) ∈ ( Y × X ) (up to an ordered coupling as discussed in Ap-pendix A), if and only if the distribution of ( ˜ y , ˜ x ) belongs to H P [ Q ( y , x )]. For given θ , onecan find ( ˜ y , ˜ x ) ∈ ( Y × X ) such that E P ε θ = 0 and E P ( ε θ ˜ x ) = 0 with ε θ ∈ E θ if and only if thezero vector belongs to E P E θ . By Theorem A.2, E P E θ is a convex set and by (A.9), ∈ E P E θ if and only if 0 ≤ h E P E θ ( u ) ∀ u ∈ B d . The final characterization follows from (A.7).The support function h E θ ( u ) is an easy to calculate convex sublinear function of u , regard-less of whether the variables involved are continuous or discrete. The optimization problemin (2.27), determining whether θ ∈ H P [ θ ], is a convex program, hence easy to solve. See forexample the CVX software by Grant and Boyd (2010). It should be noted, however, that theset H P [ θ ] itself is not necessarily convex. Hence, tracing out its boundary is non-trivial. Idiscuss computational challenges in partial identification in Section 6.I conclude this section by discussing parametric regression. Manski and Tamer (2002)study identification of parametric regression models under the assumptions in IdentificationProblem 2.6; Theorem SIR-2.7 below reports the result. The proof is omitted because itfollows immediately from the proof of Theorem SIR-2.4. Identification Problem : Let ( y , x L , x U , w ) ∼ P be observable random variables in R × R × R × R d , d < ∞ , andlet x ∈ R be an unobservable random variable. Assume that the joint distribution R of( y , x , x L , x U ) is such that R ( x L ≤ x ≤ x U ) = 1 and E R ( y | w , x , x L , x U ) = E Q ( y | w , x ).Suppose that E Q ( y | w , x ) = f ( w , x ; θ ), with f : R d × R × Θ (cid:55)→ R a known function such thatfor each w ∈ R and θ ∈ Θ, f ( w, x ; θ ) is weakly increasing in x . In the absence of additionalinformation, what can the researcher learn about θ ? (cid:52) Note that while G is a convex set, E θ is not. heorem SIR- : Under theAssumptions of Identification Problem 2.6, the sharp identification region for θ is H P [ θ ] = (cid:8) ϑ ∈ Θ : f ( w , x L ; ϑ ) ≤ E P ( y | w , x L , x U ) ≤ f ( w , x U ; ϑ ) , ( w , x L , x U ) -a.s. (cid:9) . (2.28)Aucejo, Bugni, and Hotz (2017) study Identification Problem 2.6 for the case of missingcovariate data without imposing the mean independence restriction of Manski and Tamer(2002) (Assumption MI in Identification Problem 2.3). As discussed in footnote 16, restrictionMI is undesirable in this context because it implies the assumption that data are missing atrandom. Aucejo, Bugni, and Hotz (2017) characterize H P [ θ ] under the weaker assumptions,but face the problem that this characterization is usually too complex to compute or to usefor inference. They therefore provide outer regions that are easier to compute, and they showthat these regions are informative and relatively easy to use. One of the first examples of bounding analysis appears in Frisch (1934), to assess the impactin linear regression of covariate measurement error. This analysis was substantially extendedin Gilstein and Leamer (1983), Klepper and Leamer (1984), and Leamer (1987). The morerecent literature in partial identification has provided important advances to learn features ofprobability distributions when the observed variables are error-ridden measures of the vari-ables of interest. Here I briefly mention some of the papers in this literature, and refer toChapter
XXX in this Volume by Schennach (2019) for a thorough treatment of identificationand inference with mismeasured and unobserved variables. In an influential paper, Horowitzand Manski (1995) study what can be learned about features of the distribution of y | x inthe presence of contaminated or corrupted outcome data. Whereas a contaminated samplingmodel assumes that data errors are statistically independent of sample realizations from thepopulation of interest, the corrupted sampling model does not. These models are regularlyused in the important literature on robust estimation (e.g., Huber, 1964, 2004; Hampel,Ronchetti, Rousseeuw, and Stahel, 2011). However, the goal of that literature is to charac-terize how point estimators of population parameters behave when data errors are generatedin specified ways. As such, the inference problem is approached ex-ante: before collecting thedata, one looks for point estimators that are not greatly affected by error. The question ad-dressed by Horowitz and Manski (1995) is conceptually distinct. It asks what can be learnedabout specific population parameters ex-post, that is, after the data has been collected. Forexample, whereas the mean is well known not to be a robust estimator in the presence ofcontaminated data, Horowitz and Manski (1995) show that it can be (non-trivially) boundedprovided the probability of contamination is strictly less than one. Dominitz and Sherman(2004, 2005) and Kreider and Pepper (2007, 2008) extend the results of Horowitz and Manski28o allow for (partial) verification of the distribution from which the data are drawn. Theyapply the resulting sharp bounds to learn about school performance when the observed testscores may not be valid for all students. Molinari (2008) provides sharp bounds on the dis-tribution of a misclassified outcome variable under an array of different assumptions on theextent and type of misclassification.A completely different problem is that of data combination. Applied economists oftenface the problem that no single data set contains all the variables that are necessary to con-duct inference on a population of interest. When this is the case, they need to integrate theinformation contained in different samples; for example, they might need to combine surveydata with administrative data (see Ridder and Moffitt, 2007, for a survey of the econometricsof data combination). From a methodological perspective, the problem is that while the sam-ples being combined might contain some common variables, other variables belong only toone of the samples. When the data is collected at the same aggregation level (e.g., individuallevel, household level, etc.), if the common variables include a unique and correctly recordedidentifier of the units constituting each sample, and there is a substantial overlap of unitsacross all samples, then exact matching of the data sets is relatively straightforward, and thecombined data set provides all the relevant information to identify features of the populationof interest. However, it is rather common that there is a limited overlap in the units con-stituting each sample, or that variables that allow identification of units are not available inone or more of the input files, or that one sample provides information at the individual orhousehold level (e.g., survey data) while the second sample provides information at a moreaggregate level (e.g., administrative data providing information at the precinct or districtlevel). Formally, the problem is that one observes data that identify the joint distributions P ( y , x ) and P ( x , w ), but not data that identifies the joint distribution Q ( y , x , w ) whosefeatures one wants to learn. The literature on statistical matching has aimed at using thecommon variable(s) x as a bridge to create synthetic records containing ( y , x , w ) (see, e.g.,Okner, 1972, for an early contribution). As Sims (1972) points out, the inherent assumptionat the base of statistical matching is that conditional on x , y and w are independent. Thisconditional independence assumption is strong and untestable. While it does guarantee pointidentification of features of the conditional distributions Q ( y | x , w ), it often finds very littlejustification in practice. Early on, Duncan and Davis (1953) provided numerical illustrationson how one can bound the object of interest, when both y and w are binary variables. Crossand Manski (2002) provide a general analysis of the problem. They obtain bounds on thelong regression E Q ( y | x , w ), under the assumption that w has finite support. They show thatsharp bounds on E Q ( y | x , w = w ) can be obtained using the results in Horowitz and Manski(1995), thereby establishing a connection with the analysis of contaminated data. They thenderive sharp identification regions for [ E Q ( y | x = x, w = w ) , x ∈ X , w ∈ W ]. They show thatthese bounds are sharp when y has finite support, and Molinari and Peski (2006) establishsharpness without this restriction. Fan, Sherman, and Shum (2014) address the question of29hat can be learned about counterfactual distributions and treatment effects under the datascenario just described, but with x replaced by s , a binary indicator for the received treatment(using the notation of the previous section). In this case, the exogenous selection assumption(conditional on w ) does not suffice for point identification of the objects of interest. Theauthors derive, however, sharp bounds on these quantities using monotone rearrangementinequalities. Pacini (2017) provides partial identification results for the coefficients in thelinear projection of y on ( x , w ). In order to discuss the partial identification approach to learning features of probabilitydistributions in some level of detail while keeping this chapter to a manageable length, Ihave focused on a selection of papers. In this section I briefly mention several other excellenttheoretical contributions that could be discussed more closely, as well as several papers thathave applied partial identification analysis to answer important empirical questions.While selectively observed data are commonplace in observational studies, in randomizedexperiments subjects are randomly placed in designated treatment groups conditional on x , so that the assumption of exogenous selection is satisfied with respect to the assignedtreatment. Yet, identification of some highly policy relevant parameters can remain elusivein the absence of strong assumptions. One challenge results from noncompliance, whereindividuals’ received treatments differs from the randomly assigned ones. Balke and Pearl(1997) derive sharp bounds on the ATE in this context, when Y = T = { , } . Even if one isinterested in the intention-to-treat parameter, selectively observed data may continue to be aproblem. For example, Lee (2009) studies the wage effects of the Job Corps training program,which randomly assigns eligibility to participate in the program. Individuals randomizedto be eligible were not compelled to receive treatment, hence Lee (2009) focuses on theintention-to-treat effect. Because wages are only observable when individuals are employed,a selection problem persists despite the random assignment of eligibility to treatment, asemployment status may be affected by the training program. Lee obtains sharp bounds on theintention-to-treat effect, through a trimming procedure that leverages results in Horowitz andManski (1995). Molinari (2010) analyzes the problem of identification of the ATE and othertreatment effects, when the received treatment is unobserved for a subset of the population.Missing treatment data may be due to item or survey nonresponse in observational studies,or noncompliance with randomly assigned treatments that are not directly monitored. Shederives sharp worst case bounds leveraging results in Horowitz and Manski (1995), and sheshows that these are a function of the available prior information on the distribution ofmissing treatments. If the response function is assumed monotone as in (2.13), she obtainsinformative bounds without restrictions on the distribution of missing treatments.Even randomly assigned treatments and perfect compliance with no missing data may30ot suffice for point identification of all policy relevant parameters. Important examples aregiven by Heckman, Smith, and Clements (1997) and Manski (1997a). Heckman, Smith, andClements show that features of the joint distribution of the potential outcomes of treatmentand control, including the distribution of treatment effects impacts, cannot be point identifiedin the absence of strong restrictions. This is because although subjects are randomized totreatment and control, nobody’s outcome is observed under both states. Nonetheless, theauthors obtain bounds for the functionals of interest. Mullahy (2018) derives related boundson the probability that the potential outcome of one treatment is larger than that of theother treatment, and applies these results to health economics problems. Manski shows thatfeatures of outcome distributions under treatment rules in which treatment may vary withingroups cannot be point identified in the absence of strong restrictions. This is because dataresulting from randomized experiments with perfect compliance allow for point identificationof the outcome distributions under treatment rules that assign all persons with the same x tothe same treatment group. However, such data only allow for partial identification of outcomedistributions under rules in which treatment may vary within groups. Manski derives sharpbounds for functionals of these distributions.Analyses of data resulting from natural experiments also face identification challenges.Hotz, Mullin, and Sanders (1997) study what can be learned about treatment effects whenone uses a contaminated instrumental variable, i.e. when a mean-independence assumptionholds in a population of interest, but the observed population is a mixture of the populationof interest and one in which the assumption doesn’t hold. They extend the results of Horowitzand Manski (1995) to learn about the causal effect of teenage childbearing on a teen mother’ssubsequent outcomes, using the natural experiment of miscarriages to form an instrumentalvariable for teen births. This instrument is contaminated because miscarriges may not occurrandomly for a subset of the population (e.g., higher miscarriage rates are associated withsmoking and drinking, and these behaviors may be correlated with the outcomes of interest).Of course, analyses of selectively observed data present many challenges, including butnot limited to the ones described in Section 2.1. Athey and Imbens (2006) generalize thedifference-in-difference (DID) design to a changes-in-changes (CIC) model, where the distri-bution of the unobservables is allowed to vary across groups, but not overtime within groups,and the additivity and linearity assumptions of the DID are dispensed with. For the casethat the outcomes have a continuous distribution, Athey and Imbens provide conditions forpoint identification of the entire counterfactual distribution of effects of the treatment on thetreatment group as well as the distribution of effects of the treatment on the control group,without restricting how these distributions differ from each other. For the case that the out-come variables are discrete, they provide partial identification results, as well as additionalconditions compared to their baseline model under which point identification attains.Motivated by the question of whether the age-adjusted mortality rate from cancer in2000 was lower than that in the early 1970s, Honor and Lleras-Muney (2006) study partial31dentification of competing risk models (see Peterson, 1976, for earlier partial identificationresults). To answer this question, they need to contend with the fact that mortality rate fromcardiovascular disease declined substantially over the same period of time, so that individualsthat in the early 1970s might have died from cardiovascular disease before being diagnosedwith cancer, do not in 2000. In this context, it is important to carry out the analysis withoutassuming that the underlying risks are independent. Honor and Lleras-Muney show thatbounds for the parameters of interest can be obtained as the solution to linear programmingproblems. The estimated bounds suggest much larger improvements in cancer mortality ratesthan previously estimated.Blundell, Gosling, Ichimura, and Meghir (2007) use UK data to study changes over timein the distribution of male and female wages, and in wage inequality. Because the compositionof the workforce changes over time, it is difficult to disentangle that effect from changes inthe distribution of wages, given that the latter are observed only for people in the workforce.Blundell, Gosling, Ichimura, and Meghir begin their empirical analysis by reporting worstcase bounds (as in Manski, 1994) on the CDF of wages conditional on covariates. They thenconsider various restrictions on treatment selection, e.g., a first order stochastic dominanceassumption according to which people with higher wages are more likely to work, and derivetighter bounds under this assumption (and under weaker ones). Finally, they bring to bearshape restrictions. At each step of the analysis, they report the resulting bounds, therebyilluminating the role played by each assumption in shaping the inference. Chandrasekhar,Chernozhukov, Molinari, and Schrimpf (2018) provide best linear approximations to the iden-tification region for the quantile gender wage gap using Current Population Survey repeatedcross-sections data from 1975-2001, using treatment selection assumptions in the spirit ofBlundell, Gosling, Ichimura, and Meghir (2007) as well as exclusion restrictions.Bhattacharya, Shaikh, and Vytlacil (2012) study the effect of Swan-Ganz catheterizationon subsequent mortality. Previous research had shown, using propensity score matching(assuming that there are no unobserved differences between catheterized and non catheterizedpatients) that Swan-Ganz catheterization increases the probability that patients die within180 days from admission to the intensive care unit. Bhattacharya, Shaikh, and Vytlacilre-analyze the data using (and extending) bounds results obtained by Shaikh and Vytlacil(2011). These results are based on exclusion restrictions combined with a threshold crossingstructure for both the treatment and the outcome variables in problems where Y = T = { , } . Bhattacharya, Shaikh, and Vytlacil use as instrument for Swan-Ganz catheterizationthe day of the week that the patient was admitted to the intensive care unit. The reasoningis that patients are less likely to be catheterized on the weekend, but the admission day tothe intensive care unit is plausibly uncorrelated with subsequent mortality. Their resultsconfirm that for some diagnoses, Swan-Ganz catheterization increases mortality at 30 days The Swan-Ganz catheter is a device placed in patients in the intensive care unit to guide therapy. (cid:96) ( g ) when g : X (cid:55)→ R is such that y = g ( x ) + (cid:15) and E ( y | z ) = 0. The instrumental variable z andregressor x have discrete distributions, and z has fewer points of support than x , so that (cid:96) ( g )can only be partially identified. They impose shape restrictions on g (e.g., monotonicity orconvexity) to achieve interval identification of (cid:96) ( g ), and they show that the lower and upperpoints of the interval can be obtained by solving linear programming problems. They alsoshow that the bootstrap can be used to carry out inference.33 Partial Identification of Structural Models
In this section I focus on the literature concerned with learning features of structural econo-metric models . These are models where economic theory is used to postulate relationshipsamong observable outcomes y , observable covariates x , and unobservable variables ν . Forexample, economic theory may guide assumptions on economic behavior (e.g., utility maxi-mization) and equilibrium that yield a mapping from ( x , ν ) to y . The researcher is interestedin learning features of these relationships (e.g., utility function, distribution of preferences),and to this end may supplement the data and economic theory with functional form as-sumptions on the mapping of interest and distributional assumptions on the observable andunobservable variables.The earlier literature on partial identification of features of structural models includesimportant examples of nonparametric analysis of random utility models and revealed prefer-ence extrapolation, e.g. Block and Marschak (1960), Marschak (1960), Hall (1973), McFadden(1975), Falmagne (1978), McFadden and Richter (1991), and others. The earlier literaturealso addresses semiparametric analysis, where the underlying models are specified up to pa-rameters that are finite dimensional (e.g., preference parameters) and parameters that areinfinite dimensional (e.g., distribution functions); important examples include Marschak andAndrews (1944), Markowitz (1952), Fisher (1966, Section 2.10), Harrison and Kreps (1979),Kreps (1981), Leamer (1981), Manski (1988b), Jovanovic (1989), Phillips (1989), Hansenand Jagannathan (1991), Hansen, Heaton, and Luttmer (1995), Luttmer (1996), and others.Contrary to the nonparametric bounds results discussed in Section 2, and especially in thecase of semiparametric models, structural partial identification often yields an identificationregion that is not constructive. Indeed, the boundary of the set is not obtained in closedform as a functional of the distribution of the observable data. Rather, the identificationregion can often be characterized as a level set of a properly specified criterion function.The recent spark of interest in partial identification of structural microeconometric modelswas fueled by the work of Manski and Tamer (2002), Tamer (2003) and Ciliberto and Tamer(2009), and Haile and Tamer (2003). Each of these papers has advanced the literature infundamental ways, studying conceptually very distinct problems. Manski and Tamer (2002)are concerned with partial identification of the decision process yielding binary outcomes ina semiparametric model, when one of the explanatory variables is interval valued. Hence, theroot cause of the identification problem they study is that the data is incomplete . Tamer (2003) and Ciliberto and Tamer (2009) are concerned with identification (andestimation) of simultaneous equation models with dummy endogeneous variables which are Of course, this is not always the case, as exemplified by the bounds in Hansen and Jagannathan (1991). Manski and Tamer (2002) study also partial identification (and estimation) of nonparametric, semipara-metric, and parametric conditional expectation functions that are well defined in the absence of a structuralmodel, when one of the conditioning variables is interval valued. I refer to Section 2 for a discussion. Haile and Tamer (2003)are concerned with nonparametric identification and estimation of the distribution of valua-tions in a model of English auctions under weak assumptions on bidders’ behavior. In bothcases, the root cause of the identification problem is that the structural model is incomplete .This is because the model makes multiple predictions for the observed outcome variables(respectively: the players’ actions; and the bidders’ bids), but does not specify how one ofthem is selected to yield the observed data.
Set-valued predictions for the observable outcome (endogenous variables) are a key featureof partially identified structural models. The goal of this section is to explain how theyresult in a wide array of theoretical frameworks, and how sharp identification regions canbe characterized using a unified approach based on random set theory. Although the workof Manski and Tamer (2002), Tamer (2003) and Ciliberto and Tamer (2009), and Haile andTamer (2003) has spurred many of the developments discussed in this section, for pedagogicalreasons I organize the presentation based on application topic rather than chronologically.The work of Pakes (2010) and Pakes, Porter, Ho, and Ishii (2015) further stimulated a largeempirical literature that applies partial identification methods to a wide array of questionsof substantive economic importance, to which I return in Section 3.5.
Let I denote a population of decision makers and Y = { c , . . . , c |Y| } a finite universe ofpotential alternatives ( feasible set henceforth). Let U be a family of real valued functionsdefined over the elements of Y . Let ∈ ∗ denote “is chosen from.” Then observed choice isconsistent with a random utility model if there exists a function u i drawn from U accordingto some probability distribution, such that P ( c ∈ ∗ C ) = P ( u i ( c ) ≥ u i ( b ) ∀ b ∈ C ) for all c ∈ C ,all non empty sets C ⊂ Y , and all i ∈ I (Block and Marschak, 1960). See Manski (2007a,Chapter 13) for a textbook presentation of this class of models, and Matzkin (2007) for areview of sufficient conditions for point identification of nonparametric and semiparametriclimited dependent variables models.As in the seminal work of McFadden (1974), assume that the decision makers and alter-natives are characterized by observable and unobservable vectors of real valued attributes.Denote the observable attributes by x i ≡ { x i , ( x ic , c ∈ Y ) } , i ∈ I . These include attributevectors x i that are specific to the decision maker, as well as attribute vectors x ic that includecomponents that are specific to the alternative and components that are indexed by both.Denote the unobservable attributes (preferences) by ν i ≡ ( ζ i , { (cid:15) ic , c ∈ Y} ) , i ∈ I . These areidiosyncratic to the decision maker and similarly may include alternative and decision makerspecific terms. Denote X , V the supports of x , ν , respectively.In what follows, I label “standard” a random utility model that maintains some form of Ciliberto and Tamer (2009) consider more general multi-player entry games. x i (e.g., mean or quantile or statistical independence with ν i ) and presupposesobservation of data that include { ( C i , y i , x i ) : y i ∈ ∗ C i } , i = 1 , . . . , n , with C i the choice setfaced by decision maker i and | C i | ≥ C i = D for all i ∈ I and some known D ⊆ Y , although this requirement is not critical to identification analysis. Manski and Tamer (2002) provide inference methods for nonparametric, semiparametric, andparametric conditional expectation functions when one of the conditioning variables is inter-val valued. I have discussed their nonparametric and parametric sharp bounds on conditionalexpectations with interval valued covariates in Identification Problems 2.3 and 2.6, and The-orems SIR-2.4 and SIR-2.7, respectively. Here I focus on their analysis of semiparametricbinary choice models. Compared to the generic notation set forth at the beginning of Section3.1, I let C i = Y = { , } for all i ∈ I , and with some abuse of notation I denote the vectorof observed covariates ( x L , x U , w ). Identification Problem : Let ( y , x L , x U , w ) ∼ P be observable random variables in { , } × R × R × R d , d < ∞ , and let x ∈ R be an unobservable random variable. Let y = ( w θ + δ x + (cid:15) > δ >
0, and further normalize δ = 1 because the threshold-crossing condition is in-variant to the scale of the parameters. Here (cid:15) is an unobserved heterogeneity term withcontinuous distribution conditional on ( w , x , x L , x U ), ( w , x , x L , x U )-a.s., and θ ∈ Θ ⊂ R d is a parameter vector representing decision makers’ preferences, with compact parameterspace Θ. Assume that R , the joint distribution of ( y , x , x L , x U , w , (cid:15) ), is such that R ( x L ≤ x ≤ x U ) = 1; R ( (cid:15) | w , x , x L , x U ) = R ( (cid:15) | w , x ); and for a specified α ∈ (0 , q (cid:15) R ( α, w , x ) = 0and R ( (cid:15) ≤ | w , x ) = α , ( w , x )-a.s.. In the absence of additional information, what can theresearcher learn about θ ? (cid:52) Compared to Identification Problem 2.3 (see p. 23), here one continues to impose x ∈ [ x L , x U ] a.s. The sign restriction on δ replaces the monotonicity restriction (M) in Identi-fication Problem 2.3, but does not imply it unless the distribution of (cid:15) is independent of x conditional on w . The quantile independence restriction is inspired by Manski (1985).For given θ ∈ Θ, this model yields set valued predictions because y = 1 can occurwhenever (cid:15) > − w θ − x U , whereas y = 0 can occur whenever (cid:15) ≤ − w θ − x L , and − w θ − x U ≤− w θ − x L . Conversely, observation of y = 1 allows one to conclude that (cid:15) ∈ ( − w θ − x U , + ∞ ),whereas observation of y = 0 allows one to conclude that (cid:15) ∈ ( −∞ , − w θ − x L ], and theseregions of possible realizations of (cid:15) overlap. In contrast, when x is observed the predictionis unique because the value − w θ − x partitions the space of realizations of (cid:15) in two disjointsets, one associated with y = 1 and the other with y = 0. Figure 3.1 depicts the model’s36 Model predicts y = 0when (cid:15) is realized here Model predicts y = 1when (cid:15) is realized hereModel predicts y = 0 or y = 1when (cid:15) is realized here − w θ − x U − w θ − x − w θ − x L Model admits (cid:15) ∈ ( −∞ , − w θ − x L ] when y = 0Model admits (cid:15) ∈ ( − w θ − x U , + ∞ ) when y = 1 Figure 3.1: Predicted value of y as a function of (cid:15) , and admissible values of (cid:15) for each realization of y , in Identification Problem 3.1, conditional on ( w , x L , x U ). set-valued predictions for y given ( w , x L , x U ) as a function of (cid:15) , and the model’s set valuedpredictions for (cid:15) given ( w , x L , x U ) as a function of y . Why does this set-valued prediction hinder point identification? The reason is that the dis-tribution of the observable data relates to the model structure in an incomplete manner. Themodel predicts M ( y = 1 | w , x L , x U ) = (cid:82) R ( y = 1 | w , x , x L , x U ) d R ( x | w , x L , x U ) = (cid:82) R ( (cid:15) > − w θ − x | w , x ) d R ( x | w , x L , x U ) , ( w , x L , x U )-a.s. Because the distribution R ( x | w , x L , x U ) isleft completely unspecified, one can find multiple values for ( θ, R ( x | w , x L , x U ) , R ( (cid:15) | w , x )),satisfying the assumptions in Identification Problem 3.1, such that M ( y = 1 | w , x L , x U ) = P ( y = 1 | w , x L , x U ) , ( w , x L , x U )-a.s. Nonetheless, in general, not all values of θ ∈ Θcan be paired with some R ( x | w , x L , x U ) and R ( (cid:15) | w , x ) so that they are compatible with P ( y = 1 | w , x L , x U ) , ( w , x L , x U )-a.s. and with the maintained assumptions. Hence, θ can bepartially identified using the information in the model and observed data. Theorem SIR- : Under the Assumptions of Identification Problem 3.1, the sharp identification region for θ is H P [ θ ] = (cid:110) ϑ ∈ Θ : P (cid:16) ( w , x L , x U ) : { ≤ w ϑ + x L ∩ P ( y = 1 | w , x L , x U ) ≤ − α }∪ { w ϑ + x U ≤ ∩ P ( y = 1 | w , x L , x U ) ≥ − α } (cid:17) = 0 (cid:111) . (3.1) Proof.
For any ϑ ∈ Θ, define the set of possible values for the unobservable associated with Figure 3.1 is based on Figure 1 in Manski and Tamer (2002). See Chesher and Rosen (2019, ChapterXXX in this Volume) for an extensive discussion of the duality between the model’s set valued predictions for y as a function of (cid:15) and for (cid:15) as a function of y , in both cases given the observed covariates. y , w , x L , x U ), illustrated in Figure 3.1, as E ϑ ( y , w , x L , x U ) = (cid:40) ( −∞ , − w ϑ − x L ] if y = 0,[ − w ϑ − x U , + ∞ ) if y = 1. (3.2)Then E ϑ ( y , w , x L , x U ) is a random closed set as per Definition A.1. To simplify notation, let E ϑ ( y ) ≡ E ϑ ( y , w , x L , x U ) suppressing the dependence on ( w , x L , x U ). Let ( E ϑ ( y ) , w , x L , x U ) = E ϑ ( y ) × ( w , x L , x U ) = { ( e , w , x L , x U ) : e ∈ E ϑ ( y ) } . If the model is correctly specified, forthe data generating value θ , ( (cid:15), w , x L , x U ) ∈ ( E θ ( y ) , w , x L , x U ) a.s. By Theorem A.1 andTheorem 2.33 in Molchanov and Molinari (2018), this occurs if and only if R ( (cid:15) ∈ C | w , x L , x U ) ≥ P ( E θ ( y ) ⊂ C | w , x L , x U ) , ( w , x L , x U )-a.s. ∀ C ∈ F , (3.3)where F here denotes the collection of closed subsets of R .We then have that ϑ is observationally equivalent to θ if and only if (3.3) holds for E ϑ ( y )as defined in (3.2). The condition can be rewritten as (cid:90) R ( (cid:15) ∈ C | w , x , x L , x U ) d R ( x | w , x L , x U ) ≥ P ( E ϑ ( y ) ⊂ C | w , x L , x U ) , ( w , x L , x U )-a.s. ∀ C ∈ F . The assumption that R ( (cid:15) | w , x , x L , x U ) = R ( (cid:15) | w , x ) yields that the above system of inequali-ties reduces to (cid:90) R ( (cid:15) ∈ C | w , x ) d R ( x | w , x L , x U ) ≥ P ( E ϑ ( y ) ⊂ C | w , x L , x U ) , ( w , x L , x U )-a.s. ∀ C ∈ F . Next, note that given the possible realizations of E ϑ ( y ), the above inequality is triviallysatisfied unless C = ( −∞ , t ] or C = [ t, ∞ ) for some t ∈ R . Finally, the only restrictionon the distribution of (cid:15) is the quantile independence condition, hence it suffices to consider t = 0. To see why this is the case, let for example t > w, x L , x U )for ( w , x L , x U ). Then for the inequality not to be trivially satisfied it must be that either wϑ + x L ≥ − t or wϑ + x U ≤ − t (both are not possible because wϑ + x L ≤ wϑ + x U ). If wϑ + x U ≤ − t , it must be that t ∈ (0 , − wϑ − x U ] and − wϑ − x U >
0. Then a distribution R such that (cid:82) R ( (cid:15) ∈ [0 , t ) | w = w, x ) d R ( x | w = w, x L = x L , x U = x U ) = 0 is always feasible for t ∈ (0 , − wϑ − x U ]. A similar argument holds if wϑ + x L ≥ − t ; and also if t <
0. We thenhave that if the inequalities are satisfied for t = 0, they are satisfied also for t (cid:54) = 0. In the definition of E ϑ (1 , w , x L , x U ) I exploit the fact that under the maintained assumptions P ( (cid:15) = − w ϑ − x U | w , x , x L , x U ) = 0 to enforce its closedness. There are no ( w , x L , x U )-cross restrictions. E ϑ ( y ), for t = 0 we have1 − α ≥ P ( y = 1 | w , x L , x U ) for all ( w , x L , x U ) such that w ϑ + x U ≤ , (3.4)1 − α ≤ P ( y = 1 | w , x L , x U ) for all ( w , x L , x U ) such that w ϑ + x L ≥ . (3.5)Any given ϑ ∈ Θ, ϑ (cid:54) = θ , violates the above conditions if and only if P (cid:0) ( w , x L , x U ) : { ≤ w ϑ + x L ∩ P ( y = 1 | w , x L , x U ) ≤ − α }∪{ w ϑ + x U ≤ ∩ P ( y = 1 | w , x L , x U ) ≥ − α } (cid:1) > Key Insight : The analysis in Manski and Tamer (2002) systematically studies whatcan be learned under increasingly strong sets of assumptions. These include both assumptionsthat constrain the model from fully nonparametric to semiparametric to parametric, as wellas assumptions that constrain the distribution of the observable covariates. For example,Manski and Tamer (2002, Corollary to Proposition 2) provide sufficient conditions on thejoint distribution of ( w , x L , x U ) that allow for identification of the sign of components of θ , as well as for point identification of θ . The careful analysis of the identifying powerof increasingly stronger assumptions is the pillar of the partial identification approach toempirical research proposed by Manski, as illustrated in Section 2. The work of Manski andTamer (2002) was the first example of this kind in semiparametric structural models.
Revisiting Manski and Tamer’s 2002 study of Identification Problem 3.1 nearly 20 yearslater yields important insights on the differences between point and partial identificationanalysis. It is instructive to take as a point of departure the analysis of Manski (1985), whichunder the additional assumption that ( y , w , x ) is observed yields w θ + x > ⇔ P ( y = 1 | w , x ) > − α. In this case, θ is identified relative to ϑ ∈ Θ if P (( w , x ) : { w θ + x ≤ < w ϑ + x } ∪ { w ϑ + x ≤ < w θ + x } ) > . (3.6)Manski and Tamer extend this reasoning to the case that x is unobserved, but known tosatisfy x ∈ [ x L , x U ] a.s. The first part of their analysis, collected in their Proposition 2,characterizes the collection of values that cannot be distinguished from θ on the basis of P ( w , x L , x U ) alone, through a clear generalization of (3.6): { ϑ ∈ Θ : P (( w , x L , x U ) : { w θ + x U ≤ < w ϑ + x L } ∪ { w ϑ + x U ≤ < w θ + x L } ) = 0 } . (3.7)It is worth emphasizing that the characterization in (3.7) depends on θ , and makes no useof the information in P ( y | w , x L , x U ). The Corollary to Proposition 2 yields conditions on This Corollary is related in spirit to the analysis in Manski (1988b). ( w , x L , x U ) under which either the sign of components of θ , or θ itself, can be identified,regardless of the distribution of y | w , x L , x U .Manski and Tamer (2002, Lemma 1) provide a second characterization, which presupposesknowledge of P ( y , w , x L , x U ), yields a set smaller than the one in (3.7), and coincides withthe result in Theorem SIR-3.1. Manski and Tamer (2002) use the same notation for thetwo sets, although the sets are conceptually and mathematically distinct. The result inTheorem SIR-3.1 is due to Manski and Tamer (2002, Lemma 1), but the proof provided hereis new, as is the use of random set theory in this application. Key Insight : The preceding discussion allows me to draw a novel connection betweenthe two characterizations in Manski and Tamer (2002), and the distinction put forward byChesher and Rosen (2017a) and Chesher and Rosen (2019, Chapter XXX in this Volume,Definition 2) in partial identification between potential observational equivalence and obser-vational equivalence . Applying Chesher and Rosen’s definition, parameter vectors θ and ϑ are potentially observationally equivalent if there exists some distribution of y | w , x L , x U forwhich conditions (3.4) - (3.5) hold. Simple algebra confirms that this yields the region in (3.7) .This notion of potential observational equivalence parallels one of the notions used to obtainsufficient conditions for point identification in the semiparametric literature (as in, e.g. Man-ski, 1985). Both notions, as explained in Chesher and Rosen (2019, Section 4.1), make noreference to the conditional distribution of outcomes given covariates delivered by the processbeing studied. To obtain that parameters θ and ϑ are observationally equivalent one requiresinstead that conditions (3.4) - (3.5) hold for the observed distribution P ( y = 1 | w , x L , x U ) (asopposed to “for some distribution” as in the case of potential observational equivalence). Thisyields the sharp identification region in (3.1) . Manski (2010) studies random expected utility models, where agents choose the alternativethat maximizes their expected utility. The core difference with standard models is thatManski does not fully specify the subjective beliefs that agents use to form their expectations,but only a set of such beliefs. Manski shows that the resulting, partially identified, discretechoice model can be formulated similarly to how Manski and Tamer (2002) treat interval This was confirmed in personal communication with Chuck Manski and Elie Tamer. The proof closes a gap in the argument in Manski and Tamer (2002) connecting their Proposition 2 andLemma 1, due to the fact that for a given ϑ the sets { ( w , x L , x U ) : { w θ + x U ≤ < w ϑ + x L } ∪ { w ϑ + x U ≤ < w θ + x L }} and { ( w , x L , x U ) : { < w ϑ + x L ∩ P ( y = 1 | w , x L , x U ) ≤ − α } ∪ { w ϑ + x U ≤ ∩ P ( y =1 | w , x L , x U ) > − α }} need not coincide, with the former being a subset of the latter due to part (c) of theproof of Proposition 2 in Manski and Tamer (2002). This distinction echos the distinction drawn by Manski (1988a, Section 1.1.1) between point identification and uniform point identification . Manski considers a scenario where a parameter vector of interest θ is definedas the solution to an equation of the form q P ( θ ) = 0 for some criterion function q P : Θ (cid:55)→ R + . Then θ is pointidentified relative to ( P , Θ) if it is the unique solution to q P ( θ ) = 0. It is uniformly point identified relativeto ( P , Θ), with P a space of probability distributions to which P belongs, if for every ˜ P ∈ P , q ˜ P ( ϑ ) = 0 has aunique solution. Magnac and Maurin (2008) consider a different but closely related model to the semi-parametric binary response model studied by Manski and Tamer. They assume that aninstrumental variable z is available, that (cid:15) is independent of x conditional on ( w , z ), andthat Corr ( z , (cid:15) ) = 0. They assume that the distribution of x is absolutely continuous withsupport [ v , v k ], and that x is not a deterministic linear function of ( w , z ). They considerthe case that x is unobserved but known to belong to one of the fixed (and known) intervals[ v i , v i +1 ), i = 1 , . . . , k −
1, with R [ x ∈ [ v i , v i +1 ) | w , z ] > i . Finally, theyassume that ( − w θ − (cid:15) ) ∈ [ v , v k ] with probability one. They do not, however, make quantileindependence assumptions.Their point of departure is the fact that under these conditions, if x were observed, onecould employ a transformation proposed by Lewbel (2000) for the binary outcome y , suchthat θ can be identified through a simple linear moment condition. Specifically, let˜ y = y − x > f x ( x | w , z ) , where f x ( ·| w , z ) is the conditional density function of x . Then, using the assumption that z and (cid:15) are uncorrelated, one has E P ( z ˜ y ) − E P ( zw (cid:62) ) θ = 0 . (3.8)With interval valued x , Magnac and Maurin (2008) denote by x ∗ the random variablethat takes value i ∈ { , . . . , k − } if x ∈ [ v i , v i +1 ), so that the observed data are draws fromthe joint distribution of ( y , w , z , x ∗ ). They let δ ( x ∗ ) = v x ∗ +1 − v x ∗ denote the length of the x ∗ -th interval, and define the transformed outcome variable: y ∗ = δ ( x ∗ ) P ( x ∗ = i | w , z ) y − v k . The assumptions on x yield that, given z and w , (cid:15) does not depend on x ∗ . Moreover, P ( y = 1 | x ∗ , w , z ) is non-decreasing in x ∗ and F (cid:15) ( ·| z , w , x , x ∗ ) = F (cid:15) ( ·| z , w ). Magnac andMaurin (2008) show that the sharp identification region for θ is H P [ θ ] = E P ( zw (cid:62) ) − E P ( zy ∗ + zU ) , (3.9)where E P ( zy ∗ + zU ) is the Aumann (or selection) expectation of the random interval zy ∗ + Beresteanu, Molchanov, and Molinari (2011, Supplementary Appendix F) extend the analysis of Manskiand Tamer (2002) to multinomial choice models with interval covariates. U , see Definition A.4, with U = (cid:34) − k − (cid:88) i =1 ( r i ( w , z ) − r i − ( w , z ))( v i +1 − v i ) , k − (cid:88) i =1 ( r i +1 ( w , z ) − r i ( w , z ))( v i +1 − v i ) (cid:35) . In this expression, r x ∗ ( w , z ) ≡ P ( y = 1 | x ∗ , w , z ) and by convention r ( w , z ) = 0 and r K ( w , z ) = 1, see Magnac and Maurin (2008, Theorem 4). If r i ( w , z ) , i = 0 , . . . , k , wereobserved, this characterization would be very similar to the one provided by Beresteanu andMolinari (2008) for Identification Problem 2.4, see equation (2.25). However, these randomfunctions need to be estimated. While the first-stage estimation of r i ( w , z ) , i = 0 , . . . , k ,does not affect the identification arguments, it does complicate inference, see Chandrasekhar,Chernozhukov, Molinari, and Schrimpf (2018) and the discussion in Section 4. Whereas the standard random utility model presumes some form of exogeneity for x , inpractice often some explanatory variables are endogenous. This problem has been addressedin the literature to obtain point identification of the model through a combination of severalassumptions, including large support conditions, special regressors, control function restric-tions, and more (see, e.g., Matzkin, 1993; Berry, Levinsohn, and Pakes, 1995; Lewbel, 2000;Petrin and Train, 2010). Hong and Tamer (2003b) analyze the distinct but related problemof identification in a censored regression model with endogeneous explanatory variables, andprovide sufficient conditions for point identification. Here I discuss how to carry out identification analysis in the absence of such assumptionswhen instrumental variables z are available, as proposed by Chesher, Rosen, and Smolinski(2013). They consider a more general case than I do here, with utility function that is notparametrically specified and not restricted to be separable in the unobservables. Even in thatmore general case, the identification analysis follows through similar steps as reported here. Identification Problem : Let ( y , x , z ) ∼ P be observable random variables in Y × X × Z . Let all members ofthe population face the same choice set Y . Suppose that each alternative has one unobserv-able attribute (cid:15) c , c ∈ Y and let ν ≡ ( (cid:15) c , . . . , (cid:15) c |Y| ). Let ν ∼ Q and assume that ν ⊥⊥ z .Suppose Q belongs to a nonparametric family of distributions T , and that the conditionaldistribution of ν | x , z , denoted R ( ν | x , z ), is absolutely continuous with respect to Lebesguemeasure with everywhere positive density on its support, ( x , z )-a.s. Suppose utility is sepa- The estimator that they propose extends the minimum distance estimator put forward by Manski andTamer (2002), see Section 4.2, so that if the conditions required for point identification do not hold, it estimatesthe parameter’s identification region (under regularity conditions). Hong and Tamer (2003a) carry out a similaranalysis for the binary choice model with endogenous explanatory variables. Compared to the general model put forward in Section 3.1, in this model there are no preference hetero-geneity terms ζ (random coefficients) that vary only across decision makers. θ ∈ Θ ⊂ R m , so that u i ( c ) = g ( x c ; θ ) + (cid:15) c , ( x c , (cid:15) c )-a.s., for all c ∈ Y . Maintain thenormalizations g ( x c |Y| ; θ ) = 0 for all θ ∈ Θ and all x ∈ X , and g ( x c ; θ ) = ¯ g for known ( x c , ¯ g )for all θ ∈ Θ and c ∈ Y . Given ( x , z , ν ), suppose y is the utility maximizing choice in Y .In the absence of additional information, what can the researcher learn about ( θ, Q )? (cid:52) The key challenge to identification here results because the distribution of ν can varyacross different values of x , both conditional and unconditional on z . Why does this facthinder point identification? For a given ϑ ∈ Θ and for any c ∈ Y and x ∈ X , the model yieldsthat c is optimal, and hence chosen, if and only if ν realizes in the set E ϑ ( c, x ) = { e ∈ V : g ( x c ; ϑ ) + e c ≥ g ( x d ; ϑ ) + e d ∀ d ∈ Y} . (3.10)Figure 3.2 plots the set E ϑ ( y , x ) in a stylized example with Y = { , , } and X = { x , x } , asa function of ( (cid:15) − (cid:15) , (cid:15) − (cid:15) ). Consider the model implied distribution, denoted M below,of the optimal choice. Then, recalling the restriction z ⊥⊥ ν , we have M ( c | x ∈ R x , z ; ϑ ) = (cid:90) x ∈ R x R ( E ϑ ( c, x ) | x = x, z ) d P ( x | z ) , ∀ R x ⊆ X , z -a.s. (3.11) Q ( F ) = (cid:90) x ∈X R ( F | x = x, z ) d P ( x | z ) , ∀ F ⊆ V , z -a.s. , (3.12)Because the joint distribution of ( x , ν ) conditional on z is left completely unrestricted (otherthan (3.12)), one can find multiple triplets ( ϑ, Q , R ( ν | x , z )) satisfying the maintained as-sumptions and with M ( c | x ∈ R x , z ; ϑ ) = P ( c | x ∈ R x , z ) for all c ∈ Y and R x ⊆ X , z -a.s.It is instructive to compare (3.11)-(3.12) with McFadden’s 1974 conditional logit. Underthe standard assumptions, x ⊥⊥ ν so that no instrumental variables are needed. This yields Q ( ν ) = R ( ν | x ) x -a.s., and in addition Q is typically known, with corresponding simplificationsin (3.11). The resulting system of equalities can be inverted under standard order and rankconditions to yield point identification of θ .Further insights can be gained by looking at Figure 3.2. As the value of x changesfrom x to x , the region of values where, say, alternative 1 is optimal changes. When x is exogenous, say independent of ν , this yields a system of equalities relating ( θ, Q ) tothe observed distribution P ( y , x ) which, as stated above, can be inverted to obtain pointidentification. When x is endogenous, this reasoning breaks down because the conditionaldistribution R ( ν | x , z ) may change across realizations of x . Figure 3.2 also offers an instructiveway to connect Identification Problem 3.2 with the identification problem studied in Section Of course, under these conditions one can work directly with utility differences. To try and economize onnotation, I do not explicitly do so here. This figure is based on Figures 1-3 in Chesher, Rosen, and Smolinski (2013). − (cid:15) (cid:15) − (cid:15) − g ( x ; ϑ ) − g ( x ; ϑ ) − g ( x ; ϑ ) − g ( x ; ϑ ) (a) E ϑ ( y , x ) for y = 1 (cid:15) − (cid:15) (cid:15) − (cid:15) − g ( x ; ϑ ) − g ( x ; ϑ ) − g ( x ; ϑ ) − g ( x ; ϑ ) (b) E ϑ ( y , x ) for y = 2 (cid:15) − (cid:15) (cid:15) − (cid:15) − g ( x ; ϑ ) − g ( x ; ϑ ) − g ( x ; ϑ ) − g ( x ; ϑ ) (c) E ϑ ( y , x ) for y = 3 Figure 3.2: The set E ϑ in equation (3.10) and the corresponding admissible values for ( y , x ) as afunction of ( (cid:15) − (cid:15) , (cid:15) − (cid:15) ) under the simplifying assumption that X = { x , x } and Y = { , , } .The admissible values for ( y , x ) are { ( c, x ) } in the gray area, and { ( c, x ) } in the area with verticallines. Because the two areas overlap, the model has set-valued predictions for ( y , x ). outcome variable given realizations of the covariates and unobservedheterogeneity terms, which overlap across realizations of the unobserved heterogeneity terms.In the problem studied here, the model has singleton-valued predictions for the outcomevariable of interest y as a function of the observable explanatory variables x and unobservables ν . However, for given realization of ν , the model admits sets of values for the endogenousvariables ( y , x ), which overlap across realizations of ν . Because the model is silent on thejoint distribution of ( x , ν ) (except for requiring that the marginal distribution of ν does notdepend on z ), partial identification results.It is possible to couple the maintained assumptions with the observed data to learnfeatures of ( θ, Q ). Because the observed choice y is assumed to maximize utility, for the datagenerating ( θ, Q ) the model yields ν ∈ E θ ( y , x )-a.s. , (3.13)with E θ ( y , x ) a random closed set as per Definition A.1. Equation (3.13) exhausts the mod-eling content of Identification Problem 3.2. Theorem A.1 (as expressed in (A.5)) can thenbe leveraged to extract its empirical content from the observed distribution P ( y , x , z ). As apreparation for doing so, note that for given F ∈ F (with F the collection of closed subsetsof V ) and ϑ ∈ Θ, we have P ( E ϑ ( y , x ) ⊆ F | z ) = (cid:90) x ∈X (cid:88) c ∈Y ( E ϑ ( c, x ) ⊆ F ) P ( y = c | x = x, z ) d P ( x | z ) , so that this probability can be learned from the observed data. Theorem SIR- : Under he assumptions of Identification Problem 3.2, the sharp identification region for ( θ, Q ) is H P [ θ, Q ] = (cid:110) ϑ ∈ Θ , ˜ Q ∈ T : ˜ Q ( F ) ≥ P ( E ϑ ( y , x ) ⊆ F | z ) , ∀ F ∈ F , z -a.s. (cid:111) . (3.14) Proof.
To simplify notation, I write E ϑ ≡ E ϑ ( y , x ). Let ( E ϑ , x , z ) = { ( e , x , z ) : e ∈ E ϑ } . Ifthe model is correctly specified, ( ν, x , z ) ∈ ( E θ , x , z )-a.s. for the data generating value of( θ, Q ). Using Theorem A.1 and Theorem 2.33 in Molchanov and Molinari (2018), it followsthat ( ϑ, ˜ Q ) is observationally equivalent to ( θ, Q ) if and only if˜ Q ( F | x , z ) ≥ P ( E ϑ ( y , x ) ⊆ F | x , z ) , ∀ F ∈ F , ( x , z )-a.s.As the distribution of ν is only restricted so that ν ⊥⊥ z , one can integrate both sides of theinequality with respect to x . The final result follows because ˜ Q does not depend on z .While Theorem SIR-3.2 relies on checking inequality (3.14) for all F ∈ F , the results inChesher, Rosen, and Smolinski (2013, Theorem 2) and Molchanov and Molinari (2018, Chap-ter 2) can be used to obtain a smaller collection of sets over which to verify it. In particular,if x has a discrete distribution, it suffices to use a finite collection of sets. For example, inthe case depicted in Figure 3.2 with X = { x , x } , Chesher, Rosen, and Smolinski (2013,Section 3.3 of the 2011 CeMMAP working paper version CWP39/11) show that H P [ θ, Q ]is obtained by checking at most twelve inequalities in (3.14). The left hand side of theseinequalities is a linear function of six values that the distribution ˜ Q assigns to each of thecomponent regions depicted in Figure 3.2 (the one where E ϑ (1 , x ) ∩ E ϑ (1 , x ) realizes; theone where E ϑ (1 , x ) ∩ E ϑ (3 , x ) realizes; etc.) Hence, in this example, ( ϑ, ˜ Q ) ∈ H P [ θ, Q ] if andonly if ˜ Q assigns to these six regions a probability mass such that for ϑ the twelve inequalitiescharacterized by Chesher, Rosen, and Smolinski hold. Key Insight : A conceptual contribution of Chesher, Rosen, and Smolinski (2013)is to show that one can frame models with endogenous explanatory variables as incomplete models. Incompleteness here results from the fact that the model does not specify how theendogenous variables x are determined. One can then think of these as models with set-valued predictions for the endogeneous variables ( y and x in this application), even though theoutcome of the model ( y ) is uniquely predicted by the realization of the observed explanatoryvariables ( x ) and the unobserved heterogeneity terms ( ν ). Random set theory can again beleveraged to characterize sharp identification regions. Chesher and Rosen (2019, Chapter XXX in this Volume) discuss related generalizedinstrumental variables models where random set methods are used to obtain characterizationsof sharp identification regions in the presence of endogenous explanatory variables.45 .1.3 Unobserved Heterogeneity in Choice Sets and/or Consideration Sets
Compared to the general framework set forth at the beginning of Section 3.1, as pointedout in Manski (1977), often the researcher observes ( y i , x i ) but not C i , i = 1 , . . . , n . Evenwhen C i is observable, the researcher may be unaware of which of its elements the decisionmaker actually evaluates before selecting one. In what follows, to shorten expressions, I referto both the measurement problem of unobserved choice sets and the (cognitive) problem oflimited consideration as “unobserved heterogeneity in choice sets.”Learning features of preferences using discrete choice data in the presence of unobservedheterogeneity in choice sets is a formidable task. When a decision maker chooses an alterna-tive, this may be because her choice set equals the feasible set and the chosen alternative isthe one yielding the highest utility. Then observed choice reveals preferences. But it can alsobe that the decision maker has access to/considers only the chosen alternative (e.g., Blockand Marschak, 1960, p. 99). Then observed choice is driven entirely by choice set compo-sition, and is silent about preferences. A plethora of scenarios between these extremes ispossible, but the researcher does not know which has generated the observed data. This fun-damental identification problem calls either for restrictions on the random utility model andconsideration set formation process, or for collection of richer data that eliminates unobservedheterogeneity in C i or allows for enhanced modeling of it (see, e.g., Caplin, 2016).A sizable literature spanning behavioral economics, econometrics, experimental economics,marketing, microeconomics, and psychology, has put forward different models to formalize thecomplex process that leads to the formation of the set of alternatives that the agent considersor can choose from (see, e.g., Simon, 1959; Howard, 1963; Tversky, 1972, for early contri-butions). Manski (1977) proposes both a general econometric model where decision makersdraw choice sets from an unknown distribution, as well as a specific model of choice set for-mation, independent from preferences, and studies their implications for the distributionalstructure of random utility models. However, assumptions about the choice set formation process are often rooted in a desireto achieve point identification rather than in information contained in the model or observeddata. It is then important to ask what can be learned about decision maker’s preferencesunder minimal assumptions on the choice set formation process. Allowing for unrestricteddependence between choice sets and preferences, while challenging for identification analysis,is especially relevant. Indeed, decision makers’ unobserved attributes may determine boththeir preferences and which items in the feasible set they pay attention to or are available tothem (e.g., through unobserved liquidity constraints, unobserved characteristics such as reli- The specific model in Manski (1977, Section II-A) is often used in applications. It posits that eachalternative c ∈ Y enters the decision maker’s choice set with probability φ c , independently of the otheralternatives. The probability φ c may depend on observable individual characteristics, and φ c = 1 for at leastone option c ∈ Y (the “default” good). These assumptions are akin to assumptions about selection mechanisms in models with multiple equilibria.The latter are discussed further below in Section 3.2.1, along with their criticisms. ¯ ν |Y|− , |Y| · · · ¯ ν , ¯ ν , ¯ ν |Y|− , |Y| ¯ ν , ¯ ν , Model predicts y = c |Y|− or y = c |Y| · · · Model predicts y = c or y = c Model predicts y = c or y = c Figure 3.3: Predicted value of y in Identification Problem 3.3 as a function of ν for κ = |Y| −
1. Inthis case, C = Y \ { c } for some c ∈ Y , and the model predicts either the first or the second bestalternative in Y . gious preferences in the context of school choice, or behavioral phenomena such as aversionto extremes, salience, etc.). Here I use the framework put forward by Barseghyan, Cough-lin, Molinari, and Teitelbaum (2019) to study identification of discrete choice models withunobserved heterogeneity in choice sets and preferences. Identification Problem : Let ( y , x ) ∼ P be observable random variables in Y × X . Assumethat there exists a real valued function g , which for simplicity I posit known up to parameter δ ∈ ∆ ⊂ R m and continuous in its second argument, such that u i ( c ) = g ( x ic , ν i ; ), ( x ic , ν i )-a.s., for all c ∈ Y , i ∈ I , where x ic denotes the vectors of attributes relevant to alternative c , and includes attributes that are alternative invariant and ones that are alternative specific(respectively, x i and x ic in the general notation laid out in Section 3.1). Suppose that y = arg max c ∈ C g ( x c , ν ; δ ), where ties are assumed to occur with probability zero and C is anunobservable choice set drawn from the subsets of Y according to some unknown probabilitydistribution. Suppose R ( | C | ≥ κ ) = 1 for some known constant κ ≥
2. Let Q denotethe distribution of ν , and assume that it is known up to a finite dimensional parameter γ ∈ Γ ⊂ R k . For simplicity, assume that ν ⊥⊥ x . In the absence of additional information,what can the researcher learn about θ ≡ [ δ ; γ ]? (cid:52) The model just laid out has set valued predictions for the decision maker’s optimal choice,because different alternatives might be optimal depending on which choice set the decisionmaker draws. Figure 3.3, which is based on the analysis in Barseghyan, Coughlin, Molinari,and Teitelbaum (2019), illustrates the set valued predictions in a stylized example. In thefigure ν is assumed to be a scalar; ¯ ν j,m denotes the threshold value of ν above which c j yields higher utility than c m and below which c m yields higher utility than c j (the threshold’sdependence on ( x ; δ ) is suppressed for notational convenience). Consider the case that ν ∈ [¯ ν , , ¯ ν , ], so that c is the option yielding the highest utility among all options in Y . When This assumption can be relaxed as discussed in Matzkin (2007). The procedure proposed here can alsobe adapted to allow for endogenous explanatory variables as in Section 3.1.2 by combining the results inBarseghyan, Coughlin, Molinari, and Teitelbaum (2019) with those in Chesher, Rosen, and Smolinski (2013). = |Y| −
1, the agent may draw a choice set that does not include one of the alternativesin Y . If the excluded alternative is not c (or if C realizes equal to Y ), the model predictsthat the decision maker chooses c . If C realizes equal to Y \ { c } , the model predicts thatthe decision maker chooses the second best: c if ν ∈ [¯ ν , , ¯ ν , ], and c if ν ∈ [¯ ν , , ¯ ν , ].Conversely, observation of y = c allows one to conclude that ν ≥ ¯ ν , , and y = c that ν ≥ ¯ ν , , with ¯ ν , ≤ ¯ ν , , and these regions of possible realizations of ν overlap.Why does this set valued prediction hinder point identification? The reason is similarto the explanation given for Identification Problem 3.1: the distribution of the observabledata relates to the model structure in an incomplete manner, because the distribution ofthe (unobserved) choice sets is left completely unspecified. Barseghyan, Coughlin, Molinari,and Teitelbaum (2019) show that one can find multiple candidate distributions for C andparameter vectors ϑ , such that together they yield a model implied distribution for y | x thatmatches P ( y | x ), x -a.s.Barseghyan, Coughlin, Molinari, and Teitelbaum propose to work directly with the setof model implied optimal choices given ( x , ν ) associated with each possible realization of C ,which is depicted in Figure 3.3 for a specific example. The key idea is that, according tothe model, the observed choice maximizes utility among the alternatives in C . Hence, forthe data generating value of θ , it belongs to the set of model implied optimal choices. Withthis, the authors are able to characterize H P [ θ ] through Theorem A.1 as the collection ofparameter vectors that satisfy a finite number of conditional moment inequalities. Key Insight : Barseghyan, Coughlin, Molinari, and Teitelbaum (2019) show thatworking directly with the set of model implied optimal choices given ( x , ν ) allows one todispense with considering all possible distributions of choice sets that are allowed for in Iden-tification Problem 3.3 to complete the model. Such distributions may depend on ν even afterconditioning on observables and may constitute an infinite dimensional nuisance parameter,which creates great difficulties for the computation of H P [ θ ] and for inference. Identification Problem 3.3 sets up a structure where preferences include idiosyncratic com-ponents ν that are decision maker specific and can depend on C , and where heterogeneity in C can be driven either by a measurement problem, or by the decision maker’s limited atten-tion to the options available to her. However, for computational and finite sample inferencereasons, it restricts the family of utility functions to be known up to a finite dimensionalparameter vector δ .A rich literature in decision theory has analyzed a different framework, where the decisionmaker’s choice set is observable to the researcher, but the decision maker does not consider allalternatives in it (for recent contributions see, e.g., Masatlioglu, Nakajima, and Ozbay, 2012;Manzini and Mariotti, 2014). In this literature, the utility function is left completely unspec-ified, so that interest focuses on identification of preference orderings of the available options.Unobserved heterogeneity in preferences is assumed away, so that heterogeneous choice is48riven by randomness in consideration sets. If the consideration set formation process isleft unspecified or is subject only to weak restrictions, point identification of the preferenceorderings is not possible even if preferences are homogeneous and the researcher observes arepresentative agent facing multiple distinct choice problems with varying choice sets. Catta-neo, Ma, Masatlioglu, and Suleymanov (2019) propose a general model for the considerationset formation process where the only restriction is a weak and intuitive monotonicity condi-tion: the probability that any particular consideration set is drawn does not decrease whenthe number of possible consideration sets decreases. Within this framework, they providerevealed preference theory and testable implications for observable choice probabilities. Identification Problem : Let ( y , C ) ∼ P be a pair of observable random variable and random set in Y × D ,where D = { D : D ⊆ Y} \ ∅ . Let µ : D × D → [0 ,
1] denote an attention rule such that µ ( A | G ) ≥ A ⊆ G , µ ( A | G ) = 0 for all A (cid:42) G , and (cid:80) A ⊂ G µ ( A | G ) = 1, A, G ∈ D .Assume that for any b ∈ G \ A , µ ( A | G ) ≤ µ ( A | G \ { b } ) , (3.15)and that the decision maker has a strict preference ordering (cid:31) on Y . In the absence ofadditional information, what can the researcher learn about (cid:31) ? (cid:52) Cattaneo, Ma, Masatlioglu, and Suleymanov (2019) posit that an observed distributionof choice P ( y | C ) has a random attention representation, and hence they name it a randomattention model , if there exists a preference ordering (cid:31) over Y and a monotonic attentionrule µ such that p ( c | G ) ≡ P ( y = c | C = G ) = (cid:88) A ⊆ G ( c is (cid:31) -best in A ) µ ( A | G ) , ∀ c ∈ G, ∀ G ∈ D . (3.16)The sharp identification region for the preference ordering, denoted H P [ (cid:31) ] henceforth, isgiven by the collection of preference orderings for which one can find a monotonic attentionrule to pair it with, so that (3.16) holds.Of course, an observed distribution of choice can be represented by multiple preferenceorderings and attention rules. The authors, however, show in their Lemma 1 that if for some G ∈ D with { b, c } ∈ G , p ( c | G ) > p ( c | G \ { b } ) , (3.17)then c (cid:31) b for any (cid:31) for which one can find a monotonic attention rule µ such that (3.16) Here I omit observable covariates x for simplicity. Specifically, (cid:31) is an asymmetric, transitive and complete binary relation. a (cid:31) b if in addition to the abovecondition one has p ( a | G (cid:48) ) > p ( a | G (cid:48) \ { c } ) for some c ∈ G (cid:48) and G (cid:48) ∈ D . The authors furthershow in their Theorem 1 that the collection of preference relations associated with all possibleinstances of (3.17) for all c ∈ G and G ∈ D yield all information about preferences given theobserved choice probabilities. This yields a system of linear inequalities in p ( c | G ) that fullycharacterize H P [ (cid:31) ]. Let (cid:126) p denote the vector with elements [ p ( c | G ) : c ∈ G, G ∈ D ] and Π (cid:31) denote a conformable matrix collecting the constraints on P ( y | C ) embodied in (3.17) and itsgeneralizations based on transitive closure. Then H P [ (cid:31) ] = {(cid:31) : Π (cid:31) (cid:126) p ≤ } . (3.18)The authors show that for any given preference ordering (cid:31) , the matrix Π (cid:31) characterizingwhether (cid:31)∈ H P [ (cid:31) ] through the system of linear inequalities in (3.18) is unique, and theyprovide a simple algorithm to compute it. They also show that mild additional assumptions,such as, for example, that decision makers facing binary choice sets pay attention to bothalternatives frequently enough, can substantially increase the informational content of thedata (i.e., substantially tighten H P [ (cid:31) ]). Key Insight : Cattaneo, Ma, Masatlioglu, and Suleymanov (2019) show that learningfeatures of preference orderings in Identification Problem 3.4 requires the existence in the dataof choice problems where the choice probabilities satisfy (3.17) . The latter is a violation ofthe principle of “regularity” (Luce and Suppes, 1965) according to which the probability ofchoosing an alternative from any set is at least as large as the probability of choosing itfrom any of its supersets. Regularity is a monotonicity property of choice probabilities, andit is implied by a wide array of models of decision making. The monotonicity of attentionrules in (3.15) can be viewed as regularity of the process that chooses a consideration setfrom the subsets of the choice set. Cattaneo, Ma, Masatlioglu, and Suleymanov (2019) showthat it is implied by various models of limited attention. While the violation required in (3.17) is weak in that it needs only to occur for some G , it sheds a different light on theseverity of the identification problem described at the beginning of this section. Regularity ofchoice probabilities and (partial) identification of preference orderings can co-exist only underrestrictions on the consideration set formation process that are stronger than the regularityof attention rules in (3.15) . Abaluck and Adams (2018) and Barseghyan, Molinari, and Thirkettle (2019) provide dif-ferent sets of sufficient conditions for point identification of models of limited consideration.In both cases, the authors posit specific models of consideration set formation and providesufficient conditions for point identification under exclusion and large support assumptions.Abaluck and Adams (2018) assume that unobserved heterogeneity in preferences and in con-sideration sets are independent. They exploit violations of Slutsky symmetry that result from50nattention, assuming that for each alternative there is an observable characteristic with largesupport that does not affect the consideration probability of the other options. Barseghyan,Molinari, and Thirkettle (2019) provide a thorough analysis of the extent of dependencybetween consideration and preferences under which semi-nonparametric point identificationof the distribution of preferences and consideration attains. They exploit a requirement ofstandard economic theory –the Spence-Mirrlees single crossing property of utility functions–coupled with a mild strengthening of the classic conditions for semi-nonparametric identifi-cation of discrete choice models with full consideration and identical choice sets (see, e.g.,Matzkin, 2007), assuming that there is at least one decision maker-specific characteristic withlarge support that affects utility but not consideration.
Building on Marschak (1960), Manski (2007b) studies a question related but distinct fromthose in Identification Problems 3.3-3.4. He is concerned with prediction of choice behaviorwhen decision makers face counterfactual choice sets. Manski frames this question as one ofpredicting treatment response (see Section 2.2). Here the collection of potential treatments isgiven by D , the nonempty subsets of the universe of feasible alternatives Y , and the responsefunction specifies the alternative chosen by a decision maker when facing choice set G ∈ D .Manski assumes that the researcher observes realized choice sets and chosen alternatives,( y , C ) ∼ P . Under the standard assumptions laid out at the beginning of Section 3.1,specifically if utility functions are (say) linear in (cid:15) ic and the distribution of (cid:15) ic is (say) TypeI extreme value or multivariate normal, prediction of choice behavior with counterfactualchoice sets is immediate (and point identified). Manski, however, leaves utility functionscompletely unspecified, and in fact works directly with preference orderings, which he labelsdecision maker’s types . He places no restriction on the distribution of preference types, exceptrequiring that they are independent of the observed choice sets. Manski shows that underthese rather weak assumptions, the distribution of predicted choices from counterfactualchoice sets can be partially identified, and characterized as the solution to linear programs.Specifically, let y ∗ ( G ) denote the decision maker’s optimal choice when facing choice set G ∈ D . Assume y ∗ ( · ) ⊥⊥ C , and let y k denote the choice function for a decision maker of type k –that is, a decision maker with a specific preference ordering labeled k . One example of suchpreference ordering might be c (cid:31) c (cid:31) · · · (cid:31) c |Y| . If a decision maker of this type faces, say,choice set G = { c , c , c } , then she chooses alternative c . Let K denote the set of logicallypossible types, and θ k the probability that a decision maker in the population is of type k .Suppose that the researcher posits a behavioral model specifying K , { y k , k = 1 , . . . , K } , andrestrictions that constrain θ to lie in some specified set of distributions. Let Θ denote thevalues of ϑ that satisfy these requirements plus the conditions ϑ k ≥ k ∈ K and Here I suppress covariates for simplicity. k ∈ K ϑ k = 1. Then for any c ∈ Y and ϑ ∈ Θ, the model predicts Q ( y ∗ ( G ) = c ) = (cid:88) k ∈ K ( y k ( G ) = c ) ϑ k . How can one partially identify this probability based on the observed data? Suppose C isobserved to take realizations D , . . . , D m . Then the data reveal P ( y ( D j ) = d j ) = (cid:88) k ∈ K ( y k ( D j ) = d j ) θ k ∀ d j ∈ D j , j = 1 , . . . , m. This yields that the sharp identification region for θ is H P [ θ ] = { ϑ ∈ Θ : P ( y ( D j ) = d j ) = (cid:88) k ∈ K ( y k ( D j ) = d j ) ϑ k ∀ d j ∈ D j , j = 1 , . . . , m } . If the behavioral model is correctly specified, H P [ θ ] is non-empty. In turn, the sharp identi-fication region for each choice probability is H P [ Q ( y ∗ ( G ) = c )] = (cid:40) (cid:88) k ∈ K ( y k ( G ) = c ) ϑ k : ϑ ∈ H P [ θ ] (cid:41) , and its extreme points can be obtained by solving linear programs.Kitamura and Stoye (2019) provide closely related sharp bounds on features of counterfac-tual choices in the nonparametric random utility model of demand, where observable choicesare repeated cross-sections and one allows for unrestricted, unobserved heterogeneity. Theirapproach builds on the work of Kitamura and Stoye (2018), who test weather agents’ behavioris consistent with the Axiom of Revealed Stochastic Preference (SARP) in a random utilitymodel in which the utility function of each consumer over commodity bundles is assumedto satisfy only the basic restriction that “more is better” with no satiation. Because thetesting exercise is to be carried out using repeated cross-sections data, the authors maintainthe assumption that multiple populations of consumers who face distinct choice sets havethe same distribution of preferences. With this structure in place, de facto the task is totest the full implications of rationality without functional form restrictions. Kitamura andStoye’s approach is based on several novel ideas. As a first step, they leverage an earlierinsight of McFadden (2005) to discretize the data without loss of information, so that theycan define a large but finite set of rational preferences types. As a second step, they showthat this implies that rationality can be tested by checking whether observed behavior lies ina cone corresponding to positive linear combinations of preference types. While the problemis discrete, its dimension is at first sight prohibitive. Nonetheless, Kitamura and Stoye areable to develop novel computational methods that render the problem tractable. They applytheir method to the U.K. Household Expenditure Survey, adapting to their framework results52n nonparametric instrumental variable analysis by Imbens and Newey (2009) so that theycan handle price endogeneity.Kamat (2018) builds on Manski (2007b) to learn program effects when agents are ran-domly assigned to control or treatment. The treatment group is provided access to theprogram, while the control group is not. However, members of the control group may receiveaccess to the program from outside the experiment, leading to noncompliance with the ran-domly assigned treatment. The researcher wants to learn about the average effect of programaccess on the decision to participate in the program and on the subsequent outcome. Whilesufficiently rich data may allow the researcher to learn these effects, Kamat is concernedwith the identification problem that arises when the researcher only observes the treatmentassignment status, the program participation decision, and the outcome, but not the receiptof program access for every agent. Kamat formalizes this problem as one where the receivedtreatment is selected from a choice set that depends on the assigned treatment and is un-observable to the researcher, and the agents optimally choose whether to participate in theprogram by maximizing their utility function over their choice set. Importantly, the utilityfunctions are not subject to parametric restrictions, similarly to Manski (2007b). But whileManski assumed independence of choice sets and preference types, Kamat allows them tobe arbitrarily dependent on each other, as in Barseghyan, Coughlin, Molinari, and Teitel-baum (2019). Kamat’s 2018 approach leverages specific assumptions on random assignmentof treatments and on compliance (or lack thereof) of participants to obtain nonparametricbounds on the treatment effects of interest that can be characterized using tractable linearprograms. Tamer (2003) and Ciliberto and Tamer (2009) substantially enlarge the scope of partialidentification analysis of structural models by showing how to apply it to learn featuresof payoff functions in static, simultaneous-move finite games of complete information withmultiple equilibria. Berry and Tamer (2006) extend the approach and considerations thatfollow to games of incomplete information. To start, here I focus on two-player entry gameswith complete information. Identification Problem : Let( y , y , x , x ) ∼ P be observable random variables in { , } × { , } × R d × R d , d < ∞ .Suppose that ( y , y ) result from simultaneous move, pure strategy Nash play (PSNE) in agame where the payoffs are u j ( y j , y − j , x j ; β j , δ j ) ≡ y j ( x j β j + δ j y − j + ε j ), j = 1 , Completeness of information is motivated by the idea that firms in the industry have settled in a long-runequilibrium, and have detailed knowledge of both their own and their rivals’ profit functions. y j = 1) or “stay out” ( y j = 0). Here ( x , x ) are observable payoffshifters, ( ε , ε ) are payoff shifters observable to the players but not to the econometrician, δ ≤ , δ ≤ β , β are parameter vectors in B ⊂ R d reflecting the effect of the observable covariates on payoffs. Each player enters the marketif and only if entering yields non-negative payoff, so that y j = ( x j β j + δ j y − j + ε j ≥ ε ≡ ( ε , ε ) is independent of x ≡ ( x , x ) and has bivariateNormal distribution with mean vector zero, variances equal to one (a normalization requiredby the threshold crossing nature of the model), and correlation ρ ∈ [ − , θ = [ δ δ β β ρ ]? (cid:52) From the econometric perspective, this is a generalization of a standard discrete choicemodel to a bivariate simultaneous response model which yields a stochastic representation ofequilibria in a two player, two action game. Generically, for a given value of θ and realizationof the payoff shifters, the model just laid out admits multiple equilibria (existence of PSNEis guaranteed because the interaction parameters are non-positive). In other words, it yieldsset valued predictions as depicted in Figure 3.4. Why does this set valued prediction hinder point identification? Intuitively, the challengecan be traced back to the fact that for different values of θ ∈ Θ, one may find different waysto assign the probability mass in [ − x β , − x β − δ ) × [ − x β , − x β − δ ) to (0 ,
1) and(1 , P ( y , y | x , x ). More formally, for fixed ϑ ∈ Θ and given ( x , ε ) and ( y , y ) ∈ { , } × { , } , let E ϑ [(1 , , (0 , x ] ≡ [ − x β , − x β − δ ) × [ − x β , − x β − δ ) , E ϑ [( y , y ); x ] ≡ { ( ε , ε ) : ( y , y ) is the unique equilibrium } , so that in Figure 3.4 E ϑ [(1 , , (0 , x ] is the gray region, E ϑ [(0 , x ] is the dotted region,etc. Let R ( y , y | x , ε ) be a selection mechanism that assigns to each possible outcome ofthe game ( y , y ) ∈ { , } × { , } the probability that it is played conditional on observableand unobservable payoff shifters. In order to be admissible , R ( y , y | x , ε ) must be such that R ( y , y | x , ε ) ≥ y , y ) ∈ { , } × { , } , (cid:80) ( y ,y ) ∈{ , }×{ , } R ( y , y | x , ε ) = 1, and ∀ ε ∈ E ϑ [(1 , , (0 , x ] , R (0 , | x , ε ) = R (1 , | x , ε ) = 0 (3.19) ∀ ε ∈ E ϑ [( y , y ); x ] , R (˜ y , ˜ y | x , ε ) = 0 ∀ (˜ y , ˜ y ) ∈ { , } × { , } s.t. (˜ y , ˜ y ) (cid:54) = ( y , y ) . (3.20)Let Φ r denote the probability distribution of a bivariate Normal random variable with zeromeans, unit variances, and correlation r ∈ [ − , M ( y , y | x ) denote the model predicted This figure is based on Figure 1 in Tamer (2003). ε Model predicts y = (0 ,
1) Model predicts y = (1 , y = (1 ,
0) or y = (0 , − x β − δ − x β − δ Model predicts y = (1 , − x β − δ Model predicts y = (0 , − x β − x β Figure 3.4: PSNE outcomes of the game in Identification Problem 3.5 as a function of ( ε , ε ). probability that the outcome of the game realizes equal to ( y , y ). Then the model yields M ( y , y | x ) = (cid:90) R ( y , y | x , ε ) d Φ r = (cid:90) ( ε ,ε ) ∈E ϑ [( y ,y ); x ] d Φ r + (cid:90) ε ,ε ∈E ϑ [(1 , , (0 , x ] R ( y , y | x , ε ) d Φ r . (3.21)Because R ( ·| x , ε ) is left completely unspecified, other than the basic restrictions listed abovethat render it an admissible selection mechanism, one can find multiple values for ( ϑ, R ( ·| x , ε ))such that M ( y , y | x ) = P ( y , y | x ) for all ( y , y ) ∈ { , } × { , } x -a.s.Multiplicity of equilibria implies that the mapping from the model’s exogenous variables( x , x , ε , ε ) to outcomes ( y , y ) is a correspondence rather than a function. This violatesthe classical “principal assumptions” or “coherency conditions” for simultaneous discreteresponse models discussed extensively in the econometrics literature (e.g., Heckman, 1978;Gourieroux, Laffont, and Monfort, 1980; Schmidt, 1981; Maddala, 1983; Blundell and Smith,1994). Such coherency conditions require the existence of a unique reduced form, mappingthe model’s exogenous variables and parameters to a unique realization of the endogenousvariable; hence, they constrain the model to be recursive or triangular in nature. As pointedout by Bjorn and Vuong (1984), however, the coherency conditions shut down exactly thesocial interaction effect of interest by requiring, e.g., that δ δ = 0, so that at least oneplayer’s action has no impact on the other player’s payoff.The desire to learn about interaction effects coupled with the difficulties generated bymultiplicity of equilibria prompted the earlier literature to provide at least two different waysto achieve point identification. The first one relies on imposing simplifying assumptions thatshift focus to outcome features that are common across equilibria. For example, Bresna-han and Reiss (1988, 1990, 1991) and Berry (1992) study entry games where the number,though not the identities, of entrants is uniquely predicted by the model in equilibrium.Unfortunately, however, these simplifying assumptions substantially constrain the amount55f heterogeneity in player’s payoffs that the model allows for. The second approach relieson explicitly modeling a selection mechanism which specifies the equilibrium played in theregions of multiplicity. For example, Bjorn and Vuong (1984) assume it to be a constant;Bajari, Hong, and Ryan (2010) assume a more flexible, covariate dependent parametrization;and Berry (1992) considers two possible selection mechanism specifications, one where theincumbent moves first, and the other where the most profitable player moves first. Unfor-tunately, however, the chosen selection mechanism can have non-trivial effects on inference,and the data and theory might be silent on which is more appropriate. A nice example of thisappears in Berry (1992, Table VII). Berry and Tamer (2006) review and extend a numberof results on the identification of entry models extensively used in the empirical literature.Jovanovic (1989) discusses the observable implications of models with multiple equilibria,and within the analysis of a model with homogeneous preferences shows that partial identi-fication is possible (see Jovanovic, 1989, p. 1435). I refer to de Paula (2013) for a review ofthe literature on econometric analysis of games with multiple equilibria.Ciliberto and Tamer (2009) show, on the other hand, that it is possible to partially iden-tify entry models that allow for rich heterogeneity in payoffs and for any possible selectionmechanism (even ones that are arbitrarily dependent on the unobservable payoff shifters afterconditioning on the observed payoff shifters). In addition, Tamer (2003) provides sufficientconditions for point identification based on exclusion restrictions and large support assump-tions. Kline and Tamer (2012) analyze partial identification of nonparametric models of entryin a two-player model, drawing connections with the program evaluation literature. Key Insight : An important conceptual contribution of Tamer (2003) is to clarify thedistinction between a model which is incoherent , so that no reduced form exists, and a modelwhich is incomplete , so that multiple reduced forms may exist. Models with multiple equilibriabelong to the latter category. Whereas the earlier literature in partial identification had beenmotivated by measurement problems , e.g., missing or interval data, the work of Tamer (2003)and Ciliberto and Tamer (2009) is motivated by the fact that economic theory often does notspecify how an equilibrium is selected in the regions of the exogenous variables which admitmultiple equilibria. This is a conceptually completely distinct identification problem.
Ciliberto and Tamer (2009) propose to use simple and tractable implications of the modelto learn features of the structural parameters of interest. Specifically, they point out thatthe probability of observing any outcome of the game cannot be smaller than the model’simplied probability that such outcome is the unique equilibrium of the game, and cannotbe larger than the model’s implied probability that such outcome is one of the possible equilibria of the game. Looking at Figure 3.4 this means, for example, that the observed P (( y , y ) = (0 , | x , x ) cannot be smaller than the probability that ( ε , ε ) realizes in thedotted region, and cannot be larger than the probability that it realizes either in the dottedregion or in the gray region. Compared to the model predicted distribution in (3.21), this56eans that P (( y , y ) = (0 , | x , x ) cannot be smaller than the expression obtained setting,for ε ∈ E ϑ [(1 , , x ], R (0 , | x , ε ) = 0, and cannot be larger than that obtained with R (0 , | x , ε ) = 1. Denote by Φ( A , A ; ρ ) the probability that the bivariate normal with meanvector zero, variances equal to one, and correlation ρ assigns to the event { ε ∈ A , ε ∈ A } .Then Ciliberto and Tamer (2009) show that any ϑ = [ d , d , b , b , r ] that is observationallyequivalent to the data generating value θ satisfies, ( x , x )-a.s., P (( y , y ) = (0 , | x , x ) = Φ(( −∞ , − x b ) , ( −∞ , − x b ); r ) (3.22) P (( y , y ) = (1 , | x , x ) = Φ([ − x b − d , ∞ ) , [ − x b − d , ∞ ); r ) (3.23) P (( y , y ) = (0 , | x , x ) ≤ Φ(( −∞ , − x b − d ) , ( − x b , ∞ ); r ) (3.24) P (( y , y ) = (0 , | x , x ) ≥ (cid:110) Φ(( −∞ , − x b − d ) , ( − x b , ∞ ); r ) − Φ(( − x b , − x b − d ) , ( − x b , − x b − d ); r ) (cid:111) (3.25)While the approach of Ciliberto and Tamer (2009) is summarized here for a two player entrygame, it extends without difficulty to any finite number of players and actions and to solutionconcepts other than pure strategy Nash equilibrium.Aradillas-Lopez and Tamer (2008) build on the insights of Ciliberto and Tamer (2009) tostudy what is the identification power of equilibrium in games. To do so, they compare theset-valued model predictions and what can be learned about θ when one assumes only level- k rationality as opposed to Nash play. In static entry games of complete information, they findthat the model’s predictions when k ≥ H P [ θ ] in the case of two player entry games with pure strategy Nashequilibrium as solution concept, as shown by Beresteanu, Molchanov, and Molinari (2011,Supplementary Appendix D, Corollary D.4). When there are more than two players or morethan two actions (or with different solutions concepts, such as, e.g., mixed strategy Nashequilibrium; correlated equilibrium; or rationality of level k as in Aradillas-Lopez and Tamer,2008), the characterization in Ciliberto and Tamer (2009) obtained by extending the reasoningjust laid out yields an outer region. Beresteanu, Molchanov, and Molinari (2011) use elementsof random set theory to provide a general and computationally tractable characterization ofthe identification region that is sharp, regardless of the number of players and actions, or thesolution concept adopted. For the case of PSNE with any finite number of players or actions,Galichon and Henry (2011) provide a computationally tractable sharp characterization of theidentification region using elements of optimal transportation theory.57 .2.2 Characterization of Sharpness through Random Set Theory Beresteanu, Molchanov, and Molinari (2011) provide a general approach based on random settheory that delivers sharp identification regions on parameters of structural semiparametricmodels with set valued predictions. Here I summarize it for the case of static, simultaneousmove finite games of complete information, first with PSNE as solution concept and thenwith mixed strategy Nash equilibrium. Then I discuss games of incomplete information.For a given ϑ ∈ Θ, denote the set of pure strategy Nash equilibria (depicted in Figure3.4) as Y ϑ ( x , ε ). It is easy to show that Y ϑ ( x , ε ) is a random closed set as in Definition A.1.Under the assumption in Identification Problem 3.5 that y results from simultaneous move,pure strategy Nash play, at the true DGP value of θ ∈ Θ, one has y ∈ Y θ a.s. (3.26)Equation (3.26) exhausts the modeling content of Identification Problem 3.5. Theorem A.1can be leveraged to extract its empirical content from the observed distribution P ( y , x ).For a given ϑ ∈ Θ and K ⊂ Y , let T Y ϑ ( x ,ε ) ( K ; Φ r ) denote the probability of the event { Y ϑ ( x , ε ) ∩ K (cid:54) = ∅} implied when ε ∼ Φ r , x -a.s. Theorem SIR- : Under the assumptions of Identification Problem 3.5,the sharp identification region for θ is H P [ θ ] = { ϑ ∈ Θ : P ( y ∈ K | x ) ≤ T Y ϑ ( x ,ε ) ( K ; Φ r ) ∀ K ⊂ Y , x -a.s. } . (3.27) Proof.
To simplify notation, let Y ϑ ≡ Y ϑ ( x , ε ). In order to establish sharpness, it sufficesto show that ϑ ∈ H P [ θ ] if and only if one can complete the model with an admissible selec-tion mechanism R ( y , y | x , ε ) such that R ( y , y | x , ε ) ≥ y , y ) ∈ { , } × { , } , (cid:80) ( y ,y ) ∈{ , }×{ , } R ( y , y | x , ε ) = 1, and satisfying (3.19)-(3.20), so that M ( y , y | x ) = P ( y , y | x ) for all ( y , y ) ∈ { , } × { , } x -a.s., with M ( y , y | x ) defined in (3.21). Supposefirst that ϑ is such that a selection mechanism with these properties is available. Then thereexists a selection of Y ϑ which is equal to the prediction selected by the selection mechanismand whose conditional distribution is equal to P ( y | x ), x -a.s., and therefore ϑ ∈ H P [ θ ]. Nexttake ϑ ∈ H P [ θ ]. Then by Theorem A.1, y and Y ϑ can be realized on the same probabilityspace as random elements y (cid:48) and Y (cid:48) ϑ , so that y (cid:48) and Y (cid:48) ϑ have the same distributions, respec-tively, as y and Y ϑ , and y (cid:48) ∈ Sel( Y (cid:48) ϑ ), where Sel( Y (cid:48) ϑ ) is the set of all measurable selectionsfrom Y (cid:48) ϑ , see Definition A.3. One can then complete the model with a selection mechanismthat picks y (cid:48) with probability 1, and the result follows.The characterization provided in Theorem SIR-3.3 for games with multiple PSNE, takenfrom Beresteanu, Molchanov, and Molinari (2011, Supplementary Appendix D), is equivalent58o the one in Galichon and Henry (2011). When J = 2 and Y = { , }×{ , } , the inequalitiesin (3.27) reduce to (3.22)-(3.25). With more players and/or more actions, the inequalitiesin (3.27) are a superset of those in (3.22)-(3.25), with the latter comprised of the ones in(3.27) for K = { k } and k = Y \ { k } , for all k ∈ Y . Hence, the inequalities in (3.27) aremore informative. Of course, the computational cost incurred to characterize H P [ θ ] maygrow with the number of inequalities involved. I discuss computational challenges in partialidentification in Section 6. Key Insight : (Random set theory and partial identification – continued) In Identifi-cation Problem 3.5 lack of point identification can be traced back to the set valued predictionsdelivered by the model, which in turn derive from the model incompleteness defined by Tamer(2003). As stated in the Introduction, constructing the (random) set of model predictionsdelivered by the maintained assumptions is an exercise typically carried out in identificationanalysis, regardless of whether random set theory is applied. Indeed, for the problem studied inthis section, Tamer (2003, Figure 1) put forward the set of admissible outcomes of the game.Beresteanu, Molchanov, and Molinari (2011) propose to work directly with this random setto characterize H P [ θ ] . The fundamental advantage of this approach is that it dispenses withconsidering the possible selection mechanisms that may complete the model. Selection mech-anisms may depend on the model’s unobservables even after conditioning on observables andmay constitute an infinite dimensional nuisance parameter, which creates great difficulties forthe computation of H P [ θ ] and for inference. Next, I discuss the case that the outcome of the game results from simultaneous move,mixed strategy Nash play. When mixed strategies are allowed for, the model predictsmultiple mixed strategy Nash equilibria (MSNE). But whereas when only pure strategies areallowed for, if the model is correctly specified, the observed outcome of the game is one of thepredicted PSNE, with mixed strategy it is only the result of a random mixing draw from oneof the predicted MSNE. Hence, the identification problem is more complex, and in order toobtain a tractable characterization of θ ’s sharp identification region one needs to use differenttools from random set theory.To keep the treatment simple here I continue to consider the case of two players with twostrategies, as in Identification Problem 3.5, with mixed strategies allowed for, and refer toMolchanov and Molinari (2018, Section 3.4) for the general case. Fix ϑ ∈ Θ. Let σ j : { , } → [0 ,
1] denote the probability that player j enters the market, with 1 − σ j the probability thatshe stays out. With some abuse of notation, let u j ( σ j , σ − j , x j , ε j , ϑ ) denote the expectedpayoff associated with the mixed strategy profile σ = ( σ , σ ). For a given realization ( x, e ) The same reasoning given here applies if instead of mixed strategy Nash the solution concept is correlatedequilibrium, by replacing the set of MSNE below with the set of correlated equilibria. ε S ϑ = { (0 , } S ϑ = { (1 , } S ϑ = (1 , σ ∗ , σ ∗ )(0 , S ϑ = { (1 , } S ϑ = { (0 , } − x β − δ − x β − δ − x β − x β (a) S ϑ ε ε Q ϑ = { [0 0 1 0] > } Q ϑ = { [0 1 0 0] > } Q ϑ = [0 1 0 0] > q ( σ ∗ , σ ∗ )[0 0 1 0] > Q ϑ = { [0 0 0 1] > } Q ϑ = { [1 0 0 0] > } − x β − δ − x β − δ − x β − x β (b) Q ϑ ε ε h Q ϑ ( u ) = u h Q ϑ ( u ) = u h Q ϑ ( u ) = max n u ,u > q ( σ ∗ , σ ∗ ) , u o h Q ϑ ( u ) = u h Q ϑ ( u ) = u − x β − δ − x β − δ − x β − x β (c) h Q ϑ Figure 3.5: MSNE strategies ( S ϑ ), set of multinomial distributions over outcomes of the game ( Q ϑ ),and its support function ( h Q ϑ ), as a function of ( ε , ε ), where σ ∗ ≡ − ε − x β ϑ , σ ∗ ≡ − ε − x β ϑ . of ( x , ε ) and a given value of ϑ ∈ Θ, the set of mixed strategy Nash equilibria is S ϑ ( x, e ) = (cid:26) σ ∈ [0 , : u j ( σ j , σ − j , x j , e j ; ϑ ) ≥ u j (˜ σ j , σ − j , x j , e j ; ϑ ) ∀ ˜ σ j ∈ [0 , j = 1 , (cid:27) . Beresteanu, Molchanov, and Molinari (2011) show that S ϑ ≡ S ϑ ( x , ε ) is a random closed setin [0 , . Its realizations are illustrated in Panel (a) of Figure 3.5 as a function of ( ε , ε ). Define the set of possible multinomial distributions over outcomes of the game associatedwith the selections σ of each possible realization of S ϑ as Q ϑ = q ( σ ) ≡ (1 − σ )(1 − σ ) σ (1 − σ )(1 − σ ) σ σ σ : σ ∈ S ϑ . (3.28)As Q ϑ is the image of a continuous map applied to the random compact set S ϑ , it is a randomcompact set. Its realizations are plotted in Panel (b) of Figure 3.5 as a function of ( ε , ε ).The multinomial distribution over outcomes of the game determined by a given σ ∈ S ϑ isa function of ε . To obtain the predicted distribution over outcomes of the game conditional This figure is based on Figure 1 in Beresteanu, Molchanov, and Molinari (2011).
60n observed payoff shifters only, one needs to integrate out the unobservable payoff shifters ε . Doing so requires care, as it needs to be done for each q ( σ ) ∈ Q ϑ . First, observe that allthe q ( σ ) ∈ Q ϑ are contained in the 3 dimensional unit simplex, and are therefore integrable.Next, define the conditional selection expectation (see Definition A.4) of Q ϑ as E Φ r ( Q ϑ | x ) = (cid:110) E Φ r ( q ( σ ) | x ) : σ ∈ Sel( S ϑ ) (cid:111) , where Sel( S ϑ ) is the set of all measurable selections from S ϑ , see Definition A.3. By con-struction, E Φ r ( Q ϑ | x ) is the set of probability distributions over action profiles conditionalon x which are consistent with the maintained modeling assumptions, i.e., with all themodel’s implications (including the assumption that ε ∼ Φ r ). If the model is correctly spec-ified, there exists at least one vector θ ∈ Θ such that the observed conditional distribution p ( x ) ≡ [ P ( y = y | x ) , . . . , P ( y = y | x )] (cid:62) almost surely belongs to the set E Φ ρ ( Q θ | x ). Indeed,by the definition of E Φ ρ ( Q θ | x ), p ( x ) ∈ E Φ ρ ( Q θ | x ) almost surely if and only if there exists q ∈ Sel( Q θ ) such that E Φ ρ ( q | x ) = p ( x ) almost surely, with Sel( Q θ ) the set of all measurableselections from Q θ . Hence, the collection of parameter vectors ϑ ∈ Θ that are observationallyequivalent to the data generating value θ is given by the ones that satisfy p ( x ) ∈ E Φ r ( Q ϑ | x )almost surely. In turn, observing that by Theorem A.2 the set E Φ r ( Q ϑ | x ) is convex, wehave that p ( x ) ∈ E Φ r ( Q ϑ | x ) if and only if u (cid:62) p ( x ) ≤ h E Φ r ( Q ϑ | x ) ( u ) for all u in the unit ball(see, e.g., Rockafellar, 1970, Theorem 13.1), where h E Φ r ( Q ϑ | x ) ( u ) is the support function of E Φ r ( Q ϑ | x ), see Definition A.5. Theorem SIR- : Under the assumptions in Identification Problem 3.5,allowing for mixed strategies and with the observed outcomes of the game resulting from mixedstrategy Nash play, the sharp identification region for θ is H P [ θ ] = (cid:26) ϑ ∈ Θ : max u ∈ B |Y| (cid:16) u (cid:62) p ( x ) − E Φ r [ h Q ϑ ( u ) | x ] (cid:17) = 0 , x -a.s. (cid:27) (3.29)= (cid:26) ϑ ∈ Θ : (cid:90) B |Y| ( u (cid:62) p ( x ) − E Φ r [ h Q ϑ ( u ) | x ]) + d µ ( u ) = 0 , x -a.s. (cid:27) , (3.30) where µ is any probability measure on B |Y| , and |Y| = 4 in this case.Proof. Theorem A.2 (equation (A.10)) yields (3.29), because by the arguments given beforethe theorem, H P [ θ ] = { ϑ ∈ Θ : p ( x ) ∈ E Φ r ( Q ϑ | x ) , x -a.s. } . The result in (3.30) followsbecause the integrand in (3.30) is continuous in u and both conditions inside the curly bracketsare satisfied if and only if u (cid:62) p ( x ) − E Φ r [ h Q ϑ ( u ) | x ] ≤ ∀ u ∈ B |Y| x -a.s.For a fixed u ∈ B , the possible realizations of h Q ϑ ( u ) are plotted in Panel (c) of Figure3.5 as a function of ( ε , ε ). The expectation of h Q ϑ ( u ) is quite straightforward to compute,whereas calculating the set E Φ r ( Q ϑ | x ) is computationally prohibitive in many cases. Hence,61he characterization in (3.29) is computationally attractive, because for each ϑ ∈ Θ it re-quires to maximize an easy-to-compute superlinear, hence concave, function over a convexset, and check if the resulting objective value vanishes. Several efficient algorithms in convexprogramming are available to solve this problem, see for example the MatLab software for dis-ciplined convex programming CVX (Grant and Boyd, 2010). Nonetheless, H P [ θ ] itself is notnecessarily convex, hence tracing out its boundary is non-trivial. I return to computationalchallenges in partial identification in Section 6. Key Insight : Beresteanu,Molchanov, and Molinari (2011) provide a general characterization of sharp identificationregions for models with convex moment predictions . These are models that for a given ϑ ∈ Θ and realization of observable variables, predict a set of values for a vector of variables ofinterest. This set is not necessarily convex, as exemplified by Y ϑ and Q ϑ , which are finite.No restriction is placed on the manner in which, in the DGP, a specific model predictionis selected from this set. When the researcher takes conditional expectations of the resultingelements of this set, the unrestricted process of selection yields a convex set of momentsfor the model variables (all possible mixtures). This is the model’s convex set of momentpredictions. If this set were almost surely single valued, the researcher would learn (featuresof ) θ by solving moment equality conditions involving the observed variables and predictedones. The approach reviewed in this section is a set-valued method of moments that extendsthe singleton-valued one commonly used in econometrics. I conclude this section discussing the case of static, simultaneous move finite games ofincomplete information, using the results in Beresteanu, Molchanov, and Molinari (2011,Supplementary Appendix C). For clarity, I formalize the maintained assumptions.
Identification Problem : Impose the same structure onpayoffs, entry decision rule, outcome space, parameter space, and observable variables asin Identification Problem 3.5. Assume that the observed outcome of the game results fromsimultaneous move, pure strategy Bayesian Nash play. Both players and the researcher ob-serve ( x , x ). However, ε j is private information to player j = 1 , ε ⊥⊥ ε | ( x , x ). Assume that players have correct common prior F γ on the distribution of ( ε , ε ) and the researcher knows this distribution up to γ , a finitedimensional parameter vector. Under these assumptions, multiple Bayesian Nash equilibria See Berry and Tamer (2006, Section 3) and de Paula (2013) for a thorough discussion of the literatureon identification problems in games of incomplete information with multiple Bayesian Nash equilibria (BNE).Berry and Tamer (2006) explain how to extend the approach proposed by Ciliberto and Tamer (2009) toobtain outer regions on θ when no restrictions are imposed on the equilibrium selection mechanism thatchooses among the multiple BNE. In the absence of additional information, what can the researcher learnabout θ = [ δ δ β β γ ]? (cid:52) With incomplete information, players’ strategies are decision rules that map the supportof ( ε, x ) into { , } . The non-negativity condition on expected payoffs that determines eachplayer’s decision to enter the market results in equilibrium mappings (decision rules) that arestep functions determined by a threshold: y j ( ε j ) = ( ε j ≥ t j ) , j = 1 ,
2. As a result, player j ’s beliefs about player 3 − j ’s probability of entry under the common prior assumption is (cid:82) y − j ( ε − j ) d F γ ( ε − j | x ) = 1 − F γ ( t − j | x ), and therefore player j ’s best response cutoff is t bj ( t − j , x ; θ ) = − x j β j − δ j (1 − F γ ( t − j | x )) . Hence, the set of equilibria can be defined as the set of cutoff rules: T θ ( x ) = { ( t , t ) : t j = t bj ( t − j , x ; θ ) , j = 1 , } . The equilibrium thresholds are functions of x and θ only. The set T θ ( x ) might contain a finitenumber of equilibria (e.g., if the common prior is the Normal distribution), or a continuumof equilibria. For ease of notation I suppress its dependence on x in what follows.Given the equilibrium decision rules (the selections of the set T θ ), it is possible to de-termine their associated action profiles. Because in the simple two-player entry game that Iconsider actions and outcomes coincide, I denote the set of admissible action profiles by Y θ : Y θ = y ( t ) ≡ ( ε < t , ε < t ) ( ε ≥ t , ε < t ) ( ε < t , ε ≥ t ) ( ε ≥ t , ε ≥ t ) : t ∈ Sel( T θ ) , (3.31)with Sel( T θ ) the set of all measurable selections from T θ , see Definition A.3. To obtainthe predicted set of multinomial distributions for the outcomes of the game, one needs tointegrate out ε conditional on x . Again this can be done by using the conditional Aumannexpectation: E F γ ( Y θ | x ) = { E F γ ( y ( t ) | x ) : t ∈ Sel( T θ ) } . This set is closed and convex. Regardless of whether T θ contains a finite number of equilibriaor a continuum, Y θ can take on only a finite number of realizations corresponding to eachof the vertices of the three dimensional simplex, because the vectors y ( t ) in (3.31) collectthreshold decision rules. This implies that E F γ ( Y θ | x ) is a closed convex polytope x -a.s., fullycharacterized by a finite number of supporting hyperplanes. Hence, it is possible to determine Both the independence assumption and the correct common prior assumption are maintained here tosimplify exposition. Both could be relaxed with no conceptual difficulty, though computation of the set ofBayesian Nash equilibria, for example, would become more cumbersome. ϑ ∈ H P [ θ ] using efficient algorithms in linear programming. Theorem SIR- : Under the assumptions in Identification Problem 3.6,the sharp identification region for θ is H P [ θ ] = (cid:26) ϑ ∈ Θ : max u ∈ B |Y| u (cid:62) p ( x ) − E F ˜ γ [ h Y ϑ ( u ) | x ] = 0 , x -a.s. (cid:27) (3.32)= (cid:26) ϑ ∈ Θ : u (cid:62) p ( x ) ≤ E F ˜ γ [ h Y ϑ ( u ) | x ] , ∀ u ∈ D, x -a.s. (cid:27) , (3.33)= (cid:26) ϑ ∈ Θ : P ( y ∈ K | x ) ≤ T Y ϑ ( x ,ε ) ( K ; F ˜ γ ) ∀ K ⊂ Y , x -a.s. (cid:27) , (3.34) with D = { u = [ u , . . . , u |Y| ] (cid:62) : u i ∈ { , } , i = 1 , ..., |Y|} , ϑ = [ d , d , b , b , ˜ γ ] , and T Y ϑ ( x ,ε ) ( K ; F ˜ γ ) the probability that { Y ϑ ( x , ε ) ∩ K (cid:54) = ∅} implied when ε ∼ F ˜ γ , x -a.s.Proof. The result in (3.32) follows by the same argument as in the proof of Theorem SIR-3.4.Next I show equivalence of the conditions( i ) u (cid:62) p ( x ) ≤ E F ˜ γ [ h Y ϑ ( u ) | x ] ∀ u ∈ B |Y| , ( ii ) u (cid:62) p ( x ) ≤ E F ˜ γ [ h Y ϑ ( u ) | x ] ∀ u ∈ D. By the positive homogeneity of the support function, condition ( i ) is equivalent to p ( x ) ≤ E F ˜ γ [ h Y ϑ ( u ) | x ] ∀ u ∈ R |Y| , which implies condition ( ii ). Next I show that condition ( ii ) im-plies condition ( i ). As explained before, the set Y θ , and hence also its convex hull conv( Y θ ),can take on only a finite number of realizations. Let Y , . . . , Y m be convex compact setsin the simplex of dimension |Y| − Y θ ), andlet (cid:36) ( x ) , . . . , (cid:36) m ( x ) denote the probability of each of these realizations conditional on x .Then by Theorem 2.1.34 in Molchanov (2017), E F ˜ γ ( Y θ | x ) = (cid:80) mj =1 Y j (cid:36) j ( x ). By the prop-erties of the support function (see, e.g., Schneider, 1993, Theorem 1.7.5), h E F ˜ γ ( Y θ | x ) ( u ) = (cid:80) mj =1 (cid:36) j ( x ) h Y j ( u ). For each j = 1 , ..., m, the vertices of Y j are a subset of the vertices of the( |Y| − Y j , j = 1 , ..., m , are asubset of the supporting hyperplanes of that simplex, which in turn are obtained through itssupport function evaluated in directions u ∈ D . Finally, I show equivalence with the resultin (3.34). Because the vertices of Y j are a subset of the vertices of the ( |Y| − u ∈ D determines a set K u ⊂ Y . Given the choice of u , the value of u (cid:62) y ( t ) equals one if y ( t ) ∈ K u and zero otherwise. Hence, condition (3.33) reduces to P ( y ∈ K u | x ) = u (cid:62) p ( x ) ≤ E F ˜ γ [ h Y ϑ ( u ) | x ] = E F ˜ γ (cid:34) sup y ( t ) ∈ Y ϑ u (cid:62) y ( t ) | x (cid:35) = E F ˜ γ [ ( Y ϑ ∩ K u (cid:54) = ∅ ) | x ] = T Y ϑ ( x ,ε ) ( K u ; F ˜ γ ) . D comprises the 2 |Y| vectors with entries equal to either 1 or0, and that these determine all possible subsets K u of Y , yields condition (3.34).One can use the same argument as in the proof of Theorem SIR-3.5, to show that theAumann expectation/support function characterization of the sharp identification region inTheorem SIR-3.4 coincides with the characterization based on the capacity functional inTheorem SIR-3.3, when only pure strategies are allowed for. This shows that in this classof models, the capacity functional based characterization is a special case of the Aumannexpectation/support function based one.Aradillas-Lopez and Tamer (2008) study what is the identification power of equilibriumalso in the case of static entry games with incomplete information. They show that in thepresence of multiple equilibria, assuming Bayesian Nash behavior yields more informativeregions for the parameter vector θ than assuming only rational behavior, but at the price ofa higher computational cost.de Paula and Tang (2012) propose a procedure to test for the sign of the interactioneffects (which here I have assumed to be non-positive) in discrete simultaneous games withincomplete information and (possibly) multiple equilibria. As a by-product of this procedure,they also provide a test for the presence of multiple equilibria in the DGP. The test does notrequire parametric specifications of players’ payoffs, the distributions of their private signals,or the equilibrium selection mechanism. Rather, the test builds on the commonly invokedassumption that players’ private signals are independent conditional on observed states.Grieco (2014) introduces an important class of models with flexible information structure.Each player is assumed to have a vector of payoff shifters unobservable by the researchercomposed of elements that are private information to the player, and elements that areknown to all players. The results of Beresteanu, Molchanov, and Molinari (2011) reported inthis section apply to this set-up as well. Haile and Tamer (2003) study what can be learned about the distribution of valuations inan open outcry English auction where symmetric bidders have independent private valuesfor the object being auctioned. The standard theoretical model (Milgrom and Weber, 1982),called “button auction” model, posits that each bidder holds down a button while the object’sprice rises continuously and exogenously, releasing it (in the dominant strategy equilibrium)when it reaches her valuation or all her opponents have left. In this case, the distribution ofbidder’s valuation can be learned exactly. Haile and Tamer (2003) show that much can belearned about the distribution of valuations, even allowing for the fact that real-life auctions65 v v ¯ v ¯ v ¯ vv v v v v v v v v ( v, v, v ) = (0 , , V = { v ∈ R : v ≤ v ≤ v ≤ v ≤ ¯ v } B ( v ) Figure 3.6: A realization of the model predicted ordered bids B ( (cid:126) v n ) in (3.35) for n = 3 , (cid:126) v n = v , δ = 0. may depart from this stylized framework, as in the following identification problem. Identification Problem : For a given auction with n < ∞ participating bidders, let v i ∼ Q , i = 1 , . . . , n, bebidder i ’s valuation for the object being auctioned and assume that v i ⊥⊥ v j for all i (cid:54) = j .Assume that the support of Q is [ v, ¯ v ] and that each bidder knows her own valuation but notthat of her opponents. Let the auctioneer set a minimum bid increment δ ∈ [0 , ¯ v ), and forsimplicity suppose there is no reserve price. Suppose the researcher observes order statisticsof the bids, (cid:126) b n ≡ ( b n , . . . , b n : n ) ∼ P in R n + , with b i : n the i -th lowest of the n bids. Assumethat: (1) Bidders do not bid more than they are willing to pay; (2) Bidders do not allow anopponent to win at a price they are willing to beat. In the absence of additional information,what can the researcher learn about Q ? (cid:52) The model in Identification Problem 3.7 delivers set valued predictions because givenvaluations ( v , . . . , v n ), the two fundamental assumptions about bidder’s behavior yield (cid:126) b n ∈ B ( (cid:126) v n ) ≡ (cid:34)(cid:40) n − (cid:89) i =1 [ v, v i : n ] (cid:41) × [ v n − n − δ, v n : n ] (cid:35) ∩ V n , (3.35)where (cid:126) v n ≡ ( v n , . . . , v n : n ) denotes the vector of order statistics of the valuations, and V n = { v ∈ R n : v ≤ v ≤ v ≤ · · · ≤ v n ≤ ¯ v } . Figure 3.6 provides a stylized depiction of arealization of this set for (cid:126) v n = v when there are three bidders ( n = 3), v = 0, and δ = 0. In Examples of departures from the standard model include the case where active bidding by a player’sopponents may eliminate her incentives to bid close to her valuation or at all; the econometrician does notprecisely observe the point at which each bidder drops out; there are discrete bid increments; etc. If there is a reserve price r > v , nothing can be learned about Q ( v ∈ [ v, v ]) for any v < r . In that case,one can learn features of the truncated distribution of valuations using the same insights summarized here. Using the same convention as for the bids, v i : n denotes the i -th lowest of the n valuations. B ( (cid:126) v n ) collects the model predicted values of ordered bids. The fact that b i : n ≤ v i : n for all i results from assumption (1): since each bidder bids at most an amount equal to hervaluation, the i -th highest bid cannot exceed the i -th highest valuation (Haile and Tamer,2003, Lemma 1). The fact that b n : n ≥ v n − ,n − δ follows immediately from assumption (2)(Haile and Tamer, 2003, Lemma 3). The fact that (cid:126) b n has to lie in V n follows because it is avector of ordered bids.Why does this set-valued prediction hinder point identification? The reason is that thedistribution of the observable data relates to the model structure in an incomplete manner. Define a bidding rule B ( b n , . . . , b n : n | v n , . . . , v n : n ) to be a conditional joint distribution forthe order statistics of the bids conditional on the order statistics of the valuations. Then,for a given realization of the valuations v n = v , . . . , v n : n = v n , the model requires that thesupport of B ( ·| v , . . . , v n ) is in B ( (cid:126)v ) as defined in (3.35) with v n = v , . . . , v n : n = v n , butimposes no other restriction on it. Hence, the model implied joint distribution of orderedbids is M ,...,n : n ( · ; B , Q ) ≡ (cid:90) B ( ·| v , . . . , v n ) Q ,...,n : n ( dv , . . . , dv n ) , (3.36)where Q ,...,n : n is the joint distribution of order statistics of the valuations implied by Q .Since the bidding rule B is left completely unspecified (other than requiring it to be a validjoint conditional probability distribution with support in B ), one can find multiple pairs( B , Q ) satisfying the assumptions of Identification Problem 3.7, such that M ,...,n : n ( · ; B , Q ) = G ,...,n : n ( · ), with G ,...,n : n the observed joint CDF of the order statistics of the bids associatedwith P .Haile and Tamer (2003) propose to use simple and tractable implications of the model tolearn features of Q . Recall that with i.i.d. valuations, the distribution of each order statisticuniquely determines Q ( v ), with Q ( v ) ≡ Q ( v ≤ v ) for any v ≥ v , through: Q ( v ) = q B ( Q i : n ( v ); i, n − i + 1) , (3.37)where Q i : n is the CDF of v i : n and q B ( · ; i, n − i +1) is the quantile function of a Beta-distributedrandom variable with parameters i and n − i + 1. Using this, their Lemmas 1 and 3 yield,respectively, Q ( v ) ≤ min n,i q B ( G i : n ( v ); i, n − i + 1) , ∀ v ∈ [ v, ¯ v ] , (3.38) Q ( v ) ≥ max n q B ( G n : n ( v − δ ); i, n − i + 1) , ∀ v ∈ [ v, ¯ v ] , (3.39) Note that b i : n needs not be the bid made by the bidder with valuation v i : n . Haile and Tamer (2003, Appendix D) provide the discussion summarized here. Additionally, in theirAppendix B, they give a simple example of a two-bidder auction satisfying all assumptions in IdentificationProblem 3.7, where two different distributions Q and ˜ Q yield the same distribution of ordered bids. v ≥ v , G i : n ( v ) ≡ P ( b i : n ≤ v ) denotes the observed CDF of b i : n for i = 1 , . . . , n . Key Insight : The model and analysis put forward by Haile and Tamer (2003) tradepoint identification of the distribution of valuation under stringent assumptions on the biddingrule, for a robust inference approach that yields informative bounds under weak and widelycredible assumptions on bidding behavior. Remarkably, “nothing is lost” due to the use oftheir robust approach: point identification is recovered when the standard assumptions ofthe button auction model hold. This is because in the dominant strategy equilibrium thetop losing bidder exits at her valuation, followed immediately by the winning bidder. Hence, b n − n = v n − n = b n : n and δ = 0 , so that the upper and the lower bound in (3.38) - (3.39) coincide and point identify the distribution of valuations. Haile and Tamer (2003) also provide sharp bounds on the optimal reserve price, which Ido not discuss here. However, they leave open the question of whether the collection of CDFssatisfying (3.38)-(3.39) yields the sharp identification region for Q . As discussed in Sections2.1-2.3, pointwise bounds on the CDF deliver tubes of admissible CDFs that in general yieldouter regions on the CDF of interest. But in this identification problem, the issue of sharpnessis even more subtle, and therefore addressed in the following subsection.Before moving on to that discussion, I note that the work of Haile and Tamer (2003)spurred a rich literature applying partial identification analysis to the study of auction mod-els. Tang (2011) studies first price sealed bid auctions with equilibrium behavior, whereaffiliated valuations prevent –in the absence of parametric restrictions on the distribution ofthe model primitives– point identification of the model. He derives bounds on seller revenueunder various counterfactual scenarios on reserve prices and auction formats. Armstrong(2013) also studies first price sealed bid auctions with equilibrium behavior, but relaxes theindependence assumptions on symmetric valuations by requiring it to hold only conditional onunobserved heterogeneity. He derives bounds on various functionals of the distributions of in-terest, including the mean bid and mean valuation. AradillasLpez, Gandhi, and Quint (2013)analyze second price auctions with correlated private values. In this case, the distributionof valuations is not point identified even under the assumptions of the button auction model(Athey and Haile, 2002, Theorem 4). Nonetheless, AradillasLpez, Gandhi, and Quint (2013)show that interesting functionals of it (seller profits and bidder surplus) can be bounded,if one assumes that transaction prices are determined by the second highest valuation andimposes some restrictions on the joint distribution of the number of bidders and distributionof the valuations. Komarova (2013) studies a related model of second-price ascending auc-tions with arbitrary dependence in bidders’ private values. She provides partial identificationresults for the joint distribution of values for any subset of bidders under various assumptionsabout what data the researcher observes. While in her framework the highest bid is never The button auction model yields bidding behavior consistent with Identification Problem 3.7. ϑ for which a linear pro-gram is feasible. Related results leveraging the linear structure of correlated equilibria inthe context of entry games include Yang (2006), Beresteanu, Molchanov, and Molinari (2011,Supplementary Appendix E.2), and Magnolfi and Roncoroni (2017). Haile and Tamer’s 2003 bounds exploit the information contained in the marginal
CDFs G i : n for each i and n . However, in Identification Problem 3.7 additional information can beextracted from the joint distribution of ordered bids. Chesher and Rosen (2017a) obtain thesharp identification region H P [ Q ] using random set methods (Artstein’s characterization inTheorem A.1) applied to a quantile function representation of the order statistics. Here Iprovide an equivalent characterization that uses equation (3.35) directly, and which has notappeared in the literature before. Let T denote the space of probability distributions withsupport on [ v, ¯ v ], so that Q ∈ T . For a candidate distribution ˜ Q ∈ T , let ˜ Q ,...,n : n denotethe implied distribution of order statistics of n i.i.d. random variables distributed ˜ Q . Let ˜ B be a random closed set defined as in (3.35) with respect to order statistics of i.i.d. randomvariables with distribution ˜ Q . For a given set K ∈ K , with K the collection of compactsubsets of R n , let T ˜ B ( K ; ˜ Q ) denote the probability of the event { ˜ B ∩ K (cid:54) = ∅} implied by ˜ Q . Theorem SIR- : Under the assumptions of Identification Problem 3.7, the sharp dentification region for Q is H P [ Q ] = (cid:110) ˜ Q ∈ T : P ( (cid:126) b n ∈ K ) ≤ T ˜ B ( K ; ˜ Q ) ∀ K ∈ K (cid:111) . (3.40) Proof.
The sharp identification region for Q is given by the collection of probability distri-butions ˜ Q ∈ T for which one can find a bidding rule B ( ·|· ) with support in ˜ B a.s. such that G ,...,n : n ( · ) = M ,...,n : n ( · ; B , ˜ Q ). Here M ,...,n : n ( · ; B , ˜ Q ) is defined as in (3.36) with ˜ Q replacing Q . Take a distribution ˜ Q satisfying this definition of sharpness. Then there exists a selectionof ˜ B determined by the bidding rule associated with ˜ Q , such that its distribution matchesthat of (cid:126) b n . But then Theorem A.1 implies that the inequalities in (3.40) hold. Conversely,take ˜ Q satisfying the inequalities in (3.40). Then, by Theorem A.1, (cid:126) b n and ˜ B can be realizedon the same probability space as random elements (cid:126) b (cid:48) n and ˜ B (cid:48) , (cid:126) b n d = (cid:126) b (cid:48) n , ˜ B d = ˜ B (cid:48) , such that (cid:126) b (cid:48) n ∈ ˜ B (cid:48) a.s. One can then complete the auction model with a bidding rule that picks (cid:126) b (cid:48) n with probability 1, and the result follows.In (3.40), P ( (cid:126) b n ∈ K ) is determined by the joint distribution of the ordered bids and hencecan be learned from the data. On the other side, T ˜ B ( K ; ˜ Q ) is a function of the model and˜ Q ∈ T . Hence, it can be computed using (3.35), with ˜ B defined with respect to order statisticsof i.i.d. random variables with distribution ˜ Q ∈ T . To gain insights in the characterizationof H P [ Q ], consider for example the set K = { (cid:81) n − i =1 ( −∞ , + ∞ ) } × ( −∞ , v ]. Plugging it in theinequalities in (3.40), one obtains G n : n ( v ) ≤ Q n − ,n ( v ) , for all n, which, using (3.37), yields (3.39). Similarly, plugging in the sets K j = { (cid:81) j − i =1 ( −∞ , + ∞ ) } × [ v, ∞ ) × { (cid:81) nj +1 ( −∞ , + ∞ ) } , j = 1 , . . . , n , yields (3.38). So the inequalities proposed by Haileand Tamer (2003) are a subset of the inequalities yielding the sharp identification region inTheorem SIR-3.6. More information can be obtained by using additional sets K . For instance,the set K = [ v , ∞ ) × [ v , ∞ ) × { (cid:81) ni =1 ( −∞ , + ∞ ) } , v ≥ v , yields P ( b n ≥ v , b n ≥ v ) ≤ Q , n ([ v , ∞ ) × [ v , ∞ )), which further restricts Q . Numerous examples can be given.Characterization (3.40) is stated using inequality (A.4) for the collection of compactsubsets of R n . One can instead use the (equivalent) inequality (A.5), and show that in factit suffices to check it for a much smaller collection of sets, as shown by Chesher and Rosen(2017a) (see also Molchanov and Molinari, 2018, Section 2.2). Nonetheless, this collectionremains extremely large. Key Insight : As statedin the Introduction, constructing the (random) set of model predictions delivered by the main-tained assumptions is an exercise typically carried out in identification analysis, regardless ofwhether random set theory is applied. Indeed, for the problem studied in this section, Haile nd Tamer (2003, equation D1) put forward the set of admissible bids in (3.35) . With thisset in hand, the tools of random set theory (in this case, Theorem A.1) immediately deliverthe sharp identification region of interest.
Chesher and Rosen (2017b) further generalize the analysis in this section by droppingthe requirement of independent private values. This allows them, for example, to consideraffiliated private values. They show that even in this significantly more complex context, thekey behavioral restrictions imposed by Haile and Tamer (2003) to relate bids to valuationscan be coupled with the use of random set theory, to characterize sharp identification regions.
Strategic models of network formation generalize the frameworks of single agents and multipleagents discrete choice models reviewed in Sections 3.1 and 3.2. They posit that pairs of agents(nodes) form, maintain, or sever connections (links) according to an explicit equilibriumnotion and utility structure. Each individual’s utility depends on the links formed by others(the network) and on utility shifters that may be pair-specific.One may conjecture that the results reported in Sections 3.1-3.2 apply in this more generalcontext too. While of course lessons can be carried over, network formation models presentchallenges that combined cannot be overcome without the development of new tools. Theseinclude the issue of equilibrium existence and the possibility of multiple equilibria when theyexist, due to the interdependence in agents’ choices (this problem was already discussed inSection 3.2). Another challenge is the degree of correlation between linking decisions, whichinteracts with how the observable data is generated: one may observe a growing numberof independent networks, or a growing number of agents on a single network. Yet anotherchallenge, which substantially increases the difficulties associated with the previous two, isthe combinatoric complexity of network formation problems. The purpose of this section isexclusively to discuss some recent papers that have made important progress to address thesespecific challenges and carry out partial identification analysis. For a thorough treatment ofthe literature on network formation, I refer to the reviews in Graham (2015), Chandrasekhar(2016), de Paula (2017), and Graham (2019, Chapter XXX in this Volume). Depending on whether the researcher observes data from a single network or multipleindependent networks, the underlying population of agents may be represented as a contin-uum or as a countably infinite set in the first case, or as a finite set in the second case.Henceforth, I denote generic agents as i , j , k , and m . I consider static models of undirectednetwork formation with non-transferable utility. The collection of all links among nodes Equations D1 in Haile and Tamer and (3.35) here differ in that the latter also requires bids to be ordered.This observation was besides the point in Haile and Tamer’s 2003 discussion that led to equation D1. For a review of the literature on peer group effect analysis, see, e.g., Brock and Durlauf (2001), Blume,Brock, Durlauf, and Ioannides (2011), de Paula (2017), and Graham (2019). Undirected means that if a link from node i to node j exists, then the link from j to i exists. The discussion y . For any pair ( i, j ) with i (cid:54) = j , y ij = 1 if they are linked, and y ij = 0 otherwise ( y ii = 0 for all i by convention). The notation y − { ij } denotes the networkthat results if a link present between nodes i and j is deleted, while y + { ij } denotes thenetwork that results if a link absent between nodes i and j is added. Denote agent i ’s payoffby u i ( y , x , (cid:15) ). This payoff depends on the network y and the payoff shifters ( x , (cid:15) ), with x observable both to the agents and to the researcher, (cid:15) only to the agents, and ( x , (cid:15) ) collecting( x ij , (cid:15) ij ) for all i and j . Following much of the literature, I employ pairwise stability (Jackson and Wolinsky, 1996)as equilibrium notion: y is a pairwise stable network if all linked agents prefer not to severtheir links, and all non-existing links are damaging to at least one agent. Formally, ∀ ( i, j ) : y ij = 1 , u i ( y , x , (cid:15) ) ≥ u i ( y − { ij } , x , (cid:15) ) and u j ( y , x , (cid:15) ) ≥ u j ( y − { ij } , x , (cid:15) ) , ∀ ( i, j ) : y ij = 0 , if u i ( y + { ij } , x , (cid:15) ) > u i ( y , x , (cid:15) ) then u j ( y + { ij } , x , (cid:15) ) < u j ( y , x , (cid:15) ) . Under this equilibrium notion, if equilibria exist multiplicity is likely; see, among others, theexamples in Graham (2015, p. 475), de Paula (2017, p. 301), and Sheng (2018, example 3.1).The model is therefore incomplete , because it does not specify how an equilibrium is selectedin the region of multiplicity. For the same reasons as discussed in the context of finite gamesin Section 3.2, partial identification results (unless one is willing to impose restrictions onthe equilibrium selection mechanism). However, as I explain below, an immediate applica-tion of the identification analysis carried out there presents enormous practical challengesbecause there are 2 n ( n − / possible network configurations to be checked for stability (andthe dimensionality of the space of unobservables is also very large).In what follows I consider two distinct frameworks that make different assumptions aboutthe utility function and how the data is generated, and discuss what can be learned aboutthe parameters of interest in these cases. I first consider the case that the researcher observes data from multiple independent networks.I follow the set-up put forward by Sheng (2018).
Identification Problem : Let there be n ∈ { , , . . . } , n < ∞ agents, and let ( x , y ) ∼ P be observablerandom variables in × nj =1 R d × { , } n ( n − / , d < ∞ . Suppose that y is a pairwise stablenetwork. For each agent i , let the utility function be known up to finite dimensional parameter that follows can be generalized to the case of models with transferable utility. Here I consider a framework where the agents have complete information. δ ∈ ∆ ⊂ R p , and given by u i ( y , x , (cid:15) ; δ ) = n (cid:88) j =1 y ij ( f ( x i , x j ; δ ) + (cid:15) ij )+ δ (cid:80) nj =1 (cid:80) nk (cid:54) = i,k =1 y ij y jk n − δ (cid:80) nj =1 (cid:80) nk = j +1 y ij y ik y jk n − f ( · , · ; · ) a continuous function of its arguments. Suppose that (cid:15) ij are independent forall i (cid:54) = j and identically distributed with CDF known up to parameter vector γ ∈ Γ ⊂ R m ,denoted F γ . Assume that the support of F γ is R , that F γ is absolutely continuous with respectto Lebesgue measure, and continuously differentiable with respect to γ ∈ Γ. Let Θ = ∆ × Γ.Assume that the researcher observes a random sample of networks and observable payoffshifters drawn from P . In the absence of additional information, what can the researcherlearn about θ ≡ [ δ δ δ γ ]? (cid:52) Sheng (2018) analyzes this problem. She establishes equilibrium existence provided that δ ≥ δ ≥ Given payoff shifters ( x , (cid:15) ) and parame-ters ϑ ≡ [˜ δ ˜ δ ˜ δ ˜ γ ] ∈ Θ, let Y ϑ ( x , (cid:15) ) denote the collection of pairwise stable networks impliedby the model. It is easy to show that Y ϑ ( x , (cid:15) ) is a random closed set as in Definition A.1.The networks in Y ϑ ( x , (cid:15) ) are n × n symmetric adjacency matrices with diagonal elementsequal to zero and off diagonal elements in { , } . To ease notation, I omit Y ϑ ’s dependenceon ( x , (cid:15) ) in what follows. Under the assumption that y is a pairwise stable network, at thetrue data generating value of θ ∈ Θ, one has y ∈ Y θ a . s . (3.42)Equation (3.42) exhausts the modeling content of Identification Problem 3.8. Theorem A.1can be leveraged to extract its empirical content from the observed distribution P ( y , x ). Let Y be the collection of n × n symmetric matrices with diagonal elements equal to zero and allother entries in { , } , so that |Y| = 2 n ( n − / . For a given set K ⊂ Y , let T Y ϑ ( K ; F γ ) denotethe probability of the event { Y ϑ ∩ K (cid:54) = ∅} implied when (cid:15) ∼ F γ , x -a.s. Theorem SIR- : Under the assumptions of Identification Problem 3.8, the sharpidentification region for θ is H P [ θ ] = { ϑ ∈ Θ : P ( y ∈ K | x ) ≤ T Y ϑ ( K ; F ˜ γ ) ∀ K ⊂ Y , x -a.s. } . (3.43) The effects of having friends in common and of friends of friends in (3.41) are normalized by n −
2. Thisenforces that the marginal utility that i receives from linking with j is affected by j having an additional linkwith k to a smaller degree as n grows. This does not result in diminishing network effects. With transferable utility, Sheng (2018, Proposition 2.1) establishes existence for any δ , δ ∈ R . SeeHellmann (2013) for an earlier analysis of existence and uniqueness of pairwise stable networks. roof. Follows from similar arguments as for the proof of Theorem 3.3 on p. 58.The characterization of H P [ θ ] in Theorem SIR-3.7 is new to this chapter. While tech-nically it entails a finite number of conditional moment inequalities, in practice their numbercan be prohibitive as it can be as large as 2 n ( n − / − Even using only a subset of theinequalities in (3.43) to obtain an outer region, for example applying the insights in Cilibertoand Tamer (2009), may not be practical (with n = 20, |Y| ≈ ). Moreover, computationof T Y ϑ ( K ; F γ ) may require (depending on the set K ) evaluation of rather complex integrals.To circumvent these challenges, Sheng (2018) proposes to analyze network formationthrough subnetworks . A subnetwork is the restriction of a network to a subset of the agents(i.e., a subset of nodes and the links between them). For given A ⊆ { , , . . . , n } , let y A = { y ij } i,j ∈ A,i (cid:54) = j be the submatrix in y with rows and columns in A , and let y − A be the remainingelements of y after y A is deleted. With some abuse of notation, let ( y A , y − A ) denote thecomposition of y A and y − A that returns y . Recall that Y ϑ ≡ Y ϑ ( x , (cid:15) ), and let Y Aϑ = { y A ∈ { , } | A | : ∃ y − A ∈ { , } |− A | such that ( y A , y − A ) ∈ Y ϑ } be the collection of subnetworks with rows and columns in A that can be part of a pairwisestable network in Y ϑ . Let x A denote the subset of x collecting x ij for i, j ∈ A . For a given y A ∈ { , } | A | , let C Y Aϑ ( y A ; F γ ) and T Y Aϑ ( y A ; F γ ) denote, respectively, the probability of theevents { Y Aϑ = { y A }} and {{ y A } ∈ Y Aϑ } implied when (cid:15) ∼ F γ , x -a.s. The first event meansthat only the subnetwork y A is part of a pairwise stable network, while the second eventmeans that y A is a possible subnetwork that is part of a pairwise stable network but othersubnetworks may be part of it too. Sheng (2018, Proposition 4.1) provides the following outerregion for θ by adapting the insight in Ciliberto and Tamer (2009) to subnetworks. In thetheorem I abuse notation compared to Table 1.1 by introducing a superscript, A , to makeexplicit the dependence of the outer region on it. Theorem OR- : Under the assumptions ofIdentification Problem 3.8, for any A ⊆ { , , . . . , n } , an A -dependent outer region for θ is O A P [ θ ] = { ϑ ∈ Θ : C Y Aϑ ( y A ; F ˜ γ ) ≤ P ( y A = y A | x A ) ≤ T Y Aϑ ( y A ; F ˜ γ ) ∀ y A ⊂ Y A , x A -a.s. } , (3.44) where Y A is the collection of | A | × | A | symmetric matrices with diagonal elements equal tozero and all other elements in { , } so that |Y A | = 2 | A | ( | A |− / . Gualdani (2019) has previously used Theorem D.1 in Beresteanu, Molchanov, and Molinari (2011), as I dohere, to characterize sharp identification regions in unilateral and bilateral directed network formation games. This number may be reduced drastically using the notion of core determining class of sets, see DefinitionA.8 and the discussion on p. 117. Nonetheless, even with relatively few agents, the number of inequalities in(3.43) may remain overwhelming. roof. Let u ( ˜ y | Y ϑ ) be a random variable in the unit simplex in R n ( n − / which assigns toeach possible pairwise stable network ˜ y that may realize given ( x , (cid:15) ) and ϑ ∈ Θ the probabilitythat it is selected from Y ϑ . Given y ∈ Y , denote by M ( y | x ) the model predicted probabilitythat the network realizes equal to y . Then the model yields M ( y | x ) = (cid:90) u ( y | Y ϑ ) d F γ = (cid:90) y ∈ Y ϑ , | Y ϑ | =1 d F γ + (cid:90) y ∈ Y ϑ , | Y ϑ |≥ u ( y | Y ϑ ) d F γ . (3.45)The model implied distribution for subnetwork ˜ y A is obtained by taking the marginal ofexpression (3.45) with respect to ˜ y − A M ( y A | x ) = (cid:88) y − A M (( y A , y − A ) | x ) = (cid:90) y A ∈ Y Aϑ , | Y Aϑ | =1 d F γ + (cid:90) y A ∈ Y Aϑ , | Y Aϑ |≥ (cid:88) y − A u (( y A , y − A ) | Y ϑ ) d F γ . (3.46)Replacing u in (3.46) with zero and one yields the bounds in (3.44).Sheng (2018, Section 4.2) further assumes that the selection mechanism u ( ˜ y | Y ϑ ) is in-variant to permutations of the labels of the players. Under this condition and the maintainedassumptions on (cid:15) , she shows that the inequalities in (3.44) are invariant under permutations oflabels, so subnetworks in any two subsets A, A (cid:48) ⊆ { , , . . . , n } with | A | = | A (cid:48) | and x A = x A (cid:48) yield the same inequalities for all y A = y A (cid:48) . It is therefore sufficient to consider subnetwork A and the inequalities in (3.44) associated with it. Leveraging this result, Sheng proposes anouter region obtained by looking at unlabeled subnetworks of size | A | ≤ ¯ a and given by O P [ θ ] = (cid:92) | A |≤ ¯ a O A P [ θ ] . As long as the subnetworks are chosen to be small, e.g., | A | ≤ , ,
4, the inequalities in (3.44)can be computed even if the network is large. Sheng (2018) shows that the inequalities in(3.44) remain informative as n grows. This fact highlights the importance of working withsubnetworks. One could have applied the insight of Ciliberto and Tamer (2009) directly to thefull network by setting u equal to zero and to one in (3.45). The resulting bounds, however,would vanish to zero as n grows and become uninformative for θ . The characterization inTheorem OR-3.1 can be refined to obtain a smaller region, adapting the results in Beresteanu,Molchanov, and Molinari (2011, Supplementary Appendix Theorem D.1) to subnetworks.The size of this refined region is weakly decreasing in | A | . However, the refinement doesnot yield H P [ θ ] because it is applied only to subnetworks. Key Insight : At the beginning of this section I highlighted some key challenges to The idea of using random set methods on subnetworks to obtain the refined region was put forward in anearlier version of Sheng (2018). She provided a proof that the refined region’s size decreases weakly in | A | . nference in network formation models. Identification Problem 3.8 bypasses the concern onthe dependence among linking decisions through the independence assumption on (cid:15) ij and thepresumption that the researcher observes data from multiple independent networks, which al-lows for identification of P ( y , x ) . Sheng (2018) takes on the remaining challenges by formallyestablishing equilibrium existence and allowing for unrestricted selection among multiple equi-libria. In order to overcome the computational complexity of the problem, she puts forwardthe important idea of inference based on subnetworks. While of course information is left onthe table, the approach remains feasible even with large networks. Miyauchi (2016) considers a framework similar to the one laid out in Identification Prob-lem 3.8. He assumes non-negative externalities, and shows that in this case the set of pairwisestable equilibria is a complete lattice with a smallest and a largest equilibrium. He thenuses moment functions that are monotone in the pairwise stable network (so that they taketheir extreme values at the smallest and largest equilibria), to obtain moment conditions thatrestrict θ . Examples of the moment functions used include the proportion of pairs with a link,the proportion of links belonging to traingles, and many more (see Miyauchi, 2016, Table 1).Gualdani (2019) considers unilateral and bilateral directed network formation games,still under a sampling framework where the researcher observes many independent networks.The equilibrium notion that she uses is pure strategy Nash. She assumes that the payoff thatplayer i receives from forming link ij is allowed to depend on the number of additional playersforming a link pointing to j , but rules out other spillover effects. Under this assumptionand some regularity conditions, Gualdani shows that the network formation game can bedecomposed into local games (i.e., games whose sets of players and strategy profiles are subsetsof the network formation game’s ones), so that the network formation game is in equilibriumif and only if each local game is in equilibrium. She then obtains a characterization of H P [ θ ]using elements of random set theory. When the researcher observes data from a single network, extra care has to be taken torestrict the dependence among linking decisions. This can be done in various ways (see,e.g., Chandrasekhar, 2016, for some examples). Here I consider a framework proposed by dePaula, Richards-Shubik, and Tamer (2018).
Identification Problem : Letthere be a continuum of agents j ∈ I = [0 , µ ], with µ > Let y : I × I → { , } be an This approach exploits supermodularity, and is related to Jia (2008) and Echenique (2005). This is an approximation to a framework with a large but finite number of agents. The utility functioncan be less restrictive than the one considered here (see Assumptions 1 and 2 in de Paula, Richards-Shubik,and Tamer, 2018). y jk = 1 if nodes j and k are linked, and y jk = 0 otherwise. Assumethat only connections up to distance ¯ d affect utility and that preferences are such that agentsnever choose to form more than a total of ¯ l links. To simplify exposition, let ¯ d = 2. Let eachagent j be endowed with characteristics x j ∈ X , with X a finite set in R p , that are observableto the researcher. Additionally, let each agent j be endowed with ¯ l × |X | preference shocks (cid:15) j(cid:96) ( x ) ∈ R , (cid:96) = 1 , . . . , ¯ l, x ∈ X , that are unobservable to the researcher and correspond to thepossible direct connections and their characteristics. Suppose that the vector of preferenceshocks is independent of x and has a distribution known up to parameter vector γ ∈ Γ ⊂ R m ,denoted Q γ . Let I ( j ) = { k : y jk = 1 } . Assume that agents with characteristics and preferenceshocks ( x, e ) value links according to the utility function u j ( y, x, e ) = (cid:88) k ∈I ( j ) ( f ( x j , x k ) + e j(cid:96) ( k ) ( x k ))+ δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:91) k ∈I ( j ) I ( k ) − I ( j ) − { j } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + δ (cid:88) k ∈I ( j ) (cid:88) m ∈I ( j ): m>k y km − ∞ ( |I ( k ) | > ¯ l ) (3.47)Assume that the network y formed by agents with characteristics and shocks ( x , (cid:15) ) is pairwisestable. Let Θ ≡ Υ × ∆ × Γ, with Υ the parameter space for f ≡ { f ( x, w ) : x ∈ X , w ∈ X } . Inthe absence of additional information, what can the researcher learn about θ ≡ [ f δ δ γ ]? (cid:52) Identification Problem 3.9 enforces dimension reduction through the restrictions on depthand degree (the bounds ¯ d and ¯ l ), so that it is applicable to frameworks with networks thathave limited degree distribution (e.g., close friendships network, but not Facebook network).It also requires that individual identities are irrelevant. This substantially reduces the richnessof unobserved heterogeneity allowed for and the dimensionality of the space of unobservables.While the latter feature narrows the domain of applicability of the model, it is very beneficialto obtain a tractable characterization of what can be learned about θ , and yields equilibriathat may include isolated nodes, a feature often encountered in networks data.de Paula, Richards-Shubik, and Tamer (2018) study Identification Problem 3.9 focusingon the payoff-relevant local subnetworks that result from the maintained assumptions. Theseare distinct from the subnetworks used by Sheng (2018): whereas Sheng looks at subnetworksformed by arbitrary individuals and whose size is chosen by the researcher on the base ofcomputational tractability, de Paula, Richards-Shubik, and Tamer look at subnetworks amongindividuals that are within a certain distance of each other, as determined by the structureof the preferences. On the other hand, Sheng’s 2018 analysis does not require that agentshave a finite number of types nor bounds the number of links that they may form.To characterize the local subnetworks relevant for identification analysis in their frame- The distance measure used here is the shortest path between two nodes. Under this assumption, the preference shocks do not depend on the individual identities of the agents.Hence, it agents k and m have the same observable characteristics, then j is indifferent between them. network type and preference class . A network type t = ( a, v ) describes the local network up to distance¯ d from the reference node. Here a is a square matrix of size 1 + ¯ l (cid:80) ¯ dd =1 (¯ l − d − that de-scribes the local subnetwork that is utility relevant for an agent of type t . It consists of thereference node, its direct potential neighbors (¯ l elements), its second order neighbors (¯ l (¯ l − d -th order neighbors (¯ l (¯ l − ¯ d − elements). The other component ofthe type, v , is a vector of length equal to the size of a that contains the observable character-istics of the reference node and her alters. The bounds ¯ d and ¯ l enforce dimension reductionby bounding the number of network types. The partial identification approach of de Paula,Richards-Shubik, and Tamer depends on this number, rather than on the number of agents.For example, the number of moment inequalities is determined by the number of networktypes, not by the number of agents. As such, the approach yields its highest dividends fordimension reduction in large networks.Let T denote the collection of network types generated from a preference structure u and set of characteristics X . For given realization ( x, e ) of the observable characteristics andpreference shocks of a reference agent, and for given ϑ ∈ Θ, define the collection of networktypes for which no agent wants to drop a link by H ϑ ( x, e ) = { ( a, v ) ∈ T : v = x and u ( a, v, e ) ≥ u ( a − (cid:96) , v, e ) ∀ (cid:96) = 1 , . . . , ¯ l } , where a − (cid:96) is equal to the local adjacency matrix a but with the (cid:96) -th link removed (that is,it sets the (1 , (cid:96) + 1) and ( (cid:96) + 1 ,
1) elements of a equal to zero). Because ( x , (cid:15) ) are randomvectors, H ϑ ≡ H ϑ ( x , (cid:15) ) is a random closed set as per Definition A.1. This random set takeson a finite number of realizations (equal to the possible subsets of T ), so that its distributionis completely determined by the probability with which it takes on each of these realizations.A preference class H ⊂ T is one of the possible realizations of H ϑ for some ϑ ∈ Θ. Themodel implied probability that H ϑ = H is given by M ( H | x ; ϑ ) ≡ Q ˜ γ ( (cid:15) : H ϑ = H | x ) . (3.48)Observation of data from one network allows the researcher, under suitable restrictions on thesampling process, to learn the distribution of network types in the data (type shares), denoted P ( t ). For example, in a network of best friends with ¯ l = 1 and ¯ d = 2, and X = { x , x } (e.g.,a simplified framework with only two possible races), agents are either isolated or in a pair.Network types are pairs for the agents’ race and the best friend’s race (with second elementequal zero if the agent is isolated). Type shares are the fraction of isolated blacks, the fractionof isolated whites, the fraction of blacks with a black best friend, the fraction of whites with ablack best friend, and the fraction of whites with a white best friend. The preference classes Full observation of the network is not required (and in practice it often does not occur). Samplinguncertainty results from it because in this model there is a continuum of agents. H ( b, e ) = { ( b, } , H ( b, e ) = { ( b, , ( b, b ) } , H ( b, e ) = { ( b, , ( b, w ) } , H ( b, e ) = { ( b, , ( b, w ) , ( b, b ) } (and similarly for whites). In each case, being alone is partof the preference class, as there are no links to sever. In the second class the agent has apreference for having a black friend, in the third class for a white friend, and in the lastclass for a friend of either race. It is easy to see that the model is incomplete , as for a givenrealization of (cid:15) it makes multiple predictions on the agent’s preference type.de Paula, Richards-Shubik, and Tamer propose to map the distribution of preferenceclasses into the observed distribution of preference types in the data through the use of allocation parameters , denoted α H ( t ) ∈ [0 , t given H ϑ = H .The model, augmented with them, implies a probability that an agent is of network type t : M ( t ; ϑ, α ) = 1 µ (cid:88) H ⊂T µ v ( t ) M ( H | v ( t ); ϑ ) α H ( t ) , (3.49)where µ v ( t ) is the measure of reference agents with characteristics equal to the second com-ponent of the preference type t , x = v ( t ), and α ≡ { α H ( t ) : t ∈ T , H ⊂ T } .de Paula, Richards-Shubik, and Tamer provide a characterization of an outer region for θ based on two key implications of pairwise stability that deliver restrictions on α . They alsoshow that under some additional assumptions, this characterization yields H P [ θ ] (de Paula,Richards-Shubik, and Tamer, 2018, Appendix B). Here I focus on their more general result.The first implication that they use is that existing links should not be dropped: t / ∈ H ⇒ α H ( t ) = 0 . (3.50)The condition in (3.50) is embodied in ¯ α ≡ { α H ( t ) : t ∈ H, H ⊂ T } .The second implication is that it should not be possible to establish mutually beneficiallinks among nodes that are far from each other. Let t (cid:48) and s (cid:48) denote the network types thatare generated if one adds a link in networks of types t and s among two nodes that are atdistance at least 2 ¯ d from each other and each have less than ¯ l links. Then the requirement is (cid:32) (cid:88) H ⊂T µ v ( t ) M ( H | v ( t ); ϑ ) α H ( t ) ( t (cid:48) ∈ H ) (cid:33) (cid:32) (cid:88) H ⊂T µ v ( s ) M ( H | v ( s ); ϑ ) α H ( s ) ( s (cid:48) ∈ H ) (cid:33) = 0(3.51)In words, if a positive measure of agents of type t prefer t (cid:48) (i.e., α H ( t ) > H such that t (cid:48) ∈ H ), there must be zero measure of type s individuals who prefer s (cid:48) , becauseotherwise the network is unstable. de Paula, Richards-Shubik, and Tamer show that theconditions in (3.51) can be embodied in a square matrix q of size equal to the length of ¯ α .The entries of q are constructed as follows. Let H and ˜ H be two preference classes with79 ∈ H and s ∈ ˜ H . With some abuse of notation, let q α H ( t ) ,α ˜ H ( s ) denote the element of q corresponding to the index of the entry in ¯ α equal to α H ( t ) for the row, and to α ˜ H ( s ) for thecolumn. Then set q α H ( t ) ,α ˜ H ( s ) ( ϑ ) = ( t (cid:48) ∈ H ) ( s (cid:48) ∈ ˜ H ). It follows that this element yieldsthe term (cid:0) α H ( t ) ( t (cid:48) ∈ H ) (cid:1)(cid:0) α ˜ H ( s ) ( s (cid:48) ∈ ˜ H ) (cid:1) in the quadratic form ¯ α (cid:62) q ¯ α . As long as µ v ( · ) and M ( ·| x ; ϑ ) in (3.48) are strictly positive, this term is equal to zero if and only if condition(3.51) holds for types t and s . With this background, Theorem OR-3.2 below provides an outer region for θ . The proofof this result follows from the arguments laid out above (see de Paula, Richards-Shubik, andTamer, 2018, Theorems 1 and 2, for the full details). Theorem OR- : Under the assumptions of Identification Problem 3.9, O P [ θ ] = ϑ ∈ Θ : min ¯ α ¯ α (cid:62) q ¯ α s.t. M ( t ; ϑ, ¯ α ) = P ( t ) ∀ t ∈ T (cid:80) t ∈ H ¯ α H ( t ) = 1 ∀ H ⊂ T α H ( t ) ≥ ∀ t ∈ H, ∀ H ⊂ T = 0 . (3.52)The set in (3.52) does not equal H P [ θ ] in all models allowed for in Identification Problem3.9 because condition (3.51) does not embody all implications of pairwise stability on non-existing links. While the optimization problem in (3.52) is quadratic, it is not necessarilyconvex because q may not be positive definite. Nonetheless, the simulations reported by dePaula, Richards-Shubik, and Tamer suggest that O P [ θ ] can be computed rapidly, as least forthe examples they considered. Key Insight : At the beginning of this section I highlighted some key challengesto inference in network formation models. When data is observed from a single network,as in Identification Problem 3.9, de Paula, Richards-Shubik, and Tamer’s 2018 proposal tobase inference on local networks achieves two main benefits. First, it delivers consistentlyestimable features of the game, namely the probability that an agent belongs to one of a finitecollection of network types. Second, it achieves dimension reduction, so that computation ofouter regions on θ remains feasible even with large networks and allowing for unrestrictedselection among multiple equilibria. In order to discuss the partial identification approach to learning structural parameters ofeconomic models in some level of detail while keeping this chapter to a manageable length, I The possibility that µ v ( · ) or M ( ·| x ; ϑ ) are equal to zero can be accommodated by setting q α H ( t ) ,α ˜ H ( s ) ( ϑ ) =( µ v ( t ) M ( H | v ( t ); ϑ ) ( t (cid:48) ∈ H ))( µ v ( s ) M ( H | v ( s ); ϑ ) ( s (cid:48) ∈ ˜ H )). However, in that case q depends on ϑ and itscomputational cost increases. For example, Ho (2009) uses it to model the formation of thehospital networks offered by US health insurers, and Ho, Ho, and Mortimer (2012) and Lee(2013) use it to obtain bounds on firm fixed costs as an input to modeling product choices inthe movie industry and in the US video game industry, respectively. Holmes (2011) estimatesthe effects of Wal-Mart’s strategy of creating a high density network of stores. While the closeproximity of stores implies cannibalization in sales, Wal-Mart is willing to bear it to achievedensity economies, which in turn yield savings in distribution costs. His results suggest thatWal-Mart substantially benefits from high store density. Ellickson, Houghton, and Timmins Statistical inference in these papers is often carried out using the methods proposed by Chernozhukov,Hong, and Tamer (2007), Beresteanu and Molinari (2008), and Andrews and Soares (2010). Model specificationtests, if carried out, are based on the method proposed by Bugni, Canay, and Shi (2015). See Sections 4.3 and5, respectively, for a discussion of confidence sets and specification tests. They allow for unobserved heterogeneity in prefer-ences (that may enter the utility function non-separably) and leave completely unspecifiedtheir distribution. The authors use revealed preference arguments to infer, for each household,a set of values for its unobserved heterogeneity terms that are consistent with the household’schoices in the three lines of insurance coverage. As their core restriction, they assume thateach household’s preferences are stable across contexts: the household’s utility function isthe same when facing distinct but closely related choice problems. This allows them to usethe inferred set valued data to partially identify features of the distribution of preferences,and to classify households into preference types. They apply their proposed method to an-alyze data on households’ deductible choices across three lines of insurance coverage (homeall perils, auto collision, and auto comprehensive). Their results show that between 70 and80 percent of the households make choices that can be rationalized by a model with linearutility and monotone, quadratic, or even linear probability distortions. These probability Their model is based on the one put forward by Barseghyan, Molinari, O’Donoghue, and Teitelbaum(2013). See Barseghyan, Molinari, O’Donoghue, and Teitelbaum (2018) for a review of these and other non-expected utility models in the context of estimation of risk preferences. Auto collision coverage pays for damage to the insured vehicle caused by a collision with another vehicleor object, without regard to fault. Auto comprehensive coverage pays for damage to the insured vehicle fromall other causes, without regard to fault. Home all perils (or simply home) coverage pays for damage to theinsured home from all causes, except those that are specifically excluded (e.g., flood, earthquake, or war). H P [ θ ] is based on using an unrestricted selection mechanism, as in Berryand Tamer (2006) and Ciliberto and Tamer (2009). He applies the model to study the impactof supercenters such as Wal-Mart, that sell both food and groceries, on the profitability ofrural grocery stores. He finds that entry by a supercenter outside, but within 20 miles, of alocal monopolist’s market has a smaller impact on firm profits than entry by a local grocer.Their entrance has a small negative effect on the number of grocery stores in surroundingmarkets as well as on their profits. The results suggest that location and format-baseddifferentiation partially insulate rural stores from competition with supercenters.A larger class of information structures is considered in the analysis of static discretegames carried out by Magnolfi and Roncoroni (2017). They allow for all information struc-tures consistent with the players knowing their own payoffs and the distribution of opponents’payoffs. As solution concept they adopt the Bayes Correlated Equilibrium recently developedby Bergemann and Morris (2016). Also with this solution concept multiple equilibria arepossible. The authors leave completely unspecified the selection mechanism picking the equi-librium played in the regions of multiplicity, so that partial identification attains. Magnolfiand Roncoroni use the random sets approach to characterize H P [ θ ]. They apply the methodto estimate a model of entry in the Italian supermarket industry and quantify the effect oflarge malls on local grocery stores. Norets and Tang (2014) provide partial identification84esults (and Bayesian inference methods) for semiparametric dynamic binary choice modelswithout imposing distributional assumptions on the unobserved state variables. They carryout an empirical application using Rust (1987)’s model of bus engine replacement. Theirresults suggest that parametric assumptions about the distribution of the unobserved statescan have a considerable effect on the estimates of per-period payoffs, but not a noticeableone on the counterfactual conditional choice probabilities. Berry and Compiani (2019) usethe random sets approach to partially identify and estimate dynamic discrete choice mod-els with serially correlated unobservables, under instrumental variables restrictions. Theyextend two-step dynamic estimation methods to characterize a set of structural parametersthat are consistent with the dynamic model, the instrumental variables restrictions, and thedata. Gualdani (2019) uses the random sets approach and a network formation model, tolearn about Italian firms’ incentives for having their executive directors sitting on the boardof their competitors.Barseghyan, Coughlin, Molinari, and Teitelbaum (2019) use the method described in Sec-tion 3.1.3 to partially identify the distribution of risk preferences using data on deductiblechoices in auto collision insurance. They posit an expected utility theory model and al-low for unobserved heterogeneity in households’ risk aversion and choice sets, with unre-stricted dependence between them. Motivation for why unobserved heterogeneity in choicesets might be an important factor in this empirical framework comes from the earlier anal-ysis of Barseghyan, Molinari, and Teitelbaum (2016) and novel findings that are part ofBarseghyan, Coughlin, Molinari, and Teitelbaum’s 2019 contribution. They show that com-monly used models that make strong assumptions about choice sets (e.g., the mixed logitmodel with each individual’s choice set assumed equal to the feasible set, and various modelsof choice set formation) can be rejected in their data. With regard to risk aversion, their keyfinding is that their estimated lower bounds are significantly smaller than the point estimatesobtained in the related literature. This suggests that the data can be explained by expectedutility theory with lower and more homogeneous levels of risk aversion than it had beenuncovered before. This provides new evidence on the importance of developing models thatdiffer in their specification of which alternatives agents evaluate (rather than or in additionto models focusing on how they evaluate them), and to data collection efforts that seek todirectly measure agents’ heterogeneous choice sets (Caplin, 2016).Iaryczower, Shi, and Shum (2018) study the effect of pre-vote deliberation on the deci-sions of US appellate courts. The question of interest is weather deliberation increases orreduces the probability of an incorrect decision. They use a model where communicationequilibrium is the solution concept, and only observed heterogeneity in payoffs is allowedfor. In the model, multiple equilibria are again possible, and the authors leave the selectionmechanism completely unspecified. They characterize H P [ θ ] through an optimization prob- Statistical inference on θ is carried out using Chernozhukov, Chetverikov, and Kato (2018)’s method. Statistical inference on projections of θ is carried out using Kaido, Molinari, and Stoye (2019a)’s method. H P [ θ ], for which judges have ex-antedisagreement of imprecise prior information, for which deliberation is beneficial. Otherwisedeliberation leads to lower effectiveness for the court.D’Haultfoeuille, Gaillac, and Maurel (2018) propose a test for the hypothesis of rationalexpectations for the case that one observes only the marginal distributions of realizations andsubjective beliefs, but not their joint distribution (e.g., when subjective beliefs are observedin one dataset, and realizations in a different one, and the two cannot be matched). Theyestablish that the hypothesis of rational expectations can be expressed as testing that acontinuum of moment inequalities is satisfied, and they leverage the results in Andrewsand Shi (2017) to provide a simple-to-compute test for this hypothesis. They apply theirmethod to test for and quantify deviations from rational expectations about future earnings,and examine the consequences of such departures in the context of a life-cycle model ofconsumption.Tebaldi, Torgovitsky, and Yang (2019) estimate the demand for health insurance underthe Affordable Care Act using data from California. Methodologically, they use a discretechoice model that allows for endogeneity in insurance premiums (which enter as explanatoryvariables in the model) and dispenses with parametric assumptions about the unobservedcomponents of utility leveraging the availability of instrumental variables, similarly to theframework presented in Section 3.1.2. The authors provide a characterization of sharp boundson the effects of changing premium subsidies on coverage choices, consumer surplus, andgovernment spending, as solutions to linear programming problems, rendering their methodcomputationally attractive.Another important strand of theoretical literature is concerned with partial identificationof panel data models. Honor and Tamer (2006) consider a dynamic random effects probitmodel, and use partial identification analysis to obtain bounds on the model parameters thatcircumvent the initial conditions problem. Rosen (2012) considers a fixed effect panel datamodel where he imposes a conditional quantile restriction on time varying unobserved het-erogeneity. Differencing out inequalities resulting from the conditional quantile restrictiondelivers inequalities that depend only on observable variables and parameters to be esti-mated, but not on the fixed effects, so that they can be used for estimation. Chernozhukov,Fernndez-Val, Hahn, and Newey (2013) obtain bounds on average and quantile treatmenteffects in nonparametric and semiparametric nonseparable panel data models. Khan, Pono-mareva, and Tamer (2016) provide partial identification results in linear panel data modelswhen censored outcomes, with unrestricted dependence between censoring and observableand unobservable variables. Their results are derived for two classes of models, one wherethe unobserved heterogeneity terms satisfy a stationarity restriction, and one where they86re nonstationary but satisfy a conditional independence restriction. Torgovitsky (2019a)provides a method to partially identify state dependence in panel data models where indi-vidual unobserved heterogeneity needs not be time invariant. Pakes and Porter (2016) studysemiparametric multinomial choice panel models with fixed effects where the random utilityfunction is assumed additively separable in unobserved heterogeneity, fixed effects, and alinear covariate index. The key semiparametric assumption is a group stationarity conditionon the disturbances which places no restrictions on either the joint distribution of the distur-bances across choices or the correlation of disturbances across time. Pakes and Porter proposea within-group comparison that delivers a collection of conditional moment inequalities thatthey use to provide point and partial identification results. Aristodemou (2019) proposes arelated method, where partial identification relies on the observation of individuals whoseoutcome changes in two consecutive time periods, and leverages shape restrictions to reducethe number of between alternatives comparisons needed to determine the optimal choice. The identification analysis carried out in Sections 2-3 presumes knowledge of the joint dis-tribution P of the observable variables. That is, it presumes that P can be learned withcertainty from observation of the entire population. In practice, one observes a sample of size n drawn from P . For simplicity I assume it to be a random sample. Statistical inference on H P [ θ ] needs to be conducted using knowledge of P n , the empiricaldistribution of the observable outcomes and covariates. Because H P [ θ ] is not a singleton, thistask is particularly delicate. To start, care is required to choose a proper notion of consistencyfor a set estimator ˆ H P n [ θ ] and to obtain palatable conditions under which such consistencyattains. Next, the asymptotic behavior of statistics designed to test hypothesis or buildconfidence sets for H P [ θ ] or for ϑ ∈ H P [ θ ] might change with ϑ , creating technical challengesfor the construction of confidence sets that are not encountered when θ is point identified.Many of the sharp identification regions derived in Sections 2-3 can be written as collections ofvectors ϑ ∈ Θ that satisfy conditional or unconditional moment (in)equalities. For simplicity,I assume that Θ is a compact and convex subset of R d , and I use the formalization for the This assumption is often maintained in the literature. See, e.g., Andrews and Soares (2010) for a treatmentof inference with dependent observations. Epstein, Kaido, and Seo (2016) study inference in games of completeinformation as in Identification Problem 3.5, imposing the i.i.d. assumption on the unobserved payoff shifters { ε i , ε i } ni =1 . The authors note that because the selection mechanism picking the equilibrium played in theregions of multiplicity (see Section 3.2) is left completely unspecified and may be arbitrarily correlated acrossmarkets, the resulting observed variables { w i } ni =1 may not be independent and identically distributed, andthey propose an inference method to address this issue. H P [ θ ] = { ϑ ∈ Θ : E P ( m j ( w i ; ϑ )) ≤ ∀ j ∈ J , E P ( m j ( w i ; ϑ )) = 0 ∀ j ∈ J } . (4.1)In (4.1), w i ∈ W ⊆ R d W is a random vector collecting all observable variables, with w ∼ P ; m j : W × Θ → R , j ∈ J ≡ J ∪J , are known measurable functions characterizing the model;and J is a finite set equal to { , . . . , |J |} . Instances where H P [ θ ] is characterized through afinite number of conditional moment (in)equalities and the conditioning variables have finitesupport can easily be recast as in (4.1). Consider, for example, the two player entry gamemodel in Identification Problem 3.5 on p. 53, where w = ( y , y , x , x ). Using (in)equalities(3.22)-(3.25) and assuming that the distribution of ( x , x ) has ¯ k points of support, denoted( x ,k , x ,k ) , k = 1 , . . . , ¯ k , we have |J | = 4¯ k and for k = 1 , . . . , ¯ k , m k − ( w i ; ϑ ) = [ (( y , y ) = (0 , − Φ(( −∞ , − x b ) , ( −∞ , − x b ); r )] (( x , x ) = ( x ,k , x ,k )) m k − ( w i ; ϑ ) = [ (( y , y ) = (1 , − Φ([ − x b − d , ∞ ) , [ − x b − d , ∞ ); r )] (( x , x ) = ( x ,k , x ,k )) m k − ( w i ; ϑ ) = [ (( y , y ) = (0 , − Φ(( −∞ , − x b − d ) , ( − x b , ∞ ); r )] (( x , x ) = ( x ,k , x ,k )) m k ( w i ; ϑ ) = (cid:104) (( y , y ) = (0 , − (cid:110) Φ(( −∞ , − x b − d ) , ( − x b , ∞ ); r ) − Φ(( − x b , − x b − d ) , ( − x b , − x b − d ); r ) (cid:111)(cid:105) (( x , x ) = ( x ,k , x ,k )) . In point identified moment equality models it has been common to conduct estimationand inference using a criterion function that aggregates moment violations (Hansen, 1982b).Manski and Tamer (2002) adapt this idea to the partially identified case, through a criterionfunction q P : Θ → R + such that q P ( ϑ ) = 0 if and only if ϑ ∈ H P [ θ ]. Many criterionfunctions can be used (see, e.g. Manski and Tamer, 2002; Chernozhukov, Hong, and Tamer,2007; Romano and Shaikh, 2008; Rosen, 2008; Galichon and Henry, 2009; Andrews andGuggenberger, 2009; Andrews and Soares, 2010; Canay, 2010; Romano and Shaikh, 2010). Examples where the set J is a compact set (e.g., a unit ball) rather than a finite set include the case ofbest linear prediction with interval outcome and covariate data, see characterization (2.27) on p. 27, and thecase of entry games with multiple mixed strategy Nash equilibria, see characterization (3.29) on p. 61. A moregeneral continuum of inequalities is also possible, as in the case of discrete choice with endogenous explanatoryvariables, see characterization (3.14) on p. 45. I refer to Andrews and Shi (2017) and Beresteanu, Molchanov,and Molinari (2011, Supplementary Appendix B) for inference methods in the presence of a continuum ofconditional moment (in)equalities. I refer to Khan and Tamer (2009), Andrews and Shi (2013), Chernozhukov, Lee, and Rosen (2013), Lee,Song, and Whang (2013), Armstrong (2014, 2015), Armstrong and Chan (2016), Chernozhukov, Chetverikov,and Kato (2018), and Chetverikov (2018), for inference methods in the case that the conditioning variableshave a continuous distribution. In these expressions an index of the form jk not separated by a comma equals the product of j with k . q P , sum ( ϑ ) = (cid:88) j ∈J (cid:20) E P ( m j ( w i ; ϑ )) σ P ,j ( ϑ ) (cid:21) + (cid:88) j ∈J (cid:20) E P ( m j ( w i ; ϑ )) σ P ,j ( ϑ ) (cid:21) , (4.2) q P , max ( ϑ ) = max (cid:40) max j ∈J (cid:20) E P ( m j ( w i ; ϑ )) σ P ,j ( ϑ ) (cid:21) + , max j ∈J (cid:12)(cid:12)(cid:12)(cid:12) E P ( m j ( w i ; ϑ )) σ P ,j ( ϑ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:41) , (4.3)where [ x ] + = max { x, } and σ P ,j ( ϑ ) is the population standard deviation of m j ( w i ; ϑ ). In(4.2)-(4.3) the moment functions are standardized, as doing so is important for statisticalpower (see, e.g., Andrews and Soares, 2010, p. 127). To simplify notation, I omit the labeland simply use q P ( ϑ ). Given the criterion function, one can rewrite (4.1) as H P [ θ ] = { ϑ ∈ Θ : q P ( ϑ ) = 0 } . (4.4)To keep this chapter to a manageable length, I focus my discussion of statistical inference exclusively on consistent estimation and on different notions of coverage that a confidence setmay be required to satisfy and that have proven useful in the literature. The topics of testof hypotheses and construction of confidence sets in partially identified models are coveredin Canay and Shaikh (2017), who provide a comprehensive survey devoted entirely to themin the context of moment inequality models. Molchanov and Molinari (2018, Chapters 4 and5) provide a thorough discussion of related methods based on the use of random set theory.
When the identified object is a set, it is natural that its estimator is also a set. In order todiscuss statistical properties of a set-valued estimator ˆ H P n [ θ ] (to be defined below), and inparticular its consistency, one needs to specify how to measure the distance between ˆ H P n [ θ ]and H P [ θ ]. Several distance measures among sets exist (see, e.g., Molchanov, 2017, AppendixD). A natural generalization of the commonly used Euclidean distance is the Hausdorff dis-tance , see Definition A.6, which for given
A, B ⊂ R d can be written as d H ( A, B ) = inf (cid:110) r > A ⊆ B r , B ⊆ A r (cid:111) = max (cid:26) sup a ∈ A d ( a, B ) , sup b ∈ B d ( b, A ) (cid:27) , with d ( a, B ) ≡ inf b ∈ B (cid:107) a − b (cid:107) . In words, the Hausdorff distance between two sets measuresthe furthest distance from an arbitrary point in one of the sets to its closest neighbor in theother set. It is easy to verify that d H metrizes the family of non-empty compact sets; inparticular, given non-empty compact sets A, B ⊂ R d , d H ( A, B ) = 0 if and only if A = B . If Using the well known duality between tests of hypotheses and confidence sets, the discussion could bere-framed in terms of size of the test. The definition of the Hausdorff distance can be generalized to an arbitrary metric space by replacing theEuclidean metric by the metric specified on that space. A or B is empty, d H ( A, B ) = ∞ .The use of the Hausdorff distance to conceptualize consistency of set valued estimators ineconometrics was proposed by Hansen, Heaton, and Luttmer (1995, Section 2.4) and Manskiand Tamer (2002, Section 3.2). Definition : An estimator ˆ H P n [ θ ] is consistent for H P [ θ ] if d H ( ˆ H P n [ θ ] , H P [ θ ]) p → as n → ∞ . Molchanov (1998) establishes Hausdorff consistency of a plug-in estimator of the set { ϑ ∈ Θ : g P ( ϑ ) ≤ } , with g P : W × Θ → R a lower semicontinuous function of ϑ ∈ Θ thatcan be consistently estimated by a lower semicontinuous function g n uniformly over Θ. Theset estimator is { ϑ ∈ Θ : g n ( ϑ ) ≤ } . The fundamental assumption in Molchanov (1998)is that { ϑ ∈ Θ : g P ( ϑ ) ≤ } ⊆ cl( { ϑ ∈ Θ : g P ( ϑ ) < } ), see Molchanov and Molinari(2018, Section 5.2) for a discussion. There are important applications where this conditionholds. Chernozhukov, Kocatulum, and Menzel (2015) provide results related to Molchanov(1998), as well as important extensions for the construction of confidence sets, and showthat these can be applied to carry out statistical inference on the HansenJagannathan sets ofadmissible stochastic discount factors (Hansen and Jagannathan, 1991), the MarkowitzFamameanvariance sets for asset portfolio returns (Markowitz, 1952), and the set of structuralelasticities in Chetty (2012)’s analysis of demand with optimization frictions. However, thesemethods are not broadly applicable in the general moment (in)equalities framework of thissection, as Molchanov’s key condition generally fails for the set H P [ θ ] in (4.4). Manski and Tamer (2002) extend the standard theory of extremum estimation of point iden-tified parameters to partial identification, and propose to estimate H P [ θ ] using the collectionof values ϑ ∈ Θ that approximately minimize a sample analog of q P :ˆ H P n [ θ ] = (cid:26) ϑ ∈ Θ : q n ( ϑ ) ≤ inf ˜ ϑ ∈ Θ q n ( ˜ ϑ ) + τ n (cid:27) , (4.5)with τ n a sequence of non-negative random variables such that τ n p →
0. In (4.5), q n ( ϑ ) is asample analog of q P ( ϑ ) that replaces E P ( m j ( w i ; ϑ )) and σ P ,j ( ϑ ) in (4.2)-(4.3) with properly It was previously used in the mathematical literature on random set theory, for example to formalizelaws of large numbers and central limit theorems for random sets such as the ones in Theorems A.3 and A.4(Artstein and Vitale, 1975; Gin´e, Hahn, and Zinn, 1983). m n,j ( ϑ ) ≡ n n (cid:88) i =1 m j ( w i , ϑ ) , j = 1 , . . . , |J | ˆ σ n,j ( ϑ ) ≡ (cid:32) n n (cid:88) i =1 [ m j ( w i , ϑ )] − [ ¯ m n,j ( ϑ )] (cid:33) / , j = 1 , . . . , |J | . It can be shown that as long as τ n = o p (1), under the same assumptions used to proveconsistency of extremum estimators of point identified parameters (e.g., with uniform con-vergence of q n to q P and continuity of q P on Θ),sup ϑ ∈ ˆ H P n [ θ ] inf ˜ ϑ ∈H P [ θ ] (cid:107) ϑ − ˜ ϑ (cid:107) p → n → ∞ . (4.6)This yields that asymptotically each point in ˆ H P n [ θ ] is arbitrarily close to a point in H P [ θ ], ormore formally that P ( ˆ H P n [ θ ] ⊆ H P [ θ ]) →
1. I refer to (4.6) as inner consistency henceforth. Redner (1981) provides an early contribution establishing this type of inner consistency formaximum likelihood estimators when the true parameter is not point identified.However, Hausdorff consistency requires also thatsup ϑ ∈H P [ θ ] inf ˜ ϑ ∈ ˆ H P n [ θ ] (cid:107) ϑ − ˜ ϑ (cid:107) p → n → ∞ , i.e., that each point in H P [ θ ] is arbitrarily close to a point in ˆ H P n [ θ ], or more formallythat P ( H P [ θ ] ⊆ ˆ H P n [ θ ]) →
1. To establish this result for the sharp identification regionsin Theorem SIR-2.7 (parametric regression with interval covariate) and Theorem SIR-3.1(semiparametric binary model with interval covariate), Manski and Tamer (2002, Propositions3 and 5) require the rate at which τ n p → q n convergesuniformly to q P over Θ.What might go wrong in the absence of such a restriction? A simple example can helpunderstand the issue. Consider a model with linear inequalities of the form θ ≤ E P ( w ) , − θ ≤ E P ( w ) ,θ ≤ E P ( w ) + E P ( w ) θ , − θ ≤ E P ( w ) + E P ( w ) θ . Suppose w ≡ ( w , . . . , w ) is distributed multivariate normal, with E P ( w ) = [6 0 2 0 − (cid:62) and Cov P ( w ) equal to the identity matrix. Then H P [ θ ] = { ϑ = [ ϑ ϑ ] (cid:62) ∈ Θ : ϑ ∈ See Blevins (2015, Theorem 1) for a pedagogically helpful proof for a semiparametric binary model. ,
6] and ϑ = 2 } . However, with positive probability in any finite sample q n ( ϑ ) = 0 for ϑ in a random region (e.g., a triangle if q n is the sample analog of (4.3)) that only includespoints that are close to a subset of the points in H P [ θ ]. Hence, with positive probability theminimizer of q n cycles between consistent estimators of subsets of H P [ θ ], but does not estimatethe entire set. Enlarging the estimator to include all points that are close to minimizing q n up to a tolerance that converges to zero sufficiently slowly removes this problem.Chernozhukov, Hong, and Tamer (2007) significantly generalize the consistency resultsin Manski and Tamer (2002). They work with a normalized criterion function equal to q n ( ϑ ) − inf ˜ ϑ ∈ Θ q n ( ˜ ϑ ), but to keep notation light I simply refer to it as q n . Under suitableregularity conditions, they establish consistency of an estimator that can be a smaller setthan the one proposed by Manski and Tamer (2002), and derive its convergence rate. Someof the key conditions required by Chernozhukov, Hong, and Tamer (2007, Conditions C1 andC2) to study convergence rates include that q n is lower semicontinuous in ϑ , satisfies variousconvergence properties among which sup ϑ ∈H P [ θ ] q n = O p (1 /a n ) for a sequence of normalizingconstants a n → ∞ , that τ n ≥ sup ϑ ∈H P [ θ ] q n ( ϑ ) with probability approaching one, and that τ n →
0. They also require that there exist positive constants ( δ, κ, γ ) such that for any (cid:15) ∈ (0 ,
1) there are ( d (cid:15) , n (cid:15) ) such that ∀ n ≥ n (cid:15) , q n ( ϑ ) ≥ κ [min { δ, d ( ϑ, H P [ θ ]) } ] γ uniformly on { ϑ ∈ Θ : d ( ϑ, H P [ θ ]) ≥ ( d (cid:15) /a n ) /γ } with probability at least 1 − (cid:15) . In words, theassumption, referred to as polynomial minorant condition, rules out that q n can be arbitrarilyclose to zero outside H P [ θ ]. It posits that q n changes as at least a polynomial of degree γ inthe distance of ϑ from H P [ θ ]. Under some additional regularity conditions, Chernozhukov,Hong, and Tamer (2007) establish that d H ( ˆ H P n [ θ ] , H P [ θ ]) = O p (max { /a n , τ n } ) /γ . (4.7)What is the role played by the polynomial minorant condition for the result in (4.7)?Under the maintained assumptions τ n ≥ sup ϑ ∈H P [ θ ] q n ( ϑ ) ≥ κ [min { δ, d ( ϑ, H P [ θ ]) } ] γ , and thelatter part of the inequality is used to obtain (4.7). When could the polynomial minorantcondition be violated? In moment (in)equalities models, Chernozhukov, Hong, and Tamer Using this normalized criterion function is especially important in light of possible model misspecification,see Section 5. γ = 2. Consider a simple stylized example with (in)equalities of the form − θ ≤ E P ( w ) , − θ ≤ E P ( w ) ,θ θ = E P ( w ) , with E P ( w ) = E P ( w ) = E P ( w ) = 0, and note that the sample means ( ¯ w , ¯ w , ¯ w ) are √ n -consistent estimators of ( E P ( w ) , E P ( w ) , E P ( w )). Suppose ( w , w , w ) are distributedmultivariate standard normal. Consider a sequence ϑ n = [ ϑ n ϑ n ] (cid:62) = [ n − / n − / ] (cid:62) .Then [ d ( ϑ n , H P [ θ ])] γ = O p ( n − / ). On the other hand, with positive probability q n ( ϑ n ) =( ¯ w − ϑ n ϑ n ) = O p (cid:0) n − (cid:1) , so that for n large enough q n ( ϑ n ) < [ d ( ϑ n , H P [ θ ])] γ , violatingthe assumption. This occurs because the gradient of the moment equality vanishes as ϑ approaches zero, rendering the criterion function flat in a neighborhood of H P [ θ ]. As intuitionwould suggest, rates of convergence are slower the flatter q n is outside H P [ θ ].Kaido, Molinari, and Stoye (2019b) show that in moment inequality models with smoothmoment conditions, the polynomial minorant assumption with γ = 2 implies the Abadieconstraint qualification (ACQ); see, e.g., Bazaraa, Sherali, and Shetty (2006, Chapter 5)for a definition and discussion of ACQ. The example just given to discuss failures of thepolynomial minorant condition is in fact a known example where ACQ fails at ϑ = [0 0] (cid:62) .Chernozhukov, Hong, and Tamer (2007, Condition C.3, referred to as degeneracy ) alsoconsider the case that q n vanishes on subsets of Θ that converge in Hausdorff distance to H P [ θ ] at rate a − /γn . While degeneracy might be difficult to verify in practice, Chernozhukov,Hong, and Tamer show that if it holds, τ n can be set to zero. Yildiz (2012) provides conditionson the moment functions, which are closely related to constraint qualifications (as discussedin Kaido, Molinari, and Stoye, 2019b) under which it is possible to set τ n = 0.Menzel (2014) studies estimation of H P [ θ ] when the number of moment inequalities is largerelative to sample size (possibly infinite). He provides a consistency result for criterion-basedestimators that use a number of unconditional moment inequalities that grows with samplesize. He also considers estimators based on conditional moment inequalities, and derives thefastest possible rate for estimating H P [ θ ] under smoothness conditions on the conditionalmoment functions. He shows that the rates achieved by the procedures in Armstrong (2014,2015) are (minimax) optimal, and cannot be improved upon. Key Insight : Manski and Tamer (2002) extend the notion of extremum estimationfrom point identified to partially identified models. They do so by putting forward a general-ized criterion function whose zero-level set can be used to define H P [ θ ] in partially identifiedstructural semiparametric models. It is then natural to define the set valued estimator ˆ H P n [ θ ] Chernozhukov, Hong, and Tamer (2007, equation (4.1) and equation (4.6)) set γ = 1 because they reportthe assumption for a criterion function that does not square the moment violations. s the collection of approximate minimizers of the sample analog of this criterion function.Manski and Tamer’s analysis of statistical inference focuses exclusively on providing consis-tent estimators. Chernozhukov, Hong, and Tamer (2007) substantially generalize the analysisof consistency of criterion function-based set estimators. They provide a comprehensive studyof convergence rates in partially identified models. Their work highlights the challenges a re-searcher faces in this context, and puts forward possible solutions in the form of assumptionsunder which specific rates of convergence attain. Beresteanu and Molinari (2008) introduce to the econometrics literature inference methodsfor set valued estimators based on random set theory. They study the class of models where H P [ θ ] is convex and can be written as the Aumann (or selection) expectation of a properlydefined random closed set. They propose to carry out estimation and inference leveragingthe representation of convex sets through their support function (given in Definition A.5),as it is done in random set theory; see Molchanov (2017, Chapter 3) and Molchanov andMolinari (2018, Chapter 4). Because the support function fully characterizes the boundaryof H P [ θ ], it allows for a simple sample analog estimator, and for inference procedures withdesirable properties.An example of a framework where the approach of Beresteanu and Molinari can be appliedis that of best linear prediction with interval outcome data in Identification Problem 2.4. Recall that in that case, the researcher observes random variables ( y L , y U , x ) and wishes tolearn the best linear predictor of y | x , with y unobserved and R ( y L ≤ y ≤ y U ) = 1. Forsimplicity let x be a scalar. Given a random sample { y L i , y U i , x i } ni =1 from P , the researchercan construct a random segment G i for each i and a consistent estimator ˆΣ n of the randommatrix Σ P in (2.24) as G i = (cid:40)(cid:32) y i y i x i (cid:33) : y i ∈ Sel( Y i ) (cid:41) ⊂ R , and ˆΣ n = (cid:32) xx x (cid:33) , where Y i = [ y L i , y U i ] and x , x are the sample means of x i and x i respectively. Becausein this problem H P [ θ ] = Σ − P E P G (see Theorem SIR-2.5 on p. 25), a natural sample analogestimator replaces Σ P with ˆΣ n , and E P G with a Minkowski average of G i (see Appendix A, By Theorem A.2, the Aumann expectation of a random closed set defined on a nonatomic probabilityspace is convex. In this chapter I am assuming nonatomicity of the probability space. Even if I did not makethis assumption, however, when working with a random sample the relevant probability space is the productspace with n → ∞ , hence nonatomic (Artstein and Vitale, 1975). If H P [ θ ] is not convex, Beresteanu andMolinari’s analysis applies to its convex hull. Kaido, Molinari, and Stoye (2019a, Supplementary Appendix F) establish that if x has finite support, H P [ θ ] in Theorem SIR-2.5 can be written as the collection of ϑ ∈ Θ that satisfy a finite number of momentinequalities, as posited in this section.
94. 118 for a formal definition), yieldingˆ H P n [ θ ] = ˆΣ − n n n (cid:88) i =1 G i . (4.8)The support function of ˆ H P n [ θ ] is the sample analog of that of H P [ θ ] provided in (2.26): h ˆ H P n [ θ ] ( u ) = 1 n n (cid:88) i =1 [( y L i ( f ( x i , u ) <
0) + y U i ( f ( x i , u ) ≥ f ( x i , u )] , u ∈ S , where f ( x i , u ) = [1 x i ] ˆΣ − n u . Beresteanu and Molinari (2008) use the Law of Large Numbersfor random sets reported in Theorem A.3 to show that ˆ H P n [ θ ] in (4.8) is √ n -consistent understandard conditions on the moments of ( y L i , y U i , x i ).Bontemps, Magnac, and Maurin (2012) and Chandrasekhar, Chernozhukov, Molinari,and Schrimpf (2018) significantly expand the applicability of Beresteanu and Molinari’s 2008estimator. Bontemps, Magnac, and Maurin show that it can be used in a large class ofpartially identified linear models, including ones that allow for the availability of instrumentalvariables. Chandrasekhar, Chernozhukov, Molinari, and Schrimpf show that it can be usedfor best linear approximation of any function f ( x ) that is known to lie within two identifiedbounding functions. The lower and upper functions defining the band are allowed to beany functions, including ones carrying an index, and can be estimated parametrically ornonparametrically. The method allows for estimation of the parameters of the best linearapproximations to the set identified functions in many of the identification problems describedin Section 2. It can also be used to estimate the sharp identification region for the parametersof a binary choice model with interval or discrete regressors under the assumptions of Magnacand Maurin (2008), characterized in (3.9) in Section 3.1.1.Kaido and Santos (2014) develop a theory of efficiency for estimators of sets H P [ θ ] asin (4.1) under the additional requirements that the inequalities E P ( m j ( w , ϑ )) are convex in ϑ ∈ Θ and smooth as functionals of the distribution of the data. Because of the convexity ofthe moment inequalities, H P [ θ ] is convex and can be represented through its support function.Using the classic results in Bickel, Klaassen, Ritov, and Wellner (1993), Kaido and Santosshow that under suitable regularity conditions, the support function admits for √ n -consistentregular estimation. They also show that a simple plug-in estimator based on the supportfunction attains the semiparametric efficiency bound, and the corresponding estimator of H P [ θ ] minimizes a wide class of asymptotic loss functions based on the Hausdorff distance.As they establish, this efficiency result applies to the estimators proposed by Beresteanu andMolinari (2008), including that in (4.8), and by Bontemps, Magnac, and Maurin (2012).Kaido (2016) further enlarges the applicability of the support function approach by es-tablishing its duality with the criterion function approach, for the case that q P is a convexfunction and q n is a convex function almost surely. This allows one to use the support95unction approach also when a representation of H P [ θ ] as the Aumann expectation of a ran-dom closed set is not readily available. Kaido considers H P [ θ ] and its level set estimatorˆ H P n [ θ ] as defined, respectively, in (4.4) and (4.5), with Θ a convex subset of R d . Because q P and q n are convex functions, H P [ θ ] and ˆ H P n [ θ ] are convex sets. Under the same assump-tions as in Chernozhukov, Hong, and Tamer (2007), including the polynomial minorant andthe degeneracy conditions, one can set τ n = 0 and have d H ( ˆ H P n [ θ ] , H P [ θ ]) = O p ( a − /γn ).Moreover, due to its convexity, H P [ θ ] is fully characterized by its support function, whichin turn can be consistently estimated (at the same rate as H P [ θ ]) using sample analogs as h ˆ H P n [ θ ] ( u ) = max a n q n ( ϑ ) ≤ u (cid:62) ϑ . The latter can be computed via convex programming.Kitagawa and Giacomini (2018) consider consistent estimation of H P [ θ ] in the contextof Bayesian inference. They focus on partially identified models where H P [ θ ] depends ona “reduced form” parameter φ (e.g., a vector of moments of observable random variables).They recognize that while a prior on φ can be revised in light of the data, a prior on θ cannot, due to the lack of point identification. As such they propose to choose a singleprior for the revisable parameters, and a set of priors for the unrevisable ones. The latter isthe collection of priors such that the distribution of θ | φ places probability one on H P [ θ ]. Acrucial observation in Kitagawa and Giacomini is that once φ is viewed as a random vector,as in the Bayesian paradigm, under mild regularity conditions H P [ θ ] is a random closed set,and Bayesian inference on it can be carried out using elements of random set theory. Inparticular, they show that the set of posterior means of θ | w equals the Aumann expectationof H P [ θ ] (with the underlying probability measure of φ | w ). They also show that this Aumannexpectation converges in Hausdorff distance to the “true” identified set if the latter is convex,or otherwise to its convex hull. They apply their method to analyze impulse-response in set-identified Structural Vector Autoregressions, where standard Bayesian inference is otherwisesensitive to the choice of an unrevisable prior. Key Insight : Beresteanu and Molinari (2008) show that elements of random settheory can be employed to obtain inference methods for partially identified models that are easyto implement and have desirable statistical properties. Whereas they apply their findings to aspecific class of models based on the Aumann expectation, the ensuing literature demonstratesthat random set methods are widely applicable to obtain estimators of sharp identificationregions and establish their consistency.
Chernozhukov, Lee, and Rosen (2013) propose an alternative to the notion of consistentestimator. Rather than asking that ˆ H P n [ θ ] satisfies the requirement in Definition 4.1, they There is a large literature in macro-econometrics, pioneered by Faust (1998), Canova and De Nicolo(2002), and Uhlig (2005), concerned with Bayesian inference with a non-informative prior for non-identifiedparameters. I refer to Kilian and L¨utkepohl (2017, Chapter 13) for a thorough review. Frequentist inferencefor impulse response functions in Structural Vector Autoregression models is carried out, e.g., in Granziera,Moon, and Schorfheide (2018) and Gafarov, Meier, and Montiel Olea (2018). half-median-unbiased estimator. This notion is easiest to explain in thecase of interval identified scalar parameters. Take, e.g., the bound in Theorem SIR-2.1 forthe conditional expectation of selectively observed data. Then an estimator of that intervalis half-median-unbiased if the estimated upper bound exceeds the true upper bound, and theestimated lower bound falls below the true lower bound, each with probability at least 1 / H P n [ θ ] = (cid:8) ϑ ∈ Θ : a n q n ( ϑ ) ≤ c / ( ϑ ) (cid:9) , (4.9)where c / ( ϑ ) is a critical value chosen so that ˆ H P n [ θ ] asymptotically contains H P [ θ ] (or anyfixed element in H P [ θ ]; see the discussion in Section 4.3.1 below) with at least probability1 /
2. As discussed in the next section, c / ( ϑ ) can be further chosen so that this probabilityis uniform over P ∈ P .The requirement of half-median unbiasedness has the virtue that, by construction, anestimator such as (4.9) is a subset of a 1 − α confidence set as defined in (4.10) below for any α < /
2, provided c − α ( ϑ ) is chosen using the same criterion for all α ∈ (0 , τ n in (4.5) may be larger than the critical valueused to obtain the confidence set, see equation (4.10) below, unless regularity conditions suchas degeneracy or others allow one to set τ n equal to zero. Moreover, choice of the sequence τ n is not data driven, and hence can be viewed as arbitrary. This raises a concern for thescope of consistent estimation in general settings.However, reporting a set estimator together with a confidence set is arguably importantto shed light on how much of the volume of the confidence set is due to statistical uncertaintyand how much is due to a large identified set. One can do so by either using a half-medianunbiased estimator as in (4.9), or the set of minimizers of the criterion function in (4.5) with τ n = 0 (which, as previously discussed, satisfies the inner consistency requirement in (4.6)under weak conditions, and is Hausdorff consistent in some well behaved cases). H P [ θ ] vs. Coverage of θ I first discuss confidence sets CS n ⊂ R d defined as level sets of a criterion function. Tosimplify notation, henceforth I assume a n = n . CS n = { ϑ ∈ Θ : nq n ( ϑ ) ≤ c − α ( ϑ ) } . (4.10)In (4.10), c − α ( ϑ ) may be constant or vary in ϑ ∈ Θ. It is chosen to that CS n satisfies(asymptotically) a certain coverage property with respect to either H P [ θ ] or each ϑ ∈ H P [ θ ].Correspondingly, different appearances of c − α ( ϑ ) may refer to different critical values as-97ociated with different coverage notions. The challenging theoretical aspect of inference inpartial identification is the determination of c − α and of methods to approximate it.A first classification of coverage notions pertains to whether the confidence set should cover H P [ θ ] or each of its elements with a prespecified asymptotic probability. Early on, withinthe study of interval-identified parameters, Horowitz and Manski (1998, 2000) put forwarda confidence interval that expands each of the sample analogs of the extreme points of thepopulation bounds by an amount designed so that the confidence interval asymptoticallycovers the population bounds with prespecified probability.Chernozhukov, Hong, and Tamer (2007) study the general problem of inference for a set H P [ θ ] defined as the zero-level set of a criterion function. The coverage notion that theypropose is pointwise coverage of the set , whereby c − α is chosen so that:lim inf n →∞ P ( H P [ θ ] ⊆ CS n ) ≥ − α for all P ∈ P . (4.11)Chernozhukov, Hong, and Tamer (2007) provide conditions under which CS n satisfies (4.11)with c − α constant in ϑ , yielding the so called criterion function approach to statistical in-ference in partial identification. Under the same coverage requirement, Bugni (2010) andGalichon and Henry (2013) introduce novel bootstrap methods for inference in moment in-equality models. Henry, Mango, and Queyranne (2015) propose an inference method for finitegames of complete information that exploits the structure of these models.Beresteanu and Molinari (2008) propose a method to test hypotheses and build confidencesets satisfying (4.11) based on random set theory, the so called support function approach ,which yields simple to compute confidence sets with asymptotic coverage equal to 1 − α when H P [ θ ] is strictly convex. The reason for the strict convexity requirement is that inits absence, the support function of H P [ θ ] is not fully differentiable, but only directionallydifferentiable, complicating inference. Indeed, Fang and Santos (2018) show that standardbootstrap methods are consistent if and only if full differentiability holds, and they pro-vide modified bootstrap methods that remain valid when only directional differentiabilityholds. Chandrasekhar, Chernozhukov, Molinari, and Schrimpf (2018) propose a data jitter-ing method that enforces full differentiability at the price of a small conservative distortion.Kaido and Santos (2014) extend the applicability of the support function approach to othermoment inequality models and establish efficiency results. Chernozhukov, Kocatulum, andMenzel (2015) show that an Hausdorff distance-based test statistic can be weighted to en-force either exact or first-order equivariance to transformations of parameters. Adusumilliand Otsu (2017) provide empirical likelihood based inference methods for the support func-tion approach. The test statistics employed in the criterion function approach and in thesupport function approach are asymptotically equivalent in specific moment inequality mod-els (Beresteanu and Molinari, 2008; Kaido, 2016), but the criterion function approach is morebroadly applicable. 98he field’s interest changed to a different notion of coverage when Imbens and Manski(2004) pointed out that often there is one “true” data generating θ , even if it is only partiallyidentified. Hence, they proposed confidence sets that cover each ϑ ∈ H P [ θ ] with a prespecifiedprobability. For pointwise coverage, this leads to choosing c − α so that:lim inf n →∞ P ( ϑ ∈ CS n ) ≥ − α for all P ∈ P and ϑ ∈ H P [ θ ] . (4.12)If H P [ θ ] is a singleton then (4.11) and (4.12) both coincide with the pointwise coveragerequirement employed for point identified parameters. However, as shown in Imbens andManski (2004, Lemma 1), if H P [ θ ] contains more than one element, the two notions differ,with confidence sets satisfying (4.12) being weakly smaller than ones satisfying (4.11). Rosen(2008) provides confidence sets for general moment (in)equalities models that satisfy (4.12)and are easy to compute.Although confidence sets that take each ϑ ∈ H P [ θ ] as the object of interest (and whichsatisfy the uniform coverage requirements described in Section 4.3.2 below) have received themost attention in the literature on inference in partially identified models, this choice meritssome words of caution. First, Henry and Onatski (2012) point out that if confidence sets areto be used for decision making, a policymaker concerned with robust decisions might preferones satisfying (4.11) (respectively, (4.13) below once uniformity is taken into account) toones satisfying (4.12) (respectively, (4.14) below with uniformity). Second, while in manyapplications a “true” data generating θ exists, in others it does not. For example, Manskiand Molinari (2010) and Giustinelli, Manski, and Molinari (2019a) query survey respondents(in the American Life Panel and in the Health and Retirement Study, respectively) abouttheir subjective beliefs on the probability chance of future events. A large fraction of theserespondents, when given the possibility to do so, report imprecise beliefs in the form of inter-vals. In this case, there is no “true” point-valued belief: the “truth” is interval-valued. If oneis interested in (say) average beliefs, the sharp identification region is the (Aumann) expec-tation of the reported intervals, and the appropriate coverage requirement for a confidenceset is that in (4.11) (respectively, (4.13) below with uniformity). In the context of interval identified parameters, such as, e.g., the mean with missing data inTheorem SIR-2.1 with θ ∈ R , Imbens and Manski (2004) pointed out that extra care shouldbe taken in the construction of confidence sets for partially identified parameters, as otherwisethey may be asymptotically valid only pointwise (in the distribution of the observed data) overrelevant classes of distributions. For example, consider a confidence interval that expandseach of the sample analogs of the extreme points of the population bounds by a one-sided This discussion draws on many conversations with J¨org Stoye, as well as on notes that he shared with me,for which I thank him. n one canfind a distribution P ∈ P and a parameter ϑ ∈ H P [ θ ] such that the width of the populationbounds (under P ) is small relative to n and the coverage probability for ϑ is below 1 − α .This happens because the proposed confidence interval does not take into account the factthat for some P ∈ P the problem has a two-sided nature.This observation naturally leads to a more stringent requirement of uniform coverage ,whereby (4.11)-(4.12) are replaced, respectively, bylim inf n →∞ inf P ∈P P ( H P [ θ ] ⊆ CS n ) ≥ − α, (4.13)lim inf n →∞ inf P ∈P inf ϑ ∈H P [ θ ] P ( ϑ ∈ CS n ) ≥ − α, (4.14)and c − α is chosen accordingly, to obtain either (4.13) or (4.14). Sets satisfying (4.13) arereferred to as confidence regions for H P [ θ ] that are uniformly consistent in level (over P ∈ P ).Romano and Shaikh (2010) propose such confidence regions, study their properties, andprovide a step-down procedure to obtain them.Chen, Christensen, and Tamer (2018) propose confidence sets that are contour sets ofcriterion functions using cutoffs that are computed via Monte Carlo simulations from thequasiposterior distribution of the criterion and satisfy the coverage requirement in (4.13).They recommend the use of a Sequential Monte Carlo algorithm that works well also when thequasi-posterior is irregular and multi-modal. They establish exact asymptotic coverage, non-trivial local power, and validity of their procedure in point identified and partially identifiedregular models, and validity in irregular models (e.g., in models where the reduced formparameters are on the boundary of the parameter space). They also establish efficiency oftheir procedure in regular models that happen to be point identified.Sets satisfying (4.14) are referred to as confidence regions for points in H P [ θ ] that areuniformly consistent in level (over P ∈ P ). Within the framework of Imbens and Manski(2004), Stoye (2009) shows that one can obtain a confidence interval satisfying (4.14) bypre-testing whether the lower and upper population bounds are sufficiently close to eachother. If so, the confidence interval expands each of the sample analogs of the extremepoints of the population bounds by a two-sided critical value; otherwise, by a one-sided.Stoye provides important insights clarifying the connection between superefficient (i.e., fasterthan O p (1 / √ n )) estimation of the width of the population bounds when it equals zero, andcertain challenges in Imbens and Manski’s proposed method. Bontemps, Magnac, and Indeed, the confidence interval proposed by Stoye (2009) can be thought of as using a Hodges-typeshrinkage estimator (see, e.g., van der Vaart, 1997) for the width of the population bounds. √ n E P ( m j ( w i ; ϑ )) , j = 1 , . . . , |J | , which cannot be consis-tently estimated. Romano and Shaikh (2008); Andrews and Guggenberger (2009); Andrewsand Soares (2010); Canay (2010); Andrews and Barwick (2012); Romano, Shaikh, and Wolf(2014), among others, make significant contributions to circumvent these difficulties in thecontext of a finite number of unconditional moment (in)equalities. Andrews and Shi (2013);Chernozhukov, Lee, and Rosen (2013); Lee, Song, and Whang (2013); Armstrong (2014,2015); Armstrong and Chan (2016); Chetverikov (2018), among others, make significant con-tributions to circumvent these difficulties in the context of a finite number of conditionalmoment (in)equalities (with continuously distributed conditioning variables). Chernozhukov,Chetverikov, and Kato (2018) and Andrews and Shi (2017) study, respectively, the challeng-ing frameworks where the number of moment inequalities grows with sample size and wherethere is a continuum of conditional moment inequalities.I refer to Canay and Shaikh (2017, Section 4) for a thorough discussion of these methodsand a comparison of their relative (de)merits (see also Bugni, Canay, and Guggenberger,2012; Bugni, 2016). θ vs. Coverage of a Component of θ The coverage requirements in (4.13)-(4.14) refer to confidence sets in R d for the entire θ or H P [ θ ]. Often empirical researchers are interested in inference on a specific componentor (smooth) function of θ (e.g., the returns to education; the effect of market size on theprobability of entry; the elasticity of demand for insurance to price, etc.). For simplicity,here I focus on the case of a component of θ , which I represent as u (cid:62) θ , with u a standardbasis vector in R d . In this case, the (sharp) identification region of interest is H P [ u (cid:62) θ ] = { s ∈ [ − h Θ ( − u ) , h Θ ( u )] : s = u (cid:62) ϑ and ϑ ∈ H P [ θ ] } . One could report as confidence interval for u (cid:62) θ the projection of CS n in direction ± u . Theresulting confidence interval is asymptotically valid but typically conservative. The extent ofthe conservatism increases with the dimension of θ and is easily appreciated in the case ofa point identified parameter. Consider, for example, a linear regression in R , and supposefor simplicity that the limiting covariance matrix of the estimator is the identity matrix.Then a 95% confidence interval for u (cid:62) θ is obtained by adding and subtracting 1 .
96 to thatcomponent’s estimate. In contrast, projection of a 95% confidence ellipsoid for θ on eachcomponent amounts to adding and subtracting 4 .
28 to that component’s estimate.101t is therefore desirable to provide confidence intervals CI n specifically designed to cover u (cid:62) θ rather then the entire θ . Natural counterparts to (4.13)-(4.14) arelim inf n →∞ inf P ∈P P ( H P [ u (cid:62) θ ] ⊆ CI n ) ≥ − α, (4.15)lim inf n →∞ inf P ∈P inf ϑ ∈H P [ θ ] P ( u (cid:62) ϑ ∈ CI n ) ≥ − α. (4.16)As shown in Beresteanu and Molinari (2008) and Kaido (2016) for the case of pointwisecoverage, obtaining asymptotically valid confidence intervals is simple if the identified set isconvex and one uses the support function approach. This is because it suffices to base the teststatistic on the support function in direction u , and it is often possible to easily characterizethe limiting distribution of this test statistic. See Molchanov and Molinari (2018, Chapters4 and 5) for details.The task is significantly more complex in general moment inequality models when H P [ θ ]is non-convex and one wants to satisfy the criterion in (4.15) or that in (4.16). Romano andShaikh (2008) and Bugni, Canay, and Shi (2017) propose confidence intervals of the form CI n = (cid:26) s ∈ [ − h Θ ( − u ) , h Θ ( u )] : inf ϑ ∈ Θ( s ) nq n ( ϑ ) ≤ c − α ( s ) (cid:27) , (4.17)where Θ( s ) = { ϑ ∈ Θ : u (cid:62) ϑ = s } and c − α is such that (4.16) holds. An important idea inthis proposal is that of profiling the test statistic nq n ( ϑ ) by minimizing it over all ϑ s such that u (cid:62) ϑ = s . One then includes in the confidence interval all values s for which the profiled teststatistic’s value is not too large. Romano and Shaikh (2008) propose the use of subsamplingto obtain the critical value c − α ( s ) and provide high-level conditions ensuring that (4.16)holds. Bugni, Canay, and Shi (2017) substantially extend and improve the profiling approach by providing a bootstrap-based method to obtain c − α so that (4.16) holds. Their method ismore powerful than subsampling (for reasonable choices of subsample size). Belloni, Bugni,and Chernozhukov (2018) further enlarge the domain of applicability of the profiling approachby proposing a method based on this approach that is asymptotically uniformly valid whenthe number of moment conditions is large, and can grow with the sample size, possibly atexponential rates.Kaido, Molinari, and Stoye (2019a) propose a bootstrap-based calibrated projection ap-proach where CI n = [ − h C n ( c − α ) ( − u ) , h C n ( c − α ) ( u )] , (4.18)with h C n ( c − α ) ( u ) ≡ sup ϑ ∈ Θ u (cid:62) ϑ s.t. √ n ¯ m n,j ( ϑ )ˆ σ n,j ( ϑ ) ≤ c − α ( ϑ ) , j = 1 , . . . , |J | (4.19)102nd c − α a critical level function calibrated so that (4.16) holds. Compared to the simpleprojection of CS n mentioned at the beginning of Section 4.3.3, calibrated projection (weakly)reduces the value of c − α so that the projection of θ , rather than θ itself, is asymptoticallycovered with the desired probability uniformly.Chen, Christensen, and Tamer (2018) provide methods to build confidence intervals andconfidence sets on projections of H P [ θ ] as contour sets of criterion functions using cutoffsthat are computed via Monte Carlo simulations from the quasiposterior distribution of thecriterion, and that satisfy the coverage requirement in (4.15). One of their procedures,designed specifically for scalar projections, delivers a confidence interval as the contour setof a profiled quasi-likelihood ratio with critical value equal to a quantile of the Chi-squareddistribution with one degree of freedom. The confidence sets discussed in this section are based on the frequentist approach to infer-ence. It is natural to ask whether in partially identified models, as in well behaved pointidentified models, one can build Bayesian credible sets that at least asymptotically coincidewith frequentist confidence sets. This question was first addressed by Moon and Schorfheide(2012), with a negative answer for the case that the coverage in (4.14) is sought out. Inparticular, they showed that the resulting Bayesian credible sets are a subset of H P [ θ ], andhence too narrow from the frequentist perspective.This discrepancy can be ameliorated when inference is sought out for H P [ θ ] rather than foreach ϑ ∈ H P [ θ ]. Norets and Tang (2014), Kline and Tamer (2016), Kitagawa and Giacomini(2018), and Liao and Simoni (2019) propose Bayesian credible regions that are valid forfrequentist inference in the sense of (4.11), where the first two build on the criterion functionapproach and the second two on the support function approach. All these contributions relyon the model being separable, in the sense that it yields moment inequalities that can bewritten as the sum of a function of the data only, and a function of the model parameters only(as in, e.g., (3.22)-(3.25)). In these models, the function of the data only (the reduced formparameter ) is point identified, it is related to the structural parameters θ through a knownmapping, and under standard regularity conditions it can be √ n -consistently estimated. Theresulting estimator has an asymptotically Normal distribution. The various approaches placea prior on the reduced form parameter, and standard tools in Bayesian analysis are used toobtain a posterior. The known mapping from reduced form to structural parameters is thenapplied to this posterior to obtain a credible set for H P [ θ ].103 Misspecification in Partially Identified Models
Although partial identification often results from reducing the number of assumptions main-tained in counterpart point identified models, care still needs to be taken in assessing thepossible consequences of misspecification. This section’s goal is to discuss the existing litera-ture on the topic, and to provide some additional observations. To keep the notation light, Irefer to the functional of interest as θ throughout, without explicitly distinguishing whetherit belongs to an infinite dimensional parameter space (as in the nonparametric analysis inSection 2), or to a finite dimensional one (as in the semiparametric analysis in Section 3).The original nonparametric “worst-case” bounds proposed by Manski (1989) for the anal-ysis of selectively observed data and discussed in Section 2 are not subject to the risk ofmisspecification, because they are based on the empirical evidence alone. However, oftenresearchers are willing and eager to maintain additional assumptions that can help shrink thebounds, so that one can learn more from the available data. Indeed, early on Manski (1990)proposed the use of exclusion restrictions in the form of mean independence assumptions.Section 2.2 discusses related ideas within the context of nonparametric bounds on treatmenteffects, and Manski (2003, Chapter 2) provides a thorough treatment of other types of ex-clusion restriction. The literature reviewed throughout this chapter provides many moreexamples of assumptions that have proven useful for empirical research.Broadly speaking, assumptions can be classified in two types (Manski, 2003, Chapter 2).The first type is non-refutable : it may reduce the size of H P [ θ ], but cannot lead to it beingempty. An example in the context of selectively observed data is that of exogenous selection,or data missing at random conditional on covariates and instruments (see Section 2.1, p. 10):under this assumption H P [ θ ] is a singleton, but the assumption cannot be refuted because itposes a distributional (independence) assumption on unobservables.The second type is refutable : it may reduce the size of H P [ θ ], and it may result in H P [ θ ] = ∅ if it does not hold in the DGP. An example in the context of treatment effects isthe assumption of mean independence between response function at treatment t and instru-mental variable z , see (2.14) in Section 2.2. There the sharp bounds on E Q ( y ( t ) | x = x ) areintersection bounds as in (2.15). If the instrument is invalid, the bounds can be empty.Ponomareva and Tamer (2011) consider the impact of misspecification on semiparametricpartially identified models. One of their examples concerns a linear regression model of theform E Q ( y | x ) = θ (cid:62) x when only interval data is available for y (as in Section 2.3). In thiscontext, H P [ θ ] = { ϑ ∈ Θ : E P ( y L | x ) ≤ ϑ (cid:62) x ≤ E P ( y U | x ) , x -a.s. } . The concern is that theconditional expectation might not be linear. Ponomareva and Tamer make two importantobservations. First, they argue that the set H P [ θ ] is of difficult interpretation when the modelis misspecified. When y is perfectly observed, if the conditional expectation is not linear, theoutput of ordinary least squares can be readily interpreted as the best linear approximationto E Q ( y | x ). This is not the case for H P [ θ ] when only the interval data [ y L , y U ] is observed.104hey therefore propose to work with the set of best linear predictors for y | x even in thepartially identified case (rather than fully exploit the linearity assumption). The resultingset is the one derived by Beresteanu and Molinari (2008) and reported in Theorem SIR-2.5.Ponomareva and Tamer work with projections of this set, which coincide with the bounds inStoye (2007).Ponomareva and Tamer also point out that depending on the DGP, misspecification cancause H P [ θ ] to be spuriously tight. This can happen, for example, if E P ( y L | x ) and E P ( y U | x )are sufficiently nonlinear, even if they are relatively far from each other (e.g., Ponomareva andTamer, 2011, Figure 1). Hence, caution should be taken when interpreting very tight partialidentification results as indicative of a highly informative model and empirical evidence, asthe possibility of model misspecification has to be taken into account. These observationsnaturally lead to the questions of how to test for model misspecification in the presence ofpartial identification, and of what are the consequences of misspecification for the confidencesets discussed in Section 4.3.With partial identification, a null hypothesis of correct model specification (and its alter-native) can be expressed as H : H P [ θ ] (cid:54) = ∅ ; H : H P [ θ ] = ∅ . Tests for this hypothesis have been proposed both for the case of nonparametric as well assemiparametric partially identified models. I refer to Santos (2012) for specification testsin a partially identified nonparametric instrumental variable model; to Kitamura and Stoye(2018) for a nonparametric test in random utility models that checks whether a repeated crosssection of demand data might have been generated by a population of rational consumers(thereby testing for the Axiom of Revealed Stochastic Preference); and to Guggenberger,Hahn, and Kim (2008) and Bontemps, Magnac, and Maurin (2012) for specification tests inlinear moment (in)equality models.For the general class of moment inequality models discussed in Section 4, Romano andShaikh (2008), Andrews and Guggenberger (2009), Galichon and Henry (2009), and Andrewsand Soares (2010) propose a specification test that rejects the model if CS n in (4.10) is empty,where CS n is defined with c − α ( ϑ ) determined so as to satisfy (4.14) and approximatedaccording to the methods proposed in the respective papers. The resulting test, commonlyreferred to as by-product test because obtained as a by-product to the construction of aconfidence set, takes the form φ = ( CS n = ∅ ) = (cid:18) inf ϑ ∈ Θ nq n ( ϑ ) > c − α ( ϑ ) (cid:19) . Denoting by P the collection of P ∈ P such that H P [ θ ] (cid:54) = ∅ , one has that the by-product105est achieves uniform size control (Bugni, Canay, and Shi, 2015, Theorem C.2):lim sup n →∞ sup P ∈P E P ( φ ) ≤ α. (5.1)An important feature of the by-product test is that the critical value c − α ( ϑ ) is not ob-tained to test for model misspecification, but it is obtained to insure the coverage requirementin (4.14); hence, it is obtained by working with the asymptotic distribution of nq n ( ϑ ). Bugni,Canay, and Shi (2015) propose more powerful model specification tests, using a critical value c − α that they obtain to ensure that (5.1), rather than (4.14), holds. In particular, theyshow that their tests dominate the by-product test in terms of power in any finite sampleand in the asymptotic limit. Their critical value is obtained by working with the asymptoticdistribution of inf ϑ ∈ Θ nq n ( ϑ ). As such, their proposal resembles the classic approach to modelspecification testing ( J -test) in point identified generalized method of moments models.While it is possible to test for misspecification also in partially identified models, a word ofcaution is due on what might be the effects of misspecification on confidence sets constructedas in (4.10) with c − α determined to insure (4.14), as it is often done in empirical work.Bugni, Canay, and Guggenberger (2012) show that in the presence of local misspecification,confidence sets CS n designed to satisfy (4.14) fail to do so. In practice, the concern is thatwhen the model is misspecified CS n might be spuriously small. Indeed, we have seen that itcan be empty if the misspecification is sufficiently severe. If it is less severe but still present,it may lead to inference that is erroneously interpreted as precise.It is natural to wonder how this compares to the effect of misspecification on inferencein point identified models. In that case, the rich set of tools available for inference allowsone to avoid this problem. Consider for example a point identified generalized method ofmoments model with moment conditions E P ( m j ( w ; θ )) = 0, j = 1 , . . . , |J | , and |J | > d . Let m denote the vector that stacks each of the m j functions, and let the estimator of θ beˆ θ n = arg min ϑ ∈ Θ n ¯ m n ( ϑ ) (cid:62) ˆΞ − ¯ m n ( ϑ ) , (5.2)with ˆΞ a consistent estimator of Ξ = E P [ m ( w ; θ ) m ( w ; θ ) (cid:62) ] and ¯ m n ( ϑ ) the sample analog of E P ( m ( w ; ϑ )). As shown by Hansen (1982b) for correctly specified models, the distributionof √ n (ˆ θ n − θ ) converges to a Normal with mean vector equal to zero and covariance matrixΣ. Hall and Inoue (2003) show that when the model is subject to non-local misspecification, √ n (ˆ θ n − θ ∗ ) converges to a Normal with mean vector equal to zero and covariance matrix Σ ∗ ,where θ ∗ is the pseudo-true vector (the probability limit of (5.2)) and where Σ ∗ equals Σ if themodel is correctly specified, and differs from it otherwise. Let ˆΣ ∗ be a consistent estimator The considerations that I report here are based on conversations with Joachim Freyberger and notes thathe shared with me, for which I thank him. ∗ as in Hall and Inoue (2003). Define the Wald-statistic based confidence ellipsoid { ϑ ∈ Θ : n (ˆ θ n − ϑ ) (cid:62) ˆΣ − ∗ (ˆ θ n − ϑ ) ≤ c d, − α } , (5.3)with c d, − α the 1 − α critical value of a χ d (chi-squared random variable with d degrees offreedom). Under standard regularity conditions (see Hall and Inoue, 2003) the confidence setin (5.3) covers with asymptotic probability 1 − α the true vector θ if the model is correctlyspecified, and the pseudo-true vector θ ∗ if the model is incorrectly specified. In either case,(5.3) is never empty and its volume depends on ˆΣ ∗ . Even in the point identified case a confidence set constructed similarly to (4.10), i.e., { ϑ ∈ Θ : n ¯ m n ( ϑ )ˆΞ − ¯ m n ( ϑ ) ≤ c |J | , − α } , (5.4)where c |J | , − α is the 1 − α critical value of a χ |J | , incurs the same problems as its partialidentification counterpart. Under standard regularity conditions, if the model is correctlyspecified, the confidence set in (5.4) covers θ with asymptotic probability 1 − α , because n ¯ m n ( ϑ )ˆΞ − ¯ m n ( ϑ ) ⇒ χ |J | . However, this confidence set is empty with asymptotic proba-bility P ( χ |J |− d > c |J | , − α ), due to the facts that P ( CS n = ∅ ) = P (ˆ θ n / ∈ CS n ) and that n ¯ m n (ˆ θ n )ˆΞ − ¯ m n (ˆ θ n ) ⇒ χ |J |− d . Hence, it can be arbitrarily small.In the very special case of a linear regression model with interval outcome data stud-ied by Ponomareva and Tamer (2011), the procedure proposed by Beresteanu and Moli-nari (2008) yields confidence sets that are always non-empty and whose volume dependson a covariance function that they derive (see Beresteanu and Molinari, 2008, Theorem4.3). If the linear regression model is correctly specified, and hence { ϑ ∈ Θ : E P ( y L | x ) ≤ ϑ (cid:62) x ≤ E P ( y U | x ) , x -a.s. } (cid:54) = ∅ , these confidence sets cover { ϑ ∈ Θ : E P ( y L | x ) ≤ ϑ (cid:62) x ≤ E P ( y U | x ) , x -a.s. } with asymptotic probability at least equal to 1 − α , as in (4.11). Evenif the model is misspecified and { ϑ ∈ Θ : E P ( y L | x ) ≤ ϑ (cid:62) x ≤ E P ( y U | x ) , x -a.s. } = ∅ , theconfidence sets cover the sharp identification region for the parameters of the best linearpredictor of y | x , which can be viewed as a pseudo-true set, with probability exactly equal to1 − α . The test statistic that Beresteanu and Molinari use is based on the Hausdorff distancebetween the estimator and the hypothesized set, and as such is a generalization of the stan-dard Wald-statistic to the set-valued case. These considerations can be extended to othermodels. For example, Lee and Bhattacharya (2019) study empirical measurement of Hick- The effect of misspecification for maximum likelihood, least squares, and GMM estimators in “pointidentified” models (by which I mean models where the population criterion function has a unique optimizer)has been studied in the literature; see, e.g., White (1982), Gallant and White (1988), Hall and Inoue (2003),Hansen and Lee (2019), and references therein. These estimators have been shown to converge in probabilityto pseudo-true values, and it has been established that tests of hypotheses and confidence sets based on theseestimators have correct asymptotic level with respect to the pseudo-true parameters, provided standard errorsare computed appropriately. In the specific case of GMM discussed here, the pseudo-true value θ ∗ dependson the choice of weighting matrix in (5.2): I have used ˆΞ, but other choices are possible. I do not discuss thisaspect of the problem here, but refer to Hall and Inoue (2003). H ∗ P [ θ ] that is obtainedthrough a two-step procedure. In the first step one obtains a nonparametric estimator of thefunction(s) for which the researcher wants to impose a parametric structure. In the secondstep one obtains the set H ∗ P [ θ ] as the collection of least squares projections of the set in thefirst step, on the parametric class imposed. Kaido and White show that under regularityconditions the pseudo-true set can be consistently estimated, and derive rates of convergencefor the estimator; however, they do not provide methods to obtain confidence sets. Whileconceptually valuable, their construction appears to be computationally difficult. Mastenand Poirier (2018) propose that when a model is falsified (in the sense that H P [ θ ] is empty)one should report the falsification frontier : the boundary between the set of assumptionswhich falsify the model and those which do not, obtained through continuous relaxationsof the baseline assumptions of concern. The researcher can then present the set H P [ θ ] thatresults if the true model lies somewhere on this frontier. This set can be interpreted as apseudo-true set. However, Masten and Poirier do not provide methods for inference.The implications of misspecification in partially identified models remain an open andimportant question in the literature. For example, it would be useful to have notions ofpseudo-true set that parallel those of pseudo-true value in the point identified case. It wouldalso be important to provide methods for the construction of confidence sets in general mo-ment inequality models that do not exhibit spurious precision (i.e., are arbitrarily small)when the model is misspecified. Recent work by Andrews and Kwon (2019) addresses someof these questions. As a rule of thumb, the difficulty in computing estimators of identification regions and confi-dence sets depends on whether a closed form expression is available for the boundary of theset. For example, often nonparametric bounds on functionals of a partially identified dis-tribution are known functionals of observed conditional distributions, as in Section 2. Then“plug in” estimation is possible, and the computational cost is the same as for estimation andconstruction of confidence intervals (or confidence bands) for point-identified nonparametricregressions (incurred twice, once for the lower bound and once for the upper bound).Similarly, support function based inference is easy to implement when H P [ θ ] is convex.Sometimes the extreme points of H P [ θ ] can be expressed as known functionals of observeddistributions. Even if not, level sets of convex functions are easy to compute.But as it was shown in Section 3, many problems of interest yield a set H P [ θ ] that is not convex. In this case, H P [ θ ] is obtained as a level set of a criterion function. Because H P [ θ ]108or its associated confidence set) is often a subset of R d (rather than R ), even a moderatevalue for d , e.g., 8 or 10, can lead to extremely challenging computational problems. This isbecause if one wants to compute H P [ θ ] or a set that covers it or its elements with a prespecifiedasymptotic probability (possibly uniformly over P ∈ P ), one has to map out a level set in R d .If one is interested in confidence intervals for scalar projections or other smooth functionsof ϑ ∈ H P [ θ ], one needs to solve complex nonlinear optimization problems, as for examplein (4.17) and (4.19). This can be difficult to do, especially because c − α ( ϑ ) is typically anunknown function of ϑ for which gradients are not available in closed form.Mirroring the fact that computation is easier when the boundary of H P [ θ ] is a knownfunction of observed conditional distributions, several portable software packages are availableto carry out estimation and inference in this case. For example, Beresteanu and Manski(2000) provide STATA and MatLab packages implementing the methods proposed by Manski(1989, 1990, 1994, 1995, 1997b), Horowitz and Manski (1998, 2000), and Manski and Pepper(2000). Tauchmann (2014) provides a STATA package to implement the bounds proposedby Lee (2009). McCarthy, Millimet, and Roy (2015) provide a STATA package to implementbounds on treatment effects with endogenous and misreported treatment assignment andunder the assumptions of monotone treatment selection, monotone treatment response, andmonotone instrumental variables as in Manski (1997b), Manski and Pepper (2000), Kreiderand Pepper (2007), Gundersen, Kreider, and Pepper (2012), and Kreider, Pepper, Gundersen,and Jolliffe (2012). The code computes the confidence intervals proposed by Imbens andManski (2004). In the more general context of inference for a one-dimensional parameterdefined by intersection bounds, as for example the one in (2.15), Chernozhukov, Kim, Lee, andRosen (2015) and Andrews, Kim, and Shi (2017) provide portable STATA code implementing,respectively, methods to test hypotheses and build confidence intervals in Chernozhukov, Lee,and Rosen (2013) and in Andrews and Shi (2013).Beresteanu, Molinari, and Morris (2010) provide portable STATA code implementingBeresteanu and Molinari (2008)’s method for estimation and inference for best linear pre-diction with interval outcome data as in Identification Problem 2.4. Chandrasekhar, Cher-nozhukov, Molinari, and Schrimpf (2012) provide R code implementing Chandrasekhar, Cher-nozhukov, Molinari, and Schrimpf (2018)’s method for estimation and inference for best linearapproximations of set identified functions.On the other hand, there is a paucity of portable software implementing the theoreticalmethods for inference in structural partially identified models discussed in Section 4. Cilibertoand Tamer (2009) compute Chernozhukov, Hong, and Tamer (2007) confidence sets for aparameter vector in R d in an entry game with six players, with d in the order of 20 and withtens of thousands of inequalities, through a “guess and verify” algorithm based on simulatedannealing (with no cooling) that visits many candidate values ϑ ∈ Θ, evaluates q n ( ϑ ), andbuilds CS n by retaining the visited values ϑ that satisfy nq n ( ϑ ) ≤ c − α ( ϑ ) with c − α defined109o satisfy (4.12). Given the computational resources commonly available at this point intime, this is a tremendously hard task, due to the dimension of θ and the number of momentinequalities employed.As explained in Section 3.2.1, these inequalities, which in a game of entry with J play-ers and discrete observable payoff shifters are 2 J |X | (with X the support of the observablepayoff shifters), yield an outer region O P [ θ ]. It is natural to wonder what are the additionalchallenges faced to compute H P [ θ ] as described in Section 3.2.2. A definitive answer to thisquestion is hard to obtain. If one employs all inequalities listed in Theorem A.1, the num-ber of inequalities jumps to (2 J − |X | , increasing the computational cost. However, assuggested by Galichon and Henry (2006) and extended by other authors (e.g., Beresteanu,Molchanov, and Molinari, 2008, 2011; Chesher, Rosen, and Smolinski, 2013; Chesher andRosen, 2017a), often many moment inequalities are redundant, substantially reducing thenumber of inequalities to be checked. Specifically, Galichon and Henry (2006) propose thenotion of core determining sets , a collection of compact sets such that if the inequality in The-orem A.1 holds for these sets, it holds for all sets in K , see Definition A.8 and the surroundingdiscussion in Appendix A. This often yields a number of restrictions similar to the one in-curred to obtain outer regions. For example, Beresteanu, Molchanov, and Molinari (2008,Section 4.2) analyze a four player, two type entry game with pure strategy Nash equilibriumas solution concept, originally proposed by Berry and Tamer (2006), and show that while adirect application of Theorem A.1 entails 512 |X | inequality restrictions, 26 |X | suffice. In thisexample, Ciliberto and Tamer (2009)’s outer region is based on checking 18 |X | inequalities.A related but separate question is how to best allocate the computational effort. Asone moves from partial identification analysis to finite sample considerations, one may facea trade-off between sharpness of the identification region and statistical efficiency. Thisis because inequalities that are redundant from the perspective of identification analysismight nonetheless be estimated with high precision, and hence improve the finite samplestatistical properties of a confidence set or of a test of hypothesis. Recent contributions byAndrews and Shi (2017), Chernozhukov, Chetverikov, and Kato (2018) and Belloni, Bugni,and Chernozhukov (2018), provide methods to build confidence set, respectively, with acontinuum of conditional moment inequalities, and with a number of moment inequalitiesthat may exceed sample size. These contributions, however, do not yet answer the question ofhow to optimally select inequalities to yield confidence sets with best finite sample propertiesaccording to some specified notion of “best”.A different approach proposed by Chen, Christensen, and Tamer (2018) uses directly aquasi-likelihood criterion function. In the context, e.g., of entry games, this entails assumingthat the selection mechanism depends only on observable payoff shifters, using it to obtain theexact model implied distribution as in (3.21), and partially identifying an enlarged parametervector that includes θ and the selection mechanism. In an empirical application with discretecovariates, Chen, Christensen, and Tamer (2018) apply their method to a two player entry110ame with correlated errors, where θ ∈ R and the selection mechanism is a vector in R , fora total of 17 parameters. In another application to the analysis of trade flows, their empiricalapplication includes 46 parameters.In terms of general purpose portable code that can be employed in moment inequalitymodels, I am only aware of the MatLab package provided by Kaido, Molinari, Stoye, andThirkettle (2017) to implement the inference method of Kaido, Molinari, and Stoye (2019a)for projections and smooth functions of parameter vectors in models defined by a finitenumber of unconditional moment (in)equalities. More broadly, their method can be usedto compute confidence intervals for optimal values of optimization problems with estimatedconstraints. Here I summarize their approach to further highlight why the computationaltask is challenging even in the case of projections.The confidence interval in (4.18)-(4.19) requires solving two nonlinear programs, each witha linear objective and nonlinear constraints involving a critical value which in general is anunknown function of ϑ , with unknown gradient. When the dimension of the parameter vectoris large, directly solving optimization problems with such constraints can be expensive even ifevaluating the critical value at each ϑ is cheap. Hence, Kaido, Molinari, and Stoye proposeto use an algorithm (called E-A-M for Evaluation-Approximation-Maximization) to solvethese nonlinear programs, which belongs to the family of expected improvement algorithms (see e.g. Jones, Schonlau, and Welch, 1998; Schonlau, Welch, and Jones, 1998; Jones, 2001,and references therein). Given a constrained optimization problem of the formmax ϑ ∈ Θ u (cid:62) ϑ s.t. g j ( ϑ ) ≤ c ( ϑ ) , j = 1 , . . . , J, to which (4.19) belongs, the algorithm attempts to solve it by cycling over three steps:1. The true critical level function c is evaluated at an initial (uniformly randomly drawnfrom Θ) set of points ϑ , . . . , ϑ k . These values are used to compute a current guessfor the optimal value, u (cid:62) ϑ ∗ ,k = max { u (cid:62) ϑ : ϑ ∈ { ϑ , . . . , ϑ k } and ¯ g ( ϑ ) ≤ c ( ϑ ) } , where¯ g ( ϑ ) = max j =1 ,...,J g j ( ϑ ). The “training data” ( ϑ (cid:96) , c ( ϑ (cid:96) ) k(cid:96) =1 is used to compute an approximating surface c k through a Gaussian-process regression model (kriging), asdescribed in Santner, Williams, and Notz (2013, Section 4.1.3);2. For L ≥ k + 1, with probability 1 − (cid:15) the next evaluation point θ L for the true criticallevel function c is chosen by finding the point that maximizes expected improvement withrespect to the approximating surface , EI L − ( ϑ ) = ( u (cid:62) ϑ − u (cid:62) ϑ ∗ ,L − ) + { − Φ([¯ g ( ϑ ) − c L − ( ϑ )] / [ˆ ςs L − ( ϑ )]) } . Here c L − ( ϑ ) and ˆ ς s L − ( ϑ ) are estimators of the posterior meanand variance of the approximating surface. To aim for global search, with probability Kaido, Molinari, and Stoye (2019a) propose a linearization method whereby c − α is calibrated throughrepeatedly solving bootstrap linear programs, hence it is reasonably cheap to compute. To see this it suffices to set g j ( ϑ ) = √ n ¯ m n,j ( ϑ )ˆ σ n,j ( ϑ ) and c ( ϑ ) = c − α ( ϑ ). , ϑ L is drawn uniformly from Θ. The approximating surface is then recomputed using( ϑ (cid:96) , c ( ϑ (cid:96) ) L(cid:96) =1 ). Steps 1 and 2 are repeated until a convergence criterion is met.3. The extreme point of CI n is reported as the value u (cid:62) ϑ ∗ ,L that maximizes u (cid:62) ϑ amongthe evaluation points that satisfy the true constraints, i.e. u (cid:62) ϑ ∗ ,L = max { u (cid:62) ϑ : ϑ ∈{ ϑ , . . . , ϑ L } and ¯ g ( ϑ ) ≤ c ( ϑ ) } .The only place where the approximating surface is used is in Step 2, to choose a new eval-uation point. In particular, the reported extreme points of CI n in (4.18) are the extremevalues of u (cid:62) ϑ that are consistent with the true surface where this surface was computed, not with the approximating surface. Kaido, Molinari, and Stoye (2019a) establish convergence oftheir algorithm and obtain a convergence rate, as the number of evaluation points increases,for constrained optimization problems in which the constraints are sufficiently smooth “blackbox” functions, building on an earlier contribution of Bull (2011). Bull establishes conver-gence of an expected improvement algorithm for unconstrained optimization problems wherethe objective is a “black box” function. The rate of convergence that Bull derives dependson the smoothness of the black box objective function. The rate of convergence obtained byKaido, Molinari, and Stoye depends on the smoothness of the black box constraints, and isslightly slower than Bull’s rate. Kaido, Molinari, and Stoye’s Monte Carlo experiments sug-gest that the E-A-M algorithm is fast and accurate at computing their confidence intervals.The E-A-M algorithm also allows for very rapid computation of projections of the confidenceset proposed by Andrews and Soares (2010), and for a substantial improvement in the com-putational time of the profiling-based confidence intervals proposed by Bugni, Canay, and Shi(2017). In all cases, the speed improvement results from a reduced number of evaluationpoints required to approximate the optimum. In an application to a point identified set-ting, Freyberger and Reeves (2017, Supplement Section S.3) use Kaido, Molinari, and Stoye(2019a)’s E-A-M method to construct uniform confidence bands for an unknown function ofinterest under (nonparametric) shape restrictions. They benchmark it against gridding andfind it to be accurate at considerably improved speed.
This chapter provides a discussion of the econometrics literature on partial identification.It first reviews what can be learned about (functionals of) probability distributions in theabsence of parametric restrictions, under various scenarios of data incompleteness . It thenreviews what can be learned about functionals characterizing semiparametric structural eco-nomic models, under various scenarios of model incompleteness . Finally, it discusses finite Bugni, Canay, and Shi (2017)’s method does not require solving a nonlinear program such as the one in(4.19). Rather it obtains CI n as in (4.17). However, it approximates c − α by repeatedly solving bootstrapnonlinear programs, thereby incurring a very high computational cost at that stage. sharp identificationregions. Sharpness often requires many moment inequalities, the number of which can exceedthe available sample size. Hence, there is a need of appropriate statistical inference methods.As briefly mentioned in Sections 4 and 6, methods designed to provide valid test of hypothesesand confidence sets in this scenario already exist. However, I would argue that there is a needto better understand the trade-off between sharpness of the population identification region,and statistical efficiency, especially in the context of conditional moment inequalities whereinstrument functions are needed to transform the inequalities in unconditional ones. Similarly,there is a need of more research on data driven procedures for the choice of tuning parametersfor the construction of confidence sets, in particular in the case of projection inference wherethe question has not yet been addressed. Another open and arguably important question inthe literature, is how to build confidence sets for general moment inequality models that donot exhibit spurious precision (i.e., are arbitrarily small) when the model is misspecified.113 Basic Definitions and Facts from Random Set Theory
This appendix provides basic definitions and results from random set theory that are usedthroughout this chapter. I refer to Molchanov (2017) for a textbook presentation of randomset theory, and to Molchanov and Molinari (2018) for a discussion focusing on its applicationsin econometrics.The theory of random closed sets generally applies to the space of closed subsets of alocally compact Hausdorff second countable topological space X , see Molchanov (2017). Inthis chapter I let X = R d to simplify the exposition. Closedness is a property satisfiedby random points (singleton sets), so that the theory of random closed sets includes theclassical case of random points or random vectors as a special case. A random closed set isa measurable map X : Ω (cid:55)→ F , where measurability is defined by specifying the family offunctionals of X that are random variables. Definition
A.1 (Random closed set) : A map X from a probability space (Ω , F , P ) to thefamily F of closed subsets of R d is called a random closed set if X − ( K ) = { ω ∈ Ω : X ( ω ) ∩ K (cid:54) = ∅} (A.1) belongs to the σ -algebra F on Ω for each compact set K in R d . A random compact set is a random closed set which is compact with probability one, sothat almost all values of X are compact sets. A random convex closed set is defined similarly,so that X ( ω ) is a convex closed set for almost all ω .Definition A.1 means that X is explored by its hitting events, i.e., the events where X hits a compact set K . The corresponding hitting probabilities are very important in randomset theory, because they uniquely determine the probability distribution of a random closedset X , see Molchanov (2017, Section 1.1.3). The formal definition of the hitting probabilities,and the closely related containment probabilities, follows. Definition
A.2 (Capacity functional and containment functional) :
01. A functional T X ( K ) : K (cid:55)→ [0 , given by T X ( K ) = P { X ∩ K (cid:54) = ∅} , K ∈ K , is called capacity (or hitting) functional of X .2. A functional C X ( F ) : F (cid:55)→ [0 , given by C X ( F ) = P { X ⊂ F } , F ∈ F , The treatment here summarizes a few of the topics presented in Molchanov and Molinari (2018). s called the containment functional of X .I write T ( K ) instead of T X ( K ) and C ( K ) instead of C X ( K ) where no ambiguity occurs. Ever since the seminal work of Aumann (1965), it has been common to think of randomsets as bundles of random variables – the selections of the random sets.
Definition
A.3 (Measurable selection) : For any random set X , a (measurable) selection of X is a random element x with values in R d such that x ( ω ) ∈ X ( ω ) almost surely. I denoteby Sel( X ) the set of all selections from X . The space of closed sets is not linear, which causes substantial difficulties in definingthe expectation of a random set. One approach, inspired by Aumann (1965) and pioneeredby Artstein and Vitale (1975), relies on representing a random set using the family of itsselections, and considering the set formed by their expectations. If X possesses at least oneintegrable selection, then X is called integrable . The family of all integrable selections of X is denoted by Sel ( X ). Definition
A.4 (Unconditional and conditional Aumann –or selection– expectation) : The (selection or) Aumann expectation of an integrable random closed set X is given by E X = cl (cid:26)(cid:90) Ω x d P : x ∈ Sel ( X ) (cid:27) . For each sub- σ -algebra B ⊂ F , the conditional (selection or) Aumann expectation of X given B is the B -measurable random closed set Y = E ( X | B ) such that the family of B -measurableintegrable selections of Y , denoted Sel B ( Y ) , satisfies Sel B ( Y ) = cl (cid:110) E ( x | B ) : x ∈ Sel ( X ) (cid:111) , where the closure in the right-hand side is taken in L . If X is almost surely non-empty and its norm (cid:107) X (cid:107) = sup {(cid:107) x (cid:107) : x ∈ X } is an integrablerandom variable, then X is said to be integrably bounded and all its selections are integrable.In this case, since X takes its realizations in R d , the family of expectations of these integrableselections is already closed and there is no need to take an additional closure as required inDefinition A.4, see Molchanov (2017, Theorem 2.1.37). The selection expectation depends onthe probability space used to define X , see Molchanov (2017, Section 2.1.2) and Molchanovand Molinari (2018, Section 3.1). In particular, if the probability space is non-atomic and X is integrably bounded, the selection expectation E X is a convex set regardless of whetheror not X might be non-convex itself Molchanov and Molinari (2018, Theorem 3.4). Thisconvexification property of the selection expectation implies that the expectation of the closedconvex hull of X equals the closed convex hull of E X , which in turn equals E X . It is115hen natural to describe the Aumann expectation through its support function, because thisfunction traces out a convex set’s boundary and therefore knowing the support function isequivalent to knowing the set itself, see equation (A.2) below. Definition
A.5 (Support function) : Let K be a convex set. The support function of K is h K ( u ) = sup { k (cid:62) u : k ∈ K } , u ∈ R d , where k (cid:62) u denotes the scalar product. The support function is finite for all u if K is bounded, and is sublinear (positivelyhomogeneous and subadditive) in u . Hence, it can be considered only for u ∈ B d or u ∈ S d − .Moreover, one has K = ∩ u ∈ B d { k : k (cid:62) u ≤ h K ( u ) } = ∩ u ∈ S d − { k : k (cid:62) u ≤ h K ( u ) } . (A.2)Next, I define the Hausdorff metric, a distance on the family K of compact sets: Definition
A.6 (Hausdorff metric) : Let
K, L ∈ K . The
Hausdorff distance between K and L is d H ( K, L ) = inf (cid:110) r > K ⊆ L r , L ⊆ K r (cid:111) , where K r = { x : d ( x, K ) ≤ r } is the r -envelope of K . Since K ⊆ L if and only if h K ( u ) ≤ h L ( u ) for all u ∈ S d − and h K r ( u ) = h K ( u ) + r , theuniform metric for support functions on the sphere turns into the Hausdorff distance betweencompact convex sets. Namely, d H ( K, L ) = sup (cid:110) | h K ( u ) − h L ( u ) | : (cid:107) u (cid:107) = 1 (cid:111) . (A.3)It follows that (cid:107) K (cid:107) = d H ( K, { } ) = sup (cid:8) | h K ( u ) | : (cid:107) u (cid:107) = 1 (cid:9) . Finally, I define independently and identically distributed random closed sets (see Molchanov,2017, Proposition 1.1.40 and Theorem 1.3.20, respectively):
Definition
A.7 (i.i.d. random closed sets) : Random closed sets X , . . . , X n in R d areindependent if and only if P { X ∩ K (cid:54) = ∅ , . . . , X n ∩ K n (cid:54) = ∅} = (cid:81) ni =1 T X i ( K i ) for all K , . . . , K n ∈ K . They are identically distributed if and only if for each open set G , P { X ∩ G (cid:54) = ∅} = P { X ∩ G (cid:54) = ∅} = · · · = P { X n ∩ G (cid:54) = ∅} . With these definitions in hand, I can state the theorems used throughout the chapter. Thefirst is a dominance condition due to Artstein (1983) (and Norberg, 1992) that characterizesprobability distributions of selections (see Molchanov and Molinari, 2018, Section 2.2):116 heorem A.1 (Artstein) . A probability distribution µ on R d is the distribution of a selectionof a random closed set X in R d if and only if µ ( K ) ≤ T ( K ) = P { X ∩ K (cid:54) = ∅} (A.4) for all compact sets K ⊆ R d . Equivalently, if and only if µ ( F ) ≥ C ( F ) = P { X ⊂ F } (A.5) for all closed sets F ⊂ R d . If X is a compact random closed set, it suffices to check (A.5) for compact sets F only. If µ from Theorem A.1 is the distribution of some random vector x , then it is not guar-anteed that x ∈ X a.s., e.g. x can be independent of X . Theorem A.1 means that for eachsuch µ , it is possible to construct x with distribution µ that belongs to X almost surely. Inother words, x and X can be realized on the same probability space (coupled) as randomelements x (cid:48) and X (cid:48) such that x d = x (cid:48) and X d = X (cid:48) with x (cid:48) ∈ X (cid:48) a.s.The definition of the distribution of a random closed set (Definition A.2) and the charac-terization results for its selections in Theorem A.1 require working with functionals definedon the family of all compact sets, which in general is very rich. It is therefore importantto reduce the family of all compact sets required to describe the distribution of the randomclosed set or to characterize its selections. Definition
A.8 : A family of compact sets M is said to be a core determining class fora random closed set X if any probability measure µ satisfying the inequalities µ ( K ) ≤ P { X ∩ K (cid:54) = ∅} (A.6) for all K ∈ M , is the distribution of a selection of X , implying that (A.6) holds for allcompact sets K . The notion of a core determining class was introduced by Galichon and Henry (2006). Asimple and general, but still mostly too rich, core determining class is obtained as a subfamilyof all compact sets that is dense in a certain sense in the family K . For instance, in theEuclidean space, it suffices to consider compact sets obtained as finite unions of closed ballswith rational centers and radii (e.g., Galichon and Henry, 2006, Theorem 3c). For the casethat X is a subset of a finite space, Beresteanu, Molchanov, and Molinari (2008, Algorithm5.1) propose a simple algorithm to compute core determining classes. Chesher and Rosen(2012) provide a related algorithm. Throughout this chapter, several results are mentionedwhere the class of sets over which (A.4) is verified is reduced from the class of compact subsetsof the carrier space, to a (significantly) smaller collection.117he next result characterizes a dominance condition that can be used to verify the exis-tence of selections of X with specific properties for their means (see Molchanov and Molinari,2018, Sections 3.2-3.3) Theorem A.2 (Convexification in R d ) . Let X be an integrable random set. If X is definedon a non-atomic probability space, or if X is almost surely convex, then E X = E conv X and E h X ( u ) = h E X ( u ) , u ∈ R d . (A.7) If P is atomless over B , then E ( X | B ) is convex and E ( h X ( u ) | B ) = h E ( X | B ) ( u ) , u ∈ R d . (A.8) Hence, for any vector b ∈ R d , it holds that b ∈ E X ⇔ b (cid:62) u ≤ E h X ( u ) ∀ u ∈ S d − , (A.9) b ∈ E ( X | B ) ⇔ b (cid:62) u ≤ E ( h X ( u ) | B ) ∀ u ∈ S d − . (A.10)An important consequence of Theorem A.2 is that it allows one to verify whether b ∈ E X without having to compute E X but only E h X ( u ) (and similarly for the conditional case), asubstantially easier task.Finally, i.i.d. random closed sets satisfy a law of large numbers and a central limit theoremthat are similar to the ones for random singletons. Recall that the Minkowski sum of twosets K and L in a linear space (which in this chapter I assume to be the Euclidean space R d )is obtained by adding each point from K to each point from L , formally, K + L = (cid:8) x + y : x ∈ K, y ∈ L (cid:9) . Below, X + · · · + X n denotes the Minkowski sum of the random closed sets X , . . . , X n , and( X + · · · + X n ) /n denotes their Minkowski average . Theorem A.3 (Law of large numbers for integrably bounded random sets) . Let X , X , X , . . . be i.i.d. integrably bounded random compact sets. Define S n = X + · · · + X n . Then d H (cid:18) S n n , E X (cid:19) → a.s. as n → ∞ . (A.11)The support function of a random closed set X such that E (cid:107) X (cid:107) < ∞ , is a randomcontinuous function h X ( u ) on S d − with square integrable values. Define its covariance An event A (cid:48) ∈ B is called a B -atom if P { < P ( A | B ) < P ( A (cid:48) | B ) } = 0 for all A ⊂ A (cid:48) such that A ∈ F . X ( u, v ) ≡ E [( h X ( u ) − h E X ( u ))( h X ( v ) − h E X ( v ))] , u, v ∈ S d − . (A.12)Let ζ ( u ) be a centered Gaussian random field on S d − with the same covariance structureas X , i.e. E (cid:2) ζ ( u ) ζ ( v ) (cid:3) = Γ X ( u, v ) , u, v ∈ S d − . Since the support function of a compactset is Lipschitz, it is easy to show that the random field ζ has a continuous modification bybounding the moments of | ζ ( u ) − ζ ( v ) | . Theorem A.4 (Central limit theorem) . Let X , X , . . . be i.i.d. copies of a random closedset X in R d such that E (cid:107) X (cid:107) < ∞ , and let S n = X + · · · + X n . Then as n → ∞ , √ n (cid:16) h S nn ( u ) − h E X ( u ) (cid:17) ⇒ ζ (A.13) in the space of continuous functions on the unit sphere with the uniform metric. Furthermore, √ n d H (cid:18) S n n , E X (cid:19) ⇒ (cid:107) ζ (cid:107) ∞ = sup (cid:8) | ζ ( u ) | : u ∈ S d − (cid:9) . (A.14)119 eferences Abaluck, J., and
A. Adams (2018): “What Do Consumers Consider Before They Choose?Identification from Asymmetric Demand Responses,” available at https://abiadams.com/wp-content/uploads/2018/06/DiscreteChoiceInattention_master.pdf . Abbring, J. H., and
J. J. Heckman (2007): “Chapter 72 – Econometric Evaluation ofSocial Programs, Part III: Distributional Treatment Effects, Dynamic Treatment Effects,Dynamic Discrete Choice, and General Equilibrium Policy Evaluation,” in
Handbook ofEconometrics , ed. by J. J. Heckman, and
E. E. Leamer, vol. 6, pp. 5145 – 5303. Elsevier.
Adams, A. (2019): “Mutually Consistent Revealed Preference Demand Predictions,”
Amer-ican Economic Journal: Microeconomics , forthcoming.
Adusumilli, K., and
T. Otsu (2017): “Empirical Likelihood for Random Sets,”
Journalof the American Statistical Association , 112(519), 1064–1075.
Afriat, S. N. (1967): “The Construction of Utility Functions from Expenditure Data,”
International Economic Review , 8(1), 67–77.
Andrews, D. W. K., and
P. J. Barwick (2012): “Inference for parameters defined bymoment inequalities: a recommended moment selection procedure,”
Econometrica , 80(6),2805–2826.
Andrews, D. W. K., and
P. Guggenberger (2009): “Validity of Subsampling and ‘Plug-in Asymptotic’ Inference for Parameters Defined by Moment Inequalities,”
EconometricTheory , 25(3), 669–709.
Andrews, D. W. K., W. Kim, and
X. Shi (2017): “Commands for testing conditionalmoment inequalities and equalities,”
Stata Journal , 17(1), 56–72.
Andrews, D. W. K., and
S. Kwon (2019): “Inference in Moment Inequality ModelsThat Is Robust to Spurious Precision under Model Misspecification,” available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3416831 . Andrews, D. W. K., and
X. Shi (2013): “Inference based on conditional moment inequal-ities,”
Econometrica , 81(2), 609–666.(2017): “Inference based on many conditional moment inequalities,”
Journal ofEconometrics , 196(2), 275 – 287.
Andrews, D. W. K., and
G. Soares (2010): “Inference for Parameters Defined by MomentInequalities Using Generalized Moment Selection,”
Econometrica , 78(1), 119–157.120 radillas-Lopez, A., and
E. Tamer (2008): “The Identification Power of Equilibrium inSimple Games,”
Journal of Business & Economic Statistics , 26(3), 261–283.
AradillasLpez, A., A. Gandhi, and
D. Quint (2013): “Identification and Inference inAscending Auctions With Correlated Private Values,”
Econometrica , 81(2), 489–534.
Aristodemou, E. (2019): “Semiparametric Identification in Panel Data Discrete Re-sponse Models,” available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3420016 . Armstrong, T. B. (2013): “Bounds in auctions with unobserved heterogeneity,”
Quanti-tative Economics , 4(3), 377–415.(2014): “Weighted KS statistics for inference on conditional moment inequalities,”
Journal of Econometrics , 181(2), 92 – 116.(2015): “Asymptotically exact inference in conditional moment inequality models,”
Journal of Econometrics , 186(1), 51 – 65.
Armstrong, T. B., and
H. P. Chan (2016): “Multiscale adaptive inference on conditionalmoment inequalities,”
Journal of Econometrics , 194(1), 24 – 43.
Artstein, Z. (1983): “Distributions of random sets and random selections,”
Israel Journalof Mathematics , 46, 313–324.
Artstein, Z., and
R. A. Vitale (1975): “A strong law of large numbers for randomcompact sets,”
Annals of Probabability , 3, 879–882.
Athey, S., and
P. A. Haile (2002): “Identification of Standard Auction Models,”
Econo-metrica , 70(6), 2107–2140.
Athey, S., and
G. W. Imbens (2006): “Identification and Inference in Nonlinear Difference-in-Differences Models,”
Econometrica , 74(2), 431–497.
Aucejo, E. M., F. A. Bugni, and
V. J. Hotz (2017): “Identification and inference onregressions with missing covariate data,”
Econometric Theory , 33(1).
Aumann, R. J. (1965): “Integrals of set-valued functions,”
Journal of Mathematical Analysisand Applications , 12(1), 1–12.
Bajari, P., H. Hong, and
S. P. Ryan (2010): “Identification and estimation of a discretegame of complete information,”
Econometrica , 78(5), 1529–1568.
Balke, A., and
J. Pearl (1997): “Bounds on Treatment Effects From Studies With Im-perfect Compliance,”
Journal of the American Statistical Association , 92(439), 1171–1176.121 arseghyan, L., M. Coughlin, F. Molinari, and
J. C. Teitelbaum (2019): “Hetero-geneous Choice Sets and Preferences,” available at https://arxiv.org/abs/1907.02337 . Barseghyan, L., F. Molinari, T. O’Donoghue, and
J. C. Teitelbaum (2013): “TheNature of Risk Preferences: Evidence from Insurance Choices,”
American Economic Re-view , 103(6), 2499–2529.(2018): “Estimating Risk Preferences in the Field,”
Journal of Economic Literature ,56(2).
Barseghyan, L., F. Molinari, and
J. C. Teitelbaum (2016): “Inference under stabilityof risk preferences,”
Quantitative Economics , 7(2), 367–409.
Barseghyan, L., F. Molinari, and
M. Thirkettle (2019): “Discrete Choice under Riskwith Limited Consideration,” available at https://arxiv.org/abs/1902.06629 . Bazaraa, M. S., H. D. Sherali, and
C. Shetty (2006):
Nonlinear programming: theoryand algorithms . Hoboken, N.J. : Wiley-Interscience, 3rd edn.
Belloni, A., F. A. Bugni, and
V. Chernozhukov (2018): “Subvector inference in par-tially identified models with many moment inequalities,” available at https://arxiv.org/abs/1806.11466 . Beresteanu, A., and
C. F. Manski (2000): “Bounds for STATA and Bounds for MatLab,”available at http://faculty.wcas.northwestern.edu/~cfm754/bounds_stata.pdf . Beresteanu, A., I. Molchanov, and
F. Molinari (2008): “Sharp Identification Regionsin Games,” CeMMAP working paper CWP15/08, available at .(2011): “Sharp identification regions in models with convex moment predictions,”
Econometrica , 79(6), 1785–1821.(2012): “Partial identification using random set theory,”
Journal of Econometrics ,166(1), 17 – 32, with errata at https://molinari.economics.cornell.edu/docs/NOTE_BMM2012_v3.pdf . Beresteanu, A., and
F. Molinari (2008): “Asymptotic Properties for a Class of PartiallyIdentified Models,”
Econometrica , 76(4), 763–814.
Beresteanu, A., F. Molinari, and
D. S. Morris (2010): “Asymptotics for PartiallyIdentified Models in STATA,” available at https://molinari.economics.cornell.edu/programs/Stata_SetBLP.zip . 122 ergemann, D., and
S. Morris (2016): “Bayes correlated equilibrium and the comparisonof information structures in games,”
Theoretical Economics , 11(2), 487–522.
Berry, S. T. (1992): “Estimation of a Model of Entry in the Airline Industry,”
Economet-rica , 60(4), 889–917.
Berry, S. T., and
G. Compiani (2019): “An Instrumental Variable Ap-proach to Dynamic Models,” available at https://drive.google.com/file/d/1pl1PW1w8eh3gnrTMKUBuS6T6TIKtvf9c/view . Berry, S. T., J. Levinsohn, and
A. Pakes (1995): “Automobile Prices in Market Equi-librium,”
Econometrica , 63(4), 841–890.
Berry, S. T., and
E. Tamer (2006): “Identification in Models of Oligopoly Entry,” in
Ad-vances in Economics and Econometrics: Theory and Applications, Ninth World Congress ,ed. by R. Blundell, W. K. Newey, and
T. E. Persson, vol. 2 of
Econometric Society Mono-graphs , p. 4685. Cambridge University Press.
Bhattacharya, J., A. M. Shaikh, and
E. Vytlacil (2012): “Treatment effect bounds:An application to SwanGanz catheterization,”
Journal of Econometrics , 168(2), 223 – 243.
Bickel, P. J., C. A. Klaassen, Y. Ritov, and
J. A. Wellner (1993):
Efficient andAdaptive Estimation for Semiparametric Models . Springer, New York.
Bjorn, P. A., and
Q. H. Vuong (1984): “Simultaneous Equations Models for DummyEndogenous Variables: A Game Theoretic Formulation with an Application to Labor ForceParticipation,” CIT working paper SSWP 537, California Institute of Technology, availableat http://resolver.caltech.edu/CaltechAUTHORS:20170919-140310752 . Blevins, J. R. (2015): “Non-Standard Rates of Convergence of Criterion-Function-BasedSet Estimators,”
Econometrics Journal , 18, 172–199.
Block, H. D., and
J. Marschak (1960): “Random Orderings and Stochastic Theoriesof Responses,” in
Contributions to Probability and Statistics: Essays in Honor of HaroldHotelling , ed. by I. Olkin, pp. 97–132. Stanford University Press.
Blume, L. E., W. A. Brock, S. N. Durlauf, and
Y. M. Ioannides (2011): “Iden-tification of Social Interactions,” in
Handbook of Social Economics , ed. by J. Benhabib,A. Bisin, and
M. O. Jackson, vol. 1, pp. 853 – 964. North-Holland.
Blundell, R. (2019): “Revealed preference,” in
Handbook of Econometrics . Elsevier.
Blundell, R., M. Browning, and
I. Crawford (2008): “Best Nonparametric Boundson Demand Responses,”
Econometrica , 76(6), 1227–1262.123 lundell, R., A. Gosling, H. Ichimura, and
C. Meghir (2007): “Changes in theDistribution of Male and Female Wages Accounting for Employment Composition UsingBounds,”
Econometrica , 75(2), 323–363.
Blundell, R., D. Kristensen, and
R. Matzkin (2014): “Bounding quantile demandfunctions using revealed preference inequalities,”
Journal of Econometrics , 179(2), 112 –127.
Blundell, R., and
J. R. Smith (1994): “Coherency and Estimation in SimultaneousModels with Censored or Qualitative Dependent Variables,”
Journal of Econometrics , 64,355–373.
Bontemps, C., T. Magnac, and
E. Maurin (2012): “Set identified linear models,”
Econo-metrica , 80(3), 1129–1155.
Bresnahan, T. F., and
P. C. Reiss (1988): “Do Entry Conditions Vary Across Markets?,”
Brookings Papers on Economic Activity , pp. 833–871.(1990): “Entry in Monopoly Markets,”
The Review of Economic Studies , 57(4),531–553.(1991): “Empirical models of discrete games,”
Journal of Econometrics , 48(1),57–81.
Brock, W. A., and
S. N. Durlauf (2001): “Chapter 54 – Interactions-Based Models,” in
Handbook of Econometrics , ed. by J. J. Heckman, and
E. Leamer, vol. 5, pp. 3297–3380.Elsevier.
Bugni, F. A. (2010): “Bootstrap inference in partially identified models defined by momentinequalities: coverage of the identified set,”
Econometrica , 78(2), 735–753.(2016): “Comparison of inferential methods in partially identifies models in termsof error in coverage probability,”
Econometric Theory , 32(1), 187242.
Bugni, F. A., I. A. Canay, and
P. Guggenberger (2012): “Distortions of AsymptoticConfidence Size in Locally Misspecified Moment Inequality Models,”
Econometrica , 80(4),1741–1768.
Bugni, F. A., I. A. Canay, and
X. Shi (2015): “Specification tests for partially identifiedmodels defined by moment inequalities,”
Journal of Econometrics , 185(1), 259 – 282.(2017): “Inference for subvectors and other functions of partially identified param-eters in moment inequality models,”
Quantitative Economics , 8(1), 1–38.124 ull, A. D. (2011): “Convergence rates of efficient global optimization algorithms,”
Journalof Machine Learning Research , 12(Oct), 2879–2904.
Bureau of Labor Statistics (2018): “Occupational Employment Statistics,” U.S. De-partment of Labor, available online at ; accessed 1/28/2018.
Campbell, J. Y. (2014): “Empirical Asset Pricing: Eugene Fama, Lars Peter Hansen, andRobert Shiller,”
The Scandinavian Journal of Economics , 116(3), 593–634.
Canay, I. A. (2010): “EL inference for partially identified models: Large deviations opti-mality and bootstrap validity,”
Journal of Econometrics , 156(2), 408 – 425.
Canay, I. A., and
A. M. Shaikh (2017): “Practical and Theoretical Advances in Inferencefor Partially Identified Models,” in
Advances in Economics and Econometrics: EleventhWorld Congress , ed. by B. Honor, A. Pakes, M. Piazzesi, and
L. Samuelson, vol. 2 of
Econometric Society Monographs , p. 271306. Cambridge University Press.
Canova, F., and
G. De Nicolo (2002): “Monetary disturbances matter for business fluc-tuations in the G-7,”
Journal of Monetary Economics , 49(6), 1131 – 1159.
Caplin, A. (2016): “Measuring and Modeling Attention,”
Annual Review of Economics ,8(1), 379–403.
Cattaneo, M. D., X. Ma, Y. Masatlioglu, and
E. Suleymanov (2019): “A Ran-dom Attention Model,”
Journal of Political Economy , forthcoming, available at https://arxiv.org/abs/1712.03448 . Chandrasekhar, A. (2016): “Econometrics of Network Formation,” in
Oxford Handbookon the Economics of Networks , ed. by Y. Bramoulle, A. Galeotti, and
B. Rogers, chap. 13.Oxford University Press.
Chandrasekhar, A., V. Chernozhukov, F. Molinari, and
P. Schrimpf (2012): “Rcode implementing best linear approximations to set identified functions,” available at https://bitbucket.org/paulschrimpf/mulligan-rubinstein-bounds .(2018): “Best linear approximations to set identified functions: with an applicationto the gender wage gap,” CeMMAP working paper CWP09/19, available at . Chen, X., T. M. Christensen, and
E. Tamer (2018): “MCMC Confidence Sets forIdentified Sets,”
Econometrica , 86(6), 1965–2018.
Chernozhukov, V., D. Chetverikov, and
K. Kato (2018): “Inference on causal andstructural parameters using many moment inequalities,”
Review of Economic Studies ,forthcoming, available at https://doi.org/10.1093/restud/rdy065 .125 hernozhukov, V., I. Fernndez-Val, J. Hahn, and
W. Newey (2013): “Average andquantile effects in nonseparable panel models,”
Econometrica , 81(2), 535–580.
Chernozhukov, V., H. Hong, and
E. Tamer (2007): “Estimation and Confidence Re-gions for Parameter Sets in Econometric Models,”
Econometrica , 75(5), 1243–1284.
Chernozhukov, V., W. Kim, S. Lee, and
A. M. Rosen (2015): “Implementing intersec-tion bounds in Stata,”
Stata Journal , 15(1), 21–44.
Chernozhukov, V., E. Kocatulum, and
K. Menzel (2015): “Inference on sets in fi-nance,”
Quantitative Economics , 6(2), 309–358.
Chernozhukov, V., S. Lee, and
A. M. Rosen (2013): “Intersection Bounds: estimationand inference,”
Econometrica , 81(2), 667–737.
Chesher, A., and
A. M. Rosen (2012): “Simultaneous equations for discrete outcomes: co-herence, completeness, and identification,” CeMMAP working paper CWP21/12, availableat .(2017a): “Generalized instrumental variable models,”
Econometrica , 85, 959–989.(2017b): “Incomplete English auction models with heterogeneity,” CeMMAP work-ing paper CWP27/17, available at .(2019): “Generalized instrumental variable models, methods, and applications,” in
Handbook of Econometrics . Elsevier.
Chesher, A., A. M. Rosen, and
K. Smolinski (2013): “An instrumental variable modelof multiple discrete choice,”
Quantitative Economics , 4(2), 157–196.
Chetty, R. (2012): “Bounds on elasticities with optimization frictions: a synthesis of microand macro evidence in labor supply,”
Econometrica , 80(3), 969–1018.
Chetverikov, D. (2018): “Adaptive Test of Conditional Moment Inequalities,”
Economet-ric Theory , 34(1), 186227.
Cho, W. T., and
C. F. Manski (2009): “CrossLevel/Ecological Inference,” in
Oxford Hand-book of Political Methodology , ed. by J. M. Box-Steffensmeier, H. E. Brady, and
D. Collier,chap. 24, pp. 530–569. Oxford University Press.
Choquet, G. (1953/54): “Theory of capacities,”
Annales de l’Institut Fourier (Grenoble) ,5, 131–295.
Ciliberto, F., and
E. Tamer (2009): “Market Structure and Multiple Equilibria in AirlineMarkets,”
Econometrica , 77(6), 1791–1828.126 lyde, M., and
E. I. George (2004): “Model Uncertainty,”
Statist. Sci. , 19(1), 81–94.
Cochrane, J. H. (2005):
Asset Pricing . Princeton University Press, Princeton, New Jersey,second edn.
Cowell, F. A. (1991): “Grouping bounds for inequality measures under alternative infor-mational assumptions,”
Journal of Econometrics , 48(1), 1 – 14.
Crawford, I., and
B. De Rock (2014): “Empirical Revealed Preference,”
Annual Reviewof Economics , 6(1), 503–524.
Cross, P. J., and
C. F. Manski (2002): “Regressions, Short and Long,”
Econometrica ,70(1), 357–368.
Debreu, G. (1967): “Integration of correspondences,” in
Proceedings of the Fifth BerkeleySymposium in Mathematical Statistics and Probability , vol. 2, pp. 351–372. University ofCalifornia Press.
D’Haultfoeuille, X., C. Gaillac, and
A. Maurel (2018): “Rationalizing RationalExpectations? Tests and Deviations,” NBER working paper 25274, available at . Dickstein, M. J., and
E. Morales (2018): “What do Exporters Know?,”
The QuarterlyJournal of Economics , 133(4), 1753–1801.
Dominitz, J., and
C. F. Manski (2017): “More Data or Better Data? A StatisticalDecision Problem,”
The Review of Economic Studies , 84(4), 1583–1605.
Dominitz, J., and
R. P. Sherman (2004): “Sharp bounds under contaminated or corruptedsampling with verification, with an application to environmental pollutant data,”
Journalof Agricultural, Biological, and Environmental Statistics , 9(3), 319–338.(2005): “Identification and estimation of bounds on school performance measures:a nonparametric analysis of a mixture model with verification,”
Journal of Applied Econo-metrics , 21(8), 1295–1326.
Duncan, O. D., and
B. Davis (1953): “An Alternative to Ecological Correlation,”
Amer-ican Sociological Review , 18(6), 665–666.
Echenique, F. (2005): “A short and constructive proof of Tarski’s fixed-point theorem,”
International Journal of Game Theory , 33(2), 215–218.
Eizenberg, A. (2014): “Upstream Innovation and Product Variety in the U.S. Home PCMarket,”
The Review of Economic Studies , 81(3), 1003–1045.127 llickson, P. B., S. Houghton, and
C. Timmins (2013): “Estimating network economiesin retail chains: a revealed preference approach,”
The RAND Journal of Economics , 44(2),169–193.
Epstein, L. G., H. Kaido, and
K. Seo (2016): “Robust confidence regions for incompletemodels,”
Econometrica , 84, 1799–1838.
Falmagne, J. (1978): “A representation theorem for finite random scale systems,”
Journalof Mathematical Psychology , 18(1), 52 – 72.
Fama, E. F. (1996): “Multifactor Portfolio Efficiency and Multifactor Asset Pricing,”
TheJournal of Financial and Quantitative Analysis , 31(4), 441–465.
Fan, Y., R. Sherman, and
M. Shum (2014): “Identifying Treatment Effects Under DataCombination,”
Econometrica , 82(2), 811–822.
Fang, Z., and
A. Santos (2018): “Inference on Directionally Differentiable Functions,”
The Review of Economic Studies , 86(1), 377–412.
Faust, J. (1998): “The robustness of identified VAR conclusions about money,”
Carnegie-Rochester Conference Series on Public Policy , 49, 207 – 244.
Ferson, W. E. (2003): “Chapter 12 – Tests of multifactor pricing models, volatility boundsand portfolio performance,” in
Financial Markets and Asset Pricing , vol. 1 of
Handbook ofthe Economics of Finance , pp. 743 – 802. Elsevier.
Fisher, F. M. (1966):
The Identification Problem in Econometrics . McGraw-Hill BookCompany.
Frank, M. J., R. B. Nelsen, and
B. Schweizer (1987): “Best-possible bounds for thedistribution of a sum — a problem of Kolmogorov,”
Probability Theory and Related Fields ,74(2), 199–211.
Frchet, M. R. (1951): “Sur les tableaux de correlation dont les marges sont donnes,”
Annales de IUniversit de Lyon A , 3, 53–77.
Freyberger, J., and
J. L. Horowitz (2015): “Identification and shape restrictions innonparametric instrumental variables estimation,”
Journal of Econometrics , 189(1), 41–53.
Freyberger, J., and
B. Reeves (2017): “Inference Under Shape Restrictions,” availableat https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3011474 . Frisch, R. (1934):
Statistical Confluence Analysis by Means of Complete Regression Sys-tems , Okonomiske Institutt Oslo: Publikasjon. Universitetets Økonomiske Instituut.128 afarov, B., M. Meier, and
J. L. Montiel Olea (2018): “Delta-method inference fora class of set-identified SVARs,”
Journal of Econometrics , 203(2), 316 – 327.
Galichon, A. (2016):
Optimal Transport Methods in Economics . Princeton University Press.
Galichon, A., and
M. Henry (2006): “Inference in Incomplete Models,” available at http://dx.doi.org/10.2139/ssrn.886907 .(2009): “A test of non-identifying restrictions and confidence regions for partiallyidentified parameters,”
Journal of Econometrics , 152(2), 186 – 196.(2011): “Set Identification in Models with Multiple Equilibria,”
The Review ofEconomic Studies , 78(4), 1264–1298.(2013): “Dilation bootstrap,”
Journal of Econometrics , 177(1), 109 – 115.
Gallant, A. R., and
H. White (1988):
A Unified Theory of Estimation and Inference forNonlinear Dynamic Models . B. Blackwell.
Gastwirth, J. L. (1972): “The Estimation of the Lorenz Curve and Gini Index,”
TheReview of Economics and Statistics , 54(3), 306–316.
Gentry, M., and
T. Li (2014): “Identification in auctions with selective entry,”
Economet-rica , 82(1), 315–344.
Gilstein, C. Z., and
E. E. Leamer (1983): “Robust Sets of Regression Estimates,”
Econo-metrica , 51(2), 321–333.
Gin´e, E., M. G. Hahn, and
J. Zinn (1983): “Limit theorems for random sets: An appli-cation of probability in Banach space results,” in
Probability in Banach Spaces IV , ed. byA. Beck, and
K. Jacobs, pp. 112–135, Berlin, Heidelberg. Springer Berlin Heidelberg.
Gini, C. (1921): “Sull’interpolazione di una retta quando i valori della variabile indipendentesono affetti da errori accidentali,”
Metroeconomica , 1(3), 6382.
Giustinelli, P., C. F. Manski, and
F. Molinari (2019a): “Precise or Imprecise Proba-bilities? Evidence from survey response on dementia and long-term care,” NBER WorkingPaper 26125, available at .(2019b): “Tail and Center Rounding of Probabilistic Expectations in the Healthand Retirement Study,” available at http://faculty.wcas.northwestern.edu/~cfm754/gmm_rounding.pdf . Gourieroux, C., J. J. Laffont, and
A. Monfort (1980): “Coherency Conditions in Si-multaneous Linear Equation Models with Endogenous Switching Regimes,”
Econometrica ,48, 675–695. 129 raham, B. S. (2015): “Methods of Identification in Social Networks,”
Annual Review ofEconomics , 7(1), 465–485.(2019): “The Econometric Analysis of Networks,” in
Handbook of Econometrics .Elsevier.
Grant, M., and
S. Boyd (2010): “CVX: Matlab Software for Disciplined Convex Pro-gramming, Version 1.21,” available at http://cvxr.com/cvx . Granziera, E., H. R. Moon, and
F. Schorfheide (2018): “Inference for VARs identifiedwith sign restrictions,”
Quantitative Economics , 9(3), 1087–1121.
Grieco, P. L. E. (2014): “Discrete games with flexible information structures: an applica-tion to local grocery markets,”
The RAND Journal of Economics , 45(2), 303–340.
Gualdani, C. (2019): “An Econometric Model of Network Formation with an Applica-tion to Board Interlocks Between Firms,” available at http://docs.wixstatic.com/ugd/063589_b751c9f9c4e34d51b4da7ed7e007080a.pdf . Guggenberger, P., J. Hahn, and
K. Kim (2008): “Specification testing under momentinequalities,”
Economics Letters , 99(2), 375 – 378.
Gundersen, C., B. Kreider, and
J. Pepper (2012): “The impact of the National SchoolLunch Program on child health: A nonparametric bounds analysis,”
Journal of Economet-rics , 166(1), 79–91.
Haavelmo, T. (1944): “The Probability Approach in Econometrics,”
Econometrica , 12(Sup-plement July), iii–vi+1–115.
Haile, P. A., and
E. Tamer (2003): “Inference with an Incomplete Model of EnglishAuctions,”
Journal of Political Economy , 111(1), 1–51.
Hall, A. R., and
A. Inoue (2003): “The large sample behaviour of the generalized methodof moments estimator in misspecified models,”
Journal of Econometrics , 114(2), 361 – 394.
Hall, R. E. (1973): “On the statistical theory of unobserved components,” MIT WorkingPaper 117, available at https://dspace.mit.edu/bitstream/handle/1721.1/63972/onstatisticalthe00hall.pdf?sequence=1 . Hampel, F. R., E. M. Ronchetti, P. J. Rousseeuw, and
W. A. Stahel (2011):
RobustStatistics: The Approach Based on Influence Functions . Wiley.
Hansen, B. E., and
S. Lee (2019): “Inference for iterated GMM under misspecification,”available at .130 ansen, L. P. (1982a): “Consumption, asset markets, and macroeconomic fluctuations: Acomment,”
Carnegie-Rochester Conference Series on Public Policy , 17, 239 – 250.(1982b): “Large Sample Properties of Generalized Method of Moments Estimators,”
Econometrica , 50(4), 1029–1054.
Hansen, L. P., J. Heaton, and
E. G. J. Luttmer (1995): “Econometric Evaluation ofAsset Pricing Models,”
The Review of Financial Studies , 8(2), 237–274.
Hansen, L. P., and
R. Jagannathan (1991): “Implications of Security Market Data forModels of Dynamic Economies,”
Journal of Political Economy , 99(2), 225–262.
Harrison, J., and
D. M. Kreps (1979): “Martingales and arbitrage in multiperiod secu-rities markets,”
Journal of Economic Theory , 20(3), 381 – 408.
Hausman, J. A., and
W. K. Newey (2016): “Individual Heterogeneity and Average Wel-fare,”
Econometrica , 84(3), 1225–1248.
Heckman, J. J. (1978): “Dummy Endogenous Variables in a Simultaneous Equation Sys-tem,”
Econometrica , 46(4), 931–959.
Heckman, J. J., J. Smith, and
N. Clements (1997): “Making the Most Out of Pro-gramme Evaluations and Social Experiments: Accounting for Heterogeneity in ProgrammeImpacts,”
The Review of Economic Studies , 64(4), 487–535.
Heckman, J. J., and
E. J. Vytlacil (1999): “Local Instrumental Variables and LatentVariable Models for Identifying and Bounding Treatment Effects,”
Proceedings of the Na-tional Academy of Sciences of the United States of America , 96(8), 4730–4734.(2001): “Instrumental variables, selection models, and tight bounds on the averagetreatment effect,” in
Econometric Evaluation of Labour Market Policies , ed. by M. Lechner, and
F. Pfeiffer, pp. 1–15, Heidelberg. Physica-Verlag HD.(2005): “Structural Equations, Treatment Effects, and Econometric Policy Evalua-tion,”
Econometrica , 73(3), 669–738.(2007a): “Chapter 70 – Econometric Evaluation of Social Programs, Part I: CausalModels, Structural Models and Econometric Policy Evaluation,” in
Handbook of Econo-metrics , ed. by J. J. Heckman, and
E. E. Leamer, vol. 6, pp. 4779 – 4874. Elsevier.(2007b): “Chapter 71 – Econometric Evaluation of Social Programs, Part II: Usingthe Marginal Treatment Effect to Organize Alternative Econometric Estimators to EvaluateSocial Programs, and to Forecast their Effects in New Environments,” in
Handbook ofEconometrics , ed. by J. J. Heckman, and
E. E. Leamer, vol. 6, pp. 4875 – 5143. Elsevier.131 ellmann, T. (2013): “On the existence and uniqueness of pairwise stable networks,”
International Journal of Game Theory , 42(1), 211–237.
Henry, M., R. Mango, and
M. Queyranne (2015): “Combinatorial approach to inferencein partially identified incomplete structural models,”
Quantitative Economics , 6(2), 499–529.
Henry, M., and
A. Onatski (2012): “Set coverage and robust policy,”
Economics Letters ,115(2), 256 – 257.
Hirano, K., and
J. R. Porter (2019): “Statistical Decision Rules in Econometrics,” in
Handbook of Econometrics . Elsevier.
Ho, K. (2009): “Insurer-Provider Networks in the Medical Care Market,”
The AmericanEconomic Review , 99(1), 393–430.
Ho, K., J. Ho, and
J. H. Mortimer (2012): “The Use of Full-Line Forcing Contracts inthe Video Rental Industry,”
The American Economic Review , 102(2), 686–719.
Ho, K., and
A. Pakes (2014): “Hospital Choices, Hospital Prices, and Financial Incentivesto Physicians,”
The American Economic Review , 104(12), 3841–3884.
Ho, K., and
A. M. Rosen (2017): “Partial Identification in Applied Research: Benefitsand Challenges,” in
Advances in Economics and Econometrics: Eleventh World Congress ,ed. by B. Honor, A. Pakes, M. Piazzesi, and
L. Samuelson, vol. 1 of
Econometric SocietyMonographs , pp. 307–359. Cambridge University Press.
Hoderlein, S., and
J. Stoye (2014): “Revealed Preferences in a Heterogeneous Popula-tion,”
Review of Economics and Statistics , 96(2), 197–213.(2015): “Testing stochastic rationality and predicting stochastic demand: the caseof two goods,”
Economic Theory Bulletin , 3(2), 313328.
Hoeffding, W. (1940): “Masstabinvariante Korrelationstheorie,”
Schriften des Matematis-chen Instituts und des Instituts f¨ur Angewandte MAthematik des Universit¨at Berlin , 5(3),179–233.
Holmes, T. J. (2011): “The diffusion of Wal-mart and economies of density,”
Econometrica ,79(1), 253–302.
Hong, H., and
E. Tamer (2003a): “Endogenous binary choice model with median restric-tions,”
Economics Letters , 80(2), 219 – 225.(2003b): “Inference in Censored Models with Endogenous Regressors,”
Economet-rica , 71(3), 905–932. 132 onor, B. E., and
A. Lleras-Muney (2006): “Bounds in Competing Risks Models andthe War on Cancer,”
Econometrica , 74(6), 1675–1698.
Honor, B. E., and
E. Tamer (2006): “Bounds on Parameters in Panel Dynamic DiscreteChoice Models,”
Econometrica , 74(3), 611–629.
Horowitz, J. L., and
C. F. Manski (1995): “Identification and Robustness with Contam-inated and Corrupted Data,”
Econometrica , 63(2), 281–302.(1998): “Censoring of outcomes and regressors due to survey nonresponse: Identifi-cation and estimation using weights and imputations,”
Journal of Econometrics , 84(1), 37– 58. (2000): “Nonparametric Analysis of Randomized Experiments with Missing Covari-ate and Outcome Data,”
Journal of the American Statistical Association , 95(449), 77–84.
Horowitz, J. L., C. F. Manski, M. Ponomareva, and
J. Stoye (2003): “Computationof Bounds on Population Parameters When the Data Are Incomplete,”
Reliable Computing ,9(6), 419–440.
Hotz, V. J., C. H. Mullin, and
S. G. Sanders (1997): “Bounding Causal Effects Us-ing Data From a Contaminated Natural Experiment: Analysing the Effects of TeenageChilbearing,”
The Review of Economic Studies , 64(4), 575–603.
Houthakker, H. S. (1950): “Revealed Preference and the Utility Function,”
Economica ,17(66), 159–174.
Howard, J. A. (1963):
Consumer behavior: application of theory . New York: McGraw-Hill,Includes indexes.
Huber, P. J. (1964): “Robust Estimation of a Location Parameter,”
The Annals of Math-ematical Statistics , 35(1), 73–101.(2004):
Robust Statistics , Wiley Series in Probability and Statistics - Applied Prob-ability and Statistics Section Series. Wiley.
Iaryczower, M., X. Shi, and
M. Shum (2018): “Can Words Get in the Way? The Effectof Deliberation in Collective Decision Making,”
Journal of Political Economy , 126(2), 688–734.
Imbens, G. W. (2003): “Sensitivity to Exogeneity Assumptions in Program Evaluation,”
American Economic Review , 93(2), 126–132.
Imbens, G. W., and
J. D. Angrist (1994): “Identification and Estimation of Local AverageTreatment Effects,”
Econometrica , 62(2), 467–475.133 mbens, G. W., and
C. F. Manski (2004): “Confidence Intervals for Partially IdentifiedParameters,”
Econometrica , 72(6), 1845–1857.
Imbens, G. W., and
W. K. Newey (2009): “Identification and Estimation of TriangularSimultaneous Equations Models Without Additivity,”
Econometrica , 77(5), 1481–1512.
Imbens, G. W., and
D. B. Rubin (2015):
Causal Inference for Statistics, Social, andBiomedical Sciences: An Introduction . Cambridge University Press.
Imbens, G. W., and
J. M. Wooldridge (2009): “Recent Developments in the Economet-rics of Program Evaluation,”
Journal of Economic Literature , 47(1), 5–86.
Jackson, M. O., and
A. Wolinsky (1996): “A Strategic Model of Social and EconomicNetworks,”
Journal of Economic Theory , 71(1), 44 – 74.
Jia, P. (2008): “What Happens When Wal-Mart Comes to Town: An Empirical Analysis ofthe Discount Retailing Industry,”
Econometrica , 76(6), 1263–1316.
Jones, D. R. (2001): “A Taxonomy of Global Optimization Methods Based on ResponseSurfaces,”
Journal of Global Optimization , 21(4), 345–383.
Jones, D. R., M. Schonlau, and
W. J. Welch (1998): “Efficient Global Optimizationof Expensive Black-Box Functions,”
Journal of Global Optimization , 13(4), 455–492.
Jovanovic, B. (1989): “Observable Implications of Models with Multiple Equilibria,”
Econometrica , 57(6), 1431–1437.
Juster, F. T., and
R. Suzman (1995): “An Overview of the Health and Retirement Study,”
Journal of Human Resources , 30 (Supplement), S7–S56.
Kaido, H. (2016): “A dual approach to inference for partially identified econometric models,”
Journal of Econometrics , 192(1), 269 – 290.
Kaido, H., F. Molinari, and
J. Stoye (2019a): “Confidence Intervals for Projections ofPartially Identified Parameters,”
Econometrica , 87(4), 1397–1432.(2019b): “Constraint Qualifications in Partial Identification,” working paper, avail-able at https://arxiv.org/pdf/1908.09103.pdf . Kaido, H., F. Molinari, J. Stoye, and
M. Thirkettle (2017): “Cali-brated Projection in MATLAB,” documentation available at https://arxiv.org/abs/1710.09707 and code available at https://github.com/MatthewThirkettle/calibrated-projection-MATLAB . Kaido, H., and
A. Santos (2014): “Asymptotically efficient estimation of models definedby convex moment inequalities,”
Econometrica , 82(1), 387–413.134 aido, H., and
H. White (2013): “Estimating Misspecified Moment Inequality Models,” in
Recent Advances and Future Directions in Causality, Prediction, and Specification Analysis:Essays in Honor of Halbert L. White Jr , ed. by X. Chen, and
N. R. Swanson, pp. 331–361,Springer, New York, NY.
Kamat, V. (2018): “Identification with Latent Choice Sets,” available at https://arxiv.org/abs/1711.02048 . Kawai, K., and
Y. Watanabe (2013): “Inferring Strategic Voting,”
American EconomicReview , 103(2), 624–62.
Khan, S., M. Ponomareva, and
E. Tamer (2016): “Identification of panel data modelswith endogenous censoring,”
Journal of Econometrics , 194(1), 57 – 75.
Khan, S., and
E. Tamer (2009): “Inference on endogenously censored regression modelsusing conditional moment inequalities,”
Journal of Econometrics , 152(2), 104 – 119.
Kilian, L., and
H. L¨utkepohl (2017):
Structural Vector Autoregressive Analysis , Themesin Modern Econometrics. Cambridge University Press.
Kitagawa, T. (2009): “Identification region of the potential outcome distributions underinstrument independence,” CeMMAP working paper CWP30/09, available at . Kitagawa, T., and
R. Giacomini (2018): “Robust Bayesian inference for set-identifiedmodels,” CeMMAP working paper CWP61/18, available at . Kitamura, Y., and
J. Stoye (2018): “Nonparametric Analysis of Random Utility Models,”
Econometrica , 86(6), 1883–1909.(2019): “Nonparametric Counterfactuals in Random Utility Models,” available at https://arxiv.org/abs/1902.08350 . Klepper, S., and
E. E. Leamer (1984): “Consistent Sets of Estimates for Regressionswith Errors in All Variables,”
Econometrica , 52(1), 163–183.
Kline, B., and
E. Tamer (2012): “Bounds for best response functions in binary games,”
Journal of Econometrics , 166(1), 92 – 105.(2016): “Bayesian inference in a class of partially identified models,”
QuantitativeEconomics , 7(2), 329–366.
Kline, P., and
M. Tartari (2016): “Bounding the Labor Supply Responses to a Random-ized Welfare Experiment: A Revealed Preference Approach,”
American Economic Review ,106(4), 972–1014. 135 olmogorov, A. N. (1950):
Foundations of the Theory of Probability . Chelsea, New York.
Komarova, T. (2013): “Partial identification in asymmetric auctions in the absence ofindependence,”
The Econometrics Journal , 16(1), S60–S92.
Koopmans, T. C., and
O. Reiersol (1950): “The Identification of Structural Character-istics,”
The Annals of Mathematical Statistics , 21(2), 165–181.
Kreider, B., and
J. Pepper (2008): “Inferring disability status from corrupt data,”
Jour-nal of Applied Econometrics , 23(3), 329–349.
Kreider, B., and
J. V. Pepper (2007): “Disability and Employment: Reevaluating theEvidence in Light of Reporting Errors,”
Journal of the American Statistical Association ,102(478), 432–441.
Kreider, B., J. V. Pepper, C. Gundersen, and
D. Jolliffe (2012): “Identifyingthe Effects of SNAP (Food Stamps) on Child Health Outcomes When Participation IsEndogenous and Misreported,”
Journal of the American Statistical Association , 107(499),958–975.
Kreps, D. M. (1981): “Arbitrage and equilibrium in economies with infinitely many com-modities,”
Journal of Mathematical Economics , 8(1), 15 – 35.
Leamer, E. E. (1981): “Is it a Demand Curve, Or Is It A Supply Curve? Partial Identi-fication through Inequality Constraints,”
The Review of Economics and Statistics , 63(3),319–327.(1985): “Sensitivity Analyses Would Help,”
The American Economic Review , 75(3),308–313.(1987): “Errors in Variables in Linear Systems,”
Econometrica , 55(4), 893–909.
Lee, D. S. (2009): “Training, Wages, and Sample Selection: Estimating Sharp Bounds onTreatment Effects,”
The Review of Economic Studies , 76(3), 1071–1102.
Lee, R. S. (2013): “Vertical Integration and Exclusivity in Platform and Two-Sided Mar-kets,”
The American Economic Review , 103(7), 2960–3000.
Lee, S., K. Song, and
Y.-J. Whang (2013): “Testing functional inequalities,”
Journal ofEconometrics , 172(1), 14 – 32.
Lee, Y.-Y., and
D. Bhattacharya (2019): “Applied welfare analysis for discrete choicewith interval-data on income,”
Journal of Econometrics , 211(2), 361–387.
Lewbel, A. (2000): “Semiparametric qualitative response model estimation with unknownheteroscedasticity or instrumental variables,”
Journal of Econometrics , 97(1), 145 – 177.1362018): “The Identification Zoo - Meanings of Identification in Econometrics,”
Journal of Economic Literature , forthcoming.
Liao, Y., and
A. Simoni (2019): “Bayesian inference for partially identified smooth convexmodels,”
Journal of Econometrics , 211(2), 338 – 360.
Ljungqvist, L., and
T. Sargent (2004):
Recursive Macroeconomic Theory , vol. 1. TheMIT Press, 2 edn.
Luce, R. D., and
P. Suppes (1965): “Chapter 19: Preference, Utility, and SubjectiveProbability,” in
Handbook of Mathematical Psychology , vol. 3, pp. 249–410.
Luttmer, E. G. J. (1996): “Asset Pricing in Economies with Frictions,”
Econometrica ,64(6), 1439–1467.
Machado, C., A. M. Shaikh, and
E. J. Vytlacil (2018): “Instrumental Variables andthe Sign of the Average Treatment Effect,”
Journal of Econometrics , forthcoming.
Maddala, G. S. (1983):
Limited-Dependent and Qualitative Variables in Econometrics .Cambridge University Press, New York.
Magnac, T., and
E. Maurin (2008): “Partial Identification in Monotone Binary Models:Discrete Regressors and Interval Data,”
The Review of Economic Studies , 75(3), 835–864.
Magnolfi, L., and
C. Roncoroni (2017): “Estimation of Discrete Games withWeak Assumptions on Information,” available at http://lorenzomagnolfi.com/estimdiscretegames . Makarov, G. D. (1981): “Estimates for the distribution function of a sum of two ran-dom variables when the marginal distributions are fixed,”
Theory of Probability and itsApplications , 26(4), 803–806.
Manski, C. F. (1975): “Maximum score estimation of the stochastic utility model of choice,”
Journal of Econometrics , 3(3), 205 – 228.(1977): “The structure of random utility models,”
Theory and Decision , 8(3), 229–254. (1985): “Semiparametric analysis of discrete response: Asymptotic properties of themaximum score estimator,”
Journal of Econometrics , 27(3), 313 – 333.(1988a):
Analog Estimation Methods in Econometrics , Springer Series in Statistics.Chapman and Hall.(1988b): “Identification of Binary Response Models,”
Journal of the AmericanStatistical Association , 83(403), 729–738. 1371989): “Anatomy of the Selection Problem,”
The Journal of Human Resources ,24(3), 343–360.(1990): “Nonparametric Bounds on Treatment Effects,”
The American EconomicReview Papers and Proceedings , 80(2), 319–323.(1994): “The selection problem,” in
Advances in Econometrics: Sixth WorldCongress , ed. by C. A. Sims, vol. 1 of
Econometric Society Monographs , pp. 143–170.Cambridge University Press.(1995):
Identification Problems in the Social Sciences . Harvard University Press.(1997a): “The Mixing Problem in Programme Evaluation,”
The Review of EconomicStudies , 64(4), 537–553.(1997b): “Monotone Treatment Response,”
Econometrica , 65(6), 1311–1334.(2003):
Partial Identification of Probability Distributions , Springer Series in Statis-tics. Springer.(2005):
Social Choice with Partial Knowledge of Treatment Response . PrincetonUniversity Press.(2007a):
Identification for Prediction and Decision . Harvard University Press.(2007b): “Partial Indentification of Counterfactual Choice Probabilities,”
Interna-tional Economic Review , 48(4), 1393–1410.(2010): “Random Utility Models with Bounded Ambiguity,” in
Structural Econo-metrics , ed. by B. Dutta, pp. 272–284. Oxford University Press, 1 edn.(2013a): “Identification of treatment response with social interactions,”
The Econo-metrics Journal , 16(1), S1–S23.(2013b):
Public Policy in an Uncertain World: Analysis and Decisions . HarvardUniversity Press.(2014): “Identification of incomeleisure preferences and evaluation of income taxpolicy,”
Quantitative Economics , 5(1), 145–174.
Manski, C. F., and
F. Molinari (2010): “Rounding Probabilistic Expectations in Sur-veys,”
Journal of Business and Economic Statistics , 28(2), 219–231.
Manski, C. F., and
J. V. Pepper (2000): “Monotone Instrumental Variables: With anApplication to the Returns to Schooling,”
Econometrica , 68(4), 997–1010.1382009): “More on monotone instrumental variables,”
The Econometrics Journal ,12(s1), S200–S216.(2018): “How Do Right-to-Carry Laws Affect Crime Rates? Coping with AmbiguityUsing Bounded-Variation Assumptions,”
The Review of Economics and Statistics , 100(2),232–244.
Manski, C. F., and
E. Tamer (2002): “Inference on Regressions with Interval Data on aRegressor or Outcome,”
Econometrica , 70(2), 519–546.
Manzini, P., and
M. Mariotti (2014): “Stochastic Choice and Consideration Sets,”
Econometrica , 82(3), 1153–1176.
Markowitz, H. (1952): “Portfolio selection,”
Journal of Finance , 7, 77–91.
Marschak, J. (1960): “Binary Choice Constraints on Random Utility Indicators,” in
Stan-ford Symposium on Mathematical Methods in the Social Sciences , ed. by K. Arrow. StanfordUniversity Press.
Marschak, J., and
W. H. Andrews (1944): “Random Simultaneous Equations and theTheory of Production,”
Econometrica , 12(3/4), 143–205.
Masatlioglu, Y., D. Nakajima, and
E. Y. Ozbay (2012): “Revealed Attention,”
Amer-ican Economic Review , 102(5), 2183–2205.
Masten, M. A., and
A. Poirier (2018): “Salvaging Falsified Instrumental Variable Mod-els,” available at https://arxiv.org/abs/1812.11598 . Matheron, G. (1975):
Random Sets and Integral Geometry . Wiley, New York.
Matzkin, R. L. (1993): “Nonparametric identification and estimation of polychotomouschoice models,”
Journal of Econometrics , 58(1), 137 – 168.(2007): “Chapter 73 – Nonparametric identification,” in
Handbook of Econometrics ,ed. by J. J. Heckman, and
E. E. Leamer, vol. 6, chap. 73, pp. 5307 – 5368. Elsevier.(2013): “Nonparametric Identification in Structural Economic Models,”
AnnualReview of Economics , 5(1), 457–486.
McCarthy, I., D. L. Millimet, and
M. Roy (2015): “Bounding treatment effects: Acommand for the partial identification of the average treatment effect with endogenous andmisreported treatment assignment,”
Stata Journal , 15(2), 411–436.
McFadden, D. L. (1974): “Conditional Logit Analysis of Qualitative Choice Behavior,” in
Frontiers in Econometrics , ed. by P. Zarembka. Academic Press.1391975): “Tchebyscheff bounds for the space of agent characteristics,”
Journal ofMathematical Economics , 2(2), 225 – 242.(2005): “Revealed Stochastic Preference: A Synthesis,”
Economic Theory , 26(2),245–264.
McFadden, D. L., and
M. K. Richter (1991): “Stochastic rationality and revealedstochastic preference,” in
Preferences, Uncertainty and Rationality , ed. by J. S. Chipman,D. L. McFadden, and
M. K. Richter, pp. 161–186. Westview Press.
Menzel, K. (2014): “Consistent estimation with many moment inequalities,”
Journal ofEconometrics , 182(2), 329 – 350.
Milgrom, P. R., and
R. J. Weber (1982): “A Theory of Auctions and CompetitiveBidding,”
Econometrica , 50(5), 1089–1122.
Miyauchi, Y. (2016): “Structural estimation of pairwise stable networks with nonnegativeexternality,”
Journal of Econometrics , 195(2), 224 – 235.
Mogstad, M., A. Santos, and
A. Torgovitsky (2018): “Using Instrumental Variablesfor Inference About Policy Relevant Treatment Parameters,”
Econometrica , 86(5), 1589–1619.
Mogstad, M., and
A. Torgovitsky (2018): “Identification and Extrapolation of CausalEffects with Instrumental Variables,”
Annual Review of Economics , 10(1), 577–613.
Molchanov, I. (1998): “A limit theorem for solutions of inequalities,”
Scandinavian Journalof Statistics , 25, 235–242.
Molchanov, I. (2017):
Theory of Random Sets . Springer, London, 2 edn.
Molchanov, I., and
F. Molinari (2014): “Applications of Random Set Theory in Econo-metrics,”
Annual Review of Economics , 6(1), 229–251.(2018):
Random Sets in Econometrics . Econometric Society Monograph Series,Cambridge University Press, Cambridge UK.
Molinari, F. (2008): “Partial identification of probability distributions with misclassifieddata,”
Journal of Econometrics , 144(1), 81 – 117.(2010): “Missing Treatments,”
Journal of Business and Economic Statistics , 28(1),82–95.
Molinari, F., and
M. Peski (2006): “Generalization of a Result on “Regressions, shortand long”,”
Econometric Theory , 22(1), 159–163.140 olinari, F., and
A. M. Rosen (2008): “The Identification Power of Equilibrium inGames: The Supermodular Case (Comment on Aradillas-Lopez and Tamer, 2008),”
Jour-nal of Business and Economic Statistics , 26(3), 297–302.
Moon, H. R., and
F. Schorfheide (2012): “Bayesian and frequentist inference in partiallyidentified models,”
Econometrica , 80(2), 755–782.
Mourifi´e, I., M. Henry, and
R. M´eango (2018): “Sharp Bounds and Testability of a RoyModel of STEM Major Choices,” available at https://ssrn.com/abstract=2043117 . Mullahy, J. (2018): “Individual results may vary: Inequality-probability bounds for somehealth-outcome treatment effects,”
Journal of Health Economics , 61, 151 – 162.
Neyman, J. S. (1923): “On the Application of Probability Theory to Agricultural Experi-ments. Essay on Principles. Section 9.,”
Roczniki Nauk Rolniczych , X, 1–51, reprinted in
Statistical Science , 5(4), 465-472, translated and edited by D. M. Dabrowska and T. P.Speed from the Polish original.
Norberg, T. (1992): “On the existence of ordered couplings of random sets — with appli-cations,”
Israel Journal of Mathematics , 77, 241–264.
Norets, A., and
X. Tang (2014): “Semiparametric Inference in Dynamic Binary ChoiceModels,”
The Review of Economic Studies , 81(3), 1229–1262.
Okner, B. (1972): “Constructing A New Data Base From Existing Microdata Sets: The1966 Merge File,”
Annals of Economic and Social Measurement , 1(3), 325–362.
Pacini, D. (2017): “Two-sample least squares projection,”
Econometric Reviews , 38(1),95–123.
Pakes, A. (2010): “Alternative models for moment inequalities,”
Econometrica , 78(6), 1783–1822.
Pakes, A., and
J. Porter (2016): “Moment Inequalities for Multinomial Choice with FixedEffects,” Working Paper 21893, National Bureau of Economic Research.
Pakes, A., J. Porter, K. Ho, and
J. Ishii (2015): “Moment Inequalities and TheirApplication,”
Econometrica , 83(1), 315–334. de Paula, A. (2013): “Econometric Analysis of Games with Multiple Equilibria,”
AnnualReview of Economics , 5(1), 107–131.(2017): “Econometrics of Network Models,” in
Advances in Economics and Econo-metrics: Eleventh World Congress , ed. by B. Honor, A. Pakes, M. Piazzesi, and
L. Samuel-son, vol. 1 of
Econometric Society Monographs , p. 268323. Cambridge University Press.141 e Paula, A., S. Richards-Shubik, and
E. Tamer (2018): “Identifying Preferences inNetworks With Bounded Degree,”
Econometrica , 86(1), 263–288. de Paula, A., and
X. Tang (2012): “Inference of Signs of Interaction Effects in Simulta-neous Games With Incomplete Information,”
Econometrica , 80(1), 143–172.
Peterson, A. V. (1976): “Bounds for a Joint Distribution Function with Fixed Sub-Distribution Functions: Application to Competing Risks,”
Proceedings of the NationalAcademy of Sciences of the United States of America , 73(1), 11–13.
Petrin, A., and
K. Train (2010): “A Control Function Approach to Endogeneity inConsumer Choice Models,”
Journal of Marketing Research , 47(1), 3–13.
Phillips, P. C. B. (1989): “Partially Identified Econometric Models,”
Econometric Theory ,5(2), 181–240.
Picketty, T. (2005): “Top Income Shares in the Long Run: An Overview,”
Journal of theEuropean Economic Association , 3, 382–392.
Ponomareva, M., and
E. Tamer (2011): “Misspecification in moment inequality models:back to moment equalities?,”
The Econometrics Journal , 14(2), 186–203.
Redner, R. (1981): “Note on the Consistency of the Maximum Likelihood Estimate forNonidentifiable Distributions,”
The Annals of Statistics , 9(1), 225–228.
Reiersol, O. (1941): “Confluence Analysis by Means of Lag Moments and Other Methodsof Confluence Analysis,”
Econometrica , 9(1), 1–24.
Richter, M. K. (1966): “Revealed Preference Theory,”
Econometrica , 34(3), 635–645.
Ridder, G., and
R. Moffitt (2007): “Chapter 75 – The Econometrics of Data Combina-tion,” in
Handbook of Econometrics , ed. by J. J. Heckman, and
E. E. Leamer, vol. 6, pp.5469 – 5547. Elsevier.
Rockafellar, R. (1970):
Convex Analysis , Princeton landmarks in mathematics andphysics. Princeton University Press.
Romano, J. P., and
A. M. Shaikh (2008): “Inference for identifiable parameters in par-tially identified econometric models,”
Journal of Statistical Planning and Inference , 138(9),2786 – 2807.(2010): “Inference for the Identified Set in Partially Identified Econometric Models,”
Econometrica , 78(1), 169–211.
Romano, J. P., A. M. Shaikh, and
M. Wolf (2014): “A practical two-step method fortesting moment inequalities,”
Econometrica , 82(5), 1979–2002.142 osen, A. M. (2008): “Confidence sets for partially identified parameters that satisfy afinite number of moment inequalities,”
Journal of Econometrics , 146(1), 107 – 117.(2012): “Set identification via quantile restrictions in short panels,”
Journal ofEconometrics , 166(1), 127 – 137.
Rosenbaum, P. R. (1995):
Observational Studies . Springer.
Rosenbaum, P. R., and
D. B. Rubin (1983): “Assessing Sensitivity to an UnobservedBinary Covariate in an Observational Study with Binary Outcome,”
Journal of the RoyalStatistical Society. Series B (Methodological) , 45(2), 212–218.
Ross, S. A. (1976): “The arbitrage theory of capital asset pricing,”
Journal of EconomicTheory , 13(3), 341–360.
Rubin, D. B. (1978): “Bayesian Inference for Causal Effects: The Role of Randomization,”
The Annals of Statistics , 6(1), 34–58.
R¨uschendorf, L. (1982): “Random Variables with Maximum Sums,”
Advances in AppliedProbability , 14(3), 623–632.
Rust, J. (1987): “Optimal Replacement of GMC Bus Engines: An Empirical Model ofHarold Zurcher,”
Econometrica , 55(5), 999–1033.
Samuelson, P. A. (1938): “A Note on the Pure Theory of Consumer’s Behaviour,”
Eco-nomica , 5(17), 61–71.
Santner, T. J., B. J. Williams, and
W. I. Notz (2013):
The design and analysis ofcomputer experiments . Springer Science & Business Media.
Santos, A. (2012): “Inference in nonparametric instrumental variables with partial identi-fication,”
Econometrica , 80(1), 213–275.
Schennach, S. M. (2019): “Mismeasured and unobserved variables,” in
Handbook of Econo-metrics . Elsevier.
Schmidt, P. (1981): “Constraints on the Parameters in Simultaneous Tobit and ProbitModels,” in
Structural Analysis of Discrete Data and Econometric Applications , ed. byC. F. Manski, and
D. McFadden, chap. 12, pp. 422–434. MIT Press.
Schneider, R. (1993):
Convex Bodies: The Brunn-Minkowski Theory , Encyclopedia ofMathematics and its Applications. Cambridge University Press, 1 edn.
Schonlau, M., W. J. Welch, and
D. R. Jones (1998): “Global versus Local Searchin Constrained Optimization of Computer Models,”
Lecture Notes-Monograph Series , 34,11–25. 143 haikh, A. M., and
E. J. Vytlacil (2011): “Partial identification in triangular systemsof equations with binary dependent variables,”
Econometrica , 79(3), 949–955.
Sheng, S. (2018): “A structural econometric analysis of network formation games throughsubnetworks,”
Econometrica , accepted for publication.
Shiller, R. J. (1982): “Consumption, asset markets and macroeconomic fluctuations,”
Carnegie-Rochester Conference Series on Public Policy , 17, 203 – 238.(2003): “From Efficient Markets Theory to Behavioral Finance,”
Journal of Eco-nomic Perspectives , 17(1), 83–104.
Shorack, G. R., and
J. A. Wellner (2009):
Empirical Processes with Applications toStatistics . Society for Industrial and Applied Mathematics.
Simon, H. A. (1959): “Theories of Decision-Making in Economics and Behavioral Science,”
The American Economic Review , 49(3), 253–283.
Sims, C. A. (1972): “Comments and Rejoinder On Okner (1972),”
Annals of Economic andSocial Measurement , 1(3), 343–345 and 355–357.
Stoye, J. (2007): “Bounds on Generalized Linear Predictors with Incomplete OutcomeData,”
Reliable Computing , 13(3), 293–302.(2009): “More on Confidence Intervals for Partially Identified Parameters,”
Econo-metrica , 77(4), 1299–1315.(2010): “Partial identification of spread parameters,”
Quantitative Economics , 1(2),323–357.
Syrgkanis, V., E. Tamer, and
J. Ziani (2018): “Inference on auctions with weak assump-tions on information,” available at https://arxiv.org/abs/1710.03830 . Tamer, E. (2003): “Incomplete Simultaneous Discrete Response Model with Multiple Equi-libria,”
The Review of Economic Studies , 70(1), 147–165.(2010): “Partial Identification in Econometrics,”
Annual Review of Economics , 2,167–195.
Tang, X. (2011): “Bounds on revenue distributions in counterfactual auctions with reserveprices,”
The RAND Journal of Economics , 42(1), 175–203.
Tauchmann, H. (2014): “Lee (2009) treatment-effect bounds for nonrandom sample selec-tion,”
Stata Journal , 14(4), 884–894. 144 ebaldi, P., A. Torgovitsky, and
H. Yang (2019): “Nonparametric Estimates of De-mand in the California Health Insurance Exchange,” NBER Working Paper No. 25827,available at . Torgovitsky, A. (2019a): “Nonparametric Inference on State Dependence in Unemploy-ment,”
Econometrica , forthcoming.(2019b): “Partial identification by extending subdistributions,”
Quantitative Eco-nomics , 10(1), 105–144.
Tversky, A. (1972): “Elimination by aspects: A theory of choice,”
Psychological review ,79(4), 281.
Uhlig, H. (2005): “What are the effects of monetary policy on output? Results from anagnostic identification procedure,”
Journal of Monetary Economics , 52(2), 381 – 419. van der Vaart, A. (1997): “Superefficiency,” in
Festschrift for Lucien Le Cam , ed. byD. Pollard, E. Torgersen, and
G. L. Yang, chap. 27, pp. 397–410. Springer.
Varian, H. R. (1982): “The Nonparametric Approach to Demand Analysis,”
Econometrica ,50(4), 945–973.
Wasserman, L. (2000): “Bayesian Model Selection and Model Averaging,”
Journal of Math-ematical Psychology , 44(1), 92 – 107.
White, H. (1982): “Maximum Likelihood Estimation of Misspecified Models,”
Economet-rica , 50(1), 1–25.
Wollmann, T. G. (2018): “Trucks without Bailouts: Equilibrium Product Characteristicsfor Commercial Vehicles,”
American Economic Review , 108(6), 1364–1406.
Yang, Z. (2006): “Correlated equilibrium and the estimation of static discrete games withcomplete information,” available at https://ideas.repec.org/p/pra/mprapa/79395.html . Yildiz, N. (2012): “Consistency of plug-in estimators of upper contour and level sets,”