Partial Identification in Nonseparable Binary Response Models with Endogenous Regressors
PPartial Identification in Nonseparable Binary Response Modelswith Endogenous Regressors
Jiaying Gu ∗ University of Toronto
Thomas M. Russell † Carleton University
January 6, 2021
Abstract
This paper considers (partial) identification of a variety of parameters, including counterfactual choiceprobabilities, in a general class of binary response models with possibly endogenous regressors. Impor-tantly, our framework allows for nonseparable index functions with multi-dimensional latent variables,and does not require parametric distributional assumptions. We demonstrate how various functionalform, independence, and monotonicity assumptions can be imposed as constraints in our optimizationprocedure to tighten the identified set, and we show how these assumptions have meaningful interpre-tations in terms of restrictions on latent types. In the special case when the index function is linearin the latent variables, we leverage results in computational geometry to provide a tractable means ofconstructing the sharp set of constraints for our optimization problems. Finally, we apply our methodto study the effects of health insurance on the decision to seek medical treatment.
Keywords : Binary Choice, Counterfactual Choice Probabilities, Endogeneity, Hyperplane Arrangement,Linear Programming, Partial Identification
We are grateful to Marc Henry, Roger Koenker, and the seminar audiences at Columbia University and Michigan StateUniversity for helpful feedback. We also thank Martin Weidner and the organizers of the Chamberlain Seminar, and are gratefulto Florian Gunsilius, Sukjin Han, Wayne Gao, and Takuya Ura for their questions and feedback, and to Adam Rosen for histhoughtful discussion. Jiaying Gu acknowledges financial support from the Social Sciences and Humanities Research Councilof Canada. All errors are our own. ∗ Jiaying Gu, Assistant Professor, Department of Economics, University of Toronto, 150 St. George Street, Toronto, Ontario,M5S3G7, Canada. Email: [email protected]. † Thomas M. Russell, Assistant Professor, Department of Economics, Carleton University, 1125 Colonel By Drive, Ottawa,Ontario, K1S5B6, Canada. Email: [email protected]. a r X i v : . [ ec on . E M ] J a n Introduction
This paper considers partial identification of a variety of parameters in a general class of binary responsemodels. Our main focus throughout is on counterfactual choice probabilities, as well as parameters thatcan be written as linear combinations of counterfactual choice probabilities. However, our framework isalso applicable to parameters outside of this class. Our approach allows for flexible functional form as-sumptions, endogenous regressors, and the inclusion of multi-dimensional and nonseparable latent variables.Furthermore, our approach does not require any parametric distributional assumptions.In the settings closest to the one we consider, nonparametric point-identification of the distribution oflatent variables occurs only under restrictive conditions, often including strong independence assumptionsand large support conditions (e.g. Ichimura and Thompson (1998)). Control function approaches are oftenused to address the issue of endogenous regressors, but if endogenous regressors are discrete or the mech-anism generating the endogenous regressors is poorly understood, then many of these approaches are notapplicable. Partial identification arises as a natural alternative to methods for point-identification as a resultof possible endogeneity, discrete instruments, and limited variation in the covariates. However, flexible andimplementable methods in partial identification for binary response models remain underdeveloped. Thispaper seeks to address this gap.We begin with a theoretical analysis of the binary response model that builds on the work connectingrandom set theory to partial identification. In particular, we characterize observational equivalence in termsof selections from a random set . We then define our binary response model of interest, and show how toconstruct a sequence of definitions of various identified sets arising from the notion of selectionability fromthe random set defined by our model. Finally we define the set of counterfactuals of interest, and showhow the identified set of various structural features of the binary response model—including the structuralparameters and the distribution of latent variables—are related to the identified set of counterfactual choiceprobabilities.In general, we show that constructing the identified set for counterfactual choice probabilities involves an infinite-dimensional existence problem . Intuitively, this problem arises since, for each proposed counterfac-tual choice probability, we must verify the existence of a distribution of the latent variables that rationalizesthe observed choice probabilities through our binary choice model. However, one of our main theoreticalresults shows that this infinite dimensional existence problem can be reduced to an equivalent finite di-mensional existence problem when the observed random variables are discrete. This paves the way to ourformulation of the bounds on counterfactual choice probabilities in terms of optimization problems.Our insights also reveal the importance of a special partition of the latent variable space into types that have identical responses in all possible counterfactual states. Consistent with the previous literature,we call these latent types response types . One of our important contributions is to show that additional Definitions are provided in Appendix A.2. See Torgovitsky (2019) for similar terminology. elimination of response types , which amounts to assigning zero probability to regions of the latentvariable space corresponding to particular profiles of counterfactual responses. Furthermore, we show thatcertain independence assumptions imposed on a vector of latent variables are observationally equivalent toimposing independence on response types directly. This connection helps to facilitate interpretation of theseassumptions in the class of models we consider. We are not the first to emphasize the importance of responsetypes, and our discussion echoes the insights of Heckman and Pinto (2018) and others.We show how these additional functional form, independence, and monotonicity assumptions can beintroduced as constraints in our optimization-based bounding procedure. We thoroughly study the specialcase when the index function is linear in latent variables. We show linearity in this sense can be used toimpose constraints on the distribution of latent variables, and we demonstrate how these constraints can beincorporated into our optimization problems to tighten the identified set. To construct the relevant set ofconstraints implied by the functional form restrictions we make connections to the literature on computationalgeometry and utilize the hyperplane arrangement algorithm of Gu and Koenker (2020). Furthermore, whenthe index function is also linear in parameters, we show that—unlike many other existing procedures inpartial identification—exact (i.e. not approximate) sharp bounds on counterfactual choice probabilities canbe computed without the need to grid over the entire parameter space.Finally we apply our method to study the effects of private health insurance on the decision to seekmedical treatment. Consistent with the existing literature, we treat private health insurance status as anendogenous variable, and we consider the decision to seek medical treatment as our binary outcome variableof interest. We then consider the average treatment effect of obtaining private health insurance on thedecision to visit a doctor. We find that the sign of the average treatment effect is typically only identifiedunder our strongest assumptions. However, even our strongest assumptions are much weaker than theassumptions typically maintained in the empirical literature. Interestingly, we also find non-trivial boundson the average treatment effect even when the structural parameters are unidentified. Overall, the strengthof the conclusions from our application are proportional to the strength of the assumptions the researcher iswilling to maintain.
Binary response models with possibly endogenous regressors have been studied extensively, and previouswork on the subject can be separated into two broad categories: work that focuses on conditions requiredfor point identification, and work that allows for partial identification.From the point identification perspective, typical approaches include (i) the use of linear probabilitymodels, (ii) maximum likelihood estimation (e.g. the bivariate probit), and (iii) control function approaches.3ll of these approaches have well-documented limitations. In particular, linear probability models arecommonly justified as approximations to the underlying conditional expectation function for the binary de-pendent variable, but are known to deliver very misleading results when the conditional expectation functionis highly nonlinear. Methods that use maximum likelihood—such as the bivariate probit model—enjoy ef-ficiency gains relative to other approaches when the model is correctly specified, but require strong a priori knowledge of the mechanism generating the endogenous variables, as well as knowledge of the joint distri-bution of the latent variables up to some finite parameter vector. Finally, control function approaches (e.g.Blundell and Smith (1989), Blundell and Powell (2004), and Imbens and Newey (2009), among many others)relax (to some extent) the assumptions required on the latent variables, but are generally restricted to caseswith continuous endogenous variables and also still require a correctly specified model for the endogenousvariables in nonlinear models. Unlike the control function approach, the special regressor approach of Lewbel(2000) (see also Lewbel et al. (2012) and Dong and Lewbel (2015)) does not require the correct specifica-tion of a model for endogenous variables, but instead requires the existence of an observed continuouslydistributed regressor with large support that satisfies certain conditional independence assumptions. Such aspecial regressor is not always readily available.Beyond these approaches, a number of papers have also considered nonparametric identification. Non-parametric identification was studied in binary choice and threshold crossing models by Matzkin (1992),and in more general nonseparable models by Matzkin (2003) and Chernozhukov and Hansen (2005), amongothers. Vytlacil and Yildiz (2007) study nonparametric identification of the average treatment effect in adiscrete triangular system with a binary endogenous variable under a weak separability assumption in theoutcome equation. Important precedents to the work presented here from the literature on point identifi-cation in random coefficient models include Ichimura and Thompson (1998), Gautier and Kitamura (2013)and Gu and Koenker (2020). However, all of these papers focus almost exclusively on the point-identifiedcase with linear index function and exogenous covariates with large support.In contrast, the literature on partial identification attempts to relax the assumptions required for point-identification. In a relevant series of papers, Chesher et al. (2013), Chesher and Rosen (2014) show how touse random set theory to characterize the identified set of structures in discrete choice models. A generalformulation of their approach is presented in Chesher and Rosen (2017). Similar to the current paper,these papers do not provide a model for the endogenous covariates, rendering the discrete choice model incomplete . Chesher et al. (2013) and Chesher and Rosen (2014) then use a characterization of the sharp setof constraints given by a result due to Artstein (1983) in random set theory, which we colloquially refer to as
Artstein’s inequalities . However, without additional simplification, the sharp set of constraints implied by A review of approaches typically used by practitioners to address the problem of endogenous regressors in models withbinary outcomes is provided in Lewbel et al. (2012), who focus on the case of a threshold-crossing model with linear indexfunction and additively separable errors. Lewbel et al. (2012) construct an interesting treatment effect example with a binary outcome variable where the treatmenteffect is positive for everyone, but the ATE under a linear probability model is negative. See also Norberg (1992) and Molchanov (2017) Corollary 1.4.11. Our work extends the work byChesher et al. (2013) and Chesher and Rosen (2014) by deriving a simplified set of constraints that containthe same identifying content as the constraints in their work. We then focus on obtaining sharp bounds oncounterfactual conditional choice probabilities, and show how this can be accomplished by solving a sequenceof optimization problems. A detailed comparison of our approach with the approach of Chesher et al. (2013)and Chesher and Rosen (2014) is provided in Appendix C.In another relevant precedent to our work, Torgovitsky (2019) demonstrates how to construct boundson lower-dimensional functionals of the model parameters and the latent variable distribution in a class ofmodels with discrete outcomes and regressors. In particular, he demonstrates the conditions under whichthe distribution of the latent variables can be restricted to a finite set when bounding various functionals,and considers a binary response model with additive separability as his motivating example. Many ofthe points made in Torgovitsky (2019) will also be revisited in the current paper, and in cases when theindex function is additively separable in the latent variables, our approaches will be very similar. However,the results in Torgovitsky (2019) rely heavily on the requirement that the functional of interest, and allconstraints defining the identified set, can be written in terms of the distribution function of the vector oflatent variables. This requirement is easily satisfied in the additively separable case, but in models whenthe index function is not additively separable, this will generally not be possible. In contrast, our frameworkis able to accommodate a variety of flexible assumptions on the index function, including the case when theindex function is nonseparable or weakly separable.There are a number of other relevant papers in the literature on partial identification in discrete choicemodels. In an important paper, Manski (2007) also considers counterfactual choice probabilities in a settingwith partial identification, and shows how these counterfactual choice probabilities can be bounded usingoptimization problems. However, the general approach used in this paper is very different. Furthermore,we focus substantially on demonstrating how to practically incorporate a flexible set of assumptions on thelatent index function, and we allow for endogenous explanatory variables. In another related and recentworking paper, Tebaldi et al. (2019) study the problem of computing various counterfactual quantities in anonparametric discrete choice model with an application to consumer choice of health insurance in California.However, they focus specifically on the case where consumers have quasi-linear utility functions (equal totheir valuation of the insurance option minus the premium) and use the particular structure of their setting See Galichon and Henry (2011) for an early discussion of this issue in the context of empirical entry games, and Russell(2019) for a discussion in the context of estimating treatment effects. In particular, see Lemma C.1 in Appendix C. The distribution function of a vector of random variables ( U , U , . . . , U k ) ∈ R k is the function F : R k → [0 ,
1] defined by F ( u ) = P ( U ≤ u , U ≤ u , . . . , U k ≤ u k ). This latter point also differentiates our work from Chiong et al. (2017) and Allen and Rehbeck (2019). Closely related to the problem of bounding counterfactual choice probabilities is the problem of bound-ing parameters in the literature on treatment effects with binary outcome variables. Analytic bounds intriangular systems of equations with binary dependent variables under various assumptions is consideredby Chiburis (2010), Shaikh and Vytlacil (2011), and Mourifi´e (2015). An optimization-based approach tobounding treatment effect parameters is presented in Russell (2019) in the discrete case, and Gunsilius(2020) in the continuous case. We will attempt to make a connection to the literature on treatment effectsthroughout the paper when appropriate.This paper also makes a connection to the literature on computational geometry. In the case of a linearindex function, computation of our bounds requires the analysis of a partition of the latent space determinedby finitely many hyperplanes. This turns out to be a well studied subject in combinatorial geometry calledhyperplane arrangement, and leads us to consideration of the enumeration algorithm proposed by Gu andKoenker (2020).
The remainder of the paper proceeds as follows. Section 2 introduces the main theoretical framework andmain assumptions. Section 3 studies practical implementation of the theoretical framework from Section 2and introduces our optimization-based bounding procedure for counterfactual choice probabilities. Section4 then demonstrates how to introduce functional form, independence, and monotonicity assumptions intoour bounding procedure, and Section 5 applies our methodology to study the impact of health insurance onutilization of health care services. Section 6 concludes. All proofs can be found in Appendix A. AppendixB provides some additional discussion of the results presented in the main text, and a comparison of ourapproach to the approach based on Artstein’s inequalities from Chesher et al. (2013) and Chesher and Rosen(2014) is presented in Appendix C.
Notation:
The following notation is relevant for both the main text and the appendices. Given a subset X of Euclidean space, we use B ( X ) to denote the Borel σ − algebra on X . For two measurable spaces ( X , B ( X ))and ( X (cid:48) , B ( X (cid:48) )), the product σ − algebra on X × X (cid:48) is denoted by B ( X ) ⊗ B ( X (cid:48) ). Random variables aredenoted using capital letters, and if X : (Ω , A ) → ( X , B ( X )) is a random variable defined on the probabilityspace (Ω , A , P ), then we use P X to denote the probability measure induced on X by X ; that is, for any A ∈ B ( X ), P X ( A ) := P ( X − ( A )). Furthermore, we interpret P X | X (cid:48) ( X ∈ A | X (cid:48) = x (cid:48) ) as a regular In particular, Tebaldi et al. (2019) consider a multinomial choice model with preferences over insurance options given by thedifference between the consumer’s latent valuation and the consumer’s premium for each option. Endogeneity arises becauseof possible dependence between valuations and premiums. However, in their setting (subsidized) premiums are deterministicfunctions of the coverage area, age, and income. The authors then discretize age and income, and assume that a valuationdistribution is fixed within a given coverage area and discretized age and income bin; the remaining variation in premiumswithin each coverage area and discretized age and income bin is then considered to be exogenous. P X | X (cid:48) is used as shorthand for the collection P X | X (cid:48) := { P X | X (cid:48) ( · | X (cid:48) = x (cid:48) ) : x (cid:48) ∈ X (cid:48) } . We do not explicitly differentiate between scalars and vectors, or random variables andrandom vectors. To keep the notation clean, we will omit the transpose when combining column vectors;that is, if v and v are two column vectors, rather than write v = ( v (cid:62) , v (cid:62) ) (cid:62) we instead write v = ( v , v ),where it is understood that v is a column vector unless otherwise specified. In this section we begin the theoretical analysis by introducing some key assumptions and definitions. Wewill first introduce our main assumptions on the binary response model, and connect our assumptions tothe definition of the identified set of (conditional) latent variable distributions. We then discuss the set ofcounterfactual parameters of interest in this paper, and show how the definition of the identified set of latentvariable distributions is related to the identified set of counterfactual conditional choice probabilities. Wewill use the results in this section when we introduce our practical method of bounding counterfactual choiceprobabilities in the next section.
We start by introducing our main assumptions on the binary response environment under consideration.
Assumption 2.1.
There exists a complete probability space (Ω , A , P ) , a random variable Y : Ω → { , } ,and random vectors X : Ω → X ⊆ R d x , Z : Ω → Z ⊆ R d z and θ : Ω → Θ ⊆ R d θ , with Θ compact,satisfying: Y = { ϕ ( X, Z, θ , β ) ≥ } a.s., (2.1) for some function ϕ ( · , β ) : X × Z × Θ → R parameterized by β ∈ B ⊆ R d β such that ϕ ( x, z, · , β ) iscontinuous for each ( x, z, β ) and ϕ ( · , · , θ, β ) is measurable for each ( θ, β ) . Furthermore, |X | = m x < ∞ ,and |Z| = m z < ∞ , and the spaces X , Z and Θ are equipped with the Borel σ − algebra. In Assumption 2.1 θ ∈ Θ is a vector of latent variables, β ∈ B is a vector of fixed coefficients, and X ∈ X ⊂ R d x and Z ∈ Z ⊂ R d z are vectors of covariates. The finite-dimensional vector of latent variableshave a natural interpretation in the binary response model as unobserved types . The model in (2.1) allows forgeneral nonseparability between covariates and latent variables, and thus allows for type-specific marginaleffects. The latent variables can also be interpreted as random coefficients, in which case there is no restrictionon which covariates are assigned fixed versus random coefficients by the index function ϕ . Furthermore, wedo not impose any parametric or continuity assumptions on the distribution of latent variables. Finally,compactness of Θ is a technical requirement used to verify the measurability of certain random sets that willappear shortly. Otherwise compactness does not play a significant role, and it is ignored throughout much7f the discussion in the main text.The assumptions on the index function ϕ imply that it is a Caratheodory function , which is important toestablish certain measurability results (see Appendix A.2 and the discussion below). The exact form of theindex function may or may not be known to the researcher. For now there is no distinction between X and Z ,and either may be dependent with the latent vector θ . However, later in the paper we will distinguish X from Z by introducing independence assumptions between Z and the vector of latent variables θ . The variables X and Z can be seen as consisting of utility-relevant attributes of the set of alternatives and the set of individualdecision makers. We focus throughout the paper on the case when the joint support
X × Z is finite with m := m x · m z points of support, although m may be taken to be very large. Throughout the paper we willswitch freely between indexing the points in X × Z either by { ( x , z ) , . . . , ( x , z m z ) , ( x , z ) , . . . , ( x m x , z m z ) } or by { ( x , z ) , ( x , z ) , . . . , ( x m , z m ) } , depending on which method is more convenient for our purpose.Finally, it is important to note that, because of finiteness of the support X × Z , it is possible to construct amodel satisfying Assumption 2.1 that can rationalize any observed joint distribution of Y , X and Z . In thissense, the model in (2.1) has not yet imposed any significant structure.We assume that the researcher’s objective throughout is to obtain a sharp set of constraints defining theidentified set of latent variable distributions, and to use this characterization to bound various counterfactualquantities such as counterfactual conditional choice probabilities, or functionals of the distribution of latentvariables. A general characterization of the identified set of latent variable distributions is provided inChesher et al. (2013), and Chesher and Rosen (2014) using Artstein’s inequalities from random set theory.A comparison of our work with the approach based on Artstein’s inequalities is provided in AppendixC. While our approach does not explicitly make use of Artstein’s inequalities, similar to Chesher et al.(2013), Chesher and Rosen (2014) and Chesher and Rosen (2017), we take the selectionability relation asa primitive relation on which to construct a definition of the identified set. The close connection betweenthe selection relation from random set theory and the concept of observational equivalence from the workin econometrics on identification has been appreciated in Beresteanu and Molinari (2008), Beresteanu et al.(2011), Beresteanu et al. (2012), Chesher et al. (2013), Chesher and Rosen (2014), and Chesher and Rosen(2017), among many others. We will continue this work here. In particular, we will define the set: G − ( y, x, z, β ) := { θ : y = { ϕ ( x, z, θ, β ) ≥ }} . (2.2)Intuitively, (2.2) delivers all possible values of the latent variables θ consistent with the vector ( y, x, z, β )given the binary response model in (2.1). A measurable selection from the random set G − ( Y, X, Z, β ) isa random vector θ : Ω → Θ satisfying θ ∈ G − ( Y, X, Z, β ) a.s. A general definition of a selection and arandom set is provided in Appendix A.2. Importantly, given a distribution of the observable random vectors If ( S, Σ) is a measurable space and A and B are topological spaces, then we call f : S × A → B a Caratheodory function iffor each a ∈ A we have f ( · , a ) is a measurable function, and if for each s ∈ S we have that f ( s, · ) is continuous. See Definition4.50 in Aliprantis and Border (2006). See Manski (1977) for a discussion. Y, X, Z ), a structural function ϕ and a fixed coefficient β ∈ B , any two measurable selections θ and θ (cid:48) fromthe random set G − ( Y, X, Z, β ) will be observationally equivalent in the sense that both latent variable vectors θ and θ (cid:48) could have generated the observed distribution of Y , X and Z through the model (2.1). Framedin this manner, constructing the identified set of latent variable distributions then becomes a problem ofverifying whether a given random vector θ : Ω → Θ is a measurable selection from the random set in (2.2),and then collecting the distributions of all such selections.We are now prepared to present our definition of the joint identified set for the (conditional) latentvariable distribution and coefficients β . Definition 2.1 (Identified Set) . Under Assumption 2.1, the (joint) identified set I ∗ Y,X,Z of conditional latentvariable distributions P θ | Y,X,Z and fixed coefficients β is the set of all pairs ( P θ | Y,X,Z , β ) satisfying: P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s. (2.3)As promised, this definition relies heavily on the idea of selectionability from the random set G − ( Y, X, Z, β ),and permits only the (conditional) distributions of selections from G − ( Y, X, Z, β ) to belong to the identifiedset. Note that this definition of the identified set also implicitly depends on the distribution of (
Y, X, Z )through the almost-sure relation in (2.3); any values of ( y, x, z ) assigned zero probability by the observeddistribution does not impose any restrictions on the distribution of θ . Importantly, the definition conditionson the value of the endogenous outcome variable Y . This conditioning will be carried throughout the pa-per, and we will see later on that it allows us to bound some interesting, albeit less-typical counterfactualparameters that may be relevant to policy analysis.We use Definition 2.1 as a primitive starting point with the goal of providing a definition of the identifiedset for counterfactual choice probabilities, as well as definitions of the identified set for other objects ofinterest. For example, using this definition we can also provide definitions of the identified set for variousprojections of the joint identified set. Under Assumption 2.1, the identified set of fixed coefficients B ∗ isgiven by: B ∗ := (cid:8) β : ∃ P θ | Y,X,Z s.t. ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z (cid:9) . Furthermore, under Assumption 2.1, the identified set of conditional latent variable distributions P ∗ θ | Y,X,Z is given by: P ∗ θ | Y,X,Z := (cid:8) P θ | Y,X,Z : ∃ β s.t. ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z (cid:9) . Conditioning on the value of the observed endogenous outcome variable Y may be new to many readers;however, the identified set P ∗ θ | X,Z for conditional latent variable distributions P θ | X,Z can constructed viaintegration of distributions P θ | Y,X,Z ∈ P ∗ θ | Y,X,Z with respect to the observed conditional choice probabilities.For the sake of comparison with the previous literature, in Appendix B.1 we show how the identified set of(conditional) latent variable distributions are related to conditional choice probabilities in our setup.9 .2 Bounding Counterfactual Quantities
In the previous subsection, we discussed the relevant set of constraints defining the identified set of conditionallatent variable distributions (conditional on (
Y, X, Z ), or conditional on (
X, Z ), as in Appendix B.1). Inthis subsection we present definitions and results for the identified set of counterfactual conditional choiceprobabilities conditional on Y , X and Z . Throughout the remainder of the paper we focus most of ourattention on bounding counterfactual choice probabilities, although our framework is immediately applicableto any parameter that can be written as a linear function of counterfactual choice probabilities; for example,the average treatment effect, and the average structural function with various levels of conditioning.In this paper, we will limit ourselves to the class of so-called interventionist counterfactuals . Interven-tionist counterfactuals take as primitive an existing set of structural equations relating causes and effects,and each equation in the system of structural equations is autonomous in the sense that it remains unalteredunder external manipulations of its inputs. In our case, the relevant structural equation is given by thebinary response model in (2.1), where the random vectors X and Z might be interpreted as relevant causesof the binary random variable Y . However, differing from the typical structural equations environment, weallow for X to be endogenous without explicitly providing a structural equation for it. In this setup, an interventionist counterfactual is represented by a process that exogenously manipulatesthe values of X and Z . For exogenous random variables—that is, those whose values are determined outsideof the model—we simply replace the random variable by its value under consideration in the counterfactual.For endogenous random variables—that is, those whose values are determined by a function of the otherexogenous and endogenous variables within a model—the function determining the value of the endogenousvariable is deleted from the system, and the endogenous variable is replaced by its value under considerationin the counterfactual. Such counterfactuals have a natural interpretation as “hypothetical experiments,”and are widely attributed to Haavelmo (1943, 1944).We now introduce the following assumption on the counterfactual domain, which summarizes the discus-sion above.
Assumption 2.2 (Counterfactual Domain) . For a collection of functions Γ with typical element γ : X ×Z → X × Z , there exists a collection of random variables { Y ( · , γ ) : Ω → { , } | γ ∈ Γ } , abbreviated as Y γ := Y ( · , γ ) , representing counterfactual choices for each γ such that Y γ : Ω → { , } is measurable foreach γ , and: P Y γ | Y,X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = 1 , (2.4) P Y,X,Z,θ − a.s. for the same θ ∈ Θ and β ∈ B as in Assumption 2.1, and for all γ ∈ Γ . Assumption 2.2 is needed in order to assign a counterfactual interpretation to many of the results that Heckman and Pinto (2015) attribute the notion of autonomous equations to Frisch (1938). It is natural to imagine that a structural equation exists that determines X as a function of its (potential) causes, but thatwe remain ignorant as to its exact form. We refer the reader to Pearl (2009) Section 7.1 for a discussion of a similar procedure. γ ∈ Γ thatrepresent counterfactual responses or choices and that are defined on the same probability space as therandom vector (
Y, X, Z, θ ). It then explicitly links these counterfactual random variables to the binaryresponse model from Assumption 2.1 through condition (2.4). Assumption 2.2 implies that (i) counterfactualresponse variables exist on the common probability space, and (ii) such counterfactual response variablesare equal (almost surely) to the values that would arise after an intervention on the system representedby (2.1). It also implicitly encodes an invariance assumption; other than changes in ( x, z ) induced by γ : X × Z → X × Z , all other aspects of the environment are held constant.Note that Assumption 2.2 imposes that each counterfactual be represented by a function γ : X × Z →X × Z belonging to some collection Γ. Each γ ∈ Γ can be interpreted as an assignment to a state or atreatment. Taking γ as a function allows us to consider a general class of counterfactuals that allows thecounterfactual under consideration to depend on the observed values of X and Z . Although each function γ is seen as a map from X × Z to itself, this does not prevent consideration of counterfactuals where γ selectsvalues of ( x, z ) that have never been observed in the data. Such cases can be accommodated by simplyextending the support X × Z from Assumption 2.1 to include the counterfactual pair ( x, z ) of interest. Thisapproach does not affect anything we have presented thus far (or anything we will present), since we alwaysrequire any relation to the observed distribution of (
Y, X, Z ) to hold only almost-surely. This means thatour framework can be used to study the impact of historical interventions, as well as forecast the impacts ofinterventions in environments never before experienced. Remark 2.1.
The random variables Y γ can be related to potential outcomes from the literature on treatmenteffects. Interpreting these variables as potential outcomes helps to clarify the invariance assumption madein Assumption 2.2. For example, suppose we have only a binary variable X ∈ { , } and latent variables θ (i.e. no variables Z and no fixed coefficients β ). Then the structural function from (2.1) can be written as ϕ ( X, θ ) . In this simple case we have γ ( x ) ∈ { , } , and we can define the random variables Y and Y as: Y ( ω ) := { ϕ (0 , θ ( ω )) ≥ } ,Y ( ω ) := { ϕ (1 , θ ( ω )) ≥ } . The observed choice for an individual indexed by ω ∈ Ω is then given by: Y ( ω ) := Y ( ω )(1 − X ( ω )) + Y ( ω ) X ( ω ) . That is, the observed choice corresponds to Y ( ω ) if X ( ω ) = 0 and Y ( ω ) if X ( ω ) = 1 , where Y ( ω ) and In terms of practical interpretation, here we do not specify how such an intervention is to be carried out, and insteadsuppose that the random variables Y ( ω, γ ) are stable in the sense that they do not depend on any mechanism that may begenerating the intervention γ . This assumption is very common, and we refer to Heckman and Vytlacil (2007) pp. 4790 - 4801for a more detailed discussion of similar assumptions. For example, consider Theorem B.1. This result is completely unaffected by arbitrarily enlarging
X ×Z , since the conditionsin the Theorem need only hold almost surely, and since any elements added to the initial support of X and Z will necessarilybe assigned probability zero by the observed distribution of ( Y, X, Z ). See the three policy evaluation problems of Heckman and Vytlacil (2007) pp. 4790 - 4792. ( ω ) are potential outcomes. Extending the analogy beyond the cases of binary X is straightforward, and itwill sometimes be useful to refer to this connection to potential outcomes when discussing the interpretationsof some of our parameters. Assumption 2.2 on the counterfactual domain leads directly to our definition of the identified set forcounterfactual conditional choice probabilities.
Definition 2.2 (Identified Set of Counterfactual Conditional Choice Probabilities) . Under Assumptions2.1 and 2.2, the identified set of counterfactual conditional choice probabilities P ∗ Y γ | Y,X,Z,θ is the set of allconditional distributions P Y γ | Y,X,Z,θ satisfying: P Y γ | Y,X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = 1 , (2.5)( y, x, z, θ ) − a.s. for some ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z . Note that this definition makes an explicit reference to the identified set I ∗ Y,X,Z presented in Definition2.1, which in turn is derived from a selection relation. Intuitively, this definition says that a given conditionaldistribution for counterfactual choices belongs to the identified set if there exists a pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z that can rationalize such a counterfactual distribution. As was the case with Definition 2.1, this definitionof the identified set can be used as a starting point to define other related identified sets. For example,the identified set of counterfactual conditional choice probabilities P ∗ Y γ | Y,X,Z is the set of all conditionaldistributions P Y γ | Y,X,Z satisfying: P Y γ | Y,X,Z ( Y γ = y (cid:48) | Y = y, X = x, Z = z )= (cid:90) P Y γ | Y,X,Z,θ ( Y γ = y (cid:48) | Y = y, X = x, Z = z, θ ) dP θ | Y,X,Z , (2.6)( y, x, z ) − a.s. for some triple ( P Y γ | Y,X,Z,θ , P θ | Y,X,Z , β ) satisfying (2.5). The conditional choice probability in(2.6) then allows us to answer questions of the form: “given the observed values of (
Y, X, Z ) are ( y, x, z ),what would be the expected response if we were to set the values of (
X, Z ) to be γ ( x, z )?” As we mentionedearlier, conditioning on the value of Y may seem unfamiliar to most, but it allows us to answer a new set ofpolicy questions that condition on the current observed response when evaluating the expected counterfactualresponse. In addition, the identified set for the (conditional) average structural function P ∗ Y γ | X,Z will be the set ofall probabilities satisfying: P Y γ | X,Z ( Y γ = y (cid:48) | X = x, Z = z )= (cid:88) y ∈{ , } P Y γ | Y,X,Z ( Y γ = y (cid:48) | Y = y, X = x, Z = z ) P ( Y = y | X = x, Z = z ) , This conforms with a counterfactual in the three-level hierarchy of action, prediction and counterfactuals presented in Pearl(2009). See sections 1.4 and 7.2. x, z ) − a.s. for some P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z . Finally, when γ, γ (cid:48) ∈ Γ represent two competing policies, theidentified set for the (conditional) average treatment effect from moving from policy γ to policy γ (cid:48) is givenby the set of all values: P Y γ (cid:48) | Y,X,Z ( Y γ (cid:48) = y (cid:48) | X = x, Z = z ) − P Y γ | Y,X,Z ( Y γ = y (cid:48) | X = x, Z = z ) , (2.7)where both terms are average structural functions. We will show how to construct sharp bounds on all ofthese objects using our framework.Using Definition 2.1, we now present a result that provides an intuitive but important link betweencounterfactual choice probabilities and the conditional distribution of latent variables. Theorem 2.1.
Suppose that Assumptions 2.1 and 2.2 hold. Then a distribution of counterfactual conditionalchoice probabilities P Y γ | Y,X,Z satisfies P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z if and only if there exists a pair ( P θ | Y,X,Z , β ) ∈I ∗ Y,X,Z satisfying: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , (2.8)( y, x, z ) − a.s. Theorem 2.1 provides the theoretical link between the identified set of counterfactual conditional choiceprobabilities, and the identified set for the pair ( P θ | Y,X,Z , β ). In particular, a given counterfactual conditionalchoice probability is a member of the identified set if and only if there exists a conditional distribution P θ | Y,X,Z and parameter β satisfying (2.3) that can also rationalize the counterfactual conditional choice probabilitiesthrough (2.8). We will see that the distribution of the observed random variables is informative about thedistribution of latent variables, which in turn are informative about counterfactual choices.While the result is theoretically straightforward, it hides some important practical difficulties that arisewhen constructing the identified set for counterfactual conditional choice probabilities. In particular, verify-ing the existence of a pair ( P θ | Y,X,Z , β ) that satisfies the conditions from Definition 2.1 is a nontrivial task.This is at least partly due to the fact that P θ | Y,X,Z is an infinite dimensional object, even in the case whenboth X and Z have finite support. This infinite dimensional existence problem is exacerbated in practiceby the fact that P θ | Y,X,Z must satisfy a number of constraints to ensure it is consistent with the binaryresponse model through (2.3), and to ensure it is a proper conditional probability measure. We considerthese practical difficulties in detail in the next section.
In order to bound counterfactual probabilities using Theorem 2.1, we must verify the existence of a collec-tion of Borel probability measures on Θ that are consistent with the binary response model through (2.3).However, solving this existence problem by explicitly constructing a probability measure on all Borel sets of13 seems excessively difficult and naive. Instead, we would like to consider a finite collection of Borel setsthat are both necessary and sufficient for this existence problem in the sense that, to solve the existenceproblem, it is both necessary and sufficient that we be able to construct a conditional probability measureon our finite collection of sets. To make progress towards our goal, let us define the following vector-valuedfunction: r ( β, θ ) := { ϕ ( x , z , θ, β ) ≥ } { ϕ ( x , z , θ, β ) ≥ } ... { ϕ ( x , z m z , θ, β ) ≥ } { ϕ ( x , z , θ, β ) ≥ } { ϕ ( x , z , θ, β ) ≥ } ... { ϕ ( x m x , z m z , θ, β ) ≥ } . (3.1)Furthermore, for a fixed binary vector s ∈ { , } m let us define the set:Θ( β, s ) := { θ ∈ Θ : r ( β, θ ) = s } . (3.2)The sets from (3.2) partition the space Θ into at most L := 2 m sets, with each set being uniquely associatedwith a binary vector s ∈ { , } m . This comes from the fact that there are m points of support in X × Z (and so m rows in r ( β, θ )) and each row of r ( β, θ ) can take values either 0 or 1. Similar objects to r ( β, θ )have appeared previously in the literature (e.g. Balke and Pearl (1994), Heckman and Pinto (2018)), and soto remain consistent with the previous literature we call the functions r : B × Θ → { , } m defined in (3.1) response types . In the discrete choice setting, these response types tell us the choices that an individual withtype indexed by ( β, θ ) would have made had they been assigned alternate pairs of ( x, z ). Any two individualscharacterized by values of θ from the same set Θ( β, s ) will make identical choices in every counterfactual, andso the values of θ define a natural equivalence class of latent types. We will see shortly that response typesrepresent the basic building blocks of all of our counterfactual objects of interest. Indeed, they generate thecoarsest partition of the unobservable space Θ needed to bound a variety of counterfactual quantities whilestill retaining all information from Assumptions 2.1 and 2.2.After partitioning the space of latent variables using response types, various counterfactual objects ofinterest can be written as a disjoint union of the sets Θ( β, s ) from (3.2) that comprise our partition. For the A similar problem is addressed in Torgovitsky (2019), although we note that his general framework is not immediatelyapplicable here since we are dealing with probability measures rather than distribution functions. We find that for many of themodels we consider, it is simply not possible to write the identified set and functional of interest in terms of the multi-dimensionaldistribution function for the latent variables. The collection of sets defining response types also appears to be similar to the “minimal relevant partition” (MRP) inTebaldi et al. (2019), as well as the partition described in Chesher and Rosen (2014) Appendix B. S j = { s ∈ { , } m : s j = 1 } , (3.3)for j = 1 , . . . , m . Note that each set S j is comprised of all binary vectors that have a j th entry equal to 1,and thus contain exactly 2 m − elements. Now note, by definition of the sets Θ( β, s ) and S j we have: { θ : ϕ ( x j , z j , θ, β ) ≥ } = (cid:91) s ∈ S j Θ( β, s ) . Furthermore, for s (cid:48) (cid:54) = s the definition of the sets Θ( β, s ) from (3.2) ensures we have Θ( β, s (cid:48) ) ∩ Θ( β, s ) = ∅ ,so that the union in the previous display is a disjoint union. Thus, we have the following decomposition: P θ | Y,X,Z ( ϕ ( x j , z j , θ, β ) ≥ | Y = y, X = x, Z = z ) = (cid:88) s ∈ S j P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z ) . Such a decomposition holds for any j = 1 , . . . , m . When the conditioning values ( x, z ) differ from the values( x j , z j ) in the structural function, an application of Theorem 2.1 shows that the left hand side of this displayrepresents a counterfactual conditional choice probability, illustrating the connection between response typesand counterfactual choices. Remark 3.1.
Response types have a natural interpretation in the potential outcome framework as a collectionof potential outcomes. To illustrate, let us return to the potential outcome interpretation introduced in Remark2.1. In particular, suppose we have only a binary variable X ∈ { , } and latent variables θ (i.e. no variables Z and no fixed coefficients β ). Then the structural function from (2.1) can be written as ϕ ( X, θ ) and thebinary response vector r ( β, θ ) can be written as r ( θ ) . As in Remark 2.1, let us define the random variables: Y ( ω ) := { ϕ (0 , θ ( ω )) ≥ } ,Y ( ω ) := { ϕ (1 , θ ( ω )) ≥ } . Now consider the four possible binary vectors s ∈ { , } : s = , s = , s = , s = . In this simple model, we can see that events of the form { ω : r ( θ ( ω )) = s } can be written as conjunctions ofevents involving the potential outcomes Y and Y ; in particular, we have: { ω : r ( θ ( ω )) = s } = { ω : Y ( ω ) = 1 , Y ( ω ) = 1 } , { ω : r ( θ ( ω )) = s } = { ω : Y ( ω ) = 1 , Y ( ω ) = 0 } , { ω : r ( θ ( ω )) = s } = { ω : Y ( ω ) = 0 , Y ( ω ) = 1 } , It is useful to note that the sets { S j } mj =1 are not disjoint; indeed, it is easy to show that S j ∩ S k (cid:54) = ∅ and S j ∩ S ck (cid:54) = ∅ forevery j (cid:54) = k . ω : r ( θ ( ω )) = s } = { ω : Y ( ω ) = 0 , Y ( ω ) = 0 } . From here it is easy to see that, if probabilities can be assigned to the events above then—given that theseevents are disjoint—probabilities can also be assigned to any union of these events; the latter would includeparameters like counterfactual choice probabilities, the average structural function or the average treatmenteffect. An individual with a given response type can thus be equivalently viewed as an individual with a fixedvector of potential outcomes.
The following Theorem shows that, in order to rationalize a given counterfactual conditional choiceprobability under our assumptions using Theorem 2.1, for each fixed β it is both necessary and sufficient toconstruct a probability measure on sets of the form Θ( β, s ) from (3.2) satisfying the constraints of Theorem2.1. This result thus provides the much needed simplification from the infinite dimensional existence problem from the previous subsection to a more manageable finite dimensional existence problem . Since γ : X × Z →X × Z , it will be useful in the statement of the result to redefine γ : N → N to denote the index of the pointin { ( x , z ) , . . . , ( x m , z m ) } assigned under counterfactual γ . Theorem 3.1.
Suppose Assumptions 2.1 and 2.2 hold. Fix some β ∈ B and consider the collection of sets: A ( β ) := { Θ( β, s ) : s ∈ { , } m } . (3.4) Then for any collection of counterfactual conditional choice probabilities P Y γ | Y,X,Z , there exists a collectionof Borel conditional probability measures P θ | Y,X,Z satisfying (2.8) with ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z if and only ifthere exists a collection P θ | Y,X,Z of probability measures on the sets in A ( β ) satisfying: (cid:88) s ∈ S j P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x j , Z = z j ) = 1 , (3.5) (cid:88) s ∈ S cj P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x j , Z = z j ) = 1 , (3.6) (cid:88) s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) = P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) , (3.7) for all y ∈ { , } and j ∈ { , . . . , m } assigned positive probability. Theorem 3.1 reduces our infinite dimensional existence problem to a finite dimensional existence problem.Indeed, the constraints in (3.5) and (3.6) are linear constraints on a now finite dimensional probabilityvector with typical element P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z ). This leads naturally to the optimizationformulation of bounds on counterfactual choice probabilities considered in the next subsection. Note thatthis result relies crucially on the finiteness of X and Z . All of our counterfactual parameters of interest inthis paper can be constructed from the basic counterfactual probability of the form: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) . (3.8)16he result thus also implies that the constraints (3.5) - (3.7) are sufficient to bound any counterfactualparameter that can be written as a linear function of counterfactual probabilities of the form (3.8).At first glance it may be surprising to note that the constraints in Theorem 3.1 depend on the observeddistribution only through the condition that each constraint must hold for all values of ( y, x, z ) assignedpositive probability by the observed distribution of ( Y, X, Z ). Beyond these conditions, the observed dis-tribution plays no role in Theorem 3.1. While this may appear to be cause for alarm, we remind readersthat Assumptions 2.1 and 2.2 impose minimal structure on the binary response model; here the function ϕ can be extremely flexible, and the variables X and Z can be endogenous. Theorem 3.1 shows that inthis very flexible environment, the observed conditional choice probabilities do not provide any substantialinformation on counterfactual choice probabilities. In the next section we will use this idea to formulate animpossibility result that will be useful to motivate the need for additional assumptions in this binary responsemodel. However, we will first take this basic environment as given and show how bounds on counterfactualchoice probabilities can be formulated as optimization problems. Our result on the formulation in terms ofoptimization problems will then be built upon in later sections when we show how to incorporate furtherassumptions. Finally, although we focus on an optimization-based approach, we believe the partition of thelatent variable space in terms of response types may also be useful for researchers interested in studying suf-ficient conditions for point identification, as in Heckman and Pinto (2018). We will not pursue this approachhere. The linear constraints defining the identified set of counterfactual conditional choice probabilities leads usnaturally to consider bounding counterfactual choice probabilities by solving optimization problems. We willsuppose throughout this subsection that our objective is to bound the counterfactual choice probability: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) , (3.9)for some j ∈ { , . . . , m } . However, all of the results in this section are immediately applicable to thecase when we wish to bound some linear function of these counterfactual choice probabilities, includingprobabilities of the form: P Y γ | X,Z ( Y γ = 1 | X = x j , Z = z j ) . (3.10)Recall that Theorem 3.1 implies our counterfactual object of interest can be rewritten as: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) = (cid:88) s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) , where γ ( j ) is the index in { , . . . , m } assigned to j under counterfactual γ , and where the set S γ ( j ) is givenby S γ ( j ) = { s ∈ { , } m : s γ ( j ) = 1 } (the analog of S j from (3.3)). To progress further, let us define the17arameter: ν ( y, x, z, β, s ) = P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z ) . For the sake of notation it will also be useful to define the following parameter vectors: ν ( y, β, s ) := ν ( y, x , z , β, s ) ν ( y, x , z , β, s )... ν ( y, x , z m z , β, s ) ν ( y, x , z , β, s ) ν ( y, x , z , β, s )... ν ( y, x m x , z m z , β, s ) , ν ( y, β ) := ν ( y, β, s ) ν ( y, β, s )... ν ( y, β, s L ) , ν ( β ) := ν (0 , β ) ν (1 , β ) . The vector of parameters v ( β ) represents the variable over which we will optimize in our result ahead. Nowlet d ν = 2 mL denote the dimension of the parameter vector ν ( β ) (recall that L := 2 m ). Without loss ofgenerality, we will suppose that each ( y, x, z ) is assigned positive probability by the observed distribution.From conditions (3.5) and (3.6) in Theorem 3.1, we have the constraints: (cid:88) s ∈ S j ν (1 , x j , z j , β, s ) = 1 , (cid:88) s ∈ S cj ν (0 , x j , z j , β, s ) = 1 , (3.11)for j = 1 , . . . , m . Finally, we require the nonnegativity and “adding-up” constraints: ν ( y, x j , z j , β, s ) ∈ { } , if Θ( β, s ) = ∅ , [0 , , otherwise , (3.12)for all y ∈ { , } and j = 1 , . . . , m and s ∈ { , } m , and: (cid:88) s ∈{ , } m ν ( y, x j , z j , β, s ) = 1 , (3.13)for all y ∈ { , } and j = 1 , . . . , m . The constraints in (3.12) imply that positive probability can only beassigned to non-empty sets of the form Θ( β, s ), and the constraints in (3.13) ensure that each conditionalprobability assigns probability 1 to the entire space Θ. We are now ready to state the main result for thissection.
Theorem 3.2.
Under Assumptions 2.1 and 2.2, the identified set for the counterfactual conditional choiceprobability P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) is given by: (cid:91) β ∈B [ ν (cid:96)b ( y, x j , z j , β ) , ν ub ( y, x j , z j , β )] , (3.14) Note that constraints of the form (3.13) are not implied by constraints of the form (3.11). here ν (cid:96)b ( y, x j , z j , β ) and ν ub ( y, x j , z j , β ) are determined by the optimization problems: ν (cid:96)b ( y, x j , z j , β ) := min v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , subject to (3.11) , (3.12) , and (3.13) , (3.15) ν ub ( y, x j , z j , β ) := max v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , subject to (3.11) , (3.12) , and (3.13) . (3.16)In one direction, Theorem 3.2 implies that any counterfactual conditional choice probability of the form(3.9) belonging to the identified set can be written as: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) = (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , for some β and some vector ν ( β ) satisfying the constraints (3.11), (3.12), and (3.13). In the opposite direction,the Theorem implies that if for some β the vector ν ( β ) satisfies these constraints then the conditionalprobability measure on Θ represented by ν ( β ) can be extended to a (not necessarily unique) Borel probabilitymeasure on all of B (Θ) that satisfies the conditions of Theorem 2.1. Again, there is nothing special aboutcounterfactual choice probabilities here, and this result can be easily modified to bound any linear functionof counterfactual choice probabilities by simply modifying the objective function in Theorem 3.2. We willmake use of this fact in the application section.After determining which of the sets Θ( β, s ) are empty, all of the constraints in (3.15) and (3.16) canbe written as linear equality/inequality constraints, so that the optimization problems in (3.15) and (3.16)are linear programming problems. This is very beneficial, since linear programs can be efficiently solvedeven in cases with thousands of parameters and constraints. In addition, elements of ν ( β ) corresponding tosets Θ( β, s ) that are empty can be removed from the parameter vector ν ( β ) without altering the optimalsolutions to the linear programs in (3.15) and (3.16). This allows for further reduction of the dimension ofthese linear programs. Following Theorem 3.2, these linear programs are used to construct an interval foreach value of β ∈ B , and then the full identified set is constructed by taking the union of these intervals overall values of β .Some thought reveals that β has no effect on the bounding problem in Theorem 3.2 other than throughits effect on determining which of the sets Θ( β, s ) are non-empty. Indeed, depending on the form of ϕ , fora fixed value of β ∈ B and a fixed vector s ∈ { , } m there may be no value of θ satisfying r ( θ, β ) = s . Inpractice the identified set can be constructed using Theorem 3.2 after fixing a particular functional form for ϕ , establishing a grid over the parameter space B , and then solving the optimization problems (3.15) and(3.16) for each value of β in the grid. This last step can only be completed after determining which of thesets Θ( β, s ) are non-empty at each value of β in the grid. The following proposition demonstrates that, intheory, the researcher need only repeat the procedure just described for finitely many values of β . Proposition 3.1.
Suppose that Assumptions 2.1 and 2.2 hold. Then there exists a (not necessarily unique) nite subset B (cid:48) ⊂ B such that: (cid:8) ν ∈ R d ν : ∃ β ∈ B s.t. ν ( β ) satisfies (3.11) , (3.12) , (3.13) , and ν = ν ( β ) (cid:9) = (cid:8) ν ∈ R d ν : ∃ β ∈ B (cid:48) s.t. ν ( β ) satisfies (3.11) , (3.12) , (3.13) , and ν = ν ( β ) (cid:9) . We will call the points in the set B (cid:48) the representative points , although it is important to keep in mindthat these points are generally not unique. Assuming the representative points can be determined by theresearcher, Proposition 3.1 immediately implies that the union over β ∈ B in (3.14) can be replaced with aunion over β ∈ B (cid:48) . That is, the linear programs in (3.15) and (3.16) need only be solved at the representativepoints. Proposition 3.1 also implies that the identified set for counterfactual choice probabilities in Theorem3.2 will always be a closed (but possibly disconnected) set. Unfortunately, when the researcher cannotdetermine the representative points, computing the exact (i.e. not approximate) bounds from Theorem 3.2can become computationally prohibitive. Later on we will provide a tractable way to construct B (cid:48) in thecase when ϕ is linear in β . To this point, the binary response model has been left almost entirely unrestricted. The following corollary toTheorem 3.1 states formally the unsurprising result that, when ϕ is completely unrestricted, counterfactualchoice probabilities are not identified under our assumptions. Corollary 3.1.
Suppose Assumptions 2.1 and 2.2 hold. Then there exists a function ϕ : X × Z × B × Θ → R satisfying Assumptions 2.1 and 2.2 such that the identified set for any counterfactual choice probability ofthe form (3.9) or of the form (3.10) with γ ( j ) (cid:54) = j is the interval [0 , . This result is stated as a corollary since it can be proven using Theorem 3.1. Intuitively, Theorem 3.1shows that under Assumptions 2.1 and 2.2 the observed conditional choice probabilities effectively imposeno constraints on counterfactual conditional choice probabilities; indeed, this follows by simple inspectionof the constraints (3.5) and (3.6) from the statement of Theorem 3.1, as well as the discussion following thestatement of Theorem 3.1. Without any constraints imposed by the observed conditional choice probabilities,by choosing a sufficiently flexible function ϕ : X ×Z ×B× Θ → R it is possible to rationalize all counterfactualconditional choice probabilities in the interval [0 , This impossibility result shows that it is necessary to entertain additional assumptions in order to obtaininformative bounds on counterfactual choice probabilities. In the next section, we will explore additional It is important to note that this does not occur because Assumptions 2.1 or 2.2 have imposed any structure, but insteadis because the average treatment effect is a weighted average of unidentified counterfactual conditional choice probabilities andidentified conditional choice probabilities. The assumptions we consider in this section fall into three classes: (i) functional form assumptions, (ii) in-dependence assumptions, and (iii) monotonicity assumptions. A common theme throughout our discussionis that additional assumptions like these often lead to the elimination of response types ; that is, the assump-tions imply that certain response types must have zero probability. One of our main contributions in thispaper is to show how functional form restrictions impose constraints in the bounding problem by limitingthe number of sets Θ( β, s ) that can be assigned positive probability. Reducing the number of sets that canbe assigned positive probability imposes additional constraints in the optimization problems of Theorem 3.2that help to tighten the identified set. It can also reduce computational time needed to solve the boundingproblems in Theorem 3.2 by reducing the dimension of the optimizing variable ν ( y, x, z, β, s ). Constrainingsets of the form Θ( β, s ) to be assigned zero probability will be referred to as eliminating response types .Response types corresponding to sets Θ( β, s ) that survive elimination will be called admissible responsetypes. Response types corresponding to sets Θ( β, s ) that are eliminated will be called inadmissible .We will show that a number of assumptions, including functional form assumptions, correspond to theelimination of particular response types. Since each response type is characterized by a particular menuof counterfactual responses, framing functional form assumptions in terms of the elimination of particularresponse types helps to provide some meaning to these assumptions. In the case when ϕ is linear in parameterswe provide an efficient (i.e. polynomial-time) algorithm for constructing the relevant set of constraintsin the bounding problems that is based on the hyperplane arrangement algorithm of Gu and Koenker(2020). When ϕ is linear in β , we also demonstrate how to compute an exact (i.e. not approximate)solution to the optimization problems in Theorem 3.2 that does not require establishing a grid over theentire parameter space B . After studying functional form assumptions, we then turn briefly to considerindependence assumptions and monotonicity assumptions. Independence assumptions are also quite commonin parametric binary response models and binary response models with endogenous regressors, althoughhere we show how to impose various independence assumptions as a set of linear equality constraints in theoptimization problems of Theorem 3.2. Finally, monotonicity assumptions appear in various forms in theliterature on treatment effects, and our incorporation of monotonicity restrictions arising from choice theorymakes substantial use of response types, resembling the approach of Heckman and Pinto (2018). These threeclasses of assumptions—functional form, independence, and monotonicity—will now be addressed in turn.Although we study these three assumptions separately, all of our results hold with minimal modificationwhen any combination of these assumptions are imposed. We will make use of this fact in the applicationsection. For additional impossibility results in partially identified discrete response models, see the discussion on pages 1396 and1402 of Manski (2007) as well as Corollary 3 in Chesher et al. (2013). .1 Functional Form Assumptions: The Linear Case In this subsection we consider introducing assumptions on the functional form of the index function ϕ . Wewill consider the case when ϕ is linear in the latent variables θ . This assumption will connect the results inthis paper to results on random coefficient linear index models where the distribution of random coefficientsis treated nonparametrically (c.f. Ichimura and Thompson (1998), Gautier and Kitamura (2013) and Gu andKoenker (2020), among others). For reference throughout this section, let us first formally state our linearityassumption on the index function, as well as an assumption on the distribution of the latent variables. Assumption 4.1 (Linearity in Latent Variables) . For each β ∈ B , (i) the function ϕ ( · , β ) : X × Z × Θ → R from Assumption 2.1 is linear in θ , and (ii) the event: F := (cid:91) ( x,z ) ∈X ×Z { θ : ϕ ( x, z, θ, β ) = 0 } , occurs with probability zero; that is, P θ ( F ) = 0 . Part (i) imposes linearity of ϕ , but still allows for general forms of nonseparability between the latent andobserved variables. Part (ii) of Assumption 4.1 is not needed from a technical standpoint, but we will seethat it leads to a dramatic simplification of our algorithm to enumerate response types described in the nextsubsection. We will return to this point later. For now, it suffices to know that part (ii) of Assumption 4.1essentially says that the boundaries of the sets that define response types are assigned probability zero. Underpart (i) of Assumption 4.1, the sets of the form { θ : ϕ ( x, z, θ, β ) = 0 } are hyperplanes in R d θ , and the setsΘ( β, s ) (intersected with Θ) have boundaries that are defined by these hyperplanes. Since these hyperplanesexist as d θ − d θ dimensional space, they have Lesbesgue measure zero. Thus, themost familiar condition implying part (ii) of Assumption 4.1 is absolute continuity of the distribution of θ ; this latter assumption has been imposed, for example, in Chesher et al. (2013) and Chesher and Rosen(2014). As an alternative to imposing part (ii) of Assumption 4.1, we could also instead condition all of theremaining analysis in this paper on the event that θ ∈ F c . It will be useful to keep this in mind throughout.Finally, a special case of Assumption 4.1 occurs when the function ϕ is additively separable in a scalar latentvariable θ . A full analysis of this well-studied case using our framework is worthwhile, and is taken up inAppendix B.6.We will see that assumptions on the functional form of ϕ can sometimes be useful to obtain tighterbounds on counterfactual conditional choice probabilities because they implicitly eliminate certain responsetypes. To illustrate this point, recall that Θ( β, s ) := { θ : r ( β, θ ) = s } and that there are exactly 2 m binaryvectors s ∈ { , } m . In order for the response type r ( θ, β ) = s to be admissible, it must be that the set Specifying linearity in latent variables seems to be nonnested with the weak separability of assumption of Vytlacil andYildiz (2007). In particular, Vytlacil and Yildiz (2007) permit index functions of the form (omitting β for simplicity): ϕ ( x, z, θ ) = g ( h ( x, z ) , θ ) , where h : X × Z → R and g : R × Θ → R are real-valued functions. Note that this form cannot accommodate cases like ϕ ( x, z, θ ) = (cid:104) x, θ (cid:105) , although it allows for nonlinearities in θ . β, s ) is non-empty for the pair ( β, s ). By extension, in order for all response types to be admissible, itmust be that all the sets Θ( β, s ) are non-empty. Whether this is satisfied clearly depends on the nature of ϕ as a function of θ ; for example, it clearly fails if ϕ is constant in θ , but is easily satisfied if ϕ can varyarbitrarily with θ .The following simple example shows that some of the sets Θ( β, s ) can be empty under Assumption 4.1. Example 1.
Suppose we have a variable X ∈ { . , , } and latent variables θ ∈ S (the two-dimensionalclosed unit sphere). That is, suppose there are no variables Z and no fixed coefficients β . Then the structuralfunction from (2.1) can be written as ϕ ( X, θ ) and the binary response vector r ( β, θ ) can be written as r ( θ ) ,where: r ( θ ) = { ϕ (0 . , θ ) ≥ } { ϕ (1 , θ ) ≥ } { ϕ (2 , θ ) ≥ } . Without any additional restrictions there is a total of |X | = 8 possible response types. That is, r ( θ ) ∈{ s , s , s , s , s , s , s , s } , where: s = , s = , s = , s = , s = , s = , s = , s = . Conclude that without any additional restrictions, all sets of the form Θ( β, s ) for s ∈ { , } can be assignedpositive probability by the optimization problems in Theorem 3.2. Now suppose we entertain a linear func-tional form restriction. In particular, suppose that Assumption 4.1 holds and that the structural functionfrom (2.1) can be written as: ϕ ( X, θ ) = Xθ − θ . Then the binary response vector r ( θ ) is given by: r ( θ ) = { θ ≥ θ } { θ ≥ θ } { θ ≥ θ } . As is illustrated in Figure 1, under the assumption that the index function is linear in latent variables only response types are possible. In particular, response types corresponding to binary vectors s and s are notpossible under the linearity assumption. Thus, under Assumption 4.1 a distribution of latent variables will For example, if θ = ( θ , θ ) take ϕ ( X, θ ) = sin( θ X + θ ) and fix θ = 0. Then it is straightforward to find eight values ofthe frequency parameter θ ∈ [ − ,
1] to rationalize each of the 8 response types. igure 1: A figure corresponding to Example 1 illustrating the partition of the latent variable space according toresponse types in the case when the index function is linear. Without functional form restrictions, Example 1 shows8 response types are possible; however, then the index function is linear in latent variables there are only 6 possibleresponse types, as illustrated in the figure. In particular, the response types corresponding to binary vectors s and s from Example 1 are not possible. be admissible in this example only if it assigns probability zero to the sets: Θ( β, s ) = { θ : r ( θ ) = s } , Θ( β, s ) = { θ : r ( θ ) = s } . These additional constraints must be imposed in our optimization problems from Theorem 3.2.
This example shows that imposing linearity of ϕ in latent variables implies that certain sets of the formΘ( β, s ) may be empty for some binary vectors s ∈ { , } m . In the general case, it can be shown that when ϕ is restricted to be linear in θ , there is an upper bound on the number of non-empty sets Θ( β, s ) that growsat a rate that is polynomial in m (rather than exponential in m , which is the case when ϕ is unrestricted). Proposition 4.1.
Suppose that Assumption 2.1 and 4.1 are satisfied. Then for each fixed β ∈ B , there areat most (cid:80) d θ j =0 (cid:0) mj (cid:1) admissible response types. This result is implied by results in the literature on computational geometry. In particular, linearity ofthe function ϕ ( · , θ ) means that for each instance of ( x, z, β ) the function ϕ ( x, z, θ, β ) defines a hyperplane in d θ − dimensional space. In the case when the vectors defining these hyperplanes are in general position theupper bound in Proposition 4.1 is obtained. This latter result was initially proven by Buck (1943). A collection of m hyperplanes in d − dimensional space are considered to be in general position when if any collection of k out of the m hyperplanes intersect in a d − k dimensional space for 1 < k ≤ d , and any collection of k out of m hyperplaneshas an empty intersection for k > d .
24o impose linearity in the latent variables we must determine which sets Θ( β, s ) are empty, and thenensure that any distribution of the latent variables under consideration when bounding counterfactual choiceprobabilities assigns probability zero to these sets. Let us define the collection of binary vectors S ϕ to bethose vectors s ∈ { , } m corresponding to admissible response types under Assumption 4.1. Note thatany such admissible response types under Assumption 4.1 must correspond to sets Θ( β, s ) with non-emptyinterior. A revised definition of the joint identified set I ∗ Y,X,Z under Assumption 4.1 is provided in AppendixB.2. In Appendix B.2 we also present a corollary of Theorem 3.1 that is valid under Assumption 4.1. Thisallows us to reduce the infinite dimensional existence problem to a finite dimensional problem even underthe assumption of linearity in latent variables. To extend the results of Theorem 3.2 we must then simplyinclude the correct set of additional constraints in our optimization problems. The correct set of constraintsis provided by Corollary B.1 in Appendix B.2, and can be written in terms of the parameter vector ν ( β ) as: (cid:88) s ∈ S cϕ ν ( y, x j , z j , β, s ) = 0 , (4.1)for all y ∈ { , } and j = 1 , . . . , m occurring with positive probability. Corollary B.2 in Appendix B.2 thendemonstrates that Theorem 3.2 can be extended simply by adding the constraints (4.1) to the optimiza-tion problems (3.15) and (3.16). We will see in the next subsections that independence and monotonicityconstraints can also be imposed as equality constraints on the parameter vector ν ( β ), and thus any combi-nation of these assumptions can be imposed on the optimization problems in Theorem 3.2 by simply addingthe correct constraints. From a practical implementation point of view, imposing (4.1) in addition to theconstraints from (3.15) and (3.16) is equivalent to eliminating the elements of the parameter vector ν ( β )corresponding to s ∈ S cϕ . Finally, Corollary B.3 shows that Proposition 3.1 also applies when the constraints(4.1) are added to the constraints from (3.15) and (3.16).In summary, functional form assumptions provide additional restrictions in the problem of boundingcounterfactual choice probabilities by simply restricting the number of sets Θ( β, s ) that can be assignednonzero probability, and thus limiting the admissible response types. When it is known which of the responsetypes to eliminate, we can constrain the corresponding sets Θ( β, s ) to be assigned zero probability in thebounding problem given in Theorem 3.2, which can lead to a tightening of the identified set. However, givena certain functional form assumption and a value of β , it is generally difficult to determine which of thesets Θ( β, s ) are empty, and thus which response types to eliminate. We will address this implementationproblem in the next subsection, where we propose the use of the hyperplane arrangement algorithm of Guand Koenker (2020). The fact that we want sets Θ( β, s ) that have non-empty interior follows from the fact that part (ii) of Assumption 4.1implies a set Θ( β, s ) has positive probability if and only if int(Θ( β, s )) has positive probability. .1.1 Implementation and Hyperplane Arrangement To practically implement the revised optimization problems we require a method of enumerating all admissi-ble response types represented by the binary vectors in S ϕ , defined in the previous subsection. The researchermust know the admissible response types in order to impose constraint (4.1) in the bounding problems, butthe admissible response types will depend on the support X × Z and the function ϕ . Thus, in practice S ϕ must be computed in each specific application. In order to compute the collection S ϕ we propose the use ofthe hyperplane arrangement algorithm of Gu and Koenker (2020).Enumerating the admissible response types under Assumption 4.1 corresponds to determining all setsΘ( β, s ) that have non-empty interior. When the index function ϕ is linear in θ , for each fixed β and s ∈ { , } m the set Θ( β, s ) is a convex polyhedron formed by the intersection of halfspaces whose boundariesare hyperplanes of the form { θ : ϕ ( x, z, θ, β ) = 0 } . Under Assumption 2.1 there are at most m suchhyperplanes. The hyperplane arrangement algorithm of Gu and Koenker (2020) accepts these m hyperplanesas an input, and outputs the binary vectors s corresponding to the sets Θ( β, s ) that have non-empty interior,as well as a point from each of these sets. In low dimensional space, it is relatively easy to determine the setswith non-empty interior formed by the intersection of halfspaces (see Figure 1, for instance). However, as thedimension of a space increases it becomes challenging to enumerate all of these sets. Avis and Fukuda (1996)were the first to provide an enumeration algorithm that runs in a time proportional to the maximum numberof sets with non-empty interior. Improvements to this algorithm were made by Sleumer (1999) and Radaand Cerny (2018). The algorithm of Gu and Koenker (2020) is most closely related to the latter paper, andwas developed for the problem of nonparametric maximum likelihood in a linear random coefficient model.It runs in a time proportional to O ( m d θ ).To understand the algorithm, note that for each s ∈ { , } m and fixed β , we can verify using a linearprogram whether there exists a point in the space of Θ that lies interior to the set Θ( β, s ). Indeed, considerthe following linear programming problem:max θ,ε ε s.t. (2 s j − ϕ ( x j , z j , θ, β ) ≥ ε, j = 1 , . . . , m, (4.2)where s j is the j th element of our fixed binary vector s , and where here we have an index function ϕ ( x, z, θ, β )that is linear in θ . If ε ∗ and θ ∗ are the optimal values of the program (4.2) (provided that it is feasible), thenan optimal value ε ∗ > θ ∗ is an interior point to the polyhedron Θ( β, s ). However, since thelinear program (4.2) must be solved for each s ∈ { , } m , checking whether each Θ( β, s ) admits an interiorpoint requires solving 2 m linear programs, despite the fact that we know the number of non-empty subsetsΘ( β, s ) is polynomial in m .To address this issue, the algorithm proposed in Gu and Koenker (2020) builds upon the algorithm inRada and Cerny (2018). The idea is to add one hyperplane at a time, and to enumerate all feasible responsetypes after adding each new hyperplane. At step k they start with a collection of k − As we discussed at the beginning of this subsection, although ϕ is linear in θ it can have many different specifications. k −
1. They then introduce a newhyperplane into the arrangement of hyperplanes, and determine all newly created response types by solvinga linear program. The algorithm of Rada and Cerny (2018) requires solving a linear programming problemfor all of the existing cells at each iteration, which amounts to solving O ( m d θ +1 ) such problems. When m is large, which is typically the case in practice, this can become costly. Gu and Koenker (2020) observedthat when a new hyperplane is added the only new cells will be those that are created when the existingcells are crossed by the last hyperplane. By efficiently locating those crossed cells, the algorithm reducedthe number of linear programming problems to be solved by a magnitude of m . The algorithm in Gu andKoenker (2020) is available in the R package RCBR .Recall from the previous subsection that we claimed part (ii) of Assumption 4.1 led to a dramaticsimplification of our algorithm to enumerate response types. To understand the reason why part (ii) ofAssumption 4.1 is useful, consider the following simple example.
Example 2.
Suppose we have a variable X ∈ {− , } and latent variables θ ∈ S (the two-dimensionalclosed unit sphere). That is, suppose there are no variables Z and no fixed coefficients β . Now suppose thestructural function from (2.1) is given as ϕ ( X, θ ) = Xθ . Then the binary response vector r ( β, θ ) can bewritten as r ( θ ) , where: r ( θ ) = { θ ≥ } {− θ ≥ } . Now consider the four possible binary vectors in { , } : s = , s = , s = , s = . Under Assumption 4.1 there are only two sets of the form Θ( β, s ) that can be assigned positive probability,given by: Θ( β, s ) := { θ : r ( θ ) = s } , Θ( β, s ) := { θ : r ( θ ) = s } . However, in the absence of part (ii) of Assumption 4.1 the set Θ( β, s ) := { θ : r ( θ ) = s } can also be as-signed positive probability. This demonstrates that part (ii) of Assumption 4.1 imposes additional restrictionson P θ | Y,X,Z that may tighten the identified set for counterfactual choice probabilities and other quantities.However, note that the set Θ( β, s ) is the zero-dimensional boundary between the sets Θ( β, s ) and Θ( β, s ) .Imposing Assumption 4.1 allows us to avoid enumerating lower-dimensional sets like Θ( β, s ) which can havea considerable impact on computation time, especially in higher dimensional examples. This example illustrates why part (ii) Assumption 4.1 contains some identifying content, and thus cannarrow the identified set for counterfactual choice probabilities and other quantities. It also provides someintuition for why this assumption is helpful when it comes to computation, since it allows us to avoid27numerating all lower dimensional sets Θ( β, s ).In summary, the hyperplane arrangement algorithm can be used as a pre-processing step under Assump-tion 4.1 to determine which sets Θ( β, s ) have non-empty interior in a given application. After completing thepre-processing of our problem using the hyperplane arrangement algorithm, Corollary B.2 becomes directlyapplicable. In the next subsection we will also show how the assumption of linearity in parameters β ∈ B can be combined with the hyperplane arrangement algorithm to determine the representative points fromProposition 3.1, which dramatically simplifies the bounding procedure suggested by Theorem 3.2. Recall that computation of our bounds on counterfactual conditional choice probabilities require the re-searcher to repeatedly solve linear programs. In particular, a value of β ∈ B is fixed, and the lower andupper bound of a closed interval is computed for this fixed value of β . We must then repeat this procedurefor all β ∈ B and must take the union of the resulting intervals over all values of β ∈ B . We refer to theprocess of repeating the procedure for all values of β and taking unions as “profiling β .” Profiling β canpresent major computational challenges. Corollary B.3 suggests that only finitely many representative pointsneed to be considered, although it is not obvious how to find these representative points.In this subsection we will describe an algorithm that allows us to find all representative points when ϕ islinear in both θ and β . To introduce our approach, note that under the assumption that ϕ is linear in ( θ, β ),for each fixed β ∈ B the sets of the form Θ( β, s ) define a unique partition of the space Θ into sets whoseboundaries are defined by m hyperplanes. Let us define: S ( β ) := { s ∈ { , } m : int(Θ( β, s )) (cid:54) = ∅ } . Then S ( β ) denotes the set of all vectors s ∈ { , } m that are inducible by our arrangement of m hyperplanes.Now recall that functional form assumptions impose restrictions in the bounding optimization problems byrestricting the number of sets Θ( β, s ) with non-empty interior. For any two values of β, β (cid:48) ∈ B with β (cid:54) = β (cid:48) ,if S ( β ) = S ( β (cid:48) ) then the linear programming problems in Theorem 3.2 at β and β (cid:48) will be identical, sincethey will have an identical set of constraints. The points β and β (cid:48) are thus equivalent in the sense that weonly need to solve the linear programming problems for one of them. Extending this idea, we can define anequivalence class by the set of all β ∈ B delivering the same collection S ( β ). We then only need to solvethe linear programming problems at one value of β belonging to each equivalence class. These values of β selected from each equivalence class are exactly what we call representative points. To see how to find the representative points, let us partition θ := ( θ x , θ z , ε ), β = ( β x , β z ), x = ( x r , x f ) This partition of B into equivalence classes is exactly what is done in the proof of Proposition 3.1 in the more general case. z = ( z r , z f ), and for a binary vector s ∈ { , } m let us define the set: R ( s ) := ( θ, β ) : { θ x x r + θ z z r + β x x f + β z z f ≥ ε } { θ x x r + θ z z r + β x x f + β z z f ≥ ε } ... { θ x x rm + θ z z rm + β x x fm + β z z fm ≥ ε } = s . (4.3)These sets form a unique partition of the space ( θ, β ) defined by m hyperplanes of the form: θ x x ri + θ z z ri + β x x fi + β z z fi = ε. (4.4)The basic idea behind our strategy to find representative points is to first project the sets of the form R ( s )onto the parameter space B . Note that the projection of a set R ( s ) onto the parameter space B will deliverthe set of all β consistent with the binary vector s for some value of θ . After taking the intersection of all suchprojections, each set in the resulting collection corresponds exactly to an equivalence class discussed above.An arbitrary value of β taken from such a set will then be a representative point. The most challengingpart of this approach will be to find a tractable characterization of the projections of R ( s ) on the parameterspace B .Let us denote the collection of all binary vectors s ∈ { , } m corresponding to the sets in R ( s ) withnon-empty interior as S p . The first step of our procedure to find the representative points is to determinethe binary vectors in S p . This can be done by running the hyperplane arrangement algorithm of Gu andKoenker (2020) on the collection of hyperplanes of the form (4.4) defined on Θ × B , treating β as a latentvariable. Note that the assumption of linearity of ϕ in ( θ, β ) restricts the number of sets in the collection S p to be polynomial in m . Next, let us define w ri := ( x ri , z ri , −
1) and w fi := ( x fi , z fi ), where w ri has dimension d r and w fi hasdimension d f . Then each of the hyperplanes of the form (4.4) can be written as w ri θ + w fi β = 0. Stackingthese hyperplanes into matrix form we have W r θ r + W f β = 0, where W r is m × d r and W f is m × d f . Noweach set of the form (4.3) is a polyhedral cone in R d x + d z +1 and can be uniquely identified by a sign vector2 s − {− , } m . Fix any s ∈ S p , and let D ( s ) = diag(2 s −
1) denote the m × m diagonalmatrix with the sign vector 2 s − W r ( s ) := D ( s ) W r and W f ( s ) := D ( s ) W f . Then the set R ( s ) from (4.3) can be conveniently rewritten as: R ( s ) := { ( θ, β ) : W r ( s ) θ + W f ( s ) β ≥ } . Note that the row dimension of W r ( s ) and W f ( s ) is m , which can be large if the support X ×Z contains many Note that in this context, all of the hyperplanes of the form (4.4) can be viewed as hyperplanes through the origin in Θ × B .In this case, the upper bound on the number of cells formed by this collection of hyperplanes is of smaller order than thatpresented in Proposition 4.1. Cover (1965) shows the upper bound is given by: C ( m, d θ ) := 2 d θ − (cid:88) j =0 (cid:16) m − j (cid:17) . R ( s )before proceeding to the next step. Elimination of redundant inequalities from this system can be achievedin polynomial time with a sequence of linear programs, and the resulting set of nonredundant inequalitiesthat define the polyhedral cone R ( s ) is typically much smaller than m . From here on we assume the matrices W r ( s ) and W f ( s ) only include rows corresponding to nonredundantconstraints, and we will denote their row dimension as m ( s ). Now consider the set: B ( s ) := { β ∈ B : ∃ θ ∈ Θ s.t. W r ( s ) θ + W f ( s ) β ≥ } . (4.5)Then B ( s ) is the set of values of β such that there exists θ such that all the constraints W r ( s ) θ + W f ( s ) β ≥ B ( s ) is precisely the projection of the polyhedral cone R ( s ) on theparameter space B .The objective is now to show that the set B ( s ) can be defined only in terms of linear inequality constraintsin β . In other words, we would like to “eliminate” the latent variables θ from the system of inequalities in(4.5). A natural method of doing so is to use Fourier-Motzkin elimination. Recall that the Fourier-Motzkinalgorithm eliminates variables from a system of linear inequalities by taking linear combinations of theinequalities in the system. In particular, Fourier-Motzkin elimination can be viewed as applying a sequenceof matrix operators M , M , . . . , M d r to the system of inequalities in (4.5), where the matrix M k M k − . . . M eliminates the first k elements of the vector θ from the inequalities. Let us denote M ∗ r = M d r M d r − . . . M .Then as a result of Fourier-Motzkin elimination we would have the equivalent system: B ( s ) := { β ∈ B : M ∗ r W f ( s ) β ≥ } , (4.6)since M ∗ r W r ( s ) = 0 by construction of M ∗ r . The set in (4.6) then gives us inequality constraints only interms of β that define the projection of R ( s ) on B .While it is possible to use Fourier-Motzkin elimination to “eliminate” the latent variables θ , the numberof rows in the matrix M ∗ r can be prohibitively large, even when the number of nonredundant inequalitiesdefining the set (4.6) is small. To ensure feasibility of our method of projection, we must thus search for aprocedure that will eliminate redundant inequalities from (4.5) and also results in a simpler characterization In particular, not all the hyperplanes that define R ( s ) are relevant, in the sense that some of them are implied by therest of the inequalities in the system. Removing these redundant inequalities will not change the cone R ( s ). We can removethem before continuing to the projection step of our procedure by conducting a redundancy test. For example, suppose wehave system of j + 1 inequalities of the form Ax ≤ b and s (cid:62) x ≤ t . Then to check whether the last inequality is binding (andthus nonredundant), we can solve the linear programming problem f ∗ = max s (cid:62) x s.t. Ax ≤ b, s (cid:62) x ≤ t + 1. The inequality s (cid:62) x ≤ t is redundant if and only if f ∗ ≤ t . To eliminate all redundant inequalities from a system of m inequalities results insolving m linear programs; hence, it can be computed in polynomial time. There are a few strategies to speed up the removalof redundant inequalities, as discussed in Section 2.21 in Fukuda (2014). We use the implementation in the package Rcdd withthe function redundant . The idea of using Fourier-Motzkin elimination to determine the inequality constraints defining projected regions in partialidentification was also explored in Section 8.2 of Chesher and Rosen (2019). B ( s ) than the one provided by Fourier-Motzkin elimination. To this end, consider the following set: C ( s ) := { c ∈ R m ( s ) : cW r ( s ) = 0 , c ≥ } , (4.7)where recall that m ( s ) is the dimension of W r ( s ) and W f ( s ) after we’ve removed all the redundant inequali-ties. Since the rows of M ∗ r have positive entries (by construction using the Fourier-Motzkin algorithm), theymust belong to C ( s ). Thus we can conclude that: { β ∈ B : cW r ( s ) θ + cW f ( s ) β ≥ , ∀ c ∈ C ( s ) } ⊆ B ( s ) . Furthermore, Kohler (1967) shows that the reverse inclusion also holds; in particular, every vector in thecollection C ( s ) can be written as a nonnegative linear combination of the rows of M ∗ r . We can thus conclude: B ( s ) = { β ∈ B : cW r ( s ) θ + cW f ( s ) β ≥ , ∀ c ∈ C ( s ) } . While at first glance this result is not immediately useful, the Minkowski-Weyl Theorem allows us to re-writethe set C ( s ) as: C ( s ) = (cid:110) c ∈ R m ( s ) : c = R ( s ) a, for some a ≥ (cid:111) , (4.8)where R ( s ) is some matrix. That is, every element belonging to the polyhedral cone C ( s ) can be written asa nonnegative linear combination of the columns of some matrix R ( s ). It follows that if we could obtain thematrix R ( s ) from (4.8), we could obtain the following representation of the projected set for β from R ( s ): B ( s ) = { β ∈ B : H ( s ) β ≥ } , (4.9)where H ( s ) = R ( s ) W f ( s ). The matrix R ( s ) is sometimes called the generating matrix of the polyhedral cone C ( s ). The Minkowski-Weyl Theorem essentially says that every polyhedral cone admits a generating matrix,and every generating matrix generates a polyhedral cone. The problem of finding the minimal generatingmatrix R ( s ) (that is, the matrix R ( s ) generating C ( s ) such that no proper submatrix of R ( s ) also generates C ( s )) is called the extreme ray enumeration problem . Note the minimal generating matrix is unique only upto multiplication by a positive scalar.The characterizations of the cone C ( s ) in (4.7) and (4.8) are called its H-representation and its V-representation, respectively. Converting from one representation of a convex polyhedron to another is calledthe double description problem in computational geometry, and is one of the most important problems in Note that it is possible to first perform Fourier-Motzkin elimination, and then remove redundant inequalities from the system M ∗ r W r ( s ) β ≥ For a general convex polyhedral defined by Λ = { λ ∈ R d : Aλ ≤ b } , the Minkowski-Weyl Theorem states that every vector λ ∈ Λ can be written as λ = λ + λ , where λ ∈ conv { v , . . . , v k } and λ ∈ cone { v k +1 , . . . , v n } . Here v , . . . , v k are calledvertices of Λ and v k +1 , . . . , v n are the extreme rays of Λ. In the special case of b = 0, where all hyperplanes are through theorigin, then Λ becomes a polyhedral cone and k = 0, so that Λ = cone { v , . . . , v n } . This latter case is what is relevant for us,and the columns of the matrix R ( s ) are the collections of these extreme rays. R implementation in the package Rcdd by Geyer (2019). There are alsoalternative nonincremental algorithms available for extreme ray enumeration; for instance, the reverse searchalgorithm by Avis and Fukuda (1996). However, in general there is no known efficient (polynomial-time)algorithm for general input, although the incremental double description algorithm is known to be efficientfor degenerate polyhedrons (which arises very often when the hyperplanes are not in general position) andlow dimensions (up to 10). Avis et al. (1997) present a thorough comparison of these different algorithms.After employing the double description algorithm the projection B ( s ) represented in (4.9) now containsa minimal number of constraints defined only in terms of β . Repeating the procedure described above for all s ∈ S p then gives us a collection of sets B ( s ) representing the projections of R ( s ) onto the parameter space B . However, for different binary vectors s ∈ S p the projected sets B ( s ) may not be disjoint. Thus, to getthe representative points β ∗ we consider the intersection of these cones across s ∈ S p . To do so, we stack allunique hyperplanes of the form H ( s ) β = 0 for all s ∈ S p into a matrix H p . The set of hyperplanes H p β = 0then define the boundaries of the sets formed by the intersection of the cones B ( s ). From here we canthen easily collect the representative points from the resulting collection of sets defined by the hyperplanes H p β = 0 by a final application of the hyperplane arrangement algorithm of Gu and Koenker (2020).To summarize, our procedure to profile β is based on the idea that there are only a finite numberof representative points from B that need to be considered in the bounding optimization problems. Ourproposed procedure to find these representative points is as follows:(i) Determine the collection S p of binary vectors s ∈ { , } m corresponding to the non-empty sets R ( s )from (4.3) by running the hyperplane arrangement algorithm of Gu and Koenker (2020) on the collec-tion of hyperplanes of the form (4.4).(ii) For each s ∈ S p :(a) Set D ( s ) = diag(2 s −
1) and define W r ( s ) := D ( s ) W r and W f ( s ) := D f ( s ) W f . Now remove anyredundant inequalities from the system of inequalities in the set: R ( s ) := { ( θ, β ) : W r ( s ) θ + W f ( s ) β ≥ } , by solving a sequence of linear programs, as described in footnote 33. For an incremental algorithm to be polynomial-time, the size of the intermediate rays in each incremental step needs tobe polynomial in the input size. The difficulty involved with all known incremental algorithm in the literature is that theintermediate representation can be very large and leads the algorithm to be superpolynomial in the worst case. See furtherdiscussion in Bremner (1999). R ( s ) for the polyhedral cone C ( s ) using the doubledescription algorithm of Fukuda and Prodon (1995), and set H ( s ) = R ( s ) W f ( s ). Then theprojected set B ( s ) from (4.5) can be written: B ( s ) = { β ∈ B : H ( s ) β ≥ } . (iii) Intersect the projected sets B ( s ) over all s ∈ S p : by stacking the matrices H ( s ) over s ∈ S p into thematrix H p , the rows of the matrix H p will define a set of hyperplanes that act as the boundaries of allsets defined by the intersection of the projected sets B ( s ).(iv) Run the hyperplane arrangement algorithm of Gu and Koenker (2020) a final time on the collection ofhyperplanes defined by the rows of H p in order to collect representative points from each set.The above discussion also sheds light on how we can construct the identified set for β . In particular, forsome of these representative points the linear programming problems in our bounding procedure may havean empty feasible region, that is, there exists no valid conditional distribution of θ that fulfils all constraintsfor that particular value of β . In this case, these representative points—as well as all other values of β thatbelong to the same sets—cannot be included in the identified set for the fixed coefficients B ∗ . Therefore,the identified set B ∗ naturally collects all sets whose representative points render a linear program withnon-empty feasible region. Since the arrangement involves only hyperplanes through the origin, all sets takethe form of a polyhedral cone, hence the identified set B ∗ is a union of polyhedral cones. This implies thatthe identified set B ∗ may not be connected, and for any β ∈ B ∗ , we also have λβ ∈ B ∗ for all λ ≥
0. Anappropriate normalization—for example, fixing || β || = 1—will then lead to a bounded identified set B ∗ . In addition to functional form assumptions, other assumptions can also help to tighten the identified set. Insome cases the researcher may have access to a variable that is believed to be independent of the distributionof latent variables. If such a variable enters as an argument in the structural function, then intuitivelysuch a variable will induce variation in the observed conditional choice probabilities without affecting thedistribution of latent variables. We will refer to such variables as exogenous covariates . A similar intuitionapplies if the variable is independent of the distribution of latent variables, does not enter as an argumentin the structural function, but has nontrivial dependence with the variables that do enter the structuralfunction. We will refer to such variables as instruments . Any additional variation generated in the observedconditional choice probabilities by either exogenous covariates or instruments can be used to further pin downthe distribution of latent variables.We will now distinguish between the random variables in X and Z by allowing the variables in therandom vector Z to satisfy an independence assumption with the latent variables θ . Restricting an exogenous variable from entering the structural function is often known as the “exclusion restriction” in theterminology of simultaneous equations. ssumption 4.2 (Independence) . For all A ∈ B (Θ) we have P θ | Z ( A | Z = z ) = P θ ( A ) z − a.s. The independence assumption restricts the econometric model by constraining the set of admissiblelatent variable distributions, and provides a crucial link between the conditional distributions of θ | Z = z across values of z ∈ Z . When applied to our context, Assumption 4.2 nests the two kinds of independenceconstraints introduced above. Furthermore, it is without loss of generality that we continue to write thestructural function ϕ as a function of Z , which will help us avoid unnecessary repetition by considering thetwo kinds of independence constraints separately. Definition B.2 in Appendix B.3 provides the extensionof Definition 2.1 to the case when Assumption 4.2 also holds. Also, even though Assumption 4.2 posits fullindependence between Z and the vector of latent variables θ , the assumption can also be easily modified forthe case when a subvector of Z , say Z , is conditionally independent of θ given some other subvector of Z ,say Z . We suppress this case for simplicity, but we note that consideration of conditional independence willnot have any significant impact on the results to come, and thus can be easily accommodated.Corollary B.4 provides the extension of Theorem 3.1 to the case when Assumption 4.2 also holds, andagain allows us to reduce an infinite dimensional existence problem to a manageable finite dimensionalexistence problem. Intuitively, Corollary B.4 shows that every conditional probability measure P θ | Y,X,Z defined on the sets A ( β ) from (3.4) satisfying the independence assumption can be extended to a probabilitymeasure on B (Θ) that satisfies Assumption 4.2. This result can be used to show that Assumption 4.2 isobservationally equivalent to imposing independence between Z and response types r ( θ, β ). This provides ameaningful interpretation to Assumption 4.2, which might otherwise be challenging to interpret.To extend the linear programming result of Theorem 3.2 it is straightforward to see that we must simplyinclude the additional constraints from Corollary B.4. Without loss of generality we again assume that allvalues of ( y, x, z ) are assigned positive probability by the observed distribution. Then these constraints can To illustrate what we mean, suppose that Z = { z , z } . The first kind of independence constraint is associated with thefact that Z enters the structural function, but as an exogenous covariate. In this case our assumption implies constraints of theform: (cid:88) y ∈{ , } (cid:88) x ∈X P θ | Y,X,Z ( ϕ ( x, z , θ, β ) ≥ | Y = y, X = x, Z = z ) P ( Y = y, X = x | Z = z )= (cid:88) y ∈{ , } (cid:88) x ∈X P θ | Y,X,Z ( ϕ ( x, z , θ, β ) ≥ | Y = y, X = x, Z = z ) P ( Y = y, X = x | Z = z ) , for all pairs ( z , z ). The second kind of independence constraints arises if Z does not enter the structural function, but isdependent with our endogenous covariate X . In this case ϕ does not depend on Z , and our assumption implies constraints ofthe form: (cid:88) y ∈{ , } (cid:88) x ∈X P θ | Y,X,Z ( ϕ ( x, θ, β ) ≥ | Y = y, X = x, Z = z ) P ( Y = y, X = x | Z = z )= (cid:88) y ∈{ , } (cid:88) x ∈X P θ | Y,X,Z ( ϕ ( x, θ, β ) ≥ | Y = y, X = x, Z = z ) P ( Y = y, X = x | Z = z ) , for all pairs ( z , z ). Note that the first set of constraints reduces to the second set when ϕ is specified in such a way that it doesnot depend on the random vector Z . Thus, throughout the remainder of the text we will continue to write ϕ as a function of Z while keeping in mind that either type of independence assumption (exogenous covariates or instruments) may be considered.
34e written in terms of the parameter vector ν ( β ) as: (cid:88) y ∈{ , } (cid:88) x ∈X ν ( y, x, z k , β, s ) P ( Y = y, X = x | Z = z k )= (cid:88) y ∈{ , } (cid:88) x ∈X ν ( y, x, z k +1 , β, s ) P ( Y = y, X = x | Z = z k +1 ) , (4.10)for k = 1 , . . . , m z −
1. The formal statement of the extension of Theorem 3.2 to the case when the constraints(4.10) are also imposed is provided by Corollary B.5 in Appendix B.3.The independence assumptions provide additional information by constraining the set of admissible latentvariable distributions to be those that are independent of the vector Z . In the case when Z is an instrument(in the terminology of the previous section), the structural function ϕ does not depend directly on Z , butonly indirectly on Z through its effect on X . This imposes a form of the exclusion restriction , which can havea substantial effect in the bounding problem by eliminating response types, just as in the previous subsectionon functional form assumptions. In the next subsection we will show how monotonicity assumptions alsolead to the elimination of response types.
In this subsection we will introduce monotonicity assumptions. We will see that monotonicity assumptionsalso impose constraints on the bounding problem by effectively eliminating consideration of certain responsetypes. To introduce our monotonicity assumptions, let
M ⊂ { , . . . , m } × { , . . . , m } denote any collectionof pairs of integers ( j, k ), where 1 ≤ j, k ≤ m . Assumption 4.3 (Monotonicity) . For each β ∈ B and each pair ( j, k ) in some set M (as defined above)we have ϕ ( x j , z j , β, θ ) ≤ ϕ ( x k , z k , β, θ ) a.s. This monotonicity assumption states that, when comparing two points ( x j , z j ) and ( x k , z k ), the valueof the structural function can be ordered by the researcher. Definition B.3 in Appendix B.4 provides theextension of Definition 2.1 to the case when Assumption 4.3 is also imposed.Note that if the order determined by the researcher’s monotonicity assumption for the pair of points( x j , z j ) and ( x k , z k ) is ϕ ( x j , z j , β, θ ) ≤ ϕ ( x k , z k , β, θ ) (for example), then the researcher automatically rulesout response types with { ϕ ( x j , z j , β, θ ) ≥ } > { ϕ ( x k , z k , β, θ ) ≥ } . In other words, it cannot be that anindividual assigned the vector ( x j , z j ) under counterfactual γ ∈ Γ would have responded with Y γ = 1 if thatsame individual would have responded with Y γ (cid:48) = 0 when assigned the vector ( x k , z k ) under counterfactual γ (cid:48) ∈ Γ. The following example illustrates how this idea leads to elimination of response types.
Example 3.
Suppose again that we have only a binary variable X ∈ { , } and latent variables θ (i.e. novariables Z and no fixed coefficients β ). Then the structural function from (2.1) can be written as ϕ ( X, θ ) In particular, even though the independence assumption does not affect the number of response types, the exclusionrestriction reduces response types by reducing the number of variables entering the structural function. nd the binary response vector r ( β, θ ) can be written as r ( θ ) , where: r ( θ ) = { ϕ (0 , θ ) ≥ } { ϕ (1 , θ ) ≥ } . Note that there are only four response types; that is, r ( θ ) ∈ { s , s , s , s } where: s = , s = , s = , s = . Without any additional restrictions, all response types—and thus all sets of the form Θ( β, s ) for s ∈ { , } —can be assigned positive probability by the optimization problems in Theorem 3.2. Now suppose we entertainthe monotonicity assumption ϕ (0 , θ ) ≤ ϕ (1 , θ ) a.s. Imposing this constraint clearly rules out the case when r ( θ ) = s , and thus the set Θ( β, s ) = { θ : r ( θ ) = s } must now be assigned probability zero in any solutionto the optimization problems in Theorem 3.2. Constraining such sets to be assigned zero probability in theseoptimization problems reduces the size of the feasible region and thus potentially tightens the resulting boundson counterfactual choice probabilities. Monotonicity of the type entertained here has a number of precedents in the literature on treatmenteffects, and can be interpreted a few different ways. For example, when Y is interpreted as a treatmentindicator, the type of monotonicity introduced here nests the monotonicity assumption from Angrist et al.(1996) required for identification of the local average treatment effect . Alternatively, when Y is the interpretedas the binary outcome after some (possibly endogenous) treatment X , our monotonicity assumption can beinterpreted as a version of the monotone treatment response assumption introduced in Manski (1997) and alsoconsidered in Manski and Pepper (1998). Finally, similar monotonicity assumptions in triangular systemshave also been extensively explored by Heckman and Pinto (2018). In particular, Heckman and Pinto (2018)explore how choice theory can be used to impose monotonicity assumptions and to eliminate response types,and many of their insights are also applicable here.Following the insights from the example above, let us define the collection of binary vectors S M to bethose that respect the monotonicity relations from Assumption 4.3. The extension of Theorem 3.1 to thecase when Assumption 4.3 is imposed is provided by Corollary B.7 in Appendix B.4. To extend the results ofTheorem 3.2 we must simply include the set of constraints imposed by Assumption 4.3 in our optimizationproblems. These constraints are provided in Corollary B.7, and can be written in terms of the parametervector ν ( β ) as: (cid:88) s ∈ S cM ν ( y, x j , z j , β, s ) = 0 , (4.11)for all y ∈ { , } and j = 1 , . . . , m occurring with positive probability. Corollary B.8 in Appendix B.4 thenshows the extension of Theorem 3.2 to the case when Assumption 4.3 is imposed using the constraints (4.11).Corollary B.9 then extends Proposition 3.1. 36ombining all of the results seen in this section, any combination of Assumption 4.1, Assumption 4.2 andAssumption 4.3 can be imposed on the optimization problems in Theorem 3.2 by simply adding the corre-sponding combination of constraints (4.1), (4.10) and (4.11), respectively. This shows that the optimizationformulation of the bounds in Theorem 3.2 can flexibly incorporate a wide variety of modelling assumptions.This will be demonstrated in the application section ahead. In this section we apply our method to study the impact of private health insurance on an individual’sdecision to visit a doctor. In general, insurance markets are plagued by problems arising from asymmetricinformation between consumers and insurance providers (c.f. Rothschild and Stiglitz (1978)). For example,adverse selection occurs in the health insurance market when individuals have more information about theirlatent health determinants than the providers of health insurance. A robust prediction of the classical theoryof asymmetric information is that those who are more likely to purchase insurance are also those who aremore likely to experience the insured risk. On the other hand, there has been little and mixed empiricalevidence of adverse selection in health insurance markets (see Cardon and Hendel (2001) for a discussion).Others have suggested that those who purchase insurance may be more risk averse, and so less likely toengage in activities that might cause them to experience the insured risk. Evidence of this is found inFinkelstein and McGarry (2006), who demonstrate that wealthier and more cautious individuals are morelikely to have long-term care insurance, but less likely to ever use their insurance. However, in many casesthe opposite is also equally plausible. For example, Bajari et al. (2014) explore the effect of moral hazardin health insurance markets, which occurs when those who purchase health insurance are more likely toexperience the insured risk given that they no longer bear the full cost of health care.Here we do not attempt to disentangle the effects of adverse selection, risk aversion, or moral hazard.Instead we compute various counterfactual parameters while remaining agnostic on the exact nature of theunobservables linking the health insurance and health care utilization decisions. We take the decision to visita doctor as our binary outcome variable of interest, and we consider the individuals’ private health insurancestatus to be an endogenous explanatory variable. This latter point is consistent with the idea that privateinsurance status may be dependent with individual-specific latent factors—most importantly, unobservedhealth determinants and attitudes towards risk—that also influence an individual’s propensity to visit adoctor. We use data from the 2010 wave of the Medical Expenditure Panel Survey (MEPS). This data hasbeen analyzed by Han and Lee (2019). We focus on the same sub-sample they consider. In particular, wefocus on the month of January 2010, consider only individuals between ages 25 and 64, and drop individualswho obtain either federal or state insurance in 2010 and individuals who are self-employed or unemployed.These restrictions leave us with a sample of 7555 individuals. The “insured risk” refers to the event for insurance was purchased. In our context, it is any event that would typicallyrequire a visit to the doctor.
37n all specifications X is a binary endogenous variable representing an individual’s private insurancestatus, and we consider a binary health status variable ( Z ) and a binary marital status variable ( Z ) asregressors. Finally, we use the number of employees working for the individual’s firm ( Z ) as an instrument.This variable provides a measure of the size of a firm and has discrete support in the range [1 , Using firm size as an instrument is consistent with the evidence that larger firmsare more likely to provide health insurance benefits, but do not directly influence an individual’s decision tovisit a doctor. The same instrument was also used in Han and Lee (2019). However, rather than imposefull independence of Z , throughout the application we impose that Z is conditionally independent of θ given the vector ( Z , Z ).A possible concern with using firm size as an instrument is that risk averse individuals may be morelikely to select into a job with a larger firm size. In an attempt to address this issue, we also investigate aweaker conditional independence assumption (which we call relaxed conditional independence ) that assumesthe firm size Z is conditionally independent of θ given ( Z , Z ) only when Z lies within a certain range.The main idea is that once we condition on a particular range of firm size, the remaining variation in firmsize is independent of θ conditional on ( Z , Z ). We consider four ranges, given by (1 , , , , µ ate := (cid:88) ( y,x,z ) ∈{ , }×X ×Z P θ | Y,X,Z ( ϕ (1 , z, θ, β ) ≥ | Y = y, X = x, Z = z ) P ( Y = y, X = x, Z = z ) − (cid:88) ( y,x,z ) ∈{ , }×X ×Z P θ | Y,X,Z ( ϕ (0 , z, θ, β ) ≥ | Y = y, X = x, Z = z ) P ( Y = y, X = x, Z = z ) . This parameter will provide the average causal effect of obtaining health insurance on the decision to visita doctor. Near the end of this section, we will also consider bounds on counterfactual conditional choiceprobabilities. We will construct our bounds under the following set of assumptions:(A1) Only Assumptions 2.1 and 2.2.(A2) (A1) and monotonicity, Assumption 4.3 (the discussion below will provide further details).(A3) (A1) and independence between ( Z , Z ) and θ , Assumption 4.2.(A4) (A1), (A2) and (A3) together.(A5) (A1) and independence Assumption 4.2 with independence between ( Z , Z ) and θ , and conditionalindependence of Z . The MEPS data includes information on self-reported health status on a scale from 1 −
5, and we regard values less thanor equal to 2 as being “unhealthy.” Variable Z is supported on the range [1 , Z has very few observations. In orderto get reliable estimates of the conditional choice probabilities, we further discretize the firm size into 11 bins. The bins arerespectively [1 , , , , , , , , , , , From Cardon and Hendel (2001) p.408: “Another observed symptom, consistent with the theoretical predictions, is thatthe uninsured tend to work for small employers. Large employers can overcome adverse selection by risk pooling.” Z , Z ) and θ , and relaxed con-ditional independence of Z .(A8) (A1), (A2) and (A7) together.Note that the general index function takes the form ϕ ( x, z , z , θ, β ). When we say that the monotonicityassumption is imposed in (A2), we are in fact imposing: ϕ (1 , , z , θ, β ) ≥ ϕ (0 , , z , θ, β ) , for each z ∈ { , } . This implies that for an unhealthy individual, the propensity to visit a doctor whenthe person has private insurance is always weakly greater than without the insurance, regardless of maritalstatus. Finally we consider four different models for the binary outcome variable Y : Y = 1 { ϕ ( X, Z, θ ) ≥ } , (M1) Y = 1 { ϕ ( X, Z ) ≥ θ } , (M2) Y = 1 { Xθ + Z β + Z β ≥ θ } , (M3) Y = 1 { Xβ + Z β + Z β ≥ θ } . (M4)Under model (M1) the index function ϕ need not even be explicitly specified so long as we imagine that itsatisfies Assumption (2.1). This makes model (M1) the most flexible. In model (M2) we start to introducefunctional form restrictions on ϕ . In particular, (M2) restricts the latent variable to be scalar and additivelyseparable from the nonparametric index function ϕ ( X, Z ). Additional details on how to apply our methodto the model in (M2) can also be found in Appendix B.6. Finally, models (M3) and (M4) impose linearityof ϕ in the latent variables and in the parameters. However, we distinguish two cases. In the first case,(M3) regards ( θ , θ ) as the latent variables in the model. The second case in (M4) is the same as the first,except that we have replaced the random slope coefficient θ from (M3) with a fixed coefficient. Model (M4)represents the additively separable linear index model that is commonly used in the empirical literature,except for the fact that we do not assume a parametric distribution for θ and do not have a model for howthe endogenous variable X is generated.Our method is employed using simple plug-in estimators for all probabilities depending on the observedrandom variables Y , X , Z , Z and Z . In Appendix B.5 we present a consistency result specially designedfor plug-in estimation in the kinds of problems considered in this paper. Theorem B.2 in Appendix B.5demonstrates the conditions under which simple plug-in estimators of the constraints and objective functionsin our problems leads to a consistent estimate of the identified set for our functional of interest. We refer Importantly, our consistency result requires a slight (but vanishing) relaxation of the constraint set in our linear programs;in particular, see the sequence “ b n ” in Appendix B.5. However, the scale of this sequence can be taken to be extremely small,and so has a minimal impact on the estimated bounds. ϕ LB − . − . − . − . − . − . − . − . .
732 0 .
732 0 .
591 0 .
591 0 .
469 0 .
431 0 .
589 0 . .
646 1 .
284 1 .
417 1 .
140 1 .
084 0 .
823 1 .
407 1 . ϕ LB − . − . − . − .
430 0 .
074 0 . − . − . .
732 0 .
732 0 .
462 0 .
462 0 .
332 0 .
332 0 .
452 0 . .
646 1 .
284 1 .
152 0 .
892 0 .
257 0 .
257 0 .
837 0 . ϕ (with random coefficients)LB − . − . − . − . − .
412 0 . − . − . .
732 0 .
732 0 .
452 0 .
452 0 .
342 0 .
333 0 .
456 0 . .
646 1 .
260 1 .
122 0 .
824 0 .
754 0 .
260 1 .
086 0 . ϕ (with fixed coefficients)LB − . − . − .
669 0 .
000 0 .
079 0 .
079 0 .
017 0 . .
732 0 .
732 0 .
452 0 .
452 0 .
332 0 .
332 0 .
452 0 . .
646 1 .
260 1 .
121 0 .
452 0 .
252 0 .
252 0 .
435 0 . Table 1:
Convex hull of the sharp identified set for the average treatment effect under different specifications undervarious assumptions. the reader to Appendix B.5 for additional discussion and details.We are now prepared to present the results. The identified set for µ ate under assumptions (A1) - (A8)and models (M1) - (M4) are reported in Table 1. For simplicity, we report the convex hull of the estimatedidentified set. Unsurprisingly, the bounds on µ ate shrink as the strength of our assumptions increase. Themost flexible model is (M1) under assumption (A1). It is interesting to note that the bounds on µ ate inthis case are contained strictly within the interval [ − , µ ate . Also note that the identified setfor µ ate always overlaps zero for model (M1). As expected, conditional independence of Z is a strongerassumption than relaxed conditional independence, hence the identified set under assumptions (A7) and (A8)always contain those under (A5) and (A6). In fact, relaxed conditional independence does not provide muchidentifying power (compare the results under Assumptions (A3) and (A7)). On the other hand, conditionalindependence of Z does induce a noticeable narrowing of the identified set for µ ate (compare the resultsunder Assumptions (A3) and (A6)). The results for this model are a useful benchmark to compare withcases where we impose more structure on the index function.Next, model (M2) considers the threshold-crossing case. Details on our procedure to estimate this modelare provided in Appendix B.6. We notice immediately in Table 1 that this model narrows the identified setrelative to the case of general nonseparability. The results for this model serve as an interesting point ofcomparison with previous results in the partial identification literature. Model (M2) is closely related to amodel considered in Shaikh and Vytlacil (2011) and Mourifi´e (2015). However, both Shaikh and Vytlacil(2011) and Mourifi´e (2015) also have a threshold-crossing model for the binary endogenous variable X of40 b ll lll lll Figure 2:
Profiling of β : there are 8 representative points, each representing one of the eight sets determined by 4hyperplanes in R . the form: X = 1 { ν ( Z , Z , Z ) ≥ ε } . Furthermore, they assume independence between ( Z , Z , Z ) and ( θ, ε ). Here we differ critically from thesepapers by not imposing a threshold-crossing model for X , instead allowing the process determining X tobe unspecified. We also do not assume strong independence of the instrument Z , but instead assumevarious forms of conditional independence. Thus, the bounds presented in Table 1 are valid under weakerassumptions. Note that the sign of µ ate is still not identified under model (M2) except when conditionalindependence of Z is imposed, showing the strong identifying power of this assumption.Finally, we see in Table 1 that the linear models from (M3) and (M4) also narrow the bounds relative tothe case of general nonseparability. The bounds under (M4) are nested in the bounds produced under (M2);this is expected since (M4) is a special case of (M2). Note the same is not true of models (M2) and (M3).Unsurprisingly, the smallest interval for µ ate is obtained under Assumptions (A5) and (A6) for model (M4).For models (M3) and (M4) we make use of our method for profiling β , as described in Section 4.1.2. Inmodel (M3) we must profile on β ∈ R . Figure 2 plots various regions of B corresponding to the points β that deliver the same collection of sets Θ( β, s ) with non-empty interior. The figure also shows the associatedrepresentative points. Interestingly, we find that under Assumptions (A1) - (A4) and (A7) - (A8), theidentified set of β is the entire euclidean space R . This illustrates that non-trivial bounds on µ ate arepossible even when the structural parameters are not identified. Figure 3 shows the intervals computedusing the linear programs of the form (3.15) and (3.16) for each representative point of β under our variousassumptions. The results in Table 1 for model (M3) represent the (convex hull of the) union of the intervalsin Figure 3.In the second linear model (M4), all coefficients are fixed. Thus, we now need to profile on a parameter41 . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile F i g u r e : T h i s fi g u r e s h o w s t h e i n t e r v a l s c o m pu t e du s i n g t h e li n e a r p r og r a m s o f t h e f o r m ( . ) a nd ( . ) f o r e a c h r e p r e s e n t a t i v e p o i n t o f β ∈ R w h e nb o und i n g µ a t e f o r M o d e l ( M ) und e r v a r i o u s a ss u m p t i o n s . T h e a c t i v e a ss u m p t i o n s a r e g i v e n a tt h e t o p o f e a c h ill u s t r a t i o n . T h e a x e s l a b e ll e d “ p r o fi l e ” c o rr e s p o nd t o v a r i o u s r e p r e s e n t a t i v e p o i n t s . hetab1 -4-2024-2 0 2 4b2 -2 0 2 -44-4 Figure 4:
Profiling of ( θ, β ) in R : there are 96 representative points, each represents one of the 96 sets determinedby 13 hyperplanes in R . vector β ∈ R . Our profiling procedure from Section 4.1.2 returns 96 representative points, each associatedwith a polyhedral cone in R . A visual representation is provided in Figure 4. Under Assumptions (A1)and (A2), the identified set for β is R , while for all other assumptions (A3) - (A8) we get an informativeidentified set for β . In Figure 5 we also show the intervals computed using the linear programs of the form(3.15) and (3.16) for each representative point of β under our various assumptions. The results in Table 1for model (M4) represent the (convex hull of the) union of the intervals in Figure 5.A few interesting patterns also emerge when we consider parameters other than the average treatmenteffect. In particular, consider the counterfactual choice probability: µ ccp ( y ) := (cid:88) z ∈Z P θ | Y,X,Z ( ϕ (1 , z, θ, β ) ≥ | Y = y, X = 0 , Z = z ) P ( Z = z | Y = y, X = 0) , for y ∈ { , } . We will focus on the parameter µ ccp (0) for simplicity, which represents the counterfactualchoice probability of visiting a doctor when given private health insurance for the set of individuals who haveno insurance and who have chosen not to visit a doctor, averaged across health and marital status. Table2 reports the convex hull of the estimated identified set for µ ccp (0) under various model specifications andunder various assumptions. Similar to the bounds for µ ate , we can see that the bounds on counterfactualchoice probabilities tend to be wide and uninformative for most assumptions. Note that under Assumption(A1) we always obtain the interval [0 ,
1] for the estimated identified set, providing empirical confirmationof our impossibility result from Corollary 3.1. Remarkably, the bounds for models (M2) and (M4) are verysimilar, showing that the additional functional form assumptions in model (M4) do not have significantidentifying power relative to the threshold crossing model in (M2) for this particular counterfactual choiceprobability. The narrowest bounds are found in models (M2) and (M4) under Assumptions (A5) and (A6).These bounds allow us to conclude that the probability an individual visits a doctor when given private43 . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile F i g u r e : T h i s fi g u r e s h o w s t h e i n t e r v a l s c o m pu t e du s i n g t h e li n e a r p r og r a m s o f t h e f o r m ( . ) a nd ( . ) f o r e a c h r e p r e s e n t a t i v e p o i n t o f β ∈ R w h e nb o und i n g µ a t e f o r M o d e l ( M ) und e r v a r i o u s a ss u m p t i o n s . T h e a c t i v e a ss u m p t i o n s a r e g i v e n a tt h e t o p o f e a c h ill u s t r a t i o n . T h e a x e s l a b e ll e d “ p r o fi l e ” c o rr e s p o nd t o v a r i o u s r e p r e s e n t a t i v e p o i n t s . . , . ϕ LB 0 0 0 .
000 0 .
000 0 .
031 0 .
031 0 .
017 0 . .
980 0 .
980 0 .
785 0 .
773 0 .
941 0 . .
980 0 .
980 0 .
754 0 .
742 0 .
924 0 . ϕ LB 0 0 0 .
000 0 .
000 0 .
039 0 .
039 0 .
019 0 . .
629 0 .
619 0 .
453 0 .
453 0 .
597 0 . .
629 0 .
619 0 .
414 0 .
414 0 .
578 0 . ϕ (with random coefficients)LB 0 0 0 .
000 0 .
000 0 .
038 0 .
039 0 .
017 0 . .
600 0 .
598 0 .
456 0 .
455 0 .
606 0 . .
600 0 .
598 0 .
417 0 .
416 0 .
589 0 . ϕ (with fixed coefficients)LB 0 0 0 .
000 0 .
000 0 .
039 0 .
039 0 .
019 0 . .
598 0 .
598 0 .
453 0 .
453 0 .
597 0 . .
598 0 .
598 0 .
414 0 .
414 0 .
578 0 . Table 2:
This table reports the convex hull of the estimated bounds on µ ccp (0), the counterfactual choice probabilityof visiting doctor when granted insurance under different assumptions for those who chose not to visit a doctorwithout insurance. To summarize, Table 1 shows that most specifications do not identify the sign of µ ate , and Table 2shows that most bounds on counterfactual choice probabilities are not informative. Exceptions typicallyoccur only under the strongest independence assumptions, given by assumptions (A5) and (A6), and thestrongest functional form assumptions, given in model (M4). However, even the strongest set of assumptionsconsidered here are much weaker than the typical assumptions employed in empirical work. For the sake ofcomparison to our results, we also estimate the following bivariate probit model: Y = 1 { Xβ + Z β + Z β ≥ ε } ,X = 1 { Z γ + Z γ + Z γ ≥ ε } , where ( Z , Z , Z ) are assumed to be independent from ( ε , ε ), which are bivariate normal with mean zero,unit variance and correlation ρ . This model was estimated with our data using maximum likelihood, and µ ate was estimated as 0 .
163 with a bootstrapped confidence interval of [0 . , . µ ate lieswithin all of our bounds in Table 1, and seems to suggest strong evidence of a positive causal effect of healthinsurance on the decision to visit the doctor. However, the bivariate probit model is highly parameterized,and the results from Table 1 suggest that under weaker assumptions the sign of µ ate may not be identified. Han and Lee (2019) also obtain a similar result in a model allowing for ε and ε to have unrestricted marginals, and aflexible dependence structure. However, they consider a different model from us, and the average treatment effect in Han andLee (2019) is different from ours; we consider the average treatment effect averaged over all values of ( x, z ), while they reportthe average treatment effect at the average value of their conditioning variables. They also report the average treatment effectat various quantiles of their conditioning variables. This paper considers (partial) identification of a variety of parameters in a binary response models withpossibly endogenous regressors. Importantly, our class of models allows for general nonseparability of theindex function in latent variables, and does not require any parametric distributional assumptions. Ourapproach to bounding counterfactual parameters is based on framing the bounding in terms of two opti-mization problems: one for the lower bound, and one for the upper bound. Our specific partition of thelatent variable space is key to this result, allowing us to reduce an impossible infinite-dimensional probleminto two tractable optimization problems with a finite number of constraints. We then show how a varietyof assumptions can be easily imposed in our framework, and that many assumptions can be interpreted aseliminating particular sets from our particular partition of the latent variables space. We thoroughly studythe case of a latent index function that is linear in latent variables and linear in parameters, and show howresults from computational geometry are helpful in our problem. Finally, we show an application of ourmethod to study the effects of private health insurance on the utilization of health care services.There are a number of obvious further directions in which to expand the ideas presented in this paper.For example, the consideration of multinomial choice models, triangular systems, or general simultaneousdiscrete choice models all seem to be natural next steps. In addition, a major emphasis in this paper, asin other recent papers, is on the interesting computational problems that arise in models that are partiallyidentified. We believe exploring applications of state-of-the-art algorithms in computer science to problemsin econometrics—as we have attempted here—is a fruitful avenue of research.46 eferences
Aliprantis, C. D. and Border, K. C. (2006).
Infinite dimensional analysis: a hitchhiker’s guide . Springer.Allen, R. and Rehbeck, J. (2019). Identification with additively separable heterogeneity.
Econometrica ,87(3):1021–1054.Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumentalvariables.
Journal of the American statistical Association , 91(434):444–455.Artstein, Z. (1983). Distributions of random sets and random selections.
Israel Journal of Mathematics ,46(4):313–324.Avis, D., Bremner, D., and Seidel, R. (1997). How good are convex hull algorithms?
ComputationalGeometry , 7(5-6):265–301.Avis, D. and Fukuda, K. (1996). Reverse search for enumeration.
Discrete applied mathematics , 65(1-3):21–46.Bajari, P., Dalton, C., Hong, H., and Khwaja, A. (2014). Moral hazard, adverse selection, and healthexpenditures: A semiparametric analysis.
The RAND Journal of Economics , 45(4):747–763.Balke, A. and Pearl, J. (1994). Counterfactual probabilities: Computational methods, bounds and applica-tions. In
Uncertainty Proceedings 1994 , pages 46–54. Elsevier.Bennett, J. F. (1956). Determination of the number of independent parameters of a score matrix from theexamination of rank orders.
Psychometrika , 21(4):383–393.Beresteanu, A., Molchanov, I., and Molinari, F. (2011). Sharp identification regions in models with convexmoment predictions.
Econometrica , 79(6):1785–1821.Beresteanu, A., Molchanov, I., and Molinari, F. (2012). Partial identification using random set theory.
Journal of Econometrics , 166(1):17–32.Beresteanu, A. and Molinari, F. (2008). Asymptotic properties for a class of partially identified models.
Econometrica , 76(4):763–814.Blundell, R. W. and Powell, J. L. (2004). Endogeneity in semiparametric binary response models.
Reviewof Economic Studies , 71(3):655–679.Blundell, R. W. and Smith, R. J. (1989). Estimation in a class of simultaneous equation limited dependentvariable models.
The Review of Economic Studies , 56(1):37–57.Bremner, D. (1999). Incremental convex hull algorithms are not output sensitive.
Discrete & ComputationalGeometry , 21(1):57–68. 47uck, R. (1943). Partition of space.
The American Mathematical Monthly , 50:541–544.Cardon, J. H. and Hendel, I. (2001). Asymmetric information in health insurance: evidence from the nationalmedical expenditure survey.
RAND Journal of Economics , pages 408–427.Chernozhukov, V. and Hansen, C. (2005). An iv model of quantile treatment effects.
Econometrica ,73(1):245–261.Chernozhukov, V., Hong, H., and Tamer, E. (2007). Estimation and confidence regions for parameter setsin econometric models 1.
Econometrica , 75(5):1243–1284.Chesher, A. (2013). Semiparametric structural models of binary response: shape restrictions and partialidentification.
Econometric Theory , pages 231–266.Chesher, A. and Rosen, A. M. (2014). An instrumental variable random-coefficients model for binary out-comes.
The econometrics journal , 17(2):S1–S19.Chesher, A. and Rosen, A. M. (2017). Generalized instrumental variable models.
Econometrica , 85(3):959–989.Chesher, A. and Rosen, A. M. (2019). Generalized instrumental variable models methods and applications.Technical report, cemmap working paper.Chesher, A., Rosen, A. M., and Smolinski, K. (2013). An instrumental variable model of multiple discretechoice.
Quantitative Economics , 4(2):157–196.Chiburis, R. C. (2010). Semiparametric bounds on treatment effects.
Journal of Econometrics , 159(2):267–275.Chiong, K., Hsieh, Y.-W., and Shum, M. (2017). Counterfactual estimation in semiparametric discrete-choicemodels.
Available at SSRN 2979446 .Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applicationsin pattern recognition.
IEEE transactions on electronic computers , (3):326–334.Cover, T. M. (1967). The number of linearly inducible orderings of points in d-space.
SIAM Journal onApplied Mathematics , 15(2):434–439.Dong, Y. and Lewbel, A. (2015). A simple estimator for binary choice models with endogenous regressors.
Econometric Reviews , 34(1-2):82–105.Durrett, R. (2010).
Probability: theory and examples, fourth edition . Cambridge university press.Finkelstein, A. and McGarry, K. (2006). Multiple dimensions of private information: evidence from thelong-term care insurance market.
American Economic Review , 96(4):938–958.48risch, R. (1938). Statistical versus theoretical relations in economic macrodynamics.
Paper given at Leagueof Nations. Reprinted in Hendry, D.F. and M.S. Morgan (1995) The Foundations of Econometric Analysis.
Fukuda, K. (2014).
Frequently asked questions in polyhedral computation:https://people.inf.ethz.ch/fukudak//polyfaq/polyfaq.html .Fukuda, K. and Prodon, A. (1995). Double description method revisited. In
Franco-Japanese and Franco-Chinese Conference on Combinatorics and Computer Science , pages 91–111. Springer.Galichon, A. and Henry, M. (2011). Set identification in models with multiple equilibria.
The Review ofEconomic Studies , 78(4):1264–1298.Gautier, E. and Kitamura, Y. (2013). Nonparametric estimation in random coefficients binary choice models.
Econometrica , 81(2):581–607.Geyer, C. (2019).
Using the RCDD package: https://cran.r-project.org/web/packages/rcdd/vignettes/vinny.pdf .Gu, J. and Koenker, R. (2020). Nonparametric maximum likelihood methods for binary response modelswith random coefficients.
Journal of the American Statistical Association , pages 1–47.Gunsilius, F. F. (2020). A path-sampling method to partially identify causal effects in instrumental variablemodels. Working paper.Haavelmo, T. (1943). The statistical implications of a system of simultaneous equations.
Econometrica,Journal of the Econometric Society , pages 1–12.Haavelmo, T. (1944). The probability approach in econometrics.
Econometrica: Journal of the EconometricSociety , pages iii–115.Han, S. and Lee, S. (2019). Estimation in a generalization of bivariate probit models with dummy endogenousregressors.
Journal of Applied Econometrics , 34(6):994–1015.Heckman, J. J. and Pinto, R. (2015). Causal analysis after haavelmo.
Econometric Theory , 31(1):115–151.Heckman, J. J. and Pinto, R. (2018). Unordered monotonicity.
Econometrica , 86(1):1–35.Heckman, J. J. and Vytlacil, E. J. (2007). Econometric evaluation of social programs, part i: Causal models,structural models and econometric policy evaluation.
Handbook of econometrics , 6:4779–4874.Ichimura, H. and Thompson, T. S. (1998). Maximum likelihood estimation of a binary choice model withrandom coefficients of unknown distribution.
Journal of Econometrics , 86(2):269–295.Imbens, G. W. and Newey, W. K. (2009). Identification and estimation of triangular simultaneous equationsmodels without additivity.
Econometrica , 77(5):1481–1512.49ohler, D. A. (1967). Projections of convex polyhedral sets. Technical report, University of CaliforniaBerkeley Operations Research Center.Lewbel, A. (2000). Semiparametric qualitative response model estimation with unknown heteroscedasticityor instrumental variables.
Journal of Econometrics , 97(1):145–177.Lewbel, A. (2007). Coherency and completeness of structural models containing a dummy endogenousvariable.
International Economic Review , 48(4):1379–1392.Lewbel, A., Dong, Y., and Yang, T. T. (2012). Comparing features of convenient estimators for binary choicemodels with endogenous regressors.
Canadian Journal of Economics/Revue canadienne d’´economique ,45(3):809–829.Manski, C. F. (1977). The structure of random utility models.
Theory and decision , 8(3):229.Manski, C. F. (1997). Monotone treatment response.
Econometrica: Journal of the Econometric Society ,pages 1311–1334.Manski, C. F. (2007). Partial identification of counterfactual choice probabilities.
International EconomicReview , 48(4):1393–1410.Manski, C. F. and Pepper, J. V. (1998). Monotone instrumental variables with an application to the returnsto schooling. Technical report, National Bureau of Economic Research.Manski, C. F. and Tamer, E. (2002). Inference on regressions with interval data on a regressor or outcome.
Econometrica , 70(2):519–546.Matzkin, R. L. (1992). Nonparametric and distribution-free estimation of the binary threshold crossing andthe binary choice models.
Econometrica: Journal of the Econometric Society , pages 239–270.Matzkin, R. L. (2003). Nonparametric estimation of nonadditive random functions.
Econometrica ,71(5):1339–1375.Molchanov, I. (2017).
Theory of random sets . Springer Science & Business Media.Molchanov, I. S. (1998). A limit theorem for solutions of inequalities.
Scandinavian Journal of Statistics ,25(1):235–242.Motzkin, T., Raiffa, H., Thompson, G., and Thrall, R. (1953). The double description method. In Kuhn, H.and Tucker, A., editors,
Contributions to theory of games . Princeton University Press.Mourifi´e, I. (2015). Sharp bounds on treatment effects in a binary triangular system.
Journal of Econometrics ,187(1):74–81. 50orberg, T. (1992). On the existence of ordered couplings of random sets—with applications.
Israel Journalof Mathematics , 77(3):241–264.Pearl, J. (2009).
Causality . Cambridge university press.Rada, M. and Cerny, M. (2018). A new algorithm for enumeration of cells of hyperplane arrangements and acomparison with avis and fukuda’s reverse search.
SIAM Journal on Discrete Mathematics , 32(1):455–473.Rothschild, M. and Stiglitz, J. (1978). Equilibrium in competitive insurance markets: An essay on theeconomics of imperfect information. In
Uncertainty in economics , pages 257–280. Elsevier.Russell, T. M. (2019). Sharp bounds on functionals of the joint distribution in the analysis of treatmenteffects.
Journal of Business & Economic Statistics , pages 1–15.Sainte-Beuve, M.-F. (1974). On the extension of von neumann-aumann’s theorem.
Journal of FunctionalAnalysis , 17(1):112–129.Shaikh, A. M. and Vytlacil, E. J. (2011). Partial identification in triangular systems of equations with binarydependent variables.
Econometrica , 79(3):949–955.Sleumer, N. H. (1999). Output-sensitive cell enumeration in hyperplane arrangements.
Nordic journal ofcomputing , 6(2):137–147.Tamer, E. (2003). Incomplete simultaneous discrete response model with multiple equilibria.
The Review ofEconomic Studies , 70(1):147–165.Tebaldi, P., Torgovitsky, A., and Yang, H. (2019). Nonparametric estimates of demand in the californiahealth insurance exchange. Technical report, National Bureau of Economic Research.Torgovitsky, A. (2019). Partial identification by extending subdistributions.
Quantitative Economics ,10(1):105–144.Vytlacil, E. and Yildiz, N. (2007). Dummy endogenous variables in weakly separable models.
Econometrica ,75(3):757–779. 51
Proofs
A.1 Proofs of Results in the Main Text
Proof of Theorem 2.1.
Let P ∗∗ Y γ | Y,X,Z denote the set of all conditional distributions P Y γ | Y,X,Z such that thereexists a pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z satisfying: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , ( x, z ) − a.s. To prove the result it suffices to show P ∗ Y γ | Y,X,Z = P ∗∗ Y γ | Y,X,Z . To do this, we will show that P ∗ Y γ | Y,X,Z ⊂ P ∗∗ Y γ | Y,X,Z and P ∗∗ Y γ | Y,X,Z ⊂ P ∗ Y γ | Y,X,Z . To this end, begin by fixing an arbitrary P Y γ | Y,X,Z ∈P ∗ Y γ | Y,X,Z . By Definition 2.2 we have: P Y γ | Y,X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = 1 , (A.1)( y, x, z, θ ) − a.s. for some ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z . For this pair ( P θ | Y,X,Z , β ) we have: P Y γ | Y,X,Z,θ ( Y γ = 1 | Y = y, X = x, Z = z, θ )= P Y γ | Y,X,Z,θ ( Y γ = 1 , Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) , ( y, x, z, θ ) − a.s., which follows from (A.1). Now note: P Y γ | Y,X,Z,θ ( Y γ = 1 , Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = { ϕ ( γ ( x, z ) , θ, β ) ≥ } , ( y, x, z, θ ) − a.s. Thus we have: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = (cid:90) P Y γ | Y,X,Z,θ ( Y γ = 1 | Y = y, = x, Z = z, θ ) dP θ | Y,X,Z = (cid:90) { ϕ ( γ ( x, z ) , θ, β ) ≥ } dP θ | Y,X,Z = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , ( y, x, z ) − a.s. In other words, for our P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z we have shown that there exists a pair( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z satisfying: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , ( y, x, z ) − a.s. This proves P Y γ | Y,X,Z ∈ P ∗∗ Y γ | Y,X,Z , and since P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z was arbitrary we concludethat P ∗ Y γ | Y,X,Z ⊂ P ∗∗ Y γ | Y,X,Z .For the reverse inclusion, fix any arbitrary P Y γ | Y,X,Z ∈ P ∗∗ Y γ | Y,X,Z . Then by definition there exists a pair( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z satisfying: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , y, x, z ) − a.s. It suffices to show that for this pair ( P θ | Y,X,Z , β ) there exists P Y γ | Y,X,Z,θ satisfying: P Y γ | Y,X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = 1 , (A.2)( y, x, z, θ ) − a.s. By the Radon-Nikodym Theorem, the existence of a (version of) P Y γ | Y,X,Z,θ is guaranteedby the fact that P Y γ ,θ | Y,X,Z (cid:28) P θ | Y,X,Z . Since all spaces involved are euclidean, we can choose the versionto be an almost surely unique regular conditional distribution (c.f. Durrett (2010) Theorem 5.1.9). Byconstruction this P Y γ | Y,X,Z,θ satisfies: P Y γ ,θ | Y,X,Z ( Y γ ∈ A, θ ∈ B | Y = y, X = z, Z = z )= (cid:90) B P Y γ | Y,X,Z,θ ( Y γ ∈ A | Y = y, Z = z, X = x, θ ) dP θ | Y,X,Z , ( y, x, z ) − a.s. for every A ⊂ { , } and B ∈ B (Θ). Now note that: P Y γ | Y,X,Z,θ ( Y γ = 1 , Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = { ϕ ( γ ( x, z ) , θ, β ) ≥ } ,P Y γ | Y,X,Z,θ ( Y γ = 0 , Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = { ϕ ( γ ( x, z ) , θ, β ) < } . ( y, x, z ) − a.s. Thus: P Y γ | Y,X,Z ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z )= (cid:90) Θ P Y γ | Y,X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) dP θ | Y,X,Z = (cid:90) Θ P Y γ | Y,X,Z,θ ( Y γ = 1 , Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) dP θ | Y,X,Z + (cid:90) Θ P Y γ | Y,X,Z,θ ( Y γ = 0 , Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) dP θ | Y,X,Z = (cid:90) Θ { ϕ ( γ ( x, z ) , θ, β ) ≥ } dP θ | Y,X,Z + (cid:90) Θ { ϕ ( γ ( x, z ) , θ, β ) < } dP θ | Y,X,Z = P θ | Y,X,Z ( ϕ ( γ ( x, z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) + P θ | Y,X,Z ( ϕ ( γ ( x, z ) , θ, β ) < | Y = y, X = x, Z = z )= 1 , ( y, x, z ) − a.s. This proves (A.2) and thus shows P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z . Since P Y γ | Y,X,Z ∈ P ∗∗ Y γ | Y,X,Z wasarbitrary we can conclude that P ∗∗ Y γ | Y,X,Z ⊂ P ∗ Y γ | Y,X,Z . Combining the two inclusions, we have P ∗ Y γ | Y,X,Z = P ∗∗ Y γ | Y,X,Z . This completes the proof. (cid:4)
Proof of Theorem 3.1.
Let P Y γ | Y,X,Z be a collection of conditional choice probabilities, and suppose thereexists ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z satisfying (2.8). Note that (3.7) is equivalent to (2.8), so we can conclude that( P θ | Y,X,Z , β ) satisfies (3.7). Furthermore, by definition ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z implies that: P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s., P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z satisfying(2.8) also satisfies (3.5) - (3.7).For the reverse, fix any β ∈ B and any collection P θ | Y,X,Z of probability measures on the sets in A ( β )satisfying (3.5) - (3.7). We will show that P θ | Y,X,Z can be extended to a (not necessarily unique) probabilitymeasure ˜ P θ | Y,X,Z on B (Θ) in a manner that ensures ˜ P θ | Y,X,Z satisfies (2.8) and such that ( ˜ P θ | Y,X,Z , β ) ∈I ∗ Y,X,Z . Furthermore, by the definition of an extension, ˜ P θ | Y,X,Z will agree with P θ | Y,X,Z on all sets of theform A ( β ). To construct the extension, note that the sets in A ( β ) form a disjoint partition of Θ. Fromeach set Θ( β, s ) in the collection A ( β ), select a single point θ ( β, s ) (if Θ( β, s ) is empty, choose θ ( β, s ) as anarbitrary point from Θ). Furthermore, for any set A ⊂ Θ, define the indicator: ( A, β, s ) = { θ ( β, s ) ∈ A ∩ Θ( β, s ) } . Now define the function µ y,x,z : B (Θ) → R as: µ y,x,z ( B ) := (cid:88) s ∈{ , } m ( B, β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z ) . To verify that this is a proper probability measure on B (Θ), we must show that (i) µ y,x,z ( B ) ≥ µ y,x,z ( ∅ ) = 0for every B ∈ B (Θ), (ii) µ y,x,z (Θ) = 1, and (iii) for any countable sequence of disjoint sets { A i } ∞ i =1 in B (Θ),we have: µ y,x,z (cid:32) ∞ (cid:91) i =1 A i (cid:33) = ∞ (cid:88) i =1 µ y,x,z ( A i ) . The first property holds since ( ∅ , β, s ) = 0 for all s . To verify the second property, note that (Θ , β, s ) = 1for all s , so that: µ y,x,z (Θ) = (cid:88) s ∈{ , } m (Θ , β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z )= (cid:88) s ∈{ , } m P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z )= 1 , where the last line holds since P θ | Y,X,Z is a probability measure on A ( β ). For the third property, note thatfor two disjoint Borel sets A , A ∈ B (Θ) we have: ( A ∪ A , β, s ) = ( A , β, s ) + ( A , β, s ) . Inducting on this formula, we conclude that for countable disjoint sets { A i } ∞ i =1 in B (Θ), we have: (cid:32) ∞ (cid:91) i =1 A i , β, s (cid:33) = ∞ (cid:88) i =1 ( A i , β, s ) , µ y,x,z (cid:32) ∞ (cid:91) i =1 A i (cid:33) = (cid:88) s ∈{ , } m (cid:32) ∞ (cid:91) i =1 A i , β, s (cid:33) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z )= (cid:88) s ∈{ , } m ∞ (cid:88) i =1 ( A i , β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z )= ∞ (cid:88) i =1 (cid:88) s ∈{ , } m ( A i , β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z )= ∞ (cid:88) i =1 µ y,x,z ( A i ) . Thus, our measure satisfies countable additivity. We conclude that µ y,x,z is a proper probability measure.Note that the argument above has been completed for a single triple ( y, x, z ) indexing the conditioningvariables. However, we can repeat the same argument as above for all ( y, x, z ) assigned positive probability,and thus can construct a corresponding probability measure µ y,x,z satisfying all the conditions describedabove for each such ( y, x, z ).Now we define ˜ P θ | Y,X,Z : B (Θ) → [0 ,
1] by ˜ P θ | Y,X,Z ( B | Y = y, X = x, Z = z ) = µ y,x,z ( B ) for all B ∈ B (Θ) and all ( y, x, z ) assigned positive probability. By the above, ˜ P θ | Y,X,Z ( · | Y = y, X = x, Z = z )is a proper probability measure on B (Θ) for each ( y, x, z ). Also note that for any triple (1 , x, z ) assignedpositive probability, the pair ( ˜ P θ | Y,X,Z , β ) satisfies:˜ P θ | Y,X,Z ( G − (1 , x, z, β ) | Y = 1 , X = x, Z = z )= (cid:88) s ∈ S j ˜ P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x, Z = z )= (cid:88) s ∈ S j (cid:88) s (cid:48) ∈{ , } n (Θ( β, s ) , β, s (cid:48) ) P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x, Z = z )= (cid:88) s ∈ S j (Θ( β, s ) , β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x, Z = z )= 1 , which follows from (B.8). Furthermore, for any triple (0 , x, z ) assigned positive probability, the pair ( ˜ P θ | Y,X,Z , β )also satisfies: ˜ P θ | Y,X,Z ( G − (0 , x, z, β ) | Y = 0 , X = x, Z = z )= (cid:88) s ∈ S cj ˜ P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x, Z = z )= (cid:88) s ∈ S cj (cid:88) s (cid:48) ∈{ , } n (Θ( β, s ) , β, s (cid:48) ) P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x, Z = z )= (cid:88) s ∈ S cj (Θ( β, s ) , β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x, Z = z )55 1 , which follows from (B.9). Conclude that:˜ P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , a.s. This shows that ( ˜ P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z . Finally, setting C := { θ : ϕ ( γ ( x, z ) , θ, β ) ≥ } , it is straightforwardto show that:˜ P θ | Y,X,Z ( C | Y = y, X = x, Z = z ) = (cid:88) s ∈{ , } m ( C, β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z )= (cid:88) s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j )= P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) , for all ( y, x j , z j ) assigned positive probability, which follows from (3.7). This is exactly condition (2.8).Conclude that ( ˜ P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z and that ( ˜ P θ | Y,X,Z , β ) satisfies (2.8). This completes the proof. (cid:4)
Proof of Theorem 3.2.
This result is an immediate consequence of Theorem 3.1. (cid:4)
Proof of Proposition 3.1.
First note that β ∈ B enters the constraints in Theorem 3.2 only through theconstraints (3.12); in particular, only through its determination of which sets Θ( β, s ) are empty versusnon-empty. Now define: S ( β ) := { s ∈ { , } m : Θ( β, s ) (cid:54) = ∅ } . Now define an equivalence relation ∼ on B as follows: β ∼ β (cid:48) if and only if S ( β ) = S ( β (cid:48) ). This equivalencerelation will partition B into at most 2 m equivalence classes (which is the total number of ways of choosing k vectors from { , } m for k = 0 , , . . . , m ). Furthermore, any two values β and β (cid:48) belonging to the sameequivalence class will deliver the same values for the linear programs (3.15) and (3.16) (by construction ofthe equivalence class). Thus, it is sufficient to consider only one β from each equivalence class in Theorem3.2. However, there are at most 2 m such β ’s to consider. (cid:4) Proof of Corollary 3.1.
A counterfactual choice probability of the form in (3.9) can be written as: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) = (cid:88) s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) . Note that the result is trivial if we consider ( y, x j , z j ) assigned zero probability, since in that case Theorem3.1 implies there are no constraints on the counterfactual choice probability above. Thus, assume that( y, x j , z j ) is assigned positive probability. By assumption, γ ( j ) (cid:54) = j . We now claim that (i) S γ ( j ) ∩ S j (cid:54) = ∅ ,(ii) S γ ( j ) ∩ S cj (cid:54) = ∅ , (iii) S cγ ( j ) ∩ S j (cid:54) = ∅ , (iv) S cγ ( j ) ∩ S cj (cid:54) = ∅ . In particular, any s ∈ { , } m with j th entry56qual to 1 and γ ( j ) th entry equal to 1 belongs to S γ ( j ) ∩ S j . Denote such a vector by t ∈ { , } m . Similarly,any s ∈ { , } m with j th entry equal to 0 and γ ( j ) th entry equal to 1 belongs to S γ ( j ) ∩ S cj . Denote such avector by t ∈ { , } m . Continuing in this way, let t ∈ S cγ ( j ) ∩ S j and t ∈ S cγ ( j ) ∩ S cj . Now fix any ϕ and β such that all 2 m sets Θ( β, s ) are non-empty (such a choice is always possible under Assumptions 2.1 and2.2). For any κ ∈ [0 ,
1] consider the following conditional distribution on sets A ∈ A ( β ): P θ | Y,X,Z ( A | Y = y, X = x j , Z = z j ) = κ, if A = Θ( β, t ) , and y = 1 ,κ, if A = Θ( β, t ) , and y = 0 , − κ, if A = Θ( β, t ) , and y = 1 , − κ, if A = Θ( β, t ) and y = 0 , , otherwise.If y = 1 we have: (cid:88) s ∈ S j P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x j , Z = z j ) = κ + (1 − κ ) = 1 , and if y = 0 we have: (cid:88) s ∈ S cj P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x j , Z = z j ) = κ + (1 − κ ) = 1 . This shows that constraints (3.5) and (3.6) are satisfied. Finally, note that for either y = 0 or y = 1 we have: (cid:88) s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) = κ. Since this can be completed for any κ ∈ [0 , , P Y γ | X,Z ( Y γ = 1 | X = x j , Z = z j )= (cid:88) y ∈{ , } P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) P Y | X,Z ( Y = y | X = x j , Z = z j )= κ. Again, since this can be completed for any κ ∈ [0 , P Y γ | X,Z ( Y γ = 1 | X = x j , Z = z j ) is the interval [0 , (cid:4) Proof of Proposition 4.1.
This follows immediately from the results of Buck (1943). (cid:4) .2 Measurability Results Definition A.1 (Effros-Measurability, Random Set) . Let (Ω , A , P ) be a probability space, let V be a Polishspace, and let O V denote the collection of all open sets on V . A multifunction V : Ω → F V is calledEffros-measurable if for every A ∈ O V we have V − ( A ) := { ω ∈ Ω : V ( ω ) ∩ A (cid:54) = ∅ } ∈ A . Definition A.2 (Selections) . A random element V : Ω → V is called a (measurable) selection of V if V ( ω ) ∈ V ( ω ) for P − almost all ω ∈ Ω . Lemma A.1.
Suppose Assumption 2.1 holds. Then for each β ∈ B , the map G − ( · , β ) : Y × X × Z → Θ isan Effros-measurable multifunction, and thus is a random set on Y × X × Z .Proof of Lemma A.1.
For any fixed β ∈ B and any open set A ⊂ Θ. We have: { ( y, x, z ) : G − ( y, x, z, β ) ∩ A (cid:54) = ∅ } = G ( A ) ∪ G ( A ) , where: G ( A ) := { (0 , x, z ) : G − (0 , x, z, β ) ∩ A (cid:54) = ∅ } ,G ( A ) := { (1 , x, z ) : G − (1 , x, z, β ) ∩ A (cid:54) = ∅ } . Since B ( Y ) ⊗ B ( X ) ⊗ B ( Z ) is closed under unions, it suffices to show G ( A ) , G ( A ) ∈ B ( Y ) ⊗ B ( X ) ⊗ B ( Z ).In particular, it suffices to show Effros-measurability of the maps: G − (0 , x, z, β ) = { θ : ϕ ( x, z, θ, β ) < } ,G − (1 , x, z, β ) = { θ : ϕ ( x, z, θ, β ) ≥ } . Effros measurability of G − (0 , x, z, β ) follows directly from Lemma 18.7 in Aliprantis and Border (2006) afternoting that ϕ ( · , β ) is a Caratheodory function, and ( −∞ ,
0) is an open set. Measurability of G − (1 , x, z, β )follows from Lemma 18.4.1 in Aliprantis and Border (2006) if we can establish Effros measurability of themultifunctions: G − (1 , x, z, β ) := { θ : ϕ ( x, z, θ, β ) > } ,G − (1 , x, z, β ) := { θ : ϕ ( x, z, θ, β ) = 0 } . Effros measurability of G − (1 , x, z, β ) also follows directly from Lemma 18.7 in Aliprantis and Border (2006)after noting that ϕ is a Caratheodory function, and (0 , + ∞ ) is an open set. Effros-measurability of G − (1 , x, z, β ) follows from Corollary 18.8 in Aliprantis and Border (2006) after noting ϕ is a Caratheodoryfunction, and Θ is compact under Assumption 2.1. This completes the proof. (cid:4) Given a σ − algebra F on a space R , the P -completion of F is the smallest σ − algebra containing F as well as all P − null sets of R . The intersection of all P − completions of F (over all P ) is called the58 niversal σ − algebra , and functions that are measurable with respect to the universal σ − algebra are saidto be universally measurable . The following Lemma shows that the random set G − ( Y, X, Z, β ) admits auniversally measurable selection under Assumption 2.1.
Lemma A.2.
Suppose Assumption 2.1 holds. Then the random set G − ( Y, X, Z, β ) admits a universallymeasurable selection for every β ∈ B ensuring it is non-empty almost surely,.Proof of Lemma A.2. Fix some β ∈ B ensuring G − ( Y, X, Z, β ) is almost surely non-empty. By LemmaA.1, G − ( Y, X, Z, β ) is an Effros-measurable multifunction, and by Theorem 1.3.3 in Molchanov (2017) thisimplies that the graph of G − ( Y, X, Z, β ) belongs to B ( Y ) ⊗ B ( X ) ⊗ B ( Z ) × B (Θ); that is, G − ( Y, X, Z, β )is graph-measurable. The result then follows immediately from Theorem 3 of Sainte-Beuve (1974). (cid:4)
B Additional Definitions and Results
B.1 Identified Set of Conditional Latent Variable Distributions
For the sake of comparison with the previous literature, we now present a result which connects the observedconditional choice probabilities to our definition of the identified set based on the selection relation. Theidentified set for P θ | X,Z is given by: P ∗ θ | X,Z := (cid:26) P θ | X,Z : ∃ ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z s.t P θ | X,Z = (cid:90) P θ | Y,X,Z dP Y | X,Z a.s (cid:27) . We now have the following result.
Theorem B.1.
Suppose Assumption 2.1 holds. Then a collection P θ | X,Z satisfies P θ | X,Z ∈ P ∗ θ | X,Z if andonly if P θ | X,Z satisfies: P θ | X,Z ( ϕ ( x, z, θ, β ) ≥ | X = x, Z = z ) = P Y | X,Z ( Y = 1 | X = x, Z = z ) , (B.1)( x, z ) − a.s. for some β ∈ B .Proof of Theorem B.1. Let us define: P ∗∗ θ | X,Z = (cid:8) P θ | X,Z : ∃ β ∈ B s.t. P θ | X,Z ( ϕ ( x, z, θ, β ) ≥ | X = x, Z = z ) = P Y | X,Z ( Y = 1 | X = x, Z = z ) , ( x, z ) − a.s. (cid:9) . We want to show that P ∗ θ | X,Z = P ∗∗ θ | X,Z , which will be accomplished by showing both P ∗ θ | X,Z ⊂ P ∗∗ θ | X,Z and P ∗∗ θ | X,Z ⊂ P ∗ θ | X,Z . To show P ∗ θ | X,Z ⊂ P ∗∗ θ | X,Z , fix any P θ | X,Z ∈ P ∗ θ | X,Z . Then by Definition 2.1 and thedefinition of P ∗ θ | X,Z above, there exists θ : Ω → Θ with θ ∼ P θ | Y,X,Z and an element β ∈ B such that: P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s., and: P θ | X,Z ( θ ∈ A | X = x, Z = z ) = (cid:90) P θ | Y,X,Z ( θ ∈ A | Y = y, X = x, Z = z ) dP Y | X,Z , ( x, z ) − a.s., A ∈ B (Θ). Now define the sets: B ( x, z, β ) := { θ : ϕ ( x, z, θ, β ) ≥ } ,B ( x, z, β ) := { θ : ϕ ( x, z, θ, β ) < } . By continuity of ϕ ( x, z, · , β ), we have B ( x, z, β ) , B ( x, z, β ) ∈ B (Θ) for each ( x, z, β ). Now for our pair( θ, β ) we have: P θ | X,Z ( θ ∈ B ( x, z, β ) | X = x, Z = z )= (cid:88) y ∈{ , } P θ | Y,X,Z ( θ ∈ B ( x, z, β ) | Y = y, X = x, Z = z ) P Y | X,Z ( Y = y | X = x, Z = z )= (cid:88) y ∈{ , } P θ | Y,X,Z ( θ ∈ B ( x, z, β ) ∩ G − ( y, x, z, β ) | Y = y, X = x, Z = z ) P Y | X,Z ( Y = y | X = x, Z = z )(B.2)= P θ | Y,X,Z ( θ ∈ G − (1 , x, z, β ) | Y = 1 , X = x, Z = z ) P Y | X,Z ( Y = 1 | X = x, Z = z ) (B.3)= P Y | X,Z ( Y = 1 | X = x, Z = z ) , (B.4)( x, z ) − a.s. Note that (B.2) follows from the fact that P θ | Y,X,Z ( θ ∈ G − ( y, x, z, β ) | Y = y, X = x, Z = z ) = 1a.s. since P θ | Y,X,Z ∈ P ∗ θ | Y,X,Z by assumption; (B.3) follows from the fact that B ( x, z, β ) ∩ G − (0 , x, z, β ) = ∅ and B ( x, z, β ) = G − (1 , x, z, β ); (B.4) follows from the fact that P θ | Y,X,Z ( θ ∈ G − (1 , x, z, β ) | Y = 1 , X = x, Z = z ) = 1 a.s. since P θ | Y,X,Z ∈ P ∗ θ | Y,X,Z by assumption. Repeating an identical derivation shows that: P θ | X,Z ( θ ∈ B ( x, z, β ) | X = x, Z = z ) = P Y | X,Z ( Y = 0 | X = x, Z = z ) , ( x, z ) − a.s. Since P θ | X,Z ∈ P ∗ θ | X,Z was arbitrary, this proves that P ∗ θ | X,Z ⊂ P ∗∗ θ | X,Z .To show P ∗∗ θ | X,Z ⊂ P ∗ θ | X,Z , fix any P θ | X,Z ∈ P ∗∗ θ | X,Z . We want to show that P θ | X,Z ∈ P ∗ θ | X,Z . To do so, wemust show that: (i) there exists P θ | Y,X,Z such that: P θ | X,Z ( θ ∈ A | X = x, Z = z )= (cid:90) { , } P θ | Y,X,Z ( θ ∈ A | Y = y, X = x, Z = z ) dP Y | X,Z , ( x, z ) − a.s., for every A ∈ B (Θ), and (ii) there is a β ∈ B such that: P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s., (B.5)for the same P θ | Y,X,Z from part (i). First note that, by the Radon-Nikodym Theorem, the existence ofa (version of) P θ | Y,X,Z is guaranteed by the fact that P θ,Y | X,Z (cid:28) P Y | X,Z . Since all spaces involved areeuclidean, we can choose the version to be an almost surely unique regular conditional distribution (c.f.Durrett (2010) Theorem 5.1.9). By construction P θ | Y,X,Z satisfies: P θ,Y | X,Z ( θ ∈ A, Y ∈ B | X = x, Z = z ) 60 (cid:88) y ∈ B P θ | Y,X,Z ( θ ∈ A | Y = y, X = x, Z = z ) P Y | X,Z ( Y = y | X = x, Z = z ) , for every A ∈ B (Θ) and B ⊂ { , } . This verifies part (i). It thus remains only to show that any such P θ | Y,X,Z must also satisfy (B.5). Since P θ | X,Z ∈ P ∗∗ θ | X,Z , there exists a value β ∈ B such that: P θ | X,Z ( ϕ ( x, z, θ, β ) ≥ | X = x, Z = z ) = P Y | X,Z ( Y = 1 | X = x, Z = z ) , ( x, z ) − a.s. For this value of β , note that: P θ | X,Z ( ϕ ( x, z, θ, β ) ≥ | X = x, Z = z )= P θ | Y,X,Z ( ϕ ( x, z, θ, β ) ≥ | Y = 1 , X = x, Z = z ) P ( Y = 1 | X = x, Z = z )+ P θ | Y,X,Z ( ϕ ( x, z, θ, β ) ≥ | Y = 0 , X = x, Z = z ) P ( Y = 0 | X = x, Z = z ) . Furthermore, by assumption we have: P θ | X,Z ( ϕ ( x, z, θ, β ) ≥ | X = x, Z = z ) = P ( Y = 1 | X = x, Z = z ) , ( x, z ) − a.s. Thus: P ( Y = 1 | X = x, Z = z )= P θ | Y,X,Z ( ϕ ( x, z, θ, β ) ≥ | Y = 1 , X = x, Z = z ) P ( Y = 1 | X = x, Z = z )+ P θ | Y,X,Z ( ϕ ( x, z, θ, β ) ≥ | Y = 0 , X = x, Z = z ) P ( Y = 0 | X = x, Z = z ) , (B.6)( x, z ) − a.s. Now note by (2.1): P θ | Y,X,Z ( ϕ ( x, z, θ, β ) ≥ | Y = 0 , X = x, Z = z ) P ( Y = 0 | X = x, Z = z )= P θ,Y | X,Z ( ϕ ( x, z, θ, β ) ≥ , Y = 0 | X = x, Z = z )= P θ | X,Z ( ϕ ( x, z, θ, β ) ≥ , ϕ ( x, z, θ, β ) < | X = x, Z = z )= 0 . Conclude that (B.6) is true if and only if: P θ | Y,X,Z ( ϕ ( x, z, θ, β ) ≥ | Y = 1 , X = x, Z = z ) = 1 , ( x, z ) − a.s. Similar logic shows: P θ | Y,X,Z ( ϕ ( x, z, θ, β ) < | Y = 0 , X = x, Z = z ) = 1 , ( x, z ) − a.s. Finally, note that by the definition of G − ( · , β ) : Y × X × Z →
Θ we have: P θ | Y,X,Z ( ϕ ( x, z, θ, β ) ≥ | Y = 1 , X = x, Z = z ) = P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = 1 , X = x, Z = z ) , θ | Y,X,Z ( ϕ ( x, z, θ, β ) < | Y = 0 , X = x, Z = z ) = P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = 0 , X = x, Z = z ) . Thus we conclude that P θ | Y,X,Z satisfies (B.5). Since P θ | X,Z ∈ P ∗∗ θ | X,Z was arbitrary, we conclude P ∗∗ θ | X,Z ⊂P ∗ θ | X,Z . Combining everything, we conclude P ∗∗ θ | X,Z = P ∗ θ | X,Z . This completes the proof. (cid:4)
Theorem B.1 says that in order to verify whether a given collection of distributions P θ | X,Z belongs to theidentified set P ∗ θ | X,Z , it suffices to find some value of the fixed coefficient β ∈ B such that P θ | X,Z rationalizesthe observed conditional choice probabilities via (B.1). Note that Assumption 2.1 does not impose anyassumptions on the dependence between the variables X and Z and the latent variables θ ; in other words,this result holds whether X and Z are endogenous, exogenous, or any combination of the two. As discussedin Chesher and Rosen (2014), a binary response model with endogenous regressors is incomplete when themechanism generating the endogenous regressors is left unspecified, as in our environment. In the presenceof incompleteness there is no longer a unique distribution of the endogenous outcome variables given fixedprimitives of the model. Chesher and Rosen (2014) propose the use of Artstein’s inequalities from randomset theory to characterize the distributions of selections from the incomplete binary response model in (2.1),and Theorem 3.1 in Chesher and Rosen (2014) provides a general characterization of the identified set oflatent variable distributions in the case of a linear index function. The key difference between TheoremB.1 above and Theorem 3.1 in Chesher and Rosen (2014) is the fact that we condition on the value of the(possibly endogenous) variables X and Z . Conditioning on the value of the endogenous variables allows us toconstruct a simpler set of constraints than those imposed by Artstein’s inequalities, which is demonstratedin Appendix C. Intuitively, conditioning on a fixed value of any endogenous regressors resolves the issue ofmodel incompleteness. This strategy is not applicable in all environments when the model is incomplete, butappears to be applicable whenever any endogenous regressors in the model are observable. The identifiedset for the unconditional latent variable distribution (as was considered in Chesher and Rosen (2014)) canthen be recovered from P ∗ θ | X,Z . B.2 Functional Form Assumptions
Under Assumption 4.1, we will have the following definition of the identified set, which is analogous to bothDefinitions 2.1 and 2.2.
Definition B.1.
Under Assumptions 2.1, and 4.1 the identified set I ∗ Y,X,Z is the set of all pairs ( P θ | Y,X,Z , β ) such that: P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , and P θ | Y,X,Z ( ϕ ( X, Z, θ, β ) = 0 | Y = y, X = x, Z = z ) = 0 , There competing definitions of incompleteness in the literature, although the definition of an incomplete model used inChesher and Rosen (2014) is equivalent to the definition in Tamer (2003) and Lewbel (2007). The definition of an incompletemodel discussed here is consistent with these papers. y, x, z ) − a.s. , where the function ϕ ( · , β ) : X × Z × Θ → R is linear in θ for every β ∈ B . Furthermore, underAssumptions 2.1, 2.2, and 4.1, the identified set of counterfactual conditional choice probabilities P ∗ Y γ | Y,X,Z,θ is the set of all conditional distributions P Y γ | Y,X,Z,θ satisfying: P Y γ | X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = 1 , ( y, x, z, θ ) − a.s. for some pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z . Note that we have kept the notation for the identified set the same as in Definitions 2.1 and 2.2 (e.g. I ∗ Y,X,Z , P ∗ Y γ | Y,X,Z,θ ), although these identified sets will be different depending on whether Assumption 4.1holds. We will continue using the same notation for the identified set in further subsections in this Appendixas we introduce even more assumptions, but will always distinguish the definitions by stating the assumptionsthat hold in each context. Here we do not consider the case when Assumptions 4.2 and 4.3 hold, but weagain note that this definition (and the results to follow) are easily modified to accommodate the case whenany combination of these assumptions hold.We now provide the following Corollary whose proof follows almost identically to that of Theorems 2.1and 3.1.
Corollary B.1.
Under Assumptions 2.1, 2.2, and 4.1, a distribution of counterfactual conditional choiceprobabilities P Y γ | Y,X,Z satisfies P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z if and only if there exists a pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z (for I ∗ Y,X,Z from Definition B.1) satisfying: P Y γ | X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , (B.7)( y, x, z ) − a.s. Furthermore, for any collection of counterfactual conditional choice probabilities P Y γ | Y,X,Z ,there exists a collection of Borel conditional probability measures P θ | Y,X,Z satisfying (B.7) with ( P θ | Y,X,Z , β ) ∈I ∗ Y,X,Z (for I ∗ Y,X,Z from Definition B.1) if and only if there exists a collection P θ | Y,X,Z of probability measureson the sets in A ( β ) from (3.4) satisfying: (cid:88) s ∈ S j P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x j , Z = z j ) = 1 , (B.8) (cid:88) s ∈ S cj P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x j , Z = z j ) = 1 , (B.9) (cid:88) s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) = P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) , (cid:88) s ∈ S cϕ P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) = 0 , (B.10) for y ∈ { , } and j ∈ { , . . . , m } assigned positive probability, where S ϕ denotes the collection of binaryvectors s ∈ { , } m corresponding to the sets Θ( β, s ) that have non-empty interior.Proof of Corollary B.1. The first statement follows a proof identical to the proof of Theorem 2.1. For the63econd statement, the forward direction is identical to the proof of Theorem 3.1. The reverse direction issimilar to the proof of Theorem 3.1, with the exception that the extension from a measure on A ( β ) to B (Θ)is slightly different. To construct the extension, note that the sets in A ( β ) form a disjoint partition of Θ.Now select a single point θ ( β, s ) from the interior of each set Θ( β, s ) in the collection A ( β ); if Θ( β, s ) hasempty interior, choose θ ( β, s ) as an arbitrary point from Θ. For any set A ⊂ Θ, define the indicator: ( A, β, s ) = { θ ( β, s ) ∈ A ∩ int(Θ( β, s )) } Furthermore, define the function µ y,x,z : B (Θ) → R as: µ y,x,z ( B ) := (cid:88) s ∈{ , } m ( B, β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z ) . The remainder of the proof of Theorem 3.1 now applies without modification. (cid:4)
Analogous to both Theorem 2.1, the first part of Corollary B.1 provides the theoretical link between theidentified set for counterfactual choice probabilities and the identified set for the pair ( P θ | Y,X,Z , β ) underthe additional assumption of linearity in latent variables. Analogous to Theorem 3.1, the second part ofCorollary B.1 reduces an infinite dimensional existence problem to a finite dimensional existence problemamenable to analysis using optimization problems. Building on the intuition provided in example 1, thesecond part of Corollary B.1 demonstrates that Assumption 4.1 can be imposed by considering only a finitenumber of equality constraints on a distribution P θ | Y,X,Z defined on sets of the form Θ( β, s ). By definitionof the set S ϕ , condition (B.10) simply assigns probability zero to all sets Θ( β, s ) that are empty due to thelinearity restriction from Assumption 4.1.We also have the following result. Corollary B.2.
Under Assumptions 2.1, 2.2, and 4.1, the identified set for the counterfactual conditionalchoice probability P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) is given by: (cid:91) β ∈B [ ν (cid:96)b ( y, x j , z j , β ) , ν ub ( y, x j , z j , β )] where ν (cid:96)b ( y, x j , z j , β ) and ν ub ( y, x j , z j , β ) are determined by the optimization problems: ν (cid:96)b ( y, x j , z j , β ) := min v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , s.t. (3.11) , (3.12) , (3.13) , and (4.1) , (B.11) ν ub ( y, x j , z j , β ) := max v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , s.t. (3.11) , (3.12) , (3.13) , and (4.1) . (B.12)Note that this Corollary is identical to Theorem 3.2 with the exception that we have now imposedAssumption 4.1, and thus have also included constraints of the form (4.1). With the exception of theseadditional constraints, the optimization problems that characterize the bounding problem are the same as64efore. This result does not consider the case when Assumptions 4.2 and 4.3 also hold, but as we remarkedabove it is easily modified to accommodate any combination of Assumptions 4.2 and 4.3 which can be doneby including certain constraints (seen in the next subsections of this appendix). Similar to the commentfollowing Theorem 3.2, alternative counterfactual quantities can also be bounded using this result by simplymodifying the objective function in (B.11) and (B.12).Finally, we present the following corollary to Proposition 3.1. Corollary B.3.
Suppose that Assumptions 2.1, 2.2 and 4.1 hold. Then there exists a (not necessarilyunique) finite subset B (cid:48) ⊂ B such that: (cid:8) ν ∈ R d ν : ∃ β ∈ B s.t. ν ( β ) satisfies (3.11) , (3.12) , (3.13) , (4.1) and ν = ν ( β ) (cid:9) = (cid:8) ν ∈ R d ν : ∃ β ∈ B (cid:48) s.t. ν ( β ) satisfies (3.11) , (3.12) , (3.13) , (4.1) , and ν = ν ( β ) (cid:9) . Proof of Corollary B.3.
The proof is identical to the proof of Proposition 3.1 after redefining S ( β ) from theproof of Proposition 3.1 to be S ( β ) := { s ∈ { , } m : int(Θ( β, s )) (cid:54) = ∅ } . (cid:4) B.3 Independence Assumptions
Under Assumption 4.2, we will have the following definition of the identified set, which is analogous to bothDefinitions 2.1 and 2.2.
Definition B.2.
Under Assumptions 2.1 and 4.2, the identified set I ∗ Y,X,Z is the set of all pairs ( P θ | Y,X,Z , β ) such that:(i) ( P θ | Y,X,Z , β ) satisfies: P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s. ; and(ii) For all Borel sets A ∈ B (Θ) we have P θ | Z ( A | Z = z ) = P θ ( A ) , z − a.s.Furthermore, under Assumptions 2.1, 2.2 and 4.2, the identified set of counterfactual conditional choiceprobabilities P ∗ Y γ | Y,X,Z,θ is the set of all conditional distributions P Y γ | Y,X,Z,θ satisfying: P Y γ | Y,X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = 1 , ( y, x, z, θ ) − a.s. for some pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z . Here we do not consider the case when Assumptions 4.1 and 4.3 hold, but we again note that thisdefinition (and the results to follow) are easily modified to accommodate the case when any combination ofthese assumptions hold. We now provide the following Corollary whose proof follows almost identically tothat of Theorems 2.1 and 3.1, with the exception being that now we require condition (ii) of Definition B.2to also hold. 65 orollary B.4.
Under Assumptions 2.1, 2.2 and 4.2, a distribution of counterfactual conditional choiceprobabilities P Y γ | Y,X,Z satisfies P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z if and only if there exists a pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z (for I ∗ Y,X,Z from Definition B.2) satisfying: P Y γ | X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , (B.13)( y, x, z ) − a.s. Furthermore, for any collection of counterfactual conditional choice probabilities P Y γ | Y,X,Z ,there exists a collection of Borel conditional probability measures P θ | Y,X,Z satisfying (B.13) with ( P θ | Y,X,Z , β ) ∈I ∗ Y,X,Z (for I ∗ Y,X,Z from Definition B.2) if and only if there exists a collection P θ | Y,X,Z of probability measureson the sets in A ( β ) from (3.4) satisfying: (cid:88) s ∈ S j P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x j , Z = z j ) = 1 , (cid:88) s ∈ S cj P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x j , Z = z j ) = 1 , (cid:88) s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) = P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) , for y ∈ { , } and j ∈ { , . . . , m } assigned positive probability, and: (cid:88) y (cid:88) x P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z k ) P ( Y = y, X = x | Z = z k )= (cid:88) y (cid:88) x P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z k +1 ) P ( Y = y, X = x | Z = z k +1 ) , (B.14) for all s ∈ { , } m and all k = 1 , . . . , m z − assigned positive probability.Proof of Corollary B.4. The first statement follows a proof identical to the proof of Theorem 2.1. For thesecond statement, the forward direction is identical to the proof of Theorem 3.1. The reverse directionis similar to the proof of Theorem 3.1, with the exception that we must show that the extended measureon B (Θ) satisfies independence if the intial measure on A (Θ) satisfies independence. Let ˜ P θ | Y,X,Z be theextension of P θ | Y,X,Z from the proof of Theorem 3.1. Then for any A ∈ B (Θ):˜ P θ | Z ( A | Z = z k )= (cid:88) y ∈{ , } (cid:88) x ∈X (cid:88) s ∈{ , } m ( A, β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z k ) P Y,X | Z ( Y = y, X = x | Z = z k )= (cid:88) s ∈{ , } m ( A, β, s ) (cid:88) y ∈{ , } (cid:88) x ∈X P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z k ) P Y,X | Z ( Y = y, X = x | Z = z k )= (cid:88) s ∈{ , } m ( A, β, s ) (cid:88) y ∈{ , } (cid:88) x ∈X P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z k +1 ) P Y,X | Z ( Y = y, X = x | Z = z k +1 )= (cid:88) y ∈{ , } (cid:88) x ∈X (cid:88) s ∈{ , } m ( A, β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z k +1 ) P Y,X | Z ( Y = y, X = x | Z = z k +1 )= ˜ P θ | Z ( A | Z = z k +1 ) , z k and z k +1 assigned positive probability, where the third equality follows from (B.14). Concludethat ˜ P θ | Z satisfies the second condition in Definition B.2. (cid:4) Analogous to Theorem 2.1, the first part of Corollary B.4 provides the theoretical link between theidentified set for counterfactual choice probabilities and the identified set for the pair ( P θ | Y,X,Z , β ) under theadditional independence assumption between θ and Z . Furthermore, analogous to the result in Theorem 3.1,the second part of Corollary B.4 reduces an infinite dimensional existence problem to a finite dimensionalexistence problem. Importantly, the second part of Corollary B.4 builds on Theorem 3.1 by demonstratingthat Assumption 4.2—which requires P θ | Z ( A | Z = z ) = P θ ( A ) a.s. for all Borel sets A —can be imposedby considering only a finite number of equality constraints on a distribution P θ | Y,X,Z defined on sets of theform Θ( β, s ).We also have the following Corollary to Theorem 3.2:
Corollary B.5.
Under Assumptions 2.1, 2.2, and 4.2, the identified set for the counterfactual conditionalchoice probability P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) is given by: (cid:91) β ∈B [ ν (cid:96)b ( y, x j , z j , β ) , ν ub ( y, x j , z j , β )] where ν (cid:96)b ( y, x j , z j , β ) and ν ub ( y, x j , z j , β ) are determined by the optimization problems: ν (cid:96)b ( y, x j , z j , β ) := min v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , s.t. (3.11) , (3.12) , (3.13) , and (4.10) , (B.15) ν ub ( y, x j , z j , β ) := max v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , s.t. (3.11) , (3.12) , (3.13) , and (4.10) . (B.16)Note that this Corollary is identical to Theorem 3.2 with the exception that we have now imposedAssumption 4.2, and thus have also included constraints of the form (4.10). With the exception of theseadditional constraints, the optimization problems that characterize the bounding problem are the same asbefore. Again, this result can be easily modified to bound any linear function of counterfactual choiceprobabilities by simply modifying the objective function in the optimization problems (B.15) and (B.16).Finally, we present the following corollary to Proposition 3.1. Corollary B.6.
Suppose that Assumptions 2.1, 2.2 and 4.2 hold. Then there exists a (not necessarilyunique) finite subset B (cid:48) ⊂ B such that: (cid:8) ν ∈ R d ν : ∃ β ∈ B s.t. ν ( β ) satisfies (3.11) , (3.12) , (3.13) , (4.10) and ν = ν ( β ) (cid:9) = (cid:8) ν ∈ R d ν : ∃ β ∈ B (cid:48) s.t. ν ( β ) satisfies (3.11) , (3.12) , (3.13) , (4.10) , and ν = ν ( β ) (cid:9) . The additional independence constraints in (4.10) do not affect the proof of Proposition 3.1 in any way,and so the proof of this result is identical to the proof of Proposition 3.1.67 .4 Monotonicity Assumptions
When we entertain Assumption 4.3, we will have the following definition of the identified set, which isanalogous to both Definitions 2.1 and 2.2.
Definition B.3.
Under Assumptions 2.1 and 4.3, the identified set I ∗ Y,X,Z is the set of all pairs ( P θ | Y,X,Z , β ) such that:(i) ( P θ | Y,X,Z , β ) satisfies: P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s. ; and(ii) For all ( j, k ) ∈ M from Assumption 4.3, we have: P θ | Y,X,Z ( ϕ ( x j , z j , β, θ ) ≤ ϕ ( x k , z k , β, θ ) | Y = y, X = x, Z = z ) = 1 a.s.Furthermore, under Assumptions 2.1, 2.2, and 4.3, the identified set of counterfactual conditional choiceprobabilities P ∗ Y γ | Y,X,Z,θ is the set of all conditional distributions P Y γ | Y,X,Z,θ satisfying: P Y γ | X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = 1 , ( y, x, z, θ ) − a.s. for some pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z . Again this definition and the results to follow are easily modified to accommodate the case when anycombination of Assumptions 4.1 and 4.2 also hold. We now provide the following Corollary whose prooffollows almost identically to that of Theorems 2.1 and 3.1, with the exception being that now we requirecondition (ii) of Definition B.3 to hold.
Corollary B.7.
Under Assumptions 2.1, 2.2, and 4.3, a distribution of counterfactual conditional choiceprobabilities P Y γ | Y,X,Z satisfies P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z if and only if there exists a pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z (for I ∗ Y,X,Z from Definition B.3) satisfying: P Y γ | X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , (B.17)( y, x, z ) − a.s. Furthermore, for any collection of counterfactual conditional choice probabilities P Y γ | Y,X,Z ,there exists a collection of Borel conditional probability measures P θ | Y,X,Z satisfying (B.17) with ( P θ | Y,X,Z , β ) ∈I ∗ Y,X,Z (for I ∗ Y,X,Z from Definition B.3) if and only if there exists a collection P θ | Y,X,Z of probability measureson the sets in A ( β ) from (3.4) satisfying: (cid:88) s ∈ S j P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x j , Z = z j ) = 1 , (cid:88) s ∈ S cj P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x j , Z = z j ) = 1 , s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) = P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) , for y ∈ { , } and j ∈ { , . . . , m } assigned positive probability, and: (cid:88) s ∈ S cM P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z ) = 0 , a.s. (B.18) for all ( y, x, z ) assigned positive probability, where S M is as defined in Section 4. The proof of this corollary is identical to the proof of Theorem 2.1 and Theorem 3.1. Analogous toTheorem 2.1, the first part of Corollary B.7 provides the theoretical link between the identified set forcounterfactual choice probabilities and the identified set for the pair ( P θ | Y,X,Z , β ) under the additionalmonotonicity assumption. Analogous to Theorem 3.1, the second part of Corollary B.7 reduces an infi-nite dimensional existence problem to a finite dimensional existence problem amenable to analysis usingoptimization problems. Building on the intuition provided in example 3, the second part of Corollary B.7demonstrates that monotonicity as in Assumption 4.3 can be imposed by considering only a finite numberof equality constraints on a distribution P θ | Y,X,Z defined on sets of the form Θ( β, s ). By definition of the set S M , condition (B.18) simply assigns probability zero to all sets Θ( β, s ) that do not satisfy the monotonicityrelation from Assumption 4.3. This leads to the following result. Corollary B.8.
Under Assumptions 2.1, 2.2, and 4.3, the identified set for the counterfactual conditionalchoice probability P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) is given by: (cid:91) β ∈B [ ν (cid:96)b ( y, x j , z j , β ) , ν ub ( y, x j , z j , β )] where ν (cid:96)b ( y, x j , z j , β ) and ν ub ( y, x j , z j , β ) are determined by the optimization problems: ν (cid:96)b ( y, x j , z j , β ) := min v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , s.t. (3.11) , (3.12) , (3.13) , and (4.11) , (B.19) ν ub ( y, x j , z j , β ) := max v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , s.t. (3.11) , (3.12) , (3.13) , and (4.11) . (B.20)Note that this Corollary is identical to Theorem 3.2 with the exception that we have now imposedAssumption 4.3, and thus have also included constraints of the form (4.11). With the exception of theseadditional constraints, the optimization problems that characterize the bounding problem are the same asbefore. Finally, alternative counterfactual quantities can be bounded in the same way by simply modifyingthe objective function in (B.19) and (B.20).Finally, we present the following corollary to Proposition 3.1. Corollary B.9.
Suppose that Assumptions 2.1, 2.2 and 4.3 hold. Then there exists a (not necessarilyunique) finite subset B (cid:48) ⊂ B such that: (cid:8) ν ∈ R d ν : ∃ β ∈ B s.t. ν ( β ) satisfies (3.11) , (3.12) , (3.13) , (4.11) and ν = ν ( β ) (cid:9) (cid:8) ν ∈ R d ν : ∃ β ∈ B (cid:48) s.t. ν ( β ) satisfies (3.11) , (3.12) , (3.13) , (4.11) , and ν = ν ( β ) (cid:9) . The additional monotonicity constraints in (4.11) are easily accommodated by in the proof of Proposition3.1, and so the proof of this result follows the same proof of Proposition 3.1.
B.5 Consistency
In this subsection we will present a basic consistency result for functionals of a partially identified parameter.The result is designed to minimize the number of high-level assumptions required for consistency, and isclosely related to results found in Molchanov (1998), Manski and Tamer (2002), and Chernozhukov et al.(2007). It is also presented in a form that is more general than necessary for the current paper, and so itmay be of interest in other applications.We consider an environment where the researcher wishes to compute bounds on a functional E P [ ψ ( W i , τ , τ )],where ψ : W × T → R , where W ⊂ R d w denotes the support of the observed random vector W , and T = T × T ⊂ R d τ denotes the parameter space with typical elements τ = ( τ , τ ) ∈ T . The values of( τ , τ ) are constrained by J moment inequalities of the form: E P [ m j ( W i , τ , τ )] ≤ , for j = 1 , . . . , J. Note this does not rule out moment equalities, since each moment equality can be equivalently written asa combination of two moment inequalities. In this environment, the identified set for ( τ , τ ) ∈ T at thetrue P is given by: T ∗ ( P ) := { ( τ , τ ) ∈ T : E P [ m j ( W i , τ , τ )] ≤ j = 1 , . . . , J } . In addition, the identified set for ψ := E P [ ψ ( W i , τ , τ )] is given by:Ψ ∗ ( P ) := (cid:8) ψ ∈ R : ∃ ( τ , τ ) ∈ T I ( P ) s.t. ψ = E P [ ψ ( W i , τ , τ )] (cid:9) . Let us define the projection: T ∗ ( τ , P ) := { τ ∈ T : E P [ m j ( W i , τ , τ )] ≤ j = 1 , . . . , J } . It is then straightforward to show that Ψ ∗ ( P ) can be rewritten as:Ψ ∗ ( P ) = (cid:91) τ ∈T [Ψ (cid:96)b ( τ , P ) , Ψ ub ( τ , P )] , where: Ψ (cid:96)b ( τ , P ) := min τ ∈T ∗ ( τ ,P ) E P [ ψ ( W i , τ , τ )] , Ψ ub ( τ , P ) := max τ ∈T ∗ ( τ ,P ) E P [ ψ ( W i , τ , τ )] . We will study the consistency properties of the sample analog estimator for this representation of Ψ ∗ ( P ). In70articular, define: E n [ ψ ( W i , τ , τ )] := 1 n n (cid:88) i =1 ψ ( W i , τ , τ ) , E n [ m j ( W i , τ , τ )] := 1 n n (cid:88) i =1 m j ( W i , τ , τ ) , for j = 1 , . . . , J. Then the sample analog estimator of interest is given by:Ψ ∗ ( P n ) = (cid:91) τ ∈T [Ψ (cid:96)b ( τ , P n ) , Ψ ub ( τ , P n )] , where: Ψ (cid:96)b ( τ , P n ) := min τ ∈T ∗ ( τ , P n ) E n [ ψ ( W i , τ , τ )] , Ψ ub ( τ , P n ) := max τ ∈T ∗ ( τ , P n ) E n [ ψ ( W i , τ , τ )] , and: T ∗ ( τ , P n ) := { τ ∈ T : E n [ m j ( W i , τ , τ )] ≤ j = 1 , . . . , J } . In the following, we will define the sequence { η n ( τ ) } ∞ n =1 as: η n ( τ ) := max (cid:26) max j =1 ,...,J. sup τ ∈T | E n [ m j ( W i , τ , τ )] − E P [ m j ( W i , τ , τ )] | , sup τ ∈T | E n [ ψ ( W i , τ , τ )] − E P [ ψ ( W i , τ , τ )] | (cid:27) . We now impose the following assumption.
Assumption B.1. (i) The parameter space T = T × T ⊂ R d τ , where T is compact; (ii) for each τ ∈ T ,the function ψ ( · , τ ) : W × T → R is measurable in W i ∈ W ⊂ R d w and is Lipschitz continuous in τ with a(possibly data-dependent) Lipschitz constant C ( τ ) with sup τ ∈T C ( τ ) < ∞ a.s.; (iii) for j = 1 , . . . , J , andfor each τ ∈ T , the moment function m j ( · , τ ) : W ×T → R is measurable in W i and lower semicontinuousin τ ; (iv) the true data generating process is indexed by a triple ( τ , τ , P ) that satisfies ( τ , τ ) ∈ T ,and E P [ m j ( W i , τ , τ )] ≤ , for j = 1 , . . . , J ; (v) the sample { W i } ni =1 is an indepndent and identicallydistributed draw from P ; (vi) for each fixed τ ∈ T , we have η n ( τ ) = O P ( a − n ) for some sequence a n ↑ ∞ ;(vii) for each fixed τ ∈ T , there exists a sequence b n ↓ satisfying b n ≥ η n ( τ ) with probability approaching1 (w.p.a. 1).; (viii) there exists a finite subset T (cid:48) ⊂ T such that: { τ ∈ T : ∃ τ ∈ T s.t. E P [ m j ( W i , τ , τ )] ≤ for j = 1 , . . . , k } = { τ ∈ T : ∃ τ ∈ T (cid:48) s.t. E P [ m j ( W i , τ , τ )] ≤ for j = 1 , . . . , k } . Part (i) of Assumption B.1 is standard in the literature on extremum estimators. Part (ii) separates theroles of τ and τ , and restricts the objective function to be Lipschitz continuous in the parameter τ foreach τ . Part (ii) places no restrictions on how τ enters the objective function. Part (iii) further separatesthe roles of τ and τ by requiring each of the moment functions to be lower semicontinuous in τ . Similarto part (ii), no restrictions are placed on how τ enters the moment functions. Assumption (iv) is standard,and simply indicates that the true parameters satisfying the moment inequalities at the true P . Part (v) isalso standard, although it rules out the case of dependent data. Part (vi) indicates that η n ( τ ) converges71n probability at a rate of 1 /a n . This can be verified using standard assumptions; for example, if for each τ ∈ T the J + 1 classes of functions: F ψ ( τ ) := { ψ ( · , τ , τ ) : W → R | τ ∈ T } , F j ( τ ) := { m j ( · , τ , τ ) : W → R | τ ∈ T } , for j = 1 , . . . , J, are all P − Donsker classes, then part (vi) is satisfied with a n = √ n . This will be the case, for example, for allspecifications considered in Section 5. After verifying part (vi), it is easy to find a sequence b n satisfying part(vii). For example, if a n = √ n from part (vi), then we can set b n = b/ (cid:112) log( n ) for any b >
0. Finally, part(viii) essentially allows us to replace T with a finite subset T (cid:48) without impacting the bounding problem. Itis precisely because of part (viii) that all other parts of Assumption B.1—namely parts (ii), (iii), (vi) and(vii)—are allowed to be so flexible with respect to the parameter τ . This last component of AssumptionB.1 is verified in our basic setup in Proposition 3.1 in the main text, and is verified under our functionalform, independence, and monotonicity assumptions in Corollaries B.2, B.5 and B.8, respectively. All othercomponents of Assumption B.1 are either standard assumptions, or are easily verified for the boundingproblems presented in the main text and for all specifications considered in Section 5.Before stating the main result for this subsection, for any c ∈ R let us define: T ∗ ( τ , P, c ) := { τ ∈ T : E P [ m j ( W, τ , τ )] ≤ c for j = 1 , . . . , J } , and: Ψ ∗ ( P, c ) = (cid:91) τ ∈T (cid:48) [Ψ (cid:96)b ( τ , P, c ) , Ψ ub ( τ , P, c )] , where:Ψ (cid:96)b ( τ , P, c ) := min τ ∈T ∗ ( τ ,P,c ) E P [ ψ ( W i , τ , τ )] , Ψ ub ( τ , P n , c ) := max τ ∈T ∗ ( τ ,P,c ) E P [ ψ ( W i , τ , τ )] . Define the sets T ∗ ( τ , P, c ) and Ψ ∗ ( P, c ) analogously. The following Theorem then shows that a slightenlargement of the set Ψ ∗ ( P n ) is a consistent estimator for the set Ψ ∗ ( P ), where consistency is defined usingthe Hausdorff metric. Theorem B.2.
Suppose that Assumption B.1 holds. Then d H (Ψ ∗ ( P n , b n ) , Ψ ∗ ( P )) = o P (1) , where b n is thesequence from Assumption B.1.Proof of Theorem B.2. We have: d H (Ψ ∗ ( P n , b n ) , Ψ ∗ ( P )) ≤ (cid:88) τ ∈T (cid:48) d H ([Ψ (cid:96)b ( τ , P n , b n ) , Ψ ub ( τ , P n , b n )] , [Ψ (cid:96)b ( τ , P ) , Ψ ub ( τ , P )]) . Since T (cid:48) is finite by Assumption B.1(viii), it suffices to show that: d H ([Ψ (cid:96)b ( τ , P n , b n ) , Ψ ub ( τ , P n , b n )] , [Ψ (cid:96)b ( τ , P ) , Ψ ub ( τ , P )]) = o P (1) , τ ∈ T (cid:48) . To this end, fix any τ ∈ T . To show the previous display, it suffices to show consistencyof the upper and lower bounds; i.e. that | Ψ (cid:96)b ( τ , P n , b n ) − Ψ (cid:96)b ( τ , P ) | = o P (1) and that | Ψ ub ( τ , P n , b n ) − Ψ ub ( τ , P ) | = o P (1). We will focus on the lower bound, since the upper bound proof is symmetric.First recall that ψ ( W i , τ , τ ) is continuous with respect to τ for every τ by Assumption B.1(ii), and T is compact by Assumption B.1(i). Thus, we have that ψ ( W i , τ , τ ) is uniformly continuous (w.r.t. τ ) on T .Thus, for every ε > δ ( ε ) > | E n [ ψ ( W i , τ , τ )] − E n [ ψ ( W i , τ (cid:48) , τ )] | < ε whenever || τ − τ (cid:48) || < δ ( ε ). Now note that: | Ψ (cid:96)b ( τ , P n , b n ) − Ψ (cid:96)b ( τ , P ) | = (cid:12)(cid:12)(cid:12)(cid:12) min τ ∈T ∗ ( τ , P n ,b n ) E n [ ψ ( W i , τ , τ )] − min τ ∈T ∗ ( τ ,P ) E P [ ψ ( W, τ , τ )] (cid:12)(cid:12)(cid:12)(cid:12) , ≤ (cid:12)(cid:12)(cid:12)(cid:12) min τ ∈T ∗ ( τ , P n ,b n ) E n [ ψ ( W i , τ , τ )] − min τ ∈T ∗ ( τ ,P ) E n [ ψ ( W i , τ , τ )] (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) min τ ∈T ∗ ( τ ,P ) E n [ ψ ( W i , τ , τ )] − min τ ∈T ∗ ( τ ,P ) E P [ ψ ( W i , τ , τ )] (cid:12)(cid:12)(cid:12)(cid:12) , = (cid:12)(cid:12)(cid:12)(cid:12) max τ ∈T ∗ ( τ ,P ) − E n [ ψ ( W i , τ , τ )] − max τ ∈T ∗ ( τ , P n ,b n ) − E n [ ψ ( W i , τ , τ )] (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) max τ ∈T ∗ ( τ ,P ) − E P [ ψ ( W i , τ , τ )] − max τ ∈T ∗ ( τ ,P ) − E n [ ψ ( W i , τ , τ )] (cid:12)(cid:12)(cid:12)(cid:12) , ≤ max { τ ,τ (cid:48) ∈T : || τ − τ (cid:48) ||≤ d H ( T ∗ ( τ , P n ,b n ) , T ∗ ( τ ,P )) } |− E n [ ψ ( W i , τ , τ )] − − E n [ ψ ( W i , τ (cid:48) , τ )] | + max τ ∈T ∗ ( τ ,P ) |− E n [ ψ ( W i , τ , τ )] − − E P [ ψ ( W i , τ , τ )] |≤ max { τ ,τ (cid:48) ∈T : || τ − τ (cid:48) ||≤ d H ( T ∗ ( τ , P n ,b n ) , T ∗ ( τ ,P )) } | E n [ ψ ( W i , τ (cid:48) , τ )] − E n [ ψ ( W i , τ , τ )] | + max τ ∈T ∗ ( τ ,P ) | E P [ ψ ( W i , τ , τ )] − E n [ ψ ( W i , τ , τ )] | = max { τ ,τ (cid:48) ∈T : || τ − τ (cid:48) ||≤ d H ( T ∗ ( τ , P n ,b n ) , T ∗ ( τ ,P )) } C · || τ − τ (cid:48) || + max τ ∈T ∗ ( τ ,P ) | E P [ ψ ( W i , τ , τ )] − E n [ ψ ( W i , τ , τ )] |≤ C · d H ( T ∗ ( τ , P n , b n ) , T ∗ ( τ , P )) + max τ ∈T ∗ ( τ ,P ) | E P [ ψ ( W i , τ , τ )] − E n [ ψ ( W i , τ , τ )] | . It suffices to show the two terms in the last line of the previous display converge to zero in probability. Thesecond term converges in probability to zero by Assumption B.1(vi). Furthermore, since
C < ∞ w.p. 1, thefirst term converges to zero in probability if we can show that: d H ( T ∗ ( τ , P n , b n ) , T ∗ ( τ , P )) = o P (1) . The remainder of the proof will focus on proving this latter fact. Note that: d H ( T ∗ ( τ , P n , b n ) , T ∗ ( τ , P )) = inf { δ > T ∗ ( τ , P ) ⊆ T ∗ ( τ , P n , b n ) δ , and T ∗ ( τ , P n , b n ) ⊆ T ∗ ( τ , P ) δ } , where: T ∗ ( τ , P n , b n ) δ := { τ ∈ T : B δ ( τ ) ∩ T ∗ ( τ , P n , b n ) (cid:54) = ∅ } , ∗ ( τ , P ) δ := { τ ∈ T : B δ ( τ ) ∩ T ∗ ( τ , P ) (cid:54) = ∅ } , where B δ ( τ ) denotes the closed ball of radius δ > τ . The next part of the proof closely followsthe proof of Theorem 2.1 in Molchanov (1998). Define the function: ρ ( ε ) := d H ( T ∗ ( τ , P, ε ) , T ∗ ( τ , P )) . Since each of the moment functions are lower semi-continuous in τ for each τ , each of the sets T ∗ ( τ , P, ε )and T ∗ ( τ , P ) are closed and ρ is right continuous. Furthermore, ρ is non-increasing for ε < ε >
0. Now by Assumption B.1 we have with high probability: T ∗ ( τ , P n , b n ) = { τ ∈ T : E n [ m j ( W, τ , τ )] ≤ b n for j = 1 , . . . , k }⊆ { τ ∈ T : E n [ m j ( W, τ , τ )] ≤ η n ( τ ) + b n for j = 1 , . . . , k }⊆ T ∗ ( τ , P, b n ) ⊆ T ∗ ( τ , P ) ρ (2 b n ) . Furthermore, by Assumption B.1 we have with high probability for large enough n : T ∗ ( τ , P ) ⊆ T ∗ ( τ , P, b n − η n ( τ )) ⊆ T ∗ ( τ , P n , b n ) . Conclude that with high probability for large enough n : d H ( T ∗ ( τ , P n , b n ) , T ∗ ( τ , P )) ≤ ρ (2 b n ) → , where the last line follows from right-continuity of the function ρ ( · ). Since τ ∈ T (cid:48) was arbitrary, thiscompletes the proof. (cid:4) B.6 Additively Separable Case
In this subsection we will show how our method can be applied to a model that satisfies the followingassumption.
Assumption B.2. (i) There exists a function ˜ ϕ : X × Z × B → R satisfying ϕ ( X, Z, β, θ ) = ˜ ϕ ( X, Z, β ) − θ ,and (ii) the event: F := (cid:91) ( x,z ) ∈X ×Z { θ : ϕ ( x, z, θ, β ) = 0 } , occurs with probability zero; that is, P θ ( F ) = 0 . This is a well-studied special case of the linear model considered in Section 4.1. In particular, much ofthe discussion in this section will expand upon the insights of Chesher (2013). We will consider two cases:74i) when the structural function ϕ is linear in the parameter vector β , and (ii) when the structural functionis unknown. To begin, let us consider the following simple example. Example 4.
Suppose we have a scalar variable X with support X = { x , . . . , x m x } and latent variables θ ∈ [ − , , and suppose there are no variables Z . Consider the following additively separable thresholdcrossing model: Y = { Xβ ≥ θ } , where β is a fixed scalar coefficient. The response types in this setting are characterized by the m x × vectors: r ( β, θ ) := { x β ≥ θ } { x β ≥ θ } ... { x m x β ≥ θ } . However, the set of possible response types in this setting will depend on the sign of the fixed coefficient β .In particular, when β ≥ we have the response types r ( β, θ ) ∈ { s , . . . , s m x +1 } , where: s := ... , s := ... , . . . , s m x := ... , s m x +1 := ... . (B.21) No other response types are possible when β > , and so all other response types must be assigned zeroprobability. Alternatively, when β < we have the response types r ( β, θ ) ∈ { s (cid:48) , . . . , s (cid:48) m x +1 } , where: s (cid:48) := ... , s (cid:48) := ... , . . . , s (cid:48) m x := ... , s (cid:48) m x +1 := ... . (B.22) Again, no other response types are possible when β < , and so must be assigned zero probability by thedistribution of θ .The reason that these particular response types arise when β ≥ and β < is due to the ordering of thesupport of X induced by the value of the scalar product Xβ . In particular, if we suppose x ≤ x ≤ . . . ≤ x m x ,then when β ≥ we have the ordering x β ≤ x β ≤ . . . ≤ x m x β . This means, for example, that it is igure 6: A figure corresponding to Example 4 illustrating the partition of the latent variable space according toresponse types in the case when the index function is additively separable in θ and when X = { x , x , x } with x ≤ x ≤ x . As indicated in the example, the feasible response types are those that correspond to a particularordering of the points in X induced by the scalar product Xβ . impossible to find a value of θ ∈ [ − , so that: r ( β, θ ) = { x β ≥ θ } { x β ≥ θ } { x β ≥ θ } ... { x m x β ≥ θ } = ... . Indeed, the existence of such a value for θ would contradict the ordering x β ≤ x β ≤ . . . ≤ x m x β . Thismeans that when β ≥ certain response types are not possible, and so must be assigned probability zeroby the distribution of θ . An identical intuition holds in the case when β < . In the end, the responsetypes that can be assigned positive probability in this example when β ≥ and β < are exactly the onescorresponding to the vectors in (B.21) and (B.22) , respectively. Figure 6 provides an illustration in the casewhen X = { x , x , x } . This example illustrates the key ideas behind the implementation of our approach when the index functionis additively separable in θ , as in Assumption B.2. In particular, given the function ˜ ϕ from Assumption B.2,the key is to determine the values of β such that the function ˜ ϕ ( · , β ) : X × Z → R induces a unique orderingof the points in the support X ×Z . With no Z variables, a scalar X variable, and ˜ ϕ ( X, Z, β ) = Xβ , Example4 shows that only two orderings are possible, corresponding to the case when β ≥ β <
0. After theorder is determined, we can immediately determine the set of response types that must be assigned zeroprobability by the distribution of θ , and then impose these restrictions as an additional constraint in thebounding problems (3.15) and (3.16) as in Section 4.1. In particular, letting S ϕ denote the set of all binary76ectors s ∈ { , } m corresponding to sets Θ( β, s ) that can be assigned positive probability under AssumptionB.2, and impose the constraint: (cid:88) s ∈ S cϕ ν ( y, x j , z j , β, s ) = 0 , (B.23)for all y ∈ { , } and j = 1 , . . . , m occurring with positive probability. Note that (B.23) is of an identical formto the constraint (4.1) in the main text. Corollary B.2 in Appendix B.2 is then immediately applicable, sinceAssumption 4.1 nests Assumption B.2 as a special case. Thus, Theorem 3.2 can be extended to accommodateAssumption B.2 by simply adding the constraints (4.1) to the optimization problems (3.15) and (3.16).Similar to the discussion in Section 4.1, determining the sets Θ( β, s ) that can be assigned positive prob-ability under Assumption B.2 poses an interesting computational problem. Although Example 4 illustratesa case when there are only two orderings, in general many more orderings may be possible, even when ˜ ϕ is linear in β . Clearly at most m ! orderings are possible, but when the index function is linear in β it ispossible to show that the maximum number of possible orderings is much smaller than m !. In particular,partition β = ( β x , β z ) and consider the function ˜ ϕ ( X, Z, β ) = Xβ x + Zβ z where X is a vector of dimension d x and Z is a vector of dimension d z . Label the support X × Z as { ( x , z ) , ( x , z ) , . . . , ( x m , z m ) } , and let∆ jk := ( x j , z j ) − ( x k , z k ) for 1 ≤ j < k ≤ m . Setting d = d x + d z , the set H jk := { β ∈ R d : ∆ jk β = 0 } defines a hyperplane through the origin that is normal to the line connecting ( x j , z j ) and ( x k , z k ) in R d . Theset of all such hyperplanes partitions R d into at most Q ( m, d ) non-empty cones, where Q ( m, d ) is definedrecursively as: Q ( m, d ) = Q ( m − , d ) + ( m − Q ( m − , d − , (B.24)with Q ( m,
1) = 2 for all m ≥ Q (2 , d ) = 2 for all d ≥
1. Furthermore, each these non-empty conescorresponds exactly to the equivalence class of vectors β = ( β x , β z ) that induce a unique ordering of thepoints in X × Z . Thus, the value Q ( m, d ) serves as an upper bound on the number of orderings of the pointsin X × Z that are inducible by the function ˜ ϕ ( X, Z, β ) = Xβ x + Zβ z . The recursive formula from (B.24)defining the upper bound Q ( m, d ) has been independently discovered in different contexts by many authors;the earliest such account appears in Bennett (1956), although the formula was independently discoveredagain in Cover (1967). The upper bound Q ( m, d ) is obtained when the collection of hyperplanes of theform H jk are in general position. Note that Q ( m,
1) = 2 corresponds exactly to Example 4, where it wasshown that only two orderings could be induced when ˜ ϕ ( X, Z, β ) = Xβ for scalar X and β . Typically, Q ( m, d ) < m !, although some inspection of the formula shows that we will always have Q ( m, d ) = m ! when d ≥ m − β from each of the cones defined by the collection of hyperplanes of theform H jk , we could then determine the permitted orderings of the support points X ×Z by simply evaluating x j β x + z j β z for j = 1 , . . . , m at the selected value for β . This would then allow us to determine which setsΘ( β, s ) must be assigned zero probability under Assumption B.2. Note that under Assumption B.2 the77atent variable θ obtains a value on the hyperplane H jk with probability zero. Thus, it suffices to selectone value of β from each of the non-empty cones defined by the collection of hyperplanes of the form H jk .However, this can be done using the hyperplane arrangement algorithm described in Section 4.1 applied tothe hyperplanes of the form H jk for 1 ≤ j < k ≤ m .Our method is also applicable to cases when ˜ ϕ ( X, Z, β ) may be non-linear in the finite-dimensional vector β . To see how this case can be accommodated, recall that the case when ˜ ϕ is linear in β , the ordering of thesupport points in X × Z by the function ˜ ϕ ( X, Z, β ) allowed us to determine the admissible response types,which in turn allowed us to construct the additional constraints needed in programs (3.15) and (3.16). Asimilar strategy can be used when ˜ ϕ is not known by the researcher. However, when ˜ ϕ is not restrictedby the researcher, all orderings of the support points in X × Z will be possible. The procedure to bounda counterfactual choice probability (or some other counterfactual quantity of interest) is then as follows.The researcher must first fix an ordering of the support points in
X × Z , determine the admissible responsetypes S ϕ for the fixed ordering, and run the linear programs in (3.15) and (3.16) subject to the constraint(B.23). The researcher must then repeat the procedure for all possible orderings of the support points in X × Z . On each iteration of this procedure the researcher will obtain an interval with endpoints determinedby the values of the linear programs in (3.15) and (3.16). The closed convex hull of the identified set for thecounterfactual choice probability is then given by the interval whose lower endpoint is the smallest value ofthe linear program in (3.15) obtained across all orderings, and whose upper endpoint is the largest value ofthe linear program in (3.16) obtained across all orderings. Admittedly, there will be m ! possible orderingsfor ˜ ϕ ( X, Z, β ) unless additional assumptions are imposed. This means that considering all possible orderingsmay be computationally burdensome.
C Comparison to the Approach Based on Artstein’s Inequalities
In this Appendix, we will briefly review an alternative method of constructing the identified set for (condi-tional) latent variable distributions. The method discussed in this Appendix is the one proposed by Chesheret al. (2013) and Chesher and Rosen (2014). The general method is exposited in Chesher and Rosen (2017).Our objective here is to provide an informal comparison between our approach and their approach, and toprovide a brief derivation showing how the two approaches might be connected.Let us suppose that Assumption 2.1 holds, and consider a slightly modified version of the random setfrom (2.2): H − ( y, x, z, β ) := cl { θ : y = { ϕ ( x, z, θ, β ) ≥ }} . That is, the random set H − ( y, x, z, β ) is equal to the closure of the random set G − ( y, x, z, β ) from (2.2).Under some conditions, if the distribution of θ is absolutely continuous (which is assumed, for example, inboth Chesher et al. (2013) and Chesher and Rosen (2014)), these two random sets will be equal almost78urely. Let us also define the random set: H ( θ, β ) := cl { ( y, x, z ) : y = { ϕ ( x, z, θ, β ) ≥ }} . We now apply a fundamental result due to Artstein (1983) (also see Norberg (1992) and Molchanov (2017)Corollary 1.4.11), which characterizes the set of selections of a random closed set.
Theorem C.1.
Suppose that Assumption 2.1 holds. Then for any β ∈ B , the random vector θ can be realizedas a selection of the random closed set H − ( Y, X, Z, β ) if and only if: P θ ( θ ∈ K ) ≤ P Y,X,Z ( H − ( Y, X, Z, β ) ∩ K (cid:54) = ∅ ) , (C.1) for all compact sets K ⊂ Θ . Furthermore, for any β ∈ B , the random vector ( Y, X, Z ) can be realized as aselection of the random closed set H ( θ, β ) if and only if: P Y,X,Z (( Y, X, Z ) ∈ C ) ≤ P θ ( H ( θ, β ) ∩ C (cid:54) = ∅ ) , (C.2) for all compact sets C ⊂ Y × X × Z . Remark C.1.
The statement “...the random vector θ can be realized as a selection of the random closed set H − ( Y, X, Z, β ) ...” means that there exists a probability space and random elements ˜ θ and ˜ H − ( Y, X, Z, β ) with identical distributions to θ and H − ( Y, X, Z, β ) such that ˜ θ ∈ ˜ H − ( Y, X, Z, β ) a.s. Remark C.2.
Since Θ ⊂ R d θ is locally compact and Hausdorff, it is equivalent that (C.1) hold for all opensets G ⊂ Θ . Furthermore, since Y × X × Z is finite, all subsets of this product space are compact.
The first part of this result is very similar to Theorem 1 in Chesher et al. (2013) and Theorem 3.1 inChesher and Rosen (2014), and thus is not new. The second part of this result is a direct corollary of Theorem1 in Chesher and Rosen (2017). In fact, Theorem 1 in Chesher and Rosen (2017) shows that (C.1) and (C.2)impose an equivalent set of constraints on the (unconditional) latent variable distributions P θ . Thus, either(C.1) and (C.2) can be used to construct the identified set of unconditional latent variable distributions, say P ∗ θ ; in practice, this is accomplished by first fixing a value of β ∈ B , collecting all distributions P θ satisfyingeither (C.1) or (C.2), and then taking a union (over all β ∈ B ) of the resulting collections of distributions.Also note that a similar result to Theorem C.1 can be stated after conditioning both sides of (C.1) and(C.2) on any combination of the variables in ( Y, X, Z ). We will make use of this shortly. Finally, note that(C.1) (or (C.2)) provides a characterization of all distributions of random vectors θ that can be realized asa selection from a random set, and not just those selections whose distributions satisfy certain conditions,such as absolute continuity. Although attention is often focused on selections θ with absolutely continuousdistributions, Artstein’s inequalities are only necessary and not sufficient for the existence of such a selection. In Chesher and Rosen (2014), absolute continuity combined with a linear index function ensures this statement is true.Chesher et al. (2013) consider a more general class of latent index functions than Chesher and Rosen (2014), and so also imposestrict monotonicity in latent variables of the latent index function in order to ensure their analog of the set { θ : ϕ ( x, z, θ, β ) = 0 } is of Lebesgue measure zero for each ( y, x, z, β ). Thus, most efforts to reduce the computational burden of the approach based onArtstein’s inequalities are directed towards reducing the number of constraints implied by Theorem C.1 (andits analogues in other contexts). At first glance it appears that (C.1) leads to a characterization of the identified set of unconditionallatent variable distributions that is intractable, given the number of possible compact subsets of Θ. However,following the discussion in both Chesher et al. (2013) and Chesher and Rosen (2014), it can be shown thatmost of these inequalities impose redundant constraints, and that there are typically only a finite numberof nonredundant inequalities. For example, note that when ϕ is linear in θ and β , for each fixed β the set H − ( y, x, z, β ) represents a closed halfspace through the origin (intersected with Θ). In this special case,Chesher and Rosen (2014) demonstrate that it suffices to check the inequalities in (C.1) for all sets K that can be written as the intersection of halfspaces of the form H − ( y, x, z, β ). This reduces the infinitenumber of inequalities implied by (C.1) to a finite number of inequalities. For example, in the case withno exogenous variables and a scalar endogenous variable X with m x points of support, Chesher and Rosen(2014) demonstrate that there are at most (2 m x ) / r inequalities, with r = 2 m x m z (the number of support points of Y ×X ×Z ). However,clearly even with small values of m x and m z the resulting number of inequalities can be prohibitively large.We will now show that our approach in this paper can be considered as a simplification of the set of constraintsin (C.2), where the number of constraints in our simplification is proportional to r rather than 2 r .To see the connection to the approach in this paper, consider imposing (C.2) conditional on ( Y, X, Z ).We obtain that, for any β ∈ B , the random vector θ can be realized as a selection from the random closedset H − ( y, x, z, β ) if and only if: { ( y, x, z ) ∈ C } ≤ P θ | Y,X,Z ( H ( θ, β ) ∩ C (cid:54) = ∅ | Y = y, X = x, Z = z ) , for all compact C ⊂ Y × X × Z , and all ( y, x, z ) ∈ Y × X × Z assigned positive probability. Now for a fixedvalue of ( y, x, z ), consider the set of all compact C ⊂ Y × X × Z containing ( y, x, z ). For all such C we musthave: P θ | Y,X,Z ( H ( θ, β ) ∩ C (cid:54) = ∅ | Y = y, X = x, Z = z ) = 1 . In particular, even relatively small problems can quickly exhaust all available storage in a computer’s random access memory(RAM), which can cause the computer slow down significantly, or crash. See Galichon and Henry (2011) and Chesher and Rosen (2017) for a discussion of the idea of the ”core determining class,”which is any collection of sets K ⊂ Θ that is sufficient for (C.1) to hold for all compact sets. A careful comparison betweenthe approach based on Artstein’s inequalities and other approaches in the context of bounding treatment effects is provided inRussell (2019). C ⊂ Y × X × Z containing ( y, x, z ) if any only if it holds for thesingleton set { ( y, x, z ) } . We thus have: P θ | Y,X,Z ( H ( θ, β ) ∩ { ( y, x, z ) } (cid:54) = ∅ | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s. However, some basic manipulation shows this holds if and only if: P θ | Y,X,Z ( θ ∈ H − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s. This derivation can be used to prove the following Lemma: Lemma C.1.