[PDF] Partial Identification in Nonseparable Binary Response Models with Endogenous Regressors

Abstract

This paper considers (partial) identification of a variety of counterfactual parameters in binary response models with possibly endogenous regressors. Our framework allows for nonseparable index functions with multi-dimensional latent variables, and does not require parametric distributional assumptions. We leverage results on hyperplane arrangements and cell enumeration from the literature on computational geometry in order to provide a tractable means of computing the identified set. We demonstrate how various functional form, independence, and monotonicity assumptions can be imposed as constraints in our optimization procedure to tighten the identified set, and we show how these assumptions can be assigned meaningful interpretations in terms of restrictions on latent response types. Finally, we apply our method to study the effects of health insurance on the decision to seek medical treatment.

Full PDF

PPartial Identiﬁcation in Nonseparable Binary Response Modelswith Endogenous Regressors

Jiaying Gu ∗ University of Toronto

Thomas M. Russell † Carleton University

January 6, 2021

Abstract

This paper considers (partial) identiﬁcation of a variety of parameters, including counterfactual choiceprobabilities, in a general class of binary response models with possibly endogenous regressors. Impor-tantly, our framework allows for nonseparable index functions with multi-dimensional latent variables,and does not require parametric distributional assumptions. We demonstrate how various functionalform, independence, and monotonicity assumptions can be imposed as constraints in our optimizationprocedure to tighten the identiﬁed set, and we show how these assumptions have meaningful interpre-tations in terms of restrictions on latent types. In the special case when the index function is linearin the latent variables, we leverage results in computational geometry to provide a tractable means ofconstructing the sharp set of constraints for our optimization problems. Finally, we apply our methodto study the eﬀects of health insurance on the decision to seek medical treatment.

Keywords : Binary Choice, Counterfactual Choice Probabilities, Endogeneity, Hyperplane Arrangement,Linear Programming, Partial Identiﬁcation

We are grateful to Marc Henry, Roger Koenker, and the seminar audiences at Columbia University and Michigan StateUniversity for helpful feedback. We also thank Martin Weidner and the organizers of the Chamberlain Seminar, and are gratefulto Florian Gunsilius, Sukjin Han, Wayne Gao, and Takuya Ura for their questions and feedback, and to Adam Rosen for histhoughtful discussion. Jiaying Gu acknowledges ﬁnancial support from the Social Sciences and Humanities Research Councilof Canada. All errors are our own. ∗ Jiaying Gu, Assistant Professor, Department of Economics, University of Toronto, 150 St. George Street, Toronto, Ontario,M5S3G7, Canada. Email: [email protected]. † Thomas M. Russell, Assistant Professor, Department of Economics, Carleton University, 1125 Colonel By Drive, Ottawa,Ontario, K1S5B6, Canada. Email: [email protected]. a r X i v : . [ ec on . E M ] J a n Introduction

This paper considers partial identiﬁcation of a variety of parameters in a general class of binary responsemodels. Our main focus throughout is on counterfactual choice probabilities, as well as parameters thatcan be written as linear combinations of counterfactual choice probabilities. However, our framework isalso applicable to parameters outside of this class. Our approach allows for ﬂexible functional form as-sumptions, endogenous regressors, and the inclusion of multi-dimensional and nonseparable latent variables.Furthermore, our approach does not require any parametric distributional assumptions.In the settings closest to the one we consider, nonparametric point-identiﬁcation of the distribution oflatent variables occurs only under restrictive conditions, often including strong independence assumptionsand large support conditions (e.g. Ichimura and Thompson (1998)). Control function approaches are oftenused to address the issue of endogenous regressors, but if endogenous regressors are discrete or the mech-anism generating the endogenous regressors is poorly understood, then many of these approaches are notapplicable. Partial identiﬁcation arises as a natural alternative to methods for point-identiﬁcation as a resultof possible endogeneity, discrete instruments, and limited variation in the covariates. However, ﬂexible andimplementable methods in partial identiﬁcation for binary response models remain underdeveloped. Thispaper seeks to address this gap.We begin with a theoretical analysis of the binary response model that builds on the work connectingrandom set theory to partial identiﬁcation. In particular, we characterize observational equivalence in termsof selections from a random set . We then deﬁne our binary response model of interest, and show how toconstruct a sequence of deﬁnitions of various identiﬁed sets arising from the notion of selectionability fromthe random set deﬁned by our model. Finally we deﬁne the set of counterfactuals of interest, and showhow the identiﬁed set of various structural features of the binary response model—including the structuralparameters and the distribution of latent variables—are related to the identiﬁed set of counterfactual choiceprobabilities.In general, we show that constructing the identiﬁed set for counterfactual choice probabilities involves an inﬁnite-dimensional existence problem . Intuitively, this problem arises since, for each proposed counterfac-tual choice probability, we must verify the existence of a distribution of the latent variables that rationalizesthe observed choice probabilities through our binary choice model. However, one of our main theoreticalresults shows that this inﬁnite dimensional existence problem can be reduced to an equivalent ﬁnite di-mensional existence problem when the observed random variables are discrete. This paves the way to ourformulation of the bounds on counterfactual choice probabilities in terms of optimization problems.Our insights also reveal the importance of a special partition of the latent variable space into types that have identical responses in all possible counterfactual states. Consistent with the previous literature,we call these latent types response types . One of our important contributions is to show that additional Deﬁnitions are provided in Appendix A.2. See Torgovitsky (2019) for similar terminology. elimination of response types , which amounts to assigning zero probability to regions of the latentvariable space corresponding to particular proﬁles of counterfactual responses. Furthermore, we show thatcertain independence assumptions imposed on a vector of latent variables are observationally equivalent toimposing independence on response types directly. This connection helps to facilitate interpretation of theseassumptions in the class of models we consider. We are not the ﬁrst to emphasize the importance of responsetypes, and our discussion echoes the insights of Heckman and Pinto (2018) and others.We show how these additional functional form, independence, and monotonicity assumptions can beintroduced as constraints in our optimization-based bounding procedure. We thoroughly study the specialcase when the index function is linear in latent variables. We show linearity in this sense can be used toimpose constraints on the distribution of latent variables, and we demonstrate how these constraints can beincorporated into our optimization problems to tighten the identiﬁed set. To construct the relevant set ofconstraints implied by the functional form restrictions we make connections to the literature on computationalgeometry and utilize the hyperplane arrangement algorithm of Gu and Koenker (2020). Furthermore, whenthe index function is also linear in parameters, we show that—unlike many other existing procedures inpartial identiﬁcation—exact (i.e. not approximate) sharp bounds on counterfactual choice probabilities canbe computed without the need to grid over the entire parameter space.Finally we apply our method to study the eﬀects of private health insurance on the decision to seekmedical treatment. Consistent with the existing literature, we treat private health insurance status as anendogenous variable, and we consider the decision to seek medical treatment as our binary outcome variableof interest. We then consider the average treatment eﬀect of obtaining private health insurance on thedecision to visit a doctor. We ﬁnd that the sign of the average treatment eﬀect is typically only identiﬁedunder our strongest assumptions. However, even our strongest assumptions are much weaker than theassumptions typically maintained in the empirical literature. Interestingly, we also ﬁnd non-trivial boundson the average treatment eﬀect even when the structural parameters are unidentiﬁed. Overall, the strengthof the conclusions from our application are proportional to the strength of the assumptions the researcher iswilling to maintain.

Binary response models with possibly endogenous regressors have been studied extensively, and previouswork on the subject can be separated into two broad categories: work that focuses on conditions requiredfor point identiﬁcation, and work that allows for partial identiﬁcation.From the point identiﬁcation perspective, typical approaches include (i) the use of linear probabilitymodels, (ii) maximum likelihood estimation (e.g. the bivariate probit), and (iii) control function approaches.3ll of these approaches have well-documented limitations. In particular, linear probability models arecommonly justiﬁed as approximations to the underlying conditional expectation function for the binary de-pendent variable, but are known to deliver very misleading results when the conditional expectation functionis highly nonlinear. Methods that use maximum likelihood—such as the bivariate probit model—enjoy ef-ﬁciency gains relative to other approaches when the model is correctly speciﬁed, but require strong a priori knowledge of the mechanism generating the endogenous variables, as well as knowledge of the joint distri-bution of the latent variables up to some ﬁnite parameter vector. Finally, control function approaches (e.g.Blundell and Smith (1989), Blundell and Powell (2004), and Imbens and Newey (2009), among many others)relax (to some extent) the assumptions required on the latent variables, but are generally restricted to caseswith continuous endogenous variables and also still require a correctly speciﬁed model for the endogenousvariables in nonlinear models. Unlike the control function approach, the special regressor approach of Lewbel(2000) (see also Lewbel et al. (2012) and Dong and Lewbel (2015)) does not require the correct speciﬁca-tion of a model for endogenous variables, but instead requires the existence of an observed continuouslydistributed regressor with large support that satisﬁes certain conditional independence assumptions. Such aspecial regressor is not always readily available.Beyond these approaches, a number of papers have also considered nonparametric identiﬁcation. Non-parametric identiﬁcation was studied in binary choice and threshold crossing models by Matzkin (1992),and in more general nonseparable models by Matzkin (2003) and Chernozhukov and Hansen (2005), amongothers. Vytlacil and Yildiz (2007) study nonparametric identiﬁcation of the average treatment eﬀect in adiscrete triangular system with a binary endogenous variable under a weak separability assumption in theoutcome equation. Important precedents to the work presented here from the literature on point identiﬁ-cation in random coeﬃcient models include Ichimura and Thompson (1998), Gautier and Kitamura (2013)and Gu and Koenker (2020). However, all of these papers focus almost exclusively on the point-identiﬁedcase with linear index function and exogenous covariates with large support.In contrast, the literature on partial identiﬁcation attempts to relax the assumptions required for point-identiﬁcation. In a relevant series of papers, Chesher et al. (2013), Chesher and Rosen (2014) show how touse random set theory to characterize the identiﬁed set of structures in discrete choice models. A generalformulation of their approach is presented in Chesher and Rosen (2017). Similar to the current paper,these papers do not provide a model for the endogenous covariates, rendering the discrete choice model incomplete . Chesher et al. (2013) and Chesher and Rosen (2014) then use a characterization of the sharp setof constraints given by a result due to Artstein (1983) in random set theory, which we colloquially refer to as

Artstein’s inequalities . However, without additional simpliﬁcation, the sharp set of constraints implied by A review of approaches typically used by practitioners to address the problem of endogenous regressors in models withbinary outcomes is provided in Lewbel et al. (2012), who focus on the case of a threshold-crossing model with linear indexfunction and additively separable errors. Lewbel et al. (2012) construct an interesting treatment eﬀect example with a binary outcome variable where the treatmenteﬀect is positive for everyone, but the ATE under a linear probability model is negative. See also Norberg (1992) and Molchanov (2017) Corollary 1.4.11. Our work extends the work byChesher et al. (2013) and Chesher and Rosen (2014) by deriving a simpliﬁed set of constraints that containthe same identifying content as the constraints in their work. We then focus on obtaining sharp bounds oncounterfactual conditional choice probabilities, and show how this can be accomplished by solving a sequenceof optimization problems. A detailed comparison of our approach with the approach of Chesher et al. (2013)and Chesher and Rosen (2014) is provided in Appendix C.In another relevant precedent to our work, Torgovitsky (2019) demonstrates how to construct boundson lower-dimensional functionals of the model parameters and the latent variable distribution in a class ofmodels with discrete outcomes and regressors. In particular, he demonstrates the conditions under whichthe distribution of the latent variables can be restricted to a ﬁnite set when bounding various functionals,and considers a binary response model with additive separability as his motivating example. Many ofthe points made in Torgovitsky (2019) will also be revisited in the current paper, and in cases when theindex function is additively separable in the latent variables, our approaches will be very similar. However,the results in Torgovitsky (2019) rely heavily on the requirement that the functional of interest, and allconstraints deﬁning the identiﬁed set, can be written in terms of the distribution function of the vector oflatent variables. This requirement is easily satisﬁed in the additively separable case, but in models whenthe index function is not additively separable, this will generally not be possible. In contrast, our frameworkis able to accommodate a variety of ﬂexible assumptions on the index function, including the case when theindex function is nonseparable or weakly separable.There are a number of other relevant papers in the literature on partial identiﬁcation in discrete choicemodels. In an important paper, Manski (2007) also considers counterfactual choice probabilities in a settingwith partial identiﬁcation, and shows how these counterfactual choice probabilities can be bounded usingoptimization problems. However, the general approach used in this paper is very diﬀerent. Furthermore,we focus substantially on demonstrating how to practically incorporate a ﬂexible set of assumptions on thelatent index function, and we allow for endogenous explanatory variables. In another related and recentworking paper, Tebaldi et al. (2019) study the problem of computing various counterfactual quantities in anonparametric discrete choice model with an application to consumer choice of health insurance in California.However, they focus speciﬁcally on the case where consumers have quasi-linear utility functions (equal totheir valuation of the insurance option minus the premium) and use the particular structure of their setting See Galichon and Henry (2011) for an early discussion of this issue in the context of empirical entry games, and Russell(2019) for a discussion in the context of estimating treatment eﬀects. In particular, see Lemma C.1 in Appendix C. The distribution function of a vector of random variables ( U , U , . . . , U k ) ∈ R k is the function F : R k → [0 ,

1] deﬁned by F ( u ) = P ( U ≤ u , U ≤ u , . . . , U k ≤ u k ). This latter point also diﬀerentiates our work from Chiong et al. (2017) and Allen and Rehbeck (2019). Closely related to the problem of bounding counterfactual choice probabilities is the problem of bound-ing parameters in the literature on treatment eﬀects with binary outcome variables. Analytic bounds intriangular systems of equations with binary dependent variables under various assumptions is consideredby Chiburis (2010), Shaikh and Vytlacil (2011), and Mouriﬁ´e (2015). An optimization-based approach tobounding treatment eﬀect parameters is presented in Russell (2019) in the discrete case, and Gunsilius(2020) in the continuous case. We will attempt to make a connection to the literature on treatment eﬀectsthroughout the paper when appropriate.This paper also makes a connection to the literature on computational geometry. In the case of a linearindex function, computation of our bounds requires the analysis of a partition of the latent space determinedby ﬁnitely many hyperplanes. This turns out to be a well studied subject in combinatorial geometry calledhyperplane arrangement, and leads us to consideration of the enumeration algorithm proposed by Gu andKoenker (2020).

The remainder of the paper proceeds as follows. Section 2 introduces the main theoretical framework andmain assumptions. Section 3 studies practical implementation of the theoretical framework from Section 2and introduces our optimization-based bounding procedure for counterfactual choice probabilities. Section4 then demonstrates how to introduce functional form, independence, and monotonicity assumptions intoour bounding procedure, and Section 5 applies our methodology to study the impact of health insurance onutilization of health care services. Section 6 concludes. All proofs can be found in Appendix A. AppendixB provides some additional discussion of the results presented in the main text, and a comparison of ourapproach to the approach based on Artstein’s inequalities from Chesher et al. (2013) and Chesher and Rosen(2014) is presented in Appendix C.

Notation:

The following notation is relevant for both the main text and the appendices. Given a subset X of Euclidean space, we use B ( X ) to denote the Borel σ − algebra on X . For two measurable spaces ( X , B ( X ))and ( X (cid:48) , B ( X (cid:48) )), the product σ − algebra on X × X (cid:48) is denoted by B ( X ) ⊗ B ( X (cid:48) ). Random variables aredenoted using capital letters, and if X : (Ω , A ) → ( X , B ( X )) is a random variable deﬁned on the probabilityspace (Ω , A , P ), then we use P X to denote the probability measure induced on X by X ; that is, for any A ∈ B ( X ), P X ( A ) := P ( X − ( A )). Furthermore, we interpret P X | X (cid:48) ( X ∈ A | X (cid:48) = x (cid:48) ) as a regular In particular, Tebaldi et al. (2019) consider a multinomial choice model with preferences over insurance options given by thediﬀerence between the consumer’s latent valuation and the consumer’s premium for each option. Endogeneity arises becauseof possible dependence between valuations and premiums. However, in their setting (subsidized) premiums are deterministicfunctions of the coverage area, age, and income. The authors then discretize age and income, and assume that a valuationdistribution is ﬁxed within a given coverage area and discretized age and income bin; the remaining variation in premiumswithin each coverage area and discretized age and income bin is then considered to be exogenous. P X | X (cid:48) is used as shorthand for the collection P X | X (cid:48) := { P X | X (cid:48) ( · | X (cid:48) = x (cid:48) ) : x (cid:48) ∈ X (cid:48) } . We do not explicitly diﬀerentiate between scalars and vectors, or random variables andrandom vectors. To keep the notation clean, we will omit the transpose when combining column vectors;that is, if v and v are two column vectors, rather than write v = ( v (cid:62) , v (cid:62) ) (cid:62) we instead write v = ( v , v ),where it is understood that v is a column vector unless otherwise speciﬁed. In this section we begin the theoretical analysis by introducing some key assumptions and deﬁnitions. Wewill ﬁrst introduce our main assumptions on the binary response model, and connect our assumptions tothe deﬁnition of the identiﬁed set of (conditional) latent variable distributions. We then discuss the set ofcounterfactual parameters of interest in this paper, and show how the deﬁnition of the identiﬁed set of latentvariable distributions is related to the identiﬁed set of counterfactual conditional choice probabilities. Wewill use the results in this section when we introduce our practical method of bounding counterfactual choiceprobabilities in the next section.

We start by introducing our main assumptions on the binary response environment under consideration.

Assumption 2.1.

There exists a complete probability space (Ω , A , P ) , a random variable Y : Ω → { , } ,and random vectors X : Ω → X ⊆ R d x , Z : Ω → Z ⊆ R d z and θ : Ω → Θ ⊆ R d θ , with Θ compact,satisfying: Y = { ϕ ( X, Z, θ , β ) ≥ } a.s., (2.1) for some function ϕ ( · , β ) : X × Z × Θ → R parameterized by β ∈ B ⊆ R d β such that ϕ ( x, z, · , β ) iscontinuous for each ( x, z, β ) and ϕ ( · , · , θ, β ) is measurable for each ( θ, β ) . Furthermore, |X | = m x < ∞ ,and |Z| = m z < ∞ , and the spaces X , Z and Θ are equipped with the Borel σ − algebra. In Assumption 2.1 θ ∈ Θ is a vector of latent variables, β ∈ B is a vector of ﬁxed coeﬃcients, and X ∈ X ⊂ R d x and Z ∈ Z ⊂ R d z are vectors of covariates. The ﬁnite-dimensional vector of latent variableshave a natural interpretation in the binary response model as unobserved types . The model in (2.1) allows forgeneral nonseparability between covariates and latent variables, and thus allows for type-speciﬁc marginaleﬀects. The latent variables can also be interpreted as random coeﬃcients, in which case there is no restrictionon which covariates are assigned ﬁxed versus random coeﬃcients by the index function ϕ . Furthermore, wedo not impose any parametric or continuity assumptions on the distribution of latent variables. Finally,compactness of Θ is a technical requirement used to verify the measurability of certain random sets that willappear shortly. Otherwise compactness does not play a signiﬁcant role, and it is ignored throughout much7f the discussion in the main text.The assumptions on the index function ϕ imply that it is a Caratheodory function , which is important toestablish certain measurability results (see Appendix A.2 and the discussion below). The exact form of theindex function may or may not be known to the researcher. For now there is no distinction between X and Z ,and either may be dependent with the latent vector θ . However, later in the paper we will distinguish X from Z by introducing independence assumptions between Z and the vector of latent variables θ . The variables X and Z can be seen as consisting of utility-relevant attributes of the set of alternatives and the set of individualdecision makers. We focus throughout the paper on the case when the joint support

X × Z is ﬁnite with m := m x · m z points of support, although m may be taken to be very large. Throughout the paper we willswitch freely between indexing the points in X × Z either by { ( x , z ) , . . . , ( x , z m z ) , ( x , z ) , . . . , ( x m x , z m z ) } or by { ( x , z ) , ( x , z ) , . . . , ( x m , z m ) } , depending on which method is more convenient for our purpose.Finally, it is important to note that, because of ﬁniteness of the support X × Z , it is possible to construct amodel satisfying Assumption 2.1 that can rationalize any observed joint distribution of Y , X and Z . In thissense, the model in (2.1) has not yet imposed any signiﬁcant structure.We assume that the researcher’s objective throughout is to obtain a sharp set of constraints deﬁning theidentiﬁed set of latent variable distributions, and to use this characterization to bound various counterfactualquantities such as counterfactual conditional choice probabilities, or functionals of the distribution of latentvariables. A general characterization of the identiﬁed set of latent variable distributions is provided inChesher et al. (2013), and Chesher and Rosen (2014) using Artstein’s inequalities from random set theory.A comparison of our work with the approach based on Artstein’s inequalities is provided in AppendixC. While our approach does not explicitly make use of Artstein’s inequalities, similar to Chesher et al.(2013), Chesher and Rosen (2014) and Chesher and Rosen (2017), we take the selectionability relation asa primitive relation on which to construct a deﬁnition of the identiﬁed set. The close connection betweenthe selection relation from random set theory and the concept of observational equivalence from the workin econometrics on identiﬁcation has been appreciated in Beresteanu and Molinari (2008), Beresteanu et al.(2011), Beresteanu et al. (2012), Chesher et al. (2013), Chesher and Rosen (2014), and Chesher and Rosen(2017), among many others. We will continue this work here. In particular, we will deﬁne the set: G − ( y, x, z, β ) := { θ : y = { ϕ ( x, z, θ, β ) ≥ }} . (2.2)Intuitively, (2.2) delivers all possible values of the latent variables θ consistent with the vector ( y, x, z, β )given the binary response model in (2.1). A measurable selection from the random set G − ( Y, X, Z, β ) isa random vector θ : Ω → Θ satisfying θ ∈ G − ( Y, X, Z, β ) a.s. A general deﬁnition of a selection and arandom set is provided in Appendix A.2. Importantly, given a distribution of the observable random vectors If ( S, Σ) is a measurable space and A and B are topological spaces, then we call f : S × A → B a Caratheodory function iffor each a ∈ A we have f ( · , a ) is a measurable function, and if for each s ∈ S we have that f ( s, · ) is continuous. See Deﬁnition4.50 in Aliprantis and Border (2006). See Manski (1977) for a discussion. Y, X, Z ), a structural function ϕ and a ﬁxed coeﬃcient β ∈ B , any two measurable selections θ and θ (cid:48) fromthe random set G − ( Y, X, Z, β ) will be observationally equivalent in the sense that both latent variable vectors θ and θ (cid:48) could have generated the observed distribution of Y , X and Z through the model (2.1). Framedin this manner, constructing the identiﬁed set of latent variable distributions then becomes a problem ofverifying whether a given random vector θ : Ω → Θ is a measurable selection from the random set in (2.2),and then collecting the distributions of all such selections.We are now prepared to present our deﬁnition of the joint identiﬁed set for the (conditional) latentvariable distribution and coeﬃcients β . Deﬁnition 2.1 (Identiﬁed Set) . Under Assumption 2.1, the (joint) identiﬁed set I ∗ Y,X,Z of conditional latentvariable distributions P θ | Y,X,Z and ﬁxed coeﬃcients β is the set of all pairs ( P θ | Y,X,Z , β ) satisfying: P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s. (2.3)As promised, this deﬁnition relies heavily on the idea of selectionability from the random set G − ( Y, X, Z, β ),and permits only the (conditional) distributions of selections from G − ( Y, X, Z, β ) to belong to the identiﬁedset. Note that this deﬁnition of the identiﬁed set also implicitly depends on the distribution of (

Y, X, Z )through the almost-sure relation in (2.3); any values of ( y, x, z ) assigned zero probability by the observeddistribution does not impose any restrictions on the distribution of θ . Importantly, the deﬁnition conditionson the value of the endogenous outcome variable Y . This conditioning will be carried throughout the pa-per, and we will see later on that it allows us to bound some interesting, albeit less-typical counterfactualparameters that may be relevant to policy analysis.We use Deﬁnition 2.1 as a primitive starting point with the goal of providing a deﬁnition of the identiﬁedset for counterfactual choice probabilities, as well as deﬁnitions of the identiﬁed set for other objects ofinterest. For example, using this deﬁnition we can also provide deﬁnitions of the identiﬁed set for variousprojections of the joint identiﬁed set. Under Assumption 2.1, the identiﬁed set of ﬁxed coeﬃcients B ∗ isgiven by: B ∗ := (cid:8) β : ∃ P θ | Y,X,Z s.t. ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z (cid:9) . Furthermore, under Assumption 2.1, the identiﬁed set of conditional latent variable distributions P ∗ θ | Y,X,Z is given by: P ∗ θ | Y,X,Z := (cid:8) P θ | Y,X,Z : ∃ β s.t. ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z (cid:9) . Conditioning on the value of the observed endogenous outcome variable Y may be new to many readers;however, the identiﬁed set P ∗ θ | X,Z for conditional latent variable distributions P θ | X,Z can constructed viaintegration of distributions P θ | Y,X,Z ∈ P ∗ θ | Y,X,Z with respect to the observed conditional choice probabilities.For the sake of comparison with the previous literature, in Appendix B.1 we show how the identiﬁed set of(conditional) latent variable distributions are related to conditional choice probabilities in our setup.9 .2 Bounding Counterfactual Quantities

In the previous subsection, we discussed the relevant set of constraints deﬁning the identiﬁed set of conditionallatent variable distributions (conditional on (

Y, X, Z ), or conditional on (

X, Z ), as in Appendix B.1). Inthis subsection we present deﬁnitions and results for the identiﬁed set of counterfactual conditional choiceprobabilities conditional on Y , X and Z . Throughout the remainder of the paper we focus most of ourattention on bounding counterfactual choice probabilities, although our framework is immediately applicableto any parameter that can be written as a linear function of counterfactual choice probabilities; for example,the average treatment eﬀect, and the average structural function with various levels of conditioning.In this paper, we will limit ourselves to the class of so-called interventionist counterfactuals . Interven-tionist counterfactuals take as primitive an existing set of structural equations relating causes and eﬀects,and each equation in the system of structural equations is autonomous in the sense that it remains unalteredunder external manipulations of its inputs. In our case, the relevant structural equation is given by thebinary response model in (2.1), where the random vectors X and Z might be interpreted as relevant causesof the binary random variable Y . However, diﬀering from the typical structural equations environment, weallow for X to be endogenous without explicitly providing a structural equation for it. In this setup, an interventionist counterfactual is represented by a process that exogenously manipulatesthe values of X and Z . For exogenous random variables—that is, those whose values are determined outsideof the model—we simply replace the random variable by its value under consideration in the counterfactual.For endogenous random variables—that is, those whose values are determined by a function of the otherexogenous and endogenous variables within a model—the function determining the value of the endogenousvariable is deleted from the system, and the endogenous variable is replaced by its value under considerationin the counterfactual. Such counterfactuals have a natural interpretation as “hypothetical experiments,”and are widely attributed to Haavelmo (1943, 1944).We now introduce the following assumption on the counterfactual domain, which summarizes the discus-sion above.

Assumption 2.2 (Counterfactual Domain) . For a collection of functions Γ with typical element γ : X ×Z → X × Z , there exists a collection of random variables { Y ( · , γ ) : Ω → { , } | γ ∈ Γ } , abbreviated as Y γ := Y ( · , γ ) , representing counterfactual choices for each γ such that Y γ : Ω → { , } is measurable foreach γ , and: P Y γ | Y,X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = 1 , (2.4) P Y,X,Z,θ − a.s. for the same θ ∈ Θ and β ∈ B as in Assumption 2.1, and for all γ ∈ Γ . Assumption 2.2 is needed in order to assign a counterfactual interpretation to many of the results that Heckman and Pinto (2015) attribute the notion of autonomous equations to Frisch (1938). It is natural to imagine that a structural equation exists that determines X as a function of its (potential) causes, but thatwe remain ignorant as to its exact form. We refer the reader to Pearl (2009) Section 7.1 for a discussion of a similar procedure. γ ∈ Γ thatrepresent counterfactual responses or choices and that are deﬁned on the same probability space as therandom vector (

Y, X, Z, θ ). It then explicitly links these counterfactual random variables to the binaryresponse model from Assumption 2.1 through condition (2.4). Assumption 2.2 implies that (i) counterfactualresponse variables exist on the common probability space, and (ii) such counterfactual response variablesare equal (almost surely) to the values that would arise after an intervention on the system representedby (2.1). It also implicitly encodes an invariance assumption; other than changes in ( x, z ) induced by γ : X × Z → X × Z , all other aspects of the environment are held constant.Note that Assumption 2.2 imposes that each counterfactual be represented by a function γ : X × Z →X × Z belonging to some collection Γ. Each γ ∈ Γ can be interpreted as an assignment to a state or atreatment. Taking γ as a function allows us to consider a general class of counterfactuals that allows thecounterfactual under consideration to depend on the observed values of X and Z . Although each function γ is seen as a map from X × Z to itself, this does not prevent consideration of counterfactuals where γ selectsvalues of ( x, z ) that have never been observed in the data. Such cases can be accommodated by simplyextending the support X × Z from Assumption 2.1 to include the counterfactual pair ( x, z ) of interest. Thisapproach does not aﬀect anything we have presented thus far (or anything we will present), since we alwaysrequire any relation to the observed distribution of (

Y, X, Z ) to hold only almost-surely. This means thatour framework can be used to study the impact of historical interventions, as well as forecast the impacts ofinterventions in environments never before experienced. Remark 2.1.

The random variables Y γ can be related to potential outcomes from the literature on treatmenteﬀects. Interpreting these variables as potential outcomes helps to clarify the invariance assumption madein Assumption 2.2. For example, suppose we have only a binary variable X ∈ { , } and latent variables θ (i.e. no variables Z and no ﬁxed coeﬃcients β ). Then the structural function from (2.1) can be written as ϕ ( X, θ ) . In this simple case we have γ ( x ) ∈ { , } , and we can deﬁne the random variables Y and Y as: Y ( ω ) := { ϕ (0 , θ ( ω )) ≥ } ,Y ( ω ) := { ϕ (1 , θ ( ω )) ≥ } . The observed choice for an individual indexed by ω ∈ Ω is then given by: Y ( ω ) := Y ( ω )(1 − X ( ω )) + Y ( ω ) X ( ω ) . That is, the observed choice corresponds to Y ( ω ) if X ( ω ) = 0 and Y ( ω ) if X ( ω ) = 1 , where Y ( ω ) and In terms of practical interpretation, here we do not specify how such an intervention is to be carried out, and insteadsuppose that the random variables Y ( ω, γ ) are stable in the sense that they do not depend on any mechanism that may begenerating the intervention γ . This assumption is very common, and we refer to Heckman and Vytlacil (2007) pp. 4790 - 4801for a more detailed discussion of similar assumptions. For example, consider Theorem B.1. This result is completely unaﬀected by arbitrarily enlarging

X ×Z , since the conditionsin the Theorem need only hold almost surely, and since any elements added to the initial support of X and Z will necessarilybe assigned probability zero by the observed distribution of ( Y, X, Z ). See the three policy evaluation problems of Heckman and Vytlacil (2007) pp. 4790 - 4792. ( ω ) are potential outcomes. Extending the analogy beyond the cases of binary X is straightforward, and itwill sometimes be useful to refer to this connection to potential outcomes when discussing the interpretationsof some of our parameters. Assumption 2.2 on the counterfactual domain leads directly to our deﬁnition of the identiﬁed set forcounterfactual conditional choice probabilities.

Deﬁnition 2.2 (Identiﬁed Set of Counterfactual Conditional Choice Probabilities) . Under Assumptions2.1 and 2.2, the identiﬁed set of counterfactual conditional choice probabilities P ∗ Y γ | Y,X,Z,θ is the set of allconditional distributions P Y γ | Y,X,Z,θ satisfying: P Y γ | Y,X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = 1 , (2.5)( y, x, z, θ ) − a.s. for some ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z . Note that this deﬁnition makes an explicit reference to the identiﬁed set I ∗ Y,X,Z presented in Deﬁnition2.1, which in turn is derived from a selection relation. Intuitively, this deﬁnition says that a given conditionaldistribution for counterfactual choices belongs to the identiﬁed set if there exists a pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z that can rationalize such a counterfactual distribution. As was the case with Deﬁnition 2.1, this deﬁnitionof the identiﬁed set can be used as a starting point to deﬁne other related identiﬁed sets. For example,the identiﬁed set of counterfactual conditional choice probabilities P ∗ Y γ | Y,X,Z is the set of all conditionaldistributions P Y γ | Y,X,Z satisfying: P Y γ | Y,X,Z ( Y γ = y (cid:48) | Y = y, X = x, Z = z )= (cid:90) P Y γ | Y,X,Z,θ ( Y γ = y (cid:48) | Y = y, X = x, Z = z, θ ) dP θ | Y,X,Z , (2.6)( y, x, z ) − a.s. for some triple ( P Y γ | Y,X,Z,θ , P θ | Y,X,Z , β ) satisfying (2.5). The conditional choice probability in(2.6) then allows us to answer questions of the form: “given the observed values of (

Y, X, Z ) are ( y, x, z ),what would be the expected response if we were to set the values of (

X, Z ) to be γ ( x, z )?” As we mentionedearlier, conditioning on the value of Y may seem unfamiliar to most, but it allows us to answer a new set ofpolicy questions that condition on the current observed response when evaluating the expected counterfactualresponse. In addition, the identiﬁed set for the (conditional) average structural function P ∗ Y γ | X,Z will be the set ofall probabilities satisfying: P Y γ | X,Z ( Y γ = y (cid:48) | X = x, Z = z )= (cid:88) y ∈{ , } P Y γ | Y,X,Z ( Y γ = y (cid:48) | Y = y, X = x, Z = z ) P ( Y = y | X = x, Z = z ) , This conforms with a counterfactual in the three-level hierarchy of action, prediction and counterfactuals presented in Pearl(2009). See sections 1.4 and 7.2. x, z ) − a.s. for some P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z . Finally, when γ, γ (cid:48) ∈ Γ represent two competing policies, theidentiﬁed set for the (conditional) average treatment eﬀect from moving from policy γ to policy γ (cid:48) is givenby the set of all values: P Y γ (cid:48) | Y,X,Z ( Y γ (cid:48) = y (cid:48) | X = x, Z = z ) − P Y γ | Y,X,Z ( Y γ = y (cid:48) | X = x, Z = z ) , (2.7)where both terms are average structural functions. We will show how to construct sharp bounds on all ofthese objects using our framework.Using Deﬁnition 2.1, we now present a result that provides an intuitive but important link betweencounterfactual choice probabilities and the conditional distribution of latent variables. Theorem 2.1.

Suppose that Assumptions 2.1 and 2.2 hold. Then a distribution of counterfactual conditionalchoice probabilities P Y γ | Y,X,Z satisﬁes P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z if and only if there exists a pair ( P θ | Y,X,Z , β ) ∈I ∗ Y,X,Z satisfying: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , (2.8)( y, x, z ) − a.s. Theorem 2.1 provides the theoretical link between the identiﬁed set of counterfactual conditional choiceprobabilities, and the identiﬁed set for the pair ( P θ | Y,X,Z , β ). In particular, a given counterfactual conditionalchoice probability is a member of the identiﬁed set if and only if there exists a conditional distribution P θ | Y,X,Z and parameter β satisfying (2.3) that can also rationalize the counterfactual conditional choice probabilitiesthrough (2.8). We will see that the distribution of the observed random variables is informative about thedistribution of latent variables, which in turn are informative about counterfactual choices.While the result is theoretically straightforward, it hides some important practical diﬃculties that arisewhen constructing the identiﬁed set for counterfactual conditional choice probabilities. In particular, verify-ing the existence of a pair ( P θ | Y,X,Z , β ) that satisﬁes the conditions from Deﬁnition 2.1 is a nontrivial task.This is at least partly due to the fact that P θ | Y,X,Z is an inﬁnite dimensional object, even in the case whenboth X and Z have ﬁnite support. This inﬁnite dimensional existence problem is exacerbated in practiceby the fact that P θ | Y,X,Z must satisfy a number of constraints to ensure it is consistent with the binaryresponse model through (2.3), and to ensure it is a proper conditional probability measure. We considerthese practical diﬃculties in detail in the next section.

In order to bound counterfactual probabilities using Theorem 2.1, we must verify the existence of a collec-tion of Borel probability measures on Θ that are consistent with the binary response model through (2.3).However, solving this existence problem by explicitly constructing a probability measure on all Borel sets of13 seems excessively diﬃcult and naive. Instead, we would like to consider a ﬁnite collection of Borel setsthat are both necessary and suﬃcient for this existence problem in the sense that, to solve the existenceproblem, it is both necessary and suﬃcient that we be able to construct a conditional probability measureon our ﬁnite collection of sets. To make progress towards our goal, let us deﬁne the following vector-valuedfunction: r ( β, θ ) :=  { ϕ ( x , z , θ, β ) ≥ } { ϕ ( x , z , θ, β ) ≥ } ... { ϕ ( x , z m z , θ, β ) ≥ } { ϕ ( x , z , θ, β ) ≥ } { ϕ ( x , z , θ, β ) ≥ } ... { ϕ ( x m x , z m z , θ, β ) ≥ }  . (3.1)Furthermore, for a ﬁxed binary vector s ∈ { , } m let us deﬁne the set:Θ( β, s ) := { θ ∈ Θ : r ( β, θ ) = s } . (3.2)The sets from (3.2) partition the space Θ into at most L := 2 m sets, with each set being uniquely associatedwith a binary vector s ∈ { , } m . This comes from the fact that there are m points of support in X × Z (and so m rows in r ( β, θ )) and each row of r ( β, θ ) can take values either 0 or 1. Similar objects to r ( β, θ )have appeared previously in the literature (e.g. Balke and Pearl (1994), Heckman and Pinto (2018)), and soto remain consistent with the previous literature we call the functions r : B × Θ → { , } m deﬁned in (3.1) response types . In the discrete choice setting, these response types tell us the choices that an individual withtype indexed by ( β, θ ) would have made had they been assigned alternate pairs of ( x, z ). Any two individualscharacterized by values of θ from the same set Θ( β, s ) will make identical choices in every counterfactual, andso the values of θ deﬁne a natural equivalence class of latent types. We will see shortly that response typesrepresent the basic building blocks of all of our counterfactual objects of interest. Indeed, they generate thecoarsest partition of the unobservable space Θ needed to bound a variety of counterfactual quantities whilestill retaining all information from Assumptions 2.1 and 2.2.After partitioning the space of latent variables using response types, various counterfactual objects ofinterest can be written as a disjoint union of the sets Θ( β, s ) from (3.2) that comprise our partition. For the A similar problem is addressed in Torgovitsky (2019), although we note that his general framework is not immediatelyapplicable here since we are dealing with probability measures rather than distribution functions. We ﬁnd that for many of themodels we consider, it is simply not possible to write the identiﬁed set and functional of interest in terms of the multi-dimensionaldistribution function for the latent variables. The collection of sets deﬁning response types also appears to be similar to the “minimal relevant partition” (MRP) inTebaldi et al. (2019), as well as the partition described in Chesher and Rosen (2014) Appendix B. S j = { s ∈ { , } m : s j = 1 } , (3.3)for j = 1 , . . . , m . Note that each set S j is comprised of all binary vectors that have a j th entry equal to 1,and thus contain exactly 2 m − elements. Now note, by deﬁnition of the sets Θ( β, s ) and S j we have: { θ : ϕ ( x j , z j , θ, β ) ≥ } = (cid:91) s ∈ S j Θ( β, s ) . Furthermore, for s (cid:48) (cid:54) = s the deﬁnition of the sets Θ( β, s ) from (3.2) ensures we have Θ( β, s (cid:48) ) ∩ Θ( β, s ) = ∅ ,so that the union in the previous display is a disjoint union. Thus, we have the following decomposition: P θ | Y,X,Z ( ϕ ( x j , z j , θ, β ) ≥ | Y = y, X = x, Z = z ) = (cid:88) s ∈ S j P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z ) . Such a decomposition holds for any j = 1 , . . . , m . When the conditioning values ( x, z ) diﬀer from the values( x j , z j ) in the structural function, an application of Theorem 2.1 shows that the left hand side of this displayrepresents a counterfactual conditional choice probability, illustrating the connection between response typesand counterfactual choices. Remark 3.1.

Response types have a natural interpretation in the potential outcome framework as a collectionof potential outcomes. To illustrate, let us return to the potential outcome interpretation introduced in Remark2.1. In particular, suppose we have only a binary variable X ∈ { , } and latent variables θ (i.e. no variables Z and no ﬁxed coeﬃcients β ). Then the structural function from (2.1) can be written as ϕ ( X, θ ) and thebinary response vector r ( β, θ ) can be written as r ( θ ) . As in Remark 2.1, let us deﬁne the random variables: Y ( ω ) := { ϕ (0 , θ ( ω )) ≥ } ,Y ( ω ) := { ϕ (1 , θ ( ω )) ≥ } . Now consider the four possible binary vectors s ∈ { , } : s =   , s =   , s =   , s =   . In this simple model, we can see that events of the form { ω : r ( θ ( ω )) = s } can be written as conjunctions ofevents involving the potential outcomes Y and Y ; in particular, we have: { ω : r ( θ ( ω )) = s } = { ω : Y ( ω ) = 1 , Y ( ω ) = 1 } , { ω : r ( θ ( ω )) = s } = { ω : Y ( ω ) = 1 , Y ( ω ) = 0 } , { ω : r ( θ ( ω )) = s } = { ω : Y ( ω ) = 0 , Y ( ω ) = 1 } , It is useful to note that the sets { S j } mj =1 are not disjoint; indeed, it is easy to show that S j ∩ S k (cid:54) = ∅ and S j ∩ S ck (cid:54) = ∅ forevery j (cid:54) = k . ω : r ( θ ( ω )) = s } = { ω : Y ( ω ) = 0 , Y ( ω ) = 0 } . From here it is easy to see that, if probabilities can be assigned to the events above then—given that theseevents are disjoint—probabilities can also be assigned to any union of these events; the latter would includeparameters like counterfactual choice probabilities, the average structural function or the average treatmenteﬀect. An individual with a given response type can thus be equivalently viewed as an individual with a ﬁxedvector of potential outcomes.

The following Theorem shows that, in order to rationalize a given counterfactual conditional choiceprobability under our assumptions using Theorem 2.1, for each ﬁxed β it is both necessary and suﬃcient toconstruct a probability measure on sets of the form Θ( β, s ) from (3.2) satisfying the constraints of Theorem2.1. This result thus provides the much needed simpliﬁcation from the inﬁnite dimensional existence problem from the previous subsection to a more manageable ﬁnite dimensional existence problem . Since γ : X × Z →X × Z , it will be useful in the statement of the result to redeﬁne γ : N → N to denote the index of the pointin { ( x , z ) , . . . , ( x m , z m ) } assigned under counterfactual γ . Theorem 3.1.

Suppose Assumptions 2.1 and 2.2 hold. Fix some β ∈ B and consider the collection of sets: A ( β ) := { Θ( β, s ) : s ∈ { , } m } . (3.4) Then for any collection of counterfactual conditional choice probabilities P Y γ | Y,X,Z , there exists a collectionof Borel conditional probability measures P θ | Y,X,Z satisfying (2.8) with ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z if and only ifthere exists a collection P θ | Y,X,Z of probability measures on the sets in A ( β ) satisfying: (cid:88) s ∈ S j P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x j , Z = z j ) = 1 , (3.5) (cid:88) s ∈ S cj P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x j , Z = z j ) = 1 , (3.6) (cid:88) s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) = P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) , (3.7) for all y ∈ { , } and j ∈ { , . . . , m } assigned positive probability. Theorem 3.1 reduces our inﬁnite dimensional existence problem to a ﬁnite dimensional existence problem.Indeed, the constraints in (3.5) and (3.6) are linear constraints on a now ﬁnite dimensional probabilityvector with typical element P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z ). This leads naturally to the optimizationformulation of bounds on counterfactual choice probabilities considered in the next subsection. Note thatthis result relies crucially on the ﬁniteness of X and Z . All of our counterfactual parameters of interest inthis paper can be constructed from the basic counterfactual probability of the form: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) . (3.8)16he result thus also implies that the constraints (3.5) - (3.7) are suﬃcient to bound any counterfactualparameter that can be written as a linear function of counterfactual probabilities of the form (3.8).At ﬁrst glance it may be surprising to note that the constraints in Theorem 3.1 depend on the observeddistribution only through the condition that each constraint must hold for all values of ( y, x, z ) assignedpositive probability by the observed distribution of ( Y, X, Z ). Beyond these conditions, the observed dis-tribution plays no role in Theorem 3.1. While this may appear to be cause for alarm, we remind readersthat Assumptions 2.1 and 2.2 impose minimal structure on the binary response model; here the function ϕ can be extremely ﬂexible, and the variables X and Z can be endogenous. Theorem 3.1 shows that inthis very ﬂexible environment, the observed conditional choice probabilities do not provide any substantialinformation on counterfactual choice probabilities. In the next section we will use this idea to formulate animpossibility result that will be useful to motivate the need for additional assumptions in this binary responsemodel. However, we will ﬁrst take this basic environment as given and show how bounds on counterfactualchoice probabilities can be formulated as optimization problems. Our result on the formulation in terms ofoptimization problems will then be built upon in later sections when we show how to incorporate furtherassumptions. Finally, although we focus on an optimization-based approach, we believe the partition of thelatent variable space in terms of response types may also be useful for researchers interested in studying suf-ﬁcient conditions for point identiﬁcation, as in Heckman and Pinto (2018). We will not pursue this approachhere. The linear constraints deﬁning the identiﬁed set of counterfactual conditional choice probabilities leads usnaturally to consider bounding counterfactual choice probabilities by solving optimization problems. We willsuppose throughout this subsection that our objective is to bound the counterfactual choice probability: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) , (3.9)for some j ∈ { , . . . , m } . However, all of the results in this section are immediately applicable to thecase when we wish to bound some linear function of these counterfactual choice probabilities, includingprobabilities of the form: P Y γ | X,Z ( Y γ = 1 | X = x j , Z = z j ) . (3.10)Recall that Theorem 3.1 implies our counterfactual object of interest can be rewritten as: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) = (cid:88) s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) , where γ ( j ) is the index in { , . . . , m } assigned to j under counterfactual γ , and where the set S γ ( j ) is givenby S γ ( j ) = { s ∈ { , } m : s γ ( j ) = 1 } (the analog of S j from (3.3)). To progress further, let us deﬁne the17arameter: ν ( y, x, z, β, s ) = P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z ) . For the sake of notation it will also be useful to deﬁne the following parameter vectors: ν ( y, β, s ) :=  ν ( y, x , z , β, s ) ν ( y, x , z , β, s )... ν ( y, x , z m z , β, s ) ν ( y, x , z , β, s ) ν ( y, x , z , β, s )... ν ( y, x m x , z m z , β, s )  , ν ( y, β ) :=  ν ( y, β, s ) ν ( y, β, s )... ν ( y, β, s L )  , ν ( β ) :=  ν (0 , β ) ν (1 , β )  . The vector of parameters v ( β ) represents the variable over which we will optimize in our result ahead. Nowlet d ν = 2 mL denote the dimension of the parameter vector ν ( β ) (recall that L := 2 m ). Without loss ofgenerality, we will suppose that each ( y, x, z ) is assigned positive probability by the observed distribution.From conditions (3.5) and (3.6) in Theorem 3.1, we have the constraints: (cid:88) s ∈ S j ν (1 , x j , z j , β, s ) = 1 , (cid:88) s ∈ S cj ν (0 , x j , z j , β, s ) = 1 , (3.11)for j = 1 , . . . , m . Finally, we require the nonnegativity and “adding-up” constraints: ν ( y, x j , z j , β, s ) ∈  { } , if Θ( β, s ) = ∅ , [0 , , otherwise , (3.12)for all y ∈ { , } and j = 1 , . . . , m and s ∈ { , } m , and: (cid:88) s ∈{ , } m ν ( y, x j , z j , β, s ) = 1 , (3.13)for all y ∈ { , } and j = 1 , . . . , m . The constraints in (3.12) imply that positive probability can only beassigned to non-empty sets of the form Θ( β, s ), and the constraints in (3.13) ensure that each conditionalprobability assigns probability 1 to the entire space Θ. We are now ready to state the main result for thissection.

Theorem 3.2.

Under Assumptions 2.1 and 2.2, the identiﬁed set for the counterfactual conditional choiceprobability P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) is given by: (cid:91) β ∈B [ ν (cid:96)b ( y, x j , z j , β ) , ν ub ( y, x j , z j , β )] , (3.14) Note that constraints of the form (3.13) are not implied by constraints of the form (3.11). here ν (cid:96)b ( y, x j , z j , β ) and ν ub ( y, x j , z j , β ) are determined by the optimization problems: ν (cid:96)b ( y, x j , z j , β ) := min v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , subject to (3.11) , (3.12) , and (3.13) , (3.15) ν ub ( y, x j , z j , β ) := max v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , subject to (3.11) , (3.12) , and (3.13) . (3.16)In one direction, Theorem 3.2 implies that any counterfactual conditional choice probability of the form(3.9) belonging to the identiﬁed set can be written as: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) = (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , for some β and some vector ν ( β ) satisfying the constraints (3.11), (3.12), and (3.13). In the opposite direction,the Theorem implies that if for some β the vector ν ( β ) satisﬁes these constraints then the conditionalprobability measure on Θ represented by ν ( β ) can be extended to a (not necessarily unique) Borel probabilitymeasure on all of B (Θ) that satisﬁes the conditions of Theorem 2.1. Again, there is nothing special aboutcounterfactual choice probabilities here, and this result can be easily modiﬁed to bound any linear functionof counterfactual choice probabilities by simply modifying the objective function in Theorem 3.2. We willmake use of this fact in the application section.After determining which of the sets Θ( β, s ) are empty, all of the constraints in (3.15) and (3.16) canbe written as linear equality/inequality constraints, so that the optimization problems in (3.15) and (3.16)are linear programming problems. This is very beneﬁcial, since linear programs can be eﬃciently solvedeven in cases with thousands of parameters and constraints. In addition, elements of ν ( β ) corresponding tosets Θ( β, s ) that are empty can be removed from the parameter vector ν ( β ) without altering the optimalsolutions to the linear programs in (3.15) and (3.16). This allows for further reduction of the dimension ofthese linear programs. Following Theorem 3.2, these linear programs are used to construct an interval foreach value of β ∈ B , and then the full identiﬁed set is constructed by taking the union of these intervals overall values of β .Some thought reveals that β has no eﬀect on the bounding problem in Theorem 3.2 other than throughits eﬀect on determining which of the sets Θ( β, s ) are non-empty. Indeed, depending on the form of ϕ , fora ﬁxed value of β ∈ B and a ﬁxed vector s ∈ { , } m there may be no value of θ satisfying r ( θ, β ) = s . Inpractice the identiﬁed set can be constructed using Theorem 3.2 after ﬁxing a particular functional form for ϕ , establishing a grid over the parameter space B , and then solving the optimization problems (3.15) and(3.16) for each value of β in the grid. This last step can only be completed after determining which of thesets Θ( β, s ) are non-empty at each value of β in the grid. The following proposition demonstrates that, intheory, the researcher need only repeat the procedure just described for ﬁnitely many values of β . Proposition 3.1.

Suppose that Assumptions 2.1 and 2.2 hold. Then there exists a (not necessarily unique) nite subset B (cid:48) ⊂ B such that: (cid:8) ν ∈ R d ν : ∃ β ∈ B s.t. ν ( β ) satisﬁes (3.11) , (3.12) , (3.13) , and ν = ν ( β ) (cid:9) = (cid:8) ν ∈ R d ν : ∃ β ∈ B (cid:48) s.t. ν ( β ) satisﬁes (3.11) , (3.12) , (3.13) , and ν = ν ( β ) (cid:9) . We will call the points in the set B (cid:48) the representative points , although it is important to keep in mindthat these points are generally not unique. Assuming the representative points can be determined by theresearcher, Proposition 3.1 immediately implies that the union over β ∈ B in (3.14) can be replaced with aunion over β ∈ B (cid:48) . That is, the linear programs in (3.15) and (3.16) need only be solved at the representativepoints. Proposition 3.1 also implies that the identiﬁed set for counterfactual choice probabilities in Theorem3.2 will always be a closed (but possibly disconnected) set. Unfortunately, when the researcher cannotdetermine the representative points, computing the exact (i.e. not approximate) bounds from Theorem 3.2can become computationally prohibitive. Later on we will provide a tractable way to construct B (cid:48) in thecase when ϕ is linear in β . To this point, the binary response model has been left almost entirely unrestricted. The following corollary toTheorem 3.1 states formally the unsurprising result that, when ϕ is completely unrestricted, counterfactualchoice probabilities are not identiﬁed under our assumptions. Corollary 3.1.

Suppose Assumptions 2.1 and 2.2 hold. Then there exists a function ϕ : X × Z × B × Θ → R satisfying Assumptions 2.1 and 2.2 such that the identiﬁed set for any counterfactual choice probability ofthe form (3.9) or of the form (3.10) with γ ( j ) (cid:54) = j is the interval [0 , . This result is stated as a corollary since it can be proven using Theorem 3.1. Intuitively, Theorem 3.1shows that under Assumptions 2.1 and 2.2 the observed conditional choice probabilities eﬀectively imposeno constraints on counterfactual conditional choice probabilities; indeed, this follows by simple inspectionof the constraints (3.5) and (3.6) from the statement of Theorem 3.1, as well as the discussion following thestatement of Theorem 3.1. Without any constraints imposed by the observed conditional choice probabilities,by choosing a suﬃciently ﬂexible function ϕ : X ×Z ×B× Θ → R it is possible to rationalize all counterfactualconditional choice probabilities in the interval [0 , This impossibility result shows that it is necessary to entertain additional assumptions in order to obtaininformative bounds on counterfactual choice probabilities. In the next section, we will explore additional It is important to note that this does not occur because Assumptions 2.1 or 2.2 have imposed any structure, but insteadis because the average treatment eﬀect is a weighted average of unidentiﬁed counterfactual conditional choice probabilities andidentiﬁed conditional choice probabilities. The assumptions we consider in this section fall into three classes: (i) functional form assumptions, (ii) in-dependence assumptions, and (iii) monotonicity assumptions. A common theme throughout our discussionis that additional assumptions like these often lead to the elimination of response types ; that is, the assump-tions imply that certain response types must have zero probability. One of our main contributions in thispaper is to show how functional form restrictions impose constraints in the bounding problem by limitingthe number of sets Θ( β, s ) that can be assigned positive probability. Reducing the number of sets that canbe assigned positive probability imposes additional constraints in the optimization problems of Theorem 3.2that help to tighten the identiﬁed set. It can also reduce computational time needed to solve the boundingproblems in Theorem 3.2 by reducing the dimension of the optimizing variable ν ( y, x, z, β, s ). Constrainingsets of the form Θ( β, s ) to be assigned zero probability will be referred to as eliminating response types .Response types corresponding to sets Θ( β, s ) that survive elimination will be called admissible responsetypes. Response types corresponding to sets Θ( β, s ) that are eliminated will be called inadmissible .We will show that a number of assumptions, including functional form assumptions, correspond to theelimination of particular response types. Since each response type is characterized by a particular menuof counterfactual responses, framing functional form assumptions in terms of the elimination of particularresponse types helps to provide some meaning to these assumptions. In the case when ϕ is linear in parameterswe provide an eﬃcient (i.e. polynomial-time) algorithm for constructing the relevant set of constraintsin the bounding problems that is based on the hyperplane arrangement algorithm of Gu and Koenker(2020). When ϕ is linear in β , we also demonstrate how to compute an exact (i.e. not approximate)solution to the optimization problems in Theorem 3.2 that does not require establishing a grid over theentire parameter space B . After studying functional form assumptions, we then turn brieﬂy to considerindependence assumptions and monotonicity assumptions. Independence assumptions are also quite commonin parametric binary response models and binary response models with endogenous regressors, althoughhere we show how to impose various independence assumptions as a set of linear equality constraints in theoptimization problems of Theorem 3.2. Finally, monotonicity assumptions appear in various forms in theliterature on treatment eﬀects, and our incorporation of monotonicity restrictions arising from choice theorymakes substantial use of response types, resembling the approach of Heckman and Pinto (2018). These threeclasses of assumptions—functional form, independence, and monotonicity—will now be addressed in turn.Although we study these three assumptions separately, all of our results hold with minimal modiﬁcationwhen any combination of these assumptions are imposed. We will make use of this fact in the applicationsection. For additional impossibility results in partially identiﬁed discrete response models, see the discussion on pages 1396 and1402 of Manski (2007) as well as Corollary 3 in Chesher et al. (2013). .1 Functional Form Assumptions: The Linear Case In this subsection we consider introducing assumptions on the functional form of the index function ϕ . Wewill consider the case when ϕ is linear in the latent variables θ . This assumption will connect the results inthis paper to results on random coeﬃcient linear index models where the distribution of random coeﬃcientsis treated nonparametrically (c.f. Ichimura and Thompson (1998), Gautier and Kitamura (2013) and Gu andKoenker (2020), among others). For reference throughout this section, let us ﬁrst formally state our linearityassumption on the index function, as well as an assumption on the distribution of the latent variables. Assumption 4.1 (Linearity in Latent Variables) . For each β ∈ B , (i) the function ϕ ( · , β ) : X × Z × Θ → R from Assumption 2.1 is linear in θ , and (ii) the event: F := (cid:91) ( x,z ) ∈X ×Z { θ : ϕ ( x, z, θ, β ) = 0 } , occurs with probability zero; that is, P θ ( F ) = 0 . Part (i) imposes linearity of ϕ , but still allows for general forms of nonseparability between the latent andobserved variables. Part (ii) of Assumption 4.1 is not needed from a technical standpoint, but we will seethat it leads to a dramatic simpliﬁcation of our algorithm to enumerate response types described in the nextsubsection. We will return to this point later. For now, it suﬃces to know that part (ii) of Assumption 4.1essentially says that the boundaries of the sets that deﬁne response types are assigned probability zero. Underpart (i) of Assumption 4.1, the sets of the form { θ : ϕ ( x, z, θ, β ) = 0 } are hyperplanes in R d θ , and the setsΘ( β, s ) (intersected with Θ) have boundaries that are deﬁned by these hyperplanes. Since these hyperplanesexist as d θ − d θ dimensional space, they have Lesbesgue measure zero. Thus, themost familiar condition implying part (ii) of Assumption 4.1 is absolute continuity of the distribution of θ ; this latter assumption has been imposed, for example, in Chesher et al. (2013) and Chesher and Rosen(2014). As an alternative to imposing part (ii) of Assumption 4.1, we could also instead condition all of theremaining analysis in this paper on the event that θ ∈ F c . It will be useful to keep this in mind throughout.Finally, a special case of Assumption 4.1 occurs when the function ϕ is additively separable in a scalar latentvariable θ . A full analysis of this well-studied case using our framework is worthwhile, and is taken up inAppendix B.6.We will see that assumptions on the functional form of ϕ can sometimes be useful to obtain tighterbounds on counterfactual conditional choice probabilities because they implicitly eliminate certain responsetypes. To illustrate this point, recall that Θ( β, s ) := { θ : r ( β, θ ) = s } and that there are exactly 2 m binaryvectors s ∈ { , } m . In order for the response type r ( θ, β ) = s to be admissible, it must be that the set Specifying linearity in latent variables seems to be nonnested with the weak separability of assumption of Vytlacil andYildiz (2007). In particular, Vytlacil and Yildiz (2007) permit index functions of the form (omitting β for simplicity): ϕ ( x, z, θ ) = g ( h ( x, z ) , θ ) , where h : X × Z → R and g : R × Θ → R are real-valued functions. Note that this form cannot accommodate cases like ϕ ( x, z, θ ) = (cid:104) x, θ (cid:105) , although it allows for nonlinearities in θ . β, s ) is non-empty for the pair ( β, s ). By extension, in order for all response types to be admissible, itmust be that all the sets Θ( β, s ) are non-empty. Whether this is satisﬁed clearly depends on the nature of ϕ as a function of θ ; for example, it clearly fails if ϕ is constant in θ , but is easily satisﬁed if ϕ can varyarbitrarily with θ .The following simple example shows that some of the sets Θ( β, s ) can be empty under Assumption 4.1. Example 1.

Suppose we have a variable X ∈ { . , , } and latent variables θ ∈ S (the two-dimensionalclosed unit sphere). That is, suppose there are no variables Z and no ﬁxed coeﬃcients β . Then the structuralfunction from (2.1) can be written as ϕ ( X, θ ) and the binary response vector r ( β, θ ) can be written as r ( θ ) ,where: r ( θ ) =  { ϕ (0 . , θ ) ≥ } { ϕ (1 , θ ) ≥ } { ϕ (2 , θ ) ≥ }  . Without any additional restrictions there is a total of |X | = 8 possible response types. That is, r ( θ ) ∈{ s , s , s , s , s , s , s , s } , where: s =   , s =   , s =   , s =   , s =   , s =   , s =   , s =   . Conclude that without any additional restrictions, all sets of the form Θ( β, s ) for s ∈ { , } can be assignedpositive probability by the optimization problems in Theorem 3.2. Now suppose we entertain a linear func-tional form restriction. In particular, suppose that Assumption 4.1 holds and that the structural functionfrom (2.1) can be written as: ϕ ( X, θ ) = Xθ − θ . Then the binary response vector r ( θ ) is given by: r ( θ ) =  { θ ≥ θ } { θ ≥ θ } { θ ≥ θ }  . As is illustrated in Figure 1, under the assumption that the index function is linear in latent variables only response types are possible. In particular, response types corresponding to binary vectors s and s are notpossible under the linearity assumption. Thus, under Assumption 4.1 a distribution of latent variables will For example, if θ = ( θ , θ ) take ϕ ( X, θ ) = sin( θ X + θ ) and ﬁx θ = 0. Then it is straightforward to ﬁnd eight values ofthe frequency parameter θ ∈ [ − ,

1] to rationalize each of the 8 response types. igure 1: A ﬁgure corresponding to Example 1 illustrating the partition of the latent variable space according toresponse types in the case when the index function is linear. Without functional form restrictions, Example 1 shows8 response types are possible; however, then the index function is linear in latent variables there are only 6 possibleresponse types, as illustrated in the ﬁgure. In particular, the response types corresponding to binary vectors s and s from Example 1 are not possible. be admissible in this example only if it assigns probability zero to the sets: Θ( β, s ) = { θ : r ( θ ) = s } , Θ( β, s ) = { θ : r ( θ ) = s } . These additional constraints must be imposed in our optimization problems from Theorem 3.2.

This example shows that imposing linearity of ϕ in latent variables implies that certain sets of the formΘ( β, s ) may be empty for some binary vectors s ∈ { , } m . In the general case, it can be shown that when ϕ is restricted to be linear in θ , there is an upper bound on the number of non-empty sets Θ( β, s ) that growsat a rate that is polynomial in m (rather than exponential in m , which is the case when ϕ is unrestricted). Proposition 4.1.

Suppose that Assumption 2.1 and 4.1 are satisﬁed. Then for each ﬁxed β ∈ B , there areat most (cid:80) d θ j =0 (cid:0) mj (cid:1) admissible response types. This result is implied by results in the literature on computational geometry. In particular, linearity ofthe function ϕ ( · , θ ) means that for each instance of ( x, z, β ) the function ϕ ( x, z, θ, β ) deﬁnes a hyperplane in d θ − dimensional space. In the case when the vectors deﬁning these hyperplanes are in general position theupper bound in Proposition 4.1 is obtained. This latter result was initially proven by Buck (1943). A collection of m hyperplanes in d − dimensional space are considered to be in general position when if any collection of k out of the m hyperplanes intersect in a d − k dimensional space for 1 < k ≤ d , and any collection of k out of m hyperplaneshas an empty intersection for k > d .

24o impose linearity in the latent variables we must determine which sets Θ( β, s ) are empty, and thenensure that any distribution of the latent variables under consideration when bounding counterfactual choiceprobabilities assigns probability zero to these sets. Let us deﬁne the collection of binary vectors S ϕ to bethose vectors s ∈ { , } m corresponding to admissible response types under Assumption 4.1. Note thatany such admissible response types under Assumption 4.1 must correspond to sets Θ( β, s ) with non-emptyinterior. A revised deﬁnition of the joint identiﬁed set I ∗ Y,X,Z under Assumption 4.1 is provided in AppendixB.2. In Appendix B.2 we also present a corollary of Theorem 3.1 that is valid under Assumption 4.1. Thisallows us to reduce the inﬁnite dimensional existence problem to a ﬁnite dimensional problem even underthe assumption of linearity in latent variables. To extend the results of Theorem 3.2 we must then simplyinclude the correct set of additional constraints in our optimization problems. The correct set of constraintsis provided by Corollary B.1 in Appendix B.2, and can be written in terms of the parameter vector ν ( β ) as: (cid:88) s ∈ S cϕ ν ( y, x j , z j , β, s ) = 0 , (4.1)for all y ∈ { , } and j = 1 , . . . , m occurring with positive probability. Corollary B.2 in Appendix B.2 thendemonstrates that Theorem 3.2 can be extended simply by adding the constraints (4.1) to the optimiza-tion problems (3.15) and (3.16). We will see in the next subsections that independence and monotonicityconstraints can also be imposed as equality constraints on the parameter vector ν ( β ), and thus any combi-nation of these assumptions can be imposed on the optimization problems in Theorem 3.2 by simply addingthe correct constraints. From a practical implementation point of view, imposing (4.1) in addition to theconstraints from (3.15) and (3.16) is equivalent to eliminating the elements of the parameter vector ν ( β )corresponding to s ∈ S cϕ . Finally, Corollary B.3 shows that Proposition 3.1 also applies when the constraints(4.1) are added to the constraints from (3.15) and (3.16).In summary, functional form assumptions provide additional restrictions in the problem of boundingcounterfactual choice probabilities by simply restricting the number of sets Θ( β, s ) that can be assignednonzero probability, and thus limiting the admissible response types. When it is known which of the responsetypes to eliminate, we can constrain the corresponding sets Θ( β, s ) to be assigned zero probability in thebounding problem given in Theorem 3.2, which can lead to a tightening of the identiﬁed set. However, givena certain functional form assumption and a value of β , it is generally diﬃcult to determine which of thesets Θ( β, s ) are empty, and thus which response types to eliminate. We will address this implementationproblem in the next subsection, where we propose the use of the hyperplane arrangement algorithm of Guand Koenker (2020). The fact that we want sets Θ( β, s ) that have non-empty interior follows from the fact that part (ii) of Assumption 4.1implies a set Θ( β, s ) has positive probability if and only if int(Θ( β, s )) has positive probability. .1.1 Implementation and Hyperplane Arrangement To practically implement the revised optimization problems we require a method of enumerating all admissi-ble response types represented by the binary vectors in S ϕ , deﬁned in the previous subsection. The researchermust know the admissible response types in order to impose constraint (4.1) in the bounding problems, butthe admissible response types will depend on the support X × Z and the function ϕ . Thus, in practice S ϕ must be computed in each speciﬁc application. In order to compute the collection S ϕ we propose the use ofthe hyperplane arrangement algorithm of Gu and Koenker (2020).Enumerating the admissible response types under Assumption 4.1 corresponds to determining all setsΘ( β, s ) that have non-empty interior. When the index function ϕ is linear in θ , for each ﬁxed β and s ∈ { , } m the set Θ( β, s ) is a convex polyhedron formed by the intersection of halfspaces whose boundariesare hyperplanes of the form { θ : ϕ ( x, z, θ, β ) = 0 } . Under Assumption 2.1 there are at most m suchhyperplanes. The hyperplane arrangement algorithm of Gu and Koenker (2020) accepts these m hyperplanesas an input, and outputs the binary vectors s corresponding to the sets Θ( β, s ) that have non-empty interior,as well as a point from each of these sets. In low dimensional space, it is relatively easy to determine the setswith non-empty interior formed by the intersection of halfspaces (see Figure 1, for instance). However, as thedimension of a space increases it becomes challenging to enumerate all of these sets. Avis and Fukuda (1996)were the ﬁrst to provide an enumeration algorithm that runs in a time proportional to the maximum numberof sets with non-empty interior. Improvements to this algorithm were made by Sleumer (1999) and Radaand Cerny (2018). The algorithm of Gu and Koenker (2020) is most closely related to the latter paper, andwas developed for the problem of nonparametric maximum likelihood in a linear random coeﬃcient model.It runs in a time proportional to O ( m d θ ).To understand the algorithm, note that for each s ∈ { , } m and ﬁxed β , we can verify using a linearprogram whether there exists a point in the space of Θ that lies interior to the set Θ( β, s ). Indeed, considerthe following linear programming problem:max θ,ε ε s.t. (2 s j − ϕ ( x j , z j , θ, β ) ≥ ε, j = 1 , . . . , m, (4.2)where s j is the j th element of our ﬁxed binary vector s , and where here we have an index function ϕ ( x, z, θ, β )that is linear in θ . If ε ∗ and θ ∗ are the optimal values of the program (4.2) (provided that it is feasible), thenan optimal value ε ∗ > θ ∗ is an interior point to the polyhedron Θ( β, s ). However, since thelinear program (4.2) must be solved for each s ∈ { , } m , checking whether each Θ( β, s ) admits an interiorpoint requires solving 2 m linear programs, despite the fact that we know the number of non-empty subsetsΘ( β, s ) is polynomial in m .To address this issue, the algorithm proposed in Gu and Koenker (2020) builds upon the algorithm inRada and Cerny (2018). The idea is to add one hyperplane at a time, and to enumerate all feasible responsetypes after adding each new hyperplane. At step k they start with a collection of k − As we discussed at the beginning of this subsection, although ϕ is linear in θ it can have many diﬀerent speciﬁcations. k −

1. They then introduce a newhyperplane into the arrangement of hyperplanes, and determine all newly created response types by solvinga linear program. The algorithm of Rada and Cerny (2018) requires solving a linear programming problemfor all of the existing cells at each iteration, which amounts to solving O ( m d θ +1 ) such problems. When m is large, which is typically the case in practice, this can become costly. Gu and Koenker (2020) observedthat when a new hyperplane is added the only new cells will be those that are created when the existingcells are crossed by the last hyperplane. By eﬃciently locating those crossed cells, the algorithm reducedthe number of linear programming problems to be solved by a magnitude of m . The algorithm in Gu andKoenker (2020) is available in the R package RCBR .Recall from the previous subsection that we claimed part (ii) of Assumption 4.1 led to a dramaticsimpliﬁcation of our algorithm to enumerate response types. To understand the reason why part (ii) ofAssumption 4.1 is useful, consider the following simple example.

Example 2.

Suppose we have a variable X ∈ {− , } and latent variables θ ∈ S (the two-dimensionalclosed unit sphere). That is, suppose there are no variables Z and no ﬁxed coeﬃcients β . Now suppose thestructural function from (2.1) is given as ϕ ( X, θ ) = Xθ . Then the binary response vector r ( β, θ ) can bewritten as r ( θ ) , where: r ( θ ) =  { θ ≥ } {− θ ≥ }  . Now consider the four possible binary vectors in { , } : s =   , s =   , s =   , s =   . Under Assumption 4.1 there are only two sets of the form Θ( β, s ) that can be assigned positive probability,given by: Θ( β, s ) := { θ : r ( θ ) = s } , Θ( β, s ) := { θ : r ( θ ) = s } . However, in the absence of part (ii) of Assumption 4.1 the set Θ( β, s ) := { θ : r ( θ ) = s } can also be as-signed positive probability. This demonstrates that part (ii) of Assumption 4.1 imposes additional restrictionson P θ | Y,X,Z that may tighten the identiﬁed set for counterfactual choice probabilities and other quantities.However, note that the set Θ( β, s ) is the zero-dimensional boundary between the sets Θ( β, s ) and Θ( β, s ) .Imposing Assumption 4.1 allows us to avoid enumerating lower-dimensional sets like Θ( β, s ) which can havea considerable impact on computation time, especially in higher dimensional examples. This example illustrates why part (ii) Assumption 4.1 contains some identifying content, and thus cannarrow the identiﬁed set for counterfactual choice probabilities and other quantities. It also provides someintuition for why this assumption is helpful when it comes to computation, since it allows us to avoid27numerating all lower dimensional sets Θ( β, s ).In summary, the hyperplane arrangement algorithm can be used as a pre-processing step under Assump-tion 4.1 to determine which sets Θ( β, s ) have non-empty interior in a given application. After completing thepre-processing of our problem using the hyperplane arrangement algorithm, Corollary B.2 becomes directlyapplicable. In the next subsection we will also show how the assumption of linearity in parameters β ∈ B can be combined with the hyperplane arrangement algorithm to determine the representative points fromProposition 3.1, which dramatically simpliﬁes the bounding procedure suggested by Theorem 3.2. Recall that computation of our bounds on counterfactual conditional choice probabilities require the re-searcher to repeatedly solve linear programs. In particular, a value of β ∈ B is ﬁxed, and the lower andupper bound of a closed interval is computed for this ﬁxed value of β . We must then repeat this procedurefor all β ∈ B and must take the union of the resulting intervals over all values of β ∈ B . We refer to theprocess of repeating the procedure for all values of β and taking unions as “proﬁling β .” Proﬁling β canpresent major computational challenges. Corollary B.3 suggests that only ﬁnitely many representative pointsneed to be considered, although it is not obvious how to ﬁnd these representative points.In this subsection we will describe an algorithm that allows us to ﬁnd all representative points when ϕ islinear in both θ and β . To introduce our approach, note that under the assumption that ϕ is linear in ( θ, β ),for each ﬁxed β ∈ B the sets of the form Θ( β, s ) deﬁne a unique partition of the space Θ into sets whoseboundaries are deﬁned by m hyperplanes. Let us deﬁne: S ( β ) := { s ∈ { , } m : int(Θ( β, s )) (cid:54) = ∅ } . Then S ( β ) denotes the set of all vectors s ∈ { , } m that are inducible by our arrangement of m hyperplanes.Now recall that functional form assumptions impose restrictions in the bounding optimization problems byrestricting the number of sets Θ( β, s ) with non-empty interior. For any two values of β, β (cid:48) ∈ B with β (cid:54) = β (cid:48) ,if S ( β ) = S ( β (cid:48) ) then the linear programming problems in Theorem 3.2 at β and β (cid:48) will be identical, sincethey will have an identical set of constraints. The points β and β (cid:48) are thus equivalent in the sense that weonly need to solve the linear programming problems for one of them. Extending this idea, we can deﬁne anequivalence class by the set of all β ∈ B delivering the same collection S ( β ). We then only need to solvethe linear programming problems at one value of β belonging to each equivalence class. These values of β selected from each equivalence class are exactly what we call representative points. To see how to ﬁnd the representative points, let us partition θ := ( θ x , θ z , ε ), β = ( β x , β z ), x = ( x r , x f ) This partition of B into equivalence classes is exactly what is done in the proof of Proposition 3.1 in the more general case. z = ( z r , z f ), and for a binary vector s ∈ { , } m let us deﬁne the set: R ( s ) :=  ( θ, β ) :  { θ x x r + θ z z r + β x x f + β z z f ≥ ε } { θ x x r + θ z z r + β x x f + β z z f ≥ ε } ... { θ x x rm + θ z z rm + β x x fm + β z z fm ≥ ε }  = s  . (4.3)These sets form a unique partition of the space ( θ, β ) deﬁned by m hyperplanes of the form: θ x x ri + θ z z ri + β x x fi + β z z fi = ε. (4.4)The basic idea behind our strategy to ﬁnd representative points is to ﬁrst project the sets of the form R ( s )onto the parameter space B . Note that the projection of a set R ( s ) onto the parameter space B will deliverthe set of all β consistent with the binary vector s for some value of θ . After taking the intersection of all suchprojections, each set in the resulting collection corresponds exactly to an equivalence class discussed above.An arbitrary value of β taken from such a set will then be a representative point. The most challengingpart of this approach will be to ﬁnd a tractable characterization of the projections of R ( s ) on the parameterspace B .Let us denote the collection of all binary vectors s ∈ { , } m corresponding to the sets in R ( s ) withnon-empty interior as S p . The ﬁrst step of our procedure to ﬁnd the representative points is to determinethe binary vectors in S p . This can be done by running the hyperplane arrangement algorithm of Gu andKoenker (2020) on the collection of hyperplanes of the form (4.4) deﬁned on Θ × B , treating β as a latentvariable. Note that the assumption of linearity of ϕ in ( θ, β ) restricts the number of sets in the collection S p to be polynomial in m . Next, let us deﬁne w ri := ( x ri , z ri , −

1) and w fi := ( x fi , z fi ), where w ri has dimension d r and w fi hasdimension d f . Then each of the hyperplanes of the form (4.4) can be written as w ri θ + w fi β = 0. Stackingthese hyperplanes into matrix form we have W r θ r + W f β = 0, where W r is m × d r and W f is m × d f . Noweach set of the form (4.3) is a polyhedral cone in R d x + d z +1 and can be uniquely identiﬁed by a sign vector2 s − {− , } m . Fix any s ∈ S p , and let D ( s ) = diag(2 s −

1) denote the m × m diagonalmatrix with the sign vector 2 s − W r ( s ) := D ( s ) W r and W f ( s ) := D ( s ) W f . Then the set R ( s ) from (4.3) can be conveniently rewritten as: R ( s ) := { ( θ, β ) : W r ( s ) θ + W f ( s ) β ≥ } . Note that the row dimension of W r ( s ) and W f ( s ) is m , which can be large if the support X ×Z contains many Note that in this context, all of the hyperplanes of the form (4.4) can be viewed as hyperplanes through the origin in Θ × B .In this case, the upper bound on the number of cells formed by this collection of hyperplanes is of smaller order than thatpresented in Proposition 4.1. Cover (1965) shows the upper bound is given by: C ( m, d θ ) := 2 d θ − (cid:88) j =0 (cid:16) m − j (cid:17) . R ( s )before proceeding to the next step. Elimination of redundant inequalities from this system can be achievedin polynomial time with a sequence of linear programs, and the resulting set of nonredundant inequalitiesthat deﬁne the polyhedral cone R ( s ) is typically much smaller than m . From here on we assume the matrices W r ( s ) and W f ( s ) only include rows corresponding to nonredundantconstraints, and we will denote their row dimension as m ( s ). Now consider the set: B ( s ) := { β ∈ B : ∃ θ ∈ Θ s.t. W r ( s ) θ + W f ( s ) β ≥ } . (4.5)Then B ( s ) is the set of values of β such that there exists θ such that all the constraints W r ( s ) θ + W f ( s ) β ≥ B ( s ) is precisely the projection of the polyhedral cone R ( s ) on theparameter space B .The objective is now to show that the set B ( s ) can be deﬁned only in terms of linear inequality constraintsin β . In other words, we would like to “eliminate” the latent variables θ from the system of inequalities in(4.5). A natural method of doing so is to use Fourier-Motzkin elimination. Recall that the Fourier-Motzkinalgorithm eliminates variables from a system of linear inequalities by taking linear combinations of theinequalities in the system. In particular, Fourier-Motzkin elimination can be viewed as applying a sequenceof matrix operators M , M , . . . , M d r to the system of inequalities in (4.5), where the matrix M k M k − . . . M eliminates the ﬁrst k elements of the vector θ from the inequalities. Let us denote M ∗ r = M d r M d r − . . . M .Then as a result of Fourier-Motzkin elimination we would have the equivalent system: B ( s ) := { β ∈ B : M ∗ r W f ( s ) β ≥ } , (4.6)since M ∗ r W r ( s ) = 0 by construction of M ∗ r . The set in (4.6) then gives us inequality constraints only interms of β that deﬁne the projection of R ( s ) on B .While it is possible to use Fourier-Motzkin elimination to “eliminate” the latent variables θ , the numberof rows in the matrix M ∗ r can be prohibitively large, even when the number of nonredundant inequalitiesdeﬁning the set (4.6) is small. To ensure feasibility of our method of projection, we must thus search for aprocedure that will eliminate redundant inequalities from (4.5) and also results in a simpler characterization In particular, not all the hyperplanes that deﬁne R ( s ) are relevant, in the sense that some of them are implied by therest of the inequalities in the system. Removing these redundant inequalities will not change the cone R ( s ). We can removethem before continuing to the projection step of our procedure by conducting a redundancy test. For example, suppose wehave system of j + 1 inequalities of the form Ax ≤ b and s (cid:62) x ≤ t . Then to check whether the last inequality is binding (andthus nonredundant), we can solve the linear programming problem f ∗ = max s (cid:62) x s.t. Ax ≤ b, s (cid:62) x ≤ t + 1. The inequality s (cid:62) x ≤ t is redundant if and only if f ∗ ≤ t . To eliminate all redundant inequalities from a system of m inequalities results insolving m linear programs; hence, it can be computed in polynomial time. There are a few strategies to speed up the removalof redundant inequalities, as discussed in Section 2.21 in Fukuda (2014). We use the implementation in the package Rcdd withthe function redundant . The idea of using Fourier-Motzkin elimination to determine the inequality constraints deﬁning projected regions in partialidentiﬁcation was also explored in Section 8.2 of Chesher and Rosen (2019). B ( s ) than the one provided by Fourier-Motzkin elimination. To this end, consider the following set: C ( s ) := { c ∈ R m ( s ) : cW r ( s ) = 0 , c ≥ } , (4.7)where recall that m ( s ) is the dimension of W r ( s ) and W f ( s ) after we’ve removed all the redundant inequali-ties. Since the rows of M ∗ r have positive entries (by construction using the Fourier-Motzkin algorithm), theymust belong to C ( s ). Thus we can conclude that: { β ∈ B : cW r ( s ) θ + cW f ( s ) β ≥ , ∀ c ∈ C ( s ) } ⊆ B ( s ) . Furthermore, Kohler (1967) shows that the reverse inclusion also holds; in particular, every vector in thecollection C ( s ) can be written as a nonnegative linear combination of the rows of M ∗ r . We can thus conclude: B ( s ) = { β ∈ B : cW r ( s ) θ + cW f ( s ) β ≥ , ∀ c ∈ C ( s ) } . While at ﬁrst glance this result is not immediately useful, the Minkowski-Weyl Theorem allows us to re-writethe set C ( s ) as: C ( s ) = (cid:110) c ∈ R m ( s ) : c = R ( s ) a, for some a ≥ (cid:111) , (4.8)where R ( s ) is some matrix. That is, every element belonging to the polyhedral cone C ( s ) can be written asa nonnegative linear combination of the columns of some matrix R ( s ). It follows that if we could obtain thematrix R ( s ) from (4.8), we could obtain the following representation of the projected set for β from R ( s ): B ( s ) = { β ∈ B : H ( s ) β ≥ } , (4.9)where H ( s ) = R ( s ) W f ( s ). The matrix R ( s ) is sometimes called the generating matrix of the polyhedral cone C ( s ). The Minkowski-Weyl Theorem essentially says that every polyhedral cone admits a generating matrix,and every generating matrix generates a polyhedral cone. The problem of ﬁnding the minimal generatingmatrix R ( s ) (that is, the matrix R ( s ) generating C ( s ) such that no proper submatrix of R ( s ) also generates C ( s )) is called the extreme ray enumeration problem . Note the minimal generating matrix is unique only upto multiplication by a positive scalar.The characterizations of the cone C ( s ) in (4.7) and (4.8) are called its H-representation and its V-representation, respectively. Converting from one representation of a convex polyhedron to another is calledthe double description problem in computational geometry, and is one of the most important problems in Note that it is possible to ﬁrst perform Fourier-Motzkin elimination, and then remove redundant inequalities from the system M ∗ r W r ( s ) β ≥ For a general convex polyhedral deﬁned by Λ = { λ ∈ R d : Aλ ≤ b } , the Minkowski-Weyl Theorem states that every vector λ ∈ Λ can be written as λ = λ + λ , where λ ∈ conv { v , . . . , v k } and λ ∈ cone { v k +1 , . . . , v n } . Here v , . . . , v k are calledvertices of Λ and v k +1 , . . . , v n are the extreme rays of Λ. In the special case of b = 0, where all hyperplanes are through theorigin, then Λ becomes a polyhedral cone and k = 0, so that Λ = cone { v , . . . , v n } . This latter case is what is relevant for us,and the columns of the matrix R ( s ) are the collections of these extreme rays. R implementation in the package Rcdd by Geyer (2019). There are alsoalternative nonincremental algorithms available for extreme ray enumeration; for instance, the reverse searchalgorithm by Avis and Fukuda (1996). However, in general there is no known eﬃcient (polynomial-time)algorithm for general input, although the incremental double description algorithm is known to be eﬃcientfor degenerate polyhedrons (which arises very often when the hyperplanes are not in general position) andlow dimensions (up to 10). Avis et al. (1997) present a thorough comparison of these diﬀerent algorithms.After employing the double description algorithm the projection B ( s ) represented in (4.9) now containsa minimal number of constraints deﬁned only in terms of β . Repeating the procedure described above for all s ∈ S p then gives us a collection of sets B ( s ) representing the projections of R ( s ) onto the parameter space B . However, for diﬀerent binary vectors s ∈ S p the projected sets B ( s ) may not be disjoint. Thus, to getthe representative points β ∗ we consider the intersection of these cones across s ∈ S p . To do so, we stack allunique hyperplanes of the form H ( s ) β = 0 for all s ∈ S p into a matrix H p . The set of hyperplanes H p β = 0then deﬁne the boundaries of the sets formed by the intersection of the cones B ( s ). From here we canthen easily collect the representative points from the resulting collection of sets deﬁned by the hyperplanes H p β = 0 by a ﬁnal application of the hyperplane arrangement algorithm of Gu and Koenker (2020).To summarize, our procedure to proﬁle β is based on the idea that there are only a ﬁnite numberof representative points from B that need to be considered in the bounding optimization problems. Ourproposed procedure to ﬁnd these representative points is as follows:(i) Determine the collection S p of binary vectors s ∈ { , } m corresponding to the non-empty sets R ( s )from (4.3) by running the hyperplane arrangement algorithm of Gu and Koenker (2020) on the collec-tion of hyperplanes of the form (4.4).(ii) For each s ∈ S p :(a) Set D ( s ) = diag(2 s −

1) and deﬁne W r ( s ) := D ( s ) W r and W f ( s ) := D f ( s ) W f . Now remove anyredundant inequalities from the system of inequalities in the set: R ( s ) := { ( θ, β ) : W r ( s ) θ + W f ( s ) β ≥ } , by solving a sequence of linear programs, as described in footnote 33. For an incremental algorithm to be polynomial-time, the size of the intermediate rays in each incremental step needs tobe polynomial in the input size. The diﬃculty involved with all known incremental algorithm in the literature is that theintermediate representation can be very large and leads the algorithm to be superpolynomial in the worst case. See furtherdiscussion in Bremner (1999). R ( s ) for the polyhedral cone C ( s ) using the doubledescription algorithm of Fukuda and Prodon (1995), and set H ( s ) = R ( s ) W f ( s ). Then theprojected set B ( s ) from (4.5) can be written: B ( s ) = { β ∈ B : H ( s ) β ≥ } . (iii) Intersect the projected sets B ( s ) over all s ∈ S p : by stacking the matrices H ( s ) over s ∈ S p into thematrix H p , the rows of the matrix H p will deﬁne a set of hyperplanes that act as the boundaries of allsets deﬁned by the intersection of the projected sets B ( s ).(iv) Run the hyperplane arrangement algorithm of Gu and Koenker (2020) a ﬁnal time on the collection ofhyperplanes deﬁned by the rows of H p in order to collect representative points from each set.The above discussion also sheds light on how we can construct the identiﬁed set for β . In particular, forsome of these representative points the linear programming problems in our bounding procedure may havean empty feasible region, that is, there exists no valid conditional distribution of θ that fulﬁls all constraintsfor that particular value of β . In this case, these representative points—as well as all other values of β thatbelong to the same sets—cannot be included in the identiﬁed set for the ﬁxed coeﬃcients B ∗ . Therefore,the identiﬁed set B ∗ naturally collects all sets whose representative points render a linear program withnon-empty feasible region. Since the arrangement involves only hyperplanes through the origin, all sets takethe form of a polyhedral cone, hence the identiﬁed set B ∗ is a union of polyhedral cones. This implies thatthe identiﬁed set B ∗ may not be connected, and for any β ∈ B ∗ , we also have λβ ∈ B ∗ for all λ ≥

0. Anappropriate normalization—for example, ﬁxing || β || = 1—will then lead to a bounded identiﬁed set B ∗ . In addition to functional form assumptions, other assumptions can also help to tighten the identiﬁed set. Insome cases the researcher may have access to a variable that is believed to be independent of the distributionof latent variables. If such a variable enters as an argument in the structural function, then intuitivelysuch a variable will induce variation in the observed conditional choice probabilities without aﬀecting thedistribution of latent variables. We will refer to such variables as exogenous covariates . A similar intuitionapplies if the variable is independent of the distribution of latent variables, does not enter as an argumentin the structural function, but has nontrivial dependence with the variables that do enter the structuralfunction. We will refer to such variables as instruments . Any additional variation generated in the observedconditional choice probabilities by either exogenous covariates or instruments can be used to further pin downthe distribution of latent variables.We will now distinguish between the random variables in X and Z by allowing the variables in therandom vector Z to satisfy an independence assumption with the latent variables θ . Restricting an exogenous variable from entering the structural function is often known as the “exclusion restriction” in theterminology of simultaneous equations. ssumption 4.2 (Independence) . For all A ∈ B (Θ) we have P θ | Z ( A | Z = z ) = P θ ( A ) z − a.s. The independence assumption restricts the econometric model by constraining the set of admissiblelatent variable distributions, and provides a crucial link between the conditional distributions of θ | Z = z across values of z ∈ Z . When applied to our context, Assumption 4.2 nests the two kinds of independenceconstraints introduced above. Furthermore, it is without loss of generality that we continue to write thestructural function ϕ as a function of Z , which will help us avoid unnecessary repetition by considering thetwo kinds of independence constraints separately. Deﬁnition B.2 in Appendix B.3 provides the extensionof Deﬁnition 2.1 to the case when Assumption 4.2 also holds. Also, even though Assumption 4.2 posits fullindependence between Z and the vector of latent variables θ , the assumption can also be easily modiﬁed forthe case when a subvector of Z , say Z , is conditionally independent of θ given some other subvector of Z ,say Z . We suppress this case for simplicity, but we note that consideration of conditional independence willnot have any signiﬁcant impact on the results to come, and thus can be easily accommodated.Corollary B.4 provides the extension of Theorem 3.1 to the case when Assumption 4.2 also holds, andagain allows us to reduce an inﬁnite dimensional existence problem to a manageable ﬁnite dimensionalexistence problem. Intuitively, Corollary B.4 shows that every conditional probability measure P θ | Y,X,Z deﬁned on the sets A ( β ) from (3.4) satisfying the independence assumption can be extended to a probabilitymeasure on B (Θ) that satisﬁes Assumption 4.2. This result can be used to show that Assumption 4.2 isobservationally equivalent to imposing independence between Z and response types r ( θ, β ). This provides ameaningful interpretation to Assumption 4.2, which might otherwise be challenging to interpret.To extend the linear programming result of Theorem 3.2 it is straightforward to see that we must simplyinclude the additional constraints from Corollary B.4. Without loss of generality we again assume that allvalues of ( y, x, z ) are assigned positive probability by the observed distribution. Then these constraints can To illustrate what we mean, suppose that Z = { z , z } . The ﬁrst kind of independence constraint is associated with thefact that Z enters the structural function, but as an exogenous covariate. In this case our assumption implies constraints of theform: (cid:88) y ∈{ , } (cid:88) x ∈X P θ | Y,X,Z ( ϕ ( x, z , θ, β ) ≥ | Y = y, X = x, Z = z ) P ( Y = y, X = x | Z = z )= (cid:88) y ∈{ , } (cid:88) x ∈X P θ | Y,X,Z ( ϕ ( x, z , θ, β ) ≥ | Y = y, X = x, Z = z ) P ( Y = y, X = x | Z = z ) , for all pairs ( z , z ). The second kind of independence constraints arises if Z does not enter the structural function, but isdependent with our endogenous covariate X . In this case ϕ does not depend on Z , and our assumption implies constraints ofthe form: (cid:88) y ∈{ , } (cid:88) x ∈X P θ | Y,X,Z ( ϕ ( x, θ, β ) ≥ | Y = y, X = x, Z = z ) P ( Y = y, X = x | Z = z )= (cid:88) y ∈{ , } (cid:88) x ∈X P θ | Y,X,Z ( ϕ ( x, θ, β ) ≥ | Y = y, X = x, Z = z ) P ( Y = y, X = x | Z = z ) , for all pairs ( z , z ). Note that the ﬁrst set of constraints reduces to the second set when ϕ is speciﬁed in such a way that it doesnot depend on the random vector Z . Thus, throughout the remainder of the text we will continue to write ϕ as a function of Z while keeping in mind that either type of independence assumption (exogenous covariates or instruments) may be considered.

34e written in terms of the parameter vector ν ( β ) as: (cid:88) y ∈{ , } (cid:88) x ∈X ν ( y, x, z k , β, s ) P ( Y = y, X = x | Z = z k )= (cid:88) y ∈{ , } (cid:88) x ∈X ν ( y, x, z k +1 , β, s ) P ( Y = y, X = x | Z = z k +1 ) , (4.10)for k = 1 , . . . , m z −

1. The formal statement of the extension of Theorem 3.2 to the case when the constraints(4.10) are also imposed is provided by Corollary B.5 in Appendix B.3.The independence assumptions provide additional information by constraining the set of admissible latentvariable distributions to be those that are independent of the vector Z . In the case when Z is an instrument(in the terminology of the previous section), the structural function ϕ does not depend directly on Z , butonly indirectly on Z through its eﬀect on X . This imposes a form of the exclusion restriction , which can havea substantial eﬀect in the bounding problem by eliminating response types, just as in the previous subsectionon functional form assumptions. In the next subsection we will show how monotonicity assumptions alsolead to the elimination of response types.

In this subsection we will introduce monotonicity assumptions. We will see that monotonicity assumptionsalso impose constraints on the bounding problem by eﬀectively eliminating consideration of certain responsetypes. To introduce our monotonicity assumptions, let

M ⊂ { , . . . , m } × { , . . . , m } denote any collectionof pairs of integers ( j, k ), where 1 ≤ j, k ≤ m . Assumption 4.3 (Monotonicity) . For each β ∈ B and each pair ( j, k ) in some set M (as deﬁned above)we have ϕ ( x j , z j , β, θ ) ≤ ϕ ( x k , z k , β, θ ) a.s. This monotonicity assumption states that, when comparing two points ( x j , z j ) and ( x k , z k ), the valueof the structural function can be ordered by the researcher. Deﬁnition B.3 in Appendix B.4 provides theextension of Deﬁnition 2.1 to the case when Assumption 4.3 is also imposed.Note that if the order determined by the researcher’s monotonicity assumption for the pair of points( x j , z j ) and ( x k , z k ) is ϕ ( x j , z j , β, θ ) ≤ ϕ ( x k , z k , β, θ ) (for example), then the researcher automatically rulesout response types with { ϕ ( x j , z j , β, θ ) ≥ } > { ϕ ( x k , z k , β, θ ) ≥ } . In other words, it cannot be that anindividual assigned the vector ( x j , z j ) under counterfactual γ ∈ Γ would have responded with Y γ = 1 if thatsame individual would have responded with Y γ (cid:48) = 0 when assigned the vector ( x k , z k ) under counterfactual γ (cid:48) ∈ Γ. The following example illustrates how this idea leads to elimination of response types.

Example 3.

Suppose again that we have only a binary variable X ∈ { , } and latent variables θ (i.e. novariables Z and no ﬁxed coeﬃcients β ). Then the structural function from (2.1) can be written as ϕ ( X, θ ) In particular, even though the independence assumption does not aﬀect the number of response types, the exclusionrestriction reduces response types by reducing the number of variables entering the structural function. nd the binary response vector r ( β, θ ) can be written as r ( θ ) , where: r ( θ ) =  { ϕ (0 , θ ) ≥ } { ϕ (1 , θ ) ≥ }  . Note that there are only four response types; that is, r ( θ ) ∈ { s , s , s , s } where: s =   , s =   , s =   , s =   . Without any additional restrictions, all response types—and thus all sets of the form Θ( β, s ) for s ∈ { , } —can be assigned positive probability by the optimization problems in Theorem 3.2. Now suppose we entertainthe monotonicity assumption ϕ (0 , θ ) ≤ ϕ (1 , θ ) a.s. Imposing this constraint clearly rules out the case when r ( θ ) = s , and thus the set Θ( β, s ) = { θ : r ( θ ) = s } must now be assigned probability zero in any solutionto the optimization problems in Theorem 3.2. Constraining such sets to be assigned zero probability in theseoptimization problems reduces the size of the feasible region and thus potentially tightens the resulting boundson counterfactual choice probabilities. Monotonicity of the type entertained here has a number of precedents in the literature on treatmenteﬀects, and can be interpreted a few diﬀerent ways. For example, when Y is interpreted as a treatmentindicator, the type of monotonicity introduced here nests the monotonicity assumption from Angrist et al.(1996) required for identiﬁcation of the local average treatment eﬀect . Alternatively, when Y is the interpretedas the binary outcome after some (possibly endogenous) treatment X , our monotonicity assumption can beinterpreted as a version of the monotone treatment response assumption introduced in Manski (1997) and alsoconsidered in Manski and Pepper (1998). Finally, similar monotonicity assumptions in triangular systemshave also been extensively explored by Heckman and Pinto (2018). In particular, Heckman and Pinto (2018)explore how choice theory can be used to impose monotonicity assumptions and to eliminate response types,and many of their insights are also applicable here.Following the insights from the example above, let us deﬁne the collection of binary vectors S M to bethose that respect the monotonicity relations from Assumption 4.3. The extension of Theorem 3.1 to thecase when Assumption 4.3 is imposed is provided by Corollary B.7 in Appendix B.4. To extend the results ofTheorem 3.2 we must simply include the set of constraints imposed by Assumption 4.3 in our optimizationproblems. These constraints are provided in Corollary B.7, and can be written in terms of the parametervector ν ( β ) as: (cid:88) s ∈ S cM ν ( y, x j , z j , β, s ) = 0 , (4.11)for all y ∈ { , } and j = 1 , . . . , m occurring with positive probability. Corollary B.8 in Appendix B.4 thenshows the extension of Theorem 3.2 to the case when Assumption 4.3 is imposed using the constraints (4.11).Corollary B.9 then extends Proposition 3.1. 36ombining all of the results seen in this section, any combination of Assumption 4.1, Assumption 4.2 andAssumption 4.3 can be imposed on the optimization problems in Theorem 3.2 by simply adding the corre-sponding combination of constraints (4.1), (4.10) and (4.11), respectively. This shows that the optimizationformulation of the bounds in Theorem 3.2 can ﬂexibly incorporate a wide variety of modelling assumptions.This will be demonstrated in the application section ahead. In this section we apply our method to study the impact of private health insurance on an individual’sdecision to visit a doctor. In general, insurance markets are plagued by problems arising from asymmetricinformation between consumers and insurance providers (c.f. Rothschild and Stiglitz (1978)). For example,adverse selection occurs in the health insurance market when individuals have more information about theirlatent health determinants than the providers of health insurance. A robust prediction of the classical theoryof asymmetric information is that those who are more likely to purchase insurance are also those who aremore likely to experience the insured risk. On the other hand, there has been little and mixed empiricalevidence of adverse selection in health insurance markets (see Cardon and Hendel (2001) for a discussion).Others have suggested that those who purchase insurance may be more risk averse, and so less likely toengage in activities that might cause them to experience the insured risk. Evidence of this is found inFinkelstein and McGarry (2006), who demonstrate that wealthier and more cautious individuals are morelikely to have long-term care insurance, but less likely to ever use their insurance. However, in many casesthe opposite is also equally plausible. For example, Bajari et al. (2014) explore the eﬀect of moral hazardin health insurance markets, which occurs when those who purchase health insurance are more likely toexperience the insured risk given that they no longer bear the full cost of health care.Here we do not attempt to disentangle the eﬀects of adverse selection, risk aversion, or moral hazard.Instead we compute various counterfactual parameters while remaining agnostic on the exact nature of theunobservables linking the health insurance and health care utilization decisions. We take the decision to visita doctor as our binary outcome variable of interest, and we consider the individuals’ private health insurancestatus to be an endogenous explanatory variable. This latter point is consistent with the idea that privateinsurance status may be dependent with individual-speciﬁc latent factors—most importantly, unobservedhealth determinants and attitudes towards risk—that also inﬂuence an individual’s propensity to visit adoctor. We use data from the 2010 wave of the Medical Expenditure Panel Survey (MEPS). This data hasbeen analyzed by Han and Lee (2019). We focus on the same sub-sample they consider. In particular, wefocus on the month of January 2010, consider only individuals between ages 25 and 64, and drop individualswho obtain either federal or state insurance in 2010 and individuals who are self-employed or unemployed.These restrictions leave us with a sample of 7555 individuals. The “insured risk” refers to the event for insurance was purchased. In our context, it is any event that would typicallyrequire a visit to the doctor.

37n all speciﬁcations X is a binary endogenous variable representing an individual’s private insurancestatus, and we consider a binary health status variable ( Z ) and a binary marital status variable ( Z ) asregressors. Finally, we use the number of employees working for the individual’s ﬁrm ( Z ) as an instrument.This variable provides a measure of the size of a ﬁrm and has discrete support in the range [1 , Using ﬁrm size as an instrument is consistent with the evidence that larger ﬁrmsare more likely to provide health insurance beneﬁts, but do not directly inﬂuence an individual’s decision tovisit a doctor. The same instrument was also used in Han and Lee (2019). However, rather than imposefull independence of Z , throughout the application we impose that Z is conditionally independent of θ given the vector ( Z , Z ).A possible concern with using ﬁrm size as an instrument is that risk averse individuals may be morelikely to select into a job with a larger ﬁrm size. In an attempt to address this issue, we also investigate aweaker conditional independence assumption (which we call relaxed conditional independence ) that assumesthe ﬁrm size Z is conditionally independent of θ given ( Z , Z ) only when Z lies within a certain range.The main idea is that once we condition on a particular range of ﬁrm size, the remaining variation in ﬁrmsize is independent of θ conditional on ( Z , Z ). We consider four ranges, given by (1 , , , , µ ate := (cid:88) ( y,x,z ) ∈{ , }×X ×Z P θ | Y,X,Z ( ϕ (1 , z, θ, β ) ≥ | Y = y, X = x, Z = z ) P ( Y = y, X = x, Z = z ) − (cid:88) ( y,x,z ) ∈{ , }×X ×Z P θ | Y,X,Z ( ϕ (0 , z, θ, β ) ≥ | Y = y, X = x, Z = z ) P ( Y = y, X = x, Z = z ) . This parameter will provide the average causal eﬀect of obtaining health insurance on the decision to visita doctor. Near the end of this section, we will also consider bounds on counterfactual conditional choiceprobabilities. We will construct our bounds under the following set of assumptions:(A1) Only Assumptions 2.1 and 2.2.(A2) (A1) and monotonicity, Assumption 4.3 (the discussion below will provide further details).(A3) (A1) and independence between ( Z , Z ) and θ , Assumption 4.2.(A4) (A1), (A2) and (A3) together.(A5) (A1) and independence Assumption 4.2 with independence between ( Z , Z ) and θ , and conditionalindependence of Z . The MEPS data includes information on self-reported health status on a scale from 1 −

5, and we regard values less thanor equal to 2 as being “unhealthy.” Variable Z is supported on the range [1 , Z has very few observations. In orderto get reliable estimates of the conditional choice probabilities, we further discretize the ﬁrm size into 11 bins. The bins arerespectively [1 , , , , , , , , , , , From Cardon and Hendel (2001) p.408: “Another observed symptom, consistent with the theoretical predictions, is thatthe uninsured tend to work for small employers. Large employers can overcome adverse selection by risk pooling.” Z , Z ) and θ , and relaxed con-ditional independence of Z .(A8) (A1), (A2) and (A7) together.Note that the general index function takes the form ϕ ( x, z , z , θ, β ). When we say that the monotonicityassumption is imposed in (A2), we are in fact imposing: ϕ (1 , , z , θ, β ) ≥ ϕ (0 , , z , θ, β ) , for each z ∈ { , } . This implies that for an unhealthy individual, the propensity to visit a doctor whenthe person has private insurance is always weakly greater than without the insurance, regardless of maritalstatus. Finally we consider four diﬀerent models for the binary outcome variable Y : Y = 1 { ϕ ( X, Z, θ ) ≥ } , (M1) Y = 1 { ϕ ( X, Z ) ≥ θ } , (M2) Y = 1 { Xθ + Z β + Z β ≥ θ } , (M3) Y = 1 { Xβ + Z β + Z β ≥ θ } . (M4)Under model (M1) the index function ϕ need not even be explicitly speciﬁed so long as we imagine that itsatisﬁes Assumption (2.1). This makes model (M1) the most ﬂexible. In model (M2) we start to introducefunctional form restrictions on ϕ . In particular, (M2) restricts the latent variable to be scalar and additivelyseparable from the nonparametric index function ϕ ( X, Z ). Additional details on how to apply our methodto the model in (M2) can also be found in Appendix B.6. Finally, models (M3) and (M4) impose linearityof ϕ in the latent variables and in the parameters. However, we distinguish two cases. In the ﬁrst case,(M3) regards ( θ , θ ) as the latent variables in the model. The second case in (M4) is the same as the ﬁrst,except that we have replaced the random slope coeﬃcient θ from (M3) with a ﬁxed coeﬃcient. Model (M4)represents the additively separable linear index model that is commonly used in the empirical literature,except for the fact that we do not assume a parametric distribution for θ and do not have a model for howthe endogenous variable X is generated.Our method is employed using simple plug-in estimators for all probabilities depending on the observedrandom variables Y , X , Z , Z and Z . In Appendix B.5 we present a consistency result specially designedfor plug-in estimation in the kinds of problems considered in this paper. Theorem B.2 in Appendix B.5demonstrates the conditions under which simple plug-in estimators of the constraints and objective functionsin our problems leads to a consistent estimate of the identiﬁed set for our functional of interest. We refer Importantly, our consistency result requires a slight (but vanishing) relaxation of the constraint set in our linear programs;in particular, see the sequence “ b n ” in Appendix B.5. However, the scale of this sequence can be taken to be extremely small,and so has a minimal impact on the estimated bounds. ϕ LB − . − . − . − . − . − . − . − . .

732 0 .

591 0 .

469 0 .

431 0 .

589 0 . .

646 1 .

284 1 .

417 1 .

140 1 .

084 0 .

823 1 .

407 1 . ϕ LB − . − . − . − .

430 0 .

074 0 . − . − . .

732 0 .

462 0 .

332 0 .

452 0 . .

646 1 .

284 1 .

152 0 .

892 0 .

257 0 .

837 0 . ϕ (with random coeﬃcients)LB − . − . − . − . − .

412 0 . − . − . .

732 0 .

452 0 .

342 0 .

333 0 .

456 0 . .

646 1 .

260 1 .

122 0 .

824 0 .

754 0 .

260 1 .

086 0 . ϕ (with ﬁxed coeﬃcients)LB − . − . − .

669 0 .

000 0 .

079 0 .

017 0 . .

732 0 .

452 0 .

332 0 .

452 0 . .

646 1 .

260 1 .

121 0 .

452 0 .

252 0 .

435 0 . Table 1:

Convex hull of the sharp identiﬁed set for the average treatment eﬀect under diﬀerent speciﬁcations undervarious assumptions. the reader to Appendix B.5 for additional discussion and details.We are now prepared to present the results. The identiﬁed set for µ ate under assumptions (A1) - (A8)and models (M1) - (M4) are reported in Table 1. For simplicity, we report the convex hull of the estimatedidentiﬁed set. Unsurprisingly, the bounds on µ ate shrink as the strength of our assumptions increase. Themost ﬂexible model is (M1) under assumption (A1). It is interesting to note that the bounds on µ ate inthis case are contained strictly within the interval [ − , µ ate . Also note that the identiﬁed setfor µ ate always overlaps zero for model (M1). As expected, conditional independence of Z is a strongerassumption than relaxed conditional independence, hence the identiﬁed set under assumptions (A7) and (A8)always contain those under (A5) and (A6). In fact, relaxed conditional independence does not provide muchidentifying power (compare the results under Assumptions (A3) and (A7)). On the other hand, conditionalindependence of Z does induce a noticeable narrowing of the identiﬁed set for µ ate (compare the resultsunder Assumptions (A3) and (A6)). The results for this model are a useful benchmark to compare withcases where we impose more structure on the index function.Next, model (M2) considers the threshold-crossing case. Details on our procedure to estimate this modelare provided in Appendix B.6. We notice immediately in Table 1 that this model narrows the identiﬁed setrelative to the case of general nonseparability. The results for this model serve as an interesting point ofcomparison with previous results in the partial identiﬁcation literature. Model (M2) is closely related to amodel considered in Shaikh and Vytlacil (2011) and Mouriﬁ´e (2015). However, both Shaikh and Vytlacil(2011) and Mouriﬁ´e (2015) also have a threshold-crossing model for the binary endogenous variable X of40 b ll lll lll Figure 2:

Proﬁling of β : there are 8 representative points, each representing one of the eight sets determined by 4hyperplanes in R . the form: X = 1 { ν ( Z , Z , Z ) ≥ ε } . Furthermore, they assume independence between ( Z , Z , Z ) and ( θ, ε ). Here we diﬀer critically from thesepapers by not imposing a threshold-crossing model for X , instead allowing the process determining X tobe unspeciﬁed. We also do not assume strong independence of the instrument Z , but instead assumevarious forms of conditional independence. Thus, the bounds presented in Table 1 are valid under weakerassumptions. Note that the sign of µ ate is still not identiﬁed under model (M2) except when conditionalindependence of Z is imposed, showing the strong identifying power of this assumption.Finally, we see in Table 1 that the linear models from (M3) and (M4) also narrow the bounds relative tothe case of general nonseparability. The bounds under (M4) are nested in the bounds produced under (M2);this is expected since (M4) is a special case of (M2). Note the same is not true of models (M2) and (M3).Unsurprisingly, the smallest interval for µ ate is obtained under Assumptions (A5) and (A6) for model (M4).For models (M3) and (M4) we make use of our method for proﬁling β , as described in Section 4.1.2. Inmodel (M3) we must proﬁle on β ∈ R . Figure 2 plots various regions of B corresponding to the points β that deliver the same collection of sets Θ( β, s ) with non-empty interior. The ﬁgure also shows the associatedrepresentative points. Interestingly, we ﬁnd that under Assumptions (A1) - (A4) and (A7) - (A8), theidentiﬁed set of β is the entire euclidean space R . This illustrates that non-trivial bounds on µ ate arepossible even when the structural parameters are not identiﬁed. Figure 3 shows the intervals computedusing the linear programs of the form (3.15) and (3.16) for each representative point of β under our variousassumptions. The results in Table 1 for model (M3) represent the (convex hull of the) union of the intervalsin Figure 3.In the second linear model (M4), all coeﬃcients are ﬁxed. Thus, we now need to proﬁle on a parameter41 . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile F i g u r e : T h i s ﬁ g u r e s h o w s t h e i n t e r v a l s c o m pu t e du s i n g t h e li n e a r p r og r a m s o f t h e f o r m ( . ) a nd ( . ) f o r e a c h r e p r e s e n t a t i v e p o i n t o f β ∈ R w h e nb o und i n g µ a t e f o r M o d e l ( M ) und e r v a r i o u s a ss u m p t i o n s . T h e a c t i v e a ss u m p t i o n s a r e g i v e n a tt h e t o p o f e a c h ill u s t r a t i o n . T h e a x e s l a b e ll e d “ p r o ﬁ l e ” c o rr e s p o nd t o v a r i o u s r e p r e s e n t a t i v e p o i n t s . hetab1 -4-2024-2 0 2 4b2 -2 0 2 -44-4 Figure 4:

Proﬁling of ( θ, β ) in R : there are 96 representative points, each represents one of the 96 sets determinedby 13 hyperplanes in R . vector β ∈ R . Our proﬁling procedure from Section 4.1.2 returns 96 representative points, each associatedwith a polyhedral cone in R . A visual representation is provided in Figure 4. Under Assumptions (A1)and (A2), the identiﬁed set for β is R , while for all other assumptions (A3) - (A8) we get an informativeidentiﬁed set for β . In Figure 5 we also show the intervals computed using the linear programs of the form(3.15) and (3.16) for each representative point of β under our various assumptions. The results in Table 1for model (M4) represent the (convex hull of the) union of the intervals in Figure 5.A few interesting patterns also emerge when we consider parameters other than the average treatmenteﬀect. In particular, consider the counterfactual choice probability: µ ccp ( y ) := (cid:88) z ∈Z P θ | Y,X,Z ( ϕ (1 , z, θ, β ) ≥ | Y = y, X = 0 , Z = z ) P ( Z = z | Y = y, X = 0) , for y ∈ { , } . We will focus on the parameter µ ccp (0) for simplicity, which represents the counterfactualchoice probability of visiting a doctor when given private health insurance for the set of individuals who haveno insurance and who have chosen not to visit a doctor, averaged across health and marital status. Table2 reports the convex hull of the estimated identiﬁed set for µ ccp (0) under various model speciﬁcations andunder various assumptions. Similar to the bounds for µ ate , we can see that the bounds on counterfactualchoice probabilities tend to be wide and uninformative for most assumptions. Note that under Assumption(A1) we always obtain the interval [0 ,

1] for the estimated identiﬁed set, providing empirical conﬁrmationof our impossibility result from Corollary 3.1. Remarkably, the bounds for models (M2) and (M4) are verysimilar, showing that the additional functional form assumptions in model (M4) do not have signiﬁcantidentifying power relative to the threshold crossing model in (M2) for this particular counterfactual choiceprobability. The narrowest bounds are found in models (M2) and (M4) under Assumptions (A5) and (A6).These bounds allow us to conclude that the probability an individual visits a doctor when given private43 . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile − . − . . . . A B ound Profile F i g u r e : T h i s ﬁ g u r e s h o w s t h e i n t e r v a l s c o m pu t e du s i n g t h e li n e a r p r og r a m s o f t h e f o r m ( . ) a nd ( . ) f o r e a c h r e p r e s e n t a t i v e p o i n t o f β ∈ R w h e nb o und i n g µ a t e f o r M o d e l ( M ) und e r v a r i o u s a ss u m p t i o n s . T h e a c t i v e a ss u m p t i o n s a r e g i v e n a tt h e t o p o f e a c h ill u s t r a t i o n . T h e a x e s l a b e ll e d “ p r o ﬁ l e ” c o rr e s p o nd t o v a r i o u s r e p r e s e n t a t i v e p o i n t s . . , . ϕ LB 0 0 0 .

000 0 .

031 0 .

017 0 . .

980 0 .

785 0 .

773 0 .

941 0 . .

980 0 .

754 0 .

742 0 .

924 0 . ϕ LB 0 0 0 .

000 0 .

039 0 .

019 0 . .

629 0 .

619 0 .

453 0 .

597 0 . .

629 0 .

619 0 .

414 0 .

578 0 . ϕ (with random coeﬃcients)LB 0 0 0 .

000 0 .

038 0 .

039 0 .

017 0 . .

600 0 .

598 0 .

456 0 .

455 0 .

606 0 . .

600 0 .

598 0 .

417 0 .

416 0 .

589 0 . ϕ (with ﬁxed coeﬃcients)LB 0 0 0 .

000 0 .

039 0 .

019 0 . .

598 0 .

453 0 .

597 0 . .

598 0 .

414 0 .

578 0 . Table 2:

This table reports the convex hull of the estimated bounds on µ ccp (0), the counterfactual choice probabilityof visiting doctor when granted insurance under diﬀerent assumptions for those who chose not to visit a doctorwithout insurance. To summarize, Table 1 shows that most speciﬁcations do not identify the sign of µ ate , and Table 2shows that most bounds on counterfactual choice probabilities are not informative. Exceptions typicallyoccur only under the strongest independence assumptions, given by assumptions (A5) and (A6), and thestrongest functional form assumptions, given in model (M4). However, even the strongest set of assumptionsconsidered here are much weaker than the typical assumptions employed in empirical work. For the sake ofcomparison to our results, we also estimate the following bivariate probit model: Y = 1 { Xβ + Z β + Z β ≥ ε } ,X = 1 { Z γ + Z γ + Z γ ≥ ε } , where ( Z , Z , Z ) are assumed to be independent from ( ε , ε ), which are bivariate normal with mean zero,unit variance and correlation ρ . This model was estimated with our data using maximum likelihood, and µ ate was estimated as 0 .

163 with a bootstrapped conﬁdence interval of [0 . , . µ ate lieswithin all of our bounds in Table 1, and seems to suggest strong evidence of a positive causal eﬀect of healthinsurance on the decision to visit the doctor. However, the bivariate probit model is highly parameterized,and the results from Table 1 suggest that under weaker assumptions the sign of µ ate may not be identiﬁed. Han and Lee (2019) also obtain a similar result in a model allowing for ε and ε to have unrestricted marginals, and aﬂexible dependence structure. However, they consider a diﬀerent model from us, and the average treatment eﬀect in Han andLee (2019) is diﬀerent from ours; we consider the average treatment eﬀect averaged over all values of ( x, z ), while they reportthe average treatment eﬀect at the average value of their conditioning variables. They also report the average treatment eﬀectat various quantiles of their conditioning variables. This paper considers (partial) identiﬁcation of a variety of parameters in a binary response models withpossibly endogenous regressors. Importantly, our class of models allows for general nonseparability of theindex function in latent variables, and does not require any parametric distributional assumptions. Ourapproach to bounding counterfactual parameters is based on framing the bounding in terms of two opti-mization problems: one for the lower bound, and one for the upper bound. Our speciﬁc partition of thelatent variable space is key to this result, allowing us to reduce an impossible inﬁnite-dimensional probleminto two tractable optimization problems with a ﬁnite number of constraints. We then show how a varietyof assumptions can be easily imposed in our framework, and that many assumptions can be interpreted aseliminating particular sets from our particular partition of the latent variables space. We thoroughly studythe case of a latent index function that is linear in latent variables and linear in parameters, and show howresults from computational geometry are helpful in our problem. Finally, we show an application of ourmethod to study the eﬀects of private health insurance on the utilization of health care services.There are a number of obvious further directions in which to expand the ideas presented in this paper.For example, the consideration of multinomial choice models, triangular systems, or general simultaneousdiscrete choice models all seem to be natural next steps. In addition, a major emphasis in this paper, asin other recent papers, is on the interesting computational problems that arise in models that are partiallyidentiﬁed. We believe exploring applications of state-of-the-art algorithms in computer science to problemsin econometrics—as we have attempted here—is a fruitful avenue of research.46 eferences

Aliprantis, C. D. and Border, K. C. (2006).

Inﬁnite dimensional analysis: a hitchhiker’s guide . Springer.Allen, R. and Rehbeck, J. (2019). Identiﬁcation with additively separable heterogeneity.

Econometrica ,87(3):1021–1054.Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identiﬁcation of causal eﬀects using instrumentalvariables.

Journal of the American statistical Association , 91(434):444–455.Artstein, Z. (1983). Distributions of random sets and random selections.

Israel Journal of Mathematics ,46(4):313–324.Avis, D., Bremner, D., and Seidel, R. (1997). How good are convex hull algorithms?

ComputationalGeometry , 7(5-6):265–301.Avis, D. and Fukuda, K. (1996). Reverse search for enumeration.

Discrete applied mathematics , 65(1-3):21–46.Bajari, P., Dalton, C., Hong, H., and Khwaja, A. (2014). Moral hazard, adverse selection, and healthexpenditures: A semiparametric analysis.

The RAND Journal of Economics , 45(4):747–763.Balke, A. and Pearl, J. (1994). Counterfactual probabilities: Computational methods, bounds and applica-tions. In

Uncertainty Proceedings 1994 , pages 46–54. Elsevier.Bennett, J. F. (1956). Determination of the number of independent parameters of a score matrix from theexamination of rank orders.

Psychometrika , 21(4):383–393.Beresteanu, A., Molchanov, I., and Molinari, F. (2011). Sharp identiﬁcation regions in models with convexmoment predictions.

Econometrica , 79(6):1785–1821.Beresteanu, A., Molchanov, I., and Molinari, F. (2012). Partial identiﬁcation using random set theory.

Journal of Econometrics , 166(1):17–32.Beresteanu, A. and Molinari, F. (2008). Asymptotic properties for a class of partially identiﬁed models.

Econometrica , 76(4):763–814.Blundell, R. W. and Powell, J. L. (2004). Endogeneity in semiparametric binary response models.

Reviewof Economic Studies , 71(3):655–679.Blundell, R. W. and Smith, R. J. (1989). Estimation in a class of simultaneous equation limited dependentvariable models.

The Review of Economic Studies , 56(1):37–57.Bremner, D. (1999). Incremental convex hull algorithms are not output sensitive.

Discrete & ComputationalGeometry , 21(1):57–68. 47uck, R. (1943). Partition of space.

The American Mathematical Monthly , 50:541–544.Cardon, J. H. and Hendel, I. (2001). Asymmetric information in health insurance: evidence from the nationalmedical expenditure survey.

RAND Journal of Economics , pages 408–427.Chernozhukov, V. and Hansen, C. (2005). An iv model of quantile treatment eﬀects.

Econometrica ,73(1):245–261.Chernozhukov, V., Hong, H., and Tamer, E. (2007). Estimation and conﬁdence regions for parameter setsin econometric models 1.

Econometrica , 75(5):1243–1284.Chesher, A. (2013). Semiparametric structural models of binary response: shape restrictions and partialidentiﬁcation.

Econometric Theory , pages 231–266.Chesher, A. and Rosen, A. M. (2014). An instrumental variable random-coeﬃcients model for binary out-comes.

The econometrics journal , 17(2):S1–S19.Chesher, A. and Rosen, A. M. (2017). Generalized instrumental variable models.

Econometrica , 85(3):959–989.Chesher, A. and Rosen, A. M. (2019). Generalized instrumental variable models methods and applications.Technical report, cemmap working paper.Chesher, A., Rosen, A. M., and Smolinski, K. (2013). An instrumental variable model of multiple discretechoice.

Quantitative Economics , 4(2):157–196.Chiburis, R. C. (2010). Semiparametric bounds on treatment eﬀects.

Journal of Econometrics , 159(2):267–275.Chiong, K., Hsieh, Y.-W., and Shum, M. (2017). Counterfactual estimation in semiparametric discrete-choicemodels.

Available at SSRN 2979446 .Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applicationsin pattern recognition.

IEEE transactions on electronic computers , (3):326–334.Cover, T. M. (1967). The number of linearly inducible orderings of points in d-space.

SIAM Journal onApplied Mathematics , 15(2):434–439.Dong, Y. and Lewbel, A. (2015). A simple estimator for binary choice models with endogenous regressors.

Econometric Reviews , 34(1-2):82–105.Durrett, R. (2010).

Probability: theory and examples, fourth edition . Cambridge university press.Finkelstein, A. and McGarry, K. (2006). Multiple dimensions of private information: evidence from thelong-term care insurance market.

American Economic Review , 96(4):938–958.48risch, R. (1938). Statistical versus theoretical relations in economic macrodynamics.

Paper given at Leagueof Nations. Reprinted in Hendry, D.F. and M.S. Morgan (1995) The Foundations of Econometric Analysis.

Fukuda, K. (2014).

Frequently asked questions in polyhedral computation:https://people.inf.ethz.ch/fukudak//polyfaq/polyfaq.html .Fukuda, K. and Prodon, A. (1995). Double description method revisited. In

Franco-Japanese and Franco-Chinese Conference on Combinatorics and Computer Science , pages 91–111. Springer.Galichon, A. and Henry, M. (2011). Set identiﬁcation in models with multiple equilibria.

The Review ofEconomic Studies , 78(4):1264–1298.Gautier, E. and Kitamura, Y. (2013). Nonparametric estimation in random coeﬃcients binary choice models.

Econometrica , 81(2):581–607.Geyer, C. (2019).

Using the RCDD package: https://cran.r-project.org/web/packages/rcdd/vignettes/vinny.pdf .Gu, J. and Koenker, R. (2020). Nonparametric maximum likelihood methods for binary response modelswith random coeﬃcients.

Journal of the American Statistical Association , pages 1–47.Gunsilius, F. F. (2020). A path-sampling method to partially identify causal eﬀects in instrumental variablemodels. Working paper.Haavelmo, T. (1943). The statistical implications of a system of simultaneous equations.

Econometrica,Journal of the Econometric Society , pages 1–12.Haavelmo, T. (1944). The probability approach in econometrics.

Econometrica: Journal of the EconometricSociety , pages iii–115.Han, S. and Lee, S. (2019). Estimation in a generalization of bivariate probit models with dummy endogenousregressors.

Journal of Applied Econometrics , 34(6):994–1015.Heckman, J. J. and Pinto, R. (2015). Causal analysis after haavelmo.

Econometric Theory , 31(1):115–151.Heckman, J. J. and Pinto, R. (2018). Unordered monotonicity.

Econometrica , 86(1):1–35.Heckman, J. J. and Vytlacil, E. J. (2007). Econometric evaluation of social programs, part i: Causal models,structural models and econometric policy evaluation.

Handbook of econometrics , 6:4779–4874.Ichimura, H. and Thompson, T. S. (1998). Maximum likelihood estimation of a binary choice model withrandom coeﬃcients of unknown distribution.

Journal of Econometrics , 86(2):269–295.Imbens, G. W. and Newey, W. K. (2009). Identiﬁcation and estimation of triangular simultaneous equationsmodels without additivity.

Econometrica , 77(5):1481–1512.49ohler, D. A. (1967). Projections of convex polyhedral sets. Technical report, University of CaliforniaBerkeley Operations Research Center.Lewbel, A. (2000). Semiparametric qualitative response model estimation with unknown heteroscedasticityor instrumental variables.

Journal of Econometrics , 97(1):145–177.Lewbel, A. (2007). Coherency and completeness of structural models containing a dummy endogenousvariable.

International Economic Review , 48(4):1379–1392.Lewbel, A., Dong, Y., and Yang, T. T. (2012). Comparing features of convenient estimators for binary choicemodels with endogenous regressors.

Canadian Journal of Economics/Revue canadienne d’´economique ,45(3):809–829.Manski, C. F. (1977). The structure of random utility models.

Theory and decision , 8(3):229.Manski, C. F. (1997). Monotone treatment response.

Econometrica: Journal of the Econometric Society ,pages 1311–1334.Manski, C. F. (2007). Partial identiﬁcation of counterfactual choice probabilities.

International EconomicReview , 48(4):1393–1410.Manski, C. F. and Pepper, J. V. (1998). Monotone instrumental variables with an application to the returnsto schooling. Technical report, National Bureau of Economic Research.Manski, C. F. and Tamer, E. (2002). Inference on regressions with interval data on a regressor or outcome.

Econometrica , 70(2):519–546.Matzkin, R. L. (1992). Nonparametric and distribution-free estimation of the binary threshold crossing andthe binary choice models.

Econometrica: Journal of the Econometric Society , pages 239–270.Matzkin, R. L. (2003). Nonparametric estimation of nonadditive random functions.

Econometrica ,71(5):1339–1375.Molchanov, I. (2017).

Theory of random sets . Springer Science & Business Media.Molchanov, I. S. (1998). A limit theorem for solutions of inequalities.

Scandinavian Journal of Statistics ,25(1):235–242.Motzkin, T., Raiﬀa, H., Thompson, G., and Thrall, R. (1953). The double description method. In Kuhn, H.and Tucker, A., editors,

Contributions to theory of games . Princeton University Press.Mouriﬁ´e, I. (2015). Sharp bounds on treatment eﬀects in a binary triangular system.

Journal of Econometrics ,187(1):74–81. 50orberg, T. (1992). On the existence of ordered couplings of random sets—with applications.

Israel Journalof Mathematics , 77(3):241–264.Pearl, J. (2009).

Causality . Cambridge university press.Rada, M. and Cerny, M. (2018). A new algorithm for enumeration of cells of hyperplane arrangements and acomparison with avis and fukuda’s reverse search.

SIAM Journal on Discrete Mathematics , 32(1):455–473.Rothschild, M. and Stiglitz, J. (1978). Equilibrium in competitive insurance markets: An essay on theeconomics of imperfect information. In

Uncertainty in economics , pages 257–280. Elsevier.Russell, T. M. (2019). Sharp bounds on functionals of the joint distribution in the analysis of treatmenteﬀects.

Journal of Business & Economic Statistics , pages 1–15.Sainte-Beuve, M.-F. (1974). On the extension of von neumann-aumann’s theorem.

Journal of FunctionalAnalysis , 17(1):112–129.Shaikh, A. M. and Vytlacil, E. J. (2011). Partial identiﬁcation in triangular systems of equations with binarydependent variables.

Econometrica , 79(3):949–955.Sleumer, N. H. (1999). Output-sensitive cell enumeration in hyperplane arrangements.

Nordic journal ofcomputing , 6(2):137–147.Tamer, E. (2003). Incomplete simultaneous discrete response model with multiple equilibria.

The Review ofEconomic Studies , 70(1):147–165.Tebaldi, P., Torgovitsky, A., and Yang, H. (2019). Nonparametric estimates of demand in the californiahealth insurance exchange. Technical report, National Bureau of Economic Research.Torgovitsky, A. (2019). Partial identiﬁcation by extending subdistributions.

Quantitative Economics ,10(1):105–144.Vytlacil, E. and Yildiz, N. (2007). Dummy endogenous variables in weakly separable models.

Econometrica ,75(3):757–779. 51

Proofs

A.1 Proofs of Results in the Main Text

Proof of Theorem 2.1.

Let P ∗∗ Y γ | Y,X,Z denote the set of all conditional distributions P Y γ | Y,X,Z such that thereexists a pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z satisfying: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , ( x, z ) − a.s. To prove the result it suﬃces to show P ∗ Y γ | Y,X,Z = P ∗∗ Y γ | Y,X,Z . To do this, we will show that P ∗ Y γ | Y,X,Z ⊂ P ∗∗ Y γ | Y,X,Z and P ∗∗ Y γ | Y,X,Z ⊂ P ∗ Y γ | Y,X,Z . To this end, begin by ﬁxing an arbitrary P Y γ | Y,X,Z ∈P ∗ Y γ | Y,X,Z . By Deﬁnition 2.2 we have: P Y γ | Y,X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = 1 , (A.1)( y, x, z, θ ) − a.s. for some ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z . For this pair ( P θ | Y,X,Z , β ) we have: P Y γ | Y,X,Z,θ ( Y γ = 1 | Y = y, X = x, Z = z, θ )= P Y γ | Y,X,Z,θ ( Y γ = 1 , Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) , ( y, x, z, θ ) − a.s., which follows from (A.1). Now note: P Y γ | Y,X,Z,θ ( Y γ = 1 , Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = { ϕ ( γ ( x, z ) , θ, β ) ≥ } , ( y, x, z, θ ) − a.s. Thus we have: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = (cid:90) P Y γ | Y,X,Z,θ ( Y γ = 1 | Y = y, = x, Z = z, θ ) dP θ | Y,X,Z = (cid:90) { ϕ ( γ ( x, z ) , θ, β ) ≥ } dP θ | Y,X,Z = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , ( y, x, z ) − a.s. In other words, for our P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z we have shown that there exists a pair( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z satisfying: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , ( y, x, z ) − a.s. This proves P Y γ | Y,X,Z ∈ P ∗∗ Y γ | Y,X,Z , and since P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z was arbitrary we concludethat P ∗ Y γ | Y,X,Z ⊂ P ∗∗ Y γ | Y,X,Z .For the reverse inclusion, ﬁx any arbitrary P Y γ | Y,X,Z ∈ P ∗∗ Y γ | Y,X,Z . Then by deﬁnition there exists a pair( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z satisfying: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , y, x, z ) − a.s. It suﬃces to show that for this pair ( P θ | Y,X,Z , β ) there exists P Y γ | Y,X,Z,θ satisfying: P Y γ | Y,X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = 1 , (A.2)( y, x, z, θ ) − a.s. By the Radon-Nikodym Theorem, the existence of a (version of) P Y γ | Y,X,Z,θ is guaranteedby the fact that P Y γ ,θ | Y,X,Z (cid:28) P θ | Y,X,Z . Since all spaces involved are euclidean, we can choose the versionto be an almost surely unique regular conditional distribution (c.f. Durrett (2010) Theorem 5.1.9). Byconstruction this P Y γ | Y,X,Z,θ satisﬁes: P Y γ ,θ | Y,X,Z ( Y γ ∈ A, θ ∈ B | Y = y, X = z, Z = z )= (cid:90) B P Y γ | Y,X,Z,θ ( Y γ ∈ A | Y = y, Z = z, X = x, θ ) dP θ | Y,X,Z , ( y, x, z ) − a.s. for every A ⊂ { , } and B ∈ B (Θ). Now note that: P Y γ | Y,X,Z,θ ( Y γ = 1 , Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = { ϕ ( γ ( x, z ) , θ, β ) ≥ } ,P Y γ | Y,X,Z,θ ( Y γ = 0 , Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = { ϕ ( γ ( x, z ) , θ, β ) < } . ( y, x, z ) − a.s. Thus: P Y γ | Y,X,Z ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z )= (cid:90) Θ P Y γ | Y,X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) dP θ | Y,X,Z = (cid:90) Θ P Y γ | Y,X,Z,θ ( Y γ = 1 , Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) dP θ | Y,X,Z + (cid:90) Θ P Y γ | Y,X,Z,θ ( Y γ = 0 , Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) dP θ | Y,X,Z = (cid:90) Θ { ϕ ( γ ( x, z ) , θ, β ) ≥ } dP θ | Y,X,Z + (cid:90) Θ { ϕ ( γ ( x, z ) , θ, β ) < } dP θ | Y,X,Z = P θ | Y,X,Z ( ϕ ( γ ( x, z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) + P θ | Y,X,Z ( ϕ ( γ ( x, z ) , θ, β ) < | Y = y, X = x, Z = z )= 1 , ( y, x, z ) − a.s. This proves (A.2) and thus shows P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z . Since P Y γ | Y,X,Z ∈ P ∗∗ Y γ | Y,X,Z wasarbitrary we can conclude that P ∗∗ Y γ | Y,X,Z ⊂ P ∗ Y γ | Y,X,Z . Combining the two inclusions, we have P ∗ Y γ | Y,X,Z = P ∗∗ Y γ | Y,X,Z . This completes the proof. (cid:4)

Proof of Theorem 3.1.

Let P Y γ | Y,X,Z be a collection of conditional choice probabilities, and suppose thereexists ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z satisfying (2.8). Note that (3.7) is equivalent to (2.8), so we can conclude that( P θ | Y,X,Z , β ) satisﬁes (3.7). Furthermore, by deﬁnition ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z implies that: P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s., P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z satisfying(2.8) also satisﬁes (3.5) - (3.7).For the reverse, ﬁx any β ∈ B and any collection P θ | Y,X,Z of probability measures on the sets in A ( β )satisfying (3.5) - (3.7). We will show that P θ | Y,X,Z can be extended to a (not necessarily unique) probabilitymeasure ˜ P θ | Y,X,Z on B (Θ) in a manner that ensures ˜ P θ | Y,X,Z satisﬁes (2.8) and such that ( ˜ P θ | Y,X,Z , β ) ∈I ∗ Y,X,Z . Furthermore, by the deﬁnition of an extension, ˜ P θ | Y,X,Z will agree with P θ | Y,X,Z on all sets of theform A ( β ). To construct the extension, note that the sets in A ( β ) form a disjoint partition of Θ. Fromeach set Θ( β, s ) in the collection A ( β ), select a single point θ ( β, s ) (if Θ( β, s ) is empty, choose θ ( β, s ) as anarbitrary point from Θ). Furthermore, for any set A ⊂ Θ, deﬁne the indicator: ( A, β, s ) = { θ ( β, s ) ∈ A ∩ Θ( β, s ) } . Now deﬁne the function µ y,x,z : B (Θ) → R as: µ y,x,z ( B ) := (cid:88) s ∈{ , } m ( B, β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z ) . To verify that this is a proper probability measure on B (Θ), we must show that (i) µ y,x,z ( B ) ≥ µ y,x,z ( ∅ ) = 0for every B ∈ B (Θ), (ii) µ y,x,z (Θ) = 1, and (iii) for any countable sequence of disjoint sets { A i } ∞ i =1 in B (Θ),we have: µ y,x,z (cid:32) ∞ (cid:91) i =1 A i (cid:33) = ∞ (cid:88) i =1 µ y,x,z ( A i ) . The ﬁrst property holds since ( ∅ , β, s ) = 0 for all s . To verify the second property, note that (Θ , β, s ) = 1for all s , so that: µ y,x,z (Θ) = (cid:88) s ∈{ , } m (Θ , β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z )= (cid:88) s ∈{ , } m P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z )= 1 , where the last line holds since P θ | Y,X,Z is a probability measure on A ( β ). For the third property, note thatfor two disjoint Borel sets A , A ∈ B (Θ) we have: ( A ∪ A , β, s ) = ( A , β, s ) + ( A , β, s ) . Inducting on this formula, we conclude that for countable disjoint sets { A i } ∞ i =1 in B (Θ), we have: (cid:32) ∞ (cid:91) i =1 A i , β, s (cid:33) = ∞ (cid:88) i =1 ( A i , β, s ) , µ y,x,z (cid:32) ∞ (cid:91) i =1 A i (cid:33) = (cid:88) s ∈{ , } m (cid:32) ∞ (cid:91) i =1 A i , β, s (cid:33) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z )= (cid:88) s ∈{ , } m ∞ (cid:88) i =1 ( A i , β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z )= ∞ (cid:88) i =1 (cid:88) s ∈{ , } m ( A i , β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z )= ∞ (cid:88) i =1 µ y,x,z ( A i ) . Thus, our measure satisﬁes countable additivity. We conclude that µ y,x,z is a proper probability measure.Note that the argument above has been completed for a single triple ( y, x, z ) indexing the conditioningvariables. However, we can repeat the same argument as above for all ( y, x, z ) assigned positive probability,and thus can construct a corresponding probability measure µ y,x,z satisfying all the conditions describedabove for each such ( y, x, z ).Now we deﬁne ˜ P θ | Y,X,Z : B (Θ) → [0 ,

1] by ˜ P θ | Y,X,Z ( B | Y = y, X = x, Z = z ) = µ y,x,z ( B ) for all B ∈ B (Θ) and all ( y, x, z ) assigned positive probability. By the above, ˜ P θ | Y,X,Z ( · | Y = y, X = x, Z = z )is a proper probability measure on B (Θ) for each ( y, x, z ). Also note that for any triple (1 , x, z ) assignedpositive probability, the pair ( ˜ P θ | Y,X,Z , β ) satisﬁes:˜ P θ | Y,X,Z ( G − (1 , x, z, β ) | Y = 1 , X = x, Z = z )= (cid:88) s ∈ S j ˜ P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x, Z = z )= (cid:88) s ∈ S j (cid:88) s (cid:48) ∈{ , } n (Θ( β, s ) , β, s (cid:48) ) P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x, Z = z )= (cid:88) s ∈ S j (Θ( β, s ) , β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x, Z = z )= 1 , which follows from (B.8). Furthermore, for any triple (0 , x, z ) assigned positive probability, the pair ( ˜ P θ | Y,X,Z , β )also satisﬁes: ˜ P θ | Y,X,Z ( G − (0 , x, z, β ) | Y = 0 , X = x, Z = z )= (cid:88) s ∈ S cj ˜ P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x, Z = z )= (cid:88) s ∈ S cj (cid:88) s (cid:48) ∈{ , } n (Θ( β, s ) , β, s (cid:48) ) P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x, Z = z )= (cid:88) s ∈ S cj (Θ( β, s ) , β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x, Z = z )55 1 , which follows from (B.9). Conclude that:˜ P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , a.s. This shows that ( ˜ P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z . Finally, setting C := { θ : ϕ ( γ ( x, z ) , θ, β ) ≥ } , it is straightforwardto show that:˜ P θ | Y,X,Z ( C | Y = y, X = x, Z = z ) = (cid:88) s ∈{ , } m ( C, β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z )= (cid:88) s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j )= P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) , for all ( y, x j , z j ) assigned positive probability, which follows from (3.7). This is exactly condition (2.8).Conclude that ( ˜ P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z and that ( ˜ P θ | Y,X,Z , β ) satisﬁes (2.8). This completes the proof. (cid:4)

Proof of Theorem 3.2.

This result is an immediate consequence of Theorem 3.1. (cid:4)

Proof of Proposition 3.1.

First note that β ∈ B enters the constraints in Theorem 3.2 only through theconstraints (3.12); in particular, only through its determination of which sets Θ( β, s ) are empty versusnon-empty. Now deﬁne: S ( β ) := { s ∈ { , } m : Θ( β, s ) (cid:54) = ∅ } . Now deﬁne an equivalence relation ∼ on B as follows: β ∼ β (cid:48) if and only if S ( β ) = S ( β (cid:48) ). This equivalencerelation will partition B into at most 2 m equivalence classes (which is the total number of ways of choosing k vectors from { , } m for k = 0 , , . . . , m ). Furthermore, any two values β and β (cid:48) belonging to the sameequivalence class will deliver the same values for the linear programs (3.15) and (3.16) (by construction ofthe equivalence class). Thus, it is suﬃcient to consider only one β from each equivalence class in Theorem3.2. However, there are at most 2 m such β ’s to consider. (cid:4) Proof of Corollary 3.1.

A counterfactual choice probability of the form in (3.9) can be written as: P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) = (cid:88) s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) . Note that the result is trivial if we consider ( y, x j , z j ) assigned zero probability, since in that case Theorem3.1 implies there are no constraints on the counterfactual choice probability above. Thus, assume that( y, x j , z j ) is assigned positive probability. By assumption, γ ( j ) (cid:54) = j . We now claim that (i) S γ ( j ) ∩ S j (cid:54) = ∅ ,(ii) S γ ( j ) ∩ S cj (cid:54) = ∅ , (iii) S cγ ( j ) ∩ S j (cid:54) = ∅ , (iv) S cγ ( j ) ∩ S cj (cid:54) = ∅ . In particular, any s ∈ { , } m with j th entry56qual to 1 and γ ( j ) th entry equal to 1 belongs to S γ ( j ) ∩ S j . Denote such a vector by t ∈ { , } m . Similarly,any s ∈ { , } m with j th entry equal to 0 and γ ( j ) th entry equal to 1 belongs to S γ ( j ) ∩ S cj . Denote such avector by t ∈ { , } m . Continuing in this way, let t ∈ S cγ ( j ) ∩ S j and t ∈ S cγ ( j ) ∩ S cj . Now ﬁx any ϕ and β such that all 2 m sets Θ( β, s ) are non-empty (such a choice is always possible under Assumptions 2.1 and2.2). For any κ ∈ [0 ,

1] consider the following conditional distribution on sets A ∈ A ( β ): P θ | Y,X,Z ( A | Y = y, X = x j , Z = z j ) =  κ, if A = Θ( β, t ) , and y = 1 ,κ, if A = Θ( β, t ) , and y = 0 , − κ, if A = Θ( β, t ) , and y = 1 , − κ, if A = Θ( β, t ) and y = 0 , , otherwise.If y = 1 we have: (cid:88) s ∈ S j P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x j , Z = z j ) = κ + (1 − κ ) = 1 , and if y = 0 we have: (cid:88) s ∈ S cj P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x j , Z = z j ) = κ + (1 − κ ) = 1 . This shows that constraints (3.5) and (3.6) are satisﬁed. Finally, note that for either y = 0 or y = 1 we have: (cid:88) s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) = κ. Since this can be completed for any κ ∈ [0 , , P Y γ | X,Z ( Y γ = 1 | X = x j , Z = z j )= (cid:88) y ∈{ , } P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) P Y | X,Z ( Y = y | X = x j , Z = z j )= κ. Again, since this can be completed for any κ ∈ [0 , P Y γ | X,Z ( Y γ = 1 | X = x j , Z = z j ) is the interval [0 , (cid:4) Proof of Proposition 4.1.

This follows immediately from the results of Buck (1943). (cid:4) .2 Measurability Results Deﬁnition A.1 (Eﬀros-Measurability, Random Set) . Let (Ω , A , P ) be a probability space, let V be a Polishspace, and let O V denote the collection of all open sets on V . A multifunction V : Ω → F V is calledEﬀros-measurable if for every A ∈ O V we have V − ( A ) := { ω ∈ Ω : V ( ω ) ∩ A (cid:54) = ∅ } ∈ A . Deﬁnition A.2 (Selections) . A random element V : Ω → V is called a (measurable) selection of V if V ( ω ) ∈ V ( ω ) for P − almost all ω ∈ Ω . Lemma A.1.

Suppose Assumption 2.1 holds. Then for each β ∈ B , the map G − ( · , β ) : Y × X × Z → Θ isan Eﬀros-measurable multifunction, and thus is a random set on Y × X × Z .Proof of Lemma A.1.

For any ﬁxed β ∈ B and any open set A ⊂ Θ. We have: { ( y, x, z ) : G − ( y, x, z, β ) ∩ A (cid:54) = ∅ } = G ( A ) ∪ G ( A ) , where: G ( A ) := { (0 , x, z ) : G − (0 , x, z, β ) ∩ A (cid:54) = ∅ } ,G ( A ) := { (1 , x, z ) : G − (1 , x, z, β ) ∩ A (cid:54) = ∅ } . Since B ( Y ) ⊗ B ( X ) ⊗ B ( Z ) is closed under unions, it suﬃces to show G ( A ) , G ( A ) ∈ B ( Y ) ⊗ B ( X ) ⊗ B ( Z ).In particular, it suﬃces to show Eﬀros-measurability of the maps: G − (0 , x, z, β ) = { θ : ϕ ( x, z, θ, β ) < } ,G − (1 , x, z, β ) = { θ : ϕ ( x, z, θ, β ) ≥ } . Eﬀros measurability of G − (0 , x, z, β ) follows directly from Lemma 18.7 in Aliprantis and Border (2006) afternoting that ϕ ( · , β ) is a Caratheodory function, and ( −∞ ,

0) is an open set. Measurability of G − (1 , x, z, β )follows from Lemma 18.4.1 in Aliprantis and Border (2006) if we can establish Eﬀros measurability of themultifunctions: G − (1 , x, z, β ) := { θ : ϕ ( x, z, θ, β ) > } ,G − (1 , x, z, β ) := { θ : ϕ ( x, z, θ, β ) = 0 } . Eﬀros measurability of G − (1 , x, z, β ) also follows directly from Lemma 18.7 in Aliprantis and Border (2006)after noting that ϕ is a Caratheodory function, and (0 , + ∞ ) is an open set. Eﬀros-measurability of G − (1 , x, z, β ) follows from Corollary 18.8 in Aliprantis and Border (2006) after noting ϕ is a Caratheodoryfunction, and Θ is compact under Assumption 2.1. This completes the proof. (cid:4) Given a σ − algebra F on a space R , the P -completion of F is the smallest σ − algebra containing F as well as all P − null sets of R . The intersection of all P − completions of F (over all P ) is called the58 niversal σ − algebra , and functions that are measurable with respect to the universal σ − algebra are saidto be universally measurable . The following Lemma shows that the random set G − ( Y, X, Z, β ) admits auniversally measurable selection under Assumption 2.1.

Lemma A.2.

Suppose Assumption 2.1 holds. Then the random set G − ( Y, X, Z, β ) admits a universallymeasurable selection for every β ∈ B ensuring it is non-empty almost surely,.Proof of Lemma A.2. Fix some β ∈ B ensuring G − ( Y, X, Z, β ) is almost surely non-empty. By LemmaA.1, G − ( Y, X, Z, β ) is an Eﬀros-measurable multifunction, and by Theorem 1.3.3 in Molchanov (2017) thisimplies that the graph of G − ( Y, X, Z, β ) belongs to B ( Y ) ⊗ B ( X ) ⊗ B ( Z ) × B (Θ); that is, G − ( Y, X, Z, β )is graph-measurable. The result then follows immediately from Theorem 3 of Sainte-Beuve (1974). (cid:4)

B Additional Deﬁnitions and Results

B.1 Identiﬁed Set of Conditional Latent Variable Distributions

For the sake of comparison with the previous literature, we now present a result which connects the observedconditional choice probabilities to our deﬁnition of the identiﬁed set based on the selection relation. Theidentiﬁed set for P θ | X,Z is given by: P ∗ θ | X,Z := (cid:26) P θ | X,Z : ∃ ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z s.t P θ | X,Z = (cid:90) P θ | Y,X,Z dP Y | X,Z a.s (cid:27) . We now have the following result.

Theorem B.1.

Suppose Assumption 2.1 holds. Then a collection P θ | X,Z satisﬁes P θ | X,Z ∈ P ∗ θ | X,Z if andonly if P θ | X,Z satisﬁes: P θ | X,Z ( ϕ ( x, z, θ, β ) ≥ | X = x, Z = z ) = P Y | X,Z ( Y = 1 | X = x, Z = z ) , (B.1)( x, z ) − a.s. for some β ∈ B .Proof of Theorem B.1. Let us deﬁne: P ∗∗ θ | X,Z = (cid:8) P θ | X,Z : ∃ β ∈ B s.t. P θ | X,Z ( ϕ ( x, z, θ, β ) ≥ | X = x, Z = z ) = P Y | X,Z ( Y = 1 | X = x, Z = z ) , ( x, z ) − a.s. (cid:9) . We want to show that P ∗ θ | X,Z = P ∗∗ θ | X,Z , which will be accomplished by showing both P ∗ θ | X,Z ⊂ P ∗∗ θ | X,Z and P ∗∗ θ | X,Z ⊂ P ∗ θ | X,Z . To show P ∗ θ | X,Z ⊂ P ∗∗ θ | X,Z , ﬁx any P θ | X,Z ∈ P ∗ θ | X,Z . Then by Deﬁnition 2.1 and thedeﬁnition of P ∗ θ | X,Z above, there exists θ : Ω → Θ with θ ∼ P θ | Y,X,Z and an element β ∈ B such that: P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s., and: P θ | X,Z ( θ ∈ A | X = x, Z = z ) = (cid:90) P θ | Y,X,Z ( θ ∈ A | Y = y, X = x, Z = z ) dP Y | X,Z , ( x, z ) − a.s., A ∈ B (Θ). Now deﬁne the sets: B ( x, z, β ) := { θ : ϕ ( x, z, θ, β ) ≥ } ,B ( x, z, β ) := { θ : ϕ ( x, z, θ, β ) < } . By continuity of ϕ ( x, z, · , β ), we have B ( x, z, β ) , B ( x, z, β ) ∈ B (Θ) for each ( x, z, β ). Now for our pair( θ, β ) we have: P θ | X,Z ( θ ∈ B ( x, z, β ) | X = x, Z = z )= (cid:88) y ∈{ , } P θ | Y,X,Z ( θ ∈ B ( x, z, β ) | Y = y, X = x, Z = z ) P Y | X,Z ( Y = y | X = x, Z = z )= (cid:88) y ∈{ , } P θ | Y,X,Z ( θ ∈ B ( x, z, β ) ∩ G − ( y, x, z, β ) | Y = y, X = x, Z = z ) P Y | X,Z ( Y = y | X = x, Z = z )(B.2)= P θ | Y,X,Z ( θ ∈ G − (1 , x, z, β ) | Y = 1 , X = x, Z = z ) P Y | X,Z ( Y = 1 | X = x, Z = z ) (B.3)= P Y | X,Z ( Y = 1 | X = x, Z = z ) , (B.4)( x, z ) − a.s. Note that (B.2) follows from the fact that P θ | Y,X,Z ( θ ∈ G − ( y, x, z, β ) | Y = y, X = x, Z = z ) = 1a.s. since P θ | Y,X,Z ∈ P ∗ θ | Y,X,Z by assumption; (B.3) follows from the fact that B ( x, z, β ) ∩ G − (0 , x, z, β ) = ∅ and B ( x, z, β ) = G − (1 , x, z, β ); (B.4) follows from the fact that P θ | Y,X,Z ( θ ∈ G − (1 , x, z, β ) | Y = 1 , X = x, Z = z ) = 1 a.s. since P θ | Y,X,Z ∈ P ∗ θ | Y,X,Z by assumption. Repeating an identical derivation shows that: P θ | X,Z ( θ ∈ B ( x, z, β ) | X = x, Z = z ) = P Y | X,Z ( Y = 0 | X = x, Z = z ) , ( x, z ) − a.s. Since P θ | X,Z ∈ P ∗ θ | X,Z was arbitrary, this proves that P ∗ θ | X,Z ⊂ P ∗∗ θ | X,Z .To show P ∗∗ θ | X,Z ⊂ P ∗ θ | X,Z , ﬁx any P θ | X,Z ∈ P ∗∗ θ | X,Z . We want to show that P θ | X,Z ∈ P ∗ θ | X,Z . To do so, wemust show that: (i) there exists P θ | Y,X,Z such that: P θ | X,Z ( θ ∈ A | X = x, Z = z )= (cid:90) { , } P θ | Y,X,Z ( θ ∈ A | Y = y, X = x, Z = z ) dP Y | X,Z , ( x, z ) − a.s., for every A ∈ B (Θ), and (ii) there is a β ∈ B such that: P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s., (B.5)for the same P θ | Y,X,Z from part (i). First note that, by the Radon-Nikodym Theorem, the existence ofa (version of) P θ | Y,X,Z is guaranteed by the fact that P θ,Y | X,Z (cid:28) P Y | X,Z . Since all spaces involved areeuclidean, we can choose the version to be an almost surely unique regular conditional distribution (c.f.Durrett (2010) Theorem 5.1.9). By construction P θ | Y,X,Z satisﬁes: P θ,Y | X,Z ( θ ∈ A, Y ∈ B | X = x, Z = z ) 60 (cid:88) y ∈ B P θ | Y,X,Z ( θ ∈ A | Y = y, X = x, Z = z ) P Y | X,Z ( Y = y | X = x, Z = z ) , for every A ∈ B (Θ) and B ⊂ { , } . This veriﬁes part (i). It thus remains only to show that any such P θ | Y,X,Z must also satisfy (B.5). Since P θ | X,Z ∈ P ∗∗ θ | X,Z , there exists a value β ∈ B such that: P θ | X,Z ( ϕ ( x, z, θ, β ) ≥ | X = x, Z = z ) = P Y | X,Z ( Y = 1 | X = x, Z = z ) , ( x, z ) − a.s. For this value of β , note that: P θ | X,Z ( ϕ ( x, z, θ, β ) ≥ | X = x, Z = z )= P θ | Y,X,Z ( ϕ ( x, z, θ, β ) ≥ | Y = 1 , X = x, Z = z ) P ( Y = 1 | X = x, Z = z )+ P θ | Y,X,Z ( ϕ ( x, z, θ, β ) ≥ | Y = 0 , X = x, Z = z ) P ( Y = 0 | X = x, Z = z ) . Furthermore, by assumption we have: P θ | X,Z ( ϕ ( x, z, θ, β ) ≥ | X = x, Z = z ) = P ( Y = 1 | X = x, Z = z ) , ( x, z ) − a.s. Thus: P ( Y = 1 | X = x, Z = z )= P θ | Y,X,Z ( ϕ ( x, z, θ, β ) ≥ | Y = 1 , X = x, Z = z ) P ( Y = 1 | X = x, Z = z )+ P θ | Y,X,Z ( ϕ ( x, z, θ, β ) ≥ | Y = 0 , X = x, Z = z ) P ( Y = 0 | X = x, Z = z ) , (B.6)( x, z ) − a.s. Now note by (2.1): P θ | Y,X,Z ( ϕ ( x, z, θ, β ) ≥ | Y = 0 , X = x, Z = z ) P ( Y = 0 | X = x, Z = z )= P θ,Y | X,Z ( ϕ ( x, z, θ, β ) ≥ , Y = 0 | X = x, Z = z )= P θ | X,Z ( ϕ ( x, z, θ, β ) ≥ , ϕ ( x, z, θ, β ) < | X = x, Z = z )= 0 . Conclude that (B.6) is true if and only if: P θ | Y,X,Z ( ϕ ( x, z, θ, β ) ≥ | Y = 1 , X = x, Z = z ) = 1 , ( x, z ) − a.s. Similar logic shows: P θ | Y,X,Z ( ϕ ( x, z, θ, β ) < | Y = 0 , X = x, Z = z ) = 1 , ( x, z ) − a.s. Finally, note that by the deﬁnition of G − ( · , β ) : Y × X × Z →

Theorem B.1 says that in order to verify whether a given collection of distributions P θ | X,Z belongs to theidentiﬁed set P ∗ θ | X,Z , it suﬃces to ﬁnd some value of the ﬁxed coeﬃcient β ∈ B such that P θ | X,Z rationalizesthe observed conditional choice probabilities via (B.1). Note that Assumption 2.1 does not impose anyassumptions on the dependence between the variables X and Z and the latent variables θ ; in other words,this result holds whether X and Z are endogenous, exogenous, or any combination of the two. As discussedin Chesher and Rosen (2014), a binary response model with endogenous regressors is incomplete when themechanism generating the endogenous regressors is left unspeciﬁed, as in our environment. In the presenceof incompleteness there is no longer a unique distribution of the endogenous outcome variables given ﬁxedprimitives of the model. Chesher and Rosen (2014) propose the use of Artstein’s inequalities from randomset theory to characterize the distributions of selections from the incomplete binary response model in (2.1),and Theorem 3.1 in Chesher and Rosen (2014) provides a general characterization of the identiﬁed set oflatent variable distributions in the case of a linear index function. The key diﬀerence between TheoremB.1 above and Theorem 3.1 in Chesher and Rosen (2014) is the fact that we condition on the value of the(possibly endogenous) variables X and Z . Conditioning on the value of the endogenous variables allows us toconstruct a simpler set of constraints than those imposed by Artstein’s inequalities, which is demonstratedin Appendix C. Intuitively, conditioning on a ﬁxed value of any endogenous regressors resolves the issue ofmodel incompleteness. This strategy is not applicable in all environments when the model is incomplete, butappears to be applicable whenever any endogenous regressors in the model are observable. The identiﬁedset for the unconditional latent variable distribution (as was considered in Chesher and Rosen (2014)) canthen be recovered from P ∗ θ | X,Z . B.2 Functional Form Assumptions

Under Assumption 4.1, we will have the following deﬁnition of the identiﬁed set, which is analogous to bothDeﬁnitions 2.1 and 2.2.

Deﬁnition B.1.

Under Assumptions 2.1, and 4.1 the identiﬁed set I ∗ Y,X,Z is the set of all pairs ( P θ | Y,X,Z , β ) such that: P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , and P θ | Y,X,Z ( ϕ ( X, Z, θ, β ) = 0 | Y = y, X = x, Z = z ) = 0 , There competing deﬁnitions of incompleteness in the literature, although the deﬁnition of an incomplete model used inChesher and Rosen (2014) is equivalent to the deﬁnition in Tamer (2003) and Lewbel (2007). The deﬁnition of an incompletemodel discussed here is consistent with these papers. y, x, z ) − a.s. , where the function ϕ ( · , β ) : X × Z × Θ → R is linear in θ for every β ∈ B . Furthermore, underAssumptions 2.1, 2.2, and 4.1, the identiﬁed set of counterfactual conditional choice probabilities P ∗ Y γ | Y,X,Z,θ is the set of all conditional distributions P Y γ | Y,X,Z,θ satisfying: P Y γ | X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = 1 , ( y, x, z, θ ) − a.s. for some pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z . Note that we have kept the notation for the identiﬁed set the same as in Deﬁnitions 2.1 and 2.2 (e.g. I ∗ Y,X,Z , P ∗ Y γ | Y,X,Z,θ ), although these identiﬁed sets will be diﬀerent depending on whether Assumption 4.1holds. We will continue using the same notation for the identiﬁed set in further subsections in this Appendixas we introduce even more assumptions, but will always distinguish the deﬁnitions by stating the assumptionsthat hold in each context. Here we do not consider the case when Assumptions 4.2 and 4.3 hold, but weagain note that this deﬁnition (and the results to follow) are easily modiﬁed to accommodate the case whenany combination of these assumptions hold.We now provide the following Corollary whose proof follows almost identically to that of Theorems 2.1and 3.1.

Corollary B.1.

Under Assumptions 2.1, 2.2, and 4.1, a distribution of counterfactual conditional choiceprobabilities P Y γ | Y,X,Z satisﬁes P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z if and only if there exists a pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z (for I ∗ Y,X,Z from Deﬁnition B.1) satisfying: P Y γ | X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , (B.7)( y, x, z ) − a.s. Furthermore, for any collection of counterfactual conditional choice probabilities P Y γ | Y,X,Z ,there exists a collection of Borel conditional probability measures P θ | Y,X,Z satisfying (B.7) with ( P θ | Y,X,Z , β ) ∈I ∗ Y,X,Z (for I ∗ Y,X,Z from Deﬁnition B.1) if and only if there exists a collection P θ | Y,X,Z of probability measureson the sets in A ( β ) from (3.4) satisfying: (cid:88) s ∈ S j P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x j , Z = z j ) = 1 , (B.8) (cid:88) s ∈ S cj P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x j , Z = z j ) = 1 , (B.9) (cid:88) s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) = P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) , (cid:88) s ∈ S cϕ P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) = 0 , (B.10) for y ∈ { , } and j ∈ { , . . . , m } assigned positive probability, where S ϕ denotes the collection of binaryvectors s ∈ { , } m corresponding to the sets Θ( β, s ) that have non-empty interior.Proof of Corollary B.1. The ﬁrst statement follows a proof identical to the proof of Theorem 2.1. For the63econd statement, the forward direction is identical to the proof of Theorem 3.1. The reverse direction issimilar to the proof of Theorem 3.1, with the exception that the extension from a measure on A ( β ) to B (Θ)is slightly diﬀerent. To construct the extension, note that the sets in A ( β ) form a disjoint partition of Θ.Now select a single point θ ( β, s ) from the interior of each set Θ( β, s ) in the collection A ( β ); if Θ( β, s ) hasempty interior, choose θ ( β, s ) as an arbitrary point from Θ. For any set A ⊂ Θ, deﬁne the indicator: ( A, β, s ) = { θ ( β, s ) ∈ A ∩ int(Θ( β, s )) } Furthermore, deﬁne the function µ y,x,z : B (Θ) → R as: µ y,x,z ( B ) := (cid:88) s ∈{ , } m ( B, β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z ) . The remainder of the proof of Theorem 3.1 now applies without modiﬁcation. (cid:4)

Analogous to both Theorem 2.1, the ﬁrst part of Corollary B.1 provides the theoretical link between theidentiﬁed set for counterfactual choice probabilities and the identiﬁed set for the pair ( P θ | Y,X,Z , β ) underthe additional assumption of linearity in latent variables. Analogous to Theorem 3.1, the second part ofCorollary B.1 reduces an inﬁnite dimensional existence problem to a ﬁnite dimensional existence problemamenable to analysis using optimization problems. Building on the intuition provided in example 1, thesecond part of Corollary B.1 demonstrates that Assumption 4.1 can be imposed by considering only a ﬁnitenumber of equality constraints on a distribution P θ | Y,X,Z deﬁned on sets of the form Θ( β, s ). By deﬁnitionof the set S ϕ , condition (B.10) simply assigns probability zero to all sets Θ( β, s ) that are empty due to thelinearity restriction from Assumption 4.1.We also have the following result. Corollary B.2.

Under Assumptions 2.1, 2.2, and 4.1, the identiﬁed set for the counterfactual conditionalchoice probability P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) is given by: (cid:91) β ∈B [ ν (cid:96)b ( y, x j , z j , β ) , ν ub ( y, x j , z j , β )] where ν (cid:96)b ( y, x j , z j , β ) and ν ub ( y, x j , z j , β ) are determined by the optimization problems: ν (cid:96)b ( y, x j , z j , β ) := min v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , s.t. (3.11) , (3.12) , (3.13) , and (4.1) , (B.11) ν ub ( y, x j , z j , β ) := max v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , s.t. (3.11) , (3.12) , (3.13) , and (4.1) . (B.12)Note that this Corollary is identical to Theorem 3.2 with the exception that we have now imposedAssumption 4.1, and thus have also included constraints of the form (4.1). With the exception of theseadditional constraints, the optimization problems that characterize the bounding problem are the same as64efore. This result does not consider the case when Assumptions 4.2 and 4.3 also hold, but as we remarkedabove it is easily modiﬁed to accommodate any combination of Assumptions 4.2 and 4.3 which can be doneby including certain constraints (seen in the next subsections of this appendix). Similar to the commentfollowing Theorem 3.2, alternative counterfactual quantities can also be bounded using this result by simplymodifying the objective function in (B.11) and (B.12).Finally, we present the following corollary to Proposition 3.1. Corollary B.3.

Suppose that Assumptions 2.1, 2.2 and 4.1 hold. Then there exists a (not necessarilyunique) ﬁnite subset B (cid:48) ⊂ B such that: (cid:8) ν ∈ R d ν : ∃ β ∈ B s.t. ν ( β ) satisﬁes (3.11) , (3.12) , (3.13) , (4.1) and ν = ν ( β ) (cid:9) = (cid:8) ν ∈ R d ν : ∃ β ∈ B (cid:48) s.t. ν ( β ) satisﬁes (3.11) , (3.12) , (3.13) , (4.1) , and ν = ν ( β ) (cid:9) . Proof of Corollary B.3.

The proof is identical to the proof of Proposition 3.1 after redeﬁning S ( β ) from theproof of Proposition 3.1 to be S ( β ) := { s ∈ { , } m : int(Θ( β, s )) (cid:54) = ∅ } . (cid:4) B.3 Independence Assumptions

Under Assumption 4.2, we will have the following deﬁnition of the identiﬁed set, which is analogous to bothDeﬁnitions 2.1 and 2.2.

Deﬁnition B.2.

Under Assumptions 2.1 and 4.2, the identiﬁed set I ∗ Y,X,Z is the set of all pairs ( P θ | Y,X,Z , β ) such that:(i) ( P θ | Y,X,Z , β ) satisﬁes: P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s. ; and(ii) For all Borel sets A ∈ B (Θ) we have P θ | Z ( A | Z = z ) = P θ ( A ) , z − a.s.Furthermore, under Assumptions 2.1, 2.2 and 4.2, the identiﬁed set of counterfactual conditional choiceprobabilities P ∗ Y γ | Y,X,Z,θ is the set of all conditional distributions P Y γ | Y,X,Z,θ satisfying: P Y γ | Y,X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = 1 , ( y, x, z, θ ) − a.s. for some pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z . Here we do not consider the case when Assumptions 4.1 and 4.3 hold, but we again note that thisdeﬁnition (and the results to follow) are easily modiﬁed to accommodate the case when any combination ofthese assumptions hold. We now provide the following Corollary whose proof follows almost identically tothat of Theorems 2.1 and 3.1, with the exception being that now we require condition (ii) of Deﬁnition B.2to also hold. 65 orollary B.4.

Under Assumptions 2.1, 2.2 and 4.2, a distribution of counterfactual conditional choiceprobabilities P Y γ | Y,X,Z satisﬁes P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z if and only if there exists a pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z (for I ∗ Y,X,Z from Deﬁnition B.2) satisfying: P Y γ | X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , (B.13)( y, x, z ) − a.s. Furthermore, for any collection of counterfactual conditional choice probabilities P Y γ | Y,X,Z ,there exists a collection of Borel conditional probability measures P θ | Y,X,Z satisfying (B.13) with ( P θ | Y,X,Z , β ) ∈I ∗ Y,X,Z (for I ∗ Y,X,Z from Deﬁnition B.2) if and only if there exists a collection P θ | Y,X,Z of probability measureson the sets in A ( β ) from (3.4) satisfying: (cid:88) s ∈ S j P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x j , Z = z j ) = 1 , (cid:88) s ∈ S cj P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x j , Z = z j ) = 1 , (cid:88) s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) = P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) , for y ∈ { , } and j ∈ { , . . . , m } assigned positive probability, and: (cid:88) y (cid:88) x P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z k ) P ( Y = y, X = x | Z = z k )= (cid:88) y (cid:88) x P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z k +1 ) P ( Y = y, X = x | Z = z k +1 ) , (B.14) for all s ∈ { , } m and all k = 1 , . . . , m z − assigned positive probability.Proof of Corollary B.4. The ﬁrst statement follows a proof identical to the proof of Theorem 2.1. For thesecond statement, the forward direction is identical to the proof of Theorem 3.1. The reverse directionis similar to the proof of Theorem 3.1, with the exception that we must show that the extended measureon B (Θ) satisﬁes independence if the intial measure on A (Θ) satisﬁes independence. Let ˜ P θ | Y,X,Z be theextension of P θ | Y,X,Z from the proof of Theorem 3.1. Then for any A ∈ B (Θ):˜ P θ | Z ( A | Z = z k )= (cid:88) y ∈{ , } (cid:88) x ∈X (cid:88) s ∈{ , } m ( A, β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z k ) P Y,X | Z ( Y = y, X = x | Z = z k )= (cid:88) s ∈{ , } m ( A, β, s ) (cid:88) y ∈{ , } (cid:88) x ∈X P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z k ) P Y,X | Z ( Y = y, X = x | Z = z k )= (cid:88) s ∈{ , } m ( A, β, s ) (cid:88) y ∈{ , } (cid:88) x ∈X P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z k +1 ) P Y,X | Z ( Y = y, X = x | Z = z k +1 )= (cid:88) y ∈{ , } (cid:88) x ∈X (cid:88) s ∈{ , } m ( A, β, s ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z k +1 ) P Y,X | Z ( Y = y, X = x | Z = z k +1 )= ˜ P θ | Z ( A | Z = z k +1 ) , z k and z k +1 assigned positive probability, where the third equality follows from (B.14). Concludethat ˜ P θ | Z satisﬁes the second condition in Deﬁnition B.2. (cid:4) Analogous to Theorem 2.1, the ﬁrst part of Corollary B.4 provides the theoretical link between theidentiﬁed set for counterfactual choice probabilities and the identiﬁed set for the pair ( P θ | Y,X,Z , β ) under theadditional independence assumption between θ and Z . Furthermore, analogous to the result in Theorem 3.1,the second part of Corollary B.4 reduces an inﬁnite dimensional existence problem to a ﬁnite dimensionalexistence problem. Importantly, the second part of Corollary B.4 builds on Theorem 3.1 by demonstratingthat Assumption 4.2—which requires P θ | Z ( A | Z = z ) = P θ ( A ) a.s. for all Borel sets A —can be imposedby considering only a ﬁnite number of equality constraints on a distribution P θ | Y,X,Z deﬁned on sets of theform Θ( β, s ).We also have the following Corollary to Theorem 3.2:

Corollary B.5.

Under Assumptions 2.1, 2.2, and 4.2, the identiﬁed set for the counterfactual conditionalchoice probability P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) is given by: (cid:91) β ∈B [ ν (cid:96)b ( y, x j , z j , β ) , ν ub ( y, x j , z j , β )] where ν (cid:96)b ( y, x j , z j , β ) and ν ub ( y, x j , z j , β ) are determined by the optimization problems: ν (cid:96)b ( y, x j , z j , β ) := min v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , s.t. (3.11) , (3.12) , (3.13) , and (4.10) , (B.15) ν ub ( y, x j , z j , β ) := max v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , s.t. (3.11) , (3.12) , (3.13) , and (4.10) . (B.16)Note that this Corollary is identical to Theorem 3.2 with the exception that we have now imposedAssumption 4.2, and thus have also included constraints of the form (4.10). With the exception of theseadditional constraints, the optimization problems that characterize the bounding problem are the same asbefore. Again, this result can be easily modiﬁed to bound any linear function of counterfactual choiceprobabilities by simply modifying the objective function in the optimization problems (B.15) and (B.16).Finally, we present the following corollary to Proposition 3.1. Corollary B.6.

Suppose that Assumptions 2.1, 2.2 and 4.2 hold. Then there exists a (not necessarilyunique) ﬁnite subset B (cid:48) ⊂ B such that: (cid:8) ν ∈ R d ν : ∃ β ∈ B s.t. ν ( β ) satisﬁes (3.11) , (3.12) , (3.13) , (4.10) and ν = ν ( β ) (cid:9) = (cid:8) ν ∈ R d ν : ∃ β ∈ B (cid:48) s.t. ν ( β ) satisﬁes (3.11) , (3.12) , (3.13) , (4.10) , and ν = ν ( β ) (cid:9) . The additional independence constraints in (4.10) do not aﬀect the proof of Proposition 3.1 in any way,and so the proof of this result is identical to the proof of Proposition 3.1.67 .4 Monotonicity Assumptions

When we entertain Assumption 4.3, we will have the following deﬁnition of the identiﬁed set, which isanalogous to both Deﬁnitions 2.1 and 2.2.

Deﬁnition B.3.

Under Assumptions 2.1 and 4.3, the identiﬁed set I ∗ Y,X,Z is the set of all pairs ( P θ | Y,X,Z , β ) such that:(i) ( P θ | Y,X,Z , β ) satisﬁes: P θ | Y,X,Z ( θ ∈ G − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s. ; and(ii) For all ( j, k ) ∈ M from Assumption 4.3, we have: P θ | Y,X,Z ( ϕ ( x j , z j , β, θ ) ≤ ϕ ( x k , z k , β, θ ) | Y = y, X = x, Z = z ) = 1 a.s.Furthermore, under Assumptions 2.1, 2.2, and 4.3, the identiﬁed set of counterfactual conditional choiceprobabilities P ∗ Y γ | Y,X,Z,θ is the set of all conditional distributions P Y γ | Y,X,Z,θ satisfying: P Y γ | X,Z,θ ( Y γ = { ϕ ( γ ( X, Z ) , θ, β ) ≥ } | Y = y, X = x, Z = z, θ ) = 1 , ( y, x, z, θ ) − a.s. for some pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z . Again this deﬁnition and the results to follow are easily modiﬁed to accommodate the case when anycombination of Assumptions 4.1 and 4.2 also hold. We now provide the following Corollary whose prooffollows almost identically to that of Theorems 2.1 and 3.1, with the exception being that now we requirecondition (ii) of Deﬁnition B.3 to hold.

Corollary B.7.

Under Assumptions 2.1, 2.2, and 4.3, a distribution of counterfactual conditional choiceprobabilities P Y γ | Y,X,Z satisﬁes P Y γ | Y,X,Z ∈ P ∗ Y γ | Y,X,Z if and only if there exists a pair ( P θ | Y,X,Z , β ) ∈ I ∗ Y,X,Z (for I ∗ Y,X,Z from Deﬁnition B.3) satisfying: P Y γ | X,Z ( Y γ = 1 | Y = y, X = x, Z = z ) = P θ | Y,X,Z ( ϕ ( γ ( X, Z ) , θ, β ) ≥ | Y = y, X = x, Z = z ) , (B.17)( y, x, z ) − a.s. Furthermore, for any collection of counterfactual conditional choice probabilities P Y γ | Y,X,Z ,there exists a collection of Borel conditional probability measures P θ | Y,X,Z satisfying (B.17) with ( P θ | Y,X,Z , β ) ∈I ∗ Y,X,Z (for I ∗ Y,X,Z from Deﬁnition B.3) if and only if there exists a collection P θ | Y,X,Z of probability measureson the sets in A ( β ) from (3.4) satisfying: (cid:88) s ∈ S j P θ | Y,X,Z (Θ( β, s ) | Y = 1 , X = x j , Z = z j ) = 1 , (cid:88) s ∈ S cj P θ | Y,X,Z (Θ( β, s ) | Y = 0 , X = x j , Z = z j ) = 1 , s ∈ S γ ( j ) P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x j , Z = z j ) = P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) , for y ∈ { , } and j ∈ { , . . . , m } assigned positive probability, and: (cid:88) s ∈ S cM P θ | Y,X,Z (Θ( β, s ) | Y = y, X = x, Z = z ) = 0 , a.s. (B.18) for all ( y, x, z ) assigned positive probability, where S M is as deﬁned in Section 4. The proof of this corollary is identical to the proof of Theorem 2.1 and Theorem 3.1. Analogous toTheorem 2.1, the ﬁrst part of Corollary B.7 provides the theoretical link between the identiﬁed set forcounterfactual choice probabilities and the identiﬁed set for the pair ( P θ | Y,X,Z , β ) under the additionalmonotonicity assumption. Analogous to Theorem 3.1, the second part of Corollary B.7 reduces an inﬁ-nite dimensional existence problem to a ﬁnite dimensional existence problem amenable to analysis usingoptimization problems. Building on the intuition provided in example 3, the second part of Corollary B.7demonstrates that monotonicity as in Assumption 4.3 can be imposed by considering only a ﬁnite numberof equality constraints on a distribution P θ | Y,X,Z deﬁned on sets of the form Θ( β, s ). By deﬁnition of the set S M , condition (B.18) simply assigns probability zero to all sets Θ( β, s ) that do not satisfy the monotonicityrelation from Assumption 4.3. This leads to the following result. Corollary B.8.

Under Assumptions 2.1, 2.2, and 4.3, the identiﬁed set for the counterfactual conditionalchoice probability P Y γ | Y,X,Z ( Y γ = 1 | Y = y, X = x j , Z = z j ) is given by: (cid:91) β ∈B [ ν (cid:96)b ( y, x j , z j , β ) , ν ub ( y, x j , z j , β )] where ν (cid:96)b ( y, x j , z j , β ) and ν ub ( y, x j , z j , β ) are determined by the optimization problems: ν (cid:96)b ( y, x j , z j , β ) := min v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , s.t. (3.11) , (3.12) , (3.13) , and (4.11) , (B.19) ν ub ( y, x j , z j , β ) := max v ( β ) ∈ R dν (cid:88) s ∈ S γ ( j ) ν ( y, x j , z j , β, s ) , s.t. (3.11) , (3.12) , (3.13) , and (4.11) . (B.20)Note that this Corollary is identical to Theorem 3.2 with the exception that we have now imposedAssumption 4.3, and thus have also included constraints of the form (4.11). With the exception of theseadditional constraints, the optimization problems that characterize the bounding problem are the same asbefore. Finally, alternative counterfactual quantities can be bounded in the same way by simply modifyingthe objective function in (B.19) and (B.20).Finally, we present the following corollary to Proposition 3.1. Corollary B.9.

Suppose that Assumptions 2.1, 2.2 and 4.3 hold. Then there exists a (not necessarilyunique) ﬁnite subset B (cid:48) ⊂ B such that: (cid:8) ν ∈ R d ν : ∃ β ∈ B s.t. ν ( β ) satisﬁes (3.11) , (3.12) , (3.13) , (4.11) and ν = ν ( β ) (cid:9) (cid:8) ν ∈ R d ν : ∃ β ∈ B (cid:48) s.t. ν ( β ) satisﬁes (3.11) , (3.12) , (3.13) , (4.11) , and ν = ν ( β ) (cid:9) . The additional monotonicity constraints in (4.11) are easily accommodated by in the proof of Proposition3.1, and so the proof of this result follows the same proof of Proposition 3.1.

B.5 Consistency

In this subsection we will present a basic consistency result for functionals of a partially identiﬁed parameter.The result is designed to minimize the number of high-level assumptions required for consistency, and isclosely related to results found in Molchanov (1998), Manski and Tamer (2002), and Chernozhukov et al.(2007). It is also presented in a form that is more general than necessary for the current paper, and so itmay be of interest in other applications.We consider an environment where the researcher wishes to compute bounds on a functional E P [ ψ ( W i , τ , τ )],where ψ : W × T → R , where W ⊂ R d w denotes the support of the observed random vector W , and T = T × T ⊂ R d τ denotes the parameter space with typical elements τ = ( τ , τ ) ∈ T . The values of( τ , τ ) are constrained by J moment inequalities of the form: E P [ m j ( W i , τ , τ )] ≤ , for j = 1 , . . . , J. Note this does not rule out moment equalities, since each moment equality can be equivalently written asa combination of two moment inequalities. In this environment, the identiﬁed set for ( τ , τ ) ∈ T at thetrue P is given by: T ∗ ( P ) := { ( τ , τ ) ∈ T : E P [ m j ( W i , τ , τ )] ≤ j = 1 , . . . , J } . In addition, the identiﬁed set for ψ := E P [ ψ ( W i , τ , τ )] is given by:Ψ ∗ ( P ) := (cid:8) ψ ∈ R : ∃ ( τ , τ ) ∈ T I ( P ) s.t. ψ = E P [ ψ ( W i , τ , τ )] (cid:9) . Let us deﬁne the projection: T ∗ ( τ , P ) := { τ ∈ T : E P [ m j ( W i , τ , τ )] ≤ j = 1 , . . . , J } . It is then straightforward to show that Ψ ∗ ( P ) can be rewritten as:Ψ ∗ ( P ) = (cid:91) τ ∈T [Ψ (cid:96)b ( τ , P ) , Ψ ub ( τ , P )] , where: Ψ (cid:96)b ( τ , P ) := min τ ∈T ∗ ( τ ,P ) E P [ ψ ( W i , τ , τ )] , Ψ ub ( τ , P ) := max τ ∈T ∗ ( τ ,P ) E P [ ψ ( W i , τ , τ )] . We will study the consistency properties of the sample analog estimator for this representation of Ψ ∗ ( P ). In70articular, deﬁne: E n [ ψ ( W i , τ , τ )] := 1 n n (cid:88) i =1 ψ ( W i , τ , τ ) , E n [ m j ( W i , τ , τ )] := 1 n n (cid:88) i =1 m j ( W i , τ , τ ) , for j = 1 , . . . , J. Then the sample analog estimator of interest is given by:Ψ ∗ ( P n ) = (cid:91) τ ∈T [Ψ (cid:96)b ( τ , P n ) , Ψ ub ( τ , P n )] , where: Ψ (cid:96)b ( τ , P n ) := min τ ∈T ∗ ( τ , P n ) E n [ ψ ( W i , τ , τ )] , Ψ ub ( τ , P n ) := max τ ∈T ∗ ( τ , P n ) E n [ ψ ( W i , τ , τ )] , and: T ∗ ( τ , P n ) := { τ ∈ T : E n [ m j ( W i , τ , τ )] ≤ j = 1 , . . . , J } . In the following, we will deﬁne the sequence { η n ( τ ) } ∞ n =1 as: η n ( τ ) := max (cid:26) max j =1 ,...,J. sup τ ∈T | E n [ m j ( W i , τ , τ )] − E P [ m j ( W i , τ , τ )] | , sup τ ∈T | E n [ ψ ( W i , τ , τ )] − E P [ ψ ( W i , τ , τ )] | (cid:27) . We now impose the following assumption.

Assumption B.1. (i) The parameter space T = T × T ⊂ R d τ , where T is compact; (ii) for each τ ∈ T ,the function ψ ( · , τ ) : W × T → R is measurable in W i ∈ W ⊂ R d w and is Lipschitz continuous in τ with a(possibly data-dependent) Lipschitz constant C ( τ ) with sup τ ∈T C ( τ ) < ∞ a.s.; (iii) for j = 1 , . . . , J , andfor each τ ∈ T , the moment function m j ( · , τ ) : W ×T → R is measurable in W i and lower semicontinuousin τ ; (iv) the true data generating process is indexed by a triple ( τ , τ , P ) that satisﬁes ( τ , τ ) ∈ T ,and E P [ m j ( W i , τ , τ )] ≤ , for j = 1 , . . . , J ; (v) the sample { W i } ni =1 is an indepndent and identicallydistributed draw from P ; (vi) for each ﬁxed τ ∈ T , we have η n ( τ ) = O P ( a − n ) for some sequence a n ↑ ∞ ;(vii) for each ﬁxed τ ∈ T , there exists a sequence b n ↓ satisfying b n ≥ η n ( τ ) with probability approaching1 (w.p.a. 1).; (viii) there exists a ﬁnite subset T (cid:48) ⊂ T such that: { τ ∈ T : ∃ τ ∈ T s.t. E P [ m j ( W i , τ , τ )] ≤ for j = 1 , . . . , k } = { τ ∈ T : ∃ τ ∈ T (cid:48) s.t. E P [ m j ( W i , τ , τ )] ≤ for j = 1 , . . . , k } . Part (i) of Assumption B.1 is standard in the literature on extremum estimators. Part (ii) separates theroles of τ and τ , and restricts the objective function to be Lipschitz continuous in the parameter τ foreach τ . Part (ii) places no restrictions on how τ enters the objective function. Part (iii) further separatesthe roles of τ and τ by requiring each of the moment functions to be lower semicontinuous in τ . Similarto part (ii), no restrictions are placed on how τ enters the moment functions. Assumption (iv) is standard,and simply indicates that the true parameters satisfying the moment inequalities at the true P . Part (v) isalso standard, although it rules out the case of dependent data. Part (vi) indicates that η n ( τ ) converges71n probability at a rate of 1 /a n . This can be veriﬁed using standard assumptions; for example, if for each τ ∈ T the J + 1 classes of functions: F ψ ( τ ) := { ψ ( · , τ , τ ) : W → R | τ ∈ T } , F j ( τ ) := { m j ( · , τ , τ ) : W → R | τ ∈ T } , for j = 1 , . . . , J, are all P − Donsker classes, then part (vi) is satisﬁed with a n = √ n . This will be the case, for example, for allspeciﬁcations considered in Section 5. After verifying part (vi), it is easy to ﬁnd a sequence b n satisfying part(vii). For example, if a n = √ n from part (vi), then we can set b n = b/ (cid:112) log( n ) for any b >

0. Finally, part(viii) essentially allows us to replace T with a ﬁnite subset T (cid:48) without impacting the bounding problem. Itis precisely because of part (viii) that all other parts of Assumption B.1—namely parts (ii), (iii), (vi) and(vii)—are allowed to be so ﬂexible with respect to the parameter τ . This last component of AssumptionB.1 is veriﬁed in our basic setup in Proposition 3.1 in the main text, and is veriﬁed under our functionalform, independence, and monotonicity assumptions in Corollaries B.2, B.5 and B.8, respectively. All othercomponents of Assumption B.1 are either standard assumptions, or are easily veriﬁed for the boundingproblems presented in the main text and for all speciﬁcations considered in Section 5.Before stating the main result for this subsection, for any c ∈ R let us deﬁne: T ∗ ( τ , P, c ) := { τ ∈ T : E P [ m j ( W, τ , τ )] ≤ c for j = 1 , . . . , J } , and: Ψ ∗ ( P, c ) = (cid:91) τ ∈T (cid:48) [Ψ (cid:96)b ( τ , P, c ) , Ψ ub ( τ , P, c )] , where:Ψ (cid:96)b ( τ , P, c ) := min τ ∈T ∗ ( τ ,P,c ) E P [ ψ ( W i , τ , τ )] , Ψ ub ( τ , P n , c ) := max τ ∈T ∗ ( τ ,P,c ) E P [ ψ ( W i , τ , τ )] . Deﬁne the sets T ∗ ( τ , P, c ) and Ψ ∗ ( P, c ) analogously. The following Theorem then shows that a slightenlargement of the set Ψ ∗ ( P n ) is a consistent estimator for the set Ψ ∗ ( P ), where consistency is deﬁned usingthe Hausdorﬀ metric. Theorem B.2.

Suppose that Assumption B.1 holds. Then d H (Ψ ∗ ( P n , b n ) , Ψ ∗ ( P )) = o P (1) , where b n is thesequence from Assumption B.1.Proof of Theorem B.2. We have: d H (Ψ ∗ ( P n , b n ) , Ψ ∗ ( P )) ≤ (cid:88) τ ∈T (cid:48) d H ([Ψ (cid:96)b ( τ , P n , b n ) , Ψ ub ( τ , P n , b n )] , [Ψ (cid:96)b ( τ , P ) , Ψ ub ( τ , P )]) . Since T (cid:48) is ﬁnite by Assumption B.1(viii), it suﬃces to show that: d H ([Ψ (cid:96)b ( τ , P n , b n ) , Ψ ub ( τ , P n , b n )] , [Ψ (cid:96)b ( τ , P ) , Ψ ub ( τ , P )]) = o P (1) , τ ∈ T (cid:48) . To this end, ﬁx any τ ∈ T . To show the previous display, it suﬃces to show consistencyof the upper and lower bounds; i.e. that | Ψ (cid:96)b ( τ , P n , b n ) − Ψ (cid:96)b ( τ , P ) | = o P (1) and that | Ψ ub ( τ , P n , b n ) − Ψ ub ( τ , P ) | = o P (1). We will focus on the lower bound, since the upper bound proof is symmetric.First recall that ψ ( W i , τ , τ ) is continuous with respect to τ for every τ by Assumption B.1(ii), and T is compact by Assumption B.1(i). Thus, we have that ψ ( W i , τ , τ ) is uniformly continuous (w.r.t. τ ) on T .Thus, for every ε > δ ( ε ) > | E n [ ψ ( W i , τ , τ )] − E n [ ψ ( W i , τ (cid:48) , τ )] | < ε whenever || τ − τ (cid:48) || < δ ( ε ). Now note that: | Ψ (cid:96)b ( τ , P n , b n ) − Ψ (cid:96)b ( τ , P ) | = (cid:12)(cid:12)(cid:12)(cid:12) min τ ∈T ∗ ( τ , P n ,b n ) E n [ ψ ( W i , τ , τ )] − min τ ∈T ∗ ( τ ,P ) E P [ ψ ( W, τ , τ )] (cid:12)(cid:12)(cid:12)(cid:12) , ≤ (cid:12)(cid:12)(cid:12)(cid:12) min τ ∈T ∗ ( τ , P n ,b n ) E n [ ψ ( W i , τ , τ )] − min τ ∈T ∗ ( τ ,P ) E n [ ψ ( W i , τ , τ )] (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) min τ ∈T ∗ ( τ ,P ) E n [ ψ ( W i , τ , τ )] − min τ ∈T ∗ ( τ ,P ) E P [ ψ ( W i , τ , τ )] (cid:12)(cid:12)(cid:12)(cid:12) , = (cid:12)(cid:12)(cid:12)(cid:12) max τ ∈T ∗ ( τ ,P ) − E n [ ψ ( W i , τ , τ )] − max τ ∈T ∗ ( τ , P n ,b n ) − E n [ ψ ( W i , τ , τ )] (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) max τ ∈T ∗ ( τ ,P ) − E P [ ψ ( W i , τ , τ )] − max τ ∈T ∗ ( τ ,P ) − E n [ ψ ( W i , τ , τ )] (cid:12)(cid:12)(cid:12)(cid:12) , ≤ max { τ ,τ (cid:48) ∈T : || τ − τ (cid:48) ||≤ d H ( T ∗ ( τ , P n ,b n ) , T ∗ ( τ ,P )) } |− E n [ ψ ( W i , τ , τ )] − − E n [ ψ ( W i , τ (cid:48) , τ )] | + max τ ∈T ∗ ( τ ,P ) |− E n [ ψ ( W i , τ , τ )] − − E P [ ψ ( W i , τ , τ )] |≤ max { τ ,τ (cid:48) ∈T : || τ − τ (cid:48) ||≤ d H ( T ∗ ( τ , P n ,b n ) , T ∗ ( τ ,P )) } | E n [ ψ ( W i , τ (cid:48) , τ )] − E n [ ψ ( W i , τ , τ )] | + max τ ∈T ∗ ( τ ,P ) | E P [ ψ ( W i , τ , τ )] − E n [ ψ ( W i , τ , τ )] | = max { τ ,τ (cid:48) ∈T : || τ − τ (cid:48) ||≤ d H ( T ∗ ( τ , P n ,b n ) , T ∗ ( τ ,P )) } C · || τ − τ (cid:48) || + max τ ∈T ∗ ( τ ,P ) | E P [ ψ ( W i , τ , τ )] − E n [ ψ ( W i , τ , τ )] |≤ C · d H ( T ∗ ( τ , P n , b n ) , T ∗ ( τ , P )) + max τ ∈T ∗ ( τ ,P ) | E P [ ψ ( W i , τ , τ )] − E n [ ψ ( W i , τ , τ )] | . It suﬃces to show the two terms in the last line of the previous display converge to zero in probability. Thesecond term converges in probability to zero by Assumption B.1(vi). Furthermore, since

C < ∞ w.p. 1, theﬁrst term converges to zero in probability if we can show that: d H ( T ∗ ( τ , P n , b n ) , T ∗ ( τ , P )) = o P (1) . The remainder of the proof will focus on proving this latter fact. Note that: d H ( T ∗ ( τ , P n , b n ) , T ∗ ( τ , P )) = inf { δ > T ∗ ( τ , P ) ⊆ T ∗ ( τ , P n , b n ) δ , and T ∗ ( τ , P n , b n ) ⊆ T ∗ ( τ , P ) δ } , where: T ∗ ( τ , P n , b n ) δ := { τ ∈ T : B δ ( τ ) ∩ T ∗ ( τ , P n , b n ) (cid:54) = ∅ } , ∗ ( τ , P ) δ := { τ ∈ T : B δ ( τ ) ∩ T ∗ ( τ , P ) (cid:54) = ∅ } , where B δ ( τ ) denotes the closed ball of radius δ > τ . The next part of the proof closely followsthe proof of Theorem 2.1 in Molchanov (1998). Deﬁne the function: ρ ( ε ) := d H ( T ∗ ( τ , P, ε ) , T ∗ ( τ , P )) . Since each of the moment functions are lower semi-continuous in τ for each τ , each of the sets T ∗ ( τ , P, ε )and T ∗ ( τ , P ) are closed and ρ is right continuous. Furthermore, ρ is non-increasing for ε < ε >

0. Now by Assumption B.1 we have with high probability: T ∗ ( τ , P n , b n ) = { τ ∈ T : E n [ m j ( W, τ , τ )] ≤ b n for j = 1 , . . . , k }⊆ { τ ∈ T : E n [ m j ( W, τ , τ )] ≤ η n ( τ ) + b n for j = 1 , . . . , k }⊆ T ∗ ( τ , P, b n ) ⊆ T ∗ ( τ , P ) ρ (2 b n ) . Furthermore, by Assumption B.1 we have with high probability for large enough n : T ∗ ( τ , P ) ⊆ T ∗ ( τ , P, b n − η n ( τ )) ⊆ T ∗ ( τ , P n , b n ) . Conclude that with high probability for large enough n : d H ( T ∗ ( τ , P n , b n ) , T ∗ ( τ , P )) ≤ ρ (2 b n ) → , where the last line follows from right-continuity of the function ρ ( · ). Since τ ∈ T (cid:48) was arbitrary, thiscompletes the proof. (cid:4) B.6 Additively Separable Case

In this subsection we will show how our method can be applied to a model that satisﬁes the followingassumption.

Assumption B.2. (i) There exists a function ˜ ϕ : X × Z × B → R satisfying ϕ ( X, Z, β, θ ) = ˜ ϕ ( X, Z, β ) − θ ,and (ii) the event: F := (cid:91) ( x,z ) ∈X ×Z { θ : ϕ ( x, z, θ, β ) = 0 } , occurs with probability zero; that is, P θ ( F ) = 0 . This is a well-studied special case of the linear model considered in Section 4.1. In particular, much ofthe discussion in this section will expand upon the insights of Chesher (2013). We will consider two cases:74i) when the structural function ϕ is linear in the parameter vector β , and (ii) when the structural functionis unknown. To begin, let us consider the following simple example. Example 4.

Suppose we have a scalar variable X with support X = { x , . . . , x m x } and latent variables θ ∈ [ − , , and suppose there are no variables Z . Consider the following additively separable thresholdcrossing model: Y = { Xβ ≥ θ } , where β is a ﬁxed scalar coeﬃcient. The response types in this setting are characterized by the m x × vectors: r ( β, θ ) :=  { x β ≥ θ } { x β ≥ θ } ... { x m x β ≥ θ }  . However, the set of possible response types in this setting will depend on the sign of the ﬁxed coeﬃcient β .In particular, when β ≥ we have the response types r ( β, θ ) ∈ { s , . . . , s m x +1 } , where: s :=  ...  , s :=  ...  , . . . , s m x :=  ...  , s m x +1 :=  ...  . (B.21) No other response types are possible when β > , and so all other response types must be assigned zeroprobability. Alternatively, when β < we have the response types r ( β, θ ) ∈ { s (cid:48) , . . . , s (cid:48) m x +1 } , where: s (cid:48) :=  ...  , s (cid:48) :=  ...  , . . . , s (cid:48) m x :=  ...  , s (cid:48) m x +1 :=  ...  . (B.22) Again, no other response types are possible when β < , and so must be assigned zero probability by thedistribution of θ .The reason that these particular response types arise when β ≥ and β < is due to the ordering of thesupport of X induced by the value of the scalar product Xβ . In particular, if we suppose x ≤ x ≤ . . . ≤ x m x ,then when β ≥ we have the ordering x β ≤ x β ≤ . . . ≤ x m x β . This means, for example, that it is igure 6: A ﬁgure corresponding to Example 4 illustrating the partition of the latent variable space according toresponse types in the case when the index function is additively separable in θ and when X = { x , x , x } with x ≤ x ≤ x . As indicated in the example, the feasible response types are those that correspond to a particularordering of the points in X induced by the scalar product Xβ . impossible to ﬁnd a value of θ ∈ [ − , so that: r ( β, θ ) =  { x β ≥ θ } { x β ≥ θ } { x β ≥ θ } ... { x m x β ≥ θ }  =  ...  . Indeed, the existence of such a value for θ would contradict the ordering x β ≤ x β ≤ . . . ≤ x m x β . Thismeans that when β ≥ certain response types are not possible, and so must be assigned probability zeroby the distribution of θ . An identical intuition holds in the case when β < . In the end, the responsetypes that can be assigned positive probability in this example when β ≥ and β < are exactly the onescorresponding to the vectors in (B.21) and (B.22) , respectively. Figure 6 provides an illustration in the casewhen X = { x , x , x } . This example illustrates the key ideas behind the implementation of our approach when the index functionis additively separable in θ , as in Assumption B.2. In particular, given the function ˜ ϕ from Assumption B.2,the key is to determine the values of β such that the function ˜ ϕ ( · , β ) : X × Z → R induces a unique orderingof the points in the support X ×Z . With no Z variables, a scalar X variable, and ˜ ϕ ( X, Z, β ) = Xβ , Example4 shows that only two orderings are possible, corresponding to the case when β ≥ β <

0. After theorder is determined, we can immediately determine the set of response types that must be assigned zeroprobability by the distribution of θ , and then impose these restrictions as an additional constraint in thebounding problems (3.15) and (3.16) as in Section 4.1. In particular, letting S ϕ denote the set of all binary76ectors s ∈ { , } m corresponding to sets Θ( β, s ) that can be assigned positive probability under AssumptionB.2, and impose the constraint: (cid:88) s ∈ S cϕ ν ( y, x j , z j , β, s ) = 0 , (B.23)for all y ∈ { , } and j = 1 , . . . , m occurring with positive probability. Note that (B.23) is of an identical formto the constraint (4.1) in the main text. Corollary B.2 in Appendix B.2 is then immediately applicable, sinceAssumption 4.1 nests Assumption B.2 as a special case. Thus, Theorem 3.2 can be extended to accommodateAssumption B.2 by simply adding the constraints (4.1) to the optimization problems (3.15) and (3.16).Similar to the discussion in Section 4.1, determining the sets Θ( β, s ) that can be assigned positive prob-ability under Assumption B.2 poses an interesting computational problem. Although Example 4 illustratesa case when there are only two orderings, in general many more orderings may be possible, even when ˜ ϕ is linear in β . Clearly at most m ! orderings are possible, but when the index function is linear in β it ispossible to show that the maximum number of possible orderings is much smaller than m !. In particular,partition β = ( β x , β z ) and consider the function ˜ ϕ ( X, Z, β ) = Xβ x + Zβ z where X is a vector of dimension d x and Z is a vector of dimension d z . Label the support X × Z as { ( x , z ) , ( x , z ) , . . . , ( x m , z m ) } , and let∆ jk := ( x j , z j ) − ( x k , z k ) for 1 ≤ j < k ≤ m . Setting d = d x + d z , the set H jk := { β ∈ R d : ∆ jk β = 0 } deﬁnes a hyperplane through the origin that is normal to the line connecting ( x j , z j ) and ( x k , z k ) in R d . Theset of all such hyperplanes partitions R d into at most Q ( m, d ) non-empty cones, where Q ( m, d ) is deﬁnedrecursively as: Q ( m, d ) = Q ( m − , d ) + ( m − Q ( m − , d − , (B.24)with Q ( m,

1) = 2 for all m ≥ Q (2 , d ) = 2 for all d ≥

1. Furthermore, each these non-empty conescorresponds exactly to the equivalence class of vectors β = ( β x , β z ) that induce a unique ordering of thepoints in X × Z . Thus, the value Q ( m, d ) serves as an upper bound on the number of orderings of the pointsin X × Z that are inducible by the function ˜ ϕ ( X, Z, β ) = Xβ x + Zβ z . The recursive formula from (B.24)deﬁning the upper bound Q ( m, d ) has been independently discovered in diﬀerent contexts by many authors;the earliest such account appears in Bennett (1956), although the formula was independently discoveredagain in Cover (1967). The upper bound Q ( m, d ) is obtained when the collection of hyperplanes of theform H jk are in general position. Note that Q ( m,

1) = 2 corresponds exactly to Example 4, where it wasshown that only two orderings could be induced when ˜ ϕ ( X, Z, β ) = Xβ for scalar X and β . Typically, Q ( m, d ) < m !, although some inspection of the formula shows that we will always have Q ( m, d ) = m ! when d ≥ m − β from each of the cones deﬁned by the collection of hyperplanes of theform H jk , we could then determine the permitted orderings of the support points X ×Z by simply evaluating x j β x + z j β z for j = 1 , . . . , m at the selected value for β . This would then allow us to determine which setsΘ( β, s ) must be assigned zero probability under Assumption B.2. Note that under Assumption B.2 the77atent variable θ obtains a value on the hyperplane H jk with probability zero. Thus, it suﬃces to selectone value of β from each of the non-empty cones deﬁned by the collection of hyperplanes of the form H jk .However, this can be done using the hyperplane arrangement algorithm described in Section 4.1 applied tothe hyperplanes of the form H jk for 1 ≤ j < k ≤ m .Our method is also applicable to cases when ˜ ϕ ( X, Z, β ) may be non-linear in the ﬁnite-dimensional vector β . To see how this case can be accommodated, recall that the case when ˜ ϕ is linear in β , the ordering of thesupport points in X × Z by the function ˜ ϕ ( X, Z, β ) allowed us to determine the admissible response types,which in turn allowed us to construct the additional constraints needed in programs (3.15) and (3.16). Asimilar strategy can be used when ˜ ϕ is not known by the researcher. However, when ˜ ϕ is not restrictedby the researcher, all orderings of the support points in X × Z will be possible. The procedure to bounda counterfactual choice probability (or some other counterfactual quantity of interest) is then as follows.The researcher must ﬁrst ﬁx an ordering of the support points in

X × Z , determine the admissible responsetypes S ϕ for the ﬁxed ordering, and run the linear programs in (3.15) and (3.16) subject to the constraint(B.23). The researcher must then repeat the procedure for all possible orderings of the support points in X × Z . On each iteration of this procedure the researcher will obtain an interval with endpoints determinedby the values of the linear programs in (3.15) and (3.16). The closed convex hull of the identiﬁed set for thecounterfactual choice probability is then given by the interval whose lower endpoint is the smallest value ofthe linear program in (3.15) obtained across all orderings, and whose upper endpoint is the largest value ofthe linear program in (3.16) obtained across all orderings. Admittedly, there will be m ! possible orderingsfor ˜ ϕ ( X, Z, β ) unless additional assumptions are imposed. This means that considering all possible orderingsmay be computationally burdensome.

C Comparison to the Approach Based on Artstein’s Inequalities

In this Appendix, we will brieﬂy review an alternative method of constructing the identiﬁed set for (condi-tional) latent variable distributions. The method discussed in this Appendix is the one proposed by Chesheret al. (2013) and Chesher and Rosen (2014). The general method is exposited in Chesher and Rosen (2017).Our objective here is to provide an informal comparison between our approach and their approach, and toprovide a brief derivation showing how the two approaches might be connected.Let us suppose that Assumption 2.1 holds, and consider a slightly modiﬁed version of the random setfrom (2.2): H − ( y, x, z, β ) := cl { θ : y = { ϕ ( x, z, θ, β ) ≥ }} . That is, the random set H − ( y, x, z, β ) is equal to the closure of the random set G − ( y, x, z, β ) from (2.2).Under some conditions, if the distribution of θ is absolutely continuous (which is assumed, for example, inboth Chesher et al. (2013) and Chesher and Rosen (2014)), these two random sets will be equal almost78urely. Let us also deﬁne the random set: H ( θ, β ) := cl { ( y, x, z ) : y = { ϕ ( x, z, θ, β ) ≥ }} . We now apply a fundamental result due to Artstein (1983) (also see Norberg (1992) and Molchanov (2017)Corollary 1.4.11), which characterizes the set of selections of a random closed set.

Theorem C.1.

Suppose that Assumption 2.1 holds. Then for any β ∈ B , the random vector θ can be realizedas a selection of the random closed set H − ( Y, X, Z, β ) if and only if: P θ ( θ ∈ K ) ≤ P Y,X,Z ( H − ( Y, X, Z, β ) ∩ K (cid:54) = ∅ ) , (C.1) for all compact sets K ⊂ Θ . Furthermore, for any β ∈ B , the random vector ( Y, X, Z ) can be realized as aselection of the random closed set H ( θ, β ) if and only if: P Y,X,Z (( Y, X, Z ) ∈ C ) ≤ P θ ( H ( θ, β ) ∩ C (cid:54) = ∅ ) , (C.2) for all compact sets C ⊂ Y × X × Z . Remark C.1.

The statement “...the random vector θ can be realized as a selection of the random closed set H − ( Y, X, Z, β ) ...” means that there exists a probability space and random elements ˜ θ and ˜ H − ( Y, X, Z, β ) with identical distributions to θ and H − ( Y, X, Z, β ) such that ˜ θ ∈ ˜ H − ( Y, X, Z, β ) a.s. Remark C.2.

Since Θ ⊂ R d θ is locally compact and Hausdorﬀ, it is equivalent that (C.1) hold for all opensets G ⊂ Θ . Furthermore, since Y × X × Z is ﬁnite, all subsets of this product space are compact.

The ﬁrst part of this result is very similar to Theorem 1 in Chesher et al. (2013) and Theorem 3.1 inChesher and Rosen (2014), and thus is not new. The second part of this result is a direct corollary of Theorem1 in Chesher and Rosen (2017). In fact, Theorem 1 in Chesher and Rosen (2017) shows that (C.1) and (C.2)impose an equivalent set of constraints on the (unconditional) latent variable distributions P θ . Thus, either(C.1) and (C.2) can be used to construct the identiﬁed set of unconditional latent variable distributions, say P ∗ θ ; in practice, this is accomplished by ﬁrst ﬁxing a value of β ∈ B , collecting all distributions P θ satisfyingeither (C.1) or (C.2), and then taking a union (over all β ∈ B ) of the resulting collections of distributions.Also note that a similar result to Theorem C.1 can be stated after conditioning both sides of (C.1) and(C.2) on any combination of the variables in ( Y, X, Z ). We will make use of this shortly. Finally, note that(C.1) (or (C.2)) provides a characterization of all distributions of random vectors θ that can be realized asa selection from a random set, and not just those selections whose distributions satisfy certain conditions,such as absolute continuity. Although attention is often focused on selections θ with absolutely continuousdistributions, Artstein’s inequalities are only necessary and not suﬃcient for the existence of such a selection. In Chesher and Rosen (2014), absolute continuity combined with a linear index function ensures this statement is true.Chesher et al. (2013) consider a more general class of latent index functions than Chesher and Rosen (2014), and so also imposestrict monotonicity in latent variables of the latent index function in order to ensure their analog of the set { θ : ϕ ( x, z, θ, β ) = 0 } is of Lebesgue measure zero for each ( y, x, z, β ). Thus, most eﬀorts to reduce the computational burden of the approach based onArtstein’s inequalities are directed towards reducing the number of constraints implied by Theorem C.1 (andits analogues in other contexts). At ﬁrst glance it appears that (C.1) leads to a characterization of the identiﬁed set of unconditionallatent variable distributions that is intractable, given the number of possible compact subsets of Θ. However,following the discussion in both Chesher et al. (2013) and Chesher and Rosen (2014), it can be shown thatmost of these inequalities impose redundant constraints, and that there are typically only a ﬁnite numberof nonredundant inequalities. For example, note that when ϕ is linear in θ and β , for each ﬁxed β the set H − ( y, x, z, β ) represents a closed halfspace through the origin (intersected with Θ). In this special case,Chesher and Rosen (2014) demonstrate that it suﬃces to check the inequalities in (C.1) for all sets K that can be written as the intersection of halfspaces of the form H − ( y, x, z, β ). This reduces the inﬁnitenumber of inequalities implied by (C.1) to a ﬁnite number of inequalities. For example, in the case withno exogenous variables and a scalar endogenous variable X with m x points of support, Chesher and Rosen(2014) demonstrate that there are at most (2 m x ) / r inequalities, with r = 2 m x m z (the number of support points of Y ×X ×Z ). However,clearly even with small values of m x and m z the resulting number of inequalities can be prohibitively large.We will now show that our approach in this paper can be considered as a simpliﬁcation of the set of constraintsin (C.2), where the number of constraints in our simpliﬁcation is proportional to r rather than 2 r .To see the connection to the approach in this paper, consider imposing (C.2) conditional on ( Y, X, Z ).We obtain that, for any β ∈ B , the random vector θ can be realized as a selection from the random closedset H − ( y, x, z, β ) if and only if: { ( y, x, z ) ∈ C } ≤ P θ | Y,X,Z ( H ( θ, β ) ∩ C (cid:54) = ∅ | Y = y, X = x, Z = z ) , for all compact C ⊂ Y × X × Z , and all ( y, x, z ) ∈ Y × X × Z assigned positive probability. Now for a ﬁxedvalue of ( y, x, z ), consider the set of all compact C ⊂ Y × X × Z containing ( y, x, z ). For all such C we musthave: P θ | Y,X,Z ( H ( θ, β ) ∩ C (cid:54) = ∅ | Y = y, X = x, Z = z ) = 1 . In particular, even relatively small problems can quickly exhaust all available storage in a computer’s random access memory(RAM), which can cause the computer slow down signiﬁcantly, or crash. See Galichon and Henry (2011) and Chesher and Rosen (2017) for a discussion of the idea of the ”core determining class,”which is any collection of sets K ⊂ Θ that is suﬃcient for (C.1) to hold for all compact sets. A careful comparison betweenthe approach based on Artstein’s inequalities and other approaches in the context of bounding treatment eﬀects is provided inRussell (2019). C ⊂ Y × X × Z containing ( y, x, z ) if any only if it holds for thesingleton set { ( y, x, z ) } . We thus have: P θ | Y,X,Z ( H ( θ, β ) ∩ { ( y, x, z ) } (cid:54) = ∅ | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s. However, some basic manipulation shows this holds if and only if: P θ | Y,X,Z ( θ ∈ H − ( Y, X, Z, β ) | Y = y, X = x, Z = z ) = 1 , ( y, x, z ) − a.s. This derivation can be used to prove the following Lemma: Lemma C.1.