[PDF] Sharp Bounds on Treatment Effects for Policy Evaluation

Abstract

For counterfactual policy evaluation, it is important to ensure that treatment parameters are relevant to the policies in question. This is especially challenging under unobserved heterogeneity, as is well featured in the definition of the local average treatment effect (LATE). Being intrinsically local, the LATE is known to lack external validity in counterfactual environments. This paper investigates the possibility of extrapolating local treatment effects to different counterfactual settings when instrumental variables are only binary. We propose a novel framework to systematically calculate sharp nonparametric bounds on various policy-relevant treatment parameters that are defined as weighted averages of the marginal treatment effect (MTE). Our framework is flexible enough to incorporate a large menu of identifying assumptions beyond the shape restrictions on the MTE that have been considered in prior studies. We apply our method to understand the effects of medical insurance policies on the use of medical services.

Full PDF

aa r X i v : . [ ec on . E M ] S e p Sharp Bounds on Treatment Eﬀectsfor Policy Evaluation ∗ Sukjin HanDepartment of EconomicsUniversity of [email protected] Shenshen YangDepartment of EconomicsUniversity of Texas at [email protected] Draft: September 30, 2020

Abstract

For counterfactual policy evaluation, it is important to ensure that treatment pa-rameters are relevant to the policies in question. This is especially challenging underunobserved heterogeneity, as is well featured in the deﬁnition of the local average treat-ment eﬀect (LATE). Being intrinsically local, the LATE is known to lack externalvalidity in counterfactual environments. This paper investigates the possibility of ex-trapolating local treatment eﬀects to diﬀerent counterfactual settings when instrumen-tal variables are only binary. We propose a novel framework to systematically calculatesharp nonparametric bounds on various policy-relevant treatment parameters that aredeﬁned as weighted averages of the marginal treatment eﬀect (MTE). Our frameworkis ﬂexible enough to incorporate a large menu of identifying assumptions beyond theshape restrictions on the MTE that have been considered in prior studies. We applyour method to understand the eﬀects of medical insurance policies on the use of medicalservices.

JEL Numbers:

C14, C32, C33, C36

Keywords:

Heterogeneous treatment eﬀects, local average treatment eﬀects, marginaltreatment eﬀects, extrapolation, partial identiﬁcation. ∗ The authors are grateful to Jason Abrevaya, Brendan Kline, Xun Tang, Alex Torgovitsky, Ed Vytlacil,Haiqing Xu, and participants in the 2020 Texas Econometrics Camp and the workshop at UT Austin forhelpful comments and discussions. Introduction

For counterfactual policy evaluation, it is important to ensure that treatment parametersare relevant to the policies in question. This is especially challenging in the presence of un-observed heterogeneity. This challenge is well featured in the deﬁnition of the local averagetreatment eﬀect (LATE). The LATE has been one of the most popular treatment parame-ters used by empirical researchers since it was introduced by Imbens and Angrist (1994). Itinduces a straightforward linear estimation method that requires only a binary instrumentalvariable (IV), and yet, allows for unrestricted treatment heterogeneity. The unfortunate fea-ture of the LATE is that, as the name suggests, the parameter is intrinsically local, recoveringthe average treatment eﬀect (ATE) for a speciﬁc subgroup of population called compliers.This feature leads to two major challenges in making the LATE a valuable parameter forcounterfactual policy evaluation. First, the subpopulation for which the eﬀect is measuredmay not be the population of policy interest. Second, the deﬁnition of the subpopulationdepends on the IV chosen, rendering the parameter even more diﬃcult to extrapolate to newenvironments.Dealing with the lack of external validity of the LATE has been an important themein the literature. One approach in theoretical work (Angrist and Fernandez-Val (2010);Bertanha and Imbens (2019)) and empirical research (Dehejia et al. (2019); Muralidharan et al.(2019)) has been to show the similarity between complier and non-complier groups based onobservables. This approach, however, cannot attend to possible unobservable discrepanciesbetween these groups. Heckman and Vytlacil (2005) unify well-known treatment parametersby expressing them as weighted averages of what they deﬁne as the marginal treatment ef-fect (MTE). This MTE framework has a great potential for extrapolation because a classof treatment parameters that are policy-relevant can also be generated as weighted aver-ages of the MTE. The only obstacle is that the MTE is identiﬁed via a method called localIV (Heckman and Vytlacil (1999)), which requires the continuous variation of the IV thatis sometime large depending on the targeted support. This in turn reﬂects the intrinsicdiﬃculty of extrapolation when available exogenous variation is only discrete. Acknowledg-ing this nature of the challenge, previous studies in the literature have proposed imposingshape restrictions on the MTE, which is a function of the treatment-selection unobservable,while allowing for binary instruments in the framework of Heckman and Vytlacil (2005).Brinch et al. (2017) introduce shape restrictions (e.g., linearity) on the MTE functions in anattempt to identify the LATE extrapolated to diﬀerent subpopulations or to test for its exter-nality validity. More recently, Mogstad et al. (2018) propose a general partial identiﬁcationframework where bounds on various policy-relevant treatment parameters can be obtained2rom a set of “IV-like estimands” that are directly identiﬁed from the data and routinelyobtained in empirical work. Kowalski (2020) applies an approach similar to these studies toextrapolate the results from one health insurance experiment to an external setting.This paper continues this pursuit and investigates the possibility of extrapolating localtreatment parameters to diﬀerent policy settings in the MTE framework when IVs are onlybinary. In a partial identiﬁcation framework similar in spirit to Mogstad et al. (2018), weshow how to systematically calculate sharp nonparametric bounds on various extrapolatedtreatment parameters for binary (or more generally, discrete) outcomes using instrumentsthat are allowed to be binary. These parameters are deﬁned as weighted averages of theMTE. Examples include the ATE, the treatment on the treated, the LATE for subgroupsinduced by new policies, and the policy-relevant treatment eﬀect (PRTE). We also show howto place in this procedure restrictions from a large menu of identifying assumptions beyondthe shape restrictions considered in earlier work.In this paper, we make four main contributions. First, we propose a novel framework forcalculating bounds on policy-relevant treatment parameters. We introduce the probabilityof the latent state of the outcome-generating process conditional on the treatment-selectionunobservable. This latent conditional probability is the key ingredient for our analysis, asboth the target parameter and the distribution of the observables can be written as linearfunctionals of it. Therefore, having it as a decision variable, we can formulate inﬁnite-dimensional linear programming that produces bounds on a targeted treatment parameter.This approach is reminiscent of Balke and Pearl (1997) and can be viewed as its generalizationto the MTE framework. Balke and Pearl (1997) introduce a linear programming approachto characterize bounds on the ATE with a binary outcome, treatment and instrument. Themain distinction of our approach is that the latent probability is conditioned on the selectionunobservable, which is important for our extrapolation purpose. To make it feasible to solvethe resulting inﬁnite-dimensional program, we use a sieve-like approximation of the programand produce a ﬁnite-dimensional linear program (LP). This approximation approach buildson Mogstad et al. (2018), although they use approximation directly on the MTE function.We also propose a conservative approach to choosing the sieve dimension in practice.Second, by formulating the LP based on the latent conditional probability rather than theMTE, it creates a ﬂexible environment where we can introduce identifying assumptions thathave not been used in the context of the MTE framework or the LATE extrapolation. We pro-pose assumptions that there exist exogenous variables other than IVs. We propose two typesof exogenous variables that have been used in the context of identifying the ATE in the lit-erature: Mouriﬁ´e (2015), Han and Vytlacil (2017), Vuong and Xu (2017), and Han and Lee(2019) use the ﬁrst type, and Vytlacil and Yildiz (2007), Shaikh and Vytlacil (2011), and3alat and Han (2018) use the second type. We utilize these variables in this novel contextof the MTE framework. Also, while the earlier papers exploit these variables in combinationwith rank similarity or rank invariance, we show that they independently have identifyingpower for treatment parameters, including the ATE. We also propose identifying assumptionssuch as uniformity and the direction of endogeneity in this MTE framework. The directionof endogeneity is sometimes assumed in empirical work to characterize selection bias andhas been shown to have identifying power (Manski and Pepper (2000)). The uniformity as-sumption is related to rank similarity or rank invariance (Chernozhukov and Hansen (2005)).The shape restrictions on the MTE considered in the literature can also be nested withinour framework, since the MTE is just a sum of the latent conditional probabilities. Theassumptions on the existence of exogenous variables complement the identifying assumptionsthat rely on the researcher’s prior, in that its identifying power comes from actual data.When a conﬁdence set is constructed under one of the latter assumptions, we can conduct aspeciﬁcation test for that assumption.Third, we show that our approach yields straightforward proof of the sharpness of theresulting bounds. This feature stems from the use of the latent conditional probability in thelinear programming and the convexity of the feasible set in the program. When the MTEitself is the target parameter, we distinguish between the notions of point-wise and uniformsharpness and argue why uniform sharpness is often diﬃcult to achieve.Fourth, as an application, we study the eﬀects of insurance on medical service utilizationby considering various counterfactual policies related to insurance coverage. The LATEfor compliers and the bounds on the LATE for always-takers and never-takers reveal thatpossessing private insurance has the largest eﬀect on medical visits for never takers, i.e., thosewho face higher insurance cost. This provides a policy implication that lowering the cost ofprivate insurance is important, because the high cost might hinder people with most needfrom receiving adequate medical services.The linear programming approach to partial identiﬁcation of treatment eﬀects was pio-neered by Balke and Pearl (1997) and recently gained attention in the literature; see, e.g.,Mogstad et al. (2018), Torgovitsky (2019a), Machado et al. (2019), Kamat (2019), Gunsilius(2019), and Han (2020b). As these papers suggest, there are many settings, including ours,where analytical derivation of bounds is cumbersome or nearly impossible due to the com-plexity of the problems.This paper will proceed as follows. The next section introduces the main observables,maintained assumptions, and target parameters. Section 3 deﬁnes the latent conditional For the computational approach in contexts other than program evaluation, see Manski (2007),Kitamura and Stoye (2019), Deb et al. (2017), and Tebaldi et al. (2019).

Assume that we observe the binary outcome Y ∈ { , } , binary treatment D ∈ { , } , andbinary instrument Z ∈ { , } . We may additionally observe (possibly endogenous) discretecovariates X ∈ X . Binary Y is common in empirical work. Binary Z is also common,especially in randomized experiments, and allowing for this minimal exogenous variationis the key challenge for extrapolation that we want to address in this paper. Still, theanalysis of this paper can be extended to allow for general discrete Y and Z . Let Y ( d ) bethe counterfactual outcome given D = d , which is consistent with the observed outcome: Y = DY (1) + (1 − D ) Y (0). We maintain the following assumptions: Assumption SEL. D = 1 { U ≤ P ( Z, X ) } where P ( Z, X ) ≡ Pr[ D = 1 | Z, X ] and U | X = x ∼ U nif [0 , for x ∈ X . Assumption EX. ( Y ( d ) , D ( z )) ⊥ Z | X . Assumption SEL imposes a selection model for D , which is important in motivatingand interpreting marginal treatment eﬀects later. This assumption is also equivalent toImbens and Angrist (1994)’s monotonicity assumption (Vytlacil (2002)). We introduce thestandard normalization that U ∼ U nif [0 ,

1] conditional on X = x . Assumption EX imposesthe exclusion restriction and conditional independence for Z . We focus on discrete X as it simpliﬁes the exposition. Section B.3 in the Appendix extends the frameworkto incorporate continuously distributed X . Note that for any index function g ( z, x ) and an unobservable ε with any distribution, the selection modelsatisﬁes D = 1 { ε ≤ g ( Z, X ) } = 1 { F ε | X ( ε | X ) ≤ F ε | X ( g ( Z, X ) | X ) } = 1 { U ≤ P ( Z, X ) } , since P ( z, x ) =Pr[ ε ≤ g ( z, x ) | X = x ] = Pr[ U ≤ F ε | X ( g ( z, x ) | x ) | X = x ] = F ε | X ( g ( z, x ) | x ) and F ε | X ( ε | X ) = U is uniformlydistributed conditional on X . E [ Y (1) − Y (0) | U = u, X = x ] . Following Mogstad et al. (2018), it is convenient to introduce the marginal treatment response(MTR) function m d ( u, x ) ≡ E [ Y ( d ) | U = u, X = x ]= Pr[ Y ( d ) = 1 | U = u, X = x ] . Then, the MTE can be expressed as m ( u, x ) − m ( u, x ). Now, we deﬁne the target parameter τ to be a weighted average of the MTE: τ = E [ τ ( Z, X ) − τ ( Z, X )] , (2.1)where τ d ( z, x ) = Z m d ( u, x ) w d ( u, z, x ) du (2.2)by using F U | X ( u | x ) = u , and w d ( u, z, x ) is a known weight speciﬁc to the parameter ofinterest. This deﬁnition agrees with the insight of Heckman and Vytlacil (2005). The targetparameter includes a wide range of policy-relevant treatment parameters. With a Dirac deltafunction for a given value u as the weight, the MTE itself can be an example. We list afew examples of the target parameter here; other examples can be found in Table 4 in theAppendix. Example 1.

The ATE can be a target parameter with w d ( u, z, x ) = 1 , ∀ u, z, x . τ AT E = E (cid:20)Z m ( u, X ) du − Z m ( u, X ) du (cid:21) Example 2.

The generalized LATE for always-takers and never-takers are also target pa-rameters. Here, we give the expression of the LATE for always-takers as an example. Assume P ( z, x ) increases in z for any given x ∈ X . For the always-taker (AT) LATE, we give weight P (0 ,x ) to individuals with u ∈ [0 , P (0 , x )] and thus, we have w d ( u, z, x ) = u ∈ [0 ,p (0 ,x )]) p (0 ,x ) . Mogstad et al. (2018) deﬁne the weight in a slightly diﬀerent way. LAT E - AT = E (cid:20)Z m ( u, X ) 1( u ∈ [0 , p (0 , X )]) p (0 , X ) du − Z m ( u, X ) 1( u ∈ [0 , p (0 , X )]) p (0 , X ) du (cid:21) Example 3.

The policy relevant treatment eﬀect (PRTE) is a target parameter that is partic-ularly useful for policy evaluation. It is deﬁned as the welfare diﬀerence between two diﬀerentpolicies. Let Z and Z ′ be two instrument variables under two policies and P ( Z, X ) and P ′ ( Z ′ , X ) be propensity scores under the two policies. τ P RT E = E " Z m ( u, X ) Pr [ u ≤ P ′ ( Z ′ , X )] − Pr [ u ≤ P ( Z, X )] E [ P ′ ( Z ′ , X )] − E [ P ( Z, X )] du − Z m ( u, X ) Pr [ u ≤ P ′ ( Z ′ , X )] − Pr [ u ≤ P ( Z, X )] E [ P ′ ( Z ′ , X )] − E [ P ( Z, X )] du In these examples, the weights w and w can be set asymmetrically to deﬁne a broaderclass of parameters. All the parameters we consider in this paper can be deﬁned conditionalon X , although we omit them for succinctness. As a crucial ﬁrst step of our analysis, we deﬁne a state variable that determines a speciﬁcmapping of d y. Since d ∈ { , } and y ∈ { , } , there are four possible maps from d onto y . Deﬁne a discretelatent variable ǫ whose value e corresponds to each possible map: ǫ ∈ E , where |E | = 4 with E ≡ { , , , } . That is, ǫ is a decimal transformation of a binary sequence( Y (1) , Y (0)), which captures the treatment eﬀect heterogeneity. For the later purpose, it is7elpful to explicitly deﬁne the map as y = g e ( d )and write Y ( d ) = g ǫ ( d ) , (3.1)which implies Y = g ǫ ( D ). It is important to note that no structure is imposed in introducing g e ( · ) because the mapping is saturated by binary Y and D . By (3.1) and Assumption SEL,Assumption EX can be equivalently stated as ( ǫ, U ) ⊥ Z | X . Still, ǫ and X can be correlatedas X is allowed to be endogenous.Now, as a key component of our LP, we deﬁne the probability mass function of ǫ condi-tional on ( U, X ): for e ∈ E , q ( e | u, x ) ≡ Pr[ ǫ = e | U = u, X = x ] (3.2)with P e ∈E q ( e | u, x ) = 1 for any u, x . The quantity q ( e | u, x ) captures endogenous treatmentselection. It is shown below that this latent conditional probability is a building blockfor various treatment parameters and thus serves as the decision variable in the LP. Theintroduction of q ( e | u, x ) distinguishes our approach from those in Balke and Pearl (1997)and Mogstad et al. (2018). Since the probability is conditional on continuously distributed U , the simple ﬁnite-dimensional linear programming approach of Balke and Pearl (1997) isno longer applicable. Instead, we use an approximation method similar to Mogstad et al.(2018). However, Mogstad et al. (2018) uses the MTR function as a building block fortreatment parameters and introduces the “IV-like” estimands as a means of funneling theinformation from the data. Unlike in Mogstad et al. (2018), q ( e | u, x ) can be directly relatedto the distribution of data. This allows us to later incorporate identifying assumptions thatare diﬃcult to incorporate within the framework of Mogstad et al. (2018).By (3.1) and (3.2), note thatPr[ Y ( d ) = 1 | U = u, X = x ] = Pr[ ǫ ∈ { e ∈ E : g e ( d ) = 1 }| U = u, X = x ]= X e ∈E : g e ( d )=1 q ( e | u, x ) . m d ( u, x ) = X e : g e ( d )=1 q ( e | u, x ) . (3.3)Combining (3.3) and (2.2), we have τ d ( z, x ) = P e : g e ( d )=1 R q ( e | u, x ) w d ( u, z, x ) du , and thusthe target parameter τ = E [ τ ( Z, X )] − E [ τ ( Z, X )] in (2.1) can be written as τ = X e : g e (1)=1 Z E [ q ( e | u, X ) w ( u, Z, X )] du − X e : g e (0)=1 Z E [ q ( e | u, X ) w ( u, Z, X )] du (3.4)for some q that satisﬁes the properties of probability.The goal of this paper is to (at least partially) infer the target parameter τ based on thedata, i.e., the distribution of ( Y, D, Z, X ). The key insight is that there are observationallyequivalent q ( e | u, x )’s that are consistent with the data, which in turn produces observationallyequivalent τ ’s that deﬁne the identiﬁed set.Let p ( y, d | z, x ) ≡ Pr[ Y = y, D = d | Z = z, X = x ] be the observed conditional probability.This data distribution imposes restrictions on q ( e | u, x ). For instance, for D = 1, p ( y, | z, x ) = Pr[ Y (1) = y, U ≤ P ( z, x ) | Z = z, X = x ]= Pr[ Y (1) = y, U ≤ P ( z, x ) | X = x ]by Assumption EX, butPr[ Y (1) = y, U ≤ P ( z, x ) | X = x ] = Z P ( z,x )0 Pr[ Y (1) = y | U = u, X = x ] du = X e : g e (1)= y Z P ( z,x )0 q ( e | u, x ) du, (3.5)where the second equality is by Pr[ Y ( d ) = y | U = u, X = x ] = P e : g e ( d )= y q ( e | u, x ).To deﬁne the identiﬁed set for τ , we introduce some simplifying notation. Let q ( u, x ) ≡{ q ( e | u, x ) } e ∈E and Q ≡ { q ( · ) : X e ∈E q ( e | u, x ) = 1 ∀ ( u, x ) and q ( e | u, x ) ≥ ∀ ( e, u, x ) } be the class of q ( u, x ), and let p ≡ { p (1 , d | z, x ) } ( d,z,x ) ∈{ , } ×X . Also, let R τ : Q → R and R : Q → R d p (with d p being the dimension of p ) denote the linear operators of q ( · ) that9atisfy R τ q ≡ X e : g e (1)=1 Z E [ q ( e | u, X ) w τ ( u, Z, X )] du − X e : g e (0)=1 Z E [ q ( e | u, X ) w τ ( u, Z, X )] du,R q ≡ X e : g e ( d )=1 Z U dz,x q ( e | u, x ) du, where U dz,x denotes the intervals U z,x ≡ [0 , P ( z, x )] and U z,x ≡ ( P ( z, x ) , Deﬁnition 3.1.

The identiﬁed set of τ is deﬁned as T ∗ ≡ { τ ∈ R : τ = R τ q for some q ∈ Q such that R q = p } . In what follows, we formulate the inﬁnite-dimensional LP ( ∞ -LP) that characterizes T ∗ . This program conceptualizes sharp bounds on τ from the data and the maintainedassumptions (Assumptions SEL and EX). The upper and lower bounds on τ are deﬁned as τ = sup q ∈Q R τ q, ( ∞ -LP1) τ = inf q ∈Q R τ q, ( ∞ -LP2)subject to R q = p. ( ∞ -LP3)Observe that the set of constraints ( ∞ -LP3) does not include X e : g e ( d )=0 Z U dz,x q ( e | u, x ) du = p (0 , d | z, x ) ∀ ( d, z, x ) ∈ { , } × X . (3.6)This is because we know a priori that they are redundant in the sense that they do not furtherrestrict the feasible set , i.e., the set of q ( e | u, x )’s that satisfy all the constraints ( q ∈ Q and( ∞ -LP3)). Lemma 3.1.

In the linear program ( ∞ -LP1) – ( ∞ -LP3) , the feasible set deﬁned by q ∈ Q and ( ∞ -LP3) is identical to the feasible set deﬁned by q ∈ Q , ( ∞ -LP3) , and (3.6) . Theorem 3.1.

Under Assumptions SEL and EX, suppose T ∗ is non-empty. Then, thebounds [ τ , τ ] in ( ∞ -LP1) – ( ∞ -LP3) are sharp for the target parameter τ , i.e., cl ( T ∗ ) = [ τ , τ ] ,where cl ( · ) is the closure of a set. { q : q ∈Q} ∩ { q : R q = p } in the LP and the linearity of R τ q in q , which implies that [ τ , τ ] is convex. Although conceptually useful, the LP ( ∞ -LP1)–( ∞ -LP3) is not feasible in practice because Q is an inﬁnite-dimensional space. In this section, we approximate ( ∞ -LP1)–( ∞ -LP3) witha ﬁnite-dimensional LP via a sieve approximation of the conditional probability q ( e | u, x ). Weuse Bernstein polynomials as the sieve basis. Bernstein polynomials are useful in imposingrestrictions on the original function (Joy (2000); Chen et al. (2011); Chen et al. (2017)) andtherefore have been introduced in the context of linear programming (Mogstad et al. (2018);Masten and Poirier (2018); Mogstad et al. (2019)).Consider the following sieve approximation of q ( e | u, x ) using Bernstein polynomials oforder K q ( e | u, x ) ≈ K X k =1 θ e,xk b k ( u ) , where b k ( u ) ≡ (cid:0) Kk (cid:1) x k (1 − x ) K − k is a univariate Bernstein basis, θ e,xk ≡ θ e,xk,K ≡ q ( e | k/K, x ) isits coeﬃcient, and K is ﬁnite. It is important to note that x can index θ , because q ( e | u, x )is a saturated function of x . By the deﬁnition of the Bernstein coeﬃcient, for any ( e, x ),it satisﬁes q ( e | u, x ) ≥ u if and only if θ e,xk ≥ k . Also, P e ∈E q ( e | u, x ) = 1for all ( u, x ) is approximately equivalent to P e ∈E θ e,xk = 1 for all ( k, x ). To see this, ﬁrst, P e ∈E q ( e | u, x ) = 1 for all ( u, x ) implies P e ∈E θ e,xk = P e ∈E q ( e | k/K, x ) = 1 for all ( k, x ).Conversely, when P e ∈E θ e,xk = 1 for all ( k, x ), X e ∈E q ( e | u, x ) ≈ X e ∈E K X k =1 θ e,xk b k ( u ) = K X k =1 b k ( u ) = 1by the binomial theorem (Coolidge (1949)). Motivated by this approximation, we formallydeﬁne the following sieve space for Q : Q K ≡ (n K X k =1 θ e,xk b k ( u ) o e ∈E : X e ∈E θ e,xk = 1 ∀ ( k, x ) and θ e,xk ≥ ∀ ( e, k, x ) ) ⊆ Q . (4.1)Let K ≡ { , ..., K } and p ( z, x ) ≡ Pr[ Z = z, X = x ]. For q ∈ Q K , by (3.4) and (4.1), the11arget parameter τ = E [ τ ( Z, X )] − E [ τ ( Z, X )] can be expressed with E [ τ d ( Z, X )] = X e : g e ( d )=1 X ( k,x ) ∈K×X θ e,xk Z b k ( u ) X z ∈{ , } w d ( u, z, x ) p ( z, x ) du ≡ X e : g e ( d )=1 X ( k,x ) ∈K×X θ e,xk γ dk ( x ) , (4.2)where γ dk ( x ) ≡ R b k ( u ) P z ∈{ , } w d ( u, z, x ) p ( z, x ) du . Also, for q ∈ Q K and D = 1, by (3.5),we have p ( y, | z, x ) = X e : g e ( d )= y X k ∈K θ e,xk Z P ( z,x )0 b k ( u ) du ≡ X e : g e ( d )= y X k ∈K θ e,xk δ k ( z, x ) , (4.3)where δ dk ( z, x ) ≡ R U dz,x b k ( u ) du .From (4.2) and (4.3), we can expect that a ﬁnite-dimensional LP can be obtained withrespect to θ e,xk . Let θ ≡ { θ e,xk } ( e,k,x ) ∈E×K×X and letΘ K ≡ ( θ : X e ∈E θ e,xk = 1 ∀ ( k, x ) and θ e,xk ≥ ∀ ( e, k, x ) ) . Then, we can formulate the following ﬁnite-dimensional LP that corresponds to the ∞ -LPin ( ∞ -LP1)–( ∞ -LP3): τ K = max θ ∈ Θ K X ( k,x ) ∈K×X n X e : g e (1)=1 θ e,xk γ k ( x ) − X e : g e (0)=1 θ e,xk γ k ( x ) o (LP1) τ K = min θ ∈ Θ K X ( k,x ) ∈K×X n X e : g e (1)=1 θ e,xk γ k ( x ) − X e : g e (0)=1 θ e,xk γ k ( x ) o (LP2)subject to X e : g e ( d )=1 X k ∈K θ e,xk δ dk ( z, x ) = p (1 , d | z, x ) ∀ ( d, z, x ) ∈ { , } × X . (LP3)This LP is computationally very easy to solve using standard algorithms, such as the simplexalgorithm; conditional on x , when K = 50 and dim( θ ) = 204, it takes only around 10 secondsto calculate τ K and τ K with moderate computing power. The important remaining questionis how to choose K in practice. We discuss this issue in Section 7. Finally, it is worth noting12hat, extending Proposition 4 in Mogstad et al. (2018), we may exactly calculate τ and τ (i.e., τ = τ K and τ = τ K ) under the assumptions that (i) the weight function w d ( u, z, x )is piece-wise constant in u and (ii) the constant spline that provides the best mean squarederror approximation of q ( e | u, x ) satisﬁes all the maintained assumptions (possibly includingthe identifying assumptions introduced later) that q ( e | u, x ) itself satisﬁes; see Mogstad et al.(2018) for details. Now we generalize the analysis in Sections 2–4 to incorporate additional exogenous variablesother than the instrument Z that researchers may be equipped with. We show that thesevariables are fruitful for narrowing bounds on the target parameter. This is the ﬁrst paperthat introduces this type of variable in the MTE framework. This is also the ﬁrst paper thatshows the usefulness of these variables without necessarily combining them with assumptionsrelated to rank similarity or rank invariance.Let W ∈ W be such an exogenous variable. We assume that W is discrete. We showthat even binary variation in W can be useful in improving the bounds. We modify ourmaintained assumptions to consider two diﬀerent scenarios related to W : (a) W directlyaﬀects Y but not D and (b) W directly aﬀects both Y and D . Let Y ( d, w ) be the extendedcounterfactual outcome of Y given ( d, w ). Assumption SEL W . (a) Assumption SEL; (b) D = 1 { U ≤ P ( Z, X, W ) } where P ( Z, X, W ) ≡ Pr[ D = 1 | Z, X, W ] . Assumption EX W . (a) ( Y ( d, w ) , D ( z )) ⊥ ( Z, W ) | X ; (b) ( Y ( d, w ) , D ( z, w )) ⊥ ( Z, W ) | X . Case (a) is where W is a reversely excluded exogenous variable, which we call reverse IV .This type of exogenous variables was considered by Vytlacil and Yildiz (2007), Shaikh and Vytlacil(2011), and Balat and Han (2018). However, unlike those studies, we exploit W without ranksimilarity or rank invariance. In Case (b), we show that a reverse IV is not necessary, and W can be present in the selection equation. This type of exogenous variables was consid-ered by Mouriﬁ´e (2015), Han and Vytlacil (2017), Vuong and Xu (2017), and Han and Lee(2019), but again, unlike these papers, we do not necessarily assume rank similarity or rankinvariance. Below, we combine the existence of W (for both scenarios) with assumptionsthat are related to rank similarity. Another distinct feature of our approach in comparisonto the prior studies is that we consider a broad class of the generalized LATEs as our targetparameter, including the ATE considered in those studies.13n what follows, we modify the linear programming framework from Sections 2 and 3 toreﬂect Assumptions SEL W and EX W . For notational simplicity, we focus on Case (a) here;it is straightforward to draw analogous results for Case (b). With the existence of W , theMTR is deﬁned as m d ( u, w, x ) ≡ E [ Y ( d, w ) | U = u, X = x ]= Pr[ Y ( d, w ) = 1 | U = u, X = x ] , where W does not appear as a conditioning variable due to Assumption EX W (a). Then, thetarget parameter can be expressed as τ = E [ τ ( Z, W, X ) − τ ( Z, W, X )] , where τ d ( z, w, x ) = Z m d ( u, w, x ) w d ( u, z, x ) du. Note that the weight w d ( u, z, x ) is not a function of w due to Assumption SEL W (a). Now,consider a mapping ( d, w ) y, which is coded in the value e of ǫ ∈ E where |E | = 16 (redeﬁning the variable ǫ introduced inSection 3). Conveniently, let E ≡ { , , ..., } ; Table 1 lists all 16 maps. Equivalently, deﬁne y = g e ( d, w ) , which implies Y ( d, w ) = g ǫ ( d, w ) (5.1)and Y = g ǫ ( D, W ). By (5.1) and Assumption SEL W (a), Assumption EX W (a) can be equiva-lently stated as ( ǫ, U ) ⊥ ( Z, W ) | X . Deﬁne the probability mass function of ǫ conditional on( U, X ) as q ( e | u, x ) ≡ Pr[ ǫ = e | U = u, X = x ] = Pr[ ǫ = e | U = u, X = x, W = w ] . Apparently, with Assumption SEL W (b), the weight will be written as w ∗ d ( u, z, w, x ) since the propensityscore is a function of W . d w Y ( d, w ) ǫ d w Y ( d, w ) ǫ d w Y ( d, w ) ǫ d w Y ( d, w )

13 0 0 00 1 01 0 11 1 114 0 0 10 1 01 0 11 1 115 0 0 00 1 11 0 11 1 116 0 0 10 1 11 0 11 1 1

Table 1: All Possible Maps from ( d, w ) to y Then, the MTR can be expressed as m d ( u, w, x ) = X e ∈E : g e ( d,w )=1 q ( e | u, x ) (5.2)and the target parameter as τ = Z E [ X e : g e (1 ,W )=1 q ( e | u, X ) w ( u, Z, X )] du − Z E [ X e : g e (0 ,W )=1 q ( e | u, X ) w ( u, Z, X )] du. (5.3)Recall, the sieve approximation of the conditional probability is given by q ( e | u, x ) ≈ P k ∈K θ e,xk b k ( u ).Then, for q ∈ Q K , by (5.3), we have τ = E [ τ ( Z, W, X )] − E [ τ ( Z, W, X )] with E [ τ d ( Z, W, X )] = X ( w,x ) ∈W×X X e : g e ( d,w )=1 X k ∈K θ e,xk γ dk ( w, x ) , where γ dk ( w, x ) ≡ P z ∈{ , } p ( z, w, x ) R b k ( u ) w d ( u, z, x ) du and p ( z, w, x ) ≡ Pr[ Z = z, W = w, X = x ]. In terms of the data distribution, we can derive, e.g.,15 ( y, | z, w, x ) = Pr[ Y (1 , w ) = y, U ≤ P ( z, x ) | X = x ]= Z P ( z,x )0 Pr[ Y (1 , w ) = y | U = u, X = x ] du = X e : g e (1 ,w )= y Z P ( z,x )0 q ( e | u, x ) du = X e : g e (1 ,w )= y X k ∈K θ e,xk δ k ( z, x ) , where δ dk ( z, x ) ≡ R U dz,x b k ( u ) du , as in the baseline case.This modiﬁcation yields a modiﬁed LP: τ = max θ ∈ Θ K X ( k,w,x ) ∈K×W×X n X e : g e (1 ,w )=1 θ e,xk γ k ( w, x ) − X e : g e (0 ,w )=1 θ e,xk γ k ( w, x ) o (LP W τ = min θ ∈ Θ K X ( k,w,x ) ∈K×W×X n X e : g e (1 ,w )=1 θ e,xk γ k ( w, x ) − X e : g e (0 ,w )=1 θ e,xk γ k ( w, x ) o (LP W X e : g e ( d,w )=1 X k ∈K θ e,xk δ dk ( z, x ) = p (1 , d | z, w, x ) ∀ ( d, z, w, x ) ∈ { , } × W × X . (LP W W (b) and EX W (b). Since the number of maps increases to 16 instead offour of the baseline case, the dimension of the decision variable θ in the LP (LP W W W and setting K = 50, we have dim ( θ ) = 816 = 4 × W is in fact helpful innarrowing the bounds [ τ , τ ] as long as W is a relevant variable. For the remainder of Section6, we assume W ∈ { , } and suppress “conditional on X ” for simplicity, unless otherwisenoted. Assumption R. (i)

Pr[ Y ( d, w ) = Y ( d, w ′ )] > for some d and w = w ′ ; (ii) either (a) P ( z ) > for all z or (b) P ( z, w ) > for all z, w . Theorem 5.1.

Under Assumptions SEL W , EX W , and R, the variation of W poses non-redundant constraints on θ ∈ Θ K in the LP (LP W – (LP W (suppressing x ). W in determining Y . Heuristically, theimprovement occurs because, with R(i), the constraint matrix (i.e., the matrix multipliedto the vector θ in (LP W W than without. Seethe proof of the theorem for a formal argument. Note that non-redundant constraints on θ do not always guarantee an improvement of the bounds in (LP W W Bounds on the target parameter are generally uninformative in the absence of additionalassumptions besides Assumptions SEL and EX. This is because a binary instrument hasno extrapolative power for general non-compliers, e.g., always-takers and never-takers, butonly identiﬁes the eﬀect for compliers. Prior studies have tried to overcome this challengeby imposing shape restrictions on the MTE (Cornelissen et al. (2016); Brinch et al. (2017);Kowalski (2020)), although these restrictions are not always empirically justiﬁed. Evidently,it would be useful to provide researchers with a larger variety of assumptions so that it iseasier to ﬁnd justiﬁable assumptions that suit their speciﬁc examples.In Section 5, the existence of W (Assumptions SEL W and EX W ) is shown to be one usefulsource for extrapolation. In this section, we propose identifying assumptions that can beincorporated within our framework and that help shrink the bounds on the target parameters.The shape restrictions employed in the literature can be used within our framework. We alsopropose other assumptions that have not been previously used in the LATE extrapolation.These assumptions can be incorporated as additional equality and inequality restrictions inthe linear programming: Given the LP ( ∞ -LP1)–( ∞ -LP3), identifying assumptions can beimposed by appending R q = a , ( ∞ -LP4) R q ≤ a , ( ∞ -LP5)where R and R are linear operators on Q that correspond to equality and inequality con-straints, respectively, and a and a are some vectors.When an assumption violates the true data-generating process, then the identiﬁed set willbe empty. This corresponds to the situation where the LP does not have a feasible solution.When we reﬂect sampling errors, this corresponds to the case where the conﬁdence set is17mpty. Researchers may be willing to restrict the degree of treatment heterogeneity to yield in-formative bounds. This restriction has not been used before in the context of the MTEframework. This restriction may be combined with the assumptions related to the existenceof W (Assumptions SEL W , EX W , and R). We suppress the conditioning on X throughoutthis subsection. Assumption U.

For every w ∈ W , Pr[ Y (1 , w ) ≥ Y (0 , w )] = 1 or Pr[ Y (1 , w ) ≤ Y (0 , w )] =1 . When W is not available at all, this assumption can be understood with W being degen-erate. The following assumption is stronger than Assumption U. Assumption U ∗ . For every w, w ′ ∈ W , Pr[ Y (1 , w ) ≥ Y (0 , w ′ )] = 1 or Pr[ Y (1 , w ) ≤ Y (0 , w ′ )] = 1 . Note that w and w ′ may be the same or diﬀerent, i.e., the uniformity is for all combinationsof ( w, w ′ ) ∈ { (0 , , (1 , , (1 , , (0 , } . Therefore, Assumption U ∗ implies Assumption U.Assumptions U and U ∗ are weaker than the monotone treatment response assumption inManski (1997) and Manski and Pepper (2000) in that they do not impose the direction ofmonotonicity. Assumptions U and U ∗ are also closely related to the rank similarity and rankinvariance assumptions in the literature (e.g., Chernozhukov and Hansen (2005)). Namely,given a structural model Y = 1[ s ( D, W ) ≥ V D ], when Assumption U ∗ is violated, then ranksimilarity ( F V | U = F V | U ) cannot hold, and thus rank invariance ( V = V ) cannot hold.Assumptions U and U ∗ can be imposed by “deactivating” relevant maps. For example,suppose Y (1 , w ) ≥ Y (0 , w ) almost surely for all w ∈ { , } under Assumption U. Thisassumption can be imposed as equality constraints ( ∞ -LP4), i.e., in the form of R q = a ,using the labeling of Table 1: q (3 | u ) = q (4 | u ) = q (7 | u ) = q (8 | u ) = 0 ,q (2 | u ) = q (4 | u ) = q (10 | u ) = q (12 | u ) = 0 , In order to verify whether the identiﬁed set is empty, we need to check whether the feasible set of θ isempty. An eﬃcient way to do this is to identify vertices of the feasible polytope, if any. This process is nosimpler than the simplex algorithm that we use to solve the LP. Therefore, we recommend that one ﬁrstsolves the LP and check if infeasibility is reported. w = 1 and w = 0. Therefore, the corresponding θ ek = 0. Then,the eﬀective dimension of θ will be reduced in (LP W W

3) and thus yields narrowerbounds. As another example, suppose the following holds almost surely under AssumptionU ∗ : Y (1 , ≥ Y (0 , Y (1 , ≤ Y (0 , Y (1 , ≥ Y (0 , Y (1 , ≥ Y (0 , q (2 | u ) = q (4 | u ) = q (6 | u ) = q (8 | u ) = 0 ,q (5 | u ) = q (6 | u ) = q (13 | u ) = q (14 | u ) = 0 ,q (3 | u ) = q (4 | u ) = q (7 | u ) = q (8 | u ) = 0 ,q (2 | u ) = q (4 | u ) = q (10 | u ) = q (12 | u ) = 0 . It is worth mentioning that, in Assumption U (Assumption U ∗ ), the direction of monotonicityis allowed to be diﬀerent for diﬀerent w (( w, w ′ ) pairs). This direction will be identiﬁedfrom the data. Speciﬁcally, the direction can be automatically determined from the LP byinspecting whether the LP has a feasible solution; when wrong maps are removed, there isno feasible solution. Note that this result holds regardless of the existence of W . It is easy tosee that the direction of the monotonicity coincides with the sign of the ATE. Previous workhas discussed the role of the rank similarity assumption on determining the sign of the ATE(Bhattacharya et al. (2008); Shaikh and Vytlacil (2011); Han (2020b)), and the result aboveshows that Assumptions U and U ∗ play a similar role in the linear programming approach.In the next two subsections, we suppress W for simplicity. In some applications, researchers are relatively conﬁdent about the direction of treatmentendogeneity. The idea of imposing the direction of the selection bias as an identifying as-sumption appears in Manski and Pepper (2000), who introduce monotone treatment selection(MTS), in addition to the monotone treatment response assumption mentioned above.

Assumption MTS. E [ Y ( d ) | D = 1 , X = x ] ≥ E [ Y ( d ) | D = 0 , X = x ] for d ∈ { , } and x ∈ X . Under our framework, this assumption can be imposed in the form of R q ≤ a . To seethis, Assumption MTS is equivalent to X e : g e ( d )=1 E " Z P ( Z,X ) q ( e | u ) du − Z P ( Z,X )0 q ( e | u ) du (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X = x ≤ d, x ∈ { , } × X . As is clear from this expression, Assumption MTS imposes restric-tions on the joint distribution of ( ǫ, U ). It is straightforward to incorporate the shape restrictions on the MTR or MTE functionintroduced in the literature. They can be imposed via constraints on θ . Assumption M.

For x ∈ X , m d ( u, x ) is weakly increasing in u ∈ [0 , . Assumption C.

For x ∈ X , m d ( u, x ) is weakly concave in u ∈ [0 , . Assumption M appears in Brinch et al. (2017) and Mogstad et al. (2018) and Assump-tion C appears in Mogstad et al. (2018). These assumptions can be imposed as inequalityconstraints ( ∞ -LP4), i.e., in the form of R q ≤ a . For implications on the ﬁnite-dimensionalLP (LP1)–(LP3), recall that for q ∈ Q K , the MTR satisﬁes m d ( u, x ) = X e : g e ( d )=1 q ( e | u, x ) = X k ∈K X e : g e ( d )=1 θ e,xk b k ( u ) . According to the property of the Bernstein polynomial, Assumption M implies that P e : g e ( d )=1 θ e,xk is weakly increasing in k , i.e., X e : g e ( d )=1 θ e,x ≤ X e : g e ( d )=1 θ e,x ≤ · · · ≤ X e : g e ( d )=1 θ e,xK . Assumption C implies that X e : g e ( d )=1 θ e,xk − X e : g e ( d )=1 θ e,xk +1 + X e : g e ( d )=1 θ e,xk +2 ≤ k = 0 , ..., K − . One can obtain analogous assumptions and their implications in the presence of W .Another shape restriction introduced in the literature is separability. Although it isnot particularly appealing with binary Y , if one is willing to assume a separable model for m d ( u, x ) = Pr[ Y ( d ) = 1 | U = u, X = x ] = m d ( x ) + m d ( u ), then such a structure can beimposed on θ . This section provides numerical results to illustrate our theoretical framework and to showthe role of diﬀerent identifying assumptions in improving bounds on the target parameters.20or target parameters, we consider the ATE and the LATEs for always-takers (LATE-AT),never-takers (LATE-NT), and compliers (LATE-C). We calculate the bounds on them basedonly on the information from the data and then show how additional assumptions on theexistence of additional exogenous variables, uniformity, and shape restrictions tighten thebounds.

We generate the observables (

Y, D, Z, X, W ) from the following data-generating process(DGP). We assume that W is a reverse IV, i.e., we maintain Assumptions SEL W (a) andEX W (a). We allow covariate X to be endogenous. All the variables are set to be binary withPr [ Z = 1] = 0 .

5, Pr [ X = 1] = 0 . W = 1] = 0 .

4. The treatment D is determined by Z and X through the threshold crossing model speciﬁed in Assumption SEL W (a), where thepropensity scores P ( z, x ) are speciﬁed as follows: P (0 ,

0) = 0 . P (1 ,

0) = 0 . P (0 ,

1) = 0 . P (1 ,

1) = 0 .

7. The outcome Y is generated from ( D, X, W ) through Y = DY +(1 − D ) Y where Y d = 1 [ m d ( U, X, W ) ≥ ǫ ] (7.1)and the MTR functions are deﬁned as m ( u, ,

0) = 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) m ( u, ,

1) = 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) m ( u, ,

1) = 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 1 . b ( u )where b Kk stands for the k -th basis function in the Bernstein approximation of degree K .These MTR functions are chosen to be consistent with Assumptions M and C, i.e., to bepositively monotone and weakly concave in u for all ( d, x, w ) ∈ { , } . Also, the DGP in(7.1) satisﬁes Assumption U ∗ because ǫ does not depend on d = 0 , m ( u, x, w ) > m ( u, x, w ) for all ( d, x, w ) ∈ { , } . Following the second examplein Section 6.1, the DGP satisfy the following uniform order for the counterfactual outcomes Y ( d, w ): Y (1 , ≥ Y (0 , ≥ Y (1 , ≥ Y (0 ,

0) a.s. We generate a sample containing21,000,000 observations and choose K = 50. We choose the large sample size to mimic thepopulation. Our choice of K is discussed below. The number of unknown parameters θ inthe linear programming is equal to dim( θ ) = |E | × |X | × ( K + 1). Figure 1 contains the bounds on the ATE under diﬀerent assumptions. The true ATE valueis 0.21, depicted as the solid red line in the ﬁgure. First, the worst-case bounds on the ATEwith no additional assumptions (and without using variation from W ) are [ − . , . W , we have |E | = 4, and the linear programming is solved withdim( θ ) = |E | × |X | × ( K + 1) = 4 × ×

51 = 408.For comparison, we calculate the bounds that incorporate the existence of W . We buildup the target parameters with mappings involving W and use data distribution conditional on W = 0 and W = 1 as the constraints. Using constraints conditional on diﬀerent values of W allows us to fully exploit the variations from W ; see (LP W W (Section 5), we have |E | = 16, which gives dim( θ ) = |E | × |X | × ( K + 1) = 16 × ×

51 = 1 , W is exploited, the bounds on the ATE are [ − . , . W . This result is consistent with our theoretical ﬁnding presented in Theorem5.1 that W can help tighten the bounds as long as it is a relevant variable. Nonetheless,these worst-case bounds are not that informative, e.g., they do not determine the sign of theATE.Next, we impose the uniformity assumption without W (Assumption U) and with W (Assumption U ∗ ). First, under Assumption U, the bounds on the ATE are tightened as somemappings occur with probability zero reducing the dimension of θ . As mentioned in Section6.1, the direction of monotonicity in Assumption U (i.e., which mapping does not occur) isdetermined by the LPs. We solve the LPs with diﬀerent directions imposed, then choose theone with a feasible solution. This means that the corresponding direction of monotonicityis consistent with the DGP. Under Assumption U, we obtain a narrower bound [0 . , . ∗ , the bounds become [0 . , . ∗ are depicted as violet and green dashed lines, respectively.Both sets of bounds identify the sign of the ATE, consistent with the theoretical discussion.While their lower bounds coincide, Assumption U ∗ yields a lower upper bound compared toAssumption U.Next, we impose the shape restrictions (Assumptions M and C). As discussed in Section22igure 1: Bounds on the ATE under Diﬀerent Assumptions6.3, these assumptions can be easily incorporated in the linear programming by directlyimposing inequality constraints on θ . Under these assumptions (and the existence of W ), thebounds on the ATE shrink to [0 . , . ∗ . They function diﬀerently in the linear programming: unlikethe uniformity assumption, which maintains the ranking of individuals across counterfactualgroups, shape restrictions directly control the MTR functions. Finally, the dash-dotted blackline in Figure 1 shows the bounds on the ATE under the uniformity assumption and the shaperestrictions. These assumptions all together yield the narrowest bounds, [0 . , . .2.2 Generalized LATEs Next, we construct bounds on the generalized LATEs. The original deﬁnition of the LATEis the ATE for compliers (C). Researchers may also have interests in other local treatmenteﬀects. We consider two other parameters—LATEs for always-takers (AT) and never-takers(NT). Figure 2 displays the bounds on the LATE-AT, LATE-C, and LATE-NT under diﬀerentassumptions. This analysis is analogous to that with the ATE. Since the covariate X aﬀectsthe decision of compliance, to avoid confusion in the deﬁnition of the compliance groups, weinstead establish bounds on the LATEs conditional on X . We draw the conditional MTEfunctions with solid red lines in both panels as a reference.The feature that there exists no deﬁers in the DGP is known. When there is no deﬁer,the LATE-C is point identiﬁed, which has an analytical expression of the two-stage leastsquares estimand. As a conﬁrmation exercise, we numerically calculate the LATE-C usingthe linear programming, which yields point estimates as shown in Figure 2. The true LATE-Cs conditional on X = 0 and X = 1 are equal to 0.21 and 0.22, respectively. Regardless ofassumptions imposed, the estimates remain close to the true values throughout.The true values of the conditional LATE-AT and the LATE-NT are 0.15 and 0.28 when X = 0 and 0.14 and 0.25 when X = 1. First, as before, we consider the worst-case boundswhere the existence of W is ignored versus where W is taken into account. Without W , we getthe bounds [ − . , .

24] and [ − . , .

72] on the LATE-AT and the LATE-NT conditional on X = 0, and [ − . , .

52] and [ − . , .

43] conditional on X = 1; with W , we get the bounds[ − . , .

2] and [ − . , .

7] on the LATE-AT and the LATE-NT conditional on X = 0, and[ − . , .

5] and [ − . , .

41] conditional on X = 1. The upper bounds with W are lowerthan the ones without W , although the gain is not substantial. For the lower bounds, theone on the LATE-AT conditional on X = 0 is signiﬁcantly higher with W than without W ,and all the other ones have negligible diﬀerences with and without W. We then apply Assumptions U and U ∗ . Under Assumption U, the bounds on the LATE-AT and the LATE-NT turn to [0 , .

24] and [0 , .

72] conditional on X = 0, and [0 , .

52] and[0 , .

43] conditional on X = 1; when W is used and Assumption U ∗ is applied, the boundsshrink to [0 , .

18] and [0 , .

47] conditional on X = 0, and [0 , .

36] and [0 , .

35] conditionalon X = 1. As before, Assumptions U and U ∗ determine the sign of the eﬀects.When the shape restrictions are imposed instead, the bounds on the LATE-AT and theLATE-NT were improved to [0 . , .

17] and [0 . , .

3] conditional on X = 0, and [0 . , . . , .

31] conditional on X = 1. Under Assumption U ∗ combined with the shaperestrictions, we get the narrowest bounds of [0 . , .

15] and [0 . , .

3] conditional on X = 0,and [0 . , .

26] and [0 . , .

31] conditional on X = 1.24igure 2: Bounds on the LATEs under Diﬀerent Assumptions25 .3 The Choice of K As a tuning parameter in the LP, we need to choose the order of Bernstein polynomials, K . In general, K should be chosen based on the sample size and the smoothness of thefunction to be approximated, in our case, q ( · ). The choice of the sieve dimension or moregenerally, regularization parameters, is a diﬃcult question (Chen (2007)) and developingdata-driven procedure is a subject of on-going research in various nonparametric contexts ofpoint identiﬁcation; see, e.g., Chen and Christensen (2015) and Han (2020a). In this partialidentiﬁcation setup, we propose the following heuristic and conservative approach, which isin spirit consistent with the very motivation of partial identiﬁcation.First, we do not want to claim any prior knowledge about the smoothness of q ( · ) becauseit is the distribution of a latent variable. Because K determines the dimension of unknownparameter θ in the linear programming, the width of the bounds tends to increase with K .At the same time, the computational burden increases with K . One interesting numericalﬁnding is that, when K is suﬃciently large, the increase of the width slows down and thebounds become stable. This suggests that we may be able to conservatively choose K thatacknowledges our lack of knowledge of the smoothness but, at the same time, produces areasonable computational task for the linear programming.To illustrate this point, we consider the conditional MTE and ATE as the target parame-ters and show how their bounds change as we increase K . We consider the MTE because it isa fundamental parameter that generates other target parameters, and hence, it is importantto understand the sensitivity of its bounds to K . Figures 3 and 4 show the evolution ofthe bounds on the MTE and the ATE as K grows. When K = 5, the bounds are narrow.Although it may be tempting to choose this value of K , this attempt should be avoided asit may be subject to the misspeciﬁcation of smoothness. When K increases beyond 30, thebounds start to converge and become stable. We choose K = 50, and this is the choice wemade in our previous numerical exercises. As discussed in Section B.1 in the Appendix, it is worth mentioning that the bounds onthe MTE are point-wise sharp but not uniformly sharp. The graph for the MTE bounds aredrawn by calculating the point-wise sharp bounds on MTE at each point of u (after properlydiscretizing it) and then connecting them. Therefore, these bounds should not be viewedas uniformly sharp bounds. Nonetheless, this graph is still useful for the purpose of ourillustration. Given the current DGP, we ﬁnd that there are no uniformly sharp bounds forthe MTE. Note that with larger K , some LP solvers would ignore coeﬃcients with negligible (e.g., 10 − ) valuesthat cause a large range of magnitude in the coeﬃcient matrix. It may be recommended to simultaneouslyrescale a column and a row to achieve a smaller range in the coeﬃcients. K It is widely recognized in the empirical literature that health insurance coverage can be an es-sential factor for the utilization of medical services (Hurd and McGarry (1997); Dunlop et al.(2002); Finkelstein et al. (2012); Taubman et al. (2014)). Prior studies on this topic typi-cally make use of parametric econometric models for the analysis. In their application,Han and Lee (2019) relax this common approach by introducing a semiparametric bivari-ate probit model to measure the average eﬀect of insurance coverage on patients’ medicalvisits. By applying our theoretical framework of partial identiﬁcation, we further relax theparametric and semiparametric structures used in these studies. More importantly, we tryto understand how much we can learn about the eﬀect of insurance that is utilized throughvarious counterfactual policies by learning the eﬀect of diﬀerent compliance groups.We use the 2010 wave of the Medical Expenditure Panel Survey (MEPS) and focus onall the medical visits in January 2010. The sample is restricted to contain individuals agedbetween 25 and 64 and exclude those who had any kind of federal or state insurance in 2010.The outcome Y is a binary variable indicating whether or not an individual has visited adoctor’s oﬃce; the treatment D is whether an individual has private insurance. We choose27igure 4: Bounds on ATE with Diﬀerent K Z . This IV reﬂects thesize of the ﬁrm, and larger ﬁrms are more likely to provide fringe beneﬁts, including healthinsurance. On the other hand, the number of branches of a ﬁrm does not directly aﬀectemployee decisions about medical visits. To justify the IV, self-employed individuals areexcluded. For potentially endogenous covariates X , we include the age being 45 and older,gender, income above median, and health condition. Lastly, for an exogenous covariate W ,we use the percentage of workers who are provided with paid sick leave beneﬁts within eachindustry. Following Han and Lee (2019), we assume W satisﬁes Assumptions SEL W (b) andEX W (b), as X is controlled. We construct a categorical variable such that W = 0 for lessthan 50%, W = 1 for between 50–80%, and W = 2 for above 80%. Table 2 summarizes theobservables. Table 2: Summary Statistics Variable Mean S.D Min Max Y Whether or not visit doctors 0.18 0.39 0 1 D Whether or not have insurance 0.66 0.47 0 1 Z Firm has multiple locations 0.68 0.47 0 1 X Age above 45 0.41 0.49 0 1Gender 0.50 0.50 0 1Income above median 0.50 0.50 0 1Good health 0.36 0.48 0 1 W Pay sick leave provision 1.25 0.73 0 2

Number of observations = 7,555

First, as a benchmark, we report that the LATE-C estimate calculated via our linearprogramming approach is equal to a singleton of 0.17, which is in fact identical to the 2SLSestimate we separately calculate. In what follows, we extrapolate this LATE beyond thecomplier group to the ATE. The presence of covariates reduces the eﬀective sample size andthus leads to larger sampling errors in estimating the p of the ∞ -LP ( ∞ -LP1)–( ∞ -LP3).This may create inconsistencies in the set of equality constraints ( ∞ -LP3), resulting in nofeasible solution. This is in fact what happens in this application. To resolve this estimationproblem, we introduce a slackness parameter η and modify ( ∞ -LP3) so that, with someslackness, it satisﬁes k R q − p k ≤ η. (8.1)A similarly modiﬁed constraint can then be followed in the ﬁnite-dimensional LP after ap-29roximation, as well as by combining ( ∞ -LP4)–( ∞ -LP5). The appropriate value of η shoulddepend on the sample size, the dimension of covariates, and the dimension of the unknownparameter θ . To explain the latter, as K increases, the dimension of θ (i.e., unknowns) in-creases, while the number of constraints (i.e., simultaneous equations for the unknowns) isﬁxed. Therefore, as K increases, the chance that the LP does not have a feasible solutionwould decrease. Based on the method discussed in the previous section, we set K = 50 inthis application.We calculate worst-case bounds on the ATE, as well as bounds after imposing Assump-tions U and M and after using covariate W . Under Assumption U, the data rules out thepossibility that Y (0) > Y (1), indicating that individuals with private insurance are morelikely to visit a doctor. Assumption M imposes that the MTR function is weakly increasingin U = u . Usually, U is interpreted as the latent cost of obtaining treatment. Kowalski(2020) interpreted U as eligibility in a similar setup for Medicaid insurance. The eligibilityfor Medicaid is related to income level and age. In our setup, because the treatment is havingthe private insurance, we interpret the eligibility as the health status, which is reﬂected inthe premium. Interpreting U as a latent cost (e.g., premium) of getting private insurance,Assumption M states that the chance of making a medical visit (with or without insurance)increases for those with higher cost. This is a reasonable assumption given that sicker in-dividuals typically face higher insurance costs and also visit doctors more often. We choosethe slackness parameter η to be 0.05 under no assumption and Assumption U and 0.07 whenAssumption M is added. When W is used, we choose η to be 0.08 under no assumption and0.1 with Assumption M.The bounds on the ATE are shown in Figure 5. The worst-case bound on the ATE equals[ − . , . . , .

37] under Assumption U and [0 . , .

37] underAssumption M. It is interesting to note that the identifying power of the uniformity andthe shape restriction is similar in this example. When both Assumption U and AssumptionM are imposed, the bounds are further tightened to [0 . , . W is exploited than when it is not, although the gains are not large.Next, we consider the always-taker, complier, and never-taker LATEs. We consider thesegeneralized LATEs conditional on X = x . Speciﬁcally, we focus on the treatment eﬀects formales above age 45, with income below the median and bad health conditions. The resultsare shown in Table 3 and depicted in Figure 6. The LATE-C is analytically calculated viaTSLS. For the LATE-AT and LATE-NT, Assumption U identiﬁes the sign of the eﬀects, When the alternative constraint (8.1) is used with the slackness parameter, the LATE-C is no longer asingleton.

No Assumption U M U+M W M+WLATE-AT [-0.93,0.21] [0,0.20] [-0.07,0.15] [0,0.15] [-0.76,0.20] [-0.01,0.19]LATE-C 0.01 0.01 0.01 0.01 0.01 0.01LATE-NT [-0.24,0.98] [0,0.97] [-0.14,0.95] [0,0.95] [-0.22,0.84] [-0.08,0.82]Slackness parameter η and Assumption M nearly identiﬁes it. Using the variation in W mostly improves the boundscompared to the ones without it. From the results, we can conclude that possessing privateinsurance has the greatest eﬀect on medical visits for never takers, i.e., people who face higherinsurance cost. This provides a policy implication that lowering the cost of private insuranceis important, because high costs might hinder those with the most need from receiving enoughmedical services. A Examples of the Target Parameters

Table 4 contains the list of target parameters. The table is taken from Mogstad et al. (2018).

B More Discussions

B.1 Point-wise and Uniform Sharp Bounds on MTE

In Section 2, we provided some examples of target parameters. The building block forthese parameters is the MTE, m ( u ) − m ( u ) (suppressing x ). Heckman and Vytlacil (2005)show why this fundamental parameter can be of independent interest. Unlike other targetparameters proposed here, we may want to allow the MTE to be a function of u (beyondevaluating it at a ﬁxed u ). In this section, we discuss the subtle issue of point-wise anduniform sharp bounds on τ MT E ( u ) ≡ m ( u ) − m ( u ) as a function of u .Suppress X for simplicity. Recall q ( u ) ≡ { q ( e | u ) } e ∈E and Q ≡ { q ( · ) : P e q ( e | u ) =1 ∀ u and q ( e | u ) ≥ ∀ ( e, u ) } . Let M be the set of MTE functions, i.e., M ≡ n m ( · ) − m ( · ) : m d ( · ) = E [ Y d | U = · ] = X e ∈E : g e ( d )=1 q ( e |· ) ∀ d ∈ { , } for q ( · ) ∈ Q o . arget Parameters Expressions Ranges of u Weights w d ( u, z, x )Average Treatment Eﬀect E [ Y (1) − Y (0)] [0 ,

1] 2 d − E { Y (1) − Y (0) | u ∈ [ P ( z , x ) , P ( z , x )] } [ P ( z , x ) , P ( z , x )] (2 d − × u ∈ [ P ( z ,x ) ,P ( z ,x )]) P ( z ,x ) − P ( z ,x ) (LATE-C) given x ∈ X LATE for Always-Takers E { Y (1) − Y (0) | u ∈ [0 , P ( z , x )] } [0 , P ( z , x )] (2 d − × u ∈ [0 ,P ( z ,x )]) P ( z ,x ) (LATE-AT) given x ∈ X LATE for Never Takers E { Y (1) − Y (0) | u ∈ [ P ( z , x ) , } [ P ( z , x ) ,

1] (2 d − × u ∈ [ P ( z ,x ) , − P ( z ,x ) (LATE-NT) given x ∈ X LATE for [ u, u ] E [ Y (1) − Y (0) | u ∈ [ u, u ]] [ P ( z , x ) , P ( z , x )] (2 d − × u ∈ [ u,u ]) u − u Marginal Treatment Eﬀect E [ Y (1) − Y (0) | u ′ ] u ′ (2 d − × u = u ′ )(MTE) ∗ Policy Relevant Treatment Eﬀect E ( Y ′ ) − E ( Y ) E ( D ′ ) − E ( D ) [0 ,

1] (2 d − × Pr [ u ≤ P ′ ( z ′ ) ] − Pr [ u ≤ P ′ ( z ) ] E [ P ( Z ′ )] − E [ P ( Z )] (PRTE) for a new policy ( P ′ , Z ′ ) * The MTE uses the Dirac measure at u ′ , while the other target parameters use the Lebesgue measure on[0 , Table 4: Examples of the Target ParametersThe bounds on τ MT E ∈ M in the ∞ -LP are given by using a Dirac delta function as aweight. Therefore, given evaluation point u ∈ [0 , ∞ -LP1)–( ∞ -LP3) can be simpliﬁedas follows, deﬁning the upper and lower bounds τ ( u ) and τ ( u ) (being explicit about theevaluation point) on τ MT E ( u ): τ ( u ) = sup q ∈Q X e ∈E : g e (1)=1 q ( e | u ) − X e ∈E : g e (0)=1 q ( e | u ) (B.1) τ ( u ) = inf q ∈Q X e ∈E : g e (1)=1 q ( e | u ) − X e ∈E : g e (0)=1 q ( e | u ) (B.2)subject to X e : g e ( d )=1 Z U dz q ( e | ˜ u ) d ˜ u = p (1 , d | z ) ∀ ( d, z ) ∈ { , } . (B.3)Then, for any ﬁxed u ∈ [0 , τ ( u ) ≤ τ MT E ( u ) ≤ τ ( u ) . We argue that these bounds are point-wise sharp but not necessarily uniformly sharp for τ MT E ( · ). See Firpo and Ridder (2019) for related deﬁnitions of point-wise and uniform sharpness. eﬁnition B.1 (Point-wise Sharpness) . τ ( · ) and τ ( · ) are point-wise sharp if, for any ¯ u ∈ [0 , , there exist τ MT E, ¯ u , τ MT E, ¯ u ∈ M such that τ (¯ u ) = τ MT E, ¯ u (¯ u ) and τ (¯ u ) = τ MT E, ¯ u (¯ u ) . Theorem B.1. τ ( · ) and τ ( · ) are point-wise sharp bounds on τ MT E ( · ) . The proofs of this and other theorems appear later. Note that point-wise bounds willmaintain some properties of an MTE function, but not all. For uniform sharpness, τ ( · ) and τ ( · ) themselves have to be MTE functions on [0 , τ ( · ) and τ ( · ) should be elements in M . Deﬁnition B.2 (Uniform Sharpness) . τ ( · ) and τ ( · ) are uniformly sharp if τ ( · ) , τ ( · ) ∈ M . The following theorem is almost immediate.

Theorem B.2. τ ( · ) is uniformly sharp if and only if there exists q ∗ ( · ) ∈ Q such that q ∗ ( · ) is in the feasible set and τ ( u ) = P e ∈E : g e (1)=1 q ∗ ( e | u ) − P e ∈E : g e (0)=1 q ∗ ( e | u ) for all u ∈ [0 , .Similarly, τ ( · ) is uniformly sharp if and only if there exists q † ( · ) ∈ Q such that q † ( · ) is in thefeasible set and τ ( u ) = P e ∈E : g e (1)=1 q † ( e | u ) − P e ∈E : g e (0)=1 q † ( e | u ) for all u ∈ [0 , . The following is a more useful result that relates point-wise bounds with uniform bounds.For each ¯ u , let q ∗ ¯ u ( · ) and q † ¯ u ( · ) be the point-wise maximizer and minimizer of (B.1)–(B.3),respectively. Corollary B.1. τ ( · ) is uniformly sharp if and only if there exists q ∗ ( · ) ∈ Q such that q ∗ ( · ) is in the feasible set and q ∗ ¯ u (¯ u ) = q ∗ (¯ u ) for all ¯ u ∈ [0 , . Also, τ ( u ) is uniformly sharp if andonly if there exists q † ( · ) ∈ Q such that q † ( · ) is in the feasible set and q † ¯ u (¯ u ) = q † (¯ u ) for all ¯ u ∈ [0 , . Based on the Bernstein approximation we introduce, this corollary implies that for auniform upper bound to exist, there should exist a common maximizer θ ∗ such that θ ∗ is inthe feasible set of the LP and τ ( u ) = P k ∈K n P e ∈E : g e (1)=1 θ e ∗ k b k ( u ) − P e ∈E : g e (0)=1 θ e ∗ k b k ( u ) o for all u . In other words, if θ ∗ ¯ u is the maximizer of the LP for given ¯ u , then there shouldexist θ ∗ in the feasible set such that θ ∗ ¯ u = θ ∗ for all ¯ u ∈ [0 , u in [0 ,

1] and checking whether θ ∗ u is constant for all values in the grid. B.2 Inference

It is important to construct a conﬁdence set for our target parameter or its bounds in order toaccount for the sampling variation in measuring treatment eﬀectiveness. It will also be inter-esting to develop a procedure to conduct a speciﬁcation test for the identifying assumptions35iscussed in Section 6. The problem of statistical inference when the identiﬁed set is con-structed via linear programming has been studied in, e.g.,Deb et al. (2017), Mogstad et al.(2018), Hsieh et al. (2018), and Torgovitsky (2019b) . Among these papers, Mogstad et al.(2017)’s setting is closest to ours, and their inference procedure can be directly adapted toour problem. Instead of repeating their result here, we only brieﬂy discuss the procedure.Recall q ( u, x ) ≡ { q ( e | u, x ) } e ∈E is the latent distribution and p ≡ { p (1 , d | z, x ) } d,z,x is thedistribution of the data, and R τ , R , R , and R denote the linear operators of q ( · ) thatcorrespond to the target and constraints. Consider the following hypotheses: H : p ∈ P , H : p ∈ P\P , where P ≡ { p ∈ P : Rq = a for some q ∈ Q} and R ≡ ( R ′ τ , R ′ , R ′ , R ′ ) ′ a ≡ ( τ, p ′ , a ′ , a ′ ) ′ Suppose ˆ R and ˆ a are sample counterparts of R and a . Then, a minimum distance test statisticcan be constructed as T n ( τ ) ≡ inf q ∈Q K √ n (cid:13)(cid:13)(cid:13) ˆ Rq − ˆ a (cid:13)(cid:13)(cid:13) . Similar to Mogstad et al. (2017), T n ( τ ) is the solution to a convex optimization problem thatcan be reformulated as an LP using duality. A (1 − α )-conﬁdence set for the target parameter τ can be constructed by inverting the test: CS − α ≡ { τ : T n ( τ ) ≤ ˆ c − α } where ˆ c − α is the critical value for the test. The resulting object is of independent interest, andit can further be used to conduct speciﬁcation tests. The large sample theory for T n ( τ ), as wellas a bootstrap procedure to calculate ˆ c − α , will directly follow according to Mogstad et al.(2017), which is omitted for succinctness. 36 .3 Linear Programming with Continuous X Suppose X is continuously distributed and assume X = [0 , d X . Let q ( u, x ) ≡ { q ( e | u, x ) } e ∈E and p ( x ) ≡ { p (1 , d | z, x ) } d,z . Recall that R τ : Q → R and R : Q → R d p are the linearoperators of q ( · ) where d p is the dimension of p . Consider the following LP: τ = sup q ∈Q R τ q, (B.4) τ = inf q ∈Q R τ q, (B.5) s.t. ( Rq ( x ) = p ( x ) for all x ∈ X , (B.6)where ( Rq )( x ) = p ( x ) emphasizes the dependence on x , and thus contains inﬁnitely manyconstraints. Therefore, this LP is inﬁnite dimensional because of not only the decision variablebut also the constraints. The problem with q is addressed with the sieve approximation.To address the problem with continuous X , we proceed as follow. Note that, in general, E | h ( X ) | = 0 if and only if h ( x ) = 0 almost everywhere in X . Therefore, each j -th equationin the equality restrictions (B.6) can be replaced by E | ( Rq ) j ( X ) − p j ( X ) | = 0 . Now, for the sieve space of Q , we consider˜ Q K ≡ n ˜ K X k =1 θ ek b k ( u, x ) o e ∈E : X e ∈E θ ek = 1 ∀ k ∈ ˜ K and θ ek ≥ ∀ ( e, k )  ⊆ Q , (B.7)where b k ( u, x ) is a bivariate Bernstein polynomial and ˜ K ≡ { , ..., ˜ K } . Then, E [ τ d ( Z, X )] = X e : g e ( d )=1 X k ∈ ˜ K θ ek Z E [ b k ( u, X ) w d ( u, Z, X )] du ≡ X e : g e ( d )=1 X k ∈ ˜ K θ ek ˜ γ dk , (B.8)where ˜ γ dk ≡ R E [ b k ( u, X ) w d ( u, Z, X )] du . Also, p ( y, d | z, x ) = X e : g e ( d )= y X k ∈ ˜ K θ ek Z U dz,x b k ( u, x ) du ≡ X e : g e ( d )= y X k ∈ ˜ K θ ek ˜ δ dk ( z, x ) , (B.9)37here ˜ δ dk ( z, x ) ≡ R U dz,x b k ( u, x ) du . Let ˜ θ ≡ { θ ek } ( e,k ) ∈E× ˜ K and let˜Θ ˜ K ≡ ( ˜ θ : X e ∈E θ ek = 1 ∀ k ∈ ˜ K and θ ek ≥ ∀ ( e, k ) ) . Then, we can formulate the following ﬁnite-dimensional LP: τ ˜ K = max θ ∈ Θ ˜ K X k ∈ ˜ K n X e : g e (1)=1 θ ek ˜ γ k − X e : g e (0)=1 θ ek ˜ γ k o (B.10) τ ˜ K = min θ ∈ Θ ˜ K X k ∈ ˜ K n X e : g e (1)=1 θ ek ˜ γ k − X e : g e (0)=1 θ ek ˜ γ k o (B.11)subject to E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X e : g e ( d )=1 X k ∈ ˜ K θ ek ˜ δ dk ( Z, X ) − p (1 , d | Z, X ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 0 . (B.12)In estimation, we use the sample counterparts ˆ˜ γ dk and ˆ˜ δ dk for ˜ γ dk and ˜ δ dk , and (B.12) can beestimated with slackness by1 n n X i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X e : g e ( d )=1 X k ∈ ˜ K θ ek ˆ˜ δ dk ( Z i , X i ) − ˆ p (1 , d | Z i , X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ η, where ˆ p (1 , d | z, x ) is some preliminary estimate of p (1 , d | z, x ) and η is the slackness parameter.Later, we want to introduce additional constraints from some identifying assumptions: R q = a (B.13) R q ≤ a (B.14)For the equality restrictions, we can use the same approach that transforms (B.6). For theinequality restrictions (B.14), we can allow any identifying assumptions for which R is amatrix rather than an operator: Assumption MAT. R is a dim( a ) × dim( q ) matrix. Assumptions M and C and the unconditional version of Assumption MTS satisfy thiscondition. 38 .4 Equivalence with the IV-Like Estimands

We draw a connection between our approach and the approach used in Mogstad et al. (2018).In particular, we show that the identiﬁed set of the MTR functions M id used in Mogstad et al.(2018) is equivalent to the set of MTR functions derived from the feasible set used in thispaper. Therefore, the feasible set in this paper contains no less information about the datathan those contained in M id via IV-like estimands in their paper.The IV-like estimand is deﬁned in Proposition 3 in Mogstad et al. (2018), and is statedas below. Proposition B.1 (IV-like Estimand from Mogstad et al. (2018)) . Suppose that s : { , } × R d z × d x → R is an identiﬁed (or known) function that is measurable and has a ﬁnite secondmoment. We refer to such a function s as an IV-like speciﬁcation and to β s ≡ E h s ( D, Z, X ) Y i as an IV-like estimand. If ( Y, D ) are generated according to Assumption SEL and Assump-tion EX, then β s = E h Z m ( u, X ) ω s ( u, Z, X ) du i + E h Z m ( u, X ) ω s ( u, Z, X ) du i , (B.15) where ω s ( u, z, x ) = s (0 , z, x )1[ u > p ( z, x )] , and ω s ( u, z, x ) = s (1 , z, x )1[ u ≤ p ( z, x )] . For the MTR functions to be consistent with the data, the following conditions need tobe satisﬁed: E [ Y | D = 0 , Z, X ] = E [ Y | U > p ( Z, X ) , Z, X ] = 11 − P ( Z, X ) Z p ( Z,X ) m ( u, X ) du, (B.16) E [ Y | D = 1 , Z, X ] = E [ Y | U ≤ p ( Z, X ) , Z, X ] = 1 P ( Z, X ) Z p ( Z,X )0 m ( u, X ) du. (B.17)Deﬁne the identiﬁed set as: M id = n m = ( m , m ) , m , m ∈ L : m , m satisﬁes equation (B.16) and (B.17) a.s o . This identiﬁed set is deﬁned in Mogstad et al. (2018, Section 2.5). The deﬁnition followsthe fact that the MTR functions in M id are compatible with the observed conditional meansof Y . In this sense, it exhausts the information of the data contained in the conditionalmeans. When Y is binary, the conditional means of Y contain the information of the completedistribution. 39eﬁne the feasible set Q f as Q f = n q ∈ L : q ∈ Q and satisﬁes equation ( ∞ -LP3) o . To establish the connection with M id , we construct the set of MTR functions based on thefeasible set: M f = n m = ( m , m ) : m d = X e : g e ( d )=1 q ( e | u, x ) , d = { , } , q ∈ Q f o . Then the following holds, proof of which appears later:

Theorem B.3.

Suppose Y is discretely distributed. Under the Assumption SEL and EX, M f = M id . Proposition 3 in Mogstad et al. (2018) shows an equivalence relationship between theidentiﬁed set M id and the set of MTR functions satisfying constraints based on selected IV-like estimands. Theorem B.3 shows that the information contained in our feasible set used inthe LP is the same as the selected IV-like estimands that exhaust the available information.Theorem B.3 can be extended to the case where Y is discrete and X is continuous. When Y is a non-binary discrete outcome variable, M id and M f only exhaust the information onthe conditional means, but not other distributional information. Nonetheless, that missinginformation is captured by Q f that we use as our constraint set, because q ( e | u ) is deﬁned asthe conditional probability of Y taking each value. C Proofs

C.1 Proof of Lemma 3.1

In proving the claim of the theorem, note that Z can be ﬁxed at a certain value, so we ﬁx Z = z here. We ﬁrst prove with Case (a). To simplify notation, let q ( e , ..., e J | u ) ≡ Pr[ ǫ ∈{ e , ..., e J }| u ] = P Jj =1 q ( e j | u ). Based on Table (1), we can easily derive p (1 , | z,

1) = Z P ( z )0 X e : g e (1 , q ( e | u ) du = Z P ( z )0 q (9 , ..., | u ) du,p (1 , | z,

0) = Z P ( z )0 X e : g e (1 , q ( e | u ) du = Z P ( z )0 q (5 , ..., , , ..., | u ) du,p (1 , | z,

1) = Z P ( z ) X e : g e (0 , q ( e | u ) du = Z P ( z ) q (3 , , , , , , , | u ) du,p (1 , | z,

3) that correspondto Z = z , the corresponding l.h.s. is  R P ( z )0 q (9 , ..., | u ) du R P ( z )0 q (5 , ..., , , ..., | u ) du R P ( z ) q (3 , , , , , , , | u ) du R P ( z ) q (2 , , , , , , , | u ) du  =  T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z  q ≡ T q, where T is a matrix of operators implicitly deﬁned and q ( u ) ≡ ( q (1 | u ) , ...., q (16 | u )). Now for q ∈ Q K , deﬁne a 16 K -vector θ ≡  θ ... θ  where, for each e ∈ { , ..., } , θ e ≡ ( θ e , ..., θ eK ) ′ . Similarly, let b ( u ) ≡ ( b ( u ) , ..., b K ( u )) ′ .Then, we have q ( e | u ) = b ( u ) ′ θ e . Let H be a 16 ×

16 diagonal matrix of 1’s and 0’s thatimposes additional identifying assumptions on the outcome data-generating process. In thisproof, H is used to incorporate Assumption R(i). Given H , the constraints in (LP W

3) (thatcorrespond to Z = z ) can be written as T Hq = { T H ⊗ b ′ } θ = ( p | z , p | z , p | z , p | z ) ′ . Now, we prove the claim of the theorem. Suppose the claim is not true, i.e., the even rowsare linearly dependent to odd rows in

T H . Given the form of T , which has full rank underAssumption R(ii)(a), this linear dependence only occurs when H is such that H jj = 1 for j ∈ { , , , } and 0 otherwise. But, according to Table 1, this implies that Pr[ Y ( d, w ) = Y ( d, w ′ )] = 0 for all d and w = w ′ , which contradicts Assumption R(i). This proves thetheorem for Case (a).Now we move to prove the theorem for Case (b), analogous to the previous case. For42very z , we can derive p (1 , | z,

1) = Z P ( z, X e : g e (1 , q ( e | u ) du = Z P ( z, q (9 , ..., | u ) du,p (1 , | z,

0) = Z P ( z, X e : g e (1 , q ( e | u ) du = Z P ( z, q (5 , ..., , , ..., | u ) du,p (1 , | z,

1) = Z P ( z, X e : g e (0 , q ( e | u ) du = Z P ( z, q (3 , , , , , , , | u ) du,p (1 , | z,

0) = Z P ( z, X e : g e (0 , q ( e | u ) du = Z P ( z, q (2 , , , , , , , | u ) du. Deﬁne T dz,w q e ≡ Z U dz,w q ( e | u ) du where U dz,w can be analogously deﬁned. Then,  R P ( z,w )0 q (9 , ..., | u ) du R P ( z,w ′ )0 q (5 , ..., , , ..., | u ) du R P ( z,w ) q (3 , , , , , , , | u ) du R P ( z,w ′ ) q (2 , , , , , , , | u ) du  = T z,w T z,w T z,w T z,w T z,w T z,w T z,w T z,w T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w T z,w T z,w T z,w T z,w T z,w T z,w T z,w T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ ! q ≡ ˜ T q, where ˜ T is a matrix of operators implicitly deﬁned. Then, inserting H , the constraint becomes˜ T Hq = n ˜ T H ⊗ b ′ o θ = ( p | z , p | z , p | z , p | z ) ′ . Then the remaining argument is the same as in the previous case, which completes the proof. (cid:3)

C.3 Proof of Theorem B.1

For any given ¯ u ∈ [0 , τ (¯ u ) = P e ∈E : g e (1)=1 q ∗ ¯ u ( e | ¯ u ) − P e ∈E : g e (0)=1 q ∗ ¯ u ( e | ¯ u ) for some q ∗ ¯ u ( · ) ≡{ q ∗ ¯ u ( e |· ) } e ∈E in the feasible set of the LP, (B.1) and (B.3). Therefore, τ (¯ u ) = τ MT E, ¯ u (¯ u ) for43 MT E, ¯ u (¯ u ) = P e ∈E : g e (1)=1 q ∗ ¯ u ( e | ¯ u ) − P e ∈E : g e (0)=1 q ∗ ¯ u ( e | ¯ u ), which is in M by deﬁnition. We canhave a symmetric proof for τ ( · ). (cid:3) C.4 Proof of Theorem B.2

Again, by the fact that τ MT E ( · ) = P e ∈E : g e (1)=1 q ( e |· ) − P e ∈E : g e (0)=1 q ( e |· ) in general, τ ( u ) = P e ∈E : g e (1)=1 q ∗ ( e | u ) − P e ∈E : g e (0)=1 q ∗ ( e | u ) for all u ∈ [0 ,

1] is equivalent to τ ( · ) being containedin M , and similarly for τ ( · ). (cid:3) C.5 Proof of Theorem B.3

From ( ∞ -LP3), we can write E [ Y | D = 0 , Z, X ] in terms of q ( e | u, X ) as below: E [ Y | D = 0 , Z, X ]= Pr [ Y = 1 | D = 0 , Z, X ] = Pr [ Y = 1 , D = 0 | Z, X ]Pr [ D = 0 | Z, X ]= 11 − P ( Z, X ) X e : g e (0)=1 Z P ( Z,X ) q ( e | u, X ) du = 11 − P ( Z, X ) Z P ( Z,X ) X e : g e (0)=1 q ( e | u, X ) du (C.1)Therefore, for ( m , m ) ∈ M f E [ Y | D = 0 , Z, X ] = 1 P ( Z, X ) Z P ( Z,X ) m ( u, X ) du and symmetrically, E [ Y | D = 1 , Z, X ] = 1 P ( Z, X ) Z P ( Z,X )0 m ( u, X ) du We conclude that M f ⊂ M id .Now suppose m ∈ M id . By (B.16) and (C.1), for ∀ z, x − P ( z, x ) Z P ( z,x ) m ( u, x ) du = 11 − P ( z, x ) X e : g e (0)=1 Z P ( z,x ) q ( e | u, x ) du and, 44 P ( z,x )  m ( u, x ) − X e : g e (0)=1 q ( e | u, x )  du = 0This equality holds for all the possible values of P ( z, x ), we conclude that m ( u, x ) = P e : g e (0)=1 q ( e | u, x ) on the support u ∈ [0 , ∀ x following the fundamental theorem of calcu-lus. Following the symmetric procedure, we can conclude that m ( u, x ) = P e : g e (1)=1 q ( e | u, x ).And we show that M id ⊂ M f . Thus, M f = M id . References

Angrist, J. and I. Fernandez-Val (2010): “Extrapolate-ing: External validity andoveridentiﬁcation in the late framework,” Tech. rep., National Bureau of Economic Re-search. 1

Balat, J. F. and S. Han (2018): “Multiple treatments with strategic interaction,”

Avail-able at SSRN 3182766 . 1, 5

Balke, A. and J. Pearl (1997): “Bounds on treatment eﬀects from studies with imperfectcompliance,”

Journal of the American Statistical Association , 92, 1171–1176. 1, 3

Bertanha, M. and G. W. Imbens (2019): “External validity in fuzzy regression discon-tinuity designs,”

Journal of Business & Economic Statistics , 1–39. 1

Bhattacharya, J., A. M. Shaikh, and E. Vytlacil (2008): “Treatment eﬀect boundsunder monotonicity assumptions: an application to Swan-Ganz catheterization,”

AmericanEconomic Review , 98, 351–56. 6.1

Brinch, C. N., M. Mogstad, and M. Wiswall (2017): “Beyond LATE with a discreteinstrument,”

Journal of Political Economy , 125, 985–1039. 1, 6, 6.3

Chen, X. (2007): “Large sample sieve estimation of semi-nonparametric models,”

Handbookof econometrics , 6, 5549–5632. 7.3

Chen, X. and T. Christensen (2015): “Optimal sup-norm rates, adaptivity and inferencein nonparametric instrumental variables estimation,” . 7.3

Chen, X., E. T. Tamer, and A. Torgovitsky (2011): “Sensitivity analysis in semi-parametric likelihood models,” . 4 45 hen, X., J. Tan, Z. Liu, and J. Xie (2017): “Approximation of functions by a newfamily of generalized Bernstein operators,”

Journal of Mathematical Analysis and Appli-cations , 450, 244–261. 4

Chernozhukov, V. and C. Hansen (2005): “An IV model of quantile treatment eﬀects,”

Econometrica , 73, 245–261. 1, 6.1

Coolidge, J. L. (1949): “The story of the binomial theorem,”

The American MathematicalMonthly , 56, 147–157. 4

Cornelissen, T., C. Dustmann, A. Raute, and U. Sch¨onberg (2016): “From LATEto MTE: Alternative methods for the evaluation of policy interventions,”

Labour Eco-nomics , 41, 47–60. 6

Deb, R., Y. Kitamura, J. K.-H. Quah, and J. Stoye (2017): “Revealed price prefer-ence: Theory and stochastic testing,” . 1, B.2

Dehejia, R., C. Pop-Eleches, and C. Samii (2019): “From local to global: Externalvalidity in a fertility natural experiment,”

Journal of Business & Economic Statistics , 1–27.1

Dunlop, D. D., L. M. Manheim, J. Song, and R. W. Chang (2002): “Gender andethnic/racial disparities in health care utilization among older adults,”

The Journals ofGerontology Series B: Psychological sciences and social sciences , 57, S221–S233. 8

Finkelstein, A., S. Taubman, B. Wright, M. Bernstein, J. Gruber, J. P. New-house, H. Allen, K. Baicker, and O. H. S. Group (2012): “The Oregon healthinsurance experiment: evidence from the ﬁrst year,”

The Quarterly journal of economics ,127, 1057–1106. 8

Firpo, S. and G. Ridder (2019): “Partial identiﬁcation of the treatment eﬀect distributionand its functionals,”

Journal of Econometrics , 213, 210–234. 9

Gunsilius, F. (2019): “Bounds in continuous instrumental variable models,” arXiv preprintarXiv:1910.09502 . 1

Han, S. (2020a): “Nonparametric estimation of triangular simultaneous equations modelsunder weak identiﬁcation,”

Quantitative Economics , 11, 161–202. 7.3——— (2020b): “Optimal Dynamic Treatment Regimes and Partial Welfare Ordering,” arXiv preprint arXiv:1912.10014 . 1, 6.1 46 an, S. and S. Lee (2019): “Estimation in a generalization of bivariate probit models withdummy endogenous regressors,”

Journal of Applied Econometrics , 34, 994–1015. 1, 5, 8

Han, S. and E. J. Vytlacil (2017): “Identiﬁcation in a generalization of bivariate probitmodels with dummy endogenous regressors,”

Journal of Econometrics , 199, 63–73. 1, 5

Heckman, J. J. and E. Vytlacil (2005): “Structural equations, treatment eﬀects, andeconometric policy evaluation1,”

Econometrica , 73, 669–738. 1, 2, 2, B.1

Heckman, J. J. and E. J. Vytlacil (1999): “Local instrumental variables and latentvariable models for identifying and bounding treatment eﬀects,”

Proceedings of the nationalAcademy of Sciences , 96, 4730–4734. 1

Hsieh, Y.-W., X. Shi, and M. Shum (2018): “Inference on estimators deﬁned by math-ematical programming,”

Available at SSRN 3041040 . B.2

Hurd, M. D. and K. McGarry (1997): “Medical insurance and the use of health careservices by the elderly,”

Journal of Health Economics , 16, 129–154. 8

Imbens, G. W. and J. D. Angrist (1994): “Identiﬁcation and Estimation of Local Av-erage Treatment Eﬀects,”

Econometrica , 62, 467–475. 1, 2

Joy, K. I. (2000): “Bernstein polynomials,”

On-Line Geometric Modeling Notes , 13. 4

Kamat, V. (2019): “Identiﬁcation with latent choice sets: The case of the head start impactstudy,” arXiv preprint arXiv:1711.02048 . 1

Kitamura, Y. and J. Stoye (2019): “Nonparametric Counterfactuals in Random UtilityModels,” arXiv preprint arXiv:1902.08350 . 1

Kowalski, A. E. (2020): “Reconciling Seemingly Contradictory Results from the OregonHealth Insurance Experiment and the Massachusetts Health Reform,” Tech. rep., NationalBureau of Economic Research. 1, 6, 8

Machado, C., A. Shaikh, and E. Vytlacil (2019): “Instrumental variables and thesign of the average treatment eﬀect,”

Journal of Econometrics , 212, 522–555. 1

Manski, C. F. (1997): “Monotone treatment response,”

Econometrica: Journal of theEconometric Society , 1311–1334. 6.1——— (2007): “Partial identiﬁcation of counterfactual choice probabilities,”

InternationalEconomic Review , 48, 1393–1410. 1 47 anski, C. F. and J. V. Pepper (2000): “Monotone instrumental variables: With anapplication to the returns to schooling,”

Econometrica , 68, 997–1010. 1, 6.1, 6.2

Masten, M. A. and A. Poirier (2018): “Salvaging falsiﬁed instrumental variable mod-els,” arXiv preprint arXiv:1812.11598 . 4

Mogstad, M., A. Santos, and A. Torgovitsky (2017): “Using Instrumental Variablesfor Inference about Policy Relevant Treatment Eﬀects,” Tech. rep., National Bureau ofEconomic Research. B.2——— (2018): “Using instrumental variables for inference about policy relevant treatmentparameters,”

Econometrica , 86, 1589–1619. 1, 2, 4, 3, 4, 4, 6.3, A, B.2, B.4, B.1, B.4, B.4

Mogstad, M., A. Torgovitsky, and C. R. Walters (2019): “Identiﬁcation of causaleﬀects with multiple instruments: Problems and some solutions,” Tech. rep., NationalBureau of Economic Research. 4

Mourifi´e, I. (2015): “Sharp bounds on treatment eﬀects in a binary triangular system,”

Journal of Econometrics , 187, 74–81. 1, 5

Muralidharan, K., A. Singh, and A. J. Ganimian (2019): “Disrupting education?Experimental evidence on technology-aided instruction in India,”

American Economic Re-view , 109, 1426–60. 1

Shaikh, A. M. and E. J. Vytlacil (2011): “Partial identiﬁcation in triangular systemsof equations with binary dependent variables,”

Econometrica , 79, 949–955. 1, 5, 6.1

Taubman, S. L., H. L. Allen, B. J. Wright, K. Baicker, and A. N. Finkelstein (2014): “Medicaid increases emergency-department use: evidence from Oregon’s HealthInsurance Experiment,”

Science , 343, 263–268. 8

Tebaldi, P., A. Torgovitsky, and H. Yang (2019): “Nonparametric estimates ofdemand in the california health insurance exchange,” Tech. rep., National Bureau of Eco-nomic Research. 1

Torgovitsky, A. (2019a): “Nonparametric Inference on State Dependence in Unemploy-ment,”

Econometrica , 87, 1475–1505. 1——— (2019b): “Nonparametric inference on state dependence in unemployment,”

Econo-metrica , 87, 1475–1505. B.2 48 uong, Q. and H. Xu (2017): “Counterfactual mapping and individual treatment eﬀectsin nonseparable models with binary endogeneity,”

Quantitative Economics , 8, 589–610. 1,5

Vytlacil, E. (2002): “Independence, monotonicity, and latent index models: An equiva-lence result,”

Econometrica , 70, 331–341. 2

Vytlacil, E. and N. Yildiz (2007): “Dummy endogenous variables in weakly separablemodels,”