Sharp Bounds on Treatment Effects for Policy Evaluation
aa r X i v : . [ ec on . E M ] S e p Sharp Bounds on Treatment Effectsfor Policy Evaluation ∗ Sukjin HanDepartment of EconomicsUniversity of [email protected] Shenshen YangDepartment of EconomicsUniversity of Texas at [email protected] Draft: September 30, 2020
Abstract
For counterfactual policy evaluation, it is important to ensure that treatment pa-rameters are relevant to the policies in question. This is especially challenging underunobserved heterogeneity, as is well featured in the definition of the local average treat-ment effect (LATE). Being intrinsically local, the LATE is known to lack externalvalidity in counterfactual environments. This paper investigates the possibility of ex-trapolating local treatment effects to different counterfactual settings when instrumen-tal variables are only binary. We propose a novel framework to systematically calculatesharp nonparametric bounds on various policy-relevant treatment parameters that aredefined as weighted averages of the marginal treatment effect (MTE). Our frameworkis flexible enough to incorporate a large menu of identifying assumptions beyond theshape restrictions on the MTE that have been considered in prior studies. We applyour method to understand the effects of medical insurance policies on the use of medicalservices.
JEL Numbers:
C14, C32, C33, C36
Keywords:
Heterogeneous treatment effects, local average treatment effects, marginaltreatment effects, extrapolation, partial identification. ∗ The authors are grateful to Jason Abrevaya, Brendan Kline, Xun Tang, Alex Torgovitsky, Ed Vytlacil,Haiqing Xu, and participants in the 2020 Texas Econometrics Camp and the workshop at UT Austin forhelpful comments and discussions. Introduction
For counterfactual policy evaluation, it is important to ensure that treatment parametersare relevant to the policies in question. This is especially challenging in the presence of un-observed heterogeneity. This challenge is well featured in the definition of the local averagetreatment effect (LATE). The LATE has been one of the most popular treatment parame-ters used by empirical researchers since it was introduced by Imbens and Angrist (1994). Itinduces a straightforward linear estimation method that requires only a binary instrumentalvariable (IV), and yet, allows for unrestricted treatment heterogeneity. The unfortunate fea-ture of the LATE is that, as the name suggests, the parameter is intrinsically local, recoveringthe average treatment effect (ATE) for a specific subgroup of population called compliers.This feature leads to two major challenges in making the LATE a valuable parameter forcounterfactual policy evaluation. First, the subpopulation for which the effect is measuredmay not be the population of policy interest. Second, the definition of the subpopulationdepends on the IV chosen, rendering the parameter even more difficult to extrapolate to newenvironments.Dealing with the lack of external validity of the LATE has been an important themein the literature. One approach in theoretical work (Angrist and Fernandez-Val (2010);Bertanha and Imbens (2019)) and empirical research (Dehejia et al. (2019); Muralidharan et al.(2019)) has been to show the similarity between complier and non-complier groups based onobservables. This approach, however, cannot attend to possible unobservable discrepanciesbetween these groups. Heckman and Vytlacil (2005) unify well-known treatment parametersby expressing them as weighted averages of what they define as the marginal treatment ef-fect (MTE). This MTE framework has a great potential for extrapolation because a classof treatment parameters that are policy-relevant can also be generated as weighted aver-ages of the MTE. The only obstacle is that the MTE is identified via a method called localIV (Heckman and Vytlacil (1999)), which requires the continuous variation of the IV thatis sometime large depending on the targeted support. This in turn reflects the intrinsicdifficulty of extrapolation when available exogenous variation is only discrete. Acknowledg-ing this nature of the challenge, previous studies in the literature have proposed imposingshape restrictions on the MTE, which is a function of the treatment-selection unobservable,while allowing for binary instruments in the framework of Heckman and Vytlacil (2005).Brinch et al. (2017) introduce shape restrictions (e.g., linearity) on the MTE functions in anattempt to identify the LATE extrapolated to different subpopulations or to test for its exter-nality validity. More recently, Mogstad et al. (2018) propose a general partial identificationframework where bounds on various policy-relevant treatment parameters can be obtained2rom a set of “IV-like estimands” that are directly identified from the data and routinelyobtained in empirical work. Kowalski (2020) applies an approach similar to these studies toextrapolate the results from one health insurance experiment to an external setting.This paper continues this pursuit and investigates the possibility of extrapolating localtreatment parameters to different policy settings in the MTE framework when IVs are onlybinary. In a partial identification framework similar in spirit to Mogstad et al. (2018), weshow how to systematically calculate sharp nonparametric bounds on various extrapolatedtreatment parameters for binary (or more generally, discrete) outcomes using instrumentsthat are allowed to be binary. These parameters are defined as weighted averages of theMTE. Examples include the ATE, the treatment on the treated, the LATE for subgroupsinduced by new policies, and the policy-relevant treatment effect (PRTE). We also show howto place in this procedure restrictions from a large menu of identifying assumptions beyondthe shape restrictions considered in earlier work.In this paper, we make four main contributions. First, we propose a novel framework forcalculating bounds on policy-relevant treatment parameters. We introduce the probabilityof the latent state of the outcome-generating process conditional on the treatment-selectionunobservable. This latent conditional probability is the key ingredient for our analysis, asboth the target parameter and the distribution of the observables can be written as linearfunctionals of it. Therefore, having it as a decision variable, we can formulate infinite-dimensional linear programming that produces bounds on a targeted treatment parameter.This approach is reminiscent of Balke and Pearl (1997) and can be viewed as its generalizationto the MTE framework. Balke and Pearl (1997) introduce a linear programming approachto characterize bounds on the ATE with a binary outcome, treatment and instrument. Themain distinction of our approach is that the latent probability is conditioned on the selectionunobservable, which is important for our extrapolation purpose. To make it feasible to solvethe resulting infinite-dimensional program, we use a sieve-like approximation of the programand produce a finite-dimensional linear program (LP). This approximation approach buildson Mogstad et al. (2018), although they use approximation directly on the MTE function.We also propose a conservative approach to choosing the sieve dimension in practice.Second, by formulating the LP based on the latent conditional probability rather than theMTE, it creates a flexible environment where we can introduce identifying assumptions thathave not been used in the context of the MTE framework or the LATE extrapolation. We pro-pose assumptions that there exist exogenous variables other than IVs. We propose two typesof exogenous variables that have been used in the context of identifying the ATE in the lit-erature: Mourifi´e (2015), Han and Vytlacil (2017), Vuong and Xu (2017), and Han and Lee(2019) use the first type, and Vytlacil and Yildiz (2007), Shaikh and Vytlacil (2011), and3alat and Han (2018) use the second type. We utilize these variables in this novel contextof the MTE framework. Also, while the earlier papers exploit these variables in combinationwith rank similarity or rank invariance, we show that they independently have identifyingpower for treatment parameters, including the ATE. We also propose identifying assumptionssuch as uniformity and the direction of endogeneity in this MTE framework. The directionof endogeneity is sometimes assumed in empirical work to characterize selection bias andhas been shown to have identifying power (Manski and Pepper (2000)). The uniformity as-sumption is related to rank similarity or rank invariance (Chernozhukov and Hansen (2005)).The shape restrictions on the MTE considered in the literature can also be nested withinour framework, since the MTE is just a sum of the latent conditional probabilities. Theassumptions on the existence of exogenous variables complement the identifying assumptionsthat rely on the researcher’s prior, in that its identifying power comes from actual data.When a confidence set is constructed under one of the latter assumptions, we can conduct aspecification test for that assumption.Third, we show that our approach yields straightforward proof of the sharpness of theresulting bounds. This feature stems from the use of the latent conditional probability in thelinear programming and the convexity of the feasible set in the program. When the MTEitself is the target parameter, we distinguish between the notions of point-wise and uniformsharpness and argue why uniform sharpness is often difficult to achieve.Fourth, as an application, we study the effects of insurance on medical service utilizationby considering various counterfactual policies related to insurance coverage. The LATEfor compliers and the bounds on the LATE for always-takers and never-takers reveal thatpossessing private insurance has the largest effect on medical visits for never takers, i.e., thosewho face higher insurance cost. This provides a policy implication that lowering the cost ofprivate insurance is important, because the high cost might hinder people with most needfrom receiving adequate medical services.The linear programming approach to partial identification of treatment effects was pio-neered by Balke and Pearl (1997) and recently gained attention in the literature; see, e.g.,Mogstad et al. (2018), Torgovitsky (2019a), Machado et al. (2019), Kamat (2019), Gunsilius(2019), and Han (2020b). As these papers suggest, there are many settings, including ours,where analytical derivation of bounds is cumbersome or nearly impossible due to the com-plexity of the problems.This paper will proceed as follows. The next section introduces the main observables,maintained assumptions, and target parameters. Section 3 defines the latent conditional For the computational approach in contexts other than program evaluation, see Manski (2007),Kitamura and Stoye (2019), Deb et al. (2017), and Tebaldi et al. (2019).
Assume that we observe the binary outcome Y ∈ { , } , binary treatment D ∈ { , } , andbinary instrument Z ∈ { , } . We may additionally observe (possibly endogenous) discretecovariates X ∈ X . Binary Y is common in empirical work. Binary Z is also common,especially in randomized experiments, and allowing for this minimal exogenous variationis the key challenge for extrapolation that we want to address in this paper. Still, theanalysis of this paper can be extended to allow for general discrete Y and Z . Let Y ( d ) bethe counterfactual outcome given D = d , which is consistent with the observed outcome: Y = DY (1) + (1 − D ) Y (0). We maintain the following assumptions: Assumption SEL. D = 1 { U ≤ P ( Z, X ) } where P ( Z, X ) ≡ Pr[ D = 1 | Z, X ] and U | X = x ∼ U nif [0 , for x ∈ X . Assumption EX. ( Y ( d ) , D ( z )) ⊥ Z | X . Assumption SEL imposes a selection model for D , which is important in motivatingand interpreting marginal treatment effects later. This assumption is also equivalent toImbens and Angrist (1994)’s monotonicity assumption (Vytlacil (2002)). We introduce thestandard normalization that U ∼ U nif [0 ,
1] conditional on X = x . Assumption EX imposesthe exclusion restriction and conditional independence for Z . We focus on discrete X as it simplifies the exposition. Section B.3 in the Appendix extends the frameworkto incorporate continuously distributed X . Note that for any index function g ( z, x ) and an unobservable ε with any distribution, the selection modelsatisfies D = 1 { ε ≤ g ( Z, X ) } = 1 { F ε | X ( ε | X ) ≤ F ε | X ( g ( Z, X ) | X ) } = 1 { U ≤ P ( Z, X ) } , since P ( z, x ) =Pr[ ε ≤ g ( z, x ) | X = x ] = Pr[ U ≤ F ε | X ( g ( z, x ) | x ) | X = x ] = F ε | X ( g ( z, x ) | x ) and F ε | X ( ε | X ) = U is uniformlydistributed conditional on X . E [ Y (1) − Y (0) | U = u, X = x ] . Following Mogstad et al. (2018), it is convenient to introduce the marginal treatment response(MTR) function m d ( u, x ) ≡ E [ Y ( d ) | U = u, X = x ]= Pr[ Y ( d ) = 1 | U = u, X = x ] . Then, the MTE can be expressed as m ( u, x ) − m ( u, x ). Now, we define the target parameter τ to be a weighted average of the MTE: τ = E [ τ ( Z, X ) − τ ( Z, X )] , (2.1)where τ d ( z, x ) = Z m d ( u, x ) w d ( u, z, x ) du (2.2)by using F U | X ( u | x ) = u , and w d ( u, z, x ) is a known weight specific to the parameter ofinterest. This definition agrees with the insight of Heckman and Vytlacil (2005). The targetparameter includes a wide range of policy-relevant treatment parameters. With a Dirac deltafunction for a given value u as the weight, the MTE itself can be an example. We list afew examples of the target parameter here; other examples can be found in Table 4 in theAppendix. Example 1.
The ATE can be a target parameter with w d ( u, z, x ) = 1 , ∀ u, z, x . τ AT E = E (cid:20)Z m ( u, X ) du − Z m ( u, X ) du (cid:21) Example 2.
The generalized LATE for always-takers and never-takers are also target pa-rameters. Here, we give the expression of the LATE for always-takers as an example. Assume P ( z, x ) increases in z for any given x ∈ X . For the always-taker (AT) LATE, we give weight P (0 ,x ) to individuals with u ∈ [0 , P (0 , x )] and thus, we have w d ( u, z, x ) = u ∈ [0 ,p (0 ,x )]) p (0 ,x ) . Mogstad et al. (2018) define the weight in a slightly different way. LAT E - AT = E (cid:20)Z m ( u, X ) 1( u ∈ [0 , p (0 , X )]) p (0 , X ) du − Z m ( u, X ) 1( u ∈ [0 , p (0 , X )]) p (0 , X ) du (cid:21) Example 3.
The policy relevant treatment effect (PRTE) is a target parameter that is partic-ularly useful for policy evaluation. It is defined as the welfare difference between two differentpolicies. Let Z and Z ′ be two instrument variables under two policies and P ( Z, X ) and P ′ ( Z ′ , X ) be propensity scores under the two policies. τ P RT E = E " Z m ( u, X ) Pr [ u ≤ P ′ ( Z ′ , X )] − Pr [ u ≤ P ( Z, X )] E [ P ′ ( Z ′ , X )] − E [ P ( Z, X )] du − Z m ( u, X ) Pr [ u ≤ P ′ ( Z ′ , X )] − Pr [ u ≤ P ( Z, X )] E [ P ′ ( Z ′ , X )] − E [ P ( Z, X )] du In these examples, the weights w and w can be set asymmetrically to define a broaderclass of parameters. All the parameters we consider in this paper can be defined conditionalon X , although we omit them for succinctness. As a crucial first step of our analysis, we define a state variable that determines a specificmapping of d y. Since d ∈ { , } and y ∈ { , } , there are four possible maps from d onto y . Define a discretelatent variable ǫ whose value e corresponds to each possible map: ǫ ∈ E , where |E | = 4 with E ≡ { , , , } . That is, ǫ is a decimal transformation of a binary sequence( Y (1) , Y (0)), which captures the treatment effect heterogeneity. For the later purpose, it is7elpful to explicitly define the map as y = g e ( d )and write Y ( d ) = g ǫ ( d ) , (3.1)which implies Y = g ǫ ( D ). It is important to note that no structure is imposed in introducing g e ( · ) because the mapping is saturated by binary Y and D . By (3.1) and Assumption SEL,Assumption EX can be equivalently stated as ( ǫ, U ) ⊥ Z | X . Still, ǫ and X can be correlatedas X is allowed to be endogenous.Now, as a key component of our LP, we define the probability mass function of ǫ condi-tional on ( U, X ): for e ∈ E , q ( e | u, x ) ≡ Pr[ ǫ = e | U = u, X = x ] (3.2)with P e ∈E q ( e | u, x ) = 1 for any u, x . The quantity q ( e | u, x ) captures endogenous treatmentselection. It is shown below that this latent conditional probability is a building blockfor various treatment parameters and thus serves as the decision variable in the LP. Theintroduction of q ( e | u, x ) distinguishes our approach from those in Balke and Pearl (1997)and Mogstad et al. (2018). Since the probability is conditional on continuously distributed U , the simple finite-dimensional linear programming approach of Balke and Pearl (1997) isno longer applicable. Instead, we use an approximation method similar to Mogstad et al.(2018). However, Mogstad et al. (2018) uses the MTR function as a building block fortreatment parameters and introduces the “IV-like” estimands as a means of funneling theinformation from the data. Unlike in Mogstad et al. (2018), q ( e | u, x ) can be directly relatedto the distribution of data. This allows us to later incorporate identifying assumptions thatare difficult to incorporate within the framework of Mogstad et al. (2018).By (3.1) and (3.2), note thatPr[ Y ( d ) = 1 | U = u, X = x ] = Pr[ ǫ ∈ { e ∈ E : g e ( d ) = 1 }| U = u, X = x ]= X e ∈E : g e ( d )=1 q ( e | u, x ) . m d ( u, x ) = X e : g e ( d )=1 q ( e | u, x ) . (3.3)Combining (3.3) and (2.2), we have τ d ( z, x ) = P e : g e ( d )=1 R q ( e | u, x ) w d ( u, z, x ) du , and thusthe target parameter τ = E [ τ ( Z, X )] − E [ τ ( Z, X )] in (2.1) can be written as τ = X e : g e (1)=1 Z E [ q ( e | u, X ) w ( u, Z, X )] du − X e : g e (0)=1 Z E [ q ( e | u, X ) w ( u, Z, X )] du (3.4)for some q that satisfies the properties of probability.The goal of this paper is to (at least partially) infer the target parameter τ based on thedata, i.e., the distribution of ( Y, D, Z, X ). The key insight is that there are observationallyequivalent q ( e | u, x )’s that are consistent with the data, which in turn produces observationallyequivalent τ ’s that define the identified set.Let p ( y, d | z, x ) ≡ Pr[ Y = y, D = d | Z = z, X = x ] be the observed conditional probability.This data distribution imposes restrictions on q ( e | u, x ). For instance, for D = 1, p ( y, | z, x ) = Pr[ Y (1) = y, U ≤ P ( z, x ) | Z = z, X = x ]= Pr[ Y (1) = y, U ≤ P ( z, x ) | X = x ]by Assumption EX, butPr[ Y (1) = y, U ≤ P ( z, x ) | X = x ] = Z P ( z,x )0 Pr[ Y (1) = y | U = u, X = x ] du = X e : g e (1)= y Z P ( z,x )0 q ( e | u, x ) du, (3.5)where the second equality is by Pr[ Y ( d ) = y | U = u, X = x ] = P e : g e ( d )= y q ( e | u, x ).To define the identified set for τ , we introduce some simplifying notation. Let q ( u, x ) ≡{ q ( e | u, x ) } e ∈E and Q ≡ { q ( · ) : X e ∈E q ( e | u, x ) = 1 ∀ ( u, x ) and q ( e | u, x ) ≥ ∀ ( e, u, x ) } be the class of q ( u, x ), and let p ≡ { p (1 , d | z, x ) } ( d,z,x ) ∈{ , } ×X . Also, let R τ : Q → R and R : Q → R d p (with d p being the dimension of p ) denote the linear operators of q ( · ) that9atisfy R τ q ≡ X e : g e (1)=1 Z E [ q ( e | u, X ) w τ ( u, Z, X )] du − X e : g e (0)=1 Z E [ q ( e | u, X ) w τ ( u, Z, X )] du,R q ≡ X e : g e ( d )=1 Z U dz,x q ( e | u, x ) du, where U dz,x denotes the intervals U z,x ≡ [0 , P ( z, x )] and U z,x ≡ ( P ( z, x ) , Definition 3.1.
The identified set of τ is defined as T ∗ ≡ { τ ∈ R : τ = R τ q for some q ∈ Q such that R q = p } . In what follows, we formulate the infinite-dimensional LP ( ∞ -LP) that characterizes T ∗ . This program conceptualizes sharp bounds on τ from the data and the maintainedassumptions (Assumptions SEL and EX). The upper and lower bounds on τ are defined as τ = sup q ∈Q R τ q, ( ∞ -LP1) τ = inf q ∈Q R τ q, ( ∞ -LP2)subject to R q = p. ( ∞ -LP3)Observe that the set of constraints ( ∞ -LP3) does not include X e : g e ( d )=0 Z U dz,x q ( e | u, x ) du = p (0 , d | z, x ) ∀ ( d, z, x ) ∈ { , } × X . (3.6)This is because we know a priori that they are redundant in the sense that they do not furtherrestrict the feasible set , i.e., the set of q ( e | u, x )’s that satisfy all the constraints ( q ∈ Q and( ∞ -LP3)). Lemma 3.1.
In the linear program ( ∞ -LP1) – ( ∞ -LP3) , the feasible set defined by q ∈ Q and ( ∞ -LP3) is identical to the feasible set defined by q ∈ Q , ( ∞ -LP3) , and (3.6) . Theorem 3.1.
Under Assumptions SEL and EX, suppose T ∗ is non-empty. Then, thebounds [ τ , τ ] in ( ∞ -LP1) – ( ∞ -LP3) are sharp for the target parameter τ , i.e., cl ( T ∗ ) = [ τ , τ ] ,where cl ( · ) is the closure of a set. { q : q ∈Q} ∩ { q : R q = p } in the LP and the linearity of R τ q in q , which implies that [ τ , τ ] is convex. Although conceptually useful, the LP ( ∞ -LP1)–( ∞ -LP3) is not feasible in practice because Q is an infinite-dimensional space. In this section, we approximate ( ∞ -LP1)–( ∞ -LP3) witha finite-dimensional LP via a sieve approximation of the conditional probability q ( e | u, x ). Weuse Bernstein polynomials as the sieve basis. Bernstein polynomials are useful in imposingrestrictions on the original function (Joy (2000); Chen et al. (2011); Chen et al. (2017)) andtherefore have been introduced in the context of linear programming (Mogstad et al. (2018);Masten and Poirier (2018); Mogstad et al. (2019)).Consider the following sieve approximation of q ( e | u, x ) using Bernstein polynomials oforder K q ( e | u, x ) ≈ K X k =1 θ e,xk b k ( u ) , where b k ( u ) ≡ (cid:0) Kk (cid:1) x k (1 − x ) K − k is a univariate Bernstein basis, θ e,xk ≡ θ e,xk,K ≡ q ( e | k/K, x ) isits coefficient, and K is finite. It is important to note that x can index θ , because q ( e | u, x )is a saturated function of x . By the definition of the Bernstein coefficient, for any ( e, x ),it satisfies q ( e | u, x ) ≥ u if and only if θ e,xk ≥ k . Also, P e ∈E q ( e | u, x ) = 1for all ( u, x ) is approximately equivalent to P e ∈E θ e,xk = 1 for all ( k, x ). To see this, first, P e ∈E q ( e | u, x ) = 1 for all ( u, x ) implies P e ∈E θ e,xk = P e ∈E q ( e | k/K, x ) = 1 for all ( k, x ).Conversely, when P e ∈E θ e,xk = 1 for all ( k, x ), X e ∈E q ( e | u, x ) ≈ X e ∈E K X k =1 θ e,xk b k ( u ) = K X k =1 b k ( u ) = 1by the binomial theorem (Coolidge (1949)). Motivated by this approximation, we formallydefine the following sieve space for Q : Q K ≡ (n K X k =1 θ e,xk b k ( u ) o e ∈E : X e ∈E θ e,xk = 1 ∀ ( k, x ) and θ e,xk ≥ ∀ ( e, k, x ) ) ⊆ Q . (4.1)Let K ≡ { , ..., K } and p ( z, x ) ≡ Pr[ Z = z, X = x ]. For q ∈ Q K , by (3.4) and (4.1), the11arget parameter τ = E [ τ ( Z, X )] − E [ τ ( Z, X )] can be expressed with E [ τ d ( Z, X )] = X e : g e ( d )=1 X ( k,x ) ∈K×X θ e,xk Z b k ( u ) X z ∈{ , } w d ( u, z, x ) p ( z, x ) du ≡ X e : g e ( d )=1 X ( k,x ) ∈K×X θ e,xk γ dk ( x ) , (4.2)where γ dk ( x ) ≡ R b k ( u ) P z ∈{ , } w d ( u, z, x ) p ( z, x ) du . Also, for q ∈ Q K and D = 1, by (3.5),we have p ( y, | z, x ) = X e : g e ( d )= y X k ∈K θ e,xk Z P ( z,x )0 b k ( u ) du ≡ X e : g e ( d )= y X k ∈K θ e,xk δ k ( z, x ) , (4.3)where δ dk ( z, x ) ≡ R U dz,x b k ( u ) du .From (4.2) and (4.3), we can expect that a finite-dimensional LP can be obtained withrespect to θ e,xk . Let θ ≡ { θ e,xk } ( e,k,x ) ∈E×K×X and letΘ K ≡ ( θ : X e ∈E θ e,xk = 1 ∀ ( k, x ) and θ e,xk ≥ ∀ ( e, k, x ) ) . Then, we can formulate the following finite-dimensional LP that corresponds to the ∞ -LPin ( ∞ -LP1)–( ∞ -LP3): τ K = max θ ∈ Θ K X ( k,x ) ∈K×X n X e : g e (1)=1 θ e,xk γ k ( x ) − X e : g e (0)=1 θ e,xk γ k ( x ) o (LP1) τ K = min θ ∈ Θ K X ( k,x ) ∈K×X n X e : g e (1)=1 θ e,xk γ k ( x ) − X e : g e (0)=1 θ e,xk γ k ( x ) o (LP2)subject to X e : g e ( d )=1 X k ∈K θ e,xk δ dk ( z, x ) = p (1 , d | z, x ) ∀ ( d, z, x ) ∈ { , } × X . (LP3)This LP is computationally very easy to solve using standard algorithms, such as the simplexalgorithm; conditional on x , when K = 50 and dim( θ ) = 204, it takes only around 10 secondsto calculate τ K and τ K with moderate computing power. The important remaining questionis how to choose K in practice. We discuss this issue in Section 7. Finally, it is worth noting12hat, extending Proposition 4 in Mogstad et al. (2018), we may exactly calculate τ and τ (i.e., τ = τ K and τ = τ K ) under the assumptions that (i) the weight function w d ( u, z, x )is piece-wise constant in u and (ii) the constant spline that provides the best mean squarederror approximation of q ( e | u, x ) satisfies all the maintained assumptions (possibly includingthe identifying assumptions introduced later) that q ( e | u, x ) itself satisfies; see Mogstad et al.(2018) for details. Now we generalize the analysis in Sections 2–4 to incorporate additional exogenous variablesother than the instrument Z that researchers may be equipped with. We show that thesevariables are fruitful for narrowing bounds on the target parameter. This is the first paperthat introduces this type of variable in the MTE framework. This is also the first paper thatshows the usefulness of these variables without necessarily combining them with assumptionsrelated to rank similarity or rank invariance.Let W ∈ W be such an exogenous variable. We assume that W is discrete. We showthat even binary variation in W can be useful in improving the bounds. We modify ourmaintained assumptions to consider two different scenarios related to W : (a) W directlyaffects Y but not D and (b) W directly affects both Y and D . Let Y ( d, w ) be the extendedcounterfactual outcome of Y given ( d, w ). Assumption SEL W . (a) Assumption SEL; (b) D = 1 { U ≤ P ( Z, X, W ) } where P ( Z, X, W ) ≡ Pr[ D = 1 | Z, X, W ] . Assumption EX W . (a) ( Y ( d, w ) , D ( z )) ⊥ ( Z, W ) | X ; (b) ( Y ( d, w ) , D ( z, w )) ⊥ ( Z, W ) | X . Case (a) is where W is a reversely excluded exogenous variable, which we call reverse IV .This type of exogenous variables was considered by Vytlacil and Yildiz (2007), Shaikh and Vytlacil(2011), and Balat and Han (2018). However, unlike those studies, we exploit W without ranksimilarity or rank invariance. In Case (b), we show that a reverse IV is not necessary, and W can be present in the selection equation. This type of exogenous variables was consid-ered by Mourifi´e (2015), Han and Vytlacil (2017), Vuong and Xu (2017), and Han and Lee(2019), but again, unlike these papers, we do not necessarily assume rank similarity or rankinvariance. Below, we combine the existence of W (for both scenarios) with assumptionsthat are related to rank similarity. Another distinct feature of our approach in comparisonto the prior studies is that we consider a broad class of the generalized LATEs as our targetparameter, including the ATE considered in those studies.13n what follows, we modify the linear programming framework from Sections 2 and 3 toreflect Assumptions SEL W and EX W . For notational simplicity, we focus on Case (a) here;it is straightforward to draw analogous results for Case (b). With the existence of W , theMTR is defined as m d ( u, w, x ) ≡ E [ Y ( d, w ) | U = u, X = x ]= Pr[ Y ( d, w ) = 1 | U = u, X = x ] , where W does not appear as a conditioning variable due to Assumption EX W (a). Then, thetarget parameter can be expressed as τ = E [ τ ( Z, W, X ) − τ ( Z, W, X )] , where τ d ( z, w, x ) = Z m d ( u, w, x ) w d ( u, z, x ) du. Note that the weight w d ( u, z, x ) is not a function of w due to Assumption SEL W (a). Now,consider a mapping ( d, w ) y, which is coded in the value e of ǫ ∈ E where |E | = 16 (redefining the variable ǫ introduced inSection 3). Conveniently, let E ≡ { , , ..., } ; Table 1 lists all 16 maps. Equivalently, define y = g e ( d, w ) , which implies Y ( d, w ) = g ǫ ( d, w ) (5.1)and Y = g ǫ ( D, W ). By (5.1) and Assumption SEL W (a), Assumption EX W (a) can be equiva-lently stated as ( ǫ, U ) ⊥ ( Z, W ) | X . Define the probability mass function of ǫ conditional on( U, X ) as q ( e | u, x ) ≡ Pr[ ǫ = e | U = u, X = x ] = Pr[ ǫ = e | U = u, X = x, W = w ] . Apparently, with Assumption SEL W (b), the weight will be written as w ∗ d ( u, z, w, x ) since the propensityscore is a function of W . d w Y ( d, w ) ǫ d w Y ( d, w ) ǫ d w Y ( d, w ) ǫ d w Y ( d, w )
13 0 0 00 1 01 0 11 1 114 0 0 10 1 01 0 11 1 115 0 0 00 1 11 0 11 1 116 0 0 10 1 11 0 11 1 1
Table 1: All Possible Maps from ( d, w ) to y Then, the MTR can be expressed as m d ( u, w, x ) = X e ∈E : g e ( d,w )=1 q ( e | u, x ) (5.2)and the target parameter as τ = Z E [ X e : g e (1 ,W )=1 q ( e | u, X ) w ( u, Z, X )] du − Z E [ X e : g e (0 ,W )=1 q ( e | u, X ) w ( u, Z, X )] du. (5.3)Recall, the sieve approximation of the conditional probability is given by q ( e | u, x ) ≈ P k ∈K θ e,xk b k ( u ).Then, for q ∈ Q K , by (5.3), we have τ = E [ τ ( Z, W, X )] − E [ τ ( Z, W, X )] with E [ τ d ( Z, W, X )] = X ( w,x ) ∈W×X X e : g e ( d,w )=1 X k ∈K θ e,xk γ dk ( w, x ) , where γ dk ( w, x ) ≡ P z ∈{ , } p ( z, w, x ) R b k ( u ) w d ( u, z, x ) du and p ( z, w, x ) ≡ Pr[ Z = z, W = w, X = x ]. In terms of the data distribution, we can derive, e.g.,15 ( y, | z, w, x ) = Pr[ Y (1 , w ) = y, U ≤ P ( z, x ) | X = x ]= Z P ( z,x )0 Pr[ Y (1 , w ) = y | U = u, X = x ] du = X e : g e (1 ,w )= y Z P ( z,x )0 q ( e | u, x ) du = X e : g e (1 ,w )= y X k ∈K θ e,xk δ k ( z, x ) , where δ dk ( z, x ) ≡ R U dz,x b k ( u ) du , as in the baseline case.This modification yields a modified LP: τ = max θ ∈ Θ K X ( k,w,x ) ∈K×W×X n X e : g e (1 ,w )=1 θ e,xk γ k ( w, x ) − X e : g e (0 ,w )=1 θ e,xk γ k ( w, x ) o (LP W τ = min θ ∈ Θ K X ( k,w,x ) ∈K×W×X n X e : g e (1 ,w )=1 θ e,xk γ k ( w, x ) − X e : g e (0 ,w )=1 θ e,xk γ k ( w, x ) o (LP W X e : g e ( d,w )=1 X k ∈K θ e,xk δ dk ( z, x ) = p (1 , d | z, w, x ) ∀ ( d, z, w, x ) ∈ { , } × W × X . (LP W W (b) and EX W (b). Since the number of maps increases to 16 instead offour of the baseline case, the dimension of the decision variable θ in the LP (LP W W W and setting K = 50, we have dim ( θ ) = 816 = 4 × W is in fact helpful innarrowing the bounds [ τ , τ ] as long as W is a relevant variable. For the remainder of Section6, we assume W ∈ { , } and suppress “conditional on X ” for simplicity, unless otherwisenoted. Assumption R. (i)
Pr[ Y ( d, w ) = Y ( d, w ′ )] > for some d and w = w ′ ; (ii) either (a) P ( z ) > for all z or (b) P ( z, w ) > for all z, w . Theorem 5.1.
Under Assumptions SEL W , EX W , and R, the variation of W poses non-redundant constraints on θ ∈ Θ K in the LP (LP W – (LP W (suppressing x ). W in determining Y . Heuristically, theimprovement occurs because, with R(i), the constraint matrix (i.e., the matrix multipliedto the vector θ in (LP W W than without. Seethe proof of the theorem for a formal argument. Note that non-redundant constraints on θ do not always guarantee an improvement of the bounds in (LP W W Bounds on the target parameter are generally uninformative in the absence of additionalassumptions besides Assumptions SEL and EX. This is because a binary instrument hasno extrapolative power for general non-compliers, e.g., always-takers and never-takers, butonly identifies the effect for compliers. Prior studies have tried to overcome this challengeby imposing shape restrictions on the MTE (Cornelissen et al. (2016); Brinch et al. (2017);Kowalski (2020)), although these restrictions are not always empirically justified. Evidently,it would be useful to provide researchers with a larger variety of assumptions so that it iseasier to find justifiable assumptions that suit their specific examples.In Section 5, the existence of W (Assumptions SEL W and EX W ) is shown to be one usefulsource for extrapolation. In this section, we propose identifying assumptions that can beincorporated within our framework and that help shrink the bounds on the target parameters.The shape restrictions employed in the literature can be used within our framework. We alsopropose other assumptions that have not been previously used in the LATE extrapolation.These assumptions can be incorporated as additional equality and inequality restrictions inthe linear programming: Given the LP ( ∞ -LP1)–( ∞ -LP3), identifying assumptions can beimposed by appending R q = a , ( ∞ -LP4) R q ≤ a , ( ∞ -LP5)where R and R are linear operators on Q that correspond to equality and inequality con-straints, respectively, and a and a are some vectors.When an assumption violates the true data-generating process, then the identified set willbe empty. This corresponds to the situation where the LP does not have a feasible solution.When we reflect sampling errors, this corresponds to the case where the confidence set is17mpty. Researchers may be willing to restrict the degree of treatment heterogeneity to yield in-formative bounds. This restriction has not been used before in the context of the MTEframework. This restriction may be combined with the assumptions related to the existenceof W (Assumptions SEL W , EX W , and R). We suppress the conditioning on X throughoutthis subsection. Assumption U.
For every w ∈ W , Pr[ Y (1 , w ) ≥ Y (0 , w )] = 1 or Pr[ Y (1 , w ) ≤ Y (0 , w )] =1 . When W is not available at all, this assumption can be understood with W being degen-erate. The following assumption is stronger than Assumption U. Assumption U ∗ . For every w, w ′ ∈ W , Pr[ Y (1 , w ) ≥ Y (0 , w ′ )] = 1 or Pr[ Y (1 , w ) ≤ Y (0 , w ′ )] = 1 . Note that w and w ′ may be the same or different, i.e., the uniformity is for all combinationsof ( w, w ′ ) ∈ { (0 , , (1 , , (1 , , (0 , } . Therefore, Assumption U ∗ implies Assumption U.Assumptions U and U ∗ are weaker than the monotone treatment response assumption inManski (1997) and Manski and Pepper (2000) in that they do not impose the direction ofmonotonicity. Assumptions U and U ∗ are also closely related to the rank similarity and rankinvariance assumptions in the literature (e.g., Chernozhukov and Hansen (2005)). Namely,given a structural model Y = 1[ s ( D, W ) ≥ V D ], when Assumption U ∗ is violated, then ranksimilarity ( F V | U = F V | U ) cannot hold, and thus rank invariance ( V = V ) cannot hold.Assumptions U and U ∗ can be imposed by “deactivating” relevant maps. For example,suppose Y (1 , w ) ≥ Y (0 , w ) almost surely for all w ∈ { , } under Assumption U. Thisassumption can be imposed as equality constraints ( ∞ -LP4), i.e., in the form of R q = a ,using the labeling of Table 1: q (3 | u ) = q (4 | u ) = q (7 | u ) = q (8 | u ) = 0 ,q (2 | u ) = q (4 | u ) = q (10 | u ) = q (12 | u ) = 0 , In order to verify whether the identified set is empty, we need to check whether the feasible set of θ isempty. An efficient way to do this is to identify vertices of the feasible polytope, if any. This process is nosimpler than the simplex algorithm that we use to solve the LP. Therefore, we recommend that one firstsolves the LP and check if infeasibility is reported. w = 1 and w = 0. Therefore, the corresponding θ ek = 0. Then,the effective dimension of θ will be reduced in (LP W W
3) and thus yields narrowerbounds. As another example, suppose the following holds almost surely under AssumptionU ∗ : Y (1 , ≥ Y (0 , Y (1 , ≤ Y (0 , Y (1 , ≥ Y (0 , Y (1 , ≥ Y (0 , q (2 | u ) = q (4 | u ) = q (6 | u ) = q (8 | u ) = 0 ,q (5 | u ) = q (6 | u ) = q (13 | u ) = q (14 | u ) = 0 ,q (3 | u ) = q (4 | u ) = q (7 | u ) = q (8 | u ) = 0 ,q (2 | u ) = q (4 | u ) = q (10 | u ) = q (12 | u ) = 0 . It is worth mentioning that, in Assumption U (Assumption U ∗ ), the direction of monotonicityis allowed to be different for different w (( w, w ′ ) pairs). This direction will be identifiedfrom the data. Specifically, the direction can be automatically determined from the LP byinspecting whether the LP has a feasible solution; when wrong maps are removed, there isno feasible solution. Note that this result holds regardless of the existence of W . It is easy tosee that the direction of the monotonicity coincides with the sign of the ATE. Previous workhas discussed the role of the rank similarity assumption on determining the sign of the ATE(Bhattacharya et al. (2008); Shaikh and Vytlacil (2011); Han (2020b)), and the result aboveshows that Assumptions U and U ∗ play a similar role in the linear programming approach.In the next two subsections, we suppress W for simplicity. In some applications, researchers are relatively confident about the direction of treatmentendogeneity. The idea of imposing the direction of the selection bias as an identifying as-sumption appears in Manski and Pepper (2000), who introduce monotone treatment selection(MTS), in addition to the monotone treatment response assumption mentioned above.
Assumption MTS. E [ Y ( d ) | D = 1 , X = x ] ≥ E [ Y ( d ) | D = 0 , X = x ] for d ∈ { , } and x ∈ X . Under our framework, this assumption can be imposed in the form of R q ≤ a . To seethis, Assumption MTS is equivalent to X e : g e ( d )=1 E " Z P ( Z,X ) q ( e | u ) du − Z P ( Z,X )0 q ( e | u ) du (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X = x ≤ d, x ∈ { , } × X . As is clear from this expression, Assumption MTS imposes restric-tions on the joint distribution of ( ǫ, U ). It is straightforward to incorporate the shape restrictions on the MTR or MTE functionintroduced in the literature. They can be imposed via constraints on θ . Assumption M.
For x ∈ X , m d ( u, x ) is weakly increasing in u ∈ [0 , . Assumption C.
For x ∈ X , m d ( u, x ) is weakly concave in u ∈ [0 , . Assumption M appears in Brinch et al. (2017) and Mogstad et al. (2018) and Assump-tion C appears in Mogstad et al. (2018). These assumptions can be imposed as inequalityconstraints ( ∞ -LP4), i.e., in the form of R q ≤ a . For implications on the finite-dimensionalLP (LP1)–(LP3), recall that for q ∈ Q K , the MTR satisfies m d ( u, x ) = X e : g e ( d )=1 q ( e | u, x ) = X k ∈K X e : g e ( d )=1 θ e,xk b k ( u ) . According to the property of the Bernstein polynomial, Assumption M implies that P e : g e ( d )=1 θ e,xk is weakly increasing in k , i.e., X e : g e ( d )=1 θ e,x ≤ X e : g e ( d )=1 θ e,x ≤ · · · ≤ X e : g e ( d )=1 θ e,xK . Assumption C implies that X e : g e ( d )=1 θ e,xk − X e : g e ( d )=1 θ e,xk +1 + X e : g e ( d )=1 θ e,xk +2 ≤ k = 0 , ..., K − . One can obtain analogous assumptions and their implications in the presence of W .Another shape restriction introduced in the literature is separability. Although it isnot particularly appealing with binary Y , if one is willing to assume a separable model for m d ( u, x ) = Pr[ Y ( d ) = 1 | U = u, X = x ] = m d ( x ) + m d ( u ), then such a structure can beimposed on θ . This section provides numerical results to illustrate our theoretical framework and to showthe role of different identifying assumptions in improving bounds on the target parameters.20or target parameters, we consider the ATE and the LATEs for always-takers (LATE-AT),never-takers (LATE-NT), and compliers (LATE-C). We calculate the bounds on them basedonly on the information from the data and then show how additional assumptions on theexistence of additional exogenous variables, uniformity, and shape restrictions tighten thebounds.
We generate the observables (
Y, D, Z, X, W ) from the following data-generating process(DGP). We assume that W is a reverse IV, i.e., we maintain Assumptions SEL W (a) andEX W (a). We allow covariate X to be endogenous. All the variables are set to be binary withPr [ Z = 1] = 0 .
5, Pr [ X = 1] = 0 . W = 1] = 0 .
4. The treatment D is determined by Z and X through the threshold crossing model specified in Assumption SEL W (a), where thepropensity scores P ( z, x ) are specified as follows: P (0 ,
0) = 0 . P (1 ,
0) = 0 . P (0 ,
1) = 0 . P (1 ,
1) = 0 .
7. The outcome Y is generated from ( D, X, W ) through Y = DY +(1 − D ) Y where Y d = 1 [ m d ( U, X, W ) ≥ ǫ ] (7.1)and the MTR functions are defined as m ( u, ,
0) = 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) m ( u, ,
0) = 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) m ( u, ,
0) = 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) m ( u, ,
0) = 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) m ( u, ,
1) = 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) m ( u, ,
1) = 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) m ( u, ,
1) = 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) m ( u, ,
1) = 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 0 . b ( u ) + 1 . b ( u )where b Kk stands for the k -th basis function in the Bernstein approximation of degree K .These MTR functions are chosen to be consistent with Assumptions M and C, i.e., to bepositively monotone and weakly concave in u for all ( d, x, w ) ∈ { , } . Also, the DGP in(7.1) satisfies Assumption U ∗ because ǫ does not depend on d = 0 , m ( u, x, w ) > m ( u, x, w ) for all ( d, x, w ) ∈ { , } . Following the second examplein Section 6.1, the DGP satisfy the following uniform order for the counterfactual outcomes Y ( d, w ): Y (1 , ≥ Y (0 , ≥ Y (1 , ≥ Y (0 ,
0) a.s. We generate a sample containing21,000,000 observations and choose K = 50. We choose the large sample size to mimic thepopulation. Our choice of K is discussed below. The number of unknown parameters θ inthe linear programming is equal to dim( θ ) = |E | × |X | × ( K + 1). Figure 1 contains the bounds on the ATE under different assumptions. The true ATE valueis 0.21, depicted as the solid red line in the figure. First, the worst-case bounds on the ATEwith no additional assumptions (and without using variation from W ) are [ − . , . W , we have |E | = 4, and the linear programming is solved withdim( θ ) = |E | × |X | × ( K + 1) = 4 × ×
51 = 408.For comparison, we calculate the bounds that incorporate the existence of W . We buildup the target parameters with mappings involving W and use data distribution conditional on W = 0 and W = 1 as the constraints. Using constraints conditional on different values of W allows us to fully exploit the variations from W ; see (LP W W (Section 5), we have |E | = 16, which gives dim( θ ) = |E | × |X | × ( K + 1) = 16 × ×
51 = 1 , W is exploited, the bounds on the ATE are [ − . , . W . This result is consistent with our theoretical finding presented in Theorem5.1 that W can help tighten the bounds as long as it is a relevant variable. Nonetheless,these worst-case bounds are not that informative, e.g., they do not determine the sign of theATE.Next, we impose the uniformity assumption without W (Assumption U) and with W (Assumption U ∗ ). First, under Assumption U, the bounds on the ATE are tightened as somemappings occur with probability zero reducing the dimension of θ . As mentioned in Section6.1, the direction of monotonicity in Assumption U (i.e., which mapping does not occur) isdetermined by the LPs. We solve the LPs with different directions imposed, then choose theone with a feasible solution. This means that the corresponding direction of monotonicityis consistent with the DGP. Under Assumption U, we obtain a narrower bound [0 . , . ∗ , the bounds become [0 . , . ∗ are depicted as violet and green dashed lines, respectively.Both sets of bounds identify the sign of the ATE, consistent with the theoretical discussion.While their lower bounds coincide, Assumption U ∗ yields a lower upper bound compared toAssumption U.Next, we impose the shape restrictions (Assumptions M and C). As discussed in Section22igure 1: Bounds on the ATE under Different Assumptions6.3, these assumptions can be easily incorporated in the linear programming by directlyimposing inequality constraints on θ . Under these assumptions (and the existence of W ), thebounds on the ATE shrink to [0 . , . ∗ . They function differently in the linear programming: unlikethe uniformity assumption, which maintains the ranking of individuals across counterfactualgroups, shape restrictions directly control the MTR functions. Finally, the dash-dotted blackline in Figure 1 shows the bounds on the ATE under the uniformity assumption and the shaperestrictions. These assumptions all together yield the narrowest bounds, [0 . , . .2.2 Generalized LATEs Next, we construct bounds on the generalized LATEs. The original definition of the LATEis the ATE for compliers (C). Researchers may also have interests in other local treatmenteffects. We consider two other parameters—LATEs for always-takers (AT) and never-takers(NT). Figure 2 displays the bounds on the LATE-AT, LATE-C, and LATE-NT under differentassumptions. This analysis is analogous to that with the ATE. Since the covariate X affectsthe decision of compliance, to avoid confusion in the definition of the compliance groups, weinstead establish bounds on the LATEs conditional on X . We draw the conditional MTEfunctions with solid red lines in both panels as a reference.The feature that there exists no defiers in the DGP is known. When there is no defier,the LATE-C is point identified, which has an analytical expression of the two-stage leastsquares estimand. As a confirmation exercise, we numerically calculate the LATE-C usingthe linear programming, which yields point estimates as shown in Figure 2. The true LATE-Cs conditional on X = 0 and X = 1 are equal to 0.21 and 0.22, respectively. Regardless ofassumptions imposed, the estimates remain close to the true values throughout.The true values of the conditional LATE-AT and the LATE-NT are 0.15 and 0.28 when X = 0 and 0.14 and 0.25 when X = 1. First, as before, we consider the worst-case boundswhere the existence of W is ignored versus where W is taken into account. Without W , we getthe bounds [ − . , .
24] and [ − . , .
72] on the LATE-AT and the LATE-NT conditional on X = 0, and [ − . , .
52] and [ − . , .
43] conditional on X = 1; with W , we get the bounds[ − . , .
2] and [ − . , .
7] on the LATE-AT and the LATE-NT conditional on X = 0, and[ − . , .
5] and [ − . , .
41] conditional on X = 1. The upper bounds with W are lowerthan the ones without W , although the gain is not substantial. For the lower bounds, theone on the LATE-AT conditional on X = 0 is significantly higher with W than without W ,and all the other ones have negligible differences with and without W. We then apply Assumptions U and U ∗ . Under Assumption U, the bounds on the LATE-AT and the LATE-NT turn to [0 , .
24] and [0 , .
72] conditional on X = 0, and [0 , .
52] and[0 , .
43] conditional on X = 1; when W is used and Assumption U ∗ is applied, the boundsshrink to [0 , .
18] and [0 , .
47] conditional on X = 0, and [0 , .
36] and [0 , .
35] conditionalon X = 1. As before, Assumptions U and U ∗ determine the sign of the effects.When the shape restrictions are imposed instead, the bounds on the LATE-AT and theLATE-NT were improved to [0 . , .
17] and [0 . , .
3] conditional on X = 0, and [0 . , . . , .
31] conditional on X = 1. Under Assumption U ∗ combined with the shaperestrictions, we get the narrowest bounds of [0 . , .
15] and [0 . , .
3] conditional on X = 0,and [0 . , .
26] and [0 . , .
31] conditional on X = 1.24igure 2: Bounds on the LATEs under Different Assumptions25 .3 The Choice of K As a tuning parameter in the LP, we need to choose the order of Bernstein polynomials, K . In general, K should be chosen based on the sample size and the smoothness of thefunction to be approximated, in our case, q ( · ). The choice of the sieve dimension or moregenerally, regularization parameters, is a difficult question (Chen (2007)) and developingdata-driven procedure is a subject of on-going research in various nonparametric contexts ofpoint identification; see, e.g., Chen and Christensen (2015) and Han (2020a). In this partialidentification setup, we propose the following heuristic and conservative approach, which isin spirit consistent with the very motivation of partial identification.First, we do not want to claim any prior knowledge about the smoothness of q ( · ) becauseit is the distribution of a latent variable. Because K determines the dimension of unknownparameter θ in the linear programming, the width of the bounds tends to increase with K .At the same time, the computational burden increases with K . One interesting numericalfinding is that, when K is sufficiently large, the increase of the width slows down and thebounds become stable. This suggests that we may be able to conservatively choose K thatacknowledges our lack of knowledge of the smoothness but, at the same time, produces areasonable computational task for the linear programming.To illustrate this point, we consider the conditional MTE and ATE as the target parame-ters and show how their bounds change as we increase K . We consider the MTE because it isa fundamental parameter that generates other target parameters, and hence, it is importantto understand the sensitivity of its bounds to K . Figures 3 and 4 show the evolution ofthe bounds on the MTE and the ATE as K grows. When K = 5, the bounds are narrow.Although it may be tempting to choose this value of K , this attempt should be avoided asit may be subject to the misspecification of smoothness. When K increases beyond 30, thebounds start to converge and become stable. We choose K = 50, and this is the choice wemade in our previous numerical exercises. As discussed in Section B.1 in the Appendix, it is worth mentioning that the bounds onthe MTE are point-wise sharp but not uniformly sharp. The graph for the MTE bounds aredrawn by calculating the point-wise sharp bounds on MTE at each point of u (after properlydiscretizing it) and then connecting them. Therefore, these bounds should not be viewedas uniformly sharp bounds. Nonetheless, this graph is still useful for the purpose of ourillustration. Given the current DGP, we find that there are no uniformly sharp bounds forthe MTE. Note that with larger K , some LP solvers would ignore coefficients with negligible (e.g., 10 − ) valuesthat cause a large range of magnitude in the coefficient matrix. It may be recommended to simultaneouslyrescale a column and a row to achieve a smaller range in the coefficients. K It is widely recognized in the empirical literature that health insurance coverage can be an es-sential factor for the utilization of medical services (Hurd and McGarry (1997); Dunlop et al.(2002); Finkelstein et al. (2012); Taubman et al. (2014)). Prior studies on this topic typi-cally make use of parametric econometric models for the analysis. In their application,Han and Lee (2019) relax this common approach by introducing a semiparametric bivari-ate probit model to measure the average effect of insurance coverage on patients’ medicalvisits. By applying our theoretical framework of partial identification, we further relax theparametric and semiparametric structures used in these studies. More importantly, we tryto understand how much we can learn about the effect of insurance that is utilized throughvarious counterfactual policies by learning the effect of different compliance groups.We use the 2010 wave of the Medical Expenditure Panel Survey (MEPS) and focus onall the medical visits in January 2010. The sample is restricted to contain individuals agedbetween 25 and 64 and exclude those who had any kind of federal or state insurance in 2010.The outcome Y is a binary variable indicating whether or not an individual has visited adoctor’s office; the treatment D is whether an individual has private insurance. We choose27igure 4: Bounds on ATE with Different K Z . This IV reflects thesize of the firm, and larger firms are more likely to provide fringe benefits, including healthinsurance. On the other hand, the number of branches of a firm does not directly affectemployee decisions about medical visits. To justify the IV, self-employed individuals areexcluded. For potentially endogenous covariates X , we include the age being 45 and older,gender, income above median, and health condition. Lastly, for an exogenous covariate W ,we use the percentage of workers who are provided with paid sick leave benefits within eachindustry. Following Han and Lee (2019), we assume W satisfies Assumptions SEL W (b) andEX W (b), as X is controlled. We construct a categorical variable such that W = 0 for lessthan 50%, W = 1 for between 50–80%, and W = 2 for above 80%. Table 2 summarizes theobservables. Table 2: Summary Statistics Variable Mean S.D Min Max Y Whether or not visit doctors 0.18 0.39 0 1 D Whether or not have insurance 0.66 0.47 0 1 Z Firm has multiple locations 0.68 0.47 0 1 X Age above 45 0.41 0.49 0 1Gender 0.50 0.50 0 1Income above median 0.50 0.50 0 1Good health 0.36 0.48 0 1 W Pay sick leave provision 1.25 0.73 0 2
Number of observations = 7,555
First, as a benchmark, we report that the LATE-C estimate calculated via our linearprogramming approach is equal to a singleton of 0.17, which is in fact identical to the 2SLSestimate we separately calculate. In what follows, we extrapolate this LATE beyond thecomplier group to the ATE. The presence of covariates reduces the effective sample size andthus leads to larger sampling errors in estimating the p of the ∞ -LP ( ∞ -LP1)–( ∞ -LP3).This may create inconsistencies in the set of equality constraints ( ∞ -LP3), resulting in nofeasible solution. This is in fact what happens in this application. To resolve this estimationproblem, we introduce a slackness parameter η and modify ( ∞ -LP3) so that, with someslackness, it satisfies k R q − p k ≤ η. (8.1)A similarly modified constraint can then be followed in the finite-dimensional LP after ap-29roximation, as well as by combining ( ∞ -LP4)–( ∞ -LP5). The appropriate value of η shoulddepend on the sample size, the dimension of covariates, and the dimension of the unknownparameter θ . To explain the latter, as K increases, the dimension of θ (i.e., unknowns) in-creases, while the number of constraints (i.e., simultaneous equations for the unknowns) isfixed. Therefore, as K increases, the chance that the LP does not have a feasible solutionwould decrease. Based on the method discussed in the previous section, we set K = 50 inthis application.We calculate worst-case bounds on the ATE, as well as bounds after imposing Assump-tions U and M and after using covariate W . Under Assumption U, the data rules out thepossibility that Y (0) > Y (1), indicating that individuals with private insurance are morelikely to visit a doctor. Assumption M imposes that the MTR function is weakly increasingin U = u . Usually, U is interpreted as the latent cost of obtaining treatment. Kowalski(2020) interpreted U as eligibility in a similar setup for Medicaid insurance. The eligibilityfor Medicaid is related to income level and age. In our setup, because the treatment is havingthe private insurance, we interpret the eligibility as the health status, which is reflected inthe premium. Interpreting U as a latent cost (e.g., premium) of getting private insurance,Assumption M states that the chance of making a medical visit (with or without insurance)increases for those with higher cost. This is a reasonable assumption given that sicker in-dividuals typically face higher insurance costs and also visit doctors more often. We choosethe slackness parameter η to be 0.05 under no assumption and Assumption U and 0.07 whenAssumption M is added. When W is used, we choose η to be 0.08 under no assumption and0.1 with Assumption M.The bounds on the ATE are shown in Figure 5. The worst-case bound on the ATE equals[ − . , . . , .
37] under Assumption U and [0 . , .
37] underAssumption M. It is interesting to note that the identifying power of the uniformity andthe shape restriction is similar in this example. When both Assumption U and AssumptionM are imposed, the bounds are further tightened to [0 . , . W is exploited than when it is not, although the gains are not large.Next, we consider the always-taker, complier, and never-taker LATEs. We consider thesegeneralized LATEs conditional on X = x . Specifically, we focus on the treatment effects formales above age 45, with income below the median and bad health conditions. The resultsare shown in Table 3 and depicted in Figure 6. The LATE-C is analytically calculated viaTSLS. For the LATE-AT and LATE-NT, Assumption U identifies the sign of the effects, When the alternative constraint (8.1) is used with the slackness parameter, the LATE-C is no longer asingleton.
No Assumption U M U+M W M+WLATE-AT [-0.93,0.21] [0,0.20] [-0.07,0.15] [0,0.15] [-0.76,0.20] [-0.01,0.19]LATE-C 0.01 0.01 0.01 0.01 0.01 0.01LATE-NT [-0.24,0.98] [0,0.97] [-0.14,0.95] [0,0.95] [-0.22,0.84] [-0.08,0.82]Slackness parameter η and Assumption M nearly identifies it. Using the variation in W mostly improves the boundscompared to the ones without it. From the results, we can conclude that possessing privateinsurance has the greatest effect on medical visits for never takers, i.e., people who face higherinsurance cost. This provides a policy implication that lowering the cost of private insuranceis important, because high costs might hinder those with the most need from receiving enoughmedical services. A Examples of the Target Parameters
Table 4 contains the list of target parameters. The table is taken from Mogstad et al. (2018).
B More Discussions
B.1 Point-wise and Uniform Sharp Bounds on MTE
In Section 2, we provided some examples of target parameters. The building block forthese parameters is the MTE, m ( u ) − m ( u ) (suppressing x ). Heckman and Vytlacil (2005)show why this fundamental parameter can be of independent interest. Unlike other targetparameters proposed here, we may want to allow the MTE to be a function of u (beyondevaluating it at a fixed u ). In this section, we discuss the subtle issue of point-wise anduniform sharp bounds on τ MT E ( u ) ≡ m ( u ) − m ( u ) as a function of u .Suppress X for simplicity. Recall q ( u ) ≡ { q ( e | u ) } e ∈E and Q ≡ { q ( · ) : P e q ( e | u ) =1 ∀ u and q ( e | u ) ≥ ∀ ( e, u ) } . Let M be the set of MTE functions, i.e., M ≡ n m ( · ) − m ( · ) : m d ( · ) = E [ Y d | U = · ] = X e ∈E : g e ( d )=1 q ( e |· ) ∀ d ∈ { , } for q ( · ) ∈ Q o . arget Parameters Expressions Ranges of u Weights w d ( u, z, x )Average Treatment Effect E [ Y (1) − Y (0)] [0 ,
1] 2 d − E { Y (1) − Y (0) | u ∈ [ P ( z , x ) , P ( z , x )] } [ P ( z , x ) , P ( z , x )] (2 d − × u ∈ [ P ( z ,x ) ,P ( z ,x )]) P ( z ,x ) − P ( z ,x ) (LATE-C) given x ∈ X LATE for Always-Takers E { Y (1) − Y (0) | u ∈ [0 , P ( z , x )] } [0 , P ( z , x )] (2 d − × u ∈ [0 ,P ( z ,x )]) P ( z ,x ) (LATE-AT) given x ∈ X LATE for Never Takers E { Y (1) − Y (0) | u ∈ [ P ( z , x ) , } [ P ( z , x ) ,
1] (2 d − × u ∈ [ P ( z ,x ) , − P ( z ,x ) (LATE-NT) given x ∈ X LATE for [ u, u ] E [ Y (1) − Y (0) | u ∈ [ u, u ]] [ P ( z , x ) , P ( z , x )] (2 d − × u ∈ [ u,u ]) u − u Marginal Treatment Effect E [ Y (1) − Y (0) | u ′ ] u ′ (2 d − × u = u ′ )(MTE) ∗ Policy Relevant Treatment Effect E ( Y ′ ) − E ( Y ) E ( D ′ ) − E ( D ) [0 ,
1] (2 d − × Pr [ u ≤ P ′ ( z ′ ) ] − Pr [ u ≤ P ′ ( z ) ] E [ P ( Z ′ )] − E [ P ( Z )] (PRTE) for a new policy ( P ′ , Z ′ ) * The MTE uses the Dirac measure at u ′ , while the other target parameters use the Lebesgue measure on[0 , Table 4: Examples of the Target ParametersThe bounds on τ MT E ∈ M in the ∞ -LP are given by using a Dirac delta function as aweight. Therefore, given evaluation point u ∈ [0 , ∞ -LP1)–( ∞ -LP3) can be simplifiedas follows, defining the upper and lower bounds τ ( u ) and τ ( u ) (being explicit about theevaluation point) on τ MT E ( u ): τ ( u ) = sup q ∈Q X e ∈E : g e (1)=1 q ( e | u ) − X e ∈E : g e (0)=1 q ( e | u ) (B.1) τ ( u ) = inf q ∈Q X e ∈E : g e (1)=1 q ( e | u ) − X e ∈E : g e (0)=1 q ( e | u ) (B.2)subject to X e : g e ( d )=1 Z U dz q ( e | ˜ u ) d ˜ u = p (1 , d | z ) ∀ ( d, z ) ∈ { , } . (B.3)Then, for any fixed u ∈ [0 , τ ( u ) ≤ τ MT E ( u ) ≤ τ ( u ) . We argue that these bounds are point-wise sharp but not necessarily uniformly sharp for τ MT E ( · ). See Firpo and Ridder (2019) for related definitions of point-wise and uniform sharpness. efinition B.1 (Point-wise Sharpness) . τ ( · ) and τ ( · ) are point-wise sharp if, for any ¯ u ∈ [0 , , there exist τ MT E, ¯ u , τ MT E, ¯ u ∈ M such that τ (¯ u ) = τ MT E, ¯ u (¯ u ) and τ (¯ u ) = τ MT E, ¯ u (¯ u ) . Theorem B.1. τ ( · ) and τ ( · ) are point-wise sharp bounds on τ MT E ( · ) . The proofs of this and other theorems appear later. Note that point-wise bounds willmaintain some properties of an MTE function, but not all. For uniform sharpness, τ ( · ) and τ ( · ) themselves have to be MTE functions on [0 , τ ( · ) and τ ( · ) should be elements in M . Definition B.2 (Uniform Sharpness) . τ ( · ) and τ ( · ) are uniformly sharp if τ ( · ) , τ ( · ) ∈ M . The following theorem is almost immediate.
Theorem B.2. τ ( · ) is uniformly sharp if and only if there exists q ∗ ( · ) ∈ Q such that q ∗ ( · ) is in the feasible set and τ ( u ) = P e ∈E : g e (1)=1 q ∗ ( e | u ) − P e ∈E : g e (0)=1 q ∗ ( e | u ) for all u ∈ [0 , .Similarly, τ ( · ) is uniformly sharp if and only if there exists q † ( · ) ∈ Q such that q † ( · ) is in thefeasible set and τ ( u ) = P e ∈E : g e (1)=1 q † ( e | u ) − P e ∈E : g e (0)=1 q † ( e | u ) for all u ∈ [0 , . The following is a more useful result that relates point-wise bounds with uniform bounds.For each ¯ u , let q ∗ ¯ u ( · ) and q † ¯ u ( · ) be the point-wise maximizer and minimizer of (B.1)–(B.3),respectively. Corollary B.1. τ ( · ) is uniformly sharp if and only if there exists q ∗ ( · ) ∈ Q such that q ∗ ( · ) is in the feasible set and q ∗ ¯ u (¯ u ) = q ∗ (¯ u ) for all ¯ u ∈ [0 , . Also, τ ( u ) is uniformly sharp if andonly if there exists q † ( · ) ∈ Q such that q † ( · ) is in the feasible set and q † ¯ u (¯ u ) = q † (¯ u ) for all ¯ u ∈ [0 , . Based on the Bernstein approximation we introduce, this corollary implies that for auniform upper bound to exist, there should exist a common maximizer θ ∗ such that θ ∗ is inthe feasible set of the LP and τ ( u ) = P k ∈K n P e ∈E : g e (1)=1 θ e ∗ k b k ( u ) − P e ∈E : g e (0)=1 θ e ∗ k b k ( u ) o for all u . In other words, if θ ∗ ¯ u is the maximizer of the LP for given ¯ u , then there shouldexist θ ∗ in the feasible set such that θ ∗ ¯ u = θ ∗ for all ¯ u ∈ [0 , u in [0 ,
1] and checking whether θ ∗ u is constant for all values in the grid. B.2 Inference
It is important to construct a confidence set for our target parameter or its bounds in order toaccount for the sampling variation in measuring treatment effectiveness. It will also be inter-esting to develop a procedure to conduct a specification test for the identifying assumptions35iscussed in Section 6. The problem of statistical inference when the identified set is con-structed via linear programming has been studied in, e.g.,Deb et al. (2017), Mogstad et al.(2018), Hsieh et al. (2018), and Torgovitsky (2019b) . Among these papers, Mogstad et al.(2017)’s setting is closest to ours, and their inference procedure can be directly adapted toour problem. Instead of repeating their result here, we only briefly discuss the procedure.Recall q ( u, x ) ≡ { q ( e | u, x ) } e ∈E is the latent distribution and p ≡ { p (1 , d | z, x ) } d,z,x is thedistribution of the data, and R τ , R , R , and R denote the linear operators of q ( · ) thatcorrespond to the target and constraints. Consider the following hypotheses: H : p ∈ P , H : p ∈ P\P , where P ≡ { p ∈ P : Rq = a for some q ∈ Q} and R ≡ ( R ′ τ , R ′ , R ′ , R ′ ) ′ a ≡ ( τ, p ′ , a ′ , a ′ ) ′ Suppose ˆ R and ˆ a are sample counterparts of R and a . Then, a minimum distance test statisticcan be constructed as T n ( τ ) ≡ inf q ∈Q K √ n (cid:13)(cid:13)(cid:13) ˆ Rq − ˆ a (cid:13)(cid:13)(cid:13) . Similar to Mogstad et al. (2017), T n ( τ ) is the solution to a convex optimization problem thatcan be reformulated as an LP using duality. A (1 − α )-confidence set for the target parameter τ can be constructed by inverting the test: CS − α ≡ { τ : T n ( τ ) ≤ ˆ c − α } where ˆ c − α is the critical value for the test. The resulting object is of independent interest, andit can further be used to conduct specification tests. The large sample theory for T n ( τ ), as wellas a bootstrap procedure to calculate ˆ c − α , will directly follow according to Mogstad et al.(2017), which is omitted for succinctness. 36 .3 Linear Programming with Continuous X Suppose X is continuously distributed and assume X = [0 , d X . Let q ( u, x ) ≡ { q ( e | u, x ) } e ∈E and p ( x ) ≡ { p (1 , d | z, x ) } d,z . Recall that R τ : Q → R and R : Q → R d p are the linearoperators of q ( · ) where d p is the dimension of p . Consider the following LP: τ = sup q ∈Q R τ q, (B.4) τ = inf q ∈Q R τ q, (B.5) s.t. ( Rq ( x ) = p ( x ) for all x ∈ X , (B.6)where ( Rq )( x ) = p ( x ) emphasizes the dependence on x , and thus contains infinitely manyconstraints. Therefore, this LP is infinite dimensional because of not only the decision variablebut also the constraints. The problem with q is addressed with the sieve approximation.To address the problem with continuous X , we proceed as follow. Note that, in general, E | h ( X ) | = 0 if and only if h ( x ) = 0 almost everywhere in X . Therefore, each j -th equationin the equality restrictions (B.6) can be replaced by E | ( Rq ) j ( X ) − p j ( X ) | = 0 . Now, for the sieve space of Q , we consider˜ Q K ≡ n ˜ K X k =1 θ ek b k ( u, x ) o e ∈E : X e ∈E θ ek = 1 ∀ k ∈ ˜ K and θ ek ≥ ∀ ( e, k ) ⊆ Q , (B.7)where b k ( u, x ) is a bivariate Bernstein polynomial and ˜ K ≡ { , ..., ˜ K } . Then, E [ τ d ( Z, X )] = X e : g e ( d )=1 X k ∈ ˜ K θ ek Z E [ b k ( u, X ) w d ( u, Z, X )] du ≡ X e : g e ( d )=1 X k ∈ ˜ K θ ek ˜ γ dk , (B.8)where ˜ γ dk ≡ R E [ b k ( u, X ) w d ( u, Z, X )] du . Also, p ( y, d | z, x ) = X e : g e ( d )= y X k ∈ ˜ K θ ek Z U dz,x b k ( u, x ) du ≡ X e : g e ( d )= y X k ∈ ˜ K θ ek ˜ δ dk ( z, x ) , (B.9)37here ˜ δ dk ( z, x ) ≡ R U dz,x b k ( u, x ) du . Let ˜ θ ≡ { θ ek } ( e,k ) ∈E× ˜ K and let˜Θ ˜ K ≡ ( ˜ θ : X e ∈E θ ek = 1 ∀ k ∈ ˜ K and θ ek ≥ ∀ ( e, k ) ) . Then, we can formulate the following finite-dimensional LP: τ ˜ K = max θ ∈ Θ ˜ K X k ∈ ˜ K n X e : g e (1)=1 θ ek ˜ γ k − X e : g e (0)=1 θ ek ˜ γ k o (B.10) τ ˜ K = min θ ∈ Θ ˜ K X k ∈ ˜ K n X e : g e (1)=1 θ ek ˜ γ k − X e : g e (0)=1 θ ek ˜ γ k o (B.11)subject to E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X e : g e ( d )=1 X k ∈ ˜ K θ ek ˜ δ dk ( Z, X ) − p (1 , d | Z, X ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 0 . (B.12)In estimation, we use the sample counterparts ˆ˜ γ dk and ˆ˜ δ dk for ˜ γ dk and ˜ δ dk , and (B.12) can beestimated with slackness by1 n n X i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X e : g e ( d )=1 X k ∈ ˜ K θ ek ˆ˜ δ dk ( Z i , X i ) − ˆ p (1 , d | Z i , X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ η, where ˆ p (1 , d | z, x ) is some preliminary estimate of p (1 , d | z, x ) and η is the slackness parameter.Later, we want to introduce additional constraints from some identifying assumptions: R q = a (B.13) R q ≤ a (B.14)For the equality restrictions, we can use the same approach that transforms (B.6). For theinequality restrictions (B.14), we can allow any identifying assumptions for which R is amatrix rather than an operator: Assumption MAT. R is a dim( a ) × dim( q ) matrix. Assumptions M and C and the unconditional version of Assumption MTS satisfy thiscondition. 38 .4 Equivalence with the IV-Like Estimands
We draw a connection between our approach and the approach used in Mogstad et al. (2018).In particular, we show that the identified set of the MTR functions M id used in Mogstad et al.(2018) is equivalent to the set of MTR functions derived from the feasible set used in thispaper. Therefore, the feasible set in this paper contains no less information about the datathan those contained in M id via IV-like estimands in their paper.The IV-like estimand is defined in Proposition 3 in Mogstad et al. (2018), and is statedas below. Proposition B.1 (IV-like Estimand from Mogstad et al. (2018)) . Suppose that s : { , } × R d z × d x → R is an identified (or known) function that is measurable and has a finite secondmoment. We refer to such a function s as an IV-like specification and to β s ≡ E h s ( D, Z, X ) Y i as an IV-like estimand. If ( Y, D ) are generated according to Assumption SEL and Assump-tion EX, then β s = E h Z m ( u, X ) ω s ( u, Z, X ) du i + E h Z m ( u, X ) ω s ( u, Z, X ) du i , (B.15) where ω s ( u, z, x ) = s (0 , z, x )1[ u > p ( z, x )] , and ω s ( u, z, x ) = s (1 , z, x )1[ u ≤ p ( z, x )] . For the MTR functions to be consistent with the data, the following conditions need tobe satisfied: E [ Y | D = 0 , Z, X ] = E [ Y | U > p ( Z, X ) , Z, X ] = 11 − P ( Z, X ) Z p ( Z,X ) m ( u, X ) du, (B.16) E [ Y | D = 1 , Z, X ] = E [ Y | U ≤ p ( Z, X ) , Z, X ] = 1 P ( Z, X ) Z p ( Z,X )0 m ( u, X ) du. (B.17)Define the identified set as: M id = n m = ( m , m ) , m , m ∈ L : m , m satisfies equation (B.16) and (B.17) a.s o . This identified set is defined in Mogstad et al. (2018, Section 2.5). The definition followsthe fact that the MTR functions in M id are compatible with the observed conditional meansof Y . In this sense, it exhausts the information of the data contained in the conditionalmeans. When Y is binary, the conditional means of Y contain the information of the completedistribution. 39efine the feasible set Q f as Q f = n q ∈ L : q ∈ Q and satisfies equation ( ∞ -LP3) o . To establish the connection with M id , we construct the set of MTR functions based on thefeasible set: M f = n m = ( m , m ) : m d = X e : g e ( d )=1 q ( e | u, x ) , d = { , } , q ∈ Q f o . Then the following holds, proof of which appears later:
Theorem B.3.
Suppose Y is discretely distributed. Under the Assumption SEL and EX, M f = M id . Proposition 3 in Mogstad et al. (2018) shows an equivalence relationship between theidentified set M id and the set of MTR functions satisfying constraints based on selected IV-like estimands. Theorem B.3 shows that the information contained in our feasible set used inthe LP is the same as the selected IV-like estimands that exhaust the available information.Theorem B.3 can be extended to the case where Y is discrete and X is continuous. When Y is a non-binary discrete outcome variable, M id and M f only exhaust the information onthe conditional means, but not other distributional information. Nonetheless, that missinginformation is captured by Q f that we use as our constraint set, because q ( e | u ) is defined asthe conditional probability of Y taking each value. C Proofs
C.1 Proof of Lemma 3.1
Fix ( d, z, x ). By P e ∈E q ( e | u, x ) = 1 for q ∈ Q , we have1 = X e ∈E q ( e | u, x ) = X e : g e ( d )=1 q ( e | u, x ) + X e : g e ( d )=0 q ( e | u, x ) . Then, in ( ∞ -LP3), the constraint with p (0 , d | z, x ) can be written as p (0 , d | z, x ) = Z U dz,x X e : g e ( d )=0 q ( e | u, x ) du = Z U dz,x n − X e : g e ( d )=1 q ( e | u, x ) o du = Pr[ D = d | Z = z, X = x ] − Z U dz,x X e : g e ( d )=1 q ( e | u, x ) du. p (1 , d | z, x ) = Z U dz,x X e : g e ( d )=1 q ( e | u, x ) du, since Pr[ D = d | Z = z, X = x ] − p (0 , d | z, x ) = p (1 , d | z, x ). Therefore, the constraint with p (0 , d | z, x ) does not contribute to the restrictions imposed by ( ∞ -LP3) and q ∈ Q . (cid:3) C.2 Proof of Theorem 5.1
In proving the claim of the theorem, note that Z can be fixed at a certain value, so we fix Z = z here. We first prove with Case (a). To simplify notation, let q ( e , ..., e J | u ) ≡ Pr[ ǫ ∈{ e , ..., e J }| u ] = P Jj =1 q ( e j | u ). Based on Table (1), we can easily derive p (1 , | z,
1) = Z P ( z )0 X e : g e (1 , q ( e | u ) du = Z P ( z )0 q (9 , ..., | u ) du,p (1 , | z,
0) = Z P ( z )0 X e : g e (1 , q ( e | u ) du = Z P ( z )0 q (5 , ..., , , ..., | u ) du,p (1 , | z,
1) = Z P ( z ) X e : g e (0 , q ( e | u ) du = Z P ( z ) q (3 , , , , , , , | u ) du,p (1 , | z,
0) = Z P ( z ) X e : g e (0 , q ( e | u ) du = Z P ( z ) q (2 , , , , , , , | u ) du. Define the operator T dz q e ≡ Z U dz q ( e | u ) du. p | z , p | z , p | z , p | z ) ′ of the constraints in (LP W
3) that correspondto Z = z , the corresponding l.h.s. is R P ( z )0 q (9 , ..., | u ) du R P ( z )0 q (5 , ..., , , ..., | u ) du R P ( z ) q (3 , , , , , , , | u ) du R P ( z ) q (2 , , , , , , , | u ) du = T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z T z q ≡ T q, where T is a matrix of operators implicitly defined and q ( u ) ≡ ( q (1 | u ) , ...., q (16 | u )). Now for q ∈ Q K , define a 16 K -vector θ ≡ θ ... θ where, for each e ∈ { , ..., } , θ e ≡ ( θ e , ..., θ eK ) ′ . Similarly, let b ( u ) ≡ ( b ( u ) , ..., b K ( u )) ′ .Then, we have q ( e | u ) = b ( u ) ′ θ e . Let H be a 16 ×
16 diagonal matrix of 1’s and 0’s thatimposes additional identifying assumptions on the outcome data-generating process. In thisproof, H is used to incorporate Assumption R(i). Given H , the constraints in (LP W
3) (thatcorrespond to Z = z ) can be written as T Hq = { T H ⊗ b ′ } θ = ( p | z , p | z , p | z , p | z ) ′ . Now, we prove the claim of the theorem. Suppose the claim is not true, i.e., the even rowsare linearly dependent to odd rows in
T H . Given the form of T , which has full rank underAssumption R(ii)(a), this linear dependence only occurs when H is such that H jj = 1 for j ∈ { , , , } and 0 otherwise. But, according to Table 1, this implies that Pr[ Y ( d, w ) = Y ( d, w ′ )] = 0 for all d and w = w ′ , which contradicts Assumption R(i). This proves thetheorem for Case (a).Now we move to prove the theorem for Case (b), analogous to the previous case. For42very z , we can derive p (1 , | z,
1) = Z P ( z, X e : g e (1 , q ( e | u ) du = Z P ( z, q (9 , ..., | u ) du,p (1 , | z,
0) = Z P ( z, X e : g e (1 , q ( e | u ) du = Z P ( z, q (5 , ..., , , ..., | u ) du,p (1 , | z,
1) = Z P ( z, X e : g e (0 , q ( e | u ) du = Z P ( z, q (3 , , , , , , , | u ) du,p (1 , | z,
0) = Z P ( z, X e : g e (0 , q ( e | u ) du = Z P ( z, q (2 , , , , , , , | u ) du. Define T dz,w q e ≡ Z U dz,w q ( e | u ) du where U dz,w can be analogously defined. Then, R P ( z,w )0 q (9 , ..., | u ) du R P ( z,w ′ )0 q (5 , ..., , , ..., | u ) du R P ( z,w ) q (3 , , , , , , , | u ) du R P ( z,w ′ ) q (2 , , , , , , , | u ) du = T z,w T z,w T z,w T z,w T z,w T z,w T z,w T z,w T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w T z,w T z,w T z,w T z,w T z,w T z,w T z,w T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ T z,w ′ ! q ≡ ˜ T q, where ˜ T is a matrix of operators implicitly defined. Then, inserting H , the constraint becomes˜ T Hq = n ˜ T H ⊗ b ′ o θ = ( p | z , p | z , p | z , p | z ) ′ . Then the remaining argument is the same as in the previous case, which completes the proof. (cid:3)
C.3 Proof of Theorem B.1
For any given ¯ u ∈ [0 , τ (¯ u ) = P e ∈E : g e (1)=1 q ∗ ¯ u ( e | ¯ u ) − P e ∈E : g e (0)=1 q ∗ ¯ u ( e | ¯ u ) for some q ∗ ¯ u ( · ) ≡{ q ∗ ¯ u ( e |· ) } e ∈E in the feasible set of the LP, (B.1) and (B.3). Therefore, τ (¯ u ) = τ MT E, ¯ u (¯ u ) for43 MT E, ¯ u (¯ u ) = P e ∈E : g e (1)=1 q ∗ ¯ u ( e | ¯ u ) − P e ∈E : g e (0)=1 q ∗ ¯ u ( e | ¯ u ), which is in M by definition. We canhave a symmetric proof for τ ( · ). (cid:3) C.4 Proof of Theorem B.2
Again, by the fact that τ MT E ( · ) = P e ∈E : g e (1)=1 q ( e |· ) − P e ∈E : g e (0)=1 q ( e |· ) in general, τ ( u ) = P e ∈E : g e (1)=1 q ∗ ( e | u ) − P e ∈E : g e (0)=1 q ∗ ( e | u ) for all u ∈ [0 ,
1] is equivalent to τ ( · ) being containedin M , and similarly for τ ( · ). (cid:3) C.5 Proof of Theorem B.3
From ( ∞ -LP3), we can write E [ Y | D = 0 , Z, X ] in terms of q ( e | u, X ) as below: E [ Y | D = 0 , Z, X ]= Pr [ Y = 1 | D = 0 , Z, X ] = Pr [ Y = 1 , D = 0 | Z, X ]Pr [ D = 0 | Z, X ]= 11 − P ( Z, X ) X e : g e (0)=1 Z P ( Z,X ) q ( e | u, X ) du = 11 − P ( Z, X ) Z P ( Z,X ) X e : g e (0)=1 q ( e | u, X ) du (C.1)Therefore, for ( m , m ) ∈ M f E [ Y | D = 0 , Z, X ] = 1 P ( Z, X ) Z P ( Z,X ) m ( u, X ) du and symmetrically, E [ Y | D = 1 , Z, X ] = 1 P ( Z, X ) Z P ( Z,X )0 m ( u, X ) du We conclude that M f ⊂ M id .Now suppose m ∈ M id . By (B.16) and (C.1), for ∀ z, x − P ( z, x ) Z P ( z,x ) m ( u, x ) du = 11 − P ( z, x ) X e : g e (0)=1 Z P ( z,x ) q ( e | u, x ) du and, 44 P ( z,x ) m ( u, x ) − X e : g e (0)=1 q ( e | u, x ) du = 0This equality holds for all the possible values of P ( z, x ), we conclude that m ( u, x ) = P e : g e (0)=1 q ( e | u, x ) on the support u ∈ [0 , ∀ x following the fundamental theorem of calcu-lus. Following the symmetric procedure, we can conclude that m ( u, x ) = P e : g e (1)=1 q ( e | u, x ).And we show that M id ⊂ M f . Thus, M f = M id . References
Angrist, J. and I. Fernandez-Val (2010): “Extrapolate-ing: External validity andoveridentification in the late framework,” Tech. rep., National Bureau of Economic Re-search. 1
Balat, J. F. and S. Han (2018): “Multiple treatments with strategic interaction,”
Avail-able at SSRN 3182766 . 1, 5
Balke, A. and J. Pearl (1997): “Bounds on treatment effects from studies with imperfectcompliance,”
Journal of the American Statistical Association , 92, 1171–1176. 1, 3
Bertanha, M. and G. W. Imbens (2019): “External validity in fuzzy regression discon-tinuity designs,”
Journal of Business & Economic Statistics , 1–39. 1
Bhattacharya, J., A. M. Shaikh, and E. Vytlacil (2008): “Treatment effect boundsunder monotonicity assumptions: an application to Swan-Ganz catheterization,”
AmericanEconomic Review , 98, 351–56. 6.1
Brinch, C. N., M. Mogstad, and M. Wiswall (2017): “Beyond LATE with a discreteinstrument,”
Journal of Political Economy , 125, 985–1039. 1, 6, 6.3
Chen, X. (2007): “Large sample sieve estimation of semi-nonparametric models,”
Handbookof econometrics , 6, 5549–5632. 7.3
Chen, X. and T. Christensen (2015): “Optimal sup-norm rates, adaptivity and inferencein nonparametric instrumental variables estimation,” . 7.3
Chen, X., E. T. Tamer, and A. Torgovitsky (2011): “Sensitivity analysis in semi-parametric likelihood models,” . 4 45 hen, X., J. Tan, Z. Liu, and J. Xie (2017): “Approximation of functions by a newfamily of generalized Bernstein operators,”
Journal of Mathematical Analysis and Appli-cations , 450, 244–261. 4
Chernozhukov, V. and C. Hansen (2005): “An IV model of quantile treatment effects,”
Econometrica , 73, 245–261. 1, 6.1
Coolidge, J. L. (1949): “The story of the binomial theorem,”
The American MathematicalMonthly , 56, 147–157. 4
Cornelissen, T., C. Dustmann, A. Raute, and U. Sch¨onberg (2016): “From LATEto MTE: Alternative methods for the evaluation of policy interventions,”
Labour Eco-nomics , 41, 47–60. 6
Deb, R., Y. Kitamura, J. K.-H. Quah, and J. Stoye (2017): “Revealed price prefer-ence: Theory and stochastic testing,” . 1, B.2
Dehejia, R., C. Pop-Eleches, and C. Samii (2019): “From local to global: Externalvalidity in a fertility natural experiment,”
Journal of Business & Economic Statistics , 1–27.1
Dunlop, D. D., L. M. Manheim, J. Song, and R. W. Chang (2002): “Gender andethnic/racial disparities in health care utilization among older adults,”
The Journals ofGerontology Series B: Psychological sciences and social sciences , 57, S221–S233. 8
Finkelstein, A., S. Taubman, B. Wright, M. Bernstein, J. Gruber, J. P. New-house, H. Allen, K. Baicker, and O. H. S. Group (2012): “The Oregon healthinsurance experiment: evidence from the first year,”
The Quarterly journal of economics ,127, 1057–1106. 8
Firpo, S. and G. Ridder (2019): “Partial identification of the treatment effect distributionand its functionals,”
Journal of Econometrics , 213, 210–234. 9
Gunsilius, F. (2019): “Bounds in continuous instrumental variable models,” arXiv preprintarXiv:1910.09502 . 1
Han, S. (2020a): “Nonparametric estimation of triangular simultaneous equations modelsunder weak identification,”
Quantitative Economics , 11, 161–202. 7.3——— (2020b): “Optimal Dynamic Treatment Regimes and Partial Welfare Ordering,” arXiv preprint arXiv:1912.10014 . 1, 6.1 46 an, S. and S. Lee (2019): “Estimation in a generalization of bivariate probit models withdummy endogenous regressors,”
Journal of Applied Econometrics , 34, 994–1015. 1, 5, 8
Han, S. and E. J. Vytlacil (2017): “Identification in a generalization of bivariate probitmodels with dummy endogenous regressors,”
Journal of Econometrics , 199, 63–73. 1, 5
Heckman, J. J. and E. Vytlacil (2005): “Structural equations, treatment effects, andeconometric policy evaluation1,”
Econometrica , 73, 669–738. 1, 2, 2, B.1
Heckman, J. J. and E. J. Vytlacil (1999): “Local instrumental variables and latentvariable models for identifying and bounding treatment effects,”
Proceedings of the nationalAcademy of Sciences , 96, 4730–4734. 1
Hsieh, Y.-W., X. Shi, and M. Shum (2018): “Inference on estimators defined by math-ematical programming,”
Available at SSRN 3041040 . B.2
Hurd, M. D. and K. McGarry (1997): “Medical insurance and the use of health careservices by the elderly,”
Journal of Health Economics , 16, 129–154. 8
Imbens, G. W. and J. D. Angrist (1994): “Identification and Estimation of Local Av-erage Treatment Effects,”
Econometrica , 62, 467–475. 1, 2
Joy, K. I. (2000): “Bernstein polynomials,”
On-Line Geometric Modeling Notes , 13. 4
Kamat, V. (2019): “Identification with latent choice sets: The case of the head start impactstudy,” arXiv preprint arXiv:1711.02048 . 1
Kitamura, Y. and J. Stoye (2019): “Nonparametric Counterfactuals in Random UtilityModels,” arXiv preprint arXiv:1902.08350 . 1
Kowalski, A. E. (2020): “Reconciling Seemingly Contradictory Results from the OregonHealth Insurance Experiment and the Massachusetts Health Reform,” Tech. rep., NationalBureau of Economic Research. 1, 6, 8
Machado, C., A. Shaikh, and E. Vytlacil (2019): “Instrumental variables and thesign of the average treatment effect,”
Journal of Econometrics , 212, 522–555. 1
Manski, C. F. (1997): “Monotone treatment response,”
Econometrica: Journal of theEconometric Society , 1311–1334. 6.1——— (2007): “Partial identification of counterfactual choice probabilities,”
InternationalEconomic Review , 48, 1393–1410. 1 47 anski, C. F. and J. V. Pepper (2000): “Monotone instrumental variables: With anapplication to the returns to schooling,”
Econometrica , 68, 997–1010. 1, 6.1, 6.2
Masten, M. A. and A. Poirier (2018): “Salvaging falsified instrumental variable mod-els,” arXiv preprint arXiv:1812.11598 . 4
Mogstad, M., A. Santos, and A. Torgovitsky (2017): “Using Instrumental Variablesfor Inference about Policy Relevant Treatment Effects,” Tech. rep., National Bureau ofEconomic Research. B.2——— (2018): “Using instrumental variables for inference about policy relevant treatmentparameters,”
Econometrica , 86, 1589–1619. 1, 2, 4, 3, 4, 4, 6.3, A, B.2, B.4, B.1, B.4, B.4
Mogstad, M., A. Torgovitsky, and C. R. Walters (2019): “Identification of causaleffects with multiple instruments: Problems and some solutions,” Tech. rep., NationalBureau of Economic Research. 4
Mourifi´e, I. (2015): “Sharp bounds on treatment effects in a binary triangular system,”
Journal of Econometrics , 187, 74–81. 1, 5
Muralidharan, K., A. Singh, and A. J. Ganimian (2019): “Disrupting education?Experimental evidence on technology-aided instruction in India,”
American Economic Re-view , 109, 1426–60. 1
Shaikh, A. M. and E. J. Vytlacil (2011): “Partial identification in triangular systemsof equations with binary dependent variables,”
Econometrica , 79, 949–955. 1, 5, 6.1
Taubman, S. L., H. L. Allen, B. J. Wright, K. Baicker, and A. N. Finkelstein (2014): “Medicaid increases emergency-department use: evidence from Oregon’s HealthInsurance Experiment,”
Science , 343, 263–268. 8
Tebaldi, P., A. Torgovitsky, and H. Yang (2019): “Nonparametric estimates ofdemand in the california health insurance exchange,” Tech. rep., National Bureau of Eco-nomic Research. 1
Torgovitsky, A. (2019a): “Nonparametric Inference on State Dependence in Unemploy-ment,”
Econometrica , 87, 1475–1505. 1——— (2019b): “Nonparametric inference on state dependence in unemployment,”
Econo-metrica , 87, 1475–1505. B.2 48 uong, Q. and H. Xu (2017): “Counterfactual mapping and individual treatment effectsin nonseparable models with binary endogeneity,”
Quantitative Economics , 8, 589–610. 1,5
Vytlacil, E. (2002): “Independence, monotonicity, and latent index models: An equiva-lence result,”
Econometrica , 70, 331–341. 2
Vytlacil, E. and N. Yildiz (2007): “Dummy endogenous variables in weakly separablemodels,”