[PDF] Identification and Estimation of Weakly Separable Models Without Monotonicity

Abstract

We study the identification and estimation of treatment effect parameters in weakly separable models. In their seminal work, Vytlacil and Yildiz (2007) showed how to identify and estimate the average treatment effect of a dummy endogenous variable when the outcome is weakly separable in a single index. Their identification result builds on a monotonicity condition with respect to this single index. In comparison, we consider similar weakly separable models with multiple indices, and relax the monotonicity condition for identification. Unlike Vytlacil and Yildiz (2007), we exploit the full information in the distribution of the outcome variable, instead of just its mean. Indeed, when the outcome distribution function is more informative than the mean, our method is applicable to more general settings than theirs; in particular we do not rely on their monotonicity assumption and at the same time we also allow for multiple indices. To illustrate the advantage of our approach, we provide examples of models where our approach can identify parameters of interest whereas existing methods would fail. These examples include models with multiple unobserved disturbance terms such as the Roy model and multinomial choice models with dummy endogenous variables, as well as potential outcome models with endogenous random coefficients. Our method is easy to implement and can be applied to a wide class of models. We establish standard asymptotic properties such as consistency and asymptotic normality.

Full PDF

aa r X i v : . [ ec on . E M ] A p r Dummy Endogenous Variables in Weakly SeparableMultiple Index Models without Monotonicity ∗ Songnian ChenHKUST Shakeeb KhanBoston College Xun TangRice University

April 6, 2020

Abstract

We study the identiﬁcation and estimation of treatment eﬀect parameters in weakly separable models. Intheir seminal work, Vytlacil and Yildiz (2007) showed how to identify and estimate the average treatmenteﬀect of a dummy endogenous variable when the outcome is weakly separable in a single index. Theiridentiﬁcation result builds on a monotonicity condition with respect to this single index. In comparison,we consider similar weakly separable models with multiple indices, and relax the monotonicity conditionfor identiﬁcation. Unlike Vytlacil and Yildiz (2007), we exploit the full information in the distribution ofthe outcome variable, instead of just its mean. Indeed, when the outcome distribution function is moreinformative than the mean, our method is applicable to more general settings than theirs; in particular wedo not rely on their monotonicity assumption and at the same time we also allow for multiple indices. Toillustrate the advantage of our approach, we provide examples of models where our approach can identifyparameters of interest whereas existing methods would fail. These examples include models with multipleunobserved disturbance terms such as the Roy model and multinomial choice models with dummy endogenousvariables, as well as potential outcome models with endogenous random coeﬃcients. Our method is easy toimplement and can be applied to a wide class of models. We establish standard asymptotic properties suchas consistency and asymptotic normality.

JEL Classiﬁcation : C14, C31, C35

Key Words

Weak Separability, Treatment Eﬀects, Monotonicity, Endogeneity ∗ We are grateful to Jeremy Fox, Sukjin Han, Arthur Lewbel, Elie Tamer, Ed Vytlacil, Haiqing Xu, andparticipants at the 2018 West Indies Economic Conference, the 2019 SUFE econometrics meetings, and the2020 Texas Camp Econometrics for helpful comments and suggestions. Introduction

Consider a weakly separable model with a binary endogenous variable: Y = g ( v ( X, D ) , v ( X, D ) , ...v J ( X, D ) , ε ) (1.1) D = 1 { θ ( Z ) − U > } (1.2)where ( v ( X, D ) , v ( X, D ) , ...v J ( X, D )) ≡ v ( X, D ) is a J -vector of unknown linear or nonlin-ear indices in the outcome equation (1.1) and D is a binary endogenous variable deﬁned bythe selection equation (1.2). Here X ∈ R d x and Z ∈ R d z are vectors of observable exogenousvariables, which may have overlapping elements. Similar to Vytlacil and Yildiz (2007) werequire exclusion restrictions that there is some element in Z excluded from X , and thatwe can vary X after conditioning on θ ( Z ). In the system of equations above, U is the un-observable random variable normalized to follow the uniform distribution U (0 ,

1) and theerror term ε in the outcome equation is allowed to be a random vector. We assume ( X, Z )are independent of ( ε, U ). Note that we allow v ( X, D ) to be a vector of multiple indices,whereas the method in Vytlacil and Yildiz (2007) can only be applied when it is a singleindex.Since Vytlacil and Yildiz (2007), other important work has considered identiﬁcationand estimation of related models, but under alternative conditions. Examples with binaryendogenous variables include Han and Vytlacil (2017), Vuong and Xu (2017), Lewbel, Jacho-Chavez, and Encarnciono (2016), Khan, Maurel, and Zhang (2019). Work for models whenthe endogenous variable is continuous includes Imbens and Newey (2009), D’Haultfoeuilleand Fevrier (2015) and Torgovitsky (2015). Feng (2020) shows how to identify nonseparabletriangular models where the endogenous variable is discrete and has larger support than theinstrument variable. As in the conventional framework, two potential outcomes Y and Y satisfy Y D = g ( v ( X, D ) , ǫ ) for D = 0 , . We only observe (

Y, D, X, Z ), where Y = DY + (1 − D ) Y . In this model, as in Vytlacil andYildiz (2007), we do not impose parametric distribution on the error term or a linear indexstructure. Vytlacil and Yildiz (2007) assumes that v ( X, D ) ∈ R is a single index, and E [ g ( v, ε ) | U = u ] is strictly increasing in v ∈ R for all u. (1.3) All these papers focus on point identiﬁcation. For partial identiﬁcation of a model with a binary outcome,see Shaikh and Vytlacil (2011) and Mouriﬁ´e (2015). v ( X, D ) =( v ( X, D ) , v ( X, D ) , ..., v J ( X, D )) ∈ R J .We consider the identiﬁcation and estimation of the average treatment eﬀect of D on Y , E ( Y | X ∈ A ), E ( Y | X ∈ A ) and E ( Y − Y | X ∈ A ), for some set A , without theaforementioned monotonicity. Indeed, for the case with multiple indices v ( X, D ) ∈ R J , themonotonicity condition is no longer well deﬁned.Vuong and Xu (2017) established nonparametric identiﬁcation of individual treatmenteﬀects in a fully nonseparable model that includes a binary endogenous regressor, withoutthe nonlinear index structure. They assume ε is a scalar and g is strictly increasing in ε . Intheir setting, monotonicity in the outcome equation provides the identifying restriction toextrapolate information from local treatment eﬀects to population treatment eﬀects. Generally speaking, our identiﬁcation strategy will be based on the notion of matching .Consider the identiﬁcation of E ( Y | X = x ) for some x ∈ S , where S d denotes the supportof X given D = d ∈ { , } . Note that because ( ε, U ) ⊥ ( X, Z ), E ( Y | X = x ) = E ( Y | X = x, Z = z )= E ( DY | X = x, Z = z ) + E [(1 − D ) Y | X = x, Z = z ]= P ( z ) E ( Y | D = 1 , X = x, Z = z ) + [1 − P ( z )] E ( Y | D = 0 , X = x, Z = z ) (2.1) See Ahn and Powell (1993), Chen, Khan, and Tang (2016), Vytlacil and Yildiz (2007), and more recentlyAuerbach (2019) for examples of papers that attain identiﬁcation through matching. P ( z ) ≡ E ( D | Z = z ). The only term that is not directly identiﬁable on the right-handside of (2.1) is E ( Y | D = 0 , X = x, Z = z ) = E [ g ( v ( x, , ε ) | U ≥ P ( z )].The main idea behind our approach follows that of Vytlacil and Yildiz (2007), which is toﬁnd some ˜ x ∈ S such that v ( x,

1) = v (˜ x,

0) (2.2)so that E ( Y | D = 0 , X = ˜ x, Z = z ) = E ( Y | D = 0 , X = ˜ x, Z = z )= E ( g ( v (˜ x, , ε ) | U ≥ P ( z )) = E ( g ( v ( x, , ε ) | U ≥ P ( z )).Unlike Vytlacil and Yildiz (2007), we utilize the full distribution of Y (rather than itsﬁrst moment) while searching for such pairs of ( x, ˜ x ) in (2.2). This allows us to relax thesingle-index and monotonicity conditions in Vytlacil and Yildiz (2007).For any p on the support of P ( Z ) given X = x , and for all y deﬁne h ∗ ( x, y, p ) = E ( D { Y ≤ y } | X = x, P ( Z ) = p )= E [1 { U < P ( Z ) } { g ( v ( X, , ε ) ≤ y } | X = x, P ( Z ) = p ]= Z p F g | u ( y ; v ( x, du , (2.3)where F g | u ( y ; v ( x, d )) ≡ E [1 { g ( v ( x, d ) , ε ) ≤ y }| U = u ]with v ( x, d ) being a realized index at X = x and the expectation in the deﬁnition of F g | u iswith respect to the distribution of ε given U = u . The last equality in (2.3) holds becauseof independence between ( ε, U ) and ( X, Z ). By construction, h ∗ ( x, y, p ) is directly identiﬁedfrom the joint distribution of ( D, Y, X, Z ) in the data-generating process. Furthermore, forany pair p > p , deﬁne: h ( x, y, p , p ) ≡ h ∗ ( x, y, p ) − h ∗ ( x, y, p ) = Z p p F g | u ( y ; v ( x, du .Likewise, deﬁne h ∗ ( x, y, p ) = E ((1 − D )1 { Y ≤ y } | X = x, P ( Z ) = p )= E [1 { U ≥ P ( Z ) } { g ( v ( X, , ε ) ≤ y } | X = x, P ( Z ) = p ]= Z p F g | u ( y ; v ( x, du. h ( x, y, p , p ) ≡ h ∗ ( x, y, p ) − h ∗ ( x, y, p ) = Z p p F g | u ( y ; v ( x, du .Let P x denote the support of P ( Z ) given X = x . It can be shown that for any x ∈ S and˜ x ∈ S , and any y , h ( x, y, p, p ′ ) = h (˜ x, y, p, p ′ ) for all p > p ′ on P x ∩ P ˜ x . (2.4)if and only if F g | p ( y ; v ( x, F g | p ( y ; v (˜ x, p ∈ P x ∩ P ˜ x . (2.5)Suﬃciency is immediate from the deﬁnition of h and h . To see necessity, note that for all p > p ′ on P x ∩ P ˜ x , ∂h ( x, y, ˜ p, p ′ ) ∂ ˜ p (cid:12)(cid:12)(cid:12)(cid:12) ˜ p = p = ∂h ∗ ( x, y, ˜ p ) ∂ ˜ p (cid:12)(cid:12)(cid:12)(cid:12) ˜ p = p = F g | p ( y ; v ( x , ∂h (˜ x, y, ˜ p, p ′ ) ∂ ˜ p (cid:12)(cid:12)(cid:12)(cid:12) ˜ p = p = − ∂∂ ˜ p h ∗ (˜ x, y, ˜ p ) (cid:12)(cid:12)(cid:12)(cid:12) ˜ p = p = F g | p ( y ; v (˜ x, U is absolutely continuous with respect to Lebesguemeasure.ASSUMPTION A-2: The random vectors ( U, ε ) and (

X, Z ) are independent.ASSUMPTION A-3: The random variable g ( v ( X, , ε ) and g ( v ( X, , ε ) have ﬁnite ﬁrstmoments conditional on U = u for all u ∈ [0 , x, ˜ x ) ∈ S × S , F g | p ( y ; v ( x, F g | p ( y ; v ( x, y and p ∈ P x ∩ P ˜ x if and only if v ( x,

1) = v (˜ x, X ∈ S ) > X ∈ S ) > x, ˜ x ) with v ( x,

1) = v (˜ x, v ( X, D ) ∈ R is a single index and that for any ( x, ˜ x ) ∈ ( S × S ), E [ g ( v ( x, , ε ) | U = u ] = E [ g ( v (˜ x, , ε ) | U = u ] if and only if v ( x,

1) = v (˜ x, E [ g ( v ( x, d ) , ε ) | U = p ] is a strictly monotonic function of v ( x, d ). Second, when v ( x, d ) is avector of multiple indices instead of a single index, their approach breaks down. In compar-ison, we achieve the same purpose by matching conditional distributions F g | p ( · ; v ( x, F g | p ( · ; v (˜ x, v ( . ) is vector-valued and the monotonicity condition in Vytlaciland Yildiz (2007) is not satisﬁed. We now present several examples in which the latent indices are multi-dimensional. In theﬁrst and third example, the monotonicity condition in Vytlacil and Yildiz (2007) is not sat-isﬁed; in the second example, the identiﬁcation requires a generalization of the monotonicitycondition into an invertibility condition in higher dimensions.

Example 1. (Heteroskaedastic shocks in outcome)

Consider a triangular system wherea continuous outcome is determined by double indices v ( X, D ) ≡ ( v ( X, D ) , v ( X, D )): Y = g ( v ( X, D ) , ε ) = v ( X, D ) + v ( X, D ) ε for D ∈ { , } .The selection equation determining the actual treatment is the same as (1.2). In this casethe concept of monotonicity in v ∈ R is not well-deﬁned, so the procedure proposed inVytlacil and Yildiz (2007) is not suitable here . Nevertheless, we can apply the method inSection 2 to identify the average treatment eﬀect by using the distribution of outcome toﬁnd pairs of x and ˜ x such that v ( x,

1) = v (˜ x, v ( · ) is positive. Tosee the necessity in Assumption A4, note that F g | u ( y ; v ( x, d )) = E [ v ( x, d ) + v ( x, d ) ε ≤ y | U = u ]= F ε | u (cid:18) y − v ( x, d ) v ( x, d ) (cid:19) For this particular design, the approach proposed in Vuong and Xu (2017) should be valid. But it willnot be for a slightly modiﬁed model, such as Y = v ( X, D ) + ( e + v ( X, D ) ∗ e ), whereas ours will be. d = 0 ,

1. If the CDF of ε is increasing over R , then for all y and x ∈ S and ˜ x ∈ S , F g | u ( y ; v ( x, F g | u ( y ; v (˜ x, y − v ( x, v ( x,

1) = y − v (˜ x, v (˜ x,

0) .Diﬀerentiating with respect to y yields v ( x,

1) = v (˜ x, v ( x,

1) = v (˜ x, Example 2. (Multinomial potential outcome)

Consider a triangular system wherethe outcome is multinomial. The multinomial response model has a long and rich history inboth applied and theoretical econometrics. Recent examples in the semiparametric literatureinclude Lee (1995), Ahn, Powell, Ichimura, and Ruud (2017), Shi, Shum, and Song (2018),Pakes and Porter (2014), Khan, Ouyang, and Tamer (2019). But unlike the work here, noneof those papers allow for dummy endogenous variables or potential outcomes. Y = g ( v ( X, D ) , ε ) = arg max j =0 , ,...,J y ∗ j,D where y ∗ j,D = v j ( X, D ) + ε j for j = 1 , , ..., J ; y ∗ ,D = 0.In this case the index v ≡ ( v j ) j ≤ J and the errors ε ≡ ( ε j ) j ≤ J are both J -dimensional. Theselection equation that determines D is the same as (1.2). In this case, we can replace1 { Y ≤ y } by 1 { Y = y } in the deﬁnition of h , h , h ∗ , h ∗ and F g | u ( · ; v ). Then for d = 0 , j ≤ J , F g | u ( j ; v ( x, d )) ≡ E [1 { g ( v ( x, d ) , ε ) = j }| U = u ]= Pr { v j ( x, d ) + ε j ≥ v j ′ ( x, d ) + ε j ′ ∀ j ′ ≤ J | U = u } .7y Ruud (2000) and Ahn, Powell, Ichimura, and Ruud (2017), the mapping from v ∈ R J to ( F g | u ( j ; v ) : j ≤ J ) ∈ R J is smooth and invertible provided that ε ∈ R J has non-negativedensity everywhere. This implies Assumption A-4. Example 3 . (Potential outcome from the Roy model) Consider a treatment eﬀectmodel with an endogenous binary treatment D and with the potential outcome determined bya latent Roy model. The Roy model has also been studied extensively from both applied andtheoretical perspectives. See for example the literature survey in Heckman and E.Vytlacil(2007) and the seminal paper in Heckman and Honor´e (1990).Here the observed outcome consists of two pieces: a continuous measure Y = DY + (1 − D ) Y and a discrete indicator W = DW + (1 − D ) W for d = 0 ,

1. These potential outcomesare given by Y d = max j ∈{ a,b } y ∗ j,d and W d = arg max j ∈{ a,b } y ∗ j,d where a and b index potential outcomes realized in diﬀerent sectors, with y ∗ j,d = v j ( X, d ) + ε j .The binary endogenous treatment D is determined as in the selection equation (1.2). Forexample, D ∈ { , } indicates whether an individual participates in certain professionaltraining program, W d ∈ { a, b } indicates the potential sector in which the individual isemployed, y ∗ j,d is the potential wage from sector j under treatment D = d , and Y d ∈ R is thepotential wage if the treatment status is D = d . As before, we maintain that ( X, Z ) ⊥ ( ε, U ).The parameter of interest isPr { Y ≤ y, W = a | X } which by the independence ( X, Z ) ⊥ ( ε, U ) and an application of the law of total probabilitycan be decomposed into directly identiﬁable quantities and a counterfactual quantityPr { Y ≤ y, W = a | X = x, Z = z, D = 0 } = Pr { v b ( x,

1) + ε b < v a ( x,

1) + ε a ≤ y | U ≥ P ( z ) } . (3.1)Again, we seek to identify this counterfactual quantity by ﬁnding ˜ x ∈ S such that v a ( x,

1) = v a (˜ x,

0) and v b ( x,

1) = v b (˜ x,

0) (3.2)8his would allow us to recover the right hand side of (3.1) asPr { Y ≤ y, W = a | X = ˜ x, Z = z, D = 0 } .To ﬁnd such a pair of ( x, ˜ x ), deﬁne h d,W ( x, p, p ′ ) , h ∗ d,W ( x, p ) by replacing 1 { Y ≤ y } with1 { W = a } in the deﬁnition of h d , h ∗ d in Section 2. Similarly, deﬁne h d,Y ( x, y, p, p ′ ) , h ∗ d,Y ( x, y, p )by replacing 1 { Y ≤ y } with 1 { Y ≤ y, W = a } in the deﬁnition of h d , h ∗ d in Section 2. Then h d,W ( x, p , p ) = Z p p Pr { v b ( x, d ) + ε b < v a ( x, d ) + ε a | U = u } du ; h d,Y ( x, y, p , p ) = Z p p Pr { v b ( x, d ) + ε b < v a ( x, d ) + ε a ≤ y | U = u } du ;and h d,W ( x, p , p ) and h d,Y ( x, y, p , p ) are both identiﬁed over their respective domainsby construction. Assume ( ε a , ε b ) is continuously distributed with positive density over R conditional on all u . Then the statement“ h ,W ( x, p, p ′ ) = h ,W (˜ x, p, p ′ ) and h ,Y ( x, y, p, p ′ ) = h ,Y (˜ x, y, p, p ′ )for all y and p > p ′ on P x ∩ P ˜ x ”holds true if and only if (3.2) holds. Then matching h ,W ( x, p, p ′ ) = h ,W (˜ x, p, p ′ ) ensures v a ( x, − v b ( x,

1) = v a (˜ x, − v b (˜ x, h ,Y ( x, y, p, p ′ ) = h ,Y (˜ x, y, p, p ′ ) at the same time ensures that in additionto (3.3) v a ( x,

1) = v a (˜ x, The identiﬁcation strategy we have used so far requires matching exogenous variables x, ˜ x on S , S . In some cases, with the outcome being continuous, we can construct similarargument for identifying a counterfactual quantity in a treatment eﬀect model by matchingdiﬀerent elements on the support of continuous outcome. This approach was not investigatedin Vytlacil and Yildiz (2007), which focused on the use of ﬁrst moment of outcome. Thefollowing example illustrates this point. 9 xample 4. (Potential outcome with random coeﬃcients) Random coeﬃcient modelsare prominent in both the theoretical and applied econometrics literature. They permit aﬂexible way to allow for conditional heteroscedasticity and unobserved heterogeneity. See,for example Hsiao and Pesaran (2008) for a survey. Here we consider a treatment eﬀectmodel where the potential outcome is determined through random coeﬃcients: Y = DY + (1 − D ) Y where Y d = ( α d + X ′ β d ) for d = 0 , D is determined as in the selection equation (1.2).The random intercepts α d ∈ R and the random vectors of coeﬃcients β d are given by α d = ¯ α d ( X ) + η d and β d = ¯ β d ( X ) + ε d where for any x and d ∈ { , } ., ( ¯ α d ( x ) , ¯ β d ( x )) ∈ R K +1 is a vector of constant parameterswhile η d ∈ R and ε d ∈ R K are unobservable noises.As in Vytlacil and Yildiz (2007), assume some elements in Z in the selection equationare excluded from X . We allow the vector of unobservable terms ( ǫ , ǫ , η , η , U ) to bearbitrarily correlated. We also assume that( X, Z ) ⊥ ( ǫ , ǫ , η , η , U ), (4.1)with the marginal distribution of U normalized to standard uniform, so that θ ( Z ) is directlyidentiﬁed as P ( Z ) ≡ E ( D | Z = z ).In this example our goal is to identify the conditional distribution of Y d given X = x for d = 0 ,

1. From this result we can identify parameters of interest such as average treatmenteﬀects, quantile treatment eﬀects, etc. Let G P | x denote the conditional distribution of P ≡ P ( Z ) given X = x , which is directly identiﬁable from the data-generating process. Byconstruction,Pr { Y ≤ y | X = x } = Z Pr { Y ≤ y | X = x, P = p } dG P | x ( p ),where Pr { Y ≤ y | X = x, P = p } = E [ D { Y ≤ y }| X = x, P = p ] + E [(1 − D )1 { Y ≤ y }| X = x, P = p ] .The ﬁrst term on the right-hand side is identiﬁed as E [ D { Y ≤ y }| X = x, P = p ], 10hile the second term is counterfactual and can be written as φ ( x, y, p ) ≡ E [1 { U ≥ P } { α + X ′ β ≤ y }| X = x, P = p ]= E [1 { U ≥ p } { ¯ α ( x ) + η + x ′ ( ¯ β ( x ) + ε ) ≤ y } ]= Z p Pr { η + x ′ ǫ ≤ y − ¯ α ( x ) − x ′ ¯ β ( x ) | U = u } du .For any p on the support of P given X = x , deﬁne h ∗ ( x, y, p ) ≡ E [ D { Y ≤ y } | X = x, P = p ]= E [1 { U < P } { α + X ′ β ≤ y } | X = x, P = p ] = E [1 { U < p } { α + x ′ β ≤ y } ]= Z p Pr { η + x ′ ǫ ≤ y − ¯ α ( x ) − x ′ ¯ β ( x ) | U = u } du, where the second equality uses (4.1). Likewise, under (4.1) we have: h ∗ ( x, y, p ) ≡ E [(1 − D )1 { Y ≤ y } | X = x, P = p ]= Z p Pr { η + x ′ ǫ ≤ y − ¯ α ( x ) − x ′ ¯ β ( x ) | U i = u } du .Assume F ( η ,ǫ ) | U = u = F ( η ,ǫ ) | U = u for all u ∈ [0 , φ ( x, y, p ) = Z p Pr { η + x ′ ǫ ≤ y − ¯ α ( x ) − x ′ ¯ β ( x ) | U = u } du . (4.3)Suppose for each pair ( x, y ) we can ﬁnd t ( x, y ) such that y − ¯ α ( x ) − x ′ ¯ β ( x ) = t ( x, y ) − ¯ α ( x ) − x ′ ¯ β ( x ).Then by construction h ∗ ( x, t ( x, y ) , p ) ≡ Z p Pr { η + x ′ ǫ ≤ t ( x, y ) − ¯ α ( x ) − x ′ ¯ β ( x ) | U = u } du = Z p Pr { η + x ′ ǫ ≤ y − ¯ α ( x ) − x ′ ¯ β ( x ) | U = u } du = φ ( x, y, p ) This type of distributional equality assumption generalizes the exact equality of ǫ , ǫ as can be found infor example Vytlacil and Yildiz (2007). Distributional equality has been used to motivate the rank similarity condition imposed frequently in the econometrics literature- see for example Chernozhukov and Hansen(2005), Frandsen and Lefgren (2018), Dong and Shen (2018), Chen and Khan (2014). φ ( x, y, p ) would be identiﬁed as h ∗ ( x, t ( x, y ) , p ).It remains to show that for each pair ( x, y ) we can uniquely recover t ( x, y ) using quan-tities that are identiﬁable in the data-generating process. To do so, we deﬁne two auxiliaryfunctions as follows: for p > p on the support of P given X = x , let h ( x, y, p , p ) ≡ h ∗ ( x, y, p ) − h ∗ ( x, y, p )= Z p p Pr { η + x ′ ǫ < y − ¯ α ( x ) − x ′ ¯ β ( x ) | U = u } du ;and h ( x, y, p , p ) ≡ h ∗ ( x, y, p ) − h ∗ ( x, y, p )= Z p p Pr { η + x ′ ǫ < y − ¯ α ( x ) − x ′ ¯ β ( x ) | U = u } du .Suppose η d + x ′ ǫ d is continuously distributed over R for all values of x conditional on all u ∈ [0 , x, y ) and p < p , h ( x, y, p , p ) = h ( x, t ( x, y ) , p , p )if and only if t ( x, y ) = y − ¯ α ( x ) − x ′ ¯ β ( x ) + ¯ α ( x ) + x ′ ¯ β ( x ).To see this, suppose t ( x, y ) > y − ¯ α ( x ) − x ′ ¯ β ( x ) + ¯ α ( x ) + x ′ ¯ β ( x ), then (4.2) implies that h ( x, t ( x, y ) , p , p ) > h ( x, y, p , p ). A symmetric argument establishes a similar statementwith “ > ” replaced by “ < ”. This establishes our desired result. Here we outline estimation procedures from a random sample of the observed variables thatare motivated by our identiﬁcation results. We will ﬁrst describe an estimation procedurefor the parameter E [ Y ] in the ﬁrst three examples. Let P x to denote the support of P ( Z )given X = x , f p ( . | x ) denote the density of P ( Z ) given X = x , and P cx = { p : f p ( p | x ) > c } for a known c > , and for simplicity assume a strong overlap condition that1 − c > P ( Z ) > c for a known c > , h ( x , · ) and h ( x , · ) k h ( x , · ) − h ( x , · ) k = (Z Z Z (cid:18)Z p p ( F g | u ( y ; v ( x , − F g | u ( y ; v ( x , du (cid:19) I ( p , p ∈ P cx ) w ( y ) dydp dp ) / where w ( y ) is a chosen weight function. Consider the case when h ( x, y, p , p ), h ( x, y, p , p )and P ( z ) are known. For any given X i , let ˜ X i be such that (cid:13)(cid:13)(cid:13) h ( X i , · ) − h ( ˜ X i , · ) (cid:13)(cid:13)(cid:13) = 0which, under Assumption A-4 in Section 2, is equivalent to v ( X i ,

1) = v ( ˜ X i , . Deﬁneˆ Y i = E ( Y | D = 0 , k h ( X i , · ) − h ( X, · ) k = 0 , P = P i ) . Note that the conditional expectation on the right-hand side is equal to E [ Y | D = 0 , v ( X,

0) = v ( X i , , P = P i ], which in turn equals E [ Y | D = 0 , X = X i , P = P i ].Then, following the discussion above, we deﬁne the following estimator for ∆ ≡ E [ Y ]:ˆ∆ = 1 n n X i =1 (cid:16) D i Y i + (1 − D i ) ˆ Y i (cid:17) or a weighted version ˆ∆ w = n P ni =1 { X i ∈ A } (cid:16) D i Y i + (1 − D i ) ˆ Y i (cid:17) n P ni =1 { X i ∈ A } Limiting distribution theory for each of these estimators follows from identical arguments inVytlacil and Yildiz (2007). Here we formally state the theorem for the ﬁrst estimator:

Theorem 5.1

Under Assumptions A-1 to A-5, and the additional assumption that Y haspositive and ﬁnite second moment, then we have √ n ( ˆ∆ − ∆) d → N (0 , V ) where V = V ar ( E [ Y | X, P, D ]) + E [ P V ar ( Y | X, P, D = 1)]13ow we describe an estimation procedure for the distributional treatment eﬀect in Ex-ample 4, where we had a model with random coeﬃcients. In this case, the parameter ofinterest is for a chosen value of the scalar y ,∆ ( y ) = Pr { Y ≤ y } .First, for ﬁxed values of y and p > p , we propose to estimate t ( x, y ) asˆ t ( x, y, p , p ) = arg min t ( h ( x, y, p , p ) − h ( x, t, p , p )) and then average over values of p , p :ˆ τ ( x, y ) = 1 n ( n − X i = j I [ P i > P j ]ˆ t ( x, y, P i , P j )An infeasible estimator for the parameter ∆ ( y ), which assumes t ( x, y ) is known, would beˆ∆ ( y ) = 1 n n X i =1 ( D i { Y i ≤ y } + (1 − D i )1 { Y i ≤ t ( X i , y ) } ) .In practice, for feasible estimation, one needs to replace t ( x, y ) by its estimator ˆ τ ( x, y ). This section presents simulation evidence for the performance of the proposed estimationprocedures described in Section 5, for both the Average Treatment Eﬀect and the Distri-butional Treatment Eﬀect. We report results for both our proposed estimator and that inVytlacil and Yildiz (2007), for several designs. These include designs where the said mono-tonicity condition fails, and designs where the disturbance terms in the outcome equationare multidimensional.Throughout all designs we model the treatment or dummy endogenous variable as D = I [ Z − U >

Z, U are independent standard normal. We experiment with the following designs forthe outcome

Design 1 Y = X + 0 . · D + ǫ where X is standard normal, ( ǫ, U ) are distributed bivariate normal, each with mean0 and variance 1, with correlations of 0,0.25,0.5.14 esign 2 Y = X + 0 . · D + ( X + D ) · ǫ where X is distributed standard normal, ( ǫ, U ) are distributed bivariate normal, eachwith mean 0 and variance 1, with correlations of 0,0.25,0.5. Design 3 Y = ( X + 0 . · D + ǫ ) where X is distributed standard normal, ( ǫ, U ) are distributed bivariate normal, eachwith mean 0 and variance 1, with correlations of 0,0.25,0.5.We note that the monotonicity condition is satisﬁed in design 1 but fails in the othertwo designs. For each of these designs, we report results for estimating the parameter E [ Y ], which denotes the expected value for potential outcome under treatment D = 1.The two estimators used in the simulation study were the one proposed in Section 5 andthe method proposed in Vytlacil and Yildiz (2007). The summary statistics, scaled by thetrue parameter value, Mean Bias, Median Bias, Root Mean Squared Error, (RMSE), andMedian Absolute Deviation (MAD) were evaluated for sample sizes of 100, 200, 400 for 401replications. Results for each of these designs are reported in Tables 1 to 3 respectively.In implementing our procedure we assumed the propensity score function is known, andconducted next stage estimation using a nonparametric kernel estimator with normal kernelfunction, and a bandwidth of n − / . This rate reﬂects “undersmoothing” as there are tworegressors, the propensity score and the regressor X . For the estimator in Vytlacil andYildiz (2007), which involved the derivative of conditional expectation functions as well,estimating these functions nonparametrically gave very unstable results so we report resultsfor an infeasible version of their estimator, assuming such functions, as well as the propensityscores, are known.To implement the second stage of our proposed procedure, in calculating the distance k h ( x i , · ) − h ( x , · ) k we used an evenly space grid of values for y , and selected n/

50 gridpoints, with n denoting the sample size.The results indicate the desirable properties of our proposed procedure, generally agreeingwith Theorem 5.1. In all designs our estimator has small values for bias and RMSE, withthe value of RMSE decreasing as the sample size grows. In contrast, the procedure basedon Vytlacil and Yildiz (2007) only performs well in Design 1, with values of bias and RMSEcomparable to those using our method. As in our procedure these values decrease with as15he sample size grows, which is expected, as the monotonicity condition rely on is satisﬁedin these designs. In this case, their approach has smaller standard errors largely due to therelative simpler structure of the infeasible version, but their biases persist even when thesample size increases.For designs 2 and 3, where monotonicity is violated, the procedure proposed in Vytlaciland Yildiz (2007) does not perform well. In design 2 in Table 2 both the bias and RMSEare generally increasing with the sample size. Results for their estimator are better in design3, but the bias hardly converges with the sample size and is much larger compared to ourestimator.We also simulate data from a model with dummy endogenous variable and potentialoutcomes determined by random coeﬃcients. It is important to note that for this design, theoriginal matching idea in Vytlacil and Yildiz (2007) does not apply. This is because diﬀerentvalues of x lead to diﬀerent distribution of the composite error η d + x ′ ǫ d . Our contributionin Section 4 is to propose a new approach based on matching diﬀerent values of outcome y ,rather than the regressors x . Based on the counterfactual framework discussed in Section4, here the treatment variable D is modeled as the same way as the dummy endogenousvariable above. Similarly the regressor X is standard normal. For both Y , Y the randomintercepts were modeled as constants (0 and 1, respectively) and the additive error termswere each standard normal. For the random slopes, the means were 1 and 2 respectively,and the additive error terms were also standard normal, independent of all other disturbanceterms and each other. Here we use the procedure in Section 4 to estimate the parameter∆ = P ( Y < y ), where in the simulation we set y = 1. The same four summary statisticsare reported for sample sizes 100,200,400, based on 401 replications. Results for this randomcoeﬃcients design are reported in Table 4.The estimator proposed in Section 5 performs well; but the bias and RMSE are muchsmall at 400 observations compared to 100 and 200 observations, indicating convergence atthe parametric rate. 16 able 1CKT VY ρ v / / / / ρ v / / / / ρ v / / / / able 4CKT ρ v / / In this paper, we considered identiﬁcation and estimation of nonseparable models with en-dogenous binary treatment. Existing approaches are based on a monotonicity condition,which is violated in models with multiple unobserved idiosyncratic shocks. Such modelsarise in many important empirical settings, including Roy models and multinomial choicemodels with dummy endogenous variables, as well as treatment eﬀect models with randomcoeﬃcients. We establish novel identiﬁcation results for these models which are construc-tive and conducive to estimation procedures which are easy to compute and whose limitingdistributional properties follow from standard large sample theorems. A simulation studyindicates adequate ﬁnite sample performance of our proposed methods.This paper leaves open areas for future research. Our method requires the selection of thenumber and location of cutoﬀ points, so a data driven method for selecting these would beuseful. Furthermore, the relative eﬃciency of our proposed approach needs to be explored,perhaps by deriving eﬃciency bounds for these new classes of models.

References

Ahn, H., and

J. Powell (1993): “Semiparametric Estimation of Censored Selection Mod-els,”

Journal of Econometrics , 58, 3–29.

Ahn, H., J. Powell, H. Ichimura, and

P. Ruud (2017): “Simple Estimators for In-vertible Index Models,”

Journal of Business Economics and Statistics , 36, 1–10.18 uerbach, E. (2019): “Identiﬁcation and Estimation of a Partially Linear RegressionModel using Network Data,” mimeograph, Northwestern University.

Chen, S., S. Khan, and

X. Tang (2016): “On the Informational Content of SpecialRegressors in Heteroskedastic Binary Response Models,”

Journal of Econometrics , 193,162–182.

Chen, S. H., and

S. Khan (2014): “Semiparametric Estimation of Program Impacts onDispersion of Potential Wages,”

Journal of Applied Econometrics , 29, 901–919.

Chernozhukov, V., and

C. Hansen (2005): “An IV Model of Quantile Treatment Ef-fects,”

Econometrica , 73, 245 – 261.

D’Haultfoeuille, X., and

P. Fevrier (2015): “Identiﬁcation of Nonseparable Triangu-lar Models with Discrete Instruments,”

Econometrica , 3, 1199–1210.

Dong, Y., and

S. Shen (2018): “The Empirical Content of the Roy Model,”

The Reviewof Economics and Statistics , 100, 78–85.

Feng, J. (2020): “Matching Points: Supplementing Instruments with Covariates in Trian-gular Models,” mimeograph, Columbia University.

Frandsen, B., and

L. Lefgren (2018): “Testing Rank Similarity,”

The Review of Eco-nomics and Statistics , 100, 86–91.

Han, S., and

E. J. Vytlacil (2017): “Identiﬁcation in a generalization of bivariate probitmodels with endogenous regressors,”

Journal of Econometrics , 199, 63–73.

Heckman, J., and

E.Vytlacil (2007): “Econometric evaluation of social programs,” in

Handbook of Econometrics, Vol. 6B , ed. by J. Heckman, and

E. Leamer. Amesterdam:North Holland.

Heckman, J., and

B. Honor´e (1990): “The Empirical Content of the Roy Model,”

Econometrica , 58, 1121–1149.

Hsiao, C., and

M. H. Pesaran (2008): “Random Coeﬃcient Models,” in

The Economet-rics of Panel Data: Advanced Studies in Theoretical and Applied Econonmetrics , ed. byL. Matyas, and

P. Sevestre. Springer.

Imbens, G., and

W. Newey (2009): “Identiﬁcation and Estimation of Triangular Simul-taneous Equations Models Without Additivity,”

Econometrica , 77(5), 1481–1512.19 han, S., A. Maurel, and

Y. Zhang (2019): “Informational Content of Factor Struc-tures in Simultaneous Discrete Response Models,” Working Paper, Duke University.

Khan, S., F. Ouyang, and

E. Tamer (2019): “Inference in Semiparametric MutinomialResponse Models,” Boston College Working Paper.

Lee, L.-F. (1995): “Semiparametric Maximum Likelihood Estimation of Polychotomousand Sequential Choice Models,”

Journal of Econometrics , 65, 381–428.

Lewbel, A., D. Jacho-Chavez, and

J. Encarnciono (2016): “Identiﬁcation and esti-mation of semiparametric two-step models,”

Quantitative Economics , 7, 561–589.

Mourifi´e, I. (2015): “Sharp Bounds on Treatment Eﬀects in a Binary Triangular System,”

Journal of Econometrics , 187(1), 74–81.

Pakes, A., and

J. Porter (2014): “Moment Inequalties for Multinomial Choice withFixed Eﬀects,” Harvard University Working Paper.

Ruud, P. (2000): “Semiparametric estimation of discrete choice models,” mimeograph,University of California at Berkeley.

Shaikh, A. M., and

E. Vytlacil (2011): “Partial Identiﬁcation in Triangular Systemsof Equations with Binary Dependent Variables,”

Econometrica , 79(3), 949–955.

Shi, X., M. Shum, and

W. Song (2018): “Estimating Semi-Parametric Panel MultinomialChoice Models using Cyclic Monotonicity,”

Econometrica , 86, 737–761.

Torgovitsky, A. (2015): “Identiﬁcation of Nonseparable Models Using Instruments withSmall Support,”

Econometrica , 3, 1185–1197.

Vuong, Q., and

H. Xu (2017): “Counterfactual mapping and individual treatment eﬀectsin nonseparable models with binary endogeneity,”

Quantitative Economics , pp. 589–610.

Vytlacil, E. J., and

N. Yildiz (2007): “Dummy Endogenous Variables in Weakly Sep-arable Models,”