Identification and Estimation of Weakly Separable Models Without Monotonicity
aa r X i v : . [ ec on . E M ] A p r Dummy Endogenous Variables in Weakly SeparableMultiple Index Models without Monotonicity ∗ Songnian ChenHKUST Shakeeb KhanBoston College Xun TangRice University
April 6, 2020
Abstract
We study the identification and estimation of treatment effect parameters in weakly separable models. Intheir seminal work, Vytlacil and Yildiz (2007) showed how to identify and estimate the average treatmenteffect of a dummy endogenous variable when the outcome is weakly separable in a single index. Theiridentification result builds on a monotonicity condition with respect to this single index. In comparison,we consider similar weakly separable models with multiple indices, and relax the monotonicity conditionfor identification. Unlike Vytlacil and Yildiz (2007), we exploit the full information in the distribution ofthe outcome variable, instead of just its mean. Indeed, when the outcome distribution function is moreinformative than the mean, our method is applicable to more general settings than theirs; in particular wedo not rely on their monotonicity assumption and at the same time we also allow for multiple indices. Toillustrate the advantage of our approach, we provide examples of models where our approach can identifyparameters of interest whereas existing methods would fail. These examples include models with multipleunobserved disturbance terms such as the Roy model and multinomial choice models with dummy endogenousvariables, as well as potential outcome models with endogenous random coefficients. Our method is easy toimplement and can be applied to a wide class of models. We establish standard asymptotic properties suchas consistency and asymptotic normality.
JEL Classification : C14, C31, C35
Key Words
Weak Separability, Treatment Effects, Monotonicity, Endogeneity ∗ We are grateful to Jeremy Fox, Sukjin Han, Arthur Lewbel, Elie Tamer, Ed Vytlacil, Haiqing Xu, andparticipants at the 2018 West Indies Economic Conference, the 2019 SUFE econometrics meetings, and the2020 Texas Camp Econometrics for helpful comments and suggestions. Introduction
Consider a weakly separable model with a binary endogenous variable: Y = g ( v ( X, D ) , v ( X, D ) , ...v J ( X, D ) , ε ) (1.1) D = 1 { θ ( Z ) − U > } (1.2)where ( v ( X, D ) , v ( X, D ) , ...v J ( X, D )) ≡ v ( X, D ) is a J -vector of unknown linear or nonlin-ear indices in the outcome equation (1.1) and D is a binary endogenous variable defined bythe selection equation (1.2). Here X ∈ R d x and Z ∈ R d z are vectors of observable exogenousvariables, which may have overlapping elements. Similar to Vytlacil and Yildiz (2007) werequire exclusion restrictions that there is some element in Z excluded from X , and thatwe can vary X after conditioning on θ ( Z ). In the system of equations above, U is the un-observable random variable normalized to follow the uniform distribution U (0 ,
1) and theerror term ε in the outcome equation is allowed to be a random vector. We assume ( X, Z )are independent of ( ε, U ). Note that we allow v ( X, D ) to be a vector of multiple indices,whereas the method in Vytlacil and Yildiz (2007) can only be applied when it is a singleindex.Since Vytlacil and Yildiz (2007), other important work has considered identificationand estimation of related models, but under alternative conditions. Examples with binaryendogenous variables include Han and Vytlacil (2017), Vuong and Xu (2017), Lewbel, Jacho-Chavez, and Encarnciono (2016), Khan, Maurel, and Zhang (2019). Work for models whenthe endogenous variable is continuous includes Imbens and Newey (2009), D’Haultfoeuilleand Fevrier (2015) and Torgovitsky (2015). Feng (2020) shows how to identify nonseparabletriangular models where the endogenous variable is discrete and has larger support than theinstrument variable. As in the conventional framework, two potential outcomes Y and Y satisfy Y D = g ( v ( X, D ) , ǫ ) for D = 0 , . We only observe (
Y, D, X, Z ), where Y = DY + (1 − D ) Y . In this model, as in Vytlacil andYildiz (2007), we do not impose parametric distribution on the error term or a linear indexstructure. Vytlacil and Yildiz (2007) assumes that v ( X, D ) ∈ R is a single index, and E [ g ( v, ε ) | U = u ] is strictly increasing in v ∈ R for all u. (1.3) All these papers focus on point identification. For partial identification of a model with a binary outcome,see Shaikh and Vytlacil (2011) and Mourifi´e (2015). v ( X, D ) =( v ( X, D ) , v ( X, D ) , ..., v J ( X, D )) ∈ R J .We consider the identification and estimation of the average treatment effect of D on Y , E ( Y | X ∈ A ), E ( Y | X ∈ A ) and E ( Y − Y | X ∈ A ), for some set A , without theaforementioned monotonicity. Indeed, for the case with multiple indices v ( X, D ) ∈ R J , themonotonicity condition is no longer well defined.Vuong and Xu (2017) established nonparametric identification of individual treatmenteffects in a fully nonseparable model that includes a binary endogenous regressor, withoutthe nonlinear index structure. They assume ε is a scalar and g is strictly increasing in ε . Intheir setting, monotonicity in the outcome equation provides the identifying restriction toextrapolate information from local treatment effects to population treatment effects. Generally speaking, our identification strategy will be based on the notion of matching .Consider the identification of E ( Y | X = x ) for some x ∈ S , where S d denotes the supportof X given D = d ∈ { , } . Note that because ( ε, U ) ⊥ ( X, Z ), E ( Y | X = x ) = E ( Y | X = x, Z = z )= E ( DY | X = x, Z = z ) + E [(1 − D ) Y | X = x, Z = z ]= P ( z ) E ( Y | D = 1 , X = x, Z = z ) + [1 − P ( z )] E ( Y | D = 0 , X = x, Z = z ) (2.1) See Ahn and Powell (1993), Chen, Khan, and Tang (2016), Vytlacil and Yildiz (2007), and more recentlyAuerbach (2019) for examples of papers that attain identification through matching. P ( z ) ≡ E ( D | Z = z ). The only term that is not directly identifiable on the right-handside of (2.1) is E ( Y | D = 0 , X = x, Z = z ) = E [ g ( v ( x, , ε ) | U ≥ P ( z )].The main idea behind our approach follows that of Vytlacil and Yildiz (2007), which is tofind some ˜ x ∈ S such that v ( x,
1) = v (˜ x,
0) (2.2)so that E ( Y | D = 0 , X = ˜ x, Z = z ) = E ( Y | D = 0 , X = ˜ x, Z = z )= E ( g ( v (˜ x, , ε ) | U ≥ P ( z )) = E ( g ( v ( x, , ε ) | U ≥ P ( z )).Unlike Vytlacil and Yildiz (2007), we utilize the full distribution of Y (rather than itsfirst moment) while searching for such pairs of ( x, ˜ x ) in (2.2). This allows us to relax thesingle-index and monotonicity conditions in Vytlacil and Yildiz (2007).For any p on the support of P ( Z ) given X = x , and for all y define h ∗ ( x, y, p ) = E ( D { Y ≤ y } | X = x, P ( Z ) = p )= E [1 { U < P ( Z ) } { g ( v ( X, , ε ) ≤ y } | X = x, P ( Z ) = p ]= Z p F g | u ( y ; v ( x, du , (2.3)where F g | u ( y ; v ( x, d )) ≡ E [1 { g ( v ( x, d ) , ε ) ≤ y }| U = u ]with v ( x, d ) being a realized index at X = x and the expectation in the definition of F g | u iswith respect to the distribution of ε given U = u . The last equality in (2.3) holds becauseof independence between ( ε, U ) and ( X, Z ). By construction, h ∗ ( x, y, p ) is directly identifiedfrom the joint distribution of ( D, Y, X, Z ) in the data-generating process. Furthermore, forany pair p > p , define: h ( x, y, p , p ) ≡ h ∗ ( x, y, p ) − h ∗ ( x, y, p ) = Z p p F g | u ( y ; v ( x, du .Likewise, define h ∗ ( x, y, p ) = E ((1 − D )1 { Y ≤ y } | X = x, P ( Z ) = p )= E [1 { U ≥ P ( Z ) } { g ( v ( X, , ε ) ≤ y } | X = x, P ( Z ) = p ]= Z p F g | u ( y ; v ( x, du. h ( x, y, p , p ) ≡ h ∗ ( x, y, p ) − h ∗ ( x, y, p ) = Z p p F g | u ( y ; v ( x, du .Let P x denote the support of P ( Z ) given X = x . It can be shown that for any x ∈ S and˜ x ∈ S , and any y , h ( x, y, p, p ′ ) = h (˜ x, y, p, p ′ ) for all p > p ′ on P x ∩ P ˜ x . (2.4)if and only if F g | p ( y ; v ( x, F g | p ( y ; v (˜ x, p ∈ P x ∩ P ˜ x . (2.5)Sufficiency is immediate from the definition of h and h . To see necessity, note that for all p > p ′ on P x ∩ P ˜ x , ∂h ( x, y, ˜ p, p ′ ) ∂ ˜ p (cid:12)(cid:12)(cid:12)(cid:12) ˜ p = p = ∂h ∗ ( x, y, ˜ p ) ∂ ˜ p (cid:12)(cid:12)(cid:12)(cid:12) ˜ p = p = F g | p ( y ; v ( x , ∂h (˜ x, y, ˜ p, p ′ ) ∂ ˜ p (cid:12)(cid:12)(cid:12)(cid:12) ˜ p = p = − ∂∂ ˜ p h ∗ (˜ x, y, ˜ p ) (cid:12)(cid:12)(cid:12)(cid:12) ˜ p = p = F g | p ( y ; v (˜ x, U is absolutely continuous with respect to Lebesguemeasure.ASSUMPTION A-2: The random vectors ( U, ε ) and (
X, Z ) are independent.ASSUMPTION A-3: The random variable g ( v ( X, , ε ) and g ( v ( X, , ε ) have finite firstmoments conditional on U = u for all u ∈ [0 , x, ˜ x ) ∈ S × S , F g | p ( y ; v ( x, F g | p ( y ; v ( x, y and p ∈ P x ∩ P ˜ x if and only if v ( x,
1) = v (˜ x, X ∈ S ) > X ∈ S ) > x, ˜ x ) with v ( x,
1) = v (˜ x, v ( X, D ) ∈ R is a single index and that for any ( x, ˜ x ) ∈ ( S × S ), E [ g ( v ( x, , ε ) | U = u ] = E [ g ( v (˜ x, , ε ) | U = u ] if and only if v ( x,
1) = v (˜ x, E [ g ( v ( x, d ) , ε ) | U = p ] is a strictly monotonic function of v ( x, d ). Second, when v ( x, d ) is avector of multiple indices instead of a single index, their approach breaks down. In compar-ison, we achieve the same purpose by matching conditional distributions F g | p ( · ; v ( x, F g | p ( · ; v (˜ x, v ( . ) is vector-valued and the monotonicity condition in Vytlaciland Yildiz (2007) is not satisfied. We now present several examples in which the latent indices are multi-dimensional. In thefirst and third example, the monotonicity condition in Vytlacil and Yildiz (2007) is not sat-isfied; in the second example, the identification requires a generalization of the monotonicitycondition into an invertibility condition in higher dimensions.
Example 1. (Heteroskaedastic shocks in outcome)
Consider a triangular system wherea continuous outcome is determined by double indices v ( X, D ) ≡ ( v ( X, D ) , v ( X, D )): Y = g ( v ( X, D ) , ε ) = v ( X, D ) + v ( X, D ) ε for D ∈ { , } .The selection equation determining the actual treatment is the same as (1.2). In this casethe concept of monotonicity in v ∈ R is not well-defined, so the procedure proposed inVytlacil and Yildiz (2007) is not suitable here . Nevertheless, we can apply the method inSection 2 to identify the average treatment effect by using the distribution of outcome tofind pairs of x and ˜ x such that v ( x,
1) = v (˜ x, v ( · ) is positive. Tosee the necessity in Assumption A4, note that F g | u ( y ; v ( x, d )) = E [ v ( x, d ) + v ( x, d ) ε ≤ y | U = u ]= F ε | u (cid:18) y − v ( x, d ) v ( x, d ) (cid:19) For this particular design, the approach proposed in Vuong and Xu (2017) should be valid. But it willnot be for a slightly modified model, such as Y = v ( X, D ) + ( e + v ( X, D ) ∗ e ), whereas ours will be. d = 0 ,
1. If the CDF of ε is increasing over R , then for all y and x ∈ S and ˜ x ∈ S , F g | u ( y ; v ( x, F g | u ( y ; v (˜ x, y − v ( x, v ( x,
1) = y − v (˜ x, v (˜ x,
0) .Differentiating with respect to y yields v ( x,
1) = v (˜ x, v ( x,
1) = v (˜ x, Example 2. (Multinomial potential outcome)
Consider a triangular system wherethe outcome is multinomial. The multinomial response model has a long and rich history inboth applied and theoretical econometrics. Recent examples in the semiparametric literatureinclude Lee (1995), Ahn, Powell, Ichimura, and Ruud (2017), Shi, Shum, and Song (2018),Pakes and Porter (2014), Khan, Ouyang, and Tamer (2019). But unlike the work here, noneof those papers allow for dummy endogenous variables or potential outcomes. Y = g ( v ( X, D ) , ε ) = arg max j =0 , ,...,J y ∗ j,D where y ∗ j,D = v j ( X, D ) + ε j for j = 1 , , ..., J ; y ∗ ,D = 0.In this case the index v ≡ ( v j ) j ≤ J and the errors ε ≡ ( ε j ) j ≤ J are both J -dimensional. Theselection equation that determines D is the same as (1.2). In this case, we can replace1 { Y ≤ y } by 1 { Y = y } in the definition of h , h , h ∗ , h ∗ and F g | u ( · ; v ). Then for d = 0 , j ≤ J , F g | u ( j ; v ( x, d )) ≡ E [1 { g ( v ( x, d ) , ε ) = j }| U = u ]= Pr { v j ( x, d ) + ε j ≥ v j ′ ( x, d ) + ε j ′ ∀ j ′ ≤ J | U = u } .7y Ruud (2000) and Ahn, Powell, Ichimura, and Ruud (2017), the mapping from v ∈ R J to ( F g | u ( j ; v ) : j ≤ J ) ∈ R J is smooth and invertible provided that ε ∈ R J has non-negativedensity everywhere. This implies Assumption A-4. Example 3 . (Potential outcome from the Roy model) Consider a treatment effectmodel with an endogenous binary treatment D and with the potential outcome determined bya latent Roy model. The Roy model has also been studied extensively from both applied andtheoretical perspectives. See for example the literature survey in Heckman and E.Vytlacil(2007) and the seminal paper in Heckman and Honor´e (1990).Here the observed outcome consists of two pieces: a continuous measure Y = DY + (1 − D ) Y and a discrete indicator W = DW + (1 − D ) W for d = 0 ,
1. These potential outcomesare given by Y d = max j ∈{ a,b } y ∗ j,d and W d = arg max j ∈{ a,b } y ∗ j,d where a and b index potential outcomes realized in different sectors, with y ∗ j,d = v j ( X, d ) + ε j .The binary endogenous treatment D is determined as in the selection equation (1.2). Forexample, D ∈ { , } indicates whether an individual participates in certain professionaltraining program, W d ∈ { a, b } indicates the potential sector in which the individual isemployed, y ∗ j,d is the potential wage from sector j under treatment D = d , and Y d ∈ R is thepotential wage if the treatment status is D = d . As before, we maintain that ( X, Z ) ⊥ ( ε, U ).The parameter of interest isPr { Y ≤ y, W = a | X } which by the independence ( X, Z ) ⊥ ( ε, U ) and an application of the law of total probabilitycan be decomposed into directly identifiable quantities and a counterfactual quantityPr { Y ≤ y, W = a | X = x, Z = z, D = 0 } = Pr { v b ( x,
1) + ε b < v a ( x,
1) + ε a ≤ y | U ≥ P ( z ) } . (3.1)Again, we seek to identify this counterfactual quantity by finding ˜ x ∈ S such that v a ( x,
1) = v a (˜ x,
0) and v b ( x,
1) = v b (˜ x,
0) (3.2)8his would allow us to recover the right hand side of (3.1) asPr { Y ≤ y, W = a | X = ˜ x, Z = z, D = 0 } .To find such a pair of ( x, ˜ x ), define h d,W ( x, p, p ′ ) , h ∗ d,W ( x, p ) by replacing 1 { Y ≤ y } with1 { W = a } in the definition of h d , h ∗ d in Section 2. Similarly, define h d,Y ( x, y, p, p ′ ) , h ∗ d,Y ( x, y, p )by replacing 1 { Y ≤ y } with 1 { Y ≤ y, W = a } in the definition of h d , h ∗ d in Section 2. Then h d,W ( x, p , p ) = Z p p Pr { v b ( x, d ) + ε b < v a ( x, d ) + ε a | U = u } du ; h d,Y ( x, y, p , p ) = Z p p Pr { v b ( x, d ) + ε b < v a ( x, d ) + ε a ≤ y | U = u } du ;and h d,W ( x, p , p ) and h d,Y ( x, y, p , p ) are both identified over their respective domainsby construction. Assume ( ε a , ε b ) is continuously distributed with positive density over R conditional on all u . Then the statement“ h ,W ( x, p, p ′ ) = h ,W (˜ x, p, p ′ ) and h ,Y ( x, y, p, p ′ ) = h ,Y (˜ x, y, p, p ′ )for all y and p > p ′ on P x ∩ P ˜ x ”holds true if and only if (3.2) holds. Then matching h ,W ( x, p, p ′ ) = h ,W (˜ x, p, p ′ ) ensures v a ( x, − v b ( x,
1) = v a (˜ x, − v b (˜ x, h ,Y ( x, y, p, p ′ ) = h ,Y (˜ x, y, p, p ′ ) at the same time ensures that in additionto (3.3) v a ( x,
1) = v a (˜ x, The identification strategy we have used so far requires matching exogenous variables x, ˜ x on S , S . In some cases, with the outcome being continuous, we can construct similarargument for identifying a counterfactual quantity in a treatment effect model by matchingdifferent elements on the support of continuous outcome. This approach was not investigatedin Vytlacil and Yildiz (2007), which focused on the use of first moment of outcome. Thefollowing example illustrates this point. 9 xample 4. (Potential outcome with random coefficients) Random coefficient modelsare prominent in both the theoretical and applied econometrics literature. They permit aflexible way to allow for conditional heteroscedasticity and unobserved heterogeneity. See,for example Hsiao and Pesaran (2008) for a survey. Here we consider a treatment effectmodel where the potential outcome is determined through random coefficients: Y = DY + (1 − D ) Y where Y d = ( α d + X ′ β d ) for d = 0 , D is determined as in the selection equation (1.2).The random intercepts α d ∈ R and the random vectors of coefficients β d are given by α d = ¯ α d ( X ) + η d and β d = ¯ β d ( X ) + ε d where for any x and d ∈ { , } ., ( ¯ α d ( x ) , ¯ β d ( x )) ∈ R K +1 is a vector of constant parameterswhile η d ∈ R and ε d ∈ R K are unobservable noises.As in Vytlacil and Yildiz (2007), assume some elements in Z in the selection equationare excluded from X . We allow the vector of unobservable terms ( ǫ , ǫ , η , η , U ) to bearbitrarily correlated. We also assume that( X, Z ) ⊥ ( ǫ , ǫ , η , η , U ), (4.1)with the marginal distribution of U normalized to standard uniform, so that θ ( Z ) is directlyidentified as P ( Z ) ≡ E ( D | Z = z ).In this example our goal is to identify the conditional distribution of Y d given X = x for d = 0 ,
1. From this result we can identify parameters of interest such as average treatmenteffects, quantile treatment effects, etc. Let G P | x denote the conditional distribution of P ≡ P ( Z ) given X = x , which is directly identifiable from the data-generating process. Byconstruction,Pr { Y ≤ y | X = x } = Z Pr { Y ≤ y | X = x, P = p } dG P | x ( p ),where Pr { Y ≤ y | X = x, P = p } = E [ D { Y ≤ y }| X = x, P = p ] + E [(1 − D )1 { Y ≤ y }| X = x, P = p ] .The first term on the right-hand side is identified as E [ D { Y ≤ y }| X = x, P = p ], 10hile the second term is counterfactual and can be written as φ ( x, y, p ) ≡ E [1 { U ≥ P } { α + X ′ β ≤ y }| X = x, P = p ]= E [1 { U ≥ p } { ¯ α ( x ) + η + x ′ ( ¯ β ( x ) + ε ) ≤ y } ]= Z p Pr { η + x ′ ǫ ≤ y − ¯ α ( x ) − x ′ ¯ β ( x ) | U = u } du .For any p on the support of P given X = x , define h ∗ ( x, y, p ) ≡ E [ D { Y ≤ y } | X = x, P = p ]= E [1 { U < P } { α + X ′ β ≤ y } | X = x, P = p ] = E [1 { U < p } { α + x ′ β ≤ y } ]= Z p Pr { η + x ′ ǫ ≤ y − ¯ α ( x ) − x ′ ¯ β ( x ) | U = u } du, where the second equality uses (4.1). Likewise, under (4.1) we have: h ∗ ( x, y, p ) ≡ E [(1 − D )1 { Y ≤ y } | X = x, P = p ]= Z p Pr { η + x ′ ǫ ≤ y − ¯ α ( x ) − x ′ ¯ β ( x ) | U i = u } du .Assume F ( η ,ǫ ) | U = u = F ( η ,ǫ ) | U = u for all u ∈ [0 , φ ( x, y, p ) = Z p Pr { η + x ′ ǫ ≤ y − ¯ α ( x ) − x ′ ¯ β ( x ) | U = u } du . (4.3)Suppose for each pair ( x, y ) we can find t ( x, y ) such that y − ¯ α ( x ) − x ′ ¯ β ( x ) = t ( x, y ) − ¯ α ( x ) − x ′ ¯ β ( x ).Then by construction h ∗ ( x, t ( x, y ) , p ) ≡ Z p Pr { η + x ′ ǫ ≤ t ( x, y ) − ¯ α ( x ) − x ′ ¯ β ( x ) | U = u } du = Z p Pr { η + x ′ ǫ ≤ y − ¯ α ( x ) − x ′ ¯ β ( x ) | U = u } du = φ ( x, y, p ) This type of distributional equality assumption generalizes the exact equality of ǫ , ǫ as can be found infor example Vytlacil and Yildiz (2007). Distributional equality has been used to motivate the rank similarity condition imposed frequently in the econometrics literature- see for example Chernozhukov and Hansen(2005), Frandsen and Lefgren (2018), Dong and Shen (2018), Chen and Khan (2014). φ ( x, y, p ) would be identified as h ∗ ( x, t ( x, y ) , p ).It remains to show that for each pair ( x, y ) we can uniquely recover t ( x, y ) using quan-tities that are identifiable in the data-generating process. To do so, we define two auxiliaryfunctions as follows: for p > p on the support of P given X = x , let h ( x, y, p , p ) ≡ h ∗ ( x, y, p ) − h ∗ ( x, y, p )= Z p p Pr { η + x ′ ǫ < y − ¯ α ( x ) − x ′ ¯ β ( x ) | U = u } du ;and h ( x, y, p , p ) ≡ h ∗ ( x, y, p ) − h ∗ ( x, y, p )= Z p p Pr { η + x ′ ǫ < y − ¯ α ( x ) − x ′ ¯ β ( x ) | U = u } du .Suppose η d + x ′ ǫ d is continuously distributed over R for all values of x conditional on all u ∈ [0 , x, y ) and p < p , h ( x, y, p , p ) = h ( x, t ( x, y ) , p , p )if and only if t ( x, y ) = y − ¯ α ( x ) − x ′ ¯ β ( x ) + ¯ α ( x ) + x ′ ¯ β ( x ).To see this, suppose t ( x, y ) > y − ¯ α ( x ) − x ′ ¯ β ( x ) + ¯ α ( x ) + x ′ ¯ β ( x ), then (4.2) implies that h ( x, t ( x, y ) , p , p ) > h ( x, y, p , p ). A symmetric argument establishes a similar statementwith “ > ” replaced by “ < ”. This establishes our desired result. Here we outline estimation procedures from a random sample of the observed variables thatare motivated by our identification results. We will first describe an estimation procedurefor the parameter E [ Y ] in the first three examples. Let P x to denote the support of P ( Z )given X = x , f p ( . | x ) denote the density of P ( Z ) given X = x , and P cx = { p : f p ( p | x ) > c } for a known c > , and for simplicity assume a strong overlap condition that1 − c > P ( Z ) > c for a known c > , h ( x , · ) and h ( x , · ) k h ( x , · ) − h ( x , · ) k = (Z Z Z (cid:18)Z p p ( F g | u ( y ; v ( x , − F g | u ( y ; v ( x , du (cid:19) I ( p , p ∈ P cx ) w ( y ) dydp dp ) / where w ( y ) is a chosen weight function. Consider the case when h ( x, y, p , p ), h ( x, y, p , p )and P ( z ) are known. For any given X i , let ˜ X i be such that (cid:13)(cid:13)(cid:13) h ( X i , · ) − h ( ˜ X i , · ) (cid:13)(cid:13)(cid:13) = 0which, under Assumption A-4 in Section 2, is equivalent to v ( X i ,
1) = v ( ˜ X i , . Defineˆ Y i = E ( Y | D = 0 , k h ( X i , · ) − h ( X, · ) k = 0 , P = P i ) . Note that the conditional expectation on the right-hand side is equal to E [ Y | D = 0 , v ( X,
0) = v ( X i , , P = P i ], which in turn equals E [ Y | D = 0 , X = X i , P = P i ].Then, following the discussion above, we define the following estimator for ∆ ≡ E [ Y ]:ˆ∆ = 1 n n X i =1 (cid:16) D i Y i + (1 − D i ) ˆ Y i (cid:17) or a weighted version ˆ∆ w = n P ni =1 { X i ∈ A } (cid:16) D i Y i + (1 − D i ) ˆ Y i (cid:17) n P ni =1 { X i ∈ A } Limiting distribution theory for each of these estimators follows from identical arguments inVytlacil and Yildiz (2007). Here we formally state the theorem for the first estimator:
Theorem 5.1
Under Assumptions A-1 to A-5, and the additional assumption that Y haspositive and finite second moment, then we have √ n ( ˆ∆ − ∆) d → N (0 , V ) where V = V ar ( E [ Y | X, P, D ]) + E [ P V ar ( Y | X, P, D = 1)]13ow we describe an estimation procedure for the distributional treatment effect in Ex-ample 4, where we had a model with random coefficients. In this case, the parameter ofinterest is for a chosen value of the scalar y ,∆ ( y ) = Pr { Y ≤ y } .First, for fixed values of y and p > p , we propose to estimate t ( x, y ) asˆ t ( x, y, p , p ) = arg min t ( h ( x, y, p , p ) − h ( x, t, p , p )) and then average over values of p , p :ˆ τ ( x, y ) = 1 n ( n − X i = j I [ P i > P j ]ˆ t ( x, y, P i , P j )An infeasible estimator for the parameter ∆ ( y ), which assumes t ( x, y ) is known, would beˆ∆ ( y ) = 1 n n X i =1 ( D i { Y i ≤ y } + (1 − D i )1 { Y i ≤ t ( X i , y ) } ) .In practice, for feasible estimation, one needs to replace t ( x, y ) by its estimator ˆ τ ( x, y ). This section presents simulation evidence for the performance of the proposed estimationprocedures described in Section 5, for both the Average Treatment Effect and the Distri-butional Treatment Effect. We report results for both our proposed estimator and that inVytlacil and Yildiz (2007), for several designs. These include designs where the said mono-tonicity condition fails, and designs where the disturbance terms in the outcome equationare multidimensional.Throughout all designs we model the treatment or dummy endogenous variable as D = I [ Z − U >
Z, U are independent standard normal. We experiment with the following designs forthe outcome
Design 1 Y = X + 0 . · D + ǫ where X is standard normal, ( ǫ, U ) are distributed bivariate normal, each with mean0 and variance 1, with correlations of 0,0.25,0.5.14 esign 2 Y = X + 0 . · D + ( X + D ) · ǫ where X is distributed standard normal, ( ǫ, U ) are distributed bivariate normal, eachwith mean 0 and variance 1, with correlations of 0,0.25,0.5. Design 3 Y = ( X + 0 . · D + ǫ ) where X is distributed standard normal, ( ǫ, U ) are distributed bivariate normal, eachwith mean 0 and variance 1, with correlations of 0,0.25,0.5.We note that the monotonicity condition is satisfied in design 1 but fails in the othertwo designs. For each of these designs, we report results for estimating the parameter E [ Y ], which denotes the expected value for potential outcome under treatment D = 1.The two estimators used in the simulation study were the one proposed in Section 5 andthe method proposed in Vytlacil and Yildiz (2007). The summary statistics, scaled by thetrue parameter value, Mean Bias, Median Bias, Root Mean Squared Error, (RMSE), andMedian Absolute Deviation (MAD) were evaluated for sample sizes of 100, 200, 400 for 401replications. Results for each of these designs are reported in Tables 1 to 3 respectively.In implementing our procedure we assumed the propensity score function is known, andconducted next stage estimation using a nonparametric kernel estimator with normal kernelfunction, and a bandwidth of n − / . This rate reflects “undersmoothing” as there are tworegressors, the propensity score and the regressor X . For the estimator in Vytlacil andYildiz (2007), which involved the derivative of conditional expectation functions as well,estimating these functions nonparametrically gave very unstable results so we report resultsfor an infeasible version of their estimator, assuming such functions, as well as the propensityscores, are known.To implement the second stage of our proposed procedure, in calculating the distance k h ( x i , · ) − h ( x , · ) k we used an evenly space grid of values for y , and selected n/
50 gridpoints, with n denoting the sample size.The results indicate the desirable properties of our proposed procedure, generally agreeingwith Theorem 5.1. In all designs our estimator has small values for bias and RMSE, withthe value of RMSE decreasing as the sample size grows. In contrast, the procedure basedon Vytlacil and Yildiz (2007) only performs well in Design 1, with values of bias and RMSEcomparable to those using our method. As in our procedure these values decrease with as15he sample size grows, which is expected, as the monotonicity condition rely on is satisfiedin these designs. In this case, their approach has smaller standard errors largely due to therelative simpler structure of the infeasible version, but their biases persist even when thesample size increases.For designs 2 and 3, where monotonicity is violated, the procedure proposed in Vytlaciland Yildiz (2007) does not perform well. In design 2 in Table 2 both the bias and RMSEare generally increasing with the sample size. Results for their estimator are better in design3, but the bias hardly converges with the sample size and is much larger compared to ourestimator.We also simulate data from a model with dummy endogenous variable and potentialoutcomes determined by random coefficients. It is important to note that for this design, theoriginal matching idea in Vytlacil and Yildiz (2007) does not apply. This is because differentvalues of x lead to different distribution of the composite error η d + x ′ ǫ d . Our contributionin Section 4 is to propose a new approach based on matching different values of outcome y ,rather than the regressors x . Based on the counterfactual framework discussed in Section4, here the treatment variable D is modeled as the same way as the dummy endogenousvariable above. Similarly the regressor X is standard normal. For both Y , Y the randomintercepts were modeled as constants (0 and 1, respectively) and the additive error termswere each standard normal. For the random slopes, the means were 1 and 2 respectively,and the additive error terms were also standard normal, independent of all other disturbanceterms and each other. Here we use the procedure in Section 4 to estimate the parameter∆ = P ( Y < y ), where in the simulation we set y = 1. The same four summary statisticsare reported for sample sizes 100,200,400, based on 401 replications. Results for this randomcoefficients design are reported in Table 4.The estimator proposed in Section 5 performs well; but the bias and RMSE are muchsmall at 400 observations compared to 100 and 200 observations, indicating convergence atthe parametric rate. 16 able 1CKT VY ρ v / / / / ρ v / / / / ρ v / / / / able 4CKT ρ v / / In this paper, we considered identification and estimation of nonseparable models with en-dogenous binary treatment. Existing approaches are based on a monotonicity condition,which is violated in models with multiple unobserved idiosyncratic shocks. Such modelsarise in many important empirical settings, including Roy models and multinomial choicemodels with dummy endogenous variables, as well as treatment effect models with randomcoefficients. We establish novel identification results for these models which are construc-tive and conducive to estimation procedures which are easy to compute and whose limitingdistributional properties follow from standard large sample theorems. A simulation studyindicates adequate finite sample performance of our proposed methods.This paper leaves open areas for future research. Our method requires the selection of thenumber and location of cutoff points, so a data driven method for selecting these would beuseful. Furthermore, the relative efficiency of our proposed approach needs to be explored,perhaps by deriving efficiency bounds for these new classes of models.
References
Ahn, H., and
J. Powell (1993): “Semiparametric Estimation of Censored Selection Mod-els,”
Journal of Econometrics , 58, 3–29.
Ahn, H., J. Powell, H. Ichimura, and
P. Ruud (2017): “Simple Estimators for In-vertible Index Models,”
Journal of Business Economics and Statistics , 36, 1–10.18 uerbach, E. (2019): “Identification and Estimation of a Partially Linear RegressionModel using Network Data,” mimeograph, Northwestern University.
Chen, S., S. Khan, and
X. Tang (2016): “On the Informational Content of SpecialRegressors in Heteroskedastic Binary Response Models,”
Journal of Econometrics , 193,162–182.
Chen, S. H., and
S. Khan (2014): “Semiparametric Estimation of Program Impacts onDispersion of Potential Wages,”
Journal of Applied Econometrics , 29, 901–919.
Chernozhukov, V., and
C. Hansen (2005): “An IV Model of Quantile Treatment Ef-fects,”
Econometrica , 73, 245 – 261.
D’Haultfoeuille, X., and
P. Fevrier (2015): “Identification of Nonseparable Triangu-lar Models with Discrete Instruments,”
Econometrica , 3, 1199–1210.
Dong, Y., and
S. Shen (2018): “The Empirical Content of the Roy Model,”
The Reviewof Economics and Statistics , 100, 78–85.
Feng, J. (2020): “Matching Points: Supplementing Instruments with Covariates in Trian-gular Models,” mimeograph, Columbia University.
Frandsen, B., and
L. Lefgren (2018): “Testing Rank Similarity,”
The Review of Eco-nomics and Statistics , 100, 86–91.
Han, S., and
E. J. Vytlacil (2017): “Identification in a generalization of bivariate probitmodels with endogenous regressors,”
Journal of Econometrics , 199, 63–73.
Heckman, J., and
E.Vytlacil (2007): “Econometric evaluation of social programs,” in
Handbook of Econometrics, Vol. 6B , ed. by J. Heckman, and
E. Leamer. Amesterdam:North Holland.
Heckman, J., and
B. Honor´e (1990): “The Empirical Content of the Roy Model,”
Econometrica , 58, 1121–1149.
Hsiao, C., and
M. H. Pesaran (2008): “Random Coefficient Models,” in
The Economet-rics of Panel Data: Advanced Studies in Theoretical and Applied Econonmetrics , ed. byL. Matyas, and
P. Sevestre. Springer.
Imbens, G., and
W. Newey (2009): “Identification and Estimation of Triangular Simul-taneous Equations Models Without Additivity,”
Econometrica , 77(5), 1481–1512.19 han, S., A. Maurel, and
Y. Zhang (2019): “Informational Content of Factor Struc-tures in Simultaneous Discrete Response Models,” Working Paper, Duke University.
Khan, S., F. Ouyang, and
E. Tamer (2019): “Inference in Semiparametric MutinomialResponse Models,” Boston College Working Paper.
Lee, L.-F. (1995): “Semiparametric Maximum Likelihood Estimation of Polychotomousand Sequential Choice Models,”
Journal of Econometrics , 65, 381–428.
Lewbel, A., D. Jacho-Chavez, and
J. Encarnciono (2016): “Identification and esti-mation of semiparametric two-step models,”
Quantitative Economics , 7, 561–589.
Mourifi´e, I. (2015): “Sharp Bounds on Treatment Effects in a Binary Triangular System,”
Journal of Econometrics , 187(1), 74–81.
Pakes, A., and
J. Porter (2014): “Moment Inequalties for Multinomial Choice withFixed Effects,” Harvard University Working Paper.
Ruud, P. (2000): “Semiparametric estimation of discrete choice models,” mimeograph,University of California at Berkeley.
Shaikh, A. M., and
E. Vytlacil (2011): “Partial Identification in Triangular Systemsof Equations with Binary Dependent Variables,”
Econometrica , 79(3), 949–955.
Shi, X., M. Shum, and
W. Song (2018): “Estimating Semi-Parametric Panel MultinomialChoice Models using Cyclic Monotonicity,”
Econometrica , 86, 737–761.
Torgovitsky, A. (2015): “Identification of Nonseparable Models Using Instruments withSmall Support,”
Econometrica , 3, 1185–1197.
Vuong, Q., and
H. Xu (2017): “Counterfactual mapping and individual treatment effectsin nonseparable models with binary endogeneity,”
Quantitative Economics , pp. 589–610.
Vytlacil, E. J., and
N. Yildiz (2007): “Dummy Endogenous Variables in Weakly Sep-arable Models,”