Estimating Marginal Treatment Effects under Unobserved Group Heterogeneity
EEstimating Marginal Treatment Effects under Unobserved GroupHeterogeneity ∗Tadao Hoshino † and Takahide Yanagi ‡ This version: April 2020 First version: January 2020
Abstract
This paper studies endogenous treatment effect models in which individuals are classified into unobservedgroups based on heterogeneous treatment choice rules. Such heterogeneity may arise, for example, whenmultiple treatment eligibility criteria and different preference patterns exist. Using a finite mixture approach,we propose a marginal treatment effect (MTE) framework in which the treatment choice and outcomeequations can be heterogeneous across groups. Under the availability of valid instrumental variables specificto each group, we show that the MTE for each group can be separately identified using the local instrumentalvariable method. Based on our identification result, we propose a two-step semiparametric procedure forestimating the group-wise MTE parameters. We first estimate the finite-mixture treatment choice model bya maximum likelihood method and then estimate the MTEs using a series approximation method. We provethat the proposed MTE estimator is consistent and asymptotically normally distributed. We illustrate theusefulness of the proposed method with an application to economic returns to college education.
Keywords : endogeneity, finite mixture, instrumental variables, marginal treatment effects, unobservedheterogeneity.
JEL Classification : C14, C31, C35. ∗ The authors thank Toru Kitagawa, Yasushi Kondo, Ryo Okui, Myung Hwan Seo, Hisatoshi Tanaka, Yuta Toyama, and seminarparticipants at Hitotsubashi University, Seoul National University, and Waseda University for their valuable comments. This work wassupported by JSPS KAKENHI Grant Numbers 15K17039 and 19H01473. † School of Political Science and Economics, Waseda University, 1-6-1 Nishi-waseda, Shinjuku-ku, Tokyo 169-8050, Japan. Email:[email protected]. ‡ Graduate School of Economics, Kyoto University, Yoshida Honmachi, Sakyo, Kyoto, 606-8501, Japan. Email: [email protected]. a r X i v : . [ ec on . E M ] A p r Introduction
Assessing heterogeneity in treatment effects with respect to observed and unobserved characteristics is animportant issue for precise treatment evaluation. The marginal treatment effect (MTE) framework has beenincreasingly popular in the econometrics literature as a way to explain the heterogeneous nature of treatmenteffects (Heckman and Vytlacil, 1999, 2005). The MTE is a treatment parameter defined as the average treatmenteffect conditional on both observed individual characteristics and unobserved individual cost of the treatment.The MTE framework is useful in several respects. First, the framework can be used in non-experimentalapplications in which individuals may endogenously determine their own treatment status. The endogeneityof the treatment can be dealt with using the method of local instrumental variables (IVs). Moreover, once theMTE is estimated, it can be used to build other treatment parameters such as the average treatment effect (ATE)and the local average treatment effect (LATE). For a recent overview on the MTE approach and its practicalimplementation, see, for example, Cornelissen et al. (2016), Andresen (2018), Mogstad and Torgovitsky (2018),Zhou and Xie (2019), and Shea and Torgovitsky (2020).While the conventional treatment evaluation methods including the MTE framework can address heterogene-ity among individuals and across observable groups of individuals, many applications may exhibit “unobserved”group-wise heterogeneity in treatment effects for various reasons. For example, the presence of multiple treat-ment eligibility criteria may create unobserved groups. As a typical application, consider evaluating the causaleffect of college education. Since schools typically offer a variety of admissions options (entrance exams, sportsreferrals, and so on), this process classifies individuals into several groups, and the admission criteria to whicheach individual has applied is typically unknown to us. Such differences in admissions requirements may resultin heterogeneous treatment effects of college education. Another potential reason for the presence of unobservedgroup heterogeneity is that the population may be composed of groups with different preference patterns. Forinstance, consider estimating the causal effect of foster care for abused children, as in Doyle Jr (2007). Here, thetreatment variable of interest is whether the child is put into foster care by the child protection investigator. Theauthor discusses a possibility that the child protection investigators may have different preference patterns thatplace relative emphasis on child protection. These examples suggest unobserved group patterns in the treatmentchoice process that may lead to some heterogeneity in treatment effects.In this paper, we study endogenous treatment effect models in which individuals are grouped into latentsubpopulations, where the presence of the latent groups is accounted for by a finite mixture model. Finitemixture approaches have been successfully used in various fields, such as biology, economics, marketing, andmedicine, to analyze data from heterogeneous subpopulations (McLachlan and Peel, 2004). However, the useof finite mixture models in treatment evaluation has been considered only in recent years in a few specificapplications (e.g., Harris and Sosa-Rubi, 2009; Munkin and Trivedi, 2010; Deb and Gregory, 2018; Samoilenko et al. , 2018). Compared with these studies, our modeling approach is more general and formally builds onRubin’s (1974) causal model by directly extending it to finite mixture models. In particular, we allow thetreatment choice and outcome equations to be fully heterogeneous across groups.For this model, we develop an identification and estimation procedure for the MTE parameters that can beunique to each latent group. The proposed group-wise MTE is a novel framework in the literature, which shouldbe informative for understanding the heterogeneous nature of treatment effects by capturing both group-level2nd individual-level unobserved heterogeneity simultaneously. Importantly, as we discuss below, the presenceof unobserved group heterogeneity threatens the validity of the conventional IV-based causal inference methods,such as the conventional MTE approach and the two-stage least squares approach (2SLS) for estimating theLATE. Specifically, we demonstrate that the presence of unobserved heterogeneous groups may invalidate the monotonicity condition (e.g., Imbens and Angrist, 1994; Angrist et al. , 1996; Vytlacil, 2002; Heckman andPinto, 2018), which is an essential identification condition on which these approaches are based.Our identification strategy builds on the method of local IVs by Heckman and Vytlacil (1999). Our mainidentification result requires two key conditions. The first condition is that there exists a valid group-specificcontinuous IV. As in the standard IV estimation, each group-specific IV must satisfy the exclusion restriction(i.e., the IV is independent of unobserved variables) and the relevance restriction (i.e., the IV is a determinantof the treatment). The second condition is the exogeneity of group membership; that is, group membership isconditionally independent of the unobserved variables affecting treatment choices. Since the second conditionmay be demanding in some applications, we also provide supplementary identification results when groupmembership is endogenous.Based on our constructive identification results, we propose a two-step semiparametric estimator for thegroup-wise MTEs. In the first step, we estimate a finite-mixture treatment choice model using a parametricmaximum likelihood (ML) method. In the second step, the MTE parameters are estimated using a seriesapproximation method. Under certain regularity conditions, we show that the proposed MTE estimator isconsistent and asymptotically normally distributed.As an empirical illustration, we estimate the effects of college education on annual income using the data forlabor in Japan. We find that our observations can be classified into two latent groups: one (group 1) is a groupof individuals whose college enrollment decisions are affected by their family characteristics such as parentaleducation level and economic conditions, and the other (group 2) is composed of individuals who are affectedby the regional educational environment. Our empirical results indicate that for group 1, the treatment effect ofcollege education is significantly positive if the (unobserved) cost of going on to a college is small. In contrast,we cannot find such heterogeneity for group 2.
Organization of the paper.
Section 2 introduces our model and presents our main identification result forthe group-wise MTE. In Section 3, we discuss the estimation procedure for the MTE parameters and prove itsasymptotic properties. Section 4 provides two additional discussions: first, on the identification of MTE whengroup membership is endogenous and, second, on the identification of LATE when only binary IVs are available.Section 5 presents Monte Carlo experiments and the empirical illustration, and Section 6 concludes this paper.In Appendices A and B, we provide technical proofs for our results. Appendix C presents supplementaryidentification results.
In this section, we introduce our treatment effect model that allows for the presence of an unknown mixtureof multiple subpopulations. We assume that the number of groups is finite and known, which is denoted as3 . Each individual belongs to only one of the S groups, and the group the individual belongs to, which wedenote by s ∈ { , . . . , S } , is a latent variable unknown to us. Our goal is to measure the causal effect of apotentially endogenous treatment variable D ∈ { , } on an outcome variable Y ∈ R for each group separately.Let Y ( d ) j be the potential outcome when s = j and D = d . Then, the observed outcome can be written as Y = (cid:80) Sj =1 { s = j } [ DY (1) j + (1 − D ) Y (0) j ] , where {·} is an indicator function that takes one if the argumentinside is true and zero otherwise. Suppose that the potential outcome equation is given by Y ( d ) j = µ ( d ) j (cid:16) X, (cid:15) ( d ) (cid:17) , for j ∈ { , . . . , S } and d ∈ { , } , (2.1)where X ∈ R dim( X ) is a vector of observed covariates, (cid:15) ( d ) ∈ R is an unobserved error term, and µ ( d ) j is anunknown structural function. This model specification is fairly general in that the functional form of µ ( d ) j isfully unrestricted. Our model implies that the distribution of the treatment effect Y (1) j − Y (0) j is potentiallyheterogeneous across different groups.Based on the latent index framework by Heckman and Vytlacil (1999, 2005), we characterize our treatmentchoice model as follows: D = (cid:8) µ D ( Z ) ≥ (cid:15) D (cid:9) if s = 1 ... (cid:8) µ DS ( Z S ) ≥ (cid:15) DS (cid:9) if s = S (2.2)where for j ∈ { , . . . , S } , Z j ∈ R dim( Z j ) is a vector of IVs that may contain elements of X , (cid:15) Dj ∈ R is anunobserved continuous random variable, and µ Dj is an unknown function. We allow for arbitrary dependencebetween (cid:15) Dj ’s. Assume that for all j , the error (cid:15) Dj is independent of the IVs Z j ’s conditional on X . Moreover,we require that each Z j includes at least one group-specific continuous variable to ensure that the function µ Dj ( Z j ) does not degenerate to a constant after conditioning the values of ( X, Z , . . . , Z j − , Z j +1 , . . . , Z S ) .Let F j ( ·| X ) be the conditional cumulative distribution function (CDF) of (cid:15) Dj . Further, let P j := F j ( µ Dj ( Z j ) | X ) and V j := F j ( (cid:15) Dj | X ) for j ∈ { , . . . , S } . By construction, each V j is distributed as Uniform [0 , conditionalon X . Using these definitions, we can rewrite (2.2) as follows: D = { P j ≥ V j } if s = j , for j ∈ { , . . . , S } . Remark 2.1 (Monotonicity) . The presence of group heterogeneity in the treatment choice model may lead tothe failure of the monotonicity condition in Imbens and Angrist (1994) and Angrist et al. (1996), which requiresthat shifts in the IVs determine the direction of change in the treatment choices uniformly in all individuals. Tosee this, for simplicity, consider a case with S = 2 and suppose that µ Dj ( Z j ) = Zγ zj + ζ j γ ζj for j ∈ { , } ,where Z is a common IV among the groups, ζ j is an IV specific to group j , and Z j = ( Z, ζ j ) . Suppose that γ z < and γ z > . Then, an increase in Z makes the individuals in group 1 (group 2) less (more) likelyto take the treatment, implying that the monotonicity condition does not hold. As a result, the conventionalIV-based causal inference methods that rely on the monotonicity condition cannot be used as long as one focuseson the group-common IV Z only. Note, however, that the monotonicity condition is still satisfied in terms ofthe group-specific IVs ( ζ , ζ ) . Thus, if we run a 2SLS method using ( ζ , ζ ) instead of Z as IVs for thetreatment, we would obtain some causal effects averaged over the groups, as will be demonstrated in Subsection4.2. However, this approach may overlook the possibility of heterogeneous treatment affects across groups.4or the treatment choice model in (2.2), we can interpret its meaning in several ways. The first interpretationis that there are actually multiple different treatment eligibility rules prescribed by policy makers. In the exampleof college enrollment, there are typically several different types of admissions processes for each school, forexample, paper-based entrance exams, sports referrals, and so on. Such a situation would correspond to this firsttype of interpretation. Another interpretation is that there are several types of treatment preference patterns.For example, consider again D = 1 if an individual goes to college and D = 0 otherwise. Suppose that acommon instrumental variable Z is the introduction of a physical education requirement in colleges along withmandatory augmented athletics facilities. When we specify the functional form of µ Dj as in Remark 2.1, weimagine that some people dislike physical education ( γ z < ) while others like it ( γ z > ). In this situation,we can view the treatment choice model (2.2) as a binary response model with a discrete random coefficient. Our main identification results are based on the following assumptions:
Assumption 2.1. (i) The IVs Z = ( Z , . . . , Z S ) are independent of ( (cid:15) ( d ) , (cid:15) Dj , s ) given X for all d and j .(ii) For each j , Z j has at least one group-specific continuous variable that is not included in X and the IVsfor the other groups. Assumption 2.2. (i) The membership variable s is conditionally independent of (cid:15) Dj given X for all j .(ii) For each j , there exists a constant π j ∈ (0 , such that Pr( s = j | X ) = π j and (cid:80) Sj =1 π j = 1 .Assumption 2.1(i) is an exclusion restriction requiring that the IVs are conditionally independent of allunobserved random variables including the latent group membership. Assumption 2.1(ii) is somewhat demand-ing in that we require prior knowledge as to which variables may be relevant/irrelevant to the membership ofeach group. A similar assumption can be found in the econometrics literature on finite mixture models (e.g.,Compiani and Kitamura, 2016). In practice, we can determine which IVs should belong to which group byexamining the significance of them in each group or by some theoretical guidance. The conditional indepen-dence in Assumption 2.2(i) excludes some types of endogenous group formation, which could be restrictivein some empirical situations. Even without this assumption, we can establish some identification results ofMTE parameters, as shown in Subsection 4.1. In Assumption 2.2(ii), we assume homogeneous membershipprobability for each group, which is restrictive but commonly used in the literature on finite mixture models. Infact, this assumption is made only for simplicity, and the theorem shown below still holds without modificationseven when the membership probability is a function of X . Moreover, we can identify the group-wise MTE evenwhen the membership probability depends on other covariates besides X ; see Subsection 4.1.To identify the treatment effects of interest, we first need to identify the treatment choice model (2.2).Although there has been a long history of research on the identification of finite mixture models, only few This example scenario is borrowed from Heckman and Vytlacil (2005), Subsection 6.3. P j ’s and π j ’s as known objects.The MTE parameter specific to group j is defined asMTE j ( x, p ) := m (1) j ( x, p ) − m (0) j ( x, p ) , (2.3)where m ( d ) j ( x, p ) := E [ Y ( d ) j | X = x, s = j, V j = p ] is the marginal treatment response (MTR) function specificto group j and d ∈ { , } . This expression implies that the identification of the MTE is straightforward from thatof the MTR functions. Below, we show that the MTR functions can be identified through the partial derivativesof the following functions: ψ ( x, p ) := E [ DY | X = x, P = p ] , ψ ( x, p ) := E [(1 − D ) Y | X = x, P = p ] , where P = ( P , . . . , P S ) and p = ( p , . . . , p S ) . Note that ψ d ( x, p ) can be directly identified from the data onsupp [ X, P | D = d ] , where supp [ X, P | D = d ] denotes the joint support of ( X, P ) given D = d . Theorem 2.1.
Suppose that Assumptions 2.1 and 2.2 hold. If m (1) j ( x, · ) and m (0) j ( x, · ) are continuous, we have m (1) j ( x, p j ) = 1 π j ∂ψ ( x, p ) ∂p j , m (0) j ( x, p j ) = − π j ∂ψ ( x, p ) ∂p j . Once the group-wise MTEs are identified for all p and x based on this result, we can assess heterogeneity intreatment effects. Furthermore, we can identify many other treatment parameters. For example, CATE j ( x ) = (cid:82) MTE j ( x, v )d v , where CATE j ( x ) = E [ Y (1) j − Y (0) j | X = x, s = j ] is the group-wise conditional averagetreatment effect. Furthermore, the group-wise ATE: ATE j = E [ Y (1) j − Y (0) j | s = j ] can be obtained by ATE j = (cid:82) CATE j ( x ) f X ( x )d x , where f X is the marginal density function of X (note that Assumption 2.2(ii) implies f X ( ·| s = j ) = f X ( · ) ). Then, the ATE for the whole population is simply given by ATE = (cid:80) Sj =1 π j ATE j . Wecan also identify the so-called policy relevant treatment effect (PRTE), as in Heckman and Vytlacil (2005); seeAppendix C.2 for the details. This section considers the estimation of the group-wise MTE parameters using an independent and identicallydistributed (IID) sample { ( Y i , D i , X i , Z i ) : 1 ≤ i ≤ n } . Throughout this section, Assumptions 2.1 and 2.2 areassumed to hold. 6 .1 Two-step series estimation First-stage: ML estimation
To estimate the treatment choice model, we consider the following parametricmodel specification: D = (cid:110) Z (cid:62) j γ j ≥ (cid:15) Dj (cid:111) with probability π j > , for j ∈ { , . . . , S } . (3.1)We assume that (cid:15) Dj is independent of ( X, Z ) , and the CDF F j of (cid:15) Dj is a known function (such as thestandard normal or logistic distribution). Define γ := (cid:0) γ (cid:62) , . . . , γ (cid:62) S (cid:1) (cid:62) and π := ( π , . . . , π S ) (cid:62) . Then, theconditional likelihood function for an observation i when s i = j is given by L i ( γ | s i = j ) := F j ( Z (cid:62) ji γ j ) D i [1 − F j ( Z (cid:62) ji γ j )] − D i . Thus, the ML estimator for γ and π can be obtained by ( (cid:98) γ n , (cid:98) π n ) := argmax ( (cid:101) γ, (cid:101) π ) ∈ Γ ×C S n (cid:88) i =1 log S (cid:88) j =1 (cid:101) π j L i ( (cid:101) γ | s i = j ) , where Γ ⊂ R (cid:80) Sj =1 dim( Z j ) and C S := { (cid:101) π ∈ (0 , S : (cid:80) Sj =1 (cid:101) π j = 1 } are the parameter spaces. We thenobtain the estimator of P j = F j ( Z (cid:62) j γ j ) as (cid:98) P j = F j ( Z (cid:62) j (cid:98) γ n,j ) . In the numerical studies below, we use theexpectation-maximization (EM) algorithm to solve this ML problem following the literature on finite mixturemodels (e.g., Dempster et al. , 1977; McLachlan and Peel, 2004; Train, 2008). Second-stage: series estimation
For the potential outcome equation, we assume the following linear modelfor convenience, which is a popular setup in the literature: for each d and j , Y ( d ) j = X (cid:62) β ( d ) j + (cid:15) ( d ) . (3.2)Here, the error term (cid:15) ( d ) can generally depend on the group membership j , but we suppress the dependencefor expositional simplicity. Assume that X is independent of ( (cid:15) ( d ) , (cid:15) Dj , s ) for all d and j . Then, we haveMTE j ( x, p ) = m (1) j ( x, p ) − m (0) j ( x, p ) = x (cid:62) ( β (1) j − β (0) j ) + E [ (cid:15) (1) − (cid:15) (0) | s = j, V j = p ] , with m ( d ) j ( x, p ) = x (cid:62) β ( d ) j + E [ (cid:15) ( d ) | s = j, V j = p ] . By the same argument as in the proof of Theorem 2.1, wecan show that there exist univariate functions g (1) j and g (0) j for each j satisfying E [ (cid:15) (1) | s = j, V j ≤ p j ] = 1 p j (cid:90) p j E [ (cid:15) (1) | s = j, V j = v ]d v =: g (1) j ( p j ) p j ,E [ (cid:15) (0) | s = j, V j > p j ] = 11 − p j (cid:90) p j E [ (cid:15) (0) | s = j, V j = v ]d v =: g (0) j ( p j )1 − p j . Then, letting ∇ g ( d ) j ( p ) := ∂g ( d ) j ( p ) /∂p , we observe that ∇ g (1) j ( p ) = E [ (cid:15) (1) | s = j, V j = p ] and ∇ g (0) j ( p ) = − E [ (cid:15) (0) | s = j, V j = p ] . Hence, it follows that m (1) j ( x, p ) = x (cid:62) β (1) j + ∇ g (1) j ( p ) , m (0) j ( x, p ) = x (cid:62) β (0) j − ∇ g (0) j ( p ) . ψ ( x, p ) = S (cid:88) j =1 E [ Y (1) j | X = x, s = j, V j ≤ p j ] · π j p j = S (cid:88) j =1 (cid:16) ( x · π j p j ) (cid:62) β (1) j + π j g (1) j ( p j ) (cid:17) ,ψ ( x, p ) = S (cid:88) j =1 E [ Y (0) j | X = x, s = j, V j > p j ] · π j (1 − p j ) = S (cid:88) j =1 (cid:16) ( x · π j (1 − p j )) (cid:62) β (0) j + π j g (0) j ( p j ) (cid:17) . Hence, we obtain the following partially linear additive regression models: DY = S (cid:88) j =1 ( X · π j P j ) (cid:62) β (1) j + S (cid:88) j =1 π j g (1) j ( P j ) + ε (1) , (3.3) (1 − D ) Y = S (cid:88) j =1 ( X · π j (1 − P j )) (cid:62) β (0) j + S (cid:88) j =1 π j g (0) j ( P j ) + ε (0) , (3.4)where E [ ε ( d ) | X, P ] = 0 by the definition of ψ d for both d ∈ { , } . To estimate the coefficients β ( d ) j ’s and thefunctions g ( d ) j ’s, we use the series (sieve) approximation method such that g ( d ) j ( p ) ≈ b K ( p ) (cid:62) α ( d ) j with a K × vector of basis functions b K ( p ) = ( b K ( p ) , . . . , b KK ( p )) (cid:62) and corresponding coefficients α ( d ) j .To proceed, letting d SXK := S ( dim ( X ) + K ) , define θ ( d ) d SXK × := ( β ( d ) (cid:62) , . . . , β ( d ) (cid:62) S , α ( d ) (cid:62) , . . . , α ( d ) (cid:62) S ) (cid:62) for d ∈ { , } , and R (1) Kd SXK × := ( π P X (cid:62) , . . . , π S P S X (cid:62) , π b K ( P ) (cid:62) , . . . , π S b K ( P S ) (cid:62) ) (cid:62) ,R (0) Kd SXK × := ( π (1 − P ) X (cid:62) , . . . , π S (1 − P S ) X (cid:62) ,π b K ( P ) (cid:62) , . . . , π S b K ( P S ) (cid:62) ) (cid:62) . Then, we can approximate the regression models (3.3) and (3.4), respectively, by DY ≈ R (1) (cid:62) K θ (1) + ε (1) , (1 − D ) Y ≈ R (0) (cid:62) K θ (0) + ε (0) , (3.5)which implies that we can estimate θ ( d ) by (cid:101) θ (1) n := (cid:32) n (cid:88) i =1 R (1) i,K R (1) (cid:62) i,K (cid:33) − n (cid:88) i =1 R (1) i,K D i Y i , (cid:101) θ (0) n := (cid:32) n (cid:88) i =1 R (0) i,K R (0) (cid:62) i,K (cid:33) − n (cid:88) i =1 R (0) i,K (1 − D i ) Y i , where A − is a generalized inverse of a matrix A . Note, however, that (cid:101) θ ( d ) n ’s are infeasible since P j ’s and π j ’s areunknown in practice. Then, define (cid:98) R ( d ) K analogously as above but replacing P j ’s and π j ’s with their estimatorsobtained in the first step. The feasible estimators can be obtained by (cid:98) θ (1) n := (cid:32) n (cid:88) i =1 (cid:98) R (1) i,K (cid:98) R (1) (cid:62) i,K (cid:33) − n (cid:88) i =1 (cid:98) R (1) i,K D i Y i , (cid:98) θ (0) n := (cid:32) n (cid:88) i =1 (cid:98) R (0) i,K (cid:98) R (0) (cid:62) i,K (cid:33) − n (cid:88) i =1 (cid:98) R (0) i,K (1 − D i ) Y i . The feasible estimator of g ( d ) j ( p ) is given by (cid:98) g ( d ) j ( p ) = b K ( p ) (cid:62) (cid:98) α ( d ) n,j , and the infeasible estimator is given8y (cid:101) g ( d ) j ( p ) = b K ( p ) (cid:62) (cid:101) α ( d ) n,j . Letting ∇ b K ( p ) := ∂b K ( p ) /∂p , we can also estimate ∇ g ( d ) j ( p ) by ∇ (cid:98) g ( d ) j ( p ) := ∇ b K ( p ) (cid:62) (cid:98) α ( d ) n,j and ∇ (cid:101) g ( d ) j ( p ) := ∇ b K ( p ) (cid:62) (cid:101) α ( d ) n,j . Thus, we obtain the estimators of the MTR functions as follows: (cid:101) m (1) j ( x, p ) = x (cid:62) (cid:101) β (1) n,j + ∇ (cid:101) g (1) j ( p ) , (cid:98) m (1) j ( x, p ) = x (cid:62) (cid:98) β (1) n,j + ∇ (cid:98) g (1) j ( p ) , (cid:101) m (0) j ( x, p ) = x (cid:62) (cid:101) β (0) n,j − ∇ (cid:101) g (0) j ( p ) , (cid:98) m (0) j ( x, p ) = x (cid:62) (cid:98) β (0) n,j − ∇ (cid:98) g (0) j ( p ) . Finally, the feasible estimator of the MTE is given by (cid:91)
MTE j ( x, p ) = (cid:98) m (1) j ( x, p ) − (cid:98) m (0) j ( x, p ) , and the infeasibleestimator is given by (cid:93) MTE j ( x, p ) = (cid:101) m (1) j ( x, p ) − (cid:101) m (0) j ( x, p ) . This section presents the asymptotic properties of the proposed MTE estimators for the model given by equations(3.1) and (3.2). In the following, for a vector or matrix A , we denote its Frobenius norm as (cid:107) A (cid:107) = (cid:112) tr { A (cid:62) A } where tr {·} is the trace. For a square matrix A , λ min ( A ) and λ max ( A ) denote the smallest and the largesteigenvalues of A , respectively. Assumption 3.1.
The data { ( Y i , D i , X i , Z i ) : 1 ≤ i ≤ n } are IID. Assumption 3.2. (i) supp [ Z j ] is a compact subset of R dim( Z j ) for all j .(ii) The random variables (cid:15) D = ( (cid:15) D , . . . , (cid:15) DS ) are continuously distributed on the whole R S and independentof ( X, Z ) . Each (cid:15) Dj has a twice continuously differentiable known CDF F j with bounded derivatives.(iii) (cid:107) (cid:98) γ n − γ (cid:107) = O P ( n − / ) and (cid:107) (cid:98) π n − π (cid:107) = O P ( n − / ) . Assumption 3.3. (i) supp [ X ] is a compact subset of R dim( X ) . X is independent of ( (cid:15) ( d ) , (cid:15) Dj , s ) for all d and j .(ii) The random variable (cid:15) ( d ) is independent of ( X, Z ) for all d .The IID sampling condition in Assumption 3.1 is standard in the literature. For Assumption 3.2(iii), if theparameters γ and π are identifiable, the √ n -consistency of the ML estimators is straightforward; see AppendixC.1 for a special case where (cid:15) Dj ’s are distributed as the standard normal. Note that combining Assumptions3.2(i)-(iii) imply max ≤ i ≤ n | (cid:98) P ji − P ji | = O P ( n − / ) for all j .Let ∇ a g j ( p ) := ∂ a g j ( p ) / ( ∂p ) a and ∇ a b K ( p ) := ( ∂ a b K ( p ) / ( ∂p ) a , . . . , ∂ a b KK ( p ) / ( ∂p ) a ) (cid:62) for a non-negative integer a . Further, define Ψ ( d ) K := E (cid:104) R ( d ) K R ( d ) (cid:62) K (cid:105) and Σ ( d ) K := E (cid:104) ( ξ ( d ) K ) R ( d ) K R ( d ) (cid:62) K (cid:105) , where ξ ( d ) K := e ( d ) + B ( d ) K , and e ( d ) and B ( d ) K are unobserved error terms in the series regressions whose definitions are givenin (A.1) in Appendix A. Assumption 3.4. (i) For each d and j , g ( d ) j ( p ) is r -times continuously differentiable for some r ≥ , and there exist positiveconstants µ and µ and some α ( d ) j ∈ R K satisfying sup p ∈ [0 , | b K ( p ) (cid:62) α ( d ) j − g ( d ) j ( p ) | = O ( K − µ ) and sup p ∈ [0 , |∇ b K ( p ) (cid:62) α ( d ) j − ∇ g ( d ) j ( p ) | = O ( K − µ ) .9ii) b K ( p ) is twice continuously differentiable and satisfies ζ ( K ) (cid:112) ( K log K ) /n → , ζ ( K ) (cid:112) K/n → and ζ ( K ) / √ n → , where ζ l ( K ) := max ≤ a ≤ l sup p ∈ [0 , (cid:107)∇ a b K ( p ) (cid:107) . Assumption 3.5. (i) For d ∈ { , } , there exist constants c Ψ and ¯ c Ψ such that < c Ψ ≤ λ min (cid:16) Ψ ( d ) K (cid:17) ≤ λ max (cid:16) Ψ ( d ) K (cid:17) ≤ ¯ c Ψ < ∞ , uniformly in K .(ii) For d ∈ { , } , there exist constants c Σ and ¯ c Σ such that < c Σ ≤ λ min (cid:16) Σ ( d ) K (cid:17) ≤ λ max (cid:16) Σ ( d ) K (cid:17) ≤ ¯ c Σ < ∞ , uniformly in K . Assumption 3.6. E [( e ( d ) ) | D, X, Z , s = j ] < ∞ for all d and j .Assumption 3.4(i) imposes smoothness conditions on the function g ( d ) j . These conditions are standard in theliterature on series approximation methods. For example, Lemma 2 in Holland (2017) shows that Assumption3.4(i) is satisfied by B-splines of order k for k − ≥ r such that µ = r and µ = r − . This assumptionrequires K to increase to infinity for unbiased estimation while Assumption 3.4(ii) requires that K should notdiverge too quickly. It is well known that ζ l ( K ) = O ( K (1 / l ) for B-splines (e.g., Newey, 1997). Thus, whenB-spline basis functions are employed, Assumption 3.4(ii) is satisfied if K /n → . Assumption 3.5 ensuresthat the matrices Ψ ( d ) K and Σ ( d ) K are positive definite for all K so that their inverses exist. Assumption 3.6 is usedto derive the asymptotic normality of our estimator in a convenient way.Finally, we introduce the selection matrices S X,j dim( X ) × d SXK and S K,jK × d SXK such that S X,j θ ( d ) = β ( d ) j and S K,j θ ( d ) = α ( d ) j for each d and j . The next theorem gives the asymptotic normality for the infeasible estimator. Theorem 3.1.
Suppose that Assumptions 3.1 – 3.6 hold. For a given p ∈ supp [ P j | D = 1] ∩ supp [ P j | D = 0] ,if (cid:107)∇ b K ( p ) (cid:107) → ∞ , √ nK − µ → , and √ nK − µ / (cid:107)∇ b K ( p ) (cid:107) → hold, then √ n (cid:16) (cid:93) MTE j ( x, p ) − MTE j ( x, p ) (cid:17)(cid:114)(cid:104) σ (1) K,j ( p ) (cid:105) + 2 cov K ( p ) + (cid:104) σ (0) K,j ( p ) (cid:105) d → N (0 , , where σ ( d ) K,j ( p ) := (cid:114) ∇ b K ( p ) (cid:62) S K,j (cid:104) Ψ ( d ) K (cid:105) − Σ ( d ) K (cid:104) Ψ ( d ) K (cid:105) − S (cid:62) K,j ∇ b K ( p ) , for d ∈ { , } ,cov K ( p ) := ∇ b K ( p ) (cid:62) S K,j (cid:104) Ψ (0) K (cid:105) − C K (cid:104) Ψ (1) K (cid:105) − S (cid:62) K,j ∇ b K ( p ) , and C K := E (cid:104) ξ (0) K ξ (1) K R (0) K R (1) (cid:62) K (cid:105) . As shown in Lemma B.4, σ ( d ) K,j ( p ) corresponds to the asymptotic standard deviation of the MTR estimatorfor D = d . Further, cov K ( p ) is the asymptotic covariance between the MTR estimators for D = 1 and D = 0 , which is supposed to be non-zero in our case. This non-zero covariance originates from replacing theunobserved membership indicator { s = j } with the membership probability π j .As a corollary of Theorem 3.1, the asymptotic properties of the feasible MTE estimator can be derivedrelatively easily with the following additional assumption.10 ssumption 3.7. sup p ∈ [0 , |∇ b K ( p ) (cid:62) α | = O ( √ K ) · sup p ∈ [0 , | b K ( p ) (cid:62) α | for any α ∈ R K .Chen and Christensen (2018) show that this assumption holds true for B-splines and wavelets.To proceed, let C ( D ) be the set of uniformly bounded continuous functions defined on D . Further, let T be a generic random vector where supp [ T ] is compact and dim( T ) is finite. We define the linear operator P ( d ) n,j that maps a given function q ∈ C ( supp [ T ]) to the sieve space defined by b K as follows: P ( d ) n,j q := b K ( · ) (cid:62) S K,j (cid:104) Ψ ( d ) nK (cid:105) − n n (cid:88) i =1 R ( d ) i,K q ( T i ) , where Ψ ( d ) nK := n − (cid:80) ni =1 R ( d ) i,K R ( d ) (cid:62) i,K . The functional form of q may implicitly depend on n . The operatornorm of P ( d ) n,j is defined as (cid:107)P ( d ) n,j (cid:107) ∞ := sup (cid:40) sup p ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:16) P ( d ) n,j q (cid:17) ( p ) (cid:12)(cid:12)(cid:12) : q ∈ C ( supp [ T ]) , sup t ∈ supp [ T ] | q ( t ) | = 1 (cid:41) , which is typically of order O P (1) for splines and wavelets under some regularity conditions. Corollary 3.1.
Suppose that the assumptions in Theorem 3.1 hold. If Assumption 3.7, ζ ( K ) ζ ( K ) / √ n = O (1) , and ( (cid:107)P ( d ) n,j (cid:107) ∞ + 1) √ K/ (cid:107)∇ b K ( p ) (cid:107) → for both d ∈ { , } hold additionally, then √ n (cid:16) (cid:91) MTE j ( x, p ) − MTE j ( x, p ) (cid:17)(cid:114)(cid:104) σ (1) K,j ( p ) (cid:105) + 2 cov K ( p ) + (cid:104) σ (0) K,j ( p ) (cid:105) d → N (0 , . As shown above, the feasible MTE estimator has the same asymptotic distribution as the infeasibleestimator. The standard errors of the MTE estimators can be straightforwardly computed by replacingunknown terms in σ ( d ) K,j ( p ) and cov K ( p ) with their empirical counterparts: (cid:98) Ψ ( d ) nK := n (cid:80) ni =1 (cid:98) R ( d ) i,K (cid:98) R ( d ) (cid:62) i,K , (cid:98) Σ ( d ) nK := n (cid:80) ni =1 ( (cid:98) ξ ( d ) i,K ) (cid:98) R ( d ) i,K (cid:98) R ( d ) (cid:62) i,K , and (cid:98) C nK := n (cid:80) ni =1 (cid:98) ξ (0) i,K (cid:98) ξ (1) i,K (cid:98) R (0) i,K (cid:98) R (1) (cid:62) i,K , where (cid:98) ξ (1) i,K := D i Y i − (cid:98) R (1) (cid:62) i,K (cid:98) θ (1) n and (cid:98) ξ (0) i,K := (1 − D i ) Y i − (cid:98) R (0) (cid:62) i,K (cid:98) θ (0) n . We consider relaxing Assumption 2.2 by allowing the membership variable s to be dependent of both (cid:15) Dj anda vector of covariates W . For simplicity, we focus on the case of S = 2 only. Suppose that s is determined bythe following model: s = 2 − { π ( W ) ≥ U } , (4.1) A set of easy-to-check conditions ensuring this is to verify the conditions in Lemma 7.1 in Chen and Christensen (2015) and showthat S K,j (cid:104) Ψ ( d ) nK (cid:105) − is stochastically bounded in the (cid:96) -infinity norm. See also Appendix B.3 of Hoshino and Yanagi (2020). U is an unobserved continuous random variable distributed as Uniform [0 , conditional on X , and π isan unknown function that takes values on [0 , . We assume that W is conditionally independent of U given X .Then, Pr( s = 1 | X, W ) = π ( W ) . In addition, we assume that W contains at least one continuous variable thatis not included in ( X, Z , Z ) .In this setup, we re-define the MTE parameter and the MTR function, respectively, as follows:MTE j ( x, q, p ) := m (1) j ( x, q, p ) − m (0) j ( x, q, p ) ,m ( d ) j ( x, q, p ) := E [ Y ( d ) j | X = x, U = q, V j = p ] . (4.2)Further, letting Q := π ( W ) , we define ψ ( x, q, p , p ) := E [ DY | X = x, Q = q, P = p , P = p ] ,ψ ( x, q, p , p ) := E [(1 − D ) Y | X = x, Q = q, P = p , P = p ] ,ρ dj ( x, q, p , p ) := Pr( D = d, s = j | X = x, Q = q, P = p , P = p ) for d ∈ { , } and j ∈ { , } . For identification of these functions, we first need to establish the identification of all model parameters in(4.1). In Appendix C.1.2, we present a supplementary identification result for an endogenous finite-mixturetreatment choice model where the error terms are assumed to be jointly normal. For notational simplicity,we denote the cross-partial derivatives with respect to ( q, p j ) by ∇ qp j ; for instance, ∇ qp ψ ( x, q, p , p ) = ∂ ψ ( x, q, p , p ) / ( ∂q∂p ) . Assumption 4.1. (i) ( W, Z , Z ) are independent of ( (cid:15) ( d ) , (cid:15) Dj , U ) given X for all d and j .(ii) Each of W , Z , and Z contains at least one continuous variable that is not included in X and the rest.(iii) For each j , ( U, V j ) is continuously distributed on [0 , with conditional density f UV j ( · , ·| X ) given X . Theorem 4.1.
Suppose that Assumption 4.1 holds. If m ( d ) j ( x, · , · ) and f UV j ( · , ·| X = x ) are continuous, wehave m ( d ) j ( x, q, p j ) = ∇ qp j ψ d ( x, q, p , p ) ∇ qp j ρ dj ( x, q, p , p ) , and f UV j ( q, p j | X = x ) = ( − d + j · ∇ qp j ρ dj ( x, q, p , p ) . The following theorem shows that the group-wise MTE parameter MTE j ( x, p ) defined in (2.3) can berecovered by the weighted average of MTE j ( x, q, p ) in (4.2). Theorem 4.2.
Suppose that Assumption 4.1 holds. Then, we haveMTE j ( x, p ) = (cid:90) MTE j ( x, u, p ) ω j ( x, u, p )d u, for j ∈ { , } , ω ( x, u, p ) := Pr( u ≤ Q | X = x ) f UV ( u, p | X = x ) (cid:82) Pr( u (cid:48) ≤ Q | X = x ) f UV ( u (cid:48) , p | X = x )d u (cid:48) ,ω ( x, u, p ) := Pr( u > Q | X = x ) f UV ( u, p | X = x ) (cid:82) Pr( u (cid:48) > Q | X = x ) f UV ( u (cid:48) , p | X = x )d u (cid:48) . In some empirical situations, only discrete IVs are available, and many are binary. In this subsection, we focuson a situation where only binary IVs are available and show that some LATE parameters are still identifiable bythe Wald estimand. For expositional simplicity, the condition X = x is suppressed throughout this subsection.Let Z be an S × vector of binary instruments Z = ( Z , . . . , Z S ) ∈ Z , where Z := supp [ Z ] . Z maycontain some overlapping elements. In an extreme case, when there is only one instrument common for allgroups, we have Z = { S , S } , where S and S are S × vectors of zeros and ones, respectively. Let Y ( d,z ) j be the potential outcome when s = j , D = d , and Z = z . Similarly, we denote the potential treatment statuswhen s = j and Z = z as D ( z ) j . Assumption 4.2. (i) For each d and j , Y ( d,z ) j = Y ( d ) j and D ( z ) j = D ( z j ) j for any z = ( z , . . . , z S ) ∈ Z .(ii) Z is independent of ( D (0) j , D (1) j , Y (0) j , Y (1) j , s ) for all j .(iii) Pr( D (1) j = 1 , D (0) j = 0 | s = j ) > and Pr( D (1) j = 0 , D (0) j = 1 | s = j ) = 0 for all j .Assumption 4.2(i) can be violated when the instruments affect the outcome directly or when the group- j (cid:48) -specific instrument Z j (cid:48) affects the treatment status of the individuals in group j ( j (cid:54) = j (cid:48) ). Note that, under thiscondition, it holds that D = D ( z j ) j conditional on s = j and Z = z . Assumption 4.2(ii) requires the randomnessof the instrument Z , which is essentially the same as Assumption 2.1(i). Assumption 4.2(iii) is similar to themonotonicity condition in Imbens and Angrist (1994). Following Angrist et al. (1996), the individuals with D (1) j = D (0) j = 1 can be referred to as Z j -always-takers ; those with D (1) j = D (0) j = 0 are Z j -never-takers ;those with D (1) j > D (0) j are Z j -compliers ; and those with D (1) j < D (0) j are Z j -defiers . Hence, the condition(iii) ensures that for all j , there are Z j -compliers but no Z j -defiers in group j . When some IVs are commonacross groups, this monotonicity condition might be violated, as mentioned in Remark 2.1. Theorem 4.3.
Suppose that Assumption 4.2 holds. Then, the weighted average of the group-wise LATEs canbe identified as below: E [ Y | Z = S ] − E [ Y | Z = S ] E [ D | Z = S ] − E [ D | Z = S ] = S (cid:88) j =1 E [ Y (1) j − Y (0) j | Z j -compliers , s = j ] Pr( Z j -compliers , s = j ) (cid:80) Sh =1 Pr( Z h -compliers , s = h ) . Moreover, if e j ∈ Z , where e j is an S × unit vector with its j -th element equal to one, we can identify theLATE specific to group j : E [ Y | Z = e j ] − E [ Y | Z = S ] E [ D | Z = e j ] − E [ D | Z = S ] = E [ Y (1) j − Y (0) j | Z j -compliers , s = j ] . Numerical Analysis
To evaluate the finite sample performance of our MTE estimators, we conduct a set of Monte Carlo experiments.Setting S = 2 with the membership probabilities ( π , π ) = (0 . , . , the treatment variable is generated by D = { Z (cid:62) j γ j ≥ (cid:15) Dj } for s = j , where Z j = (1 , X , ζ j ) (cid:62) , X ∼ N (0 , , ζ j ∼ N (0 , , and (cid:15) Dj ∼ N (0 , forboth j ∈ { , } . We set γ = ( γ , γ , γ ) (cid:62) = (0 , − . , . (cid:62) and γ = ( γ , γ , γ ) (cid:62) = (0 , . , − . (cid:62) .The potential outcomes are generated by Y ( d ) j = X (cid:62) β ( d ) j + (cid:15) ( d ) , where X = (1 , X ) (cid:62) , (cid:15) (0) = (cid:80) j ∈{ , } { s = j } V j + η (0) , (cid:15) (1) = (cid:80) j ∈{ , } { s = j } V j + η (1) , and η ( d ) ∼ N (0 , . ) for both d ∈ { , } . Here, V j =Φ( (cid:15) Dj ) , and Φ denotes the standard normal CDF. The coefficients are set to β (0)1 = ( − , (cid:62) , β (0)2 = (1 , (cid:62) , β (1)1 = (1 , − (cid:62) , and β (1)2 = (2 , (cid:62) . For each setup, we consider two sample sizes n ∈ { , } .For the first-stage estimation of the finite mixture Probit model, we use the EM algorithm. The second-stageMTE estimation is carried out using both the infeasible and feasible estimators. We employ the B-splines oforder for the basis functions. The number of inner knots of B-splines, say (cid:101) K , is set to (cid:101) K = 1 when n = 1000 and (cid:101) K ∈ { , } when n = 4000 . To stabilize the series regression, using ridge regression with penalty n − is also considered for comparison. Note that introducing a sufficiently small penalty term does not alter ourasymptotic results. The simulation results reported below are based on 1,000 Monte Carlo replications.Table 1 shows the bias and root mean squared error (RMSE) of estimating the group-wise MTE for bothgroups with x = 0 . and v ∈ { . , . , . , . } (labeled respectively as MTE1.1, and MTE1.2, and so on forgroup 1, and similarly labeled for group 2). Overall, the performance of the feasible estimator is almost thesame as that of the infeasible estimator. This is consistent with our asymptotic theory. The bias of our estimatoris satisfactorily small except for the cases close to the boundary. The RMSE quickly decreases as the samplesize increases, as long as the number of basis terms is unchanged, as expected. Although theoretically we needto employ larger number of basis terms as the sample size increases, using (cid:101) K = 2 seems too flexible and theincrease in the variance is rather problematic even for n = 4000 (of course, this result is more or less specificto our choice of the functional form for MTE). The RMSE for group 2 tends to be smaller than that for group 1,probably due to the difference in their group sizes. Introducing a penalty term can improve the overall RMSE;hence, we recommend employing ridge regression in practice with moderate sample size.Table 2 presents the simulation results for the ML estimation of the finite mixture Probit model. For allestimation parameters, the bias is satisfactorily small under all sample sizes. The RMSE is approximatelyhalved when the sample size increases from 1000 to 4000, implying √ n -consistency of our ML estimator. In this empirical analysis, we investigate the effects of college education on income in the Japanese labor market.There are two sources used for data collection. The primary data source is the Japanese Life Course PanelSurvey 2007 (wave 1), which includes detailed information on Japanese workers aged 20 to 40, including theirworking condition, annual income, education level, and family member characteristics. The outcome variable Acknowledgement: The data for this secondary analysis, "Japanese Life Course Panel Surveys, wave 1, 2007, of the Institute ofSocial Science, The University of Tokyo," was provided by the Social Science Japan Data Archive, Center for Social Research and DataArchives, Institute of Social Science, The University of Tokyo.
Group 1 Group 2 n (cid:101) K ridge MTE1.1 MTE1.2 MTE1.3 MTE1.4 MTE2.1 MTE2.2 MTE2.3 MTE2.4 Bias for the feasible estimator .
130 0 . − . − . − .
052 0 .
007 0 .
035 0 . − . − . − . − . − . − .
012 0 . − . − . − .
008 0 .
013 0 .
014 0 . − . − .
007 0 . − . − . − . − . − .
064 0 .
001 0 . − . − .
069 0 . − .
080 0 .
074 0 . − . − .
020 0 . − .
227 0 . − . − . − .
095 0 . − . − . Bias for the infeasible estimator . − . − . − . − . − .
001 0 .
027 0 . − . − . − . − . − . − .
012 0 . − . − . − .
016 0 .
004 0 .
016 0 . − . − .
006 0 . − . − . − . − . − .
061 0 .
001 0 . − . − .
079 0 . − .
085 0 .
077 0 . − . − .
020 0 . − .
233 0 . − . − . − .
092 0 . − . − . RMSE for the feasible estimator .
647 1 .
099 1 .
200 2 .
794 1 .
181 0 .
589 0 .
558 1 . .
795 0 .
652 0 .
651 0 .
880 0 .
619 0 .
409 0 .
383 0 . .
274 0 .
575 0 .
606 1 .
426 0 .
581 0 .
296 0 .
282 0 . .
667 0 .
422 0 .
422 0 .
809 0 .
442 0 .
262 0 .
242 0 . .
502 1 .
262 1 .
333 1 .
812 0 .
676 0 .
656 0 .
640 0 . .
681 0 .
727 0 .
783 0 .
898 0 .
461 0 .
511 0 .
521 0 . RMSE for the infeasible estimator .
606 1 .
076 1 .
225 2 .
858 1 .
175 0 .
582 0 .
551 1 . .
795 0 .
657 0 .
680 0 .
936 0 .
634 0 .
411 0 .
383 0 . .
266 0 .
568 0 .
629 1 .
461 0 .
589 0 .
296 0 .
285 0 . .
670 0 .
424 0 .
444 0 .
851 0 .
450 0 .
262 0 .
245 0 . .
491 1 .
251 1 .
329 1 .
840 0 .
681 0 .
655 0 .
640 0 . .
684 0 .
726 0 .
803 0 .
935 0 .
469 0 .
511 0 .
521 0 . Note: The column labeled “ridge” indicates whether the ridge regression is used (1 for “yes” and 0 for “no”).
Table 2: Simulation results of the ML estimation of the finite mixture Probit model n γ γ γ γ γ γ π π Bias .
001 0 .
001 0 .
000 0 .
000 0 .
000 0 .
000 0 . − . .
000 0 .
000 0 .
000 0 .
000 0 .
000 0 . − .
001 0 . RMSE .
026 0 .
021 0 .
029 0 .
024 0 .
022 0 .
018 0 .
042 0 . .
013 0 .
011 0 .
015 0 .
012 0 .
011 0 .
009 0 .
020 0 . of interest is the respondent’s annual income, and the treatment variable is defined as follows: D = 1 if therespondent has a college degree or higher, D = 0 otherwise. The second data source is the School Basic Surveyconducted by the Ministry of Education, Japan. Using this dataset, we collect information on regional universityenrollment statistics for each prefecture where the respondents were living at the age of 15 to create IVs for thetreatment. We consider 10 IVs in total. Table 3 below shows the list of variables used in this analysis. Afterexcluding observations with missing data, the analysis is performed on 3,347 individuals.Table 3: Variables used Variable Y Annual income (in million JPY). D Dummy variable: the respondent has a college degree or higher. X Dummy variable: the respondent is currently living in an urban area.Dummy variable: the respondent is male.Age.*Working experience in years for the current job.*Dummy variable: the respondent is a part-time worker.Log of average working hours per day.Dummy variable: the respondent has professional skills.Dummy variable: the respondent is a public worker.Dummy variable: the respondent is in a managerial position.Dummy variable: the respondent is working in a large company.Dummy variable: the respondent is married.Dummy variable: the respondent’s partner (if any) is a part-time worker.Self-reported health status (in five scales). (Shorthand) Z Dummy variable: the respondent is male. male
The number of elder siblings. nesibs
The number of younger siblings. nysibs
Log of the number of books in the respondent’s home when he/she was 15 years old. books
The respondent’s father’s education in years. feduc
The respondent’s mother’s education in years. meduc
Economic condition of the respondent’s household when he/she was 15 years old (in five scales). econom
The proportion of private universities.** priv
Capacity: (the number of universities)/(the number of high-school graduates).** cap
The rate of university enrollment for high school graduates.** runiv * The square of these variables are also included in the regressors.** These variables are all at prefecture level, and they are as of when the respondent was 15 years old.
In this analysis, we set S = 2 . As shown below in Table 4, our treatment choice model has a smallerAkaike’s information criterion (AIC) value than the model with S = 1 (no mixture). For the models with S = 3 or larger, the EM algorithm did not converge within tolerable parameter values. Our final treatmentchoice model was determined in the following manner. Assuming that there is at least one IV specific to eachgroup, we can consider (cid:18) (cid:19) = 45 potential pairs of group specific IVs in total. We first estimated all 45(full) models where the other IVs that are not used as group specific IVs are all included in both groups. Then,based on the value of log-likelihood, we identified two group specific IVs: namely, nysibs for group 1 and runiv How to determine an optimal number of mixture components in a mixture model is a long-standing issue in statistics. Aconventional approach is to use some information criteria, such as AIC and Bayesian information criterion (BIC). A more formalapproach is to statistically test the number of mixture components using likelihood ratio type tests (see, e.g., Chen et al. , 2001; Zhu andZhang, 2004). To investigate the applicability of these tests to our situation is outside the scope of this paper but is an important issuefor future research.
Mixture Model ( S = 2 ) Standard Probit Model Group 1 Group 2
Estimate t-value Estimate t-value Estimate t-value π male nesibs -0.200 -3.962 0.435 1.253 -0.119 -3.548 nysibs -0.132 -2.965 -0.103 -2.953 books feduc meduc econom priv cap runiv Following the suggestion from the Monte Carlo results, we employ the third-order B-splines with (cid:101) K = 1 and use the ridge regression with the penalty equal to n − for the estimation of the MTEs. Figure 1 shows ourmain results. We find that the effect of a college degree on wage is not significant for those who belong to group1 and have a higher unobserved cost of going on to higher education. Recall that, for the members in group1, family characteristics, such as parental education level and economic condition, are main factors affectingthe college enrollment status, and we can expect that their personal willingness toward higher education issomewhat heterogeneous within the group. The downward-sloping shape of the MTE for group 1 might be dueto this heterogeneity. In contrast, for the members in group 2 (those who are less influenced by their familycharacteristics), the MTE curve is relatively flat at MTE ≈ . million JPY. This paper considered identification and estimation of MTE when the data is composed of a mixture of latentgroups. We developed a general treatment effect model with unobserved group heterogeneity by extending theRubin’s causal model to finite mixture models. We proved that the MTE for each latent group can be separatelyidentified under the availability of group-specific continuous IVs. Based on our constructive identificationresult, we proposed the two-step semiparametric procedure for estimating the group-wise MTEs and established17igure 1: Estimated MTEs
Note: In each panel, the solid line indicates the point estimate of the MTE evaluated at the sample mean of X , and thegrayed area corresponds to the 95% confidence interval. its asymptotic properties. The results of the Monte Carlo simulations show that our estimators perform well infinite samples. An empirical application to the estimation of economic returns to college education indicatesthe usefulness of the proposed model.Several open issues and promising extensions of the proposed approach are as follows. First, it would beinteresting to study the identification of treatment effects when only common IVs among all groups are available.Second, it might be worthwhile to consider the estimation of MTE (beyond LATE) when all group-specificIVs are discrete (cf., Brinch et al. , 2017). Lastly, based on our finite-mixture framework, we could construct arelative risk measure for alternative treatments. For example, even when the global ATE of a treatment is weaklypositive, it is possible that the treatment is actually harmful for most people but is significantly beneficial foronly a small subset. Then, we can conclude that such a treatment is risky. In addition, following the approachin Subsection 4.1, we can predict to which group each individual is likely to belong. By incorporating thisinformation into the framework developed, for example, by Kitagawa and Tetenov (2018), we might be able topropose a new way of constructing optimal treatment assignment rules. These topics are left for future work.18 Appendix: Proofs of Theorems
This appendix collects the proofs of the theorems. The lemmas used to prove Theorem 3.1 and Corollary 3.1are relegated to Appendix B. Below, we denote the conditional density of a random variable T as f T ( ·|· ) . A.1 Proof of Theorem 2.1
We provide the proof for m (1) j ( x, p ) only, as the proof for m (0) j ( x, p ) is analogous. First, observe that ψ ( x, p ) = S (cid:88) j =1 E (cid:104) { s = j } DY (1) j (cid:12)(cid:12)(cid:12) X = x, P = p (cid:105) = S (cid:88) j =1 E [ Y (1) j | X = x, P = p , s = j, D = 1] · Pr( s = j, D = 1 | X = x, P = p )= S (cid:88) j =1 E [ Y (1) j | X = x, s = j, V j ≤ p j ] · π j p j , under Assumptions 2.1(i) and 2.2. Further, it holds that E [ Y (1) j | X = x, s = j, V j ≤ p j ] = (cid:90) p j E [ Y (1) j | X = x, s = j, V j = v ] f V j ( v | X = x, s = j )Pr( V j ≤ p j | X = x, s = j ) d v = 1 p j (cid:90) p j m (1) j ( x, v )d v, by Assumption 2.2(i). Therefore, ψ ( x, p ) = (cid:80) Sj =1 π j (cid:82) p j m (1) j ( x, v )d v , and the Leibniz integral rule leads to ∂ψ ( x, p ) /∂p j = π j · m (1) j ( x, p j ) . This completes the proof.Here, we introduce additional notations for the subsequent discussions. Let δ (1) j := { s = j, V j ≤ P j } = D · { s = j } and δ (0) j := { s = j, V j > P j } = (1 − D ) · { s = j } . Note that D = (cid:80) Sj =1 δ (1) j and − D = (cid:80) Sj =1 δ (0) j . By (3.2), we can write DY = S (cid:88) j =1 δ (1) j X (cid:62) β (1) j + D(cid:15) (1) = S (cid:88) j =1 δ (1) j ( X (cid:62) β (1) j + g (1) j ( P j ) /P j ) + e (1) = S (cid:88) j =1 δ (1) j ( X (cid:62) β (1) j + b K ( P j ) (cid:62) α (1) j /P j ) (cid:124) (cid:123)(cid:122) (cid:125) =: T (1) (cid:62) K θ (1) + r (1) K + e (1) = R (1) (cid:62) K θ (1) + r (1) K + B (1) K + e (1) (cid:124) (cid:123)(cid:122) (cid:125) =: ξ (1) K , r (1) K := (cid:80) Sj =1 δ (1) j [ g (1) j ( P j ) − b K ( P j ) (cid:62) α (1) j ] /P j , B (1) K := (cid:16) T (1) K − R (1) K (cid:17) (cid:62) θ (1) = S (cid:88) j =1 ( δ (1) j − π j P j ) (cid:16) X (cid:62) β (1) j + b K ( P j ) (cid:62) α (1) j /P j (cid:17) ,e (1) := D(cid:15) (1) − S (cid:88) j =1 δ (1) j g (1) j ( P j ) /P j = S (cid:88) j =1 δ (1) j (cid:104) (cid:15) (1) − g (1) j ( P j ) /P j (cid:105) . (A.1)Let Y (1) = ( D Y , . . . , D n Y n ) (cid:62) , R (1) K = ( R (1)1 ,K , . . . , R (1) n,K ) (cid:62) , r (1) K = ( r (1)1 ,K , . . . , r (1) n,K ) (cid:62) , B (1) K = ( B (1)1 ,K , . . . , B (1) n,K ) (cid:62) , e (1) = ( e (1)1 , . . . , e (1) n ) (cid:62) , and ξ (1) K = ( ξ (1)1 ,K , . . . , ξ (1) n,K ) (cid:62) . The infeasible estimator for θ (1) can be written as (cid:101) θ (1) n = (cid:16) R (1) (cid:62) K R (1) K (cid:17) − R (1) (cid:62) K Y (1) = θ (1) + (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K r (1) K /n + (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K ξ (1) K /n. Similarly, noting that DY = (cid:98) R (1) (cid:62) K θ (1) + (cid:98) ∆ (1) K + r (1) K + ξ (1) K with (cid:98) ∆ (1) K := ( R (1) K − (cid:98) R (1) K ) (cid:62) θ (1) , the feasibleestimator (cid:98) θ (1) n can be written as (cid:98) θ (1) n = θ (1) + (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K (cid:98) ∆ (1) K /n + (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K r (1) K /n + (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K ξ (1) K /n, where (cid:98) R (1) K = ( (cid:98) R (1)1 ,K , . . . , (cid:98) R (1) n,K ) (cid:62) , and (cid:98) ∆ (1) K = ( (cid:98) ∆ (1)1 ,K , . . . , (cid:98) ∆ (1) n,K ) (cid:62) . In the same manner, for the estimatorsof θ (0) , we have (cid:101) θ (0) n − θ (0) = (cid:104) Ψ (0) nK (cid:105) − R (0) (cid:62) K r (0) K /n + (cid:104) Ψ (0) nK (cid:105) − R (0) (cid:62) K ξ (0) K /n, (cid:98) θ (0) n − θ (0) = (cid:104) (cid:98) Ψ (0) nK (cid:105) − (cid:98) R (0) (cid:62) K (cid:98) ∆ (0) K /n + (cid:104) (cid:98) Ψ (0) nK (cid:105) − (cid:98) R (0) (cid:62) K r (0) K /n + (cid:104) (cid:98) Ψ (0) nK (cid:105) − (cid:98) R (0) (cid:62) K ξ (0) K /n, where the definitions of the newly introduced variables should be clear from the context. A.2 Proof of Theorem 3.1
By (B.8) and (B.9), we observe that √ n (cid:16) (cid:93) MTE j ( x, p ) − MTE j ( x, p ) (cid:17) = √ n (cid:16) (cid:101) m (1) j ( x, p ) − m (1) j ( x, p ) (cid:17) − √ n (cid:16) (cid:101) m (0) j ( x, p ) − m (0) j ( x, p ) (cid:17) = ∇ b K ( p ) (cid:62) S K,j (cid:104) Ψ (1) K (cid:105) − R (1) (cid:62) K ξ (1) K / √ n + ∇ b K ( p ) (cid:62) S K,j (cid:104) Ψ (0) K (cid:105) − R (0) (cid:62) K ξ (0) K / √ n + o P ( (cid:107)∇ b K ( p ) (cid:107) ) . Thus, as shown in Lemma B.4(i), the term on the left-hand side is approximated by the sum of two asymptoticallynormal random variables with mean zero. Note that unlike standard treatment effect estimators, the covariance20f these two terms is not zero: n n (cid:88) l =1 n (cid:88) m =1 E (cid:20) R (0) l,K R (1) (cid:62) m,K ξ (0) l,K ξ (1) m,K (cid:12)(cid:12)(cid:12)(cid:12) { X i , Z i } ni =1 (cid:21) = 1 n n (cid:88) l =1 R (0) l,K R (1) (cid:62) l,K E (cid:20) ξ (0) l,K ξ (1) l,K (cid:12)(cid:12)(cid:12)(cid:12) { X i , Z i } ni =1 (cid:21) , by E [ ξ ( d ) K | X, Z ] = 0 for both d ∈ { , } and Assumption 3.1, and E (cid:104) ξ (0) K ξ (1) K (cid:12)(cid:12) X, Z (cid:105) = E (cid:104) ( T (0) (cid:62) K θ (0) − R (0) (cid:62) K θ (0) + e (0) )( T (1) (cid:62) K θ (1) − R (1) (cid:62) K θ (1) + e (1) ) (cid:12)(cid:12) X, Z (cid:105) = − R (0) (cid:62) K θ (0) · R (1) (cid:62) K θ (1) (cid:54) = 0 in general.The remainder of the proof is straightforward. A.3 Proof of Corollary 3.1
By (B.8) and (B.10), we have √ n (cid:16) (cid:91) MTE j ( x, p ) − MTE j ( x, p ) (cid:17) = ∇ b K ( p ) (cid:62) S K,j (cid:104) Ψ (1) K (cid:105) − R (1) (cid:62) K ξ (1) K / √ n + ∇ b K ( p ) (cid:62) S K,j (cid:104) Ψ (0) K (cid:105) − R (0) (cid:62) K ξ (0) K / √ n + o P ( (cid:107)∇ b K ( p ) (cid:107) ) . The same argument as in the proof of Theorem 3.1 leads to the desired result.
A.4 Proof of Theorem 4.1
To save the space, we provide the proof for the case of d = 1 and j = 1 only. Under Assumption 4.1, we have ρ ( x, q, p , p ) = Pr( U ≤ Q, V ≤ P | X = x, Q = q, P = p , P = p ) = (cid:90) p (cid:90) q f UV ( u, v | X = x )d u d v . On the other hand, by the law of iterated expectations, it holds that ψ ( x, q, p , p ) = E [ Y (1)1 | D = 1 , s = 1 , X = x, Q = q, P = p , P = p ] ρ ( x, q, p , p )+ E [ Y (1)2 | D = 1 , s = 2 , X = x, Q = q, P = p , P = p ] ρ ( x, q, p , p )= E [ Y (1)1 | U ≤ q, V ≤ p , X = x ] Pr( U ≤ q, V ≤ p | X = x )+ E [ Y (1)2 | U > q, V ≤ p , X = x ] Pr( U > q, V ≤ p | X = x )= (cid:90) p (cid:90) q m (1)1 ( x, u, v ) f UV ( u, v | X = x )d u d v + (cid:90) p (cid:90) q m (1)2 ( x, u, v ) f UV ( u, v | X = x )d u d v . Thus, the Leibniz integral rule completes the proof. 21 .5 Proof of Theorem 4.2
We only prove for the case of s = 1 . By the law of iterated expectations, we observe thatMTE ( x, p ) = E (cid:104) E [ Y (1)1 − Y (0)1 | X = x, V = p, Q, s = 1] (cid:12)(cid:12)(cid:12) X = x, V = p, s = 1 (cid:105) = (cid:90) E [ Y (1)1 − Y (0)1 | X = x, V = p, Q = q, s = 1] f Q ( q | X = x, V = p, s = 1)d q. Further, by Assumption 4.1(i), E [ Y (1)1 − Y (0)1 | X = x, V = p, Q = q, s = 1] = E [ Y (1)1 − Y (0)1 | X = x, V = p, Q = q, U ≤ q ]= (cid:82) { u ≤ q } MTE ( x, u, p ) f U ( u | X = x, V = p )d u Pr( U ≤ q | X = x, V = p )= (cid:82) { u ≤ q } MTE ( x, u, p ) f UV ( u, p | X = x )d u Pr( U ≤ q | X = x, V = p ) , where the last equality follows from the fact that V ∼ Uniform [0 , conditional on X = x . On the other hand,Bayes’ theorem implies that f Q ( q | X = x, V = p, s = 1) = Pr( U ≤ q | X = x, V = p ) f Q ( q | X = x )Pr( s = 1 | X = x, V = p ) , under Assumption 4.1(i). By the law of iterated expectations and Assumption 4.1(i), we have Pr( s = 1 | X = x, V = p ) = E [Pr( s = 1 | X = x, V = p, Q ) | X = x, V = p ]= (cid:90) Pr( U ≤ q | X = x, V = p, Q = q ) f Q ( q | X = x, V = p )d q = (cid:90) (cid:90) { u ≤ q } f U ( u | X = x, V = p ) f Q ( q | X = x )d q d u = (cid:90) Pr( u ≤ Q | X = x ) f UV ( u, p | X = x )d u. Combining these results yieldsMTE ( x, p ) = (cid:82) (cid:82) MTE ( x, u, p ) { u ≤ q } f UV ( u, p | X = x ) f Q ( q | X = x )d q d u (cid:82) Pr( u (cid:48) ≤ Q | X = x ) f UV ( u (cid:48) , p | X = x )d u (cid:48) = (cid:90) MTE ( x, u, p ) Pr( u ≤ Q | X = x ) f UV ( u, p | X = x ) (cid:82) Pr( u (cid:48) ≤ Q | X = x ) f UV ( u (cid:48) , p | X = x )d u (cid:48) d u. This completes the proof. 22 .6 Proof of Theorem 4.3
We provide proof of the first result only since the second result can be shown analogously. First, observe that E [ D | Z = S ] = S (cid:88) j =1 E [ { s = j } D (1) j | Z = S ]= S (cid:88) j =1 (cid:110) Pr( D (1) j = 1 , D (0) j = 0 , s = j ) + Pr( D (1) j = 1 , D (0) j = 1 , s = j ) (cid:111) , where the first equality follows from Assumption 4.2(i), and the second follows from Assumption 4.2(ii).Similarly, we can show that E [ D | Z = S ] = (cid:80) Sj =1 Pr( D (1) j = 1 , D (0) j = 1 , s = j ) under Assumption 4.2(iii).Thus, we have E [ D | Z = S ] − E [ D | Z = S ] = S (cid:88) j =1 Pr( Z j -compliers , s = j ) . Next, observe that E [ Y | Z = S ] = (cid:80) Sj =1 E [ { s = j } Y | Z = S ] , and that by the law of iteratedexpectations, E [ { s = j } Y | Z = S ] = E [ Y (1) j | D (1) j = 1 , D (0) j = 1 , s = j ] Pr( D (1) j = 1 , D (0) j = 1 , s = j )+ E [ Y (0) j | D (1) j = 0 , D (0) j = 0 , s = j ] Pr( D (1) j = 0 , D (0) j = 0 , s = j )+ E [ Y (1) j | D (1) j = 1 , D (0) j = 0 , s = j ] Pr( D (1) j = 1 , D (0) j = 0 , s = j ) , by Assumption 4.2. In the same manner, it holds that E [ { s = j } Y | Z = S ] = E [ Y (1) j | D (1) j = 1 , D (0) j = 1 , s = j ] Pr( D (1) j = 1 , D (0) j = 1 , s = j )+ E [ Y (0) j | D (1) j = 0 , D (0) j = 0 , s = j ] Pr( D (1) j = 0 , D (0) j = 0 , s = j )+ E [ Y (0) j | D (1) j = 1 , D (0) j = 0 , s = j ] Pr( D (1) j = 1 , D (0) j = 0 , s = j ) . Thus, E [ Y | Z = S ] − E [ Y | Z = ] = S (cid:88) j =1 E [ Y (1) j − Y (0) j | Z j -compliers , s = j ] Pr( Z j -compliers , s = j ) . This completes the proof.
B Appendix: Lemmas
This appendix collects the lemmas that are used in the proof of Theorem 3.1 and Corollary 3.1. In the following,we present the results for D = 1 only, and those for D = 0 are similar and thus omitted to save space. Below,we often use the notation S to denote either of S X,j and S K,j , and we suppress the superscript (1) if there is noconfusion. For a matrix A , we denote || A || = (cid:112) λ max ( A (cid:62) A ) as its spectral norm.23 emma B.1. Suppose that Assumptions 3.1 – 3.6 hold.(i) (cid:13)(cid:13)(cid:13) Ψ (1) nK − Ψ (1) K (cid:13)(cid:13)(cid:13) = O P ( ζ ( K ) (cid:112) (log K ) /n ) .(ii) (cid:13)(cid:13)(cid:13)(cid:13)(cid:104) Ψ (1) nK (cid:105) − − (cid:104) Ψ (1) K (cid:105) − (cid:13)(cid:13)(cid:13)(cid:13) = O P ( ζ ( K ) (cid:112) (log K ) /n ) .(iii) (cid:13)(cid:13)(cid:13)(cid:13) S (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K B (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16)(cid:112) tr { SS (cid:62) } /n (cid:17) .(iv) (cid:13)(cid:13)(cid:13)(cid:13) S (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K r (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) = O P ( K − µ ) .(v) (cid:13)(cid:13)(cid:13)(cid:13) S (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K e (1) /n (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16)(cid:112) tr { SS (cid:62) } /n (cid:17) . Proof. (i), (ii) The proofs are the same as those of Lemma A.1(i) and (iii) in Hoshino and Yanagi (2020).(iii) Since E [ δ (1) j | X, Z ] = π j P j , we have E [ T (1) K | X, Z ] = R (1) K and E [ B (1) K | X, Z ] = 0 . Then, underAssumption 3.1, E [ B (1) l,K B (1) k,K |{ X i , Z i } ni =1 ] = 0 for any l, k ∈ { , . . . , n } such that l (cid:54) = k . Additionally,observe that max ≤ i ≤ n | B (1) i,K | = O (1) holds from Assumptions 3.2(i),(ii), and 3.3(i) and sup p ∈ [0 , (cid:12)(cid:12)(cid:12) b K ( p ) (cid:62) α (1) j (cid:12)(cid:12)(cid:12) ≤ sup p ∈ [0 , (cid:12)(cid:12)(cid:12) b K ( p ) (cid:62) α (1) j − g (1) j ( p ) (cid:12)(cid:12)(cid:12) + sup p ∈ [0 , (cid:12)(cid:12)(cid:12) g (1) j ( p ) (cid:12)(cid:12)(cid:12) = O ( K − µ ) + O (1) , (B.1)by Assumption 3.4(i). Thus, noting that E [ B K B (cid:62) K |{ X i , Z i } ni =1 ] is a diagonal matrix whose diagonal elementsare O (1) , we have E (cid:104) (cid:107) S Ψ − nK R (cid:62) K B K /n (cid:107) |{ X i , Z i } ni =1 (cid:105) = tr { S Ψ − nK R (cid:62) K E [ B K B (cid:62) K |{ X i , Z i } ni =1 ] R K Ψ − nK S (cid:62) } /n ≤ O (1 /n ) · tr { S Ψ − nK Ψ nK Ψ − nK S (cid:62) } = O P ( tr { SS (cid:62) } /n ) , where the last equality follows from Assumption 3.5(i) and result (ii). Then, the result follows from Markov’sinequality.(vi), (v) For (iv), since min ≤ i ≤ n P ji > for any j under Assumptions 3.2(i) and (ii), we have max ≤ i ≤ n | r (1) i,K | ≤ O (1) · S (cid:88) j =1 max ≤ i ≤ n (cid:12)(cid:12)(cid:12) g (1) j ( P ji ) − b K ( P ji ) (cid:62) α (1) j (cid:12)(cid:12)(cid:12) = O ( K − µ )
24y Assumption 3.4(i). For (v), note that E [ e (1) | X, Z ] = S (cid:88) j =1 (cid:104) E [ δ j (cid:15) (1) | X, Z ] − E [ δ j | X, Z ] g (1) j ( P j ) /P j (cid:105) = S (cid:88) j =1 (cid:104) E [ (cid:15) (1) | X, Z , s = j, V j ≤ P j ] · π j P j − π j g (1) j ( P j ) (cid:105) = S (cid:88) j =1 (cid:20) π j (cid:90) P j E [ (cid:15) (1) | s = j, V j = v ]d v − π j g (1) j ( P j ) (cid:21) = 0 , (B.2)by Assumptions 2.1(i), 2.2, and 3.3(i), so that E [ e (1) e (1) (cid:62) |{ X i , Z i } ni =1 ] is a diagonal matrix whose diagonalelements are O (1) by Assumptions 3.1 and 3.6. Then, the rest of the proofs are similar to the proof of LemmaA.2 in Hoshino and Yanagi (2020).To prove the next lemma, define X iS dim( X ) × := ( X (cid:62) i , . . . , X (cid:62) i ) (cid:62) , b i,KSK × := ( b K ( P i ) (cid:62) , . . . , b K ( P Si ) (cid:62) ) (cid:62) , X n × S dim( X ) = ( X , . . . , X n ) (cid:62) , b Kn × SK = ( b ,K , . . . , b n,K ) (cid:62) , W Kn × d SXK := ( X , b K ) ,Υ (1) i,Kd SXK × := ( π P i (cid:62) dim( X ) , . . . , π S P Si (cid:62) dim( X ) , π (cid:62) K , . . . , π S (cid:62) K ) (cid:62) , Υ (1) Kn × d SXK := ( Υ (1)1 ,K , . . . , Υ (1) n,K ) (cid:62) , and define analogously (cid:98) b K , (cid:99) W K := ( X , (cid:98) b K ) , and (cid:98) Υ (1) K . Then, we can write R (1) K = Υ (1) K ◦ W K and (cid:98) R (1) K = (cid:98) Υ (1) K ◦ (cid:99) W K , where ◦ denotes the Hadamard product. Lemma B.2.
Suppose that Assumptions 3.1 – 3.6 hold.(i) (cid:13)(cid:13)(cid:13) (cid:98) Ψ (1) nK − Ψ (1) nK (cid:13)(cid:13)(cid:13) = O P ( ζ ( K ) / √ n ) .(ii) (cid:13)(cid:13)(cid:13)(cid:13)(cid:104) (cid:98) Ψ (1) nK (cid:105) − − (cid:104) Ψ (1) nK (cid:105) − (cid:13)(cid:13)(cid:13)(cid:13) = O P ( ζ ( K ) / √ n ) .(iii) (cid:13)(cid:13)(cid:13)(cid:13) S (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K (cid:98) ∆ (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) = O P ( n − / ) .(iv) (cid:13)(cid:13)(cid:13)(cid:13) S (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K B (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16)(cid:112) tr { SS (cid:62) } /n (cid:17) + O P ( ζ ( K ) √ K/n ) + O P ( ζ ( K ) /n ) .(v) (cid:13)(cid:13)(cid:13)(cid:13) S (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K r (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) = O P ( K − µ ) .(vi) (cid:13)(cid:13)(cid:13)(cid:13) S (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K e (1) /n (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16)(cid:112) tr { SS (cid:62) } /n (cid:17) .25 roof. (i) By the triangle inequality, we have (cid:13)(cid:13)(cid:13) (cid:98) Ψ nK − Ψ nK (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) (cid:98) Ψ nK − Ψ nK (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) (cid:98) R K − R K (cid:17) (cid:62) (cid:16) (cid:98) R K − R K (cid:17) /n (cid:13)(cid:13)(cid:13)(cid:13) + 2 (cid:13)(cid:13)(cid:13) R (cid:62) K (cid:16) (cid:98) R K − R K (cid:17) /n (cid:13)(cid:13)(cid:13) . (B.3)For the first term of (B.3), the mean value theorem and Assumption 3.2 lead to b K ( (cid:98) P ji ) − b K ( P ji ) = ∇ b K ( ¯ P ji ) · O P ( n − / ) where ¯ P ji is between (cid:98) P ji and P ji . Let ∇ ¯ b i,K = ( ∇ b K ( ¯ P i ) (cid:62) , . . . , ∇ b K ( ¯ P Si ) (cid:62) ) (cid:62) and ∇ ¯ b K =( ∇ ¯ b ,K , . . . , ∇ ¯ b n,K ) (cid:62) . By the triangle inequality and Assumptions 3.2(iii) and 3.3(i), we have (cid:107) (cid:98) R K − R K (cid:107) ≤ (cid:13)(cid:13)(cid:13) ( (cid:98) Υ K − Υ K ) ◦ (cid:99) W K (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) Υ K ◦ ( (cid:99) W K − W K ) (cid:13)(cid:13)(cid:13) ≤ O P ( n − / ) · (cid:110)(cid:13)(cid:13)(cid:13)(cid:98) b K (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13) ∇ ¯ b K (cid:13)(cid:13)(cid:111) = O P ( ζ ( K )) + O P ( ζ ( K )) = O P ( ζ ( K )) . (B.4)Thus, (cid:107) ( (cid:98) R K − R K ) (cid:62) ( (cid:98) R K − R K ) /n (cid:107) ≤ (cid:107) (cid:98) R K − R K (cid:107) /n = O P ( ζ ( K ) /n ) . For the second term of (B.3),we have (cid:107) R (cid:62) K ( (cid:98) R K − R K ) /n (cid:107) = tr { ( (cid:98) R K − R K ) (cid:62) R K R (cid:62) K ( (cid:98) R K − R K ) } /n ≤ O P (1 /n ) · (cid:107) (cid:98) R K − R K (cid:107) = O P ( ζ ( K ) /n ) by Lemma B.1(i) and (B.4). Thus, the second term is of order O P ( ζ ( K ) / √ n ) , and we obtainthe desired result.(ii) The proof is the same as that of Lemma A.1(iii) in Hoshino and Yanagi (2020).(iii) By Lemma B.1(ii), result (ii), and Assumption 3.5(i), we have (cid:13)(cid:13)(cid:13) S (cid:98) Ψ − nK (cid:98) R (cid:62) K (cid:98) ∆ K /n (cid:13)(cid:13)(cid:13) = tr { (cid:98) ∆ (cid:62) K (cid:98) R K (cid:98) Ψ − nK S (cid:62) S (cid:98) Ψ − nK (cid:98) R (cid:62) K (cid:98) ∆ K } /n ≤ O P ( n − ) · (cid:107) (cid:98) ∆ K (cid:107) . Note that (cid:98) ∆ K = [ (cid:98) Υ K ◦ ( W K − (cid:99) W K )] θ (1) + [( Υ K − (cid:98) Υ K ) ◦ W K ] θ (1) . By the mean value theorem, it is easyto see that (cid:13)(cid:13)(cid:13) [ (cid:98) Υ K ◦ ( W K − (cid:99) W K )] θ (1) (cid:13)(cid:13)(cid:13) = n (cid:88) i =1 S (cid:88) j =1 (cid:98) π n,j ( P ji − (cid:98) P ji ) · ∇ b K ( ¯ P ji ) (cid:62) α (1) j = O P (1) from sup p ∈ [0 , (cid:12)(cid:12)(cid:12) ∇ b K ( p ) (cid:62) α (1) j (cid:12)(cid:12)(cid:12) ≤ sup p ∈ [0 , (cid:12)(cid:12)(cid:12) ∇ b K ( p ) (cid:62) α (1) j − ∇ g (1) j ( p ) (cid:12)(cid:12)(cid:12) + sup p ∈ [0 , (cid:12)(cid:12)(cid:12) ∇ g (1) j ( p ) (cid:12)(cid:12)(cid:12) = O ( K − µ ) + O (1) . The same argument shows that (cid:107) [( Υ K − (cid:98) Υ K ) ◦ W K ] θ (1) (cid:107) = O P (1) . Thus, (cid:107) (cid:98) ∆ K (cid:107) = O P (1) , and we have thedesired result.(iv) By Lemma B.1(iii), we have S (cid:98) Ψ − nK (cid:98) R (cid:62) K B K /n = S Ψ − nK R (cid:62) K B K /n + S ( (cid:98) Ψ − nK − Ψ − nK ) R (cid:62) K B K /n + S (cid:98) Ψ − nK ( (cid:98) R K − R K ) (cid:62) B K /n = S ( (cid:98) Ψ − nK − Ψ − nK ) R (cid:62) K B K /n + S (cid:98) Ψ − nK ( (cid:98) R K − R K ) (cid:62) B K /n + O P (cid:18)(cid:113) tr { SS (cid:62) } /n (cid:19) . (cid:107) S ( (cid:98) Ψ − nK − Ψ − nK ) R (cid:62) K B K /n (cid:107) ≤ (cid:107) (cid:98) Ψ − nK − Ψ − nK (cid:107) (cid:107) R (cid:62) K B K /n (cid:107) = O P ( ζ ( K ) √ K/n ) , (B.5)by result (ii) and Markov’s inequality.For the second term, observe that (cid:107) S (cid:98) Ψ − nK ( (cid:98) R K − R K ) (cid:62) B K /n (cid:107) ≤ O P (1 /n ) · (cid:107) ( (cid:98) R K − R K ) (cid:62) B K (cid:107) and ( (cid:98) R K − R K ) (cid:62) B Kd SXK × = (cid:80) ni =1 ( (cid:98) π n, (cid:98) P i − π P i ) X i B i,K ... (cid:80) ni =1 ( (cid:98) π n,S (cid:98) P Si − π S P Si ) X i B i,K (cid:80) ni =1 [ (cid:98) π n, b K ( (cid:98) P i ) − π b K ( P i )] B i,K ... (cid:80) ni =1 [ (cid:98) π n,S b K ( (cid:98) P Si ) − π S b K ( P Si )] B i,K . For each element of the right-hand side, applying the Taylor expansion to (cid:98) P ji = F j ( Z (cid:62) ji (cid:98) γ n,j ) around γ j yields n (cid:88) i =1 ( (cid:98) π n,j (cid:98) P ji − π j P ji ) X i B i,K = n (cid:88) i =1 ( (cid:98) π n,j − π j ) (cid:98) P ji X i B i,K + n (cid:88) i =1 π j ( (cid:98) P ji − P ji ) X i B i,K = ( (cid:98) π n,j − π j ) n (cid:88) i =1 P ji X i B i,K + dim( Z j ) (cid:88) h =1 ( (cid:98) γ n,jh − γ jh ) n (cid:88) i =1 Z jhi π j f j ( Z (cid:62) ji γ j ) X i B i,K + O P (1) , and similarly n (cid:88) i =1 [ (cid:98) π n,j b K ( (cid:98) P ji ) − π j b K ( P ji )] B i,K = n (cid:88) i =1 ( (cid:98) π n,j − π j ) b K ( (cid:98) P ji ) B i,K + n (cid:88) i =1 π j ( b K ( (cid:98) P ji ) − b K ( P ji )) B i,K = ( (cid:98) π n,j − π j ) n (cid:88) i =1 b K ( P ji ) B i,K + dim( Z j ) (cid:88) h =1 ( (cid:98) γ n,jh − γ jh ) n (cid:88) i =1 Z jhi π j f j ( Z (cid:62) ji γ j ) ∇ b K ( P ji ) B i,K + O P ( ζ ( K )) . For expositional simplicity, assume that dim( Z j ) = 1 for all j . Let (cid:98) Π nK and (cid:98) Γ nK be appropriate d SXK × d SXK diagonal matrices with diagonal elements ( (cid:98) π n,j − π j ) and ( (cid:98) γ n,j − γ j ) , respectively, so that we can write ( (cid:98) R K − R K ) (cid:62) B K = √ n (cid:98) Π nK n (cid:88) i =1 M i,K + √ n (cid:98) Γ nK n (cid:88) i =1 N i,K + O P ( ζ ( K )) , (B.6)27here M i,K := n − / P i X i B i,K ... P Si X i B i,K b K ( P i ) B i,K ... b K ( P Si ) B i,K , and N i,K := n − / Z i π f ( Z i γ ) X i B i,K ... Z Si π S f S ( Z Si γ S ) X i B i,K Z i π f ( Z i γ ) ∇ b K ( P i ) B i,K ... Z Si π S f S ( Z Si γ S ) ∇ b K ( P Si ) B i,K . Note that E [ N i,K ] = d SXK by E [ B i,K | X i , Z i ] = 0 and ¯ N K := max ≤ i ≤ n (cid:107) N i,K (cid:107) = O ( ζ ( K ) / √ n ) underAssumptions 3.2(i),(ii), 3.3(i), and 3.4. Further, let σ nK := max {(cid:107) (cid:80) ni =1 E ( N i,K N (cid:62) i,K ) (cid:107) , (cid:107) (cid:80) ni =1 E ( N (cid:62) i,K N i,K ) (cid:107) } .It is easy to see that σ nK = O ( ζ ( K )) . Observe that ¯ N K (cid:112) log( d SXK + 1) = O ( ζ ( K ) (cid:112) (log K ) /n ) = o ( σ nK ) . Then, by Corollary 4.1 in Chen and Christensen (2015), we obtain (cid:107) (cid:80) ni =1 N i,K (cid:107) = O P ( ζ ( K ) √ log K ) .In a similar manner, we can show that (cid:107) (cid:80) ni =1 M i,K (cid:107) = O P ( ζ ( K ) √ log K ) . Noting that || (cid:98) Π nK || = O P ( n − / ) and || (cid:98) Γ nK || = O P ( n − / ) , we can show that the first and second terms on the right-hand side of(B.6) are O P ( ζ ( K ) √ log K ) and O P ( ζ ( K ) √ log K ) , respectively. Thus, S (cid:98) Ψ − nK ( (cid:98) R K − R K ) (cid:62) B K /n = O P ( ζ ( K ) (cid:112) log K/n ) + O P ( ζ ( K ) /n ) . (B.7)Summarizing these results, we obtain S (cid:98) Ψ − nK (cid:98) R (cid:62) K B K /n = O P ( ζ ( K ) √ K/n ) + O P ( ζ ( K ) (cid:112) log K/n ) + O P ( ζ ( K ) /n ) + O P (cid:18)(cid:113) tr { SS (cid:62) } /n (cid:19) . (v), (vi) The proofs are similar to the proof of Lemma A.2 in Hoshino and Yanagi (2020). For (vi), notethat E [ e (1) | D, X, Z ] = E [ E [ e (1) | D, X, Z , s ] | D, X, Z ] = 0 since E [ e (1) | D, X, Z , s = j ] = E (cid:34) S (cid:88) h =1 δ (1) h (cid:16) (cid:15) (1) − g h ( P h ) /P h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) D, X, Z , s = j (cid:35) = E (cid:20) D (cid:16) (cid:15) (1) − g j ( P j ) /P j (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) D, X, Z , s = j (cid:21) = 0 for all j. Here, for a generic random variable T and q ∈ C ( supp [ T ]) , we define (cid:98) P ( d ) n,j q := b K ( · ) (cid:62) S K,j (cid:104) (cid:98) Ψ ( d ) nK (cid:105) − n n (cid:88) i =1 (cid:98) R ( d ) i,K q ( T i ) . Lemma B.3.
Suppose that Assumptions 3.1 – 3.6 hold. If ζ ( K ) ζ ( K ) / √ n = O (1) holds, then (cid:107) (cid:98) P (1) n,j (cid:107) ∞ = (cid:107)P (1) n,j (cid:107) ∞ + O P (1) , (cid:107) (cid:98) P ( d ) n,j (cid:107) ∞ := sup (cid:40) sup p ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:16) (cid:98) P ( d ) n,j q (cid:17) ( p ) (cid:12)(cid:12)(cid:12) : q ∈ C ( supp [ T ]) , sup t ∈ supp [ T ] | q ( t ) | = 1 (cid:41) . Proof.
The proof is the same as that of Lemma A.3 in Hoshino and Yanagi (2020).
Lemma B.4.
Suppose that Assumptions 3.1 – 3.6 hold. For a given p ∈ supp [ P j | D = 1] , if (cid:107)∇ b K ( p ) (cid:107) → ∞ , √ nK − µ → , and √ nK − µ / (cid:107)∇ b K ( p ) (cid:107) → hold, then(i) √ n (cid:16) (cid:101) m (1) j ( x, p ) − m (1) j ( x, p ) (cid:17) σ (1) K,j ( p ) d → N (0 , . If Assumption 3.7, ζ ( K ) ζ ( K ) / √ n = O (1) , and ( (cid:107)P (1) n,j (cid:107) ∞ + 1) √ K/ (cid:107)∇ b K ( p ) (cid:107) → hold additionally, then(ii) √ n (cid:16) (cid:98) m (1) j ( x, p ) − m (1) j ( x, p ) (cid:17) σ (1) K,j ( p ) d → N (0 , . Proof. (i) First, by Assumption 3.5, we have σ K,j ( p ) = ∇ b K ( p ) (cid:62) S K,j Ψ − K Σ K Ψ − K S (cid:62) K,j ∇ b K ( p ) ≥ c Σ ¯ c · (cid:107)∇ b K ( p ) (cid:107) > . (B.8)Next, by Lemmas B.1(iii)-(v), we have (cid:13)(cid:13)(cid:13) (cid:101) β (1) n,j − β (1) j (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13) S X,j (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K B (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) S X,j (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K r (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) S X,j (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K e (1) /n (cid:13)(cid:13)(cid:13)(cid:13) = O P ( n − / ) + O P ( K − µ ) . Thus, by the definition of the infeasible estimator (cid:101) m (1) j ( x, p ) and Assumption 3.4(i), (cid:101) m (1) j ( x, p ) − m (1) j ( x, p ) = x (cid:62) (cid:16) (cid:101) β (1) n,j − β (1) j (cid:17) + ∇ b K ( p ) (cid:62) (cid:101) α (1) n,j − ∇ g (1) j ( p )= ∇ b K ( p ) (cid:62) (cid:16)(cid:101) α (1) n,j − α (1) j (cid:17) + O P ( n − / ) + O P ( K − µ ) + O ( K − µ )= A n,j + A n,j + O P ( n − / ) + O P ( K − µ ) + O ( K − µ ) , where A n,j := ∇ b K ( p ) (cid:62) S K,j Ψ − nK R (cid:62) K ξ K /n and A n,j := ∇ b K ( p ) (cid:62) S K,j Ψ − nK R (cid:62) K r K /n . For A n,j , byLemma B.1(iv) and √ nK − µ → , we have | A n,j | ≤ (cid:107)∇ b K ( p ) (cid:107) · (cid:107) S K,j Ψ − nK R (cid:62) K r K /n (cid:107) = (cid:107)∇ b K ( p ) (cid:107) · O P ( K − µ ) = (cid:107)∇ b K ( p ) (cid:107) · o P ( n − / ) . Define A (cid:48) n,j := ∇ b K ( p ) (cid:62) S K,j Ψ − K R (cid:62) K ξ K /n . It is easy to see that | A n,j − A (cid:48) n,j | ≤ (cid:107)∇ b K ( p ) (cid:107) · (cid:107) S K,j (Ψ − nK − Ψ − K ) R (cid:62) K ξ K /n (cid:107) (cid:107)∇ b K ( p ) (cid:107) · O P ( ζ ( K ) (cid:112) ( K log K ) /n ) = (cid:107)∇ b K ( p ) (cid:107) · o P ( n − / ) , by Lemma B.1(ii), Markov’s inequality, and Assumption 3.4(ii). Thus, by (B.8), we obtain √ n (cid:16) (cid:101) m (1) j ( x, p ) − m (1) j ( x, p ) (cid:17) σ (1) K,j ( p ) = √ n ( A n,j + A n,j ) σ (1) K,j ( p ) + o P (1) = √ nA (cid:48) n,j σ (1) K,j ( p ) + o P (1) , (B.9)since we have assumed (cid:107)∇ b K ( p ) (cid:107) → ∞ , √ nK − µ → , and √ nK − µ / (cid:107)∇ b K ( p ) (cid:107) → .We now show the asymptotic normality of √ nA (cid:48) n,j /σ (1) K,j ( p ) . Let φ ji := Π K,j ( p ) R (1) i,K ξ (1) i,K / √ n , where Π K,j ( p ) := ∇ b K ( p ) (cid:62) S K,j Ψ − K /σ (1) K,j ( p ) , so that (cid:80) ni =1 φ ji = √ nA (cid:48) n,j /σ (1) K,j ( p ) . Since E [ B (1) K | X, Z ] = 0 and E [ e (1) | X, Z ] = 0 as shown in (B.2), we have E [ φ ji ] = 0 and V ar [ φ ji ] = n − . Moreover, note that E [( ξ (1) i,K ) | X i , Z i ] = O (1) holds by the c r -inequality with Assumption 3.6 and the uniform boundedness of B (1) i,K . Then, by the same argument as in the proof of Theorem 4.2 in Hoshino and Yanagi (2020), we obtain (cid:80) ni =1 E [ φ ji ] = O ( ζ ( K ) K/n ) = o (1) under Assumption 3.4(ii). Hence, result (i) follows from Lyapunov’scentral limit theorem.(ii) By Lemmas B.2(iii) – (vi), Assumption 3.4(ii), and √ nK − µ → , we have (cid:13)(cid:13)(cid:13) (cid:98) β (1) n,j − β (1) j (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13) S X,j (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K (cid:98) ∆ (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) S X,j (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K r (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) S X,j (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K ξ (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) = O P ( n − / ) + O P ( ζ ( K ) √ K/n ) + O P ( ζ ( K ) /n ) + O P ( K − µ ) (cid:124) (cid:123)(cid:122) (cid:125) = o P ( n − / ) . Thus, by the definition of the feasible estimator (cid:98) m (1) j ( x, p ) and Assumption 3.4(i), (cid:98) m (1) j ( x, p ) − m (1) j ( x, p ) = x (cid:62) ( (cid:98) β (1) n,j − β (1) j ) + ∇ b K ( p ) (cid:62) (cid:98) α (1) n,j − ∇ g (1) j ( p )= ∇ b K ( p ) (cid:62) ( (cid:98) α (1) n,j − α (1) j ) + O P ( n − / ) + O P ( K − µ )= A n,j + A n,j + A n,j + O P ( n − / ) + O P ( K − µ ) , where A n,j := ∇ b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:98) R (cid:62) K ξ K /n, A n,j := ∇ b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:98) R (cid:62) K r K /n, A n,j := ∇ b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:98) R (cid:62) K (cid:98) ∆ K /n. For A n,j , observe that A n,j = A n,j + ∇ b K ( p ) (cid:62) S K,j (cid:16) (cid:98) Ψ − nK − Ψ − nK (cid:17) R (cid:62) K ξ K /n + ∇ b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:16) (cid:98) R K − R K (cid:17) (cid:62) B K /n + ∇ b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:16) (cid:98) R K − R K (cid:17) (cid:62) e /n.
30y the same argument as in (B.5), the second term on the right-hand side satisfies (cid:13)(cid:13)(cid:13) ∇ b K ( p ) (cid:62) S K,j (cid:16) (cid:98) Ψ − nK − Ψ − nK (cid:17) R (cid:62) K ξ K /n (cid:13)(cid:13)(cid:13) ≤ (cid:107)∇ b K ( p ) (cid:107) · (cid:13)(cid:13)(cid:13) S K,j (cid:16) (cid:98) Ψ − nK − Ψ − nK (cid:17) R (cid:62) K ξ K /n (cid:13)(cid:13)(cid:13) = (cid:107)∇ b K ( p ) (cid:107) · O P ( ζ ( K ) √ K/n ) . Similarly, we can show that the third term is of order (cid:107)∇ b K ( p ) (cid:107) · (cid:8) O P ( ζ ( K ) √ log K/n ) + O P ( ζ ( K ) /n ) (cid:9) by (B.7). For the fourth term, recalling that E [ e (1) | D, X, Z ] = 0 , we have E (cid:34)(cid:13)(cid:13)(cid:13)(cid:13) S K,j (cid:98) Ψ − nK (cid:16) (cid:98) R K − R K (cid:17) (cid:62) e /n (cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { D i , X i , Z i } ni =1 (cid:35) = tr (cid:26) S K,j (cid:98) Ψ − nK (cid:16) (cid:98) R K − R K (cid:17) (cid:62) E [ ee (cid:62) |{ D i , X i , Z i } ni =1 ] (cid:16) (cid:98) R K − R K (cid:17) (cid:98) Ψ − nK S (cid:62) K,j (cid:27) /n ≤ O (1 /n ) · || (cid:98) R K − R K || · tr { S K,j (cid:98) Ψ − nK (cid:98) Ψ − nK S (cid:62) K,j } = O P ( ζ ( K ) K/n ) by Assumption 3.6 and (B.4). Thus, by Markov’s inequality, we find that the fourth term is of order (cid:107)∇ b K ( p ) (cid:107) · O P ( ζ ( K ) √ K/n ) . Combining these results yields A n,j = A n,j + (cid:107)∇ b K ( p ) (cid:107) · (cid:110) O P ( ζ ( K ) √ K/n ) + O P ( ζ ( K ) /n ) (cid:111) = A n,j + (cid:107)∇ b K ( p ) (cid:107) · o P ( n − / ) under Assumption 3.4(ii).For A n,j , by Lemma B.2(v) and √ nK − µ → , we have | A n,j | ≤ (cid:107)∇ b K ( p ) (cid:107) · (cid:107) S K,j (cid:98) Ψ − nK (cid:98) R (cid:62) K r K /n (cid:107) = (cid:107)∇ b K ( p ) (cid:107) · o P ( n − / ) .For A n,j , observe that | A n,j | ≤ O ( √ K ) · sup p ∈ [0 , | b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:98) R (cid:62) K (cid:98) ∆ K /n | by Assumption 3.7.Further, noting that (cid:98) ∆ K = ( R K − (cid:98) R K ) (cid:62) θ (1) = S (cid:88) h =1 (cid:104) π h ( P h − (cid:98) P h ) X (cid:62) β (1) h + ( π h − (cid:98) π n,h ) (cid:98) P h X (cid:62) β (1) h + π h ( b K ( P h ) − b K ( (cid:98) P h )) (cid:62) α (1) h + ( π h − (cid:98) π n,h ) b K ( (cid:98) P h ) (cid:62) α (1) h (cid:105) , write b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:98) R (cid:62) K (cid:98) ∆ K /n = b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:34) n n (cid:88) i =1 (cid:98) R i,K (cid:32) S (cid:88) h =1 π h ( P hi − (cid:98) P hi ) X (cid:62) i β (1) h (cid:33)(cid:35) + b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:34) n n (cid:88) i =1 (cid:98) R i,K (cid:32) S (cid:88) h =1 ( π h − (cid:98) π n,h ) (cid:98) P hi X (cid:62) i β (1) h (cid:33)(cid:35) + b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:34) n n (cid:88) i =1 (cid:98) R i,K (cid:32) S (cid:88) h =1 π h (cid:16) b K ( P hi ) − b K ( (cid:98) P hi ) (cid:17) (cid:62) α (1) h (cid:33)(cid:35) + b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:34) n n (cid:88) i =1 (cid:98) R i,K (cid:32) S (cid:88) h =1 ( π h − (cid:98) π n,h ) b K ( (cid:98) P hi ) (cid:62) α (1) h (cid:33)(cid:35) =: B n,j ( p ) + B n,j ( p ) + B n,j ( p ) + B n,j ( p ) , say .
31y Lemma B.3, if ζ ( K ) ζ ( K ) / √ n = O (1) , | B n,j ( p ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK n n (cid:88) i =1 (cid:98) R i,K q n ( Z i , X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:16) (cid:98) P (1) n,j q n (cid:17) ( p ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) (cid:98) P (1) n,j (cid:107) ∞ · O P ( n − / ) = (cid:16) (cid:107)P (1) n,j (cid:107) ∞ + 1 (cid:17) · O P ( n − / ) for any p ∈ [0 , , where the definition of q n ( Z i , X i ) should be clear from the context. Similarly, we can easilyshow that | B n,j ( p ) | , | B n,j ( p ) | , and | B n,j ( p ) | are also of order (cid:16) (cid:107)P (1) n,j (cid:107) ∞ + 1 (cid:17) · O P ( n − / ) uniformly in p ∈ [0 , . Consequently, we have A n,j = (cid:16) (cid:107)P (1) n,j (cid:107) ∞ + 1 (cid:17) · O P ( (cid:112) K/n ) = (cid:107)∇ b K ( p ) (cid:107) · o P ( n − / ) since we have assumed ( (cid:107)P (1) n,j (cid:107) ∞ + 1) √ K/ (cid:107)∇ b K ( p ) (cid:107) → . Therefore, we have √ n (cid:16) (cid:98) m (1) j ( x, p ) − m (1) j ( x, p ) (cid:17) σ (1) K,j ( p ) = √ nA n,j σ (1) K,j ( p ) + o P (1) , (B.10)and the result follows from the proof of result (i). C Appendix: Supplementary Identification Results
C.1 Identification of the Finite Mixture Probit Models
C.1.1 Exogenous membership with constant membership probability
Consider the following finite mixture Probit model: D = (cid:110) Z (cid:62) γ zj + ζ j γ ζj ≥ (cid:15) Dj (cid:111) with probability π j > , for j ∈ { , . . . , S } , where Z ∈ R dim( Z ) is a vector of common IVs among all groups, ζ j ∈ R is a group-specific continuous IV, and (cid:15) Dj ∼ N (0 , independently of ( Z, ζ, s ) for all j with ζ = ( ζ , . . . , ζ S ) (cid:62) . Additionally, we assume γ ζj (cid:54) = 0 forall j .For this setup, we show that the coefficients γ ζ = ( γ ζ , . . . , γ ζS ) (cid:62) and γ z = ( γ (cid:62) z , . . . , γ (cid:62) zS ) (cid:62) and themembership probabilities π = ( π , . . . , π S ) (cid:62) can be identified. Letting x = (x , . . . , x S ) (cid:62) be a givenrealization of ζ , it holds that Pr( D = 1 | Z = z, ζ = x) = (cid:80) Sj =1 π j Φ( z (cid:62) γ zj + x j γ ζj ) . Thus, for each j , ∂∂ x j Pr( D = 1 | Z = z, ζ = x) = π j φ ( z (cid:62) γ zj + x j γ ζj ) γ ζj , (C.1)where φ denotes the standard normal density. Hence, for another realization x (cid:48) = (x (cid:48) , . . . , x (cid:48) S ) (cid:62) of ζ such that32 x j | (cid:54) = | x (cid:48) j | , we have ∂ Pr( D = 1 | Z = z, ζ = x) /∂ x j ∂ Pr( D = 1 | Z = z, ζ = x (cid:48) ) /∂ x (cid:48) j = φ ( z (cid:62) γ zj + x j γ ζj ) φ ( z (cid:62) γ zj + x (cid:48) j γ ζj )= exp (cid:18) (cid:104) ( z (cid:62) γ zj + x (cid:48) j γ ζj ) − ( z (cid:62) γ zj + x j γ ζj ) (cid:105)(cid:19) = exp (cid:18) (cid:104)(cid:2) (x (cid:48) j ) − x j (cid:3) γ ζj + 2(x (cid:48) j − x j ) z (cid:62) γ zj γ ζj (cid:105)(cid:19) . This implies that we can obtain the following linear equation with the parameters γ ζj and γ zj γ ζj : (cid:34) ∂ Pr( D = 1 | Z = z, ζ = x) /∂ x j ∂ Pr( D = 1 | Z = z, ζ = x (cid:48) ) /∂ x (cid:48) j (cid:35) = (cid:2) (x (cid:48) j ) − x j (cid:3) γ ζj + 2(x (cid:48) j − x j ) z (cid:62) γ zj γ ζj . Note that the left-hand side term can be identified from data. Then, if there are Z ) distinct pairs (x , x (cid:48) ) of realizations of ζ conditional on Z = z for some z (cid:54) = dim( Z ) , γ ζj and γ zj γ ζj can be identified by solving thesystem of linear equations constructed from such pairs. To identify γ ζj and γ zj separately, note that the signof γ ζj is identified by (C.1) since π j φ is positive. Hence, γ ζj is identified, and so is γ zj . Finally, π j is alsoidentified from (C.1). The above argument holds for any j , implying the identification of all γ ζ , γ z , and π . C.1.2 Endogenous membership with covariate-dependent membership probability
In line with the setup in Subsection 4.1, we consider the following finite mixture Probit model with potentiallyendogenous group membership: s = 2 − { Z (cid:62) α z + W α w ≥ (cid:15) s } ,D = { Z (cid:62) γ zj + ζ j γ ζj ≥ (cid:15) Dj } if s = j ,where Z ∈ R dim( Z ) is a vector of common IVs among the groups, ζ = ( ζ , ζ ) ∈ R are group-specificcontinuous IVs, and W ∈ R is a continuous covariate which affects group membership only. The error terms ( (cid:15) s , (cid:15) Dj ) are independent of ( Z, ζ, W ) and follow the standard bivariate normal distribution with correlationparameter ρ j . Suppose that the sign of α w is known to be positive, and γ ζj (cid:54) = 0 for j ∈ { , } . Finally, weassume that W is distributed on the whole R .We first show that ( γ z , γ ζ ) can be identified (the identification of ( γ z , γ ζ ) is symmetric and thus omitted).Letting t ( z, w ) := z (cid:62) α z + w α w , observe Pr( D = 1 | Z = z, ζ = x , W = w ) = Pr( D = 1 | s = 1 , Z = z, ζ = x , W = w ) Pr( s = 1 | Z = z, ζ = x , W = w )+ Pr( D = 1 | s = 2 , Z = z, ζ = x , W = w ) Pr( s = 2 | Z = z, ζ = x , W = w )= Pr( (cid:15) D ≤ z (cid:62) γ z + x γ ζ | (cid:15) s ≤ t ( z, w )) Pr( (cid:15) s ≤ t ( z, w ))+ Pr( (cid:15) D ≤ z (cid:62) γ z + x γ ζ | (cid:15) s > t ( z, w )) Pr( (cid:15) s > t ( z, w )) .
33t is easy to see that the conditional density of (cid:15) D given (cid:15) s ≤ t ( z, w ) is obtained by f (cid:15) D ( e | (cid:15) s ≤ t ( z, w )) = f (cid:15) D ( e )Pr( (cid:15) s ≤ t ( z, w )) Pr( (cid:15) s ≤ t ( z, w ) | (cid:15) D = e ) = φ ( e )Φ( t ( z, w )) Φ (cid:32) t ( z, w ) − ρ e (cid:112) − ρ (cid:33) . Similarly, the conditional density of (cid:15) D given (cid:15) s > t ( z, w ) is obtained by f (cid:15) D ( e | (cid:15) s > t ( z, w )) = φ ( e )1 − Φ( t ( z, w )) (cid:32) − Φ (cid:32) t ( z, w ) − ρ e (cid:112) − ρ (cid:33)(cid:33) . Thus, we have
Pr( D = 1 | Z = z, ζ = x , W = w ) = (cid:90) z (cid:62) γ z +x γ ζ −∞ φ ( e )Φ (cid:32) t ( z, w ) − ρ e (cid:112) − ρ (cid:33) d e + (cid:90) z (cid:62) γ z +x γ ζ −∞ φ ( e ) (cid:32) − Φ (cid:32) t ( z, w ) − ρ e (cid:112) − ρ (cid:33)(cid:33) d e . Taking the partial derivative with respect to x leads to ∂∂ x Pr( D = 1 | Z = z, ζ = x , W = w ) = γ ζ · φ ( z (cid:62) γ z + x γ ζ )Φ (cid:32) t ( z, w ) − ρ [ z (cid:62) γ z + x γ ζ ] (cid:112) − ρ (cid:33) . (C.2)Then, since α w > , we have lim w →∞ ∂∂ x Pr( D = 1 | Z = z, ζ = x , W = w ) = γ ζ · φ ( z (cid:62) γ z + x γ ζ ) . For another realization x (cid:48) of ζ such that | x | (cid:54) = | x (cid:48) | , we have lim w →∞ ∂ Pr( D = 1 | Z = z, ζ = x , W = w ) /∂ x lim w →∞ ∂ Pr( D = 1 | Z = z, ζ = x (cid:48) , W = w ) /∂ x (cid:48) = exp (cid:18) (cid:104)(cid:2) (x (cid:48) ) − x (cid:3) γ ζ + 2(x (cid:48) − x ) z (cid:62) γ z γ ζ (cid:105)(cid:19) = ⇒ (cid:20) lim w →∞ ∂ Pr( D = 1 | Z = z, ζ = x , W = w ) /∂ x lim w →∞ ∂ Pr( D = 1 | Z = z, ζ = x (cid:48) , W = w ) /∂ x (cid:48) (cid:21) = (cid:2) (x (cid:48) ) − x (cid:3) γ ζ + 2(x (cid:48) − x ) z (cid:62) γ z γ ζ . Thus, the same argument as in the previous subsection gives the identification of γ z and γ ζ .To examine identification of ρ , we rearrange (C.2) as follows: γ ζ · φ ( z (cid:62) γ z + x γ ζ ) (cid:20) ∂∂ x Pr( D = 1 | Z = z, ζ = x , W = w ) (cid:21) = Φ (cid:32) t ( z, w ) − ρ [ z (cid:62) γ z + x γ ζ ] (cid:112) − ρ (cid:33) = ⇒ Φ − (cid:18) γ ζ · φ ( z (cid:62) γ z + x γ ζ ) (cid:20) ∂∂ x Pr( D = 1 | Z = z, ζ = x , W = w ) (cid:21)(cid:19)(cid:124) (cid:123)(cid:122) (cid:125) =: L ( z, x ,w ) = t ( z, w ) − ρ [ z (cid:62) γ z + x γ ζ ] (cid:112) − ρ . (C.3)34hen, we have L ( z, x , w ) − L ( z, x (cid:48) , w ) = ρ (x (cid:48) − x ) γ ζ (cid:112) − ρ = ⇒ L ( z, x , w ) − L ( z, x (cid:48) , w )(x (cid:48) − x ) γ ζ = ρ (cid:112) − ρ . Noting that L ( z, x , w ) and L ( z, x (cid:48) , w ) are already identified, by solving the above equation, we can identify ρ . The identification of ρ can be established in the same manner.Finally, to identify α w and α z , we further rearrange (C.3) as follows: (cid:113) − ρ L ( z, x , w ) + ρ [ z (cid:62) γ z + x γ ζ ] = z (cid:62) α z + w α w . Noting that the left-hand side is identified, then we can see that α z and α w can be identified under appropriaterank conditions for ( Z, W ) . C.2 Identification of PRTE
Consider a counterfactual policy that changes P but does not affect Y ( d ) j , X , (cid:15) Dj , and s . Let P (cid:63) = ( P (cid:63) , . . . , P (cid:63)S ) be a counterfactual version of P whose distribution is known and D (cid:63) be the treatment status under P (cid:63) . As inAssumption 2.1(i), we assume that P (cid:63) is independent of ( (cid:15) ( d ) , (cid:15) Dj , s ) given X . Denote the outcome after thepolicy as Y (cid:63) . The PRTE is defined asPRTE: E [ Y (cid:63) | X = x ] − E [ Y | X = x ] . Since E [ Y | X = x ] is directly identified from data, we focus on the identification of E [ Y (cid:63) | X = x ] . ByAssumptions 2.1(i) and 2.2(i), E [ D (cid:63) Y (1) j | X = x, P (cid:63) = p (cid:63) , s = j ] = (cid:90) { p (cid:63)j ≥ v j } m (1) j ( x, v j )d v j . Similarly, we can show that E [(1 − D (cid:63) ) Y (0) j | X = x, P (cid:63) = p (cid:63) , s = j ] = (cid:90) { p (cid:63)j < v j } m (0) j ( x, v j )d v j . As a result, by the law of iterated expectations, we obtain E [ Y (cid:63) | X = x ] = S (cid:88) j =1 π j E (cid:104) E [ D (cid:63) Y (1) j + (1 − D (cid:63) ) Y (0) j | X = x, P (cid:63) , s = j ] (cid:12)(cid:12)(cid:12) X = x, s = j (cid:105) = S (cid:88) j =1 π j (cid:90) (cid:16) Pr( P (cid:63)j ≥ v j | X = x ) m (1) j ( x, v j ) + Pr( P (cid:63)j < v j | X = x ) m (0) j ( x, v j ) (cid:17) d v j . In this way, we can identify the PRTE through the MTR functions.35 eferences
Andresen, M.E., 2018. Exploring marginal treatment effects: Flexible estimation using stata,
The Stata Journal , 18 (1),118–158.Angrist, J.D., Imbens, G.W., and Rubin, D.B., 1996. Identification of causal effects using instrumental variables,
Journalof the American Statistical Association , 91 (434), 444–455.Brinch, C.N., Mogstad, M., and Wiswall, M., 2017. Beyond late with a discrete instrument,
Journal of Political Economy ,125 (4), 985–1039.Butler, S.M. and Louis, T.A., 1997. Consistency of maximum likelihood estimators in general random effects models forbinary data,
The Annals of Statistics , 25 (1), 351–377.Chen, H., Chen, J., and Kalbfleisch, J.D., 2001. A modified likelihood ratio test for homogeneity in finite mixture models,
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 63 (1), 19–29.Chen, X. and Christensen, T.M., 2015. Optimal uniform convergence rates and asymptotic normality for series estimatorsunder weak dependence and weak conditions,
Journal of Econometrics , 188 (2), 447–465.Chen, X. and Christensen, T.M., 2018. Optimal sup-norm rates and uniform inference on nonlinear functionals ofnonparametric iv regression,
Quantitative Economics , 9 (1), 39–84.Compiani, G. and Kitamura, Y., 2016. Using mixtures in econometric models: a brief review and some new results,
TheEconometrics Journal , 19, C95–C127.Cornelissen, T., Dustmann, C., Raute, A., and Schönberg, U., 2016. From late to mte: Alternative methods for theevaluation of policy interventions,
Labour Economics , 41, 47–60.Deb, P. and Gregory, C.A., 2018. Heterogeneous impacts of the supplemental nutrition assistance program on foodinsecurity,
Economics Letters , 173, 55–60.Dempster, A.P., Laird, N.M., and Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm,
Journal of the Royal Statistical Society: Series B , 39 (1), 1–22.Doyle Jr, J.J., 2007. Child protection and child outcomes: Measuring the effects of foster care,
American Economic Review ,97 (5), 1583–1610.Follmann, D.A. and Lambert, D., 1991. Identifiability of finite mixtures of logistic regression models,
Journal of StatisticalPlanning and Inference , 27 (3), 375–381.Harris, J.E. and Sosa-Rubi, S.G., 2009. Impact of "Seguro Popular" on prenatal visits in mexico, 2002-2005: Latent classmodel of count data with a discrete endogenous variable, NBER Working Paper 14995.Heckman, J.J. and Pinto, R., 2018. Unordered monotonicity,
Econometrica , 86 (1), 1–35.Heckman, J.J. and Vytlacil, E.J., 1999. Local instrumental variables and latent variable models for identifying and boundingtreatment effects,
Proceedings of the National Academy of Sciences , 96 (8), 4730–4734.Heckman, J.J. and Vytlacil, E.J., 2005. Structural equations, treatment effects, and econometric policy evaluation,
Econo-metrica , 73 (3), 669–738. olland, A.D., 2017. Penalized spline estimation in the partially linear model, Journal of Multivariate Analysis , 153,211–235.Hoshino, T. and Yanagi, T., 2020. Treatment effect models with strategic interaction in treatment decisions, arXiv preprintarXiv:1810.08350 .Huang, J.Z., 2003. Local asymptotics for polynomial spline regression,
The Annals of Statistics , 31 (5), 1600–1635.Imbens, G.W. and Angrist, J.D., 1994. Identification and estimation of local average treatment effects,
Econometrica ,62 (2), 467–475.Kitagawa, T. and Tetenov, A., 2018. Who should be treated? empirical welfare maximization methods for treatment choice,
Econometrica , 86 (2), 591–616.McLachlan, G. and Peel, D., 2004.
Finite Mixture Models , Wiley, New York.Mogstad, M. and Torgovitsky, A., 2018. Identification and extrapolation of causal effects with instrumental variables,
Annual Review of Economics , 10, 577–613.Munkin, M.K. and Trivedi, P.K., 2010. Disentangling incentives effects of insurance coverage from adverse selection inthe case of drug expenditure: a finite mixture approach,
Health Economics , 19 (9), 1093–1108.Newey, W.K., 1997. Convergence rates and asymptotic normality for series estimators,
Journal of Econometrics , 79 (1),147–168.Rubin, D.B., 1974. Estimating causal effects of treatments in randomized and nonrandomized studies.,
Journal of educa-tional Psychology , 66 (5), 688.Samoilenko, M., Blais, L., Boucoiran, I., and Lefebvre, G., 2018. Using a mixture-of-bivariate-regressions model toexplore heterogeneity of effects of the use of inhaled corticosteroids on gestational age and birth weight among pregnantwomen with asthma,
American Journal of Epidemiology , 187 (9), 2046–2059.Shea, J. and Torgovitsky, A., 2020. ivmte: An R package for implementing marginal treatment effect methods,
Workingpaper, Becker Friedman Institute , 2020-1.Train, K.E., 2008. EM algorithms for nonparametric estimation of mixing distributions,
Journal of Choice Modelling ,1 (1), 40–69.Tropp, J.A., 2012. User-friendly tail bounds for sums of random matrices,
Foundations of Computational Mathematics ,12 (4), 389–434.Vytlacil, E.J., 2002. Independence, monotonicity, and latent index models: An equivalence result,
Econometrica , 70 (1),331–341.Zhou, X. and Xie, Y., 2019. Marginal treatment effects from a propensity score perspective,
Journal of Political Economy ,127 (6), 3070–3084.Zhu, H.T. and Zhang, H., 2004. Hypothesis testing in mixture regression models,
Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 66 (1), 3–16., 66 (1), 3–16.