[PDF] Estimating Marginal Treatment Effects under Unobserved Group Heterogeneity

Abstract

This paper studies endogenous treatment effect models in which individuals are classified into unobserved groups based on heterogeneous treatment choice rules. Such heterogeneity may arise, for example, when multiple treatment eligibility criteria and different preference patterns exist. Using a finite mixture approach, we propose a marginal treatment effect (MTE) framework in which the treatment choice and outcome equations can be heterogeneous across groups. Under the availability of valid instrumental variables specific to each group, we show that the MTE for each group can be separately identified using the local instrumental variable method. Based on our identification result, we propose a two-step semiparametric procedure for estimating the group-wise MTE parameters. We first estimate the finite-mixture treatment choice model by a maximum likelihood method and then estimate the MTEs using a series approximation method. We prove that the proposed MTE estimator is consistent and asymptotically normally distributed. We illustrate the usefulness of the proposed method with an application to economic returns to college education.

Full PDF

EEstimating Marginal Treatment Eﬀects under Unobserved GroupHeterogeneity ∗Tadao Hoshino † and Takahide Yanagi ‡ This version: April 2020 First version: January 2020

Abstract

This paper studies endogenous treatment eﬀect models in which individuals are classiﬁed into unobservedgroups based on heterogeneous treatment choice rules. Such heterogeneity may arise, for example, whenmultiple treatment eligibility criteria and diﬀerent preference patterns exist. Using a ﬁnite mixture approach,we propose a marginal treatment eﬀect (MTE) framework in which the treatment choice and outcomeequations can be heterogeneous across groups. Under the availability of valid instrumental variables speciﬁcto each group, we show that the MTE for each group can be separately identiﬁed using the local instrumentalvariable method. Based on our identiﬁcation result, we propose a two-step semiparametric procedure forestimating the group-wise MTE parameters. We ﬁrst estimate the ﬁnite-mixture treatment choice model bya maximum likelihood method and then estimate the MTEs using a series approximation method. We provethat the proposed MTE estimator is consistent and asymptotically normally distributed. We illustrate theusefulness of the proposed method with an application to economic returns to college education.

Keywords : endogeneity, ﬁnite mixture, instrumental variables, marginal treatment eﬀects, unobservedheterogeneity.

JEL Classiﬁcation : C14, C31, C35. ∗ The authors thank Toru Kitagawa, Yasushi Kondo, Ryo Okui, Myung Hwan Seo, Hisatoshi Tanaka, Yuta Toyama, and seminarparticipants at Hitotsubashi University, Seoul National University, and Waseda University for their valuable comments. This work wassupported by JSPS KAKENHI Grant Numbers 15K17039 and 19H01473. † School of Political Science and Economics, Waseda University, 1-6-1 Nishi-waseda, Shinjuku-ku, Tokyo 169-8050, Japan. Email:[email protected]. ‡ Graduate School of Economics, Kyoto University, Yoshida Honmachi, Sakyo, Kyoto, 606-8501, Japan. Email: [email protected]. a r X i v : . [ ec on . E M ] A p r Introduction

Assessing heterogeneity in treatment eﬀects with respect to observed and unobserved characteristics is animportant issue for precise treatment evaluation. The marginal treatment eﬀect (MTE) framework has beenincreasingly popular in the econometrics literature as a way to explain the heterogeneous nature of treatmenteﬀects (Heckman and Vytlacil, 1999, 2005). The MTE is a treatment parameter deﬁned as the average treatmenteﬀect conditional on both observed individual characteristics and unobserved individual cost of the treatment.The MTE framework is useful in several respects. First, the framework can be used in non-experimentalapplications in which individuals may endogenously determine their own treatment status. The endogeneityof the treatment can be dealt with using the method of local instrumental variables (IVs). Moreover, once theMTE is estimated, it can be used to build other treatment parameters such as the average treatment eﬀect (ATE)and the local average treatment eﬀect (LATE). For a recent overview on the MTE approach and its practicalimplementation, see, for example, Cornelissen et al. (2016), Andresen (2018), Mogstad and Torgovitsky (2018),Zhou and Xie (2019), and Shea and Torgovitsky (2020).While the conventional treatment evaluation methods including the MTE framework can address heterogene-ity among individuals and across observable groups of individuals, many applications may exhibit “unobserved”group-wise heterogeneity in treatment eﬀects for various reasons. For example, the presence of multiple treat-ment eligibility criteria may create unobserved groups. As a typical application, consider evaluating the causaleﬀect of college education. Since schools typically oﬀer a variety of admissions options (entrance exams, sportsreferrals, and so on), this process classiﬁes individuals into several groups, and the admission criteria to whicheach individual has applied is typically unknown to us. Such diﬀerences in admissions requirements may resultin heterogeneous treatment eﬀects of college education. Another potential reason for the presence of unobservedgroup heterogeneity is that the population may be composed of groups with diﬀerent preference patterns. Forinstance, consider estimating the causal eﬀect of foster care for abused children, as in Doyle Jr (2007). Here, thetreatment variable of interest is whether the child is put into foster care by the child protection investigator. Theauthor discusses a possibility that the child protection investigators may have diﬀerent preference patterns thatplace relative emphasis on child protection. These examples suggest unobserved group patterns in the treatmentchoice process that may lead to some heterogeneity in treatment eﬀects.In this paper, we study endogenous treatment eﬀect models in which individuals are grouped into latentsubpopulations, where the presence of the latent groups is accounted for by a ﬁnite mixture model. Finitemixture approaches have been successfully used in various ﬁelds, such as biology, economics, marketing, andmedicine, to analyze data from heterogeneous subpopulations (McLachlan and Peel, 2004). However, the useof ﬁnite mixture models in treatment evaluation has been considered only in recent years in a few speciﬁcapplications (e.g., Harris and Sosa-Rubi, 2009; Munkin and Trivedi, 2010; Deb and Gregory, 2018; Samoilenko et al. , 2018). Compared with these studies, our modeling approach is more general and formally builds onRubin’s (1974) causal model by directly extending it to ﬁnite mixture models. In particular, we allow thetreatment choice and outcome equations to be fully heterogeneous across groups.For this model, we develop an identiﬁcation and estimation procedure for the MTE parameters that can beunique to each latent group. The proposed group-wise MTE is a novel framework in the literature, which shouldbe informative for understanding the heterogeneous nature of treatment eﬀects by capturing both group-level2nd individual-level unobserved heterogeneity simultaneously. Importantly, as we discuss below, the presenceof unobserved group heterogeneity threatens the validity of the conventional IV-based causal inference methods,such as the conventional MTE approach and the two-stage least squares approach (2SLS) for estimating theLATE. Speciﬁcally, we demonstrate that the presence of unobserved heterogeneous groups may invalidate the monotonicity condition (e.g., Imbens and Angrist, 1994; Angrist et al. , 1996; Vytlacil, 2002; Heckman andPinto, 2018), which is an essential identiﬁcation condition on which these approaches are based.Our identiﬁcation strategy builds on the method of local IVs by Heckman and Vytlacil (1999). Our mainidentiﬁcation result requires two key conditions. The ﬁrst condition is that there exists a valid group-speciﬁccontinuous IV. As in the standard IV estimation, each group-speciﬁc IV must satisfy the exclusion restriction(i.e., the IV is independent of unobserved variables) and the relevance restriction (i.e., the IV is a determinantof the treatment). The second condition is the exogeneity of group membership; that is, group membership isconditionally independent of the unobserved variables aﬀecting treatment choices. Since the second conditionmay be demanding in some applications, we also provide supplementary identiﬁcation results when groupmembership is endogenous.Based on our constructive identiﬁcation results, we propose a two-step semiparametric estimator for thegroup-wise MTEs. In the ﬁrst step, we estimate a ﬁnite-mixture treatment choice model using a parametricmaximum likelihood (ML) method. In the second step, the MTE parameters are estimated using a seriesapproximation method. Under certain regularity conditions, we show that the proposed MTE estimator isconsistent and asymptotically normally distributed.As an empirical illustration, we estimate the eﬀects of college education on annual income using the data forlabor in Japan. We ﬁnd that our observations can be classiﬁed into two latent groups: one (group 1) is a groupof individuals whose college enrollment decisions are aﬀected by their family characteristics such as parentaleducation level and economic conditions, and the other (group 2) is composed of individuals who are aﬀectedby the regional educational environment. Our empirical results indicate that for group 1, the treatment eﬀect ofcollege education is signiﬁcantly positive if the (unobserved) cost of going on to a college is small. In contrast,we cannot ﬁnd such heterogeneity for group 2.

Organization of the paper.

Section 2 introduces our model and presents our main identiﬁcation result forthe group-wise MTE. In Section 3, we discuss the estimation procedure for the MTE parameters and prove itsasymptotic properties. Section 4 provides two additional discussions: ﬁrst, on the identiﬁcation of MTE whengroup membership is endogenous and, second, on the identiﬁcation of LATE when only binary IVs are available.Section 5 presents Monte Carlo experiments and the empirical illustration, and Section 6 concludes this paper.In Appendices A and B, we provide technical proofs for our results. Appendix C presents supplementaryidentiﬁcation results.

In this section, we introduce our treatment eﬀect model that allows for the presence of an unknown mixtureof multiple subpopulations. We assume that the number of groups is ﬁnite and known, which is denoted as3 . Each individual belongs to only one of the S groups, and the group the individual belongs to, which wedenote by s ∈ { , . . . , S } , is a latent variable unknown to us. Our goal is to measure the causal eﬀect of apotentially endogenous treatment variable D ∈ { , } on an outcome variable Y ∈ R for each group separately.Let Y ( d ) j be the potential outcome when s = j and D = d . Then, the observed outcome can be written as Y = (cid:80) Sj =1 { s = j } [ DY (1) j + (1 − D ) Y (0) j ] , where {·} is an indicator function that takes one if the argumentinside is true and zero otherwise. Suppose that the potential outcome equation is given by Y ( d ) j = µ ( d ) j (cid:16) X, (cid:15) ( d ) (cid:17) , for j ∈ { , . . . , S } and d ∈ { , } , (2.1)where X ∈ R dim( X ) is a vector of observed covariates, (cid:15) ( d ) ∈ R is an unobserved error term, and µ ( d ) j is anunknown structural function. This model speciﬁcation is fairly general in that the functional form of µ ( d ) j isfully unrestricted. Our model implies that the distribution of the treatment eﬀect Y (1) j − Y (0) j is potentiallyheterogeneous across diﬀerent groups.Based on the latent index framework by Heckman and Vytlacil (1999, 2005), we characterize our treatmentchoice model as follows: D =  (cid:8) µ D ( Z ) ≥ (cid:15) D (cid:9) if s = 1 ... (cid:8) µ DS ( Z S ) ≥ (cid:15) DS (cid:9) if s = S (2.2)where for j ∈ { , . . . , S } , Z j ∈ R dim( Z j ) is a vector of IVs that may contain elements of X , (cid:15) Dj ∈ R is anunobserved continuous random variable, and µ Dj is an unknown function. We allow for arbitrary dependencebetween (cid:15) Dj ’s. Assume that for all j , the error (cid:15) Dj is independent of the IVs Z j ’s conditional on X . Moreover,we require that each Z j includes at least one group-speciﬁc continuous variable to ensure that the function µ Dj ( Z j ) does not degenerate to a constant after conditioning the values of ( X, Z , . . . , Z j − , Z j +1 , . . . , Z S ) .Let F j ( ·| X ) be the conditional cumulative distribution function (CDF) of (cid:15) Dj . Further, let P j := F j ( µ Dj ( Z j ) | X ) and V j := F j ( (cid:15) Dj | X ) for j ∈ { , . . . , S } . By construction, each V j is distributed as Uniform [0 , conditionalon X . Using these deﬁnitions, we can rewrite (2.2) as follows: D = { P j ≥ V j } if s = j , for j ∈ { , . . . , S } . Remark 2.1 (Monotonicity) . The presence of group heterogeneity in the treatment choice model may lead tothe failure of the monotonicity condition in Imbens and Angrist (1994) and Angrist et al. (1996), which requiresthat shifts in the IVs determine the direction of change in the treatment choices uniformly in all individuals. Tosee this, for simplicity, consider a case with S = 2 and suppose that µ Dj ( Z j ) = Zγ zj + ζ j γ ζj for j ∈ { , } ,where Z is a common IV among the groups, ζ j is an IV speciﬁc to group j , and Z j = ( Z, ζ j ) . Suppose that γ z < and γ z > . Then, an increase in Z makes the individuals in group 1 (group 2) less (more) likelyto take the treatment, implying that the monotonicity condition does not hold. As a result, the conventionalIV-based causal inference methods that rely on the monotonicity condition cannot be used as long as one focuseson the group-common IV Z only. Note, however, that the monotonicity condition is still satisﬁed in terms ofthe group-speciﬁc IVs ( ζ , ζ ) . Thus, if we run a 2SLS method using ( ζ , ζ ) instead of Z as IVs for thetreatment, we would obtain some causal eﬀects averaged over the groups, as will be demonstrated in Subsection4.2. However, this approach may overlook the possibility of heterogeneous treatment aﬀects across groups.4or the treatment choice model in (2.2), we can interpret its meaning in several ways. The ﬁrst interpretationis that there are actually multiple diﬀerent treatment eligibility rules prescribed by policy makers. In the exampleof college enrollment, there are typically several diﬀerent types of admissions processes for each school, forexample, paper-based entrance exams, sports referrals, and so on. Such a situation would correspond to this ﬁrsttype of interpretation. Another interpretation is that there are several types of treatment preference patterns.For example, consider again D = 1 if an individual goes to college and D = 0 otherwise. Suppose that acommon instrumental variable Z is the introduction of a physical education requirement in colleges along withmandatory augmented athletics facilities. When we specify the functional form of µ Dj as in Remark 2.1, weimagine that some people dislike physical education ( γ z < ) while others like it ( γ z > ). In this situation,we can view the treatment choice model (2.2) as a binary response model with a discrete random coeﬃcient. Our main identiﬁcation results are based on the following assumptions:

Assumption 2.1. (i) The IVs Z = ( Z , . . . , Z S ) are independent of ( (cid:15) ( d ) , (cid:15) Dj , s ) given X for all d and j .(ii) For each j , Z j has at least one group-speciﬁc continuous variable that is not included in X and the IVsfor the other groups. Assumption 2.2. (i) The membership variable s is conditionally independent of (cid:15) Dj given X for all j .(ii) For each j , there exists a constant π j ∈ (0 , such that Pr( s = j | X ) = π j and (cid:80) Sj =1 π j = 1 .Assumption 2.1(i) is an exclusion restriction requiring that the IVs are conditionally independent of allunobserved random variables including the latent group membership. Assumption 2.1(ii) is somewhat demand-ing in that we require prior knowledge as to which variables may be relevant/irrelevant to the membership ofeach group. A similar assumption can be found in the econometrics literature on ﬁnite mixture models (e.g.,Compiani and Kitamura, 2016). In practice, we can determine which IVs should belong to which group byexamining the signiﬁcance of them in each group or by some theoretical guidance. The conditional indepen-dence in Assumption 2.2(i) excludes some types of endogenous group formation, which could be restrictivein some empirical situations. Even without this assumption, we can establish some identiﬁcation results ofMTE parameters, as shown in Subsection 4.1. In Assumption 2.2(ii), we assume homogeneous membershipprobability for each group, which is restrictive but commonly used in the literature on ﬁnite mixture models. Infact, this assumption is made only for simplicity, and the theorem shown below still holds without modiﬁcationseven when the membership probability is a function of X . Moreover, we can identify the group-wise MTE evenwhen the membership probability depends on other covariates besides X ; see Subsection 4.1.To identify the treatment eﬀects of interest, we ﬁrst need to identify the treatment choice model (2.2).Although there has been a long history of research on the identiﬁcation of ﬁnite mixture models, only few This example scenario is borrowed from Heckman and Vytlacil (2005), Subsection 6.3. P j ’s and π j ’s as known objects.The MTE parameter speciﬁc to group j is deﬁned asMTE j ( x, p ) := m (1) j ( x, p ) − m (0) j ( x, p ) , (2.3)where m ( d ) j ( x, p ) := E [ Y ( d ) j | X = x, s = j, V j = p ] is the marginal treatment response (MTR) function speciﬁcto group j and d ∈ { , } . This expression implies that the identiﬁcation of the MTE is straightforward from thatof the MTR functions. Below, we show that the MTR functions can be identiﬁed through the partial derivativesof the following functions: ψ ( x, p ) := E [ DY | X = x, P = p ] , ψ ( x, p ) := E [(1 − D ) Y | X = x, P = p ] , where P = ( P , . . . , P S ) and p = ( p , . . . , p S ) . Note that ψ d ( x, p ) can be directly identiﬁed from the data onsupp [ X, P | D = d ] , where supp [ X, P | D = d ] denotes the joint support of ( X, P ) given D = d . Theorem 2.1.

Suppose that Assumptions 2.1 and 2.2 hold. If m (1) j ( x, · ) and m (0) j ( x, · ) are continuous, we have m (1) j ( x, p j ) = 1 π j ∂ψ ( x, p ) ∂p j , m (0) j ( x, p j ) = − π j ∂ψ ( x, p ) ∂p j . Once the group-wise MTEs are identiﬁed for all p and x based on this result, we can assess heterogeneity intreatment eﬀects. Furthermore, we can identify many other treatment parameters. For example, CATE j ( x ) = (cid:82) MTE j ( x, v )d v , where CATE j ( x ) = E [ Y (1) j − Y (0) j | X = x, s = j ] is the group-wise conditional averagetreatment eﬀect. Furthermore, the group-wise ATE: ATE j = E [ Y (1) j − Y (0) j | s = j ] can be obtained by ATE j = (cid:82) CATE j ( x ) f X ( x )d x , where f X is the marginal density function of X (note that Assumption 2.2(ii) implies f X ( ·| s = j ) = f X ( · ) ). Then, the ATE for the whole population is simply given by ATE = (cid:80) Sj =1 π j ATE j . Wecan also identify the so-called policy relevant treatment eﬀect (PRTE), as in Heckman and Vytlacil (2005); seeAppendix C.2 for the details. This section considers the estimation of the group-wise MTE parameters using an independent and identicallydistributed (IID) sample { ( Y i , D i , X i , Z i ) : 1 ≤ i ≤ n } . Throughout this section, Assumptions 2.1 and 2.2 areassumed to hold. 6 .1 Two-step series estimation First-stage: ML estimation

To estimate the treatment choice model, we consider the following parametricmodel speciﬁcation: D = (cid:110) Z (cid:62) j γ j ≥ (cid:15) Dj (cid:111) with probability π j > , for j ∈ { , . . . , S } . (3.1)We assume that (cid:15) Dj is independent of ( X, Z ) , and the CDF F j of (cid:15) Dj is a known function (such as thestandard normal or logistic distribution). Deﬁne γ := (cid:0) γ (cid:62) , . . . , γ (cid:62) S (cid:1) (cid:62) and π := ( π , . . . , π S ) (cid:62) . Then, theconditional likelihood function for an observation i when s i = j is given by L i ( γ | s i = j ) := F j ( Z (cid:62) ji γ j ) D i [1 − F j ( Z (cid:62) ji γ j )] − D i . Thus, the ML estimator for γ and π can be obtained by ( (cid:98) γ n , (cid:98) π n ) := argmax ( (cid:101) γ, (cid:101) π ) ∈ Γ ×C S n (cid:88) i =1 log  S (cid:88) j =1 (cid:101) π j L i ( (cid:101) γ | s i = j )  , where Γ ⊂ R (cid:80) Sj =1 dim( Z j ) and C S := { (cid:101) π ∈ (0 , S : (cid:80) Sj =1 (cid:101) π j = 1 } are the parameter spaces. We thenobtain the estimator of P j = F j ( Z (cid:62) j γ j ) as (cid:98) P j = F j ( Z (cid:62) j (cid:98) γ n,j ) . In the numerical studies below, we use theexpectation-maximization (EM) algorithm to solve this ML problem following the literature on ﬁnite mixturemodels (e.g., Dempster et al. , 1977; McLachlan and Peel, 2004; Train, 2008). Second-stage: series estimation

For the potential outcome equation, we assume the following linear modelfor convenience, which is a popular setup in the literature: for each d and j , Y ( d ) j = X (cid:62) β ( d ) j + (cid:15) ( d ) . (3.2)Here, the error term (cid:15) ( d ) can generally depend on the group membership j , but we suppress the dependencefor expositional simplicity. Assume that X is independent of ( (cid:15) ( d ) , (cid:15) Dj , s ) for all d and j . Then, we haveMTE j ( x, p ) = m (1) j ( x, p ) − m (0) j ( x, p ) = x (cid:62) ( β (1) j − β (0) j ) + E [ (cid:15) (1) − (cid:15) (0) | s = j, V j = p ] , with m ( d ) j ( x, p ) = x (cid:62) β ( d ) j + E [ (cid:15) ( d ) | s = j, V j = p ] . By the same argument as in the proof of Theorem 2.1, wecan show that there exist univariate functions g (1) j and g (0) j for each j satisfying E [ (cid:15) (1) | s = j, V j ≤ p j ] = 1 p j (cid:90) p j E [ (cid:15) (1) | s = j, V j = v ]d v =: g (1) j ( p j ) p j ,E [ (cid:15) (0) | s = j, V j > p j ] = 11 − p j (cid:90) p j E [ (cid:15) (0) | s = j, V j = v ]d v =: g (0) j ( p j )1 − p j . Then, letting ∇ g ( d ) j ( p ) := ∂g ( d ) j ( p ) /∂p , we observe that ∇ g (1) j ( p ) = E [ (cid:15) (1) | s = j, V j = p ] and ∇ g (0) j ( p ) = − E [ (cid:15) (0) | s = j, V j = p ] . Hence, it follows that m (1) j ( x, p ) = x (cid:62) β (1) j + ∇ g (1) j ( p ) , m (0) j ( x, p ) = x (cid:62) β (0) j − ∇ g (0) j ( p ) . ψ ( x, p ) = S (cid:88) j =1 E [ Y (1) j | X = x, s = j, V j ≤ p j ] · π j p j = S (cid:88) j =1 (cid:16) ( x · π j p j ) (cid:62) β (1) j + π j g (1) j ( p j ) (cid:17) ,ψ ( x, p ) = S (cid:88) j =1 E [ Y (0) j | X = x, s = j, V j > p j ] · π j (1 − p j ) = S (cid:88) j =1 (cid:16) ( x · π j (1 − p j )) (cid:62) β (0) j + π j g (0) j ( p j ) (cid:17) . Hence, we obtain the following partially linear additive regression models: DY = S (cid:88) j =1 ( X · π j P j ) (cid:62) β (1) j + S (cid:88) j =1 π j g (1) j ( P j ) + ε (1) , (3.3) (1 − D ) Y = S (cid:88) j =1 ( X · π j (1 − P j )) (cid:62) β (0) j + S (cid:88) j =1 π j g (0) j ( P j ) + ε (0) , (3.4)where E [ ε ( d ) | X, P ] = 0 by the deﬁnition of ψ d for both d ∈ { , } . To estimate the coeﬃcients β ( d ) j ’s and thefunctions g ( d ) j ’s, we use the series (sieve) approximation method such that g ( d ) j ( p ) ≈ b K ( p ) (cid:62) α ( d ) j with a K × vector of basis functions b K ( p ) = ( b K ( p ) , . . . , b KK ( p )) (cid:62) and corresponding coeﬃcients α ( d ) j .To proceed, letting d SXK := S ( dim ( X ) + K ) , deﬁne θ ( d ) d SXK × := ( β ( d ) (cid:62) , . . . , β ( d ) (cid:62) S , α ( d ) (cid:62) , . . . , α ( d ) (cid:62) S ) (cid:62) for d ∈ { , } , and R (1) Kd SXK × := ( π P X (cid:62) , . . . , π S P S X (cid:62) , π b K ( P ) (cid:62) , . . . , π S b K ( P S ) (cid:62) ) (cid:62) ,R (0) Kd SXK × := ( π (1 − P ) X (cid:62) , . . . , π S (1 − P S ) X (cid:62) ,π b K ( P ) (cid:62) , . . . , π S b K ( P S ) (cid:62) ) (cid:62) . Then, we can approximate the regression models (3.3) and (3.4), respectively, by DY ≈ R (1) (cid:62) K θ (1) + ε (1) , (1 − D ) Y ≈ R (0) (cid:62) K θ (0) + ε (0) , (3.5)which implies that we can estimate θ ( d ) by (cid:101) θ (1) n := (cid:32) n (cid:88) i =1 R (1) i,K R (1) (cid:62) i,K (cid:33) − n (cid:88) i =1 R (1) i,K D i Y i , (cid:101) θ (0) n := (cid:32) n (cid:88) i =1 R (0) i,K R (0) (cid:62) i,K (cid:33) − n (cid:88) i =1 R (0) i,K (1 − D i ) Y i , where A − is a generalized inverse of a matrix A . Note, however, that (cid:101) θ ( d ) n ’s are infeasible since P j ’s and π j ’s areunknown in practice. Then, deﬁne (cid:98) R ( d ) K analogously as above but replacing P j ’s and π j ’s with their estimatorsobtained in the ﬁrst step. The feasible estimators can be obtained by (cid:98) θ (1) n := (cid:32) n (cid:88) i =1 (cid:98) R (1) i,K (cid:98) R (1) (cid:62) i,K (cid:33) − n (cid:88) i =1 (cid:98) R (1) i,K D i Y i , (cid:98) θ (0) n := (cid:32) n (cid:88) i =1 (cid:98) R (0) i,K (cid:98) R (0) (cid:62) i,K (cid:33) − n (cid:88) i =1 (cid:98) R (0) i,K (1 − D i ) Y i . The feasible estimator of g ( d ) j ( p ) is given by (cid:98) g ( d ) j ( p ) = b K ( p ) (cid:62) (cid:98) α ( d ) n,j , and the infeasible estimator is given8y (cid:101) g ( d ) j ( p ) = b K ( p ) (cid:62) (cid:101) α ( d ) n,j . Letting ∇ b K ( p ) := ∂b K ( p ) /∂p , we can also estimate ∇ g ( d ) j ( p ) by ∇ (cid:98) g ( d ) j ( p ) := ∇ b K ( p ) (cid:62) (cid:98) α ( d ) n,j and ∇ (cid:101) g ( d ) j ( p ) := ∇ b K ( p ) (cid:62) (cid:101) α ( d ) n,j . Thus, we obtain the estimators of the MTR functions as follows: (cid:101) m (1) j ( x, p ) = x (cid:62) (cid:101) β (1) n,j + ∇ (cid:101) g (1) j ( p ) , (cid:98) m (1) j ( x, p ) = x (cid:62) (cid:98) β (1) n,j + ∇ (cid:98) g (1) j ( p ) , (cid:101) m (0) j ( x, p ) = x (cid:62) (cid:101) β (0) n,j − ∇ (cid:101) g (0) j ( p ) , (cid:98) m (0) j ( x, p ) = x (cid:62) (cid:98) β (0) n,j − ∇ (cid:98) g (0) j ( p ) . Finally, the feasible estimator of the MTE is given by (cid:91)

MTE j ( x, p ) = (cid:98) m (1) j ( x, p ) − (cid:98) m (0) j ( x, p ) , and the infeasibleestimator is given by (cid:93) MTE j ( x, p ) = (cid:101) m (1) j ( x, p ) − (cid:101) m (0) j ( x, p ) . This section presents the asymptotic properties of the proposed MTE estimators for the model given by equations(3.1) and (3.2). In the following, for a vector or matrix A , we denote its Frobenius norm as (cid:107) A (cid:107) = (cid:112) tr { A (cid:62) A } where tr {·} is the trace. For a square matrix A , λ min ( A ) and λ max ( A ) denote the smallest and the largesteigenvalues of A , respectively. Assumption 3.1.

The data { ( Y i , D i , X i , Z i ) : 1 ≤ i ≤ n } are IID. Assumption 3.2. (i) supp [ Z j ] is a compact subset of R dim( Z j ) for all j .(ii) The random variables (cid:15) D = ( (cid:15) D , . . . , (cid:15) DS ) are continuously distributed on the whole R S and independentof ( X, Z ) . Each (cid:15) Dj has a twice continuously diﬀerentiable known CDF F j with bounded derivatives.(iii) (cid:107) (cid:98) γ n − γ (cid:107) = O P ( n − / ) and (cid:107) (cid:98) π n − π (cid:107) = O P ( n − / ) . Assumption 3.3. (i) supp [ X ] is a compact subset of R dim( X ) . X is independent of ( (cid:15) ( d ) , (cid:15) Dj , s ) for all d and j .(ii) The random variable (cid:15) ( d ) is independent of ( X, Z ) for all d .The IID sampling condition in Assumption 3.1 is standard in the literature. For Assumption 3.2(iii), if theparameters γ and π are identiﬁable, the √ n -consistency of the ML estimators is straightforward; see AppendixC.1 for a special case where (cid:15) Dj ’s are distributed as the standard normal. Note that combining Assumptions3.2(i)-(iii) imply max ≤ i ≤ n | (cid:98) P ji − P ji | = O P ( n − / ) for all j .Let ∇ a g j ( p ) := ∂ a g j ( p ) / ( ∂p ) a and ∇ a b K ( p ) := ( ∂ a b K ( p ) / ( ∂p ) a , . . . , ∂ a b KK ( p ) / ( ∂p ) a ) (cid:62) for a non-negative integer a . Further, deﬁne Ψ ( d ) K := E (cid:104) R ( d ) K R ( d ) (cid:62) K (cid:105) and Σ ( d ) K := E (cid:104) ( ξ ( d ) K ) R ( d ) K R ( d ) (cid:62) K (cid:105) , where ξ ( d ) K := e ( d ) + B ( d ) K , and e ( d ) and B ( d ) K are unobserved error terms in the series regressions whose deﬁnitions are givenin (A.1) in Appendix A. Assumption 3.4. (i) For each d and j , g ( d ) j ( p ) is r -times continuously diﬀerentiable for some r ≥ , and there exist positiveconstants µ and µ and some α ( d ) j ∈ R K satisfying sup p ∈ [0 , | b K ( p ) (cid:62) α ( d ) j − g ( d ) j ( p ) | = O ( K − µ ) and sup p ∈ [0 , |∇ b K ( p ) (cid:62) α ( d ) j − ∇ g ( d ) j ( p ) | = O ( K − µ ) .9ii) b K ( p ) is twice continuously diﬀerentiable and satisﬁes ζ ( K ) (cid:112) ( K log K ) /n → , ζ ( K ) (cid:112) K/n → and ζ ( K ) / √ n → , where ζ l ( K ) := max ≤ a ≤ l sup p ∈ [0 , (cid:107)∇ a b K ( p ) (cid:107) . Assumption 3.5. (i) For d ∈ { , } , there exist constants c Ψ and ¯ c Ψ such that < c Ψ ≤ λ min (cid:16) Ψ ( d ) K (cid:17) ≤ λ max (cid:16) Ψ ( d ) K (cid:17) ≤ ¯ c Ψ < ∞ , uniformly in K .(ii) For d ∈ { , } , there exist constants c Σ and ¯ c Σ such that < c Σ ≤ λ min (cid:16) Σ ( d ) K (cid:17) ≤ λ max (cid:16) Σ ( d ) K (cid:17) ≤ ¯ c Σ < ∞ , uniformly in K . Assumption 3.6. E [( e ( d ) ) | D, X, Z , s = j ] < ∞ for all d and j .Assumption 3.4(i) imposes smoothness conditions on the function g ( d ) j . These conditions are standard in theliterature on series approximation methods. For example, Lemma 2 in Holland (2017) shows that Assumption3.4(i) is satisﬁed by B-splines of order k for k − ≥ r such that µ = r and µ = r − . This assumptionrequires K to increase to inﬁnity for unbiased estimation while Assumption 3.4(ii) requires that K should notdiverge too quickly. It is well known that ζ l ( K ) = O ( K (1 / l ) for B-splines (e.g., Newey, 1997). Thus, whenB-spline basis functions are employed, Assumption 3.4(ii) is satisﬁed if K /n → . Assumption 3.5 ensuresthat the matrices Ψ ( d ) K and Σ ( d ) K are positive deﬁnite for all K so that their inverses exist. Assumption 3.6 is usedto derive the asymptotic normality of our estimator in a convenient way.Finally, we introduce the selection matrices S X,j dim( X ) × d SXK and S K,jK × d SXK such that S X,j θ ( d ) = β ( d ) j and S K,j θ ( d ) = α ( d ) j for each d and j . The next theorem gives the asymptotic normality for the infeasible estimator. Theorem 3.1.

Suppose that Assumptions 3.1 – 3.6 hold. For a given p ∈ supp [ P j | D = 1] ∩ supp [ P j | D = 0] ,if (cid:107)∇ b K ( p ) (cid:107) → ∞ , √ nK − µ → , and √ nK − µ / (cid:107)∇ b K ( p ) (cid:107) → hold, then √ n (cid:16) (cid:93) MTE j ( x, p ) − MTE j ( x, p ) (cid:17)(cid:114)(cid:104) σ (1) K,j ( p ) (cid:105) + 2 cov K ( p ) + (cid:104) σ (0) K,j ( p ) (cid:105) d → N (0 , , where σ ( d ) K,j ( p ) := (cid:114) ∇ b K ( p ) (cid:62) S K,j (cid:104) Ψ ( d ) K (cid:105) − Σ ( d ) K (cid:104) Ψ ( d ) K (cid:105) − S (cid:62) K,j ∇ b K ( p ) , for d ∈ { , } ,cov K ( p ) := ∇ b K ( p ) (cid:62) S K,j (cid:104) Ψ (0) K (cid:105) − C K (cid:104) Ψ (1) K (cid:105) − S (cid:62) K,j ∇ b K ( p ) , and C K := E (cid:104) ξ (0) K ξ (1) K R (0) K R (1) (cid:62) K (cid:105) . As shown in Lemma B.4, σ ( d ) K,j ( p ) corresponds to the asymptotic standard deviation of the MTR estimatorfor D = d . Further, cov K ( p ) is the asymptotic covariance between the MTR estimators for D = 1 and D = 0 , which is supposed to be non-zero in our case. This non-zero covariance originates from replacing theunobserved membership indicator { s = j } with the membership probability π j .As a corollary of Theorem 3.1, the asymptotic properties of the feasible MTE estimator can be derivedrelatively easily with the following additional assumption.10 ssumption 3.7. sup p ∈ [0 , |∇ b K ( p ) (cid:62) α | = O ( √ K ) · sup p ∈ [0 , | b K ( p ) (cid:62) α | for any α ∈ R K .Chen and Christensen (2018) show that this assumption holds true for B-splines and wavelets.To proceed, let C ( D ) be the set of uniformly bounded continuous functions deﬁned on D . Further, let T be a generic random vector where supp [ T ] is compact and dim( T ) is ﬁnite. We deﬁne the linear operator P ( d ) n,j that maps a given function q ∈ C ( supp [ T ]) to the sieve space deﬁned by b K as follows: P ( d ) n,j q := b K ( · ) (cid:62) S K,j (cid:104) Ψ ( d ) nK (cid:105) − n n (cid:88) i =1 R ( d ) i,K q ( T i ) , where Ψ ( d ) nK := n − (cid:80) ni =1 R ( d ) i,K R ( d ) (cid:62) i,K . The functional form of q may implicitly depend on n . The operatornorm of P ( d ) n,j is deﬁned as (cid:107)P ( d ) n,j (cid:107) ∞ := sup (cid:40) sup p ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:16) P ( d ) n,j q (cid:17) ( p ) (cid:12)(cid:12)(cid:12) : q ∈ C ( supp [ T ]) , sup t ∈ supp [ T ] | q ( t ) | = 1 (cid:41) , which is typically of order O P (1) for splines and wavelets under some regularity conditions. Corollary 3.1.

Suppose that the assumptions in Theorem 3.1 hold. If Assumption 3.7, ζ ( K ) ζ ( K ) / √ n = O (1) , and ( (cid:107)P ( d ) n,j (cid:107) ∞ + 1) √ K/ (cid:107)∇ b K ( p ) (cid:107) → for both d ∈ { , } hold additionally, then √ n (cid:16) (cid:91) MTE j ( x, p ) − MTE j ( x, p ) (cid:17)(cid:114)(cid:104) σ (1) K,j ( p ) (cid:105) + 2 cov K ( p ) + (cid:104) σ (0) K,j ( p ) (cid:105) d → N (0 , . As shown above, the feasible MTE estimator has the same asymptotic distribution as the infeasibleestimator. The standard errors of the MTE estimators can be straightforwardly computed by replacingunknown terms in σ ( d ) K,j ( p ) and cov K ( p ) with their empirical counterparts: (cid:98) Ψ ( d ) nK := n (cid:80) ni =1 (cid:98) R ( d ) i,K (cid:98) R ( d ) (cid:62) i,K , (cid:98) Σ ( d ) nK := n (cid:80) ni =1 ( (cid:98) ξ ( d ) i,K ) (cid:98) R ( d ) i,K (cid:98) R ( d ) (cid:62) i,K , and (cid:98) C nK := n (cid:80) ni =1 (cid:98) ξ (0) i,K (cid:98) ξ (1) i,K (cid:98) R (0) i,K (cid:98) R (1) (cid:62) i,K , where (cid:98) ξ (1) i,K := D i Y i − (cid:98) R (1) (cid:62) i,K (cid:98) θ (1) n and (cid:98) ξ (0) i,K := (1 − D i ) Y i − (cid:98) R (0) (cid:62) i,K (cid:98) θ (0) n . We consider relaxing Assumption 2.2 by allowing the membership variable s to be dependent of both (cid:15) Dj anda vector of covariates W . For simplicity, we focus on the case of S = 2 only. Suppose that s is determined bythe following model: s = 2 − { π ( W ) ≥ U } , (4.1) A set of easy-to-check conditions ensuring this is to verify the conditions in Lemma 7.1 in Chen and Christensen (2015) and showthat S K,j (cid:104) Ψ ( d ) nK (cid:105) − is stochastically bounded in the (cid:96) -inﬁnity norm. See also Appendix B.3 of Hoshino and Yanagi (2020). U is an unobserved continuous random variable distributed as Uniform [0 , conditional on X , and π isan unknown function that takes values on [0 , . We assume that W is conditionally independent of U given X .Then, Pr( s = 1 | X, W ) = π ( W ) . In addition, we assume that W contains at least one continuous variable thatis not included in ( X, Z , Z ) .In this setup, we re-deﬁne the MTE parameter and the MTR function, respectively, as follows:MTE j ( x, q, p ) := m (1) j ( x, q, p ) − m (0) j ( x, q, p ) ,m ( d ) j ( x, q, p ) := E [ Y ( d ) j | X = x, U = q, V j = p ] . (4.2)Further, letting Q := π ( W ) , we deﬁne ψ ( x, q, p , p ) := E [ DY | X = x, Q = q, P = p , P = p ] ,ψ ( x, q, p , p ) := E [(1 − D ) Y | X = x, Q = q, P = p , P = p ] ,ρ dj ( x, q, p , p ) := Pr( D = d, s = j | X = x, Q = q, P = p , P = p ) for d ∈ { , } and j ∈ { , } . For identiﬁcation of these functions, we ﬁrst need to establish the identiﬁcation of all model parameters in(4.1). In Appendix C.1.2, we present a supplementary identiﬁcation result for an endogenous ﬁnite-mixturetreatment choice model where the error terms are assumed to be jointly normal. For notational simplicity,we denote the cross-partial derivatives with respect to ( q, p j ) by ∇ qp j ; for instance, ∇ qp ψ ( x, q, p , p ) = ∂ ψ ( x, q, p , p ) / ( ∂q∂p ) . Assumption 4.1. (i) ( W, Z , Z ) are independent of ( (cid:15) ( d ) , (cid:15) Dj , U ) given X for all d and j .(ii) Each of W , Z , and Z contains at least one continuous variable that is not included in X and the rest.(iii) For each j , ( U, V j ) is continuously distributed on [0 , with conditional density f UV j ( · , ·| X ) given X . Theorem 4.1.

Suppose that Assumption 4.1 holds. If m ( d ) j ( x, · , · ) and f UV j ( · , ·| X = x ) are continuous, wehave m ( d ) j ( x, q, p j ) = ∇ qp j ψ d ( x, q, p , p ) ∇ qp j ρ dj ( x, q, p , p ) , and f UV j ( q, p j | X = x ) = ( − d + j · ∇ qp j ρ dj ( x, q, p , p ) . The following theorem shows that the group-wise MTE parameter MTE j ( x, p ) deﬁned in (2.3) can berecovered by the weighted average of MTE j ( x, q, p ) in (4.2). Theorem 4.2.

Suppose that Assumption 4.1 holds. Then, we haveMTE j ( x, p ) = (cid:90) MTE j ( x, u, p ) ω j ( x, u, p )d u, for j ∈ { , } , ω ( x, u, p ) := Pr( u ≤ Q | X = x ) f UV ( u, p | X = x ) (cid:82) Pr( u (cid:48) ≤ Q | X = x ) f UV ( u (cid:48) , p | X = x )d u (cid:48) ,ω ( x, u, p ) := Pr( u > Q | X = x ) f UV ( u, p | X = x ) (cid:82) Pr( u (cid:48) > Q | X = x ) f UV ( u (cid:48) , p | X = x )d u (cid:48) . In some empirical situations, only discrete IVs are available, and many are binary. In this subsection, we focuson a situation where only binary IVs are available and show that some LATE parameters are still identiﬁable bythe Wald estimand. For expositional simplicity, the condition X = x is suppressed throughout this subsection.Let Z be an S × vector of binary instruments Z = ( Z , . . . , Z S ) ∈ Z , where Z := supp [ Z ] . Z maycontain some overlapping elements. In an extreme case, when there is only one instrument common for allgroups, we have Z = { S , S } , where S and S are S × vectors of zeros and ones, respectively. Let Y ( d,z ) j be the potential outcome when s = j , D = d , and Z = z . Similarly, we denote the potential treatment statuswhen s = j and Z = z as D ( z ) j . Assumption 4.2. (i) For each d and j , Y ( d,z ) j = Y ( d ) j and D ( z ) j = D ( z j ) j for any z = ( z , . . . , z S ) ∈ Z .(ii) Z is independent of ( D (0) j , D (1) j , Y (0) j , Y (1) j , s ) for all j .(iii) Pr( D (1) j = 1 , D (0) j = 0 | s = j ) > and Pr( D (1) j = 0 , D (0) j = 1 | s = j ) = 0 for all j .Assumption 4.2(i) can be violated when the instruments aﬀect the outcome directly or when the group- j (cid:48) -speciﬁc instrument Z j (cid:48) aﬀects the treatment status of the individuals in group j ( j (cid:54) = j (cid:48) ). Note that, under thiscondition, it holds that D = D ( z j ) j conditional on s = j and Z = z . Assumption 4.2(ii) requires the randomnessof the instrument Z , which is essentially the same as Assumption 2.1(i). Assumption 4.2(iii) is similar to themonotonicity condition in Imbens and Angrist (1994). Following Angrist et al. (1996), the individuals with D (1) j = D (0) j = 1 can be referred to as Z j -always-takers ; those with D (1) j = D (0) j = 0 are Z j -never-takers ;those with D (1) j > D (0) j are Z j -compliers ; and those with D (1) j < D (0) j are Z j -deﬁers . Hence, the condition(iii) ensures that for all j , there are Z j -compliers but no Z j -deﬁers in group j . When some IVs are commonacross groups, this monotonicity condition might be violated, as mentioned in Remark 2.1. Theorem 4.3.

Suppose that Assumption 4.2 holds. Then, the weighted average of the group-wise LATEs canbe identiﬁed as below: E [ Y | Z = S ] − E [ Y | Z = S ] E [ D | Z = S ] − E [ D | Z = S ] = S (cid:88) j =1 E [ Y (1) j − Y (0) j | Z j -compliers , s = j ] Pr( Z j -compliers , s = j ) (cid:80) Sh =1 Pr( Z h -compliers , s = h ) . Moreover, if e j ∈ Z , where e j is an S × unit vector with its j -th element equal to one, we can identify theLATE speciﬁc to group j : E [ Y | Z = e j ] − E [ Y | Z = S ] E [ D | Z = e j ] − E [ D | Z = S ] = E [ Y (1) j − Y (0) j | Z j -compliers , s = j ] . Numerical Analysis

To evaluate the ﬁnite sample performance of our MTE estimators, we conduct a set of Monte Carlo experiments.Setting S = 2 with the membership probabilities ( π , π ) = (0 . , . , the treatment variable is generated by D = { Z (cid:62) j γ j ≥ (cid:15) Dj } for s = j , where Z j = (1 , X , ζ j ) (cid:62) , X ∼ N (0 , , ζ j ∼ N (0 , , and (cid:15) Dj ∼ N (0 , forboth j ∈ { , } . We set γ = ( γ , γ , γ ) (cid:62) = (0 , − . , . (cid:62) and γ = ( γ , γ , γ ) (cid:62) = (0 , . , − . (cid:62) .The potential outcomes are generated by Y ( d ) j = X (cid:62) β ( d ) j + (cid:15) ( d ) , where X = (1 , X ) (cid:62) , (cid:15) (0) = (cid:80) j ∈{ , } { s = j } V j + η (0) , (cid:15) (1) = (cid:80) j ∈{ , } { s = j } V j + η (1) , and η ( d ) ∼ N (0 , . ) for both d ∈ { , } . Here, V j =Φ( (cid:15) Dj ) , and Φ denotes the standard normal CDF. The coeﬃcients are set to β (0)1 = ( − , (cid:62) , β (0)2 = (1 , (cid:62) , β (1)1 = (1 , − (cid:62) , and β (1)2 = (2 , (cid:62) . For each setup, we consider two sample sizes n ∈ { , } .For the ﬁrst-stage estimation of the ﬁnite mixture Probit model, we use the EM algorithm. The second-stageMTE estimation is carried out using both the infeasible and feasible estimators. We employ the B-splines oforder for the basis functions. The number of inner knots of B-splines, say (cid:101) K , is set to (cid:101) K = 1 when n = 1000 and (cid:101) K ∈ { , } when n = 4000 . To stabilize the series regression, using ridge regression with penalty n − is also considered for comparison. Note that introducing a suﬃciently small penalty term does not alter ourasymptotic results. The simulation results reported below are based on 1,000 Monte Carlo replications.Table 1 shows the bias and root mean squared error (RMSE) of estimating the group-wise MTE for bothgroups with x = 0 . and v ∈ { . , . , . , . } (labeled respectively as MTE1.1, and MTE1.2, and so on forgroup 1, and similarly labeled for group 2). Overall, the performance of the feasible estimator is almost thesame as that of the infeasible estimator. This is consistent with our asymptotic theory. The bias of our estimatoris satisfactorily small except for the cases close to the boundary. The RMSE quickly decreases as the samplesize increases, as long as the number of basis terms is unchanged, as expected. Although theoretically we needto employ larger number of basis terms as the sample size increases, using (cid:101) K = 2 seems too ﬂexible and theincrease in the variance is rather problematic even for n = 4000 (of course, this result is more or less speciﬁcto our choice of the functional form for MTE). The RMSE for group 2 tends to be smaller than that for group 1,probably due to the diﬀerence in their group sizes. Introducing a penalty term can improve the overall RMSE;hence, we recommend employing ridge regression in practice with moderate sample size.Table 2 presents the simulation results for the ML estimation of the ﬁnite mixture Probit model. For allestimation parameters, the bias is satisfactorily small under all sample sizes. The RMSE is approximatelyhalved when the sample size increases from 1000 to 4000, implying √ n -consistency of our ML estimator. In this empirical analysis, we investigate the eﬀects of college education on income in the Japanese labor market.There are two sources used for data collection. The primary data source is the Japanese Life Course PanelSurvey 2007 (wave 1), which includes detailed information on Japanese workers aged 20 to 40, including theirworking condition, annual income, education level, and family member characteristics. The outcome variable Acknowledgement: The data for this secondary analysis, "Japanese Life Course Panel Surveys, wave 1, 2007, of the Institute ofSocial Science, The University of Tokyo," was provided by the Social Science Japan Data Archive, Center for Social Research and DataArchives, Institute of Social Science, The University of Tokyo.

Group 1 Group 2 n (cid:101) K ridge MTE1.1 MTE1.2 MTE1.3 MTE1.4 MTE2.1 MTE2.2 MTE2.3 MTE2.4 Bias for the feasible estimator .

130 0 . − . − . − .

052 0 .

007 0 .

035 0 . − . − . − . − . − . − .

012 0 . − . − . − .

008 0 .

013 0 .

014 0 . − . − .

007 0 . − . − . − . − . − .

064 0 .

001 0 . − . − .

069 0 . − .

080 0 .

074 0 . − . − .

020 0 . − .

227 0 . − . − . − .

095 0 . − . − . Bias for the infeasible estimator . − . − . − . − . − .

001 0 .

027 0 . − . − . − . − . − . − .

012 0 . − . − . − .

016 0 .

004 0 .

016 0 . − . − .

006 0 . − . − . − . − . − .

061 0 .

001 0 . − . − .

079 0 . − .

085 0 .

077 0 . − . − .

020 0 . − .

233 0 . − . − . − .

092 0 . − . − . RMSE for the feasible estimator .

647 1 .

099 1 .

200 2 .

794 1 .

181 0 .

589 0 .

558 1 . .

795 0 .

652 0 .

651 0 .

880 0 .

619 0 .

409 0 .

383 0 . .

274 0 .

575 0 .

606 1 .

426 0 .

581 0 .

296 0 .

282 0 . .

667 0 .

422 0 .

809 0 .

442 0 .

262 0 .

242 0 . .

502 1 .

262 1 .

333 1 .

812 0 .

676 0 .

656 0 .

640 0 . .

681 0 .

727 0 .

783 0 .

898 0 .

461 0 .

511 0 .

521 0 . RMSE for the infeasible estimator .

606 1 .

076 1 .

225 2 .

858 1 .

175 0 .

582 0 .

551 1 . .

795 0 .

657 0 .

680 0 .

936 0 .

634 0 .

411 0 .

383 0 . .

266 0 .

568 0 .

629 1 .

461 0 .

589 0 .

296 0 .

285 0 . .

670 0 .

424 0 .

444 0 .

851 0 .

450 0 .

262 0 .

245 0 . .

491 1 .

251 1 .

329 1 .

840 0 .

681 0 .

655 0 .

640 0 . .

684 0 .

726 0 .

803 0 .

935 0 .

469 0 .

511 0 .

521 0 . Note: The column labeled “ridge” indicates whether the ridge regression is used (1 for “yes” and 0 for “no”).

Table 2: Simulation results of the ML estimation of the ﬁnite mixture Probit model n γ γ γ γ γ γ π π Bias .

001 0 .

000 0 .

000 0 . − . .

000 0 .

000 0 . − .

001 0 . RMSE .

026 0 .

021 0 .

029 0 .

024 0 .

022 0 .

018 0 .

042 0 . .

013 0 .

011 0 .

015 0 .

012 0 .

011 0 .

009 0 .

020 0 . of interest is the respondent’s annual income, and the treatment variable is deﬁned as follows: D = 1 if therespondent has a college degree or higher, D = 0 otherwise. The second data source is the School Basic Surveyconducted by the Ministry of Education, Japan. Using this dataset, we collect information on regional universityenrollment statistics for each prefecture where the respondents were living at the age of 15 to create IVs for thetreatment. We consider 10 IVs in total. Table 3 below shows the list of variables used in this analysis. Afterexcluding observations with missing data, the analysis is performed on 3,347 individuals.Table 3: Variables used Variable Y Annual income (in million JPY). D Dummy variable: the respondent has a college degree or higher. X Dummy variable: the respondent is currently living in an urban area.Dummy variable: the respondent is male.Age.*Working experience in years for the current job.*Dummy variable: the respondent is a part-time worker.Log of average working hours per day.Dummy variable: the respondent has professional skills.Dummy variable: the respondent is a public worker.Dummy variable: the respondent is in a managerial position.Dummy variable: the respondent is working in a large company.Dummy variable: the respondent is married.Dummy variable: the respondent’s partner (if any) is a part-time worker.Self-reported health status (in ﬁve scales). (Shorthand) Z Dummy variable: the respondent is male. male

The number of elder siblings. nesibs

The number of younger siblings. nysibs

Log of the number of books in the respondent’s home when he/she was 15 years old. books

The respondent’s father’s education in years. feduc

The respondent’s mother’s education in years. meduc

Economic condition of the respondent’s household when he/she was 15 years old (in ﬁve scales). econom

The proportion of private universities.** priv

Capacity: (the number of universities)/(the number of high-school graduates).** cap

The rate of university enrollment for high school graduates.** runiv * The square of these variables are also included in the regressors.** These variables are all at prefecture level, and they are as of when the respondent was 15 years old.

In this analysis, we set S = 2 . As shown below in Table 4, our treatment choice model has a smallerAkaike’s information criterion (AIC) value than the model with S = 1 (no mixture). For the models with S = 3 or larger, the EM algorithm did not converge within tolerable parameter values. Our ﬁnal treatmentchoice model was determined in the following manner. Assuming that there is at least one IV speciﬁc to eachgroup, we can consider (cid:18) (cid:19) = 45 potential pairs of group speciﬁc IVs in total. We ﬁrst estimated all 45(full) models where the other IVs that are not used as group speciﬁc IVs are all included in both groups. Then,based on the value of log-likelihood, we identiﬁed two group speciﬁc IVs: namely, nysibs for group 1 and runiv How to determine an optimal number of mixture components in a mixture model is a long-standing issue in statistics. Aconventional approach is to use some information criteria, such as AIC and Bayesian information criterion (BIC). A more formalapproach is to statistically test the number of mixture components using likelihood ratio type tests (see, e.g., Chen et al. , 2001; Zhu andZhang, 2004). To investigate the applicability of these tests to our situation is outside the scope of this paper but is an important issuefor future research.

Mixture Model ( S = 2 ) Standard Probit Model Group 1 Group 2

Estimate t-value Estimate t-value Estimate t-value π male nesibs -0.200 -3.962 0.435 1.253 -0.119 -3.548 nysibs -0.132 -2.965 -0.103 -2.953 books feduc meduc econom priv cap runiv Following the suggestion from the Monte Carlo results, we employ the third-order B-splines with (cid:101) K = 1 and use the ridge regression with the penalty equal to n − for the estimation of the MTEs. Figure 1 shows ourmain results. We ﬁnd that the eﬀect of a college degree on wage is not signiﬁcant for those who belong to group1 and have a higher unobserved cost of going on to higher education. Recall that, for the members in group1, family characteristics, such as parental education level and economic condition, are main factors aﬀectingthe college enrollment status, and we can expect that their personal willingness toward higher education issomewhat heterogeneous within the group. The downward-sloping shape of the MTE for group 1 might be dueto this heterogeneity. In contrast, for the members in group 2 (those who are less inﬂuenced by their familycharacteristics), the MTE curve is relatively ﬂat at MTE ≈ . million JPY. This paper considered identiﬁcation and estimation of MTE when the data is composed of a mixture of latentgroups. We developed a general treatment eﬀect model with unobserved group heterogeneity by extending theRubin’s causal model to ﬁnite mixture models. We proved that the MTE for each latent group can be separatelyidentiﬁed under the availability of group-speciﬁc continuous IVs. Based on our constructive identiﬁcationresult, we proposed the two-step semiparametric procedure for estimating the group-wise MTEs and established17igure 1: Estimated MTEs

Note: In each panel, the solid line indicates the point estimate of the MTE evaluated at the sample mean of X , and thegrayed area corresponds to the 95% conﬁdence interval. its asymptotic properties. The results of the Monte Carlo simulations show that our estimators perform well inﬁnite samples. An empirical application to the estimation of economic returns to college education indicatesthe usefulness of the proposed model.Several open issues and promising extensions of the proposed approach are as follows. First, it would beinteresting to study the identiﬁcation of treatment eﬀects when only common IVs among all groups are available.Second, it might be worthwhile to consider the estimation of MTE (beyond LATE) when all group-speciﬁcIVs are discrete (cf., Brinch et al. , 2017). Lastly, based on our ﬁnite-mixture framework, we could construct arelative risk measure for alternative treatments. For example, even when the global ATE of a treatment is weaklypositive, it is possible that the treatment is actually harmful for most people but is signiﬁcantly beneﬁcial foronly a small subset. Then, we can conclude that such a treatment is risky. In addition, following the approachin Subsection 4.1, we can predict to which group each individual is likely to belong. By incorporating thisinformation into the framework developed, for example, by Kitagawa and Tetenov (2018), we might be able topropose a new way of constructing optimal treatment assignment rules. These topics are left for future work.18 Appendix: Proofs of Theorems

This appendix collects the proofs of the theorems. The lemmas used to prove Theorem 3.1 and Corollary 3.1are relegated to Appendix B. Below, we denote the conditional density of a random variable T as f T ( ·|· ) . A.1 Proof of Theorem 2.1

We provide the proof for m (1) j ( x, p ) only, as the proof for m (0) j ( x, p ) is analogous. First, observe that ψ ( x, p ) = S (cid:88) j =1 E (cid:104) { s = j } DY (1) j (cid:12)(cid:12)(cid:12) X = x, P = p (cid:105) = S (cid:88) j =1 E [ Y (1) j | X = x, P = p , s = j, D = 1] · Pr( s = j, D = 1 | X = x, P = p )= S (cid:88) j =1 E [ Y (1) j | X = x, s = j, V j ≤ p j ] · π j p j , under Assumptions 2.1(i) and 2.2. Further, it holds that E [ Y (1) j | X = x, s = j, V j ≤ p j ] = (cid:90) p j E [ Y (1) j | X = x, s = j, V j = v ] f V j ( v | X = x, s = j )Pr( V j ≤ p j | X = x, s = j ) d v = 1 p j (cid:90) p j m (1) j ( x, v )d v, by Assumption 2.2(i). Therefore, ψ ( x, p ) = (cid:80) Sj =1 π j (cid:82) p j m (1) j ( x, v )d v , and the Leibniz integral rule leads to ∂ψ ( x, p ) /∂p j = π j · m (1) j ( x, p j ) . This completes the proof.Here, we introduce additional notations for the subsequent discussions. Let δ (1) j := { s = j, V j ≤ P j } = D · { s = j } and δ (0) j := { s = j, V j > P j } = (1 − D ) · { s = j } . Note that D = (cid:80) Sj =1 δ (1) j and − D = (cid:80) Sj =1 δ (0) j . By (3.2), we can write DY = S (cid:88) j =1 δ (1) j X (cid:62) β (1) j + D(cid:15) (1) = S (cid:88) j =1 δ (1) j ( X (cid:62) β (1) j + g (1) j ( P j ) /P j ) + e (1) = S (cid:88) j =1 δ (1) j ( X (cid:62) β (1) j + b K ( P j ) (cid:62) α (1) j /P j ) (cid:124) (cid:123)(cid:122) (cid:125) =: T (1) (cid:62) K θ (1) + r (1) K + e (1) = R (1) (cid:62) K θ (1) + r (1) K + B (1) K + e (1) (cid:124) (cid:123)(cid:122) (cid:125) =: ξ (1) K , r (1) K := (cid:80) Sj =1 δ (1) j [ g (1) j ( P j ) − b K ( P j ) (cid:62) α (1) j ] /P j , B (1) K := (cid:16) T (1) K − R (1) K (cid:17) (cid:62) θ (1) = S (cid:88) j =1 ( δ (1) j − π j P j ) (cid:16) X (cid:62) β (1) j + b K ( P j ) (cid:62) α (1) j /P j (cid:17) ,e (1) := D(cid:15) (1) − S (cid:88) j =1 δ (1) j g (1) j ( P j ) /P j = S (cid:88) j =1 δ (1) j (cid:104) (cid:15) (1) − g (1) j ( P j ) /P j (cid:105) . (A.1)Let Y (1) = ( D Y , . . . , D n Y n ) (cid:62) , R (1) K = ( R (1)1 ,K , . . . , R (1) n,K ) (cid:62) , r (1) K = ( r (1)1 ,K , . . . , r (1) n,K ) (cid:62) , B (1) K = ( B (1)1 ,K , . . . , B (1) n,K ) (cid:62) , e (1) = ( e (1)1 , . . . , e (1) n ) (cid:62) , and ξ (1) K = ( ξ (1)1 ,K , . . . , ξ (1) n,K ) (cid:62) . The infeasible estimator for θ (1) can be written as (cid:101) θ (1) n = (cid:16) R (1) (cid:62) K R (1) K (cid:17) − R (1) (cid:62) K Y (1) = θ (1) + (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K r (1) K /n + (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K ξ (1) K /n. Similarly, noting that DY = (cid:98) R (1) (cid:62) K θ (1) + (cid:98) ∆ (1) K + r (1) K + ξ (1) K with (cid:98) ∆ (1) K := ( R (1) K − (cid:98) R (1) K ) (cid:62) θ (1) , the feasibleestimator (cid:98) θ (1) n can be written as (cid:98) θ (1) n = θ (1) + (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K (cid:98) ∆ (1) K /n + (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K r (1) K /n + (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K ξ (1) K /n, where (cid:98) R (1) K = ( (cid:98) R (1)1 ,K , . . . , (cid:98) R (1) n,K ) (cid:62) , and (cid:98) ∆ (1) K = ( (cid:98) ∆ (1)1 ,K , . . . , (cid:98) ∆ (1) n,K ) (cid:62) . In the same manner, for the estimatorsof θ (0) , we have (cid:101) θ (0) n − θ (0) = (cid:104) Ψ (0) nK (cid:105) − R (0) (cid:62) K r (0) K /n + (cid:104) Ψ (0) nK (cid:105) − R (0) (cid:62) K ξ (0) K /n, (cid:98) θ (0) n − θ (0) = (cid:104) (cid:98) Ψ (0) nK (cid:105) − (cid:98) R (0) (cid:62) K (cid:98) ∆ (0) K /n + (cid:104) (cid:98) Ψ (0) nK (cid:105) − (cid:98) R (0) (cid:62) K r (0) K /n + (cid:104) (cid:98) Ψ (0) nK (cid:105) − (cid:98) R (0) (cid:62) K ξ (0) K /n, where the deﬁnitions of the newly introduced variables should be clear from the context. A.2 Proof of Theorem 3.1

By (B.8) and (B.9), we observe that √ n (cid:16) (cid:93) MTE j ( x, p ) − MTE j ( x, p ) (cid:17) = √ n (cid:16) (cid:101) m (1) j ( x, p ) − m (1) j ( x, p ) (cid:17) − √ n (cid:16) (cid:101) m (0) j ( x, p ) − m (0) j ( x, p ) (cid:17) = ∇ b K ( p ) (cid:62) S K,j (cid:104) Ψ (1) K (cid:105) − R (1) (cid:62) K ξ (1) K / √ n + ∇ b K ( p ) (cid:62) S K,j (cid:104) Ψ (0) K (cid:105) − R (0) (cid:62) K ξ (0) K / √ n + o P ( (cid:107)∇ b K ( p ) (cid:107) ) . Thus, as shown in Lemma B.4(i), the term on the left-hand side is approximated by the sum of two asymptoticallynormal random variables with mean zero. Note that unlike standard treatment eﬀect estimators, the covariance20f these two terms is not zero: n n (cid:88) l =1 n (cid:88) m =1 E (cid:20) R (0) l,K R (1) (cid:62) m,K ξ (0) l,K ξ (1) m,K (cid:12)(cid:12)(cid:12)(cid:12) { X i , Z i } ni =1 (cid:21) = 1 n n (cid:88) l =1 R (0) l,K R (1) (cid:62) l,K E (cid:20) ξ (0) l,K ξ (1) l,K (cid:12)(cid:12)(cid:12)(cid:12) { X i , Z i } ni =1 (cid:21) , by E [ ξ ( d ) K | X, Z ] = 0 for both d ∈ { , } and Assumption 3.1, and E (cid:104) ξ (0) K ξ (1) K (cid:12)(cid:12) X, Z (cid:105) = E (cid:104) ( T (0) (cid:62) K θ (0) − R (0) (cid:62) K θ (0) + e (0) )( T (1) (cid:62) K θ (1) − R (1) (cid:62) K θ (1) + e (1) ) (cid:12)(cid:12) X, Z (cid:105) = − R (0) (cid:62) K θ (0) · R (1) (cid:62) K θ (1) (cid:54) = 0 in general.The remainder of the proof is straightforward. A.3 Proof of Corollary 3.1

By (B.8) and (B.10), we have √ n (cid:16) (cid:91) MTE j ( x, p ) − MTE j ( x, p ) (cid:17) = ∇ b K ( p ) (cid:62) S K,j (cid:104) Ψ (1) K (cid:105) − R (1) (cid:62) K ξ (1) K / √ n + ∇ b K ( p ) (cid:62) S K,j (cid:104) Ψ (0) K (cid:105) − R (0) (cid:62) K ξ (0) K / √ n + o P ( (cid:107)∇ b K ( p ) (cid:107) ) . The same argument as in the proof of Theorem 3.1 leads to the desired result.

A.4 Proof of Theorem 4.1

To save the space, we provide the proof for the case of d = 1 and j = 1 only. Under Assumption 4.1, we have ρ ( x, q, p , p ) = Pr( U ≤ Q, V ≤ P | X = x, Q = q, P = p , P = p ) = (cid:90) p (cid:90) q f UV ( u, v | X = x )d u d v . On the other hand, by the law of iterated expectations, it holds that ψ ( x, q, p , p ) = E [ Y (1)1 | D = 1 , s = 1 , X = x, Q = q, P = p , P = p ] ρ ( x, q, p , p )+ E [ Y (1)2 | D = 1 , s = 2 , X = x, Q = q, P = p , P = p ] ρ ( x, q, p , p )= E [ Y (1)1 | U ≤ q, V ≤ p , X = x ] Pr( U ≤ q, V ≤ p | X = x )+ E [ Y (1)2 | U > q, V ≤ p , X = x ] Pr( U > q, V ≤ p | X = x )= (cid:90) p (cid:90) q m (1)1 ( x, u, v ) f UV ( u, v | X = x )d u d v + (cid:90) p (cid:90) q m (1)2 ( x, u, v ) f UV ( u, v | X = x )d u d v . Thus, the Leibniz integral rule completes the proof. 21 .5 Proof of Theorem 4.2

We only prove for the case of s = 1 . By the law of iterated expectations, we observe thatMTE ( x, p ) = E (cid:104) E [ Y (1)1 − Y (0)1 | X = x, V = p, Q, s = 1] (cid:12)(cid:12)(cid:12) X = x, V = p, s = 1 (cid:105) = (cid:90) E [ Y (1)1 − Y (0)1 | X = x, V = p, Q = q, s = 1] f Q ( q | X = x, V = p, s = 1)d q. Further, by Assumption 4.1(i), E [ Y (1)1 − Y (0)1 | X = x, V = p, Q = q, s = 1] = E [ Y (1)1 − Y (0)1 | X = x, V = p, Q = q, U ≤ q ]= (cid:82) { u ≤ q } MTE ( x, u, p ) f U ( u | X = x, V = p )d u Pr( U ≤ q | X = x, V = p )= (cid:82) { u ≤ q } MTE ( x, u, p ) f UV ( u, p | X = x )d u Pr( U ≤ q | X = x, V = p ) , where the last equality follows from the fact that V ∼ Uniform [0 , conditional on X = x . On the other hand,Bayes’ theorem implies that f Q ( q | X = x, V = p, s = 1) = Pr( U ≤ q | X = x, V = p ) f Q ( q | X = x )Pr( s = 1 | X = x, V = p ) , under Assumption 4.1(i). By the law of iterated expectations and Assumption 4.1(i), we have Pr( s = 1 | X = x, V = p ) = E [Pr( s = 1 | X = x, V = p, Q ) | X = x, V = p ]= (cid:90) Pr( U ≤ q | X = x, V = p, Q = q ) f Q ( q | X = x, V = p )d q = (cid:90) (cid:90) { u ≤ q } f U ( u | X = x, V = p ) f Q ( q | X = x )d q d u = (cid:90) Pr( u ≤ Q | X = x ) f UV ( u, p | X = x )d u. Combining these results yieldsMTE ( x, p ) = (cid:82) (cid:82) MTE ( x, u, p ) { u ≤ q } f UV ( u, p | X = x ) f Q ( q | X = x )d q d u (cid:82) Pr( u (cid:48) ≤ Q | X = x ) f UV ( u (cid:48) , p | X = x )d u (cid:48) = (cid:90) MTE ( x, u, p ) Pr( u ≤ Q | X = x ) f UV ( u, p | X = x ) (cid:82) Pr( u (cid:48) ≤ Q | X = x ) f UV ( u (cid:48) , p | X = x )d u (cid:48) d u. This completes the proof. 22 .6 Proof of Theorem 4.3

We provide proof of the ﬁrst result only since the second result can be shown analogously. First, observe that E [ D | Z = S ] = S (cid:88) j =1 E [ { s = j } D (1) j | Z = S ]= S (cid:88) j =1 (cid:110) Pr( D (1) j = 1 , D (0) j = 0 , s = j ) + Pr( D (1) j = 1 , D (0) j = 1 , s = j ) (cid:111) , where the ﬁrst equality follows from Assumption 4.2(i), and the second follows from Assumption 4.2(ii).Similarly, we can show that E [ D | Z = S ] = (cid:80) Sj =1 Pr( D (1) j = 1 , D (0) j = 1 , s = j ) under Assumption 4.2(iii).Thus, we have E [ D | Z = S ] − E [ D | Z = S ] = S (cid:88) j =1 Pr( Z j -compliers , s = j ) . Next, observe that E [ Y | Z = S ] = (cid:80) Sj =1 E [ { s = j } Y | Z = S ] , and that by the law of iteratedexpectations, E [ { s = j } Y | Z = S ] = E [ Y (1) j | D (1) j = 1 , D (0) j = 1 , s = j ] Pr( D (1) j = 1 , D (0) j = 1 , s = j )+ E [ Y (0) j | D (1) j = 0 , D (0) j = 0 , s = j ] Pr( D (1) j = 0 , D (0) j = 0 , s = j )+ E [ Y (1) j | D (1) j = 1 , D (0) j = 0 , s = j ] Pr( D (1) j = 1 , D (0) j = 0 , s = j ) , by Assumption 4.2. In the same manner, it holds that E [ { s = j } Y | Z = S ] = E [ Y (1) j | D (1) j = 1 , D (0) j = 1 , s = j ] Pr( D (1) j = 1 , D (0) j = 1 , s = j )+ E [ Y (0) j | D (1) j = 0 , D (0) j = 0 , s = j ] Pr( D (1) j = 0 , D (0) j = 0 , s = j )+ E [ Y (0) j | D (1) j = 1 , D (0) j = 0 , s = j ] Pr( D (1) j = 1 , D (0) j = 0 , s = j ) . Thus, E [ Y | Z = S ] − E [ Y | Z = ] = S (cid:88) j =1 E [ Y (1) j − Y (0) j | Z j -compliers , s = j ] Pr( Z j -compliers , s = j ) . This completes the proof.

B Appendix: Lemmas

This appendix collects the lemmas that are used in the proof of Theorem 3.1 and Corollary 3.1. In the following,we present the results for D = 1 only, and those for D = 0 are similar and thus omitted to save space. Below,we often use the notation S to denote either of S X,j and S K,j , and we suppress the superscript (1) if there is noconfusion. For a matrix A , we denote || A || = (cid:112) λ max ( A (cid:62) A ) as its spectral norm.23 emma B.1. Suppose that Assumptions 3.1 – 3.6 hold.(i) (cid:13)(cid:13)(cid:13) Ψ (1) nK − Ψ (1) K (cid:13)(cid:13)(cid:13) = O P ( ζ ( K ) (cid:112) (log K ) /n ) .(ii) (cid:13)(cid:13)(cid:13)(cid:13)(cid:104) Ψ (1) nK (cid:105) − − (cid:104) Ψ (1) K (cid:105) − (cid:13)(cid:13)(cid:13)(cid:13) = O P ( ζ ( K ) (cid:112) (log K ) /n ) .(iii) (cid:13)(cid:13)(cid:13)(cid:13) S (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K B (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16)(cid:112) tr { SS (cid:62) } /n (cid:17) .(iv) (cid:13)(cid:13)(cid:13)(cid:13) S (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K r (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) = O P ( K − µ ) .(v) (cid:13)(cid:13)(cid:13)(cid:13) S (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K e (1) /n (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16)(cid:112) tr { SS (cid:62) } /n (cid:17) . Proof. (i), (ii) The proofs are the same as those of Lemma A.1(i) and (iii) in Hoshino and Yanagi (2020).(iii) Since E [ δ (1) j | X, Z ] = π j P j , we have E [ T (1) K | X, Z ] = R (1) K and E [ B (1) K | X, Z ] = 0 . Then, underAssumption 3.1, E [ B (1) l,K B (1) k,K |{ X i , Z i } ni =1 ] = 0 for any l, k ∈ { , . . . , n } such that l (cid:54) = k . Additionally,observe that max ≤ i ≤ n | B (1) i,K | = O (1) holds from Assumptions 3.2(i),(ii), and 3.3(i) and sup p ∈ [0 , (cid:12)(cid:12)(cid:12) b K ( p ) (cid:62) α (1) j (cid:12)(cid:12)(cid:12) ≤ sup p ∈ [0 , (cid:12)(cid:12)(cid:12) b K ( p ) (cid:62) α (1) j − g (1) j ( p ) (cid:12)(cid:12)(cid:12) + sup p ∈ [0 , (cid:12)(cid:12)(cid:12) g (1) j ( p ) (cid:12)(cid:12)(cid:12) = O ( K − µ ) + O (1) , (B.1)by Assumption 3.4(i). Thus, noting that E [ B K B (cid:62) K |{ X i , Z i } ni =1 ] is a diagonal matrix whose diagonal elementsare O (1) , we have E (cid:104) (cid:107) S Ψ − nK R (cid:62) K B K /n (cid:107) |{ X i , Z i } ni =1 (cid:105) = tr { S Ψ − nK R (cid:62) K E [ B K B (cid:62) K |{ X i , Z i } ni =1 ] R K Ψ − nK S (cid:62) } /n ≤ O (1 /n ) · tr { S Ψ − nK Ψ nK Ψ − nK S (cid:62) } = O P ( tr { SS (cid:62) } /n ) , where the last equality follows from Assumption 3.5(i) and result (ii). Then, the result follows from Markov’sinequality.(vi), (v) For (iv), since min ≤ i ≤ n P ji > for any j under Assumptions 3.2(i) and (ii), we have max ≤ i ≤ n | r (1) i,K | ≤ O (1) · S (cid:88) j =1 max ≤ i ≤ n (cid:12)(cid:12)(cid:12) g (1) j ( P ji ) − b K ( P ji ) (cid:62) α (1) j (cid:12)(cid:12)(cid:12) = O ( K − µ )

24y Assumption 3.4(i). For (v), note that E [ e (1) | X, Z ] = S (cid:88) j =1 (cid:104) E [ δ j (cid:15) (1) | X, Z ] − E [ δ j | X, Z ] g (1) j ( P j ) /P j (cid:105) = S (cid:88) j =1 (cid:104) E [ (cid:15) (1) | X, Z , s = j, V j ≤ P j ] · π j P j − π j g (1) j ( P j ) (cid:105) = S (cid:88) j =1 (cid:20) π j (cid:90) P j E [ (cid:15) (1) | s = j, V j = v ]d v − π j g (1) j ( P j ) (cid:21) = 0 , (B.2)by Assumptions 2.1(i), 2.2, and 3.3(i), so that E [ e (1) e (1) (cid:62) |{ X i , Z i } ni =1 ] is a diagonal matrix whose diagonalelements are O (1) by Assumptions 3.1 and 3.6. Then, the rest of the proofs are similar to the proof of LemmaA.2 in Hoshino and Yanagi (2020).To prove the next lemma, deﬁne X iS dim( X ) × := ( X (cid:62) i , . . . , X (cid:62) i ) (cid:62) , b i,KSK × := ( b K ( P i ) (cid:62) , . . . , b K ( P Si ) (cid:62) ) (cid:62) , X n × S dim( X ) = ( X , . . . , X n ) (cid:62) , b Kn × SK = ( b ,K , . . . , b n,K ) (cid:62) , W Kn × d SXK := ( X , b K ) ,Υ (1) i,Kd SXK × := ( π P i (cid:62) dim( X ) , . . . , π S P Si (cid:62) dim( X ) , π (cid:62) K , . . . , π S (cid:62) K ) (cid:62) , Υ (1) Kn × d SXK := ( Υ (1)1 ,K , . . . , Υ (1) n,K ) (cid:62) , and deﬁne analogously (cid:98) b K , (cid:99) W K := ( X , (cid:98) b K ) , and (cid:98) Υ (1) K . Then, we can write R (1) K = Υ (1) K ◦ W K and (cid:98) R (1) K = (cid:98) Υ (1) K ◦ (cid:99) W K , where ◦ denotes the Hadamard product. Lemma B.2.

Suppose that Assumptions 3.1 – 3.6 hold.(i) (cid:13)(cid:13)(cid:13) (cid:98) Ψ (1) nK − Ψ (1) nK (cid:13)(cid:13)(cid:13) = O P ( ζ ( K ) / √ n ) .(ii) (cid:13)(cid:13)(cid:13)(cid:13)(cid:104) (cid:98) Ψ (1) nK (cid:105) − − (cid:104) Ψ (1) nK (cid:105) − (cid:13)(cid:13)(cid:13)(cid:13) = O P ( ζ ( K ) / √ n ) .(iii) (cid:13)(cid:13)(cid:13)(cid:13) S (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K (cid:98) ∆ (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) = O P ( n − / ) .(iv) (cid:13)(cid:13)(cid:13)(cid:13) S (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K B (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16)(cid:112) tr { SS (cid:62) } /n (cid:17) + O P ( ζ ( K ) √ K/n ) + O P ( ζ ( K ) /n ) .(v) (cid:13)(cid:13)(cid:13)(cid:13) S (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K r (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) = O P ( K − µ ) .(vi) (cid:13)(cid:13)(cid:13)(cid:13) S (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K e (1) /n (cid:13)(cid:13)(cid:13)(cid:13) = O P (cid:16)(cid:112) tr { SS (cid:62) } /n (cid:17) .25 roof. (i) By the triangle inequality, we have (cid:13)(cid:13)(cid:13) (cid:98) Ψ nK − Ψ nK (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) (cid:98) Ψ nK − Ψ nK (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) (cid:98) R K − R K (cid:17) (cid:62) (cid:16) (cid:98) R K − R K (cid:17) /n (cid:13)(cid:13)(cid:13)(cid:13) + 2 (cid:13)(cid:13)(cid:13) R (cid:62) K (cid:16) (cid:98) R K − R K (cid:17) /n (cid:13)(cid:13)(cid:13) . (B.3)For the ﬁrst term of (B.3), the mean value theorem and Assumption 3.2 lead to b K ( (cid:98) P ji ) − b K ( P ji ) = ∇ b K ( ¯ P ji ) · O P ( n − / ) where ¯ P ji is between (cid:98) P ji and P ji . Let ∇ ¯ b i,K = ( ∇ b K ( ¯ P i ) (cid:62) , . . . , ∇ b K ( ¯ P Si ) (cid:62) ) (cid:62) and ∇ ¯ b K =( ∇ ¯ b ,K , . . . , ∇ ¯ b n,K ) (cid:62) . By the triangle inequality and Assumptions 3.2(iii) and 3.3(i), we have (cid:107) (cid:98) R K − R K (cid:107) ≤ (cid:13)(cid:13)(cid:13) ( (cid:98) Υ K − Υ K ) ◦ (cid:99) W K (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) Υ K ◦ ( (cid:99) W K − W K ) (cid:13)(cid:13)(cid:13) ≤ O P ( n − / ) · (cid:110)(cid:13)(cid:13)(cid:13)(cid:98) b K (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13) ∇ ¯ b K (cid:13)(cid:13)(cid:111) = O P ( ζ ( K )) + O P ( ζ ( K )) = O P ( ζ ( K )) . (B.4)Thus, (cid:107) ( (cid:98) R K − R K ) (cid:62) ( (cid:98) R K − R K ) /n (cid:107) ≤ (cid:107) (cid:98) R K − R K (cid:107) /n = O P ( ζ ( K ) /n ) . For the second term of (B.3),we have (cid:107) R (cid:62) K ( (cid:98) R K − R K ) /n (cid:107) = tr { ( (cid:98) R K − R K ) (cid:62) R K R (cid:62) K ( (cid:98) R K − R K ) } /n ≤ O P (1 /n ) · (cid:107) (cid:98) R K − R K (cid:107) = O P ( ζ ( K ) /n ) by Lemma B.1(i) and (B.4). Thus, the second term is of order O P ( ζ ( K ) / √ n ) , and we obtainthe desired result.(ii) The proof is the same as that of Lemma A.1(iii) in Hoshino and Yanagi (2020).(iii) By Lemma B.1(ii), result (ii), and Assumption 3.5(i), we have (cid:13)(cid:13)(cid:13) S (cid:98) Ψ − nK (cid:98) R (cid:62) K (cid:98) ∆ K /n (cid:13)(cid:13)(cid:13) = tr { (cid:98) ∆ (cid:62) K (cid:98) R K (cid:98) Ψ − nK S (cid:62) S (cid:98) Ψ − nK (cid:98) R (cid:62) K (cid:98) ∆ K } /n ≤ O P ( n − ) · (cid:107) (cid:98) ∆ K (cid:107) . Note that (cid:98) ∆ K = [ (cid:98) Υ K ◦ ( W K − (cid:99) W K )] θ (1) + [( Υ K − (cid:98) Υ K ) ◦ W K ] θ (1) . By the mean value theorem, it is easyto see that (cid:13)(cid:13)(cid:13) [ (cid:98) Υ K ◦ ( W K − (cid:99) W K )] θ (1) (cid:13)(cid:13)(cid:13) = n (cid:88) i =1  S (cid:88) j =1 (cid:98) π n,j ( P ji − (cid:98) P ji ) · ∇ b K ( ¯ P ji ) (cid:62) α (1) j  = O P (1) from sup p ∈ [0 , (cid:12)(cid:12)(cid:12) ∇ b K ( p ) (cid:62) α (1) j (cid:12)(cid:12)(cid:12) ≤ sup p ∈ [0 , (cid:12)(cid:12)(cid:12) ∇ b K ( p ) (cid:62) α (1) j − ∇ g (1) j ( p ) (cid:12)(cid:12)(cid:12) + sup p ∈ [0 , (cid:12)(cid:12)(cid:12) ∇ g (1) j ( p ) (cid:12)(cid:12)(cid:12) = O ( K − µ ) + O (1) . The same argument shows that (cid:107) [( Υ K − (cid:98) Υ K ) ◦ W K ] θ (1) (cid:107) = O P (1) . Thus, (cid:107) (cid:98) ∆ K (cid:107) = O P (1) , and we have thedesired result.(iv) By Lemma B.1(iii), we have S (cid:98) Ψ − nK (cid:98) R (cid:62) K B K /n = S Ψ − nK R (cid:62) K B K /n + S ( (cid:98) Ψ − nK − Ψ − nK ) R (cid:62) K B K /n + S (cid:98) Ψ − nK ( (cid:98) R K − R K ) (cid:62) B K /n = S ( (cid:98) Ψ − nK − Ψ − nK ) R (cid:62) K B K /n + S (cid:98) Ψ − nK ( (cid:98) R K − R K ) (cid:62) B K /n + O P (cid:18)(cid:113) tr { SS (cid:62) } /n (cid:19) . (cid:107) S ( (cid:98) Ψ − nK − Ψ − nK ) R (cid:62) K B K /n (cid:107) ≤ (cid:107) (cid:98) Ψ − nK − Ψ − nK (cid:107) (cid:107) R (cid:62) K B K /n (cid:107) = O P ( ζ ( K ) √ K/n ) , (B.5)by result (ii) and Markov’s inequality.For the second term, observe that (cid:107) S (cid:98) Ψ − nK ( (cid:98) R K − R K ) (cid:62) B K /n (cid:107) ≤ O P (1 /n ) · (cid:107) ( (cid:98) R K − R K ) (cid:62) B K (cid:107) and ( (cid:98) R K − R K ) (cid:62) B Kd SXK × =  (cid:80) ni =1 ( (cid:98) π n, (cid:98) P i − π P i ) X i B i,K ... (cid:80) ni =1 ( (cid:98) π n,S (cid:98) P Si − π S P Si ) X i B i,K (cid:80) ni =1 [ (cid:98) π n, b K ( (cid:98) P i ) − π b K ( P i )] B i,K ... (cid:80) ni =1 [ (cid:98) π n,S b K ( (cid:98) P Si ) − π S b K ( P Si )] B i,K  . For each element of the right-hand side, applying the Taylor expansion to (cid:98) P ji = F j ( Z (cid:62) ji (cid:98) γ n,j ) around γ j yields n (cid:88) i =1 ( (cid:98) π n,j (cid:98) P ji − π j P ji ) X i B i,K = n (cid:88) i =1 ( (cid:98) π n,j − π j ) (cid:98) P ji X i B i,K + n (cid:88) i =1 π j ( (cid:98) P ji − P ji ) X i B i,K = ( (cid:98) π n,j − π j ) n (cid:88) i =1 P ji X i B i,K + dim( Z j ) (cid:88) h =1 ( (cid:98) γ n,jh − γ jh ) n (cid:88) i =1 Z jhi π j f j ( Z (cid:62) ji γ j ) X i B i,K + O P (1) , and similarly n (cid:88) i =1 [ (cid:98) π n,j b K ( (cid:98) P ji ) − π j b K ( P ji )] B i,K = n (cid:88) i =1 ( (cid:98) π n,j − π j ) b K ( (cid:98) P ji ) B i,K + n (cid:88) i =1 π j ( b K ( (cid:98) P ji ) − b K ( P ji )) B i,K = ( (cid:98) π n,j − π j ) n (cid:88) i =1 b K ( P ji ) B i,K + dim( Z j ) (cid:88) h =1 ( (cid:98) γ n,jh − γ jh ) n (cid:88) i =1 Z jhi π j f j ( Z (cid:62) ji γ j ) ∇ b K ( P ji ) B i,K + O P ( ζ ( K )) . For expositional simplicity, assume that dim( Z j ) = 1 for all j . Let (cid:98) Π nK and (cid:98) Γ nK be appropriate d SXK × d SXK diagonal matrices with diagonal elements ( (cid:98) π n,j − π j ) and ( (cid:98) γ n,j − γ j ) , respectively, so that we can write ( (cid:98) R K − R K ) (cid:62) B K = √ n (cid:98) Π nK n (cid:88) i =1 M i,K + √ n (cid:98) Γ nK n (cid:88) i =1 N i,K + O P ( ζ ( K )) , (B.6)27here M i,K := n − /  P i X i B i,K ... P Si X i B i,K b K ( P i ) B i,K ... b K ( P Si ) B i,K  , and N i,K := n − /  Z i π f ( Z i γ ) X i B i,K ... Z Si π S f S ( Z Si γ S ) X i B i,K Z i π f ( Z i γ ) ∇ b K ( P i ) B i,K ... Z Si π S f S ( Z Si γ S ) ∇ b K ( P Si ) B i,K  . Note that E [ N i,K ] = d SXK by E [ B i,K | X i , Z i ] = 0 and ¯ N K := max ≤ i ≤ n (cid:107) N i,K (cid:107) = O ( ζ ( K ) / √ n ) underAssumptions 3.2(i),(ii), 3.3(i), and 3.4. Further, let σ nK := max {(cid:107) (cid:80) ni =1 E ( N i,K N (cid:62) i,K ) (cid:107) , (cid:107) (cid:80) ni =1 E ( N (cid:62) i,K N i,K ) (cid:107) } .It is easy to see that σ nK = O ( ζ ( K )) . Observe that ¯ N K (cid:112) log( d SXK + 1) = O ( ζ ( K ) (cid:112) (log K ) /n ) = o ( σ nK ) . Then, by Corollary 4.1 in Chen and Christensen (2015), we obtain (cid:107) (cid:80) ni =1 N i,K (cid:107) = O P ( ζ ( K ) √ log K ) .In a similar manner, we can show that (cid:107) (cid:80) ni =1 M i,K (cid:107) = O P ( ζ ( K ) √ log K ) . Noting that || (cid:98) Π nK || = O P ( n − / ) and || (cid:98) Γ nK || = O P ( n − / ) , we can show that the ﬁrst and second terms on the right-hand side of(B.6) are O P ( ζ ( K ) √ log K ) and O P ( ζ ( K ) √ log K ) , respectively. Thus, S (cid:98) Ψ − nK ( (cid:98) R K − R K ) (cid:62) B K /n = O P ( ζ ( K ) (cid:112) log K/n ) + O P ( ζ ( K ) /n ) . (B.7)Summarizing these results, we obtain S (cid:98) Ψ − nK (cid:98) R (cid:62) K B K /n = O P ( ζ ( K ) √ K/n ) + O P ( ζ ( K ) (cid:112) log K/n ) + O P ( ζ ( K ) /n ) + O P (cid:18)(cid:113) tr { SS (cid:62) } /n (cid:19) . (v), (vi) The proofs are similar to the proof of Lemma A.2 in Hoshino and Yanagi (2020). For (vi), notethat E [ e (1) | D, X, Z ] = E [ E [ e (1) | D, X, Z , s ] | D, X, Z ] = 0 since E [ e (1) | D, X, Z , s = j ] = E (cid:34) S (cid:88) h =1 δ (1) h (cid:16) (cid:15) (1) − g h ( P h ) /P h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) D, X, Z , s = j (cid:35) = E (cid:20) D (cid:16) (cid:15) (1) − g j ( P j ) /P j (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) D, X, Z , s = j (cid:21) = 0 for all j. Here, for a generic random variable T and q ∈ C ( supp [ T ]) , we deﬁne (cid:98) P ( d ) n,j q := b K ( · ) (cid:62) S K,j (cid:104) (cid:98) Ψ ( d ) nK (cid:105) − n n (cid:88) i =1 (cid:98) R ( d ) i,K q ( T i ) . Lemma B.3.

Suppose that Assumptions 3.1 – 3.6 hold. If ζ ( K ) ζ ( K ) / √ n = O (1) holds, then (cid:107) (cid:98) P (1) n,j (cid:107) ∞ = (cid:107)P (1) n,j (cid:107) ∞ + O P (1) , (cid:107) (cid:98) P ( d ) n,j (cid:107) ∞ := sup (cid:40) sup p ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:16) (cid:98) P ( d ) n,j q (cid:17) ( p ) (cid:12)(cid:12)(cid:12) : q ∈ C ( supp [ T ]) , sup t ∈ supp [ T ] | q ( t ) | = 1 (cid:41) . Proof.

The proof is the same as that of Lemma A.3 in Hoshino and Yanagi (2020).

Lemma B.4.

Suppose that Assumptions 3.1 – 3.6 hold. For a given p ∈ supp [ P j | D = 1] , if (cid:107)∇ b K ( p ) (cid:107) → ∞ , √ nK − µ → , and √ nK − µ / (cid:107)∇ b K ( p ) (cid:107) → hold, then(i) √ n (cid:16) (cid:101) m (1) j ( x, p ) − m (1) j ( x, p ) (cid:17) σ (1) K,j ( p ) d → N (0 , . If Assumption 3.7, ζ ( K ) ζ ( K ) / √ n = O (1) , and ( (cid:107)P (1) n,j (cid:107) ∞ + 1) √ K/ (cid:107)∇ b K ( p ) (cid:107) → hold additionally, then(ii) √ n (cid:16) (cid:98) m (1) j ( x, p ) − m (1) j ( x, p ) (cid:17) σ (1) K,j ( p ) d → N (0 , . Proof. (i) First, by Assumption 3.5, we have σ K,j ( p ) = ∇ b K ( p ) (cid:62) S K,j Ψ − K Σ K Ψ − K S (cid:62) K,j ∇ b K ( p ) ≥ c Σ ¯ c · (cid:107)∇ b K ( p ) (cid:107) > . (B.8)Next, by Lemmas B.1(iii)-(v), we have (cid:13)(cid:13)(cid:13) (cid:101) β (1) n,j − β (1) j (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13) S X,j (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K B (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) S X,j (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K r (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) S X,j (cid:104) Ψ (1) nK (cid:105) − R (1) (cid:62) K e (1) /n (cid:13)(cid:13)(cid:13)(cid:13) = O P ( n − / ) + O P ( K − µ ) . Thus, by the deﬁnition of the infeasible estimator (cid:101) m (1) j ( x, p ) and Assumption 3.4(i), (cid:101) m (1) j ( x, p ) − m (1) j ( x, p ) = x (cid:62) (cid:16) (cid:101) β (1) n,j − β (1) j (cid:17) + ∇ b K ( p ) (cid:62) (cid:101) α (1) n,j − ∇ g (1) j ( p )= ∇ b K ( p ) (cid:62) (cid:16)(cid:101) α (1) n,j − α (1) j (cid:17) + O P ( n − / ) + O P ( K − µ ) + O ( K − µ )= A n,j + A n,j + O P ( n − / ) + O P ( K − µ ) + O ( K − µ ) , where A n,j := ∇ b K ( p ) (cid:62) S K,j Ψ − nK R (cid:62) K ξ K /n and A n,j := ∇ b K ( p ) (cid:62) S K,j Ψ − nK R (cid:62) K r K /n . For A n,j , byLemma B.1(iv) and √ nK − µ → , we have | A n,j | ≤ (cid:107)∇ b K ( p ) (cid:107) · (cid:107) S K,j Ψ − nK R (cid:62) K r K /n (cid:107) = (cid:107)∇ b K ( p ) (cid:107) · O P ( K − µ ) = (cid:107)∇ b K ( p ) (cid:107) · o P ( n − / ) . Deﬁne A (cid:48) n,j := ∇ b K ( p ) (cid:62) S K,j Ψ − K R (cid:62) K ξ K /n . It is easy to see that | A n,j − A (cid:48) n,j | ≤ (cid:107)∇ b K ( p ) (cid:107) · (cid:107) S K,j (Ψ − nK − Ψ − K ) R (cid:62) K ξ K /n (cid:107) (cid:107)∇ b K ( p ) (cid:107) · O P ( ζ ( K ) (cid:112) ( K log K ) /n ) = (cid:107)∇ b K ( p ) (cid:107) · o P ( n − / ) , by Lemma B.1(ii), Markov’s inequality, and Assumption 3.4(ii). Thus, by (B.8), we obtain √ n (cid:16) (cid:101) m (1) j ( x, p ) − m (1) j ( x, p ) (cid:17) σ (1) K,j ( p ) = √ n ( A n,j + A n,j ) σ (1) K,j ( p ) + o P (1) = √ nA (cid:48) n,j σ (1) K,j ( p ) + o P (1) , (B.9)since we have assumed (cid:107)∇ b K ( p ) (cid:107) → ∞ , √ nK − µ → , and √ nK − µ / (cid:107)∇ b K ( p ) (cid:107) → .We now show the asymptotic normality of √ nA (cid:48) n,j /σ (1) K,j ( p ) . Let φ ji := Π K,j ( p ) R (1) i,K ξ (1) i,K / √ n , where Π K,j ( p ) := ∇ b K ( p ) (cid:62) S K,j Ψ − K /σ (1) K,j ( p ) , so that (cid:80) ni =1 φ ji = √ nA (cid:48) n,j /σ (1) K,j ( p ) . Since E [ B (1) K | X, Z ] = 0 and E [ e (1) | X, Z ] = 0 as shown in (B.2), we have E [ φ ji ] = 0 and V ar [ φ ji ] = n − . Moreover, note that E [( ξ (1) i,K ) | X i , Z i ] = O (1) holds by the c r -inequality with Assumption 3.6 and the uniform boundedness of B (1) i,K . Then, by the same argument as in the proof of Theorem 4.2 in Hoshino and Yanagi (2020), we obtain (cid:80) ni =1 E [ φ ji ] = O ( ζ ( K ) K/n ) = o (1) under Assumption 3.4(ii). Hence, result (i) follows from Lyapunov’scentral limit theorem.(ii) By Lemmas B.2(iii) – (vi), Assumption 3.4(ii), and √ nK − µ → , we have (cid:13)(cid:13)(cid:13) (cid:98) β (1) n,j − β (1) j (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13) S X,j (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K (cid:98) ∆ (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) S X,j (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K r (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) S X,j (cid:104) (cid:98) Ψ (1) nK (cid:105) − (cid:98) R (1) (cid:62) K ξ (1) K /n (cid:13)(cid:13)(cid:13)(cid:13) = O P ( n − / ) + O P ( ζ ( K ) √ K/n ) + O P ( ζ ( K ) /n ) + O P ( K − µ ) (cid:124) (cid:123)(cid:122) (cid:125) = o P ( n − / ) . Thus, by the deﬁnition of the feasible estimator (cid:98) m (1) j ( x, p ) and Assumption 3.4(i), (cid:98) m (1) j ( x, p ) − m (1) j ( x, p ) = x (cid:62) ( (cid:98) β (1) n,j − β (1) j ) + ∇ b K ( p ) (cid:62) (cid:98) α (1) n,j − ∇ g (1) j ( p )= ∇ b K ( p ) (cid:62) ( (cid:98) α (1) n,j − α (1) j ) + O P ( n − / ) + O P ( K − µ )= A n,j + A n,j + A n,j + O P ( n − / ) + O P ( K − µ ) , where A n,j := ∇ b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:98) R (cid:62) K ξ K /n, A n,j := ∇ b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:98) R (cid:62) K r K /n, A n,j := ∇ b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:98) R (cid:62) K (cid:98) ∆ K /n. For A n,j , observe that A n,j = A n,j + ∇ b K ( p ) (cid:62) S K,j (cid:16) (cid:98) Ψ − nK − Ψ − nK (cid:17) R (cid:62) K ξ K /n + ∇ b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:16) (cid:98) R K − R K (cid:17) (cid:62) B K /n + ∇ b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:16) (cid:98) R K − R K (cid:17) (cid:62) e /n.

30y the same argument as in (B.5), the second term on the right-hand side satisﬁes (cid:13)(cid:13)(cid:13) ∇ b K ( p ) (cid:62) S K,j (cid:16) (cid:98) Ψ − nK − Ψ − nK (cid:17) R (cid:62) K ξ K /n (cid:13)(cid:13)(cid:13) ≤ (cid:107)∇ b K ( p ) (cid:107) · (cid:13)(cid:13)(cid:13) S K,j (cid:16) (cid:98) Ψ − nK − Ψ − nK (cid:17) R (cid:62) K ξ K /n (cid:13)(cid:13)(cid:13) = (cid:107)∇ b K ( p ) (cid:107) · O P ( ζ ( K ) √ K/n ) . Similarly, we can show that the third term is of order (cid:107)∇ b K ( p ) (cid:107) · (cid:8) O P ( ζ ( K ) √ log K/n ) + O P ( ζ ( K ) /n ) (cid:9) by (B.7). For the fourth term, recalling that E [ e (1) | D, X, Z ] = 0 , we have E (cid:34)(cid:13)(cid:13)(cid:13)(cid:13) S K,j (cid:98) Ψ − nK (cid:16) (cid:98) R K − R K (cid:17) (cid:62) e /n (cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { D i , X i , Z i } ni =1 (cid:35) = tr (cid:26) S K,j (cid:98) Ψ − nK (cid:16) (cid:98) R K − R K (cid:17) (cid:62) E [ ee (cid:62) |{ D i , X i , Z i } ni =1 ] (cid:16) (cid:98) R K − R K (cid:17) (cid:98) Ψ − nK S (cid:62) K,j (cid:27) /n ≤ O (1 /n ) · || (cid:98) R K − R K || · tr { S K,j (cid:98) Ψ − nK (cid:98) Ψ − nK S (cid:62) K,j } = O P ( ζ ( K ) K/n ) by Assumption 3.6 and (B.4). Thus, by Markov’s inequality, we ﬁnd that the fourth term is of order (cid:107)∇ b K ( p ) (cid:107) · O P ( ζ ( K ) √ K/n ) . Combining these results yields A n,j = A n,j + (cid:107)∇ b K ( p ) (cid:107) · (cid:110) O P ( ζ ( K ) √ K/n ) + O P ( ζ ( K ) /n ) (cid:111) = A n,j + (cid:107)∇ b K ( p ) (cid:107) · o P ( n − / ) under Assumption 3.4(ii).For A n,j , by Lemma B.2(v) and √ nK − µ → , we have | A n,j | ≤ (cid:107)∇ b K ( p ) (cid:107) · (cid:107) S K,j (cid:98) Ψ − nK (cid:98) R (cid:62) K r K /n (cid:107) = (cid:107)∇ b K ( p ) (cid:107) · o P ( n − / ) .For A n,j , observe that | A n,j | ≤ O ( √ K ) · sup p ∈ [0 , | b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:98) R (cid:62) K (cid:98) ∆ K /n | by Assumption 3.7.Further, noting that (cid:98) ∆ K = ( R K − (cid:98) R K ) (cid:62) θ (1) = S (cid:88) h =1 (cid:104) π h ( P h − (cid:98) P h ) X (cid:62) β (1) h + ( π h − (cid:98) π n,h ) (cid:98) P h X (cid:62) β (1) h + π h ( b K ( P h ) − b K ( (cid:98) P h )) (cid:62) α (1) h + ( π h − (cid:98) π n,h ) b K ( (cid:98) P h ) (cid:62) α (1) h (cid:105) , write b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:98) R (cid:62) K (cid:98) ∆ K /n = b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:34) n n (cid:88) i =1 (cid:98) R i,K (cid:32) S (cid:88) h =1 π h ( P hi − (cid:98) P hi ) X (cid:62) i β (1) h (cid:33)(cid:35) + b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:34) n n (cid:88) i =1 (cid:98) R i,K (cid:32) S (cid:88) h =1 ( π h − (cid:98) π n,h ) (cid:98) P hi X (cid:62) i β (1) h (cid:33)(cid:35) + b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:34) n n (cid:88) i =1 (cid:98) R i,K (cid:32) S (cid:88) h =1 π h (cid:16) b K ( P hi ) − b K ( (cid:98) P hi ) (cid:17) (cid:62) α (1) h (cid:33)(cid:35) + b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK (cid:34) n n (cid:88) i =1 (cid:98) R i,K (cid:32) S (cid:88) h =1 ( π h − (cid:98) π n,h ) b K ( (cid:98) P hi ) (cid:62) α (1) h (cid:33)(cid:35) =: B n,j ( p ) + B n,j ( p ) + B n,j ( p ) + B n,j ( p ) , say .

31y Lemma B.3, if ζ ( K ) ζ ( K ) / √ n = O (1) , | B n,j ( p ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) b K ( p ) (cid:62) S K,j (cid:98) Ψ − nK n n (cid:88) i =1 (cid:98) R i,K q n ( Z i , X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:16) (cid:98) P (1) n,j q n (cid:17) ( p ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) (cid:98) P (1) n,j (cid:107) ∞ · O P ( n − / ) = (cid:16) (cid:107)P (1) n,j (cid:107) ∞ + 1 (cid:17) · O P ( n − / ) for any p ∈ [0 , , where the deﬁnition of q n ( Z i , X i ) should be clear from the context. Similarly, we can easilyshow that | B n,j ( p ) | , | B n,j ( p ) | , and | B n,j ( p ) | are also of order (cid:16) (cid:107)P (1) n,j (cid:107) ∞ + 1 (cid:17) · O P ( n − / ) uniformly in p ∈ [0 , . Consequently, we have A n,j = (cid:16) (cid:107)P (1) n,j (cid:107) ∞ + 1 (cid:17) · O P ( (cid:112) K/n ) = (cid:107)∇ b K ( p ) (cid:107) · o P ( n − / ) since we have assumed ( (cid:107)P (1) n,j (cid:107) ∞ + 1) √ K/ (cid:107)∇ b K ( p ) (cid:107) → . Therefore, we have √ n (cid:16) (cid:98) m (1) j ( x, p ) − m (1) j ( x, p ) (cid:17) σ (1) K,j ( p ) = √ nA n,j σ (1) K,j ( p ) + o P (1) , (B.10)and the result follows from the proof of result (i). C Appendix: Supplementary Identiﬁcation Results

C.1 Identiﬁcation of the Finite Mixture Probit Models

C.1.1 Exogenous membership with constant membership probability

Consider the following ﬁnite mixture Probit model: D = (cid:110) Z (cid:62) γ zj + ζ j γ ζj ≥ (cid:15) Dj (cid:111) with probability π j > , for j ∈ { , . . . , S } , where Z ∈ R dim( Z ) is a vector of common IVs among all groups, ζ j ∈ R is a group-speciﬁc continuous IV, and (cid:15) Dj ∼ N (0 , independently of ( Z, ζ, s ) for all j with ζ = ( ζ , . . . , ζ S ) (cid:62) . Additionally, we assume γ ζj (cid:54) = 0 forall j .For this setup, we show that the coeﬃcients γ ζ = ( γ ζ , . . . , γ ζS ) (cid:62) and γ z = ( γ (cid:62) z , . . . , γ (cid:62) zS ) (cid:62) and themembership probabilities π = ( π , . . . , π S ) (cid:62) can be identiﬁed. Letting x = (x , . . . , x S ) (cid:62) be a givenrealization of ζ , it holds that Pr( D = 1 | Z = z, ζ = x) = (cid:80) Sj =1 π j Φ( z (cid:62) γ zj + x j γ ζj ) . Thus, for each j , ∂∂ x j Pr( D = 1 | Z = z, ζ = x) = π j φ ( z (cid:62) γ zj + x j γ ζj ) γ ζj , (C.1)where φ denotes the standard normal density. Hence, for another realization x (cid:48) = (x (cid:48) , . . . , x (cid:48) S ) (cid:62) of ζ such that32 x j | (cid:54) = | x (cid:48) j | , we have ∂ Pr( D = 1 | Z = z, ζ = x) /∂ x j ∂ Pr( D = 1 | Z = z, ζ = x (cid:48) ) /∂ x (cid:48) j = φ ( z (cid:62) γ zj + x j γ ζj ) φ ( z (cid:62) γ zj + x (cid:48) j γ ζj )= exp (cid:18) (cid:104) ( z (cid:62) γ zj + x (cid:48) j γ ζj ) − ( z (cid:62) γ zj + x j γ ζj ) (cid:105)(cid:19) = exp (cid:18) (cid:104)(cid:2) (x (cid:48) j ) − x j (cid:3) γ ζj + 2(x (cid:48) j − x j ) z (cid:62) γ zj γ ζj (cid:105)(cid:19) . This implies that we can obtain the following linear equation with the parameters γ ζj and γ zj γ ζj : (cid:34) ∂ Pr( D = 1 | Z = z, ζ = x) /∂ x j ∂ Pr( D = 1 | Z = z, ζ = x (cid:48) ) /∂ x (cid:48) j (cid:35) = (cid:2) (x (cid:48) j ) − x j (cid:3) γ ζj + 2(x (cid:48) j − x j ) z (cid:62) γ zj γ ζj . Note that the left-hand side term can be identiﬁed from data. Then, if there are Z ) distinct pairs (x , x (cid:48) ) of realizations of ζ conditional on Z = z for some z (cid:54) = dim( Z ) , γ ζj and γ zj γ ζj can be identiﬁed by solving thesystem of linear equations constructed from such pairs. To identify γ ζj and γ zj separately, note that the signof γ ζj is identiﬁed by (C.1) since π j φ is positive. Hence, γ ζj is identiﬁed, and so is γ zj . Finally, π j is alsoidentiﬁed from (C.1). The above argument holds for any j , implying the identiﬁcation of all γ ζ , γ z , and π . C.1.2 Endogenous membership with covariate-dependent membership probability

In line with the setup in Subsection 4.1, we consider the following ﬁnite mixture Probit model with potentiallyendogenous group membership: s = 2 − { Z (cid:62) α z + W α w ≥ (cid:15) s } ,D = { Z (cid:62) γ zj + ζ j γ ζj ≥ (cid:15) Dj } if s = j ,where Z ∈ R dim( Z ) is a vector of common IVs among the groups, ζ = ( ζ , ζ ) ∈ R are group-speciﬁccontinuous IVs, and W ∈ R is a continuous covariate which aﬀects group membership only. The error terms ( (cid:15) s , (cid:15) Dj ) are independent of ( Z, ζ, W ) and follow the standard bivariate normal distribution with correlationparameter ρ j . Suppose that the sign of α w is known to be positive, and γ ζj (cid:54) = 0 for j ∈ { , } . Finally, weassume that W is distributed on the whole R .We ﬁrst show that ( γ z , γ ζ ) can be identiﬁed (the identiﬁcation of ( γ z , γ ζ ) is symmetric and thus omitted).Letting t ( z, w ) := z (cid:62) α z + w α w , observe Pr( D = 1 | Z = z, ζ = x , W = w ) = Pr( D = 1 | s = 1 , Z = z, ζ = x , W = w ) Pr( s = 1 | Z = z, ζ = x , W = w )+ Pr( D = 1 | s = 2 , Z = z, ζ = x , W = w ) Pr( s = 2 | Z = z, ζ = x , W = w )= Pr( (cid:15) D ≤ z (cid:62) γ z + x γ ζ | (cid:15) s ≤ t ( z, w )) Pr( (cid:15) s ≤ t ( z, w ))+ Pr( (cid:15) D ≤ z (cid:62) γ z + x γ ζ | (cid:15) s > t ( z, w )) Pr( (cid:15) s > t ( z, w )) .

33t is easy to see that the conditional density of (cid:15) D given (cid:15) s ≤ t ( z, w ) is obtained by f (cid:15) D ( e | (cid:15) s ≤ t ( z, w )) = f (cid:15) D ( e )Pr( (cid:15) s ≤ t ( z, w )) Pr( (cid:15) s ≤ t ( z, w ) | (cid:15) D = e ) = φ ( e )Φ( t ( z, w )) Φ (cid:32) t ( z, w ) − ρ e (cid:112) − ρ (cid:33) . Similarly, the conditional density of (cid:15) D given (cid:15) s > t ( z, w ) is obtained by f (cid:15) D ( e | (cid:15) s > t ( z, w )) = φ ( e )1 − Φ( t ( z, w )) (cid:32) − Φ (cid:32) t ( z, w ) − ρ e (cid:112) − ρ (cid:33)(cid:33) . Thus, we have

Pr( D = 1 | Z = z, ζ = x , W = w ) = (cid:90) z (cid:62) γ z +x γ ζ −∞ φ ( e )Φ (cid:32) t ( z, w ) − ρ e (cid:112) − ρ (cid:33) d e + (cid:90) z (cid:62) γ z +x γ ζ −∞ φ ( e ) (cid:32) − Φ (cid:32) t ( z, w ) − ρ e (cid:112) − ρ (cid:33)(cid:33) d e . Taking the partial derivative with respect to x leads to ∂∂ x Pr( D = 1 | Z = z, ζ = x , W = w ) = γ ζ · φ ( z (cid:62) γ z + x γ ζ )Φ (cid:32) t ( z, w ) − ρ [ z (cid:62) γ z + x γ ζ ] (cid:112) − ρ (cid:33) . (C.2)Then, since α w > , we have lim w →∞ ∂∂ x Pr( D = 1 | Z = z, ζ = x , W = w ) = γ ζ · φ ( z (cid:62) γ z + x γ ζ ) . For another realization x (cid:48) of ζ such that | x | (cid:54) = | x (cid:48) | , we have lim w →∞ ∂ Pr( D = 1 | Z = z, ζ = x , W = w ) /∂ x lim w →∞ ∂ Pr( D = 1 | Z = z, ζ = x (cid:48) , W = w ) /∂ x (cid:48) = exp (cid:18) (cid:104)(cid:2) (x (cid:48) ) − x (cid:3) γ ζ + 2(x (cid:48) − x ) z (cid:62) γ z γ ζ (cid:105)(cid:19) = ⇒ (cid:20) lim w →∞ ∂ Pr( D = 1 | Z = z, ζ = x , W = w ) /∂ x lim w →∞ ∂ Pr( D = 1 | Z = z, ζ = x (cid:48) , W = w ) /∂ x (cid:48) (cid:21) = (cid:2) (x (cid:48) ) − x (cid:3) γ ζ + 2(x (cid:48) − x ) z (cid:62) γ z γ ζ . Thus, the same argument as in the previous subsection gives the identiﬁcation of γ z and γ ζ .To examine identiﬁcation of ρ , we rearrange (C.2) as follows: γ ζ · φ ( z (cid:62) γ z + x γ ζ ) (cid:20) ∂∂ x Pr( D = 1 | Z = z, ζ = x , W = w ) (cid:21) = Φ (cid:32) t ( z, w ) − ρ [ z (cid:62) γ z + x γ ζ ] (cid:112) − ρ (cid:33) = ⇒ Φ − (cid:18) γ ζ · φ ( z (cid:62) γ z + x γ ζ ) (cid:20) ∂∂ x Pr( D = 1 | Z = z, ζ = x , W = w ) (cid:21)(cid:19)(cid:124) (cid:123)(cid:122) (cid:125) =: L ( z, x ,w ) = t ( z, w ) − ρ [ z (cid:62) γ z + x γ ζ ] (cid:112) − ρ . (C.3)34hen, we have L ( z, x , w ) − L ( z, x (cid:48) , w ) = ρ (x (cid:48) − x ) γ ζ (cid:112) − ρ = ⇒ L ( z, x , w ) − L ( z, x (cid:48) , w )(x (cid:48) − x ) γ ζ = ρ (cid:112) − ρ . Noting that L ( z, x , w ) and L ( z, x (cid:48) , w ) are already identiﬁed, by solving the above equation, we can identify ρ . The identiﬁcation of ρ can be established in the same manner.Finally, to identify α w and α z , we further rearrange (C.3) as follows: (cid:113) − ρ L ( z, x , w ) + ρ [ z (cid:62) γ z + x γ ζ ] = z (cid:62) α z + w α w . Noting that the left-hand side is identiﬁed, then we can see that α z and α w can be identiﬁed under appropriaterank conditions for ( Z, W ) . C.2 Identiﬁcation of PRTE

Consider a counterfactual policy that changes P but does not aﬀect Y ( d ) j , X , (cid:15) Dj , and s . Let P (cid:63) = ( P (cid:63) , . . . , P (cid:63)S ) be a counterfactual version of P whose distribution is known and D (cid:63) be the treatment status under P (cid:63) . As inAssumption 2.1(i), we assume that P (cid:63) is independent of ( (cid:15) ( d ) , (cid:15) Dj , s ) given X . Denote the outcome after thepolicy as Y (cid:63) . The PRTE is deﬁned asPRTE: E [ Y (cid:63) | X = x ] − E [ Y | X = x ] . Since E [ Y | X = x ] is directly identiﬁed from data, we focus on the identiﬁcation of E [ Y (cid:63) | X = x ] . ByAssumptions 2.1(i) and 2.2(i), E [ D (cid:63) Y (1) j | X = x, P (cid:63) = p (cid:63) , s = j ] = (cid:90) { p (cid:63)j ≥ v j } m (1) j ( x, v j )d v j . Similarly, we can show that E [(1 − D (cid:63) ) Y (0) j | X = x, P (cid:63) = p (cid:63) , s = j ] = (cid:90) { p (cid:63)j < v j } m (0) j ( x, v j )d v j . As a result, by the law of iterated expectations, we obtain E [ Y (cid:63) | X = x ] = S (cid:88) j =1 π j E (cid:104) E [ D (cid:63) Y (1) j + (1 − D (cid:63) ) Y (0) j | X = x, P (cid:63) , s = j ] (cid:12)(cid:12)(cid:12) X = x, s = j (cid:105) = S (cid:88) j =1 π j (cid:90) (cid:16) Pr( P (cid:63)j ≥ v j | X = x ) m (1) j ( x, v j ) + Pr( P (cid:63)j < v j | X = x ) m (0) j ( x, v j ) (cid:17) d v j . In this way, we can identify the PRTE through the MTR functions.35 eferences

Andresen, M.E., 2018. Exploring marginal treatment eﬀects: Flexible estimation using stata,

The Stata Journal , 18 (1),118–158.Angrist, J.D., Imbens, G.W., and Rubin, D.B., 1996. Identiﬁcation of causal eﬀects using instrumental variables,

Journalof the American Statistical Association , 91 (434), 444–455.Brinch, C.N., Mogstad, M., and Wiswall, M., 2017. Beyond late with a discrete instrument,

Journal of Political Economy ,125 (4), 985–1039.Butler, S.M. and Louis, T.A., 1997. Consistency of maximum likelihood estimators in general random eﬀects models forbinary data,

The Annals of Statistics , 25 (1), 351–377.Chen, H., Chen, J., and Kalbﬂeisch, J.D., 2001. A modiﬁed likelihood ratio test for homogeneity in ﬁnite mixture models,

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 63 (1), 19–29.Chen, X. and Christensen, T.M., 2015. Optimal uniform convergence rates and asymptotic normality for series estimatorsunder weak dependence and weak conditions,

Journal of Econometrics , 188 (2), 447–465.Chen, X. and Christensen, T.M., 2018. Optimal sup-norm rates and uniform inference on nonlinear functionals ofnonparametric iv regression,

Quantitative Economics , 9 (1), 39–84.Compiani, G. and Kitamura, Y., 2016. Using mixtures in econometric models: a brief review and some new results,

TheEconometrics Journal , 19, C95–C127.Cornelissen, T., Dustmann, C., Raute, A., and Schönberg, U., 2016. From late to mte: Alternative methods for theevaluation of policy interventions,

Labour Economics , 41, 47–60.Deb, P. and Gregory, C.A., 2018. Heterogeneous impacts of the supplemental nutrition assistance program on foodinsecurity,

Economics Letters , 173, 55–60.Dempster, A.P., Laird, N.M., and Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm,

Journal of the Royal Statistical Society: Series B , 39 (1), 1–22.Doyle Jr, J.J., 2007. Child protection and child outcomes: Measuring the eﬀects of foster care,

American Economic Review ,97 (5), 1583–1610.Follmann, D.A. and Lambert, D., 1991. Identiﬁability of ﬁnite mixtures of logistic regression models,

Journal of StatisticalPlanning and Inference , 27 (3), 375–381.Harris, J.E. and Sosa-Rubi, S.G., 2009. Impact of "Seguro Popular" on prenatal visits in mexico, 2002-2005: Latent classmodel of count data with a discrete endogenous variable, NBER Working Paper 14995.Heckman, J.J. and Pinto, R., 2018. Unordered monotonicity,

Econometrica , 86 (1), 1–35.Heckman, J.J. and Vytlacil, E.J., 1999. Local instrumental variables and latent variable models for identifying and boundingtreatment eﬀects,

Proceedings of the National Academy of Sciences , 96 (8), 4730–4734.Heckman, J.J. and Vytlacil, E.J., 2005. Structural equations, treatment eﬀects, and econometric policy evaluation,

Econo-metrica , 73 (3), 669–738. olland, A.D., 2017. Penalized spline estimation in the partially linear model, Journal of Multivariate Analysis , 153,211–235.Hoshino, T. and Yanagi, T., 2020. Treatment eﬀect models with strategic interaction in treatment decisions, arXiv preprintarXiv:1810.08350 .Huang, J.Z., 2003. Local asymptotics for polynomial spline regression,

The Annals of Statistics , 31 (5), 1600–1635.Imbens, G.W. and Angrist, J.D., 1994. Identiﬁcation and estimation of local average treatment eﬀects,

Econometrica ,62 (2), 467–475.Kitagawa, T. and Tetenov, A., 2018. Who should be treated? empirical welfare maximization methods for treatment choice,

Econometrica , 86 (2), 591–616.McLachlan, G. and Peel, D., 2004.

Finite Mixture Models , Wiley, New York.Mogstad, M. and Torgovitsky, A., 2018. Identiﬁcation and extrapolation of causal eﬀects with instrumental variables,

Annual Review of Economics , 10, 577–613.Munkin, M.K. and Trivedi, P.K., 2010. Disentangling incentives eﬀects of insurance coverage from adverse selection inthe case of drug expenditure: a ﬁnite mixture approach,

Health Economics , 19 (9), 1093–1108.Newey, W.K., 1997. Convergence rates and asymptotic normality for series estimators,

Journal of Econometrics , 79 (1),147–168.Rubin, D.B., 1974. Estimating causal eﬀects of treatments in randomized and nonrandomized studies.,

Journal of educa-tional Psychology , 66 (5), 688.Samoilenko, M., Blais, L., Boucoiran, I., and Lefebvre, G., 2018. Using a mixture-of-bivariate-regressions model toexplore heterogeneity of eﬀects of the use of inhaled corticosteroids on gestational age and birth weight among pregnantwomen with asthma,

American Journal of Epidemiology , 187 (9), 2046–2059.Shea, J. and Torgovitsky, A., 2020. ivmte: An R package for implementing marginal treatment eﬀect methods,

Workingpaper, Becker Friedman Institute , 2020-1.Train, K.E., 2008. EM algorithms for nonparametric estimation of mixing distributions,

Journal of Choice Modelling ,1 (1), 40–69.Tropp, J.A., 2012. User-friendly tail bounds for sums of random matrices,

Foundations of Computational Mathematics ,12 (4), 389–434.Vytlacil, E.J., 2002. Independence, monotonicity, and latent index models: An equivalence result,

Econometrica , 70 (1),331–341.Zhou, X. and Xie, Y., 2019. Marginal treatment eﬀects from a propensity score perspective,

Journal of Political Economy ,127 (6), 3070–3084.Zhu, H.T. and Zhang, H., 2004. Hypothesis testing in mixture regression models,

Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 66 (1), 3–16., 66 (1), 3–16.