[PDF] Generalized Local IV with Unordered Multiple Treatment Levels: Identification, Efficient Estimation, and Testable Implication

Abstract

This paper studies the econometric aspects of the generalized local IV framework defined using the unordered monotonicity condition, which accommodates multiple levels of treatment and instrument in program evaluations. The framework is explicitly developed to allow for conditioning covariates. Nonparametric identification results are obtained for a wide range of policy-relevant parameters. Semiparametric efficiency bounds are computed for these identified structural parameters, including the local average structural function and local average structural function on the treated. Two semiparametric estimators are introduced that achieve efficiency. One is the conditional expectation projection estimator defined through the nonparametric identification equation. The other is the double/debiased machine learning estimator defined through the efficient influence function, which is suitable for high-dimensional settings. More generally, for parameters implicitly defined by possibly non-smooth and overidentifying moment conditions, this study provides the calculation for the corresponding semiparametric efficiency bounds and proposes efficient semiparametric GMM estimators again using the efficient influence functions. Then an optimal set of testable implications of the model assumption is proposed. Previous results developed for the binary local IV model and the multivalued treatment model under unconfoundedness are encompassed as special cases in this more general framework. The theoretical results are illustrated by an empirical application investigating the return to schooling across different fields of study, and a Monte Carlo experiment.

Full PDF

aa r X i v : . [ ec on . E M ] J a n Generalized Local IV with Unordered Multiple Treatment Levels

Identiﬁcation, Eﬃcient Estimation, and Testable Implication ˚ Haitian Xie † January, 2020

Abstract

This paper studies the econometric aspects of the generalized local IV framework deﬁnedusing the unordered monotonicity condition, which accommodates multiple levels of treatmentand instrument in program evaluations. The framework is explicitly developed to allow forconditioning covariates. Nonparametric identiﬁcation results are obtained for a wide range ofpolicy-relevant parameters. Semiparametric eﬃciency bounds are computed for these identi-ﬁed structural parameters, including the local average structural function and local averagestructural function on the treated. Two semiparametric estimators are introduced that achieveeﬃciency. One is the conditional expectation projection estimator deﬁned through the nonpara-metric identiﬁcation equation. The other is the double/debiased machine learning estimatordeﬁned through the eﬃcient inﬂuence function, which is suitable for high-dimensional settings.More generally, for parameters implicitly deﬁned by possibly non-smooth and overidentifyingmoment conditions, this study provides the calculation for the corresponding semiparametriceﬃciency bounds and proposes eﬃcient semiparametric GMM estimators again using the eﬃ-cient inﬂuence functions. Then an optimal set of testable implications of the model assumptionis proposed. Previous results developed for the binary local IV model and the multivaluedtreatment model under unconfoundedness are encompassed as special cases in this more generalframework. The theoretical results are illustrated by an empirical application investigating thereturn to schooling across diﬀerent ﬁelds of study, and a Monte Carlo experiment.

Keywords:

Generalized Local IV, Multi-valued Treatment, Unordered Monotonicity, Semi-parametric Eﬃciency, Eﬃcient Estimation, Optimal Testable Implication, Return to Schooling. ˚ The author is grateful to Kaspar Wuthrich, Yixiao Sun, Graham Elliott, and Yu-Chang Chen for comments andsuggestions. † Department of Economics, University of California San Diego. Email: [email protected]. Introduction

Since the seminal works of Imbens and Angrist (1994) and Angrist et al. (1996), local instrumentalvariable has become a popular method for causal inference in economics. Instead of imposing ho-mogeneity of the treatment eﬀects among the individuals, as in the classical IV regression model,the local IV framework allows for heterogeneous treatment eﬀects. To achieve identiﬁcation, how-ever, in practice, the treatment has to be integrated into one binary indicator. But oftentimes,treatments in economic relevant programs are multi-leveled in nature. They can be ordered, astax rates, years of schooling, and numbers of cigarettes smoked; or unordered, as diﬀerent jobtraining programs, ﬁelds of study in college, and vouchers to various housing opportunities. Theunordered case is more general than the ordered one since ordered treatment levels can also beconsidered as unordered. The question now becomes how to ﬁner evaluate programs in the localIV framework, incorporating the multiplicity in treatment levels. One possible solution is given inHeckman and Pinto (2018a) and Pinto (2019), which uses their unordered monotonicity conditionto generalize the identiﬁcation results in binary local IV to situations with multiple unordered levelsof treatments. The extension from binary treatments to multi-valued ones further demonstratesthe source of identiﬁcation in the local IV model making use of the monotonicity conditions. Inthe current study, this broader framework is referred to as the generalized local IV model.The marginal beneﬁts for introducing multiple levels of treatment into the local IV model istwofold. First, as mentioned above, there are many empirical cases where treatments are explicitlymulti-valued. Collapsing these levels together is not very useful for a detailed analysis of theprogram eﬀects, covering estimation and inference for parameters of ﬁner subpopulations deﬁnedby the way of the treatment choice varies with the instrument. Also, when the binary treatmentis further divided into multiple values, eﬃciency gains are possible provided with overidentifyingrestrictions justiﬁed by the underlying economic theory. Conversely, the theories can be testedthrough these restrictions deﬁned by parameters available only when the multiplicity of treatmentis modeled.This paper is concerned with the econometric aspects of the generalized local IV model, whichturns the identiﬁcation results into applicable methods in empirical research. The framework is ﬁrstextended to allow for conditioning covariates, which is important because, often in observational2tudies, the instrument is only valid conditioning on other factors. And the conditioning issue heresuﬀers the same problem as in the binary local IV case since the subpopulations are not identiﬁed.Heckman and Pinto (2018a) and Pinto (2019) focused, among other things, on the identiﬁcationof conditional local average structural function (LASF) and type probabilities. Additional iden-tiﬁcation results, in the unconditional sense, are obtained in this paper for more policy-relevantparameters, including the local average structural function on the treated (LASF-T). Semipara-metric eﬃciency bounds are computed for a wide range of structural parameters, including theLASFs and LASF-Ts. For these parameters with explicit deﬁnition, conditional expectation pro-jection (CEP) estimators, deﬁned as semiparametric two-step estimators through the identiﬁcationequations, are shown to achieve the eﬃciency bound, which is analogous to results in the liter-ature (Fr¨olich, 2007; Hahn, 1998; Hong and Nekipelov, 2010a). Eﬃciency is also proven for thedouble/debiased machine learning (DML) estimators (Chernozhukov et al., 2018) deﬁned throughthe eﬃcient inﬂuence functions. These estimators are well suited to modern high-dimensional casessince their moment conditions satisfy the Neyman orthogonality condition on the nonparametricnuisance parameters. More generally, for parameters implicitly deﬁned by possibly non-smoothand overidentifying moment conditions, this study provides the calculation of the semiparametriceﬃciency bounds and proposes eﬃcient semiparametric GMM estimators deﬁned through the eﬃ-cient inﬂuence functions. Two important cases can be incorporated, one is the quantile estimation(Melly and W¨uthrich, 2017; Firpo, 2007), and the other is the aforementioned case where the un-derlying economic theory provides overidentiﬁcation. Optimal joint inferences can be conductedacross and between diﬀerent treatment levels, based on the semiparametric eﬃcient estimations. The assumption of the generalized local IV model is refutable but not veriﬁable. An optimal setof testable implications of the model assumptions is proposed, in the sense that the refutation ofthis particular set of implications necessarily leads to the rejection of all implications of the modelassumptions.The literature on semiparametric eﬃciency in program evaluation starts with the seminal workof Hahn (1998), which studies the benchmark case of estimating the average treatment eﬀect(ATE) under unconfoundedness. When endogeneity is present, that is in the framework of local The problem of joint inference is not salient in the binary local IV model since usually there is only a singleparameter of interest.

The general framework is presented in this section. We have a treatment variable T taking valuesin the unordered set T “ t t , ¨ ¨ ¨ , t N T u . Instrument Z takes values in the unordered set Z “t z , ¨ ¨ ¨ , z N Z u . The random variables ´ Y t , ¨ ¨ ¨ , Y t NT ¯ , with Y t P Y Ă R , t P T , represent thepotential outcomes under each treatment level. These are assumed to have ﬁnite second moments. And the random variables p T z ¨ ¨ ¨ , T z NZ q , with T z P T , z P Z , represent the potential treatmentstatus under each instrument level. Random vector X is a set of covariates, which takes valuein X Ă R d X . Also deﬁne a random vector S “ ´ T z , ¨ ¨ ¨ , T z NZ ¯ , which denotes the type of theindividual. Let S Ă T N Z be the support of S . The observed treatment and observed outcomeare deﬁned as T “ ř z P Z t Z “ z u T z and Y “ ř t P T t T “ t u Y t respectively. An equivalent wayto formulate is to use structural equations, as in Heckman and Pinto (2018a); Pinto (2019). Bydenoting the function and the determined random variable with the same notation, the treatmentand outcome can be deﬁned as T “ T p Z, X, V q , Y “ Y p T, X, V, ǫ q , where Z, V, ǫ are mutuallyindependent conditional on X . Under this formulation, T z “ T p z, X, V q , Y t “ Y p t, X, V, ǫ q , and S “ p T p z , X, V q , ¨ ¨ ¨ , T p z N Z , X, V qq .The following notations are used throughout the paper. Let π p X q “ p π z p X q , ¨ ¨ ¨ , π z NZ p X qq ,where π z p X q “ P p Z “ z | X q . For any t P T , let P t,z p X q “ P p T “ t | Z “ z, X q , and P t,Z p X q “ ´ P t,z , ¨ ¨ ¨ , P t,z NZ ¯ . For any measurable g : Y Ñ R , let g t,z “ E r g p Y q t T “ t u | Z “ z, X s and It is convenient to assume this in the beginning, since the main focus of the paper is on eﬃciency. t,Z “ ´ g t,z , ¨ ¨ ¨ , g t,z NZ ¯ . Let B t be the N Z ˆ N S binary matrix whose i, j th element is t s j r i s “ t u ,and its Moore-Penrose inverse is denoted by B ` t . Let Σ t,k Ă S , k “ , ¨ ¨ ¨ , N Z be the set of typesin which treatment t appears exactly k times, i.e. Σ t,k “ ! s P S : ř N Z i “ t s r i s “ t u “ k ) . Note thatthe Σ t,k ’s form a partition of the type conﬁguration, i.e. S “ Ů N Z k “ Σ t,k . Let ˜ b t,k “ b t,k B ` t , where b t,k “ p t s P Σ t,k u , ¨ ¨ ¨ , t s N S P Σ t,k uq . The main assumption of the generalized local IV model ispresented below. Assumption 1.

Generalized Local IV:(i) Conditional independence: ´ t Y t : t P T u , t T z : z P Z u ¯ K Z | X .(ii) Type constraint: the type support S is known and satisﬁes, for any t P T , z, z P Z , either t T z “ t u ě t T z “ t u or t T z “ t u ď t T z “ t u .(iii) First stage: for all z, z P Z and t P T , it hold that π z p X q ě π ą and P p T z “ t | X q ‰ P p T z “ t | X q . These assumptions are essentially the multi-valued analog of those used in Abadie (2003). As-sumption 1(i) is on the validity of instrument conditioning on X . Assumption 1(ii) is the unorderedmonotonicity constraint on the conﬁguration S of type. It means that a shift in an instrumentmoves all agents uniformly toward or against each possible treatment choice (Heckman and Pinto,2018a). As pointed out by Vytlacil (2002), the LATE type monotonicity condition is a restrictionacross individuals on the relationship between diﬀerent hypothetical treatment choices deﬁned interms of an instrument. Assumption 1(iii) requires that the instrument has some eﬀect on theselection of each treatment level , and also implies that the support of X does not change withthe value of Z . The exclusion restrictions of the instrument from the outcomes are already im-posed in the deﬁnition of potential outcomes. The observed data is assumed to be an IID sample p Y i , T i , Z i , X i q ni “ . The concept of B t is deﬁned using S , and hence are nonrandom and do not depend on X . The original deﬁnitionof B t in Heckman and Pinto (2018a) involve random variables, and are diﬃcult to state unambiguously with thepresence of X . I thank Yixiao Sun for helpful suggestions on this. An equivalent condition, shown by Heckman and Pinto (2018a), for unordered monotonicity is, for any t P T , B t is lonesum. The strong overlapping assumption is imposed here for estimation. For identiﬁcation, it suﬃces to impose aweaker condition π z p X q P p , q . T “ t t , t , t u , andinstrument levels Z “ t z , z u . Note that the indexing is only for convenience, they are intrinsicallyunordered. The type conﬁguration S is speciﬁed below. s s s s s T z t t t t t T z t t t t t The unordered monotonicity is satisﬁed: t T z “ t u ě t T z “ t u , t T z “ t u ě t T z “ t u ,and t T z “ t u ď t T z “ t u . The type partitions are, for t , Σ t , “ t s , s , s u , Σ t , “ t s u ,Σ t , “ t s u ; for t , Σ t , “ t s , s , s u , Σ t , “ t s u , Σ t , “ t s u ; and for t , Σ t , “ t s , s u ,Σ t , “ t s , s u , Σ t , “ t s u . B t ’s and their generalized inverses are B t “ »—– ﬁﬃﬂ , B t “ »—– ﬁﬃﬂ , B t “ »—– ﬁﬃﬂ B ` t “ »——————————– ´ ﬁﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬂ , B ` t “ »——————————– ´ ﬁﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬂ , and B ` t “ »——————————– . ´ . . ´ . ﬁﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬂ . The b t,k ’s, and hence the ˜ b t,k ’s, are displayed in the following table. b t, b t, ˜ b t, ˜ b t, t “ t p , , , , q p , , , , q p´ , q p , q t “ t p , , , , q p , , , , q p´ , q p , q t “ t p , , , , q p , , , , q p , ´ q p , q These ﬁve groups are similarly deﬁned in Kline and Walters (2016) for their analysis of the Head Start program. π z p X q “ P p Z “ z | X q , π z p X q “ P p Z “ z | X q , and P t ,Z p X q “ p P p T “ t | Z “ z , X q , P p T “ t | Z “ z , X qq ,P t ,Z p X q “ p P p T “ t | Z “ z , X q , P p T “ t | Z “ z , X qq ,P t ,Z p X q “ p P p T “ t | Z “ z , X q , P p T “ t | Z “ z , X qq . For any measurable g , we have g t ,Z p X q “ p E r g p Y q t T “ t u | Z “ z , X s , E r g p Y q t T “ t u | Z “ z , X sq ,g t ,Z p X q “ p E r g p Y q t T “ t u | Z “ z , X s , E r g p Y q t T “ t u | Z “ z , X sq ,g t ,Z p X q “ p E r g p Y q t T “ t u | Z “ z , X s , E r g p Y q t T “ t u | Z “ z , X sq . Conditioning on X , Heckman and Pinto (2018a) establishes the identiﬁcation of the probabilitiesof S lying in any of the Σ t,k ’s and the conditional distribution of Y t given S P Σ t,k . This result ispresented in Appendix B as Lemma 2. Bayes rule is applied to turn these conditional identiﬁcationresults into the unconditional ones. In particular, the diﬃculty here, that the conditional distribu-tion of X | S P Σ t,k is unidentiﬁed, is similar to that of classical local IV. However, using the Bayesrule, this unidentiﬁed distribution can be represented as P p S P Σ t,k | X “ x q P p S P Σ t,k q f X p x q which is identiﬁed. Theorem 1.

For t P T , k “ , ¨ ¨ ¨ , N Z , and g measurable, the following quantities are identiﬁed.(i) Type probabilities: p t,k ” P p S P Σ t,k q “ ˜ b t,k E r P t,Z p X qs . (1) (ii) Mean potential outcome conditioning on type: E r g p Y t q | S P Σ t,k s “ p t,k ˜ b t,k E r g t,Z p X qs . (2)The above theorem provides identiﬁcation for quantities solely related to the type S of the The term LASF is reserved for the more speciﬁc case of g being the identity map. S and the actual treatment received T . Especially, the conditional distribution of Y t given T “ t and S P Σ t,k might be of more interest in the current setting. This is because the main source ofidentiﬁcation lies in the structural functions, the Y t ’s, instead of the treatment eﬀects. Hence it ismore attractive to study the expectation of Y t inside the subpopulation whose treatment is actually t . The following theorem deals with the identiﬁcation of distributions relevant to this idea. Theorem 2.

For t P T , k “ , ¨ ¨ ¨ , N Z , and g measurable, if for some t P T , there exists W t ,t,k Ă Z such that S P Σ t,k , T z “ t ðñ S P Σ t,k , z P W t ,t,k , and denote π W p X q “ ÿ z P W π z p X q , W Ă Z , then the following quantities are identiﬁed.(i) Treatment status and type probability: q t ,t,k ” P ` T “ t , S P Σ t,k ˘ “ ˜ b t,k E ” P t,Z p X q π W t ,t,k p X q ı . (3) (ii) Mean potential outcome conditioning on treatment status and type: E “ g p Y t q | T “ t , S P Σ t,k ‰ “ q t ,t,k ˜ b t,k E ” g t,Z p X q π W t ,t,k p X q ı . (4) In particular, the probability q t,k ” P p T “ t, S P Σ t,k q , (5) and the t -mean potential outcome E r g p Y t q | T “ t, S P Σ t,k s “ q t,k ˜ b t,k E “ g t,Z p X q π W t,k p X q ‰ , (6) for the t -treated subpopulation with type S P Σ t,k , are always identiﬁed, where, q t,k “ q t,t,k ,and W t,k ” W t,t,k . To the best of my knowledge, this result is new in the literature. The W t ,t,k deﬁned in thetheorem contains the z ’s such that, inside the subpopulation where type S P Σ t,k , individuals9ssigned with z takes and only takes the treatment t . The unordered monotonicity conditionguarantees that such W t,k always exists for any pair of t, k . In the binary local IV case, it isshown previously in Hong and Nekipelov (2010a), as a special case, that inside the subpopulationof treated compliers, both of the structural functions are identiﬁed.In most of the cases, the parameter of interest are the average structural functions identiﬁablein certain subpopulations. They are deﬁned by taking g “ I , the identity map on Y , in equations(2) and (4), which are the LASFs and LASF-Ts displayed below. β t,k ” E r Y t | S P Σ t,k s “ p t,k ˜ b t,k E r I t,Z p X qs .γ t ,t,k ” E “ Y t | T “ t , S P Σ t,k ‰ “ q t ,t,k ˜ b t,k E ” I t,Z p X q π W t ,t,k p X q ı . (7)As before, let γ t,k “ γ t,t,k . To concrete ideas, the identiﬁcation in the main example for the case ofΣ t , “ s is computed, the other parameters can be computed in the same way. By Theorem 1, p t , “ P p S “ s q “ E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs .β t , “ E r Y t | S “ s s “ p t , E r E r Y t T “ t u | Z “ z , X s ´ E r Y t T “ t u | Z “ z , X ss . For Σ t , “ s , W t ,t , “ t z u , thus by Theorem 2, q t , “ P p T “ t , S “ s q “ E rp P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qq P p Z “ z | X qs .γ t , “ E r Y t | T “ t , S “ s s“ q t , E rp E r Y t T “ t u | Z “ z , X s ´ E r Y t T “ t u | Z “ z , X sq P p Z “ z | X qs . A subtle thing here is that the parameters p t,k ’s and q t ,t,k ’s can potentially be overidentiﬁed.For example, Σ t , “ t s , s u “ Σ t , Y Σ t , , leading to two ways to calculate the probability P p S P t s , s uq . In this case, these ways give rise to identical expressions of P p S P t s , s uq ,10amely, P p S “ s q ` P p S “ s q “ E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs` E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs“ E r P p T P t t , t u | Z “ z , X q ´ P p T P t t , t u | Z “ z , X qs“ E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs“ P p S P t s , s uq . As thus, it can be shown that there is no overidentifying restriction on the observable distribution inthe main example. However, there are cases satisfying Assumption 1 that imposes overidentifyingrestrictions. Consider removing s from the conﬁgurations S , so that S can only take on the fourvalues as shown in the following table. s s s s T z t t t t T z t t t t The unordered monotonicity is still satisﬁed. But it is automatically imposed that P p S “ s q “ E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs“ E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs . (8)which is an overidentifying restriction. These types of overidentifying restrictions are quite unnec-essary, since beneﬁt from the removal of impossible conﬁgurations is small compared to the cost offalsely eliminating a type that exists in the true DGP. In the next section on eﬃcient estimation oflocal average structural functions, Assumption 1 is strengthened so that the parameters in Theorem1 and Theorem 2 are exactly identiﬁed. This is not restrictive, since it is satisﬁed by the binarylocal IV, the main example in this paper, and examples in Heckman and Pinto (2018a). Moreover,when overidentiﬁcation is present, the consistency and asymptotic normality of the proposed esti-mators, and it’s eﬃciency for the exactly-identiﬁed target parameter are valid regardless. Further11iscussion on this issue is beyond the scope of this paper. This section calculates the semiparametric eﬃciency bound for β t,k , γ t ,t,k , p t,k , and q t ,t,k , andproposes semiparametric eﬃcient estimators. Other policy-relevant parameters that can be derivedfrom the above quantities are also discussed afterwards. The more general case of non-smoothparameters with overidentifying constraints is considered in the next section. In this section andthe next one, for simiplicity, the notation for parameters, including the nuisance ones, are usedto represent both the true value and a general value in the parameter space. When necessary, asuperscript “ o ” is placed to signify the true value. The following theorem presents the eﬃciencybound using eﬃcient inﬂuence functions. Theorem 3.

Consider any t P T , k “ , ¨ ¨ ¨ , N Z , and t satisfying the condition in Theorem 2.(i) The semiparametric eﬃciency bound for β t,k is given by the variance of the eﬃcient inﬂuencefunction Ψ β t,k p Y, T, Z, X, β t,k , p t,k , I t,Z , P t,Z , π q“ p t,k ˜ b t,k p ζ p Z, X, π q p ι p Y t T “ t uq ´ I t,Z p X qq ` I t,Z p X qq´ β t,k p t,k ˜ b t,k p ζ p Z, X, π q p ι t T “ t u ´ P t,Z p X qq ` P t,Z p X qq . (9) where ι denotes a column vector of ones, and ζ p Z, X, π q is a diagonal matrix with the diagonalelements being ´ t Z “ z u π z p X q , ¨ ¨ ¨ , t Z “ z N Z u π z NZ p X q ¯ (ii) The semiparametric eﬃciency bound for γ t ,t,k is given by the variance of the eﬃcient inﬂuence Some hints on more primitive assumptions for the non-existence of overidentifying conditions, within the structureof S , can be found in A.14 of Heckman and Pinto (2018b), referring to the concept of “complete response matrix”. unction Ψ γ t ,t,k p Y, T, Z, X, γ t ,t,k , q t ,t,k , I t,Z , P t,Z , π q“ q t ,t,k ˜ b t,k ´ ζ p Z, X, π q p ι p Y t T “ t uq ´ I t,Z p X qq π W t ,t,k p X q ` I t,Z p X q t Z P W t ,t,k u ¯ ´ γ t ,t,k q t ,t,k ˜ b t,k ´ ζ p Z, X, π q p ι t T “ t u ´ P t,Z p X qq π W t ,t,k p X q ` P t,Z p X q t Z P W t ,t,k u ¯ . (10) (iii) The semiparametric eﬃciency bound for p t,k is given by the variance of the eﬃcient inﬂuencefunction Ψ p t,k p T, Z, X, p t,k , P t,Z , π q “ ˜ b t,k p ζ p Z, X, π q p ι t T “ t u ´ P t,Z p X qq ` P t,Z p X qq ´ p t,k (11) (iv) The semiparametric eﬃciency bound for q t ,t,k is given by the variance of the eﬃcient inﬂuencefunction Ψ q t ,t,k p T, Z, X, q t ,t,k , P t,Z , π q“ ˜ b t,k ´ ζ p Z, X, π q p ι t T “ t u ´ P t,Z p X qq π W t ,t,k p X q ` P t,Z p X q t Z P W t ,t,k u ¯ ´ q t ,t,k (12)Notice that for the case of binary treatment and binary instrument, the ﬁrst two parts of Theo-rem 3 reduces to Theorem 2 of Hong and Nekipelov (2010a). The structure of the eﬃcient inﬂuencefunctions is interpretable in the view of Newey (1994). In Ψ β t,k , the terms ˜ b t,k p ζ p Z, X, π q p ι p Y t T “ t uq ´ I t,Z p X qqq and ˜ b t,k p ζ p Z, X, π q p ι t T “ t u ´ P t,Z p X qqq serve as the correction term due to the presence of un-known inﬁnite dimensional nuisance parameters I t,Z and P t,Z respectively. In Ψ γ t ,t,k , the correctionterm also contains π W t ,t,k which accounts for the fact that π is unknown. The derivation of thisdecomposition is more apparent in the proof of Theorem 4.The program evaluation literature has become concerned about the role of propensity score inthe eﬃcient estimation. In the current context, π represents the proper concept of propensity score.Observations in the proof of Theorem 3 indicate that the eﬃciency bound of β t,k is not aﬀectedby the knowledge of propensity score. However, the γ t ,t,k ’s can be estimated more eﬃciently if thepropensity score is known. To be more speciﬁc, the part of the score function that corresponds to See for example Hahn (1998); Fr¨olich (2007); Hong and Nekipelov (2010a); Chen et al. (2008) γ t ,t,k explicitly involves the propensity score, hence does itspathwise derivative. Similarly, the eﬃciency in the estimation of the q t ,t,k ’s not the p t,k is aﬀectedby the knowledge of propensity score.For eﬃcient estimation, one possible way is to use the CEP estimators common in the literature(Hahn, 1998; Fr¨olich, 2007; Chen et al., 2008; Hong and Nekipelov, 2010a). The methodology isto ﬁrst estimate the conditional expectations π , P t,Z , and I t,Z , and then uses the identiﬁcationresults directly as moment conditions. The asymptotic linear representation of these estimatorscan be easily computed using the method developed in Newey (1994). More speciﬁcally, for z P Z ,deﬁne two sets of conditional expectations by h Y,t,z p X q “ E r t Z “ z u Y t T “ t u | X s , h t,z p X q “ E r t Z “ z u t T “ t u | X s . Let ˆ h Y,t,z , ˆ h t,z , ˆ π z denote the nonparametric estimators, such as kernelestimators or series estimators. Notice that the conditional expectations are related by I t,z “ h Y,t,z { π z and P t,z “ h t,z { π z , for z P Z , hence deﬁne ˆ I t,z “ ˆ h Y,t,z { ˆ π z and ˆ P t,z “ ˆ h t,z { ˆ π z . Then thevector estimators ˆ I t,Z , ˆ P t,Z , ˆ π Z , and vector functions h Y,t,Z , h t,Z are stacked in the obvious way.Also, let ˆ π W t ,t,k “ ř z P W t ,t,k ˆ π z . Deﬁne the estimators ˆ p t,k , ˆ q t t,k , ˆ β t,k , and ˆ γ t ,t,k byˆ p t,k “ n n ÿ i “ ˜ b t,k ˆ P t,Z p X i q (13)ˆ q t ,t,k “ n n ÿ i “ ˜ b t,k ˆ P t,Z p X i q ˆ π W t ,t,k p X i q (14)ˆ β t,k “ p t,k n n ÿ i “ ˜ b t,k ˆ I t,Z p X i q (15)ˆ γ t ,t,k “ q t ,t,k n n ÿ i “ ˜ b t,k ˆ I t,Z p X i q ˆ π W t ,t,k p X i q (16)The eﬃcient inﬂuence functions also provide a way for conducting optimal joint inferences.The eﬃcient inﬂuence function for a vector of parameters is the collection of the eﬃcient inﬂuencefunctions corresponding to the parameters. The variance-covariance matrix for eﬃcient estimatorscan be calculated accordingly. More concretely, let κ “ p κ , ¨ ¨ ¨ , κ K q be a vector of parametersselected from the identiﬁed ones t β t,k , γ t ,t,k , p t,k , q t ,t,k u . Then the eﬃcient inﬂuence function of κ is Ψ κ “ p Ψ κ , ¨ ¨ ¨ , Ψ κ K q , and the eﬃciency bound is E r Ψ κ Ψ κ s . A natural plug-in estimator ˆ V κ The reason for using h Y,t,z and h t,z as primitive estimators, instead of I t,z and P t,z , is that they are simpleconditional expectations with theoretical appeals. V κ “ n n ÿ i “ ˆΨ κ,i ˆΨ κ,i (17)where ˆΨ κ,i is deﬁned by plugging-in the data observation of index i , the CEP estimates ˆ κ , and thenonparametric estimates ˆ I t,z , ˆ P t,z , and ˆ π z . For example, when κ “ β t,k , thenˆΨ β t,k ,i “ Ψ β t,k p Y i , T i , Z i , X i , ˆ β t,k , ˆ p t,k , ˆ I t,Z , ˆ P t,Z , ˆ π q . Since the CEP estimators are eﬃcient, their asymptotic covariance matrix can be estimated bythe plug-in estimators for eﬃciency bounds. This leads to optimal joint inferences, where theoptimality is by Section 25.6 of Vaart (1998), where it is stated that the semiparametric eﬃciencyof the estimators leads to (locally) asymptotically uniformly most powerful tests.The following theorem summarizes the properties of the CEP estimation procedure deﬁnedabove. For each t and z , let H Y,t,z , H t,z , and Π z be the space of functions containing the truenuisance parameters h oY,t,z , h ot,z , and π oz respectively. For any small enough δ ą

0, let H δY,t,z “ ! h Y,t,z P H Y,t,z : (cid:13)(cid:13)(cid:13) h Y,t,z ´ h oY,t,z (cid:13)(cid:13)(cid:13) ď δ ) , and H δt,z and Π δz be deﬁned analogously. Theorem 4.

Consider t P T . For any z P Z , assume the following conditions hold.(i) The convergence rate of nonparametric estimators satisfy ? n (cid:13)(cid:13)(cid:13) ˆ h Y,t,z ´ h oY,t,z (cid:13)(cid:13)(cid:13) “ o p p q , ? n (cid:13)(cid:13)(cid:13) ˆ h t,z ´ h ot,z (cid:13)(cid:13)(cid:13) “ o p p q , and ? n k ˆ π z ´ π oz k “ o p p q .(ii) There exists some δ ą such that the classes H δY,t,z , H δt,z , and Π δz are Donsker, with E ” sup h Y,t,z P H δY,t,z | h Y,t,z p X q | ı ă 8 .Then, for k “ , ¨ ¨ ¨ , N Z , estimators ˆ β t,k , ˆ γ t ,t,k , ˆ p t,k , and ˆ q t t,k are semiparametric eﬃcient for β t,k , γ t ,t,k , p t,k , and q t ,t,k , respectively. Moreover, the plug-in estimator ˆ V κ for eﬃciency bound,deﬁned in equation(17), is consistent. Condition 4(i) is a standard requirement on the convergence rate of nonparametric estima-tors in the semiparametric two-step estimation literature (Newey, 1994; Newey and McFadden,1994). Condition 4(ii) is also standard that requires the functional spaces containing the inﬁnite-dimensional nuisance parameters are not too complex, for the stochastic equicontinuity condition15o hold. The reason for this type of “limited information” estimators to work is well explained inAckerberg et al. (2014). The estimation problem here falls into their general semiparametric modelwhere the parameter of interest is deﬁned by possibly overidentifying unconditional moment restric-tions and the nuisance function are deﬁned by exactly identifying conditional moment restrictions.They showed that the semiparametric two-step optimally weighted GMM estimators achieve theeﬃciency bound, which are the CEP estimators in this case since the parameters of interest areexactly identiﬁed. Discussions related to this phenomenon can also be found in Chen and Santos(2018).Back to the main example, for β t , , the eﬃcient inﬂuence function isΨ β t , “ p t , ˆ t Z “ z u P p Z “ z | X q p Y t T “ t u ´ E r Y t T “ t u | Z “ z , X sq ` E r Y t T “ t u | Z “ z , X s ˙ ´ p t , ˆ t Z “ z u P p Z “ z | X q p Y t T “ t u ´ E r Y t T “ t u | Z “ z , X sq ` E r Y t T “ t u | Z “ z , X s ˙ ´ β t , p t , ˆ t Z “ z u P p Z “ z | X q p t T “ t u ´ E r t T “ t u | Z “ z , X sq ` E r t T “ t u | Z “ z , X s ˙ ` β t , p t , ˆ t Z “ z u P p Z “ z | X q p t T “ t u ´ E r t T “ t u | Z “ z , X sq ` E r t T “ t u | Z “ z , X s ˙ and the CEP estimator isˆ β t , “ ř ni “ ´ ˆ h Y,T,z p X i q{ ˆ π z p X i q ¯ ´ ´ ˆ h Y,T,z p X i q{ ˆ π z p X i q ¯ř ni “ ´ ˆ h T,z p X i q{ ˆ π z p X i q ¯ ´ ´ ˆ h T,z p X i q{ ˆ π z p X i q ¯ . Besides the eﬃcient estimation results shown above, it might also be of interest to eﬃcientlyestimate other policy-relevant parameters, whose identiﬁcation can be derived from the aforemen-tioned parameters ` β t,k , γ t ,t,k , p t,k , q t ,t,k ˘ . Some examples are discussed here. The ratio q t ,t,k { p t,k “ P p T “ t | S P Σ t,k q can be understood as the conditional probability of taking treatment t giventype S belongs to Σ t,k . The average structural function local in the subpopulation whose type S belonging to any of the Σ t,k , k “ , ¨ ¨ ¨ , N Z ´ β t ” E r Y t | S P Σ t s “ ř N Z ´ k “ β t,k p t,k ř N Z ´ k “ p t,k (18)where Σ t “ Ť N Z ´ k “ Σ t,k is referred to as t -switchers in Heckman and Pinto (2018a), which means16ndividuals in this subpopulation switches among t and other treatments when given diﬀerent levelsof instruments. It is a generalization of the concept of compliers in the binary local IV framework.Similarly, one can also deﬁne γ t ” E r Y t | T “ t, S P Σ t s “ ř N Z ´ k “ γ t,k p t,k ř N Z ´ k “ p t,k (19)which represents the average structural function local in the subpopulation of t -treated t -switchers.By Theorem 2, this parameter is always identiﬁed. Some treatment eﬀects can also be identiﬁedand estimated through the parameters discussed in this section. This point is illustrated using themain example, which also appears in Heckman and Pinto (2018a). Consider the following quantity β t , ´ β t , p t , ` β t , p t , p t , ` p t , “ E r Y t ´ Y t | S “ s s P p S “ s q ` E r Y t ´ Y t | S “ s s P p S “ s q P p S P t s , s uq (20)which represents the average treatment eﬀect of t against other treatments within the subpopula-tion of t -switchers. Analogously, the quantity γ t , ´ γ t ,t , q t ,t , ` γ t ,t , q t ,t , q t ,t , ` q t ,t , “ E r Y t ´ Y t | T “ t , S “ s s P p T “ t , S “ s q ` E r Y t ´ Y t | T “ t , S “ s s P p T “ t , S “ s q P p T “ t , S P t s , s uq (21)can be understood as the average treatment eﬀect of t against other treatments within the sub-population of t -treated t -switchers.More generally, let φ “ φ p p, q, β, γ q be a ﬁnite-dimensional parameter, where φ p¨q is a known con-tinuously diﬀerentiable function, and p is the vector containing all identiﬁable p t,k ’s, and q, β, γ aredeﬁned analogously. A natural estimator can be deﬁned through the CEP estimates, φ p ˆ p, ˆ q, ˆ β, ˆ γ q .A delta method argument helps calculate the eﬃciency bound of φ and show the eﬃciency of φ p ˆ p, ˆ q, ˆ β, ˆ γ q . In fact, following Theorem 25.47 of Vaart (1998), and Theorem 3 and 4, the corollarybelow is immediate, which, in particular, solves the issue of eﬃcient estimation for the severalexamples illustrated above. 17 orollary 1. The semiparametric eﬃciency bound of φ is given by the variance of eﬃcient inﬂu-ence function Ψ φ “ ÿ p P p B φ B p Ψ p ` ÿ q P q B φ B q Ψ q ` ÿ β P β B φ B β Ψ β ` ÿ γ P γ B φ B γ Ψ γ (22) where the partial derivatives are evaluated at the true parameter value. Moreover, the plug-inestimator φ p ˆ p, ˆ q, ˆ β, ˆ γ q , based on the CEP estimators ˆ p, ˆ q, ˆ β, ˆ γ , achieves the eﬃciency bound. The role of the eﬃcient inﬂuence functions discussed above is mainly on calculating the eﬃ-ciency bound. They could also be used to generate a collection of moment conditions, to achieveeﬃcient estimation directly. These moment conditions possess the feature that the ﬁrst-step es-timation of the nuisance function do not aﬀect the asymptotic variance . This is straightforwardto verify using Proposition 3 in Newey (1994). This feature can be further exploited in the DMLmethodology which is suitable in the high dimensional settings, where the Donsker properties as incondition 4(ii) can no longer be satisﬁed. More formally, the eﬃcient inﬂuence function satisﬁesthe Neyman orthogonality condition, which means reduced sensitivity with respect to the nuisanceparameters I t,Z ’s, P t,Z ’s, and π . Together with appropriate data splitting methods, moment esti-mators constructed with Neyman orthogonal moment conditions are often employed in cases withdata-rich environments where the nuisance parameters are “highly complex”, e.g. the dimensionof covariates X grows with sample size n . Here I explain how to implement in this speciﬁc settingthe DML method introduced in Chernozhukov et al. (2018) to eﬃciently estimate β t,k when thedimension of X is larger than the sample size. The cross-ﬁtting method starts with taking a L -fold random partition of the data such that the size of each fold is n { L . Then, for l “ , ¨ ¨ ¨ , L ,let I l denote the observation indices in the l th fold and I cl “ Ť l ‰ l I l . Also, deﬁne ˇ I lt,Z , ˇ P lt,Z , ˇ π l be the nonparametric machine learning estimates using data from i P I cl . The associated momentcondition is based on equation (9) that E ” ˜ b t,k p ζ p Z, X, π q p ι p Y t T “ t uq ´ I t,Z p X qq ` I t,Z p X qq´ β t,k ˜ b t,k p ζ p Z, X, π q p ι t T “ t u ´ P t,Z p X qq ` P t,Z p X qq ı “ . (23) I am grateful to Kaspar Wuthrich for suggesting this. Works on this topic include Belloni et al. (2014, 2017);Chernozhukov et al. (2018). The cases of estimating γ t ,t,k , p t,k and q t ,t,k are essentially the same, thus omitted for brevity. β t,k is deﬁned byˇ β t,k “ ř Ll “ ř i P I l ˜ b t,k ´ ζ p Z i , X i , ˇ π l q ´ ι p Y i t T i “ t uq ´ ˇ I lt,Z p X i q ¯ ` ˇ I lt,Z p X i q ¯ř Ll “ ř i P I l ˜ b t,k ´ ζ p Z i , X i , ˇ π l q ´ ι t T i “ t u ´ ˇ P lt,Z p X i q ¯ ` ˇ P lt,Z p X i q ¯ . (24)This is the DML2 estimator deﬁned in Chernozhukov et al. (2018) with a L -fold cross-ﬁtting. Thereis another estimation procedure called the DML1 estimator in Chernozhukov et al. (2018). It isnot discussed here since DML1 and DML2 are asymptotically equivalent, and DML2 is generallyrecommended by the authors. The variance estimator is given byˇ V β t,k “ nL L ÿ l “ n ÿ i “ ` Ψ β t,k ` Y i , T i , Z i , X i , ˇ β t,k , ˇ p t,k , ˇ I t,Z , ˇ P t,Z , ˇ π ˘˘ (25)where ˇ p t,k “ nL L ÿ l “ n ÿ i “ ˜ b t,k ` ζ p Z i , X i , ˇ π q ` ι t T i “ t u ´ ˇ P t,Z p X i q ˘ ` ˇ P t,Z p X i q ˘ (26) Theorem 5.

Let δ n ě n ´ { and ∆ n be some sequences of positive constants approaching zero.Also, let C ą and q ą be ﬁxed constants, and L ě be ﬁxed integer. Assume the followingconditions hold for any joint distribution P P P for the quadruple p Y, T, Z, X q .(i) The variance bound for β t,k calculated in Theorem 3 is strictly positive.(ii) max " (cid:13)(cid:13)(cid:13) I ot,Z (cid:13)(cid:13)(cid:13) q , (cid:13)(cid:13)(cid:13) ιY t T “ t u ´ I ot,Z (cid:13)(cid:13)(cid:13) q * ď C .(iii) With probability no less than ´ ∆ n , max " (cid:13)(cid:13)(cid:13) ˇ I t,Z ´ I ot,Z (cid:13)(cid:13)(cid:13) q , (cid:13)(cid:13)(cid:13) ˇ P t,Z ´ P ot,Z (cid:13)(cid:13)(cid:13) q , k ˇ π ´ π o k q * ď C , max ! (cid:13)(cid:13)(cid:13) ˇ I t,Z ´ I ot,Z (cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13) ˇ P t,Z ´ P ot,Z (cid:13)(cid:13)(cid:13) , k ˇ π ´ π o k ) ď δ n , and for any z P Z , ˇ π z ě π and k ˇ π z ´ π oz k ˆ ´ (cid:13)(cid:13) ˇ I t,z ´ I ot,z (cid:13)(cid:13) ` (cid:13)(cid:13) ˇ P t,z ´ P ot,z (cid:13)(cid:13) ¯ ď n ´ { δ n .Then the estimator ˇ β t,k obey V ´ { β t,k ? n ` ˇ β t,k ´ β t,k ˘ ñ N p , q , (27) uniformly over P , where V β t,k “ E ” Ψ β t,k p Y, T, Z, X, β ot,k , p ot,k , I ot,Z , P ot,Z , π ot,Z q ı . Moreover, the re-sults continue to hold when V β t,k is replaced by ˇ V β t,k . P , itcan be used for standard construction of uniformly valid conﬁdence regions. The previous section is on the eﬃcient estimation of average structural functions inside someidentiﬁable subpopulations. More generally, the parameter of interest could be deﬁned throughnon-smooth and overidentifying moment conditions. One example is the case of quantile estimation.Another case of interest is when the underlying economic theory provides overidentifying constraintsfor the quantities of interest, which is very possible in the current framework with multiple levelsof treatment and instrument.Deﬁne a set of random variables Y “ t Y ˚ t,k : t P T , k “ , ¨ ¨ ¨ , N Z u such that each Y ˚ t,k has the same marginal distribution as Y t | S P Σ t,k . Their joint distribution is irrelevant. Let Y ˚ “ p Y ˚ , ¨ ¨ ¨ , Y ˚ J q , for Y ˚ j P Y . Also, for notational convenience, use j rather than t, k to label t j , p j , and ˜ b j according to the ordering of the Y ˚ t,k ’s within the random vector Y ˚ . Let the parameterof interest be η , in the interior of Λ Ă R d η , where d η ď J . The true value of the parameter η satisﬁes the moment condition E r m p Y ˚ , η o qs “ m : Y J ˆ R d η Ñ R J is of the form m p Y ˚ , η q ” p m p Y ˚ , η q , ¨ ¨ ¨ , m J p Y ˚ J , η qq In general, m is allowed not to be diﬀerentiable with respect to η . Since the vector η appears ineach m j , restrictions are allowed both within and across diﬀerent subpopulation-conditional distri-butions. Another interesting feature of this speciﬁcation is that the moment conditions are deﬁnedfor the random variables whose distributions are not directly identiﬁed. The following theoremprovides semiparametric eﬃciency bound for estimating η . Note that Assumption 1 is enough20or deriving the following results, since the distributions of Y ˚ t,k ’s are always exactly-identiﬁed.Let m Z “ p m ,t ,Z , ¨ ¨ ¨ , m J,t J ,Z q , where m j,t j ,Z p X, η q “ ` m j,t j ,z p X, η q , ¨ ¨ ¨ , m j,t j ,z p X, η q ˘ , and m j,t j ,z p X, η q “ E r m j p Y, η q t T “ t j u | Z “ z, X s . Theorem 6.

Assume the following conditions hold.(i) For any ď j ď ¨ ¨ ¨ ď j d η ď J , the subvector of moments E ” p m j p Y ˚ j , η q , ¨ ¨ ¨ , m j dη p Y ˚ j dη , η qq ı is zero if and only if η “ η o .(ii) E “ m p Y ˚ , η q ‰ ă 8 , η P Λ .(iii) For each j and z , E ” m oj,t j ,z p Y, η q ı is diﬀerentiable in some neighborhood of η o , with the deriva-tive continuous at η o . Let Γ be the J ˆ d η matrix whose j th row is ˜ b j BB η E ” m oj,t j ,Z p X, η q ı ˇˇ η “ η o ,and assume Γ has full column rank.Then for the estimation of η , the eﬃcient inﬂuence function is Ψ η p Y, T, Z, X, η o , m oZ , π o q “ ´ ` Γ V ´ Γ ˘ ´ Γ V ´ Ψ m p Y, T, Z, X, η o , m oZ , π o q (29) V “ Var p Ψ m p Y, T, Z, X, η o , m oZ , π o qq , and Ψ m p Y, T, Z, X, η o , m oZ , π o q is a J ˆ random vector whose j th element is ˜ b j ´ ζ p Z, X, π o q ´ ι p m j p Y, η o q t T “ t j uq ´ m oj,t j ,Z p X, η o q ¯ ` m oj,t j ,Z p X, η o q ¯ (30) Thus the semiparametric eﬃciency bound is ` Γ V ´ Γ ˘ ´ . Note that, for example, if Y ˚ “ Y ˚ t,k , and m p Y ˚ t,k , η q “ Y ˚ t,k ´ η , then η “ β t,k , and the eﬃciencybound shown above reduces to the one computed in Theorem 3(i). If T “ Z , that is underunconfoundedness, the above result reduces to Theorem 1 in Cattaneo (2010). The eﬃciencybound can be achieved by estimators in the same spirit of the EIFE proposed in Cattaneo (2010).Essentially, it is the optimally-weighted GMM estimator based on the moment conditions obtainedfrom the eﬃcient inﬂuence function Ψ m . Let the criterion function be G n p η, π, m Z q “ n n ÿ i “ Ψ m p Y i , T i , Z i , X i , η, π, m Z q (31)21nd its probability limit is G p η, π, m Z q “ E r Ψ m p Y, T, Z, X, η, π, m Z qss (32)The main diﬃculty is that G n p¨ , π, m Z q could potentially be discontinuous, since we allow m p Y ˚ , ¨q to be discontinuous. To deal with this issue, the method developed in Chen et al. (2003) is em-ployed, where the criterion function is allowed not to satisfy standard smoothness conditions andsimultaneously depends on some nonparametric estimators. The implementation procedure is as follows. Let ˆ π and ˆ m Z be nonparametric estimators. Onecan ﬁrst ﬁnd a consistent GMM estimator ˜ η using the identity matrix as the (non-optimal) weightingmatrix, i.e. k G n p ˜ η, ˆ π, ˆ m Z q k ď inf η P Λ k G n p η, ˆ π, ˆ m Z q k ` o p p q . (33)Next use this estimate to form a consistent estimator ˆ V for the covariance matrix of Ψ m byˆ V “ n n ÿ i “ Ψ m p Y i , T i , Z i , X i , ˜ η, ˆ π, ˆ m Z q Ψ m p Y i , T i , Z i , X i , ˜ η, ˆ π, ˆ m Z q . (34)Then deﬁne ˆ η as the optimally-weighted GMM estimatorˆ η “ arg min η P Λ G n p η, ˆ π, ˆ m Z q ˆ V ´ G n p η, ˆ π, ˆ m Z q . (35)Lastly, for estimating the asymptotic variance of ˆ η , one can estimate Γ using numerical derivativesas in Newey and McFadden (1994). Let ε n be a positive sequence such that ε n Ñ ε n ? n Ñ 8 .Deﬁne the J ˆ d η matrix estimator byˆΓ jl “ ε n ˜ b j ˜ n n ÿ i “ ˆ m j,t j ,Z p X i , ˆ η ` ε n e l q ´ n n ÿ i “ ˆ m j,t j ,Z p X i , ˆ η ´ ε n e l q ¸ (36)where e l P R d η is the vector with the l th element being 1 and 0 on other entries. The followingtheorem summarizes the asymptotic properties of of this estimation procedure. For each j and z ,let M j,z Ă R p X ˆ Λ q be the vector space of functions, endowed with the sup-norm, containing the Cattaneo (2010) instead uses the theory from Pakes and Pollard (1989). However, the general theory ofChen et al. (2003) is more straightforward to apply in this case, since they explicitly assumes the presence of in-ﬁnite dimensional nuisance parameters, which can depend on the parameters to be estimated. m oj,t j ,z . For any small enough δ ą

0, let M δj,z “ ! m j,t j ,z P M j,z : (cid:13)(cid:13)(cid:13) m j,t j ,z ´ m oj,t j ,z (cid:13)(cid:13)(cid:13) ď δ ) . Theorem 7.

Let the conditions in Theorem 6 hold. Further assume that, for each j and z ,(i) Λ is compact and η o P int p Λ q .(ii) Convergence rates of nonparametric estimators satisfy k ˆ π z ´ π oz k , (cid:13)(cid:13)(cid:13) ˆ m j,t j ,z ´ m oj,t j ,z (cid:13)(cid:13)(cid:13) “ o p p n ´ { q .(iii) The classes ! m j,t j ,z p¨ , η q : η P Λ , m j,t j ,z P M δj,z ) and Π z are Glivenko-Cantelli.(iv) For some δ ą , the classes ! m j,t j ,z p¨ , η q : η P Λ , k η ´ η o k ď δ, m j,t j ,z P M j,z , (cid:13)(cid:13)(cid:13) m j,t j ,z ´ m oj,t j ,z (cid:13)(cid:13)(cid:13) ď δ ) and Π δz are Donsker.(v) E ” sup δ P Λ ,m j,tj ,z P M j,z (cid:12)(cid:12) m j,t j ,z p¨ , η q (cid:12)(cid:12) ı ă 8 .(vi) E “ m j,t j ,z p X, ¨q ‰ is continuous, for any m j,t j ,z P M δj,z .Then ˆ V , ˆ η , ˆΓ are consistent and ? n p ˆ η ´ η o q ñ N ´ , ` Γ V ´ Γ ˘ ´ ¯ . As explained in the previous section, asymptotically optimal inferences can be conducted, basedon this result, for joint hypothesis over η . A possible case for the application of this non-smoothGMM methodology developed is illustrated below using the main example. The set Y is deﬁnedby Y “ ! Y ˚ t , d “ Y | S “ s , Y ˚ t , d “ Y | S “ s , Y ˚ t , d “ Y | S “ s ,Y ˚ t , d “ Y | S “ s , Y ˚ t , d “ Y | S P t s , s u , Y ˚ t , d “ Y | S “ s ) . The parameter of interest η could be deﬁned by, say, the following moment conditions: E “ Y ˚ t , ´ η ‰ “ E “ Y ˚ t , ´ η ‰ “ E “ t Y ˚ t , ď η u ´ . ‰ “

0, and E “ t Y ˚ t , ď η u ´ . ‰ “

0. This means that Y ˚ t , and Y ˚ t , have the same expectations and medians, which all equal to η . Note that both withinand cross type restrictions are contained in this example.Before ending this section, it is worth mentioning that the set Y can be extended to include morerandom variables whose marginal distributions are identiﬁed. Y t | T “ t, S P Σ t,k and Y t | S P Σ t are such examples. Similar arguments go through for eﬃcient estimation, for which the details arenot being repeated here. 23 Optimal Testable Implication of Model Assumption

In practice, one should check the validity of model assumptions before proceeding to estimation.As discussed previously, overidentifying restrictions may exist depending on the type conﬁguration S . This section discusses a systematic approach to generate testable implications for Assumption1, even when no overidentifying type restriction exists. Based on works of Kitagawa (2015) andSun (2018), an obvious generalization would be the set of conditions: for any t P T , k “ , ¨ ¨ ¨ , N Z ,and measurable B Ă Y ,˜ b t,k B t,Z p X q “ P p Y t P B, S P Σ t,k | X q P r , s , a.s. (37)where B is the indicator function of B . This means that the identiﬁable parts of the jointdistribution of the potential outcomes and type should be a proper probability. However, more canbe tested. For instance, both P p S “ s q and P p S P t s , s uq can be identiﬁed in the main example,then the former should be no greater than the latter. In fact, this intuition can be developed intoa set of high-level conditions optimal for testing Assumption 1, including but not limited to boththe implications (37) and the type conﬁguration overidentifying conditions discussed in section 3.Let Σ t “ t Σ t,k : k “ , ¨ ¨ ¨ , N Z u . For any t P T , deﬁne a function Q t : X ˆ p B Y ˆ Σ t q Ñ R by Q t p X, p B, Σ t,k qq “ ˜ b t,k B t,Z p X q (38) Assumption 2.

There exist N T functions ˜ Q t : X ˆ ` B Y ˆ S ˘ Ñ R , such that, for all t, t P T ,(i) ˜ Q t is a probability kernel;(ii) ˜ Q t p¨ , p Y , Σ qq “ ˜ Q t p¨ , p Y , Σ qq for all Σ Ă S ;(iii) ˜ Q t p¨ , p B, Σ qq “ Q t p¨ , p B, Σ qq , for all measurable B Ă Y , and Σ P Σ t . The probability kernel ˜ Q t represents the unidentiﬁed joint distribution of p Y t q t P T and S given X , i.e. ˜ Q t p X, p B, Σ qq “ P p Y t P B, S P Σ | X q . The second condition in Assumption 2 ensures that P p S P Σ | X q is well-deﬁned, while the third condition assigns ˜ Q t to its identiﬁed value wheneverpossible. The overidentiﬁcation constraints on type conﬁgurations S , e.g. equation (8), resultsfrom conditions (ii) and (iii). The constraints deﬁned by equation (37) follows from conditions (i)24nd (iii). The optimality of Assumption 2 for testing is explained by the following theorem. Let L be the underlying unidentiﬁed joint probability of ´ t Y t : t P T u , t T z : z P Z u , Z, X ¯ and L bethe observed joint probability of ´ Y, T, Z, X ¯ . Each L induces a L , but not the other way round.Denote such mapping from L to L by O . Theorem 8.

The following relationships hold between Assumption 1 on L and Assumption 2 on L .(i) If L satisﬁes Assumption 1, then O p L q satisﬁes Assumption 2. If L satisﬁes Assumption2, then there exists a L that violates Assumption 1 and O p L q “ L .(ii) If L satisﬁes Assumption 2, then there exists a L that satisﬁes Assumption 1 and O p L q “ L .Therefore, if C is another condition on L , such that L satisﬁes Assumption 1 implies O p L q satisﬁes Assumption C , then L satisﬁes Assumption 2 implies L satisﬁes condition C . Part (i) of the theorem establishes that Assumption 2 is a necessary but not suﬃcient conditionfor Assumption 1. Part (ii) establishes that Assumption 2 is the optimal testable implication ofAssumption 1, in the sense that any testable implication of Assumption 1 can be implied by thesatisfaction of Assumption 2. Also note that Assumption 1 not only requires unordered mono-tonicity but the full speciﬁcation of S . From this theorem and its proof, the idea becomes clear ofthe optimality for testable implications on nonveriﬁable hypothesis, which is another contributionof the paper. The set of conditions presented in Kitagawa (2015) is indeed a special case, thesimplicity of which is speciﬁc to the binary case.This section again ends with an illustration using the main example. Deﬁne Q t p X, p B, Σ qq “ $’’&’’% P p Y P B, T “ t | Z “ z , X q ´ P p Y P B, T “ t | Z “ z , X q , if Σ “ t s u P p Y P B, T “ t | Z “ z , X q , if Σ “ t s u Q t p X, p B, Σ qq “ $’’&’’% P p Y P B, T “ t | Z “ z , X q ´ P p Y P B, T “ t | Z “ z , X q , if Σ “ t s u P p Y P B, T “ t | Z “ z , X q , if Σ “ t s u t p X, p B, Σ qq “ $’’&’’% P p Y P B, T “ t | Z “ z , X q ´ P p Y P B, T “ t | Z “ z , X q , if Σ “ t s , s u P p Y P B, T “ t | Z “ z , X q , if Σ “ t s u By equation (8), Q t p¨ , Y ˆ t s uq ` Q t p¨ , Y ˆ t s uq “ Q t p¨ , Y ˆ t s , s uq Also note that Q t on t s u , Q t on t s u , and Q t on t s u are already between 0 and 1. And Q t on t s u , Q t on t s u , and Q t on t s , s u are already below 1. The true restrictions in this exampleare hence reduced to P p Y P B, T “ t | Z “ z , X q ě P p Y P B, T “ t | Z “ z , X q P p Y P B, T “ t | Z “ z , X q ě P p Y P B, T “ t | Z “ z , X q P p Y P B, T “ t | Z “ z , X q ě P p Y P B, T “ t | Z “ z , X q These inequalities are very similar to the form of the conditions in Kitagawa (2015). This simplicityis due to the fact that 2 S equals to the algebra generated by Σ and no overidentifying restrictionexists. It is possible to generalize the variance-weighted Komolgorov-Smirnov test proposed inKitagawa (2015) and the power-improved test proposed in Sun (2018), the implementation ofwhich is beyond the scope of this study. In this section, the estimation methods discussed in the paper are applied to study the returnto schooling using proximity to college as an instrument (Card, 1993). The data comes from theNational Longitudinal Survey on the original cohort of young men at the time of 1966 and 1976.It is shown in Kitagawa (2015) that the instrument introduced is only valid after conditioning oncovariates such as race and region of residence, making this example appropriate for illustratingthe usefulness of including conditioning covariates in the framework, which is a step forward of thecurrent paper from Heckman and Pinto (2018a) and Pinto (2019).The model is the same as the main example and the variables are explained as follows. The26utcome Y is the log of weekly earning in 1976. The instrument is the binary Z indicating whethera four-year local college is present for the individual in 1966. It is relevant because the presence of acollege nearby gets the individual to know about college in early life and reduces the cost of receivingsome college education, and valid since the individual’s ability is presumed to be independent ofthe place of residence, given race and the extent to which this region is developed. The treatment T describes the education received by the individual up to the time of 1976. Instead of being binary, T takes on three unordered values t t , t , t u , where t means the student receives college educationin ﬁelds including engineering, mathematics, law, social sciences, and others, t means the studentreceives college education in other ﬁelds including business, education, and public services amongothers, and t means the student doesn’t receive college education. In the binary local IV case, t and t would be collapsed into one treatment indicating college-level education, for the study ofreturn to college schooling. The unobserved type S is deﬁned by the way education decisions varywith the proximity to the college. The conditioning covariates X includes race, whether residencein southern states and whether residence in the standard metropolitan area. The available samplesize is 2930, 381 of which are treated by t , and 487 by t .For estimation, no advanced nonparametric method is needed since X is discrete. The P t,Z ’s areestimated using the linear probability model with ﬁve dummies as in Kitagawa (2015). The π z ’s areestimated with the sample means. Then in CEP estimators in Theorem 4 are used to evaluate theparameters of interest. The estimation results for the LASFs and LASF-Ts are displayed in Table 1.The asymptotic covariance matrix can be easily estimated using the eﬃcient inﬂuence functions, asin Table 2, leading to joint inferences of the parameters, which is another beneﬁt of the methodologydeveloped in this paper. Table 3 shows the results of some selected statistical tests comparingboth between and within LASFs and LASF-Ts. The diﬀerences between the pairs β t , , β t , and γ t , , γ t , are both insigniﬁcant, indicating that the income after receiving college education on thetwo categories of ﬁelds are similar for their own switchers respectively. As mentioned before, thesefacts can be turned into overidentifying restrictions ( β t , “ β t , , γ t , “ γ t , ) to improve eﬃciency,if supported by underlying economic theory. College education is generally perceived as a causalfactor for increasing income. Thus it should be the case that the LASF-T is higher than the LASFwhen the treatment belongs to t t , t u , and lower when treatment is t , which is consistent with the This classiﬁcation of the ﬁelds of study is for balancing the sample size in the dataset. β t , r . , . s β t , r . , . s β t , r . , . s β t , r . , . s β t , r . , . s β t , r . , . s γ t , r . , . s γ t , r . , . s γ t , r . , . s Table 2: Empirically Estimated Asymptotic Covariance Matrix β t , β t , β t , β t , β t , γ t , γ t , γ t , β t , -0.2339 0.0050 ă . β t , ă . β t , -0.0145 -0.0419 -0.0001 -0.0246 0.3252 -0.2186 β t , -0.0004 ă . β t , -0.0026 0.5415 -0.0194 0.1688 β t , -0.0005 ă . γ t , -0.0164 -0.1673 γ t , -0.0813testing results. Also, both β t , and β t , are higher than β t , , where the insigniﬁcance is due to thefact that these three parameters are averages over diﬀerent subpopulations. Comparisons among β t , , β t , , and β t , shows similar intuition on the eﬀect of schooling with a tendency that theoutcomes for always-takers are less spread across treatments. The ratios P p T “ t | S “ s q and P p T “ t | S “ s q are estimated to be 0.68 and 0.92, revealing a signiﬁcant amount of individualsreceiving college education in the subpopulations of switchers. At this point, one might questionwhether the parameter estimates are of policy interest. The argument here is that clearly by theidentiﬁcation results, the parameters discussed here are at least as informative as the LATE (andLATT) parameters in the binary local IV case. And one of the themes of local IV and other causalinference models is to tradeoﬀ informativeness with removing incredible assumptions.28able 3: Diﬀerences between Parameter ValuesTwo-sided One-sided β t , ´ β t , -0.467 (0.976) β t , ´ β t , γ t , ´ γ t , β t , ´ β t , γ t , ´ β t , ˚ γ t , ´ β t , -0.104 (0.291) β t , ´ γ t , ˚ Standard deviation of the estimate is presented in the parenthesis,and “ ˚ ” indicates signiﬁcance at the 5% level. Monte Carlo simulations are conducted, for further understanding of the relationships between ran-dom variables in the model and ﬁnite-sample performances of the estimators. Two data generatingprocesses (DGP), diﬀerent only in the distribution of the covariate X , are speciﬁed as follows. Inthe ﬁrst DGP, X is drawn from the uniform distribution over p . , . q ; whereas in the second DGP, X is a discrete random variable taking values in p . , . , . , . , . q with equal probabilities.Then the two DGPs follow the same way to generate S , Z , T , and Y from X . Type S is drawnfrom Binomial p , X q , the values p , , , , q matches with p s , s , s , s , s q respectively. The in-strument Z is generated according to the distribution Bernoulli p X q . Then the treatment T isdetermined by the realization of both S and Z . The potential outcomes p Y , Y , Y q are constructedas ¨˚˚˚˚˝ Y Y Y ˛‹‹‹‹‚ “ s ¨˚˚˚˚˝ ξ ` ξξ ` ξξ ` ξ ˛‹‹‹‹‚ ` s ¨˚˚˚˚˝ ξ ` ξξ ` ξξ ` ξ ˛‹‹‹‹‚ ` s ¨˚˚˚˚˝ ξ ` ξξ ` ξξ ` ξ ˛‹‹‹‹‚ ` s ¨˚˚˚˚˝ ξ ξ ξ ˛‹‹‹‹‚ ` s ¨˚˚˚˚˝ ξ ξ ξ ˛‹‹‹‹‚ where s is the indicator of type s , and the ξ ’s are mutually independent random variables withdistributions ξ „ N p . , q , ξ „ N p X, q , ξ „ N p X ` . , q , and ξ „ N p X ` . , q . Normalityis assumed for the bell-shaped empirical distribution of log-income. By construction, S and Z areindependent conditional on X , and the Y t ’s depends on S and X not on Z . In the subpopulationswith S P t s , s u , the Y t ’s are mutually independent, while in other cases they are correlated through ξ . This feature resembles that of the data generating process in Hong and Nekipelov (2010a). Thenumber of observations is N “ β t,k ’s are very close to the ˆ σ β t,k {? N ’s, conﬁrming the eﬃciency in estimationand optimality of tests with moderate sample size.Table 4: Monte Carlo ResultsParameter P X Value Mean Bias Median Bias Std Deviation Root MSE β t , Continuous 1.00 0.0007 0.0010 0.0488 0.0488Discrete 1.00 -0.0080 -0.0078 0.0502 0.0508 β t , Continuous 1.00 0.0160 0.0149 0.0756 0.0773Discrete 1.00 0.0124 0.0126 0.0744 0.0754 β t , Continuous 0.60 -0.0109 -0.0105 0.0458 0.0471Discrete 0.60 0.0024 0.0025 0.0475 0.0476 σ β t , Discrete 2.75 -0.0183 -0.0265 0.1316 0.1329 σ β t , Discrete 4.12 -0.0441 -0.0555 0.2409 0.2449 σ β t , Discrete 2.59 -0.0095 -0.0108 0.0843 0.0849

The σ β t,k ’s are estimated using the plug-in estimators where the β t,k ’s are estimated by CEP. This paper has studied the semiparametric eﬃcient estimation in the generalized local IV frame-work where treatment is allowed to take multiple values. A large class of parameters implicitlydeﬁned by a possibly over-identiﬁed non-smooth collection of moment conditions is considered,with a special focus on parameters derived through type probabilities and local average structuralfunctions. The calculated eﬃcient inﬂuence functions lead to the easy implementation of optimaljoint inferences and the construction of estimators suitable under high-dimensional settings. Themodel assumptions of local IV, in general, is further understood through the optimal observableimplications. The applicability of the methodology is demonstrated with examples for empirical re-search with a ﬁnite amount of sample data. For future studies, one could consider using the eﬃcientestimation methods for, say, LASFs to extract information on the (non-local) average structuralfunctions, as in Mogstad et al. (2018). 30 eferences

Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment response models.

Journal of econometrics , 113(2):231–263.Ackerberg, D., Chen, X., Hahn, J., and Liao, Z. (2014). Asymptotic eﬃciency of semiparametrictwo-step gmm.

Review of Economic Studies , 81(3):919–943.Angrist, J. D. and Imbens, G. W. (1995). Two-stage least squares estimation of average causal ef-fects in models with variable treatment intensity.

Journal of the American statistical Association ,90(430):431–442.Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identiﬁcation of causal eﬀects usinginstrumental variables.

Journal of the American statistical Association , 91(434):444–455.Bajari, P., Chernozhukov, V., Hong, H., and Nekipelov, D. (2015). Identiﬁcation and eﬃcientsemiparametric estimation of a dynamic discrete game. Technical report, National Bureau ofEconomic Research.Belloni, A., Chernozhukov, V., Fern´andez-Val, I., and Hansen, C. (2017). Program evaluation andcausal inference with high-dimensional data.

Econometrica , 85(1):233–298.Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment eﬀects after selectionamong high-dimensional controls.

The Review of Economic Studies , 81(2):608–650.Card, D. (1993). Using geographic variation in college proximity to estimate the return to schooling.Technical report, National Bureau of Economic Research.Cattaneo, M. D. (2010). Eﬃcient semiparametric estimation of multi-valued treatment eﬀects underignorability.

Journal of Econometrics , 155(2):138–154.Chen, X., Hong, H., and Tarozzi, A. (2004). Semiparametric eﬃciency in gmm models of nonclas-sical measurement errors, missing data and treatment eﬀects.Chen, X., Hong, H., and Tarozzi, A. (2008). Semiparametric eﬃciency in gmm models with auxiliarydata.

The Annals of Statistics , 36(2):808–843.31hen, X., Linton, O., and Van Keilegom, I. (2003). Estimation of semiparametric models when thecriterion function is not smooth.

Econometrica , 71(5):1591–1608.Chen, X. and Santos, A. (2018). Overidentiﬁcation in regular models.

Econometrica , 86(5):1771–1817.Chernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C., Newey, W., and Robins,J. (2018). Double/debiased machine learning for treatment and structural parameters.

TheEconometrics Journal , 21(1):C1–C68.Firpo, S. (2007). Eﬃcient semiparametric estimation of quantile treatment eﬀects.

Econometrica ,75(1):259–276.Fr¨olich, M. (2007). Nonparametric iv estimation of local average treatment eﬀects with covariates.

Journal of Econometrics , 139(1):35–75.Hahn, J. (1998). On the role of the propensity score in eﬃcient semiparametric estimation ofaverage treatment eﬀects.

Econometrica , pages 315–331.Heckman, J. J. and Pinto, R. (2018a). Unordered monotonicity.

Econometrica , 86(1):1–35.Heckman, J. J. and Pinto, R. (2018b). Web appendix for unordered monotonicity.

Econometrica ,86(1):1–35.Hong, H. and Nekipelov, D. (2010a). Semiparametric eﬃciency in nonlinear late models.

Quanti-tative Economics , 1(2):279–304.Hong, H. and Nekipelov, D. (2010b). Supplement to “semiparametric eﬃciency in nonlinear latemodels”.

Quantitative Economics , 1(2):279–304.Imbens, G. W. and Angrist, J. D. (1994). Identiﬁcation and estimation of local average treatmenteﬀects.

Econometrica , 62(2):467–475.Kitagawa, T. (2015). A test for instrument validity.

Econometrica , 83(5):2043–2063.Kline, P. and Walters, C. R. (2016). Evaluating public programs with close substitutes: The caseof head start.

The Quarterly Journal of Economics , 131(4):1795–1848.32elly, B. and W¨uthrich, K. (2017). Local quantile treatment eﬀects. In

Handbook of quantileregression , pages 145–164. Chapman and Hall/CRC.Mogstad, M., Santos, A., and Torgovitsky, A. (2018). Using instrumental variables for inferenceabout policy relevant treatment parameters.

Econometrica , 86(5):1589–1619.Nekipelov, D. (2011). Identiﬁcation and semiparametric eﬃciency in a model with a multi-valueddiscrete regressor. Technical report, Mimeo., UC Berkeley.Newey, W. K. (1990). Semiparametric eﬃciency bounds.

Journal of applied econometrics , 5(2):99–135.Newey, W. K. (1994). The asymptotic variance of semiparametric estimators.

Econometrica:Journal of the Econometric Society , pages 1349–1382.Newey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing.

Handbookof econometrics , 4:2111–2245.Pakes, A. and Pollard, D. (1989). Simulation and the asymptotics of optimization estimators.

Econometrica: Journal of the Econometric Society , pages 1027–1057.Pinto, R. (2019). Noncompliance as a rational choice: A framework that exploits compromises insocial experiments to identify causal eﬀects. UCLA Working paper.Sun, Z. (2018).

Essays on Non-parametric and High-dimensional Econometrics . PhD thesis, UCSan Diego.Vaart, A. W. v. d. (1998).

Asymptotic Statistics . Cambridge Series in Statistical and ProbabilisticMathematics. Cambridge University Press.Van Der Vaart, A. and Wellner, J. A. (2000). Preservation theorems for glivenko-cantelli anduniform glivenko-cantelli classes. In

High dimensional probability II , pages 115–133. Springer.Van Der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence. In

Weak convergence andempirical processes , pages 16–28. Springer.Vytlacil, E. (2002). Independence, monotonicity, and latent index models: An equivalence result.

Econometrica , 70(1):331–341. 33

Binary Local IV with Conditioning Covariates

It may help enhance understanding of the notations and results in the main text by presentingthem in the context of binary local IV. The identiﬁcation of classical LATE, without conditioningcovariates, is discussed in Section 4.1 of Heckman and Pinto (2018a) using binary matrices. The ef-ﬁciency bound and eﬃcient estimation are discussed in Fr¨olich (2007); Hong and Nekipelov (2010a).Speciﬁcation test is discussed in Kitagawa (2015). Here both the treatment and instrument arebinary: Z “ t z , z u , T “ t t , t u . The type constraint is s s s T z t t t T z t t t where s is always-taker, s is complier, and s is never-taker. The unordered monotonicity con-dition is satisﬁed: t T z “ t u ě t T z “ t u and t T z “ t u ď t T z “ t u . The type partitionsare, for t , Σ t , “ t s u , Σ t , “ t s u ; and for t , Σ t , “ t s u , Σ t , “ t s u . The B t ’s and theirinverses are B t “ »—– ﬁﬃﬂ ùñ B ` t “ »————– ´

10 1 ﬁﬃﬃﬃﬃﬂ ; B t “ »—– ﬁﬃﬂ ùñ B ` t “ »————– ´ ﬁﬃﬃﬃﬃﬂ and b t,k ’s are b t , “ b t , “ p , , q , b t , “ p , , q , b t , “ p , , q . Thus ˜ b t , “ p , ´ q ,˜ b t , “ p´ , q , ˜ b t , “ p , q , and ˜ b t , “ p , q . The I t,Z p X q ’s and P t,Z p X q ’s are I t ,Z p X q “ p E r Y t T “ t u | Z “ z , X s , E r Y t T “ t u | Z “ z , X sq I t ,Z p X q “ p E r Y t T “ t u | Z “ z , X s , E r Y t T “ t u | Z “ z , X sq P t ,Z p X q “ p P p T “ t u | Z “ z , X q , P p T “ t u | Z “ z , X qq P t ,Z p X q “ p P p T “ t u | Z “ z , X q , P p T “ t u | Z “ z , X qq

34y Theorem 1, p t , “ P p S “ s q “ ˜ b t , E r P t ,Z p X qs “ E r P p T “ t | Z “ z , X qs p t , “ P p S “ s q “ p t , “ ˜ b t , E r P t ,Z p X qs “ E r´ P p T “ t | Z “ z , X q ` P p T “ t | Z “ z , X qs p t , “ P p S “ s q “ ˜ b t , E r P t ,Z p X qs “ E r P p T “ t | Z “ z , X qs and β t , “ E r Y t | S “ s s “ p t , ˜ b t , E r I t ,Z p X qs “ E r E r Y t T “ t u | Z “ z , X s ´ E r Y t T “ t u | Z “ z , X ss E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs β t , “ E r Y t | S “ s s “ p t , ˜ b t , E r I t ,Z p X qs “ E r E r Y t T “ t u | Z “ z , X ss E r P p T “ t | Z “ z , X qs β t , “ E r Y t | S “ s s “ p t , ˜ b t , E r I t ,Z p X qs “ E r E r Y t T “ t u | Z “ z , X s ´ E r Y t T “ t u | Z “ z , X ss E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs β t , “ E r Y t | S “ s s “ p t , ˜ b t , E r I t ,Z p X qs “ E r E r Y t T “ t u | Z “ z , X ss E r P p T “ t | Z “ z , X qs Thus we have the usual expression for LATE E r Y t ´ Y t | S “ s s “ E r Y t | S “ s s ´ E r Y t | S “ s s “ β t , ´ β t , “ E r E r Y | Z “ z , X s ´ E r Y | Z “ z , X ss E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs Focusing on Σ t , “ Σ t , “ t s u , we can derive the LASF-Ts E r Y | T “ t , S “ s s , E r Y | T “ t , S “ s s .In fact, W t , “ t z u . Thus, by Theorem 2, q t ,t , “ P p T “ t , S “ s q “ E rp´ P p T “ t | Z “ z , X q ` P p T “ t | Z “ z , X qq P p Z “ z | X qs γ t ,t , “ E r Y | T “ t , S “ s s “ E ” ˜ b t , I t ,Z p X q π W t ,t , p X q ı E ” ˜ b t , P t ,Z p X q π W t ,t , p X q ı “ E rp´ E r Y t T “ t u | Z “ z , X s ` E r Y t T “ t u | Z “ z , X sq P p Z “ z | X qs E rp´ P p T “ t | Z “ z , X q ` P p T “ t | Z “ z , X qq P p Z “ z | X qs γ t ,t , “ E r Y | T “ t , S “ s s “ E ” ˜ b t , I t ,Z p X q π W t ,t , p X q ı E ” ˜ b t , P t ,Z p X q π W t ,t , p X q ı “ E rp E r Y t T “ t u | Z “ z , X s ´ E r Y t T “ t u | Z “ z , X sq P p Z “ z | X qs E rp P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qq P p Z “ z | X qs E r Y t ´ Y t | T “ t , S “ s s “ E r Y t | T “ t , S “ s s ´ E r Y t | T “ t , S “ s s “ γ t ,t , ´ γ t ,t , “ E rp E r Y | Z “ z , X s ´ E r Y | Z “ z , X sq E r Z “ z | X ss E rp P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qq P p Z “ z | X qs The notation ζ p Z, X q in Theorem 3 means ´ t Z “ z u P p Z “ z | X q , t Z “ z u P p Z “ z | X q ¯ . The semiparametric eﬃciencycalculations give the eﬃcient inﬂuence function for, say, β t , “ E r Y t | S “ s s , which isΨ β t , “ p t , ˆ t Z “ z u P p Z “ z | X q p Y t T “ t u ´ E r Y t T “ t u | Z “ z , X sq ` E r Y t T “ t u | Z “ z , X s ˙ ` p t , ˆ t Z “ z u P p Z “ z | X q p Y t T “ t u ´ E r Y t T “ t u | Z “ z , X sq ` E r Y t T “ t u | Z “ z , X s ˙ ´ β t , p t , ˆ t Z “ z u P p Z “ z | X q p t T “ t u ´ E r t T “ t u | Z “ z , X sq ` E r t T “ t u | Z “ z , X s ˙ ` β t , p t , ˆ t Z “ z u P p Z “ z | X q p t T “ t u ´ E r t T “ t u | Z “ z , X sq ` E r t T “ t u | Z “ z , X s ˙ The estimators are skipped for brevity. The optimal set of testable implications reduces to, for anymeasurable B , almost surly E r B p Y q t T “ t u | Z “ z , X s ´ E r B p Y q t T “ t u | Z “ z , X s “ ˜ b t , B t ,Z p X q P r , s E r B p Y q t T “ t u | Z “ z , X s ´ E r B p Y q t T “ t u | Z “ z , X s “ ˜ b t , B t ,Z p X q P r , s which is equation (3.3) of Kitagawa (2015). B Proofs of Theorems and Regularity Conditions

Lemma 1.

The following conditional independence relationships hold: S K Z | X ; and for any t P T , Y t K T | S, X .Proof.

The ﬁrst statement follows from the deﬁnition of S and the fact that Z is independent of ´ T z , ¨ ¨ ¨ , T z NZ ¯ conditioning on X . For the second statement, T is a function of p S, Z, X q . Hence,given S and X , T is independent of Y t , since Z is independent of ´ Y t , ¨ ¨ ¨ , Y t NT ¯ conditional on X . (cid:4) emma 2. For each t P T , k “ , ¨ ¨ ¨ , N Z , and g measurable, the following identiﬁcation resultshold.(i) P p S P Σ t,k | X q “ ˜ b t,k P t,Z p X q a.s.(ii) E r g p Y t q | S P Σ t,k , X s “ ˜ b t,k g t,Z p X q ˜ b t,k P t,Z p X q a.s.Proof. Conditional on X , we have B t r i, j s “ t T “ t | Z “ z i , S “ s j u , which is the deﬁnitionof B t in Heckman and Pinto (2018a). This means the quantity b t,k B ` t deﬁned in their paper isthe constant (across diﬀerent values of X ) ˜ b t,k . Hence the result is equivalent to Theorem 6 inHeckman and Pinto (2018a). (cid:4) Proof of Theorem 1. (i) We can get the result by applying the law of iterated expectation on (i)of Lemma 2.(ii) Using the Bayes rule, we have E r g p Y t q | S P Σ t,k s “ ż E r g p Y t q | S P Σ t,k , X “ x s f X | S P Σ t,k p x q dx “ ż E r g p Y t q | S P Σ t,k , X “ x s P p S P Σ t,k | X “ x q P p S P Σ t,k q f X p x q dx “ p t,k E ” ˜ b t,k g t,Z p X q ı . (cid:4) Proof of Theorem 2. (i) By the deﬁnition of W t ,t,k , we have P ` T “ t , S P Σ t,k ˘ “ P ` Z P W t ,t,k , S P Σ t,k ˘ “ E “ P ` Z P W t ,t,k , S P Σ t,k | X ˘‰ “ E “ P ` Z P W t ,t,k | X ˘ P p S P Σ t,k | X q ‰ “ E ” ˜ b t,k P t,Z p X q π W t ,t,k p X q ı where the third equality follows from Z K S | X .37ii) Using the Bayes rule, we have E “ g p Y t q | T “ t , S P Σ t,k ‰ “ ż E “ g p Y t q | T “ t , S P Σ t,k , X “ x ‰ f X | T “ t ,S P Σ t,k p x q dx “ ż E “ g p Y t q | T “ t , S P Σ t,k , X “ x ‰ P p T “ t , S P Σ t,k | X “ x q P p T “ t , S P Σ t,k q f X p x q dx “ ż E r g p Y t q | S P Σ t,k , X “ x s P p T “ t , S P Σ t,k | X “ x q P p T “ t , S P Σ t,k q f X p x q dx “ q t ,t,k E ” ˜ b t,k g t,Z p X q π W t p X q ı where the second equality follows from the Bayes rule and the third equality follows from Y t K T | S, X . By Lemma L-16 of Heckman and Pinto (2018b), we know that underthe unordered monotonicity assumption of S , B t r¨ , i s “ B t r¨ , i s for all s i , s i P Σ t,k . Thus E r g p Y t q | T “ t, S P Σ t,k s is always identiﬁed. (cid:4) The calculations for the semiparametric eﬃciency bound follow from Newey (1990). The likeli-hood of the statistical model can be speciﬁed as L p Y, T, Z, X q “ ˜ź z P Z ´ f z p Y, T | X q π z p X q ¯ t Z “ z u ¸ f X p X q where f z p¨ , ¨ | X q denotes the conditional density of Y, T given Z “ z and X , and f X p¨q denotes themarginal density of X . In a regular parametric submodel, where the true underlying probabilitymeasure P is indexed by θ o , using the following notations s z p Y, Z | X q “ BB θ log p f z p Y, T | X ; θ qq ˇˇˇ θ “ θ o , z P Z s π p Z | X ; θ o q “ ÿ z P Z t Z “ z u BB θ log p π z p X ; θ qq ˇˇˇ θ “ θ o s X p X q “ BB θ log p s X p X ; θ qq ˇˇˇ θ “ θ o Then the following Lemma is immediate. It computes the score and tangent space, and is invokedmany times below for the calculation of the semiparametric eﬃciency bound.38 emma 3.

The score in a regular parametric submodel is s θ o p Y, T, Z, X q “ ÿ z P Z t Z “ z u s z p Y, T | X ; θ o q ` s π p Z | X ; θ o q ` s X p X ; θ o q Hence the tangent space of the original model is S p P q “ ! s P L p P q : s p Y, T, Z, X q “ ÿ z P Z t Z “ z u s z p Y, T | X q ` s π p Z | X q ` s X p X q for some s z , s π , s X such that ż s z p y, t | X q f z p y, t | X q dydt ” , @ z ; ÿ z P Z s π p z | X q π z p X q ” , and ż s X p x q f X p x q dx “ ) Proof of Theorem 3.

We only prove (i) and (ii), (iii) and (iv) are easier cases that can be provedalong the way.(i) For the pathwise diﬀerentiablility of β t,k , in any parametric submodel, BB θ β t,k p θ q ˇˇˇ θ “ θ o “ BB θ ˜ ˜ b t,k E θ r I t,Z p X qs p t,k ¸ ˇˇˇˇˇ θ “ θ o “ p t,k ˜ B ˜ b t,k E θ r I t,Z p X qsB θ ˇˇˇˇˇ θ “ θ o ´ ˜ b t,k E θ r I t,Z p X qs p t,k B p t,k B θ ˇˇˇˇˇ θ “ θ o ¸ “ p t,k ˜ b t,k ˆ BB θ E θ r I t,Z p X qs ˇˇˇ θ “ θ o ´ BB θ E θ r P t,Z p X qs ˇˇˇ θ “ θ o β t,k ˙ where BB θ E θ r I t,Z p X qs ˇˇˇ θ “ θ o and BB θ E θ r P t,Z p X qs ˇˇˇ θ “ θ o are N Z ˆ ż y t τ “ t u s z p y, τ | x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx ` ż y t τ “ t u s X p x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx and ż t τ “ t u s z p y, τ | x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx ` ż t τ “ t u s X p x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx z P Z . The eﬃcient inﬂuence function has to satisfy BB θ β t,k p θ q ˇˇˇ θ “ θ o “ E “ Ψ β t,k s θ o ‰ , and Ψ β t,k P S p P q The expression of Ψ β t,k presented in the theorem meets the above requirements. In particu-lar, correspondence between terms in the eﬃcient inﬂuence function and pathwise derivativeappears exactly as in Lemma 1 of Hong and Nekipelov (2010b).(ii) The pathwise derivative of γ t ,t,k can be computed in a similar way. BB θ γ t ,t,k p θ q ˇˇˇ θ “ θ o “ q t ,t,k ˜ b t,k BB θ E θ ” I t,Z p X q π W t ,t,k p X q ı ˇˇˇ θ “ θ o ´ γ t ,t,k q t ,t,k ˜ b t,k BB θ E θ ” P t,Z p X q π W t ,t,k p X q ı ˇˇˇ θ “ θ o where BB θ E θ ” I t,Z p X q π W t ,t,k p X q ı ˇˇˇ θ “ θ o and BB θ E θ ” P t,Z p X q π W t ,t,k p X q ı ˇˇˇ θ “ θ o are N Z ˆ ż y t τ “ t u s z p y, τ | x ; θ o q π W t ,t,k p x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx ` ż y t τ “ t u s X p x ; θ o q π W t ,t,k p x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx ` ż y t τ “ t u ˆ BB θ π W t ,t,k p X ; θ q ˇˇ θ “ θ o ˙ f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx and ż t τ “ t u s z p y, τ | x ; θ o q π W t ,t,k p x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx ` ż t τ “ t u s X p x ; θ o q π W t ,t,k p x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx ` ż t τ “ t u ˆ BB θ π W t ,t,k p X ; θ q ˇˇ θ “ θ o ˙ f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx respectively, for z P Z . The main diﬀerence appears when dealing with the last terms in theabove two expressions, which can be matched with terms in the eﬃcient inﬂuence function of40he following two forms E r Y t T “ t u | Z “ z, X s ´ t Z P W t ,t,k u ´ π W t ,t,k p X q ¯ E r t T “ t u | Z “ z, X s ´ t Z P W t ,t,k u ´ π W t ,t,k p X q ¯ To further explain, take the latter one as an example. Notice that t Z P W t ,t,k u ´ π W t ,t,k p X q “ ÿ z P W t ,t,k p t Z “ z u ´ π z p X qq and p t Z “ z u ´ π z p X qq s π p Z | X ; θ o q “ t Z “ z u π z p X q BB θ π z p X ; θ q ˇˇ θ “ θ o ´ π z p X q s π p Z | X ; θ o q Using the law of iterated expectation, E r E r t T “ t u | Z “ z, X s p t Z “ z u ´ π z p X qq s π p Z | X ; θ o qs“ E „ E r t T “ t u | Z “ z, X s E „ t Z “ z u π z p X q | X  BB θ π z p X ; θ q ˇˇ θ “ θ o  ´ E r E r t T “ t u | Z “ z, X s π z p X q E r s π p Z | X ; θ o q | X ss“ E „ E r t T “ t u | Z “ z, X s BB θ π z p X ; θ q ˇˇ θ “ θ o  “ ż t τ “ t u ˆ BB θ π z p X ; θ q ˇˇ θ “ θ o ˙ f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx (cid:4) Proof of Theorem 4.

This proof is based on Section 5 in Newey (1994). We ﬁrst focus on the case of β t,k . Some calculations are done for preparation. For the sake of brevity, let h t “ p h Y,t,Z , h t,Z , π q .Notice that the estimator ˆ β t,k is deﬁned by the following moment condition. M β t,k p X, β t,k , h t q ” ˜ b t,k ˜ h Y,t,z p X q π z p X q , ¨ ¨ ¨ , h Y,t,z NZ p X q π z NZ p X q ¸ ´ β t,k ˜ b t,k ˜ h t,z p X q π z p X q , ¨ ¨ ¨ , h t,z NZ p X q π z NZ p X q ¸ E „ B M β t,k B β t,k  “ ´ ˜ b t,k E r P t,Z p X qs “ ´ p ot,k B M β t,k B h Y,t,z i ˇˇˇ h ot “ ˜ b t,k r i s π oz i p X q ” δ Y,t,z i p X qBB h t,z i M β t,k ˇˇˇ h ot “ ´ β t,k ˜ b t,k r i s π oz i p X q ” δ t,z i p X qB M β t,k B π z i ˇˇˇ h ot “ ´ ˜ b t,k r i s I ot,z i p X q π oz i p X q ` β t,k ˜ b t,k r i s P ot,z i p X q π oz i p X q ” δ π,z i p X q where ˜ b t,k r i s denotes the i th element of the vector ˜ b t,k . Let D β t,k p X, h t q “ ÿ z P Z δ Y,t,z p X q h Y,t,z p X q ` ÿ z P Z δ t,z p X q h t,z p X q ` ÿ z P Z δ π,z p X q π z p X q“ N Z ÿ j “ ˜ b t,k r j s π oz j p X q ” h Y,t,z j p X q ´ β ot,k h t,z j p X q ´ ´ I ot,z j p X q ´ P ot,z j p X q ¯ π z j p X q ı and α β t,k p Y, T, Z, X q ” ÿ z P Z δ Y,t,z p X q ` t Z “ z u Y t T “ t u ´ h oY,t,z p X q ˘ ` ÿ z P Z δ t,z p X q ` t Z “ z u t T “ t u ´ h ot,z p X q ˘ ` ÿ z P Z δ π,z p X q p t Z “ z u ´ π oz p X qq“ ˜ b t,k ζ p Z, X, π o q ` ι p Y t T “ t uq ´ I ot,Z p X q ˘ ´ β ot,k ˜ b t,k ζ p Z, X, π o q ` ι t T “ t u ´ P ot,Z p X q ˘ Then we check in turns Assumption 5.1 to 5.3 in Newey (1994). For Assumption 5.1(i), thelinearization D can be taken as D β t,k by equation (4.2) in that paper, since M β t,k depends on h t only through its value h t p X q . Assumption 5.1(ii) is satisﬁed by our condition 4(i) on the convergencerate of ˆ h t . Assumption 5.2 is the stochastic equicontinuity condition on D β t,k , which can be veriﬁed42y our condition 4(ii), since1 ? n n ÿ i “ ” D β t,k p X i , ˆ h t ´ h ot q ´ E ” D β t,k p X, ˆ h t ´ h ot q ıı “ N Z ÿ j “ ˜ b t,k r j s π oz j p X q ” ´ ν n p ˆ h Y,t,z j q ´ ν n p h oY,t,z j q ¯ ´ β ot,k ´ ν n p ˆ h t,z j q ´ ν n p h ot,z j q ¯ ´ ´ I ot,z j p X q ´ P ot,z j p X q ¯ ´ ν n p ˆ π z j q ´ ν n p π oz j q ¯ ı p Ñ h : X Ñ R , ν n p h q “ ? n ř ni “ r h p X i q ´ E r h p X qss is used to denotethe empirical process. The α p z q in Assumption 5.3 is constructed to be α β t,k p Y, T, Z, X q usingProposition 4 in that paper. From Lemma 5.1 there, we can establish the asymptotically linearrepresentation of ? nM β t,k p X, β ot,k , ˆ h t q to be ? nM β t,k p X, β ot,k , ˆ h t q “ ? n n ÿ i “ “ M β t,k p X i , β ot,k , h ot q ` α p Y i , T i , Z i , X i q ‰ ` o p p q . Also, the consistency of ˆ p t,k follows from (cid:13)(cid:13)(cid:13) ˆ h t ´ h t (cid:13)(cid:13)(cid:13) p Ñ π z ’s are boundedaway from zero and one. Then using Slutsky’s theorem, the above results can be combined toobtain asymptotic normality of ˆ β t,k since ? n ´ ˆ β t,k ´ β ot,k ¯ “ ? nM β t,k p X, β ot,k , ˆ h t q{ ˆ p t,k . Hence the inﬂuence function of ˆ β t,k should be ´ M β t,k p X, β ot,k , h ot q ` α β t,k ¯ { p ot,k , which equals toΨ β t,k evaluated at the true parameter values. The term α β t,k corrects the bias in estimation due tothe presence of the unknown inﬁnite dimensional nuisance parameter p h Y,t,Z , h t,Z , π q . The proofsfor ˆ γ t ,t,k , ˆ p t,k , and ˆ q t ,t,k are essentially the same. For estimating the eﬃciency bound, consistencyof the plug-in estimators follows directly from the consistency of both the nonparametric estimatesand the CEP estimators, the continuity of the eﬃcient inﬂuence functions in the parameters, andthe fact that propensity scores are bounded away from zero and one.Lastly, the consistency of ˆ V κ follows from Lemma 8.3 of Newey and McFadden (1994), where More discussion on this “mean-square diﬀerentiability” condition can be found in Newey and McFadden (1994). There is no remainder o p p q terms because M β t,k is linear in β t,k , and hence it’s unnecessary to check Assumptions5.4 to 5.6 in Newey (1994). M β t,k and α β t,k . (cid:4) Proof of Theorem 5.

The proof is similar to that of Theorem 5.1 and 5.2 in Chernozhukov et al.(2018), which verify the assumptions in their Theorem 3.1. First observe that the moment condition(23) is linear in β t,k . Since ˜ b t,k is a ﬁnite vector, it suﬃce in our case to verify the conditions intheir Theorem 3.1 for the following score function ψ p W, β t,k , υ q “ ψ a p W, P t,z , π z q β t,k ` ψ b p W, I t,z , π z q” ˆ t Z “ z u π z p X q p t T “ t u ´ P t,z p X qq ` P t,z p X q ˙ β t,k ´ t Z “ z u π z p X q p Y t T “ t u ´ I t,z p X qq ` I t,z p X q , where W “ p Y, T, Z, X q and υ “ p I t,z , P t,z , π z q . To check the Neyman orthogonality condition, wecan compute the Gateaux derivative BB r E r ψ p W, β t,k , υ o ` r p υ ´ υ o qqs ˇˇˇ r “ “ E « t Z “ z u π oz p X q ` t T “ t u ´ P ot,z p X q ˘ p π z p X q ´ π oz p X qq β t,k ` ˆ P t,z p X q ´ P ot,z p X q ´ t Z “ z u π oz p X q ` P t,z p X q ´ P ot,z p X q ˘˙ β t,k ` t Z “ z u π oz p X q ` Y t T “ t u ´ I ot,z p X q ˘ p π z p X q ´ π oz p X qq` I t,z p X q ´ I ot,z p X q ´ t Z “ z u π oz p X q ` I t,z p X q ´ I ot,z p X q ˘ ﬀ . It is equal to zero since E „ t Z “ z u π oz p X q ` t T “ t u ´ P ot,z p X q ˘ ˇˇˇ X  “ E „ t Z “ z u π oz p X q ` Y t T “ t u ´ I ot,z p X q ˘ ˇˇˇ X  “ , (39)and E ” t Z “ z u π oz p X q ˇˇˇ X ı “

1. Inside the nuisance realization set such that the nuisance parameters takevalue in this set with probability ∆ n , we verify their Assumption 3.2 as follows. k ψ a p W, P t,z , π z q k q “ (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q p t T “ t u ´ P t,z p X qq ` P t,z p X q (cid:13)(cid:13)(cid:13)(cid:13) q ď (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q (cid:13)(cid:13)(cid:13)(cid:13) q k t T “ t u ´ P t,z p X q k q ` k P t,z p X q k q ď { π ` , ψ p W, β t,k , υ q k q ď (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q p t T “ t u ´ P t,z p X qq ` P t,z p X q (cid:13)(cid:13)(cid:13)(cid:13) q β t,k ` (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q p Y t T “ t u ´ I t,z p X qq ` I t,z p X q (cid:13)(cid:13)(cid:13)(cid:13) q ď p { π ` q β t,k ` (cid:13)(cid:13) I t,z p X q ´ I ot,z p X q (cid:13)(cid:13) q ` (cid:13)(cid:13) I ot,z p X q (cid:13)(cid:13) q ` (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q (cid:13)(cid:13)(cid:13)(cid:13) q ´ (cid:13)(cid:13) Y t T “ t u ´ I ot,z p X q (cid:13)(cid:13) q ` (cid:13)(cid:13) I t,z p X q ´ I ot,z p X q (cid:13)(cid:13) q ¯ ď p { π ` q ` C ` C ` { π p C ` C q “ p C ` q p { π ` q , (cid:12)(cid:12) E r ψ a p W, P t,z , π z qs ´ E “ ψ a p W, P ot,z , π oz q ‰ (cid:12)(cid:12) “ (cid:12)(cid:12)(cid:12)(cid:12) E „ t Z “ z u π z p X q p t T “ t u ´ P t,z p X qq ` ` P t,z p X q ´ P ot,z p X q ˘ (cid:12)(cid:12)(cid:12)(cid:12) “ (cid:12)(cid:12)(cid:12)(cid:12) E „ t Z “ z u π z p X q ` P ot,z p X q ´ P t,z p X q ˘ ` ` P t,z p X q ´ P ot,z p X q ˘ (cid:12)(cid:12)(cid:12)(cid:12) ď (cid:13)(cid:13) P t,z p X q ´ P ot,z p X q (cid:13)(cid:13) { π ď δ n { π, To bound k ψ p W, β t,k , υ q ´ ψ p W, β t,k , υ o q k , note that we have (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q p t T “ t u ´ P t,z p X qq ` P t,z p X q ´ t Z “ z u π oz p X q ` t T “ t u ´ P ot,z p X q ´ P ot,z p X q ˘ (cid:13)(cid:13)(cid:13)(cid:13) ď (cid:13)(cid:13) P t,z p X q ´ P ot,z p X q (cid:13)(cid:13) ` (cid:13)(cid:13)(cid:13)(cid:13) ˆ π z p X q ´ π oz p X q ˙ t Z “ z u p t T “ t u ´ P t,z p X qq (cid:13)(cid:13)(cid:13)(cid:13) ` (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π oz p X q ` P t,z p X q ´ P ot,z p X q ˘ (cid:13)(cid:13)(cid:13)(cid:13) ďp ` { π q (cid:13)(cid:13) P t,z p X q ´ P ot,z p X q (cid:13)(cid:13) ` k π z p X q ´ π oz p X q k { π ď ` ` { π ` { π ˘ δ n , (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q p Y t T “ t u ´ I t,z p X qq ` I t,z p X q ´ t Z “ z u π oz p X q ` Y t T “ t u ´ I ot,z p X q ´ I ot,z p X q ˘ (cid:13)(cid:13)(cid:13)(cid:13) ď (cid:13)(cid:13) I t,z p X q ´ I ot,z p X q (cid:13)(cid:13) ` (cid:13)(cid:13)(cid:13)(cid:13) ˆ π z p X q ´ π oz p X q ˙ t Z “ z u p Y t T “ t u ´ I t,z p X qq (cid:13)(cid:13)(cid:13)(cid:13) ` (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π oz p X q ` I t,z p X q ´ I ot,z p X q ˘ (cid:13)(cid:13)(cid:13)(cid:13) ď p ` { π q (cid:13)(cid:13) I t,z p X q ´ I ot,z p X q (cid:13)(cid:13) ` k π z p X q ´ π oz p X q k ´ (cid:13)(cid:13) Y t T “ t u ´ I ot,z (cid:13)(cid:13) q ` (cid:13)(cid:13) I t,z ´ I ot,z (cid:13)(cid:13) q ¯ { π ď ` ` { π ` C { π ˘ δ n . Thus we have k ψ p W, β t,k , υ q ´ ψ p W, β t,k , υ o q k ď (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q p t T “ t u ´ P t,z p X qq ` P t,z p X q ´ t Z “ z u π oz p X q ` t T “ t u ´ P ot,z p X q ´ P ot,z p X q ˘ (cid:13)(cid:13)(cid:13)(cid:13) β t,k ` (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q p t T “ t u ´ P t,z p X qq ` P t,z p X q ´ t Z “ z u π oz p X q ` t T “ t u ´ P ot,z p X q ´ P ot,z p X q ˘ (cid:13)(cid:13)(cid:13)(cid:13) ď `` ` { π ` { π ˘ β t,k ` ` ` { π ` C { π ˘˘ δ n . Lastly, for any r P p , q , based on equation (39) we have B B r E r ψ p W, β t,k , υ o ` r p υ ´ υ o qqsď E „ ˆ t Z “ z up π oz p X q ` r p π z p X q ´ π oz p X qqq r ` P t,z p X q ´ P ot,z p X q ˘ p π z p X q ´ π oz p X qq  β t,k ´ E „ t Z “ z up π oz p X q ` r p π z p X q ´ π oz p X qqq ` P t,z p X q ´ P ot,z p X q ˘ p π z p X q ´ π oz p X qq  β t,k ´ E „ t Z “ z up π oz p X q ` r p π z p X q ´ π oz p X qqq r ` P t,z p X q ´ P ot,z p X q ˘ p π z p X q ´ π oz p X qq  β t,k ` E „ ˆ t Z “ z up π oz p X q ` r p π z p X q ´ π oz p X qqq r ` I t,z p X q ´ I ot,z p X q ˘ p π z p X q ´ π oz p X qq  ´ E „ t Z “ z up π oz p X q ` r p π z p X q ´ π oz p X qqq ` I t,z p X q ´ I ot,z p X q ˘ p π z p X q ´ π oz p X qq  ´ E „ t Z “ z up π oz p X q ` r p π z p X q ´ π oz p X qqq r ` I t,z p X q ´ I ot,z p X q ˘ p π z p X q ´ π oz p X qq  (cid:12)(cid:12)(cid:12)(cid:12) B B r E r ψ p W, β t,k , υ o ` r p υ ´ υ o qqs (cid:12)(cid:12)(cid:12)(cid:12) ď Const. ˆ k π z ´ π oz k ´ (cid:13)(cid:13) I t,z ´ I ot,z (cid:13)(cid:13) ` (cid:13)(cid:13) P t,z ´ P ot,z (cid:13)(cid:13) ¯ ď Const. ˆ n ´ { δ n , which completes the proof. (cid:4) Proof of Theorem 6.

First note that the moment conditions can be equivalently represented by˜ b j E “ m j,t j ,Z p X, η q ‰ “ , ď j ď J . Then the rest of the proof is mainly based on the approachdescribed in section 3.6 of Hong and Nekipelov (2010a) and the proof of Theorem 1 in Cattaneo(2010). We use a constant d η ˆ d m matrix A to transform the overidentiﬁed vector of moments intoan exactly identiﬁed system of equations A ´ ˜ b j E “ m j,t j ,Z p X, η q ‰¯ Jj “ “

0, and ﬁnd the A -dependenteﬃcient inﬂuence function for the exactly-identiﬁed parameter. Then choose the optimal A . In aparametric submodel, by the implicit function theorem, we have BB θ η ˇˇ θ “ θ o “ ´ p A Γ q ´ A BB θ ´ ˜ b j E θ “ m j,t j ,Z p X, η o q ‰¯ Jj “ ˇˇ θ “ θ o where BB θ E θ “ m j,t j ,Z p X, η o q ‰ ˇˇ θ “ θ o is a N Z ˆ ż m j p y, η o q t τ “ t j u s z p y, τ | x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx ` ż m j p y, η o q t τ “ t j u s X p x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx for z P Z . So the eﬃcient inﬂuence function for this exactly-identiﬁed parameter isΨ A p Y, T, Z, X, η o , π o , m oZ q “ ´ p A Γ q ´ A Ψ m p Y, T, Z, X, η o , π o , m oZ q where Ψ m is deﬁned by equation (30). It is straightforward to verify that Ψ A satisﬁes BB θ η ˇˇˇ θ “ θ o “ E r Ψ A s θ o s , and Ψ A P S p P q . The optimal A is chosen by minimizing the sandwich matrix E r Ψ A Ψ A s “p A Γ q ´ A E r Ψ m Ψ m s A p Γ A q ´ . Thus the eﬃcient inﬂuence function for the generally over-identiﬁedparameter is obtained when A “ Γ V ´ p o . Plugging into Ψ A , we get equation (29). (cid:4) Proof of Theorem 7.

We follow the large sample theory in Chen et al. (2003) (hereafter CLK),47etting θ “ η , h “ p π, m Z q , M p θ, h q “ G p η, π, m Z q , and M n p θ, h q “ G n p η, π, m Z q .Their Theorem 1 is applied ﬁrst to show the consistency of ˜ η . Their (1.2) is satisﬁed since Λ iscompact, and G p η, π o , m oZ q “ ´ ˜ b j E ” m oj,t j ,Z p X, η q ı¯ Jj “ , which has a unique zero η o by our condition7(i) and is continuous by our condition 7(iii). As for (1.3) of CLK, continuity of G in m j,t j ,z and π z is veriﬁed by their way of entering (linear or by taking inverse, with π z bounded away from 0 and1), and the uniformity in η follows from the fact that E r m p Y ˚ , η qs is bounded as a function of η (by its continuity and the compactness of Λ). Condition (1.4) of CLK is satisﬁed by our condition7(ii). The uniform stochastic equicontinuity condition (1.5) of CLK is implied by the fact that, forany j and z , the class " t Z “ z u π z p X q ` m j p Y, η q t T “ t j u ´ m j,t j ,z p X, η q ˘ ` m j,t j ,z p X, η q : η P Λ , m j,t j ,z P M δj,z , π z P Π δz * is Glivenko-Cantelli, which follows from our condition 7(iii) and the results in Van Der Vaart and Wellner(2000), stating that Glivenko-Cantelli classes with integrable envelopes are preserved by a contin-uous function. Thus ˜ η ´ η o “ o p p q .Then we use Corollary 1 in CLK to show the consistency of ˆ V and the asymptotic normalityof ˆ η . Condition (2.2) in CLK is veriﬁed by our condition 6(iii). As in the proof of Theorem 5, it isstraightforward to show the moment condition G , based on the eﬃcient inﬂuence functions, satisﬁesthe Neyman orthogonality condition for the nuisance parameters π and m Z . For any j and z , denote π rz “ π oz p X q ` r p π z p X q ´ π oz p X qq and m rj,t j ,z p X, η q “ m oj,t j ,z p X, η q ` r ´ m j,t j ,z p X, η q ´ m oj,t j ,z p X, η q ¯ ,we have BB r E „ t Z “ z u π rz p X q ´ m j p Y, η q t T “ t j u ´ m rj,t j ,z p X, η q ¯ ` m rj,t j ,z p X, η q  ˇˇˇˇˇ r “ “ E « ´ t Z “ z up π oz p X qq p π z p X q ´ π oz p X qq ´ m j p Y, η q t T “ t j u ´ m oj,t j ,z p X, η q ¯ ` ´ m oj,t j ,z p X, η q ´ m j,t j ,z p X, η q ¯ ˆ t Z “ z u π oz p X q ´ ˙ ﬀ “ E „ t Z “ z u π oz p X q ´ m j p Y, η q t T “ t j u ´ m oj,t j ,z p X, η q ¯ ˇˇˇ X  “ G with respect to p π, m Z q is zero in any direction, and hencecondition (2.3) of CLK is veriﬁed. Their condition (2.4) is satisﬁed by our condition 7(ii). To showthe stochastic equicontinuity condition (2.6), it suﬃce to show that the class " t Z “ z u π z p X q ` m j p Y, η q t T “ t j u ´ m j,t j ,z p X, η q ˘ ` m j,t j ,z p X, η q : η P Λ δ , m j,t j ,z P M δj,z , π z P Π δz * is Donsker. This follows from our condition 7(iv) and Theorem 2.10.6 (as well as examples 2.10.7-2.10.9) in Van Der Vaart and Wellner (1996). Condition (2.6) in CLK is trivially veriﬁed using thecentral limit theorem. For the condition in Corollary 1 of CLK, letΩ p η, π, m Z q “ E “ Ψ m p Y, T, Z, X, η, π, m Z q Ψ p Y, T, Z, X, η, π, m Z q m ‰ and Ω n p η, π, m Z q “ n n ÿ i “ Ψ m p Y i , T i , Z i , X i , η, π, m Z q Ψ m p Y i , T i , Z i , X i , η, π, m Z q . Then V “ Ω p η o , π o , m oZ q and ˆ V “ Ω n p ˜ η, ˆ π o , ˆ m Z o q . For any δ n Ó

0, in the shrinking neighborhoodsΛ δ n , Π δ n z , and M δ n j,z , we havesup k Ω n p η, π, m Z q ´ V k ď sup k Ω n p η, π, m Z q ´ Ω p η, π, m Z q ´ p Ω n p η o , π o , m oZ q ´ Ω p η o , π o , m oZ qq k ` sup k Ω p η, π, m Z q ´ Ω p η o , π o , m oZ q k ` sup k Ω n p η o , π o , m oZ q ´ Ω p η o , π o , m oZ q k The ﬁrst term on the RHS is o p p q follows from the stochastic equicontinuity property on Ω n ´ Ω,which results from the (element-wise) Donsker property of the matrix Ψ m Ψ m . The second termon the RHS is o p p q since Ω is continuous in its arguments (equation (30) and condition 6(iii)),while the third term is o p p q by the standard central limit theorem. Hence, we have shown thatˆ V ´ V “ o p p q and ˆ η ´ η o “ o p p q .Lastly, using the arguments in Theorem 7.4 in Newey and McFadden (1994), the numericalderivative ˆΓ is consistent. (cid:4) Proof of Theorem 8. (ii) The second part of the theorem is proved ﬁrst. Suppose L satisﬁes As-sumption 2, we want to ﬁnd a L that induces L and satisﬁes Assumption 1. The strategy is tomake the Y t ’s mutually independent. And set their conditional (on S P Σ t,k ) distributions to the49dentiﬁed value when identiﬁable, and to be arbitrary when unidentiﬁable.Let ˜ P p¨ | X q be an arbitrary conditional distribution on the support of Y . Deﬁne joint dis-tribution of p Z, X q as identiﬁed from L . The goal is to construct the conditional distribution of p Y t : t P T , S q | Z, X not to depend on Z . For any measurable sequence of sets t B , ¨ ¨ ¨ , B N T u onthe support of Y , Σ Ă S , and z P Z , P ´ Y t P B , ¨ ¨ ¨ , Y t NT P B N T , S P Σ | Z “ z, X ¯ “ ˜ź t P T ˜ Q t p X, B t ˆ Σ q ˜ Q t p X, Y ˆ Σ q ¸ ˜ Q t p X, Y ˆ Σ q For s R S , let P ´ Y t P E , ¨ ¨ ¨ , Y t NT P E N T , S “ s | Z “ z, X ¯ “

0. We have fully speciﬁed a jointdistribution of ´ t Y t : t P T , z P Z u , t T z : z P Z u , Z, X ¯ , L , that is consistent with L and satisﬁesAssumption 1. Let C be another condition on L , such that L satisﬁes Assumption 1 implies O p L q satisﬁes condition C . The contrapositive statement is that if L violates C , then any L with O p L q “ L has to violate C . Therefore, in the current case, where L satisﬁes Assumption 1, L hasto satisfy C .(i) The ﬁrst statement is trivial. For the second statement, suppose L satisﬁes Assumption 2,we want to ﬁnd a L that induces L and violates Assumption 1. The strategy is to deﬁne thestructural functions to be dependent on Z . In particular, specify p Y t : t P T , S q | Z, X to be thesame as before when conditioning on Z “ z . When Z ‰ z , let P ´ Y t P B , ¨ ¨ ¨ , Y t NT P B N T , S P Σ | Z “ z, X ¯ “ L Y | S p B , ¨ ¨ ¨ , B N T q ˜ Q t p X, Y ˆ Σ q where L Y | S denotes a joint law of N T not mutally independent random variables whose marginaldistribution is equal to ˜ Q t p X,B t ˆ Σ q ˜ Q t p X, Y ˆ Σ q . Clearly, the Y t ’s are Z -dependent. (cid:4)(cid:4)