Generalized Local IV with Unordered Multiple Treatment Levels: Identification, Efficient Estimation, and Testable Implication
aa r X i v : . [ ec on . E M ] J a n Generalized Local IV with Unordered Multiple Treatment Levels
Identification, Efficient Estimation, and Testable Implication ˚ Haitian Xie † January, 2020
Abstract
This paper studies the econometric aspects of the generalized local IV framework definedusing the unordered monotonicity condition, which accommodates multiple levels of treatmentand instrument in program evaluations. The framework is explicitly developed to allow forconditioning covariates. Nonparametric identification results are obtained for a wide range ofpolicy-relevant parameters. Semiparametric efficiency bounds are computed for these identi-fied structural parameters, including the local average structural function and local averagestructural function on the treated. Two semiparametric estimators are introduced that achieveefficiency. One is the conditional expectation projection estimator defined through the nonpara-metric identification equation. The other is the double/debiased machine learning estimatordefined through the efficient influence function, which is suitable for high-dimensional settings.More generally, for parameters implicitly defined by possibly non-smooth and overidentifyingmoment conditions, this study provides the calculation for the corresponding semiparametricefficiency bounds and proposes efficient semiparametric GMM estimators again using the effi-cient influence functions. Then an optimal set of testable implications of the model assumptionis proposed. Previous results developed for the binary local IV model and the multivaluedtreatment model under unconfoundedness are encompassed as special cases in this more generalframework. The theoretical results are illustrated by an empirical application investigating thereturn to schooling across different fields of study, and a Monte Carlo experiment.
Keywords:
Generalized Local IV, Multi-valued Treatment, Unordered Monotonicity, Semi-parametric Efficiency, Efficient Estimation, Optimal Testable Implication, Return to Schooling. ˚ The author is grateful to Kaspar Wuthrich, Yixiao Sun, Graham Elliott, and Yu-Chang Chen for comments andsuggestions. † Department of Economics, University of California San Diego. Email: [email protected]. Introduction
Since the seminal works of Imbens and Angrist (1994) and Angrist et al. (1996), local instrumentalvariable has become a popular method for causal inference in economics. Instead of imposing ho-mogeneity of the treatment effects among the individuals, as in the classical IV regression model,the local IV framework allows for heterogeneous treatment effects. To achieve identification, how-ever, in practice, the treatment has to be integrated into one binary indicator. But oftentimes,treatments in economic relevant programs are multi-leveled in nature. They can be ordered, astax rates, years of schooling, and numbers of cigarettes smoked; or unordered, as different jobtraining programs, fields of study in college, and vouchers to various housing opportunities. Theunordered case is more general than the ordered one since ordered treatment levels can also beconsidered as unordered. The question now becomes how to finer evaluate programs in the localIV framework, incorporating the multiplicity in treatment levels. One possible solution is given inHeckman and Pinto (2018a) and Pinto (2019), which uses their unordered monotonicity conditionto generalize the identification results in binary local IV to situations with multiple unordered levelsof treatments. The extension from binary treatments to multi-valued ones further demonstratesthe source of identification in the local IV model making use of the monotonicity conditions. Inthe current study, this broader framework is referred to as the generalized local IV model.The marginal benefits for introducing multiple levels of treatment into the local IV model istwofold. First, as mentioned above, there are many empirical cases where treatments are explicitlymulti-valued. Collapsing these levels together is not very useful for a detailed analysis of theprogram effects, covering estimation and inference for parameters of finer subpopulations definedby the way of the treatment choice varies with the instrument. Also, when the binary treatmentis further divided into multiple values, efficiency gains are possible provided with overidentifyingrestrictions justified by the underlying economic theory. Conversely, the theories can be testedthrough these restrictions defined by parameters available only when the multiplicity of treatmentis modeled.This paper is concerned with the econometric aspects of the generalized local IV model, whichturns the identification results into applicable methods in empirical research. The framework is firstextended to allow for conditioning covariates, which is important because, often in observational2tudies, the instrument is only valid conditioning on other factors. And the conditioning issue heresuffers the same problem as in the binary local IV case since the subpopulations are not identified.Heckman and Pinto (2018a) and Pinto (2019) focused, among other things, on the identificationof conditional local average structural function (LASF) and type probabilities. Additional iden-tification results, in the unconditional sense, are obtained in this paper for more policy-relevantparameters, including the local average structural function on the treated (LASF-T). Semipara-metric efficiency bounds are computed for a wide range of structural parameters, including theLASFs and LASF-Ts. For these parameters with explicit definition, conditional expectation pro-jection (CEP) estimators, defined as semiparametric two-step estimators through the identificationequations, are shown to achieve the efficiency bound, which is analogous to results in the liter-ature (Fr¨olich, 2007; Hahn, 1998; Hong and Nekipelov, 2010a). Efficiency is also proven for thedouble/debiased machine learning (DML) estimators (Chernozhukov et al., 2018) defined throughthe efficient influence functions. These estimators are well suited to modern high-dimensional casessince their moment conditions satisfy the Neyman orthogonality condition on the nonparametricnuisance parameters. More generally, for parameters implicitly defined by possibly non-smoothand overidentifying moment conditions, this study provides the calculation of the semiparametricefficiency bounds and proposes efficient semiparametric GMM estimators defined through the effi-cient influence functions. Two important cases can be incorporated, one is the quantile estimation(Melly and W¨uthrich, 2017; Firpo, 2007), and the other is the aforementioned case where the un-derlying economic theory provides overidentification. Optimal joint inferences can be conductedacross and between different treatment levels, based on the semiparametric efficient estimations. The assumption of the generalized local IV model is refutable but not verifiable. An optimal setof testable implications of the model assumptions is proposed, in the sense that the refutation ofthis particular set of implications necessarily leads to the rejection of all implications of the modelassumptions.The literature on semiparametric efficiency in program evaluation starts with the seminal workof Hahn (1998), which studies the benchmark case of estimating the average treatment effect(ATE) under unconfoundedness. When endogeneity is present, that is in the framework of local The problem of joint inference is not salient in the binary local IV model since usually there is only a singleparameter of interest.
The general framework is presented in this section. We have a treatment variable T taking valuesin the unordered set T “ t t , ¨ ¨ ¨ , t N T u . Instrument Z takes values in the unordered set Z “t z , ¨ ¨ ¨ , z N Z u . The random variables ´ Y t , ¨ ¨ ¨ , Y t NT ¯ , with Y t P Y Ă R , t P T , represent thepotential outcomes under each treatment level. These are assumed to have finite second moments. And the random variables p T z ¨ ¨ ¨ , T z NZ q , with T z P T , z P Z , represent the potential treatmentstatus under each instrument level. Random vector X is a set of covariates, which takes valuein X Ă R d X . Also define a random vector S “ ´ T z , ¨ ¨ ¨ , T z NZ ¯ , which denotes the type of theindividual. Let S Ă T N Z be the support of S . The observed treatment and observed outcomeare defined as T “ ř z P Z t Z “ z u T z and Y “ ř t P T t T “ t u Y t respectively. An equivalent wayto formulate is to use structural equations, as in Heckman and Pinto (2018a); Pinto (2019). Bydenoting the function and the determined random variable with the same notation, the treatmentand outcome can be defined as T “ T p Z, X, V q , Y “ Y p T, X, V, ǫ q , where Z, V, ǫ are mutuallyindependent conditional on X . Under this formulation, T z “ T p z, X, V q , Y t “ Y p t, X, V, ǫ q , and S “ p T p z , X, V q , ¨ ¨ ¨ , T p z N Z , X, V qq .The following notations are used throughout the paper. Let π p X q “ p π z p X q , ¨ ¨ ¨ , π z NZ p X qq ,where π z p X q “ P p Z “ z | X q . For any t P T , let P t,z p X q “ P p T “ t | Z “ z, X q , and P t,Z p X q “ ´ P t,z , ¨ ¨ ¨ , P t,z NZ ¯ . For any measurable g : Y Ñ R , let g t,z “ E r g p Y q t T “ t u | Z “ z, X s and It is convenient to assume this in the beginning, since the main focus of the paper is on efficiency. t,Z “ ´ g t,z , ¨ ¨ ¨ , g t,z NZ ¯ . Let B t be the N Z ˆ N S binary matrix whose i, j th element is t s j r i s “ t u ,and its Moore-Penrose inverse is denoted by B ` t . Let Σ t,k Ă S , k “ , ¨ ¨ ¨ , N Z be the set of typesin which treatment t appears exactly k times, i.e. Σ t,k “ ! s P S : ř N Z i “ t s r i s “ t u “ k ) . Note thatthe Σ t,k ’s form a partition of the type configuration, i.e. S “ Ů N Z k “ Σ t,k . Let ˜ b t,k “ b t,k B ` t , where b t,k “ p t s P Σ t,k u , ¨ ¨ ¨ , t s N S P Σ t,k uq . The main assumption of the generalized local IV model ispresented below. Assumption 1.
Generalized Local IV:(i) Conditional independence: ´ t Y t : t P T u , t T z : z P Z u ¯ K Z | X .(ii) Type constraint: the type support S is known and satisfies, for any t P T , z, z P Z , either t T z “ t u ě t T z “ t u or t T z “ t u ď t T z “ t u .(iii) First stage: for all z, z P Z and t P T , it hold that π z p X q ě π ą and P p T z “ t | X q ‰ P p T z “ t | X q . These assumptions are essentially the multi-valued analog of those used in Abadie (2003). As-sumption 1(i) is on the validity of instrument conditioning on X . Assumption 1(ii) is the unorderedmonotonicity constraint on the configuration S of type. It means that a shift in an instrumentmoves all agents uniformly toward or against each possible treatment choice (Heckman and Pinto,2018a). As pointed out by Vytlacil (2002), the LATE type monotonicity condition is a restrictionacross individuals on the relationship between different hypothetical treatment choices defined interms of an instrument. Assumption 1(iii) requires that the instrument has some effect on theselection of each treatment level , and also implies that the support of X does not change withthe value of Z . The exclusion restrictions of the instrument from the outcomes are already im-posed in the definition of potential outcomes. The observed data is assumed to be an IID sample p Y i , T i , Z i , X i q ni “ . The concept of B t is defined using S , and hence are nonrandom and do not depend on X . The original definitionof B t in Heckman and Pinto (2018a) involve random variables, and are difficult to state unambiguously with thepresence of X . I thank Yixiao Sun for helpful suggestions on this. An equivalent condition, shown by Heckman and Pinto (2018a), for unordered monotonicity is, for any t P T , B t is lonesum. The strong overlapping assumption is imposed here for estimation. For identification, it suffices to impose aweaker condition π z p X q P p , q . T “ t t , t , t u , andinstrument levels Z “ t z , z u . Note that the indexing is only for convenience, they are intrinsicallyunordered. The type configuration S is specified below. s s s s s T z t t t t t T z t t t t t The unordered monotonicity is satisfied: t T z “ t u ě t T z “ t u , t T z “ t u ě t T z “ t u ,and t T z “ t u ď t T z “ t u . The type partitions are, for t , Σ t , “ t s , s , s u , Σ t , “ t s u ,Σ t , “ t s u ; for t , Σ t , “ t s , s , s u , Σ t , “ t s u , Σ t , “ t s u ; and for t , Σ t , “ t s , s u ,Σ t , “ t s , s u , Σ t , “ t s u . B t ’s and their generalized inverses are B t “ »—– fiffifl , B t “ »—– fiffifl , B t “ »—– fiffifl B ` t “ »——————————– ´ fiffiffiffiffiffiffiffiffiffiffifl , B ` t “ »——————————– ´ fiffiffiffiffiffiffiffiffiffiffifl , and B ` t “ »——————————– . ´ . . ´ . fiffiffiffiffiffiffiffiffiffiffifl . The b t,k ’s, and hence the ˜ b t,k ’s, are displayed in the following table. b t, b t, ˜ b t, ˜ b t, t “ t p , , , , q p , , , , q p´ , q p , q t “ t p , , , , q p , , , , q p´ , q p , q t “ t p , , , , q p , , , , q p , ´ q p , q These five groups are similarly defined in Kline and Walters (2016) for their analysis of the Head Start program. π z p X q “ P p Z “ z | X q , π z p X q “ P p Z “ z | X q , and P t ,Z p X q “ p P p T “ t | Z “ z , X q , P p T “ t | Z “ z , X qq ,P t ,Z p X q “ p P p T “ t | Z “ z , X q , P p T “ t | Z “ z , X qq ,P t ,Z p X q “ p P p T “ t | Z “ z , X q , P p T “ t | Z “ z , X qq . For any measurable g , we have g t ,Z p X q “ p E r g p Y q t T “ t u | Z “ z , X s , E r g p Y q t T “ t u | Z “ z , X sq ,g t ,Z p X q “ p E r g p Y q t T “ t u | Z “ z , X s , E r g p Y q t T “ t u | Z “ z , X sq ,g t ,Z p X q “ p E r g p Y q t T “ t u | Z “ z , X s , E r g p Y q t T “ t u | Z “ z , X sq . Conditioning on X , Heckman and Pinto (2018a) establishes the identification of the probabilitiesof S lying in any of the Σ t,k ’s and the conditional distribution of Y t given S P Σ t,k . This result ispresented in Appendix B as Lemma 2. Bayes rule is applied to turn these conditional identificationresults into the unconditional ones. In particular, the difficulty here, that the conditional distribu-tion of X | S P Σ t,k is unidentified, is similar to that of classical local IV. However, using the Bayesrule, this unidentified distribution can be represented as P p S P Σ t,k | X “ x q P p S P Σ t,k q f X p x q which is identified. Theorem 1.
For t P T , k “ , ¨ ¨ ¨ , N Z , and g measurable, the following quantities are identified.(i) Type probabilities: p t,k ” P p S P Σ t,k q “ ˜ b t,k E r P t,Z p X qs . (1) (ii) Mean potential outcome conditioning on type: E r g p Y t q | S P Σ t,k s “ p t,k ˜ b t,k E r g t,Z p X qs . (2)The above theorem provides identification for quantities solely related to the type S of the The term LASF is reserved for the more specific case of g being the identity map. S and the actual treatment received T . Especially, the conditional distribution of Y t given T “ t and S P Σ t,k might be of more interest in the current setting. This is because the main source ofidentification lies in the structural functions, the Y t ’s, instead of the treatment effects. Hence it ismore attractive to study the expectation of Y t inside the subpopulation whose treatment is actually t . The following theorem deals with the identification of distributions relevant to this idea. Theorem 2.
For t P T , k “ , ¨ ¨ ¨ , N Z , and g measurable, if for some t P T , there exists W t ,t,k Ă Z such that S P Σ t,k , T z “ t ðñ S P Σ t,k , z P W t ,t,k , and denote π W p X q “ ÿ z P W π z p X q , W Ă Z , then the following quantities are identified.(i) Treatment status and type probability: q t ,t,k ” P ` T “ t , S P Σ t,k ˘ “ ˜ b t,k E ” P t,Z p X q π W t ,t,k p X q ı . (3) (ii) Mean potential outcome conditioning on treatment status and type: E “ g p Y t q | T “ t , S P Σ t,k ‰ “ q t ,t,k ˜ b t,k E ” g t,Z p X q π W t ,t,k p X q ı . (4) In particular, the probability q t,k ” P p T “ t, S P Σ t,k q , (5) and the t -mean potential outcome E r g p Y t q | T “ t, S P Σ t,k s “ q t,k ˜ b t,k E “ g t,Z p X q π W t,k p X q ‰ , (6) for the t -treated subpopulation with type S P Σ t,k , are always identified, where, q t,k “ q t,t,k ,and W t,k ” W t,t,k . To the best of my knowledge, this result is new in the literature. The W t ,t,k defined in thetheorem contains the z ’s such that, inside the subpopulation where type S P Σ t,k , individuals9ssigned with z takes and only takes the treatment t . The unordered monotonicity conditionguarantees that such W t,k always exists for any pair of t, k . In the binary local IV case, it isshown previously in Hong and Nekipelov (2010a), as a special case, that inside the subpopulationof treated compliers, both of the structural functions are identified.In most of the cases, the parameter of interest are the average structural functions identifiablein certain subpopulations. They are defined by taking g “ I , the identity map on Y , in equations(2) and (4), which are the LASFs and LASF-Ts displayed below. β t,k ” E r Y t | S P Σ t,k s “ p t,k ˜ b t,k E r I t,Z p X qs .γ t ,t,k ” E “ Y t | T “ t , S P Σ t,k ‰ “ q t ,t,k ˜ b t,k E ” I t,Z p X q π W t ,t,k p X q ı . (7)As before, let γ t,k “ γ t,t,k . To concrete ideas, the identification in the main example for the case ofΣ t , “ s is computed, the other parameters can be computed in the same way. By Theorem 1, p t , “ P p S “ s q “ E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs .β t , “ E r Y t | S “ s s “ p t , E r E r Y t T “ t u | Z “ z , X s ´ E r Y t T “ t u | Z “ z , X ss . For Σ t , “ s , W t ,t , “ t z u , thus by Theorem 2, q t , “ P p T “ t , S “ s q “ E rp P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qq P p Z “ z | X qs .γ t , “ E r Y t | T “ t , S “ s s“ q t , E rp E r Y t T “ t u | Z “ z , X s ´ E r Y t T “ t u | Z “ z , X sq P p Z “ z | X qs . A subtle thing here is that the parameters p t,k ’s and q t ,t,k ’s can potentially be overidentified.For example, Σ t , “ t s , s u “ Σ t , Y Σ t , , leading to two ways to calculate the probability P p S P t s , s uq . In this case, these ways give rise to identical expressions of P p S P t s , s uq ,10amely, P p S “ s q ` P p S “ s q “ E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs` E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs“ E r P p T P t t , t u | Z “ z , X q ´ P p T P t t , t u | Z “ z , X qs“ E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs“ P p S P t s , s uq . As thus, it can be shown that there is no overidentifying restriction on the observable distribution inthe main example. However, there are cases satisfying Assumption 1 that imposes overidentifyingrestrictions. Consider removing s from the configurations S , so that S can only take on the fourvalues as shown in the following table. s s s s T z t t t t T z t t t t The unordered monotonicity is still satisfied. But it is automatically imposed that P p S “ s q “ E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs“ E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs . (8)which is an overidentifying restriction. These types of overidentifying restrictions are quite unnec-essary, since benefit from the removal of impossible configurations is small compared to the cost offalsely eliminating a type that exists in the true DGP. In the next section on efficient estimation oflocal average structural functions, Assumption 1 is strengthened so that the parameters in Theorem1 and Theorem 2 are exactly identified. This is not restrictive, since it is satisfied by the binarylocal IV, the main example in this paper, and examples in Heckman and Pinto (2018a). Moreover,when overidentification is present, the consistency and asymptotic normality of the proposed esti-mators, and it’s efficiency for the exactly-identified target parameter are valid regardless. Further11iscussion on this issue is beyond the scope of this paper. This section calculates the semiparametric efficiency bound for β t,k , γ t ,t,k , p t,k , and q t ,t,k , andproposes semiparametric efficient estimators. Other policy-relevant parameters that can be derivedfrom the above quantities are also discussed afterwards. The more general case of non-smoothparameters with overidentifying constraints is considered in the next section. In this section andthe next one, for simiplicity, the notation for parameters, including the nuisance ones, are usedto represent both the true value and a general value in the parameter space. When necessary, asuperscript “ o ” is placed to signify the true value. The following theorem presents the efficiencybound using efficient influence functions. Theorem 3.
Consider any t P T , k “ , ¨ ¨ ¨ , N Z , and t satisfying the condition in Theorem 2.(i) The semiparametric efficiency bound for β t,k is given by the variance of the efficient influencefunction Ψ β t,k p Y, T, Z, X, β t,k , p t,k , I t,Z , P t,Z , π q“ p t,k ˜ b t,k p ζ p Z, X, π q p ι p Y t T “ t uq ´ I t,Z p X qq ` I t,Z p X qq´ β t,k p t,k ˜ b t,k p ζ p Z, X, π q p ι t T “ t u ´ P t,Z p X qq ` P t,Z p X qq . (9) where ι denotes a column vector of ones, and ζ p Z, X, π q is a diagonal matrix with the diagonalelements being ´ t Z “ z u π z p X q , ¨ ¨ ¨ , t Z “ z N Z u π z NZ p X q ¯ (ii) The semiparametric efficiency bound for γ t ,t,k is given by the variance of the efficient influence Some hints on more primitive assumptions for the non-existence of overidentifying conditions, within the structureof S , can be found in A.14 of Heckman and Pinto (2018b), referring to the concept of “complete response matrix”. unction Ψ γ t ,t,k p Y, T, Z, X, γ t ,t,k , q t ,t,k , I t,Z , P t,Z , π q“ q t ,t,k ˜ b t,k ´ ζ p Z, X, π q p ι p Y t T “ t uq ´ I t,Z p X qq π W t ,t,k p X q ` I t,Z p X q t Z P W t ,t,k u ¯ ´ γ t ,t,k q t ,t,k ˜ b t,k ´ ζ p Z, X, π q p ι t T “ t u ´ P t,Z p X qq π W t ,t,k p X q ` P t,Z p X q t Z P W t ,t,k u ¯ . (10) (iii) The semiparametric efficiency bound for p t,k is given by the variance of the efficient influencefunction Ψ p t,k p T, Z, X, p t,k , P t,Z , π q “ ˜ b t,k p ζ p Z, X, π q p ι t T “ t u ´ P t,Z p X qq ` P t,Z p X qq ´ p t,k (11) (iv) The semiparametric efficiency bound for q t ,t,k is given by the variance of the efficient influencefunction Ψ q t ,t,k p T, Z, X, q t ,t,k , P t,Z , π q“ ˜ b t,k ´ ζ p Z, X, π q p ι t T “ t u ´ P t,Z p X qq π W t ,t,k p X q ` P t,Z p X q t Z P W t ,t,k u ¯ ´ q t ,t,k (12)Notice that for the case of binary treatment and binary instrument, the first two parts of Theo-rem 3 reduces to Theorem 2 of Hong and Nekipelov (2010a). The structure of the efficient influencefunctions is interpretable in the view of Newey (1994). In Ψ β t,k , the terms ˜ b t,k p ζ p Z, X, π q p ι p Y t T “ t uq ´ I t,Z p X qqq and ˜ b t,k p ζ p Z, X, π q p ι t T “ t u ´ P t,Z p X qqq serve as the correction term due to the presence of un-known infinite dimensional nuisance parameters I t,Z and P t,Z respectively. In Ψ γ t ,t,k , the correctionterm also contains π W t ,t,k which accounts for the fact that π is unknown. The derivation of thisdecomposition is more apparent in the proof of Theorem 4.The program evaluation literature has become concerned about the role of propensity score inthe efficient estimation. In the current context, π represents the proper concept of propensity score.Observations in the proof of Theorem 3 indicate that the efficiency bound of β t,k is not affectedby the knowledge of propensity score. However, the γ t ,t,k ’s can be estimated more efficiently if thepropensity score is known. To be more specific, the part of the score function that corresponds to See for example Hahn (1998); Fr¨olich (2007); Hong and Nekipelov (2010a); Chen et al. (2008) γ t ,t,k explicitly involves the propensity score, hence does itspathwise derivative. Similarly, the efficiency in the estimation of the q t ,t,k ’s not the p t,k is affectedby the knowledge of propensity score.For efficient estimation, one possible way is to use the CEP estimators common in the literature(Hahn, 1998; Fr¨olich, 2007; Chen et al., 2008; Hong and Nekipelov, 2010a). The methodology isto first estimate the conditional expectations π , P t,Z , and I t,Z , and then uses the identificationresults directly as moment conditions. The asymptotic linear representation of these estimatorscan be easily computed using the method developed in Newey (1994). More specifically, for z P Z ,define two sets of conditional expectations by h Y,t,z p X q “ E r t Z “ z u Y t T “ t u | X s , h t,z p X q “ E r t Z “ z u t T “ t u | X s . Let ˆ h Y,t,z , ˆ h t,z , ˆ π z denote the nonparametric estimators, such as kernelestimators or series estimators. Notice that the conditional expectations are related by I t,z “ h Y,t,z { π z and P t,z “ h t,z { π z , for z P Z , hence define ˆ I t,z “ ˆ h Y,t,z { ˆ π z and ˆ P t,z “ ˆ h t,z { ˆ π z . Then thevector estimators ˆ I t,Z , ˆ P t,Z , ˆ π Z , and vector functions h Y,t,Z , h t,Z are stacked in the obvious way.Also, let ˆ π W t ,t,k “ ř z P W t ,t,k ˆ π z . Define the estimators ˆ p t,k , ˆ q t t,k , ˆ β t,k , and ˆ γ t ,t,k byˆ p t,k “ n n ÿ i “ ˜ b t,k ˆ P t,Z p X i q (13)ˆ q t ,t,k “ n n ÿ i “ ˜ b t,k ˆ P t,Z p X i q ˆ π W t ,t,k p X i q (14)ˆ β t,k “ p t,k n n ÿ i “ ˜ b t,k ˆ I t,Z p X i q (15)ˆ γ t ,t,k “ q t ,t,k n n ÿ i “ ˜ b t,k ˆ I t,Z p X i q ˆ π W t ,t,k p X i q (16)The efficient influence functions also provide a way for conducting optimal joint inferences.The efficient influence function for a vector of parameters is the collection of the efficient influencefunctions corresponding to the parameters. The variance-covariance matrix for efficient estimatorscan be calculated accordingly. More concretely, let κ “ p κ , ¨ ¨ ¨ , κ K q be a vector of parametersselected from the identified ones t β t,k , γ t ,t,k , p t,k , q t ,t,k u . Then the efficient influence function of κ is Ψ κ “ p Ψ κ , ¨ ¨ ¨ , Ψ κ K q , and the efficiency bound is E r Ψ κ Ψ κ s . A natural plug-in estimator ˆ V κ The reason for using h Y,t,z and h t,z as primitive estimators, instead of I t,z and P t,z , is that they are simpleconditional expectations with theoretical appeals. V κ “ n n ÿ i “ ˆΨ κ,i ˆΨ κ,i (17)where ˆΨ κ,i is defined by plugging-in the data observation of index i , the CEP estimates ˆ κ , and thenonparametric estimates ˆ I t,z , ˆ P t,z , and ˆ π z . For example, when κ “ β t,k , thenˆΨ β t,k ,i “ Ψ β t,k p Y i , T i , Z i , X i , ˆ β t,k , ˆ p t,k , ˆ I t,Z , ˆ P t,Z , ˆ π q . Since the CEP estimators are efficient, their asymptotic covariance matrix can be estimated bythe plug-in estimators for efficiency bounds. This leads to optimal joint inferences, where theoptimality is by Section 25.6 of Vaart (1998), where it is stated that the semiparametric efficiencyof the estimators leads to (locally) asymptotically uniformly most powerful tests.The following theorem summarizes the properties of the CEP estimation procedure definedabove. For each t and z , let H Y,t,z , H t,z , and Π z be the space of functions containing the truenuisance parameters h oY,t,z , h ot,z , and π oz respectively. For any small enough δ ą
0, let H δY,t,z “ ! h Y,t,z P H Y,t,z : (cid:13)(cid:13)(cid:13) h Y,t,z ´ h oY,t,z (cid:13)(cid:13)(cid:13) ď δ ) , and H δt,z and Π δz be defined analogously. Theorem 4.
Consider t P T . For any z P Z , assume the following conditions hold.(i) The convergence rate of nonparametric estimators satisfy ? n (cid:13)(cid:13)(cid:13) ˆ h Y,t,z ´ h oY,t,z (cid:13)(cid:13)(cid:13) “ o p p q , ? n (cid:13)(cid:13)(cid:13) ˆ h t,z ´ h ot,z (cid:13)(cid:13)(cid:13) “ o p p q , and ? n k ˆ π z ´ π oz k “ o p p q .(ii) There exists some δ ą such that the classes H δY,t,z , H δt,z , and Π δz are Donsker, with E ” sup h Y,t,z P H δY,t,z | h Y,t,z p X q | ı ă 8 .Then, for k “ , ¨ ¨ ¨ , N Z , estimators ˆ β t,k , ˆ γ t ,t,k , ˆ p t,k , and ˆ q t t,k are semiparametric efficient for β t,k , γ t ,t,k , p t,k , and q t ,t,k , respectively. Moreover, the plug-in estimator ˆ V κ for efficiency bound,defined in equation(17), is consistent. Condition 4(i) is a standard requirement on the convergence rate of nonparametric estima-tors in the semiparametric two-step estimation literature (Newey, 1994; Newey and McFadden,1994). Condition 4(ii) is also standard that requires the functional spaces containing the infinite-dimensional nuisance parameters are not too complex, for the stochastic equicontinuity condition15o hold. The reason for this type of “limited information” estimators to work is well explained inAckerberg et al. (2014). The estimation problem here falls into their general semiparametric modelwhere the parameter of interest is defined by possibly overidentifying unconditional moment restric-tions and the nuisance function are defined by exactly identifying conditional moment restrictions.They showed that the semiparametric two-step optimally weighted GMM estimators achieve theefficiency bound, which are the CEP estimators in this case since the parameters of interest areexactly identified. Discussions related to this phenomenon can also be found in Chen and Santos(2018).Back to the main example, for β t , , the efficient influence function isΨ β t , “ p t , ˆ t Z “ z u P p Z “ z | X q p Y t T “ t u ´ E r Y t T “ t u | Z “ z , X sq ` E r Y t T “ t u | Z “ z , X s ˙ ´ p t , ˆ t Z “ z u P p Z “ z | X q p Y t T “ t u ´ E r Y t T “ t u | Z “ z , X sq ` E r Y t T “ t u | Z “ z , X s ˙ ´ β t , p t , ˆ t Z “ z u P p Z “ z | X q p t T “ t u ´ E r t T “ t u | Z “ z , X sq ` E r t T “ t u | Z “ z , X s ˙ ` β t , p t , ˆ t Z “ z u P p Z “ z | X q p t T “ t u ´ E r t T “ t u | Z “ z , X sq ` E r t T “ t u | Z “ z , X s ˙ and the CEP estimator isˆ β t , “ ř ni “ ´ ˆ h Y,T,z p X i q{ ˆ π z p X i q ¯ ´ ´ ˆ h Y,T,z p X i q{ ˆ π z p X i q ¯ř ni “ ´ ˆ h T,z p X i q{ ˆ π z p X i q ¯ ´ ´ ˆ h T,z p X i q{ ˆ π z p X i q ¯ . Besides the efficient estimation results shown above, it might also be of interest to efficientlyestimate other policy-relevant parameters, whose identification can be derived from the aforemen-tioned parameters ` β t,k , γ t ,t,k , p t,k , q t ,t,k ˘ . Some examples are discussed here. The ratio q t ,t,k { p t,k “ P p T “ t | S P Σ t,k q can be understood as the conditional probability of taking treatment t giventype S belongs to Σ t,k . The average structural function local in the subpopulation whose type S belonging to any of the Σ t,k , k “ , ¨ ¨ ¨ , N Z ´ β t ” E r Y t | S P Σ t s “ ř N Z ´ k “ β t,k p t,k ř N Z ´ k “ p t,k (18)where Σ t “ Ť N Z ´ k “ Σ t,k is referred to as t -switchers in Heckman and Pinto (2018a), which means16ndividuals in this subpopulation switches among t and other treatments when given different levelsof instruments. It is a generalization of the concept of compliers in the binary local IV framework.Similarly, one can also define γ t ” E r Y t | T “ t, S P Σ t s “ ř N Z ´ k “ γ t,k p t,k ř N Z ´ k “ p t,k (19)which represents the average structural function local in the subpopulation of t -treated t -switchers.By Theorem 2, this parameter is always identified. Some treatment effects can also be identifiedand estimated through the parameters discussed in this section. This point is illustrated using themain example, which also appears in Heckman and Pinto (2018a). Consider the following quantity β t , ´ β t , p t , ` β t , p t , p t , ` p t , “ E r Y t ´ Y t | S “ s s P p S “ s q ` E r Y t ´ Y t | S “ s s P p S “ s q P p S P t s , s uq (20)which represents the average treatment effect of t against other treatments within the subpopula-tion of t -switchers. Analogously, the quantity γ t , ´ γ t ,t , q t ,t , ` γ t ,t , q t ,t , q t ,t , ` q t ,t , “ E r Y t ´ Y t | T “ t , S “ s s P p T “ t , S “ s q ` E r Y t ´ Y t | T “ t , S “ s s P p T “ t , S “ s q P p T “ t , S P t s , s uq (21)can be understood as the average treatment effect of t against other treatments within the sub-population of t -treated t -switchers.More generally, let φ “ φ p p, q, β, γ q be a finite-dimensional parameter, where φ p¨q is a known con-tinuously differentiable function, and p is the vector containing all identifiable p t,k ’s, and q, β, γ aredefined analogously. A natural estimator can be defined through the CEP estimates, φ p ˆ p, ˆ q, ˆ β, ˆ γ q .A delta method argument helps calculate the efficiency bound of φ and show the efficiency of φ p ˆ p, ˆ q, ˆ β, ˆ γ q . In fact, following Theorem 25.47 of Vaart (1998), and Theorem 3 and 4, the corollarybelow is immediate, which, in particular, solves the issue of efficient estimation for the severalexamples illustrated above. 17 orollary 1. The semiparametric efficiency bound of φ is given by the variance of efficient influ-ence function Ψ φ “ ÿ p P p B φ B p Ψ p ` ÿ q P q B φ B q Ψ q ` ÿ β P β B φ B β Ψ β ` ÿ γ P γ B φ B γ Ψ γ (22) where the partial derivatives are evaluated at the true parameter value. Moreover, the plug-inestimator φ p ˆ p, ˆ q, ˆ β, ˆ γ q , based on the CEP estimators ˆ p, ˆ q, ˆ β, ˆ γ , achieves the efficiency bound. The role of the efficient influence functions discussed above is mainly on calculating the effi-ciency bound. They could also be used to generate a collection of moment conditions, to achieveefficient estimation directly. These moment conditions possess the feature that the first-step es-timation of the nuisance function do not affect the asymptotic variance . This is straightforwardto verify using Proposition 3 in Newey (1994). This feature can be further exploited in the DMLmethodology which is suitable in the high dimensional settings, where the Donsker properties as incondition 4(ii) can no longer be satisfied. More formally, the efficient influence function satisfiesthe Neyman orthogonality condition, which means reduced sensitivity with respect to the nuisanceparameters I t,Z ’s, P t,Z ’s, and π . Together with appropriate data splitting methods, moment esti-mators constructed with Neyman orthogonal moment conditions are often employed in cases withdata-rich environments where the nuisance parameters are “highly complex”, e.g. the dimensionof covariates X grows with sample size n . Here I explain how to implement in this specific settingthe DML method introduced in Chernozhukov et al. (2018) to efficiently estimate β t,k when thedimension of X is larger than the sample size. The cross-fitting method starts with taking a L -fold random partition of the data such that the size of each fold is n { L . Then, for l “ , ¨ ¨ ¨ , L ,let I l denote the observation indices in the l th fold and I cl “ Ť l ‰ l I l . Also, define ˇ I lt,Z , ˇ P lt,Z , ˇ π l be the nonparametric machine learning estimates using data from i P I cl . The associated momentcondition is based on equation (9) that E ” ˜ b t,k p ζ p Z, X, π q p ι p Y t T “ t uq ´ I t,Z p X qq ` I t,Z p X qq´ β t,k ˜ b t,k p ζ p Z, X, π q p ι t T “ t u ´ P t,Z p X qq ` P t,Z p X qq ı “ . (23) I am grateful to Kaspar Wuthrich for suggesting this. Works on this topic include Belloni et al. (2014, 2017);Chernozhukov et al. (2018). The cases of estimating γ t ,t,k , p t,k and q t ,t,k are essentially the same, thus omitted for brevity. β t,k is defined byˇ β t,k “ ř Ll “ ř i P I l ˜ b t,k ´ ζ p Z i , X i , ˇ π l q ´ ι p Y i t T i “ t uq ´ ˇ I lt,Z p X i q ¯ ` ˇ I lt,Z p X i q ¯ř Ll “ ř i P I l ˜ b t,k ´ ζ p Z i , X i , ˇ π l q ´ ι t T i “ t u ´ ˇ P lt,Z p X i q ¯ ` ˇ P lt,Z p X i q ¯ . (24)This is the DML2 estimator defined in Chernozhukov et al. (2018) with a L -fold cross-fitting. Thereis another estimation procedure called the DML1 estimator in Chernozhukov et al. (2018). It isnot discussed here since DML1 and DML2 are asymptotically equivalent, and DML2 is generallyrecommended by the authors. The variance estimator is given byˇ V β t,k “ nL L ÿ l “ n ÿ i “ ` Ψ β t,k ` Y i , T i , Z i , X i , ˇ β t,k , ˇ p t,k , ˇ I t,Z , ˇ P t,Z , ˇ π ˘˘ (25)where ˇ p t,k “ nL L ÿ l “ n ÿ i “ ˜ b t,k ` ζ p Z i , X i , ˇ π q ` ι t T i “ t u ´ ˇ P t,Z p X i q ˘ ` ˇ P t,Z p X i q ˘ (26) Theorem 5.
Let δ n ě n ´ { and ∆ n be some sequences of positive constants approaching zero.Also, let C ą and q ą be fixed constants, and L ě be fixed integer. Assume the followingconditions hold for any joint distribution P P P for the quadruple p Y, T, Z, X q .(i) The variance bound for β t,k calculated in Theorem 3 is strictly positive.(ii) max " (cid:13)(cid:13)(cid:13) I ot,Z (cid:13)(cid:13)(cid:13) q , (cid:13)(cid:13)(cid:13) ιY t T “ t u ´ I ot,Z (cid:13)(cid:13)(cid:13) q * ď C .(iii) With probability no less than ´ ∆ n , max " (cid:13)(cid:13)(cid:13) ˇ I t,Z ´ I ot,Z (cid:13)(cid:13)(cid:13) q , (cid:13)(cid:13)(cid:13) ˇ P t,Z ´ P ot,Z (cid:13)(cid:13)(cid:13) q , k ˇ π ´ π o k q * ď C , max ! (cid:13)(cid:13)(cid:13) ˇ I t,Z ´ I ot,Z (cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13) ˇ P t,Z ´ P ot,Z (cid:13)(cid:13)(cid:13) , k ˇ π ´ π o k ) ď δ n , and for any z P Z , ˇ π z ě π and k ˇ π z ´ π oz k ˆ ´ (cid:13)(cid:13) ˇ I t,z ´ I ot,z (cid:13)(cid:13) ` (cid:13)(cid:13) ˇ P t,z ´ P ot,z (cid:13)(cid:13) ¯ ď n ´ { δ n .Then the estimator ˇ β t,k obey V ´ { β t,k ? n ` ˇ β t,k ´ β t,k ˘ ñ N p , q , (27) uniformly over P , where V β t,k “ E ” Ψ β t,k p Y, T, Z, X, β ot,k , p ot,k , I ot,Z , P ot,Z , π ot,Z q ı . Moreover, the re-sults continue to hold when V β t,k is replaced by ˇ V β t,k . P , itcan be used for standard construction of uniformly valid confidence regions. The previous section is on the efficient estimation of average structural functions inside someidentifiable subpopulations. More generally, the parameter of interest could be defined throughnon-smooth and overidentifying moment conditions. One example is the case of quantile estimation.Another case of interest is when the underlying economic theory provides overidentifying constraintsfor the quantities of interest, which is very possible in the current framework with multiple levelsof treatment and instrument.Define a set of random variables Y “ t Y ˚ t,k : t P T , k “ , ¨ ¨ ¨ , N Z u such that each Y ˚ t,k has the same marginal distribution as Y t | S P Σ t,k . Their joint distribution is irrelevant. Let Y ˚ “ p Y ˚ , ¨ ¨ ¨ , Y ˚ J q , for Y ˚ j P Y . Also, for notational convenience, use j rather than t, k to label t j , p j , and ˜ b j according to the ordering of the Y ˚ t,k ’s within the random vector Y ˚ . Let the parameterof interest be η , in the interior of Λ Ă R d η , where d η ď J . The true value of the parameter η satisfies the moment condition E r m p Y ˚ , η o qs “ m : Y J ˆ R d η Ñ R J is of the form m p Y ˚ , η q ” p m p Y ˚ , η q , ¨ ¨ ¨ , m J p Y ˚ J , η qq In general, m is allowed not to be differentiable with respect to η . Since the vector η appears ineach m j , restrictions are allowed both within and across different subpopulation-conditional distri-butions. Another interesting feature of this specification is that the moment conditions are definedfor the random variables whose distributions are not directly identified. The following theoremprovides semiparametric efficiency bound for estimating η . Note that Assumption 1 is enough20or deriving the following results, since the distributions of Y ˚ t,k ’s are always exactly-identified.Let m Z “ p m ,t ,Z , ¨ ¨ ¨ , m J,t J ,Z q , where m j,t j ,Z p X, η q “ ` m j,t j ,z p X, η q , ¨ ¨ ¨ , m j,t j ,z p X, η q ˘ , and m j,t j ,z p X, η q “ E r m j p Y, η q t T “ t j u | Z “ z, X s . Theorem 6.
Assume the following conditions hold.(i) For any ď j ď ¨ ¨ ¨ ď j d η ď J , the subvector of moments E ” p m j p Y ˚ j , η q , ¨ ¨ ¨ , m j dη p Y ˚ j dη , η qq ı is zero if and only if η “ η o .(ii) E “ m p Y ˚ , η q ‰ ă 8 , η P Λ .(iii) For each j and z , E ” m oj,t j ,z p Y, η q ı is differentiable in some neighborhood of η o , with the deriva-tive continuous at η o . Let Γ be the J ˆ d η matrix whose j th row is ˜ b j BB η E ” m oj,t j ,Z p X, η q ı ˇˇ η “ η o ,and assume Γ has full column rank.Then for the estimation of η , the efficient influence function is Ψ η p Y, T, Z, X, η o , m oZ , π o q “ ´ ` Γ V ´ Γ ˘ ´ Γ V ´ Ψ m p Y, T, Z, X, η o , m oZ , π o q (29) V “ Var p Ψ m p Y, T, Z, X, η o , m oZ , π o qq , and Ψ m p Y, T, Z, X, η o , m oZ , π o q is a J ˆ random vector whose j th element is ˜ b j ´ ζ p Z, X, π o q ´ ι p m j p Y, η o q t T “ t j uq ´ m oj,t j ,Z p X, η o q ¯ ` m oj,t j ,Z p X, η o q ¯ (30) Thus the semiparametric efficiency bound is ` Γ V ´ Γ ˘ ´ . Note that, for example, if Y ˚ “ Y ˚ t,k , and m p Y ˚ t,k , η q “ Y ˚ t,k ´ η , then η “ β t,k , and the efficiencybound shown above reduces to the one computed in Theorem 3(i). If T “ Z , that is underunconfoundedness, the above result reduces to Theorem 1 in Cattaneo (2010). The efficiencybound can be achieved by estimators in the same spirit of the EIFE proposed in Cattaneo (2010).Essentially, it is the optimally-weighted GMM estimator based on the moment conditions obtainedfrom the efficient influence function Ψ m . Let the criterion function be G n p η, π, m Z q “ n n ÿ i “ Ψ m p Y i , T i , Z i , X i , η, π, m Z q (31)21nd its probability limit is G p η, π, m Z q “ E r Ψ m p Y, T, Z, X, η, π, m Z qss (32)The main difficulty is that G n p¨ , π, m Z q could potentially be discontinuous, since we allow m p Y ˚ , ¨q to be discontinuous. To deal with this issue, the method developed in Chen et al. (2003) is em-ployed, where the criterion function is allowed not to satisfy standard smoothness conditions andsimultaneously depends on some nonparametric estimators. The implementation procedure is as follows. Let ˆ π and ˆ m Z be nonparametric estimators. Onecan first find a consistent GMM estimator ˜ η using the identity matrix as the (non-optimal) weightingmatrix, i.e. k G n p ˜ η, ˆ π, ˆ m Z q k ď inf η P Λ k G n p η, ˆ π, ˆ m Z q k ` o p p q . (33)Next use this estimate to form a consistent estimator ˆ V for the covariance matrix of Ψ m byˆ V “ n n ÿ i “ Ψ m p Y i , T i , Z i , X i , ˜ η, ˆ π, ˆ m Z q Ψ m p Y i , T i , Z i , X i , ˜ η, ˆ π, ˆ m Z q . (34)Then define ˆ η as the optimally-weighted GMM estimatorˆ η “ arg min η P Λ G n p η, ˆ π, ˆ m Z q ˆ V ´ G n p η, ˆ π, ˆ m Z q . (35)Lastly, for estimating the asymptotic variance of ˆ η , one can estimate Γ using numerical derivativesas in Newey and McFadden (1994). Let ε n be a positive sequence such that ε n Ñ ε n ? n Ñ 8 .Define the J ˆ d η matrix estimator byˆΓ jl “ ε n ˜ b j ˜ n n ÿ i “ ˆ m j,t j ,Z p X i , ˆ η ` ε n e l q ´ n n ÿ i “ ˆ m j,t j ,Z p X i , ˆ η ´ ε n e l q ¸ (36)where e l P R d η is the vector with the l th element being 1 and 0 on other entries. The followingtheorem summarizes the asymptotic properties of of this estimation procedure. For each j and z ,let M j,z Ă R p X ˆ Λ q be the vector space of functions, endowed with the sup-norm, containing the Cattaneo (2010) instead uses the theory from Pakes and Pollard (1989). However, the general theory ofChen et al. (2003) is more straightforward to apply in this case, since they explicitly assumes the presence of in-finite dimensional nuisance parameters, which can depend on the parameters to be estimated. m oj,t j ,z . For any small enough δ ą
0, let M δj,z “ ! m j,t j ,z P M j,z : (cid:13)(cid:13)(cid:13) m j,t j ,z ´ m oj,t j ,z (cid:13)(cid:13)(cid:13) ď δ ) . Theorem 7.
Let the conditions in Theorem 6 hold. Further assume that, for each j and z ,(i) Λ is compact and η o P int p Λ q .(ii) Convergence rates of nonparametric estimators satisfy k ˆ π z ´ π oz k , (cid:13)(cid:13)(cid:13) ˆ m j,t j ,z ´ m oj,t j ,z (cid:13)(cid:13)(cid:13) “ o p p n ´ { q .(iii) The classes ! m j,t j ,z p¨ , η q : η P Λ , m j,t j ,z P M δj,z ) and Π z are Glivenko-Cantelli.(iv) For some δ ą , the classes ! m j,t j ,z p¨ , η q : η P Λ , k η ´ η o k ď δ, m j,t j ,z P M j,z , (cid:13)(cid:13)(cid:13) m j,t j ,z ´ m oj,t j ,z (cid:13)(cid:13)(cid:13) ď δ ) and Π δz are Donsker.(v) E ” sup δ P Λ ,m j,tj ,z P M j,z (cid:12)(cid:12) m j,t j ,z p¨ , η q (cid:12)(cid:12) ı ă 8 .(vi) E “ m j,t j ,z p X, ¨q ‰ is continuous, for any m j,t j ,z P M δj,z .Then ˆ V , ˆ η , ˆΓ are consistent and ? n p ˆ η ´ η o q ñ N ´ , ` Γ V ´ Γ ˘ ´ ¯ . As explained in the previous section, asymptotically optimal inferences can be conducted, basedon this result, for joint hypothesis over η . A possible case for the application of this non-smoothGMM methodology developed is illustrated below using the main example. The set Y is definedby Y “ ! Y ˚ t , d “ Y | S “ s , Y ˚ t , d “ Y | S “ s , Y ˚ t , d “ Y | S “ s ,Y ˚ t , d “ Y | S “ s , Y ˚ t , d “ Y | S P t s , s u , Y ˚ t , d “ Y | S “ s ) . The parameter of interest η could be defined by, say, the following moment conditions: E “ Y ˚ t , ´ η ‰ “ E “ Y ˚ t , ´ η ‰ “ E “ t Y ˚ t , ď η u ´ . ‰ “
0, and E “ t Y ˚ t , ď η u ´ . ‰ “
0. This means that Y ˚ t , and Y ˚ t , have the same expectations and medians, which all equal to η . Note that both withinand cross type restrictions are contained in this example.Before ending this section, it is worth mentioning that the set Y can be extended to include morerandom variables whose marginal distributions are identified. Y t | T “ t, S P Σ t,k and Y t | S P Σ t are such examples. Similar arguments go through for efficient estimation, for which the details arenot being repeated here. 23 Optimal Testable Implication of Model Assumption
In practice, one should check the validity of model assumptions before proceeding to estimation.As discussed previously, overidentifying restrictions may exist depending on the type configuration S . This section discusses a systematic approach to generate testable implications for Assumption1, even when no overidentifying type restriction exists. Based on works of Kitagawa (2015) andSun (2018), an obvious generalization would be the set of conditions: for any t P T , k “ , ¨ ¨ ¨ , N Z ,and measurable B Ă Y ,˜ b t,k B t,Z p X q “ P p Y t P B, S P Σ t,k | X q P r , s , a.s. (37)where B is the indicator function of B . This means that the identifiable parts of the jointdistribution of the potential outcomes and type should be a proper probability. However, more canbe tested. For instance, both P p S “ s q and P p S P t s , s uq can be identified in the main example,then the former should be no greater than the latter. In fact, this intuition can be developed intoa set of high-level conditions optimal for testing Assumption 1, including but not limited to boththe implications (37) and the type configuration overidentifying conditions discussed in section 3.Let Σ t “ t Σ t,k : k “ , ¨ ¨ ¨ , N Z u . For any t P T , define a function Q t : X ˆ p B Y ˆ Σ t q Ñ R by Q t p X, p B, Σ t,k qq “ ˜ b t,k B t,Z p X q (38) Assumption 2.
There exist N T functions ˜ Q t : X ˆ ` B Y ˆ S ˘ Ñ R , such that, for all t, t P T ,(i) ˜ Q t is a probability kernel;(ii) ˜ Q t p¨ , p Y , Σ qq “ ˜ Q t p¨ , p Y , Σ qq for all Σ Ă S ;(iii) ˜ Q t p¨ , p B, Σ qq “ Q t p¨ , p B, Σ qq , for all measurable B Ă Y , and Σ P Σ t . The probability kernel ˜ Q t represents the unidentified joint distribution of p Y t q t P T and S given X , i.e. ˜ Q t p X, p B, Σ qq “ P p Y t P B, S P Σ | X q . The second condition in Assumption 2 ensures that P p S P Σ | X q is well-defined, while the third condition assigns ˜ Q t to its identified value wheneverpossible. The overidentification constraints on type configurations S , e.g. equation (8), resultsfrom conditions (ii) and (iii). The constraints defined by equation (37) follows from conditions (i)24nd (iii). The optimality of Assumption 2 for testing is explained by the following theorem. Let L be the underlying unidentified joint probability of ´ t Y t : t P T u , t T z : z P Z u , Z, X ¯ and L bethe observed joint probability of ´ Y, T, Z, X ¯ . Each L induces a L , but not the other way round.Denote such mapping from L to L by O . Theorem 8.
The following relationships hold between Assumption 1 on L and Assumption 2 on L .(i) If L satisfies Assumption 1, then O p L q satisfies Assumption 2. If L satisfies Assumption2, then there exists a L that violates Assumption 1 and O p L q “ L .(ii) If L satisfies Assumption 2, then there exists a L that satisfies Assumption 1 and O p L q “ L .Therefore, if C is another condition on L , such that L satisfies Assumption 1 implies O p L q satisfies Assumption C , then L satisfies Assumption 2 implies L satisfies condition C . Part (i) of the theorem establishes that Assumption 2 is a necessary but not sufficient conditionfor Assumption 1. Part (ii) establishes that Assumption 2 is the optimal testable implication ofAssumption 1, in the sense that any testable implication of Assumption 1 can be implied by thesatisfaction of Assumption 2. Also note that Assumption 1 not only requires unordered mono-tonicity but the full specification of S . From this theorem and its proof, the idea becomes clear ofthe optimality for testable implications on nonverifiable hypothesis, which is another contributionof the paper. The set of conditions presented in Kitagawa (2015) is indeed a special case, thesimplicity of which is specific to the binary case.This section again ends with an illustration using the main example. Define Q t p X, p B, Σ qq “ $’’&’’% P p Y P B, T “ t | Z “ z , X q ´ P p Y P B, T “ t | Z “ z , X q , if Σ “ t s u P p Y P B, T “ t | Z “ z , X q , if Σ “ t s u Q t p X, p B, Σ qq “ $’’&’’% P p Y P B, T “ t | Z “ z , X q ´ P p Y P B, T “ t | Z “ z , X q , if Σ “ t s u P p Y P B, T “ t | Z “ z , X q , if Σ “ t s u t p X, p B, Σ qq “ $’’&’’% P p Y P B, T “ t | Z “ z , X q ´ P p Y P B, T “ t | Z “ z , X q , if Σ “ t s , s u P p Y P B, T “ t | Z “ z , X q , if Σ “ t s u By equation (8), Q t p¨ , Y ˆ t s uq ` Q t p¨ , Y ˆ t s uq “ Q t p¨ , Y ˆ t s , s uq Also note that Q t on t s u , Q t on t s u , and Q t on t s u are already between 0 and 1. And Q t on t s u , Q t on t s u , and Q t on t s , s u are already below 1. The true restrictions in this exampleare hence reduced to P p Y P B, T “ t | Z “ z , X q ě P p Y P B, T “ t | Z “ z , X q P p Y P B, T “ t | Z “ z , X q ě P p Y P B, T “ t | Z “ z , X q P p Y P B, T “ t | Z “ z , X q ě P p Y P B, T “ t | Z “ z , X q These inequalities are very similar to the form of the conditions in Kitagawa (2015). This simplicityis due to the fact that 2 S equals to the algebra generated by Σ and no overidentifying restrictionexists. It is possible to generalize the variance-weighted Komolgorov-Smirnov test proposed inKitagawa (2015) and the power-improved test proposed in Sun (2018), the implementation ofwhich is beyond the scope of this study. In this section, the estimation methods discussed in the paper are applied to study the returnto schooling using proximity to college as an instrument (Card, 1993). The data comes from theNational Longitudinal Survey on the original cohort of young men at the time of 1966 and 1976.It is shown in Kitagawa (2015) that the instrument introduced is only valid after conditioning oncovariates such as race and region of residence, making this example appropriate for illustratingthe usefulness of including conditioning covariates in the framework, which is a step forward of thecurrent paper from Heckman and Pinto (2018a) and Pinto (2019).The model is the same as the main example and the variables are explained as follows. The26utcome Y is the log of weekly earning in 1976. The instrument is the binary Z indicating whethera four-year local college is present for the individual in 1966. It is relevant because the presence of acollege nearby gets the individual to know about college in early life and reduces the cost of receivingsome college education, and valid since the individual’s ability is presumed to be independent ofthe place of residence, given race and the extent to which this region is developed. The treatment T describes the education received by the individual up to the time of 1976. Instead of being binary, T takes on three unordered values t t , t , t u , where t means the student receives college educationin fields including engineering, mathematics, law, social sciences, and others, t means the studentreceives college education in other fields including business, education, and public services amongothers, and t means the student doesn’t receive college education. In the binary local IV case, t and t would be collapsed into one treatment indicating college-level education, for the study ofreturn to college schooling. The unobserved type S is defined by the way education decisions varywith the proximity to the college. The conditioning covariates X includes race, whether residencein southern states and whether residence in the standard metropolitan area. The available samplesize is 2930, 381 of which are treated by t , and 487 by t .For estimation, no advanced nonparametric method is needed since X is discrete. The P t,Z ’s areestimated using the linear probability model with five dummies as in Kitagawa (2015). The π z ’s areestimated with the sample means. Then in CEP estimators in Theorem 4 are used to evaluate theparameters of interest. The estimation results for the LASFs and LASF-Ts are displayed in Table 1.The asymptotic covariance matrix can be easily estimated using the efficient influence functions, asin Table 2, leading to joint inferences of the parameters, which is another benefit of the methodologydeveloped in this paper. Table 3 shows the results of some selected statistical tests comparingboth between and within LASFs and LASF-Ts. The differences between the pairs β t , , β t , and γ t , , γ t , are both insignificant, indicating that the income after receiving college education on thetwo categories of fields are similar for their own switchers respectively. As mentioned before, thesefacts can be turned into overidentifying restrictions ( β t , “ β t , , γ t , “ γ t , ) to improve efficiency,if supported by underlying economic theory. College education is generally perceived as a causalfactor for increasing income. Thus it should be the case that the LASF-T is higher than the LASFwhen the treatment belongs to t t , t u , and lower when treatment is t , which is consistent with the This classification of the fields of study is for balancing the sample size in the dataset. β t , r . , . s β t , r . , . s β t , r . , . s β t , r . , . s β t , r . , . s β t , r . , . s γ t , r . , . s γ t , r . , . s γ t , r . , . s Table 2: Empirically Estimated Asymptotic Covariance Matrix β t , β t , β t , β t , β t , γ t , γ t , γ t , β t , -0.2339 0.0050 ă . β t , ă . β t , -0.0145 -0.0419 -0.0001 -0.0246 0.3252 -0.2186 β t , -0.0004 ă . β t , -0.0026 0.5415 -0.0194 0.1688 β t , -0.0005 ă . γ t , -0.0164 -0.1673 γ t , -0.0813testing results. Also, both β t , and β t , are higher than β t , , where the insignificance is due to thefact that these three parameters are averages over different subpopulations. Comparisons among β t , , β t , , and β t , shows similar intuition on the effect of schooling with a tendency that theoutcomes for always-takers are less spread across treatments. The ratios P p T “ t | S “ s q and P p T “ t | S “ s q are estimated to be 0.68 and 0.92, revealing a significant amount of individualsreceiving college education in the subpopulations of switchers. At this point, one might questionwhether the parameter estimates are of policy interest. The argument here is that clearly by theidentification results, the parameters discussed here are at least as informative as the LATE (andLATT) parameters in the binary local IV case. And one of the themes of local IV and other causalinference models is to tradeoff informativeness with removing incredible assumptions.28able 3: Differences between Parameter ValuesTwo-sided One-sided β t , ´ β t , -0.467 (0.976) β t , ´ β t , γ t , ´ γ t , β t , ´ β t , γ t , ´ β t , ˚ γ t , ´ β t , -0.104 (0.291) β t , ´ γ t , ˚ Standard deviation of the estimate is presented in the parenthesis,and “ ˚ ” indicates significance at the 5% level. Monte Carlo simulations are conducted, for further understanding of the relationships between ran-dom variables in the model and finite-sample performances of the estimators. Two data generatingprocesses (DGP), different only in the distribution of the covariate X , are specified as follows. Inthe first DGP, X is drawn from the uniform distribution over p . , . q ; whereas in the second DGP, X is a discrete random variable taking values in p . , . , . , . , . q with equal probabilities.Then the two DGPs follow the same way to generate S , Z , T , and Y from X . Type S is drawnfrom Binomial p , X q , the values p , , , , q matches with p s , s , s , s , s q respectively. The in-strument Z is generated according to the distribution Bernoulli p X q . Then the treatment T isdetermined by the realization of both S and Z . The potential outcomes p Y , Y , Y q are constructedas ¨˚˚˚˚˝ Y Y Y ˛‹‹‹‹‚ “ s ¨˚˚˚˚˝ ξ ` ξξ ` ξξ ` ξ ˛‹‹‹‹‚ ` s ¨˚˚˚˚˝ ξ ` ξξ ` ξξ ` ξ ˛‹‹‹‹‚ ` s ¨˚˚˚˚˝ ξ ` ξξ ` ξξ ` ξ ˛‹‹‹‹‚ ` s ¨˚˚˚˚˝ ξ ξ ξ ˛‹‹‹‹‚ ` s ¨˚˚˚˚˝ ξ ξ ξ ˛‹‹‹‹‚ where s is the indicator of type s , and the ξ ’s are mutually independent random variables withdistributions ξ „ N p . , q , ξ „ N p X, q , ξ „ N p X ` . , q , and ξ „ N p X ` . , q . Normalityis assumed for the bell-shaped empirical distribution of log-income. By construction, S and Z areindependent conditional on X , and the Y t ’s depends on S and X not on Z . In the subpopulationswith S P t s , s u , the Y t ’s are mutually independent, while in other cases they are correlated through ξ . This feature resembles that of the data generating process in Hong and Nekipelov (2010a). Thenumber of observations is N “ β t,k ’s are very close to the ˆ σ β t,k {? N ’s, confirming the efficiency in estimationand optimality of tests with moderate sample size.Table 4: Monte Carlo ResultsParameter P X Value Mean Bias Median Bias Std Deviation Root MSE β t , Continuous 1.00 0.0007 0.0010 0.0488 0.0488Discrete 1.00 -0.0080 -0.0078 0.0502 0.0508 β t , Continuous 1.00 0.0160 0.0149 0.0756 0.0773Discrete 1.00 0.0124 0.0126 0.0744 0.0754 β t , Continuous 0.60 -0.0109 -0.0105 0.0458 0.0471Discrete 0.60 0.0024 0.0025 0.0475 0.0476 σ β t , Discrete 2.75 -0.0183 -0.0265 0.1316 0.1329 σ β t , Discrete 4.12 -0.0441 -0.0555 0.2409 0.2449 σ β t , Discrete 2.59 -0.0095 -0.0108 0.0843 0.0849
The σ β t,k ’s are estimated using the plug-in estimators where the β t,k ’s are estimated by CEP. This paper has studied the semiparametric efficient estimation in the generalized local IV frame-work where treatment is allowed to take multiple values. A large class of parameters implicitlydefined by a possibly over-identified non-smooth collection of moment conditions is considered,with a special focus on parameters derived through type probabilities and local average structuralfunctions. The calculated efficient influence functions lead to the easy implementation of optimaljoint inferences and the construction of estimators suitable under high-dimensional settings. Themodel assumptions of local IV, in general, is further understood through the optimal observableimplications. The applicability of the methodology is demonstrated with examples for empirical re-search with a finite amount of sample data. For future studies, one could consider using the efficientestimation methods for, say, LASFs to extract information on the (non-local) average structuralfunctions, as in Mogstad et al. (2018). 30 eferences
Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment response models.
Journal of econometrics , 113(2):231–263.Ackerberg, D., Chen, X., Hahn, J., and Liao, Z. (2014). Asymptotic efficiency of semiparametrictwo-step gmm.
Review of Economic Studies , 81(3):919–943.Angrist, J. D. and Imbens, G. W. (1995). Two-stage least squares estimation of average causal ef-fects in models with variable treatment intensity.
Journal of the American statistical Association ,90(430):431–442.Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects usinginstrumental variables.
Journal of the American statistical Association , 91(434):444–455.Bajari, P., Chernozhukov, V., Hong, H., and Nekipelov, D. (2015). Identification and efficientsemiparametric estimation of a dynamic discrete game. Technical report, National Bureau ofEconomic Research.Belloni, A., Chernozhukov, V., Fern´andez-Val, I., and Hansen, C. (2017). Program evaluation andcausal inference with high-dimensional data.
Econometrica , 85(1):233–298.Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effects after selectionamong high-dimensional controls.
The Review of Economic Studies , 81(2):608–650.Card, D. (1993). Using geographic variation in college proximity to estimate the return to schooling.Technical report, National Bureau of Economic Research.Cattaneo, M. D. (2010). Efficient semiparametric estimation of multi-valued treatment effects underignorability.
Journal of Econometrics , 155(2):138–154.Chen, X., Hong, H., and Tarozzi, A. (2004). Semiparametric efficiency in gmm models of nonclas-sical measurement errors, missing data and treatment effects.Chen, X., Hong, H., and Tarozzi, A. (2008). Semiparametric efficiency in gmm models with auxiliarydata.
The Annals of Statistics , 36(2):808–843.31hen, X., Linton, O., and Van Keilegom, I. (2003). Estimation of semiparametric models when thecriterion function is not smooth.
Econometrica , 71(5):1591–1608.Chen, X. and Santos, A. (2018). Overidentification in regular models.
Econometrica , 86(5):1771–1817.Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins,J. (2018). Double/debiased machine learning for treatment and structural parameters.
TheEconometrics Journal , 21(1):C1–C68.Firpo, S. (2007). Efficient semiparametric estimation of quantile treatment effects.
Econometrica ,75(1):259–276.Fr¨olich, M. (2007). Nonparametric iv estimation of local average treatment effects with covariates.
Journal of Econometrics , 139(1):35–75.Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation ofaverage treatment effects.
Econometrica , pages 315–331.Heckman, J. J. and Pinto, R. (2018a). Unordered monotonicity.
Econometrica , 86(1):1–35.Heckman, J. J. and Pinto, R. (2018b). Web appendix for unordered monotonicity.
Econometrica ,86(1):1–35.Hong, H. and Nekipelov, D. (2010a). Semiparametric efficiency in nonlinear late models.
Quanti-tative Economics , 1(2):279–304.Hong, H. and Nekipelov, D. (2010b). Supplement to “semiparametric efficiency in nonlinear latemodels”.
Quantitative Economics , 1(2):279–304.Imbens, G. W. and Angrist, J. D. (1994). Identification and estimation of local average treatmenteffects.
Econometrica , 62(2):467–475.Kitagawa, T. (2015). A test for instrument validity.
Econometrica , 83(5):2043–2063.Kline, P. and Walters, C. R. (2016). Evaluating public programs with close substitutes: The caseof head start.
The Quarterly Journal of Economics , 131(4):1795–1848.32elly, B. and W¨uthrich, K. (2017). Local quantile treatment effects. In
Handbook of quantileregression , pages 145–164. Chapman and Hall/CRC.Mogstad, M., Santos, A., and Torgovitsky, A. (2018). Using instrumental variables for inferenceabout policy relevant treatment parameters.
Econometrica , 86(5):1589–1619.Nekipelov, D. (2011). Identification and semiparametric efficiency in a model with a multi-valueddiscrete regressor. Technical report, Mimeo., UC Berkeley.Newey, W. K. (1990). Semiparametric efficiency bounds.
Journal of applied econometrics , 5(2):99–135.Newey, W. K. (1994). The asymptotic variance of semiparametric estimators.
Econometrica:Journal of the Econometric Society , pages 1349–1382.Newey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing.
Handbookof econometrics , 4:2111–2245.Pakes, A. and Pollard, D. (1989). Simulation and the asymptotics of optimization estimators.
Econometrica: Journal of the Econometric Society , pages 1027–1057.Pinto, R. (2019). Noncompliance as a rational choice: A framework that exploits compromises insocial experiments to identify causal effects. UCLA Working paper.Sun, Z. (2018).
Essays on Non-parametric and High-dimensional Econometrics . PhD thesis, UCSan Diego.Vaart, A. W. v. d. (1998).
Asymptotic Statistics . Cambridge Series in Statistical and ProbabilisticMathematics. Cambridge University Press.Van Der Vaart, A. and Wellner, J. A. (2000). Preservation theorems for glivenko-cantelli anduniform glivenko-cantelli classes. In
High dimensional probability II , pages 115–133. Springer.Van Der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence. In
Weak convergence andempirical processes , pages 16–28. Springer.Vytlacil, E. (2002). Independence, monotonicity, and latent index models: An equivalence result.
Econometrica , 70(1):331–341. 33
Binary Local IV with Conditioning Covariates
It may help enhance understanding of the notations and results in the main text by presentingthem in the context of binary local IV. The identification of classical LATE, without conditioningcovariates, is discussed in Section 4.1 of Heckman and Pinto (2018a) using binary matrices. The ef-ficiency bound and efficient estimation are discussed in Fr¨olich (2007); Hong and Nekipelov (2010a).Specification test is discussed in Kitagawa (2015). Here both the treatment and instrument arebinary: Z “ t z , z u , T “ t t , t u . The type constraint is s s s T z t t t T z t t t where s is always-taker, s is complier, and s is never-taker. The unordered monotonicity con-dition is satisfied: t T z “ t u ě t T z “ t u and t T z “ t u ď t T z “ t u . The type partitionsare, for t , Σ t , “ t s u , Σ t , “ t s u ; and for t , Σ t , “ t s u , Σ t , “ t s u . The B t ’s and theirinverses are B t “ »—– fiffifl ùñ B ` t “ »————– ´
10 1 fiffiffiffiffifl ; B t “ »—– fiffifl ùñ B ` t “ »————– ´ fiffiffiffiffifl and b t,k ’s are b t , “ b t , “ p , , q , b t , “ p , , q , b t , “ p , , q . Thus ˜ b t , “ p , ´ q ,˜ b t , “ p´ , q , ˜ b t , “ p , q , and ˜ b t , “ p , q . The I t,Z p X q ’s and P t,Z p X q ’s are I t ,Z p X q “ p E r Y t T “ t u | Z “ z , X s , E r Y t T “ t u | Z “ z , X sq I t ,Z p X q “ p E r Y t T “ t u | Z “ z , X s , E r Y t T “ t u | Z “ z , X sq P t ,Z p X q “ p P p T “ t u | Z “ z , X q , P p T “ t u | Z “ z , X qq P t ,Z p X q “ p P p T “ t u | Z “ z , X q , P p T “ t u | Z “ z , X qq
34y Theorem 1, p t , “ P p S “ s q “ ˜ b t , E r P t ,Z p X qs “ E r P p T “ t | Z “ z , X qs p t , “ P p S “ s q “ p t , “ ˜ b t , E r P t ,Z p X qs “ E r´ P p T “ t | Z “ z , X q ` P p T “ t | Z “ z , X qs p t , “ P p S “ s q “ ˜ b t , E r P t ,Z p X qs “ E r P p T “ t | Z “ z , X qs and β t , “ E r Y t | S “ s s “ p t , ˜ b t , E r I t ,Z p X qs “ E r E r Y t T “ t u | Z “ z , X s ´ E r Y t T “ t u | Z “ z , X ss E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs β t , “ E r Y t | S “ s s “ p t , ˜ b t , E r I t ,Z p X qs “ E r E r Y t T “ t u | Z “ z , X ss E r P p T “ t | Z “ z , X qs β t , “ E r Y t | S “ s s “ p t , ˜ b t , E r I t ,Z p X qs “ E r E r Y t T “ t u | Z “ z , X s ´ E r Y t T “ t u | Z “ z , X ss E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs β t , “ E r Y t | S “ s s “ p t , ˜ b t , E r I t ,Z p X qs “ E r E r Y t T “ t u | Z “ z , X ss E r P p T “ t | Z “ z , X qs Thus we have the usual expression for LATE E r Y t ´ Y t | S “ s s “ E r Y t | S “ s s ´ E r Y t | S “ s s “ β t , ´ β t , “ E r E r Y | Z “ z , X s ´ E r Y | Z “ z , X ss E r P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qs Focusing on Σ t , “ Σ t , “ t s u , we can derive the LASF-Ts E r Y | T “ t , S “ s s , E r Y | T “ t , S “ s s .In fact, W t , “ t z u . Thus, by Theorem 2, q t ,t , “ P p T “ t , S “ s q “ E rp´ P p T “ t | Z “ z , X q ` P p T “ t | Z “ z , X qq P p Z “ z | X qs γ t ,t , “ E r Y | T “ t , S “ s s “ E ” ˜ b t , I t ,Z p X q π W t ,t , p X q ı E ” ˜ b t , P t ,Z p X q π W t ,t , p X q ı “ E rp´ E r Y t T “ t u | Z “ z , X s ` E r Y t T “ t u | Z “ z , X sq P p Z “ z | X qs E rp´ P p T “ t | Z “ z , X q ` P p T “ t | Z “ z , X qq P p Z “ z | X qs γ t ,t , “ E r Y | T “ t , S “ s s “ E ” ˜ b t , I t ,Z p X q π W t ,t , p X q ı E ” ˜ b t , P t ,Z p X q π W t ,t , p X q ı “ E rp E r Y t T “ t u | Z “ z , X s ´ E r Y t T “ t u | Z “ z , X sq P p Z “ z | X qs E rp P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qq P p Z “ z | X qs E r Y t ´ Y t | T “ t , S “ s s “ E r Y t | T “ t , S “ s s ´ E r Y t | T “ t , S “ s s “ γ t ,t , ´ γ t ,t , “ E rp E r Y | Z “ z , X s ´ E r Y | Z “ z , X sq E r Z “ z | X ss E rp P p T “ t | Z “ z , X q ´ P p T “ t | Z “ z , X qq P p Z “ z | X qs The notation ζ p Z, X q in Theorem 3 means ´ t Z “ z u P p Z “ z | X q , t Z “ z u P p Z “ z | X q ¯ . The semiparametric efficiencycalculations give the efficient influence function for, say, β t , “ E r Y t | S “ s s , which isΨ β t , “ p t , ˆ t Z “ z u P p Z “ z | X q p Y t T “ t u ´ E r Y t T “ t u | Z “ z , X sq ` E r Y t T “ t u | Z “ z , X s ˙ ` p t , ˆ t Z “ z u P p Z “ z | X q p Y t T “ t u ´ E r Y t T “ t u | Z “ z , X sq ` E r Y t T “ t u | Z “ z , X s ˙ ´ β t , p t , ˆ t Z “ z u P p Z “ z | X q p t T “ t u ´ E r t T “ t u | Z “ z , X sq ` E r t T “ t u | Z “ z , X s ˙ ` β t , p t , ˆ t Z “ z u P p Z “ z | X q p t T “ t u ´ E r t T “ t u | Z “ z , X sq ` E r t T “ t u | Z “ z , X s ˙ The estimators are skipped for brevity. The optimal set of testable implications reduces to, for anymeasurable B , almost surly E r B p Y q t T “ t u | Z “ z , X s ´ E r B p Y q t T “ t u | Z “ z , X s “ ˜ b t , B t ,Z p X q P r , s E r B p Y q t T “ t u | Z “ z , X s ´ E r B p Y q t T “ t u | Z “ z , X s “ ˜ b t , B t ,Z p X q P r , s which is equation (3.3) of Kitagawa (2015). B Proofs of Theorems and Regularity Conditions
Lemma 1.
The following conditional independence relationships hold: S K Z | X ; and for any t P T , Y t K T | S, X .Proof.
The first statement follows from the definition of S and the fact that Z is independent of ´ T z , ¨ ¨ ¨ , T z NZ ¯ conditioning on X . For the second statement, T is a function of p S, Z, X q . Hence,given S and X , T is independent of Y t , since Z is independent of ´ Y t , ¨ ¨ ¨ , Y t NT ¯ conditional on X . (cid:4) emma 2. For each t P T , k “ , ¨ ¨ ¨ , N Z , and g measurable, the following identification resultshold.(i) P p S P Σ t,k | X q “ ˜ b t,k P t,Z p X q a.s.(ii) E r g p Y t q | S P Σ t,k , X s “ ˜ b t,k g t,Z p X q ˜ b t,k P t,Z p X q a.s.Proof. Conditional on X , we have B t r i, j s “ t T “ t | Z “ z i , S “ s j u , which is the definitionof B t in Heckman and Pinto (2018a). This means the quantity b t,k B ` t defined in their paper isthe constant (across different values of X ) ˜ b t,k . Hence the result is equivalent to Theorem 6 inHeckman and Pinto (2018a). (cid:4) Proof of Theorem 1. (i) We can get the result by applying the law of iterated expectation on (i)of Lemma 2.(ii) Using the Bayes rule, we have E r g p Y t q | S P Σ t,k s “ ż E r g p Y t q | S P Σ t,k , X “ x s f X | S P Σ t,k p x q dx “ ż E r g p Y t q | S P Σ t,k , X “ x s P p S P Σ t,k | X “ x q P p S P Σ t,k q f X p x q dx “ p t,k E ” ˜ b t,k g t,Z p X q ı . (cid:4) Proof of Theorem 2. (i) By the definition of W t ,t,k , we have P ` T “ t , S P Σ t,k ˘ “ P ` Z P W t ,t,k , S P Σ t,k ˘ “ E “ P ` Z P W t ,t,k , S P Σ t,k | X ˘‰ “ E “ P ` Z P W t ,t,k | X ˘ P p S P Σ t,k | X q ‰ “ E ” ˜ b t,k P t,Z p X q π W t ,t,k p X q ı where the third equality follows from Z K S | X .37ii) Using the Bayes rule, we have E “ g p Y t q | T “ t , S P Σ t,k ‰ “ ż E “ g p Y t q | T “ t , S P Σ t,k , X “ x ‰ f X | T “ t ,S P Σ t,k p x q dx “ ż E “ g p Y t q | T “ t , S P Σ t,k , X “ x ‰ P p T “ t , S P Σ t,k | X “ x q P p T “ t , S P Σ t,k q f X p x q dx “ ż E r g p Y t q | S P Σ t,k , X “ x s P p T “ t , S P Σ t,k | X “ x q P p T “ t , S P Σ t,k q f X p x q dx “ q t ,t,k E ” ˜ b t,k g t,Z p X q π W t p X q ı where the second equality follows from the Bayes rule and the third equality follows from Y t K T | S, X . By Lemma L-16 of Heckman and Pinto (2018b), we know that underthe unordered monotonicity assumption of S , B t r¨ , i s “ B t r¨ , i s for all s i , s i P Σ t,k . Thus E r g p Y t q | T “ t, S P Σ t,k s is always identified. (cid:4) The calculations for the semiparametric efficiency bound follow from Newey (1990). The likeli-hood of the statistical model can be specified as L p Y, T, Z, X q “ ˜ź z P Z ´ f z p Y, T | X q π z p X q ¯ t Z “ z u ¸ f X p X q where f z p¨ , ¨ | X q denotes the conditional density of Y, T given Z “ z and X , and f X p¨q denotes themarginal density of X . In a regular parametric submodel, where the true underlying probabilitymeasure P is indexed by θ o , using the following notations s z p Y, Z | X q “ BB θ log p f z p Y, T | X ; θ qq ˇˇˇ θ “ θ o , z P Z s π p Z | X ; θ o q “ ÿ z P Z t Z “ z u BB θ log p π z p X ; θ qq ˇˇˇ θ “ θ o s X p X q “ BB θ log p s X p X ; θ qq ˇˇˇ θ “ θ o Then the following Lemma is immediate. It computes the score and tangent space, and is invokedmany times below for the calculation of the semiparametric efficiency bound.38 emma 3.
The score in a regular parametric submodel is s θ o p Y, T, Z, X q “ ÿ z P Z t Z “ z u s z p Y, T | X ; θ o q ` s π p Z | X ; θ o q ` s X p X ; θ o q Hence the tangent space of the original model is S p P q “ ! s P L p P q : s p Y, T, Z, X q “ ÿ z P Z t Z “ z u s z p Y, T | X q ` s π p Z | X q ` s X p X q for some s z , s π , s X such that ż s z p y, t | X q f z p y, t | X q dydt ” , @ z ; ÿ z P Z s π p z | X q π z p X q ” , and ż s X p x q f X p x q dx “ ) Proof of Theorem 3.
We only prove (i) and (ii), (iii) and (iv) are easier cases that can be provedalong the way.(i) For the pathwise differentiablility of β t,k , in any parametric submodel, BB θ β t,k p θ q ˇˇˇ θ “ θ o “ BB θ ˜ ˜ b t,k E θ r I t,Z p X qs p t,k ¸ ˇˇˇˇˇ θ “ θ o “ p t,k ˜ B ˜ b t,k E θ r I t,Z p X qsB θ ˇˇˇˇˇ θ “ θ o ´ ˜ b t,k E θ r I t,Z p X qs p t,k B p t,k B θ ˇˇˇˇˇ θ “ θ o ¸ “ p t,k ˜ b t,k ˆ BB θ E θ r I t,Z p X qs ˇˇˇ θ “ θ o ´ BB θ E θ r P t,Z p X qs ˇˇˇ θ “ θ o β t,k ˙ where BB θ E θ r I t,Z p X qs ˇˇˇ θ “ θ o and BB θ E θ r P t,Z p X qs ˇˇˇ θ “ θ o are N Z ˆ ż y t τ “ t u s z p y, τ | x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx ` ż y t τ “ t u s X p x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx and ż t τ “ t u s z p y, τ | x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx ` ż t τ “ t u s X p x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx z P Z . The efficient influence function has to satisfy BB θ β t,k p θ q ˇˇˇ θ “ θ o “ E “ Ψ β t,k s θ o ‰ , and Ψ β t,k P S p P q The expression of Ψ β t,k presented in the theorem meets the above requirements. In particu-lar, correspondence between terms in the efficient influence function and pathwise derivativeappears exactly as in Lemma 1 of Hong and Nekipelov (2010b).(ii) The pathwise derivative of γ t ,t,k can be computed in a similar way. BB θ γ t ,t,k p θ q ˇˇˇ θ “ θ o “ q t ,t,k ˜ b t,k BB θ E θ ” I t,Z p X q π W t ,t,k p X q ı ˇˇˇ θ “ θ o ´ γ t ,t,k q t ,t,k ˜ b t,k BB θ E θ ” P t,Z p X q π W t ,t,k p X q ı ˇˇˇ θ “ θ o where BB θ E θ ” I t,Z p X q π W t ,t,k p X q ı ˇˇˇ θ “ θ o and BB θ E θ ” P t,Z p X q π W t ,t,k p X q ı ˇˇˇ θ “ θ o are N Z ˆ ż y t τ “ t u s z p y, τ | x ; θ o q π W t ,t,k p x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx ` ż y t τ “ t u s X p x ; θ o q π W t ,t,k p x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx ` ż y t τ “ t u ˆ BB θ π W t ,t,k p X ; θ q ˇˇ θ “ θ o ˙ f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx and ż t τ “ t u s z p y, τ | x ; θ o q π W t ,t,k p x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx ` ż t τ “ t u s X p x ; θ o q π W t ,t,k p x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx ` ż t τ “ t u ˆ BB θ π W t ,t,k p X ; θ q ˇˇ θ “ θ o ˙ f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx respectively, for z P Z . The main difference appears when dealing with the last terms in theabove two expressions, which can be matched with terms in the efficient influence function of40he following two forms E r Y t T “ t u | Z “ z, X s ´ t Z P W t ,t,k u ´ π W t ,t,k p X q ¯ E r t T “ t u | Z “ z, X s ´ t Z P W t ,t,k u ´ π W t ,t,k p X q ¯ To further explain, take the latter one as an example. Notice that t Z P W t ,t,k u ´ π W t ,t,k p X q “ ÿ z P W t ,t,k p t Z “ z u ´ π z p X qq and p t Z “ z u ´ π z p X qq s π p Z | X ; θ o q “ t Z “ z u π z p X q BB θ π z p X ; θ q ˇˇ θ “ θ o ´ π z p X q s π p Z | X ; θ o q Using the law of iterated expectation, E r E r t T “ t u | Z “ z, X s p t Z “ z u ´ π z p X qq s π p Z | X ; θ o qs“ E „ E r t T “ t u | Z “ z, X s E „ t Z “ z u π z p X q | X BB θ π z p X ; θ q ˇˇ θ “ θ o ´ E r E r t T “ t u | Z “ z, X s π z p X q E r s π p Z | X ; θ o q | X ss“ E „ E r t T “ t u | Z “ z, X s BB θ π z p X ; θ q ˇˇ θ “ θ o “ ż t τ “ t u ˆ BB θ π z p X ; θ q ˇˇ θ “ θ o ˙ f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx (cid:4) Proof of Theorem 4.
This proof is based on Section 5 in Newey (1994). We first focus on the case of β t,k . Some calculations are done for preparation. For the sake of brevity, let h t “ p h Y,t,Z , h t,Z , π q .Notice that the estimator ˆ β t,k is defined by the following moment condition. M β t,k p X, β t,k , h t q ” ˜ b t,k ˜ h Y,t,z p X q π z p X q , ¨ ¨ ¨ , h Y,t,z NZ p X q π z NZ p X q ¸ ´ β t,k ˜ b t,k ˜ h t,z p X q π z p X q , ¨ ¨ ¨ , h t,z NZ p X q π z NZ p X q ¸ E „ B M β t,k B β t,k “ ´ ˜ b t,k E r P t,Z p X qs “ ´ p ot,k B M β t,k B h Y,t,z i ˇˇˇ h ot “ ˜ b t,k r i s π oz i p X q ” δ Y,t,z i p X qBB h t,z i M β t,k ˇˇˇ h ot “ ´ β t,k ˜ b t,k r i s π oz i p X q ” δ t,z i p X qB M β t,k B π z i ˇˇˇ h ot “ ´ ˜ b t,k r i s I ot,z i p X q π oz i p X q ` β t,k ˜ b t,k r i s P ot,z i p X q π oz i p X q ” δ π,z i p X q where ˜ b t,k r i s denotes the i th element of the vector ˜ b t,k . Let D β t,k p X, h t q “ ÿ z P Z δ Y,t,z p X q h Y,t,z p X q ` ÿ z P Z δ t,z p X q h t,z p X q ` ÿ z P Z δ π,z p X q π z p X q“ N Z ÿ j “ ˜ b t,k r j s π oz j p X q ” h Y,t,z j p X q ´ β ot,k h t,z j p X q ´ ´ I ot,z j p X q ´ P ot,z j p X q ¯ π z j p X q ı and α β t,k p Y, T, Z, X q ” ÿ z P Z δ Y,t,z p X q ` t Z “ z u Y t T “ t u ´ h oY,t,z p X q ˘ ` ÿ z P Z δ t,z p X q ` t Z “ z u t T “ t u ´ h ot,z p X q ˘ ` ÿ z P Z δ π,z p X q p t Z “ z u ´ π oz p X qq“ ˜ b t,k ζ p Z, X, π o q ` ι p Y t T “ t uq ´ I ot,Z p X q ˘ ´ β ot,k ˜ b t,k ζ p Z, X, π o q ` ι t T “ t u ´ P ot,Z p X q ˘ Then we check in turns Assumption 5.1 to 5.3 in Newey (1994). For Assumption 5.1(i), thelinearization D can be taken as D β t,k by equation (4.2) in that paper, since M β t,k depends on h t only through its value h t p X q . Assumption 5.1(ii) is satisfied by our condition 4(i) on the convergencerate of ˆ h t . Assumption 5.2 is the stochastic equicontinuity condition on D β t,k , which can be verified42y our condition 4(ii), since1 ? n n ÿ i “ ” D β t,k p X i , ˆ h t ´ h ot q ´ E ” D β t,k p X, ˆ h t ´ h ot q ıı “ N Z ÿ j “ ˜ b t,k r j s π oz j p X q ” ´ ν n p ˆ h Y,t,z j q ´ ν n p h oY,t,z j q ¯ ´ β ot,k ´ ν n p ˆ h t,z j q ´ ν n p h ot,z j q ¯ ´ ´ I ot,z j p X q ´ P ot,z j p X q ¯ ´ ν n p ˆ π z j q ´ ν n p π oz j q ¯ ı p Ñ h : X Ñ R , ν n p h q “ ? n ř ni “ r h p X i q ´ E r h p X qss is used to denotethe empirical process. The α p z q in Assumption 5.3 is constructed to be α β t,k p Y, T, Z, X q usingProposition 4 in that paper. From Lemma 5.1 there, we can establish the asymptotically linearrepresentation of ? nM β t,k p X, β ot,k , ˆ h t q to be ? nM β t,k p X, β ot,k , ˆ h t q “ ? n n ÿ i “ “ M β t,k p X i , β ot,k , h ot q ` α p Y i , T i , Z i , X i q ‰ ` o p p q . Also, the consistency of ˆ p t,k follows from (cid:13)(cid:13)(cid:13) ˆ h t ´ h t (cid:13)(cid:13)(cid:13) p Ñ π z ’s are boundedaway from zero and one. Then using Slutsky’s theorem, the above results can be combined toobtain asymptotic normality of ˆ β t,k since ? n ´ ˆ β t,k ´ β ot,k ¯ “ ? nM β t,k p X, β ot,k , ˆ h t q{ ˆ p t,k . Hence the influence function of ˆ β t,k should be ´ M β t,k p X, β ot,k , h ot q ` α β t,k ¯ { p ot,k , which equals toΨ β t,k evaluated at the true parameter values. The term α β t,k corrects the bias in estimation due tothe presence of the unknown infinite dimensional nuisance parameter p h Y,t,Z , h t,Z , π q . The proofsfor ˆ γ t ,t,k , ˆ p t,k , and ˆ q t ,t,k are essentially the same. For estimating the efficiency bound, consistencyof the plug-in estimators follows directly from the consistency of both the nonparametric estimatesand the CEP estimators, the continuity of the efficient influence functions in the parameters, andthe fact that propensity scores are bounded away from zero and one.Lastly, the consistency of ˆ V κ follows from Lemma 8.3 of Newey and McFadden (1994), where More discussion on this “mean-square differentiability” condition can be found in Newey and McFadden (1994). There is no remainder o p p q terms because M β t,k is linear in β t,k , and hence it’s unnecessary to check Assumptions5.4 to 5.6 in Newey (1994). M β t,k and α β t,k . (cid:4) Proof of Theorem 5.
The proof is similar to that of Theorem 5.1 and 5.2 in Chernozhukov et al.(2018), which verify the assumptions in their Theorem 3.1. First observe that the moment condition(23) is linear in β t,k . Since ˜ b t,k is a finite vector, it suffice in our case to verify the conditions intheir Theorem 3.1 for the following score function ψ p W, β t,k , υ q “ ψ a p W, P t,z , π z q β t,k ` ψ b p W, I t,z , π z q” ˆ t Z “ z u π z p X q p t T “ t u ´ P t,z p X qq ` P t,z p X q ˙ β t,k ´ t Z “ z u π z p X q p Y t T “ t u ´ I t,z p X qq ` I t,z p X q , where W “ p Y, T, Z, X q and υ “ p I t,z , P t,z , π z q . To check the Neyman orthogonality condition, wecan compute the Gateaux derivative BB r E r ψ p W, β t,k , υ o ` r p υ ´ υ o qqs ˇˇˇ r “ “ E « t Z “ z u π oz p X q ` t T “ t u ´ P ot,z p X q ˘ p π z p X q ´ π oz p X qq β t,k ` ˆ P t,z p X q ´ P ot,z p X q ´ t Z “ z u π oz p X q ` P t,z p X q ´ P ot,z p X q ˘˙ β t,k ` t Z “ z u π oz p X q ` Y t T “ t u ´ I ot,z p X q ˘ p π z p X q ´ π oz p X qq` I t,z p X q ´ I ot,z p X q ´ t Z “ z u π oz p X q ` I t,z p X q ´ I ot,z p X q ˘ ff . It is equal to zero since E „ t Z “ z u π oz p X q ` t T “ t u ´ P ot,z p X q ˘ ˇˇˇ X “ E „ t Z “ z u π oz p X q ` Y t T “ t u ´ I ot,z p X q ˘ ˇˇˇ X “ , (39)and E ” t Z “ z u π oz p X q ˇˇˇ X ı “
1. Inside the nuisance realization set such that the nuisance parameters takevalue in this set with probability ∆ n , we verify their Assumption 3.2 as follows. k ψ a p W, P t,z , π z q k q “ (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q p t T “ t u ´ P t,z p X qq ` P t,z p X q (cid:13)(cid:13)(cid:13)(cid:13) q ď (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q (cid:13)(cid:13)(cid:13)(cid:13) q k t T “ t u ´ P t,z p X q k q ` k P t,z p X q k q ď { π ` , ψ p W, β t,k , υ q k q ď (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q p t T “ t u ´ P t,z p X qq ` P t,z p X q (cid:13)(cid:13)(cid:13)(cid:13) q β t,k ` (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q p Y t T “ t u ´ I t,z p X qq ` I t,z p X q (cid:13)(cid:13)(cid:13)(cid:13) q ď p { π ` q β t,k ` (cid:13)(cid:13) I t,z p X q ´ I ot,z p X q (cid:13)(cid:13) q ` (cid:13)(cid:13) I ot,z p X q (cid:13)(cid:13) q ` (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q (cid:13)(cid:13)(cid:13)(cid:13) q ´ (cid:13)(cid:13) Y t T “ t u ´ I ot,z p X q (cid:13)(cid:13) q ` (cid:13)(cid:13) I t,z p X q ´ I ot,z p X q (cid:13)(cid:13) q ¯ ď p { π ` q ` C ` C ` { π p C ` C q “ p C ` q p { π ` q , (cid:12)(cid:12) E r ψ a p W, P t,z , π z qs ´ E “ ψ a p W, P ot,z , π oz q ‰ (cid:12)(cid:12) “ (cid:12)(cid:12)(cid:12)(cid:12) E „ t Z “ z u π z p X q p t T “ t u ´ P t,z p X qq ` ` P t,z p X q ´ P ot,z p X q ˘ (cid:12)(cid:12)(cid:12)(cid:12) “ (cid:12)(cid:12)(cid:12)(cid:12) E „ t Z “ z u π z p X q ` P ot,z p X q ´ P t,z p X q ˘ ` ` P t,z p X q ´ P ot,z p X q ˘ (cid:12)(cid:12)(cid:12)(cid:12) ď (cid:13)(cid:13) P t,z p X q ´ P ot,z p X q (cid:13)(cid:13) { π ď δ n { π, To bound k ψ p W, β t,k , υ q ´ ψ p W, β t,k , υ o q k , note that we have (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q p t T “ t u ´ P t,z p X qq ` P t,z p X q ´ t Z “ z u π oz p X q ` t T “ t u ´ P ot,z p X q ´ P ot,z p X q ˘ (cid:13)(cid:13)(cid:13)(cid:13) ď (cid:13)(cid:13) P t,z p X q ´ P ot,z p X q (cid:13)(cid:13) ` (cid:13)(cid:13)(cid:13)(cid:13) ˆ π z p X q ´ π oz p X q ˙ t Z “ z u p t T “ t u ´ P t,z p X qq (cid:13)(cid:13)(cid:13)(cid:13) ` (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π oz p X q ` P t,z p X q ´ P ot,z p X q ˘ (cid:13)(cid:13)(cid:13)(cid:13) ďp ` { π q (cid:13)(cid:13) P t,z p X q ´ P ot,z p X q (cid:13)(cid:13) ` k π z p X q ´ π oz p X q k { π ď ` ` { π ` { π ˘ δ n , (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q p Y t T “ t u ´ I t,z p X qq ` I t,z p X q ´ t Z “ z u π oz p X q ` Y t T “ t u ´ I ot,z p X q ´ I ot,z p X q ˘ (cid:13)(cid:13)(cid:13)(cid:13) ď (cid:13)(cid:13) I t,z p X q ´ I ot,z p X q (cid:13)(cid:13) ` (cid:13)(cid:13)(cid:13)(cid:13) ˆ π z p X q ´ π oz p X q ˙ t Z “ z u p Y t T “ t u ´ I t,z p X qq (cid:13)(cid:13)(cid:13)(cid:13) ` (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π oz p X q ` I t,z p X q ´ I ot,z p X q ˘ (cid:13)(cid:13)(cid:13)(cid:13) ď p ` { π q (cid:13)(cid:13) I t,z p X q ´ I ot,z p X q (cid:13)(cid:13) ` k π z p X q ´ π oz p X q k ´ (cid:13)(cid:13) Y t T “ t u ´ I ot,z (cid:13)(cid:13) q ` (cid:13)(cid:13) I t,z ´ I ot,z (cid:13)(cid:13) q ¯ { π ď ` ` { π ` C { π ˘ δ n . Thus we have k ψ p W, β t,k , υ q ´ ψ p W, β t,k , υ o q k ď (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q p t T “ t u ´ P t,z p X qq ` P t,z p X q ´ t Z “ z u π oz p X q ` t T “ t u ´ P ot,z p X q ´ P ot,z p X q ˘ (cid:13)(cid:13)(cid:13)(cid:13) β t,k ` (cid:13)(cid:13)(cid:13)(cid:13) t Z “ z u π z p X q p t T “ t u ´ P t,z p X qq ` P t,z p X q ´ t Z “ z u π oz p X q ` t T “ t u ´ P ot,z p X q ´ P ot,z p X q ˘ (cid:13)(cid:13)(cid:13)(cid:13) ď `` ` { π ` { π ˘ β t,k ` ` ` { π ` C { π ˘˘ δ n . Lastly, for any r P p , q , based on equation (39) we have B B r E r ψ p W, β t,k , υ o ` r p υ ´ υ o qqsď E „ ˆ t Z “ z up π oz p X q ` r p π z p X q ´ π oz p X qqq r ` P t,z p X q ´ P ot,z p X q ˘ p π z p X q ´ π oz p X qq β t,k ´ E „ t Z “ z up π oz p X q ` r p π z p X q ´ π oz p X qqq ` P t,z p X q ´ P ot,z p X q ˘ p π z p X q ´ π oz p X qq β t,k ´ E „ t Z “ z up π oz p X q ` r p π z p X q ´ π oz p X qqq r ` P t,z p X q ´ P ot,z p X q ˘ p π z p X q ´ π oz p X qq β t,k ` E „ ˆ t Z “ z up π oz p X q ` r p π z p X q ´ π oz p X qqq r ` I t,z p X q ´ I ot,z p X q ˘ p π z p X q ´ π oz p X qq ´ E „ t Z “ z up π oz p X q ` r p π z p X q ´ π oz p X qqq ` I t,z p X q ´ I ot,z p X q ˘ p π z p X q ´ π oz p X qq ´ E „ t Z “ z up π oz p X q ` r p π z p X q ´ π oz p X qqq r ` I t,z p X q ´ I ot,z p X q ˘ p π z p X q ´ π oz p X qq (cid:12)(cid:12)(cid:12)(cid:12) B B r E r ψ p W, β t,k , υ o ` r p υ ´ υ o qqs (cid:12)(cid:12)(cid:12)(cid:12) ď Const. ˆ k π z ´ π oz k ´ (cid:13)(cid:13) I t,z ´ I ot,z (cid:13)(cid:13) ` (cid:13)(cid:13) P t,z ´ P ot,z (cid:13)(cid:13) ¯ ď Const. ˆ n ´ { δ n , which completes the proof. (cid:4) Proof of Theorem 6.
First note that the moment conditions can be equivalently represented by˜ b j E “ m j,t j ,Z p X, η q ‰ “ , ď j ď J . Then the rest of the proof is mainly based on the approachdescribed in section 3.6 of Hong and Nekipelov (2010a) and the proof of Theorem 1 in Cattaneo(2010). We use a constant d η ˆ d m matrix A to transform the overidentified vector of moments intoan exactly identified system of equations A ´ ˜ b j E “ m j,t j ,Z p X, η q ‰¯ Jj “ “
0, and find the A -dependentefficient influence function for the exactly-identified parameter. Then choose the optimal A . In aparametric submodel, by the implicit function theorem, we have BB θ η ˇˇ θ “ θ o “ ´ p A Γ q ´ A BB θ ´ ˜ b j E θ “ m j,t j ,Z p X, η o q ‰¯ Jj “ ˇˇ θ “ θ o where BB θ E θ “ m j,t j ,Z p X, η o q ‰ ˇˇ θ “ θ o is a N Z ˆ ż m j p y, η o q t τ “ t j u s z p y, τ | x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx ` ż m j p y, η o q t τ “ t j u s X p x ; θ o q f z p y, τ | x ; θ o q f X p x ; θ o q dydτ dx for z P Z . So the efficient influence function for this exactly-identified parameter isΨ A p Y, T, Z, X, η o , π o , m oZ q “ ´ p A Γ q ´ A Ψ m p Y, T, Z, X, η o , π o , m oZ q where Ψ m is defined by equation (30). It is straightforward to verify that Ψ A satisfies BB θ η ˇˇˇ θ “ θ o “ E r Ψ A s θ o s , and Ψ A P S p P q . The optimal A is chosen by minimizing the sandwich matrix E r Ψ A Ψ A s “p A Γ q ´ A E r Ψ m Ψ m s A p Γ A q ´ . Thus the efficient influence function for the generally over-identifiedparameter is obtained when A “ Γ V ´ p o . Plugging into Ψ A , we get equation (29). (cid:4) Proof of Theorem 7.
We follow the large sample theory in Chen et al. (2003) (hereafter CLK),47etting θ “ η , h “ p π, m Z q , M p θ, h q “ G p η, π, m Z q , and M n p θ, h q “ G n p η, π, m Z q .Their Theorem 1 is applied first to show the consistency of ˜ η . Their (1.2) is satisfied since Λ iscompact, and G p η, π o , m oZ q “ ´ ˜ b j E ” m oj,t j ,Z p X, η q ı¯ Jj “ , which has a unique zero η o by our condition7(i) and is continuous by our condition 7(iii). As for (1.3) of CLK, continuity of G in m j,t j ,z and π z is verified by their way of entering (linear or by taking inverse, with π z bounded away from 0 and1), and the uniformity in η follows from the fact that E r m p Y ˚ , η qs is bounded as a function of η (by its continuity and the compactness of Λ). Condition (1.4) of CLK is satisfied by our condition7(ii). The uniform stochastic equicontinuity condition (1.5) of CLK is implied by the fact that, forany j and z , the class " t Z “ z u π z p X q ` m j p Y, η q t T “ t j u ´ m j,t j ,z p X, η q ˘ ` m j,t j ,z p X, η q : η P Λ , m j,t j ,z P M δj,z , π z P Π δz * is Glivenko-Cantelli, which follows from our condition 7(iii) and the results in Van Der Vaart and Wellner(2000), stating that Glivenko-Cantelli classes with integrable envelopes are preserved by a contin-uous function. Thus ˜ η ´ η o “ o p p q .Then we use Corollary 1 in CLK to show the consistency of ˆ V and the asymptotic normalityof ˆ η . Condition (2.2) in CLK is verified by our condition 6(iii). As in the proof of Theorem 5, it isstraightforward to show the moment condition G , based on the efficient influence functions, satisfiesthe Neyman orthogonality condition for the nuisance parameters π and m Z . For any j and z , denote π rz “ π oz p X q ` r p π z p X q ´ π oz p X qq and m rj,t j ,z p X, η q “ m oj,t j ,z p X, η q ` r ´ m j,t j ,z p X, η q ´ m oj,t j ,z p X, η q ¯ ,we have BB r E „ t Z “ z u π rz p X q ´ m j p Y, η q t T “ t j u ´ m rj,t j ,z p X, η q ¯ ` m rj,t j ,z p X, η q ˇˇˇˇˇ r “ “ E « ´ t Z “ z up π oz p X qq p π z p X q ´ π oz p X qq ´ m j p Y, η q t T “ t j u ´ m oj,t j ,z p X, η q ¯ ` ´ m oj,t j ,z p X, η q ´ m j,t j ,z p X, η q ¯ ˆ t Z “ z u π oz p X q ´ ˙ ff “ E „ t Z “ z u π oz p X q ´ m j p Y, η q t T “ t j u ´ m oj,t j ,z p X, η q ¯ ˇˇˇ X “ G with respect to p π, m Z q is zero in any direction, and hencecondition (2.3) of CLK is verified. Their condition (2.4) is satisfied by our condition 7(ii). To showthe stochastic equicontinuity condition (2.6), it suffice to show that the class " t Z “ z u π z p X q ` m j p Y, η q t T “ t j u ´ m j,t j ,z p X, η q ˘ ` m j,t j ,z p X, η q : η P Λ δ , m j,t j ,z P M δj,z , π z P Π δz * is Donsker. This follows from our condition 7(iv) and Theorem 2.10.6 (as well as examples 2.10.7-2.10.9) in Van Der Vaart and Wellner (1996). Condition (2.6) in CLK is trivially verified using thecentral limit theorem. For the condition in Corollary 1 of CLK, letΩ p η, π, m Z q “ E “ Ψ m p Y, T, Z, X, η, π, m Z q Ψ p Y, T, Z, X, η, π, m Z q m ‰ and Ω n p η, π, m Z q “ n n ÿ i “ Ψ m p Y i , T i , Z i , X i , η, π, m Z q Ψ m p Y i , T i , Z i , X i , η, π, m Z q . Then V “ Ω p η o , π o , m oZ q and ˆ V “ Ω n p ˜ η, ˆ π o , ˆ m Z o q . For any δ n Ó
0, in the shrinking neighborhoodsΛ δ n , Π δ n z , and M δ n j,z , we havesup k Ω n p η, π, m Z q ´ V k ď sup k Ω n p η, π, m Z q ´ Ω p η, π, m Z q ´ p Ω n p η o , π o , m oZ q ´ Ω p η o , π o , m oZ qq k ` sup k Ω p η, π, m Z q ´ Ω p η o , π o , m oZ q k ` sup k Ω n p η o , π o , m oZ q ´ Ω p η o , π o , m oZ q k The first term on the RHS is o p p q follows from the stochastic equicontinuity property on Ω n ´ Ω,which results from the (element-wise) Donsker property of the matrix Ψ m Ψ m . The second termon the RHS is o p p q since Ω is continuous in its arguments (equation (30) and condition 6(iii)),while the third term is o p p q by the standard central limit theorem. Hence, we have shown thatˆ V ´ V “ o p p q and ˆ η ´ η o “ o p p q .Lastly, using the arguments in Theorem 7.4 in Newey and McFadden (1994), the numericalderivative ˆΓ is consistent. (cid:4) Proof of Theorem 8. (ii) The second part of the theorem is proved first. Suppose L satisfies As-sumption 2, we want to find a L that induces L and satisfies Assumption 1. The strategy is tomake the Y t ’s mutually independent. And set their conditional (on S P Σ t,k ) distributions to the49dentified value when identifiable, and to be arbitrary when unidentifiable.Let ˜ P p¨ | X q be an arbitrary conditional distribution on the support of Y . Define joint dis-tribution of p Z, X q as identified from L . The goal is to construct the conditional distribution of p Y t : t P T , S q | Z, X not to depend on Z . For any measurable sequence of sets t B , ¨ ¨ ¨ , B N T u onthe support of Y , Σ Ă S , and z P Z , P ´ Y t P B , ¨ ¨ ¨ , Y t NT P B N T , S P Σ | Z “ z, X ¯ “ ˜ź t P T ˜ Q t p X, B t ˆ Σ q ˜ Q t p X, Y ˆ Σ q ¸ ˜ Q t p X, Y ˆ Σ q For s R S , let P ´ Y t P E , ¨ ¨ ¨ , Y t NT P E N T , S “ s | Z “ z, X ¯ “
0. We have fully specified a jointdistribution of ´ t Y t : t P T , z P Z u , t T z : z P Z u , Z, X ¯ , L , that is consistent with L and satisfiesAssumption 1. Let C be another condition on L , such that L satisfies Assumption 1 implies O p L q satisfies condition C . The contrapositive statement is that if L violates C , then any L with O p L q “ L has to violate C . Therefore, in the current case, where L satisfies Assumption 1, L hasto satisfy C .(i) The first statement is trivial. For the second statement, suppose L satisfies Assumption 2,we want to find a L that induces L and violates Assumption 1. The strategy is to define thestructural functions to be dependent on Z . In particular, specify p Y t : t P T , S q | Z, X to be thesame as before when conditioning on Z “ z . When Z ‰ z , let P ´ Y t P B , ¨ ¨ ¨ , Y t NT P B N T , S P Σ | Z “ z, X ¯ “ L Y | S p B , ¨ ¨ ¨ , B N T q ˜ Q t p X, Y ˆ Σ q where L Y | S denotes a joint law of N T not mutally independent random variables whose marginaldistribution is equal to ˜ Q t p X,B t ˆ Σ q ˜ Q t p X, Y ˆ Σ q . Clearly, the Y t ’s are Z -dependent. (cid:4)(cid:4)