aa r X i v : . [ ec on . E M ] J u l Mode Treatment Effect
Neng-Chieh Chang ∗ Abstract
Mean, median, and mode are three essential measures of the centrality of probabilitydistributions. In program evaluation, the average treatment effect (mean) and thequantile treatment effect (median) have been intensively studied in the past decades.The mode treatment effect, however, has long been neglected in program evaluation.This paper fills the gap by discussing both the estimation and inference of the modetreatment effect. I propose both traditional kernel and machine learning methods toestimate the mode treatment effect. I also derive the asymptotic properties of theproposed estimators and find that both estimators follow the asymptotic normality butwith the rate of convergence slower than the regular rate √ N , which is different fromthe rates of the classical average and quantile treatment effect estimators. The effects of policies on the distribution of outcomes have long been of central interestin many areas of empirical economics. A policy maker might be interested in the differ-ence of the distribution of outcome under treatment and the distribution of outcome inthe absence of treatment. The empirical studies of distributional effects include but notare not limited to Freeman (1980), Card (1996), DiNardo, Fortin, & Lemieux (1995), andBitler, Gelbach, & Hoynes (2006). Most researches use the difference of the averages orquantiles of the treated and untreated distribution, known as average treatment effect andquantile treatment effect, as a summary for the effect of treatment on distribution. Themode of a distribution, which is also an important summary statistics of data, has long beenignored in the literature. This paper fills up the gap by studying the mode treatment effect:the difference of the modes of the treated and untreated distribution. Compared to theaverage and the quantile treatment effect, the mode treatment effect has two advantages:(1) mode captures the most probable value of the distribution under treatment and in theabsence of treatment. It provides a better summary of centrality than average and quantilewhen the distributions are highly skewed; (2) mode is robust to heavy-tailed distributionswhere outliers don’t follow the same behavior as the majority of a sample. In economicstudies, it is especially often to confront a skewed and heavy-tailed distribution when theoutcome of interest is income or wage. ∗ Department of Economics, University of California Los Angeles, 315 Portola Plaza, Los Angeles, CA90095, USA. email: [email protected] √ N h , where N is the sample size and h is the bandwidth of the chosen kernel, which isslower than the traditional rate of convergence √ N in the estimation of mean and quantile.In fact, this rate of convergence complies the intuition. While the estimators of mean andquantile are the weighted average of all the available observations, only a small portion ofobservations near the mode provides the information to the estimator of the mode. Thisexplains the slower rate of convergence for the proposed estimators.This paper contributes to the program evaluation literature which includes the studies ofaverage treatment effect: Rosenbaum & Rubin (1983), Heckman & Robb (1985), Heckman, Ichimura, & Todd(1997), Hahn (1998), and Hirano, Imbens, & Ridder (2003); the studies of quantile treat-ment effect: Abadie, Angrist, & Imbens (2002), Chernozhukov & Hansen (2005), and Firpo(2007); the studies of mode estimation and mode regression: Parzen (1962), Eddy et al.(1980), Lee (1989), Yao & Li (2014), and Chen, Genovese, Tibshirani, Wasserman, et al.(2016); as well as the causal inference of ML methods: Belloni et al. (2012), Belloni et al.(2014), Chernozhukov et al. (2015), Belloni et al. (2017), Chernozhukov et al. (2018), andAthey et al. (2019). This paper is also closely related to the robustness of average treat-2ent effect estimation discussed in (Robins & Rotnitzky, 1995) and the general discussionin (Chernozhukov, Escanciano, Ichimura, & Newey, 2016). The asymptotic properties of therobust estimators discussed in these papers remain unaffected if only one of the first-stepestimation with classical nonparametric method is inconsistent. Plan of the paper.
Section 2 sets up the notation and framework for the discussion ofthe mode treatment effect. Section 3 discusses the kernel method and derives the asymptoticproperties. Section 4 presents the ML estimator for density estimation and the correspondingNeyman-orthogonal score. I combine the Neyman-orthogonal score with the cross-fitting al-gorithm to propose the ML estimator of the mode treatment effect, and derive its asymptoticproperties. Section 5 concludes this paper.
Let Y be a continuous outcome variable of interest, D the binary treatment indicator, andX d × vector of control variables. Denote by Y an individual’s potential outcome when D = 1 and Y if D = 0 . Let f Y ( y ) and f Y ( y ) be the marginal probability density function(p.d.f.) of Y and Y , respectively. The modes of Y and Y are the values that appear withthe highest probability. That is, θ ∗ ≡ arg max y ∈Y f Y ( y ) and θ ∗ ≡ arg max y ∈Y f Y ( y ) , where Y and Y are the supports of Y and Y . Here I assume that θ ∗ and θ ∗ are unique,meaning that both Y and Y are unimodal. I also assume that the modes θ ∗ and θ ∗ are inthe interior of the common supports of Y and Y . These conditions are formally stated inthe following assumption: Assumption 1 (Uni-mode) • For all ǫ > , sup y : | y − θ ∗ | >ε f Y ( y ) < f Y ( θ ∗ ) for y ∈ Y , and sup y : | y − θ ∗ | >ε f Y ( y ) < f Y ( θ ∗ ) for y ∈ Y . • θ ∗ , θ ∗ ∈ Int ( Y ∩ Y ) .Assumption 1 has been widely adopted in many studies (Parzen, 1962; Eddy et al., 1980; Lee,1989; Yao & Li, 2014). Under Assumption 1, the mode treatment effect is uniquely defined as ∆ ∗ ≡ θ ∗ − θ ∗ . The following states the strong ignorability assumption (Rosenbaum & Rubin,1983): Assumption 2 (strong ignorability) • ( Y , Y ) ⊥ D | X < P (cid:0) D = 1 | X (cid:1) < The first part of Assumption 2 assumes that potential outcomes are independent of treatmentafter conditioning on the observable covariates X . The second part states that for all valuesof X , both treatment status occur with a positive probability. Under the strong ignorabilitycondition, both f Y and f Y can be identified from the observable variables ( Y, D, X ) since f Y | D =1 ,X (cid:0) y | x (cid:1) = f Y | D =1 ,X (cid:0) y | x (cid:1) = f Y | X (cid:0) y | x (cid:1) , and thus f Y ( y ) = E h f Y | X (cid:0) y | X (cid:1)i = E h f Y | D =1 ,X (cid:0) y | X (cid:1)i . (2.1)Similarly, we have f Y ( y ) = E h f Y | D =0 ,X (cid:0) y | X (cid:1)i . (2.2)Equation (2.1) and (2.2) shows the identification result of the density function f Y and f Y .Then it is straightforward to identify their modes θ ∗ and θ ∗ : θ ∗ = arg max y ∈Y E h f Y | D =1 ,X (cid:0) y | X (cid:1)i and θ ∗ = arg max y ∈Y E h f Y | D =0 ,X (cid:0) y | X (cid:1)i . (2.3)If both f Y | D =1 ,X (cid:0) y | X (cid:1) and f Y | D =0 ,X (cid:0) y | X (cid:1) are differentiable with respect to y , we canfurther identify the modes using the first-order conditions under Assumption 1: E h f (1) Y | D =1 ,X (cid:0) θ ∗ | X (cid:1)i = 0 and E h f (1) Y | D =0 ,X (cid:0) θ ∗ | X (cid:1)i = 0 , (2.4)where m ( s ) ( y, x ) ≡ ∂ s m ( y, x ) /∂y s denotes the partial derivatives with respect to y .Equation (2.1)-(2.4) provide us a direct way to estimate the modes θ ∗ and θ ∗ . Intuitively,we estimate the density functions f Y ( y ) and f Y ( y ) in the first step and use the maximizersof the estimated density functions as the estimators of the modes. Section 3 and 4 presentsthe kernel and ML estimation method, respectively. In this section, I propose kernel estimators for θ ∗ , θ ∗ , and the mode treatment effect ∆ ∗ = θ ∗ − θ ∗ . Let K ( · ) be a kernel function with bandwidth h . Define the estimators of the densityfunctions f Y ( y ) and f Y ( y ) as, ˆ f Y ( y ) = 1 n n X i =1 ˆ f Y | D =1 ,X (cid:0) y | X i (cid:1) , ˆ f Y ( y ) = 1 n n X i =1 ˆ f Y | D =0 ,X (cid:0) y | X i (cid:1) ˆ f Y | D =1 ,X (cid:0) y | x (cid:1) = P nj =1 D j K h (cid:0) y − Y j (cid:1) K h (cid:0) x − X j (cid:1)P nj =1 D j K h (cid:0) x − X j (cid:1) , ˆ f Y | D =0 ,X (cid:0) y | x (cid:1) = P nj =1 (cid:0) − D j (cid:1) K h (cid:0) y − Y j (cid:1) K h (cid:0) x − X j (cid:1)P nj =1 (cid:0) − D j (cid:1) K h (cid:0) x − X j (cid:1) where K h (cid:0) y − Y j (cid:1) = h − K (cid:16) y − Y j h (cid:17) and K h (cid:0) x − X j (cid:1) = h − d K (cid:18) x − X j h (cid:19) × ... × K (cid:18) x d − X jd h (cid:19) . Then it is straightforward to define the estimators of the modes θ ∗ and θ ∗ : ˆ θ ≡ arg max y ˆ f Y ( y ) , ˆ θ ≡ arg max y ˆ f Y ( y ) . The estimator of the mode treatment effect ∆ ∗ is ˆ∆ ≡ ˆ θ − ˆ θ . Through out the paper, Iimpose the following conditions on the kernel K ( · ) : Assumption 3: • (cid:12)(cid:12) K ( u ) (cid:12)(cid:12) ≤ ¯ K < ∞ . • R K ( u ) du = 1 , R uK ( u ) du = 0 , R u K ( u ) du < ∞ . • K ( u ) is differentiable.The first part of Assumption 3 requires that K ( u ) is bounded. Although the second partimplies that K ( u ) is a first-order kernel, the arguments in this paper can be easily extendedto higher-order kernels. We assume the first-order kernel here just for simplicity. The thirdpart imposes enough smoothness on K ( u ) . Theorem 1 (Consistency)
Suppose Assumption 1-3 hold. Assume that the densityfunctions f Y | D =1 ,X (cid:0) y | x (cid:1) and f Y | D =0 ,X (cid:0) y | x (cid:1) are (i) continuous in y , (ii) bounded bysome function d ( x ) with E (cid:2) d ( X ) (cid:3) < ∞ for all y ∈ Y , and (iii) y ∈ Y and x ∈ X withcompact Y and X . We also assume that the density functions f X | D =1 ( x ) and f X | D =0 ( x ) arebounded away from zero. If n → ∞ , h → , and ln n (cid:0) nh d +1 (cid:1) − → , then we have ˆ θ p → θ ∗ and ˆ θ p → θ ∗ . Theorem 2 (Asymptotic Normality)
Suppose that the assumptions of Theorem 1 hold.Assume that f (2) Y | X,D =1 (cid:0) y | x (cid:1) and f (2) Y | X,D =0 (cid:0) y | x (cid:1) are continuous at y = θ ∗ and y = θ ∗ for ll x , respectively. If n → ∞ , h → , √ nh (ln n ) (cid:0) nh d +3 (cid:1) − → , (ln n ) (cid:0) nh d +5 (cid:1) − → ,and √ nh h → , then √ nh (cid:16) ˆ θ − θ ∗ (cid:17) d → N (cid:0) , M − V M − (cid:1) √ nh (cid:16) ˆ θ − θ ∗ (cid:17) d → N (cid:0) , M − V M − (cid:1) where M ≡ E h f (2) Y | X,D =1 (cid:0) θ ∗ | X (cid:1)i ,M ≡ E h f (2) Y | X,D =0 (cid:0) θ ∗ | X (cid:1)i ,V = κ (1)0 E " f Y | X,D =1 (cid:0) θ ∗ | X (cid:1) P (cid:0) D = 1 | X (cid:1) ,V = κ (1)0 E " f Y | X,D =0 (cid:0) θ ∗ | X (cid:1) P (cid:0) D = 0 | X (cid:1) , and κ (1)0 = R K (1) ( u ) du . Further, we have √ nh (cid:16) ˆ∆ − ∆ ∗ (cid:17) d → N (0 , M V M + M V M ) . Theorem 1 and 2 show that the asymptotic properties of the estimator of the modetreatment effect. We can see that the proposed estimators follows the asymptotic normalitybut with the rate of convergence slower than the regular rate √ N . The intuition is that,unlike the estimation of the average and the quantile treatment effect, the estimation ofmodes only uses a small portion of total observations which are around the modes. Theusage rate of observations determines that the rate of convergence is slower than the regularrate √ N .To estimate the asymptotic variances, we define π ( X ) ≡ P (cid:0) D = 1 | X (cid:1) to be thepropensity score. The consistent variance estimators are ˆ M = 1 n n X i =1 ˆ f (2) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) , ˆ M = 1 n n X i =1 ˆ f (2) Y | X,D =0 (cid:16) ˆ θ | X i (cid:17) , ˆ V = κ (1)0 n n X i =1 ˆ f Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) ˆ π ( X i ) , ˆ V = κ (1)0 n n X i =1 ˆ f Y | X,D =0 (cid:16) ˆ θ | X i (cid:17) ˆ π ( X i ) . heorem 3 (Variance Estimation) Suppose that the assumptions in Theorem 2 hold.Let ˆ π ( x ) be an uniformly consistent estimator for π ( x ) . If n → ∞ , h → , and ln n (cid:0) nh d +5 (cid:1) → , then ˆ M p → M , ˆ M p → M , ˆ V p → V , ˆ V p → V . Thus we have ˆ M − ˆ V ˆ M − p → M − V M − and ˆ M − ˆ V ˆ M − p → M − V M − . In this section, I propose the ML estimator of the mode treatment effect. The ML estimatorcan accommodate a large number of control variables, potentially more than the samplesize. This flexibility will enable researcher to include as many control variables they considerimportant to make their identification assumptions more plausible. The key to implementML methods is to replace the estimation of the conditional density function with the estima-tion of the conditional expectation. To begin with, the estimation of the conditional densityfunction in the traditional kernel estimation is ˆ f Y | D =1 ,X (cid:0) y | x (cid:1) = P nj =1 D j K h (cid:0) y − Y j (cid:1) K h (cid:0) x − X j (cid:1)P nj =1 D j K h (cid:0) x − X j (cid:1) . Notice that we can divide both the numerator and the denominator by P nj =1 K h (cid:0) x − X j (cid:1) to obtain ˆ f Y | D =1 ,X (cid:0) y | x (cid:1) = P nj =1 D j K h (cid:0) y − Y j (cid:1) K h (cid:0) x − X j (cid:1) / P nj =1 K h (cid:0) x − X j (cid:1)P nj =1 D j K h (cid:0) x − X j (cid:1) / P nj =1 K h (cid:0) x − X j (cid:1) . The numerator is an kernel estimator of E (cid:2) DK h ( y − Y ) | X (cid:3) and the denominator is ankernel estimator of the propensity score E (cid:2) D | X (cid:3) = π ( X ) . Hence, ˆ f Y | D =1 ,X (cid:0) y | x (cid:1) is anestimator of E (cid:2) DK h ( y − Y ) | X (cid:3) /π ( X ) . Then the marginal density estimator ˆ f Y ( y ) = 1 n n X i =1 ˆ f Y | D =1 ,X (cid:0) y | X i (cid:1) defined in the previous section can be interpreted as an estimator of E " E (cid:2) DK h ( y − Y ) | X (cid:3) π ( X ) = E (cid:20) DK h ( y − Y ) π ( X ) (cid:21) . Therefore, we can use the machine learning estimator of E h DK h ( y − Y ) π ( X ) i as an estimator for f Y ( y ) . We have successfully translate the estimation of the conditional density functioninto the estimation of the conditional expectation, which is the propensity score π ( X ) .Here we pursue a little bit further to construct the Neyman-orthogonal score (Chernozhukov et al.,2018) for the robustness of the first-step estimation: m ( Z, y, η ) = DK h ( y − Y ) π ( X ) − D − π ( X ) π ( X ) E (cid:2) K h ( y − Y ) | X, D = 1 (cid:3) , (4.1)7here Z = ( Y, D, X ) and η = ( π , g ) with g ( X ) ≡ E (cid:2) K h ( y − Y ) | X, D = 1 (cid:3) . Similary,the Neyman-orthogonal score for f Y ( y ) is m ( Z, y, η ) = (1 − D ) K h ( y − Y )1 − π ( X ) − π − D ( X )1 − π ( X ) E (cid:2) K h ( y − Y ) | X, D = 0 (cid:3) , (4.2)where η = ( π , g ) with g ( X ) ≡ E (cid:2) K h ( y − Y ) | X, D = 0 (cid:3) . Equation (4.1) and (4.2), tomy best knowledge, should be the new results for density estimation. The Neyman orthogo-nality will make the estimation of the density functions more robust to the first-step estima-tion. Now I combine (4.1) and (4.2) with the cross-fitting algorithm (Chernozhukov et al.,2018) to propose the new estimator:
Definition.
1. Take a K -fold random partition ( I k ) Kk =1 of [ N ] = { , ..., N } such that the size of each I k is n = N/K . For each k ∈ [ K ] = { , ..., K } , define the auxiliary sample I ck ≡ { , ..., N } .2. For each k ∈ [ K ] , use the auxiliary sample I ck to construct machine learning estimators ˆ π k ( x ) , ˆ g k ( x ) , and ˆ g k ( x ) of π ( x ) , g ( x ) , and g ( x ) .3. Construct the estimator of f Y ( y ) and f Y ( y ) : ˆ f Y ( y ) = 1 K K X k =1 E n,k (cid:2) m ( Z, y, ˆ η k ) (cid:3) and ˆ f Y ( y ) = 1 K K X k =1 E n,k (cid:2) m ( Z, y, ˆ η k ) (cid:3) where E n,k (cid:2) m ( Z ) (cid:3) = n − P i ∈ I k m ( Z i ) .4. Construct the estimator for θ ∗ and θ ∗ ˆ θ = arg max y ˆ f Y ( y ) and ˆ θ = arg max y ˆ f Y ( y ) .
5. Construct the estimator for the mode treatment effect ˆ∆ = ˆ θ − ˆ θ . Theorem 4.1.
Suppose that with probability − o (1) , k ˆ η k − η k P, ≤ ε N , k ˆ π k − / k P, ∞ ≤ / − κ , and k ˆ π k − π k P, + k ˆ π k − π k P, × k ˆ g k − g k P, ≤ ( ε N ) . If ǫ N = o (( N h ) − / ) and N h → , then we have √ nh (cid:16) ˆ θ − θ ∗ (cid:17) d → N (cid:0) , M − V M − (cid:1) , √ nh (cid:16) ˆ θ − θ ∗ (cid:17) d → N (cid:0) , M − V M − (cid:1) . As for the variance estimation, recall that the kernel estimator of M in the previoussection is ˆ M = N − P Ni =1 ˆ f (2) Y | D =1 ,X (ˆ θ | x ) , where8 f (2) Y | D =1 ,X (cid:0) y | x (cid:1) = P nj =1 D j K h (cid:0) y − Y j (cid:1) (2) K h (cid:0) x − X j (cid:1)P nj =1 D j K h (cid:0) x − X j (cid:1) . Notice that we can divide both the numerator and the denominator by P nj =1 K h (cid:0) x − X j (cid:1) to obtain ˆ f (2) Y | D =1 ,X (cid:0) y | x (cid:1) = P nj =1 D j K h (cid:0) y − Y j (cid:1) (2) K h (cid:0) x − X j (cid:1) / P nj =1 K h (cid:0) x − X j (cid:1)P nj =1 D j K h (cid:0) x − X j (cid:1) / P nj =1 K h (cid:0) x − X j (cid:1) . Observe that the numerator is an kernel estimator of E (cid:2) DK h ( y − Y ) | X (cid:3) and the de-nominator is an kernel estimator of the propensity score E (cid:2) D | X (cid:3) = π ( X ) . Hence, ˆ f Y | D =1 ,X (cid:0) y | x (cid:1) is an estimator of E h DK (2) h ( y − Y ) | X i /π ( X ) . Hence, we can use themachine learning estimator of E (cid:20) DK (2) h ( y − Y ) π ( X ) (cid:21) as an estimator for M . We can also constructa DML estimator using the Neyman-orthogonal functional form DK (2) h ( y − Y ) π ( X ) − D − π ( X ) π ( X ) E h K (2) h ( y − Y ) | X, D = 1 i In step 1, we use machine learning methods to estimate π ( X ) and E [ K (2) h (cid:16) ˆ θ − Y (cid:17) | X, D =1] using auxiliary sample I ck . In Step 2, we construct the DML estimator of M : ˆ M = 1 K K X k =1 X i ∈ I k D i K (2) h (cid:16) ˆ θ − Y i (cid:17) ˆ π ( X i ) − D i − ˆ π ( X i )ˆ π ( X i ) ˆ E (cid:20) K (2) h (cid:16) ˆ θ − Y (cid:17) | X i , D = 1 (cid:21) . By the general DML theory (Chernozhukov et al., 2018), ˆ M is a consistent estimator of M .Similarly, we can construct the DML estimators for V , M , and V using the following table:Original Form Equivalent Form M E h f (2) Y | X,D =1 (cid:0) θ ∗ | X (cid:1)i E (cid:20) DK (2) h ( θ ∗ − Y ) π ( X ) − D − π ( X ) π ( X ) E h K (2) h ( y − Y ) | X, D = 1 i(cid:21) V κ (1)0 E (cid:20) f Y | X,D =1 ( θ ∗ | X ) P ( D =1 | X ) (cid:21) E (cid:20) DK h ( θ ∗ − Y ) π ( X ) − D − π ( X ) π ( X ) E (cid:2) K h ( y − Y ) | X, D = 1 (cid:3)(cid:21) M E h f (2) Y | X,D =0 (cid:0) θ ∗ | X (cid:1)i E (cid:20) (1 − D ) K (2) h ( θ ∗ − Y ) − π ( X ) − π ( X ) − D − π ( X ) E h K (2) h ( y − Y ) | X, D = 0 i(cid:21) V κ (1)0 E (cid:20) f Y | X,D =0 ( θ ∗ | X ) P ( D =0 | X ) (cid:21) E (cid:20) (1 − D ) K h ( θ ∗ − Y ) (1 − π ( X )) − π ( X ) − D (1 − π ( X )) E (cid:2) K h ( y − Y ) | X, D = 0 (cid:3)(cid:21)
This paper studies the estimation and inference of the mode treatment effect, which has beenignored in the treatment effect literature compared to the estimation of the average and the9uantile treatment effect estimation. I propose both kernel and ML estimators to accom-modate a variety of data sets faced by researchers. I also derive the asymptotic propertiesof the proposed estimators. I show that both estimators are consistent and asymptoticallynormal with the rate of convergence √ N h . References
Abadie, A., Angrist, J., & Imbens, G. (2002). Instrumental variables estimates of the effectof subsidized training on the quantiles of trainee earnings.
Econometrica , (1), 91–117.Athey, S., Tibshirani, J., Wager, S., et al. (2019). Generalized random forests. The Annalsof Statistics , (2), 1148–1178.Belloni, A., Chen, D., Chernozhukov, V., & Hansen, C. (2012). Sparse models and methodsfor optimal instruments with an application to eminent domain. Econometrica , (6),2369–2429.Belloni, A., Chernozhukov, V., Fernández-Val, I., & Hansen, C. (2017). Program evaluationand causal inference with high-dimensional data. Econometrica , (1), 233–298.Belloni, A., Chernozhukov, V., & Hansen, C. (2014). Inference on treatment effects afterselection among high-dimensional controls † . The Review of Economic Studies , (2),608-650.Bitler, M. P., Gelbach, J. B., & Hoynes, H. W. (2006). What mean impacts miss: Dis-tributional effects of welfare reform experiments. American Economic Review , (4),988–1012.Card, D. (1996). The effect of unions on the structure of wages: A longitudinal analysis. Econometrica: Journal of the Econometric Society , 957–979.Chen, Y.-C., Genovese, C. R., Tibshirani, R. J., Wasserman, L., et al. (2016). Nonparametricmodal regression.
The Annals of Statistics , (2), 489–514.Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., &Robins, J. (2018). Double/debiased machine learning for treatment and structural pa-rameters. The Econometrics Journal , (1), C1–C68.Chernozhukov, V., Escanciano, J. C., Ichimura, H., & Newey, W. K. (2016). Locally robustsemiparametric estimation. arXiv preprint arXiv:1608.00033 .Chernozhukov, V., & Hansen, C. (2005). An iv model of quantile treatment effects. Econo-metrica , (1), 245–261.Chernozhukov, V., Hansen, C., & Spindler, M. (2015). Valid post-selection and post-regularization inference: An elementary, general approach. Annu. Rev. Econ. , (1), 649–688. 10iNardo, J., Fortin, N. M., & Lemieux, T. (1995). Labor market institutions and thedistribution of wages, 1973-1992: A semiparametric approach (Tech. Rep.). Nationalbureau of economic research.Eddy, W. F., et al. (1980). Optimum kernel estimators of the mode.
The Annals of Statistics , (4), 870–882.Firpo, S. (2007). Efficient semiparametric estimation of quantile treatment effects. Econo-metrica , (1), 259–276.Freeman, R. B. (1980). Unionism and the dispersion of wages. ILR Review , (1), 3–23.Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimationof average treatment effects. Econometrica , 315–331.Hansen, B. E. (2008). Uniform convergence rates for kernel estimation with dependent data.
Econometric Theory , (3), 726–748.Heckman, J., Ichimura, H., & Todd, P. E. (1997). Matching as an econometric evaluationestimator: Evidence from evaluating a job training programme. The review of economicstudies , (4), 605–654.Heckman, J., & Robb, R. (1985). Alternative methods for evaluating the impact of inter-ventions: An overview. Journal of econometrics , (1-2), 239–267.Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatmenteffects using the estimated propensity score. Econometrica , (4), 1161–1189.Lee, M.-J. (1989). Mode regression. Journal of Econometrics , (3), 337–349.Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of econometrics , , 2111–2245.Parzen, E. (1962). On estimation of a probability density function and mode. The annalsof mathematical statistics , (3), 1065–1076.Robins, J. M., & Rotnitzky, A. (1995). Semiparametric efficiency in multivariate regressionmodels with missing data. Journal of the American Statistical Association , (429), 122–129.Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score inobservational studies for causal effects. Biometrika , (1), 41–55.Tauchen, G. (1985). Diagnostic testing and evaluation of maximum likelihood models. Journal of Econometrics , (1-2), 415–443.Van der Vaart, A. W. (2000). Asymptotic statistics (Vol. 3). Cambridge university press.Yao, W., & Li, L. (2014). A new regression model: modal linear regression.
ScandinavianJournal of Statistics , (3), 656–671. 11 Appendix
Proof of Theorem 1:
We only present the proof of the first claim, ˆ θ p → θ ∗ , since the secondclaim follows from the same arguments. The proof proceeds in two steps. In Step 1, we showthe uniform law of large number holds sup y | ˆ f Y ( y ) − f Y ( y ) | = o p (1) . In Step 2, we establish the consistency ˆ θ p → θ ∗ using the same argument of Theorem 5.7 inVan der Vaart (2000). Step 1.
Notice that we have the decomposition ˆ f Y ( y ) − f Y ( y ) = 1 n n X i =1 ˆ f Y | D =1 ,X (cid:0) y | X i (cid:1) − E h f Y | D =1 ,X (cid:0) y | X (cid:1)i = 1 n n X i =1 (cid:16) ˆ f Y | D =1 ,X (cid:0) y | X i (cid:1) − f Y | D =1 ,X (cid:0) y | X i (cid:1)(cid:17)| {z } A ( y ) + 1 n n X i =1 f Y | D =1 ,X (cid:0) y | X i (cid:1) − E h f Y | D =1 ,X (cid:0) y | X (cid:1)i| {z } B ( y ) . Hence, sup y | ˆ f Y ( y ) − f Y ( y ) |≤ sup y | A ( y ) | + sup y | B ( y ) | By Theorem 6 in Hansen (2008) (uniform rates of convergence of kernel estimators), the firstterm sup y | A ( y ) | is bounded by sup y (cid:12)(cid:12) A ( y ) (cid:12)(cid:12) ≤ sup y n n X i =1 (cid:12)(cid:12)(cid:12) ˆ f Y | D =1 ,X (cid:0) y | X i (cid:1) − f Y | D =1 ,X (cid:0) y | X i (cid:1)(cid:12)(cid:12)(cid:12) ≤ sup y sup x (cid:12)(cid:12)(cid:12) ˆ f Y | D =1 ,X (cid:0) y | x (cid:1) − f Y | D =1 ,X (cid:0) y | x (cid:1)(cid:12)(cid:12)(cid:12) ≤ sup x,y (cid:12)(cid:12)(cid:12) ˆ f Y | D =1 ,X (cid:0) y | x (cid:1) − f Y | D =1 ,X (cid:0) y | x (cid:1)(cid:12)(cid:12)(cid:12) = O p r ln nnh d +1 + h ! = o p (1) . On the other hand, by Lemma 1 of Tauchen (1985) (uniform law of large numbers), we have sup y ∈Y (cid:12)(cid:12) B ( y ) (cid:12)(cid:12) = sup y ∈Y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 f Y | D =1 ,X (cid:0) y | X i (cid:1) − E h f Y | D =1 ,X (cid:0) y | X (cid:1)i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p → . sup y | A ( y ) | and sup y | B ( y ) | gives sup y | ˆ f Y ( y ) − f Y ( y ) | = o p (1) . Step 2.
The definition of ˆ θ implies that ˆ f Y (cid:16) ˆ θ (cid:17) ≥ ˆ f Y ( θ ∗ ) . Therefore, we have f Y ( θ ∗ ) − f Y (cid:16) ˆ θ (cid:17) = f Y ( θ ∗ ) − ˆ f Y ( θ ∗ ) + ˆ f Y ( θ ∗ ) − f Y (cid:16) ˆ θ (cid:17) ≤ f Y ( θ ∗ ) − ˆ f Y ( θ ∗ ) + ˆ f Y (cid:16) ˆ θ (cid:17) − f Y (cid:16) ˆ θ (cid:17) ≤ y | ˆ f Y ( y ) − f Y ( y ) | . By Step 1, we have that for any δ > , P (cid:18) f Y ( θ ∗ ) − f Y (cid:16) ˆ θ (cid:17) > δ (cid:19) ≤ P sup y | ˆ f Y ( y ) − f Y ( y ) | > δ/ ! → . Further, Assumption 1 implies that for any ε > , there exists δ > such that sup y : | y − θ ∗ | >ε f Y ( y ) < f Y ( θ ∗ ) − δ. Then the following inequality holds P (cid:16) | ˆ θ − θ ∗ | > ε (cid:17) ≤ P (cid:18) f Y (cid:16) ˆ θ (cid:17) < f Y ( θ ∗ ) − δ (cid:19) ≤ P (cid:18) f Y ( θ ∗ ) − f Y (cid:16) ˆ θ (cid:17) > δ (cid:19) → . Thus, we prove the consistency ˆ θ p → θ ∗ . Proof of Theorem 2:
Here we focus on the result for ˆ θ only. Notice that the first-ordercondition for ˆ θ gives f (1) Y (cid:16) ˆ θ (cid:17) = 1 n n X i =1 ˆ f (1) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) = 1 n n X i =1 ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + 1 n n X i =1 ˆ f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) (cid:16) ˆ θ − θ ∗ (cid:17) , where ˜ θ ∈ (cid:16) ˆ θ , θ ∗ (cid:17) . Then we have √ nh (cid:16) ˆ θ − θ ∗ (cid:17) = − n n X i =1 ˆ f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) − √ nh n n X i =1 ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) . M = E h f (2) Y | X,D =1 (cid:0) θ ∗ | X (cid:1)i in probability. In Step 2-5, we show the asymptotic normalityof the second term. Then, by Slutsky’s theorem, we can show the asymptotic normality for ˆ θ . In Step 6, we show the asymptotic normality for ˆ∆ .For convenience, we define γ ( x ) ≡ f (1) Y,X | D =1 ( θ ∗ , x ) , γ ( x ) ≡ f X | D =1 ( x ) , and ˆ γ ( x ) ≡ n n X j =1 D j K (1) h (cid:0) θ ∗ − Y j (cid:1) K h (cid:0) x − X j (cid:1) P ( D = 1)ˆ γ ( x ) ≡ n n X j =1 D j K h (cid:0) x − X j (cid:1) P ( D = 1) . In these notations, we can express ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | x (cid:1) and f (1) Y | X,D =1 (cid:0) θ ∗ | x (cid:1) as ˆ γ ( x ) / ˆ γ ( x ) and γ ( x ) /γ ( x ) , respectively. Also, let γ = ( γ , γ ) ′ and ˆ γ = (ˆ γ , ˆ γ ) ′ . Step 1.
In this step, we show that n − P ni =1 ˆ f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) p → E h f (2) Y | X,D =1 (cid:0) θ ∗ | X (cid:1)i .Notice that n n X i =1 ˆ f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) = 1 n n X i =1 f (2) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + A + A where A = 1 n n X i =1 ˆ f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) − f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) and A = 1 n n X i =1 f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) − f (2) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) . Since n P ni =1 f (2) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) p → E h f (2) Y | X,D =1 (cid:0) θ ∗ | X (cid:1)i by the law of large numbers, weonly have to show that A = o p (1) and A = o p (1) . Note that | A | ≤ n n X i =1 (cid:12)(cid:12)(cid:12)(cid:12) ˆ f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) − f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup y,x (cid:12)(cid:12)(cid:12) ˆ f (2) Y | X,D =1 (cid:0) y | x (cid:1) − f (2) Y | X,D =1 (cid:0) y | x (cid:1)(cid:12)(cid:12)(cid:12) = O p r ln nnh d +5 + h ! = o p (1) , where the first equality follows from the uniform rates of convergence of kernel estimators(Hansen, 2008). For A , we use the argument in Lemma 4.3 of Newey & McFadden (1994).By consistency of ˆ θ , and thus ˜ θ , there is δ n → such that (cid:13)(cid:13)(cid:13) ˜ θ − θ ∗ (cid:13)(cid:13)(cid:13) ≤ δ n with probabilityapproaching to one. Define 14 n ( X i ) = sup k y − θ ∗ k ≤ δ n (cid:13)(cid:13)(cid:13) f (2) Y | X,D =1 (cid:0) y | X i (cid:1) − f (2) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1)(cid:13)(cid:13)(cid:13) . By the continuity of f (2) Y | X,D =1 (cid:0) y | X i (cid:1) at θ ∗ , ∆ n ( X i ) p → . Hence, by the dominated conver-gence theorem, we have E (cid:2) ∆ n ( X i ) (cid:3) → . Then, by Markov’s inequality, P n n X i =1 ∆ n ( X i ) > ǫ ≤ E (cid:2) ∆ n ( X i ) (cid:3) /ǫ → . Therefore, we have | A | ≤ n n X i =1 ∆ n ( X i ) + o p (1) = o p (1) . Step 2.
In this step, we show √ nh n n X i =1 ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) = √ nh n n X i =1 f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + √ nh n n X i =1 G ( Z i , ˆ γ − γ )+ o p (1) , where G ( z, γ ) = γ ( x ) − h , − γ ( x ) γ ( x ) i γ ( x ) and z = ( y, x, d ) denotes data observation. Todo this, it suffices to show √ nh n n X i =1 h ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) − f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) − G ( Z i , ˆ γ − γ ) i = o p (1) . Using the notation of γ , we have ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | x (cid:1) − f (1) Y | X,D =1 (cid:0) θ ∗ | x (cid:1) = ˆ γ ( x )ˆ γ ( x ) − γ ( x ) γ ( x ) . The following argument follows from Newey & McFadden (1994). Consider the algebra re-lation ˜ a/ ˜ b − a/b = b − (cid:20) − ˜ b − (cid:16) ˜ b − b (cid:17)(cid:21) (cid:20) ˜ a − a − (cid:0) a/b (cid:1) (cid:16) ˜ b − b (cid:17)(cid:21) . The linear part of ther.h.s is b − (cid:20) ˜ a − a − (cid:0) a/b (cid:1) (cid:16) ˜ b − b (cid:17)(cid:21) , and the remaining term is of higher order. By letting a = γ , ˜ a = ˆ γ , b = γ , and ˜ b = ˆ γ , this linear term corresponds to the linear functional G ( Z i , ˆ γ − γ ) . The remaining higher-order term will satisfy | γ ( x ) γ ( x ) − γ ( x ) γ ( x ) − G ( z, γ − γ ) |≤| γ ( x ) | − γ ( x ) − (cid:20) γ ( x ) γ ( x ) (cid:21) h(cid:0) γ ( x ) − γ ( x ) (cid:1) + (cid:0) γ ( x ) − γ ( x ) (cid:1) i ≤ C sup x ∈X (cid:13)(cid:13) γ ( x ) − γ ( x ) (cid:13)(cid:13) C if γ and γ are bounded away from zero. Hence Lemma 1 holds if √ nh sup x ∈X (cid:13)(cid:13) ˆ γ ( x ) − γ ( x ) (cid:13)(cid:13) p → . By the uniform rates of convergence of kernel estimators(Hansen, 2008), we have sup x ∈X (cid:13)(cid:13) ˆ γ ( x ) − γ ( x ) (cid:13)(cid:13) = sup x ∈X (cid:16)(cid:0) ˆ γ ( x ) − γ ( x ) (cid:1) + (cid:0) ˆ γ ( x ) − γ ( x ) (cid:1) (cid:17) ≤ sup x ∈X (cid:0) ˆ γ ( x ) − γ ( x ) (cid:1) + sup x ∈X (cid:0) ˆ γ ( x ) − γ ( x ) (cid:1) = O p (cid:20) (ln n ) (cid:16) nh d +3 (cid:17) − + h (cid:21) + O p (cid:20) (ln n ) (cid:16) nh d (cid:17) − + h (cid:21) = O p (cid:20) (ln n ) (cid:16) nh d +3 (cid:17) − + h (cid:21) . The rates of h and n imply that √ nh sup x ∈X (cid:13)(cid:13) ˆ γ ( x ) − γ ( x ) (cid:13)(cid:13) p → . Step 3.
In this step, we show √ nh n n X i =1 ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) = √ nh n n X i =1 f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + √ nh Z G ( z, ˆ γ − γ ) dF ( z )+ o p (1) , where F is the c.d.f. of z . To do this, it suffices to show that √ nh n n X i =1 G ( Z i , ˆ γ − γ ) − Z G ( z, ˆ γ − γ ) dF ( z ) = o p (1) . Let ¯ γ ≡ E [ˆ γ ] and by the linearity of G ( z, γ ) , we have the decomposition G ( z, ˆ γ − γ ) = G ( z, ˆ γ − ¯ γ ) + G ( z, ¯ γ − γ ) . Therefore we just need to show that √ nh n n X i =1 G ( Z i , ˆ γ − ¯ γ ) − Z G ( z, ˆ γ − ¯ γ ) dF ( z ) = o p (1) and √ nh n n X i =1 G ( Z i , ¯ γ − γ ) − Z G ( z, ¯ γ − γ ) dF ( z ) = o p (1) . The second condition holds by the central limit theorem since √ nh n n X i =1 G ( Z i , ¯ γ − γ ) − Z G ( z, ¯ γ − γ ) dF ( z ) = √ nh O p (cid:16) n − / (cid:17) = o p (1) .
16t remains to show the first condition. We follow the arguments in Newey & McFadden(1994). Define q j ≡ (cid:18) D j K (1) h ( θ ∗ − Y j ) P ( D =1) , D j P ( D =1) (cid:19) ′ , we can rewrite ˆ γ ( x ) = " ˆ γ ( x )ˆ γ ( x ) = 1 n n X j =1 q j K h (cid:0) x − X j (cid:1) . We also define m (cid:0) Z i , Z j (cid:1) = G h Z i , q j K h (cid:0) · − X j (cid:1)i m ( z ) = Z m ( z, ˜ z ) dF (˜ z ) = G ( z, ¯ γ ) m ( z ) = Z m (˜ z, z ) dF (˜ z ) = Z G (cid:2) ˜ z, qK h ( · − X ) (cid:3) dF (˜ z ) . Then the l.h.s. of the first condition equals √ nh n n X i =1 G ( Z i , ˆ γ − ¯ γ ) − Z G ( z, ˆ γ − ¯ γ ) dF ( z ) = √ nh n n X i =1 G ( z, ˆ γ ) − n n X i =1 G ( z, ¯ γ ) − Z G ( z, ˆ γ ) dF ( z ) + Z G ( z, ¯ γ ) dF ( z ) = √ nh n n X i =1 n X j =1 m (cid:0) Z i , Z j (cid:1) − n n X i =1 m ( Z i ) − n n X i =1 m ( Z i ) + E (cid:2) m ( z ) (cid:3) = √ nh × O p ( E h(cid:12)(cid:12) m ( Z , Z ) (cid:12)(cid:12)i /n + (cid:18) E h(cid:12)(cid:12) m ( Z , Z ) (cid:12)(cid:12) i(cid:19) / /n ) , where the last equality follows from Lemma 8.4 of Newey & McFadden (1994). The last termconverges to zero in probability if we can control the convergence rates of E h(cid:12)(cid:12) m ( Z , Z ) (cid:12)(cid:12)i and E h(cid:12)(cid:12) m ( Z , Z ) (cid:12)(cid:12) i . Notice that we have (cid:12)(cid:12) G ( z, γ ) (cid:12)(cid:12) ≤ b ( z ) k γ k with b ( z ) = (cid:13)(cid:13)(cid:13)(cid:13) f X | D =1 ( x ) − h , − f (1) Y | X,D =1 ( θ ∗ , x ) i(cid:13)(cid:13)(cid:13)(cid:13) where k·k denotes the ℓ norm. Then E (cid:20)(cid:12)(cid:12)(cid:12) G (cid:0) z, qK h ( · − x ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:21) ≤ b ( z ) h − d k q k by theboundedness of K ( u ) . By that f X | D =1 ( x ) is bounded away from zero and f Y | X,D =1 ( θ ∗ , x ) is17ounded from above, we have that E h b ( z ) i ≤ ∞ . Therefore, we have √ nh × O p ( E h(cid:12)(cid:12) m ( Z , Z ) (cid:12)(cid:12)i /n + (cid:18) E h(cid:12)(cid:12) m ( Z , Z ) (cid:12)(cid:12) i(cid:19) / /n ) = √ nh × O p ( E (cid:2) k q k b ( Z ) (cid:3) /n + (cid:18) E h k q k b ( Z ) i(cid:19) / (cid:16) nh d (cid:17) − ) = √ nh × O p (cid:16) n − h − d − (cid:17) = o p (1) by the assumptions on n and h . The additional h − in the rates of convergence follows fromthat q contains K (1) h ( u ) = h − K (1) (cid:0) u/h (cid:1) with bounded K (1) ( u ) . Step 4.
In this step, we show that √ nh n n X i =1 ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) = √ nh n n X i =1 f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + √ nh n n X i =1 v ( X i ) q i + o p (1) , where v ( X i ) = P ( D =1) P ( D =1 | X i ) h , − γ ( X i ) γ ( X i ) i and q i = (cid:18) D i K (1) h ( θ ∗ − Y i ) P ( D =1) , D i P ( D =1) (cid:19) ′ . To do this, itsuffices to show that √ nh Z G ( z, ˆ γ − γ ) dF ( z ) − √ nh n n X i =1 v ( X i ) q i = o p (1) Notice that Z G ( z, γ ) dF ( z ) = Z γ ( x ) − (cid:20) , − γ ( x ) γ ( x ) (cid:21) γ ( x ) f X ( x ) dx = Z f X | D =1 ( x ) − (cid:20) , − γ ( x ) γ ( x ) (cid:21) γ ( x ) f X ( x ) dx = Z P ( D = 1) P (cid:0) D = 1 | X = x (cid:1) (cid:20) , − γ ( x ) γ ( x ) (cid:21) γ ( x ) dx = Z v ( x ) γ ( x ) dx, where f X ( x ) is the density function of X and v ( x ) = P ( D =1) P ( D =1 | X = x ) h , − γ ( x ) γ ( x ) i . Also, we have v ( x ) γ ( x ) = P ( D = 1) P (cid:0) D = 1 | X = x (cid:1) (cid:20) , − γ ( x ) γ ( x ) (cid:21) " γ ( x ) γ ( x ) = P ( D = 1) P (cid:0) D = 1 | X = x (cid:1) (cid:0) γ ( x ) − γ ( x ) (cid:1) = 0 . Z G ( z, ˆ γ − γ ) dF ( z ) = Z v ( x ) ˆ γ ( x ) dx − Z v ( x ) γ ( x ) dx = Z v ( x ) ˆ γ ( x ) dx = 1 n n X i =1 Z v ( x ) q i K h ( x − X i ) dx = 1 n n X i =1 v ( X i ) q i + n n X i =1 Z v ( x ) q i K h ( x − X i ) dx − n n X i =1 v ( X i ) q i = 1 n n X i =1 v ( X i ) q i + 1 n n X i =1 (cid:20)Z v ( x ) K h ( x − X i ) dx − v ( X i ) (cid:21) q i . By Chebyshev’s inequality, sufficient conditions for √ nh times the second term in the lastline converging to zero in probability are that √ nh E "(cid:18)Z v ( x ) K h ( x − X i ) dx − v ( X i ) (cid:19) q i → and E " k q i k (cid:13)(cid:13)(cid:13)(cid:13)Z v ( x ) K h ( x − X i ) dx − v ( X i ) (cid:13)(cid:13)(cid:13)(cid:13) → . The expectation in the first condition is the difference of E h(cid:0)R v ( x ) K h ( x − X i ) dx (cid:1) q i i and E (cid:2) v ( X i ) q i (cid:3) . We begin with the second term E (cid:2) v ( X i ) q i (cid:3) . Notice that E (cid:2) v ( X i ) q i (cid:3) = E h v ( X i ) E (cid:2) q i | X i (cid:3)i = E v ( X i ) E D i K (1) h ( θ ∗ − Y i ) P ( D =1) D i P ( D =1) | X i = E v ( X i ) P (cid:0) D = 1 | X i (cid:1) P ( D = 1) E K (1) h ( θ ∗ − Y i )1 ! | X i , D = 1 by the law of iterated expectations. The inner conditional expectation in the last line satisfies19 h K (1) h ( θ ∗ − Y i ) | X i , D = 1 i = 1 h E " K (1) (cid:18) θ ∗ − Y i h (cid:19) | X i , D = 1 = 1 h Z K (1) (cid:18) θ ∗ − yh (cid:19) f Y | X,D =1 (cid:0) y | X i (cid:1) dy = 1 h Z K (cid:18) θ ∗ − yh (cid:19) f (1) Y | X,D =1 (cid:0) y | X i (cid:1) dy = Z K ( u ) f (1) Y | X,D =1 (cid:0) θ ∗ + hu | X i (cid:1) du = Z K ( u ) f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) du + Z huK ( u ) f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) du + Z h u K ( u ) f (3) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) du = f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + h κ f (3) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) with ˜ θ ∈ ( θ ∗ , θ ∗ + hu ) and κ = R u K ( u ) du . The third equality follows from integrationby parts and the forth from change of variables. Hence, E (cid:2) v ( X i ) q i (cid:3) = E v ( X i ) P (cid:0) D = 1 | X i (cid:1) P ( D = 1) f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) ! + h κ E v ( X i ) P (cid:0) D = 1 | X i (cid:1) P ( D = 1) f (3) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) = E v ( X i ) P (cid:0) D = 1 | X i (cid:1) P ( D = 1) f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) ! + O (cid:0) h (cid:1) = Z v ( x ) P (cid:0) D = 1 | X i = x (cid:1) P ( D = 1) f (1) Y | X,D =1 (cid:0) θ ∗ | X i = x (cid:1) ! f X ( x ) dx + O (cid:0) h (cid:1) = Z v ( x ) f (1) Y | X,D =1 (cid:0) θ ∗ | X i = x (cid:1) ! f X | D =1 ( x ) dx + O (cid:0) h (cid:1) = Z v ( x ) f (1) Y,X | D =1 (cid:0) θ ∗ | X i = x (cid:1) f X | D =1 ( x ) ! dx + O (cid:0) h (cid:1) = Z v ( x ) γ ( x ) dx + O (cid:0) h (cid:1) . E "(cid:18)Z v ( x ) K h ( x − X i ) dx (cid:19) q i = E "(cid:18)Z v ( X i + hu ) K ( u ) du (cid:19) q i = Z (cid:18)Z v ( x + hu ) K ( u ) du (cid:19) γ ( x ) dx + O (cid:0) h (cid:1) Then the first condition equals √ nh (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E "(cid:18)Z v ( x ) K h ( x − X i ) dx − v ( X i ) (cid:19) q i = √ nh (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)Z (cid:18)Z v ( x + hu ) K ( u ) du (cid:19) γ ( x ) dx − Z v ( x ) γ ( x ) dx + O (cid:0) h (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Following the argument in Theorem 8.11 of Newey & McFadden (1994), the last line satisfies √ nh (cid:13)(cid:13)(cid:13)(cid:13)Z Z v ( x ) K ( u ) γ ( x − hu ) dudx − Z v ( x ) γ ( x ) dx + O (cid:0) h (cid:1)(cid:13)(cid:13)(cid:13)(cid:13) = √ nh (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)Z v ( x ) (cid:26)Z (cid:2) γ ( x − hu ) − γ ( x ) (cid:3) du (cid:27) dx + O (cid:0) h (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ √ nh Z (cid:13)(cid:13) v ( x ) (cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)Z (cid:2) γ ( x − hu ) − γ ( x ) (cid:3) du (cid:13)(cid:13)(cid:13)(cid:13) dx + O (cid:16) √ nh h (cid:17) ≤ √ nh Ch Z (cid:13)(cid:13) v ( x ) (cid:13)(cid:13) dx + O (cid:16) √ nh h (cid:17) = O (cid:16) √ nh h (cid:17) . Therefore the first condition holds if √ nh h → .Recall that the second condition we would like to show is E " k q i k (cid:13)(cid:13)(cid:13)(cid:13)Z v ( x ) K h ( x − X i ) dx − v ( X i ) (cid:13)(cid:13)(cid:13)(cid:13) → . By Cauchy Schwartz inequality, it suffices to show that E "(cid:13)(cid:13)(cid:13)(cid:13)Z v ( x ) K h ( x − X i ) dx − v ( X i ) (cid:13)(cid:13)(cid:13)(cid:13) → . By the continuity of v ( x ) , v ( x + hu ) → v ( x ) for all x and u as h → . By the dominatedconvergence theorem, R v ( x ) K h ( x − x i ) dx = R v ( x + hu ) K ( u ) du → R v ( x ) K ( u ) du = v ( x ) for all x . Therefore we have E "(cid:13)(cid:13)(cid:13)(cid:13)Z v ( x ) K h ( x − X i ) dx − v ( X i ) (cid:13)(cid:13)(cid:13)(cid:13) = E "(cid:13)(cid:13)(cid:13)(cid:13)Z v ( X i + hu ) K ( u ) du − v ( X i ) (cid:13)(cid:13)(cid:13)(cid:13) → . tep 5. By Step 4 and the definition of v ( X i ) and q i , we have √ nh n n X i =1 ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) = √ nh n n X i =1 f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + √ nh n n X i =1 v ( X i ) q i + o p (1)= √ nh n n X i =1 f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + o p (1)+ √ nh n n X i =1 D i P (cid:0) D = 1 | X i (cid:1) h K (1) h ( θ ∗ − Y i ) − f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1)i = √ nh n n X i =1 f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) " − D i P (cid:0) D = 1 | X i (cid:1) + √ nh n n X i =1 D i P (cid:0) D = 1 | X i (cid:1) K (1) h ( θ ∗ − Y i ) + o p (1) . Since we have E " f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) (cid:20) − D i P ( D =1 | X i ) (cid:21) = 0 by the law of iterated expecta-tions, the central limit theorem holds for the first term of r.h.s. Hence, √ nh n n X i =1 ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) = O p (cid:16) √ nh n − / (cid:17) + √ nh nh n X i =1 D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) + o p (1)= O p (cid:16) √ h (cid:17) + 1 √ nh n X i =1 D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) + o p (1)= 1 √ nh n X i =1 D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) + o p (1) . In this step, we show that √ nh n X i =1 D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) d → N (0 , V ) , where V = κ (1)0 E (cid:20) f Y | X,D =1 ( θ ∗ | X ) P ( D =1 | X ) (cid:21) and κ (1)0 = R K (1) ( u ) du .For convenience, we define ˆ g ( θ ∗ ) ≡ (cid:0) nh (cid:1) − P ni =1 D i P ( D =1 | X i ) K (1) (cid:16) θ ∗ − Y i h (cid:17) . Then it isequivalent to show that √ nh (cid:0) ˆ g ( θ ∗ ) − (cid:1) d → N (0 , V ) .
22o use central limit theorem, we have to calculate E (cid:2) ˆ g ( θ ∗ ) (cid:3) and V ar (cid:0) ˆ g ( θ ∗ ) (cid:1) . E (cid:2) ˆ g ( θ ∗ ) (cid:3) = 1 h E " D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) = 1 h E P (cid:0) D = 1 | X i (cid:1) E " D i K (1) (cid:18) θ ∗ − Y i h (cid:19) | X i = 1 h E E " K (1) (cid:18) θ ∗ − Y i h (cid:19) | X i , D = 1 . Since h − E (cid:20) K (1) (cid:16) θ ∗ − Y i h (cid:17) | X i , D = 1 (cid:21) = f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + h κ f (3) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) fromthe calculation in Step 4, then E (cid:2) ˆ g ( θ ∗ ) (cid:3) = E h f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1)i + O (cid:0) h (cid:1) = 0 + O (cid:0) h (cid:1) . For the variance,
V ar (cid:0) ˆ g ( θ ∗ ) (cid:1) = 1 nh V ar D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19)! = 1 nh E D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19)! − nh E " D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) = 1 nh E D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19)! + 1 nh O (cid:0) h (cid:1) = 1 nh E P (cid:0) D = 1 | X i (cid:1) E " D i K (1) (cid:18) θ ∗ − Y i h (cid:19) | X + 1 nh O (cid:0) h (cid:1) = 1 nh E P (cid:0) D = 1 | X i (cid:1) E " K (1) (cid:18) θ ∗ − Y i h (cid:19) | X, D = 1 + 1 nh O (cid:0) h (cid:1) . The inner expectation in the last line equals 23 " K (1) (cid:18) θ ∗ − Y i h (cid:19) | X, D = 1 = Z K (1) (cid:18) θ ∗ − yh (cid:19) f Y | X,D =1 (cid:0) y | X (cid:1) dy = h Z K (1) ( u ) f Y | X,D =1 (cid:0) θ ∗ + hu | X (cid:1) du = hf Y | X,D =1 (cid:0) θ ∗ | X (cid:1) Z K (1) ( u ) du + h f (1) Y | X,D =1 (cid:16) ˜ θ | X (cid:17) Z uK (1) ( u ) du, where ˜ θ ∈ ( θ ∗ , θ ∗ + hu ) . Define κ (1)0 = R K (1) ( u ) du and κ (1)1 = R uK (1) ( u ) du , the varianceequals V ar (cid:0) ˆ g ( θ ∗ ) (cid:1) = 1 nh E " hP (cid:0) D = 1 | X i (cid:1) κ (1)0 f Y | X,D =1 (cid:0) θ ∗ | X (cid:1) + 1 nh E " h P (cid:0) D = 1 | X i (cid:1) κ (1)1 f (1) Y | X,D =1 (cid:0) θ ∗ | X (cid:1) + 1 nh O (cid:0) h (cid:1) = 1 nh κ (1)0 E " P (cid:0) D = 1 | X i (cid:1) f Y | X,D =1 (cid:0) θ ∗ | X (cid:1) + O ( h ) + O (cid:0) h (cid:1) = 1 nh (cid:16) V + O ( h ) + O (cid:0) h (cid:1)(cid:17) . Then we are ready to apply the central limit theorem.Let Z n,i ≡ ( nh ) − / D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) − E " D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) , then E (cid:2) Z n,i (cid:3) = 0 and V ar (cid:0) Z n,i (cid:1) = h V ar (cid:0) ˆ g ( θ ∗ ) (cid:1) = n − V + o (cid:0) n − (cid:1) . Then √ nh (cid:0) ˆ g ( θ ∗ ) − (cid:1) = √ nh (cid:16) ˆ g ( θ ∗ ) − E (cid:2) ˆ g ( θ ∗ ) (cid:3)(cid:17) + √ nh (cid:16) E (cid:2) ˆ g ( θ ∗ ) (cid:3) − (cid:17) = √ nh (cid:16) ˆ g ( θ ∗ ) − E (cid:2) ˆ g ( θ ∗ ) (cid:3)(cid:17) + √ nh O (cid:0) h (cid:1) = n X i =1 Z n,i + √ nh O (cid:0) h (cid:1) d → N (0 , V ) by Liapunov CLT and √ nh h → . 24 tep 6. In this step, we show that √ nh " ˆ θ − θ ∗ ˆ θ − θ ∗ d → N " , " M V M M V M and thus, by the delta method, we have √ nh (cid:16) ˆ∆ − ∆ ∗ (cid:17) d → N (0 , M V M + M V M ) . To show the joint distribution we adopt vector notations. The first-order conditions of ˆ θ and ˆ θ give " = ˆ f Y (cid:16) ˆ θ (cid:17) ˆ f Y (cid:16) ˆ θ (cid:17) = 1 n n X i =1 ˆ f (1) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) ˆ f (1) Y | X,D =0 (cid:16) ˆ θ | X i (cid:17) = 1 n n X i =1 " ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) ˆ f (1) Y | X,D =0 (cid:0) θ ∗ | X i (cid:1) + J n " ˆ θ − θ ∗ ˆ θ − θ ∗ where J n = 1 n n X i =1 ∂ ˆ f (1) Y | X,D =1 ( ˜ θ | X i ) ∂θ ∂ ˆ f (1) Y | X,D =1 ( ˜ θ | X i ) ∂θ ∂ ˆ f (1) Y | X,D =0 ( ˜ θ | X i ) ∂θ ∂ ˆ f (1) Y | X,D =0 ( ˜ θ | X i ) ∂θ = 1 n n X i =1 ˆ f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17)
00 ˆ f (2) Y | X,D =0 (cid:16) ˜ θ | X i (cid:17) . Hence we have √ nh " ˆ θ − θ ∗ ˆ θ − θ ∗ = (cid:0) J − n (cid:1) √ nh n n X i =1 ˆ f (1) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) ˆ f (1) Y | X,D =0 (cid:16) ˆ θ | X i (cid:17) = (cid:0) J − n (cid:1) √ nh n X i =1 D i P ( D =1 | X i ) K (1) (cid:16) θ ∗ − Y i h (cid:17) − D i P ( D =0 | X i ) K (1) (cid:16) θ ∗ − Y i h (cid:17) + " o p (1) o p (1) where the last equality follows from the Step 5 in the proof of Theorem 1. Since J n p → " M M and √ nh n X i =1 D i P ( D =1 | X i ) K (1) (cid:16) θ ∗ − Y i h (cid:17) − D i P ( D =0 | X i ) K (1) (cid:16) θ ∗ − Y i h (cid:17) d → N " , " V V , √ nh " ˆ θ − θ ∗ ˆ θ − θ ∗ d → N " , " M V M M V M . Proof of Theorem 3.
It is enough to show the results of ˆ M and ˆ V . We first show that ˆ M p → M . By adding and subtracting additional terms, we have ˆ M = 1 n n X i =1 ˆ f (2) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) = 1 n n X i =1 f (2) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + A + A where A = 1 n n X i =1 ˆ f (2) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) − f (2) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) and A = 1 n n X i =1 f (2) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) − f (2) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) . If we can show that A = o p (1) and A = o p (1) , then ˆ M p → M by law of large numbers.Note that | A | ≤ n n X i =1 (cid:12)(cid:12)(cid:12)(cid:12) ˆ f (2) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) − f (2) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup y,x (cid:12)(cid:12)(cid:12) ˆ f (2) Y | X,D =1 (cid:0) y | x (cid:1) − f (2) Y | X,D =1 (cid:0) y | x (cid:1)(cid:12)(cid:12)(cid:12) = O p r ln nnh d +5 + h ! = o p (1) , where the first equality follows from the uniform rates of convergence of kernel estimators(Hansen, 2008). For A , we use the argument in Lemma 4.3 of Newey & McFadden (1994).By consistency of ˆ θ there is δ n → such that (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) ≤ δ n with probability approaching toone. Define ∆ n ( Z i ) = sup k y − θ ∗ k ≤ δ n (cid:13)(cid:13)(cid:13) f (2) Y | X,D =1 (cid:0) y | X i (cid:1) − f (2) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1)(cid:13)(cid:13)(cid:13) . By the continu-ity of f (2) Y | X,D =1 (cid:0) y | X i (cid:1) at θ ∗ , ∆ n ( Z i ) p → . By the dominated convergence theorem, we have E (cid:2) ∆ n ( Z i ) (cid:3) → . By Markov inequality, P (cid:0) n − P ni =1 ∆ n ( Z i ) > ǫ (cid:1) ≤ E (cid:2) ∆ n ( Z i ) (cid:3) /ǫ → .Therefore, we have | A | ≤ n n X i =1 ∆ n ( Z i ) = o p (1) . ˆ V p → V . We can rewrite ˆ V /κ (1)0 = 1 n n X i =1 ˆ f Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) ˆ π ( X i ) = 1 n n X i =1 f Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) π ( X i ) + B + B with B = 1 n n X i =1 ˆ f Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) ˆ π ( X i ) − f Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) π ( X i ) and B = 1 n n X i =1 f Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) π ( X i ) − f Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) π ( X i ) . It remains to show that B = o p (1) and B = o p (1) . The result of B follows from the samearguments as in the proof of A if f Y | X,D =1 (cid:0) y | X i (cid:1) is continuous at θ ∗ . Thus, we only focuson B . For conenience, define f (cid:0) y | x (cid:1) = f Y | X,D =1 (cid:0) y | x (cid:1) . For π bounded away from zero,we have ˆ f (cid:0) y | x (cid:1) ˆ π ( x ) − f (cid:0) y | x (cid:1) π ( x ) = π ( x ) ˆ f (cid:0) y | x (cid:1) − ˆ π ( x ) f (cid:0) y | x (cid:1) ˆ π ( x ) π ( x )= π ( x ) ˆ f (cid:0) y | x (cid:1) − π ( x ) f (cid:0) y | x (cid:1) + π ( x ) f (cid:0) y | x (cid:1) − ˆ π ( x ) f (cid:0) y | x (cid:1) ˆ π ( x ) π ( x )= ˆ f (cid:0) y | x (cid:1) − f (cid:0) y | x (cid:1) ˆ π ( x ) + f (cid:0) y | x (cid:1) ˆ π ( x ) π ( x ) (cid:0) ˆ π ( x ) − π ( x ) (cid:1) ≤ C (cid:18)(cid:16) ˆ f (cid:0) y | x (cid:1) − f (cid:0) y | x (cid:1)(cid:17) + (cid:0) ˆ π ( x ) − π ( x ) (cid:1)(cid:19) for some C > . By the uniform rates of convergence of kernel estimators (Hansen, 2008),we have | B | ≤ n n X i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ f Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) ˆ π ( X i ) − f Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) π ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C sup y,x (cid:18)(cid:12)(cid:12)(cid:12) ˆ f (cid:0) y | x (cid:1) − f (cid:0) y | x (cid:1)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12) ˆ π ( x ) − π ( x ) (cid:12)(cid:12)(cid:19) = O p r ln nnh d +1 + h ! + sup x (cid:12)(cid:12) ˆ π ( x ) − π ( x ) (cid:12)(cid:12) = o p (1) by the rates of n and h and the uniform convergence of ˆ π ( x ) .27 roof of Theorem 4: Suppose that ˆ f Y ( y ) = 1 K K X k =1 E n,k [ m ( Z, y, ˆ η k )] is differentiable with respect to y . Define ˆ f (1) Y ( y ) ≡ K K X k =1 E n,k [ m (1)1 ( Z, y, ˆ η k )] where m (1)1 ( Z, y, ˆ η k ) ≡ ∂m ( Z, y, ˆ η k ) /∂y .By the definition of ˆ θ , we have f (1) Y (ˆ θ ) = 1 K K X k =1 E n,k [ m (1)1 ( Z, ˆ θ , ˆ η k )]= 1 K K X k =1 E n,k [ m (1)1 ( Z, θ ∗ , ˆ η k )] + 1 K K X k =1 E n,k [ m (2)1 ( Z, ˜ θ , ˆ η k )](ˆ θ − θ ∗ ) and √ N h (ˆ θ − θ ∗ ) = − K K X k =1 E n,k [ m (2)1 ( Z, ˜ θ , ˆ η k )] − √ N h K K X k =1 E n,k [ m (1)1 ( Z, θ ∗ , ˆ η k )] . In Step 1 and 2 below, we will show that K K X k =1 E n,k [ m (2)1 ( Z, ˜ θ , ˆ η k )] p → M and √ N h K K X k =1 E n,k [ m (1)1 ( Z, θ ∗ , ˆ η k )] d → N (0 , V ) , respectively. Hence, we can obtain the final result √ N h (ˆ θ − θ ∗ ) d → N (0 , M − V M − ) . Step 1.
Since K is a fixed integer, which is independent of N , it suffices to show that foreach k ∈ [ K ] , E n,k [ m (2)1 ( Z, ˜ θ , ˆ η k )] p → M . Then we can show this convergence using the same argument in Step 1 in the proof ofTheorem 2.
Step 2.
Since K is a fixed integer, which is independent of N , it is enough to considerthe convergence of E n,k [ m (1)1 ( Z, θ ∗ , ˆ η k )] . Notice that28 n,k [ m (1)1 ( Z, θ ∗ , ˆ η k )] = 1 n X i ∈ I k m (1)1 ( Z, θ ∗ , η ) + R k where R ,k = E n,k [ m (1)1 ( Z, θ ∗ , ˆ η k )] − n X i ∈ I k m (1)1 ( Z, θ ∗ , η ) . Then by triangular inequality, (cid:13)(cid:13) R ,k (cid:13)(cid:13) ≤ I ,k + I ,k √ n , where I ,k ≡ (cid:13)(cid:13)(cid:13)(cid:13) G n,k h m (1)1 ( Z, θ ∗ , ˆ η k ) i − G n,k h m (1)1 ( Z, θ ∗ , η ) i(cid:13)(cid:13)(cid:13)(cid:13) ,I ,k ≡ √ n (cid:13)(cid:13)(cid:13)(cid:13) E P h m (1)1 ( Z, θ ∗ , ˆ η k ) | ( W i ) i ∈ I ck i − E P h m (1)1 ( Z, θ ∗ , η ) i(cid:13)(cid:13)(cid:13)(cid:13) . Two auxiliary results will be used to bound I ,k and I ,k : sup η ∈T N (cid:18) E h k m (1)1 ( Z, θ ∗ , η ) − m (1)1 ( Z, θ ∗ , η ) k i(cid:19) / ≤ ε N , (A.1) sup r ∈ (0 , ,η ∈T N k ∂ r E h m (1)1 (cid:0) Z, θ ∗ , η + r ( η − η ) (cid:1)i k≤ ( ε N ) , (A.2)where T N is the set of all η = ( π , g ) consisting of square-integrable functions π and g such that k η − η k P, ≤ ε N , k π − / k P, ∞ ≤ / − κ, k π − π k P, + k π − π k P, × k g − g k P, ≤ ( ε N ) . Then by assumption, we have ˆ η k ∈ T N with probability − o (1) .To bound I ,k , note that conditional on ( W i ) i ∈ I ck the estimator ˆ η k is nonstochastic. Underthe event that ˆ η k ∈ T N , we have E P h I ,k | ( W i ) i ∈ I ck i = E P h k m (1)1 ( Z, θ ∗ , ˆ η k ) − m (1)1 ( Z, θ ∗ , η ) k | ( W i ) i ∈ I ck i ≤ sup η ∈T N E P h k m (1)1 ( Z, θ ∗ , η ) − m (1)1 ( Z, θ ∗ , η ) k | ( W i ) i ∈ I ck i = sup η ∈T N E P h k m (1)1 ( Z, θ ∗ , η ) − m (1)1 ( Z, θ ∗ , η ) k i = ( ε N ) by (A.1). Hence, I ,k = O P ( ε N ) . To bound I ,k , define the following function f k ( r ) = E P h m (1)1 ( Z, θ ∗ , η + r (ˆ η k − η )) | ( W i ) i ∈ I ck i − E h m (1)1 ( Z, θ ∗ , η ) i r ∈ [0 , . By Taylor series expansion, we have f k (1) = f k (0) + f ′ k (0) + f ′′ k (˜ r ) / , for some ˜ r ∈ (0 , . Note that f k (0) = E h m (1)1 ( Z, θ ∗ , η ) | ( W i ) i ∈ I ck i = E h m (1)1 ( Z, θ ∗ , η ) i = O ( h ) by thecalculation in Step 4 in the proof of Theorem 2. Further, on the event ˆ η k ∈ T N , k f ′ k (0) k = k ∂ η E [ m (1)1 ( Z, θ ∗ , η )] [ˆ η k − η ] k = 0 by the orthogonality. Also, on the event ˆ η k ∈ T N , k f ′′ k (˜ r ) k≤ sup r ∈ (0 , k f ′′ k ( r ) k≤ ( ε N ) by (A.2). Thus, I ,k = √ n k f k (1) k = O P (cid:16) √ n ( ε N ) + √ nh (cid:17) . Together with the result on I ,k , we have (cid:13)(cid:13) R ,k (cid:13)(cid:13) ≤ I ,k + I ,k √ n = O P (cid:16) n − / ε N + ( ε N ) + h (cid:17) Hence, √ N h (cid:13)(cid:13) R ,k (cid:13)(cid:13) = O P ( √ h ǫ N + √ N h ǫ N + √ N h h ) = o P (1) by the assumptions on the rate of convergence that ǫ N = o (( N h ) − / ) and N h → .Therefore, E n,k [ m (1)1 ( Z, θ ∗ , ˆ η k )] = 1 n X i ∈ I k m (1)1 ( Z, θ ∗ , η ) + o P (1) d → N (0 , V ) ..