[PDF] The Mode Treatment Effect

Abstract

Mean, median, and mode are three essential measures of the centrality of probability distributions. In program evaluation, the average treatment effect (mean) and the quantile treatment effect (median) have been intensively studied in the past decades. The mode treatment effect, however, has long been neglected in program evaluation. This paper fills the gap by discussing both the estimation and inference of the mode treatment effect. I propose both traditional kernel and machine learning methods to estimate the mode treatment effect. I also derive the asymptotic properties of the proposed estimators and find that both estimators follow the asymptotic normality but with the rate of convergence slower than the regular rate N − − √ , which is different from the rates of the classical average and quantile treatment effect estimators.

Full PDF

aa r X i v : . [ ec on . E M ] J u l Mode Treatment Eﬀect

Neng-Chieh Chang ∗ Abstract

Mean, median, and mode are three essential measures of the centrality of probabilitydistributions. In program evaluation, the average treatment eﬀect (mean) and thequantile treatment eﬀect (median) have been intensively studied in the past decades.The mode treatment eﬀect, however, has long been neglected in program evaluation.This paper ﬁlls the gap by discussing both the estimation and inference of the modetreatment eﬀect. I propose both traditional kernel and machine learning methods toestimate the mode treatment eﬀect. I also derive the asymptotic properties of theproposed estimators and ﬁnd that both estimators follow the asymptotic normality butwith the rate of convergence slower than the regular rate √ N , which is diﬀerent fromthe rates of the classical average and quantile treatment eﬀect estimators. The eﬀects of policies on the distribution of outcomes have long been of central interestin many areas of empirical economics. A policy maker might be interested in the diﬀer-ence of the distribution of outcome under treatment and the distribution of outcome inthe absence of treatment. The empirical studies of distributional eﬀects include but notare not limited to Freeman (1980), Card (1996), DiNardo, Fortin, & Lemieux (1995), andBitler, Gelbach, & Hoynes (2006). Most researches use the diﬀerence of the averages orquantiles of the treated and untreated distribution, known as average treatment eﬀect andquantile treatment eﬀect, as a summary for the eﬀect of treatment on distribution. Themode of a distribution, which is also an important summary statistics of data, has long beenignored in the literature. This paper ﬁlls up the gap by studying the mode treatment eﬀect:the diﬀerence of the modes of the treated and untreated distribution. Compared to theaverage and the quantile treatment eﬀect, the mode treatment eﬀect has two advantages:(1) mode captures the most probable value of the distribution under treatment and in theabsence of treatment. It provides a better summary of centrality than average and quantilewhen the distributions are highly skewed; (2) mode is robust to heavy-tailed distributionswhere outliers don’t follow the same behavior as the majority of a sample. In economicstudies, it is especially often to confront a skewed and heavy-tailed distribution when theoutcome of interest is income or wage. ∗ Department of Economics, University of California Los Angeles, 315 Portola Plaza, Los Angeles, CA90095, USA. email: [email protected] √ N h , where N is the sample size and h is the bandwidth of the chosen kernel, which isslower than the traditional rate of convergence √ N in the estimation of mean and quantile.In fact, this rate of convergence complies the intuition. While the estimators of mean andquantile are the weighted average of all the available observations, only a small portion ofobservations near the mode provides the information to the estimator of the mode. Thisexplains the slower rate of convergence for the proposed estimators.This paper contributes to the program evaluation literature which includes the studies ofaverage treatment eﬀect: Rosenbaum & Rubin (1983), Heckman & Robb (1985), Heckman, Ichimura, & Todd(1997), Hahn (1998), and Hirano, Imbens, & Ridder (2003); the studies of quantile treat-ment eﬀect: Abadie, Angrist, & Imbens (2002), Chernozhukov & Hansen (2005), and Firpo(2007); the studies of mode estimation and mode regression: Parzen (1962), Eddy et al.(1980), Lee (1989), Yao & Li (2014), and Chen, Genovese, Tibshirani, Wasserman, et al.(2016); as well as the causal inference of ML methods: Belloni et al. (2012), Belloni et al.(2014), Chernozhukov et al. (2015), Belloni et al. (2017), Chernozhukov et al. (2018), andAthey et al. (2019). This paper is also closely related to the robustness of average treat-2ent eﬀect estimation discussed in (Robins & Rotnitzky, 1995) and the general discussionin (Chernozhukov, Escanciano, Ichimura, & Newey, 2016). The asymptotic properties of therobust estimators discussed in these papers remain unaﬀected if only one of the ﬁrst-stepestimation with classical nonparametric method is inconsistent. Plan of the paper.

Section 2 sets up the notation and framework for the discussion ofthe mode treatment eﬀect. Section 3 discusses the kernel method and derives the asymptoticproperties. Section 4 presents the ML estimator for density estimation and the correspondingNeyman-orthogonal score. I combine the Neyman-orthogonal score with the cross-ﬁtting al-gorithm to propose the ML estimator of the mode treatment eﬀect, and derive its asymptoticproperties. Section 5 concludes this paper.

Let Y be a continuous outcome variable of interest, D the binary treatment indicator, andX d × vector of control variables. Denote by Y an individual’s potential outcome when D = 1 and Y if D = 0 . Let f Y ( y ) and f Y ( y ) be the marginal probability density function(p.d.f.) of Y and Y , respectively. The modes of Y and Y are the values that appear withthe highest probability. That is, θ ∗ ≡ arg max y ∈Y f Y ( y ) and θ ∗ ≡ arg max y ∈Y f Y ( y ) , where Y and Y are the supports of Y and Y . Here I assume that θ ∗ and θ ∗ are unique,meaning that both Y and Y are unimodal. I also assume that the modes θ ∗ and θ ∗ are inthe interior of the common supports of Y and Y . These conditions are formally stated inthe following assumption: Assumption 1 (Uni-mode) • For all ǫ > , sup y : | y − θ ∗ | >ε f Y ( y ) < f Y ( θ ∗ ) for y ∈ Y , and sup y : | y − θ ∗ | >ε f Y ( y ) < f Y ( θ ∗ ) for y ∈ Y . • θ ∗ , θ ∗ ∈ Int ( Y ∩ Y ) .Assumption 1 has been widely adopted in many studies (Parzen, 1962; Eddy et al., 1980; Lee,1989; Yao & Li, 2014). Under Assumption 1, the mode treatment eﬀect is uniquely deﬁned as ∆ ∗ ≡ θ ∗ − θ ∗ . The following states the strong ignorability assumption (Rosenbaum & Rubin,1983): Assumption 2 (strong ignorability) • ( Y , Y ) ⊥ D | X < P (cid:0) D = 1 | X (cid:1) < The ﬁrst part of Assumption 2 assumes that potential outcomes are independent of treatmentafter conditioning on the observable covariates X . The second part states that for all valuesof X , both treatment status occur with a positive probability. Under the strong ignorabilitycondition, both f Y and f Y can be identiﬁed from the observable variables ( Y, D, X ) since f Y | D =1 ,X (cid:0) y | x (cid:1) = f Y | D =1 ,X (cid:0) y | x (cid:1) = f Y | X (cid:0) y | x (cid:1) , and thus f Y ( y ) = E h f Y | X (cid:0) y | X (cid:1)i = E h f Y | D =1 ,X (cid:0) y | X (cid:1)i . (2.1)Similarly, we have f Y ( y ) = E h f Y | D =0 ,X (cid:0) y | X (cid:1)i . (2.2)Equation (2.1) and (2.2) shows the identiﬁcation result of the density function f Y and f Y .Then it is straightforward to identify their modes θ ∗ and θ ∗ : θ ∗ = arg max y ∈Y E h f Y | D =1 ,X (cid:0) y | X (cid:1)i and θ ∗ = arg max y ∈Y E h f Y | D =0 ,X (cid:0) y | X (cid:1)i . (2.3)If both f Y | D =1 ,X (cid:0) y | X (cid:1) and f Y | D =0 ,X (cid:0) y | X (cid:1) are diﬀerentiable with respect to y , we canfurther identify the modes using the ﬁrst-order conditions under Assumption 1: E h f (1) Y | D =1 ,X (cid:0) θ ∗ | X (cid:1)i = 0 and E h f (1) Y | D =0 ,X (cid:0) θ ∗ | X (cid:1)i = 0 , (2.4)where m ( s ) ( y, x ) ≡ ∂ s m ( y, x ) /∂y s denotes the partial derivatives with respect to y .Equation (2.1)-(2.4) provide us a direct way to estimate the modes θ ∗ and θ ∗ . Intuitively,we estimate the density functions f Y ( y ) and f Y ( y ) in the ﬁrst step and use the maximizersof the estimated density functions as the estimators of the modes. Section 3 and 4 presentsthe kernel and ML estimation method, respectively. In this section, I propose kernel estimators for θ ∗ , θ ∗ , and the mode treatment eﬀect ∆ ∗ = θ ∗ − θ ∗ . Let K ( · ) be a kernel function with bandwidth h . Deﬁne the estimators of the densityfunctions f Y ( y ) and f Y ( y ) as, ˆ f Y ( y ) = 1 n n X i =1 ˆ f Y | D =1 ,X (cid:0) y | X i (cid:1) , ˆ f Y ( y ) = 1 n n X i =1 ˆ f Y | D =0 ,X (cid:0) y | X i (cid:1) ˆ f Y | D =1 ,X (cid:0) y | x (cid:1) = P nj =1 D j K h (cid:0) y − Y j (cid:1) K h (cid:0) x − X j (cid:1)P nj =1 D j K h (cid:0) x − X j (cid:1) , ˆ f Y | D =0 ,X (cid:0) y | x (cid:1) = P nj =1 (cid:0) − D j (cid:1) K h (cid:0) y − Y j (cid:1) K h (cid:0) x − X j (cid:1)P nj =1 (cid:0) − D j (cid:1) K h (cid:0) x − X j (cid:1) where K h (cid:0) y − Y j (cid:1) = h − K (cid:16) y − Y j h (cid:17) and K h (cid:0) x − X j (cid:1) = h − d K (cid:18) x − X j h (cid:19) × ... × K (cid:18) x d − X jd h (cid:19) . Then it is straightforward to deﬁne the estimators of the modes θ ∗ and θ ∗ : ˆ θ ≡ arg max y ˆ f Y ( y ) , ˆ θ ≡ arg max y ˆ f Y ( y ) . The estimator of the mode treatment eﬀect ∆ ∗ is ˆ∆ ≡ ˆ θ − ˆ θ . Through out the paper, Iimpose the following conditions on the kernel K ( · ) : Assumption 3: • (cid:12)(cid:12) K ( u ) (cid:12)(cid:12) ≤ ¯ K < ∞ . • R K ( u ) du = 1 , R uK ( u ) du = 0 , R u K ( u ) du < ∞ . • K ( u ) is diﬀerentiable.The ﬁrst part of Assumption 3 requires that K ( u ) is bounded. Although the second partimplies that K ( u ) is a ﬁrst-order kernel, the arguments in this paper can be easily extendedto higher-order kernels. We assume the ﬁrst-order kernel here just for simplicity. The thirdpart imposes enough smoothness on K ( u ) . Theorem 1 (Consistency)

Suppose Assumption 1-3 hold. Assume that the densityfunctions f Y | D =1 ,X (cid:0) y | x (cid:1) and f Y | D =0 ,X (cid:0) y | x (cid:1) are (i) continuous in y , (ii) bounded bysome function d ( x ) with E (cid:2) d ( X ) (cid:3) < ∞ for all y ∈ Y , and (iii) y ∈ Y and x ∈ X withcompact Y and X . We also assume that the density functions f X | D =1 ( x ) and f X | D =0 ( x ) arebounded away from zero. If n → ∞ , h → , and ln n (cid:0) nh d +1 (cid:1) − → , then we have ˆ θ p → θ ∗ and ˆ θ p → θ ∗ . Theorem 2 (Asymptotic Normality)

Suppose that the assumptions of Theorem 1 hold.Assume that f (2) Y | X,D =1 (cid:0) y | x (cid:1) and f (2) Y | X,D =0 (cid:0) y | x (cid:1) are continuous at y = θ ∗ and y = θ ∗ for ll x , respectively. If n → ∞ , h → , √ nh (ln n ) (cid:0) nh d +3 (cid:1) − → , (ln n ) (cid:0) nh d +5 (cid:1) − → ,and √ nh h → , then √ nh (cid:16) ˆ θ − θ ∗ (cid:17) d → N (cid:0) , M − V M − (cid:1) √ nh (cid:16) ˆ θ − θ ∗ (cid:17) d → N (cid:0) , M − V M − (cid:1) where M ≡ E h f (2) Y | X,D =1 (cid:0) θ ∗ | X (cid:1)i ,M ≡ E h f (2) Y | X,D =0 (cid:0) θ ∗ | X (cid:1)i ,V = κ (1)0 E " f Y | X,D =1 (cid:0) θ ∗ | X (cid:1) P (cid:0) D = 1 | X (cid:1) ,V = κ (1)0 E " f Y | X,D =0 (cid:0) θ ∗ | X (cid:1) P (cid:0) D = 0 | X (cid:1) , and κ (1)0 = R K (1) ( u ) du . Further, we have √ nh (cid:16) ˆ∆ − ∆ ∗ (cid:17) d → N (0 , M V M + M V M ) . Theorem 1 and 2 show that the asymptotic properties of the estimator of the modetreatment eﬀect. We can see that the proposed estimators follows the asymptotic normalitybut with the rate of convergence slower than the regular rate √ N . The intuition is that,unlike the estimation of the average and the quantile treatment eﬀect, the estimation ofmodes only uses a small portion of total observations which are around the modes. Theusage rate of observations determines that the rate of convergence is slower than the regularrate √ N .To estimate the asymptotic variances, we deﬁne π ( X ) ≡ P (cid:0) D = 1 | X (cid:1) to be thepropensity score. The consistent variance estimators are ˆ M = 1 n n X i =1 ˆ f (2) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) , ˆ M = 1 n n X i =1 ˆ f (2) Y | X,D =0 (cid:16) ˆ θ | X i (cid:17) , ˆ V = κ (1)0 n n X i =1 ˆ f Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) ˆ π ( X i ) , ˆ V = κ (1)0 n n X i =1 ˆ f Y | X,D =0 (cid:16) ˆ θ | X i (cid:17) ˆ π ( X i ) . heorem 3 (Variance Estimation) Suppose that the assumptions in Theorem 2 hold.Let ˆ π ( x ) be an uniformly consistent estimator for π ( x ) . If n → ∞ , h → , and ln n (cid:0) nh d +5 (cid:1) → , then ˆ M p → M , ˆ M p → M , ˆ V p → V , ˆ V p → V . Thus we have ˆ M − ˆ V ˆ M − p → M − V M − and ˆ M − ˆ V ˆ M − p → M − V M − . In this section, I propose the ML estimator of the mode treatment eﬀect. The ML estimatorcan accommodate a large number of control variables, potentially more than the samplesize. This ﬂexibility will enable researcher to include as many control variables they considerimportant to make their identiﬁcation assumptions more plausible. The key to implementML methods is to replace the estimation of the conditional density function with the estima-tion of the conditional expectation. To begin with, the estimation of the conditional densityfunction in the traditional kernel estimation is ˆ f Y | D =1 ,X (cid:0) y | x (cid:1) = P nj =1 D j K h (cid:0) y − Y j (cid:1) K h (cid:0) x − X j (cid:1)P nj =1 D j K h (cid:0) x − X j (cid:1) . Notice that we can divide both the numerator and the denominator by P nj =1 K h (cid:0) x − X j (cid:1) to obtain ˆ f Y | D =1 ,X (cid:0) y | x (cid:1) = P nj =1 D j K h (cid:0) y − Y j (cid:1) K h (cid:0) x − X j (cid:1) / P nj =1 K h (cid:0) x − X j (cid:1)P nj =1 D j K h (cid:0) x − X j (cid:1) / P nj =1 K h (cid:0) x − X j (cid:1) . The numerator is an kernel estimator of E (cid:2) DK h ( y − Y ) | X (cid:3) and the denominator is ankernel estimator of the propensity score E (cid:2) D | X (cid:3) = π ( X ) . Hence, ˆ f Y | D =1 ,X (cid:0) y | x (cid:1) is anestimator of E (cid:2) DK h ( y − Y ) | X (cid:3) /π ( X ) . Then the marginal density estimator ˆ f Y ( y ) = 1 n n X i =1 ˆ f Y | D =1 ,X (cid:0) y | X i (cid:1) deﬁned in the previous section can be interpreted as an estimator of E " E (cid:2) DK h ( y − Y ) | X (cid:3) π ( X ) = E (cid:20) DK h ( y − Y ) π ( X ) (cid:21) . Therefore, we can use the machine learning estimator of E h DK h ( y − Y ) π ( X ) i as an estimator for f Y ( y ) . We have successfully translate the estimation of the conditional density functioninto the estimation of the conditional expectation, which is the propensity score π ( X ) .Here we pursue a little bit further to construct the Neyman-orthogonal score (Chernozhukov et al.,2018) for the robustness of the ﬁrst-step estimation: m ( Z, y, η ) = DK h ( y − Y ) π ( X ) − D − π ( X ) π ( X ) E (cid:2) K h ( y − Y ) | X, D = 1 (cid:3) , (4.1)7here Z = ( Y, D, X ) and η = ( π , g ) with g ( X ) ≡ E (cid:2) K h ( y − Y ) | X, D = 1 (cid:3) . Similary,the Neyman-orthogonal score for f Y ( y ) is m ( Z, y, η ) = (1 − D ) K h ( y − Y )1 − π ( X ) − π − D ( X )1 − π ( X ) E (cid:2) K h ( y − Y ) | X, D = 0 (cid:3) , (4.2)where η = ( π , g ) with g ( X ) ≡ E (cid:2) K h ( y − Y ) | X, D = 0 (cid:3) . Equation (4.1) and (4.2), tomy best knowledge, should be the new results for density estimation. The Neyman orthogo-nality will make the estimation of the density functions more robust to the ﬁrst-step estima-tion. Now I combine (4.1) and (4.2) with the cross-ﬁtting algorithm (Chernozhukov et al.,2018) to propose the new estimator:

Deﬁnition.

1. Take a K -fold random partition ( I k ) Kk =1 of [ N ] = { , ..., N } such that the size of each I k is n = N/K . For each k ∈ [ K ] = { , ..., K } , deﬁne the auxiliary sample I ck ≡ { , ..., N } .2. For each k ∈ [ K ] , use the auxiliary sample I ck to construct machine learning estimators ˆ π k ( x ) , ˆ g k ( x ) , and ˆ g k ( x ) of π ( x ) , g ( x ) , and g ( x ) .3. Construct the estimator of f Y ( y ) and f Y ( y ) : ˆ f Y ( y ) = 1 K K X k =1 E n,k (cid:2) m ( Z, y, ˆ η k ) (cid:3) and ˆ f Y ( y ) = 1 K K X k =1 E n,k (cid:2) m ( Z, y, ˆ η k ) (cid:3) where E n,k (cid:2) m ( Z ) (cid:3) = n − P i ∈ I k m ( Z i ) .4. Construct the estimator for θ ∗ and θ ∗ ˆ θ = arg max y ˆ f Y ( y ) and ˆ θ = arg max y ˆ f Y ( y ) .

5. Construct the estimator for the mode treatment eﬀect ˆ∆ = ˆ θ − ˆ θ . Theorem 4.1.

Suppose that with probability − o (1) , k ˆ η k − η k P, ≤ ε N , k ˆ π k − / k P, ∞ ≤ / − κ , and k ˆ π k − π k P, + k ˆ π k − π k P, × k ˆ g k − g k P, ≤ ( ε N ) . If ǫ N = o (( N h ) − / ) and N h → , then we have √ nh (cid:16) ˆ θ − θ ∗ (cid:17) d → N (cid:0) , M − V M − (cid:1) , √ nh (cid:16) ˆ θ − θ ∗ (cid:17) d → N (cid:0) , M − V M − (cid:1) . As for the variance estimation, recall that the kernel estimator of M in the previoussection is ˆ M = N − P Ni =1 ˆ f (2) Y | D =1 ,X (ˆ θ | x ) , where8 f (2) Y | D =1 ,X (cid:0) y | x (cid:1) = P nj =1 D j K h (cid:0) y − Y j (cid:1) (2) K h (cid:0) x − X j (cid:1)P nj =1 D j K h (cid:0) x − X j (cid:1) . Notice that we can divide both the numerator and the denominator by P nj =1 K h (cid:0) x − X j (cid:1) to obtain ˆ f (2) Y | D =1 ,X (cid:0) y | x (cid:1) = P nj =1 D j K h (cid:0) y − Y j (cid:1) (2) K h (cid:0) x − X j (cid:1) / P nj =1 K h (cid:0) x − X j (cid:1)P nj =1 D j K h (cid:0) x − X j (cid:1) / P nj =1 K h (cid:0) x − X j (cid:1) . Observe that the numerator is an kernel estimator of E (cid:2) DK h ( y − Y ) | X (cid:3) and the de-nominator is an kernel estimator of the propensity score E (cid:2) D | X (cid:3) = π ( X ) . Hence, ˆ f Y | D =1 ,X (cid:0) y | x (cid:1) is an estimator of E h DK (2) h ( y − Y ) | X i /π ( X ) . Hence, we can use themachine learning estimator of E (cid:20) DK (2) h ( y − Y ) π ( X ) (cid:21) as an estimator for M . We can also constructa DML estimator using the Neyman-orthogonal functional form DK (2) h ( y − Y ) π ( X ) − D − π ( X ) π ( X ) E h K (2) h ( y − Y ) | X, D = 1 i In step 1, we use machine learning methods to estimate π ( X ) and E [ K (2) h (cid:16) ˆ θ − Y (cid:17) | X, D =1] using auxiliary sample I ck . In Step 2, we construct the DML estimator of M : ˆ M = 1 K K X k =1 X i ∈ I k D i K (2) h (cid:16) ˆ θ − Y i (cid:17) ˆ π ( X i ) − D i − ˆ π ( X i )ˆ π ( X i ) ˆ E (cid:20) K (2) h (cid:16) ˆ θ − Y (cid:17) | X i , D = 1 (cid:21) . By the general DML theory (Chernozhukov et al., 2018), ˆ M is a consistent estimator of M .Similarly, we can construct the DML estimators for V , M , and V using the following table:Original Form Equivalent Form M E h f (2) Y | X,D =1 (cid:0) θ ∗ | X (cid:1)i E (cid:20) DK (2) h ( θ ∗ − Y ) π ( X ) − D − π ( X ) π ( X ) E h K (2) h ( y − Y ) | X, D = 1 i(cid:21) V κ (1)0 E (cid:20) f Y | X,D =1 ( θ ∗ | X ) P ( D =1 | X ) (cid:21) E (cid:20) DK h ( θ ∗ − Y ) π ( X ) − D − π ( X ) π ( X ) E (cid:2) K h ( y − Y ) | X, D = 1 (cid:3)(cid:21) M E h f (2) Y | X,D =0 (cid:0) θ ∗ | X (cid:1)i E (cid:20) (1 − D ) K (2) h ( θ ∗ − Y ) − π ( X ) − π ( X ) − D − π ( X ) E h K (2) h ( y − Y ) | X, D = 0 i(cid:21) V κ (1)0 E (cid:20) f Y | X,D =0 ( θ ∗ | X ) P ( D =0 | X ) (cid:21) E (cid:20) (1 − D ) K h ( θ ∗ − Y ) (1 − π ( X )) − π ( X ) − D (1 − π ( X )) E (cid:2) K h ( y − Y ) | X, D = 0 (cid:3)(cid:21)

This paper studies the estimation and inference of the mode treatment eﬀect, which has beenignored in the treatment eﬀect literature compared to the estimation of the average and the9uantile treatment eﬀect estimation. I propose both kernel and ML estimators to accom-modate a variety of data sets faced by researchers. I also derive the asymptotic propertiesof the proposed estimators. I show that both estimators are consistent and asymptoticallynormal with the rate of convergence √ N h . References

Abadie, A., Angrist, J., & Imbens, G. (2002). Instrumental variables estimates of the eﬀectof subsidized training on the quantiles of trainee earnings.

Econometrica , (1), 91–117.Athey, S., Tibshirani, J., Wager, S., et al. (2019). Generalized random forests. The Annalsof Statistics , (2), 1148–1178.Belloni, A., Chen, D., Chernozhukov, V., & Hansen, C. (2012). Sparse models and methodsfor optimal instruments with an application to eminent domain. Econometrica , (6),2369–2429.Belloni, A., Chernozhukov, V., Fernández-Val, I., & Hansen, C. (2017). Program evaluationand causal inference with high-dimensional data. Econometrica , (1), 233–298.Belloni, A., Chernozhukov, V., & Hansen, C. (2014). Inference on treatment eﬀects afterselection among high-dimensional controls † . The Review of Economic Studies , (2),608-650.Bitler, M. P., Gelbach, J. B., & Hoynes, H. W. (2006). What mean impacts miss: Dis-tributional eﬀects of welfare reform experiments. American Economic Review , (4),988–1012.Card, D. (1996). The eﬀect of unions on the structure of wages: A longitudinal analysis. Econometrica: Journal of the Econometric Society , 957–979.Chen, Y.-C., Genovese, C. R., Tibshirani, R. J., Wasserman, L., et al. (2016). Nonparametricmodal regression.

The Annals of Statistics , (2), 489–514.Chernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C., Newey, W., &Robins, J. (2018). Double/debiased machine learning for treatment and structural pa-rameters. The Econometrics Journal , (1), C1–C68.Chernozhukov, V., Escanciano, J. C., Ichimura, H., & Newey, W. K. (2016). Locally robustsemiparametric estimation. arXiv preprint arXiv:1608.00033 .Chernozhukov, V., & Hansen, C. (2005). An iv model of quantile treatment eﬀects. Econo-metrica , (1), 245–261.Chernozhukov, V., Hansen, C., & Spindler, M. (2015). Valid post-selection and post-regularization inference: An elementary, general approach. Annu. Rev. Econ. , (1), 649–688. 10iNardo, J., Fortin, N. M., & Lemieux, T. (1995). Labor market institutions and thedistribution of wages, 1973-1992: A semiparametric approach (Tech. Rep.). Nationalbureau of economic research.Eddy, W. F., et al. (1980). Optimum kernel estimators of the mode.

The Annals of Statistics , (4), 870–882.Firpo, S. (2007). Eﬃcient semiparametric estimation of quantile treatment eﬀects. Econo-metrica , (1), 259–276.Freeman, R. B. (1980). Unionism and the dispersion of wages. ILR Review , (1), 3–23.Hahn, J. (1998). On the role of the propensity score in eﬃcient semiparametric estimationof average treatment eﬀects. Econometrica , 315–331.Hansen, B. E. (2008). Uniform convergence rates for kernel estimation with dependent data.

Econometric Theory , (3), 726–748.Heckman, J., Ichimura, H., & Todd, P. E. (1997). Matching as an econometric evaluationestimator: Evidence from evaluating a job training programme. The review of economicstudies , (4), 605–654.Heckman, J., & Robb, R. (1985). Alternative methods for evaluating the impact of inter-ventions: An overview. Journal of econometrics , (1-2), 239–267.Hirano, K., Imbens, G. W., & Ridder, G. (2003). Eﬃcient estimation of average treatmenteﬀects using the estimated propensity score. Econometrica , (4), 1161–1189.Lee, M.-J. (1989). Mode regression. Journal of Econometrics , (3), 337–349.Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of econometrics , , 2111–2245.Parzen, E. (1962). On estimation of a probability density function and mode. The annalsof mathematical statistics , (3), 1065–1076.Robins, J. M., & Rotnitzky, A. (1995). Semiparametric eﬃciency in multivariate regressionmodels with missing data. Journal of the American Statistical Association , (429), 122–129.Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score inobservational studies for causal eﬀects. Biometrika , (1), 41–55.Tauchen, G. (1985). Diagnostic testing and evaluation of maximum likelihood models. Journal of Econometrics , (1-2), 415–443.Van der Vaart, A. W. (2000). Asymptotic statistics (Vol. 3). Cambridge university press.Yao, W., & Li, L. (2014). A new regression model: modal linear regression.

ScandinavianJournal of Statistics , (3), 656–671. 11 Appendix

Proof of Theorem 1:

We only present the proof of the ﬁrst claim, ˆ θ p → θ ∗ , since the secondclaim follows from the same arguments. The proof proceeds in two steps. In Step 1, we showthe uniform law of large number holds sup y | ˆ f Y ( y ) − f Y ( y ) | = o p (1) . In Step 2, we establish the consistency ˆ θ p → θ ∗ using the same argument of Theorem 5.7 inVan der Vaart (2000). Step 1.

Notice that we have the decomposition ˆ f Y ( y ) − f Y ( y ) = 1 n n X i =1 ˆ f Y | D =1 ,X (cid:0) y | X i (cid:1) − E h f Y | D =1 ,X (cid:0) y | X (cid:1)i = 1 n n X i =1 (cid:16) ˆ f Y | D =1 ,X (cid:0) y | X i (cid:1) − f Y | D =1 ,X (cid:0) y | X i (cid:1)(cid:17)| {z } A ( y ) + 1 n n X i =1 f Y | D =1 ,X (cid:0) y | X i (cid:1) − E h f Y | D =1 ,X (cid:0) y | X (cid:1)i| {z } B ( y ) . Hence, sup y | ˆ f Y ( y ) − f Y ( y ) |≤ sup y | A ( y ) | + sup y | B ( y ) | By Theorem 6 in Hansen (2008) (uniform rates of convergence of kernel estimators), the ﬁrstterm sup y | A ( y ) | is bounded by sup y (cid:12)(cid:12) A ( y ) (cid:12)(cid:12) ≤ sup y n n X i =1 (cid:12)(cid:12)(cid:12) ˆ f Y | D =1 ,X (cid:0) y | X i (cid:1) − f Y | D =1 ,X (cid:0) y | X i (cid:1)(cid:12)(cid:12)(cid:12) ≤ sup y sup x (cid:12)(cid:12)(cid:12) ˆ f Y | D =1 ,X (cid:0) y | x (cid:1) − f Y | D =1 ,X (cid:0) y | x (cid:1)(cid:12)(cid:12)(cid:12) ≤ sup x,y (cid:12)(cid:12)(cid:12) ˆ f Y | D =1 ,X (cid:0) y | x (cid:1) − f Y | D =1 ,X (cid:0) y | x (cid:1)(cid:12)(cid:12)(cid:12) = O p r ln nnh d +1 + h ! = o p (1) . On the other hand, by Lemma 1 of Tauchen (1985) (uniform law of large numbers), we have sup y ∈Y (cid:12)(cid:12) B ( y ) (cid:12)(cid:12) = sup y ∈Y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 f Y | D =1 ,X (cid:0) y | X i (cid:1) − E h f Y | D =1 ,X (cid:0) y | X (cid:1)i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p → . sup y | A ( y ) | and sup y | B ( y ) | gives sup y | ˆ f Y ( y ) − f Y ( y ) | = o p (1) . Step 2.

The deﬁnition of ˆ θ implies that ˆ f Y (cid:16) ˆ θ (cid:17) ≥ ˆ f Y ( θ ∗ ) . Therefore, we have f Y ( θ ∗ ) − f Y (cid:16) ˆ θ (cid:17) = f Y ( θ ∗ ) − ˆ f Y ( θ ∗ ) + ˆ f Y ( θ ∗ ) − f Y (cid:16) ˆ θ (cid:17) ≤ f Y ( θ ∗ ) − ˆ f Y ( θ ∗ ) + ˆ f Y (cid:16) ˆ θ (cid:17) − f Y (cid:16) ˆ θ (cid:17) ≤ y | ˆ f Y ( y ) − f Y ( y ) | . By Step 1, we have that for any δ > , P (cid:18) f Y ( θ ∗ ) − f Y (cid:16) ˆ θ (cid:17) > δ (cid:19) ≤ P sup y | ˆ f Y ( y ) − f Y ( y ) | > δ/ ! → . Further, Assumption 1 implies that for any ε > , there exists δ > such that sup y : | y − θ ∗ | >ε f Y ( y ) < f Y ( θ ∗ ) − δ. Then the following inequality holds P (cid:16) | ˆ θ − θ ∗ | > ε (cid:17) ≤ P (cid:18) f Y (cid:16) ˆ θ (cid:17) < f Y ( θ ∗ ) − δ (cid:19) ≤ P (cid:18) f Y ( θ ∗ ) − f Y (cid:16) ˆ θ (cid:17) > δ (cid:19) → . Thus, we prove the consistency ˆ θ p → θ ∗ . Proof of Theorem 2:

Here we focus on the result for ˆ θ only. Notice that the ﬁrst-ordercondition for ˆ θ gives f (1) Y (cid:16) ˆ θ (cid:17) = 1 n n X i =1 ˆ f (1) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) = 1 n n X i =1 ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + 1 n n X i =1 ˆ f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) (cid:16) ˆ θ − θ ∗ (cid:17) , where ˜ θ ∈ (cid:16) ˆ θ , θ ∗ (cid:17) . Then we have √ nh (cid:16) ˆ θ − θ ∗ (cid:17) = −  n n X i =1 ˆ f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) −  √ nh n n X i =1 ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) . M = E h f (2) Y | X,D =1 (cid:0) θ ∗ | X (cid:1)i in probability. In Step 2-5, we show the asymptotic normalityof the second term. Then, by Slutsky’s theorem, we can show the asymptotic normality for ˆ θ . In Step 6, we show the asymptotic normality for ˆ∆ .For convenience, we deﬁne γ ( x ) ≡ f (1) Y,X | D =1 ( θ ∗ , x ) , γ ( x ) ≡ f X | D =1 ( x ) , and ˆ γ ( x ) ≡ n n X j =1 D j K (1) h (cid:0) θ ∗ − Y j (cid:1) K h (cid:0) x − X j (cid:1) P ( D = 1)ˆ γ ( x ) ≡ n n X j =1 D j K h (cid:0) x − X j (cid:1) P ( D = 1) . In these notations, we can express ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | x (cid:1) and f (1) Y | X,D =1 (cid:0) θ ∗ | x (cid:1) as ˆ γ ( x ) / ˆ γ ( x ) and γ ( x ) /γ ( x ) , respectively. Also, let γ = ( γ , γ ) ′ and ˆ γ = (ˆ γ , ˆ γ ) ′ . Step 1.

In this step, we show that n − P ni =1 ˆ f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) p → E h f (2) Y | X,D =1 (cid:0) θ ∗ | X (cid:1)i .Notice that n n X i =1 ˆ f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) = 1 n n X i =1 f (2) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + A + A where A = 1 n n X i =1 ˆ f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) − f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) and A = 1 n n X i =1 f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) − f (2) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) . Since n P ni =1 f (2) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) p → E h f (2) Y | X,D =1 (cid:0) θ ∗ | X (cid:1)i by the law of large numbers, weonly have to show that A = o p (1) and A = o p (1) . Note that | A | ≤ n n X i =1 (cid:12)(cid:12)(cid:12)(cid:12) ˆ f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) − f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup y,x (cid:12)(cid:12)(cid:12) ˆ f (2) Y | X,D =1 (cid:0) y | x (cid:1) − f (2) Y | X,D =1 (cid:0) y | x (cid:1)(cid:12)(cid:12)(cid:12) = O p r ln nnh d +5 + h ! = o p (1) , where the ﬁrst equality follows from the uniform rates of convergence of kernel estimators(Hansen, 2008). For A , we use the argument in Lemma 4.3 of Newey & McFadden (1994).By consistency of ˆ θ , and thus ˜ θ , there is δ n → such that (cid:13)(cid:13)(cid:13) ˜ θ − θ ∗ (cid:13)(cid:13)(cid:13) ≤ δ n with probabilityapproaching to one. Deﬁne 14 n ( X i ) = sup k y − θ ∗ k ≤ δ n (cid:13)(cid:13)(cid:13) f (2) Y | X,D =1 (cid:0) y | X i (cid:1) − f (2) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1)(cid:13)(cid:13)(cid:13) . By the continuity of f (2) Y | X,D =1 (cid:0) y | X i (cid:1) at θ ∗ , ∆ n ( X i ) p → . Hence, by the dominated conver-gence theorem, we have E (cid:2) ∆ n ( X i ) (cid:3) → . Then, by Markov’s inequality, P  n n X i =1 ∆ n ( X i ) > ǫ  ≤ E (cid:2) ∆ n ( X i ) (cid:3) /ǫ → . Therefore, we have | A | ≤ n n X i =1 ∆ n ( X i ) + o p (1) = o p (1) . Step 2.

In this step, we show √ nh n n X i =1 ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) = √ nh n n X i =1 f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + √ nh n n X i =1 G ( Z i , ˆ γ − γ )+ o p (1) , where G ( z, γ ) = γ ( x ) − h , − γ ( x ) γ ( x ) i γ ( x ) and z = ( y, x, d ) denotes data observation. Todo this, it suﬃces to show √ nh n n X i =1 h ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) − f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) − G ( Z i , ˆ γ − γ ) i = o p (1) . Using the notation of γ , we have ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | x (cid:1) − f (1) Y | X,D =1 (cid:0) θ ∗ | x (cid:1) = ˆ γ ( x )ˆ γ ( x ) − γ ( x ) γ ( x ) . The following argument follows from Newey & McFadden (1994). Consider the algebra re-lation ˜ a/ ˜ b − a/b = b − (cid:20) − ˜ b − (cid:16) ˜ b − b (cid:17)(cid:21) (cid:20) ˜ a − a − (cid:0) a/b (cid:1) (cid:16) ˜ b − b (cid:17)(cid:21) . The linear part of ther.h.s is b − (cid:20) ˜ a − a − (cid:0) a/b (cid:1) (cid:16) ˜ b − b (cid:17)(cid:21) , and the remaining term is of higher order. By letting a = γ , ˜ a = ˆ γ , b = γ , and ˜ b = ˆ γ , this linear term corresponds to the linear functional G ( Z i , ˆ γ − γ ) . The remaining higher-order term will satisfy | γ ( x ) γ ( x ) − γ ( x ) γ ( x ) − G ( z, γ − γ ) |≤| γ ( x ) | − γ ( x ) − (cid:20) γ ( x ) γ ( x ) (cid:21) h(cid:0) γ ( x ) − γ ( x ) (cid:1) + (cid:0) γ ( x ) − γ ( x ) (cid:1) i ≤ C sup x ∈X (cid:13)(cid:13) γ ( x ) − γ ( x ) (cid:13)(cid:13) C if γ and γ are bounded away from zero. Hence Lemma 1 holds if √ nh sup x ∈X (cid:13)(cid:13) ˆ γ ( x ) − γ ( x ) (cid:13)(cid:13) p → . By the uniform rates of convergence of kernel estimators(Hansen, 2008), we have sup x ∈X (cid:13)(cid:13) ˆ γ ( x ) − γ ( x ) (cid:13)(cid:13) = sup x ∈X (cid:16)(cid:0) ˆ γ ( x ) − γ ( x ) (cid:1) + (cid:0) ˆ γ ( x ) − γ ( x ) (cid:1) (cid:17) ≤ sup x ∈X (cid:0) ˆ γ ( x ) − γ ( x ) (cid:1) + sup x ∈X (cid:0) ˆ γ ( x ) − γ ( x ) (cid:1) = O p (cid:20) (ln n ) (cid:16) nh d +3 (cid:17) − + h (cid:21) + O p (cid:20) (ln n ) (cid:16) nh d (cid:17) − + h (cid:21) = O p (cid:20) (ln n ) (cid:16) nh d +3 (cid:17) − + h (cid:21) . The rates of h and n imply that √ nh sup x ∈X (cid:13)(cid:13) ˆ γ ( x ) − γ ( x ) (cid:13)(cid:13) p → . Step 3.

In this step, we show √ nh n n X i =1 ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) = √ nh n n X i =1 f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + √ nh Z G ( z, ˆ γ − γ ) dF ( z )+ o p (1) , where F is the c.d.f. of z . To do this, it suﬃces to show that √ nh  n n X i =1 G ( Z i , ˆ γ − γ ) − Z G ( z, ˆ γ − γ ) dF ( z )  = o p (1) . Let ¯ γ ≡ E [ˆ γ ] and by the linearity of G ( z, γ ) , we have the decomposition G ( z, ˆ γ − γ ) = G ( z, ˆ γ − ¯ γ ) + G ( z, ¯ γ − γ ) . Therefore we just need to show that √ nh  n n X i =1 G ( Z i , ˆ γ − ¯ γ ) − Z G ( z, ˆ γ − ¯ γ ) dF ( z )  = o p (1) and √ nh  n n X i =1 G ( Z i , ¯ γ − γ ) − Z G ( z, ¯ γ − γ ) dF ( z )  = o p (1) . The second condition holds by the central limit theorem since √ nh  n n X i =1 G ( Z i , ¯ γ − γ ) − Z G ( z, ¯ γ − γ ) dF ( z )  = √ nh O p (cid:16) n − / (cid:17) = o p (1) .

16t remains to show the ﬁrst condition. We follow the arguments in Newey & McFadden(1994). Deﬁne q j ≡ (cid:18) D j K (1) h ( θ ∗ − Y j ) P ( D =1) , D j P ( D =1) (cid:19) ′ , we can rewrite ˆ γ ( x ) = " ˆ γ ( x )ˆ γ ( x ) = 1 n n X j =1 q j K h (cid:0) x − X j (cid:1) . We also deﬁne m (cid:0) Z i , Z j (cid:1) = G h Z i , q j K h (cid:0) · − X j (cid:1)i m ( z ) = Z m ( z, ˜ z ) dF (˜ z ) = G ( z, ¯ γ ) m ( z ) = Z m (˜ z, z ) dF (˜ z ) = Z G (cid:2) ˜ z, qK h ( · − X ) (cid:3) dF (˜ z ) . Then the l.h.s. of the ﬁrst condition equals √ nh  n n X i =1 G ( Z i , ˆ γ − ¯ γ ) − Z G ( z, ˆ γ − ¯ γ ) dF ( z )  = √ nh  n n X i =1 G ( z, ˆ γ ) − n n X i =1 G ( z, ¯ γ ) − Z G ( z, ˆ γ ) dF ( z ) + Z G ( z, ¯ γ ) dF ( z )  = √ nh  n n X i =1 n X j =1 m (cid:0) Z i , Z j (cid:1) − n n X i =1 m ( Z i ) − n n X i =1 m ( Z i ) + E (cid:2) m ( z ) (cid:3) = √ nh × O p ( E h(cid:12)(cid:12) m ( Z , Z ) (cid:12)(cid:12)i /n + (cid:18) E h(cid:12)(cid:12) m ( Z , Z ) (cid:12)(cid:12) i(cid:19) / /n ) , where the last equality follows from Lemma 8.4 of Newey & McFadden (1994). The last termconverges to zero in probability if we can control the convergence rates of E h(cid:12)(cid:12) m ( Z , Z ) (cid:12)(cid:12)i and E h(cid:12)(cid:12) m ( Z , Z ) (cid:12)(cid:12) i . Notice that we have (cid:12)(cid:12) G ( z, γ ) (cid:12)(cid:12) ≤ b ( z ) k γ k with b ( z ) = (cid:13)(cid:13)(cid:13)(cid:13) f X | D =1 ( x ) − h , − f (1) Y | X,D =1 ( θ ∗ , x ) i(cid:13)(cid:13)(cid:13)(cid:13) where k·k denotes the ℓ norm. Then E (cid:20)(cid:12)(cid:12)(cid:12) G (cid:0) z, qK h ( · − x ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:21) ≤ b ( z ) h − d k q k by theboundedness of K ( u ) . By that f X | D =1 ( x ) is bounded away from zero and f Y | X,D =1 ( θ ∗ , x ) is17ounded from above, we have that E h b ( z ) i ≤ ∞ . Therefore, we have √ nh × O p ( E h(cid:12)(cid:12) m ( Z , Z ) (cid:12)(cid:12)i /n + (cid:18) E h(cid:12)(cid:12) m ( Z , Z ) (cid:12)(cid:12) i(cid:19) / /n ) = √ nh × O p ( E (cid:2) k q k b ( Z ) (cid:3) /n + (cid:18) E h k q k b ( Z ) i(cid:19) / (cid:16) nh d (cid:17) − ) = √ nh × O p (cid:16) n − h − d − (cid:17) = o p (1) by the assumptions on n and h . The additional h − in the rates of convergence follows fromthat q contains K (1) h ( u ) = h − K (1) (cid:0) u/h (cid:1) with bounded K (1) ( u ) . Step 4.

In this step, we show that √ nh n n X i =1 ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) = √ nh n n X i =1 f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + √ nh n n X i =1 v ( X i ) q i + o p (1) , where v ( X i ) = P ( D =1) P ( D =1 | X i ) h , − γ ( X i ) γ ( X i ) i and q i = (cid:18) D i K (1) h ( θ ∗ − Y i ) P ( D =1) , D i P ( D =1) (cid:19) ′ . To do this, itsuﬃces to show that √ nh Z G ( z, ˆ γ − γ ) dF ( z ) − √ nh n n X i =1 v ( X i ) q i = o p (1) Notice that Z G ( z, γ ) dF ( z ) = Z γ ( x ) − (cid:20) , − γ ( x ) γ ( x ) (cid:21) γ ( x ) f X ( x ) dx = Z f X | D =1 ( x ) − (cid:20) , − γ ( x ) γ ( x ) (cid:21) γ ( x ) f X ( x ) dx = Z P ( D = 1) P (cid:0) D = 1 | X = x (cid:1) (cid:20) , − γ ( x ) γ ( x ) (cid:21) γ ( x ) dx = Z v ( x ) γ ( x ) dx, where f X ( x ) is the density function of X and v ( x ) = P ( D =1) P ( D =1 | X = x ) h , − γ ( x ) γ ( x ) i . Also, we have v ( x ) γ ( x ) = P ( D = 1) P (cid:0) D = 1 | X = x (cid:1) (cid:20) , − γ ( x ) γ ( x ) (cid:21) " γ ( x ) γ ( x ) = P ( D = 1) P (cid:0) D = 1 | X = x (cid:1) (cid:0) γ ( x ) − γ ( x ) (cid:1) = 0 . Z G ( z, ˆ γ − γ ) dF ( z ) = Z v ( x ) ˆ γ ( x ) dx − Z v ( x ) γ ( x ) dx = Z v ( x ) ˆ γ ( x ) dx = 1 n n X i =1 Z v ( x ) q i K h ( x − X i ) dx = 1 n n X i =1 v ( X i ) q i +  n n X i =1 Z v ( x ) q i K h ( x − X i ) dx − n n X i =1 v ( X i ) q i  = 1 n n X i =1 v ( X i ) q i + 1 n n X i =1 (cid:20)Z v ( x ) K h ( x − X i ) dx − v ( X i ) (cid:21) q i . By Chebyshev’s inequality, suﬃcient conditions for √ nh times the second term in the lastline converging to zero in probability are that √ nh E "(cid:18)Z v ( x ) K h ( x − X i ) dx − v ( X i ) (cid:19) q i → and E " k q i k (cid:13)(cid:13)(cid:13)(cid:13)Z v ( x ) K h ( x − X i ) dx − v ( X i ) (cid:13)(cid:13)(cid:13)(cid:13) → . The expectation in the ﬁrst condition is the diﬀerence of E h(cid:0)R v ( x ) K h ( x − X i ) dx (cid:1) q i i and E (cid:2) v ( X i ) q i (cid:3) . We begin with the second term E (cid:2) v ( X i ) q i (cid:3) . Notice that E (cid:2) v ( X i ) q i (cid:3) = E h v ( X i ) E (cid:2) q i | X i (cid:3)i = E  v ( X i ) E  D i K (1) h ( θ ∗ − Y i ) P ( D =1) D i P ( D =1)  | X i  = E  v ( X i ) P (cid:0) D = 1 | X i (cid:1) P ( D = 1) E  K (1) h ( θ ∗ − Y i )1 ! | X i , D = 1  by the law of iterated expectations. The inner conditional expectation in the last line satisﬁes19 h K (1) h ( θ ∗ − Y i ) | X i , D = 1 i = 1 h E " K (1) (cid:18) θ ∗ − Y i h (cid:19) | X i , D = 1 = 1 h Z K (1) (cid:18) θ ∗ − yh (cid:19) f Y | X,D =1 (cid:0) y | X i (cid:1) dy = 1 h Z K (cid:18) θ ∗ − yh (cid:19) f (1) Y | X,D =1 (cid:0) y | X i (cid:1) dy = Z K ( u ) f (1) Y | X,D =1 (cid:0) θ ∗ + hu | X i (cid:1) du = Z K ( u ) f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) du + Z huK ( u ) f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) du + Z h u K ( u ) f (3) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) du = f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + h κ f (3) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) with ˜ θ ∈ ( θ ∗ , θ ∗ + hu ) and κ = R u K ( u ) du . The third equality follows from integrationby parts and the forth from change of variables. Hence, E (cid:2) v ( X i ) q i (cid:3) = E  v ( X i ) P (cid:0) D = 1 | X i (cid:1) P ( D = 1) f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) ! + h κ E  v ( X i ) P (cid:0) D = 1 | X i (cid:1) P ( D = 1)  f (3) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17)  = E  v ( X i ) P (cid:0) D = 1 | X i (cid:1) P ( D = 1) f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) ! + O (cid:0) h (cid:1) = Z v ( x ) P (cid:0) D = 1 | X i = x (cid:1) P ( D = 1) f (1) Y | X,D =1 (cid:0) θ ∗ | X i = x (cid:1) ! f X ( x ) dx + O (cid:0) h (cid:1) = Z v ( x ) f (1) Y | X,D =1 (cid:0) θ ∗ | X i = x (cid:1) ! f X | D =1 ( x ) dx + O (cid:0) h (cid:1) = Z v ( x ) f (1) Y,X | D =1 (cid:0) θ ∗ | X i = x (cid:1) f X | D =1 ( x ) ! dx + O (cid:0) h (cid:1) = Z v ( x ) γ ( x ) dx + O (cid:0) h (cid:1) . E "(cid:18)Z v ( x ) K h ( x − X i ) dx (cid:19) q i = E "(cid:18)Z v ( X i + hu ) K ( u ) du (cid:19) q i = Z (cid:18)Z v ( x + hu ) K ( u ) du (cid:19) γ ( x ) dx + O (cid:0) h (cid:1) Then the ﬁrst condition equals √ nh (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E "(cid:18)Z v ( x ) K h ( x − X i ) dx − v ( X i ) (cid:19) q i = √ nh (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)Z (cid:18)Z v ( x + hu ) K ( u ) du (cid:19) γ ( x ) dx − Z v ( x ) γ ( x ) dx + O (cid:0) h (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Following the argument in Theorem 8.11 of Newey & McFadden (1994), the last line satisﬁes √ nh (cid:13)(cid:13)(cid:13)(cid:13)Z Z v ( x ) K ( u ) γ ( x − hu ) dudx − Z v ( x ) γ ( x ) dx + O (cid:0) h (cid:1)(cid:13)(cid:13)(cid:13)(cid:13) = √ nh (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)Z v ( x ) (cid:26)Z (cid:2) γ ( x − hu ) − γ ( x ) (cid:3) du (cid:27) dx + O (cid:0) h (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ √ nh Z (cid:13)(cid:13) v ( x ) (cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)Z (cid:2) γ ( x − hu ) − γ ( x ) (cid:3) du (cid:13)(cid:13)(cid:13)(cid:13) dx + O (cid:16) √ nh h (cid:17) ≤ √ nh Ch Z (cid:13)(cid:13) v ( x ) (cid:13)(cid:13) dx + O (cid:16) √ nh h (cid:17) = O (cid:16) √ nh h (cid:17) . Therefore the ﬁrst condition holds if √ nh h → .Recall that the second condition we would like to show is E " k q i k (cid:13)(cid:13)(cid:13)(cid:13)Z v ( x ) K h ( x − X i ) dx − v ( X i ) (cid:13)(cid:13)(cid:13)(cid:13) → . By Cauchy Schwartz inequality, it suﬃces to show that E "(cid:13)(cid:13)(cid:13)(cid:13)Z v ( x ) K h ( x − X i ) dx − v ( X i ) (cid:13)(cid:13)(cid:13)(cid:13) → . By the continuity of v ( x ) , v ( x + hu ) → v ( x ) for all x and u as h → . By the dominatedconvergence theorem, R v ( x ) K h ( x − x i ) dx = R v ( x + hu ) K ( u ) du → R v ( x ) K ( u ) du = v ( x ) for all x . Therefore we have E "(cid:13)(cid:13)(cid:13)(cid:13)Z v ( x ) K h ( x − X i ) dx − v ( X i ) (cid:13)(cid:13)(cid:13)(cid:13) = E "(cid:13)(cid:13)(cid:13)(cid:13)Z v ( X i + hu ) K ( u ) du − v ( X i ) (cid:13)(cid:13)(cid:13)(cid:13) → . tep 5. By Step 4 and the deﬁnition of v ( X i ) and q i , we have √ nh n n X i =1 ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) = √ nh n n X i =1 f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + √ nh n n X i =1 v ( X i ) q i + o p (1)= √ nh n n X i =1 f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + o p (1)+ √ nh n n X i =1 D i P (cid:0) D = 1 | X i (cid:1) h K (1) h ( θ ∗ − Y i ) − f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1)i = √ nh n n X i =1 f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) " − D i P (cid:0) D = 1 | X i (cid:1) + √ nh n n X i =1 D i P (cid:0) D = 1 | X i (cid:1) K (1) h ( θ ∗ − Y i ) + o p (1) . Since we have E " f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) (cid:20) − D i P ( D =1 | X i ) (cid:21) = 0 by the law of iterated expecta-tions, the central limit theorem holds for the ﬁrst term of r.h.s. Hence, √ nh n n X i =1 ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) = O p (cid:16) √ nh n − / (cid:17) + √ nh nh n X i =1 D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) + o p (1)= O p (cid:16) √ h (cid:17) + 1 √ nh n X i =1 D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) + o p (1)= 1 √ nh n X i =1 D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) + o p (1) . In this step, we show that √ nh n X i =1 D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) d → N (0 , V ) , where V = κ (1)0 E (cid:20) f Y | X,D =1 ( θ ∗ | X ) P ( D =1 | X ) (cid:21) and κ (1)0 = R K (1) ( u ) du .For convenience, we deﬁne ˆ g ( θ ∗ ) ≡ (cid:0) nh (cid:1) − P ni =1 D i P ( D =1 | X i ) K (1) (cid:16) θ ∗ − Y i h (cid:17) . Then it isequivalent to show that √ nh (cid:0) ˆ g ( θ ∗ ) − (cid:1) d → N (0 , V ) .

22o use central limit theorem, we have to calculate E (cid:2) ˆ g ( θ ∗ ) (cid:3) and V ar (cid:0) ˆ g ( θ ∗ ) (cid:1) . E (cid:2) ˆ g ( θ ∗ ) (cid:3) = 1 h E " D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) = 1 h E  P (cid:0) D = 1 | X i (cid:1) E " D i K (1) (cid:18) θ ∗ − Y i h (cid:19) | X i = 1 h E  E " K (1) (cid:18) θ ∗ − Y i h (cid:19) | X i , D = 1 . Since h − E (cid:20) K (1) (cid:16) θ ∗ − Y i h (cid:17) | X i , D = 1 (cid:21) = f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + h κ f (3) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17) fromthe calculation in Step 4, then E (cid:2) ˆ g ( θ ∗ ) (cid:3) = E h f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1)i + O (cid:0) h (cid:1) = 0 + O (cid:0) h (cid:1) . For the variance,

V ar (cid:0) ˆ g ( θ ∗ ) (cid:1) = 1 nh V ar D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19)! = 1 nh E  D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19)!  − nh  E " D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) = 1 nh E  D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19)!  + 1 nh O (cid:0) h (cid:1) = 1 nh E  P (cid:0) D = 1 | X i (cid:1) E " D i K (1) (cid:18) θ ∗ − Y i h (cid:19) | X + 1 nh O (cid:0) h (cid:1) = 1 nh E  P (cid:0) D = 1 | X i (cid:1) E " K (1) (cid:18) θ ∗ − Y i h (cid:19) | X, D = 1 + 1 nh O (cid:0) h (cid:1) . The inner expectation in the last line equals 23 " K (1) (cid:18) θ ∗ − Y i h (cid:19) | X, D = 1 = Z K (1) (cid:18) θ ∗ − yh (cid:19) f Y | X,D =1 (cid:0) y | X (cid:1) dy = h Z K (1) ( u ) f Y | X,D =1 (cid:0) θ ∗ + hu | X (cid:1) du = hf Y | X,D =1 (cid:0) θ ∗ | X (cid:1) Z K (1) ( u ) du + h f (1) Y | X,D =1 (cid:16) ˜ θ | X (cid:17) Z uK (1) ( u ) du, where ˜ θ ∈ ( θ ∗ , θ ∗ + hu ) . Deﬁne κ (1)0 = R K (1) ( u ) du and κ (1)1 = R uK (1) ( u ) du , the varianceequals V ar (cid:0) ˆ g ( θ ∗ ) (cid:1) = 1 nh E " hP (cid:0) D = 1 | X i (cid:1) κ (1)0 f Y | X,D =1 (cid:0) θ ∗ | X (cid:1) + 1 nh E " h P (cid:0) D = 1 | X i (cid:1) κ (1)1 f (1) Y | X,D =1 (cid:0) θ ∗ | X (cid:1) + 1 nh O (cid:0) h (cid:1) = 1 nh  κ (1)0 E " P (cid:0) D = 1 | X i (cid:1) f Y | X,D =1 (cid:0) θ ∗ | X (cid:1) + O ( h ) + O (cid:0) h (cid:1) = 1 nh (cid:16) V + O ( h ) + O (cid:0) h (cid:1)(cid:17) . Then we are ready to apply the central limit theorem.Let Z n,i ≡ ( nh ) − /  D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) − E " D i P (cid:0) D = 1 | X i (cid:1) K (1) (cid:18) θ ∗ − Y i h (cid:19) , then E (cid:2) Z n,i (cid:3) = 0 and V ar (cid:0) Z n,i (cid:1) = h V ar (cid:0) ˆ g ( θ ∗ ) (cid:1) = n − V + o (cid:0) n − (cid:1) . Then √ nh (cid:0) ˆ g ( θ ∗ ) − (cid:1) = √ nh (cid:16) ˆ g ( θ ∗ ) − E (cid:2) ˆ g ( θ ∗ ) (cid:3)(cid:17) + √ nh (cid:16) E (cid:2) ˆ g ( θ ∗ ) (cid:3) − (cid:17) = √ nh (cid:16) ˆ g ( θ ∗ ) − E (cid:2) ˆ g ( θ ∗ ) (cid:3)(cid:17) + √ nh O (cid:0) h (cid:1) = n X i =1 Z n,i + √ nh O (cid:0) h (cid:1) d → N (0 , V ) by Liapunov CLT and √ nh h → . 24 tep 6. In this step, we show that √ nh " ˆ θ − θ ∗ ˆ θ − θ ∗ d → N " , " M V M M V M and thus, by the delta method, we have √ nh (cid:16) ˆ∆ − ∆ ∗ (cid:17) d → N (0 , M V M + M V M ) . To show the joint distribution we adopt vector notations. The ﬁrst-order conditions of ˆ θ and ˆ θ give " =  ˆ f Y (cid:16) ˆ θ (cid:17) ˆ f Y (cid:16) ˆ θ (cid:17) = 1 n n X i =1  ˆ f (1) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) ˆ f (1) Y | X,D =0 (cid:16) ˆ θ | X i (cid:17) = 1 n n X i =1 " ˆ f (1) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) ˆ f (1) Y | X,D =0 (cid:0) θ ∗ | X i (cid:1) + J n " ˆ θ − θ ∗ ˆ θ − θ ∗ where J n = 1 n n X i =1  ∂ ˆ f (1) Y | X,D =1 ( ˜ θ | X i ) ∂θ ∂ ˆ f (1) Y | X,D =1 ( ˜ θ | X i ) ∂θ ∂ ˆ f (1) Y | X,D =0 ( ˜ θ | X i ) ∂θ ∂ ˆ f (1) Y | X,D =0 ( ˜ θ | X i ) ∂θ  = 1 n n X i =1  ˆ f (2) Y | X,D =1 (cid:16) ˜ θ | X i (cid:17)

00 ˆ f (2) Y | X,D =0 (cid:16) ˜ θ | X i (cid:17) . Hence we have √ nh " ˆ θ − θ ∗ ˆ θ − θ ∗ = (cid:0) J − n (cid:1) √ nh n n X i =1  ˆ f (1) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) ˆ f (1) Y | X,D =0 (cid:16) ˆ θ | X i (cid:17) = (cid:0) J − n (cid:1) √ nh n X i =1  D i P ( D =1 | X i ) K (1) (cid:16) θ ∗ − Y i h (cid:17) − D i P ( D =0 | X i ) K (1) (cid:16) θ ∗ − Y i h (cid:17) + " o p (1) o p (1) where the last equality follows from the Step 5 in the proof of Theorem 1. Since J n p → " M M and √ nh n X i =1  D i P ( D =1 | X i ) K (1) (cid:16) θ ∗ − Y i h (cid:17) − D i P ( D =0 | X i ) K (1) (cid:16) θ ∗ − Y i h (cid:17) d → N " , " V V , √ nh " ˆ θ − θ ∗ ˆ θ − θ ∗ d → N " , " M V M M V M . Proof of Theorem 3.

It is enough to show the results of ˆ M and ˆ V . We ﬁrst show that ˆ M p → M . By adding and subtracting additional terms, we have ˆ M = 1 n n X i =1 ˆ f (2) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) = 1 n n X i =1 f (2) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) + A + A where A = 1 n n X i =1 ˆ f (2) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) − f (2) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) and A = 1 n n X i =1 f (2) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) − f (2) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) . If we can show that A = o p (1) and A = o p (1) , then ˆ M p → M by law of large numbers.Note that | A | ≤ n n X i =1 (cid:12)(cid:12)(cid:12)(cid:12) ˆ f (2) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) − f (2) Y | X,D =1 (cid:16) ˆ θ | X i (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup y,x (cid:12)(cid:12)(cid:12) ˆ f (2) Y | X,D =1 (cid:0) y | x (cid:1) − f (2) Y | X,D =1 (cid:0) y | x (cid:1)(cid:12)(cid:12)(cid:12) = O p r ln nnh d +5 + h ! = o p (1) , where the ﬁrst equality follows from the uniform rates of convergence of kernel estimators(Hansen, 2008). For A , we use the argument in Lemma 4.3 of Newey & McFadden (1994).By consistency of ˆ θ there is δ n → such that (cid:13)(cid:13)(cid:13) ˆ θ − θ ∗ (cid:13)(cid:13)(cid:13) ≤ δ n with probability approaching toone. Deﬁne ∆ n ( Z i ) = sup k y − θ ∗ k ≤ δ n (cid:13)(cid:13)(cid:13) f (2) Y | X,D =1 (cid:0) y | X i (cid:1) − f (2) Y | X,D =1 (cid:0) θ ∗ | X i (cid:1)(cid:13)(cid:13)(cid:13) . By the continu-ity of f (2) Y | X,D =1 (cid:0) y | X i (cid:1) at θ ∗ , ∆ n ( Z i ) p → . By the dominated convergence theorem, we have E (cid:2) ∆ n ( Z i ) (cid:3) → . By Markov inequality, P (cid:0) n − P ni =1 ∆ n ( Z i ) > ǫ (cid:1) ≤ E (cid:2) ∆ n ( Z i ) (cid:3) /ǫ → .Therefore, we have | A | ≤ n n X i =1 ∆ n ( Z i ) = o p (1) . ˆ V p → V . We can rewrite ˆ V /κ (1)0 = 1 n n X i =1 ˆ f Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) ˆ π ( X i ) = 1 n n X i =1 f Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) π ( X i ) + B + B with B = 1 n n X i =1 ˆ f Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) ˆ π ( X i ) − f Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) π ( X i ) and B = 1 n n X i =1 f Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) π ( X i ) − f Y | X,D =1 (cid:0) θ ∗ | X i (cid:1) π ( X i ) . It remains to show that B = o p (1) and B = o p (1) . The result of B follows from the samearguments as in the proof of A if f Y | X,D =1 (cid:0) y | X i (cid:1) is continuous at θ ∗ . Thus, we only focuson B . For conenience, deﬁne f (cid:0) y | x (cid:1) = f Y | X,D =1 (cid:0) y | x (cid:1) . For π bounded away from zero,we have ˆ f (cid:0) y | x (cid:1) ˆ π ( x ) − f (cid:0) y | x (cid:1) π ( x ) = π ( x ) ˆ f (cid:0) y | x (cid:1) − ˆ π ( x ) f (cid:0) y | x (cid:1) ˆ π ( x ) π ( x )= π ( x ) ˆ f (cid:0) y | x (cid:1) − π ( x ) f (cid:0) y | x (cid:1) + π ( x ) f (cid:0) y | x (cid:1) − ˆ π ( x ) f (cid:0) y | x (cid:1) ˆ π ( x ) π ( x )= ˆ f (cid:0) y | x (cid:1) − f (cid:0) y | x (cid:1) ˆ π ( x ) + f (cid:0) y | x (cid:1) ˆ π ( x ) π ( x ) (cid:0) ˆ π ( x ) − π ( x ) (cid:1) ≤ C (cid:18)(cid:16) ˆ f (cid:0) y | x (cid:1) − f (cid:0) y | x (cid:1)(cid:17) + (cid:0) ˆ π ( x ) − π ( x ) (cid:1)(cid:19) for some C > . By the uniform rates of convergence of kernel estimators (Hansen, 2008),we have | B | ≤ n n X i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ f Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) ˆ π ( X i ) − f Y | X,D =1 (cid:16) ˆ θ | X i (cid:17) π ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C sup y,x (cid:18)(cid:12)(cid:12)(cid:12) ˆ f (cid:0) y | x (cid:1) − f (cid:0) y | x (cid:1)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12) ˆ π ( x ) − π ( x ) (cid:12)(cid:12)(cid:19) = O p r ln nnh d +1 + h ! + sup x (cid:12)(cid:12) ˆ π ( x ) − π ( x ) (cid:12)(cid:12) = o p (1) by the rates of n and h and the uniform convergence of ˆ π ( x ) .27 roof of Theorem 4: Suppose that ˆ f Y ( y ) = 1 K K X k =1 E n,k [ m ( Z, y, ˆ η k )] is diﬀerentiable with respect to y . Deﬁne ˆ f (1) Y ( y ) ≡ K K X k =1 E n,k [ m (1)1 ( Z, y, ˆ η k )] where m (1)1 ( Z, y, ˆ η k ) ≡ ∂m ( Z, y, ˆ η k ) /∂y .By the deﬁnition of ˆ θ , we have f (1) Y (ˆ θ ) = 1 K K X k =1 E n,k [ m (1)1 ( Z, ˆ θ , ˆ η k )]= 1 K K X k =1 E n,k [ m (1)1 ( Z, θ ∗ , ˆ η k )] + 1 K K X k =1 E n,k [ m (2)1 ( Z, ˜ θ , ˆ η k )](ˆ θ − θ ∗ ) and √ N h (ˆ θ − θ ∗ ) = −  K K X k =1 E n,k [ m (2)1 ( Z, ˜ θ , ˆ η k )]  −  √ N h K K X k =1 E n,k [ m (1)1 ( Z, θ ∗ , ˆ η k )]  . In Step 1 and 2 below, we will show that K K X k =1 E n,k [ m (2)1 ( Z, ˜ θ , ˆ η k )] p → M and √ N h K K X k =1 E n,k [ m (1)1 ( Z, θ ∗ , ˆ η k )] d → N (0 , V ) , respectively. Hence, we can obtain the ﬁnal result √ N h (ˆ θ − θ ∗ ) d → N (0 , M − V M − ) . Step 1.

Since K is a ﬁxed integer, which is independent of N , it suﬃces to show that foreach k ∈ [ K ] , E n,k [ m (2)1 ( Z, ˜ θ , ˆ η k )] p → M . Then we can show this convergence using the same argument in Step 1 in the proof ofTheorem 2.

Step 2.

Since K is a ﬁxed integer, which is independent of N , it is enough to considerthe convergence of E n,k [ m (1)1 ( Z, θ ∗ , ˆ η k )] . Notice that28 n,k [ m (1)1 ( Z, θ ∗ , ˆ η k )] = 1 n X i ∈ I k m (1)1 ( Z, θ ∗ , η ) + R k where R ,k = E n,k [ m (1)1 ( Z, θ ∗ , ˆ η k )] − n X i ∈ I k m (1)1 ( Z, θ ∗ , η ) . Then by triangular inequality, (cid:13)(cid:13) R ,k (cid:13)(cid:13) ≤ I ,k + I ,k √ n , where I ,k ≡ (cid:13)(cid:13)(cid:13)(cid:13) G n,k h m (1)1 ( Z, θ ∗ , ˆ η k ) i − G n,k h m (1)1 ( Z, θ ∗ , η ) i(cid:13)(cid:13)(cid:13)(cid:13) ,I ,k ≡ √ n (cid:13)(cid:13)(cid:13)(cid:13) E P h m (1)1 ( Z, θ ∗ , ˆ η k ) | ( W i ) i ∈ I ck i − E P h m (1)1 ( Z, θ ∗ , η ) i(cid:13)(cid:13)(cid:13)(cid:13) . Two auxiliary results will be used to bound I ,k and I ,k : sup η ∈T N (cid:18) E h k m (1)1 ( Z, θ ∗ , η ) − m (1)1 ( Z, θ ∗ , η ) k i(cid:19) / ≤ ε N , (A.1) sup r ∈ (0 , ,η ∈T N k ∂ r E h m (1)1 (cid:0) Z, θ ∗ , η + r ( η − η ) (cid:1)i k≤ ( ε N ) , (A.2)where T N is the set of all η = ( π , g ) consisting of square-integrable functions π and g such that k η − η k P, ≤ ε N , k π − / k P, ∞ ≤ / − κ, k π − π k P, + k π − π k P, × k g − g k P, ≤ ( ε N ) . Then by assumption, we have ˆ η k ∈ T N with probability − o (1) .To bound I ,k , note that conditional on ( W i ) i ∈ I ck the estimator ˆ η k is nonstochastic. Underthe event that ˆ η k ∈ T N , we have E P h I ,k | ( W i ) i ∈ I ck i = E P h k m (1)1 ( Z, θ ∗ , ˆ η k ) − m (1)1 ( Z, θ ∗ , η ) k | ( W i ) i ∈ I ck i ≤ sup η ∈T N E P h k m (1)1 ( Z, θ ∗ , η ) − m (1)1 ( Z, θ ∗ , η ) k | ( W i ) i ∈ I ck i = sup η ∈T N E P h k m (1)1 ( Z, θ ∗ , η ) − m (1)1 ( Z, θ ∗ , η ) k i = ( ε N ) by (A.1). Hence, I ,k = O P ( ε N ) . To bound I ,k , deﬁne the following function f k ( r ) = E P h m (1)1 ( Z, θ ∗ , η + r (ˆ η k − η )) | ( W i ) i ∈ I ck i − E h m (1)1 ( Z, θ ∗ , η ) i r ∈ [0 , . By Taylor series expansion, we have f k (1) = f k (0) + f ′ k (0) + f ′′ k (˜ r ) / , for some ˜ r ∈ (0 , . Note that f k (0) = E h m (1)1 ( Z, θ ∗ , η ) | ( W i ) i ∈ I ck i = E h m (1)1 ( Z, θ ∗ , η ) i = O ( h ) by thecalculation in Step 4 in the proof of Theorem 2. Further, on the event ˆ η k ∈ T N , k f ′ k (0) k = k ∂ η E [ m (1)1 ( Z, θ ∗ , η )] [ˆ η k − η ] k = 0 by the orthogonality. Also, on the event ˆ η k ∈ T N , k f ′′ k (˜ r ) k≤ sup r ∈ (0 , k f ′′ k ( r ) k≤ ( ε N ) by (A.2). Thus, I ,k = √ n k f k (1) k = O P (cid:16) √ n ( ε N ) + √ nh (cid:17) . Together with the result on I ,k , we have (cid:13)(cid:13) R ,k (cid:13)(cid:13) ≤ I ,k + I ,k √ n = O P (cid:16) n − / ε N + ( ε N ) + h (cid:17) Hence, √ N h (cid:13)(cid:13) R ,k (cid:13)(cid:13) = O P ( √ h ǫ N + √ N h ǫ N + √ N h h ) = o P (1) by the assumptions on the rate of convergence that ǫ N = o (( N h ) − / ) and N h → .Therefore, E n,k [ m (1)1 ( Z, θ ∗ , ˆ η k )] = 1 n X i ∈ I k m (1)1 ( Z, θ ∗ , η ) + o P (1) d → N (0 , V ) ..