[PDF] Local Regression Distribution Estimators

Abstract

Full PDF

LLocal Regression Distribution Estimators ∗ Matias D. Cattaneo † Michael Jansson ‡ Xinwei Ma § October 1, 2020

Abstract

This paper investigates the large sample properties of local regression distribution estimators,which include a class of boundary adaptive density estimators as a prime example. First,we establish a pointwise Gaussian large sample distributional approximation in a uniﬁed way,allowing for both boundary and interior evaluation points simultaneously. Using this result, westudy the asymptotic eﬃciency of the estimators, and show that a carefully crafted minimumdistance implementation based on “redundant” regressors can lead to eﬃciency gains. Second,we establish uniform linearizations and strong approximations for the estimators, and employthese results to construct valid conﬁdence bands. Third, we develop extensions to weighteddistributions with estimated weights and to local L least squares estimation. Finally, weillustrate our methods with two applications in program evaluation: counterfactual densitytesting, and IV speciﬁcation and heterogeneity density analysis. Companion software packagesin Stata and R are available. Keywords: distribution and density estimation, local polynomial methods, uniform approxima-tion, eﬃciency, optimal kernel, program evaluation. ∗ Prepared for “Celebrating Whitney Newey’s Contributions to Econometrics” Conference at MIT, May 17-18,2019. We thank the conference participants for comments, and Guido Imbens and Yingjie Feng for very usefuldiscussions. We are also thankful to the handling co-Editor, Xiaohong Chen, an Associate Editor and two reviewersfor their input. Cattaneo gratefully acknowledges ﬁnancial support from the National Science Foundation throughgrant SES-1947805, and Jansson gratefully acknowledges ﬁnancial support from the National Science Foundationthrough grant SES-1947662 and the research support of CREATES. † Department of Operations Research and Financial Engineering, Princeton University. ‡ Department of Economics, UC Berkeley and CREATES. § Department of Economics, UC San Diego. a r X i v : . [ ec on . E M ] S e p Introduction

Kernel-based nonparametric estimation of distribution and density functions, as well as higher-order derivatives thereof, play an important role in econometrics. These nonparametric estimatorsoften feature both as the main object of interest and as preliminary ingredients in multi-step semi-parametric procedures (Newey and McFadden, 1994; Ichimura and Todd, 2007). Whitney Newey’spath-breaking contributions to non/semiparametric econometrics employing kernel smoothing arenumerous. This paper hopes to honor his inﬂuential work in this area by studying the main largesample properties of a new class of local regression distribution estimators , which can be used fornon/semiparametric estimation and inference.The class of local regression distribution estimators is constructed using a local least squaresapproximation to the empirical distribution function of a random variable x ∈ X ⊆ R , where thelocalization at the evaluation point x ∈ X is implemented via a kernel function and a bandwidthparameter. The local functional form approximation is done using a ﬁnite-dimension basis function,not necessarily of polynomial form. When the basis function contains polynomials up to order p ∈ N , the associated least squares coeﬃcients give estimators of the distribution function, densityfunction, and higher-order derivatives (up to order p − x ∈ X . If only apolynomial basis is used, then the estimator reduces to the one recently proposed in Cattaneo,Jansson, and Ma (2020a).We present two main large sample distributional results for the local regression distributionestimators. First, in Section 3, we establish a pointwise (in x ∈ X ) Gaussian distributional ap-proximation with consistent standard errors. Because these estimators have a U-statistic structurewith an n -varying kernel, where n denotes the sample size, we construct a fully automatic Studen-tization given a choice of basis, kernel, and bandwidth. Furthermore, we show that when the basisfunction includes polynomials, the associated density and its higher-order derivatives estimators areboundary adaptive without further modiﬁcations. This result generalizes Cattaneo, Jansson, andMa (2020a) by allowing for arbitrary local basis functions, which is particularly useful for eﬃciencyconsiderations.To be more precise, for the special case of local polynomial density estimation, Cattaneo, See, for example, Newey and Stoker (1993), Newey (1994a), Newey (1994b), Hausman and Newey (1995), Robins,Hsieh, and Newey (1995), Newey, Hsieh, and Robins (2004), Newey and Ruud (2005), Ichimura and Newey (2020),and Chernozhukov, Escanciano, Ichimura, Newey, and Robins (2020).

I ⊆ X , based on either the basic local regression distribution estimatorsor the associated more eﬃcient estimators obtained via our proposed minimum distance procedure.More precisely, we establish a strong approximation to the boundary adaptive Studentized statistic,uniformly over x ∈ I , relying on a “coupling” result in Gin´e, Koltchinskii, and Sakhanenko (2004);see also Rio (1994) and Gin´e and Nickl (2010) for closely related results, and Zaitsev (2013) fora review on strong approximation methods. This approach allows us to deduce a distributionalapproximation for many functionals of the Studentized statistic, including its supremum, follow-ing ideas in Chernozhukov, Chetverikov, and Kato (2014b). For further discussion and referenceson strong approximations and their applications to non/semiparametric econometrics see Cher-nozhukov, Chetverikov, and Kato (2014a), Belloni, Chernozhukov, Chetverikov, and Kato (2015),Belloni, Chernozhukov, Chetverikov, and Fernandez-Val (2019), and Cattaneo, Farrell, and Feng(2020), Cattaneo, Crump, Farrell, and Feng (2020), and references therein.We employ our strong approximation results for local regression distribution estimators to con-struct asymptotically valid conﬁdence bands for the density function and derivatives thereof. Other2pplications of our results, not discussed here to conserve space, include parametric speciﬁcationand shape restriction testing. As a by-product, we also establish a linear approximation to theboundary adaptive Studentized statistic, uniformly over x ∈ I , which gives uniform convergencerates and can be used for further theoretical developments. See the supplemental appendix formore details.In addition to our main large sample results for local regression distribution and related esti-mators, we brieﬂy discuss several extensions in Section 5. First, we allow for a weighted empiricaldistribution function entering our estimators, where the weights themselves may be estimated. Ourresults continue to hold in this more general case, which is practically relevant as illustrated in ourempirical applications. Second, we present and study an alternative class of estimators that employa non-random L loss function, instead of the more standard least squares approximation underlyingour local regression distribution estimators. These alternative estimators enjoy certain theoreticaladvantages, but require ex-ante knowledge of the boundary location of X . In particular, we show inthe supplemental appendix how these alternative estimators can be implemented to achieve max-imum asymptotic eﬃciency in estimating the density function and its derivatives. Third, we alsodiscuss incorporating shape restrictions using the general local basis function entering the localregression distribution estimators.Finally, in Section 6, we illustrate our methods with two applications in program evaluation(for a review see Abadie and Cattaneo, 2018). First, we discuss counterfactual density analysisfollowing DiNardo, Fortin, and Lemieux (1996); see also Chernozhukov, Fernandez-Val, and Melly(2013) for related discussion based on distribution functions. Second, we discuss speciﬁcation test-ing and heterogeneity analysis in the context of instrumental variables following Kitagawa (2015)and Abadie (2003), respectively; see also Imbens and Rubin (2015) for background and other ap-plications of nonparametric density estimation to causal inference and program evaluation. In allthese applications, we develop formal estimation and inference methods based on nonparametricdensity estimation using local regression distribution estimators implemented with weighted distri-bution functions. We showcase our new methods using a subsample of the data in Abadie, Angrist,and Imbens (2002), corresponding to the Job Training Partnership Act (JTPA).From both methodological and technical perspectives, our proposed class of local regressiondistribution estimators is diﬀerent from, and exhibits demonstrable advantages over, other related3stimators available in the literature. For the special case of density estimation (i.e., when the basisfunction is taken to be polynomial), our resulting kernel-based density estimator enjoys boundarycarpentry over the possibly unknown boundary of X , does not require preliminary smoothing of thedata and hence avoids preliminary tuning parameter choices, and is easy to implement and interpret.Cattaneo, Jansson, and Ma (2020a) gave a detailed discussion of that density estimator and relatedapproaches in the literature, which include the inﬂuential local polynomial estimator of Cheng,Fan, and Marron (1997) and related estimators (Zhang and Karunamuni, 1998; Karunamuni andZhang, 2008, and referneces therein). The class of estimators we consider here can be more eﬃcientby employing minimum distance estimation ideas (Section 3), easily delivers intuitive estimators ofdensity-weighted averages (Section 5.1), and allows for incorporating shape and other restrictions(Section 5.3), among other features that we discuss below. Last but not least, some of the technicalresults presented herein for the general class of estimators, such as asymptotic eﬃciency (Section3.3) and uniform inference (Section 4) are new, even for the special case of density estimation inCattaneo, Jansson, and Ma (2020a).The rest of the paper proceeds as follows. Section 2 introduces the class of local regressiondistribution estimators. Section 3 establishes a pointwise distributional approximation, along witha consistent standard error estimator, and discusses eﬃciency focusing in particular on the leadingspecial case of density estimation. Section 4 establishes uniform results, including valid lineariza-tions and strong approximations, which are then used to construct conﬁdence bands. Section5 discusses extensions of our methodology, while Section 6 illustrate our new methods with twodistinct program evaluation applications. Section 7 concludes. The supplemental appendix (SA)includes all proofs of our theoretical results as well as other technical, methodological and numericalresults that may be of independent interest. Software packages for Stata and R implementing themain results in this paper are discussed in Cattaneo, Jansson, and Ma (2020b). Suppose x , x , . . . , x n is a random sample from a univariate random variable x with absolutecontinuous cumulative distribution function F ( · ), and associated Lebesgue density f ( · ), over itssupport X ⊆ R , which may be compact and not necessarily known. We propose, and study thelarge sample properties of a new class of nonparametric estimators of F ( · ), f ( · ), and derivatives4hereof, both pointwise at x ∈ X and uniformly over some region I ⊆ X .Our proposed estimators are applicable whenever F ( · ) is suitably smooth near x and admits asuﬃciently accurate linear-in-parameters local approximation of the form: % ( h, x ) = sup | x − x |≤ h (cid:12)(cid:12)(cid:12) F ( x ) − R ( x − x ) θ ( x ) (cid:12)(cid:12)(cid:12) is small for h small , (1)where R ( · ) is a known local basis function and θ ( x ) is a parameter vector to be estimated. As anestimator of θ ( x ) in (1), we consider the local regression estimatorˆ θ ( x ) = argmin θ n X i =1 W i (cid:16) ˆ F i − R i θ (cid:17) , (2)where W i = K (( x i − x ) /h ) /h for some kernel K ( · ) and some bandwidth h, R i = R ( x i − x ), andˆ F i = 1 n n X j =1 ( x j ≤ x i ) (3)is the empirical distribution function evaluated at x i , with ( · ) denoting the indicator function.The generic formulation (1) is motivated in part by the important special case where F ( · ) issuﬃciently smooth, in which case F ( x ) ≈ F ( x ) + f ( x )( x − x ) + ... + f ( p − ( x ) 1 p ! ( x − x ) p for x ≈ x , (4)and f ( s ) ( x ) = d s f ( x ) / d x s | x = x are higher-order density derivatives. Of course, the approximation(4) is of the form (1) with R ( u ) = (1 , u, · · · , u p /p !) , and hence θ ( x ) = ( F ( x ) , f ( x ) , · · · , f ( p − ( x )) .In such special case, the estimator ˆ θ ( x ) corresponds to one of the estimators introduced in Cattaneo,Jansson, and Ma (2020a). But, as further discussed below, other choices of R ( · ) and/or θ ( · ) canbe attractive, and as a consequence we take (1) as the starting point for our analysis. Section 5discusses other extensions and generalization of the basic local regression distribution estimatorˆ θ ( x ) in (2).The class of estimators deﬁned in (2) is motivated by standard local polynomial regressionmethods (Fan and Gijbels, 1996). However, well-known results for local polynomial regression arenot applicable to the local regression distribution estimator, ˆ θ ( x ), because the empirical distribu-5ion function estimator, ˆ F i , which plays the role of the “dependent” variable in the construction,depends on not only x i but also all of the “independent” observations x , x , . . . , x n . This impliesthat, unlike the case of standard local polynomial regression, ˆ θ ( x ) cannot be studied by condi-tioning on the “covariates” x , x , . . . , x n . Instead, we employ U-statistic methods for analyzingthe statistical properties of ˆ θ ( x ). This observation explains the quite diﬀerent asymptotic varianceof our estimator: see Section 3.3 for details. Furthermore, as discussed in Section 5.1, when aweighted distribution function is used in place of ˆ F i in (2), the resulting (weighted) local regres-sion distribution estimators are consistent for a density-weighted regression function, as opposedto being consistent for the regression function itself (as it is the case for standard local polynomialregression methods). Finally, the SA highlights other technical diﬀerences between the two typesof local regression estimators. This section discusses the large sample properties of the estimator ˆ θ ( x ) , pointwise in x ∈ X . We ﬁrstestablish asymptotic normality, and then discuss asymptotic eﬃciency. Other results are reportedin the SA to conserve space. We drop the dependence on the evaluation point x whenever possible. We impose the following assumption throughout this section. We do not restrict the support of X ,which can be a compact set or unbounded, because our estimator automatically adapts to boundaryevaluation points. Assumption 1 x , . . . , x n is a random sample from a distribution F ( · ) supported on X ⊆ R , and x ∈ X .(i) For some δ > , F ( · ) is absolutely continuous on [ x − δ, x + δ ] with a density f ( · ) admittingconstants f ( x − ) , ˙ f ( x − ) , f ( x +) , and ˙ f ( x +) such that sup u ∈ [ − δ, | f ( x + u ) − f ( x − ) − ˙ f ( x − ) u || u | + sup u ∈ (0 ,δ ] | f ( x + u ) − f ( x +) − ˙ f ( x +) u || u | < ∞ . (ii) K ( · ) is nonnegative, symmetric, and continuous on its support [ − , , and integrates to 1. iii) R ( · ) is locally bounded, and there exists a positive-deﬁnite diagonal matrix Υ h for each h > , such that Υ h R ( u ) = R ( u/h ) .(iv) Let X h, x = X − x h . For all h suﬃciently small, the minimum eigenvalues of Γ h, x and h − Σ h, x are bounded away from zero, where Γ h, x = Z X h, x R ( u ) R ( u ) K ( u ) f ( x + hu )d u, Σ h, x = Z X h, x Z X h, x R ( u ) R ( v ) h F ( x + h min { u, v } ) − F ( x + hu ) F ( x + hv ) i · K ( u ) K ( v ) f ( x + hu ) f ( x + hv )d u d v. Part (i) imposes smoothness conditions on the distribution function F ( · ), separately for thetwo regions on the left and on the right of the evaluation point x . In most applications, thedistribution function will also be smooth at the evaluation point, in which case f ( x − ) = f ( x +) and˙ f ( x − ) = ˙ f ( x +). However, there are important situations where F ( · ) only has one-sided derivatives,such as at boundary or kink evaluation points. Part (ii) imposes standard restrictions on the kernelfunction, which allows for all commonly used (compactly supported) second-order kernel functions.Part (iii) requires that the local basis R ( · ) can be stabilized by a suitable normalization. Parts (iv)give assumptions on two (non-random) matrices which will feature in the asymptotic distribution.The error of the approximation in (1) depends on the choice of R ( · ) and θ , and is quantiﬁedby % ( h ), where we suppress the dependence on the evaluation point x to save notation. Theapproximation error will be required to be “small” in the sense that n% ( h ) /h →

0. In the cases ofmain interest (i.e., when R ( · ) is polynomial), we have either % ( h ) = O ( h p +1 ) or % ( h ) = o ( h p ) forsome p . The condition can therefore be stated as nh p +1 → nh p − = O (1), respectively, inthose cases.We do not discuss how to choose the bandwidth h , or the order p if R ( · ) contains polynomials,as both choices can be developed following standard ideas in the local polynomial literature. Wefocus instead on distributional approximation (Section 3.2) and asymptotic variance minimization(Section 3.3), given a choice of bandwidth sequence and polynomial order. Bandwidth selectioncan be developed by extending the results in Cattaneo, Jansson, and Ma (2020a) and polynomialorder selection can be developed following Fan and Gijbels (1996, Section 3.3). In particular, alarger p can lead to more bias reduction whenever the target population function is smooth enough7t the expense of a larger asymptotic variance. We discuss this trade-oﬀ explicitly in our eﬃciencycalculations (Section 3.3). We show that, under regularity conditions and if h vanishes at a suitable rate as n → ∞ , thenˆΩ − / (ˆ θ − θ ) N (0 , I ) , ˆΩ = ˆΓ − ˆΣˆΓ − , (5)where ˆΓ = 1 n n X i =1 W i R i R i , ˆΣ = 1 n n X i =1 ˆ ψ i ˆ ψ i , ˆ ψ i = 1 n n X j =1 W j R j ( ( x i ≤ x j ) − ˆ F j ) . It follows from this result that inference on θ can be based on ˆ θ by employing the (pointwise)distributional approximation ˆ θ a ∼ N ( θ, ˆΩ). The three matrices, ˆΓ, ˆΣ and ˆΩ, depend on the evalua-tion point x , but such dependence is again suppressed for simplicity. This distributional result willrely on the “small” bias condition n% ( h ) /h → θ negligible relative to its standard error. From an inferenceperspective, such bias condition can be achieved by employing undersmoothing or robust bias cor-rection: see Calonico, Cattaneo, and Farrell (2018, 2020) for discussion and background references.The SA includes more details on the bias of the estimator.To provide some insight into the distributional approximation (5), and to see why it cannot beestablished using standard results for local polynomial regression, ﬁrst observe thatˆ θ − θ = ˆΓ − S, S = 1 n n X i =1 W i R i ( ˆ F i − R i θ ) , assuming ˆΓ is invertible with probability approaching one. The statistic S can be written as S = U + B, U = 1 n ( n − n X i,j =1 ,i = j W j R j (cid:16) ( x i ≤ x j ) − F ( x j ) (cid:17) , (6)where B consists of a leave-in bias term and a smoothing bias term. Since S is approximately asecond-order U -statistic, result (5) should follow from a central limit theorem for ( n -varying) U -8tatistics under suitable regularity conditions, including conditions ensuring that the approximationerrors are negligible. More speciﬁcally, result (5) follows if U is asymptotically mean-zero Gaussian V [ U ] − / U N (0 , I ), where V [ U ] denotes the variance of U , V [ U ] − / B → P

0, and if the varianceestimator ˆΣ is consistent in the sense that V [ U ] − ( ˆΣ − V [ U ]) → P

0. Moreover, the projectiontheorem for U -statistics implies that, under appropriate regularity conditions, V [ U ] ≈ n E [ ψ i ψ i ] , ψ i = E [ W j R j ( x i ≤ x j ) − F ( x j ) | x i ] , which motivates the functional form of the variance estimator ˆΣ used to form ˆΩ.The following theorem formalizes the above intuition with precise suﬃcient conditions. Theorem 1 (Pointwise Asymptotic Normality)

Suppose Assumption 1 holds. If n% ( h ) /h → and nh → ∞ , then (5) holds. This theorem establishes a (pointwise) Gaussian distributional approximation for the Studen-tized statistic ˆΩ − / (ˆ θ − θ ), which is valid for each evaluation point x ∈ X . For example, letting c be a vector of conformable dimension and α ∈ (0 , − α )%conﬁdence intervalCI α ( x ) = (cid:20) c ˆ θ ( x ) − q − α/ q c ˆΩ( x ) c , c ˆ θ ( x ) − q α/ q c ˆΩ( x ) c (cid:21) , where q a = inf { u ∈ R : P [ N (0 , ≤ u ] ≥ α } . The above conﬁdence interval is asymptotically validfor each evaluation point x , which is reﬂected by the notation CI α ( x ). That is,lim n →∞ P h c θ ( x ) ∈ CI α ( x ) i = 1 − α, for all x ∈ X . Section 4 develops asymptotically valid conﬁdence bands, which will be denoted by CI α ( I ) for someregion I ⊆ X . As it is well known in the literature (Fan and Gijbels, 1996), the standard local polynomial regressionestimator of E [ y | x = x ], for dependent variable y and independent variable x , has a limiting9symptotic variance of the “sandwich form” e Γ − A Γ − e , where e ‘ denotes the ( ‘ + 1)th standardbasis vector, andΓ = f ( x ) Z − R ( u ) R ( u ) K ( u )d u, A = V [ y | x = x ] f ( x ) Z − R ( u ) R ( u ) K ( u ) d u. This variance structure implies that setting K ( · ) to be the uniform kernel makes A proportional to B (i.e., K ( u ) = K ( u ) whenever K ( u ) = ( | u | ≤ − A Γ − ≥ A − . See also Granovsky and M¨uller (1991) for amore general discussion on the optimality of the uniform kernel for kernel-based estimation.Unlike the case of the asymptotic variance of local polynomial regression, however, our localregression distribution estimators exhibit a more complex and uneven asymptotic variance formuladue to their construction. As a result, employing the uniform kernel may not exhaust the potentialeﬃciency gains. For example, in the case of local polynomial density estimation (Cattaneo, Jansson,and Ma, 2020a), R ( u ) is polynomial of order p ≥ f ( x ) = e ˆ θ ( x ) takes the form e Γ − ΣΓ − e withΣ = f ( x ) Z − Z − min { u, v } R ( u ) R ( v ) K ( u ) K ( v )d u d v, which implies that Γ is no longer proportional to Σ even when the kernel function is uniform. (Toshow this result, one ﬁrst recognizes that the asymptotic variance of ˆ f ( x ) is h − e Γ − h Σ h Γ − h e ,where the matrices are deﬁned in Assumption 1. Then the expression reduces to e Γ − ΣΓ − e after taking the limit h →

0, provided that x is an interior evaluation point. See the SA for omitteddetails.) This observation applies to the general case where the local basis function R ( · ) needs notto be of polynomial form, or when higher-order derivatives are of interest. See the SA for furtherdiscussion and detailed formulas.In this section we employ a minimum distance approach to develop a lower bound on theasymptotic variance of the local regression distribution estimators, and also propose more eﬃcientestimators based on the observation that their asymptotic variance is of the sandwich form Γ − ΣΓ − but with Γ not proportional to Σ even when the uniform kernel is used.To motivate our approach, notice that in many cases it is possible to specify R ( · ) in such a waythat θ can be partitioned as θ = ( θ , θ ) , where θ = 0. In such cases several distinct estimators of10 are available. To describe some leading candidates and their salient properties, partition ˆ θ , ˆΓ,ˆΣ, and ˆΩ conformable with θ as ˆ θ = (ˆ θ , ˆ θ ) andˆΓ =  ˆΓ ˆΓ ˆΓ ˆΓ  , ˆΣ =  ˆΣ ˆΣ ˆΣ ˆΣ  , ˆΩ =  ˆΩ ˆΩ ˆΩ ˆΩ  . The “short” regression counterpart of ˆ θ obtained by dropping R ( · ) from R ( · ) = ( R ( · ) , R ( · ) ) is given by ˆ θ R , = ˆ θ + ˆΓ − ˆΓ ˆ θ , while an optimal minimum distance estimator of θ is given byˆ θ MD , = argmin θ (cid:18) ˆ θ − θ ˆ θ (cid:19) ˆΩ − (cid:18) ˆ θ − θ ˆ θ (cid:19) = ˆ θ − ˆΩ ˆΩ − ˆ θ . (7)As a by-product of results obtained when establishing (5), it follows thatˆΩ − / (ˆ θ − θ ) N (0 , I ) , ˆΩ − / R , (ˆ θ R , − θ ) N (0 , I ) , ˆΩ R , = ˆΓ − ˆΣ ˆΓ − , and ˆΩ − / MD , (ˆ θ MD , − θ ) N (0 , I ) , ˆΩ MD , = ˆΩ − ˆΩ ˆΩ − ˆΩ , under regularity conditions. Since ˆΩ is of “sandwich” form, the estimators ˆ θ and ˆ θ R , cannotbe ranked in terms of (asymptotic) eﬃciency in general. On the other hand, ˆ θ MD , will always be(weakly) superior to both ˆ θ and ˆ θ R , . In fact, becauseˆ θ = argmin θ (cid:18) ˆ θ − θ ˆ θ (cid:19)  ˆΩ −

00 ˆΩ −  (cid:18) ˆ θ − θ ˆ θ (cid:19) , and ˆ θ R , = argmin θ (cid:18) ˆ θ − θ ˆ θ (cid:19) ˆΓ (cid:18) ˆ θ − θ ˆ θ (cid:19) , each estimator admits a minimum distance interpretation, but only ˆ θ MD , can be interpreted asan optimal minimum distance estimator based on ˆ θ . See Newey and McFadden (1994) for morediscussion on minimum distance estimation. 11s a consequence, we investigate whether an appropriately implemented ˆ θ MD , can lead to asymp-totic eﬃciency gains relative to ˆ θ and ˆ θ R , . More generally, as a by-product, we obtain an eﬃciencybound among minimum distance estimators and show that this bound coincides with those knownin the literature for kernel-based density estimation at interior points (Granovsky and M¨uller, 1991;Cheng, Fan, and Marron, 1997).In the remaining of this section we focus on the case of local polynomial density estimationat an interior point for concreteness, but the SA presents more general results. Consequently, weassume that F ( · ) is p -times continuously diﬀerentiable in a neighborhood of x . Then, (4) is satisﬁedand a natural choice of R ( · ) is R ( u ) = (cid:16) R ( u ) , R ( u ) (cid:17) = (cid:16) , P ( u ) , Q ( u ) (cid:17) , (8)where P ( u ) = ( u, u / , · · · , u p /p !) is a polynomial basis, and Q ( · ) represent redundant regressors.Therefore, in our minimum distance construction, the parameters are θ = (cid:16) F ( x ) | {z } intercept , f ( x ) , · · · , f ( p − ( x ) | {z } slope, R ( · )= P ( · ) , , · · · , | {z } redundant, R ( · )= Q ( · ) (cid:17) , (9)with smoothing error of order % ( h ) = o ( h p ).With (8) and (9), we deﬁne the minimum distance density estimator as ˆ f MD ( x ) = e ˆ θ MD , .Similarly, we have ˆ f ( x ) = e ˆ θ and ˆ f R ( x ) = e ˆ θ R , . Of course, if it is known a priori thatthe distribution function is p + q times continuously diﬀerentiable, then one can specify Q ( · )to include higher order polynomials: Q ( u ) = ( u p +1 / ( p + 1)! , · · · , u p + q / ( p + q )!) . By redeﬁn-ing the parameters as θ = ( F ( x ) , f ( x ) , · · · , f ( p + q − ( x )) , the smoothing error will be of order % ( h ) = o ( h p + q ). Notice that, in this case, ˆ f ( x ) and ˆ f R ( x ) correspond to the density estimator intro-duced in Cattaneo, Jansson, and Ma (2020a) implemented with R ( u ) = (1 , u, · · · , u p + q / ( p + q )!) and R ( u ) = (1 , u, · · · , u p /p !) , respectively. Since the purpose of this section is to investigatethe eﬃciency gains of incorporating additional redundant regressors, we do not exploit the extrasmoothness condition, and we will treat Q ( · ) as redundant regressors even if Q ( · ) contains higherorder polynomials.As both ˆ f ( x ) and ˆ f R ( x ) are (weakly) asymptotically ineﬃcient relative to ˆ f MD ( x ) for any choice12f Q ( · ), we consider the asymptotic variance of the minimum distance estimator, which can beobtained by establishing asymptotic counterparts of ˆΓ and ˆΣ after suitable scaling. Under regularityconditions (e.g., lack of perfect collinearity between P and Q ), the asymptotic variance of theminimum distance ‘ -th derivative density estimator, ˆ f ( ‘ ) MD ( x ) = e ‘ ˆ θ MD , with 0 ≤ ‘ ≤ p −

1, is

AsyVar [ ˆ f ( ‘ ) MD ( x )] = e ‘ h Ω P P − Ω P Q Ω − QQ Ω QP i e ‘ , where  Ω Ω P Ω Q Ω P Ω P P Ω P Q Ω Q Ω QP Ω QQ  = Γ − ΣΓ − . Therefore, the objective is to ﬁnd a function Q ( · ) that minimizes the asymptotic variance AsyVar [ ˆ f ( ‘ ) MD ( x )]. Taking Q ( · ) scalar and properly orthogonalized, without loss of generality, we have R − P ( u ) K ( u )d u = 0 and R − (1 , P ( u ) ) Q ( u ) K ( u )d u = 0. It follows that the problem of selectingan optimal Q ( · ) to minimize AsyVar [ ˆ f ( ‘ ) MD ( x )] is equivalent to the following variational problem:sup Q ∈Q hR − R − P ‘ ( u ) Q ( v ) min { u, v } K ( u ) K ( v )d u d v i R − R − Q ( u ) Q ( v ) min { u, v } K ( u ) K ( v )d u d v (10)where Q = (cid:26) Q ( · ) : Z − Q ( u ) K ( u )d u = 0 , Z − P ( u ) Q ( u ) K ( u )d u = 0 (cid:27) , with P ‘ ( u ) = e ‘ (cid:16) R − P ( u ) P ( u ) K ( u )d u (cid:17) − P ( u ) and ‘ = 1 , , . . . , p −

1. The objective function isobtained from the fact that, after proper orthogonalization, the matrix Γ becomes block diagonal.See the SA for all other omitted details.The following theorem characterizes a lower bound for the asymptotic variance of the minimumdistance density estimator among all possible choices of redundant regressors.

Theorem 2 (Eﬃciency: Local Polynomial Density Estimator at Interior Points)

Supposethe conditions of Theorem 1 hold. If x ∈ X is an interior point, then inf Q ∈Q AsyVar [ ˆ f ( ‘ ) MD ( x )] ≥ ν ‘ , ν ‘ = f ( x ) e ‘ (cid:18)Z − ˙ P ( u ) ˙ P ( u ) d u (cid:19) − e ‘ , ≤ ‘ ≤ p − , here ˙ P ( u ) = (1 , u, · · · , u p − / ( p − is the derivative of P ( u ) . This theorem establishes a lower bound among minimum distance estimators. Importantly, itis shown in the SA that this bound coincides with the variance bound of all kernel-type density(and derivatives thereof) estimators employing the same order of the (induced) kernel function(Granovsky and M¨uller, 1991). Therefore, our minimum distance approach sheds new light onminimum variance results for nonparametric kernel-based estimators of the density function andits derivatives.This lower bound can be (approximately) achieved by setting the redundant regressor Q ( · ) toinclude a certain higher order polynomial function. By direct calculation for each p = 1 , , . . . , j →∞ AsyVar [ ˆ f ( ‘ ) MD ,j ( x )] = ν ‘ , where the minimum distance estimatorˆ f ( ‘ ) MD ,j ( x ) = e ‘ ˆ θ MD ,j is constructed with Q ( u ) = u j +1 − P ( u ) (cid:18)Z − P ( u ) P ( u ) d u (cid:19) − Z − P ( u ) u j +1 d u, for ‘ = 0 , , , · · · , or Q ( u ) = u j +2 − P ( u ) (cid:18)Z − P ( u ) P ( u ) d u (cid:19) − Z − P ( u ) u j +2 d u, for ‘ = 1 , , , · · · , and K ( · ) being the uniform kernel. While we found that other kernel shapes can also be used, wechose the uniform kernel in this construction for three reasons. First, this choice is intuitive andcoincides with the optimal choice in standard local polynomial regression settings. Second, when p ≥ Q ( · ) so that thecorresponding minimum distance estimator approximately achieves the variance bound for j largeenough. Interestingly, Q ( · ) is scalar and known, but the larger j the closer the asymptotic varianceof the minimum distance density estimator will be to the eﬃciency bound. We assume Q ( · ) isorthogonal to P ( · ) for theoretical convenience. To implement this estimator, one only needs to runa local polynomial regression of the empirical distribution function on a constant, the polynomial14 − − . − . − . . . . Figure 1: Equivalent Kernels of the Minimum Distance Estimators.

Notes . We set P ( u ) = u or P ( u ) = ( u, u / , and K uniform. The redundant regressor is Q ( u ) = u j +1 for j = 1 , , · · · ,

30. The initial equivalent kernel is quadratic (black solid line), and the minimum variance kernel isuniform (red solid line). basis P ( · ), and one additional regressor u j +1 or u j +2 (depending on the choice of ‘ ), and thenapply (7) with the corresponding estimated variance-covariance matrix.In Figure 1, we consider the local linear/quadratic density estimator ( ‘ = 1) with the redundantregressor being a higher order polynomial (i.e., P ( u ) = u or P ( u ) = ( u, u / , and Q ( u ) = u j +1 ),and plot the corresponding equivalent kernel of our minimum distance density estimator for j =1 , , . . . ,

30. As j increases, the equivalent kernel converges to the uniform kernel, which is well-known to minimize the (asymptotic) variance among all density estimators employing second orderkernels (Granovsky and M¨uller, 1991). The asymptotic variance of our proposed minimum distancedensity estimator converges to the optimal asymptotic variance as j → ∞ .Finally, in this paper we focus on minimizing the asymptotic variance of the estimator ˆ θ and itsvariants because our main goal is inference. However, our results could be modiﬁed and extended tooptimize the asymptotic mean square error (MSE). We do not pursue point estimation optimalityfurther for brevity, but we do note that in the case of local polynomial density estimation (Cat-15aneo, Jansson, and Ma, 2020a), the resulting estimator is automatically MSE-optimal at interiorpoints when p ≤

2, because the induced equivalent kernel coincides with the Epanechnikov kernel(Granovsky and M¨uller, 1991; Cheng, Fan, and Marron, 1997).

The distributional result presented in Theorem 1 is valid pointwise for x ∈ X . We now develop anuniform distributional approximation for the Studentized process  T ( x ) = c ˆ θ ( x ) − c θ ( x ) q c ˆΩ( x ) c : x ∈ I  , using the notation in (5), where c is a conformable vector and I ⊆ X is some prespeciﬁed region.This stochastic process is not tight, and hence does not converge in distribution. Our approximationproceeds in two steps. First, for an positive (vanishing) sequence, r L ,n , we establish a uniform“linearization” of the process T ( · ) of the form:sup x ∈I | T ( x ) − T ( x ) | = O P ( r L ,n ) , (11)where ( T ( x ) = 1 √ n n X i =1 K h, x ( x i ) : x ∈ I ) with K h, x ( x i ) = c Υ h Γ − h, x R X h R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) f ( x + hu ) du p c Υ h Ω h, x Υ h c , and Ω h, x = Γ − h, x Σ h, x Γ − h, x . In words, we show that the Studentized process T ( · ), which involvesvarious pre-asymptotic estimated quantities, is uniformly close to the linearized process T ( · ), whichis a sample average of independent observations. To obtain (11), we develop new uniform approx-imations with precise convergence rates, which may be of independent interest in semiparametricestimation and inference settings. See the SA for more details.Second, in a possibly enlarged probability space, we show that there exists a copy of T ( · ),denoted by ˜ T ( · ), and a Gaussian process { B ( x ) : x ∈ I} , with a suitable variance-covariance16tructure, such that sup x ∈I (cid:12)(cid:12)(cid:12) ˜ T ( x ) − B ( x ) (cid:12)(cid:12)(cid:12) = O P ( r G ,n ) , (12)where r G ,n is another positive (vanishing) sequence. This type of strong approximation result, whenestablished with suitably fast rate r G ,n →

0, can be used to deduce distributional approximations forstatistics such as sup x ∈I | T ( x ) | , which are useful for constructing conﬁdence bands or for conductinghypothesis tests about shape or other restrictions on the function of interest. To obtain (12), weemploy a result established by Rio (1994), and later extended in Gin´e, Koltchinskii, and Sakhanenko(2004); see also Gin´e and Nickl (2010, Proof of their Proposition 5).In this section we consider a ﬁxed linear combination c for ease of exposition, but in the SA wediscuss the more general case where c can depend on both the evaluation point x and the tuningparameter h , which is necessary to establish uniform distribution approximations for the minimumdistance estimator introduced in Section 3.3. All the results reported in this section apply to thelatter class of estimators as well. In addition to Assumption 1, we impose the following conditions on the data generating process.In the sequel, continuity and diﬀerentiability conditions at boundary or kink points should beinterpreted as one-sided statements (i.e., as in part (i) of Assumption 1).

Assumption 2

Let

I ⊆ X be a compact interval.(i) The density function f ( x ) is twice continuously diﬀerentiable and bounded away from zeroon I .(ii) There exists some δ > and compactly supported kernel functions K † ( · ) and { K ‡ ,d ( · ) } d ≤ δ ,such that (ii.1) sup u ∈ R | K † ( u ) | +sup d ≤ δ,u ∈ R | K ‡ ,d ( u ) | < ∞ , (ii.2) the support of K ‡ ,d ( · ) has Lebesguemeasure bounded by Cd , where C is independent of d ; and (ii.3) for all u and v such that | u − v | ≤ δ , | K ( u ) − K ( v ) | ≤ | u − v | · K † ( u ) + K ‡ , | u − v | ( u ) . (iii) The basis function R ( · ) is Lipschitz continuous in [ − , .(iv) For all h suﬃciently small, the minimum eigenvalues of Γ h, x and h − Σ h, x are bounded awayfrom zero uniformly for x ∈ I . I . Part (ii) imposes additional requirements on the kernelfunction. Although seemingly technical, it permits a decomposition of the diﬀerence | K ( u ) − K ( v ) | into two parts. The ﬁrst part, | u − v | · K † ( u ), is a kernel function which vanishes uniformly as | u − v | becomes small. Note that this will be the case for all piecewise smooth kernel functions, such asthe triangular or the Epanechnikov kernel. However, diﬀerence of discontinuous kernels, such asthe uniform kernel, cannot be made uniformly close to zero. This motivates the second term inthe above decomposition. Part (iii) requires the basis function to be reasonably smooth. Together,parts (i)–(iii) imply that the estimator ˆ θ ( x ) will be “smooth” in x , which is important to controlthe complexity of the process T ( · ). Finally, part (iv) implies that the matrices Γ h, x and Σ h, x arewell-behaved uniformly for x ∈ I . We ﬁrst discuss the covariance of the process T ( · ). It is straightforward to show that C ov[ T ( x ) , T ( y )] = c Υ h Ω h, x , y Υ h c p c Υ h Ω h, x Υ h c p c Υ h Ω h, y Υ h c , Ω h, x , y = Γ − h, x Σ h, x , y Γ − h, y , where Σ h, x , y = Z X h, y Z X h, x R ( u ) R ( v ) h F (min { x + hu, y + hv } ) − F ( x + hu ) F ( y + hv ) i · K ( u ) K ( v ) f ( x + hu ) f ( y + hv )d u d v, and Σ h, x , x = Σ h, x .Now we state the second main distributional result of this paper in the following theorem. Theorem 3 (Strong Approximation)

Suppose Assumptions 1 and 2 hold, and that h → and nh / log n → ∞ .1. (11) holds with r L ,n = r nh sup x ∈I % ( h, x ) + log n √ nh .

2. On a possibly enlarged probability space, there exists a copy ˜ T ( · ) of T ( · ) , and a centered aussian process, { B ( x ) , x ∈ I} , deﬁned with the same covariance as T ( · ) , such that (12) holds with r G ,n = log n √ nh . The ﬁrst part of this theorem gives conditions such that the feasible Studentized process T ( · )is well approximated by the infeasible (linear) process T ( · ), uniformly for x ∈ I . The latter processis mean zero, and takes a kernel-based form. However, standard strong approximation resultsfor kernel-type estimators do not apply directly to the process T ( · ), as the implied (equivalent,Studentized) kernel K h, x ( · ) depends not only on the bandwidth but also on the evaluation pointin a non-standard way. That is, due to the boundary adaptive feature of the local distributionregression estimators, the shape of the implied kernel automatically changes for diﬀerent evaluationpoints depending on whether they are interior or boundary points.Putting the two results together, it follows that the distribution of T ( · ) is approximated by thatof B ( · ), provided the following condition holds: r nh sup x ∈I % ( h, x ) + log n √ nh → . To facilitate understanding of this rate restriction, we consider the local polynomial density es-timation setting of Cattaneo, Jansson, and Ma (2020a), where the basis function takes the form R ( u ) = (1 , u, u / , · · · , u p /p !) for some p ≥

1, and the second element of ˆ θ ( x ) estimates the density f ( x ). That is, e ˆ θ ( x ) = ˆ f ( x ) → P f ( x ) under Assumption 1, where c = e . By a Taylor expansionargument, it is easy to see that the smoothing bias has order h p +1 as long as the distributionfunction F ( · ) is suitably smooth. Then, the above rate restriction reduces to √ nh p +1 + log n √ nh → x ∈I | T ( x ) | , then an extra √ log n factoris needed in the rate restriction, as discussed in Chernozhukov, Chetverikov, and Kato (2014a). Aformal statement of such result is given below, after we discuss how we can further approximatethe infeasible Gaussian process B ( · ). Feasible inference cannot be based on the Gaussian process B ( · ), as its covariance structure isunknown and has to be estimated in practice. For estimation, ﬁrst recall from Sections 2 and 319hat W i ( x ) = K (( x i − x ) /h ) /h , R i ( x ) = R ( x i − x ), and ˆΓ( x ) = n P ni =1 W i ( x ) R i ( x ) R i ( x ) . Then, weconstruct the plug-in estimator of Ω h, x , y as follows:ˆΩ h, x , y = n Υ − h ˆΓ( x ) − ˆΣ( x , y )ˆΓ( y ) − Υ − h , ˆΣ( x , y ) = 1 n n X i =1 ˆ ψ i ( x ) ˆ ψ i ( y ) where ˆ ψ i ( x ) = 1 n n X j =1 W j ( x ) R j ( x )( ( x i ≤ x j ) − ˆ F j ) . The following lemma characterizes the approximation error from replacing the infeasible process B ( · ) by its plug-in feasible version. Conditioning on data means conditioning on x , x , · · · , x n . Lemma 1

Suppose Assumptions 1 and 2 hold, and that h → and nh / log n → ∞ . Then, thereexist two centered Gaussian processes, { ˜ B ( x ) , x ∈ I} and { ˆ B ( x ) , x ∈ I} , such that (i) ˜ B ( · ) has thesame distribution as B ( · ) and is independent of the data; (ii) conditional on the data, ˆ B ( · ) hascovariance C ov h ˆ B ( x ) , ˆ B ( y ) (cid:12)(cid:12)(cid:12) Data i = c Υ h ˆΩ h, x , y Υ h c q c Υ h ˆΩ h, x Υ h c q c Υ h ˆΩ h, y Υ h c ; and (iii) E (cid:20) sup x ∈I | ˜ B ( x ) − ˆ B ( x ) | (cid:12)(cid:12)(cid:12)(cid:12) Data (cid:21) = O P (cid:18) log n √ nh (cid:19) / ! . The following theorem combines previous results, and justiﬁes the uniform conﬁdence bandconstructed using critical values from sup x ∈I | ˆ B ( x ) | . Theorem 4 (Kolmogorov-Smirnov Distance)

Suppose Assumptions 1 and 2 hold, and that n log nh sup x ∈I % ( h, x ) + (log n ) nh = o (1) , then sup u ∈ R (cid:12)(cid:12)(cid:12)(cid:12) P h sup x ∈I | T ( x ) | ≤ u i − P h sup x ∈I | ˆ B ( x ) | ≤ u (cid:12)(cid:12)(cid:12) Data i(cid:12)(cid:12)(cid:12)(cid:12) = o P (1) , where ˆ B ( · ) is a centered Gaussian process deﬁned in Lemma 1. From Theorem 4, an asymptotically valid 100(1 − α )% conﬁdence band for { c θ ( x ) : x ∈ I} is20iven by CI α ( I ) = (cid:26)(cid:20) c ˆ θ ( x ) − q − α q c ˆΩ( x ) c , c ˆ θ ( x ) + q − α q c ˆΩ( x ) c (cid:21) , x ∈ I (cid:27) , where q − α is the 1 − α quantile of sup x ∈I | ˆ B ( x ) | , conditional on the data. That is, q a = inf (cid:26) u ∈ R : P (cid:20) sup x ∈I | ˆ B ( x ) | ≤ u (cid:12)(cid:12)(cid:12) Data (cid:21) ≥ α (cid:27) , which can be obtained by simulating the process ˆ B ( · ) on a dense grid.As an alternative to analytic estimation of the covariance kernel, it is possible to considerresampling methods as in Chernozhukov, Chetverikov, and Kato (2014a), Cheng and Chen (2019),Cattaneo, Farrell, and Feng (2020), and references therein. We relegate resampling-based inferencefor future research. We brieﬂy outline some extensions of our main results. First, we introduce a re-weighted version ofˆ θ , which is useful in applications as illustrated in Section 6. Second, we discuss a new class of localregression estimators based on a non-random least-squares loss function, which has some interestingtheoretical properties and may be of interest in some semiparametric settings. Finally, we discusshow to incorporate restrictions in the estimation procedure, employing the generic structure of thelocal basis R ( · ). Suppose ( x , w ) , ( x , w ) , · · · , ( x n , w n ) is a random sample, where x i is a continuous random vari-able with a smooth cumulative distribution function, but now w i is an additional “weighting”variable, possibly random and involving unknown parameters. We consider the generic weighteddistribution parameter H ( x ) = E [ w i ( x i ≤ x )] , whose practical interpretation depends on the speciﬁc choice of w i .We discuss some examples. If w i = 1, H ( · ) becomes the distribution function F ( · ), and hence21he results above apply. If w i is set to be a certain ratio of propensity scores for subpopulationmembership, then the derivative d H ( x ) / d x becomes a counterfactual density function, as in Di-Nardo, Fortin, and Lemieux (1996); see Section 6.1 below. If w i is set to be a combination of thetreatment assignment and treatment status variables, then the resulting derivative can be used toconduct speciﬁcation testing in IV models, or if w i is set to be a certain ratio of propensity scoresfor a binary instrument, then the derivative can be used to identify distributions of compliers, asin Imbens and Rubin (1997), Abadie (2003), and Kitagawa (2015); see Section 6.2 below. Otherexamples of applicability of this extension include bunching, missing data, measurement error, datacombination, and treatment eﬀect settings.More generally, when weights are allowed for, there is another potentially interesting connectionbetween the estimand d H ( x ) / d x and classical weighted averages featuring prominently in econo-metrics because d H ( x ) / d x = E [ w i | x i = x ] f ( x ), which is useful in the context of partial means andrelated problems as in Newey (1994b).Our main results extend immediately to allow for √ n -consistent estimated weights w i or,more generally, to estimated weights that converge suﬃciently fast. Speciﬁcally, we let ˆ F w,i ( x ) = n P nj =1 w j ( x j ≤ x ) in place of ˆ F i , and investigate the large sample properties of our proposedestimator in (2) when w i is replaced by ˆ w i = w i ( ˆ β ) with ˆ β an a n -consistent estimator, for some a n → ∞ , and w i ( · ) a known function of the data. That is, when estimated weights are used toconstruct the weighted empirical distribution function ˆ F w,i ( x ). Provided that a − n → a n = √ n under regularity conditions). All the results reported in the previous sections applyto this extension, which we illustrate empirically below. L Distribution Estimators

The local regression distribution estimator is obtained from a least squares projection of the em-pirical distribution function onto a local basis, where the projection puts equal weights at allobservations. That is, (2) employs an L ( ˆ F )-projectionˆ θ ( x ) = argmin θ Z (cid:16) ˆ F ( u ) − R ( u − x ) θ (cid:17) K (cid:18) u − x h (cid:19) d ˆ F ( u ) . L distribution estimators given byˆ θ G ( x ) = argmin θ Z (cid:16) ˆ F ( u ) − R ( u − x ) θ (cid:17) K (cid:18) u − x h (cid:19) d G ( u )for some measure G . We show in the SA that all our theoretical results continue to hold for ˆ θ G ,provided that G is absolutely continuous with respect to the Lebesgue measure and the Radon-Nikodym derivative is reasonably smooth. (Note that G does not need to be a proper distributionfunction.)The estimator ˆ θ G involves only one average, while the local regression estimator ˆ θ has twolayers of averages (one from the construction of the empirical distribution function, and the otherfrom the L ( ˆ F )-projection/regression). As a result, with suitable centering and scaling, the local L distribution estimator, ˆ θ G , can be written as the sum of a mean-zero inﬂuence function and asmoothing bias term. Since ˆ θ G no longer involves a second order U-statistic (c.f. (6)), or a leave-inbias, pointwise asymptotic normality can be established under weaker conditions: it is no longerneeded to assume nh → ∞ (Theorem 1), and nh → ∞ will suﬃce. Similarly, for the strongapproximation results we only need to restrict log n/ √ nh as opposed to log n/ √ nh (part 1 ofTheorem 3).In addition, the local L distribution estimator ˆ θ G is robust to “low” density. To see this, recallthat the local regression estimator ˆ θ involves regressing the empirical distribution on a local basis,which means that this estimator can be numerically unstable if there are only a few observationsnear the evaluation point. More precisely, the matrix ˆΓ will be close to singular if the eﬀectivesample size is small.Although the local L distribution estimator ˆ θ G takes a simpler form, is robust to low density,and its large sample properties can be established under weaker bandwidth conditions, it does haveone drawback: it requires knowledge of the support X . To be more precise, let G be the Lebesguemeasure, then the local L distribution estimator may be biased at or near boundaries of X if it iscompact. In contrast, the local regression distribution estimator is fully boundary adaptive, evenin cases where the location of the boundary is unknown. See Cattaneo, Jansson, and Ma (2020a)for further discussion for the case of density estimation.23 .3 Incorporating Restrictions The formulation (2) is general enough to allow for some interesting extensions in the deﬁnitionof the local regression distribution estimator. The key observation is that the estimator has aweighted least squares representation with a generic local basis function R ( · ), which allows fordeploying well-know results from linear regression models. We brieﬂy illustrate this idea with threeexamples.First, consider the case where the local basis R ( u ) incorporates speciﬁc restrictions, such ascontinuity or lack thereof, on the distribution function, density function or higher-order derivativesat the evaluation point x . To give a concrete example, suppose that F ( x ) and f ( x ) are knownto be continuous at some interior point x ∈ X , while no information is available for the higher-order derivatives. Then, these restriction can be eﬀortlessly incorporated to the local regressiondistribution estimator by considering the local basis function R ( u ) = (1 , u, u ( u < x ) , u ( u ≥ x ) , u ( u < x ) , u ( u ≥ x ) , . . . , u p ( u < x ) , u p ( u ≥ x )) . It follows that ˆ f ( x ) = e ˆ θ ( x ) consistently estimates the density f ( x ) at the kink point x , while e ˆ θ ( x ) and e ˆ θ ( x ) are consistent estimators of the left and the right derivatives of the density func-tion, respectively (and similarly for other higher-order one-sided derivatives). In this example, thegeneralized formulation not only reduces the bias of ˆ f ( x ) = e ˆ θ ( x ) even in the absence of conti-nuity of higher-order derivatives, but also provides the basis for testing procedures for continuityof higher-order derivatives; e.g., by considering an statistic based on ( e − e ) ˆ θ ( x ). This providesa concrete illustration of the advantages of allowing for generic local basis. A distinct examplewas developed in Cattaneo, Jansson, and Ma (2018, 2020a) for density discontinuity testing inregression discontinuity designs.As a second example, consider imposing shape constraints, such as positivity or monotonicity, inthe construction of the local regression distribution estimator. Such constraints amount to speciﬁcrestrictions on the parameter space of θ , which naturally leads to restricted weighted least squaresestimation in the context of our estimator. To be concrete, consider constructing a local polynomialdensity estimator which is non-negative, in which case R ( u ) is a polynomial basis of order p ≥ θ ( x ) = argmin θ n X i =1 W i ( ˆ F i − R i θ ) subject to T θ ≥ , where T denotes a matrix of restrictions; in this example, T = e to ensure that ˆ f ( x ) = e ˆ θ ( x ) ≥ θ ( x ) = argmin θ P ni =1 W i ( ˆ F i − Λ( R i θ )) for some known link function Λ( · ).For instance, such extension may be useful to model distributions with large support or to imposespeciﬁc local shape constraints.All of the examples above, as well as many others, can be analyzed using the large sampleresults developed in this paper and proper extensions thereof. We plan to investigate these andother extensions in future research. We discuss two applications of our main results in the context of program evaluation (see Abadieand Cattaneo, 2018, and references therein).

In this ﬁrst application, the objects of interest are density functions over their entire support,including boundaries and near-boundary regions, which are estimated using estimated weightingschemes, as this is a key feature needed for counterfactual analysis (and many other applications).Our general estimation strategy is specialized to the counterfactual density approach originallyproposed by DiNardo, Fortin, and Lemieux (1996). We focus on density estimation, and we referreaders to Chernozhukov, Fernandez-Val, and Melly (2013) for related methods based on distribu-tion functions as well as for an overview of the literature on counterfactual analysis.To construct a counterfactual density or, more generally, re-weighted density estimators, wesimply need to set the weights ( w , w , · · · , w n ) appropriately. In most applications, this also25equires constructing preliminary consistent estimators of these weights, as we illustrate in thissection. Suppose the observed data is ( x i , t i , z i ) , i = 1 , , . . . , n , where x i continues to be themain outcome variable, z i collects other covariates, and t i is a binary variable indicating to whichgroup unit i belongs. For concreteness, we call these two groups control and treatment, though ourdiscussion does not need to bear any causal interpretation.The marginal distribution of the outcome variable x i for the full sample can be easily estimatedwithout weights (that is, w i = 1). In addition, two conditional densities, one for each group, canbe estimated using w i = t i / P [ t i = 1] for the treatment group and w i = (1 − t i ) / P [ t i = 0] for thecontrol group, and are denote by ˆ f ( x ) and ˆ f ( x ), respectively. For example, in the context ofrandomized controlled trials, these density estimators can be useful to depict the distribution ofthe outcome variables for control and treatment units.A more challenging question is: what would the outcome distribution have been, had the treatedunits had the same covariates distribution as the control units? The resulting density is called thecounterfactual density for the treated, which is denoted by f (cid:3) ( x ). Knowledge about this distribu-tion is important for understanding diﬀerences between f ( x ) and f ( x ), as the outcome distributionis aﬀected by both group status and covariates distribution. Furthermore, the counterfactual distri-bution has another useful interpretation: Assume the outcome variable is generated from potentialoutcomes, x i = t i x i (1) + (1 − t i ) x i (0), then under unconfoundedness, that is, assuming t i is inde-pendent of ( x i (0) , x i (1)) conditional on the covariates z i , f (cid:3) ( x ) is the counterfactual distributionfor the control group: it is the density function associated with the distribution of x i (1) conditionalon t i = 0.Regardless of the interpretation taken, f (cid:3) ( x ) is of interest and can be estimated using ourgeneric density estimator ˆ f ( x ) with the following weights: w (cid:3) i = t i · P [ t i = 0 | z i ] P [ t i = 1 | z i ] P [ t i = 1] P [ t i = 0] . In practice, this choice of weighting scheme is unknown because the conditional probability P [ t i =1 | z i ], a.k.a. the propensity score, is not observed. Thus, researchers estimate this quantity using aﬂexible parametric model, such as Probit or Logit. Our technical results allow for these estimatedweights to form counterfactual density estimators after replacing the theoretical weights by their26stimated counterparts, provided the estimated weights converge suﬃciently fast to their populationcounterparts.To be more precise, we can model P [ t i = 1 | z i ] = G ( b ( z i ) β ) for some known link function G ( · ),such as Logit or Probit, and K -dimensional basis expansion b ( z i ), such as power series or B-splines.If the model is correctly speciﬁed for some ﬁxed K and basis function b ( · ), then max ≤ i ≤ n | w i − ˆ w i | = O P ( a − n ) with a n = √ n under mild regularity conditions, and all our results carry over to thesetting with estimated weights mentioned in Section 5.1. Alternatively, from a nonparametricperspective, if K → ∞ as n → ∞ , and for appropriate basis function b ( · ) and regularity conditions,max ≤ i ≤ n | w i − ˆ w i | = O P ( a − n ) with a n depending on both K and n . Then, as in the parametriccase, our main results carry over if a − n → We demonstrate empirically how marginal, conditional, and counterfactual densities can be esti-mated with our proposed method. We consider the eﬀect of education on earnings using a subsampleof the data in Abadie, Angrist, and Imbens (2002). The data consists of individuals who did notenroll in the Job Training Partnership Act (JTPA). The main outcome variable is the sum of earn-ings in a 30-month period, and individuals are split into two groups according to their educationattainment: t i = 1 for those with high school degree or GED, and t i = 0 otherwise. Also availableare demographic characteristics, including gender, ethnicity, age, marital status, AFDC receipt(for women), and a dummy indicating whether the individual worked at least 12 weeks during aone-year period. The sample size is 5 , ,

927 being either high school graduates or GED.Summary statistics are available as the fourth column in Table 1. We leave further details on theJTPA program to Section 6.2, where we utilize a larger sample and conduct distribution estimationin a randomized controlled (intention-to-treat) and instrumental variables (imperfect compliance)setting.It is well-known that education has signiﬁcant impact on labor income, and we ﬁrst plot earningdistributions separately for subsamples with and without high school degree or GED. The twoestimates, ˆ f ( x ) and ˆ f ( x ), are plotted in panel (a) of Figure 2. There, it is apparent that theearning distribution for high school graduates is very diﬀerent compared to those without high27able 1: Summary Statistics for the JTPA data. Full JTPA Oﬀer JTPA Enrollment

N Y N Y

Income .

20 17191 .

13 18321 .

59 17015 .

58 19098 . HS or GED .

72 0 .

71 0 .

72 0 .

70 0 . Male .

46 0 .

47 0 .

46 0 .

48 0 . Nonwhite .

36 0 .

36 0 . Married .

28 0 .

27 0 .

29 0 .

27 0 . Work ≤ .

44 0 .

43 0 .

44 0 .

44 0 . AFDC .

17 0 .

16 0 . Age .

24 0 .

25 0 .

24 0 .

24 0 . .

21 0 .

20 0 .

21 0 .

21 0 . .

24 0 .

25 0 .

24 0 .

24 0 . .

19 0 .

20 0 . .

08 0 .

08 0 . Columns : (i) Full: full sample; (ii) JTPA Oﬀer: whether oﬀered JTPA services; (iii) JTPA Enrollment: whetherenrolled in JTPA.

Rows : (i) Income: cumulative income over 30-month period post random selection; (ii) HS orGED: whether has high school degree or GED; (iii) Male: gender being male; (iv) Nonwhite: black or Hispanic; (v)Married: whether married; (vi) Work ≤

12: worked less than 12 weeks during one year period prior to randomassignment; (vii) Age: age groups. school degree. More speciﬁcally, both the mean and median of ˆ f ( x ) are higher than ˆ f ( x ), andˆ f ( x ) seems to have much thinner left tail and thicker right tail.As mentioned earlier, direct comparison between ˆ f ( x ) and ˆ f ( x ) does not reveal the impact ofhaving high school degree on earning, since the diﬀerence is confounded by the fact that individualswith high school degree can have very diﬀerent characteristics (measured by covariates) comparedto those without. We employ covariates adjustments, and ask the following question: what wouldthe earning distribution have been for high school graduates, had they had the same characteristicsas those without such degree?We estimate the counterfactual distribution f (cid:3) ( x ) by our proposed method, and is shown inpanel (b) of Figure 2. The diﬀerence between ˆ f (cid:3) ( x ) and ˆ f ( x ) is not very profound, althoughit seems ˆ f (cid:3) ( x ) has smaller mean and median. On the other hand, diﬀerence between ˆ f ( x ) andˆ f (cid:3) ( x ) remains highly nontrivial. Our empirical ﬁnding is compatible with existing literature onreturn to education: it is generally believed that education leads to signiﬁcant accumulation ofhuman capital, hence increase in labor income. As a result, educational attainment is usually oneof the most important “explanatory variables” for diﬀerences in income.28 (a) Marginal Distributions (b) Counterfactual Distribution Figure 2: Earning Distributions by Education, JTPA.

Notes : (i) Full: earning distribution for the full sample ( n = 5 , n = 1 ,

520 and 3 , R (and Stata ) package described in Cattaneo, Jansson, and Ma (2020b).

Self-selection and treatment eﬀect heterogeneity are important concerns in causal inference andstudies of socioeconomic programs. It is now well understood that classical treatment parameters,such as the average treatment eﬀect or the treatment eﬀect on the treated, are not identiﬁable evenwhen treatment assignment is fully randomized due to imperfect compliance. Indeed, what canbe recovered is either an intention-to-treat parameter or, using the instrumental variables method,some other more local treatment eﬀect, speciﬁc to a subpopulation: the “compliers.” See Imbensand Rubin (2015) and references therein for further discussion. Practically, this poses two issuesfor empirical work employing instrumental variables methods focusing on local average treatmenteﬀects. First, since compliers are usually not identiﬁed, it is crucial to understand how diﬀerenttheir characteristics are compared to the population as a whole. Second, it is often desirable tohave a thorough estimate of the distribution of potential outcomes, which provides information not29nly on the mean or median, but also its dispersion, overall shape, or local curvatures.Motivated by these observations, and to illustrate the applicability of our density estimationmethods, we now consider two related problems. First, we investigate speciﬁcation testing inthe context of local average treatment eﬀects based on comparison of two (rescaled) densities asdiscussed by Kitagawa (2015). This method requires estimating two densities nonparametrically.Second, we consider estimating the density of potential outcomes for compliers in the IV settingof Abadie (2003), which allows for conditioning on covariates. The resulting density plots not onlyprovide visual guides on treatment eﬀects, but also can be used for further analysis to construct arich set of summary statistics or as inputs for semiparametric procedures. Both methods requireestimated weights.We ﬁrst introduce the notation and the potential outcomes framework. For each individual thereis a binary indicator of treatment assignment (a.k.a. the instrument), denoted by d i . The actualtreatment (takeup), however, can be diﬀerent, due to imperfect compliance. More speciﬁcally, let t i (0) and t i (1) be the two potential treatments, corresponding to d i = 0 and 1, then the observedbinary treatment indicator is t i = d i t i (1) + (1 − d i ) t i (0). We also have a pair of potential outcomes, x i (0) and x i (1), associated with t i = 0 and 1, and what is observed is x i = t i x i (1) + (1 − t i ) x i (0).Finally, also available are some covariates, collected in z i . We assume that the observed data is arandom sample { ( x i , t i , d i , z i ) : 1 ≤ i ≤ n } .There are three important assumptions for identiﬁcation. First, the instrument has to beexogenous, meaning that conditional on covariates, it is independent of the potential treatmentsand outcomes. Second, the instrument has to be relevant, meaning that conditional on covariates,the instrument should be able to induce changes in treatment takeups. Third, there are no deﬁers(a.k.a. the monotonicity assumption). We do not reproduce the exact details of those assumptionsand other technical requirements for identiﬁcation; see the references given for more details.Building on Imbens and Rubin (1997), Kitagawa (2015) discusses interesting testable implica-tions in this IV setting, which can be easily adapted to test instrument validity using our densityestimator. In the current context, the testable implications take the following form: for any (mea-30urable) set I ⊂ R , P [ x i ∈ I , t i = 1 | d i = 1] ≥ P [ x i ∈ I , t i = 1 | d i = 0] , and P [ x i ∈ I , t i = 0 | d i = 0] ≥ P [ x i ∈ I , t i = 0 | d i = 1] . The ﬁrst requirement holds trivially in the JTPA context, since the program does not allow enroll-ment without being oﬀered (that is, P [ t i = 1 | d i = 0] = 0). Therefore we demonstrate the secondwith our density estimator. Let f d =0 ,t =0 ( x ) be the earning density for the subsample d i = 0 and t i = 0, that is, for individuals without JTPA oﬀer and not enrolled. Similarly let f d =1 ,t =0 ( x ) bethe earning density for individuals oﬀered JTPA but not enrolled. Then the second inequality inthe above display is equivalent to P [ t i = 0 | d i = 0] · f d =0 ,t =0 ( x ) ≥ P [ t i = 0 | d i = 1] · f d =1 ,t =0 ( x ) , for all x ∈ I . Thus, our density estimator can be used directly, where f d =0 ,t =0 ( x ) is consistently estimated withweights w d =0 ,t =0 i = (1 − d i )(1 − t i ) / P [ d i = 0 , t i = 0], and f d =1 ,t =0 ( x ) is consistently estimated with w d =1 ,t =0 i = d i (1 − t i ) / P [ d i = 1 , t i = 0].Abadie (2003) showed that the distributional characteristics of compliers are identiﬁed, and canbe expressed as re-weighted marginal quantities. We focus on three distributional parameters here.The ﬁrst one is the distribution of the observed outcome variable, x i , for compliers, which is denotedby f c . This parameter is important for understanding the overall characteristics of compliers, andhow diﬀerent it is from the populations. The other two parameters are distributions of the potentialoutcomes, x i (0) and x i (1), for compliers, since the diﬀerence thereof reveals the eﬀect of treatmentfor this subsample. They are denoted by f c, and f c, , respectively. The three density functions canalso be estimated using our proposed local polynomial density estimator ˆ f ( x ) with, respectively,the following weights: w ci = 1 P [ t i (1) > t i (0)] · (cid:18) − t i (1 − d i ) P [ d i = 0 | z i ] − (1 − t i ) d i P [ d i = 1 | z i ] (cid:19) ,w c, i = 1 P [ t i (1) > t i (0)] · (1 − t i ) · − d i − P [ d i = 0 | z i ] P [ d i = 0 | z i ] P [ d i = 1 | z i ] ,w c, i = 1 P [ t i (1) > t i (0)] · t i · d i − P [ d i = 1 | z i ] P [ d i = 0 | z i ] P [ d i = 1 | z i ] . P [ d i = 1 | z i ] so long as they converge suﬃciently fast to their population counterparts. The JTPA is a large publicly funded job training program targeting at individuals who are econom-ically disadvantaged and/or facing signiﬁcant barriers to employment. Individuals were randomlyoﬀered JTPA training, the treatment take-up, however, was only about 67% among those who wereoﬀered. Therefore the JTPA oﬀer provides valid instrument to study the impact of the job trainingprogram. We continue to use the same data as Abadie, Angrist, and Imbens (2002), who analyzedquantile treatment eﬀects on earning distributions.Besides the main outcome variable and covariates already introduced in Section 6.1, also avail-able are the treatment take-up (JTPA enrollment) and the instrument (JTPA Oﬀer). See Table1 for summary statistics for the full sample and separately for subgroups. As the JTPA oﬀerswere randomly assigned, it is possible to estimate the intent-to-treat eﬀect by mean comparison.Indeed, individuals who are oﬀered JTPA services earned, on average, $1 ,

130 more than those notoﬀered. On the other hand, due to imperfect compliance, it is in general not possible to estimatethe eﬀect of job training (i.e. the eﬀect of JTPA enrollment), unless one is willing to impose strongassumptions such as constant treatment eﬀect.We ﬁrst implement the IV speciﬁcation test, which is straightforward using our density estimatorˆ f ( x ). We plot the two estimated (rescaled) densities in Figure 3. A simple eyeball test suggestsno evidence against instrumental variable validity. A formal hypothesis test, justiﬁed using ourtheoretical results, conﬁrms this ﬁnding.Second, we estimate the density of the potential outcomes for compliers. In panel (a) of Figure4, we plot earning distributions for the full sample and that for the compliers, where the secondis estimated using the weights w ci , introduced earlier. The two distributions seem quite similar,while compliers tend to have higher mean and thinner left tail in the earning distribution. Next weconsider the intent-to-treat eﬀect, as the diﬀerence in earning distributions for subgroups with andwithout JTPA oﬀer (a.k.a. the reduced form estimate in the 2SLS context). This is given in panel32 Figure 3: Testing Validity of Instruments, JTPA.

Notes : (i) JTPA: Not Oﬀered & Not Enrolled: the scaled density estimate P i ( t i =0 ,d i =0) P i ( d i =0) ˆ f d =0 ,t =0 ( x ); (ii) JTPA:Oﬀered & Not Enrolled: the scaled density estimate P i ( t i =0 ,d i =1) P i ( d i =1) ˆ f d =1 ,t =0 ( x ). Point estimates are obtained byusing local polynomial regression with order 2, and robust conﬁdence bands are obtained with local polynomial oforder 3. Bandwidths are chosen by minimizing integrated mean squared errors. All estimates are obtained usingcompanion R (and Stata ) package described in Cattaneo, Jansson, and Ma (2020b). (b) of Figure 4. The eﬀect is signiﬁcant, albeit not very large. We also plot earning distributionsfor individuals enrolled (and not) in JTPA in panel (c). Not surprisingly, the diﬀerence is muchlarger. Simple mean comparison implies that enrolling in JTPA is associated with $2 ,

083 moreincome.Unfortunately, neither panel (b) nor (c) reveals information on distribution of potential out-comes. To see the reason, note that in panel (b) earning distributions are estimated accordingto treatment assignment, but potential outcomes are deﬁned according to treatment takeup. Andpanel (c) does not give potential outcome distributions since treatment takeup is not randomlyassigned. In panel (d) of Figure 4, we use weighting schemes w c, i and w c, i to construct potentialearning distributions for compliers, which estimates the identiﬁed distributional treatment eﬀect inthis IV setting. Indeed, treatment eﬀect on compliers is larger than the intent-to-treat eﬀect, but is33 (a) Marginal Distributions (b) Marginal Distributions (c) Marginal Distributions (d) Potential Outcome Distributions Figure 4: Earning Distributions, JTPA.

Notes : (a) earning distributions in the full sample and for compliers; (b) earning distributions by JTPA oﬀer; (c)earning distributions by JTPA enrollment; (d) distributions of potential outcomes for compliers. Point estimates areobtained by using local polynomial regression with order 2, and robust conﬁdence bands are obtained with localpolynomial of order 3. Bandwidths are chosen by minimizing integrated mean squared errors. All estimates areobtained using companion R (and Stata ) package described in Cattaneo, Jansson, and Ma (2020b).

We introduced a new class of local regression distribution estimator, which can be used to constructdistribution, density, and higher-order derivatives estimators. We established valid large sampledistributional approximations, both pointwise and uniform over their support. Pointwise on theevaluation point, we characterized a minimum distance implementation based on redundant re-gressors leading to asymptotic eﬃciency improvements, and gave precise results in terms of (tight)lower bounds for interior points. Uniformly over the evaluation points, we obtained valid lineariza-tions and strong approximations, and constructed conﬁdence bands. Finally, we discussed severalextensions of our work.Although beyond the scope of this paper, it would be useful to generalized our results to the caseof multivariate regressors x i ∈ R d . Boundary adaptation is substantially more diﬃcult in multipledimensions, and hence our proposed methods are potentially very useful in such setting. In addition,multidimensional density estimation can be used to construct new conditional distribution, densityand higher derivative estimators in a straightforward way. These new estimators would be usefulin several areas of economics, including for instance estimation of auction models. References

Abadie, A. (2003): “Semiparametric Instrumental Variable Estimation of Treatment ResponseModels,”

Journal of Econometrics , 113(2), 231–263.

Abadie, A., J. Angrist, and

G. Imbens (2002): “Instrumental Variables Estimates of the Eﬀectof Subsidized Training on the Quantiles of Trainee Earnings,”

Econometrica , 70(1), 91–117.

Abadie, A., and

M. D. Cattaneo (2018): “Econometric Methods for Program Evaluation,”

Annual Review of Economics , 10, 465–503. 35 elloni, A., V. Chernozhukov, D. Chetverikov, and

I. Fernandez-Val (2019): “Condi-tional Quantile Processes based on Series or Many Regressors,”

Journal of Econometrics , 213(1),4–29.

Belloni, A., V. Chernozhukov, D. Chetverikov, and

K. Kato (2015): “Some New Asymp-totic Theory for Least Squares Series: Pointwise and Uniform Results,”

Journal of Econometrics ,186(2), 345–366.

Calonico, S., M. D. Cattaneo, and

M. H. Farrell (2018): “On the Eﬀect of Bias Esti-mation on Coverage Accuracy in Nonparametric Inference,”

Journal of the American StatisticalAssociation , 113(522), 767–779.(2020): “Coverage Error Optimal Conﬁdence Intervals for Local Polynomial Regression,”arXiv:1808.01398.

Cattaneo, M. D., R. K. Crump, M. H. Farrell, and

Y. Feng (2020): “On Binscatter,”arXiv:1902.09608.

Cattaneo, M. D., M. H. Farrell, and

Y. Feng (2020): “Large Sample Properties ofPartitioning-Based Estimators,”

Annals of Statistics , 48(3), 1718–1741.

Cattaneo, M. D., M. Jansson, and

X. Ma (2018): “Manipulation Testing based on DensityDiscontinuity,”

Stata Journal , 18(1), 234–261.(2020a): “Simple Local Polynomial Density Estimators,”

Journal of the American Statis-tical Association , 115(531), 1449–1455.(2020b): “ lpdensity : Local Polynomial Density Estimation and Inference,”arXiv:1906.06529.

Cheng, G., and

Y.-C. Chen (2019): “Nonparametric Inference via Bootstrapping the DebiasedEstimator,”

Electronic Journal of Statistics , 13(1), 2194–2256.

Cheng, M.-Y., J. Fan, and

J. S. Marron (1997): “On Automatic Boundary Corrections,”

Annals of Statistics , 25(4), 1691–1708. 36 hernozhukov, V., D. Chetverikov, and

K. Kato (2014a): “Anti-Concentration and HonestAdaptive Conﬁdence Bands,”

Annals of Statistics , 42(5), 1787–1818.(2014b): “Gaussian Approximation of Suprema of Empirical Processes,”

Annals of Statis-tics , 42(4), 1564–1597.

Chernozhukov, V., J. C. Escanciano, H. Ichimura, W. K. Newey, and

J. M. Robins (2020): “Locally Robust Semiparametric Estimation,” arXiv:1608.00033.

Chernozhukov, V., I. Fernandez-Val, and

B. Melly (2013): “Inference on CounterfactualDistributions,”

Econometrica , 81(6), 2205–2268.

DiNardo, J., N. M. Fortin, and

T. Lemieux (1996): “Labor Market Institutions and theDistribution of Wages, 1973-1992: A Semiparametric Approach,”

Econometrica , 64(5), 1001–1044.

Fan, J., and

I. Gijbels (1996):

Local Polynomial Modelling and Its Applications . Chapman &Hall/CRC, New York.

Gin´e, E., V. Koltchinskii, and

L. Sakhanenko (2004): “Kernel Density Estimators: Conver-gence in Distribution for Weighted Sup-Norms,”

Probability Theory and Related Fields , 130(2),167–198.

Gin´e, E., and

R. Nickl (2010): “Conﬁdence Bands in Density Estimation,”

Annals of Statistics ,38(2), 1122–1170.

Granovsky, B. L., and

H.-G. M¨uller (1991): “Optimizing Kernel Methods: A Unifying Vari-ational Principle,”

International Statistical Review/Revue Internationale de Statistique , 59(3),373–388.

Hausman, J. A., and

W. K. Newey (1995): “Nonparametric Estimation of Exact ConsumersSurplus and Deadweight Loss,”

Econometrica , 63(6), 1445–1476.

Ichimura, H., and

W. K. Newey (2020): “The Inﬂuence Function of Semiparametric Estima-tors,” arXiv:1508.01378. 37 chimura, H., and

P. E. Todd (2007): “Implementing Nonparametric and SemiparametricEstimators,” in

Handbook of Econometrics, Volume 6B , ed. by J. J. Heckman, and

E. E. Leamer,pp. 5369–5468. Elsevier Science B.V., New York.

Imbens, G. W., and

D. B. Rubin (1997): “Estimating Outcome Distributions for Compliers inInstrumental Variables Models,”

Review of Economic Studies , 64(4), 555–574.(2015):

Causal Inference in Statistics, Social, and Biomedical Sciences . Cambridge Uni-versity Press, New York.

Karunamuni, R. J., and

S. Zhang (2008): “Some Improvements on a Boundary CorrectedKernel Density Estimator,”

Statistics & Probability Letters , 78(5), 499–507.

Kitagawa, T. (2015): “A Test for Instrument Validity,”

Econometrica , 83(5), 2043–2063.

Newey, W. K. (1994a): “The Asymptotic Variance of Semiparametric Estimators,”

Econometrica ,62(6), 1349–1382.(1994b): “Kernel Estimation of Partial Means and a General Variance Estimator,”

Econo-metric Theory , 10(2), 233–253.

Newey, W. K., F. Hsieh, and

J. M. Robins (2004): “Twicing Kernels and a Small Bias Propertyof Semiparametric Estimators,”

Econometrica , 72(3), 947–962.

Newey, W. K., and

D. L. McFadden (1994): “Large Sample Estimation and Hypothesis Test-ing,” in

Handbook of Econometrics, Volume 5 , ed. by R. F. Engle, and

D. L. McFadden, pp.2111–2245. Elsevier Science B.V., New York.

Newey, W. K., and

P. A. Ruud (2005): “Density Weighted Linear Least Squares,” in

Identi-ﬁcation and Inference in Econometric Models: Essays in Honor of Thomas Rothenberg , ed. byD. Andrews, and

J. Stock, pp. 554–573. Cambridge University Press, Cambridge.

Newey, W. K., and

T. M. Stoker (1993): “Eﬃciency of Weighted Average Derivative Estima-tors and Index Models,”

Econometrica , 61(5), 1199–1223.

Rio, E. (1994): “Local Invariance Principles and Their Application to Density Estimation,”

Prob-ability Theory and Related Fields , 98(1), 21–45.38 obins, J. M., F. Hsieh, and

W. K. Newey (1995): “Semiparametric Eﬃcient Estimation of aConditional Density with Missing or Mismeasured Covariates,”

Journal of the Royal StatisticalSociety: Series B (Methodological) , 57(2), 409–424.

Zaitsev, A. Y. (2013): “The Accuracy of Strong Gaussian Approximation for Sums of IndependentRandom Vectors,”

Russian Mathematical Surveys , 68(4), 721–761.

Zhang, S., and

R. J. Karunamuni (1998): “On Kernel Density Estimation Near Endpoints,”

Journal of Statistical Planning and Inference , 70(1), 301–316.39 ocal Regression Distribution EstimatorsSupplemental Appendix

Matias D. Cattaneo ∗ Michael Jansson † Xinwei Ma ‡ October 1, 2020

Abstract

This Supplemental Appendix contains general theoretical results encompassing those discussedin the main paper, includes proofs of those general results, and discusses additional method-ological and technical results. ∗ Department of Operations Research and Financial Engineering, Princeton University. † Department of Economics, UC Berkeley and CREATES. ‡ Department of Economics, UC San Diego. a r X i v : . [ ec on . E M ] S e p ontents L Distribution Estimation ................................................................................. 52.2 Local Regression Distribution Estimation..................................................................... 73 Eﬃciency ............................................................................................................................... 93.1 Eﬀect of Orthogonalization ........................................................................................... 103.2 Optimal Q ..................................................................................................................... 124 Uniform Distribution Theory ................................................................................................ 194.1 Local L Distribution Estimation ................................................................................. 224.2 Local Regression Distribution Estimation..................................................................... 245 Proofs .................................................................................................................................... 275.1 Additional Preliminary Lemmas ................................................................................... 275.2 Proof of Theorem 1 ....................................................................................................... 285.3 Proof of Theorem 2 ....................................................................................................... 295.4 Proof of Theorem 3 ....................................................................................................... 305.5 Proof of Theorem 4 ....................................................................................................... 315.6 Proof of Corollary 5 ...................................................................................................... 345.7 Proof of Corollary 6 ...................................................................................................... 345.8 Proof of Lemma 7 ........................................................................................................ 365.9 Proof of Theorem 8 ....................................................................................................... 365.10 Proof of Theorem 9 ....................................................................................................... 375.11 Proof of Lemma 10 ....................................................................................................... 395.12 Proof of Lemma 11 ....................................................................................................... 415.13 Proof of Lemma 12 ....................................................................................................... 415.14 Proof of Lemma 13 ....................................................................................................... 415.15 Proof of Theorem 14 ..................................................................................................... 425.16 Proof of Theorem 15 ..................................................................................................... 425.17 Proof of Lemma 16 ....................................................................................................... 435.18 Proof of Lemma 17 ....................................................................................................... 445.19 Proof of Lemma 18 ....................................................................................................... 465.20 Proof of Lemma 19 ....................................................................................................... 475.21 Proof of Lemma 20 ....................................................................................................... 475.22 Proof of Lemma 21 ....................................................................................................... 485.23 Proof of Theorem 22 ..................................................................................................... 485.24 Proof of Theorem 23 ..................................................................................................... 482

Setup

Suppose x , x , · · · , x n is a random sample from a univariate distribution with cumulative dis-tribution function F ( · ). Also assume the distribution function admits a (suﬃciently accurate)linear-in-parameters local approximation near an evaluation point x : % ( h, x ) := sup | x − x |≤ h (cid:12)(cid:12) F ( x ) − R ( x − x ) θ ( x ) (cid:12)(cid:12) is small for h small , where R ( · ) is a known basis function. The parameter θ ( x ) can be estimated by the following local L method:ˆ θ G = argmin θ Z X (cid:16) ˆ F ( u ) − R ( u − x ) θ (cid:17) h K (cid:18) u − x h (cid:19) d G ( u ) , ˆ F ( u ) = 1 n n X i =1 ( x i ≤ u ) , (1)where K ( · ) is a kernel function, X is the support of F ( · ), and G ( · ) is a known weighting function tobe speciﬁed later. The local projection estimator (1) is closely related to another estimator, whichis constructed by local regression:ˆ θ = argmin θ n X i =1 (cid:16) ˆ F ( x i ) − R ( x i − x ) θ (cid:17) h K (cid:18) x i − x h (cid:19) . (2)The local regression estimator can be equivalently expressed as ˆ θ ˆ F , meaning that it can be viewed asa special case of the local projection estimator, with G ( · ) in (1) replaced by the empirical measure(empirical distribution function) ˆ F ( · ).For future reference, we ﬁrst discuss some of the notation we use in the main paper and thisSupplemental Appendix (SA). For a function g ( · ), we denote its j -th derivative as g ( j ) ( · ). Forsimplicity, we also use the “dot” notation to denote the ﬁrst derivative: ˙ g ( · ) = g (1) ( · ). Assume g ( · ) is suitably smooth on [ x − δ, x ) ∪ ( x , x + δ ] for some δ >

0, but not necessarily continuous ordiﬀerentiable at x . If g ( · ) and its one-sided derivatives can be continuously extended to x from thetwo sides, we adopt the following special notation: g ( j ) u = ( u < g ( j ) ( x − ) + ( u ≥ g ( j ) ( x +) . With j = 0, the above is simply g u = ( u < g ( x − ) + ( u ≥ g ( x +). Also for j = 1, we use˙ g u = g (1) u . Convergence in probability and in distribution are denoted by P → and , respectively,and limits are taken with respect to the sample size n going to inﬁnity unless otherwise speciﬁed.We use | · | to denote the Euclidean norm. 3he following matrices will feature in asymptotic expansions of our estimators:Γ h, x = Z X− x h R ( u ) R ( u ) K ( u ) g ( x + hv )d u = Z X− x h R ( u ) R ( u ) K ( u ) g u d u + O ( h ) = Γ h, x + O ( h )andΣ h, x = Z Z X− x h R ( u ) R ( v ) h F ( x + h ( u ∧ v )) − F ( x + hu ) F ( x + hv ) i K ( u ) K ( v ) g ( x + hu ) g ( x + hv )d u d v = F ( x )(1 − F ( x )) Z X− x h R ( u ) K ( u ) g u d u ! Z X− x h R ( u ) K ( u ) g u d u ! + h Z Z X− x h R ( u ) R ( v ) K ( u ) K ( v ) h − F ( x )( uf u + vf v ) g u g v + F ( x )(1 − F ( x ))( u ˙ g u g v + v ˙ g v g u ) i d u d v + h Z Z X− x h R ( u ) R ( v ) K ( u ) K ( v )( u ∧ v ) f u ∧ v g u g v d u d v + O ( h )= Σ h, x + h Σ h, x + O ( h ) . Now we list the main assumptions.

Assumption 1. x , . . . , x n is a random sample from a distribution F ( · ) supported on X ⊆ R , and x ∈ X .(i) For some δ > , F ( · ) is absolutely continuous on [ x − δ, x + δ ] with a density f ( · ) admittingconstants f ( x − ) , f ( x +) , ˙ f ( x − ) , and ˙ f ( x +) , such that sup u ∈ [ − δ, f ( x + u ) − f ( x − ) − u ˙ f ( x − ) u + sup u ∈ (0 ,δ ] f ( x + u ) − f ( x +) − u ˙ f ( x +) u < ∞ . (ii) K ( · ) is nonnegative, symmetric, and continuous on its support [ − , , and integrates to 1.(iii) R ( · ) is locally bounded, and there exists a positive-deﬁnite diagonal matrix Υ h for each h > , such that Υ h R ( u ) = R ( u/h ) (iv) For all h suﬃciently small, the minimum eigenvalues of Γ h, x and h − Σ h, x are boundedaway from zero. (cid:4) Assumption 2.

For some δ > , G ( · ) is absolutely continuous on [ x − δ, x + δ ] with a derivative g ( · ) ≥ admitting constants g ( x − ) , g ( x +) , ˙ g ( x − ) , and ˙ g ( x +) , such that sup u ∈ [ − δ, g ( x + u ) − g ( x − ) − u ˙ g ( x − ) u + sup u ∈ (0 ,δ ] g ( x + u ) − g ( x +) − u ˙ g ( x +) u < ∞ . (cid:4) Example 1 (Local Polynomial Estimator).

Before closing this section, we brieﬂy introducethe local polynomial estimator of Cattaneo, Jansson, and Ma (2020), which is a special case of4ur local regression distribution estimator. The local polynomial estimator employs the followingpolynomial basis: R ( u ) = (cid:16) , u, u , · · · , p ! u p (cid:17) , for some p ∈ N . As a result, it estimates the distribution function, the density function, andderivatives thereof. To be precise, θ ( x ) = (cid:16) F ( x ) , f ( x ) , f (1) ( x ) , · · · , f ( p − ( x ) (cid:17) . With R ( · ) being a polynomial basis, it is straightforward to characterize the approximation bias % ( h, x ). Assuming the distribution function F ( · ) is at least p + 1 times continuously diﬀerentiable ina neighborhood of x , one can employ a Taylor expansion argument and show that % ( h, x ) = O ( h p +1 ).We will revisit this local polynomial estimator below as a leading example when we discuss pointwiseand uniform asymptotic properties of our local distribution estimator. (cid:4) We discuss pointwise (i.e., for a ﬁxed evaluation point x ∈ X ) large-sample properties of the local L distribution estimator (1), and that of the local regression estimator (2). For ease of exposition,we suppress the dependence on the evaluation point x whenever possible. L Distribution Estimation

With simple algebra, the local L ( G ) estimator in (1) takes the following formˆ θ G = (cid:18)Z X R ( u − x ) R ( u − x ) h K (cid:18) u − x h (cid:19) d G ( u ) (cid:19) − (cid:18)Z X R ( u − x ) ˆ F ( u ) 1 h K (cid:18) u − x h (cid:19) d G ( u ) (cid:19) . We can further simplify the above. First note that the “denominator” can be rewritten as Z X R ( u − x ) R ( u − x ) h K (cid:18) u − x h (cid:19) d G ( u )= Υ − h (cid:18)Z X Υ h R ( u − x ) R ( u − x ) Υ h h K (cid:18) u − x h (cid:19) g ( u )d u (cid:19) Υ − h = Υ − h Γ h Υ − h . The same technique can be applied to the “numerator”, which leads toˆ θ G − θ = Υ h Γ − h Z X− x h R ( u ) ˆ F ( x + hu ) K ( u ) g ( x + hu )d u ! − θ =Υ h Γ − h Z X− x h R ( u ) h F ( x + hu ) − θ R ( u )Υ − h i K ( u ) g ( x + hu )d u (3)+ Υ h n n X i =1 Γ − h Z X− x h R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) g ( x + hu )d u. (4)5he above provides further expansion of the local L estimator into a term that contributes as bias,and another term that contributes asymptotically to the variance.The large-sample properties of the local L distribution estimator (1) are as follows. Theorem 1 (Local L Distribution Estimation: Asymptotic Normality).

Assume As-sumptions 1 and 2 hold, and that h → , nh → ∞ and n% ( h ) /h → . Then(i) (3) satisﬁes (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z X− x h R ( u ) h F ( x + hu ) − θ R ( u )Υ − h i K ( u ) d u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O ( % ( h )) . (ii) (4) satisﬁes V "Z X− x h R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) g ( x + hu )d u = Σ h , and Σ − / h √ n n X i =1 Z X− x h R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) g ( x + hu )d u ! N (0 , I ) . (iii) The local L distribution estimator is asymptotically normally distributed √ n (cid:0) Γ − h Σ h Γ − h (cid:1) − / Υ − h (ˆ θ G − θ ) N (0 , I ) . (cid:4) For valid inference, one needs to construct standard errors. To start, note that Γ h is known,and hence we only need to estimate Σ h . Consider the following:ˆΣ h = 1 n n X i =1 Z Z X− x h R ( u ) R ( v ) h ( x i ≤ x + hu ) − ˆ F ( x + hu ) ih ( x i ≤ x + hv ) − ˆ F ( x + hv ) i K ( u ) K ( v ) g ( x + hu ) g ( x + hv )d u d v, (5)where ˆ F ( · ) is the empirical distribution function. The following theorem shows that standard errorsconstructed using ˆΣ h are consistent. Theorem 2 (Local L Distribution Estimation: Standard Errors).

Assume Assumptions1 and 2 hold, and that h → and nh → ∞ . Let c be a nonzero vector of suitable dimension, then (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c ˆΣ h cc Σ h c − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P r nh ! . f, in addition that n% ( h ) /h → , then c (ˆ θ G − θ ) q c Υ h Γ − h ˆΣ h Γ − h Υ h c/n N (0 , . (cid:4) The local regression distribution estimator (2) can be understood as a special case of the local L based estimator, by setting G = ˆ F (i.e., using the empirical distribution as the design). However,the empirical measure ˆ F is not smooth, so that large-sample properties of the local regressionestimator cannot be deduced directly from Theorem 1. In this subsection, we will show thatestimates obtained by the two approaches, (1) and (2), are asymptotically equivalent under suitableregularity conditions. To be precise, we establish the equivalence of the local regression distributionestimator, ˆ θ , and the (infeasible) local L distribution estimator, ˆ θ F (i.e., using F as the designweighting in (1)). As before, we suppress the dependence on the evaluation point x .First, the local regression estimator can be written asˆ θ − θ = n n X i =1 R ( x i − x ) R ( x i − x ) h K (cid:18) x i − x h (cid:19)! − n n X i =1 R ( x i − x ) h ˆ F ( x i ) − R ( x i − x ) θ i h K (cid:18) x i − x h (cid:19)! = Υ h ˆΓ − h Γ h Γ − h n n X i =1 Υ h R ( x i − x ) h ˆ F ( x i ) − R ( x i − x ) θ i h K (cid:18) x i − x h (cid:19)! , where ˆΓ h = 1 n n X i =1 Υ h R ( x i − x ) R ( x i − x ) Υ h h K (cid:18) x i − x h (cid:19) , and Γ h is deﬁned as before with G = F .To proceed, we further expand as follows1 n n X i =1 Υ h R ( x i − x ) h ˆ F ( x i ) − R ( x i − x ) θ i h K (cid:18) x i − x h (cid:19) = 1 n n X i,j =1 ,i = j Υ h R ( x j − x ) h ( x i ≤ x j ) − F ( x j ) i h K (cid:18) x j − x h (cid:19) + 1 n n X j =1 Υ h R ( x j − x ) h − F ( x j ) i h K (cid:18) x j − x h (cid:19) (6)+ 1 n n X j =1 Υ h R ( x j − x ) h F ( x j ) − R ( x j − x ) θ i h K (cid:18) x j − x h (cid:19) . (7)7he last two terms correspond to the leave-in bias and the approximation bias, respectively. Wefurther decompose the ﬁrst term with conditional expectation:1 n n X i,j =1 ,i = j Υ h R ( x j − x ) h ( x i ≤ x j ) − F ( x j ) i h K (cid:18) x j − x h (cid:19) = 1 n n X i,j =1 ,i = j E (cid:20) Υ h R ( x j − x ) h ( x i ≤ x j ) − F ( x j ) i h K (cid:18) x j − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) x i (cid:21) + 1 n n X i,j =1 ,i = j Υ h R ( x j − x ) h ( x i ≤ x j ) − F ( x j ) i h K (cid:18) x j − x h (cid:19) − E (cid:20) Υ h R ( x j − x ) h ( x i ≤ x j ) − F ( x j ) i h K (cid:18) x j − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) x i (cid:21) = n − n n X i =1 Z X− x h R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) f ( x + hu )d u (8)+ 1 n n X i,j =1 ,i = j Υ h R ( x j − x ) h ( x i ≤ x j ) − F ( x j ) i h K (cid:18) x j − x h (cid:19) − Z X− x h R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) f ( x + hu )d u. (9)The following theorem studies the large-sample properties of each term in the above decom-position, and shows that the local regression distribution estimator is asymptotically equivalent tothe local L based estimator by setting G = F , and hence it is asymptotically normally distributed. Theorem 3 (Local Regression Distribution Estimation: Asymptotic Normality).

As-sume Assumption 1 holds, and that h → , nh → ∞ and n% ( h ) /h → . Then(i) ˆΓ h satisﬁes (cid:12)(cid:12)(cid:12) ˆΓ h − Γ h (cid:12)(cid:12)(cid:12) = O P r nh ! . (ii) (6) and (7) satisfy (6) = O P (cid:18) n (cid:19) , (7) = O P ( % ( h )) . (iii) (9) satisﬁes (9) = O P r n h ! . iv) The local regression distribution estimator (2) satisﬁes √ n (cid:0) Γ − h Σ h Γ − h (cid:1) − / Υ − h (ˆ θ − θ ) = √ n (cid:0) Γ − h Σ h Γ − h (cid:1) − / Υ − h (ˆ θ F − θ ) + o P (1) N (0 , I ) . (cid:4) We now discuss how to construct standard errors in the local regression framework. Note thatΓ h can be estimated by ˆΓ h , whose properties have already been studied in Theorem 3. For Σ h , wepropose the followingˆΣ h = 1 n n X i =1  n n X j =1 Υ h R ( x j − x ) h ( x i ≤ x j ) − ˆ F ( x j ) i h K (cid:18) x j − x h (cid:19) n n X j =1 Υ h R ( x j − x ) h ( x i ≤ x j ) − ˆ F ( x j ) i h K (cid:18) x j − x h (cid:19) . where ˆ F ( · ) is the empirical distribution function. The following theorem shows that standard errorsconstructed using ˆΣ h are consistent. Theorem 4 (Local Regression Distribution Estimation: Standard Errors).

Assume As-sumption 1 holds. In addition, assume h → and nh → ∞ . Let c be a nonzero vector of suitabledimension. Then (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c ˆΓ − h ˆΣ h ˆΓ − h cc Γ − h Σ h Γ − h c − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P r nh ! . If, in addition that n% ( h ) /h → , one has c (ˆ θ − θ ) q c Υ h ˆΓ − h ˆΣ h ˆΓ − h Υ h c/n N (0 , . (cid:4) We focus on the local L distribution estimator ˆ θ G , but all the results in this section are applicableto the local regression distribution estimator ˆ θ , as we showed earlier that it is asymptoticallyequivalent to ˆ θ F . In addition, we consider a speciﬁc basis: R ( u ) = (cid:0) , P ( u ) , Q ( u ) (cid:1) , (10)where P ( u ) is a polynomial basis of order p : P ( u ) = (cid:18) u, u , · · · , p ! u p (cid:19) , Q ( u ) is a scalar function, and hence is a “redundant regressor.” Without Q ( · ), the abovereduces to the local polynomial estimator of Cattaneo, Jansson, and Ma (2020). See Section 1 andExample 1 for an introduction.We consider additional regressors because they may help improve eﬃciency (i.e., reduce theasymptotic variance). Following Assumption 1, we assume there exists a scalar υ h (depending on h )such that υ h Q ( u ) = Q ( u/h ). Therefore, Υ h is a diagonal matrix containing 1 , h − , h − , · · · , h − p , υ h .As we consider a (local) polynomial basis, it is natural to impose smoothness assumptions on F ( · ).In particular, Assumption 3 (Smoothness).

For some δ > , F ( · ) is ( p + 1) -times continuously diﬀerentiablein X ∩ [ x − δ, x + δ ] for some p ≥ , and G ( · ) is twice continuously diﬀerentiable in X ∩ [ x − δ, x + δ ] . (cid:4) Under the above assumption, the approximation error satisﬁes % ( h ) = O ( h p +1 ), and the pa-rameter θ can be partitioned into the following: θ = (cid:16) θ , θ P , θ Q (cid:17) = (cid:16) F ( x ) , f ( x ) , · · · , f ( p − ( x ) , (cid:17) . We ﬁrst state a corollary, which specializes Theorem 1 to the polynomial basis (10). That is,we study the (infeasible) estimatorˆ θ F = argmin θ Z X (cid:16) ˆ F ( u ) − R ( u − x ) θ (cid:17) h K (cid:18) u − x h (cid:19) d F ( u ) . (11) Corollary 5 (Local Polynomial L Distribution Estimation: Asymptotic Normality).

Assume Assumptions 1, 2 and 3 hold, and that h → , nh → ∞ , and n% ( h ) /h → . Then thelocal polynomial L distribution estimator in (11) satisﬁes √ n (cid:0) Γ − h Σ h Γ − h (cid:1) − / Υ − h (ˆ θ F − θ ) N (0 , I ) . (cid:4) For ease of presentation, consider the following (sequentially) orthogonalized basis: R ⊥ ( u ) = (cid:16) , P ⊥ ( u ) , Q ⊥ ( u ) (cid:17) , (12)where P ⊥ ( u ) = P ⊥ ( u ) − Z X− x h K ( u ) P ( u )d u,Q ⊥ ( u ) = Q ( u ) − (cid:16) , P ( v ) (cid:17) Z X− x h K ( v ) (cid:16) , P ( v ) (cid:17) (cid:16) , P ( v ) (cid:17) d v ! − Z X− x h K ( v ) (cid:16) , P ( v ) (cid:17) Q ( v )d v ! . R ⊥ ( u ) = Λ h R ( u ) , where Λ h is a nonsingular upper triangular matrix. (Note that the matrix Λ h depends on thebandwidth only because we would like to handle both interior and boundary evaluation points.If, for example, we ﬁx the evaluation point to be in the interior of the support of the data, thenΛ h is a ﬁxed matrix and no longer depends on h . Alternatively, one could also use the notation“Λ x ” to denote such dependence.) Now consider the following orthogonalized local polynomial L distribution estimatorˆ θ ⊥ F = argmin θ Z X (cid:16) ˆ F ( u ) − Λ h R ( u − x ) θ (cid:17) h K (cid:18) u − x h (cid:19) d F ( u ) . (13)To discuss its properties, we partition the estimator and the target parameter asˆ θ ⊥ F = (cid:16) ˆ θ ⊥ ,F , (ˆ θ ⊥ P,F ) , ˆ θ ⊥ Q,F (cid:17) , where ˆ θ ⊥ ,F is the ﬁrst element of ˆ θ ⊥ F and ˆ θ ⊥ Q,F is the last element of ˆ θ ⊥ F . Similarly, we can partitionthe target parameter, θ ⊥ = Λ − h θ = (cid:16) θ ⊥ , ( θ ⊥ P ) , θ ⊥ Q (cid:17) , so that θ ⊥ is the ﬁrst element of Λ − h θ and θ ⊥ Q is the last element of Λ − h θ . As θ Q = 0, simple leastsquares algebra implies θ ⊥ = (cid:16) θ ⊥ , θ P , (cid:17) = (cid:16) θ ⊥ , f ( x ) , f (1) ( x ) , · · · , f ( p − ( x ) , (cid:17) . Note that, in general, θ ⊥ = θ , meaning that after orthogonalization, the intercept of the localpolynomial estimator no longer estimates the distribution function F ( x ).The following corollary gives the large-sample properties of the orthogonalized local polynomialestimator, excluding the intercept. Corollary 6 (Orthogonalized Local Polynomial L Distribution Estimation: AsymptoticNormality).

Assume Assumptions 1, 2 and 3 hold, and that h → , nh → ∞ , and n% ( h ) /h → .Then the orthogonalized local polynomial L distribution estimator in (13) satisﬁes " (Γ ⊥ P,h ) − Σ ⊥ P P,h (Γ ⊥ P,h ) − (Γ ⊥ P,h ) − Σ ⊥ P Q,h (Γ ⊥ Q,h ) − (Γ ⊥ Q,h ) − Σ ⊥ QP,h (Γ ⊥ P,h ) − (Γ ⊥ Q,h ) − Σ ⊥ QQ,h (Γ ⊥ Q,h ) − − / r nhf ( x ) Υ − − ,h " ˆ θ ⊥ P,F − θ P ˆ θ ⊥ Q,F N (0 , I ) , here Γ ⊥ P,h = Z X− x h P ⊥ ( u ) P ⊥ ( u ) K ( u )d u, Γ ⊥ Q,h = Z X− x h Q ⊥ ( u ) K ( u )d u, Σ ⊥ P P,h = Z Z X− x h K ( u ) K ( v ) P ⊥ ( u ) P ⊥ ( v ) ( u ∧ v )d u d v, Σ ⊥ QQ,h = Z Z X− x h K ( u ) K ( v ) Q ⊥ ( u ) Q ⊥ ( v )( u ∧ v )d u d v, Σ ⊥ P Q,h = (Σ ⊥ QP,h ) = Z Z X− x h K ( u ) K ( v ) P ⊥ ( u ) Q ⊥ ( v )( u ∧ v )d u d v, and Υ − ,h is a diagonal matrix containing h − , h − , · · · , h − p , υ h . (cid:4) Q Now we discuss the optimal choice of Q , which minimizes the asymptotic variance of the minimumdistance estimator. Recall from the main paper that, with orthogonalized basis, the minimumdistance estimator of f ( ‘ ) ( x ), for 0 ≤ ‘ ≤ p −

1, has an asymptotic variance f ( x ) h e ‘ (Γ ⊥ P,h ) − Σ ⊥ P P,h (Γ ⊥ P,h ) − e ‘ − e ‘ (Γ ⊥ P,h ) − Σ ⊥ P Q,h (Σ ⊥ QQ,h ) − Σ ⊥ QP,h (Γ ⊥ P,h ) − e ‘ i , where e ‘ is the ( ‘ + 1)-th standard basis vector. In subsequent analysis, we drop the multiplicativefactor f ( x ).Let p ‘ ( u ) be deﬁned as p ‘ ( u ) = e ‘ (Γ ⊥ P,h ) − P ⊥ ( u ) , then the objective is to maximize Z Z X− x h K ( u ) K ( v ) Q ⊥ ( u ) Q ⊥ ( v )( u ∧ v )d u d v ! − Z Z X− x h K ( u ) K ( v ) p ‘ ( u ) Q ⊥ ( v )( u ∧ v )d u d v ! . Alternatively, we would like to solve (recall that Q ( u ) is a scaler function)maximize (cid:16)RR X− x h K ( u ) K ( v ) p ‘ ( u ) q ( v )( u ∧ v )d u d v (cid:17) RR X− x h K ( u ) K ( v ) q ( u ) q ( v )( u ∧ v )d u d v , subject to Z X− x h K ( u ) q ( u )(1 , P ( u ) )d u = 0 . To proceed, deﬁne the following transformation for a function g ( · ): H ( g )( u ) = Z X− x h ( v ≥ u ) K ( v ) g ( v )d v. This transformation satisﬁes two important properties, which are summarized in the following12emma.

Lemma 7 ( H ( · ) Transformation). (i) If g ( · ) and g ( · ) are bounded, and that either R X− x h K ( u ) g ( u )d u or R X− x h K ( u ) g ( u )d u iszero, then Z X− x h ∩ [ − , H ( g )( u ) H ( g )( u )d u = Z Z X− x h K ( u ) K ( v ) g ( u ) g ( v )( u ∧ v )d u d v. (ii) If g ( · ) and g ( · ) are bounded, g ( · ) is continuously diﬀerentiable with a bounded derivative,and that either R X− x h K ( u ) g ( u )d u or R X− x h K ( u ) g ( u )d u is zero, then Z X− x h ∩ [ − , H ( g )( u ) ˙ g ( u )d u = Z X− x h K ( u ) g ( u ) g ( u )d u. (cid:4) With the previous lemma, we can rewrite the maximization problem asmaximize (cid:16)R X− x h ∩ [ − , H ( p ‘ )( u ) H ( q )( u )d u (cid:17) R X− x h ∩ [ − , H ( q )( u ) d u (14)subject to Z X− x h ∩ [ − , ˙ P ( u ) H ( q )( u )d u = 0 , H ( q ) (cid:18) inf X − x h ∨ ( − (cid:19) = 0 . (15) Theorem 8 (Variance Bound of the Minimum Distance Estimator).

An upper bound of the maximization problem (14) and (15) is e ‘ (Γ ⊥ P,h ) − Σ ⊥ P P,h (Γ ⊥ P,h ) − e ‘ − e ‘ Z X− x h ∩ [ − , ˙ P ( u ) ˙ P ( u ) d u ! − e ‘ . Therefore, the asymptotic variance of the minimum distance estimator is bounded below by f ( x ) e ‘ Z X− x h ∩ [ − , ˙ P ( u ) ˙ P ( u ) d u ! − e ‘ , where ˙ P ( u ) = (1 , u, u / , u / , · · · , u p − / ( p − . (cid:4) Example 2 (Local Linear/Quadratic Minimum Distance Density Estimation).

Considera simple example where ‘ = 0 and P ( u ) = u , which means we focus on the asymptotic varianceof the estimated density in a local linear regression. Also assume we employ a uniform kernel: K ( u ) = ( | u | ≤ X − x h = R (i.e., x is an interior evaluationpoint). Note that this example also applies to local quadratic regressions, as u and u are orthogonalfor interior evaluation points. 13 − − . − . − . . . . Figure 1. Equivalent Kernel of the Local Linear Minimum Distance Density Estimator.

Notes : The basis function R ( u ) consists of an intercept, a linear term u (i.e., local linear regression), and an oddhigher-order polynomial term u j +1 for j = 1 , , · · · ,

30. Without the higher-order polynomial regressor, the locallinear density estimator using the uniform kernel is equivalent to the kernel density estimator using theEpanechnikov kernel (black line). Including a higher-order redundant regressor leads to an equivalent kernel thatapproaches the uniform kernel as j tends to inﬁnity (red). Taking P ( u ) = u , the variance bound in Theorem 8 is easily found to be f ( x ) (cid:18)Z − ˙ P ( u ) ˙ P ( u ) d u (cid:19) − = f ( x ) 12 . We now calculate the asymptotic variance of the minimum distance estimator. To be speciﬁc,we choose Q ( u ) = u j +1 , which is a higher-order polynomial function. With tedious calculation,one can show that the minimum distance estimator has the following asymptotic varianceAsy V [ ˆ f MD ( x )] = f ( x ) 11 + 4 j

20 + 8 j , which asymptotes to f ( x ) / j → ∞ . As a result, it is possible to achieve the maximum amount ofeﬃciency gain by including one higher-order polynomial and using our minimum distance estimator.In Figure 1, we plot the equivalent kernel of the local linear minimum distance density estimatorusing a uniform kernel. Without the redundant regressor, it is equivalent to the kernel densityestimator using the Epanechnikov kernel. As j gets larger, however, the equivalent kernel of theminimum distance estimator becomes closer to the uniform kernel, which is why, as j → ∞ , theminimum distance estimator has an asymptotic variance the same as the kernel density estimatorusing the uniform kernel. (cid:4) xample 3 (Local Cubic Minimum Distance Estimation). We adopt the same setting inExample 2, i.e., local polynomial density estimation with the uniform kernel at an interior evaluationpoint. The diﬀerence is that we now consider a local cubic regression: P ( u ) = ( u, u , u ) .As before, the variance bound in Theorem 8 is easily found to be f ( x ) (cid:18)Z − ˙ P ( u ) ˙ P ( u ) d u (cid:19) − = f ( x )  − −  . Again, we compute the asymptotic variance of our minimum distance estimator. Note, however,that now we have both odd and even order polynomials in our basis P ( u ), therefore we include twohigher-order polynomials. That is, we set Q ( u ) = ( u j , u j +1 ) . The asymptotic variance of ourminimum distance estimator isAsy V  ˆ f MD ( x )ˆ f (1) MD ( x )ˆ f (2) MD ( x )  = f ( x )  j +15)16(2 j +7) − j +17)8(2 j +7) j +398 j +20 − j +17)8(2 j +7) j +19)8 j +28  , which, again, asymptotes to the variance bound as j → ∞ . See also Table 1 for the eﬃciency gainof employing the minimum distance technique. (cid:4) Example 4 (Local p = 5 Minimum Distance Estimation).

We consider the same setting inExample 2 and 3, but with p = 5: P ( u ) = ( u, u , · · · , u ) .The variance bound in Theorem 8 is f ( x ) (cid:18)Z − ˙ P ( u ) ˙ P ( u ) d u (cid:19) − = f ( x )  − − − − − −  . Again, we include two higher order polynomials: Q ( u ) = ( u j , u j +1 ) . The asymptotic varianceof our minimum distance estimator isAsy V  ˆ f MD ( x )ˆ f (1) MD ( x )ˆ f (2) MD ( x )ˆ f (3) MD ( x )ˆ f (4) MD ( x )  = f ( x )  j +19)256(2 j +9) − j +21)64(2 j +9) j +23)32(2 j +9) j +17)16(2 j +7) − j +19)8(2 j +7) − j +21)64(2 j +9) j +23)16(2 j +9) − j +25)8(2 j +9) − j +19)8(2 j +7) j +21)8 j +28 j +23)32(2 j +9) − j +25)8(2 j +9) j +27)8 j +36  , which converges to the variance bound as j → ∞ . See also Table 1 for the eﬃciency gain ofemploying the minimum distance technique. (cid:4) (a) Density f ( x ) p = 1 p = 2 p = 3 p = 4 Kernel Function

Uniform 0 .

600 0 .

600 1 .

250 1 . .

743 0 .

743 1 .

452 1 . .

714 0 .

714 1 .

407 1 . MD Variance Bound .

500 0 .

500 1 .

125 1 . (b) Density Derivative f (1) ( x ) p = 2 p = 3 p = 4 p = 5 Kernel Function

Uniform 2 .

143 2 .

143 11 .

932 11 . .

498 3 .

498 17 .

353 17 . .

182 3 .

182 15 .

970 15 . MD Variance Bound .

500 1 .

500 9 .

375 9 . Notes : Panel (a) compares asymptotic variance of the local polynomial density estimator of Cattaneo, Jansson, andMa (2020) for diﬀerent polynomial orders ( p = 1, 2, 3, and 4) and diﬀerent kernel functions (uniform, triangular andEpanechnikov). Also shown are the variance bound of the minimum distance estimator (MD Variance Bound),calculated according to Theorem 8. Panel(b) provides the same information for the estimated density derivative.All comparisons assume an interior evaluation point x . Before closing this section, we make several remarks on the variance bound derived in Theorem8, as well as to what extent it is achievable.

Remark 1 (Achievability of the Variance Bound).

The previous two examples suggest thatthe variance bound derived in Theorem 8 can be achieved by employing a minimum distanceestimator with two additional regressors, one higher-order even polynomial and one higher-orderodd polynomial. With analytic calculation, we verify that this is indeed the case for p ≤

10 whena uniform kernel is used. (cid:4)

Remark 2 (Optimality of the Variance Bound).

Granovsky and M¨uller (1991) discuss theproblem of ﬁnding the optimal kernel function for kernel-type estimators. To be precise, considerthe following 1 nh ‘ +1 n X i =1 φ ‘,k (cid:18) x i − x h (cid:19) , φ ‘,k ( u ) is a function satisfying Z − u j φ ‘,k ( u )d u =  ≤ j < k, j = ‘‘ ! j = ‘ , Z − u k φ ‘,k ( u )d u = 0 . Then it is easy to see that, with a Taylor expansion argument, E " nh ‘ +1 n X i =1 φ ‘,k (cid:18) x i − x h (cid:19) = 1 h ‘ +1 Z − φ ‘,k (cid:18) u − x h (cid:19) f ( u )d u = 1 h ‘ Z − φ ‘,k ( u ) f ( x + hu )d u = 1 h ‘ Z − φ ‘,k ( u )  k − X j =0 ( hu ) j j ! f ( j ) ( x ) + u k O ( h k )  d u = f ( ‘ ) ( x ) + O ( h k − ‘ ) . That is, the kernel φ ‘,k ( u ) facilitates estimating the ‘ -th derivative of the density function with aleading bias of order h k − ‘ . Asymptotic variance of this kernel-type estimator is easily found to beAsy V " nh ‘ +1 n X i =1 φ ‘,k (cid:18) x i − x h (cid:19) = f ( x ) Z − φ ‘,k ( u ) d u. Granovsky and M¨uller (1991) provide the exact form of the kernel function φ ‘,k ( u ) that minimizesthe asymptotic variance subject to the order of the leading bias.Take ‘ = 0 and k = 2, φ ‘,k ( u ) takes the following form: φ ‘,k ( u ) = 12 ( | u | ≤ , which is the uniform kernel and minimizes variance among all second order kernels for densityestimation. As illustrated in Example 2, our variance bound matches f ( x ) R − φ ‘,k ( u ) d u .Now take ‘ = 1 and k = 3. This will give an estimator for the density derivative f (1) ( x ) witha leading bias of order O ( h ). The optimal choice of φ ‘,k ( u ) is φ ‘,k ( u ) = 32 u ( | u | ≤ . to match the order of bias, we consider the minimum distance estimator with p = 3. Again, thevariance bound in Theorem 8 matches f ( x ) R − φ ‘,k ( u ) d u .As a ﬁnal illustration, take ‘ = 1 and k = 5, which gives an estimator for the density derivative f (1) ( x ) with a leading bias of order O ( h ). The optimal choice of φ ‘,k ( u ) is φ ‘,k ( u ) = (cid:18) u − u (cid:19) ( | u | ≤ .

17t is easy to see that f ( x ) R − φ ‘,k ( u ) d u = 75 f ( x ) /

8. To match the bias order, we take p =5 for our minimum distance estimator. The variance bound is 75 f ( x ) /

8, which is the same as f ( x ) R − φ ‘,k ( u ) d u .With analytic calculations, we verify that the variance bound stated in Theorem 8 is the same asthe minimum variance found in Granovsky and M¨uller (1991). Together with the previous remark,we reach a much stronger conclusion: including two higher-order polynomials in our minimumdistance estimator can help achieve the variance bound in Theorem 8, which, in turn, is the smallestvariance any kernel-type estimator can achieve (given a speciﬁc leading bias order). (cid:4) Remark 3 (Another Density Estimator Which Achieves the Variance Bound).

Thefollowing estimator achieves the bound of Theorem 8, although it does not belong to the class ofestimators we consider in this paper.ˆ θ ND = (cid:18)Z X ˙ P ( u − x ) ˙ P ( u − x ) h K (cid:18) u − x h (cid:19) d u (cid:19) − n n X i =1 ˙ P ( x i − x ) 1 h K (cid:18) x i − x h (cid:19)! , where ˙ P ( u ) = (1 , u, u / , · · · , u p − / ( p − is the ( p − θ ND = (cid:18)Z X ˙ P ( u − x ) ˙ P ( u − x ) h K (cid:18) u − x h (cid:19) d u (cid:19) − Z X ˙ P ( u − x ) 1 h K (cid:18) u − x h (cid:19) d ˆ F ( u )d u d u ! = argmin θ Z X d ˆ F ( u )d u − ˙ P ( u − x ) θ ! h K (cid:18) u − x h (cid:19) d u, where the derivative d ˆ F ( u ) / d u is interpreted in the sense of generalized functions. From the above,it is clear that this estimator requires the knowledge of the boundary position (that is, the knowledgeof X ).With straightforward calculations, this estimator has a leading bias E [ˆ θ ND ] = (cid:18)Z X ˙ P ( u − x ) ˙ P ( u − x ) h K (cid:18) u − x h (cid:19) d u (cid:19) − E (cid:20) ˙ P ( x i − x ) 1 h K (cid:18) x i − x h (cid:19)(cid:21) = θ + h p Υ h f ( p ) ( x ) Z X− x h ˙ P ( u ) ˙ P ( u ) K ( u ) d u ! − Z X− x h ˙ P ( u ) u p K ( u ) d u + o ( h p Υ h ) , where Υ h is a diagonal matrix containing 1, h − , · · · , h − ( p − . Its leading variance is also easy to18stablish: V [ˆ θ ND ] = 1 nh Υ h f ( x ) Z X− x h ˙ P ( u ) ˙ P ( u ) K ( u ) d u ! − Z X− x h ˙ P ( u ) ˙ P ( u ) K ( u ) d u ! Z X− x h ˙ P ( u ) ˙ P ( u ) K ( u ) d u ! − Υ h + o (cid:18) nh Υ h (cid:19) . To reach the eﬃciency bound in Theorem 8, it suﬃces to set K ( · ) to be the uniform kernel. Loader(2006), Section 5.1.1 also discussed this estimator, although it seems its eﬃciency property has notbeen realized in the literature. (cid:4) We establish distribution approximation for { ˆ θ G ( x ) , x ∈ I} and { ˆ θ ( x ) , x ∈ I} , which can be viewedas processes indexed by the evaluation point x in some set I ⊆ X . Recall the deﬁnition of Γ h, x andΣ h, x from Section 1, and we deﬁne Ω h, x = Γ − h, x Σ h, x Γ − h, x .We ﬁrst study the following (infeasible) centered and Studentized process: T G ( x ) = 1 √ n n X i =1 c h, x Υ h Γ − h, x R X− x h R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) g ( x + hu )d u q c h, x Υ h Ω h, x Υ h c h, x , x ∈ I , (16)where we consider linear combinations through a (known) vector c h, x , which can depend on thesample size through the bandwidth h , and can depend on the evaluation point. Again, we use thesubscript G to denote the local L approach with G being the design distribution. To economizenotation, let K h, x ( x ) = c h, x Υ h Γ − h, x R X− x h R ( u ) h ( x ≤ x + hu ) − F ( x + hu ) i K ( u ) g ( x + hu )d u q c h, x Υ h Ω h, x Υ h c h, x , then we can rewrite (16) as T G ( x ) = 1 √ n n X i =1 K h, x ( x i ) , and hence the centered and Studentized process T G ( · ) takes a kernel form. The diﬀerence com-pared to standard kernel density estimators, however, is that the (equivalent) kernel in our casechanges with the evaluation point, which is why our estimator is able to adapt to boundary points19utomatically. From the pointwise distribution theory developed in Section 2, the process T G ( x )has variance V [ T G ( x )] = E (cid:2) K h, x ( x i ) (cid:3) = 1 . We can also compute the covariance as C ov [ T G ( x ) , T G ( y )] = E [ K h, x ( x i ) K h, y ( x i )] = c h, x Υ h Ω h, x , y Υ h c h, y q c h, x Υ h Ω h, x Υ h c h, x q c h, y Υ h Ω h, y Υ h c h, y + O ( h ) , where Ω h, x , y = Γ − h, x Σ h, x , y Γ − h, y , andΣ h, x , y = Z X− y h Z X− x h R ( u ) R ( v ) h F (( x + hu ) ∧ ( y + hv )) − F ( x + hu ) F ( y + hv ) i K ( u ) K ( v ) g ( x + hu ) g ( y + hv )d u d v. Of course we can further expand the above, but this is unnecessary for our purpose.For future reference, let r ( ε, h ) = sup x , y ∈I , | x − y |≤ ε (cid:12)(cid:12) c h, x Υ h − c h, y Υ h (cid:12)(cid:12) , r ( h ) = sup x ∈I | c h, x Υ h | . Remark 4 (On the Order of r ( ε, h ) , r ( h ) and sup x ∈I % ( h, x ) ). In general, it is not possibleto give precise orders of the quantities introduced above. In this remark, we consider the localpolynomial estimator of Cattaneo, Jansson, and Ma (2020) (see Section 1 for an introduction). Thelocal polynomial estimator employs a polynomial basis, and hence estimates the density functionand higher-order derivatives by (it also estimates the distribution function)ˆ F ( ‘ ) ( x ) = e ‘ ˆ θ ( x ) , where e ‘ is the ( ‘ + 1)-th standard basis vector. As a result, c h, x = e ‘ , which does not dependon the evaluation point. For the scaling matrix Υ h , we note that it is diagonal with elements1 , h − , · · · , h − p , and hence it does not depend on the evaluation point either. Therefore, we concludethat, for density (and higher-order) derivative estimation using the local polynomial estimator, r ( ε, h ) is identically zero. Similarly, we have that r ( h ) = h ‘ . Finally, given the discussion inSection 1, the bias term generally has order sup x ∈I % ( h, x ) = h p +1 for the local polynomial densityestimator.The above discussion restricts to the local polynomial density estimator, but more can be saidabout r ( h ). We will argue that, in general, one should expect r ( h ) = O (1). Recall that theleading variance of c h, x ˆ θ ( x ) and c h, x ˆ θ G ( x ) is n c h, x Υ h Ω h, x Υ h c h, x , and that the maximum eigenvalueof Ω h, x is bounded. Therefore, the variance has order O (1 / ( nr ( h ) )). In general, we do not expectthe variance to shrink faster than 1 /n , which is why r ( h ) is usually bounded. In fact, for most20nteresting cases, c h, x ˆ θ ( x ) and c h, x ˆ θ G ( x ) will be “nonparametric” estimators in the sense that theyestimate local features of the distribution function. If this is the case, we may even argue that r ( h )will be vanishing as the bandwidth shrinks. (cid:4) We also make some additional assumptions.

Assumption 4.

Let I be a compact interval.(i) The density function is twice continuously diﬀerentiable and bounded away from zero in I .(ii) There exists some δ > and compactly supported kernel functions K † ( · ) and { K ‡ ,d ( · ) } d ≤ δ ,such that (ii.1) sup u ∈ R | K † ( u ) | , sup d ≤ δ,u ∈ R | K ‡ ,d ( u ) | < ∞ ; (ii.2) the support of K ‡ ,d ( · ) has Lebesguemeasure bounded by Cd , where C is independent of d ; and (ii.3) Further, for all u and v such that | u − v | ≤ δ , | K ( u ) − K ( v ) | ≤ | u − v | · K † ( u ) + K ‡ , | u − v | ( u ) . (iii) The basis function R ( · ) is Lipschitz continuous in [ − , .(iv) For all h suﬃciently small, and the minimum eigenvalues of Γ h, x and h − Σ h, x are boundedaway from zero uniformly for x ∈ I .(v) h → and nh/ log n → ∞ as n → ∞ .(vi) For some C , C ≥ and C ≥ , r ( ε, h ) = O (cid:0) ε C h − C (cid:1) , r ( h ) = O (cid:0) h C (cid:1) . In addition, sup x ∈I | c h, x Υ h | inf x ∈I | c h, x Υ h | = O (1) . (cid:4) Assumption 5.

The design density function g ( · ) is twice continuously diﬀerentiable and boundedaway from zero in I . (cid:4) For any h > n ), we can deﬁne a centered Gaussian process, { B G ( x ) : x ∈ I} ,which has the same variance-covariance structure as the process T G ( · ). The following lemma showsthat it is possible to construct such a process, and that T G ( · ) and B G ( · ) are “close in distribution.” Theorem 9 (Strong Approximation).

Assume Assumptions 1, 2, 4 and 5 hold. Then on apossibly enlarged probability space there exist two processes, { ˜ T G ( x ) : x ∈ I} and { B G ( x ) : x ∈ I} ,such that (i) ˜ T G ( · ) has the same distribution as T G ( · ) ; (ii) B G ( · ) is a Gaussian process with thesame covariance structure as T G ( · ) ; and (iii) P (cid:20) sup x ∈I (cid:12)(cid:12)(cid:12) ˜ T G ( x ) − B G ( x ) (cid:12)(cid:12)(cid:12) > C ( u + C log n ) √ nh (cid:21) ≤ C e − C u , where C , C , C and C are constants that do not depend on h or n . (cid:4) T G ( · ), which will help control the complexity of the Gaussian process B G ( · ). To be precise, deﬁnethe pseudo-metric σ G ( x , y ) as σ G ( x , y ) = p V [ T G ( x ) − T G ( y )] = q E [( K h, x ( x i ) − K h, y ( x i )) ] , we would like to provide an upper bound of σ G ( x , y ) in terms of | x − y | (at least for all x and y suchthat | x − y | is small enough). Lemma 10 (VC-type Property).

Assume Assumptions 1, 2, 4 and 5 hold. Then for all x , y ∈ I such that | x − y | = ε ≤ h , σ G ( x , y ) = O (cid:18) √ h εh + 1 √ h r ( ε, h ) r ( h ) + 1 h r ( ε, h ) r ( h ) (cid:19) . Therefore, E (cid:20) sup x ∈I | B G ( x ) | (cid:21) = O (cid:16)p log n (cid:17) , and E (cid:20) sup x ∈I | T G ( x ) | (cid:21) = O (cid:16)p log n (cid:17) . (cid:4) L Distribution Estimation

We ﬁrst discuss the covariance estimator. For the local L distribution estimator, let ˆΩ h, x , y =Γ − h, x ˆΣ h, x , y Γ − h, y with ˆΣ h, x , y given byˆΣ h, x , y = 1 n n X i =1 Z X− y h Z X− x h R ( u ) R ( v ) h ( x i ≤ x + hu ) − ˆ F ( x + hu ) ih ( x i ≤ y + hv ) − ˆ F ( y + hv ) i K ( u ) K ( v ) g ( x + hu ) g ( y + hv )d u d v. The next lemma characterizes the convergence rate of ˆΩ h, x , y . Lemma 11 (Local L Distribution Estimation: Covariance Estimation).

Assume Assump-tions 1, 2, 4 and 5 hold, and that nh / log n → ∞ . Then sup x , y ∈I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c h, x Υ h ( ˆΩ h, x , y − Ω h, x , y )Υ h c h, y q c h, x Υ h Ω h, x Υ h c h, x q c h, y Υ h Ω h, y Υ h c h, y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P r log nnh ! (cid:4)

22e now consider the estimator c h, x ˆ θ G ( x ). From (3) and (4), one has T G ( x ) = √ nc h, x (cid:16) ˆ θ G ( x ) − θ ( x ) (cid:17)q c h, x Υ h ˆΩ h, x Υ h c h, x = √ n c h, x Υ h Γ − h R X− x h R ( u ) h F ( x + hu ) − θ R ( u )Υ − h i K ( u ) g ( x + hu )d u q c h, x Υ h ˆΩ h, x Υ h c h, x (17)+ 1 √ n n X i =1 c h, x Υ h Γ − h, x R X− x h R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) g ( x + hu )d u q c h, x Υ h ˆΩ h, x Υ h c h, x . (18)In the following two lemmas, we analyze the two terms in the above decomposition separately. Lemma 12.

Assume Assumptions 1, 2, 4 and 5 hold, and that nh / log n → ∞ . Then sup x ∈I (cid:12)(cid:12)(cid:12) (17) (cid:12)(cid:12)(cid:12) = O P (cid:18)r nh sup x ∈I % ( h, x ) (cid:19) . (cid:4) Lemma 13.

Assume Assumptions 1, 2, 4 and 5 hold, and that nh / log n → ∞ . Then sup x ∈I (cid:12)(cid:12)(cid:12) (18) − T G ( x ) (cid:12)(cid:12)(cid:12) = O P (cid:18) log n √ nh (cid:19) . (cid:4) Now we state the main result on uniform distributional approximation.

Theorem 14 (Local L Distribution Estimation: Uniform Distributional Approxima-tion).

Assume Assumptions 1, 2, 4 and 5 hold, and that nh / log n → ∞ . Then on a possiblyenlarged probability space there exist two processes, { ˜ T G ( x ) : x ∈ I} and { B G ( x ) : x ∈ I} , suchthat (i) ˜ T G ( · ) has the same distribution as T G ( · ) ; (ii) B G ( · ) is a Gaussian process with the samecovariance structure as T G ( · ) ; and (iii) sup x ∈I (cid:12)(cid:12)(cid:12) T G ( x ) − T G ( x ) (cid:12)(cid:12)(cid:12) + sup x ∈I (cid:12)(cid:12)(cid:12) ˜ T G ( x ) − B G ( x ) (cid:12)(cid:12)(cid:12) = O P (cid:18) log n √ nh + r nh sup x ∈I % ( h, x ) (cid:19) . (cid:4) The following theorem shows that a feasible approximation to the process B G ( · ) can beachieved by simulating a Gaussian process with covariance estimated from the data. In the fol-lowing, we use P ? , E ? and C ov ? to denote the probability, expectation and covariance operatorconditioning on the data. Theorem 15 (Local L Distribution Estimation: Feasible Distributional Approxima-tion).

Assume Assumptions 1, 2, 4 and 5 hold, and that nh / log n → ∞ . Then on a possi-bly enlarged probability space there exist two centered Gaussian processes, { ˜ B G ( x ) , x ∈ I} and { ˆ B G ( x ) , x ∈ I} , satisfying (i) ˜ B G ( x ) is independent of the data, and has the same distribution as G ( · ) ; (ii) ˆ B G ( · ) has covariance C ov ? h ˆ B G ( x ) , ˆ B G ( y ) i = c h, x Υ h ˆΩ h, x , y Υ h c h, y q c h, x Υ h ˆΩ h, x Υ h c h, x q c h, y Υ h ˆΩ h, y Υ h c h, y ; and (iii) E ? (cid:20) sup x ∈I | ˜ B G ( x ) − ˆ B G ( x ) | (cid:21) = O P p log n inf ε ≤ h  √ h εh + 1 √ h r ( ε, h ) r ( h ) + 1 h r ( ε, h ) r ( h ) + s ε r log nn  . (cid:4) Remark 5 (On the Remainders in Theorems 14 and 15).

Both errors involve numerousquantities which are diﬃcult to further simplify. To understand the magnitude of these two errorbounds, we again consider the local polynomial estimator of Cattaneo, Jansson, and Ma (2020)(see Section 1 for a brief introduction, and also Remark 4).Recall that the local polynomial density estimator employs a polynomial basis, which impliesthat sup x ∈I % ( h, x ) = h p +1 , where p is the highest polynomial order. Then the error in Theorem 14reduces to √ nh p +1 + log n √ nh . As discussed in Remark 4, r ( ε, h ) = 0 for density (or higher-order derivative) estimation,which implies that the error in Theorem 15 becomes p log n inf ε ≤ h  √ h εh + s ε r log nn  = O (cid:18) log n √ nh (cid:19) / ! . A suﬃcient set of conditions for both errors to be negligible is nh p +1 → nh / log n → ∞ . (cid:4) Now we consider the local regression estimator { ˆ θ ( x ) , x ∈ I} . As before, we ﬁrst discuss theconstruction of the covariance Ω h, x , y . Let ˆΩ h, x , y = ˆΓ − h, x ˆΣ h, x , y ˆΓ − h, y . Construction of ˆΓ h, x is given inSection 2.2. The following lemma shows that ˆΓ h, x is uniformly consistent. Lemma 16 (Uniform Consistency of ˆΓ h, x ). Assume Assumptions 1 and 4 hold. Then sup x ∈I (cid:12)(cid:12)(cid:12) ˆΓ h, x − Γ h, x (cid:12)(cid:12)(cid:12) = O P r log nnh ! . (cid:4) h, x , y also mimics that in Section 2.2. To be precise, we letˆΣ h, x , y = 1 n n X i =1  n n X j =1 Υ h R ( x j − x ) h ( x i ≤ x j ) − ˆ F ( x j ) i h K (cid:18) x j − x h (cid:19) n n X j =1 Υ h R ( x j − y ) h ( x i ≤ x j ) − ˆ F ( x j ) i h K (cid:18) x j − y h (cid:19) . where ˆ F ( · ) remains to be the empirical distribution function. The following result justiﬁes consis-tency of ˆΩ h, x , y . Lemma 17 (Local Regression Distribution Estimation: Covariance Estimation).

As-sume Assumptions 1 and 4 hold, and that nh / log n → ∞ . Then sup x , y ∈I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c h, x Υ h ( ˆΩ h, x , y − Ω h, x , y )Υ h c h, y q c h, x Υ h Ω h, x Υ h c h, x q c h, y Υ h Ω h, y Υ h c h, y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P r log nnh ! (cid:4) The following is an expansion of T ( · ). T ( x ) = √ nc h, x (cid:16) ˆ θ ( x ) − θ ( x ) (cid:17)q c h, x Υ h ˆΩ h, x Υ h c h, x = 1 n √ n n X i =1 c h, x Υ h ˆΓ − h, x Υ h R ( x i − x )[1 − F ( x i )] h K ( x i − x h ) q c h, x Υ h ˆΩ h, x Υ h c h, x (19)+ 1 √ n n X i =1 c h, x Υ h ˆΓ − h, x Υ h R ( x i − x )[ F ( x i ) − θ ( x ) R ( x i − x )] h K ( x i − x h ) q c h, x Υ h ˆΩ h, x Υ h c h, x (20)+ 1 n √ n n X i,j =1 ,i = j q c h, x Υ h ˆΩ h, x Υ h c h, x ( c h, x Υ h ˆΓ − h, x Υ h R ( x j − x ) h ( x i ≤ x j ) − F ( x j ) i h K (cid:18) x j − x h (cid:19) − Z X− x h c h, x Υ h ˆΓ − h, x R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) f ( x + hu )d u ) (21)+ n − n √ n n X i =1 c h, x Υ h ˆΓ − h, x R X− x h R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) f ( x + hu )d u q c h, x Υ h ˆΩ h, x Υ h c h, x . (22) Lemma 18.

Assume Assumptions 1 and 4 hold, and that nh / log n → ∞ . Then sup x ∈I (cid:12)(cid:12)(cid:12) (19) (cid:12)(cid:12)(cid:12) = O P (cid:18) √ nh (cid:19) . (cid:4) emma 19. Assume Assumptions 1 and 4 hold, and that nh / log n → ∞ . Then sup x ∈I (cid:12)(cid:12)(cid:12) (20) (cid:12)(cid:12)(cid:12) = O P (cid:16)r nh sup x ∈I % ( h, x ) (cid:17) . (cid:4) Lemma 20.

Assume Assumptions 1 and 4 hold, and that nh / log n → ∞ . Then sup x ∈I (cid:12)(cid:12)(cid:12) (21) (cid:12)(cid:12)(cid:12) = O P (cid:18) log n √ nh (cid:19) . (cid:4) Lemma 21.

Assume Assumptions 1 and 4 hold, and that nh / log n → ∞ . Then sup x ∈I (cid:12)(cid:12)(cid:12) (22) − T F ( x ) (cid:12)(cid:12)(cid:12) = O P (cid:18) log n √ nh (cid:19) . (cid:4) Finally we have the following result on uniform distributional approximation for the localregression distribution estimator, as well as a feasible approximation by simulating from a Gaussianprocess with estimated covariance.

Theorem 22 (Local Regression Distribution Estimation: Uniform Distributional Ap-proximation).

Assume Assumptions 1 and 4 hold, and that nh / log n → ∞ . Then on a possiblyenlarged probability space there exist two processes, { ˜ T F ( x ) : x ∈ I} and { B F ( x ) : x ∈ I} , suchthat (i) ˜ T F ( · ) has the same distribution as T F ( · ) ; (ii) B F ( · ) is a Gaussian process with the samecovariance structure as T F ( · ) ; and (iii) sup x ∈I (cid:12)(cid:12)(cid:12) T ( x ) − T F ( x ) (cid:12)(cid:12)(cid:12) + sup x ∈I (cid:12)(cid:12)(cid:12) ˜ T F ( x ) − B F ( x ) (cid:12)(cid:12)(cid:12) = O P (cid:18) log n √ nh + r nh sup x ∈I % ( h, x ) (cid:19) . (cid:4) Theorem 23 (Local Regression Distribution Estimation: Feasible Distributional Ap-proximation).

Assume Assumptions 1 and 4 hold, and that nh / log n → ∞ . Then on a pos-sibly enlarged probability space there exist two centered Gaussian processes, { ˜ B F ( x ) , x ∈ I} and { ˆ B F ( x ) , x ∈ I} , satisfying (i) ˜ B F ( x ) is independent of the data, and has the same distribution as B F ( · ) ; (ii) ˆ B F ( · ) has covariance C ov ? h ˆ B F ( x ) , ˆ B F ( y ) i = c h, x Υ h ˆΩ h, x , y Υ h c h, y q c h, x Υ h ˆΩ h, x Υ h c h, x q c h, y Υ h ˆΩ h, y Υ h c h, y ; and (iii) E ? (cid:20) sup x ∈I | ˜ B F ( x ) − ˆ B F ( x ) | (cid:21) = O P p log n inf ε ≤ h  √ h εh + 1 √ h r ( ε, h ) r ( h ) + 1 h r ( ε, h ) r ( h ) + s ε r log nn  . (cid:4) Proofs

Lemma 24.

Assume { u i,h ( a ) : a ∈ A ⊂ R d } are independent across i , and E [ u i,h ( a )] = 0 for all a ∈ A and all h > . In addition, assume for each ε > there exists { u i,h,ε ( a ) : a ∈ A } , such that | a − b | ≤ ε ⇒ | u i,h ( a ) − u i,h ( b ) | ≤ u i,h,ε ( a ) . Deﬁne C = sup a ∈ A max ≤ i ≤ n V [ u i,h ( a )] , C = sup a ∈ A max ≤ i ≤ n | u i,h ( a ) | C ,ε = sup a ∈ A max ≤ i ≤ n V [ u i,h,ε ( a )] , C ,ε = sup a ∈ A max ≤ i ≤ n | u i,h,ε ( a ) − E [ u i,h,ε ( a )] | , C ,ε = sup a ∈ A max ≤ i ≤ n E [ | u i,h,ε ( a ) | ] . Then sup a ∈ A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 u i,h ( a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P ( γ + γ ε + C ,ε ) , where γ and γ ε are any sequences satisfying γ n ( C + γC ) log N ( ε, A, | · | ) and γ ε n ( C ,ε + γ ε C ,ε ) log N ( ε, A, | · | ) are bounded from below , and N ( ε, A, | · | ) is the covering number of A . (cid:4) Remark 6.

Provided that u i,h ( · ) is reasonably smooth, one can always choose ε (as a function of n and h ) smallenough, and the leading order will be given by γ (and hence is determined by C and C ). (cid:4) Proof.

Let A ε be an ε -covering of A , thensup a ∈ A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 u i,h ( a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup a ∈ A ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 u i,h ( a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup a ∈ A ε ,b ∈ A, | a − b |≤ ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 u i,h ( a ) − u i,h ( b ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Next we apply the union bound and Bernstein’s inequality: P " sup a ∈ A ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 u i,h ( a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ γu ≤ N ( ε, A, | · | ) sup a ∈ A P "(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 u i,h ( a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ γu ≤ N ( ε, A, | · | ) exp (cid:26) − γ nu C + γC u (cid:27) = 2 exp (cid:26) − γ nu C + γC u + log N ( ε, A, | · | ) (cid:27) . Now take u suﬃciently large, then the above is further bounded by: P " sup a ∈ A ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 u i,h ( a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ γu ≤ (cid:26) − log N ( ε, A, | · | ) (cid:20)

12 1log N ( ε, A, | · | ) γ nC + γC u − (cid:21)(cid:27) , which tends to zero if log N ( ε, A, | · | ) → ∞ and γ n ( C + γC ) log N ( ε, A, | · | ) is bounded from below , n which case we have sup a ∈ A ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 u i,h ( a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P ( γ ) . We can apply the same technique to the other term, and obtainsup a ∈ A ε ,b ∈ A, | a − b |≤ ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 u i,h ( a ) − u i,h ( b ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P ( γ ε ) , where γ ε is any sequence satisfying γ ε n ( C ,ε + γ ε C ,ε ) log N ( ε, A, | · | ) is bounded from below . (cid:4) Lemma 25 (Equation (3.5) of Gin´e, Lata la, and Zinn (2000)).

For a degenerate second order U-statistic, P ni,j =1 ,i = j h ij ( x i , x j ) , the following holds: P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i,j,i = j u ij ( x i , x j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t  ≤ C exp ( − C min " tD , (cid:18) tB (cid:19) , (cid:18) tA (cid:19) , where C is some universal constant, and A , B and D are any constants satisfying A ≥ max ≤ i,j ≤ n sup u,v | u ij ( u, v ) | B ≥ max ≤ i,j ≤ n " sup v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i =1 E u ij ( x i , v ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , sup u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X j =1 E u ij ( u, x j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D ≥ n X i,j =1 ,i = j E u ij ( x i , x j ) . (cid:4) Part (i)

The bias term can be bounded by (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z X− x h R ( u ) h F ( x + hu ) − θ R ( u )Υ − h i K ( u ) d u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup u ∈ [ − , (cid:12)(cid:12)(cid:12) F ( x + hu ) − θ R ( u )Υ − h (cid:12)(cid:12)(cid:12) Z X− x h | R ( u ) | K ( u ) d u = % ( h ) Z X− x h | R ( u ) | K ( u ) d u. Part (ii)

The variance can be found by V " √ n n X i =1 Z X− x h R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) g ( x + hu )d u = Z Z X− x h R ( u ) R ( v ) K ( u ) K ( v ) h F ( x + h ( u ∧ v )) − F ( x + hu ) F ( x + hv ) i g ( x + hu ) g ( x + hv )d u d v. o establish asymptotic normality, we verify a Lyapunov-type condition with a fourth moment calculation. Take c to be a nonzero vector of conformable dimension, and we employ the Cramer-Wold device:1 n (cid:0) c Σ h c (cid:1) − E " Z X− x h c R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) g ( x + hu )d u ! . If c Σ h c is bounded away from zero as the bandwidth decreases, the above will have order n − , as K ( · ) is boundedand compactly supported and R ( · ) is locally bounded. Therefore, the Lyapunov condition holds in this case. Themore challenging case is when c Σ h c is of order h . In this case, it implies F ( x )(1 − F ( x )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z Z X− x h c R ( u ) K ( u ) g u d u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O ( h ) . Now consider the fourth moment. The leading term is F ( x )(1 − F ( x ))(3 F ( x ) − F ( x ) + 1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z X− x h c R ( u ) K ( u ) g ( x + hu )d u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O ( h ) , meaning that for the Lyapunov condition to hold, we need the requirement that nh → ∞ . Part (iii)

This follows immediately from Part (i) and (ii).

To study the property of ˆΣ h , we make the following decomposition:ˆΣ h = 1 n n X i =1 Z Z X− x h R ( u ) R ( v ) h ( x i ≤ x + hu ) − F ( x + hu ) ih ( x i ≤ x + hv ) − F ( x + hv ) i K ( u ) K ( v ) g ( x + hu ) g ( x + hv )d u d v (I) − Z Z X− x h R ( u ) R ( v ) h ˆ F ( x + hu ) − F ( x + hu ) ih ˆ F ( x + hv ) − F ( x + hv ) i K ( u ) K ( v ) g ( x + hu ) g ( x + hv )d u d v. (II)First, it is obvious that term (II) is of order O P (1 /n ). Term (I) requires more delicate analysis. Let c be a vectorof unit length and suitable dimension, and deﬁne c i = Z Z X− x h c R ( u ) R ( v ) c h ( x i ≤ x + hu ) − F ( x + hu ) ih ( x i ≤ x + hv ) − F ( x + hv ) i K ( u ) K ( v ) g ( x + hu ) g ( x + hv )d u d v. Then c (I) c = E [ c (I) c ] + O P (cid:16)p V [ c (I) c ] (cid:17) = E [ c i ] + O P r n ( E [ c i ] − ( E [ c i ]) ) ! , which implies that c (I) c E [ c (I) c ] − O P s n (cid:18) E [ c i ]( E [ c i ]) − (cid:19)! . With the same argument used in the proof of Theorem 1, one can show that E [ c i ]( E [ c i ]) = O (cid:18) h (cid:19) , hich implies c (I) cc Σ h c − O P r nh ! . Part (i)

For the “denominator,” its variance is bounded by (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) V " n n X i =1 Υ h R ( x i − x ) R ( x i − x ) Υ h h K (cid:16) x i − x h (cid:17) ≤ n E (cid:20)(cid:12)(cid:12) Υ h R ( x i − x ) R ( x i − x ) Υ h (cid:12)(cid:12) h K (cid:16) x i − x h (cid:17) (cid:21) = 1 n Z X (cid:12)(cid:12) Υ h R ( u − x ) R ( u − x ) Υ h (cid:12)(cid:12) h K (cid:16) u − x h (cid:17) f ( u )d u = 1 nh Z X− x h (cid:12)(cid:12) R ( u ) R ( u ) (cid:12)(cid:12) K ( u ) f ( x + hu )d u = O (cid:18) nh (cid:19) . Therefore, under the assumption that h → nh → ∞ , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 Υ h R ( x i − x ) R ( x i − x ) Υ h h K (cid:16) x i − x h (cid:17) − Γ h (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P r nh ! . As a result, we haveˆ θ − θ = Υ h Γ − h n n X i =1 Υ h R ( x i − x ) h ˆ F ( x i ) − R ( x i − x ) θ i h K (cid:16) x i − x h (cid:17)! (1 + o P (1)) . Part (ii)

The order of the leave-in bias is clearly 1 /n . For the approximation bias (7), we obtained its mean in the proof ofTheorem 1 by setting G = F , which has an order of % ( h ). The approximation bias has a variance of order (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) V " n n X j =1 Υ h R ( x j − x ) h F ( x j ) − R ( x j − x ) θ i h K (cid:16) x j − x h (cid:17) ≤ n E "(cid:12)(cid:12)(cid:12)(cid:12) Υ h R ( x j − x ) h F ( x j ) − R ( x j − x ) θ i h K (cid:16) x j − x h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) = 1 n Z X (cid:12)(cid:12)(cid:12) Υ h R ( u − x ) h F ( u ) − R ( u − x ) θ i(cid:12)(cid:12)(cid:12) h K (cid:16) u − x h (cid:17) f ( u )d u = 1 nh Z X− x h (cid:12)(cid:12)(cid:12) R ( u ) h F ( x + hu ) − R ( u − x ) θ i(cid:12)(cid:12)(cid:12) K ( u ) f ( x + hu )d u ≤ nh % ( h ) Z X− x h | R ( u ) | K ( u ) f ( x + hu )d u = O (cid:18) ρ ( h ) nh (cid:19) . Therefore, (7) = O P % ( h ) + % ( h ) r nh ! = O P ( % ( h )) , under the assumption that nh → ∞ . art (iii) We compute the variance of the U-statistic (9). For simplicity, deﬁne u ij = Υ h R ( x j − x ) h ( x i ≤ x j ) − F ( x j ) i h K (cid:16) x j − x h (cid:17) − Z X− x h R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) f ( x + hu )d u, which satisﬁes E [ u ij ] = E [ u ij | x i ] = E [ u ij | x j ] = 0. Therefore V [(9)] = 1 n n X i,j =1 ,i = j n X i ,j =1 ,i = j E (cid:2) u ij u i j (cid:3) = 1 n n X i,j =1 ,i = j E (cid:2) u ij u ij (cid:3) + E (cid:2) u ij u ji (cid:3) , meaning that (9) = O P r n h ! . Part (iv)

This follows immediately from Part (i)–(iii) and Theorem 1.

We ﬁrst split ˆΣ h into two terms,(I) = 1 n n X i,j,k =1 Υ h R j R k Υ h W j W k (cid:16) ( x i ≤ x j ) − F ( x j ) (cid:17)(cid:16) ( x i ≤ x k ) − F ( x k ) (cid:17) (II) = − n n X j,k =1 Υ h R j R k Υ h W j W k (cid:16) ˆ F ( x j ) − F ( x j ) (cid:17)(cid:16) ˆ F ( x k ) − F ( x k ) (cid:17) , where R i = R ( x i − x ) and W i = K (( x i − x ) /h ) /h .(II) satisﬁes | (II) | ≤ sup x | ˆ F ( x ) − F ( x ) | n n X j,k =1 (cid:12)(cid:12) Υ h R j R k Υ h W j W k (cid:12)(cid:12) . It is obvious that sup x | ˆ F ( x ) − F ( x ) | = O P (cid:18) n (cid:19) . As for the second part, we have1 n n X j,k =1 E (cid:2)(cid:12)(cid:12) Υ h R j R k Υ h W j W k (cid:12)(cid:12)(cid:3) = n − n E (cid:2) (cid:12)(cid:12) Υ h R j R k Υ h W j W k (cid:12)(cid:12) (cid:12)(cid:12) j = k (cid:3) + 1 n E (cid:2)(cid:12)(cid:12) Υ h R k R k Υ h W k W k (cid:12)(cid:12)(cid:3) = O P (cid:18) nh (cid:19) = O P (1) , implying that (II) = O P (cid:18) n (cid:19) , provided that nh → ∞ . or (I), we further expand into “diagonal” and “non-diagonal” sums:(I) = 1 n n X i,j,k =1distinct Υ h R j R k Υ h W j W k (cid:16) ( x i ≤ x j ) − F ( x j ) (cid:17)(cid:16) ( x i ≤ x k ) − F ( x k ) (cid:17) (I.1)+ 1 n n X i,k =1distinct Υ h R i R k Υ h W i W k (cid:16) ( x i ≤ x i ) − F ( x i ) (cid:17)(cid:16) ( x i ≤ x k ) − F ( x k ) (cid:17) (I.2)+ 1 n n X i,j =1distinct Υ h R j R i Υ h W j W i (cid:16) ( x i ≤ x j ) − F ( x j ) (cid:17)(cid:16) ( x i ≤ x i ) − F ( x i ) (cid:17) (I.3)+ 1 n n X i,j =1distinct Υ h R j R j Υ h W j W j (cid:16) ( x i ≤ x j ) − F ( x j ) (cid:17)(cid:16) ( x i ≤ x j ) − F ( x j ) (cid:17) (I.4)+ 1 n X i Υ h R i R i Υ h W i W i (cid:16) ( x i ≤ x i ) − F ( x i ) (cid:17)(cid:16) ( x i ≤ x i ) − F ( x i ) (cid:17) . (I.5)By calculating the expectation of the absolute value of the summands above, it is straightforward to show(I.2) = O P (cid:18) n (cid:19) , (I.3) = O P (cid:18) n (cid:19) , (I.4) = O P (cid:18) nh (cid:19) , (I.5) = O P (cid:18) n h (cid:19) . Therefore, we haveˆΣ h = (I.1) + O P (cid:18) nh (cid:19) = 1 n n X i,j,k =1distinct Υ h R j R k Υ h W j W k (cid:16) ( x i ≤ x j ) − F ( x j ) (cid:17)(cid:16) ( x i ≤ x k ) − F ( x k ) (cid:17) + O P (cid:18) nh (cid:19) . To proceed, deﬁne u ij = Υ h R j W j (cid:16) ( x i ≤ x j ) − F ( x j ) (cid:17) and ¯ u i = E [ u ij | x i ; i = j ] . Then we can further decompose (I.1) into(I.1) = 1 n n X i,j,k =1distinct u ij u ik = 1 n n X i,j,k =1distinct E [ u ij u ik | x i ] + 1 n n X i,j,k =1distinct (cid:16) u ij u ik − E [ u ij u ik | x i ] (cid:17) = ( n − n − n n X i =1 ¯ u i ¯ u i (I.1.1)+ 1 n n X i,j,k =1distinct (cid:16) u ij u ik − ¯ u i ¯ u i (cid:17) . (I.1.2)We have already analyzed (I.1.1) in Theorem 2, which satisﬁes(I.1.1) = Σ h + O P (cid:18) √ n (cid:19) . ow we study (I.1.2), which satisﬁes(I.1.2) = n − n n X i,j =1distinct (cid:16) u ij − ¯ u i (cid:17) ¯ u i (I.1.2.1)+ n − n n X i,j =1distinct ¯ u i (cid:16) u ij − ¯ u i (cid:17) (I.1.2.2)+ 1 n n X i,j,k =1distinct (cid:16) u ij − ¯ u i (cid:17)(cid:16) u ik − ¯ u i (cid:17) . (I.1.2.3)By variance calculation, it is easy to see that (I.1.2.3) = O P (cid:18) nh (cid:19) . Therefore we have c ( ˆΣ h − Σ h ) cc Σ h c = O P (cid:18) √ nh (cid:19) + 2 c (I.1.2.1) cc Σ h c , since (I.1.2.1) and (I.1.2.2) are transpose of each other. To close the proof, we calculate the variance of the last termin the above. V (cid:20) c (I.1.2.1) cc Σ h c (cid:21) = 1( c Σ h c ) ( n − n E  n X i,j =1distinct n X i ,j =1distinct c (cid:16) u ij − ¯ u i (cid:17) ¯ u i cc (cid:16) u i j − ¯ u i (cid:17) ¯ u i c  = 1( c Σ h c ) ( n − n E  n X i,j,i =1distinct c u ij ¯ u i cc u i j ¯ u i c  + higher order terms . The expectation is further given by (note that i , j and i are assumed to be distinct indices) E (cid:2) c u ij ¯ u i cc u i j ¯ u i c (cid:3) = E Z Z X− x h W j (cid:2) c Υ h R j R ( u ) cc Υ h R j R ( v ) c (cid:3) K ( u ) K ( v )[ F ( x j ∧ ( x + hu )) − F ( x j ) F ( x + hu )] [ F ( x j ∧ ( x + hv )) − F ( x j ) F ( x + hv )] f ( x + hu ) f ( x + hv )d u d v = 1 h Z Z Z X− x h (cid:2) c R ( w ) R ( u ) cc R ( w ) R ( v ) c (cid:3) K ( u ) K ( v ) K ( w ) [ F ( x + h ( w ∧ u )) − F ( x + hw ) F ( x + hu )] [ F ( x + h ( w ∧ v )) − F ( x + hw ) F ( x + hv )] f ( x + hw ) f ( x + hu ) f ( x + hv )d w d u d v = 1 h F ( x ) (1 − F ( x )) Z Z Z X− x h (cid:2) c R ( w ) R ( u ) cc R ( w ) R ( v ) c (cid:3) K ( u ) K ( v ) K ( w ) f w f u f v d w d u d v + higher-order terms . If c Σ h c = O (1), then the above will have order h , which means V (cid:20) c (I.1.2.1) cc Σ h c (cid:21) = O (cid:18) nh (cid:19) . If c Σ h c = O ( h ), however, E [ c u ij ¯ u i cc u i j ¯ u i c ] will be O (1), which will imply that V (cid:20) c (I.1.2.1) cc Σ h c (cid:21) = O (cid:18) nh (cid:19) . s a result, we have c ( ˆΣ h − Σ h ) cc Σ h c = O P (cid:18) √ nh (cid:19) . Now consider c ˆΓ − h ˆΣ h ˆΓ − h cc Γ − h Σ h Γ − h c − c ˆΓ − h ( ˆΣ h − Σ h )ˆΓ − h cc Γ − h Σ h Γ − h c + c (ˆΓ − h − Γ − h )Σ h ˆΓ − h cc Γ − h Σ h Γ − h c + c (ˆΓ − h − Γ − h )Σ h Γ − h cc Γ − h Σ h Γ − h c = c ˆΓ − h ( ˆΣ h − Σ h )ˆΓ − h cc Γ − h Σ h Γ − h c + 2 c (ˆΓ − h − Γ − h )Σ h Γ − h cc Γ − h Σ h Γ − h c + c (ˆΓ − h − Γ − h )Σ h (ˆΓ − h − Γ − h ) cc Γ − h Σ h Γ − h c . From the analysis of ˆΣ h , we have c ˆΓ − h ( ˆΣ h − Σ h )ˆΓ − h cc Γ − h Σ h Γ − h c = O P (cid:18) √ nh (cid:19) . For the second term, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c (ˆΓ − h − Γ − h )Σ h ˆΓ − h cc Γ − h Σ h Γ − h c (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | c (ˆΓ − h − Γ − h )Σ / h | · | c Γ − h Σ / h || c Γ − h Σ / h | = | c (ˆΓ − h − Γ − h )Σ / h || c Γ − h Σ / h | = O P r nh ! . The third term has order c (ˆΓ − h − Γ − h )Σ h (ˆΓ − h − Γ − h ) cc Γ − h Σ h Γ − h c = O P (cid:18) nh (cid:19) . This follows directly from Theorem 1.

To understand (13), note thatˆ θ ⊥ F = (cid:18)Z X Λ h R ( u − x ) R ( u − x ) Λ h h K (cid:16) u − x h (cid:17) d F ( u ) (cid:19) − (cid:18)Z X Λ h R ( u − x ) ˆ F ( u ) 1 h K (cid:16) u − x h (cid:17) d F ( u ) (cid:19) = Λ − h (cid:18)Z X R ( u − x ) R ( u − x ) h K (cid:16) u − x h (cid:17) d F ( u ) (cid:19) − (cid:18)Z X R ( u − x ) ˆ F ( u ) 1 h K (cid:16) u − x h (cid:17) d F ( u ) (cid:19) , which means ˆ θ ⊥ F = Λ − h ˆ θ F . Then we have (up to an approximation bias term)ˆ θ ⊥ F − Λ − h θ = Λ − h (ˆ θ F − θ )= Λ − h (cid:18)Z X R ( u − x ) R ( u − x ) h K (cid:16) u − x h (cid:17) d F ( u ) (cid:19) − (cid:18)Z X R ( u − x ) (cid:16) ˆ F ( u ) − F ( u ) (cid:17) h K (cid:16) u − x h (cid:17) d F ( u ) (cid:19) = Λ − h Υ h Z X− x h R ( u ) R ( u ) K ( u ) f ( x + hu )d u ! − Z X− x h R ( u ) (cid:16) ˆ F ( x + hu ) − F ( x + hu ) (cid:17) K ( u ) f ( x + hu )d u ! = Λ − h Υ h Λ h Z X− x h R ⊥ ( u ) R ⊥ ( u ) K ( u ) f ( x + hu )d u ! − Z X− x h R ⊥ ( u ) (cid:16) ˆ F ( x + hu ) − F ( x + hu ) (cid:17) K ( u ) f ( x + hu )d u ! . As a result, we have the following e ﬁrst discuss the transformed parameter vector Λ − h θ . By construction, the matrix Λ h takes the followingform: Λ h =  c , c , · · · c ,p +2 · · · c ,p +2 · · · c ,p +2 ... ... ... . . . ...0 0 0 · · ·  where c i,j are some constants (possibly depending on h ). Therefore, the above matrix diﬀers from the identity matrixonly in its ﬁrst row and last column. This observation also holds for Λ − h . Since the last component of θ is zero(because the extra regressor Q h ( · ) is redundant), we conclude that, except for the ﬁrst element, Λ h θ and θ areidentical. More speciﬁcally, let I − be the identity matrix excluding the ﬁrst row: I − =  · · ·

00 0 1 0 · · ·

00 0 0 1 · · · · · ·  , which is used to extract all elements of a vector except for the ﬁrst one, then by Theorem 1, √ n (cid:16) I − (Λ − h Υ h Λ h )(Γ ⊥ h ) − Σ ⊥ h (Γ ⊥ h ) − (Λ − h Υ h Λ h ) I (cid:17) − / " ˆ θ ⊥ P,F − θ P ˆ θ ⊥ Q,F N (0 , I ) , where θ ⊥ P,F contains the second to the p + 1-th element of θ ⊥ F , and θ ⊥ Q,F is the last element.Now we discuss the covariance matrix in the above display. Due to orthogonalization, Γ ⊥ h is block diagonal. Tobe precise,Γ ⊥ h = f ( x )  Γ ⊥ ,h ⊥ P,h

00 0 Γ ⊥ Q,h  , Γ ⊥ ,h = Z X− x h K ( u )d u, Γ ⊥ P,h = Z X− x h P ⊥ ( u ) P ⊥ ( u ) K ( u )d u, Γ ⊥ Q,h = Z X− x h Q ⊥ ( u ) K ( u )d u. Finally, using the structure of Λ h and Υ h , we have I − (Λ − h Υ h Λ h )(Γ ⊥ h ) − = I − Υ h (Γ ⊥ h ) − . The form of Σ ⊥ h is quite involved, but with some algebra, and using the fact that the basis R ( · ) (or R ⊥ ( · ))includes a constant and polynomials, one can show the following:(Λ − h Υ h Λ h )(Γ ⊥ h ) − Σ ⊥ h (Γ ⊥ h ) − (Λ − h Υ h Λ h ) = hf ( x )Υ − ,h (Γ ⊥− ,h ) − Σ ⊥− ,h (Γ ⊥− ,h ) − Υ − ,h , where Υ − ,h , Γ ⊥− ,h and Σ ⊥− ,h are obtained by excluding the ﬁrst row and the ﬁrst column of Υ h , Γ ⊥ h and Σ ⊥ h ,respectively:Υ − ,h =  h − · · · h − · · ·

00 0 h − · · · · · · υ h  , Γ ⊥− ,h = f ( x ) " Γ ⊥ P,h

00 Γ ⊥ Q,h , Σ ⊥− ,h = f ( x ) " Σ ⊥ PP,h Σ ⊥ PQ,h Σ ⊥ QP,h Σ ⊥ QQ,h , ndΣ ⊥ PP,h = Z Z X− x h K ( u ) K ( v ) P ⊥ ( u ) P ⊥ ( v ) ( u ∧ v )d u d v, Σ ⊥ QQ,h = Z Z X− x h K ( u ) K ( v ) Q ⊥ ( u ) Q ⊥ ( v )( u ∧ v )d u d v Σ ⊥ PQ,h = (Σ ⊥ QP,h ) = Z Z X− x h K ( u ) K ( v ) P ⊥ ( u ) Q ⊥ ( v )( u ∧ v )d u d v. With the above discussion, we have the desired result.

Part (i)

To start, Z X− x h ∩ [ − , H ( g )( u ) H ( g )( u )d u = Z X− x h Z X− x h ( v ≥ u ) K ( v ) g ( v )d v ! Z X− x h ( v ≥ u ) K ( v ) g ( v )d v ! d u = Z Z X− x h K ( v ) K ( v ) g ( v ) g ( v ) Z X− x h ∩ [ − , ( v ≥ u ) ( v ≥ u )d u ! d v d v = Z Z X− x h K ( v ) K ( v ) g ( v ) g ( v ) (cid:20) ( v ∧ v ) ∧ (cid:18) x − x h ∧ (cid:19) − (cid:16) x − x h ∨ ( − (cid:17)(cid:21) d v d v = Z Z X− x h K ( v ) K ( v ) g ( v ) g ( v )( v ∧ v )d v d v , where for the last equality, we used the fact that v ≤ x − x h ∧ v ≤ x − x h ∧ Part (ii)

For this part, Z X− x h ∩ [ − , H ( g )( u ) ˙ g ( u )d u = Z X− x h ∩ [ − , Z X− x h ( v ≥ u ) K ( v ) g ( v )d v ! ˙ g ( u )d u = Z X− x h K ( v ) g ( v ) Z X− x h ∩ [ − , ( v ≥ u ) ˙ g ( u )d u ! d v = Z X− x h K ( v ) g ( v ) (cid:20) g (cid:18) v ∧ x − x h ∧ (cid:19) − g (cid:16) x − x h ∨ ( − (cid:17)(cid:21) d v = Z X− x h K ( v ) g ( v ) g ( v )d v. To ﬁnd a bound of the maximization problem, we note that for any c ∈ R p − , one has Z X− x h ∩ [ − , H ( p ‘ )( u ) H ( q )( u )d u = Z X− x h ∩ [ − , h H ( p ‘ )( u ) + c ˙ P ( u ) i H ( q )( u )d u, ue to the constraint (15). Hence subject to this constraint, an upper bound of the objective function is (due to theCauchy-Schwarz inequality)inf c Z X− x h ∩ [ − , h H ( p ‘ )( u ) + c ˙ P ( u ) i d u = inf c Z X− x h ∩ [ − , h H ( p ‘ )( u ) + 2 c ˙ P ( u ) H ( p ‘ )( u ) + c ˙ P ( u ) ˙ P ( u ) c i d u = Z X− x h ∩ [ − , H ( p ‘ )( u ) d u + inf c Z X− x h ∩ [ − , h c ˙ P ( u ) H ( p ‘ )( u ) + c ˙ P ( u ) ˙ P ( u ) c i d u = Z X− x h ∩ [ − , H ( p ‘ )( u ) d u + inf c " c Z X− x h K ( u ) P ( u ) p ‘ ( u )d u ! + c Z X− x h ∩ [ − , ˙ P ( u ) ˙ P ( u ) d u ! c , which is minimized by setting c = − Z X− x h ∩ [ − , ˙ P ( u ) ˙ P ( u ) d u ! − Z X− x h K ( u ) P ( u ) p ‘ ( u )d u ! . As a result, an upper bound of (14) is Z X− x h ∩ [ − , H ( p ‘ )( u ) d u − Z X− x h K ( u ) P ( u ) p ‘ ( u )d u ! Z X− x h ∩ [ − , ˙ P ( u ) ˙ P ( u ) d u ! − Z X− x h K ( u ) P ( u ) p ‘ ( u )d u ! . We may further simplify the above. First, Z X− x h ∩ [ − , H ( p ‘ )( u ) d u = e ‘ (Γ ⊥ P,h ) − Σ ⊥ PP,h (Γ ⊥ P,h ) − e ‘ . Second, note that Z X− x h K ( u ) P ( u ) p ‘ ( u )d u = Z X− x h K ( u ) P ( u ) P ⊥ ( u ) d u ! (Γ ⊥ P,h ) − e ‘ = Z X− x h K ( u ) P ⊥ ( u ) P ⊥ ( u ) d u ! (Γ ⊥ P,h ) − e ‘ = e ‘ . As a result, an upper bound of (14) is e ‘ (Γ ⊥ P,h ) − Σ ⊥ PP,h (Γ ⊥ P,h ) − e ‘ − e ‘ Z X− x h ∩ [ − , ˙ P ( u ) ˙ P ( u ) d u ! − e ‘ = e ‘ " (Γ ⊥ P,h ) − Σ ⊥ PP,h (Γ ⊥ P,h ) − − Z X− x h ∩ [ − , ˙ P ( u ) ˙ P ( u ) d u ! − e ‘ . To show that the two processes, T G ( · ) and B G ( · ) are “close in distribution,” we employ the proof strategy of Gin´e,Koltchinskii, and Sakhanenko (2004). Recall that F denotes the distribution of x i , and we deﬁne K h, x ◦ F − ( x ) = K h, x ( F − ( x )) . ake v < v in [0 , (cid:12)(cid:12) K h, x ◦ F − ( v ) − K h, x ◦ F − ( v ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) R X− x h c h, x Υ h Γ − h, x R ( u ) h ( F − ( v ) ≤ x + hu ) − ( F − ( v ) ≤ x + hu ) i K ( u ) g ( x + hu )d u q c h, x Υ h Ω h, x Υ h c h, x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ R X− x h (cid:12)(cid:12)(cid:12) c h, x Υ h Γ − h, x R ( u ) (cid:12)(cid:12)(cid:12) h ( F − ( v ) ≤ x + hu ) − ( F − ( v ) ≤ x + hu ) i K ( u ) g ( x + hu )d u q c h, x Υ h Ω h, x Υ h c h, x . Therefore, the function K h, x ◦ F − ( · ) has total variation bounded by R X− x h (cid:12)(cid:12)(cid:12) c h, x Υ h Γ − h, x R ( u ) (cid:12)(cid:12)(cid:12) h ( F − (0) ≤ x + hu ) − ( F − (1) ≤ x + hu ) i K ( u ) g ( x + hu )d u q c h, x Υ h Ω h, x Υ h c h, x = R − (cid:12)(cid:12)(cid:12) c h, x Υ h Γ − h, x R ( u ) (cid:12)(cid:12)(cid:12) K ( u ) g ( x + hu )d u q c h, x Υ h Ω h, x Υ h c h, x ≤ C √ h . It is well-known that functions of bounded variation can be approximated (pointwise) by convex combinationof indicator functions of half intervals. To be more precise, n K h, x ◦ F − ( · ) : x ∈ I o ⊂ C √ h conv n ± ( · ≤ t ) , ± ( · ≥ t ) o . Following (2.3) and (2.4) of Gin´e, Koltchinskii, and Sakhanenko (2004), we have P (cid:20) sup x ∈I (cid:12)(cid:12)(cid:12) ˜ T G ( x ) − B G ( x ) (cid:12)(cid:12)(cid:12) > C ( u + C log n ) √ nh (cid:21) ≤ C e − C u , where C , C and C are some universal constants. .11 Proof of Lemma 10 Take | x − y | ≤ ε to be some small number, then K h, x ( x ) − K h, y ( x )= c h, x Υ h Γ − h, x R X− x h R ( u ) h ( x ≤ x + hu ) − F ( x + hu ) i K ( u ) g ( x + hu )d u q c h, x Υ h Ω h, x Υ h c h, x − c h, y Υ h Γ − h, y R X− y h R ( u ) h ( x ≤ y + hu ) − F ( y + hu ) i K ( u ) g ( y + hu )d u q c h, y Υ h Ω h, y Υ h c h, y =  c h, x Υ h Γ − h, x q c h, x Υ h Ω h, x Υ h c h, x − c h, y Υ h Γ − h, y q c h, y Υ h Ω h, y Υ h c h, y  Z X− x h R ( u ) h ( x ≤ x + hu ) − F ( x + hu ) i K ( u ) g ( x + hu )d u ! +  c h, y Υ h Γ − h, y q c h, y Υ h Ω h, y Υ h c h, y  (cid:18) h Z X h R (cid:16) u − x h (cid:17) K (cid:16) u − x h (cid:17) − R (cid:16) u − y h (cid:17) K (cid:16) u − y h (cid:17) ih ( x ≤ u ) − F ( u ) i g ( u )d u (cid:19) = 1 q c h, y Υ h Ω h, y Υ h c h, y (cid:0) c h, x Υ h − c h, y Υ h (cid:1) Γ − h, x Z X− x h R ( u ) h ( x ≤ x + hu ) − F ( x + hu ) i K ( u ) g ( x + hu )d u ! (I)+ 1 q c h, y Υ h Ω h, y Υ h c h, y c h, y Υ h (cid:0) Γ − h, x − Γ − h, y (cid:1) Z X− x h R ( u ) h ( x ≤ x + hu ) − F ( x + hu ) i K ( u ) g ( x + hu )d u ! (II)+  q c h, x Υ h Ω h, x Υ h c h, x − q c h, y Υ h Ω h, y Υ h c h, y  c h, x Υ h Γ − h, x Z X− x h R ( u ) h ( x ≤ x + hu ) − F ( x + hu ) i K ( u ) g ( x + hu )d u ! (III)+  c h, y Υ h Γ − h, y q c h, y Υ h Ω h, y Υ h c h, y  (cid:18) h Z X h R (cid:16) u − x h (cid:17) K (cid:16) u − x h (cid:17) − R (cid:16) u − y h (cid:17) K (cid:16) u − y h (cid:17) ih ( x ≤ u ) − F ( u ) i g ( u )d u (cid:19) . (IV)For term (I), its variance (replace the placeholder x by x i ) is V [(I)] = 1 c h, y Υ h Ω h, y Υ h c h, y (cid:0) c h, x Υ h − c h, y Υ h (cid:1) Ω h, x (cid:0) c h, x Υ h − c h, y Υ h (cid:1) = O (cid:18) h r ( ε, h ) r ( h ) (cid:19) . Term (II) has variance V [(II)] = 1 c h, y Υ h Ω h, y Υ h c h, y c h, y Υ h (cid:0) Γ − h, x − Γ − h, y (cid:1) Σ h, x (cid:0) Γ − h, x − Γ − h, y (cid:1) (cid:0) c h, y Υ h (cid:1) = O (cid:18) h (cid:16) εh ∧ (cid:17) (cid:19) , where the order εh ∧ − h, x − Γ − h, y . ext for term (III), we have V [(III)] =  q c h, x Υ h Ω h, x Υ h c h, x − q c h, y Υ h Ω h, y Υ h c h, y  c h, x Υ h Ω h, x Υ h c h, x = − s c h, x Υ h Ω h, x Υ h c h, x − c h, y Υ h Ω h, y Υ h c h, y c h, y Υ h Ω h, y Υ h c h, y ! (cid:16) c h, x Υ h Ω h, x Υ h c h, x − c h, y Υ h Ω h, y Υ h c h, y c h, y Υ h Ω h, y Υ h c h, y ! = c h, x Υ h (Ω h, x − Ω h, y )Υ h c h, x c h, y Υ h Ω h, y Υ h c h, y + ( c h, x Υ h − c h, y Υ h )Ω h, y Υ h c h, x + ( c h, x Υ h − c h, y Υ h )Ω h, y Υ h c h, y c h, y Υ h Ω h, y Υ h c h, y ! . The ﬁrst term has bound c h, x Υ h (Ω h, x − Ω h, y )Υ h c h, x c h, y Υ h Ω h, y Υ h c h, y = O (cid:16) εh (cid:17) . The third term has bound( c h, x Υ h − c h, y Υ h )Ω h, y Υ h c h, y c h, y Υ h Ω h, y Υ h c h, y - | ( c h, x Υ h − c h, y Υ h )Ω / h, y | q c h, y Υ h Ω h, y Υ h c h, y = O (cid:18) √ h r ( ε, h ) r ( h ) (cid:19) . Finally, the second term can be bounded as( c h, x Υ h − c h, y Υ h )Ω h, y Υ h c h, x c h, y Υ h Ω h, y Υ h c h, y = ( c h, x Υ h − c h, y Υ h )Ω h, y Υ h c h, y + ( c h, x Υ h − c h, y Υ h )Ω h, y ( c h, x Υ h − c h, y Υ h ) c h, y Υ h Ω h, y Υ h c h, y = O (cid:18) √ h r ( ε, h ) r ( h ) + 1 h r ( ε, h ) r ( h ) (cid:19) . Overall, we have that V [(III)] = O (cid:18) ε h + 1 h r ( ε, h ) r ( h ) + 1 h r ( ε, h ) r ( h ) (cid:19) . Given our assumptions on the basis function and on the kernel function, it is obvious that term (IV) has variance V [(IV)] = O (cid:18) h (cid:16) εh ∧ (cid:17) (cid:19) . The bound on E [sup x ∈I | B G ( x ) | ] can be found by standard entropy calculation, and the bound on E [sup x ∈I | T G ( x ) | ]is obtained by the following fact E (cid:20) sup x ∈I | T G ( x ) | (cid:21) ≤ E (cid:20) sup x ∈I | B G ( x ) | (cid:21) + E (cid:20) sup x ∈I | ˜ T G ( x ) − B G ( x ) | (cid:21) , and that E (cid:20) sup x ∈I | ˜ T G ( x ) − B G ( x ) | (cid:21) = Z ∞ P (cid:20) sup x ∈I | ˜ T G ( x ) − B G ( x ) | > u (cid:21) d u = O (cid:18) log n √ nh (cid:19) = o ( p log n ) , which follows from Theorem 9 and our assumption that log n/ ( nh ) → .12 Proof of Lemma 11 We adopt the following decomposition (the integration is always on X− y h × X− x h , unless otherwise speciﬁed):1 n n X i =1 Z Z R ( u ) R ( v ) h ( x i ≤ x + hu ) − F ( x + hu ) ih ( x i ≤ y + hv ) − F ( y + hv ) i K ( u ) K ( v ) g ( x + hu ) g ( y + hv )d u d v (I) − Z Z R ( u ) R ( v ) h ˆ F ( x + hu ) − F ( x + hu ) ih ˆ F ( y + hv ) − F ( y + hv ) i K ( u ) K ( v ) g ( x + hu ) g ( y + hv )d u d v. (II)By the uniform convergence of the empirical distribution function, we have thatsup x , y ∈I | (II) | = O P (cid:18) n (cid:19) . From the deﬁnition of Σ h, x , y , we know that E [(I)] = Σ h, x , y . As (I) is a sum of bounded terms, we can apply Lemma 24 and easily show thatsup x , y ∈I | (I) − Σ h, x , y | + O P r log nn ! . We rewrite (17) as | (17) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n s c h, x Υ h Ω h, x Υ h c h, x c h, x Υ h ˆΩ h, x Υ h c h, x c h, x Υ h Γ − h R X− x h R ( u ) h F ( x + hu ) − θ R ( u )Υ − h i K ( u ) g ( x + hu )d u q c h, x Υ h Ω h, x Υ h c h, x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ r nh " sup x ∈I s c h, x Υ h Ω h, x Υ h c h, x c h, x Υ h ˆΩ h, x Υ h c h, x sup x ∈I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z X− x h R ( u ) h F ( x + hu ) − θ R ( u )Υ − h i K ( u ) g ( x + hu )d u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P (cid:18)r nh sup x ∈I % ( h, x ) (cid:19) , where the ﬁnal bound holds uniformly for x ∈ I . To start, we expand term (18) as(18) = 1 √ n n X i =1 c h, x Υ h Γ − h, x R X− x h R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) g ( x + hu )d u q c h, x Υ h Ω h, x Υ h c h, x + 1 √ n n X i =1 " − s c h, x Υ h Ω h, x Υ h c h, x c h, x Υ h ˆΩ h, x Υ h c h, x c h, x Υ h Γ − h, x R X− x h R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) g ( x + hu )d u q c h, x Υ h Ω h, x Υ h c h, x = T G ( x )+ " − s c h, x Υ h Ω h, x Υ h c h, x c h, x Υ h ˆΩ h, x Υ h c h, x T G ( x ) . (I) erm (I) can be easily bounded bysup x ∈I | (I) | = O P r log nnh ! E (cid:20) sup x ∈I | T G ( x ) | (cid:21)! = O P (cid:18) log n √ nh (cid:19) . The claim follows from Theorem 9 and previous lemmas.

Let I ε be an ε -covering (with respect to the Euclidean metric) of I , and assume ε ≤ h . Then the process ˜ B G ( · ) canbe decomposed into: ˜ B G ( x ) = ˜ B G (Π I ε ( x )) + ˜ B G ( x ) − ˜ B G (Π I ε ( x )) , where Π I ε : I → I ε is a mapping satisfying: Π I ε ( x ) = argmin y ∈I ε | y − x | . We ﬁrst study the properties of ˜ B G ( x ) − ˜ B G (Π I ε ( x )). With standard entropy calculation, one has: E (cid:20) sup x ∈I | ˜ B G ( x ) − ˜ B G (Π I ε ( x )) | (cid:21) ≤ E " sup x , y ∈I , | x − y |≤ ε | ˜ B G ( x ) − ˜ B G ( y ) | ≤ E " sup x , y ∈I ,σ ( x , y ) ≤ δ ( ε ) | ˜ B G ( x ) − ˜ B G ( y ) | - Z δ ( ε )0 p log N ( λ, I , σ G )d λ, where δ ( ε ) = C (cid:18) √ h εh + 1 √ h r ( ε, h ) r ( h ) + 1 h r ( ε, h ) r ( h ) (cid:19) , for some C > ε and h , and N ( λ, I , σ G ) is the covering number of I measured by thepseudo metric σ G ( · , · ), which satisﬁes N ( λ, I , σ G ) - δ − ( λ ) . Therefore, we have E (cid:20) sup x ∈I | ˜ B G ( x ) − ˜ B G (Π I ε ( x )) | (cid:21) - (cid:18) √ h εh + 1 √ h r ( ε, h ) r ( h ) + 1 h r ( ε, h ) r ( h ) (cid:19) p log n. Now consider the discretized version of ˜ B G ( x ). To be precise, we denote I ε by { x , x , · · · , x M ( ε ) } , where M ( ε ) de-pends on ε , and we have M ( ε ) = O ( ε − ). Also let Ξ I ε be the covariance matrix of ( ˜ B G ( x ) , ˜ B G ( x ) , · · · , ˜ B G ( x M ( ε ) )) ,then the following representation holds:( ˜ B G ( x ) , ˜ B G ( x ) , · · · , ˜ B G ( x M ( ε ) )) = Ξ / I ε N M ( ε ) , and N M ( ε ) is a standard Gaussian random vector (i.e., with zero mean, unit variance and zero correlation) of length M ( ε ). In addition, N M ( ε ) is independent of the data by construction. imilarly, we can deﬁne the Gaussian vector( ˆ B G ( x ) , ˆ B G ( x ) , · · · , ˆ B G ( x M ( ε ) )) = ˆΞ / I ε N M ( ε ) , where ˆΞ / I ε is the estimated covariance matrix. From Lemma 11, we have (cid:13)(cid:13)(cid:13) ˆΞ I ε − Ξ I ε (cid:13)(cid:13)(cid:13) ∞ = O P r log nnh ! , where k · k ∞ denote supremum norm (i.e., the maximum discrepancy between entries of Ξ I ε and ˆΞ I ε ). Also we notethat, in any row/column of ˆΞ I ε or Ξ I ε , at most O ( hM ( ε )) = O ( h/ε ) entries are nonzero. Therefore we have (cid:13)(cid:13)(cid:13) ˆΞ I ε − Ξ I ε (cid:13)(cid:13)(cid:13) op = O P hε r log nnh ! = O P ε r log nn ! , with k · k op being the operator norm. We also have (cid:13)(cid:13)(cid:13) ˆΞ / I ε − Ξ / I ε (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ˆΞ I ε − Ξ I ε (cid:13)(cid:13)(cid:13) op . Now we are ready to bound the diﬀerence of the two discretized processes: E ? (cid:20) sup x ∈I ε | ˜ B G ( x ) − ˆ B G ( x ) | (cid:21) = O r log 1 ε ! r sup x ∈I ε V ? h ˜ B G ( x ) − ˆ B G ( x ) i = O P s log nε r log nn  . We apply Lemma 24. For simplicity, assume R ( · ) is scalar, and let u i,h ( x ) = R (cid:16) x i − x h (cid:17) h K (cid:16) x i − x h (cid:17) − Γ h, x . Then it is easy to see thatsup x ∈I max ≤ i ≤ n V [ u i,h ( x )] = O ( h − ) , sup x ∈I max ≤ i ≤ n | u i,h ( x ) | = O ( h − ) . Let | x − y | ≤ ε ≤ h , we also have | u i,h ( x ) − u i,h ( y ) | ≤ (cid:12)(cid:12)(cid:12)(cid:12) R (cid:16) x i − x h (cid:17) h K (cid:16) x i − x h (cid:17) − R (cid:16) x i − y h (cid:17) h K (cid:16) x i − y h (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) + | Γ h, x − Γ h, y |≤ (cid:12)(cid:12)(cid:12)(cid:12) R (cid:16) x i − x h (cid:17) − R (cid:16) x i − y h (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) h K (cid:16) x i − x h (cid:17) + R (cid:16) x i − y h (cid:17) h (cid:12)(cid:12)(cid:12) K (cid:16) x i − x h (cid:17) − K (cid:16) x i − y h (cid:17)(cid:12)(cid:12)(cid:12) + | Γ h, x − Γ h, y |≤ M (cid:20) εh h K (cid:16) x i − x h (cid:17) + εh h K † (cid:16) x i − x h (cid:17) + 1 h K ‡ (cid:16) x i − x h (cid:17) + εh (cid:21) . where M is some constant that does not depend on n , h or ε . Then it is easy to see thatsup x ∈I max ≤ i ≤ n V [ u i,h,ε ( x )] = O (cid:16) εh (cid:17) , sup x ∈I max ≤ i ≤ n | u i,h,ε ( x ) − E [ u i,h,ε ( x )] | = O ( h − ) , sup x ∈I max ≤ i ≤ n E [ | u i,h,ε ( x ) | ] = O (cid:16) εh (cid:17) . Now take ε = p h log n/n , then log N ( ε, I , | · | ) = O (log n ). Lemma 24 implies thatsup x ∈I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 R (cid:16) x i − x h (cid:17) h K (cid:16) x i − x h (cid:17) − Γ h, x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P r log nnh ! . .18 Proof of Lemma 17 Let R i ( x ) = R ( x i − x ) and W i ( x ) = K (( x i − x ) /h ) /h , then we split ˆΣ h, x , y into two terms,(I) = 1 n X i,j,k Υ h R j ( x ) R k ( y ) Υ h W j ( x ) W k ( y ) (cid:16) ( x i ≤ x j ) − F ( x j ) (cid:17)(cid:16) ( x i ≤ x k ) − F ( x k ) (cid:17) (II) = − n X j,k Υ h R j ( x ) R k ( y ) Υ h W j ( x ) W k ( y ) (cid:16) ˆ F ( x j ) − F ( x j ) (cid:17)(cid:16) ˆ F ( x k ) − F ( x k ) (cid:17) . (II) satisﬁes sup x , y ∈I | (II) | ≤ sup x | ˆ F ( x ) − F ( x ) | sup x ∈I n X j | Υ h R j ( x ) W j ( x ) | ! . It is obvious that sup x | ˆ F ( x ) − F ( x ) | = O P (cid:18) n (cid:19) . As for the second part, one can employ the same technique used to prove Lemma 16 and show thatsup x ∈I n X j | Υ h R j ( x ) W j ( x ) | = O P (1) , implying that sup x , y ∈I | (II) | = O P (cid:18) n (cid:19) . For (I), we ﬁrst deﬁne u ij ( x ) = Υ h R j ( x ) W j ( x ) (cid:16) ( x i ≤ x j ) − F ( x j ) (cid:17) , and ¯ u i ( x ) = E [ u ij ( x ) | x i ; i = j ] , ˆ u i ( x ) = 1 n X j u ij ( x ) . Then (I) = 1 n X i n X j u ij ( x ) ! n X j u ij ( y ) ! = 1 n X i ˆ u i ( x )ˆ u i ( y ) = 1 n X i ¯ u i ( x )¯ u i ( y ) + 1 n X i (ˆ u i ( x ) − ¯ u i ( x )) ˆ u i ( y ) + 1 n X i ¯ u i ( x ) (ˆ u i ( y ) − ¯ u i ( y )) = 1 n X i ¯ u i ( x )¯ u i ( y ) | {z } (I.1) + 1 n X i (ˆ u i ( x ) − ¯ u i ( x )) ¯ u i ( y ) | {z } (I.2) + 1 n X i ¯ u i ( x ) (ˆ u i ( y ) − ¯ u i ( y )) | {z } (I.3) + 1 n X i (ˆ u i ( x ) − ¯ u i ( x )) (ˆ u i ( y ) − ¯ u i ( y )) | {z } (I.4) . Term (I.1) has been analyzed in Lemma 11, which satisﬁessup x , y ∈I | (I.1) − Σ h, x , y | = O P r log nn ! . erm (I.2) has expansion:(I.2) = 1 n X i,j ( u ij ( x ) − ¯ u i ( x )) ¯ u i ( y ) = 1 n X i,j distinct ( u ij ( x ) − ¯ u i ( x )) ¯ u i ( y ) | {z } (I.2.1) + 1 n X i ( u ii ( x ) − ¯ u i ( x )) ¯ u i ( y ) | {z } (I.2.2) . By the same technique of Lemma 16, one can show thatsup x , y ∈I | (I.2.2) | = O P (cid:18) n (cid:19) . We need a further decomposition to make (I.2.1) a degenerate U-statistic:(I.2.1) = n − n X j E (cid:2) ( u ij ( x ) − ¯ u i ( x )) ¯ u i ( y ) (cid:12)(cid:12) x j (cid:3)| {z } (I.2.1.1) + 1 n X i,j distinct ( ( u ij ( x ) − ¯ u i ( x )) ¯ u i ( y ) − E (cid:2) ( u ij ( x ) − ¯ u i ( x )) ¯ u i ( y ) (cid:12)(cid:12) x j (cid:3) )| {z } (I.2.1.2) . (I.2.1) has zero mean. By discretizing I and apply Bernstein’s inequality, one can show that the (I.2.1.1) hasorder O P (cid:16)p log n/n (cid:17) .For (I.2.1.2), we ﬁrst discretize I and then apply a Bernstein-type inequality (Lemma 25) for degenerate U-statistics, which gives an order sup x , y ∈I | (I.2.1.2) | = O P (cid:18) log n √ n h (cid:19) . Overall, we have sup x , y ∈I | (I.2) | = O P n + r log nn + log n √ n h ! = O P r log nn ! , and the same bound applies to (I.3).For (I.4), one can show thatsup x ∈I sup x ∈X (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X j Υ h R j ( x ) W j ( x ) (cid:16) ( x ≤ x j ) − F ( x j ) (cid:17) − E h Υ h R j ( x ) W j ( x ) (cid:16) ( x ≤ x j ) − F ( x j ) (cid:17)i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P r log nnh ! , which means sup x , y ∈I | (I.4) | = O P (cid:18) log nnh (cid:19) = O P r log nn ! , under our assumption that log n/ ( nh ) → x , y ∈I (cid:12)(cid:12)(cid:12) ˆΣ h, x , y − Σ h, x , y (cid:12)(cid:12)(cid:12) = O P r log nn ! . ow take c to be a generic vector. Then we have c h, x Υ h ( ˆΩ h, x , y − Ω h, x , y )Υ h c h, y q c h, x Υ h Ω h, x Υ h c h, x q c h, y Υ h Ω h, y Υ h c h, y = c h, x Υ h ˆΓ − h, x ( ˆΣ h, x , y − Σ h, x , y )ˆΓ − h, y Υ h c h, y q c h, x Υ h Ω h, x Υ h c h, x q c h, y Υ h Ω h, y Υ h c h, y + c h, x Υ h (ˆΓ − h, x − Γ − h, x )Σ h, x , y ˆΓ − h, y Υ h c h, y q c h, x Υ h Ω h, x Υ h c h, x q c h, y Υ h Ω h, y Υ h c h, y + c h, x Υ h Γ − h, x Σ h, x , y (ˆΓ − h, y − Γ − h, y )Υ h c h, y q c h, x Υ h Ω h, x Υ h c h, x q c h, y Υ h Ω h, y Υ h c h, y . From the analysis of ˆΣ h, x , y , we havesup x , y ∈I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c h, x Υ h ˆΓ − h, x ( ˆΣ h, x , y − Σ h, x , y )ˆΓ − h, y Υ h c h, y q c h, x Υ h Ω h, x Υ h c h, x q c h, y Υ h Ω h, y Υ h c h, y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O P r log nnh ! . For the second term, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c h, x Υ h (ˆΓ − h, x − Γ − h, x )Σ h, x , y ˆΓ − h, y Υ h c h, y q c h, x Υ h Ω h, x Υ h c h, x q c h, y Υ h Ω h, y Υ h c h, y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | c h, x Υ h (ˆΓ − h, x − Γ − h, x )Σ / h, x | · | c h, y Υ h ˆΓ − h, y Σ / h, y | q c h, x Υ h Ω h, x Υ h c h, x q c h, y Υ h Ω h, y Υ h c h, y = O P r log nnh ! . The same bound holds for the third term.

We decompose (19) assup x ∈I | (19) | ≤ √ n  sup x ∈I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c h, x Υ h ˆΓ − h, x q c h, x Υ h ˆΩ h, x Υ h c h, x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)| {z } (I) " sup x ∈I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 R (( x i − x ) /h )[1 − F ( x i )] 1 h K ( x i − x h ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (II) . As both ˆΓ h, x and c h, x Υ h ˆΩ h, x Υ h c h, x are uniformly consistent, term (I) has order(I) = O P r h ! . For (II), we can employ the same technique used to prove Lemma 16 and show that(II) = O P r log nnh ! = O P (1) , where the leading order in the above represents the mean of R (( x i − x ) /h )[1 − F ( x i )] h K ( x i − x h ). .20 Proof of Lemma 19 To start, term (20) is bounded bysup x ∈I | (20) | ≤ √ n  sup x ∈I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c h, x Υ h ˆΓ − h, x q c h, x Υ h ˆΩ h, x Υ h c h, x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)| {z } (I) " sup x ∈I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 R (( x i − x ) /h )[ F ( x i ) − θ ( x ) R ( x i − x )] 1 h K ( x i − x h ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (II) . Employing the same argument used to prove Lemma 18, we have(I) = O P r h ! . To bound term (II), recall that K ( · ) is supported on [ − , x ∈I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 R (( x i − x ) /h )[ F ( x i ) − θ ( x ) R ( x i − x )] 1 h K ( x i − x h ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = sup x ∈I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 R (( x i − x ) /h )[ F ( x i ) − θ ( x ) R ( x i − x )] ( | x i − x | ≤ h ) 1 h K ( x i − x h ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ " sup x ∈I n n X i =1 (cid:12)(cid:12)(cid:12)(cid:12) R (( x i − x ) /h ) 1 h K ( x i − x h ) (cid:12)(cid:12)(cid:12)(cid:12) (II.1) " sup x ∈I sup u ∈ [ x − h, x + h ] (cid:12)(cid:12)(cid:12)h F ( u ) − θ ( x ) R ( u − x ) i(cid:12)(cid:12)(cid:12) (II.2) . Term (II.2) has the bound sup x ∈I % ( h, x ). Term (II.1) can be bounded by mean and variance calculations andadopting the proof of Lemma 16, which leads to(II.1) = O P r log nnh ! = O P (1) . For ease of exposition, deﬁne the following: u ij ( x ) = Υ h R ( x j − x ) h ( x i ≤ x j ) − F ( x j ) i h K (cid:16) x j − x h (cid:17) − Z X− x h R ( u ) h ( x i ≤ x + hu ) − F ( x + hu ) i K ( u ) f ( x + hu )d u, then n − P ni,j =1 ,i = j u ij ( x ) is a degenerate U-statistic. We rewrite (21) assup x ∈I | (21) | ≤ √ n  sup x ∈I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c h, x Υ h ˆΓ − h, x q c h, x Υ h ˆΩ h, x Υ h c h, x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)| {z } (I)  sup x ∈I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i,j =1 ,i = j u ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)| {z } (II) . As before, we have (I) = O P r h ! . ow we consider (II). Let I ε be an ε -covering of I , we havesup x ∈I (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i,j =1 ,i = j u ij ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max x ∈I ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i,j =1 ,i = j u ij ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)| {z } (II.1) + max x ∈I ε , y ∈I , | x − y |≤ ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i,j =1 ,i = j (cid:16) u ij ( x ) − u ij ( y ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)| {z } (II.2) . We rely on the concentration inequality in Lemma 25 for degenerate second order U-statistics. By our assump-tions, A can be chosen to be C h − where C is some constant that is independent of x . Similarly, B can be chosento be C √ nh − for some constant C which is independent of x , and D can be chosen as C nh − / for some C independent of x . Therefore, by setting η = K log n/ √ n h for some large constant K , we have P [(II.1) ≥ η ] ≤ C ε max x ∈I ε P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n X i,j =1 ,i = j u ij ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ n η  ≤ C ε exp ( − C min " n h / ηnc , (cid:18) n hηn / c (cid:19) , (cid:18) n hηc (cid:19) = C ε exp  − C min  K log nc , K √ nh log nc ! , K √ n h log nc !  . As ε is at most polynomial in n , the above tends to zero for all K large enough, which implies(II.1) = O P (cid:18) log n √ n h (cid:19) . With tedious but still straightforward calculations, it can be shown that(II.2) = O P (cid:18) εh + log n √ n h + εh log n √ n h (cid:19) , and to match the rates, let ε = h log n/ √ n h . The proof resembles that of of Lemma 13.

The proof resembles that of Theorem 14.

The proof resembles that of Theorem 15.

References

Cattaneo, M. D., M. Jansson, and

X. Ma (2020): “Simple Local Polynomial Density Estima-tors,”

Journal of the American Statistical Association , 115(531), 1449–1455.

Gin´e, E., V. Koltchinskii, and

L. Sakhanenko (2004): “Kernel Density Estimators: Conver-gence in Distribution for Weighted Sup-Norms,”

Probability Theory and Related Fields , 130(2),167–198. 48 in´e, E., R. Lata la, and

J. Zinn (2000): “Exponential and Moment Inequalities for U-statistics,” in

High Dimensional Probability II . Springer.

Granovsky, B. L., and

H.-G. M¨uller (1991): “Optimizing Kernel Methods: A Unifying Vari-ational Principle,”

International Statistical Review/Revue Internationale de Statistique , 59(3),373–388.

Loader, C. (2006):