[PDF] Kernel Density Estimation with Berkson Error

Abstract

Given a sample { X i } n i=1 from f X , we construct kernel density estimators for f Y , the convolution of f X with a known error density f ϵ . This problem is known as density estimation with Berkson error and has applications in epidemiology and astronomy. Little is understood about bandwidth selection for Berkson density estimation. We compare three approaches to selecting the bandwidth both asymptotically, using large sample approximations to the MISE, and at finite samples, using simulations. Our results highlight the relationship between the structure of the error f ϵ and the optimal bandwidth. In particular, the results demonstrate the importance of smoothing when the error term f ϵ is concentrated near 0. We propose a data--driven bandwidth estimator and test its performance on NO 2 exposure data.

Full PDF

KKernel Density Estimation with Berkson Error

James P. LongDepartment of Statistics, Texas A&M University3143 TAMU, College Station, TX [email protected] El KarouiDepartment of Statistics, University of California, Berkeley367 Evans Hall

Abstract

Given a sample { X i } ni =1 from f X , we construct kernel density estimators for f Y , the convolution of f X with a known error density f (cid:15) . This problem is known as density estimation with Berkson error and hasapplications in epidemiology and astronomy. Little is understood about bandwidth selection for Berksondensity estimation. We compare three approaches to selecting the bandwidth both asymptotically, usinglarge sample approximations to the MISE, and at ﬁnite samples, using simulations. Our results highlightthe relationship between the structure of the error f (cid:15) and the optimal bandwidth. In particular, the resultsdemonstrate the importance of smoothing when the error term f (cid:15) is concentrated near 0. We propose adata–driven bandwidth estimator and test its performance on NO exposure data. Keywords : Berkson Error; Measurement Error; Bandwidth Selection; Kernel Density Estimation; Multi-variate Density Estimation

Short title : Kernel Density Estimation with Berkson Error a r X i v : . [ s t a t . M E ] J u l Introduction

We consider smoothing a density estimate when an error–free sample is observed and one is interested in theconvolution of the population density with an error term. This is known as density estimation with Berksonerror and has been studied in Delaigle [2007] where

N O exposure in children is estimated using knownkitchen and bedroom concentrations. The exposure level in children is modeled as a function of kitchen andbedroom concentrations plus independent, additive random error.Density estimation with Berkson error is one example of a class of problems where low–error or error–free data is used to construct an estimate for noisy data. Bovy et al. [2011] considers this problem in thecontext of constructing a classiﬁer for noisy astronomical data. Here, each object belongs to the class quasaror star. For each object a telescope records a vector of ﬂux ratios. For a training set of observations ofknown class, the authors observe low–error ﬂux ratios. However for the data of unknown class, ﬂux ratioscontain signiﬁcant measurement error. The goal is to construct an accurate classiﬁer for the noisy data. Carroll et al. [2009] considers a similar problem in a regression context with data arising from nutritionalepidemiology. Long et al. [2012] studies this problem in the context of classiﬁcation of periodic variable stars.In each of these works, tuning parameters are selected to optimize some risk function. There is extensiveliterature on selecting tuning parameters for problems where all data is observed without measurement error.For example with kernel density estimation, asymptotic rates for the mean integrated squared error (MISE)as a function of bandwidth as well as ﬁnite–sample procedures for selecting the bandwidth are known [Joneset al., 1996, Silverman, 1986]. In contrast, much less is understood about selection of tuning parameters whenone set of data is error–free and there is measurement error in the variable of interest. In this work, we deriveasymptotic results and present ﬁnite sample simulations illustrating the relationship between measurementerror and smoothing for kernel density estimators. While our results are most directly applicable to densityestimation with Berkson error, they have implications for the classiﬁcation and regression problems above.We now formalize the density estimation problem. Suppose we observe independent { X i } ni =1 ∼ f X . Weseek to use this data to estimate the density, denoted f Y , of Y = X + (cid:15) . Here (cid:15) ∼ f (cid:15) , X ∼ f X , and (cid:15) and X are independent. All random variables are in R p . In the literature, (cid:15) is known as Berkson error and wasintroduced in a regression context by Berkson [1950]. It diﬀers from classical measurement error where one See Section 2 (Equations 1, 2, and 3) and Section 5 of Bovy et al. [2011] for more information. For estimating f Y , Delaigle [2007] proposed using (cid:101) f Y ( y ) = 1 n n (cid:88) i =1 f (cid:15) ( y − X i ) . (1)Delaigle [2007] showed that when f (cid:15) is square–integrable and f X is bounded, this estimator is unbiased witha mean integrated squared error (MISE) that converges to 0 at rate n . The convergence result contrastswith standard density estimation where the MISE is generally of order n − / (4+ p ) . Eﬀectively, knowledgeof f (cid:15) provides valuable information about the structure of f Y unavailable in the standard kernel densityestimation case. In addition to a fast convergence rate, (cid:101) f Y in Equation 1 has no tuning parameters thatrequire estimation. A potential drawback of (cid:101) f Y in Equation 1 is that for cases where (cid:15) is concentrated around0, the estimator will have large spikes at the sample points { X i } ni =1 and thus high variance. We illustratethis problem through simulations in Section 4.In this work we study of impact of smoothing estimates of f Y using kernels. We focus on comparingthree approaches to kernel bandwidth selection: • Approach 1:

Select a bandwidth speciﬁcally to optimize estimation of f Y . • Approach 2:

Since f Y = (cid:82) f X ( y − (cid:15) ) f (cid:15) ( (cid:15) ) d(cid:15) , use a kernel density estimator to estimate f X . Selectthe bandwidth to optimize estimation of f X . Then convolve this estimate of f X with f (cid:15) in order toestimate f Y . • Approach 3:

Set the bandwidth to 0 i.e., use Delaigle’s estimator in Equation (1).Each of these approaches has some attractive properties. Approach 1 may provide the optimal performancebecause the bandwidth is chosen speciﬁcally to estimate f Y . Approach 2 is attractive because there isextensive literature on selecting a bandwidth which optimizes estimation of f X . Approach 3 is attractivebecause there are no tuning parameters to estimate and the resulting estimator has parametric ﬁrst orderconvergence rates as shown by Delaigle [2007].In addition to shedding light on the Berkson density estimation problem, a comparison of the performanceof these approaches may be useful when considering how to regularize classiﬁcation or regression methodswhen training data is error–free (or low–error) and data of unknown class has measurement error. See Carroll et al. [2006] for a review of Berkson and classical measurement error. The n − / (4+ p ) order for the MISE requires regularity conditions on f Y . For example, Wand and Jones [1995] (Section 4.3,p.95) assumes each entry of the Hessian of f Y is piecewise continuous and square integrable. See p.100 of Wand and Jones[1995] for the MISE convergence rate. .2 Outline of Work and Summary of Findings In Section 2 we present a kernel bandwidth estimator for f Y that allows simultaneous comparison of allthree smoothing approaches. As a guide towards comparing the approaches, in Section 3 we derive a secondorder expansion of the MISE of the estimator as a function of bandwidth. The asymptotic expansion revealsinteresting properties: • Asymptotically, the optimal bandwidth for estimating f Y (approach 1) is of order n − / . Notably thisrate does not depend on the dimension of the problem, unlike for standard kernel density estimation.Approach 1 provides a second order ( n − ) reduction in MISE over approach 3 (no smoothing). • The asymptotically optimal bandwidth for estimating f Y depends on f X which is unknown. Theexpression we derive for the optimal bandwidth is used as the basis for a plug–in estimator in Section5. • Using approach 2 (smoothing to optimize estimation of f X ) results in a MISE of order n − / (4+ p ) , slowerthan the n − order of approaches 1 and 3. As the dimension increases, the discrepancy in these ratesgrows.Based purely on asymptotic considerations, approaches 1 and 3 appear superior to approach 2. In Section4 we study the ﬁnite sample properties of these three approaches by adapting a result from Wand and Jones[1993] which allows for exact computation of MISE when f X is a mixture of normals and the error and kernelare normal. We ﬁnd: • Approach 3 (no smoothing) can result in drastically undersmoothing the density, particularly whenthe error term is concentrated around 0. • Approach 2 (smoothing to optimize estimation of f X ) can result in oversmoothing the density, partic-ularly when the error term is smooth 3. • In the cases considered in the simulation, the qualitative impacts of undersmoothing by using approach3 appear worse than the qualitative impacts of oversmoothing by using approach 2. • Approach 1 outperforms approach 2 or 3. In the simulations considered, this performance advantageincreased in three dimensions versus one dimension.3n Section 5 we propose a data–based bandwidth estimator for approach 1 and apply our methodologyto the NO data studied by Delaigle [2007]. In Section 6 suggest some directions for future work. Proofs ofall theorems are given in Section S.1 and some technical issues are addressed in Section S.2. We observe independent random variables X , . . . , X n ∼ f X . We aim to estimate, f Y , the density of Y = X + (cid:15). Here X ∼ f X , (cid:15) ∼ f (cid:15) , and X and (cid:15) are independent. f (cid:15) is assumed known. All random variables are in R p .In all that follows let (cid:98) f V represent the characteristic function of the random variable V and let (cid:101) f representan estimator of f . f Y We construct an estimator for f Y by ﬁrst estimating (cid:98) f Y , the characteristic function of f Y . Let K be a mean0 density function called the kernel, and (cid:98) K its characteristic function. LetΣ K = (cid:90) xx T K ( x ) dx. Let H = H n (cid:23) p × p matrices called the bandwidth. (cid:101)(cid:98) f X ( ω ) = 1 n n (cid:88) j =1 e iω T X j is an estimate of (cid:98) f X . Consider estimating (cid:98) f Y (the characteristic function of f Y ) using (cid:101)(cid:98) f Y,H ( ω ) = (cid:98) K ( Hω ) (cid:98) f (cid:15) ( ω ) (cid:101)(cid:98) f X ( ω ) . (2)Note that (cid:101)(cid:98) f Y,H is a characteristic function because it is the product of characteristic functions. Assuming (cid:101)(cid:98) f Y,H ∈ L , we may estimate f Y using (cid:101) f Y,H ( y ) = 1(2 π ) p (cid:90) e − iω T y (cid:101)(cid:98) f Y,H ( ω ) dω. (3)The assumption that (cid:101)(cid:98) f Y,H ∈ L implies (cid:101) f Y,H is a bounded density (see Theorem 3.3 in Durrett [2005]).Throughout this work, we require (cid:98) f (cid:15) ∈ L , thus guaranteeing that (cid:101)(cid:98) f Y,H ∈ L and ensuring that (cid:101) f Y,H inEquation (3) is a valid density. 4 .2 (cid:101) f Y,H and Approaches to Smoothing (cid:101) f Y,H allows for simultaneous comparison of all three bandwidth selection approaches discussed in the intro-duction. With approach 1, we optimize H in (cid:101) f Y,H speciﬁcally for estimating f Y . Recall that with approach2, we construct a kernel density estimator for f X with some bandwidth H X (cid:31) f X . Denote this estimator (cid:101) f X,H X ( x ) = 1 n n (cid:88) i =1 K H X ( x − X i ) . Since f Y ( (cid:15) ) = (cid:82) f X ( y − (cid:15) ) f (cid:15) ( (cid:15) ) d(cid:15) , (cid:101) f X,H X ( x ) is convolved with f (cid:15) to estimate f Y . However this procedure isequivalent to using H X in (cid:101) f Y,H . In other words, (cid:101) f Y,H X ( y ) = (cid:90) (cid:101) f X,H X ( y − (cid:15) ) f (cid:15) ( (cid:15) ) d(cid:15). Finally (cid:101) f Y,H includes as a subcase approach 3, no smoothing. By setting H = 0 we have (cid:101) f Y, ( y ) = 1 n n (cid:88) i =1 f (cid:15) ( y − X i ) . This is the kernel–free estimator of Delaigle [2007] presented in Equation (1).

We use mean integrated squared error as a guide toward understanding the behavior of the three approachesto selecting the bandwidth in (cid:101) f Y,H . Let P n be the product measure on ( X , . . . , X n ). The mean integratedsquared error is deﬁned as MISE( H ) ≡ E P n (cid:90) (cid:16) (cid:101) f Y,H ( y ) − f Y ( y ) (cid:17) dy. MISE is a popular measure of risk used in many density estimation studies (see for example Delaigle [2008],Jones et al. [1996], Marron and Wand [1992], Tsybakov [2009], Wand and Jones [1995]). The MISE allowsfor relatively straightforward asymptotic analysis (Theorems 1 and 2) as well as admitting to, under certainconditions, a computationally eﬃcient representation which we exploit in order to obtain our ﬁnite sampleresults (Theorem 3). We explore some of the qualitative impacts of the diﬀerent smoothing approaches,not directly captured by MISE, in Subsection 4.3. While other measures, such as K-L divergence, Hellingerdistance, or integrated absolute error, could be used, we feel MISE captures the essential properties of thethree approaches to smoothing.The optimal H , in terms of minimizing MISE for estimating f Y , is H Y = argmin { H : H (cid:23) } MISE( H ) . f X . In other words H X = argmin { H : H (cid:31) } E P n (cid:90) (cid:16) (cid:101) f X ( x ) − f X ( x ) (cid:17) dx. Unfortunately the MISE expression is complicated and exact expressions for H Y and H X are not generallypossible. In Section 3 we form asymptotic approximations to the MISE and determine the rates at which || H Y || ∞ → H Y ) → n → ∞ . We refer to standard kernel density theory for the ratesassociated with H X . Our asymptotic expansion of the MISE reveals rates associated with MISE( H X ). Theasymptotic approximation also shows the error for approach 3, MISE(0), a quantity that was derived byDelaigle [2007].In Section 4 we specialize to the case where f X is a Gaussian mixture, K is a Gaussian kernel, and f (cid:15) isa Gaussian error density. In this setting, the MISE can be evaluated at a particular H without numericallyapproximating integrals (see Theorem 3). Using this result we study the ﬁnite sample properties of H Y , H X ,MISE( H Y ), MISE( H X ), and MISE(0).In this work, H Y refers to the exact optimal bandwidth, H ∗ Y refers to an asymptotically optimal band-width, and (cid:101) H Y refers to a data based estimator of H Y (used mostly in Section 5). While we derive asymptoticresults for bandwidth matrices, we often specialize to the case of a scalar bandwidth. In such cases, lowercaseletters, h Y , h ∗ Y , and (cid:101) h Y , are used. The same notation applies for H X . For the purposes of forming asymptotic expansions, we represent the MISE in terms of characteristic func-tions.

Theorem 1.

Assume (cid:98) f Y , (cid:101)(cid:98) f Y,H ∈ L . Then (2 π ) p MISE ( H ) = (cid:90) | − (cid:98) K ( Hω ) | dµ ( ω ) + 1 n (cid:90) | (cid:98) K ( Hω ) | dν ( ω ) (4) where dµ ( ω ) = | (cid:98) f (cid:15) ( ω ) | | (cid:98) f X ( ω ) | dω,dν ( ω ) = | (cid:98) f (cid:15) ( ω ) | (1 − | (cid:98) f X ( ω ) | ) dω are positive measures. (cid:82) | − (cid:98) K ( Hω ) | dµ ( ω ) is the integratedsquared bias of (cid:101) f Y,H and n − (cid:82) | (cid:98) K ( Hω ) | dν ( ω ) is the integrated variance. Notice that for ﬁxed H , thevariance decreases at rate n − while the bias is constant. When H = 0, (cid:98) K ( Hω ) = 1, so the integratedsquared bias term vanishes.We require assumptions on the kernel K and the bandwidth matrix H . Assumptions A. K is a symmetric density (5) (cid:98) K is four times continuously diﬀerentiable (6) H = H n (cid:23) (i.e. sequence is positive semideﬁnite) (7) || H || ∞ → − , p satisfy Assumptions 5 and 6. InAssumption 7, notice that the bandwidth does not have to be strictly positive deﬁnite, unlike in the standardkernel density estimation case. We require the following assumptions on the characteristic functions of f X and f (cid:15) . Assumptions B. (cid:90) || ω || ∞ | (cid:98) f (cid:15) ( ω ) | dω < ∞ (9) (cid:90) | (cid:98) f (cid:15) ( ω ) | dω < ∞ (10)Assumptions 9 and 10 are satisﬁed as long as the error term has a density that is smooth, such asmultivariate normal or Student’s t (see Sutradhar [1986] for the characteristic function of the multivariateStudent’s t). Theorem 2.

Under Assumptions A and B and with the notation of Theorem 1 (2 π ) p MISE ( H )= 1 n (cid:90) dν ( ω )+ (cid:18) (cid:90) ( ω T H T Σ K Hω ) dµ ( ω ) − n (cid:90) ( ω T H T Σ K Hω ) dν ( ω ) (cid:19) (1 + O ( || H || ∞ )) . (11)7ee Subsection S.1.2 on p.S.2 for a proof. The n − (cid:82) dν ( ω ) term is variance in the estimator that doesnot depend on the bandwidth. If the bandwidth is set to 0 (see approach 3 below), all other terms vanishand this is the MISE. The (cid:82) ( ω T H T Σ K Hω ) dµ ( ω ) term is bias caused by using a kernel with bandwidth H while − n − (cid:82) ( ω T H T Σ K Hω ) dν ( ω ) is the corresponding reduction in variance.While the full bandwidth matrix oﬀers the most ﬂexibility and greatest potential for reduction in MISE,this expression is diﬃcult to optimize, see Subsection S.2.1. More simply, one could use a diagonal bandwidthmatrix and optimize p bandwidths. We study this case in Subsection S.2.2. Here we focus on using a scalarbandwidth. This allows for the most direct and straightforward comparisons of the diﬀerent approaches tosmoothing. We reparameterize the bandwidth H = hI . Here the general MISE expression in Equation (11) becomes(2 π ) p MISE( h )= 1 n (cid:90) dν ( ω ) + (cid:18) h (cid:90) ( ω T Σ K ω ) dµ ( ω ) − h n (cid:90) ( ω T Σ K ω ) dν ( ω ) (cid:19) (1 + O ( h )) . (12)We now discuss the three smoothing approaches Approach 1:

It is straightforward to ﬁnd the bandwidth that minimizes the main terms in this MISEexpression. Speciﬁcally, h ∗ Y = argmin h ≥ (cid:18) h (cid:90) ( ω T Σ K ω ) dµ ( ω ) − h n (cid:90) ( ω T Σ K ω ) dν ( ω ) (cid:19) = (cid:115) (cid:82) ( ω T Σ K ω ) dν ( ω ) n (cid:82) ( ω T Σ K ω ) dµ ( ω ) . (13) h ∗ Y is of order n − / . This order does not depend on p , unlike for the error free case where, under someregularity conditions, the order for the asymptotically optimal amount of smoothing is n − / (4+ p ) . However,the dimension p will aﬀect the constant on h ∗ Y in Equation (13). We explore the relationship between thedimension p and the optimal amount of smoothing for ﬁnite samples in Section 4. Using, h ∗ Y the MISE is(2 π ) p MISE( h ∗ Y ) = 1 n (cid:90) dν ( ω ) − n (cid:0)(cid:82) ( ω T Σ K ω ) dν ( ω ) (cid:1) (cid:0)(cid:82) ( ω T Σ K ω ) dµ ( ω ) (cid:1) + O ( n − ) . Approach 2:

Set h to minimize MISE in estimating f X . Under certain regularity conditions on f X , thebandwidth is order n − / (4+ p ) (e.g. see Wand and Jones [1995] page 100). Speciﬁcally, suppose h ∗ X = D ( n ) n − / (4+ p ) , D : Z + → R + such that lim sup n D ( n ) < ∞ and lim inf n D ( n ) >

0. The MISE for estimating f Y using h ∗ X (obtained from Equation (12)) is(2 π ) p MISE( h ∗ X ) = 1 n (cid:90) dν ( ω ) + (cid:18) D ( n ) n − / (4+ p ) (cid:90) ( ω T Σ K ω ) dµ ( ω ) − D ( n ) n − (6+ p ) / (4+ p ) (cid:90) ( ω T Σ K ω ) dν ( ω ) (cid:19) (1 + O ( n − / (4+ p ) ))= (cid:18) D ( n ) n − / (4+ p ) (cid:90) ( ω T Σ K ω ) dµ ( ω ) (cid:19) (1 + o (1)) . (14)The n − / (4+ p ) order for the MISE when using h ∗ X is strictly worse than the n − order that can be achievedby optimizing the bandwidth speciﬁcally for the error distribution, i.e. using h ∗ Y . Essentially using h ∗ X oversmooths (cid:101) f Y,H . The ﬁrst order term in MISE( h ∗ X ), Equation (14), is caused entirely by bias. Approach 3:

Set h = 0. Here we have(2 π ) p MISE(0) = 1 n (cid:90) dν ( ω ) = 1 n (cid:18)(cid:90) | (cid:98) f (cid:15) ( ω ) | dω − (cid:90) | (cid:98) f (cid:15) ( ω ) | | (cid:98) f X ( ω ) | dω (cid:19) . Asymptotically, this approach is better than approach 2 since MISE(0) is order n − . The ratio of usingasymptotically optimal smoothing ( h ∗ Y ) to no smoothing isMISE( h ∗ Y )MISE(0) = 1 − n (cid:0)(cid:82) ( ω T Σ K ω ) dν ( ω ) (cid:1) (cid:0)(cid:82) ( ω T Σ K ω ) dµ ( ω ) (cid:1) (cid:0)(cid:82) dν ( ω ) (cid:1) + O ( n − ) . These asymptotic results suggest that using the bandwidth that minimizes error in estimating f X forestimating f Y is a poor idea. This procedure results in a convergence rate of higher order than either notsmoothing (approach 3) or smoothing speciﬁcally for f Y (approach 1), i.e. using h ∗ Y . The asymptotic resultsshow an improvement only at the n − level by using h = h ∗ Y rather than h = 0. Based strictly on asymptoticanalysis, this small improvement in error rate may not appear to justify the extra eﬀort required to estimate h Y . However, simulation results in Section 4 indicate that the eﬀects of using h Y are more important thanthe asymptotic analysis suggest, both in terms of minimizing MISE and preserving important qualitativefeatures of the densities. The asymptotic results from Section 3 illustrate the large sample behavior of the MISE under diﬀerentapproaches to choosing the bandwidth parameter. However it is important to understand the ﬁnite samplebehavior of these quantities and what sample sizes are needed for asymptotics to be informative.9alculation of the exact MISE in Equation 4 requires numerically approximating integrals. Therefore it iscomputationally challenging, given a f X , f (cid:15) , and kernel K , to determine the bandwidth H which minimizesthe MISE. For the error-free kernel density estimation case, Wand and Jones [1993] showed that when f X isa normal mixture and K is a normal kernel, the exact MISE has a simple representation that does not requirenumerically approximating integrals. Using this result, the authors compared bandwidth parameterizationsfor bivariate density estimation. Here we generalize this result to the case with Berkson measurement error.We assume f X is a normal mixture and K and f (cid:15) are normal. We use this result for studying the ﬁnitesample properties of various bandwidth selection approaches. Theorem 3 is a generalization of Theorem 1in Wand and Jones [1993] to the case with Berkson error. Theorem 3.

Let φ Σ be the mean , normal density with covariance Σ . Assume the kernel K = φ Σ K and theerror density f (cid:15) = φ Σ (cid:15) . Assume that f X is a mixture of normal densities parameterized by { ( α j , µ j , Σ j ) } mj =1 where (cid:80) mj =1 α j = 1 and ( α j , µ j , Σ j ) is the mixing proportion, mean, and variance of the j th component of f X . In other words f X ( x ) = m (cid:88) j =1 α j φ Σ j ( x − µ j ) . Let S = H T Σ K H . Let Ω a for a ∈ { , , } be a m × m matrix with j, j (cid:48) entry equal to φ aS +2Σ (cid:15) +Σ j +Σ j (cid:48) ( µ j − µ j (cid:48) ) . Finally let α = ( α , . . . , α m ) . ThenMISE ( H ) = 1 n φ S +2Σ (cid:15) (0) + α T ((1 − n − )Ω − + Ω ) α. (15)See Subsection S.1.3 for a proof. Equation (15) can be evaluated at a particular bandwidth H withoutnumerically approximating integrals.In Subsection 4.1 we compare the MISE for the three bandwidth selection approaches at ﬁnite samplesizes in one dimension. In Subsection 4.2 we repeat this analysis for several 3–dimensional densities. InSubsection 4.3 we show the visual impacts of diﬀerent MISEs by plotting pointwise quantiles for densityestimates using diﬀerent bandwidth selection approaches. Finally in Subsection 4.4 we explore how fast theasymptotic approximations from Section 3 for the optimal smoothing parameter for f Y take hold. All of theresults presented in this section can be reproduced using publicly available code. R-code and data for generating results in Sections 4 and 5 are available at http://stat.tamu.edu/~jlong/berkson.zip . .1 Relative Error in One Dimension In this section the f X densities are 1–dimensional normal mixtures which satisfy the assumptions of Theorem3. The densities we consider, and associated names, are plotted in Figure 1. The exact parameter values forthese densities are given in Table 1. −3 −2 −1 0 1 2 3 . . . . . Normal X D en s i t y −2 0 2 4 6 . . . Bimodal 1 X D en s i t y −5 0 5 . . . . . Bimodal 2 X D en s i t y −8 −6 −4 −2 0 2 4 6 . . . . Trimodal X D en s i t y Figure 1: The four 1–dimensional Gaussian mixture densities we consider.We now compare the MISE for the densities in Figure 1 using the three approaches for selecting thebandwidth parameter. Recall that the approaches are: 1) optimize the bandwidth for estimating f Y , 2)optimize the bandwidth for estimating f X , and 3) set the bandwidth equal to 0. In Section 3 we showed thatoptimizing the bandwidth for f Y (approach 1) and setting the bandwidth equal to 0 (approach 3) resultedin the same ﬁrst order asymptotic performance for the MISE. In contrast, optimizing the bandwidth for f X φ ( x )Bimodal 1 . φ ( x ) + . φ ( x − . φ ( x + 6) + . φ ( x − . φ ( x + 4) + . φ . ( x ) + . φ ( x − φ Σ is the normal, mean 0 density withcovariance Σ.(approach 2) resulted in slower, nonparametric convergence rates.Let h Y be the optimal bandwidth for estimating f Y and h X be the optimal bandwidth for estimating f X ( h Y and h X are sequences implicitly indexed by the sample size n ). We now compare MISE( h Y ),MISE( h X ), and MISE(0) at ﬁnite n for the four densities in Figure 1 and a variety of error variances σ (cid:15) .Clearly MISE( h Y ) ≤ M ISE ( h X ) and MISE( h Y ) ≤ M ISE (0) since h Y is the minimizer of the MISE. Weseek to understand the level of reduction in MISE one can achieve by using h Y , the parameter settingswhere these reductions occur, and how MISE( h X ) compares to MISE(0). We note that h X and h Y are exactminimizers for the MISE of f Y and f X , not asymptotic approximations.In Table 2 we present (cid:16) MISE (0)

MISE ( h Y ) , MISE ( h X ) MISE ( h Y ) (cid:17) for four densities and ﬁve error variances (the error isnormal, mean 0) for n = 50. We note some general trends. As σ (cid:15) decreases, MISE( h X ) /M ISE ( h Y )decreases. With small σ (cid:15) , f Y is close to f X and thus h X and h Y are close. In contrast, as σ (cid:15) decreases,MISE(0) /M ISE ( h Y ) increases. With no smoothing and small σ (cid:15) , approach 3 undersmooths the densityestimate. σ (cid:15) Normal Bimodal 1 Bimodal 2 Trimodal2 (1.02,1.18) (1.08,1.01) (1.03,1.02) (1.18,1.05)1 (1.05,1.17) (1.15,1.01) (1.07,1.03) (1.24,1.04)0.5 (1.13,1.11) (1.26,1.01) (1.16,1.03) (1.30,1.01)0.25 (1.32,1.05) (1.50,1.00) (1.37,1.01) (1.46,1.00)0.125 (1.70,1.02) (1.92,1.00) (1.76,1.01) (1.77,1.00)Table 2: Each entry is (cid:16)

MISE (0)

MISE ( h Y ) , MISE ( h X ) MISE ( h Y ) (cid:17) for n = 50. These ratios are always greater than 1 because h Y is the minimizer of the MISE. As expected, MISE(0) performs well when σ (cid:15) (the error variance) is largebut poorly when σ (cid:15) is small. MISE( h X ) performs well when σ (cid:15) is small but poorly when σ (cid:15) is large.For the densities and error variances considered, using h X (approach 2) is generally better than no12moothing (approach 3). Only for the normal distribution with σ (cid:15) = 2 or 1 does no smoothing outperformsmoothing with h X . This is surprising given the asymptotic results showed that the convergence rate using h X is slower than the rate using no smoothing. For the densities and error distributions considered, a samplesize of 50 is not large enough for these asymptotics to take hold. An important caveat to this conclusion isthat the no smoothing estimator is simpler than h X because it has no smoothing parameters. σ (cid:15) Normal Bimodal 1 Bimodal 2 Trimodal2 (1.01,1.24) (1.04,1.03) (1.02,1.04) (1.09,1.02)1 (1.03,1.24) (1.08,1.03) (1.04,1.06) (1.12,1.01)0.5 (1.07,1.18) (1.15,1.03) (1.09,1.06) (1.16,1.00)0.25 (1.19,1.09) (1.31,1.02) (1.24,1.03) (1.27,1.00)0.125 (1.46,1.04) (1.62,1.01) (1.53,1.01) (1.50,1.00)Table 3: The entries here are the same as Table 2 but for n = 100. This larger n generally improves perfor-mance for MISE(0) and worsens the performance of MISE( h X ) (relative to MISE( h Y )). This is predicted byour asymptotic theory, since as n → ∞ , MISE (0)

MISE ( f Y ) → MISE ( h X ) MISE ( h Y ) → ∞ . However at n = 100, using h X still generally outperforms no smoothing.In Table 3 we plot the same quantities as Table 2 but for n = 100. The same general trends apply here aswith the n = 50 case: As σ (cid:15) decreases MISE( h X ) decreases while MISE(0) increases (relative to MISE( h Y )).Note that the performance of no smoothing relative to h X is generally better for n = 100 than n = 50. Forexample, with Bimodal 2 and σ (cid:15) = 2, MISE( h X ) < M ISE (0) for n = 50, but MISE( h X ) > M ISE (0) for n = 100. In fact, for every case considered MISE(0) /M ISE ( h Y ) is lower for n = 100 than n = 50. With theexception of Trimodal, MISE( h X ) /M ISE ( h Y ) is always greater for n = 100 than n = 50.The asymptotic results in Section 3.1 predict this behavior. As n → ∞ , MISE (0)

MISE ( f Y ) → MISE ( h X ) MISE ( h Y ) →∞ . So for large enough n , MISE(0) < M ISE ( h X ). However, for most densities and error variancesconsidered here, n = 100 is not large enough for these asymptotics to take hold. These results suggest thatat sample sizes of potential interest, using no smoothing can undersmooth the density estimate. This eﬀectappears most signiﬁcant when the error density is concentrated near 0.13 .2 Relative Error in Three Dimensions We now explore relative error rates for 3–dimensional densities. The Gaussian mixture densities we studyare deﬁned in Table 4 using the notation+ =  .

64 00 .

64 1 0 .

640 0 .

64 1  − =  − .

64 0 − .

64 1 − . − .

64 1  (16)for covariances matrices. The four densities in Table 4 are meant to be 3–dimensional generalizations ofthe 1–dimensional densities from Table 1. In particular Multi. Normal is a 3–dimensional normal, a directgeneralization of the 1–dimensional normal. Multi 2-Comp 2 is a mixture of two well separated, identitycovariance normals. This is a close analogue to Bimodal 2 from Table 1.Name ParametersMulti. Normal φ I ( x )Multi. 2-Comp 1 . φ + ( x ) + . φ − ( x − (1 , , T )Multi. 2-Comp 2 . φ I ( x − (6 , , T ) + . φ I ( x + (6 , , T )Multi. 3-Comp . φ + ( x ) + . φ − ( x − (1 , , T ) + . φ − ( x )Table 4: Parameters for the four 3–dimensional densities studied. φ Σ is the normal, mean 0 density in threedimensions with covariance Σ. Here I is the 3 × − signs are covariancematrices deﬁned in Equation 16.We consider 5 error densities for (cid:15) . Each error density is normal, mean 0, with all diagonal elements equaland 0 for all covariances. The normality of (cid:15) is required by Theorem 3. The other choices were made to keepthese simulations a reasonable size. For diagonal terms of the covariance, we consider the same values as forthe 1–dimensional case: 2 , , . , . , and 0 . (cid:16) MISE (0)

MISE ( h Y ) , MISE ( h X ) MISE ( h Y ) (cid:17) for all 20 f X , f (cid:15) pairs for n = 100 and n = 500respectively. The ﬁrst column in each table, σ (cid:15) , refers to the diagonal elements of the covariance matrix.The n = 100 case, Table 5, allows for direct comparison with the 1–dimensional case in Table 3. The n = 500case, Table 6, provides results for what is perhaps a more realistic sample size when attempting to estimatea 3–dimensional density non–parametrically.We discuss the n = 100 results, Table 5. In general h Y performs better relative to no smoothing and h X smoothing in three dimensions than in one dimension. In particular, all ratios are larger for Multi.Normal and Multi. 2-Comp 2 in Table 5 than Normal and Bimodal 2 in Table 3. As before, with small error14ariance, approach 3 undersmooths the density estimate. As in the 1-dimensional case, h X oversmooths thedensity estimates when the error variance is large. This eﬀect appears worse in three dimensions than in onedimension. For example, for the standard normal with σ (cid:15) = 2 and n = 100, MISE( h X ) /M ISE ( h Y ) is 1.24in one dimension (Table 3) and 1.76 in three dimensions (Table 5). σ (cid:15) Multi Normal Multi 2-Comp 1 Multi 2-Comp 2 Multi 3-Comp2 (1.02,1.76) (1.02,1.13) (1.04,1.28) (1.02,1.20)1 (1.07,1.63) (1.06,1.12) (1.12,1.28) (1.07,1.15)0.5 (1.24,1.35) (1.16,1.08) (1.39,1.17) (1.21,1.07)0.25 (1.80,1.14) (1.40,1.05) (2.18,1.07) (1.55,1.02)0.125 (3.38,1.05) (2.00,1.02) (4.34,1.02) (2.32,1.01)Table 5: Three dimensional ﬁnite sample results for n = 100. Generally, h X and no smoothing performworse relative to h Y here than for n = 100 in one dimension (see Table 3).We now discuss the results for n = 500, Table 6. Note that with the larger sample size, all of theMISE(0) /M ISE ( h Y ) ratios have decreased while all of the MISE( h X ) /M ISE ( h Y ) ratios have increasedrelative to the n = 100 case in Table 5. The asymptotic results from Section 3 predict this behavior. As n → ∞ , MISE(0) /M ISE ( h Y ) → h X ) /M ISE ( h Y ) → ∞ . As expected, for all densitiesMISE(0) /M ISE ( h Y ) is increasing as σ (cid:15) decreases. σ (cid:15) Multi Normal Multi 2-Comp 1 Multi 2-Comp 2 Multi 3-Comp2 (1.00,2.66) (1.00,1.27) (1.01,1.72) (1.01,1.37)1 (1.01,2.54) (1.01,1.30) (1.03,1.82) (1.02,1.34)0.5 (1.06,2.00) (1.03,1.29) (1.10,1.57) (1.05,1.25)0.25 (1.25,1.45) (1.10,1.23) (1.41,1.26) (1.14,1.16)0.125 (1.94,1.16) (1.30,1.14) (2.37,1.09) (1.41,1.09)Table 6: Three dimensional ﬁnite sample results for n = 500. MISE( h X ) /M ISE ( h Y ) is larger andMISE(0) /M ISE ( h Y ) is smaller here relative to Table 5 where the sample size was 100.The three dimensional ﬁnite sample results reinforce the conclusion from the one dimensional results thatnot smoothing when the error variance is small produces undersmoothed estimates with large MISE relativeto using the optimal smoothing parameter. Additionally, in three dimensions, using h X oversmooths thedensity estimates in many cases where the error variance is large.The 3–dimensional simulations here were limited to a very narrow set of error distributions. In particular15he error variance was equal in all directions. Another interesting setting to consider is when the errorvariance is very small along certain directions and sizable along other directions. The limiting case of thissetting has no error along certain directions.When there is no error in certain direction but error in other directions, one can obtain convergence ratesthat depend on the dimension of the space on which there is error (Long [2013], Theorem 2.3). Speciﬁcally,suppose one estimates a p dimensional density f Y and (cid:15) is 0 with probability 1 on a p dimensional subspaceof R p . Suppose (cid:15) has a density on the other p dimensions ( p = p + p ). Then for second order kernels,under some regularity conditions, the optimal smoothing using a scalar bandwidth is of order n − / (4+ p ) andresults in an MISE of order n − / (4+ p ) (see Equation 2.27 in Long [2013]). Note that this rate is betweenthe error free rate of n − / (4+ p ) and the error in all directions case where the MISE is of order n − . We now visualize some of the results in Table 2 by plotting pointwise quantiles for density estimates fordiﬀerent choices of smoothing parameters. In order to obtain an understanding of the impact of using h X (approach 2) or no smoothing (approach 3), we examine the 1-dimensional cases where these methodsperform worst relative to using h Y (approach 1). This shows some of the qualitative impacts of suboptimalsmoothing on the density estimate.We ﬁrst study σ (cid:15) = 2, Normal. Here h X performed worst (relative to h Y ) out of all the densities anderror variances considered (see Table 2). We generate 100 samples of size 50 from Normal. Using these 100samples, we construct 100 density estimates using h Y and h X . In Figure 2 a) we plot the .1 and .9 pointwisequantiles for these density estimates (orange–dashed for h Y and blue–dotted for h X ) along with the trueunderlying density f Y (i.e. the Normal density convolved with φ ) in black–solid. The quantiles for h X havea lower peak and heavier tails than the quantiles for the h Y density estimates. Using h X oversmooths thedensity estimates ( h X = 0 .

52 and h Y = 0 . h Y and h X respectively. We see that theindividual density estimates using h X are negatively biased near Y = 0 and positively biased for large | Y | .Since all the density estimates are unimodal, with a mode near 0 and approximately normal, the qualitativeconclusions that one is likely to draw from these density estimates are likely to be similar regardless ofwhether one is using h X or h Y .We now study the Bimodal 1 density case with σ (cid:15) = .

125 and n = 50. Here MISE(0) /M ISE ( h Y ) was16a) (b) (c) −4 −2 0 2 4 . . . . . . Y D en s i t y f Y h Y h X −4 −2 0 2 4 . . . . . . Y D en s i t y −4 −2 0 2 4 . . . . . Y D en s i t y Figure 2: Comparison of using h Y (optimal smoothing for f Y ) to h X (optimal smoothing for f X ) for theNormal density with σ (cid:15) = 2. In a) we plot f Y and the .9 and .1 quantiles for density estimates using h Y (orange–dash) and h X (blue–dot). h X oversmooths the estimate, so the peak at Y = 0 is biased lowwhile the tails are biased high. In b) and c) we plot 10 density estimates using h Y and h X respectively.The qualitative conclusions that one is likely to draw from the density estimates are similar, regardless ofwhether h X or h Y is used.highest out of all conditions tested in Table 2. Following the procedure for generating Figure 2, we generate100 samples of size 50 from Bimodal 1. Using these 100 samples, we construct 100 density estimates using h Y and no smoothing. In Figure 3 a) we plot the .1 and .9 pointwise quantiles for these density estimates(orange–dashed for h Y and blue–dotted for no smoothing). We plot the true underlying density, f Y (i.e. theBimodal 1 density convolved with φ . ), in black–solid.No smoothing greatly overestimates the height of the mode at Y = 0. The quantiles for the h Y densityestimates are nearly contained within the quantiles for no smoothing across all values of Y . In Figure 3 b)and c) we plot 10 density estimates using h Y and no smoothing respectively. The density estimates using h Y (in b)) typically identify two modes close to the correct Y values. In contrast the density estimates using nosmoothing appear very undersmoothed (in c)). Several estimates have three or more modes and the modeheights are often far from the true value. In this case, not smoothing could have a signiﬁcant impact onqualitative conclusions drawn from the density estimate. The results from Figures 2 and 3 suggest that thequalitative impacts of not smoothing may be worse than smoothing using h X . We now study the rate of convergence of the asymptotically optimal bandwidth for estimating f Y , h ∗ Y (seeEquation 13) to the exact optimal bandwidth h Y . Fast convergence rates suggest that plug–in estimators17a) (b) (c) −2 0 2 4 6 . . . . . . . Y D en s i t y f Y h Y h=0 −2 0 2 4 6 . . . . . . . Y D en s i t y −2 0 2 4 6 . . . . . Y D en s i t y Figure 3: Comparison of using h Y (optimal smoothing for f Y ) to no smoothing for the Bimodal 1 densitywith σ (cid:15) = . f Y (black–solid) and the .9 and .1 quantiles for density estimates using h Y (orange–dash) and no smoothing (blue–dot). The quantiles for no smoothing are wider than for h Y for mostvalues of Y . In b) and c) we plot 10 density estimates using h Y and no smoothing respectively. The densityestimates using no smoothing often have 3 modes. These modes are often not close to the true Y valuemodes.could be eﬀective for estimating h Y . We pay particular attention to the relationship between convergencerate of h ∗ Y and the error variance σ (cid:15) .For the four densities in Figure 1 using error variances σ (cid:15) = 2 , , . , . , .

125 we compute the ratiobetween the exact optimal bandwidth ( h Y ) and the asymptotically optimal bandwidth ( h ∗ Y ). The exactoptimal bandwidth is determined by ﬁnding the h which minimizes Equation 15. The asymptotically optimalbandwidth is computed using Equation 13. We plot these ratios as a function of n for the Normal andTrimodal densities in Figure 4 a) and b) respectively.For the Normal density, the larger σ (cid:15) , the faster the convergence of the asymptotically optimal bandwidthto the actual optimal bandwidth. For example with σ (cid:15) = 2 , , . n = 100, the exact optimal bandwidth iswithin 10% of the asymptotic expression. This suggests that for a normal density, with moderate n and errorvariance not too small, plug–in estimators for the asymptotic bandwidth may provide a good approximationto the bandwidth which minimizes the exact MISE. The plots for Bimodal 1 and Bimodal 2 (not shown)closely resemble the Normal density.For the Trimodal density, the relationship between σ (cid:15) and the convergence rate of the asymptoticallyoptimal bandwidth ( h ∗ Y ) to exact optimal bandwidth ( h Y ) is more complicated than for the Normal, Bimodal1, and Bimodal 2 densities. Broadly, the asymptotics for σ (cid:15) = . , .

125 take hold at larger n than for σ (cid:15) = 2 , , .

5. Unlike for the normal case, the asymptotically optimal bandwidth is not always greater than18he exact optimal bandwidth.(a) (b)

10 50 100 500 5000 . . . . . . . Normal n h Y / h Y * Error Variance210.50.250.125 10 50 100 500 5000 . . . . . . Trimodal n h Y / h Y * Error Variance210.50.250.125

Figure 4: Ratio of exact optimal bandwidth to asymptotically optimal bandwidth ( h Y /h ∗ Y ) as a function of n for the Normal (a) and Trimodal (b) densities. The convergence of this ratio to 1 varies with σ (cid:15) (the errorvariance). For the Normal density the relationship between σ (cid:15) and the convergence is fairly simple, whilefor the Trimodal, the behavior is more complex.Certain aspects of the convergence rate behavior in Figure 4 may be explained by considering asymptoticsin σ (cid:15) . At constant n , as σ (cid:15) → h Y /h ∗ Y →

0. This can be seen by considering the limiting values (in σ (cid:15) )of h Y and h ∗ Y . Note that as σ (cid:15) → f Y → f X . Therefore h Y → h X , a positive constant. In contrast, as σ (cid:15) → h ∗ Y = (cid:113) (cid:82) ( ω T Σ K ω ) dν ( ω ) n (cid:82) ( ω T Σ K ω ) dµ ( ω ) → ∞ . This behavior is seen in Figure 4 a) where at ﬁxed n , the smaller σ (cid:15) , the smaller h Y /h ∗ Y . For n > σ (cid:15) and h Y /h ∗ Y is true for Trimodal (Figure4 b)) as well. Consideration of asymptotic regimes in which n → ∞ and σ (cid:15) → (cid:15) isconcentrated around 0, h ∗ Y is far from h Y . Thus plug–in estimators, which attempt to estimate h ∗ Y ), maybe suboptimal in terms of minimizing MISE. In Section 5 we observe this behavior with a “rule–of–thumb”estimator for h Y when (cid:15) has small variance. 19 Estimator for h Y and Real Data Example h Y Jones et al. [1996] deﬁne “rule–of–thumb” bandwidth selection procedures as any method which replacesunknown quantities in the asymptotically optimal bandwidth with estimated values based on a parametricfamily for the unknown density. We now propose a rule–of–thumb estimation method for h Y . Recall fromEquation (13) that the asymptotically optimal bandwidth is h ∗ Y = (cid:115) (cid:82) ( ω T Σ K ω ) dν ( ω ) n (cid:82) ( ω T Σ K ω ) dµ ( ω ) = (cid:118)(cid:117)(cid:117)(cid:116) (cid:82) ( ω T Σ K ω ) | (cid:98) f (cid:15) ( ω ) | (1 − | (cid:98) f X ( ω ) | ) dωn (cid:82) ( ω T Σ K ω ) | (cid:98) f X ( ω ) | | (cid:98) f (cid:15) ( ω ) | dω . (17)We specialize to one dimension, so Σ K is a scalar and can be set to 1 without loss of generality. h ∗ Y dependson | (cid:98) f X ( ω ) | , which is unknown. We replace this quantity by assuming (solely for the purposes of bandwidthestimation) that f X is mean 0 normal. In this case, | (cid:98) f X ( ω ) | = e − σ X ω where σ X is the variance of f X . σ X is estimated with (cid:101) σ X , the variance of the observations X , . . . , X n . Thus our rule–of–thumb bandwidthestimator for Berkson kernel density estimation is (cid:101) h Y = (cid:118)(cid:117)(cid:117)(cid:116) (cid:82) ω | (cid:98) f (cid:15) ( ω ) | (1 − e − (cid:101) σ X ω ) dωn (cid:82) ω e − (cid:101) σ X ω | (cid:98) f (cid:15) ( ω ) | dω . (18)For the case where (cid:15) is mean 0 normal with variance σ (cid:15) , | (cid:98) f (cid:15) ( ω ) | = e − σ (cid:15) ω and Equation 18 simpliﬁes to (cid:101) h Y = (cid:115) n (cid:20) ( (cid:101) σ X + σ (cid:15) ) / σ (cid:15) − ( σ (cid:15) + (cid:101) σ X ) (cid:21) . (19) We analyze data collected by Ferris Jr et al. [1979] concerning childhood exposure to NO , a known cause ofrespiratory illnesss. The goal is to determine the density of exposure to NO for children living in Watertown,Massachusetts. Ferris Jr et al. [1979] collected kitchen and bathroom concentrations of NO for 231 homesin Watertown. In this study, direct personal exposure to NO was not observed.Using data collected in Portage, Wisconsin and the Netherlands, Tosteson et al. [1989] modeled logpersonal exposure to NO ( Y ) as a linear function of log kitchen (ln( W k )) and log bathroom (ln( W b ))concentrations plus random error. Speciﬁcally (see Table 1 of Tosteson et al. [1989]) Y = 1 .

22 + 0 . W k ) + 0 .

33 ln( W b ) + (cid:15) where (cid:15) ∼ N (0 , . W k and W b . Let X = 1 .

22 + 0 . W k ) + 0 .

33 ln( W b ). By assumingthe same error model holds in Watertown as in Portage and the Netherlands (i.e. assuming portability of20he error model), we can estimate log personal exposure density in Watertown as X plus independent noisewhere we have 231 observed values of X .We estimate the density of Y using the three smoothing approaches described in earlier sections: smooth-ing to optimize estimation of f Y , smoothing to optimize estimation of f X and no smoothing. For no smooth-ing we simply convolve the observations with the error density. For smoothing to optimize estimation of f X and f Y we must select a kernel K and bandwidth estimation methods for h X and h Y . We use a Gaussiankernel. For estimating h X we use “Silverman’s rule–of–thumb”, developed in Deheuvels [1977] and Silverman[1986], (cid:101) h X = 0 . (cid:93) IQR X / . , (cid:101) σ X ) n / . Here (cid:93)

IQR and (cid:101) σ X are the estimated inter–quartile range and standard deviation of X . For estimating h Y we use the rule–of–thumb estimator (cid:101) h Y proposed in Equation (19).In addition to studying the case (cid:15) ∼ N (0 , . (cid:15) is normal withvariance 0 . .

006 in order to study robustness of the smoothing methods to diﬀerent levels of Berksonerror. In Figure 5 we plot the three estimators for a) (cid:15) ∼ N (0 , . (cid:15) ∼ N (0 , . (cid:15) ∼ N (0 , . σ (cid:15) = 0 . σ (cid:15) = 0 .

06 no smoothingresults in a somewhat higher mode around y = 3 than smoothing to optimize estimation of f X or f Y . Itappears unlikely that the choice of smoothing would aﬀect qualitative conclusions in this case. In c) where σ (cid:15) = 0 .

006 no smoothing severely under–regularizes the density estimate. In particular the estimate hasfour modes. (cid:101) h Y oversmooths the estimate of f Y . This is likely due to the fact that for ﬁxed n as σ (cid:15) → h ∗ Y → ∞ while h Y → h X (see Subsection 4.4). Since (cid:101) h Y is an estimate of h ∗ Y , it oversmooths the densityestimate in this case. Overall, with (cid:15) ∼ N (0 , . (cid:101) h X produces the best density estimate (blue dashedline). In this work we compared diﬀerent approaches to smoothing a density estimate subject to Berkson error. Nosmoothing (approach 3) achieved suboptimal asymptotic (at second order) and ﬁnite sample MISE. This wasespecially evident when the error term (cid:15) was concentrated near 0. Smoothing to optimize estimation of f X resulted in suboptimal asymptotic (at ﬁrst order) MISE rates. At ﬁnite samples, h X oversmoothed density This is the default bandwidth selection for the density function in the software package R , Version 3.01. . . . . . Y D en s i t y h=0h~ X =0.13h~ Y =0.05 . . . . . Y D en s i t y h=0h~ X =0.13h~ Y =0.11 . . . . . . Y D en s i t y h=0h~ X =0.13h~ Y =0.47 Figure 5: Density estimates of log NO exposure for children living in Watertown using three diﬀerent errorvariances. With error variances σ (cid:15) = 0 . .

06, plots a) and b) respectively, all three smoothing methodsproduce similar density estimates. In c), where σ (cid:15) = 0 . f Y . More work is needed todevelop estimators of h Y . The asymptotically optimal bandwidth h ∗ Y derived in Equation (13) suggest oneform for rule–of–thumb and plug–in type estimators. The simulations in Subsection 4.4 suggest that whenthe error is concentrated around 0, using a bandwidth which estimates h ∗ Y (the asymptotic approximation to h Y ) may oversmooth the density estimate. The rule–of–thumb estimator for h Y we developed in Equation(19) displayed this behavior. Estimators for h Y based on cross–validation or the bootstrap may performbetter under a wider range of possible error distributions and should be developed. The development ofbandwidth estimators for standard kernel density estimation (see Jones et al. [1996] for a review) suggestforms for such procedures. Acknowledgments

Support from NSF grant 0941742 (Cyber-Enabled Discovery and Innovation), NSF grant DMS-0847647(CAREER), and a fellowship from Citadel LLC are gratefully acknowledged. We thank Raymond Carrollfor providing helpful comments on the manuscript and Len Stefanski for supplying the data set on children’sNO exposure. References

J. Berkson. Are there two regressions?

Journal of the American Statistical Association , 45(250):164–180,1950. ISSN 0162-1459. 22. Bovy, J. F. Hennawi, D. W. Hogg, A. D. Myers, J. A. Kirkpatrick, D. J. Schlegel, N. P. Ross, E. S.Sheldon, I. D. McGreer, D. P. Schneider, et al. Think outside the color box: Probabilistic target selectionand the SDSS-XDQSO Quasar targeting catalog.

The Astrophysical Journal , 729(2):141, 2011.R. Carroll, D. Ruppert, L. Stefanski, and C. M. Crainiceanu.

Measurement error in nonlinear models: amodern perspective . CRC Press, 2006. ISBN 1584886331.R. Carroll, A. Delaigle, and P. Hall. Nonparametric prediction in measurement error models.

Journal of theAmerican Statistical Association , 104(487):993–1003, 2009. ISSN 0162-1459.P. Deheuvels. Estimation non param´etrique de la densit´e par histogrammes g´en´eralis´es.

Revue de StatistiqueAppliqu´ee , 25(3):5–42, 1977.A. Delaigle. Nonparametric density estimation from data with a mixture of berkson and classical errors.

Canadian Journal of Statistics , 35(1):89–104, 2007.A. Delaigle. An alternative view of the deconvolution problem.

Statistica Sinica , 18(3):1025–1045, 2008.R. Durrett.

Probability : theory and examples . Duxbury advanced series. Brooks/Cole, Belmont, USA, 2005.ISBN 0-534-42441-4.C. H. Edwards Jr.

Advanced calculus of several variables . Dover Publications, 1973.B. Ferris Jr, F. Speizer, J. Spengler, D. Dockery, Y. Bishop, M. Wolfson, C. Humble, et al. Eﬀects of sulfuroxides and respirable particles on human health. methodology and demography of populations in study.

The American review of respiratory disease , 120(4):767, 1979.H. Henderson and S. Searle. Vec and vech operators for matrices, with some uses in jacobians and multivariatestatistics.

Canadian Journal of Statistics , 7(1):65–81, 1979.M. C. Jones, J. S. Marron, and S. J. Sheather. A brief survey of bandwidth selection for density estimation.

Journal of the American Statistical Association , 91(433):401–407, 1996.J. P. Long.

Prediction Methods for Astronomical Data Observed with Measurement Error . PhD thesis,University of California, Berkeley, 2013.J. P. Long, N. El Karoui, J. A. Rice, J. W. Richards, and J. S. Bloom. Optimizing automated classiﬁcationof variable stars in new synoptic surveys.

Publications of the Astronomical Society of the Paciﬁc , 124(913):280–295, 2012.J. S. Marron and M. P. Wand. Exact mean integrated squared error.

The Annals of Statistics , 20(2):712–736,1992.B. W. Silverman.

Density estimation for statistics and data analysis , volume 26. CRC press, 1986.B. C. Sutradhar. On the characteristic function of multivariate student t-distribution.

Canadian Journal ofStatistics , 14(4):329–337, 1986.T. D. Tosteson, L. A. Stefanski, and D. W. Schafer. A measurement-error model for binary and ordinalregression.

Statistics in Medicine , 8(9):1139–1147, 1989.A. B. Tsybakov.

Introduction to nonparametric estimation . Springer, 2009. ISBN 1441927093.N. G. Ushakov.

Selected topics in characteristic functions . De Gruyter Mouton, 1999.23. Wand and M. Jones. Comparison of smoothing parameterizations in bivariate kernel density estimation.

Journal of the American Statistical Association , pages 520–528, 1993.M. P. Wand and M. C. Jones.

Kernel smoothing , volume 60. Chapman & Hall/CRC, 1995.24

UPPLEMENTARY MATERIAL S.1

Supplementary Material to

Kernel Density Estimation with Berkson Error

James P. LongDepartment of Statistics, Texas A&M University3143 TAMU, College Station, TX [email protected] El KarouiDepartment of Statistics, University of California, Berkeley367 Evans Hall

S.1 Proofs of the Theorems

S.1.1 Proof of Theorem 1

Let P n be the product measure on ( X , . . . , X n ). Using the deﬁnition of MISE( H ), Equation (S.2), and thefacts (cid:98) f Y ( ω ) = (cid:98) f X ( ω ) (cid:98) f (cid:15) ( ω ) and (cid:101)(cid:98) f Y,H ( ω ) = (cid:98) K ( Hω ) (cid:98) f (cid:15) ( ω ) (cid:101)(cid:98) f X ( ω ), we haveMISE( H ) = E P n (cid:90) (cid:16) f Y ( y ) − (cid:101) f Y,H ( y ) (cid:17) dy = 1(2 π ) p E P n (cid:90) | (cid:98) f Y ( ω ) − (cid:101)(cid:98) f Y,H ( ω ) | dω = 1(2 π ) p E P n (cid:90) | (cid:98) K ( Hω ) (cid:98) f (cid:15) ( ω ) (cid:101)(cid:98) f X ( ω ) − (cid:98) f X ( ω ) (cid:98) f (cid:15) ( ω ) | dω = 1(2 π ) p E P n (cid:90) | (cid:98) f (cid:15) ( ω ) | | (cid:101)(cid:98) f X ( ω ) (cid:98) K ( Hω ) − (cid:98) f X ( ω ) | dω. Note that the integrand is a non-negative function, so we move the expectation inside the integral usingFubini’s Theorem. We have(2 π ) p MISE( H ) = (cid:90) | (cid:98) f (cid:15) ( ω ) | E P n | (cid:101)(cid:98) f X ( ω ) (cid:98) K ( Hω ) − (cid:98) f X ( ω ) | dω. Noting that it is suﬃcient to show Equation (S.1) holds, all that is left is to show is E P n | (cid:101)(cid:98) f X ( ω ) (cid:98) K ( Hω ) − (cid:98) f X ( ω ) | = | − (cid:98) K ( Hω ) | | (cid:98) f X ( ω ) | + 1 n | (cid:98) K ( Hω ) | (1 − | (cid:98) f X ( ω ) | ) . This identity is shown in the proof of Theorem 1.4 on page 22 in Tsybakov [2009].

S.1.2 Proof of Theorem 2

Recall that we are working under Assumptions A and B. This proof is divided into three parts. In

Part 1 we show (cid:98) f Y , (cid:101)(cid:98) f Y,H ∈ L , which satisﬁes the conditions for Theorem 1 and implies(2 π ) p MISE( H ) = (cid:90) | − (cid:98) K ( Hω ) | dµ ( ω ) + 1 n (cid:90) | (cid:98) K ( Hω ) | dν ( ω ) . (S.3)In Part 2 we expand the ﬁrst term of the right hand side of Equation (S.3) to show (cid:90) | − (cid:98) K ( Hω ) | dµ ( ω ) = (cid:18) (cid:90) ( ω T H T Σ K Hω ) dµ ( ω ) (cid:19) (1 + O ( || H || ∞ )) . (S.4)In Part 3 we expand the second term of the right hand side of Equation (S.3) to show1 n (cid:90) | (cid:98) K ( Hω ) | dν ( ω ) = 1 n (cid:90) dν ( ω ) − (cid:18) n (cid:90) ( ω T H T Σ K Hω ) dν ( ω ) (cid:19) (1 + O ( || H || ∞ )) . (S.5)Summing Equations (S.4) and (S.5) we have the result(2 π ) p MISE( H )= 1 n (cid:90) dν ( ω )++ (cid:18) (cid:90) ( ω T H T Σ K Hω ) dµ ( ω ) − n (cid:90) ( ω T H T Σ K Hω ) dν ( ω ) (cid:19) (1 + O ( || H || ∞ )) . Part 1: (cid:98) f Y , (cid:101)(cid:98) f Y,H ∈ L Note that since the modulus of a characteristic function is bounded by 1 | (cid:98) f Y ( ω ) | = | (cid:98) f X ( ω ) (cid:98) f (cid:15) ( ω ) | ≤ | (cid:98) f (cid:15) ( ω ) | , | (cid:101)(cid:98) f Y,H ( ω ) | = | (cid:98) K ( Hω ) (cid:98) f (cid:15) ( ω ) (cid:101)(cid:98) f X ( ω ) | ≤ | (cid:98) f (cid:15) ( ω ) | . UPPLEMENTARY MATERIAL S.3 (cid:98) f (cid:15) ∈ L by Assumption (10), implying (cid:98) f Y , (cid:101)(cid:98) f Y,H ∈ L . Part 2: Bias

By Lemma 1 on p.S.6 there exists R satisfying | R ( ω ) | ≤ C || ω || ∞ (S.6)such that (cid:98) K ( ω ) = 1 − ω T Σ K ω R ( ω ) . (S.7)Note that the kernel K is symmetric so (cid:98) K and R are real valued functions. (cid:90) | − (cid:98) K ( Hω ) | dµ ( ω ) = (cid:90) (cid:12)(cid:12)(cid:12)(cid:12) ω T H T Σ K Hω − R ( Hω ) (cid:12)(cid:12)(cid:12)(cid:12) dµ ( ω )= 14 (cid:90) ( ω T H T Σ K Hω ) dµ ( ω ) − (cid:90) R ( Hω )( ω T H T Σ K Hω ) dµ ( ω ) (S.8)+ (cid:90) R ( Hω ) dµ ( ω ) . (S.9)We have split the integrals formally. We now show that Expressions (S.8) and (S.9) are O ( || H || ∞ ) bybounding their integrands. Using the bound R ( ω ) ≤ C || ω || ∞ (Equation (S.6)), for some E we have | R ( Hω )( ω T H T Σ K Hω ) | ≤ C || Hω || ∞ || ω T H T Σ K Hω || ∞ ≤ E || H || ∞ || ω || ∞ , | R ( Hω ) | ≤ C || Hω || ∞ ≤ E || H || ∞ || ω || ∞ . Using the deﬁnition of dµ ( ω ) and the fact (cid:82) || ω || ∞ | (cid:98) f (cid:15) ( ω ) | dω < ∞ (Assumption (9)) we have (cid:90) || ω || ∞ dµ ( ω ) = (cid:90) || ω || ∞ | (cid:98) f X ( ω ) | | (cid:98) f (cid:15) ( ω ) | dω ≤ (cid:90) || ω || ∞ | (cid:98) f (cid:15) ( ω ) | dω < ∞ . So Expressions (S.8) and (S.9) are O ( || H || ) and O ( || H || ) respectively. Thus (cid:90) | − (cid:98) K ( Hω ) | dµ ( ω ) = (cid:18) (cid:90) ( ω T H T Σ K Hω ) dµ ( ω ) (cid:19) (1 + O ( || H || ∞ )) . Part 3: Variance

Using the expansion of (cid:98) K in Equation (S.7) we have1 n (cid:90) | (cid:98) K ( Hω ) | dν ( ω ) = 1 n (cid:90) (cid:12)(cid:12)(cid:12)(cid:12) − ω T H T Σ K Hω R ( Hω ) (cid:12)(cid:12)(cid:12)(cid:12) dν ( ω ) . UPPLEMENTARY MATERIAL S.4

Expanding the right hand side we have1 n (cid:90) (cid:12)(cid:12)(cid:12)(cid:12) − ω T H T Σ K Hω R ( Hω ) (cid:12)(cid:12)(cid:12)(cid:12) dν ( ω ) = 1 n (cid:16) (cid:90) dν ( ω ) (S.10) − (cid:90) ( ω T H T Σ K Hω ) dν ( ω ) (S.11)+ 14 (cid:90) ( ω T H T Σ K Hω ) dν ( ω ) (S.12) − (cid:90) R ( Hω )( ω T H T Σ K Hω ) dν ( ω ) (S.13)+ 2 (cid:90) R ( Hω ) dν ( ω ) (S.14)+ (cid:90) R ( Hω ) dν ( ω ) (cid:17) . (S.15)We have split the integral formally. Using the bound R ( ω ) ≤ C || ω || ∞ (Equation (S.6)) we bound theintegrands of Expressions (S.12), (S.13), (S.14), and (S.15). For some F we have | ( ω T H T Σ K Hω ) | ≤ F || ω || ∞ || H || ∞ , | R ( Hω )( ω T H T Σ K Hω ) | ≤ F || ω || ∞ || H || ∞ , | R ( Hω ) | ≤ F || ω || ∞ || H || ∞ , | R ( Hω ) | ≤ F || ω || ∞ || H || ∞ . Note that by the deﬁnition of dν ( ω ) and the fact (cid:82) || ω || ∞ | f (cid:15) ( ω ) | dω < ∞ (Assumption (9)) we have (cid:90) || ω || ∞ dν ( ω ) = (cid:90) || ω || ∞ | (cid:98) f (cid:15) ( ω ) | dω − (cid:90) || ω || ∞ | (cid:98) f (cid:15) ( ω ) | | (cid:98) f X ( ω ) | dω < ∞ . So Expressions (S.12), (S.13), (S.14), and (S.15) are all integrable and O ( || H || ∞ ). Thus1 n (cid:90) | (cid:98) K ( Hω ) | dν ( ω ) = 1 n (cid:90) dν ( ω ) − (cid:18) n (cid:90) ( ω T H T Σ K Hω ) dν ( ω ) (cid:19) (1 + O ( || H || ∞ )) . S.1.3 Proof of Theorem 3

Recall that K = φ Σ K , f (cid:15) = φ Σ (cid:15) , and f X ( x ) = m (cid:88) j =1 α j φ Σ j ( x − µ j ) . (S.16)Let α = ( α , . . . , α m ), S = H T Σ K H , and Ω a for a ∈ { , , } be a m × m matrix with j, j (cid:48) entry equal to φ aS +2Σ (cid:15) +Σ j +Σ j (cid:48) ( µ j − µ j (cid:48) ) . (S.17)In Variance we show (cid:90)

Var ( (cid:101) f Y,H ( y )) dy = 1 n (cid:0) φ S +2Σ (cid:15) (0) − α T Ω α (cid:1) . (S.18) UPPLEMENTARY MATERIAL S.5 In Bias we show (cid:90) (cid:16) E [ (cid:101) f Y,H ( y )] − f Y ( y ) (cid:17) dy = α T (Ω − + Ω ) α. (S.19)By the bias–variance decomposition of the MISE, we can sum Equation (S.18) and (S.19) to obtain the resultMISE( H ) = 1 n φ S +2Σ (cid:15) (0) + α T ((1 − n − )Ω − + Ω ) α. First note that since (cid:15) and K H are both normal, the estimator (cid:101) f Y,H has a simple form, speciﬁcally (cid:101) f Y,H ( y ) = 1 n n (cid:88) i =1 φ S +Σ (cid:15) ( y − X i ) . (S.20)Second note that (see e.g., A.2 on p.527 in Wand and Jones [1993]) for any two covariance matrices Σ andΣ (cid:48) and mean vectors µ and µ (cid:48) we have (cid:90) φ Σ ( x − µ ) φ Σ (cid:48) ( x − µ (cid:48) ) dx = φ Σ+Σ (cid:48) ( µ − µ (cid:48) ) . (S.21) Variance:

Using Equation (S.20), we have (cid:90)

Var ( (cid:101) f Y,H ( y )) dy = 1 n (cid:18)(cid:90) E [ φ S +Σ (cid:15) ( y − X )] dy − (cid:90) E [ φ S +Σ (cid:15) ( y − X )] dy (cid:19) . (S.22)We now simplify each term in the parenthesis on the right hand side of this equation. Using Equation (S.21)for the last equality, for the ﬁrst term on the right hand side of Equation (S.22) we have (cid:90) E [ φ S +Σ (cid:15) ( y − X )] dy = (cid:90) (cid:90) φ S +Σ (cid:15) ( y − x ) f X ( x ) dxdy = (cid:90) φ S +Σ (cid:15) ( y ) dy = φ S +2Σ (cid:15) (0) . Recalling the representation of f X in Equation (S.16), the identity in Equation (S.21), and the deﬁnition ofΩ in Equation (S.17), for the second term on the right hand side of Equation (S.22) we have (cid:90) E [ φ S +Σ (cid:15) ( y − X )] dy = (cid:90) (cid:18)(cid:90) φ S +Σ (cid:15) ( y − x ) f X ( x ) dx (cid:19) dy = (cid:90) (cid:90) φ S +Σ (cid:15) ( y − x )  m (cid:88) j =1 α j φ Σ j ( x − µ j )  dx  dy = (cid:90)  m (cid:88) j =1 α j φ S +Σ (cid:15) +Σ j ( y − µ j )  dy = m (cid:88) j =1 m (cid:88) j (cid:48) =1 α j α j (cid:48) (cid:90) φ S +Σ (cid:15) +Σ j ( y − µ j ) φ S +Σ (cid:15) +Σ j (cid:48) ( y − µ j (cid:48) ) dy = m (cid:88) j =1 m (cid:88) j (cid:48) =1 α j α j (cid:48) φ S +2Σ (cid:15) +Σ j +Σ j (cid:48) ( µ j − µ j (cid:48) )= α T Ω α. UPPLEMENTARY MATERIAL S.6

Hence (cid:90)

Var ( (cid:101) f Y,H ( y )) dy = 1 n (cid:0) φ S +2Σ (cid:15) (0) − α T Ω α (cid:1) Bias:

Recalling f (cid:15) = φ Σ (cid:15) and the representations of f X and (cid:101) f Y,H in Equations (S.16) and (S.20), note f Y ( y ) = (cid:90) f X ( y − (cid:15) ) f (cid:15) ( (cid:15) ) d(cid:15) = m (cid:88) j =1 α j (cid:90) φ Σ j ( y − (cid:15) − µ j ) φ Σ (cid:15) ( (cid:15) ) d(cid:15) = m (cid:88) j =1 α j φ Σ j +Σ (cid:15) ( y − µ j ) , E [ (cid:101) f Y,H ( y )] = (cid:90) φ S +Σ (cid:15) ( y − x ) m (cid:88) j =1 α j φ Σ j ( x − µ j ) dx = m (cid:88) j =1 α j φ S +Σ j +Σ (cid:15) ( y − µ j ) . Using these identities and the deﬁnition of Ω a (Equation (S.17)), for the integrated squared bias we have (cid:90) (cid:16) E [ (cid:101) f Y,H ( y )] − f Y ( y ) (cid:17) dy = (cid:90)  m (cid:88) j =1 α j (cid:0) φ S +Σ j +Σ (cid:15) ( y − µ j ) − φ Σ j +Σ (cid:15) ( y − µ j ) (cid:1) dy = (cid:90) m (cid:88) j =1 m (cid:88) j (cid:48) =1 α j α j (cid:48) (cid:16) φ S +Σ j +Σ (cid:15) ( y − µ j ) φ S +Σ j (cid:48) +Σ (cid:15) ( y − µ j (cid:48) ) − φ S +Σ j +Σ (cid:15) ( y − µ j ) φ Σ j (cid:48) +Σ (cid:15) ( y − µ j (cid:48) ) − φ S +Σ j (cid:48) +Σ (cid:15) ( y − µ j (cid:48) ) φ Σ j +Σ (cid:15) ( y − µ j )+ φ Σ j +Σ (cid:15) ( y − µ j ) φ Σ j (cid:48) +Σ (cid:15) ( y − µ j (cid:48) ) (cid:17) dy = m (cid:88) j =1 m (cid:88) j (cid:48) =1 α j α j (cid:48) (cid:16) φ S +2Σ (cid:15) +Σ j +Σ j (cid:48) ( µ j − µ j (cid:48) ) − φ S +2Σ (cid:15) +Σ j +Σ j (cid:48) ( µ j − µ j (cid:48) ) + φ (cid:15) +Σ j +Σ j (cid:48) ( µ j − µ j (cid:48) ) (cid:17) = α T (Ω − + Ω ) α. S.1.4 Lemmas

Lemma 1.

Under Assumptions A, K is a symmetric density function in R p with a characteristic function (cid:98) K that is four times continuously diﬀerentiable. Let Σ K be the variance of K . We Taylor expand (cid:98) K around , obtaining (cid:98) K ( ω ) = 1 − ω T Σ K ω R ( ω ) . There exists C such that for any ω R ( ω ) ≤ C || ω || ∞ . Proof.

We bound the remainder term R ( ω ) by considering two cases.1. { ω : || ω || ∞ ≤ } : Since (cid:98) K is four times continuously diﬀerentiable, there exists D such that for any { j : (cid:80) pk =1 j k = 4 } , ∀ || ω || ∞ ≤ ∂ (cid:98) K∂ω j , . . . , ∂ω j p p ( ω ) < D. (S.23) UPPLEMENTARY MATERIAL S.7

Using the mean value form of the Taylor remainder we have (see e.g. Theorem 7.1 in Edwards Jr [1973]on page 131) R ( ω ) = (cid:88) { j : (cid:80) pk =1 j k =4 } ∂ (cid:98) K∂ω j , . . . , ∂ω j p p ( ξ ) p (cid:89) k =1 ω j k k j k ! . for some ξ = tω for t ∈ [0 , (cid:81) pk =1 ω j k k ≤ || ω || ∞ , for some C wehave | R ( ω ) | ≤ C || ω || ∞ . { ω : || ω || ∞ > } : Note that for some D , ω T Σ K ω ≤ D || ω || ∞ . Also note that on the set || ω || ∞ > || ω || ∞ ≤ || ω || ∞ . We have | R ( ω ) | = (cid:12)(cid:12)(cid:12)(cid:12) (cid:98) K ( ω ) − ω T Σ K ω (cid:12)(cid:12)(cid:12)(cid:12) ≤ | (cid:98) K ( ω ) | + | | + (cid:12)(cid:12)(cid:12)(cid:12) ω T Σ K ω (cid:12)(cid:12)(cid:12)(cid:12) ≤ | ω T Σ K ω |≤ D || ω || ∞ ≤ || ω || ∞ + D || ω || ∞ ≤ (2 + D ) || ω || ∞ S.2 Technical Notes

S.2.1 Full Bandwidth Matrix Optimization

In Theorem 2 on p.7, the MISE (using a full bandwidth matrix) is1 n (cid:90) dν ( ω ) + (cid:18) (cid:90) ( ω T Sω ) dµ ( ω ) − n (cid:90) ( ω T Sω ) dν ( ω ) (cid:19) (1 + O ( || H || ∞ ))where S = H T Σ K H . Using vec notation and the identity vec( EF G ) = ( G T ⊗ E )vec( F ) where ⊗ denotesKronecker product (see Equation 5 on page 67 in Henderson and Searle [1979]), we write the optimizationproblem for S as S ∗ Y = argmin S (cid:23) vec( S ) T B vec( S ) − n vec( S ) T V (S.24)where B = 14 (cid:90) ( ω ⊗ ω )( ω ⊗ ω ) T dµ ( ω ) ,V = (cid:90) ( ω ⊗ ω ) dν ( ω ) . UPPLEMENTARY MATERIAL S.8

It is important to note that B and V cannot be computed from the data because they depend on the unknownfunction (cid:98) f X ( ω ). In practice we could use plug–in estimators to approximate these integrals.The unconstrained solution to optimization problem (S.24) may not be positive semideﬁnite, so we cannotomit the S (cid:23) S ∗ Y is positive semideﬁnite. In other words, the following procedure is not valid: g (vec( S )) ≡ vec( S ) T B vec( S ) − n vec( S ) T V, = ⇒ ∇ g (vec( S )) = 2 B vec( S ) − n V. Setting the gradient equal to 0 and solving we havevec( S ∗ Y ) = 12 n B − V. One could then check whether S ∗ Y (cid:23)

0. This procedure is not valid because B is not invertible. To see that B is not invertible, note that the vector ( ω ⊗ ω ) has p elements, but not p unique elements. For examplewhen p = 2, ( ω ⊗ ω ) = ( ω , ω ω , ω ω , ω ) T . When the j th and k th elements of ( ω ⊗ ω ) are equal, the j thand k th rows of ( ω ⊗ ω )( ω ⊗ ω ) T are equal. Thus at least two rows of B ≡ (cid:82) ( ω ⊗ ω )( ω ⊗ ω ) T dµ ( ω ) are equal,implying that B cannot be inverted. S.2.2 Diagonal Bandwidth and Σ K = I In Equation (11) the full bandwidth matrix asymptotic expansion of the MISE was presented. By restrictingthe kernel to have Σ K = I and the bandwidth matrix to be diagonal we achieve considerable simpliﬁcationof the MISE. Let h i = H ii and s = ( h , . . . , h p ). The MISE (Equation (11)) becomes(2 π ) p MISE( s ) = 1 n (cid:90) dν ( ω ) + (cid:18) s T Bs − n s T V (cid:19) (1 + O ( || s || ∞ )) , where B i,j = 14 (cid:90) ω i ω j dµ ( ω ) ,V i = (cid:90) ω i dν ( ω ) . We seek the s which minimizes the larger order terms in the MISE expression. In other words we seek s ∗ Y = argmin s ≥ (cid:18) s T Bs − n s T V (cid:19) . (S.25) B is positive deﬁnite so the expression is strictly convex and there is a unique solution. Enforcing the domainrestriction s ≥ n B − V mayhave elements less than 0. In the following paragraphs we work through an example where f X and f (cid:15) arebivariate independent normals with (cid:15) having small variance along one direction. The kernel is normal withidentity covariance. The normality is not essential for this example, but makes the computations simpler. UPPLEMENTARY MATERIAL S.9

We begin by showing that the optimal bandwidth matrix is diagonal, implying that optimizing over thefull bandwidth matrix and the diagonal matrix are equivalent. We then show that when optimizing overthe unconstrained diagonal matrix, the direction in which (cid:15) has larger variance yields a “negative squaredbandwidth”. Consider: f X ∼ N (0 , I × ) ,f (cid:15) ∼ N (0 ,  σ σ  ) , Σ K ≡ (cid:90) xx T K ( x ) dx = I × . We parameterize the bandwidth matrix using H =  h h h h  . First consider optimizing over the entirebandwidth matrix, Equation (S.24). In our case S ≡ H T Σ K H = H T H,B = (cid:90) ( ω ⊗ ω )( ω ⊗ ω ) T dµ ( ω ) = (cid:90)  ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω  dµ ( ω ) ,V = (cid:90) ( ω ⊗ ω ) dν ( ω ) = (cid:90)  ω ω ω ω ω ω  dν ( ω ) . So Equation (S.24) becomesvec( H T H ) T (cid:90)  ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω ω  dµ ( ω )vec( H T H ) − n vec( H T H ) T (cid:90)  ω ω ω ω ω ω  dν ( ω )  . The integration causes those terms involving odd powers of ω i to be 0 by independence and symmetry of dν ( ω ) and dµ ( ω ). Additionally the center ω ω terms are moved outside the main expression and into the UPPLEMENTARY MATERIAL S.10 third term. We havevec( H T H ) T (cid:90)  ω ω ω ω ω ω  dµ ( ω )vec( H T H ) − n vec( H T H ) T (cid:90)  ω ω  dν ( ω )  + 4( h ( h + h )) (cid:90) ω ω dµ ( ω ) . Since H T H =  h + h h ( h + h ) h ( h + h ) h + h  , minimization of the ﬁrst two terms depends on ( h + h , h + h ). So by setting h = 0 we make thethird term in the expression 0, without restricting minimization of the ﬁrst two terms. Thus for the generalbandwidth matrix the minimum occurs when the oﬀ-diagonal elements are 0.Now let s = ( h , h ). We study the diagonal optimization problem (S.25) s ∗ Y = min s s T B (cid:48) s − n s T V (cid:48) , where B (cid:48) i,j = 14 (cid:90) ω i ω j dµ ( ω ) = 14 (cid:90) ω i ω j | (cid:98) f X ( ω ) | | (cid:98) f (cid:15) ( ω ) | dω,V (cid:48) i = (cid:90) ω i dν ( ω ) = (cid:90) ω i | (cid:98) f (cid:15) ( ω ) | dω − (cid:90) ω i | (cid:98) f X ( ω ) | | (cid:98) f (cid:15) ( ω ) | dω. With no restrictions on s the optimum is s ∗ Y = 12 n B (cid:48)− V (cid:48) . We now compute this quantity for the given densities. First compute B (cid:48) :4 B (cid:48) = (cid:90) ω | (cid:98) f X ( ω ) | | (cid:98) f (cid:15) ( ω ) | dω (cid:90) | (cid:98) f X ( ω ) | | (cid:98) f (cid:15) ( ω ) | dω = (cid:18) (cid:114) π (1 + σ ) (cid:19) (cid:18)(cid:114) π σ (cid:19) , B (cid:48) = (cid:90) ω | (cid:98) f X ( ω ) | | (cid:98) f (cid:15) ( ω ) | dω (cid:90) | (cid:98) f X ( ω ) | | (cid:98) f (cid:15) ( ω ) | dω = (cid:18) (cid:114) π (1 + σ ) (cid:19) (cid:18)(cid:114) π σ (cid:19) , B (cid:48) = (cid:90) ω | (cid:98) f X ( ω ) | | (cid:98) f (cid:15) ( ω ) | dω (cid:90) ω | (cid:98) f X ( ω ) | | (cid:98) f (cid:15) ( ω ) | dω = (cid:18) (cid:114) π (1 + σ ) (cid:19) (cid:18) (cid:114) π (1 + σ ) (cid:19) . UPPLEMENTARY MATERIAL S.11