[PDF] Density Deconvolution with Non-Standard Error Distributions: Rates of Convergence and Adaptive Estimation

Abstract

It is a typical standard assumption in the density deconvolution problem that the characteristic function of the measurement error distribution is non-zero on the real line. While this condition is assumed in the majority of existing works on the topic, there are many problem instances of interest where it is violated. In this paper we focus on non--standard settings where the characteristic function of the measurement errors has zeros, and study how zeros multiplicity affects the estimation accuracy. For a prototypical problem of this type we demonstrate that the best achievable estimation accuracy is determined by the multiplicity of zeros, the rate of decay of the error characteristic function, as well as by the smoothness and the tail behavior of the estimated density. We derive lower bounds on the minimax risk and develop optimal in the minimax sense estimators. In addition, we consider the problem of adaptive estimation and propose a data-driven estimator that automatically adapts to unknown smoothness and tail behavior of the density to be estimated.

Full PDF

aa r X i v : . [ m a t h . S T ] J a n Density Deconvolution with Non–Standard Error Distributions:Rates of Convergence and Adaptive Estimation ∗ Alexander Goldenshluger & Taeho Kim

Department of StatisticsUniversity of HaifaHaifa 3498838, Israel

Abstract.

It is a standard assumption in the density deconvolution problem that thecharacteristic function of the measurement error distribution is non-zero on the real line.While this condition is assumed in the majority of existing works on the topic, there are manyproblem instances of interest where it is violated. In this paper we focus on non–standardsettings where the characteristic function of the measurement errors has zeros, and study howzeros multiplicity aﬀects the estimation accuracy. For a prototypical problem of this type wedemonstrate that the best achievable estimation accuracy is determined by the multiplicityof zeros, the rate of decay of the error characteristic function, as well as by the smoothnessand the tail behavior of the estimated density. We derive lower bounds on the minimax riskand develop optimal in the minimax sense estimators. In addition, we consider the problemof adaptive estimation and propose a data–driven estimator that automatically adapts tounknown smoothness and tail behavior of the density to be estimated.

Keywords and phrases:

Density Deconvolution, Minimax Risk, Characteristic Function,Laplace Transform, Non-standard Measurement Error, Zero Multiplicity

E-mail address : [email protected]; [email protected] . ∗ The research was supported by the Israel Science Foundation (ISF) research grant. ontents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22 Estimator Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18A.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20A.3 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24A.4 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25A.5 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28A.6 Auxiliary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1. Introduction

Density deconvolution is a problem of estimating a probability density from observations withadditive measurement errors. Speciﬁcally, assume that we observe random sample Y , . . . , Y n generated by the model Y i = X i + ǫ i , i = 1 , , . . . , n, where X i ’s are i.i.d. random variables with unknown density f with respect to the Lebesguemeasure on R , ǫ i ’s are i.i.d. measurement errors with distribution function G , and X i ’s areindependent of ǫ i ’s. The objective is to estimate f on the basis of the sample Y n := { Y , . . . , Y n } .Since Y i is the sum of two independent random variables, X i and ǫ i , density f Y of Y i is givenby the convolution f Y ( y ) = ( f ⋆ d G )( y ) = Z ∞−∞ f ( y − x )d G ( x ) . (1.1) n estimator of the value of f ( x ) is a measurable function of Y n , ˆ f ( x ) = ˆ f ( x ; Y n ), andthe risk of ˆ f ( x ) is R n [ ˆ f , f ] := h E f | ˆ f ( x ) − f ( x ) | i / , where E f stands for the expectation with respect to the probability measure P f generated bythe observation Y n when the unknown density of X i ’s is f . For a particular functional class F ,accuracy of ˆ f ( x ) is measured by the maximal risk R n [ ˆ f ; F ] := sup f ∈ F R n [ ˆ f , f ] , and an estimator ˆ f ∗ ( x ) is called rate–optimal or optimal in order on F if R n [ ˆ f ∗ ; F ] ≍ R ∗ n [ F ] := inf ˆ f R n [ ˆ f ; F ] , n → ∞ . Here R ∗ n [ F ] is the minimax risk , and the inﬁmum in its deﬁnition is taken over all possibleestimators of f ( x ). The objective in the density deconvolution problem is to construct anoptimal in order estimator, and to study the rate at which the minimax risk R ∗ n [ F ] convergesto zero as n → ∞ . In what follows we refer to the latter as the minimax rate of convergence .The outlined problem is a subject of vast literature under various assumptions on the func-tional class F and distribution of measurement errors G ; see, e.g., Carroll and Hall [5], Stefanskiand Carroll [21], Zhang [22], Fan [8], Butucea and Tsybakov [3, 4], Meister [19], Lounici andNickl [17] for representative publications, where further references can be found. Typically F isa class of functions satisfying smoothness conditions (e.g., H¨older or Sobolev functional classes).As for assumptions on the measurement error distribution, they are usually put in terms of thecharacteristic function of G and read as follows. Assumption ( E0 ) . Let φ g ( iω ) := F [d G, ω ] := R ∞−∞ e − iωx d G ( x ) be the characteristic function(the Fourier transform) of the measurement error distribution G . Then,I. | φ g ( iω ) | 6 = 0 for all ω ∈ R .II. | φ g ( iω ) | decreases at polynomial or exponential rate as | ω | → ∞ : ordinary smooth errors: | φ g ( iω ) | ≍ | ω | − γ , | ω | → ∞ for some γ >

0, or super-smooth errors: | φ g ( iω ) | ≍ exp {− c | ω | γ } , | ω | → ∞ for some c > γ > f is deteminedby the rate at which φ g tends to zero and by smoothness of f as characterized in terms offunctional class F . Condition (E0-I) ensures that the statistical model is identiﬁable (it is wellknown that if φ g vanishes on a set of non–zero Lebesgue measure then f is not identiﬁable).It underlies applicability of the standard Fourier–transform–based techniques for constructingestimators of f . Note however that (E0-I) does not hold if φ g has isolated zeros which is thecase in many interesting situations, e.g., for continuous distributions with compactly supported ensities or for general discrete distributions. For example, if G is a uniform distribution on[ − ,

1] then φ g ( iω ) = sin ω/ω has zeros at ω = ± πk , k ∈ N , and (E0-I) is not fulﬁlled.The settings in which the error characteristic function φ g may have isolated zeros have beenstudied to a considerably lesser extent; the available results in this area are fragmentary anddisparate. Devroye [7] pointed out that density f can be estimated consistently in the L –normwhen the characteristic function φ g of the error distribution is non-zero almost everywhere. Al-though it is a quite general result, the convergence is not uniform, and the evaluation procedureis not based on the minimax criterion. Several previous studies investigated the problem withthe uniform error distribution. In particular, Groeneboom and Jongbloed [13] and Feuervergeret al. [9] demonstrate that zeros of the characteristic function φ g do not have inﬂuence on theminimax rate of convergence: it remains the same as under condition (E0-I) when the esti-mated density f is supported on the positive real line [13], or has bounded second moment [9].Considering a more general class of so-called Fourier–oscillating error distributions, Delaigleand Meister [6] derive a similar result for densities f having ﬁnite left endpoint. In contrastto the aforementioned results, Hall and Meister [14] demonstrate that for the class of Fourier–oscillating error distributions zeros of the error characteristic function lead to a slower minimaxconvergence rate than the one under condition (E0-I). Hall and Meister [14] suggest a “ridge”modiﬁcation of the kernel density deconvolution estimator in which characteristic function ofthe error distribution is regularized to avoid singularities due to the zeros. For another closelyrelated work we also refer to Meister [18].Recently a principled method for solving density deconvolution problems under general as-sumptions on the error characteristic function has been proposed in Belomestny and Goldensh-luger [2]. This method uses the Laplace transform (the Fourier transform in complex domain)in conjunction with the linear functional strategy for constructing rate–optimal kernel decon-volution estimators. The results show that zeros of the error characteristic function have noinﬂuence on the achievable estimation accuracy when, in addition to usual smoothness con-ditions, the estimated density f has suﬃciently light tails. On the other hand, if f is heavytailed then zeros of the error characteristic function aﬀect the minimax rates of convergencethat become slower. Belomestny and Goldenshluger [2] provide an explicit condition on the tailbehavior of f and zeros geometry of φ g under which the minimax rates of convergence are notinﬂuenced by the zeros of φ g .In this paper we focus on the setting when φ g has zeros, and f is heavy tailed relative tothe multiplicity m of zeros of φ g on the imaginary axis. The prototypical settings of this typearise when mesurement error distribution is the binomial distribution Bin( m, /

2) or the m –fold convolution of uniform distributions on [ − θ, θ ]. Utilizing the methodology proposed in [2]we develop rate–optimal estimators of f and investigate their properties. It is shown that, incontrast to the well known results under Assumption (E0), in the considered regime the minimax ate of convergence is determined not only by the smoothness of f and the rate at which φ g tends to zero, but also by the tail behavior of f and the zero multiplicity of φ g . The derivedlower bounds on the minimax risk demonstrate that dependence of the estimation accuracy onthese factors is essential.The construction of the proposed rate–optimal estimator of f depends on tuning parameters,and their speciﬁcation requires prior information on smoothness and tail behavior of f . Inpractice such information is rarely available. To overcome this diﬃculty we propose and studyan adaptive estimator of f that is based on the methodology developed in Goldenshluger andLepski [11, 12]. An interesting feature of the proposed estimator is that it involves two tuningparameters, and the adaptation here is not only with respect to the unknown smoothness, butalso with respect to the unknown tail behavior of f . We derive an oracle inequality for thedeveloped adaptive estimator and show that it achieves the minimax rate of convergence up toa logarithmic factor which is unavoidable payment for adaptation in point-wise estimation.The rest of the paper is organized as follows. In Section 2 we present the general idea forestimator construction and introduce our estimator. Section 3 deals with minimax estimationof f ( x ) with respect to proper functional classes. In Section 4 we introduce the correspondingadaptive procedure and investigate its properties. Lastly, Section 5 is reserved for discussionand concluding remarks. All the proofs are deferred to Appendix.

2. Estimator Construction

We start with presenting the key idea for estimator construction in our density deconvolutionproblem. The construction uses Laplace transform (Fourier transform in the complex domain)which allows us to handle the situation where the ﬁrst condition of Assumption (E0) is notsatisﬁed. Our goal is to deliver the main idea of construction; for further details we referto Belomestny and Goldenshluger [2].The following deﬁnitions will be utilized throughout the study. For a generic function w thebilateral Laplace transform of w is deﬁned to be L [ w ; z ] := φ w ( z ) = Z ∞−∞ w ( x ) e − zx d x. (2.1)The integral convergence region Σ w (if exists) is a vertical strip in the complex plane, Σ w = { z ∈ C : Re ( z ) ∈ ( σ − w , σ + w ) } for some σ − w , σ + w ∈ R , and φ w ( z ) is analytic in Σ w . The inverse aplace transform is w ( x ) = 12 πi Z s + i ∞ s − i ∞ φ w ( z ) e zx d z = 12 π Z ∞−∞ φ w ( s + iω ) e ( s + iω ) x d ω, s ∈ ( σ − w , σ + w ) . For the error distribution function G we write φ g ( z ) := R ∞−∞ e − zx d G ( x ), and note that theintegral convergence region necessarily includes the imaginary axis { z ∈ C : Re ( z ) = 0 } with φ g ( iω ) being the characteristic function of G . In what follows we assume that Σ g is a verticalstrip in the complex plane, Σ g := { z ∈ C : Re ( z ) ∈ ( σ − g , σ + g ) } for some σ − g < < σ + g .Our estimator uses a kernel whose construction relies upon the linear functional strategy forsolution of ill-posed problems (see, e.g., [10]). Let K ∈ C ∞ ( R ) be a kernel on [ − ,

1] satisfyingstandard conditions: for ﬁxed k ∈ Z + Z − K ( t ) dt = 1 , Z − t j K ( t ) dt = 0 , ∀ j = 1 , . . . , k. (2.2)Note that φ K ( z ) is an entire function, i.e. Σ K = C . We would like to ﬁnd a function L : R → R such that for any given x ∈ R Z ∞−∞ L ( y − x ) f Y ( y )d y = 1 h Z ∞−∞ K (cid:16) x − x h (cid:17) f ( x )d x, (2.3)where we recall that f Y and f are related to each other by the convolution integral (1.1).If function L satisfying (2.3) is found then a reasonable estimator of f ( x ) is given by theempirical estimator of the integral on the left hand side of (2.3) based on the sample Y n . In ourdeconvolution problem this strategy is realized as follows.In addition to the analyticity of φ g in Σ g we suppose that φ g ( z ) does not vanish on the set { z : Re ( z ) ∈ (cid:0) κ − g , κ + g (cid:1) \{ }} for some κ − g , κ + g such that σ − g ≤ κ − g < < κ + g ≤ σ + g . Note that φ g may have zeros on the imaginary axis { z : Re ( z ) = 0 } , so that the conventional Fourier transformtechnique would not work in this situation. Let S g := (cid:8) z : Re ( z ) ∈ (cid:0) − κ + g , − κ − g (cid:1) \ { } (cid:9) ; in fact, S g is the union of two open vertical strips in the complex plane having the imaginary axis asthe boundary. Note that φ g ( − z ) = 0 on S g , and for h > φ L ( z ) := φ K ( zh ) φ g ( − z ) , z ∈ S g . Obviously, φ L is analytic on S g , and we deﬁne kernel L sh as the inverse Laplace transform of φ L : L sh ( x ) := 12 π Z ∞−∞ φ K (( s + iω ) h ) φ g ( − s − iω ) e ( s + iω ) x dω, s ∈ ( − κ + g , − κ − g ) \ { } . (2.4)Depending on the sign of parameter s formula (2.4) deﬁnes two diﬀerent kernels which in thesequel are denoted L + h ( · ) for s > L − h ( · ) for s <

0. If the integral on the right hand side of(2.4) is absolutely convergent and Z ∞−∞ | L sh ( y − x ) | f Y ( y ) dy < ∞ , hen by Lemma 1 in [2] kernels L sh and K are related to each other via (2.3). Then we deﬁnethe resulting density deconvolution estimator byˆ f sh ( x ) = 1 n n X i =1 L sh ( Y i − x ) , s ∈ ( − κ + g , − κ − g ) \ { } . While a general form of the kernel L sh is given in (2.4), it would be beneﬁcial to specializeit for particular error distributions. We handle this in the next subsection in relation to errorcharacteristic functions φ g having zeros on the imaginary axis. The following assumption on characteristic function of measurement errors has been intro-duced in [2].

Assumption ( E1 ) . φ g is analytic in Σ g := { z : Re ( z ) ∈ ( σ − g , σ + g ) } with σ − g < < σ + g andadmits the following representation φ g ( z ) = 1 ψ ( z ) q Y k =1 (cid:16) − e a k z − ib k (cid:17) m k , (2.5)where { a k } qk =1 and { b k } qk =1 are real numbers, a k > b k ∈ [0 , π ) for all k , { m k } qk =1 are non-negative integer numbers, and pairs { ( a k , b k ) } qk =1 are distinct for all k . The function ψ ( z ) hasthe following representation: ψ ( z ) = ψ ( z ) Y k : b k =0 ( − a k z ) m k Y k : b k =0 (1 − e − ib k ) m k , where ψ ( z ) is analytic and has no zeros in a vertical strip Σ ψ , { z : Re ( z ) = 0 } ⊂ Σ ψ ⊆ Σ g .Assumption (E1) postulates that characteristic function φ g ( z ) is analytic in a vertical stripand can be factorized in a product of two functions: the ﬁrst function has zeros on the imaginaryaxis while the second one does not vanish is the strip. Under (2.5), the zeros of φ g ( z ) are z k,j = i ( b k + 2 πj ) /a k , j = 0 , ± , ± , . . . , z k,j = 0, and the multiplicity of z k,j is equal to m k forany j .Assumption (E1) is rather general. It holds for a wide class of discrete and continuousdistributions for speciﬁc examples we refer to [2, Section 3.2]. Since the main focus of thisstudy is to investigate the eﬀect of zeros multiplicity of φ g ( z ) on the estimation accuracy, wewill concentrate on the following prototypical examples: a) [ m –convolution of U ( − θ, θ ) distribution]. Let G be the distribution function of m –foldconvolution of the uniform distribution on [ − θ, θ ], θ >

0. In this case φ g ( z ) = (cid:20) sinh( θz ) θz (cid:21) m = e − mθz ( − θz ) − m (1 − e θz ) m , (2.6)so that Assumption (E1) holds with q = 1, a = 2 θ , b = 0, m = m and ψ ( z ) =( − θz ) m e mθz .(b) [Binomial distribution]. Let G be the distribution function of the binomial randomvariable with parameters m and p = 1 /

2; then φ g ( z ) = 2 − m (1 + e z ) m , (2.7)so that Assumption (E1) holds with q = 1, a = 1, b = π and ψ ( z ) = 2 m . Under Assumption (E1) the kernel in (2.4) takes the following particular form: L sh ( t ) = 12 π Z ∞−∞ φ K (( s + iω ) h ) ψ ( − s − iω ) Q qk =1 (1 − e a k ( s + iω ) − ib k ) m k e ( s + iω ) t d ω, s + iω ∈ S g . (2.8)While the denominator does not vanish for s ∈ ( − κ + g , κ − g ) \ { } , the kernel representation iseither L + h or L − h , depending on the sign of s . For examples (a) and (b) discussed above wecan substitute expressions for φ g ( z ) given by (2.6) and (2.7) in (2.4). Then expanding formallythe integrand in series (for details see [2, Section 4.1]) we come to the following inﬁnite seriesrepresentation for the kernels:(a) m –convolution of U ( − θ, θ ) distribution: L ± h ( t ) = ( ± θ ) m h m +1 ∞ X j =0 C j,m K ( m ) (cid:18) t ∓ θ (2 j + m ) h (cid:19) ;(b) binomial distribution: L ± h ( t ) = ( ± m h ∞ X j =0 C j,m K (cid:18) t ∓ jh (cid:19) , where C j,m := (cid:18) j + m − m − (cid:19) is the number of weak compositions of j into m parts (see, e.g., [20]). Note that the derivedkernels L ± h are not integrable, and, in general, condition (2.3) is fulﬁlled only if f has suﬃcientlylight tails. That is why in the estimator construction we truncate the inﬁnite series by a cut–oﬀ arameter N coming to the kernels L ± h,N ( t ) := ( ± θ ) m h m +1 N X j =0 C j,m K ( m ) (cid:18) t ∓ θ (2 j + m ) h (cid:19) , (2.9) L ± h,N ( t ) := ( ± m h N X j =0 C j,m K (cid:18) t ∓ jh (cid:19) (2.10)for examples (a) and (b) respectively.The multiplicity of zeros clearly manifests itself in construction of kernel L ± h,N : in setting (a)multiplicity m determines ill–posedness of the deconvolution problem, and in the both settingscoeﬃcients C j,m in (2.9) and (2.10) grow with m aﬀecting the variance of the correspondingestimators in the case of heavy tailed densities f . Intuitively, the larger multiplicity m , theﬂatter the characteristic function φ g ( z ) in the vicinity of zeros, and the harder the deconvolutionproblem.Based on the derived kernels we deﬁne the estimators of f ( x ) in examples (a) and (b) by(a) ˆ f ± h,N ( x ) = 1 n n X i =1 ( ± θ ) m h m +1 N X j =0 C j,m K ( m ) (cid:18) Y i − x ∓ θ (2 j + m ) h (cid:19) , (2.11)(b) ˆ f ± h,N ( x ) = 1 n n X i =1 ( ± m h N X j =0 C j,m K (cid:18) Y i − x ∓ jh (cid:19) , (2.12)where h and N are two tuning parameters that should be speciﬁed.

3. Minimax Results

In this section we derive upper bounds on the risk of the estimators constructed in the previoussection, and show that they are rate optimal over functional classes characterized by the smooth-ness and tail conditions. The analysis of the risk for the both estimators in cases (a) and (b)coincides in almost every detail. Therefore in the sequel we concentrate on the example (a); thecorresponding results for binomial error distribution are discussed in Section 5.

The following assumption introduces the functional class over which accuracy of ˆ f ± h,N ( x ) willbe assessed. Assumption ( F ) . Let A and B be a positive real numbers. I) For α >

0, a probability density f belongs to the functional class H α ( A ) if f is ⌊ α ⌋ :=max { n ∈ N ∪ { } : n < α } times continuously diﬀerentiable, and (cid:12)(cid:12)(cid:12) f ( ⌊ α ⌋ ) ( t ) − f ( ⌊ α ⌋ ) ( t ′ ) (cid:12)(cid:12)(cid:12) ≤ A | t − t ′ | α −⌊ α ⌋ , ∀ t, t ′ ∈ R (3.1)(II) Let q be a positive real number. We say that a probability density f belongs to thefunctional class N q ( B ) if f ( t ) ≤ B | t | − q , ∀ t ∈ R . (3.2)Combining the two conditions in Assumption (F), we deﬁne the following functional class: W α,q ( A, B ) := H α ( A ) ∩ N q ( B ) . Remark.

While ﬁrst assumption deﬁnes the usual H¨older class H α ( A ), the second conditionimposes a uniform upper bound on the decay of the tails of the measurement error density.Note that this tail condition is comparable to the moment condition in [2, Deﬁnition 3]. Now we are in a position to establish upper bounds on the maximal risk of the estimator ˆ f ± h,N ( x )deﬁned in (2.11). Let ˆ f h,N ( x ) := ( ˆ f + h,N ( x ) , x ≥ , ˆ f − h,N ( x ) , x < , (3.3) r := ( ( α/q )(2 m − − q ) , q < m − , , q ≥ m − , , ν := α α + 2 m + 1 + r , (3.4)and deﬁne ϕ ( n ) :=  (cid:0) B /α A m +1 α (cid:1) ν n − ν , q > m − , (cid:0) B /α A m +1 α (cid:1) ν (cid:0) log nn (cid:1) ν , q = 2 m − , (cid:0) B m − αq A m +1 α (cid:1) ν n − ν , q < m − . (3.5) Theorem 1.

Let f ∈ W α,q ( A, B ) with q > , and let φ g ( z ) = [sin( θz ) / ( θz )] m , m ∈ N . Let ˆ f h,N ( x ) be the estimator deﬁned in (3.3) and (2.11) and associated with kernel K satisfyingcondition (2.2) with parameter k ≥ α +1 . Then with h = h ∗ and N = N ∗ deﬁned in (A.6)–(A.8)in the proof of the theorem one has lim sup n →∞ n [ ϕ ( n )] − R n [ ˆ f h ∗ ,N ∗ ; W α,q ( A, B )] o ≤ C , where C is a constant independent of A and B .Remark. a) The result of Theorem 1 shows how the tail behavior of f and zeros multiplicity m aﬀect the estimation accuracy. If the tail of f is suﬃciently light, i.e., q > m −

1, thenthe risk of ˆ f h ∗ ,N ∗ ( x ) converges to zero at the rate n − α/ (2 α +2 m +1) which is obtained inthe ordinary smooth case with γ = m and non–vanishing characteristic function φ g [seeAssumption (E0)]. On the other hand, for heavy tailed densities f with q < m − f h ∗ ,N ∗ ( x ) converges at a slower rate, and parameter r in (3.4)characterizes deterioration in the convergence rate.(b) The existence of diﬀerent regimes depending on the tail behavior of f and zeros mul-tiplicity m has been noticed in [2]; however, the case of heavy tailed densities has notbeen studied there.Next theorem provides a lower bound on the minimax risk of estimation over functionalclass W α,q ( A, B ). Theorem 2.

Let f ∈ W α,q ( A, B ) for q > and φ g ( z ) = [sin( θz ) / ( θz )] m , m ∈ N . Then lim inf n →∞ n(cid:0) A − (2 m +1) /α n (cid:1) ν R ∗ n [ W α,q ( A, B )] o ≥ C , where ν is deﬁned in (3.4), and C is a positive constant independent of A .Remark. (a) Theorems 1 and 2 show that there are two regimes in behavior of the minimax risk.These regimes are characterized by the tail behavior of the estimated density f andthe multiplicity of zeros of the error characteristic function φ g . In the light tail regime , q > m −

1, the zeros of φ g have no inﬂuence on the minimax rate of convergence: it isfully determined by the tail behavior of φ g . On the other hand, if q < m − heavytail regime ) then zeros of φ g have signiﬁcant inﬂuence on the minimax rate, it becomesmuch slower than in the case of non–vanishing φ g .(b) Theorems 1 and 2 demonstrate that the proposed estimator ˆ f h ∗ ,N ∗ ( x ) is rate optimalin both light tail and heavy tail regimes . We note that on the boundary q = 2 m − W α,q ( A, B ) deﬁned in Assump-tion (F). Although these conditions are pretty reasonable in the context of the density decon-volution, they involve an extra assumption on the tail behavior of f , and it is natural to askwhat happens when the tail condition does not hold. The next result provides an answer tothis question. Corollary 1.

Let φ g ( z ) = [sin( θz ) / ( θz )] m , m ∈ N ; then the following results hold lim inf n →∞ n ψ − n R ∗ n [ H α ( A )] o ≥ C , (3.6) im sup n →∞ n ψ − n R ∗ n [ H α ( A ) ∩ N ( B )] o ≤ C , (3.7) where ψ n := ( A (2 m +1) /α /n ) α mα +2 m +1 , and C and C do not depend on A .Remark. In view of (3.6), the rate of convergence ψ n on the functional class H α ( A ) is signiﬁ-cantly slower than the one achieved on H α ( A ) in the setting with non–vanishing characteristicfunction φ g . Note that the upper bound in (3.7) is achieved on a slightly smaller functionalclass. The assumption f ∈ N ( B ) is very mild and is fulﬁlled for virtually any probabilitydensity. However it does not hold uniformly for all densities. We were not able to derive theupper bound (3.7) without this additional condition.

4. Adaptive Procedure

The minimax results in the previous section can only be achieved when the information onthe functional class is known to us in advance. This is evident by observing that the optimalchoice of tuning parameters h ∗ and N ∗ requires knowledge of the functional class. However, inmost of applications, it is extremely rare to have the advance information about the functionalclass where the target function f resides in. Therefore, it is natural to ask whether one canconstruct an estimator with the equivalent or comparable accuracy guarantees without knowingthe functional class parameters.In this section we develop an adaptive estimator of f ( x ) whose construction is based on theidea of data–driven selection from a family of estimators { ˆ f h,N ( x ) : ( h, N ) ∈ H × N } , whereˆ f h,N ( x ) is deﬁned in the previous section, and H and N are some ﬁxed sets of bandwidths andcut–oﬀ parameters. Since the estimators ˆ f h,N ( x ) depend on two tuning parameters, we adoptthe general method of adaptive estimation proposed in [11]. Let H and N be the discrete sets deﬁned as follows: for real numbers 0 < h min < h max = θ andinteger number N max to be speciﬁed later H := (cid:8) h ∈ [ h min , h max ] : h = 2 − j h max , j = 0 , . . . , M h (cid:9) , N := (cid:8) j : j = 1 , . . . , N max =: M N (cid:9) , where M h := ⌊ log ( h max /h min ) ⌋ and M N := N max denote the cardinality of H and N respec-tively.Let T := H × N , deﬁne τ := ( h, N ), and consider the family of estimators F ( T ) = { ˆ f ± τ ( x ) , τ ∈ T } , where ˆ f ± τ ( x ) = ˆ f ± h,N ( x ) is deﬁned in (2.11) and (3.3). The adaptiveestimator is based on data–driven selection from the family F ( T ). For the sake of deﬁniteness n the sequel we assume that x ≥ f + τ ( x ) only; the case x < f − τ ( x ) is handled in exactly the same way.The selection rule uses auxiliary estimators that are constructed as follows. For τ, τ ′ ∈ T let τ ∨∧ τ ′ := ( h ∨ h ′ , N ∧ N ′ ) denote the operation of coordinate-wise maximum and minimum.With any pair τ, τ ′ ∈ T we associate the estimator [cf. (2.11)]ˆ f + τ ∨∧ τ ′ ( x ) := 1 n n X i =1 (2 θ ) m ( h ∨ h ′ ) m +1 N ∧ N ′ X j =0 C j,m K ( m ) (cid:18) Y i − x − θ (2 j + m ) h ∨ h ′ (cid:19) . Observe that ˆ f + τ ∨∧ τ ′ ( x ) = ˆ f + τ ′ ∨∧ τ ( x ) for all τ, τ ′ ∈ T .Selection rules based on convolution–type auxiliary kernel estimators are developed in [11,12], while Lepski [16] uses auxiliary estimators that are based on the operation of point–wisemaximum of multi–bandwidths. Our construction is close in spirit to the latter one; it is dictatedby the structure of estimators ˆ f ± h,N ( x ) in the deconvolution problem.An important ingredient in the construction of the proposed selection rule is a uniform upperbound on the stochastic error of estimator ˆ f + τ ( x ), τ ∈ T . For τ ∈ T the stochastic error ofˆ f + τ ( x ) is ξ τ ( x ) := 1 n n X i =1 L + τ ( Y i − x ) − E f (cid:2) L + τ ( Y − x ) (cid:3) , (4.1)where L + τ ( y ) := (2 θ ) m h m +1 N X j =0 C j,m K ( m ) (cid:18) y − θ (2 j + m ) h (cid:19) ;see (2.9). Deﬁne σ τ := (2 θ ) m h m +2 N X j =0 C j,m Z ∞−∞ (cid:12)(cid:12)(cid:12)(cid:12) K ( m ) (cid:18) y − x − θ (2 j + m ) h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) f Y ( y )d y. (4.2)The proof of Theorem 1 shows that var f { ξ τ ( x ) } ≤ σ τ /n . Let u τ := 2 m +1 θ m C N,m k K ( m ) k ∞ h − m − , (4.3)and for real number κ > τ ( κ ) := σ τ r κ n + 2 u τ κ n . (4.4)In Lemma 1 in Appendix we demonstrate that Λ τ ( κ ) is a uniform upper bound on | ξ τ ( x ) | inthe sense that all moments of the random variable sup τ ∈T [ | ξ τ ( x ) | − Λ τ ( κ )] + are suitably smallas κ increases. Note however that Λ τ ( κ ) cannot be used in the selection rule because it dependson the unknown density. In order to overcome this problem we consider a data–driven uniformupper bound on ξ τ ( x ) that is constructed as follows. or τ ∈ T let ˆ σ τ := 1 n n X i =1 (2 θ ) m h m +2 N X j =0 C j,m (cid:12)(cid:12)(cid:12)(cid:12) K ( m ) (cid:18) Y i − x − θ (2 j + m ) h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) . Note that ˆ σ τ is the empirical estimator of σ τ . LetˆΛ τ ( κ ) := 7 (cid:18) ˆ σ τ r κ n + 2 u τ κ n (cid:19) . (4.5)With the introduced notation the selection rule is the following. For any τ ∈ T deﬁneˆ R τ ( x ) := sup τ ′ ∈T h(cid:12)(cid:12) ˆ f + τ ∨∧ τ ′ ( x ) − ˆ f + τ ′ ( x ) (cid:12)(cid:12) − ˆΛ τ ∨∧ τ ′ ( κ ) − ˆΛ τ ′ ( κ ) i + + ˆΛ τ ( κ ) + sup τ ′ ∈T ˆΛ τ ∨∧ τ ′ ( κ ) . (4.6)Then, the adaptive estimator ˆ f ∗ ( x ) is deﬁned byˆ f ∗ ( x ) := ˆ f +ˆ τ ( x ) , ˆ τ = (cid:0) ˆ h, ˆ N (cid:1) := arg min τ ∈T ˆ R τ ( x ) . (4.7) Remark.

The deﬁned selection rule is fully data–driven; it only requires speciﬁcation of param-eter κ in (4.5). This parameter provides a uniform control of the stochastic errors for the familyof estimators F ( T ), and has no relation to the properties of the density to be estimated. Inaddition, the parameters h min and N max should be chosen; they determine the sets of admissiblebandwidths H and cut–oﬀ parameters N . For h, h ′ ∈ H and N, N ′ ∈ N deﬁne¯ B h ( f ) := sup h ′ ≤ h sup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12) h Z ∞−∞ K (cid:16) t − xh (cid:17) [ f ( t ) − f ( x )]d t (cid:12)(cid:12)(cid:12)(cid:12) , (4.8)¯ B N ( x ; f ) := max ≤ j ≤ m sup | t |≤ θ sup N ′ ≥ N (cid:2) f (cid:0) t + x + 2 θ ( N ′ + 1) j (cid:1)(cid:3) , (4.9)and let ¯ B τ ( x ; f ) := 2 m +1 h ¯ B h ( f ) + (1 + k K k ) ¯ B N ( x ; f ) i . (4.10) Theorem 3.

Let ˆ f ∗ ( x ) be the estimator deﬁned in (4.6)-(4.7) and associated with parameter κ > ; then | ˆ f ∗ ( x ) − f ( x ) | ≤ C inf τ ∈T n ¯ B τ ( x ; f ) + Λ τ ( κ ) o + C (cid:16) δ ( x ) + κ n (cid:17) , (4.11) here C is an absolute constant, C depends only on m and θ , and δ ( x ) is a non–negativerandom variable that admits the following bound: for any p ≥ f (cid:2) δ ( x ) (cid:3) p ≤ C M h M N [ ¯Λ( κ )] p κ − p e − κ , (4.12) where ¯Λ( κ ) := sup τ ∈T { (1 + u τ )Λ τ ( κ ) } , and constant C depends on p only.Remark. Explicit expressions for constants C , C and C appear in the proofs of Theorem 3and Lemma 2. Note that the oracle inequality holds for any probability density f , without anyfunctional class assumptions.The oracle inequality (4.11) allows us to derive the following result on the accuracy of theadaptive estimator ˆ f ∗ ( x ) on the class W α,q ( A, B ). Corollary 2.

Suppose that f ∈ W α,q ( A, B ) with q ≥ . Let F ( T ) be the family of estimators (cid:8) ˆ f + h,N ( x ) , ( h, N ) ∈ H × N (cid:9) with h min := (cid:16) log nn (cid:17) / (2 m +1) , h max = θ, N max := (cid:16) n log n (cid:17) / (2 m ) . (4.13) Let ˆ f ∗ ( x ) be the estimator deﬁned by selection rule (4.6)-(4.7) and associated with parame-ter κ = κ ∗ := 5 log n ; then lim sup n →∞ nh ϕ (cid:16) n log n (cid:17)i − R n [ ˆ f h ∗ ,N ∗ ; W α,q ( A, B )] o ≤ C, where ϕ ( · ) is deﬁned in (3.5), and C does not depend on A and B .Remark. Note that the resulting rate is the same as the rate of convergence in Theorem 1except for the extra log n factor. It is a well-known fact by Lepski [15] that this factor cannotbe avoided in the adaptive nonparametric estimation of a function at a single point.

5. Concluding Remarks

We close this paper with a few concluding remarks.In this paper we concentrated on the setting when the error distribution is the m –fold convo-lution of the uniform distribution on [ − θ, θ ]. Here the error characteristic function has inﬁnitenumber of isolated zeros on the imaginary axis, each of them has the same multiplicity m . Notethat the results of Theorems 1, 2, and Corollary 2 also hold for the binomial error distribu-tion Bin( m, /

2) with the following minor changes in notation: in (3.4) parameter ν should beredeﬁned as ν = 1 / (2 α + 1 + r ), and in (3.5) and in the statement of Theorem 2 expression A (2 m +1) /α should be replaced by A /α . The speciﬁc form of the error characteristic functionsused in this paper facilitates derivation of lower bounds on the minimax risk. However, in eneral, the proposed technique is applicable to other error distributions whose charatceristicfunction has zeros on the imaginary axis.We developed rate optimal estimators with respect to the point–wise risk. It is worth notingthere is a signiﬁcant diﬀerence between settings with point–wise and L –risks when the errorcharacteristic function has zeros on the imaginary axis. This fact has been already noticed in[2]. Some results for density deconvolution with L –risk for non–standard error distributionsappeared in [18] and [14]. In general, deconvolution problems under global losses with non–standard error distributions deserve a thorough study. References [1] Jean-Pierre Aubin.

Applied functional analysis . John Wiley & Sons, 2 edition, 2000.[2] Denis Belomestny and Alexander Goldenshluger. Density deconvolution under generalassumptions on the distribution of measurement errors. arXiv preprint arXiv:1907.11024 ,2019.[3] Cristina Butucea and Alexandre B Tsybakov. Sharp optimality in density deconvolutionwith dominating bias. I.

Teoriya Veroyatnoste˘ı i ee Primeneniya , 52(1):111–128, 2007.[4] Cristina Butucea and Alexandre B Tsybakov. Sharp optimality in density deconvolutionwith dominating bias. II.

Teoriya Veroyatnoste˘ı i ee Primeneniya , 52(2):336–349, 2007.[5] Raymond J Carroll and Peter Hall. Optimal rates of convergence for deconvolving a density.

Journal of the American Statistical Association , 83(404):1184–1186, 1988.[6] Aurore Delaigle and Alexander Meister. Nonparametric function estimation under Fourier-oscillating noise.

Statistica Sinica , 21(3):1065–1092, 2011.[7] Luc Devroye. Consistent deconvolution in density estimation.

The Canadian Journal ofStatistics/La Revue Canadienne de Statistique , 17(2):235–239, 1989.[8] Jianqing Fan. On the optimal rates of convergence for nonparametric deconvolution prob-lems.

The Annals of Statistics , 19(3):1257–1272, 1991.[9] Andrey Feuerverger, Peter T Kim, and Jiayang Sun. On optimal uniform deconvolution.

Journal of Statistical Theory and Practice , 2(3):433–451, 2008.[10] Michael A Golberg. A method of adjoints for solving some ill-posed equations of the ﬁrstkind.

Applied Mathematics and Computation , 5(2):123–129, 1979.[11] Alexander Goldenshluger and Oleg Lepski. Bandwidth selection in kernel density estima-tion: oracle inequalities and adaptive minimax optimality.

The Annals of Statistics , 39(3):1608–1632, 2011.[12] Alexander Goldenshluger and Oleg Lepski. On adaptive minimax density estimation on R d . Probability Theory and Related Fields , 159(3-4):479–543, 2014.[13] Piet Groeneboom and Geurt Jongbloed. Density estimation in the uniform deconvolutionmodel.

Statistica Neerlandica , 57(1):136–157, 2003.

14] Peter Hall and Alexander Meister. A ridge-parameter approach to deconvolution.

TheAnnals of Statistics , 35(4):1535–1558, 2007.[15] Oleg Lepski. On a problem of adaptive estimation in Gaussian white noise.

Theory ofProbability & Its Applications , 35(3):454–466, 1991.[16] Oleg Lepski. Adaptive estimation over anisotropic functional classes via oracle approach.

The Annals of Statistics , 43(3):1178–1242, 2015.[17] Karim Lounici and Richard Nickl. Global uniform risk bounds for wavelet deconvolutionestimators.

The Annals of Statistics , 39(1):201–231, 2011.[18] Alexander Meister. Deconvolution from Fourier-oscillating error densities under decay andsmoothness restrictions.

Inverse Problems , 24(1):015003, 2007.[19] Alexander Meister.

Deconvolution problems in nonparametric statistics , volume 193 of

Lecture Notes in Statistics . Springer-Verlag, Berlin, 2009.[20] Richard P Stanley.

Enumerative combinatorics , volume 1 of

Cambridge studies in advancedmathematics . Cambridge University Press, second edition, 2011.[21] Leonard A Stefanski and Raymond J Carroll. Deconvolving kernel density estimators.

Statistics , 21(2):169–184, 1990.[22] Cun-Hui Zhang. Fourier methods for estimating mixing densities and distributions.

TheAnnals of Statistics , 18(2):806–831, 1990. ppendix A. Proofs A.1. Proof of Theorem 1

Proof.

In the subsequent proof c , c , . . . , stand for positive constants independent of A and B .Without loss of generality we assume that x ≥

0; the proof for the case x < f + h,N ( x ). It is shown in [2] there that thevariance of ˆ f + h,N ( x ) is bounded from above as followsvar f (cid:2) ˆ f + h,N ( x ) (cid:3) ≤ (2 θ ) m nh m +2 N X j =0 C j,m Z ∞−∞ (cid:12)(cid:12)(cid:12) K ( m ) (cid:16) y − x − θ (2 j + m ) h (cid:17)(cid:12)(cid:12)(cid:12) f Y ( y )d y ≤ c θ m nh m +1 N X j =0 C j,m h Z I j ( x ) f Y ( t )d t, (A.1)where I j ( x ) := [ x + θ (2 j + m ) − h, x + θ (2 j + m ) + h ]. Furthermore, by (A.16) in [2],1 h Z I j ( x ) f Y ( y )d y ≤ c θ Z h − h f ( t + x + 2( j + m ) θ )d t + c θ Z h − h f ( t + x + 2 jθ )d t + c θ Z mθ − mθ f ( t + x + (2 j + m ) θ )d t =: S ,j + S ,j + S ,j . We have N X j =0 C j,m S ,j = c θ N X j =0 C j,m Z h − h f ( t + x + 2( j + m ) θ )d t ≤ c N X j =0 j m − θ Z x +2( j + m ) θ + hx +2( j + m ) θ − h t q f ( t )( x + 2 θj ) q d t ≤ c Bhθ q +1 N X j =0 j m − q − , (A.2)where we have used that C j,m = (cid:0) j + m − m − (cid:1) ≤ c j m − , f ∈ N q ( B ) and θ > h for large n . Theterm P Nj =0 C j,m S ,j is also bounded from above by the same expression as on the right handside of (A.2). Furthermore, N X j =0 C j,m S ,j = c N X j =0 C j,m θ Z mθ − mθ f ( t + x + (2 j + m ) θ )d t ≤ c θ N X j =0 j m − Z x +2( j + m ) θx +2 jθ t q f ( t )( x + 2 θj ) q d t ≤ c Bθ q N X j =0 j m − q − . (A.3) ombining (A.3), (A.2) and (A.1) we conclude thatvar f (cid:2) ˆ f + h,N ( x ) (cid:3) ≤ c θ m − q Bψ N nh m +1 , ψ N :=  , q > m − , log N, q = 2 m − ,N m − q − , q < m − . (A.4)(b). Now we bound the bias of ˆ f + h,N ( x ). It is shown in [2] thatE f (cid:2) ˆ f + h,N ( x ) (cid:3) = 1 h Z ∞−∞ K (cid:16) t − x h (cid:17) f ( t )d t + T N ( f ; x ) , where T N ( f ; x ) = m X j =1 (cid:0) mj (cid:1) Z − K ( y ) f ( yh + x + 2 θ ( N + 1) j )d y. Taking into account that f ∈ N q ( B ) we obtain for any j = 1 , . . . , m Z − | K ( y ) | f ( yh + x + 2 θ ( N + 1) j ) dy ≤ c h Z x +2 θ ( N +1) j + hx +2 θ ( N +1) j − h f ( y ) dy ≤ c Bhh ( x + 2 θN ) q ≤ c B ( θN ) q . This leads to the following upper bound on the bias of ˆ f h,N ( x ): (cid:12)(cid:12)(cid:12) E f (cid:2) ˆ f + h,N ( x ) (cid:3) − f ( x ) (cid:12)(cid:12)(cid:12) ≤ c (cid:16) Ah α + Bθ q N q (cid:17) . (A.5)(c). We complete the proof by combining the bounds in (A.4) and (A.5) for the cases q > m − q = 2 m − q < m −

1. Sraightforward algebra shows that the following choiceof h = h ∗ and N = N ∗ yields the theorem result:(i) if q > m − h ∗ = c (cid:16) BA n (cid:17) α +2 m +1 , N ∗ ≥ c (cid:16) B α +2 m +1 n α A m +1 (cid:17) q (2 α +2 m +1) ; (A.6)(ii) if q = 2 m − h ∗ = c (cid:16) B log nA n (cid:17) α +2 m +1 , N ∗ = c (cid:26) B α +2 m +1 A m +1 (cid:16) n log n (cid:17) α (cid:27) q (2 α +2 m +1) ; (A.7)(iii) if q < m − h ∗ = c (cid:16) B (2 m − /q A (2 m + q − /q n (cid:17) α +2 m +1+ r , N ∗ = c ( B/A ) /q h − α/q ∗ , (A.8)where constants c , . . . c do not depend on A and B . (cid:3) .2. Proof of Theorem 2 Proof.

Without loss of generality we ﬁx x to be 0. The proof is split into a few steps: (i)deﬁnes two functions in W α,q ( A, B ) and provides their point-wise distance; (ii) bounds the χ -divergence between densities of the observations; (iii) speciﬁes the proper tuning parametersand provides the rate for the lower bound, and (iv) deals with derivation of the lower bound forthe light tail regime.(i). For s > / f ( x ) := C ( s )(1 + x ) s , x ∈ R , (A.9)where C ( s ) is a normalizing constant depending on s . Then, f ∈ N q ( B ) for 1 < q ≤ s since f ( x ) ≤ C ( s ) /x s ≤ B/x q for x > B >

0. In addition, since f isinﬁnitely diﬀerentiable, f ∈ H α ( A ) for any α with properly chosen A .Deﬁne function η on R via its Fourier transform φ η ( ω ) = R ∞−∞ η ( x ) e − iωx d x as follows. Let φ η be an inﬁnitely diﬀerentiable function on R with the following properties:(a) φ η is supported on [ − , φ η is symmetric, φ η ( ω ) = φ η ( − ω ), ∀ ω ∈ R ;(c) given some ﬁxed δ ∈ (0 , / φ η ( ω ) = 1 for ω ∈ [0 , − δ ), φ η ( ω ) = 0 for ω ≥

1, and φ η is monotone decreasing on [1 − δ, h with h < π/θ and N ∈ N , deﬁne φ η ( ω ) := N X k = N +1 (cid:26) φ η (cid:18) ω − πk/θh (cid:19) + φ η (cid:18) ω + πk/θh (cid:19)(cid:27) . (A.10)Note that φ η is supported on: N [ k = N +1 A k ( h ) , A k ( h ) := (cid:20) − πkθ − h, − πkθ + h (cid:21) ∪ (cid:20) πkθ − h, πkθ + h (cid:21) . (A.11)Then, deﬁne function η through the inverse Fourier transform as follows: η ( x ) = 12 π Z ∞−∞ φ η ( ω ) e iωx dω = 2 hη ( hx ) N X k = N +1 cos (cid:18) πkxθ (cid:19) for x ∈ R . (A.12)In the subsequent proof the parameters h and N are speciﬁed so that h → N → ∞ as n → ∞ ; thus, we tacitly assume that N is large and h is small for large enough sample size n .Given real numbers M > c >

0, deﬁne f ( x ) := f ( x ) + c M η ( x ) . (A.13) e demonstrate that under appropriate choice of c and M . f is a probability density from W α,q ( A, B ) for any h and N . Observe that φ η (0) = 0 implies R ∞−∞ η ( x ) dx = 0 so that f integrates to one. Moreover, since φ η is inﬁnitely diﬀerentiable and compactly supported, η is a rapidly decreasing function, i.e., | η ( j )0 ( x ) x ℓ | ≤ c j,l for any j, ℓ = 0 , , , . . . . In particular, forsome constant c ( s ) depending on s only one has | η ( x ) | ≤ c ( s ) | x | − s for all x ∈ R . It followsfrom (A.12) that | η ( x ) | ≤ c h − s +1 | x | − s N for x ∈ R . Therefore choosing M = h s − N − we obtain c M | η ( x ) | ≤ f ( x ) for c small enough. Therefore f is non–negative, and it is aprobability density. Moreover, f ∈ N q ( B ) for q ≤ s . If α is a positive integer then it followsfrom (A.12) that (cid:12)(cid:12)(cid:12) η ( α ) ( x ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h α X i =0 (cid:18) αi (cid:19) h i η ( i )0 ( xh ) N X k = N +1 cos ( α − i ) ( πkx/θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ c h α X i =0 h i N α − i +1 ≤ c hN α +1 . Therefore, we can ensure f ∈ H α ( A ) by selecting h and N so that M hN α +1 = h s N α ≤ A. (A.14)Thus, under (A.14) we have f , f ∈ W α,q ( A, B ). In addition, | f (0) − f (0) | = c M η (0) = c M hη (0) N = c h s . (A.15)(ii). Now we derive an upper bound on the χ -divergence between the densities of observations f Y, = g ⋆ f and f Y, = g ⋆ f that correspond to f and f . Observe the following expression: χ ( f Y, , f Y, ) := Z ∞−∞ ( f Y, ( x ) − f Y, ( x )) f Y ( x ) dx ( A. ) = c M Z ∞−∞ | ( g ⋆ η )( x ) | ( g ⋆ f )( x ) dx. Consider the denominator, g ⋆ f , of the integrand. We have( g ⋆ f )( x ) = C ( s ) Z ∞−∞ g ( y )[1 + ( x − y ) ] s dy ≥ C ( s ) Z ∞−∞ g ( y )2 s (1 + y ) s (1 + x ) s d y ≥ c (1 + x ) s , where we have used the elementary inequality 1 + | x − y | ≤ | x | )(1 + | y | ), ∀ x, y . Thenthe χ -divergence can be bounded: χ ( f Y, ; f Y, ) ≤ c M Z ∞−∞ | ( g ⋆ η )( x ) | dx + c M Z ∞−∞ x s | ( g ⋆ η )( x ) | dx. (A.16)Let us handle the second integral on the right-hand side. For any positive integer number s we have Z ∞−∞ x s | ( g ⋆ η )( x ) | dx = 12 π Z ∞−∞ (cid:12)(cid:12)(cid:12)(cid:12) d s dω s φ g ( ω ) φ η ( ω ) (cid:12)(cid:12)(cid:12)(cid:12) dω. (A.17) ote that d s dω s φ g ( ω ) φ η ( ω ) = s X j =0 (cid:18) sj (cid:19) φ ( j ) g ( ω ) φ ( s − j ) η ( ω )= s X j =0 (cid:18) sj (cid:19) φ ( j ) g ( ω ) h s − j N X k = N +1 (cid:26) φ ( s − j ) η (cid:18) ω − πk/θh (cid:19) + φ ( s − j ) η (cid:18) ω + πk/θh (cid:19)(cid:27) . Furthermore, φ ( j ) g can be expanded by Fa´a di Bruno formula for j ∈ N : if φ g ( ω ) := sin( θω ) / ( θω )then φ g ( ω ) = [ φ g ( ω )] m and φ ( j ) g ( ω ) = d j dω j (cid:18) sin θωθω (cid:19) m = j X l =1 j · · · ( j − l + 1) (cid:18) sin θωθω (cid:19) m − l B j,l (cid:16) φ ′ g ( ω ) , . . . , φ ( j − l +1) g ( ω ) (cid:17) , where B j,l denotes the Bell polynomials. Recall that B j,l is a homogeneous polynomial in j variables of degree l , and note that | φ ( j ) g ( ω ) | ≤ c ( | ω | − ∧ ∀ j . Then, (cid:12)(cid:12)(cid:12) φ ( j ) g ( ω ) (cid:12)(cid:12)(cid:12) ≤ c j X l =1 (cid:12)(cid:12)(cid:12)(cid:12) sin θωθω (cid:12)(cid:12)(cid:12)(cid:12) m − l | θω | − l = c | θω | m j X l =1 | sin θω | m − l . (A.18)Combining the above results and the fact that sets A k ( h ) in (A.10) are disjoint for k = N +1 , . . . , N , we bound the integral in (A.17) as follows: Z ∞−∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s X j =0 (cid:18) sj (cid:19) φ ( j ) g ( ω ) h s − j N X k = N +1 (cid:26) φ ( s − j ) η (cid:18) ω − πk/θh (cid:19) + φ ( s − j ) η (cid:18) ω + πk/θh (cid:19)(cid:27)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d ω ≤ c h − s N X k = N +1 Z A k ( h ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s X j =0 h j φ ( j ) g ( ω ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d ω ≤ c h − s N X k = N +1 Z A k ( h ) (cid:12)(cid:12)(cid:12)(cid:12) sin θωθω (cid:12)(cid:12)(cid:12)(cid:12) m + 1 | θω | m s X j =1 h j j X l =1 | sin θω | m − l  d ω ≤ c h m +1 − s N X k = N +1 k m = c h m − s +1 N − m +1 . In addition, the ﬁrst integral on the left-hand side in (A.16) can be bounded with s = 0, so that Z ∞−∞ | ( g ⋆ η )( x ) | d x ≤ c h m +1 N − m +1 . Therefore, for positive integer s , χ ( f Y, ; f Y, ) ≤ c M h m +1 N − m +1 + c M h m − s +1 N − m +1 ≤ c h m +2 s − N − m − . (A.19) he same upper bound holds for any non-integer s ≥

0; this fact is due to the interpolationinequality for the Sobolev spaces, see, e.g., Aubin [1] for the details.(iii). Now, based on (A.14) and (A.19), we specify parameters h = h ∗ and N = N ∗ as follows: N ∗ := (cid:18) Ah s ∗ (cid:19) /α , h ∗ := A m +1 α n ! α (2 m +2 s − α +2 s (2 m +1) . Under this choice (A.14) holds, and χ ( f Y, , f Y, ) ≤ c /n . Then the lower bound on theminimax risk is obtained by plugging these expressions in (A.15) and letting 2 s = q > R ∗ n [ W α,q ( A, B )] ≥ c A m +1 α n ! α α +2 m +1+( α/q )(2 m − − q ) . (A.20)(iv). To complete the proof of the theorem it remains to observe that in the consideredproblem the following standard lower bound on the minimax risk can be also established: R ∗ n [ W α,q ( A, B )] ≥ c A m +1 α n ! α α +2 m +1 . (A.21)For completeness, we provide the proof sketch. Let f be given by (A.9), and let η be thefunction deﬁned via its Fourier transform φ η as follows φ η ( ω ) = φ η (2 ωh −

3) + φ η (2 ωh + 3) , where φ η is a function with properties (a)–(c). Obviously, φ η is symmetric, supported on[ − /h, − /h ] ∪ [1 /h, /h ], and η ( x ) = 12 π Z ∞−∞ (cid:2) φ η (2 ωh −

3) + φ η (2 ωh + 3) (cid:3) e iωx d ω = 2 h η (cid:16) x h (cid:17) cos (cid:16) x (cid:17) . The function f is deﬁned by (A.13), and the choice M = Ah α +1 and properties of function η guarantee that f is a density from the class W α,q ( A, B ) with q ≤ s . With this construction | f (0) − f (0) | = c M η (0) = c Ah α . The upper bound on the χ –divergence between f Y, and f Y, is computed along the same lines as above with the following modiﬁcations. Now we apply(A.18) to get (cid:12)(cid:12)(cid:12) d s dω s φ g ( ω ) φ η ( ω ) (cid:12)(cid:12)(cid:12) ≤ s X j =0 (cid:18) sj (cid:19)(cid:12)(cid:12)(cid:12) φ ( j ) g ( ω ) φ ( s − j ) η ( ω ) (cid:12)(cid:12)(cid:12) ≤ c | θω | − m s X j =0 | φ ( s − j ) η ( ω ) | , and, by properties of function φ η , Z ∞−∞ x s | ( g ⋆ η )( x ) | d x ≤ c Z /h /h | ω | − m d ω = c h m − . The same upper bound holds for the integral R ∞−∞ | ( g ⋆ η )( x ) | d x which leads to χ ( f Y, ; f Y, ) ≤ c M h m − = c A h α +2 m +1 . hen (A.21) follows from the choice h ∗ = ( A n ) − / (2 α +2 m +1) .Combining (A.20) and (A.21) and noting that the following relation holds for 1 < q < m − α α + 2 m + 1 + ( α/q )(2 m − − q ) ≤ α α + 2 m + 1 , we complete the proof. (cid:3) A.3. Proof of Corollary 1

Proof.

The upper bound (3.7) is obtained directly from Theorem 1 applied with q = 1. We needto establish (3.6) only. The proof goes along the lines of the proof of Theorem 2 with minormodiﬁcations that are indicated below.Deﬁne f ( x ) := hπ (1 + h x ) , x ∈ R , where h > f ∈ H α ( A ) for small enough h . Usingthe function η deﬁned in (A.10), (A.11), and (A.12), let f ( x ) := f ( x ) + c M η ( x ) for x ∈ R . Similarly to the proof of Theorem 2, | η ( x ) | ≤ c h − N | x | − . Set M := N − , so that c M | η ( x ) | = c c / ( h | x | ) ≤ f ( x ) holds for suﬃciently small c . Since we use the same function η in Theorem 2, we can ensure f ∈ H α ( A ) by setting M hN α +1 = hN α ≤ A. (A.22)Therefore, for x = 0, we have the following point-wise distance | f (0) − f (0) | = c M η (0) = c M hη (0) N = c h. The bound on the χ –divergence takes the following form χ ( f Y, ; f Y, ) ≤ c M h Z ∞−∞ | ( g ⋆ η )( x ) | dx + c M h Z ∞−∞ x | ( g ⋆ η )( x ) | dx ≤ c ( M /h ) h m +1 N − m +1 + c ( M h ) h m − N − m +1 ≤ c h m N − m − . (A.23)Based on (A.22) and (A.23), we choose h = h ∗ and N = N ∗ as follows: N ∗ := ( A/h ∗ ) /α , h ∗ := ( A m +1 /n α ) mα +2 m +1 which leads to the announced result. (cid:3) .4. Proof of Theorem 3 Proof. (I). The error of estimator ˆ f + τ ( x ) is | ˆ f + τ ( x ) − f ( x ) | ≤ | B τ ( x ; f ) | + | ξ τ ( x ) | , where B τ ( x ; f ) is the bias term, and ξ τ ( x ) is the stochastic error given by (4.1). The biasterm is expressed as follows (see the proof of Theorem 1): B τ ( x ; f ) := E f (cid:2) ˆ f + h,N ( x ) (cid:3) − f ( x )= 1 h Z ∞−∞ K (cid:16) t − x h (cid:17) [ f ( t ) − f ( x )]d t + m X j =1 (cid:0) mj (cid:1) ( − j Z − K ( y ) f ( yh + x + 2 θ ( N + 1) j )d y = m X j =0 (cid:0) mj (cid:1) ( − j Z − K ( y ) h f ( yh + x + 2 θ ( N + 1) j ) − f ( x + 2 θ ( N + 1) j ) i d y + m X j =1 (cid:0) mj (cid:1) ( − j f ( x + 2 θ ( N + 1) j ) . Therefore by deﬁnitions of ¯ B h ( f ) and ¯ B N ( x ; f ) [see (4.8), (4.9)] we have | B τ ( x ; f ) | ≤ m ¯ B h ( f ) + 2 m ¯ B N ( x ; f ) ≤ ¯ B τ ( x ; f ) , where ¯ B τ ( x ; f ) is deﬁned in (4.10).(II). Now we demonstrate that | B τ ∨∧ τ ′ ( x ; f ) − B τ ′ ( x ; f ) | ≤ ¯ B τ ( x ; f ) , ∀ τ, τ ′ ∈ T . For this purpose denote S h ( x ) := 1 h Z ∞−∞ K (cid:16) t − xh (cid:17) [ f ( t ) − f ( x )]d tT N ( x ) := m X j =1 (cid:0) mj (cid:1) ( − j f ( x + 2 θ ( N + 1) j )and write B τ ( x ; f ) = S h ( x ) + T N ( x ) + m X j =1 (cid:0) mj (cid:1) ( − j S h ( x + 2 θ ( N + 1) j ) . (A.24)In view of (A.24) for any pair τ = ( h, N ), τ ′ = ( h ′ , N ′ ) we have B τ ∨∧ τ ′ ( x ; f ) − B τ ′ ( x ; f ) = (cid:2) S h ∨ h ′ ( x ) − S h ′ ( x ) (cid:3) + (cid:2) T N ∧ N ′ ( x ) − T N ′ ( x ) (cid:3) + m X j =1 (cid:0) mj (cid:1) ( − j h S h ∨ h ′ ( x + 2 θ ( N ∧ N ′ + 1) j ) − S h ′ ( x + 2 θ ( N ′ + 1) j ) i . (A.25) e consider the three terms on the right hand side of (A.25):sup h ′ ∈H (cid:12)(cid:12) S h ∨ h ′ ( x ) − S h ′ ( x ) (cid:12)(cid:12) = sup h ′ ≤ h (cid:2) S h ∨ h ′ ( x ) − S h ′ ( x ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) S h ( x ) (cid:12)(cid:12) + sup h ′ ≤ h (cid:12)(cid:12) S h ′ ( x ) (cid:12)(cid:12) ≤ h ′ ≤ h (cid:12)(cid:12) S h ′ ( x ) (cid:12)(cid:12) , (A.26)and similarly sup N ′ ∈N (cid:12)(cid:12) T N ∧ N ′ ( x ) − T N ′ ( x ) (cid:12)(cid:12) ≤ N ′ ≥ N m X j =1 (cid:0) mj (cid:1) f ( x + 2 θ ( N ′ + 1) j ) . (A.27)Furthermoresup h ′ ,N ′ (cid:12)(cid:12) S h ∨ h ′ ( x + 2 θ ( N ∧ N ′ + 1) j ) − S h ′ ( x + 2 θ ( N ′ + 1) j ) (cid:12)(cid:12) ≤ sup h ′ ,N ′ (cid:12)(cid:12) S h ∨ h ′ ( x + 2 θ ( N ∧ N ′ + 1) j ) − S h ′ ( x + 2 θ ( N ′ ∧ N + 1) j ) (cid:12)(cid:12) + sup h ′ ,N ′ (cid:12)(cid:12) S h ′ ( x + 2 θ ( N ∧ N ′ + 1) j ) − S h ′ ( x + 2 θ ( N ′ + 1) j ) (cid:12)(cid:12) ≤ h ′ ≤ h (cid:13)(cid:13) S h ′ (cid:13)(cid:13) ∞ + 2 sup h ′ ∈H sup N ′ ≥ N (cid:12)(cid:12) S h ′ ( x + 2 θ ( N ′ + 1) j ) (cid:12)(cid:12) ≤ h ′ ≤ h (cid:13)(cid:13) S h ′ (cid:13)(cid:13) ∞ + 2 sup h ∈H sup N ′ ≥ N (cid:12)(cid:12)(cid:12)(cid:12) Z − K ( y ) f ( yh + x + 2 θ ( N ′ + 1) j )d y (cid:12)(cid:12)(cid:12)(cid:12) + 2 sup N ′ ≥ N f ( x + 2 θ ( N ′ + 1) j ) ≤ h ′ ≤ h (cid:13)(cid:13) S h ′ (cid:13)(cid:13) ∞ + 2(1 + k K k ) sup | t |≤ θ sup N ′ ≥ N f ( t + x + 2 θ ( N ′ + 1) j ) . (A.28)Combining (A.26)–(A.28) with (A.25) we obtainsup τ ′ ∈T (cid:12)(cid:12) B τ ∨∧ τ ′ ( x ; f ) − B τ ′ ( x ; f ) (cid:12)(cid:12) ≤ m +1 sup h ′ ≤ h (cid:13)(cid:13) S h ′ (cid:13)(cid:13) ∞ + 2 m +1 (1 + k K k ) max ≤ j ≤ m sup | t |≤ θ sup N ′ ≥ N f ( t + x + 2 θ ( N ′ + 1) j )= 2 m +1 ¯ B h ( f ) + 2 m +1 (1 + k K k ) ¯ B N ( x ; f ) ≤ ¯ B τ ( x ; f ) , (A.29)where ¯ B h ( f ), ¯ B N ( x ; f ) and ¯ B τ ( x ; f ) are deﬁned in (4.8), (4.9), and (4.10) respectively.(III). Let ˆ τ = (ˆ h, ˆ N ) be the parameter selected by the rule (4.6)–(4.7). For any τ ∈ T wehave by the triangle inequality | ˆ f +ˆ τ ( x ) − f ( x ) | ≤ | ˆ f +ˆ τ ( x ) − ˆ f +ˆ τ ∨∧ τ ( x ) | + | ˆ f + τ ∨∧ ˆ τ ( x ) − ˆ f + τ ( x ) | + | ˆ f + τ ( x ) − f ( x ) | . (A.30)Now we bound the terms on the right hand side separately.We begin with the following simple observation: it follows from (4.6) thatˆ R τ ( x ) − ˆΛ τ ( κ ) − sup τ ′ ∈T ˆΛ τ ∨∧ τ ′ ( κ ) = sup τ ′ ∈T h(cid:12)(cid:12) ˆ f + τ ∨∧ τ ′ ( x ) − ˆ f + τ ′ ( x ) (cid:12)(cid:12) − ˆΛ τ ∨∧ τ ′ ( κ ) − ˆΛ τ ′ ( κ ) i + ≤ sup τ ′ ∈T (cid:12)(cid:12) B τ ∨∧ τ ′ ( x ; f ) − B τ ′ ( x ; f ) (cid:12)(cid:12) + sup τ ′ ∈T h | ξ τ ∨∧ τ ′ ( x ) − ξ τ ′ ( x ) (cid:12)(cid:12) − ˆΛ τ ∨∧ τ ′ ( κ ) − ˆΛ τ ′ ( κ ) i + . ence by (A.29) ˆ R τ ( x ) ≤ ¯ B τ ( x ; f ) + 2ˆ ζ ( x ) + ˆΛ τ ( κ ) + sup τ ′ ∈T ˆΛ τ ∨∧ τ ′ ( κ ) , (A.31)where ˆ ζ ( x ) := sup τ ∈T (cid:2) | ξ τ ( x ) | − ˆΛ τ ( κ ) (cid:3) + . Therefore for any τ, τ ′ ∈ T (cid:12)(cid:12) ˆ f + τ ∨∧ τ ′ ( x ) − ˆ f + τ ′ ( x ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) B τ ∨∧ τ ′ ( x ; f ) − B τ ′ ( x ; f ) (cid:12)(cid:12) + (cid:12)(cid:12) ξ τ ∨∧ τ ′ ( x ) − ξ τ ′ ( x ) (cid:12)(cid:12) ≤ ¯ B τ ( x ; f ) + 2ˆ ζ ( x ) + ˆΛ τ ∨∧ τ ′ ( κ ) + ˆΛ τ ′ ( κ ) ≤ ¯ B τ ( x ; f ) + 2ˆ ζ ( x ) + ˆ R τ ′ ( x ) , where the last inequality follows from the deﬁnition of ˆ R τ ( x ). This inequality together with(A.31) imply the following bound on the ﬁrst term on the right hand side of (A.30): | ˆ f +ˆ τ ∨∧ τ ( x ) − ˆ f +ˆ τ ( x ) | ≤ ¯ B τ ( x ; f ) + 2ˆ ζ ( x ) + ˆ R ˆ τ ( x ) ≤ ¯ B τ ( x ; f ) + 2ˆ ζ ( x ) + ˆ R τ ( x ) ≤ B τ ( x ; f ) + 4ˆ ζ ( x ) + ˆΛ τ ( κ ) + sup τ ′ ˆΛ τ ∨∧ τ ′ ( κ ) , (A.32)where in the penultimate inequality we have used that ˆ R ˆ τ ( x ) ≤ ˆ R τ ( x ) for any τ ∈ T .We proceed with bounding the second term on the right hand side of (A.30): by deﬁnitionof ˆ R ˆ τ ( x ) we have | ˆ f + τ ∨∧ ˆ τ ( x ) − ˆ f + τ ( x ) | ± [ ˆΛ τ ∨∧ ˆ τ ( κ ) + ˆΛ τ ( κ )] ≤ ˆ R ˆ τ ( x ) + sup τ ′ ∈T ˆΛ τ ∨∧ τ ′ ( κ ) + ˆΛ τ ( κ ) ≤ ˆ R τ ( x ) + sup τ ′ ∈T ˆΛ τ ∨∧ τ ′ ( κ ) + ˆΛ τ ( κ ) ≤ ¯ B τ ( x ; f ) + 2ˆ ζ ( x ) + 2 sup τ ′ ∈T ˆΛ τ ∨∧ τ ′ ( κ ) + 2 ˆΛ τ ( κ ) . (A.33)Finally | ˆ f + τ ( x ) − f ( x ) | ≤ | B τ ( x ; f ) | + | ξ τ ( x ) | ≤ ¯ B τ ( x ; f ) + Λ τ ( κ ) + ζ ( x ) , (A.34)where we recall that ζ ( x ) := sup τ ∈T h | ξ τ ( x ) | − Λ τ ( κ ) i + . Combining (A.32), (A.33), (A.34) and (A.30) we obtain (cid:12)(cid:12) ˆ f +ˆ τ ( x ) − f ( x ) (cid:12)(cid:12) ≤ inf τ ∈T n B τ ( x ; f ) + 3 ˆΛ τ ( κ ) + 3 sup τ ′ ∈T ˆΛ τ ∨∧ τ ′ ( κ ) + Λ τ ( κ ) o + 6ˆ ζ ( x ) + ζ ( x ) . (IV). We complete the proof using Lemmas 2 and 1 in Appendix. Observing that ˆΛ τ ( κ ) =7 ˜Λ τ ( κ ) and applying the ﬁrst inequality in (A.41) we haveˆ ζ ( x ) ≤ ζ ( x ) + sup τ ∈T (cid:2) Λ τ ( κ ) − τ ( κ ) (cid:3) + ≤ ζ ( x ) + 2 cη ( x ) , here c = 2 − m − θ k K ( m ) k − ∞ [cf. Lemma 2]. Then using the second inequality in (A.41) in orderto bound ˆΛ τ ( κ ) and sup τ ′ ∈T ˆΛ τ ∨∧ τ ′ ( κ ) in terms of Λ τ ( κ ) we obtain (cid:12)(cid:12) ˆ f +ˆ τ ( x ) − f ( x ) (cid:12)(cid:12) ≤ inf τ ∈T n B τ ( x ; f ) + 127Λ τ ( κ ) + 126 sup τ ′ ∈T Λ τ ∨∧ τ ′ ( κ ) o +7 ζ ( x ) + (42 + 12 c ) η ( x ) + 42 κ n . By deﬁnition of the opeartion ∨∧ and by deﬁnition of σ τ and u τ [see (4.2) and (4.3)] we havethat σ τ ∨∧ τ ′ ≤ σ τ and u τ ∨∧ τ ′ ≤ u τ for any τ, τ ′ ∈ T ; therefore sup τ ′ ∈T Λ τ ∨∧ τ ′ ( κ ) ≤ Λ τ ( κ ) for all τ ∈ T . We complete the proof by setting δ ( x ) = ζ ( x ) + η ( x ) and using Lemma 1. (cid:3) A.5. Proof of Corollary 2

Proof.

Below c , c , . . . stand for positive constants independent of n , A and B . The proofgoes along the following lines. We select values of h and N from H × N and apply the oracleinequality of Theorem 3.The proof of Theorem 1 shows that if f ∈ W α,q ( A, B ) then¯ B h ( f ) ≤ c Ah α , ¯ B N ( x ; f ) ≤ c Bθ − q N − q . Furthermore, by (A.4) σ τ ≤ c θ m − q Bψ N h m +1 , ψ N :=  , q > m − , log N, q = 2 m − ,N m − q − , q < m − . In addition, with κ ∗ = κ log n we haveΛ τ ( κ ∗ ) ≤ c (cid:18) B / ψ / N h m +1 / r κ log nn + N m − h m +1 κ log nn (cid:19) . (A.35)First we note that for all h min ≤ h ≤ h max and N ≤ N max and all suﬃciently large n Λ τ ( κ ∗ ) ≤ c B / ψ / N h m +1 / r κ log nn . Indeed, this inequality follows from (A.35) because by the choice of h min and N max for large n one has h min (cid:16) n log n (cid:17) = (cid:16) n log n (cid:17) m/ (2 m +1) ≥ N m − = (cid:16) n log n (cid:17) (2 m − / (2 m ) . Thus, using (4.11) we have | ˆ f ∗ ( x ) − f ( x ) | ≤ c inf ( h,N ) ∈H×N (cid:26) Ah α + Bθ q N q + B / ψ / N h m +1 / r log nn (cid:27) + c (cid:18) δ ( x ) + κ log nn (cid:19) . ow we set h ∗ and N ∗ to be deﬁned by formulas (A.8), (A.7) and (A.6) with n replacedby n/ log n . Note that these values of h and N balance the bias and stochastic error bounds onthe right hand side of the previous display formula [for details see the proof of Theorem 1]. Weneed to verify that h ∗ and N ∗ satisfy h ∗ ≥ h min and N ∗ ≤ N max for large n . The ﬁrst inequalityis evident because 1 / (2 α + 2 m + 1) ≥ / (2 m + 1) for all α >

0. To check the inequality N ∗ ≤ N max we note that N ∗ = O (cid:0) ( n/ log n ) αq (2 m +2 α +1+ r ) (cid:1) in the case 1 ≤ q < m − αq (2 m + 2 α + 1 + r ) = αα (2 m − q ) + q (2 m + 1) ≤ m for all α >

0. If q > m − N ∗ = O (cid:0) ( n/ log n ) αq (2 m +2 α +1) (cid:1) , and αq (2 m + 2 α + 1) ≤ α (2 m − m + 2 α + 1) ≤ m − , ∀ α > . Thus, we always have N ∗ ≤ N max for large n . The inequalities h ∗ ≥ h min and N ∗ ≤ N max implythat sets H and N contain elements that bound h ∗ and N ∗ from below and from above withinconstant factors. This yields | ˆ f ∗ ( x ) − f ( x ) | ≤ c ϕ ( n/ log n ) + c (cid:18) δ ( x ) + κ log nn (cid:19) , where function ϕ ( · ) is deﬁned in (3.5).To complete the proof we note that M h = O (log n ), M N = O ( n / (2 m ) ), and¯Λ( κ ∗ ) ≤ c N m − h m +1 / r log nn (cid:18) N m − h m +1min (cid:19) ≤ c (cid:16) n log n (cid:17) / , so that if κ ≥ n E f [ δ ( x )] ≤ c (log n ) n / m (cid:16) n log n (cid:17) e − κ log n ≤ c n − . This completes the proof. (cid:3)

A.6. Auxiliary Results

Denote L + τ ( y ) := (2 θ ) m h m +1 N X j =0 C j,m K ( m ) (cid:18) y − x − θ (2 j + m ) h (cid:19) . Then var f [ ˆ f + τ ( x )] = E f [ ξ τ ( x )] , ξ τ ( x ) := 1 n n X i =1 (cid:2) L + τ ( Y i ) − E f L + τ ( Y i ) (cid:3) . Let ζ ( x ) := sup τ ∈T (cid:2) | ξ τ ( x ) | − Λ τ ( κ ) (cid:3) + (A.36) ( x ) := sup τ ∈T (cid:2) | ˆ σ τ − σ τ | − u τ Λ τ ( κ ) (cid:3) + . (A.37) Lemma 1.

For any p ≥ and κ > one has E f [ ζ ( x )] p ≤ p + 1) M h M N (cid:2) Λ τ ( κ ) (cid:3) p κ − p e − κ , E f [ η ( x )] p ≤ p + 1) M h M N (cid:2) u τ Λ τ ( κ ) (cid:3) p κ − p e − κ . Proof. (i). Observe that | L + τ ( Y j ) | ≤ u τ /

2, where u τ is deﬁned in (4.3); hence | ξ τ | ≤ u τ . Inaddition, it follows from (A.1) thatvar f (cid:2) L + τ ( Y ) (cid:3) ≤ σ τ := (2 θ ) m h m +2 N X j =0 C j,m Z ∞−∞ (cid:12)(cid:12)(cid:12)(cid:12) K ( m ) (cid:18) y − x − θ (2 j + m ) h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) f Y ( y )d y. By Bernstein’s inequality for any z > f (cid:8) | ξ τ ( x ) | ≥ z (cid:9) ≤ n − nz σ τ + u τ z o . Therefore for Λ τ ( κ ) deﬁned in (4.4) we obtainP f (cid:8) | ξ τ ( x ) | ≥ Λ τ ( κ ) (cid:9) ≤ (cid:26) − (cid:0) σ τ q κ n + u τ κ n − (cid:1) σ τ /n + u τ n (cid:0) σ τ q κ n + κ u τ n (cid:1) (cid:27) ≤ e − κ , (A.38)where we have used the following elementary inequality: for any a > , b > κ > √ κ a + κ b ) a + b ( √ κ a + κ b ) ≥ κ . (A.39)Therefore, for any p ≥ f (cid:2) | ξ τ ( x ) | − Λ τ ( κ ) (cid:3) p + = p Z ∞ t p − P f (cid:8) | ξ τ ( x ) | ≥ Λ τ ( κ ) + t (cid:9) d t ≤ p (cid:2) Λ τ ( κ ) (cid:3) p Z ∞ y p − P f (cid:8) | ξ τ ( x ) | ≥ Λ τ ( κ (1 + y )) (cid:9) d y ≤ p [Λ τ ( κ )] p Z ∞ y p − e − κ (1+ y ) d y = 2Γ( p + 1) (cid:2) Λ τ ( κ ) (cid:3) p κ − p e − κ , (A.40)where the second line follows from the change of variables and the fact that Λ τ ( a κ ) ≤ a Λ τ ( κ )for a ≥

1; and the third line is a consequence of (A.38).(ii). Let ˆ σ τ be the emripical estimator for σ τ based on the sample Y , Y , . . . , Y n :ˆ σ τ := (2 θ ) m nh m +2 n X i =1 N X j =0 C j,m (cid:12)(cid:12)(cid:12)(cid:12) K ( m ) (cid:18) Y i − x − θ (2 j + m ) h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) . Then ˆ σ τ − σ τ = 1 n n X i =1 (cid:16) ψ τ ( Y i ) − E f [ ψ τ ( Y i )] (cid:17) , here we put ψ τ ( y ) := (2 θ ) m h m +2 N X j =0 C j,m (cid:12)(cid:12)(cid:12)(cid:12) K ( m ) (cid:18) y − x − θ (2 j + m ) h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) . It is evident that | ψ τ ( y ) | ≤ (2 θ ) m h m +2 C N,m k K ( m ) k ∞ = u τ , ∀ y ;hence (cid:12)(cid:12) ψ τ ( Y i ) − E f [ ψ τ ( Y i )] (cid:12)(cid:12) ≤ u τ /

4, andvar f { ψ τ ( Y i ) } ≤ E f (cid:2) ψ τ ( Y i ) (cid:3) ≤ σ τ u τ . Therefore by Bernstein inequality for any z ≥ f n(cid:12)(cid:12) ˆ σ τ − σ τ (cid:12)(cid:12) ≥ z o ≤ (cid:26) − nz σ τ u τ + u τ z (cid:27) . This inequality together with (A.39) implies thatP f n | ˆ σ τ − σ τ | ≥ u τ Λ τ ( κ ) o ≤ P f (cid:26) | ˆ σ τ − σ τ | ≥ u τ (cid:16) σ τ r κ n + u τ κ n (cid:17)(cid:27) ≤ e − κ . Similarly to the derivation in (A.40) we have for any p ≥ f (cid:2) | ˆ σ τ − σ | − u τ Λ τ ( κ ) (cid:3) p + = p Z ∞ t p − P f (cid:8) | ˆ σ τ − σ τ | ≥ u τ Λ τ ( κ ) + t (cid:9) d t ≤ p (cid:2) u τ Λ τ ( κ ) (cid:3) p Z ∞ y p − P f (cid:8) | ˆ σ τ − σ τ | ≥ u τ Λ τ ( κ (1 + y )) (cid:9) d y ≤ p [ u τ Λ τ ( κ )] p Z ∞ y p − e − κ (1+ y ) d y = 2Γ( p + 1) (cid:2) u τ Λ τ ( κ ) (cid:3) p κ − p e − κ . This completes the proof. (cid:3)

Denote ˜Λ τ ( κ ) := ˆ σ τ r κ n + 2 u τ κ n and observe that ˜Λ τ ( κ ) = ˆΛ τ ( κ ), where ˆΛ τ ( κ ) is deﬁned in (4.5). Lemma 2.

For any τ ∈ T one has (cid:2) Λ τ ( κ ) − τ ( κ ) (cid:3) + ≤ cη ( x ) , (cid:2) ˜Λ τ ( κ ) − τ ( κ ) (cid:3) + ≤ η ( x ) + κ n , (A.41) where η ( x ) is deﬁned in (A.37) and c := 2 − m − θ k K ( m ) k − ∞ .Proof. We have ˜Λ τ ( κ ) − Λ τ ( κ ) = (ˆ σ τ − σ τ ) p κ /n . Deﬁne T := n τ ∈ T : σ τ r κ n ≥ u τ κ n o . f τ ∈ T then σ τ ≥ √ u τ ( κ /n ) / and | ˆ σ τ − σ τ | = | ˆ σ τ − σ τ | ˆ σ τ + σ τ ≤ σ τ | ˆ σ τ − σ τ | ≤ u τ r n κ (cid:2) η ( x ) + u τ Λ τ ( κ ) (cid:3) ;hence for any τ ∈ T | ˜Λ τ ( κ ) − Λ τ ( κ ) | ≤ Λ τ ( κ ) + η ( x )2 u τ ≤ Λ τ ( κ ) + cη ( x ) (A.42)where we have used that u τ ≥ m +1 θ − k K ( m ) k ∞ for all τ ∈ T , and denoted for brevity c := 2 − m − θ k K ( m ) k − ∞ . Thus (A.42) implies that (cid:2) ˜Λ τ ( κ ) − Λ τ ( κ ) (cid:3) + ≤ cη ( x ) and (cid:2) Λ τ ( κ ) − τ ( κ ) (cid:3) + ≤ cη ( x ) , ∀ τ ∈ T . (A.43)Now assume that τ ∈ T := T \ T ; for such τ , Λ τ ( κ ) ≤ u τ κ /n . Note also that by deﬁnition˜Λ τ ( κ ) ≥ u τ κ /n ; therefore [Λ τ ( κ ) − τ ( κ )] + = 0 , ∀ τ ∈ T . (A.44)Furthermore, we bound | ˆ σ τ − σ τ | as follows: | ˆ σ τ − σ τ | ≤ | ˆ σ τ − σ τ | / ≤ p η ( x ) + p u τ Λ τ ( κ ) ≤ p η ( x ) + √ u τ r κ n . Therefore for any τ ∈ T (cid:12)(cid:12) ˜Λ τ ( κ ) − Λ τ ( κ ) (cid:12)(cid:12) ≤ r κ n η ( x ) + √ u τ κ n ≤ κ n + η ( x ) + 5Λ τ ( κ ) , where the last bound follows from the elementary inequality √ ab ≤ √ a + b ≤ a + b for a, b ≥

0. This implies that (cid:2) ˜Λ τ ( κ ) − τ ( κ ) (cid:3) + ≤ κ n + η ( x ) , ∀ τ ∈ T . (A.45)Combining (A.43), (A.44) and (A.45) we complete the proof. (cid:3)(cid:3)