[PDF] Minimax estimation of norms of a probability density: II. Rate-optimal estimation procedures

Abstract

In this paper we develop rate--optimal estimation procedures in the problem of estimating the L p --norm, p∈(0,∞) of a probability density from independent observations. The density is assumed to be defined on R d , d≥1 and to belong to a ball in the anisotropic Nikolskii space. We adopt the minimax approach and construct rate--optimal estimators in the case of integer p≥2 . We demonstrate that, depending on parameters of Nikolskii's class and the norm index p , the risk asymptotics ranges from inconsistency to n − − √ --estimation. The results in this paper complement the minimax lower bounds derived in the companion paper \cite{gl20}.

Full PDF

aa r X i v : . [ m a t h . S T ] A ug Minimax estimation of norms of a probabilitydensity: II. Rate-optimal estimation procedures

A. Goldenshluger ∗ and O. V. Lepski † Department of StatisticsUniversity of HaifaMount CarmelHaifa 31905, Israele-mail: [email protected]

Institut de Math´ematique de MarseilleAix-Marseille Universit´e39, rue F. Joliot-Curie13453 Marseille, Francee-mail: [email protected]

Abstract:

In this paper we develop rate–optimal estimation procedures in the problem ofestimating the L p –norm, p ∈ (0 , ∞ ) of a probability density from independent observations.The density is assumed to be deﬁned on R d , d ≥ p ≥

2. We demonstrate that, depending on parameters of Nikolskii’s classand the norm index p , the risk asymptotics ranges from inconsistency to √ n –estimation.The results in this paper complement the minimax lower bounds derived in the companionpaper Goldenshluger and Lepski (2020). AMS 2000 subject classiﬁcations:

Keywords and phrases: density estimation, minimax risk, L p -norm, U -statistics, anisotropicNikol’skii class.

1. Introduction

Suppose that we observe i.i.d. random vectors X i ∈ R d , i = 1 , . . . , n, with common probabilitydensity f . Let p > L p -norm of f , k f k p := (cid:20) Z R d | f ( x ) | p d x (cid:21) /p , from observation X ( n ) = ( X , . . . , X n ). By estimator of k f k p we mean any X ( n ) -measurable map e N : R n → R . Accuracy of an estimator e N is measured by the quadratic risk R n [ e F , f ] := (cid:16) E f (cid:2) e N − k f k p (cid:3) (cid:17) / , where E f denotes expectation with respect to the probability measure P f of the observations X ( n ) = ( X , . . . , X n ).We adopt minimax approach to measuring estimation accuracy. Let F denote the set of allprobability densities deﬁned on R d . The maximal risk of an estimator e N on the set F ⊂ F is ∗ Supported by the ISF grant No. 361/15. † This work has been carried out in the framework of the Labex Archim`ede (ANR-11-LABX-0033) and ofthe A*MIDEX project (ANR-11-IDEX-0001-02), funded by the ”Investissements d’Avenir” French Governmentprogram managed by the French National Research Agency (ANR).1 . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density deﬁned by R n (cid:2) e N , F (cid:3) := sup f ∈ F R n [ e N , f ] , and the minimax risk is R n [ F ] := inf e N R n [ e N ; F ] , where inf is taken over all possible estimators.In the companion paper Goldenshluger and Lepski (2020) (referred to hereafter as Part I),we derived lower bounds on the minimax risk over functional classes F = N ~r,d (cid:0) ~β, ~L (cid:1) ∩ B q ( Q ),where N ~r,d (cid:0) ~β, ~L (cid:1) denotes anisotropic Nikolskii’s class [see Deﬁnition 1 below], and B q ( Q ) := { f : k f k q ≤ Q } is the ball in L q (cid:0) R d (cid:1) of radius Q . Speciﬁcally, we found the sequence φ n completelydetermined by the ~β, ~L, ~r, q, p and n such that R n (cid:2) N ~r,d (cid:0) ~β, ~L (cid:1) ∩ B q ( Q ) (cid:3) & φ n , n → ∞ . The goal of the present paper is to develop a rate–optimal estimator, say, ˆ N , such that for anygiven ~β, ~r, ~L, q, Q lim sup n →∞ φ − n R n (cid:2) ˆ N, N ~r,d (cid:0) ~β, ~L (cid:1) ∩ B q ( Q ) (cid:3) < ∞ . We provide explicit construction of such an estimator for integer values of p ≥ e , . . . , e d ) denotethe canonical basis of R d . For a function G : R d → R and real number u ∈ R the ﬁrst orderdiﬀerence operator with step size u in direction of the variable x j is deﬁned by ∆ u,j G ( x ) = G ( x + u e j ) − G ( x ) , j = 1 , . . . , d. By induction, the k -th order diﬀerence operator is∆ ku,j G ( x ) = ∆ u,j ∆ k − u,j G ( x ) = P kl =1 ( − l + k (cid:0) kl (cid:1) ∆ ul,j G ( x ) . Deﬁnition 1.

For given vectors ~β = ( β , . . . , β d ) ∈ (0 , ∞ ) d , ~r = ( r , . . . , r d ) ∈ [1 , ∞ ] d , and ~L = ( L , . . . , L d ) ∈ (0 , ∞ ) d a function G : R d → R is said to belong to anisotropic Nikolskii’sclass N ~r,d (cid:0) ~β, ~L (cid:1) if k G k r j ≤ L j for all j = 1 , . . . , d and there exist natural numbers k j > β j suchthat (cid:13)(cid:13) ∆ k j u,j G (cid:13)(cid:13) r j ≤ L j | u | β j , ∀ u ∈ R , ∀ j = 1 , . . . , d. Important quantities that are related to Nikolskii’s classes and determine asymptotics of theminimax risk are the following:1 β := d X j =1 β j , ω := d X j =1 β j r j , L β := d Y j =1 L βj j ,τ ( s ) := 1 − ω + 1 βs , s ∈ [1 , ∞ ] . It is worth mentioning that τ ( · ) appears in embedding theorems for Nikolskii’s spaces; see Sec-tion 5.1.3 for details. . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density In Part I we have established the lower bound (1) on the minimax risk with sequence φ n deﬁned by: θ =  τ (1) , τ ( p ) ≥ /p − /q − /q − (1 − /p ) τ ( q ) , τ ( p ) < , τ ( q ) < τ ( p ) τ (1) , τ ( p ) < , τ ( q ) ≥ φ n = L − /pτ (1) β n − θ ∗ , θ ∗ = 2 − ∧ θ ; (1.1)see Theorem 1 of Part I. Our goal is to develop a rate–optimal estimator whose risk convergesto zero at the rate φ n .

2. Estimator construction

Assuming that p is integer let us ﬁrst discuss the problem of estimating a closely related functional k f k pp .Let K : [ − , d → R be a given function (kernel), and let h = ( h , . . . , h d ) ∈ (0 , d be a givenvector (bandwidth). Let K h ( x ) := (1 /V h ) K ( x/h ) , V h := d Y k =1 h k , where here in all what follows y/x denotes the coordinate–wise division for x, y ∈ R d . Deﬁne S h ( x ) := Z K h ( x − y ) f ( y )d y, B h ( x ) := S h ( x ) − f ( x ) . (2.1)Obviously, B h ( x ) is the bias of the kernel density estimator of f ( x ) associated with kernel K andbandwidth h .The construction of our estimator for k f k pp is based on a simple observation formulated belowas Lemma 1. Lemma 1.

For any p ∈ N ∗ , p ≥ , f ∈ F and h ∈ (0 , ∞ ) d one has k f k pp = (1 − p ) Z S ph ( x )d x + p Z S p − h ( x ) f ( x )d x + p X j =2 (cid:0) pj (cid:1) ( − j Z [ S h ( x )] p − j B jh ( x )d x. (2.2)The proof is elementary and given in Appendix. It does not require any assumption on thekernel K except of existence of the integrals on the right hand side of (2.2).Let ˆ T ,h and ˆ T ,h be estimators of T (1) p ( f ) := Z S ph ( x )d x, T (2) p ( f ) := Z S p − h ( x ) f ( x )d x, respectively. Then we estimate k f k pp byˆ T h = (1 − p ) ˆ T ,h + p ˆ T ,h . . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density Note that if ˆ T ,h and ˆ T ,h are unbiased estimators of T (1) p ( f ) and T (2) p ( f ) then, in view ofLemma 1, the bias of ˆ T h in estimation of k f k pp is p X j =2 (cid:0) pj (cid:1) ( − j Z [ S h ( x )] p − j B jh ( x )d x. The last quantity can be eﬃciently bounded from above via norms of the bias B h ( · ) and theunderlying density f .The natural unbiased estimators for T (1) p ( f ) and T (2) p ( f ) are based on the U–statistics:ˆ T ,h := 1 (cid:0) np (cid:1) X i ,...,i p U (1) h ( X i , . . . , X i p ) , ˆ T ,h := 1 (cid:0) np (cid:1) X i ,...,i p U (2) h ( X i , . . . , X i p ) , where the summations are taken over all possible combinations of p distinct elements { i , . . . , i p } of { , . . . , n } , and U (1) h ( x , . . . , x p ) := Z K h ( y − x ) · · · K h ( y − x p )d y,U (2) h ( x , . . . , x p ) := 1 p p X i =1 p Y j =1 j = i K h ( x j − x i ) . It is worth mentioning that not only there is the explicit formula for the bias of ˆ T h , but alsoits variance admits a rather simple analytical bound. The following result states an upper boundon the variance of ˆ T h ; the proof is given in Appendix. Lemma 2.

Let K be symmetric bounded function supported on [ − , d . Then for all p ∈ N ∗ , p ≥ , f ∈ F and h ∈ (0 , ∞ ) d one has var f (cid:2) ˆ T h (cid:3) ≤ C k K k p ∞ p X k =1 k f k p − k p − k (cid:0) n k V k − h (cid:1) − , where C is a constant depending on p and d only. If ˆ T h is a ”reasonable” estimator of k f k pp then it seems natural to deﬁneˆ N h := (cid:12)(cid:12) ˆ T h (cid:12)(cid:12) /p (2.3)as an estimator for k f k p .

3. Main result

In this section we demonstrate that the estimator ˆ N h with properly chosen bandwidth h is arate–optimal estimator for k f k p , provided that ~r ∈ [1 , p ] d ∪ [ p, ∞ ] d , p ∈ N ∗ , p ≥ , . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density i.e., r j ≤ p for all j = 1 , . . . , d , or r j ≥ p for all j = 1 , . . . , d . The proof is based on the derivationof tight uniform upper bounds on the bias and the variance of ˆ T h over anisotropic Nikolskii’sclasses. To get such bounds for the bias term we use a special construction of kernel K [see, e.g.Goldenshluger and Lepski (2014)].For a given positive integer ℓ ∈ N ∗ and function K : R → R supported on [ − ,

1] deﬁne K ℓ ( y ) = ℓ X i =1 (cid:0) ℓi (cid:1) ( − i +1 i − K (cid:0) y/i (cid:1) . Assumption 1. K is symmetric, R R K ( y )d y = 1 , kKk ∞ < ∞ , and let K ( x ) = d Y j =1 K ℓ ( x j ) , ∀ x ∈ R d . Let us tntroduce the following notation. For j = 1 , . . . , d let κ j := ( β j τ ( p ) τ ( r j ) , r j ≤ p, τ ( q ) > β j , otherwise , p j :=  − /p )1 − /r j , r j ≥ p ;2 , r j ≤ p, τ ( q ) > /p − /q )1 /r j − /q , r j < p, τ ( q ) ≤ , (3.1)and let 1 υ := d X j =1 p j κ j . Deﬁne h = ( h , . . . , h d ) by h j := L − / κ j j (cid:0) L n − (cid:1) κ jpj [1+2(1 − /p ) /υ ] , (3.2)where L is a constant that is completely determined by the class parameters ~β, ~L, ~r and p (acumbersome but explicit expression for L is given in Section 5.3.3).The main result of this paper is given in the next theorem. Theorem.

Let p ∈ N ∗ , p ≥ , Q > , ~β ∈ (0 , ∞ ) d , ~L ∈ (0 , ∞ ) d , ~r ∈ [1 , p ] d ∪ [ p, ∞ ] d and q ≥ p − be ﬁxed. Let Assumption 1 hold with ℓ > max j =1 ,...,d β j . Let ˆ N h be the estimator (2.3)associated with bandwidth h deﬁned in (3.2); then lim sup n →∞ φ − n R n (cid:2) ˆ N h , N ~r,d (cid:0) ~β, ~L (cid:1) ∩ B q ( Q ) (cid:3) < ∞ , where φ n is given in (1.1). Remark 1. (i)

Combining the result of the theorem with that of Theorem 1 in Part I we conclude thatthe suggested estimator ˆ N h is rate-optimal. Thus, the problem of constructing a rate–optimalestimator is solved completely if r j ≤ p for all j = 1 , . . . , d , or r j ≥ p for all j = 1 , . . . , d . Thegeneral setting when values of the coordinates of ~r may be arbitrary with respect to p remains anopen problem. We conjecture that in the general setting a rate–optimal estimator does not belongto the family { ˆ N h , h ∈ R d } , and a diﬀerent estimation procedure has to be developed. Another . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density intriguing question is whether condition q ≥ p − is necessary in order to guarantee the obtainedestimation accuracy? (ii) It is surprising that the L p -norm can be estimated with parametric rate n − / . To thebest of our knowledge this phenomenon has not been observed in the literature. Note that theparametric regime is possible only if τ ( q ) > . Indeed, /p − /q − /q − (1 − /p ) τ ( q ) ≥ ⇔ p − − q ≥ − (cid:16) − p (cid:17) τ ( q ) , and the last inequality is impossible if τ ( q ) ≤ because p ≥ . (iii) A particularly simple description of the minimax rate of convergence φ n is obtained inthe speciﬁc case p = 2 and q = ∞ . Here θ ∗ = (max { τ (1) , } ) − if ~r ∈ [2 , ∞ ] d , and θ ∗ = ( , ω > /ω , ω ≤ , if ~r ∈ [1 , d . As we see, the regime corresponding to the exponent τ ( p ) /τ (1) does not appear if p = 2 , q = ∞ .

4. Discussion

In this section we compare and contrast our results with other results in the literature. Ourdiscussion is restricted to the problem of estimating the L p –norm with integer p only. Thefundamental diﬀerences between our results and other results are mainly due to the followingtwo reasons. • Statistical model.

We argue that when estimating the L p –norm estimation in the densitymodel, much better accuracy can be achieved than in the regression/Gaussian white noisemodel. • Domain of observations.

We also explain why estimation accuracy is completely diﬀerentfor compactly supported densities f , and densities f supported on the entire space R d . To the best of our knowledge, in the context of the Gaussian white noise model, estimation ofthe L p –norm was studied only in the one–dimensional case, d = 1. So we focus on the one–dimensional Nikolskii class that is denoted N r , ( β , L ), β > , L >

0, and r ∈ [1 , ∞ ].Let W be the standard Wiener process and assume that we observe the trajectory of theprocess X n ( t ) = Z t f ( u )d u + n − / W ( t ) , t ∈ [0 , . (4.1)The estimation of k f k p , 1 ≤ p < ∞ in the Gaussian white noise model (4.1) was initiatedin Lepski et al. (1999) under assumption that f belongs to H¨older’s functional class, i.e., f ∈ N ∞ , ( β , L ). The recent paper Han et al. (2019) considers the same problem for Nikolskii’s classeswith r ∈ [ p, ∞ ). In the Gaussian white noise model (4.1) even in the case of integer p there isa big diﬀerence between the minimax rates of convergence obtained for odd and even values of . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density p . Such phenomenon does not occur in the density model because density is a positive function.Thus in the subsequent discussion of the results in the Gaussian white noise model we restrictourselves with the case of even p .According to the results obtained in Lepski et al. (1999) and Han et al. (2019) in the case ofeven p , the minimax rate of convergence (expressed in our notation) is given by ϕ n = n − τ (1) . Note that in both papers r ≥ p and, therefore, ϕ n = n − τ (1) ≫ n − τ (1) = φ n , where φ n is deﬁned in (1.1). Also, since τ (1) > r = 1 we conclude that the paramet-ric regime is impossible in the model (4.1). Before providing the explanations why the resultsdiscussed above are so diﬀerent, let us discuss another result.Lepski and Spokoiny (1999) studied a hypothesis testing problem in the model (4.1) when theset of alternatives consists of functions from N r , ( β , L ), r ∈ [1 ,

2) separated away from zero inthe L -norm. It is well known that this problem is equivalent to the problem of estimating the L -norm, and the minimax rate of testing coincides with the rate of estimation. The minimaxrate found in Lepski and Spokoiny (1999) is ϕ n = n − τ (2)2[1+ τ (1)] under assumption τ ( ∞ ) >

0. Notingthat τ ( ∞ ) > q = ∞ and comparing this result with the result obtained in this paperfor the case p = 2, r ∈ [1 ,

2) and τ ( ∞ ) > ϕ n = n − τ (2)2[1+ τ (1)] ≫ n − / = φ n . The last inequality follows from the fact that τ (1) > τ (2). Thus we again conclude that the para-metric regime is impossible in the model (4.1). Another interesting feature should be mentioned:the approach used in Lepski and Spokoiny (1999) is based on rather sophisticated pointwisebandwidth selection scheme while in our estimation procedure the bandwidth is a ﬁxed vector.The explanation why estimation accuracy in the density model is better than in the Gaussianwhite noise model (4.1) is rather simple. In fact, the maximal value of the risk in the density modelis attained on the densities with small L p -norm. Analysis of our estimation strategy shows that inthe density model the variance of the corresponding U -statistics is proportional to the L p -normof the density (see Lemma 2 and Proposition 2 for details). The smaller this norm, the smallerthe stochastic error of the estimation procedure, and careful bandwidth selection employed inour procedure takes into account this fact and improves considerably the estimation accuracy. Incontrast, in the Gaussian white noise model (4.1) the stochastic error of the estimation procedureis independent of the signal f and of the value of its L p –norm. We start with the following simple observation. If a probability density f is deﬁned on a compactset I ⊂ R d then necessarily k f k p ≥ |I| − /p > p ∈ (1 , ∞ ]. This fact reduces estimationof k f k p for compactly supported densities to the problem of estimating k f k pp , which is a muchsmoother functional. Indeed, let e N be an estimator of k f k pp . Then, for any density f and any p ∈ N ∗ , p ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e N (cid:12)(cid:12) /p − k f k p (cid:12)(cid:12)(cid:12) ≤ k f k − ( p − p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) e N (cid:12)(cid:12) − k f k pp (cid:12)(cid:12)(cid:12) ≤ |I| − /p (cid:12)(cid:12) e N − k f k pp (cid:12)(cid:12) . The problem of estimating k f k pp has been considered by many authors starting from the semi-nal paper Bickel and Ritov (1988); see, for instance, Birg´e and Massart (1995), Kerkyacharian and Picard(1996), Laurent (1997), Cai and Low (2005), Tchetgen et al. (2008) among others. It is worth . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density noting that majority of the papers deals with compactly supported densities belonging to a semi–isotropic functional class, that is r l = r for any l = 1 , . . . , d . Below we discuss these results andpresent them in a uniﬁed way.Let p ∈ N ∗ , p ≥ N r ,d (cid:0) ~β, ~L, I (cid:1) denote a semi-isotropic Nikolskii class ofdensities supported on I . Assume that r ≥ p . Then the minimax rate of convergence in estimating k f k pp on this functional class, and, therefore, of k f k p as well, is given by ϕ n = n − ( /β ∧ ) . Hence, taking into account that τ (1) = 1 − / ( r β ) + 1 /β ≤ / (2 β ) for any r ≥ p ≥ ϕ n = n − ( /β ∧ ) ≪ n − ( / β ∧ ) = φ n . In particular, the parametric regime in estimating compactly supported densities is possible ifand only if β ≥ / τ (1) ≤

5. Proofs

In this section we collect known facts from functional analysis as well as some recent resultsrelated to anisotropic Nikolskii’s spaces. These results will be used in the subsequent proofs.

For locally integrable function f on R d the strong Hardy–Littlewood maximal operator of f isdeﬁned by M [ f ]( x ) = sup (cid:26) | I | Z I f ( y )d y : x ∈ I (cid:27) , where the supremum is taken over all rectangles I with edges parallel to the coordinate axesand containing point x ; here | · | stands for the Lebesgue measure. In view of the Lebesguediﬀerentiation theorem f ( x ) ≤ M [ f ]( x ) a.e. (5.1)Moreover, if f ∈ L p ( R d ) then (cid:13)(cid:13) M [ f ] (cid:13)(cid:13) p ≤ c k f k p , < p ≤ ∞ (5.2)with constant c depending on p and d only; see, e.g., Guzman (1975). For the ease of reference we recall some well known inequalities that are routinely used in thesequel. These results can be found, e.g., in Folland (1999). • Interpolation inequality.

Let 1 ≤ s < s < s ≤ ∞ . If f ∈ L s (cid:0) R d (cid:1) ∩ L s (cid:0) R d (cid:1) then f ∈ L s (cid:0) R d (cid:1) and (cid:13)(cid:13) f (cid:13)(cid:13) s ≤ (cid:0)(cid:13)(cid:13) f (cid:13)(cid:13) s (cid:1) ( s − s ) s s − s s (cid:0)(cid:13)(cid:13) f (cid:13)(cid:13) s (cid:1) ( s − s s s − s s . • Young’s inequality (general form).

Let 1 ≤ p, q, r ≤ ∞ and 1 + 1 /r = 1 /p + 1 /q . If f ∈ L p (cid:0) R d (cid:1) and g ∈ L q (cid:0) R d (cid:1) then f ∗ g ∈ L r (cid:0) R d (cid:1) and k f ∗ g k r ≤ k f k p k g k q . . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density Let s >

1, and let the functional class N ~r,d ( ~β, ~L ) be ﬁxed. Deﬁne for any j = 1 , . . . , dγ j ( s ) := (cid:26) β j τ ( s ) /τ ( r j ) , r j < s ; β j , r j ≥ s, and let s ∗ := s ∨ (cid:20) max j =1 ,...,d r j (cid:21) , ~s := (cid:0) r ∨ s, . . . , r d ∨ s (cid:1) . Recall that the bias B h ( · ) = B h ( · , f ) of a kernel density estimator associated with kernel K isdeﬁned in (2.1). The following result has been proved in Goldenshluger and Lepski (2014). Lemma 3.

Let Assumption 1 hold with ℓ > max j =1 ,...,d β j . Then B h ( x ) admits for any x ∈ R d the representation B h ( x ) = P dj =1 B ( j ) h ( x ) with functions B ( j ) h satisfying the following inequalities.There exist C > and C > independent of ~L such that for any f ∈ N ~r,d ( ~β, ~L ) and h ∈ (0 , ∞ ) d (cid:13)(cid:13) B ( j ) h (cid:13)(cid:13) r j ≤ C L j h β j j , ∀ j = 1 , . . . , d. (5.3) Moreover, for any s > satisfying τ ( s ∗ ) > one has (cid:13)(cid:13) B ( j ) h (cid:13)(cid:13) s j ≤ C L j h γ j ( s ) j , ∀ j = 1 , . . . , d. (5.4) Finally, for any r ∈ [1 , ∞ ] and R > there exists C > such that for any f ∈ B r ( R ) and any h ∈ (0 , ∞ ) d (cid:13)(cid:13) B ( j ) h (cid:13)(cid:13) r ≤ C , ∀ j = 1 , . . . , d. (5.5)The next lemma presents an inequality between diﬀerent norms of a function belonging toanisotropic Nikolskii’s class. The proof of this lemma is given in Appendix. Lemma 4.

Let ≤ p < ∞ be ﬁxed. For any s ∈ (1 , ∞ ] satisfying s ≥ p ∨ [max j =1 ,...,d r j ] and τ ( s ) > there exists constant C > independent of ~L such that for any f ∈ N ~r,d ( ~β, ~L ) k f k s ≤ C (cid:0) L γ ( s ) (cid:1) (1 /p − /s ) τ ( s ) τ ( p ) (cid:0) k f k p (cid:1) τ ( s ) τ ( p ) , L γ ( s ) := d Y j =1 L γj ( s ) j . ˆ T h We will analyze the risk of the proposed estimator ˆ N h using two diﬀerent upper bounds thatrelate the risk of ˆ N h to the risk of ˆ T h via elementary inequalities: E f (cid:12)(cid:12) ˆ N h − k f k p (cid:12)(cid:12) = E f (cid:12)(cid:12) | ˆ T h | /p − k f k p (cid:12)(cid:12) ≤ E f (cid:12)(cid:12) ˆ T h − k f k pp (cid:12)(cid:12) /p ≤ h E f (cid:0) ˆ T h − k f k pp (cid:1) i /p ; E f (cid:12)(cid:12) ˆ N h − k f k p (cid:12)(cid:12) = E f (cid:12)(cid:12)(cid:12)(cid:12) ˆ N ph − k f k pp P p − i =0 ˆ N ih k f k p − ip (cid:12)(cid:12)(cid:12)(cid:12) ≤ E f (cid:0) ˆ T h − k f k pp (cid:1) k f k p − p . Thus, for any underlying density f and any p ∈ N ∗ , p ≥ E f (cid:12)(cid:12) ˆ N h − k f k p (cid:12)(cid:12) ≤ min (cid:26)h E f (cid:0) ˆ T h − k f k pp (cid:1) i /p , E f (cid:0) ˆ T h − k f k pp (cid:1) k f k p − p (cid:27) . (5.6) . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density The proof is divided in three steps. First we establish upper bounds (uniform over N ~r,d ( ~β, ~L ) ∩ B q ( Q )) on the bias and the variance of the estimator ˆ T h for any value of h . Then using thederived upper bounds on the risk of ˆ T h and the risk reduction argument given in Section 5.2, wecomplete the proof. The next proposition states the upper bound on the bias of the estimator ˆ T h . Proposition 1.

Let ~r ∈ [1 , p ] d ∪ [ p, ∞ ] d . Then there exists C > independent of ~L such that forany f ∈ N ~r,d (cid:0) ~β, ~L (cid:1) ∩ B q ( Q ) and any h ∈ (0 , ∞ ) d E f (cid:12)(cid:12) ˆ T h − k f k pp (cid:12)(cid:12) ≤ C k f k p − p d X j =1 L p j j h p j κ j j , where p j and κ j are deﬁned in (3.1). Proof

Our ﬁrst objective is to prove that (cid:12)(cid:12) E f (cid:0) ˆ T h (cid:1) − k f k pp (cid:12)(cid:12) ≤ c k f k p − p k B h k p , ∀ h ∈ (0 , ∞ ) d . (5.7)If p = 2 then, by Lemma 1, (5.7) holds as equality with c = 1. If p ≥ (cid:12)(cid:12) E f (cid:0) ˆ T h (cid:1) − k f k pp (cid:12)(cid:12) ≤ c p X k =2 Z | B h ( x ) | k | S h ( x ) | p − k d x. (5.8)Note that for any h ∈ (0 , ∞ ) d in view of (5.1) | B h ( x ) | ≤ | S h ( x ) | + f ( x ) = k K k M [ f ]( x ) + f ( x ) ≤ c M [ f ]( x ) a.e. It yields together with (5.8) (cid:12)(cid:12) E f (cid:0) ˆ T h (cid:1) − k f k pp (cid:12)(cid:12) ≤ c Z | B h ( x ) | (cid:0) M [ f ]( x ) (cid:1) p − d x. Hence, applying H¨older’s inequality with exponents p/ p/ ( p −

2) and (5.2) we obtain (cid:12)(cid:12) E f (cid:0) ˆ T h (cid:1) − k f k pp (cid:12)(cid:12) ≤ c (cid:13)(cid:13) M [ f ] (cid:13)(cid:13) p − p k B h k p ≤ c k f k p − p k B h k p . This completes the proof of (5.7). We deduce from (5.7) and Lemma 3 (cid:12)(cid:12) E f ˆ T h − k f k pp (cid:12)(cid:12) ≤ c k f k p − p d X j =1 (cid:13)(cid:13) B ( j ) h (cid:13)(cid:13) p , ∀ h ∈ (0 , ∞ ) d . (5.9)Consider now separately three cases. Let ~r ∈ [ p, ∞ ] d . Then, for any j = 1 , . . . , d, applying theinterpolation inequality with s = 1, s = p and s = r j , and (5.3) and (5.5) of Lemma 3 weobtain (cid:13)(cid:13) B ( j ) h (cid:13)(cid:13) p ≤ c (cid:13)(cid:13) B ( j ) h (cid:13)(cid:13) − /p )1 − /rj r j ≤ c (cid:0) L j h β j j (cid:1) p j = c L p j j h κ j p j j , (5.10) . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density for any f ∈ N ~r,d (cid:0) ~β, ~L (cid:1) .Let now ~r ∈ [1 , p ] d and τ ( q ) >

0, and recall that in this case τ ( p ) > q > p. Applying(5.4) of Lemma 3 with s = p and s j = p, j = 1 , . . . , d , we obtain (cid:13)(cid:13) B ( j ) h (cid:13)(cid:13) p ≤ c (cid:0) L j h γ j ( p ) j (cid:1) = c L p j j h κ j p j j , (5.11)for any f ∈ N ~r,d (cid:0) ~β, ~L (cid:1) .It remains to consider the case ~r ∈ [1 , p ) d and τ ( q ) ≤

0. Applying the interpolation inequalitywith s = r j , s = p and s = q we get for any j = 1 , . . . , d and any f ∈ N ~r,d (cid:0) ~β, ~L (cid:1) ∩ B q ( Q ) (cid:13)(cid:13) B ( j ) h (cid:13)(cid:13) p ≤ (cid:0)(cid:13)(cid:13) B ( j ) h (cid:13)(cid:13) r j (cid:1) /p − /q )1 /rj − /q (cid:0)(cid:13)(cid:13) B ( j ) h (cid:13)(cid:13) q (cid:1) /rj − /p )1 /rj − /q ≤ c (cid:13)(cid:13) B ( j ) h (cid:13)(cid:13) p j r j . To get the last inequality we have used (5.5) of Lemma 3 with r = q and R = Q . It yieldstogether with (5.3) of Lemma 3 that (cid:13)(cid:13) B ( j ) h (cid:13)(cid:13) p ≤ c L p j j h κ j p j j , (5.12)The required result follows from (5.9), (5.10), (5.11) and (5.12). Now we derive the upper bound on the variance of ˆ T h . Recall that q ≥ p − L :=  max k =1 ,...,p (cid:0) L γ (2 p − k ) (cid:1) (1 − k/p ) τ (2 p − k ) τ ( p ) , τ ( q ) > , τ ( q ) ≤ , Proposition 2.

Let ~β, ~L, ~r, q > p − and Q > be ﬁxed. Then there exists C > independentof ~L such that for any f ∈ N ~r,d (cid:0) ~β, ~L (cid:1) ∩ B q ( Q )var f (cid:2) ˆ T h (cid:3) ≤ C L  P pk =1 (cid:0) k f k p (cid:1) (2 p − k ) τ (2 p − k ) τ ( p ) (cid:0) n k V k − h (cid:1) − , τ ( q ) > P pk =1 (cid:0) k f k p (cid:1) ( q − p + k ) pq − p (cid:0) n k V k − h (cid:1) − , τ ( q ) ≤ . Proof

In view of Lemma 2 var f (cid:2) ˆ T h (cid:3) ≤ c p X k =1 k f k p − k p − k (cid:0) n k V k − h (cid:1) − , Assume ﬁrst that τ ( q ) >

0; this also implies that τ ( r ) > r ≤ q. Applying Lemma 4with s = 2 p − k we get for any k ∈ { , . . . , p } (cid:13)(cid:13) f (cid:13)(cid:13) p − k p − k ≤ (cid:0) L γ (2 p − k ) (cid:1) (1 − k/p ) τ (2 p − k ) τ ( p ) (cid:0) k f k p (cid:1) (2 p − k ) τ (2 p − k ) τ ( p ) ≤ L (cid:0) k f k p (cid:1) (2 p − k ) τ (2 p − k ) τ ( p ) . Now assume that τ ( q ) ≤

0. Applying the interpolation inequality with s = p , s = 2 p − k and s = q we get for any k ∈ { , . . . , p } (cid:13)(cid:13) f (cid:13)(cid:13) p − k p − k ≤ (cid:0) k f k p (cid:1) ( q − p + k ) pq − p (cid:0) k f k q (cid:1) ( p − k ) qq − p ≤ c (cid:0) k f k p (cid:1) ( q − p + k ) pq − p . This completes the proof. . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density Now we are in a position to complete the proof of the theorem.Deﬁne N := (cid:0) L n − (cid:1) − /p ) /υ , L := L /p L − /p κ , L κ := d Y j =1 L / κ j j , and recall that h = ( h , . . . , h d ) is deﬁned as follows. h j := L − / κ j j (cid:0) L n − (cid:1) κ jpj [1+2(1 − /p ) /υ ] = L − / κ j j N κ jpj . . Several remarks are in order. Direct computations show that N p − d X j =1 L p j j h p j κ j j = dN p ; (5.13) L N p (cid:2) n p ( V h ) p − (cid:3) − = N p . (5.14)The following useful equality is deduced from (5.14): (cid:2) nV h (cid:3) − = (cid:2) L − nN p (cid:3) p − = (cid:2) L − L κ (cid:3) /p N − pυ . (5.15)Let R n ( f ) denote the quadratic risk of the estimator ˆ T h , and let F := N ~r,d ( ~β, ~L ) ∩ B q ( Q ) ∩ B p ( N ) , ¯ F := (cid:2) N ~r,d ( ~β, ~L ) ∩ B q ( Q ) (cid:3) \ F We obviously have R n := R n (cid:2) ˆ N h , N ~r,d (cid:0) ~β, ~L (cid:1) ∩ B q ( Q ) (cid:3) = max (cid:16) sup f ∈ F R n (cid:2) ˆ N h , f (cid:3) , sup f ∈ ¯ F R n (cid:2) ˆ N h , f (cid:3)(cid:17) and, we deduce from (5.6) that R n ≤ max (cid:16) sup f ∈ F (cid:2) R n ( f ) (cid:3) /p , sup f ∈ ¯ F k f k − pp R n ( f ) (cid:17) . In view of Propositions 1 and 2, and by (5.13) and (5.15) we have that for any f ∈ FE f (cid:12)(cid:12) ˆ T h − k f k pp (cid:12)(cid:12) ≤ c N p ; (5.16)var f (cid:2) ˆ T h (cid:3) ≤ C (cid:0) ~L (cid:1) n  P pk =1 N (2 p − k ) τ (2 p − k ) τ ( p ) +(1 − pυ )( k − , τ ( q ) > P pk =1 N ( q − p + k ) pq − p +(1 − pυ )( k − , τ ( q ) ≤ . Setting θ = 1 − pυ + b , a =  (2 p − τ (2 p − τ ( p ) , τ ( q ) > ( q − p +1) pq − p , τ ( q ) ≤ , b = ( − τ ( ∞ ) τ ( p ) , τ ( q ) > pq − p , τ ( q ) ≤ , . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density we obtain for any f ∈ F var f (cid:2) ˆ T h (cid:3) ≤ C (cid:0) ~L (cid:1) N a n p − X m =0 N (1 − pυ + b ) m = C (cid:0) ~L (cid:1) N a +[ θ ( p − − n = C (cid:0) ~L (cid:1) N z , where here and later y − := min( y, z := a + 1 + p − pυ + [ θ ( p − − . It yields togetherwith (5.16) sup f ∈ F (cid:2) R n ( f ) (cid:3) /p ≤ C (cid:0) ~L (cid:1)h N + N z /p i . (5.17)3 . Let us compute the quantity 1 /υ . If ~r ∈ [ p, ∞ ] d we have1 υ = p p − d X j =1 − /r j β j = p (1 /β − /ω )2( p − . (5.18)If ~r ∈ [1 , p ] d and τ ( q ) > υ = 12 τ ( p ) d X j =1 τ ( r j ) β j = 12 τ ( p ) (cid:18) τ ( ∞ ) β + 1 ωβ (cid:19) = 12 βτ ( p ) . (5.19)Finally, if ~r ∈ [1 , p ] d and τ ( q ) ≤ υ = pq q − p ) d X j =1 /r j − /qβ j = pq q − p ) (cid:18) ω − βq (cid:19) . (5.20)It yields, in particular,( p − θ =  p − pβτ ( p ) − β + ω , ~r ∈ [ p, ∞ ] d ;0 , ~r ∈ [1 , p ] d , τ ( q ) > ( p − qτ ( q ) q − p , ~r ∈ [1 , p ] d , τ ( q ) ≤ . Noting that ~r ∈ [ p, ∞ ] d implies τ ( p ) ≥ p − pβτ ( p ) − β + 1 ω ≤ p − pβ − β + 1 ω = 1 ω − pβ = 1 − τ ( p ) ≤ . Hence in all cases ( p − θ ≤ z = a + 1 + 2( p − pυ + θ ( p −

1) = a + b ( p −

1) + p = p + (2 p − τ ( ∞ ) + 1 /βτ ( p ) − ( p − τ ( ∞ ) τ ( p ) = 2 p. Hence we deduce from (5.17) thatsup f ∈ F (cid:2) R n ( f ) (cid:3) /p ≤ C (cid:0) ~L (cid:1) N . (5.21) . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density . Note that for any f ∈ ¯ F one has in view of Proposition 1 and (5.13) k f k − pp (cid:0) E f (cid:12)(cid:12) ˆ T h − k f k pp (cid:12)(cid:12) ) (cid:1) ≤ c k f k − p (cid:18) d X j =1 L p j j h p j κ j j (cid:19) ≤ c N . (5.22)Here we have used that N is small then n is large.(a). If τ ( q ) > k ∈ { , . . . , p } (2 p − k ) τ (2 p − k ) − (2 p − τ ( p ) = − [( k − τ ( ∞ ) + ( p − / ( pβ )]= − ( k − (cid:2) τ ( ∞ ) + 1 / ( pβ ) (cid:3) − ( p − k ) / ( pβ ) . = − ( k − τ ( p ) − ( p − k ) / ( pβ ) . Since τ ( p ) >

0, for any k ∈ { , . . . , p } (2 p − k ) τ (2 p − k ) τ ( p ) − (2 p −

2) = − ( k − − ( p − k ) pβτ ( p ) ≤ . Putting α = 1 − ( p − τ ( p ) pβ we deduce from Proposition 2 for any f ∈ ¯ F var f (cid:2) ˆ T h (cid:3) ≤ C L n − (cid:20) N α − + p X k =2 N − ( k − − ( p − k ) pβτ ( p ) (cid:0) nV h (cid:1) − k (cid:21) ≤ C (cid:0) ~L (cid:1) n − (cid:20) N α − + p X k =2 N − ( k − − ( p − k ) pβτ ( p ) +(1 − pυ )( k − (cid:21) ≤ C (cid:0) ~L (cid:1) n − (cid:2) N α − + N γ +( p − θ (cid:3) , where we have put γ = 1 − pυ − p − pβτ ( p ) and used that θ ≤

0. The last bound can be rewritten asvar f (cid:2) ˆ T h (cid:3) ≤ C (cid:0) ~L (cid:1)(cid:2) n − N α − + N z (cid:3) , (5.23)where z = γ + 1 + p − pυ + ( p − θ . We have z = p − p − pβτ ( p ) + ( p − b = p − p − τ ( p ) (cid:2) τ ( ∞ ) + 1 / ( pβ ) (cid:3) = 2and (5.23) becomes var f (cid:2) ˆ T h (cid:3) ≤ C (cid:0) ~L (cid:1)(cid:2) n − N α + N (cid:3) , ∀ f ∈ ¯ F . (5.24)Thus, if τ ( q ) > R n ≤ C (cid:0) ~L (cid:1)(cid:2) n − N α + N (cid:3) . (5.25)Additionally we remark that if α < n − N α − = C (cid:0) ~L (cid:1) N α +1+2(1 − /p ) /υ . If ~r ∈ [ p, ∞ ] d then τ ( p ) >

1, and we get from (5.18) α + 1 + 2(1 − /p ) /υ = 2 − ( p − τ ( p ) pβ + 1 β − ω ≥ pβ − ω ≥ . . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density If ~r ∈ [1 , p ] d then in view of (5.19) α + 1 + 2(1 − /p ) /υ = 2 − ( p − τ ( p ) pβ + ( p − pβτ ( p ) = 2 . Thus, (5.25) is equivalent to R n ≤ C (cid:0) ~L (cid:1) ( n − ∨ N , α ≥ N , α < . (5.26)The deﬁnition of N implies that N ≤ n − ⇔ − /p ) /υ ≤ ⇔ ≥ ( /β − /ω, ~r ∈ [ p, ∞ ] d ; ( p − pβτ ( p ) , ~r ∈ [1 , p ] d . In the case ~r ∈ [1 , p ] d it yields immediately that N ≤ n − ⇔ a ≥ ~r ∈ [ p, ∞ ] d then α ≥ ⇔ − ( p − pβτ ( p ) = 1 − β + ω + (cid:20) ( p − pβ − ( p − pβτ ( p ) (cid:21) + (cid:20) pβ − ω (cid:21) ≥ . Since in the considered case τ ( p ) ≥ ⇔ / ( pβ ) > /ω we assert that1 ≥ /β − /ω ⇒ a ≥ . Thus, we deduce from (5.26) R n ≤ C (cid:0) ~L (cid:1) max h n − , n − − /p ) /υ i and the assertion of the theorem in the case τ ( q ) > τ ( q ) ≤

0. For any k ∈ { , . . . p } one has( q − p + k ) pq − p − p −

1) = 2 − p + p ( k − p ) q − p ≤ f ∈ ¯ F var f (cid:2) ˆ T h (cid:3) ≤ C L N − p p X k =1 N ( k − p ) pq − p (cid:0) n k V k − h (cid:1) − ≤ C L N q − p − pqq − p n − p X k =1 (cid:0) N pq − p ( nV h ) − (cid:1) k − = C (cid:0) ~L (cid:1) N − q ( p − τ ( q ) q − p p X k =1 N qτ ( q )( k − q − p = C (cid:0) ~L (cid:1) N . To get the penultimate equality we have used (5.20). It yields together with (5.22)sup f ∈ ¯ F k f k − pp R n ( f ) ≤ C (cid:0) ~L (cid:1) N . which together with (5.21) leads to R n ≤ C (cid:0) ~L (cid:1) N = C (cid:0) ~L (cid:1) n − − /p ) /υ = C (cid:0) ~L (cid:1) n − /p − /q )1 − /q − (1 − /p ) τ ( q ) . This completes the theorem proof. . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density

6. AppendixProof of Lemma 1

Using the identity a p = P pj =0 (cid:0) pj (cid:1) b p − j ( a − b ) j with a = f ( x ) and b = S h ( x )we obtain for all h ∈ (0 , ∞ ) d and x ∈ R d f p ( x ) = p X j =0 (cid:0) pj (cid:1) [ S h ( x )] p − j (cid:2) f ( x ) − S h ( x ) (cid:3) j d x = S ph ( x )(1 − p ) + p [ S h ( x )] p − f ( x ) + p X j =2 (cid:0) pj (cid:1) ( − j [ S h ( x )] p − j B jh ( x ) . Integrating the last equality, we come to the statement of the lemma.

Proof of Lemma 2

For any y ∈ R d let I h ( y ) := ⊗ dj =1 (cid:8) x ∈ R d : | x j − y j | ≤ h j (cid:9) . . We start with bounding the variance of ˆ T ,h . Deﬁne g k ( x , . . . , x k ) := E f (cid:2) U (1) h ( x , . . . , x k , X k +1 , . . . , X p ) (cid:3) , k = 1 , . . . , p − ,g p ( x , . . . , x p ) := U (1) h ( x , . . . , x p ) . Then the variance of ˆ T ,h is given by the following well known formula [see, e.g., Serﬂing (1980)]:var f (cid:2) ˆ T ,h (cid:3) = 1 (cid:0) np (cid:1) p X k =1 (cid:18) pk (cid:19)(cid:18) n − pp − k (cid:19) ζ k , ζ k := var f (cid:2) g k ( X , . . . , X k ) (cid:3) . We note that g k ( x , . . . , x k ), k = 1 , . . . , p are symmetric functions. Observe that for k = 1 , . . . , p − | g k ( x , . . . , x k ) | = (cid:12)(cid:12) E f (cid:2) g ( x , . . . , x k , X k +1 , . . . , X p ) (cid:3)(cid:12)(cid:12) ≤ Z k Y i =1 | K h ( y − x i ) | h p Y i = k +1 Z | K h ( y − x i ) | f ( x i )d x i i d y ≤ k K k p − k ∞ Z k Y i =1 | K h ( y − x i ) | (cid:0) M [ f ]( y ) (cid:1) p − k d y. . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density Then ζ k ≤ E f (cid:2) g k ( X , . . . , X k ) (cid:3) ≤ k K k p − k ) ∞ Z Z (cid:0) M [ f ]( y ) (cid:1) p − k (cid:0) M [ f ]( z ) (cid:1) p − k × h k Y i =1 Z K h ( y − x i ) K h ( z − x i ) f ( x i )d x i i d y d z ≤ k K k p ∞ V − kh Z Z (cid:0) M [ f ]( y ) (cid:1) p − k (cid:0) M [ f ]( z ) (cid:1) p − k { y − x ∈ I h (0) }× h k Y i =1 Z K h ( z − x i ) f ( x i )d x i i d y d z ≤ k K k p ∞ V − kh Z Z (cid:0) M [ f ]( y ) (cid:1) p − k (cid:0) M [ f ]( z ) (cid:1) p { y − x ∈ I h (0) } d z d y ≤ c k K k p ∞ V − k +1 h Z (cid:0) M [ f ]( y ) (cid:1) p − k M (cid:2) M p [ f ] (cid:3) ( y )d y. Furthermore, by the H¨older inequality and (5.2)

Z (cid:0) M [ f ]( y ) (cid:1) p − k M (cid:2) M p [ f ] (cid:3) ( y ) ≤ (cid:13)(cid:13) M [ f ] (cid:13)(cid:13) p − k p − k (cid:13)(cid:13) M (cid:2) M p [ f ] (cid:3)(cid:13)(cid:13) (2 p − k ) /p ≤ c k f k p − k p − k (cid:13)(cid:13) M p [ f ] (cid:13)(cid:13) (2 p − k ) /p = c k f k p − k p − k (cid:13)(cid:13) M [ f ] (cid:13)(cid:13) p p − k ≤ c (cid:13)(cid:13) f (cid:13)(cid:13) p − k p − k . (6.1)Therefore we obtain for any k = 1 , . . . , p − ζ k ≤ c k K k p ∞ V − k +1 h Z [ f ( x )] p − k d x. In addition, E f [ g p ( X , . . . , X p )] = Z Z h p Y i =1 Z K h ( y − x i ) K h ( z − x i ) f ( x i )d x i i d y d z ≤ k K k p ∞ V ph Z Z { y − z ∈ I h (0) } (cid:0) M [ f ]( y ) (cid:1) p d y d z ≤ c k K k p ∞ V p − h Z [ f ( x )] p d x. Thus we obtain var f (cid:2) ˆ T ,h (cid:3) ≤ C k K k p ∞ p X k =1 n k V k − h Z [ f ( x )] p − k d x. (6.2)2 . Bounding the variance of ˆ T ,h goes along the same lines. Deﬁne g k ( x , . . . , x k ) := E f (cid:2) U (2) h ( x , . . . , x k , X k +1 , . . . , X p ) (cid:3) , k = 1 , . . . , p − ,g p ( x , . . . , x p ) := U (2) h ( x , . . . , x p ) . . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density We have for k = 1 , . . . , p − g k ( x , . . . , x k ) = E f (cid:2) U (2) h ( x , . . . , x k , X k +1 , . . . , X p ) (cid:3) = 1 p k X i =1 k Y j =1 j = i K h ( x j − x i ) (cid:20) p Y j = k +1 Z K h ( x j − x i ) f ( x j )d x j (cid:21) + 1 p p X i = k +1 Z k Y j =1 K h ( x j − x i ) (cid:20) p Y j = k +1 j = i Z K h ( x j − x i ) f ( x j )d x j (cid:21) f ( x i )d x i . Therefore | g k ( x , . . . , x k ) | ≤ k K k p − k ∞ p k X i =1 (cid:0) M [ f ]( x i ) (cid:1) p − k k Y j =1 j = i | K h ( x j − x i ) | + k K k p − k − ∞ p p X i = k +1 Z k Y j =1 (cid:12)(cid:12) K h ( x j − x i ) (cid:12)(cid:12)(cid:0) M [ f ]( x i ) (cid:1) p − k − f ( x i )d x i := g (1) k ( x , . . . , x k ) + g (2) k ( x , . . . , x k ) . For the ﬁrst term on the right hand side we obtain E f | g (1) k ( X , . . . , X k ) | ≤ c k K k p − k ) ∞ Z (cid:0) M [ f ]( x i ) (cid:1) p − k ) × (cid:20) k Y j =1 j = i Z K h ( x j − x i ) f ( x j )d x j (cid:21) f ( x i )d x i ≤ c k K k p − ∞ V k − h Z (cid:0) M [ f ]( x ) (cid:1) p − k − f ( x )d x ≤ c k K k p − ∞ V k − h Z (cid:0) M [ f ]( x ) (cid:1) p − k d x ≤ c k K k p − ∞ V k − h Z [ f ( x )] p − k d x. The expectation of the squared second term is bounded as follows: E f | g (2) k ( x , . . . , x k ) | ≤ c k K k p − k − ∞ Z Z (cid:26) k Y j =1 Z K h ( x j − y ) K h ( x j − z ) f ( x j )d x j (cid:27) × (cid:0) M [ f ]( y ) (cid:1) p − k (cid:0) M [ f ]( z ) (cid:1) p − k d y d z ≤ c k K k p − ∞ V k − h Z Z { z − y ∈ I h (0) } (cid:0) M [ f ]( y ) (cid:1) p (cid:0) M [ f ]( z ) (cid:1) p − k d z d y ≤ c k K k p − ∞ V k − h Z (cid:0) M [ f ]( z ) (cid:1) p − k M (cid:2) M p [ f ] (cid:3) ( z )d z ≤ c k K k p − ∞ V k − h Z [ f ( z )] p − k d z, . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density where we have used (6.1). Finally, E f | g p ( X , . . . , X p ) | ≤ c Z (cid:20) p Y j =2 Z K h ( x j − x ) f ( x j )d x j (cid:21) f ( x )d x ≤ c k K k p − ∞ V p − h Z (cid:0) M [ f ]( x ) (cid:1) p − f ( x )d x ≤ c k K k p − ∞ V p − h Z (cid:0) M [ f ]( x ) (cid:1) p d x ≤ c k K k p − ∞ V p − h Z [ f ( x )] p d x. Thus, we obtain var f (cid:2) ˆ T ,h (cid:3) ≤ C k K k p − ∞ p X k =1 n k V k − h Z [ f ( x )] p − k d x. (6.3)The assertion of the lemma follows now from (6.2) and (6.3). Proof of Lemma 4

Let K be a kernel satisfying Assumption 1 with ℓ ≥ max j =1 ,...,d β j . Forany η = ( η , . . . , η d ), η j > j = 1 , . . . , d we have in view of Lemma 3 k f k s ≤ (cid:13)(cid:13) B η (cid:13)(cid:13) s + (cid:13)(cid:13) S η (cid:13)(cid:13) s ≤ d X j =1 (cid:13)(cid:13) B ( j ) η (cid:13)(cid:13) s + (cid:13)(cid:13) K η ∗ f (cid:13)(cid:13) s . By the general form of Young’s inequality with 1 /q = 1 + 1 /s − /p k K η ∗ f k s ≤ k f k p k K η k q ≤ c (cid:0) V η (cid:1) /q − k f k p = c (cid:0) V η (cid:1) /s − /p k f k p . Furthermore, it follows from (5.4) of Lemma 3 with s ∗ = s and s j = s that d X j =1 (cid:13)(cid:13) B ( j ) η (cid:13)(cid:13) s ≤ c d X j =1 L j η γ j ( s ) j , γ j ( s ) = β j τ ( s ) τ ( r j ) . Therefore, for any η j > , j = 1 , . . . , d k f k s ≤ c (cid:0) V η (cid:1) /s − /p k f k p + c d X j =1 L j η γ j ( s ) j . Putting 1 /γ = P dj =1 /γ j ( s ) and choosing η , . . . , η d from the equality (cid:0) V η (cid:1) /s − /p k f k p = d X j =1 L j η γ j ( s ) j we come to the following bound k f k s ≤ c (cid:18) d Y j =1 L γj ( s ) j (cid:19) /p − /s

1+ 1 /p − /sγ (cid:13)(cid:13) f (cid:13)(cid:13)

11+ 1 /p − /sγ p . . Goldenshluger and O. V. Lepski/Estimation of norms of a probability density Noting that1 + 1 /p − /sγ = 1 + (cid:20) p − s (cid:21)(cid:20) τ ( ∞ ) βτ ( s ) + 1 βωτ ( s ) (cid:21) = 1 + 1 /p − /sβτ ( s ) = τ ( p ) τ ( s )we complete the proof. References

Bickel, R.J. and

Ritov, Y. (1988). Estimating integrated squared density derivatives: sharpbest order of convergence estimates.

Sankhya: The Indian Journal of Statistics , 381-393. Birg´e, L. and

Massart, P. (1995). Estimation of the integral functionals of a density.

Ann.Statist. , 11–29. Cai, T.T. and

Low, M.G. (2005). Nonquadratic estimators of a quadratic functional.

Ann.Statist. , 2930-2956. Folland, G. (1999).

Real Analysis: Modern Techiniques and Their Applications.

Second edition.John Wiley & Sons, New York.

Goldenshluger, A. and

Lepski, O.V. (2014). On adaptive minimax density estimation on R d . Probab. Theory Related Fields , 479–543.

Goldenshluger, A. and

Lepski, O.V. (2020). Minimax estimation of norms of a probabilitydensity: I. Lower bounds.

Manuscript . de Guzman, M. (1975). Diﬀerentiation of Integrals in R n . Lecture Notes in Mathematics, Vol.481. Springer-Verlag, Berlin-New York.

Han, Y., Jiao, J. and

Mukherjee, R. (2019). On estimation of L r -norms in Gaussian WhiteNoise Models. arXiv:1710.03863 [math.ST] . Kerkyacharian, G. and

Picard, D. (1996). Estimating nonquadratic functionals of a densityusing Haar wavelets.

Ann. Statist. , 485-507. Lepski, O.V., Nemirovski, A. and

Spokoiny, V. (1999). On estimation of the L r -norm of aregression function Probab. Theory and Related Fields , 221-253.

Lepski, O.V. and

Spokoiny, V. (1999). Minimax nonparametric hypothesis testing: the caseof an inhomogeneous alternative.

Bernoulli , 333-358. Laurent, B. (1997). Estimation of integral functionals of a density and its derivatives.

Bernoulli , 181-211. Nikolskii S.M. (1977).

Priblizhenie Funktsii Mnogikh Peremennykh i Teoremy Vlozheniya. (inRussian). [

Approximation of functions of several variables and embedding theorems. ] 2nd ed.,revised and supplemented. Nauka, Moscow.

Serfling, R. J. (1980).

Approximation Theorems of Mathematical Statistics.

John Wiley &Sons, Inc.

Tchetgen, E., Li, L., Robins, J. and van der Vaart, A. (2008). Minimax estimation of theintegral of a power of a density.

Stat. Probab. Letters78