Adaptive nonparametric estimation of a component density in a two-class mixture model
Gaelle Chagny, Antoine Channarond, Van Ha Hoang, Angelina Roche
AAdaptive nonparametric estimation of a component density in atwo-class mixture model
Ga¨elle Chagny ∗ , Antoine Channarond † , Van H`a Hoang ‡ , Angelina Roche § July 31, 2020
Abstract
A two-class mixture model, where the density of one of the components is known, is considered.We address the issue of the nonparametric adaptive estimation of the unknown probability densityof the second component. We propose a randomly weighted kernel estimator with a fully data-drivenbandwidth selection method, in the spirit of the Goldenshluger and Lepski method. An oracle-type inequality for the pointwise quadratic risk is derived as well as convergence rates over H¨oldersmoothness classes. The theoretical results are illustrated by numerical simulations.
The following mixture model with two components: g p x q “ θ ` p ´ θ q f p x q , @ x P r , s , (1)where the mixing proportion θ P p , q and the probability density function f on r , s are unknown,is considered in this article. It is assumed that n independent and identically distributed ( i.i.d. in thesequel) random variables X , . . . , X n drawn from density g are observed. The main goal is to constructan adaptive estimator of the nonparametric component f and to provide non-asymptotic upper boundsof the pointwise risk. As an intermediate step, the estimation of the parametric component θ is addressedas well.Model (1) appears in some statistical settings, robust estimation and multiple testing among oth-ers. The one chosen in the present article, as described above, comes from the multiple testing frame-work, where a large number n of independent hypotheses tests are performed simultaneously. p -values X , . . . , X n generated by these tests can be modeled by (1). Indeed these are uniformly distributed on r , s under null hypotheses while their distribution under alternative hypotheses, corresponding to f , isunknown. The unknown parameter θ is the asymptotic proportion of true null hypotheses. It can beneeded to estimate f , especially to evaluate and control different types of expected errors of the testingprocedure, which is a major issue in this context. See for instance Genovese and Wassermann [15], Storey[28], Langaas et al. [20], Robin et al. [26], Strimmer [29], Nguyen and Matias [23], and more fundamentally,Benjamini et al. [1] and Efron et al. [14].In the setting of robust estimation, different from the multiple testing one, model (1) can be thought ofas a contamination model, where the unknown distribution of interest f is contaminated by the uniformdistribution on r , s with the proportion θ . This is a very specific case of the Huber contamination model[18]. The statistical task considered consists in robustly estimating f from contaminated observations X , . . . , X n . But unlike our setting, the contamination distribution is not necessarily known while thecontamination proportion θ is assumed to be known and the theoretical investigations aim at providingminimax rates as functions of both n and θ . See for instance the preprint of Liu and Gao [22], whichaddresses pointwise estimation in this framework. ∗ LMRS, UMR CNRS 6085, Universit´e de Rouen Normandie, [email protected] † LMRS, UMR CNRS 6085, Universit´e de Rouen Normandie, [email protected] ‡ LMRS, UMR CNRS 6085, Universit´e de Rouen Normandie, [email protected] § CEREMADE, UMR CNRS 7534, Universit´e Paris Dauphine, [email protected] a r X i v : . [ m a t h . S T ] J u l ack to the setting of multiple testing, the estimation of f in model (1) has been addressed in severalworks. Langaas et al. [20] proposed a Grenander density estimator for f , based on a nonparametricmaximum likelihood approach, under the assumption that f belongs to the set of decreasing densitieson r , s . Following a similar approach, Strimmer [29] also proposed a modified Grenander strategyto estimate f . However, the two aforementioned papers do not investigate theoretical features of theproposed estimators. Robin et al. [26] and Nguyen and Matias[23] proposed a randomly weighted kernelestimator of f , where the weights are estimators of the posterior probabilities of the mixture model, thatis, the probabilities of each individual i being in the nonparametric component given the observation X i .[26] proposes an EM-like algorithm, and proves the convergence to an unique solution of the iterativeprocedure, but they do not provide any asymptotic property of the estimator. Note that their model g p x q “ θφ p x q ` p ´ θ q f p x q , where φ is a known density, is slightly more general, but our procedure isalso suitable for this model under some assumptions on φ . Besides, [23] achieves a nonparametric rateof convergence n ´ β {p β ` q for their estimator, where β is the smoothness of the unknown density f .However, their estimation procedure is not adaptive since the choice of their optimal bandwidth stilldepends on β .In the present work, a new randomly weighted kernel estimator is proposed. Unlike the usual approachin mixture models, the weights of the estimator are not estimates of the posterior probabilities. A function w is derived instead such that f p x q “ w p θ, g p x qq g p x q , for all θ, x P r , s . This kind of equation, linkingthe target distribution (one of the conditional distribution given hidden variables) to the distribution ofobserved variables, is remarkable in the framework of mixture models. It is a key idea of our approach,since it implies a crucial equation for controlling the bias term of the risk, see Subsection 2.1 for moredetails. Thus oracle weights are defined by w p θ, g p X i qq , i “ , . . . , n , but g and θ are unknown. Theseoracle weights are estimated by plug-in, using preliminary estimators of g and θ , based on an additionalsample X n ` , . . . , X n . Note that procedures of [23] and [26] actually require preliminary estimates of g and θ as well, but they do not deal with possible biases caused by the multiple use of the same observationsin the estimates of θ , g and f .Furthermore a data-driven bandwidth selection rule is also constructed in this paper, using the Gold-enshluger and Lespki (GL) approach [17], which has been applied in various contexts, see for instance,Comte et al. [11], Comte and Lacour [9], Doumic et al. [13], Reynaud-Bouret et al. [25] who apply GLmethod in kernel density estimation, and Bertin et al. [3], Chagny [6], Chichignoud et al. [7] or Comte andRebafka [12]. Our selection rule is then adaptive to unknown smoothness of the target function, whichis new in this context. The main original results derived in this paper are the oracle-type inequalityin Theorem 1, and the rates of convergence over H¨older classes, which are adapted to the control ofpointwise risk of kernel estimators, in Corollary 1.Some assumptions on the preliminary estimators for g and θ are needed to prove the results on theestimator of f ; this paper also provides estimators of g and θ which satisfy these assumptions. The choiceof a nice estimator for θ requires identifiability of the model (1). g being given, the couple p θ, f q such that g “ θ ` p ´ θ q f is uniquely determined under additional assumptions on f (in particular monotonicityand zero set of f ), see a review about this issue in Section 1.1 in Nguyen and Matias [24]. Nonetheless,note that these additional assumptions on f are not needed to obtain the results on the nonparametricestimation procedure of f .The paper is organized as follows. Our randomly weighted estimator of f is constructed in Section2.1. Assumptions on f and on preliminary estimators of g and θ required for proving the theoreticalresults are in this section too. In Section 2, a bias-variance decomposition for the pointwise risk of theestimator of f is given as well as the convergence rate of the kernel estimator with a fixed bandwidth.In Section 3, an oracle inequality which justifies our adaptive estimation procedure. Construction of thepreliminary estimators of g and θ are to be found in Section 4. Numerical results illustrate the theoreticalresults in Section 5. Proofs of theorems, propositions and technical lemmas are postponed to Section 6. In this section, a family of kernel estimators for the density function f based on a sample p X i q i “ ,...,n of i.i.d. variables with distribution g is defined. It is assumed that two preliminary estimators ˜ θ n ofthe mixing proportion θ and ˆ g of the mixture density g are available, and defined from an additionalsample p X i q i “ n ` ,..., n of independent variables also drawn from g but independent of the first sample2 X i q i “ ,...,n . The definition of these preliminary estimates is the subject of Section 4. To define estimators for f , the challenge is that observations X , . . . , X n are not drawn from f but fromthe mixture density g . Hence the density f cannot be estimated directly by a classical kernel densityestimator. We will thus build weighted kernel estimates, using a methodology inspired for example byComte and Rebafka [12]. The starting point is the following lemma whose proof is straightforward. Lemma 1.
Let X be a random variable from the mixture density g defined by (1) and Y be an (unob-servable) random variable from the component density f . Then for any measurable bounded function ϕ we have E “ ϕ p Y q ‰ “ E “ w p θ, g p X qq ϕ p X q ‰ , (2) with w p θ, g p x qq : “ ´ θ ˆ ´ θg p x q ˙ , x P r , s . This result will be used as follows. Let K : R Ñ R be a kernel function, that is an integrable functionsuch that ş R K p x q dx “ ş R K p x q dx ă `8 . For any h ą
0, let K h p¨q “ K p¨{ h q{ h . Then the choice ϕ p¨q “ K h p x ´ ¨q in Lemma 1 leads to E “ K h p x ´ Y q ‰ “ E “ w p θ, g p X qq K h p x ´ X q ‰ , where Y is drawn from g . Thus,ˆ f h p x q “ n n ÿ i “ w p ˜ θ n , ˆ g p X i qq K h p x ´ X i q , x P r , s , (3)is well-suited to estimate f , with w p ˜ θ n , ˆ g p X i qq “ ´ ˜ θ n ˜ ´ ˜ θ n ˆ g p X i q ¸ , i “ , . . . , n. Therefore, ˆ f h is a randomly weighted kernel estimator of f . However, the total sum of the weightsmay not equal 1, in comparison with the estimators proposed in Nguyen and Matias [23] and Robin etal. [26]. The main advantage of our estimate is that, if we replace ˆ g and ˜ θ n by their theoretical unknowncounterparts g and θ in (3), we obtain, E r ˆ f h p x qs “ K h ‹ f p x q , where ‹ stands for the convolution product.This relation, classical in nonparametric kernel estimation, is crucial for the study of the bias term in therisk of the estimator. Here, we establish upper bounds for the pointwise mean-squared error of the estimator ˆ f h , defined in(3), with a fixed bandwidth h ą
0. Our objective is to study the pointwise risk for the estimation of thedensity f at a point x P r , s . Throughout the paper, the kernel K is chosen compactly supported onan interval r´ A, A s with A a positive real number. We denote by V n p x q the neighbourhood of x usedin the sequel and defined by V n p x q “ „ x ´ Aα n , x ` Aα n , where p α n q n is a positive sequence of numbers larger than 1, only depending on n such that α n Ñ `8 as n Ñ `8 . For any function u on R , and any interval I Ă R , let } u } ,I “ sup t P I | u p t q| .The following assumptions will be required for our theoretical results. (A1) The density f is uniformly bounded on V n p x q for some n : } f } ,V n p x q ă 8 .3 A2)
The preliminary estimator ˆ g is bounded away from 0 on V n p x q :ˆ γ : “ inf t P V n p x q | ˆ g p t q| ą . (4) (A3) The preliminary estimate ˆ g of g satisfies, for all ν ą P ˜ sup t P V n p x q ˇˇˇˇ ˆ g p t q ´ g p t q ˆ g p t q ˇˇˇˇ ą ν ¸ ď C g,ν exp ! ´p log n q { ) , (5)with C g,ν a constant only depending on g and ν . (A4) The preliminary estimator ˜ θ n is constructed such that ˜ θ n P r δ { , ´ δ { s , for a fixed δ P p , q . (A5) For any bandwidth h ą
0, we assume that α n ď h and 1 h ď ˆ γn log p n q . (A6) f belongs to the H¨older class of smoothness β and radius L on r , s , defined byΣ p β, L q “ ! φ : φ has (cid:96) “ t β u derivatives and @ x, y P r , s , | φ p (cid:96) q p x q ´ φ p (cid:96) q p y q| ă L | x ´ y | β ´ (cid:96) ) , (A7) K is a kernel of order (cid:96) “ t β u : ş R x j K p x q dx “ ď j ď (cid:96) and ş R | x | (cid:96) | K p x q| dx ă 8 .Since g “ θ ` p ´ θ q f , Assumption (A1) implies that } g } ,V n p x q ă 8 . The latter condition is neededto control the variance term, among others, of the bias-variance decomposition of the risk. Notice that thedensity g is automatically bounded from below by a positive constant in our model (1). Assumption (A2) is required to bound the term 1 { ˆ g p¨q that appears in the weight w p ˜ θ n , ˆ g p¨qq . Assumption (A3) meansthat the preliminary ˆ g has to be rather accurate. Assumptions (A2) and (A3) are also introduced byBertin et al. [2] for conditional density estimation purpose : see (3.2) and (3.3) p.946. The methodologyused in our proofs is inspired from their work : the role played by g here corresponds to the role playsby the marginal density of their paper. They have also shown that an estimator of g satisfying theseproperties can be built, see Theorem 4, p. 14 of [2] and some details at Section 4.1. We also build anestimator ˜ θ n that satisfied Assumption (A4) in Section 4.2. Assumption (A5) deals with the order ofmagnitude of the bandwidths and is also borrowed from [2] (see Assumption (CK) p.947). Assumptions (A6) and (A7) are classical for kernel density estimation, see [30] or [10]. The index β in Assumption (A6) is a measure of the smoothness of the target function. It permits to control the bias term of thebias-variance decomposition of the risk, and thus to derive convergence rates.We first state an upper bound for the pointwise risk of the estimator ˆ f h . The proof can be found inSection 6.1. Proposition 1.
Assume that Assumptions (A1) to (A5) are satisfied. Then, for any x P r , s and δ P p , q , the estimator ˆ f h defined by (3) satisfies E ”` ˆ f h p x q ´ f p x q ˘ ı ď C ˚ " } K h ‹ f ´ f } ,V n p x q ` δ γ nh * ` C ˚ δ E ”ˇˇ ˜ θ n ´ θ ˇˇ ı ` C ˚ δ γ E ” } ˆ g ´ g } ,V n p x q ı ` C ˚ δ n , (6) where C ˚ (cid:96) , (cid:96) “ , . . . , are positive constants such that : C ˚ depends on } K } and } g } ,V n p x q , C ˚ dependson } g } ,V n p x q and } K } , C ˚ depends on } K } and C ˚ depends on } f } ,V n p x q , g and } K } . Proposition 1 is a bias-variance decomposition of the risk. The first term in the right-hand-side( r.h.s. in the sequel) of (6) is a bias term which decreases when the bandwidth h goes to 0 whereas thesecond one corresponds to the variance term and increases when h goes to 0. There are two additionalterms E r} ˆ g ´ g } ,V n p x q s and E r| ˆ θ n ´ θ | s in the r.h.s. of (6). They are unavoidable since the estimatorˆ f h depends on the plug-in estimators ˆ g and ˜ θ n . The term C ˚ {p δ n q is a remaining term and is alsonegligible. However, the convergence rate that we derive in Corollary 1 below will prove that these lastthree terms in (6) are negligible if g and θ are estimated accurately : Section 4 proves that it is possible.4 Adaptive pointwise estimation
Let H n be a finite family of possible bandwidths h ą
0, whose cardinality is bounded by the sample size n . The best estimator in the collection p ˆ f h q h P H n defined in (3) at the point x is the one that have thesmallest risk, or similarly, the smallest bias-variance decomposition. But since f is unknown, in practiceit is impossible to minimize over H n the r.h.s. of inequality (6) in order to select the best estimate. Thus,we propose a data-driven selection, with a rule in the spirit of Goldenshluger and Lepski (GL in thesequel) [17]. The idea is to mimic the bias-variance trade-off for the risk, with empirical counterparts forthe unknown quantities. We first estimate the variance term of the trade-off by setting, for any h P H n V p x , h q “ κ } K } } K } } g } ,V n p x q ˆ γ nh log p n q , (7)with κ ą } K h ‹ f ´ f } ,V n p x q of ˆ f h p x q for any h P H n with A p x , h q : “ max h P H n !` ˆ f h,h p x q ´ ˆ f h p x q ˘ ´ V p x , h q ) ` , where, for any h, h P H n ,ˆ f h,h p x q “ n n ÿ i “ w p ˜ θ n , ˆ g p X i qqp K h ‹ K h qp x ´ X i q “ p K h ‹ ˆ f h qp x q . Heuristically, since ˆ f h is an estimator of f then ˆ f h,h “ K h ‹ ˆ f h can be considered as an estimator of K h ‹ f . The proof of Theorem 1 below in Section 6.4 then justifies that A p x , h q is a good approximationfor the bias term of the pointwise risk. Finally, our estimate at the point x isˆ f p x q : “ ˆ f ˆ h p x q p x q , (8)where the bandwidth ˆ h p x q minimizes the empirical bias-variance decomposition :ˆ h p x q : “ argmin h P H n (cid:32) A p x , h q ` V p x , h q ( . The constants that appear in the estimated variance V p x , h q are known, except κ , which is a numericalconstant calibrated by simulation (see practical tuning in Section 5), and except } g } ,V n p x q , which isreplaced by an empirical counterpart in practice (see also Section 5). It is also possible to justify thesubstitution from a theoretical point of view, but it adds cumbersome technicalities. Moreover, thereplacement does not change the result of Theorem 1 below. We thus refer to Section 3.3 p.1178 in [8] forexample, for the details of a similar substitution. The risk of this estimator is controlled in the followingresult. Theorem 1.
Assume that Assumptions (A1) to (A5) are fulfilled, and that the sample size n is largerthan a constant that only depends on the kernel K . For any δ P p , q , the estimator ˆ f p x q defined in (8) satisfies E ”` ˆ f p x q ´ f p x q ˘ ı ď C ˚ min h P H n " } K h ‹ f ´ f } ,V n p x q ` log p n q δ γ nh * ` C ˚ δ sup θ Pr δ, ´ δ s E ”ˇˇ ˜ θ n ´ θ ˇˇ ı ` C ˚ δ γ E ” } ˆ g ´ g } ,V n p x q ı ` C ˚ δ γ n , (9) where C ˚ (cid:96) , (cid:96) “ , . . . , are positive constants such that : C ˚ depends on } g } ,V n p x q , } K } and } K } , C ˚ depends on } K } , C ˚ depends on } g } ,V n p x q and } K } , and C ˚ depends on } f } ,V n p x q , g , } K } and } K } . Theorem 1 is an oracle-type inequality. It holds whatever the sample size, larger than a fixed con-stant. It shows that the optimal bias variance trade-off is automatically achieved: the selection rule5ermits to select in a data-driven way the best estimator in the collection of estimators p ˆ f h q h P H n , upto a multiplicative constant C ˚ . The last three remainder terms in the r.h.s. of (9) are the same asthe ones in Proposition 1, and are unavoidable, as aforementioned. We have an additional logarithmicterm in the second term of the r.h.s. , compared to the analogous term in (6). It is classical in adaptivepointwise estimation (see for example [12] or [4]). In our framework, it does not deteriorate the adaptiveconvergence rate. The risk of the estimator ˆ f p x q with data-driven bandwidth decreases at the optimalminimax rate of convergence (up to a logarithmic term) if the bandwidth is well-chosen : the upperbound of Corollary 1 matches with the lower-bound for the minimax risk established by Ibragimov andHasminskii [19]. Corollary 1.
Assume that (A6) and (A7) are satisfied, for β ą and L ą , and for an index (cid:96) ą such that (cid:96) ě t β u . Suppose also that the assumptions of Theorem 1 are satisfied, and that the preliminaryestimates ˜ θ n and ˆ g are such that E ” | ˜ θ n ´ θ | ı ď C ˆ log nn ˙ β β ` , E ” } ˆ g ´ g } ,V n p x q ı ď C ˆ log nn ˙ β β ` . (10) Then, E ”` ˆ f p x q ´ f p x q ˘ ı ď C ˚ ˆ log nn ˙ β β ` , (11) where C ˚ is a constant depending on } g } ,V n p x q , } K } , } K } , L and } f } ,V n p x q . The estimator ˆ f now achieves the convergence rate p log n { n q β {p β ` q over the class Σ p β, L q as soon as β ď (cid:96) . It automatically adapts to the unknown smoothness of the function to estimate : the bandwidthˆ h p x q is computed in a fully data-driven way, without using the knowledge of the regularity index β ,contrary to the estimator ˆ f rwkn of Nguyen and Matias [23] (corollary 3.4). Section 4 below permits toalso build ˆ g and ˜ θ n without any knowledge of β , to obtain an automatic adaptive estimation procedure. Remark 1.
In the present work, we focus on Model (1) . However, the estimation procedure we developcan easily be extended to the model g p x q “ θφ p x q ` p ´ θ q f p x q , x P R , (12) where the function φ is a known density, but not necessarily equal to the uniform one. In this case, afamily of kernel estimates can be defined like in (3) replacing the weights w p ˜ θ n , ˆ g p¨qq by w p ˜ θ n , ˆ g p¨q , φ p x qq “ ´ ˜ θ n ˜ ´ ˜ θ n φ p x q ˆ g p¨q ¸ . If the density function φ is uniformly bounded on R , it is then possible to obtain analogous results (bias-variance trade-off for the pointwise risk, adaptive bandwidth selection rule leading to oracle-type inequalityand optimal convergence rate) as we established for model (1) . g and the mixing propor-tion θ This section is devoted to the construction of the preliminary estimators ˆ g and ˜ θ n , required to build(3). To define them, we assume that we observe an additional sample p X i q i “ n ` ,..., n distributed withdensity function g , but independent of the sample p X i q i “ ,...,n . We explain how estimators ˆ g and ˜ θ n canbe defined to satisfy the assumptions described at the beginning of Section 2.2, and also how we computethem in practice. The reader should bear in mind that other constructions are possible, but our mainobjective is the adaptive estimation of the density f . Thus, further theoretical studies are beyond thescope of this paper. 6 .1 Preliminary estimator for the mixture density g As already noticed, the role plays by g to estimate f in our framework finds an analogue in the work ofBertin et al. [3] : the authors propose a conditional density estimation method that involves a preliminaryestimator of the marginal density of a couple of real random variables. The assumptions (A2) and (A3) are borrowed to their paper. From a theoretical point of view, we thus also draw inspiration from themto build ˆ g .Since we focus on kernel methods to recover f , we also use kernels for the estimation of g . Let L : R Ñ R be a function such that ş R L p x q dx “ ş R L p x q dx ă 8 . Let L b p¨q “ b ´ L p¨{ b q , for any h ą
0. The function L is a kernel, but can be chosen differently from the kernel K used to estimate thedensity f . The classical kernel density estimate for g isˆ g b p x q “ n n ÿ i “ n ` L b p x ´ X i q , (13)Theorem 4 p.14 of [2] proves that it is possible to select an adaptive bandwidth b of ˆ g b in such a way thatAssumptions (A2) and (A3) are fulfilled, and that the resulting estimate ˆ g ˆ b satisfies E ” ›› ˆ g ˆ b ´ g ›› ,V n p x q ı ď C ˆ log nn ˙ β β ` , if g P Σ p β, L q , where C, L ą L has an order (cid:96) “ t β u . The idea ofthe result of Theorem 4 in [2] is to select the bandwidth ˆ b with a classical Lepski method, and to applyresults from Gin´e and Nickl [16]. Notice that, in our model, Assumption (A6) permits to obtain directlythe required smoothness assumption, g P Σ p β, L q . This guarantees that both the assumptions (A2) and (A3) on ˆ g can be satisfied and that the additional term E r} ˆ g ´ g } ,V n p x q s can be bounded as requiredin the statement of Corollary 1.For the simulation study below now, we start from the kernel estimators p ˆ g b q b ą defined in (13) andrather use a procedure in the spirit of the pointwise GL method to automatically select a bandwidth b . First, this choice permits to be coherent with the selection method chosen for the main estimators p ˆ f h q h P H n , see Section 3. Then, the construction also provides an accurate estimate of g , see for example[10]. Let B be a finite family of bandwidths. For any b, b P B , we introduce the auxiliary functionsˆ g b,b p x q “ n ´ ř ni “ n ` p L b ‹ L b qp x ´ X i q . Next, for any b P B , we set A g p b, x q “ max b P B !` ˆ g b,b p x q ´ ˆ g b p x q ˘ ´ Γ p b q ) ` , where Γ p b q “ (cid:15) } L } } L } } g } log p n q{p nb q , with (cid:15) ą g is given by ˆ g p x q : “ ˆ g ˆ b g p x q p x q , with ˆ b g p x q : “ argmin b P B t A g p b, x q ` Γ p b qu . The tuning of theconstant (cid:15) is presented in Section 5. θ A huge variety of methods have been investigated for the estimation of the mixing proportion θ of model(1) : see, for instance, [28], [20], [26], [5], [24] and references therein. A common and performant estimatoris the one proposed by Storey [28]: θ is estimated by ˆ θ τ,n “ t X i ą τ ; i “ n ` , . . . , n u{p n p ´ τ qq with τ a threshold to be chosen. The optimal value of τ is calculated with a boostrap algorithm. However, itseems difficult to obtain theoretical guarantees on ˆ θ τ,n . To our knowledge, most of the other methods inthe literature rely on different identifiability constraints for the parameters p θ, f q . We refer to Celisse andRobin [5] or Nguyen and Matias [24] for a detailed discussion about possible identifiability conditions ofmodel (1). In the sequel we focus on a particular case of model (1), that permits to obtain the identiabilityof the parameters p θ, f q (see for example Assumption A in [5], or Section 1.1 in [24]). The density f isassumed to belong to the family F δ “ (cid:32) f : r , s Ñ R ` , f is a continuously non-increasing density, positive on r , ´ δ q and such that f |r ´ δ, s “ ) , (14)7here δ P p , q . Starting from this set, the main idea to recover θ is that it is the lower bound of thedensity g in model (1) : θ “ inf x Pr , s g p x q “ g p q . Celisse and Robin [5] or Nguyen and Matias [24] thendefine a histogram-based estimator ˆ g for g , and estimate θ with the lower bound of ˆ g , or with ˆ g p q . Theprocedure we choose is in the spirit of this one, but, to be coherent with the other estimates, we usekernels to recover g instead of histograms.Nevertheless, we cannot directly use the kernel estimates of g defined in (13): it is well-known thatkernel density estimation methods suffers from boundary effects, which lead to an inaccuracy estimateof g p q . To avoid this problem, we apply a simple reflection method inspired by Schuster [27]. From therandom sample X n ` , . . . , X n from density g , we introduce, for k “ i ´ n , i “ n ` , . . . , n , Y k “ X i if (cid:15) i “ , ´ X i if (cid:15) i “ ´ , where (cid:15) n ` , . . . , (cid:15) n are n i.i.d. random variables drawn from Rademacher distribution with parameter1 {
2, and independent of the X i ’s. The random variables Y , . . . , Y n can be regarded as symmetrizedversion of the X i ’s, with support r , s (see the first point of Lemma 2 below). Now, suppose that L is asymmetric kernel. For any b ą
0, defineˆ g symb p x q “ n n ÿ k “ “ L b p x ´ Y k q ` L b pp ´ x q ´ Y k q ‰ , x P r , s . (15)The graph of ˆ g symb is symmetric with respect to the straight-line x “
1. Then, instead of evaluating ˆ g symb at the single point x “
1, we compute the average of all the values of the estimator ˆ g b on the interval r ´ δ, s , relying on the fact that θ “ g p x q , for all x P r ´ δ, s (under the assumption f P F δ ), to increasethe accuracy of the resulting estimate. Thus, we setˆ θ n,b “ δ ż ´ δ ˆ g symb p x q dx. (16)Finally, for the estimation of f , we use a truncated estimator ˜ θ n defined as˜ θ n,b : “ max ` min p ˆ θ n,b , ´ δ { q , δ { ˘ . (17)The definition of ˜ θ n,b permits to ensure that ˜ θ n,b P r δ { , ´ δ { s : this is Assumption (A4) . This permitsto avoid possible difficulties in the estimation of f when ˆ θ n,b is close to zero, see (3). The following lemmaestablishes some properties of all these estimates. Its proof can be found in Section 6.2. Lemma 2. • The estimator ˆ g symb defined in (15) has the same distribution as a classical kernel estimator from thesymmetrized density of g . More precisely, let p R i q i Pt ,...,n u be i.i.d. random variables with density r : x ÞÝÑ " g p x q{ if x P r , s g p ´ x q{ if x P r , s , and ˆ r b p x q “ n ´ ř ni “ L b p x ´ R i q . Then, the estimators ˆ g symb and ˆ r b have the same distribution. • We have | ˆ θ n,b ´ θ | ď ›› ˆ g symb ´ g ›› , r ´ δ, s . (18) • Moreover, P ´ ˜ θ n,b ‰ ˆ θ n,b ¯ ď δ E ” | ˆ θ n,b ´ θ | ı . (19)The first property of Lemma 2 permits to deal with ˆ g symb as with a classical kernel density estimate.The second property (18) allows us to control the estimation risk of ˆ θ n,b , while the third one, (19), justifiesthat the introduction of ˜ θ n,b is reasonable.To obtain a fully data-driven estimate ˜ θ n,b , it remains to define a bandwidth selection rule for the kernelestimator ˆ g symb . In view of (18), we introduce a data-driven procedure under sup-norm loss, inspired from8epski [21]. For any x P r , s and any bandwidth b, b in a collection B , we set ˆ g symb,b p x q “ p L b ‹ ˆ g symb qp x q ,and Γ p b q “ λ } L } log p n q{p nb q , with λ a tuning parameter. As for the other bandwidth selection device,we now define ∆ p b q “ max b P B sup x Pr ´ δ, s ` ˆ g symb,b p x q ´ ˆ g symb p x q ˘ ´ Γ p b q + ` , Finally, we choose ˜ b “ argmin b P B t ∆ p b q ` Γ p b qu , which leads to ˆ g sym : “ ˆ g sym ˜ b and ˜ θ n : “ ˜ θ n, ˜ b . Theresults of [21], combined with Lemma 2 ensure that ˜ θ n satisfies (10), if g is smooth enough. Numericalsimulations in Section 5 justify that our estimator has a good performance from the practical point ofview, in comparison with those proposed in [24] and [28]. We briefly illustrate the performance of the estimation method over simulated data, according thefollowing framework. We simulate observations with density g defined by model (1) for sample size n P t , , u . Three different cases of p θ, f q are considered: • f p x q “ p ´ x q r , s p x q , θ “ . • f p x q “ s ´ δ ˆ ´ x ´ δ ˙ s ´ r , ´ δ s p x q with p δ, s q “ p . , . q , θ “ . • f p x q “ λe ´ λx ` ´ e ´ λb ˘ ´ r ,b s p x q the density of truncated exponential distribution on r , b s with p λ, b q “ p , . q , θ “ . f is borrowed from [23] while the shape of f is used both by [5] and [24]. Figure 1 representsthose three cases with respect to each design density and associated proportion θ . f (x)g (x) . . . . . f (x)g (x) f (x)g (x) Figure 1: Representation of f j and the corresponding g j in model (1) for p θ “ . , f q (left), p θ “ . , f q (middle) and p θ “ . , f q (right). To compute our estimates, we choose K p x q “ L p x q “ p ´ | x |q t| x |ď u the triangular kernel. In thevariance term (7) of the GL method used to select the bandwidth of the kernel estimator of f , we replace } g } ,V n p x q by the 95 th percentile of (cid:32) max t P V n p x q ˆ g h p t q , h P H n ( . Similarly, in the variance term Γ usedto select the bandwidth of the kernel estimate of g , we use the 95 th percentile of (cid:32) max t Pr , s ˆ g h p t q , h P H n ( instead of } g } . The collection of bandwidths H n , B , B are equal to (cid:32) { k, k “ , . . . , t ? n u ( where t x u denotes a smallest integer which is strictly smaller than the real number x .9 . . . . k M SE f^ (0.2)f^ (0.4)f^ (0.6)f^ (0.9) . . . . . e M SE g^ (0.2)g^ (0.2)g^ (0.6)g^ (0.4)g^ (0.9) . . . . . l MAE of q ^ , q =0.65MAE of q ^ , q =0.45MAE of q ^ , q =0.35 (a) (b) (c)Figure 2: values of the mean-squared error for (a) ˆ f p x q with respect to κ , (b) ˆ g p x q with respect to (cid:15) .(c) : Values of the mean-absolute error for ˆ θ n with respect to λ . The sample size is n “ κ (figure (a)), (cid:15) (figure (b)) and λ (figure (c)).We shall settle the values of the constants κ , (cid:15) and λ involved in the penalty terms V p x , h q , Γ p h q andΓ p b q respectively, to compute the selected bandwidths. Since the calibrations of these tuning parametersare carried out in the same fashion, we only describe the calibration for κ . Denote by ˆ f κ the estimatorof f depending on the constant κ to be calibrated. We approximate the mean-squared error for theestimator ˆ f κ , defined by MSE p ˆ f κ p x qq “ E rp ˆ f κ p x q ´ f p x qq s , over 100 Monte-Carlo runs, for differentpossible values t κ , . . . , κ K u of κ , for the three densities f , f , f calculated at several test points x . Wechoose a value for κ that leads to small risks in all investigated cases. Figure 2(a) shows that κ “ . (cid:15) “ .
52 and λ “ (cid:15) and λ . θ We compare our estimator ˆ θ n with the histogram-based estimator ˆ θ Ng-M n proposed in [24] and the estimatorˆ θ Sn introduced in [28]. Boxplots in Figure 3 represent the absolute errors of ˆ θ n , ˆ θ Ng-M n and ˆ θ Sn , labeledrespectively by ”Sym-Ker”, ”Histogram” and ”Bootstrap”. The estimators ˆ θ n and ˆ θ Ng-M n have comparableperformances, and outperform ˆ θ Sn . l ll llll Sym−Ker Histogram Boostrap . . . . q = 0. 65 l lll Sym−Ker Histogram Boostrap . . . . q = 0. 45 llllll l Sym−Ker Histogram Boostrap . . . . . . . q = 0. 35 Figure 3: errors for the estimation of θ in the three simulated settings (with sample size n “ .3.2 Estimation of the target density f We present in Tables 1, 2 and 3 the mean-squared error (MSE) for the estimation of f according tothe three different models and the different sample sizes introduced in Section 5.1. The MSEs’ areapproximated over 100 Monte-Carlo replications. We shall choose the estimation points (to compute thepointwise risk): we propose x P t . , . , . , . u . The choices of x “ . x “ . x “ . x “ . f close to the boundaries of thedomain of definition of f and g . We compare our estimator ˆ f with the randomly weighted estimatorproposed in Nguyen and Matias [23]. In the sequel, the label ”AWKE” (Adaptive Weighted KernelEstimator) refers to our estimator ˆ f , whose bandwidth is selected by the Goldenshluger-Lepski methodand ”Ng-M” refers to the one proposed by [23]. Resulting boxplots are displayed in Figure 4 for n “ x “ . x “ . x “ . x “ . n “
500 AWKE 0.1848 0.0121 0.0286 0.0057Ng-M 0.2869 0.0450 0.1046 0.0433 n “ n “ f , for our estimator ˆ f (AWKE), and for theestimator of Nguyen and Matias [23] (Ng-M).Sample size Estimator x “ . x “ . x “ . x “ . n “
500 AWKE 0.0453 0.0136 0.0297 0.0024Ng-M 0.0560 0.0540 0.0306 0.0138 n “ n “ f , for our estimator ˆ f (AWKE), and for theestimator of Nguyen and Matias [23] (Ng-M).Sample size Estimator x “ . x “ . x “ . x “ . n “
500 AWKE 0.0806 0.0096 0.0045 0.0016Ng-M 0.1308 0.0247 0.0207 0.0096 n “ n “ f , for our estimator ˆ f (AWKE), and for theestimator of Nguyen and Matias [23] (Ng-M).Tables 1, 2, 3 and boxplots show that our estimator outperforms the one of [23]. Notice that the errorsare relatively large at the point x “ .
1, for both estimators, which was expected (boundary effect).11 ll lllll llllllllll lllll ll lll lllllllll ll llll llllllllll lllll ll llllll lllllll l f3, x0 = 0.1 f3, x0 = 0.4 f3, x0 = 0.6 f3, x0 = 0.9f2, x0 = 0.1 f2, x0 = 0.4 f2, x0 = 0.6 f2, x0 = 0.9f1, x0 = 0.1 f1, x0 = 0.4 f1, x0 = 0.6 f1, x0 = 0.9AWKE Ng−M AWKE Ng−M AWKE Ng−M AWKE Ng−MAWKE Ng−M AWKE Ng−M AWKE Ng−M AWKE Ng−MAWKE Ng−M AWKE Ng−M AWKE Ng−M AWKE Ng−M0.000.020.040.060.080.0000.0050.0100.0150.00000.00250.00500.00750.000.050.100.150.000.010.020.030.040.050.0000.0050.0100.0150.0200.0250.000.030.060.090.000.020.040.060.0000.0050.0100.0150.0200.00.10.20.30.40.000.020.040.060.000.050.100.15 E rr o r Figure 4: errors for the estimation of f , f and f for x P t . , . , . , . u and sample size n “ In the sequel, the notations ˜ P , ˜ E and ˜ V ar respectively denote the probability, the expectation and thevariance associated with X , . . . , X n , conditionally on the additional random sample X n ` , . . . , X n . Let ρ ą
1, introduce the event Ω ρ “ ! ρ ´ γ ď ˆ γ ď ργ ) . such as ˆ f h p x q ´ f p x q “ ` ˆ f h p x q ´ f p x q ˘ Ω ρ ` ` ˆ f h p x q ´ f p x q ˘ Ω cρ . (20)We first evaluate the term ` ˆ f h p x q ´ f p x q ˘ Ω ρ . Suppose now that we are on Ω ρ , then for any x P r , s , we have ` ˆ f h p x q ´ f p x q ˘ ď ´` ˆ f h p x q ´ K h ‹ ˇ f p x q ˘ ` ` K h ‹ ˇ f p x q ´ ˇ f p x q ˘ ` ` ˇ f p x q ´ f p x q ˘ ¯ , (21)where we define ˇ f p x q “ w p ˜ θ n , ˆ g p x qq g p x q “ ´ ˜ θ n ˜ ´ ˜ θ n ˆ g p x q ¸ g p x q . Note that by definition of ˇ f , we have K h ‹ ˇ f p x q “ ˜ E “ ˆ f h p x q ‰ . Hence, ` ˆ f h p x q ´ K h ‹ ˇ f p x q ˘ “ ` ˆ f h p x q ´ ˜ E “ ˆ f h p x q ‰˘ .
12t follows that˜ E ”` ˆ f h p x q ´ ˜ E “ ˆ f h p x q ‰˘ ı “ ˜ V ar ´ ˆ f h p x q ¯ “ ˜ V ar ˜ n n ÿ i “ w p ˜ θ n , ˆ g p X i qq K h p x ´ X i q ¸ “ n ˜ V ar ´ w p ˜ θ n , ˆ g p X qq K h p x ´ X q ¯ ď n ˜ E „´ w p ˜ θ n , ˆ g p X qq K h p x ´ X q ¯ . On the other hand, for all i P t , . . . , n u , thanks to (A4) and (A2) , w p ˜ θ n , ˆ g p X i qq K h p x ´ X i q “ ´ ˜ θ n ˜ ´ ˜ θ n ˆ g p X i q ¸ K h p x ´ X i q ď δ ˜ ` ˜ θ n | ˆ g p X i q| ¸ K h p x ´ X i qď δ ˆ ` γ ˙ K h p x ´ X i q ď δ ˆ γ K h p x ´ X i q . (22)Indeed, as we use compactly supported kernel to construct the estimator ˆ f h , condition α n ď h ´ in (A5) ensures that ` ˆ g p X i q ˘ ´ K h p x ´ X i q is upper bounded by ˆ γ ´ K h p x ´ X i q even though we have noobservation in the neighbourhood of x .Moreover, since ˆ γ ě ρ ´ γ on Ω ρ , we have that w p ˜ θ n , ˆ g p X i qq ď ρδ ´ γ ´ . Thus we obtain˜ E ”` ˆ f h p x q ´ ˜ E “ ˆ f h p x q ‰˘ ı ď ρ δ γ n ˜ E ” K h p x ´ X q ı ď ρ } K } } g } ,V n p x q δ γ nh . (23)For the last two terms of (21), we apply the following proposition: Proposition 2.
Assume (A1) and (A3) . On the set Ω ρ , we have the following results for any x P r , s ` ˇ f p x q ´ f p x q ˘ ď C δ ´ γ ´ } ˆ g ´ g } ,V n p x q ` C δ ´ ˇˇ ˜ θ n ´ θ ˇˇ , (24) ` K h ‹ ˇ f p x q ´ ˇ f p x q ˘ ď } K h ‹ f ´ f } ,V n p x q ` C δ ´ γ ´ } ˆ g ´ g } ,V n p x q ` C δ ´ ˇˇ ˜ θ n ´ θ ˇˇ , (25) where C and C respectively depend on ρ and } g } ,V n p x q , C depends on ρ and } K } and C dependson } g } ,V n p x q and } K } . Combining (23), (24) and (25), we obtain E ”` ˆ f h p x q ´ f p x q ˘ Ω ρ ı ď } K h ‹ f ´ f } ,V n p x q ` p C ` C q δ ´ γ ´ E “ } ˆ g ´ g } ,V n p x q ‰ ` p C ` C q δ ´ E “ˇˇ ˜ θ n ´ θ ˇˇ ‰ ` ρ } K } } g } ,V n p x q δ γ nh . It remains to study the risk bound on Ω cρ . To do so, we successively apply the following lemmas whoseproofs are postponed to the end of Theorem 1’s proof. Lemma 3.
Suppose that Assumption (A3) is satisfied. Then we have for ρ ą P ´ Ω cρ ¯ ď C g,ρ exp ! ´p log n q { ) , with C g,ρ a positive constant depending on g and ρ . Lemma 4.
Assume (A2) and (A5) . For any h P H n , we have E ”` ˆ f h p x q ´ f p x q ˘ Ω cρ ı ď C ˚ δ n , with C ˚ a positive constant depending on } f } ,V n p x q , } K } , g and ρ . This concludes the proof of Proposition 1. 13 .2 Proof of Lemma 2
First, we prove that ˆ g symb has the same distribution as ˆ r b . First, Y i has the same distribution as 2 ´ Y i .Thus, ˆ g symb has the same distribution as x ÞÑ n ´ ř ni “ L b p x ´ Y i q . It is thus sufficient to show that Y i has the same distribution as R i . To this aim, let ϕ be a measurable bounded function defined on R . Wecompute E r ϕ p Y i qs “ E r E r ϕ p X i q| ε i s t (cid:15) “ u s ` E r E r ϕ p ´ X i q| ε i s t (cid:15) “´ u s , “ ` E r ϕ p X i qs ` E r ϕ p ´ X i qs ˘ , “ ˜ż ϕ p x q g p x q dx ` ż ϕ p ´ x q g p x q dx ¸ , “ ˜ż ϕ p x q g p x q dx ` ż ϕ p x q g p ´ x q dx ¸ , “ ż ϕ p x q r p x q dx “ E r ϕ p R i qs . This allows us to obtain the first assertion of the lemma.We prove now (18). Under the identifiability condition, we have θ “ g p x q for all x P r ´ δ, s . Hencewe have | ˆ θ n,b ´ θ | “ ˇˇˇˇˇ δ ż ´ δ ˆ g symb p x q dx ´ δ ż ´ δ g p x q dx ˇˇˇˇˇ “ δ ˇˇˇˇˇż ´ δ p ˆ g symb p x q ´ g p x qq dx ˇˇˇˇˇ ď δ ż ´ δ ˇˇ ˆ g symb p x q ´ g p x q ˇˇ dx ď δ ż ´ δ ›› ˆ g symb ´ g ›› , r ´ δ, s dx “ ›› ˆ g symb ´ g ›› , r ´ δ, s . Moreover, thanks to the Markov Inequality P ´ ˜ θ n,b ‰ ˆ θ n,b ¯ “ P ˜ ˆ θ n,b R „ δ , ´ δ ¸ ď P ˆ | ˆ θ n,b ´ θ | ą δ ˙ ď δ E ” | ˆ θ n,b ´ θ | ı , which is (19). This concludes the proof of Lemma 2. Let us introduce the function˜ f p x q : “ w p ˜ θ n , g p x qq g p x q “ ´ ˜ θ n ˜ ´ ˜ θ n g p x q ¸ g p x q . (26)Then we have for x P r , s ` ˇ f p x q ´ f p x q ˘ ď ´` ˇ f p x q ´ ˜ f p x q ˘ ` ` ˜ f p x q ´ f p x q ˘ ¯ . ρ “ (cid:32) ρ ´ γ ď ˆ γ ď ργ ( we have, by using (A4) , ` ˇ f p x q ´ ˜ f p x q ˘ “ ´ w p ˜ θ n , ˆ g p x qq g p x q ´ w p ˜ θ n , g p x qq g p x q ¯ “ ¨˝ ´ ˜ θ n ˜ ´ ˜ θ n ˆ g p x q ¸ ´ ´ ˜ θ n ˜ ´ ˜ θ n g p x q ¸˛‚ | g p x q| “ ˜ θ n p ´ ˜ θ n q ˆ g p x q ´ g p x q ˙ | g p x q| ď δ ˆ ˆ g p x q ´ g p x q ˆ g p x q g p x q ˙ | g p x q| ď ρ δ ´ γ ´ } ˆ g ´ g } ,V n p x q . (27)Moreover, thanks to (A1) , ` ˜ f p x q ´ f p x q ˘ “ ´ w p ˜ θ n , g p x qq g p x q ´ w p θ, g p x qq g p x q ¯ “ ¨˝ ´ ˜ θ n ˜ ´ ˜ θ n g p x q ¸ g p x q ´ ´ θ ˜ ´ θg p x q ¸ g p x q ˛‚ “ ¨˝ ´ ˜ θ n ´ ´ θ ` ˜ θ ´ θ ´ ˜ θ n ´ ˜ θ n ¸ g p x q ˛‚ | g p x q| “ | g p x q| p ´ θ q p ´ ˜ θ n q ˜ ˜ θ n ´ θ ` θ ´ ˜ θ n g p x q ¸ ď } g } ,V n p x q δ ˜ ˜ θ n ´ θ ` θ ´ ˜ θ n g p x q ¸ ď } g } ,V n p x q δ ´ ˇˇ ˜ θ n ´ θ ˇˇ . (28)Thus we obtain by gathering (27) and (28), ` ˇ f p x q ´ f p x q ˘ ď ρ δ ´ γ ´ } ˆ g ´ g } ,V n p x q ` } g } ,V n p x q δ ´ ˇˇ ˜ θ n ´ θ ˇˇ . Next, the term ` K h ‹ ˇ f p x q ´ ˇ f p x q ˘ can be treated by studying the following decomposition ` K h ‹ ˇ f p x q ´ ˇ f p x q ˘ ď ˆ` K h ‹ ˇ f p x q ´ K h ‹ ˜ f p x q ˘ ` ` K h ‹ ˜ f p x q ´ K h ‹ f p x q ˘ ` ` K h ‹ f p x q ´ ˇ f p x q ˘ ˙ “ : 3 ` A ` A ` A q . For term A , we have by using (27) A “ ` K h ‹ p ˇ f ´ ˜ f qp x q ˘ “ ˆż K h p x ´ u qp ˇ f p u q ´ ˜ f p u qq du ˙ ď ˆż | K h p x ´ u q|| ˇ f p u q ´ ˜ f p u q| du ˙ ď ρ δ ´ γ ´ } ˆ g ´ g } ,V n p x q ˆż | K h p x ´ u q| du ˙ ď ρ δ ´ γ ´ } K } } ˆ g ´ g } ,V n p x q .
15y using (28) and following the same lines as for A , we obtain A “ ` K h ‹ p ˜ f ´ f qp x q ˘ ď } g } ,V n p x q δ ´ } K } ˇˇ ˜ θ n ´ θ ˇˇ . For A , using the upper bound obtained as above for p ˇ f p x q ´ f p x qq , we have A ď ` K h ‹ f p x q ´ f p x q ˘ ` ` f p x q ´ ˇ f p x q ˘ ď } K h ‹ f ´ f } ,V n p x q ` ρ δ ´ γ ´ } ˆ g ´ g } ,V n p x q ` } g } ,V n p x q δ ´ ˇˇ ˜ θ n ´ θ ˇˇ . Finally, combining all the terms A , A and A , we obtain (25). This ends the proof of Proposition 2. Suppose that we are on Ω ρ . Let ˆ f be the adaptive estimator defined in (8), we have for any x P r , s , ` ˆ f p x q ´ f p x q ˘ ď ´` ˆ f p x q ´ ˇ f p x q ˘ ` ` ˇ f p x q ´ f p x q ˘ ¯ The second term is controlled by (24) of Proposition 2. Hence it remains to handle with the firstterm. For any h P H n , we have ` ˆ f p x q ´ ˇ f p x q ˘ ď ´` ˆ f ˆ h p x q p x q ´ ˆ f ˆ h,h p x q ˘ ` ` ˆ f ˆ h p x q ,h p x q ´ ˆ f h p x q ˘ ` ` ˆ f h p x q ´ ˇ f p x q ˘ ¯ “ ´` ˆ f ˆ h p x q p x q ´ ˆ f ˆ h,h p x q ˘ ´ V p x , ˆ h q ` ` ˆ f ˆ h p x q ,h p x q ´ ˆ f h p x q ˘ ´ V p x , h q` V p x , ˆ h q ` V p x , h q ` ` ˆ f h p x q ´ ˇ f p x q ˘ ¯ ď ´ A p x , ˆ h q ` A p x , h q ` V p x , ˆ h q ` V p x , h q ` ` ˆ f h p x q ´ ˇ f p x q ˘ ¯ ď A p x , h q ` V p x , h q ` ` ˆ f h p x q ´ K h ‹ ˇ f p x q ˘ ` ` K h ‹ ˇ f p x q ´ ˇ f p x q ˘ . (29)Next, we have A p x , h q “ max h P H n !` ˆ f h,h p x q ´ ˆ f h p x q ˘ ´ V p x , h q ) ` ď h P H n !` ˆ f h,h p x q ´ K h ‹ p K h ‹ ˇ f qp x q ˘ ` ` ˆ f h p x q ´ K h ‹ ˇ f p x q ˘ ` ` K h ‹ p K h ‹ ˇ f qp x q ´ K h ‹ ˇ f p x q ˘ ´ V p x , h q * ` ď ` B p h q ` D ` D ˘ , where B p h q “ max h P H n ´ K h ‹ p K h ‹ ˇ f qp x q ´ K h ‹ ˇ f p x q ¯ D “ max h P H n "` ˆ f h p x q ´ K h ‹ ˇ f p x q ˘ ´ V p x , h q * ` D “ max h P H n "` ˆ f h,h p x q ´ K h ‹ p K h ‹ ˇ f qp x q ˘ ´ V p x , h q * ` . Since B p h q “ max h P H n ´ K h ‹ p K h ‹ ˇ f qp x q ´ K h ‹ ˇ f p x q ¯ “ max h P H n ´ K h ‹ p K h ‹ ˇ f ´ ˇ f qp x q ¯ ď } K } sup t P V n p x q ˇˇ K h ‹ ˇ f p t q ´ ˇ f p t q ˇˇ , ´ ˆ f p x q ´ ˇ f p x q ¯ ď D ` D ` V p x , h q ` ` ˆ f h p x q ´ K h ‹ ˇ f p x q ˘ ` p } K } ` q sup t P V n p x q ˇˇ K h ‹ ˇ f p t q ´ ˇ f p t q ˇˇ . (30)The last two terms of (30) are controlled by (23) and (25) of Proposition 2. Hence it remains to dealwith terms D and D .For D , we recall that K h ‹ ˇ f p x q “ ˜ E “ ˆ f h p x q ‰ and˜ E r D s “ ˜ E « max h P H n "´ ˆ f h p x q ´ K h ‹ ˇ f p x q ¯ ´ V p x , h q * ` ff ď ÿ h P H n ˜ E «"´ ˆ f h p x q ´ ˜ E “ ˆ f h p x q ¯ ´ V p x , h q * ` ff ď ÿ h P H n ż `8 ˜ P ˜"` ˆ f h p x q ´ ˜ E “ ˆ f h p x q ˘ ´ V p x , h q * ` ą u ¸ du ď ÿ h P H n ż `8 ˜ P ˜ˇˇ ˆ f h p x q ´ ˜ E “ ˆ f h p x q ˇˇ ą c V p x , h q ` u ¸ du. (31)Now let us introduce the sequence of i.i.d. random variables Z , . . . , Z n where we set Z i “ w p ˜ θ n , ˆ g p X i qq K h p x ´ X i q . Then we have ˆ f h p x q ´ ˜ E “ ˆ f h p x q ‰ “ n n ÿ i “ ` Z i ´ ˜ E r Z i s ˘ . Moreover, we have by (22) and (23) | Z i | “ | w p ˜ θ n , ˆ g p X i qq K h p x ´ X i q| ď } K } hδ ˆ γ “ : b, and ˜ E “ Z ‰ “ ˜ E ” w p ˜ θ n , ˆ g p X i qq K h p x ´ X i q ı ď } K } } g } ,V n p x q hδ ˆ γ “ : v. Applying the Bernstein inequality (cf. Lemma 2 of Comte and Lacour [9]), we have for any u ą P ˜ˇˇ ˆ f h p x q ´ ˜ E “ ˆ f h p x q ˇˇ ą c V p x , h q ` u ¸ “ ˜ P ˜ˇˇˇ n n ÿ i “ ` Z i ´ ˜ E r Z i s ˘ˇˇˇ ą c V p x , h q ` u ¸ ď $&% exp ˜ ´ n v ˆ V p x , h q ` u ˙¸ , exp ˜ ´ n b c V p x , h q ` u ¸,.- ď $&% exp ˆ ´ n v V p x , h q ˙ exp ˆ ´ nu v ˙ , exp ˜ ´ n b c V p x , h q ¸ exp ˜ ´ n ? u b ¸,.- On the other hand, by the definition of V p x , h q we have n v V p x , h q “ nh ˆ γ δ ρ } K } } g } ,V n p x q ˆ κ } K } } K } } g } ,V n p x q ˆ γ nh log p n q “ κδ } K } ρ log p n q ě κδ ρ log p n q . If we choose κ such that κδ ρ ě
2, we get n v V p x , h q ě p n q . γnh ě log p n q we have n b c V p x , h q “ nh ˆ γδ ? } K } ˆ } K } } K } b κ } g } ,V n p x q log p n q ˆ γ ? nh “ δ } K } } K } } g } { ,V n p x q ? } K } a κnh log p n qě δ } K } } K } ? ρ { γ { } K } ? κ log p n q ě p n q , if δ } K } } K } ? ρ { γ { } K } ? κ log p n q ě κ , and n large enough. Then we have by using theconditions ρ ´ γ ď ˆ γ and h ě { n ,˜ E r D s ď ÿ h P H n ż `8 n ´ max $&% exp ˆ ´ nu v ˙ , exp ˜ ´ n ? u b ¸,.- du ď n ´ ÿ h P H n ż `8 max $&% exp ¨˝ ´ nh δ ˆ γ } K } } g } ,V n p x q u ˛‚ , exp ˜ ´ nh δ ˆ γ } K } ? u ¸,.- du ď n ´ ÿ h P H n ż `8 max $&% exp ¨˝ ´ nh δ γ ρ } K } } g } ,V n p x q u ˛‚ , exp ˜ ´ nh δγ ρ } K } ? u ¸,.- du ď n ´ ÿ h P H n ż `8 max ! e ´ π u , e ´ π ? u ) du ď n ´ ÿ h P H n max π , π + . with π : “ δ γ ρ } K } } g } ,V n p x q and π : “ δγ ρ } K } .Since card p H n q ď n , we finally obtain˜ E r D s ď C δ ´ γ ´ n ´ , (32)where C is a positive constant depending on } g } ,V n p x q , } K } , } K } and ρ .Similarly, we introduce U i “ w p ˜ θ n , ˆ g p X i qq K h ‹ K h p x ´ X i q for i “ , . . . , n . Then,ˆ f h,h p x q ´ K h ‹ p K h ‹ ˇ f qp x q “ ˆ f h,h p x q ´ ˜ E “ ˆ f h,h p x q ‰ “ n n ÿ i “ ` U i ´ ˜ E r U i s ˘ , and | U i | ď } K } } K } h δ ˆ γ “ : ¯ b, and ˜ E “ U ‰ ď } K } } K } } g } ,V n p x q h δ ˆ γ “ : ¯ v. Following the same lines as for obtaining (32), we get by using Bernstein inequality˜ E r D s ď C δ ´ γ ´ n ´ , (33)with C a positive constant depends on } g } ,V n p x q , } K } , } K } , } K } and ρ .Finally, combining (30), (32), (33) and successively applying Lemma 3 and Lemma 4 allow us toconclude the result stated in Theorem 1. 18 .5 Proof of Corollary 1 Assume that Assumptions (A6) and (A7) are fulfilled. According to Proposition 1.2 of Tsybakov [30],we get for all x P r , s | K h ‹ f p x q ´ f p x q| ď C L h β , where C a constant depending on K and L . Taking h “ L ´ { β Λ ´ { βn , Λ n “ L ´ {p β ` q ˜ δ γ n log n ¸ β {p β ` q , we get log p n q δ γ nh “ Λ ´ n . Hence, we obtainmin h P H n " } K h ‹ f ´ f } ,V n p x q ` log p n q δ γ nh * ď p C ` q L {p β ` q ˜ δ γ n log n ¸ ´ β {p β ` q . (34)Finally, since we also assume (10), gathering (9) and (34), we obtain E ”` ˆ f p x q ´ f p x q ˘ ı ď C ˆ log nn ˙ β β ` , where C is a constant depending on K , } f } ,V n p x q , g , δ , γ , ρ , L and β . This ends the proof of Corollary1. Lemma 3 is a consequence of (5). Indeed, assume that condition (A3) is satisfied, then we have for all t P V n p x q , | ˆ g p t q ´ g p t q| ď ν | ˆ g p t q| with probability 1 ´ C g,ν exp ` ´ p log n q { ˘ .This implies, p ` ν q ´ | g p t q| ď | ˆ g p t q| ď p ´ ν q ´ | g p t q| . Since γ “ inf t P V n p x q | g p t q| and ˆ γ “ inf t P V n p x q | ˆ g p t q| , by using (5) and taking ν “ ρ ´ ν “ ´ ρ ´ , we obtainwith probability 1 ´ C g,ν exp ` ´ p log n q { ˘ , p ` ν q ´ γ ď ˆ γ ď p ´ ν q ´ γ which completes the proof ofLemma 3. We have for any x P r , s , E ”` ˆ f h p x q ´ f p x q ˘ Ω cρ ı ď E “ | ˆ f h p x q| Ω cρ ‰ ` } f } ,V n p x q P p Ω cρ q . Using Assumptions (A6) and (22), we obtain E “ | ˆ f h p x q| Ω cρ ‰ “ E »–ˇˇˇˇˇ nh n ÿ i “ w p ˜ θ n , ˆ g p X i qq K ˆ x ´ X i h ˙ ˇˇˇˇˇ Ω cρ fifl ď δ E »–ˇˇˇˇˇ γnh n ÿ i “ K ˆ x ´ X i h ˙ˇˇˇˇˇ Ω cρ fifl ď } K } δ n p log n q P p Ω cρ q . E ”` ˆ f h p x q ´ f p x q ˘ Ω cρ ı ď C g,ρ ˜ } K } δ n p log n q ` } f } ,V n p x q ¸ exp ! ´p log n q { ) ď C g,ρ } K } } f } ,V n p x q δ n , which ends the proof of Lemma 4. Acknowledgment
We are very grateful to Catherine Matias for interesting discussions on mixture models. The researchof the authors is partly supported by the french Agence Nationale de la Recherche (ANR-18-CE40-0014projet SMILES) and by the french R´egion Normandie (projet RIN AStERiCs 17B01101GR).
References [1] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerfulapproach to multiple testing.
Journal of the Royal statistical society: series B (Methodological) ,57(1):289–300, 1995.[2] Karine Bertin, Claire Lacour, and Vincent Rivoirard. Adaptive pointwise estimation of conditionaldensity function. preprint arXiv:1312.7402 , 2013.[3] Karine Bertin, Claire Lacour, and Vincent Rivoirard. Adaptive pointwise estimation of conditionaldensity function.
Ann. Inst. H. Poincar Probab. Statist. , 52(2):939–980, 05 2016.[4] C. Butucea. Two adaptive rates of convergence in pointwise density estimation.
Math. MethodsStatist. , 9(1):39–64, 2000.[5] Alain Celisse and St´ephane Robin. A cross-validation based estimation of the proportion of true nullhypotheses.
Journal of Statistical Planning and Inference , 140(11):3132–3147, 2010.[6] Ga¨elle Chagny. Penalization versus goldenshluger- lepski strategies in warped bases regression.
ESAIM: Probability and Statistics , 17:328–358, 2013.[7] Micha¨el Chichignoud, Van Ha Hoang, Thanh Mai Pham Ngoc, Vincent Rivoirard, et al. Adaptivewavelet multivariate regression with errors in variables.
Electronic journal of statistics , 11(1):682–724, 2017.[8] F. Comte, S. Ga¨ıffas, and A. Guilloux. Adaptive estimation of the conditional intensity of marker-dependent counting processes.
Ann. Inst. Henri Poincar´e Probab. Stat. , 47(4):1171–1196, 2011.[9] F. Comte and C. Lacour. Anisotropic adaptive kernel deconvolution.
Ann. Inst. Henri Poincar´eProbab. Stat. , 49(2):569–609, 2013.[10] Fabienne Comte.
Estimation non-param´etrique . Spartacus-IDH, 2015.[11] Fabienne Comte, Valentine Genon-Catalot, and Adeline Samson. Nonparametric estimation forstochastic differential equations with random effects.
Stochastic Processes and their Applications ,123(7):2522–2551, 2013.[12] Fabienne Comte and Tabea Rebafka. Nonparametric weighted estimators for biased data.
Journalof Statistical Planning and Inference , 174:104–128, 2016.[13] Marie Doumic, Marc Hoffmann, Patricia Reynaud-Bouret, and Vincent Rivoirard. Nonparametricestimation of the division rate of a size-structured population.
SIAM Journal on Numerical Analysis ,50(2):925–950, 2012. 2014] Bradley Efron, Robert Tibshirani, John D Storey, and Virginia Tusher. Empirical bayes analysis ofa microarray experiment.
Journal of the American statistical association , 96(456):1151–1160, 2001.[15] Christopher Genovese and Larry Wasserman. Operating characteristics and extensions of the falsediscovery rate procedure.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,64(3):499–517, 2002.[16] Evarist Gin´e and Richard Nickl. An exponential inequality for the distribution function of thekernel density estimator, with applications to adaptive estimation.
Probab. Theory Related Fields ,143(3-4):569–596, 2009.[17] Alexander Goldenshluger and Oleg Lepski. Bandwidth selection in kernel density estimation: orcaleinequalities and adaptive minimax optimality.
The Annals of Statistics , 39(3):1608–1632, 2011.[18] Peter J Huber. A robust version of the probability ratio test.
The Annals of Mathematical Statistics ,pages 1753–1758, 1965.[19] I. A. Ibragimov and R. Z. Has minski˘ı. An estimate of the density of a distribution. Zap. Nauchn.Sem. Leningrad. Otdel. Mat. Inst. Steklov. (LOMI) , 98:61–85, 161–162, 166, 1980. Studies in math-ematical statistics, IV.[20] Mette Langaas, Bo Henry Lindqvist, and Egil Ferkingstad. Estimating the proportion of true nullhypotheses, with application to dna microarray data.
Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) , 67(4):555–572, 2005.[21] Oleg Lepski et al. Multivariate density estimation under sup-norm loss: oracle approach, adaptationand independence structure.
The Annals of Statistics , 41(2):1005–1034, 2013.[22] Haoyang Liu and Chao Gao. Density estimation with contaminated data: Minimax rates and theoryof adaptation. arXiv preprint arXiv:1712.07801 , 2017.[23] Van Hanh Nguyen and Catherine Matias. Nonparametric estimation of the density of the alternativehypothesis in a multiple testing setup. application to local false discovery rate estimation.
ESAIM:Probability and Statistics , 18:584612, 2014.[24] Van Hanh Nguyen and Catherine Matias. On efficient estimators of the proportion of true nullhypotheses in a multiple testing setup.
Scandinavian Journal of Statistics , 41(4):1167–1194, 2014.[25] Patricia Reynaud-Bouret, Vincent Rivoirard, Franck Grammont, and Christine Tuleau-Malot.Goodness-of-fit tests and nonparametric adaptive estimation for spike train analysis.
The Jour-nal of Mathematical Neuroscience , 4(1):1, 2014.[26] St´ephane Robin, Avner Bar-Hen, Jean-Jacques Daudin, and Laurent Pierre. A semi-parametricapproach for mixture models: Application to local false discovery rate estimation.
ComputationalStatistics & Data Analysis , 51(12):5483–5493, 2007.[27] Eugene F Schuster. Incorporating support constraints into nonparametric estimators of densities.
Communications in Statistics-Theory and methods , 14(5):1123–1136, 1985.[28] John D Storey. A direct approach to false discovery rates.
Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 64(3):479–498, 2002.[29] Korbinian Strimmer. A unified approach to false discovery rate estimation.
BMC bioinformatics ,9(1):303, 2008.[30] Alexandre B Tsybakov.