Exponential confidence interval based on the recursive Wolverton-Wagner density estimation
aa r X i v : . [ m a t h . S T ] F e b Exponential confidence interval based on the recursiveWolverton - Wagner density estimation.M.R.Formica, E.Ostrovsky, and L.Sirota.
Universit`a degli Studi di Napoli Parthenope, via Generale Parisi 13, PalazzoPacanowsky, 80132, Napoli, Italy.e-mail: [email protected] of Mathematics and Statistics, Bar-Ilan University, 59200, RamatGan, Israel.e-mail: [email protected] of Mathematics and Statistics, Bar-Ilan University,59200, Ramat Gan, Israel.e-mail: [email protected]
Abstract.
We derive the exponential non improvable Grand Lebesgue Space norm decreas-ing estimations for tail of distribution for exact normed deviation for the famousrecursive Wolverton - Wagner multivariate statistical density estimation.We consider pointwise as well as Lebesgue - Riesz norm error of statisticaldensity of measurement.
Key words and phrases.
Probability, random variable and vector (r.v.), densityof distribution, H¨older’s and other functional class of functions, Tchernov’s inequal-ity, Young - Fenchel transform, weight, regression problem, mixes and ordinaryLebesgue - Riesz and Grand Lebesgue Space norm and spaces, kernel, bandwidth,condition of orthogonality, bias and variation, convergence, uniform norm, conver-gence almost everywhere, consistence, recursive Wolverton - Wagner multivariatestatistical density estimation, optimization.1
Statement of problem. Notations and defini-tions. Previous results.
Let (Ω , M, P ) be probability space with expectation E and variance Var . Let also { ξ k } , k = 1 , , . . . , n be a sequence of independent, identical distributed(i, i.d.) random vectors (r.v.) taking the values in the ordinary Euclidean space R d , d = 1 , , . . . and having certain non - known density of a distribution f = f ( x ) , x ∈ R d . C.Wolverton and T.J.Wagner in [31] offered the following famousstatistical estimation f W Wn ( x ) = f n ( x ) for f ( · ) . Let { h k } , k = 1 , , . . . be some positive sequence of real numbers such thatlim k →∞ h k = 0 . Let also K = K ( x ) , x ∈ R d be certain kernel , i.e. measurableeven function for which Z R d K ( x ) dx = 1 . (1)Then by definition f W Wn ( x ) = f n ( x ) def = 1 n n X k =1 h dk K x − ξ k h k ! . (2)Recall that the classical kernel, or Parzen - Rosenblatt’s estimate f P Rn ( x ) hasa form f P Rn ( x ) := 1 nh dn n X k =1 K x − ξ k h n ! , (3)see [24], [25].Note that the Wolverton - Wagner estimate obeys a very important recursionproperty: f W Wn ( x ) = n − n f W Wn − ( x ) + 1 nh dn K x − ξ n h n ! . ”The recurrent definition of probability density estimates” f W Wn ( x ) has twoobvious advantages: 1) there is no need to memorize data, i.e. if the estimate f W Wn − ( x ) is known, then f W Wn ( x ) can be calculated by means of the last observation ξ n only, without using the sampling ξ , ξ , . . . , ξ n − ; 2) the asymptotic dispersionof the estimate f W Wn ( x ) does not exceed the dispersion of the estimate” f P R ( x ) , see [22]. Our aim in this report is to deduce the exact exponential decreasing estimate for the tail of deviation probabilityP
W Wn ( u ) def = sup x ∈ R d P ( B n | f W Wn ( x ) − f ( x ) | > u ) , u ≥ , (4)2.e. under exact optimal deterministic numerical sequence B n , such thatlim n →∞ B n = ∞ . For the Parzen - Rosenblatt estimate f P Rn ( x ) these estimates was obtained e.g.in [23], chapter 5, sections 5.2 - 5.6.We will use some facts from the theory of the so - called Grand LebesgueSpaces (GLS), devoted in particular the Banach spaces of random variables havingexponential decreasing tails of distributions, see e.g. [2] , [7], [8], [17], [18], [20], [23]etc. Note that the distribution of the normed deviation B n ( f W Wn − f ) in differentLebesgue - Riesz spaces L p ( R d ⊗ Ω) :∆ n,p := E Z R d B pn | f W Wn ( x ) − f ( x ) | p dx was investigated in many works, see e.g. [10], [22], [24], [25], [27], [28], [29], [30], [31]etc. The optimal choose of { h k } and the kernel K ( x ) are devoted the followingworks [9], [16], [19], [21]. The case when the r.v. - s. are dependent is investigatedin [19], [26]. Let us reproduce some used for us notations and conditions fromthis theory.
Let ( β, L ) be certain positive numbers. Denote by l = l ( β ) = [ β ] an integerpart for β, i.e. a maximal integer number less than β : l ( β ) = [ β ] = max { j = 0 , , , . . . : j ≤ β } . For instance, l (0 .
3) = 0 , l ( π ) = 3 . Correspondingly, the fractional part { β } for β is equal { β } := β − [ β ] . Introduce as ordinary the functional class Σ( β, L ) as followsΣ( β, L ) = { f : R d → R ; ∀ m = ~m : | m | ≤ [ β ] ⇒ ∂ m f∂x m ∈ H ( { β } , L ) , } (5)where H ( α, L ) denotes the H¨older class of the functions H ( α, L ) = { g : R d → R, | g ( x ) − g ( y ) | ≤ L · | x − y | α } , α ∈ (0 , | z | = q ( z, z ) , m = ~m = { m , m , . . . , m d } , | m | = d X i =1 m i . In the case when β is integer number, the derivative in (5) is assumed to becontinuous and bounded: 3 m = ~m : | m | = β = [ β ] ⇒ sup x ∈ R d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ ~mf ∂x ~m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L. We suppose henceforth that the density function belongs to some set Σ( β, L )for some non - trivial value β ∈ (0 , ∞ ) : f ( · ) ∈ Σ( β ) def = ∪ L ∈ (0 , ∞ ) Σ( β, L ) , β > . (6)As for the kernel K. We impose on K the following conditions K ( − x ) = K ( x ); Z R d K ( x ) dx = 1; Z R d K ( x ) dx < ∞ ; (7) K ( · ) ∈ C ( R d ) , Z R d | K ( x ) | dx < ∞ . (8)The following conditions may be named as conditions of orthogonality: ∀ ~m : | ~m | ≤ [ β ] ⇒ Z R d ~x ~m V ( x ) dx = 0 . (9)The last conditions (9) may be used only for the investigation of bias δ n ( x ) ofthese statistics δ n = δ n ( x ) def = E f W Wn ( x ) − f ( x ) . (10)In detail, as long as f ∈ Σ( β ) and by virtue of (9) δ ( k ) def = h − dk Z R d K (cid:18) x − yh k (cid:19) f ( y ) dy − f ( x ) ∼ h βk , h k → , therefore | δ n | ∼ C ( β, L ) n − " n X k =1 h βk , (11)see [22], [10], [19]. Let us investigate now the
Variance of the considered Wolver- Wagner f W Wn ( x )statistic, of course under formulated above restrictions. We haveVar n f WWn (x) o = 1n X k=1 h − dk Var ( K x − ξ k h k ! ) ≍ X k=1 dk . Let’s form the classical target functional4 = Z n ( h , h , . . . , h n ) def = E h f W Wn ( x ) − f ( x ) i ;then Z n ( h , h , . . . , h n ) ≍ n n X k =1 h dk + " n X k =1 h βk . (12)The (asymptotic) minimal value of the functional Z n ( h , h , . . . , h n ) relativethe variables { h k } subject to our limitations is attained on the values h k ∼ C ( β, d, L ) k − / (2 β + d ) (13)and wherein min Z n = n − β/ (2 β + d ) . (14)So, the speed of convergence f W Wn ( x ) → f ( x ) as n → ∞ is equal to n − β/ (2 β + d ) : h E ( f W Wn ( x ) − f ( x )) i / ≍ n − β/ (2 β + d ) , (15)alike ones for the Parzen - Rosenblatt estimates.On the other words, the value B n in (4) must be choosed as follows: B n = n β/ (2 β + d ) . (16)Note that the one - dimensional case d = 1 was considered in [9]. We suppose henceforth that the values { h k } , B n are choosed optimally inaccordance with (13) and (16).Define the following tail probability Q W Wn ( u ) def = sup x ∈ R d P ( B n | f W Wn ( x ) − E f W Wn ( x ) | > u ) , u ≥ . (17)Note that the following value is bounded:sup x sup n B n | f ( x ) − E f W Wn ( x ) | = C = C ( d, β, L ) < ∞ . Therefore P W Wn ( u ) ≤ Q W Wn ( u − C ) . Evidently, the r.v. f W Wn ( x ) − E f W Wn ( x ) is centered (mean zero). Theorem 2.1.
We propose under formulated above conditions5up n Q W Wn ( u ) ≤ (cid:20) − C ( d, β, L ) u β + dβ + d (cid:21) , u ≥ . (18) Proof.
First of all we need for applying the theory of Grand Lebesgue Spaces(GLS) the estimate of an exponential moment E n [ Q, λ, β ] def = E exp h λB n ( f W Wn ( x ) − E f W Wn ( x )) i , λ ∈ R. (19)Denote for this purposeΘ n = B n ( f W Wn ( x ) − E f W Wn ( x )) = n − β + d β + d n X k =1 h − dk K o x − ξ k h k ! = n X k =1 θ k,n , θ k,n := n − β + d β + d h − dk K o x − ξ k h k ! , where as ordinary for arbitrary r.v. η ⇒ η o def = η − E η. We have E n [ Q, λ, β ] = E e λ Θ n = n Y k =1 E e λθ k,n . Let us consider two possibilities. A . < λ ≤ n − β + d β + d h − dn / (2 sup x K ( x )) ≤ , or equally λ ∈ (cid:18) , C n β β + d (cid:19) . We use the following elementary inequality y ∈ (0 , ⇒ e y ≤ y + y . Therefore E k,n ( λ ) := E exp ( λθ k,n ) ≤ Cλ Var( θ k , n ) ≤ exp (cid:16) Cλ Var( θ k , n ) (cid:17) ; E n [ Q, λ, β ] ≤ exp Cλ n X k =1 Var( θ k , n ) ! ≤ exp( Cλ ) . B. Let us investigate an opposite possibility λ ≥ C n β β + d . But then 6 ≤ C λ β + dβ , and we deduce λ Θ n = λ n − β + d β + d n X k =1 h − dk K o x − ξ k h k ! , and following λ | Θ n | ≤ Cλn β + d β + d ≤ Cλ β + dβ , E exp( λ Θ n ) ≤ exp (cid:18) Cλ β + dβ (cid:19) . The case when λ < m = m ( n ) = n β/ (2 β + d ) . Let us introduce the following function φ ( λ ) = φ m ( λ ) = φ β,d,n ( λ ) = λ I ( | λ | ≤ m ) + | λ | (2 β + d ) /β I ( | λ | > m ) , where as ordinary I ( A ) denotes the indicator function of the set A. We obtained actually ln E ( λ Θ n ) ≤ φ m ( C λ ) , λ ∈ R. (20)It follows from the theory of Grand Lebesgue Spaces (GLS), see e.g. [7], [8],[15], [20] that P ( | Θ n | > Cu ) ≤ exp ( − φ ∗ m ( u ) ) , u ≥ , modified Tchernov’s inequality. Here as usually φ ∗ ( · ) denotes the classical Young- Fenchel transform φ ∗ ( u ) def = sup λ ∈ R ( λu − φ ( λ )) . We deduce after simple calculations P ( | Θ n | > Cu ) ≤ exp (cid:16) − u (cid:17) , u ∈ (cid:16) , n β/ (2 β + d ) (cid:17) ; (21) P ( | Θ n | > Cu ) ≤ exp (cid:16) − u (2 β + d ) / ( β + d ) (cid:17) , u ≥ n β/ (2 β + d ) . (22)The announced result (18) follows immediately from (21) and (22); it is easilyto verify that obtained estimate for P ( | Θ n | > Cu ) reaches its maximum relativethe variable n only for the value n = 1 . Remark 2.1.
Note that the inequality of the form (18) of Theorem 2.1 is truealso for the classical Parzen - Rosenblatt estimation, see [23], chapter 5, sections 1- 2. 7 emark 2.2.
It is known, see [23], chapter 5, section 3 that the result (18) isessentially non - improvable. Indeed, there holds the following lower estimate underour conditions for arbitrary density statistics ˆ f :sup n sup x P (cid:16) B n | ˆ f ( x ) − f ( x ) | > u (cid:17) ≥≥ (cid:20) − C ( d, β, L ) u β + dβ + d (cid:21) , u ≥ . (23) Let µ be arbitrary Borelian finite: µ ( R d ) = 1 measure on the whole space X := R d . For instance, µ ( A ) = ν ( A ∩ D ) ν ( D ) , where ν is ordinary Lebesgue measure and D is fixed measurable non - trivialset: 0 < ν ( D ) < ∞ or µ = δ x − unit delta Dirac measure concentrated at thepoint x ∈ X. Introduce as ordinary the classical Lebesgue - Riesz space L p = L p ( R d , µ ) asa set of all the (measurable) functions g : R d → R having a finite norm || g || p := (cid:20) Z R d | g ( x ) | p µ ( dx ) (cid:21) /p , p ∈ [1 , ∞ ) . We intent in this section to evaluate the error estimation of the Wolverton -Wagner statistics in the L p norm: R n,p ( u ) def = P ( B n || f W Wn − f || p > u ) , u ≥ . (24)Note that case p = 2 (Hilbert space) was considered in many works, e.g. [9],[10], [22] at all. The L approach was investigated in the monograph [11]. Theorem 3.1.
We propose again under formulated above conditionssup n R n,p ( u ) ≤ exp (cid:20) − C ( d, β, L, p ) ( u − C ) β + dβ + d (cid:21) , u ≥ C . (25) Proof.
We need for this purpose to apply the theory of the so - called mixed
Lebesgue - Riesz spaces, e.g. [3], [4]. Indeed, let us introduce the following twomixed Lebesgue - Riesz spaces containing on all the bi-measurable numerical valuedrandom processes (fields) η = η ( x, ω ) , x ∈ R d , ω ∈ Ω having a finite norm || η || p,X,r, Ω def = || || η || p,X || r, Ω = (26)8 E (cid:20) Z X | η ( x, ω ) | p µ ( dx ) (cid:21) r/p ) /r , X = R d , p, r ≥ . (27)and correspondingly || η || r, Ω ,p,X def = || || η || r, Ω || p,X = (28) (cid:26) Z X [ E | η ( x ) | r ] p/r µ ( dx ) (cid:27) /p , X = R d , p, r ≥ . (29)Evidently, in general case || η || p,X,r, Ω = || η || r, Ω ,p,X , but always || η || p,X,pr, Ω ≤ || η || pr, Ω ,p,X . (30)It follows from the theory of Grand Lebesgue Spaces, [20], [23], chapter 1, section1.5 that if the r.v. ζ satisfies the inequality P ( | ζ | > u ) ≤ exp (cid:16) − Cu (2 β + d ) / ( β + d ) (cid:17) , then sup r ≥ h r − β/ (2 β + d ) || ζ || r, Ω i < ∞ , and inverse proposition is also true. Therefore it follows from the estimation (22)that uniformly relative the parameter n || Θ n || r, Ω ≤ C ( β, L, d ) r β/ (2 β + d ) , r ≥ .. We obtain from the relation (30) denoting κ = || Θ n || p,X : || κ || pr, Ω ≤ C ( pr ) β/ (2 β + d ) , or equally taking into account the boundedness of the measure µ and applyingLyapunov’s inequality || κ || s ≤ C ( β, d, L ; p ) s β/ (2 β + d ) , s ≥ Corollary 3.1.
It follows from theorem 3.1 P ( || f W Wn − f || p > C ( v + C )) ≤ ∆ n ( v ) , (31)where ∆ n ( v ) := exp (cid:16) − n β/ ( β + d ) × v (2 β + d ) / ( β + d ) (cid:17) , v = const ≥ . (32)Since 9 v > ⇒ ∞ X n =1 ∆ n ( v ) < ∞ , we conclude that for all the values p ∈ [1 , ∞ ) the Wolverton - Wagner’s statisticsconverges in the L p norm with probability one: P ( || f W W − f || p → . (33) Corollary 3.2.
Moreover, define for each the value p ∈ [1 , ∞ ) the variables τ n def = || f W Wn − f || p , τ def = sup n || f W Wn − f || p = sup n τ n . As long as P ( τ > v ) ≤ ∞ X n =1 P ( τ n > v ) , v ≥ , we get after some calculations P ( τ > C ( v + C )) ≤ ∞ X n =1 exp (cid:16) − v (2 β + d ) / ( β + d ) · n β/ ( β + d ) (cid:17) ≤ C v − (2 β + d ) /β , v ≥ . A. It is interest by our opinion to deduce the optimal density estimation as wellas confidence region in the uniform norm L ∞ : || g || ∞ := sup x ∈ R d | g ( x ) | . Perhaps, one can set for this purpose h k := (ln k ) γ k / (2 β + d ) . B. Offered here method may be generalized on the so called regression problem,i.e. when η i = f ( x i ) + ǫ i , i = 1 , , . . . , n. C. For the practical using it may be interest to investigate the weight approxi-mate for density, e.g. 10 n,w [ f W Wn , f ] := sup n sup x [ w ( x ) B n || f W Wn ( · ) − f ( · ) || ] , where w = w ( x ) is certain weight, i.e. non - negative numerical valued measurablefunction. Acknowledgement.
The first author has been partially supported by the Gruppo Nazionaleper l’Analisi Matematica, la Probabilit`a e le loro Applicazioni (GNAMPA) of the Istituto Nazionaledi Alta Matematica (INdAM) and by Universit`a degli Studi di Napoli Parthenope through theproject “sostegno alla Ricerca individuale”(triennio 2015 - 2017) .The second author is grateful to Yousri Slaoui for sending its a very interestarticle [19].
References [1]
G. Anatriello and
A. Fiorenza.
Fully measurable grand Lebesgue spaces . J.Math. Anal. Appl. (2015), no. 2, 783–797.[2]
G. Anatriello and
M. R. Formica.
Weighted fully measurable grand Lebesguespaces and the maximal theorem . Ric. Mat. (2016), no. 1, 221–233.[3] Benedek A. and Poneanz R.
The space Lp with mixed norm.
Duke Math.J. (1961), 301 - 324.[4]
Besov O.V., Ilin V.P., Nikolskii S.M.
Integral representation of functionsand imbedding theorems.
Vol.1; Scripta Series in Math., V.H.Winston and Sons,(1979), New York, Toronto, Ontario, London.[5]
Bertin K. (2004).
Asymptotically exact minimax estimation in sup-norm foranisotropic H¨older classes.
Bernoulli,
873 - 888. MR2093615[6]
Bertin K. (2004).
Estimation asymptotiquement exacte en norme sup de fonc-tions multidimensionnelles.
Ph.D. thesis, Universit Paris 6.[7]
V. V. Buldygin, D. I. Mushtary, E. I. Ostrovsky and
M. I. Pushalsky.
New Trends in Probability Theory and Statistics.
Mokslas (1992), V.1, 78–92;Amsterdam, Utrecht, New York, Tokyo.[8]
C. Capone, M. R. Formica and
R. Giova.
Grand Lebesgue spaces withrespect to measurable functions . Nonlinear Anal. (2013), 125–131.[9] Fabienne Compte and Nicolas Marie.
Bandwidth selections for the Wolver-ton - Wagner estimator. arXiv:1902.00734v2 [math.ST] 12 Oct 20191110]
Devroye L. (1979)
On the pointwise and integral convergence of recursive ker-nel estimates of probability densities.
Util Math.,
113 - 128.[11]
Devroye Giorfi.
Nonparametric density estimation; the L1 view.
New York :John Wiley, 1985.[12]
A.Fiorenza.
Duality and reflexivity in grand Lebesgue spaces.
CollectaneaMathematica, (electronic version),
2, (2000), 131-148.[13]
A.Fiorenza and G.E.Karadzhov.
Grand and small Lebesgue spaces andtheir analogs.
A. Fiorenza, M. R. Formica and
A. Gogatishvili.
On grand and smallLebesgue and Sobolev spaces and some applications to PDE’s . Differ. Equ. Appl. (2018), no. 1, 21–46.[15] Maria Rosaria Formica, Yuriy Vasilovich Kozachenko, Eugeny Os-trovsky, Leonid Sirota.
Exponential tail estimates in the law of ordinarylogarithm (LOL) for triangular arrays of random variables.
Lithuanian Math-ematical Journal ( IF 0.413 ) Pub Date : 2020-05-02 , DOI: 10.1007/s10986-020-09481-[16]
A. Goldenshluger, and O. Lepski.
Bandwidth selection in kernel densityestimation: oracle inequalities and adaptive minimax optimality.
The Annals ofStatistics,
T.Iwaniec and C. Sbordone.
On the integrability of the Jacobian undermin-imal hypotheses.
Arch. Rat. Mech. Anal., (1992), 129-143.[18]
T.Iwaniec, P. Koskela and J. Onninen.
Mapping of finite distortion:Monotonicity and Continuity.
Invent. Math., 144, (2001), 507-531.[19]
Salah Khardani1, Yousri Slaoui.
Recursive Kernel Density Estimation andOptimal Bandwidth Selection Under α Mixing Data.
Journal of StatisticalTheory and Practice, (2019), https://doi.org/10.1007/s42519-018-0031-61, JOURIGINAL ARTICLE.[20]
Kozachenko Yu. V., Ostrovsky E.I. (1985).
The Banach Spaces of randomvariables of subgaussian Type.
Theory of Probab. and Math. Stat. (in Russian).Kiev, KSU, 32, 43-57.[21]
M. Lerasle, N. Magalhaes and P. Reynaud-Bouret.
Optimal Kernel Se-lection for Density Estimation.
High-Dimensional Probability VII: The CargeseVolume, Prog. Probab.
Birkhaser, 425 - 460, 2016.[22]
Elizbar Nadaraya, Petre Babilua.
On the Wolverton - Wagner Esti-mate of a Distribution Density.
BULLETIN OF THE GEORGIAN NATION-AL ACADEMY OF SCIENCES, 175, Ostrovsky E.I. (1999).
Exponential Estimations for Random Fields and itsApplications, (in Russian). Moscow-Obninsk, OINPE.[24]
Parzen E. (1962.)
On estimation of a probability density and mode.
Ann. Math.Stat.,
Rosenblatt M. (1956.)
Remarks on some nonparametric estimates of a densityfunction.
Ann. Math. Stat.,
832 - 837.[26]
Andrea De Simone, Alessandro Morandinia.
Nonparametric Density Es-timation from Markov Chains.arXiv:2009.03937v1 [stat.ME] 8 Sep 2020[27]
Slaoui Y. (2013.)
Large and moderate principles for recursive kernel densityestimators defined by stochastic approximation method.
Serdica Math J 39:5382[28]
Tsybakov A.B. (1990).
Recurrent estimation of the mode of a multidimen-sional distribution.
Probl. of Inf. Transm.,
119 - 126.[29]
A.B.Tsybakov.
Introduction to Nonparametric Estimation.
Springer, 2009.[30]
E.J.Wegman and H.I.Davies.
Remarks on Some Recursive Estimators of aProbability Density.
The Annals of Statistics,
316 - 327, 1979.[31]
Wolverton C, Wagner T.J. (1969).
Asymptotically optimal discriminantfunctions for pattern classification.
IEEE Trans Inform Theory,15,