On Construction of Higher Order Kernels Using Fourier Transforms and Covariance Functions
aa r X i v : . [ m a t h . S T ] J a n O N C ONSTRUCTION OF H IGHER O R DER K ER NELS U SING F OUR IER T R ANSFORMS AND C OVARIANCE F UNC TIONS S OUMYA D AS CEMSE Division (Statistics)King Abdullah University of Science and [email protected] S UBHAJIT D UTTA ∗ Department of Mathematics and StatisticsIndian Institute of Technology [email protected] R ADHENDUHSKA S RIVASTAVA † Department of MathematicsIndian Institute of Technology [email protected]
January 22, 2020
Abstract
In this paper, we show that a suitably chosen covariance function of a continuous time, secondorder stationary stochastic process can be viewed as a symmetric higher order kernel. This leads to theconstruction of a higher order kernel by choosing an appropriate covariance function. An optimal choiceof the constructed higher order kernel that partially minimizes the mean integrated square error of thekernel density estimator is also discussed.
Keywords :
Corrected density estimator, Inverse Fourier transform, Shannon’s formula, Sinc kerneland function. ∗ Research has been partially supported by ECR/2017/000374, Science & Engineering Research Board (SERB), Department ofScience and Technology, Government of India. † Research supported by INSPIRE fellowship, Department of Science and Technology, Government of India and a seed grantfrom IIT Bombay. Introduction
Given univariate independent and identically distributed (i.i.d.) random variables X , . . . , X n with a com-mon probability density function (pdf) f ( x ) at a fixed point x ∈ R , a commonly used nonparametric esti-mator of f ( x ) is the kernel density estimator given by (see, e.g., Silverman (1986)) f n ( x, h ) = 1 n n X i =1 K h ( x − X i ) , (1)where K h ( u ) = h − K ( u/h ) , K is an even real-valued function referred as the kernel function, and h > is the smoothing parameter. It is well known that under fairly general conditions on f , the kernel densityestimator f n is a consistent estimator of f when h → together with nh → ∞ as n → ∞ (see, e.g.,Silverman (1986)). Moreover, the rate of convergence of the bias of f n to zero can be sharpened if theunderlying density function f is smooth. In particular, if f is ( p + 1) times continuously differentiable withbounded f ( p +1) , then by using Taylor’s series expansion, we obtain E [ f n ( x, h )] = Z ∞−∞ K ( u ) f ( x − hu ) du = f ( x ) + p X l =1 ( − h ) l f ( l ) ( x ) l ! Z ∞−∞ u l K ( u ) du + o ( h p ) , (2)where f ( l ) denotes the l th derivative of f for l = 1 , . . . , p . Further, the fastest possible rate of the bias of f n could be O ( h p ) , if the chosen kernel function K satisfies Z ∞−∞ u j K ( u ) du = 0 for j = 1 , . . . , ( p − .We now discuss a closely related notion. Throughout the paper, L p ( R ) denotes the class of functions g with Z ∞−∞ | g ( u ) | p du < ∞ for p ≥ . Definition 1.
A real-valued function K ∈ L ( R ) is said to be a kernel of order p , if it satisfies thefollowing conditions.(a) Z ∞−∞ K ( u ) du = 1 ,(b) Z ∞−∞ u j K ( u ) du = 0 for j = 1 , , . . . , ( p − and Z ∞−∞ u p K ( u ) du = 0 .Several researchers have explored the construction of higher order kernels to improve the rate of con-vergence of the density estimator (see Parzen (1962); Rosenblatt (1971)). Wand and Schucany (1990) con-sidered an approach composed of polynomials and a second order kernel function to construct higher order2ernels. The twicing technique of Abdous (1995) when applied to a kernel K of order p leads to the kernelfunction K − K ∗ K of order p , where g ∗ g ( t ) = Z ∞−∞ g ( u ) g ( t − u ) du is the convolution of thefunctions g , g ∈ L ( R ) . Generalizations for higher order kernels based on twicing approaches are usuallyquite complicated, and may not yield explicit expressions. Construction of such higher order kernels involvean iterative twicing procedure as well as generalized jackknifing. Glad et al. (2003) provides a good reviewof the existing procedures of constructing higher order kernels (also see Scott (2015)). Hall and Marron(1988) and Ushakov and Ushakova (2012) studied higher order kernel based on Fourier transforms.The asymptotic order of the bias of f n is limited by the order of the kernel when the underlying pdf f is infinitely differentiable. Davis (1975) studied the behavior of f n by using the sinc kernel defined as K ( u ) = ( πu ) − sin( u ) which is a kernel of ‘infinite’ order. Devroye (1992) constructed super-kernelswhich are ‘infinite’ order kernels (see also Politis and Romano (1999)). If the underlying density function f is p times continuously differentiable with bounded p th derivative, then the rate of the bias of a super-kernel density estimator f n is o ( h p ) . Glad et al. (2002) concluded that the sinc kernel density estimator ispreferable compared to super-kernel density estimators in several situations. Recently, Chac´on et al. (2014)have used the sinc kernel for smooth distribution function estimation.In this article, we aim to construct an explicit family of higher order kernel for a given even order. InSection 2, we provide a general methodology to construct higher order kernel using Fourier transforms. Thismethod requires us to compute the Fourier transformation of an appropriately chosen function, which maynot be easy to obtain explicitly. We also view this construction as a covariance function of a continuoustime, second order stationary stochastic process. Using this view point, we construct a family of kernelsmotivated from Shannon’s formula (Shannon, 1949) based on the sinc function (defined in Section 2.1). InSection 3, we obtain an expression for the MISE of the kernel density estimator for the chosen kernel andprovide a guideline to choose the kernel from the constructed family of kernels. In Section 4, we comparethe mean integrated square error (MISE) of the kernel density estimator obtained from our chosen higherorder kernel with some regular kernels and the sinc kernel using some numerical examples. All theoreticalproofs are given in the Appendix, and additional results are stated in a Supplementary file.3 Construction of a Higher Order Kernel Using Fourier Transforms
The following theorem gives a general method to construct symmetric higher order kernels.
Theorem 1.
Let G be even, q ( ≥ times continuously differentiable and compactly supported functionsuch that G (0) = 1 and G ( j ) (0) = 0 for j = 1 , , . . . , (2 q − and G (2 q ) (0) = 0 . Let K be the Fouriertransform of G defined as K ( u ) = Z ∞−∞ G ( t ) e − πiut dt, where i = √− . Then, K is a symmetric kernel of order q . Hall and Marron (1988) also considered kernels constructed using Fourier transform of functions, but theyconsidered a specific choice of G . The proof of Theorem 1 can be established easily. For the sake ofcompleteness, we have provided the proof in the Supplementary material.If the function G is positive, then the constructed higher order kernel K can be viewed as covariancefunction of a continuous time second order stationary stochastic process with power spectral density G .For example, the covariance functions corresponding to the following q degree polynomial power spectraldensities, are the higher order kernels of order q . Consider the following examples:1. G ( t ) = (1 − t ) q I [ − / , / ( t ) ,2. G ( t ) = (1 − (2 t ) q ) I [ − / , / ( t ) ,where I A is the indicator function associated with the set A . The higher order kernels (equivalently, thecovariance functions) corresponding to the spectral density G is listed in Table 1 for different orders.Table 1: Higher order kernels (covariance function) based on the spectral density G p K (0) K ( u ) ∗ {
16 sin( u/ − u cos( u/ } /u {− u cos( u/
2) + 768 sin( u/ − u sin( u/ } /u { u/ − u cos( u/
2) + 768 u cos( u/ − u sin( u/ } /u { u/ − u cos( u/
2) + 245760 u cos( u/ − u sin( u/
2) + 12288 u sin( u/ } /u * The function K is obtained by using the function fourier in MATLAB. .1 Higher Order Kernel Using Truncated Covariance Function Given a general G satisfying conditions of Theorem 1, the expression of K may not be easily accessible(e.g., the spectral density G ). In this section, we construct a higher order kernel which has a closed formexpression. Suppose that the support of spectral density G is [ − / , / , then using Shannon’s formula(Shannon, 1949), the covariance function K can be expressed as K ( u ) = ∞ X j = −∞ K ( j ) sinc ( π ( u − j )) , (3)where sinc ( u ) = sin u/u if u = 0 , and if u = 0 . The representation (3) shows that the function K ( u ) canbe reconstructed from the sequence { K ( j ) , j = . . . , − , , , . . . } . Further, if P ∞ j = −∞ | K ( j ) | < ∞ , then G ( t ) = Z ∞−∞ K ( u ) e πitu du = ∞ X j = −∞ K ( j ) e − πijt [ − , ] ( t ) . (4)The equality in equation (3) holds in L ( R ) sense, i.e., Z ∞−∞ ( K N ( u ) − K ( u )) du → as N → ∞ , where K N ( u ) = N X j = − N K ( j ) sinc ( π ( u − j )) (see Shannon (1949)). Additionally, if ∞ X j =1 | K ( j ) | j < ∞ , then theequation in (3) holds pointwise as well (see Lemma 1 in the Supplementary material).In view of equations (3) and (4), the condition G (0) = 1 is equivalent to P ∞ j = −∞ K ( j ) = 1 . Similarly,conditions G (2 r ) (0) = 0 corresponds to P ∞ j = −∞ j r K ( j ) = 0 for r = 1 , , . . . , q − and G (2 q ) (0) = 0 corresponds to P ∞ j = −∞ j q K ( j ) = 0 . If we truncate the series in equation (3) by choosing K ( j ) = 0 for all | j | > q , then the conditions on G stated in Theorem 1 in terms of sequence { K ( j ) , j = . . . , − , , , . . . } reduces to the following q X j = − q K ( j ) = 1 , q X j = − q j r K ( j ) = 0 for r = 1 , . . . , ( q − , (5)and q X j = − q j q K ( j ) = 0 . (6)A solution to the system of linear equation (5) that satisfies (6) leads to a higher order kernel of order q byusing equation (3). Let K (0) = α and by using the symmetry of K , i.e., K ( − u ) = K ( u ) , the system of5inear equation (5) reduces to the following: q X j =1 K ( j ) = (1 − α ) / , q X j =1 j K ( j ) = 0 , . . . , q X j =1 j q − K ( j ) = 0 . (7) Theorem 2. If α = 1 , the system of linear equations stated in (7) has the unique solution K (0) = α and K ( j ) = 1 − α (cid:18) q ! j (cid:19) q Q l =1 l = j ( l − j ) , ∀ j = 1 , . . . , q, Further, q X j = − q j q K ( j ) = (1 − α )( q !) ( − q +1 = 0 . In view of Theorem 2, the kernel function K obtained by using equation (3), i.e., K ( u ) = q X j = − q K ( j ) sinc ( π ( u − j )) , (8)where { K ( j ) , j = 0 , ± , . . . , ± q } is as stated in Theorem 2, is a higher order symmetric kernel of order q . The expression for the kernel density estimator is f n ( x, h ) = n P ni =1 K h ( x − X i ) , where K h ( u ) = h − K ( u/h ) and h > . The mean integrated square error (MISE) of f n ( · , h ) is defined as M ISE ( f n ( · , h )) = E f Z ∞−∞ (cid:8) f n ( x, h ) − f ( x ) (cid:9) dx. Since the kernel function K is the Fourier transform of the function G (see Theorem 1), we obtain a simpli-fied expression of the MISE as follows: M ISE ( f n ( · , h )) = 1 nh Z ∞−∞ G ( t ) dt + Z ∞−∞ [ G ( ht ) − | φ f ( t ) | dt − n Z ∞−∞ G ( ht ) | φ f ( t ) | dt, (9)where φ f ( t ) = Z ∞−∞ f ( u ) e πitu du and G ( t ) = Z ∞−∞ K ( u ) e πitu du (see Ushakov and Ushakova (2012)).6 heorem 3. Suppose that the density function f is p times differentiable, where p is an even number. Then,the MISE of the density estimator corresponding to the kernel function K of order p , as constructed in (8) ,is given by lim n →∞ n p p +1 M ISE ( f n ( · , h )) = Z ∞−∞ G ( t ) dt + ( G ( p ) (0)) ( p !) Z ∞−∞ t p | φ f ( t ) | dt. If we minimize the first term of the MISE expression stated above over α , then the choice of α is C p C p ,where C p = 12 p X l =1 (cid:18) ( p !) l p Q j =1 ,j = l ( j − l ) (cid:19) . Although the kernel function K obtained by choosing the function G as per Theorem 1 is a higher orderkernel of order p , the explicit form of this kernel as in Theorem 2 holds in L ( R ) sense and pointwise butnot in L ( R ) sense as the sinc function is not integrable. Thus, the kernel function constructed in Theorem2 is not an integrable function but square integrable, and the resulting density estimator is not a valid densityestimator. Further, the density estimator corresponding to the higher order kernel constructed in Theorem 2is neither non-negative nor integrable.To rectify this problem, we use the proposal of Glad et al. (2003) and define ˜ f n ( x, h ) = max { , f n ( x, h ) − ξ } , where the constant ξ is chosen such that Z ∞−∞ ˜ f n ( x, h ) dx = 1 . This correction ensures that ˜ f n ( x, h ) isnon-negative, integrates to one and it also follows that the MISE of this new version (say, M ISE ( ˜ f n ( · , h )) )is lower than M ISE ( f n ( · , h )) (see Theorem 1 in Glad et al. (2003)). In this section, we compare the MISEs of the conventional (with the Gaussian kernel) density estimator,the usual sinc density estimator, and the two density estimators proposed in Section 2 for finite values ofthe sample size (say, n ). We have considered three values for the sample size n = 50 (small), n = 250 (medium) and n = 500 (large) over simulated samples. The samples are drawn from the N (0 , . , the l p -symmetric (i.e., f ( x ) = ( p/ p )) exp( −| x | p ) ) with p = 3 distribution, and the Fej´er-de la Vall´ee Poussin (FVP) density(i.e., f ( x ) = (2 /π )( x − sin( x/ ). For numerical experiments, we have used the corrected versions ofthe sinc and the Fourier based density estimators. The bandwidth selection approaches for the competingmethods were different. We have used the function bw.nrd in the R package stats for the Gaussiankernel, and (log( n + 1)) − / for the sinc kernel (Glad et al., 2002). For the proposed methods, we havetaken n − / (2 p +1) when the kernel is of order p . The grid over which an estimator evaluated is equi-spaced points in the interval [ − , . Results of the average MISEs are reported in Table 2. The minimumMISE is marked in bold , while the second best is marked in italics .It is clear from Table 2 that the density estimator with the Gaussian kernel performs quite well for thefirst two examples, while the estimator based on G (a higher order kernel of order ) stated in Table 1yields the best performance in the next two examples. In the third and fourth examples, the l -symmetricdensity is not differentiable at the point , while the FVP density is quite wriggly, respectively. The overallperformance of the kernel based on the truncated sinc function (say, tsinc) of order (see equation (8) ofSection 2.1) is quite competitive.Table 2: Estimated MISE (with standard error in brackets) for varying sample sizesDistribution n Gaussian sinc G tsinc N (0 , .
1) 250 G (2 ,
2) 250 l (0 ,
1) 250 Appendix
Proof of Theorem 2.
The system of linear equation given in (7) is expressed in the matrix form as follows. . . . . . . q . . . q ... ... ... ... ... q − q − q − . . . q q − K (1) K (2) K (3) ... K ( q ) = (1 − α ) / ... . (10)The ( i, j ) th element of a q × q Vandermonde matrix is defined as b i − j , where b j , j = 1 , , . . . , q are non-zeroreal numbers and the determinant of this matrix is Q ≤ i 1) = ( − q +1 . This completes the proof. (cid:3) Proof of Theorem 3. In view of (9), the MISE of the density estimator f n ( · , h ) is given by M ISE ( f n ( · , h )) = 1 nh Z ∞−∞ G ( t ) dt + Z ∞−∞ [ G ( ht ) − | φ f ( t ) | dt − n Z ∞−∞ G ( ht ) | φ f ( t ) | dt = A + A + A ( say ) , φ f ( t ) = Z ∞−∞ f ( u ) e πitu du . Clearly, A = O (( nh ) − ) .By using Taylor series and the conditions on G stated in Theorem 1, we have G ( ht ) = 1 + 1 p ! G ( p ) ( ξ ) h p t p , (12)where ξ ∈ (0 , th ) . Now, the second term A reduces to A = h p ( p !) Z ∞−∞ ( G ( p ) ( ξ )) t p | φ f ( t ) | dt. By using the inverse Fourier transformation of φ f and p times differentiability of f , we have d p dx p f ( x ) = d p dx p Z ∞−∞ e − πitx φ f ( t ) dt = Z ∞−∞ ( − πi ) p t p e − πitx φ f ( t ) dt, where the interchange of the differentiation with the integral follows by Lebesgue’s DCT. Thus, t p φ f ( t ) is integrable. Since | φ f ( t ) | ≤ and by using DCT, the function t p | φ f ( t ) | is also integrable. Again byapplying Lebesgue’s DCT, we get lim h → A h p = ( G ( p ) (0)) ( p !) Z ∞−∞ t p | φ f ( t ) | dt. We now turn to the third term A . A = 1 n Z ∞−∞ | φ f ( t ) | dt + 1 n h p ( p !) Z ∞−∞ ( G ( p ) ( ξ )) t p | φ f ( t ) | dt + 2 h p np ! Z ∞−∞ G ( p ) ( ξ ) t p | φ f ( t ) | dt = A + A + A ( say ) , Again, by using Lebesgue’s DCT, we obtain lim h → nh p A = ( G ( p ) (0)) ( p !) Z ∞−∞ t p | φ f ( t ) | dt and lim h → nh p A = G ( p ) (0) p ! Z ∞−∞ t p | φ f ( t ) | dt. Thus, we have A = O ( n − ) . A is faster than that of the terms A and A . Now, by combining therates of A , A and A , we obtain M ISE ( f n ( · , h )) = O (( nh ) − ) + O ( h p ) . We equate both the rates to get the optimal rate of convergence of MISE and this leads to the choice of h = n − p +1 . Subsequently, MISE is given by lim n →∞ n p p +1 M ISE ( f n ( · , h )) = Z ∞−∞ G ( t ) dt + ( G ( p ) (0)) ( p !) Z ∞−∞ t p | φ f ( t ) | dt. The MISE of the density estimator depends on the kernel function through Z ∞−∞ G ( t ) dt and ( G ( p ) (0)) .For the kernel constructed in (8), by using Lemma 2, we have ( G ( p ) (0)) = (( p/ (1 − α ) / . Ideally,one should choose a kernel which minimizes n p/ (2 p +1) M ISE ( f n ( · , h )) expressed as Z ∞−∞ G ( t ) dt + (( p/ (1 − α ) p !) Z ∞−∞ t p | φ f ( t ) | dt. Clearly, the second term depends on the unknown density function f . Thus, we choose the function G that minimizes the first term Z ∞−∞ G ( t ) dt . By using Plancherel’s theorem, we have Z ∞−∞ G ( t ) dt = ∞ X j = −∞ K ( j ) . For the kernel function K constructed in (8), we now have Z ∞−∞ G ( t ) dt = K (0) + 2 p X l =1 K ( l ) = α + 2 (1 − α ) p X l =1 (cid:18) ( p !) l p Q j =1 ,j = l ( j − l ) (cid:19) = α + (1 − α ) C p , where C p = 12 p X l =1 (cid:18) ( p !) l p Q j =1 ,j = l ( j − l ) (cid:19) . If we minimize this expression over α , the optimal solution turns out to be C p C p , which also satisfies Z ∞−∞ G ( t ) dt = C p C p . This completes the proof. (cid:3) eferences B. Abdous. Computationally efficient classes of higher-order kernel functions. Canad. J. Statist. , 23(1):21–27, 1995.J. E. Chac´on, P. Monfort, and C. Tenreiro. Fourier methods for smooth distribution function estimation. Statistics &Probability Letters , 84:223–230, 2014.K. B. Davis. Mean square error properties of density estimates. Ann. Statist. , 3(4):1025–1030, 1975.L. Devroye. A note on the usefulness of superkernels in density estimation. Ann. Statist. , 20(4):2037–2056, 1992.I. K. Glad, N. L. Hjort, and N. G. Ushakov. Density estimation using the sinc kernel. Manuscript , pages 1–20, 2002.URL .I. K. Glad, N. L. Hjort, and N. G. Ushakov. Correction of density estimators that are not densities. ScandinavianJournal of Statistics , 30(2):415–427, 2003.P. Hall and J. S. Marron. Choice of kernel order in density estimation. Ann. Statist. , 16(1):161–173, 1988.E. Parzen. On estimation of a probability density function and mode. Ann. Math. Statist. , 33:1065–1076, 1962.D. N. Politis and J. P. Romano. Multivariate density estimation with general flat-top kernels of infinite order. J.Multivariate Anal. , 68(1):1–25, 1999.A. R. Rao and P. Bhimasankaram. Linear Algebra , volume 19 of Texts and Readings in Mathematics . Hindustan BookAgency, New Delhi, second edition, 2000.M. Rosenblatt. Curve estimates. Ann. Math. Statist. , 42:1815–1842, 1971.D. W. Scott. Multivariate Density Estimation . Wiley Series in Probability and Statistics. John Wiley & Sons, Inc.,Hoboken, NJ, second edition, 2015. Theory, practice, and visualization.C. E. Shannon. Communication in the presence of noise. Proc. I.R.E. , 37:10–21, 1949.B. W. Silverman. Density Estimation for Statistics and Data Analysis . Monographs on Statistics and Applied Proba-bility. Chapman & Hall, London, 1986.N. Ushakov and A. Ushakova. On density estimation with superkernels. Journal of Nonparametric Statistics , 24(3):613–627, 2012.M. P. Wand and W. R. Schucany. Gaussian-based kernels. Canad. J. Statist. , 18(3):197–204, 1990., 18(3):197–204, 1990.