Wavelet methods in statistics: Some recent developments and their applications
aa r X i v : . [ s t a t . M E ] D ec Statistics Surveys
Vol. 1 (2007) 16–55ISSN: 1935-7516DOI:
Wavelet methods in statistics: Somerecent developments and theirapplications ∗ Anestis Antoniadis † Laboratoire Jean KuntzmannBP 53, 38041 Grenoble Cedex 9Francee-mail:
Abstract:
The development of wavelet theory has in recent years spawnedapplications in signal processing, in fast algorithms for integral transforms,and in image and function representation methods. This last application hasstimulated interest in wavelet applications to statistics and to the analysis ofexperimental data, with many successes in the efficient analysis, processing,and compression of noisy signals and images.This is a selective review article that attempts to synthesize some recentwork on “nonlinear” wavelet methods in nonparametric curve estimationand their role on a variety of applications. After a short introduction towavelet theory, we discuss in detail several wavelet shrinkage and waveletthresholding estimators, scattered in the literature and developed, undermore or less standard settings, for density estimation from i.i.d. observa-tions or to denoise data modeled as observations of a signal with additivenoise. Most of these methods are fitted into the general concept of reg-ularization with appropriately chosen penalty functions. A narrow rangeof applications in major areas of statistics is also discussed such as par-tial linear regression models and functional index models. The usefulnessof all these methods are illustrated by means of simulations and practicalexamples.
AMS 2000 subject classifications:
Primary 60K35, 60K35; secondary60K35.
Keywords and phrases: curve smoothing, density estimation, waveletthresholding, penalized least-squares, robust regression, partial linear mod-els, mixed effects models, inverse regression, time series prediction.Received January 2007.
Contents ∗ This paper was accepted by T. Tony Cai, Associate Editor for the IMS. † Financial support form the IAP research network Nr. P5/24 and P6/03 of the Belgiangovernment (Belgian Science Policy) is gratefully acknowledged. The author would also liketo thank the Associate Editor and the two referees for their useful comments and suggestions.16 . Antoniadis/Wavelet methods in statistics
1. Introduction
Nonparametric regression has been a fundamental tool in data analysis overthe past two decades and is still an expanding area of ongoing research. Thegoal is to recover an unknown function, say g , based on sampled data that arecontaminated with noise. Denoising techiques provide a very effective and simpleway of finding structure in data sets without the imposition of a parametricregression model (as in linear or polynomial regression for example). Only verygeneral assumptions about g are made such as that it belongs to a certain classof functions.During the 1990s, the nonparametric regression literature was arguably dom-inated by (nonlinear) wavelet shrinkage and wavelet thresholding estimators.These estimators are a new subset of an old class of nonparametric regressionestimators, namely orthogonal series methods. Moreover, these estimators areeasily implemented through fast algorithms so they are very appealing in prac-tical situations. Donoho and Johnstone (1994) and Donoho et al. (1995) haveintroduced nonlinear wavelet estimators in nonparametric regression throughthresholding which typically amounts to term-by-term assessment of estimatesof coefficients in the empirical wavelet expansion of the unknown function. If anestimate of a coefficient is sufficiently large in absolute value – that is, if it ex-ceeds a predetermined threshold – then the corresponding term in the empiricalwavelet expansion is retained (or shrunk toward to zero by an amount equal tothe threshold); otherwise it is omitted.Extensive reviews and descriptions of the various classical and Bayesianwavelet shrinkage and wavelet thresholding estimators are given in the books . Antoniadis/Wavelet methods in statistics by Ogden (1997), Vidakovic (1999) and Percival and Walden (2000), in the pa-pers appeared in the edited volume by M¨uller and Vidakovic (1999), and in thereview papers by Antoniadis (1997), Vidakovic (1998b) and Abramovich et al.(2000). With the increased applicability of these estimators in nonparametricregression, several new wavelet based curve smoothing procedures have beenproposed in the recent literature, and one of the purposes of this review is topresent few of them under the general concept of penalized least squares regres-sion. It should be noted that most of these methods are usually implementedunder the assumptions of dyadic sample size, equally spaced and fixed samplepoints, and i.i.d. normal errors. When the data does not meet one or both ofthese requirements, various modifications have been proposed and we will alsomention few of them in the forthcoming sections. To keep the length of this ar-ticle reasonable we have not included in our discussion important developmentsin Bayesian wavelet denoising methods which deserve a survey by their own.Finally we provide a brief overview of a spectrum of wavelet applicationsin some real data problems. The reader should be cautioned, however, thatthe wavelet is so a large research area that truly comprehensive surveys arealmost impossible, and thus, our overview may be a little restricted. At firstwe will be concerned with a semiparametric partially linear regression modelwith unknown regression coefficients and we will present a wavelet thresholdingbased estimation procedure to estimate the components of the partial linearmodel. We will also discuss a wavelet based variant of a popular dimensionreduction reduction method for functional data analysis that is traditionallyused in practice, namely MAVE.The rest of the paper is organized as follows. Section 2 recalls some knownresults about wavelet series, function spaces and the discrete wavelet transform.Section 3 briefly discusses the nonlinear wavelet approach to nonparametric re-gression and introduces several recent thresholding techniques that can be for-mulated as penalized least squares problems. We also discuss some block-wisethresholding procedures that have been shown to enjoy a high level of adaptationfor wavelet series estimates and end with a short discussion on block threshold-ing methods for density estimation. Few wavelet based applications to complexproblems are given in Section 4. Whenever necessary, the practical performanceof the methods that are discussed is examined by appropriate simulations.
2. A short background on wavelets
In this section we give a brief overview of some relevant material on the waveletseries expansion and a fast wavelet transform that we need further.
The term wavelets is used to refer to a set of orthonormal basis functions gen-erated by dilation and translation of a compactly supported scaling function (or father wavelet ), φ , and a mother wavelet , ψ , associated with an r -regular . Antoniadis/Wavelet methods in statistics multiresolution analysis of L ( R ). A variety of different wavelet families nowexist that combine compact support with various degrees of smoothness andnumbers of vanishing moments (see, Daubechies (1992)), and these are nowthe most intensively used wavelet families in practical applications in statistics.Hence, many types of functions encountered in practice can be sparsely (i.e.parsimoniously) and uniquely represented in terms of a wavelet series. Waveletbases are therefore not only useful by virtue of their special structure, but theymay also be (and have been!) applied in a wide variety of contexts.For simplicity in exposition, we shall assume that we are working with peri-odized wavelet bases on [0 ,
1] (see, for example, Mallat (1999), Section 7.5.1),letting φ p jk ( t ) = X l ∈ Z φ jk ( t − l ) and ψ p jk ( t ) = X l ∈ Z ψ jk ( t − l ) , for t ∈ [0 , , where φ jk ( t ) = 2 j/ φ (2 j t − k ) , ψ jk ( t ) = 2 j/ ψ (2 j t − k ) . For any j ≥
0, the collection { φ p j k , k = 0 , , . . . , j − ψ p j k , j ≥ j ≥ , k =0 , , . . . , j − } is then an orthonormal basis of L ([0 , g ∈ L ([0 , g ( t ) = j − X k =0 α j k φ j k ( t ) + ∞ X j = j j − X k =0 β jk ψ jk ( t ) , j ≥ , t ∈ [0 , , where α j k = h g, φ j k i = Z g ( t ) φ j k ( t ) dt, j ≥ , k = 0 , , . . . , j − β jk = h g, ψ jk i = Z g ( t ) ψ jk ( t ) dt, j ≥ j ≥ , k = 0 , , . . . , j − . An usual assumption underlying the use of periodic wavelets is that the func-tion to be expanded is assumed to be periodic. However, such an assumptionis not always realistic and periodic wavelets exhibit a poor behaviour near theboundaries (they create high amplitude wavelet coefficients in the neighborhoodof the boundaries when the analysed function is not periodic). However, periodicwavelets are commonly used because the numerical implementation is partic-ular simple. While, as Johnstone (1994) has pointed out, this computationalsimplification affects only a fixed number of wavelet coefficients at each resolu-tion level, we will also present later on an effective method, developed recentlyby Oh and Lee (2005), combining wavelet decompositions with local polynomialregression, for correcting the boundary bias introduced by the inappropriatenessof the periodic assumption. . Antoniadis/Wavelet methods in statistics The (inhomogeneous) Besov spaces on the unit interval, B sρ ,ρ [0 , r th difference of a function f be∆ ( r ) h f ( t ) = r X k =0 (cid:18) rk (cid:19) ( − k f ( t + kh ) , and let the r th modulus of smoothness of f ∈ L ρ [0 ,
1] be ν r,ρ ( f ; t ) = sup h ≤ t ( || ∆ ( r ) h f || L ρ [0 , − rh ] ) . Then the Besov seminorm of index ( s, ρ , ρ ) is defined for r > s , where 1 ≤ ρ , ρ ≤ ∞ , by | f | B sρ ,ρ = (cid:20)Z (cid:26) ν r,ρ ( f ; h ) h s (cid:27) ρ dhh (cid:21) /ρ , if 1 ≤ ρ < ∞ , and by | f | B sρ , ∞ = sup 1] and | f | B sρ ,ρ < ∞ , i.e. satisfying || f || B sρ ,ρ < ∞ . Theparameter ρ can be viewed as a degree of function’s inhomogeneity while s is a measure of its smoothness. Roughly speaking, the (not necessarily integer)parameter s indicates the number of function’s derivatives, where their existenceis required in an L ρ -sense; the additional parameter ρ is secondary in its role,allowing for additional fine tuning of the definition of the space.The Besov classes include, in particular, the well-known Hilbert-Sobolev( H s [0 , s = 1 , , . . . ) and H¨older ( C s [0 , s > 0) spaces of smooth func-tions ( B s , [0 , 1] and B s ∞ , ∞ [0 , 1] respectively), but in addition less-traditionalspaces, like the space of bounded-variation, sandwiched between B , [0 , 1] and B , ∞ [0 , f is related to a sequence space norm onthe wavelet coefficients of the function. Confining attention to the resolutionand spatial indices j ≥ j and k = 0 , , . . . , j − s ′ = s + 1 / − /ρ , the sequence space norm is given by || θ || b sρ ,ρ = || α j || ρ + ∞ X j = j js ′ ρ || β j || ρ ρ /ρ , if 1 ≤ ρ < ∞ , || θ || b sρ , ∞ = || α j || ρ + sup j ≥ j n js ′ || β j || ρ o , . Antoniadis/Wavelet methods in statistics where || α j || ρ ρ = j − X k =0 | α j k | ρ and || β j || ρ ρ = j − X k =0 | β jk | ρ . If the mother wavelet ψ is of regularity r > 0, it can be shown that thecorresponding orthonormal periodic wavelet basis defined in Section 2.1 is anunconditional basis for the Besov spaces B sρ ,ρ [0 , 1] for 0 < s < r , 1 ≤ ρ , ρ ≤∞ . In other words, we have K || f || B sρ ,ρ ≤ || θ || b sρ ,ρ ≤ K || f || B sρ ,ρ , where K and K are constants, not depending on f . Therefore the Besov normof the function f is equivalent to the corresponding sequence space norm de-fined above; this allows one to characterize Besov spaces in terms of waveletcoefficients (see, e.g., Meyer (1992); Donoho and Johnstone (1998)). For a moredetailed study on (inhomogeneous) Besov spaces we refer to Meyer (1992). In statistical settings we are more usually concerned with discretely sampled,rather than continuous, functions. It is then the wavelet analogy to the dis-crete Fourier transform which is of primary interest and this is referred toas the discrete wavelet transform (DWT). Given a vector of function values g = ( g ( t ) , ..., g ( t n )) ′ at equally spaced points t i , the discrete wavelet transformof g is given by d = W g , where d is an n × c j k , anddiscrete wavelet coefficients, d jk , and W is an orthogonal n × n matrix associatedwith the orthonormal wavelet basis chosen. The c j k and d jk are related to theircontinuous counterparts α j k and β jk (with an approximation error of order n − ) via the relationships c j k ≈ √ n α j k and d jk ≈ √ n β jk . The factor √ n arises because of the difference between the continuous and dis-crete orthonormality conditions. This root factor is unfortunate but both thedefinition of the DWT and the wavelet coefficients are now fixed by convention,hence the different notation used to distinguish between the discrete wavelet co-efficients and their continuous counterpart. Note that, because of orthogonalityof W , the inverse DWT (IDWT) is simply given by g = W T d , where W T denotes the transpose of W .If n = 2 J for some positive integer J , the DWT and IDWT may be per-formed through a computationally fast algorithm developed by Mallat (1999) . Antoniadis/Wavelet methods in statistics that requires only order n operations. In this case, for a given j and under pe-riodic boundary conditions, the DWT of g results in an n -dimensional vector d comprising both discrete scaling coefficients c j k , k = 0 , ..., j − d jk , j = j , ..., J − k = 0 , ..., j − n DWT algorithmmentioned above. Essentially the algorithm is a fast hierarchical scheme forderiving the required inner products which at each step involves the action of lowand high pass filters, followed by a decimation (selection of every even member ofa sequence). The IDWT may be similarly obtained in terms of related filteringoperations. For excellent accounts of the DWT and IDWT in terms of filteroperators we refer to Nason and Silverman (1995), Strang and Nguyen (1996),or Burrus et al. (1998). 3. Denoising by wavelet thresholding The problem of estimating a signal that is corrupted by additive noise is astandard problem in statistics and signal processing. It can be described asfollows. Consider the standard univariate nonparametric regression setting y i = g ( t i ) + σ ǫ i , i = 1 , . . . , n, (3.1)where ǫ i are independent N (0 , 1) random variables and the noise level σ may beknown or unknown. We suppose, without loss of generality, that t i are withinthe unit interval [0 , g fromthe noisy data, y = ( y , . . . , y n ) T , without assuming any particular parametricstructure for g .One of the basic approaches to handle this regression problem is to considerthe unknown function g expanded as a generalised Fourier series and then toestimate the generalised Fourier coefficients from the data. The original (non-parametric) problem is thus transformed to a parametric one, although thepotential number of parameters is infinite. An appropriate choice of basis forthe expansion is therefore a key point in relation to the efficiency of such anapproach. A ‘good’ basis should be parsimonious in the sense that a large set ofpossible response functions can be approximated well by only few terms of thegeneralized Fourier expansion employed. Wavelet series allow a parsimoniousexpansion for a wide variety of functions, including inhomogeneous cases. It istherefore natural to consider applying the generalized Fourier series approachusing a wavelet series.In what follows we assume that the sample points are equally spaced, i.e. t i = i/n , and that the sample size n is a power of two: n = 2 J for some positiveinteger J . These assumptions allow us to perform both the DWT and the IWDTusing Mallat’s fast algorithm. Note, however, that for non-equispaced or randomdesigns, or sample sizes which are not a power of two, or data contaminatedwith correlated noise, modifications are needed to the standard wavelet-basedestimation procedures that will be discussed in subsection 3.4. . Antoniadis/Wavelet methods in statistics Due to the orthogonality of the matrix W , the DWT of white noise is alsoan array of independent N (0 , 1) random variables, so from (3.1) it follows thatˆ c j k = c j k + σ ǫ jk , k = 0 , , . . . , j − , (3.2)ˆ d jk = d jk + σ ǫ jk , j = j , . . . , J − , k = 0 , . . . , j − , (3.3)where ˆ c j k and ˆ d jk are respectively the empirical scaling and the empiricalwavelet coefficients of the the noisy data y , and ǫ jk are independent N (0 , d jk contain information about the underlying func-tion g , while ‘small’ d jk can be attributed to the noise which uniformly contam-inates all wavelet coefficients. Thus, simple denoising algorithms that use thewavelet transform consist of three steps:1) Calculate the wavelet transform of the noisy signal.2) Modify the noisy wavelet coefficients according to some rule.3) Compute the inverse transform using the modified coefficients.Traditionally, for the second step of the above approach there are two kinds of de-noising methods; namely, linear and nonlinear techniques. A wavelet based linearapproach, extending simply spline smoothing estimation methods as describedby Wahba (1990), is the one suggested by Antoniadis (1996) and independentlyby Amato and Vuza (1997). This method is appropriate for estimating relativelyregular functions. Assuming that the smoothness index s of the function g tobe recovered is known, the resulting estimator is obtained by estimating thescaling coefficients c j k by their empirical counterparts ˆ c j k and by estimatingthe wavelet coefficients d jk via a linear shrinkage˜ d jk = ˆ d jk λ js , where λ > λ is chosen by cross-validation in Amato and Vuza (1997), while the choice of λ in Antoniadis (1996)is based on risk minimization and depends on a preliminary consistent estima-tor of the noise level σ . While simple and cheap to implement, the above linearmethod is not designed to handle spatially inhomogeneous functions with low regularity. For such functions one usually relies upon nonlinear thresholding orshrinkage methods. One of the earliest papers in the field of wavelet denois-ing may be that of Weaver et al. (1991), proposing a hard-thresholding schemefor filtering noise from magnetic resonance images. While Weaver et al. (1991)demonstrated the advantages of the wavelet thresholding scheme mainly basedon experimental results, the first thorough mathematical treatment of waveletshrinkage and wavelet thresholding was done by Donoho et al. in a series oftechnical reports in the early 1990’s and published in Donoho and Johnstone(1994), Donoho (1995), Donoho et al. (1995) and Donoho and Johnstone (1998).Donoho and his coworkers analyzed wavelet thresholding and shrinkage meth-ods in the context of minimax estimation and showed, that wavelet shrinkage . Antoniadis/Wavelet methods in statistics generates asymptotically optimal estimates for noisy data that outperform anylinear estimator.Mathematically wavelet coefficients are estimated using either the hard or soft thresholding rule given respectively by δ H λ ( ˆ d jk ) = (cid:26) | ˆ d jk | ≤ λ ˆ d jk if | ˆ d jk | > λ (3.4)and δ S λ ( ˆ d jk ) = | ˆ d jk | ≤ λ ˆ d jk − λ if ˆ d jk > λ ˆ d jk + λ if ˆ d jk < − λ . (3.5)Thresholding allows the data itself to decide which wavelet coefficients are signif-icant; hard thresholding (a discontinuous function) is a ‘keep’ or ‘kill’ rule, whilesoft thresholding (a continuous function) is a ‘shrink’ or ‘kill’ rule. Beside thesetwo possibilities there are many others (semi-soft shrinkage, firm shrinkage, . . . )and as long as the shrinkage function preserves the sign (sign( δ λ ( x )) = sign( x ))and shrinks the magnitude ( | δ λ ( x ) | ≤ | x | ) one can expect a denoising effect.Gao and Bruce (1997) and Marron et al. (1998) have shown that simplethreshold values with hard thresholding results in larger variance in the func-tion estimate, while the same threshold values with soft thresholding shift theestimated coefficients by an amount of λ even when | ˆ d jk | stand way out of noiselevel, creating unnecessary bias when the true coefficients are large. Also, due toits discontinuity, hard thresholding can be unstable – that is, sensitive to smallchanges in the data.To remedy the drawbacks of both hard and soft thresholding rules,Gao and Bruce (1997) considered the firm threshold thresholding δ F λ ,λ ( ˆ d jk ) = | ˆ d jk | ≤ λ sign( ˆ d jk ) λ ( | ˆ d jk |− λ ) λ − λ if λ < | ˆ d jk | ≤ λ ˆ d jk if | ˆ d jk | > λ (3.6)which is a “shrink” or “kill” rule (a continuous function).The resulting wavelet thresholding estimators offer, in small samples, advan-tages over both hard thresholding (generally smaller mean squared error and lesssensitivity to small perturbations in the data) and soft thresholding (generallysmaller bias and overall mean squared error) rules. For values of | ˆ d jk | near thelower threshold λ , δ F λ ,λ ( ˆ d jk ) behaves like δ S λ ( ˆ d jk ). For values of | ˆ d jk | abovethe upper threshold λ , δ F λ ,λ ( ˆ d jk ) behaves like δ H λ ( ˆ d jk ). Note that the hardthresholding and soft thresholding rules are limiting cases of (3.6) with λ = λ and λ = ∞ respectively.Note that firm thresholding has a drawback in that it requires two thresholdvalues (one for ‘keep’ or ‘shrink’ and another for ‘shrink’ or ‘kill’), thus makingthe estimation procedure for the threshold values more computationally expen-sive. To overcome this drawback, Gao (1998) considered the nonnegative garrote . Antoniadis/Wavelet methods in statistics thresholding δ G λ ( ˆ d jk ) = ( | ˆ d jk | ≤ λ ˆ d jk − λ ˆ d jk if | ˆ d jk | > λ (3.7)which also is a “shrink” or “kill” rule (a continuous function). The resultingwavelet thresholding estimators offer, in small samples, advantages over bothhard thresholding and soft thresholding rules that is comparable to the firmthresholding rule, while the latter requires two threshold values.In the same spirit to that in Gao (1998), Antoniadis and Fan (2001) suggestedthe SCAD thresholding rule δ SCAD λ ( ˆ d jk ) = sign( ˆ d jk ) max (0 , | ˆ d jk | − λ ) if | ˆ d jk | ≤ λ ( a − 1) ˆ d jk − aλ sign( ˆ d jk ) a − if 2 λ < | ˆ d jk | ≤ aλ ˆ d jk if | ˆ d jk | > aλ (3.8)which is a “shrink” or “kill” rule (a piecewise linear function). It does not overpenalize large values of | ˆ d jk | and hence does not create excessive bias when thewavelet coefficients are large. Antoniadis and Fan (2001), based on a Bayesianargument, have recommended to use the value of α = 3 . Nonlinear diffusion filtering and wavelet shrinkage are two methods that servethe same purpose, namely discontinuity-preserving denoising. In this subsec-tion we give a survey on relations between both paradigms when fully discreteversions of nonlinear diffusion filters with an explicit time discretization are con-sidered. In particular we present the results of Mr´azek et al. (2003) connecting ashift-invariant Haar wavelet shrinkage and the diffusivity of a nonlinear diffusionfilter. This allows to present corresponding diffusivity functions to some knownand widely used shrinkage functions or new shrinkage functions with competi-tive performance which are induced by famous diffusivities. Due to the lack ofspace we can only present the main ideas and refer the reader to the paper ofMr´azek et al. (2003) for more details. Before proceeding, we would like first torecall some facts about translation-invariant denoising.One drawback of the discrete wavelet transform is that the coefficients of thediscretized signal are not circularly shift equivariant, so that circularly shiftingthe observed series by some amount will not circularly shift the discrete wavelettransform coefficients by the same amount. This seriously degrades the qualityof the denoising achieved. To try to alleviate this problem Coifman and Donoho(1995) introduced the technique of ‘cycle spinning’. The idea of denoising viacycle spinning is to apply denoising not only to y , but also to all possible unique . Antoniadis/Wavelet methods in statistics circularly shifted versions of y , and to average the results. As pointed out byPercival and Walden (2000) (see, p. 429) another way to perform the translationinvariant shrinkage is by applying standard thresholding to the wavelet coeffi-cients of the maximal overlap discrete wavelet transform, a transform we morebriefly refer to as the stationary wavelet transform and we refer to the abovemonograph for details. We are now able to introduce and analyze a general con-nection between translation invariant Haar wavelet shrinkage and a discretizedversion of a nonlinear diffusion.The scaling and wavelet filters h and ˜ h corresponding to the Haar wavelettransform are h = 1 √ . . . , , , , , . . . ) ˜ h = 1 √ . . . , , − , , , . . . ) . Given a discrete signal f = ( f k ) k ∈ Z , it is easy to see that a shift-invariant softwavelet shrinkage of f on a single level decomposition with the Haar waveletcreates a filtered signal u = ( u k ) k ∈ Z given by u k = 14 ( f k − + 2 f k + f k +1 ) + 12 √ (cid:18) − δ S λ (cid:18) f k +1 − f k √ (cid:19) + δ S λ (cid:18) f k − f k − √ (cid:19)(cid:19) , where δ S λ denotes the soft shrinkage operator with threshold λ . Because the filtersof the Haar wavelet are simple difference filters (a finite difference approximationof derivatives) the above shrinkage rule looks a little like a discretized versionof a differential equation. Indeed, rewriting the above equation as u k = f k + f k +1 − f k − f k − f k − 4+ 12 √ (cid:18) − δ S λ (cid:18) f k +1 − f k √ (cid:19) + δ S λ (cid:18) f k − f k − √ (cid:19)(cid:19) = f k + (cid:18) ( f k +1 − f k )4 − √ δ S λ (cid:18) f k +1 − f k √ (cid:19)(cid:19) − (cid:18) ( f k − f k − )4 − √ δ S λ (cid:18) f k − f k − √ (cid:19)(cid:19) , we obtain u k − f k ∆ t = ( f k +1 − f k ) g ( | f k +1 − f k | ) − ( f k − f k − ) g ( | f k − f k − | ) , (3.9)with a function g and a time step size ∆ t defined by∆ t g ( | s | ) = 14 − √ | s | δ S λ (cid:18) | s |√ (cid:19) . Eq. (3.9) appears as a first iteration of an explicit (Euler forward) scheme for anonlinear diffusion filter with initial state f , time step size ∆ t and spatial stepsize 1, and therefore the shrinkage rule corresponds to a discretization of thefollowing differential equation ∂ t u = ∂ x (( ∂ x u ) g ( | ∂ x u | )) , (3.10) . Antoniadis/Wavelet methods in statistics with initial condition u (0) = f . This equation is a 1-D variant of the Perona-Malik diffusion equation well known in image processing, and the function g iscalled the diffusivity. The basic idea behind nonlinear diffusion filtering in the1-D case (see Droske and Rumpf (2004)) is to obtain a family u ( x, t ) of filteredversions of a continuous signal f as the solution of the diffusion process statedin Eq. (3.10) with f as initial condition, u ( x, 0) = f ( x ) and reflecting boundaryconditions. The diffusivity g controls the speed of diffusion depending on themagnitude of the gradient. Usually, g is chosen such that it is equal to one forsmall magnitudes of the gradient and goes down to zero for large gradients.Hence the diffusion stops at positions where the gradient is large. These areasare considered as singularities of the signal. Since the Perona-Malik equation isnonlinear the existence of a solution is not obvious.We can now give a proposition which relates some properties of shrinkagefunctions and diffusivities and whose proof is an easy consequence of the re-lation between g and δ λ . A detailed analysis of this connection in terms ofextremum principles, monotonicity preservation and sign stability can be foundin Mr´azek et al. (2003) (see also Lorenz (2006)). We formulate this relations forthe case ∆ t = 1 / Propostion 3.1. Let ∆ t = 1 / . Then the diffusivity and the shrinkage functionare related through g ( | x | ) = 1 − √ | x | δ λ (cid:18) | x |√ (cid:19) . (3.11) The following properties hold:1. If δ λ performs shrinkage then the diffusion is always forward, i. e. δ λ ( | x | ) ≤ | x | ⇔ g ( x ) ≥ . 2. If δ λ is differentiable at 0 then, as x → , g ( x ) → ⇔ δ λ (0) = 0 and δ ′ λ (0) = 0 . 3. If the diffusion stops for large gradients the shrinkage function has lineargrowth at infinity, i. e. g ( x ) → , as x → ∞ ⇔ δ λ ( x ) x → , as x → ∞ . Some examples will make the correspondence more clear. As suggested byProposition 3.1 we choose ∆ t = 1 / Linear shrinkage A linear shrinkage rule, producing linear wavelet denoisingis given by δ λ ( x ) = x λ . The corresponding diffusivity is constant g ( | x | ) = λ (1+ λ ) , and the diffusion is linear. . Antoniadis/Wavelet methods in statistics Soft shrinkage The soft shrinkage function δ λ ( x ) = sign( x )( | x | − λ ) + gives g ( | x | ) = (cid:16) − ( | x |−√ λ ) + | x | (cid:17) , which is a stabilized total variation diffusivity(see Steidl and Weickert (2002)). Hard shrinkage The hard shrinkage function δ λ ( x ) = x (1 − I {| x |≤ λ } ( x )) leadsto g ( | x | ) = I {| x |≤√ λ } ( | x | ) which is a piecewise linear diffusion that degen-erates for large gradients. Garrote shrinkage The nonnegative garrote shrinkage δ λ ( x ) = (cid:16) x − λ x (cid:17) (1 − I {| x |≤ λ } ( x )) leads to a stabilized unbounded BFB diffusivity given by g ( | x | ) = I {| x |≤√ λ } ( | x | ) + λ x I {| x | > √ λ } ( | x | ). Firm shrinkage Firm shrinkage defined by eq. (3.6) yields a diffusivity thatdegenerates to 0 for sufficiently large gradients: g ( | x | ) = | x | ≤ √ λ λ ( λ − λ ) (cid:16) √ λ | x | − (cid:17) if √ λ < | x | ≤ √ λ | x | > √ λ . SCAD shrinkage SCAD shrinkage defined by eq. (3.8) gives also a diffusivitythat degenerates to 0: g ( | x | ) = | x | ≤ √ λ √ λ | x | if √ λ < | x | ≤ √ λ a √ λ ( a − | x | − a − if 2 √ λ < | x | ≤ a √ λ | x | > a √ λ . Fig 1 . Shrinkage functions (top) and corresponding diffusivities (bottom). The functions areplotted for λ = 1 , λ = 1 , λ = 2 (Firm) and a = 3 . (Scad). The dashed line represents thediagonal. All these example are depicted in Figure 1. The other way round one can ask,how the shrinkage functions for famous diffusivities look like. The function δ λ . Antoniadis/Wavelet methods in statistics expressed in terms of g looks like δ λ ( | x | ) = | x | (1 − g ( √ | x | ) and the dependenceof the shrinkage function on the threshold parameter λ is naturally fulfilled be-cause usually diffusivities involve a parameter too. Using this remark we obtainthe following new shrinkage functions: Charbonnier diffusivity The Charbonnier diffusivity (Charbonnier et al.(1994)) is given by g ( | x | ) = (cid:16) x λ (cid:17) − / and corresponds to the shrink-age function δ λ ( x ) = x (cid:16) − q λ λ +2 x (cid:17) . Perona-Malik diffusivity The Perona-Malik diffusivity (Perona and Malik(1990)) is defined by g ( | x | ) = (cid:16) x λ (cid:17) − and lead to the shrinkage func-tion δ λ ( x ) = x x + λ . Weickert diffusivity Weickert (1998) introduced the following diffusivity g ( | x | ) = I {| x | > ( x ) (cid:16) − exp (cid:16) − . | x | /λ ) (cid:17)(cid:17) which leads to the shrinkagefunction δ λ ( x ) = x exp (cid:16) − . λ x (cid:17) . Tukey diffusivity Tukey’s diffusivity, defined by Black et al. (1998) as g ( | x | ) =(1 − ( x/λ ) ) I {| x |≤ λ ( | x | ) leads to the shrinkage function δ λ ( x ) = (cid:26) x λ − x λ if | x | ≤ λ/ √ x if | x | > λ/ √ . Fig 2 . “Classical” diffusivites (top) and corresponding shrinkage functions. Figure 2 illustrates these diffusivities and their corresponding shrinkage func-tions. Having developed the connection between diffusivities and shrinkage func-tions, we will further exploit them in the subsection that follows in order to showthat they can all be interpreted as cases of a broad class of penalized least squaresestimators. This unified treatment and the general results of Antoniadis and Fan . Antoniadis/Wavelet methods in statistics (2001) on penalized wavelet estimators allow a systematic derivation of oracleinequalities and minimax properties for a large class of wavelet estimators. The various thresholding rules presented in the previous subsection play nomonopoly in choosing an ideal wavelet sub-basis to efficiently estimate an un-known function observed with noise. It turns out that the corresponding nonlin-ear estimators can be seen as optimal estimators in a regularization setting forspecific penalty functions, i.e. most of the shrinkage rules can be linked to a reg-ularization process under a corresponding and reasonable penalty function. Ex-ploring the nature of these penalties and using results from Antoniadis and Fan(2001), it is then easy to show that the corresponding thresholding estimatorshave good sampling properties and are adaptively minimax. It is the aim of thissubsection to integrate the diverse shrinkage rules from a regularization pointof view.When estimating a signal that is corrupted by additive noise by wavelet basedmethods, the traditional regularization problem can be formulated in the waveletdomain by finding the minimum in θ of the penalized least-squares functional ℓ ( θ ) defined by ℓ ( θ ) = k W y − θ k n + 2 λ X i>i p ( | θ i | ) , (3.12)where θ is the vector of the wavelet coefficients of the unknown regression func-tion g and p is a given penalty function, while i is a given integer correspondingto penalizing wavelet coefficients above certain resolution level j . Here to facili-tate the presentation we changed the notation d j,k from a double array sequenceinto a single array sequence θ i . Similarly, denote by z the vector of empiricalwavelet coefficients W y . With a choice of an additive penalty P i>i p ( | θ i | ), theminimization problem becomes separable. Minimizing (3.12) is equivalent tominimizing ℓ ( θ i ) = k z i − θ i k + 2 λp ( | θ i | ) , (3.13)for each coordinate i . The resulting penalized least-squares estimator is in thiscase separable, that is the estimate of any coordinate θ i depends solely on theempirical wavelet coefficient z i . While separable estimators have their draw-backs, this is the case that we address here. To reduce abuse of notation, andbecause p ( | θ | ) is allowed to depend on λ , we will use p λ to denote the penatlyfunction λp in the following discussion.The performance of the resulting wavelet estimator depends on the penaltyand the regularization parameter λ . To select a good penalty function,Antoniadis and Fan (2001) and Fan and Li (2001) proposed three principlesthat a good penalty function should satisfy: unbiasedness, in which there isno over-penalization of large coefficients to avoid unnecessary modeling biases;sparsity, as the resulting penalized least-squares estimators should follow athresholding rule such that insignificant coefficients should be set to zero to . Antoniadis/Wavelet methods in statistics reduce model complexity; and continuity to avoid instability and large vari-ability in model prediction. The interested reader is referred to Theorem 1 ofAntoniadis and Fan (2001) which gives the necessary and sufficient conditionson a penalty function for the solution of the penalized least-suares problem tobe thresholding, continuous and approximately unbiased when | z | is large. Ourpurpose here is to show how to derive the penalties corresponding to the thresh-olding rules defined in the previous subsection, and to show that almost all ofthem satisfy these conditions. More precisely, we have Propostion 3.2. Let δ λ : R → R be a thresholding function that is increasingantisymmetric such that ≤ δ λ ( x ) ≤ x for x ≥ and δ λ ( x ) → ∞ as x → ∞ .Then there exist a continuous positive penalty function p λ , with p λ ( x ) ≤ p λ ( y ) whenever | x | ≤ | y | , such that δ λ ( z ) is the unique solution of the minimizationproblem min θ ( z − θ ) + 2 p λ ( | θ | ) for every z at which δ λ is continuous.Proof. Let r λ : R → R defined by r λ ( x ) = sup { z | δ λ ( z ) ≤ x } . The function r λ is well defined, since the set upon which the supremum is taken is non empty(recall that δ λ ( z ) → −∞ as z → −∞ ) and upper bounded (since δ λ ( z ) → ∞ as z → ∞ . For θ ≥ 0, let p λ ( θ ) = Z θ ( r λ ( u ) − u ) du, and suppose that δ λ is continuous at θ . Let k ( θ ) = ( θ − z ) + 2 p λ ( θ ) − z = 2 Z θ ( r λ ( u ) − z ) du, so that k ( θ ) − k ( r λ ( z )) = 2 Z θr λ ( z ) ( r λ ( u ) − z ) du. Using the assumptions, it is easy to show that for θ = r λ ( z ), we have k ( θ ) >k ( r λ ( z )), so θ = δ λ ( z ) is the only solution to the minimization problem. Thecontracting property of the shrinkage function leads directly to the contractingproperty of the corresponding penalty.It is however interesting to recall here from the above proof the almost an-alytical expression for p λ . Denote by r λ the generalized inverse function of δ λ defined by r λ ( x ) = sup { z | δ λ ( z ) ≤ x } . Then, for any z > p λ is defined by p λ ( z ) = Z z ( r λ ( u ) − u ) du. (3.14)Note that all the thresholding function studied in the previous subsection satisfythe conditions of Proposition 3.2. Applying expression (3.14) one finds, in par-ticular, the well known ridge regression L -penalty p λ ( | θ | ) = λ | θ | correspond-ing to the linear shrinkage function, the L -penalty p λ ( | θ | ) = λ | θ | correspond-ing to the soft thresholding rule and the hard thresholding penalty function p λ ( | θ | ) = λ − ( | θ | − λ ) I {| θ | <λ } ( | θ | ) that results in the hard-thresholding rule . Antoniadis/Wavelet methods in statistics Fig 3 . Penalties corresponding to the shrinkage and thresholding functions with the samename. (see Antoniadis (1997)). Figure 3 displays all penalty functions correspondingto the various shrinkage functions studied in this review.Among the penalties displayed in Figure 3, the quadratic penalty, while con-tinuous is not singular at zero, and the resulting estimator is not thresholded.All other penalties are singular at zero, thus resulting in thresholding rules thatenforce sparseness of the solution. The estimated wavelet coefficients obtainedusing the hard-thresholding penalty is not continuous at the threshold, so itmay induce the oscillation of the reconstructed signal (lack of stability). Inthe soft-thresholding case, the resulting estimator of large coefficients is shiftedby an amount of λ , which creates unnecessary bias when the coefficients arelarge. The same is true for Charbonnier and Perona-Malick penalties. All otherpenalties have similar behaviour. They are singular at zero (thus encouragingsparse solutions), continuous (thus stable) and do not create excessive bias whenthe wavelet coefficients are large. Most importantly, all the above thresholdingpenalties satisfy the conditions of Theorem 1 in Antoniadis and Fan (2001).The implication of this fact is that it leads to a systematic derivation of ora-cle inequalities and minimax properties for the resulting wavelet estimators viaTheorem 2 of Antoniadis and Fan (2001). In particular, the optimal hard andsoft universal threshold λ = σ p n given by Donoho and Johnstone (1994)leads to a sharp asymptotic risk upper bound and the resulting penalized esti-mators are adaptively minimax within a factor of logarithmic order over a widerange of Besov spaces. . Antoniadis/Wavelet methods in statistics In this subsection, we illustrate the performance of the penalized least-squaresmethods introduced in the previous subsections by using some simulated datasets. For simulated data, we use the functions “heavisine”, ”blip”, ”corner” and“wave” as testing functions. The first three contain either jump points or cusps,while the fourth one is regular enough to see how the methods perform forregular functions. These, together with a typical Gaussian noisy data with asignal-to-noise ratio (SNR) of 3, are displayed in Figure 4. Fig 4 . The four signals used in the simulations (solid lines) together with a typical noisyversion (points). For each signal, we have used two noise levels corresponding to signal-to-noiseratios 3 and 7 (low and high). For each simulation and each function, we haveused an equispaced design of size 512 within the interval [0 , 1] and a Gaussiannoise was added to obtain the observed data. The experiment was repeated 100times. Figure 5 dipslays a typical fit of the Heavisine function from corrupteddata (SNR=3) and Table 1 reports the average mean-squared error over allMonte Carlo experiments for each method on each signal and signal-to-noiseratio combination.As one can see most methods perform similarly when an universal threshold isused. Note also the bias of the soft-thresholding rule and the similarity betweenthe firm and the scad shrinkage. From Table 1, we can see that the less classicalnonlinear regularization methods are often superior to the hard-thresholdingand the soft-thresholding method and always better than ridge regression. In this whole section, we have tried to present a study for different denoisingmethods for signals observed on regularly spaced design and corrupted by ad- . Antoniadis/Wavelet methods in statistics Fig 5 . A typical fit of the penalized least sqaures methods for the Blip function corrupted withan additive Gaussian noise at a signal-to-noise ratio of 3. Table 1 Mean Squared Errors over 100 Simulations for each signal and SNR setting. Method Low SNR High SNRHeavi Blip Corner Wave Heavi Blip Corner WaveWeick 80 95 63 212 46 64 27 147Hard 65 76 60 152 48 57 30 147Soft 113 181 67 472 46 108 26 249Garrote 87 103 63 256 45 71 26 160Firm 70 78 55 185 43 58 27 133Lin 168 1849 69 4344 45 354 27 812Perona 105 149 60 392 43 102 27 234Char 136 354. 656 9056 446 176 26 420Tukey 84 90 57 247 42 65 28 129Scad 82 90 58 241 43 65 27 140 ditive Gaussian noise and have shown, that many of them are leading to theidea of shrinkage in a general sense. However, as stated already in the introduc-tion, when the data does not meet these requirements (equally spaced and fixedsample points, periodic boundary conditions and i.i.d. normal errors) variousmodifications have been proposed in the recent literature and we would like toshortly mention few of them in this subsection.An usual assumption underlying the use of wavelet shrinkage, that we havealso adopted in the previous sections of this review, is that the regression func-tion is assumed to be periodic, in order to use a wavelet transform with peri-odic boundary conditions. However, such an assumption is not always realistic.Oh and Lee (2005) propose an effective hybrid automatic method for correctingthe boundary bias in wavelet shrinkage introduced by the inappropriateness ofsuch a periodic assumption. Their idea is to combine wavelet shrinkage withlocal polynomial regression, where the latter regression technique is known topossess excellent boundary properties. Results from their numerical experimentsstrongly support this approach. The interested reader is referred to their paper . Antoniadis/Wavelet methods in statistics for further details. Note that it is quite easy to implement the various thresh-olding strategies presented here within their hybrid approach.For data that are not equispaced, there exist numerous wavelet methods inthis setting. Among them let us cite Hall and Turlach (1997), Antoniadis et al.(1997), Cai and Brown (1998), Kovac and Silverman (2000), Antoniadis and Fan(2001), Maxim (2003), Chicken (2003), Kerkyacharian and Picard (2004) andAmato et al. (2006b). Compared to standard algorithms, the thresholding isnotably more complicated because it has to incorporate the variations of thedensity of the design. In many of these constructions, the first step consists indetermining a function Y ( x ) of the form: Y ( x ) = X m w m ( x ) Y m , where w m ( x ) is a sequence of functions suitably chosen. For instance, inHall and Turlach (1997) the w m s correspond to a polynomial which dependson the design. In Cai and Brown (1998) (and in Maxim (2003)), the w m s corre-sponds to scale wavelets warped with the (known) design cumulative distributionfunction. In Antoniadis et al. (1997), the random design is transformed into eq-uispaced data via a binning method and the weights w m are defined by scalewavelets. In a second step, the function Y ( x ) is expanded on a standard waveletbasis and a hard thresholding algorithm is performed. In Antoniadis and Fan(2001) the w m s are Sobolev interpolation weights. Kovac and Silverman (2000)apply first a linear transformation to the data to map it to a set of dyadic, equis-paced points. Since the transformation induces a band-limited correlation on theresulting wavelet coefficients, they adopt a term-by-term thresholding methodsimilar to VisuShrink, but taking into account the varying levels of noise. InChicken (2003) linear transformations and block thresholding are applied tonondyadic, non-i.i.d., nonequispaced data with various band-limited covariancestructures. His method makes use of Kovac and Silvermans fast algorithm forestimating the covariances of the transformed data. In all the techniques de-scribed above, the thresholds have similar forms and depend on the quantitysup t s ( t ) where s is the density of the design points, and which correspondsto an upper bound for the variance of the estimated wavelet coefficients. Thealgorithm of Amato et al. (2006b) is quite different and relies upon wavelet ker-nel reproducing Hilbert spaces. The regularization method suggested in theirpaper requires neither pre-processing of the data by interpolation or similartechnique, nor the knowledge of the distribution s of the design points. For thisreason, the method works really well even in the case when the distribution ofthe design points deviates far from the uniform. When the estimation error iscalculated at the design points only, the method achieves optimal convergencerates in Besov spaces no matter how irregular the design is. In order to ob-tain asymptotic optimality in the L metric, an extra assumption on the designpoints should be imposed, namely, the density s of the design points should bebounded away from zero. To end this paragraph we mention the algorithm de-veloped in Kerkyacharian and Picard (2004). Their procedure stays very close . Antoniadis/Wavelet methods in statistics to the equispaced Donoho and Johnstone’s Visushrink procedure, and thus isvery simple in its form (preliminary estimators are no longer needed) and in itsimplementation (the standard uniform threshold suffices). On the other side, theprojection is done on an unusual non-orthonormal basis, called warped waveletbasis, so their analytic properties need to be studied to derive the performancesof the estimator. Some theoretical results, including maxiset properties are es-tablished in their paper.Concerning the noise, we would like to mention a recent work by Brown et al.(2006) who develop a nonparametric regression method that is simultaneouslyadaptive over a wide range of function classes for the regression function androbust over a large collection of error distributions, including those that areheavytailed, and may not even possess variances or means. Their approach isto first use local medians to turn the problem of nonparametric regression withunknown noise distribution into a standard Gaussian regression problem andthen apply a wavelet block thresholding procedure to construct an estimatorof the regression function. It is shown that the estimator simultaneously at-tains the optimal rate of convergence over a wide range of the Besov classes,without prior knowledge of the smoothness of the underlying functions or priorknowledge of the error distribution. The estimator also automatically adapts tothe local smoothness of the underlying function, and attains the local adaptiveminimax rate for estimating functions at a point. A key technical result in theirdevelopment is a quantile coupling theorem which gives a tight bound for thequantile coupling between the sample medians and a normal variable.We would like to add some comments concerning the choice of the penaltyparameter λ in finite sample situations. From a practical point of view its op-timal choice is important. Given the basic framework of function estimationusing wavelet thresholding and its relation with the regularization approachwith penalties non smooth at 0, there are a variety of methods to choosethe regularization parameter λ . Solo (2001) in his discussion of the paper byAntoniadis and Fan (2001) suggests a data based estimator of λ similar in spiritto the SURE selection criterion used by Donoho and Johnstone (1994), and pro-vides an appropriate simple SURE formula for the general penalties studied inthis review. Another way to address the optimal choice of the regularizationparameter is Generalized Cross Validation (GCV). Cross-validation has beenwidely used as an automatic procedure to choose the smoothing parameter inmany statistical settings. The classical cross-validation method is performed bysystematically expelling a data point from the construction of an estimate, pre-dicting what the removed value would be and, then, comparing the predictionwith the value of the expelled point. One way to proceed is to use the approachadopted by Jansen et al. (1997). It is clear that this is an area where furthercareful theoretical and practical work is needed. In the previous subsections we have used separable penalties, which achieve min-imax optimality through term-by-term thresholding of the empirical wavelet . Antoniadis/Wavelet methods in statistics coefficients by realizing a degree of trade-off between variance and bias con-tribution to mean squared error. However this trade-off is not optimal. Whileseparable rules have desirable minimax properties, they are necessarily not rateadaptive over Besov spaces under integrated mean squared error because theyremove too many terms from the empirical wavelet expansion, with the result theestimator being too biased and having sub-optimal L -risk convergence rate (andalso in L p , 1 ≤ p ≤ ∞ ). One way to increase estimation precision is by utilizinginformation about neighboring empirical wavelet coefficients. In other words,empirical wavelet coefficients could be thresholded in blocks (or groups) ratherthan individually. As a result, the amount of information available from the datafor estimating the “average” empirical wavelet coefficient within a block, andmaking a decision about retaining or discarding it, would be an order of mag-nitude larger than the case of a term-by-term threshold rule. This would allowthreshold decisions to be made more accurately and permit convergence ratesto be improved. This is the spirit of block thresholding rules and blockwise ad-ditive penalty functions that have been and are currently theoretically exploredby many researchers (see e.g. Hall et al. (1999), Cai and Brown (1999), Cai(2002), Cai and Silverman (2001), Chicken (2003) and Chicken and Cai (2005),Cai and Zhou (2005)). For completeness we present below some more details onsuch block-wise thresholding procedures. A nonoverlapping block thresholding estimator was proposed by Cai and Brown(1999) via the approach of ideal adaptation with the help of an oracle.At each resolution level j = j , . . . , J − 1, the empirical wavelet coefficientsˆ d jk are grouped into nonoverlapping blocks of length L . In each case, the firstfew empirical wavelet coefficients might be re-used to fill the last block (whichis called the augmented case) or the last few remaining empirical wavelet coef-ficients might not be used in the inference (which is called the truncated case),should L not divide 2 j exactly.Let ( jb ) denote the set of indices of the coefficients in the b th block at level j , that is, ( jb ) = { ( j, k ) : ( b − L + 1 ≤ k ≤ bL } , and let S jb ) denote the L -energy of the noisy signal in the block ( jb ). Withineach block ( jb ), estimate the wavelet coefficients d jk simultaneously via a James-Stein thresholding rule˜ d ( jb ) jk = max , S jb ) − λLσ S jb ) ! ˆ d jk . (3.15)Then, an estimate of the unkown function g is obtained by applying the IDWT tothe vector consisting of both empirical scaling coefficients ˆ c j k ( k = 0 , , . . . , j − 1) and thresholded empirical wavelet coefficients ˜ d ( jb ) jk ( j = j , . . . , J − k =0 , , . . . , j − . Antoniadis/Wavelet methods in statistics Cai and Brown (1999) suggested using L = log n and setting λ = 4 . λ − log λ − BlockJS . Remark 3.1. Hall et al. (1997) and Hall et al. (1998), Hall et al. (1999) con-sidered wavelet block thresholding estimators by first obtaining a near unbiasedestimate of the L -energy of the true coefficients whithin a block and then keep-ing or killing all the empirical wavelet coefficients within the block based on themagnitude of the estimate. Although it would be interesting to numerically com-pare their estimators, they require the selection of smoothing parameters – blocklength and thershold level – and it seems that no specific criterion is providedfor choosing these parameters in finite sample situations.3.5.2. An overlapping block thresholding estimator Cai and Silverman (2001) considered an overlapping block thresholding estima-tor by modifying the nonoverlapping block thresholding estimator of Cai (1999).The effect is that the treatment of empirical wavelet coefficients in the middleof each block depends on the data in the whole block.At each resolution level j = j , . . . , J − 1, one groups the empirical waveletcoefficients ˆ d jk into nonoverlapping blocks ( jb ) of length L . Extend each blockby an amount L = max (1 , [ L / jB ) of length L = L + 2 L .Let S jB ) denote the L -energy of the noisy signal in the larger block ( jB ).Within each block ( jb ), estimate the wavelet coefficients simultaneously via thefollowing James-Stein thresholding rule˘ d ( jb ) jk = max , S jB ) − λL ˆ σ S jB ) ! ˆ d jk . (3.16)Then, an estimate of the unkown function g is obtained by applying the IDWT tothe vector consisting of both empirical scaling coefficients ˆ c j k ( k = 0 , , . . . , j − 1) and thresholded empirical wavelet coefficients ˘ d ( jb ) jk ( j = j , . . . , J − k =0 , , . . . , j − L = [log n/ 2] and taking λ = 4 . NeighBlock estimator) or L = L = 1 (i.e., L = 3) and taking λ = log n (which results in the NeighCoeff estimator).NeighBlock uses neighbouring coefficients outside the block of current interestin fixing the threshold, whilst NeighCoeff chooses a threshold for each coefficientby reference not only to that coefficient but also to its neighbours. Remark 3.2. The above thresholding rule (3.16) is different to the one given in(3.15) since the empirical wavelet coefficients ˆ d jk are thresholded with referenceto the coefficients in the larger block ( jB ) . One can envision ( jB ) as a sliding . Antoniadis/Wavelet methods in statistics window which moves L positions each time and, for each window, only half ofthe coefficients in the center of the window are estimated. As noticed above, the block size and threshold level play important roles inthe performance of a block thresholding estimator. The local block threshold-ing methods mentioned above all have fixed block size and threshold and thesame thresholding rule is applied to all resolution levels regardless of the dis-tribution of the wavelet coefficients. In a recent paper, Cai and Zhou (2005)propose a data-driven approach to empirically select both the block size andthreshold at individual resolution levels. At each resolution level, their proce-dure, named SureBlock, chooses the block size and threshold empirically byminimizing Stein’s Unbiased Risk Estimate (SURE). By empirically selectingboth the block size and threshold and allowing them to vary from resolutionlevel to resolution level, the SureBlock estimator has significant advantages overthe more conventional wavelet thresholding estimators with fixed block sizes.For more details the reader is referred to the above paper.We would like to end this subsection with some remarks on the use of blockthresholding for density estimation by Chicken and Cai (2005). The reader in-terested in other wavelet based methods for density estimation from i.i.d. obser-vations is referred to the review paper by Antoniadis (1997) where some linearand nonlinear wavelet estimators in univariate density estimation are discussedin detail. Chicken and Cai’s block thresholding procedure first divides the em-pirical coefficients at each resolution level into nonoverlapping blocks and thensimultaneously keeps or kills all the coefficients within a block, based on themagnitude of the sum of the squared empirical coefficients within that block.Motivated by the analysis of block thresholding rules for nonparametric regres-sion, the block size is chosen to be log n . It is shown that the block thresholdingestimator adaptively achieves not only the optimal global rate over Besov spaces,but simultaneously attains the adaptive local convergence rate as well. Theseresults are obtained in part through the determination of the optimal blocklength. 4. Some applications In this Section we present two recent applications of wavelets in statistics. Eachapplication begins with a description of the problem under study and points tospecific properties and techniques which were used to determine a solution. Ofcourse many other excellent applied works on wavelets have been presented inthe literature that we find it impossible to list them all in this review. We willmainly concentrate on some important uses of wavelet methods in statistics,developed recently by the author and its collaborators. This subsection is concerned with a semiparametric partially linear regressionmodel with unknown regression coefficients, an unknown nonparametric func- . Antoniadis/Wavelet methods in statistics tion for the non-linear component, and unobservable Gaussian distributed ran-dom errors. We present a wavelet thresholding based estimation procedure toestimate the components of the partial linear model by establishing the connec-tion made between an L -penalty based wavelet estimator of the nonparametriccomponent and Huber ’s M-estimation of a standard linear model with outliers.Assume that responses y , . . . , y n are observed at deterministic equidistantpoints t i = in of an univariate variable such as time and for fixed values X i , i = 1 , . . . , n , of some p -dimensional explanatory variable and that the relationbetween the response and predictor values is modeled by a Partially LinearModel (PLM): y i = X Ti β + f ( t i ) + u i i = 1 . . . n, (4.1)where β is an unknown p -dimensional real parameter vector and f is an un-known real-valued function; the u i ’s are i.i.d. normal errors with mean 0 andvariance σ and superscript “T” denotes the transpose of a vector or matrix. Wewill assume hereafter that the sample size is n = 2 J for some positive integer J . Given the observed data ( y i , X i ) i =1 ...n , the aim is to estimate from the datathe vector β and the function f .Partially linear models are more flexible than the standard linear model be-cause they combine both parametric and nonparametric components when it isbelieved that the response variable Y depends on variable X in a linear waybut is nonlinearly related to other independent variable t . As it is well known,model 4.1 is often used when the researcher knows more about the dependenceamong y and X than about the relationship between y and the predictor t ,which establishes an unevenness in prior knowledge. Several methods have beenproposed in the literature to consider partially linear models. One approachto estimation of the nonparametric component in these models is based onsmoothing splines regression techniques and has been employed in particularby Green and Yandell (1985), Engle et al. (1986) and Rice (1986) among oth-ers. Kernel regression (see e.g. Speckman (1988)) and local polynomial fittingtechniques (see e.g. Hamilton and Truong (1997)) have also been used to studypartially linear models. An important assumption by all these methods for theunknown nonparametric component f is its high smoothness. But in reality, sucha strong assumption may not be satisfied. To deal with cases of a less-smoothnonparametric component, a wavelet based estimation procedure, developed re-cently by Gannaz (2006) (a PhD student of the author), is developed in whatfollows, and as such that it can handle nonparametric estimation for curves ly-ing in Besov spaces instead of the more classical Sobolev spaces. To capture keycharacteristics of variations in f and to exploit its sparse wavelet coefficientsrepresentation, we assume that f belongs to B sπ,r ([0; 1]) with s + 1 /π − / > f at a given pointmakes sense. . Antoniadis/Wavelet methods in statistics In matrix notation, the PLM model specified by (4.1) can be written as Y = X β + F + U , (4.2)where Y = (cid:0) Y , . . . , Y n (cid:1) T , X T = (cid:0) X , . . . , X n (cid:1) is the p × n design matrix,and F = (cid:0) f ( t ) , . . . f ( t n ) (cid:1) T . The noise vector U = (cid:0) u , . . . , u n (cid:1) T is a Gaussianvector with mean 0 and variance matrix σ I n .Let now Z = W Y , A = W X , θ = W F and ǫ = W U , where W is thediscrete wavelet transform operator. Pre-multiplying (4.1) by W , we obtain thetransformed model Z = A β + θ + ǫ . (4.3)The orthogonality of the DWT matrix W ensures that the transformed noisevector ǫ is still distributed as a Gaussian white noise with variance σ I n . Hence,the representation of the model in the wavelet domain not only allows to retainthe partly linear structure of the model but also to exploit in an efficient way thesparsity of the wavelet coefficients in the representation of the nonparametriccomponent.With the basic understanding of wavelet based penalized least squares pro-cedures covered in depth in Section 3, we propose estimating the parameters β and θ in model (4.3) by penalized least squares. To be specific, our waveletbased estimators are defined as follows:(ˆ β n , ˆ θ n ) = argmin ( β , θ ) ( J n ( β , θ ) = n X i =1 12 ( z i − A Ti β − θ i ) + λ n X i = i | θ i | ) , (4.4)for a given penalty parameter λ , where i = 2 j + 1. The penalty term inthe above expression penalizes only the empirical wavelet coefficients of thenonparametric part of the model and not its scaling coefficients. Rememberthat the L -penalty is associated the soft thresholding rule.The regularization method proposed above is closely related to the methodproposed recently by Chang and Qu (2004), who essentially concentrate on thebackfitting algorithms involved in the optimization, without any theoreticalstudy of the resulting estimates. The method also relates to the recent one de-veloped by Fadili and Bullmore (2005) where a variety of penalties is discussed.Note, however, that their asymptotic study is limited to quadratic penaltieswhich amounts essentially in assuming that the underlying function f belongsto some Sobolev space.Let us now have a closer look at the minimization of the criterion J n statedin (4.4). For a fixed value of β , the criterion J n ( β , · ) is minimum at˜ θ i ( β ) = ( z i − A Ti β if i < i , sign( z i − A Ti β ) (cid:0) | z i − A Ti β | − λ (cid:1) + if i ≥ i . (4.5) . Antoniadis/Wavelet methods in statistics Therefore, finding ˆ β n , a solution to problem (4.4), amounts in finding ˆ β n min-imizing the criterion J n ( β , ˜ θ ( β )). However, note that J n ( β , ˜ θ ( β )) = n X i = i ρ λ ( z i − A Ti β ) (4.6)where ρ λ is Huber’s cost functional with threshold λ , defined by: ρ λ ( u ) = ( u / | u | ≤ λ,λ | u | − λ / | u | > λ. (4.7)The above facts can be derived as follows. Let i ≥ i . Minimizing expression(4.4) with respect to θ i is equivalent in minimizing j ( θ i ) := ( z i − A Ti β − θ i ) + λ | θ i | . The first order conditions for this are : j ′ ( θ i ) = θ i − ( z i − A Ti β )+sign( θ i ) λ =0 where j ′ denotes the derivative of j . Now, • if θ i ≥ 0, then j ′ ( θ i ) = 0 if and only if θ i = z i − A Ti β − λ . Hence, if z i − A Ti β ≤ λ , θ i = 0 and otherwise θ i = z i − A Ti β − λ . • if θ i ≤ j ′ ( θ i ) is zero if and only if θ i = z i − A Ti β + λ ; therefore, if z i − A Ti β ≥ − λ , θ i = 0 and otherwise θ i = z i − A Ti β + λ .This proves that for a fixed value of β , the criterion (4.4) is minimal for ˜ θ ( β )given by expression (4.5). If we now replace θ in the objective function J n weobtain J n ( β , ˜ θ ( β )) = n X i = i (cid:16) ( z i − A Ti β − ˜ θ i ) + λ | ˜ θ i | (cid:17) since ˜ θ i = z i − A Ti β for i < i . Now denoting by I the set I := { j = i . . . n, | z j − A j β | < λ } , wefind that J n ( β , ˜ θ ( β )) = X I ( z i − A Ti β ) + X I C λ + λ X I C (cid:0) | z i − A Ti β | − λ (cid:1) byreplacing ˜ θ i with (4.5), which is exactly Huber’s functional.This result allows the computation of the estimators ˆ β n et ˆ θ n in a non-iterative fashion. The resulting form of the estimators allows us to study theirasymptotic properties (see Gannaz (2006)). Another benefit is that one candesign estimation algorithms much faster than those based on backfitting.The estimation procedure may be summarized as follows. Using the observeddata ( Y , X ) :1. Apply the DWT of order J = log ( n ) on X and Y to get their corre-sponding representation A and Z in the wavelet domain.2. The parameter β is then Huber’s robust estimator which is obtainedwithout taking care of the nonparametric component in the PLM model:ˆ β n = argmin β n X i =1 ρ λ ( z i − A Ti β ) . In other words this amounts in considering the linear model z i = A Ti β + e i with noise e i = θ ,i + ǫ i . . Antoniadis/Wavelet methods in statistics 3. The vector θ of wavelet coefficients of the function f is estimated by softthresholding of Z − A ˆ β n :ˆ θ i,n = z i − A Ti ˆ β n if i < i , sign( z i − A Ti ˆ β n ) (cid:16) | z i − A Ti ˆ β n | − λ (cid:17) + if i ≥ i . . The estimation of f is then obtained by applying the inverse discretewavelet transform. Note that this last step corresponds to a standardsoft-thresholding nonparametric estimation of f in the model: Y i − X Ti ˆ β n = f ( t i ) + v i , i = 1 . . . n, where v i = X Ti ( β − ˆ β n ) + u i . Remark 4.1. The wavelet soft-thresholding procedure proposed above is derivedby exploiting the connection between an L -based penalization of the waveletcoefficients of f and Huber’s M-estimators in a linear model. Other penalties,leading to different thresholding procedures can also be seeing as M -estimationprocedures. For example, if δ λ denotes the resulting thresholding function, wecan show in a similar way that the estimators verify ˆ β n = argmin β n X i = i ρ λ ( z i − A Ti β ) , ˆ θ i,n = ( z i − A Ti β if i < i ,δ λ ( z i − A Ti β ) if i ≥ i , , i = 1 . . . n, with ρ λ being the primitive of u u − δ λ ( u ) . From the above, one sees thathard thresholding corresponds to mean truncation, while SCAD thresholding isassociated to Hampel’s M-estimation. However, in this subsection, we only con-centrate on the properties of estimators obtained by soft thresholding. Existing results for semi-parametric partial linear models establish paramet-ric rates of convergence for the linear part and minimax rates for the nonpara-metric part, showing in particular that the the existence of a linear componentdoes not changes the rates of convergence of the nonparametric component.Within the framework adopted in this paper, the rates of convergence are simi-lar, but an extra logarithmic term appears in the rates of the parametric part,mainly due to the fact that the smoothness assumptions on the nonparametrricpart are weaker. Indeed, under appropriate assumptions, one has: Propostion 4.1. Let ˆ β n and ˆ θ n be the estimators defined above. Consider thatthe penalty parameter λ is the universal threshold: λ = σ p n ) . Then itholds ˆ β n − β = O P r log( n ) n ! . . Antoniadis/Wavelet methods in statistics If in addition we assume that the scaling function φ and the mother wavelet ψ belong to C R and that ψ has N vanishing moments, then, for f belonging to theBesov space B sπ,r with < s < min( R, N ) , one has k ˆ f n − f k = O P (cid:18) log( n ) n (cid:19) s ! , where k ˆ f n − f k = R ( ˆ f n − f ) . A proof of this proposition together with the relevant assumptions may befound in Gannaz (2006). The presence of a logarithmic loss lies on the choice ofthe threshold λ : taking λ which tends to 0, as suggested by Fadili and Bullmore(2005), would lead to a minimax rate in the estimation of β . The drawback isthat the quality of the estimation for the nonparametric part of the PLM wouldnot be anymore quasi-minimax. This phenomenon was put in evidence by Rice(1986): a compromise must be done between the optimality of the linear partestimation with an oversmoothing of the functional estimation and a loss inthe linear regression parameter convergence rate but a correct smoothing of thefunctional part. The estimation procedure relies upon knowledge of the variance σ of the noise,appearing in the expression of the threshold λ . In practice, this variance isunknown and needs to be estimated. In wavelet approaches for standard non-parametric regression, a popular and well behaved estimator for the unknownstandard deviation of the noise is the median absolute deviation (MAD) of thefinest detail coefficients of the response divided by 0.6745 (see Donoho et al.(1995)). The use of the MAD makes sense provided that the wavelet repre-sentation of the signal to be denoised is sparse. However, such an estimationprocedure cannot be applied without some pretreatment of the data in a par-tially linear model because the wavelet representation of the linear part of aPLM may be not sparse.A QR decomposition on the regression matrix of the PLM allow to eliminatethis bias. Since often the function wavelet coefficients at weak resolutions arenot sparse, we only consider the wavelet representation at a level J = log ( n ).Let A J be the wavelet representation of the design matrix X at level J . The QRdecomposition ensures that there exist an orthogonal matrix Q and an uppertriangular matrix R such that A J = Q (cid:18) R (cid:19) . If Z J , θ ,J and ǫ J denote respectively the vector of the wavelets coefficientsat resolution J of Y , F and U , model (4.3) gives Q T z J = (cid:18) R (cid:19) β + Q T θ ,J + Q T ǫ J . . Antoniadis/Wavelet methods in statistics It is easy to see that applying the MAD estimation on the last components of Q T z J rather than on z J leads to a satisfactory estimation of σ . Indeed thanksto the QR decomposition the linear part does not appear anymore in the es-timation and thus the framework is similar to the one used in nonparametricregression. Following Donoho and Johnstone (1998), the sparsity of the func-tional part representation ensures good properties of the resulting estimator.The interested reader will find in Gannaz (2006) two particular optimizationalgorithms that may be used for estimating in a computationally efficient waythe linear part of the model. In functional regression problems, one has a response variable Y to be predictedby a set of variables X , . . . , X p that are discretizations of a same curve X atpoints t , . . . , t p , that is X j = X ( t j ) , j = 1 , . . . , p where the discretization timepoints t j lie in [0 , 1] without loss of generality. A typical set-up includes nearinfra-red spectroscopy ( Y represents the proportion of a chemical constituentand x is the spectrum of a sample, discretized at a sequence of wavelengths).In practice, before applying any nonparametric regression technique to modelreal data, and in order to avoid the curse of dimensionality, a dimension reduc-tion or model selection technique is crucial for appropriate smoothing. A possibleapproach is to find a functional basis, decompose the covariate curve X ( t ) ac-cordingly and work with the coefficients in a spirit similar to that adopted byMartens and Naes (1989), chapter 3, Alsberg (1993) and Denham and Brown(1993). The aim is to explain or predict the response through an expansionof the explanatory process in a relatively low dimensional basis in the spacespanned by the measurements of X , thus revealing relationships that may notbe otherwise apparent. Dimension reduction without loss of information is adominant theme in such cases: one tries to reduce the dimension of X withoutlosing information on Y | X , and without requiring a model for Y | X . Borrowingterminology from classical statistics, Cook (2000) calls this a sufficient dimen-sion reduction which leads to the pursuit of sufficient summaries containing allof the information on Y | X that is available from the sample.In this subsection we describe a wavelet based regression approach for re-gression problems with high dimensional X variables, developed recently by theauthor and its co-authors (see Amato et al. (2006a)). The methodology reliesupon the above notion of sufficient dimension reduction and is based on devel-opments in a recent paper by Xia et al. (2002) where an adaptive approach foreffective dimension reduction called the (conditional) minimum average varianceestimation (MAVE) method, is proposed within quite a general setting. We describe below the application of the MAVE method via a wavelet decom-position of the explanatory covariate process. We suppose that each realization . Antoniadis/Wavelet methods in statistics X ( t ) of the covariate process will be modelled as X ( t ) = f ( t ) + s ( t ), t ∈ [0 , f ( t ) represents the mean at time t and s ( t ) is the observed residual varia-tion, which will be regarded as a realization of a second order weakly stationaryprocess. Since a large number of signal compression algorithms are based onsecond order statistical information we will concentrate on covariance mod-elling, and the mean function will be removed by filtering or simple averaging.Thus, we assume hereafter that the covariate process has been centered, so that E ( X ( t )) = 0 for all t .Consider then the following model Y = g ( h β , X i , . . . , h β K , X i ) + ε (4.8)where ε is a scalar random variable independent of X ( t ), { β s ( t ) , s = 1 , . . . , K } are K orthonormal functions belonging to L ([0 , g is a smooth linkfunction of R K into R . For D = K , we obtain a standard regression model withall explanatory variables h β s , X i , s = 1 , . . . , K entering independently. Providedthat D < K , the regression function depends on X only through D linear func-tionals of the explanatory process X . Hence, to explain the dependent variable Y , the space of K explanatory variables can be reduced to a space with a smallerdimension D . The dimension reduction methods aim to find the dimension D of the reduction space and a basis defining this space.Given a multiresolution analysis of L ([0 , j , as seenin Section 2, both X ( t ) and β s ( t ) can be decomposed as X ( t ) = j − X k =0 ξ j ,k φ j ,k + X j ≥ j j − X k =0 η j,k ψ j,k β s ( t ) = j − X k =0 c sj ,k φ j ,k + X j ≥ j j − X k =0 d sj,k ψ j,k with ξ j ,k = h X, φ j ,k i and η j,k = h X, ψ j,k i c sj ,k = h β s , φ j ,k i and d sj,k = h β s , ψ j,k i ,s = 1 , . . . , K and { ξ j ,k , η j,k } j,k sequences of random variables. By the isometrybetween L ([0 , ℓ ( R ), model (4.8) can be also written as Y = g ( h β , γ i , . . . , h β K , γ i ) + ε. (4.9)We have indicated by β s the ℓ -sequence formed by the wavelet and scalingcoefficients of β s ( t ), s = 1 , . . . , K ; and by γ the ℓ -sequence formed by thewavelet and scaling coefficients of X ( t ).Usually, the sample paths of the process X are discretized. If we observe p = 2 J values of X ( t ), ( X , . . . , X p ) = ( X ( t ) , . . . , X ( t p )), then, given the pre-vious notation, X ( t ) can be approximated by its ‘empirical’ projection onto the . Antoniadis/Wavelet methods in statistics approximation space V J defined as X J ( t ) = j − X k =0 ˜ ξ j ,k φ j ,k + J − X j = j j − X k =0 ˜ η j,k ψ j,k where ˜ ξ j ,k and ˜ η j,k are the empirical scaling and wavelet coefficients. We willcollect them into a vector ˜ γ J ∈ R p . Let β Js be the projection of β s onto V J andlet us denote by β Js ∈ R p the vector collecting its scaling and wavelet coeffi-cients, s = 1 , . . . , K . The original model (4.9) is then replaced by its discretecounterpart Y = g (cid:16) h β J , ˜ γ J i , . . . , h β JK , ˜ γ J i (cid:17) + ε. (4.10)As much as K and J are large enough and thanks to the isometries between L and ℓ and the compression properties of the wavelet transform, the originalfunctional regression model (4.8) may be replaced by the above model (4.10)which is the candidate for further dimension reduction by MAVE. A compactway to write down model (4.10) is Y = g (cid:0) B T ˜ γ J (cid:1) + ε, (4.11) B ∈ R p × K , p = 2 J . The method is then applied to Eq. (4.11). For the sake ofcompleteness, we briefly describe hereafter the MAVE method, as it is appliedon data obeying model (4.11).Let d represent now the working dimension, 1 ≤ d ≤ K . For an assumednumber d of directions in model (4.10) and known directions B , one wouldtypically minimize min E { Y − E ( Y | B T ˜ γ J ) } , to obtain a nonparametric estimate of the regression function E ( Y | B T ˜ γ J ). TheMAVE method is based on the local linear regression, which hinges at a point˜ γ J on linear approximation E ( Y | B T ˜ γ J ) ≈ a + b T B T (˜ γ J − ˜ γ J ) . (4.12)Now, if directions B are not known, we have to search their approximation B .Xia et al. (2002) propose to plug-in unknown directions B in the local linearapproximation of the regression function and to optimize simultaneously withrespect to B and local parameters a and b of local linear smoothing. Hence,given a sample (˜ γ Ji , Y i ) ni =1 from (˜ γ J , Y ), they perform local linear regression atevery ˜ γ J = ˜ γ Ji , i = 1 , . . . , n , and end up minimizingmin B : B T B = I K a l , b l , l = 1 , . . . , n n X l =1 n X i =1 (cid:8) Y i − (cid:2) a l + b tl B T (cid:0) ˜ γ Ji − ˜ γ Jl (cid:1)(cid:3)(cid:9) w il (4.13)where I K represents the K × K identity matrix and w il are weights describingthe local character of the linear approximation (4.12) (i.e., w il should dependon the distance of points ˜ γ Ji and ˜ γ Jl ). . Antoniadis/Wavelet methods in statistics Xia et al. (2002) call the resulting estimator of B , MAVE and show that thesimultaneous minimisation with respect to local linear approximation given by a j , b j and to directions B results in a convergence rate superior to any otherdimension-reduction method. Initially, a natural choice of weights is given by amultidimensional kernel function K h . At a given ˜ γ J , w i = K h (˜ γ Ji − ˜ γ J ) / n X i =1 K h (˜ γ Ji − ˜ γ J ) , for i = 1 , . . . , n and a kernel function K h ( · ), where h refers to a bandwidth.Additionally, when we already have an initial estimate of the dimension reduc-tion space given by ˆ B , it is possible to iterate and search an improved estimateof the reduction space. Xia et al. (2002) do so by using the initial estimatorˆ B to measure distances between points ˜ γ Ji and ˜ γ J in the reduced space. Moreprecisely, they propose to choose in the iterative step weights w i = K h ( ˆ B T (˜ γ Ji − ˜ γ J )) / n X i =1 K h ( ˆ B T (˜ γ Ji − ˜ γ J )) . Repeating such iteration steps until convergence results in a refined MAVE(rMAVE) estimator. As one sees from the above equations, the initial estimateˆ B depends on local linear smoothing performed with weights computed via amultidimensional kernel on R p . Since, by Theorem 1 in Xia et al. (2002) theoptimal kernel bandwidth h must be such that h = O (log n/n /p ), in order toavoid the curse of dimensionality and to stabilize the computations it is thereforeadvisable in practice to reduce the initial resolution log p to some resolution J < log p , and in such a way that the approximation of ˜ γ by its projection ˜ γ J does not affect the asymptotics. When assuming that the process X is α -regularsuch a condition holds if 2 α J = O ( n /p ) (see Amato et al. (2006a)). From nowon, whenever we refer to MAVE, we mean its refined version rMAVE withsuch a choice of smoothing parameters. One may show that under appropriateassumptions on X and g and with such a choice of smoothing parameters, onegets optimal asymptotic rates. The interested reader is referred to the paper ofAmato et al. (2006a) for more details.The described methods are capable of estimating the dimension reductionspace provided we can specify its dimension. To determine the dimension d ,Xia et al. (2002) extend the cross-validation approach of Yao and Tong (1994).The cross-validation criterion is defined as CV ( d ) = n X j =1 Y j − n X i =1 ,i = j Y i K h ( ˆ B T (˜ γ Ji − ˜ γ Jj ) P ni =1 ,i = j Y i K h ( ˆ B T (˜ γ Ji − ˜ γ Jj ) , for d > Y and X as CV (0) = 1 n n X i =1 ( Y i − ¯ Y ) . . Antoniadis/Wavelet methods in statistics Consequently, the dimension is then determined asˆ d = argmin ≤ d ≤ K CV ( d ) , where K represents the initial number of basis functions in model (4.8). To illustrate the performance of the dimensional reduction regression methodproposed in this subsection we report here some of the results of an extensiveMonte Carlo simulation study realized in Amato et al. (2006a) for a particularmodel. More precisely, let H = L ([0 , X = ( X ( t )) t ∈ [0 , be a standardBrownian motion and ǫ a mean zero Gaussian distribution with variance σ independent of X . All curves β i ( t ) and X ( t ) are discretized on the same gridgenerated on an equispaced grid of p = 128 equispaced points t ∈ [0 , 1] . Theobservations Y are generated from i.i.d. observations of X and ǫ according tothe following model: Model 1. Y = exp ( h β , X i ) + exp ( |h β , X i| ) + ǫ,β ( t ) = sin(3 πt/ , β ( t ) = sin(5 πt/ , σ = 0 . β and β belong tothe eigen-subspace of the covariance operator of X and therefore it representsthe ideal situation where the EDR space is included in the central subspace.On several simulation runs, the number of directions chosen by cross-validationMAVE was 2 which is optimal since the functions β and β are respectivelythe second and third eigenvalues of the covariance operator of X . Whatever theestimation of the directions β i , i = 1 , h ˆ β i , X i are to the true projections h β i , X i .In order to check this, Figure 6 displays the indexes h ˆ β i , X i versus h β i , X i for atypical simulation run, showing a quite satisfactory estimation. 5. Conclusion This survey paper has investigated several aspects of wavelet thresholding andhas considered two recent applications of wavelet to solve some interesting statis-tical problems. We would like however to mention here few more types of signalprocessing problems where these methods have been used in practice. Waveletanalysis and denoising have been found particularly useful in detecting machin-ery fault detection. Typical examples of signals encountered in this field arevibration signals generated in defective bearings and gears rotating at constantspeeds (see e.g. Lada et al. (2002)). Wavelet denoising procedures in conjunctionwith hypothesis testing have also been used for detecting change points in several . Antoniadis/Wavelet methods in statistics Fig 6 . True projections versus estimated projections for the simulated example with thewavelet MAVE method (WM). biomedical applications. Typical examples are the detection of life-threateningcardiac arythmias in electrocardiographic signals (ECG) recorded during themonitoring of patiens, or the detection of venous air embolism in doppler heartsound signals recorded during surgery when the incision wounds lie above theheart. The same is true for functional analysis of variance problems and fornonparametric mixed-effects models used in proteomics (e.g. Morris and Carroll(2006), Antoniadis and Sapatinas (2007)). A number of interesting applicationsof wavelets may be also found in economic and financial applications (e.g.Ramsay and Lampart (1998)) and for times series analysis and prediction (e.gFryzlewicz et al. (2003), Antoniadis and Sapatinas (2003)). In conclusion, it isapparent that wavelets are particularly well adapted to the statistical analysisof several types of data. References Abramovich, F. , Bailey, T. C. and Sapatinas, T. (2000). Wavelet analysisand its statistical applications. The Statistician , Alsberg, B. K. (1993). Representation of spectra by continuous functions. Journal of Chemometrics , Amato, U. , Antoniadis, A. and De Feiss, I. (2006a). Dimension reductionin functional regression with applications. Comput. Statist. Data Anal. , Amato, U. , Antoniadis, A. and Pensky, M. (2006b). Wavelet kernel penal-ized estimation for non-equispaced design regression. Statistics and Comput-ing , Amato, U. and Vuza, D. (1997). Wavelet approximation of a function fromsamples affected by noise. Rev. Roumaine Math. Pure Appl. , Antoniadis, A. (1996). Smoothing noisy data with tapered coiflets series. Scand. J. Statist. , . Antoniadis/Wavelet methods in statistics Antoniadis, A. (1997). Wavelets in statistics: a review (with discussion). J.Ital. Statist. Soc. , . Antoniadis, A. and Fan, J. (2001). Regularization of wavelets approxima-tions. J. Amer. Statist. Assoc. , Antoniadis, A. , Gr´egoire, G. and Vial, P. (1997). Random design waveletcurve smoothing. Statist. Probab. Lett. , Antoniadis, A. and Sapatinas, T. (2003). Wavelet methods for continuous-time prediction using representations of autoregressive processes in Hilbertspaces. J. of Multivariate Anal. , Antoniadis, A. and Sapatinas, T. (2007). Estimation and inference in func-tional mixed-effects models. Comput. Statist. Data Anal. , Black, M. , Sapiro, G. , Marimont, D. and Heeger, D. (1998). Robustanisotropic diffusion. IEEE Transactions on Image Processing , Brown, L. D. , Cai, T. and Zhou, H. (2006). Robust nonparametric estima-tion via wavelet median regression. Technical report, University of Pennsyl-vania. Burrus, C. , Gonipath, R. and Guo, H. (1998). Introduction to Waveletsand Wavelet Transforms: A Primer . Englewood Cliffs, Prentice Hall. Cai, T. (1999). Adaptive wavelet estimation: a block thresholding and oracleinequality approach. Ann. Statist. , Cai, T. (2001). Invited discussion on “Regularization of wavelets approxima-tions” by A. Antoniadis and J. Fan. J. Amer. Statist. Assoc. , Cai, T. (2002). On block thresholding in wavelet regression: Adaptivity, blocksize, and threshold level. Statist. Sinica , Cai, T. and Brown, L. (1998). Wavelet shrinkage for nonequispaced samples. Ann. Statist. , Cai, T. and Brown, L. (1999). Wavelet estimation for samples with randomuniform design. Statist. Probab. Lett. , Cai, T. and Silverman, B. W. (2001). Incorporating information on neigh-boring coefficients into wavelet estimation. Sankhya, Series B , Cai, T. and Zhou, H. (2005). A data-driven block thresholding approach towavelet estimation. Technical report, Department of Statistics, University ofPennsylvania, http://stat.wharton.upenn.edu/ ∼ tcai/paper/ . Chang, X. and Qu, L. (2004). Wavelet estimation of partially linear models. Comput. Statist. Data Anal. , Charbonnier, P. , Blanc-F´eraud, L. , Aubert, G. and Barlaud, M. (1994). Two deterministic half-quadratic regularization algorithms for com-puted imaging. In Proc. 1994 IEEE International Conference on Image Pro-cessing , vol. 2. IEEE Computer Society Press, Austin, TX, 168–172. Chicken, E. (2003). Block thresholding and wavelet estimation for nonequis-paced samples. J. Statist. Plann. Inference , Chicken, E. and Cai, T. (2005). Block thresholding for density estimation:Local and global adaptivity. J. Multivariate Analysis , Coifman, R. and Donoho, D. (1995). Translation-invariant de-noising. In Wavelets and Statistics (A. Antoniadis and G. Oppenheim, eds.), vol. 103. . Antoniadis/Wavelet methods in statistics Springer-Verlag, New York, 125–150. Cook, D. (2000). Save: A method for dimension reduction and graphics inregression. Communications in Statistics: Theory and Methods , Daubechies, I. (1992). Ten Lectures on Wavelets . SIAM, Philadelphia.MR1162107 Denham, M. C. and Brown, P. J. (1993). Calibration with many variables. Applied Statistics , Donoho, D. , Johnstone, I. , Kerkyacharian, G. and Picard, D. (1995).Wavelet shrinkage: asymptopia? (with discussion). J. Roy. Statist. Soc. Ser.B , Donoho, D. L. (1995). De-noising by soft thresholding. IEEE Transactionson Information Theory , Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation bywavelet shrinkage. Biometrika , Donoho, D. L. and Johnstone, I. M. (1998). Minimax estimation via waveletshrinkage. Ann. Statist. , Droske, M. and Rumpf, M. (2004). A variational approach to non-rigidmorphological registration. SIAM Appl. Math. , Engle, R. , Granger, C. , Rice, J. and Weiss, A. (1986). Semiparametricestimates of the relation between weather and electricity sales. J. Amer.Statist. Assoc. , Eubank, R. (1999). Nonparametric Regression and Spline Smoothing . 2nd ed.Marcel Dekker, New York. MR1680784 Fadili, J. and Bullmore, E. (2005). Penalized partially linear models usingsparse representation with an application to fmri time series. IEEE Transac-tions on signal processing , Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and its Applica-tions . Chapman & Hall, London. MR1383587 Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized like-lihood and its oracle properties. J. Amer. Statist. Assoc. , Foster, G. (1996). Wavelets for period analysis of unevenly sampled timeseries. The Astronomical Journal , Fryzlewicz, P. , von Sachs, R. and van Bellegem, S. (2003). Forecastingnon-stationary time series by wavelet process modelling. Ann. Inst. Math.Statist. , Gannaz, I. (2006). Robust estimation and wavelet thresholding in partial linearmodels. Technical report, University Joseph Fourier, Grenoble, France. URL . Gao, H.-Y. (1998). Wavelet shrinkage denoising using the non-negative garrote. J. Comput. Graph. Statist. , Gao, H.-Y. and Bruce, A. (1997). Waveshrink with firm shrinkage. Statist.Sinica , Green, P. and Yandell, B. (1985). Semi-parametric generalized linear mod-els. Technical Report 2847, University of Wisconsin-Madison. Hall, P. , Kerkyacharian, G. and Picard, D. (1998). Block threshold rules . Antoniadis/Wavelet methods in statistics for curve estimation using kernel and wavelet methods. Ann. Statist. , Hall, P. , Kerkyacharian, G. and Picard, D. (1999). On the minimaxoptimality of block thresholded wavelet estimators. Statist. Sinica , Hall, P. , Penev, S. , Kerkyacharian, G. and Picard, D. (1997). Nu-merical performance of block thresholded wavelet estimators. Statistics andComputing , Hall, P. and Turlach, B. (1997). Interpolation methods for nonlinearwavelet regression with irregularly spaced design. Ann. Statist. , Hamilton, S. and Truong, Y. (1997). Local estimation in partly linear mod-els. J. Multivariate Analysis , Jansen, M. , Malfait, M. and Bultheel, A. (1997). Generalized cross vali-dation for wavelet thresholding. Signal Processing , Johnstone, I. (1994). Minimax bayes, asymptotic minimax and sparse waveletpriors. In Statistical Decision Theory and Related Topics, V (S. Gupta andJ. Berger, eds.). Springer-Verlag, New York, 303–326. MR1286310 Kerkyacharian, G. and Picard, D. (2004). Regression in random designand warped wavelets. Bernoulli , Kovac, A. and Silverman, B. W. (2000). Extending the scope of waveletregression methods by coefficient-dependent thresholding. J. Amer. Statist.Assoc. , Lada, E. K. , Lu, J.-C. and Wilson, J. R. (2002). A wavelet based procedurefor process fault detection. IEEE Transactions on Semiconductor Manufac-turing , Lorenz, D. (2006). Non-convex variational denoising of images: Interpolationbetween hard and soft wavelet shrinkage. Current Development in Theoryand Applications of Wavelets to appear. Mallat, S. (1999). A Wavelet Tour of Signal Processing . 2nd ed. AcademicPress, San Diego. MR1614527 Marron, J. , Adak, S. , Johnstone, I. , Neumann, M. and Patil, P. (1998).Exact risk analysis of wavelet regression. J. Comput. Graph. Statist. , Martens, H. and Naes, T. (1989). Multivariate Calibration . John Wiley &Sons, New York. MR1029523 Maxim, V. (2003). Restauration de signaux bruit´es observ´es sur des plansd’exp´erience al´eatoires . Ph.D. thesis, University Joseph Fourier, France. Meyer, Y. (1992). Wavelets and Operators . Cambridge University Press,Cambridge. MR1228209 Morris, J. S. and Carroll, R. J. (2006). Wavelet-based functional mixedmodels. J. Roy. Statist. Soc. Ser. B , Mr´azek, P. , Weickert, J. and Steidl, G. (2003). Correspondences betweenwavelet shrinkage and nonlinear diffusion. In Scale Space Methods in Com-puter Vision (L. D. Griffin and M. Lillholm, eds.), vol. 2695 of Lecture Notesin Computer Science . Springer, Berlin, 101–116. . Antoniadis/Wavelet methods in statistics M¨uller, P. and Vidakovic, B. (1999). Bayesian inference in waveletbased models. In Lect. Notes Statist. , vol. 141. Springer-Verlag, New York.MR1699864 Nason, G. (1996). Wavelet shrinkage using cross-validation. J. Roy. Statist.Soc. Ser. B , Nason, G. and Silverman, B. (1995). The stationary wavelet transform andsome statistical applications. In Wavelets and Statistics (A. Antoniadis andG. Oppenheim, eds.), vol. 103 of Lect. Notes Statist . Springer-Verlag, NewYork, 281–300. MR1364669 Ogden, R. (1997). Essential Wavelets for Statistical Applications and DataAnalysis . Birkh¨auser, Boston. MR1420193 Oh, H.-S. and Lee, T. C. M. (2005). Hybrid local polynomial waveletshrinkage: wavelet regression with automatic boundary adjustment. Com-put. Statist. Data Anal. , Percival, D. and Walden, A. (2000). Wavelet Methods for Time SeriesAnalysis . Cambridge University Press, Cambridge. MR1770693 Perona, P. and Malik, J. (1990). Scale space and edge detection usinganisotropic diffusion. IEEE Transactions on Pattern Analysis and MachineIntelligence , Ramsay, J. B. and Lampart, C. (1998). The decomposition of economicrelationships by time scale using wavelets: expenditure and income. Studiesin NonLinear Dynamics and Econometrics , Rice, J. (1986). Convergence rates for partially splined models. Statist. Probab.Lett. , Solo, V. (2001). Invited discussion on “Regularization of wavelets approxima-tions” by A. Antoniadis and J. Fan. J. Amer. Statist. Assoc. , Speckman, P. (1988). Kernel smoothing in partial linear models. J. Roy.Statist. Soc. Ser. B , Steidl, G. and Weickert, J. (2002). Relations between soft wavelet shrinkageand total variation denoising. In Pattern Recognition (V. G. L., ed.), vol. 2449of Lecture Notes in Computer Science . Springer, Berlin, 198–205. Strang, G. and Nguyen, T. (1996). Wavelets and Filter Banks . Wellesley-Cambridge Press. MR1411910 Vidakovic, B. (1998b). Wavelet based nonparametric bayes methods. In Practical Nonparametric and Semiparametric Bayesian Statistics (D. Dey,P. M¨uller and D. Sinha, eds.), vol. 133 of Lect. Notes Statist. Springer-Verlag,New York, 133–155. MR1630079 Vidakovic, B. (1999). Statistical Modeling by Wavelets . John Wiley & Sons,New York. MR1681904 Wahba, G. (1990). Spline Models for Observational Data . SIAM, Philadelphia.MR1045442 Weaver, J. B. , Xu, J. , M., H. D. and Cromwell, L. D. (1991). Filteringnoise from images with wavelet transforms. Magnetic Resonance in Medicine , Weickert, J. (1998). Anisotropic Diffusion in Image Processing . Teubner,Stuttgart. MR1666943 . Antoniadis/Wavelet methods in statistics Xia, Y. , Tong, H. , Li, W. K. and Zhu, L.-X. (2002). An adaptive estima-tion of dimension reduction space. J. Roy. Statist. Soc. Ser. B , Yao, Q. and Tong, H. (1994). On subset selection in nonparametric stochasticregression. Statist. Sinica ,4