[PDF] Signal Processing and Piecewise Convex Estimation

Abstract

Many problems on signal processing reduce to nonparametric function estimation. We propose a new methodology, piecewise convex fitting (PCF), and give a two-stage adaptive estimate. In the first stage, the number and location of the change points is estimated using strong smoothing. In the second stage, a constrained smoothing spline fit is performed with the smoothing level chosen to minimize the MSE. The imposed constraint is that a single change point occurs in a region about each empirical change point of the first-stage estimate. This constraint is equivalent to requiring that the third derivative of the second-stage estimate has a single sign in a small neighborhood about each first-stage change point. We sketch how PCF may be applied to signal recovery, instantaneous frequency estimation, surface reconstruction, image segmentation, spectral estimation and multivariate adaptive regression.

Full PDF

aa r X i v : . [ s t a t . M E ] M a r Kurt S. Riedel

Signal Processing and Piecewise Convex Estimation

Many problems on signal processing reduce to nonparametric function estimation. We propose a new methodology,piecewise convex ﬁtting (PCF), and give a two-stage adaptive estimate. In the ﬁrst stage, the number and locationof the change points is estimated using strong smoothing. In the second stage, a constrained smoothing spline ﬁtis performed with the smoothing level chosen to minimize the MSE. The imposed constraint is that a single changepoint occurs in a region about each empirical change point of the ﬁrst-stage estimate. This constraint is equivalent torequiring that the third derivative of the second-stage estimate has a single sign in a small neighborhood about eachﬁrst-stage change point. We sketch how PCF may be applied to signal recovery, instantaneous frequency estimation,surface reconstruction, image segmentation, spectral estimation and multivariate adaptive regression.

1. Signal Recovery

Techniques such as splines and wavelets are optimal for estimating an arbitrary function in a ball in function space.We claim that the “uniform” prior is unrealistic and that naturally occuring functions have very few inﬂection points.Our basic tenet is that the ﬁtted curve should preserve the geometric ﬁdelity of the unknown function by havingthe same number of inﬂections. Consider a signal, y i = g ( t i ) + ǫ i , measured at t i = iδ , i = 1 . . . N , where { ǫ i } areindependent Gaussian random variables: ǫ i ∼ N (0 , σ ). Let g ( t ) have K change points of ℓ -convexity with changepoints x ≤ x . . . ≤ x K if ( − k − g ( ℓ ) ( t ) ≥ x k ≤ t ≤ x k +1 . In practice, we take ℓ = 2.Our goal is to estimate g ( t ) while the preserving the geometry of g . A standard technique, kernel smoothers [10],estimates g ( t ) by a weighted local average: ˆ g h ( t ) = Nh P Ni =1 y i κ (cid:0) t − t i h (cid:1) , where h is the kernel halfwidth thatdetermines the amount of smoothing. As h increases, the random error (variance) in ˆ g decreases, while the systematicerror (bias) increases. In [8], the geometric faithfulness of kernel smoothers is examined in the limit as N → ∞ and h → N δ = 1. The halfwidth that minimizes the mean square error scales as N − m +1 for g ∈ C m [0 , ℓ = m or m −

1, this halfwidth scaling, N − m +1 ,produces extra (artiﬁcial) ℓ -change points with high probability. To eliminate the artiﬁcial inﬂection points withprobability 1, the smoothing must be increased such that h >> N − ℓ +3 ln [ N ]. Thus, the level of smoothing requiredfor geometric ﬁdelity is large enough to degrade the MSE. In [8], we propose a two-stage estimator.Stage 1: Strongly smooth with h N N ℓ +3 / ln N → ∞ . Denote the empirical ℓ -change points by ˆ x k , k = 1 . . . ˆ K .Stage 2: Perform a constrained smoothing spine ﬁt by minimizing the penalized likelihood subject to the constraintsthat ˆ g ( ℓ +1) ( t ) does not change sign in the intervals, [ˆ x k − z α ˆ σ k , ˆ x k + z α ˆ σ k ]. The k -th empirical change point variance isˆ σ k ≡ (cid:20) cσ | ˆ g ( ℓ +1) stage (ˆ x k ) | Nh ℓ +3 (cid:21) , where ˆ g ( ℓ +1) stage (ˆ x k ) is the estimate of g ( ℓ +1) at ˆ x k from the ﬁrst stage. c is a constant thatdepends on the kernel shape. The conﬁdence interval parameter, z α , is the α -quantile for the normal distribution.To motivate this two-stage estimator, consider the smoothing dilemma: If the smoothing level is optimized for MSE,there tends to be too many ℓ -change points (wiggles). If the smoothing is chosen to eliminate the artiﬁcial changepoints (suppress wiggles), then the MSE suﬀers. Our two-stage estimate provides the best of both worlds! In [8],we show that this two-stage estimator achieves the optimal rate of convergence for functions in the Sobolev space W m, , while suppressing artiﬁcial inﬂection points in the neighborhood where they are likely to occur.In practice, we choose the second-stage smoothing parameter, λ stage , by generalized cross-validation while we scalethe ﬁrst-stage smoothing as h stage = ι ( N ) h GCV , where ι ( N ) ≡ log ( N ) N α with α = m +1 − ℓ +3 . Unconstrainedsmoothing splines can be used in ﬁrst stage with the smoothing parameter, λ , scaled with the correspondence: λ = h meff , with h eff ∼ (log N ) N − ℓ +3 .When ℓ = m −

1, the stage 2 minimization reduces to a ﬁnite dimensional convex minimization in the dual formulation.A simple implementation is that of Villabos and Wahba (VW) [14], who add pointwise constraints on the sign ofˆ g ( m ) ( z m ) with z m chosen in the constraint regions. The goal is to select { z j , j = 1 , M } such that the constraints aresatisﬁed everywhere even though they are imposed only at a ﬁnite number of points. An important advantage ofour two-stage estimator is that we impose constraints only in small regions about ˆ x k . Since the constraint regionsare small, the number of z j that are necessary in the VW scheme is small. Thus, our two-stage estimator can beinterpreted as a pilot estimator to determine where to place the { z j } in the VW scheme.The VW implementation of the second stage estimator reduces to a quadratic programming problem with linearinequality constraints. The number of constraints is bounded by ( m · ˆ k plus ).The B spline representation gives a banded structure to the programming problem. The minimization may beolved exactly using “active set” methods in quadratic programming or may be solved approximately using projectedsuccessive over-relaxation (PSOR) iterations. The PSOR method uses the band structure of convexity constraintswhile active set programs require modiﬁcation to take advantage of the tridiagonal structure. In both cases, thedual formulation is easier to implement since positivity constraints are substituted for inequality constraints.

2. Adaptive regression splines

An alternative to smoothing splines is adaptive regression splines [3]. At each step, a new knot is added to the ﬁt.The number and locations of the new knots are chosen by minimizing a Loss of Fit (LoF) function, d N ( ˆ f , { y i } ).Let ˆ σ be a measure of the average residual error: ˆ σ ( ˆ f , { y i } ) = Nσ P Ni =1 [ y i − ˆ f ( t i )] , or its L analog. Typicaldiscrepancy functions are d N ( ˆ f , { y i } ) = ˆ σ /[1 − ( γ p + m ) /N ] and d B ( ˆ f , { y i } ) = ˆ σ [1 + ( γ p + m ) ln( N ) /N ] , (1)where p is the number of knots in the ﬁt and m is the spline degree. Friedman proposed d FN with a default valueof γ = 3, while d B is the Bayesian/Schwartz information criterion when γ = 1. For a nested family of models, γ = 1 is appropriate while γ = 2 corresponds to a nonnested family with 2 (cid:0) NK (cid:1) candidate models at the k th level[2]. In very specialized settings in regression theory and time series, it has been shown that LoF functions like d FN are asymptotically eﬃcient while those like d BN are asymptotically consistent. In other words, using d FN -like criteriawill asymptotically minimize the expected error at the cost of not always yielding the correct model. In contrast,the Bayesian criteria will asymptotically yield the correct model at the cost of having a larger expected error.Our goal is to consistently select the number of convexity change points and eﬃciently estimate the model subjectto the change point restrictions. Therefore, we propose the following new LoF function:

P CIC = σ (ˆ g, { y i } ) (cid:20) γ K ln( N ) /N − ( γ p + m ) /N (cid:21) (2)where K is the number of convexity change points and p is the number of knots. PCIC stands for PiecewiseConvex Information Criterion. In selecting the positions of the K change points, there are essentially 2 (cid:0) NK (cid:1) possiblecombinations of change point locations if we categorize the change points by the nearest measurement location.Thus, our default values are γ = 3 and γ = 2.We motivate PCIC: to add a change point requires an improvement in the residual square error of O ( σ ln( N )),which corresponds to an asymptotically consistent estimate. If the additional knot does not increase the number ofchange points, it will be added if the residual error decreases by γ σ . Presently, PCIC is purely a heuristic principle.We conjecture that it consistently selects the number of change points and is asymptotically eﬃcient within the classof methods which are asymptotically consistent with regards to convexity change points.

3. Instantaneous Frequency and Time-frequency representations

Much eﬀort has been devoted to ﬁnding joint time-frequency representations of a signal whose frequency is beingslowly modulated in time. The canonical example is a “chirp”: y i = cos( at + bt ), where the instantaneous frequencyis ω ( t ) = a + 2 bt . In [7], we proposed using the WKB (eikonal/geometric optics) representation for signals whoseamplitude and frequency are being slowly modulated in time: y t = A ( δt ) cos (cid:20)Z t ω ( δs ) ds (cid:21) + ǫ t , (3)where ǫ t is white noise and δ is a small parameter. The characteristic time scale for amplitude and frequencymodulation is 1 /δ . In [7], we propose estimating A ( δt ) and ω ( δt ) using data adaptive kernel smoothers. Theinstantaneous frequency corresponds to a ﬁrst derivative estimate. We represent the signal in the time-frequencyplane as the curve, ˆ A ( t, ˆ ω ( t )). Since we expect the phase to be piecewise convex, we replace the adaptive kernelestimate of cos[ φ ( t )] with PC ﬁtting. The circular statistics require a penalized likelihood functional of the form: N X i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) y i ˆ A ( t i ) − cos (cid:20)Z t i ˆ ω ( s ) ds (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + λ Z | ˆ A ′′′ ( t ) | + | ˆ ω ′′ ( t ) | dt . (4)Eq. (4) may also be used in for a smoothing spline ﬁt without PC constraints. The ﬁrst term in Eq. (4) diﬀers fromKatkovnik [4] by placing A ( t ) in the denominator instead of the numerator.

4. Spectral Estimation

Consider a stationary time series, { x t } , with an unknown spectral density, S ( f ). A standard method to estimate S ( f ) is to multiply the data by a spectral window, compute the windowed periodogram, and then smooth thewindowed periodogram or its logarithm with a data-adaptive kernel smoother or smoothing splines. In [11-13], weshow that a better estimate replaces the single spectral window with a family of orthonormal spectral windows.he multiple window estimate reduces the broad band bias while making a variance stabilizing transformation ofthe log-periodogram. When we use the sinusoidal tapers of [11], v ( k ) t = r N + 1 sin (cid:18) ktN + 1 (cid:19) , the multi-windowestimate reduces toˆ S MW ( f ) = ∆ K K X k =1 | y ( f + k ∆) − y ( f − k ∆) | , where ∆ = 1 / (2 N + 2) . (5)From [12], we recommend K = ( N/ / . Instead of kernel smoothing ln[ ˆ S MW ( f )], we now advocate using thetwo-stage piecewise convex ﬁtting procedure. Piecewise convex ﬁtting should prove particularly advantageous tospectral estimation because it will suppress the 1 /N oscillations that arise from discrete time sampling.

5. Additive Models, Projection Pursuit and MARS

Many classes of models attempt to ﬁt multi-dimensional functions as sums of one-dimensional functions. In eachcase, we can replace the standard nonparametric estimation methods with piecewise convex ﬁtting. Additive growthcurve models [9] ﬁts models of the form g ( t, x . . . x m ) = f ( t ) + P mj =1 f j ( t ) x j , where the x j are determined bysmooth splines (old method) or PC ﬁtting (new method). The back ﬁtting algorithm (corresponding to the Gauss-Seidel iteration) may be used to PC ﬁt the f j iteratively. The same remarks apply to projection pursuit [3]. InMARS (multivariate adaptive regression splines), a sum of products form: f ( x , x . . . x m ) = P Mj,j ′ =1 g j ( x j ) h j ( x ′ j )is assumed. The knots are placed adaptively, in our case using PCIC (2).

6. Robust Estimation

At present, our understanding of the statistics of false inﬂection points is limited to Gaussian errors and linearestimators. In practice, it is often advantageous to replace both the residual errors and the penalty function withmore robust analogs: P Ni =1 | y i − ˆ g ( t i ) | q + λ R | ˆ g ( m ) ( t ) | q dt , where 1 ≤ q i ≤

2. Representation and duality theoremsare given in [8] for q j >

1. A heuristic scaling shows that the eﬀective halfwidth of the robustiﬁed function satisﬁes h eff ∼ [ λ | f ( m ) ( t ) | q − ] m . The bias error scales as g ( m ) ( t ) h meff while the “variance” is proportional to 1 /N h eff . Thehalfwidth that minimizes the MISE scales as h eff ∼ N − m +1 , while the halfwidth to eliminate false change points of g ( ℓ ) with asymptotic probability one satisﬁes h cr N − ℓ +3 → ∞ . The optimal variable halfwidth kernel smoother hasa kernel halfwidth proportional to | f ( m ) | − m +1 . For 1 ≤ q ≤

2, the eﬀective halfwidth of the robustiﬁed functionautomatically reduces the halfwidth in regions of large | f ( m ) ( t ) | just like a variable halfwidth smoother. When q = q = 1, the problem reduces to a linear programming problem for each predetermined set of constraints.

7. Image Segmentation

Image segmentation divides a digital picture into similar regions for further processing. The Mumford-Shah (MS)algorithm assumes that the image is piecewise constant, and that the boundaries of the regions are unknown [5].The region boundaries are determined by minimizing the sum of the residual square ﬁt error plus a penalty termproportional to the length of the boundary. If the boundary is parameterized as x ( s ), the penalty term is the totalvariation of x ( s ). We suggest modifying the MS algorithm by replacing the total variation penalty with a piecewiseconvex constrained ﬁt using a robustiﬁed penalty function such as the L integral of the boundary curvature. Whenusing the PCIC (2), we use the arclength divided by the grid spacing as a proxy for N , the number of data points.

8. Response Surface Estimation

We seek to estimate a smooth function, g ( x , x ), given N noisy measurements. The two dimensional analog ofPC ﬁtting is to divide the plane into regions where the Gaussian curvature (or more simply ∆ f ) has a single sign.The boundaries between regions of positive and negative Gaussian curvature are free boundaries, which we requireto be PC. In the ﬁrst stage of the ﬁt, we use a penalty function of the form λ R | ∆ m/ g | , where we require that λ / mN >> log ( N ) N − ℓ +4 . This scaling is heuristic since the statistics of false zeros of ∆ ℓ/ ˆ g are unknown, as is thecritical scaling of the smoothing parameter that avoids extra regions of incorrectly speciﬁed curvature. To answerthese issues, a two-dimensional analog of the Cramer-Leadbetter formula is necessary.In the second stage, we suggest imposing constraints on the sign of ∂ normal ∆ ℓ/ ˆ g near the ﬁrst-stage convexityboundaries. The stage 2 convexity boundaries, ( x ( s ) , y ( s )), are free and need to be ﬁt using a penalty functionplus PC constraints on the curve shape. The numerical implementation appears tricky with a need for some ellipticanalog of front tracking. Overall, the two-dimensional problem appears very challenging from both theoreticaland numerical perspectives. Basic issues such as replacing ∂ normal ∆ ℓ/ ˆ g with geometric invariants have not beenaddressed. . Evolutionary Spectra A common model of nonstationary stochastic processes is x t = R π − π A ( ω, δt ) dZ ( ω ), where dZ ( ω ) is a stochasticprocess with independent spectral increments E [ dZ ( ω ) d ¯ Z ( ω )] = δ ( ω − ω ′ ) dω . The representation is nonunique forGaussian processes since A ( ω, t ) corresponds to a square root of the covariance matrix. To resolve the nonuniqueness,we require that A ( ω, δt ) e iωt correspond to the Fourier transform of the positive deﬁnite square root of the covariancematrix. The evolutionary spectrum is S ( ω, t ) = | A ( ω, t ) | .Let λ f be the characteristic frequency scale length and τ be the characteristic time scale of A : A ( f /λ f , t/τ ), with thesampling rate = 1. In [6], we present an asymptotic expansion of the mean square error in estimating S ( f /λ f , t/τ ).We begin by evaluating the multi-window analog of the log-spectrogram on a two-dimensional time frequency lattice.The bias error is minimized by using a window length of N w ∼ p τ /λ f . In [6], we estimate ˆ S ( ω, t ) using a twodimensional cross-product kernel smoother on the log-multi window spectrogram. The optimal halfwidths, h t & h f ,scale as h t /h f ∼ p τ /λ f and h f h t ∼ ( τ λ f ) − / , where h t and h f are the halfwidths in the t and f directions.We now advocate replacing kernel smoothing with PC ﬁtting the log multi-windowed spectrogram. This methodshould eliminate the spurious 1 /N w oscillations which occur due to the discrete sampling.

10. Wavelet thresholding: a wiggle enhancer

Wavelet algorithms for function estimation oﬀer two advantages: 1) speed, the algorithms are often O ( N log( N )),2) asymptotic minimax optimality in a number of decision theoretic settings [2]. The speed arises from separability:each wavelet coeﬃcient is estimated separately without regard to geometric ﬁdelity. The asymptotic optimalitytheory assumes that the unknown function is an arbitrary member of a function space, which makes function ﬁtswith ten or twenty inﬂection points as reasonable as ﬁts with no inﬂection points. Essentially, function spaces containtoo many “unphysical” functions. We prefer the “common sense prior” that the function has only a few inﬂectionpoints with high probaability.Both of these advantages of wavelets disappear when the more realistic assumption is made that the unknown functionhas only a small number of convexity change points. Wavelet thresholding has another intrinsic disadvantage whenit comes to geometric ﬁdelity: wavelets possess the complete oscillation property [1]. In contrast, B-splines have theantithetical and valuable property, total positivity.

11. Summary

Sections 1 & 2 describe nonparametric estimation methods that seek to exclude spurious oscillations with negligiblesacriﬁce of ﬁt quality. We have outlined a number of applications of our two-stage piecewise convex ﬁtting method.The PCF methodology can be used to solve inverse problems.

12. References Chui, C.K.,

An Introduction to Wavelets, Academic Press, New York, 1992.2

Donoho, D.L., Johnstone, I.M. , Ideal spatial adaption by wavelet shrinkage, Biometrika , (1995).3

Friedman, J. , Multivariate adaptive regression splines,

Ann. Stat. Katkovnik, V. , Local polynomial approximation of time varying frequency, Univ. S. Africa Stat. Report 94/1 (1994).5

Mumford, D., Shah, J. , Optimal approximation by piecewise smooth functions and associated variational problems,

Comm. Pure & Appl. Math. , Riedel, K.S.

Optimal data-based kernel estimation for evolutionary spectra,

I.E.E.E. Trans. Signal Processing, ,2439-2447, (1993).7 Riedel, K.S.

Optimal kernel estimation of the instantaneous frequency,

I.E.E.E. Trans. Signal Proc., Riedel, K.S.

Two-stage estimation of piecewise convex functions, this issue; Piecewise convex function estimation andmodel selection,

Proc. of Approximation Theory VIII

C. K. Chui and L. L. Schumaker (eds.) World Scientiﬁc Pub. (1995).9

Riedel, K.S., Imre K.,

Smoothing spline growth curves with covariates.

Comm. in Statistics Riedel, K.S., Sidorenko, A.,

Function estimation using data adaptive kernel smoothers - How much smoothing?

Computers in Physics , Riedel, K.S., Sidorenko, A.

Minimum bias multiple taper spectral estimation.

I.E.E.E. Trans. Signal Processing Riedel, K.S., Sidorenko, A.

Adaptive smoothing of the log-spectrum with multi-tapering. Submitted.13

Riedel, K.S., Sidorenko, A., Thomson, D.J.

Spectral estimation of plasma ﬂuctuations I: Comparison of methods,

Physics of Plasmas M. Villalobos and G. Wahba , Inequality-constrained multivariate smoothing splines with application to the esti-mation of posterior probabilities,