aa r X i v : . [ s t a t . M E ] M a r Kurt S. Riedel
Signal Processing and Piecewise Convex Estimation
Many problems on signal processing reduce to nonparametric function estimation. We propose a new methodology,piecewise convex fitting (PCF), and give a two-stage adaptive estimate. In the first stage, the number and locationof the change points is estimated using strong smoothing. In the second stage, a constrained smoothing spline fitis performed with the smoothing level chosen to minimize the MSE. The imposed constraint is that a single changepoint occurs in a region about each empirical change point of the first-stage estimate. This constraint is equivalent torequiring that the third derivative of the second-stage estimate has a single sign in a small neighborhood about eachfirst-stage change point. We sketch how PCF may be applied to signal recovery, instantaneous frequency estimation,surface reconstruction, image segmentation, spectral estimation and multivariate adaptive regression.
1. Signal Recovery
Techniques such as splines and wavelets are optimal for estimating an arbitrary function in a ball in function space.We claim that the “uniform” prior is unrealistic and that naturally occuring functions have very few inflection points.Our basic tenet is that the fitted curve should preserve the geometric fidelity of the unknown function by havingthe same number of inflections. Consider a signal, y i = g ( t i ) + ǫ i , measured at t i = iδ , i = 1 . . . N , where { ǫ i } areindependent Gaussian random variables: ǫ i ∼ N (0 , σ ). Let g ( t ) have K change points of ℓ -convexity with changepoints x ≤ x . . . ≤ x K if ( − k − g ( ℓ ) ( t ) ≥ x k ≤ t ≤ x k +1 . In practice, we take ℓ = 2.Our goal is to estimate g ( t ) while the preserving the geometry of g . A standard technique, kernel smoothers [10],estimates g ( t ) by a weighted local average: ˆ g h ( t ) = Nh P Ni =1 y i κ (cid:0) t − t i h (cid:1) , where h is the kernel halfwidth thatdetermines the amount of smoothing. As h increases, the random error (variance) in ˆ g decreases, while the systematicerror (bias) increases. In [8], the geometric faithfulness of kernel smoothers is examined in the limit as N → ∞ and h → N δ = 1. The halfwidth that minimizes the mean square error scales as N − m +1 for g ∈ C m [0 , ℓ = m or m −
1, this halfwidth scaling, N − m +1 ,produces extra (artificial) ℓ -change points with high probability. To eliminate the artificial inflection points withprobability 1, the smoothing must be increased such that h >> N − ℓ +3 ln [ N ]. Thus, the level of smoothing requiredfor geometric fidelity is large enough to degrade the MSE. In [8], we propose a two-stage estimator.Stage 1: Strongly smooth with h N N ℓ +3 / ln N → ∞ . Denote the empirical ℓ -change points by ˆ x k , k = 1 . . . ˆ K .Stage 2: Perform a constrained smoothing spine fit by minimizing the penalized likelihood subject to the constraintsthat ˆ g ( ℓ +1) ( t ) does not change sign in the intervals, [ˆ x k − z α ˆ σ k , ˆ x k + z α ˆ σ k ]. The k -th empirical change point variance isˆ σ k ≡ (cid:20) cσ | ˆ g ( ℓ +1) stage (ˆ x k ) | Nh ℓ +3 (cid:21) , where ˆ g ( ℓ +1) stage (ˆ x k ) is the estimate of g ( ℓ +1) at ˆ x k from the first stage. c is a constant thatdepends on the kernel shape. The confidence interval parameter, z α , is the α -quantile for the normal distribution.To motivate this two-stage estimator, consider the smoothing dilemma: If the smoothing level is optimized for MSE,there tends to be too many ℓ -change points (wiggles). If the smoothing is chosen to eliminate the artificial changepoints (suppress wiggles), then the MSE suffers. Our two-stage estimate provides the best of both worlds! In [8],we show that this two-stage estimator achieves the optimal rate of convergence for functions in the Sobolev space W m, , while suppressing artificial inflection points in the neighborhood where they are likely to occur.In practice, we choose the second-stage smoothing parameter, λ stage , by generalized cross-validation while we scalethe first-stage smoothing as h stage = ι ( N ) h GCV , where ι ( N ) ≡ log ( N ) N α with α = m +1 − ℓ +3 . Unconstrainedsmoothing splines can be used in first stage with the smoothing parameter, λ , scaled with the correspondence: λ = h meff , with h eff ∼ (log N ) N − ℓ +3 .When ℓ = m −
1, the stage 2 minimization reduces to a finite dimensional convex minimization in the dual formulation.A simple implementation is that of Villabos and Wahba (VW) [14], who add pointwise constraints on the sign ofˆ g ( m ) ( z m ) with z m chosen in the constraint regions. The goal is to select { z j , j = 1 , M } such that the constraints aresatisfied everywhere even though they are imposed only at a finite number of points. An important advantage ofour two-stage estimator is that we impose constraints only in small regions about ˆ x k . Since the constraint regionsare small, the number of z j that are necessary in the VW scheme is small. Thus, our two-stage estimator can beinterpreted as a pilot estimator to determine where to place the { z j } in the VW scheme.The VW implementation of the second stage estimator reduces to a quadratic programming problem with linearinequality constraints. The number of constraints is bounded by ( m · ˆ k plus ).The B spline representation gives a banded structure to the programming problem. The minimization may beolved exactly using “active set” methods in quadratic programming or may be solved approximately using projectedsuccessive over-relaxation (PSOR) iterations. The PSOR method uses the band structure of convexity constraintswhile active set programs require modification to take advantage of the tridiagonal structure. In both cases, thedual formulation is easier to implement since positivity constraints are substituted for inequality constraints.
2. Adaptive regression splines
An alternative to smoothing splines is adaptive regression splines [3]. At each step, a new knot is added to the fit.The number and locations of the new knots are chosen by minimizing a Loss of Fit (LoF) function, d N ( ˆ f , { y i } ).Let ˆ σ be a measure of the average residual error: ˆ σ ( ˆ f , { y i } ) = Nσ P Ni =1 [ y i − ˆ f ( t i )] , or its L analog. Typicaldiscrepancy functions are d N ( ˆ f , { y i } ) = ˆ σ /[1 − ( γ p + m ) /N ] and d B ( ˆ f , { y i } ) = ˆ σ [1 + ( γ p + m ) ln( N ) /N ] , (1)where p is the number of knots in the fit and m is the spline degree. Friedman proposed d FN with a default valueof γ = 3, while d B is the Bayesian/Schwartz information criterion when γ = 1. For a nested family of models, γ = 1 is appropriate while γ = 2 corresponds to a nonnested family with 2 (cid:0) NK (cid:1) candidate models at the k th level[2]. In very specialized settings in regression theory and time series, it has been shown that LoF functions like d FN are asymptotically efficient while those like d BN are asymptotically consistent. In other words, using d FN -like criteriawill asymptotically minimize the expected error at the cost of not always yielding the correct model. In contrast,the Bayesian criteria will asymptotically yield the correct model at the cost of having a larger expected error.Our goal is to consistently select the number of convexity change points and efficiently estimate the model subjectto the change point restrictions. Therefore, we propose the following new LoF function:
P CIC = σ (ˆ g, { y i } ) (cid:20) γ K ln( N ) /N − ( γ p + m ) /N (cid:21) (2)where K is the number of convexity change points and p is the number of knots. PCIC stands for PiecewiseConvex Information Criterion. In selecting the positions of the K change points, there are essentially 2 (cid:0) NK (cid:1) possiblecombinations of change point locations if we categorize the change points by the nearest measurement location.Thus, our default values are γ = 3 and γ = 2.We motivate PCIC: to add a change point requires an improvement in the residual square error of O ( σ ln( N )),which corresponds to an asymptotically consistent estimate. If the additional knot does not increase the number ofchange points, it will be added if the residual error decreases by γ σ . Presently, PCIC is purely a heuristic principle.We conjecture that it consistently selects the number of change points and is asymptotically efficient within the classof methods which are asymptotically consistent with regards to convexity change points.
3. Instantaneous Frequency and Time-frequency representations
Much effort has been devoted to finding joint time-frequency representations of a signal whose frequency is beingslowly modulated in time. The canonical example is a “chirp”: y i = cos( at + bt ), where the instantaneous frequencyis ω ( t ) = a + 2 bt . In [7], we proposed using the WKB (eikonal/geometric optics) representation for signals whoseamplitude and frequency are being slowly modulated in time: y t = A ( δt ) cos (cid:20)Z t ω ( δs ) ds (cid:21) + ǫ t , (3)where ǫ t is white noise and δ is a small parameter. The characteristic time scale for amplitude and frequencymodulation is 1 /δ . In [7], we propose estimating A ( δt ) and ω ( δt ) using data adaptive kernel smoothers. Theinstantaneous frequency corresponds to a first derivative estimate. We represent the signal in the time-frequencyplane as the curve, ˆ A ( t, ˆ ω ( t )). Since we expect the phase to be piecewise convex, we replace the adaptive kernelestimate of cos[ φ ( t )] with PC fitting. The circular statistics require a penalized likelihood functional of the form: N X i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) y i ˆ A ( t i ) − cos (cid:20)Z t i ˆ ω ( s ) ds (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + λ Z | ˆ A ′′′ ( t ) | + | ˆ ω ′′ ( t ) | dt . (4)Eq. (4) may also be used in for a smoothing spline fit without PC constraints. The first term in Eq. (4) differs fromKatkovnik [4] by placing A ( t ) in the denominator instead of the numerator.
4. Spectral Estimation
Consider a stationary time series, { x t } , with an unknown spectral density, S ( f ). A standard method to estimate S ( f ) is to multiply the data by a spectral window, compute the windowed periodogram, and then smooth thewindowed periodogram or its logarithm with a data-adaptive kernel smoother or smoothing splines. In [11-13], weshow that a better estimate replaces the single spectral window with a family of orthonormal spectral windows.he multiple window estimate reduces the broad band bias while making a variance stabilizing transformation ofthe log-periodogram. When we use the sinusoidal tapers of [11], v ( k ) t = r N + 1 sin (cid:18) ktN + 1 (cid:19) , the multi-windowestimate reduces toˆ S MW ( f ) = ∆ K K X k =1 | y ( f + k ∆) − y ( f − k ∆) | , where ∆ = 1 / (2 N + 2) . (5)From [12], we recommend K = ( N/ / . Instead of kernel smoothing ln[ ˆ S MW ( f )], we now advocate using thetwo-stage piecewise convex fitting procedure. Piecewise convex fitting should prove particularly advantageous tospectral estimation because it will suppress the 1 /N oscillations that arise from discrete time sampling.
5. Additive Models, Projection Pursuit and MARS
Many classes of models attempt to fit multi-dimensional functions as sums of one-dimensional functions. In eachcase, we can replace the standard nonparametric estimation methods with piecewise convex fitting. Additive growthcurve models [9] fits models of the form g ( t, x . . . x m ) = f ( t ) + P mj =1 f j ( t ) x j , where the x j are determined bysmooth splines (old method) or PC fitting (new method). The back fitting algorithm (corresponding to the Gauss-Seidel iteration) may be used to PC fit the f j iteratively. The same remarks apply to projection pursuit [3]. InMARS (multivariate adaptive regression splines), a sum of products form: f ( x , x . . . x m ) = P Mj,j ′ =1 g j ( x j ) h j ( x ′ j )is assumed. The knots are placed adaptively, in our case using PCIC (2).
6. Robust Estimation
At present, our understanding of the statistics of false inflection points is limited to Gaussian errors and linearestimators. In practice, it is often advantageous to replace both the residual errors and the penalty function withmore robust analogs: P Ni =1 | y i − ˆ g ( t i ) | q + λ R | ˆ g ( m ) ( t ) | q dt , where 1 ≤ q i ≤
2. Representation and duality theoremsare given in [8] for q j >
1. A heuristic scaling shows that the effective halfwidth of the robustified function satisfies h eff ∼ [ λ | f ( m ) ( t ) | q − ] m . The bias error scales as g ( m ) ( t ) h meff while the “variance” is proportional to 1 /N h eff . Thehalfwidth that minimizes the MISE scales as h eff ∼ N − m +1 , while the halfwidth to eliminate false change points of g ( ℓ ) with asymptotic probability one satisfies h cr N − ℓ +3 → ∞ . The optimal variable halfwidth kernel smoother hasa kernel halfwidth proportional to | f ( m ) | − m +1 . For 1 ≤ q ≤
2, the effective halfwidth of the robustified functionautomatically reduces the halfwidth in regions of large | f ( m ) ( t ) | just like a variable halfwidth smoother. When q = q = 1, the problem reduces to a linear programming problem for each predetermined set of constraints.
7. Image Segmentation
Image segmentation divides a digital picture into similar regions for further processing. The Mumford-Shah (MS)algorithm assumes that the image is piecewise constant, and that the boundaries of the regions are unknown [5].The region boundaries are determined by minimizing the sum of the residual square fit error plus a penalty termproportional to the length of the boundary. If the boundary is parameterized as x ( s ), the penalty term is the totalvariation of x ( s ). We suggest modifying the MS algorithm by replacing the total variation penalty with a piecewiseconvex constrained fit using a robustified penalty function such as the L integral of the boundary curvature. Whenusing the PCIC (2), we use the arclength divided by the grid spacing as a proxy for N , the number of data points.
8. Response Surface Estimation
We seek to estimate a smooth function, g ( x , x ), given N noisy measurements. The two dimensional analog ofPC fitting is to divide the plane into regions where the Gaussian curvature (or more simply ∆ f ) has a single sign.The boundaries between regions of positive and negative Gaussian curvature are free boundaries, which we requireto be PC. In the first stage of the fit, we use a penalty function of the form λ R | ∆ m/ g | , where we require that λ / mN >> log ( N ) N − ℓ +4 . This scaling is heuristic since the statistics of false zeros of ∆ ℓ/ ˆ g are unknown, as is thecritical scaling of the smoothing parameter that avoids extra regions of incorrectly specified curvature. To answerthese issues, a two-dimensional analog of the Cramer-Leadbetter formula is necessary.In the second stage, we suggest imposing constraints on the sign of ∂ normal ∆ ℓ/ ˆ g near the first-stage convexityboundaries. The stage 2 convexity boundaries, ( x ( s ) , y ( s )), are free and need to be fit using a penalty functionplus PC constraints on the curve shape. The numerical implementation appears tricky with a need for some ellipticanalog of front tracking. Overall, the two-dimensional problem appears very challenging from both theoreticaland numerical perspectives. Basic issues such as replacing ∂ normal ∆ ℓ/ ˆ g with geometric invariants have not beenaddressed. . Evolutionary Spectra A common model of nonstationary stochastic processes is x t = R π − π A ( ω, δt ) dZ ( ω ), where dZ ( ω ) is a stochasticprocess with independent spectral increments E [ dZ ( ω ) d ¯ Z ( ω )] = δ ( ω − ω ′ ) dω . The representation is nonunique forGaussian processes since A ( ω, t ) corresponds to a square root of the covariance matrix. To resolve the nonuniqueness,we require that A ( ω, δt ) e iωt correspond to the Fourier transform of the positive definite square root of the covariancematrix. The evolutionary spectrum is S ( ω, t ) = | A ( ω, t ) | .Let λ f be the characteristic frequency scale length and τ be the characteristic time scale of A : A ( f /λ f , t/τ ), with thesampling rate = 1. In [6], we present an asymptotic expansion of the mean square error in estimating S ( f /λ f , t/τ ).We begin by evaluating the multi-window analog of the log-spectrogram on a two-dimensional time frequency lattice.The bias error is minimized by using a window length of N w ∼ p τ /λ f . In [6], we estimate ˆ S ( ω, t ) using a twodimensional cross-product kernel smoother on the log-multi window spectrogram. The optimal halfwidths, h t & h f ,scale as h t /h f ∼ p τ /λ f and h f h t ∼ ( τ λ f ) − / , where h t and h f are the halfwidths in the t and f directions.We now advocate replacing kernel smoothing with PC fitting the log multi-windowed spectrogram. This methodshould eliminate the spurious 1 /N w oscillations which occur due to the discrete sampling.
10. Wavelet thresholding: a wiggle enhancer
Wavelet algorithms for function estimation offer two advantages: 1) speed, the algorithms are often O ( N log( N )),2) asymptotic minimax optimality in a number of decision theoretic settings [2]. The speed arises from separability:each wavelet coefficient is estimated separately without regard to geometric fidelity. The asymptotic optimalitytheory assumes that the unknown function is an arbitrary member of a function space, which makes function fitswith ten or twenty inflection points as reasonable as fits with no inflection points. Essentially, function spaces containtoo many “unphysical” functions. We prefer the “common sense prior” that the function has only a few inflectionpoints with high probaability.Both of these advantages of wavelets disappear when the more realistic assumption is made that the unknown functionhas only a small number of convexity change points. Wavelet thresholding has another intrinsic disadvantage whenit comes to geometric fidelity: wavelets possess the complete oscillation property [1]. In contrast, B-splines have theantithetical and valuable property, total positivity.
11. Summary
Sections 1 & 2 describe nonparametric estimation methods that seek to exclude spurious oscillations with negligiblesacrifice of fit quality. We have outlined a number of applications of our two-stage piecewise convex fitting method.The PCF methodology can be used to solve inverse problems.
12. References Chui, C.K.,
An Introduction to Wavelets, Academic Press, New York, 1992.2
Donoho, D.L., Johnstone, I.M. , Ideal spatial adaption by wavelet shrinkage, Biometrika , (1995).3
Friedman, J. , Multivariate adaptive regression splines,
Ann. Stat. Katkovnik, V. , Local polynomial approximation of time varying frequency, Univ. S. Africa Stat. Report 94/1 (1994).5
Mumford, D., Shah, J. , Optimal approximation by piecewise smooth functions and associated variational problems,
Comm. Pure & Appl. Math. , Riedel, K.S.
Optimal data-based kernel estimation for evolutionary spectra,
I.E.E.E. Trans. Signal Processing, ,2439-2447, (1993).7 Riedel, K.S.
Optimal kernel estimation of the instantaneous frequency,
I.E.E.E. Trans. Signal Proc., Riedel, K.S.
Two-stage estimation of piecewise convex functions, this issue; Piecewise convex function estimation andmodel selection,
Proc. of Approximation Theory VIII
C. K. Chui and L. L. Schumaker (eds.) World Scientific Pub. (1995).9
Riedel, K.S., Imre K.,
Smoothing spline growth curves with covariates.
Comm. in Statistics Riedel, K.S., Sidorenko, A.,
Function estimation using data adaptive kernel smoothers - How much smoothing?
Computers in Physics , Riedel, K.S., Sidorenko, A.
Minimum bias multiple taper spectral estimation.
I.E.E.E. Trans. Signal Processing Riedel, K.S., Sidorenko, A.
Adaptive smoothing of the log-spectrum with multi-tapering. Submitted.13
Riedel, K.S., Sidorenko, A., Thomson, D.J.
Spectral estimation of plasma fluctuations I: Comparison of methods,
Physics of Plasmas M. Villalobos and G. Wahba , Inequality-constrained multivariate smoothing splines with application to the esti-mation of posterior probabilities,