Projected nonlinear least squares for exponential fitting
aa r X i v : . [ m a t h . NA ] J un PROJECTED NONLINEAR LEAST SQUARESFOR EXPONENTIAL FITTING ∗ JEFFREY M. HOKANSON † Abstract.
The modern ability to collect vast quantities of data presents a challenge for param-eter estimation problems. Posed as a nonlinear least squares problem fitting a model to the data,the cost of each iteration grows linearly with the amount of data; with large data, it can easilybecome too expensive to perform many iterations. Here we develop an approach that projects thedata onto a low-dimensional subspace that preserves the quality of the resulting parameter estimates.We provide results from both an optimization and a statistical perspective that shows that accurateparameter estimates are obtained when the subspace angles between this projection and the Jacobianof the model at the current iterate remain small. However, for this approach to reduce computationalcomplexity, both the projected model and projected Jacobian must be computed inexpensively. Thisplaces a constraint on the pairs of models and subspaces for which this approach provides a com-putational speedup. Here we consider the exponential fitting problem projected onto the range ofa Vandermonde matrix, for which the projected model and projected Jacobian can be computed inclosed form using a generalized geometric sum formula. We further provide an inexpensive heuristicthat picks this Vandermonde matrix so that the subspace angles with the Jacobian remain small anduse this heuristic to update the subspace during optimization. Although the asymptotic cost stilldepends on the data dimension, the overall cost of solving this sequence of projected nonlinear leastsquares problems is less expensive than the original. Applied to the exponential fitting problem, thisyields an algorithm that is not only faster in the limit of large data than the conventional nonlinearleast squares approach, but is also faster than subspace based approaches such as HSVD.
Key words. exponential fitting, harmonic estimation, modal analysis, spectral analysis, param-eter estimation, nonlinear least squares, dimension reduction, experimental design
AMS subject classifications.
1. Introduction.
With the increasing prowess of data acquisition hardware andstorage, collecting vast amounts of data has become trivial. This poses a challenge forparameter estimation problems where the sheer scale of data makes these problemsexpensive. Here we consider a nonlinear least squares parameter estimation prob-lem [14] that seeks to fit a model f with q parameters θ ∈ C q to (noisy) measurements e y yielding a (noisy) parameter estimate e θ that minimizes the -norm mismatch(1) e θ := argmin θ ∈ C q k f ( θ ) − e y k , where f : C q → C n , e y ∈ C n , q ≪ n. With vast quantities of data, the asymptotic cost of solving this problem is dom-inated by the n -dependent steps in the optimization. For example, using eitherGauss-Newton or Levenberg-Marquardt, each optimization step solves a least squaresproblem involving the Jacobian of f , J : C q → C n × q , at a cost of O ( nq ) operations [4,Ch. 9]. To reduce this cost, we propose replacing the full least squares problem (1)with a sequence of low-dimensional surrogate problems by projecting measurementsonto e y onto a sequence of subspaces W ℓ ⊂ C n with m ℓ := dim W ℓ ≪ n (2) e θ W ℓ := argmin θ ∈ C q k P W ℓ [ f ( θ ) − e y ] k = argmin θ ∈ C q k W ∗ ℓ f ( θ ) − W ∗ ℓ e y k , P W ℓ = W ℓ W ∗ ℓ , ∗ Submitted to the editors 14 July 2016.
Funding:
This work was supported by NSF VIGRE grants DMS-0240058 and DMS-0739420 atRice University and the Department of Defense, Defense Advanced Research Project AgencyâĂŹsprogram Enabling Quantification of Uncertainty in Physical Systems. † Department of Applied Mathematics and Statistics, Colorado School of Mines, 1500 Illinois St,Golden CO, 80401, ([email protected], http://inside.mines.edu/~hokanson/).1
JEFFREY M. HOKANSON where P W ℓ is an orthogonal projector onto W ℓ and W ℓ ∈ C n × m ℓ is an orthonormalbasis for W ℓ . Although the total cost is still asymptotically n -dependent due to themultiplication W ∗ ℓ e y , each optimization step is cheaper since the projected Jacobian W ∗ ℓ J ( θ ) is smaller. However, for this computational speedup to be fully realized, theproducts W ∗ ℓ f ( θ ) and W ∗ ℓ J ( θ ) need to be formed without the expensive, n -dependentmultiplication. Additionally, we must ensure that the final projected parameter esti-mate e θ W ℓ remains a good estimate of the full parameter estimate, e θ . Here we do so byrequiring that the subspace angles between W ℓ and the Jacobian at the current iterateremain small. This requirement is justified by perspectives from both optimizationand statistics. From an optimization perspective described in section 2, the accuracyof each optimization step depends on these subspace angles and the projected pa-rameter estimate e θ W ℓ is equal to the full parameter estimate e θ when these subspaceangles go to zero. From a statistical perspective described in section 3, when mea-surements e y are contaminated by additive Gaussian noise, the covariance of projectedparameter estimate e θ W ℓ is larger than the covariance of full parameter estimate e θ byan amount that scales with these subspace angles as measured by efficiency . Hencethe subspace angles between W ℓ and the Jacobian at the current iterate determinethe quality of our projected parameter estimate e θ W ℓ . The challenge in applying thisprojected nonlinear least squares approach to a specific problem is satisfying bothcriteria simultaneously: finding a sequence of subspaces {W ℓ } ℓ with orthogonal bases { W ℓ } ℓ where W ∗ ℓ f ( θ ) and W ∗ ℓ J ( θ ) can be formed inexpensively independently of n and where the subspace angles between W ℓ and the range of J ( θ k ) for each iterate θ k remain small.Here we consider the exponential fitting problem [31], also known as modal analy-sis [7], harmonic estimation [22], and spectral analysis [38], that seeks to approximatedata e y as a sum of p complex exponentials with frequencies ω and amplitudes a where(3) [ f ([ ω , a ])] j = p X k =1 a k e jω k , ω , a ∈ C p ; θ = [ ω , a ] , q = 2 p. There is an extensive body of literature on this problem, with a wide array of methodsfor recovering the frequencies ω : from classical approaches such as Prony’s method [32]its extensions for overdetermined problems [15, §9.4], to subspace methods [16] suchas HSVD [1], HTLS [20], and the matrix-pencil method [19], to parameter estimationapproaches using optimization (such as ours) [14, 43], to more recent approaches basedon ideas from sparse recovery [39], and many others described in reviews [21, 24, 31,41]. We choose this problem due to the exploitable structure of the model function f ([ ω , a ]) that allows us to obtain inexpensive inner products with subspaces thatapproximately contain the range of the Jacobian. Specifically, as the model functionis the product of a Vandermonde matrix V ( ω ) and the amplitudes a ,(4) f ([ ω , a ]) = V ( ω ) a , [ V ( ω )] j,k = e jω k , by projecting measurements onto the subspace W ( µ ) ,(5) W ( µ ) := Range V ( µ ) = Range W ( µ ) , W ( µ ) ∗ W ( µ ) = I , µ ∈ C m , we can inexpensively obtain the inner products W ( µ ) ∗ f ([ ω , a ]) and W ( µ ) ∗ J ([ ω , a ]) as described in section 4 using the geometric sum formula and its generalization givenin Appendix B. Further, using a heuristic described in section 5, we can ensure thesubspace angles between W ( µ ) and Range J ([ ω , a ]) remain small by a careful selection ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING − . − . − . . . [ e θ ] = e ω [ e θ ] = e a All measurements of data efficiency − . − . − . [ e θ ] = e ω Every tenth measurement of data . efficiency − . − . − . . . [ e θ ] = e ω Projected measurements . of data . efficiency Fig. 1 . Parameter estimates from a toy exponential fitting problem with true parameters b ω = − . and b a = 1 using n = 1000 measurements contaminated with zero-mean Gaussian noise g with Cov g = 0 . I , e y = f ([ − . , g . The parameter estimates on the left are computed by solving (1) with all measurements e y . In the center, the parameter estimates are computed by solving (2) using a subspace that selects every 10th measurement. On the right, the parameter estimates are alsocomputed by solving (2) , but instead using the subspace W ( µ ) where µ = [ − . ± . i ] . In eachcase, the resulting nonlinear least squares problem was solved using Matlab’s lsqnonlin . Efficiency,defined in section , quantifies how close the covariance of the projected problem resembles the fullproblem (left). As this example shows, the projected parameter estimate with well-chosen subspaceyields almost identical parameter estimates to the full problem. of µ . The net result is a faster solution to the exponential fitting problem in the limitof large data as illustrated by a magnetic resonance spectroscopy test case in section 7.Further, due to careful selection of the subspaces, the projected parameter estimate e θ W remains close to the full parameter estimate e θ as seen in Figure 1 for a toy problemand in Figure 4 for the magnetic resonance spectroscopy test case.Projection is a recurring theme in applied mathematics, appearing in a varietyof contexts from Galerkin projections for solving partial differential equations to therandomized projections that form the foundation of randomized numerical linear al-gebra. Our projection approach for solving a nonlinear least squares problem fitsinto this theme and it is not without precedent. Incremental methods, such as incre-mental gradient [3] and the extended Kalman filter [2], when applied to a nonlinearleast squares problem can be interpreted as projecting onto a row (or set of rows) ateach iteration, choosing the basis W ℓ = [ I ] · , I ℓ where I ℓ the set of rows at the ℓ thstep. With this perspective, we note that both our method and incremental methodsrequire the projected model W ∗ ℓ f ( θ ) and projected Jacobian W ∗ ℓ J ( θ ) to be formedinexpensively. Satisfying this requirement is straightforward for incremental methodswhen f ( θ ) is defined entry-wise, whereas in our case we must be careful choose theorthonormal basis W ℓ such that these products can be, for example, evaluated inclosed-form. However these methods differ is in how the basis W ℓ is chosen. Forincremental methods, the set of rows is typically chosen either deterministically bycycling through rows or by randomly selecting rows [10]; whereas in our case, wecarefully choose the basis W ℓ such that our steps are accurate, and, when the datais contaminated by noise, our parameter estimates are precise.
2. An optimization perspective.
In this section we provide three differentresults that inform the choice of subspace W ℓ from the perspective of optimization.Each of these results points to the key role played by the canonical subspace angles JEFFREY M. HOKANSON between the subspace W ℓ and the range of the Jacobian at the current iterate. Wedefine these canonical subspace angles following Björck and Golub [5, Thm. 1]: if A and B are two subspaces of C n and if A ∈ C n × m a and B ∈ C n × m b are orthonormalbases for A and B , then the canonical subspace angles φ k ( A , B ) between A and B are(6) cos φ k ( A , B ) := σ k ( A ∗ B ) , ≤ φ ( A , B ) ≤ φ ( A , B ) ≤· · · ≤ φ min { m a ,m b } ( A , B ) ≤ π/ where σ k ( X ) is the k th singular value of X in descending order. Our first result insubsection 2.1 uses the first order necessary conditions to observe that the projectedproblem will have the same stationary points as the full problem when the subspaceangles between the W and the range of the Jacobian at the stationary point are zero.Our second result in subsection 2.2 shows that the difference between the Gauss-Newton steps of the full and projected problems depends on the subspace anglesbetween W ℓ and the range of the Jacobian at the current iterate. Our third result insubsection 2.3 interprets the Levenberg-Marquardt method applied to the projectedproblem as computing inexact steps of the Levenberg-Marquardt method applied tothe full problem. We show that by making the subspace angles between W ℓ and therange of the Jacobian at the current iterate small, we can satisfy one of the conditionsfor the convergence of inexact Newton. All these results suggest that the subspaceangles between W ℓ and the range of the Jacobian at the current iterate should besmall. The first order necessary conditions for a point q θ to be a local optimum require that the gradient of the objective function at this pointbe zero [30, Thm. 2.2]. In the context of nonlinear least squares, where the gradientof the full problem (1) is(7) ∇ θ k f ( θ ) − e y k = 2 J ( θ ) ∗ r ( θ ) , r ( θ ) := f ( θ ) − e y , [ J ( θ )] · ,k := ∂ r ( θ ) ∂ [ θ ] k a point q θ satisfies the first order necessary conditions if [4, §9.1.2](8) J ( q θ ) ∗ r ( q θ ) = . Similarly for the projected problem (2), a point q θ W will satisfy the first order necessaryconditions for the projected problem if(9) J ( q θ W ) ∗ P W r ( q θ W ) = . To assess the quality of the projected problem, we ask: under what conditions will q θ W also satisfy the first order necessary conditions for the full problem (8)? There aretwo conditions under which this happens. The zero-residual case, where r ( q θ W ) = ,implies that measurements e y exactly fit the model f . This situation makes the problemeasy to solve, as any subspace W that yields a well-posed optimization problem canbe used. The other, more general situation occurs when W contains the range ofthe Jacobian, as then P W J ( q θ W ) = J ( q θ W ) . This is equivalent to requiring all thesubspace angles between W and Range J ( q θ W ) to be zero. The challenge with thiscondition is that it is black or white: either the W contains the range of the Jacobianor it does not. In the next two subsections we suggest other requirements on W thatallow more shades of grey. ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING Another result that provides insight into the choiceof subspace W comes from considering the Gauss-Newton step [30, §10.3] for the fulland projected problems at θ :(10) s = − J ( θ ) + r ( θ ) (full) , s W = − [ P W J ( θ )] + P W r ( θ ) (projected)where A + denotes the pseudoinverse of A [13, §5.5.4]. We bound the differencebetween these two steps using Lemma 6 from Appendix A. Theorem
Let s and s W be the Gauss-Newtonsteps for the full and projected problems at θ as given in (10) , then their mismatch isbounded by (11) k s − s W k ≤ k J ( θ ) + k k r ( θ ) k (cid:2) sin φ q ( W , J ( θ )) + tan φ q ( W , J ( θ )) (cid:3) where J ( θ ) := Range J ( θ ) . Using this theorem, we can provide a heuristic for estimating the mismatchbetween the full and projected parameter estimates. Applying the Gauss-Newtonmethod to the full problem starting at a stationary point of the projected problem q θ W , we note the first step cannot move further than(12) k s k ≤ k J ( q θ W ) + k k r ( q θ W ) k h sin φ q ( W , J ( q θ W )) + tan φ q ( W , J ( q θ W )) i . Although multiple iterations of the Gauss-Newton method might be required to reacha stationary point of the full problem, if q θ W is sufficiently close to a stationary point q θ of the full problem, then we expect this first step to yield a good estimate; i.e., q θ W + s ≈ q θ . This suggests choosing subspaces W to minimize the largest subspaceangle between W and the range of Jacobian at the stationary point of the projectedproblem J ( q θ W ) to ensure the full and projected parameter estimates are nearby. A third and final result that providesinsight into the choice of subspace W comes from considering steps of the Levenberg-Marquardt method [30, §10.3] applied to the projected problem (2) as inexact stepsof the Levenberg-Marquardt method applied to the full problem (1). For the fullproblem, the Levenberg-Marquardt method generates a sequence of iterates { θ k } k starting from a given θ using the rule(13) θ k +1 = θ k + s k , s k := argmin s ∈ C q (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) J ( θ k ) λ k I (cid:21) s + (cid:20) r ( θ k ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) where λ k has been chosen to enforce a trust region; see, e.g., [25, §3.3.5]. Iterates ofthe projected problem { e θ k } k ≥ follow a similar update rule:(14) e θ k +1 = e θ k + e s k , e s k := argmin s ∈ C q (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) W ∗ k J ( θ k ) λ k I (cid:21) s + (cid:20) W ∗ k r ( θ k ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) , where W k is the orthonormal basis for the subspace applied at the k th step. Herewe ask, under what conditions on W k does the sequence { e θ k } k converge to the samepoint as { θ k } k ? Although we are unable to prove the convergence of the projectediterates unless the subspace angles between J ( e θ k ) and W k go to zero, we invoke the JEFFREY M. HOKANSON convergence analysis of inexact Newton [6], and specific results for inexact Levenberg-Marquardt [44], to suggest a choice of W k . These convergence results require that theerror in the step e s k be bounded by a forcing sequence { α k } k :(15) k ( J ( e θ k ) ∗ J ( e θ k ) + λ k I ) e s k + J ( e θ k ) ∗ r ( e θ k ) k k J ( e θ k ) ∗ r ( e θ k ) k ≤ α k < α < . Here we show that the quantity on the left can be bounded above in terms of thesubspace angles and that this quantity can be made arbitrarily small.To bound (15) we use a result from Appendix A applied to the augmented sub-space, Jacobian, and residual in the least squares problem for e s k , (14), c W k := Range (cid:20) W k
00 I (cid:21) , b J k := (cid:20) J ( e θ k ) λ k I (cid:21) , b r k := (cid:20) r ( θ k ) (cid:21) . Then applying Lemma 7 with subspace c W k to (14) yields,(16) k ( J ( e θ k ) ∗ J ( e θ k ) + λ k I ) e s k + J ( e θ k ) ∗ r ( e θ k ) k k J ( e θ k ) ∗ r ( e θ k ) k ≤ sin φ q ( c W k , b J k )cos φ q ( c W k , b J k ) k b J k k k P ⊥ b J k b r k k k J ( e θ k ) ∗ r ( e θ k ) k . where b J k := Range b J k , b R k := Range b r k , and P ⊥ b J k denotes the orthogonal projectoronto the subspace perpendicular to b J k . To obtain an expression in terms of subspaceangles, we note the numerator can be written in terms of sines k P ⊥ b J k b r k k = sin φ ( b J k , b R k ) k b r k k = sin φ ( b J k , b R k ) k r ( e θ k ) k , and similarly we can bound the denominator in terms of subspace angles, k J ( e θ k ) ∗ r ( e θ k ) k = k b J ∗ k b r k k = k b J ∗ k P b J k b r k k ≥ σ q ( b J k ) cos φ ( b J k , b R k ) k r ( e θ k ) k . Combining these two results yields the upper bound(17) k ( J ( e θ k ) ∗ J ( e θ k )+ λ k I ) e s k + J ( e θ k ) ∗ r ( e θ k ) k k J ( e θ k ) ∗ r ( e θ k ) k ≤ sin φ q ( c W k , b J k ) tan φ ( b J k , b R k )cos φ q ( c W k , b J k ) σ ( b J k ) σ q ( b J k ) . This result again confirms the centrality of the subspace angles between W andthe range of the Jacobian, although in this result, it is the augmented subspace c W k and augmented Jacobian b J k . By controlling the subspace W k , we can ensure thatbound in (17) is smaller than one so that step e s k obeys the bound required by inexactNewton (15). This suggests that the Levenberg-Marquardt method applied to theprojected problem makes progress towards solving the full problem. However, thisresult cannot be used online as it requires evaluating the full residual to compute φ ( b J k , b R k ) . Nor can we use this result to guarantee convergence since Wright andHolt’s convergence result for inexact Levenberg-Marquardt [44, Thm. 5] requires anadditional sufficient decrease condition. The projected problem is unlikely to satisfythis additional constraint since projected problem converges different stationary pointsunless the subspace angles between W k and J ( e θ k ) go to zero as k → ∞ . This promptsthe statistical approach we use in the next section to answer the question: how closeare the projected parameter estimates to the full parameter estimates? ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING
3. A statistical perspective.
One setting in which the nonlinear least squaresproblem (1) can arise is when measurements e y are the sum of f evaluated at some trueparameters b θ ∈ C q plus Gaussian random noise g with zero mean and covariance ǫ I ; e y = f ( b θ ) + g . Then the nonlinear least squares problem,(18) e θ ( g ) := argmin θ k f ( θ ) − ( f ( b θ ) + g ) k , yields the maximum likelihood estimate e θ of b θ [36, §2.1]. This estimate has a numberof beneficial features. In the limit of large data or small noise, the estimator e θ isunbiased and obtains the Cramér-Rao lower bound , namely, e θ has the smallest pos-sible covariance of any unbiased estimator of b θ [34, §6.3]. Hence, the correspondingprojected parameter estimate(19) e θ W ( g ) := argmin θ k P W [ f ( θ ) − ( f ( b θ ) + g ) ] k must have a larger covariance. By using the inexpensive projected parameter estimate e θ W as an alternative to full parameter estimate e θ , we are following in a tradition thatdates back to Fisher [9, §8]. Fisher quantified the loss of precision incurred by aparticular scalar estimator by the efficiency : the ratio of the minimum covarianceto the estimator’s covariance. For our vector valued estimates, the covariance is apositive definite matrix in C q × q and so to obtain a scalar value for the efficiencyratio, we follow the lead of experimental design [37, §2.1],[28, §1.4] and consider thedeterminant of the covariance matrix, which leads to the D-efficiency (20) b η ( W ) := det Cov e θ det Cov e θ W ∈ [0 , . In choosing our subspace W , our goal will be to make the efficiency as large as possibleso that the covariance of the estimates for e θ W and e θ are similar. However, as e θ and e θ W are both nonlinear functions of the noise g , we cannot compute a closed formexpression for the efficiency. Instead, following a standard approach for nonlinearexperimental design [8, §1.4], we linearize the parameter estimates e θ and e θ W about b θ yielding the linearized D-efficiency as derived in subsection 3.1. The main result ofthis section will be to connect linearized efficiency to the subspace angles between W and the range of the Jacobian at b θ , J ( b θ ) :(21) η ( W , J ( b θ )) := det[ J ( b θ ) ∗ J ( b θ )] − det[ J ( b θ ) ∗ P W J ( b θ )] − = q Y k =1 cos φ q ( W , J ( b θ )) ≈ det Cov e θ det Cov e θ W . We will refer to η as simply the efficiency, and following Fisher we will say a subspace W is efficient for b θ if η ( W , J ( b θ )) = 0 . . Later in section 5 we will design sub-spaces for the exponential fitting problem with the goal of obtaining a target efficiency.Further, note that maximizing efficiency corresponds to minimizing the covariance of e θ W and hence selecting the subspace W is similar to experimental design [8, 37, 28],albeit where the design is happening after the data has been collected. Here we briefly derive the linearized estimate of e θ W and the corresponding linearized covariance; cf. [36, §12.2.6]. In the limit of smallnoise g , we expand e θ W about the true parameters b θ (22) e θ W ( g ) = b θ + [ W ∗ J ( b θ )] + W ∗ g + O ( k g k ) . JEFFREY M. HOKANSON
Applying this first order estimate in the covariance, we have
Cov e θ W = E g [( b θ − e θ W ( g ))( b θ − e θ W ( g )) ∗ ] ≈ E g h [ W ∗ J ( b θ )] + ∗ W ∗ gg ∗ W [ W ∗ J ( b θ )] + i = ǫ [ J ( b θ ) ∗ WW ∗ J ( b θ )] − (23)when Cov g = ǫ I . To obtain the D-linearized efficiency (21), we replace Cov e θ and Cov e θ W with this the estimate above. As with the optimization per-spective, a good subspace from a statistical perspective will have small subspaceangles between W and the Jacobian J ( b θ ) . The following theorem establishes thisconnection. Theorem If W is an m -dimensional subspace of C n with orthonormal basis W and J ∈ C n × q where m ≥ q with J := Range J , then (24) η ( W , J ) := det([ J ∗ J ] − )det([ J ∗ WW ∗ J ] − ) = q Y k =1 cos φ k ( W , J ) where φ k ( A , B ) is the k th principle angle between A , B ⊂ C n as defined in (6) .Proof. Let J = QT be the short-form QR factorization of J where Q ∗ Q = I .Then using the multiplicative property of the determinant [17, §0.3.5], η ( W , J ) = det( J ∗ WW ∗ J )det( J ∗ J ) = det( Q ∗ WW ∗ Q ) det( T ) det( T ∗ )det( Q ∗ Q ) det( T ) det( T ∗ )= det( Q ∗ WW ∗ Q ) = q Y k =1 σ k ( W ∗ Q ) = q Y k =1 cos φ k ( W , J ) . We conclude withthree results about the linearized D-efficiency that aid in our construction of subspacesfor exponential fitting in section 5. There, our approach will be to precompute a finitenumber of subspaces for a single exponential and combine these to produce subspacesfor multiple exponentials.The first result proves an intuitive fact: by enlarging the subspace W , the effi-ciency will not decrease. The following theorem establishes this result, making use ofthe partial ordering of positive definite matrices; namely, A (cid:23) B if A − B is positivedefinite [17, §7.7]. Theorem If W ⊆ W and J are subspaces of C n then η ( W , J ) ≤ W , J ) .Proof. Let W and W be orthonormal bases for W and W and let Q be anorthonormal basis for J . Then W W ∗ (cid:22) W W and by [17, Cor. 7.7.4], η ( W , J ) = det( Q ∗ W W ∗ Q ) ≤ det( Q ∗ W W ∗ Q ) = η ( W , J ) . Since our subspaces for multiple exponentials will be built from a union of sub-spaces for each exponential, this second result shows that the union satisfies a nec-essary (but not sufficient) condition for the combined subspace to have the sameefficiency as each component subspace had for a single exponential.
Theorem If W , {J k } k are subspaces of C n , J = S k J k , and the dimensionof W exceeds J , then η ( W , J ) ≤ min k η ( W , J k ) . ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING Proof.
Let Q = [ Q , Q ] , where Q ∈ C n × q and Q ∈ C n × q , be an orthonormalbasis for J such that Q ∈ C n × q is a basis for J ℓ and let W be an orthonormal basisfor W . Then as σ k ( W ∗ Q ) ≤ , η ( W , J ) = q + q Y k =1 σ k ( W ∗ Q ) ≤ q Y k =1 σ q + k ( W ∗ Q ) ≤ q Y k =1 σ k ( W ∗ Q ) = η ( W , J ℓ ) , where the second inequality follows from deleting the last q columns of W ∗ Q andapplying [18, Cor. 3.1.3]. The result follows by repeating this process for each J ℓ .The final result provides a lower bound on the efficiency for a nearby Jacobian.This bound is used in subsection 5.1 to convert a check for efficiency over a continuousset of subspaces J ( θ ) into a check over a discrete set. Theorem If W , J , and J are subspaces of C n and J and J have the samedimension, then η ( W , J ) η ( J , J ) ≤ η ( W , J ) .Proof. Let Q and Q be orthonormal bases for J and J . As P W (cid:23) P J P W P J we obtain the lower bound after application of [17, Cor. 7.7.4] η ( W , J ) = det( Q ∗ P W Q ) ≥ det( Q ∗ Q Q ∗ P W Q Q ∗ Q )= det( Q ∗ P W Q ) det( Q ∗ Q Q ∗ Q ) = η ( W , J ) η ( J , J ) . With these general results from the two preceding sections complete, we now turnto the specifics of the exponential fitting problem.
4. Fast inner-products for exponential fitting.
In the two preceding sec-tions, we have argued that subspaces W should be chosen such that the subspaceangles between W and the range of the Jacobian are small. Now, in this section weturn to the specific problem of selecting a family of subspaces W for the exponentialfitting problem that not only satisfy this requirement, but also have orthonormal bases W such that projected model W ∗ f ([ ω , a ]) and projected Jacobian W ∗ J ([ ω , a ]) canbe inexpensively computed in fewer than O ( n ) operations. For the exponential fittingproblem we chose the subspace W ( µ ) parameterized by µ ∈ C m with correspondingorthonormal basis W ( µ ) ∈ C n × m (25) W ( µ ) := Range V ( µ ) , W ( µ ) := V ( µ ) R ( µ ) − , R ( µ ) ∈ C m × m where V ( µ ) ∈ C n × m is the Vandermonde matrix [ V ( µ )] j,k = e jµ k and R ( µ ) isconstructed as described in subsection 4.3 such that W ( µ ) has orthonormal columns.We call the parameters µ interpolation points since if the entries of ω are a subset ofthe entries of µ , then the projected model interpolates the full model:(26) ω ⊂ µ = ⇒ P W ( µ ) f ([ ω , a ]) = P Range V ( µ ) V ( ω ) a = V ( ω ) a = f ([ ω , a ]) . In this section we show how to inexpensively compute the product of W ( µ ) with theexponential fitting model f ([ ω , a ]) and Jacobian J ([ ω , a ]) ,(27) J ([ ω , a ]) := (cid:2) V ′ ( ω ) diag( a ) V ( ω ) (cid:3) , [ V ′ ( ω )] j,k = ∂∂ω k [ V ( ω )] j,k = je jω k . Examining the products W ( µ ) ∗ f ([ ω , a ]) and W ( µ ) ∗ J ([ ω , a ]) , W ( µ ) ∗ f ([ ω , a ]) = R ( µ ) −∗ V ( µ ) ∗ V ( ω ) a (28) W ( µ ) ∗ J ([ ω , a ]) = (cid:2) R ( µ ) −∗ V ( µ ) ∗ V ′ ( ω ) R ( µ ) −∗ V ( µ ) ∗ V ( ω ) diag a (cid:3) , (29)0 JEFFREY M. HOKANSON reveals two matrix multiplications of size n , V ( µ ) ∗ V ( ω ) and V ( µ ) ∗ V ′ ( ω ) , that needto be inexpensively computed. Here we use the geometric sum formula and gen-eralization provided by Theorem 9 in Appendix B, to compute the entries of theseproducts in closed form. Unfortunately, these formulas exhibit catastrophic cancel-lation in finite precision arithmetic necessitating careful modifications to obtain highrelative accuracy as described in subsection 4.1 for V ( µ ) ∗ V ( ω ) and in subsection 4.2for V ( µ ) ∗ V ′ ( ω ) . Additionally, we discuss how to compute R ( µ ) inexpensively from V ( µ ) ∗ V ( µ ) in subsection 4.3. The choice of interpolation points µ is later discussedin section 5 and combined with these results yields our algorithm for exponentialfitting described in section 6. Each entry in the product of two Vandermonde matri-ces V ( µ ) ∗ V ( ω ) is a geometric sum and hence has a closed form expression via the geometric sum formula :(30) [ V ( µ ) ∗ V ( ω )] j,k = n − X ℓ =0 e µ j ℓ e ω k ℓ = − e n ( µ j + ω k ) − e µ j + ω k , e µ j + ω k = 1; n, e µ j + ω k = 1 . In finite precision arithmetic this formula exhibits catastrophic cancellation when e µ j + ω k ≈ . Fortunately, many standard libraries provide the special function expm1 that evaluates e x − to high relative accuracy. However, even with this specialfunction, there is still a removable discontinuity at e µ j + ω k = 1 . Hence in floating pointwe patch this function using a two-term Taylor series expansion around e µ j + ω k = 1 :(31) [ V ( µ ) ∗ V ( ω )] j,k = expm1 ( n ( µ j + ω k )) expm1 ( µ j + ω k ) , | expm1 ( µ j + ω k ) | > − ; n (1 + ( n − µ j + ω k ) / , | expm1 ( µ j + ω k ) | ≤ − . In our numerical experiments, this expression has a relative accuracy of ∼ − whencompared to a 500-digit reference evaluation of (30) using mpmath [23]. Entries of the product V ′ ( µ ) ∗ V ( ω ) are nolonger a geometric sum, but a generalized geometric sum that has a closed formexpression given by Theorem 9 in Appendix B: [ V ( µ ) ∗ V ′ ( ω )] j,k = n − X ℓ =0 ℓe µ j ℓ e ω k ℓ = − ne n ( µ j + ω k ) − e µ j + ω k + e ω k + µ j (1 − e n ( ω k + µ j ) )(1 − e ω k + µ j ) , e ω k + µ j = 1; n ( n − / , e ω k + µ j = 1 . (32)As with the geometric sum formula, this expression also exhibits catastrophic can-cellation but this can no longer be fixed using standard special functions. Instead,we derive a more accurate expression in floating point arithmetic by rearranging theexpression in the first case and using a Taylor series about e µ j + ω k = 1 . Defining δ j,k := µ j + ω k ∈ R × [ − π/ , π/ i (removing periodicity in the imaginary part) thefirst case of (32) can be rearranged to yield(33) − ne nδ j,k − e δ j,k + e δ j,k (1 − e nδ j,k )(1 − e δ j,k ) = 1 − e nδ j,k − e δ j,k (cid:20) e δ j,k − e δ j,k − ne nδ j,k − e nδ j,k (cid:21) . ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING expm1 and theexpression inside the brackets has a rapidly converging Taylor series: e δ − e δ − ne nδ − e nδ = n −
12 + ( n − δ − ( n − δ
720 + ( n − δ − ( n − δ n − δ − n − δ O ( n δ ) . Calling the first seven terms of this expansion the special function expdiff ( n, δ ) , wethen evaluate the product V ( µ ) ∗ V ′ ( ω ) in finite precision arithmetic using(34) [ V ( µ ) ∗ V ′ ( ω )] j,k = ne nδ j,k expm1 ( δ j,k ) − e δ j,k expm1 ( δ j,k n )[ expm1 ( δ j,k )] , | δ j,k | > . /n ; expm1 ( nδ j,k ) expm1 ( δ j,k ) expdiff ( n, δ j,k ) , < | δ j,k | ≤ . /n ; n ( n − / , δ j,k = 0 . In our numerical experiments, this expression has a relative accuracy of ∼ − whencompared to a 500-digit reference evaluation of (33) using mpmath . Finally we need to compute the matrix R ( µ ) suchthat V ( µ ) R ( µ ) − has orthonormal columns inexpensively. One approach would beto simply take the QR-factorization of V ( µ ) , but this is has an O ( n ) -dependent cost.Instead, our approach is to form V ( ω ) ∗ V ( ω ) using (30) and either take its Choleskydecomposition or its eigendecomposition to compute R ( µ ) V ( µ ) ∗ V ( µ ) = R ( µ ) R ( µ ) ∗ , (Cholesky) , V ( µ ) ∗ V ( µ ) = U ( µ ) Λ ( µ ) U ( µ ) ∗ , (eigendecomposition) , R ( µ ) = U ( µ ) Λ ( µ ) / . Although the Cholesky decomposition should be preferred since V ( µ ) ∗ V ( µ ) is pos-itive definite provided each of the { e µ j } j are distinct, in finite precision arithmeticthis product can have small negative eigenvalues. Instead we compute R ( µ ) − usingthe eigendecomposition, truncating the small ( < − ) eigenvalues.
5. A subspace for exponential fitting.
With the results of the previous sec-tion, we can now inexpensively project the model and Jacobian onto the subspace W ( µ ) . However, this leaves one question: how do we choose the interpolation points µ such that the subspace angles between W ( µ ) and the exponential fitting Jacobian(35) J ([ ω , a ]) = (cid:2) V ′ ( ω ) diag( a ) V ( ω ) (cid:3) , [ V ′ ( ω )] j,k = ∂∂ω k [ V ( ω )] j,k = je jω k . are small? Immediately we note that these subspace angles do not depend on a (ifany entry of a was zero, we would instead fit fewer exponentials); hence we define(36) J ( ω ) := Range (cid:2) V ′ ( ω ) V ( ω ) (cid:3) . Structurally, it may seem impossible to have small subspace angles between W ( µ ) and J ( ω ) since W ( µ ) = Range V ( µ ) does not contain any columns from V ′ . However,since columns of V ′ are the derivatives of the columns of V , we can approximate therange of V ′ using finite differences(37) V ( ω ) ≈ V ( ω + δ + ) + V ( ω + δ − )2 V ′ ( ω ) ≈ V ( ω + δ + ) − V ( ω + δ − ) δ + − δ − . JEFFREY M. HOKANSON α α α α − π − π/ π/ π Re ω I m ω x e x Fig. 2 . An illustration of the box partition of the parameter space ω ∈ ( −∞ , × [ − π, π ) i for exponential fitting. For a given exponential with frequency ω denoted by × , we select the boxcontaining it, denoted in red, and the corresponding four interpolation points at the corners, denotedby • . The blue shaded region shows the set of ω where the efficiency at ω using this subspace is atleast . This region includes the entire box and extends outward. The figure on the right showsthe same features under the exponential map, exposing the periodicity of the parameter space. Hence, for an appropriate choice of interpolation points, the subspace angles be-tween W ( µ ) and J ( ω ) are small; for example, the interpolation points δ ± ( ω ) =0 . ω ± max {− .
52 Re ω, . /n } yields a subspace with efficiency for any ω ∈ ( −∞ , × [ − π, π ) i . Here our approach for selecting interpolation points is to di-vide the parameter space for a single exponential with frequency ω ∈ ( −∞ , × [ − π, π ) into a series of boxes as shown in Figure 2. These boxes have been constructed suchthat when the corners of the box containing ω are taking as interpolation points,the efficiency of this subspace is at least and hence the subspace angles between W ( µ ) and J ( ω ) are small. Then, for multiple exponentials, we simply combine thesubspaces generated by this heuristic, justified by Theorem 4 that this combination isa necessary condition for the combined subspace to also have at least efficiency.There are several advantages to this box partition approach. By limiting ourselvesto a finite number of interpolation points, we can frequently reuse the multiplication V ( µ j ) ∗ e y as the subspace updates. Moreover, constructing this subspace is inexpen-sive, allowing frequent updates during optimization. In the remainder of this sectionwe first discuss a practical algorithm for building this box partition for any targetefficiency and then give the coordinates for box partition with efficiency. Our goal in constructing the box partition is toguarantee that the target efficiency η target is obtained for every single exponentialwith frequency ω inside the box. This is an expensive task, so we build our boxesto simplify the verification process. As shown in Figure 2, our box partition consistsof a series of stacks where each stack has twice as many boxes as the one on its left,evenly dividing the imaginary component of the parameter space. Then, as efficiencydepends on δ j,k = µ j + ω k , cf. (30) and (32), simultaneously shifting the imaginaryparts of µ j and ω k does not change δ j,k and hence verifying that one box in the stackobtains the target efficiency for each ω inside establishes the same for the remainingboxes in the stack. With this construction, there is only one set of free parameters:the real coordinates of each box { α ℓ } ℓ ≥ . We choose these α ℓ , starting from α = −∞ ,by making α ℓ as large as possible while still obtaining the target efficiency inside each ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING α ℓ = maximize α>α ℓ − α such that η min ≤ minimize ω ∈ [ α ℓ − ,α ] × [0 , π/ ℓ ] i η ( W ( µ ) , J ( ω )) where µ = (cid:2) α ℓ − α ℓ − + 2 π/ ℓ i α α + 2 π/ ℓ i (cid:3) . (38)This is a challenging nested optimization problem over a continuous set of ω , so weinvoke Theorem 5 to construct an auxiliary grid of exponentials where obtaining aslightly higher efficiency at these discrete points guarantees the target efficiency isreached for any ω inside the box.To construct this auxiliary grid, we specify a series of real parts a j ∈ R andimaginary spacings b j ∈ R + that define the grid points z j,k := a j + ikb j . To specify a j and b j with a grid efficiency of η grid , starting from a = α ℓ we solve the singlevariable finding root problem that yields a j and b j ,(39) η ( J ( a j ) , J ( a + (1 + 1 i ) c )) = η grid , ⇒ a j +1 := a j + c, b j +1 := c, setting b = b . Then, invoking Theorem 5, we have the bound(40) minimize ω ∈ [ α ℓ − ,α ] × [0 , π/ ℓ ] i η ( W ( µ ) , J ( ω )) ≤ minimize j,k ∈ Z + z j,k ∈ [ α ℓ − ,α ] × [0 , π/ ℓ ] i η grid · η ( W ( µ ) , J ( z j,k )) . Substituting this bound in (38) replaces the inner optimization with finding the min-imum over a discrete set, simplifying the problem. Further, since the accuracy of theefficiency computation is limited by the grid, we restrict the maximization over α tothe discrete set of grid points a j . Here we provide thecoordinates for a box partition with a target efficiency of constructed using η grid = 0 . in Table 1 for multiple values of n . In practice, we restrict our inter-polation points to the closed left half plane and hence set the first α ℓ greater thanzero to zero. Although this choice of target efficiency was arbitrary, it does make theright-most interpolation points correspond to the n th roots of unity that appear inthe discrete Fourier transform (DFT).Although Table 1 only displays the coordinates for several values of n , two pat-terns emerge that allow us to estimate the box partition for any n . First note thatthe values for α ℓ when n = ∞ match those for n = ∞ for all but the last two, whichare always larger. Hence the values of α ℓ for n = ∞ are an lower bound on those forarbitrary n . The other pattern is that after the first five, the α ℓ for n = ∞ shrinkexponentially with(41) α ℓ ≈ − . · − ℓ ; ℓ ≥ . These two patterns allow us to pick the box partition using α ℓ for n = ∞ , extendingthis sequence using the approximation above for larger values of ℓ .When n is not a power of two, the box partition will no longer have the n throots of unity available as interpolation points. Due to their connection with thediscrete Fourier transform, we prefer to keep n th roots of unity available and thusmodify the construction of the box partition. For an n that is not a power of two,everything remains the same except when ω is in the rightmost stack, Re ω ∈ ( α b ℓ , .In this case we no longer use boxes, but pick the two closest interpolation pointswith Re µ = α b ℓ from the leftward stack and the two closest n th roots of unity where Re µ = 0 . Although we are no longer able to guarantee efficiency for exponentialsin this range, this heuristic still provides a high efficiency subspace in practice.4 JEFFREY M. HOKANSON
Table 1
The real coordinates of the box partition α ℓ as determined by solving (38) with η grid = 0 . .Due to the limitation of this discretization, these values are only accurate to approximately threedigits. ℓ n = 16 n = 256 n = 1024 n = 2 n = ∞ − . · − . · − . · − . · − . · − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − . · − − . · − − . · − − . · − − . · − − . · − − . · −
10 8 . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · −
20 8 . · − − . · −
6. A projected exponential fitting algorithm.
Equipped with subspace W ( µ ) for which the projected model W ( µ ) ∗ f ([ ω , a ]) and Jacobian W ( µ ) ∗ J ([ ω , a ]) can be inexpensively computed as described in section 4 and combined with the heuris-tic from section 5 to pick interpolation points µ so that the subspace angles between W ( µ ) and the range of the Jacobian J ( ω ) are small, we now construct an algorithmto solve exponential fitting problem using projected nonlinear least squares. Our basicapproach is to solve a sequence of projected nonlinear least squares problems usingLevenberg-Marquardt and infrequently update the subspace during the course of op-timization. In this section we describe several important details for this algorithm.First, in subsection 6.1 we show how variable projection [11, 12] can be used to im-plicitly solve for the amplitudes a revealing optimization problem over frequencies ω alone; then in subsection 6.2 we discuss the details of how to update the subspace; andfinally in subsection 6.3 we show how to obtain initial estimates of the frequencies ω to enable a fair comparison with subspace based methods which do not require initialestimates. The key insight behind variable projection origi-nated in a PhD thesis by Scolnik on the exponential fitting problem [35]. Recog-nizing the optimal linear parameters a are given by the pseudoinverse for a fixed ω , a = V ( ω ) + e y , allows the residual to be stated as a function of ω alone:(42) r ([ ω , a ]) = f ([ ω , a ]) − e y = V ( ω ) a − e y ⇒ (cid:2) V ( ω ) V ( ω ) + − I (cid:3) e y = P ⊥ V ( ω ) e y , where P ⊥ V ( ω ) is the orthogonal projector onto the subspace perpendicular to the rangeof V ( ω ) . This allows us to define an equivalent optimization problem over ω alone(43) minimize ω ∈ C p k b r ( ω ) k , b r ( ω ) := P ⊥ V ( ω ) e y . ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING b r ,(44) [ b J ( ω )] · ,k := − (cid:20) P ⊥ V ( ω ) ∂ V ( ω ) ∂ω k V ( ω ) − + V ( ω ) −∗ (cid:18) ∂ V ( ω ) ∂ω k (cid:19) ∗ P ⊥ V ( ω ) (cid:21) e y , and we can further exploit the structure of the exponential fitting problem to reveala simple expression for this Jacobian. Defining the short form QR-factorization of V ( ω ) , V ( ω ) = Q ( ω ) T ( ω ) , this Jacobian becomes(45) b J ( ω ) = [ I − Q ( ω ) Q ( ω ) ∗ ] V ′ ( ω ) diag( a ) − Q ( ω ) T ( ω ) −∗ diag( V ′ ( ω ) ∗ b r ( ω )) . With the linear parameters removed, the Levenberg-Marquardt method can then beapplied to the variable projection residual b r ( ω ) and Jacobian b J ( ω ) . The same expres-sions also apply to the projected problem upon making the substitutions: V ( ω ) → W ( µ ) ∗ V ( ω ) , V ′ ( ω ) → W ( µ ) ∗ V ′ ( ω ) , and e y → W ( µ ) ∗ e y . As the analysis in section 2 suggests that the sub-space angles between W ( µ ) and J ( ω ) need to remain small, we repeatedly updatethe interpolation points during the course of optimization. Here we use the efficiencybased heuristic described in section 5 to pick interpolation points, as a high efficiencyensures small subspace angles between W ( µ ) and J ( ω ) . However, rather than dis-carding interpolation points no longer required by this heuristic, we preserve them,continually expanding the subspace. This is necessary to prevent the optimizationalgorithm from entering a cycle. A final issue concerns how we provide the initial valuesthe optimization algorithm. Subspace based methods do not require these initialvalues and so to provide a fair comparison we use a simple initialization heuristic.It is well known that peaks in the discrete Fourier transform (DFT) of a signal, F ∗ n e y where [ F n ] j,k = n − / e πijk/n , correspond to the frequencies present [38]. Thisforms the foundation of many initialization approaches. For example, in magneticresonance spectroscopy these peaks can be identified manually [42, §3.3] to initializean optimization algorithm. Here, we pick the initial estimate iteratively. Startingwith the first exponential, we set ω = 2 πi b k/n where b k is the largest entry in F ∗ n e y .Then after the optimization algorithm has terminated, we initialize ω based on thelargest entry of the residual F ∗ n b r ( ω ) . This process repeats until the desired number ofexponentials have been recovered. This approach is similar to that of Macleod [27],but we optimize all the frequencies ω at each step.
7. A numerical example.
To demonstrate the effectiveness of projected non-linear least squares for exponential fitting, we apply the algorithm described in sec-tion 6 to a magnetic resonance spectroscopy test problem from [43, Table 1]; see,e.g., [14, §12.4] for a discussion of the underlying physics. This example describes acontinuous complex signal y ( t ) consisting of eleven exponentials: y ( t ) = X k =1 a k e iπ/ e (2 iπf k − d k ) t where a = [ 75 150 75 150 150 150 150 150 1400 60 500 ] f = [ − − −
54 152 168 292 308 360 440 490 530 ] d = [ 50 50 50 50 50 50 50 25 285 . (46)from which we construct measurements e y ∈ C n by sampling y ( t ) uniformly in time andcontaminating these with independent and identically distributed additive Gaussian6 JEFFREY M. HOKANSON p r o j e c t e d N L S f u ll N L S d e n s e H S V D f a s t H S V D V ( µ ) ∗ e y × × O ( n ) O ( n ) − − data dimension n w a ll t i m e ( s ec o nd s ) Wall clock performance comparison
Fig. 3 . The median wall clock time from from ten runs for four different exponential fittingalgorithm implemented in Matlab 2016b, applied to data from (46) , and running on a 2013 Mac Prowith a 3.5 GHz 6-Core Intel Xeon E5 and 16 GB of RAM clocked at 1866 MHz. We also show thetime taken to form V ( µ ) ∗ e y for µ ∈ C which is a lower bound on the time taken by our projectednonlinear least squares algorithm; this is approximately the time required to check the first ordernecessary conditions. noise g with Cov g = I according to the formula(47) [ e y ] k = y ( δ ( n ) k ) + 15[ g ] k , δ ( n ) := 2563 n · − . This allows us to scale the original problem which took n = 256 by increasing thesample rate δ . In this section we consider three algorithms applied to this expo-nential fitting problem: conventional nonlinear least squares, our projected nonlinearleast squares, and HSVD [1] as a representative of subspace based methods due to itssimple implementation. We present our results using two different implementationsof HSVD: an implementation using dense linear algebra and fast implementationusing an O ( n log n ) Hankel matrix-vector product and an iterative SVD algorithmfollowing [26]. Our goal is to compare these algorithms on two metrics: the wallclock time taken to solve the exponential fitting problem and the precision of theresulting parameter estimates. A Matlab implementation of our projected nonlin-ear least squares algorithm for exponential fitting, the two HSVD implementationsdescribed, and code to construct these examples are provided at https://github.com/jeffrey-hokanson/ExpFit . In these implementations we use tight convergencetolerances: − for both residual norm and solution change in Matlab’s nonlinearleast squares solver lsqnonlin and − for the Ritz residual in Matlab’s eigs usedin the fast HSVD implementation. As these three algorithms use different paradigms for solving theexponential fitting problem, we compare their performance using total wall clocktime. Figure 3 shows the time taken by each algorithm when applied to the magneticresonance spectroscopy test problem given in (46). Asymptotically, the time takenby the dense HSVD implementation scales like O ( n ) due to the dense SVD, whereasthe fast HSVD implementation scales like O ( n log n ) due to the use of the fast Fouriertransform (FFT) to compute the Hankel matrix-vector product. Similarly, the timetaken both nonlinear least squares based approaches scales like O ( n log n ) due to their ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING P r o j ec t e d N L S p r o b a b ili t y d e n s i t y n = 256 n = 1024 n = 4096 k Γ / ( e θ − b θ ) k F a s t H S V D p r o b a b ili t y d e n s i t y k Γ / ( e θ − b θ ) k k Γ / ( e θ − b θ ) k Fig. 4 . The density of standardized error in the parameter estimate e θ , Γ / ( e θ − b θ ) as describedin subsection . Thick curves show the expected distribution of the 2-norm of the standardized errorand the filled regions show the empirically determined density from realizations of each methodfor each n . The dotted lines shows the density of the standardized error in the full nonlinear leastsquares parameter estimate. use of the FFT in the initialization heuristic. Although these three algorithms eachhave the same asymptotic rate, their constants are different. In the limit of largedata, the projected nonlinear least squares implementation is fastest, but for smalldata, the repeated initialization of the optimization algorithm dominates the cost. Itis possible that a more careful implementation could avoid this cost and bring thewall clock time for projected nonlinear least squares to closer, or perhaps faster than,the fast HSVD implementation in the limit of small data. In addition to providing faster performance than fast HSVD forlarge data, the projected nonlinear least squares approach also yields more precise pa-rameter estimates. Considering the same magnetic resonance spectroscopy example,we seek to quantify the precision of our parameter estimates. In the limit of smallnoise, the error in the parameter estimate e θ = [ e ω , e a ] relative to the true parameters b θ = [ b ω , b a ] is normally distributed with covariance:(48) Γ := J ( b θ ) ∗ E g [ gg ∗ ] J ( b θ ) = J ( b θ ) ∗ J ( b θ ) · E g [ k g k ] ≈ Cov e θ . If Γ is actually the covariance of e θ , then Γ − / ( e θ − b θ ) is normally distributed withzero mean an unit variance. Hence, the norm of the mismatch k Γ − / ( e θ − b θ ) k followsa χ distribution with degrees of freedom. As seen in Figure 4, the distributionof error of the projected nonlinear least squares problem approximately matches thatof the full problem and approaches the desired χ distribution as n becomes large.However, HSVD provides less precise parameter estimates, a result that follows fromthe analysis of Rao [33].
8. Discussion.
In this paper we have shown that by solving a sequence of pro-jected nonlinear least squares problems we can substantially improve the run timeperformance with a negligible loss of accuracy for solving exponential fitting problemwhen compared to both conventional nonlinear least squares and HSVD, a typical8
JEFFREY M. HOKANSON subspace based approach. For the exponential fitting problem, there are still severalopen questions. Is there a better choice for selecting interpolation points than ourbox partition? Are there better subspaces, such as one that includes not only V ( µ ) ,but V ′ ( µ ) as well? More generally: can we provide conditions that guarantee theconvergence of the series of projected problems? Can we bound the error incurred bythe projection in a deterministic sense? Finally, we ask, what were the key featuresthat allowed the projected approach to work for the exponential fitting problem andcould this approach be applied to other problems? One key feature was that projectedmodel W ∗ ℓ f ( θ ) and Jacobian W ∗ ℓ J ( θ ) could be computed inexpensively. The otherkey feature was that we were able to generate subspaces W ℓ such that the subspaceangles between W ℓ and the range of the Jacobian J ( θ ) remained small. These tworequirements limit the applicability of these results to specific pairs of models f ( θ ) andsubspaces W ℓ , but for those problems that satisfy these requirements the projectednonlinear least squares approach presents a way to improve performance. Acknowledgements.
I would like to thank Mark Embree, Caleb Magruder, andPaul Constantine for their help in preparing this manuscript; Christopher Beattie,the Einstein Foundation of Berlin, Volker Mehrmann, and TU Berlin for hosting meduring the writing of this manuscript; and my PhD committee: Steve Cox, ThanosAntoulas, and Matthias Heinkenschloss.
Appendix A. Projected least squares error bounds.
Here we provide twolemmas used in section 2 related to the accuracy of a projected least squares problem.
Lemma Let A ∈ C n × q have full column rank and let b ∈ C n with respectiverange A and span B . Let W be an m -dimensional subspace of C n where m ≥ q andlet P W be the orthogonal projector onto W where P W A has full column rank. If x isthe minimizer of k Ax − b k and y is the minimizer of k P W ( Ay − b ) k then (49) k x − y k ≤ k A + k k b k (cid:2) sin φ q ( W , A ) sin φ ( W , B )+tan φ q ( W , A ) cos φ ( W , B ) (cid:3) . Proof.
Using the pseudoinverse, we write x and y as x = ( A ∗ A ) − A ∗ b , (50) y = ( A ∗ P W A ) − A ∗ P W b . (51)Inserting the decomposition of the identity I = P W + P ⊥W before b in x ,(52) x = ( A ∗ A ) − A ∗ P W b + ( A ∗ A ) − A ∗ P ⊥W b , we then note the difference between x and y is(53) x − y = ( A ∗ A ) − A ∗ P ⊥W b + (cid:2) ( A ∗ A ) − − ( A ∗ P W A ) − (cid:3) A ∗ P W b . Replacing A with its short form SVD, A = UΣV ∗ , and P W with WW ∗ where W is an orthonormal basis for W , we have x − y = VΣ − U ∗ P ⊥W b − VΣ − (cid:2) ( U ∗ WW ∗ U ) − − I (cid:3) Σ − U ∗ WW ∗ b , k x − y k ≤ k Σ − k (cid:16) k P ⊥W U k k P ⊥W b k + k ( U ∗ WW ∗ U ) − − I k k U ∗ W k k W ∗ b k (cid:17) . Then invoking the subspace angle identities: k P ⊥W b k = sin φ ( W , B ) k b k , k W ∗ b k =cos φ ( W , B ) k b k , k U ∗ W k = cos φ ( W , A ) , and k P ⊥W U k = sin φ q ( A , W ) ,(54) k x − y k ≤ k Σ − k k b k (cid:0) sin φ q ( W , A ) sin φ ( W , B )+ k ( U ∗ WW ∗ U ) − − I k cos φ ( W , A ) cos φ ( W , B ) (cid:1) . ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING k ( U ∗ WW ∗ U ) − − I k , we note that as U ∗ WW ∗ U is positive semidefinite,there exists an α ≥ such that U ∗ WW ∗ U (cid:23) α I . This implies λ k ( U ∗ WW ∗ U ) − α ≥ , ⇒ σ k ( W ∗ U ) − α ≥ , ∀ k ∈ , . . . , q (55)where λ k ( X ) is the k eigenvalue in descending order of X . The largest α satisfying thisinequality is α = σ q ( W ∗ U ) = cos φ q ( W , A ) . Invoking [17, Cor. 7.7.4], U ∗ WW ∗ U (cid:23) α I implies ( U ∗ WW ∗ U ) − (cid:22) α − I and hence ( U ∗ WW ∗ U ) − − I (cid:22) α − I − I = ( α − − I . (56)Upon taking the norm, we have k ( U ∗ WW ∗ U ) − − I k ≤ ( α − − . (57)Thus α − − φ q ( W , A ) and invoking trigonometric identities, sec φ q ( W , A ) − φ q ( W , A ) ; hence(58) k x − y k ≤ k Σ − k k b k (cid:0) sin φ q ( W , A ) sin φ ( W , B )+ tan φ q ( W , A ) cos φ ( W , A ) cos φ ( W , B ) (cid:1) . By applying the upper bound cos φ ( W , A ) ≤ , and noting k Σ − k = k A + k weobtain the desired bound. Lemma In the same setting as Lemma , k A ∗ Ay − A ∗ b k ≤ cos φ ( A , W ) sin φ q ( A , W )cos φ q ( A , W ) k A k k P ⊥A b k . (59) Proof.
Using the pseudoinverse, A ∗ Ay − A ∗ b = A ∗ A ( A ∗ P W A ) − A ∗ P W b − A ∗ b . (60)Then, inserting the decomposition of the identity I = P A + P ⊥A between P W and b , A ∗ Ay − b = A ∗ A ( A ∗ P W A ) − A ∗ P W ( P A + P ⊥A ) b − A ∗ b . (61)After expanding the first term on the right, the P A component is A ∗ b , A ∗ A ( A ∗ P W A ) − A ∗ P W P A b = A ∗ A ( A ∗ P W A ) − A ∗ P W A ( A ∗ A ) − A ∗ b = A ∗ b , (62)and hence cancels A ∗ b leaving one term: A ∗ Ay − A ∗ b = A ∗ A ( A ∗ P W A ) − A ∗ P W P ⊥A b . (63)Next, we define the oblique projector above X := A ( A ∗ P W A ) − A ∗ P W in terms ofthe SVD of A . If A has a full and reduced SVD(64) A = UΣV ∗ = (cid:2) U U (cid:3) (cid:20) Σ (cid:21) V ∗ = U Σ V ∗ , then this oblique projector is X = UΣV ∗ ( V ∗ Σ ∗ UWW ∗ UΣV ∗ ) − VΣ ∗ U ∗ WW ∗ = U Σ V ∗ V ( Σ U ∗ WW ∗ U Σ ) − V ∗ VΣ ∗ U ∗ WW ∗ = U Σ Σ − ( U ∗ WW ∗ U ) − Σ − Σ ∗ U ∗ WW ∗ = U ( U ∗ WW ∗ U ) − U ∗ WW ∗ . JEFFREY M. HOKANSON
Inserting this result into the expression for A ∗ Ay − A ∗ b , we obtain the bound k A ∗ Ay − A ∗ b k = k A ∗ U ( U ∗ WW ∗ U ) − U ∗ WW ∗ U U ∗ b k ≤ k A k k ( U ∗ WW ∗ U ) − k k U ∗ W k k W ∗ U k k U ∗ b k = σ q ( U ∗ W ) − σ ( U ∗ W ) σ ( W ∗ U ) k A k k P ⊥A b k = cos φ ( A , W ) sin φ q ( A , W )cos φ q ( A , W ) k A k k P ⊥A b k . Appendix B. Generalized geometric sum formula.
A critical componentfor our algorithm is the ability to compute in closed form generalized geometric sum ,(65) n − X k = n k p e δk where δ ∈ C , and p , n , n are non-negative integers. The standard geometric sumformula provides a closed form expression when p = 0 and when e δ = 1 , this sumis can be written in terms of Bernoulli polynomials [29, eq. (24.4.9)]. The followinglemma establishes the remaining case when e δ = 1 and p > . Lemma Let δ ∈ C with e δ = 1 , p, n , n ∈ Z + , where ≤ n ≤ n , then (66) n − X k = n k p e δk = p X ℓ =0 χ n ( p, ℓ ) e δ ( n + ℓ ) − χ n ( p, ℓ ) e δ ( n + ℓ (1 − e δ ) ℓ +1 where χ n ( p, ℓ ) is given by the recurrence (67) χ n ( p + 1 , ℓ ) = ( n + ℓ ) χ n ( p, ℓ ) + k χ n ( p, ℓ − χ n (0 , ℓ ) = δ ℓ, , p, ℓ ≥ . Proof.
Multiplying each term of the geometric sum by k p corresponds to a p thderivative with respect to δ of each entry. Since this is a finite sum, we pull thederivative outside the sum yielding(68) n − X k = n k p e δk = n − X k = n ∂ p ∂δ p e δk = ∂ p ∂δ p n − X k = n e δk = ∂ p ∂δ p e δn − e δn − e δ . To obtain an explicit formula for the derivative on the right, we show by induction(69) ∂ p ∂δ p e nδ − e δ = p X ℓ =0 χ n ( p, ℓ ) e ( n + ℓ ) δ (1 − e δ ) ℓ +1 . The base case p = 0 holds as χ n (0 ,
0) = 1 . The inductive step follows by taking thederivative of each side ∂∂δ p X ℓ =0 χ n ( p, ℓ ) e ( n + ℓ ) δ (1 − e δ ) ℓ +1 = p +1 X ℓ =0 [ χ n ( p, ℓ )( n + ℓ ) + χ n ( p, ℓ − ℓ ] e ( n + ℓ ) δ (1 − e δ ) ℓ +1 = p +1 X ℓ =0 χ n ( p + 1 , ℓ ) e ( n + ℓ ) δ (1 − e δ ) ℓ +1 . Subtracting (69) evaluated at n = n from (69) evaluated at n = n yields (66). ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING
Theorem
Given δ ∈ C , p ∈ Z + , and n , n ∈ Z with ≤ n < n integers, then (70) n − X k = n k p e δk = p X ℓ =0 χ n ( p, ℓ ) e ( n + ℓ ) δ − χ n ( p, ℓ ) e ( n + ℓ ) δ (1 − e δ ) ℓ +1 , e δ = 1; B p +1 ( n ) − B p +1 ( n ) p + 1 , e δ = 1; where B p is the p th Bernoulli polynomial. REFERENCES[1]
H. Barkhuijsen, R. de Beer, and D. van Ormondt , Improved algorithm for noniterativetime-domain model fitting to exponentially damped magnetic resonance signals , J. Magn.Reson., 73 (1987), pp. 553–557, https://doi.org/10.1016/0022-2364(87)90023-0.[2]
D. P. Bertsekas , Incremental least squares methods and the extented Kalman filter , SIAMJ. Optim., 6 (1996), pp. 807–822, https://doi.org/10.1137/S1052623494268522.[3]
D. P. Bertsekas , A new class of incremental gradient methods for least squares problems ,SIAM J. Optim., 7 (1997), pp. 913–926, https://doi.org/10.1137/S1052623495287022.[4]
Å. Björck , Numerical Methods for Least Squares Problems , SIAM, Philadelphia, PA, 1996.[5]
Å. Björck and G. H. Golub , Numerical methods for computing angles between lin-ear subspaces , Math. Comput., 27 (1973), pp. 579–594, https://doi.org/10.1090/S0025-5718-1973-0348991-3.[6]
R. S. Dembo, S. C. Eisenstat, and T. Steihaug , Inexact Newton methods , SIAM J. Numer.Anal., 19 (1982), pp. 400–408, https://doi.org/10.1137/0719025.[7]
D. J. Ewins , Modal Testing: Theory and Practice , Research Studies Press LTD., Letchworth,Hertfordshire, England, 1984.[8]
V. V. Fedorov , Theory of Optimal Experiments , Academic Press, New York, 1972.[9]
R. A. Fisher , On the mathematical foundations of theoretical statistics , Philos. Trans. R. Soc.London, Ser. A, 222 (1922), pp. 309–368.[10]
M. P. Friedlander and M. Schmidt , Hybrid deterministic-stochastic methods for datafitting , SIAM J. Sci. Comput., 34 (2012), pp. A1380–A1405, https://doi.org/10.1137/110830629.[11]
G. Golub and V. Pereyra , Separable nonlinear least squares: the variable projectionmethod and its applications , Inverse Prob., 19 (2003), pp. R1–R26, https://doi.org/10.1088/0266-5611/19/2/201.[12]
G. H. Golub and V. Pereyra , The differentiation of pseudo-inverses and nonlinear leastsquares problems whose variables separate , SIAM J. Numer. Anal., 10 (1973), pp. 413–432,https://doi.org/10.1137/0710036.[13]
G. H. Golub and C. F. Van Loan , Matrix Computations , Johns Hopkins University Press,Baltimore, MD, third ed., 1996.[14]
P. C. Hansen, V. Pereyra, and G. Scherer , Least Squares Data Fitting with Applications ,Johns Hopkins University Press, 2013.[15]
F. B. Hildebrand , Introduction to Numerical Analysis , McGraw-Hill, New York, 1956.[16]
B. L. Ho and R. E. Kalman , Effective construction of linear state-variable models from in-put/output functions , Regelungstechnik, 14 (1966), pp. 545–592, https://doi.org/10.1524/auto.1966.14.112.545.[17]
R. A. Horn and C. R. Johnson , Matrix Analysis , Cambridge University Press, Cambridge,1985.[18]
R. A. Horn and C. R. Johnson , Topics in Matrix Analysis , Cambridge University Press,Cambridge, 1991.[19]
Y. Hua and T. K. Sarkar , Matrix pencil method for estimating parameters of exponentiallydamped/undamped sinusoids in noise , IEEE Trans. Acoust. Speech Signal Process., 38(1990), pp. 814–824, https://doi.org/10.1109/29.56027.[20]
S. V. Huffel, H. Chen, C. Decanniere, and P. V. Hecke , Algorithm for time-domainNMR data fitting based on total least squares , J. Magn. Reson., Ser. A, 110 (1994), pp. 228–237, https://doi.org/10.1006/jmra.1994.1209. JEFFREY M. HOKANSON[21]
A. A. Istratov and O. F. Vyvenko , Exponential analysis in physical phenomena , Rev. Sci.Instrum., 70 (1999), pp. 1233–1257, https://doi.org/10.1063/1.1149581.[22]
S. K. Jain and S. N. Singh , Harmonics estimation in emerging power system: key issues andchallenges , Electr. Power Syst. Res., 81 (2011), pp. 1754–1766, https://doi.org/10.1016/j.epsr.2011.05.004.[23]
F. Johansson et al. , mpmath: a Python library for arbitrary-precision floating-point arith-metic (version 0.19) , June 2014, http://mpmath.org.[24] D. W. Kammler and R. J. McGlinn , A bibliography for approximation with exponen-tial sums , J. Comput. Appl. Math., 4 (1978), pp. 167 – 173, https://doi.org/10.1016/0771-050X(78)90042-6.[25]
C. T. Kelley , Iterative Methods for Optimization , SIAM, 1999.[26]
T. Laudadio, N. Mastronardi, L. Vanhamme, P. V. Hecke, and S. V. Huffel , Improvedlanczos algorithms for blackbox mrs data quantitation , J. Magn. Reson., (2002), pp. 292 –297, https://doi.org/10.1006/jmre.2002.2593.[27]
M. D. Macleod , Fast nearly ML estimation of the parameters of real or complex single tonesor resolved multiple tones , IEEE Trans. Sginal Process., 46 (1998), pp. 141–148, https://doi.org/10.1109/78.651200.[28]
V. B. Melas , Functional Approach to Optimal Experimental Design , Springer, 2006.[29]
National Institute of Standards and Technology , Digital Library of MathematicalFunctions , http://dlmf.nist.gov/.[30]
J. Nocedal and S. J. Wright , Numerical Optimization , Springer, New York, 2006.[31]
V. Pereyra and G. Scherer , Exponential Data Fitting and Its Applications , Bentham Sci-ence Publishers, 2010.[32]
R. Prony , Essai expérimental et analytique sur les lois de la dilatabilité et sur celles de laforce expansive de la vapeur de l’eau et de la vapeur de l’alkool, à différentes températures ,J. de l’Ecole Polytechnique, 1 (1795), pp. 24–76. English translation in [40, App. A].[33]
B. D. Rao , Perturbation analysis of an SVD-based linear prediction method for estimating thefrequencies of multiple sinusoids , IEEE Trans. Acoust. Speech Signal Process., 36 (1988),pp. 1026–1035, https://doi.org/10.1109/29.1626.[34]
P. J. Schreier and L. L. Scharf , Statistical Signal Processing of Complex-Valued Data:Theory of Improper and Noncircular Signals , Cambridge University Press, Cambridge,2010.[35]
H. D. Scolnik , On the solution of nonlinear least squares problems , PhD thesis, Universityof Zurich, 1970.[36]
G. A. F. Seber and C. J. Wild , Nonlinear Regression , Wiley-Interscience, 1989.[37]
S. D. Silvey , Optimal Design , Chapman and Hall, London, 1980.[38]
P. Stoica and R. Moses , Introduction to Spectral Analysis , Prentice Hall, Upper SaddleRiver, NJ, 1997.[39]
G. Tang, B. N. Bhaskar, P. Shah, and B. Recht , Compressed sensing off the grid ,IEEE Trans. Inform. Theory, 59 (2013), pp. 7465–7490, https://doi.org/10.1109/TIT.2013.2277451.[40]
D. Vandevoorde , A fast exponential decomposition algorithm and its applications to struc-tured matrices , PhD thesis, Rensselaer Polytechnic Institute, 1996.[41]
L. Vanhamme, T. Sudin, P. Van Hecke, and S. Van Huffel , MR spectroscopy quantita-tion: a review of time-domain methods , NMR Biomed., 14 (2001), pp. 233–246, https://doi.org/10.1002/nbm.695.[42]
L. Vanhamme, A. van den Boogaart, and S. V. Huffel , Fast and accurate parame-ter estimation of noisy complex exponentials with use of prior knowledge , in ProceedingsEUSIPCO-96, 1996.[43]
L. Vanhamme, A. van den Boogaart, and S. Van Huffel , Improved method for accurateand efficient quantification of MRS data with use of prior knowledge , J. Magn. Reson.,129 (1997), pp. 35–43, https://doi.org/10.1006/jmre.1997.1244.[44]
S. J. Wright and J. N. Holt , An inexact levenberg-marquardt method for large sparsenonlinear least squares , J. Austral. Math. Soc. Ser. B, 26 (1985), pp. 387–403, https://doi.org/10.1017/S0334270000004604., J. Austral. Math. Soc. Ser. B, 26 (1985), pp. 387–403, https://doi.org/10.1017/S0334270000004604.