[PDF] Projected nonlinear least squares for exponential fitting

Abstract

The modern ability to collect vast quantities of data poses a challenge for parameter estimation problems. When posed as a nonlinear least squares problem fitting a model to data, the cost of each iteration grows linearly with the amount of data and it can easily become prohibitively expensive to perform many iterations. Here we develop an approach that projects the data onto a low-dimensional subspace of the high-dimensional data that preserves the information in the original data. We provide results from both optimization and statistical perspectives showing that the information is preserved when the subspace angles between this projection and the Jacobian of the model at the current iterate remain small. However, for this approach to reduce computational complexity, both the projected model and Jacobian must be computed inexpensively. This is a constraint on the pairs of models and subspaces for which this approach provides a computational speedup. Here we consider the exponential fitting problem projected onto the range of Vandermonde matrix, for which the projected model and Jacobian can be computed in closed form using a generalized geometric sum formula. We further provide an inexpensive heuristic that picks this Vandermonde matrix so that the subspace angles with the Jacobian remain small and use this heuristic to update the subspace during optimization. Although the asymptotic cost still depends on the data dimension, the overall cost of this sequence of projected nonlinear least squares problems is less expensive than the original nonlinear least squares problem. Applied to the exponential fitting problem, this provides an algorithm that is not only faster in the limit of large data than the conventional nonlinear least squares approach, but is also faster than subspace based approaches such as HSVD.

Full PDF

aa r X i v : . [ m a t h . NA ] J un PROJECTED NONLINEAR LEAST SQUARESFOR EXPONENTIAL FITTING ∗ JEFFREY M. HOKANSON † Abstract.

The modern ability to collect vast quantities of data presents a challenge for param-eter estimation problems. Posed as a nonlinear least squares problem ﬁtting a model to the data,the cost of each iteration grows linearly with the amount of data; with large data, it can easilybecome too expensive to perform many iterations. Here we develop an approach that projects thedata onto a low-dimensional subspace that preserves the quality of the resulting parameter estimates.We provide results from both an optimization and a statistical perspective that shows that accurateparameter estimates are obtained when the subspace angles between this projection and the Jacobianof the model at the current iterate remain small. However, for this approach to reduce computationalcomplexity, both the projected model and projected Jacobian must be computed inexpensively. Thisplaces a constraint on the pairs of models and subspaces for which this approach provides a com-putational speedup. Here we consider the exponential ﬁtting problem projected onto the range ofa Vandermonde matrix, for which the projected model and projected Jacobian can be computed inclosed form using a generalized geometric sum formula. We further provide an inexpensive heuristicthat picks this Vandermonde matrix so that the subspace angles with the Jacobian remain small anduse this heuristic to update the subspace during optimization. Although the asymptotic cost stilldepends on the data dimension, the overall cost of solving this sequence of projected nonlinear leastsquares problems is less expensive than the original. Applied to the exponential ﬁtting problem, thisyields an algorithm that is not only faster in the limit of large data than the conventional nonlinearleast squares approach, but is also faster than subspace based approaches such as HSVD.

Key words. exponential ﬁtting, harmonic estimation, modal analysis, spectral analysis, param-eter estimation, nonlinear least squares, dimension reduction, experimental design

AMS subject classiﬁcations.

1. Introduction.

With the increasing prowess of data acquisition hardware andstorage, collecting vast amounts of data has become trivial. This poses a challenge forparameter estimation problems where the sheer scale of data makes these problemsexpensive. Here we consider a nonlinear least squares parameter estimation prob-lem [14] that seeks to ﬁt a model f with q parameters θ ∈ C q to (noisy) measurements e y yielding a (noisy) parameter estimate e θ that minimizes the -norm mismatch(1) e θ := argmin θ ∈ C q k f ( θ ) − e y k , where f : C q → C n , e y ∈ C n , q ≪ n. With vast quantities of data, the asymptotic cost of solving this problem is dom-inated by the n -dependent steps in the optimization. For example, using eitherGauss-Newton or Levenberg-Marquardt, each optimization step solves a least squaresproblem involving the Jacobian of f , J : C q → C n × q , at a cost of O ( nq ) operations [4,Ch. 9]. To reduce this cost, we propose replacing the full least squares problem (1)with a sequence of low-dimensional surrogate problems by projecting measurementsonto e y onto a sequence of subspaces W ℓ ⊂ C n with m ℓ := dim W ℓ ≪ n (2) e θ W ℓ := argmin θ ∈ C q k P W ℓ [ f ( θ ) − e y ] k = argmin θ ∈ C q k W ∗ ℓ f ( θ ) − W ∗ ℓ e y k , P W ℓ = W ℓ W ∗ ℓ , ∗ Submitted to the editors 14 July 2016.

Funding:

This work was supported by NSF VIGRE grants DMS-0240058 and DMS-0739420 atRice University and the Department of Defense, Defense Advanced Research Project AgencyâĂŹsprogram Enabling Quantiﬁcation of Uncertainty in Physical Systems. † Department of Applied Mathematics and Statistics, Colorado School of Mines, 1500 Illinois St,Golden CO, 80401, ([email protected], http://inside.mines.edu/~hokanson/).1

JEFFREY M. HOKANSON where P W ℓ is an orthogonal projector onto W ℓ and W ℓ ∈ C n × m ℓ is an orthonormalbasis for W ℓ . Although the total cost is still asymptotically n -dependent due to themultiplication W ∗ ℓ e y , each optimization step is cheaper since the projected Jacobian W ∗ ℓ J ( θ ) is smaller. However, for this computational speedup to be fully realized, theproducts W ∗ ℓ f ( θ ) and W ∗ ℓ J ( θ ) need to be formed without the expensive, n -dependentmultiplication. Additionally, we must ensure that the ﬁnal projected parameter esti-mate e θ W ℓ remains a good estimate of the full parameter estimate, e θ . Here we do so byrequiring that the subspace angles between W ℓ and the Jacobian at the current iterateremain small. This requirement is justiﬁed by perspectives from both optimizationand statistics. From an optimization perspective described in section 2, the accuracyof each optimization step depends on these subspace angles and the projected pa-rameter estimate e θ W ℓ is equal to the full parameter estimate e θ when these subspaceangles go to zero. From a statistical perspective described in section 3, when mea-surements e y are contaminated by additive Gaussian noise, the covariance of projectedparameter estimate e θ W ℓ is larger than the covariance of full parameter estimate e θ byan amount that scales with these subspace angles as measured by eﬃciency . Hencethe subspace angles between W ℓ and the Jacobian at the current iterate determinethe quality of our projected parameter estimate e θ W ℓ . The challenge in applying thisprojected nonlinear least squares approach to a speciﬁc problem is satisfying bothcriteria simultaneously: ﬁnding a sequence of subspaces {W ℓ } ℓ with orthogonal bases { W ℓ } ℓ where W ∗ ℓ f ( θ ) and W ∗ ℓ J ( θ ) can be formed inexpensively independently of n and where the subspace angles between W ℓ and the range of J ( θ k ) for each iterate θ k remain small.Here we consider the exponential ﬁtting problem [31], also known as modal analy-sis [7], harmonic estimation [22], and spectral analysis [38], that seeks to approximatedata e y as a sum of p complex exponentials with frequencies ω and amplitudes a where(3) [ f ([ ω , a ])] j = p X k =1 a k e jω k , ω , a ∈ C p ; θ = [ ω , a ] , q = 2 p. There is an extensive body of literature on this problem, with a wide array of methodsfor recovering the frequencies ω : from classical approaches such as Prony’s method [32]its extensions for overdetermined problems [15, §9.4], to subspace methods [16] suchas HSVD [1], HTLS [20], and the matrix-pencil method [19], to parameter estimationapproaches using optimization (such as ours) [14, 43], to more recent approaches basedon ideas from sparse recovery [39], and many others described in reviews [21, 24, 31,41]. We choose this problem due to the exploitable structure of the model function f ([ ω , a ]) that allows us to obtain inexpensive inner products with subspaces thatapproximately contain the range of the Jacobian. Speciﬁcally, as the model functionis the product of a Vandermonde matrix V ( ω ) and the amplitudes a ,(4) f ([ ω , a ]) = V ( ω ) a , [ V ( ω )] j,k = e jω k , by projecting measurements onto the subspace W ( µ ) ,(5) W ( µ ) := Range V ( µ ) = Range W ( µ ) , W ( µ ) ∗ W ( µ ) = I , µ ∈ C m , we can inexpensively obtain the inner products W ( µ ) ∗ f ([ ω , a ]) and W ( µ ) ∗ J ([ ω , a ]) as described in section 4 using the geometric sum formula and its generalization givenin Appendix B. Further, using a heuristic described in section 5, we can ensure thesubspace angles between W ( µ ) and Range J ([ ω , a ]) remain small by a careful selection ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING − . − . − . . . [ e θ ] = e ω [ e θ ] = e a All measurements of data eﬃciency − . − . − . [ e θ ] = e ω Every tenth measurement of data . eﬃciency − . − . − . . . [ e θ ] = e ω Projected measurements . of data . eﬃciency Fig. 1 . Parameter estimates from a toy exponential ﬁtting problem with true parameters b ω = − . and b a = 1 using n = 1000 measurements contaminated with zero-mean Gaussian noise g with Cov g = 0 . I , e y = f ([ − . , g . The parameter estimates on the left are computed by solving (1) with all measurements e y . In the center, the parameter estimates are computed by solving (2) using a subspace that selects every 10th measurement. On the right, the parameter estimates are alsocomputed by solving (2) , but instead using the subspace W ( µ ) where µ = [ − . ± . i ] . In eachcase, the resulting nonlinear least squares problem was solved using Matlab’s lsqnonlin . Eﬃciency,deﬁned in section , quantiﬁes how close the covariance of the projected problem resembles the fullproblem (left). As this example shows, the projected parameter estimate with well-chosen subspaceyields almost identical parameter estimates to the full problem. of µ . The net result is a faster solution to the exponential ﬁtting problem in the limitof large data as illustrated by a magnetic resonance spectroscopy test case in section 7.Further, due to careful selection of the subspaces, the projected parameter estimate e θ W remains close to the full parameter estimate e θ as seen in Figure 1 for a toy problemand in Figure 4 for the magnetic resonance spectroscopy test case.Projection is a recurring theme in applied mathematics, appearing in a varietyof contexts from Galerkin projections for solving partial diﬀerential equations to therandomized projections that form the foundation of randomized numerical linear al-gebra. Our projection approach for solving a nonlinear least squares problem ﬁtsinto this theme and it is not without precedent. Incremental methods, such as incre-mental gradient [3] and the extended Kalman ﬁlter [2], when applied to a nonlinearleast squares problem can be interpreted as projecting onto a row (or set of rows) ateach iteration, choosing the basis W ℓ = [ I ] · , I ℓ where I ℓ the set of rows at the ℓ thstep. With this perspective, we note that both our method and incremental methodsrequire the projected model W ∗ ℓ f ( θ ) and projected Jacobian W ∗ ℓ J ( θ ) to be formedinexpensively. Satisfying this requirement is straightforward for incremental methodswhen f ( θ ) is deﬁned entry-wise, whereas in our case we must be careful choose theorthonormal basis W ℓ such that these products can be, for example, evaluated inclosed-form. However these methods diﬀer is in how the basis W ℓ is chosen. Forincremental methods, the set of rows is typically chosen either deterministically bycycling through rows or by randomly selecting rows [10]; whereas in our case, wecarefully choose the basis W ℓ such that our steps are accurate, and, when the datais contaminated by noise, our parameter estimates are precise.

2. An optimization perspective.

In this section we provide three diﬀerentresults that inform the choice of subspace W ℓ from the perspective of optimization.Each of these results points to the key role played by the canonical subspace angles JEFFREY M. HOKANSON between the subspace W ℓ and the range of the Jacobian at the current iterate. Wedeﬁne these canonical subspace angles following Björck and Golub [5, Thm. 1]: if A and B are two subspaces of C n and if A ∈ C n × m a and B ∈ C n × m b are orthonormalbases for A and B , then the canonical subspace angles φ k ( A , B ) between A and B are(6) cos φ k ( A , B ) := σ k ( A ∗ B ) , ≤ φ ( A , B ) ≤ φ ( A , B ) ≤· · · ≤ φ min { m a ,m b } ( A , B ) ≤ π/ where σ k ( X ) is the k th singular value of X in descending order. Our ﬁrst result insubsection 2.1 uses the ﬁrst order necessary conditions to observe that the projectedproblem will have the same stationary points as the full problem when the subspaceangles between the W and the range of the Jacobian at the stationary point are zero.Our second result in subsection 2.2 shows that the diﬀerence between the Gauss-Newton steps of the full and projected problems depends on the subspace anglesbetween W ℓ and the range of the Jacobian at the current iterate. Our third result insubsection 2.3 interprets the Levenberg-Marquardt method applied to the projectedproblem as computing inexact steps of the Levenberg-Marquardt method applied tothe full problem. We show that by making the subspace angles between W ℓ and therange of the Jacobian at the current iterate small, we can satisfy one of the conditionsfor the convergence of inexact Newton. All these results suggest that the subspaceangles between W ℓ and the range of the Jacobian at the current iterate should besmall. The ﬁrst order necessary conditions for a point q θ to be a local optimum require that the gradient of the objective function at this pointbe zero [30, Thm. 2.2]. In the context of nonlinear least squares, where the gradientof the full problem (1) is(7) ∇ θ k f ( θ ) − e y k = 2 J ( θ ) ∗ r ( θ ) , r ( θ ) := f ( θ ) − e y , [ J ( θ )] · ,k := ∂ r ( θ ) ∂ [ θ ] k a point q θ satisﬁes the ﬁrst order necessary conditions if [4, §9.1.2](8) J ( q θ ) ∗ r ( q θ ) = . Similarly for the projected problem (2), a point q θ W will satisfy the ﬁrst order necessaryconditions for the projected problem if(9) J ( q θ W ) ∗ P W r ( q θ W ) = . To assess the quality of the projected problem, we ask: under what conditions will q θ W also satisfy the ﬁrst order necessary conditions for the full problem (8)? There aretwo conditions under which this happens. The zero-residual case, where r ( q θ W ) = ,implies that measurements e y exactly ﬁt the model f . This situation makes the problemeasy to solve, as any subspace W that yields a well-posed optimization problem canbe used. The other, more general situation occurs when W contains the range ofthe Jacobian, as then P W J ( q θ W ) = J ( q θ W ) . This is equivalent to requiring all thesubspace angles between W and Range J ( q θ W ) to be zero. The challenge with thiscondition is that it is black or white: either the W contains the range of the Jacobianor it does not. In the next two subsections we suggest other requirements on W thatallow more shades of grey. ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING Another result that provides insight into the choiceof subspace W comes from considering the Gauss-Newton step [30, §10.3] for the fulland projected problems at θ :(10) s = − J ( θ ) + r ( θ ) (full) , s W = − [ P W J ( θ )] + P W r ( θ ) (projected)where A + denotes the pseudoinverse of A [13, §5.5.4]. We bound the diﬀerencebetween these two steps using Lemma 6 from Appendix A. Theorem

Let s and s W be the Gauss-Newtonsteps for the full and projected problems at θ as given in (10) , then their mismatch isbounded by (11) k s − s W k ≤ k J ( θ ) + k k r ( θ ) k (cid:2) sin φ q ( W , J ( θ )) + tan φ q ( W , J ( θ )) (cid:3) where J ( θ ) := Range J ( θ ) . Using this theorem, we can provide a heuristic for estimating the mismatchbetween the full and projected parameter estimates. Applying the Gauss-Newtonmethod to the full problem starting at a stationary point of the projected problem q θ W , we note the ﬁrst step cannot move further than(12) k s k ≤ k J ( q θ W ) + k k r ( q θ W ) k h sin φ q ( W , J ( q θ W )) + tan φ q ( W , J ( q θ W )) i . Although multiple iterations of the Gauss-Newton method might be required to reacha stationary point of the full problem, if q θ W is suﬃciently close to a stationary point q θ of the full problem, then we expect this ﬁrst step to yield a good estimate; i.e., q θ W + s ≈ q θ . This suggests choosing subspaces W to minimize the largest subspaceangle between W and the range of Jacobian at the stationary point of the projectedproblem J ( q θ W ) to ensure the full and projected parameter estimates are nearby. A third and ﬁnal result that providesinsight into the choice of subspace W comes from considering steps of the Levenberg-Marquardt method [30, §10.3] applied to the projected problem (2) as inexact stepsof the Levenberg-Marquardt method applied to the full problem (1). For the fullproblem, the Levenberg-Marquardt method generates a sequence of iterates { θ k } k starting from a given θ using the rule(13) θ k +1 = θ k + s k , s k := argmin s ∈ C q (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) J ( θ k ) λ k I (cid:21) s + (cid:20) r ( θ k ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) where λ k has been chosen to enforce a trust region; see, e.g., [25, §3.3.5]. Iterates ofthe projected problem { e θ k } k ≥ follow a similar update rule:(14) e θ k +1 = e θ k + e s k , e s k := argmin s ∈ C q (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) W ∗ k J ( θ k ) λ k I (cid:21) s + (cid:20) W ∗ k r ( θ k ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) , where W k is the orthonormal basis for the subspace applied at the k th step. Herewe ask, under what conditions on W k does the sequence { e θ k } k converge to the samepoint as { θ k } k ? Although we are unable to prove the convergence of the projectediterates unless the subspace angles between J ( e θ k ) and W k go to zero, we invoke the JEFFREY M. HOKANSON convergence analysis of inexact Newton [6], and speciﬁc results for inexact Levenberg-Marquardt [44], to suggest a choice of W k . These convergence results require that theerror in the step e s k be bounded by a forcing sequence { α k } k :(15) k ( J ( e θ k ) ∗ J ( e θ k ) + λ k I ) e s k + J ( e θ k ) ∗ r ( e θ k ) k k J ( e θ k ) ∗ r ( e θ k ) k ≤ α k < α < . Here we show that the quantity on the left can be bounded above in terms of thesubspace angles and that this quantity can be made arbitrarily small.To bound (15) we use a result from Appendix A applied to the augmented sub-space, Jacobian, and residual in the least squares problem for e s k , (14), c W k := Range (cid:20) W k

00 I (cid:21) , b J k := (cid:20) J ( e θ k ) λ k I (cid:21) , b r k := (cid:20) r ( θ k ) (cid:21) . Then applying Lemma 7 with subspace c W k to (14) yields,(16) k ( J ( e θ k ) ∗ J ( e θ k ) + λ k I ) e s k + J ( e θ k ) ∗ r ( e θ k ) k k J ( e θ k ) ∗ r ( e θ k ) k ≤ sin φ q ( c W k , b J k )cos φ q ( c W k , b J k ) k b J k k k P ⊥ b J k b r k k k J ( e θ k ) ∗ r ( e θ k ) k . where b J k := Range b J k , b R k := Range b r k , and P ⊥ b J k denotes the orthogonal projectoronto the subspace perpendicular to b J k . To obtain an expression in terms of subspaceangles, we note the numerator can be written in terms of sines k P ⊥ b J k b r k k = sin φ ( b J k , b R k ) k b r k k = sin φ ( b J k , b R k ) k r ( e θ k ) k , and similarly we can bound the denominator in terms of subspace angles, k J ( e θ k ) ∗ r ( e θ k ) k = k b J ∗ k b r k k = k b J ∗ k P b J k b r k k ≥ σ q ( b J k ) cos φ ( b J k , b R k ) k r ( e θ k ) k . Combining these two results yields the upper bound(17) k ( J ( e θ k ) ∗ J ( e θ k )+ λ k I ) e s k + J ( e θ k ) ∗ r ( e θ k ) k k J ( e θ k ) ∗ r ( e θ k ) k ≤ sin φ q ( c W k , b J k ) tan φ ( b J k , b R k )cos φ q ( c W k , b J k ) σ ( b J k ) σ q ( b J k ) . This result again conﬁrms the centrality of the subspace angles between W andthe range of the Jacobian, although in this result, it is the augmented subspace c W k and augmented Jacobian b J k . By controlling the subspace W k , we can ensure thatbound in (17) is smaller than one so that step e s k obeys the bound required by inexactNewton (15). This suggests that the Levenberg-Marquardt method applied to theprojected problem makes progress towards solving the full problem. However, thisresult cannot be used online as it requires evaluating the full residual to compute φ ( b J k , b R k ) . Nor can we use this result to guarantee convergence since Wright andHolt’s convergence result for inexact Levenberg-Marquardt [44, Thm. 5] requires anadditional suﬃcient decrease condition. The projected problem is unlikely to satisfythis additional constraint since projected problem converges diﬀerent stationary pointsunless the subspace angles between W k and J ( e θ k ) go to zero as k → ∞ . This promptsthe statistical approach we use in the next section to answer the question: how closeare the projected parameter estimates to the full parameter estimates? ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING

3. A statistical perspective.

One setting in which the nonlinear least squaresproblem (1) can arise is when measurements e y are the sum of f evaluated at some trueparameters b θ ∈ C q plus Gaussian random noise g with zero mean and covariance ǫ I ; e y = f ( b θ ) + g . Then the nonlinear least squares problem,(18) e θ ( g ) := argmin θ k f ( θ ) − ( f ( b θ ) + g ) k , yields the maximum likelihood estimate e θ of b θ [36, §2.1]. This estimate has a numberof beneﬁcial features. In the limit of large data or small noise, the estimator e θ isunbiased and obtains the Cramér-Rao lower bound , namely, e θ has the smallest pos-sible covariance of any unbiased estimator of b θ [34, §6.3]. Hence, the correspondingprojected parameter estimate(19) e θ W ( g ) := argmin θ k P W [ f ( θ ) − ( f ( b θ ) + g ) ] k must have a larger covariance. By using the inexpensive projected parameter estimate e θ W as an alternative to full parameter estimate e θ , we are following in a tradition thatdates back to Fisher [9, §8]. Fisher quantiﬁed the loss of precision incurred by aparticular scalar estimator by the eﬃciency : the ratio of the minimum covarianceto the estimator’s covariance. For our vector valued estimates, the covariance is apositive deﬁnite matrix in C q × q and so to obtain a scalar value for the eﬃciencyratio, we follow the lead of experimental design [37, §2.1],[28, §1.4] and consider thedeterminant of the covariance matrix, which leads to the D-eﬃciency (20) b η ( W ) := det Cov e θ det Cov e θ W ∈ [0 , . In choosing our subspace W , our goal will be to make the eﬃciency as large as possibleso that the covariance of the estimates for e θ W and e θ are similar. However, as e θ and e θ W are both nonlinear functions of the noise g , we cannot compute a closed formexpression for the eﬃciency. Instead, following a standard approach for nonlinearexperimental design [8, §1.4], we linearize the parameter estimates e θ and e θ W about b θ yielding the linearized D-eﬃciency as derived in subsection 3.1. The main result ofthis section will be to connect linearized eﬃciency to the subspace angles between W and the range of the Jacobian at b θ , J ( b θ ) :(21) η ( W , J ( b θ )) := det[ J ( b θ ) ∗ J ( b θ )] − det[ J ( b θ ) ∗ P W J ( b θ )] − = q Y k =1 cos φ q ( W , J ( b θ )) ≈ det Cov e θ det Cov e θ W . We will refer to η as simply the eﬃciency, and following Fisher we will say a subspace W is eﬃcient for b θ if η ( W , J ( b θ )) = 0 . . Later in section 5 we will design sub-spaces for the exponential ﬁtting problem with the goal of obtaining a target eﬃciency.Further, note that maximizing eﬃciency corresponds to minimizing the covariance of e θ W and hence selecting the subspace W is similar to experimental design [8, 37, 28],albeit where the design is happening after the data has been collected. Here we brieﬂy derive the linearized estimate of e θ W and the corresponding linearized covariance; cf. [36, §12.2.6]. In the limit of smallnoise g , we expand e θ W about the true parameters b θ (22) e θ W ( g ) = b θ + [ W ∗ J ( b θ )] + W ∗ g + O ( k g k ) . JEFFREY M. HOKANSON

Applying this ﬁrst order estimate in the covariance, we have

Cov e θ W = E g [( b θ − e θ W ( g ))( b θ − e θ W ( g )) ∗ ] ≈ E g h [ W ∗ J ( b θ )] + ∗ W ∗ gg ∗ W [ W ∗ J ( b θ )] + i = ǫ [ J ( b θ ) ∗ WW ∗ J ( b θ )] − (23)when Cov g = ǫ I . To obtain the D-linearized eﬃciency (21), we replace Cov e θ and Cov e θ W with this the estimate above. As with the optimization per-spective, a good subspace from a statistical perspective will have small subspaceangles between W and the Jacobian J ( b θ ) . The following theorem establishes thisconnection. Theorem If W is an m -dimensional subspace of C n with orthonormal basis W and J ∈ C n × q where m ≥ q with J := Range J , then (24) η ( W , J ) := det([ J ∗ J ] − )det([ J ∗ WW ∗ J ] − ) = q Y k =1 cos φ k ( W , J ) where φ k ( A , B ) is the k th principle angle between A , B ⊂ C n as deﬁned in (6) .Proof. Let J = QT be the short-form QR factorization of J where Q ∗ Q = I .Then using the multiplicative property of the determinant [17, §0.3.5], η ( W , J ) = det( J ∗ WW ∗ J )det( J ∗ J ) = det( Q ∗ WW ∗ Q ) det( T ) det( T ∗ )det( Q ∗ Q ) det( T ) det( T ∗ )= det( Q ∗ WW ∗ Q ) = q Y k =1 σ k ( W ∗ Q ) = q Y k =1 cos φ k ( W , J ) . We conclude withthree results about the linearized D-eﬃciency that aid in our construction of subspacesfor exponential ﬁtting in section 5. There, our approach will be to precompute a ﬁnitenumber of subspaces for a single exponential and combine these to produce subspacesfor multiple exponentials.The ﬁrst result proves an intuitive fact: by enlarging the subspace W , the eﬃ-ciency will not decrease. The following theorem establishes this result, making use ofthe partial ordering of positive deﬁnite matrices; namely, A (cid:23) B if A − B is positivedeﬁnite [17, §7.7]. Theorem If W ⊆ W and J are subspaces of C n then η ( W , J ) ≤ W , J ) .Proof. Let W and W be orthonormal bases for W and W and let Q be anorthonormal basis for J . Then W W ∗ (cid:22) W W and by [17, Cor. 7.7.4], η ( W , J ) = det( Q ∗ W W ∗ Q ) ≤ det( Q ∗ W W ∗ Q ) = η ( W , J ) . Since our subspaces for multiple exponentials will be built from a union of sub-spaces for each exponential, this second result shows that the union satisﬁes a nec-essary (but not suﬃcient) condition for the combined subspace to have the sameeﬃciency as each component subspace had for a single exponential.

Theorem If W , {J k } k are subspaces of C n , J = S k J k , and the dimensionof W exceeds J , then η ( W , J ) ≤ min k η ( W , J k ) . ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING Proof.

Let Q = [ Q , Q ] , where Q ∈ C n × q and Q ∈ C n × q , be an orthonormalbasis for J such that Q ∈ C n × q is a basis for J ℓ and let W be an orthonormal basisfor W . Then as σ k ( W ∗ Q ) ≤ , η ( W , J ) = q + q Y k =1 σ k ( W ∗ Q ) ≤ q Y k =1 σ q + k ( W ∗ Q ) ≤ q Y k =1 σ k ( W ∗ Q ) = η ( W , J ℓ ) , where the second inequality follows from deleting the last q columns of W ∗ Q andapplying [18, Cor. 3.1.3]. The result follows by repeating this process for each J ℓ .The ﬁnal result provides a lower bound on the eﬃciency for a nearby Jacobian.This bound is used in subsection 5.1 to convert a check for eﬃciency over a continuousset of subspaces J ( θ ) into a check over a discrete set. Theorem If W , J , and J are subspaces of C n and J and J have the samedimension, then η ( W , J ) η ( J , J ) ≤ η ( W , J ) .Proof. Let Q and Q be orthonormal bases for J and J . As P W (cid:23) P J P W P J we obtain the lower bound after application of [17, Cor. 7.7.4] η ( W , J ) = det( Q ∗ P W Q ) ≥ det( Q ∗ Q Q ∗ P W Q Q ∗ Q )= det( Q ∗ P W Q ) det( Q ∗ Q Q ∗ Q ) = η ( W , J ) η ( J , J ) . With these general results from the two preceding sections complete, we now turnto the speciﬁcs of the exponential ﬁtting problem.

4. Fast inner-products for exponential ﬁtting.

In the two preceding sec-tions, we have argued that subspaces W should be chosen such that the subspaceangles between W and the range of the Jacobian are small. Now, in this section weturn to the speciﬁc problem of selecting a family of subspaces W for the exponentialﬁtting problem that not only satisfy this requirement, but also have orthonormal bases W such that projected model W ∗ f ([ ω , a ]) and projected Jacobian W ∗ J ([ ω , a ]) canbe inexpensively computed in fewer than O ( n ) operations. For the exponential ﬁttingproblem we chose the subspace W ( µ ) parameterized by µ ∈ C m with correspondingorthonormal basis W ( µ ) ∈ C n × m (25) W ( µ ) := Range V ( µ ) , W ( µ ) := V ( µ ) R ( µ ) − , R ( µ ) ∈ C m × m where V ( µ ) ∈ C n × m is the Vandermonde matrix [ V ( µ )] j,k = e jµ k and R ( µ ) isconstructed as described in subsection 4.3 such that W ( µ ) has orthonormal columns.We call the parameters µ interpolation points since if the entries of ω are a subset ofthe entries of µ , then the projected model interpolates the full model:(26) ω ⊂ µ = ⇒ P W ( µ ) f ([ ω , a ]) = P Range V ( µ ) V ( ω ) a = V ( ω ) a = f ([ ω , a ]) . In this section we show how to inexpensively compute the product of W ( µ ) with theexponential ﬁtting model f ([ ω , a ]) and Jacobian J ([ ω , a ]) ,(27) J ([ ω , a ]) := (cid:2) V ′ ( ω ) diag( a ) V ( ω ) (cid:3) , [ V ′ ( ω )] j,k = ∂∂ω k [ V ( ω )] j,k = je jω k . Examining the products W ( µ ) ∗ f ([ ω , a ]) and W ( µ ) ∗ J ([ ω , a ]) , W ( µ ) ∗ f ([ ω , a ]) = R ( µ ) −∗ V ( µ ) ∗ V ( ω ) a (28) W ( µ ) ∗ J ([ ω , a ]) = (cid:2) R ( µ ) −∗ V ( µ ) ∗ V ′ ( ω ) R ( µ ) −∗ V ( µ ) ∗ V ( ω ) diag a (cid:3) , (29)0 JEFFREY M. HOKANSON reveals two matrix multiplications of size n , V ( µ ) ∗ V ( ω ) and V ( µ ) ∗ V ′ ( ω ) , that needto be inexpensively computed. Here we use the geometric sum formula and gen-eralization provided by Theorem 9 in Appendix B, to compute the entries of theseproducts in closed form. Unfortunately, these formulas exhibit catastrophic cancel-lation in ﬁnite precision arithmetic necessitating careful modiﬁcations to obtain highrelative accuracy as described in subsection 4.1 for V ( µ ) ∗ V ( ω ) and in subsection 4.2for V ( µ ) ∗ V ′ ( ω ) . Additionally, we discuss how to compute R ( µ ) inexpensively from V ( µ ) ∗ V ( µ ) in subsection 4.3. The choice of interpolation points µ is later discussedin section 5 and combined with these results yields our algorithm for exponentialﬁtting described in section 6. Each entry in the product of two Vandermonde matri-ces V ( µ ) ∗ V ( ω ) is a geometric sum and hence has a closed form expression via the geometric sum formula :(30) [ V ( µ ) ∗ V ( ω )] j,k = n − X ℓ =0 e µ j ℓ e ω k ℓ =  − e n ( µ j + ω k ) − e µ j + ω k , e µ j + ω k = 1; n, e µ j + ω k = 1 . In ﬁnite precision arithmetic this formula exhibits catastrophic cancellation when e µ j + ω k ≈ . Fortunately, many standard libraries provide the special function expm1 that evaluates e x − to high relative accuracy. However, even with this specialfunction, there is still a removable discontinuity at e µ j + ω k = 1 . Hence in ﬂoating pointwe patch this function using a two-term Taylor series expansion around e µ j + ω k = 1 :(31) [ V ( µ ) ∗ V ( ω )] j,k =  expm1 ( n ( µ j + ω k )) expm1 ( µ j + ω k ) , | expm1 ( µ j + ω k ) | > − ; n (1 + ( n − µ j + ω k ) / , | expm1 ( µ j + ω k ) | ≤ − . In our numerical experiments, this expression has a relative accuracy of ∼ − whencompared to a 500-digit reference evaluation of (30) using mpmath [23]. Entries of the product V ′ ( µ ) ∗ V ( ω ) are nolonger a geometric sum, but a generalized geometric sum that has a closed formexpression given by Theorem 9 in Appendix B: [ V ( µ ) ∗ V ′ ( ω )] j,k = n − X ℓ =0 ℓe µ j ℓ e ω k ℓ =  − ne n ( µ j + ω k ) − e µ j + ω k + e ω k + µ j (1 − e n ( ω k + µ j ) )(1 − e ω k + µ j ) , e ω k + µ j = 1; n ( n − / , e ω k + µ j = 1 . (32)As with the geometric sum formula, this expression also exhibits catastrophic can-cellation but this can no longer be ﬁxed using standard special functions. Instead,we derive a more accurate expression in ﬂoating point arithmetic by rearranging theexpression in the ﬁrst case and using a Taylor series about e µ j + ω k = 1 . Deﬁning δ j,k := µ j + ω k ∈ R × [ − π/ , π/ i (removing periodicity in the imaginary part) theﬁrst case of (32) can be rearranged to yield(33) − ne nδ j,k − e δ j,k + e δ j,k (1 − e nδ j,k )(1 − e δ j,k ) = 1 − e nδ j,k − e δ j,k (cid:20) e δ j,k − e δ j,k − ne nδ j,k − e nδ j,k (cid:21) . ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING expm1 and theexpression inside the brackets has a rapidly converging Taylor series: e δ − e δ − ne nδ − e nδ = n −

12 + ( n − δ − ( n − δ

720 + ( n − δ − ( n − δ n − δ − n − δ O ( n δ ) . Calling the ﬁrst seven terms of this expansion the special function expdiff ( n, δ ) , wethen evaluate the product V ( µ ) ∗ V ′ ( ω ) in ﬁnite precision arithmetic using(34) [ V ( µ ) ∗ V ′ ( ω )] j,k =  ne nδ j,k expm1 ( δ j,k ) − e δ j,k expm1 ( δ j,k n )[ expm1 ( δ j,k )] , | δ j,k | > . /n ; expm1 ( nδ j,k ) expm1 ( δ j,k ) expdiff ( n, δ j,k ) , < | δ j,k | ≤ . /n ; n ( n − / , δ j,k = 0 . In our numerical experiments, this expression has a relative accuracy of ∼ − whencompared to a 500-digit reference evaluation of (33) using mpmath . Finally we need to compute the matrix R ( µ ) suchthat V ( µ ) R ( µ ) − has orthonormal columns inexpensively. One approach would beto simply take the QR-factorization of V ( µ ) , but this is has an O ( n ) -dependent cost.Instead, our approach is to form V ( ω ) ∗ V ( ω ) using (30) and either take its Choleskydecomposition or its eigendecomposition to compute R ( µ ) V ( µ ) ∗ V ( µ ) = R ( µ ) R ( µ ) ∗ , (Cholesky) , V ( µ ) ∗ V ( µ ) = U ( µ ) Λ ( µ ) U ( µ ) ∗ , (eigendecomposition) , R ( µ ) = U ( µ ) Λ ( µ ) / . Although the Cholesky decomposition should be preferred since V ( µ ) ∗ V ( µ ) is pos-itive deﬁnite provided each of the { e µ j } j are distinct, in ﬁnite precision arithmeticthis product can have small negative eigenvalues. Instead we compute R ( µ ) − usingthe eigendecomposition, truncating the small ( < − ) eigenvalues.

5. A subspace for exponential ﬁtting.

With the results of the previous sec-tion, we can now inexpensively project the model and Jacobian onto the subspace W ( µ ) . However, this leaves one question: how do we choose the interpolation points µ such that the subspace angles between W ( µ ) and the exponential ﬁtting Jacobian(35) J ([ ω , a ]) = (cid:2) V ′ ( ω ) diag( a ) V ( ω ) (cid:3) , [ V ′ ( ω )] j,k = ∂∂ω k [ V ( ω )] j,k = je jω k . are small? Immediately we note that these subspace angles do not depend on a (ifany entry of a was zero, we would instead ﬁt fewer exponentials); hence we deﬁne(36) J ( ω ) := Range (cid:2) V ′ ( ω ) V ( ω ) (cid:3) . Structurally, it may seem impossible to have small subspace angles between W ( µ ) and J ( ω ) since W ( µ ) = Range V ( µ ) does not contain any columns from V ′ . However,since columns of V ′ are the derivatives of the columns of V , we can approximate therange of V ′ using ﬁnite diﬀerences(37) V ( ω ) ≈ V ( ω + δ + ) + V ( ω + δ − )2 V ′ ( ω ) ≈ V ( ω + δ + ) − V ( ω + δ − ) δ + − δ − . JEFFREY M. HOKANSON α α α α − π − π/ π/ π Re ω I m ω x e x Fig. 2 . An illustration of the box partition of the parameter space ω ∈ ( −∞ , × [ − π, π ) i for exponential ﬁtting. For a given exponential with frequency ω denoted by × , we select the boxcontaining it, denoted in red, and the corresponding four interpolation points at the corners, denotedby • . The blue shaded region shows the set of ω where the eﬃciency at ω using this subspace is atleast . This region includes the entire box and extends outward. The ﬁgure on the right showsthe same features under the exponential map, exposing the periodicity of the parameter space. Hence, for an appropriate choice of interpolation points, the subspace angles be-tween W ( µ ) and J ( ω ) are small; for example, the interpolation points δ ± ( ω ) =0 . ω ± max {− .

52 Re ω, . /n } yields a subspace with eﬃciency for any ω ∈ ( −∞ , × [ − π, π ) i . Here our approach for selecting interpolation points is to di-vide the parameter space for a single exponential with frequency ω ∈ ( −∞ , × [ − π, π ) into a series of boxes as shown in Figure 2. These boxes have been constructed suchthat when the corners of the box containing ω are taking as interpolation points,the eﬃciency of this subspace is at least and hence the subspace angles between W ( µ ) and J ( ω ) are small. Then, for multiple exponentials, we simply combine thesubspaces generated by this heuristic, justiﬁed by Theorem 4 that this combination isa necessary condition for the combined subspace to also have at least eﬃciency.There are several advantages to this box partition approach. By limiting ourselvesto a ﬁnite number of interpolation points, we can frequently reuse the multiplication V ( µ j ) ∗ e y as the subspace updates. Moreover, constructing this subspace is inexpen-sive, allowing frequent updates during optimization. In the remainder of this sectionwe ﬁrst discuss a practical algorithm for building this box partition for any targeteﬃciency and then give the coordinates for box partition with eﬃciency. Our goal in constructing the box partition is toguarantee that the target eﬃciency η target is obtained for every single exponentialwith frequency ω inside the box. This is an expensive task, so we build our boxesto simplify the veriﬁcation process. As shown in Figure 2, our box partition consistsof a series of stacks where each stack has twice as many boxes as the one on its left,evenly dividing the imaginary component of the parameter space. Then, as eﬃciencydepends on δ j,k = µ j + ω k , cf. (30) and (32), simultaneously shifting the imaginaryparts of µ j and ω k does not change δ j,k and hence verifying that one box in the stackobtains the target eﬃciency for each ω inside establishes the same for the remainingboxes in the stack. With this construction, there is only one set of free parameters:the real coordinates of each box { α ℓ } ℓ ≥ . We choose these α ℓ , starting from α = −∞ ,by making α ℓ as large as possible while still obtaining the target eﬃciency inside each ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING α ℓ = maximize α>α ℓ − α such that η min ≤ minimize ω ∈ [ α ℓ − ,α ] × [0 , π/ ℓ ] i η ( W ( µ ) , J ( ω )) where µ = (cid:2) α ℓ − α ℓ − + 2 π/ ℓ i α α + 2 π/ ℓ i (cid:3) . (38)This is a challenging nested optimization problem over a continuous set of ω , so weinvoke Theorem 5 to construct an auxiliary grid of exponentials where obtaining aslightly higher eﬃciency at these discrete points guarantees the target eﬃciency isreached for any ω inside the box.To construct this auxiliary grid, we specify a series of real parts a j ∈ R andimaginary spacings b j ∈ R + that deﬁne the grid points z j,k := a j + ikb j . To specify a j and b j with a grid eﬃciency of η grid , starting from a = α ℓ we solve the singlevariable ﬁnding root problem that yields a j and b j ,(39) η ( J ( a j ) , J ( a + (1 + 1 i ) c )) = η grid , ⇒ a j +1 := a j + c, b j +1 := c, setting b = b . Then, invoking Theorem 5, we have the bound(40) minimize ω ∈ [ α ℓ − ,α ] × [0 , π/ ℓ ] i η ( W ( µ ) , J ( ω )) ≤ minimize j,k ∈ Z + z j,k ∈ [ α ℓ − ,α ] × [0 , π/ ℓ ] i η grid · η ( W ( µ ) , J ( z j,k )) . Substituting this bound in (38) replaces the inner optimization with ﬁnding the min-imum over a discrete set, simplifying the problem. Further, since the accuracy of theeﬃciency computation is limited by the grid, we restrict the maximization over α tothe discrete set of grid points a j . Here we provide thecoordinates for a box partition with a target eﬃciency of constructed using η grid = 0 . in Table 1 for multiple values of n . In practice, we restrict our inter-polation points to the closed left half plane and hence set the ﬁrst α ℓ greater thanzero to zero. Although this choice of target eﬃciency was arbitrary, it does make theright-most interpolation points correspond to the n th roots of unity that appear inthe discrete Fourier transform (DFT).Although Table 1 only displays the coordinates for several values of n , two pat-terns emerge that allow us to estimate the box partition for any n . First note thatthe values for α ℓ when n = ∞ match those for n = ∞ for all but the last two, whichare always larger. Hence the values of α ℓ for n = ∞ are an lower bound on those forarbitrary n . The other pattern is that after the ﬁrst ﬁve, the α ℓ for n = ∞ shrinkexponentially with(41) α ℓ ≈ − . · − ℓ ; ℓ ≥ . These two patterns allow us to pick the box partition using α ℓ for n = ∞ , extendingthis sequence using the approximation above for larger values of ℓ .When n is not a power of two, the box partition will no longer have the n throots of unity available as interpolation points. Due to their connection with thediscrete Fourier transform, we prefer to keep n th roots of unity available and thusmodify the construction of the box partition. For an n that is not a power of two,everything remains the same except when ω is in the rightmost stack, Re ω ∈ ( α b ℓ , .In this case we no longer use boxes, but pick the two closest interpolation pointswith Re µ = α b ℓ from the leftward stack and the two closest n th roots of unity where Re µ = 0 . Although we are no longer able to guarantee eﬃciency for exponentialsin this range, this heuristic still provides a high eﬃciency subspace in practice.4 JEFFREY M. HOKANSON

Table 1

The real coordinates of the box partition α ℓ as determined by solving (38) with η grid = 0 . .Due to the limitation of this discretization, these values are only accurate to approximately threedigits. ℓ n = 16 n = 256 n = 1024 n = 2 n = ∞ − . · − . · − . · − . · − . · − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − . · − − . · − − . · − − . · − − . · − − . · − − . · −

10 8 . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · − − . · −

20 8 . · − − . · −

6. A projected exponential ﬁtting algorithm.

Equipped with subspace W ( µ ) for which the projected model W ( µ ) ∗ f ([ ω , a ]) and Jacobian W ( µ ) ∗ J ([ ω , a ]) can be inexpensively computed as described in section 4 and combined with the heuris-tic from section 5 to pick interpolation points µ so that the subspace angles between W ( µ ) and the range of the Jacobian J ( ω ) are small, we now construct an algorithmto solve exponential ﬁtting problem using projected nonlinear least squares. Our basicapproach is to solve a sequence of projected nonlinear least squares problems usingLevenberg-Marquardt and infrequently update the subspace during the course of op-timization. In this section we describe several important details for this algorithm.First, in subsection 6.1 we show how variable projection [11, 12] can be used to im-plicitly solve for the amplitudes a revealing optimization problem over frequencies ω alone; then in subsection 6.2 we discuss the details of how to update the subspace; andﬁnally in subsection 6.3 we show how to obtain initial estimates of the frequencies ω to enable a fair comparison with subspace based methods which do not require initialestimates. The key insight behind variable projection origi-nated in a PhD thesis by Scolnik on the exponential ﬁtting problem [35]. Recog-nizing the optimal linear parameters a are given by the pseudoinverse for a ﬁxed ω , a = V ( ω ) + e y , allows the residual to be stated as a function of ω alone:(42) r ([ ω , a ]) = f ([ ω , a ]) − e y = V ( ω ) a − e y ⇒ (cid:2) V ( ω ) V ( ω ) + − I (cid:3) e y = P ⊥ V ( ω ) e y , where P ⊥ V ( ω ) is the orthogonal projector onto the subspace perpendicular to the rangeof V ( ω ) . This allows us to deﬁne an equivalent optimization problem over ω alone(43) minimize ω ∈ C p k b r ( ω ) k , b r ( ω ) := P ⊥ V ( ω ) e y . ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING b r ,(44) [ b J ( ω )] · ,k := − (cid:20) P ⊥ V ( ω ) ∂ V ( ω ) ∂ω k V ( ω ) − + V ( ω ) −∗ (cid:18) ∂ V ( ω ) ∂ω k (cid:19) ∗ P ⊥ V ( ω ) (cid:21) e y , and we can further exploit the structure of the exponential ﬁtting problem to reveala simple expression for this Jacobian. Deﬁning the short form QR-factorization of V ( ω ) , V ( ω ) = Q ( ω ) T ( ω ) , this Jacobian becomes(45) b J ( ω ) = [ I − Q ( ω ) Q ( ω ) ∗ ] V ′ ( ω ) diag( a ) − Q ( ω ) T ( ω ) −∗ diag( V ′ ( ω ) ∗ b r ( ω )) . With the linear parameters removed, the Levenberg-Marquardt method can then beapplied to the variable projection residual b r ( ω ) and Jacobian b J ( ω ) . The same expres-sions also apply to the projected problem upon making the substitutions: V ( ω ) → W ( µ ) ∗ V ( ω ) , V ′ ( ω ) → W ( µ ) ∗ V ′ ( ω ) , and e y → W ( µ ) ∗ e y . As the analysis in section 2 suggests that the sub-space angles between W ( µ ) and J ( ω ) need to remain small, we repeatedly updatethe interpolation points during the course of optimization. Here we use the eﬃciencybased heuristic described in section 5 to pick interpolation points, as a high eﬃciencyensures small subspace angles between W ( µ ) and J ( ω ) . However, rather than dis-carding interpolation points no longer required by this heuristic, we preserve them,continually expanding the subspace. This is necessary to prevent the optimizationalgorithm from entering a cycle. A ﬁnal issue concerns how we provide the initial valuesthe optimization algorithm. Subspace based methods do not require these initialvalues and so to provide a fair comparison we use a simple initialization heuristic.It is well known that peaks in the discrete Fourier transform (DFT) of a signal, F ∗ n e y where [ F n ] j,k = n − / e πijk/n , correspond to the frequencies present [38]. Thisforms the foundation of many initialization approaches. For example, in magneticresonance spectroscopy these peaks can be identiﬁed manually [42, §3.3] to initializean optimization algorithm. Here, we pick the initial estimate iteratively. Startingwith the ﬁrst exponential, we set ω = 2 πi b k/n where b k is the largest entry in F ∗ n e y .Then after the optimization algorithm has terminated, we initialize ω based on thelargest entry of the residual F ∗ n b r ( ω ) . This process repeats until the desired number ofexponentials have been recovered. This approach is similar to that of Macleod [27],but we optimize all the frequencies ω at each step.

7. A numerical example.

To demonstrate the eﬀectiveness of projected non-linear least squares for exponential ﬁtting, we apply the algorithm described in sec-tion 6 to a magnetic resonance spectroscopy test problem from [43, Table 1]; see,e.g., [14, §12.4] for a discussion of the underlying physics. This example describes acontinuous complex signal y ( t ) consisting of eleven exponentials: y ( t ) = X k =1 a k e iπ/ e (2 iπf k − d k ) t where a = [ 75 150 75 150 150 150 150 150 1400 60 500 ] f = [ − − −

54 152 168 292 308 360 440 490 530 ] d = [ 50 50 50 50 50 50 50 25 285 . (46)from which we construct measurements e y ∈ C n by sampling y ( t ) uniformly in time andcontaminating these with independent and identically distributed additive Gaussian6 JEFFREY M. HOKANSON p r o j e c t e d N L S f u ll N L S d e n s e H S V D f a s t H S V D V ( µ ) ∗ e y × × O ( n ) O ( n ) − − data dimension n w a ll t i m e ( s ec o nd s ) Wall clock performance comparison

Fig. 3 . The median wall clock time from from ten runs for four diﬀerent exponential ﬁttingalgorithm implemented in Matlab 2016b, applied to data from (46) , and running on a 2013 Mac Prowith a 3.5 GHz 6-Core Intel Xeon E5 and 16 GB of RAM clocked at 1866 MHz. We also show thetime taken to form V ( µ ) ∗ e y for µ ∈ C which is a lower bound on the time taken by our projectednonlinear least squares algorithm; this is approximately the time required to check the ﬁrst ordernecessary conditions. noise g with Cov g = I according to the formula(47) [ e y ] k = y ( δ ( n ) k ) + 15[ g ] k , δ ( n ) := 2563 n · − . This allows us to scale the original problem which took n = 256 by increasing thesample rate δ . In this section we consider three algorithms applied to this expo-nential ﬁtting problem: conventional nonlinear least squares, our projected nonlinearleast squares, and HSVD [1] as a representative of subspace based methods due to itssimple implementation. We present our results using two diﬀerent implementationsof HSVD: an implementation using dense linear algebra and fast implementationusing an O ( n log n ) Hankel matrix-vector product and an iterative SVD algorithmfollowing [26]. Our goal is to compare these algorithms on two metrics: the wallclock time taken to solve the exponential ﬁtting problem and the precision of theresulting parameter estimates. A Matlab implementation of our projected nonlin-ear least squares algorithm for exponential ﬁtting, the two HSVD implementationsdescribed, and code to construct these examples are provided at https://github.com/jeffrey-hokanson/ExpFit . In these implementations we use tight convergencetolerances: − for both residual norm and solution change in Matlab’s nonlinearleast squares solver lsqnonlin and − for the Ritz residual in Matlab’s eigs usedin the fast HSVD implementation. As these three algorithms use diﬀerent paradigms for solving theexponential ﬁtting problem, we compare their performance using total wall clocktime. Figure 3 shows the time taken by each algorithm when applied to the magneticresonance spectroscopy test problem given in (46). Asymptotically, the time takenby the dense HSVD implementation scales like O ( n ) due to the dense SVD, whereasthe fast HSVD implementation scales like O ( n log n ) due to the use of the fast Fouriertransform (FFT) to compute the Hankel matrix-vector product. Similarly, the timetaken both nonlinear least squares based approaches scales like O ( n log n ) due to their ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING P r o j ec t e d N L S p r o b a b ili t y d e n s i t y n = 256 n = 1024 n = 4096 k Γ / ( e θ − b θ ) k F a s t H S V D p r o b a b ili t y d e n s i t y k Γ / ( e θ − b θ ) k k Γ / ( e θ − b θ ) k Fig. 4 . The density of standardized error in the parameter estimate e θ , Γ / ( e θ − b θ ) as describedin subsection . Thick curves show the expected distribution of the 2-norm of the standardized errorand the ﬁlled regions show the empirically determined density from realizations of each methodfor each n . The dotted lines shows the density of the standardized error in the full nonlinear leastsquares parameter estimate. use of the FFT in the initialization heuristic. Although these three algorithms eachhave the same asymptotic rate, their constants are diﬀerent. In the limit of largedata, the projected nonlinear least squares implementation is fastest, but for smalldata, the repeated initialization of the optimization algorithm dominates the cost. Itis possible that a more careful implementation could avoid this cost and bring thewall clock time for projected nonlinear least squares to closer, or perhaps faster than,the fast HSVD implementation in the limit of small data. In addition to providing faster performance than fast HSVD forlarge data, the projected nonlinear least squares approach also yields more precise pa-rameter estimates. Considering the same magnetic resonance spectroscopy example,we seek to quantify the precision of our parameter estimates. In the limit of smallnoise, the error in the parameter estimate e θ = [ e ω , e a ] relative to the true parameters b θ = [ b ω , b a ] is normally distributed with covariance:(48) Γ := J ( b θ ) ∗ E g [ gg ∗ ] J ( b θ ) = J ( b θ ) ∗ J ( b θ ) · E g [ k g k ] ≈ Cov e θ . If Γ is actually the covariance of e θ , then Γ − / ( e θ − b θ ) is normally distributed withzero mean an unit variance. Hence, the norm of the mismatch k Γ − / ( e θ − b θ ) k followsa χ distribution with degrees of freedom. As seen in Figure 4, the distributionof error of the projected nonlinear least squares problem approximately matches thatof the full problem and approaches the desired χ distribution as n becomes large.However, HSVD provides less precise parameter estimates, a result that follows fromthe analysis of Rao [33].

8. Discussion.

In this paper we have shown that by solving a sequence of pro-jected nonlinear least squares problems we can substantially improve the run timeperformance with a negligible loss of accuracy for solving exponential ﬁtting problemwhen compared to both conventional nonlinear least squares and HSVD, a typical8

JEFFREY M. HOKANSON subspace based approach. For the exponential ﬁtting problem, there are still severalopen questions. Is there a better choice for selecting interpolation points than ourbox partition? Are there better subspaces, such as one that includes not only V ( µ ) ,but V ′ ( µ ) as well? More generally: can we provide conditions that guarantee theconvergence of the series of projected problems? Can we bound the error incurred bythe projection in a deterministic sense? Finally, we ask, what were the key featuresthat allowed the projected approach to work for the exponential ﬁtting problem andcould this approach be applied to other problems? One key feature was that projectedmodel W ∗ ℓ f ( θ ) and Jacobian W ∗ ℓ J ( θ ) could be computed inexpensively. The otherkey feature was that we were able to generate subspaces W ℓ such that the subspaceangles between W ℓ and the range of the Jacobian J ( θ ) remained small. These tworequirements limit the applicability of these results to speciﬁc pairs of models f ( θ ) andsubspaces W ℓ , but for those problems that satisfy these requirements the projectednonlinear least squares approach presents a way to improve performance. Acknowledgements.

I would like to thank Mark Embree, Caleb Magruder, andPaul Constantine for their help in preparing this manuscript; Christopher Beattie,the Einstein Foundation of Berlin, Volker Mehrmann, and TU Berlin for hosting meduring the writing of this manuscript; and my PhD committee: Steve Cox, ThanosAntoulas, and Matthias Heinkenschloss.

Appendix A. Projected least squares error bounds.

Here we provide twolemmas used in section 2 related to the accuracy of a projected least squares problem.

Lemma Let A ∈ C n × q have full column rank and let b ∈ C n with respectiverange A and span B . Let W be an m -dimensional subspace of C n where m ≥ q andlet P W be the orthogonal projector onto W where P W A has full column rank. If x isthe minimizer of k Ax − b k and y is the minimizer of k P W ( Ay − b ) k then (49) k x − y k ≤ k A + k k b k (cid:2) sin φ q ( W , A ) sin φ ( W , B )+tan φ q ( W , A ) cos φ ( W , B ) (cid:3) . Proof.

Using the pseudoinverse, we write x and y as x = ( A ∗ A ) − A ∗ b , (50) y = ( A ∗ P W A ) − A ∗ P W b . (51)Inserting the decomposition of the identity I = P W + P ⊥W before b in x ,(52) x = ( A ∗ A ) − A ∗ P W b + ( A ∗ A ) − A ∗ P ⊥W b , we then note the diﬀerence between x and y is(53) x − y = ( A ∗ A ) − A ∗ P ⊥W b + (cid:2) ( A ∗ A ) − − ( A ∗ P W A ) − (cid:3) A ∗ P W b . Replacing A with its short form SVD, A = UΣV ∗ , and P W with WW ∗ where W is an orthonormal basis for W , we have x − y = VΣ − U ∗ P ⊥W b − VΣ − (cid:2) ( U ∗ WW ∗ U ) − − I (cid:3) Σ − U ∗ WW ∗ b , k x − y k ≤ k Σ − k (cid:16) k P ⊥W U k k P ⊥W b k + k ( U ∗ WW ∗ U ) − − I k k U ∗ W k k W ∗ b k (cid:17) . Then invoking the subspace angle identities: k P ⊥W b k = sin φ ( W , B ) k b k , k W ∗ b k =cos φ ( W , B ) k b k , k U ∗ W k = cos φ ( W , A ) , and k P ⊥W U k = sin φ q ( A , W ) ,(54) k x − y k ≤ k Σ − k k b k (cid:0) sin φ q ( W , A ) sin φ ( W , B )+ k ( U ∗ WW ∗ U ) − − I k cos φ ( W , A ) cos φ ( W , B ) (cid:1) . ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING k ( U ∗ WW ∗ U ) − − I k , we note that as U ∗ WW ∗ U is positive semideﬁnite,there exists an α ≥ such that U ∗ WW ∗ U (cid:23) α I . This implies λ k ( U ∗ WW ∗ U ) − α ≥ , ⇒ σ k ( W ∗ U ) − α ≥ , ∀ k ∈ , . . . , q (55)where λ k ( X ) is the k eigenvalue in descending order of X . The largest α satisfying thisinequality is α = σ q ( W ∗ U ) = cos φ q ( W , A ) . Invoking [17, Cor. 7.7.4], U ∗ WW ∗ U (cid:23) α I implies ( U ∗ WW ∗ U ) − (cid:22) α − I and hence ( U ∗ WW ∗ U ) − − I (cid:22) α − I − I = ( α − − I . (56)Upon taking the norm, we have k ( U ∗ WW ∗ U ) − − I k ≤ ( α − − . (57)Thus α − − φ q ( W , A ) and invoking trigonometric identities, sec φ q ( W , A ) − φ q ( W , A ) ; hence(58) k x − y k ≤ k Σ − k k b k (cid:0) sin φ q ( W , A ) sin φ ( W , B )+ tan φ q ( W , A ) cos φ ( W , A ) cos φ ( W , B ) (cid:1) . By applying the upper bound cos φ ( W , A ) ≤ , and noting k Σ − k = k A + k weobtain the desired bound. Lemma In the same setting as Lemma , k A ∗ Ay − A ∗ b k ≤ cos φ ( A , W ) sin φ q ( A , W )cos φ q ( A , W ) k A k k P ⊥A b k . (59) Proof.

Using the pseudoinverse, A ∗ Ay − A ∗ b = A ∗ A ( A ∗ P W A ) − A ∗ P W b − A ∗ b . (60)Then, inserting the decomposition of the identity I = P A + P ⊥A between P W and b , A ∗ Ay − b = A ∗ A ( A ∗ P W A ) − A ∗ P W ( P A + P ⊥A ) b − A ∗ b . (61)After expanding the ﬁrst term on the right, the P A component is A ∗ b , A ∗ A ( A ∗ P W A ) − A ∗ P W P A b = A ∗ A ( A ∗ P W A ) − A ∗ P W A ( A ∗ A ) − A ∗ b = A ∗ b , (62)and hence cancels A ∗ b leaving one term: A ∗ Ay − A ∗ b = A ∗ A ( A ∗ P W A ) − A ∗ P W P ⊥A b . (63)Next, we deﬁne the oblique projector above X := A ( A ∗ P W A ) − A ∗ P W in terms ofthe SVD of A . If A has a full and reduced SVD(64) A = UΣV ∗ = (cid:2) U U (cid:3) (cid:20) Σ (cid:21) V ∗ = U Σ V ∗ , then this oblique projector is X = UΣV ∗ ( V ∗ Σ ∗ UWW ∗ UΣV ∗ ) − VΣ ∗ U ∗ WW ∗ = U Σ V ∗ V ( Σ U ∗ WW ∗ U Σ ) − V ∗ VΣ ∗ U ∗ WW ∗ = U Σ Σ − ( U ∗ WW ∗ U ) − Σ − Σ ∗ U ∗ WW ∗ = U ( U ∗ WW ∗ U ) − U ∗ WW ∗ . JEFFREY M. HOKANSON

Inserting this result into the expression for A ∗ Ay − A ∗ b , we obtain the bound k A ∗ Ay − A ∗ b k = k A ∗ U ( U ∗ WW ∗ U ) − U ∗ WW ∗ U U ∗ b k ≤ k A k k ( U ∗ WW ∗ U ) − k k U ∗ W k k W ∗ U k k U ∗ b k = σ q ( U ∗ W ) − σ ( U ∗ W ) σ ( W ∗ U ) k A k k P ⊥A b k = cos φ ( A , W ) sin φ q ( A , W )cos φ q ( A , W ) k A k k P ⊥A b k . Appendix B. Generalized geometric sum formula.

A critical componentfor our algorithm is the ability to compute in closed form generalized geometric sum ,(65) n − X k = n k p e δk where δ ∈ C , and p , n , n are non-negative integers. The standard geometric sumformula provides a closed form expression when p = 0 and when e δ = 1 , this sumis can be written in terms of Bernoulli polynomials [29, eq. (24.4.9)]. The followinglemma establishes the remaining case when e δ = 1 and p > . Lemma Let δ ∈ C with e δ = 1 , p, n , n ∈ Z + , where ≤ n ≤ n , then (66) n − X k = n k p e δk = p X ℓ =0 χ n ( p, ℓ ) e δ ( n + ℓ ) − χ n ( p, ℓ ) e δ ( n + ℓ (1 − e δ ) ℓ +1 where χ n ( p, ℓ ) is given by the recurrence (67) χ n ( p + 1 , ℓ ) = ( n + ℓ ) χ n ( p, ℓ ) + k χ n ( p, ℓ − χ n (0 , ℓ ) = δ ℓ, , p, ℓ ≥ . Proof.

Multiplying each term of the geometric sum by k p corresponds to a p thderivative with respect to δ of each entry. Since this is a ﬁnite sum, we pull thederivative outside the sum yielding(68) n − X k = n k p e δk = n − X k = n ∂ p ∂δ p e δk = ∂ p ∂δ p n − X k = n e δk = ∂ p ∂δ p e δn − e δn − e δ . To obtain an explicit formula for the derivative on the right, we show by induction(69) ∂ p ∂δ p e nδ − e δ = p X ℓ =0 χ n ( p, ℓ ) e ( n + ℓ ) δ (1 − e δ ) ℓ +1 . The base case p = 0 holds as χ n (0 ,

0) = 1 . The inductive step follows by taking thederivative of each side ∂∂δ p X ℓ =0 χ n ( p, ℓ ) e ( n + ℓ ) δ (1 − e δ ) ℓ +1 = p +1 X ℓ =0 [ χ n ( p, ℓ )( n + ℓ ) + χ n ( p, ℓ − ℓ ] e ( n + ℓ ) δ (1 − e δ ) ℓ +1 = p +1 X ℓ =0 χ n ( p + 1 , ℓ ) e ( n + ℓ ) δ (1 − e δ ) ℓ +1 . Subtracting (69) evaluated at n = n from (69) evaluated at n = n yields (66). ROJECTED NONLINEAR LEAST SQUARES FOR EXPONENTIAL FITTING

Theorem

Given δ ∈ C , p ∈ Z + , and n , n ∈ Z with ≤ n < n integers, then (70) n − X k = n k p e δk =  p X ℓ =0 χ n ( p, ℓ ) e ( n + ℓ ) δ − χ n ( p, ℓ ) e ( n + ℓ ) δ (1 − e δ ) ℓ +1 , e δ = 1; B p +1 ( n ) − B p +1 ( n ) p + 1 , e δ = 1; where B p is the p th Bernoulli polynomial. REFERENCES[1]

H. Barkhuijsen, R. de Beer, and D. van Ormondt , Improved algorithm for noniterativetime-domain model ﬁtting to exponentially damped magnetic resonance signals , J. Magn.Reson., 73 (1987), pp. 553–557, https://doi.org/10.1016/0022-2364(87)90023-0.[2]

D. P. Bertsekas , Incremental least squares methods and the extented Kalman ﬁlter , SIAMJ. Optim., 6 (1996), pp. 807–822, https://doi.org/10.1137/S1052623494268522.[3]

D. P. Bertsekas , A new class of incremental gradient methods for least squares problems ,SIAM J. Optim., 7 (1997), pp. 913–926, https://doi.org/10.1137/S1052623495287022.[4]

Å. Björck , Numerical Methods for Least Squares Problems , SIAM, Philadelphia, PA, 1996.[5]

Å. Björck and G. H. Golub , Numerical methods for computing angles between lin-ear subspaces , Math. Comput., 27 (1973), pp. 579–594, https://doi.org/10.1090/S0025-5718-1973-0348991-3.[6]

R. S. Dembo, S. C. Eisenstat, and T. Steihaug , Inexact Newton methods , SIAM J. Numer.Anal., 19 (1982), pp. 400–408, https://doi.org/10.1137/0719025.[7]

D. J. Ewins , Modal Testing: Theory and Practice , Research Studies Press LTD., Letchworth,Hertfordshire, England, 1984.[8]

V. V. Fedorov , Theory of Optimal Experiments , Academic Press, New York, 1972.[9]

R. A. Fisher , On the mathematical foundations of theoretical statistics , Philos. Trans. R. Soc.London, Ser. A, 222 (1922), pp. 309–368.[10]

M. P. Friedlander and M. Schmidt , Hybrid deterministic-stochastic methods for dataﬁtting , SIAM J. Sci. Comput., 34 (2012), pp. A1380–A1405, https://doi.org/10.1137/110830629.[11]

G. Golub and V. Pereyra , Separable nonlinear least squares: the variable projectionmethod and its applications , Inverse Prob., 19 (2003), pp. R1–R26, https://doi.org/10.1088/0266-5611/19/2/201.[12]

G. H. Golub and V. Pereyra , The diﬀerentiation of pseudo-inverses and nonlinear leastsquares problems whose variables separate , SIAM J. Numer. Anal., 10 (1973), pp. 413–432,https://doi.org/10.1137/0710036.[13]

G. H. Golub and C. F. Van Loan , Matrix Computations , Johns Hopkins University Press,Baltimore, MD, third ed., 1996.[14]

P. C. Hansen, V. Pereyra, and G. Scherer , Least Squares Data Fitting with Applications ,Johns Hopkins University Press, 2013.[15]

F. B. Hildebrand , Introduction to Numerical Analysis , McGraw-Hill, New York, 1956.[16]

B. L. Ho and R. E. Kalman , Eﬀective construction of linear state-variable models from in-put/output functions , Regelungstechnik, 14 (1966), pp. 545–592, https://doi.org/10.1524/auto.1966.14.112.545.[17]

R. A. Horn and C. R. Johnson , Matrix Analysis , Cambridge University Press, Cambridge,1985.[18]

R. A. Horn and C. R. Johnson , Topics in Matrix Analysis , Cambridge University Press,Cambridge, 1991.[19]

Y. Hua and T. K. Sarkar , Matrix pencil method for estimating parameters of exponentiallydamped/undamped sinusoids in noise , IEEE Trans. Acoust. Speech Signal Process., 38(1990), pp. 814–824, https://doi.org/10.1109/29.56027.[20]

S. V. Huffel, H. Chen, C. Decanniere, and P. V. Hecke , Algorithm for time-domainNMR data ﬁtting based on total least squares , J. Magn. Reson., Ser. A, 110 (1994), pp. 228–237, https://doi.org/10.1006/jmra.1994.1209. JEFFREY M. HOKANSON[21]

A. A. Istratov and O. F. Vyvenko , Exponential analysis in physical phenomena , Rev. Sci.Instrum., 70 (1999), pp. 1233–1257, https://doi.org/10.1063/1.1149581.[22]

S. K. Jain and S. N. Singh , Harmonics estimation in emerging power system: key issues andchallenges , Electr. Power Syst. Res., 81 (2011), pp. 1754–1766, https://doi.org/10.1016/j.epsr.2011.05.004.[23]

F. Johansson et al. , mpmath: a Python library for arbitrary-precision ﬂoating-point arith-metic (version 0.19) , June 2014, http://mpmath.org.[24] D. W. Kammler and R. J. McGlinn , A bibliography for approximation with exponen-tial sums , J. Comput. Appl. Math., 4 (1978), pp. 167 – 173, https://doi.org/10.1016/0771-050X(78)90042-6.[25]

C. T. Kelley , Iterative Methods for Optimization , SIAM, 1999.[26]

T. Laudadio, N. Mastronardi, L. Vanhamme, P. V. Hecke, and S. V. Huffel , Improvedlanczos algorithms for blackbox mrs data quantitation , J. Magn. Reson., (2002), pp. 292 –297, https://doi.org/10.1006/jmre.2002.2593.[27]

M. D. Macleod , Fast nearly ML estimation of the parameters of real or complex single tonesor resolved multiple tones , IEEE Trans. Sginal Process., 46 (1998), pp. 141–148, https://doi.org/10.1109/78.651200.[28]

V. B. Melas , Functional Approach to Optimal Experimental Design , Springer, 2006.[29]

National Institute of Standards and Technology , Digital Library of MathematicalFunctions , http://dlmf.nist.gov/.[30]

J. Nocedal and S. J. Wright , Numerical Optimization , Springer, New York, 2006.[31]

V. Pereyra and G. Scherer , Exponential Data Fitting and Its Applications , Bentham Sci-ence Publishers, 2010.[32]

R. Prony , Essai expérimental et analytique sur les lois de la dilatabilité et sur celles de laforce expansive de la vapeur de l’eau et de la vapeur de l’alkool, à diﬀérentes températures ,J. de l’Ecole Polytechnique, 1 (1795), pp. 24–76. English translation in [40, App. A].[33]

B. D. Rao , Perturbation analysis of an SVD-based linear prediction method for estimating thefrequencies of multiple sinusoids , IEEE Trans. Acoust. Speech Signal Process., 36 (1988),pp. 1026–1035, https://doi.org/10.1109/29.1626.[34]

P. J. Schreier and L. L. Scharf , Statistical Signal Processing of Complex-Valued Data:Theory of Improper and Noncircular Signals , Cambridge University Press, Cambridge,2010.[35]

H. D. Scolnik , On the solution of nonlinear least squares problems , PhD thesis, Universityof Zurich, 1970.[36]

G. A. F. Seber and C. J. Wild , Nonlinear Regression , Wiley-Interscience, 1989.[37]

S. D. Silvey , Optimal Design , Chapman and Hall, London, 1980.[38]

P. Stoica and R. Moses , Introduction to Spectral Analysis , Prentice Hall, Upper SaddleRiver, NJ, 1997.[39]

G. Tang, B. N. Bhaskar, P. Shah, and B. Recht , Compressed sensing oﬀ the grid ,IEEE Trans. Inform. Theory, 59 (2013), pp. 7465–7490, https://doi.org/10.1109/TIT.2013.2277451.[40]

D. Vandevoorde , A fast exponential decomposition algorithm and its applications to struc-tured matrices , PhD thesis, Rensselaer Polytechnic Institute, 1996.[41]

L. Vanhamme, T. Sudin, P. Van Hecke, and S. Van Huffel , MR spectroscopy quantita-tion: a review of time-domain methods , NMR Biomed., 14 (2001), pp. 233–246, https://doi.org/10.1002/nbm.695.[42]

L. Vanhamme, A. van den Boogaart, and S. V. Huffel , Fast and accurate parame-ter estimation of noisy complex exponentials with use of prior knowledge , in ProceedingsEUSIPCO-96, 1996.[43]

L. Vanhamme, A. van den Boogaart, and S. Van Huffel , Improved method for accurateand eﬃcient quantiﬁcation of MRS data with use of prior knowledge , J. Magn. Reson.,129 (1997), pp. 35–43, https://doi.org/10.1006/jmre.1997.1244.[44]

S. J. Wright and J. N. Holt , An inexact levenberg-marquardt method for large sparsenonlinear least squares , J. Austral. Math. Soc. Ser. B, 26 (1985), pp. 387–403, https://doi.org/10.1017/S0334270000004604., J. Austral. Math. Soc. Ser. B, 26 (1985), pp. 387–403, https://doi.org/10.1017/S0334270000004604.