[PDF] Likelihood Analysis of Power Spectra and Generalized Moment Problems

Abstract

We develop an approach to spectral estimation that has been advocated by Ferrante, Masiero and Pavon and, in the context of the scalar-valued covariance extension problem, by Enqvist and Karlsson. The aim is to determine the power spectrum that is consistent with given moments and minimizes the relative entropy between the probability law of the underlying Gaussian stochastic process to that of a prior. The approach is analogous to the framework of earlier work by Byrnes, Georgiou and Lindquist and can also be viewed as a generalization of the classical work by Burg and Jaynes on the maximum entropy method. In the present paper we present a new fast algorithm in the general case (i.e., for general Gaussian priors) and show that for priors with a specific structure the solution can be given in closed form.

Full PDF

11 Likelihood Analysis of Power Spectraand Generalized Moment Problems

Tryphon T. Georgiou,

Fellow, IEEE and Anders Lindquist,

Life Fellow, IEEE

Abstract

We develop an approach to spectral estimation that has been advocated by Ferrante, Masiero and Pavon [2] and, in the contextof the scalar-valued covariance extension problem, by Enqvist and Karlsson [3]. The aim is to determine the power spectrumthat is consistent with given moments and minimizes the relative entropy between the probability law of the underlying Gaussianstochastic process to that of a prior. The approach is analogous to the framework of earlier work by Byrnes, Georgiou andLindquist and can also be viewed as a generalization of the classical work by Burg and Jaynes on the maximum entropy method.In the present paper we present a new fast algorithm in the general case (i.e., for general Gaussian priors) and show that for priorswith a speciﬁc structure the solution can be given in closed form.

I. I

NTRODUCTION

Consider a stationary, vector-valued, discrete-time, zero-mean, Gaussian stochastic process { y ( t ) | t ∈ Z } , where y ( t ) ∈ R m ,and Z , R are the sets of integers and reals, respectively. We denote the corresponding probability law (on sample paths ofthe process) by P [1, Chapter 1] and the power spectral density, which we assume exists, by Φ( e iθ ) , θ ∈ [0 , π ) . Further, weassume that the stochastic process is nondeterministic in that the entropy rate is ﬁnite, (cid:90) π − π log det Φ( e iθ ) dθ < ∞ . We study the basic problem to estimate Φ from sample-statistics of { y ( t ) } . Following [2], we view this problem in a large-deviations framework where a prior law Q is available, and where this law corresponds to a power spectral density Ψ withﬁnite entropy rate. We postulate that available sample-statistics of the process are not consistent with the prior law Q , andtherefore we seek the law P that is consistent with the available statistics and is the closest such law to the prior in the sense oflarge deviations, that is, P is such that the Kullback-Leibler (KL) divergence between P and Q is minimal. The KL divergencebetween the two laws is precisely the Itakura-Saito distance between the corresponding power spectra, which was consideredin [3] for the special case of covariance extension for scalar-valued time series.The theme of the approach, namely, to obtain power spectra that are consistent with given statistics and optimal with respectto a suitable criterion, is a standard recurring theme in works going back to Burg [4]. Statistics amount to (generalized) momentconstraints and, in the past thirty or so years, a rich theory emerged that made contact with analytic function theory and theclassical moment problem, see [5]–[31] and the references therein. A detailed and rigorous exposition of related topics andideas in Signal Processing is given in [36].Initially, following Burg, early researchers focused on the entropy rate as such a suitable functional. The entropy rate relatesto the variance of one-step-ahead linear prediction and the problem reduces to solving a linear set of equations–the normalequations [37]. In the context of autoregressive modeling these are solved by the Levinson algorithm. To a large degree,research in the 80’s was driven by problems in antenna arrays and, in particular, by interest in ﬁnding “directions of arrival”[38], [39], [40], [41], [42] and the analysis of multidimensional processes [8]. Two related themes emerged. First, powerspectra were sought that consist of “spectral lines” (Pisarenko decomposition ) and then, estimates of power over narrowfrequency range was considered as a surrogate for power spectral density (Capon method ). Subsequent developments viewedspectral estimation as an inverse problem to achieve consistency with estimated statistics. Initial motivation was providedby a question of R.E. Kalman to identify spectra of low complexity [47]. Early results were obtained using topological andhomotopy methods and the complete parametrization of solutions with generic minimal degree was formulated in steps in [6],[7] and [10]. Subsequently, it was discovered that optimizers of weighted entropy-like functionals (KL-divergence betweenpower spectra as well as various types of distance to priors [48]) had a particularly nice structure; they were rational and hadsmall dimension [13]–[31], [49], [50], [51]. In fact, it turned out that suitably speciﬁed weighted entropy functionals contained Supported in part by the NSF under Grant ECCS-1509387, the AFOSR under Grants FA9550-12-1-0319 and FA9550-15-1-0045, the Vincentine Hermes-LuhChair, the Swedish Research Council (VR) and the Foundation for Strategic Research (SSF).T.T. Georgiou is with the Department of Electrical & Computer Engineering, University of Minnesota, Minneapolis, Minnesota, email: [email protected] ;and A. Lindquist is with the Department of Automation and the School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, China, and withthe Center for Industrial and Applied Mathematics and the Department of Mathematics, KTH Royal Institute of Technology, Stockholm, Sweden, email: [email protected] [36, Chapter 4.5]; see [43], [44], [45] for multivariable generalizations. [36, Chapter 5.4]; see [46] for points of contact to the problem of moments. a r X i v : . [ c s . S Y ] M a y the precise degrees of freedom that were needed to efﬁciently parametrize and construct these generic minimal degree solutions[13], [15], [16], [17], [25]. The mathematical underpinnings of this latter theory were largely based on optimization and duality.The present work is similar in spirit and technique but differs substantially in the choice of functional and interpretation.Following [2], [3], we consider the KL-divergence between Gaussian probability laws of stochastic processes or, equivalently,the Itakura-Saito distance between their power spectra. The interpretation as well as the structure of optimizers have subtledifferences from earlier constructions. For one thing, the use of the KL-divergence in this way has a very natural and appealinginterpretation: the sought power spectra represent the most likely statistical signature on the path space of a time series that isin agreement with the estimated sample statistics (see Section II-A). The structure of solutions retains many of the attractivefeatures of earlier works. In particular, it ensures reasonably good bounds on the dimensionality of modeling ﬁlters (see Remark3).Below, we begin in Section II by discussing in some detail the motivation for choosing the particular functional to guideidentifying suitable power spectra that reproduce sample statistics. We then explain how sample statistics impose momentconstraints on sought power spectra. Finally we expand on a versatile structure for underlying dynamics (input-to-state ﬁlters)that allows an effective solution of the moment problems that arise in spectral analysis. In Section III we present a geometricframework for input-to-state ﬁlters that provides basic tools for building a fast algorithm to solve the basic estimation problem.Section IV gives the problem formulation and presents the main results. The fast algorithm is presented. For certain priors,we provide closed-form solutions, which correspond to autoregressive models in the case of trigonometric moment problemsand all-pole priors. The results are presented for multivariable time series and the moment problems for the correspondingmatricial power spectra. Section V provides a simple example and connections with earlier literature. Proofs of the main resultsare given in Sections VI-VIII. In particular, Section VII is devoted to deriving the fast algorithm and Section VIII to derivingthe closed-form solution, respectively. In the concluding Section IX we provide some ﬁnal thoughts.II. P

RELIMINARIES

A. Likelihood framework

The rationale for the framework adopted herein has been used to justify maximum likelihood methods [53], [54], [55] andcomplements the original reasoning by E.T. Jaynes [56], [57], [58]. It can be presented as follows. If sample paths of a timeseries are drawn out of the given prior Q , they have a small probability of giving rise to sample statistics that are not consistentwith Q . If that were to happen, and thereby the sample paths represent a rare event, i.e., a departure from what is expected,one is motivated to seek out of the many possible sample-path distributions that are consistent with the observed statistics theone that is most likely. It is known that, asymptotically, the probability of rare events that suggest an (empirical) distribution P depends exponentially on the KL divergence between the prior Q and P .The KL divergence between two laws P and Q is D ( P(cid:107)Q ) = lim N →∞ N + 1 D ( P| [ − N,N ] (cid:107)Q| [ − N,N ] ) , (1)where P| [ − N,N ] denotes the restriction of P to the subset of random variables { y ( − N ) , . . . , y ( − , y (0) , y (1) , . . . , y ( N ) } and similarly for Q| [ − N,N ] . In turn, the KL divergence between the ﬁnite-dimensional probability densities p ( y ( − N ) , . . . , y ( N )) and q ( y ( − N ) , . . . , y ( N )) , corresponding to P| [ − N,N ] and Q| [ − N,N ] , is (cid:90) R N +1 p log( q/p ) dy ( − N ) · · · dy ( N ) . Provided both laws represent purely nondeterministic processes, as is assumed herein, the limit in (1) exists. Using Szeg¨o-Wiener-Masani’s formula (see e.g., [59, Lemma 5.1], [60, formula (E.12)], [61, Theorem 11.3.5]), D ( P(cid:107)Q ) can be expressedin terms of the corresponding power spectral densities as follows D ( P(cid:107)Q ) = 14 π (cid:90) π − π tr (cid:0) ΦΨ − − log ΦΨ − − I (cid:1) dθ =: D (Φ (cid:107) Ψ) , (2)where tr( · ) denotes trace. Since P is completely speciﬁed by Φ we only need to determine Φ , based of course on empiricalstatistics. Thus, we are interested in determining a power spectral density Φ that is consistent with given statistics and minimizes D (Φ (cid:107) Ψ) for a given power spectrum Ψ . The precise formulation of the problem requires expressing statistics in terms of powerspectra which is done next. The problem is stated precisely in Section IV. B. Filter banks and statistics

Time-series represent samples of a stochastic process, and available statistics consist of sample covariances. We now explainthe setting and nature of the covariance data.In time-series analysis as well as in antenna array processing it is customary to assume that recorded data is scaled bya frequency-dependent vector/matrix-valued gain G ( e iθ ) where the frequency θ corresponds to time, space, angle, or even avector-valued combination, see e.g., [36, Chapter 6], [62], [63], [24, Section II-B]. For instance, a window of observations { y ( k ) , y ( k − , . . . , y ( k − n ) } of a time series can be thought as the vectorial output of a “tapped delay line” represented bythe vector-valued Fourier vector (cid:2) e − iθ . . . e − inθ (cid:3) (cid:48) , i.e., the Fourier vector is the transfer function of the tapped delayline. Likewise, in the array processing literature, a model of an equispaced array of n + 1 omnidirectional sensors registeringsignals that are emitted from afar is again the same Fourier vector [36, Section 6]. Such a vector-valued gain G , for generalarrays, is often referred to as the array manifold and can be thought as a bank of ﬁlters that capture the relative dependence ofthe sensor outputs to signals from afar (see Fig. 1). Often, for a large equispaced array of sensors, a smaller output is selectedthat corresponds to G being a linear combination of Fourier components (beamspace techniques) G ( e iθ ) = M  e − iθ ... e − i ( n − θ  , (3)for a suitable matrix M . Other times, processing of time series or sensor-array data involves a suitably designed bank of ﬁlters G k ( e iθ ) , k = 1 , , . . . , n , G G G n (cid:45)(cid:45)(cid:45) (cid:45)(cid:45)(cid:45) (cid:96)(cid:96)(cid:96) x x x n y Figure 1: Bank of ﬁltersin which case G ( e iθ ) = (cid:2) G ( e iθ ) G ( e iθ ) . . . G n ( e iθ ) (cid:3) (cid:48) with { y ( t ) } the common input and general dynamics, see e.g. [64], [18]. The ﬁlters may also encapsulate attenuation from thecoordinate θ of “sources” generating { y ( t ) } to the respective outputs of sensor array (cf. [24, Section II]). In all these cases,it is natural to estimate covariance of the vectorial time series x ( t ) = (cid:2) x ( t ) x ( t ) . . . x n ( t ) (cid:3) (cid:48) . This is typically the form of available statistics that we consider henceforth.We assume that G is a square-integrable, stable n × m transfer function. Then, the n -dimensional output process { x ( t ) | t ∈ Z } assumes a representation as a stochastic integral x ( t ) = (cid:90) π − π e − itθ G ( e iθ ) W ( e iθ ) d ˆ w ( θ ) (cid:124) (cid:123)(cid:122) (cid:125) d ˆ y ( θ ) , where ˆ w is a Wiener process such that E { d ˆ wd ˆ w ∗ } = Idθ/ π . Here, I is the identity matrix, E { } the expectation operator,and W is a (minimum-phase) spectral factor of Φ , i.e., W ( e iθ ) W ( e iθ ) ∗ = Φ( e iθ ) , and therefore d ˆ y is the stochastic Fouriertransform of y ; see e.g. [61, Chapter 3]. It follows that the covariance of the (zero-mean) vectorial output x ( t ) is Σ := E { x ( t ) x ( t ) } = (cid:90) G Φ G ∗ (4)where, for economy of notation, we have suppressed the limits of integration and the normalized Lebesgue measure dθ/ π ,i.e., (cid:82) denotes (cid:82) π − π dθ π . The value Σ represents a matricial moment constraint on Φ . The problem that we consider below is,given G and Σ , to determine suitable Φ satisfying (4). C. Input-to-state ﬁlters

A special case of a ﬁlter bank of great interest is when this represents an input-to-state (stable) linear system x ( t + 1) = Ax ( t ) + By ( t ) , t ∈ Z , (5)where A ∈ R n × n and B ∈ R n × m . In that case, the transfer function of the ﬁlter bank is G ( z ) = z ( zI − A ) − B. (6)Throughout we assume that all the eigenvalues of A are located in the open unit disc. Then G ( z ) = ( I − z − A ) − B = B + ABz − + A Bz − + A Bz − + . . . for all z such that | z | > . Throughout, to insure that the complete state space is being reached and to avoid trivialities weassume that ( A, B ) is a reachable pair, i.e., rank [ B, AB, · · · , A n − B ] = n, and that B is full column rank. The use of such ﬁlter banks is the basis of a tunable method of spectral analysis that wasintroduced in [18] and is referred to as THREE.The input-to-state structure in (6) encompasses Fourier vectors where G k ( z ) := z − ( k − , k = 1 , , . . . , n , in which case A =  · · · · · · · · · ... ... . . . ... ... · · ·  , B =  ...  (7)and the state covariance is Toeplitz, i.e., Σ := E { x ( t ) x ( t ) (cid:48) } = (cid:2) c k − (cid:96) (cid:3) nk,(cid:96) =1 (8)where c k := E { y ( t + k ) y ( t ) } . Identifying a power spectral density Φ that is consistent with Σ and the process model isprecisely the problem that underlies subspace identiﬁcation [61] and coincides with the classical “covariance extension” or“trigonometric moment” problem.On the other hand, ﬁrst-order ﬁlters G k ( z ) := zz − p k , k = 1 , , . . . , n , (with p k (cid:54) = p (cid:96) for k (cid:54) = (cid:96) ) lead to A =  p p . . . p n  , B =  ...  (9)and a state covariance matrix Σ that has the structure of a Pick matrix; see [19].Finally, it is also seen that (6) is of the form (3) where M = [ B, AB, . . . ] . This matrix is ﬁnite when A is nilpotentcorresponding to “moving average” dynamics.III. G EOMETRY OF INPUT - TO - STATE FILTERS

The (rational) input-to-state structure of G ( z ) in (6) imposes structural algebraic constraints on the covariance of x ( t ) . Inaddition to positive deﬁniteness, Σ is completely characterized by belonging to the range of the integral operator Γ : Φ (cid:55)→

Σ = (cid:90) G Φ G ∗ . (10)This is a linear operator that takes m × m integrable matrix-valued functions Φ on the unit circle to symmetric matrices Σ .The range of Γ admits an algebraic characterization. In fact, it is shown in [65] that a symmetric n × n matrix Σ belongsto range(Γ) if and only if Σ − A Σ A (cid:48) = BH + H (cid:48) B (cid:48) (11)for some m × n matrix H . Equivalently, rank (cid:20) Σ − A Σ A (cid:48) BB (cid:48) (cid:21) = rank (cid:20) BB (cid:48) (cid:21) , (12) where denotes a zero-matrix of appropriate size, is necessary and sufﬁcient for solvability of (11). Moreover, there is acoercive, continuous spectral density Φ satisfying the generalized moment condition (4) if and only if Σ is positive deﬁnite and satisﬁes (11) or, the equivalent condition (12).The adjoint operator Γ ∗ maps symmetric matrices into m × m integrable Hermitian matrix-valued functions on the unitcircle, namely Γ ∗ : Λ (cid:55)→ G ∗ Λ G. The inner product in these two spaces, symmetric matrices and integrable Hermitian matrix-valued functions on the unit circle,relate as (cid:104) Λ , Σ (cid:105) := tr(ΛΣ)= tr (cid:90) G ∗ Λ G Φ=: (cid:104) G ∗ Λ G, Φ (cid:105) . We also consider the operator

Θ : H (cid:55)→ ∆ = BH + H (cid:48) B (cid:48) which maps R m × n to symmetric n × n matrices and its adjoint Θ ∗ : ∆ (cid:55)→ B (cid:48) ∆ . (13)We are interested in non-redundant representations of range(Γ) and range(Γ ∗ ) by identifying the minimal degrees of freedomin suitable matrix representations. The ﬁrst proposition deals with range(Γ) . Proposition 1: The map Σ (cid:55)→ H = ( B (cid:48) B ) − [ B (cid:48) (Σ − A Σ A (cid:48) ) − Y B (cid:48) ] (14a) where Y is the symmetric solution of the Lyapunov equation ( B (cid:48) B ) Y + Y ( B (cid:48) B ) = B (cid:48) (Σ − A Σ A (cid:48) ) B (14b) establishes a bijective correspondence between Σ ∈ range(Γ) and H ∈ range(Θ ∗ ) .Proof: Set ∆ := Σ − A Σ A (cid:48) . Since we have Σ ∈ range(Γ) , ∆ = BH + H (cid:48) B (cid:48) (15)can be solved for H ∈ R m × n and ∆ = Θ( H ) . We seek a particular solution of minimal Frobenius norm (cid:107) H (cid:107) F := (cid:112) tr( HH (cid:48) ) . Then, this solution will be in range(Θ ∗ ) , a fact that will be veriﬁed below. The Lagrangian of the problem is tr( HH (cid:48) ) + 2 tr(Λ BH ) − tr(Λ∆) where Λ = Λ (cid:48) is the symmetric matrix-valued Lagrange multiplier. It follows that the unique optimal solution is of the form H = B (cid:48) Λ , (16)and therefore H ∈ range(Θ ∗ ) in view (13). Then, HB = B (cid:48) Λ B =: Y is symmetric. Further, it satisﬁes the Lyapunov equation ( B (cid:48) B ) Y + Y ( B (cid:48) B ) = B (cid:48) ∆ B. (17)as can be seen by pre-multiplying (15) by B (cid:48) and post-multiplying by B . Since B has full column rank by assumption, theeigenvalues of B (cid:48) B are positive and (17) has a unique solution Y . By premultiplying (15) by B (cid:48) we can now solve for H = ( B (cid:48) B ) − ( B (cid:48) ∆ − Y B (cid:48) ) . (18)Finally, suppose that (11) has two solutions H and H in range(Θ ∗ ) . Then H − H ∈ ker Θ = (range(Θ ∗ )) ⊥ , and hence H = H , proving uniqueness. The case where Σ is only nonnegative deﬁnite is discussed fully in [44]. In that case the spectral content may correspond to a singular spectral measure. The essence is that (11) has many solutions in general when m > . In that case, Θ has a non-trivial null space, andProposition 1 provides the solution to (11) of minimal Frobenius norm.The next proposition deals with range(Γ ∗ ) . Since the orthogonal complement of the range of Γ is the null space of Γ ∗ ,elements in range(Γ ∗ ) can always be written in the form G ∗ Λ G where Λ ∈ range(Γ) . Proposition 2: The map G ∗ Λ G (cid:55)→ X = M B, (19a) where M is the unique solution of the Lyapunov equation M = A (cid:48) M A + Λ (19b) establishes a bijective correspondence between G ∗ Λ G ∈ range(Γ ∗ ) and X ∈ range(Θ ∗ ) .Proof: We ﬁrst note that the dimension of range(Γ ∗ ) , which coincides with the dimension of range(Γ) , is equal to thedimension of range(Θ ∗ ) by Proposition 1. Since M is symmetric, it also follows that X (cid:48) = B (cid:48) M ∈ range(Θ ∗ ) . Thus, inorder to establish that the correspondence G ∗ Λ G (cid:55)→ X (cid:48) is a bijection, it sufﬁces to prove that X (cid:48) = 0 only when Λ = 0 . Tosee this note that, since AG ( z ) = z (cid:0) G ( z ) − B (cid:1) , (11) yields G ∗ Λ G = G ∗ M G − G ∗ A (cid:48) M AG = G ∗ M G − [ G − B ] ∗ M [ G − B ]= G ∗ X + X (cid:48) G (20)with G given G ( z ) := G ( z ) − B = B + A ( zI − A ) − B = B + ABz − + A Bz − + A Bz − + . . . . (21)But, since Λ ∈ range(Γ) , G ∗ Λ G = 0 only when Λ = 0 . Thus, X = 0 implies that Λ = 0 and this completes the proof.IV. M

AIN RESULTS

We are now in a position to formulate the main problem that we consider. As noted earlier this problem was ﬁrst formulatedand studied in [2].

Problem 1:

Given an m × m matrix-valued power spectral density Ψ , and given the parameters A, B of the input-to-stateﬁlter (ﬁlter bank) in (5) and the covariance Σ of the state process x ( t ) , determine ˆΦ ∈ argmin { D (Φ (cid:107) Ψ) | such that (4) holds } . Next we discuss the structure of solutions. The expressions we give provide and alternative to those in [2] and require fewervariables in general. This non-redundant structure of solutions is analogous to the reduction in the number of variables enablingthe fast algorithms for Kalman ﬁltering in [66].As in our previous work on the moment problem, e.g., [15], [17], [18], solving Problem 1 reduces to convex optimization.With G given by (21), the optimization criterion is the strictly convex functional J ( X ) = tr (cid:26) ( HX + X (cid:48) H (cid:48) ) − (cid:90) log (cid:0) Ψ − + G ∗ X + X (cid:48) G (cid:1)(cid:27) , (22)deﬁned on the open set X + of matrices X ∈ R n × m such that X (cid:48) ∈ range(Θ ∗ ) , i.e., B (cid:48) X is symmetric, and Q ( z ) := Ψ( z ) − + G ( z ) ∗ X + X (cid:48) G ( z ) (23)is positive deﬁnite at each point z = e iθ on the unit circle. Theorem 3: Let Σ be a symmetric, positive deﬁnite n × n matrix in the range of Γ , and let H be given by (14) . Suppose thatthe prior spectral density Ψ is coercive on the unit circle and that the components of θ (cid:55)→ Ψ( e iθ ) − are Lipschitz continuous.Then Problem 1 has the unique solution ˆΦ = ˆ Q − , (24a) where ˆ Q = Ψ − + G ∗ ˆ X + ˆ X (cid:48) G (24b) for some ˆ X ∈ X + . The matrix ˆ X is the unique minimizer of the functional J ( X ) , and it is also the unique solution of thestationarity condition (cid:90) (cid:0) Ψ − + G ∗ X + X (cid:48) G (cid:1) − G ∗ = H. (25)The solution in the theorem can be obtained numerically by a Newton method to compute the minimizer of J . To this end,we compute the gradient ∂ J ∂X = H − (cid:90) Q − G ∗ , (26)where Q is given by (23), and the Hessian H ( X ) = (cid:90) G Q − G ∗ > . (27)The positivity of the Hessian indeed shows that the functional J is strictly convex. A possible starting point is X = 0 .In the next theorem we consider the special case where the prior Ψ has the form Ψ = ( G ∗ Λ G ) − . Then the solution toProblem 1 can be given in closed form. Theorem 4: Let Σ be a positive deﬁnite n × n matrix in the range of Γ , and suppose that the prior Ψ is given by Ψ = ( G ∗ Λ G ) − . (28) Then Problem 1 has the unique solution ˆΦ = (cid:0) G ∗ ˆΛ G (cid:1) − , (29) where ˆΛ := Σ − B ( B (cid:48) Σ − B ) − B (cid:48) Σ − and does not depend on Λ . The proofs of Theorems 3 and 4 are given in Sections VII and VIII, respectively.

Remark 1:

It is interesting to point out that the solution ˆΦ to Problem 1 shares the same zeros as the prior Ψ . To seethis note that at any value on the complex plane where Ψ becomes singular, ˆ Q becomes inﬁnity along suitable direction, andtherefore ˆΦ becomes singular as well. This property is also present in solutions to moment problems that minimize alternativeentropy functionals and has been explored in our earlier work. It is quite instructive to consider Problem 1 in the scalar case( m = 1 ). Then the optimal solution takes the form ˆΦ( e iθ ) = Ψ( e iθ )1 + 2Ψ( e iθ ) Re { G ( e iθ ) ∗ ˆ X } . (30)Any zeros of the prior Ψ will therefore be zeros also of ˆΦ . However, in the special case of rational Ψ , the dimension ofmodeling ﬁlters corresponding to ˆΦ is enlarged as compared to alternative formulations in our earlier works, e.g., [17], [18]. Remark 2:

Since the closed-form solution (29) does not depend on Λ , we may in particular choose Λ = I . Then, in theimportant case when G k ( z ) := z − ( k − , k = 1 , , . . . , n , we have Ψ = I , leading to an autoregressive (maximum-entropy)model. Remark 3:

Going back to [47] the original motivation was to identify and characterize solutions to moment problems havinglow degree. It is instructive to consider the scalar trigonometric moment problem with data (7) and (8), that is, the problem tomatch the n covariance samples { c , c , . . . , c n − } with a rational power spectrum Φ , in the sense that c k = (cid:90) e ikθ Φ for k = 0 , , . . . , n − holds, or, equivalently, (4) holds for the n × n covariance matrix Σ . There is a generic set of covariance samples (i.e., a setwith an open interior) for which the minimal degree solution has spectral factors of degree n − [6]. (For a more general resultof this type; see [12, Theorem 2.2].) The family of all power spectra with the same dimensionality can be parametrized by aset of arbitrarily selected n − spectral zeros (i.e., zeros of the corresponding minumum-phase spectral factor) – existence ofpower spectra corresponding to each such choice was shown in [6], [7] and uniqueness was shown in [10]. Likewise, in thecase of m -vector valued time series where an n × n covariance matrix Σ is available, the family of generically minimal degreesolutions has spectral factors of degree n − m , parametrized accordingly for a choice of spectral zeros [52, Section IV andCorollary 2]). On the other hand, a direct approach of constructing solutions based on the THREE framework gives a familyof solutions with spectral factors of degree n [24, Section IV-B] (instead of the generic minimum n − m in [52]) likewise parametrized by a suitable choice of spectral zeros. The current framework allows constructing solutions (29) with spectralfactors of degree n only when the zero-structure is trivial (i.e., identical to the eigenvalues of the matrix A ), while in generalthe best bound one can provide from (24) for the dimension of spectral factors is n + × (degree of Ψ ); cf. [2, Section IV].V. A SIMPLE EXAMPLE

Consider the case where Σ is a Toeplitz matrix as given in (8) with covariance lags c k := E { y ( t + k ) y ( t ) } of a scalarstationary process y . Then G is given by (3) with M = I . Moreover, A and B are given by (7), and hence B (cid:48) (Σ − A Σ A ) =( c , c , . . . , c n − ) and B (cid:48) B = 1 . Consequently it follows from (17) that Y = c and from (18) that H = ( c , c , . . . , c n − ) .Then, setting X (cid:48) := ( q , q , . . . , q n − ) , we have HX + X (cid:48) H (cid:48) = (cid:104) c, q (cid:105) := n − (cid:88) k = − ( n − c k q k and Q ( e iθ ) = Ψ( e iθ ) − + n − (cid:88) k = − ( n − q k e ikθ . (31a)Problem 1 then amounts to minimizing J ( q ) = (cid:104) c, q (cid:105) − (cid:90) log Q (31b)over all q := ( q , q , . . . , q n − ) such that Q ( e iθ ) > for all θ . In this notation the stationarity condition (25) becomes (cid:90) π − π e ikθ Q − dθ π = c k , k = 0 , , . . . , n − . (31c) Remark 4:

It is interesting to compare the functional J ( q ) and the form of solution above to those in the framework of,e.g., [15], [17], [18]. There, the corresponding functional is J ( q ) = (cid:104) c, q (cid:105) − (cid:90) Ψ log Q instead of (31b), with Q ( e iθ ) = n − (cid:88) k = − ( n − q k e ikθ and moment conditions (cid:90) π − π e ikθ Ψ Q dθ π = c k , k = 0 , , . . . , n − , instead of (31). We see that the present framework is analogous to the maximum-entropy solution in these earlier works exceptfor the absence of Ψ( e iθ ) − in the corresponding expression for Q which is traded off with the direct presence of Ψ infunctional and the stationarity conditions. The optimal solution is ˆΦ( e iθ ) = Ψ( e iθ ) (cid:80) q k e ikθ in this case instead of ˆΦ( e iθ ) = Ψ( e iθ )1 + Ψ( e iθ ) (cid:80) q k e ikθ in our present framework.We proceed with our numerical example. To this end, we select Φ( z ) = | ( z + 1)( z − + 1) | (32) = z − + 6 z − + 15 z − + 20 + 15 z + 6 z + z that corresponds to a moving average ﬁlter with transfer function W ( z ) = 1+3 z − +3 z − + z − . In Fig. 2 we ﬁrst compare thetrue power spectral density Φ in (32), evaluated at z = e iθ for θ ∈ [0 , π ] , with a prior Ψ = 10(1 + 0 . θ )(1 + 0 . θ )) ) that is selected to have a low pass characteristic. We seek to match moments, namely, c = (20 , , , , , , , . Next,in Fig. 3, we compare Φ with the optimal solution ˆΦ to Problem 1 for the given Ψ . Finally, in Fig. 4, we compare Φ withthe solution corresponding to the choice Ψ = 1 . The power spectral density obtained in this way, using either the Newton algorithm based on Theorem 3 or the closed-form expression in Theorem 4, is an all-pole power spectrum that agrees withthe given moments.

Fig. 2. True spectrum (solid line) vs. prior (dashed)Fig. 3. True spectrum (solid line) vs. optimal (dashed)Fig. 4. True spectrum (solid line) vs. ME spectrum (dashed) It is interesting to observe the oscillatory character of the all-pole power spectral density, which, of course, coincides withthe spectrum obtained following the classical maximum entropy (ME) method [67]. In contrast, the use of a prior whichcorresponds to a low pass ﬁlter aliviates the oscillations (Fig. 2).VI. D

UAL PROBLEM AND THE FORM OF THE MINIMIZER

Suppose that Σ belongs to the range of the operator Γ , deﬁned by (10). Then, Problem 1 amounts to minimizing D (Φ (cid:107) Ψ) = 12 (cid:90) tr (cid:0) ΦΨ − − log Φ + log Ψ − I (cid:1) (33a)over all spectral densities Φ satisfying the moment condition Σ = (cid:90) G Φ G ∗ . (33b)Proceeding along the lines of [19], it was shown in [2] that the dual of (33) is the problem to minimize J (Λ) = tr (cid:26) ΛΣ − (cid:90) log Q (cid:27) (34a)over all real, symmetric n × n matrices Λ in the range of Γ such that Q ( z ) := Ψ( z ) − + G ( z ) ∗ Λ G ( z ) (34b)is positive on the unit circle. For the convenience of the reader, we also review some steps in the proof in our present notation.We denote the class of feasible Λ by L + , i.e., L + = { Λ ∈ range (Γ) | Λ (cid:48) = Λ; Q ( e iθ ) > , ∀ θ } . We note in passing that the rationality of G is not needed at this point; in fact, an interesting example with G ( e iθ ) =[1 , e iθ , e i √ θ ] (cid:48) is motivated in the context of sensor array processing in [24].The Lagrangian for the problem above becomes L (Φ , Λ) = D (Φ (cid:107) Ψ) + tr (cid:26) Λ (cid:18)(cid:90) G Φ G ∗ − Σ (cid:19)(cid:27) , = − tr(ΛΣ) + (cid:90) tr (cid:8) Φ(Ψ − + G ∗ Λ G ) − log Φ + log Ψ − I } , where Λ is a symmetric n × n matrix of Lagrange multipliers. Since tr (cid:8) Λ (cid:0)(cid:82) G Φ G ∗ − Σ (cid:1)(cid:9) is simply the inner product of Λ with elements in the range of Γ , we can restrict Λ to the same space and therefore assume that Λ ∈ range (Γ) . The function Φ (cid:55)→ L (Φ , Λ) is strictly convex for each Λ such that Q , deﬁned by (34b), is positive semideﬁnite on the unitcircle. If Q fails to be positive semideﬁnite, L (Φ , Λ) can be made arbitrarily small for some Φ , and hence such a Λ is not acandidate in the dual problem. Hence we may restrict Λ to the class L + . Setting the directional derivative δL (Φ , Λ; δ Φ) = (cid:90) tr (cid:8) (Ψ − + G ∗ Λ G − Φ − ) δ Φ (cid:9) equal to zero, we obtain Φ = (Ψ − + G ∗ Λ G ) − , (35)which inserted into the Lagrangian yields the dual functional ϕ (Λ) = − tr(ΛΣ)+ (cid:90) tr (cid:8) log(Ψ − + G ∗ Λ G ) + log Ψ (cid:9) = − J (Λ) + (cid:90) tr log Ψ . Since this dual functional should be maximized, the dual problem is equivalent to minimizing J over all Λ ∈ L + . It wasshown in [2] that this problem has a unique solution. This problem differs from the one in [19] in that the prior Ψ in [19] doesnot occur in Q but instead multiples log Q . Unlike the situation in [19], tr(ΛΣ) might negative in the present setting whichcomplicates the analysis somewhat.We shall need the following lemma in Section VII. Lemma 5: If Σ belongs to the range of Γ , then the functional J is bounded from below.Proof: The condition that Σ belongs to the range of Γ ensures the existence of a spectral density Φ satisfying (4). Then,in view of the construction above, ϕ (Λ) ≤ L (Φ , Λ) = D (Φ (cid:107) Ψ) or equivalently J (Λ) ≥ (cid:90) tr log Ψ − D (Φ (cid:107) Ψ) , which establishes the required bound.VII. A FAST ALGORITHM FOR THE DUAL PROBLEM

One of the difﬁculties dealing with the dual problem in Section VI is the redundancy introduced by the integral operator Γ , which has the consequence that only the part of Λ belonging to the range of Γ affects the value of J (Λ) . To remove thisredundancy we reformulate the problem by deﬁning R n × m matrix-valued variable X = M B, (36)where M is the unique solution of the Lyapunov equation M = A (cid:48) M A + Λ (37)and Λ ∈ range(Γ) is the is the matrix-valued variable in the dual problem in Section VI. By Proposition 2, there is a one-onecorrespondence between Λ and X . In view of (20), G ( z ) ∗ Λ G ( z ) = G ( z ) ∗ X + X (cid:48) G ( z ) , where G is given by (21). Therefore (34b) takes the form Q ( z ) = Ψ( z ) − + G ( z ) ∗ X + X (cid:48) G ( z ) . (38)Moreover, in view of (37) and (11), tr(ΛΣ) = tr( M Σ) − tr( M A Σ A (cid:48) )= tr( M BH ) + tr( B (cid:48) M H (cid:48) )= tr( HX + X (cid:48) H (cid:48) ) . Consequently the dual functional can be expressed in terms of X to obtain the functional J ( X ) : X + → R deﬁned by (22),where X + is a convex set.To prove that the functional (22) has a unique minimizer in X + we could now appeal to the proof in [2] that the dualproblem in Section VI has a unique solution. However, since now the redundancy in the dual problem has been removed, wecan offer a more straight-forward alternative proof. We denote by ¯ X + the closure of X + . Lemma 6: Suppose that Σ belongs to the range of Γ . Then any nonempty sublevel set { X ∈ ¯ X + | J ( X ) ≤ r } (39) is bounded.Proof: Let X ∈ ¯ X + be arbitrary, and deﬁne λ := (cid:107) X (cid:107) . We want to show that X cannot remain in the level set (39) as λ → ∞ . To this end, it is no restriction to assume that λ ≥ λ > . Next set ˜ X := λ − X and ˜ Q λ := ( λ Ψ) − + G ∗ ˜ X + ˜ X (cid:48) G .Then J ( X ) = γλ − log λ − tr (cid:90) log ˜ Q λ , where γ := tr( H ˜ X + ˜ X (cid:48) H (cid:48) ) , and where ˜ Q λ depends on λ but is bounded for λ ≥ λ . First suppose γ > . Then comparinglinear and logarithmic growth, J ( X ) → ∞ as λ → ∞ , which contradicts J ( X ) ≤ r . Next, suppose that γ ≤ . Then J ( X ) → −∞ as λ → ∞ , which contradicts Lemma 5, since J ( X ) = J ( L (Λ) B ) = J (Λ) , where L (Λ) is the unique solutionof the Lyapunov equation (37). Hence the sublevel set (39) is bounded as claimed. Lemma 7: The functional J : ¯ X + → R ∪ {∞} has a unique minimizer ˆ X in X + .Proof: We ﬁrst prove that J , which is continuous on X + , can be extended as a lower semicontinuous function J : ¯ X + → R ∪ {∞} . To this end, let ( X k ) be a sequence converging to X in L ∞ norm, and let ( Q k ) and Q be the corresponding functions (23), which are continuous on the compact interval [ − π, π ] , and hence uniformly continuous. Consequently there isa bound κ such that, for θ ∈ [ − π, π ] , Q ( e iθ ) ≤ κ and Q k ( e iθ ) ≤ κ for all k , and hence, by Fatou’s lemma, − (cid:90) log (cid:18) Qκ (cid:19) ≤ lim inf k →∞ − (cid:90) log (cid:18) Q k κ (cid:19) since Q k → Q pointwise. Consequently, J ( X ) ≤ lim inf k →∞ J ( X )) , which shows that that J , extended to the boundary ¯ X + , is lower semicontinuous. Therefore it follows from Lemma 6 that the sublevel set (39) is closed and hence bounded.Consequently, by Weierstrass’ Theorem, J has a minimum ˆ X in X , which must be unique by convexity.It remains to prove that ˆ X is not the boundary ∂ X . To this end, following [13], [15], consider the directional derivative δ J ( X, δX ) = tr (cid:26) ( HδX + δX (cid:48) H (cid:48) ) − (cid:90) Q − ( G ∗ δX + δX (cid:48) G ) (cid:27) = tr (cid:26) ( HδX + δX (cid:48) H (cid:48) ) − (cid:90) Q − δQ (cid:27) Now, for any X ∈ X + and ¯ X ∈ ∂ X , take δX = X − ¯ X and X λ = ¯ X + λδX and, correspondently, form δQ = Q ( z ) − ¯ Q ( z ) and Q λ ( z ) = ¯ Q ( z ) + λδQ λ ( z ) , where det ¯ Q ( e iθ ) for some θ ∈ [ − π, π ] . Then δ J ( X λ , − δX ) = − tr( HδX + δX (cid:48) H (cid:48) ) + (cid:90) f λ , where f λ is the scalar function f λ ( e iθ ) = tr { Q λ ( e iθ ) − δQ ( e iθ ) } . Taking the derivative with respect to λ we have ddλ f λ ( e iθ ) = tr { δQ ( e iθ ) ∗ Q λ ( e iθ ) − δQ ( e iθ ) } ≥ , and consequently f λ ( e iθ ) is a monotonically nondecreasing function of λ for all θ ∈ [ − π, π ] . Therefore, as λ → , f λ tendspointwise to f = tr { ¯ Q − ( Q − ¯ Q ) } = tr { ¯ Q − Q − I } = tr { ¯ Q − Q } − n. If (cid:82) f λ would tend to a ﬁnite value as λ → , ( f λ ) would be a Cauchy sequence in L ( − π, π ) and hence have a limit in L ( − π, π ) equal almost everywhere to f . However, since there is a δ > such that Q ( e iθ ) > δ , (cid:90) f ≥ δ (cid:90) tr (cid:0) ¯ Q − (cid:1) − n which is inﬁnite by Proposition 10. Consequently δ J ( X λ , ¯ X − X ) → ∞ as λ → , so there could be no minimum in ¯ X . This concludes the proof.Since the unique minimizer ˆ X belongs to the interior X + , the gradient (26) is zero there. This proves (25). Then, by (35),the optimal solution of Problem 1 is given by (24). This concludes the proof of Theorem 3.VIII. CLOSED - FORM SOLUTION FOR A SPECIAL CASE OF PRIOR

We now consider the special case where the prior power spectral density is of the particular form

Ψ = ( G ( z ) ∗ Λ G ( z )) − . Then the matrix function Q deﬁned by (34b) is given by Q ( z ) = G ( z ) ∗ (Λ + Λ) G ( z ) , which must be positive on the unit circle and hence, by Lemma 12 there exists a constant matrix C such that Q ( z ) = G ( z ) ∗ CC (cid:48) G ( z ) . (40) We ﬁrst change the dual functional (34a) by adding the constant tr(Λ Σ) , and compute tr((Λ + Λ )Σ) = tr (cid:90) (Λ + Λ ) G Φ G ∗ = tr (cid:90) Q Φ= tr (cid:90) C (cid:48) G Φ G ∗ C = tr C (cid:48) Σ C, (41)where Φ satisﬁes (4). In view of (40), the modiﬁed functional becomes ˜ J ( C ) := J (Λ) + tr Λ Σ= tr (cid:18) C (cid:48) Σ C − (cid:90) log G ∗ CC (cid:48) G (cid:19) (42)which is now a function of C . Recall the following result from Wiener-Masani-Helson-Lowdenslager. Proposition 8: If F ( z ) is a square outer matrix-valued function, then (cid:90) log det F F ∗ = log det F (0) F (0) ∗ . Proof:

The result follows by Jensen’s formula after noting that f = det F is outer ([68, p. 184]).We now consider once again the functional ˜ J ( C ) and determine stationarity conditions that provide a form of the optimal C . First, ˜ J ( C ) = tr ( C (cid:48) Σ C ) − log det( B (cid:48) CC (cid:48) B )= tr (cid:0) C (cid:48) Σ C − log( B (cid:48) CC (cid:48) B ) (cid:1) . The gradient with respect to C is ∂ ˜ J ∂C = 2 C (cid:48) Σ − B (cid:48) C ) − B (cid:48) , and hence the stationary point is given by C (cid:48) Σ = ( B (cid:48) C ) − B (cid:48) . This yield the equation B (cid:48) CC (cid:48) = B (cid:48) Σ − (43)for the optimal C , and we readily see that C = Σ − B ( B (cid:48) Σ − B ) − / satisﬁes (43). Thus, the optimal Q is ˆ Q ( z ) = G ( z ) ∗ Σ − B ( B (cid:48) Σ − B ) − B (cid:48) Σ − G ( z ) , and therefore ˆΦ( z ) = ( G ( z ) ∗ Σ − B ( B (cid:48) Σ − B ) − B (cid:48) Σ − G ( z )) − . This concludes the proof of Theorem 4. IX. C

ONCLUSIONS

The topic of the paper is to construct power spectral densities that are consistent with speciﬁed moments and are closestto a prior in a suitable sense. The spirit of the work is similar to a long line of contributions going back to [4], includinga series of papers [13]–[31] where the emphasis was in identifying and parametrizing power spectra of minimal complexity(i.e., dimensionality of modeling ﬁlters). A key tool in these earlier works was a choice of entropy functional that allowedparametrizing solutions via selection of a suitable prior power spectrum. The moment constraints were cast in the form of thestate covariance of an input-to-state ﬁlter.In departure from this early work, Ferrante, Masiero and Pavon [2] proposed to use the KL divergence between Gaussianprobability laws – a formulation which is quite natural from a probabilistic standpoint. The KL divergence between Gaussianprobability laws coincides with the Itakura Saito distance between their respective power spectral densities, and thus, theproblem turns out to be equivalent to one studied by Enqvist and Karlsson [3] in the context of scalar processes. The purposeof the current work is to present a simpliﬁed alternative optimization procedure which is based on a detailed analysis of thegeometry of input-to-state ﬁlters and related moment problems. Indeed, the power spectral densities are now parametrizedmore conveniently by a non-redundant coefﬁcient matrix ( X in Theorem 3) containing minimal number of parameters that arenecessary. Sections III and IV, as well as the proofs later in the paper contain the main contributions. X. A

PPENDIX

A. Behavior of J on the boundaryLemma 9: Let θ (cid:55)→ M ( e iθ ) be a matrix-valued function with Lipschitz-continuous components, and suppose that M ( e iθ ) ispositive semideﬁnite for all θ and identically zero for θ = θ . Then (cid:90) π − π tr { M − ( e iθ ) } dθ π = ∞ , where M − is deﬁned to have inﬁnite value on any subset of [ − π, π ] where it is identically zero.Proof: Without loss of generality we can assume that M ( e iθ ) = 0 in an isolated point θ . By assumption, we can choosea common Lipschitz constant K and an ε > such that the components m k(cid:96) ( e iθ ) of M have the bounds (cid:12)(cid:12) m k(cid:96) ( e iθ ) (cid:12)(cid:12) ≤ K | θ − θ | for | θ − θ | < ε . If N ( e iθ ) := M − ( e iθ ) , its components satisfy (cid:88) (cid:96) m k(cid:96) ( e iθ ) n (cid:96)k ( e iθ ) = 1 for all θ and k, which then implies that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) (cid:96) n (cid:96)k ( e iθ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K | θ − θ | ≥ for all θ ∈ ( θ − ε, θ + ε ) and k. Consequently, since N ( e iθ ) ≥ , there must be a k and an L > such that n kk ( e iθ ) ≥ L | θ − θ | , for all θ ∈ ( θ − ε, θ + ε ) , and therefore (cid:90) π − π tr { M − ( e iθ ) } dθ π ≥ L (cid:90) θ + εθ − ε | θ − θ | dθ π = ∞ , as claimed. Proposition 10: Let θ (cid:55)→ Q ( e iθ ) be a matrix-valued function with Lipschitz-continuous components, and suppose that Q ( e iθ ) is positive semideﬁnite for all θ and singular for θ = θ . Then (cid:90) π − π tr { Q − ( e iθ ) } dθ π = ∞ . Proof:

After applying a constant unitary transformation we can write Q on the form Q = (cid:20) Q Q Q ∗ Q (cid:21) , where Q ( e iθ ) = Q ( e iθ ) = 0 and Q ( e iθ ) > . Then Q − = (cid:20)(cid:2) Q − Q Q − Q ∗ (cid:3) − ∗∗ ∗ (cid:21) , where the Schur complement M := Q − Q Q − Q ∗ is positive semideﬁnite and has Lipschitz-continuous components. Then the statement of the proposition follows from Lemma 9. B. Co-invariant subspaces

Let H m represent row vector-valued functions in the Hardy space of square integrable functions on the circle which havean analytic continuation in the interior of the unit disc – a standard notation H or H ( D ) . The forward shift S amounts tomultiplication by z . The backward shift is precisely its adjoint, S ∗ : H (cid:96) → H (cid:96) : x ( z ) (cid:55)→ Π H z − x ( z ) . Subspaces which are invariant under S ∗ are those that are orthogonal to invariant subspaces of the forward shift S , i.e., of theform K := H × m (cid:9) H × m V ( z ) with V ( z ) an inner (matrix-valued) function, and they are often referred to simply as “co-invariant subspaces”. The orthogonalprojection onto K is Π K : H × m → K x ( z ) (cid:55)→ (cid:16) Π ( H × m ) ⊥ x ( z ) V ( z ) ∗ (cid:17) V ( z ) . To see this, note that since V ( z ) is inner, Π K deﬁned above is idempotant and Hermitian—hence a projection. It is easy toverify that its kernel is precisely H × m V ( z ) and therefore Π K is the orthogonal projection onto K as claimed.Let A ∈ C n × n with eigenvalues in D , B ∈ C n × m with ( A, B ) controllable. Without loss in generality we can alwaysnormalize ( A, B ) so that the corresponding controllability Grammian is the identity I ; when this is true AA ∗ + BB ∗ = I and [ A, B ] can be competed to a unitary matrix (cid:20) A BC D (cid:21) . It follows that V ( z ) = D + zC ( I − zA ) − B is an inner matrix-valued function, i.e., it is analytic in D and V V ∗ = V ∗ V = I , where V ∗ := V ( z ) := V ∗ ( z − ) . Now,consider G ( z ) := ( I − zA ) − B and the co-invariant subspace K as noted above. The following statement is known (see [65, Proposition 4]). Proposition 11: The rows of G ( z ) form a basis for K .Proof: The proof is again from [65, Proposition 4]. We ﬁrst claim that any element in K is of the form v ( zI − A ∗ ) − C ∗ V ( z ) (44)where v ∈ C × n . To see this note that Π K : x + x z + . . . (cid:55)→ (cid:2) Π ( H × m ) ⊥ ( x + x z + . . . ) × ( D ∗ + z − B ∗ C ∗ + . . . ) (cid:3) V ( z )= v ( z − C ∗ + z − A ∗ C ∗ + . . . ) V ( z ) where v = x B ∗ + x B ∗ A ∗ + . . . . Next, it can be shown [65, Eq. (36)] that G ( z ) = ( zI − A ∗ ) − C ∗ V ( z ) . (45)In view of (44), the rows of G ( z ) span K . Finally, if vG ( z ) = 0 for some v ∈ C × n , then necessarily v = 0 because ( A, B ) is controllable. Hence the rows of G ( z ) are linearly independent and form a basis for K as claimed. Lemma 12: Let Λ be a Hermitian n × n -matrix such that Q ( z ) := G ( z ) ∗ Λ G ( z ) > for z = e iθ , and θ ∈ [0 , π ) . There exists Λ o = C ∗ o C o with C o ∈ C m × n such that G ( z ) ∗ Λ G ( z ) = G ( z ) ∗ Λ o G ( z ) and C o G ( z ) is outer (i.e., minimum phase, that is, stable and stably invertible).Proof: Since Q ( z ) is Hermitian and positive deﬁnite on the unit circle of the complex plane, it can be factored as Q ( z ) = a ( z ) ∗ a ( z ) with a ( z ) outer. But V ( z ) G ( z ) ∗ Λ G ( z ) = V ( z ) a ( z ) ∗ a ( z ) has all its elements in H , since already V ( z ) G ( z ) ∗ does. Since a ( z ) is outer, V ( z ) a ( z ) ∗ is in H as well. Now, note that G ( z ) V ( z ) ∗ is orthogonal to H . Therefore, the zeroth term of G ( z ) V ( z ) ∗ vanishes. It follows that V ( z ) a ( z ) ∗ a ( z ) which has only positive power of z has no th term either. Therefore, V ( z ) a ( z ) ∗ hasonly positive powers of z and no th term. So, ﬁnally, we conclude that all elements of a ( z ) V ( z ) ∗ are orthogonal to H and,therefore, the rows of a ( z ) are in K . Thus, there exists a C ∈ C m × n such that a ( z ) = CG ( z ) . This completes the proof. R

EFERENCES[1] Pinsker, Mark S.“Information and information stability of random variables and processes”, Holden-Day, Inc. (1964).[2] A. Ferrante, C. Masiero and M. Pavon, “Time and spectral domain relative entropy: A new approach to multivariable spectral estimation”, IEEE Transactionson Automatic Control. vol. 57(10), 2012, pp. 2561–2575.[3] P. Enqvist and J. Karlsson, “Minimal Itakura-Saito distance and covariance interpolation,” in 47th IEEE Conference on Decision and Control, pp. 137-142(2008).[4] Burg, John Parker. “The relationship between maximum entropy spectra and maximum likelihood spectra.” Geophysics 37.2 (1972): 375-376.[5] H.J. Landau. ”Maximum entropy and the moment problem.” Bulletin of the American Mathematical Society 16.1 (1987): 47-77.[6] T.T. Georgiou Partial realization of covariance sequences, 1983 PhD Thesis, University of Florida.[7] T.T. Georgiou. “Realization of power spectra from partial covariance sequences,” IEEE Transactions on Acoustics, Speech and Signal Processing 35.4(1987): 438-449.[8] S.W. Lang and J.H. McClellan. “Multidimensional MEM spectral estimation.” IEEE Transactions on Acoustics, Speech and Signal Processing, 30(6):880–887, Dec. 1982.[9] T.T. Georgiou. “A topological approach to Nevanlinna-Pick interpolation.” SIAM Journal on Mathematical Analysis, 18:1248–1260, 1987.[10] C.I. Byrnes, A. Lindquist, S.V. Gusev, and A.S. Matveev. “A complete parameterization of all positive rational extensions of a covariance sequence.”IEEE Transactions on Automatic Control, 40(11):1841–1857, 1995.[11] C. I. Byrnes,H. J. Landau and A. Lindquist. “On the well-posedness of the rational covariance extension problem.” In Current and Future Directions inApplied Mathematics, M. Alber, B. Hu and J Rosenthal (editors), Birkhauser Boston, 1997, 81 –108.[12] C. I. Byrnes and A. Lindquist. “On the partial stochastic realization problem.” IEEE Transactions on Automatic Control (1997), 1049–1069.[13] C. I. Byrnes, S. V. Gusev, and A. Lindquist, “A convex optimization approach to the rational covariance extension problem,” SIAM J. Control andOptimization 37 (1998) 211–229.[14] T.T. Georgiou. “The interpolation problem with a degree constraint.” IEEE Transactions on Automatic Control, 44(3):631–635, 1999.[15] C. I. Byrnes, S. V. Gusev, and A. Lindquist, From ﬁnite covariance windows to modeling ﬁlters: A convex optimization approach, SIAM Review (2001), 645–675.[16] C.I. Byrnes, P. Enqvist, and A. Lindquist. Identiﬁability and well-posedness of shaping-ﬁlter parameterizations: A global analysis approach. SIAMJournal on Control and Optimization , 41(1):23–59, 2002.[17] C.I. Byrnes, T.T. Georgiou and A. Lindquist. ”A generalized entropy criterion for Nevanlinna-Pick interpolation with degree constraint.” AutomaticControl, IEEE Transactions on 46.6 (2001): 822-839.[18] C.I. Byrnes, T.T. Georgiou, and A. Lindquist, “A new approach to spectral estimation: A tunable high-resolution spectral estimator,” IEEE Trans. SignalProc. SP-49 (Nov. 2000), 3189–3205.[19] T. T. Georgiou and A. Lindquist, “Kullback-Leibler Approximation of spectral density functions”, IEEE Transactions on Information Theory, vol. 49(11),2003, pp. 2910–2917.[20] C.I. Byrnes and A. Lindquist. ”A convex optimization approach to generalized moment problems.” In Y. Oishi K. Hashimoto and Y. Yamamoto, editors,

Control and Modeling of Complex Systems: Cybernetics in the 21st Century: Festschrift in Honor of Hidenori Kimura on the Occasion of his 60th Birthday ,pages 3–21. Birkh¨auser, 2003.[21] A. Blomqvist, A. Lindquist and R. Nagamune. “Matrix-valued Nevanlinna-Pick interpolation with complexity constraint: An optimization approach.IEEE Transactions on Automatic Control 48 (Dec. 2003), 2172–2190.[22] T.T. Georgiou. “Solution of the general moment problem via a one-parameter imbedding.” IEEE Transactions on Automatic Control, 50(6):811–826,2005.[23] C.I. Byrnes and A. Lindquist. “The moment problem for rational measures: convexity in the spirit of Krein.” In

Modern Analysis and Application: MarkKrein Centenary Conference, Vol. I: Operator Theory and Related Topics , pages 157–169. Birkh¨auser, 2009.[24] T. T. Georgiou, “Relative entropy and the multivariable moment problem”, IEEE Transactions on Information Theory, vol. 52(3), 2006, pp. 1052–1066.[25] C.I. Byrnes, T. T. Georgiou, A. Lindquist, and A. Megretski. ”Generalized interpolation in ∞ with a complexity constraint.” Transactions of the AmericanMathematical Society 358.3 (2006): 965-987.[26] C.I. Byrnes and A. Lindquist. ”The generalized moment problem with complexity constraint.” Integral Equations and Operator Theory, 56(2):163–180,2006.[27] M. Pavon and A. Ferrante. “On the Georgiou-Lindquist approach to constrained Kullback-Leibler approximation of spectral densities.” IEEE Transactionson Automatic Control, 51(4):639–644, 2006.[28] H.I. Nurdin. “New results on the rational covariance extension problem with degree constraint.” Systems & Control Letters, 55(7):530 – 537, 2006.[29] C.I. Byrnes and A. Lindquist. ”Important moments in systems and control.” SIAM Journal on Control and Optimization 47: 5 (2008): 2458-2469.[30] T. T. Georgiou and A. Lindquist “A convex optimization approach to ARMA modeling.” IEEE Trans. Automatic Control (2008), 1108–1119.[31] J. Karlsson, A. Lindquist and A. Ringh, ”Multidimensional moment problem with complexity constraint,” Integral Equations and Operator Theory 84(2016), 395-418.[32] A. Ferrante, M. Pavon, and F. Ramponi. “Hellinger versus Kullback-Leibler multivariable spectrum approximation.” IEEE Transactions on AutomaticContro, 53(4):954–967, 2008.[33] Ramponi, Federico, Augusto Ferrante, and Michele Pavon. ”A globally convergent matricial algorithm for multivariate spectral estimation.” AutomaticControl, IEEE Transactions on 54.10 (2009): 2376-2388. [34] Ferrante, Augusto, Federico Ramponi, and Francesco Ticozzi. ”On the convergence of an efﬁcient algorithm for KullbackLeibler approximation ofspectral densities.” Automatic Control, IEEE Transactions on 56.3 (2011): 506-515.[35] M. Pavon and A. Ferrante. “ On the geometry of maximum entropy problems.” SIAM Review, 55(3):415–439, 2013.[36] P. Stoica and R. L. Moses. Spectral analysis of signals. Pearson/Prentice Hall, 2005.[37] Makhoul, John. ”Linear prediction: A tutorial review.” Proceedings of the IEEE 63.4 (1975): 561-580.[38] S.W. Lang and J.H. McClellan. “Spectral estimation for sensor arrays.” IEEE Transactions on Acoustics, Speech and Signal Processing, 31(2):349–358,Apr. 1983.[39] Li, Jian, and Petre Stoica. ”MIMO radar with colocated antennas.” Signal Processing Magazine, IEEE 24.5 (2007): 106-114.[40] R. Krummenauer et al. ”Improving the threshold performance of maximum likelihood estimation of direction of arrival.” Signal Processing 90.5 (2010):1582-1590.[41] A. Shapoury and E. Serpedin. ”Incoherent DOA estimation in uniform antenna arrays with inordinate spacing using a subband hopping approach.”Circuits, Systems & Signal Processing 28.1 (2009): 17-39.[42] Georgiou, Tryphon T. ”Signal estimation via selective harmonic ampliﬁcation: MUSIC, Redux.” Signal Processing, IEEE Transactions on 48.3 (2000):780-790.[43] TT. Georgiou. ”Spectral analysis based on the state covariance: the maximum entropy spectrum and linear fractional parametrization.” IEEE transactionson Automatic Control 47.11 (2002): 1811-1823.[44] TT. Georgiou. ”The Carathodory F´ejer Pisarenko decomposition and its multivariable counterpart.” Automatic Control, IEEE Transactions on 52.2 (2007):212-228.[45] Z. Yang, L. Xie, and P. Stoica. ”Vandermonde Decomposition of Multilevel Toeplitz Matrices with Application to Multidimensional Super-Resolution.”arXiv preprint arXiv:1505.02510 (2015).[46] TT. Georgiou. ”Spectral estimation via selective harmonic ampliﬁcation.” Automatic Control, IEEE Transactions on 46.1 (2001): 29-42.[47] Kalman, R. E. (1982). Realization of covariance sequences (pp. 331-342). Birkhuser Basel.[48] Basseville, Mich`ele. ”Divergence measures for statistical data processing: An annotated bibliography.” Signal Processing 93.4 (2013): 621-633.[49] J. Karlsson and P. Enqvist. ”Input-to-state covariances for spectral analysis: The biased estimate.” Proc. of Int. Symp. Mathematical Theory of Networkand Systems, MTNS. 2012.[50] Mahata, Kaushik, and Minyue Fu. ”Modeling continuous-time processes via input-to-state ﬁlters.” Automatica 42.7 (2006): 1073-1084.[51] Ying, Li, Wang Jian, and Song Zhanjie. ”Multivariate spectral estimation based on THREE.” The Journal of China Universities of Posts andTelecommunications 22.4 (2015): 26-32.[52] M.S. Takyar and T.T. Georgiou. ”Analytic Interpolation With a Degree Constraint for Matrix-Valued Functions.” Automatic Control, IEEE Transactionson 55.5 (2010): 1075-1088.[53] R.W. Johnson and J.E. Shore. ”Minimum cross-entropy spectral analysis of multiple signals.” Acoustics, Speech and Signal Processing, IEEE Transactionson 31.3 (1983): 574-582.[54] Ellis, Richard S. ”The theory of large deviations: from Boltzmann’s 1877 calculation to equilibrium macrostates in 2D turbulence.” Physica D: NonlinearPhenomena 133.1 (1999): 106-136.[55] Georgii, Hans-Otto. ”Large deviations and maximum entropy principle for interacting random ﬁelds on Zd.” The Annals of Probability (1993): 1845-1875.[56] E.T. Jaynes, “Information Theory and Statistical Mechanics,” in In Ford, K. (ed.). Statistical Physics . New York: Benjamin. p. 181, 1963.[57] E.T. Jaynes, E. T. “The Relation of Bayesian and Maximum Entropy Methods,” in Maximum-Entropy and Bayesian Methods in Science and Engineering(Vol. 1), Kluwer Academic Publishers, p. 25-29, 1988.[58] E.T. Jaynes,

Probability Theory: The Logic of Science , Cambridge University Press, 2003.[59] B.D.O. Anderson, J.B. Moore and R.M. Hawkes, “Model approximation via prediction error identiﬁcation”, Automatica, vol. 14, 1978, pp. 615–620.[60] A.A. Stoorvogel and J.H. van Schuppen, “System identiﬁcation with information theoretic criteria”, in

Identiﬁcation, Adaptation, Learning: The Scienceof Learning Models from Data , S. Bittanti and G. Picci (eds), Series F: Computer and System Sciences 153, Springer-Verlag, Berlin, 1996, pp. 289–338.[61] A. Lindquist and G. Picci,

Linear Stochastic Systems: A Geometric approach to Modeling, Estimation and Identiﬁcation , Series in ContemporaryMathematics 1, Springer-Verlag, Berlin Heidelberg, 2015.[62] Li, Jian, and Petre Stoica. Robust adaptive beam forming. Vol. 88. John Wiley & Sons, 2005.[63] Li, Jian, and Petre Stoica, eds. MIMO radar signal processing. J. Wiley & Sons, 2009.[64] Vaidyanathan, Parishwad P. “Multirate digital ﬁlters, ﬁlter banks, polyphase networks, and applications: a tutorial.” Proceedings of the IEEE 78.1 (1990):56-93.[65] T.T. Georgiou, “The structure of state covariances and its relation to the power spectrum of the input,” IEEE Transactions on Automatic Control, vol.47, pp. 1056–1066, 2002.[66] Lindquist, Anders. ”A new algorithm for optimal ﬁltering of discrete-time stationary processes.” SIAM Journal on Control 12.4 (1974): 736-746.[67] Haykin, Simon. ”Nonlinear methods of spectral analysis.” Springer-Verlag (1979).[68] L. V. Ahlfors,