[PDF] On kernel design for regularized LTI system identification

Abstract

There are two key issues for the kernel-based regularization method: one is how to design a suitable kernel to embed in the kernel the prior knowledge of the LTI system to be identified, and the other one is how to tune the kernel such that the resulting regularized impulse response estimator can achieve a good bias-variance tradeoff. In this paper, we focus on the issue of kernel design. Depending on the type of the prior knowledge, we propose two methods to design kernels: one is from a machine learning perspective and the other one is from a system theory perspective. We also provide analysis results for both methods, which not only enhances our understanding for the existing kernels but also directs the design of new kernels.

Full PDF

aa r X i v : . [ c s . S Y ] J u l On kernel design for regularized LTI system identiﬁcation ⋆ Tianshi Chen

School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 518172, China, (e-mail:[email protected]).

Abstract

There are two key issues for the kernel-based regularization method: one is how to design a suitable kernel to embed in thekernel the prior knowledge of the LTI system to be identiﬁed, and the other one is how to tune the kernel such that theresulting regularized impulse response estimator can achieve a good bias-variance tradeoﬀ. In this paper, we focus on the issueof kernel design. Depending on the type of the prior knowledge, we propose two methods to design kernels: one is from amachine learning perspective and the other one is from a system theory perspective. We also provide analysis results for bothmethods, which not only enhances our understanding for the existing kernels but also directs the design of new kernels.

Key words:

System identiﬁcation, regularization methods, kernel methods, kernel design, prior knowledge.

Among diverse system identiﬁcation problems, thelinear time-invariant (LTI) system identiﬁcation is aclassical and fundamental problem. As well-known, anLTI system is uniquely characterized by its impulseresponse, and thus the LTI system identiﬁcation isequivalent to its impulse response estimation that couldbe ill-conditioned in practice. So to tackle the LTI sys-tem identiﬁcation, one needs to ﬁrst ﬁnd a way to makethe ill-conditioned problem well-conditioned. Then onefaces the core issue of system identiﬁcation, i.e., how toconstruct a model estimator able to achieve a good bias-variance tradeoﬀ, or equivalently, to suitably balancethe adherence to the data and the model complexity.There are diﬀerent routes to handle these two issues. The ⋆ Abridged and preliminary results of this paper was pre-sented in the 19th IFAC World Congress, Cape Town, SouthAfrica, 2014, the European Research Network on SystemIdentiﬁcation (ERNSI) workshop, Ostend, Belgium, 2014,and the 17th IFAC Symposium on System Identiﬁcation,Beijing, China, 2015.This work is supported by the Thousand Youth Talents Planfunded by the central government of China, the ShenzhenProjects Ji-20170189 and Ji-20160207 funded by ShenzhenScience and Technology Innovation Council, the President’sgrant under contract No. PF. 01.000249 and the Start-upgrant under contract No. 2014.0003.23 funded by the ChineseUniversity of Hong Kong, Shenzhen, as well as by a researchgrant for junior researchers under contract No. 2014-5894,funded by Swedish Research Council. most widely used route is to adopt instead a parametricmodel structure with a ﬁnite number of parameters. Theso-called maximum likelihood/prediction error method(ML/PEM) is one method along this route. It has opti-mal asymptotic properties and is the current standardmethod to LTI system identiﬁcation. It ﬁrst postulatesa parametric model structure, e.g., a rational transferfunction, then forms the prediction error criterion, andﬁnally obtains the model estimator by minimizing theprediction error criterion; see e.g., [1,2]. The model com-plexity of the parametric model structure is governed bythe number of parameters and tuned in a discrete man-ner. The bias-variance tradeoﬀ issue is handled by crossvalidation or model structure selection criteria such asAkaike’s information criterion (AIC), Bayesian informa-tion criterion (BIC), and etc, which correspond to com-binatorial optimization problems; see e.g., [1, Chapter16]. However, these techniques are not as reliable as ex-pected when the data is short and/or has low signal-to-noise ratio; see e.g., [3,4].Another route is to adopt regularization by adding an ex-tra regularization term in the estimation criterion. Thekernel-based regularization method (KRM) proposed inthe seminal paper [5] and further developed in [6,3,7]is one method along this route. It ﬁrst proposes a ker-nel parameterized by some hyper-parameters for the im-pulse response, then estimates these hyper-parameterswith certain methods, and ﬁnally obtains the estimateof the impulse response by minimizing a regularizedleast squares criterion with the regularization term inquadratic form and the regularization matrix deﬁned

Preprint submitted to Automatica 17 September 2018 hrough the kernel; see [4] for a recent survey. While theimpulse response model is used, the underlying modelstructure is determined by the kernel. The model com-plexity is governed by the hyper-parameters of the ker-nel, and tuned in a continuous manner by e.g., the em-pirical Bayes method, Stein’s unbiased risk estimator[4,8,9] and etc., which correspond to non-convex butnon-combinatorial optimization problems.The route to adopt regularization is by no means new,see e.g., [10], [11], [1, p. 504-505] and also [12] for a his-toric review, but no important progress along this routehas been reported until [5]. The major obstacle is thatit was unclear whether or not it is possible to designthe regularization to embed the prior knowledge of theLTI system to be identiﬁed. The intriguing ﬁnding dis-closed by the KRM is that when considering impulseresponse estimation problem, it is possible to design aregularization in quadratic form or equivalently a kernelto embed the prior knowledge of the impulse response of the LTI system to be identiﬁed. So far several kernelshave been invented to embed various prior knowledge,e.g., [5,6,3,7,13,14,15,16,17,18,19] and some analysis re-sults of the corresponding kernels have been derived,e.g., [5,3,7,13,14,15,16,17,18]. Still, there lack systematicways to design kernels to embed in the kernel the priorknowledge of the impulse response of the LTI system tobe identiﬁed.In this paper, we try to develop systematic ways to de-sign kernels. Clearly, this issue relates to the type of theprior knowledge. Interestingly, even for the same impulseresponse estimation problem, users with diﬀerent back-ground may come up with diﬀerent types of prior knowl-edge because they may treat the impulse response in dif-ferent ways. For instance, users from machine learningmay treat the impulse response as a function, whose am-plitude varies with a certain rate and decays exponen-tially to zero in the end; users from signals and systemsmay associate the impulse response with an LTI systemthat is stable and may be overdamped, underdamped,has multiple distinct time constants and resonant fre-quencies, and etc. Accordingly, we provide two meth-ods to design kernels to embed the corresponding typeof prior knowledge from a machine learning perspectiveand from a system theory perspective, respectively.To develop the machine learning method, we ﬁrst pointout a common feature of the two most widely used singlekernels, i.e., the stable spline (SS) kernel in [5] and thediagonal-correlated (DC) kernel in [3], that they belongto the class of the so-called amplitude modulated locallystationary (AMLS) kernel. The AMLS kernel is a multi-plication of two kernels: a rank-1 kernel and a stationarykernel. It has a neat implication: the zero mean Gaussianprocess with the AMLS kernel as the covariance functioncan be seen as an amplitude modulated (by the rank-1kernel) zero mean Gaussian processes with the station-ary kernel as covariance function; see Section 3 for de- tails. This ﬁnding leads to the machine learning methodto design general kernels: design the rank-1 kernel andthe stationary kernel to account for the decay and thevarying rate of the impulse response, respectively.To develop the system theory method, we recall the gen-eral guideline to design a kernel in [3] that is to let thekernel mimic the behavior of the optimal kernel [3, Thm.1] and [4, Prop. 19]. Moreover, the prior knowledge forthe impulse response, or equivalently, the LTI system tobe identiﬁed, shall be made use of, since the optimal ker-nel depends on the true impulse response. Following thisguideline and employing the multiplicative uncertaintyconﬁguration in robust control, see e.g. [20], we designthe simulation induced (SI) kernel. In particular, theprior knowledge is embedded in the nominal model, theuncertainty is assumed to be stable and ﬁnally, the mul-tiplicative uncertainty conﬁguration is used to take intoaccount both the nominal model and the uncertaintyand is simulated with an impulsive input to get the SIkernel. Then we make analysis for SI kernels in terms ofstability, maximum entropy property, and Markov prop-erty, with the SS and DC kernels as examples. In par-ticular, we give conditions under which the SI kernel isstable, solves a suitably deﬁned entropy maximizationproblem, and induces a Gaussian Markov process. Themaximum entropy interpretation enhances our under-standing about a kernel, and the Markov property of akernel ensures that the inverse of the kernel matrix isbanded and this special structure can be exploited toderive more stable and eﬃcient implementation for theKRM, see e.g., [21,18].The remaining part of this paper is organized as follows.In Section 2, the KRM is recapped. The machine learn-ing and system theory methods to design kernels are in-troduced in Section 3 and 4, respectively. Then a casestudy is provided in Section 5 to demonstrate how todesign kernels from the proposed two methods to modelimpulse responses with damped oscillation. Finally, thispaper is concluded in Section 6 with a take-home mes-sage. The proofs of all propositions and theorems aregiven in the Appendix.

Consider a single-input-single-output, causal and LTIsystem described by y ( t ) = ( g ∗ u )( t ) + v ( t ) , t = T s , T s , · · · , N T s , (1)where t is the time index, y ( t ) , u ( t ) , v ( t ) ∈ R and g ( t ) ∈ R are the measured output, input, disturbance, and im-pulse response of the LTI system at time t , respectively,2 g ∗ u )( t ) is the convolution between the impulse response g ( · ) and the input u ( · ) evaluated at time t , and T s is thesampling period and for convenience chosen to be 1.The system (1) represents both discrete-time (DT) sys-tem with t = 1 , , · · · , and continuous-time system (CT)with t ≥

0. In particular, we have( g ∗ u )( t ) = ( P ∞ τ =1 g ( τ ) u ( t − τ ) , DT , R ∞ g ( τ ) u ( t − τ ) dτ, CT , (2)where the unmeasured portions of u ( t ) are set to zero.Moreover, the system (1) is assumed to be stable, i.e., g ∈ ℓ for the DT case and g ∈ L for the CT case, where ℓ denotes the space of absolutely convergent sequences and L denotes the space of absolutely Lebesgue integrablefunctions on t ≥

0, and the disturbance v ( t ) is assumedto be a white Gaussian noise with zero mean and variance σ and independent of u ( t ).The goal of LTI system identiﬁcation is to estimate theimpulse response g ( t ) as well as possible given the data y ( t ) , u ( t ) with t = 1 , , · · · , N for the DT case and y ( t )with t = 1 , , · · · , N and u ( t ) with t ≥ As well-known [10], [1, p. 504-505], straightforward im-pulse response estimation, i.e.,minimize g N X t =1 ( y ( t ) − ( g ∗ u )( t )) could be an ill-conditioned problem in practice and oneway to overcome this problem is to adopt regularization.Moreover, the recent progresses for KRM [5,6,3,7] showthat if the regularization is well designed and tuned, theresulting regularized impulse response estimator can alsoachieve a good bias-variance tradeoﬀ.To introduce the KRM, we ﬁrst recall the deﬁnitionof the positive semideﬁnite kernel and its associatedreproducing kernel Hilbert space (RKHS). Let ( X, d )be a metric space with d being its metric. A function k : X × X → R is called a positive semideﬁnite kernel,if it is symmetric and satisﬁes P mi,j =1 a i a j k ( x i , x j ) ≥ { x , · · · , x m } ⊂ X and { a , ..., a m } ⊂ R . As well-known from e.g., [22], to ev-ery positive semideﬁnite kernel k there corresponds toone and only one class of functions with a unique deter-mined inner product in it, leading to the so-called repro-ducing kernel Hilbert space (RKHS) H k with k as thereproducing kernel.The KRM ﬁrst introduces a suitable positive semidef- inite kernel k ( t, s ; θ ) with t, s ∈ X , where for the DTcase, X = N and for the CT case, X = { t | t ≥ } , and θ ∈ R m is a hyper-parameter vector that contains theparameters used to parameterize the kernel, and thensolves the following regularized least squares problem:ˆ g ( t ) = arg min g ∈H k N X t =1 ( y ( t ) − ( g ∗ u )( t )) + γ k g k H k , (3)where k g k H k is the norm of g in H k , and γ > P Nt =1 ( y ( t ) − ( g ∗ u )( t )) and the regulariza-tion term k g k H k .We will discuss the issue of kernels in the next subsec-tions. For the time being, we assume that a suitable ker-nel k ( t, s ; θ ) has been designed, but its hyper-parameter θ is left to be determined. The current most widelyused method to determine θ is the so-called empiricalBayes method [23]. It ﬁrst embeds the regularization in aBayesian framework, and then estimates θ and possiblyalso the noise variance σ by maximizing the marginallikelihood p ( Y | η ), where Y ∈ R N with y ( t ) being its t thelement, and η could be θ or the concatenation of θ and σ . Speciﬁcally, we deﬁne A ( θ ) ∈ R N × N with its ( t, s )thelement A t,s ( θ ) deﬁned by a ( t, s ; θ ) = ( k ( t, · ; θ ) ∗ u )( s ) , A t,s ( θ ) = ( a ( · , s ; θ ) ∗ u )( t )and moreover, we let Σ( η ) = A ( θ ) + σ I N , where I N isthe N -dimensional identity matrix. Then we getminimize η ∈ Γ Y T Σ( η ) − Y + log det Σ( η ) , where Γ is a constraint set where we search for η . Whenan estimate of η is obtained, the solution to (3) is givenby the representer theorem [4, Theorem 3]:ˆ g ( t ) = N X s =1 ˆ c s a ( t, s ; θ ) , (4)where ˆ c s is the s th element of ˆ c = ( A ( θ ) + σ I N ) − Y . So far the two mostly widely used single kernels are thestable spline (SS) kernel [5] and the diagonal/correlated(DC) kernel [3]. The SS kernel is deﬁned as k SS ( t, s ; θ ) = c

12 exp( − β ( t + s ) − β max { t, s } ) − c

16 exp( − β max { t, s } ) , θ = [ c β ] T , c, β ≥ , (5) Sometimes the dependence of k ( t, s ; θ ) on θ is ignored. k DC ( t, s ; θ ) = cλ ( t + s ) / ρ | t − s | , θ = [ c λ ] T (6) k TC ( t, s ; θ ) = c min( λ t , λ s ) , θ = [ c λ ρ ] T (7) c ≥ , ≤ λ < , ( | ρ | ≤ , DT0 ≤ ρ ≤ , CTwhere the TC kernel is a special case of the DC kernelwith ρ = √ λ and is also called the ﬁrst order stablespline kernel. Remark 2.1

It is worth to note that for ρ < and | t − s | 6∈ N , ρ | t − s | is complex and thus k DC ( t, s ; θ ) with ρ < is not well deﬁned for the CT case.2.4 Optimal Kernel and Stable Kernel For the KRM, the optimal kernel in the sense of minimiz-ing the mean square error (MSE) exists [3,4] and moti-vates a general guideline to design a kernel. To state thisresult, we assume that the data has been generated by(1) for a true impulse response g ( t ) and we let ¯ g and¯ˆ g to represent any ﬁnite dimensional vector obtained bysampling g ( t ) and its estimate ˆ g ( t ) at the same arbi-trary time instants. Then the following result holds. Lemma 2.1 (Optimal kernel, [3,4])

Letting γ = σ .Then for the KRM, the MSE matrixMSE ( k ) , E (cid:2) (¯ˆ g − ¯ g )(¯ˆ g − ¯ g ) T (cid:3) , (8) where E is the mathematical expectation, is minimized bythe kernel k opt ( t, s ) = g ( t ) g ( s ) , t, s ∈ X (9) in the sense that MSE ( k ) − MSE ( k opt ) is positive semidef-inite for any positive semideﬁnite kernel k . The optimal kernel k opt cannot be applied in practiceas it depends on the true impulse response g ( t ). How-ever, it motivates a general guideline to design a kernel: let the kernel mimic the behavior of the optimal kernel,and moreover, the prior knowledge of the true impulseresponse should be used in the design of the kernel. For instance, if the LTI system is known to be stable,i.e., g ∈ ℓ or L , then the designed kernel k should re-ﬂect this, and a necessary condition to satisfy is that k ( t, t ) should tend to 0 as t goes to inﬁnity. This obser-vation basically rules out the possibility to model theimpulse response of stable LTI systems with stationarykernels and a rigorous proof for this has been given Recall that a kernel k ( t, s ) with t, s ∈ X is said to bestationary if k ( t, s ) is a function of t − s for any t, s ∈ X . in [13, Lemma 8]. More generally, the designed kernelshould guarantee that its associated RKHS H k is a sub-space of ℓ or L and a kernel that has such property issaid to be stable [13,4]. Suﬃcient and necessary condi-tions for a kernel to be stable exist and are given below. Lemma 2.2 ([24,13])

A positive semideﬁnite kernel k is stable if and only ifDT: P ∞ s =1 | P ∞ t =1 k ( t, s ) l ( t ) | < ∞ , ∀ l ( t ) ∈ ℓ ∞ CT: R ∞ (cid:12)(cid:12)R ∞ k ( t, s ) l ( t ) dt (cid:12)(cid:12) ds < ∞ , ∀ l ( t ) ∈ L ∞ (10) where L ∞ and ℓ ∞ denote the space of bounded functionson R ≥ and bounded sequences, respectively. Corollary 2.1 ([13,4])

A positive semideﬁnite kernel k is stable ifDT: P ∞ s =1 | P ∞ t =1 k ( t, s ) | < ∞ , CT: R ∞ (cid:12)(cid:12)R ∞ k ( t, s ) dt (cid:12)(cid:12) ds < ∞ . (11) Hereafter, we focus on the problem of kernel design. Theproblem is stated as follows: for the given prior knowl-edge of the impulse response to be identiﬁed, our goal isto design a kernel such that the prior knowledge is em-bedded in the kernel. The answer should depend on thetype of the prior knowledge. We will consider two typesof prior knowledge and discuss how to design kernelsaccordingly from two perspectives: a machine learningperspective and a system theory perspective.It should be noted that the LTI system (1) is assumed tobe stable, meaning that g ∈ ℓ or L becomes our ﬁrstprior knowledge. In this regard, Theorem 2.2 becomes arule that the designed kernel should obey. Remark 2.2

Other than stability, the cases where theimpulse response or the corresponding LTI system isknown to have relative degree, to be monotonic, smoothand to have delay, have been discussed in [13] and in par-ticular, suﬃcient and/or necessary conditions for a ker-nel to have such properties are given. For instance, if aCT kernel k is m -times continuously diﬀerentiable, thenevery function h ∈ H k is m -times continuously diﬀeren-tiable [25, Corollary 4.36, p. 131], [13, Lemma 5]. We ﬁrst show that both the SS kernel and the DC kernelbelong to the class of the so-called amplitude modulatedlocally stationary (AMLS) kernel. Accordingly, we pro-pose to treat the impulse response as a function, whose4mplitude varies with a certain rate and decays expo-nentially to zero in the end, and moreover, design AMLSkernels to model the impulse response.

Recall that a kernel k ( t, s ) is said to be a locally station-ary (LS) kernel in the sense of Silverman [26] if k ( t, s ) = k d (cid:18) t + s (cid:19) k c ( t − s ) , t, s ∈ X (12)where k d ≥ k c is a stationary kernel . Motivatedby the LS kernel, we introduce a kernel suitable for mod-eling impulse response, which is called the amplitudemodulated locally stationary (AMLS) kernel. Deﬁnition 3.1

A kernel k ( t, s ) is said to be an ampli-tude modulated locally stationary kernel if k ( t, s ) = k d ( t, s ) k c ( t − s ) , k d ( t, s ) = b ( t ) b ( s ) , (13) k c (0) = 1 , t, s ∈ X, where b ( t ) > is bounded and k c is a stationary kernel. It can be proved that the AMLS kernels have the follow-ing properties (proof given in the Appendix).

Proposition 3.1

Consider the AMLS kernel (13).Then the following results hold:(a) k d ( t, s ) is a rank-1 kernel and moreover satisfy k d ( t, s ) = q k d ( t, t ) k d ( s, s ) . (14) (b) Let g ( t ) be a stochastic process with zero mean andthe AMLS kernel as the covariance function. Then k c ( t − s ) is the correlation coeﬃcient between g ( t ) and g ( s ) .(c) Let h ( t ) be a stochastic process with zero meanand k c ( t − s ) as the covariance function. Then thestochastic process g ( t ) , b ( t ) h ( t ) has zero meanand the AMLS kernel as its covariance function. Remark 3.1

Since k c ( t − s ) is the correlation coeﬃcientand k c (0) = 1 , then k c ( t − s ) is bounded and moreover, | k c ( t − s ) | ≤ for any t, s ∈ X . For a function estimation problem, if a zero mean Gaus-sian processes with the AMLS kernel as the covariancefunction is chosen to model the function, then Proposi-tion 3.1 has the following implications: Clearly, if k d is a positive constant, then the LS kernelreduces to a stationary kernel. A kernel k ( t, s ) with t, s ∈ X is said to be a rank-1 kernel iffor any t i , s i ∈ X , i = 1 , . . . , n and for any n ∈ N , the kernelmatrix K , deﬁned by K i,j = k ( t i , t j ), is a rank-1 matrix. • The stationary kernel k c ( t − s ) in (13) accounts forthe varying rate of the function.Note that part (b) shows that k c ( t − s ) is the cor-relation coeﬃcient for g ( t ), which implies that E ( g ( t ) − g ( s )) = ( b ( t )) + ( b ( s )) − b ( t ) b ( s ) k c ( t − s ) . The above equation shows that for any ﬁxed t, s ∈ X ,as k c ( t − s ) varies from 1 to −

1, ( g ( t ) − g ( s )) tends tobecome larger, that is, g ( s ) tends to vary more quicklyaway from g ( t ). This observation also holds for h ( t ) as k c ( t − s ) is also the correlation coeﬃcient for h ( t ). • The rank-1 kernel k d ( t, s ) in (13) account for thechange of the amplitude of the function.Note that part (c) shows that the amplitude of g ( t )is modulated by the factor b ( t ). If h ( t ) does not con-verge to 0 as t goes to ∞ , then a suitable b ( t ) can bechosen such that g ( t ) does.Interestingly, both the SS and DC kernels can be put inthe form of (13). Setting λ = exp( − β/

2) and using theequality max { t, s } = ( t + s + | t − s | ) / k SS ( t, s ; c, λ ) = cλ t + s ) (cid:18) λ | t − s | − λ | t − s | (cid:19) . (15)Then we identify, for the SS kernel (15), k d ( t, s ) = c λ t + s ) , k c ( t − s ) = 32 λ | t − s | − λ | t − s | , (16)and for the DC kernel (6), k d ( t, s ) = cλ ( t + s ) , k c ( t − s ) = ρ | t − s | . (17)Moreover, one can prove the following result (proof givenin the Appendix). Proposition 3.2

The SS kernel (15) and the DC kernel(6) are AMLS kernels.

Now we demonstrate for the SS and DC kernels the roleof the rank-1 kernel k d and the stationary kernel k c . Example 3.1

As seen from the left panel of Fig. 1,as λ changes from . to . , the realizations of theDT zero mean Gaussian process with k c ( t − s ; [ c, λ ] T ) in (16) as the covariance function varies more quickly,because k c (1; [1 , . ] T ) > k c (1; [1 , . ] T ) . Similarly,as ρ changes from . , to and to − . , the realiza-tions of the DT zero mean Gaussian process with k ( t − s ; [1 , . , ρ ] T ) in (17) varies more quickly, and especiallyfor the case with ρ = − . , the realization tends tochange its sign at the next time instant. The above obser-vations carry over to the right panel of Fig. 1. Finally,

20 40 60 80 100-0.200.2 0 20 40 60 80 10000.020.040 20 40 60 80 100-0.500.5 0 20 40 60 80 10001020 × -3 Fig. 1. Realizations of the DT zero mean Gaussian processeswith the SS kernel (15) and the DC kernel (6) as covariancefunction. For each row, the left panel shows the realizationsof the DT zero mean Gaussian process with the IS kernel k c ( t − s ) as covariance function and the right panel showsthe realization of the DT zero mean Gaussian process withthe SS or DC kernel as covariance function. The correspond-ing realizations for the SS kernel (15) are shown in the toptwo rows, which correspond to c = 1 and λ = 0 . , . , re-spectively. The corresponding realizations for the DC kernel(6) are shown in the bottom three rows, which correspondto c = 1, λ = 0 . ρ = 0 . , , − .

99, respectively. the realizations on the left panel of Fig. 1 do not go tozero for large t but the realizations on the right panel do,because of k d ( t, s ) . Remark 3.2

Recall from Remark 2.1 that the DC kernel(6) with ρ < is only deﬁned for the DT case but notfor the CT case. Now we see that the DC kernel is notgood to model quickly varying impulse responses for theCT case, which is also true for the SS kernel (15).3.2 Construct AMLS Kernels for Regularized ImpulseResponse Estimation Proposition 3.2 and its implications on the role of therank-1 kernel k d and the stationary kernel k c suggesta machine learning method to design kernels for regu-larized impulse response estimation. If the impulse re-sponse is treated as a function and the prior knowledgeis about its decay and varying rate, then AMLS kernelscan be designed by choosing suitable rank-1 kernel k d and stationary kernel k c to account for the decay andvarying rate of the impulse response, respectively. k c ( t − s )An important class of stationary kernels is the isotropicstationary (IS) kernel. Recall that a stationary kernel k c ( t − s ) is said to be an IS kernel if it depends on | t − s | .IS kernels have been studied extensively in the literaturein statistics and machine learning, see e.g. [27, Section4.2.1]. There are many choices of IS kernels that couldbe used instead of the ones in (16) and (17).Most of the IS kernels introduced in [27, pages 83 -88]decay monotonically w.r.t. | t − s | and are always positive.For example, the squared exponential (SE) kernel andthe Mat`ern class of kernels: k c ( r ) = e − βr , β > , “SE” ,k c ( r ) = 2 − ν Γ( ν ) (cid:16) β √ νr (cid:17) ν K ν (cid:16) β √ νr (cid:17) ,β > , ν > , where r = | t − s | , Γ( · ) is the Gamma function and K ν isa modiﬁed Bessel function with order ν .There are IS kernels that do not decay monotonically,can take negative values, and can have the form ofdamped oscillation w.r.t. | t − s | . For example, the kernelin [27, p. 89], [28]: k c ( r ) = c ( αr ) − ν J ν ( αr ) , α > , ν ≥ − / , (18a)where r = | t − s | , c is a scalar such that k c (0) = 1, and J ν ( · ) is the Bessel function of the ﬁrst kind with order ν .The kernel (18a) is deﬁned for any r ≥ ν = − / ν = 1 /

2, the kernel(18a) takes the following form k c ( r ) = cos( αr ) , α > , (18b) k c ( r ) = sin( αr ) αr , α > . (18c) k d ( t, s )The design of k d ( t, s ) is equivalent to that of the strictlypositive function b ( t ) in (13). The bottom line is that b ( t ) should ensure that the designed AMLS kernel (13)is stable, i.e, H k is a subspace of ℓ or L . Our follow-ing result gives a characterization of the stability of theAMLS kernel (proof given in the Appendix). Proposition 3.3

Consider the AMLS kernel (13).Then the following results hold.(a) If b ( t ) ∈ ℓ for the DT case, and b ( t ) ∈ L for the CTcase, then the AMLS kernel (13) is stable. b) Assume that there exists a sequence of positive numbers λ i and linearly independent functions φ i ( t ) deﬁned on X such that k c ( t − s ) = ∞ X i =1 λ i φ i ( t ) φ i ( s ) ,t, s ∈ X, λ i > , i = 1 , · · · , ∞ , (19) where the convergence is absolute and uniform on Y × Y with Y , Y being any compact subsets of X . If theAMLS kernel (13) is stable, then there exists no ǫ > such that b ( t ) ≥ ǫ for all t ∈ X . Since the b ( t ) in (16) and (17) are exponential decayfunctions and clearly satisfy b ( t ) ∈ ℓ or L , the SS andDC kernels are stable by Proposition 3.3. Remark 3.3

It is reasonable to check wether the con-dition (19) is too strong to be satisﬁed. In fact, the se-ries expansion (19) can be obtained by Mercer’s Theorem[29,30,31]. For instance, a suﬃcient condition for (19)to hold is given in [31]. To state this result, we deﬁne thekernel section of k c at a ﬁxed s ∈ X as k cs , k c ( t − s ) andwe let L ( X, µ ) denote the space of functions f : X → R such that R X | f ( t ) | dµ ( t ) < ∞ , where µ is a nondegen-erate Boreal measure on X . Then (19) holds if k c ( t − s ) is continuous, k cs ∈ L ( X, µ ) and Z X Z X ( k c ( t − s )) dµ ( t ) dµ ( s ) < ∞ . (20) It is easy to check that many stationary kernels satisfythe above suﬃcient condition, e.g., the SE kernel and the k c in (16) and (17). Remark 3.4

It follows from (A.6) that if there exists an ǫ > such that | h ( t ) | ≥ ǫ for all t ≥ , then b ( t ) ∈ L ,implying that b ( t ) ∈ L is also necessary for the stabilityof the AMLS kernel (13). The above observations showthat under the assumption that the AMLS kernel (13) isstable, the result we can draw on the properties of b ( t ) isdetermined by the property of H k c . Remark 3.5

The reason why we force b ( t ) > for t ∈ X is because for a kernel in the form of k ( t, s ) = k d ( t, s ) k c ( t − s ) , we expect the two kernels k d and k c have somewhat independent role. As shownabove, this idea eases the kernel design and the corre-sponding analysis. If this idea is not taken, then we candesign more general b ( t ) and even more general k d . Forinstance, we could allow b ( t ) to be arbitrary real-valuedfunction and we could also allow k d to be the moregeneral exponentially convex (EC) kernel and design For the LS kernel (12), if k d is also a kernel, then k d iscalled an exponentially convex (EC) kernel [32] and in thiscase k is called an exponentially convex locally stationary(ECLS) kernel [26]. G ( q ) G ( q ) G ( q ) g ( t ) δ ( t ) ¯ u ( t ) Fig. 2. The block diagram for the multiplicative uncertaintyparadigm in robust control. The single-input-single-outputsystem G ( q ) = G ( q )(1 + G △ ( q )) consists of two parts: thenominal part G ( q ) and the uncertainty part G △ ( q ), where q is the forward shift operator and the diﬀerential operatorfor the DT and CT case, respectively. The real-valued signals δ ( t ) , ¯ u ( t ) and g ( t ) are the impulsive input, the input of G ( q )and the output of G ( q ), respectively. accordingly the ECLS kernel [15] instead of the AMLSkernel. The price to pay is that the role of k d and k c would become obscure and in particular, k d would not de-scribe the decay rate and k c would not be the correlationcoeﬃcient of the underlying Gaussian process. Instead of simply treating the impulse response as afunction, we now associate the impulse response with anLTI system which is stable and may be overdamped, un-derdamped, have multiple distinct time constants andresonant frequencies, and etc., and our goal is to designkernels to embed this kind of prior knowledge.

Suppose that the prior knowledge of an LTI system isembedded in a stochastic process g ( t ) that is used tomodel the corresponding impulse response. Then follow-ing the guideline to design kernels in Section 2.4 thatis to mimic the behavior of the optimal kernel (9), weshould design the kernel as k ( t, s ) = C ov ( g ( t ) , g ( s )) , (21)where C ov ( g ( t ) , g ( s )) is the covariance between g ( t ) and g ( s ). Now the problem of kernel design becomes “howto embed the prior knowledge of an LTI system in astochastic process g ( t ) that is used to model the corre-sponding impulse response”?A natural way to tackle the above question is by us-ing simulation. To this goal, it is useful to employ themultiplicative uncertainty conﬁguration in robust con-trol, see e.g., [20], as shown in Fig. 2. Here, the nominalmodel G ( q ) is used to embed the prior knowledge onthe LTI system to be identiﬁed, the uncertainty G △ ( q ) isassumed to be stable and ﬁnally, the multiplicative un-certainty conﬁguration is used to take into account boththe nominal model and the uncertainty, and is simulatedwith an impulsive input to get a zero-mean Gaussian7 ( t + 1) = Az ( t ) + B δ ( t ) + Bl ( t )_ z ( t ) = Az ( t ) + B δ ( t ) + Bl ( t ) g ( t ) = Cz ( t ) + D δ ( t ) + Dl ( t ) l ( t ) b ( t ) w ( t ) g ( t ) δ ( t ) z (0) ∼ N (0 ; Q ) Fig. 3. The block diagram for the SI kernel (24). process to model the impulse response g ( t ). This idealeads to the system theory method to design kernels andwe call such kernels the simulation induced (SI) kernels. Remark 4.1

It is worth to note that the problem of mar-rying system identiﬁcation and robust control is not newand has been studied over two decades in the system iden-tiﬁcation community, see e.g., [33,34]. Both the addi-tive uncertainty conﬁguration and the multiplicative un-certainty conﬁguration have been used and many resultshave been obtained, see e.g., [35,33,34,36,37,38]. In par-ticular, [35,33,36,37,38] used the nominal model G ( q ) and the uncertainty G △ ( q ) to represent a model estimateand the corresponding model error, respectively. Here thenominal model G ( q ) and the uncertainty G △ ( q ) are usedto embed the prior knowledge of the LTI system and de-scribe the corresponding uncertainty, respectively.4.2 Simulation Induced Kernel More speciﬁcally, the prior knowledge is embedded in G ( q ) or equivalently, in its state-space model z ( t + 1) = Az ( t ) + B ¯ u ( t ) (22a) G ( q ) : ˙ z ( t ) = Az ( t ) + B ¯ u ( t ) (22b) z (0) ∼ N (0 , Q ) (22c) g ( t ) = Cz ( t ) + D ¯ u ( t ) (22d)where A, B, C, D , and Q have compatible dimensions,¯ u ( t ) and g ( t ) are the input and output of G ( q ), re-spectively. In what follows, we will use the quintuple( A, B, C, D, Q ) to represent the state-space model (22).Since the uncertainty G △ ( q ) is stable, the simplest wayto model the impulse response of G △ ( q ) with a stochasticprocess l ( t ), is to have G △ ( q ) : l ( t ) = b ( t ) w ( t ) , t ∈ X, (23)where w ( t ) is a white Gaussian noise with zero mean andunit variance (independent of z (0)), and b ( t ) > b ( t ) ∈ ℓ or L for the DT and CT case, respectively.Clearly, l ( t ) is a zero mean Gaussian process with theAMLS kernel b ( t ) b ( s ) δ ( t − s ) as the covariance function,where δ ( · ) is the Dirac Delta function and the AMLSkernel b ( t ) b ( s ) δ ( t − s ) is stable by Proposition 3.3.Finally, we simulate the system G ( q ) in Fig. 2 with theimpulsive input δ ( t ) and noting (22), (23) and ¯ u ( t ) = δ ( t )+ l ( t ), we obtain the simulation induced (SI) kernel: z ( t + 1) = Az ( t ) + Bδ ( t ) + Bb ( t ) w ( t ) , (24a)˙ z ( t ) = Az ( t ) + Bδ ( t ) + Bb ( t ) w ( t ) , (24b) g ( t ) = Cz ( t ) + Dδ ( t ) + Db ( t ) w ( t ) , (24c) z (0) ∼ N (0 , Q ) (24d) k SI ( t, s ) = C ov ( g ( t ) , g ( s )) , (24e)whose block diagram is shown in Fig. 3. From linearsystem theory, e.g., [39], the formal expression of the SIkernel (24) is available. For the DT case, it is k SI ( t, s ) = CA t Q ( A s ) T C T + D b ( t ) b ( s ) δ ( t − s )+ Db ( t ) s − X k =0 δ ( t − k ) b ( k ) B T ( A s − − k ) T C T + Db ( s ) t − X k =0 δ ( s − k ) b ( k ) B T ( A t − − k ) T C T + min { t,s }− X k =0 b ( k ) CA t − − k BB T ( A s − − k ) T C T (25a)For the CT case with D = 0, it is k SI ( t, s ) = Ce At Q ( e As ) T C T + C Z min { t,s } b ( τ ) e A ( t − τ ) BB T ( e A ( s − τ ) ) T dτ C T (25b) Remark 4.2

It should be noted that both DT and CTcases are considered in (22) and (24). For the CT case,(24b) could be more rigorously written as an Itˆo stochas-tic diﬀerential equation (SDE) as in [14, eq. (9)]. How-ever, we decide to take the same point of view as in [40] touse (24b) instead in order to save the space for the intro-duction of SDE stuﬀ, which is only used to derive (25b).

Remark 4.3

The SI kernels (25) may or may not haveclosed form expressions. The related computational dif-ﬁculty and cost depend on whether or not the SI kernels(25) have closed form expressions. If they do, then thecomputation would be easier and similar to the SS andDC kernels. If they do not, the computation of the hyper-parameter estimate and the regularized impulse responseestimate would become more demanding. In this regard,the particle ﬁltering based technique for nonlinear state-space model identiﬁcation in [41] could be adopted. How-ever, the technique in [41] cannot be applied trivially asthe measurement output (1) for the state space model (24)does not depend on the current state z ( t ) solely but alsothe past state due to the presence of convolution. Moredetails will be reported in an independent paper. Note that g ( t ) in (24) is a Gaussian process with theimpulse response of the LTI system ( A, B, C, D, Q ) asits mean and the SI kernel k SI ( t, s ) as its covariance8unction. If δ ( t ) is set to 0 in (24), g ( t ) in (24) is a zeromean Gaussian process with the SI kernel k SI ( t, s ) as itscovariance function. In what follows, when consideringkernel design, we will set δ ( t ) to zero for convenience.Interestingly, the AMLS kernel (13) is closely relatedwith the SI kernel (24), and in fact many AMLS kernelssuch as the SS and DC kernels can be put in the form of(24) and are thus SI kernels. To state this result, recallthat the power spectral density of k c ( t − s ), denoted byΨ( ω ), is deﬁned asΨ( ω ) = ( P + ∞ τ = −∞ k c ( τ ) e − iωτ , DT , R + ∞−∞ k c ( τ ) e − iωτ dτ, CT . (26) Proposition 4.1

Consider the AMLS kernel (13). As-sume that k d ( t, s ) = cλ t + s with c ≥ and ≤ λ < and Ψ( ω ) is a proper rational function of e iω or cos( ω ) forthe DT case, and ω for the CT case, respectively. Thenthe AMLS kernel (13) can be put in the form of (24) andthus is SI kernel. Proposition 4.2

Consider the SS kernel (15) and theDC kernel (6). Then the following results hold: • For the DT case, let ¯ a = q λ − p λ + λ , ¯ b = q λ + p λ + λ . Then the SS kernel is in the form of (24) with A = λ " λ λ , B = " λ λ , Q = c " − λ − λ − λ − λ C = r (1 − λ ) h ¯ a +¯ bλ − λ − λ (¯ a +¯ bλ )1 − λ i D = ¯ b r (1 − λ ) , b ( t ) = (cid:16) c (cid:17) λ t . (27) The DC kernel is in the form of (24) with A = λ ρ, B = λ , C = ρ (1 − ρ ) , Q = c − ρ ,D = (1 − ρ ) , b ( t ) = c λ t . (28) • For the CT case, let λ = exp( − β/ and ρ = exp( − α ) .Then the SS kernel is in the form of (24) with A = " − β − β − β , B = " , Q = c " β β C = h β i , D = 0 , b ( t ) = (cid:16) c (cid:17) e − βt . (29) The DC kernel is in the form of (24) with A = − α − β, B = 1 , C = (2 α ) , Q = c α ,D = 0 , b ( t ) = c e − β t . (30)Proposition 4.2 enhances our understanding about theSS and DC kernels and in particular, the prior knowl-edge embedded in the two kernels from a system theoryperspective is as follows:1) For the SS kernel, the corresponding nominal model G ( q ) in Fig. 2 is a second order system. For the DCkernel, the corresponding nominal model G ( q ) is aﬁrst order system. For both the SS and DC kernelsand for the CT case, G ( q ) has negative real polescorresponding to overdamped impulse responses. Forthe DT case, the two poles for the SS kernel are realpositive and correspond to impulse response withoutoscillation, and the pole for the DC kernel can be pos-itive or negative (depending on ρ ) and corresponds toimpulse response without or with oscillation. The polefor the TC kernel is positive (equal to λ ) and corre-sponds to impulse response without oscillation.2) For both the SS and DC kernels, the decay rate of theimpulse response of the uncertainty G △ ( q ) in Fig. 2are all described by exponential decay functions. Remark 4.4

It is worth to note that to study the linkbetween a kernel and its state-space model realization hasa long history, see e.g., [42,40] . This link is importantbecause it opens the body of a kernel from a system the-ory perspective and accordingly helps to understand theunderlying behavior of a kernel.4.3 Stability of SI kernels

For the SI kernel (25), we provide the following suﬃcientcondition to guarantee its stability.

Proposition 4.3

Consider the SI kernel (25). • For the DT case, assume that A has distinct eigenval-ues and moreover, A is stable, i.e., A has all eigenval-ues strictly inside the unit circle. Assume further that b ( t ) ≤ ¯ c | ¯ λ | t (31a) for some ¯ c > , where ¯ λ is the eigenvalue of A with thelargest modulus. Then the SI kernel (25a) is stable. • For the CT case, assume that A has distinct eigenval-ues and moreover A is stable, i.e., A has all eigenvalueswith strictly negative real parts. Assume further that b ( t ) ≤ ¯ ce − | Re(¯ λ ) | t (31b)9 or some ¯ c > , where ¯ λ is the eigenvalue of A with thelargest real part and Re(¯ λ ) is the real part of ¯ λ . Thenthe SI kernel (25b) is stable. Remark 4.5

The conditions (31) are suﬃcient but notnecessary. For instance, consider the DT case. The SSkernel (27) satisﬁes (31a) but the DC kernel (28) doesnot. It is possible to derive a slower upper bound for b ( t ) .Another intension to have Proposition 4.3 is to give someguidelines about the relations between b ( t ) , A and thestability of SI kernel: in order to guarantee stability of theSI kernel, the impulse response of the uncertainty G △ ( q ) (23), i.e., b ( t ) should decay a bit faster than the slowestmode of the nominal model G ( q ) .4.4 Maximum Entropy Property of SI Kernels Since the prior knowledge is never complete, it is worthto note Jaynes’s maximum entropy (MaxEnt) rationale[43] to derive from incomplete prior knowledge the opti-mal statistical prior distribution. Jaynes’s idea is to for-mulate a MaxEnt problem with respect to the prior, andthen solve the problem to obtain the optimal prior. Theconstraints of the problem describes the prior knowl-edge (in the MaxEnt sense) of the underlying stochasticprocess (the system) to be identiﬁed. Interestingly, theSI kernel lends itself easily to a MaxEnt interpretation,leading to a new facet to understand the underlying be-havior of the SI kernel.Recall that the diﬀerential entropy H ( X ) of a real-valuedcontinuous random variable X is deﬁned as H ( X ) = − R S p ( x ) log p ( x ) dx , where p ( x ) is the probability den-sity function of X and S is the support set of X . Thenwe prove the next MaxEnt interpretation of SI kernels . Proposition 4.4

Consider the SI kernel (24) with(25a). • For the case D = 0 , deﬁne f ( t ) = ¯ g ( t ) − CA t ¯ z (0) − P t − k =0 CA t − − k Bb ( k ) f ( k ) Db ( t )¯ g ( t ) ∈ R , ¯ z (0) ∈ R n , t = 0 , , · · · , s (32) Then for any s ∈ N , the Gaussian process g ( t ) in (24)is the solution to the MaxEnt problem maximize ¯ z (0) , ¯ g ( t ) H (¯ z (0) , ¯ g (0) , ¯ g (1) , · · · , ¯ g ( s )) (33a) subject to E (¯ z (0)) = 0 , E (¯ g ( t )) = 0 , C ov (¯ z (0)) = Q, V ( f ( t )) = 1 , t = 0 , · · · , s (33b) Let X , X be two jointly distributed random variables.Then the joint diﬀerential entropy of H ([ X , X ] T ) is simplywritten as H ( X , X ) below. where C ov (¯ z (0)) is the covariance matrix of z (0) and V ( f ( t )) is the variance of f ( t ) . • For the case D = 0 , assume that CB = 0 and deﬁne f ( t −

1) = ¯ g ( t ) − CA t ¯ z (0) − P t − k =0 CA t − − k Bb ( k ) f ( k ) CBb ( t − g ( t ) ∈ R , ¯ z (0) ∈ R n , t = 0 , · · · , s − Then for any s ∈ N , the Gaussian process g ( t ) in (24)is the solution to the MaxEnt problem maximize ¯ z (0) , ¯ g ( t ) H (¯ z (0) , ¯ g (1) , · · · , ¯ g ( s )) (35a) subject to E (¯ z (0)) = 0 , E (¯ g ( t )) = 0 , C ov (¯ z (0)) = Q, V ( f ( t )) = 1 , t = 0 , · · · , s − Corollary 4.1

For any s ∈ N , the zero mean Gaussianprocess g ( t ) with the DC kernel (6) deﬁned on N × N asits covariance is the solution to the MaxEnt problem maximize ¯ g ( t ) H (¯ g (0) , ¯ g (1) , · · · , ¯ g ( s )) subject to E (¯ g ( t )) = 0 , t = 0 , · · · , s, V (¯ g (0)) = c, V (¯ g ( t ) − λ / ρ ¯ g ( t − c (1 − ρ ) λ t , t = 1 , · · · , s (36) Remark 4.6

When ρ = λ / , the DC kernel (6) be-comes the TC kernel (7). For the TC kernel deﬁned on N × N , (36) becomes maximize ¯ g ( t ) H (¯ g (0) , ¯ g (1) , · · · , ¯ g ( s )) subject to E (¯ g ( t )) = 0 , t = 0 , · · · , s, V (¯ g (0)) = c, V (¯ g ( t ) − λ ¯ g ( t − c (1 − λ ) λ t , t = 1 , · · · , s (37) which is diﬀerent from the MaxEnt interpretation in [17].It shall be noted that both interpretation are correct butderived in diﬀerent ways: the MaxEnt problems are dif-ferent but have the same optimal solution. Remark 4.7

By using Corollary 4.1 and the trick in[17, Theorem 2], it is possible to derive a more conciseproof for [18, Proposition IV.1] which shows that, the factthat the DC kernel has tridiagonal inverse can be givena MaxEnt covariance completion interpretation.

Remark 4.8

The CT case is not discussed here becauseaccording to our best knowledge the entropy is not welldeﬁned for CT stochastic processes.4.5 Markov Property of SI Kernels

As shown in [44,19,18,17], the kernel matrix of DC kernel(6) has tridiagonal inverse. Here, we further show that10he Gaussian process with the DC kernel as its covari-ance function is also a Markov process with order 1 andmoreover, we are able to design SI kernels which corre-spond to more general Gaussian Markov processes andhave banded inverses of their kernel matrices.First, recall from e.g., [27, Appendix B] that a GaussianMarkov process is a stochastic process that is both aGaussian process and a Markov process. A well-knowninstance is the DT autoregressive process of order p : x ( t + 1) = p − X k =0 a ( t, k ) x ( t − k ) + b ( t + 1) w ( t + 1) (38a)or, x ( t + 1) = p − X k =0 a ( t, k ) x ( t − k ) + b ( t ) w ( t ) (38b)where t ∈ N , x ( t ) , a ( t, k ) , b ( t ) ∈ R , x (0) is Gaussiandistributed and assumed to be independent of w ( t ), azero mean white Gaussian noise with unit variance. Thestochastic process (38) is a Markov process with order p since x ( t + p + 1) only depends on x ( t + p ) , · · · , x ( t + 1)given the history x ( s ) with s ≤ t + p . Proposition 4.5

Consider a zero mean DT Gaussianprocess g ( t ) with the DC kernel (6) as its covariance.Then g ( t ) with t ∈ N can be put in the form of g ( t + 1) = λ ρg ( t ) + (1 − ρ ) c λ t +12 w ( t + 1) (39) and thus a Markov process with order 1, and moreover,its kernel matrix has a -band matrix inverse. Remark 4.9

Interestingly, (39) can also be derived from(36), i.e., from the MaxEnt property of the DC kernel.

It is possible to construct more general SI kernels withMarkov property.

Proposition 4.6

Consider the DT SI kernel (24) with ( A, B, C, D ) being a realization of G ( q ) which is an n thorder DT system G ( q ) = q n − ¯ bq n + a q n − + · · · + a n , (40) where ¯ b, a , . . . , a n ∈ R . Then the Gaussian process g ( t ) in (24) with any b ( t ) > and positive semideﬁnite Q isalso a Markov process with order n and thus the SI kernelhas an n -band matrix inverse. For illustration, we consider an example. A real symmetric matrix A with dimension n > m + 1 iscalled an m − band matrix if A i,j = 0 for | i − j | > m . Fig. 4. Scaled image of K − in Example 4.1. When generating K , we choose ¯ b = 1, a = 0 . a = 0 . b ( t ) = 0 . t , and Q = I . The image is drawn by using imagesc in MATLAB,where the colder the color the smaller the element of K − . Example 4.1

Consider the DT SI kernel (24) with ( A, B, C, D ) being a realization of G ( q ) , which is a ndorder DT system having two stable real poles, i.e., G ( q ) = ¯ bq ( q + a )( q + a ) , (41) where ¯ b ∈ R and | a i | < , i = 1 , . We consider theinverse of the kernel matrix K deﬁned by K i,j = k SI ( i, j ) with i, j = 1 , . . . , . By Proposition 4.6, K − should bea 2-band matrix, which is conﬁrmed by Fig. 4. Remark 4.10

The Markov property of a kernel and itsassociated special structure (the banded inverse of thekernel matrix) can be used to develop numerically morestable and eﬃcient implementations for this kernel basedregularization method, see e.g., [18, Section 5].

Remark 4.11

The CT case is not discussed here be-cause according to [27, p. 217], regular sampling of aCT Gaussian Markov process entropy in general wouldnot lead to a DT Gaussian Markov process. That is tosay, even if a CT SI kernel with Markov property is con-structed, its corresponding kernel matrix evaluated on thesampling instants may not have banded matrix inverse.

In this section, we consider the estimation of impulseresponses with damped oscillation and demonstrate howto design kernels from the proposed two perspectives.

As shown in Section 3, the machine learning perspectivetreats the impulse response as a function, and designsAMLS kernels with the rank-1 kernel and the stationarykernel to account for the decay and varying rate of theimpulse response, respectively. Now we show that byfurther exploiting the rank-1 kernel or the stationarykernel, we can design AMLS kernels capable to embed11

20 40 60 80 100-5050 20 40 60 80 100-5050 20 40 60 80 100-5050 20 40 60 80 100-2020 20 40 60 80 100-202 0 20 40 60 80 100-2020 20 40 60 80 100-1010 20 40 60 80 100-2020 20 40 60 80 100-2020 20 40 60 80 100-202

Fig. 5. Realizations of DT zero mean Gaussian process withAMLS kernels (43). For each row, the left panel shows therealizations of the DT zero mean Gaussian process with theIS kernel k c ( t − s ) as covariance function and the right panelshows the realization of the DT zero mean Gaussian pro-cess with the AMLS kernel as covariance function. The topthree rows correspond to the AMLS kernel (43b) with c = 1, λ = 0 . , ω = 0 . π and ρ = 0 . , , − .

99, respectively. Thebottom two rows correspond to the AMLS kernel (43a) with c = 1, λ = 0 . and α = π, . π , respectively. the extra prior knowledge that the impulse response hasdamped oscillation.To have oscillation behavior in the AMLS kernel, we canchoose to have it either in the stationary kernel or in therank-1 kernel, i.e., b ( t ). For example, we can choose b ( t )as an exponential decay function, i.e., b ( t ) = cλ t andthen choose (18b) as the stationary kernel, because (18b)has oscillation behavior. Or we can choose the stationarykernel in (17) and b ( t ) as an exponential decay functionwith oscillation, i.e., b ( t ) = c λ t [cos( ωt ) + 1 + ǫ ] , (42)where c >

0, 0 ≤ λ < ω ≥

0, and ǫ > b ( t ) > t ∈ X . The aboveidea leads to the following two kernels, respectively: k AMLS-2Os ( t, s ) = cλ t + s cos( α | t − s | ) , (43a) k AMLS-2Od ( t, s ) = cλ t + s [cos( ωt ) + 1 + ǫ ] × [cos( ωs ) + 1 + ǫ ] ρ | t − s | , (43b)Fig. 5 illustrates that the AMLS kernels (43) are capableto describe functions with damped oscillation behavior. As shown in Section 4, the system theory perspectiveassociates the impulse response with an LTI system, anddesigns SI kernels with the nominal model to embed theprior knowledge on the LTI system. Now we show thatby choosing the nominal model to be a second orderLTI system with a pair of complex conjugate poles, wedesign a SI kernel capable to model impulse responseswith damped oscillation.More speciﬁcally, we choose the transfer function of thenominal model G ( q ) in Fig. 2 to be G ( s ) = 1 s + 2 w ξs + w (44)where ω > ≤ ξ <

1. By setting α = w ξ and β = w p − ξ , a state space model for (44) is describedby ( A, B, C, D ) with A = " − α − β − α , B = " , C = h i , D = 0 . Finally, setting Q = I and b ( t ) = e − γt with γ > k SI-2Od ( t, s ) = e − α ( t + s ) [cos( βt ) cos( βs )+ αβ sin( β ( t + s )) + α + 1 β sin( βt ) sin( βs )]+ e − α ( t + s ) cos( β ( t − s ))( e α − γ ) min { t,s } − β ( α − γ )+ e − α ( t + s ) β p β + 4( α − γ ) × [cos( φ + β ( t + s )) − e α − γ ) min { t,s } cos(2 β min { t, s } − φ − β ( t + s ))] ,φ = arccos (cid:18) α − γ )4 β + 4( α − γ ) (cid:19) . (45) To illustrate that the AMLS kernls (43) and the SI ker-nel (45) are capable to model LTI systems with strongoscillation, we consider the following numerical example.

The way in [45] is used to generate the test systemsand data bank. In particular, we ﬁrst generate 200 testsystems with strong oscillation: G ( q ) = q + 0 . q N r +1 X i =1 G i ( q ) (46)12here G i ( q ) = K i q + 0 . q − p i )( q − p ∗ i ) , i = 1 , . . . , N r , and G N r +1 ( q ) is a 4th order system randomly generatedby the function drmodel in MATLAB with its poles in-side the disk of radius 0.95. The parameters N r , p i , K i are randomly generated as follows: N r ∼ U [3 , K i ∼U [2 , p i = ρ i e j [ φ + π Nr ( i − with φ ∼ U [0 , π ] and ρ i ∼ U [0 . , . idinput in MATLABis used to generate a random Gaussian input u ( t ) withnormalized band [0 , .

95] and length 210. The chosenDT system is simulated with u ( t ) to get the noise-freeoutput G ( q ) u ( t ), and G ( q ) u ( t ) is then perturbed by awhite Gaussian noise, whose standard deviation is equalto U [0 . ,

1] times the standard deviation of G ( q ) u ( t ).In this way, we generate 200 test DT systems and foreach system a data set with 210 data points. The following model ﬁt is used to measure how good theestimated impulse response is: ﬁt = 100  − " P k =1 | g k − ˆ g k | P k =1 | g k − ¯ g |  , ¯ g = 1100 X k =1 g k where g k and ˆ g k are the true impulse response and itsestimate at the k th time instant, respectively.The AMLS kernels (43) and the SI kernel (45) are com-pared with the TC kernel (7), the DC kernel (6) and theSS kernel (15) enriched with a second order parametricpart proposed in [6] and denoted by SSp below.The simulation result is summarized below. The averagemodel ﬁts for the tested kernels are shown below: TC DC SSp AMLS-2Os AMLS-2Od SI-2OdAvg. Fit 47.5 50.0 50.9 48.4 49.7 53.3

The distribution of the model ﬁts are shown in Fig. 6.

For the test systems and data bank, we have the fol-lowing ﬁndings. The SI kernel (45) is best among allthe tested kernels in both the average accuracy and ro-bustness and thus is best to model systems with strongoscillation. The AMLS kernel (43b) works well as it is

Fig. 6. Boxplot of the model ﬁts for the TC, DC, SSp,AMLS-2Os, AMLS-2Od and SI-2Od kernels [from left toright]. The six kernels have 5, 3, 3, 3, 4, and 3 ﬁts smallerthan zero, respectively, which are not shown in the boxplotfor better display. better than the TC kernel in both the average accuracyand robustness and it is just a bit worse than the DCkernel in the average accuracy. The AMLS kernel (43a)works not very well, though it has better average accu-racy than the TC kernel. One possible explanation whythe SI kernel (45) is best to model impulse response withstrong oscillation is that the prior knowledge embeddedin the SI kernel (45) is closest to the truth. This may betrue because it makes use of the prior knowledge from asystem theory perspective directly by choosing the nom-inal model to be a second order system with a pair ofcomplex conjugate poles.

Remark 5.1

To embed the prior knowledge such as mul-tiple distinct time constants and multiple resonant fre-quencies, we proposed to use the multiple kernel intro-duced in [7] where the multiple kernel is a conic combi-nation of some ﬁxed kernels. The ﬁxed kernels can be in-stances of the SS, TC, DC and SI kernels (24) evaluatedon a grid in their hyper-parameter space. The interestedreaders are referred to [7] for more details.

Kernel methods or regularization methods to estimatethe impulse response of a linear time invariant sys-tems is a most useful relatively recent addition to thetechniques of system identiﬁcation. Earlier results havedemonstrated that important improvements in estima-tion quality can be achieved with kernels devised in amore or less ad hoc way. In this contribution, we havefocused on systematic mechanisms and design conceptsfor how to construct kernels that are capable of adjustingits hyperparameters to capture the unknown system’sproperties in useful ways. The two main avenues tosuch thinking have been a machine learning perspective focusing on function properties of the impulse responseand a system theory perspective focusing on the LTI sys-tem that produces the impulse response. This has lead13o understanding and insights into general propertiesof the earlier suggested methods, e.g. that the so calledDC and SS kernels (derived from more ad hoc thinking)belong to the class of amplitude modulated locally sta-tionary kernels (Proposition 3.2) and that they are sim-ulation induced from certain LTI systems (Proposition4.2). They are also related to maximum entropy optimalchoices (Proposition 4.4) which is a valuable feature.The take-home message of this contribution is as follows.The issue of kernel design should relate to the type of theprior knowledge and diﬀerent types of prior knowledgeshould lead to diﬀerent ways to design the correspond-ing kernels. Here, a machine learning perspective and asystem theory perspective are introduced accordingly: • Machine Learning perspective: If the impulse responseis treated as a function and the prior knowledge is onits decay and varying rate, then we can design the am-plitude modulated locally stationary (AMLS) kernel.In particular, we design a rank-1 kernel and a station-ary kernel to account for the decay and varying rate ofthe impulse response, respectively. Moreover, by fur-ther exploiting the rank-1 kernel or the stationary ker-nel, it is possible to design AMLS kernels capable toembed more general prior knowledge. • System Theory perspective: If the impulse is associ-ated with an LTI system and the prior knowledge isthat the LTI system is stable and may be overdamped,underdamped, have multiple distinct time constantsand resonant frequencies and etc., then we can designthe simulation induced (SI) kernel. In particular, thenominal model is used to embed the prior knowledge,the uncertainty is assumed to be stable and ﬁnally,the multiplicative uncertainty conﬁguration is used totake into account both the nominal model and the un-certainty, and is simulated with an impulsive input toget the SI kernel or equivalently the zero-mean Gaus-sian process to model the impulse response.Finally, ﬁnding a suitable kernel structure is only oneleg of the kernel-based regularization method. Tuningits hyperparameters regardless of structure is the othermain topic. This has been discussed in some detail e.g.in [4] and [8]. Some related new asymptotic results arerecently presented in [9].

Acknowledgements

The authors would like to thank Prof. Lennart Ljung forhis suggestions and comments at the early stage of thisproject. The authors also would like to thank Prof. Gi-anluigi Pillonetto, Prof. Alessandro Chiuso, Prof. H˚akanHjalmarsson, Dr. John Latarie and Dr. Giulio Bottegalfor their helpful comments.

References [1] L. Ljung.

System Identiﬁcation - Theory for the User .Prentice-Hall, Upper Saddle River, N.J., 2nd edition, 1999.[2] T. S¨oderstr¨om and P. Stoica.

System Identiﬁcation . Prentice-Hall Int., London, 1989.[3] T. Chen, H. Ohlsson, and L. Ljung. On the estimation oftransfer functions, regularizations and Gaussian processes -Revisited.

Automatica , 48:1525–1535, 2012.[4] G. Pillonetto, F. Dinuzzo, T. Chen, G. De Nicolao, andL. Ljung. Kernel methods in system identiﬁcation, machinelearning and function estimation: A survey.

Automatica ,50(3):657–682, 2014.[5] G. Pillonetto and G. De Nicolao. A new kernel-basedapproach for linear system identiﬁcation.

Automatica ,46(1):81–93, 2010.[6] G. Pillonetto, A. Chiuso, and G. De Nicolao. Prediction erroridentiﬁcation of linear systems: a nonparametric Gaussianregression approach.

Automatica , 47(2):291–305, 2011.[7] T. Chen, M. S. Andersen, L. Ljung, A. Chiuso,and G. Pillonetto. System identiﬁcation via sparsemultiple kernel-based regularization using sequential convexoptimization techniques.

IEEE Transactions on AutomaticControl , (11):2933–2945, 2014.[8] Gianluigi Pillonetto and Alessandro Chiuso. Tuningcomplexity in regularized kernel-based regression and linearsystem identiﬁcation: The robustness of the marginallikelihood estimator.

Automatica , 51:106 – 117, 2015.[9] B. Mu, T. Chen, and L. Ljung. Tuning of hyperparametersfor ﬁr models – an asymptotic theory. In

Proceedings of theIFAC 2017 World Congress , page under review, Toulouse,France., 2017.[10] A. N. Tikhonov and V. Y. Arsenin.

Solutions of Ill-posedProblems . Winston/Wiley, Washington, D.C., 1977.[11] J Sj¨oberg, T McKelvey, and L Ljung. On the use ofregularization in system identiﬁcation. In

Proceedings ofthe IFAC 2012 World Congress , pages 381–386, Sydney,Australia., 1993.[12] A. Chiuso. Regularization and Bayesian learning indynamical systems: Past, present and future.

Annual Reviewsin Control , 41:24 – 38, 2016.[13] F. Dinuzzo. Kernels for linear time invariant systemidentiﬁcation.

SIAM Journal on Control and Optimization ,53(5):3299–3317, 2015.[14] Tianshi Chen and Lennart Ljung. Constructive state-spacemodel induced kernels for regularized system identiﬁcation.In , pages 1047–1052, Cape town,South Africa, 2014.[15] T. Chen and L. Ljung. On kernel structure for regularizedsystem identiﬁcation (i): a machine learning perspective. In

Proceedings of the IFAC Symposium on System Identiﬁcation ,pages 1035–1040, Beijing, China., 2015.[16] T. Chen and L. Ljung. On kernel structure for regularizedsystem identiﬁcation (ii): a system theory perspective. In

Proceedings of the IFAC Symposium on System Identiﬁcation ,pages 1041–1046, Beijing, China., 2015.[17] Tianshi Chen, Tohid Ardeshiri, Francesca P. Carli,Alessandro Chiuso, Lennart Ljung, and Gianluigi Pillonetto.Maximum entropy properties of discrete-time ﬁrst-orderstable spline kernel.

Automatica , 66:34 – 38, 2016.[18] F. P. Carli, T. Chen, and L. Ljung. Maximum entropy kernelsfor system identiﬁcation.

IEEE Transactions on AutomaticControl , 2017.

19] A. Marconato, M. Schoukens, and J. Schoukens. Filter-basedregularisation for impulse response modelling.

IET ControlTheory & Applications , to appear 2016.[20] K. Zhou, J. C. Doyle, and K. Glover.

Robust and OptimalControl . Prentice-Hall, Englewood Cliﬀs, NJ, 1996.[21] T. Chen and L. Ljung. Implementation of algorithms fortuning parameters in regularized least squares problems insystem identiﬁcation.

Automatica , 49:2213–2220, 2013.[22] Nachman Aronszajn. Theory of reproducing kernels.

Transactions of the American mathematical society , pages337–404, 1950.[23] Gianluigi Pillonetto and Alessandro Chiuso. Tuningcomplexity in regularized kernel-based regression and linearsystem identiﬁcation: The robustness of the marginallikelihood estimator.

Automatica , 58:106–117, 2015.[24] Claudio Carmeli, Ernesto De Vito, and Alessandro Toigo.Vector valued reproducing kernel hilbert spaces of integrablefunctions and mercer theorem.

Analysis and Applications ,4(04):377–408, 2006.[25] Ingo Steinwart and Andreas Christmann.

Support vectormachines . Springer Science & Business Media, New York,2008.[26] R Silverman. Locally stationary random processes.

Information Theory, IRE Transactions on , 3(3):182–187,1957.[27] C. E. Rasmussen and C. K. I. Williams.

Gaussian Processesfor Machine Learning . MIT Press, Cambridge, MA, 2006.[28] Akira Moiseevich Yaglom.

Correlation theory of stationaryand related random functions . Springer, 1987.[29] James Mercer. Functions of positive and negative type,and their connection with the theory of integral equations.

Philosophical transactions of the royal society of London.Series A, containing papers of a mathematical or physicalcharacter , pages 415–446, 1909.[30] Harry Hochstadt.

Integral equations . John Wiley & Sons,1973.[31] Hongwei Sun. Mercer theorem for RKHS on noncompactsets.

Journal of Complexity , 21(3):337–349, 2005.[32] Michel Lo`eve. Probability theory, vol. II.

Graduate texts inmathematics , 1978.[33] Brett Ninness and Graham C. Goodwin. Estimation of modelquality.

Automatica , 31(12):1771 – 1797, 1995.[34] L. Ljung. Model validation and model error modeling.In B. Wittenmark and A. Rantzer, editors,

The strmSymposiium on Control , pages 15 –42, Lund, Sweden, Aug1999. Studentlitteratur.[35] G. C. Goodwin, M. Gevers, and B. Ninness. Quantifyingthe error in estimated transfer functions with applicationto model order selection.

IEEE Trans. Automatic Control ,37(7):913–929, 1992.[36] G. C. Goodwin, J. H. Braslavsky, and M. M. Seron.Non-stationary stochastic embedding for transfer functionestimation.

Automatica , 38:47–62, 2002.[37] Wolfgang Reinelt, Andrea Garulli, and Lennart Ljung.Comparing diﬀerent approaches to model error modeling inrobust identiﬁcation.

Automatica , 38(5):787 – 803, 2002.[38]

Model Error Modeling and Stochastic Embedding , volume 48,Beijing, China, 2015.[39] C. Chen.

Linear system theory and design . Oxford UniversityPress, New York, 3 edition, 1999.[40] Torkel Glad and Lennart Ljung.

Control theory:Multivariable and nonlinear methods . Taylor & Francis, 2000. [41] T. B. Sch¨on, A. Wills, and B. Ninness. System identiﬁcationof nonlinear state-space models.

Automatica , 47(1):39–49,January 2011.[42] K. J. ˚Astr¨om.

Introduction to stochastic control theory .Academic Press, New York and London, 1970.[43] E. T Jaynes. On the rationale of maximum-entropy methods.

Proceedings of the IEEE , 70(9):939–952, 1982.[44] F. P. Carli. On the maximum entropy property of the ﬁrst-order stable spline kernel and its implications. In

IEEEMulti-Conference on Systems and Control , pages 409–414,Nice/Antibes, France, 2014.[45] A. Chiuso, T. Chen, L. Ljung, and G. Pillonetto. On thedesign of multiple kernels for nonparametric linear systemidentiﬁcation. In

Proceedings of the IEEE Conference onDecision and Control , pages 3346–3351, Los Angeles, CA.,2014.[46] Marc G Genton. Classes of kernels for machine learning:a statistics perspective.

The Journal of Machine LearningResearch , 2:299–312, 2002.[47] Theodoros Evgeniou, Massimiliano Pontil, and TomasoPoggio. Regularization networks and support vectormachines.

Advances in computational mathematics , 13(1):1–50, 2000.[48] F. Cucker and S. Smale. On the mathematical foundations oflearning.

American Mathematical Society , 39(1):1–49, 2002.[49] T. M. Cover and J. A. Thomas.

Elements of informationtheory . John Wiley & Sons, 2012.[50] A. P. Dempster. Covariance selection.

Biometrics , 28(1):157–175, 1972.

Proof of Proposition 3.1

Part (a) . For any n ∈ N and for any t i ∈ X , i = 1 , . . . , n ,let ¯ b be the column vector containing b ( t ) , . . . , b ( t n ) inorder. Then the kernel matrix K d deﬁned by K di,j = k d ( t i , t j ) is rewritten as K d = ¯ b ¯ b T . Clearly, K d is positivesemideﬁnite, and moreover rank( K d ) = 1. So k d ( t, s ) isa rank-1 kernel by deﬁnition. The identity (14), Part (b)and Part (c) can be veriﬁed in a straightforward way.

Proof of Proposition 3.2

We give the proof for the CT case with X = { t | t ≥ } .The results for the DT case with X = N hold by notingthat the DT kernel is the CT kernel restricted to N .From (16) and (17), we see the proof will be done if it canbe shown that the kernels k c ( t − s ) in (16) and (17) arestationary kernels. To show this, note from e.g. [27, page85] that e − β | t − s | with β > /

2. So it remains to show k c ( t − s )in (16) is also a kernel. Note that the exponential kernelis an isotropic stationary (IS) kernel (see Section 3.2.1for its deﬁnition). Below we show that k c ( t − s ) in (16) isactually also an IS kernel. As shown in [46], the spectral15epresentation of an IS kernel k c ( t, s ) takes the form of k c ( t, s ) = Z ∞ cos( ω | t − s | ) F ( dω ) (A.1)where F is any nondecreasing bounded function. For theexponential kernel e − β | t − s | with β >

0, (A.1) is satisﬁedwith the spectral density f ( ω ) = βπ ( β + ω ) . So k c ( t − s ) in (16) with λ = exp( − β/

2) satisﬁes (A.1)with a well deﬁned spectral density described by32 βπ (( β ) + ω ) − βπ (( β ) + ω )= 3 β π β ) + ω − β ) + ω ! ≥ , ∀ ω Therefore, k c ( t − s ) in (16) is an IS kernel. This completesthe proof. Proof of Proposition 3.3

We give the proof for the CT case with X = { t | t ≥ } .The proof for the DT case with X = N can be derivedin a similar way. Part (a).

Assume that b ( t ) ∈ L , i.e., R ∞ b ( t ) dt < ∞ .Since | k c ( t − s ) | ≤ t, s ≥

0. Then from Z ∞ (cid:12)(cid:12)(cid:12)(cid:12)Z ∞ k ( t, s ) dt (cid:12)(cid:12)(cid:12)(cid:12) ds ≤ Z ∞ b ( s ) ds Z ∞ b ( t ) dt < ∞ , and by Corollary 2.1, the AMLS kernel (13) is stable. Part (b).

Assume that the AMLS kernel (13) is stable,i.e., H k ⊂ L . We need a lemma to prove the result. Lemma A.1 [47, p. 16], [48, p. 37]

Assume that ¯ k ( t, s ) with t, s ∈ X is a positive semideﬁnite kernel and more-over, there exists a sequence of positive numbers ν i andlinearly independent functions ψ i ( t ) such that ¯ k ( t, s ) = P ∞ i =1 ν i ψ i ( t ) ψ i ( s ) , where the convergence is absolute anduniform on Y × Y with Y , Y being any compact subsetsof X . Then H ¯ k = ( g | g = ∞ X i =1 µ i ψ i , ∞ X i =1 µ i ν i < ∞ ) and for any f , f ∈ H ¯ k with f = P ∞ i =1 c i ψ i and f = P ∞ i =1 d i ψ i , the inner product over H ¯ k is deﬁned as h f, g i H ¯ k = ∞ X i =1 c i d i ν i . Noting (19) and by Lemma A.1, we have H k c = ( h | h = ∞ X i =1 µ i φ i , ∞ X i =1 µ i λ i < ∞ ) . (A.2)Then from (19), the AMLS kernel (13) is rewritten as k ( t, s ) = ∞ X i =1 λ i ρ i ( t ) ρ i ( s ) , ρ i ( t ) = b ( t ) φ i ( t ) , (A.3)and by Lemma A.1, we have H k = ( g | g = ∞ X i =1 µ i ρ i , ∞ X i =1 µ i λ i < ∞ ) . (A.4)By (A.4) and (A.2), we have H k = { g | g ( t ) = b ( t ) h ( t ) , t ≥ , h ∈ H k c } . (A.5)Since H k ⊂ L , for any g ∈ H k , g ∈ L , i.e., Z ∞ | g ( t ) | dt = Z ∞ b ( t ) | h ( t ) | dt < ∞ , (A.6)where the equality follows from (A.5) and h ∈ H k c is thefunction associated with g ∈ H k .Since k c is not stable, H k c * L and there exists h ∈ H k c but h / ∈ L . For such h , if there exits an ǫ > b ( t ) ≥ ǫ for all t ≥

0, then Z ∞ | g ( t ) | dt = Z ∞ b ( t ) | h ( t ) | dt ≥ ǫ Z ∞ | h ( t ) | dt = ∞ , which contradicts with (A.6). This completes the proof. Proof of Proposition 4.1

To prove the result, we need a lemma.

Lemma A.2 [42,40]

Consider a zero mean stationaryGaussian process h ( t ) with covariance function k c ( t − s ) .Assume that k c ( t − s ) has rational power spectral density Ψ( ω ) . Then the following results hold: • For the DT case, there exists a rational function G which has all poles inside the unit circle and all zerosinside or on the unit circle such that Ψ( ω ) = G ( e iω ) G ( e − iω ) (A.7) Check the statement of Theorem 4.1 for the deﬁnition ofrational power spectral density. here | · | denotes the modulus of a complex num-ber. Moreover, let the quintuple ( ¯ A, ¯ B, ¯ C, ¯ D, ¯ Q ) bea state space realization of G ( e iω ) , i.e., G ( e iω ) =¯ C ( e iω I dim( ¯ A ) − ¯ A ) − ¯ B + ¯ D , where dim( ¯ A ) is the di-mension of ¯ A and I dim( ¯ A ) is the identity matrix withdimension dim( ¯ A ) . Then the stationary Gaussian pro-cess h ( t ) has the following state space representation x ( t + 1) = ¯ Ax ( t ) + ¯ Bw ( t ) , x (0) ∼ N (0 , ¯ Q ) (A.8a) h ( t ) = ¯ Cx ( t ) + ¯ Dw ( t ) (A.8b) where w ( t ) is the zero mean white Gaussian noise withunit variance and ¯ Q is the solution of the Lyapunovequation ¯ Q = ¯ A ¯ Q ¯ A T + ¯ B ¯ B T (A.8c) • For the CT case, there exists a rational function G which has all poles in the left half plane and all zeros inthe left half plane or on the imaginary axis such that Ψ( ω ) = G ( iω ) G ( − iω ) . (A.9) Moreover, let the quintuple ( ¯ A, ¯ B, ¯ C, ¯ D, ¯ Q ) be a statespace realization of G ( iω ) , i.e., G ( iω ) = ¯ C ( iωI dim ¯ A − ¯ A ) − ¯ B + ¯ D . Then the stationary Gaussian process h ( t ) has the following state space representation ˙ x ( t ) = ¯ Ax ( t ) + ¯ Bw ( t ) , x (0) ∼ N (0 , ¯ Q ) (A.10a) h ( t ) = ¯ Cx ( t ) + ¯ Dw ( t ) (A.10b) where w ( t ) is the zero mean white Gaussian noise withunit power spectral density and ¯ Q is the solution of theLyapunov equation ¯ A ¯ Q + ¯ Q ¯ A T + ¯ B ¯ B T = 0 (A.10c) Proof:

For the DT case, the ﬁrst part is a result ofthe Spectral Factorization Theorem [42, Theorem 3.1 inChapter 4] and the second part is a result of the Repre-sentation Theorem [42, Theorem 3.2 in Chapter 4] and[40, Theorem 5.3 and Equation (5.92)]. For the CT case,the ﬁrst part is a result of the Spectral Factorization The-orem [42, Theorem 5.1 in Chapter 4] and the second partis a result of the Representation Theorem [42, Theorem5.2 in Chapter 4] and [40, Theorem 5.3]. ♦ . For the DT case, let z ( t ) = c λ t x ( t ). Then simple calcu-lation shows that the AMLS kernel (13) with k d ( t, s ) = cλ t + s is in the form of (24) with A = λ ¯ A , B = λ ¯ B , C = ¯ C , D = ¯ D , Q = c ¯ Q and b ( t ) = c λ t . For theCT case, let z ( t ) = c e − βt x ( t ) with λ = e − β . Thensimple calculation shows that the AMLS kernel (13)with k d ( t, s ) = cλ t + s is in the form of (24) with A =¯ A − βI dim ¯ A , B = ¯ B , C = ¯ C , D = ¯ D , Q = c ¯ Q and b ( t ) = c e − βt . Proof of Proposition 4.2

We only sketch the proof for the SS kernel and the prooffor the DC kernel can be derived in a similar way.For the DT case, the IS kernel k c ( t − s ) in (16) has thepower spectral densityΨ( ω ) = + ∞ X τ = −∞ k c ( τ ) e − iωτ = 32 1 − λ (1 − λe − iω )(1 − λe iω ) −

12 1 − λ (1 − λ e − iω )(1 − λ e iω )= (1 − λ ) λ + λ ( e − iω + e − iω )(1 − λe − iω )(1 − λe iω )(1 − λ e − iω )(1 − λ e iω )= (1 − λ ) ae − iω + ¯ b )(¯ ae iω + ¯ b )(1 − λe − iω )(1 − λe iω )(1 − λ e − iω )(1 − λ e iω )By spectral factorization technique, we have Ψ( ω ) = G ( e iω ) G ( e iω ) with G ( e iω ) = r (1 − λ ) ae − iω + ¯ b )(1 − λe − iω )(1 − λ e − iω ) . For the CT case, the IS kernel k c ( t − s ) has the powerspectral densityΨ( ω ) = Z + ∞−∞ (cid:18) e − β | τ | − e − β | τ | (cid:19) e − iwτ dτ = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β ( iω + β )( iω + β ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , | G ( iω ) | . From the realization theory of linear systems, see e.g.,[39], we can derive the corresponding state-space modelrepresentation of the IS kernel k c ( t − s ), based on whichand by using the argument in the end of the proof ofProposition 4.1, we derive (27) and (29), respectively. Proof of Proposition 4.3

For the DT case, note that there exists l > | CA t Q ( A s ) T C T | ≤ l | ¯ λ | t + s , | D b ( t ) b ( s ) δ ( t − s ) | ≤ l | ¯ λ | t + s , | Db ( t ) s − X k =0 δ ( t − k ) b ( k ) B T ( A s − − k ) T C T | ≤ l | ¯ λ | t + s , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) min { t,s }− X k =0 b ( k ) CA t − − k BB T ( A s − − k ) T C T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ l ( | ¯ λ | t + s + | ¯ λ | t + s − min { t,s } ) . P ∞ t =1 P ∞ s =1 (cid:12)(cid:12) k SI ( t, s ) (cid:12)(cid:12) < ∞ , andthus the SI kernel (25a) is stable by Corollary 2.1.For the CT case, note that there exists l > | Ce At Q ( e As ) T C T | ≤ le −| Re(¯ λ ) | ( t + s ) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C Z min { t,s } b ( τ ) e A ( t − τ ) BB T ( e A ( s − τ ) ) T dτ C T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ l ( e −| Re(¯ λ ) | ( t + s ) + e −| Re(¯ λ ) | ( t + s − min { t,s } ) ) . Then it is easy to show R ∞ R ∞ (cid:12)(cid:12) k SI ( t, s ) (cid:12)(cid:12) dtds < ∞ , andthus the SI kernel (25b) is stable by Corollary 2.1. Proof of Proposition 4.4

We only give the proof for the case D = 0 and the prooffor the case (b) is similar and thus omitted.By chain rule, the diﬀerential entropy in (33) becomes H (¯ z (0) , ¯ g (0) , ¯ g (1) , · · · , ¯ g ( s )) = H (¯ z (0)) + H (¯ g (0) | ¯ z (0))+ s X t =1 H (¯ g ( t ) | ¯ z (0) , ¯ g (0) , · · · , ¯ g ( t − H (¯ g (0) | ¯ z (0)) = H ( f (0) Db (0) + C ¯ z (0) | ¯ z (0))= H ( f (0) Db (0) | ¯ z (0))= H ( f (0) | ¯ z (0)) + log | Db (0) | where the ﬁrst equation follows because translation doesnot change the diﬀerential entropy and the second equa-tion follows because of the scaling property of diﬀeren-tial entropy. Analogously, we have H (¯ g ( t ) | ¯ z (0) , ¯ g (0) , · · · , ¯ g ( t − H (¯ g ( t ) | ¯ z (0) , f (0) , · · · , f ( t − H ( f ( t ) Db ( t ) | ¯ z (0) , f (0) , · · · , f ( t − H ( f ( t ) | ¯ z (0) , f (0) , · · · , f ( t − | Db ( t ) | Therefore, (A.11) is rewritten as H (¯ z (0) , ¯ g (0) , ¯ g (1) , · · · , ¯ g ( s )) = H (¯ z (0)) + H ( f (0) | ¯ z (0))+ s X t =1 H ( f ( t ) | ¯ z (0) , f (0) , · · · , f ( t − s X t =0 log | Db ( t ) | = H (¯ z (0) , f (0) , f (1) , · · · , f ( s )) + s X t =0 log | Db ( t ) | Since the second term in the above equation is indepen-dent of ¯ z (0) and f ( t ), t = 0 , · · · , s , the MaxEnt problem (33) is equivalently rewritten asmaximize ¯ z (0) ,f ( t ) H (¯ z (0) , f (0) , f (1) , · · · , f ( s )) (A.12a)subject to E (¯ z (0)) = 0 , E ( f ( t )) = 0 , C ov (¯ z (0)) = Q, V ( f ( t )) = 1 , t = 0 , · · · , s. (A.12b)Note that the constraints in (A.12) are separablewith respect to ¯ z (0) , f ( t ), t = 0 , · · · , s , and that H (¯ z (0) , f (0) , f (1) , · · · , f ( s )) is maximized if ¯ z (0) , f (0), f (1) , · · · , f ( s ) are independent with each other. There-fore, (A.12) is equivalent tomaximize ¯ z (0) H (¯ z (0))subject to E (¯ z (0)) = 0 , C ov (¯ z (0)) = Q and for t = 0 , · · · , s ,maximize f ( t ) H ( f ( t ))subject to E ( f ( t )) = 0 , V ( f ( t )) = 1It is well known that multivariate normal distributionmaximizes the diﬀerential entropy over all distribu-tions with the same covariance, see e.g., [49, The-orem 8.6.5]. Then we have the optimal solution to(A.12) is that ¯ z (0) ∼ N (0 , Q ) and f ( t ) ∼ N (0 , t = 0 , · · · , s , and moreover, ¯ z (0) , f (0) , f (1) , · · · , f ( s )are independent with each other. Finally, (32) implies¯ g ( t ) = CA t ¯ z (0)+ P t − k =0 CA t − − k Bb ( k ) f ( k )+ Db ( t ) f ( t ),same as g ( t ) in (25a). This completes the proof. Proof of Corollary 4.1

From Proposition 4.2, the DC kernel is a SI kernel with

A, B, C, D, Q, b ( t ) given in (28), which implies that The-orem 4.4 holds for the DC kernel. Comparing (32) with(36) shows that the proof is completed if we can show C ov (¯ z (0)) = Q = ⇒ V (¯ g (0)) = c, V ( f ( t )) = 1 = ⇒ V (¯ g ( t ) − λ / ρ ¯ g ( t − c (1 − ρ ) λ t , t = 1 , · · · , s (A.13)The ﬁrst line of (A.13) is straightforward. In what fol-lows, we consider the second line of (A.13). Clearly, theresults holds for the case either λ = 0 or ρ = 0. So wefurther consider below the case where λ = 0 and ρ = 0.18t is easy to see that for t = 0 , . . . , s − g ( t + 1) = ACA t ¯ z (0) + A t − X k =0 CA t − − k Bb ( k ) f ( k )+ ACA − Bb ( t ) f ( t ) + Db ( t + 1) f ( t + 1)= ACA t ¯ z (0) + A t − X k =0 CA t − − k Bb ( k ) f ( k )+ ADb ( t ) f ( t ) + Db ( t + 1) f ( t + 1)= A ¯ g ( t ) + Db ( t + 1) f ( t + 1)where the second equality holds because CA − B = D .Therefore (A.13) holds, which completes the proof. Proof of Proposition 4.5

We need a lemma to prove this result.

Lemma A.3

Consider the p th order Gaussian Markovprocess (38). Then the following results hold:(a) For any t ∈ N and k > p , x ( t ) and x ( t + k ) areconditionally independent given x ( s ) with s = t , s = t + k and s ∈ N .(b) Let i < i < · · · < i n with n > p + 1 be anystrictly increasing subsequence taken from N . Thenthe covariance matrix of [ x ( i ) · · · x ( i n )] T is a p -band matrix. Proof:

Part (a) follows from the p th order Markov prop-erty of x ( t ) and the fact that w ( t ) is white Gaussiannoise. To prove Part (b), recall from e.g., [50] that, forGaussian random variables, the zeros in the inverse ofthe covariance matrix indicate conditional independenceof the two corresponding elements conditioned on the re-maining ones. To be speciﬁc, consider a Gaussian ran-dom variable with mean m and covariance matrix K : h x y z i T ∼ N ( m, K ) , (A.14) where x, y ∈ R , z ∈ R n and K ∈ R ( n +2) × ( n +2) . Then fora given z , x, y are conditionally independent if and onlyif ( K − ) , = ( K − ) , = 0 , where ( K − ) i,j denote the ( i, j ) th element of K − . Therefore, Part (b) follows fromthe above observation and part (a). ♦ Note from Proposition 4.2 that g ( t ) can be written inthe following form z ( t + 1) = λ ρz ( t ) + λ c λ t w ( t ) , z (0) ∼ N (0 , c − ρ ) g ( t ) = ρ (1 − ρ ) z ( t ) + (1 − ρ ) c λ t w ( t ) . which implies that (39) holds. Clearly, (39) is in the formof (38a) with p = 1 and thus g ( t ) is a Markov process with order 1. Finally, the result that the kernel matrix ofthe DC kernel has a 1-band matrix inverse follows fromLemma A.3. Proof of Proposition 4.6

By using the realization of G ( q ) in controllable canon-ical form, the DT SI kernel (24) is written as follows: z ( t + 1) = " − a · · · − a n − − a n I n n × z ( t )+ " n × b ( t ) w ( t ) g ( t ) = h ¯ b · · · i z ( t ) , z (0) ∼ N (0 , Q ) . From the above equation, we have g ( t + 1) = − ¯ b n X i =1 a i z i ( t ) + ¯ bb ( t ) w ( t ) (A.15)where z i ( t ) is the i th element of z ( t ). Note that¯ bz i ( t ) = g ( t − i + 1) , i = 1 , · · · , n Therefore, (A.15) becomes g ( t + 1) = − n X i =1 a i g ( t − i + 1) + ¯ bb ( t ) w ( t )which is in the form of (38b) with order n . Thus theGaussian process g ( t ) in (24) is a Markov process withorder n and the fact that the kernel has an nn