[PDF] The Generalized Cross Validation Filter

Abstract

Generalized cross validation (GCV) is one of the most important approaches used to estimate parameters in the context of inverse problems and regularization techniques. A notable example is the determination of the smoothness parameter in splines. When the data are generated by a state space model, like in the spline case, efficient algorithms are available to evaluate the GCV score with complexity that scales linearly in the data set size. However, these methods are not amenable to on-line applications since they rely on forward and backward recursions. Hence, if the objective has been evaluated at time t−1 and new data arrive at time t, then O(t) operations are needed to update the GCV score. In this paper we instead show that the update cost is O(1) , thus paving the way to the on-line use of GCV. This result is obtained by deriving the novel GCV filter which extends the classical Kalman filter equations to efficiently propagate the GCV score over time. We also illustrate applications of the new filter in the context of state estimation and on-line regularized linear system identification.

Full PDF

TThe Generalized Cross Validation Filter

Giulio Bottegal a and Gianluigi Pillonetto b a Department of Electrical Engineering, TU Eindhoven, Eindhoven, The Netherlands (e-mail: [email protected]) b Department of Information Engineering, University of Padova, Padova, Italy (e-mail: [email protected])

Abstract

Generalized cross validation (GCV) is one of the most important approaches used to estimate parameters in the context of inverse problemsand regularization techniques. A notable example is the determination of the smoothness parameter in splines. When the data are generatedby a state space model, like in the spline case, efﬁcient algorithms are available to evaluate the GCV score with complexity that scaleslinearly in the data set size. However, these methods are not amenable to on-line applications since they rely on forward and backwardrecursions. Hence, if the objective has been evaluated at time t − t , then O ( t ) operations are needed to updatethe GCV score. In this paper we instead show that the update cost is O ( ) , thus paving the way to the on-line use of GCV. This result isobtained by deriving the novel GCV ﬁlter which extends the classical Kalman ﬁlter equations to efﬁciently propagate the GCV score overtime. We also illustrate applications of the new ﬁlter in the context of state estimation and on-line regularized linear system identiﬁcation.

Key words:

Kalman ﬁltering; generalized cross-validation; on-line system identiﬁcation; inverse problems; regularization; smoothnessparameter; splines

Linear state space models assume the form x k + = A k x k + ω k y k = C k x k + e k where x k is the state at instant k , y k is the output, while ω k and e k are random noises. The matrices A k and C k regulatethe state transition and the observation model at instant k .This kind of models plays a central role in the analysis anddesign of discrete-time systems [17]. Applications aboundand include tracking, navigation and biomedicine.In on-line state estimation , the problem is the reconstructionof the values of x k from measurements of y k collected overtime. When the matrices A k and C k and the noises covari-ances are known, the optimal linear estimates are efﬁcientlyreturned by the classical Kalman ﬁlter [1]. However, in This research has been partially supported by the MIUR FIRBproject RBFR12M3AC-Learning meets time: a new computationalapproach to learning in dynamic systems and by the Progettodi Ateneo CPDA147754/14-New statistical learning approach formulti-agents adaptive estimation and coverage control. This paperwas not presented at any IFAC meeting. Corresponding authorGianluigi Pillonetto Ph. +390498277607. many circumstances there can be unknown model param-eters that also need to be inferred from data in an on-linemanner, e.g. variance components or entries of the transi-tion/observation matrices. One can interpret such parame-ters as additional states. Then, the extended Kalman ﬁlter[16] or more sophysticated stochastic techniques, such asparticle ﬁlters and Markov chain Monte Carlo [10,22,3,9],can be used to track the ﬁltered posterior. Another tech-nique consists of propagating the marginal likelihood of theunknown parameters via a bank of ﬁlters [1, Ch. 10]. In thispaper, we will show that another viable alternative is theuse of an approach known in the literature as generalizedcross validation (GCV) [12].In the literature of statistics and inverse problems, GCV iswidely used in off-line contexts to estimate unknown pa-rameters entering regularized estimators [5,37,40]. This ap-proach was initially used to tune the smoothness parameterin ridge regression and smoothing splines [14,12,33]. GCVis now also popular in machine learning, used to improvethe generalization capability of regularized kernel-based ap-proaches [34,8], such as regularization networks, which con-tain spline regression as special case [31,11].To introduce GCV in our state space context, we ﬁrst re-call that smoothing splines are closely linked to state spacemodels of m -fold integrated Wiener processes [25]; then itappears natural to extend GCV to general state space mod-els. To this end, assume that measurements y k have been Preprint submitted to Automatica 9 June 2017 a r X i v : . [ s t a t . M L ] J un ollected up to instant t and stacked in the vector Y t . Denotewith ˆ Y t the vector containing the optimal linear output esti-mate and use H t to denote the inﬂuence matrix satisfyingˆ Y t = H t Y t . Then, the parameter estimates achieved by GCV minimizeGCV t = S t t ( − δ t / t ) , (2)where S t is the sum of squared residuals, i.e. S t = (cid:107) ˆ Y t − Y t (cid:107) , and δ t are the degrees of freedom given by the trace of H t ,i.e. δ t = Tr ( H t ) . In the objective (2), the term S t accounts for the goodnessof ﬁt while δ t assumes values on [ , t ] and measures modelcomplexity. In fact, in nonparametric regularized estimation,the degrees of freedom δ t can be seen as the counterpartof the number of parameters entering a parametric model[20,13,26].GCV is supported by important asymptotic results. Also, forﬁnite data set size it turns often out a good approximation ofthe output mean squared error [7]. It is worth stressing thatsuch properties have been derived without postulating thecorrectness of the prior models describing the output data[38,39]. In control, this means that GCV can compensatefor possible modeling mismatch affecting the state spacedescription.Despite these nice features, the use of GCV within the con-trol community appears limited, in particular in on-line con-texts. One important reason is the following one. For statespace models, there exist efﬁcient algorithms which, for agiven parameter vector, return its GCV score with O ( t ) op-erations [18,4], see also [15,35,21] for procedures dedicatedto smoothing splines. But all of these techniques are notsuited to on-line computations since they involve forwardand backward recursions. Hence, if GCV t − is available andnew data arrive at time t , other O ( t ) operations are neededto achieve GCV t . In this paper, we will instead show thatthe update cost is O ( ) , thus paving the way to a more per-vasive on-line use of GCV. This result is obtained by deriv-ing the novel GCV ﬁlter which consists of an extension ofthe classical Kalman equations. Thanks to it, one can run abank of ﬁlters (possibly in parallel) to efﬁciently propagateGCV over a grid of parameter values. This makes the pro-posed GCV ﬁlter particularly suitable for applications wherea measurement model admits a state space description withdynamics depending on few parameters, see e.g. the nextsection for an application in numerical differentiation. In The components of ˆ Y t are thus given by C ˆ x k | t , where thesmoothed state ˆ x k | t can be obtained for any t with O ( t ) operationsby a ﬁxed-interval Kalman smoothing ﬁlter [32,19]. this framework, an implementation of the GCV ﬁlter via abank of parallel ﬁlters turns out computationally attractive.The paper is organized as follows. In Section 2, ﬁrst someadditional notation is introduced. Then, the GCV ﬁlter ispresented. Its asymptotic properties are then discussed inSection 3. In Section 4 we illustrate some applications, in-cluding also smoothing splines and on-line regularized lin-ear system identiﬁcation with the stable spline kernel usedas stochastic model for the impulse response [28,29]. Con-clusions end the paper while the correctness of the GCVﬁlter is shown in Appendix. First, we provide full details about our measurements model.We use x ∼ ( a , b ) to denote a random vector x with mean a and covariance matrix b . Then, our state space model isdeﬁned by x k + = A k x k + ω k (3a) y k = C k x k + e k , k = , , . . . (3b) x ∼ ( µ , P ) (3c) ω k ∼ ( , Q k ) (3d) e k ∼ ( , γ ) (3e)where the initial condition x and all the nosies { ω k , e k } k = , ,... are mutually uncorrelated. We do not specify any partic-ular distribution for these variables, since the GCV scoredoes not depend on the particular noise distribution . If x , ω k , e k are Gaussian, then the Kalman ﬁlter providesthe optimal state estimate in the mean-square sense. In theother cases, the Kalman ﬁlter corresponds to the best linearstate estimator [1]. In addition, just to simplify notation themeasurements y k are assumed scalar, so that γ representsthe noise variance.We assume that some of the parameters in (3) may be un-known, or could enter A k , B k , Q k and P ; however, we donot stress this possible dependence to make the formulasmore readable. The matrix P is assumed to be independentof γ . Such parameter is typically unknown, being connectedto the ratio between the measurement noise variance andthe variance of the driving noise. It corresponds to theregularization parameter in the smoothing-splines contextdescribed in the example below. Example 1 (Smoothing splines [30])

Function estimationand numerical differentiation are often required in variousapplications. These include also input reconstruction in non-linear dynamic systems as described e.g. in [30]. Assume Of course, GCV may result not effective if the noises are highlynon-Gaussian. Different approaches, like particle ﬁlters, shouldinstead be used if linear estimators perform poorly due e.g. tomultimodal noise distributions. hat one is interested in determining the ﬁrst m derivatives ofa continuous-time signal measured with non-uniform sam-pling periods T k . Modeling the signal as an m-th fold inte-grated Wiener process one obtains the stochastic interpreta-tion of the m-th order smoothing splines [40]. In particular,one can use (3) to represent the signal dynamics as followsA k =  . . . T k . . . T k T k . . . . . . ...... ... . . . . . . ... T mk m ! T m − k ( m − ) ! . . . T k  , C k =  ...  T , [ Q k ] i j = T i + j − k ( i − ) ! ( j − ) ! ( i + j − ) . Such model depends on the measurement noise variance γ ,making this application particularly suited for the GCV ﬁlter.2.2 The GCV ﬁlter The GCV ﬁlter equations are now reported. Below, ˆ x k de-notes the optimal linear one-step ahead state prediction hav-ing covariance P k . Its dynamics are regulated by the classicalKalman ﬁlter via (5a), (5c) and the Riccati equation (5e). GCV ﬁlter

Initialization ˆ x = µ , ˆ ζ = P = P , Σ = δ = − γ ( C P C T + γ ) − (4c) S = γ ( y − C µ ) ( C P C T + γ ) (4d)GCV = S ( − δ ) (4e) UpdateK k = A k P k C Tk ( C k P k C Tk + γ ) − (5a) G k = A k Σ k A Tk − K k ( C k Σ k C Tk + ) C k P k C Tk + γ (5b)ˆ x k + = A k ˆ x k + K k ( y k − C k ˆ x k ) (5c)ˆ ζ k + = ( A k − K k C k ) ˆ ζ k + G k ( y k − C k ˆ x k ) (5d) P k + = ( A k − K k C k ) P k ( A k − K k C k ) T + γ K k K Tk + Q k (5e) Σ k + = ( A k − K k C k ) Σ k ( A k − K k C k ) T + K k K Tk (5f) δ k + = δ k + − γ C k + Σ k + C Tk + + C k + P k + C Tk + + γ (5g) z − - +++++ z − Update K k = AP k C T ( CP k C T + g ) (5a) G k = A S k A T K k ( C S k C T + ) CP k C T + g (5b)ˆ x k + = A ˆ x k + K k ( y k C ˆ x k ) (5c)ˆ z k + = ( A K k C ) ˆ z k + G k ( y k C ˆ x k ) (5d) P k + = ( A K k C ) P k ( A K k C ) T + g K k K Tk + Q (5e) S k + = ( A K k C ) S k ( A K k C ) T + K k K Tk (5f) Do f k + = Do f k + g C S k + C T + CP k + C T + g (5g) Ssr k + = Ssr k (5h) + g C S k + C T + ( CP k + C T + g ) ( y k + C ˆ x k + ) + g C ˆ z k + y k + C ˆ x k + CP k + C T + g Gcv k + = ( k + ) Ssr k + ( k + Do f k + ) (5i)It is apparent that the difference w.r.t the classical Kalmanﬁlter is the presence of the additional state ˆ z k (of the samedimension of ˆ x k ). Its dynamics are regulated by (5d). Notethat this equation is piloted by the innovation y k C ˆ x k , butthe Kalman gain K k in (5b) is replaced by G k as deﬁned by(5b). In turn, such gain depends by S k which is propagatedover time through a modiﬁed version of the Riccati equationgiven by (5f). Without loss of generality, we set the initial system conditionto zero, i.e. µ =

0. We also use X t , Y t and E t to denote thecolumn vectors containing the states, the outputs and themeasurements noises up to instant t , i.e. X t = [ x T . . . x Tt ] T , Y t = [ y T . . . y Tt ] T E t = [ e T . . . e Tt ] T . Then, it holds that Y t = H t X t + E t , where H t = [ B T . . . B T ] T is the regression matrix built using t blocks B . We also use S t and V t to denote the state and output covariance matrix, i.e. S t : = Var ( X t ) (6) V t : = Var ( Y t ) = H t S t H Tt + g I t , (7)where I t is the t ⇥ t identity matrix. Note that, using the abovenotation, the smoothed estimate of Y t , already encounteredin Section 1, is ˆ Y t = H t S t H Tt V t Y t , (8)so that the degrees of freedom at instant t turn out Do f t = Tr ( H t S t H Tt V t ) , (9)where Tr is the trace operator.The following simple lemma is useful for our future devel-opments. Lemma 1

One hasDo f t = t g ∂ log det V t ∂ g (10) Ssr t = g ∂ Y t V t Y t ∂ g (11) Proof:

In view of (7), we start noticing that g V t = I t H t S t H Tt V t . (12)Then, (10) is obtained from the following equalities g ∂ log det V t ∂ g = g Tr ✓ V t ∂ V t ∂ g ◆ = g Tr ( V t )= Tr ( I t H t S t H TT V t )= t Do f t , (13)where the last two passages exploit (12) and (9), respectively.Eq. 11 is instead proved as follows g ∂ Y t V t Y t ∂ g = g Y Tt V t Y t = Y Tt ( I t H t S t H Tt V t ) T ( I t H t S t H Tt V t ) Y t = k Y t ˆ Y t k = Ssr t where the second and third equality exploit (12) and (8),respectively. ⌅ Update K k = AP k C T ( CP k C T + g ) (5a) G k = A S k A T K k ( C S k C T + ) CP k C T + g (5b)ˆ x k + = A ˆ x k + K k ( y k C ˆ x k ) (5c)ˆ z k + = ( A K k C ) ˆ z k + G k ( y k C ˆ x k ) (5d) P k + = ( A K k C ) P k ( A K k C ) T + g K k K Tk + Q (5e) S k + = ( A K k C ) S k ( A K k C ) T + K k K Tk (5f) Do f k + = Do f k + g C S k + C T + CP k + C T + g (5g) Ssr k + = Ssr k (5h) + g C S k + C T + ( CP k + C T + g ) ( y k + C ˆ x k + ) + g C ˆ z k + y k + C ˆ x k + CP k + C T + g Gcv k + = ( k + ) Ssr k + ( k + Do f k + ) (5i)It is apparent that the difference w.r.t the classical Kalmanﬁlter is the presence of the additional state ˆ z k (of the samedimension of ˆ x k ). Its dynamics are regulated by (5d). Notethat this equation is piloted by the innovation y k C ˆ x k , butthe Kalman gain K k in (5b) is replaced by G k as deﬁned by(5b). In turn, such gain depends by S k which is propagatedover time through a modiﬁed version of the Riccati equationgiven by (5f). Without loss of generality, we set the initial system conditionto zero, i.e. µ =

One hasDo f t = t g ∂ log det V t ∂ g (10) Ssr t = g ∂ Y t V t Y t ∂ g (11) Proof:

In view of (7), we start noticing that g V t = I t H t S t H Tt V t . (12)Then, (10) is obtained from the following equalities g ∂ log det V t ∂ g = g Tr ✓ V t ∂ V t ∂ g ◆ = g Tr ( V t )= Tr ( I t H t S t H TT V t )= t Do f t , (13)where the last two passages exploit (12) and (9), respectively.Eq. 11 is instead proved as follows g ∂ Y t V t Y t ∂ g = g Y Tt V t Y t = Y Tt ( I t H t S t H Tt V t ) T ( I t H t S t H Tt V t ) Y t = k Y t ˆ Y t k = Ssr t where the second and third equality exploit (12) and (8),respectively. ⌅ + - + z − + Update K k = AP k C T ( CP k C T + g ) (5a) G k = A S k A T K k ( C S k C T + ) CP k C T + g (5b)ˆ x k + = A ˆ x k + K k ( y k C ˆ x k ) (5c)ˆ z k + = ( A K k C ) ˆ z k + G k ( y k C ˆ x k ) (5d) P k + = ( A K k C ) P k ( A K k C ) T + g K k K Tk + Q (5e) S k + = ( A K k C ) S k ( A K k C ) T + K k K Tk (5f) Do f k + = Do f k + g C S k + C T + CP k + C T + g (5g) Ssr k + = Ssr k (5h) + g C S k + C T + ( CP k + C T + g ) ( y k + C ˆ x k + ) + g C ˆ z k + y k + C ˆ x k + CP k + C T + g Gcv k + = ( k + ) Ssr k + ( k + Do f k + ) (5i)It is apparent that the difference w.r.t the classical Kalmanﬁlter is the presence of the additional state ˆ z k (of the samedimension of ˆ x k ). Its dynamics are regulated by (5d). Notethat this equation is piloted by the innovation y k C ˆ x k , butthe Kalman gain K k in (5b) is replaced by G k as deﬁned by(5b). In turn, such gain depends by S k which is propagatedover time through a modiﬁed version of the Riccati equationgiven by (5f). Without loss of generality, we set the initial system conditionto zero, i.e. µ =

One hasDo f t = t g ∂ log det V t ∂ g (10) Ssr t = g ∂ Y t V t Y t ∂ g (11) Proof:

Gcv t = Ssr t t ( Do f t / t ) , (2)where Ssr t is the sum of squared residuals, i.e. Ssr t = k ˆ Y t H t Y t k , and Do f t are the degrees of freedom given by the trace of H t , i.e. Do f t = Tr ( H t ) . In the objective (2), the term

Ssr t accounts for the good-ness of ﬁt while Do f t assumes values on [ , t ] and measuresmodel complexity. In fact, in nonparametric regularized es-timation, the degrees of freedom Do f t can be seen as thecounterpart of the number of parameters entering a paramet-ric model [17,10,21].GCV is supported by important asymptotic results and, whendata set size is ﬁnite, it turns often out a good approximatorof the output mean squared error [4]. It is worth stressingthat such properties have been derived without postulatingthe correctness of the prior models describing the output data[30,31]. In control, this means that GCV can contrast pos-sible undermodelling affecting the state space description.Despite these nice features, GCV does not appear to be muchused within the control community, in particular in on-linecontexts. One important reason is the following one. Forstate space models, there exist efﬁcient algorithms which,for a given parameter vector, return its GCV score with O ( t ) operations [15], see also [12,28,18] for procedures dedicatedto smoothing splines. But all of these techniques are notsuited to on-line computations since they involve forwardand backward recursions. Hence, if Gcv t is available andnew data arrive at time t , other O ( t ) operations are neededto achieve Gcv t . In this paper, we will instead show thatthe update cost is O ( ) , thus paving the way to a morepervasive use of GCV for on-line applications. This resultis obtained by deriving the novel GCV ﬁlter which consistsof an extension of the classical Kalman equations. Thanksto it, one can run a bank of ﬁlters (possibly in parallel) toefﬁciently propagate GCV over a grid of parameter values.The paper is organized as follows. In Section 2, after settingup some additional notation. the GCV ﬁlter is presented,also discussing its asymptotic properties. In Section 3, weillustrate two applications. The ﬁrst one deals with a a DCmotor model taken from [20]. The second one is relatedto on-line regularized linear system identiﬁcation, using thestable spline kernel as stochastic model for the impulse re-sponse [22,23]. Conclusions end the paper while the proofof the correctness of the GCV ﬁlter is reported in Appendix.

First, we need to provide full details about our measurementsmodel. We use x ⇠ ( a , b ) to denote a random vector x withmean a and covariance b . Then, our state space model isdeﬁned by x k + = Ax k + w k (3a) y k = Cx k + v k , k = , , . . . (3b) x ⇠ ( x , P ) (3c) w k ⇠ ( , Q ) (3d) v k ⇠ ( , g ) (3e)where the initial condition x and all the nosies { w k , v k } k = , ,... are mutually uncorrelated. In addition, the measurements y k are assumed scalar, so that g represents the noise variance.This allows to simplify notation but all the formulas derivedin the sequel can be easily extended to the multiple out-put case. Beyond g , other unknown parameters could enter P , A , B and Q but we omit to stress this possible depen-dence again to simplify notation. The matrix P is assumednot to depend on g . ( · ) g ( · ) The equations of the GCV ﬁlter are now reported. Below,ˆ x k denotes the optimal linear state prediction of covariance P k . Its dynamics are regulated by the classical Kalman ﬁltervia (5a), (5c) and the Riccati equation (5e). GCV ﬁlter

Initialization ˆ x = µ , ˆ z = P = P , S = Do f = g ( CP C T + g ) (4c) Ssr = g ( y C µ ) ( CP C T + g ) (4d) Gcv = Ssr ( Do f ) (4e)2the inﬂuence matrix satisfyingˆ Y t = HY t . Then, the parameter estimates achieved by GCV minimize

Initialization ˆ x = µ , ˆ z = P = P , S = Do f = g ( CP C T + g ) (4c) Ssr = g ( y C µ ) ( CP C T + g ) (4d) Gcv = Ssr ( Do f ) (4e)2 ' " − ) " & " & " & " ' " % " ! " ! " * " ) " Fig. 1.

GCV ﬁlter: in the bottom the nonlinear blocks f and g aredeﬁned, respectively, by (5h) and (5i) while δ k + can be recursivelycomputed by (5g). S k + = S k + γ C k + Σ k + C Tk + + ( C k + P k + C Tk + + γ ) ( y k + − C k + ˆ x k + ) (5h) + γ C k + ˆ ζ k + y k + − C k + ˆ x k + C k + P k + C Tk + + γ GCV k + = ( k + ) S k + ( k + − δ k + ) (5i)It is apparent that the difference w.r.t the classical Kalmanﬁlter is the presence of the additional state ˆ ζ k of the samedimension of ˆ x k . Comparing (5c) and (5d), one can see that A k is replaced by A k − K k C k . In addition, the dynamics of ˆ ζ k are still driven by the innovation y k − C k ˆ x k , but the Kalmangain K k given by (5a) is substituted by the G k deﬁned by(5b). In turn, such gain depends on Σ k which is propagatedover time through a modiﬁed version of the Riccati equationgiven by (5f). The GCV ﬁlter is graphically depicted in Fig.1. We ﬁrst consider the case where the state-space model (3) istime-invariant, i.e. the matrices A k , C k , and Q k are constantin k . The structure of the equations governing the GCV ﬁlter3ermits to easily understand its asymptotic behaviour. Inparticular, exploiting well known properties of the Kalmanﬁlter [1], the following result is obtained (see the Appendixfor a proof). Proposition 1

Assume that the system (3) is time-invariant,stabilizable and detectable. Then, for any P we have lim k → ∞ P k = ¯ P and lim k → ∞ Σ k = ¯ Σ where ¯ P and ¯ Σ are the unique symmetric and semideﬁnitepositive matrices solving, respectively, the algebraic Riccatiequation ¯ P = A ¯ PA T + Q − A ¯ PC T ( C ¯ PC T + γ ) − C ¯ PA T (6) and the Lyapunov equation ¯ Σ = ( A − ¯ KC ) ¯ Σ ( A − ¯ KC ) T + ¯ K ¯ K T (7) where ¯ K = A ¯ PC T ( C ¯ PC T + γ ) − . In addition, all the rootsof the matrix A − ¯ KC are inside the unit circle so that the(asymptotic) GCV ﬁlter is asymptotically stable.

Properties of the GCV ﬁlter can be also characterized in thetime-varying case. In particular, following Section 2 of [2],one can ﬁrst replace stabilizability and detectability withthe assumptions of uniform stabilizability and detectability.Then, following the same reasonings contained in the proofof Proposition 1, Theorem 5.3 in [2] ensures the uniformexponential stability of the GCV ﬁlter.

Proposition 1 leads also to a new computationally appeal-ing approach to tune the regularization parameter γ e.g. insmoothing splines. In particular, consider the scenario de-scribed in [21] where an unknown function has to be re-constructed by spline regression from equally spaced noisysamples. When assumptions in Proposition 1 hold true, it ispossible to compute off-line the gains¯ K = A ¯ PC T ( C ¯ PC T + γ ) − , ¯ G = A ¯ Σ A T − ¯ K ( C ¯ Σ C T + ) C ¯ PC T + γ . Then, one can exploit the asymptotic (suboptimal) GCV ﬁl-ter, with the guarantee that the objective values will con-verge to the exact GCV scores as k increases. Moreover, inoff-line contexts this approach appears computationally ap-pealing even when compared to the many GCV-based splinealgorithms developed in the last decades [41,40,18,15,35].Furthermore, [21] deﬁned the asymptotic smoothing ratio aslim k → ∞ δ k k , also providing an interesting closed-form expression for thecubic splines case useful to further speed up the tuning of γ .For the general case, we notice that Proposition 1 gives alsoa numerical procedure to compute the asymptotic smoothingratio (for different values of γ ). In fact, if (3) is stabilizableand detectable, combining (5g) and Proposition 1 we obtainlim k → ∞ δ k k = − γ C ¯ Σ C T + C ¯ PC T + γ with ¯ Σ and ¯ P deﬁned, respectively, in (6) and (7). We consider the reconstruction of the function exp ( sin 8 t ) taken from [29] from samples collected at 400 instants t i randomly generated from a uniform distribution on [ , ] .The measurement noise is Gaussian with standard deviationequal to 0.3. We model f as the two-fold integral of whitenoise setting m = f using cubic smoothing splines [40].We use Z t to denote the vector containing the noiseless out-puts (corresponding to the second entries of { x k } tk = ). Wedenote the average of Z t by the scalar quantity ¯ Z t . Then, theperformance measure is the percentage ﬁt F t = (cid:18) − (cid:107) Z t − ˆ Z t (cid:107)(cid:107) Z t −

11 ¯ Z t (cid:107) (cid:19) , (8)where ˆ Z t is the estimate of Z t obtained through the Kalmansmoother [1], and 11 a column vector with all entries equalto 1. The following two different estimators ˆ Z t are tested: • GCV: this approach estimates γ exploiting the GCV ﬁlter.More speciﬁcally, the GCV score is propagated over agrid containing 100 values of γ logarithmically spaced onthe interval [ − , ] . Then, at any t the estimate ˆ Z t iscomputed by a Kalman smoothing ﬁlter which exploitsthe γ t that minimizes GCV t . • Oracle: the same as GCV except that γ t maximizes theﬁt F t . Note that this approach is not implementable inpractice since it uses an oracle that knows the noiseless(unavailable) output Z t .The left panel of Fig. 2 displays the noiseless output (solidline), the measurements ( ◦ ) and the function estimate re-turned by GCV (dashed line) which turns out close to f . Theright panel also shows that the GCV ﬁlter is able to trackwell and in an on-line manner the time-course of γ returnedby Oracle .4 Time -0.500.511.522.533.5

GCV estimate

MeasurementsGCV estimateTrue

Time -2 -1 Estimate of γ GCVOracle

Fig. 2.

Cubic spline example - Section 4.1 . Left: noiseless output (solid line), measurements ( ◦ ) and cubic spline estimate obtained byGCV (dashed line). Right:

Estimated regularization parameter γ t , as a function of time, obtained by Oracle maximizing the ﬁt F t in eq.8 (solid line) and by GCV minimizing the score GCV t computed by eq. 5i (dashed line). Time -70-60-50-40-30-20-100102030

GCV estimate

MeasurementsGCV estimateTrue

Time -2 -1 Estimate of γ GCVOracle

Fig. 3.

Model mismatch example - Section 4.2 . Left: noiseless output (solid line), measurements ( ◦ ) and smoothed output obtained byGCV (dashed line). Right:

Estimated noise variance γ t , as a function of time, obtained by Oracle (solid line) and by GCV (dashed line). We consider the following discrete-time model (see also [23,Section 6]): x k + = (cid:32) . . (cid:33) x k + ω k (9a) y k = (cid:16) (cid:17) x k + e k (9b)with zero-mean Gaussian noises of covariances Q = (cid:32) . . (cid:33) (cid:16) .

81 0 . (cid:17) , γ = . We will use data generated by this model to test the capabil-ity of the GCV ﬁlter to compensate for mismatches betweenthe true system and the model used to track the data by tun-ing γ in an on-line manner. As in the previous example Z t isthe vector containing the ﬁrst t noiseless outputs (which arethe second entries of { x k } tk = ) and the performance measureis (8). The following three different estimators ˆ Z t are tested: • GCV: this approach uses a wrong transition covariancegiven by ˜ Q = Q + (cid:32) (cid:33) , and then estimates γ exploiting the GCV ﬁlter over a gridwith 100 values logarithmically spaced on [ − , ] .5 Fits

Fig. 4.

Model mismatch example - Section 4.2 . Boxplots of the100 ﬁts F , as deﬁned in eq. 8, obtained after a Monte Carlostudy by the estimators Oracle, GCV and Nominal. Then, at any t the estimate ˆ Z t is computed by a Kalmansmoothing ﬁlter which exploits the γ t minimizing GCV t . • Oracle: the same as GCV except that γ t maximizes the ﬁt F t in (8). • Nominal: the estimate ˆ Z t is returned by a Kalman smooth-ing ﬁlter deﬁned by the nominal wrong covariance ˜ Q and γ =

30. Thus, this approach does not try to compensatefor model mismatch since it does not tune γ from data.The left panel of Fig. 3 displays the noiseless output (solidline), the measurements ( ◦ ) and the smoothed output ob-tained by GCV (dashed line) which appears close to truth.In the right panel, one can also see the trajectory in time ofthe γ t returned by Oracle and by GCV. One can appreciatethe capability of the GCV ﬁlter to compensate the modellingmismatch by tracking a regularization parameter leading toa high ﬁt F t .To further support these ﬁndings, we have also performed aMonte Carlo study of 100 runs. During each run, 200 out-put measurements are generated using (9) and Z t is recon-structed by GCV, Oracle and Nominal. From the MATLABboxplots of the 100 ﬁts (8) reported in Fig. 4, the robust-ness of GCV emerges clearly. Its performance is in fact veryclose to that of the oracle-based procedure. Now, we consider a linear system identiﬁcation problemwhere the aim is to estimate an unknown impulse responsefrom input-output measurements. Assuming a high orderFIR, the model describing the outputs collected up to instant t , and stacked in the (column) vector Y t , is Y t = Φ t g + E t , (10)where g denotes the m -dimensional vector whose compo-nents are the impulse response coefﬁcients, the regression matrix Φ t is deﬁned by the input samples and E t is the mea-surement noise vector, which we assume white and Gaus-sian.To solve this problem, we use the kernel-based approachoriginally proposed in [28,27,6]. The impulse response es-timate is given byarg min g ∈ R m (cid:107) Y t − Φ t g (cid:107) + γ g T P − g . (11)It makes use of the regularization matrix P induced by theso called ﬁrst-order spline kernel, i.e. its ( i , j ) entry is [ P ] i j = α max ( i , j ) , ≤ α < , where α is an hyperparameter which regulates the rate ofdecay to zero of the components of g . We refer the readeralso to [29] for further details on advantages of (11) overclassical parametric approaches.In real applications, both γ and α are unknown. Since weconsider a situation where g has to be estimated on-line,we will estimate these two hyperparameters by the GCVﬁlter. To do that, we ﬁrst notice that (11) corresponds tothe maximum a posteriori (MAP) estimator of g and, underthe stated Gaussian assumptions, also to its minimum mean-square estimator (MMSE). The estimate of g can then becomputed using the Kalman ﬁlter. In fact the state spacemodel is x k + = x k y k = C k x k + e k , k = , , . . . (12) x ∼ ( , P ) e k ∼ ( , γ ) where the state vector is the stochastic model for g (with x k = g for any k ), y k and e k are the k -th entries of Y t and E t ,respectively, and C k is the k -th row of Φ t .We deﬁne a grid in the plane ( γ , α ) taking the points suchthat α is in the set { . , . , . . . , . , . , . } , while γ assumes values in a logarithmically spaced grid of 20 pointbetween 10 − and 10 . In this way, the grid consists of 140points. We run 140 GCV ﬁlters in parallel, each correspond-ing to one of the points; when a new measure y k (and C k )is available, we update the GCV score of each pair ( γ , α ) ,selecting the one giving the minimum score.We test the obtained GCV ﬁlter for on-line regularized sys-tem identiﬁcation on a set of 100 Monte Carlo runs. At anyrun, a random impulse response of length m =

200 is gen-erated using the same mechanism described in [24, Section7.4]. The generated system is fed with a with noise sequenceof unit variance. Note that this type of input is persistentlyexciting and guarantees the observability of the system (12)(see e.g. [2]), avoiding the covariance windup phenomenon[36]. The standard deviation of the measurement noise is6hat of the 200 noiseless outputs divided by 10. We assumethe system is at rest (the input is equal to zero) prior to thedata collection. The performance of the estimator is evalu-ated by means of the ﬁt (as a function of time) F t = (cid:18) − (cid:107) g i − ˆ g it (cid:107)(cid:107) g i −

11 ¯ g i (cid:107) (cid:19) , (13)where g i is the impulse response generated at the i -th MonteCarlo run, ¯ g i its mean, and ˆ g it its estimate (the impulse re-sponse estimate is function of the time instant t ).An example of one of the Monte Carlo runs is given in Fig.5, which shows the evolution in time of the impulse responseestimate and its ﬁt. It sufﬁces 50 measurements to the GCVﬁlter to achieve an appreciable ﬁt. The overall results of theMonte Carlo experiment are summarized in Fig. 6, whichdepicts the average ﬁt of the impulse responses as a functionof time. It can be seen that, after a short transient phase,the ﬁt increases monotonically and achieves a high averagevalue. The novel ﬁlter here presented allows to propagate efﬁcientlythe GCV score over time. Hence, unknown parameters en-tering a state space model can now be estimated in an on-linemanner resorting to one of the most important techniquesused for parameter estimation. The asymptotic properties ofthe GCV ﬁlter provide also a new very efﬁcient way to esti-mate the regularization parameter e.g. in smoothing splines.Applications of the new ﬁlter have been illustrated using ar-tiﬁcial data regarding a function estimation and on-line reg-ularized linear system identiﬁcation.A Matlab implementation of the GCV ﬁlter is available atthe web page . Without loss of generality, we set the initial system conditionto zero, i.e. µ =

0. We also use X t , Y t and E t to denote thecolumn vectors containing the states, the outputs and themeasurements noises up to instant t , i.e. X t = [ x T . . . x Tt ] T , Y t = [ y . . . y t ] T , E t = [ e . . . e t ] T . Then, it holds that Y t = O t X t + E t , where O t = diag { C , . . . , C t } is the regression matrix builtusing the measurement matrices C k , k = , . . . , t . We also use W t and V t to denote the state and output covariance matrix,i.e. W t : = Var ( X t ) (14) V t : = Var ( Y t ) = O t W t O Tt + γ I t , (15)where I t is the t × t identity matrix. Note that, using the abovenotation, the smoothed estimate of Y t , already encounteredin Section 1, is ˆ Y t = O t W t O Tt V − t Y t , (16)so that the degrees of freedom at instant t turn out δ t = Tr ( O t W t O Tt V − t ) . (17)The following simple lemma is useful for our future devel-opments. Lemma 2

One has δ t = t − γ ∂ log det V t ∂ γ , (18) S t = − γ ∂ Y t V − t Y t ∂ γ . (19) Proof:

In view of (15), we start noticing that γ V − t = I t − O t W t O Tt V − t . (20)Then, (18) is obtained from the following equalities γ ∂ log det V t ∂ γ = γ Tr (cid:18) V − t ∂ V t ∂ γ (cid:19) = γ Tr ( V − t )= Tr ( I t − O t W t O Tt V − t )= t − δ t , (21)where the last two passages exploit (20) and (17), respec-tively.Eq. 19 is instead proved as follows − γ ∂ Y t V − t Y t ∂ γ = γ Y Tt V − t Y t = Y Tt ( I t − O t W t O Tt V − t ) T ( I t − O t W t O Tt V − t ) Y t = (cid:107) Y t − ˆ Y t (cid:107) = S t where the second and third equality exploit (20) and (16),respectively. (cid:4) -0.6-0.4-0.200.20.40.6 Impulse response estimate k=10k=50k=200True

Time -50050100

Fit

Fig. 5.

On-line system identiﬁcation - Section 4.3 . Left: true impulse response (solid line) and GCV estimates obtained at time instants k = , , Right:

Fit obtained by GCV as a function of time.

Time -100-50050100

Fit from Monte Carlo

Fig. 6.

On-line system identiﬁcation - Section 4.3 . Average ofthe GCV ﬁts (solid line) ± one standard deviation (dashed line),as a function of time, obtained after a Monte Carlo study. At anyof the 100 runs a new impulse response was randomly generatedas detailed in [24, Section 7.4]. The dynamics of the matrix P k in the GCV ﬁlter are regulatedby the discrete-time algebraic Riccati equation (DARE),which can be also rewritten as P k + = A k P k A Tk + Q k − A k P k C Tk ( C k P k C Tk + γ ) − C k P k A Tk . It is now easy to see that the matrix Σ k entering the GCV ﬁlteris the partial derivative of P k w.r.t. γ . In fact, differentiatingthe DRE, and adopting the notation Σ k : = ∂ P k ∂γ , one has Σ k + = A k Σ k A Tk − A k Σ k C Tk ( C k P k C Tk + γ ) − C k P k A Tk − A k P k C Tk ( C k P k C Tk + γ ) − C k Σ k A Tk + A k P k C Tk ( C k P k C Tk + γ ) − C k P k A Tk ( C k Σ k C Tk + ) . Exploiting the deﬁnition of K k and rearraging the terms, therecursive formula (5f) is obtained.Now, consider the dynamics of the predicted stateˆ x k + = A k ˆ x k + A k P k C Tk ( C k P k C Tk + γ ) − ( y k − C k ˆ x k ) . We now show that ˆ ζ k is the partial derivative of ˆ x k w.r.t. γ .In fact, differentating the above equation using the corre-spondence ˆ ζ k : = ∂ ˆ x k ∂γ , one obtainsˆ ζ k + = A k ˆ ζ k + A k Σ k C Tk ( C k P k C Tk + γ ) − ( y k − C k ˆ x k ) − A k Σ k C Tk ( C k Σ k C Tk + )( C k P k C Tk + γ ) − ( y k − C k ˆ x k ) − A k P k C Tk ( C k P k C Tk + γ ) − C k ˆ ζ k . This, combined with the deﬁnition of K k , leads to the recur-sive formula (5d).Now, exploiting well known properties of the innovationssequence { y k − C k ˆ x k } tk = , whose variances are { C k P k C Tk + γ } tk = , and recalling that Σ k : = ∂ P k ∂γ , we have ∂ log det V t ∂ γ = t ∑ k = ∂ log ( C k P k C Tk + γ ) ∂ γ = t ∑ k = C k Σ k C Tk + C k P k C Tk + γ . Then, the recursive formula (5g) for the degrees of freedom δ k is obtained combining the above equation and (18).8till using properties of the innovations sequence, and re-calling that ˆ ζ k : = ∂ ˆ x k ∂γ , one also has − ∂ Y t V − t Y t ∂ γ = − t ∑ k = ∂ ( y k − C k ˆ x k ) ( C k P k C Tk + γ ) − ∂ γ = t ∑ k = C k Σ k C Tk + ( C k P k C Tk + γ ) ( y k − C k ˆ x k ) + C ˆ ζ k y k − C k ˆ x k C k P k C Tk + γ . This equation, in combination with (19), proves the correct-ness of the update rule (5h) for the sum of squared residuals S k and completes the derivation. If the system (3) is stabilizable and detectable, then standardproperties of the algebraic Riccati equation (6) ensure that ¯ P is symmetric and positive semideﬁnite and that the Kalmanﬁlter, corresponding to (5a), (5c) and (5e), is asymptoticallystable (see [1, p. 77]). Because the Kalman ﬁlter is asymp-totically stable, the matrix ( A − ¯ KC ) has all the eigenvaluesinside the unit circle, ensuring that (7) admits a unique posi-tive semideﬁnite solution [1, p. 67]. The ﬁlter state transitionmatrix which regulates the dynamics of ˆ x k and ˆ ζ k is (cid:32) A − K k C − G k C A − K k C (cid:33) and so it also has all eigenvalues inside the unit circle atleast for sufﬁciently large k . References [1] B. D. O. Anderson and J. B. Moore.

Optimal Filtering . Prentice-Hall, Englewood Cliffs, N.J., USA, 1979.[2] B.D.O. Anderson and J.B. Moore. Detectability and stabilizability oftime-varying discrete-time linear systems.

SIAM Journal on Controland Optimization , 19(1):20–32, 1981.[3] C. Andrieu, A. Doucet, and R. Holenstein. Particle markov chainmonte carlo methods.

Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) , 72(3):269–342, 2010.[4] C.F. Ansley and R. Kohn. Efﬁcient generalized cross-validation forstate space models.

Biometrika , 74(1):139–148, 1987.[5] M. Bertero. Linear inverse and ill-posed problems.

Advances inElectronics and Electron Physics , 75:1–120, 1989.[6] T. Chen, H. Ohlsson, and L. Ljung. On the estimation oftransfer functions, regularizations and Gaussian processes - revisited.

Automatica , 48(8):1525–1535, 2012.[7] P. Craven and G. Wahba. Smoothing noisy data with spline functions.

Numerische Mathematik , 31:377–403, 1979.[8] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks andsupport vector machines.

Advances in Computational Mathematics ,13:1–50, 2000. [9] R. Frigola, F. Lindsten, T.B. Schon, and C.E. Rasmussen. Bayesianinference and learning in Gaussian process state-space models withparticle mcmc. In

Advances in Neural Information ProcessingSystems (NIPS) , 2013.[10] W.R. Gilks, S. Richardson, and D.J. Spiegelhalter.

Markov chainMonte Carlo in Practice . London: Chapman and Hall, 1996.[11] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neuralnetworks architectures.

Neural Computation , 7(2):219–269, 1995.[12] G. Golub, M. Heath, and G. Wahba. Generalized cross-validationas a method for choosing a good ridge parameter.

Technometrics ,21(2):215–223, 1979.[13] T. J. Hastie, R. J. Tibshirani, and J. Friedman.

The Elementsof Statistical Learning. Data Mining, Inference and Prediction .Springer, Canada, 2001.[14] A.E. Hoerl and R.W. Kennard. Ridge regression: Biased estimationfor nonorthogonal problems.

Technometrics , 12:55–67, 1970.[15] M.G. Hutchinson and F.R. De Hoog. Smoothing data with splinefunctions.

Numer. Math. , 47:99–106, 1985.[16] A. Jazwinski.

Stochastic Processes and Filtering Theory . Dover,1970.[17] R.E. Kalman. A new approach to linear ﬁltering and predictionproblems.

Trans. of the AMSE - Journal of Basic Engineering ,82:35–45, 1960.[18] R. Kohn and C.F. Ansley. A fast algorithm for signal extraction,inﬂuence and cross validation in state space models.

Biometrika ,76(1):65–79, 1989.[19] L. Ljung and T. Kailath. A uniﬁed approach to smoothing formulas.

Automatica , 12(2):147–157, 1976.[20] D.J.C. MacKay. Bayesian interpolation.

Neural Computation , 4:415–447, 1992.[21] G. De Nicolao, G. Ferrari Trecate, and G. Sparacino. Fast splinesmoothing via spectral factorization concepts.

Automatica , 36:1733–1739, 2000.[22] B. Ninness and S. Henriksen. Bayesian system identiﬁcation viaMCMC techniques.

Automatica , 46(1):40–51, 2010.[23] H. Ohlsson, F. Gustafsson, L. Ljung, and S. Boyd. State smoothing bysum-of-norms regularization. In , pages 2880–2885. IEEE, 2010.[24] G. Pillonetto, T. Chen, A. Chiuso, G. De Nicolao, and L. Ljung.Regularized linear system identiﬁcation using atomic, nuclear andkernel-based norms: The role of the stability constraint.

Automatica ,69:137 – 149, 2016.[25] G. Pillonetto and A. Chiuso. Fast computation of smoothing splinessubject to equality constraints.

Automatica , 45(12):2842–2849, 2009.[26] G. Pillonetto and A. Chiuso. Tuning complexity in regularizedkernel-based regression and linear system identiﬁcation.

Automatica ,58(2):106–117, 2015.[27] G. Pillonetto, A. Chiuso, and G. De Nicolao. Prediction erroridentiﬁcation of linear systems: a nonparametric Gaussian regressionapproach.

Automatica , 47(2):291–305, 2011.[28] G. Pillonetto and G. De Nicolao. A new kernel-based approach forlinear system identiﬁcation.

Automatica , 46(1):81–93, 2010.[29] G. Pillonetto, F. Dinuzzo, T. Chen, G. De Nicolao, and L. Ljung.Kernel methods in system identiﬁcation, machine learning andfunction estimation: a survey.

Automatica , 50(3):657–682, 2014.[30] G. Pillonetto and M.P. Saccomani. Input estimation in nonlineardynamic systems using differential algebra concepts.

Automatica ,42:2117–2129, 2006.[31] T. Poggio and F. Girosi. Networks for approximation and learning.In

Proceedings of the IEEE , volume 78, pages 1481–1497, 1990.

32] H. E. Rauch, F. Tung, and C. T. Striebel. Maximum likelihoodestimates of linear dynamic systems.

AIAA J. , 3(8):1145–1150, 1965.[33] J. Rice. Choice of smoothing parameter in deconvolution problems.

Contemporary Math. , 59:137–151, 1986.[34] B. Sch¨olkopf and A. J. Smola.

Learning with Kernels: Support VectorMachines, Regularization, Optimization, and Beyond . (AdaptiveComputation and Machine Learning). MIT Press, 2001.[35] B.W. Silverman. Some aspects of the spline approach tononparametric regression curve ﬁtting.

J. of the Royal StatisticalSociety , 47:1–52, 1985.[36] B. Stenlund and F. Gustafsson. Avoiding windup in recursiveparameter estimation. , pages 148–153,2002.[37] A. Tarantola.

Inverse Problem Theory and Methods for ModelParameter Estimation . SIAM, Philadelphia, 2005.[38] G. Wahba. Bayesian conﬁdence intervals for the cross-validatedsmoothing spline.

Journal of the Royal Statistical Society. Series B(Methodological) , 45(1):pp. 133–150, 1983.[39] G. Wahba. A comparison of GCV and GML for choosing thesmoothing parameter in the generalized spline smoothing problem.

The Annals of Statistics , 13(4):1378–1402, 1985.[40] G. Wahba.

Spline models for observational data . SIAM, Philadelphia,1990.[41] H.L. Weinert.

Fast Compact Algorithms and Software for SplineSmoothing . Springer, 2013.. Springer, 2013.