The Generalized Cross Validation Filter
TThe Generalized Cross Validation Filter
Giulio Bottegal a and Gianluigi Pillonetto b a Department of Electrical Engineering, TU Eindhoven, Eindhoven, The Netherlands (e-mail: [email protected]) b Department of Information Engineering, University of Padova, Padova, Italy (e-mail: [email protected])
Abstract
Generalized cross validation (GCV) is one of the most important approaches used to estimate parameters in the context of inverse problemsand regularization techniques. A notable example is the determination of the smoothness parameter in splines. When the data are generatedby a state space model, like in the spline case, efficient algorithms are available to evaluate the GCV score with complexity that scaleslinearly in the data set size. However, these methods are not amenable to on-line applications since they rely on forward and backwardrecursions. Hence, if the objective has been evaluated at time t − t , then O ( t ) operations are needed to updatethe GCV score. In this paper we instead show that the update cost is O ( ) , thus paving the way to the on-line use of GCV. This result isobtained by deriving the novel GCV filter which extends the classical Kalman filter equations to efficiently propagate the GCV score overtime. We also illustrate applications of the new filter in the context of state estimation and on-line regularized linear system identification.
Key words:
Kalman filtering; generalized cross-validation; on-line system identification; inverse problems; regularization; smoothnessparameter; splines
Linear state space models assume the form x k + = A k x k + ω k y k = C k x k + e k where x k is the state at instant k , y k is the output, while ω k and e k are random noises. The matrices A k and C k regulatethe state transition and the observation model at instant k .This kind of models plays a central role in the analysis anddesign of discrete-time systems [17]. Applications aboundand include tracking, navigation and biomedicine.In on-line state estimation , the problem is the reconstructionof the values of x k from measurements of y k collected overtime. When the matrices A k and C k and the noises covari-ances are known, the optimal linear estimates are efficientlyreturned by the classical Kalman filter [1]. However, in This research has been partially supported by the MIUR FIRBproject RBFR12M3AC-Learning meets time: a new computationalapproach to learning in dynamic systems and by the Progettodi Ateneo CPDA147754/14-New statistical learning approach formulti-agents adaptive estimation and coverage control. This paperwas not presented at any IFAC meeting. Corresponding authorGianluigi Pillonetto Ph. +390498277607. many circumstances there can be unknown model param-eters that also need to be inferred from data in an on-linemanner, e.g. variance components or entries of the transi-tion/observation matrices. One can interpret such parame-ters as additional states. Then, the extended Kalman filter[16] or more sophysticated stochastic techniques, such asparticle filters and Markov chain Monte Carlo [10,22,3,9],can be used to track the filtered posterior. Another tech-nique consists of propagating the marginal likelihood of theunknown parameters via a bank of filters [1, Ch. 10]. In thispaper, we will show that another viable alternative is theuse of an approach known in the literature as generalizedcross validation (GCV) [12].In the literature of statistics and inverse problems, GCV iswidely used in off-line contexts to estimate unknown pa-rameters entering regularized estimators [5,37,40]. This ap-proach was initially used to tune the smoothness parameterin ridge regression and smoothing splines [14,12,33]. GCVis now also popular in machine learning, used to improvethe generalization capability of regularized kernel-based ap-proaches [34,8], such as regularization networks, which con-tain spline regression as special case [31,11].To introduce GCV in our state space context, we first re-call that smoothing splines are closely linked to state spacemodels of m -fold integrated Wiener processes [25]; then itappears natural to extend GCV to general state space mod-els. To this end, assume that measurements y k have been Preprint submitted to Automatica 9 June 2017 a r X i v : . [ s t a t . M L ] J un ollected up to instant t and stacked in the vector Y t . Denotewith ˆ Y t the vector containing the optimal linear output esti-mate and use H t to denote the influence matrix satisfyingˆ Y t = H t Y t . Then, the parameter estimates achieved by GCV minimizeGCV t = S t t ( − δ t / t ) , (2)where S t is the sum of squared residuals, i.e. S t = (cid:107) ˆ Y t − Y t (cid:107) , and δ t are the degrees of freedom given by the trace of H t ,i.e. δ t = Tr ( H t ) . In the objective (2), the term S t accounts for the goodnessof fit while δ t assumes values on [ , t ] and measures modelcomplexity. In fact, in nonparametric regularized estimation,the degrees of freedom δ t can be seen as the counterpartof the number of parameters entering a parametric model[20,13,26].GCV is supported by important asymptotic results. Also, forfinite data set size it turns often out a good approximation ofthe output mean squared error [7]. It is worth stressing thatsuch properties have been derived without postulating thecorrectness of the prior models describing the output data[38,39]. In control, this means that GCV can compensatefor possible modeling mismatch affecting the state spacedescription.Despite these nice features, the use of GCV within the con-trol community appears limited, in particular in on-line con-texts. One important reason is the following one. For statespace models, there exist efficient algorithms which, for agiven parameter vector, return its GCV score with O ( t ) op-erations [18,4], see also [15,35,21] for procedures dedicatedto smoothing splines. But all of these techniques are notsuited to on-line computations since they involve forwardand backward recursions. Hence, if GCV t − is available andnew data arrive at time t , other O ( t ) operations are neededto achieve GCV t . In this paper, we will instead show thatthe update cost is O ( ) , thus paving the way to a more per-vasive on-line use of GCV. This result is obtained by deriv-ing the novel GCV filter which consists of an extension ofthe classical Kalman equations. Thanks to it, one can run abank of filters (possibly in parallel) to efficiently propagateGCV over a grid of parameter values. This makes the pro-posed GCV filter particularly suitable for applications wherea measurement model admits a state space description withdynamics depending on few parameters, see e.g. the nextsection for an application in numerical differentiation. In The components of ˆ Y t are thus given by C ˆ x k | t , where thesmoothed state ˆ x k | t can be obtained for any t with O ( t ) operationsby a fixed-interval Kalman smoothing filter [32,19]. this framework, an implementation of the GCV filter via abank of parallel filters turns out computationally attractive.The paper is organized as follows. In Section 2, first someadditional notation is introduced. Then, the GCV filter ispresented. Its asymptotic properties are then discussed inSection 3. In Section 4 we illustrate some applications, in-cluding also smoothing splines and on-line regularized lin-ear system identification with the stable spline kernel usedas stochastic model for the impulse response [28,29]. Con-clusions end the paper while the correctness of the GCVfilter is shown in Appendix. First, we provide full details about our measurements model.We use x ∼ ( a , b ) to denote a random vector x with mean a and covariance matrix b . Then, our state space model isdefined by x k + = A k x k + ω k (3a) y k = C k x k + e k , k = , , . . . (3b) x ∼ ( µ , P ) (3c) ω k ∼ ( , Q k ) (3d) e k ∼ ( , γ ) (3e)where the initial condition x and all the nosies { ω k , e k } k = , ,... are mutually uncorrelated. We do not specify any partic-ular distribution for these variables, since the GCV scoredoes not depend on the particular noise distribution . If x , ω k , e k are Gaussian, then the Kalman filter providesthe optimal state estimate in the mean-square sense. In theother cases, the Kalman filter corresponds to the best linearstate estimator [1]. In addition, just to simplify notation themeasurements y k are assumed scalar, so that γ representsthe noise variance.We assume that some of the parameters in (3) may be un-known, or could enter A k , B k , Q k and P ; however, we donot stress this possible dependence to make the formulasmore readable. The matrix P is assumed to be independentof γ . Such parameter is typically unknown, being connectedto the ratio between the measurement noise variance andthe variance of the driving noise. It corresponds to theregularization parameter in the smoothing-splines contextdescribed in the example below. Example 1 (Smoothing splines [30])
Function estimationand numerical differentiation are often required in variousapplications. These include also input reconstruction in non-linear dynamic systems as described e.g. in [30]. Assume Of course, GCV may result not effective if the noises are highlynon-Gaussian. Different approaches, like particle filters, shouldinstead be used if linear estimators perform poorly due e.g. tomultimodal noise distributions. hat one is interested in determining the first m derivatives ofa continuous-time signal measured with non-uniform sam-pling periods T k . Modeling the signal as an m-th fold inte-grated Wiener process one obtains the stochastic interpreta-tion of the m-th order smoothing splines [40]. In particular,one can use (3) to represent the signal dynamics as followsA k = . . . T k . . . T k T k . . . . . . ...... ... . . . . . . ... T mk m ! T m − k ( m − ) ! . . . T k , C k = ... T , [ Q k ] i j = T i + j − k ( i − ) ! ( j − ) ! ( i + j − ) . Such model depends on the measurement noise variance γ ,making this application particularly suited for the GCV filter.2.2 The GCV filter The GCV filter equations are now reported. Below, ˆ x k de-notes the optimal linear one-step ahead state prediction hav-ing covariance P k . Its dynamics are regulated by the classicalKalman filter via (5a), (5c) and the Riccati equation (5e). GCV filter
Initialization ˆ x = µ , ˆ ζ = P = P , Σ = δ = − γ ( C P C T + γ ) − (4c) S = γ ( y − C µ ) ( C P C T + γ ) (4d)GCV = S ( − δ ) (4e) UpdateK k = A k P k C Tk ( C k P k C Tk + γ ) − (5a) G k = A k Σ k A Tk − K k ( C k Σ k C Tk + ) C k P k C Tk + γ (5b)ˆ x k + = A k ˆ x k + K k ( y k − C k ˆ x k ) (5c)ˆ ζ k + = ( A k − K k C k ) ˆ ζ k + G k ( y k − C k ˆ x k ) (5d) P k + = ( A k − K k C k ) P k ( A k − K k C k ) T + γ K k K Tk + Q k (5e) Σ k + = ( A k − K k C k ) Σ k ( A k − K k C k ) T + K k K Tk (5f) δ k + = δ k + − γ C k + Σ k + C Tk + + C k + P k + C Tk + + γ (5g) z − - +++++ z − Update K k = AP k C T ( CP k C T + g ) (5a) G k = A S k A T K k ( C S k C T + ) CP k C T + g (5b)ˆ x k + = A ˆ x k + K k ( y k C ˆ x k ) (5c)ˆ z k + = ( A K k C ) ˆ z k + G k ( y k C ˆ x k ) (5d) P k + = ( A K k C ) P k ( A K k C ) T + g K k K Tk + Q (5e) S k + = ( A K k C ) S k ( A K k C ) T + K k K Tk (5f) Do f k + = Do f k + g C S k + C T + CP k + C T + g (5g) Ssr k + = Ssr k (5h) + g C S k + C T + ( CP k + C T + g ) ( y k + C ˆ x k + ) + g C ˆ z k + y k + C ˆ x k + CP k + C T + g Gcv k + = ( k + ) Ssr k + ( k + Do f k + ) (5i)It is apparent that the difference w.r.t the classical Kalmanfilter is the presence of the additional state ˆ z k (of the samedimension of ˆ x k ). Its dynamics are regulated by (5d). Notethat this equation is piloted by the innovation y k C ˆ x k , butthe Kalman gain K k in (5b) is replaced by G k as defined by(5b). In turn, such gain depends by S k which is propagatedover time through a modified version of the Riccati equationgiven by (5f). Without loss of generality, we set the initial system conditionto zero, i.e. µ =
0. We also use X t , Y t and E t to denote thecolumn vectors containing the states, the outputs and themeasurements noises up to instant t , i.e. X t = [ x T . . . x Tt ] T , Y t = [ y T . . . y Tt ] T E t = [ e T . . . e Tt ] T . Then, it holds that Y t = H t X t + E t , where H t = [ B T . . . B T ] T is the regression matrix built using t blocks B . We also use S t and V t to denote the state and output covariance matrix, i.e. S t : = Var ( X t ) (6) V t : = Var ( Y t ) = H t S t H Tt + g I t , (7)where I t is the t ⇥ t identity matrix. Note that, using the abovenotation, the smoothed estimate of Y t , already encounteredin Section 1, is ˆ Y t = H t S t H Tt V t Y t , (8)so that the degrees of freedom at instant t turn out Do f t = Tr ( H t S t H Tt V t ) , (9)where Tr is the trace operator.The following simple lemma is useful for our future devel-opments. Lemma 1
One hasDo f t = t g ∂ log det V t ∂ g (10) Ssr t = g ∂ Y t V t Y t ∂ g (11) Proof:
In view of (7), we start noticing that g V t = I t H t S t H Tt V t . (12)Then, (10) is obtained from the following equalities g ∂ log det V t ∂ g = g Tr ✓ V t ∂ V t ∂ g ◆ = g Tr ( V t )= Tr ( I t H t S t H TT V t )= t Do f t , (13)where the last two passages exploit (12) and (9), respectively.Eq. 11 is instead proved as follows g ∂ Y t V t Y t ∂ g = g Y Tt V t Y t = Y Tt ( I t H t S t H Tt V t ) T ( I t H t S t H Tt V t ) Y t = k Y t ˆ Y t k = Ssr t where the second and third equality exploit (12) and (8),respectively. ⌅ Update K k = AP k C T ( CP k C T + g ) (5a) G k = A S k A T K k ( C S k C T + ) CP k C T + g (5b)ˆ x k + = A ˆ x k + K k ( y k C ˆ x k ) (5c)ˆ z k + = ( A K k C ) ˆ z k + G k ( y k C ˆ x k ) (5d) P k + = ( A K k C ) P k ( A K k C ) T + g K k K Tk + Q (5e) S k + = ( A K k C ) S k ( A K k C ) T + K k K Tk (5f) Do f k + = Do f k + g C S k + C T + CP k + C T + g (5g) Ssr k + = Ssr k (5h) + g C S k + C T + ( CP k + C T + g ) ( y k + C ˆ x k + ) + g C ˆ z k + y k + C ˆ x k + CP k + C T + g Gcv k + = ( k + ) Ssr k + ( k + Do f k + ) (5i)It is apparent that the difference w.r.t the classical Kalmanfilter is the presence of the additional state ˆ z k (of the samedimension of ˆ x k ). Its dynamics are regulated by (5d). Notethat this equation is piloted by the innovation y k C ˆ x k , butthe Kalman gain K k in (5b) is replaced by G k as defined by(5b). In turn, such gain depends by S k which is propagatedover time through a modified version of the Riccati equationgiven by (5f). Without loss of generality, we set the initial system conditionto zero, i.e. µ =
0. We also use X t , Y t and E t to denote thecolumn vectors containing the states, the outputs and themeasurements noises up to instant t , i.e. X t = [ x T . . . x Tt ] T , Y t = [ y T . . . y Tt ] T E t = [ e T . . . e Tt ] T . Then, it holds that Y t = H t X t + E t , where H t = [ B T . . . B T ] T is the regression matrix built using t blocks B . We also use S t and V t to denote the state and output covariance matrix, i.e. S t : = Var ( X t ) (6) V t : = Var ( Y t ) = H t S t H Tt + g I t , (7)where I t is the t ⇥ t identity matrix. Note that, using the abovenotation, the smoothed estimate of Y t , already encounteredin Section 1, is ˆ Y t = H t S t H Tt V t Y t , (8)so that the degrees of freedom at instant t turn out Do f t = Tr ( H t S t H Tt V t ) , (9)where Tr is the trace operator.The following simple lemma is useful for our future devel-opments. Lemma 1
One hasDo f t = t g ∂ log det V t ∂ g (10) Ssr t = g ∂ Y t V t Y t ∂ g (11) Proof:
In view of (7), we start noticing that g V t = I t H t S t H Tt V t . (12)Then, (10) is obtained from the following equalities g ∂ log det V t ∂ g = g Tr ✓ V t ∂ V t ∂ g ◆ = g Tr ( V t )= Tr ( I t H t S t H TT V t )= t Do f t , (13)where the last two passages exploit (12) and (9), respectively.Eq. 11 is instead proved as follows g ∂ Y t V t Y t ∂ g = g Y Tt V t Y t = Y Tt ( I t H t S t H Tt V t ) T ( I t H t S t H Tt V t ) Y t = k Y t ˆ Y t k = Ssr t where the second and third equality exploit (12) and (8),respectively. ⌅ Update K k = AP k C T ( CP k C T + g ) (5a) G k = A S k A T K k ( C S k C T + ) CP k C T + g (5b)ˆ x k + = A ˆ x k + K k ( y k C ˆ x k ) (5c)ˆ z k + = ( A K k C ) ˆ z k + G k ( y k C ˆ x k ) (5d) P k + = ( A K k C ) P k ( A K k C ) T + g K k K Tk + Q (5e) S k + = ( A K k C ) S k ( A K k C ) T + K k K Tk (5f) Do f k + = Do f k + g C S k + C T + CP k + C T + g (5g) Ssr k + = Ssr k (5h) + g C S k + C T + ( CP k + C T + g ) ( y k + C ˆ x k + ) + g C ˆ z k + y k + C ˆ x k + CP k + C T + g Gcv k + = ( k + ) Ssr k + ( k + Do f k + ) (5i)It is apparent that the difference w.r.t the classical Kalmanfilter is the presence of the additional state ˆ z k (of the samedimension of ˆ x k ). Its dynamics are regulated by (5d). Notethat this equation is piloted by the innovation y k C ˆ x k , butthe Kalman gain K k in (5b) is replaced by G k as defined by(5b). In turn, such gain depends by S k which is propagatedover time through a modified version of the Riccati equationgiven by (5f). Without loss of generality, we set the initial system conditionto zero, i.e. µ =
0. We also use X t , Y t and E t to denote thecolumn vectors containing the states, the outputs and themeasurements noises up to instant t , i.e. X t = [ x T . . . x Tt ] T , Y t = [ y T . . . y Tt ] T E t = [ e T . . . e Tt ] T . Then, it holds that Y t = H t X t + E t , where H t = [ B T . . . B T ] T is the regression matrix built using t blocks B . We also use S t and V t to denote the state and output covariance matrix, i.e. S t : = Var ( X t ) (6) V t : = Var ( Y t ) = H t S t H Tt + g I t , (7)where I t is the t ⇥ t identity matrix. Note that, using the abovenotation, the smoothed estimate of Y t , already encounteredin Section 1, is ˆ Y t = H t S t H Tt V t Y t , (8)so that the degrees of freedom at instant t turn out Do f t = Tr ( H t S t H Tt V t ) , (9)where Tr is the trace operator.The following simple lemma is useful for our future devel-opments. Lemma 1
One hasDo f t = t g ∂ log det V t ∂ g (10) Ssr t = g ∂ Y t V t Y t ∂ g (11) Proof:
In view of (7), we start noticing that g V t = I t H t S t H Tt V t . (12)Then, (10) is obtained from the following equalities g ∂ log det V t ∂ g = g Tr ✓ V t ∂ V t ∂ g ◆ = g Tr ( V t )= Tr ( I t H t S t H TT V t )= t Do f t , (13)where the last two passages exploit (12) and (9), respectively.Eq. 11 is instead proved as follows g ∂ Y t V t Y t ∂ g = g Y Tt V t Y t = Y Tt ( I t H t S t H Tt V t ) T ( I t H t S t H Tt V t ) Y t = k Y t ˆ Y t k = Ssr t where the second and third equality exploit (12) and (8),respectively. ⌅ Update K k = AP k C T ( CP k C T + g ) (5a) G k = A S k A T K k ( C S k C T + ) CP k C T + g (5b)ˆ x k + = A ˆ x k + K k ( y k C ˆ x k ) (5c)ˆ z k + = ( A K k C ) ˆ z k + G k ( y k C ˆ x k ) (5d) P k + = ( A K k C ) P k ( A K k C ) T + g K k K Tk + Q (5e) S k + = ( A K k C ) S k ( A K k C ) T + K k K Tk (5f) Do f k + = Do f k + g C S k + C T + CP k + C T + g (5g) Ssr k + = Ssr k (5h) + g C S k + C T + ( CP k + C T + g ) ( y k + C ˆ x k + ) + g C ˆ z k + y k + C ˆ x k + CP k + C T + g Gcv k + = ( k + ) Ssr k + ( k + Do f k + ) (5i)It is apparent that the difference w.r.t the classical Kalmanfilter is the presence of the additional state ˆ z k (of the samedimension of ˆ x k ). Its dynamics are regulated by (5d). Notethat this equation is piloted by the innovation y k C ˆ x k , butthe Kalman gain K k in (5b) is replaced by G k as defined by(5b). In turn, such gain depends by S k which is propagatedover time through a modified version of the Riccati equationgiven by (5f). Without loss of generality, we set the initial system conditionto zero, i.e. µ =
0. We also use X t , Y t and E t to denote thecolumn vectors containing the states, the outputs and themeasurements noises up to instant t , i.e. X t = [ x T . . . x Tt ] T , Y t = [ y T . . . y Tt ] T E t = [ e T . . . e Tt ] T . Then, it holds that Y t = H t X t + E t , where H t = [ B T . . . B T ] T is the regression matrix built using t blocks B . We also use S t and V t to denote the state and output covariance matrix, i.e. S t : = Var ( X t ) (6) V t : = Var ( Y t ) = H t S t H Tt + g I t , (7)where I t is the t ⇥ t identity matrix. Note that, using the abovenotation, the smoothed estimate of Y t , already encounteredin Section 1, is ˆ Y t = H t S t H Tt V t Y t , (8)so that the degrees of freedom at instant t turn out Do f t = Tr ( H t S t H Tt V t ) , (9)where Tr is the trace operator.The following simple lemma is useful for our future devel-opments. Lemma 1
One hasDo f t = t g ∂ log det V t ∂ g (10) Ssr t = g ∂ Y t V t Y t ∂ g (11) Proof:
In view of (7), we start noticing that g V t = I t H t S t H Tt V t . (12)Then, (10) is obtained from the following equalities g ∂ log det V t ∂ g = g Tr ✓ V t ∂ V t ∂ g ◆ = g Tr ( V t )= Tr ( I t H t S t H TT V t )= t Do f t , (13)where the last two passages exploit (12) and (9), respectively.Eq. 11 is instead proved as follows g ∂ Y t V t Y t ∂ g = g Y Tt V t Y t = Y Tt ( I t H t S t H Tt V t ) T ( I t H t S t H Tt V t ) Y t = k Y t ˆ Y t k = Ssr t where the second and third equality exploit (12) and (8),respectively. ⌅ Update K k = AP k C T ( CP k C T + g ) (5a) G k = A S k A T K k ( C S k C T + ) CP k C T + g (5b)ˆ x k + = A ˆ x k + K k ( y k C ˆ x k ) (5c)ˆ z k + = ( A K k C ) ˆ z k + G k ( y k C ˆ x k ) (5d) P k + = ( A K k C ) P k ( A K k C ) T + g K k K Tk + Q (5e) S k + = ( A K k C ) S k ( A K k C ) T + K k K Tk (5f) Do f k + = Do f k + g C S k + C T + CP k + C T + g (5g) Ssr k + = Ssr k (5h) + g C S k + C T + ( CP k + C T + g ) ( y k + C ˆ x k + ) + g C ˆ z k + y k + C ˆ x k + CP k + C T + g Gcv k + = ( k + ) Ssr k + ( k + Do f k + ) (5i)It is apparent that the difference w.r.t the classical Kalmanfilter is the presence of the additional state ˆ z k (of the samedimension of ˆ x k ). Its dynamics are regulated by (5d). Notethat this equation is piloted by the innovation y k C ˆ x k , butthe Kalman gain K k in (5b) is replaced by G k as defined by(5b). In turn, such gain depends by S k which is propagatedover time through a modified version of the Riccati equationgiven by (5f). Without loss of generality, we set the initial system conditionto zero, i.e. µ =
0. We also use X t , Y t and E t to denote thecolumn vectors containing the states, the outputs and themeasurements noises up to instant t , i.e. X t = [ x T . . . x Tt ] T , Y t = [ y T . . . y Tt ] T E t = [ e T . . . e Tt ] T . Then, it holds that Y t = H t X t + E t , where H t = [ B T . . . B T ] T is the regression matrix built using t blocks B . We also use S t and V t to denote the state and output covariance matrix, i.e. S t : = Var ( X t ) (6) V t : = Var ( Y t ) = H t S t H Tt + g I t , (7)where I t is the t ⇥ t identity matrix. Note that, using the abovenotation, the smoothed estimate of Y t , already encounteredin Section 1, is ˆ Y t = H t S t H Tt V t Y t , (8)so that the degrees of freedom at instant t turn out Do f t = Tr ( H t S t H Tt V t ) , (9)where Tr is the trace operator.The following simple lemma is useful for our future devel-opments. Lemma 1
One hasDo f t = t g ∂ log det V t ∂ g (10) Ssr t = g ∂ Y t V t Y t ∂ g (11) Proof:
In view of (7), we start noticing that g V t = I t H t S t H Tt V t . (12)Then, (10) is obtained from the following equalities g ∂ log det V t ∂ g = g Tr ✓ V t ∂ V t ∂ g ◆ = g Tr ( V t )= Tr ( I t H t S t H TT V t )= t Do f t , (13)where the last two passages exploit (12) and (9), respectively.Eq. 11 is instead proved as follows g ∂ Y t V t Y t ∂ g = g Y Tt V t Y t = Y Tt ( I t H t S t H Tt V t ) T ( I t H t S t H Tt V t ) Y t = k Y t ˆ Y t k = Ssr t where the second and third equality exploit (12) and (8),respectively. ⌅ Update K k = AP k C T ( CP k C T + g ) (5a) G k = A S k A T K k ( C S k C T + ) CP k C T + g (5b)ˆ x k + = A ˆ x k + K k ( y k C ˆ x k ) (5c)ˆ z k + = ( A K k C ) ˆ z k + G k ( y k C ˆ x k ) (5d) P k + = ( A K k C ) P k ( A K k C ) T + g K k K Tk + Q (5e) S k + = ( A K k C ) S k ( A K k C ) T + K k K Tk (5f) Do f k + = Do f k + g C S k + C T + CP k + C T + g (5g) Ssr k + = Ssr k (5h) + g C S k + C T + ( CP k + C T + g ) ( y k + C ˆ x k + ) + g C ˆ z k + y k + C ˆ x k + CP k + C T + g Gcv k + = ( k + ) Ssr k + ( k + Do f k + ) (5i)It is apparent that the difference w.r.t the classical Kalmanfilter is the presence of the additional state ˆ z k (of the samedimension of ˆ x k ). Its dynamics are regulated by (5d). Notethat this equation is piloted by the innovation y k C ˆ x k , butthe Kalman gain K k in (5b) is replaced by G k as defined by(5b). In turn, such gain depends by S k which is propagatedover time through a modified version of the Riccati equationgiven by (5f). Without loss of generality, we set the initial system conditionto zero, i.e. µ =
0. We also use X t , Y t and E t to denote thecolumn vectors containing the states, the outputs and themeasurements noises up to instant t , i.e. X t = [ x T . . . x Tt ] T , Y t = [ y T . . . y Tt ] T E t = [ e T . . . e Tt ] T . Then, it holds that Y t = H t X t + E t , where H t = [ B T . . . B T ] T is the regression matrix built using t blocks B . We also use S t and V t to denote the state and output covariance matrix, i.e. S t : = Var ( X t ) (6) V t : = Var ( Y t ) = H t S t H Tt + g I t , (7)where I t is the t ⇥ t identity matrix. Note that, using the abovenotation, the smoothed estimate of Y t , already encounteredin Section 1, is ˆ Y t = H t S t H Tt V t Y t , (8)so that the degrees of freedom at instant t turn out Do f t = Tr ( H t S t H Tt V t ) , (9)where Tr is the trace operator.The following simple lemma is useful for our future devel-opments. Lemma 1
One hasDo f t = t g ∂ log det V t ∂ g (10) Ssr t = g ∂ Y t V t Y t ∂ g (11) Proof:
In view of (7), we start noticing that g V t = I t H t S t H Tt V t . (12)Then, (10) is obtained from the following equalities g ∂ log det V t ∂ g = g Tr ✓ V t ∂ V t ∂ g ◆ = g Tr ( V t )= Tr ( I t H t S t H TT V t )= t Do f t , (13)where the last two passages exploit (12) and (9), respectively.Eq. 11 is instead proved as follows g ∂ Y t V t Y t ∂ g = g Y Tt V t Y t = Y Tt ( I t H t S t H Tt V t ) T ( I t H t S t H Tt V t ) Y t = k Y t ˆ Y t k = Ssr t where the second and third equality exploit (12) and (8),respectively. ⌅ + - + z − + Update K k = AP k C T ( CP k C T + g ) (5a) G k = A S k A T K k ( C S k C T + ) CP k C T + g (5b)ˆ x k + = A ˆ x k + K k ( y k C ˆ x k ) (5c)ˆ z k + = ( A K k C ) ˆ z k + G k ( y k C ˆ x k ) (5d) P k + = ( A K k C ) P k ( A K k C ) T + g K k K Tk + Q (5e) S k + = ( A K k C ) S k ( A K k C ) T + K k K Tk (5f) Do f k + = Do f k + g C S k + C T + CP k + C T + g (5g) Ssr k + = Ssr k (5h) + g C S k + C T + ( CP k + C T + g ) ( y k + C ˆ x k + ) + g C ˆ z k + y k + C ˆ x k + CP k + C T + g Gcv k + = ( k + ) Ssr k + ( k + Do f k + ) (5i)It is apparent that the difference w.r.t the classical Kalmanfilter is the presence of the additional state ˆ z k (of the samedimension of ˆ x k ). Its dynamics are regulated by (5d). Notethat this equation is piloted by the innovation y k C ˆ x k , butthe Kalman gain K k in (5b) is replaced by G k as defined by(5b). In turn, such gain depends by S k which is propagatedover time through a modified version of the Riccati equationgiven by (5f). Without loss of generality, we set the initial system conditionto zero, i.e. µ =
0. We also use X t , Y t and E t to denote thecolumn vectors containing the states, the outputs and themeasurements noises up to instant t , i.e. X t = [ x T . . . x Tt ] T , Y t = [ y T . . . y Tt ] T E t = [ e T . . . e Tt ] T . Then, it holds that Y t = H t X t + E t , where H t = [ B T . . . B T ] T is the regression matrix built using t blocks B . We also use S t and V t to denote the state and output covariance matrix, i.e. S t : = Var ( X t ) (6) V t : = Var ( Y t ) = H t S t H Tt + g I t , (7)where I t is the t ⇥ t identity matrix. Note that, using the abovenotation, the smoothed estimate of Y t , already encounteredin Section 1, is ˆ Y t = H t S t H Tt V t Y t , (8)so that the degrees of freedom at instant t turn out Do f t = Tr ( H t S t H Tt V t ) , (9)where Tr is the trace operator.The following simple lemma is useful for our future devel-opments. Lemma 1
One hasDo f t = t g ∂ log det V t ∂ g (10) Ssr t = g ∂ Y t V t Y t ∂ g (11) Proof:
In view of (7), we start noticing that g V t = I t H t S t H Tt V t . (12)Then, (10) is obtained from the following equalities g ∂ log det V t ∂ g = g Tr ✓ V t ∂ V t ∂ g ◆ = g Tr ( V t )= Tr ( I t H t S t H TT V t )= t Do f t , (13)where the last two passages exploit (12) and (9), respectively.Eq. 11 is instead proved as follows g ∂ Y t V t Y t ∂ g = g Y Tt V t Y t = Y Tt ( I t H t S t H Tt V t ) T ( I t H t S t H Tt V t ) Y t = k Y t ˆ Y t k = Ssr t where the second and third equality exploit (12) and (8),respectively. ⌅ the influence matrix satisfyingˆ Y t = HY t . Then, the parameter estimates achieved by GCV minimize
Gcv t = Ssr t t ( Do f t / t ) , (2)where Ssr t is the sum of squared residuals, i.e. Ssr t = k ˆ Y t H t Y t k , and Do f t are the degrees of freedom given by the trace of H t , i.e. Do f t = Tr ( H t ) . In the objective (2), the term
Ssr t accounts for the good-ness of fit while Do f t assumes values on [ , t ] and measuresmodel complexity. In fact, in nonparametric regularized es-timation, the degrees of freedom Do f t can be seen as thecounterpart of the number of parameters entering a paramet-ric model [17,10,21].GCV is supported by important asymptotic results and, whendata set size is finite, it turns often out a good approximatorof the output mean squared error [4]. It is worth stressingthat such properties have been derived without postulatingthe correctness of the prior models describing the output data[30,31]. In control, this means that GCV can contrast pos-sible undermodelling affecting the state space description.Despite these nice features, GCV does not appear to be muchused within the control community, in particular in on-linecontexts. One important reason is the following one. Forstate space models, there exist efficient algorithms which,for a given parameter vector, return its GCV score with O ( t ) operations [15], see also [12,28,18] for procedures dedicatedto smoothing splines. But all of these techniques are notsuited to on-line computations since they involve forwardand backward recursions. Hence, if Gcv t is available andnew data arrive at time t , other O ( t ) operations are neededto achieve Gcv t . In this paper, we will instead show thatthe update cost is O ( ) , thus paving the way to a morepervasive use of GCV for on-line applications. This resultis obtained by deriving the novel GCV filter which consistsof an extension of the classical Kalman equations. Thanksto it, one can run a bank of filters (possibly in parallel) toefficiently propagate GCV over a grid of parameter values.The paper is organized as follows. In Section 2, after settingup some additional notation. the GCV filter is presented,also discussing its asymptotic properties. In Section 3, weillustrate two applications. The first one deals with a a DCmotor model taken from [20]. The second one is relatedto on-line regularized linear system identification, using thestable spline kernel as stochastic model for the impulse re-sponse [22,23]. Conclusions end the paper while the proofof the correctness of the GCV filter is reported in Appendix.
First, we need to provide full details about our measurementsmodel. We use x ⇠ ( a , b ) to denote a random vector x withmean a and covariance b . Then, our state space model isdefined by x k + = Ax k + w k (3a) y k = Cx k + v k , k = , , . . . (3b) x ⇠ ( x , P ) (3c) w k ⇠ ( , Q ) (3d) v k ⇠ ( , g ) (3e)where the initial condition x and all the nosies { w k , v k } k = , ,... are mutually uncorrelated. In addition, the measurements y k are assumed scalar, so that g represents the noise variance.This allows to simplify notation but all the formulas derivedin the sequel can be easily extended to the multiple out-put case. Beyond g , other unknown parameters could enter P , A , B and Q but we omit to stress this possible depen-dence again to simplify notation. The matrix P is assumednot to depend on g . ( · ) g ( · ) The equations of the GCV filter are now reported. Below,ˆ x k denotes the optimal linear state prediction of covariance P k . Its dynamics are regulated by the classical Kalman filtervia (5a), (5c) and the Riccati equation (5e). GCV filter
Initialization ˆ x = µ , ˆ z = P = P , S = Do f = g ( CP C T + g ) (4c) Ssr = g ( y C µ ) ( CP C T + g ) (4d) Gcv = Ssr ( Do f ) (4e)2the influence matrix satisfyingˆ Y t = HY t . Then, the parameter estimates achieved by GCV minimize
Gcv t = Ssr t t ( Do f t / t ) , (2)where Ssr t is the sum of squared residuals, i.e. Ssr t = k ˆ Y t H t Y t k , and Do f t are the degrees of freedom given by the trace of H t , i.e. Do f t = Tr ( H t ) . In the objective (2), the term
Ssr t accounts for the good-ness of fit while Do f t assumes values on [ , t ] and measuresmodel complexity. In fact, in nonparametric regularized es-timation, the degrees of freedom Do f t can be seen as thecounterpart of the number of parameters entering a paramet-ric model [17,10,21].GCV is supported by important asymptotic results and, whendata set size is finite, it turns often out a good approximatorof the output mean squared error [4]. It is worth stressingthat such properties have been derived without postulatingthe correctness of the prior models describing the output data[30,31]. In control, this means that GCV can contrast pos-sible undermodelling affecting the state space description.Despite these nice features, GCV does not appear to be muchused within the control community, in particular in on-linecontexts. One important reason is the following one. Forstate space models, there exist efficient algorithms which,for a given parameter vector, return its GCV score with O ( t ) operations [15], see also [12,28,18] for procedures dedicatedto smoothing splines. But all of these techniques are notsuited to on-line computations since they involve forwardand backward recursions. Hence, if Gcv t is available andnew data arrive at time t , other O ( t ) operations are neededto achieve Gcv t . In this paper, we will instead show thatthe update cost is O ( ) , thus paving the way to a morepervasive use of GCV for on-line applications. This resultis obtained by deriving the novel GCV filter which consistsof an extension of the classical Kalman equations. Thanksto it, one can run a bank of filters (possibly in parallel) toefficiently propagate GCV over a grid of parameter values.The paper is organized as follows. In Section 2, after settingup some additional notation. the GCV filter is presented,also discussing its asymptotic properties. In Section 3, weillustrate two applications. The first one deals with a a DCmotor model taken from [20]. The second one is relatedto on-line regularized linear system identification, using thestable spline kernel as stochastic model for the impulse re-sponse [22,23]. Conclusions end the paper while the proofof the correctness of the GCV filter is reported in Appendix.
First, we need to provide full details about our measurementsmodel. We use x ⇠ ( a , b ) to denote a random vector x withmean a and covariance b . Then, our state space model isdefined by x k + = Ax k + w k (3a) y k = Cx k + v k , k = , , . . . (3b) x ⇠ ( x , P ) (3c) w k ⇠ ( , Q ) (3d) v k ⇠ ( , g ) (3e)where the initial condition x and all the nosies { w k , v k } k = , ,... are mutually uncorrelated. In addition, the measurements y k are assumed scalar, so that g represents the noise variance.This allows to simplify notation but all the formulas derivedin the sequel can be easily extended to the multiple out-put case. Beyond g , other unknown parameters could enter P , A , B and Q but we omit to stress this possible depen-dence again to simplify notation. The matrix P is assumednot to depend on g . ( · ) g ( · ) The equations of the GCV filter are now reported. Below,ˆ x k denotes the optimal linear state prediction of covariance P k . Its dynamics are regulated by the classical Kalman filtervia (5a), (5c) and the Riccati equation (5e). GCV filter
Initialization ˆ x = µ , ˆ z = P = P , S = Do f = g ( CP C T + g ) (4c) Ssr = g ( y C µ ) ( CP C T + g ) (4d) Gcv = Ssr ( Do f ) (4e)2 ' " − ) " & " & " & " ' " % " ! " ! " * " ) " Fig. 1.
GCV filter: in the bottom the nonlinear blocks f and g aredefined, respectively, by (5h) and (5i) while δ k + can be recursivelycomputed by (5g). S k + = S k + γ C k + Σ k + C Tk + + ( C k + P k + C Tk + + γ ) ( y k + − C k + ˆ x k + ) (5h) + γ C k + ˆ ζ k + y k + − C k + ˆ x k + C k + P k + C Tk + + γ GCV k + = ( k + ) S k + ( k + − δ k + ) (5i)It is apparent that the difference w.r.t the classical Kalmanfilter is the presence of the additional state ˆ ζ k of the samedimension of ˆ x k . Comparing (5c) and (5d), one can see that A k is replaced by A k − K k C k . In addition, the dynamics of ˆ ζ k are still driven by the innovation y k − C k ˆ x k , but the Kalmangain K k given by (5a) is substituted by the G k defined by(5b). In turn, such gain depends on Σ k which is propagatedover time through a modified version of the Riccati equationgiven by (5f). The GCV filter is graphically depicted in Fig.1. We first consider the case where the state-space model (3) istime-invariant, i.e. the matrices A k , C k , and Q k are constantin k . The structure of the equations governing the GCV filter3ermits to easily understand its asymptotic behaviour. Inparticular, exploiting well known properties of the Kalmanfilter [1], the following result is obtained (see the Appendixfor a proof). Proposition 1
Assume that the system (3) is time-invariant,stabilizable and detectable. Then, for any P we have lim k → ∞ P k = ¯ P and lim k → ∞ Σ k = ¯ Σ where ¯ P and ¯ Σ are the unique symmetric and semidefinitepositive matrices solving, respectively, the algebraic Riccatiequation ¯ P = A ¯ PA T + Q − A ¯ PC T ( C ¯ PC T + γ ) − C ¯ PA T (6) and the Lyapunov equation ¯ Σ = ( A − ¯ KC ) ¯ Σ ( A − ¯ KC ) T + ¯ K ¯ K T (7) where ¯ K = A ¯ PC T ( C ¯ PC T + γ ) − . In addition, all the rootsof the matrix A − ¯ KC are inside the unit circle so that the(asymptotic) GCV filter is asymptotically stable.
Properties of the GCV filter can be also characterized in thetime-varying case. In particular, following Section 2 of [2],one can first replace stabilizability and detectability withthe assumptions of uniform stabilizability and detectability.Then, following the same reasonings contained in the proofof Proposition 1, Theorem 5.3 in [2] ensures the uniformexponential stability of the GCV filter.
Proposition 1 leads also to a new computationally appeal-ing approach to tune the regularization parameter γ e.g. insmoothing splines. In particular, consider the scenario de-scribed in [21] where an unknown function has to be re-constructed by spline regression from equally spaced noisysamples. When assumptions in Proposition 1 hold true, it ispossible to compute off-line the gains¯ K = A ¯ PC T ( C ¯ PC T + γ ) − , ¯ G = A ¯ Σ A T − ¯ K ( C ¯ Σ C T + ) C ¯ PC T + γ . Then, one can exploit the asymptotic (suboptimal) GCV fil-ter, with the guarantee that the objective values will con-verge to the exact GCV scores as k increases. Moreover, inoff-line contexts this approach appears computationally ap-pealing even when compared to the many GCV-based splinealgorithms developed in the last decades [41,40,18,15,35].Furthermore, [21] defined the asymptotic smoothing ratio aslim k → ∞ δ k k , also providing an interesting closed-form expression for thecubic splines case useful to further speed up the tuning of γ .For the general case, we notice that Proposition 1 gives alsoa numerical procedure to compute the asymptotic smoothingratio (for different values of γ ). In fact, if (3) is stabilizableand detectable, combining (5g) and Proposition 1 we obtainlim k → ∞ δ k k = − γ C ¯ Σ C T + C ¯ PC T + γ with ¯ Σ and ¯ P defined, respectively, in (6) and (7). We consider the reconstruction of the function exp ( sin 8 t ) taken from [29] from samples collected at 400 instants t i randomly generated from a uniform distribution on [ , ] .The measurement noise is Gaussian with standard deviationequal to 0.3. We model f as the two-fold integral of whitenoise setting m = f using cubic smoothing splines [40].We use Z t to denote the vector containing the noiseless out-puts (corresponding to the second entries of { x k } tk = ). Wedenote the average of Z t by the scalar quantity ¯ Z t . Then, theperformance measure is the percentage fit F t = (cid:18) − (cid:107) Z t − ˆ Z t (cid:107)(cid:107) Z t −
11 ¯ Z t (cid:107) (cid:19) , (8)where ˆ Z t is the estimate of Z t obtained through the Kalmansmoother [1], and 11 a column vector with all entries equalto 1. The following two different estimators ˆ Z t are tested: • GCV: this approach estimates γ exploiting the GCV filter.More specifically, the GCV score is propagated over agrid containing 100 values of γ logarithmically spaced onthe interval [ − , ] . Then, at any t the estimate ˆ Z t iscomputed by a Kalman smoothing filter which exploitsthe γ t that minimizes GCV t . • Oracle: the same as GCV except that γ t maximizes thefit F t . Note that this approach is not implementable inpractice since it uses an oracle that knows the noiseless(unavailable) output Z t .The left panel of Fig. 2 displays the noiseless output (solidline), the measurements ( ◦ ) and the function estimate re-turned by GCV (dashed line) which turns out close to f . Theright panel also shows that the GCV filter is able to trackwell and in an on-line manner the time-course of γ returnedby Oracle .4 Time -0.500.511.522.533.5
GCV estimate
MeasurementsGCV estimateTrue
Time -2 -1 Estimate of γ GCVOracle
Fig. 2.
Cubic spline example - Section 4.1 . Left: noiseless output (solid line), measurements ( ◦ ) and cubic spline estimate obtained byGCV (dashed line). Right:
Estimated regularization parameter γ t , as a function of time, obtained by Oracle maximizing the fit F t in eq.8 (solid line) and by GCV minimizing the score GCV t computed by eq. 5i (dashed line). Time -70-60-50-40-30-20-100102030
GCV estimate
MeasurementsGCV estimateTrue
Time -2 -1 Estimate of γ GCVOracle
Fig. 3.
Model mismatch example - Section 4.2 . Left: noiseless output (solid line), measurements ( ◦ ) and smoothed output obtained byGCV (dashed line). Right:
Estimated noise variance γ t , as a function of time, obtained by Oracle (solid line) and by GCV (dashed line). We consider the following discrete-time model (see also [23,Section 6]): x k + = (cid:32) . . (cid:33) x k + ω k (9a) y k = (cid:16) (cid:17) x k + e k (9b)with zero-mean Gaussian noises of covariances Q = (cid:32) . . (cid:33) (cid:16) .
81 0 . (cid:17) , γ = . We will use data generated by this model to test the capabil-ity of the GCV filter to compensate for mismatches betweenthe true system and the model used to track the data by tun-ing γ in an on-line manner. As in the previous example Z t isthe vector containing the first t noiseless outputs (which arethe second entries of { x k } tk = ) and the performance measureis (8). The following three different estimators ˆ Z t are tested: • GCV: this approach uses a wrong transition covariancegiven by ˜ Q = Q + (cid:32) (cid:33) , and then estimates γ exploiting the GCV filter over a gridwith 100 values logarithmically spaced on [ − , ] .5 Fits
Fig. 4.
Model mismatch example - Section 4.2 . Boxplots of the100 fits F , as defined in eq. 8, obtained after a Monte Carlostudy by the estimators Oracle, GCV and Nominal. Then, at any t the estimate ˆ Z t is computed by a Kalmansmoothing filter which exploits the γ t minimizing GCV t . • Oracle: the same as GCV except that γ t maximizes the fit F t in (8). • Nominal: the estimate ˆ Z t is returned by a Kalman smooth-ing filter defined by the nominal wrong covariance ˜ Q and γ =
30. Thus, this approach does not try to compensatefor model mismatch since it does not tune γ from data.The left panel of Fig. 3 displays the noiseless output (solidline), the measurements ( ◦ ) and the smoothed output ob-tained by GCV (dashed line) which appears close to truth.In the right panel, one can also see the trajectory in time ofthe γ t returned by Oracle and by GCV. One can appreciatethe capability of the GCV filter to compensate the modellingmismatch by tracking a regularization parameter leading toa high fit F t .To further support these findings, we have also performed aMonte Carlo study of 100 runs. During each run, 200 out-put measurements are generated using (9) and Z t is recon-structed by GCV, Oracle and Nominal. From the MATLABboxplots of the 100 fits (8) reported in Fig. 4, the robust-ness of GCV emerges clearly. Its performance is in fact veryclose to that of the oracle-based procedure. Now, we consider a linear system identification problemwhere the aim is to estimate an unknown impulse responsefrom input-output measurements. Assuming a high orderFIR, the model describing the outputs collected up to instant t , and stacked in the (column) vector Y t , is Y t = Φ t g + E t , (10)where g denotes the m -dimensional vector whose compo-nents are the impulse response coefficients, the regression matrix Φ t is defined by the input samples and E t is the mea-surement noise vector, which we assume white and Gaus-sian.To solve this problem, we use the kernel-based approachoriginally proposed in [28,27,6]. The impulse response es-timate is given byarg min g ∈ R m (cid:107) Y t − Φ t g (cid:107) + γ g T P − g . (11)It makes use of the regularization matrix P induced by theso called first-order spline kernel, i.e. its ( i , j ) entry is [ P ] i j = α max ( i , j ) , ≤ α < , where α is an hyperparameter which regulates the rate ofdecay to zero of the components of g . We refer the readeralso to [29] for further details on advantages of (11) overclassical parametric approaches.In real applications, both γ and α are unknown. Since weconsider a situation where g has to be estimated on-line,we will estimate these two hyperparameters by the GCVfilter. To do that, we first notice that (11) corresponds tothe maximum a posteriori (MAP) estimator of g and, underthe stated Gaussian assumptions, also to its minimum mean-square estimator (MMSE). The estimate of g can then becomputed using the Kalman filter. In fact the state spacemodel is x k + = x k y k = C k x k + e k , k = , , . . . (12) x ∼ ( , P ) e k ∼ ( , γ ) where the state vector is the stochastic model for g (with x k = g for any k ), y k and e k are the k -th entries of Y t and E t ,respectively, and C k is the k -th row of Φ t .We define a grid in the plane ( γ , α ) taking the points suchthat α is in the set { . , . , . . . , . , . , . } , while γ assumes values in a logarithmically spaced grid of 20 pointbetween 10 − and 10 . In this way, the grid consists of 140points. We run 140 GCV filters in parallel, each correspond-ing to one of the points; when a new measure y k (and C k )is available, we update the GCV score of each pair ( γ , α ) ,selecting the one giving the minimum score.We test the obtained GCV filter for on-line regularized sys-tem identification on a set of 100 Monte Carlo runs. At anyrun, a random impulse response of length m =
200 is gen-erated using the same mechanism described in [24, Section7.4]. The generated system is fed with a with noise sequenceof unit variance. Note that this type of input is persistentlyexciting and guarantees the observability of the system (12)(see e.g. [2]), avoiding the covariance windup phenomenon[36]. The standard deviation of the measurement noise is6hat of the 200 noiseless outputs divided by 10. We assumethe system is at rest (the input is equal to zero) prior to thedata collection. The performance of the estimator is evalu-ated by means of the fit (as a function of time) F t = (cid:18) − (cid:107) g i − ˆ g it (cid:107)(cid:107) g i −
11 ¯ g i (cid:107) (cid:19) , (13)where g i is the impulse response generated at the i -th MonteCarlo run, ¯ g i its mean, and ˆ g it its estimate (the impulse re-sponse estimate is function of the time instant t ).An example of one of the Monte Carlo runs is given in Fig.5, which shows the evolution in time of the impulse responseestimate and its fit. It suffices 50 measurements to the GCVfilter to achieve an appreciable fit. The overall results of theMonte Carlo experiment are summarized in Fig. 6, whichdepicts the average fit of the impulse responses as a functionof time. It can be seen that, after a short transient phase,the fit increases monotonically and achieves a high averagevalue. The novel filter here presented allows to propagate efficientlythe GCV score over time. Hence, unknown parameters en-tering a state space model can now be estimated in an on-linemanner resorting to one of the most important techniquesused for parameter estimation. The asymptotic properties ofthe GCV filter provide also a new very efficient way to esti-mate the regularization parameter e.g. in smoothing splines.Applications of the new filter have been illustrated using ar-tificial data regarding a function estimation and on-line reg-ularized linear system identification.A Matlab implementation of the GCV filter is available atthe web page . Without loss of generality, we set the initial system conditionto zero, i.e. µ =
0. We also use X t , Y t and E t to denote thecolumn vectors containing the states, the outputs and themeasurements noises up to instant t , i.e. X t = [ x T . . . x Tt ] T , Y t = [ y . . . y t ] T , E t = [ e . . . e t ] T . Then, it holds that Y t = O t X t + E t , where O t = diag { C , . . . , C t } is the regression matrix builtusing the measurement matrices C k , k = , . . . , t . We also use W t and V t to denote the state and output covariance matrix,i.e. W t : = Var ( X t ) (14) V t : = Var ( Y t ) = O t W t O Tt + γ I t , (15)where I t is the t × t identity matrix. Note that, using the abovenotation, the smoothed estimate of Y t , already encounteredin Section 1, is ˆ Y t = O t W t O Tt V − t Y t , (16)so that the degrees of freedom at instant t turn out δ t = Tr ( O t W t O Tt V − t ) . (17)The following simple lemma is useful for our future devel-opments. Lemma 2
One has δ t = t − γ ∂ log det V t ∂ γ , (18) S t = − γ ∂ Y t V − t Y t ∂ γ . (19) Proof:
In view of (15), we start noticing that γ V − t = I t − O t W t O Tt V − t . (20)Then, (18) is obtained from the following equalities γ ∂ log det V t ∂ γ = γ Tr (cid:18) V − t ∂ V t ∂ γ (cid:19) = γ Tr ( V − t )= Tr ( I t − O t W t O Tt V − t )= t − δ t , (21)where the last two passages exploit (20) and (17), respec-tively.Eq. 19 is instead proved as follows − γ ∂ Y t V − t Y t ∂ γ = γ Y Tt V − t Y t = Y Tt ( I t − O t W t O Tt V − t ) T ( I t − O t W t O Tt V − t ) Y t = (cid:107) Y t − ˆ Y t (cid:107) = S t where the second and third equality exploit (20) and (16),respectively. (cid:4) -0.6-0.4-0.200.20.40.6 Impulse response estimate k=10k=50k=200True
Time -50050100
Fit
Fig. 5.
On-line system identification - Section 4.3 . Left: true impulse response (solid line) and GCV estimates obtained at time instants k = , , Right:
Fit obtained by GCV as a function of time.
Time -100-50050100
Fit from Monte Carlo
Fig. 6.
On-line system identification - Section 4.3 . Average ofthe GCV fits (solid line) ± one standard deviation (dashed line),as a function of time, obtained after a Monte Carlo study. At anyof the 100 runs a new impulse response was randomly generatedas detailed in [24, Section 7.4]. The dynamics of the matrix P k in the GCV filter are regulatedby the discrete-time algebraic Riccati equation (DARE),which can be also rewritten as P k + = A k P k A Tk + Q k − A k P k C Tk ( C k P k C Tk + γ ) − C k P k A Tk . It is now easy to see that the matrix Σ k entering the GCV filteris the partial derivative of P k w.r.t. γ . In fact, differentiatingthe DRE, and adopting the notation Σ k : = ∂ P k ∂γ , one has Σ k + = A k Σ k A Tk − A k Σ k C Tk ( C k P k C Tk + γ ) − C k P k A Tk − A k P k C Tk ( C k P k C Tk + γ ) − C k Σ k A Tk + A k P k C Tk ( C k P k C Tk + γ ) − C k P k A Tk ( C k Σ k C Tk + ) . Exploiting the definition of K k and rearraging the terms, therecursive formula (5f) is obtained.Now, consider the dynamics of the predicted stateˆ x k + = A k ˆ x k + A k P k C Tk ( C k P k C Tk + γ ) − ( y k − C k ˆ x k ) . We now show that ˆ ζ k is the partial derivative of ˆ x k w.r.t. γ .In fact, differentating the above equation using the corre-spondence ˆ ζ k : = ∂ ˆ x k ∂γ , one obtainsˆ ζ k + = A k ˆ ζ k + A k Σ k C Tk ( C k P k C Tk + γ ) − ( y k − C k ˆ x k ) − A k Σ k C Tk ( C k Σ k C Tk + )( C k P k C Tk + γ ) − ( y k − C k ˆ x k ) − A k P k C Tk ( C k P k C Tk + γ ) − C k ˆ ζ k . This, combined with the definition of K k , leads to the recur-sive formula (5d).Now, exploiting well known properties of the innovationssequence { y k − C k ˆ x k } tk = , whose variances are { C k P k C Tk + γ } tk = , and recalling that Σ k : = ∂ P k ∂γ , we have ∂ log det V t ∂ γ = t ∑ k = ∂ log ( C k P k C Tk + γ ) ∂ γ = t ∑ k = C k Σ k C Tk + C k P k C Tk + γ . Then, the recursive formula (5g) for the degrees of freedom δ k is obtained combining the above equation and (18).8till using properties of the innovations sequence, and re-calling that ˆ ζ k : = ∂ ˆ x k ∂γ , one also has − ∂ Y t V − t Y t ∂ γ = − t ∑ k = ∂ ( y k − C k ˆ x k ) ( C k P k C Tk + γ ) − ∂ γ = t ∑ k = C k Σ k C Tk + ( C k P k C Tk + γ ) ( y k − C k ˆ x k ) + C ˆ ζ k y k − C k ˆ x k C k P k C Tk + γ . This equation, in combination with (19), proves the correct-ness of the update rule (5h) for the sum of squared residuals S k and completes the derivation. If the system (3) is stabilizable and detectable, then standardproperties of the algebraic Riccati equation (6) ensure that ¯ P is symmetric and positive semidefinite and that the Kalmanfilter, corresponding to (5a), (5c) and (5e), is asymptoticallystable (see [1, p. 77]). Because the Kalman filter is asymp-totically stable, the matrix ( A − ¯ KC ) has all the eigenvaluesinside the unit circle, ensuring that (7) admits a unique posi-tive semidefinite solution [1, p. 67]. The filter state transitionmatrix which regulates the dynamics of ˆ x k and ˆ ζ k is (cid:32) A − K k C − G k C A − K k C (cid:33) and so it also has all eigenvalues inside the unit circle atleast for sufficiently large k . References [1] B. D. O. Anderson and J. B. Moore.
Optimal Filtering . Prentice-Hall, Englewood Cliffs, N.J., USA, 1979.[2] B.D.O. Anderson and J.B. Moore. Detectability and stabilizability oftime-varying discrete-time linear systems.
SIAM Journal on Controland Optimization , 19(1):20–32, 1981.[3] C. Andrieu, A. Doucet, and R. Holenstein. Particle markov chainmonte carlo methods.
Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) , 72(3):269–342, 2010.[4] C.F. Ansley and R. Kohn. Efficient generalized cross-validation forstate space models.
Biometrika , 74(1):139–148, 1987.[5] M. Bertero. Linear inverse and ill-posed problems.
Advances inElectronics and Electron Physics , 75:1–120, 1989.[6] T. Chen, H. Ohlsson, and L. Ljung. On the estimation oftransfer functions, regularizations and Gaussian processes - revisited.
Automatica , 48(8):1525–1535, 2012.[7] P. Craven and G. Wahba. Smoothing noisy data with spline functions.
Numerische Mathematik , 31:377–403, 1979.[8] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks andsupport vector machines.
Advances in Computational Mathematics ,13:1–50, 2000. [9] R. Frigola, F. Lindsten, T.B. Schon, and C.E. Rasmussen. Bayesianinference and learning in Gaussian process state-space models withparticle mcmc. In
Advances in Neural Information ProcessingSystems (NIPS) , 2013.[10] W.R. Gilks, S. Richardson, and D.J. Spiegelhalter.
Markov chainMonte Carlo in Practice . London: Chapman and Hall, 1996.[11] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neuralnetworks architectures.
Neural Computation , 7(2):219–269, 1995.[12] G. Golub, M. Heath, and G. Wahba. Generalized cross-validationas a method for choosing a good ridge parameter.
Technometrics ,21(2):215–223, 1979.[13] T. J. Hastie, R. J. Tibshirani, and J. Friedman.
The Elementsof Statistical Learning. Data Mining, Inference and Prediction .Springer, Canada, 2001.[14] A.E. Hoerl and R.W. Kennard. Ridge regression: Biased estimationfor nonorthogonal problems.
Technometrics , 12:55–67, 1970.[15] M.G. Hutchinson and F.R. De Hoog. Smoothing data with splinefunctions.
Numer. Math. , 47:99–106, 1985.[16] A. Jazwinski.
Stochastic Processes and Filtering Theory . Dover,1970.[17] R.E. Kalman. A new approach to linear filtering and predictionproblems.
Trans. of the AMSE - Journal of Basic Engineering ,82:35–45, 1960.[18] R. Kohn and C.F. Ansley. A fast algorithm for signal extraction,influence and cross validation in state space models.
Biometrika ,76(1):65–79, 1989.[19] L. Ljung and T. Kailath. A unified approach to smoothing formulas.
Automatica , 12(2):147–157, 1976.[20] D.J.C. MacKay. Bayesian interpolation.
Neural Computation , 4:415–447, 1992.[21] G. De Nicolao, G. Ferrari Trecate, and G. Sparacino. Fast splinesmoothing via spectral factorization concepts.
Automatica , 36:1733–1739, 2000.[22] B. Ninness and S. Henriksen. Bayesian system identification viaMCMC techniques.
Automatica , 46(1):40–51, 2010.[23] H. Ohlsson, F. Gustafsson, L. Ljung, and S. Boyd. State smoothing bysum-of-norms regularization. In , pages 2880–2885. IEEE, 2010.[24] G. Pillonetto, T. Chen, A. Chiuso, G. De Nicolao, and L. Ljung.Regularized linear system identification using atomic, nuclear andkernel-based norms: The role of the stability constraint.
Automatica ,69:137 – 149, 2016.[25] G. Pillonetto and A. Chiuso. Fast computation of smoothing splinessubject to equality constraints.
Automatica , 45(12):2842–2849, 2009.[26] G. Pillonetto and A. Chiuso. Tuning complexity in regularizedkernel-based regression and linear system identification.
Automatica ,58(2):106–117, 2015.[27] G. Pillonetto, A. Chiuso, and G. De Nicolao. Prediction erroridentification of linear systems: a nonparametric Gaussian regressionapproach.
Automatica , 47(2):291–305, 2011.[28] G. Pillonetto and G. De Nicolao. A new kernel-based approach forlinear system identification.
Automatica , 46(1):81–93, 2010.[29] G. Pillonetto, F. Dinuzzo, T. Chen, G. De Nicolao, and L. Ljung.Kernel methods in system identification, machine learning andfunction estimation: a survey.
Automatica , 50(3):657–682, 2014.[30] G. Pillonetto and M.P. Saccomani. Input estimation in nonlineardynamic systems using differential algebra concepts.
Automatica ,42:2117–2129, 2006.[31] T. Poggio and F. Girosi. Networks for approximation and learning.In
Proceedings of the IEEE , volume 78, pages 1481–1497, 1990.
32] H. E. Rauch, F. Tung, and C. T. Striebel. Maximum likelihoodestimates of linear dynamic systems.
AIAA J. , 3(8):1145–1150, 1965.[33] J. Rice. Choice of smoothing parameter in deconvolution problems.
Contemporary Math. , 59:137–151, 1986.[34] B. Sch¨olkopf and A. J. Smola.
Learning with Kernels: Support VectorMachines, Regularization, Optimization, and Beyond . (AdaptiveComputation and Machine Learning). MIT Press, 2001.[35] B.W. Silverman. Some aspects of the spline approach tononparametric regression curve fitting.
J. of the Royal StatisticalSociety , 47:1–52, 1985.[36] B. Stenlund and F. Gustafsson. Avoiding windup in recursiveparameter estimation. , pages 148–153,2002.[37] A. Tarantola.
Inverse Problem Theory and Methods for ModelParameter Estimation . SIAM, Philadelphia, 2005.[38] G. Wahba. Bayesian confidence intervals for the cross-validatedsmoothing spline.
Journal of the Royal Statistical Society. Series B(Methodological) , 45(1):pp. 133–150, 1983.[39] G. Wahba. A comparison of GCV and GML for choosing thesmoothing parameter in the generalized spline smoothing problem.
The Annals of Statistics , 13(4):1378–1402, 1985.[40] G. Wahba.
Spline models for observational data . SIAM, Philadelphia,1990.[41] H.L. Weinert.
Fast Compact Algorithms and Software for SplineSmoothing . Springer, 2013.. Springer, 2013.