[PDF] Recursive Least Squares with Variable-Direction Forgetting -- Compensating for the loss of persistency

Abstract

Learning depends on the ability to acquire and assimilate new information. This ability depends---somewhat counterintuitively---on the ability to forget. In particular, effective forgetting requires the ability to recognize and utilize new information to order to update a system model. This article is a tutorial on forgetting within the context of recursive least squares (RLS). To do this, RLS is first presented in its classical form, which employs uniform-direction forgetting. Next, examples are given to motivate the need for variable-direction forgetting, especially in cases where the excitation is not persistent. Some of these results are well known, whereas others complement the prior literature. The goal is to provide a self-contained tutorial of the main ideas and techniques for students and researchers whose research may benefit from variable-direction forgetting.

Full PDF

aa r X i v : . [ m a t h . O C ] M a r Recursive Least Squareswith Variable-Direction Forgetting

Compensating for the loss of persistencyAnkit Goel, Adam L. Bruce, and Dennis S. BernsteinPOC: A. Goel ([email protected])

The ability to estimate parameters depends on two things, namely, identiﬁability [1], whichis ability to distinguish distinct parameters, and persistent excitation, which refers to the spectral content of the signals needed to ensure convergence of the parameter estimates to the trueparameter values [2]–[4]. Roughly speaking, the level of persistency must be commensurate with the number of unknown parameters. For example, a harmonic input has two-dimensionalpersistency and thus can be used to identify two parameters, whereas white noise is sufﬁciently persistent for identifying an arbitrary number of parameters. Within the context of adaptivecontrol, persistent excitation is needed to avoid bursting [5]; recent research has focused on relaxing these requirements [6]–[8].Under persistent excitation, a key issue in practice is the rate of convergence, especially under changing conditions. For example, the parameters of a system may change abruptly, andthe goal is to ensure fast convergence to the modiﬁed parameter values. In this case, it turns out that the rate of convergence depends on the ability to forget past parameters and incorporate newinformation. As discussed in “Summary,” the ability to accommodate new information depends on the ability to forget; the ability to forget is thus crucial to the ability to learn. This paradoxis widely recognized, and effective forgetting is of intense interest in machine learning [9]–[12]. In the ﬁrst half of the present article, classical forgetting within the context of recursive leastsquares (RLS) is considered. In the classical RLS formulation [13]–[16], a constant forgetting factor λ ∈ (0 , can be set by the user. However, it often occurs in practice that the performanceof RLS is extremely sensitive to the choice of λ , and suitable values in the range . to . are typically found by trial-and-error testing. This difﬁculty has motivated extensionsof classical RLS in the form of variable-rate forgetting [17]–[23], constant trace adjustment, covariance resetting, and covariance modiﬁcation [24], [25].In the second half of this article, variable-direction forgetting (VDF), a technique that excitation, new information is conﬁned to a limited number of directions. The goal of VDF isthus to determine these directions and thereby constrain forgetting to the directions in which new information is available. VDF allows RLS to operate without divergence during periods ofloss of persistency. The goal of this tutorial article is to investigate the effect of forgetting within the contextof RLS in order to motivate the need for VDF. With this motivation in mind, the article develops and illustrates RLS with VDF. The presentation is intended for graduate students who may wishto understand and apply this technique to system identiﬁcation for modeling and adaptive control. Table 1 and 2 summarizes the results and examples in this article. Some of the content in thisarticle appeared in preliminary form in [33]. Although, in practical applications, all sensor measurements are corrupted by noise, theeffect of sensor noise is not considered in this article in order to focus on the loss of persistency. Alternative interpretations of RLS in the special case of zero-mean, white sensor noise arepresented in “RLS as a One-Step Optimal Predictor” and “RLS as a Maximum Likelihood Estimator”.

Recursive Least Squares Consider the model y k = φ k θ, (1)where, for all k ≥ , y k ∈ R p is the measurement, φ k ∈ R p × n is the regressor matrix, and θ ∈ R n is the vector of unknown parameters. The goal is to estimate θ as new data become available.One approach to this problem is to minimize the quadratic cost function J k (ˆ θ ) △ = k X i =0 λ k − i ( y i − φ i ˆ θ ) T ( y i − φ i ˆ θ ) + λ k +1 (ˆ θ − θ ) T R (ˆ θ − θ ) , (2)where λ ∈ (0 , is the forgetting factor , R ∈ R n × n is positive deﬁnite, and θ ∈ R n is theinitial estimate of θ . The forgetting factor applies higher weighting to more recent data, thereby enhancing the ability of RLS to use incoming data to estimate time-varying parameters. Thefollowing result is recursive least squares . Theorem 1:

For all k ≥ , let φ k ∈ R p × n and y k ∈ R p , let R ∈ R n × n be positive deﬁnite,and deﬁne P △ = R − , θ ∈ R n , and λ ∈ (0 , . Furthermore, for all k ≥ , denote the minimizer2ABLE 1: Summary of deﬁnitions and results in this article.Deﬁnition 1 Persistently exciting regressorDeﬁnition 2 Lyapunov stable equilibriumDeﬁnition 3 Uniformly Lyapunov stable equilibriumDeﬁnition 4 Globally aymptotically stable equilibriumDeﬁnition 5 Uniformly globally geometrically stable equilibriumTheorem 1-2 Recursive least squares (RLS)Theorem 3-5 Lyapunov stability theoremsTheorem 6 Lyapunov analysis of RLS for λ ∈ (0 , Theorem 7 Stability analysis of RLS for λ ∈ (0 , based on θ k Theorem S1 A Quadratic Cost Function for Variable-Direction RLSProposition 1 Recursive update of P − k with uniform-direction forgettingProposition 2 Data-dependent subspace constraint on θ k Proposition 3 Bounds on P k for λ = 1 Proposition 4 Bounds on P k for λ ∈ (0 , Proposition 5 Converse of Proposition 4Proposition 6 Convergence of z k with uniform-direction forgettingProposition 7 Persistent excitation and A k Proposition 8 Recursive update of P − k with variable-direction forgettingProposition 9 Convergence of z k with variable-direction forgettingProposition 10 Bounds on P k with variable-direction forgettingof (2) by θ k +1 = argmin ˆ θ ∈ R n J k (ˆ θ ) . (3)Then, for all k ≥ , θ k +1 is given by P k +1 = 1 λ P k − λ P k φ T k (cid:0) λI p + φ k P k φ T k (cid:1) − φ k P k , (4) θ k +1 = θ k + P k +1 φ T k ( y k − φ k θ k ) . (5) Proof:

See [13]. (cid:3)

The following result is a variation of Theorem 1, where the updates of P k and θ k arereversed. Theorem 2:

For all k ≥ , let φ k ∈ R p × n and y k ∈ R p , let R ∈ R n × n be positive deﬁnite,3ABLE 2: Summary of examples in this article.Example 1 P k converges to zero without persistent excitationExample 2 Persistent excitation and bounds on P − k Example 3 Lack of persistent excitation and bounds on P − k Example 4 Convergence of z k and θ k Example 5 Using κ ( P k ) to determine whether ( φ k ) ∞ k =0 is persistently excitingExample 6 Effect of λ on the rate of convergence of θ k Example 7 Lack of persistent excitation in scalar estimationExample 8 Subspace constrained regressorExample 9 Effect of lack of persistent excitation on θ k Example 10 Lack of persistent excitation and the information-rich subspaceExample 11 Variable-direction forgetting for a regressor lacking persistent excitationExample 12 Effect of variable-direction forgetting on θ k and deﬁne P △ = R − , θ ∈ R n , and λ ∈ (0 , . Furthermore, for all k ≥ , denote the minimizerof (2) by (3). Then, for all k ≥ , θ k +1 is given by θ k +1 = θ k + P k φ T k ( λI + φ k P k φ T k ) − ( y k − φ k θ k ) , (6) P k +1 = 1 λ P k − λ P k φ T k ( λI + φ k P k φ T k ) − φ k P k . (7) Proof:

See [13]. (cid:3)

Proposition 1:

Let λ ∈ (0 , ∞ ) , and let ( P k ) ∞ k =0 be a sequence of n × n positive-deﬁnitematrices. Then, for all k ≥ , ( P k ) ∞ k =0 satisﬁes (4) if and only if, for all k ≥ , ( P k ) ∞ k =0 satisﬁes P − k +1 = λP − k + φ T k φ k . (8) Proof:

To prove necessity, it follows from (8) and matrix-inversion lemma, that P k +1 = ( λP − k + φ T k φ k ) − = ( λP − k ) − − ( λP − k ) − φ T k ( I p + φ k ( λP − k ) − φ T k ) − φ k ( λP − k ) − = 1 λ P k − λ P k φ T k (cid:0) λI p + φ k P k φ T k (cid:1) − φ k P k . Reversing these steps proves sufﬁciency. (cid:3)

Let k ≥ . By deﬁning the parameter error ˜ θ k △ = θ k − θ, (9)4t follows that φ i θ k − y i = φ i ˜ θ k . (10)Using (10) with k replaced by k + 1 , it follows that the minimum value of J k is given by J k ( θ k +1 ) = k X i =0 λ k − i ˜ θ T k +1 φ T i φ i ˜ θ k +1 + λ k +1 (˜ θ k +1 − ˜ θ ) T R (˜ θ k +1 − ˜ θ ) . (11)Furthermore, (5) and (9) imply that ˜ θ k satisﬁes ˜ θ k +1 = ( I n − P k +1 φ T k φ k )˜ θ k (12) = λP k +1 P − k ˜ θ k . (13)Finally, it follows from (13) that, for all k, l ≥ , ˜ θ k = λ k − l P k P − l ˜ θ l . (14)The following result shows that the estimate θ k of θ is constrained to a data-dependent subspace. Let R ( A ) denote the range of the matrix A . Proposition 2:

For all k ≥ , let φ k ∈ R p × n and y k ∈ R p , let R ∈ R n × n be positivedeﬁnite, let θ ∈ R n , let λ ∈ (0 , , and deﬁne θ k +1 by (3). Then, θ k +1 satisﬁes k X i =0 λ k − i φ T i φ i + λ k +1 R ! θ k +1 = k X i =0 λ k − i φ T i y i + λ k +1 Rθ . (15)Furthermore, θ k +1 ∈ R (Φ T k Φ k + R − Φ T k Φ k R − + θ θ T0 ) , (16)where Φ k △ = [ φ T0 · · · φ T k ] T ∈ R ( k +1) p × n . (17) Proof:

Note that J k (ˆ θ ) = ˆ θ T A k ˆ θ + ˆ θ T b k + c k , where A k △ = k X i =0 λ k − i φ T i φ i + λ k +1 R,b k △ = k X i =0 − λ k − i φ T i y i − λ k +1 Rθ ,c k △ = k X i =0 λ k − i y T i y i + λ k +1 θ T0 Rθ . A k is positive deﬁnite, it follows from Lemma 1 in [13] that the minimizer θ k +1 of J k satisﬁes (15).Next, deﬁne W k △ = diag( λ − I p , . . . , λ − − k I p ) ∈ R ( k +1) p × ( k +1) p . Using (15) and Lemma 1from “Three Useful Lemmas,” it follows that θ k +1 = (cid:0) I n + Φ T k W k Φ k (cid:1) − k X i =0 λ − i − R − φ T i y i + θ ! = k X i =0 (cid:0) I n + Φ T k W k Φ k (cid:1) − λ − i − R − φ T i y i + (cid:0) I n + Φ T k W k Φ k (cid:1) − θ ∈ k X i =0 R ([Φ T k R − φ T i ]) + R ([Φ T k θ ])= R ([Φ T k R − Φ T k θ ])= R (Φ T k Φ k + R − Φ T k Φ k R − + θ θ T0 ) . (cid:3) Table 3 summarizes various expressions for the RLS variables.TABLE 3: Alternative expressions for the RLS variables.Variable Expression Equation P k • P k +1 = 1 λ P k − λ P k φ T k (cid:0) λI p + φ k P k φ T k (cid:1) − φ k P k (4) • P − k +1 = λP − k + φ T k φ k (8) • P − k +1 = λ k +1 P − + P ki =0 λ k − i φ T i φ i (8) θ k • θ k +1 = θ k + P k +1 φ T k ( y k − φ k θ k ) (5) • θ k +1 = θ k + P k φ T k ( λI p + φ k P k φ T k ) − ( y k − φ k θ k ) (6) • θ k +1 = P k +1 (cid:16)P ki =0 λ k − i φ T i y i + λ k +1 P − θ (cid:17) (15) ˜ θ k • ˜ θ k = θ k − θ (9) • ˜ θ k +1 = ( I n − P k +1 φ T k φ k )˜ θ k (12) • ˜ θ k +1 = λP k +1 P − k ˜ θ k (13) • ˜ θ k = λ k − l P k P − l ˜ θ l (14)6 ersistent Excitation and Forgetting This section deﬁnes persistent excitation of the regressor sequence and investigates theeffect of persistent excitation and forgetting on P k . For all j ≥ and k ≥ j, deﬁne F j,k △ = k X i = j φ T i φ i . (18) Deﬁnition 1:

The sequence ( φ k ) ∞ k =0 ⊂ R p × n is persistently exciting if there exist N ≥ n/p and α, β ∈ (0 , ∞ ) such that, for all j ≥ ,αI n ≤ F j,j + N ≤ βI n . (19)Suppose that ( φ k ) ∞ k =0 is persistently exciting and (19) is satisﬁed for given values of N, α, β.

Then, with suitably modiﬁed values of α and β, (19) is satisﬁed for all larger values of N . For example, if N is replaced by N, then (19) is satisﬁed with α replaced by α and β replacedby β. The following result expresses (8) in terms of F ,k in the case where λ = 1 . Lemma 1:

Let λ = 1 and, for all k ≥ , deﬁne P k as in Theorem 1. Then, P − k = F ,k + P − . (20)The following result shows that, if ( φ k ) ∞ k =0 is persistently exciting and λ = 1 , then P k converges to zero. Proposition 3:

Assume that ( φ k ) ∞ k =0 ∈ R p × n is persistently exciting, let N, α, β be givenby Deﬁnition 1, let R ∈ R n × n be positive deﬁnite, deﬁne P △ = R − , let λ = 1 , and, for all k ≥ , let P k be given by (4). Then, for all k ≥ N + 1 , (cid:4) kN +1 (cid:5) αI n + P − ≤ P − k ≤ (cid:6) kN +1 (cid:7) βI n + P − . (21)Furthermore, lim k →∞ P k = 0 . (22) Proof:

First, note that, for all k ≥ , F ,k = (cid:22) kN +1 (cid:23) X i =1 F ( i − N +1) ,i ( N +1) − + F (cid:22) kN +1 (cid:23) ( N +1) ,k ≤ (cid:24) kN +1 (cid:25) X i =1 F ( i − N +1) ,i ( N +1) − , (cid:4) kN +1 (cid:5) αI n ≤ (cid:22) kN +1 (cid:23) X i =1 F ( i − N +1) ,i ( N +1) − ≤ (cid:24) kN +1 (cid:25) X i =1 F ( i − N +1) ,i ( N +1) − ≤ (cid:6) kN +1 (cid:7) βI n . (23)It follows from Lemma 1 and (23) that, for all k ≥ N + 1 , (cid:4) kN +1 (cid:5) αI n + P − ≤ F , (cid:22) kN +1 (cid:23) ( N +1) − + P − ≤ F ,k + P − = P − k ≤ F , (cid:24) kN +1 (cid:25) ( N +1) − + P − ≤ (cid:6) kN +1 (cid:7) βI n + P − . Finally, it follows from (21) that lim k →∞ P k = 0 . (cid:3) The following example shows that lim k →∞ P k = 0 does not imply that ( φ k ) ∞ k =0 ispersistently exciting. Example 1: P k converges to zero without persistent excitation. For all k ≥ , let φ k = √ k +1 . Let λ = 1 . For all N ≥ , note that F j,j + N ≤ N +1 j +1 , and thus there does not exist α satisfying (19). Hence, ( φ k ) ∞ k =0 is not persistently exciting. However, it follows from (8) that,for all k ≥ , P − k = k X i =0 i + 1 + P − . (24)Thus, lim k →∞ P k = 0 . ⋄ The following result given in [34] shows that, if ( φ k ) ∞ k =0 is persistently exciting and λ ∈ (0 , , then P k is bounded. Proposition 4:

Assume that ( φ k ) ∞ k =0 ∈ R p × n is persistently exciting, let N, α, β be givenby Deﬁnition 1, let R ∈ R n × n be positive deﬁnite, deﬁne P △ = R − , let λ ∈ (0 , , and, for all k ≥ , let P k be given by (4). Then, for all k ≥ N + 1 ,λ N (1 − λ ) α − λ N +1 I n ≤ P − k ≤ β − λ N +1 I n + P − N . (25)8 roof: It follows from (8) that, for all i ≥ , λP − i ≤ P − i +1 and φ T i φ i ≤ P − i +1 , and thus, forall i, j ≥ , λ j P − i ≤ P − i + j . Hence, for all k ≥ N + 1 , αI n ≤ k − X i = k − N − φ T i φ i ≤ k X i = k − N P − i ≤ ( λ − N + · · · + 1) P − k = 1 − λ N +1 λ N (1 − λ ) P − k , which proves the ﬁrst inequality in (25). To prove the second inequality in (25), note that, forall k ≥ N + 1 , P − k ≤ − λ − λ N +1 k + N − X i = k − P − i +1 ≤ − λ − λ N +1 λ k + N − X i = k − P − i + βI n ! ≤ − λ − λ N +1 λ k N X i =0 P − i + 1 − λ k − λ βI n ! ≤ λ k − N P − N + (1 − λ k ) β − λ N +1 I n . ≤ P − N + β − λ N +1 I n . (cid:3) The next result, which is an immediate consequence of (8), is a converse of Proposition4. Proposition 5:

Deﬁne φ k , y k , R, and P as in Theorem 1, let λ ∈ (0 , , and let P k be givenby (4). Furthermore, assume there exist α, β ∈ (0 , ∞ ) such that, for all k ≥ , αI n ≤ P − k ≤ βI n . Let N ≥ λβ − α (1 − λ ) α . Then, for all j ≥ , [(1 + (1 − λ ) N ) α − λβ ] I n ≤ j + N X i = j φ T i φ i ≤ − λ N +1 λ N (1 − λ ) βI n . (26)Consequently, ( φ k ) ∞ k =0 is persistently exciting. 9 roof: Note that, for all j ≥ , [(1 + (1 − λ ) N ) α − λβ ] I n = αI n + (1 − λ ) N αI n − βI n ≤ P − j + N +1 + (1 − λ ) j + N X i = j +1 P − i − λP − j = j + N X i = j ( P − i +1 − λP − i )= j + N X i = j φ T i φ i , which proves the ﬁrst inequality in (26). To prove the second inequality in (26), note that(8) implies that, for all i ≥ , λP − i ≤ P − i +1 and φ T i φ i ≤ P − i +1 , and thus, for all i, j ≥ ,λ j P − i ≤ P − i + j . Hence, for all j ≥ , j + N X i = j φ T i φ i ≤ j + N X i = j P − i +1 ≤ ( λ − N + · · · + 1) P − j + N +1 ≤ − λ N +1 λ N (1 − λ ) βI n . Finally, it follows from Deﬁnition 1 with N ≥ λβ − α (1 − λ ) α , α = (1 + (1 − λ ) N ) α − λβ, and β = − λ N +1 λ N (1 − λ ) β, that ( φ k ) ∞ k =0 is persistently exciting. (cid:3) The proof of Proposition 5 shows that the condition N ≥ λβ − α (1 − λ ) α is needed to satisfy the lower bound in Deﬁnition 1. However, the upper bound in Deﬁnition 1 is satisﬁed for all N ≥ . Example 2: Persistent excitation and bounds on P − k . Let φ k = [ u k u k − ] , where u k isthe periodic signal u k = sin 2 πk

17 + sin 2 πk

23 + sin 2 πk . (27)Figure 1 shows the singular values of F j,j + N for N = 2 and N = 10 , as well as the singular values of P − k with the corresponding upper and lower bounds given by (25) for N = 2 and N = 10 . ⋄ Example 3: Lack of persistent excitation and bounds on P − k . Let φ k = [ u k u k − ] , where u k is given by (27) for all k < and u k = 1 for all k ≥ . Figure 2 shows the singular values of F j,j +2 and the singular values of P − k for λ = 1 and λ = 0 . , respectively. Note that,for λ = 1 , one of the singular values of P − k diverges, whereas, for λ ∈ (0 , , one of singular values of P − k converges to zero. ⋄ predicted error z k △ = φ k θ k − y k converges to zerowhether or not ( φ k ) ∞ k =0 is persistent. Proposition 6:

For all k ≥ , let φ k ∈ R p × n and y k ∈ R p , let R ∈ R n × n be positivedeﬁnite, and let P = R − , θ ∈ R n , and λ ∈ (0 , . Furthermore, for all k ≥ , let P k and θ k be given by (4) and (5), respectively, and deﬁne the predicted error z k △ = φ k θ k − y k . Then, lim k →∞ z k = 0 . (28) Proof:

For all k ≥ , note that z k = φ k ˜ θ k , and deﬁne V k △ = ˜ θ T k P − k ˜ θ k . Note that, for all k ≥ and ˜ θ k ∈ R n , V k ≥ . Furthermore, for all k ≥ ,V k +1 − V k = ˜ θ T k +1 P − k +1 ˜ θ k +1 − ˜ θ T k P − k ˜ θ k = λ ˜ θ T k P − k P k +1 P − k ˜ θ k − ˜ θ T k P − k ˜ θ k = ( λ ˜ θ T k +1 − ˜ θ T k ) P − k ˜ θ k = − [(1 − λ )˜ θ T k + λ ˜ θ T k φ T k φ k P k +1 ] P − k ˜ θ k = − [(1 − λ )˜ θ T k P − k ˜ θ k + λ ˜ θ T k φ T k φ k P k +1 P − k ˜ θ k ]= − [(1 − λ )˜ θ T k P − k ˜ θ k + ˜ θ T k φ T k [ I p − φ k P k φ T k ( λI p + φ k P k φ T k ) − ] φ k ˜ θ k ]= − [(1 − λ ) V k + z T k [ I p − φ k P k φ T k ( λI p + φ k P k φ T k ) − ] z k ] ≤ . Note that, since ( V k ) ∞ k =1 is a nonnegative, nonincreasing sequence, it converges to a nonnegativenumber. Hence, lim k →∞ ( V k +1 − V k ) = 0 , which implies that lim k →∞ [(1 − λ ) V k + z T k R k z k ] = 0 , where R k △ = I p − φ k P k φ T k ( λI p + φ k P k φ T k ) − . Lemma 2 from “Three Useful Lemmas” impliesthat R k is positive deﬁnite. Since V k ≥ , it follows that lim k →∞ z k = 0 . (cid:3) The following example shows that θ k may converge despite the fact that ( φ k ) ∞ k =0 is notpersistent. Example 4: Convergence of z k and θ k . Consider the ﬁrst-order system y k = 0 . q − . u k , (29)where q is the forward-shift operator. Deﬁne φ k △ = [ y k − u k − ] , so that y k = φ k θ , where θ consists of the coefﬁcients in (29). To apply RLS, let P = I , θ = 0 , and λ = 0 . . Figure 3 shows the shows the singular values of F j,j +10 , the predicted error z k , and the parameter estimate θ k for two choices of the input u k . In the ﬁrst case, for all k ≥ , u k = 1 , whereas in the second case, for all k ≥ , u k = 1 . For both choices of u k , the predicted error z k converges to zero,11hich conﬁrms Proposition 6, and θ k converges. Note that, in these two cases, θ k converges todifferent parameter values, neither of which is the true value. ⋄ Table 4 summarizes the results in this section.TABLE 4: Behavior of P k with and without persistent excitation.Excitation \ λ λ = 1 λ ∈ (0 , Persistent • P k converges to zero • P k is bounded • Proposition 3 • Propositions 4, 5 • Example 2 • Example 2Not Persistent • All singular values of P k are bounded • Some singular values of P k diverge • Some of these converge tozero • The remaining singularvalues are bounded • Example 3 • Example 3

Persistent Excitation and the Condition Number For nonsingular A ∈ R n × n , the condition number of A is deﬁned by κ ( A ) △ = σ max ( A ) σ min ( A ) , (30)For B ∈ R n × m , let k B k denotes the maximum singular value of B . If A is positive deﬁnite,then k A − k − I n = σ min ( A ) I n ≤ A ≤ σ max ( A ) I n = k A k I n . (31)Therefore, if α, β ∈ (0 , ∞ ) satisfy α ≤ σ min ( A ) and σ max ( A ) ≤ β , then κ ( A ) ≤ β/α . Thus, if λ = 1 and ( φ k ) ∞ k =0 is persistently exciting with N, α, β given by Deﬁnition 1, then (21) impliesthat κ ( P k ) ≤ βα . (32)Similarly, if λ ∈ (0 , and ( φ k ) ∞ k =0 is persistently exciting with N, α, β given by Deﬁnition 1,then (25) implies that κ ( P k ) ≤ β + (1 − λ N +1 ) k P − N k λ N (1 − λ ) α . (33)However, as shown by Example 3, in the case where ( φ k ) ∞ k =0 is not persistently exciting, theremight not exist α > satisfying (19), and thus κ ( P k ) cannot be bounded. Hence κ ( P k ) can be ( φ k ) ∞ k =0 is persistently exciting, where a bounded conditionnumber implies that ( φ k ) ∞ k =0 is persistently exciting, and a diverging condition number implies that φ k is not persistently exciting, as illustrated by the following example. [35] provides arecursive algorithm for computing κ ( P k ) . Example 5: Using the condition number of P k to determine whether ( φ k ) ∞ k =0 is persistentlyexciting. Consider the 5th-order system y k = 0 . q − . q − . q − . q + 0 . q − q + 0 . q − . q − . q + 0 . u k , (34)where u k is given by (27). To apply RLS, let θ consist of the coefﬁcients in (34) and let φ k = [ u k − · · · u k − y k − · · · y k − ] , (35)so that y k = φ k θ. Letting P = I , Figure 4 shows the singular values of F j,j +20 and the singularvalues and condition number of P k for λ = 1 and λ = 0 . . In particular, the smallest singular value of F j,j +20 is essentially zero, which indicates that ( φ k ) ∞ k =0 is not persistently exciting.Consequently, in the case where λ = 0 . , P k becomes ill-conditioned. ⋄ In Example 5, the regressor ( φ k ) ∞ k =0 is not persistently exciting. Consequently, in the casewhere λ = 1 , it follows from (20) that P k is bounded by P , and thus all of the singular values of P k are bounded; this property is illustrated by Figure 4. However, Figure 4 also shows thatnot all of the singular values of P k converge to zero. On the other hand, in the case where λ = 0 . , Figure 4 shows that some of the singular values of P k are bounded, whereas theremaining singular values diverge. This example thus shows that singular values can diverge due to the lack of persistent excitation with λ ∈ (0 , . Lyapunov Analysis of the Parameter Error Let k ≥ , and consider the system x k +1 = f ( k, x k ) , (36)where x k ∈ R n , f : { , , , . . . } × R n → R n is continuous, and, for all k ≥ , f ( k,

0) = 0 . Let

D ⊂ R n be an open set such that ∈ D . Deﬁnition 2:

The zero solution of (36) is

Lyapunov stable if, for all ε > and k ≥ ,there exists δ ( ε, k ) > such that, for all x k ∈ R n satisfying k x k k < δ ( ε, k ) , it follows that, for all k ≥ k , k x k k < ε . Deﬁnition 3:

The zero solution of (36) is uniformly Lyapunov stable if, for all ε > , there exists δ ( ε ) > such that, for all k ≥ and all x k ∈ R n satisfying k x k k < δ ( ε ) , it followsthat, for all k ≥ k , k x k k < ε . 13 eﬁnition 4: The zero solution of (36) is globally asymptotically stable if it is Lyapunov stable and, for all k ≥ and all x k ∈ R n , it follows that lim k →∞ x k = 0 . Deﬁnition 5:

The zero solution of (36) is uniformly globally geometrically stable if there exist α > and β > such that, for all k ≥ and all x k ∈ R n , it follows that, for all k ≥ k , k x k k ≤ α k x k k β − k . Note that, if the zero solution of (36) is uniformly globally geometrically stable, then it isuniformly globally aymptotically stable as well as uniformly Lyapunov stable. The following three results are specializations of Theorem 13.11 given in [36, pp. 784,785]. Theorem 3:

Consider (36), and assume there exist a continuous function V : { , , . . . } ×D → R and α > such that, for all k ≥ and x ∈ D , V ( k,

0) = 0 , (37) α k x k ≤ V ( k, x ) , (38) V ( k + 1 , f ( k, x )) − V ( k, x ) ≤ . (39)Then, the zero solution of (36) is Lyapunov stable. Theorem 4:

Consider (36), and assume there exist a continuous function V : { , , . . . } ×D → R and α , β > such that, for all k ≥ and x ∈ D , V ( k,

0) = 0 , (40) α k x k ≤ V ( k, x ) ≤ β k x k , (41) V ( k + 1 , f ( k, x )) − V ( k, x ) ≤ . (42)Then, the zero solution of (36) is uniformly Lyapunov stable. Theorem 5:

Consider (36), and assume there exist a continuous function V : { , , . . . } × R n → R , and α , β , γ > , such that, for all k ≥ and x ∈ R n , α k x k ≤ V ( k, x ) ≤ β k x k , (43) V ( k + 1 , f ( k, x )) − V ( k, x ) ≤ − γ k x k . (44)Then, the zero solution of (36) is uniformly globally geometrically stable.The following result uses Theorems 3-5 to prove that, if ( φ k ) ∞ k =0 is persistently exciting, then the RLS estimate θ k with λ ∈ (0 , converges to θ in the sense of Deﬁnition 5. A relatedresult is given in [34]. heorem 6: Assume that ( φ k ) ∞ k =0 is persistently exciting, let N, α, β be given by Deﬁnition1, let R ∈ R n × n be positive deﬁnite, deﬁne P △ = R − , let λ ∈ (0 , , and, for all k ≥ , let P k be given by (4). Then the zero solution of (12) is Lyapunov stable. In addition, if λ ∈ (0 , ,then the zero solution of (12) is uniformly Lyapunov stable and uniformly globally geometrically stable. Proof:

Deﬁne the Lyapunov candidate V ( k, x ) △ = x T P − k x, where x ∈ R n . Note that, for all k ≥ , V ( k,

0) = 0 , which conﬁrms (37). Next, deﬁning f ( k, x ) △ = ( I n − P k +1 φ T k φ k ) x, it follows that V ( k + 1 , f ( k, x )) − V ( k, x ) = f ( k, x ) T P − k +1 f ( k, x ) − x T P − k x = x T [( I n − φ T k φ k P k +1 ) P − k +1 ( I n − P k +1 φ T k φ k ) − P − k ] x = x T [( P − k +1 − φ T k φ k )( I n − P k +1 φ T k φ k ) − P − k ] x = x T [ P − k +1 − φ T k φ k + φ T k φ k P k +1 φ T k φ k − P − k ] x = x T [( λ − P − k − φ T k ( I p − φ k P k +1 φ T k ) φ k ] x. (45)First, consider the case where λ = 1 . It follows from (8) with λ = 1 that P − ≤ P − k , andthus, for all k ≥ , σ min ( P − ) k x k ≤ V ( k, x ) , which conﬁrms (38) with α ( k x k ) = σ min ( P − ) k x k . Next, note that I p − φ k P k +1 φ T k = I p − [ φ k P k φ T k − φ k P k φ T k (cid:0) I p + φ k P k φ T k (cid:1) − φ k P k φ T k ] . (46)Using (45), (46), and Lemma 3 from “Three Useful Lemmas” yields (39). It thus follows from Theorem 3 that the zero solution of (12) is Lyapunov stable.Next, consider the case where λ ∈ (0 , . It follows from Proposition 4 that, for all k ≥ N + 1 , λ N (1 − λ ) α − λ N +1 k x k ≤ V ( k, x ) ≤ β − λ N +1 k x k + x T P − N x ≤ (cid:18) β − λ N +1 + k P − N k (cid:19) k x k , λ ∈ (0 , with α = λ N (1 − λ ) α − λ N +1 , and β = β − λ N +1 + k P − N k . Using (45), (46), and Lemma 3 from “Three Useful Lemmas”, (42) is conﬁrmed. It thus followsfrom Theorem 4 that the zero solution of (12) is uniformly Lyapunov stable. Furthermore, (43) is conﬁrmed, α = λ N (1 − λ ) α − λ N +1 , and β = β − λ N +1 + k P − N k . Finally, if λ ∈ (0 , , then V ( k + 1 , f ( k, x )) − V ( k, x ) ≤ ( λ − x T P − k x ≤ ( λ − (cid:18) β − λ N +1 + k P − N k (cid:19) k x k , which conﬁrms (44) with , γ = (1 − λ )( β − λ N +1 + k P − N k ) . It thus follows from Theorem 5 thatthe zero solution of (12) is uniformly globally geometrically stable. (cid:3) The following result provides an alternative proof of Theorem 6 that does not depend onTheorems 3-5. In addition, this result considers the case λ = 1 , where the RLS estimate θ k converges to θ in the sense of Deﬁnition 4. Theorem 7:

Assume that ( φ k ) ∞ k =0 is persistently exciting, let N, α, β be given by Deﬁnition

1, let R ∈ R n × n be positive deﬁnite, deﬁne P △ = R − , let λ ∈ (0 , , and, for all k ≥ , let P k be given by (4). Then the zero solution of (12) is globally asymptotically stable. Furthermore, if λ ∈ (0 , , then the zero solution of (12) is uniformly globally geometrically stable. Proof:

Let k ≥ and ˜ θ k ∈ R n . Then, it follows from (14) that, for all k ≥ k , k ˜ θ k k = λ k − k k P k P − k ˜ θ k k≤ k P k P − k ˜ θ k k≤ k P k kk P − k kk ˜ θ k k . (47)First, consider the case where λ = 1 . Let δ > , and suppose that ˜ θ k ∈ R n satisﬁes k ˜ θ k k < δ . It follows from (8) with λ = 1 that k P k k ≤ k P k and (47), that, for all k ≥ k , k ˜ θ k k < k P kk P − k k δ. It thus follows from Deﬁnition 2 with ε = k P kk P − k k δ that the zero solution of (12) is Lyapunov stable.Next, let ˜ θ ∈ R n . Then, Proposition 3 implies that lim k →∞ ˜ θ k = lim k →∞ P k P − ˜ θ = 0 . It thus follows from Deﬁnition 4 that the zero solution of (12) is globally asymptotically stable. λ ∈ (0 , . Let k ≥ and δ > , and let ˜ θ k ∈ R n satisfy k ˜ θ k k < δ . It follows from Proposition 4 and (47) that, for all k ≥ max( N + 1 , k ) , k ˜ θ k k < ε, where ε △ = β + (1 − λ N +1 ) k P − N k λ N (1 − λ ) α δ. It thus follows from Deﬁnition 3 that the zero solution of (12) is uniformly Lyapunov stable.Next, let ˜ θ k ∈ R n . Then, it follows from (14) and Proposition 4 that, for all ˜ θ k ∈ R n and k ≥ N + 1 , k ˜ θ k k ≤ α k ˜ θ k k β − k , where β △ = 1 /λ and α △ = β + (1 − λ N +1 ) k P − N k λ N (1 − λ ) α . It thus follows from Deﬁnition 5 that the zero solution of (12) is uniformly globally geometricallystable, and thus globally asymptotically stable. (cid:3) The following result shows that persistent excitation produces an inﬁnite sequence ofmatrices whose product converges to zero. Proposition 7:

Let P ∈ R n × n be positive deﬁnite, let λ ∈ (0 , , and, for all k ≥ , let P k be given by (4). Then, for all k ≥ , all of the eigenvalues of P k +1 φ T k φ k are contained in [0 , . If, in addition, ( φ k ) ∞ k =0 is persistently exciting, then lim k →∞ A k = 0 , (48)where A k △ = ( I n − P k +1 φ T k φ k ) · · · ( I n − P φ T0 φ ) . (49) Proof:

It follows from (8) that, for all k ≥ , φ T k φ k ≤ P − k +1 , and thus, for all k ≥ ,P / k +1 φ T k φ k P / k +1 ≤ I n . Hence, for all k ≥ , ≤ λ max ( P k +1 φ T k φ k ) = λ max ( P / k +1 φ T k φ k P / k +1 ) ≤ . To prove (48), suppose that ( φ k ) ∞ k =0 is persistently exciting, let i ∈ { , . . . , n } , and deﬁne θ △ = e i + θ, where e i is the i th column of I n . Note that ˜ θ △ = θ − θ = e i . Then, (14) impliesthat, for all k ≥ , ˜ θ k +1 = A k e i = λ k +1 P k +1 P − e i . (50)17t follows from Theorem 7 that ˜ θ k converges to zero. Hence, (50) implies that the i th columnof A k converges to zero as k → ∞ . It thus follows that every column of A k converges to zeroas k → ∞ , which implies (48). (cid:3) It follows from Theorem 7 that, if ( φ k ) ∞ k =0 is persistently exciting, then, for all λ ∈ (0 , , ˜ θ k converges to zero. In addition, if λ ∈ (0 , , then ˜ θ k converges to zero geometrically, and thus the rate of convergence of k ˜ θ k k is O ( λ k ) . However, in the case λ = 1 , as shown in [34] and thenext example, ˜ θ k converges to zero as O (1 /k ) , and thus the convergence is not geometric. Example 6: Effect of λ on the rate of convergence of θ k . Consider the 3rd-order FIR system y k = q + 0 . q + 0 . q u k . (51)To apply RLS, let θ = [1 0 . . , θ = 0 , and φ k = [ u k − u k − u k − ] , where the input u k is zero-mean Gaussian white noise with standard deviation 1. Note that ( φ k ) ∞ k =0 is persistently exciting. It thus follows from Theorem 7 that ˜ θ k converges to zero. Figure 5 shows the parameter-errornorm k ˜ θ k k for several values of P and λ as well as the condition number of the corresponding P k . Note that the convergence rate of k ˜ θ k k is O (1 /k ) for λ = 1 and geometric for all λ ∈ (0 , .Furthermore, as λ is decreased, the convergence rate of θ k increases; however, the condition number of P k degrades, and the effect of P is reduced. ⋄ Lack of Persistent Excitation This section presents numerical examples to investigate the effect of lack of persistentexcitation. As shown in Example 3 and Example 5, if ( φ k ) ∞ k =0 is not persistently exciting and λ = 1 , then some of the singular values of P k converge to zero, whereas the remaining singularvalues remain bounded. On the other hand, if ( φ k ) ∞ k =0 is not persistently exciting and λ ∈ (0 , , then some of the singular values of P k remain bounded, whereas the remaining singular valuesdiverge. Furthermore, Proposition 6 implies that the predicted error z k converges to zero whether or not ( φ k ) ∞ k =0 is persistent. Example 7: Lack of persistent excitation in scalar estimation . Let n = 1 , so that (4), (5)are given by P k +1 = P k λ + P k φ k , (52) ˜ θ k +1 = λ ˜ θ k λ + P k φ k . (53)Now, let k ≥ and assume that, for all k ≥ k , φ k = 0 . Therefore, for all j ≥ and N ≥ , F j,j + N cannot be lower bounded as in (19), and thus ( φ k ) ∞ k =0 is not persistently exciting.18urthermore, in the case where λ = 1 , it follows from the fact that φ k = 0 for all k ≥ k that P k and ˜ θ k converge in k steps to P = 0 and ˜ θ , respectively. Furthermore, if θ = θ, then ˜ θ = 0 . However, in the case where λ ∈ (0 , , it follows that P k diverges geometrically, whereas, as in the case where λ = 1 , ˜ θ k converges in k steps. Therefore, for all λ ∈ (0 , , since φ k = 0 for all k ≥ k , it follows from (52) and (53) that, for all k ≥ k , the minimum value of (2) is achieved in a ﬁnite number of steps. Consequently, RLS provides no further reﬁnement of theestimate θ k of θ, and thus ˜ θ = 0 implies that θ k does not converge to θ. Alternatively, assume that, for all k ≥ , φ k = φ, where φ = 0 . Then it follows fromDeﬁnition 1 with N = 1 , α = φ , and β = 3 φ that ( φ k ) ∞ k =0 is persistently exciting. If λ = 1 , then both P k and ˜ θ k converge to zero. However, if λ ∈ (0 , , then P k converges to − λφ and ˜ θ k converges geometrically to zero. Table 5 shows the asymptotic behavior of ˜ θ k and P k for both of these cases. ⋄ Excitation \ λ λ = 1 λ ∈ (0 , Not persistently exciting ˜ θ k → ˜ θ, P k → P ˜ θ k → ˜ θ, P k divergesPersistently exciting ˜ θ k → , P k → θ k → , P k → − λφ TABLE 5: Asymptotic behavior of RLS in Example 7. In the case of persistent excitation with λ < , the convergence of ˜ θ k is geometric. Example 8: Subspace-constrained regressor.

Consider (1), where φ k = (sin πk )[1 1] and θ = [0 . . T . To estimate θ using RLS, let P = I and θ = 0 . Figure 6 shows the estimate θ k of θ with λ = 1 and λ = 0 . . Note that all regressors φ k lie along the same one-dimensional subspace, and thus, ( φ k ) ∞ k =0 is not persistently exciting. It follows from (16) that the estimate θ k of θ lies in this subspace. For λ = 1 , note that one singular value decreases to zero, whereas the other singular valueis bounded. Note that ˜ θ k converges along the singular vector corresponding to the bounded singular value. For λ = 0 . , one singular value is bounded, whereas the other singular valuediverges. Note that ˜ θ k converges along the singular vector corresponding to the diverging singular value. ⋄ Example 9: Lack of persistent excitation and ﬁnite-precision arithmetic.

Consider the problem of ﬁtting a 5th-order model to measured input-output data from the system (34), wherethe input u k is given by (27). Note that φ k is given by (35), and is not persistently exciting as shown in Example 5. Let P = I , θ = 0 , and λ = 0 . . Figure 7 shows the predicted error19 k , the norm of the parameter error ˜ θ k , and the singular values and the condition number of P k . Note that the ˜ θ k does not converge to zero and that six singular values of P k remain boundeddue to the presence of three harmonics in the regresssor. Due to ﬁnite-precision arithmetic, the computation becomes erroneous as P k becomes numerically ill-conditioned, and thus the estimate θ k diverges. ⋄ The numerical examples in this section show that, if λ ∈ (0 , and ( φ k ) ∞ k =0 is notpersistently exciting, then ˜ θ k does not necessarily converge to zero. Furthermore, if λ ∈ (0 , and ( φ k ) ∞ k =0 is not persistently exciting, then some of the singular values of P k diverge, and θ k diverges due to ﬁnite-precision arithmetic when P k becomes numerically ill-conditioned. Information Subspace

Using the singular value decomposition, (8) can be written as P − k +1 = λU k Σ k U T k + U k ψ T k ψ k U T k , (54)where U k ∈ R n × n is an orthonormal matrix whose columns are the singular vectors of P − k , Σ k ∈ R n × n is a diagonal matrix whose diagonal entries are the corresponding singular values,and ψ k △ = φ k U k . (55)The columns of U k are the information directions at step k , and each row of ψ k is the projection of the corresponding row of φ k onto the information directions. The norm of each column of ψ k thus indicates the information content present in φ k along the corresponding information direction. The smallest subspace that is spanned by a subset of the information directions andthat contains all rows of φ k is the information-rich subspace I k at step k . Figure 8 illustrates the information-rich subspace.Now, consider the case where ψ k = h ψ k, p × ( n − n ) i , (56)where ψ k, ∈ R p × n . It follows from (56) that φ k provides new information along the ﬁrst n columns of U k ; these directions constitute the information-rich subspace. It thus follows from(54) and (56) that P − k +1 is given by P − k +1 = U k " λ Σ k, + ψ T k, ψ k, λ Σ k, U T k , (57)20here Σ k, ∈ R n × n is the diagonal matrix whose diagonal entries are the ﬁrst n singularvalues of P − k , and Σ k, is the diagonal matrix whose diagonal entries are the remaining n − n singular values of P − k . In particular, writing U k = h U k, U k, i , (58)where U k, ∈ R n × n contains the ﬁrst n columns of U k , and U k, ∈ R n × n − n contains theremaining n − n columns of U k , it follows that P − k +1 = h U k +1 , U k +1 , i " Σ k +1 ,

00 Σ k +1 , U T k +1 , U T k +1 , , (59)where U k +1 , = U k, V k , (60) Σ k +1 , = D k , (61) U k +1 , = U k, , (62) Σ k +1 , = λ Σ k, , (63)where V k ∈ R n × n contains the singular vectors of λ Σ k, + ψ T k, ψ k, and D k ∈ R n × n is the diagonal matrix containing the corresponding singular values. It follows from (62), (63) that if,for all k ≥ , ψ k is given by (56) and λ ∈ (0 , , then the last n − n singular vectors of P − k do not change and the corresponding singular values of P − k decrease to zero geometrically. Itthus follows from Proposition 4 that ( φ k ) ∞ k =0 is not persistently exciting. Furthermore, since P k and P − k have the same singular vectors and the singular values of P k are the reciprocals of thesingular values of P − k , it follows that the last n − n singular values of P k diverge. The next example considers the case where there exists a proper subspace

S ⊂ R n suchthat, for all k ≥ , R ( φ T k ) ⊆ S . Hence, ( φ k ) ∞ k =0 is not persistently exciting. In this case, for all k ≥ , the information-rich subspace I k is a proper subspace of R n , and the singular values of P − k corresponding to the singular vectors in the orthogonal complement of I k converge to zero. Example 10: Lack of persistent excitation and the information-rich subspace.

Consider theregressor φ k given by (35) used in Example 5. Recall that ( φ k ) ∞ k =0 is not persistently exciting. Let P = I . Figure 9 shows the information content | ψ k, ( i ) | for several values of λ alongwith the singular values of the corresponding P − k . Note that the information-rich subspace is six dimensional due to the presence of three harmonics in u k as shown by six relatively largecomponents of ψ k and, in the case where λ < , the singular values that correspond to the singular vectors not in the information-rich subspace converge to zero in machine precision. ⋄ ariable-Direction forgetting Examples 3, 5, 7, 8, and 9 show that some of the singular values of P − k converge tozero in the case where φ k is not persistently exciting. To address this situation, (8) is modiﬁedby replacing the scalar forgetting factor λ by a data-dependent forgetting matrix Λ k . Similarmodiﬁcations are discussed in “Toward Matrix Forgetting”. In particular, P − k +1 is redeﬁned as P − k +1 = Λ k P − k Λ k + φ T k φ k , (64)where Λ k is a positive-deﬁnite (and thus symmetric) matrix constructed below. Note that, for all k ≥ , P − k +1 given by (64) is positive deﬁnite. Using the singular value decomposition, (64) canbe written as P − k +1 = Λ k U k Σ k U T k Λ k + U k ψ T k ψ k U T k , (65)where U k , Σ k , and ψ k are as deﬁned in the previous section.The objective is to apply forgetting to only those singular values of P − k that correspondto the singular vectors in the information-rich subspace, that is, forgetting is restricted to thesubspace of P − k where sufﬁcient new information is provided by φ k . Speciﬁcally, forgettingis applied to those information directions where the information content is greater than ε > ,where ε should be selected to be larger than the noise to signal ratio or larger than the machinezero, if no noise is present. To do so, (65) is written as P − k +1 = U k Λ k Σ k Λ k U T k + U k ψ T k ψ k U T k , (66)where Λ k is a diagonal matrix whose diagonal entries are either √ λ or . In particular, Λ k ( i, i ) △ =  √ λ, k col i ( ψ k ) k > ε, , otherwise , (67)where col i ( ψ k ) is the i th column of ψ k and λ ∈ (0 , . Note that, it follows from (66) and (67)that P − k +1 is positive deﬁnite. Next, it follows from (65) and (66) that Λ k = U k Λ k U T k , (68)which is positive deﬁnite. Note that Λ − k = U k Λ − k U T k . (69)The next result provides a recursive formula to update P k +1 given by (64). Proposition 8:

Let λ ∈ (0 , , ε > , let ( P k ) ∞ k =0 be a sequence of n × n positive-deﬁnitematrices, and let U k ∈ R n × n be an orthonormal matrix whose columns are the singular vectors22f P k . Furthermore, let ψ k ∈ R p × n be given by (55), let Λ k be given by (67), and let Λ k begiven by (68). Then, for all k ≥ , ( P k ) ∞ k =0 satisﬁes (64) if and only if, for all k ≥ , ( P k ) ∞ k =0 satisﬁes P k +1 = P k − P k φ k ( I p + φ T k P k φ k ) − φ T k P k , (70)where P k = Λ − k P k Λ − k . (71) Proof:

To prove necessity, it follows from (64) and matrix-inversion lemma, that P k +1 = (Λ k P − k Λ k + φ T k φ k ) − = (Λ k P − k Λ k ) − − (Λ k P − k Λ k ) − φ T k [ I p + φ k (Λ k P − k Λ k ) − φ T k ] − φ k (Λ k P − k Λ k ) − = P k − P k φ k ( I p + φ T k P k φ k ) − φ T k P k , where P k is given by (71). Reversing these steps proves sufﬁciency. (cid:3) The modiﬁed update (64) is shown to be optimal for a speciﬁc cost function in “A ModiﬁedQuadratic Cost Function Supporting Variable-Direction RLS”. Next, the matrix-forgetting scheme (64) is shown to prevent the singular values of P k fromdiverging. Consider the case where, for all k ≥ , ψ k = h ψ k, i , (72)where ψ k, ∈ R p × n , that is, the information-rich subspace is spanned by the ﬁrst n columnsof U k . It thus follows from (66) and (72) that P − k +1 is given by P − k +1 = U k " λ Σ k, + ψ T k, ψ k,

00 Σ k, U T k . (73)It follows from the (2 , block of (73) that the last n − n information directions and thecorresponding singular values are not affected by φ k . Furthermore, if n = n , that is, new information is present in φ k along every information direction, then forgetting is applied toall of the singular values of P − k , and thus variable-direction forgetting specializes to uniform- direction forgetting, that is, RLS with the update for P k given by (8).The next result shows that, as in the case of uniform-direction forgetting, z k converges to zero with variable-direction forgetting for every choice of ε > , whether or not ( φ k ) ∞ k =0 ispersistently exciting. roposition 9: For all k ≥ , let φ k ∈ R p × n and y k ∈ R p , let R ∈ R n × n be positivedeﬁnite, and let P = R − , θ ∈ R n , and λ ∈ (0 , . Furthermore, for all k ≥ , let P k and θ k be given by (64) and (5), respectively. Then, lim k →∞ z k = 0 . (74) Proof:

Using (67), (68), and P − k = U k Σ k U T k , it follows that, for all k ≥ , Λ k P − k Λ k = U k Λ k Σ k Λ k U T k ≤ U k Σ k U T k = P − k . (75)For all k ≥ , note that z k = φ k ˜ θ k , and deﬁne V k △ = ˜ θ T k P − k ˜ θ k . Note that, for all k ≥ and ˜ θ k ∈ R n , V k ≥ . Furthermore, for all k ≥ ,V k +1 − V k = ˜ θ T k +1 P − k +1 ˜ θ k +1 − ˜ θ T k P − k ˜ θ k = ˜ θ T k Λ k P − k Λ k P k +1 Λ k P − k Λ k ˜ θ k − ˜ θ T k P − k ˜ θ k = ˜ θ T k [Λ k P − k Λ k P k +1 Λ k P − k Λ k − P − k ]˜ θ k = ˜ θ T k [Λ k P − k ( P k − P k Λ − k φ k ( I p + φ T k P k φ k ) − φ T k Λ − k P k ) P − k Λ k − P − k ]˜ θ k = ˜ θ T k [Λ k P − k Λ k − φ k ( I p + φ T k P k φ k ) − φ T k − P − k ]˜ θ k = − [˜ θ T k ( P − k − Λ k P − k Λ k )˜ θ k + z k ( I p + φ T k P k φ k ) − z k ] ≤ . Note that, since ( V k ) ∞ k =1 is a nonnegative, nonincreasing sequence, it converges to a nonnegativenumber. Hence, lim k →∞ ( V k +1 − V k ) = 0 , which implies that lim k →∞ [˜ θ T k ( P − k − Λ k P − k Λ k )˜ θ k + z k ( I p + φ T k P k φ k ) − z k ] = 0 . Since, for all k ≥ , P − k − Λ k P − k Λ k ≥ and ( I p + φ T k P k φ k ) − > , it follows that lim k →∞ z k =0 . (cid:3) The next result shows that P k is bounded from above with variable-direction forgetting for every choice of ε > in the case where ( φ k ) ∞ k =0 is persistently exciting. Proposition 10:

Assume that ( φ k ) ∞ k =0 is persistently exciting, let N, α, β be given byDeﬁnition 1, let R ∈ R n × n be positive deﬁnite, deﬁne P △ = R − , let λ ∈ (0 , , and, forall k ≥ , let P k be given by (64). Then, for all k ≥ N + 1 ,λ N (1 − λ ) α − λ N +1 I n ≤ P − k . (76) Proof:

It follows from (64), that, for all k ≥ , Λ k P − k Λ k ≤ P − k +1 and φ T k φ k ≤ P − k +1 . Next,using (68) and P − k = U k Σ k U T k , it follows that, for all k ≥ , λP − k = λU k Σ k U T k ≤ U k Λ k Σ k Λ k U T k = Λ k P − k Λ k ≤ P − k +1 . k ≥ N + 1 , αI n ≤ k − X i = k − N − φ T i φ i ≤ k X i = k − N P − i ≤ ( λ − N + · · · + 1) P − k = 1 − λ N +1 λ N (1 − λ ) P − k , which proves (76). (cid:3) The next two examples consider variable-direction forgetting in the case where ( φ k ) ∞ k =0 is not persistently exciting. In these examples, P k is bounded, z k converges to zero, and θ k converges, although not to the true value θ. Example 11: Variable-direction forgetting for a regressor lacking persistent excitation. Reconsider Example 10. Let P = I , and P − k be given by (70), where ε = 10 − . Figure10 shows the information content | col i ( ψ k ) | and the singular values of the P − k for several values of λ . Note that the information-rich subspace is six dimensional due to the presence ofthree harmonics in u k as shown by six relatively large components of ψ k and the singular values that correspond to the singular vectors not in the information-rich subspace do not converge tozero. ⋄ Example 12: Effect of variable-direction forgetting on θ k . Reconsider Example 9. Let P = I , and P − k be given by (70), where ε = 10 − . Figure 11 shows the predicted error z k , the norm of the parameter error ˜ θ k , and the singular values and the condition number of P k . Notethat the ˜ θ k does not converge to zero and, unlike uniform-direction forgetting, all of the singular values of P k remain bounded and θ k is bounded. ⋄ Concluding Remarks This tutorial article presented a self-contained exposition of uniform-direction and variable-direction forgetting within the context of RLS. It was shown that, in the case of persistent excitation without forgetting, the parameter estimates converge asymptotically, whereas, withforgetting, the parameter estimates converge geometrically. Numerical examples were presented to illustrate this behavior.In the case where forgetting is used but the excitation is not persistent, it was shown that P k diverges, leading to numerical instability. This phenomenon was traced to the divergence of the singular values of P k corresponding to singularvectors that are orthogonal to the information-rich subspace. In order to address this problem, a data-dependent forgetting matrix was constructed torestrict forgetting to the information-rich subspace. The RLS cost function that corresponds to this extension of RLS was presented. Numerical examples showed that this variable-directionforgetting technique prevents P k from diverging under lack of persistent excitation. Since RLS is fundamentally least squares optimization, its estimates are not consistentin the case of sensor noise [37]. An open problem is thus to develop extensions of RLS that provide consistent parameter estimates in the presence of errors-in-variable noise arising insystem identiﬁcation problems [38]. Acknowledgments

This research was partially supported by AFOSR under DDDAS grant FA9550-16-1-0071 References [1] M. Grewal and K. Glover, “Identiﬁability of linear and nonlinear dynamical systems,” IEEETrans. Autom. Contr. , vol. 21, no. 6, pp. 833–837, 1976. [2] I. Y. Mareels and M. Gevers, “Persistence of excitation criteria,” in Proc. Conf. Dec. Contr ,1986, pp. 1933–1935. [3] I. M. Mareels, R. R. Bitmead, M. Gevers, C. R. Johnson, and R. L. Kosut, “How excitingcan a signal really be?” Sys. Contr. Lett. , vol. 8, no. 3, pp. 197–204, 1987. [4] I. M. Mareels and M. Gevers, “Persistency of excitation criteria for linear, multivariable,time-varying systems,” Mathematics of Control, Signals, and Systems , vol. 1, no. 3, pp. phenomena,” Automatica , vol. 21, no. 3, pp. 247–258, 1985.[6] G. Chowdhary and E. Johnson, “Concurrent learning for convergence in adaptive control without persistency of excitation,” in Proc. Conf. Dec. Contr. , 2010, pp. 3674–3679.[7] G. Chowdhary, M. M ¨uhlegg, and E. Johnson, “Exponential parameter and tracking error convergence guarantees for adaptive controllers without persistency of excitation,” Int. J. ontr. , vol. 87, no. 8, pp. 1583–1603, 2014. [8] S. Aranovskiy, A. Bobtsov, R. Ortega, and A. Pyrkin, “Performance enhancement ofparameter estimators via dynamic regressor extension and mixing,” IEEE Trans. Autom. Contr. , vol. 62, no. 7, pp. 3546–3550, 2017.[9] P. Panda, J. M. Allred, S. Ramanathan, and K. Roy, “Learning to forget with adaptive synaptic plasticity in spiking neural networks,” J. Emerg. Selec. Top. Circ. Syst. , vol. 8,no. 1, pp. 51–64, 2018. [10] J. M. Allred and K. Roy, “Unsupervised incremental stdp learning using forced ﬁring ofdormant or idle neurons,” in , July theory of three-factor learning rules,” Frontiers in Neural Circuits , vol. 9, pp. 85–103,2016. [12] B. Han, A. Ankit, A. Sengupta, and K. Roy, “Cross-layer design exploration for energy-quality tradeoffs in spiking and non-spiking deep artiﬁcial neural networks,” IEEE Trans- actions on Multi-Scale Computing Systems , 2017.[13] S. A. U. Islam and D. S. Bernstein, “Recursive Least Squares for Real-Time Implementa- tion,” IEEE Contr. Sys. Mag. , vol. 39, pp. 82–85, June 2019.[14] L. Ljung,

System Identiﬁcation: Theory for the User , 2nd ed. Prentice Hall, 1999. [15] A. H. Sayed, Fundamentals of Adaptive Filtering . Wiley, 2003.[16] K. J. Astrom and B. Wittenmark,

Computer-Controlled Systems: Theory and Design , 3rd ed. Prentice-Hall, 1996.[17] T. R. Fortescue, L. S. Kershenbaum, and B. E. Ydstie, “Implementation of Self-tuning Regulators with Variable Forgetting Factors,”

Automatica , vol. 17, no. 6, pp. 831–835,1981. [18] C. Paleologu, J. Benesty, and C. Silviu, “A Robust Variable Forgetting Factor RecursiveLeast-Squares Algorithm for System Identiﬁcation,” IEEE Sig. Proc. Lett. , vol. 15, pp. Time-Varying Environments,”

IEEE Trans. Sig. Proc. , vol. 53, no. 8, pp. 3141–3150, 2005.[20] S. Song, J.-S. Lim, S. J. Baek, and K.-M. Sung, “Gauss Newton variable forgetting factor recursive least squares for time varying parameter tracking,” Electron. Lett. , vol. 36, no. 11,pp. 988–990, 2000. [21] D. J. Park, B. E. Jun, and K. J. H., “Fast tracking RLS algorithm using novel variableforgetting factor with unity zone,” Electron. Lett. , vol. 27, no. 23, pp. 2150–2151, 1991. [22] A. A. Ali, J. B. Hoagg, M. Mossberg, and D. S. Bernstein, “On the stability and convergence27f a sliding-window variable-regularization recursive-least-squares algorithm,” Int. J. Adapt. Contr. Sig. Proc. , vol. 30, pp. 715–735, 2016.[23] R. M. Canetti and M. D. Espa˜na, “Convergence analysis of the least-squares identiﬁcation algorithm with a variable forgetting factor for time-varying linear systems,” Automatica ,vol. 25, no. 4, pp. 609–612, 1989. [24] M. E. Salgado, G. C. Goodwin and R. H. Middleton, “Modiﬁed least squares algorithmincorporating exponential resetting and forgetting,” Int. J. Contr. , vol. 47, no. 2, pp. 477– with covariance resetting,” in IEE Proceedings D-Control Theory and Applications , vol.130, no. 1, 1983, pp. 6–8. [26] R. Kulhav`y, “Restricted exponential forgetting in real-time identiﬁcation,” Automatica ,vol. 23, no. 5, pp. 589–600, 1987. [27] R. Kulhav \ ‘y and M. K´arn \ ‘y, “Tracking of slowly varying parameters by directionalforgetting,” IFAC Proc. Vol. , vol. 17, no. 2, pp. 687–692, 1984. [28] G. Kreisselmeier, “Stabilized least-squares type adaptive identiﬁers,” IEEE Trans. Autom.Contr. , vol. 35, no. 3, pp. 306–310, 1990. [29] L. Cao and H. Schwartz, “Directional forgetting algorithm based on the decomposition ofthe information matrix,” Automatica , vol. 36, no. 11, pp. 1725–1731, 2000. [30] G. Kubin, “Stabilization of the RLS algorithm in the absence of persistent excitation,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing , 1988, pp. 1369–1372. [31] S. Bittanti, P. Bolzern, and M. Campi, “Convergence and exponential convergence ofidentiﬁcation algorithms with directional forgetting factor,” Automatica , vol. 26, no. 5, pp. 929–932, 1990.[32] ——, “Exponential convergence of a modiﬁed directional forgetting identiﬁcation algo- rithm,” Sys. Contr. Lett. , vol. 14, no. 2, pp. 131–137, 1990.[33] A. Goel and D. S. Bernstein, “A targeted forgetting factor for recursive least squares,” in Proc. Conf. Dec. Contr. , 2018, pp. 3899–3903.[34] R. M. Johnstone, C. R. Johnson, R. R. Bitmead, and B. D. O. Anderson, “Exponential convergence of recursive least squares with exponential forgetting factor,” in Proc. Conf.Dec. Contr. , 1982, pp. 994–997. [35] J. Benesty and T. G¨ansler, “New Insights into the RLS Algorithm,” Eurasip Jour. App. Sig.Proc. , no. 3, pp. 331–339, 2004. [36] W. M. Haddad and V. Chellaboina, Nonlinear Dynamical Systems and Control: A Lyapunov-based Approach . Princeton University Press, 2008. [37] P. Eykhoff, System Identiﬁcation: Parameter and State Estimation . Wiley-Interscience,28974. [38] T. S¨oderstr¨om, Errors-in-variables methods in system identiﬁcation . Springer, 2018.29 idebar: Summary Learning depends on the ability to acquire and assimilate new information. This abilitydepends—somewhat counterintuitively—on the ability to forget. In particular, effective forgetting requires the ability to recognize and utilize new information to order to update a system model.This article is a tutorial on forgetting within the context of recursive least squares (RLS). To do this, RLS is ﬁrst presented in its classical form, which employs uniform-direction forgetting.Next, examples are given to motivate the need for variable-direction forgetting, especially in cases where the excitation is not persistent. Some of these results are well known, whereas otherscomplement the prior literature. The goal is to provide a self-contained tutorial of the main ideas and techniques for students and researchers whose research may beneﬁt from variable-directionforgetting. idebar: Three Useful Lemmas Lemma 1:

Let X ∈ R n × p and y ∈ R n , and let W ∈ R p × p be positive deﬁnite. Then, ( I n + XW X T ) − y ∈ R ([ X y ]) . (S1) Proof:

Note that y ∈ R ([ X y ])= R [ X y + XW X T y ]= R [ X ( I n + XW X T ) y ] " I p + W X T X

00 1 = R ([ X ( I p + W X T X ) ( I n + XW X T ) y ])= R ([( I n + XW X T ) X ( I n + XW X T ) y ])= ( I n + XW X T ) R ([ X y ]) , which implies (S1). (cid:3) Lemma 2:

Let A ∈ R n × n be positive semideﬁnite, and let λ > . Then, I n − A ( λI n + A ) − > . (S2) Proof:

Write A = SDS T , where D = diag( d , . . . , d n ) is diagonal and S is unitary. Forall i ∈ { , . . . , n } , d i ≥ , and thus d i λ + d i < . Hence, D ( λI n + D ) − = diag (cid:16) d λ + d , . . . , d n λ + d n (cid:17) < I n . (S3)Pre-multiplying and post-multiplying (S3) by S and S T , respectively, yields (S2). (cid:3) Lemma 3:

Let A ∈ R n × n be positive semideﬁnite, and let λ > . Then, I n − λ (cid:0) A − A ( λI n + A ) − A (cid:1) > . (S4) Proof:

Write A = SDS T , where D = diag( d , . . . , d n ) is diagonal and S is unitary. Forall i ∈ { , . . . , n } , d i ≥ , and thus d i λ + d i < . Hence, λ (cid:0) D − D ( λI n + D ) − D (cid:1) = diag (cid:16) d λ + d , . . . , d n λ + d n (cid:17) < I n . (S5)Pre-multiplying and post-multiplying (S5) by S and S T , respectively, yields (S4). (cid:3) idebar: RLS as a One-Step Optimal Predictor Consider the linear system x k +1 = A k x k + B k u k + w ,k , (S1) y k = C k x k + w ,k , (S2)where, for all k ≥ , x k ∈ R n , u k ∈ R m , y k ∈ R p , and A k , B k , C k are real matrices of appropriatesizes. The input u k and output y k are assumed to be measured. The process noise w ,k ∈ R n and the sensor noise w ,k ∈ R p are zero-mean white noise processes with variances E [ w ,k w T1 ,k ] = Q k and E [ w ,k w T2 ,k ] = R k , respectively. The expected value of the initial state is assumed to be x , and the variance of the initial state is P , that is, E [ x ] = x and E [( x − x )( x − x ) T ] = P .The objective is to estimate the state x k given the measurements of u k and y k . To estimate x k , consider the estimator ˆ x k +1 = A k ˆ x k + B k u k + K k ( y k − C k ˆ x k ) , (S3)where ˆ x k is the estimate of x k at step k and ˆ x = x . The matrix K k is constructed as follows.Deﬁne the state-estimate error e k △ = x k − ˆ x k and the state error covariance P k △ = E [ e k e T k ] ∈ R n × n .Then, e k and P k satisfy e k +1 = ( A k − K k C k ) e k + w ,k − K k w ,k , (S4) P k +1 = A k P k A T k + Q k + K k (cid:0) R k + C k P k C T k (cid:1) K T k − A k P k C T k K T k − C k P k A T k . (S5) Proposition S1:

Let P k +1 be given by (S5). The matrix K k that minimizes tr P k +1 is givenby K k = A k P k C T k (cid:0) R k + C k P k C T k (cid:1) − , (S6)and the minimized state-error covariance P k is updated as P k +1 = A k P k A T k + Q k − A k P k C T k (cid:0) R k + C k P k C T k (cid:1) − C k P k A T k . (S7) Proof:

See [S1]. (cid:3) Let A k = I n , B k = 0 , C k = φ k , Q k = 0 , and R k = I p . Then, ˆ x k +1 = ˆ x k + P k φ T k (cid:0) I p + φ k P k φ T k (cid:1) − ( y k − φ k ˆ x k ) , (S8) P k +1 = P k − P k φ T k (cid:0) I p + φ k P k φ T k (cid:1) − φ k P k . (S9)Note that (6), (7) with λ = 1 have the same form as (S8), (S9). In particular, RLS withoutforgetting is the state estimator for the linear time-varying system with A k = I n , B k = 0 , C k = φ k , Q k = 0 , and R k = I p . 32 eferences [S1] S. A. U. Islam, A. Goel, and D. S. Bernstein, “Real-Time Implementation of the OptimalPredictor and Optimal Filter: Accuracy versus Latency,” IEEE Contr. Sys. Mag. , to appear. idebar: RLS as a Maximum Likelihood Estimator Let k ≥ and, for all i ∈ { , , . . . , k } , consider the process y i = φ i θ true + v i , (S1)where θ true ∈ R n is the unknown parameter, φ i ∈ R p × n is the regressor matrix, v i ∈ R p is themeasurement noise, and y i ∈ R p is the measurement. The goal is to estimate θ true using the data ( φ i ) ki =0 and ( y i ) ki =0 .Let θ true be modeled by the n -dimensional, real-valued normal random variable Θ withmean θ ∈ R n and covariance ( λ k +1 R ) − , where λ ∈ (0 , and R ∈ R n × n is positive deﬁnite.For θ ∈ R n , the density of Θ is thus given by f Θ ( θ ) = 1 p (2 π ) n det ( λ k +1 R ) − exp[ − ( θ − θ ) T λ k +1 R ( θ − θ )] . (S2)For all i ∈ { , , . . . , k } , assume that v i is a sample of the zero-mean, p -dimensional, real-valuednormal random variable V i with covariance λ i − k I p . For v i ∈ R p , the density of V i is thus givenby f V i ( v i ) = 1 p (2 π ) p λ i − k exp( − v T i λ k − i I p v i ) . (S3)Assume that V , V , . . . , V k are independent. Since θ true and v i are modeled as normal random variables, it follows from (S1) that y i isa sample of the p -dimensional, real-valued normal random variable Y i = φ i θ true + V i . Note that,since V , V , . . . , V k are independent, it follows that Y , Y , . . . , Y k are independent. Using (S1)and (S3), it thus follows that f Y i | θ ( y i ) = 1 p (2 π ) p λ i − k exp[ − ( y i − φ k θ ) T λ k − i I p ( y i − φ k θ )] , (S4)where f Y i | θ ( y i ) is the density of the random variable Y i conditions on Θ taking the value θ .It follows from Bayes’ rule [S1, p. 413] that f Θ |{ y ,...,y k } ( θ ) = α − f Θ ( θ ) k Y i =0 f Y i | θ ( y i ) , (S5)where α △ = Z R n f Θ ( θ ) k Y i =0 f Y i | θ ( y i ) d θ. (S6)34ubstituting (S2) and (S4) into (S5), it follows that f Θ |{ y ,...,y k } ( θ ) = β exp " k X i =0 − λ k − i ( y i − φ k θ ) T ( y i − φ k θ ) − λ k +1 ( θ − θ ) T R ( θ − θ ) , (S7)where β △ = 1 α p (2 π ) p λ i − k p (2 π ) n det ( λ k +1 R ) − . (S8)Finally, the maximum likelihood estimate of θ true is given by the maximizer of (S7), thatis, θ ML = argmax θ ∈ R n f Θ |{ y ,...,y k } ( θ ) . (S9)In fact, θ ML = argmin θ ∈ R n J k ( θ ) , where J k ( θ ) is given by (2). Therefore, RLS with forgetting can be interpreted as the maximum likelihood estimator of the random variable Θ . References [S1] D. P. Bertsekas, and J. N. Tsitsiklis. Introduction to Probability , Second Edition, AthenaScientiﬁc, 2008. idebar: Toward Matrix Forgetting In [S1], P − k is updated by P − k +1 = ( I n + M k P k ) P − k + φ T k φ k , (S1)where M k ∈ R n × n is chosen to guarantee asymptotic stability and boundedness. Two choices ofmatrix M k are considered. In the ﬁrst case, M k △ = − (1 − λ )( I − αP k ) N P − k , (S2)where λ ∈ (0 , , α > , and N is an odd, positive integer. In the second case, M k = − (1 − λ )( P − k − αI n ) N ( P − k + βI n ) − N P − k , (S3)where λ ∈ (0 , , α > , β ≥ , and N is an odd, positive integer. Note that RLS with constantforgetting is obtained by setting M k = ( λ − P − k in (S1). Proposition S1:

Consider (S1) with (S2) or (S3). Let P be symmetric and nonsingular.Then, the following statements hold: i ) For all k ≥ , P k is symmetric and nonsingular. ii ) If P − ≥ α I n , then, P − k = αI is an asymptotically stable equilibrium of (S1). iii ) If P − ≥ αI n , then, for all k ≥ , P − k ≥ αI n . iv ) If P − ≥ αI n and, for all k ≥ , φ k is bounded, then P − k is bounded. v ) If P − ≥ αI n and φ k is persistently exciting, then there exists k > such that, for all k ≥ k , P − k > αI n . Proof:

See [28]. (cid:3)

The main goal of (S1) is stabilization of P k in the case where ( φ k ) ∞ k =0 is not persistently exciting. Proposition S1 implies that P k remains bounded whether or not ( φ k ) ∞ k =0 is persistent.However, (S1) is not designed to implement forgetting. Furthermore, note that (S1) requires the computation of the inverse of an n × n matrix at each step.An alternative directional forgetting scheme given in [S2] considers the update P − k +1 = M k P − k + φ T k φ k , (S4)where M k ∈ R n × n is designed to apply forgetting to a speciﬁc subspace. In the case of a scalarmeasurement, that is, p = 1 , P − k is decomposed as P − k = P − ,k + P − ,k , (S5)36here P − ,k is chosen such that P − ,k φ T k = 0 , that is, φ T k is in the null space of P − ,k . Next,forgetting is restricted to P − ,k , that is, P − k +1 = P − ,k + λP − ,k + φ T k φ k . (S6)The matrix P − ,k is chosen to be positive semideﬁnite with rank by using P − ,k △ = P − k φ T k (cid:0) φ k P − k φ T k (cid:1) − φ k P − k , (S7)and thus P − ,k = P − k − P − ,k . Finally, it follows from (S4), (S6), and (S7) that M k = I n − (1 − λ ) (cid:0) φ k P − k φ T k (cid:1) − P − k φ T k φ k (S8)and P k +1 is computed as P k =  P k + 1 − λλ (cid:0) φ k P − k φ T k (cid:1) − φ T k φ k , φ k = 0 ,P k , φ k = 0 , (S9) P k +1 = P k − P k φ k (1 + φ T k P k φ k ) − φ T k P k . (S10)It is shown in [S2] that, if P − k is positive deﬁnite, then, for all λ ∈ (0 , , M k P − k is positive deﬁnite. Furthermore, if, for all k ≥ , φ k is bounded, then there exists β > such that,for all k ≥ , P k < βI n . References [S1] G. Kreisselmeier, “Stabilized least-squares type adaptive identiﬁers,”

IEEE Trans. Autom. Contr. , vol. 35, no. 3, pp. 306–310, 1990.[S2] L. Cao and H. Schwartz, “Directional forgetting algorithm based on the decomposition ofthe information matrix,” Automatica , vol. 36, no. 11, pp. 1725–1731, 2000.37 idebar: A Cost Function for Variable-Direction RLS Theorem S1:

For all k ≥ , let φ k ∈ R p × n and y k ∈ R p . Furthermore, let R ∈ R n × n bepositive deﬁnite, let λ ∈ (0 , , and, for all k ≥ , let P k be given by P − k +1 = Λ k P − k Λ k + φ T k φ k , (S1)where P △ = R − and let Λ k be given by (68). In addition, let θ ∈ R n , and deﬁne J k (ˆ θ ) △ = k X i =0 ( y i − φ i ˆ θ ) T ( y i − φ i ˆ θ ) + (ˆ θ − θ ) T R k (ˆ θ − θ ) , (S2)where, for all k ≥ , R k = R k − + Λ k P − k Λ k − P − k , (S3)where R − △ = R . Then, for all k ≥ , (S2) has a unique global minimizer θ k +1 = argmin ˆ θ ∈ R n J k (ˆ θ ) , (S4)which is given by θ k +1 = θ k + P k +1 φ T k ( y k − φ k θ k ) + P k +1 ( R k − R k − )( θ − θ k ) . (S5) Proof:

Note that, for all k ≥ , J k (ˆ θ ) = ˆ θ T A k ˆ θ + ˆ θ T b k + c k , where A k △ = k X i =0 φ T i φ i + R k , (S6) b k △ = k X i =0 − φ T i y i − R k θ , (S7) c k △ = k X i =0 y T i y i + θ T0 R k θ . Using (S3), (S6), and (S7), it follows that, for all k ≥ ,A k = A k − + Λ k P − k Λ k − P − k + φ T k φ k , (S8) b k = b k − − φ T k y k − ( R k − R k − ) θ , (S9)38here A − △ = R and b − △ = − Rθ . Using (S1) and (S8), it follows that, for all k ≥ , A k − P − k +1 = A k − − P − k = A − − P − = 0 . It follows from (65) that, for all k ≥ , P − k +1 is positive deﬁnite, and thus A k is positive deﬁnite.Furthermore, for all k ≥ , A k is given by A k = Λ k A k − Λ k + φ T k φ k . Finally, since A k is positive deﬁnite, it follows from Lemma 1 in [S1] that θ k +1 = − A − k b k = − A − k ( b k − − φ T k y k − ( R k − R k − ) θ )= − A − k ( − A k − θ k − φ T k y k − ( R k − R k − ) θ )= A − k (( A k − R k + R k − − φ T k φ k ) θ k + φ T k y k + ( R k − R k − ) θ )= A − k ( A k θ k + φ T k ( y k − φ k θ k ) + ( R k − R k − )( θ − θ k )= θ k + A − k φ T k ( y k − φ k θ k ) + A − k ( R k − R k − )( θ − θ k )= θ k + P k +1 φ T k ( y k − φ k θ k ) + P k +1 ( R k − R k − )( θ − θ k ) . Hence, (S5) is satisﬁed. (cid:3)

Using R k − R k − = Λ k A k − Λ k − A k − , it follows that (S5) can be implemented withoutcomputing P − k . References [S1] S. A. U. Islam, and D. S. Bernstein, “Recursive Least Squares for Real-Time Implemen- tation,” Contr. Sys. Mag. , vol. 39, pp. 82-85, June 2019.39igure 1: Example 2. Persistent excitation and bounds on P − k . a) and b) show the singularvalues of F j,j + N for N = 2 and N = 10 , where α and β are chosen to satisfy (19). Since u k isperiodic, it follows that, for all j ≥ , the lower and upper bounds (19) for F j,j + N are satisﬁed.Hence, ( φ k ) ∞ k =0 is persistently exciting. c) shows the singular values of P − k , with correspondingbounds given by (25) for λ = 0 . . Note that α and β are larger for N = 10 than for N = 2 ,as expected. 40igure 2: Example 3. Lack of persistent excitation and bounds on P − k . a) shows the singularvalues of F j,j +2 . Note that the smaller singular value of F j,j +2 reaches zero in machine precision,and thus that α > satisfying (19) does not exist. Hence, φ k is not persistently exciting. Theupper bound β shown by the dashed line is chosen to satisfy (19). b) and c) show the singularvalues of P − k for λ = 1 and λ = 0 . , respectively. Note that, if λ = 1 , then one of the singularvalues of P − k diverges, whereas, if λ ∈ (0 , , then one of singular values of P − k converges tozero. 41igure 3: Example 4. Convergence of z k and θ k . a) and b) show the singular values of F j,j +10 for two choices of u k . Note that the singular value of F j,j +10 that is close to machine precision ( ≈ − ) is essentially zero. Deﬁnition 1 thus implies that ( φ k ) ∞ k =0 is not persistently exciting.c) and d) show the predicted error z k for both cases. Note that z k converges to zero in both cases.Finally, e) and f) show the parameter estimate θ k for both cases. Note that, for both choices ofinput u k , θ k converge, but to different parameter values.42igure 4: Example 5. Using the condition number of P k to evaluate persistency. a) shows thesingular values of F j,j +20 , where the singular values of F j,j +20 close to machine precision ( ≈ − ) are essentially zero, thus implying that ( φ k ) ∞ k =0 is not persistently exciting. b) and c) showsthe singular values and the condition number of P k for λ = 1 . Note that the six singular valuesof P k decrease due to the presence of three harmonics in u k . d) and e) shows the singular valuesand the condition number of P k for λ = 0 . . Note that the six singular values of P k remainbounded due to the presence of three harmonics in u k . However, P k becomes ill-conditioneddue to the lack of persistent excitation. 43igure 5: Example 6. Effect of λ on the rate of convergence of θ k . a)-f) show the parametererror norm k ˜ θ k k for several values of P and λ . Note that the slope of − between log k ˜ θ k k and log k in d) is consistent with the fact that the rate of convergence of k ˜ θ k k is O (1 /k ) for λ = 1 .Similarly, the slope of log λ between log k ˜ θ k k and k in b) and c) is consistent with the factthat the rate of convergence of k ˜ θ k k is O ( λ k ) for λ ∈ (0 , . g), h), and i) show the conditionnumber of the corresponding P k for several values of P and λ . Note that, as λ is decreased, theconvergence rate of θ k increases; however, the condition number of P k degrades, and the effectof P is reduced. 44 λ = 1 -1 0 1-101 λ = 0 . Figure 6: Example 8. Subspace constrained regressor. The ﬁrst component of each vector isplotted along the horizontal axis, and the second component is plotted along the vertical axis. Thesingular values σ i ( P ) are shown with the corresponding singular vector u P ,i . All regressors φ k lie along the same one-dimensional subspace, and thus, ( φ k ) ∞ k =0 is not persistently exciting.Consequently, each estimate θ k of θ lies in this subspace. The color gradient from yellow to blueof θ k and ˜ θ k shows the evolution from k = 1 to k = 1000 . In a), the singular value correspondingto the cyan singular vector decreases to zero, whereas the singular value corresponding tothe magenta singular vector is bounded. Note that ˜ θ k converges along the singular vectorcorresponding to the bounded singular value. In b), the singular value corresponding to the cyansingular vector is bounded, whereas the singular value corresponding to the magenta singularvector diverges. Note that ˜ θ k converges along the singular vector corresponding to the divergingsingular value. 45igure 7: Example 9. Effect of lack of persistent excitation on θ k . a) shows the predicted error z k ,b) shows the norm of the parameter error ˜ θ k , c) shows the singular values of P k , and d) shows thecondition number of P k . Note that six singular values of P k remain bounded due to the presenceof three harmonics in the regresssor. Due to ﬁnite-precision arithmetic, the computation becomeserroneous as P k becomes numerically ill-conditioned, and thus, the estimate θ k diverges.46igure 8: Illustrative example of the information-rich subspace. Let u , u , and u be theinformation directions (shown in blue). The regressor φ (shown in red) has new informationalong all three information directions, as shown by the nonzero values ψ , , ψ , , and ψ , ; theinformation-rich subspace is thus R ([ u u u ]) . On the other hand, the regressor φ (shown ingreen) has new information only along u and u , as shown by the nonzero values ψ , and ψ , ;the information-rich subspace is thus R ([ u u ]) .47igure 9: Example 10. Relation between P k and the information content ψ k . a), b), and c) showthe information content col i ( ψ k ) for several values of λ . Note that, in each case, the information-rich subspace is six dimensional due to the presence of three harmonics in u k . d), e), and (f)show the singular values of P − k for several values of λ . The inverse of the condition numberof P k is shown in black. Note that, for λ < , the singular values of P − k corresponding tothe singular vectors in the orthogonal complement of the information-rich subspace converge tozero. 48igure 10: Example 11. Variable-direction forgetting for a regressor lacking persistent excitation.a) and b) show the information content k ψ k k for λ = 0 . and λ = 0 . . c) and d) show the singularvalues of P − k for λ = 0 . and λ = 0 . . The inverse of the condition number of P k is shownin black. Note that, for λ < , the singular values that correspond to the singular vectors not inthe information-rich subspace do not converge to zero.49igure 11: Example 12. Effect of variable-direction forgetting on θ k . a) shows the predicted error z k , b) shows the norm of the parameter error ˜ θ k , c) shows the singular values of P k , and d)shows the condition number of P k . Note that all of the singular values of P kk