Consistency Results for Stationary Autoregressive Processes with Constrained Coefficients
aa r X i v : . [ s t a t . M L ] J un Consistency Results for Stationary AutoregressiveProcesses with Constrained Coefficients
Alessio Sancetta ∗ June 9, 2017
Abstract
We consider stationary autoregressive processes with coefficients restricted toan ellipsoid, which includes autoregressive processes with absolutely summable co-efficients. We provide consistency results under different norms for the estimationof such processes using constrained and penalized estimators. As an applicationwe show some weak form of universal consistency. Simulations show that directlyincluding the constraint in the estimation can lead to more robust results.
Key Words: consistency, empirical process, ridge regression, reproducingkernel Hilbert space, universal consistency.
It is common to impose constraints on the decay rate of the autoregressive coefficientsin order to derive results amenable to estimation for the purpose of prediction. Atminimum, these constraints tend to require that the AR coefficients are absolutelysummable. Then, a natural approach when dealing with high order autoregressivemodels is to consider sieve estimation. Sieve estimation of infinite AR models has beenconsidered by various authors. For universal consistency, Schäfer (2002) derived perhapsthe strongest result possible. Györfi and Sancetta (2015) review some of these results.For convergence in probability, various authors have considered infinite AR models and ∗ ℓ norm in order to deriveasymptotic results. The conditions essentially require the autoregressive coefficientsto be absolutely summable. We shall see that the vector of autoregressive coefficientscan be seen as an element in a Reproducing Kernel Hilbert Space (RKHS) when ℓ is equipped with a suitable inner product. This allows us to exploit all the existingmachinery for estimation in RKHS and build on it (Steinwart and Chirstmann, 2008,for a comprehensive review) . The main ingredient is penalized least square estima-tion. We also consider the constrained least square problem. Penalized and constrainedestimation are dual problems for specific values of the penalty coefficient. Our resultestablishes the relation between the two problems and the consistency rates. In gen-eral, they can lead to different consistency results under different norms. One norm isthe usual Euclidean norm of the vector of coefficients while the other is the norm ofthe RKHS. We show that consistency under the latter has important implications forprediction problems.In general, unlike existing results we are able to establish consistency as both theautoregressive order and the sample size go to infinity with no constraint on the rates.Existing results use the machinery of method of sieve, hence they require the autore-gressive order to go to infinity in a controlled way. As already mentioned, we are ableto avoid this restriction because the ellipsoid is compact under the Euclidean norm.The plan for the paper is as follows. Section 2 reviews the estimation method andpresents the consistency results. A numerical example is provided in Section 3. Section4 mentions extensions to other processes such as vector autoregressive processes (VAR).The proof of the consistency results is long and is given in Section 5. We restrict attention to the infinite order autoregressive process Y t = ∞ X k =1 ϕ k Y t − k + ε t (1)2or some mean zero independent identically distributed (i.i.d.) sequence ( ε t ) t ∈ Z andunknown coefficients ϕ k ’s. This paper considers estimators of the above under thecondition that P ∞ k =1 | ϕ k | ≤ ¯ ϕ < ∞ .In a finite sample, the above model can only be approximated by the finite dimen-sional model Y t = K X k =1 b k Y t − k + ε t with K → ∞ . While this is essentially a sieve we do not necessarily require K to beof smaller order than the sample size. Here, we restrict the coefficients in an ellipsoidto be defined as follows. Let λ k ’s be positive constants such that λ k ≍ k λ for λ > ,where ≍ means that the left hand side (l.h.s.) and the right hand side (r.h.s.) areproportional. Define the ellipsoid as E K ( B ) := ( b ∈ R ∞ : ∞ X k =1 b k λ k ≤ B , b k = 0 for k > K ) . (2)Given that the λ k ’s are increasing, the b k ’s need to be smaller in absolute valuesas k increases. Write E ( B ) = S K> E K ( B ) for the ellipsoid where all coefficientscan be non-zero, E K = S B< ∞ E K ( B ) and E = S B< ∞ E ( B ) , so for example E = { b ∈ R ∞ : P ∞ k =1 b k λ k < ∞} is the ellipsoid that is restricted to have finite but decreas-ing principal axes. The following condition will be imposed on the ellipsoid. Condition 1
The sequence ( Y t ) t ∈ Z follows the process (1) with ϕ ∈ E and λ k ≍ k λ ,where λ > / . Moreover, − P ∞ k =1 ϕ k z k = 0 only for z outside the unit circle. Theinnovations ( ε t ) t ∈ Z are independent identically distributed with finite fourth moment. Throughout, when writing E K ( B ) and similar quantities, it is understood that the λ k ’s are as in Condition 1. The following is stated for convenience. Lemma 1 If b ∈ E ( B ) then, b k . k − (2 λ +1) / / ln ǫ (1 + k ) for some ǫ > , where . isinequality up to a fixed absolute multiplicative constant. In consequence, Condition 1 implies absolutely summable autoregressive coefficients.Note that absolute summability would just require λ ≥ / in Condition 1 rather than λ > / , hence the condition we use is a bit more restrictive. The following statesadditional properties of the model. 3 emma 2 Under Condition 1, ( Y t ) t ∈ Z is stationary and ergodic with absolutely summableautocovariance function and E Y t < ∞ . It is well known that for the AR process, − P ∞ k =1 ϕ k z k = 0 only for z outsidethe unit circle if the autocovariance function is absolutely summable and the spectraldensity is strictly positive and continuous (Kreiss et al., 2011, Corollary 2.1).Note that there are processes (even Gaussian) that satisfy Condition 1, but failto be beta mixing (Doukhan, 1995, Theorem 3, p.59). The beta mixing assumptionis often conveniently used when proving convergence using methods from empiricalprocess theory. Alas, it cannot be used here. The goal is to find an estimator for ϕ . We consider two approaches: constrained leastsquare and penalized least square. By duality, the two can be made to be equivalent bysuitable choice of the penalty parameter. However, in the constrained case, the penaltyturns out to be sample dependent, while in penalized estimation this it not necessarilythe case.To avoid notational trivialities, suppose that the sample size is N = n + K . Thiswill be assumed without further notice throughout the paper. In particular, our sampleis Y − ( K − , Y − ( K − , ..., Y , Y , ..., Y n . This also stresses the fact that n and K can go toinfinity at different rates.In the constrained problem, we estimate b ∈ E K ( B ) . The constrained estimator isdefined as b n = arg inf b ∈E K ( B ) n n X t =1 Y t − ∞ X k =1 b k Y t − k ! (3)Of course, in the above, P ∞ k =1 b k Y t − k = P Kk =1 b k Y t − k if b ∈ E K ( B ) .In the penalized problem, we estimate b ∈ E K , but introduce the penalty parameter τ > . The penalized estimator is defined as b n,τ := arg inf b ∈E K n n X t =1 Y t − ∞ X k =1 b k Y t − k ! + τ ∞ X k =1 λ k b k , (4)where the λ k ’s are from the definition of E . By use of the Lagrangian, we can alwaysrewrite (3) as (4) for suitable choice of τ , i.e. there is a τ = τ B,n ( τ = 0 if the constraintit not binding) such that b n,τ = b n . 4oth problems can be reformulated in matrix form using the Lagrangian. Let X be the n × K dimensional matrix with ( t, k ) th entry equal to Y t − k and Y be the n -dimensional vector with t th entry Y t . Also, let Λ be the K × K diagonal matrix with k th diagonal entry equal to λ k . The estimator for either (3) or (4) is found by minimizingthe penalized least square criterion with respect to (w.r.t.) ˜ b ∈ R K , n (cid:16) Y − X ˜ b (cid:17) T (cid:16) Y − X ˜ b (cid:17) + τ ˜ b T Λ ˜ b (5)where for (3) τ is chosen so that the constraint ˜ b T Λ˜ b ≤ B is satisfied. In this lattercase, τ is necessarily random because the constraint needs to be satisfied in sample.Here the tilde in ˜ b is used to remind us that in the matrix formulation, b is truncatedto be a K dimensional vector, as all entries larger than K are zero by definition of E K .The solution is the usual ridge regression estimator ˜ b n,τ := (cid:0) X T X + τ Λ (cid:1) − X T Y .For problem (4), τ = τ n can go to zero in a controlled way. For problem (3), τ = τ B,n ≥ must be chosen so that the constraint is satisfied. Such τ B,n is zero if theconstraint is binding, and zero otherwise. This is equivalent to replacing τ ˜ b T Λ ˜ b with (cid:16) ˜ b T Λ ˜ b − B (cid:17) in (5), and minimizing the so modified objective function (5) w.r.t. ˜ b and τ ≥ . The minimizer w.r.t. τ is τ B,n .All vectors are in R ∞ , though only the first K elements might be non-zero. Theexception is when we use a tilde, as in (5). For b n in (3), the Euclidean norm of b n − ϕ becomes | b n − ϕ | = (cid:16)P Kk =1 | b nk − ϕ k | + P k>K | ϕ k | (cid:17) / It is worth noting that the ellipsoid
E ⊂ ℓ is a RKHS generated by the kernel C ( k, l ) = P ∞ v =1 λ − v δ v,k δ v,l where δ v,l is the Kronecker’s delta, i.e. δ v,l = 1 if v = l and zero otherwise. The inner product h· , ·i E is defined to satisfy the reproducingkernel property h C ( · , l ) , C ( · , k ) i E = C ( k, l ) . Hence for a, b ∈ E , b k = h b, C ( · , k ) i E and h a, b i E = P ∞ v =1 λ v a v b v . The norm induced by the inner product is |·| E such that for anyvector b ∈ R ∞ , | b | E = P ∞ k =1 λ k b k . This norm strictly dominates the Euclidean norm.The fact that E (1) is compact under the Euclidean norm is a consequence of the factthat E is a RKHS (Li and Linde, 1999) and sharp asymptotics can be derived by relatedmeans (Graf and Luschgy, 2004).Once we realize such compactness, it becomes clear that it might be possible to esti-mate infinite AR processes under no restriction on the number of estimated coefficients.We show that this conjecture is true. We also establish convergence rates. Moreover,we want to clearly address the relation between constrained and penalized estimation.5he best approximation ϕ K ∈ E K to ϕ minimizes the population mean square error ϕ K = arg inf b ∈E K E Y − ∞ X k =1 b k Y − k ! (6)Despite the abuse of notation, do not confuse ϕ K with the K th entry in ϕ . Theorem 1
Suppose that Condition 1, and n, K → ∞ hold.1. (Consistency of Constrained Estimator) If ϕ ∈ E ( B ) There is a random τ = τ B,n such that τ = O p (cid:0) n − / (cid:1) , b n,τ = b n and if ϕ ∈ E ( B ) , | b n − ϕ | = O p (cid:16) n − ( λ − ǫ λ − ǫ +1 ) + K − λ (cid:17) for any ǫ ∈ (0 , λ − .2. (Consistency of Penalized Estimator) Consider possibly random τ = τ n such that τ → and τ n / → ∞ in probability. There is a finite B such that ϕ ∈ int ( E ( B )) , | b n,τ | E < B eventually in probability and | b n,τ − ϕ | E → in probability.3. (Approximation Error in E ) There is an ǫ > such that | ϕ − ϕ K | E = O (cid:16) (ln K ) − (1+ ǫ ) (cid:17) .Suppose the k th entry ϕ k in ϕ satisfies | ϕ k | . k − ν with ν > (2 λ + 1) / for all k large enough. Then | ϕ − ϕ K | E = O (cid:0) K (2 λ +1 − ν ) / (cid:1) .4. (Estimation Error in E ) If (cid:0) τ + n − / (cid:1) = O p (cid:0) K − λ (cid:1) , then | b n,τ − ϕ K | E = O p (cid:0) n − / K λ (cid:1)
5. (Difference Between Norms) There is K → ∞ and τ = O p (cid:0) n − / (cid:1) such that | b n,τ − ϕ | → in probability, but | b n,τ − ϕ | E does not converge to zero in proba-bility. Point 1 in the theorem establishes the link between constrained and penalized es-timation by finding the rate of decay of the ridge penalty so that (3) and (4) are thesame. It also establishes the convergence rate of (3) towards the true ϕ in terms of λ (recall λ k ≍ k λ in Condition 1). This rate does not constrain the number of lags usedonce we constrain ϕ ∈ E ( B ) . For the finite dimensional case we trivially recover theroot-n convergence by letting λ → ∞ .Point 2 says that if we use the penalized estimation and the penalty does not go tozero too fast (i.e. strictly slower than in Point 1) we can expect (4) to be contained ina ball in E that contains the true parameter with probability going to one. Moreover,(4) is consistent under the norm |·| E . 6oint 3 is concerned with the approximation error of (6) in the RKHS norm. Thiserror might go to zero at a logarithmic rate. However, if the true coefficients decay fast,then we can have polynomial convergence rate.Point 4 restricts the way we let K → ∞ in order to derive convergence rates of theestimation error under the norm |·| E .Point 5 establishes an additional insight between the convergence under the Eu-clidean norm and the RKHS norm in terms of the penalty. A “slowly convergent”penalty is necessary for convergence under |·| E . Hence, this also shows that the con-strained estimator (whose penalty is τ = τ B,n = O p (cid:0) n − / (cid:1) when ϕ ∈ E ( B ) ) cannot beconsistent in the norm |·| E in general. This happens when choosing a rather large K that leads to a binding constraint for (3).As corollary to Points 3 and 4 in Theorem 1, we have the following. Corollary 1
Suppose Condition 1 holds, K → ∞ and τ = O p (cid:0) K − λ (cid:1) .1. Choose K ≍ n κ for some κ ∈ (0 , / . Then, there is an ǫ > such that | b n,τ − ϕ | E = O p (cid:16) (ln K ) − (1+ ǫ ) (cid:17) .2. Suppose the k th entry ϕ k in ϕ satisfies | ϕ k | . k − ν with ν > (2 λ + 1) / for all k large enough. Choose K ≍ n ν − . Then, | b n,τ − ϕ | E = O p (cid:16) n − ν − (2 λ +1)4(2 ν − (cid:17) . Corollary 1 imposes additional restrictions in order to improve on the statement ofPoint 2 in Theorem 1 by giving rates of convergence. These rates are not tight as theyrequire K = o ( n ) unlike Point 2 in Theorem 1. However, they are useful in applications(e.g. Section 2.1.1).Sieve estimators are often consistent under the sole condition that the number ofcomponents (here K ) is of smaller order of magnitude than the sample size n . In Point1 of Theorem 1, we have shown that this is not required. Recall that N = n + K isthe sample size. We can have K = O ( N ) as long as n → ∞ . Of course, we requireknowledge concerning the magnitude of the coefficients. Such knowledge is usuallyassumed in the literature in order to bound the approximation error.In practice the fact that we allow K = O ( N ) might sound irrelevant. However,the asymptotic results can be seen as suggesting that, once we set the constraint, theprocedure used here can be more robust to lag choice. We show this in the simulationin Section 3. 7 .1.1 Application to Optimal Forecasting and Universal Consistency Define X t ( a ) = P ∞ k =1 a k Y t − k for any a ∈ R ∞ . The expectation of Y t conditioning on theinfinite past ( Y t − s ) s> is X t ( ϕ ) . As an application of Theorem 1 consider the followingproblem. Show that sup t ∈T | X t ( ϕ ) − X t ( b n,τ ) | → in probability where T = (0 , ∞ ) or (0 , n ) ( b n,τ in (4)). Hence, we want X t ( b n,τ ) tobe close to the conditional expectation of Y t uniformly in t ∈ T , which is even moregeneral than considering a moving target. The norm |·| E is useful because the previousdisplay can be written as sup t ∈T | X t ( ϕ − b n,τ ) | . | ϕ − b n,τ | E sup t ∈T ∞ X k =1 (cid:18) Y t − k k λ (cid:19) ! / . (7)To obtain the inequality, we have multiplied and divided each term in the sum (onthe l.h.s.) by λ k and then used the Cauchy-Schwarz inequality and Condition 1 to set λ k ≍ k λ .We have that | ϕ − b n,τ | E = O p ( ǫ n ) in probability, where ǫ n → at rate whichdepends on Theorem 1. Then, if sup t ∈T ∞ X k =1 (cid:18) Y t − k k λ (cid:19) ! / = o p (cid:0) ǫ − n (cid:1) , (8)we have shown that (7) goes to zero in probability. This is a weak form of universalconsistency because the convergence is in probability rather than almost surely. On thepositive side, the convergence holds for a variety of processes and circumstances.If T = (0 , ∞ ) then (8) is almost surely finite if the random variables are bounded,and (7) goes to zero in probability using Point 2 in Theorem 1.If T = (0 , n ) , we can use the bound E sup t ∈ (0 ,n ) ∞ X k =1 Y t − k k λ ! / ≤ n / (2 p ) sup t ∈ (0 ,n ) E ∞ X k =1 Y pt − k k λp ! / (2 p ) when the variables are p integrable. If p is such that n / (2 p ) = o ( ǫ − n ) , then the r.h.s.of (7) goes to zero in probability. If Y t has moment generating function the r.h.s. ofthe above display is O (ln n ) . Either way, to find ǫ n we can use Corollary 1. Note that8he argument is unchanged if T = (0 , c n ) for any c n ≍ n .Theorem 1 can also be applied to the less ambitious problem: show that lim K →∞ sup t ∈T | X t ( ϕ K ) − X t ( b n,τ ) | → in probability. In this case we want to forecast as well as the increasingly best approx-imation of the conditional expectation of Y t , uniformly in t ∈ T . Point 4 in Theorem 1is suited for this problem. B in Practice The parameter B can be chosen to minimize some cross-validated prediction error es-timate (beware of cross-validation in a time series context, e.g. Györfi et al., 1990,Burman and Nolan, 1992, Burman et al., 1994, for discussions and applicability). Al-ternatively, one can choose B to minimize some penalized loss function such as ln ˆ σ B + 2df ( B ) n (9)where df ( B ) = Trace (cid:16)(cid:0) X T X + τ B,n n Λ (cid:1) − X T X (cid:17) and τ B,n is the solution of ˜ b Tn Λ ˜ b n ≤ B , using the notation in (5). Here, ˆ σ B is the sample variance of the residuals from theestimation. If the constraint is binding, τ B,n solves Y T X (cid:0) X T X + τ B,n n Λ (cid:1) − X T Y = B . (10)This τ B,n is then used to compute df ( B ) , which is the effective number of degrees offreedom implied by B (Hastie et al., 2009) Asymptotic results are of interest on their own, but it is also of interest to understandthe scope of applicability in practice. As a benchmark, we use predictions based on anAR model where the lag length is chosen by Akaike’s Information Criterion (AIC).9 .1 Simulated True Models
One thousand data samples are simulated from (1). The sample size is N = 1000 . Awarm up sample of 1000 observations is used to reduce any dependence on the startingvalue. We also simulate a testing sample of observations to approximate the meansquare error (MSE). We consider different specifications for ϕ in (1) including longmemory in order to see how the procedure works when the true model is not in E . Inthis case, an approximation error is incurred. Short Memory
In (1), the errors are i.i.d. standard normal and the ϕ k ’s are chosento be ϕ k = ¯ ϕk − / / (cid:16)P K k =1 k − / (cid:17) , where ¯ ϕ = 0 . , . . A higher value for ¯ ϕ leads toa more persistent behaviour. By construction, for both values of ¯ ϕ , the model appearsto generate cycles because the roots of − P K k =1 ϕ k z k = 0 are outside the unit circle,but complex. We shall have different values for K ∈ { , } . Given the finitenumber of lags the coefficients are automatically in E . Long Memory Model
The model is an ARFIMA Y t = K X k =1 ϕ k Y t − k + (1 − L ) − d L X l =0 θ l ε t − l ! (11)where the ϕ k ’s are as in the previous paragraph. The MA polynomial is θ l = (1 − . l ) with L = 5 . The coefficient of fractional integration d = 0 . . Hence, the model isstationary, but exhibits long memory. The parameter’s estimates are obtained from (5) with λ k = k − . . The benchmarkis an AR model with lag length chosen to minimize AIC. Denote the number of lagschosen using AIC by K AIC . We compare this to a model estimated using more lags,but with coefficients constrained in E K ( B ) . In particular, K = 2 K AIC and K AIC with B chosen as outlined in Section 2.2 . The goal is to verify whether the procedure isrobust to lag choice. AIC is known to choose large models. We use even larger models,and verify whether we are able to obtain sensible results.The results in Table 1 show the improvement in MSE of the constrained procedureover AIC. Table 1 shows that the procedure is robust against lag choice. This becomes10vident in the long memory case. The larger model ( K AIC ) leads to relatively betterperformance when the true model exhibits persistency as (11).Table 1: Simulation Results. For Short Memory theprocess is as in (1) with number of true AR coeffi-cients equal to K and AR coefficients satisfying ϕ k =¯ ϕk − / / (cid:16)P K k =1 k − / (cid:17) , where ¯ ϕ = 0 . , . . For LongMemory, the process is as in (11). Entries denote theMSE improvement relative to the MSE of a model withlag length K AIC chosen using AIC. MSE in the numer-ator in the calculation of the relative improvement iscomputed using lag length K AIC and K AIC and con-straining the coefficients in E ( B ) where B is chosen asdescribed in Section 2.2. K =
100 1000 K AIC K AIC K AIC K AIC
Short Memory ¯ ϕ = 0 . ¯ ϕ = 0 . ¯ ϕ = 0 . ¯ ϕ = 0 . It is simple to impose linear restrictions on the coefficients of either the constrainedor penalized estimator. A natural example is positivity. This is the case if we wish toestimate ARCH models of large orders. Under ARCH restrictions, the squared returnsfollow an AR process. The estimator does not have a closed form expression, but it isjust the solution of a quadratic programming problem. Another extension pertains to11ector autoregressive processes Y t = ∞ X k =1 Φ k Y t − k + ε t (12)where now the variables and innovations are L dimensional vectors and we use thecapital Φ k to stress the multivariate framework, where Φ k is an L × L matrix. Again,we can restrict E in a suitable way. For example, we can impose that Φ k is lowertriangular. This restriction has a variety of implications going from Granger causalityto exogeneity and it is of much interest in econometrics (e.g., Sims, 1980). For fixed L , all the results in this paper apply to this problem as well, with obvious changes ifwe modify the constraint to P ∞ k =1 | Φ k | λ k ≤ B where | Φ k | is any matrix norm, e.g.,Frobenius: | Φ k | = p T race (Φ Tk Φ k ) , where Φ Tk is the transpose of Φ k .An extension, which does not follow directly from the results derived here, is toconsider the case where L → ∞ . This is the problem where we have a large cross-section ( L is the dimensional of the vector Y t in (12)). In this case, the constraintcannot use an arbitrary matrix norm (norms are not equivalent in infinite dimensionalspaces). Results in Lutz and Bühlmann (2006) together with the ones derived here canprovide initial guidance on how to tackle this problem in the future. At first we include the short proof of Lemma 2
Proof. [Lemma 2]A stationary infinite AR process with absolutely summable ARcoefficients has an infinite MA representation with absolutely summable coefficient andit is invertible (Lemma 2.1 in Bühlmann, 1995). Hence, there are coefficients ψ s ’s suchthat Y t = P ∞ s =0 ψ s ε t − s and ∞ X k =1 | E Y t Y t − k | ≤ σ ∞ X k =1 ∞ X s =0 | ψ s + k | | ψ s | < ∞ , which means that the autocovariance function is absolutely summable. The momentbound follows from the infinite MA representation and the bound on the fourth momentof the innovations. 12 .1 Proof of Theorem 1 We divide the proof into two parts. One only concerns results under the Euclideannorm. The other is concerned with convergence results under the RKHS norm.
Few lemmas are needed for the proof. Throughout, we shall use the notation X t ( a ) = P ∞ k =1 a k Y t − k for any a ∈ R ∞ . Lemma 3
For ρ := (2 λ + 1) / > λ > / as in Condition 1) and real constants w k ’s, sup b ∈E K ( B ) (cid:12)(cid:12)(cid:12)P Kk =1 b k w k (cid:12)(cid:12)(cid:12) . P Kk =1 k − ρ | w k | , and similarly, for real constants w k,l ’s, sup b ∈E K ( B ) (cid:12)(cid:12)(cid:12)P Kk,l =1 b k b l w lk (cid:12)(cid:12)(cid:12) . P Kk,l =1 k − ρ l − ρ | w kl | . Proof.
Note that (cid:12)(cid:12)(cid:12)P Kk =1 b k w k (cid:12)(cid:12)(cid:12) ≤ P Kk,l =1 | b k | k − ρ k − ρ | w k | . Given that b ∈ E K ( B ) , then b k . k − ρ uniformly in b ∈ E K ( B ) , by Lemma 1. This implies that the previous quantityis bounded by a constant multiple of P Kk =1 k − ρ | w k | . The same argument proves thesecond statement in the lemmaThe w kl ’s in the lemma above will be partial sums of cross products of Y t ’s, whichwe bound using the following.For arbitrary τ > , the first order conditions that define (4) imply that b n,τ,k = − τ λ k n n X t =1 ( Y t − X t ( b n,τ )) Y t − k (13)where b n,τ,k is the k th element in b n,τ . By Condition 1, multiplying both sides by τ λ k a k and summing over k , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X t =1 ( Y t − X t ( b n,τ )) X t ( a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 2 τ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X k =1 λ k b n,τ,k a k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ τ vuut K X k =1 λ k b n,τ,k vuut K X k =1 λ k a k , (14)recalling the definition of X t ( a ) and using the Cauchy-Schwarz inequality. If a ∈ E K (1) , qP Kk =1 λ k a k ≤ and the above display clearly holds uniformly in a . We need to showthat there is a τ = τ n = O p (cid:0) n − / (cid:1) such qP Kk =1 λ k b n,τ,k < B . This will imply thedisplay in the statement of the lemma. 13 emma 4 Under Condition 1, sup n,k,l> E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ∞ . Proof.
From the proof of Lemma 2, there are absolutely summable coefficients ψ u ’s,such that Y t = P ∞ u =0 ψ u ε t − u . For ease of notation suppose that the i.i.d. innovationshave variance one and the MA coefficients are non-negative. By stationarity, E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n X s =0 E [(1 − E ) Y t − k Y t − l ] [(1 − E ) Y t − s − k Y t − s − l ] , where the r.h.s. holds for any t . If we showed that E [(1 − E ) Y t − k Y t − l ] [(1 − E ) Y t − s − k Y t − s − l ] . ψ s the result would follow by summability of the coefficients. To show the above, with noloss of generality, by symmetry, consider only the case l ≥ k . This implies that E [(1 − E ) Y t − k Y t − l ] [(1 − E ) Y t − s − k Y t − s − l ]= Cov ( Y t − k Y t − l , Y t − s − k Y t − s − l )= E ∞ X u =0 ∞ X u =0 ψ u ψ u [(1 − E ) ε t − k − u ε t − l − u ] × ∞ X u =0 ∞ X u =0 ψ u ψ u [(1 − E ) ε t − s − k − u ε t − s − l − u ] . The above is equal to ∞ X u =0 ∞ X u =0 ∞ X u =0 ∞ X u =0 ψ u ψ u ψ u ψ u Cov ( ε t − k − u ε t − l − u , ε t − s − k − u ε t − s − l − u ) . By the i.i.d. condition on the innovations, the covariance is zero if the indexes are notconstrained in the following sets { k + u = l + u , k + u = l + u } , { u = u + s, u = u + s } , { k + u = l + u + s, l + u = k + u + s } . Hence, we can consider summation with in-dexes in these sets only. Splitting the sum according to the above index sets, we have14espectively, I = ∞ X u =0 ∞ X v =0 ψ u + l − k ψ u ψ v + l − k ψ v Cov (cid:0) ε , ε u − ( s + v ) (cid:1) ,II = ∞ X u =0 ∞ X v =0 ψ u + s ψ v + s ψ u ψ v E ε ε u − v )+( k − l ) ,III = ∞ X u =0 ∞ X v =0 ψ u + s +( l − k ) ψ v + s +( k − l ) ψ u ψ v E ε ε u − v − s )+( k − l ) . By elementary change of indexes, I ≤ ∞ X u =0 ∞ X v =0 ψ u + l − k ψ u ψ v + l − k ψ v { u − v = s } ≤ ∞ X v =0 ψ v + s + l − k ψ v + s ψ v +( l − k ) ψ v ≤ ∞ X v =0 ψ v + s ψ v . ψ s . Similarly, deduce that II . ∞ X u =0 ψ u ψ u + s ! ≤ ψ s ∞ X u =0 ψ u ! . ψ s . Finally,
III . ∞ X u =0 ∞ X v =0 ψ u ψ v ψ u + s +( l − k ) ψ v + s +( k − l ) ≤ ψ s ∞ X u =0 ∞ X v =0 ψ v ψ u ! . ψ s . The bounds do not depend on k, l beyond the fact that l ≥ k . Repeating the argumentfor k > l , the result follows.Lemma 4 will be used to bound quantities such as the following E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ X k,l =1 k − (2 λ +1) / l − (2 λ +1) / n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ∞ X k,l =1 k − (2 λ +1) / l − (2 λ +1) / E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . √ n max k,l> E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (2 λ + 1) / > . Then, by Lemma 4 the ex-pectation is finite because E |·| ≤ (cid:0) E |·| (cid:1) / and it is independent of k, l by stationarity.In consequence the display is O p (cid:0) n − / (cid:1) because convergence in L implies convergencein probability.To establish convergence rates we need two stochastic equicontinuity results. Lemma 5
Under Condition 1, for any ǫ > E sup a,b ∈E (2 B ) , | b | ≤ δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) X t ( b ) X t ( a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . δ λ − ǫ − λ − ǫ . (15) Proof.
By the triangle inequality, (15) is bounded by E sup a,b ∈E (2 B ) , | b | ≤ δ ∞ X l =1 | a l | ∞ X k =1 | b k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . By Lemma 3, there is a ρ > such that the above is bounded by a constant multiple of ∞ X l =1 l − ρ E sup b ∈E (2 B ) , | b | ≤ δ ∞ X k =1 | b k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . sup l> E sup b ∈E (2 B ) , | b | ≤ δ ∞ X k =1 | b k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) by summability of l − ρ . For any positive V , the above display can be written as sup l> E sup b ∈E (2 B ) , | b | ≤ δ X k ≤ V + X k>V ! | b k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . We shall bound the two sums separately. By the Cauchy-Schwarz inequality, the firstsum is bounded by vuut sup l> sup | b | ≤ δ X k ≤ V b k X k ≤ V E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . δ √ V , (16)where the inequality uses Lemma 4 and | b | ≤ δ . Having set V to such finite value, by16he Cauchy-Schwarz inequality, the second sum is bounded by vuuut sup b ∈E (2 B ) X k>V b k k (1+ ǫ ) ! sup l> X k>V k − (1+ ǫ ) E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . vuut V (1+ ǫ ) λ − V sup b ∈E (2 B ) X k>V b k λ k ! for any ǫ ∈ (0 , λ − , using again Lemma 4, and the fact that k − (1+ ǫ ) is summableand k (1+ ǫ ) λ − k is decreasing. The r.h.s. is then bounded by a constant multiple of V (1+ ǫ − λ ) / . Equating δ √ V with V (1+ ǫ − λ ) / we choose V = δ / (2 λ − ǫ ) , implying that δ √ V + V (1+ ǫ − λ ) / . δ λ − ǫ − λ − ǫ and the lemma is proved. Lemma 6
Under Condition 1, for any ǫ > , E sup b ∈E (2 B ) , | b | ≤ δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 ε t X t ( b ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . δ λ − ǫ − λ − ǫ Proof.
By linearity and the triangle inequality, E sup b ∈E (2 B ) , | b | ≤ δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 ε t X t ( b ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E sup b ∈E (2 B ) , | b | ≤ δ ∞ X k =1 | b k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 ε t Y t − k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Note that sup k> E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 ε t Y t − k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ σ γ (0) . Hence, we can proceed exactly as in the proof of Lemma 5 to deduce the result.The first part of Point 1 in the theorem will be proved in Lemma 8 (Section 5.1.2).Hence, here we shall only derive the convergence rate.Define the empirical loss function L n ( b ) := 1 n n X t =1 Y t − ∞ X k =1 b k Y t − k ! b ∈ E . When b ∈ E K the sum inside the parenthesis only runs from to K . Thepopulation loss is L ( b ) := E X ( ϕ − b ) . Define β = β K ∈ R ∞ such that its first K entries are as in ϕ and the remaining areall zero. The consistency proof is standard (van der Vaart and Wellner, 2000, Theorem3.2.5) once we show the following: L ( b ) − L ( β ) & | b − β | , (17) E sup b ∈E K ( B ): | b − β | ≤ δ | [ L n ( b ) − L ( b )] − [ L n ( β ) − L ( β )] | . δ α √ n , (18)for some α ∈ (0 , . Then, for any sequence r n → ∞ satisfying r − αn . √ n , L n ( b n ) ≤ L n ( β ) + O p ( r − n ) and | ϕ − β | . r − n , we have that | b n − ϕ | = O p ( r − n ) .At first we verify (17). Note that L ( b ) − L ( β ) = ∞ X k,l =1 ( b k − β k ) ( b l − β l ) γ ( k − l ) , where γ ( k ) is the autocovariance function (ACF) of the Y t ’s. The estimator is uniquelyidentified if the matrix, say Γ , with ( k, l ) entry equal to γ ( k − l ) , is strictly positivedefinite with smallest eigenvalue θ min > (see remarks after Lemma 2.2. in Kreiss et al.,2011). This is the case if the spectral density of ( Y t ) t ∈ Z , say g ( ω ) , is bounded away fromzero. The spectral density of the AR model (1) is given by g ( ω ) = (2 π ) − σ /ϕ ( ω ) ,where ϕ ( ω ) = (cid:12)(cid:12)P ∞ k =0 ϕ k e − ikω (cid:12)(cid:12) with ϕ := 1 . Noting that by Condition 1, ϕ ( ω ) = (cid:12)(cid:12)P ∞ k =0 ϕ k e − ikω (cid:12)(cid:12) ≤ ( P ∞ k =0 | ϕ k | ) < ∞ , deduce that the eigenvalues of Γ are boundedaway from zero. Hence, L ( b ) − L ( β ) ≥ θ − min ∞ X k =1 ( b k − β k ) = | b − β | , (19)and (17) holds.Using the notation Y t = X t ( ϕ ) + ε t , the empirical loss is equal to L n ( b ) = 1 n n X t =1 (cid:2) ε t + X t ( ϕ − b ) + 2 ε t X t ( ϕ − b ) (cid:3) . ( L n ( b ) − L ( b )) − ( L n ( β ) − L ( β ))= 1 n n X t =1 (cid:2) ε t X t ( β − b ) + (1 − E ) (cid:0) X t ( b − ϕ ) − X t ( β − ϕ ) (cid:1)(cid:3) . To verify (18), we need to bound the above uniformly in b ∈ E ( B ) such that | b − β | ≤ δ. To this end, apply Lemma 6 to the first term on the r.h.s. to find that the uniform boundis a constant multiple of n − / δ λ − ǫ − λ − ǫ for any ǫ > . By basic algebraic manipulations,the second term on the r.h.s. of the display is (1 − E ) (cid:0) X t ( b − ϕ ) − X t ( β − ϕ ) (cid:1) = 1 n n X t =1 (1 − E ) Y t − k Y t − l ( b n,k − β k ) ( b n,l − ϕ l )+ 1 n n X t =1 (1 − E ) Y t − k Y t − l ( β k − ϕ k ) ( b n,l − β l ) . Note that both ϕ − b and β − ϕ are in E (2 B ) . We apply Lemma 5 to deduce thateach term on the r.h.s. of the above display is uniformly bounded in L by a constantmultiple of n − / δ λ − ǫ − λ − ǫ for any ǫ > when | b − β | ≤ δ . Hence (18) is verified with α = λ − ǫ − λ − ǫ . When we are only interested in a finite dimensional model, we can take λ → ∞ to deduce that α = 1 , which is the parametric case.To find r n note that L n ( b n ) − L n ( β ) ≤ L n ( b n ) − inf b ∈E K ( B ) L n ( b ) = 0 . Also, | ϕ − β | = (cid:0)P k>K | ϕ k | (cid:1) / . K − λ / ln ǫ ( K ) for some ǫ > using Lemma 1 andbounding the sum with an integral ad using the fact that ln ǫ ( · ) is slowly varying atinfinity. Hence we deduce that r − n ≍ (cid:0) K − λ / ln ǫ ( K ) (cid:1) + n − ( λ − ǫ λ − ǫ +1 ) as stated in Point1 of the theorem. The proof depends on a few preliminary lemmas. Let ϕ τ = ϕ K,τ ∈ E K be the penalizedpopulation estimator 19 τ = arg inf b ∈E K E X ( b − ϕ ) + τ | b | E . (20)The following can be deduced from Theorem 5.9 in Steinwart and Christmann (2008,eq. 5.14). The proof is given, as the context might seem different at first sight. Lemma 7
Suppose Condition 1. For arbitrary but fixed τ > , consider b n,τ and ϕ τ in(4) and (20) with K possibly diverging to infinity. Then, vuut K X k =1 λ k ( b n,τ,k − ϕ τ,k ) ≤ vuut K X k =1 τ λ k n n X t =1 (1 − E ) ( Y t − X t ( ϕ τ )) Y t − k ! , where b n,τ,k is the k th entry in the K dimensional vector b n,τ , and similarly for ϕ τ,k . Proof.
By convexity of the square error loss, n n X t =1 ( Y t − X t ( ϕ τ )) ( X t ( b n,τ ) − X t ( ϕ τ )) ≤ n n X t =1 ( Y t − X t ( b n,τ )) − n n X t =1 ( Y t − X t ( ϕ τ )) . Note the following algebraic equality, τ ∞ X k =1 λ k ( b n,τ,k − ϕ τ,k ) ϕ τ,k + τ ∞ X k =1 λ k ( b n,τ,k − ϕ τ,k ) = τ ∞ X k =1 λ k b n,τ,k − τ ∞ X k =1 λ k ϕ τ,k . The above two displays imply n n X t =1 ( Y t − X t ( ϕ τ )) ( X t ( b n,τ ) − X t ( ϕ τ ))+2 τ ∞ X k =1 λ k ( b n,τ,k − ϕ τ,k ) ϕ τ,k + τ ∞ X k =1 λ k ( b n,τ,k − ϕ τ,k ) ≤ n n X t =1 ( Y t − X t ( b n,τ )) + τ ∞ X k =1 λ k b n,τ,k − n n X t =1 ( Y t − X t ( ϕ τ )) − τ ∞ X k =1 λ k ϕ τ,k ≤ where the most r.h.s. follows because b n,τ minimizes the empirical penalized risk. Thefirst order conditions for ϕ τ read ϕ τ,k = − τ λ k E ( Y t − X t ( ϕ τ )) Y t − k (21)20or k ≥ . Substituting this in the previous display, n n X t =1 ( Y t − X t ( ϕ τ )) ( X t ( b n,τ ) − X t ( ϕ τ )) − E ( Y t − X t ( ϕ τ )) K X k =1 ( b n,τ,k − ϕ τ,k ) Y t − k + τ K X k =1 λ k ( b n,τ,k − ϕ τ,k ) ≤ . Rearranging and using the definition of X t ( b n,τ − ϕ τ ) , deduce that τ K X k =1 λ k ( b n,τ,k − ϕ τ,k ) ≤ n n X t =1 ( E −
1) ( Y t − X t ( ϕ τ )) K X k =1 ( b n,τ,k − ϕ τ,k ) Y t − k ≤ vuut K X k =1 λ k n n X t =1 ( E −
1) ( Y t − X t ( ϕ τ )) Y t − k ! vuut K X k =1 λ k ( b n,τ,k − ϕ τ,k ) , using the Cauchy-Schwarz inequality in the last step. This implies the result of thelemma after simple rearrangement.The next lemma establishes the relation between the constrained and penalizedestimator and states a bound for the distance between the sample and populationpenalized estimator under the RKHS norm. Lemma 8
Suppose that ϕ ∈ int ( E ( B )) . Under Condition 1, if a ∈ E K (1) , and b n,τ isas in (4), there is τ = τ n = O p (cid:0) n − / (cid:1) such that | b n,τ | E < B and √ n n X t =1 ( Y t − X t ( b n,τ )) X t ( a ) = O p B vuut K X k =1 λ k a k , where the above bound holds uniformly in a ∈ E K (1) . In consequence, there is a τ = O p (cid:0) n − / (cid:1) such that b n,τ = b n .Moreover, for any τ > , vuut K X k =1 τ λ k n n X t =1 (1 − E ) ( Y t − X t ( ϕ τ )) Y t − k ! = O p (cid:0) τ − n − / (cid:1) . roof. Suppose that τ > as otherwise, by the first order conditions, the r.h.s.in the first display in the statement of lemma is exactly zero and there is nothing toprove.By the triangle inequality, vuut K X k =1 λ k b n,τ,k ≤ vuut K X k =1 λ k ϕ τ,k + vuut K X k =1 λ k ( b n,τ,k − ϕ τ,k ) . (22)For τ ≥ , qP Kk =1 λ k ϕ τ,k ≤ qP Kk =1 λ k ϕ k , as the penalized population estimator musthave norm no larger than ϕ . By this remark and the fact that ϕ ∈ int ( E ( B )) , there isan ǫ > such that the first term on the r.h.s. is B − ǫ . Lemma 7 gives K X k =1 λ k ( b n,τ,k − ϕ τ,k ) ≤ K X k =1 τ λ k " n n X t =1 ( Y t − X t ( ϕ τ )) Y t − k − E ( Y t − X t ( ϕ τ )) Y t − k . (23)Adding and subtracting (1 − E ) X t ( ϕ ) Y t − k , and then using the basic inequality ( x + y ) ≤ x + 2 y for any real x, y , the r.h.s. is K X k =1 τ λ k " n n X t =1 (1 − E ) ( Y t − X t ( ϕ )) Y t − k + 1 n n X t =1 (1 − E ) ( X t ( ϕ ) − X t ( ϕ τ )) Y t − k . ≤ K X k =1 τ λ k " n n X t =1 (1 − E ) ( Y t − X t ( ϕ )) Y t − k +2 K X k =1 τ λ k " n n X t =1 (1 − E ) ( X t ( ϕ ) − X t ( ϕ τ )) Y t − k . Recalling that our goal is to bound the second term on the r.h.s. of (22), the above two22isplays imply that vuut K X k =1 λ k ( b n,τ,k − ϕ τ,k ) ≤ τ vuut K X k =1 λ k " n n X t =1 (1 − E ) ( Y t − X t ( ϕ )) Y t − k + 1 τ vuut K X k =1 λ k " n n X t =1 (1 − E ) ( X t ( ϕ ) − X t ( ϕ τ )) Y t − k =: I + II. (24)To bound I on the r.h.s. note that for k > , E " n n X t =1 (1 − E ) ( Y t − X t ( ϕ )) Y t − k = E " n n X t =1 ε t Y t − k = σ γ (0) n (recall γ ( k ) is the ACF) so that K X k =1 λ k " n n X t =1 (1 − E ) ( Y t − X t ( ϕ )) Y t − k = O p (cid:18) σ γ (0) n (cid:19) because the coefficients λ − k are summable. Hence, it is possible to find a τ = O p (cid:0) n − / (cid:1) such that I ≤ ǫ . To bound II , recall that ϕ τ , ϕ ∈ E ( B ) for any τ ≥ , and write W k,l := 1 √ n n X t =1 (1 − E ) Y t − l Y t − k ρ = (2 λ + 1) / > , III := E K X k =1 λ k " n n X t =1 (1 − E ) ( X t ( ϕ ) − X t ( ϕ τ )) Y t − k ≤ K X k =1 λ k E sup b ∈E (2 B ) " n n X t =1 (1 − E ) ∞ X l =1 b l Y t − l Y t − k ≤ n K X k =1 λ k ∞ X l,j =1 l − ρ j − ρ E W k,l W k,j . n sup k,l,j E W k,l W k,j ≤ n sup k,l E W k,l (25)using Lemma 3 in the second inequality and summability of the coefficient in the laststep. By Lemma 4, E W k,l ≤ c for some finite absolute constant c . Hence, deducethat III = O p ( n − ) , which implies that II = O p (cid:0) τ − n − / (cid:1) . Hence, there is a τ = O p (cid:0) n − / (cid:1) such that II ≤ ǫ . The control of I + II implies that (24) is not greater than ǫ for suitable τ . Hence, we have shown that there is a τ = O p (cid:0) n − / (cid:1) such that (22) isnot greater than B − ǫ . This bound for (22) together with (14) proves the first displayin the lemma. To see that this also implies that there is a τ = O p (cid:0) n − / (cid:1) such that b n,τ = b n note that | b n,τ | E is non-deceasing as τ → . Hence, b n,τ = b n for the smallest τ such that | b n,τ | E ≤ B The last statement in the lemma follows from (23) and the just derived bound for(24).We now estimate the approximation error.
Lemma 9
For any K → ∞ , we have that | ϕ K − ϕ τ | E → as τ → where ϕ K is asin (6). Moreover, if τ = O p (cid:0) K − λ (cid:1) , then | ϕ K − ϕ τ | E = O p (cid:0) τ K λ (cid:1) . Proof.
The first part of the lemma is just Theorem 5.17 in Steinwart and Christ-mann (2008). Hence, we only need to prove the second statement. Let Γ be the K × K matrix with ( k, l ) entry γ ( k − l ) and let Γ be the first column in Γ . Let ˜ ϕ K , ˜ ϕ τ ∈ R K to be the first K entries in ϕ K , ϕ τ ∈ E K . Recall that in both ϕ K and ϕ τ all entries k > K are zero. Then, ˜ ϕ K = Γ − Γ , and writing D := τ / Λ for Λ as in (5), ˜ ϕ τ = ( DD + Γ) − Γ .
24y the Woodbury identity (Petersen and Pedersen, 2012, eq.159) ( DD + Γ) − = Γ − − Γ − D (cid:0) I + D Γ − D (cid:1) − D Γ − we have that ˜ ϕ K − ˜ ϕ τ = h Γ − D (cid:0) I + D Γ − D (cid:1) − D Γ − i Γ . Hence, | ϕ K − ϕ τ | E = (cid:12)(cid:12)(cid:12) ΛΓ − D (cid:0) I + D Γ − D (cid:1) − D Γ − Γ (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) D Γ − D (cid:0) I + D Γ − D (cid:1) − Λ ˜ ϕ K (cid:12)(cid:12)(cid:12) using the definitions of ˜ ϕ K and D . For any square matrix W and compatible vector a , | W a | ≤ σ ( W ) | a | , where σ ( W ) is the maximum eigenvalue of W . Define W = D Γ − D ( I + D Γ − D ) − . Given that ϕ ∈ E K ( B ) , then, | Λ ˜ ϕ | ≤ B . Hence, weonly need to find the maximum eigenvalue of W to bound the above display. Thefollowing inequalities hold for the eigenvalues of the product of two positive definitematrices A and C : σ ( A ) σ ( C ) ≤ σ ( AC ) ≤ σ ( AC ) ≤ σ ( A ) σ ( C ) where σ ( · ) and σ ( · ) are the maximum and minimum eigenvalue of the matrix ar-gument (Bathia, 1997, problem III.6.14, p.78). In order to derive (19), we argued that Γ has minimum eigenvalue θ min bounded away from zero. Hence, D Γ − D has eigenvaluesin (cid:2) θ − τ λ , θ − τ λ K (cid:3) . The matrix ( I + D Γ − D ) has eigenvalues equal to 1 plus theeigenvalues of D Γ − D . Hence deduce that | ϕ K − ϕ τ | E . θ − τ λ K (cid:0) θ − τ λ (cid:1) . Thisis just O ( τ λ K ) = O (cid:0) τ K λ (cid:1) as required.We need a final approximation result. Lemma 10
Recall (6). If ϕ ∈ E , then | ϕ − ϕ K | E = 1 / ln ǫ ( K ) as K → ∞ . If also | ϕ k | . k − ν with ν > (2 λ + 1) / , then, | ϕ − ϕ K | E = O (cid:0) K (2 λ +1 − ν ) / (cid:1) . Proof.