[PDF] Consistency Results for Stationary Autoregressive Processes with Constrained Coefficients

Abstract

We consider stationary autoregressive processes with coefficients restricted to an ellipsoid, which includes autoregressive processes with absolutely summable coefficients. We provide consistency results under different norms for the estimation of such processes using constrained and penalized estimators. As an application we show some weak form of universal consistency. Simulations show that directly including the constraint in the estimation can lead to more robust results.

Full PDF

aa r X i v : . [ s t a t . M L ] J un Consistency Results for Stationary AutoregressiveProcesses with Constrained Coeﬃcients

Alessio Sancetta ∗ June 9, 2017

Abstract

We consider stationary autoregressive processes with coeﬃcients restricted toan ellipsoid, which includes autoregressive processes with absolutely summable co-eﬃcients. We provide consistency results under diﬀerent norms for the estimationof such processes using constrained and penalized estimators. As an applicationwe show some weak form of universal consistency. Simulations show that directlyincluding the constraint in the estimation can lead to more robust results.

Key Words: consistency, empirical process, ridge regression, reproducingkernel Hilbert space, universal consistency.

It is common to impose constraints on the decay rate of the autoregressive coeﬃcientsin order to derive results amenable to estimation for the purpose of prediction. Atminimum, these constraints tend to require that the AR coeﬃcients are absolutelysummable. Then, a natural approach when dealing with high order autoregressivemodels is to consider sieve estimation. Sieve estimation of inﬁnite AR models has beenconsidered by various authors. For universal consistency, Schäfer (2002) derived perhapsthe strongest result possible. Györﬁ and Sancetta (2015) review some of these results.For convergence in probability, various authors have considered inﬁnite AR models and ∗ ℓ norm in order to deriveasymptotic results. The conditions essentially require the autoregressive coeﬃcientsto be absolutely summable. We shall see that the vector of autoregressive coeﬃcientscan be seen as an element in a Reproducing Kernel Hilbert Space (RKHS) when ℓ is equipped with a suitable inner product. This allows us to exploit all the existingmachinery for estimation in RKHS and build on it (Steinwart and Chirstmann, 2008,for a comprehensive review) . The main ingredient is penalized least square estima-tion. We also consider the constrained least square problem. Penalized and constrainedestimation are dual problems for speciﬁc values of the penalty coeﬃcient. Our resultestablishes the relation between the two problems and the consistency rates. In gen-eral, they can lead to diﬀerent consistency results under diﬀerent norms. One norm isthe usual Euclidean norm of the vector of coeﬃcients while the other is the norm ofthe RKHS. We show that consistency under the latter has important implications forprediction problems.In general, unlike existing results we are able to establish consistency as both theautoregressive order and the sample size go to inﬁnity with no constraint on the rates.Existing results use the machinery of method of sieve, hence they require the autore-gressive order to go to inﬁnity in a controlled way. As already mentioned, we are ableto avoid this restriction because the ellipsoid is compact under the Euclidean norm.The plan for the paper is as follows. Section 2 reviews the estimation method andpresents the consistency results. A numerical example is provided in Section 3. Section4 mentions extensions to other processes such as vector autoregressive processes (VAR).The proof of the consistency results is long and is given in Section 5. We restrict attention to the inﬁnite order autoregressive process Y t = ∞ X k =1 ϕ k Y t − k + ε t (1)2or some mean zero independent identically distributed (i.i.d.) sequence ( ε t ) t ∈ Z andunknown coeﬃcients ϕ k ’s. This paper considers estimators of the above under thecondition that P ∞ k =1 | ϕ k | ≤ ¯ ϕ < ∞ .In a ﬁnite sample, the above model can only be approximated by the ﬁnite dimen-sional model Y t = K X k =1 b k Y t − k + ε t with K → ∞ . While this is essentially a sieve we do not necessarily require K to beof smaller order than the sample size. Here, we restrict the coeﬃcients in an ellipsoidto be deﬁned as follows. Let λ k ’s be positive constants such that λ k ≍ k λ for λ > ,where ≍ means that the left hand side (l.h.s.) and the right hand side (r.h.s.) areproportional. Deﬁne the ellipsoid as E K ( B ) := ( b ∈ R ∞ : ∞ X k =1 b k λ k ≤ B , b k = 0 for k > K ) . (2)Given that the λ k ’s are increasing, the b k ’s need to be smaller in absolute valuesas k increases. Write E ( B ) = S K> E K ( B ) for the ellipsoid where all coeﬃcientscan be non-zero, E K = S B< ∞ E K ( B ) and E = S B< ∞ E ( B ) , so for example E = { b ∈ R ∞ : P ∞ k =1 b k λ k < ∞} is the ellipsoid that is restricted to have ﬁnite but decreas-ing principal axes. The following condition will be imposed on the ellipsoid. Condition 1

The sequence ( Y t ) t ∈ Z follows the process (1) with ϕ ∈ E and λ k ≍ k λ ,where λ > / . Moreover, − P ∞ k =1 ϕ k z k = 0 only for z outside the unit circle. Theinnovations ( ε t ) t ∈ Z are independent identically distributed with ﬁnite fourth moment. Throughout, when writing E K ( B ) and similar quantities, it is understood that the λ k ’s are as in Condition 1. The following is stated for convenience. Lemma 1 If b ∈ E ( B ) then, b k . k − (2 λ +1) / / ln ǫ (1 + k ) for some ǫ > , where . isinequality up to a ﬁxed absolute multiplicative constant. In consequence, Condition 1 implies absolutely summable autoregressive coeﬃcients.Note that absolute summability would just require λ ≥ / in Condition 1 rather than λ > / , hence the condition we use is a bit more restrictive. The following statesadditional properties of the model. 3 emma 2 Under Condition 1, ( Y t ) t ∈ Z is stationary and ergodic with absolutely summableautocovariance function and E Y t < ∞ . It is well known that for the AR process, − P ∞ k =1 ϕ k z k = 0 only for z outsidethe unit circle if the autocovariance function is absolutely summable and the spectraldensity is strictly positive and continuous (Kreiss et al., 2011, Corollary 2.1).Note that there are processes (even Gaussian) that satisfy Condition 1, but failto be beta mixing (Doukhan, 1995, Theorem 3, p.59). The beta mixing assumptionis often conveniently used when proving convergence using methods from empiricalprocess theory. Alas, it cannot be used here. The goal is to ﬁnd an estimator for ϕ . We consider two approaches: constrained leastsquare and penalized least square. By duality, the two can be made to be equivalent bysuitable choice of the penalty parameter. However, in the constrained case, the penaltyturns out to be sample dependent, while in penalized estimation this it not necessarilythe case.To avoid notational trivialities, suppose that the sample size is N = n + K . Thiswill be assumed without further notice throughout the paper. In particular, our sampleis Y − ( K − , Y − ( K − , ..., Y , Y , ..., Y n . This also stresses the fact that n and K can go toinﬁnity at diﬀerent rates.In the constrained problem, we estimate b ∈ E K ( B ) . The constrained estimator isdeﬁned as b n = arg inf b ∈E K ( B ) n n X t =1 Y t − ∞ X k =1 b k Y t − k ! (3)Of course, in the above, P ∞ k =1 b k Y t − k = P Kk =1 b k Y t − k if b ∈ E K ( B ) .In the penalized problem, we estimate b ∈ E K , but introduce the penalty parameter τ > . The penalized estimator is deﬁned as b n,τ := arg inf b ∈E K n n X t =1 Y t − ∞ X k =1 b k Y t − k ! + τ ∞ X k =1 λ k b k , (4)where the λ k ’s are from the deﬁnition of E . By use of the Lagrangian, we can alwaysrewrite (3) as (4) for suitable choice of τ , i.e. there is a τ = τ B,n ( τ = 0 if the constraintit not binding) such that b n,τ = b n . 4oth problems can be reformulated in matrix form using the Lagrangian. Let X be the n × K dimensional matrix with ( t, k ) th entry equal to Y t − k and Y be the n -dimensional vector with t th entry Y t . Also, let Λ be the K × K diagonal matrix with k th diagonal entry equal to λ k . The estimator for either (3) or (4) is found by minimizingthe penalized least square criterion with respect to (w.r.t.) ˜ b ∈ R K , n (cid:16) Y − X ˜ b (cid:17) T (cid:16) Y − X ˜ b (cid:17) + τ ˜ b T Λ ˜ b (5)where for (3) τ is chosen so that the constraint ˜ b T Λ˜ b ≤ B is satisﬁed. In this lattercase, τ is necessarily random because the constraint needs to be satisﬁed in sample.Here the tilde in ˜ b is used to remind us that in the matrix formulation, b is truncatedto be a K dimensional vector, as all entries larger than K are zero by deﬁnition of E K .The solution is the usual ridge regression estimator ˜ b n,τ := (cid:0) X T X + τ Λ (cid:1) − X T Y .For problem (4), τ = τ n can go to zero in a controlled way. For problem (3), τ = τ B,n ≥ must be chosen so that the constraint is satisﬁed. Such τ B,n is zero if theconstraint is binding, and zero otherwise. This is equivalent to replacing τ ˜ b T Λ ˜ b with (cid:16) ˜ b T Λ ˜ b − B (cid:17) in (5), and minimizing the so modiﬁed objective function (5) w.r.t. ˜ b and τ ≥ . The minimizer w.r.t. τ is τ B,n .All vectors are in R ∞ , though only the ﬁrst K elements might be non-zero. Theexception is when we use a tilde, as in (5). For b n in (3), the Euclidean norm of b n − ϕ becomes | b n − ϕ | = (cid:16)P Kk =1 | b nk − ϕ k | + P k>K | ϕ k | (cid:17) / It is worth noting that the ellipsoid

E ⊂ ℓ is a RKHS generated by the kernel C ( k, l ) = P ∞ v =1 λ − v δ v,k δ v,l where δ v,l is the Kronecker’s delta, i.e. δ v,l = 1 if v = l and zero otherwise. The inner product h· , ·i E is deﬁned to satisfy the reproducingkernel property h C ( · , l ) , C ( · , k ) i E = C ( k, l ) . Hence for a, b ∈ E , b k = h b, C ( · , k ) i E and h a, b i E = P ∞ v =1 λ v a v b v . The norm induced by the inner product is |·| E such that for anyvector b ∈ R ∞ , | b | E = P ∞ k =1 λ k b k . This norm strictly dominates the Euclidean norm.The fact that E (1) is compact under the Euclidean norm is a consequence of the factthat E is a RKHS (Li and Linde, 1999) and sharp asymptotics can be derived by relatedmeans (Graf and Luschgy, 2004).Once we realize such compactness, it becomes clear that it might be possible to esti-mate inﬁnite AR processes under no restriction on the number of estimated coeﬃcients.We show that this conjecture is true. We also establish convergence rates. Moreover,we want to clearly address the relation between constrained and penalized estimation.5he best approximation ϕ K ∈ E K to ϕ minimizes the population mean square error ϕ K = arg inf b ∈E K E Y − ∞ X k =1 b k Y − k ! (6)Despite the abuse of notation, do not confuse ϕ K with the K th entry in ϕ . Theorem 1

Suppose that Condition 1, and n, K → ∞ hold.1. (Consistency of Constrained Estimator) If ϕ ∈ E ( B ) There is a random τ = τ B,n such that τ = O p (cid:0) n − / (cid:1) , b n,τ = b n and if ϕ ∈ E ( B ) , | b n − ϕ | = O p (cid:16) n − ( λ − ǫ λ − ǫ +1 ) + K − λ (cid:17) for any ǫ ∈ (0 , λ − .2. (Consistency of Penalized Estimator) Consider possibly random τ = τ n such that τ → and τ n / → ∞ in probability. There is a ﬁnite B such that ϕ ∈ int ( E ( B )) , | b n,τ | E < B eventually in probability and | b n,τ − ϕ | E → in probability.3. (Approximation Error in E ) There is an ǫ > such that | ϕ − ϕ K | E = O (cid:16) (ln K ) − (1+ ǫ ) (cid:17) .Suppose the k th entry ϕ k in ϕ satisﬁes | ϕ k | . k − ν with ν > (2 λ + 1) / for all k large enough. Then | ϕ − ϕ K | E = O (cid:0) K (2 λ +1 − ν ) / (cid:1) .4. (Estimation Error in E ) If (cid:0) τ + n − / (cid:1) = O p (cid:0) K − λ (cid:1) , then | b n,τ − ϕ K | E = O p (cid:0) n − / K λ (cid:1)

5. (Diﬀerence Between Norms) There is K → ∞ and τ = O p (cid:0) n − / (cid:1) such that | b n,τ − ϕ | → in probability, but | b n,τ − ϕ | E does not converge to zero in proba-bility. Point 1 in the theorem establishes the link between constrained and penalized es-timation by ﬁnding the rate of decay of the ridge penalty so that (3) and (4) are thesame. It also establishes the convergence rate of (3) towards the true ϕ in terms of λ (recall λ k ≍ k λ in Condition 1). This rate does not constrain the number of lags usedonce we constrain ϕ ∈ E ( B ) . For the ﬁnite dimensional case we trivially recover theroot-n convergence by letting λ → ∞ .Point 2 says that if we use the penalized estimation and the penalty does not go tozero too fast (i.e. strictly slower than in Point 1) we can expect (4) to be contained ina ball in E that contains the true parameter with probability going to one. Moreover,(4) is consistent under the norm |·| E . 6oint 3 is concerned with the approximation error of (6) in the RKHS norm. Thiserror might go to zero at a logarithmic rate. However, if the true coeﬃcients decay fast,then we can have polynomial convergence rate.Point 4 restricts the way we let K → ∞ in order to derive convergence rates of theestimation error under the norm |·| E .Point 5 establishes an additional insight between the convergence under the Eu-clidean norm and the RKHS norm in terms of the penalty. A “slowly convergent”penalty is necessary for convergence under |·| E . Hence, this also shows that the con-strained estimator (whose penalty is τ = τ B,n = O p (cid:0) n − / (cid:1) when ϕ ∈ E ( B ) ) cannot beconsistent in the norm |·| E in general. This happens when choosing a rather large K that leads to a binding constraint for (3).As corollary to Points 3 and 4 in Theorem 1, we have the following. Corollary 1

Suppose Condition 1 holds, K → ∞ and τ = O p (cid:0) K − λ (cid:1) .1. Choose K ≍ n κ for some κ ∈ (0 , / . Then, there is an ǫ > such that | b n,τ − ϕ | E = O p (cid:16) (ln K ) − (1+ ǫ ) (cid:17) .2. Suppose the k th entry ϕ k in ϕ satisﬁes | ϕ k | . k − ν with ν > (2 λ + 1) / for all k large enough. Choose K ≍ n ν − . Then, | b n,τ − ϕ | E = O p (cid:16) n − ν − (2 λ +1)4(2 ν − (cid:17) . Corollary 1 imposes additional restrictions in order to improve on the statement ofPoint 2 in Theorem 1 by giving rates of convergence. These rates are not tight as theyrequire K = o ( n ) unlike Point 2 in Theorem 1. However, they are useful in applications(e.g. Section 2.1.1).Sieve estimators are often consistent under the sole condition that the number ofcomponents (here K ) is of smaller order of magnitude than the sample size n . In Point1 of Theorem 1, we have shown that this is not required. Recall that N = n + K isthe sample size. We can have K = O ( N ) as long as n → ∞ . Of course, we requireknowledge concerning the magnitude of the coeﬃcients. Such knowledge is usuallyassumed in the literature in order to bound the approximation error.In practice the fact that we allow K = O ( N ) might sound irrelevant. However,the asymptotic results can be seen as suggesting that, once we set the constraint, theprocedure used here can be more robust to lag choice. We show this in the simulationin Section 3. 7 .1.1 Application to Optimal Forecasting and Universal Consistency Deﬁne X t ( a ) = P ∞ k =1 a k Y t − k for any a ∈ R ∞ . The expectation of Y t conditioning on theinﬁnite past ( Y t − s ) s> is X t ( ϕ ) . As an application of Theorem 1 consider the followingproblem. Show that sup t ∈T | X t ( ϕ ) − X t ( b n,τ ) | → in probability where T = (0 , ∞ ) or (0 , n ) ( b n,τ in (4)). Hence, we want X t ( b n,τ ) tobe close to the conditional expectation of Y t uniformly in t ∈ T , which is even moregeneral than considering a moving target. The norm |·| E is useful because the previousdisplay can be written as sup t ∈T | X t ( ϕ − b n,τ ) | . | ϕ − b n,τ | E sup t ∈T ∞ X k =1 (cid:18) Y t − k k λ (cid:19) ! / . (7)To obtain the inequality, we have multiplied and divided each term in the sum (onthe l.h.s.) by λ k and then used the Cauchy-Schwarz inequality and Condition 1 to set λ k ≍ k λ .We have that | ϕ − b n,τ | E = O p ( ǫ n ) in probability, where ǫ n → at rate whichdepends on Theorem 1. Then, if sup t ∈T ∞ X k =1 (cid:18) Y t − k k λ (cid:19) ! / = o p (cid:0) ǫ − n (cid:1) , (8)we have shown that (7) goes to zero in probability. This is a weak form of universalconsistency because the convergence is in probability rather than almost surely. On thepositive side, the convergence holds for a variety of processes and circumstances.If T = (0 , ∞ ) then (8) is almost surely ﬁnite if the random variables are bounded,and (7) goes to zero in probability using Point 2 in Theorem 1.If T = (0 , n ) , we can use the bound E sup t ∈ (0 ,n ) ∞ X k =1 Y t − k k λ ! / ≤ n / (2 p ) sup t ∈ (0 ,n ) E ∞ X k =1 Y pt − k k λp ! / (2 p ) when the variables are p integrable. If p is such that n / (2 p ) = o ( ǫ − n ) , then the r.h.s.of (7) goes to zero in probability. If Y t has moment generating function the r.h.s. ofthe above display is O (ln n ) . Either way, to ﬁnd ǫ n we can use Corollary 1. Note that8he argument is unchanged if T = (0 , c n ) for any c n ≍ n .Theorem 1 can also be applied to the less ambitious problem: show that lim K →∞ sup t ∈T | X t ( ϕ K ) − X t ( b n,τ ) | → in probability. In this case we want to forecast as well as the increasingly best approx-imation of the conditional expectation of Y t , uniformly in t ∈ T . Point 4 in Theorem 1is suited for this problem. B in Practice The parameter B can be chosen to minimize some cross-validated prediction error es-timate (beware of cross-validation in a time series context, e.g. Györﬁ et al., 1990,Burman and Nolan, 1992, Burman et al., 1994, for discussions and applicability). Al-ternatively, one can choose B to minimize some penalized loss function such as ln ˆ σ B + 2df ( B ) n (9)where df ( B ) = Trace (cid:16)(cid:0) X T X + τ B,n n Λ (cid:1) − X T X (cid:17) and τ B,n is the solution of ˜ b Tn Λ ˜ b n ≤ B , using the notation in (5). Here, ˆ σ B is the sample variance of the residuals from theestimation. If the constraint is binding, τ B,n solves Y T X (cid:0) X T X + τ B,n n Λ (cid:1) − X T Y = B . (10)This τ B,n is then used to compute df ( B ) , which is the eﬀective number of degrees offreedom implied by B (Hastie et al., 2009) Asymptotic results are of interest on their own, but it is also of interest to understandthe scope of applicability in practice. As a benchmark, we use predictions based on anAR model where the lag length is chosen by Akaike’s Information Criterion (AIC).9 .1 Simulated True Models

One thousand data samples are simulated from (1). The sample size is N = 1000 . Awarm up sample of 1000 observations is used to reduce any dependence on the startingvalue. We also simulate a testing sample of observations to approximate the meansquare error (MSE). We consider diﬀerent speciﬁcations for ϕ in (1) including longmemory in order to see how the procedure works when the true model is not in E . Inthis case, an approximation error is incurred. Short Memory

In (1), the errors are i.i.d. standard normal and the ϕ k ’s are chosento be ϕ k = ¯ ϕk − / / (cid:16)P K k =1 k − / (cid:17) , where ¯ ϕ = 0 . , . . A higher value for ¯ ϕ leads toa more persistent behaviour. By construction, for both values of ¯ ϕ , the model appearsto generate cycles because the roots of − P K k =1 ϕ k z k = 0 are outside the unit circle,but complex. We shall have diﬀerent values for K ∈ { , } . Given the ﬁnitenumber of lags the coeﬃcients are automatically in E . Long Memory Model

The model is an ARFIMA Y t = K X k =1 ϕ k Y t − k + (1 − L ) − d L X l =0 θ l ε t − l ! (11)where the ϕ k ’s are as in the previous paragraph. The MA polynomial is θ l = (1 − . l ) with L = 5 . The coeﬃcient of fractional integration d = 0 . . Hence, the model isstationary, but exhibits long memory. The parameter’s estimates are obtained from (5) with λ k = k − . . The benchmarkis an AR model with lag length chosen to minimize AIC. Denote the number of lagschosen using AIC by K AIC . We compare this to a model estimated using more lags,but with coeﬃcients constrained in E K ( B ) . In particular, K = 2 K AIC and K AIC with B chosen as outlined in Section 2.2 . The goal is to verify whether the procedure isrobust to lag choice. AIC is known to choose large models. We use even larger models,and verify whether we are able to obtain sensible results.The results in Table 1 show the improvement in MSE of the constrained procedureover AIC. Table 1 shows that the procedure is robust against lag choice. This becomes10vident in the long memory case. The larger model ( K AIC ) leads to relatively betterperformance when the true model exhibits persistency as (11).Table 1: Simulation Results. For Short Memory theprocess is as in (1) with number of true AR coeﬃ-cients equal to K and AR coeﬃcients satisfying ϕ k =¯ ϕk − / / (cid:16)P K k =1 k − / (cid:17) , where ¯ ϕ = 0 . , . . For LongMemory, the process is as in (11). Entries denote theMSE improvement relative to the MSE of a model withlag length K AIC chosen using AIC. MSE in the numer-ator in the calculation of the relative improvement iscomputed using lag length K AIC and K AIC and con-straining the coeﬃcients in E ( B ) where B is chosen asdescribed in Section 2.2. K =

100 1000 K AIC K AIC K AIC K AIC

Short Memory ¯ ϕ = 0 . ¯ ϕ = 0 . ¯ ϕ = 0 . ¯ ϕ = 0 . It is simple to impose linear restrictions on the coeﬃcients of either the constrainedor penalized estimator. A natural example is positivity. This is the case if we wish toestimate ARCH models of large orders. Under ARCH restrictions, the squared returnsfollow an AR process. The estimator does not have a closed form expression, but it isjust the solution of a quadratic programming problem. Another extension pertains to11ector autoregressive processes Y t = ∞ X k =1 Φ k Y t − k + ε t (12)where now the variables and innovations are L dimensional vectors and we use thecapital Φ k to stress the multivariate framework, where Φ k is an L × L matrix. Again,we can restrict E in a suitable way. For example, we can impose that Φ k is lowertriangular. This restriction has a variety of implications going from Granger causalityto exogeneity and it is of much interest in econometrics (e.g., Sims, 1980). For ﬁxed L , all the results in this paper apply to this problem as well, with obvious changes ifwe modify the constraint to P ∞ k =1 | Φ k | λ k ≤ B where | Φ k | is any matrix norm, e.g.,Frobenius: | Φ k | = p T race (Φ Tk Φ k ) , where Φ Tk is the transpose of Φ k .An extension, which does not follow directly from the results derived here, is toconsider the case where L → ∞ . This is the problem where we have a large cross-section ( L is the dimensional of the vector Y t in (12)). In this case, the constraintcannot use an arbitrary matrix norm (norms are not equivalent in inﬁnite dimensionalspaces). Results in Lutz and Bühlmann (2006) together with the ones derived here canprovide initial guidance on how to tackle this problem in the future. At ﬁrst we include the short proof of Lemma 2

Proof. [Lemma 2]A stationary inﬁnite AR process with absolutely summable ARcoeﬃcients has an inﬁnite MA representation with absolutely summable coeﬃcient andit is invertible (Lemma 2.1 in Bühlmann, 1995). Hence, there are coeﬃcients ψ s ’s suchthat Y t = P ∞ s =0 ψ s ε t − s and ∞ X k =1 | E Y t Y t − k | ≤ σ ∞ X k =1 ∞ X s =0 | ψ s + k | | ψ s | < ∞ , which means that the autocovariance function is absolutely summable. The momentbound follows from the inﬁnite MA representation and the bound on the fourth momentof the innovations. 12 .1 Proof of Theorem 1 We divide the proof into two parts. One only concerns results under the Euclideannorm. The other is concerned with convergence results under the RKHS norm.

Few lemmas are needed for the proof. Throughout, we shall use the notation X t ( a ) = P ∞ k =1 a k Y t − k for any a ∈ R ∞ . Lemma 3

For ρ := (2 λ + 1) / > λ > / as in Condition 1) and real constants w k ’s, sup b ∈E K ( B ) (cid:12)(cid:12)(cid:12)P Kk =1 b k w k (cid:12)(cid:12)(cid:12) . P Kk =1 k − ρ | w k | , and similarly, for real constants w k,l ’s, sup b ∈E K ( B ) (cid:12)(cid:12)(cid:12)P Kk,l =1 b k b l w lk (cid:12)(cid:12)(cid:12) . P Kk,l =1 k − ρ l − ρ | w kl | . Proof.

Note that (cid:12)(cid:12)(cid:12)P Kk =1 b k w k (cid:12)(cid:12)(cid:12) ≤ P Kk,l =1 | b k | k − ρ k − ρ | w k | . Given that b ∈ E K ( B ) , then b k . k − ρ uniformly in b ∈ E K ( B ) , by Lemma 1. This implies that the previous quantityis bounded by a constant multiple of P Kk =1 k − ρ | w k | . The same argument proves thesecond statement in the lemmaThe w kl ’s in the lemma above will be partial sums of cross products of Y t ’s, whichwe bound using the following.For arbitrary τ > , the ﬁrst order conditions that deﬁne (4) imply that b n,τ,k = − τ λ k n n X t =1 ( Y t − X t ( b n,τ )) Y t − k (13)where b n,τ,k is the k th element in b n,τ . By Condition 1, multiplying both sides by τ λ k a k and summing over k , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X t =1 ( Y t − X t ( b n,τ )) X t ( a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 2 τ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X k =1 λ k b n,τ,k a k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ τ vuut K X k =1 λ k b n,τ,k vuut K X k =1 λ k a k , (14)recalling the deﬁnition of X t ( a ) and using the Cauchy-Schwarz inequality. If a ∈ E K (1) , qP Kk =1 λ k a k ≤ and the above display clearly holds uniformly in a . We need to showthat there is a τ = τ n = O p (cid:0) n − / (cid:1) such qP Kk =1 λ k b n,τ,k < B . This will imply thedisplay in the statement of the lemma. 13 emma 4 Under Condition 1, sup n,k,l> E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ∞ . Proof.

From the proof of Lemma 2, there are absolutely summable coeﬃcients ψ u ’s,such that Y t = P ∞ u =0 ψ u ε t − u . For ease of notation suppose that the i.i.d. innovationshave variance one and the MA coeﬃcients are non-negative. By stationarity, E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ n X s =0 E [(1 − E ) Y t − k Y t − l ] [(1 − E ) Y t − s − k Y t − s − l ] , where the r.h.s. holds for any t . If we showed that E [(1 − E ) Y t − k Y t − l ] [(1 − E ) Y t − s − k Y t − s − l ] . ψ s the result would follow by summability of the coeﬃcients. To show the above, with noloss of generality, by symmetry, consider only the case l ≥ k . This implies that E [(1 − E ) Y t − k Y t − l ] [(1 − E ) Y t − s − k Y t − s − l ]= Cov ( Y t − k Y t − l , Y t − s − k Y t − s − l )= E ∞ X u =0 ∞ X u =0 ψ u ψ u [(1 − E ) ε t − k − u ε t − l − u ] × ∞ X u =0 ∞ X u =0 ψ u ψ u [(1 − E ) ε t − s − k − u ε t − s − l − u ] . The above is equal to ∞ X u =0 ∞ X u =0 ∞ X u =0 ∞ X u =0 ψ u ψ u ψ u ψ u Cov ( ε t − k − u ε t − l − u , ε t − s − k − u ε t − s − l − u ) . By the i.i.d. condition on the innovations, the covariance is zero if the indexes are notconstrained in the following sets { k + u = l + u , k + u = l + u } , { u = u + s, u = u + s } , { k + u = l + u + s, l + u = k + u + s } . Hence, we can consider summation with in-dexes in these sets only. Splitting the sum according to the above index sets, we have14espectively, I = ∞ X u =0 ∞ X v =0 ψ u + l − k ψ u ψ v + l − k ψ v Cov (cid:0) ε , ε u − ( s + v ) (cid:1) ,II = ∞ X u =0 ∞ X v =0 ψ u + s ψ v + s ψ u ψ v E ε ε u − v )+( k − l ) ,III = ∞ X u =0 ∞ X v =0 ψ u + s +( l − k ) ψ v + s +( k − l ) ψ u ψ v E ε ε u − v − s )+( k − l ) . By elementary change of indexes, I ≤ ∞ X u =0 ∞ X v =0 ψ u + l − k ψ u ψ v + l − k ψ v { u − v = s } ≤ ∞ X v =0 ψ v + s + l − k ψ v + s ψ v +( l − k ) ψ v ≤ ∞ X v =0 ψ v + s ψ v . ψ s . Similarly, deduce that II . ∞ X u =0 ψ u ψ u + s ! ≤ ψ s ∞ X u =0 ψ u ! . ψ s . Finally,

III . ∞ X u =0 ∞ X v =0 ψ u ψ v ψ u + s +( l − k ) ψ v + s +( k − l ) ≤ ψ s ∞ X u =0 ∞ X v =0 ψ v ψ u ! . ψ s . The bounds do not depend on k, l beyond the fact that l ≥ k . Repeating the argumentfor k > l , the result follows.Lemma 4 will be used to bound quantities such as the following E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ X k,l =1 k − (2 λ +1) / l − (2 λ +1) / n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ∞ X k,l =1 k − (2 λ +1) / l − (2 λ +1) / E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . √ n max k,l> E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (2 λ + 1) / > . Then, by Lemma 4 the ex-pectation is ﬁnite because E |·| ≤ (cid:0) E |·| (cid:1) / and it is independent of k, l by stationarity.In consequence the display is O p (cid:0) n − / (cid:1) because convergence in L implies convergencein probability.To establish convergence rates we need two stochastic equicontinuity results. Lemma 5

Under Condition 1, for any ǫ > E sup a,b ∈E (2 B ) , | b | ≤ δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) X t ( b ) X t ( a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . δ λ − ǫ − λ − ǫ . (15) Proof.

By the triangle inequality, (15) is bounded by E sup a,b ∈E (2 B ) , | b | ≤ δ ∞ X l =1 | a l | ∞ X k =1 | b k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . By Lemma 3, there is a ρ > such that the above is bounded by a constant multiple of ∞ X l =1 l − ρ E sup b ∈E (2 B ) , | b | ≤ δ ∞ X k =1 | b k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . sup l> E sup b ∈E (2 B ) , | b | ≤ δ ∞ X k =1 | b k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) by summability of l − ρ . For any positive V , the above display can be written as sup l> E sup b ∈E (2 B ) , | b | ≤ δ X k ≤ V + X k>V ! | b k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . We shall bound the two sums separately. By the Cauchy-Schwarz inequality, the ﬁrstsum is bounded by vuut sup l> sup | b | ≤ δ X k ≤ V b k X k ≤ V E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . δ √ V , (16)where the inequality uses Lemma 4 and | b | ≤ δ . Having set V to such ﬁnite value, by16he Cauchy-Schwarz inequality, the second sum is bounded by vuuut sup b ∈E (2 B ) X k>V b k k (1+ ǫ ) !  sup l> X k>V k − (1+ ǫ ) E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 (1 − E ) Y t − k Y t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)  . vuut V (1+ ǫ ) λ − V sup b ∈E (2 B ) X k>V b k λ k ! for any ǫ ∈ (0 , λ − , using again Lemma 4, and the fact that k − (1+ ǫ ) is summableand k (1+ ǫ ) λ − k is decreasing. The r.h.s. is then bounded by a constant multiple of V (1+ ǫ − λ ) / . Equating δ √ V with V (1+ ǫ − λ ) / we choose V = δ / (2 λ − ǫ ) , implying that δ √ V + V (1+ ǫ − λ ) / . δ λ − ǫ − λ − ǫ and the lemma is proved. Lemma 6

Under Condition 1, for any ǫ > , E sup b ∈E (2 B ) , | b | ≤ δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 ε t X t ( b ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . δ λ − ǫ − λ − ǫ Proof.

By linearity and the triangle inequality, E sup b ∈E (2 B ) , | b | ≤ δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 ε t X t ( b ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E sup b ∈E (2 B ) , | b | ≤ δ ∞ X k =1 | b k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 ε t Y t − k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Note that sup k> E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n X t =1 ε t Y t − k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ σ γ (0) . Hence, we can proceed exactly as in the proof of Lemma 5 to deduce the result.The ﬁrst part of Point 1 in the theorem will be proved in Lemma 8 (Section 5.1.2).Hence, here we shall only derive the convergence rate.Deﬁne the empirical loss function L n ( b ) := 1 n n X t =1 Y t − ∞ X k =1 b k Y t − k ! b ∈ E . When b ∈ E K the sum inside the parenthesis only runs from to K . Thepopulation loss is L ( b ) := E X ( ϕ − b ) . Deﬁne β = β K ∈ R ∞ such that its ﬁrst K entries are as in ϕ and the remaining areall zero. The consistency proof is standard (van der Vaart and Wellner, 2000, Theorem3.2.5) once we show the following: L ( b ) − L ( β ) & | b − β | , (17) E sup b ∈E K ( B ): | b − β | ≤ δ | [ L n ( b ) − L ( b )] − [ L n ( β ) − L ( β )] | . δ α √ n , (18)for some α ∈ (0 , . Then, for any sequence r n → ∞ satisfying r − αn . √ n , L n ( b n ) ≤ L n ( β ) + O p ( r − n ) and | ϕ − β | . r − n , we have that | b n − ϕ | = O p ( r − n ) .At ﬁrst we verify (17). Note that L ( b ) − L ( β ) = ∞ X k,l =1 ( b k − β k ) ( b l − β l ) γ ( k − l ) , where γ ( k ) is the autocovariance function (ACF) of the Y t ’s. The estimator is uniquelyidentiﬁed if the matrix, say Γ , with ( k, l ) entry equal to γ ( k − l ) , is strictly positivedeﬁnite with smallest eigenvalue θ min > (see remarks after Lemma 2.2. in Kreiss et al.,2011). This is the case if the spectral density of ( Y t ) t ∈ Z , say g ( ω ) , is bounded away fromzero. The spectral density of the AR model (1) is given by g ( ω ) = (2 π ) − σ /ϕ ( ω ) ,where ϕ ( ω ) = (cid:12)(cid:12)P ∞ k =0 ϕ k e − ikω (cid:12)(cid:12) with ϕ := 1 . Noting that by Condition 1, ϕ ( ω ) = (cid:12)(cid:12)P ∞ k =0 ϕ k e − ikω (cid:12)(cid:12) ≤ ( P ∞ k =0 | ϕ k | ) < ∞ , deduce that the eigenvalues of Γ are boundedaway from zero. Hence, L ( b ) − L ( β ) ≥ θ − min ∞ X k =1 ( b k − β k ) = | b − β | , (19)and (17) holds.Using the notation Y t = X t ( ϕ ) + ε t , the empirical loss is equal to L n ( b ) = 1 n n X t =1 (cid:2) ε t + X t ( ϕ − b ) + 2 ε t X t ( ϕ − b ) (cid:3) . ( L n ( b ) − L ( b )) − ( L n ( β ) − L ( β ))= 1 n n X t =1 (cid:2) ε t X t ( β − b ) + (1 − E ) (cid:0) X t ( b − ϕ ) − X t ( β − ϕ ) (cid:1)(cid:3) . To verify (18), we need to bound the above uniformly in b ∈ E ( B ) such that | b − β | ≤ δ. To this end, apply Lemma 6 to the ﬁrst term on the r.h.s. to ﬁnd that the uniform boundis a constant multiple of n − / δ λ − ǫ − λ − ǫ for any ǫ > . By basic algebraic manipulations,the second term on the r.h.s. of the display is (1 − E ) (cid:0) X t ( b − ϕ ) − X t ( β − ϕ ) (cid:1) = 1 n n X t =1 (1 − E ) Y t − k Y t − l ( b n,k − β k ) ( b n,l − ϕ l )+ 1 n n X t =1 (1 − E ) Y t − k Y t − l ( β k − ϕ k ) ( b n,l − β l ) . Note that both ϕ − b and β − ϕ are in E (2 B ) . We apply Lemma 5 to deduce thateach term on the r.h.s. of the above display is uniformly bounded in L by a constantmultiple of n − / δ λ − ǫ − λ − ǫ for any ǫ > when | b − β | ≤ δ . Hence (18) is veriﬁed with α = λ − ǫ − λ − ǫ . When we are only interested in a ﬁnite dimensional model, we can take λ → ∞ to deduce that α = 1 , which is the parametric case.To ﬁnd r n note that L n ( b n ) − L n ( β ) ≤ L n ( b n ) − inf b ∈E K ( B ) L n ( b ) = 0 . Also, | ϕ − β | = (cid:0)P k>K | ϕ k | (cid:1) / . K − λ / ln ǫ ( K ) for some ǫ > using Lemma 1 andbounding the sum with an integral ad using the fact that ln ǫ ( · ) is slowly varying atinﬁnity. Hence we deduce that r − n ≍ (cid:0) K − λ / ln ǫ ( K ) (cid:1) + n − ( λ − ǫ λ − ǫ +1 ) as stated in Point1 of the theorem. The proof depends on a few preliminary lemmas. Let ϕ τ = ϕ K,τ ∈ E K be the penalizedpopulation estimator 19 τ = arg inf b ∈E K E X ( b − ϕ ) + τ | b | E . (20)The following can be deduced from Theorem 5.9 in Steinwart and Christmann (2008,eq. 5.14). The proof is given, as the context might seem diﬀerent at ﬁrst sight. Lemma 7

Suppose Condition 1. For arbitrary but ﬁxed τ > , consider b n,τ and ϕ τ in(4) and (20) with K possibly diverging to inﬁnity. Then, vuut K X k =1 λ k ( b n,τ,k − ϕ τ,k ) ≤ vuut K X k =1 τ λ k n n X t =1 (1 − E ) ( Y t − X t ( ϕ τ )) Y t − k ! , where b n,τ,k is the k th entry in the K dimensional vector b n,τ , and similarly for ϕ τ,k . Proof.

By convexity of the square error loss, n n X t =1 ( Y t − X t ( ϕ τ )) ( X t ( b n,τ ) − X t ( ϕ τ )) ≤ n n X t =1 ( Y t − X t ( b n,τ )) − n n X t =1 ( Y t − X t ( ϕ τ )) . Note the following algebraic equality, τ ∞ X k =1 λ k ( b n,τ,k − ϕ τ,k ) ϕ τ,k + τ ∞ X k =1 λ k ( b n,τ,k − ϕ τ,k ) = τ ∞ X k =1 λ k b n,τ,k − τ ∞ X k =1 λ k ϕ τ,k . The above two displays imply n n X t =1 ( Y t − X t ( ϕ τ )) ( X t ( b n,τ ) − X t ( ϕ τ ))+2 τ ∞ X k =1 λ k ( b n,τ,k − ϕ τ,k ) ϕ τ,k + τ ∞ X k =1 λ k ( b n,τ,k − ϕ τ,k ) ≤ n n X t =1 ( Y t − X t ( b n,τ )) + τ ∞ X k =1 λ k b n,τ,k − n n X t =1 ( Y t − X t ( ϕ τ )) − τ ∞ X k =1 λ k ϕ τ,k ≤ where the most r.h.s. follows because b n,τ minimizes the empirical penalized risk. Theﬁrst order conditions for ϕ τ read ϕ τ,k = − τ λ k E ( Y t − X t ( ϕ τ )) Y t − k (21)20or k ≥ . Substituting this in the previous display, n n X t =1 ( Y t − X t ( ϕ τ )) ( X t ( b n,τ ) − X t ( ϕ τ )) − E ( Y t − X t ( ϕ τ )) K X k =1 ( b n,τ,k − ϕ τ,k ) Y t − k + τ K X k =1 λ k ( b n,τ,k − ϕ τ,k ) ≤ . Rearranging and using the deﬁnition of X t ( b n,τ − ϕ τ ) , deduce that τ K X k =1 λ k ( b n,τ,k − ϕ τ,k ) ≤ n n X t =1 ( E −

1) ( Y t − X t ( ϕ τ )) K X k =1 ( b n,τ,k − ϕ τ,k ) Y t − k ≤ vuut K X k =1 λ k n n X t =1 ( E −

1) ( Y t − X t ( ϕ τ )) Y t − k ! vuut K X k =1 λ k ( b n,τ,k − ϕ τ,k ) , using the Cauchy-Schwarz inequality in the last step. This implies the result of thelemma after simple rearrangement.The next lemma establishes the relation between the constrained and penalizedestimator and states a bound for the distance between the sample and populationpenalized estimator under the RKHS norm. Lemma 8

Suppose that ϕ ∈ int ( E ( B )) . Under Condition 1, if a ∈ E K (1) , and b n,τ isas in (4), there is τ = τ n = O p (cid:0) n − / (cid:1) such that | b n,τ | E < B and √ n n X t =1 ( Y t − X t ( b n,τ )) X t ( a ) = O p  B vuut K X k =1 λ k a k  , where the above bound holds uniformly in a ∈ E K (1) . In consequence, there is a τ = O p (cid:0) n − / (cid:1) such that b n,τ = b n .Moreover, for any τ > , vuut K X k =1 τ λ k n n X t =1 (1 − E ) ( Y t − X t ( ϕ τ )) Y t − k ! = O p (cid:0) τ − n − / (cid:1) . roof. Suppose that τ > as otherwise, by the ﬁrst order conditions, the r.h.s.in the ﬁrst display in the statement of lemma is exactly zero and there is nothing toprove.By the triangle inequality, vuut K X k =1 λ k b n,τ,k ≤ vuut K X k =1 λ k ϕ τ,k + vuut K X k =1 λ k ( b n,τ,k − ϕ τ,k ) . (22)For τ ≥ , qP Kk =1 λ k ϕ τ,k ≤ qP Kk =1 λ k ϕ k , as the penalized population estimator musthave norm no larger than ϕ . By this remark and the fact that ϕ ∈ int ( E ( B )) , there isan ǫ > such that the ﬁrst term on the r.h.s. is B − ǫ . Lemma 7 gives K X k =1 λ k ( b n,τ,k − ϕ τ,k ) ≤ K X k =1 τ λ k " n n X t =1 ( Y t − X t ( ϕ τ )) Y t − k − E ( Y t − X t ( ϕ τ )) Y t − k . (23)Adding and subtracting (1 − E ) X t ( ϕ ) Y t − k , and then using the basic inequality ( x + y ) ≤ x + 2 y for any real x, y , the r.h.s. is K X k =1 τ λ k " n n X t =1 (1 − E ) ( Y t − X t ( ϕ )) Y t − k + 1 n n X t =1 (1 − E ) ( X t ( ϕ ) − X t ( ϕ τ )) Y t − k . ≤ K X k =1 τ λ k " n n X t =1 (1 − E ) ( Y t − X t ( ϕ )) Y t − k +2 K X k =1 τ λ k " n n X t =1 (1 − E ) ( X t ( ϕ ) − X t ( ϕ τ )) Y t − k . Recalling that our goal is to bound the second term on the r.h.s. of (22), the above two22isplays imply that vuut K X k =1 λ k ( b n,τ,k − ϕ τ,k ) ≤ τ vuut K X k =1 λ k " n n X t =1 (1 − E ) ( Y t − X t ( ϕ )) Y t − k + 1 τ vuut K X k =1 λ k " n n X t =1 (1 − E ) ( X t ( ϕ ) − X t ( ϕ τ )) Y t − k =: I + II. (24)To bound I on the r.h.s. note that for k > , E " n n X t =1 (1 − E ) ( Y t − X t ( ϕ )) Y t − k = E " n n X t =1 ε t Y t − k = σ γ (0) n (recall γ ( k ) is the ACF) so that K X k =1 λ k " n n X t =1 (1 − E ) ( Y t − X t ( ϕ )) Y t − k = O p (cid:18) σ γ (0) n (cid:19) because the coeﬃcients λ − k are summable. Hence, it is possible to ﬁnd a τ = O p (cid:0) n − / (cid:1) such that I ≤ ǫ . To bound II , recall that ϕ τ , ϕ ∈ E ( B ) for any τ ≥ , and write W k,l := 1 √ n n X t =1 (1 − E ) Y t − l Y t − k ρ = (2 λ + 1) / > , III := E K X k =1 λ k " n n X t =1 (1 − E ) ( X t ( ϕ ) − X t ( ϕ τ )) Y t − k ≤ K X k =1 λ k E sup b ∈E (2 B ) " n n X t =1 (1 − E ) ∞ X l =1 b l Y t − l Y t − k ≤ n K X k =1 λ k ∞ X l,j =1 l − ρ j − ρ E W k,l W k,j . n sup k,l,j E W k,l W k,j ≤ n sup k,l E W k,l (25)using Lemma 3 in the second inequality and summability of the coeﬃcient in the laststep. By Lemma 4, E W k,l ≤ c for some ﬁnite absolute constant c . Hence, deducethat III = O p ( n − ) , which implies that II = O p (cid:0) τ − n − / (cid:1) . Hence, there is a τ = O p (cid:0) n − / (cid:1) such that II ≤ ǫ . The control of I + II implies that (24) is not greater than ǫ for suitable τ . Hence, we have shown that there is a τ = O p (cid:0) n − / (cid:1) such that (22) isnot greater than B − ǫ . This bound for (22) together with (14) proves the ﬁrst displayin the lemma. To see that this also implies that there is a τ = O p (cid:0) n − / (cid:1) such that b n,τ = b n note that | b n,τ | E is non-deceasing as τ → . Hence, b n,τ = b n for the smallest τ such that | b n,τ | E ≤ B The last statement in the lemma follows from (23) and the just derived bound for(24).We now estimate the approximation error.

Lemma 9

For any K → ∞ , we have that | ϕ K − ϕ τ | E → as τ → where ϕ K is asin (6). Moreover, if τ = O p (cid:0) K − λ (cid:1) , then | ϕ K − ϕ τ | E = O p (cid:0) τ K λ (cid:1) . Proof.

The ﬁrst part of the lemma is just Theorem 5.17 in Steinwart and Christ-mann (2008). Hence, we only need to prove the second statement. Let Γ be the K × K matrix with ( k, l ) entry γ ( k − l ) and let Γ be the ﬁrst column in Γ . Let ˜ ϕ K , ˜ ϕ τ ∈ R K to be the ﬁrst K entries in ϕ K , ϕ τ ∈ E K . Recall that in both ϕ K and ϕ τ all entries k > K are zero. Then, ˜ ϕ K = Γ − Γ , and writing D := τ / Λ for Λ as in (5), ˜ ϕ τ = ( DD + Γ) − Γ .

24y the Woodbury identity (Petersen and Pedersen, 2012, eq.159) ( DD + Γ) − = Γ − − Γ − D (cid:0) I + D Γ − D (cid:1) − D Γ − we have that ˜ ϕ K − ˜ ϕ τ = h Γ − D (cid:0) I + D Γ − D (cid:1) − D Γ − i Γ . Hence, | ϕ K − ϕ τ | E = (cid:12)(cid:12)(cid:12) ΛΓ − D (cid:0) I + D Γ − D (cid:1) − D Γ − Γ (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) D Γ − D (cid:0) I + D Γ − D (cid:1) − Λ ˜ ϕ K (cid:12)(cid:12)(cid:12) using the deﬁnitions of ˜ ϕ K and D . For any square matrix W and compatible vector a , | W a | ≤ σ ( W ) | a | , where σ ( W ) is the maximum eigenvalue of W . Deﬁne W = D Γ − D ( I + D Γ − D ) − . Given that ϕ ∈ E K ( B ) , then, | Λ ˜ ϕ | ≤ B . Hence, weonly need to ﬁnd the maximum eigenvalue of W to bound the above display. Thefollowing inequalities hold for the eigenvalues of the product of two positive deﬁnitematrices A and C : σ ( A ) σ ( C ) ≤ σ ( AC ) ≤ σ ( AC ) ≤ σ ( A ) σ ( C ) where σ ( · ) and σ ( · ) are the maximum and minimum eigenvalue of the matrix ar-gument (Bathia, 1997, problem III.6.14, p.78). In order to derive (19), we argued that Γ has minimum eigenvalue θ min bounded away from zero. Hence, D Γ − D has eigenvaluesin (cid:2) θ − τ λ , θ − τ λ K (cid:3) . The matrix ( I + D Γ − D ) has eigenvalues equal to 1 plus theeigenvalues of D Γ − D . Hence deduce that | ϕ K − ϕ τ | E . θ − τ λ K (cid:0) θ − τ λ (cid:1) . Thisis just O ( τ λ K ) = O (cid:0) τ K λ (cid:1) as required.We need a ﬁnal approximation result. Lemma 10

Recall (6). If ϕ ∈ E , then | ϕ − ϕ K | E = 1 / ln ǫ ( K ) as K → ∞ . If also | ϕ k | . k − ν with ν > (2 λ + 1) / , then, | ϕ − ϕ K | E = O (cid:0) K (2 λ +1 − ν ) / (cid:1) . Proof.