Online nonparametric regression with Sobolev kernels
Oleksandr Zadorozhnyi, Pierre Gaillard, Sebastien Gerschinovitz, Alessandro Rudi
OO NLINE NONPARAMETRIC REGRESSION WITH KERNELS
Online nonparametric regression with Sobolev kernels
Oleksandr Zadorozhnyi
OLEKSANDR . ZADOROZHNYI @ UNI - POTSDAM . DE Insitute of MathematicsUniversity of Potsdam14471 Potsdam, Germany
Pierre Gaillard
PIERRE . GAILLARD @ INRIA . FR Centre de Recherche INRIA de Paris2 rue Simone Iff75012 Paris France
Sebastien Gerschinovitz
SEBASTIEN.GERCHINOVITZ@
IRT - SAINTEXUPERY . COM
IRT Saint-Exupéry3 rue Tarfaya31400 Toulouse
Alessandro Rudi
ALESSANDRO . RUDI @ INRIA . FR Centre de Recherche INRIA de Paris2 rue Simone Iff75012 Paris France
Editor:
Unknown Editor
Abstract
In this work we investigate the variation of the online kernelized ridge regression algorithm in thesetting of d − dimensional adversarial nonparametric regression. We derive the regret upper boundson the classes of Sobolev spaces W βp ( X ) , p ≥ , β > dp . The upper bounds are supported by theminimax regret analysis, which reveals that in the cases β > d or p = ∞ these rates are (essen-tially) optimal. Finally, we compare the performance of the kernelized ridge regression forecasterto the known non-parametric forecasters in terms of the regret rates and their computational com-plexity as well as to the excess risk rates in the setting of statistical (i.i.d.) nonparametric regression.
1. Introduction
We consider the online least-squares regression framework (Cesa-Bianchi and Lugosi, 2006) as agame between the environment and the learner where the task is to sequentially predict the envi-ronment’s output y t given the current input x t and the observed history { ( x i , y i ) } t − i =1 . Specifically,let X ⊂ R d be an input space, Y ⊂ R a label space, and (cid:98) Y ⊂ R a target space. Before the gamestarts, the environment secretly produces a sequence of input-output pairs ( x , y ) , ( x , y ) , . . . in X ×Y over some (possibly infinite) time horizon. At each round t ≥ , the environment first revealsan input x t ∈ X ; the learner forms the prediction (cid:98) y t ∈ (cid:98) Y of the true label y t ∈ Y based on pastinformation ( x , y ) , . . . , ( x t − , y t − ) ∈ X × Y and on the current input x t . The true label y t isthen revealed, the learner suffers the squared loss ( y t − (cid:98) y t ) and round t + 1 starts. The problem is a r X i v : . [ m a t h . S T ] F e b o design an algorithm which minimizes the learner’s cumulative regret R n ( F ) := sup f ∈F R n ( f ) , where R n ( f ) := n (cid:88) t =1 ( y t − (cid:98) y t ) − n (cid:88) t =1 ( y t − f ( x t )) , (1)over n ≥ rounds with respect to the best prediction rule from some reference functional class F ⊂ R X .In the setting of adversarial online learning the nature of data can be completely arbitrary, unlikein the standard statistical learning framework where the data stream is assumed to be generated fromsome underlying stochastic process, usually with an independent noise component. The problem ofonline learning with arbitrary (adversarial) data goes back to the work of Foster (1991). A lot oftheoretical research has been done since then for parametric models (see example Azoury and War-muth, 2001; Cesa-Bianchi, 1999; Vovk, 1998). However, the amount of data and the complexity ofcurrent machine learning problems have led the community to explore the more general problem ofonline-learning with methods based on nonparametric decision rules and with the reference classesbeing bounded functional sets (see ex. Vovk (2006a), Rakhlin and Sridharan (2014)). Much efforthas been devoted to the regret analysis with respect to functional classes that include Sobolev spaces(Rakhlin and Sridharan, 2014; Rakhlin et al., 2014; Vovk, 2006a, 2007). Surprisingly, only a fewexplicit algorithms have been designed to address the regression problem (Vovk, 2006a,b, 2007;Gaillard and Gerchinovitz, 2015). While having optimal (or close to optimal) regret rates, they havethe disadvantage of either being computationally intractable or of providing suboptimal regret upperbounds (see Table for computational complexities of some known algorithms 1). For more detailson previous work, we refer the reader to Section 5.In this work we consider the framework of online adversarial regression where the benchmarkclass F , against which the algorithm competes, is a ball in a Sobolev space (see e.g., Adams andFournier, 2003), denoted by W βp ( X ) where β > and p ≥ . In other words, F is the space offunctions with p -integrable weak derivatives up to order β (see 2.2 for more details). The problem isof interest since, to date, the optimal regret is only achieved by a computationally efficient algorithmin the smooth regime when β > d/ and p = 2 . Overview of the main results and outline of the paper
The aim of this paper is to provide adeeper analysis of the regret achieved a version of online kernel ridge regression algorithm, KernelAggregating Algorithm Regression, (KAAR). Our key contributions in the regret upper boundsof KAAR and its comparison to the known procedures (both in regret rates and computationalefficiency) are summarized in Table 1. Furthermore, by proving the lower bounds, we show that itreaches optimal or close to optimal regret rates on bounded balls of Sobolev spaces W βp ( X ) with p ≥ and β > dp (which, up to a multiplicative constant which depends on the diameter of the set,implies the result for all bounded subsets of continuous functions in W βp ( X ) ).More precisely, the result is threefold. On the one hand, our analysis recovers the classical resultfor Sobolev spaces, i.e. when β > d/ and p ≥ In particular, we show in Theorem 4 that, bychoosing appropriately the regularization parameter, on the classes of continuous functions whichbelong to Sobolev RKHS of smoothness β , KAAR achieves the optimal regret upper bound R n ( F ) (cid:46) n − β β + d log n. NLINE NONPARAMETRIC REGRESSION WITH KERNELS
KAAR (4) Rakhlin and Sridharan (2014) Gaillard and Gerchinovitz (2015) EWA by Vovk (2006a)Regret Cost Regret Cost Regret Cost Cost ( d = 1 , p = ∞ ) Regret Cost β > d n − β β + d + ε n + dn n − β β + d Non constructive n − β β + d exp( n ) poly(n) n − ββ + d exp( n ) + nd dp < β ≤ d n − βd p − d/βp − + ε n + dn n − βd Non constructive n − βd exp( n ) n (cid:100) β (cid:101) (cid:0) β +22 β +1 (cid:1) n − ββ + d exp( n ) + ndp = ∞ , β ≤ d n − βd + ε n + dn n − βd Non constructive n − βd exp( n ) n (cid:100) β (cid:101) (cid:0) β +22 β +1 (cid:1) n − ββ + d exp( n ) + nd Table 1: Regret rates and time complexity of KAAR (4) (new upper-bounds from this paper arehighlighted in blue) and the existing algorithms for online nonparametric regression.On the other hand, we consider the more challenging scenario when d/ > β ≥ d/p , whichcorresponds to not so smooth benchmark functional classes that cannot be embedded into a RKHS.We will refer to this case as the hard-learning scenario . In Theorem 6 we prove that in sucha scenario when F = B W βp ( X ) (0 , R ) the regret of KAAR with well-chosen parameters is upper-bounded as R n ( F ) (cid:46) n − βd p − dβp − log n. In particular, when p = ∞ , the regret upper bound is of order O ( n − βd + ε log n ) , The latter boundis then proven to be optimal (up to a constant ε that can be made arbitrary small) by showing thelower bound for minimax regret in Section 4 for the lower bounds. Optimal regret upper boundson the classes of bounded Hölder balls were previously derived with polynomial-time algorithmsfor d = 1 Gaillard and Gerchinovitz (2015). The case d ≥ and β = 1 was also analyzed forLipschitz and semi-Lipschitz losses in Cesa-Bianchi et al. (2017). Notice that throughout the paperwe do not consider the case of Sobolev spaces with β ≤ d/p . In the latter case the existence ofcontinuous representatives for equivalence classes in W βp ( X ) is not guaranteed and the regret ofany forecaster will be linear.In Figure 1, we plot the regions of the (1 /p, β/d ) -plane corresponding to the different regretcases where we obtain either the optimal rate or a suboptimal rate, but improved with respect to clas-sical aggregation algorithms in the nonparametric framework Vovk (2006a). Note that the smaller β/d and p are, the harder the problem is. Additional graphs comparing the regret of KAAR withthe lower-bound are available in Appendix G.To complete the analysis of online nonparametric regression over Sobolev spaces, we makeuse the general results of Rakhlin and Sridharan (2014), derive upper and lower bound on the fat-shattering dimension ( see Rakhlin and Sridharan (2014) also Appendix F for an exact definition) and establish corresponding lower bounds. We prove that any admissible algorithm ( the ex-act definition of which will be presented in section 4) suffers at least the minimax regret of order n − β/ (2 β + d ) in the smooth case β > d/ , and n − β/d when β ≤ d/ . The latter implies thatKAAR (with the proper choice of parameters) achieves optimal regret rates when β > d/ or
1. In terms of its upper bound.2. Gaillard and Gerchinovitz (2015) only provide an efficient version of their algorithm for Sobolev spaces with p = ∞ , d = 1 and β ≥ / . Their efficient algorithm can however be extended for any β ∈ (0 , / with a polynomial timecomplexity.3. The notation (cid:46) denotes an approximate inequality which includes multiplicative constants which depend on F and X . .0 0.1 0.2 0.3 0.4 0.5 p / d Optimal regret with KAARImproved regret with KAAR B e t t e r r e g r e t w i t h E W A Linear regret with KAAR O p t i m a l r e g r e t w i t h K AA R p = Regret regions p / d Easy problems (Sobolev RKHS, Thm. 4)Hard problems (non-RKHS case, Thm. 6)Wild non-continuous functions E a s y p r o b l e m s , H o l d e r b a ll s p = Problem hardness
Figure 1: (Left) Different regions in the (1 /p, βd ) -plane for which our new regret bound for KAAR:[light green] is optimal (i.e., β > d/ or p = ∞ ); [dark green] improves the bound ofEWA by Vovk (2006b); [blue] is worse than the bound of EWA; [red] is linear in n (i.e, β ≤ d/p ). (Right) Hardness of the problem in the (1 /p, βd ) plane p = ∞ . The regret analysis of KAAR on the classes of compact subsets of Sobolev spaces W βp ( X ) , p ≥ as well as lower bounds for minimax regret for the classes of bounded balls in Sobolev spaces W βp ( X ) are summarized in Table 2. Upper bound of KAAR Lower bound for minimax regret β > d n − β β + d + ε log( n ) n − β β + d dp < β ≤ d n − βd p − d/βp − + ε log( n ) n − βd p = ∞ , β ≤ d n − βd + ε log( n ) n − βd Table 2: Regret upper bounds of KAAR and the corresponding lower bound on the classes ofbounded subsets of W βp ( X ) , β ∈ R , p ≥ . Here ε > is an arbitrary small number.The outline of the rest of the paper is as follows. In Section 2, we fix the notation and recall thedefinition of Sobolev spaces, reproducing kernel Hilbert spaces (RKHS) and their effective dimen-sion. Furthermore, we describe KAAR therein. In Section 3, we provide our regret upper boundsfor KAAR and in Section 4 we present the corresponding lower bounds. Finally, in Section 5, wemake more detailed comparisons with existing work both in the adversarial online regression set-ting studied in this paper and in the more standard statistical framework with i.i.d. observations.We discuss the optimality of the rates and comment on the aspect of computational complexity byshowing that KAAR is superior to the known nonparametric schemes in terms of runtime and stor-age complexities. All the proofs as well as technical details on Sobolev spaces and kernels are givenin the Appendices. NLINE NONPARAMETRIC REGRESSION WITH KERNELS
2. Notation and background
We recall below some notations on reproducing kernel Hilbert Spaces (RKHS). An in-depth surveyon this topic can be found in Smola and Schölkopf (2002) and Steinwart and Cristmann (2008). AHilbert space of functions H := { f : X (cid:55)→ R } equipped with an inner product (cid:104)· , ·(cid:105) H is calledRKHS if for every x ∈ X the evaluation functional δ x ( f ) := f ( x ) is continuous in f . Furthermore,we say that function k ( · , · ) : X × X (cid:55)→ R over domain X is a real-valued kernel if every kernelmatrix K n := ( k ( x i , x j )) ni,j =1 is positive semi-definite. It is known that value of kernel can berepresented as an inner product in some Hilbert space H , namely k (cid:0) x, x (cid:48) (cid:1) = (cid:10) φ ( x ) , φ ( x (cid:48) ) (cid:11) H ,where we call H a feature space and φ ( · ) a feature map of kernel k . Lastly we say that RKHS H is generated by kernel k ( · , · ) if for every x it holds: k x := k ( x, · ) ∈ H and f ( x ) = (cid:10) f, k x (cid:11) H forevery f ∈ H and x ∈ X (i.e. the so-called reproducing property holds). In this case we write H k to denote the RKHS generated by kernel k ( · , · ) and say that kernel k ( · , · ) is a reproducing kernelof H k . Denote also λ j ( K n ) to be the j − th largest eigenvalue of the matrix K n . We give belowthe definition of the effective dimension, which measures the complexity of the underlying RKHSbased on a given data sample. It plays a key role in our regret analysis of KAAR. Definition 1 (Effective dimension)
Let k : X × X → R be a kernel function, D n = { x i } ≤ i ≤ n ∈X n be a sequence of inputs and τ > . The effective dimension associated with the sample D n andthe kernel k and with a scale parameter τ is defined by d neff ( τ ) := Tr (cid:0) ( K n + τ I ) − K n (cid:1) = n (cid:88) j =1 λ j ( K n ) λ j ( K n ) + τ , (2) where I : R n (cid:55)→ R n is the identity matrix. In statistical learning, it has been shown (Zhang (2005), Rudi et al. (2015), and Blanchardand Muecke (2017)) that the effective dimension characterizes the generalization error of kernel-based algorithms. This is a decreasing function of the scale parameter τ and d neff ( τ ) → when τ → ∞ . On the other side, as τ → , it converges to the rank of K n , which can be interpreted asthe "physical" dimension of the points ( k x i ) ≤ i ≤ n . Additional definitions and related notations onkernels (which are used in the proofs) are given in Appendix A. Let β ∈ N ∗ , ≤ p < ∞ and X := [ − , d , where we use standard notation for N ∗ := { , . . . , } .We denote by L p ( X ) the space of equivalence classes of p -integrable functions with respect to theLebesgue measure λ on the Borel σ − algebra B ( X ) and by [ f ] λ the λ − equivalence class to somefunction f : X (cid:55)→ R . We denote by C m ( X ) the space of all m -times differentiable functions f with multidimensional derivative D γ f ( | γ | ≤ m ) that are continuous on X and let C ( X ) be thestandard space of continuous functions equipped with the norm (cid:107) f (cid:107) C ( X ) = max x ∈X | f ( x ) | (wewrite it simply (cid:107) f (cid:107) when no confusion can arise). For the normed space ( G , (cid:107) · (cid:107) ) we use B G ( x, R ) and B G ( x, R ) to denote respectively the open and the closed ball of radius R centered at the point x .We denote | γ | := (cid:80) ni =1 | γ i | for γ ∈ N d ∗ and we write D γ f for the multidimensional weak derivative(see section 5.2.1, page 242 in Evans (1998)) of the function f : X (cid:55)→ R of order γ ∈ N d ∗ . We ecall that the Sobolev space (see chapter Adams and Fournier (2003)) W βp ( X ) is the space of allequivalence classes of functions [ f ] λ ∈ L p ( X ) such that (cid:107) f (cid:107) W βp ( X ) := (cid:40) (cid:0) (cid:80) | γ | ≤ β (cid:107) D γ f (cid:107) pL p ( X ) (cid:1) p if p < ∞ sup | γ | ≤ β (cid:107) D γ f (cid:107) L ∞ ( X ) if p = ∞ is finite. The notion of Sobolev spaces is then extended to the case of any real β > (see Ap-pendix B for the details) by means of the Gagliardo (semi)norms. In the case p = 2 it can be shownto be equivalent to the known approach of the definition of fractional Sobolev spaces via Fouriertransform. Sobolev Reproducing Kernel Hilbert Spaces.
We recall here known results on embedding char-acteristics of fractional Sobolev spaces, which are essential in our analysis. Let s ∈ R + and con-sider the Sobolev space W s ( X ) with X ⊂ R d . It is a separable Hilbert space (see Chapter 7 inSchaback (2007)) with the inner product (cid:104) f, g (cid:105) = (cid:80) (cid:107) γ (cid:107) ≤ s (cid:104) D γ f, D γ g (cid:105) L ( X ) . By Sobolev Embed-ding Theorem (see Theorem 7.34 in Adams and Fournier (2003) for the case s ∈ R + , s > d/ )we have that W s ( X ) (cid:44) → C ( X ) . The latter embedding is to be understood in the sense that thereexists C > , such that each λ − equivalence class has a unique element f ∈ C ( X ) such that (cid:107) f (cid:107) C ( X ) ≤ C (cid:107) f (cid:107) W s ( X ) . We refer to the set of continuous representatives of all equivalence classesin W s ( X ) as to Sobolev RKHS and denote it as W s ( X ) . It can be shown (see paragraph 7.5 andTheorem 7.13 in Schaback (2007) ) that W s ( X ) is indeed a RKHS. Furthermore (see part (c) The-orem 7.34 in Adams and Fournier (2003)) when p ≥ , W sp ( X ) is embedded into the space ofcontinuous functions C ( X ) if s > d/p while if s < d is not (and not embeddable into) a RKHS.Furthermore, (see chapter 7 in Schaback (2007)), Sobolev RKHS W s ( X ) is generated by thetranslation invariant kernel, which is a restriction to X of the kernel k s of W s (cid:0) R d (cid:1) (see also Corol-lary 10.48 on page 170 in Wendlandt (2005)). It is a continuous, bounded and measurable kernel(see general Lemma 4.28 and 4.25 in Steinwart and Cristmann (2008) ) which is defined for all x, x (cid:48) ∈ X by k (cid:0) x, x (cid:48) (cid:1) := 2 − s Γ( s ) (cid:13)(cid:13) x − x (cid:48) (cid:13)(cid:13) s − d K d − s (cid:0)(cid:13)(cid:13) x − x (cid:48) (cid:13)(cid:13) (cid:1) , (3)where K d/ − s ( · ) is a modified Bessel function of the second kind (see Chapter 5.1 in Wendlandt(2005) for more details on Bessel function). Alternatively, the kernel function k ( · ) of SobolevRKHS W s ( X ) can be described by its Fourier transform, which equals F ( k )( ω ) = (1 + (cid:107) ω (cid:107) ) − s .We refer the reader to the Chapters 10-11 in Wendlandt (2005) as well as to Novak et al. (2017) formore details on the kernel functions of Sobolev RKHS. In this work, we analyse the regret achieved by KAAR (Gammerman et al. (2004)), over the(Sobolev) RKHS W s ( X ) . The regret is measured with respect to the benchmark classes of boundedSobolev balls W βp ( X ) which may have different regularity, i.e. we consider the case when β (cid:54) = s .KAAR (see Algorithm 1) was first introduced in the case of adversarial sequential linear re-gression by Vovk (2001) and Azoury and Warmuth (2001); further it was analyzed in Cesa-Bianchiand Lugosi (2006), Rakhlin and Sridharan (2014), Gaillard et al. (2019) and applied to concreteforecasting problems including electricity (Devaine et al., 2013), air quality (Mallet et al., 2009) NLINE NONPARAMETRIC REGRESSION WITH KERNELS
Parameters: d ≥ , s > d/ , and τ > Initialization: define k ( · , · ) as in (3); while t ≥ do observe x t ∈ X ; ˜ y t := ( y , . . . , y t − , (cid:62) ; ˜ k ( x t ) := (cid:0) k ( x , x t ) , . . . , k ( x t − , x t ) , k ( x t , x t ) (cid:1) ; K t := (cid:0) k ( x i , x j ) (cid:1) ≤ i,j ≤ t ;forecast (cid:98) y t := ˜ y (cid:62) t ( K t + τ I t ) − ˜ k ( x t ) ;observe y t ; end Algorithm 1: KAAR (Gammerman et al., 2004) on Sobolev RKHSand exchange rate (Amat et al., 2018) forecasting. It was extended to the case of general reproduc-ing Hilbert spaces in Gammerman et al. (2004), while Jézéquel et al. (2019) provide a variation ofthe algorithm with the same regret and reduced computational complexity. In the case of Sobolevspaces, KAAR (Alg. 1) reads as follows. Let τ > , s > d/ at round t ≥ KAAR predicts (cid:98) y t := (cid:98) f τ,t ( x t ) , where (cid:98) f τ,t := Arg Min f ∈ W s ( X ) (cid:26)(cid:18) t − (cid:88) j =1 (cid:0) y j − f ( x j ) (cid:1) (cid:19) + τ (cid:107) f (cid:107) W s ( X ) + f ( x t ) (cid:27) . (4)The prediction (cid:98) y t = (cid:98) f τ,t ( x t ) can be computed in the closed form by Algorithm 1 in O ( n + n d ) operations (see Section 5.3 for details on the computational complexity). This improves computa-tional complexity over other known nonparametric online regression algorithms, which achieveoptimal regret with respect to Sobolev spaces in dimension d . Remark 2
We remark that the right-hand-side of (4) depends on the input x t , so while (cid:98) f τt ∈ W s ( X ) , the prediction function x t (cid:55)→ (cid:98) f τ,t ( x t ) is a measurable function which in general not neces-sarily belongs to the space W s ( X ) , thus the prediction map not necessarily belongs to the bench-mark class against which the algorithm is competing with. This corresponds to the so-called caseof improper learning (see more details in (Rakhlin et al., 2015; Hazan et al., 2018)). Further-more, a sequential version of kernel ridge regression was considered by Zhdanov and Kalnishkan(2010). It removes the term f ( x t ) in the r.h.s. of (4) and clips the prediction, by forecasting (cid:98) y Mt := (cid:98) y t := min(max( − M, ˜ f τ,t ( x t )) , M ) , where ˜ f t is the solution to the Problem 4 without f ( x t ) term. In the case of nonlinear estimator KAAR, for the clipped version of the KAAR forecaster (cid:98) y Mt ,since for every y t ∈ [ − M, M ] we have (cid:0) y t − (cid:98) y Mt (cid:1) ≤ ( y t − (cid:98) y t ) so the upper bound regret analysisfor KAAR directly applies to its clipped version. We emphasize that throughout the paper β and p refer to the parameters of the benchmarkSobolev space and s > d/ refers to the smoothness parameter of RKHS W s ( X ) used in KAAR.
3. Main results: Upper-bound on the regret of KAAR on the classes of Sobolev balls.
In this section, we present regret upper bounds of KAAR on the reference classes of bounded ballsin W βp ( X ) , β > dp . By Sobolev embedding Theorem (see Adams and Fournier (2003) Theo- em 7.34 or Equation 10 on page 60 in Edmunds and Triebel (1996)), condition β > dp impliesthat every equivalence class in W βp ( X ) has a continuous representative. In our analysis under R n (cid:16) B W βp ( X ) (0 , R ) (cid:17) we always understand regret with respect to the correspondent ball of con-tinuous representatives bounded in the norm of the space W βp ( X ) (Adams and Fournier, 2003). Weconsider the framework of online adversarial regression with the label space Y := [ − M, M ] , targetspace (cid:98) Y ⊂ R , the input space being X = [ − , d and the reference class F := B W βp ( X ) (0 , R ) be-ing an open ball in Sobolev space W βp ( X ) of radius R > with d ∈ N ∗ , β ∈ R + and p ≥ , wherewe use standard notation for N ∗ = { , . . . , } . We remark that the assumption on the input spaceis given for simplicity and can be weakened to any bounded domain in R d with Lipschitz boundary( see chapter 4 in Adams and Fournier (2003) on more details on Lipschitz boundaries). We start by recalling a general upper-bound on the regret of KAAR on the bounded balls of the gen-eral separable RKHS in terms of the effective dimension. It is a direct extension of the upper-boundof KAAR in Vovk (2001); Azoury and Warmuth (2001) from finite dimensional linear regressionto kernel regression and can be retrieved from Theorem 2 in Gammerman et al. (2004) (see alsoProposition 1 and 2 in Jézéquel et al. (2019) for the next statement) for the case of Sobolev RKHS,as the underlying kernel function is continuous. The regret of KAAR on any f ∈ W s ( X ) is upper-bounded as R n ( f ) ≤ τ (cid:107) f (cid:107) W s ( X ) + M (cid:18) (cid:16) nκ τ (cid:17)(cid:19) d neff ( τ ) , (5)where κ > is such that sup x k ( x, x ) ≤ κ and d neff ( τ ) is the effective dimension as given inDefinition 1. The regret bound (5) will be used as a starting point to prove different upper-boundsin the next subsection.Next theorem provides an upper bound on the effective dimension for the Sobolev RKHS W s ( X ) . Theorem 3 (Upper-bound for the effective dimension of Sobolev RKHS)
Let ε ∈ (0 , / , d ≥ , n > and s > d/ . Consider the Sobolev RKHS W s ( X ) with X := [ − , d . For any sequenceof inputs D := { x , . . . , x n } and τ > the effective dimension d eff ( τ ) is upper-bounded as d neff ( τ ) ≤ C (cid:18)(cid:16) nτ (cid:17) d s + ε + 1 (cid:19) , where ε = dε/s , and C is a constant which depends on d, s, R, K, M, X , ε , but is independentof n . Furthermore, if s ∈ N ∗ , then ε = 0 . The proof of this statement, is presented in Appendix C. It is based on some known propertiesof low rank projections in Sobolev spaces which are recalled in Appendix B.
4. Throughout the paper, we refer to constants
C, C , etc which may depend on the properties of the domain X , thefunctional class F or other quantities (such as ε ) but are always independent of n . We refer also to ε, ε as to someinfinitesimal numbers (possibly zeros). Their exact values are omitted and may differ from a statement to another,but we will specify this dependency in case this will be necessary for analysis. NLINE NONPARAMETRIC REGRESSION WITH KERNELS β > d/ ) Notice that when p ≥ and β ≥ d/ we have W βp ( X ) ⊆ W β ( X ) and (by Sobolev embeddingTheorem) W βp ( X ) (cid:55)→ C ( X ) . Since the space of continuous representatives of every equivalenceclass W βp ( X ) is a closed subspace of W s ( X ) , thus it is a RKHS. Using KAAR with s = β andputting the upper bound for the effective dimension of W β ( X ) into the regret upper bound (5) withthe proper choice of the parameter τ := τ n , we obtain the following result. Theorem 4
Let X := [ − , d , β ∈ ( d/ , + ∞ ) , p ≥ , M > , and n > , n ∈ N . Then for anydatasample { x t , y t } nt =1 ∈ ( X × Y ) n , any ε > regret of the KAAR with s = β, τ n := n d β + d , on the benchmark class F := B W βp ( X ) (0 , R ) satisfies the following upper bound R n ( F ) n ≤ Cn − β β + d + ε log( n ) , where constant C depends on d, s, R, K, M, X , and ε , but not on n . Proof of Theorem 4 is given in Appendix E
Remark 5
In the lower-bound section we prove that the upper bound of Theorem 4 matches theminimax optimal for β > d/ on the class of bounded Sobolev balls ( modulo a constant ε in theexponent that can be made arbitrarily small and logarithmic term in the number of observations).This rate was achieved by Rakhlin and Sridharan (2014) by a non-constructive procedure. Anexplicit forecaster has been proposed in Gaillard and Gerchinovitz (2015); it can be calculatedefficiently when p = ∞ and d = 1 and in general has exponential time and storage complexity.We believe that Theorem 4 is the first (essentially) optimal regret upper-bound for the classes ofbounded balls in Sobolev spaces W βp ( X ) with d ≥ , β > d , and p ≥ that is achieved by acomputationally efficient procedure. dp < β ≤ d , p ≥ . In this part we consider KAAR over the benchmark classes of bounded balls W βp ( X ) when dp <β ≤ d , p > and refer to this case as to "hard learning" scenario. When βd ≤ the Sobolev space W βp ( X ) is not (and not embedded into) Sobolev RKHS so using KAAR in this case we need tocontrol the error due to using the element (cid:98) f τt ∈ W s ( X ) when competing against any function from W βp ( X ) . In this case the regret analysis can be decomposed into two parts: approximation of anyfunction f ∈ W βp ( X ) by some element f ε ∈ W s ( X ) and regret of KAAR with respect to boundedballs in W s ( X ) . Intuitively, the smaller the approximation error between f and f ε , the larger thenorm of the approximation function f ε should be, which implies the larger regret upper bound ofKAAR with respect to f ε (see bound (5)). Therefore, in this case one has to control a trade-offbetween the approximation error of f ∈ W p ( X ) by means of some f ε ∈ W s ( X ) and the regretsuffered with respect to f ε . We have the following result. heorem 6 Let X = [ − , d , p > , β ∈ R + , d/p < β ≤ d/ , M > , ε > , n ≥ and { ( x t , y t ) } nt =1 ∈ ( X × [ − M, M ]) n be arbitrary sequence of observations. Then by choosing s = d + ε and τ n = n − d ( − p − ) − β (cid:48) d ( − p − ) where β (cid:48) = β − ε is sufficiently close to β decision rule 1 of KAAR satisfies the following regretupper-bound R n ( F ) ≤ Cn − βd p − dβp − + εθ log( n ) , where F = B W βp ( X ) (0 , R ) , and R > . Constant C depends on d, s, R, β, M, X , and ε , but not on n and constant θ = p ( p − d . The proof of Theorem 6 is given in Appendix E. Theorem and its implications are discussed inSection 5. Here we want just to provide two remarks that help to interpret the result.
Remark 7
In the proof we provide the regret upper-bound for any choice s > d , however the ratefor dp ≤ β ≤ d is minimized by the choice s > d as small as possible. Therefore in this situationwe choose s := d/ ε with an arbitrary small ε > . Furthermore, in the result of Theorem 6the constant C has exponential dependence on the underlying dimension d . To the best of ourknowledge this dependence is unavoidable when using techniques which we use in this work. Remark 8
Notice that in an interesting particular case of Theorem 6 when p = ∞ and β ∈ R + the space W β ∞ ( X ) corresponds to functions with derivatives up to order (cid:98) β (cid:99) bounded in supremumnorm and (cid:98) β (cid:99) -th derivatives are Hölder continuous of order α ∈ (0 , ( see (Adams and Fournier,2003)). Then the regret of Theorem 6 leads to a regret upper bound of order O ( n − βd + ε log n ) . Thisupper bound is optimal on the class W β ∞ ( X ) , up to a negligible factor ε that can be made arbitrarysmall (see Section 4 for a lower bound on the minimax regret).
4. Lower bounds
In this section, we present lower-bounds on the regret of any algorithm with respect to any data-sequence on the bounded closed balls in Sobolev spaces W βp ( X ) with β > d/p , p ≥ . We definethe minimax regret for the problem of online nonparametric regression on the functional class F as ˜ R n ( F ) := inf A sup ( x s ,y s ) s ≤ n ∈ ( X ×Y ) n R n ( F ) , (6)where A = ( A s ) s ≥ is any admissible forecasting rule, i.e. such that at time t ∈ N outputs a predic-tion (cid:98) y t ∈ (cid:98) Y based on past predictions ( (cid:98) y s ) s ≤ t − and data-sample ( { x s , y s } s ≤ t − ∪ x t ) . More for-mally, we assume ( A ) s ≥ is such that for every t ∈ N the map A t : (cid:16) (cid:98) Y t − × ( X × Y ) t − × X (cid:17) (cid:55)→ (cid:98) Y is measurable and call such algorithm admissible. The most important element of this assumptionis that the forecaster cannot use the future outcomes for making decisions at round t . Notice thatin this setting we consider the oblivious adversary meaning that all outputs ( x t , y t ) t ≥ are fixed inadvance. With this notation set we have the following result. NLINE NONPARAMETRIC REGRESSION WITH KERNELS
Theorem 9
Let we have
M > , p ≥ and β > d/p . Consider the problem of online adversarialnonparameteric regression with y t ∈ [ − M, M ] , x t ∈ X = [ − , d over the benchmark class F := B W βp ( X ) (0 , M ) . Then minimax regret 6 is lower-bounded as ˜ R n ( F ) ≥ (cid:40) C n − βd if β ≤ d C n − β β + d if β > d , where C and C are constants which depend on M, X , d, β , and p , but are independent of n . The proof is based on the general minimax lower bounds of Rakhlin and Sridharan (2014) andestimation of fat-shattering dimension of the class B W βp ( X ) (0 , M ) and is given in Appendix F. Statistical i.i.d. regression Adversarial online nonparametric regressionBest known excess risk upper bound Lower bound Best known upper bound for R n ( F ) /n Lower bound β > d n − β β + d n − β β + d n − β β + d n − β β + d dp < β ≤ d n − β β + d n − n − βd n − βd p = ∞ , β ≤ d n − β β + d n − β β + d n − βd n − βd Table 3: Best known regret and excess risk upper and lower bounds on the classes of Sobolev balls.Results achieved by KAAR are highlighted with blue color.
Remark 10
In Table (3) we compare the best known lower and upper regret bounds on the classesof Sobolev balls in the settings of adversarial online regression to the correspondent bounds for theexcess risk in the statistical i.i.d. scenario. Interestingly, on the classes of Sobolev balls in spaces W βp ( X ) , β ≥ d and Hölder balls W β ∞ ( X ) rates for the (normalized) regret and for the excess riskare optimal and archived by the regularized empirical risk minimization procedure (for example byregularized least squares estimators in the statistical learning scenario, see Fischer and Steinwart(2017) and KAAR in adversarial regression as shown in this work).
5. Discussion
In this part we compare regret rates for KAAR with the existing algorithms in the (adversarial)online nonparametric regression in the terms of regret bounds and computational complexity. Fur-thermore we compare regret analysis to the excess risk bounds for the known algorithmic schemesin the statistical least-squares regression scenario. We point out on interesting consequences for thegap in the rate which arises due to adversarial data.
To unify settings we always consider the normalized regret of class F , R n ( F ) n . In the statisticalsetting we assume a sample Z n = ( z i = ( x i , y i )) ni =1 to be generated independently from the distri-bution ν x,y of a pair of random variables X, Y over a probability space ( X × Y , B ( X × Y ) , P ) andlet f Z n : X (cid:55)→ R is some (data-dependent) estimator produced by a measurable learning method L n X × Y . Let ν xy := P ◦ ( X, Y ) be the joint distribution of ( X, Y ) ; denote ν ( y | x ) to be a regularconditional probability distribution of Y conditional to X , and µ to be the X − marginal of ν . Inthe setting of nonparametric regression, for a given class H ⊂ Y X , the goal is to find a function f ∈ H which minimizes the expected squared risk E ( f ) = E ν (cid:104) ( Y − f ( X )) (cid:105) . The performancemeasure of the algorithm which outputs decision rule f Z n in this case is the excess risk which is E ( f Z n ) − inf f ∈F E ( f ) . If F is dense in L ( X , µ ) , it is well-known that the latter is equivalent tominimizing the (cid:107) f ν − f Z n (cid:107) L ( X ,µ ) , where for µ − almost all x , f ν ( x ) is a version of conditional ex-pectation of Y under measure ν ( ·| x ) . Notice that in order to compare the statistical learning settingto the results of our work we do not necessarily assume that f ν ∈ H . Furthermore, since f ν ( · ) isdefined only for µ − almost all x , we denote f ν for both the version of this conditional expecta-tion and the correspondent equivalence class with respect to measure µ . We denote W βp ( X , µ ) theSobolev space on a probability space ( X , B ( X ) , µ ) . To aviod technical difficulties with threateningweak-derivatives w.r.t. to arbitrary Borel measure, we restrict the space of X − marginal proba-bility measures to the subset of all measures which have Radon-Nikodym derivative with respectto Lebesgue measure on X . The latter means that the underlying Sobolev space is equivalent to W βp ( X ) . As before, we consider X = [ − , d , however all the subsequent results in the statisticalregression scenario can be reformulated for any bounded subset of R d with Lipschitz boundary. Ifother is not stated we provide comparison to the excess risk upper bounds in high probability, mean-ing that under (cid:107) f Z n − f ν (cid:107) L ( X ) ≤ ψ ( δ, n ) we understand inequality which holds with probabilityat least − exp( − δ ) for some δ > , and ψ ( · , · ) : [0 , × R + (cid:55)→ R + are some functions. We startwith the easy problem case in which f ν ∈ H and H is Sobolev RKHS. Theorem 1 in Caponettoand E.De.Vito (2006) implies (by taking b = βd and c = 1 therein) that for H = W β ( X ) , β > d , f ν ∈ W β ( X , µ ) and f z ∈ H being a regularized least-squares estimator, we obtain (dropping theconstants) that it holds (cid:107) f z − f ν (cid:107) L ( X,ν ) ≤ Cn − β β + d which is known to be optimal in the settingof nonparametric regression (see Tsybakov (2009) and Györfi (2002) for matching lower bounds) .Under same conditions, optimal excess risk rates on W β ( X , µ ) can be deduced from Corollary 6 inLin and Cevher (2018) using the decision rule based on the spectral kernel algorithms or stochasticgradient descent. Notice that in the latter works no assumption on the probability measure ν isposed (a-part the standard in the setting of statistical learning Bernstein condition for ν ( y | x ) (seeBlanchard and Muecke (2017), Rudi et al. (2015)) and the variance bound for a random variable Y ).It follows that the regret rates of KAAR on classes W β ( X ) match (disregarding the log terms andarbitrary small polynomial factor) the optimal known rates for the excess risk in the i.i.d. scenarioon classes W β ( X , µ ) . A setting when the underlying RKHS is a subspace of reference class of reg-ular functions is studied in several works. In the particular case of H := H γ ( X ) being a GaussianRKHS over X (which is known to be included in space of C ∞ ( X ) functions and is generated by thekernel k γ ( x ) = exp (cid:16) − (cid:107) x (cid:107) γ (cid:17) ) and f ν ( · ) ∈ W β ( X ) ∩ L ∞ ( X ) , β ∈ R + Corollary 2 in Eberts andSteinwart (2011) implies that the (gaussian) kernel ridge regression estimator with the proper choiceof both regularization parameter λ and band-width γ achieves (almost) optimal rates for excess riskof order n − β β + d + ε with β > d , ε > . The same rates hold when β ≤ d under additional condition Y = [ − M, M ] ( which implies ν − a.s. boundedness of f ν that is not ensured unless β > d ). In thecase f ν ∈ W β ∞ ( X ) (i.e. its (cid:98) β (cid:99)− derivative is β − (cid:98) β (cid:99) Hölder continuous) and β ≤ d , excess riskrates in the statistical i.i.d. scenario are optimal (see Chapter 3.2 in Györfi (2002) for a matchinglower bounds and Theorem 14.5 therein). They are better than the normilized regret rates of KAAR NLINE NONPARAMETRIC REGRESSION WITH KERNELS in the setting of adversarial regression (which is of order n − βd ) which are (up to a negligible poly-nomial factor) optimal (see Theorem 9 for a matching lower bound). This uncovers an interestingconsequence, namely that the gap between the (optimal) rates for regret and excess risk on classesof bounded balls in W β ∞ ( X ) is purely due to the adversarial nature of the data.When f ν ∈ W β ( X , µ ) and the algorithm is the kernelized ridge-regression estimator generatedby a kernel of finite smoothness from Corollary 6 in Steinwart et al. (2019) and their discussionafterwards one deduce that excess risk upper bounds of least squares regression estimator in theSobolev RKHS W s ( X ) with s ≥ β > d are optimal. Notice that in the latter case we do notneed to know the smoothness parameter β but only the (possibly crude) upper bound s . Similarly,from Theorem and Example in Pillaud-Vivien et al. (2018) the excess risk rates (in expectation)for the stochastic gradient descent decision rule in Sobolev space W s ( X ) on the class W β ( X , µ ) ,for d < β < s can be deduced. They are optimal under the additional assumption s − β ≥ d .Corollary 4.4 in Lin et al. (2020) implies risk upper bound of order n − ζ (2 ζ + γ ) ∨ where parameter ζ is the power of the so-called source condition (see Engl et al. (2000) for more details on sourcecondition and also see Blanchard et al. (2007) for the statistical perspective) and γ is the decayrate of the effective dimension. In the case of Sobolev RKHS W s ( X ) and ball in W βp ( X ) we have ζ = β s and γ = d s and β ≤ d , s > d ) . If s > d we have the excess risk upper bound oforder n − βs which is worse than n − β β + d . If d < s ≤ d the excess risk upper rate is n − β β + d when s − d < β ≤ d and n − βs when < β < s − d . In the latter case it is better then the lower bound onthe minimax regret but worth then n − β β + d achieved, as stated above, by, for example, regularizedleast squares estimator with Gaussian kernels. In the worse case scenario ( β < d , β + d < s )one also observe the gap between upper rates for the excess risk in the statistical learning scenarioachieved by general spectral regularization methods ( n − β/s ) and the lower bounds for the minimaxregret ( n − βd ) in the online regression setting. A broader analysis of the difference f Z n − f ν inthe norms of the interpolation Hilbert spaces ( which can be represented as a range of the fractionalpower of the kernel integral operator) which range between H and L ( X ) is provided in Fischer andSteinwart (2017) where the regularized kernel least-squares estimator is considered. Corollary 4.1therein and inclusion between Sobolev spaces imply excess risk upper bounds of order n − β β + d + ε for f ν ∈ W βp ( X ) , β > , p ≥ . If βd ∈ ( p , ] and p ≥ then the aforementioned excess riskrates are better then the regret upper bounds obtained by KAAR ( Theorem 6). To the best of ourknowledge, the best known lower bounds in probability on the excess risk on the classes of ballsin the Sobolev spaces are of order n − (see Corollary 4.2 in Fischer and Steinwart (2017) with t = 0 , f ν ∈ W βp ( X ) ⊂ W β ( X ) and notice that f ν is bounded on X by Sobolev embedding andBolzano-Weierstrass theorem). The setting of onlineregression when competing against a benchmark of nonparametric functional classes is definitelynot new. The standard idea is to use an ε -net of the bounded functional space and exploit theexponential weighted average (EWA) forecaster for a finite class of experts which will be the ele-ment of the ε − net (see Chapter 1 in the monograph Cesa-Bianchi and Lugosi (2006) for the finiteEWA and Vovk (2006a) for its application in the case of ). Vovk (2006b) analyzes the regret when ompeting against a general reproducing kernel Hilbert space defined on an arbitrary set X ⊂ R and proves in this case the existence of an algorithm (which is based on the so-called idea of de-fensive forecasting and requires the knowledge of the feature kernel map) with the regret of order O ( √ n ) over unit balls within the underlying reproducing Hilbert space. Vovk (2007) extends theanalysis to the more general framework of Banach spaces, which is described through the decayrate of the so-called modulus of convexity (originally introduced by Clarkson (1936)) and includes,as a particular example, Sobolev spaces with the parameter p of the modulus of convexity beingthe parameter of the p − from the definition of W pβ ( X ) . All these approaches have the disadvan-tage of either having suboptimal regret bounds or having the prohibitive computational complexity.Notice that in the framework of online nonparametric regression, minimax regret analysis in termsof (sequential) entropy growth rates of the underlying functional classes was provided by Rakhlinand Sridharan (2014). In particular, the optimal rates of order n d β + d (up to a logarithmic terms)when the reference class is Sobolev RKHS ( β > d ) and of order n − βd on the classes of Hólderballs (which correspond to classes B W β ∞ ( X ) (0 , R ) ) can be achieved by using the generic forecasterwith Rademacher complexity as a relaxation (for more details see Example 2, Theorems 2,3 andSection 6 in Rakhlin and Sridharan (2014)). Although the relaxation procedure ensures minimaxoptimality, it is not constructive in general. An explicit forecaster, which designs an algorithm basedon a multi-scale exponential weighted average algorithm (called Chaining EWA), has been providedin Gaillard and Gerchinovitz (2015) . The latter achieves an optimal rate when competing againstfunctional classes of uniformly bounded functions which have certain (sharp) growth condition onthe sequential entropy (see Rakhlin and Sridharan (2014)). This condition implies optimal rates,for example on classes where sequential entropy is of order of metric entropy (see F for the defini-tion of the notion of the entropy). Chaining EWA has been shown to be computationally efficienton the class of Hölder balls ( p = ∞ ) with d = 1 . In general, the Chaining EWA forecaster iscomputationally prohibitive (as it has exponential time complexity in the number of rounds). Comparison with Exponential Weighted Average (EWA) forecaster.
The idea of using of theEWA forecaster in the nonparametric setting over bounded benchmark functional class W is to con-sider the ε − net W ε of the smallest cardinality (i.e. the set W ε ⊂ W such that for all f ∈ W thereexists f ∈ W ε such that (cid:107) f − f (cid:107) ∞ ≤ ε ) and to use the (finite) exponentially weighted averageforecaster (see Cesa-Bianchi and Lugosi (2006)) on the set W ε . It was introduced in Vovk (2006a)(see also discussions in Rakhlin and Sridharan (2014) and Gaillard and Gerchinovitz (2015)) andleads to the composed regret upper bound of order nε + log( N ∞ ( ε, F )) , where the last term is themetric entropy of class F on scale ε . It is known (see Edmunds and Triebel (1996)) that for thebenchmark class of Sobolev spaces W βp ( X ) (with p ≥ and β > d/p ) metric entropy is of order ε − dβ . Balancing the terms by a proper choice of ε , it results in an upper bound of order n d/ ( β + d ) (see also Corollary 8 in Vovk (2006b)). As is illustrated in Figure 1 in the ( βd , p − ) plane, regretupper-bounds of KAAR are smaller than that of EWA as soon as βd is large enough. More precisely,EWA outperforms KAAR when βd ∈ [ p , √ p − p ] . The latter is not surprising since KAAR, whichoutputs prediction rules in Sobolev RKHS (i.e. functions of sufficiently high regularity), performsworse on the when competing against functions of small regularity. EWA does not have this draw-back as it acts through the space discretization. In the case p ≥ and βd ≤ p it is generally nottrue that there exists a continuous representative for each equivalence class in W βp ( X ) . In the caseof additional continuity assumption (i.e. considering W βp ( X ) ∩ C ( X ) as a benchmark class instead) NLINE NONPARAMETRIC REGRESSION WITH KERNELS the best (known) upper bound for minimax regret (and thus for regret itself) is of order n − p (seeExample 2 in Rakhlin and Sridharan (2014)). It is achieved by a non-constructive algorithm basedon the notion of relaxation of sequential Rademacher complexity. Notice that EWA can be alsoapplied over classes of bounded balls in W βp ( X ) ∩ C ( X ) , βd ≤ p ; in this case it provides same rate n dβ + d which is in this case worth than n − p . Comparison with defensive forecaster by Vovk (2007).
In Vovk (2007), author describes thealgorithms that are based on the defensive forecasting schemes in general Banach spaces. Thebenchmark classes are irregular but continuous functions, particularly they include Sobolev spaces.By transferring the results given in Equations (6) and (11) in Vovk (2007) to the setting of thiswork, defensive forecaster BBK29 (see pages 19-20 in Vovk (2007)) achieves for a unite ball F = B W βp ( X ) (0 , the following regret bound R n ( F ) ≤ (cid:40) Cn − βd + ε if p = ∞ Cn − p if ≤ p < ∞ and dp ≤ β . Therefore, in the first case, which corresponds to Hölder balls in W β ∞ ( X ) and < β ≤ , werecover the same rate as Theorem 6 but for the range β > . The rate is optimal, as stated inTheorem 9. In the second case ( p ≥ and dp < β < ), the upper-bounds provided by Theorem 6(if β > dp ) or Theorem 4 (if d = 1 and β > / ) are always better then the correspondent boundsof Vovk (2007). Here we consider an optimal computational scheme for KAAR and compare its costs to the knownnonparametric algorithms (in terms of both runtime and storage complexity).Recall that KAAR for any x t ∈ X , ( x s , y s ) s ≤ t − ∈ ( X × Y ) t − computes (cid:98) y t = (cid:98) f τ,t ( x t ) = (cid:10) (cid:98) f τ,t , k x t (cid:11) H k = t (cid:88) s =1 k ( x t , x s ) c s , where c ∈ R t , c = ( K t + τ I ) − ˜ y t , ˜ y (cid:62) t = (cid:0) Y (cid:62) t − , (cid:1) and K t = ( k ( x i , x j )) i,j ≤ t is the kernel matrixat step t . A naive way to compute the value of KAAR at the input x t is by computing the inverseof matrix K t + τ I t . This requires O (cid:0) t (cid:1) iterations in round t and implies O (cid:0) n (cid:1) cumulative timecomplexity over n rounds. The letter can be improved by using the Cholesky decomposition andthe rank-one update of the kernel matrix. Namely, we use the approach as in Algorithm 1 (see Rudiet al. (2015)) for general RKHS . More precisely, at time t we compute the Cholesky decomposition R t − R (cid:62) t − = K t + τ I ; next denote the following quantities b t := ( k ( x t , x ) , . . . , k ( x t , x t − )) α t := K (cid:62) t − b t + τ b t γ t := a (cid:62) t a t + τ k ( x t , x t ) g t := (cid:112) γ t , and u t = ( α t g t , g t ) , v t = ( α t g t , − . Using this we compute an update of R t R t := (cid:18) R t −
00 0 (cid:19) , R t := CHOLUPDATE ( R t , u t ,’+’) , R t := CHOLUPDATE ( R t , v t ,’-’) nd calculate the solution’s coefficients c t = R − t (cid:0) R (cid:62) t (cid:1) − K t ˜ y t . Notice that the procedure CHOLUP ( R, a, ” + ”) returns the upper triangular cholesky factor of R + a (cid:62) a , whereas CHOLUP ( R, a, ” − ”) returns the upper triangule update of R − a (cid:62) a . At round t ( t ≤ n ) it has computational cost of at most O (cid:0) t (cid:1) . Taking into the account that at the end we compute kernel matrix K n = ( k ( x i , x j )) i,j ≤ n for a d − dimensional input x t which adds dn to the total computational complexity we obtainthat total computational costs is of order of O ( n + n d ) operations. The latter complexity canbe further improved when β > d/ (2 √ − (which implies β > d/ ) to O ( n d/β (1 − ( d/ (2 β ))2) ) byusing Nyström projection (Jézéquel et al., 2019) while retaining the optimal regret. In particular,it converges to linear run-time complexity when β → ∞ . Jézéquel et al. (2019) also providesadditional improvements to the complexity if features x t are revealed to the learner beforehand.As was mentioned before, most of existing work in online nonparametric regression on Sobolevspaces ( in particular (Rakhlin and Sridharan, 2014; Vovk, 2006a,b, 2007)) do not provide efficient(i.e., polynomial in time) algorithms. Work by Rakhlin and Sridharan (2014) provides an optimalminimax analysis however they do not develop constructive procedures. More precisely, they re-quire knowledge of the (tight) upper bounds for the so-called relaxations . To obtain the latter ones,in general, one has to compute the offset Rademacher complexity which is numerically infeasible.The approach of using EWA in nonparametric setting (Vovk (2006a)) has non-optimal rates and suf-fers from prohibitive computational complexity, since it needs to update the weights of the expertsin the ε − net. For Sobolev balls its size is of order O (exp( n )) (since the number of experts scales as ( N ( F )) with log N ( F ) being the metric entropy of the class F , which is polynomial in the num-ber of rounds) so that the total time complexity will be O (exp n + nd ) (where nd comes from theaggregation of observations x t ∈ X ⊂ R d over n rounds). The defensive forecasting approaches by(Vovk, 2006b, 2007) require the knowledge of the so-called Banach feature map which is typicallyinaccessible in the computational design of the algorithm.To the best of our knowledge, the only algorithm that addresses the problem of computationalcost in online nonparametric regression is chaining EWA forecaster ( Gaillard and Gerchinovitz(2015)). On class W β ∞ ( X ) with β = r + α , α ∈ (0 , , r ∈ N ∗ Chaining EWA forecaster canbe efficiently implemented through piece-wise polynomial approximation —see Lemma 12 andAppendix C in Gaillard and Gerchinovitz (2015). Its time and storage total complexities are oforder: Storage: O (cid:0) n r +4+ β ( r − β +1 log( n ) (cid:1) , Time: O (cid:0) n ( r +1)(2+ β β +1 ) log( n ) (cid:1) . Notice that storage complexity of KAAR is O (cid:0) n (cid:1) and it is uniformly better for any β = r + α > than of Chaining EWA. Furthermore, its time complexity is better for all β ≥ (and worth for <β < ) than that of the efficient implementation of Chaining EWA. As was mentioned in Gaillardand Gerchinovitz (2015), in most of the cases the direct implementation of Chaining EWA forecasterrequires exp( dpoly ( n )) time (due to the exponentially many updates of the expert’s coefficients). Acknowledgements
Oleksandr Zadorozhnyi would like to acknowledge the full support of theDeutsche Forschungsgemeinschaft (DFG) SFB 1294 and the mobility support due to the UFA-DFHthrough the French-German Doktorandenkolleg CDFA 01-18.The authors acknowledge the Franco-German University (UFA) for its support through the bi-national
Collège Doctoral Franco-Allemand
CDFA 01-18. NLINE NONPARAMETRIC REGRESSION WITH KERNELS
References
H. Adams and J. Fournier.
Sobolev spaces . Academic Press, 2003.C. Amat, T. Michalski, and G. Stoltz. Fundamentals and exchange rate forecastability with simplemachine learning methods.
Journal of International Money and Finance , 88:1–24, 2018.K. Azoury and M. Warmuth. Relative loss bounds for on-line density estimation with the exponen-tial family of distributions.
Machine learning , 43:211–246, 2001.G. Blanchard and N. Muecke. Optimal rates of regularization of statistical inverse learning prob-lems.
Foundations of Computational Mathematics , 18:971–1013, August 2017.G. Blanchard, O. Bousquet, and L. Zwald. Statistical properties of kernel principal componentanalysis.hal hal-00373789.
Machine Learning , 3:259–294, 2007.H. Brezis and P. Mironescu. Gagliardo-nierenberg inequalities and non-inequalities.
Annales del’Institut de Henri Poincare , 1:1355–1376, 2018.A. Caponetto and E.De.Vito. Optimal rates for the regularized least-squares algorithm.
Foundationsof Computational Mathematics , pages 331–368, 2006.N. Cesa-Bianchi. Analysis of two gradient-based algorithms for online regression.
Journal Com-putational System Sci. , pages 392–411, 1999.N. Cesa-Bianchi and G. Lugosi.
Prediction, Learning and Games . Cambridge University Press,2006.P. Cesa-Bianchi, N.and Gaillard, C. Gentile, and S. Gerchinovitz. Algorithmic chaining and the roleof partial feedback in online nonparametric learning. arXiv preprint arXiv:1702.08211 , 2017.J.A. Clarkson. Uniformly convex spaces.
Transactions of the American Mathematical Society , 40:396–414, 1936.M. Devaine, P. Gaillard, Y. Goude, and G. Stoltz. Forecasting electricity consumption by aggre-gating specialized experts - a review of the sequential aggregation of specialized experts, with anapplication to slovakian and french country-wide one-day-ahead (half-)hourly predictions.
Ma-chine Learning , 90(2):231–260, 2013.E. Di Nezza, G. Palatucci, and E. Valdinoci. Hitchhiker’s guide to the fractjional sobolev spaces.
Bulletin des Sciences Mathematique , 136:521–573, 2012.M. Eberts and I. Steinwart. Optimal learning rates for least squares svm using gaussian kernels. In
Advances in Neural Information Processing Systems 24 , pages 1539–1547. Curran Associates,Inc., 2011.D. Edmunds and H. Triebel.
Function Spaces, Entropy Numbers,Differential Operators . CambridgeUniversity Press, 1996.H.W. Engl, M. Hanke, and A. Neubauer.
Regularization of inverse problems . Springer Netherlands,2000. ISBN 978-0-7923-4157-4. .C. Evans. Partial Differential Equations . American Mathematical Society, 1998.S. Fischer and I. Steinwart. Sobolev norm learning rates for regularized least-squares algorithms.
Arxiv , pages 1–26, 2017. URL https://arxiv.org/pdf/1702.07254.pdf .D. Foster. Prediction in the worst case.
Annals of Statistics , 19:1084–1090, 1991.P. Gaillard and S. Gerchinovitz. A chaining algorithm for online nonparametric regression. In
Proceedings of The 28th Conference on Learning Theory , volume 40, pages 764–796, 2015.P. Gaillard, S. Gerchinovitz, M. Huard, and G. Stoltz. Uniform regret bounds over R d for the sequen-tial linear regression problem with the square loss. In Aurélien Garivier and Satyen Kale, editors, Proceedings of the 30th International Conference on Algorithmic Learning Theory , volume 98 of
Proceedings of Machine Learning Research , pages 404–432, Chicago, Illinois, 22–24 Mar 2019.PMLR. URL http://proceedings.mlr.press/v98/gaillard19a.html .A. Gammerman, Y. Kalnishkan, and V. Vovk. On-line prediction with kernels and the complex-ity approximation principle. In
Proceedings of the 20th conference on Uncertainty in artificialintelligence , pages 170–176, 2004.L. Györfi.
A Distribution-Free theory of nonparametric regression . Springer, 2002.E. Hazan, W. Hu, Y. Li, and Z. Li. Online improper learning with an approximation oracle. InS. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,
Advances in Neural Information Processing Systems , volume 31, pages 5652–5660. Curran As-sociates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/ad47a008a2f806aa6eb1b53852cd8b37-Paper.pdf .R. Jézéquel, P. Gaillard, and A. Rudi. Efficient online learning with kernels for adversarial largescale problems. In
Advances in Neural Information Processing Systems , pages 9427–9436, 2019.J. Lin and V. Cevher. Optimal convergence for distributed learning with stochastic gradient methodsand spectral regularization algorithms.
Arxiv , pages 1–53, 2018. URL https://arxiv.org/pdf/1801.07226.pdf .J. Lin, A. Rudi, L. Rosasco, and Cevher V. Optimal rates for spectral algorithms with least-squaresregression over hilbert spaces.
Applied and Computational Harmonic Analysis , pages 868–890,2020. URL .W. Tu Loring.
An Introduction to Manifolds . Springer, 2011.V. Mallet, G. Stoltz, and B. Mauricette. Ozone ensemble forecast with machine learning algorithms.
Journal of Geophysical Research: Atmospheres , 114(D5), 2009.F. Narcowich and J. Ward. Scattered-data interpolation on R n : error estimates for radial basis andband-limited functions. SIAM J. MATH. ANAL , 36:284–300, 2004.F. Narcowich, J. Ward, and H. Wendland. Sobolev bounds on functions with scattered zeros, withapplications to radial basis function surface fitting.
Mathematics of Computation , 74:743–763,2004. NLINE NONPARAMETRIC REGRESSION WITH KERNELS
E. Novak, M. Ulrich, H. Wozniakowski, and S. Zhung. Reproducing kernels of sobolev spaceson R d and applications to embedding constants and tractability. Arxiv , 2017. URL https://arxiv.org/pdf/1709.02568.pdf .N. Pagliana, A. Rudi, E. De Vito, and L. Rosasco. Interpolation and learning with scale-dependentkernels.
Arxiv , 2020. URL https://arxiv.org/pdf/2006.09984.pdf .L. Pillaud-Vivien, A. Rudi, and F. Bach. Statistical optimality of stochastic gradient descent on hardlearning problems through multiple passes. In
NIPS , 2018.A. Rakhlin and K. Sridharan. Online nonparametric regression.
Journal of Machine LearningResearch , pages 1–27, 2014.A. Rakhlin, K. Sridharan, and A.Tewari. Sequential complexities and uniform martingale laws oflarge numbers.
Probability Theory and related random fields , 161:111–153, 2014.A. Rakhlin, K. Sridharan, and A. Tewari. Online learning via sequential complexities.
Journal ofMachine Learning Research , pages 155–186, 2015.A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nyström computational regularization. In
Advances in Neural Information Processing Systems , pages 1657–1665, 2015.J. Schaback.
Kernel-based meshless methods . Lecture notes, 2007.A. Smola and B. Schölkopf.
Learning with Kernels: Support Vector Machines, Regularization,Optimization and Beyond . MIT Press, Cambridge, MA, 2002.E.M. Stein.
Singular integrals and differentiability properties of functions . Princeton UniversityPress, 1970.I. Steinwart and A. Cristmann.
Support Vector Machines . Springer, 2008.I. Steinwart, D. Hush, and C. Scovel. Optimal rates for least-squares regression. In S. Dasguptaand A. Klivans, editors,
Proceedings of the 22nd Annual Conference on Learning Theory , pages79–93, 2019.A. Tsybakov.
Introduction to nonparametric estimation . Springer, 2009.V. Vovk. Competitive online linear regression.
Proceedings of the 1997 conference on advances inneural information processing systems, 10 , pages 364–370, 1998.V. Vovk. Competitive online statistics.
International statistical review , 69:213–248, 2001.V. Vovk. Metric entropy in competitive online prediction.
Arxiv , 2006a.V. Vovk. On-line regression competitive with reproducing kernel hilbert spaces. In
InternationalConference of Theory and Application of Models of Computation , volume 69, pages 452–463,2006b.V. Vovk. Competing with wild prediction rules.
Machine Learning , 69:193–212, 2007.H. Wendlandt.
Scattered Data Approximation . Cambridge University Press, 2005. . Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Com-putation 17(9) , pages 2077–2098, 2005.F. Zhdanov and Y. Kalnishkan. An identity for kernel ridge regression. In
Algorithmic LearningTheory , pages 405–419. Springer, 2010. NLINE NONPARAMETRIC REGRESSION WITH KERNELS
Appendices
Appendix A. Notation on kernels and linear operators over reproducing kernelHilbert spaces
We complete Section 2.1 by providing addition notations on kernels that are used in the proofs. Weconsider kernel methods that choose forecaster (cid:98) f t in a reproducing kernel Hilbert space H k which isassociated with a reproducing kernel k : X × X (cid:55)→ R . The prediction rule f t at round then forecasts (cid:98) f t ( x t ) = (cid:104) f t , k x t (cid:105) H k . We use the following notations, which are common in the setting of kernellearning. Integral and covariance operators
Let ( X , B ( X )) be a measurable space and µ be some measureon a Borel σ − algebra B ( X ) . We define S : H k (cid:55)→ L ( X , µ ) to be the restriction operator of afunction f ∈ H k to its equivalence class in L ( X , µ ) . We drop the dependence of S on the measure µ to simplify the notation. The correspondent adjoint S ∗ : L ( X , µ ) (cid:55)→ H k is then well-definedand has the form S ∗ f = (cid:82) x ∈X f ( x ) k x dµ ( x ) for any f ∈ L ( X , µ ) . We define the (kernel) integraloperator L = SS ∗ : L ( X , µ ) (cid:55)→ L ( X , µ ) such that for any f ∈ L ( X , µ ) we have for µ almostall x ∈ X L ( f )( x ) = (cid:90) z ∈X k ( x, z ) f ( z ) dµ ( z ) . (7)The (kernel) covariance operator T = S ∗ S : H k (cid:55)→ H k is defined as T = (cid:90) x ∈X k x ⊗ k x dµ ( x ) . (8)It is known (see e.g. Theorem 2.2 and Theorem 2.3 in Blanchard et al. (2007)) that the operators T and L are both positive, self-adjoint and trace-class operators. Moreover, they have the samenon-zero spectrum. Evaluation and empirical covariance operators
Analogous to the population case, based on thedata sequence ( x s , y s ) ≤ s ≤ t for each t ∈ { , . . . , n } , we define the evaluation operator S t f : H k (cid:55)→ R t , such that for any j ∈ { , . . . , t } ( S t f ) j = (cid:10) f, k x j (cid:11) = f ( x j ) . Let S ∗ t : R t (cid:55)→ H k be the corresponding adjoint. Then, for any y ∈ R t S ∗ t y = t (cid:88) i =1 y i k x i . Note that the kernel matrix K t := ( k ( x i , x j )) ≤ i,j ≤ t satisfies K t = S t S ∗ t . We also define T t : H k (cid:55)→ H k the empirical covariance operator for t ≥ as T t := S ∗ t S t = t (cid:88) i =1 k x i ⊗ k x i . or any f ∈ H k , T t f = (cid:80) ti =1 k x i (cid:104) k x i , f (cid:105) = (cid:80) ti =1 f ( x i ) k x i . For a given τ > , we definethe regularized covariance operator A t = T t + τ I , where I : H k (cid:55)→ H k is the identity operator.Finally, we call λ j ( A ) the j − th largest eigenvalue of the operator A (i.e. λ ( A ) ≥ λ ( A ) . . . ≥ λ n ( A ) ≥ . . . ). It is worth pointing out that both T t and K t are positive semi-definite for all t ∈{ , , . . . , n } . Since the kernel k ( · , · ) is bounded, T t is a trace class operator. In other words, T t isa compact operator for which a trace may be defined; i.e., in some orthonormal basis ( φ k ) k ∈ N ∗ , thetrace (cid:107) T t (cid:107) := Tr | A | := (cid:80) k (cid:10) ( T ∗ t T t ) / φ k , φ k (cid:11) = (cid:80) k (cid:112) λ k ( T ∗ t T t ) is finite. With a slight abuse ofnotation, we write S n f = ( f ( x i )) ni =1 for any function f : X (cid:55)→ R and datasample { x s } s ≥ . Appendix B. Preliminary results on Sobolev spaces
In this part, we recall known results on Sobolev spaces that will be useful for our analysis. We referthe curious reader to Adams and Fournier (2003) for an extensive survey on Sobolev spaces and toDi Nezza et al. (2012) for the specific case of non-integer exponents.
B.1 Definition and notation
Let
X ⊆ R d , p ∈ [1 , ∞ ) and denote L p ( X ) for the equivalence class of p − integrable functionswith respect to the Lebesque measure λ on X . We recall the definition of Sobolev spaces W rp ( X ) when r ≥ is an integer. Definition of Sobolev spaces with integer r ∈ N ∗ . We recall (see Section 2.2) that the Sobolevspaces W rp ( X ) and W r ∞ ( X ) are the vector spaces of equivalence classes of functions defined as: W rp ( X ) := (cid:26) f : X → R s.t. (cid:107) f (cid:107) W rp ( X ) := (cid:0) (cid:88) | γ | ≤ r (cid:107) D γ f (cid:107) pL p ( X ) (cid:1) p < ∞ (cid:27) , and W r ∞ ( X ) := (cid:26) f : X → R s.t. (cid:107) f (cid:107) W r ∞ ( X ) := sup | γ | ≤ r (cid:107) D γ f (cid:107) L ∞ ( X ) < ∞ (cid:27) . We also define the Sobolev semi-norm | f | W jp ( X ) := (cid:80) γ : | γ | = j (cid:107) D γ f (cid:107) L p ( X ) . Definition of Sobolev spaces with non-integer smoothness exponent β . Let β ∈ R + ; for ourproposes we write β = r + σ with r ∈ N and σ ∈ (0 , . Let u : X (cid:55)→ R be some fixedmeasurable function. We define the map ϕ u : X × X (cid:55)→ R ∪ {∞} such that for ≤ p < ∞ and all ( x, y ) ∈ X × X : ϕ u ( x, y ) = | u ( x ) − u ( y ) |(cid:107) x − y (cid:107) dp + σ , and denote ˜ W σp ( X ) := { u ∈ L p ( X ) : (cid:107) ϕ u (cid:107) L p ( X ×X ) < ∞} . The space ˜ W σp ( X ) equipped with the norm (cid:107) u (cid:107) ˜ W σp ( X ) := (cid:0) (cid:107) u (cid:107) L p ( X ) + (cid:107) ϕ u (cid:107) L p ( X ×X ) (cid:1) p can beshown to be a Banach space. With this notation, Sobolev space W βp ( X ) , β = r + σ can be definedas W βp ( X ) := (cid:8) u ∈ W rp ( X ) : D γ u ∈ ˜ W σp ( X ) for any γ such that | γ | = r (cid:9) . (9) NLINE NONPARAMETRIC REGRESSION WITH KERNELS
Equipped with the norm (cid:107) u (cid:107) W βp ( X ) := (cid:18) (cid:107) u (cid:107) pW rp ( X ) + (cid:88) γ : | γ | = r (cid:107) D γ u (cid:107) p ˜ W σp ( X ) (cid:19) p , (10)it becomes Banach space. In the case β = m ∈ N ∗ , it matches the definition of the Sobolevspace W mp ( X ) (up to a re-scaling of the norm). If m = 0 (i.e. r = σ ∈ [0 , ), we find that W mp ( X ) = L p ( X ) so that the norm in W σp ( X ) is given by (cid:107) u (cid:107) W rp ( X ) = (cid:107) u (cid:107) ˜ W σp ( X ) := (cid:0) (cid:107) u (cid:107) L p ( X ) + (cid:107) ϕ u (cid:107) L p ( X ×X ) (cid:1) p . (11)In accordance with above definition of the class W βp ( X ) , for any β = r + σ , σ ∈ [0 , we set ˜ W σ ∞ ( X ) := { u ∈ L ∞ ( X ) : sup x,y ∈X ,x (cid:54) = y | u ( x ) − u ( y ) |(cid:107) x − y (cid:107) σ ≤ ∞} . (12)Now, for β = r + σ ∈ R the Sobolev space W β ∞ ( X ) can be defined as a functional space W β ∞ ( X ) := (cid:8) u ∈ W m ∞ ( X ) : D γ u ∈ ˜ W σ ∞ ( X ) for any γ such that | γ | = r (cid:9) . (13)equipped with a norm (cid:107) u (cid:107) W β ∞ ( X ) := max {(cid:107) u (cid:107) W r ∞ ( X ) , max γ : | γ | = r (cid:107) D r u (cid:107) ˜ W σ ∞ ( X ) } (14) B.2 Approximation properties of the Sobolev spaces.
We recall that W s ( X ) is Sobolev RKHS, a space of continuous representatives from equivalenceclasses of functions from the Sobolev space W s ( X ) provided s > d . The goal of this section is tocontrol the regret with respect to a ball in an arbitrary Sobolev space W βp ( X ) with p ≥ and β (cid:54) = s .To do so, we need to control the approximation error of f ∈ W βp ( X ) by the elements from somesubset G ⊂ W s ( X ) uniformly over f ∈ W βp ( X ) . This can be achieved by considering the subset ofthe functions with limited bandwidth (see ex. Narcowich et al. (2004)) which is in W s ( X ) for any s > . Namely for σ ∈ R + \ { } we define B σ to be B σ := { f ∈ L (cid:0) R d (cid:1) ∩ C ∞ (cid:0) R d (cid:1) : supp ( F ( f )) ⊂ B (0 , σ ) } , (15)where we denote F ( f ) for the Fourier transform of f and recall that B (0 , σ ) is an open ball in R d with radius σ .Next result is the consequence of the Proposition 3.7 in Narcowich and Ward (2004) (see also theproof of Lemma 3.7 in Narcowich et al. (2004)). To be able to apply the aforementioned Propositionwe need to extend functions f : X (cid:55)→ R , f ∈ W s ( X ) to functions ˜ f : R d (cid:55)→ R such that ˜ f ∈ W s (cid:0) R d (cid:1) . By Stein’s Extension Theorem (see Stein (1970), page. 181) since X is a boundedLipschitz domain there exists a linear operator C : W s ( X ) (cid:55)→ W s (cid:0) R d (cid:1) which is continuous, i.e. (cid:107) C f (cid:107) W s ( R d ) ≤ ˜ C (cid:107) f (cid:107) W s ( X ) . For this operator C , every f ∈ W s ( X ) and g σ ∈ B σ by definitionof the norm in W s ( X ) we have (cid:107) f − g σ (cid:107) W s ( X ) ≤ (cid:107) C f − g σ (cid:107) W s ( R d ) . Applying Lemma 3.7 in arcowich et al. (2004) to C f ∈ W s (cid:0) R d (cid:1) , and using the argument as in the proof of Theorem 3.8in Narcowich et al. (2004) for g σ given by Lemma 3.7 therein, we have (cid:107) f − g σ (cid:107) W r ( R d ) ≤ cσ r − s (cid:107) g σ (cid:107) W s ( R d ) and (cid:107) g σ (cid:107) W s ( X ) ≤ (cid:107) g σ (cid:107) W s ( R d ) ≤ c (cid:107) C f (cid:107) W s ( R d ) ≤ c (cid:107) f (cid:107) W s ( X ) . Thus we obtain the following statement.
Proposition 11
Let s ≥ r ≥ . For every f ∈ W s ( X ) and σ > there exist a function g σ ∈ B σ and constants C and C which are independent of σ such that for every σ > , (cid:107) f − g σ (cid:107) W r ( X ) ≤ C σ r − s (cid:107) f (cid:107) W s ( X ) and (cid:107) g σ (cid:107) W r ( X ) ≤ C σ r − s (cid:107) f (cid:107) W s ( X ) . We now state an upper-bound of (cid:107) f (cid:107) W rp ( X ) when f belongs to the intermediate Sobolev spaces W s p ( X ) and W s p ( X ) for some p , p , s , s . This result is a Gagliardo-Nirenberg type inequalityand follows from the result originally stated in Theorem 1 in Brezis and Mironescu (2018). Proposition 12 (Theorem 1, Brezis and Mironescu (2018))
Let
X ⊆ R d be a Lipschitz boundeddomain. Let ≤ r, s , s < ∞ and ≤ p , p , p ≤ ∞ be real numbers such that there exists θ ∈ (0 , with r = θs + (1 − θ ) s and p = θp + 1 − θp . Let A := (cid:8) ( s , s , p , p ) s.t. s ∈ N ∗ , p = 1 , s − s ≤ − p (cid:9) . If ( s , s , p , p ) / ∈ A ,then there exists a constant C > which depends on s , s , p , p , θ and X such that (cid:107) f (cid:107) W rp ( X ) ≤ C (cid:107) f (cid:107) θW s p ( X ) (cid:107) f (cid:107) − θW s p ( X ) , for all f ∈ W s p ( X ) ∩ W s p ( X ) . In the next corollary we state two particular cases of Proposition 12 that will prove useful.
Corollary 13
For the domain X = [ − , d and any ε > , all p ≥ and β > d/p there exists aconstant C > depending on p , d , ε and β such that (cid:107) g (cid:107) W dp + εp ( X ) ≤ C (cid:107) g (cid:107) dβp + εβ W βp ( X ) (cid:107) g (cid:107) − dβp − εβ L p ( X ) , (16) for all function g ∈ W βp ( X ) . Furthermore, for all β > , p ≥ and ε > , there exists a constant C > depending on β , p , d , and ε such that (cid:107) g (cid:107) W d ε ( X ) ≤ C (cid:107) g (cid:107) d +2 εβp W βp/ ( X ) (cid:107) g (cid:107) − d +2 εβp L ( X ) , (17) for any function g ∈ W β ( X ) . NLINE NONPARAMETRIC REGRESSION WITH KERNELS
Proof
First, notice that X = [ − , d is a Lipschitz bounded domain. The first inequality is ob-tained by choosing p = p = p ≥ , r = d/p + ε , s = β , and s = 0 in Proposition 12; checkingthat ( s , s , p , p ) / ∈ A ; and noting that for any β > we have W β ( X ) ∩ L ( X ) = W β ( X ) . Thesecond inequality stems from the choice p = p = p = 2 (note that this is for the p in the Propo-sition which is different from the p in the inequality), s = 0 , s = βp and noting the inclusion W β ( X ) ⊂ W βp ( X ) ⊆ W βp/ ( X ) = L ( X ) ∩ W βp/ ( X ) which holds true since p ≥ . B.3 Results from interpolation theory on Sobolev spaces
To provide sharp upper bound on the effective dimension (Proposition 3), we also need the followinggeneral interpolation result on Sobolev spaces (stated in Theorem 3.8 in Narcowich et al. (2004)).Recall (see Wendlandt (2005), p.172 ) that the fill distance of a set of points
Z ⊂ X is defined as h Z, X := sup x ∈X inf z ∈ Z (cid:107) x − z (cid:107) . Proposition 14 (Theorem 3.8 in Narcowich et al. (2004))
Suppose
Φ : R d → R to be a positivedefinite function such that its Fourier transform F (Φ) satisfies c (cid:0) (cid:107) ω (cid:107) (cid:1) − q ≤ F (Φ)( ω ) ≤ c (cid:0) (cid:107) ω (cid:107) (cid:1) − q (18) where q ≥ s ≥ r ≥ and c , c are some constants. Assume that X ⊂ R d is bounded domain, hasLipschitz boundary and satisfies the interior cone condition (see Chapter 4 in Adams and Fournier(2003)) with parameters ( ϕ, R ) . Let k = (cid:98) q (cid:99) and Z ⊂ X be such that its mesh norm h := h Z , X satisfies h Z , X ≤ k − Q ( ϕ ) R , where Q ( ϕ ) := sin( ϕ ) sin( θ )8(1 + sin( θ ))(1 + sin( ϕ )) (19) and θ = 2 arcsin (cid:0) sin( ϕ ) / (4(1 + sin ϕ )) (cid:1) . If f ∈ W s ( X ) then there exists a function v ∈ span { Φ( · − x j ) , x j ∈ Z} such that for every real ≤ r ≤ s (cid:107) f − v (cid:107) W r ( X ) ≤ Ch s − r Z , X (cid:107) f (cid:107) W s ( X ) , (20) where C is some constant independent of h Z, X and f . Let us now instantiate the above Proposition to the specific cases we are interested in by choos-ing X , Φ , Z , and r . Let T ∈ N be fixed; set X := [ − , d , Φ being the feature map of SobolevRKHS W s (cid:0) R d (cid:1) . In this case (see 3.1 in Narcowich et al. (2004)) Φ satisfies decay rate 18 with q = s . Choose Z to be the set of points of size T such that h Z , X (cid:46) T − d . To control when thencondition 19 is fulfilled, we firstly notice that X is star-shaped (see Definition 11.25 in Wendlandt(2005), also Proposition 2.1 of Narcowich et al. (2004) ); it includes (cid:96) ball centered at origin withradius r = 1 and can be included in the (cid:96) ball centered at of radius √ d . Thus, by Proposition 2.1in Narcowich et al. (2004) we obtain that X satisfies interior cone condition with the radius R = 1 and angle ϕ = 2 arcsin √ d . A straightforward calculation shows that in this case Q ( ϕ ) = Q ( u ( ϕ )) = u (cid:18) −
88 + u √ − u (cid:19) = (cid:16) u (cid:17) √ − u u √ − u , here u := sin ϕ ϕ = √ d − d + √ d − . Notice that in this case we have that √ d ≤ u ≤ √ d . We caneasily check this by simple inequalities: u = √ d − d + √ d − ≥ d − d ≥ √ d , and from the other side u ≤ d − √ d − √ d − ≤ √ d . From these conditions we deduce Q ( u ) ≥ d . Since h X , Z = sup x ∈X inf z ∈Z (cid:107) x − z (cid:107) (cid:46) T − d so to satisfy condition (19) we need to have T ≥ (cid:16) k Q ( u ) (cid:17) d where we take k = (cid:98) s (cid:99) and R = 1 .Notice that the choice T ≥ (cid:0) s d (cid:1) d ensures the last condition, therefore in order to satisfycondition (19) the size T of the grid Z should be of order (cid:0) s d (cid:1) d .Recall (see Wendlandt (2005)) that the kernel k ( · ) of the Sobolev space W s (cid:0) R d (cid:1) can be repre-sented by means of Bessel functions of second kind as: k ( x , x ) = 2 − s Γ( s ) (cid:107) x − x (cid:107) s − d K d − s ( (cid:107) x − x (cid:107) ) (21)Notice that by Corollary 10.13 in Wendlandt (2005) the norm (cid:107)·(cid:107) W s ( R d ) is equivalent to (cid:107)·(cid:107) W s ( R d ) .By Theorem 7.13 in Schaback (2007) (see also Corollary 10.48 on p. 170 in Wendlandt (2005) )a restriction of RKHS W s (cid:0) R d (cid:1) to the domain X := [ − , d is itself a RKHS W s ( X ) such thatit is continuously embedded into W s ( X ) and its kernel k is a restriction of kernel k to the space X . Thus, we can always consider W s ( X ) as a RKHS with reproducing kernel k ( · ) obtained bythe restriction of the kernel k ( · ) given by (21) to the domain X . Notice that it can be written as k ( x , x ) = Φ ( x − x ) and since Φ( · ) satisfies Assumption 18 so also Φ ( · ) .Then, applying Proposition 14 twice, with r = 0 and r = s and the above choices of X , Φ and Z entails the following corollary. Corollary 15
Let X := [ − , d , s > d/ and Z ⊂ X T be a set of points such that fill dis-tance h Z , X (cid:46) T − d , T ≥ T , T = (cid:0) s d (cid:1) d . Then, for any f ∈ W s ( X ) , there exists (cid:98) f ∈ span { k ( x, · ) , x ∈ Z } , such that (cid:13)(cid:13) f − (cid:98) f (cid:13)(cid:13) L ( X ) ≤ C T − sd (cid:107) f (cid:107) W s ( X ) , (cid:13)(cid:13) f − (cid:98) f (cid:13)(cid:13) W s ( X ) ≤ C (cid:107) f (cid:107) W s ( X ) , and (cid:98) f ( x ) = f ( x ) for any x ∈ Z , where the constants C and C depend on d and s but areindependent of the set Z and function f . The latter proposition together with Gagliardo-Nierenberg inequality yield the following ap-proximation result of functions f ∈ W s ( X ) by low ranked projections P f . Lemma 16 (Projection approximation)
Let X := [ − , d , s > d/ , T > T , T is given asin Lemma 15 and Z ⊂ X T be a set of points T points { x , . . . , x T } such that the fill distance NLINE NONPARAMETRIC REGRESSION WITH KERNELS h Z , X (cid:46) T − d and P Z : W s ( X ) → W s ( X ) be the orthogonal projection on span { k x : x ∈ Z} .Then, for any f ∈ W s ( X ) and for any ε > (cid:107) f − P Z f (cid:107) L ∞ ( X ) = sup x ∈X | f ( x ) − ( P Z f )( x ) | ≤ CT − s − εd + (cid:107) f (cid:107) W s ( X ) , (22) where C is a constant independent of f and T . Furthermore, if s ∈ N ∗ then Equation (22) holdswith ε = 0 . Proof
Let f ∈ W s ( X ) and ε > . The first inequality follows from inclusion f − P Z f ∈ W s ( X ) ⊂ C ( X ) when s > d . Define (cid:98) f Z := Arg Min g ∈ span { k x ,x ∈Z} (cid:107) f − g (cid:107) W s ( X ) . (23)Since W s ( X ) is a Hilbert space, (cid:98) f Z = P Z f ∈ W s ( X ) . Furthermore, through reproducing propertyin RKHS W s ( X ) and from the definition of an orthogonal projector we have for any x ∈ Z that P Z f ( x ) = (cid:104) P Z f, k x (cid:105) = (cid:104) f, P Z k x (cid:105) = (cid:104) f, k x (cid:105) = f ( x ) . By using Sobolev embedding Theorembetween the spaces W d/ ε ( X ) and L ∞ ( X ) (Equation (9) on page 60 in Edmunds and Triebel(1996) applied with s = d/ ε , s = 0 , n = d , p = 2 , and p = ∞ ) and by using Gagliardo-Nierenberg Inequality (16) we get (cid:107) f − P Z f (cid:107) L ∞ ( X ) ≤ C (cid:107) f − P Z f (cid:107) W d ε ( X ) ≤ C (cid:107) f − P Z f (cid:107) d s + εs W s ( X ) (cid:107) f − P Z f (cid:107) − d s − εs L ( X ) ← from Ineq. (16) ≤ C (cid:0) (cid:107) f − P Z f (cid:107) W s ( X ) (cid:1) d/ εs T − sd + + εd (cid:107) f (cid:107) − d s − εs W s ( X ) ← from Cor. (15) ≤ C T − sd + + εd (cid:107) f (cid:107) W s ( X ) , ← from Cor. (15)where the constants C , C , C , and C are independent of f and T . Finally in the specific case s ∈ N we apply directly Corollary 11.33 from Wendlandt (2005) with m = 0 , τ = s , q = ∞ to f − P Z f and obtain directly bound (22) with ε = 0 . Appendix C. Proof of Theorem. 3. Effective dimension upper-bound for the SobolevRKHS
Notice that the effective dimension can be rewritten as: d neff ( τ ) = Tr ( T n + τ I ) − T n , where T n - (empirical )covariance operator. We provide below some auxiliary results that controlthe tail of the trace of the kernel integral operator. These results are provided in Lemmata 2,3 byPagliana et al. (2020) and are just formulated here for completeness of the narrative. Lemma 17
Let H k be some RKHS over domain X ⊆ R d with continuous reproducing kernel k : X × X → R . Let A : H k → H k be a bounded linear operator and A ∗ be its adjoint. Then sup x ∈X (cid:107) Ak x (cid:107) H k ≤ sup (cid:107) f (cid:107) H k ≤ (cid:107) A ∗ f (cid:107) L ∞ ( X ) . emma 18 Let H k be some RKHS over domain X ⊆ R d with reproducing kernel k : X × X → R and µ be any σ − finite measure on X . Let (cid:96) ∈ N + and P : H k (cid:55)→ H k be a projection operator withrank less than or equal to (cid:96) ∈ N + . Then (cid:88) t>(cid:96) λ t ( L ) ≤ (cid:90) X (cid:107) ( I − P ) k x (cid:107) H k dµ ( x ) ≤ sup x ∈X (cid:107) ( I − P ) k x (cid:107) H k , where L : L ( X , µ ) (cid:55)→ L ( X , µ ) is the kernel integral operator as defined in Equation (7) and λ t ( L ) are its t -th eigenvalues. We are now ready to prove our upper-bound of the effective dimension. Notice that it can bealso recovered from a more general result of Lemma 4 in Pagliana et al. (2020) when taking scale γ = n s − d therein. We provide the proof for completeness. Proof of Theorem 3.
Let s > d/ , t ≥ T . By Lemma 16 for the orthogonal projector P on the setof t points Z = { x , . . . , x t } ∈ X t such that fill distance h Z , X (cid:46) t − d for any ε (cid:48) ∈ R + holds sup (cid:107) f (cid:107) Ws ( X ) ≤ (cid:107) f − P f (cid:107) L ∞ ( X ) ≤ Ct − s (cid:48) d + , where s (cid:48) = s − ε (cid:48) and C is a constant that depends on X , d, s, ε , but not on t . Applying Lemma 17with A = I − P we obtain: sup x ∈X (cid:107) ( I − P ) k x (cid:107) H k ≤ Ct − s (cid:48) d + . Let { x i } ni ≥ be the sequence of inputs in X . Then with the choice µ := (1 /n ) (cid:80) ni =1 δ x i , the kernelintegral operator L equals K n /n ; combining Lemma 18 with the last inequality yields (cid:88) (cid:96)>t λ t (cid:0) K n /n (cid:1) ≤ n n (cid:88) i =1 (cid:13)(cid:13) ( I − P ) k x i (cid:13)(cid:13) H k ≤ sup x ∈X (cid:13)(cid:13) ( I − P ) k x (cid:13)(cid:13) H k ≤ Ct − s (cid:48) d +1 . (24)From the definition of the effective dimension (see Def. 1), we can upper-bound d neff ( τ ) := n (cid:88) j =1 λ j ( K n ) λ j ( K n ) + τ ≤ t (cid:88) j =1 λ j ( K n ) λ j ( K n ) + τ + τ − (cid:88) j ≥ t λ j ( K n ) , (25)where we used that since since K n is positive semidefinite, λ j ( K n ) ≥ for all j ≥ . Furthermore, λ j ( K n ) / ( λ j ( K n ) + τ ) ≤ for all j ≥ , which implies t (cid:88) j =1 λ j ( K n ) λ j ( K n ) + τ ≤ t . By homogeneity of the eigenvalues we have λ j ( K n ) = nλ j ( K n /n ) , and therefore τ − (cid:88) j ≥ t λ j ( K n ) = nτ − (cid:88) j ≥ t λ j ( K n /n ) . NLINE NONPARAMETRIC REGRESSION WITH KERNELS
Combining the last two inequalities with Inequalities (24) and (25), we upper-bound the effectivedimension as d neff ( τ ) ≤ t + Cnτ − t − s (cid:48) /d +1 . Choosing t to balance the terms in the above equation, i.e. t = n d s (cid:48) τ − d s (cid:48) , we get d neff ( τ ) ≤ C (cid:16) nτ (cid:17) d s (cid:48) = C (cid:16) nτ (cid:17) d ( s − ε (cid:48) ) . Then assuming ε (cid:48) < s/ , and using / (1 − x ) ≤ x for ≤ x ≤ / , we have d neff ( τ ) ≤ C (cid:16) nτ (cid:17) d s − ε (cid:48) s ≤ C (cid:16) nτ (cid:17) d s (1+ ε (cid:48) s ) = C (cid:16) nτ (cid:17) d s + ds ε (cid:48) ≤ C (cid:16) nτ (cid:17) d s + ε (cid:48) s . For any ε ∈ (0 , , the choice ε (cid:48) = εs/ concludes the proof in the case s ∈ R .Finally, to satisfy condition t ≥ T it is sufficient to have n, τ such that nτ ≥ CT s (cid:48) d . The lattercan be alleviated by additional additive constant in the final bound. The result for s ∈ R + follows.Lastly, the result implies also the particular case with s ∈ N by taking ε = 0 . Appendix D. Proof of Theorem 4
Proof
Recall that KAAR, when competing against some function f in an arbitrary RKHS H k with a bounded reproducing kernel, attains the general regret upper bound as given in Equation (5).Plugging in the bound on the effective dimension of Theorem. 3 with H k = W s ( X ) into the regretupper bound (5) gives R n ( F ) ≤ τ (cid:107) f (cid:107) H k + M C (cid:0) (cid:0) nκ τ (cid:1)(cid:1)(cid:0) (cid:101) C (cid:0) nτ − (cid:1) d s + ε + 1 (cid:1) , (26)for any ε > . Balancing the first and second terms in order to minimize the right hand size (bychoosing an appropriate value of τ ), i.e. by setting τ := n d s + d , it yields R n ( F ) ≤ Cn d s + d + ε log( n ) , where a constant C depends only on d, s, R, M, X and does not depend on n . Appendix E. Proof of Theorem 6
We start by introducing a general lemma for the regret of KAAR when competing against continousfunction and then proceed with the proof of the main theorem.
Lemma 19
Let f ∈ C ( X ) and g ∈ H k . Assume that ( x i ) ni =1 ∈ X n and y i ∈ [ − M, M ] , for some M > . Then the regret of algorithm (4) when competing against function f is bounded by R n ( f ) ≤ τ (cid:107) g (cid:107) H k + M (cid:32) (cid:32) n (cid:107) k (cid:107) ∞ τ (cid:33)(cid:33) d neff ( τ ) + 2 n (cid:107) f − g (cid:107) L ∞ ( X ) (cid:0) M + (cid:107) g (cid:107) L ∞ ( X ) (cid:1) roof Let ε ∈ (0 , and let g ∈ H k be some function which is to be chosen later. Denote by v the vector v = ( f ( x ) , . . . , f ( x n )) ∈ R n and w = S n g = ( g ( x ) , . . . , g ( x n )) ∈ R n . We candecompose the regret in the following way: R n ( f ) = (cid:13)(cid:13) Y n − (cid:98) Y n (cid:13)(cid:13) − (cid:107) Y n − v (cid:107) = (cid:13)(cid:13) Y n − (cid:98) Y n (cid:13)(cid:13) − (cid:107) Y n − w (cid:107) − (cid:107) v − w (cid:107) + 2 (cid:104) Y n − w, v − w (cid:105)≤ (cid:13)(cid:13) Y n − (cid:98) Y n (cid:13)(cid:13) − (cid:107) Y n − w (cid:107) + 2 (cid:104) Y n − w, v − w (cid:105)≤ R n ( g ) + 2 (cid:104) Y n − w, v − w (cid:105) . (27)Applying the regret upper bound (5) to the element g we get: R n ( g ) ≤ τ (cid:107) g (cid:107) H k + M (cid:32) (cid:32) n (cid:107) k (cid:107) ∞ τ (cid:33)(cid:33) d neff ( τ ) , where we recall that d neff ( τ ) is the effective dimension of the RKHS H k with respect to the sample D ⊂ X n . For the second term on the right hand side in inequality (27) we have: (cid:104) Y n − w, v − w (cid:105) ≤ n (cid:88) t =1 | ( y t − g ( x t ))( f ( x t ) − g ( x t )) |≤ n (cid:88) t =1 ( | y t | + | g ( x t ) | ) | f ( x t ) − g ( x t ) | ← by triangle inequality ≤ n (cid:107) f − g (cid:107) L ∞ ( X ) (cid:0) M + (cid:107) g (cid:107) L ∞ ( X ) (cid:1) ← by continuity (28)Putting together the aforementioned bounds we obtain our final result. Proof of Theorem 6.
Let σ > be some fixed bandwidth. By Proposition 11 for any function f ∈ W βp ( X ) ⊂ W β ( X ) , p ≥ and σ > there exists f σ ∈ B σ such that for ≤ r ≤ β we have: (cid:107) f − f σ (cid:107) L ( X ) ≤ C σ − β (cid:107) f (cid:107) W β ( X ) , (cid:107) f σ (cid:107) W r ( X ) ≤ C σ ( r − β ) (cid:107) f (cid:107) W β ( X ) . (29)Since f ∈ W βp ( X ) and p ≥ so the inclusion implies that we have (cid:107) f (cid:107) W β ( X ) ≤ C (cid:107) f (cid:107) W βp ( X ) with some constant C . Let ε > be any positive number. Applying Sobolev embedding Theorem(see Equation (9) on page 60 in Edmunds and Triebel (1996) with s = d/ ε , s = 0 , n = d , p = 2 , and p = ∞ ), Proposition 12 for a function f − f σ ∈ W βp ( X ) and the fact that for p ≥ NLINE NONPARAMETRIC REGRESSION WITH KERNELS W βp ( X ) ⊂ W βp/ ( X ) , W βp ( X ) ⊂ W β ( X ) we get (cid:107) f − f σ (cid:107) L ∞ ( X ) ≤ C (cid:107) f − f σ (cid:107) W d ε ( X ) ← by Sobolev embedding Theorem ≤ C (cid:107) f − f σ (cid:107) d +2 ε βp W βp/ ( X ) (cid:107) f − f σ (cid:107) − d +2 ε βp L ( X ) ← by Inequality (17) ≤ C (cid:107) f − f σ (cid:107) d +2 ε βp W βp/ ( X ) (cid:16) σ − β (cid:107) f (cid:107) W β ( X ) (cid:17) − d +2 ε βp ← by Proposition 11 ≤ C (cid:107) f − f σ (cid:107) d +2 ε βp W βp ( X ) σ − β + dp + ε p (cid:107) f (cid:107) − d +2 ε βp W βp ( X ) , ← by inclusion (30)with a constant C which does not depend on f, f σ or σ . Since f σ satisfies (29) we obtain for any r ∈ R + , r ≥ β : (cid:107) f σ (cid:107) W r ( X ) ≤ ˜ C σ ( r − β ) (cid:107) f (cid:107) W β ( X ) ≤ ˜ C σ ( r − β ) (cid:107) f (cid:107) W βp ( X ) , (31)where we obtain the second inequality by inclusion of the Sobolev spaces ( W βp ( X ) ⊂ W β ( X ) ) andthe constant ˜ C depends only on X , d, β but not σ . Notice that by the triangle inequality and (31)with r = β we have: (cid:107) f − f σ (cid:107) W βp ( X ) ≤ (cid:107) f (cid:107) W βp ( X ) + (cid:107) f σ (cid:107) W βp ( X ) ≤ (cid:16) C p (cid:17) (cid:107) f (cid:107) W βp ( X ) . (32)Thus, plugging (32) in the Equation (30) we deduce: (cid:107) f − f σ (cid:107) L ∞ ( X ) ≤ C σ − β + dp + ε (cid:107) f (cid:107) W βp ( X ) . (33)Note also that by using (31) with r = s ≥ β we have: (cid:107) f σ (cid:107) W s ( X ) ≤ C σ ( s − β ) (cid:107) f (cid:107) W β ( X ) ≤ C σ ( s − β ) (cid:107) f (cid:107) W βp ( X ) , (34)where the last inequality holds since W βp ( X ) ⊂ W β ( X ) . Notice that f σ as in Proposition (11)is of limited bandwidth and is continuous on X , therefore (cid:107) f σ (cid:107) L ∞ ( X ) = (cid:107) f σ (cid:107) C ( X ) . Now since f ∈ W βp ( X ) and β > dp , so by Sobolev Embedding Theorem f ∈ C ( X ) ; for the f σ chosen as inProposition (11) we have (cid:107) f σ (cid:107) L ∞ ( X ) ≤ (cid:107) f σ (cid:107) W βp ( X ) ≤ ˜ C /p (cid:107) f (cid:107) W βp ( X ) , where the last step is true due to (31).Thus from Lemma 19 with g = f σ ∈ C ( X ) , H k = W s ( X ) , we have for the regret of any f ∈ W βp ( X ) it holds that: R n ( f ) ≤ τ (cid:107) f σ (cid:107) W s ( X ) + M log (cid:32) e + en (cid:107) k (cid:107) ∞ τ (cid:33) d neff ( τ )+ 2 n (cid:107) f − f σ (cid:107) L ∞ ( X ) (cid:0) M + (cid:107) f σ (cid:107) L ∞ ( X ) (cid:1) . (35) enote ε = σ − , s (cid:48) = s − ε , β (cid:48) = β − ε and plugging (33), (34), and the bound for d neff ( τ ) fromTheorem 3 in (35) while noticing that s (cid:48) − β (cid:48) = s − β we obtain for any f : R n ( f ) ≤ ˜ C τ ε − (cid:16) s (cid:48) − β (cid:48) (cid:17) (cid:107) f (cid:107) W βp ( X ) + ˜ C M (cid:0) (cid:0) n (cid:107) k (cid:107) ∞ τ (cid:1)(cid:1) n d s (cid:48) τ − d s (cid:48) + ˜ C nε β (cid:48) − d/p (cid:107) f (cid:107) W βp (cid:0) X (cid:1)(cid:0) M + (cid:107) f (cid:107) W βp ( X ) (cid:1) where s (cid:48) = s − ε , β (cid:48) = β − ε and ˜ C , ˜ C , ˜ C are constants depend on d, β, s, d , but not n, M, τ, ε, f . By setting ε = n − s (cid:48) s (cid:48) ( β (cid:48) + d − d/p ) − d ( β (cid:48) + d/p ) , τ = nε s (cid:48) − β (cid:48) − d/p = n − s (cid:48) (2 s (cid:48)− β (cid:48)− d/p )2 s (cid:48) ( β (cid:48) + d − d/p ) − d ( β (cid:48) + d/p ) and noticing that with such choice of τ, ε for any f ∈ F we have R n ( f ) ≤ Cτ ε − (cid:0) s (cid:48) − β (cid:48) (cid:1) = nε β (cid:48) − dp we obtain for any f ∈ F := { f ∈ W βp ( X ) : (cid:107) f (cid:107) W βp ( X ) ≤ R } R n ( F ) = sup f ∈F R n ( f ) ≤ Cn − s (cid:48) ( β (cid:48)− d/p )2 s (cid:48) ( β (cid:48) + d − d/p ) − d ( β (cid:48) + d/p ) = Cn − β (cid:48) p − d ( β (cid:48) p + d ) (cid:18) − d s (cid:48) (cid:19) + d ( p − , where C depends on d, β, s, d, R, M, X , but not n . Now to obtain the final claim we choose s = d + ε thus s (cid:48) = d and we have: − β (cid:48) p − d ( β (cid:48) p + d ) (cid:16) − d s (cid:48) (cid:17) + d ( p − = 1 − βd p − dβ p − + ε pd ( p − , fromwhich the final claim follows. Appendix F. Proof of the lower bounds (Theorem 9)
To prove the lower bounds we use the notion of the sequential fat-shattering dimension (see Defini-tion 12 in Rakhlin and Sridharan (2014)). Recall (see Rakhlin et al. (2014)) that a Z -valued tree z of depth n is a complete rooted binary tree with nodes labeled by the elements of the set Z . Morerigorously, z is a set of labeling functions ( z , . . . , z n ) such that z t : {− , } t − (cid:55)→ Z for every t ≤ n . For any ε ∈ {− , } n , we denote { z t ( ε ) := z t ( ε , . . . , ε t − ) } to be the label of the node atthe level t which is obtained by following the path ε . Definition 20 (Fat-shattering dimension, see Definition 7 in Rakhlin et al. (2014) )
Let γ > .An X -valued tree x of depth d is said to be γ -shattered by F = { f : X (cid:55)→ R } if there exists an R − valued tree s of depth d such that ∀ ε ∈ {− , } d , ∃ f ε ∈ F , s.t. ε t ( f ε ( x t ( ε )) − s t ( ε )) ≥ γ , for all t ∈ { , . . . , d } . The tree s is called a witness. The largest d such that there exists a γ -shattered tree x is called the (sequential) fat-shattering dimension of F and is denoted by fat γ ( F ) . If the last inequality becomes equality, we say that the tree x is exactly shattered by the elementsof F or (alternatively) that class F exactly shatters the tree x .We recall also the notion of sequential covering numbers and the sequential entropy of class F . NLINE NONPARAMETRIC REGRESSION WITH KERNELS
Definition 21
A set V of R − valued trees of depth n forms a γ − cover (with respect to the (cid:96) q norm, ≤ q < ∞ ) of a function class F ⊂ R X on a given X − valued tree x of depth d if ∀ f ∈ F , ∀ ε ∈ {± } d , ∃ v ∈ V, s.t. (cid:32) n d (cid:88) t =1 | f ( x t ( ε )) − v t ( ε ) | q (cid:33) /q ≤ γ. In the case q = ∞ , we have that | f ( x t ( ε )) − v t ( ε ) | ≤ γ for all t ∈ { , . . . , d } . The size of thesmallest γ -cover of a tree x is denoted by N q ( γ, F , x ) ; and N q ( γ, F , d ) = sup x N ( γ, F , x ) wherethe last supremum is taken over all trees of depth d . Finally, the sequential entropy of class F is sup x log N q ( γ, F , x ) . To derive the main results of Theorem 9 we use the following consequences of Lemmata 14,15 inSection 5, Rakhlin and Sridharan (2014).
Lemma 22 (Variant of Lemma 14 in Rakhlin and Sridharan (2014))
Let n ∈ N ∗ , Y = [ − M, M ] and F ⊆ (cid:8) f : X → [ − M/ , M/ (cid:9) for some M > . If γ > such that n ≤ fat γ ( F ) then ˜ R n ( F ) ≥ M nγ. Proof
Since, γ > such that n ≤ fat γ ( F ) , by definition of the fat-shattering dimension there existsan X − valued tree x of depth n (and a witness of shattering µ ) which is shattered by the elementsof F . Further proof follows the same lines as in the original argument of Lemma 14 of Rakhlin andSridharan (2014) with the tree x , witness of shattering µ , β := γ and functions (as well as witnessof shattering bounded in [ − M , M ] instead of [ − , ) therein. Lemma 23 (Variant of Lemma 15 by Rakhlin and Sridharan (2014))
Let n ∈ N ∗ , γ > , and F (cid:48) be a class of functions from X to [ − M/ , M/ which exactly γ -shatters some tree x of depth fat γ ( F (cid:48) ) < n . Then the minimax regret with respect to F (cid:48) is lower-bounded as ˜ R n (cid:0) F (cid:48) (cid:1) ≥ M C (cid:16) √ γ (cid:113) n fat γ ( F (cid:48) ) − nγ (cid:17) . (36) Proof
The lemma is proved in the same way as Lemma 15 in Rakhlin and Sridharan (2014), bynoting that since F (cid:48) exactly shatters x , so we can consider F = F (cid:48) in the original proof. Theargument follows then the same lines by noticing that the target functional class is a subset of { f : X (cid:55)→ [ − M , M ] } (instead of { f : X (cid:55)→ [ − , } as in the original argument).To prove the lower bounds we provide a tight control of fat γ ( F ) (in terms of the scale γ , whileconstants may depend on the range Y , d , β ) for F being the bounded ball in Sobolev space W βp ( X ) .We recall the notion of sequential Rademacher complexity (see Rakhlin and Sridharan (2014)): R n ( F ) = sup x E ε (cid:34) n − sup f ∈F n (cid:88) t =1 ε t f ( x t ( ε )) (cid:35) , where E ε [ · ] denotes the expectation under the product measure P = ( δ − + δ ) ⊗ n , the supremumis over all X − valued trees of depth n . Firstly we provide an auxiliary Lemma which provides anupper bound of the Sobolev ball B W βp ( X ) (0 , . emma 24 Let n ∈ N , n ≥ , M > an let F := B W βp ( X ) (0 , M/ , p ≥ . For the fat-shatteringdimension fat γ ( F ) on the scale γ > when β (cid:54) = d it holds fat γ ( F ) ≤ max { ˜ C γ − (cid:16) dβ ∨ (cid:17) , } , where C is some constant which depends on β, d, M but not on γ . In the case βd = 1 / we have fat γ ( F ) ≤ max { ˜ C (cid:18) γ log( γ ) (cid:19) − , } , where C is some constant which depends on β, d, M . but not on γ . Proof of Lemma 24
Following from Definition (20) if x of depth n is γ − shattered by the elementsof F then n ≤ fat γ ( F ) . For an arbitrary functional class F from the definition of the fat-shatteringdimension for any γ > such that fat γ ( F ) > n we have that R n ( F ) ≥ γ (one readily checks thisby considering Rademacher complexity over the set of n shattered points). Therefore R n ( F ) ≥ sup { γ : fat γ ( F ) > n } , which is equivalent to fat γ ( F ) ≤ min { n : R n ( F ) ≤ γ } . By Proposition 1and Definition 3 in Rakhlin et al. (2014) for all c ∈ R we have R n ( c F ) = | c |R n ( F ) . where c F = { cf : f ∈ F } . Taking c = M we have for F (cid:48) = B W β ∞ (cid:0) X (cid:1) (0 , M ) that R n (cid:0) F (cid:48) (cid:1) = M R n (cid:0) F (cid:1) ,where F = B W β ∞ (cid:0) X (cid:1) (0 , . From the definition of (cid:107)·(cid:107) W β ∞ ( X ) it follows that if f ∈ B W β ∞ ( X ) (0 , then max x ∈X | f ( x ) | ≤ . By Theorem 3 in Rakhlin et al. (2014) we have for any functional class F ⊂ [ − , X R n (cid:0) F (cid:1) ≤ sup x inf ρ ∈ (0 , (cid:18) ρ + 12 √ n (cid:90) ρ (cid:112) log N ( δ, F , x ) dδ (cid:19) . (37)It is straightforward to check that for any tree z it holds that N ( γ, F , z ) ≤ N ∞ ( γ, F , z ) . (38)Furthermore, if N ∞ ( F , γ ) is a metric entropy of class F on scale γ > then it is easy to checkthat for any tree z of depth d ≥ and any scale γ > , N ∞ ( γ, F , z ) ≤ N ∞ ( γ, F ) . Indeed, thisfollows trivially by taking for any tree z witness v ( · ) = g ( z ( · )) , where g ( · ) is the element of γ − netsuch that (cid:107) f − g (cid:107) ∞ ≤ γ . Furthermore, for F = B W βp ( X ) (0 , , β > d/p the metric entropy of F on the scale δ is (up to some constant C which does not depend on δ ) upper bounded by δ − dβ . Thelatter bound is a well-known result and it can be deduced from the general result for Besov spacesstated in Theorem 3.5 in Edmunds and Triebel (1996) (see also Equation (38) on page 19 in Vovk(2006b)). Thus, using Equations (37),(38), the fact that metric entropy uniformly bounds sequentialentropy, properties of Rademacher complexity (see Lemma 3 in Rakhlin et al. (2015)) and the upperbound on the metric entropy of the Sobolev ball F = B W β ∞ ( X ) (0 , M/ we get: R n ( F ) = M R n (cid:18) M B W β ∞ (cid:18) , M (cid:19)(cid:19) ≤ M R n (cid:16) B W β ∞ ( X ) (0 , (cid:17) ≤ M ρ ∈ (0 , (cid:18) ρ + 12 C √ n (cid:90) ρ δ − d β dδ (cid:19) ≤ C inf ρ ∈ (0 , (cid:18) ρ + 12 √ n (cid:90) ρ δ − d β dδ (cid:19) , (39) NLINE NONPARAMETRIC REGRESSION WITH KERNELS where we use C = M max { , C } for completeness. Notice that if β > d then integral (cid:82) t − d β dt is finite, thus in this case in (39) we can take ρ = 0 which implies R n ( F ) ≤ C √ n − d β . When β < d then the choice ρ = ρ min = (9 n − ) βd leads to the bound R n ( F ) ≤ C n − βd − βd . Finally,in case when β = d with the choice ρ = √ n , one gets R n ( F ) ≤ C ln( n ) √ n .Thus we obtain R n (cid:0) F (cid:1) ≤ C Kn − ( βd ∧ ) (40)where in Equation (40) K = − ( βd ∧ d β ) if β (cid:54) = d otherwise K = ln( n )2 . If βd (cid:54) = then we have fat γ ( G ) ≤ fat γ ( F ) ≤ min { n : R n ( F ) ≤ γ }≤ min (cid:26) n : 12 C Kn − ( βd ∧ ) ≤ γ (cid:27) ≤ (cid:100) (cid:18) γ C K (cid:19) − (cid:16) dβ ∨ (cid:17) (cid:101)≤ max { C γ − (cid:16) dβ ∨ (cid:17) , } . with C = 2 · (24 C K ) dβ ∨ . In case when βd = we have that by any n ≥ (cid:100) (cid:16) γ/ C log( γ/ C ) (cid:17) − (cid:101) ensures that C ln( n ) √ n ≤ γ from which we deduce fat γ ( G ) ≤ max { C (cid:16) γ log( γ ) (cid:17) − , } .To derive the first statement of Theorem 9 we construct a class G ⊂ B W βp ( X ) (0 , M ) whichsatisfies Lemmata 22,23 and deduce the final bound for the minimax regret ˜ R n (cid:16) B W βp ( X ) (0 , M ) (cid:17) by inclusion argument. Class construction.
We provide a class construction taking inspiration from the nonparametricregression in the statistical learning scenario (see for example Theorem 3.2 in Györfi (2002)). Recallthat X = [ − , d ; for a given n ∈ N denote b := n − d . Consider the following set of half openintervals A = { A (cid:96) = [ − (cid:96)b, − (cid:96) + 1) b ) , ≤ (cid:96) ≤ (cid:98) n /d (cid:99) − } , and let P = A d be its d − th power. Let I := { , . . . , (cid:98) n d (cid:99) − } d , N = | I | = (cid:98) n d (cid:99) d and π : I (cid:55)→ { , . . . , N } be a function which maps an element k ∈ I to its index in the lexicographicorder among the elements in I . Since lexicographic order is a total order, we have that π ( · ) is abijection. For each k ∈ I := { , . . . , (cid:98) n d (cid:99) − } d such that π ( k ) = j we denote B j = (cid:81) di =1 [ − k i b, − k i + 1) b ) . Notice that ∪ Nj =1 B j ⊂ X and for i (cid:54) = j obviously B i ∩ B j = ∅ . Fora cube B t , t ∈ { , . . . , N } we denote a t ∈ R d to be its center. One can show explicitly that a t = (cid:0) b (cid:0) + (cid:0) π − ( t ) (cid:1) (cid:1) − , . . . , b (cid:0) + (cid:0) π − ( t ) (cid:1) d (cid:1) − (cid:1) . Consider the following set of functions: F β,d,n = (cid:26) f : f ( x ) = M n − βd (cid:107) g (cid:107) W β ∞ ( X ) N (cid:88) t =1 c t g n,t ( x ) , c j ∈ {− , } (cid:27) , (41) here g n,t ( x ) = g (cid:0) n d ( x − a t ) (cid:1) , and g such that g ( x ) = (cid:16) − σ ( (cid:107) x (cid:107) − a c − a ) (cid:17) , c = , a = and σ ( t ) = h ( t ) h ( t )+ h (1 − t ) , h ( t ) = e − /t I t> for t ∈ R , x ∈ R d . We need the following Lemma whichshows that the functional class F β,d,n defined by Equation (41) is included in the ball of the space W β ∞ ( X ) . Lemma 25
Let β > , d ≥ , n ∈ N ; consider N := (cid:98) n d (cid:99) d and the class F β,d,n , as defined in (41) . It holds that F β,d,n ⊂ B L ∞ ( X ) (cid:16) , M (cid:17) . Moreover, an even stronger inclusion holds, namely that: F β,d,n ⊂ B W β ∞ ( X ) (cid:16) , M (cid:17) . Proof
Firstly, notice that g ( ) = (cid:0) − σ ( − a c − a ) (cid:1) . Since t := − a c − a < so h ( t ) = 0 andconsequently σ ( t ) = 0 from which we have g (0) = (1 − σ ( t )) = . For a cube B j if x / ∈ B j thenwe have g n,j ( x ) = 0 . Indeed, since x / ∈ B j so for a j center of B j holds max i ≤ d (cid:12)(cid:12)(cid:12) x ( i ) − a ( i ) j (cid:12)(cid:12)(cid:12) ≥ n − d .Therefore, since (cid:13)(cid:13) n d (cid:16) x ( i ) − a ( i ) j (cid:17)(cid:13)(cid:13) ≥ n d max i ≤ d | x − a j | ≥ and because g ( · ) as constructedabove is a mollifier from R d to R with non-zero support on B R d (0 , / (see paragraph 13 in Loring(2011)) we have g n,j ( x ) = g (cid:16) n d ( x − a j ) (cid:17) = 0 .From the definition of the norm in the functional class W β ∞ ( X ) it follows that for any x ∈ X we have | g ( x ) | ≤ (cid:107) g (cid:107) L ∞ ( X ) ≤ (cid:107) g (cid:107) W β ∞ ( X ) . Furthermore, for any element f ∈ F β,d,n , for any x ∈ X \ ∪ Nk =1 B k we have f ( x ) = 0 . If x ∈ ∪ Nk =1 B k then there exists some cube B j with x ∈ B j .Thus we get | f ( x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M n − βd (cid:107) g (cid:107) W β ∞ ( X ) N (cid:88) t =1 c j g n,t ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ M n − βd | g n,j ( x ) |(cid:107) g (cid:107) W β ∞ ( X ) ≤ M n − βd (cid:107) g (cid:107) L ∞ ( X ) (cid:107) g (cid:107) W β ∞ ( X ) ≤ M , so that F β,d,n ⊂ B L ∞ ( X ) (cid:0) , M (cid:1) and the first part of the claim is proved.Let β = m + σ . For every r ≤ m , r ∈ N and x ∈ X we notice that if x ∈ X \ ∪ Nk =1 B k thensince it is a finite linear combination of mollifiers we have D r f ( x ) = 0 . By a chain rule for every NLINE NONPARAMETRIC REGRESSION WITH KERNELS f ∈ F β,d,n , x ∈ X , k ≤ N such that x ∈ B j : sup x ∈X | D r f ( x ) | = sup B j ∈P sup x ∈ B j | D r f ( x ) | = sup B j ∈P sup x ∈ B j M (cid:107) g (cid:107) W β ∞ ( X ) (cid:12)(cid:12)(cid:12) D r n − βd g n,j ( x ) (cid:12)(cid:12)(cid:12) = sup B j ∈P sup x ∈ B j M n − βd (cid:107) g (cid:107) W β ∞ ( X ) (cid:12)(cid:12)(cid:12) D r g (cid:16) n d ( x − a j ) (cid:17)(cid:12)(cid:12)(cid:12) ← definition of g , x := n d ( x − a j )= M (cid:107) g (cid:107) W β ∞ ( X ) n r − βd sup B j ∈P sup x ∈ B j | D r g ( x ) | ← chain rule ≤ M x ∈X | D r g ( x ) |(cid:107) g (cid:107) W β ∞ ( X ) = M (cid:107) D r g (cid:107) L ∞ ( X ) (cid:107) g (cid:107) W β ∞ ( X ) ≤ M ← by the definition of (cid:107)·(cid:107) W β ∞ ( X ) . Consider the D γ f — derivative of order | γ | = m of function f ∈ F β,d,n . For some ≤ j ≤ N we have for any x, z, ∈ B j ( here B j = B j ∪ ∂B j ) that it holds | D γ f ( x ) − D γ f ( z ) |(cid:107) x − z (cid:107) σ = M n − βd (cid:107) g (cid:107) W β ∞ ( X ) | D γ g n,j ( x ) − D γ g n,j ( z ) |(cid:107) x − z (cid:107) σ = M n − βd (cid:107) g (cid:107) W β ∞ ( X ) (cid:12)(cid:12)(cid:12) D γ g (cid:0) n d ( x − a j ) (cid:1) − D γ g (cid:0) n d ( z − a j ) (cid:1)(cid:12)(cid:12)(cid:12) (cid:107) x − z (cid:107) σ = M n − βd (cid:107) g (cid:107) W β ∞ ( X ) (cid:12)(cid:12)(cid:12) D γ g ( x ) ∂ γ ∂x ...∂x d n d ( x − a j ) − D γ g ( z ) ∂ γ ∂z ...∂z d n d ( z − a j ) (cid:12)(cid:12)(cid:12) n − σd (cid:107) x − z (cid:107) σ ← chain rule ≤ M n − βd + md + σd (cid:107) g (cid:107) W β ∞ ( X ) sup x,z ∈X ,x (cid:54) = z | D γ g ( x ) − D γ g ( z ) |(cid:107) x − z (cid:107) σ ← taking sup over the cube X = M (cid:107) g (cid:107) W β ∞ ( X ) sup x,z ∈X ,x (cid:54) = z | D γ g ( x ) − D γ g ( z ) |(cid:107) x − z (cid:107) σ ≤ M ← from definition of (cid:107)·(cid:107) W β ∞ ( X ) Furthermore, if B j , B k ∈ P are two different cubes then for x ∈ B j and z ∈ B k consider elements x ∈ ∂B j , z ∈ ∂B k which lie on the line between x, z . Notice that if B j and B k have common d − hyperplane (i.e. they are the neighbour cells) then x = z . In all cases it follows from the onstruction of f ∈ F β,d,n that D γ f ( x ) = D γ f ( z ) = 0 . Therefore we have | D γ f ( x ) − D γ f ( z ) |(cid:107) x − z (cid:107) σ = M n − βd (cid:107) g (cid:107) W β ∞ ( X ) | D γ g n,j ( x ) − D γ g n,j ( x ) − D γ g n,k ( z ) + D γ g n,k ( z ) |(cid:107) x − z (cid:107) σ ≤ M n − βd (cid:107) g (cid:107) W β ∞ ( X ) | D γ g n,j ( x ) − D γ g n,j ( x ) | + | D γ g n,k ( z ) − D γ g n,k ( z ) |(cid:107) x − z (cid:107) σ ← by triangle ineq. ≤ M n − βd (cid:107) g (cid:107) W β ∞ ( X ) (cid:107) g (cid:107) W β ∞ ( X ) n βd ( (cid:107) x − x (cid:107) σ + (cid:107) z − z (cid:107) σ ) (cid:107) x − z (cid:107) σ ← using the result for x, x ∈ B j ≤ M σ (cid:107) x − x (cid:107) σ + (cid:107) z − z (cid:107) σ (cid:107) x − z (cid:107) σ ← since < σ < ≤ M σ (cid:18) (cid:107) x − x (cid:107) + (cid:107) z − z (cid:107) (cid:19) σ (cid:107) z − x (cid:107) σ ← by Jensen’s inequality ≤ M (cid:107) x − z (cid:107) σ (cid:107) x − z (cid:107) σ = M ← since x , z lie on the line between x, z. If for any pair ( x, z ) ∈ X , x (cid:54) = z one (without losing of generality let it be z ) does not belongto the union of the cubes ∪ B ∈P B then we can substitute this point by the point z , which is theintersection of the segment [ x, z ] and the boundary of the closest cube to the point z . Notice thatin this case D γ f ( z ) = D γ f ( z ) = 0 by construction of f and (cid:107) x − z (cid:107) σ ≥ (cid:107) x − z (cid:107) σ . Applyingaforementioned analysis to a pair ( x, z ) which lies in some (different) cubes B j , B k we get | D γ f ( x ) − D γ f ( z ) |(cid:107) x − z (cid:107) σ ≤ | D γ f ( x ) − D γ f ( z ) |(cid:107) x − z (cid:107) σ ≤ M . Finally case ( x, z ) ∈ X where none of the points belong to the union of the cubes is trivial.Considering these cases together we have sup x,y ∈X ,x (cid:54) = y | D γ f ( x ) − D γ f ( y ) |(cid:107) x − y (cid:107) σ ≤ M for any f ∈F β,d,n . Therefore, F β,d,n ⊂ B W β ∞ ( X ) (cid:0) , M (cid:1) . Proof of Theorem 9 . For n ≥ consider the functional class F β,d,n as given by Equation 41.Consider a X − valued tree x of depth N := (cid:98) n d (cid:99) d constructed as follows: for any ε ∈ {− , } N ,any t ≤ N we set x t ( ε ) = a t , where a t is the center of the correspondent cube. Now, for any ε ∈{− , } n consider f ε ( · ) ∈ F β,d,n where F β,d,n as in (41) and f ε ( x ) = M (cid:107) g (cid:107) Wβ ∞ ( X ) (cid:80) Nj =1 ε j n − βd g n,j ( x ) .Then for the tree x , for every ε ∈ {− , } N , ≤ t ≤ N and a real-valued (witness of shattering) s t ( · ) := 0 we have NLINE NONPARAMETRIC REGRESSION WITH KERNELS ε t ( f ε ( x t ( ε )) − s t ( ε )) = ε t f ε ( x t ( ε )) = ε t f ε ( a t ) = M (cid:107) g (cid:107) W β ∞ ( X ) ε t N (cid:88) j =1 ε j n − βd g n,j ( a t )= M (cid:107) g (cid:107) W β ∞ ( X ) n − βd g n,t ( a t )= M (cid:107) g (cid:107) W β ∞ ( X ) n − βd g (0) = C M,g n − βd , (42)where C M,g := M (cid:107) g (cid:107) Wβ ∞ ( X ) . Thus class F β,d,n with ˜ γ = ˜ γ ( n ) := C M,g n − βd (exactly) shatters thetree x . Notice that N = (cid:98) n d (cid:99) d ≤ d n ; from the other side we have N ≥ (cid:0) n d (cid:1) d ≥ n . Thus, fromthe definition of fat-shattering dimension it follows fat ˜ γ (cid:16) F β,d,n (cid:17) ≥ N ≥ n. (43)All conditions of Lemma 22 are fulfilled for the class F β,d,n ; by Lemma 25 F β,d,n ⊂ B W β ∞ ( X ) (cid:0) , M (cid:1) .Applying Lemma 22 to the class F β,d,n , using Lemma 25 and simple inclusion B W β ∞ ( X ) (cid:0) , M (cid:1) ⊂ B W βp ( X ) (cid:0) , M (cid:1) ⊂ B W βp ( X ) (0 , M ) we obtain for the Sobolev ball F := B W βp ( X ) (0 , M )˜ R n ( F ) ≥ ˜ R n ( F β,d,n ) ≥ M n ˜ γ ≥ · M (cid:107) g (cid:107) W β ∞ ( X ) n − βd , so that the case dp < β ≤ d is proved.To prove the second bound, notice that by Lemma (25) for any n ∈ N ∗ , F β,d,n ⊂ B W β ∞ ( X ) (cid:0) , M (cid:1) which implies that fat γ (cid:16) B W β ∞ (cid:0) , M (cid:1)(cid:17) ≥ fat γ ( F β,d,n ) . In particular,this holds if we choose n := (cid:98) (cid:16) γC M,g (cid:17) − dβ ∨ (cid:99) then n < (cid:16) γC M,g (cid:17) − dβ ∨ which is equivalent to C M,g n − βd ≤ γ . Notice thatif γ < γ then fat γ ( F ) ≥ fat γ ( F ) . Applying the first property to the classes (cid:16) B W β ∞ (cid:0) , M (cid:1)(cid:17) and F β,d,n on the scale γ and the second property for the class F β,d,n o on the scales γ and C M,g n − βd we consequently get: fat γ (cid:18) B W β ∞ ( X ) (cid:18) , M (cid:19)(cid:19) ≥ fat γ ( F β,d,n ) ≥ fat C M,g n − βd ( F β,d,n ) ≥ n . (44)Finally, since n ≥ so by using elementary (cid:98) a (cid:99) ≥ a we have n ≥ (cid:16) γC M,g − dβ ∨ (cid:17) therefore fat γ (cid:16) B W β ∞ (0 , M/ (cid:17) ≥ fat γ ( F β,d,n ) ≥ (cid:32)(cid:18) γC M,g (cid:19) − dβ ∨ (cid:33) . hoose γ := C dβ + d M,g n − β β + d , n := (cid:98) (cid:16) γC M,g (cid:17) − dβ (cid:99) where C is a constant as in Lemma 24 and C M,g is a constant as in Equation (42). For β > d we have by inclusion and by Lemma 24 that forany n with the choice of γ as before it holds: fat γ ( F β,d,n ) ≤ fat γ (cid:0) B W β ∞ ( X ) (cid:0) , M (cid:1)(cid:1) ≤ ˜ Cn β β + d < ˜ Cn . Furthermore, as for any n ∈ N , B W β ∞ ( X ) (cid:0) , M (cid:1) ⊃ F β,d,n so, in particular, B W β ∞ ( X ) (cid:0) , M (cid:1) ⊃ F β,d,n which implies R n (cid:16) B W β ∞ ( X ) (cid:0) , M (cid:1)(cid:17) ≥ R n (cid:0) F β,d,n (cid:1) . Thus, applying Lemma 23 to theclass F β,d,n with any n and γ, n as above we obtain: ˜ R n ( F β,d,n ) ≥ Cγ √ n (cid:0) √ (cid:113) fat γ ( F β,d,n ) − √ nγ (cid:1) ← by Lemma 24 ≥ Cn − β β + d √ n (cid:18) C dd + β M,g n d β + d ) − C dβ + d M,g n − β β + d (cid:19) ← plugin the bound for fat γ ( F β,d,n ) and γ ≥ ˜ C n − β β + d + + d β + d ) = ˜ C n d β + d , where C is some constant independent of n . Now the final bound for β < d follows fromthe inclusion F β,d,n ⊂ B W β ∞ (cid:0) , M (cid:1) ⊂ B W βp ( X ) (cid:0) , M (cid:1) which implies ˜ R n (cid:16) B W β ∞ ( X ) (0 , M ) (cid:17) ≥ ˜ R n (cid:0) B W βp ( X ) (0 , M ) (cid:1) ≥ ˜ R n (cid:0) F β,d,n (cid:1) . Appendix G. Regret rates comparison
Here we provide a short comparison of the exponents of theoretical regret rates between KAAR (4)and EWA (Vovk (2006a)). One can check that when βd < √ p − p , EWA provides better rate thenKAAR given by (4) with s = d + ε , ε > and τ n chosen as in the Theorem 6. For a fixed pair ( β, d ) this means that with increasing regularity of the function f in terms of its integral p − norm, KAARestimates its behaviour better then EWA for larger range of possible values ( β, d ) . This effect isillustrated in Figure 2.Figure 2: Exponent of the regret in the case W βp ( X ) , p < βd ≤ , p = 4 , , ..